WEBVTT

00:00:09.030 --> 00:00:10.429


00:00:10.429 --> 00:00:13.699
This presentation is delivered by the Stanford Center for Professional

00:00:13.699 --> 00:00:20.699
Development. Welcome back. If you haven't given me the homework yet,

00:00:27.019 --> 00:00:28.470
you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back.

00:00:28.470 --> 00:00:32.259
What I want to do today is wrap up our discussion on support vector

00:00:32.259 --> 00:00:33.510
machines

00:00:33.510 --> 00:00:37.250
and in particular we'll also talk about the idea of kernels

00:00:37.250 --> 00:00:39.440
and then talk about

00:00:39.440 --> 00:00:41.869
[inaudible]

00:00:41.869 --> 00:00:45.600
and then I'll talk about

00:00:45.600 --> 00:00:50.030
the SMO algorithm, which is an algorithm for solving the

00:00:50.030 --> 00:00:53.770
optimization problem that we posed last time.

00:00:53.770 --> 00:00:55.650


00:00:55.650 --> 00:00:58.200
To recap, we

00:00:58.200 --> 00:01:01.380
wrote down

00:01:01.380 --> 00:01:08.380
the following context optimization problem.

00:01:12.270 --> 00:01:15.990
All this is assuming that the data is linearly separable, which is an assumption that I'll

00:01:15.990 --> 00:01:18.110
fix later,

00:01:18.110 --> 00:01:19.530


00:01:19.530 --> 00:01:22.770


00:01:22.770 --> 00:01:24.660


00:01:24.660 --> 00:01:30.210
and so with this optimization problem, given a training set,

00:01:30.210 --> 00:01:34.840
this will find

00:01:34.840 --> 00:01:37.439
the optimal margin

00:01:37.439 --> 00:01:41.299
classifier for the data set that maximizes this

00:01:41.299 --> 00:01:45.020
geometric margin from your training

00:01:45.020 --> 00:01:46.900
examples.

00:01:46.900 --> 00:01:51.810
And so in the previous lecture, we also derived the dual of this problem,

00:01:51.810 --> 00:01:54.130
which was to maximize

00:01:54.130 --> 00:02:01.130


00:02:01.590 --> 00:02:08.590


00:02:18.319 --> 00:02:21.549
this. And this is the dual of our

00:02:21.549 --> 00:02:24.129
primal [inaudible] optimization problem.

00:02:24.129 --> 00:02:28.379
Here, I'm using these angle brackets to denote inner product,

00:02:28.379 --> 00:02:29.389


00:02:29.389 --> 00:02:31.369
so

00:02:31.369 --> 00:02:33.189
this is just XI transpose XJ

00:02:33.189 --> 00:02:38.559
for vectors XI

00:02:38.559 --> 00:02:40.819
and XJ. We also worked out the

00:02:40.819 --> 00:02:47.489
ways W would be given by sum over I alpha I

00:02:47.489 --> 00:02:51.129
YI XI. Therefore, when you need to make a prediction of classification

00:02:51.129 --> 00:02:52.149
time, you

00:02:52.149 --> 00:02:54.689
need to compute the

00:02:54.689 --> 00:03:01.299
value of the hypothesis applied to an [inaudible], which is G of W

00:03:01.299 --> 00:03:05.969
transpose X plus B where G is that threshold function that outputs plus one and minus one.

00:03:05.969 --> 00:03:08.660
And so this is G of

00:03:08.660 --> 00:03:09.900
sum over I alpha

00:03:09.900 --> 00:03:11.589


00:03:11.589 --> 00:03:18.589


00:03:20.239 --> 00:03:22.129
I. So that can also be written

00:03:22.129 --> 00:03:24.389
in terms of inner

00:03:24.389 --> 00:03:25.629
products between input

00:03:25.629 --> 00:03:31.369
vectors X. So what I

00:03:31.369 --> 00:03:35.180
want to do is now talk about the idea of kernels, which will make use of this

00:03:35.180 --> 00:03:36.809
property because it turns out

00:03:36.809 --> 00:03:41.690
you can take the only dependers of the algorithm on X

00:03:41.690 --> 00:03:44.230
is through these inner products.

00:03:44.230 --> 00:03:46.609
In fact, you can write the entire algorithm

00:03:46.609 --> 00:03:50.300
without ever explicitly referring to

00:03:50.300 --> 00:03:52.389
an X vector [inaudible]

00:03:52.389 --> 00:03:53.890


00:03:53.890 --> 00:03:57.729
between input feature vectors.

00:03:57.729 --> 00:04:00.939
And the idea of a high kernel is as following "

00:04:00.939 --> 00:04:07.389
let's say that you have an input attribute.

00:04:07.389 --> 00:04:10.839
Let's just say for now it's a real number. Maybe this is the

00:04:10.839 --> 00:04:13.429
living area of a house

00:04:13.429 --> 00:04:19.649
that you're trying to make a prediction on, like whether it will be sold in the next six months.

00:04:19.649 --> 00:04:21.319
Quite often, we'll take this

00:04:21.319 --> 00:04:23.789
feature X and we'll map it

00:04:23.789 --> 00:04:28.979
to a richer set of features. So for example, we will take X and map it

00:04:28.979 --> 00:04:35.979
to these four polynomial features, and let me acutely call this mapping Phi. So we'll

00:04:36.560 --> 00:04:38.560
let Phi of X denote

00:04:38.560 --> 00:04:40.930
the mapping from your original features to

00:04:40.930 --> 00:04:44.799
some higher dimensional set of features.

00:04:44.799 --> 00:04:49.080
So if you do this and you want to use the features Phi of X,

00:04:49.080 --> 00:04:49.680
then

00:04:49.680 --> 00:04:53.569
all you need to do is go back to the learning algorithm

00:04:53.569 --> 00:04:57.270
and everywhere you see

00:04:57.270 --> 00:04:59.000
XI, XJ, we'll replace it

00:04:59.000 --> 00:05:00.190
with

00:05:00.190 --> 00:05:04.979
the inner product between Phi of XI and

00:05:04.979 --> 00:05:07.629
Phi

00:05:07.629 --> 00:05:10.590
of XJ. So this corresponds to running a support vector machine

00:05:10.590 --> 00:05:14.380
with the features given by Phi of X rather than with your original

00:05:14.380 --> 00:05:19.129
onedimensional input feature X.

00:05:19.129 --> 00:05:21.689
And in a scenario that I want to consider,

00:05:21.689 --> 00:05:23.389
sometimes Phi of X

00:05:23.389 --> 00:05:29.489
will be very high dimensional,

00:05:29.489 --> 00:05:33.759
and in fact sometimes Phi of X " so for example, Phi of X may contain

00:05:33.759 --> 00:05:36.599
very high degree polynomial features.

00:05:36.599 --> 00:05:41.529
Sometimes Phi of X will actually even be an infinite dimensional vector of features,

00:05:41.529 --> 00:05:45.089
and the question is if Phi of X is an extremely high dimensional,

00:05:45.089 --> 00:05:48.600
then you can't actually compute to these inner products very

00:05:48.600 --> 00:05:51.300
efficiently, it seems, because computers need to

00:05:51.300 --> 00:05:55.319
represent an extremely high dimensional feature vector and then

00:05:55.319 --> 00:05:57.199
take

00:05:57.199 --> 00:05:59.740
[inaudible] inefficient. It

00:05:59.740 --> 00:06:02.929
turns out that

00:06:02.929 --> 00:06:07.879
in many important special cases, we can write down " let's call the kernel function,

00:06:07.879 --> 00:06:09.839
denoted by K,

00:06:09.839 --> 00:06:12.499
which will be

00:06:12.499 --> 00:06:18.300
this,

00:06:18.300 --> 00:06:21.339
which would be inner product between those feature

00:06:21.339 --> 00:06:24.409
vectors. It turns out there will be important special cases where

00:06:24.409 --> 00:06:28.809
computing Phi of X is computationally very expensive " maybe is impossible.

00:06:28.809 --> 00:06:30.229
There's an infinite dimensional vector,

00:06:30.229 --> 00:06:33.369
and you can't compute infinite dimensional vectors.

00:06:33.369 --> 00:06:37.219
There will be important special cases where Phi of X is very expensive to represent because it

00:06:37.219 --> 00:06:38.729
is so high dimensional,

00:06:38.729 --> 00:06:41.529
but nonetheless, you can actually compute a kernel between

00:06:41.529 --> 00:06:46.150
XI and XJ. You can compute the inner product between these two vectors

00:06:46.150 --> 00:06:48.749
very inexpensively.

00:06:48.749 --> 00:06:52.330
And so the idea of the support vector machine is that

00:06:52.330 --> 00:06:56.539
everywhere in the algorithm that you see these inner products,

00:06:56.539 --> 00:06:58.029
we're going to replace it

00:06:58.029 --> 00:07:01.629
with a kernel function that you can compute efficiently,

00:07:01.629 --> 00:07:03.839
and that lets you work in feature

00:07:03.839 --> 00:07:07.909
spaces Phi of X even if Phi of X

00:07:07.909 --> 00:07:12.429
are very high dimensional. Let

00:07:12.429 --> 00:07:16.289
me now say how that's done. A

00:07:16.289 --> 00:07:17.029


00:07:17.029 --> 00:07:20.689
little bit later today, we'll actually see some concrete examples of Phi of X and

00:07:20.689 --> 00:07:25.579
of kernels. For now, let's just think about constructing kernels explicitly.

00:07:25.579 --> 00:07:28.939
This best illustrates my example.

00:07:28.939 --> 00:07:32.539


00:07:32.539 --> 00:07:35.049
Let's say you have

00:07:35.049 --> 00:07:40.129
two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to

00:07:40.129 --> 00:07:45.519
write X and Z to save on writing.

00:07:45.519 --> 00:07:49.679
Let's say my kernel is K of X, Z

00:07:49.679 --> 00:07:50.719
equals

00:07:50.719 --> 00:07:55.439
X transpose Z squared.

00:07:55.439 --> 00:08:02.439
And so this is "

00:08:09.199 --> 00:08:12.859
right? X transpose Z " this thing here is X

00:08:12.859 --> 00:08:15.240
transpose Z and this thing is X transpose Z, so this is X

00:08:15.240 --> 00:08:17.389
transpose Z squared.

00:08:17.389 --> 00:08:18.249


00:08:18.249 --> 00:08:25.249
And that's equal

00:08:29.490 --> 00:08:33.190
to that.

00:08:33.190 --> 00:08:35.860
And so this kernel corresponds

00:08:35.860 --> 00:08:37.580
to the feature mapping where Phi

00:08:37.580 --> 00:08:40.720
of X is equal to " and I'll write

00:08:40.720 --> 00:08:47.720
this down for the case of N equals free, I guess.

00:09:03.150 --> 00:09:06.520
And so with this definition of Phi of X,

00:09:06.520 --> 00:09:10.300
you can verify for yourself that this thing

00:09:10.300 --> 00:09:11.660
becomes the

00:09:11.660 --> 00:09:17.870
inner product between Phi of X

00:09:17.870 --> 00:09:21.960
and Phi of Z, because to get an inner product between two vectors

00:09:21.960 --> 00:09:25.440
is " you can just take a sum of the corresponding elements of the vectors. You

00:09:25.440 --> 00:09:27.470
multiply them. So

00:09:27.470 --> 00:09:29.170
if this is Phi of X,

00:09:29.170 --> 00:09:32.400
then the inner product between Phi of X and Phi of Z will be

00:09:32.400 --> 00:09:33.550
the sum

00:09:33.550 --> 00:09:35.160
over all the

00:09:35.160 --> 00:09:39.610
elements of this vector times the corresponding elements of Phi of Z,

00:09:39.610 --> 00:09:45.130
and what you get is this one.

00:09:45.130 --> 00:09:48.350
And so the cool thing about this is that

00:09:48.350 --> 00:09:53.400
in order to compute Phi of X, you

00:09:53.400 --> 00:09:55.920


00:09:55.920 --> 00:10:02.920
need [inaudible] just to compute

00:10:03.020 --> 00:10:05.370
Phi of X. If N

00:10:05.370 --> 00:10:08.860
is a dimension of X and Z, then

00:10:08.860 --> 00:10:13.790
Phi of X is a vector of all pairs of XI XJ multiplied of

00:10:13.790 --> 00:10:15.020
each other,

00:10:15.020 --> 00:10:19.450
and so the length of Phi of X is N squared. You need order N

00:10:19.450 --> 00:10:23.840
squared time just to compute Phi of X. But

00:10:23.840 --> 00:10:30.840
to compute K

00:10:37.150 --> 00:10:39.360
" to compute the kernel function,

00:10:39.360 --> 00:10:41.220
all you need is order N time,

00:10:41.220 --> 00:10:43.070
because the

00:10:43.070 --> 00:10:47.150
kernel function is defined as X transpose Z squared, so you just take the inner product

00:10:47.150 --> 00:10:50.780
between X and Z, which is order N time and you square that

00:10:50.780 --> 00:10:52.410
and you've computed this

00:10:52.410 --> 00:10:54.230
kernel function,

00:10:54.230 --> 00:10:58.000
and so you just computed the inner product between two vectors where each vector

00:10:58.000 --> 00:11:00.200
has N squared elements, but

00:11:00.200 --> 00:11:07.200
you did it in N square time. Student:For any kernel you find

00:11:13.880 --> 00:11:20.650
for X and Z,

00:11:20.650 --> 00:11:23.800
does Phi exist for X and Z? Instructor (Andrew

00:11:23.800 --> 00:11:30.800
Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense.

00:11:31.660 --> 00:11:34.000


00:11:34.000 --> 00:11:35.460
So

00:11:35.460 --> 00:11:39.310
let me just describe a couple of quick generalizations to

00:11:39.310 --> 00:11:42.170
this.

00:11:42.170 --> 00:11:47.460
One is that if

00:11:47.460 --> 00:11:54.460
you define KXZ to be equal to

00:11:55.210 --> 00:12:00.190
X transpose Z plus C squared, so again, you can compute this kernel

00:12:00.190 --> 00:12:03.200
in order N time,

00:12:03.200 --> 00:12:07.070
then that turns out to correspond to a feature vector

00:12:07.070 --> 00:12:08.070
where I'm

00:12:08.070 --> 00:12:10.810
just going to add a few more elements at the bottom

00:12:10.810 --> 00:12:17.810
where you add root 2.

00:12:23.550 --> 00:12:27.700
Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3

00:12:27.700 --> 00:12:29.670
and C.

00:12:29.670 --> 00:12:31.340
And so

00:12:31.340 --> 00:12:35.450
this is a way of creating a feature vector with both the monomials,

00:12:35.450 --> 00:12:37.069
meaning the first order terms,

00:12:37.069 --> 00:12:40.539
as well as the quadratic or the inner product

00:12:40.539 --> 00:12:42.730
terms between XI and XJ,

00:12:42.730 --> 00:12:45.230
and the parameter C here

00:12:45.230 --> 00:12:49.020
allows you to control the relative waiting between the monomial

00:12:49.020 --> 00:12:51.490
terms, so the first order terms,

00:12:51.490 --> 00:12:53.680
and the quadratic

00:12:53.680 --> 00:12:56.940
terms. Again, this is still inner

00:12:56.940 --> 00:13:00.600
product between vectors of length and square [inaudible]

00:13:00.600 --> 00:13:02.920
in order N time.

00:13:02.920 --> 00:13:04.630


00:13:04.630 --> 00:13:11.630
More generally, here are some other examples of kernels.

00:13:13.720 --> 00:13:18.470
Actually, a

00:13:18.470 --> 00:13:25.070
generalization of the one I just derived right now would be the following kernel.

00:13:25.070 --> 00:13:30.960
And so this corresponds to using all N plus DQZ

00:13:30.960 --> 00:13:33.660
features

00:13:33.660 --> 00:13:39.240
of all

00:13:39.240 --> 00:13:43.400
monomials. Monomials just mean the products of XI XJ XK. Just

00:13:43.400 --> 00:13:46.570
all the polynomial terms up

00:13:46.570 --> 00:13:51.850
to degree D

00:13:51.850 --> 00:13:56.100
and plus [inaudible] so on the order of

00:13:56.100 --> 00:13:58.750
N plus D to the power of D, so this grows exponentially in D.

00:13:58.750 --> 00:14:01.130
This is a very high dimensional feature vector,

00:14:01.130 --> 00:14:04.930
but again, you can implicitly construct the feature vector and take

00:14:04.930 --> 00:14:07.460
inner products between them.

00:14:07.460 --> 00:14:10.730
It's very computationally efficient, because you just compute the inner product

00:14:10.730 --> 00:14:14.870
between X and Z, add C and you take that real number to the power of D

00:14:14.870 --> 00:14:18.590
and by plugging this in as a kernel, you're implicitly working in an extremely

00:14:18.590 --> 00:14:25.590
high dimensional computing space.

00:14:25.880 --> 00:14:27.730
So what

00:14:27.730 --> 00:14:34.730
I've given is just a few specific examples of how to create kernels. I want

00:14:36.280 --> 00:14:40.320
to go over just a few specific examples of kernels. So let's you ask

00:14:40.320 --> 00:14:43.609
you more generally if you're faced with a new machine-learning problem,

00:14:43.609 --> 00:14:46.120
how do you come up with a kernel? There

00:14:46.120 --> 00:14:49.720
are many ways to think about it, but here's one intuition that's sort of

00:14:49.720 --> 00:14:50.560
useful.

00:14:50.560 --> 00:14:51.650


00:14:51.650 --> 00:14:53.770
So given a set of attributes of X, you're

00:14:53.770 --> 00:14:57.800
going to use a feature vector of Phi of X

00:14:57.800 --> 00:14:59.510
and given a set

00:14:59.510 --> 00:15:04.150
of attributes Z, you're going to use an input feature vector Phi of Z,

00:15:04.150 --> 00:15:07.390
and so the kernel is computing

00:15:07.390 --> 00:15:11.900
the inner product between Phi of X and Phi of Z.

00:15:11.900 --> 00:15:13.139
And so

00:15:13.139 --> 00:15:14.870
one intuition "

00:15:14.870 --> 00:15:17.190
this is a partial intuition. This

00:15:17.190 --> 00:15:20.820
isn't as rigorous intuition that it is used for. It

00:15:20.820 --> 00:15:21.670
is that

00:15:21.670 --> 00:15:24.470
if X and Z are very similar,

00:15:24.470 --> 00:15:28.380
then Phi of X and Phi of Z will be pointing in the same direction,

00:15:28.380 --> 00:15:32.890
and therefore the inner product would be large.

00:15:32.890 --> 00:15:34.150
Whereas in contrast, if

00:15:34.150 --> 00:15:36.720
X and Z are very dissimilar,

00:15:36.720 --> 00:15:40.420
then Phi of X and Phi of Z may be pointing different directions,

00:15:40.420 --> 00:15:42.370
and so the inner product may be small.

00:15:42.370 --> 00:15:46.990
That intuition is not a rigorous one, but it's sort of a useful one to think about.

00:15:46.990 --> 00:15:49.240


00:15:49.240 --> 00:15:53.360
If you're faced with a new learning problem " if I give you some random

00:15:53.360 --> 00:15:54.620
thing to classify

00:15:54.620 --> 00:15:56.190
and you want to decide how to come up with

00:15:56.190 --> 00:15:58.000
a kernel,

00:15:58.000 --> 00:15:59.209
one way is

00:15:59.209 --> 00:16:02.710
to try to come up with the function

00:16:02.710 --> 00:16:04.260
P of XZ that

00:16:04.260 --> 00:16:05.570
is large,

00:16:05.570 --> 00:16:10.540
if you want to learn the algorithm to think of X and Z as similar

00:16:10.540 --> 00:16:11.370
and

00:16:11.370 --> 00:16:15.040
small.

00:16:15.040 --> 00:16:17.640


00:16:17.640 --> 00:16:19.290


00:16:19.290 --> 00:16:24.500
Again, this isn't always true, but this is one of several intuitions.

00:16:24.500 --> 00:16:27.969
So if you're trying to classify some brand new thing " you're trying to

00:16:27.969 --> 00:16:31.999
classify [inaudible] or DNA sequences or something,

00:16:31.999 --> 00:16:33.010


00:16:33.010 --> 00:16:37.250
some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think

00:16:37.250 --> 00:16:38.790
these are similar things

00:16:38.790 --> 00:16:41.870
or these are dissimilar.

00:16:41.870 --> 00:16:43.590
And so this

00:16:43.590 --> 00:16:46.730
answers the question of let's

00:16:46.730 --> 00:16:49.860
say I have something I want to classify, and let's say I write down

00:16:49.860 --> 00:16:53.370
the function

00:16:53.370 --> 00:16:57.700
that I think is a good measure of how similar or dissimilar X and Z

00:16:57.700 --> 00:16:59.370
are for my specific problem.

00:16:59.370 --> 00:17:05.310
Let's say I write down K of XZ equals E to the

00:17:05.310 --> 00:17:08.209
minus.

00:17:08.209 --> 00:17:13.260
Let's say I write down this function,

00:17:13.260 --> 00:17:19.089
and I think this is a good measure of how similar X and Z are.

00:17:19.089 --> 00:17:26.089
The question, then, is is this really a valid

00:17:28.490 --> 00:17:32.650
kernel? In other words, to understand how we can construct kernels " if I write down the

00:17:32.650 --> 00:17:33.669
function like that,

00:17:33.669 --> 00:17:37.940
the question is does there really exist some Phi

00:17:37.940 --> 00:17:39.830
such that

00:17:39.830 --> 00:17:41.049
KXZ

00:17:41.049 --> 00:17:42.419
is equal to

00:17:42.419 --> 00:17:48.370
the inner product?

00:17:48.370 --> 00:17:49.860
And that's the

00:17:49.860 --> 00:17:56.160
question of is K a valid kernel.

00:17:56.160 --> 00:17:59.780
It turns out that

00:17:59.780 --> 00:18:04.260
there is a result that characterizes necessary and sufficient

00:18:04.260 --> 00:18:05.380
conditions for

00:18:05.380 --> 00:18:07.700
when a function K that you might choose

00:18:07.700 --> 00:18:10.370
is a valid kernel.

00:18:10.370 --> 00:18:17.340
I should go ahead show part of that result now.

00:18:17.340 --> 00:18:23.559
Suppose K is a valid kernel,

00:18:23.559 --> 00:18:27.179
and when I say K is a kernel, what I mean is there does indeed exist some

00:18:27.179 --> 00:18:31.800
function Phi for which this holds true.

00:18:31.800 --> 00:18:34.700
Then

00:18:34.700 --> 00:18:39.190
let any set of points

00:18:39.190 --> 00:18:46.190
X1 up to XM be given. Let

00:18:47.610 --> 00:18:52.640
me define a matrix K.

00:18:52.640 --> 00:18:54.410


00:18:54.410 --> 00:18:57.690
I apologize for overloading notation. K I'm going

00:18:57.690 --> 00:19:00.029
to use to denote both the

00:19:00.029 --> 00:19:02.720
kernel function, which is the function of X and Z

00:19:02.720 --> 00:19:06.940
as well as a matrix. Unfortunately,

00:19:06.940 --> 00:19:10.180
there aren't enough alphabets.

00:19:10.180 --> 00:19:12.590
Well, that's not true. We need to find

00:19:12.590 --> 00:19:15.350
the kernel matrix to

00:19:15.350 --> 00:19:20.170
be an M-by-M matrix such that K subscript IJ is equal to the

00:19:20.170 --> 00:19:23.810
kernel function

00:19:23.810 --> 00:19:30.810
applied to two of my examples.

00:19:31.480 --> 00:19:38.480
Then it

00:19:39.250 --> 00:19:45.030
turns out that for any vector Z that's indimensional, I want

00:19:45.030 --> 00:19:50.270
you to consider Z transpose

00:19:50.270 --> 00:19:57.270


00:20:01.780 --> 00:20:04.080
KZ.

00:20:04.080 --> 00:20:05.500


00:20:05.500 --> 00:20:07.880
By

00:20:07.880 --> 00:20:13.230
definition of matrix multiplication, this is

00:20:13.230 --> 00:20:15.120
that,

00:20:15.120 --> 00:20:18.500
and so

00:20:18.500 --> 00:20:23.120
KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is

00:20:23.120 --> 00:20:27.580
a valid kernel function,

00:20:27.580 --> 00:20:31.120
and so there must exist such a

00:20:31.120 --> 00:20:33.419
value

00:20:33.419 --> 00:20:36.720
for Phi.

00:20:36.720 --> 00:20:40.500
This is the inner product between two feature vectors, so let me just make that inner product

00:20:40.500 --> 00:20:43.090
the explicit. I'm going

00:20:43.090 --> 00:20:45.950
to sum over the elements of this vector,

00:20:45.950 --> 00:20:50.320
and I'm going to use Phi XI subscript K just to denote the K

00:20:50.320 --> 00:20:57.320
element of this vector.

00:20:58.490 --> 00:21:03.480
Just

00:21:03.480 --> 00:21:10.480
rearrange sums. You get

00:21:24.940 --> 00:21:30.110
sum over K. This next set may look familiar to some of you,

00:21:30.110 --> 00:21:37.110
which is just "

00:21:47.220 --> 00:21:48.300
right? Therefore,

00:21:48.300 --> 00:21:51.540
this is the sum of squares

00:21:51.540 --> 00:21:56.630
and it must therefore be greater than or equal to zero.

00:21:56.630 --> 00:22:00.250
Do you want to take a minute to look for all the steps and just make

00:22:00.250 --> 00:22:07.250
sure you buy

00:22:09.400 --> 00:22:11.010
them all? Oh, this is

00:22:11.010 --> 00:22:15.080
the inner product between the vector of Phi of XI and Phi

00:22:15.080 --> 00:22:17.170
of XJ,

00:22:17.170 --> 00:22:19.210
so the

00:22:19.210 --> 00:22:21.100
inner product between two vectors

00:22:21.100 --> 00:22:28.080
is the sum over all the elements of the vectors of the corresponding element.

00:22:28.080 --> 00:22:31.559
Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A

00:22:31.559 --> 00:22:36.970
transpose B equals sum over K, AK BK, so that's

00:22:36.970 --> 00:22:38.830
just this. This is the sum of K of the

00:22:38.830 --> 00:22:44.490
K elements of this vector.

00:22:44.490 --> 00:22:46.180
Take a look at this and make sure

00:22:46.180 --> 00:22:50.470
it makes sense. Questions about

00:22:50.470 --> 00:22:57.470
this? So just to

00:22:59.720 --> 00:23:00.419
summarize,

00:23:00.419 --> 00:23:03.300
what we showed was that

00:23:03.300 --> 00:23:05.309
for any vector Z,

00:23:05.309 --> 00:23:07.380
Z transpose KZ

00:23:07.380 --> 00:23:09.700
is greater than or equal to zero,

00:23:09.700 --> 00:23:14.090
and this is one of the standard definitions of a matrix,

00:23:14.090 --> 00:23:17.540
the matrix K being posisemidefinite

00:23:17.540 --> 00:23:24.240
when a matrix K is posisemidefinite, that is, K is equal to

00:23:24.240 --> 00:23:28.010
zero. Just to summarize, what was shown is that if

00:23:28.010 --> 00:23:29.000
K

00:23:29.000 --> 00:23:31.150
is a

00:23:31.150 --> 00:23:34.430
valid kernel " in other words, if K is a function

00:23:34.430 --> 00:23:37.140
for which there exists

00:23:37.140 --> 00:23:38.390
some Phi such that

00:23:38.390 --> 00:23:39.990
K of XI

00:23:39.990 --> 00:23:43.830
XJ is the inner product between Phi of XI and Phi of XJ.

00:23:43.830 --> 00:23:48.130
So if K is a valid kernel, we showed, then, that the kernel matrix

00:23:48.130 --> 00:23:50.929


00:23:50.929 --> 00:23:55.030
must be posisemidefinite.

00:23:55.030 --> 00:23:59.910


00:23:59.910 --> 00:24:01.970
It turns out that the

00:24:01.970 --> 00:24:04.840
converse [inaudible]

00:24:04.840 --> 00:24:09.480
and so this gives you a test for

00:24:09.480 --> 00:24:13.010
whether a function K is a valid kernel.

00:24:13.010 --> 00:24:14.600
So this is

00:24:14.600 --> 00:24:18.370
a theorem due to Mercer, and so kernels are also sometimes called Mercer

00:24:18.370 --> 00:24:25.370
kernels. It means the same thing. It just means it's a valid kernel. Let

00:24:25.510 --> 00:24:31.440
K of XZ be given.

00:24:31.440 --> 00:24:33.809
Then K

00:24:33.809 --> 00:24:36.980
is a valid

00:24:36.980 --> 00:24:43.160
kernel "

00:24:43.160 --> 00:24:45.310
in other

00:24:45.310 --> 00:24:48.060
words, it's a Mercer kernel, i.e., there exists a Phi

00:24:48.060 --> 00:24:50.120
such that

00:24:50.120 --> 00:24:52.360
KXZ

00:24:52.360 --> 00:24:53.640


00:24:53.640 --> 00:24:58.649


00:24:58.649 --> 00:25:00.970
equals Phi of X

00:25:00.970 --> 00:25:04.840
transpose

00:25:04.840 --> 00:25:11.179
Phi of Z " if and only if for any set of M examples,

00:25:11.179 --> 00:25:15.039
and this really means for any set of M points. It's not necessarily a training set.

00:25:15.039 --> 00:25:21.299
It's just any set of M points you may choose. It

00:25:21.299 --> 00:25:28.299
holds true that the kernel

00:25:29.040 --> 00:25:30.410
matrix,

00:25:30.410 --> 00:25:34.290
capital K that I defined just now,

00:25:34.290 --> 00:25:36.510
is

00:25:36.510 --> 00:25:37.960


00:25:37.960 --> 00:25:44.960
symmetric posisemidefinite.

00:25:49.260 --> 00:25:54.070
And so I proved only one direction of this result. I proved that if it's a

00:25:54.070 --> 00:25:55.420
valid kernel, then K

00:25:55.420 --> 00:25:57.299
is symmetry

00:25:57.299 --> 00:26:00.909
posisemidefinite, but the converse I didn't show. It turns out that this is

00:26:00.909 --> 00:26:03.710
necessary and a sufficient condition. And

00:26:03.710 --> 00:26:06.250
so this gives you a useful test for

00:26:06.250 --> 00:26:09.289
whether any function that you might want to choose

00:26:09.289 --> 00:26:16.289
is a

00:26:23.270 --> 00:26:27.040
kernel. A concrete example of something that's clearly not a

00:26:27.040 --> 00:26:28.240
valid kernel would

00:26:28.240 --> 00:26:30.659
be

00:26:30.659 --> 00:26:36.040
if you find an input X such that K of X, X " and this

00:26:36.040 --> 00:26:39.410
is minus one, for example " then this is an example of something that's clearly

00:26:39.410 --> 00:26:41.530
not a valid kernel,

00:26:41.530 --> 00:26:48.010
because minus one cannot possibly be equal to Phi of X

00:26:48.010 --> 00:26:49.390
transpose Phi of X,

00:26:49.390 --> 00:26:53.250
and so this would be one of many examples of functions that

00:26:53.250 --> 00:26:57.040
will fail to meet the conditions of this theorem,

00:26:57.040 --> 00:27:04.040
because inner products of a vector itself are always greater than zero. So

00:27:15.820 --> 00:27:22.820


00:27:29.530 --> 00:27:33.620
just to tie this back explicitly to an

00:27:33.620 --> 00:27:34.710
SVM,

00:27:34.710 --> 00:27:37.190
let's say to use a support vector machine with a kernel,

00:27:37.190 --> 00:27:41.400
what you do is you choose some function K of XZ,

00:27:41.400 --> 00:27:45.860
and so you can choose " and it turns out that function I wrote down just now " this is,

00:27:45.860 --> 00:27:49.060
indeed, a valid kernel. It

00:27:49.060 --> 00:27:50.850
is called the Galcean kernel

00:27:50.850 --> 00:27:52.250
because of the similarity to

00:27:52.250 --> 00:27:53.540
Galceans.

00:27:53.540 --> 00:27:58.580
So you choose some kernel function like this, or you may choose X

00:27:58.580 --> 00:28:02.490
transpose Z plus

00:28:02.490 --> 00:28:03.470
C

00:28:03.470 --> 00:28:06.420
to the D vector. To apply a support vector machine kernel, you choose one of these

00:28:06.420 --> 00:28:07.500
functions,

00:28:07.500 --> 00:28:09.490
and the choice of this would depend on your problem.

00:28:09.490 --> 00:28:14.149
It depends on what is a good measure of one or two examples

00:28:14.149 --> 00:28:17.060
similar and one or two examples different for your problem.

00:28:17.060 --> 00:28:21.940
Then you go back to our formulation of support vector machine,

00:28:21.940 --> 00:28:24.810
and you have to use the dual formulation,

00:28:24.810 --> 00:28:28.730
and you then replace everywhere you see

00:28:28.730 --> 00:28:31.570
these

00:28:31.570 --> 00:28:36.799
things, you

00:28:36.799 --> 00:28:39.900
replace it with K of XI,

00:28:39.900 --> 00:28:41.820
XJ.

00:28:41.820 --> 00:28:45.360
And you then run exactly the same support vector machine algorithm, only everywhere

00:28:45.360 --> 00:28:48.950
you see these inner products, you replace them with that,

00:28:48.950 --> 00:28:52.480
and what you've just done is you've taken a support vector machine

00:28:52.480 --> 00:28:56.890
and you've taken each of your feature vectors X

00:28:56.890 --> 00:29:02.760
and you've replaced it with implicitly a very high dimensional feature vector.

00:29:02.760 --> 00:29:06.270
It turns out that the Galcean kernel corresponds to a feature vector that's

00:29:06.270 --> 00:29:08.640
infinite dimensional.

00:29:08.640 --> 00:29:09.950


00:29:09.950 --> 00:29:13.860
Nonetheless, you can run a support vector machine in a finite amount of time,

00:29:13.860 --> 00:29:15.450
even though you're working with

00:29:15.450 --> 00:29:18.920
infinite dimensional feature vectors,

00:29:18.920 --> 00:29:20.559
because all you ever need to do is

00:29:20.559 --> 00:29:23.120
compute these things, and you don't ever need

00:29:23.120 --> 00:29:26.330
to represent these infinite dimensional feature vectors

00:29:26.330 --> 00:29:28.330
explicitly.

00:29:28.330 --> 00:29:30.040
Why

00:29:30.040 --> 00:29:33.250
is this a good idea? It turns out "

00:29:33.250 --> 00:29:38.070
I think I started off talking about support vector machines. I started

00:29:38.070 --> 00:29:41.580
saying that we wanted to start to develop non-linear learning algorithms.

00:29:41.580 --> 00:29:45.350
So here's one useful picture to keep in mind, which is that

00:29:45.350 --> 00:29:48.420
let's say your original data "

00:29:48.420 --> 00:29:52.060
I didn't mean to draw that slanted. Let's say you have one-dimensional input

00:29:52.060 --> 00:29:56.540
data. You just have one feature X and R. What a kernel

00:29:56.540 --> 00:30:00.960
does is the following. It takes your original input data and maps it to a

00:30:00.960 --> 00:30:04.910
very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw

00:30:04.910 --> 00:30:09.170
two

00:30:09.170 --> 00:30:12.950
dimensions here.

00:30:12.950 --> 00:30:19.950
So say [inaudible] very high dimensional feature space where "

00:30:23.070 --> 00:30:28.900
like so. So it takes all your data in R1 and maps it to R infinity,

00:30:28.900 --> 00:30:30.360
and then you run a

00:30:30.360 --> 00:30:33.770
support vector machine in this infinite dimensional space and also exponentially

00:30:33.770 --> 00:30:35.240
high dimensional space,

00:30:35.240 --> 00:30:38.030
and you'll find the optimal margin classifier "

00:30:38.030 --> 00:30:39.730
in other words, the

00:30:39.730 --> 00:30:43.390
classifier that separates your data in this very high dimensional space

00:30:43.390 --> 00:30:46.060
with the largest possible geometric margin.

00:30:46.060 --> 00:30:50.350
In this example that you just drew anyway,

00:30:50.350 --> 00:30:54.680
whereas your data was not linearly separable in your originally one dimensional space,

00:30:54.680 --> 00:30:58.340
when you map it to this much higher dimensional space, it becomes linearly separable, so you

00:30:58.340 --> 00:30:59.960
can use your

00:30:59.960 --> 00:31:01.250
linear classifier

00:31:01.250 --> 00:31:04.990
to [inaudible] which data is not really separable in your original space.

00:31:04.990 --> 00:31:07.090
This is what support vector machines

00:31:07.090 --> 00:31:10.460
output nonlinear decision boundaries

00:31:10.460 --> 00:31:14.930
and in the entire process, all you ever need to do is solve complex optimization

00:31:14.930 --> 00:31:18.780
problems.

00:31:18.780 --> 00:31:19.560


00:31:19.560 --> 00:31:25.000
Questions about any of this? Student:[Inaudible] sigmer? Instructor

00:31:25.000 --> 00:31:30.310
(Andrew Ng):Yeah, so sigmer is " let's

00:31:30.310 --> 00:31:31.999
see. Well, I was going to talk about [inaudible]

00:31:31.999 --> 00:31:32.730
later. One way

00:31:32.730 --> 00:31:35.690
to choose sigmer is

00:31:35.690 --> 00:31:38.859
save aside a small amount of your data and

00:31:38.859 --> 00:31:41.909
try different values of sigmer and train an SVM using,

00:31:41.909 --> 00:31:44.740
say, two thirds of your data.

00:31:44.740 --> 00:31:48.500
Try different values of sigmer, then see what works best on a separate

00:31:48.500 --> 00:31:48.950
hold out

00:31:48.950 --> 00:31:53.270
cross validation set " on a separate set that you're testing. Something about

00:31:53.270 --> 00:31:56.289
learning algorithms we talked about

00:31:56.289 --> 00:31:57.409
"

00:31:57.409 --> 00:32:00.360
locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a

00:32:00.360 --> 00:32:03.159
number of parameters to some of these algorithms that you

00:32:03.159 --> 00:32:05.829
can choose IDs by saving aside some data to test on. I'll

00:32:05.829 --> 00:32:10.250
talk more about model selection [inaudible] explicitly. Are there other questions? Student:So

00:32:10.250 --> 00:32:12.790
how do you know that

00:32:12.790 --> 00:32:14.299


00:32:14.299 --> 00:32:16.830
moving it up to high dimensional space

00:32:16.830 --> 00:32:19.820
is going to give you that kind of separation? Instructor (Andrew

00:32:19.820 --> 00:32:21.279
Ng):Good question.

00:32:21.279 --> 00:32:25.980
Usually, you don't know [inaudible]. Sometimes you can know, but

00:32:25.980 --> 00:32:29.230
in most cases, you won't know [inaudible] actually going to

00:32:29.230 --> 00:32:30.010
linearly

00:32:30.010 --> 00:32:33.920
separable, so the next topic will be [inaudible], which is what

00:32:33.920 --> 00:32:40.920
[inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend

00:32:41.080 --> 00:32:42.260
linearly

00:32:42.260 --> 00:32:44.630
separated by mapping a higher dimension, couldn't you

00:32:44.630 --> 00:32:47.309
also just use [inaudible]

00:32:47.309 --> 00:32:49.400
higher

00:32:49.400 --> 00:32:53.580
dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't

00:32:53.580 --> 00:32:55.350
separate it in higher dimensional

00:32:55.350 --> 00:32:59.460
space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs.

00:32:59.460 --> 00:33:06.120
Okay. Student:What

00:33:06.120 --> 00:33:08.180
if

00:33:08.180 --> 00:33:09.370
you run an SVM algorithm

00:33:09.370 --> 00:33:11.720
that assumes the data are linearly

00:33:11.720 --> 00:33:12.770
separable on data that

00:33:12.770 --> 00:33:15.320
is not actually linearly separable? Instructor

00:33:15.320 --> 00:33:19.490
(Andrew Ng):You guys are really giving me a hard time about whether the

00:33:19.490 --> 00:33:20.220
data's linearly

00:33:20.220 --> 00:33:24.320
separable. It turns out this algorithm won't work if the data is not linearly separable, but

00:33:24.320 --> 00:33:29.260
I'll change that in a second and make it work.

00:33:29.260 --> 00:33:31.100


00:33:31.100 --> 00:33:35.390
If I move on to talk about that,

00:33:35.390 --> 00:33:39.760
let me just say one final word about kernels,

00:33:39.760 --> 00:33:46.450
which is that

00:33:46.450 --> 00:33:49.130


00:33:49.130 --> 00:33:52.559


00:33:52.559 --> 00:33:54.780
I talked about kernels

00:33:54.780 --> 00:34:00.770
in a context of support vector machines,

00:34:00.770 --> 00:34:04.550
and the idea of kernels was what really made support vector machines a very

00:34:04.550 --> 00:34:06.700
powerful learning algorithm,

00:34:06.700 --> 00:34:09.850
and actually towards the end of today's lecture if I have time, I'll actually

00:34:09.850 --> 00:34:14.119
give a couple more [inaudible] examples of how to choose kernels as well.

00:34:14.119 --> 00:34:14.949


00:34:14.949 --> 00:34:18.719
It turns out that the idea of kernels is actually more general than support vector

00:34:18.719 --> 00:34:19.659
machines,

00:34:19.659 --> 00:34:21.209
and in particular,

00:34:21.209 --> 00:34:25.559
we took this SVM algorithm and we derived a dual, and that was what let us write

00:34:25.559 --> 00:34:27.959
the entire algorithm

00:34:27.959 --> 00:34:30.999
in terms of inner products of these.

00:34:30.999 --> 00:34:35.319
It turns out that you can take many of the other algorithms that you've seen

00:34:35.319 --> 00:34:36.630
in this class " in fact, it turns

00:34:36.630 --> 00:34:40.240
out you can take most of the linear algorithms we talked about, such as linear

00:34:40.240 --> 00:34:42.119
regression, logistic regression

00:34:42.119 --> 00:34:43.869
[inaudible]

00:34:43.869 --> 00:34:48.029
and it turns out you can take all of these algorithms and rewrite them entirely

00:34:48.029 --> 00:34:51.429
in terms of these inner products.

00:34:51.429 --> 00:34:55.149
So if you have any algorithm that you can rewrite in terms of inner products, then that

00:34:55.149 --> 00:35:01.259
means you can replace it with K of

00:35:01.259 --> 00:35:02.430
XI XJ

00:35:02.430 --> 00:35:06.499
and that means that you can take any of theses algorithms and

00:35:06.499 --> 00:35:10.949
implicitly map the features vectors of these very high dimensional feature spaces

00:35:10.949 --> 00:35:12.990
and have the algorithm

00:35:12.990 --> 00:35:13.670


00:35:13.670 --> 00:35:17.380
still work. The idea of kernels is perhaps most widely used with support vector

00:35:17.380 --> 00:35:19.869
machines, but it is actually more general than that,

00:35:19.869 --> 00:35:23.429
and you can take many of the other algorithms that you've seen and many of the algorithms that

00:35:23.429 --> 00:35:26.499
we'll see later this quarter as well

00:35:26.499 --> 00:35:30.309
and write them in terms of inner products and thereby kernalize them and apply them to

00:35:30.309 --> 00:35:33.199
infinite dimensional feature spaces.

00:35:33.199 --> 00:35:36.309
You'll actually get to play with many of these ideas more in the next problem

00:35:36.309 --> 00:35:41.089


00:35:41.089 --> 00:35:43.269
set.

00:35:43.269 --> 00:35:47.019
Let's talk about non-linear decision boundaries, and this is

00:35:47.019 --> 00:35:50.209
the idea of

00:35:50.209 --> 00:35:55.059
" it's called the L1 norm soft

00:35:55.059 --> 00:35:59.140
margin SVM. Machine only people sometimes aren't great at coming up with good

00:35:59.140 --> 00:36:00.269
names, but here's

00:36:00.269 --> 00:36:01.839
the idea.

00:36:01.839 --> 00:36:08.839
Let's say I have a data

00:36:12.670 --> 00:36:15.380
set. This is a linearly separable data

00:36:15.380 --> 00:36:19.100
set, but what I do if I have a couple of other examples there that makes the data

00:36:19.100 --> 00:36:23.779
nonlinearly separable,

00:36:23.779 --> 00:36:25.929
and in fact,

00:36:25.929 --> 00:36:30.059
sometimes even if the data is linearly separable, maybe you might not want to.

00:36:30.059 --> 00:36:32.829
So for example, this is a very nice data set. It looks like

00:36:32.829 --> 00:36:37.109
there's a great decision boundary that separates the two [inaudible]. Well,

00:36:37.109 --> 00:36:40.269
what if I had just one outlier

00:36:40.269 --> 00:36:42.489
down here? I could

00:36:42.489 --> 00:36:46.269
still linearly separate this data set

00:36:46.269 --> 00:36:48.029
with something like that,

00:36:48.029 --> 00:36:50.190
but I'm somehow letting one

00:36:50.190 --> 00:36:52.539
slightly suspicious example

00:36:52.539 --> 00:36:55.950
skew my entire decision boundary by a lot,

00:36:55.950 --> 00:36:57.279
and so

00:36:57.279 --> 00:37:00.969
what I'm going to talk about now is the L1 norm soft margin SVM, which is

00:37:00.969 --> 00:37:05.339
a slightly modified formulation of the SVM optimization problem.

00:37:05.339 --> 00:37:09.069
They will let us deal with both of these cases " one where one of the data's

00:37:09.069 --> 00:37:10.809
just not linearly separable

00:37:10.809 --> 00:37:14.149
and two, what if you have some examples that you'd rather

00:37:14.149 --> 00:37:16.599
not get [inaudible] in a training set. Maybe

00:37:16.599 --> 00:37:19.389
with an outlier here, maybe you actually prefer to

00:37:19.389 --> 00:37:23.119
choose that original decision boundary and not try so hard to get

00:37:23.119 --> 00:37:26.619
that training example. Here's the formulation.

00:37:26.619 --> 00:37:33.619
Our

00:37:44.479 --> 00:37:45.429


00:37:45.429 --> 00:37:46.429
SVM

00:37:46.429 --> 00:37:51.109
primal problem was to minimize one-half [inaudible] W

00:37:51.109 --> 00:37:51.919
squared.

00:37:51.919 --> 00:37:58.919


00:38:02.720 --> 00:38:05.189
So this is our original problem, and I'm

00:38:05.189 --> 00:38:09.739
going to modify this by adding the

00:38:09.739 --> 00:38:16.739
following.

00:38:23.449 --> 00:38:27.369
In other words, I'm gonna add these penalty terms, CIs,

00:38:27.369 --> 00:38:32.409
and I'm going to demand that each of my training examples is separated with

00:38:32.409 --> 00:38:35.999
functional margin greater than or equal to one minus CI,

00:38:35.999 --> 00:38:39.639
and you remember

00:38:39.639 --> 00:38:45.479


00:38:45.479 --> 00:38:47.430
if this is

00:38:47.430 --> 00:38:50.010
greater than zero " was it

00:38:50.010 --> 00:38:52.750
two lectures ago that I said that if the

00:38:52.750 --> 00:38:54.689
function margin is greater than zero,

00:38:54.689 --> 00:39:00.809
that implies you classified it correctly.

00:39:00.809 --> 00:39:04.589
If it's less than zero, then you misclassified it.

00:39:04.589 --> 00:39:08.229
By setting some of the CIs to be larger than one, I can

00:39:08.229 --> 00:39:10.660
actually have some examples with

00:39:10.660 --> 00:39:12.879
functional margin negative,

00:39:12.879 --> 00:39:16.439
and therefore I'm allowing my algorithm to misclassify some of the examples of the

00:39:16.439 --> 00:39:19.089
training set.

00:39:19.089 --> 00:39:23.130
However, I'll encourage the algorithm not to do that by adding to the

00:39:23.130 --> 00:39:24.379
optimization objective,

00:39:24.379 --> 00:39:29.609
this sort of penalty term that penalizes setting CIs to be large.

00:39:29.609 --> 00:39:34.119
This is an optimization problem where the parameters are WB

00:39:34.119 --> 00:39:37.489
and all of the CIs

00:39:37.489 --> 00:39:42.739
and this is also a convex optimization problem. It

00:39:42.739 --> 00:39:46.549
turns out that

00:39:46.549 --> 00:39:50.630
similar to how we worked on the dual of the support vector machine,

00:39:50.630 --> 00:39:54.319
we can also work out the dual for this optimization problem. I won't

00:39:54.319 --> 00:39:55.499
actually do it,

00:39:55.499 --> 00:39:59.169
but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going

00:39:59.169 --> 00:40:01.640
to

00:40:01.640 --> 00:40:03.489
use

00:40:03.489 --> 00:40:08.119
Alpha and R to denote the [inaudible] multipliers no corresponding to

00:40:08.119 --> 00:40:10.599
this set of constraints that we had previously

00:40:10.599 --> 00:40:15.069
and this new set of constraints on the CI [inaudible] zero. This gives us a use of

00:40:15.069 --> 00:40:16.439
the

00:40:16.439 --> 00:40:23.279
[inaudible] multipliers. The [inaudible] will be

00:40:23.279 --> 00:40:24.569
optimization objective

00:40:24.569 --> 00:40:31.569
minus sum from

00:40:35.879 --> 00:40:38.699
plus CI

00:40:38.699 --> 00:40:44.349
minus "

00:40:44.349 --> 00:40:46.679
and so

00:40:46.679 --> 00:40:50.549
there's our [inaudible] optimization objective

00:40:50.549 --> 00:40:55.049
minus or plus Alpha times each of these constraints,

00:40:55.049 --> 00:40:56.769


00:40:56.769 --> 00:41:03.769
which are

00:41:05.199 --> 00:41:12.199


00:41:17.969 --> 00:41:20.779
greater or equal to

00:41:20.779 --> 00:41:27.779
zero. I won't redivide the entire dual again, but it's really the same,

00:41:28.499 --> 00:41:33.799
and when you derive the dual of this optimization problem and when you simplify,

00:41:33.799 --> 00:41:37.199
you find that you get the following.

00:41:37.199 --> 00:41:40.609
You have to maximize

00:41:40.609 --> 00:41:47.609
[inaudible], which is actually the same as before.

00:41:50.969 --> 00:41:57.969


00:42:03.130 --> 00:42:10.130


00:42:20.319 --> 00:42:23.389
So it turns out when you derive the dual

00:42:23.389 --> 00:42:24.720
and simply,

00:42:24.720 --> 00:42:28.730
it turns out that the only way the dual changes compared to the previous one

00:42:28.730 --> 00:42:32.389
is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero,

00:42:32.389 --> 00:42:36.839
we now have a constraint that the Alphas are between zero and C.

00:42:36.839 --> 00:42:40.009
This derivation isn't very hard, and you're encouraged to go home and try to do it yourself.

00:42:40.009 --> 00:42:42.450
It's really essentially the same math,

00:42:42.450 --> 00:42:44.420
and when you simply, it turns out

00:42:44.420 --> 00:42:48.619
you can simply the R of the [inaudible] multiplier away

00:42:48.619 --> 00:42:49.420
and you end up with

00:42:49.420 --> 00:42:54.789
just these constraints of

00:42:54.789 --> 00:42:57.299


00:42:57.299 --> 00:43:01.909
the

00:43:01.909 --> 00:43:05.179
Alphas.

00:43:05.179 --> 00:43:06.450
Just as an aside,

00:43:06.450 --> 00:43:08.779
I won't

00:43:08.779 --> 00:43:12.499
derive these, either. It turns out that "

00:43:12.499 --> 00:43:15.779
remember, I wrote down the [inaudible] conditions in

00:43:15.779 --> 00:43:18.259
the last lecture.

00:43:18.259 --> 00:43:21.609
The necessary conditions for something to be an optimal solution

00:43:21.609 --> 00:43:24.019
to constrain optimization problems.

00:43:24.019 --> 00:43:27.119
So if you used the [inaudible] conditions,

00:43:27.119 --> 00:43:30.499
it turns out you can actually derive conversions conditions, so

00:43:30.499 --> 00:43:32.999
we want to solve this optimization problem.

00:43:32.999 --> 00:43:37.929
When do we know the Alphas have converged to the global optimum?

00:43:37.929 --> 00:43:40.819
It turns out you can use the following.

00:43:40.819 --> 00:43:47.819


00:43:48.449 --> 00:43:55.449


00:44:05.289 --> 00:44:12.289


00:44:14.140 --> 00:44:16.789


00:44:16.789 --> 00:44:20.970
I don't want to say a lot about these. It

00:44:20.970 --> 00:44:24.219
turns out from the [inaudible] conditions you can derive

00:44:24.219 --> 00:44:27.339
these as the conversion conditions for an algorithm that you might choose to

00:44:27.339 --> 00:44:28.379
use

00:44:28.379 --> 00:44:34.469
to try to solve the optimization problem in terms of the

00:44:34.469 --> 00:44:39.789
Alphas. That's the L1 norm soft margin SVM, and this is the

00:44:39.789 --> 00:44:43.749
change the algorithm that lets us handle non-linearly separable data sets as well as

00:44:43.749 --> 00:44:47.160
single outliers that may still be linearly separable

00:44:47.160 --> 00:44:54.160
but you may choose not to separate [inaudible].

00:44:55.869 --> 00:44:58.409
Questions about this?

00:44:58.409 --> 00:45:03.319
Raise

00:45:03.319 --> 00:45:10.319


00:45:17.749 --> 00:45:24.749


00:45:39.880 --> 00:45:46.880


00:45:52.069 --> 00:45:56.529
your hand if this stuff makes sense at all. Great.

00:45:56.529 --> 00:45:59.979
So

00:45:59.979 --> 00:46:02.849
the last thing I want to do is

00:46:02.849 --> 00:46:07.619
talk about an algorithm for actually solving this optimization problem.

00:46:07.619 --> 00:46:10.400
We wrote down

00:46:10.400 --> 00:46:13.769
this dual optimization problem with convergence criteria,

00:46:13.769 --> 00:46:19.169
so let's come up with an efficient algorithm to actually solve this optimization problem.

00:46:19.169 --> 00:46:21.180


00:46:21.180 --> 00:46:25.329
I want to do this partly to give me an excuse to talk about an algorithm

00:46:25.329 --> 00:46:28.859
called coordinate assent, which is useful to

00:46:28.859 --> 00:46:34.229
do. What I actually want to do is tell you about an algorithm called coordinate

00:46:34.229 --> 00:46:36.809
assent, which is a useful algorithm to know about,

00:46:36.809 --> 00:46:40.049
and it'll turn out that it won't apply in the simplest form

00:46:40.049 --> 00:46:43.589
to this problem, but we'll then be able to modify it slightly and then it'll

00:46:43.589 --> 00:46:45.549
give us a very efficient

00:46:45.549 --> 00:46:48.910
algorithm for solving this [inaudible]

00:46:48.910 --> 00:46:52.249
optimization problem. That was the other reason that I had to derive the dual, not only so that we could

00:46:52.249 --> 00:46:53.320
use kernels

00:46:53.320 --> 00:46:57.610
but also so that we can apply an algorithm like the

00:46:57.610 --> 00:47:00.829
SMO

00:47:00.829 --> 00:47:04.809
algorithm. First, let's talk about coordinate assent, which is another

00:47:04.809 --> 00:47:11.809
[inaudible] optimization

00:47:17.829 --> 00:47:21.239
algorithm. To describe coordinate

00:47:21.239 --> 00:47:24.679
assent, I just want you to consider the problem of if we

00:47:24.679 --> 00:47:29.529
want to maximize some function

00:47:29.529 --> 00:47:34.039
W, which is a function of Alpha one through Alpha M with no constraints. So for

00:47:34.039 --> 00:47:38.769


00:47:38.769 --> 00:47:39.709


00:47:39.709 --> 00:47:43.199
now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the

00:47:43.199 --> 00:47:49.569
constraint that some of YI Alpha I must be equal to zero.

00:47:49.569 --> 00:47:53.279
Then this is the coordinate

00:47:53.279 --> 00:47:54.220
assent

00:47:54.220 --> 00:47:56.409
algorithm. It will

00:47:56.409 --> 00:48:00.549
repeat until convergence

00:48:00.549 --> 00:48:03.839


00:48:03.839 --> 00:48:06.949
and will do for I equals one to M. The [inaudible] of coordinate assent essentially

00:48:06.949 --> 00:48:09.549
holds all the parameters

00:48:09.549 --> 00:48:12.039
except Alpha I fixed

00:48:12.039 --> 00:48:16.249
and then it just maximizes this function with respect to just one of the parameters. Let me

00:48:16.249 --> 00:48:18.219
write that

00:48:18.219 --> 00:48:21.279
as Alpha I gets

00:48:21.279 --> 00:48:24.039
updated as [inaudible]

00:48:24.039 --> 00:48:29.229
over Alpha I hat of

00:48:29.229 --> 00:48:29.839
W

00:48:29.839 --> 00:48:31.969
Alpha one

00:48:31.969 --> 00:48:38.969


00:48:41.789 --> 00:48:48.789
Alpha I minus one. This is really the fancy way of saying hold everything

00:48:49.679 --> 00:48:54.249
except Alpha I

00:48:54.249 --> 00:48:58.089
fixed. Just optimize W by optimization objective with

00:48:58.089 --> 00:48:59.650
respect to only

00:48:59.650 --> 00:49:00.960
Alpha I.

00:49:00.960 --> 00:49:02.019
This is just a fancy

00:49:02.019 --> 00:49:04.390
way of writing

00:49:04.390 --> 00:49:09.219
it. This is coordinate

00:49:09.219 --> 00:49:11.109
assent.

00:49:11.109 --> 00:49:14.259
One picture

00:49:14.259 --> 00:49:17.739
that's kind of useful for coordinate assent is if you

00:49:17.739 --> 00:49:24.739
imagine you're trying to optimize a quadratic function, it

00:49:25.009 --> 00:49:27.489
really looks like

00:49:27.489 --> 00:49:30.799
that. These are the contours of the quadratic function and the minimums

00:49:30.799 --> 00:49:31.919
here.

00:49:31.919 --> 00:49:37.149
This is what

00:49:37.149 --> 00:49:39.839
coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one.

00:49:39.839 --> 00:49:44.859
My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going

00:49:44.859 --> 00:49:48.969
to begin by minimizing this with respect to Alpha one. I

00:49:48.969 --> 00:49:51.029
go there.

00:49:51.029 --> 00:49:54.799
And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like

00:49:54.799 --> 00:49:56.249
that.

00:49:56.249 --> 00:49:58.359
Then, I'll minimize

00:49:58.359 --> 00:50:02.239
with respect to Alpha one goes back to Alpha two

00:50:02.239 --> 00:50:04.469
and so on.

00:50:04.469 --> 00:50:08.049
You're always taking these axis-aligned steps

00:50:08.049 --> 00:50:11.709
to get to the minimum.

00:50:11.709 --> 00:50:16.039
It turns out that there's a modification to this. There are

00:50:16.039 --> 00:50:18.149
variations of this as well. The way I describe

00:50:18.149 --> 00:50:20.729
the algorithm, we're always doing this

00:50:20.729 --> 00:50:24.169
in alternating order. We always optimize with respect to Alpha

00:50:24.169 --> 00:50:28.519
one then Alpha two, then Alpha one, then Alpha two.

00:50:28.519 --> 00:50:32.159
What I'm about to say applies only in higher dimensions, but it turns out if you have

00:50:32.159 --> 00:50:35.239
a lot of parameters, Alpha one through Alpha M,

00:50:35.239 --> 00:50:39.290
you may not choose to always visit them in a fixed order. You may choose

00:50:39.290 --> 00:50:43.240
which Alphas update next depending on what you

00:50:43.240 --> 00:50:46.079
think will allow you to make the most progress. If you have only two parameters, it makes

00:50:46.079 --> 00:50:50.000
sense to alternate between them.

00:50:50.000 --> 00:50:51.099
If

00:50:51.099 --> 00:50:52.730
you have

00:50:52.730 --> 00:50:57.020
higher dimensional parameters, sometimes you may choose to update them in a

00:50:57.020 --> 00:51:00.650
different order if you think doing so would let you make faster progress towards the

00:51:00.650 --> 00:51:01.529


00:51:01.529 --> 00:51:04.959


00:51:04.959 --> 00:51:08.519
maximum. It turns out that coordinate assent

00:51:08.519 --> 00:51:11.459
compared to some of the algorithms we saw previously " compared to, say, Newton's

00:51:11.459 --> 00:51:12.470


00:51:12.470 --> 00:51:16.329
method, coordinate assent will usually take a lot more steps,

00:51:16.329 --> 00:51:20.819
but the chief advantage of coordinate assent when it works well

00:51:20.819 --> 00:51:22.570
is that sometimes

00:51:22.570 --> 00:51:29.119
the optimization objective W

00:51:29.119 --> 00:51:33.160
sometimes is very inexpensive to optimize W with respect to any one of your

00:51:33.160 --> 00:51:34.129
parameters,

00:51:34.129 --> 00:51:37.939
and so coordinate assent has to take many more iterations than, say, Newton's

00:51:37.939 --> 00:51:38.859
method

00:51:38.859 --> 00:51:41.419
in order to

00:51:41.419 --> 00:51:45.009
converge. It turns out that there are many optimization problems for which it's

00:51:45.009 --> 00:51:46.749
particularly easy

00:51:46.749 --> 00:51:47.569
to fix

00:51:47.569 --> 00:51:51.939
all but one of the parameters and optimize with respect to just that one parameter,

00:51:51.939 --> 00:51:55.619
and if that's true, then the inner group of coordinate assent with optimizing with

00:51:55.619 --> 00:51:58.219
respect to Alpha I can be done very quickly

00:51:58.219 --> 00:52:00.999
and cause [inaudible].

00:52:00.999 --> 00:52:02.259
It turns out that

00:52:02.259 --> 00:52:06.849
this will be true when we modify this algorithm to solve the SVM

00:52:06.849 --> 00:52:11.880
optimization problem.

00:52:11.880 --> 00:52:18.880
Questions about this?

00:52:21.879 --> 00:52:28.879


00:52:47.489 --> 00:52:49.699
Okay. Let's go ahead and apply this

00:52:49.699 --> 00:52:54.329
to our support vector machine dual optimization problem. It

00:52:54.329 --> 00:52:57.179
turns out that

00:52:57.179 --> 00:53:02.459
coordinate assent in its basic form does not work for the following reason. The

00:53:02.459 --> 00:53:07.369
reason is we have constrains on the Alpha Is. Mainly, what we can

00:53:07.369 --> 00:53:10.529
recall from what we worked out previously,

00:53:10.529 --> 00:53:12.259
we have a constraint that the sum of [inaudible] Y

00:53:12.259 --> 00:53:15.569
Alpha I must be equal to zero,

00:53:15.569 --> 00:53:19.649
and so if you fix all the Alphas except for one,

00:53:19.649 --> 00:53:23.919
then you can't change one Alpha without violating the constraint. If

00:53:23.919 --> 00:53:24.669


00:53:24.669 --> 00:53:26.459
I just try to change Alpha one,

00:53:26.459 --> 00:53:27.239


00:53:27.239 --> 00:53:30.559
Alpha one is actually exactly determined as a function of the other

00:53:30.559 --> 00:53:33.400
Alphas because this was sum to zero.

00:53:33.400 --> 00:53:35.729
The

00:53:35.729 --> 00:53:39.390
SMO algorithm, by the way, is due to John Platt, a colleague at

00:53:39.390 --> 00:53:42.609
Microsoft. The SMO

00:53:42.609 --> 00:53:46.289
algorithm, therefore, instead of trying to change one Alpha at a time,

00:53:46.289 --> 00:53:48.979
we will try to change

00:53:48.979 --> 00:53:50.489
two

00:53:50.489 --> 00:53:54.199
Alphas at a time.

00:53:54.199 --> 00:53:55.270
This is

00:53:55.270 --> 00:53:58.189
called the SMO algorithm, in a

00:53:58.189 --> 00:54:00.599
sense the sequential minimal optimization

00:54:00.599 --> 00:54:02.990
and the term minimal refers to the fact that

00:54:02.990 --> 00:54:06.350
we're choosing the smallest number of Alpha Is to change at a time, which

00:54:06.350 --> 00:54:10.679
in this case, we need to change at least two at a time.

00:54:10.679 --> 00:54:16.849
So then go ahead and outline the algorithm.

00:54:16.849 --> 00:54:19.660
We will

00:54:19.660 --> 00:54:23.629
select

00:54:23.629 --> 00:54:28.609
two Alphas to change, some Alpha I and Alpha J [inaudible] " it

00:54:28.609 --> 00:54:33.979
just means a rule of thumb.

00:54:33.979 --> 00:54:40.929
We'll hold all the Alpha Ks fixed

00:54:40.929 --> 00:54:44.679
except

00:54:44.679 --> 00:54:49.939
Alpha I and Alpha J

00:54:49.939 --> 00:54:54.199
and optimize W [inaudible] Alpha

00:54:54.199 --> 00:54:56.640
with respect

00:54:56.640 --> 00:54:59.439
to Alpha I

00:54:59.439 --> 00:55:06.439
and Alpha J subject to all the constraints.

00:55:14.329 --> 00:55:17.650
It turns out the

00:55:17.650 --> 00:55:19.880
key step

00:55:19.880 --> 00:55:21.440
which I'm going to

00:55:21.440 --> 00:55:22.100
work out

00:55:22.100 --> 00:55:25.579
is this one, which

00:55:25.579 --> 00:55:28.449
is how do you optimize W of Alpha

00:55:28.449 --> 00:55:32.199
with respect to the two parameters that you just chose to update

00:55:32.199 --> 00:55:34.729
and subject to the constraints? I'll do

00:55:34.729 --> 00:55:37.459
that in a second.

00:55:37.459 --> 00:55:38.900


00:55:38.900 --> 00:55:41.130
You would keep running this algorithm

00:55:41.130 --> 00:55:43.369
until you have satisfied

00:55:43.369 --> 00:55:44.690
these

00:55:44.690 --> 00:55:49.469
convergence criteria up to Epsilon.

00:55:49.469 --> 00:55:53.509
What I want to do now is describe how to do this [inaudible] "

00:55:53.509 --> 00:55:55.759
how to optimize W of Alpha

00:55:55.759 --> 00:55:57.759
with respect to

00:55:57.759 --> 00:56:00.499
Alpha I and Alpha J,

00:56:00.499 --> 00:56:05.489
and it turns out that it's because you can do this extremely efficiently

00:56:05.489 --> 00:56:08.449
that the SMO algorithm works well. The [inaudible] for the SMO

00:56:08.449 --> 00:56:11.549
algorithm can be done extremely efficiently, so it may take a large number of

00:56:11.549 --> 00:56:18.549
iterations to converge, but each iteration is very cheap. Let's talk about that.

00:56:33.449 --> 00:56:40.449


00:56:41.749 --> 00:56:45.679
So in order to derive that step where we update in respect to Alpha I and

00:56:45.679 --> 00:56:46.530
Alpha J,

00:56:46.530 --> 00:56:50.849
I'm actually going to use Alpha one and Alpha two as my example. I'm gonna

00:56:50.849 --> 00:56:51.879
update

00:56:51.879 --> 00:56:56.609
Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but

00:56:56.609 --> 00:56:59.659
just to make my notation on the board easier, I'm going to derive the

00:56:59.659 --> 00:57:02.209
derivation for Alpha one and Alpha two

00:57:02.209 --> 00:57:07.269
and the general [inaudible] completely analogous.

00:57:07.269 --> 00:57:08.869


00:57:08.869 --> 00:57:13.259
On every step of the algorithm with respect to constraint, that

00:57:13.259 --> 00:57:17.130
sum over I Alpha I YI is equal to zero. This is one of the

00:57:17.130 --> 00:57:18.959
constraints we had previously for

00:57:18.959 --> 00:57:22.569
our dual optimization problem.

00:57:22.569 --> 00:57:25.749
This means that Alpha one Y1 plus Alpha

00:57:25.749 --> 00:57:28.159
two Y2 must be equal to this,

00:57:28.159 --> 00:57:31.319
to

00:57:31.319 --> 00:57:38.319


00:57:41.179 --> 00:57:43.029
which I'm going to denote

00:57:43.029 --> 00:57:46.259
by Zeta.

00:57:46.259 --> 00:57:48.119


00:57:48.119 --> 00:57:49.449
So

00:57:49.449 --> 00:57:51.949
we also have the constraint

00:57:51.949 --> 00:57:53.439
that the Alpha Is

00:57:53.439 --> 00:57:57.289
must be between zero and C. We had two constraints on our dual. This was one of the

00:57:57.289 --> 00:57:59.809
constraints. This was the other one.

00:57:59.809 --> 00:58:06.809
In pictures,

00:58:08.139 --> 00:58:15.139
the

00:58:29.289 --> 00:58:33.039
constraint that the Alpha Is is between zero

00:58:33.039 --> 00:58:35.339
and C, that is often called the Bosk constraint,

00:58:35.339 --> 00:58:38.089
and so if I draw Alpha one

00:58:38.089 --> 00:58:43.239
and Alpha two, then

00:58:43.239 --> 00:58:50.239
I

00:58:50.239 --> 00:58:53.269
have a constraint that

00:58:53.269 --> 00:58:57.079
the values of Alpha one and Alpha two must lie within this box that ranges from zero

00:58:57.079 --> 00:58:58.889
to C.

00:58:58.889 --> 00:59:01.909
And so the picture of the algorithm is as follows.

00:59:01.909 --> 00:59:04.449
I have some constraint

00:59:04.449 --> 00:59:06.699
that Alpha one Y1

00:59:06.699 --> 00:59:08.630
plus Alpha

00:59:08.630 --> 00:59:10.449
two Y2

00:59:10.449 --> 00:59:14.129
must equal to Zeta,

00:59:14.129 --> 00:59:18.239


00:59:18.239 --> 00:59:20.789


00:59:20.789 --> 00:59:21.839


00:59:21.839 --> 00:59:26.479
and so this implies that

00:59:26.479 --> 00:59:32.279
Alpha one must be equal to Zeta minus Alpha two Y2

00:59:32.279 --> 00:59:36.859
over Y1,

00:59:36.859 --> 00:59:39.789
and so what I want to do is I want to optimize the objective with respect

00:59:39.789 --> 00:59:43.439


00:59:43.439 --> 00:59:45.209
to this.

00:59:45.209 --> 00:59:48.589
What I can do is plug in my definition for Alpha one as a function of Alpha two

00:59:48.589 --> 00:59:51.349
and so this becomes W of

00:59:51.349 --> 00:59:55.469
Alpha one must be equal to Zeta minus Alpha two Y2

00:59:55.469 --> 00:59:56.530
over

00:59:56.530 --> 01:00:01.649
Y1, Alpha two, Alpha three

01:00:01.649 --> 01:00:03.140
and so on,

01:00:03.140 --> 01:00:06.719
and it turns out that because W is a quadratic function " if you look back to our

01:00:06.719 --> 01:00:10.669
earlier definition of W, you find it's a quadratic function in all the Alphas "

01:00:10.669 --> 01:00:13.409
it turns out that if you look at this expression for W

01:00:13.409 --> 01:00:16.709
and if you view it as just a function of Alpha two,

01:00:16.709 --> 01:00:21.529
you find that this is a one dimensional quadratic function of Alpha two

01:00:21.529 --> 01:00:25.049
if you hold Alpha three, Alpha four and so on fixed,

01:00:25.049 --> 01:00:26.440
and so

01:00:26.440 --> 01:00:30.369
this can be simplified to some expression of the form A Alpha two squared

01:00:30.369 --> 01:00:32.359
plus B Alpha two plus

01:00:32.359 --> 01:00:33.229
C. This

01:00:33.229 --> 01:00:37.509
is a standard quadratic function.

01:00:37.509 --> 01:00:40.470
This is really easy to optimize. We know how to optimize

01:00:40.470 --> 01:00:41.630
"

01:00:41.630 --> 01:00:43.820
when did we learn this? This was high school

01:00:43.820 --> 01:00:47.969
or undergrad or something. You know how to optimize quadratic functions like these.

01:00:47.969 --> 01:00:49.930
You just do that and that gives you the

01:00:49.930 --> 01:00:52.449
optimal value for Alpha two.

01:00:52.449 --> 01:00:54.739
The last step with a

01:00:54.739 --> 01:00:59.089
Bosk constraint like this " just in pictures,

01:00:59.089 --> 01:01:02.469
you know your solution must lie on this line, and so there'll be some

01:01:02.469 --> 01:01:04.830
sort of quadratic function

01:01:04.830 --> 01:01:07.739
over this line, say,

01:01:07.739 --> 01:01:10.079
and so if you minimize the quadratic function,

01:01:10.079 --> 01:01:13.969
maybe you get a value that lies in the box, and if so, then you're done.

01:01:13.969 --> 01:01:17.229
Or if your quadratic function looks like this,

01:01:17.229 --> 01:01:20.979
maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with

01:01:20.979 --> 01:01:22.719
a solution like that. If

01:01:22.719 --> 01:01:25.939
that happens, you clip your solution

01:01:25.939 --> 01:01:29.589
just to map it back inside the box.

01:01:29.589 --> 01:01:34.769
That'll give you the optimal solution of this quadratic optimization problem

01:01:34.769 --> 01:01:35.679
subject to

01:01:35.679 --> 01:01:37.079
your solution

01:01:37.079 --> 01:01:39.539
satisfying this box constraint

01:01:39.539 --> 01:01:42.589
and lying on this straight line " in other words, subject to

01:01:42.589 --> 01:01:48.739
the solution lying on this line segment within the box.

01:01:48.739 --> 01:01:53.179


01:01:53.179 --> 01:01:57.629
Having solved the Alpha two this way, you can clip it if necessary

01:01:57.629 --> 01:02:00.430
to get it back within the box constraint

01:02:00.430 --> 01:02:01.070
and then

01:02:01.070 --> 01:02:04.129
we have Alpha one as a function of Alpha two

01:02:04.129 --> 01:02:06.489
and this allows you to

01:02:06.489 --> 01:02:09.839
optimize W with respect to Alpha one and Alpha two quickly,

01:02:09.839 --> 01:02:11.420
subject to all the constraints,

01:02:11.420 --> 01:02:14.769
and the key step is really just sort of one dequadratic optimization, which

01:02:14.769 --> 01:02:16.439
we do very quickly,

01:02:16.439 --> 01:02:23.439
which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that

01:02:24.959 --> 01:02:31.959
we can

01:02:32.529 --> 01:02:34.329
change

01:02:34.329 --> 01:02:37.749
whatever, but the SMO

01:02:37.749 --> 01:02:42.879
algorithm, we can

01:02:42.879 --> 01:02:44.739
change two at a

01:02:44.739 --> 01:02:46.019
time,

01:02:46.019 --> 01:02:49.059
so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change

01:02:49.059 --> 01:02:52.469
" as I run optimization algorithm, I

01:02:52.469 --> 01:02:54.829
have to respect the constraint that

01:02:54.829 --> 01:02:56.060
sum over

01:02:56.060 --> 01:02:57.899
I Alpha I YI

01:02:57.899 --> 01:03:00.339
must be equal to zero,

01:03:00.339 --> 01:03:07.339
so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent.

01:03:07.489 --> 01:03:09.249
Suppose I tried

01:03:09.249 --> 01:03:13.259
to change just Alpha one.

01:03:13.259 --> 01:03:15.449
Then I know that Alpha one

01:03:15.449 --> 01:03:20.309
must be equal to the sum from I equals two to M

01:03:20.309 --> 01:03:21.809
Alpha I

01:03:21.809 --> 01:03:22.959
YI

01:03:22.959 --> 01:03:25.839
divided by Y1, right,

01:03:25.839 --> 01:03:27.099
and so Alpha

01:03:27.099 --> 01:03:30.039
one can actually be written exactly as a function of Alpha two, Alpha three and so on

01:03:30.039 --> 01:03:32.589
through Alpha M.

01:03:32.589 --> 01:03:35.610
And so if I hold Alpha two, Alpha three, Alpha four

01:03:35.610 --> 01:03:39.710
through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final

01:03:39.710 --> 01:03:41.700


01:03:41.700 --> 01:03:45.509
[inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same

01:03:45.509 --> 01:03:46.869
time,

01:03:46.869 --> 01:03:50.909
then I still have a constraint and so I know Alpha one and Alpha two

01:03:50.909 --> 01:03:53.629
must satisfy

01:03:53.629 --> 01:03:55.339
that linear constraint

01:03:55.339 --> 01:03:59.199
but at least this way I can change Alpha one if I also

01:03:59.199 --> 01:04:04.549
change Alpha two accordingly to make sure this satisfies

01:04:04.549 --> 01:04:08.869
the constraint.

01:04:08.869 --> 01:04:13.890
Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on

01:04:13.890 --> 01:04:15.919
each iteration, I have some

01:04:15.919 --> 01:04:18.099
setting of the parameters,

01:04:18.099 --> 01:04:21.400
Alpha one, Alpha two, Alpha three and so on,

01:04:21.400 --> 01:04:24.729
and I want to change Alpha one and Alpha two, say.

01:04:24.729 --> 01:04:27.009
So from the previous iteration,

01:04:27.009 --> 01:04:28.229
let's

01:04:28.229 --> 01:04:31.759
say I had not validated the constraint, so that holds true,

01:04:31.759 --> 01:04:35.259
and so I'm just defining Zeta to be equal to this, because

01:04:35.259 --> 01:04:37.769
Alpha one Y1 plus Alpha two Y2 must be equal to sum from

01:04:37.769 --> 01:04:42.409
I equals [inaudible] to M of that,

01:04:42.409 --> 01:04:44.119
and so I'm just defining this

01:04:44.119 --> 01:04:51.119
to be Zeta.

01:04:59.709 --> 01:05:01.439
Student:[Inaudible]. Instructor (Andrew

01:05:01.439 --> 01:05:06.709
Ng):On every iteration, you change maybe a different pair of Alphas to update. The

01:05:06.709 --> 01:05:10.979
way you do this is something I don't want to talk about. I'll say a couple more words about

01:05:10.979 --> 01:05:11.729
that, but

01:05:11.729 --> 01:05:14.469
the basic outline of the algorithm is

01:05:14.469 --> 01:05:18.159
on every iteration of the algorithm, you're going to select some Alpha I and

01:05:18.159 --> 01:05:19.390
Alpha J to update

01:05:19.390 --> 01:05:23.129
like on this board. So that's an Alpha I and an Alpha J to update

01:05:23.129 --> 01:05:24.739
via some [inaudible]

01:05:24.739 --> 01:05:27.839
and then you use the procedure I just described to

01:05:27.839 --> 01:05:31.049
actually update

01:05:31.049 --> 01:05:33.880
Alpha I and Alpha J. What I actually just talked about

01:05:33.880 --> 01:05:35.390
was the procedure

01:05:35.390 --> 01:05:39.699
to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the

01:05:39.699 --> 01:05:46.699
[inaudible] for choosing Alpha

01:05:48.699 --> 01:05:52.289
I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up

01:05:52.289 --> 01:05:53.839
there. I'll just write it again. W of

01:05:53.839 --> 01:06:00.199
Alpha is that function we had previously. W of Alpha was the sum over I

01:06:00.199 --> 01:06:02.949
" this is about solving the "

01:06:02.949 --> 01:06:09.599
it was that thing.

01:06:09.599 --> 01:06:12.869
All of this is about solving the optimization problem for the SVM, so this

01:06:12.869 --> 01:06:19.469
is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing

01:06:19.469 --> 01:06:26.029
that one, you

01:06:26.029 --> 01:06:27.190
can

01:06:27.190 --> 01:06:28.639
make the

01:06:28.639 --> 01:06:34.339
other one that you have to change work,

01:06:34.339 --> 01:06:35.619
right? Instructor (Andrew Ng):What

01:06:35.619 --> 01:06:39.929
do

01:06:39.929 --> 01:06:43.979
you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently.

01:06:43.979 --> 01:06:46.890
What we're trying to do is we're trying to optimize the objective function W

01:06:46.890 --> 01:06:48.239
of Alpha,

01:06:48.239 --> 01:06:52.049
so the metric of progress that we care about is whether W of Alpha is getting

01:06:52.049 --> 01:06:54.959
better on every iteration,

01:06:54.959 --> 01:06:58.849
and so what is true for coordinate assent and for SMO is on every iteration;

01:06:58.849 --> 01:07:02.929
W of Alpha can only increase. It may stay the same

01:07:02.929 --> 01:07:04.900
or it may increase. It can't get worse.

01:07:04.900 --> 01:07:08.829
It's true that eventually, the Alphas will converge at some value. It's true that

01:07:08.829 --> 01:07:12.909
in intervening iterations, one of the Alphas may move further

01:07:12.909 --> 01:07:16.529
away and then closer and further and closer to its final value,

01:07:16.529 --> 01:07:20.089
but what we really care about is that W of Alpha is getting better

01:07:20.089 --> 01:07:27.089
every time, which is true.

01:07:28.489 --> 01:07:30.839
Just a couple more words on

01:07:30.839 --> 01:07:33.719
SMO before I wrap up on this.

01:07:33.719 --> 01:07:38.179
One is that John Platt's original

01:07:38.179 --> 01:07:39.849
algorithm talked about a [inaudible]

01:07:39.849 --> 01:07:43.869
for choosing which values or pairs, Alpha I and Alpha J, to update

01:07:43.869 --> 01:07:46.650
next

01:07:46.650 --> 01:07:50.170
is one of those things that's not conceptually complicated but it's very

01:07:50.170 --> 01:07:51.009
complicated

01:07:51.009 --> 01:07:53.329
to explain in words. I won't talk about that here.

01:07:53.329 --> 01:07:56.369
If you want to learn about it,

01:07:56.369 --> 01:07:58.539
go ahead and look up John Platt's paper on the

01:07:58.539 --> 01:08:00.179


01:08:00.179 --> 01:08:06.239
SMO algorithm. The [inaudible] is pretty easy to read,

01:08:06.239 --> 01:08:11.329
and later on, we'll also posting a handout on the

01:08:11.329 --> 01:08:13.389
course homepage

01:08:13.389 --> 01:08:14.200
with

01:08:14.200 --> 01:08:18.120
some of a simplified version of this [inaudible] that you can use in

01:08:18.120 --> 01:08:20.170
problems.

01:08:20.170 --> 01:08:24.230
You can see some of the process readings in more details.

01:08:24.230 --> 01:08:28.579
One other thing that I didn't talk about was how to update the parameter B. So

01:08:28.579 --> 01:08:30.899
this is solving all your Alphas.

01:08:30.899 --> 01:08:34.020
This is also the Alpha that allows us to get W.

01:08:34.020 --> 01:08:38.060
The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not

01:08:38.060 --> 01:08:39.929
very difficult.

01:08:39.929 --> 01:08:42.510
I'll let you read about that yourself with the

01:08:42.510 --> 01:08:45.049
notes that we'll post

01:08:45.049 --> 01:08:52.049
along with the next

01:08:52.229 --> 01:08:55.589
problems. To wrap up today's lecture, what I want to do is just tell

01:08:55.589 --> 01:09:01.469
you briefly about a couple of examples of applications

01:09:01.469 --> 01:09:03.029


01:09:03.029 --> 01:09:04.389


01:09:04.389 --> 01:09:07.489


01:09:07.489 --> 01:09:14.489
of SVMs.

01:09:26.009 --> 01:09:29.630
Let's consider the problem of Handler's Integer Recognition.

01:09:29.630 --> 01:09:31.909
In Handler's Integer

01:09:31.909 --> 01:09:33.469
Recognition,

01:09:33.469 --> 01:09:36.569
you're given a pixel array with

01:09:36.569 --> 01:09:40.609
a scanned image of,

01:09:40.609 --> 01:09:44.119
say, a zip code somewhere in Britain. This is an array of pixels,

01:09:44.119 --> 01:09:47.859
and some of these pixels will be on

01:09:47.859 --> 01:09:48.849
and other pixels

01:09:48.849 --> 01:09:52.689
will be off. This combination of pixels being on maybe represents the

01:09:52.689 --> 01:09:56.030
character one.

01:09:56.030 --> 01:09:57.739
The question is given

01:09:57.739 --> 01:10:00.269
an input feature vector like this,

01:10:00.269 --> 01:10:03.119
if you

01:10:03.119 --> 01:10:06.829
have, say, ten pixels by ten pixels, then you have

01:10:06.829 --> 01:10:13.639
a hundred dimensional feature vector,

01:10:13.639 --> 01:10:16.999
then [inaudible]. If you have ten pixels by ten pixels, you have

01:10:16.999 --> 01:10:18.319
100

01:10:18.319 --> 01:10:22.239
features, and maybe these are binary features of XB01 or maybe the Xs are

01:10:22.239 --> 01:10:23.929
gray scale values

01:10:23.929 --> 01:10:29.169
corresponding to how dark each of these pixels was.

01:10:29.169 --> 01:10:32.300
[Inaudible].

01:10:32.300 --> 01:10:36.010
Turns out for many years, there was a neuronetwork that was a champion

01:10:36.010 --> 01:10:40.039
algorithm for Handler's Integer Recognition.

01:10:40.039 --> 01:10:44.049
And it turns out that you can apply an SVM with the

01:10:44.049 --> 01:10:46.889
following

01:10:46.889 --> 01:10:53.159
kernel. It turns out either the polynomial kernel or

01:10:53.159 --> 01:10:57.929
the Galcean

01:10:57.929 --> 01:11:02.580


01:11:02.580 --> 01:11:05.199
kernel works fine for this problem,

01:11:05.199 --> 01:11:10.460
and just by writing down this kernel and throwing an SVM at

01:11:10.460 --> 01:11:14.239
it, an SVM gave performance comparable to the very best neuronetworks.

01:11:14.239 --> 01:11:15.099


01:11:15.099 --> 01:11:19.559


01:11:19.559 --> 01:11:22.210
This is surprising because support vector machine

01:11:22.210 --> 01:11:26.059
doesn't take into account any knowledge about the pixels,

01:11:26.059 --> 01:11:27.110
and in particular,

01:11:27.110 --> 01:11:30.310
it doesn't know that this pixel is

01:11:30.310 --> 01:11:32.130
next to that pixel

01:11:32.130 --> 01:11:36.559
because it's just representing the pixel intensity value as a vector.

01:11:36.559 --> 01:11:39.920
And so this means the performance of SVM would be the same even if you were to

01:11:39.920 --> 01:11:42.890
shuffle all the pixels around.

01:11:42.890 --> 01:11:46.420
[Inaudible]

01:11:46.420 --> 01:11:52.030
let's say comparable to the very best neuronetworks, which had been

01:11:52.030 --> 01:11:58.579
under very careful development for many years. I want

01:11:58.579 --> 01:11:59.789
to

01:11:59.789 --> 01:12:03.929
tell you about one other cool example, which is SVMs are

01:12:03.929 --> 01:12:08.610
also used also to classify other fairly esoteric objects. So for

01:12:08.610 --> 01:12:09.360
example,

01:12:09.360 --> 01:12:13.440
let's say you want to classify

01:12:13.440 --> 01:12:16.989
protein sequences into different classes of

01:12:16.989 --> 01:12:20.989
proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize

01:12:20.989 --> 01:12:25.040
for that. There are

01:12:25.040 --> 01:12:29.280
20 amino acids, and proteins in our bodies are made up by

01:12:29.280 --> 01:12:31.439
sequences of amino acids.

01:12:31.439 --> 01:12:35.469
Even though there are 20 amino acids and 26 alphabets, I'm going to

01:12:35.469 --> 01:12:37.130
denote amino acids by the alphabet

01:12:37.130 --> 01:12:41.409
A through Z with apologizes to the biologists. Here's

01:12:41.409 --> 01:12:45.459
an amino acid sequence

01:12:45.459 --> 01:12:52.459


01:12:53.679 --> 01:13:00.679


01:13:03.449 --> 01:13:05.139


01:13:05.139 --> 01:13:12.139
represented by a series of alphabets.

01:13:15.949 --> 01:13:17.829


01:13:17.829 --> 01:13:18.559
So

01:13:18.559 --> 01:13:21.419
suppose I want to assign

01:13:21.419 --> 01:13:25.070
this protein into a few classes depending on what

01:13:25.070 --> 01:13:27.790
type of protein it is. The

01:13:27.790 --> 01:13:29.780
question is how

01:13:29.780 --> 01:13:32.639
do I construct my feature

01:13:32.639 --> 01:13:36.590
vector? This is challenging for many reasons, one of which is that protein

01:13:36.590 --> 01:13:39.889
sequences can be of different lengths. There are some very long protein

01:13:39.889 --> 01:13:41.900
sequences and some very short ones,

01:13:41.900 --> 01:13:44.590
and so you can't have a feature saying what is

01:13:44.590 --> 01:13:48.869
the amino acid in the 100th position, because maybe there is no 100th

01:13:48.869 --> 01:13:53.050
position in this protein sequence. Some are very long. Some are very short.

01:13:53.050 --> 01:13:55.440
Here's my feature representation,

01:13:55.440 --> 01:13:56.929
which is

01:13:56.929 --> 01:13:59.720
I'm going to write down

01:13:59.720 --> 01:14:05.270
all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC

01:14:05.270 --> 01:14:07.219
down

01:14:07.219 --> 01:14:09.420
to

01:14:09.420 --> 01:14:11.469
AAAZ and then

01:14:11.469 --> 01:14:13.979
AABA

01:14:13.979 --> 01:14:20.169
and so on. You get the idea. Write down

01:14:20.169 --> 01:14:23.760
all

01:14:23.760 --> 01:14:27.070
possible combinations of

01:14:27.070 --> 01:14:28.330
four alphabets

01:14:28.330 --> 01:14:30.559
and my feature representation will be

01:14:30.559 --> 01:14:33.570
I'm going to scan through this sequence of amino acids

01:14:33.570 --> 01:14:38.599
and count how often each of these subsequences occur. So for example,

01:14:38.599 --> 01:14:40.340
BAJT occurs twice and so I'll put

01:14:40.340 --> 01:14:42.749
a two there, and

01:14:42.749 --> 01:14:44.530
none of these sequences occur, so I'll put

01:14:44.530 --> 01:14:46.169
a zero there. I guess

01:14:46.169 --> 01:14:50.009
I have a one here

01:14:50.009 --> 01:14:54.429
and a zero there. This very long vector

01:14:54.429 --> 01:14:57.300
will be my feature representation for protein.

01:14:57.300 --> 01:15:00.469
This representation applies no matter how long my protein sequence is.

01:15:00.469 --> 01:15:02.729
How

01:15:02.729 --> 01:15:05.249
large is this? Well, it turns out

01:15:05.249 --> 01:15:08.109
this is going to be in R20

01:15:08.109 --> 01:15:12.070
to the four,

01:15:12.070 --> 01:15:16.089
and so you have a 160,000

01:15:16.089 --> 01:15:18.360
dimensional feature vector,

01:15:18.360 --> 01:15:21.429
which is reasonably large, even by modern computer standards.

01:15:21.429 --> 01:15:23.910
Clearly, we don't want to explicitly represent

01:15:23.910 --> 01:15:26.069
these high

01:15:26.069 --> 01:15:30.029
dimensional feature vectors. Imagine you have 1,000 examples and you store this as double

01:15:30.029 --> 01:15:34.280
[inaudible]. Even on modern day computers, this is big.

01:15:34.280 --> 01:15:38.429
It turns out that there's an

01:15:38.429 --> 01:15:42.090
efficient dynamic programming algorithm that can efficiently

01:15:42.090 --> 01:15:45.260
compute inner products between these feature vectors, and so we can apply this

01:15:45.260 --> 01:15:48.210
feature representation, even though it's a ridiculously high

01:15:48.210 --> 01:15:49.630
feature vector

01:15:49.630 --> 01:15:52.499
to classify protein sequences.

01:15:52.499 --> 01:15:56.579
I won't talk about the [inaudible] algorithm. If any of you have seen

01:15:56.579 --> 01:16:00.849
the [inaudible] algorithm for finding subsequences, it's kind of

01:16:00.849 --> 01:16:02.309
reminiscent of that. You

01:16:02.309 --> 01:16:05.289
can look those up if you're interested.

01:16:05.289 --> 01:16:08.179
This is just another example of a cool kernel, and

01:16:08.179 --> 01:16:13.230
more generally, if you're faced with some new machine-learning problem, sometimes you can

01:16:13.230 --> 01:16:15.750
choose a standard kernel like a Galcean kernel,

01:16:15.750 --> 01:16:19.580
and sometimes there are research papers written on how to

01:16:19.580 --> 01:16:21.410
come up with a new kernel

01:16:21.410 --> 01:16:27.520
for a new problem.

01:16:27.520 --> 01:16:29.590
Two last sentences I want

01:16:29.590 --> 01:16:33.710
to say. Where are we now? That wraps up SVMs, which many

01:16:33.710 --> 01:16:38.379
people would consider one of the most effective off the shelf learning algorithms,

01:16:38.379 --> 01:16:42.229
and so as of today, you've actually seen a lot of learning algorithms. I want to

01:16:42.229 --> 01:16:45.689
close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms

01:16:45.689 --> 01:16:48.169
to a lot of problems.

01:16:48.169 --> 01:16:52.510
We're still in week four of the quarter, so there's more to come. In particular, what I

01:16:52.510 --> 01:16:56.890
want to do next is talk about how to really understand the learning algorithms and

01:16:56.890 --> 01:16:58.860
when they work well and when they work poorly

01:16:58.860 --> 01:17:01.889
and to take the tools which you now have and really talk about how you can use

01:17:01.889 --> 01:17:04.449
them really well. We'll start to do that in the next lecture. Thanks.