This presentation is delivered by the Stanford Center for Professional
Development. Welcome back. If you haven't given me the homework yet,
you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back.
What I want to do today is wrap up our discussion on support vector
machines
and in particular we'll also talk about the idea of kernels
and then talk about
[inaudible]
and then I'll talk about
the SMO algorithm, which is an algorithm for solving the
optimization problem that we posed last time.
To recap, we
wrote down
the following context optimization problem.
All this is assuming that the data is linearly separable, which is an assumption that I'll
fix later,
and so with this optimization problem, given a training set,
this will find
the optimal margin
classifier for the data set that maximizes this
geometric margin from your training
examples.
And so in the previous lecture, we also derived the dual of this problem,
which was to maximize
this. And this is the dual of our
primal [inaudible] optimization problem.
Here, I'm using these angle brackets to denote inner product,
so
this is just XI transpose XJ
for vectors XI
and XJ. We also worked out the
ways W would be given by sum over I alpha I
YI XI. Therefore, when you need to make a prediction of classification
time, you
need to compute the
value of the hypothesis applied to an [inaudible], which is G of W
transpose X plus B where G is that threshold function that outputs plus one and minus one.
And so this is G of
sum over I alpha
I. So that can also be written
in terms of inner
products between input
vectors X. So what I
want to do is now talk about the idea of kernels, which will make use of this
property because it turns out
you can take the only dependers of the algorithm on X
is through these inner products.
In fact, you can write the entire algorithm
without ever explicitly referring to
an X vector [inaudible]
between input feature vectors.
And the idea of a high kernel is as following "
let's say that you have an input attribute.
Let's just say for now it's a real number. Maybe this is the
living area of a house
that you're trying to make a prediction on, like whether it will be sold in the next six months.
Quite often, we'll take this
feature X and we'll map it
to a richer set of features. So for example, we will take X and map it
to these four polynomial features, and let me acutely call this mapping Phi. So we'll
let Phi of X denote
the mapping from your original features to
some higher dimensional set of features.
So if you do this and you want to use the features Phi of X,
then
all you need to do is go back to the learning algorithm
and everywhere you see
XI, XJ, we'll replace it
with
the inner product between Phi of XI and
Phi
of XJ. So this corresponds to running a support vector machine
with the features given by Phi of X rather than with your original
onedimensional input feature X.
And in a scenario that I want to consider,
sometimes Phi of X
will be very high dimensional,
and in fact sometimes Phi of X " so for example, Phi of X may contain
very high degree polynomial features.
Sometimes Phi of X will actually even be an infinite dimensional vector of features,
and the question is if Phi of X is an extremely high dimensional,
then you can't actually compute to these inner products very
efficiently, it seems, because computers need to
represent an extremely high dimensional feature vector and then
take
[inaudible] inefficient. It
turns out that
in many important special cases, we can write down " let's call the kernel function,
denoted by K,
which will be
this,
which would be inner product between those feature
vectors. It turns out there will be important special cases where
computing Phi of X is computationally very expensive " maybe is impossible.
There's an infinite dimensional vector,
and you can't compute infinite dimensional vectors.
There will be important special cases where Phi of X is very expensive to represent because it
is so high dimensional,
but nonetheless, you can actually compute a kernel between
XI and XJ. You can compute the inner product between these two vectors
very inexpensively.
And so the idea of the support vector machine is that
everywhere in the algorithm that you see these inner products,
we're going to replace it
with a kernel function that you can compute efficiently,
and that lets you work in feature
spaces Phi of X even if Phi of X
are very high dimensional. Let
me now say how that's done. A
little bit later today, we'll actually see some concrete examples of Phi of X and
of kernels. For now, let's just think about constructing kernels explicitly.
This best illustrates my example.
Let's say you have
two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to
write X and Z to save on writing.
Let's say my kernel is K of X, Z
equals
X transpose Z squared.
And so this is "
right? X transpose Z " this thing here is X
transpose Z and this thing is X transpose Z, so this is X
transpose Z squared.
And that's equal
to that.
And so this kernel corresponds
to the feature mapping where Phi
of X is equal to " and I'll write
this down for the case of N equals free, I guess.
And so with this definition of Phi of X,
you can verify for yourself that this thing
becomes the
inner product between Phi of X
and Phi of Z, because to get an inner product between two vectors
is " you can just take a sum of the corresponding elements of the vectors. You
multiply them. So
if this is Phi of X,
then the inner product between Phi of X and Phi of Z will be
the sum
over all the
elements of this vector times the corresponding elements of Phi of Z,
and what you get is this one.
And so the cool thing about this is that
in order to compute Phi of X, you
need [inaudible] just to compute
Phi of X. If N
is a dimension of X and Z, then
Phi of X is a vector of all pairs of XI XJ multiplied of
each other,
and so the length of Phi of X is N squared. You need order N
squared time just to compute Phi of X. But
to compute K
" to compute the kernel function,
all you need is order N time,
because the
kernel function is defined as X transpose Z squared, so you just take the inner product
between X and Z, which is order N time and you square that
and you've computed this
kernel function,
and so you just computed the inner product between two vectors where each vector
has N squared elements, but
you did it in N square time. Student:For any kernel you find
for X and Z,
does Phi exist for X and Z? Instructor (Andrew
Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense.
So
let me just describe a couple of quick generalizations to
this.
One is that if
you define KXZ to be equal to
X transpose Z plus C squared, so again, you can compute this kernel
in order N time,
then that turns out to correspond to a feature vector
where I'm
just going to add a few more elements at the bottom
where you add root 2.
Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3
and C.
And so
this is a way of creating a feature vector with both the monomials,
meaning the first order terms,
as well as the quadratic or the inner product
terms between XI and XJ,
and the parameter C here
allows you to control the relative waiting between the monomial
terms, so the first order terms,
and the quadratic
terms. Again, this is still inner
product between vectors of length and square [inaudible]
in order N time.
More generally, here are some other examples of kernels.
Actually, a
generalization of the one I just derived right now would be the following kernel.
And so this corresponds to using all N plus DQZ
features
of all
monomials. Monomials just mean the products of XI XJ XK. Just
all the polynomial terms up
to degree D
and plus [inaudible] so on the order of
N plus D to the power of D, so this grows exponentially in D.
This is a very high dimensional feature vector,
but again, you can implicitly construct the feature vector and take
inner products between them.
It's very computationally efficient, because you just compute the inner product
between X and Z, add C and you take that real number to the power of D
and by plugging this in as a kernel, you're implicitly working in an extremely
high dimensional computing space.
So what
I've given is just a few specific examples of how to create kernels. I want
to go over just a few specific examples of kernels. So let's you ask
you more generally if you're faced with a new machine-learning problem,
how do you come up with a kernel? There
are many ways to think about it, but here's one intuition that's sort of
useful.
So given a set of attributes of X, you're
going to use a feature vector of Phi of X
and given a set
of attributes Z, you're going to use an input feature vector Phi of Z,
and so the kernel is computing
the inner product between Phi of X and Phi of Z.
And so
one intuition "
this is a partial intuition. This
isn't as rigorous intuition that it is used for. It
is that
if X and Z are very similar,
then Phi of X and Phi of Z will be pointing in the same direction,
and therefore the inner product would be large.
Whereas in contrast, if
X and Z are very dissimilar,
then Phi of X and Phi of Z may be pointing different directions,
and so the inner product may be small.
That intuition is not a rigorous one, but it's sort of a useful one to think about.
If you're faced with a new learning problem " if I give you some random
thing to classify
and you want to decide how to come up with
a kernel,
one way is
to try to come up with the function
P of XZ that
is large,
if you want to learn the algorithm to think of X and Z as similar
and
small.
Again, this isn't always true, but this is one of several intuitions.
So if you're trying to classify some brand new thing " you're trying to
classify [inaudible] or DNA sequences or something,
some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think
these are similar things
or these are dissimilar.
And so this
answers the question of let's
say I have something I want to classify, and let's say I write down
the function
that I think is a good measure of how similar or dissimilar X and Z
are for my specific problem.
Let's say I write down K of XZ equals E to the
minus.
Let's say I write down this function,
and I think this is a good measure of how similar X and Z are.
The question, then, is is this really a valid
kernel? In other words, to understand how we can construct kernels " if I write down the
function like that,
the question is does there really exist some Phi
such that
KXZ
is equal to
the inner product?
And that's the
question of is K a valid kernel.
It turns out that
there is a result that characterizes necessary and sufficient
conditions for
when a function K that you might choose
is a valid kernel.
I should go ahead show part of that result now.
Suppose K is a valid kernel,
and when I say K is a kernel, what I mean is there does indeed exist some
function Phi for which this holds true.
Then
let any set of points
X1 up to XM be given. Let
me define a matrix K.
I apologize for overloading notation. K I'm going
to use to denote both the
kernel function, which is the function of X and Z
as well as a matrix. Unfortunately,
there aren't enough alphabets.
Well, that's not true. We need to find
the kernel matrix to
be an M-by-M matrix such that K subscript IJ is equal to the
kernel function
applied to two of my examples.
Then it
turns out that for any vector Z that's indimensional, I want
you to consider Z transpose
KZ.
By
definition of matrix multiplication, this is
that,
and so
KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is
a valid kernel function,
and so there must exist such a
value
for Phi.
This is the inner product between two feature vectors, so let me just make that inner product
the explicit. I'm going
to sum over the elements of this vector,
and I'm going to use Phi XI subscript K just to denote the K
element of this vector.
Just
rearrange sums. You get
sum over K. This next set may look familiar to some of you,
which is just "
right? Therefore,
this is the sum of squares
and it must therefore be greater than or equal to zero.
Do you want to take a minute to look for all the steps and just make
sure you buy
them all? Oh, this is
the inner product between the vector of Phi of XI and Phi
of XJ,
so the
inner product between two vectors
is the sum over all the elements of the vectors of the corresponding element.
Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A
transpose B equals sum over K, AK BK, so that's
just this. This is the sum of K of the
K elements of this vector.
Take a look at this and make sure
it makes sense. Questions about
this? So just to
summarize,
what we showed was that
for any vector Z,
Z transpose KZ
is greater than or equal to zero,
and this is one of the standard definitions of a matrix,
the matrix K being posisemidefinite
when a matrix K is posisemidefinite, that is, K is equal to
zero. Just to summarize, what was shown is that if
K
is a
valid kernel " in other words, if K is a function
for which there exists
some Phi such that
K of XI
XJ is the inner product between Phi of XI and Phi of XJ.
So if K is a valid kernel, we showed, then, that the kernel matrix
must be posisemidefinite.
It turns out that the
converse [inaudible]
and so this gives you a test for
whether a function K is a valid kernel.
So this is
a theorem due to Mercer, and so kernels are also sometimes called Mercer
kernels. It means the same thing. It just means it's a valid kernel. Let
K of XZ be given.
Then K
is a valid
kernel "
in other
words, it's a Mercer kernel, i.e., there exists a Phi
such that
KXZ
equals Phi of X
transpose
Phi of Z " if and only if for any set of M examples,
and this really means for any set of M points. It's not necessarily a training set.
It's just any set of M points you may choose. It
holds true that the kernel
matrix,
capital K that I defined just now,
is
symmetric posisemidefinite.
And so I proved only one direction of this result. I proved that if it's a
valid kernel, then K
is symmetry
posisemidefinite, but the converse I didn't show. It turns out that this is
necessary and a sufficient condition. And
so this gives you a useful test for
whether any function that you might want to choose
is a
kernel. A concrete example of something that's clearly not a
valid kernel would
be
if you find an input X such that K of X, X " and this
is minus one, for example " then this is an example of something that's clearly
not a valid kernel,
because minus one cannot possibly be equal to Phi of X
transpose Phi of X,
and so this would be one of many examples of functions that
will fail to meet the conditions of this theorem,
because inner products of a vector itself are always greater than zero. So
just to tie this back explicitly to an
SVM,
let's say to use a support vector machine with a kernel,
what you do is you choose some function K of XZ,
and so you can choose " and it turns out that function I wrote down just now " this is,
indeed, a valid kernel. It
is called the Galcean kernel
because of the similarity to
Galceans.
So you choose some kernel function like this, or you may choose X
transpose Z plus
C
to the D vector. To apply a support vector machine kernel, you choose one of these
functions,
and the choice of this would depend on your problem.
It depends on what is a good measure of one or two examples
similar and one or two examples different for your problem.
Then you go back to our formulation of support vector machine,
and you have to use the dual formulation,
and you then replace everywhere you see
these
things, you
replace it with K of XI,
XJ.
And you then run exactly the same support vector machine algorithm, only everywhere
you see these inner products, you replace them with that,
and what you've just done is you've taken a support vector machine
and you've taken each of your feature vectors X
and you've replaced it with implicitly a very high dimensional feature vector.
It turns out that the Galcean kernel corresponds to a feature vector that's
infinite dimensional.
Nonetheless, you can run a support vector machine in a finite amount of time,
even though you're working with
infinite dimensional feature vectors,
because all you ever need to do is
compute these things, and you don't ever need
to represent these infinite dimensional feature vectors
explicitly.
Why
is this a good idea? It turns out "
I think I started off talking about support vector machines. I started
saying that we wanted to start to develop non-linear learning algorithms.
So here's one useful picture to keep in mind, which is that
let's say your original data "
I didn't mean to draw that slanted. Let's say you have one-dimensional input
data. You just have one feature X and R. What a kernel
does is the following. It takes your original input data and maps it to a
very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw
two
dimensions here.
So say [inaudible] very high dimensional feature space where "
like so. So it takes all your data in R1 and maps it to R infinity,
and then you run a
support vector machine in this infinite dimensional space and also exponentially
high dimensional space,
and you'll find the optimal margin classifier "
in other words, the
classifier that separates your data in this very high dimensional space
with the largest possible geometric margin.
In this example that you just drew anyway,
whereas your data was not linearly separable in your originally one dimensional space,
when you map it to this much higher dimensional space, it becomes linearly separable, so you
can use your
linear classifier
to [inaudible] which data is not really separable in your original space.
This is what support vector machines
output nonlinear decision boundaries
and in the entire process, all you ever need to do is solve complex optimization
problems.
Questions about any of this? Student:[Inaudible] sigmer? Instructor
(Andrew Ng):Yeah, so sigmer is " let's
see. Well, I was going to talk about [inaudible]
later. One way
to choose sigmer is
save aside a small amount of your data and
try different values of sigmer and train an SVM using,
say, two thirds of your data.
Try different values of sigmer, then see what works best on a separate
hold out
cross validation set " on a separate set that you're testing. Something about
learning algorithms we talked about
"
locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a
number of parameters to some of these algorithms that you
can choose IDs by saving aside some data to test on. I'll
talk more about model selection [inaudible] explicitly. Are there other questions? Student:So
how do you know that
moving it up to high dimensional space
is going to give you that kind of separation? Instructor (Andrew
Ng):Good question.
Usually, you don't know [inaudible]. Sometimes you can know, but
in most cases, you won't know [inaudible] actually going to
linearly
separable, so the next topic will be [inaudible], which is what
[inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend
linearly
separated by mapping a higher dimension, couldn't you
also just use [inaudible]
higher
dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't
separate it in higher dimensional
space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs.
Okay. Student:What
if
you run an SVM algorithm
that assumes the data are linearly
separable on data that
is not actually linearly separable? Instructor
(Andrew Ng):You guys are really giving me a hard time about whether the
data's linearly
separable. It turns out this algorithm won't work if the data is not linearly separable, but
I'll change that in a second and make it work.
If I move on to talk about that,
let me just say one final word about kernels,
which is that
I talked about kernels
in a context of support vector machines,
and the idea of kernels was what really made support vector machines a very
powerful learning algorithm,
and actually towards the end of today's lecture if I have time, I'll actually
give a couple more [inaudible] examples of how to choose kernels as well.
It turns out that the idea of kernels is actually more general than support vector
machines,
and in particular,
we took this SVM algorithm and we derived a dual, and that was what let us write
the entire algorithm
in terms of inner products of these.
It turns out that you can take many of the other algorithms that you've seen
in this class " in fact, it turns
out you can take most of the linear algorithms we talked about, such as linear
regression, logistic regression
[inaudible]
and it turns out you can take all of these algorithms and rewrite them entirely
in terms of these inner products.
So if you have any algorithm that you can rewrite in terms of inner products, then that
means you can replace it with K of
XI XJ
and that means that you can take any of theses algorithms and
implicitly map the features vectors of these very high dimensional feature spaces
and have the algorithm
still work. The idea of kernels is perhaps most widely used with support vector
machines, but it is actually more general than that,
and you can take many of the other algorithms that you've seen and many of the algorithms that
we'll see later this quarter as well
and write them in terms of inner products and thereby kernalize them and apply them to
infinite dimensional feature spaces.
You'll actually get to play with many of these ideas more in the next problem
set.
Let's talk about non-linear decision boundaries, and this is
the idea of
" it's called the L1 norm soft
margin SVM. Machine only people sometimes aren't great at coming up with good
names, but here's
the idea.
Let's say I have a data
set. This is a linearly separable data
set, but what I do if I have a couple of other examples there that makes the data
nonlinearly separable,
and in fact,
sometimes even if the data is linearly separable, maybe you might not want to.
So for example, this is a very nice data set. It looks like
there's a great decision boundary that separates the two [inaudible]. Well,
what if I had just one outlier
down here? I could
still linearly separate this data set
with something like that,
but I'm somehow letting one
slightly suspicious example
skew my entire decision boundary by a lot,
and so
what I'm going to talk about now is the L1 norm soft margin SVM, which is
a slightly modified formulation of the SVM optimization problem.
They will let us deal with both of these cases " one where one of the data's
just not linearly separable
and two, what if you have some examples that you'd rather
not get [inaudible] in a training set. Maybe
with an outlier here, maybe you actually prefer to
choose that original decision boundary and not try so hard to get
that training example. Here's the formulation.
Our
SVM
primal problem was to minimize one-half [inaudible] W
squared.
So this is our original problem, and I'm
going to modify this by adding the
following.
In other words, I'm gonna add these penalty terms, CIs,
and I'm going to demand that each of my training examples is separated with
functional margin greater than or equal to one minus CI,
and you remember
if this is
greater than zero " was it
two lectures ago that I said that if the
function margin is greater than zero,
that implies you classified it correctly.
If it's less than zero, then you misclassified it.
By setting some of the CIs to be larger than one, I can
actually have some examples with
functional margin negative,
and therefore I'm allowing my algorithm to misclassify some of the examples of the
training set.
However, I'll encourage the algorithm not to do that by adding to the
optimization objective,
this sort of penalty term that penalizes setting CIs to be large.
This is an optimization problem where the parameters are WB
and all of the CIs
and this is also a convex optimization problem. It
turns out that
similar to how we worked on the dual of the support vector machine,
we can also work out the dual for this optimization problem. I won't
actually do it,
but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going
to
use
Alpha and R to denote the [inaudible] multipliers no corresponding to
this set of constraints that we had previously
and this new set of constraints on the CI [inaudible] zero. This gives us a use of
the
[inaudible] multipliers. The [inaudible] will be
optimization objective
minus sum from
plus CI
minus "
and so
there's our [inaudible] optimization objective
minus or plus Alpha times each of these constraints,
which are
greater or equal to
zero. I won't redivide the entire dual again, but it's really the same,
and when you derive the dual of this optimization problem and when you simplify,
you find that you get the following.
You have to maximize
[inaudible], which is actually the same as before.
So it turns out when you derive the dual
and simply,
it turns out that the only way the dual changes compared to the previous one
is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero,
we now have a constraint that the Alphas are between zero and C.
This derivation isn't very hard, and you're encouraged to go home and try to do it yourself.
It's really essentially the same math,
and when you simply, it turns out
you can simply the R of the [inaudible] multiplier away
and you end up with
just these constraints of
the
Alphas.
Just as an aside,
I won't
derive these, either. It turns out that "
remember, I wrote down the [inaudible] conditions in
the last lecture.
The necessary conditions for something to be an optimal solution
to constrain optimization problems.
So if you used the [inaudible] conditions,
it turns out you can actually derive conversions conditions, so
we want to solve this optimization problem.
When do we know the Alphas have converged to the global optimum?
It turns out you can use the following.
I don't want to say a lot about these. It
turns out from the [inaudible] conditions you can derive
these as the conversion conditions for an algorithm that you might choose to
use
to try to solve the optimization problem in terms of the
Alphas. That's the L1 norm soft margin SVM, and this is the
change the algorithm that lets us handle non-linearly separable data sets as well as
single outliers that may still be linearly separable
but you may choose not to separate [inaudible].
Questions about this?
Raise
your hand if this stuff makes sense at all. Great.
So
the last thing I want to do is
talk about an algorithm for actually solving this optimization problem.
We wrote down
this dual optimization problem with convergence criteria,
so let's come up with an efficient algorithm to actually solve this optimization problem.
I want to do this partly to give me an excuse to talk about an algorithm
called coordinate assent, which is useful to
do. What I actually want to do is tell you about an algorithm called coordinate
assent, which is a useful algorithm to know about,
and it'll turn out that it won't apply in the simplest form
to this problem, but we'll then be able to modify it slightly and then it'll
give us a very efficient
algorithm for solving this [inaudible]
optimization problem. That was the other reason that I had to derive the dual, not only so that we could
use kernels
but also so that we can apply an algorithm like the
SMO
algorithm. First, let's talk about coordinate assent, which is another
[inaudible] optimization
algorithm. To describe coordinate
assent, I just want you to consider the problem of if we
want to maximize some function
W, which is a function of Alpha one through Alpha M with no constraints. So for
now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the
constraint that some of YI Alpha I must be equal to zero.
Then this is the coordinate
assent
algorithm. It will
repeat until convergence
and will do for I equals one to M. The [inaudible] of coordinate assent essentially
holds all the parameters
except Alpha I fixed
and then it just maximizes this function with respect to just one of the parameters. Let me
write that
as Alpha I gets
updated as [inaudible]
over Alpha I hat of
W
Alpha one
Alpha I minus one. This is really the fancy way of saying hold everything
except Alpha I
fixed. Just optimize W by optimization objective with
respect to only
Alpha I.
This is just a fancy
way of writing
it. This is coordinate
assent.
One picture
that's kind of useful for coordinate assent is if you
imagine you're trying to optimize a quadratic function, it
really looks like
that. These are the contours of the quadratic function and the minimums
here.
This is what
coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one.
My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going
to begin by minimizing this with respect to Alpha one. I
go there.
And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like
that.
Then, I'll minimize
with respect to Alpha one goes back to Alpha two
and so on.
You're always taking these axis-aligned steps
to get to the minimum.
It turns out that there's a modification to this. There are
variations of this as well. The way I describe
the algorithm, we're always doing this
in alternating order. We always optimize with respect to Alpha
one then Alpha two, then Alpha one, then Alpha two.
What I'm about to say applies only in higher dimensions, but it turns out if you have
a lot of parameters, Alpha one through Alpha M,
you may not choose to always visit them in a fixed order. You may choose
which Alphas update next depending on what you
think will allow you to make the most progress. If you have only two parameters, it makes
sense to alternate between them.
If
you have
higher dimensional parameters, sometimes you may choose to update them in a
different order if you think doing so would let you make faster progress towards the
maximum. It turns out that coordinate assent
compared to some of the algorithms we saw previously " compared to, say, Newton's
method, coordinate assent will usually take a lot more steps,
but the chief advantage of coordinate assent when it works well
is that sometimes
the optimization objective W
sometimes is very inexpensive to optimize W with respect to any one of your
parameters,
and so coordinate assent has to take many more iterations than, say, Newton's
method
in order to
converge. It turns out that there are many optimization problems for which it's
particularly easy
to fix
all but one of the parameters and optimize with respect to just that one parameter,
and if that's true, then the inner group of coordinate assent with optimizing with
respect to Alpha I can be done very quickly
and cause [inaudible].
It turns out that
this will be true when we modify this algorithm to solve the SVM
optimization problem.
Questions about this?
Okay. Let's go ahead and apply this
to our support vector machine dual optimization problem. It
turns out that
coordinate assent in its basic form does not work for the following reason. The
reason is we have constrains on the Alpha Is. Mainly, what we can
recall from what we worked out previously,
we have a constraint that the sum of [inaudible] Y
Alpha I must be equal to zero,
and so if you fix all the Alphas except for one,
then you can't change one Alpha without violating the constraint. If
I just try to change Alpha one,
Alpha one is actually exactly determined as a function of the other
Alphas because this was sum to zero.
The
SMO algorithm, by the way, is due to John Platt, a colleague at
Microsoft. The SMO
algorithm, therefore, instead of trying to change one Alpha at a time,
we will try to change
two
Alphas at a time.
This is
called the SMO algorithm, in a
sense the sequential minimal optimization
and the term minimal refers to the fact that
we're choosing the smallest number of Alpha Is to change at a time, which
in this case, we need to change at least two at a time.
So then go ahead and outline the algorithm.
We will
select
two Alphas to change, some Alpha I and Alpha J [inaudible] " it
just means a rule of thumb.
We'll hold all the Alpha Ks fixed
except
Alpha I and Alpha J
and optimize W [inaudible] Alpha
with respect
to Alpha I
and Alpha J subject to all the constraints.
It turns out the
key step
which I'm going to
work out
is this one, which
is how do you optimize W of Alpha
with respect to the two parameters that you just chose to update
and subject to the constraints? I'll do
that in a second.
You would keep running this algorithm
until you have satisfied
these
convergence criteria up to Epsilon.
What I want to do now is describe how to do this [inaudible] "
how to optimize W of Alpha
with respect to
Alpha I and Alpha J,
and it turns out that it's because you can do this extremely efficiently
that the SMO algorithm works well. The [inaudible] for the SMO
algorithm can be done extremely efficiently, so it may take a large number of
iterations to converge, but each iteration is very cheap. Let's talk about that.
So in order to derive that step where we update in respect to Alpha I and
Alpha J,
I'm actually going to use Alpha one and Alpha two as my example. I'm gonna
update
Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but
just to make my notation on the board easier, I'm going to derive the
derivation for Alpha one and Alpha two
and the general [inaudible] completely analogous.
On every step of the algorithm with respect to constraint, that
sum over I Alpha I YI is equal to zero. This is one of the
constraints we had previously for
our dual optimization problem.
This means that Alpha one Y1 plus Alpha
two Y2 must be equal to this,
to
which I'm going to denote
by Zeta.
So
we also have the constraint
that the Alpha Is
must be between zero and C. We had two constraints on our dual. This was one of the
constraints. This was the other one.
In pictures,
the
constraint that the Alpha Is is between zero
and C, that is often called the Bosk constraint,
and so if I draw Alpha one
and Alpha two, then
I
have a constraint that
the values of Alpha one and Alpha two must lie within this box that ranges from zero
to C.
And so the picture of the algorithm is as follows.
I have some constraint
that Alpha one Y1
plus Alpha
two Y2
must equal to Zeta,
and so this implies that
Alpha one must be equal to Zeta minus Alpha two Y2
over Y1,
and so what I want to do is I want to optimize the objective with respect
to this.
What I can do is plug in my definition for Alpha one as a function of Alpha two
and so this becomes W of
Alpha one must be equal to Zeta minus Alpha two Y2
over
Y1, Alpha two, Alpha three
and so on,
and it turns out that because W is a quadratic function " if you look back to our
earlier definition of W, you find it's a quadratic function in all the Alphas "
it turns out that if you look at this expression for W
and if you view it as just a function of Alpha two,
you find that this is a one dimensional quadratic function of Alpha two
if you hold Alpha three, Alpha four and so on fixed,
and so
this can be simplified to some expression of the form A Alpha two squared
plus B Alpha two plus
C. This
is a standard quadratic function.
This is really easy to optimize. We know how to optimize
"
when did we learn this? This was high school
or undergrad or something. You know how to optimize quadratic functions like these.
You just do that and that gives you the
optimal value for Alpha two.
The last step with a
Bosk constraint like this " just in pictures,
you know your solution must lie on this line, and so there'll be some
sort of quadratic function
over this line, say,
and so if you minimize the quadratic function,
maybe you get a value that lies in the box, and if so, then you're done.
Or if your quadratic function looks like this,
maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with
a solution like that. If
that happens, you clip your solution
just to map it back inside the box.
That'll give you the optimal solution of this quadratic optimization problem
subject to
your solution
satisfying this box constraint
and lying on this straight line " in other words, subject to
the solution lying on this line segment within the box.
Having solved the Alpha two this way, you can clip it if necessary
to get it back within the box constraint
and then
we have Alpha one as a function of Alpha two
and this allows you to
optimize W with respect to Alpha one and Alpha two quickly,
subject to all the constraints,
and the key step is really just sort of one dequadratic optimization, which
we do very quickly,
which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that
we can
change
whatever, but the SMO
algorithm, we can
change two at a
time,
so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change
" as I run optimization algorithm, I
have to respect the constraint that
sum over
I Alpha I YI
must be equal to zero,
so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent.
Suppose I tried
to change just Alpha one.
Then I know that Alpha one
must be equal to the sum from I equals two to M
Alpha I
YI
divided by Y1, right,
and so Alpha
one can actually be written exactly as a function of Alpha two, Alpha three and so on
through Alpha M.
And so if I hold Alpha two, Alpha three, Alpha four
through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final
[inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same
time,
then I still have a constraint and so I know Alpha one and Alpha two
must satisfy
that linear constraint
but at least this way I can change Alpha one if I also
change Alpha two accordingly to make sure this satisfies
the constraint.
Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on
each iteration, I have some
setting of the parameters,
Alpha one, Alpha two, Alpha three and so on,
and I want to change Alpha one and Alpha two, say.
So from the previous iteration,
let's
say I had not validated the constraint, so that holds true,
and so I'm just defining Zeta to be equal to this, because
Alpha one Y1 plus Alpha two Y2 must be equal to sum from
I equals [inaudible] to M of that,
and so I'm just defining this
to be Zeta.
Student:[Inaudible]. Instructor (Andrew
Ng):On every iteration, you change maybe a different pair of Alphas to update. The
way you do this is something I don't want to talk about. I'll say a couple more words about
that, but
the basic outline of the algorithm is
on every iteration of the algorithm, you're going to select some Alpha I and
Alpha J to update
like on this board. So that's an Alpha I and an Alpha J to update
via some [inaudible]
and then you use the procedure I just described to
actually update
Alpha I and Alpha J. What I actually just talked about
was the procedure
to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the
[inaudible] for choosing Alpha
I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up
there. I'll just write it again. W of
Alpha is that function we had previously. W of Alpha was the sum over I
" this is about solving the "
it was that thing.
All of this is about solving the optimization problem for the SVM, so this
is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing
that one, you
can
make the
other one that you have to change work,
right? Instructor (Andrew Ng):What
do
you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently.
What we're trying to do is we're trying to optimize the objective function W
of Alpha,
so the metric of progress that we care about is whether W of Alpha is getting
better on every iteration,
and so what is true for coordinate assent and for SMO is on every iteration;
W of Alpha can only increase. It may stay the same
or it may increase. It can't get worse.
It's true that eventually, the Alphas will converge at some value. It's true that
in intervening iterations, one of the Alphas may move further
away and then closer and further and closer to its final value,
but what we really care about is that W of Alpha is getting better
every time, which is true.
Just a couple more words on
SMO before I wrap up on this.
One is that John Platt's original
algorithm talked about a [inaudible]
for choosing which values or pairs, Alpha I and Alpha J, to update
next
is one of those things that's not conceptually complicated but it's very
complicated
to explain in words. I won't talk about that here.
If you want to learn about it,
go ahead and look up John Platt's paper on the
SMO algorithm. The [inaudible] is pretty easy to read,
and later on, we'll also posting a handout on the
course homepage
with
some of a simplified version of this [inaudible] that you can use in
problems.
You can see some of the process readings in more details.
One other thing that I didn't talk about was how to update the parameter B. So
this is solving all your Alphas.
This is also the Alpha that allows us to get W.
The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not
very difficult.
I'll let you read about that yourself with the
notes that we'll post
along with the next
problems. To wrap up today's lecture, what I want to do is just tell
you briefly about a couple of examples of applications
of SVMs.
Let's consider the problem of Handler's Integer Recognition.
In Handler's Integer
Recognition,
you're given a pixel array with
a scanned image of,
say, a zip code somewhere in Britain. This is an array of pixels,
and some of these pixels will be on
and other pixels
will be off. This combination of pixels being on maybe represents the
character one.
The question is given
an input feature vector like this,
if you
have, say, ten pixels by ten pixels, then you have
a hundred dimensional feature vector,
then [inaudible]. If you have ten pixels by ten pixels, you have
100
features, and maybe these are binary features of XB01 or maybe the Xs are
gray scale values
corresponding to how dark each of these pixels was.
[Inaudible].
Turns out for many years, there was a neuronetwork that was a champion
algorithm for Handler's Integer Recognition.
And it turns out that you can apply an SVM with the
following
kernel. It turns out either the polynomial kernel or
the Galcean
kernel works fine for this problem,
and just by writing down this kernel and throwing an SVM at
it, an SVM gave performance comparable to the very best neuronetworks.
This is surprising because support vector machine
doesn't take into account any knowledge about the pixels,
and in particular,
it doesn't know that this pixel is
next to that pixel
because it's just representing the pixel intensity value as a vector.
And so this means the performance of SVM would be the same even if you were to
shuffle all the pixels around.
[Inaudible]
let's say comparable to the very best neuronetworks, which had been
under very careful development for many years. I want
to
tell you about one other cool example, which is SVMs are
also used also to classify other fairly esoteric objects. So for
example,
let's say you want to classify
protein sequences into different classes of
proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize
for that. There are
20 amino acids, and proteins in our bodies are made up by
sequences of amino acids.
Even though there are 20 amino acids and 26 alphabets, I'm going to
denote amino acids by the alphabet
A through Z with apologizes to the biologists. Here's
an amino acid sequence
represented by a series of alphabets.
So
suppose I want to assign
this protein into a few classes depending on what
type of protein it is. The
question is how
do I construct my feature
vector? This is challenging for many reasons, one of which is that protein
sequences can be of different lengths. There are some very long protein
sequences and some very short ones,
and so you can't have a feature saying what is
the amino acid in the 100th position, because maybe there is no 100th
position in this protein sequence. Some are very long. Some are very short.
Here's my feature representation,
which is
I'm going to write down
all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC
down
to
AAAZ and then
AABA
and so on. You get the idea. Write down
all
possible combinations of
four alphabets
and my feature representation will be
I'm going to scan through this sequence of amino acids
and count how often each of these subsequences occur. So for example,
BAJT occurs twice and so I'll put
a two there, and
none of these sequences occur, so I'll put
a zero there. I guess
I have a one here
and a zero there. This very long vector
will be my feature representation for protein.
This representation applies no matter how long my protein sequence is.
How
large is this? Well, it turns out
this is going to be in R20
to the four,
and so you have a 160,000
dimensional feature vector,
which is reasonably large, even by modern computer standards.
Clearly, we don't want to explicitly represent
these high
dimensional feature vectors. Imagine you have 1,000 examples and you store this as double
[inaudible]. Even on modern day computers, this is big.
It turns out that there's an
efficient dynamic programming algorithm that can efficiently
compute inner products between these feature vectors, and so we can apply this
feature representation, even though it's a ridiculously high
feature vector
to classify protein sequences.
I won't talk about the [inaudible] algorithm. If any of you have seen
the [inaudible] algorithm for finding subsequences, it's kind of
reminiscent of that. You
can look those up if you're interested.
This is just another example of a cool kernel, and
more generally, if you're faced with some new machine-learning problem, sometimes you can
choose a standard kernel like a Galcean kernel,
and sometimes there are research papers written on how to
come up with a new kernel
for a new problem.
Two last sentences I want
to say. Where are we now? That wraps up SVMs, which many
people would consider one of the most effective off the shelf learning algorithms,
and so as of today, you've actually seen a lot of learning algorithms. I want to
close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms
to a lot of problems.
We're still in week four of the quarter, so there's more to come. In particular, what I
want to do next is talk about how to really understand the learning algorithms and
when they work well and when they work poorly
and to take the tools which you now have and really talk about how you can use
them really well. We'll start to do that in the next lecture. Thanks.