Lecture 8 | Machine Learning (Stanford)
-
0:09 - 0:10
-
0:10 - 0:14This presentation is delivered by the Stanford Center for Professional
-
0:14 - 0:21Development. Welcome back. If you haven't given me the homework yet,
-
0:27 - 0:28you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back.
-
0:28 - 0:32What I want to do today is wrap up our discussion on support vector
-
0:32 - 0:34machines
-
0:34 - 0:37and in particular we'll also talk about the idea of kernels
-
0:37 - 0:39and then talk about
-
0:39 - 0:42[inaudible]
-
0:42 - 0:46and then I'll talk about
-
0:46 - 0:50the SMO algorithm, which is an algorithm for solving the
-
0:50 - 0:54optimization problem that we posed last time.
-
0:54 - 0:56
-
0:56 - 0:58To recap, we
-
0:58 - 1:01wrote down
-
1:01 - 1:08the following context optimization problem.
-
1:12 - 1:16All this is assuming that the data is linearly separable, which is an assumption that I'll
-
1:16 - 1:18fix later,
-
1:18 - 1:20
-
1:20 - 1:23
-
1:23 - 1:25
-
1:25 - 1:30and so with this optimization problem, given a training set,
-
1:30 - 1:35this will find
-
1:35 - 1:37the optimal margin
-
1:37 - 1:41classifier for the data set that maximizes this
-
1:41 - 1:45geometric margin from your training
-
1:45 - 1:47examples.
-
1:47 - 1:52And so in the previous lecture, we also derived the dual of this problem,
-
1:52 - 1:54which was to maximize
-
1:54 - 2:01
-
2:02 - 2:09
-
2:18 - 2:22this. And this is the dual of our
-
2:22 - 2:24primal [inaudible] optimization problem.
-
2:24 - 2:28Here, I'm using these angle brackets to denote inner product,
-
2:28 - 2:29
-
2:29 - 2:31so
-
2:31 - 2:33this is just XI transpose XJ
-
2:33 - 2:39for vectors XI
-
2:39 - 2:41and XJ. We also worked out the
-
2:41 - 2:47ways W would be given by sum over I alpha I
-
2:47 - 2:51YI XI. Therefore, when you need to make a prediction of classification
-
2:51 - 2:52time, you
-
2:52 - 2:55need to compute the
-
2:55 - 3:01value of the hypothesis applied to an [inaudible], which is G of W
-
3:01 - 3:06transpose X plus B where G is that threshold function that outputs plus one and minus one.
-
3:06 - 3:09And so this is G of
-
3:09 - 3:10sum over I alpha
-
3:10 - 3:12
-
3:12 - 3:19
-
3:20 - 3:22I. So that can also be written
-
3:22 - 3:24in terms of inner
-
3:24 - 3:26products between input
-
3:26 - 3:31vectors X. So what I
-
3:31 - 3:35want to do is now talk about the idea of kernels, which will make use of this
-
3:35 - 3:37property because it turns out
-
3:37 - 3:42you can take the only dependers of the algorithm on X
-
3:42 - 3:44is through these inner products.
-
3:44 - 3:47In fact, you can write the entire algorithm
-
3:47 - 3:50without ever explicitly referring to
-
3:50 - 3:52an X vector [inaudible]
-
3:52 - 3:54
-
3:54 - 3:58between input feature vectors.
-
3:58 - 4:01And the idea of a high kernel is as following "
-
4:01 - 4:07let's say that you have an input attribute.
-
4:07 - 4:11Let's just say for now it's a real number. Maybe this is the
-
4:11 - 4:13living area of a house
-
4:13 - 4:20that you're trying to make a prediction on, like whether it will be sold in the next six months.
-
4:20 - 4:21Quite often, we'll take this
-
4:21 - 4:24feature X and we'll map it
-
4:24 - 4:29to a richer set of features. So for example, we will take X and map it
-
4:29 - 4:36to these four polynomial features, and let me acutely call this mapping Phi. So we'll
-
4:37 - 4:39let Phi of X denote
-
4:39 - 4:41the mapping from your original features to
-
4:41 - 4:45some higher dimensional set of features.
-
4:45 - 4:49So if you do this and you want to use the features Phi of X,
-
4:49 - 4:50then
-
4:50 - 4:54all you need to do is go back to the learning algorithm
-
4:54 - 4:57and everywhere you see
-
4:57 - 4:59XI, XJ, we'll replace it
-
4:59 - 5:00with
-
5:00 - 5:05the inner product between Phi of XI and
-
5:05 - 5:08Phi
-
5:08 - 5:11of XJ. So this corresponds to running a support vector machine
-
5:11 - 5:14with the features given by Phi of X rather than with your original
-
5:14 - 5:19onedimensional input feature X.
-
5:19 - 5:22And in a scenario that I want to consider,
-
5:22 - 5:23sometimes Phi of X
-
5:23 - 5:29will be very high dimensional,
-
5:29 - 5:34and in fact sometimes Phi of X " so for example, Phi of X may contain
-
5:34 - 5:37very high degree polynomial features.
-
5:37 - 5:42Sometimes Phi of X will actually even be an infinite dimensional vector of features,
-
5:42 - 5:45and the question is if Phi of X is an extremely high dimensional,
-
5:45 - 5:49then you can't actually compute to these inner products very
-
5:49 - 5:51efficiently, it seems, because computers need to
-
5:51 - 5:55represent an extremely high dimensional feature vector and then
-
5:55 - 5:57take
-
5:57 - 6:00[inaudible] inefficient. It
-
6:00 - 6:03turns out that
-
6:03 - 6:08in many important special cases, we can write down " let's call the kernel function,
-
6:08 - 6:10denoted by K,
-
6:10 - 6:12which will be
-
6:12 - 6:18this,
-
6:18 - 6:21which would be inner product between those feature
-
6:21 - 6:24vectors. It turns out there will be important special cases where
-
6:24 - 6:29computing Phi of X is computationally very expensive " maybe is impossible.
-
6:29 - 6:30There's an infinite dimensional vector,
-
6:30 - 6:33and you can't compute infinite dimensional vectors.
-
6:33 - 6:37There will be important special cases where Phi of X is very expensive to represent because it
-
6:37 - 6:39is so high dimensional,
-
6:39 - 6:42but nonetheless, you can actually compute a kernel between
-
6:42 - 6:46XI and XJ. You can compute the inner product between these two vectors
-
6:46 - 6:49very inexpensively.
-
6:49 - 6:52And so the idea of the support vector machine is that
-
6:52 - 6:57everywhere in the algorithm that you see these inner products,
-
6:57 - 6:58we're going to replace it
-
6:58 - 7:02with a kernel function that you can compute efficiently,
-
7:02 - 7:04and that lets you work in feature
-
7:04 - 7:08spaces Phi of X even if Phi of X
-
7:08 - 7:12are very high dimensional. Let
-
7:12 - 7:16me now say how that's done. A
-
7:16 - 7:17
-
7:17 - 7:21little bit later today, we'll actually see some concrete examples of Phi of X and
-
7:21 - 7:26of kernels. For now, let's just think about constructing kernels explicitly.
-
7:26 - 7:29This best illustrates my example.
-
7:29 - 7:33
-
7:33 - 7:35Let's say you have
-
7:35 - 7:40two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to
-
7:40 - 7:46write X and Z to save on writing.
-
7:46 - 7:50Let's say my kernel is K of X, Z
-
7:50 - 7:51equals
-
7:51 - 7:55X transpose Z squared.
-
7:55 - 8:02And so this is "
-
8:09 - 8:13right? X transpose Z " this thing here is X
-
8:13 - 8:15transpose Z and this thing is X transpose Z, so this is X
-
8:15 - 8:17transpose Z squared.
-
8:17 - 8:18
-
8:18 - 8:25And that's equal
-
8:29 - 8:33to that.
-
8:33 - 8:36And so this kernel corresponds
-
8:36 - 8:38to the feature mapping where Phi
-
8:38 - 8:41of X is equal to " and I'll write
-
8:41 - 8:48this down for the case of N equals free, I guess.
-
9:03 - 9:07And so with this definition of Phi of X,
-
9:07 - 9:10you can verify for yourself that this thing
-
9:10 - 9:12becomes the
-
9:12 - 9:18inner product between Phi of X
-
9:18 - 9:22and Phi of Z, because to get an inner product between two vectors
-
9:22 - 9:25is " you can just take a sum of the corresponding elements of the vectors. You
-
9:25 - 9:27multiply them. So
-
9:27 - 9:29if this is Phi of X,
-
9:29 - 9:32then the inner product between Phi of X and Phi of Z will be
-
9:32 - 9:34the sum
-
9:34 - 9:35over all the
-
9:35 - 9:40elements of this vector times the corresponding elements of Phi of Z,
-
9:40 - 9:45and what you get is this one.
-
9:45 - 9:48And so the cool thing about this is that
-
9:48 - 9:53in order to compute Phi of X, you
-
9:53 - 9:56
-
9:56 - 10:03need [inaudible] just to compute
-
10:03 - 10:05Phi of X. If N
-
10:05 - 10:09is a dimension of X and Z, then
-
10:09 - 10:14Phi of X is a vector of all pairs of XI XJ multiplied of
-
10:14 - 10:15each other,
-
10:15 - 10:19and so the length of Phi of X is N squared. You need order N
-
10:19 - 10:24squared time just to compute Phi of X. But
-
10:24 - 10:31to compute K
-
10:37 - 10:39" to compute the kernel function,
-
10:39 - 10:41all you need is order N time,
-
10:41 - 10:43because the
-
10:43 - 10:47kernel function is defined as X transpose Z squared, so you just take the inner product
-
10:47 - 10:51between X and Z, which is order N time and you square that
-
10:51 - 10:52and you've computed this
-
10:52 - 10:54kernel function,
-
10:54 - 10:58and so you just computed the inner product between two vectors where each vector
-
10:58 - 11:00has N squared elements, but
-
11:00 - 11:07you did it in N square time. Student:For any kernel you find
-
11:14 - 11:21for X and Z,
-
11:21 - 11:24does Phi exist for X and Z? Instructor (Andrew
-
11:24 - 11:31Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense.
-
11:32 - 11:34
-
11:34 - 11:35So
-
11:35 - 11:39let me just describe a couple of quick generalizations to
-
11:39 - 11:42this.
-
11:42 - 11:47One is that if
-
11:47 - 11:54you define KXZ to be equal to
-
11:55 - 12:00X transpose Z plus C squared, so again, you can compute this kernel
-
12:00 - 12:03in order N time,
-
12:03 - 12:07then that turns out to correspond to a feature vector
-
12:07 - 12:08where I'm
-
12:08 - 12:11just going to add a few more elements at the bottom
-
12:11 - 12:18where you add root 2.
-
12:24 - 12:28Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3
-
12:28 - 12:30and C.
-
12:30 - 12:31And so
-
12:31 - 12:35this is a way of creating a feature vector with both the monomials,
-
12:35 - 12:37meaning the first order terms,
-
12:37 - 12:41as well as the quadratic or the inner product
-
12:41 - 12:43terms between XI and XJ,
-
12:43 - 12:45and the parameter C here
-
12:45 - 12:49allows you to control the relative waiting between the monomial
-
12:49 - 12:51terms, so the first order terms,
-
12:51 - 12:54and the quadratic
-
12:54 - 12:57terms. Again, this is still inner
-
12:57 - 13:01product between vectors of length and square [inaudible]
-
13:01 - 13:03in order N time.
-
13:03 - 13:05
-
13:05 - 13:12More generally, here are some other examples of kernels.
-
13:14 - 13:18Actually, a
-
13:18 - 13:25generalization of the one I just derived right now would be the following kernel.
-
13:25 - 13:31And so this corresponds to using all N plus DQZ
-
13:31 - 13:34features
-
13:34 - 13:39of all
-
13:39 - 13:43monomials. Monomials just mean the products of XI XJ XK. Just
-
13:43 - 13:47all the polynomial terms up
-
13:47 - 13:52to degree D
-
13:52 - 13:56and plus [inaudible] so on the order of
-
13:56 - 13:59N plus D to the power of D, so this grows exponentially in D.
-
13:59 - 14:01This is a very high dimensional feature vector,
-
14:01 - 14:05but again, you can implicitly construct the feature vector and take
-
14:05 - 14:07inner products between them.
-
14:07 - 14:11It's very computationally efficient, because you just compute the inner product
-
14:11 - 14:15between X and Z, add C and you take that real number to the power of D
-
14:15 - 14:19and by plugging this in as a kernel, you're implicitly working in an extremely
-
14:19 - 14:26high dimensional computing space.
-
14:26 - 14:28So what
-
14:28 - 14:35I've given is just a few specific examples of how to create kernels. I want
-
14:36 - 14:40to go over just a few specific examples of kernels. So let's you ask
-
14:40 - 14:44you more generally if you're faced with a new machine-learning problem,
-
14:44 - 14:46how do you come up with a kernel? There
-
14:46 - 14:50are many ways to think about it, but here's one intuition that's sort of
-
14:50 - 14:51useful.
-
14:51 - 14:52
-
14:52 - 14:54So given a set of attributes of X, you're
-
14:54 - 14:58going to use a feature vector of Phi of X
-
14:58 - 15:00and given a set
-
15:00 - 15:04of attributes Z, you're going to use an input feature vector Phi of Z,
-
15:04 - 15:07and so the kernel is computing
-
15:07 - 15:12the inner product between Phi of X and Phi of Z.
-
15:12 - 15:13And so
-
15:13 - 15:15one intuition "
-
15:15 - 15:17this is a partial intuition. This
-
15:17 - 15:21isn't as rigorous intuition that it is used for. It
-
15:21 - 15:22is that
-
15:22 - 15:24if X and Z are very similar,
-
15:24 - 15:28then Phi of X and Phi of Z will be pointing in the same direction,
-
15:28 - 15:33and therefore the inner product would be large.
-
15:33 - 15:34Whereas in contrast, if
-
15:34 - 15:37X and Z are very dissimilar,
-
15:37 - 15:40then Phi of X and Phi of Z may be pointing different directions,
-
15:40 - 15:42and so the inner product may be small.
-
15:42 - 15:47That intuition is not a rigorous one, but it's sort of a useful one to think about.
-
15:47 - 15:49
-
15:49 - 15:53If you're faced with a new learning problem " if I give you some random
-
15:53 - 15:55thing to classify
-
15:55 - 15:56and you want to decide how to come up with
-
15:56 - 15:58a kernel,
-
15:58 - 15:59one way is
-
15:59 - 16:03to try to come up with the function
-
16:03 - 16:04P of XZ that
-
16:04 - 16:06is large,
-
16:06 - 16:11if you want to learn the algorithm to think of X and Z as similar
-
16:11 - 16:11and
-
16:11 - 16:15small.
-
16:15 - 16:18
-
16:18 - 16:19
-
16:19 - 16:24Again, this isn't always true, but this is one of several intuitions.
-
16:24 - 16:28So if you're trying to classify some brand new thing " you're trying to
-
16:28 - 16:32classify [inaudible] or DNA sequences or something,
-
16:32 - 16:33
-
16:33 - 16:37some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think
-
16:37 - 16:39these are similar things
-
16:39 - 16:42or these are dissimilar.
-
16:42 - 16:44And so this
-
16:44 - 16:47answers the question of let's
-
16:47 - 16:50say I have something I want to classify, and let's say I write down
-
16:50 - 16:53the function
-
16:53 - 16:58that I think is a good measure of how similar or dissimilar X and Z
-
16:58 - 16:59are for my specific problem.
-
16:59 - 17:05Let's say I write down K of XZ equals E to the
-
17:05 - 17:08minus.
-
17:08 - 17:13Let's say I write down this function,
-
17:13 - 17:19and I think this is a good measure of how similar X and Z are.
-
17:19 - 17:26The question, then, is is this really a valid
-
17:28 - 17:33kernel? In other words, to understand how we can construct kernels " if I write down the
-
17:33 - 17:34function like that,
-
17:34 - 17:38the question is does there really exist some Phi
-
17:38 - 17:40such that
-
17:40 - 17:41KXZ
-
17:41 - 17:42is equal to
-
17:42 - 17:48the inner product?
-
17:48 - 17:50And that's the
-
17:50 - 17:56question of is K a valid kernel.
-
17:56 - 18:00It turns out that
-
18:00 - 18:04there is a result that characterizes necessary and sufficient
-
18:04 - 18:05conditions for
-
18:05 - 18:08when a function K that you might choose
-
18:08 - 18:10is a valid kernel.
-
18:10 - 18:17I should go ahead show part of that result now.
-
18:17 - 18:24Suppose K is a valid kernel,
-
18:24 - 18:27and when I say K is a kernel, what I mean is there does indeed exist some
-
18:27 - 18:32function Phi for which this holds true.
-
18:32 - 18:35Then
-
18:35 - 18:39let any set of points
-
18:39 - 18:46X1 up to XM be given. Let
-
18:48 - 18:53me define a matrix K.
-
18:53 - 18:54
-
18:54 - 18:58I apologize for overloading notation. K I'm going
-
18:58 - 19:00to use to denote both the
-
19:00 - 19:03kernel function, which is the function of X and Z
-
19:03 - 19:07as well as a matrix. Unfortunately,
-
19:07 - 19:10there aren't enough alphabets.
-
19:10 - 19:13Well, that's not true. We need to find
-
19:13 - 19:15the kernel matrix to
-
19:15 - 19:20be an M-by-M matrix such that K subscript IJ is equal to the
-
19:20 - 19:24kernel function
-
19:24 - 19:31applied to two of my examples.
-
19:31 - 19:38Then it
-
19:39 - 19:45turns out that for any vector Z that's indimensional, I want
-
19:45 - 19:50you to consider Z transpose
-
19:50 - 19:57
-
20:02 - 20:04KZ.
-
20:04 - 20:06
-
20:06 - 20:08By
-
20:08 - 20:13definition of matrix multiplication, this is
-
20:13 - 20:15that,
-
20:15 - 20:18and so
-
20:18 - 20:23KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is
-
20:23 - 20:28a valid kernel function,
-
20:28 - 20:31and so there must exist such a
-
20:31 - 20:33value
-
20:33 - 20:37for Phi.
-
20:37 - 20:40This is the inner product between two feature vectors, so let me just make that inner product
-
20:40 - 20:43the explicit. I'm going
-
20:43 - 20:46to sum over the elements of this vector,
-
20:46 - 20:50and I'm going to use Phi XI subscript K just to denote the K
-
20:50 - 20:57element of this vector.
-
20:58 - 21:03Just
-
21:03 - 21:10rearrange sums. You get
-
21:25 - 21:30sum over K. This next set may look familiar to some of you,
-
21:30 - 21:37which is just "
-
21:47 - 21:48right? Therefore,
-
21:48 - 21:52this is the sum of squares
-
21:52 - 21:57and it must therefore be greater than or equal to zero.
-
21:57 - 22:00Do you want to take a minute to look for all the steps and just make
-
22:00 - 22:07sure you buy
-
22:09 - 22:11them all? Oh, this is
-
22:11 - 22:15the inner product between the vector of Phi of XI and Phi
-
22:15 - 22:17of XJ,
-
22:17 - 22:19so the
-
22:19 - 22:21inner product between two vectors
-
22:21 - 22:28is the sum over all the elements of the vectors of the corresponding element.
-
22:28 - 22:32Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A
-
22:32 - 22:37transpose B equals sum over K, AK BK, so that's
-
22:37 - 22:39just this. This is the sum of K of the
-
22:39 - 22:44K elements of this vector.
-
22:44 - 22:46Take a look at this and make sure
-
22:46 - 22:50it makes sense. Questions about
-
22:50 - 22:57this? So just to
-
23:00 - 23:00summarize,
-
23:00 - 23:03what we showed was that
-
23:03 - 23:05for any vector Z,
-
23:05 - 23:07Z transpose KZ
-
23:07 - 23:10is greater than or equal to zero,
-
23:10 - 23:14and this is one of the standard definitions of a matrix,
-
23:14 - 23:18the matrix K being posisemidefinite
-
23:18 - 23:24when a matrix K is posisemidefinite, that is, K is equal to
-
23:24 - 23:28zero. Just to summarize, what was shown is that if
-
23:28 - 23:29K
-
23:29 - 23:31is a
-
23:31 - 23:34valid kernel " in other words, if K is a function
-
23:34 - 23:37for which there exists
-
23:37 - 23:38some Phi such that
-
23:38 - 23:40K of XI
-
23:40 - 23:44XJ is the inner product between Phi of XI and Phi of XJ.
-
23:44 - 23:48So if K is a valid kernel, we showed, then, that the kernel matrix
-
23:48 - 23:51
-
23:51 - 23:55must be posisemidefinite.
-
23:55 - 24:00
-
24:00 - 24:02It turns out that the
-
24:02 - 24:05converse [inaudible]
-
24:05 - 24:09and so this gives you a test for
-
24:09 - 24:13whether a function K is a valid kernel.
-
24:13 - 24:15So this is
-
24:15 - 24:18a theorem due to Mercer, and so kernels are also sometimes called Mercer
-
24:18 - 24:25kernels. It means the same thing. It just means it's a valid kernel. Let
-
24:26 - 24:31K of XZ be given.
-
24:31 - 24:34Then K
-
24:34 - 24:37is a valid
-
24:37 - 24:43kernel "
-
24:43 - 24:45in other
-
24:45 - 24:48words, it's a Mercer kernel, i.e., there exists a Phi
-
24:48 - 24:50such that
-
24:50 - 24:52KXZ
-
24:52 - 24:54
-
24:54 - 24:59
-
24:59 - 25:01equals Phi of X
-
25:01 - 25:05transpose
-
25:05 - 25:11Phi of Z " if and only if for any set of M examples,
-
25:11 - 25:15and this really means for any set of M points. It's not necessarily a training set.
-
25:15 - 25:21It's just any set of M points you may choose. It
-
25:21 - 25:28holds true that the kernel
-
25:29 - 25:30matrix,
-
25:30 - 25:34capital K that I defined just now,
-
25:34 - 25:37is
-
25:37 - 25:38
-
25:38 - 25:45symmetric posisemidefinite.
-
25:49 - 25:54And so I proved only one direction of this result. I proved that if it's a
-
25:54 - 25:55valid kernel, then K
-
25:55 - 25:57is symmetry
-
25:57 - 26:01posisemidefinite, but the converse I didn't show. It turns out that this is
-
26:01 - 26:04necessary and a sufficient condition. And
-
26:04 - 26:06so this gives you a useful test for
-
26:06 - 26:09whether any function that you might want to choose
-
26:09 - 26:16is a
-
26:23 - 26:27kernel. A concrete example of something that's clearly not a
-
26:27 - 26:28valid kernel would
-
26:28 - 26:31be
-
26:31 - 26:36if you find an input X such that K of X, X " and this
-
26:36 - 26:39is minus one, for example " then this is an example of something that's clearly
-
26:39 - 26:42not a valid kernel,
-
26:42 - 26:48because minus one cannot possibly be equal to Phi of X
-
26:48 - 26:49transpose Phi of X,
-
26:49 - 26:53and so this would be one of many examples of functions that
-
26:53 - 26:57will fail to meet the conditions of this theorem,
-
26:57 - 27:04because inner products of a vector itself are always greater than zero. So
-
27:16 - 27:23
-
27:30 - 27:34just to tie this back explicitly to an
-
27:34 - 27:35SVM,
-
27:35 - 27:37let's say to use a support vector machine with a kernel,
-
27:37 - 27:41what you do is you choose some function K of XZ,
-
27:41 - 27:46and so you can choose " and it turns out that function I wrote down just now " this is,
-
27:46 - 27:49indeed, a valid kernel. It
-
27:49 - 27:51is called the Galcean kernel
-
27:51 - 27:52because of the similarity to
-
27:52 - 27:54Galceans.
-
27:54 - 27:59So you choose some kernel function like this, or you may choose X
-
27:59 - 28:02transpose Z plus
-
28:02 - 28:03C
-
28:03 - 28:06to the D vector. To apply a support vector machine kernel, you choose one of these
-
28:06 - 28:08functions,
-
28:08 - 28:09and the choice of this would depend on your problem.
-
28:09 - 28:14It depends on what is a good measure of one or two examples
-
28:14 - 28:17similar and one or two examples different for your problem.
-
28:17 - 28:22Then you go back to our formulation of support vector machine,
-
28:22 - 28:25and you have to use the dual formulation,
-
28:25 - 28:29and you then replace everywhere you see
-
28:29 - 28:32these
-
28:32 - 28:37things, you
-
28:37 - 28:40replace it with K of XI,
-
28:40 - 28:42XJ.
-
28:42 - 28:45And you then run exactly the same support vector machine algorithm, only everywhere
-
28:45 - 28:49you see these inner products, you replace them with that,
-
28:49 - 28:52and what you've just done is you've taken a support vector machine
-
28:52 - 28:57and you've taken each of your feature vectors X
-
28:57 - 29:03and you've replaced it with implicitly a very high dimensional feature vector.
-
29:03 - 29:06It turns out that the Galcean kernel corresponds to a feature vector that's
-
29:06 - 29:09infinite dimensional.
-
29:09 - 29:10
-
29:10 - 29:14Nonetheless, you can run a support vector machine in a finite amount of time,
-
29:14 - 29:15even though you're working with
-
29:15 - 29:19infinite dimensional feature vectors,
-
29:19 - 29:21because all you ever need to do is
-
29:21 - 29:23compute these things, and you don't ever need
-
29:23 - 29:26to represent these infinite dimensional feature vectors
-
29:26 - 29:28explicitly.
-
29:28 - 29:30Why
-
29:30 - 29:33is this a good idea? It turns out "
-
29:33 - 29:38I think I started off talking about support vector machines. I started
-
29:38 - 29:42saying that we wanted to start to develop non-linear learning algorithms.
-
29:42 - 29:45So here's one useful picture to keep in mind, which is that
-
29:45 - 29:48let's say your original data "
-
29:48 - 29:52I didn't mean to draw that slanted. Let's say you have one-dimensional input
-
29:52 - 29:57data. You just have one feature X and R. What a kernel
-
29:57 - 30:01does is the following. It takes your original input data and maps it to a
-
30:01 - 30:05very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw
-
30:05 - 30:09two
-
30:09 - 30:13dimensions here.
-
30:13 - 30:20So say [inaudible] very high dimensional feature space where "
-
30:23 - 30:29like so. So it takes all your data in R1 and maps it to R infinity,
-
30:29 - 30:30and then you run a
-
30:30 - 30:34support vector machine in this infinite dimensional space and also exponentially
-
30:34 - 30:35high dimensional space,
-
30:35 - 30:38and you'll find the optimal margin classifier "
-
30:38 - 30:40in other words, the
-
30:40 - 30:43classifier that separates your data in this very high dimensional space
-
30:43 - 30:46with the largest possible geometric margin.
-
30:46 - 30:50In this example that you just drew anyway,
-
30:50 - 30:55whereas your data was not linearly separable in your originally one dimensional space,
-
30:55 - 30:58when you map it to this much higher dimensional space, it becomes linearly separable, so you
-
30:58 - 31:00can use your
-
31:00 - 31:01linear classifier
-
31:01 - 31:05to [inaudible] which data is not really separable in your original space.
-
31:05 - 31:07This is what support vector machines
-
31:07 - 31:10output nonlinear decision boundaries
-
31:10 - 31:15and in the entire process, all you ever need to do is solve complex optimization
-
31:15 - 31:19problems.
-
31:19 - 31:20
-
31:20 - 31:25Questions about any of this? Student:[Inaudible] sigmer? Instructor
-
31:25 - 31:30(Andrew Ng):Yeah, so sigmer is " let's
-
31:30 - 31:32see. Well, I was going to talk about [inaudible]
-
31:32 - 31:33later. One way
-
31:33 - 31:36to choose sigmer is
-
31:36 - 31:39save aside a small amount of your data and
-
31:39 - 31:42try different values of sigmer and train an SVM using,
-
31:42 - 31:45say, two thirds of your data.
-
31:45 - 31:48Try different values of sigmer, then see what works best on a separate
-
31:48 - 31:49hold out
-
31:49 - 31:53cross validation set " on a separate set that you're testing. Something about
-
31:53 - 31:56learning algorithms we talked about
-
31:56 - 31:57"
-
31:57 - 32:00locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a
-
32:00 - 32:03number of parameters to some of these algorithms that you
-
32:03 - 32:06can choose IDs by saving aside some data to test on. I'll
-
32:06 - 32:10talk more about model selection [inaudible] explicitly. Are there other questions? Student:So
-
32:10 - 32:13how do you know that
-
32:13 - 32:14
-
32:14 - 32:17moving it up to high dimensional space
-
32:17 - 32:20is going to give you that kind of separation? Instructor (Andrew
-
32:20 - 32:21Ng):Good question.
-
32:21 - 32:26Usually, you don't know [inaudible]. Sometimes you can know, but
-
32:26 - 32:29in most cases, you won't know [inaudible] actually going to
-
32:29 - 32:30linearly
-
32:30 - 32:34separable, so the next topic will be [inaudible], which is what
-
32:34 - 32:41[inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend
-
32:41 - 32:42linearly
-
32:42 - 32:45separated by mapping a higher dimension, couldn't you
-
32:45 - 32:47also just use [inaudible]
-
32:47 - 32:49higher
-
32:49 - 32:54dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't
-
32:54 - 32:55separate it in higher dimensional
-
32:55 - 32:59space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs.
-
32:59 - 33:06Okay. Student:What
-
33:06 - 33:08if
-
33:08 - 33:09you run an SVM algorithm
-
33:09 - 33:12that assumes the data are linearly
-
33:12 - 33:13separable on data that
-
33:13 - 33:15is not actually linearly separable? Instructor
-
33:15 - 33:19(Andrew Ng):You guys are really giving me a hard time about whether the
-
33:19 - 33:20data's linearly
-
33:20 - 33:24separable. It turns out this algorithm won't work if the data is not linearly separable, but
-
33:24 - 33:29I'll change that in a second and make it work.
-
33:29 - 33:31
-
33:31 - 33:35If I move on to talk about that,
-
33:35 - 33:40let me just say one final word about kernels,
-
33:40 - 33:46which is that
-
33:46 - 33:49
-
33:49 - 33:53
-
33:53 - 33:55I talked about kernels
-
33:55 - 34:01in a context of support vector machines,
-
34:01 - 34:05and the idea of kernels was what really made support vector machines a very
-
34:05 - 34:07powerful learning algorithm,
-
34:07 - 34:10and actually towards the end of today's lecture if I have time, I'll actually
-
34:10 - 34:14give a couple more [inaudible] examples of how to choose kernels as well.
-
34:14 - 34:15
-
34:15 - 34:19It turns out that the idea of kernels is actually more general than support vector
-
34:19 - 34:20machines,
-
34:20 - 34:21and in particular,
-
34:21 - 34:26we took this SVM algorithm and we derived a dual, and that was what let us write
-
34:26 - 34:28the entire algorithm
-
34:28 - 34:31in terms of inner products of these.
-
34:31 - 34:35It turns out that you can take many of the other algorithms that you've seen
-
34:35 - 34:37in this class " in fact, it turns
-
34:37 - 34:40out you can take most of the linear algorithms we talked about, such as linear
-
34:40 - 34:42regression, logistic regression
-
34:42 - 34:44[inaudible]
-
34:44 - 34:48and it turns out you can take all of these algorithms and rewrite them entirely
-
34:48 - 34:51in terms of these inner products.
-
34:51 - 34:55So if you have any algorithm that you can rewrite in terms of inner products, then that
-
34:55 - 35:01means you can replace it with K of
-
35:01 - 35:02XI XJ
-
35:02 - 35:06and that means that you can take any of theses algorithms and
-
35:06 - 35:11implicitly map the features vectors of these very high dimensional feature spaces
-
35:11 - 35:13and have the algorithm
-
35:13 - 35:14
-
35:14 - 35:17still work. The idea of kernels is perhaps most widely used with support vector
-
35:17 - 35:20machines, but it is actually more general than that,
-
35:20 - 35:23and you can take many of the other algorithms that you've seen and many of the algorithms that
-
35:23 - 35:26we'll see later this quarter as well
-
35:26 - 35:30and write them in terms of inner products and thereby kernalize them and apply them to
-
35:30 - 35:33infinite dimensional feature spaces.
-
35:33 - 35:36You'll actually get to play with many of these ideas more in the next problem
-
35:36 - 35:41
-
35:41 - 35:43set.
-
35:43 - 35:47Let's talk about non-linear decision boundaries, and this is
-
35:47 - 35:50the idea of
-
35:50 - 35:55" it's called the L1 norm soft
-
35:55 - 35:59margin SVM. Machine only people sometimes aren't great at coming up with good
-
35:59 - 36:00names, but here's
-
36:00 - 36:02the idea.
-
36:02 - 36:09Let's say I have a data
-
36:13 - 36:15set. This is a linearly separable data
-
36:15 - 36:19set, but what I do if I have a couple of other examples there that makes the data
-
36:19 - 36:24nonlinearly separable,
-
36:24 - 36:26and in fact,
-
36:26 - 36:30sometimes even if the data is linearly separable, maybe you might not want to.
-
36:30 - 36:33So for example, this is a very nice data set. It looks like
-
36:33 - 36:37there's a great decision boundary that separates the two [inaudible]. Well,
-
36:37 - 36:40what if I had just one outlier
-
36:40 - 36:42down here? I could
-
36:42 - 36:46still linearly separate this data set
-
36:46 - 36:48with something like that,
-
36:48 - 36:50but I'm somehow letting one
-
36:50 - 36:53slightly suspicious example
-
36:53 - 36:56skew my entire decision boundary by a lot,
-
36:56 - 36:57and so
-
36:57 - 37:01what I'm going to talk about now is the L1 norm soft margin SVM, which is
-
37:01 - 37:05a slightly modified formulation of the SVM optimization problem.
-
37:05 - 37:09They will let us deal with both of these cases " one where one of the data's
-
37:09 - 37:11just not linearly separable
-
37:11 - 37:14and two, what if you have some examples that you'd rather
-
37:14 - 37:17not get [inaudible] in a training set. Maybe
-
37:17 - 37:19with an outlier here, maybe you actually prefer to
-
37:19 - 37:23choose that original decision boundary and not try so hard to get
-
37:23 - 37:27that training example. Here's the formulation.
-
37:27 - 37:34Our
-
37:44 - 37:45
-
37:45 - 37:46SVM
-
37:46 - 37:51primal problem was to minimize one-half [inaudible] W
-
37:51 - 37:52squared.
-
37:52 - 37:59
-
38:03 - 38:05So this is our original problem, and I'm
-
38:05 - 38:10going to modify this by adding the
-
38:10 - 38:17following.
-
38:23 - 38:27In other words, I'm gonna add these penalty terms, CIs,
-
38:27 - 38:32and I'm going to demand that each of my training examples is separated with
-
38:32 - 38:36functional margin greater than or equal to one minus CI,
-
38:36 - 38:40and you remember
-
38:40 - 38:45
-
38:45 - 38:47if this is
-
38:47 - 38:50greater than zero " was it
-
38:50 - 38:53two lectures ago that I said that if the
-
38:53 - 38:55function margin is greater than zero,
-
38:55 - 39:01that implies you classified it correctly.
-
39:01 - 39:05If it's less than zero, then you misclassified it.
-
39:05 - 39:08By setting some of the CIs to be larger than one, I can
-
39:08 - 39:11actually have some examples with
-
39:11 - 39:13functional margin negative,
-
39:13 - 39:16and therefore I'm allowing my algorithm to misclassify some of the examples of the
-
39:16 - 39:19training set.
-
39:19 - 39:23However, I'll encourage the algorithm not to do that by adding to the
-
39:23 - 39:24optimization objective,
-
39:24 - 39:30this sort of penalty term that penalizes setting CIs to be large.
-
39:30 - 39:34This is an optimization problem where the parameters are WB
-
39:34 - 39:37and all of the CIs
-
39:37 - 39:43and this is also a convex optimization problem. It
-
39:43 - 39:47turns out that
-
39:47 - 39:51similar to how we worked on the dual of the support vector machine,
-
39:51 - 39:54we can also work out the dual for this optimization problem. I won't
-
39:54 - 39:55actually do it,
-
39:55 - 39:59but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going
-
39:59 - 40:02to
-
40:02 - 40:03use
-
40:03 - 40:08Alpha and R to denote the [inaudible] multipliers no corresponding to
-
40:08 - 40:11this set of constraints that we had previously
-
40:11 - 40:15and this new set of constraints on the CI [inaudible] zero. This gives us a use of
-
40:15 - 40:16the
-
40:16 - 40:23[inaudible] multipliers. The [inaudible] will be
-
40:23 - 40:25optimization objective
-
40:25 - 40:32minus sum from
-
40:36 - 40:39plus CI
-
40:39 - 40:44minus "
-
40:44 - 40:47and so
-
40:47 - 40:51there's our [inaudible] optimization objective
-
40:51 - 40:55minus or plus Alpha times each of these constraints,
-
40:55 - 40:57
-
40:57 - 41:04which are
-
41:05 - 41:12
-
41:18 - 41:21greater or equal to
-
41:21 - 41:28zero. I won't redivide the entire dual again, but it's really the same,
-
41:28 - 41:34and when you derive the dual of this optimization problem and when you simplify,
-
41:34 - 41:37you find that you get the following.
-
41:37 - 41:41You have to maximize
-
41:41 - 41:48[inaudible], which is actually the same as before.
-
41:51 - 41:58
-
42:03 - 42:10
-
42:20 - 42:23So it turns out when you derive the dual
-
42:23 - 42:25and simply,
-
42:25 - 42:29it turns out that the only way the dual changes compared to the previous one
-
42:29 - 42:32is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero,
-
42:32 - 42:37we now have a constraint that the Alphas are between zero and C.
-
42:37 - 42:40This derivation isn't very hard, and you're encouraged to go home and try to do it yourself.
-
42:40 - 42:42It's really essentially the same math,
-
42:42 - 42:44and when you simply, it turns out
-
42:44 - 42:49you can simply the R of the [inaudible] multiplier away
-
42:49 - 42:49and you end up with
-
42:49 - 42:55just these constraints of
-
42:55 - 42:57
-
42:57 - 43:02the
-
43:02 - 43:05Alphas.
-
43:05 - 43:06Just as an aside,
-
43:06 - 43:09I won't
-
43:09 - 43:12derive these, either. It turns out that "
-
43:12 - 43:16remember, I wrote down the [inaudible] conditions in
-
43:16 - 43:18the last lecture.
-
43:18 - 43:22The necessary conditions for something to be an optimal solution
-
43:22 - 43:24to constrain optimization problems.
-
43:24 - 43:27So if you used the [inaudible] conditions,
-
43:27 - 43:30it turns out you can actually derive conversions conditions, so
-
43:30 - 43:33we want to solve this optimization problem.
-
43:33 - 43:38When do we know the Alphas have converged to the global optimum?
-
43:38 - 43:41It turns out you can use the following.
-
43:41 - 43:48
-
43:48 - 43:55
-
44:05 - 44:12
-
44:14 - 44:17
-
44:17 - 44:21I don't want to say a lot about these. It
-
44:21 - 44:24turns out from the [inaudible] conditions you can derive
-
44:24 - 44:27these as the conversion conditions for an algorithm that you might choose to
-
44:27 - 44:28use
-
44:28 - 44:34to try to solve the optimization problem in terms of the
-
44:34 - 44:40Alphas. That's the L1 norm soft margin SVM, and this is the
-
44:40 - 44:44change the algorithm that lets us handle non-linearly separable data sets as well as
-
44:44 - 44:47single outliers that may still be linearly separable
-
44:47 - 44:54but you may choose not to separate [inaudible].
-
44:56 - 44:58Questions about this?
-
44:58 - 45:03Raise
-
45:03 - 45:10
-
45:18 - 45:25
-
45:40 - 45:47
-
45:52 - 45:57your hand if this stuff makes sense at all. Great.
-
45:57 - 46:00So
-
46:00 - 46:03the last thing I want to do is
-
46:03 - 46:08talk about an algorithm for actually solving this optimization problem.
-
46:08 - 46:10We wrote down
-
46:10 - 46:14this dual optimization problem with convergence criteria,
-
46:14 - 46:19so let's come up with an efficient algorithm to actually solve this optimization problem.
-
46:19 - 46:21
-
46:21 - 46:25I want to do this partly to give me an excuse to talk about an algorithm
-
46:25 - 46:29called coordinate assent, which is useful to
-
46:29 - 46:34do. What I actually want to do is tell you about an algorithm called coordinate
-
46:34 - 46:37assent, which is a useful algorithm to know about,
-
46:37 - 46:40and it'll turn out that it won't apply in the simplest form
-
46:40 - 46:44to this problem, but we'll then be able to modify it slightly and then it'll
-
46:44 - 46:46give us a very efficient
-
46:46 - 46:49algorithm for solving this [inaudible]
-
46:49 - 46:52optimization problem. That was the other reason that I had to derive the dual, not only so that we could
-
46:52 - 46:53use kernels
-
46:53 - 46:58but also so that we can apply an algorithm like the
-
46:58 - 47:01SMO
-
47:01 - 47:05algorithm. First, let's talk about coordinate assent, which is another
-
47:05 - 47:12[inaudible] optimization
-
47:18 - 47:21algorithm. To describe coordinate
-
47:21 - 47:25assent, I just want you to consider the problem of if we
-
47:25 - 47:30want to maximize some function
-
47:30 - 47:34W, which is a function of Alpha one through Alpha M with no constraints. So for
-
47:34 - 47:39
-
47:39 - 47:40
-
47:40 - 47:43now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the
-
47:43 - 47:50constraint that some of YI Alpha I must be equal to zero.
-
47:50 - 47:53Then this is the coordinate
-
47:53 - 47:54assent
-
47:54 - 47:56algorithm. It will
-
47:56 - 48:01repeat until convergence
-
48:01 - 48:04
-
48:04 - 48:07and will do for I equals one to M. The [inaudible] of coordinate assent essentially
-
48:07 - 48:10holds all the parameters
-
48:10 - 48:12except Alpha I fixed
-
48:12 - 48:16and then it just maximizes this function with respect to just one of the parameters. Let me
-
48:16 - 48:18write that
-
48:18 - 48:21as Alpha I gets
-
48:21 - 48:24updated as [inaudible]
-
48:24 - 48:29over Alpha I hat of
-
48:29 - 48:30W
-
48:30 - 48:32Alpha one
-
48:32 - 48:39
-
48:42 - 48:49Alpha I minus one. This is really the fancy way of saying hold everything
-
48:50 - 48:54except Alpha I
-
48:54 - 48:58fixed. Just optimize W by optimization objective with
-
48:58 - 49:00respect to only
-
49:00 - 49:01Alpha I.
-
49:01 - 49:02This is just a fancy
-
49:02 - 49:04way of writing
-
49:04 - 49:09it. This is coordinate
-
49:09 - 49:11assent.
-
49:11 - 49:14One picture
-
49:14 - 49:18that's kind of useful for coordinate assent is if you
-
49:18 - 49:25imagine you're trying to optimize a quadratic function, it
-
49:25 - 49:27really looks like
-
49:27 - 49:31that. These are the contours of the quadratic function and the minimums
-
49:31 - 49:32here.
-
49:32 - 49:37This is what
-
49:37 - 49:40coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one.
-
49:40 - 49:45My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going
-
49:45 - 49:49to begin by minimizing this with respect to Alpha one. I
-
49:49 - 49:51go there.
-
49:51 - 49:55And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like
-
49:55 - 49:56that.
-
49:56 - 49:58Then, I'll minimize
-
49:58 - 50:02with respect to Alpha one goes back to Alpha two
-
50:02 - 50:04and so on.
-
50:04 - 50:08You're always taking these axis-aligned steps
-
50:08 - 50:12to get to the minimum.
-
50:12 - 50:16It turns out that there's a modification to this. There are
-
50:16 - 50:18variations of this as well. The way I describe
-
50:18 - 50:21the algorithm, we're always doing this
-
50:21 - 50:24in alternating order. We always optimize with respect to Alpha
-
50:24 - 50:29one then Alpha two, then Alpha one, then Alpha two.
-
50:29 - 50:32What I'm about to say applies only in higher dimensions, but it turns out if you have
-
50:32 - 50:35a lot of parameters, Alpha one through Alpha M,
-
50:35 - 50:39you may not choose to always visit them in a fixed order. You may choose
-
50:39 - 50:43which Alphas update next depending on what you
-
50:43 - 50:46think will allow you to make the most progress. If you have only two parameters, it makes
-
50:46 - 50:50sense to alternate between them.
-
50:50 - 50:51If
-
50:51 - 50:53you have
-
50:53 - 50:57higher dimensional parameters, sometimes you may choose to update them in a
-
50:57 - 51:01different order if you think doing so would let you make faster progress towards the
-
51:01 - 51:02
-
51:02 - 51:05
-
51:05 - 51:09maximum. It turns out that coordinate assent
-
51:09 - 51:11compared to some of the algorithms we saw previously " compared to, say, Newton's
-
51:11 - 51:12
-
51:12 - 51:16method, coordinate assent will usually take a lot more steps,
-
51:16 - 51:21but the chief advantage of coordinate assent when it works well
-
51:21 - 51:23is that sometimes
-
51:23 - 51:29the optimization objective W
-
51:29 - 51:33sometimes is very inexpensive to optimize W with respect to any one of your
-
51:33 - 51:34parameters,
-
51:34 - 51:38and so coordinate assent has to take many more iterations than, say, Newton's
-
51:38 - 51:39method
-
51:39 - 51:41in order to
-
51:41 - 51:45converge. It turns out that there are many optimization problems for which it's
-
51:45 - 51:47particularly easy
-
51:47 - 51:48to fix
-
51:48 - 51:52all but one of the parameters and optimize with respect to just that one parameter,
-
51:52 - 51:56and if that's true, then the inner group of coordinate assent with optimizing with
-
51:56 - 51:58respect to Alpha I can be done very quickly
-
51:58 - 52:01and cause [inaudible].
-
52:01 - 52:02It turns out that
-
52:02 - 52:07this will be true when we modify this algorithm to solve the SVM
-
52:07 - 52:12optimization problem.
-
52:12 - 52:19Questions about this?
-
52:22 - 52:29
-
52:47 - 52:50Okay. Let's go ahead and apply this
-
52:50 - 52:54to our support vector machine dual optimization problem. It
-
52:54 - 52:57turns out that
-
52:57 - 53:02coordinate assent in its basic form does not work for the following reason. The
-
53:02 - 53:07reason is we have constrains on the Alpha Is. Mainly, what we can
-
53:07 - 53:11recall from what we worked out previously,
-
53:11 - 53:12we have a constraint that the sum of [inaudible] Y
-
53:12 - 53:16Alpha I must be equal to zero,
-
53:16 - 53:20and so if you fix all the Alphas except for one,
-
53:20 - 53:24then you can't change one Alpha without violating the constraint. If
-
53:24 - 53:25
-
53:25 - 53:26I just try to change Alpha one,
-
53:26 - 53:27
-
53:27 - 53:31Alpha one is actually exactly determined as a function of the other
-
53:31 - 53:33Alphas because this was sum to zero.
-
53:33 - 53:36The
-
53:36 - 53:39SMO algorithm, by the way, is due to John Platt, a colleague at
-
53:39 - 53:43Microsoft. The SMO
-
53:43 - 53:46algorithm, therefore, instead of trying to change one Alpha at a time,
-
53:46 - 53:49we will try to change
-
53:49 - 53:50two
-
53:50 - 53:54Alphas at a time.
-
53:54 - 53:55This is
-
53:55 - 53:58called the SMO algorithm, in a
-
53:58 - 54:01sense the sequential minimal optimization
-
54:01 - 54:03and the term minimal refers to the fact that
-
54:03 - 54:06we're choosing the smallest number of Alpha Is to change at a time, which
-
54:06 - 54:11in this case, we need to change at least two at a time.
-
54:11 - 54:17So then go ahead and outline the algorithm.
-
54:17 - 54:20We will
-
54:20 - 54:24select
-
54:24 - 54:29two Alphas to change, some Alpha I and Alpha J [inaudible] " it
-
54:29 - 54:34just means a rule of thumb.
-
54:34 - 54:41We'll hold all the Alpha Ks fixed
-
54:41 - 54:45except
-
54:45 - 54:50Alpha I and Alpha J
-
54:50 - 54:54and optimize W [inaudible] Alpha
-
54:54 - 54:57with respect
-
54:57 - 54:59to Alpha I
-
54:59 - 55:06and Alpha J subject to all the constraints.
-
55:14 - 55:18It turns out the
-
55:18 - 55:20key step
-
55:20 - 55:21which I'm going to
-
55:21 - 55:22work out
-
55:22 - 55:26is this one, which
-
55:26 - 55:28is how do you optimize W of Alpha
-
55:28 - 55:32with respect to the two parameters that you just chose to update
-
55:32 - 55:35and subject to the constraints? I'll do
-
55:35 - 55:37that in a second.
-
55:37 - 55:39
-
55:39 - 55:41You would keep running this algorithm
-
55:41 - 55:43until you have satisfied
-
55:43 - 55:45these
-
55:45 - 55:49convergence criteria up to Epsilon.
-
55:49 - 55:54What I want to do now is describe how to do this [inaudible] "
-
55:54 - 55:56how to optimize W of Alpha
-
55:56 - 55:58with respect to
-
55:58 - 56:00Alpha I and Alpha J,
-
56:00 - 56:05and it turns out that it's because you can do this extremely efficiently
-
56:05 - 56:08that the SMO algorithm works well. The [inaudible] for the SMO
-
56:08 - 56:12algorithm can be done extremely efficiently, so it may take a large number of
-
56:12 - 56:19iterations to converge, but each iteration is very cheap. Let's talk about that.
-
56:33 - 56:40
-
56:42 - 56:46So in order to derive that step where we update in respect to Alpha I and
-
56:46 - 56:47Alpha J,
-
56:47 - 56:51I'm actually going to use Alpha one and Alpha two as my example. I'm gonna
-
56:51 - 56:52update
-
56:52 - 56:57Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but
-
56:57 - 57:00just to make my notation on the board easier, I'm going to derive the
-
57:00 - 57:02derivation for Alpha one and Alpha two
-
57:02 - 57:07and the general [inaudible] completely analogous.
-
57:07 - 57:09
-
57:09 - 57:13On every step of the algorithm with respect to constraint, that
-
57:13 - 57:17sum over I Alpha I YI is equal to zero. This is one of the
-
57:17 - 57:19constraints we had previously for
-
57:19 - 57:23our dual optimization problem.
-
57:23 - 57:26This means that Alpha one Y1 plus Alpha
-
57:26 - 57:28two Y2 must be equal to this,
-
57:28 - 57:31to
-
57:31 - 57:38
-
57:41 - 57:43which I'm going to denote
-
57:43 - 57:46by Zeta.
-
57:46 - 57:48
-
57:48 - 57:49So
-
57:49 - 57:52we also have the constraint
-
57:52 - 57:53that the Alpha Is
-
57:53 - 57:57must be between zero and C. We had two constraints on our dual. This was one of the
-
57:57 - 58:00constraints. This was the other one.
-
58:00 - 58:07In pictures,
-
58:08 - 58:15the
-
58:29 - 58:33constraint that the Alpha Is is between zero
-
58:33 - 58:35and C, that is often called the Bosk constraint,
-
58:35 - 58:38and so if I draw Alpha one
-
58:38 - 58:43and Alpha two, then
-
58:43 - 58:50I
-
58:50 - 58:53have a constraint that
-
58:53 - 58:57the values of Alpha one and Alpha two must lie within this box that ranges from zero
-
58:57 - 58:59to C.
-
58:59 - 59:02And so the picture of the algorithm is as follows.
-
59:02 - 59:04I have some constraint
-
59:04 - 59:07that Alpha one Y1
-
59:07 - 59:09plus Alpha
-
59:09 - 59:10two Y2
-
59:10 - 59:14must equal to Zeta,
-
59:14 - 59:18
-
59:18 - 59:21
-
59:21 - 59:22
-
59:22 - 59:26and so this implies that
-
59:26 - 59:32Alpha one must be equal to Zeta minus Alpha two Y2
-
59:32 - 59:37over Y1,
-
59:37 - 59:40and so what I want to do is I want to optimize the objective with respect
-
59:40 - 59:43
-
59:43 - 59:45to this.
-
59:45 - 59:49What I can do is plug in my definition for Alpha one as a function of Alpha two
-
59:49 - 59:51and so this becomes W of
-
59:51 - 59:55Alpha one must be equal to Zeta minus Alpha two Y2
-
59:55 - 59:57over
-
59:57 - 60:02Y1, Alpha two, Alpha three
-
60:02 - 60:03and so on,
-
60:03 - 60:07and it turns out that because W is a quadratic function " if you look back to our
-
60:07 - 60:11earlier definition of W, you find it's a quadratic function in all the Alphas "
-
60:11 - 60:13it turns out that if you look at this expression for W
-
60:13 - 60:17and if you view it as just a function of Alpha two,
-
60:17 - 60:22you find that this is a one dimensional quadratic function of Alpha two
-
60:22 - 60:25if you hold Alpha three, Alpha four and so on fixed,
-
60:25 - 60:26and so
-
60:26 - 60:30this can be simplified to some expression of the form A Alpha two squared
-
60:30 - 60:32plus B Alpha two plus
-
60:32 - 60:33C. This
-
60:33 - 60:38is a standard quadratic function.
-
60:38 - 60:40This is really easy to optimize. We know how to optimize
-
60:40 - 60:42"
-
60:42 - 60:44when did we learn this? This was high school
-
60:44 - 60:48or undergrad or something. You know how to optimize quadratic functions like these.
-
60:48 - 60:50You just do that and that gives you the
-
60:50 - 60:52optimal value for Alpha two.
-
60:52 - 60:55The last step with a
-
60:55 - 60:59Bosk constraint like this " just in pictures,
-
60:59 - 61:02you know your solution must lie on this line, and so there'll be some
-
61:02 - 61:05sort of quadratic function
-
61:05 - 61:08over this line, say,
-
61:08 - 61:10and so if you minimize the quadratic function,
-
61:10 - 61:14maybe you get a value that lies in the box, and if so, then you're done.
-
61:14 - 61:17Or if your quadratic function looks like this,
-
61:17 - 61:21maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with
-
61:21 - 61:23a solution like that. If
-
61:23 - 61:26that happens, you clip your solution
-
61:26 - 61:30just to map it back inside the box.
-
61:30 - 61:35That'll give you the optimal solution of this quadratic optimization problem
-
61:35 - 61:36subject to
-
61:36 - 61:37your solution
-
61:37 - 61:40satisfying this box constraint
-
61:40 - 61:43and lying on this straight line " in other words, subject to
-
61:43 - 61:49the solution lying on this line segment within the box.
-
61:49 - 61:53
-
61:53 - 61:58Having solved the Alpha two this way, you can clip it if necessary
-
61:58 - 62:00to get it back within the box constraint
-
62:00 - 62:01and then
-
62:01 - 62:04we have Alpha one as a function of Alpha two
-
62:04 - 62:06and this allows you to
-
62:06 - 62:10optimize W with respect to Alpha one and Alpha two quickly,
-
62:10 - 62:11subject to all the constraints,
-
62:11 - 62:15and the key step is really just sort of one dequadratic optimization, which
-
62:15 - 62:16we do very quickly,
-
62:16 - 62:23which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that
-
62:25 - 62:32we can
-
62:33 - 62:34change
-
62:34 - 62:38whatever, but the SMO
-
62:38 - 62:43algorithm, we can
-
62:43 - 62:45change two at a
-
62:45 - 62:46time,
-
62:46 - 62:49so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change
-
62:49 - 62:52" as I run optimization algorithm, I
-
62:52 - 62:55have to respect the constraint that
-
62:55 - 62:56sum over
-
62:56 - 62:58I Alpha I YI
-
62:58 - 63:00must be equal to zero,
-
63:00 - 63:07so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent.
-
63:07 - 63:09Suppose I tried
-
63:09 - 63:13to change just Alpha one.
-
63:13 - 63:15Then I know that Alpha one
-
63:15 - 63:20must be equal to the sum from I equals two to M
-
63:20 - 63:22Alpha I
-
63:22 - 63:23YI
-
63:23 - 63:26divided by Y1, right,
-
63:26 - 63:27and so Alpha
-
63:27 - 63:30one can actually be written exactly as a function of Alpha two, Alpha three and so on
-
63:30 - 63:33through Alpha M.
-
63:33 - 63:36And so if I hold Alpha two, Alpha three, Alpha four
-
63:36 - 63:40through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final
-
63:40 - 63:42
-
63:42 - 63:46[inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same
-
63:46 - 63:47time,
-
63:47 - 63:51then I still have a constraint and so I know Alpha one and Alpha two
-
63:51 - 63:54must satisfy
-
63:54 - 63:55that linear constraint
-
63:55 - 63:59but at least this way I can change Alpha one if I also
-
63:59 - 64:05change Alpha two accordingly to make sure this satisfies
-
64:05 - 64:09the constraint.
-
64:09 - 64:14Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on
-
64:14 - 64:16each iteration, I have some
-
64:16 - 64:18setting of the parameters,
-
64:18 - 64:21Alpha one, Alpha two, Alpha three and so on,
-
64:21 - 64:25and I want to change Alpha one and Alpha two, say.
-
64:25 - 64:27So from the previous iteration,
-
64:27 - 64:28let's
-
64:28 - 64:32say I had not validated the constraint, so that holds true,
-
64:32 - 64:35and so I'm just defining Zeta to be equal to this, because
-
64:35 - 64:38Alpha one Y1 plus Alpha two Y2 must be equal to sum from
-
64:38 - 64:42I equals [inaudible] to M of that,
-
64:42 - 64:44and so I'm just defining this
-
64:44 - 64:51to be Zeta.
-
65:00 - 65:01Student:[Inaudible]. Instructor (Andrew
-
65:01 - 65:07Ng):On every iteration, you change maybe a different pair of Alphas to update. The
-
65:07 - 65:11way you do this is something I don't want to talk about. I'll say a couple more words about
-
65:11 - 65:12that, but
-
65:12 - 65:14the basic outline of the algorithm is
-
65:14 - 65:18on every iteration of the algorithm, you're going to select some Alpha I and
-
65:18 - 65:19Alpha J to update
-
65:19 - 65:23like on this board. So that's an Alpha I and an Alpha J to update
-
65:23 - 65:25via some [inaudible]
-
65:25 - 65:28and then you use the procedure I just described to
-
65:28 - 65:31actually update
-
65:31 - 65:34Alpha I and Alpha J. What I actually just talked about
-
65:34 - 65:35was the procedure
-
65:35 - 65:40to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the
-
65:40 - 65:47[inaudible] for choosing Alpha
-
65:49 - 65:52I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up
-
65:52 - 65:54there. I'll just write it again. W of
-
65:54 - 66:00Alpha is that function we had previously. W of Alpha was the sum over I
-
66:00 - 66:03" this is about solving the "
-
66:03 - 66:10it was that thing.
-
66:10 - 66:13All of this is about solving the optimization problem for the SVM, so this
-
66:13 - 66:19is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing
-
66:19 - 66:26that one, you
-
66:26 - 66:27can
-
66:27 - 66:29make the
-
66:29 - 66:34other one that you have to change work,
-
66:34 - 66:36right? Instructor (Andrew Ng):What
-
66:36 - 66:40do
-
66:40 - 66:44you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently.
-
66:44 - 66:47What we're trying to do is we're trying to optimize the objective function W
-
66:47 - 66:48of Alpha,
-
66:48 - 66:52so the metric of progress that we care about is whether W of Alpha is getting
-
66:52 - 66:55better on every iteration,
-
66:55 - 66:59and so what is true for coordinate assent and for SMO is on every iteration;
-
66:59 - 67:03W of Alpha can only increase. It may stay the same
-
67:03 - 67:05or it may increase. It can't get worse.
-
67:05 - 67:09It's true that eventually, the Alphas will converge at some value. It's true that
-
67:09 - 67:13in intervening iterations, one of the Alphas may move further
-
67:13 - 67:17away and then closer and further and closer to its final value,
-
67:17 - 67:20but what we really care about is that W of Alpha is getting better
-
67:20 - 67:27every time, which is true.
-
67:28 - 67:31Just a couple more words on
-
67:31 - 67:34SMO before I wrap up on this.
-
67:34 - 67:38One is that John Platt's original
-
67:38 - 67:40algorithm talked about a [inaudible]
-
67:40 - 67:44for choosing which values or pairs, Alpha I and Alpha J, to update
-
67:44 - 67:47next
-
67:47 - 67:50is one of those things that's not conceptually complicated but it's very
-
67:50 - 67:51complicated
-
67:51 - 67:53to explain in words. I won't talk about that here.
-
67:53 - 67:56If you want to learn about it,
-
67:56 - 67:59go ahead and look up John Platt's paper on the
-
67:59 - 68:00
-
68:00 - 68:06SMO algorithm. The [inaudible] is pretty easy to read,
-
68:06 - 68:11and later on, we'll also posting a handout on the
-
68:11 - 68:13course homepage
-
68:13 - 68:14with
-
68:14 - 68:18some of a simplified version of this [inaudible] that you can use in
-
68:18 - 68:20problems.
-
68:20 - 68:24You can see some of the process readings in more details.
-
68:24 - 68:29One other thing that I didn't talk about was how to update the parameter B. So
-
68:29 - 68:31this is solving all your Alphas.
-
68:31 - 68:34This is also the Alpha that allows us to get W.
-
68:34 - 68:38The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not
-
68:38 - 68:40very difficult.
-
68:40 - 68:43I'll let you read about that yourself with the
-
68:43 - 68:45notes that we'll post
-
68:45 - 68:52along with the next
-
68:52 - 68:56problems. To wrap up today's lecture, what I want to do is just tell
-
68:56 - 69:01you briefly about a couple of examples of applications
-
69:01 - 69:03
-
69:03 - 69:04
-
69:04 - 69:07
-
69:07 - 69:14of SVMs.
-
69:26 - 69:30Let's consider the problem of Handler's Integer Recognition.
-
69:30 - 69:32In Handler's Integer
-
69:32 - 69:33Recognition,
-
69:33 - 69:37you're given a pixel array with
-
69:37 - 69:41a scanned image of,
-
69:41 - 69:44say, a zip code somewhere in Britain. This is an array of pixels,
-
69:44 - 69:48and some of these pixels will be on
-
69:48 - 69:49and other pixels
-
69:49 - 69:53will be off. This combination of pixels being on maybe represents the
-
69:53 - 69:56character one.
-
69:56 - 69:58The question is given
-
69:58 - 70:00an input feature vector like this,
-
70:00 - 70:03if you
-
70:03 - 70:07have, say, ten pixels by ten pixels, then you have
-
70:07 - 70:14a hundred dimensional feature vector,
-
70:14 - 70:17then [inaudible]. If you have ten pixels by ten pixels, you have
-
70:17 - 70:18100
-
70:18 - 70:22features, and maybe these are binary features of XB01 or maybe the Xs are
-
70:22 - 70:24gray scale values
-
70:24 - 70:29corresponding to how dark each of these pixels was.
-
70:29 - 70:32[Inaudible].
-
70:32 - 70:36Turns out for many years, there was a neuronetwork that was a champion
-
70:36 - 70:40algorithm for Handler's Integer Recognition.
-
70:40 - 70:44And it turns out that you can apply an SVM with the
-
70:44 - 70:47following
-
70:47 - 70:53kernel. It turns out either the polynomial kernel or
-
70:53 - 70:58the Galcean
-
70:58 - 71:03
-
71:03 - 71:05kernel works fine for this problem,
-
71:05 - 71:10and just by writing down this kernel and throwing an SVM at
-
71:10 - 71:14it, an SVM gave performance comparable to the very best neuronetworks.
-
71:14 - 71:15
-
71:15 - 71:20
-
71:20 - 71:22This is surprising because support vector machine
-
71:22 - 71:26doesn't take into account any knowledge about the pixels,
-
71:26 - 71:27and in particular,
-
71:27 - 71:30it doesn't know that this pixel is
-
71:30 - 71:32next to that pixel
-
71:32 - 71:37because it's just representing the pixel intensity value as a vector.
-
71:37 - 71:40And so this means the performance of SVM would be the same even if you were to
-
71:40 - 71:43shuffle all the pixels around.
-
71:43 - 71:46[Inaudible]
-
71:46 - 71:52let's say comparable to the very best neuronetworks, which had been
-
71:52 - 71:59under very careful development for many years. I want
-
71:59 - 72:00to
-
72:00 - 72:04tell you about one other cool example, which is SVMs are
-
72:04 - 72:09also used also to classify other fairly esoteric objects. So for
-
72:09 - 72:09example,
-
72:09 - 72:13let's say you want to classify
-
72:13 - 72:17protein sequences into different classes of
-
72:17 - 72:21proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize
-
72:21 - 72:25for that. There are
-
72:25 - 72:2920 amino acids, and proteins in our bodies are made up by
-
72:29 - 72:31sequences of amino acids.
-
72:31 - 72:35Even though there are 20 amino acids and 26 alphabets, I'm going to
-
72:35 - 72:37denote amino acids by the alphabet
-
72:37 - 72:41A through Z with apologizes to the biologists. Here's
-
72:41 - 72:45an amino acid sequence
-
72:45 - 72:52
-
72:54 - 73:01
-
73:03 - 73:05
-
73:05 - 73:12represented by a series of alphabets.
-
73:16 - 73:18
-
73:18 - 73:19So
-
73:19 - 73:21suppose I want to assign
-
73:21 - 73:25this protein into a few classes depending on what
-
73:25 - 73:28type of protein it is. The
-
73:28 - 73:30question is how
-
73:30 - 73:33do I construct my feature
-
73:33 - 73:37vector? This is challenging for many reasons, one of which is that protein
-
73:37 - 73:40sequences can be of different lengths. There are some very long protein
-
73:40 - 73:42sequences and some very short ones,
-
73:42 - 73:45and so you can't have a feature saying what is
-
73:45 - 73:49the amino acid in the 100th position, because maybe there is no 100th
-
73:49 - 73:53position in this protein sequence. Some are very long. Some are very short.
-
73:53 - 73:55Here's my feature representation,
-
73:55 - 73:57which is
-
73:57 - 74:00I'm going to write down
-
74:00 - 74:05all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC
-
74:05 - 74:07down
-
74:07 - 74:09to
-
74:09 - 74:11AAAZ and then
-
74:11 - 74:14AABA
-
74:14 - 74:20and so on. You get the idea. Write down
-
74:20 - 74:24all
-
74:24 - 74:27possible combinations of
-
74:27 - 74:28four alphabets
-
74:28 - 74:31and my feature representation will be
-
74:31 - 74:34I'm going to scan through this sequence of amino acids
-
74:34 - 74:39and count how often each of these subsequences occur. So for example,
-
74:39 - 74:40BAJT occurs twice and so I'll put
-
74:40 - 74:43a two there, and
-
74:43 - 74:45none of these sequences occur, so I'll put
-
74:45 - 74:46a zero there. I guess
-
74:46 - 74:50I have a one here
-
74:50 - 74:54and a zero there. This very long vector
-
74:54 - 74:57will be my feature representation for protein.
-
74:57 - 75:00This representation applies no matter how long my protein sequence is.
-
75:00 - 75:03How
-
75:03 - 75:05large is this? Well, it turns out
-
75:05 - 75:08this is going to be in R20
-
75:08 - 75:12to the four,
-
75:12 - 75:16and so you have a 160,000
-
75:16 - 75:18dimensional feature vector,
-
75:18 - 75:21which is reasonably large, even by modern computer standards.
-
75:21 - 75:24Clearly, we don't want to explicitly represent
-
75:24 - 75:26these high
-
75:26 - 75:30dimensional feature vectors. Imagine you have 1,000 examples and you store this as double
-
75:30 - 75:34[inaudible]. Even on modern day computers, this is big.
-
75:34 - 75:38It turns out that there's an
-
75:38 - 75:42efficient dynamic programming algorithm that can efficiently
-
75:42 - 75:45compute inner products between these feature vectors, and so we can apply this
-
75:45 - 75:48feature representation, even though it's a ridiculously high
-
75:48 - 75:50feature vector
-
75:50 - 75:52to classify protein sequences.
-
75:52 - 75:57I won't talk about the [inaudible] algorithm. If any of you have seen
-
75:57 - 76:01the [inaudible] algorithm for finding subsequences, it's kind of
-
76:01 - 76:02reminiscent of that. You
-
76:02 - 76:05can look those up if you're interested.
-
76:05 - 76:08This is just another example of a cool kernel, and
-
76:08 - 76:13more generally, if you're faced with some new machine-learning problem, sometimes you can
-
76:13 - 76:16choose a standard kernel like a Galcean kernel,
-
76:16 - 76:20and sometimes there are research papers written on how to
-
76:20 - 76:21come up with a new kernel
-
76:21 - 76:28for a new problem.
-
76:28 - 76:30Two last sentences I want
-
76:30 - 76:34to say. Where are we now? That wraps up SVMs, which many
-
76:34 - 76:38people would consider one of the most effective off the shelf learning algorithms,
-
76:38 - 76:42and so as of today, you've actually seen a lot of learning algorithms. I want to
-
76:42 - 76:46close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms
-
76:46 - 76:48to a lot of problems.
-
76:48 - 76:53We're still in week four of the quarter, so there's more to come. In particular, what I
-
76:53 - 76:57want to do next is talk about how to really understand the learning algorithms and
-
76:57 - 76:59when they work well and when they work poorly
-
76:59 - 77:02and to take the tools which you now have and really talk about how you can use
-
77:02 - 77:04them really well. We'll start to do that in the next lecture. Thanks.
- Title:
- Lecture 8 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng continues his lecture about support vector machines, including soft margin optimization and kernels.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:17:19