WEBVTT 00:00:09.030 --> 00:00:10.429 00:00:10.429 --> 00:00:13.699 This presentation is delivered by the Stanford Center for Professional 00:00:13.699 --> 00:00:20.699 Development. Welcome back. If you haven't given me the homework yet, 00:00:27.019 --> 00:00:28.470 you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back. 00:00:28.470 --> 00:00:32.259 What I want to do today is wrap up our discussion on support vector 00:00:32.259 --> 00:00:33.510 machines 00:00:33.510 --> 00:00:37.250 and in particular we'll also talk about the idea of kernels 00:00:37.250 --> 00:00:39.440 and then talk about 00:00:39.440 --> 00:00:41.869 [inaudible] 00:00:41.869 --> 00:00:45.600 and then I'll talk about 00:00:45.600 --> 00:00:50.030 the SMO algorithm, which is an algorithm for solving the 00:00:50.030 --> 00:00:53.770 optimization problem that we posed last time. 00:00:53.770 --> 00:00:55.650 00:00:55.650 --> 00:00:58.200 To recap, we 00:00:58.200 --> 00:01:01.380 wrote down 00:01:01.380 --> 00:01:08.380 the following context optimization problem. 00:01:12.270 --> 00:01:15.990 All this is assuming that the data is linearly separable, which is an assumption that I'll 00:01:15.990 --> 00:01:18.110 fix later, 00:01:18.110 --> 00:01:19.530 00:01:19.530 --> 00:01:22.770 00:01:22.770 --> 00:01:24.660 00:01:24.660 --> 00:01:30.210 and so with this optimization problem, given a training set, 00:01:30.210 --> 00:01:34.840 this will find 00:01:34.840 --> 00:01:37.439 the optimal margin 00:01:37.439 --> 00:01:41.299 classifier for the data set that maximizes this 00:01:41.299 --> 00:01:45.020 geometric margin from your training 00:01:45.020 --> 00:01:46.900 examples. 00:01:46.900 --> 00:01:51.810 And so in the previous lecture, we also derived the dual of this problem, 00:01:51.810 --> 00:01:54.130 which was to maximize 00:01:54.130 --> 00:02:01.130 00:02:01.590 --> 00:02:08.590 00:02:18.319 --> 00:02:21.549 this. And this is the dual of our 00:02:21.549 --> 00:02:24.129 primal [inaudible] optimization problem. 00:02:24.129 --> 00:02:28.379 Here, I'm using these angle brackets to denote inner product, 00:02:28.379 --> 00:02:29.389 00:02:29.389 --> 00:02:31.369 so 00:02:31.369 --> 00:02:33.189 this is just XI transpose XJ 00:02:33.189 --> 00:02:38.559 for vectors XI 00:02:38.559 --> 00:02:40.819 and XJ. We also worked out the 00:02:40.819 --> 00:02:47.489 ways W would be given by sum over I alpha I 00:02:47.489 --> 00:02:51.129 YI XI. Therefore, when you need to make a prediction of classification 00:02:51.129 --> 00:02:52.149 time, you 00:02:52.149 --> 00:02:54.689 need to compute the 00:02:54.689 --> 00:03:01.299 value of the hypothesis applied to an [inaudible], which is G of W 00:03:01.299 --> 00:03:05.969 transpose X plus B where G is that threshold function that outputs plus one and minus one. 00:03:05.969 --> 00:03:08.660 And so this is G of 00:03:08.660 --> 00:03:09.900 sum over I alpha 00:03:09.900 --> 00:03:11.589 00:03:11.589 --> 00:03:18.589 00:03:20.239 --> 00:03:22.129 I. So that can also be written 00:03:22.129 --> 00:03:24.389 in terms of inner 00:03:24.389 --> 00:03:25.629 products between input 00:03:25.629 --> 00:03:31.369 vectors X. So what I 00:03:31.369 --> 00:03:35.180 want to do is now talk about the idea of kernels, which will make use of this 00:03:35.180 --> 00:03:36.809 property because it turns out 00:03:36.809 --> 00:03:41.690 you can take the only dependers of the algorithm on X 00:03:41.690 --> 00:03:44.230 is through these inner products. 00:03:44.230 --> 00:03:46.609 In fact, you can write the entire algorithm 00:03:46.609 --> 00:03:50.300 without ever explicitly referring to 00:03:50.300 --> 00:03:52.389 an X vector [inaudible] 00:03:52.389 --> 00:03:53.890 00:03:53.890 --> 00:03:57.729 between input feature vectors. 00:03:57.729 --> 00:04:00.939 And the idea of a high kernel is as following " 00:04:00.939 --> 00:04:07.389 let's say that you have an input attribute. 00:04:07.389 --> 00:04:10.839 Let's just say for now it's a real number. Maybe this is the 00:04:10.839 --> 00:04:13.429 living area of a house 00:04:13.429 --> 00:04:19.649 that you're trying to make a prediction on, like whether it will be sold in the next six months. 00:04:19.649 --> 00:04:21.319 Quite often, we'll take this 00:04:21.319 --> 00:04:23.789 feature X and we'll map it 00:04:23.789 --> 00:04:28.979 to a richer set of features. So for example, we will take X and map it 00:04:28.979 --> 00:04:35.979 to these four polynomial features, and let me acutely call this mapping Phi. So we'll 00:04:36.560 --> 00:04:38.560 let Phi of X denote 00:04:38.560 --> 00:04:40.930 the mapping from your original features to 00:04:40.930 --> 00:04:44.799 some higher dimensional set of features. 00:04:44.799 --> 00:04:49.080 So if you do this and you want to use the features Phi of X, 00:04:49.080 --> 00:04:49.680 then 00:04:49.680 --> 00:04:53.569 all you need to do is go back to the learning algorithm 00:04:53.569 --> 00:04:57.270 and everywhere you see 00:04:57.270 --> 00:04:59.000 XI, XJ, we'll replace it 00:04:59.000 --> 00:05:00.190 with 00:05:00.190 --> 00:05:04.979 the inner product between Phi of XI and 00:05:04.979 --> 00:05:07.629 Phi 00:05:07.629 --> 00:05:10.590 of XJ. So this corresponds to running a support vector machine 00:05:10.590 --> 00:05:14.380 with the features given by Phi of X rather than with your original 00:05:14.380 --> 00:05:19.129 onedimensional input feature X. 00:05:19.129 --> 00:05:21.689 And in a scenario that I want to consider, 00:05:21.689 --> 00:05:23.389 sometimes Phi of X 00:05:23.389 --> 00:05:29.489 will be very high dimensional, 00:05:29.489 --> 00:05:33.759 and in fact sometimes Phi of X " so for example, Phi of X may contain 00:05:33.759 --> 00:05:36.599 very high degree polynomial features. 00:05:36.599 --> 00:05:41.529 Sometimes Phi of X will actually even be an infinite dimensional vector of features, 00:05:41.529 --> 00:05:45.089 and the question is if Phi of X is an extremely high dimensional, 00:05:45.089 --> 00:05:48.600 then you can't actually compute to these inner products very 00:05:48.600 --> 00:05:51.300 efficiently, it seems, because computers need to 00:05:51.300 --> 00:05:55.319 represent an extremely high dimensional feature vector and then 00:05:55.319 --> 00:05:57.199 take 00:05:57.199 --> 00:05:59.740 [inaudible] inefficient. It 00:05:59.740 --> 00:06:02.929 turns out that 00:06:02.929 --> 00:06:07.879 in many important special cases, we can write down " let's call the kernel function, 00:06:07.879 --> 00:06:09.839 denoted by K, 00:06:09.839 --> 00:06:12.499 which will be 00:06:12.499 --> 00:06:18.300 this, 00:06:18.300 --> 00:06:21.339 which would be inner product between those feature 00:06:21.339 --> 00:06:24.409 vectors. It turns out there will be important special cases where 00:06:24.409 --> 00:06:28.809 computing Phi of X is computationally very expensive " maybe is impossible. 00:06:28.809 --> 00:06:30.229 There's an infinite dimensional vector, 00:06:30.229 --> 00:06:33.369 and you can't compute infinite dimensional vectors. 00:06:33.369 --> 00:06:37.219 There will be important special cases where Phi of X is very expensive to represent because it 00:06:37.219 --> 00:06:38.729 is so high dimensional, 00:06:38.729 --> 00:06:41.529 but nonetheless, you can actually compute a kernel between 00:06:41.529 --> 00:06:46.150 XI and XJ. You can compute the inner product between these two vectors 00:06:46.150 --> 00:06:48.749 very inexpensively. 00:06:48.749 --> 00:06:52.330 And so the idea of the support vector machine is that 00:06:52.330 --> 00:06:56.539 everywhere in the algorithm that you see these inner products, 00:06:56.539 --> 00:06:58.029 we're going to replace it 00:06:58.029 --> 00:07:01.629 with a kernel function that you can compute efficiently, 00:07:01.629 --> 00:07:03.839 and that lets you work in feature 00:07:03.839 --> 00:07:07.909 spaces Phi of X even if Phi of X 00:07:07.909 --> 00:07:12.429 are very high dimensional. Let 00:07:12.429 --> 00:07:16.289 me now say how that's done. A 00:07:16.289 --> 00:07:17.029 00:07:17.029 --> 00:07:20.689 little bit later today, we'll actually see some concrete examples of Phi of X and 00:07:20.689 --> 00:07:25.579 of kernels. For now, let's just think about constructing kernels explicitly. 00:07:25.579 --> 00:07:28.939 This best illustrates my example. 00:07:28.939 --> 00:07:32.539 00:07:32.539 --> 00:07:35.049 Let's say you have 00:07:35.049 --> 00:07:40.129 two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to 00:07:40.129 --> 00:07:45.519 write X and Z to save on writing. 00:07:45.519 --> 00:07:49.679 Let's say my kernel is K of X, Z 00:07:49.679 --> 00:07:50.719 equals 00:07:50.719 --> 00:07:55.439 X transpose Z squared. 00:07:55.439 --> 00:08:02.439 And so this is " 00:08:09.199 --> 00:08:12.859 right? X transpose Z " this thing here is X 00:08:12.859 --> 00:08:15.240 transpose Z and this thing is X transpose Z, so this is X 00:08:15.240 --> 00:08:17.389 transpose Z squared. 00:08:17.389 --> 00:08:18.249 00:08:18.249 --> 00:08:25.249 And that's equal 00:08:29.490 --> 00:08:33.190 to that. 00:08:33.190 --> 00:08:35.860 And so this kernel corresponds 00:08:35.860 --> 00:08:37.580 to the feature mapping where Phi 00:08:37.580 --> 00:08:40.720 of X is equal to " and I'll write 00:08:40.720 --> 00:08:47.720 this down for the case of N equals free, I guess. 00:09:03.150 --> 00:09:06.520 And so with this definition of Phi of X, 00:09:06.520 --> 00:09:10.300 you can verify for yourself that this thing 00:09:10.300 --> 00:09:11.660 becomes the 00:09:11.660 --> 00:09:17.870 inner product between Phi of X 00:09:17.870 --> 00:09:21.960 and Phi of Z, because to get an inner product between two vectors 00:09:21.960 --> 00:09:25.440 is " you can just take a sum of the corresponding elements of the vectors. You 00:09:25.440 --> 00:09:27.470 multiply them. So 00:09:27.470 --> 00:09:29.170 if this is Phi of X, 00:09:29.170 --> 00:09:32.400 then the inner product between Phi of X and Phi of Z will be 00:09:32.400 --> 00:09:33.550 the sum 00:09:33.550 --> 00:09:35.160 over all the 00:09:35.160 --> 00:09:39.610 elements of this vector times the corresponding elements of Phi of Z, 00:09:39.610 --> 00:09:45.130 and what you get is this one. 00:09:45.130 --> 00:09:48.350 And so the cool thing about this is that 00:09:48.350 --> 00:09:53.400 in order to compute Phi of X, you 00:09:53.400 --> 00:09:55.920 00:09:55.920 --> 00:10:02.920 need [inaudible] just to compute 00:10:03.020 --> 00:10:05.370 Phi of X. If N 00:10:05.370 --> 00:10:08.860 is a dimension of X and Z, then 00:10:08.860 --> 00:10:13.790 Phi of X is a vector of all pairs of XI XJ multiplied of 00:10:13.790 --> 00:10:15.020 each other, 00:10:15.020 --> 00:10:19.450 and so the length of Phi of X is N squared. You need order N 00:10:19.450 --> 00:10:23.840 squared time just to compute Phi of X. But 00:10:23.840 --> 00:10:30.840 to compute K 00:10:37.150 --> 00:10:39.360 " to compute the kernel function, 00:10:39.360 --> 00:10:41.220 all you need is order N time, 00:10:41.220 --> 00:10:43.070 because the 00:10:43.070 --> 00:10:47.150 kernel function is defined as X transpose Z squared, so you just take the inner product 00:10:47.150 --> 00:10:50.780 between X and Z, which is order N time and you square that 00:10:50.780 --> 00:10:52.410 and you've computed this 00:10:52.410 --> 00:10:54.230 kernel function, 00:10:54.230 --> 00:10:58.000 and so you just computed the inner product between two vectors where each vector 00:10:58.000 --> 00:11:00.200 has N squared elements, but 00:11:00.200 --> 00:11:07.200 you did it in N square time. Student:For any kernel you find 00:11:13.880 --> 00:11:20.650 for X and Z, 00:11:20.650 --> 00:11:23.800 does Phi exist for X and Z? Instructor (Andrew 00:11:23.800 --> 00:11:30.800 Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense. 00:11:31.660 --> 00:11:34.000 00:11:34.000 --> 00:11:35.460 So 00:11:35.460 --> 00:11:39.310 let me just describe a couple of quick generalizations to 00:11:39.310 --> 00:11:42.170 this. 00:11:42.170 --> 00:11:47.460 One is that if 00:11:47.460 --> 00:11:54.460 you define KXZ to be equal to 00:11:55.210 --> 00:12:00.190 X transpose Z plus C squared, so again, you can compute this kernel 00:12:00.190 --> 00:12:03.200 in order N time, 00:12:03.200 --> 00:12:07.070 then that turns out to correspond to a feature vector 00:12:07.070 --> 00:12:08.070 where I'm 00:12:08.070 --> 00:12:10.810 just going to add a few more elements at the bottom 00:12:10.810 --> 00:12:17.810 where you add root 2. 00:12:23.550 --> 00:12:27.700 Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3 00:12:27.700 --> 00:12:29.670 and C. 00:12:29.670 --> 00:12:31.340 And so 00:12:31.340 --> 00:12:35.450 this is a way of creating a feature vector with both the monomials, 00:12:35.450 --> 00:12:37.069 meaning the first order terms, 00:12:37.069 --> 00:12:40.539 as well as the quadratic or the inner product 00:12:40.539 --> 00:12:42.730 terms between XI and XJ, 00:12:42.730 --> 00:12:45.230 and the parameter C here 00:12:45.230 --> 00:12:49.020 allows you to control the relative waiting between the monomial 00:12:49.020 --> 00:12:51.490 terms, so the first order terms, 00:12:51.490 --> 00:12:53.680 and the quadratic 00:12:53.680 --> 00:12:56.940 terms. Again, this is still inner 00:12:56.940 --> 00:13:00.600 product between vectors of length and square [inaudible] 00:13:00.600 --> 00:13:02.920 in order N time. 00:13:02.920 --> 00:13:04.630 00:13:04.630 --> 00:13:11.630 More generally, here are some other examples of kernels. 00:13:13.720 --> 00:13:18.470 Actually, a 00:13:18.470 --> 00:13:25.070 generalization of the one I just derived right now would be the following kernel. 00:13:25.070 --> 00:13:30.960 And so this corresponds to using all N plus DQZ 00:13:30.960 --> 00:13:33.660 features 00:13:33.660 --> 00:13:39.240 of all 00:13:39.240 --> 00:13:43.400 monomials. Monomials just mean the products of XI XJ XK. Just 00:13:43.400 --> 00:13:46.570 all the polynomial terms up 00:13:46.570 --> 00:13:51.850 to degree D 00:13:51.850 --> 00:13:56.100 and plus [inaudible] so on the order of 00:13:56.100 --> 00:13:58.750 N plus D to the power of D, so this grows exponentially in D. 00:13:58.750 --> 00:14:01.130 This is a very high dimensional feature vector, 00:14:01.130 --> 00:14:04.930 but again, you can implicitly construct the feature vector and take 00:14:04.930 --> 00:14:07.460 inner products between them. 00:14:07.460 --> 00:14:10.730 It's very computationally efficient, because you just compute the inner product 00:14:10.730 --> 00:14:14.870 between X and Z, add C and you take that real number to the power of D 00:14:14.870 --> 00:14:18.590 and by plugging this in as a kernel, you're implicitly working in an extremely 00:14:18.590 --> 00:14:25.590 high dimensional computing space. 00:14:25.880 --> 00:14:27.730 So what 00:14:27.730 --> 00:14:34.730 I've given is just a few specific examples of how to create kernels. I want 00:14:36.280 --> 00:14:40.320 to go over just a few specific examples of kernels. So let's you ask 00:14:40.320 --> 00:14:43.609 you more generally if you're faced with a new machine-learning problem, 00:14:43.609 --> 00:14:46.120 how do you come up with a kernel? There 00:14:46.120 --> 00:14:49.720 are many ways to think about it, but here's one intuition that's sort of 00:14:49.720 --> 00:14:50.560 useful. 00:14:50.560 --> 00:14:51.650 00:14:51.650 --> 00:14:53.770 So given a set of attributes of X, you're 00:14:53.770 --> 00:14:57.800 going to use a feature vector of Phi of X 00:14:57.800 --> 00:14:59.510 and given a set 00:14:59.510 --> 00:15:04.150 of attributes Z, you're going to use an input feature vector Phi of Z, 00:15:04.150 --> 00:15:07.390 and so the kernel is computing 00:15:07.390 --> 00:15:11.900 the inner product between Phi of X and Phi of Z. 00:15:11.900 --> 00:15:13.139 And so 00:15:13.139 --> 00:15:14.870 one intuition " 00:15:14.870 --> 00:15:17.190 this is a partial intuition. This 00:15:17.190 --> 00:15:20.820 isn't as rigorous intuition that it is used for. It 00:15:20.820 --> 00:15:21.670 is that 00:15:21.670 --> 00:15:24.470 if X and Z are very similar, 00:15:24.470 --> 00:15:28.380 then Phi of X and Phi of Z will be pointing in the same direction, 00:15:28.380 --> 00:15:32.890 and therefore the inner product would be large. 00:15:32.890 --> 00:15:34.150 Whereas in contrast, if 00:15:34.150 --> 00:15:36.720 X and Z are very dissimilar, 00:15:36.720 --> 00:15:40.420 then Phi of X and Phi of Z may be pointing different directions, 00:15:40.420 --> 00:15:42.370 and so the inner product may be small. 00:15:42.370 --> 00:15:46.990 That intuition is not a rigorous one, but it's sort of a useful one to think about. 00:15:46.990 --> 00:15:49.240 00:15:49.240 --> 00:15:53.360 If you're faced with a new learning problem " if I give you some random 00:15:53.360 --> 00:15:54.620 thing to classify 00:15:54.620 --> 00:15:56.190 and you want to decide how to come up with 00:15:56.190 --> 00:15:58.000 a kernel, 00:15:58.000 --> 00:15:59.209 one way is 00:15:59.209 --> 00:16:02.710 to try to come up with the function 00:16:02.710 --> 00:16:04.260 P of XZ that 00:16:04.260 --> 00:16:05.570 is large, 00:16:05.570 --> 00:16:10.540 if you want to learn the algorithm to think of X and Z as similar 00:16:10.540 --> 00:16:11.370 and 00:16:11.370 --> 00:16:15.040 small. 00:16:15.040 --> 00:16:17.640 00:16:17.640 --> 00:16:19.290 00:16:19.290 --> 00:16:24.500 Again, this isn't always true, but this is one of several intuitions. 00:16:24.500 --> 00:16:27.969 So if you're trying to classify some brand new thing " you're trying to 00:16:27.969 --> 00:16:31.999 classify [inaudible] or DNA sequences or something, 00:16:31.999 --> 00:16:33.010 00:16:33.010 --> 00:16:37.250 some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think 00:16:37.250 --> 00:16:38.790 these are similar things 00:16:38.790 --> 00:16:41.870 or these are dissimilar. 00:16:41.870 --> 00:16:43.590 And so this 00:16:43.590 --> 00:16:46.730 answers the question of let's 00:16:46.730 --> 00:16:49.860 say I have something I want to classify, and let's say I write down 00:16:49.860 --> 00:16:53.370 the function 00:16:53.370 --> 00:16:57.700 that I think is a good measure of how similar or dissimilar X and Z 00:16:57.700 --> 00:16:59.370 are for my specific problem. 00:16:59.370 --> 00:17:05.310 Let's say I write down K of XZ equals E to the 00:17:05.310 --> 00:17:08.209 minus. 00:17:08.209 --> 00:17:13.260 Let's say I write down this function, 00:17:13.260 --> 00:17:19.089 and I think this is a good measure of how similar X and Z are. 00:17:19.089 --> 00:17:26.089 The question, then, is is this really a valid 00:17:28.490 --> 00:17:32.650 kernel? In other words, to understand how we can construct kernels " if I write down the 00:17:32.650 --> 00:17:33.669 function like that, 00:17:33.669 --> 00:17:37.940 the question is does there really exist some Phi 00:17:37.940 --> 00:17:39.830 such that 00:17:39.830 --> 00:17:41.049 KXZ 00:17:41.049 --> 00:17:42.419 is equal to 00:17:42.419 --> 00:17:48.370 the inner product? 00:17:48.370 --> 00:17:49.860 And that's the 00:17:49.860 --> 00:17:56.160 question of is K a valid kernel. 00:17:56.160 --> 00:17:59.780 It turns out that 00:17:59.780 --> 00:18:04.260 there is a result that characterizes necessary and sufficient 00:18:04.260 --> 00:18:05.380 conditions for 00:18:05.380 --> 00:18:07.700 when a function K that you might choose 00:18:07.700 --> 00:18:10.370 is a valid kernel. 00:18:10.370 --> 00:18:17.340 I should go ahead show part of that result now. 00:18:17.340 --> 00:18:23.559 Suppose K is a valid kernel, 00:18:23.559 --> 00:18:27.179 and when I say K is a kernel, what I mean is there does indeed exist some 00:18:27.179 --> 00:18:31.800 function Phi for which this holds true. 00:18:31.800 --> 00:18:34.700 Then 00:18:34.700 --> 00:18:39.190 let any set of points 00:18:39.190 --> 00:18:46.190 X1 up to XM be given. Let 00:18:47.610 --> 00:18:52.640 me define a matrix K. 00:18:52.640 --> 00:18:54.410 00:18:54.410 --> 00:18:57.690 I apologize for overloading notation. K I'm going 00:18:57.690 --> 00:19:00.029 to use to denote both the 00:19:00.029 --> 00:19:02.720 kernel function, which is the function of X and Z 00:19:02.720 --> 00:19:06.940 as well as a matrix. Unfortunately, 00:19:06.940 --> 00:19:10.180 there aren't enough alphabets. 00:19:10.180 --> 00:19:12.590 Well, that's not true. We need to find 00:19:12.590 --> 00:19:15.350 the kernel matrix to 00:19:15.350 --> 00:19:20.170 be an M-by-M matrix such that K subscript IJ is equal to the 00:19:20.170 --> 00:19:23.810 kernel function 00:19:23.810 --> 00:19:30.810 applied to two of my examples. 00:19:31.480 --> 00:19:38.480 Then it 00:19:39.250 --> 00:19:45.030 turns out that for any vector Z that's indimensional, I want 00:19:45.030 --> 00:19:50.270 you to consider Z transpose 00:19:50.270 --> 00:19:57.270 00:20:01.780 --> 00:20:04.080 KZ. 00:20:04.080 --> 00:20:05.500 00:20:05.500 --> 00:20:07.880 By 00:20:07.880 --> 00:20:13.230 definition of matrix multiplication, this is 00:20:13.230 --> 00:20:15.120 that, 00:20:15.120 --> 00:20:18.500 and so 00:20:18.500 --> 00:20:23.120 KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is 00:20:23.120 --> 00:20:27.580 a valid kernel function, 00:20:27.580 --> 00:20:31.120 and so there must exist such a 00:20:31.120 --> 00:20:33.419 value 00:20:33.419 --> 00:20:36.720 for Phi. 00:20:36.720 --> 00:20:40.500 This is the inner product between two feature vectors, so let me just make that inner product 00:20:40.500 --> 00:20:43.090 the explicit. I'm going 00:20:43.090 --> 00:20:45.950 to sum over the elements of this vector, 00:20:45.950 --> 00:20:50.320 and I'm going to use Phi XI subscript K just to denote the K 00:20:50.320 --> 00:20:57.320 element of this vector. 00:20:58.490 --> 00:21:03.480 Just 00:21:03.480 --> 00:21:10.480 rearrange sums. You get 00:21:24.940 --> 00:21:30.110 sum over K. This next set may look familiar to some of you, 00:21:30.110 --> 00:21:37.110 which is just " 00:21:47.220 --> 00:21:48.300 right? Therefore, 00:21:48.300 --> 00:21:51.540 this is the sum of squares 00:21:51.540 --> 00:21:56.630 and it must therefore be greater than or equal to zero. 00:21:56.630 --> 00:22:00.250 Do you want to take a minute to look for all the steps and just make 00:22:00.250 --> 00:22:07.250 sure you buy 00:22:09.400 --> 00:22:11.010 them all? Oh, this is 00:22:11.010 --> 00:22:15.080 the inner product between the vector of Phi of XI and Phi 00:22:15.080 --> 00:22:17.170 of XJ, 00:22:17.170 --> 00:22:19.210 so the 00:22:19.210 --> 00:22:21.100 inner product between two vectors 00:22:21.100 --> 00:22:28.080 is the sum over all the elements of the vectors of the corresponding element. 00:22:28.080 --> 00:22:31.559 Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A 00:22:31.559 --> 00:22:36.970 transpose B equals sum over K, AK BK, so that's 00:22:36.970 --> 00:22:38.830 just this. This is the sum of K of the 00:22:38.830 --> 00:22:44.490 K elements of this vector. 00:22:44.490 --> 00:22:46.180 Take a look at this and make sure 00:22:46.180 --> 00:22:50.470 it makes sense. Questions about 00:22:50.470 --> 00:22:57.470 this? So just to 00:22:59.720 --> 00:23:00.419 summarize, 00:23:00.419 --> 00:23:03.300 what we showed was that 00:23:03.300 --> 00:23:05.309 for any vector Z, 00:23:05.309 --> 00:23:07.380 Z transpose KZ 00:23:07.380 --> 00:23:09.700 is greater than or equal to zero, 00:23:09.700 --> 00:23:14.090 and this is one of the standard definitions of a matrix, 00:23:14.090 --> 00:23:17.540 the matrix K being posisemidefinite 00:23:17.540 --> 00:23:24.240 when a matrix K is posisemidefinite, that is, K is equal to 00:23:24.240 --> 00:23:28.010 zero. Just to summarize, what was shown is that if 00:23:28.010 --> 00:23:29.000 K 00:23:29.000 --> 00:23:31.150 is a 00:23:31.150 --> 00:23:34.430 valid kernel " in other words, if K is a function 00:23:34.430 --> 00:23:37.140 for which there exists 00:23:37.140 --> 00:23:38.390 some Phi such that 00:23:38.390 --> 00:23:39.990 K of XI 00:23:39.990 --> 00:23:43.830 XJ is the inner product between Phi of XI and Phi of XJ. 00:23:43.830 --> 00:23:48.130 So if K is a valid kernel, we showed, then, that the kernel matrix 00:23:48.130 --> 00:23:50.929 00:23:50.929 --> 00:23:55.030 must be posisemidefinite. 00:23:55.030 --> 00:23:59.910 00:23:59.910 --> 00:24:01.970 It turns out that the 00:24:01.970 --> 00:24:04.840 converse [inaudible] 00:24:04.840 --> 00:24:09.480 and so this gives you a test for 00:24:09.480 --> 00:24:13.010 whether a function K is a valid kernel. 00:24:13.010 --> 00:24:14.600 So this is 00:24:14.600 --> 00:24:18.370 a theorem due to Mercer, and so kernels are also sometimes called Mercer 00:24:18.370 --> 00:24:25.370 kernels. It means the same thing. It just means it's a valid kernel. Let 00:24:25.510 --> 00:24:31.440 K of XZ be given. 00:24:31.440 --> 00:24:33.809 Then K 00:24:33.809 --> 00:24:36.980 is a valid 00:24:36.980 --> 00:24:43.160 kernel " 00:24:43.160 --> 00:24:45.310 in other 00:24:45.310 --> 00:24:48.060 words, it's a Mercer kernel, i.e., there exists a Phi 00:24:48.060 --> 00:24:50.120 such that 00:24:50.120 --> 00:24:52.360 KXZ 00:24:52.360 --> 00:24:53.640 00:24:53.640 --> 00:24:58.649 00:24:58.649 --> 00:25:00.970 equals Phi of X 00:25:00.970 --> 00:25:04.840 transpose 00:25:04.840 --> 00:25:11.179 Phi of Z " if and only if for any set of M examples, 00:25:11.179 --> 00:25:15.039 and this really means for any set of M points. It's not necessarily a training set. 00:25:15.039 --> 00:25:21.299 It's just any set of M points you may choose. It 00:25:21.299 --> 00:25:28.299 holds true that the kernel 00:25:29.040 --> 00:25:30.410 matrix, 00:25:30.410 --> 00:25:34.290 capital K that I defined just now, 00:25:34.290 --> 00:25:36.510 is 00:25:36.510 --> 00:25:37.960 00:25:37.960 --> 00:25:44.960 symmetric posisemidefinite. 00:25:49.260 --> 00:25:54.070 And so I proved only one direction of this result. I proved that if it's a 00:25:54.070 --> 00:25:55.420 valid kernel, then K 00:25:55.420 --> 00:25:57.299 is symmetry 00:25:57.299 --> 00:26:00.909 posisemidefinite, but the converse I didn't show. It turns out that this is 00:26:00.909 --> 00:26:03.710 necessary and a sufficient condition. And 00:26:03.710 --> 00:26:06.250 so this gives you a useful test for 00:26:06.250 --> 00:26:09.289 whether any function that you might want to choose 00:26:09.289 --> 00:26:16.289 is a 00:26:23.270 --> 00:26:27.040 kernel. A concrete example of something that's clearly not a 00:26:27.040 --> 00:26:28.240 valid kernel would 00:26:28.240 --> 00:26:30.659 be 00:26:30.659 --> 00:26:36.040 if you find an input X such that K of X, X " and this 00:26:36.040 --> 00:26:39.410 is minus one, for example " then this is an example of something that's clearly 00:26:39.410 --> 00:26:41.530 not a valid kernel, 00:26:41.530 --> 00:26:48.010 because minus one cannot possibly be equal to Phi of X 00:26:48.010 --> 00:26:49.390 transpose Phi of X, 00:26:49.390 --> 00:26:53.250 and so this would be one of many examples of functions that 00:26:53.250 --> 00:26:57.040 will fail to meet the conditions of this theorem, 00:26:57.040 --> 00:27:04.040 because inner products of a vector itself are always greater than zero. So 00:27:15.820 --> 00:27:22.820 00:27:29.530 --> 00:27:33.620 just to tie this back explicitly to an 00:27:33.620 --> 00:27:34.710 SVM, 00:27:34.710 --> 00:27:37.190 let's say to use a support vector machine with a kernel, 00:27:37.190 --> 00:27:41.400 what you do is you choose some function K of XZ, 00:27:41.400 --> 00:27:45.860 and so you can choose " and it turns out that function I wrote down just now " this is, 00:27:45.860 --> 00:27:49.060 indeed, a valid kernel. It 00:27:49.060 --> 00:27:50.850 is called the Galcean kernel 00:27:50.850 --> 00:27:52.250 because of the similarity to 00:27:52.250 --> 00:27:53.540 Galceans. 00:27:53.540 --> 00:27:58.580 So you choose some kernel function like this, or you may choose X 00:27:58.580 --> 00:28:02.490 transpose Z plus 00:28:02.490 --> 00:28:03.470 C 00:28:03.470 --> 00:28:06.420 to the D vector. To apply a support vector machine kernel, you choose one of these 00:28:06.420 --> 00:28:07.500 functions, 00:28:07.500 --> 00:28:09.490 and the choice of this would depend on your problem. 00:28:09.490 --> 00:28:14.149 It depends on what is a good measure of one or two examples 00:28:14.149 --> 00:28:17.060 similar and one or two examples different for your problem. 00:28:17.060 --> 00:28:21.940 Then you go back to our formulation of support vector machine, 00:28:21.940 --> 00:28:24.810 and you have to use the dual formulation, 00:28:24.810 --> 00:28:28.730 and you then replace everywhere you see 00:28:28.730 --> 00:28:31.570 these 00:28:31.570 --> 00:28:36.799 things, you 00:28:36.799 --> 00:28:39.900 replace it with K of XI, 00:28:39.900 --> 00:28:41.820 XJ. 00:28:41.820 --> 00:28:45.360 And you then run exactly the same support vector machine algorithm, only everywhere 00:28:45.360 --> 00:28:48.950 you see these inner products, you replace them with that, 00:28:48.950 --> 00:28:52.480 and what you've just done is you've taken a support vector machine 00:28:52.480 --> 00:28:56.890 and you've taken each of your feature vectors X 00:28:56.890 --> 00:29:02.760 and you've replaced it with implicitly a very high dimensional feature vector. 00:29:02.760 --> 00:29:06.270 It turns out that the Galcean kernel corresponds to a feature vector that's 00:29:06.270 --> 00:29:08.640 infinite dimensional. 00:29:08.640 --> 00:29:09.950 00:29:09.950 --> 00:29:13.860 Nonetheless, you can run a support vector machine in a finite amount of time, 00:29:13.860 --> 00:29:15.450 even though you're working with 00:29:15.450 --> 00:29:18.920 infinite dimensional feature vectors, 00:29:18.920 --> 00:29:20.559 because all you ever need to do is 00:29:20.559 --> 00:29:23.120 compute these things, and you don't ever need 00:29:23.120 --> 00:29:26.330 to represent these infinite dimensional feature vectors 00:29:26.330 --> 00:29:28.330 explicitly. 00:29:28.330 --> 00:29:30.040 Why 00:29:30.040 --> 00:29:33.250 is this a good idea? It turns out " 00:29:33.250 --> 00:29:38.070 I think I started off talking about support vector machines. I started 00:29:38.070 --> 00:29:41.580 saying that we wanted to start to develop non-linear learning algorithms. 00:29:41.580 --> 00:29:45.350 So here's one useful picture to keep in mind, which is that 00:29:45.350 --> 00:29:48.420 let's say your original data " 00:29:48.420 --> 00:29:52.060 I didn't mean to draw that slanted. Let's say you have one-dimensional input 00:29:52.060 --> 00:29:56.540 data. You just have one feature X and R. What a kernel 00:29:56.540 --> 00:30:00.960 does is the following. It takes your original input data and maps it to a 00:30:00.960 --> 00:30:04.910 very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw 00:30:04.910 --> 00:30:09.170 two 00:30:09.170 --> 00:30:12.950 dimensions here. 00:30:12.950 --> 00:30:19.950 So say [inaudible] very high dimensional feature space where " 00:30:23.070 --> 00:30:28.900 like so. So it takes all your data in R1 and maps it to R infinity, 00:30:28.900 --> 00:30:30.360 and then you run a 00:30:30.360 --> 00:30:33.770 support vector machine in this infinite dimensional space and also exponentially 00:30:33.770 --> 00:30:35.240 high dimensional space, 00:30:35.240 --> 00:30:38.030 and you'll find the optimal margin classifier " 00:30:38.030 --> 00:30:39.730 in other words, the 00:30:39.730 --> 00:30:43.390 classifier that separates your data in this very high dimensional space 00:30:43.390 --> 00:30:46.060 with the largest possible geometric margin. 00:30:46.060 --> 00:30:50.350 In this example that you just drew anyway, 00:30:50.350 --> 00:30:54.680 whereas your data was not linearly separable in your originally one dimensional space, 00:30:54.680 --> 00:30:58.340 when you map it to this much higher dimensional space, it becomes linearly separable, so you 00:30:58.340 --> 00:30:59.960 can use your 00:30:59.960 --> 00:31:01.250 linear classifier 00:31:01.250 --> 00:31:04.990 to [inaudible] which data is not really separable in your original space. 00:31:04.990 --> 00:31:07.090 This is what support vector machines 00:31:07.090 --> 00:31:10.460 output nonlinear decision boundaries 00:31:10.460 --> 00:31:14.930 and in the entire process, all you ever need to do is solve complex optimization 00:31:14.930 --> 00:31:18.780 problems. 00:31:18.780 --> 00:31:19.560 00:31:19.560 --> 00:31:25.000 Questions about any of this? Student:[Inaudible] sigmer? Instructor 00:31:25.000 --> 00:31:30.310 (Andrew Ng):Yeah, so sigmer is " let's 00:31:30.310 --> 00:31:31.999 see. Well, I was going to talk about [inaudible] 00:31:31.999 --> 00:31:32.730 later. One way 00:31:32.730 --> 00:31:35.690 to choose sigmer is 00:31:35.690 --> 00:31:38.859 save aside a small amount of your data and 00:31:38.859 --> 00:31:41.909 try different values of sigmer and train an SVM using, 00:31:41.909 --> 00:31:44.740 say, two thirds of your data. 00:31:44.740 --> 00:31:48.500 Try different values of sigmer, then see what works best on a separate 00:31:48.500 --> 00:31:48.950 hold out 00:31:48.950 --> 00:31:53.270 cross validation set " on a separate set that you're testing. Something about 00:31:53.270 --> 00:31:56.289 learning algorithms we talked about 00:31:56.289 --> 00:31:57.409 " 00:31:57.409 --> 00:32:00.360 locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a 00:32:00.360 --> 00:32:03.159 number of parameters to some of these algorithms that you 00:32:03.159 --> 00:32:05.829 can choose IDs by saving aside some data to test on. I'll 00:32:05.829 --> 00:32:10.250 talk more about model selection [inaudible] explicitly. Are there other questions? Student:So 00:32:10.250 --> 00:32:12.790 how do you know that 00:32:12.790 --> 00:32:14.299 00:32:14.299 --> 00:32:16.830 moving it up to high dimensional space 00:32:16.830 --> 00:32:19.820 is going to give you that kind of separation? Instructor (Andrew 00:32:19.820 --> 00:32:21.279 Ng):Good question. 00:32:21.279 --> 00:32:25.980 Usually, you don't know [inaudible]. Sometimes you can know, but 00:32:25.980 --> 00:32:29.230 in most cases, you won't know [inaudible] actually going to 00:32:29.230 --> 00:32:30.010 linearly 00:32:30.010 --> 00:32:33.920 separable, so the next topic will be [inaudible], which is what 00:32:33.920 --> 00:32:40.920 [inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend 00:32:41.080 --> 00:32:42.260 linearly 00:32:42.260 --> 00:32:44.630 separated by mapping a higher dimension, couldn't you 00:32:44.630 --> 00:32:47.309 also just use [inaudible] 00:32:47.309 --> 00:32:49.400 higher 00:32:49.400 --> 00:32:53.580 dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't 00:32:53.580 --> 00:32:55.350 separate it in higher dimensional 00:32:55.350 --> 00:32:59.460 space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs. 00:32:59.460 --> 00:33:06.120 Okay. Student:What 00:33:06.120 --> 00:33:08.180 if 00:33:08.180 --> 00:33:09.370 you run an SVM algorithm 00:33:09.370 --> 00:33:11.720 that assumes the data are linearly 00:33:11.720 --> 00:33:12.770 separable on data that 00:33:12.770 --> 00:33:15.320 is not actually linearly separable? Instructor 00:33:15.320 --> 00:33:19.490 (Andrew Ng):You guys are really giving me a hard time about whether the 00:33:19.490 --> 00:33:20.220 data's linearly 00:33:20.220 --> 00:33:24.320 separable. It turns out this algorithm won't work if the data is not linearly separable, but 00:33:24.320 --> 00:33:29.260 I'll change that in a second and make it work. 00:33:29.260 --> 00:33:31.100 00:33:31.100 --> 00:33:35.390 If I move on to talk about that, 00:33:35.390 --> 00:33:39.760 let me just say one final word about kernels, 00:33:39.760 --> 00:33:46.450 which is that 00:33:46.450 --> 00:33:49.130 00:33:49.130 --> 00:33:52.559 00:33:52.559 --> 00:33:54.780 I talked about kernels 00:33:54.780 --> 00:34:00.770 in a context of support vector machines, 00:34:00.770 --> 00:34:04.550 and the idea of kernels was what really made support vector machines a very 00:34:04.550 --> 00:34:06.700 powerful learning algorithm, 00:34:06.700 --> 00:34:09.850 and actually towards the end of today's lecture if I have time, I'll actually 00:34:09.850 --> 00:34:14.119 give a couple more [inaudible] examples of how to choose kernels as well. 00:34:14.119 --> 00:34:14.949 00:34:14.949 --> 00:34:18.719 It turns out that the idea of kernels is actually more general than support vector 00:34:18.719 --> 00:34:19.659 machines, 00:34:19.659 --> 00:34:21.209 and in particular, 00:34:21.209 --> 00:34:25.559 we took this SVM algorithm and we derived a dual, and that was what let us write 00:34:25.559 --> 00:34:27.959 the entire algorithm 00:34:27.959 --> 00:34:30.999 in terms of inner products of these. 00:34:30.999 --> 00:34:35.319 It turns out that you can take many of the other algorithms that you've seen 00:34:35.319 --> 00:34:36.630 in this class " in fact, it turns 00:34:36.630 --> 00:34:40.240 out you can take most of the linear algorithms we talked about, such as linear 00:34:40.240 --> 00:34:42.119 regression, logistic regression 00:34:42.119 --> 00:34:43.869 [inaudible] 00:34:43.869 --> 00:34:48.029 and it turns out you can take all of these algorithms and rewrite them entirely 00:34:48.029 --> 00:34:51.429 in terms of these inner products. 00:34:51.429 --> 00:34:55.149 So if you have any algorithm that you can rewrite in terms of inner products, then that 00:34:55.149 --> 00:35:01.259 means you can replace it with K of 00:35:01.259 --> 00:35:02.430 XI XJ 00:35:02.430 --> 00:35:06.499 and that means that you can take any of theses algorithms and 00:35:06.499 --> 00:35:10.949 implicitly map the features vectors of these very high dimensional feature spaces 00:35:10.949 --> 00:35:12.990 and have the algorithm 00:35:12.990 --> 00:35:13.670 00:35:13.670 --> 00:35:17.380 still work. The idea of kernels is perhaps most widely used with support vector 00:35:17.380 --> 00:35:19.869 machines, but it is actually more general than that, 00:35:19.869 --> 00:35:23.429 and you can take many of the other algorithms that you've seen and many of the algorithms that 00:35:23.429 --> 00:35:26.499 we'll see later this quarter as well 00:35:26.499 --> 00:35:30.309 and write them in terms of inner products and thereby kernalize them and apply them to 00:35:30.309 --> 00:35:33.199 infinite dimensional feature spaces. 00:35:33.199 --> 00:35:36.309 You'll actually get to play with many of these ideas more in the next problem 00:35:36.309 --> 00:35:41.089 00:35:41.089 --> 00:35:43.269 set. 00:35:43.269 --> 00:35:47.019 Let's talk about non-linear decision boundaries, and this is 00:35:47.019 --> 00:35:50.209 the idea of 00:35:50.209 --> 00:35:55.059 " it's called the L1 norm soft 00:35:55.059 --> 00:35:59.140 margin SVM. Machine only people sometimes aren't great at coming up with good 00:35:59.140 --> 00:36:00.269 names, but here's 00:36:00.269 --> 00:36:01.839 the idea. 00:36:01.839 --> 00:36:08.839 Let's say I have a data 00:36:12.670 --> 00:36:15.380 set. This is a linearly separable data 00:36:15.380 --> 00:36:19.100 set, but what I do if I have a couple of other examples there that makes the data 00:36:19.100 --> 00:36:23.779 nonlinearly separable, 00:36:23.779 --> 00:36:25.929 and in fact, 00:36:25.929 --> 00:36:30.059 sometimes even if the data is linearly separable, maybe you might not want to. 00:36:30.059 --> 00:36:32.829 So for example, this is a very nice data set. It looks like 00:36:32.829 --> 00:36:37.109 there's a great decision boundary that separates the two [inaudible]. Well, 00:36:37.109 --> 00:36:40.269 what if I had just one outlier 00:36:40.269 --> 00:36:42.489 down here? I could 00:36:42.489 --> 00:36:46.269 still linearly separate this data set 00:36:46.269 --> 00:36:48.029 with something like that, 00:36:48.029 --> 00:36:50.190 but I'm somehow letting one 00:36:50.190 --> 00:36:52.539 slightly suspicious example 00:36:52.539 --> 00:36:55.950 skew my entire decision boundary by a lot, 00:36:55.950 --> 00:36:57.279 and so 00:36:57.279 --> 00:37:00.969 what I'm going to talk about now is the L1 norm soft margin SVM, which is 00:37:00.969 --> 00:37:05.339 a slightly modified formulation of the SVM optimization problem. 00:37:05.339 --> 00:37:09.069 They will let us deal with both of these cases " one where one of the data's 00:37:09.069 --> 00:37:10.809 just not linearly separable 00:37:10.809 --> 00:37:14.149 and two, what if you have some examples that you'd rather 00:37:14.149 --> 00:37:16.599 not get [inaudible] in a training set. Maybe 00:37:16.599 --> 00:37:19.389 with an outlier here, maybe you actually prefer to 00:37:19.389 --> 00:37:23.119 choose that original decision boundary and not try so hard to get 00:37:23.119 --> 00:37:26.619 that training example. Here's the formulation. 00:37:26.619 --> 00:37:33.619 Our 00:37:44.479 --> 00:37:45.429 00:37:45.429 --> 00:37:46.429 SVM 00:37:46.429 --> 00:37:51.109 primal problem was to minimize one-half [inaudible] W 00:37:51.109 --> 00:37:51.919 squared. 00:37:51.919 --> 00:37:58.919 00:38:02.720 --> 00:38:05.189 So this is our original problem, and I'm 00:38:05.189 --> 00:38:09.739 going to modify this by adding the 00:38:09.739 --> 00:38:16.739 following. 00:38:23.449 --> 00:38:27.369 In other words, I'm gonna add these penalty terms, CIs, 00:38:27.369 --> 00:38:32.409 and I'm going to demand that each of my training examples is separated with 00:38:32.409 --> 00:38:35.999 functional margin greater than or equal to one minus CI, 00:38:35.999 --> 00:38:39.639 and you remember 00:38:39.639 --> 00:38:45.479 00:38:45.479 --> 00:38:47.430 if this is 00:38:47.430 --> 00:38:50.010 greater than zero " was it 00:38:50.010 --> 00:38:52.750 two lectures ago that I said that if the 00:38:52.750 --> 00:38:54.689 function margin is greater than zero, 00:38:54.689 --> 00:39:00.809 that implies you classified it correctly. 00:39:00.809 --> 00:39:04.589 If it's less than zero, then you misclassified it. 00:39:04.589 --> 00:39:08.229 By setting some of the CIs to be larger than one, I can 00:39:08.229 --> 00:39:10.660 actually have some examples with 00:39:10.660 --> 00:39:12.879 functional margin negative, 00:39:12.879 --> 00:39:16.439 and therefore I'm allowing my algorithm to misclassify some of the examples of the 00:39:16.439 --> 00:39:19.089 training set. 00:39:19.089 --> 00:39:23.130 However, I'll encourage the algorithm not to do that by adding to the 00:39:23.130 --> 00:39:24.379 optimization objective, 00:39:24.379 --> 00:39:29.609 this sort of penalty term that penalizes setting CIs to be large. 00:39:29.609 --> 00:39:34.119 This is an optimization problem where the parameters are WB 00:39:34.119 --> 00:39:37.489 and all of the CIs 00:39:37.489 --> 00:39:42.739 and this is also a convex optimization problem. It 00:39:42.739 --> 00:39:46.549 turns out that 00:39:46.549 --> 00:39:50.630 similar to how we worked on the dual of the support vector machine, 00:39:50.630 --> 00:39:54.319 we can also work out the dual for this optimization problem. I won't 00:39:54.319 --> 00:39:55.499 actually do it, 00:39:55.499 --> 00:39:59.169 but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going 00:39:59.169 --> 00:40:01.640 to 00:40:01.640 --> 00:40:03.489 use 00:40:03.489 --> 00:40:08.119 Alpha and R to denote the [inaudible] multipliers no corresponding to 00:40:08.119 --> 00:40:10.599 this set of constraints that we had previously 00:40:10.599 --> 00:40:15.069 and this new set of constraints on the CI [inaudible] zero. This gives us a use of 00:40:15.069 --> 00:40:16.439 the 00:40:16.439 --> 00:40:23.279 [inaudible] multipliers. The [inaudible] will be 00:40:23.279 --> 00:40:24.569 optimization objective 00:40:24.569 --> 00:40:31.569 minus sum from 00:40:35.879 --> 00:40:38.699 plus CI 00:40:38.699 --> 00:40:44.349 minus " 00:40:44.349 --> 00:40:46.679 and so 00:40:46.679 --> 00:40:50.549 there's our [inaudible] optimization objective 00:40:50.549 --> 00:40:55.049 minus or plus Alpha times each of these constraints, 00:40:55.049 --> 00:40:56.769 00:40:56.769 --> 00:41:03.769 which are 00:41:05.199 --> 00:41:12.199 00:41:17.969 --> 00:41:20.779 greater or equal to 00:41:20.779 --> 00:41:27.779 zero. I won't redivide the entire dual again, but it's really the same, 00:41:28.499 --> 00:41:33.799 and when you derive the dual of this optimization problem and when you simplify, 00:41:33.799 --> 00:41:37.199 you find that you get the following. 00:41:37.199 --> 00:41:40.609 You have to maximize 00:41:40.609 --> 00:41:47.609 [inaudible], which is actually the same as before. 00:41:50.969 --> 00:41:57.969 00:42:03.130 --> 00:42:10.130 00:42:20.319 --> 00:42:23.389 So it turns out when you derive the dual 00:42:23.389 --> 00:42:24.720 and simply, 00:42:24.720 --> 00:42:28.730 it turns out that the only way the dual changes compared to the previous one 00:42:28.730 --> 00:42:32.389 is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero, 00:42:32.389 --> 00:42:36.839 we now have a constraint that the Alphas are between zero and C. 00:42:36.839 --> 00:42:40.009 This derivation isn't very hard, and you're encouraged to go home and try to do it yourself. 00:42:40.009 --> 00:42:42.450 It's really essentially the same math, 00:42:42.450 --> 00:42:44.420 and when you simply, it turns out 00:42:44.420 --> 00:42:48.619 you can simply the R of the [inaudible] multiplier away 00:42:48.619 --> 00:42:49.420 and you end up with 00:42:49.420 --> 00:42:54.789 just these constraints of 00:42:54.789 --> 00:42:57.299 00:42:57.299 --> 00:43:01.909 the 00:43:01.909 --> 00:43:05.179 Alphas. 00:43:05.179 --> 00:43:06.450 Just as an aside, 00:43:06.450 --> 00:43:08.779 I won't 00:43:08.779 --> 00:43:12.499 derive these, either. It turns out that " 00:43:12.499 --> 00:43:15.779 remember, I wrote down the [inaudible] conditions in 00:43:15.779 --> 00:43:18.259 the last lecture. 00:43:18.259 --> 00:43:21.609 The necessary conditions for something to be an optimal solution 00:43:21.609 --> 00:43:24.019 to constrain optimization problems. 00:43:24.019 --> 00:43:27.119 So if you used the [inaudible] conditions, 00:43:27.119 --> 00:43:30.499 it turns out you can actually derive conversions conditions, so 00:43:30.499 --> 00:43:32.999 we want to solve this optimization problem. 00:43:32.999 --> 00:43:37.929 When do we know the Alphas have converged to the global optimum? 00:43:37.929 --> 00:43:40.819 It turns out you can use the following. 00:43:40.819 --> 00:43:47.819 00:43:48.449 --> 00:43:55.449 00:44:05.289 --> 00:44:12.289 00:44:14.140 --> 00:44:16.789 00:44:16.789 --> 00:44:20.970 I don't want to say a lot about these. It 00:44:20.970 --> 00:44:24.219 turns out from the [inaudible] conditions you can derive 00:44:24.219 --> 00:44:27.339 these as the conversion conditions for an algorithm that you might choose to 00:44:27.339 --> 00:44:28.379 use 00:44:28.379 --> 00:44:34.469 to try to solve the optimization problem in terms of the 00:44:34.469 --> 00:44:39.789 Alphas. That's the L1 norm soft margin SVM, and this is the 00:44:39.789 --> 00:44:43.749 change the algorithm that lets us handle non-linearly separable data sets as well as 00:44:43.749 --> 00:44:47.160 single outliers that may still be linearly separable 00:44:47.160 --> 00:44:54.160 but you may choose not to separate [inaudible]. 00:44:55.869 --> 00:44:58.409 Questions about this? 00:44:58.409 --> 00:45:03.319 Raise 00:45:03.319 --> 00:45:10.319 00:45:17.749 --> 00:45:24.749 00:45:39.880 --> 00:45:46.880 00:45:52.069 --> 00:45:56.529 your hand if this stuff makes sense at all. Great. 00:45:56.529 --> 00:45:59.979 So 00:45:59.979 --> 00:46:02.849 the last thing I want to do is 00:46:02.849 --> 00:46:07.619 talk about an algorithm for actually solving this optimization problem. 00:46:07.619 --> 00:46:10.400 We wrote down 00:46:10.400 --> 00:46:13.769 this dual optimization problem with convergence criteria, 00:46:13.769 --> 00:46:19.169 so let's come up with an efficient algorithm to actually solve this optimization problem. 00:46:19.169 --> 00:46:21.180 00:46:21.180 --> 00:46:25.329 I want to do this partly to give me an excuse to talk about an algorithm 00:46:25.329 --> 00:46:28.859 called coordinate assent, which is useful to 00:46:28.859 --> 00:46:34.229 do. What I actually want to do is tell you about an algorithm called coordinate 00:46:34.229 --> 00:46:36.809 assent, which is a useful algorithm to know about, 00:46:36.809 --> 00:46:40.049 and it'll turn out that it won't apply in the simplest form 00:46:40.049 --> 00:46:43.589 to this problem, but we'll then be able to modify it slightly and then it'll 00:46:43.589 --> 00:46:45.549 give us a very efficient 00:46:45.549 --> 00:46:48.910 algorithm for solving this [inaudible] 00:46:48.910 --> 00:46:52.249 optimization problem. That was the other reason that I had to derive the dual, not only so that we could 00:46:52.249 --> 00:46:53.320 use kernels 00:46:53.320 --> 00:46:57.610 but also so that we can apply an algorithm like the 00:46:57.610 --> 00:47:00.829 SMO 00:47:00.829 --> 00:47:04.809 algorithm. First, let's talk about coordinate assent, which is another 00:47:04.809 --> 00:47:11.809 [inaudible] optimization 00:47:17.829 --> 00:47:21.239 algorithm. To describe coordinate 00:47:21.239 --> 00:47:24.679 assent, I just want you to consider the problem of if we 00:47:24.679 --> 00:47:29.529 want to maximize some function 00:47:29.529 --> 00:47:34.039 W, which is a function of Alpha one through Alpha M with no constraints. So for 00:47:34.039 --> 00:47:38.769 00:47:38.769 --> 00:47:39.709 00:47:39.709 --> 00:47:43.199 now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the 00:47:43.199 --> 00:47:49.569 constraint that some of YI Alpha I must be equal to zero. 00:47:49.569 --> 00:47:53.279 Then this is the coordinate 00:47:53.279 --> 00:47:54.220 assent 00:47:54.220 --> 00:47:56.409 algorithm. It will 00:47:56.409 --> 00:48:00.549 repeat until convergence 00:48:00.549 --> 00:48:03.839 00:48:03.839 --> 00:48:06.949 and will do for I equals one to M. The [inaudible] of coordinate assent essentially 00:48:06.949 --> 00:48:09.549 holds all the parameters 00:48:09.549 --> 00:48:12.039 except Alpha I fixed 00:48:12.039 --> 00:48:16.249 and then it just maximizes this function with respect to just one of the parameters. Let me 00:48:16.249 --> 00:48:18.219 write that 00:48:18.219 --> 00:48:21.279 as Alpha I gets 00:48:21.279 --> 00:48:24.039 updated as [inaudible] 00:48:24.039 --> 00:48:29.229 over Alpha I hat of 00:48:29.229 --> 00:48:29.839 W 00:48:29.839 --> 00:48:31.969 Alpha one 00:48:31.969 --> 00:48:38.969 00:48:41.789 --> 00:48:48.789 Alpha I minus one. This is really the fancy way of saying hold everything 00:48:49.679 --> 00:48:54.249 except Alpha I 00:48:54.249 --> 00:48:58.089 fixed. Just optimize W by optimization objective with 00:48:58.089 --> 00:48:59.650 respect to only 00:48:59.650 --> 00:49:00.960 Alpha I. 00:49:00.960 --> 00:49:02.019 This is just a fancy 00:49:02.019 --> 00:49:04.390 way of writing 00:49:04.390 --> 00:49:09.219 it. This is coordinate 00:49:09.219 --> 00:49:11.109 assent. 00:49:11.109 --> 00:49:14.259 One picture 00:49:14.259 --> 00:49:17.739 that's kind of useful for coordinate assent is if you 00:49:17.739 --> 00:49:24.739 imagine you're trying to optimize a quadratic function, it 00:49:25.009 --> 00:49:27.489 really looks like 00:49:27.489 --> 00:49:30.799 that. These are the contours of the quadratic function and the minimums 00:49:30.799 --> 00:49:31.919 here. 00:49:31.919 --> 00:49:37.149 This is what 00:49:37.149 --> 00:49:39.839 coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one. 00:49:39.839 --> 00:49:44.859 My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going 00:49:44.859 --> 00:49:48.969 to begin by minimizing this with respect to Alpha one. I 00:49:48.969 --> 00:49:51.029 go there. 00:49:51.029 --> 00:49:54.799 And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like 00:49:54.799 --> 00:49:56.249 that. 00:49:56.249 --> 00:49:58.359 Then, I'll minimize 00:49:58.359 --> 00:50:02.239 with respect to Alpha one goes back to Alpha two 00:50:02.239 --> 00:50:04.469 and so on. 00:50:04.469 --> 00:50:08.049 You're always taking these axis-aligned steps 00:50:08.049 --> 00:50:11.709 to get to the minimum. 00:50:11.709 --> 00:50:16.039 It turns out that there's a modification to this. There are 00:50:16.039 --> 00:50:18.149 variations of this as well. The way I describe 00:50:18.149 --> 00:50:20.729 the algorithm, we're always doing this 00:50:20.729 --> 00:50:24.169 in alternating order. We always optimize with respect to Alpha 00:50:24.169 --> 00:50:28.519 one then Alpha two, then Alpha one, then Alpha two. 00:50:28.519 --> 00:50:32.159 What I'm about to say applies only in higher dimensions, but it turns out if you have 00:50:32.159 --> 00:50:35.239 a lot of parameters, Alpha one through Alpha M, 00:50:35.239 --> 00:50:39.290 you may not choose to always visit them in a fixed order. You may choose 00:50:39.290 --> 00:50:43.240 which Alphas update next depending on what you 00:50:43.240 --> 00:50:46.079 think will allow you to make the most progress. If you have only two parameters, it makes 00:50:46.079 --> 00:50:50.000 sense to alternate between them. 00:50:50.000 --> 00:50:51.099 If 00:50:51.099 --> 00:50:52.730 you have 00:50:52.730 --> 00:50:57.020 higher dimensional parameters, sometimes you may choose to update them in a 00:50:57.020 --> 00:51:00.650 different order if you think doing so would let you make faster progress towards the 00:51:00.650 --> 00:51:01.529 00:51:01.529 --> 00:51:04.959 00:51:04.959 --> 00:51:08.519 maximum. It turns out that coordinate assent 00:51:08.519 --> 00:51:11.459 compared to some of the algorithms we saw previously " compared to, say, Newton's 00:51:11.459 --> 00:51:12.470 00:51:12.470 --> 00:51:16.329 method, coordinate assent will usually take a lot more steps, 00:51:16.329 --> 00:51:20.819 but the chief advantage of coordinate assent when it works well 00:51:20.819 --> 00:51:22.570 is that sometimes 00:51:22.570 --> 00:51:29.119 the optimization objective W 00:51:29.119 --> 00:51:33.160 sometimes is very inexpensive to optimize W with respect to any one of your 00:51:33.160 --> 00:51:34.129 parameters, 00:51:34.129 --> 00:51:37.939 and so coordinate assent has to take many more iterations than, say, Newton's 00:51:37.939 --> 00:51:38.859 method 00:51:38.859 --> 00:51:41.419 in order to 00:51:41.419 --> 00:51:45.009 converge. It turns out that there are many optimization problems for which it's 00:51:45.009 --> 00:51:46.749 particularly easy 00:51:46.749 --> 00:51:47.569 to fix 00:51:47.569 --> 00:51:51.939 all but one of the parameters and optimize with respect to just that one parameter, 00:51:51.939 --> 00:51:55.619 and if that's true, then the inner group of coordinate assent with optimizing with 00:51:55.619 --> 00:51:58.219 respect to Alpha I can be done very quickly 00:51:58.219 --> 00:52:00.999 and cause [inaudible]. 00:52:00.999 --> 00:52:02.259 It turns out that 00:52:02.259 --> 00:52:06.849 this will be true when we modify this algorithm to solve the SVM 00:52:06.849 --> 00:52:11.880 optimization problem. 00:52:11.880 --> 00:52:18.880 Questions about this? 00:52:21.879 --> 00:52:28.879 00:52:47.489 --> 00:52:49.699 Okay. Let's go ahead and apply this 00:52:49.699 --> 00:52:54.329 to our support vector machine dual optimization problem. It 00:52:54.329 --> 00:52:57.179 turns out that 00:52:57.179 --> 00:53:02.459 coordinate assent in its basic form does not work for the following reason. The 00:53:02.459 --> 00:53:07.369 reason is we have constrains on the Alpha Is. Mainly, what we can 00:53:07.369 --> 00:53:10.529 recall from what we worked out previously, 00:53:10.529 --> 00:53:12.259 we have a constraint that the sum of [inaudible] Y 00:53:12.259 --> 00:53:15.569 Alpha I must be equal to zero, 00:53:15.569 --> 00:53:19.649 and so if you fix all the Alphas except for one, 00:53:19.649 --> 00:53:23.919 then you can't change one Alpha without violating the constraint. If 00:53:23.919 --> 00:53:24.669 00:53:24.669 --> 00:53:26.459 I just try to change Alpha one, 00:53:26.459 --> 00:53:27.239 00:53:27.239 --> 00:53:30.559 Alpha one is actually exactly determined as a function of the other 00:53:30.559 --> 00:53:33.400 Alphas because this was sum to zero. 00:53:33.400 --> 00:53:35.729 The 00:53:35.729 --> 00:53:39.390 SMO algorithm, by the way, is due to John Platt, a colleague at 00:53:39.390 --> 00:53:42.609 Microsoft. The SMO 00:53:42.609 --> 00:53:46.289 algorithm, therefore, instead of trying to change one Alpha at a time, 00:53:46.289 --> 00:53:48.979 we will try to change 00:53:48.979 --> 00:53:50.489 two 00:53:50.489 --> 00:53:54.199 Alphas at a time. 00:53:54.199 --> 00:53:55.270 This is 00:53:55.270 --> 00:53:58.189 called the SMO algorithm, in a 00:53:58.189 --> 00:54:00.599 sense the sequential minimal optimization 00:54:00.599 --> 00:54:02.990 and the term minimal refers to the fact that 00:54:02.990 --> 00:54:06.350 we're choosing the smallest number of Alpha Is to change at a time, which 00:54:06.350 --> 00:54:10.679 in this case, we need to change at least two at a time. 00:54:10.679 --> 00:54:16.849 So then go ahead and outline the algorithm. 00:54:16.849 --> 00:54:19.660 We will 00:54:19.660 --> 00:54:23.629 select 00:54:23.629 --> 00:54:28.609 two Alphas to change, some Alpha I and Alpha J [inaudible] " it 00:54:28.609 --> 00:54:33.979 just means a rule of thumb. 00:54:33.979 --> 00:54:40.929 We'll hold all the Alpha Ks fixed 00:54:40.929 --> 00:54:44.679 except 00:54:44.679 --> 00:54:49.939 Alpha I and Alpha J 00:54:49.939 --> 00:54:54.199 and optimize W [inaudible] Alpha 00:54:54.199 --> 00:54:56.640 with respect 00:54:56.640 --> 00:54:59.439 to Alpha I 00:54:59.439 --> 00:55:06.439 and Alpha J subject to all the constraints. 00:55:14.329 --> 00:55:17.650 It turns out the 00:55:17.650 --> 00:55:19.880 key step 00:55:19.880 --> 00:55:21.440 which I'm going to 00:55:21.440 --> 00:55:22.100 work out 00:55:22.100 --> 00:55:25.579 is this one, which 00:55:25.579 --> 00:55:28.449 is how do you optimize W of Alpha 00:55:28.449 --> 00:55:32.199 with respect to the two parameters that you just chose to update 00:55:32.199 --> 00:55:34.729 and subject to the constraints? I'll do 00:55:34.729 --> 00:55:37.459 that in a second. 00:55:37.459 --> 00:55:38.900 00:55:38.900 --> 00:55:41.130 You would keep running this algorithm 00:55:41.130 --> 00:55:43.369 until you have satisfied 00:55:43.369 --> 00:55:44.690 these 00:55:44.690 --> 00:55:49.469 convergence criteria up to Epsilon. 00:55:49.469 --> 00:55:53.509 What I want to do now is describe how to do this [inaudible] " 00:55:53.509 --> 00:55:55.759 how to optimize W of Alpha 00:55:55.759 --> 00:55:57.759 with respect to 00:55:57.759 --> 00:56:00.499 Alpha I and Alpha J, 00:56:00.499 --> 00:56:05.489 and it turns out that it's because you can do this extremely efficiently 00:56:05.489 --> 00:56:08.449 that the SMO algorithm works well. The [inaudible] for the SMO 00:56:08.449 --> 00:56:11.549 algorithm can be done extremely efficiently, so it may take a large number of 00:56:11.549 --> 00:56:18.549 iterations to converge, but each iteration is very cheap. Let's talk about that. 00:56:33.449 --> 00:56:40.449 00:56:41.749 --> 00:56:45.679 So in order to derive that step where we update in respect to Alpha I and 00:56:45.679 --> 00:56:46.530 Alpha J, 00:56:46.530 --> 00:56:50.849 I'm actually going to use Alpha one and Alpha two as my example. I'm gonna 00:56:50.849 --> 00:56:51.879 update 00:56:51.879 --> 00:56:56.609 Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but 00:56:56.609 --> 00:56:59.659 just to make my notation on the board easier, I'm going to derive the 00:56:59.659 --> 00:57:02.209 derivation for Alpha one and Alpha two 00:57:02.209 --> 00:57:07.269 and the general [inaudible] completely analogous. 00:57:07.269 --> 00:57:08.869 00:57:08.869 --> 00:57:13.259 On every step of the algorithm with respect to constraint, that 00:57:13.259 --> 00:57:17.130 sum over I Alpha I YI is equal to zero. This is one of the 00:57:17.130 --> 00:57:18.959 constraints we had previously for 00:57:18.959 --> 00:57:22.569 our dual optimization problem. 00:57:22.569 --> 00:57:25.749 This means that Alpha one Y1 plus Alpha 00:57:25.749 --> 00:57:28.159 two Y2 must be equal to this, 00:57:28.159 --> 00:57:31.319 to 00:57:31.319 --> 00:57:38.319 00:57:41.179 --> 00:57:43.029 which I'm going to denote 00:57:43.029 --> 00:57:46.259 by Zeta. 00:57:46.259 --> 00:57:48.119 00:57:48.119 --> 00:57:49.449 So 00:57:49.449 --> 00:57:51.949 we also have the constraint 00:57:51.949 --> 00:57:53.439 that the Alpha Is 00:57:53.439 --> 00:57:57.289 must be between zero and C. We had two constraints on our dual. This was one of the 00:57:57.289 --> 00:57:59.809 constraints. This was the other one. 00:57:59.809 --> 00:58:06.809 In pictures, 00:58:08.139 --> 00:58:15.139 the 00:58:29.289 --> 00:58:33.039 constraint that the Alpha Is is between zero 00:58:33.039 --> 00:58:35.339 and C, that is often called the Bosk constraint, 00:58:35.339 --> 00:58:38.089 and so if I draw Alpha one 00:58:38.089 --> 00:58:43.239 and Alpha two, then 00:58:43.239 --> 00:58:50.239 I 00:58:50.239 --> 00:58:53.269 have a constraint that 00:58:53.269 --> 00:58:57.079 the values of Alpha one and Alpha two must lie within this box that ranges from zero 00:58:57.079 --> 00:58:58.889 to C. 00:58:58.889 --> 00:59:01.909 And so the picture of the algorithm is as follows. 00:59:01.909 --> 00:59:04.449 I have some constraint 00:59:04.449 --> 00:59:06.699 that Alpha one Y1 00:59:06.699 --> 00:59:08.630 plus Alpha 00:59:08.630 --> 00:59:10.449 two Y2 00:59:10.449 --> 00:59:14.129 must equal to Zeta, 00:59:14.129 --> 00:59:18.239 00:59:18.239 --> 00:59:20.789 00:59:20.789 --> 00:59:21.839 00:59:21.839 --> 00:59:26.479 and so this implies that 00:59:26.479 --> 00:59:32.279 Alpha one must be equal to Zeta minus Alpha two Y2 00:59:32.279 --> 00:59:36.859 over Y1, 00:59:36.859 --> 00:59:39.789 and so what I want to do is I want to optimize the objective with respect 00:59:39.789 --> 00:59:43.439 00:59:43.439 --> 00:59:45.209 to this. 00:59:45.209 --> 00:59:48.589 What I can do is plug in my definition for Alpha one as a function of Alpha two 00:59:48.589 --> 00:59:51.349 and so this becomes W of 00:59:51.349 --> 00:59:55.469 Alpha one must be equal to Zeta minus Alpha two Y2 00:59:55.469 --> 00:59:56.530 over 00:59:56.530 --> 01:00:01.649 Y1, Alpha two, Alpha three 01:00:01.649 --> 01:00:03.140 and so on, 01:00:03.140 --> 01:00:06.719 and it turns out that because W is a quadratic function " if you look back to our 01:00:06.719 --> 01:00:10.669 earlier definition of W, you find it's a quadratic function in all the Alphas " 01:00:10.669 --> 01:00:13.409 it turns out that if you look at this expression for W 01:00:13.409 --> 01:00:16.709 and if you view it as just a function of Alpha two, 01:00:16.709 --> 01:00:21.529 you find that this is a one dimensional quadratic function of Alpha two 01:00:21.529 --> 01:00:25.049 if you hold Alpha three, Alpha four and so on fixed, 01:00:25.049 --> 01:00:26.440 and so 01:00:26.440 --> 01:00:30.369 this can be simplified to some expression of the form A Alpha two squared 01:00:30.369 --> 01:00:32.359 plus B Alpha two plus 01:00:32.359 --> 01:00:33.229 C. This 01:00:33.229 --> 01:00:37.509 is a standard quadratic function. 01:00:37.509 --> 01:00:40.470 This is really easy to optimize. We know how to optimize 01:00:40.470 --> 01:00:41.630 " 01:00:41.630 --> 01:00:43.820 when did we learn this? This was high school 01:00:43.820 --> 01:00:47.969 or undergrad or something. You know how to optimize quadratic functions like these. 01:00:47.969 --> 01:00:49.930 You just do that and that gives you the 01:00:49.930 --> 01:00:52.449 optimal value for Alpha two. 01:00:52.449 --> 01:00:54.739 The last step with a 01:00:54.739 --> 01:00:59.089 Bosk constraint like this " just in pictures, 01:00:59.089 --> 01:01:02.469 you know your solution must lie on this line, and so there'll be some 01:01:02.469 --> 01:01:04.830 sort of quadratic function 01:01:04.830 --> 01:01:07.739 over this line, say, 01:01:07.739 --> 01:01:10.079 and so if you minimize the quadratic function, 01:01:10.079 --> 01:01:13.969 maybe you get a value that lies in the box, and if so, then you're done. 01:01:13.969 --> 01:01:17.229 Or if your quadratic function looks like this, 01:01:17.229 --> 01:01:20.979 maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with 01:01:20.979 --> 01:01:22.719 a solution like that. If 01:01:22.719 --> 01:01:25.939 that happens, you clip your solution 01:01:25.939 --> 01:01:29.589 just to map it back inside the box. 01:01:29.589 --> 01:01:34.769 That'll give you the optimal solution of this quadratic optimization problem 01:01:34.769 --> 01:01:35.679 subject to 01:01:35.679 --> 01:01:37.079 your solution 01:01:37.079 --> 01:01:39.539 satisfying this box constraint 01:01:39.539 --> 01:01:42.589 and lying on this straight line " in other words, subject to 01:01:42.589 --> 01:01:48.739 the solution lying on this line segment within the box. 01:01:48.739 --> 01:01:53.179 01:01:53.179 --> 01:01:57.629 Having solved the Alpha two this way, you can clip it if necessary 01:01:57.629 --> 01:02:00.430 to get it back within the box constraint 01:02:00.430 --> 01:02:01.070 and then 01:02:01.070 --> 01:02:04.129 we have Alpha one as a function of Alpha two 01:02:04.129 --> 01:02:06.489 and this allows you to 01:02:06.489 --> 01:02:09.839 optimize W with respect to Alpha one and Alpha two quickly, 01:02:09.839 --> 01:02:11.420 subject to all the constraints, 01:02:11.420 --> 01:02:14.769 and the key step is really just sort of one dequadratic optimization, which 01:02:14.769 --> 01:02:16.439 we do very quickly, 01:02:16.439 --> 01:02:23.439 which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that 01:02:24.959 --> 01:02:31.959 we can 01:02:32.529 --> 01:02:34.329 change 01:02:34.329 --> 01:02:37.749 whatever, but the SMO 01:02:37.749 --> 01:02:42.879 algorithm, we can 01:02:42.879 --> 01:02:44.739 change two at a 01:02:44.739 --> 01:02:46.019 time, 01:02:46.019 --> 01:02:49.059 so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change 01:02:49.059 --> 01:02:52.469 " as I run optimization algorithm, I 01:02:52.469 --> 01:02:54.829 have to respect the constraint that 01:02:54.829 --> 01:02:56.060 sum over 01:02:56.060 --> 01:02:57.899 I Alpha I YI 01:02:57.899 --> 01:03:00.339 must be equal to zero, 01:03:00.339 --> 01:03:07.339 so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent. 01:03:07.489 --> 01:03:09.249 Suppose I tried 01:03:09.249 --> 01:03:13.259 to change just Alpha one. 01:03:13.259 --> 01:03:15.449 Then I know that Alpha one 01:03:15.449 --> 01:03:20.309 must be equal to the sum from I equals two to M 01:03:20.309 --> 01:03:21.809 Alpha I 01:03:21.809 --> 01:03:22.959 YI 01:03:22.959 --> 01:03:25.839 divided by Y1, right, 01:03:25.839 --> 01:03:27.099 and so Alpha 01:03:27.099 --> 01:03:30.039 one can actually be written exactly as a function of Alpha two, Alpha three and so on 01:03:30.039 --> 01:03:32.589 through Alpha M. 01:03:32.589 --> 01:03:35.610 And so if I hold Alpha two, Alpha three, Alpha four 01:03:35.610 --> 01:03:39.710 through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final 01:03:39.710 --> 01:03:41.700 01:03:41.700 --> 01:03:45.509 [inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same 01:03:45.509 --> 01:03:46.869 time, 01:03:46.869 --> 01:03:50.909 then I still have a constraint and so I know Alpha one and Alpha two 01:03:50.909 --> 01:03:53.629 must satisfy 01:03:53.629 --> 01:03:55.339 that linear constraint 01:03:55.339 --> 01:03:59.199 but at least this way I can change Alpha one if I also 01:03:59.199 --> 01:04:04.549 change Alpha two accordingly to make sure this satisfies 01:04:04.549 --> 01:04:08.869 the constraint. 01:04:08.869 --> 01:04:13.890 Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on 01:04:13.890 --> 01:04:15.919 each iteration, I have some 01:04:15.919 --> 01:04:18.099 setting of the parameters, 01:04:18.099 --> 01:04:21.400 Alpha one, Alpha two, Alpha three and so on, 01:04:21.400 --> 01:04:24.729 and I want to change Alpha one and Alpha two, say. 01:04:24.729 --> 01:04:27.009 So from the previous iteration, 01:04:27.009 --> 01:04:28.229 let's 01:04:28.229 --> 01:04:31.759 say I had not validated the constraint, so that holds true, 01:04:31.759 --> 01:04:35.259 and so I'm just defining Zeta to be equal to this, because 01:04:35.259 --> 01:04:37.769 Alpha one Y1 plus Alpha two Y2 must be equal to sum from 01:04:37.769 --> 01:04:42.409 I equals [inaudible] to M of that, 01:04:42.409 --> 01:04:44.119 and so I'm just defining this 01:04:44.119 --> 01:04:51.119 to be Zeta. 01:04:59.709 --> 01:05:01.439 Student:[Inaudible]. Instructor (Andrew 01:05:01.439 --> 01:05:06.709 Ng):On every iteration, you change maybe a different pair of Alphas to update. The 01:05:06.709 --> 01:05:10.979 way you do this is something I don't want to talk about. I'll say a couple more words about 01:05:10.979 --> 01:05:11.729 that, but 01:05:11.729 --> 01:05:14.469 the basic outline of the algorithm is 01:05:14.469 --> 01:05:18.159 on every iteration of the algorithm, you're going to select some Alpha I and 01:05:18.159 --> 01:05:19.390 Alpha J to update 01:05:19.390 --> 01:05:23.129 like on this board. So that's an Alpha I and an Alpha J to update 01:05:23.129 --> 01:05:24.739 via some [inaudible] 01:05:24.739 --> 01:05:27.839 and then you use the procedure I just described to 01:05:27.839 --> 01:05:31.049 actually update 01:05:31.049 --> 01:05:33.880 Alpha I and Alpha J. What I actually just talked about 01:05:33.880 --> 01:05:35.390 was the procedure 01:05:35.390 --> 01:05:39.699 to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the 01:05:39.699 --> 01:05:46.699 [inaudible] for choosing Alpha 01:05:48.699 --> 01:05:52.289 I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up 01:05:52.289 --> 01:05:53.839 there. I'll just write it again. W of 01:05:53.839 --> 01:06:00.199 Alpha is that function we had previously. W of Alpha was the sum over I 01:06:00.199 --> 01:06:02.949 " this is about solving the " 01:06:02.949 --> 01:06:09.599 it was that thing. 01:06:09.599 --> 01:06:12.869 All of this is about solving the optimization problem for the SVM, so this 01:06:12.869 --> 01:06:19.469 is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing 01:06:19.469 --> 01:06:26.029 that one, you 01:06:26.029 --> 01:06:27.190 can 01:06:27.190 --> 01:06:28.639 make the 01:06:28.639 --> 01:06:34.339 other one that you have to change work, 01:06:34.339 --> 01:06:35.619 right? Instructor (Andrew Ng):What 01:06:35.619 --> 01:06:39.929 do 01:06:39.929 --> 01:06:43.979 you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently. 01:06:43.979 --> 01:06:46.890 What we're trying to do is we're trying to optimize the objective function W 01:06:46.890 --> 01:06:48.239 of Alpha, 01:06:48.239 --> 01:06:52.049 so the metric of progress that we care about is whether W of Alpha is getting 01:06:52.049 --> 01:06:54.959 better on every iteration, 01:06:54.959 --> 01:06:58.849 and so what is true for coordinate assent and for SMO is on every iteration; 01:06:58.849 --> 01:07:02.929 W of Alpha can only increase. It may stay the same 01:07:02.929 --> 01:07:04.900 or it may increase. It can't get worse. 01:07:04.900 --> 01:07:08.829 It's true that eventually, the Alphas will converge at some value. It's true that 01:07:08.829 --> 01:07:12.909 in intervening iterations, one of the Alphas may move further 01:07:12.909 --> 01:07:16.529 away and then closer and further and closer to its final value, 01:07:16.529 --> 01:07:20.089 but what we really care about is that W of Alpha is getting better 01:07:20.089 --> 01:07:27.089 every time, which is true. 01:07:28.489 --> 01:07:30.839 Just a couple more words on 01:07:30.839 --> 01:07:33.719 SMO before I wrap up on this. 01:07:33.719 --> 01:07:38.179 One is that John Platt's original 01:07:38.179 --> 01:07:39.849 algorithm talked about a [inaudible] 01:07:39.849 --> 01:07:43.869 for choosing which values or pairs, Alpha I and Alpha J, to update 01:07:43.869 --> 01:07:46.650 next 01:07:46.650 --> 01:07:50.170 is one of those things that's not conceptually complicated but it's very 01:07:50.170 --> 01:07:51.009 complicated 01:07:51.009 --> 01:07:53.329 to explain in words. I won't talk about that here. 01:07:53.329 --> 01:07:56.369 If you want to learn about it, 01:07:56.369 --> 01:07:58.539 go ahead and look up John Platt's paper on the 01:07:58.539 --> 01:08:00.179 01:08:00.179 --> 01:08:06.239 SMO algorithm. The [inaudible] is pretty easy to read, 01:08:06.239 --> 01:08:11.329 and later on, we'll also posting a handout on the 01:08:11.329 --> 01:08:13.389 course homepage 01:08:13.389 --> 01:08:14.200 with 01:08:14.200 --> 01:08:18.120 some of a simplified version of this [inaudible] that you can use in 01:08:18.120 --> 01:08:20.170 problems. 01:08:20.170 --> 01:08:24.230 You can see some of the process readings in more details. 01:08:24.230 --> 01:08:28.579 One other thing that I didn't talk about was how to update the parameter B. So 01:08:28.579 --> 01:08:30.899 this is solving all your Alphas. 01:08:30.899 --> 01:08:34.020 This is also the Alpha that allows us to get W. 01:08:34.020 --> 01:08:38.060 The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not 01:08:38.060 --> 01:08:39.929 very difficult. 01:08:39.929 --> 01:08:42.510 I'll let you read about that yourself with the 01:08:42.510 --> 01:08:45.049 notes that we'll post 01:08:45.049 --> 01:08:52.049 along with the next 01:08:52.229 --> 01:08:55.589 problems. To wrap up today's lecture, what I want to do is just tell 01:08:55.589 --> 01:09:01.469 you briefly about a couple of examples of applications 01:09:01.469 --> 01:09:03.029 01:09:03.029 --> 01:09:04.389 01:09:04.389 --> 01:09:07.489 01:09:07.489 --> 01:09:14.489 of SVMs. 01:09:26.009 --> 01:09:29.630 Let's consider the problem of Handler's Integer Recognition. 01:09:29.630 --> 01:09:31.909 In Handler's Integer 01:09:31.909 --> 01:09:33.469 Recognition, 01:09:33.469 --> 01:09:36.569 you're given a pixel array with 01:09:36.569 --> 01:09:40.609 a scanned image of, 01:09:40.609 --> 01:09:44.119 say, a zip code somewhere in Britain. This is an array of pixels, 01:09:44.119 --> 01:09:47.859 and some of these pixels will be on 01:09:47.859 --> 01:09:48.849 and other pixels 01:09:48.849 --> 01:09:52.689 will be off. This combination of pixels being on maybe represents the 01:09:52.689 --> 01:09:56.030 character one. 01:09:56.030 --> 01:09:57.739 The question is given 01:09:57.739 --> 01:10:00.269 an input feature vector like this, 01:10:00.269 --> 01:10:03.119 if you 01:10:03.119 --> 01:10:06.829 have, say, ten pixels by ten pixels, then you have 01:10:06.829 --> 01:10:13.639 a hundred dimensional feature vector, 01:10:13.639 --> 01:10:16.999 then [inaudible]. If you have ten pixels by ten pixels, you have 01:10:16.999 --> 01:10:18.319 100 01:10:18.319 --> 01:10:22.239 features, and maybe these are binary features of XB01 or maybe the Xs are 01:10:22.239 --> 01:10:23.929 gray scale values 01:10:23.929 --> 01:10:29.169 corresponding to how dark each of these pixels was. 01:10:29.169 --> 01:10:32.300 [Inaudible]. 01:10:32.300 --> 01:10:36.010 Turns out for many years, there was a neuronetwork that was a champion 01:10:36.010 --> 01:10:40.039 algorithm for Handler's Integer Recognition. 01:10:40.039 --> 01:10:44.049 And it turns out that you can apply an SVM with the 01:10:44.049 --> 01:10:46.889 following 01:10:46.889 --> 01:10:53.159 kernel. It turns out either the polynomial kernel or 01:10:53.159 --> 01:10:57.929 the Galcean 01:10:57.929 --> 01:11:02.580 01:11:02.580 --> 01:11:05.199 kernel works fine for this problem, 01:11:05.199 --> 01:11:10.460 and just by writing down this kernel and throwing an SVM at 01:11:10.460 --> 01:11:14.239 it, an SVM gave performance comparable to the very best neuronetworks. 01:11:14.239 --> 01:11:15.099 01:11:15.099 --> 01:11:19.559 01:11:19.559 --> 01:11:22.210 This is surprising because support vector machine 01:11:22.210 --> 01:11:26.059 doesn't take into account any knowledge about the pixels, 01:11:26.059 --> 01:11:27.110 and in particular, 01:11:27.110 --> 01:11:30.310 it doesn't know that this pixel is 01:11:30.310 --> 01:11:32.130 next to that pixel 01:11:32.130 --> 01:11:36.559 because it's just representing the pixel intensity value as a vector. 01:11:36.559 --> 01:11:39.920 And so this means the performance of SVM would be the same even if you were to 01:11:39.920 --> 01:11:42.890 shuffle all the pixels around. 01:11:42.890 --> 01:11:46.420 [Inaudible] 01:11:46.420 --> 01:11:52.030 let's say comparable to the very best neuronetworks, which had been 01:11:52.030 --> 01:11:58.579 under very careful development for many years. I want 01:11:58.579 --> 01:11:59.789 to 01:11:59.789 --> 01:12:03.929 tell you about one other cool example, which is SVMs are 01:12:03.929 --> 01:12:08.610 also used also to classify other fairly esoteric objects. So for 01:12:08.610 --> 01:12:09.360 example, 01:12:09.360 --> 01:12:13.440 let's say you want to classify 01:12:13.440 --> 01:12:16.989 protein sequences into different classes of 01:12:16.989 --> 01:12:20.989 proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize 01:12:20.989 --> 01:12:25.040 for that. There are 01:12:25.040 --> 01:12:29.280 20 amino acids, and proteins in our bodies are made up by 01:12:29.280 --> 01:12:31.439 sequences of amino acids. 01:12:31.439 --> 01:12:35.469 Even though there are 20 amino acids and 26 alphabets, I'm going to 01:12:35.469 --> 01:12:37.130 denote amino acids by the alphabet 01:12:37.130 --> 01:12:41.409 A through Z with apologizes to the biologists. Here's 01:12:41.409 --> 01:12:45.459 an amino acid sequence 01:12:45.459 --> 01:12:52.459 01:12:53.679 --> 01:13:00.679 01:13:03.449 --> 01:13:05.139 01:13:05.139 --> 01:13:12.139 represented by a series of alphabets. 01:13:15.949 --> 01:13:17.829 01:13:17.829 --> 01:13:18.559 So 01:13:18.559 --> 01:13:21.419 suppose I want to assign 01:13:21.419 --> 01:13:25.070 this protein into a few classes depending on what 01:13:25.070 --> 01:13:27.790 type of protein it is. The 01:13:27.790 --> 01:13:29.780 question is how 01:13:29.780 --> 01:13:32.639 do I construct my feature 01:13:32.639 --> 01:13:36.590 vector? This is challenging for many reasons, one of which is that protein 01:13:36.590 --> 01:13:39.889 sequences can be of different lengths. There are some very long protein 01:13:39.889 --> 01:13:41.900 sequences and some very short ones, 01:13:41.900 --> 01:13:44.590 and so you can't have a feature saying what is 01:13:44.590 --> 01:13:48.869 the amino acid in the 100th position, because maybe there is no 100th 01:13:48.869 --> 01:13:53.050 position in this protein sequence. Some are very long. Some are very short. 01:13:53.050 --> 01:13:55.440 Here's my feature representation, 01:13:55.440 --> 01:13:56.929 which is 01:13:56.929 --> 01:13:59.720 I'm going to write down 01:13:59.720 --> 01:14:05.270 all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC 01:14:05.270 --> 01:14:07.219 down 01:14:07.219 --> 01:14:09.420 to 01:14:09.420 --> 01:14:11.469 AAAZ and then 01:14:11.469 --> 01:14:13.979 AABA 01:14:13.979 --> 01:14:20.169 and so on. You get the idea. Write down 01:14:20.169 --> 01:14:23.760 all 01:14:23.760 --> 01:14:27.070 possible combinations of 01:14:27.070 --> 01:14:28.330 four alphabets 01:14:28.330 --> 01:14:30.559 and my feature representation will be 01:14:30.559 --> 01:14:33.570 I'm going to scan through this sequence of amino acids 01:14:33.570 --> 01:14:38.599 and count how often each of these subsequences occur. So for example, 01:14:38.599 --> 01:14:40.340 BAJT occurs twice and so I'll put 01:14:40.340 --> 01:14:42.749 a two there, and 01:14:42.749 --> 01:14:44.530 none of these sequences occur, so I'll put 01:14:44.530 --> 01:14:46.169 a zero there. I guess 01:14:46.169 --> 01:14:50.009 I have a one here 01:14:50.009 --> 01:14:54.429 and a zero there. This very long vector 01:14:54.429 --> 01:14:57.300 will be my feature representation for protein. 01:14:57.300 --> 01:15:00.469 This representation applies no matter how long my protein sequence is. 01:15:00.469 --> 01:15:02.729 How 01:15:02.729 --> 01:15:05.249 large is this? Well, it turns out 01:15:05.249 --> 01:15:08.109 this is going to be in R20 01:15:08.109 --> 01:15:12.070 to the four, 01:15:12.070 --> 01:15:16.089 and so you have a 160,000 01:15:16.089 --> 01:15:18.360 dimensional feature vector, 01:15:18.360 --> 01:15:21.429 which is reasonably large, even by modern computer standards. 01:15:21.429 --> 01:15:23.910 Clearly, we don't want to explicitly represent 01:15:23.910 --> 01:15:26.069 these high 01:15:26.069 --> 01:15:30.029 dimensional feature vectors. Imagine you have 1,000 examples and you store this as double 01:15:30.029 --> 01:15:34.280 [inaudible]. Even on modern day computers, this is big. 01:15:34.280 --> 01:15:38.429 It turns out that there's an 01:15:38.429 --> 01:15:42.090 efficient dynamic programming algorithm that can efficiently 01:15:42.090 --> 01:15:45.260 compute inner products between these feature vectors, and so we can apply this 01:15:45.260 --> 01:15:48.210 feature representation, even though it's a ridiculously high 01:15:48.210 --> 01:15:49.630 feature vector 01:15:49.630 --> 01:15:52.499 to classify protein sequences. 01:15:52.499 --> 01:15:56.579 I won't talk about the [inaudible] algorithm. If any of you have seen 01:15:56.579 --> 01:16:00.849 the [inaudible] algorithm for finding subsequences, it's kind of 01:16:00.849 --> 01:16:02.309 reminiscent of that. You 01:16:02.309 --> 01:16:05.289 can look those up if you're interested. 01:16:05.289 --> 01:16:08.179 This is just another example of a cool kernel, and 01:16:08.179 --> 01:16:13.230 more generally, if you're faced with some new machine-learning problem, sometimes you can 01:16:13.230 --> 01:16:15.750 choose a standard kernel like a Galcean kernel, 01:16:15.750 --> 01:16:19.580 and sometimes there are research papers written on how to 01:16:19.580 --> 01:16:21.410 come up with a new kernel 01:16:21.410 --> 01:16:27.520 for a new problem. 01:16:27.520 --> 01:16:29.590 Two last sentences I want 01:16:29.590 --> 01:16:33.710 to say. Where are we now? That wraps up SVMs, which many 01:16:33.710 --> 01:16:38.379 people would consider one of the most effective off the shelf learning algorithms, 01:16:38.379 --> 01:16:42.229 and so as of today, you've actually seen a lot of learning algorithms. I want to 01:16:42.229 --> 01:16:45.689 close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms 01:16:45.689 --> 01:16:48.169 to a lot of problems. 01:16:48.169 --> 01:16:52.510 We're still in week four of the quarter, so there's more to come. In particular, what I 01:16:52.510 --> 01:16:56.890 want to do next is talk about how to really understand the learning algorithms and 01:16:56.890 --> 01:16:58.860 when they work well and when they work poorly 01:16:58.860 --> 01:17:01.889 and to take the tools which you now have and really talk about how you can use 01:17:01.889 --> 01:17:04.449 them really well. We'll start to do that in the next lecture. Thanks.