WEBVTT 00:00:11.789 --> 00:00:15.049 This presentation is delivered by the Stanford Center for Professional 00:00:15.049 --> 00:00:22.049 Development. 00:00:23.680 --> 00:00:25.839 So welcome back. 00:00:25.839 --> 00:00:30.859 And what I wanna do today is continue our discussion on support vector machines. And in 00:00:30.859 --> 00:00:34.410 particular, I wanna talk about the optimal margin classifier. 00:00:34.410 --> 00:00:38.280 Then I wanna take a brief digression and talk about primal and duo optimization problems, and 00:00:38.280 --> 00:00:39.410 in particular, 00:00:39.410 --> 00:00:41.570 what's called the KKT conditions. 00:00:41.570 --> 00:00:45.190 And then we'll derive the duo to the optimization problem that I 00:00:45.190 --> 00:00:46.570 had posed earlier. 00:00:46.570 --> 00:00:50.980 And that will lead us into a discussion of kernels, which I won't really - 00:00:50.980 --> 00:00:54.010 which we just get to say couple words about, but which I'll do probably 00:00:54.010 --> 00:00:57.530 only in the next lecture. 00:00:57.530 --> 00:01:02.079 And as part of today's lecture, I'll spend some time talking about optimization problems. 00:01:02.079 --> 00:01:04.180 And in 00:01:04.180 --> 00:01:08.299 the little time I have today, I won't really be able to do this topic justice. 00:01:08.299 --> 00:01:13.010 I wanna talk about convex optimization and do that topic justice. 00:01:13.010 --> 00:01:17.500 And so at this week's discussion session, the TAs will have more 00:01:17.500 --> 00:01:22.110 time - will teach a discussion session - focus on convex optimization 00:01:22.110 --> 00:01:25.159 - sort of very beautiful and useful theory. 00:01:25.159 --> 00:01:32.159 So you want to learn more about that, listen to this Friday's discussion session. 00:01:33.100 --> 00:01:36.630 Just to recap what we did in the previous lecture, 00:01:36.630 --> 00:01:38.830 as 00:01:38.830 --> 00:01:42.420 we were beginning on developing on support vector machines, I 00:01:42.420 --> 00:01:43.900 said that a 00:01:43.900 --> 00:01:48.800 hypothesis represented as H sub [inaudible] wb as g of 00:01:48.800 --> 00:01:53.090 w transpose [inaudible] x + b, where 00:01:53.090 --> 00:01:55.030 00:01:55.030 --> 00:01:55.820 g 00:01:55.820 --> 00:02:02.820 will be 00:02:04.350 --> 00:02:08.599 +1 or -1, depending on whether z is greater than 00:02:08.599 --> 00:02:10.069 0. 00:02:10.069 --> 00:02:12.779 And I said that in our development of support vector machines, 00:02:12.779 --> 00:02:16.999 we'll use - we'll change the convention of letting y be +1, -1 to note 00:02:16.999 --> 00:02:20.159 the class labels. 00:02:20.159 --> 00:02:24.439 So last time, 00:02:24.439 --> 00:02:28.149 we also talked about the functional margin, which was this thing, gamma 00:02:28.149 --> 00:02:33.519 hat i. 00:02:33.519 --> 00:02:36.109 00:02:36.109 --> 00:02:38.759 And so we had the intuition that the 00:02:38.759 --> 00:02:42.389 if functional margin is a large positive number, 00:02:42.389 --> 00:02:46.769 then that means that we are classifying a training example correctly and very 00:02:46.769 --> 00:02:48.209 confidently. So 00:02:48.209 --> 00:02:50.389 yi is +1. We 00:02:50.389 --> 00:02:53.319 would like w transpose xi + b to be very large. 00:02:53.319 --> 00:02:54.649 And it makes i - if, excuse me, if 00:02:54.649 --> 00:02:56.779 yi is -1, 00:02:56.779 --> 00:02:59.459 then we'd w transpose xi + b to be a large negative number. So 00:02:59.459 --> 00:03:03.399 we'd sort of like functional margins to be large. We 00:03:03.399 --> 00:03:06.029 also said - functional margin is a strange property - 00:03:06.029 --> 00:03:10.469 that you can increase functional margin just by, say, taking your parameters, 00:03:10.469 --> 00:03:11.359 w and b, 00:03:11.359 --> 00:03:13.909 and multiplying them by 00:03:13.909 --> 00:03:15.809 2. 00:03:15.809 --> 00:03:18.299 And then we also 00:03:18.299 --> 00:03:20.949 defined the geometric margin, 00:03:20.949 --> 00:03:22.709 which 00:03:22.709 --> 00:03:24.659 00:03:24.659 --> 00:03:31.659 was 00:03:33.219 --> 00:03:37.389 that we just - essentially, the functional margin 00:03:37.389 --> 00:03:38.499 divided by 00:03:38.499 --> 00:03:40.349 the normal w. 00:03:40.349 --> 00:03:43.449 And so the geometric margin had 00:03:43.449 --> 00:03:46.589 the interpretation as being - I'll give 00:03:46.589 --> 00:03:48.799 you a few examples. The 00:03:48.799 --> 00:03:52.079 geometric margin, for example, is - has 00:03:52.079 --> 00:03:56.139 the interpretation as a distance between a training example 00:03:56.139 --> 00:03:56.909 and a hyperplane. 00:03:56.909 --> 00:04:00.679 And it'll actually be a sin distance, so that this distance will be positive 00:04:00.679 --> 00:04:05.059 if you're classifying the example correctly. And if you misclassify the example, this 00:04:05.059 --> 00:04:07.929 distance - it'll be the minus of the distance, 00:04:07.929 --> 00:04:10.269 reaching the point, reaching the training example. 00:04:10.269 --> 00:04:12.079 And you're separating hyperplane. 00:04:12.079 --> 00:04:16.329 Where you're separating hyperplane is defined by the equation w transpose x 00:04:16.329 --> 00:04:17.620 + 00:04:17.620 --> 00:04:20.589 b = 00:04:20.589 --> 00:04:24.229 0. 00:04:24.229 --> 00:04:28.089 So - oh, 00:04:28.089 --> 00:04:32.099 well, and I guess 00:04:32.099 --> 00:04:36.539 also defined these things as the functional margin, geometric margins, 00:04:36.539 --> 00:04:38.499 respect to training set I defined as 00:04:38.499 --> 00:04:45.499 the worst case or the minimum functional geometric margin. So in our 00:04:48.719 --> 00:04:52.870 development of the optimal margin classifier, 00:04:52.870 --> 00:04:56.499 our learning algorithm would choose parameters w and b so as to maximize 00:04:56.499 --> 00:05:00.050 the geometric margin. So our goal is to find the separating hyperplane 00:05:00.050 --> 00:05:04.729 that separates the positive and negative examples with as large a distance as possible 00:05:04.729 --> 00:05:06.100 between hyperplane 00:05:06.100 --> 00:05:10.540 and the positive and negative examples. And if you 00:05:10.540 --> 00:05:14.940 go to choose parameters w and b to maximize this, [inaudible] one copy of 00:05:14.940 --> 00:05:16.620 the geometric margin 00:05:16.620 --> 00:05:18.159 is that you 00:05:18.159 --> 00:05:19.989 can actually scale w 00:05:19.989 --> 00:05:24.460 and b arbitrarily. So you look at this definition for the geometric margin. 00:05:24.460 --> 00:05:24.850 00:05:24.850 --> 00:05:28.300 I can choose to multiply my parameters w and b 00:05:28.300 --> 00:05:31.629 by 2 or by 10 or any other constant. 00:05:31.629 --> 00:05:33.589 And it doesn't change 00:05:33.589 --> 00:05:35.819 my geometric margin. 00:05:35.819 --> 00:05:38.280 And one way of interpreting that is you're looking 00:05:38.280 --> 00:05:39.379 at 00:05:39.379 --> 00:05:43.099 just separating hyperplane. You look at this line you're separating by positive and negative training examples. If 00:05:43.099 --> 00:05:44.819 I 00:05:44.819 --> 00:05:45.740 scale w 00:05:45.740 --> 00:05:47.400 and b, 00:05:47.400 --> 00:05:49.489 that doesn't change the position of this plane, though 00:05:49.489 --> 00:05:51.629 because the equation wh + 00:05:51.629 --> 00:05:58.629 b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight 00:05:58.899 --> 00:06:02.589 line. And what that means is that I can actually choose whatever scaling 00:06:02.589 --> 00:06:05.969 for w and b is convenient for me. 00:06:05.969 --> 00:06:07.819 And in particular, 00:06:07.819 --> 00:06:09.199 we use in a minute, 00:06:09.199 --> 00:06:13.600 I can [inaudible] perfect constraint like that the normal w [inaudible] 1 00:06:13.600 --> 00:06:17.180 because this means that you can find a solution to w and b. 00:06:17.180 --> 00:06:21.930 And then by rescaling the parameters, you can easily meet this condition, this rescaled w 00:06:21.930 --> 00:06:23.530 [inaudible] 1. And so I can 00:06:23.530 --> 00:06:28.440 add the condition like this and then essentially not change the problem. Or I 00:06:28.440 --> 00:06:32.469 can add other conditions. I can actually add a 00:06:32.469 --> 00:06:35.120 condition 00:06:35.120 --> 00:06:38.150 that - excuse me, the absolute value of w1 = 1. I can have 00:06:38.150 --> 00:06:41.430 only one of these conditions right now [inaudible]. And adding condition to the absolute 00:06:41.430 --> 00:06:42.419 value - the 00:06:42.419 --> 00:06:45.169 first component of w must be to 1. And again, 00:06:45.169 --> 00:06:48.569 you can find the absolute solution and just rescale w and meet this 00:06:48.569 --> 00:06:50.539 condition. 00:06:50.539 --> 00:06:53.179 And it can have other, 00:06:53.179 --> 00:06:56.300 most esoteric conditions like that 00:06:56.300 --> 00:06:57.250 because again, 00:06:57.250 --> 00:07:01.039 this is a condition that you can solve for the optimal margin, and then just 00:07:01.039 --> 00:07:02.669 by scaling, 00:07:02.669 --> 00:07:04.789 you have w up and down. You can - you can then 00:07:04.789 --> 00:07:08.729 ensure you meet this condition as well. So 00:07:08.729 --> 00:07:12.580 again, [inaudible] one of these conditions right now, not all of them. 00:07:12.580 --> 00:07:15.449 And so our ability to choose 00:07:15.449 --> 00:07:19.330 any scaling condition on w that's convenient to us 00:07:19.330 --> 00:07:23.779 will be useful again in a second. All right. 00:07:23.779 --> 00:07:28.749 So let's go ahead and break down the optimization problem. And again, my goal is to choose 00:07:28.749 --> 00:07:30.279 parameters w and b 00:07:30.279 --> 00:07:34.169 so as to maximize the geometric margin. 00:07:34.169 --> 00:07:38.569 Here's my first attempt at writing down the optimization problem. Actually wrote this one down 00:07:38.569 --> 00:07:40.379 right at 00:07:40.379 --> 00:07:42.520 the end of the previous 00:07:42.520 --> 00:07:46.529 lecture. Begin to solve the parameters gamma w and b 00:07:46.529 --> 00:07:48.300 such that 00:07:48.300 --> 00:07:49.429 00:07:49.429 --> 00:07:53.989 - 00:07:53.989 --> 00:07:57.399 that [inaudible] i 00:07:57.399 --> 00:08:01.610 for in training examples. 00:08:01.610 --> 00:08:02.460 Let's say I 00:08:02.460 --> 00:08:06.439 choose to add this normalization condition. 00:08:06.439 --> 00:08:10.629 So the norm condition that w - the normal w is equal to 1 just makes 00:08:10.629 --> 00:08:15.279 the geometric and the functional margin the same. 00:08:15.279 --> 00:08:18.029 And so I'm saying I want to find a value - 00:08:18.029 --> 00:08:22.610 I want to find a value for gamma as big as possible 00:08:22.610 --> 00:08:26.379 so that all of my training examples have functional margin 00:08:26.379 --> 00:08:29.049 greater than or equals gamma, 00:08:29.049 --> 00:08:30.010 and 00:08:30.010 --> 00:08:33.780 with the constraint that normal w equals 1, 00:08:33.780 --> 00:08:36.089 functional margin and geometric margin are the same. 00:08:36.089 --> 00:08:37.760 So it's the same. 00:08:37.760 --> 00:08:39.670 Find the value for gamma so that 00:08:39.670 --> 00:08:43.130 all the values - all the geometric margins are greater or equal to 00:08:43.130 --> 00:08:45.870 gamma. 00:08:45.870 --> 00:08:51.269 So you solve this optimization problem, then you have derived 00:08:51.269 --> 00:08:55.060 the optimal margin classifier - 00:08:55.060 --> 00:08:56.940 that 00:08:56.940 --> 00:09:00.170 there's not a very nice optimization problem because this is a 00:09:00.170 --> 00:09:04.300 nasty, nonconvex constraints. And [inaudible] is asking that you 00:09:04.300 --> 00:09:06.229 solve for parameters w 00:09:06.229 --> 00:09:10.989 that lie on the surface of a unisphere, lie on his [inaudible]. 00:09:10.989 --> 00:09:15.170 It lies on a unicircle - a unisphere. 00:09:15.170 --> 00:09:16.390 00:09:16.390 --> 00:09:17.990 And so 00:09:17.990 --> 00:09:21.160 if we can come up with a convex optimization problem, then 00:09:21.160 --> 00:09:24.579 we'd be guaranteed that our [inaudible] descend to other local [inaudible] will 00:09:24.579 --> 00:09:25.460 not have 00:09:25.460 --> 00:09:26.490 local optimal. And 00:09:26.490 --> 00:09:31.480 it turns out this is an example of a nonconvex constraint. This is a nasty constraint 00:09:31.480 --> 00:09:38.480 that I would like to get rid of. So 00:09:40.790 --> 00:09:42.840 let's change the optimization problem 00:09:42.840 --> 00:09:47.650 one more time. 00:09:47.650 --> 00:09:52.180 Now, let me 00:09:52.180 --> 00:09:56.870 pose a slightly different optimization problem. Let 00:09:56.870 --> 00:10:02.680 me maximize the functional margin divided by the normal w 00:10:02.680 --> 00:10:04.870 subject 00:10:04.870 --> 00:10:11.870 to yi w transpose xi. 00:10:12.580 --> 00:10:15.390 So in other words, once you find 00:10:15.390 --> 00:10:17.220 a number, gamma hat, 00:10:17.220 --> 00:10:19.700 so that every one of my training examples has functional margin greater 00:10:19.700 --> 00:10:22.840 than the gamma hat, 00:10:22.840 --> 00:10:27.300 and my optimization objective is I want to maximize gamma hat divided by the normal 00:10:27.300 --> 00:10:28.180 00:10:28.180 --> 00:10:30.550 w. And so I wanna maximize the 00:10:30.550 --> 00:10:35.680 function margin divided by the normal w. And we saw previously the function 00:10:35.680 --> 00:10:38.390 margin divided by the normal 00:10:38.390 --> 00:10:41.780 w is just a geometric margin, and so this is a different way of posing 00:10:41.780 --> 00:10:44.610 the same 00:10:44.610 --> 00:10:46.780 optimization problem. [Inaudible] confused, though. Are there questions 00:10:46.780 --> 00:10:53.710 about this? 00:10:53.710 --> 00:10:53.960 00:10:53.940 --> 00:10:56.590 Student:[Inaudible] the second statement has to be made of the functional margin y 00:10:56.590 --> 00:10:58.900 divided by - why don't you just have it 00:10:58.900 --> 00:11:02.280 the geometric 00:11:02.280 --> 00:11:04.800 margin? Why do 00:11:04.800 --> 00:11:09.450 you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew 00:11:09.450 --> 00:11:10.740 Ng):Oh, I see, yes. Student:[Inaudible] 00:11:10.740 --> 00:11:12.730 00:11:12.730 --> 00:11:17.050 is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin. 00:11:17.050 --> 00:11:18.820 Student:Yeah. 00:11:18.820 --> 00:11:25.820 Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional 00:11:26.040 --> 00:11:27.580 margin. 00:11:27.580 --> 00:11:31.780 Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me 00:11:31.780 --> 00:11:36.210 try saying 00:11:36.210 --> 00:11:40.390 this again. So here's my goal. My - I want [inaudible]. So 00:11:40.390 --> 00:11:42.110 let's see, 00:11:42.110 --> 00:11:46.750 the parameters of this optimization problem where gamma hat w and b - so 00:11:46.750 --> 00:11:49.610 the convex optimization software 00:11:49.610 --> 00:11:52.399 solves this problem for some set of parameters gamma 00:11:52.399 --> 00:11:54.260 w and b. 00:11:54.260 --> 00:11:57.120 And I'm imposing the constraint that 00:11:57.120 --> 00:11:59.050 whatever values it comes up with, 00:11:59.050 --> 00:12:03.750 yi x [inaudible] x5 + b must be greater than gamma hat. 00:12:03.750 --> 00:12:05.889 And so this means that 00:12:05.889 --> 00:12:08.640 the functional margin of every example had 00:12:08.640 --> 00:12:09.310 better be 00:12:09.310 --> 00:12:11.070 greater than equal to gamma hat. So 00:12:11.070 --> 00:12:14.300 there's a constraint to the function margin and a constraint to the gamma hat. 00:12:14.300 --> 00:12:17.950 But what I care about is not really maximizing the functional margin. What I 00:12:17.950 --> 00:12:19.010 really care about - 00:12:19.010 --> 00:12:21.430 in other words, in optimization objective, 00:12:21.430 --> 00:12:25.110 is maximizing gamma hat divided by the normal w, 00:12:25.110 --> 00:12:27.890 which is the geometric margin. 00:12:27.890 --> 00:12:32.370 So in other words, my optimization [inaudible] is I want to maximize the function margin 00:12:32.370 --> 00:12:35.210 divided by the normal 00:12:35.210 --> 00:12:38.230 w. Subject to that, every example must have 00:12:38.230 --> 00:12:40.290 function margin and at least gamma hat. Does that make 00:12:40.290 --> 00:12:46.120 sense now? Student:[Inaudible] when you 00:12:46.120 --> 00:12:48.000 said that to maximize gamma 00:12:48.000 --> 00:12:50.580 or gamma hat, respect to gamma w and with 00:12:50.580 --> 00:12:51.640 respect to gamma hat 00:12:51.640 --> 00:12:56.820 so that 00:12:56.820 --> 00:13:01.490 [inaudible] gamma hat are no 00:13:01.490 --> 00:13:04.920 longer [inaudible]? Instructor (Andrew Ng):So this is the - 00:13:04.920 --> 00:13:06.800 so it turns out - 00:13:06.800 --> 00:13:11.020 so this is how I write down the - this is how I write down an optimization problem 00:13:11.020 --> 00:13:12.790 in order to solve for 00:13:12.790 --> 00:13:15.950 the geometric margin. What is it - 00:13:15.950 --> 00:13:18.450 so it turns out that 00:13:18.450 --> 00:13:21.410 the question of this - is the gamma hat the function of w and b? And it turns out 00:13:21.410 --> 00:13:22.250 that 00:13:22.250 --> 00:13:25.130 in my previous mathematical definition, it was, 00:13:25.130 --> 00:13:29.590 but the way I'm going to pose this as an optimization problem is 00:13:29.590 --> 00:13:34.630 I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving 00:13:34.630 --> 00:13:37.920 convex optimization problems - 00:13:37.920 --> 00:13:39.550 hen I'm going to 00:13:39.550 --> 00:13:43.930 pretend that these are independent variables and ask my convex optimization software 00:13:43.930 --> 00:13:46.420 to find me values for gamma, w, and b, 00:13:46.420 --> 00:13:50.610 to make this value as big as possible and subject to this constraint. 00:13:50.610 --> 00:13:52.670 And it'll turn out that 00:13:52.670 --> 00:13:55.329 when it does that, it will choose - or 00:13:55.329 --> 00:13:58.790 obviously, it will choose for gamma to be as big as possible 00:13:58.790 --> 00:14:01.340 because optimization objective is this: 00:14:01.340 --> 00:14:02.730 You're trying to maximize gamma hat. 00:14:02.730 --> 00:14:05.540 So for x value of w and b, 00:14:05.540 --> 00:14:09.140 my software, which choose to make gamma hat as big as possible - 00:14:09.140 --> 00:14:12.590 well, but how big can we make gamma hat? Well, it's limited by use 00:14:12.590 --> 00:14:15.460 constraints. It says that every training example 00:14:15.460 --> 00:14:17.079 must have function margin 00:14:17.079 --> 00:14:20.380 greater than equal to gamma hat. 00:14:20.380 --> 00:14:23.950 And so my - the bigger you can make gamma hat 00:14:23.950 --> 00:14:27.880 will be the value of the smallest functional margin. 00:14:27.880 --> 00:14:30.220 And so when you solve this optimization problem, 00:14:30.220 --> 00:14:33.720 the value of gamma hat you get out will be, indeed, 00:14:33.720 --> 00:14:38.590 the minimum of the functional margins of your training set. Okay, so 00:14:38.590 --> 00:14:40.530 Justin? Student:Yeah, I was just 00:14:40.530 --> 00:14:41.760 wondering, I 00:14:41.760 --> 00:14:46.200 guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line 00:14:46.200 --> 00:14:50.180 such that you maximize the distance between - the smallest distance that [inaudible] between the line and 00:14:50.180 --> 00:14:52.190 the 00:14:52.190 --> 00:14:53.780 data points." 00:14:53.780 --> 00:14:56.370 And it seems like that's kind of what we're doing, but it's - it seems like 00:14:56.370 --> 00:15:00.200 this is more complicated than that. And I guess I'm wondering what 00:15:00.200 --> 00:15:04.109 is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And 00:15:04.109 --> 00:15:08.520 this seems 00:15:08.520 --> 00:15:12.420 more complicated than trying to find a line [inaudible]. So I'm 00:15:12.420 --> 00:15:17.550 just repeating the questions in case - since I'm not sure how all the audio catches it. 00:15:17.550 --> 00:15:21.380 So the answer is this is actually exactly that problem. This is exactly that problem 00:15:21.380 --> 00:15:22.710 of 00:15:22.710 --> 00:15:26.620 given the two class of data, positive and negative examples, 00:15:26.620 --> 00:15:29.900 this is exactly the formalization of the problem 00:15:29.900 --> 00:15:31.519 where I go is to find 00:15:31.519 --> 00:15:35.880 a line that separates the two - the positive and negative examples, 00:15:35.880 --> 00:15:42.880 maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case 00:15:43.040 --> 00:15:43.740 distance [inaudible]? 00:15:43.740 --> 00:15:46.270 Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now, 00:15:46.270 --> 00:15:48.230 00:15:48.230 --> 00:15:52.590 let's just say - let's just care about the worst-case distance for now. We'll come back, 00:15:52.590 --> 00:15:55.330 and we'll fix that later. We'll - that's a - 00:15:55.330 --> 00:15:57.850 caring about the worst case is is just - 00:15:57.850 --> 00:16:01.240 is just a nice way to formulate this optimization problem. I'll come back, and I'll 00:16:01.240 --> 00:16:03.580 change that later. Okay, 00:16:03.580 --> 00:16:10.100 raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool. 00:16:10.100 --> 00:16:13.830 Great. So let's see - 00:16:13.830 --> 00:16:15.910 so this is just a different way of posing 00:16:15.910 --> 00:16:18.160 the same optimization problem. And 00:16:18.160 --> 00:16:22.339 on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on 00:16:22.339 --> 00:16:23.880 the other hand, I've now 00:16:23.880 --> 00:16:24.830 added 00:16:24.830 --> 00:16:30.200 a nasty, nonconvex objective. In particular, this is not a convex function in parameters 00:16:30.200 --> 00:16:31.010 w. 00:16:31.010 --> 00:16:32.610 And so you can't - 00:16:32.610 --> 00:16:36.670 you don't have the usual guarantees like if you [inaudible] 00:16:36.670 --> 00:16:38.240 global minimum. 00:16:38.240 --> 00:16:40.170 At least that 00:16:40.170 --> 00:16:45.190 does not follow immediately from this because this is nonconvex. 00:16:45.190 --> 00:16:47.840 00:16:47.840 --> 00:16:51.380 So what 00:16:51.380 --> 00:16:54.810 I'm going to do is, 00:16:54.810 --> 00:16:56.940 earlier, I said that can pose 00:16:56.940 --> 00:16:57.770 any of 00:16:57.770 --> 00:17:02.630 a number of even fairly bizarre scaling constraints on w. So you can 00:17:02.630 --> 00:17:06.100 choose any scaling constraint like this, and things are still fine. 00:17:06.100 --> 00:17:07.660 And so here's 00:17:07.660 --> 00:17:14.089 the scaling I'm going to choose to add. Again, I'm 00:17:14.089 --> 00:17:18.380 gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly 00:17:18.380 --> 00:17:19.580 separable, that 00:17:19.580 --> 00:17:23.290 you can actually separate the positive and negative classes, and that we'll come back and 00:17:23.290 --> 00:17:25.580 fix this later as well. 00:17:25.580 --> 00:17:28.700 But here's the scaling constraint I want to impose on w. I want 00:17:28.700 --> 00:17:33.840 to impose a constraint that 00:17:33.840 --> 00:17:37.840 the functional margin is equal to 1. 00:17:37.840 --> 00:17:40.059 And another way of writing that 00:17:40.059 --> 00:17:44.380 is that I want to impose a constraint that min 00:17:44.380 --> 00:17:51.380 over i, yi - 00:17:52.380 --> 00:17:55.170 that in the worst case, function y is over 1. 00:17:55.170 --> 00:17:59.390 And clearly, this is a scaling constraint because if 00:17:59.390 --> 00:18:01.740 00:18:01.740 --> 00:18:05.820 you solve for w and b, and you find that your worst-case function margin is actually 10 or 00:18:05.820 --> 00:18:06.790 whatever, 00:18:06.790 --> 00:18:10.099 then by dividing through w and b by a factor of 10, I can get 00:18:10.099 --> 00:18:13.210 my functional margin to be over 1. 00:18:13.210 --> 00:18:15.580 So this is a scaling constraint [inaudible] would 00:18:15.580 --> 00:18:17.289 imply. And this is 00:18:17.289 --> 00:18:19.220 just more compactly written 00:18:19.220 --> 00:18:22.590 as follows. This is imposing a constraint that the functional margin be 00:18:22.590 --> 00:18:25.520 equal to 1. 00:18:25.520 --> 00:18:28.770 And so if we just take 00:18:28.770 --> 00:18:32.230 what I wrote down as No. 2 of our previous optimization problem and add the 00:18:32.230 --> 00:18:34.510 scaling constraint, 00:18:34.510 --> 00:18:38.400 we then get the following optimization problem: 00:18:38.400 --> 00:18:45.400 min over wb. 00:18:57.360 --> 00:19:00.520 I guess previously, we had a maximization 00:19:00.520 --> 00:19:05.500 over gamma hats divided by the normal w. So those maximize 00:19:05.500 --> 00:19:10.240 1 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum 00:19:10.240 --> 00:19:10.720 normal w 00:19:10.720 --> 00:19:14.510 is min w - normal w squared. And then these 00:19:14.510 --> 00:19:16.400 are our constraints. 00:19:16.400 --> 00:19:20.980 Since I've added the constraint, the functional margin 00:19:20.980 --> 00:19:23.570 is over 1. And this is actually 00:19:23.570 --> 00:19:26.830 my 00:19:26.830 --> 00:19:30.440 final - well, 00:19:30.440 --> 00:19:36.960 final formulation of the optimal margin classifier problem, at least for now. 00:19:36.960 --> 00:19:38.250 So the picture 00:19:38.250 --> 00:19:40.640 to 00:19:40.640 --> 00:19:44.540 keep in mind for this, I guess, is that our 00:19:44.540 --> 00:19:48.450 optimization objective is once you minimize the normal w. And so our 00:19:48.450 --> 00:19:50.020 optimization objective 00:19:50.020 --> 00:19:52.450 is just the [inaudible] quadratic function. 00:19:52.450 --> 00:19:55.300 And [inaudible] those pictures [inaudible] can draw 00:19:55.300 --> 00:19:56.659 it. So it - 00:19:56.659 --> 00:19:58.850 if [inaudible] is w1 and w2, and you 00:19:58.850 --> 00:20:02.640 want to minimize the quadratic function like this - so quadratic function 00:20:02.640 --> 00:20:06.250 just has [inaudible] that look like this. 00:20:06.250 --> 00:20:10.190 And moreover, you have a number of linear constraints in your parameters, so you may have linear 00:20:10.190 --> 00:20:11.940 constraints that eliminates that half space or 00:20:11.940 --> 00:20:17.260 linear constraint eliminates that half space [inaudible]. So there's that half space 00:20:17.260 --> 00:20:19.560 00:20:19.560 --> 00:20:22.570 and so on. 00:20:22.570 --> 00:20:25.610 And so the picture is you have a quadratic function, 00:20:25.610 --> 00:20:27.020 and you're ruling out 00:20:27.020 --> 00:20:28.639 00:20:28.639 --> 00:20:33.540 various half spaces where each of these linear constraints. And I hope - if 00:20:33.540 --> 00:20:38.080 you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can 00:20:38.080 --> 00:20:43.250 convince yourself that this is a convex problem that has no local optimum. But they 00:20:43.250 --> 00:20:45.640 be run great 00:20:45.640 --> 00:20:49.850 [inaudible] within this set of points that hasn't ruled out, then you convert to the 00:20:49.850 --> 00:20:53.400 global optimum. 00:20:53.400 --> 00:20:56.180 And so that's the convex optimization problem. 00:20:56.180 --> 00:20:58.600 The - does this [inaudible] nice and 00:20:58.600 --> 00:21:02.390 [inaudible]. 00:21:02.390 --> 00:21:09.390 Questions about this? 00:21:17.080 --> 00:21:24.080 Actually, just raise your hand if this makes sense. Okay, cool. 00:21:25.799 --> 00:21:30.250 So this gives you the optimal margin classifier algorithm. 00:21:30.250 --> 00:21:31.500 And it turns out that 00:21:31.500 --> 00:21:33.789 this is the convex optimization problem, 00:21:33.789 --> 00:21:36.970 so you can actually take this formulation of the problem 00:21:36.970 --> 00:21:41.410 and throw it at off-the-shelf software - what's called a QP or quadratic 00:21:41.410 --> 00:21:45.039 program software. This [inaudible] optimization is called a quadratic program, 00:21:45.039 --> 00:21:48.720 where the quadratic convex objective function and [inaudible] constraints - 00:21:48.720 --> 00:21:53.700 so you can actually download software to solve these optimization problems for you. 00:21:53.700 --> 00:21:55.810 Usually, as you wanna use the - 00:21:55.810 --> 00:21:56.880 use 00:21:56.880 --> 00:22:00.350 [inaudible] because you have constraints like these, although you could actually modify [inaudible] work with 00:22:00.350 --> 00:22:03.520 this, too. 00:22:03.520 --> 00:22:06.500 So we could just declare success and say that we're done with this formulation of the 00:22:06.500 --> 00:22:07.860 problem. But 00:22:07.860 --> 00:22:10.899 what I'm going to do now is take a 00:22:10.899 --> 00:22:14.680 digression to talk about primal and duo optimization problems. 00:22:14.680 --> 00:22:16.160 And in particular, I'm going to 00:22:16.160 --> 00:22:17.970 - 00:22:17.970 --> 00:22:21.700 later, I'm going to come back and derive yet another very different form 00:22:21.700 --> 00:22:24.640 of this optimization problem. 00:22:24.640 --> 00:22:29.130 And the reason we'll do that is because it turns out this optimization problem has 00:22:29.130 --> 00:22:33.050 certain properties that make it amenable to very efficient algorithms. 00:22:33.050 --> 00:22:35.320 And moreover, I'll be deriving 00:22:35.320 --> 00:22:37.430 what's called the duo formulation of this 00:22:37.430 --> 00:22:39.830 that allows us to 00:22:39.830 --> 00:22:42.850 apply the optimal margin classifier even in 00:22:42.850 --> 00:22:46.950 very high-dimensional feature spaces - even in sometimes infinite 00:22:46.950 --> 00:22:49.730 dimensional feature spaces. So we 00:22:49.730 --> 00:22:51.780 can come back to that later. 00:22:51.780 --> 00:22:53.190 But 00:22:53.190 --> 00:22:54.450 let me know, 00:22:54.450 --> 00:23:01.450 since I'm talking about 00:23:03.470 --> 00:23:06.480 convex optimization. So how many here is - how many of you, from, I don't 00:23:06.480 --> 00:23:08.080 know, calculus, 00:23:08.080 --> 00:23:12.120 remember the method of Lagrange multipliers for 00:23:12.120 --> 00:23:13.910 solving an optimization problem 00:23:13.910 --> 00:23:18.299 like minimum - minimization, maximization problem subject to some constraint? 00:23:18.299 --> 00:23:22.740 How many of you remember the method of Lagrange multipliers for that? Oh, okay, 00:23:22.740 --> 00:23:24.070 cool. Some of you, yeah. 00:23:24.070 --> 00:23:27.710 So if you don't remember, don't worry. I - I'll describe that briefly 00:23:27.710 --> 00:23:28.390 here 00:23:28.390 --> 00:23:32.039 as well, but what I'm really gonna do is talk about the generalization of this method 00:23:32.039 --> 00:23:33.710 of Lagrange multipliers 00:23:33.710 --> 00:23:37.659 that you may or may not have seen in some calculus classes. But if you haven't 00:23:37.659 --> 00:23:40.080 seen it before, don't worry about it. 00:23:40.080 --> 00:23:43.350 00:23:43.350 --> 00:23:49.159 So the method of Lagrange multipliers is - was - well, suppose there's some 00:23:49.159 --> 00:23:52.950 function you want to minimize, or minimize f of w. 00:23:52.950 --> 00:23:55.660 We're subject to 00:23:55.660 --> 00:24:02.660 some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And 00:24:03.400 --> 00:24:06.770 given this constraint, I'll actually usually write it in vectorial 00:24:06.770 --> 00:24:09.550 form in which I write h of w 00:24:09.550 --> 00:24:11.429 as this 00:24:11.429 --> 00:24:16.980 vector value function. 00:24:16.980 --> 00:24:20.919 So that is equal to 0, where 0 is the arrow on top. I used 00:24:20.919 --> 00:24:24.600 that to denote the vector of 00:24:24.600 --> 00:24:27.200 all 0s. So you want to solve this optimization problem. 00:24:27.200 --> 00:24:28.880 00:24:28.880 --> 00:24:32.130 Some of you have seen method of Lagrange multipliers where 00:24:32.130 --> 00:24:34.880 you construct this 00:24:34.880 --> 00:24:39.470 [inaudible] Lagrange, 00:24:39.470 --> 00:24:45.980 00:24:45.980 --> 00:24:49.810 which is the original optimization objective plus some [inaudible] Lagrange multipliers the 00:24:49.810 --> 00:24:51.950 highest constraints. 00:24:51.950 --> 00:24:54.720 And these parameters - they derive - 00:24:54.720 --> 00:25:01.570 we call the Lagrange multipliers. 00:25:01.570 --> 00:25:06.320 And so the way you actually solve the optimization problem is 00:25:06.320 --> 00:25:10.830 you take the partial derivative of this with respect to the original parameters 00:25:10.830 --> 00:25:13.740 and set that to 0. So 00:25:13.740 --> 00:25:19.710 the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0. 00:25:19.710 --> 00:25:23.630 And then the same as theorem through [inaudible], I guess [inaudible] 00:25:23.630 --> 00:25:24.510 Lagrange 00:25:24.510 --> 00:25:30.800 was that for w - for some value w star to get a 00:25:30.800 --> 00:25:32.380 solution, 00:25:32.380 --> 00:25:34.660 it 00:25:34.660 --> 00:25:41.660 is necessary 00:25:42.970 --> 00:25:48.299 that - can this 00:25:48.299 --> 00:25:48.840 be 00:25:48.840 --> 00:25:50.200 00:25:50.200 --> 00:25:53.510 the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists 00:25:53.510 --> 00:25:54.750 beta star 00:25:54.750 --> 00:26:00.420 00:26:00.420 --> 00:26:07.420 00:26:17.300 --> 00:26:19.940 such that those partial derivatives are 00:26:19.940 --> 00:26:21.500 equal to 00:26:21.500 --> 00:26:22.250 0. 00:26:22.250 --> 00:26:24.020 So the 00:26:24.020 --> 00:26:25.400 method 00:26:25.400 --> 00:26:28.510 of Lagrange multipliers is to solve this problem, 00:26:28.510 --> 00:26:30.710 you construct a Lagrange, 00:26:30.710 --> 00:26:32.580 take the derivative with respect to 00:26:32.580 --> 00:26:34.520 the original parameters b, the original 00:26:34.520 --> 00:26:39.809 parameters w, and with respect to the Lagrange multipliers beta. 00:26:39.809 --> 00:26:42.139 Set the partial derivatives equal to 0, and solve 00:26:42.139 --> 00:26:45.460 for our solutions. And then you check each of the solutions to see if it is indeed a minimum. 00:26:45.460 --> 00:26:48.500 Great. 00:26:48.500 --> 00:26:52.380 So great 00:26:52.380 --> 00:26:56.790 - 00:26:56.790 --> 00:26:59.110 so what I'm going to do is actually 00:26:59.110 --> 00:27:00.420 write 00:27:00.420 --> 00:27:02.760 down the generalization of this. And 00:27:02.760 --> 00:27:06.090 if you haven't seen that before, don't worry about it. This is [inaudible]. 00:27:06.090 --> 00:27:11.000 So what I'm going to do is actually write down the generalization of this to solve 00:27:11.000 --> 00:27:15.870 a slightly more difficult type of constraint optimization problem, 00:27:15.870 --> 00:27:20.370 which is suppose you want to minimize f of w 00:27:20.370 --> 00:27:24.039 subject to the constraint that gi of w, 00:27:24.039 --> 00:27:25.539 excuse me, 00:27:25.539 --> 00:27:30.780 is less than equal to 00:27:30.780 --> 00:27:34.940 0, and that hi of w is equal to 0. 00:27:34.940 --> 00:27:38.280 And 00:27:38.280 --> 00:27:43.230 again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0. 00:27:43.230 --> 00:27:44.710 So 00:27:44.710 --> 00:27:48.940 in [inaudible]'s 00:27:48.940 --> 00:27:52.169 case, we now have inequality for constraint as well as 00:27:52.169 --> 00:27:59.169 equality constraint. 00:28:01.620 --> 00:28:02.779 I then have 00:28:02.779 --> 00:28:06.719 a Lagrange, or it's actually still - called - say generalized 00:28:06.719 --> 00:28:08.370 Lagrange, 00:28:08.370 --> 00:28:13.040 which is now a function of my original optimization for parameters w, 00:28:13.040 --> 00:28:18.050 as well as two sets of Lagrange multipliers, alpha and beta. 00:28:18.050 --> 00:28:19.799 And so this will be 00:28:19.799 --> 00:28:26.799 f of w. 00:28:28.650 --> 00:28:32.549 Now, 00:28:32.549 --> 00:28:36.559 00:28:36.559 --> 00:28:39.350 here's a 00:28:39.350 --> 00:28:40.220 cool part. 00:28:40.220 --> 00:28:42.650 I'm going to define theta 00:28:42.650 --> 00:28:43.679 00:28:43.679 --> 00:28:45.990 subscript p of 00:28:45.990 --> 00:28:46.690 w 00:28:46.690 --> 00:28:49.030 to be equal to 00:28:49.030 --> 00:28:52.030 max of alpha beta subject to the 00:28:52.030 --> 00:28:59.030 constraints that the alphas are, beta equal 00:28:59.090 --> 00:29:06.090 to 0 of the Lagrange. 00:29:10.480 --> 00:29:15.490 And so 00:29:15.490 --> 00:29:17.910 I want you to consider 00:29:17.910 --> 00:29:21.610 the optimization problem 00:29:21.610 --> 00:29:24.350 min over w of 00:29:24.350 --> 00:29:31.350 max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And 00:29:31.700 --> 00:29:35.690 that's just equal to min over w, 00:29:35.690 --> 00:29:42.000 theta p of 00:29:42.000 --> 00:29:47.990 w. And just to give us a name, the [inaudible] - the subscript p here is a sense of 00:29:47.990 --> 00:29:49.590 primal problem. 00:29:49.590 --> 00:29:56.590 And that refers to this entire thing. 00:29:56.920 --> 00:30:01.080 This optimization problem that written down is called a primal problem. This means 00:30:01.080 --> 00:30:04.270 there's the original optimization problem in which [inaudible] solving. 00:30:04.270 --> 00:30:08.450 And later on, I'll derive in another version of this, but that's what p 00:30:08.450 --> 00:30:13.360 stands for. It's a - this is a primal problem. 00:30:13.360 --> 00:30:17.690 Now, I want you to look at - consider theta over p again. And in particular, I wanna 00:30:17.690 --> 00:30:21.510 consider the problem of what happens if you minimize w 00:30:21.510 --> 00:30:24.070 - minimize as a function of w 00:30:24.070 --> 00:30:27.340 this quantity theta over p. 00:30:27.340 --> 00:30:34.160 00:30:34.160 --> 00:30:37.510 So let's look at what theta p of w is. Notice 00:30:37.510 --> 00:30:39.659 that 00:30:39.659 --> 00:30:41.620 if 00:30:41.620 --> 00:30:43.480 gi of w 00:30:43.480 --> 00:30:47.039 is greater than 0, so let's pick the value of w. 00:30:47.039 --> 00:30:49.779 And let's ask what is the state of p of w? 00:30:49.779 --> 00:30:56.779 So if w violates one of your primal problems constraints, 00:30:57.050 --> 00:31:00.280 00:31:00.280 --> 00:31:03.670 then state of p of w would be infinity. 00:31:03.670 --> 00:31:05.780 Why is that? 00:31:05.780 --> 00:31:09.440 [Inaudible] p [inaudible] second. 00:31:09.440 --> 00:31:12.800 Suppose I pick a value of w that violates one of these constraints. So gi of 00:31:12.800 --> 00:31:16.880 w is positive. 00:31:16.880 --> 00:31:21.840 Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So 00:31:21.840 --> 00:31:25.740 one of these gi of w's is this positive, 00:31:25.740 --> 00:31:30.300 then by setting the other responding alpha i to plus infinity, I can make this 00:31:30.300 --> 00:31:31.940 arbitrarily large. 00:31:31.940 --> 00:31:34.539 And so if w violates one of my 00:31:34.539 --> 00:31:37.770 primal problem's constraints in one of the gis, then 00:31:37.770 --> 00:31:42.290 max over alpha of this Lagrange will be plus 00:31:42.290 --> 00:31:49.290 infinity. There's some of - and in the same way - I guess in a similar way, 00:31:50.500 --> 00:31:55.240 if hi of w is not equal to 00:31:55.240 --> 00:31:57.690 0, 00:31:57.690 --> 00:32:02.669 then theta p of w also be infinity for a very similar reason because 00:32:02.669 --> 00:32:05.650 if hi of w is not equal to 0 for some value of i, 00:32:05.650 --> 00:32:09.990 then in my Lagrange, I had a beta i x hi theorem. 00:32:09.990 --> 00:32:13.480 And so by setting beta i to be plus infinity or minus 00:32:13.480 --> 00:32:15.830 infinity depending on the sign of hi, 00:32:15.830 --> 00:32:20.310 I can make this plus infinity as well. 00:32:20.310 --> 00:32:22.190 And otherwise, 00:32:22.190 --> 00:32:29.190 00:32:30.400 --> 00:32:33.730 00:32:33.730 --> 00:32:38.270 theta p of w is just equal to f of w. Turns 00:32:38.270 --> 00:32:39.649 out if 00:32:39.649 --> 00:32:41.799 I had a value of w that 00:32:41.799 --> 00:32:44.940 satisfies all of the gi and the hi constraints, 00:32:44.940 --> 00:32:47.749 then we maximize in terms of alpha and beta 00:32:47.749 --> 00:32:48.950 - all the Lagrange multiply 00:32:48.950 --> 00:32:53.830 theorems will actually be obtained by 00:32:53.830 --> 00:32:57.600 setting all the Lagrange multiply theorems to be 0, 00:32:57.600 --> 00:32:59.190 and so theta p 00:32:59.190 --> 00:33:03.590 just left with f of w. Thus, theta 00:33:03.590 --> 00:33:06.400 p 00:33:06.400 --> 00:33:11.490 of w is equal to 00:33:11.490 --> 00:33:13.540 f of w 00:33:13.540 --> 00:33:16.610 if constraints are 00:33:16.610 --> 00:33:17.929 satisfied 00:33:17.929 --> 00:33:21.330 [inaudible] 00:33:21.330 --> 00:33:22.540 the gi in 00:33:22.540 --> 00:33:24.409 hi constraints, 00:33:24.409 --> 00:33:28.270 and is equal to plus infinity 00:33:28.270 --> 00:33:31.770 otherwise. 00:33:31.770 --> 00:33:34.500 So the problem I 00:33:34.500 --> 00:33:38.619 wrote down that minimizes the function of w - 00:33:38.619 --> 00:33:44.630 theta p of w - this is [inaudible] problem. That's 00:33:44.630 --> 00:33:47.540 just 00:33:47.540 --> 00:33:50.639 exactly the same problem as my original primal problem 00:33:50.639 --> 00:33:55.010 because if you choose a value of w that violates the constraints, you get infinity. 00:33:55.010 --> 00:33:57.920 And if you satisfy the constraints, you get f of w. 00:33:57.920 --> 00:33:59.809 So this is really just the same as - well, 00:33:59.809 --> 00:34:00.980 we'll say, 00:34:00.980 --> 00:34:05.100 "Satisfy the constraints, and minimize f of w." That's really what 00:34:05.100 --> 00:34:08.609 minimizing the state of p 00:34:08.609 --> 00:34:15.609 of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right. 00:34:24.699 --> 00:34:28.339 I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly 00:34:28.339 --> 00:34:31.159 the same thing we started with. So here's 00:34:31.159 --> 00:34:33.499 the cool part. 00:34:33.499 --> 00:34:40.289 Let me know if you find it in your problem. To find 00:34:40.289 --> 00:34:41.909 theta 00:34:41.909 --> 00:34:45.059 d 00:34:45.059 --> 00:34:47.129 and d [inaudible] duo, 00:34:47.129 --> 00:34:50.409 and this is how the function of alpha and beta is. It's not the function of the Lagrange 00:34:50.409 --> 00:34:52.589 multipliers. It's not of w. 00:34:52.589 --> 00:34:58.229 To find this, we minimize over w of 00:34:58.229 --> 00:35:05.229 my generalized Lagrange. And 00:35:06.309 --> 00:35:13.309 my dual problem is this. So in 00:35:16.729 --> 00:35:20.359 other words, this is max over 00:35:20.359 --> 00:35:21.859 that. 00:35:21.859 --> 00:35:23.130 00:35:23.130 --> 00:35:28.239 00:35:28.239 --> 00:35:33.019 And so this is my duo optimization problem. To maximize over alpha and 00:35:33.019 --> 00:35:34.059 beta, theta 00:35:34.059 --> 00:35:35.499 d over alpha 00:35:35.499 --> 00:35:37.980 and beta. So this optimization problem, I 00:35:37.980 --> 00:35:40.859 guess, is my dual problem. I 00:35:40.859 --> 00:35:44.810 want you to compare this to our previous prime optimization problem. 00:35:44.810 --> 00:35:47.499 The only difference is that 00:35:47.499 --> 00:35:51.690 I took the max and min, and I switched the order around with the max and 00:35:51.690 --> 00:35:56.199 min. That's the difference in the primal and the duo optimization [inaudible]. 00:35:56.199 --> 00:36:00.210 And it turns out that 00:36:00.210 --> 00:36:01.669 00:36:01.669 --> 00:36:04.730 it's a - it's sort of - it's a fact - it's 00:36:04.730 --> 00:36:06.169 true, generally, that d 00:36:06.169 --> 00:36:09.919 star is less than [inaudible] p star. In other words, I think I defined 00:36:09.919 --> 00:36:15.609 p star previously. P star was a value of the prime optimization problem. 00:36:15.609 --> 00:36:19.829 And in other words, that it's just generally true 00:36:19.829 --> 00:36:25.150 that the max of the min of something is less than equal to the min of the max 00:36:25.150 --> 00:36:27.609 of something. 00:36:27.609 --> 00:36:29.199 And this is a 00:36:29.199 --> 00:36:30.099 general fact. 00:36:30.099 --> 00:36:33.169 And just as a concrete example, the 00:36:33.169 --> 00:36:37.069 max over y in the set 01 x - 00:36:37.069 --> 00:36:40.449 oh, excuse me, of the min of the 00:36:40.449 --> 00:36:43.579 set in 01 00:36:43.579 --> 00:36:47.939 of indicator x = 00:36:47.939 --> 00:36:50.940 y - this is 00:36:50.940 --> 00:36:57.940 [inaudible] equal to 00:37:02.689 --> 00:37:09.669 min. 00:37:09.669 --> 00:37:14.229 So this equality - this inequality actually holds true for any 00:37:14.229 --> 00:37:15.759 function you might find in here. 00:37:15.759 --> 00:37:18.059 And this is one specific example 00:37:18.059 --> 00:37:20.670 where 00:37:20.670 --> 00:37:24.289 the min over xy - excuse me, min over x of [inaudible] equals y - 00:37:24.289 --> 00:37:27.009 this is always equal to 0 00:37:27.009 --> 00:37:30.760 because whatever y is, you can choose x to be something different. So 00:37:30.760 --> 00:37:32.539 this is always 0, 00:37:32.539 --> 00:37:34.079 whereas if 00:37:34.079 --> 00:37:37.680 I exchange the order to min 00:37:37.680 --> 00:37:39.569 and max, then thing here is always equal to 00:37:39.569 --> 00:37:41.669 1. So 0 [inaudible] to 00:37:41.669 --> 00:37:44.700 1. And more generally, this min/max - excuse 00:37:44.700 --> 00:37:50.909 me, this max/min, thus with the min/max holds true for any function you might put in 00:37:50.909 --> 00:37:52.149 there. 00:37:52.149 --> 00:37:57.779 But it turns out that sometimes under certain conditions, 00:37:57.779 --> 00:37:59.349 00:37:59.349 --> 00:38:01.349 00:38:01.349 --> 00:38:01.839 00:38:01.839 --> 00:38:03.239 00:38:03.239 --> 00:38:06.949 these two optimization problems have the same value. Sometimes under certain 00:38:06.949 --> 00:38:07.859 conditions, 00:38:07.859 --> 00:38:10.110 the primal and the dual problems 00:38:10.110 --> 00:38:12.349 have the same value. 00:38:12.349 --> 00:38:14.559 And so 00:38:14.559 --> 00:38:15.700 you might be able to solve 00:38:15.700 --> 00:38:20.079 the dual problem rather than the primal problem. 00:38:20.079 --> 00:38:22.290 And the reason to do that is that 00:38:22.290 --> 00:38:26.289 sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem, 00:38:26.289 --> 00:38:31.130 the dual problem turns out to be much easier than it - often has many useful properties that 00:38:31.130 --> 00:38:33.749 will make user 00:38:33.749 --> 00:38:40.749 compared to the primal. So for the sake of - 00:38:42.199 --> 00:38:48.019 so 00:38:48.019 --> 00:38:55.019 what 00:39:01.549 --> 00:39:05.230 I'm going to do now is write down formally the certain conditions under which that's 00:39:05.230 --> 00:39:09.449 true - where the primal and the dual problems are equivalent. 00:39:09.449 --> 00:39:14.199 And so our strategy for working out the [inaudible] of support vector machine algorithm 00:39:14.199 --> 00:39:17.849 will be that we'll write down the primal optimization problem, which we did 00:39:17.849 --> 00:39:20.619 previously, and maximizing classifier. 00:39:20.619 --> 00:39:21.780 And then we'll 00:39:21.780 --> 00:39:24.470 derive the duo optimization problem for that. 00:39:24.470 --> 00:39:26.489 And then we'll solve the dual problem. 00:39:26.489 --> 00:39:29.279 And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for 00:39:29.279 --> 00:39:30.489 00:39:30.489 --> 00:39:33.639 now, let me just first, for 00:39:33.639 --> 00:39:36.659 the sake of completeness, I just write down the conditions under which the primal 00:39:36.659 --> 00:39:41.579 and the duo optimization problems give you the same solutions. So let f 00:39:41.579 --> 00:39:45.170 be convex. If you're not 00:39:45.170 --> 00:39:46.699 00:39:46.699 --> 00:39:50.249 sure what convex means, for the purposes of this class, you can take it to 00:39:50.249 --> 00:39:53.319 mean that the 00:39:53.319 --> 00:39:56.260 Hessian, h is positive. [Inaudible], so it just means it's 00:39:56.260 --> 00:39:58.199 a [inaudible] function like that. 00:39:58.199 --> 00:39:58.860 And 00:39:58.860 --> 00:40:01.030 once you learn more about optimization 00:40:01.030 --> 00:40:06.559 - again, please come to this week's discussion session taught by the TAs. 00:40:06.559 --> 00:40:09.279 Then 00:40:09.279 --> 00:40:11.309 suppose 00:40:11.309 --> 00:40:12.520 hi 00:40:12.520 --> 00:40:17.919 - the hi constraints [inaudible], and what that means is that hi of w equals alpha i 00:40:17.919 --> 00:40:24.519 transpose w plus vi. This actually 00:40:24.519 --> 00:40:28.509 means the same thing as linear. Without the term b here, we say 00:40:28.509 --> 00:40:29.970 that hi is linear 00:40:29.970 --> 00:40:34.079 where we have a constant interceptor as well. This is technically called [inaudible] other than 00:40:34.079 --> 00:40:37.229 linear. 00:40:37.229 --> 00:40:39.770 And let's suppose 00:40:39.770 --> 00:40:42.599 00:40:42.599 --> 00:40:49.599 that gi's are strictly feasible. 00:40:51.170 --> 00:40:54.049 And 00:40:54.049 --> 00:40:57.029 what that means is that 00:40:57.029 --> 00:41:00.869 there is just a value of the w such that 00:41:00.869 --> 00:41:04.249 from i, 00:41:04.249 --> 00:41:06.669 gi of w is less 00:41:06.669 --> 00:41:10.519 than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the 00:41:10.519 --> 00:41:12.939 technical details. 00:41:12.939 --> 00:41:15.820 Strictly feasible, which just means that there's a value of w such that 00:41:15.820 --> 00:41:19.260 all of these constraints are satisfy were stricter than the equality rather than what 00:41:19.260 --> 00:41:21.739 less than equal to. 00:41:21.739 --> 00:41:22.490 00:41:22.490 --> 00:41:25.130 Under these conditions, 00:41:25.130 --> 00:41:26.180 00:41:26.180 --> 00:41:29.990 there were exists w star, 00:41:29.990 --> 00:41:32.299 alpha 00:41:32.299 --> 00:41:33.800 star, beta 00:41:33.800 --> 00:41:39.639 star such that w star solves the primal problem. 00:41:39.639 --> 00:41:41.849 And alpha star 00:41:41.849 --> 00:41:48.849 and beta star, the Lagrange multipliers, solve the dual problem. 00:41:51.829 --> 00:41:53.659 00:41:53.659 --> 00:41:58.679 And the value of the primal problem will be equal to the value of the dual problem will 00:41:58.679 --> 00:42:03.649 be equal to the value of your Lagrange multiplier - excuse 00:42:03.649 --> 00:42:06.189 me, will be equal to the value of your generalized Lagrange, the value 00:42:06.189 --> 00:42:08.759 of that w star, alpha star, beta star. 00:42:08.759 --> 00:42:13.339 In other words, you can solve either the primal or the dual problem. You get 00:42:13.339 --> 00:42:14.899 the same 00:42:14.899 --> 00:42:19.869 solution. Further, 00:42:19.869 --> 00:42:22.530 your parameters will satisfy 00:42:22.530 --> 00:42:24.139 00:42:24.139 --> 00:42:27.190 00:42:27.190 --> 00:42:32.119 these conditions. Partial derivative perspective parameters would be 00:42:32.119 --> 00:42:33.339 0. And 00:42:33.339 --> 00:42:36.559 actually, to keep this equation in mind, we'll actually use this in a second 00:42:36.559 --> 00:42:38.630 when we take the Lagrange, and we - and 00:42:38.630 --> 00:42:43.259 our support vector machine problem, and take a derivative with respect to w to solve a - 00:42:43.259 --> 00:42:44.679 to solve our - 00:42:44.679 --> 00:42:46.519 to derive our dual problem. We'll actually 00:42:46.519 --> 00:42:50.480 perform this step ourselves in a second. 00:42:50.480 --> 00:42:52.099 00:42:52.099 --> 00:42:58.289 Partial derivative with respect to the Lagrange multiplier beta is 00:42:58.289 --> 00:43:01.789 00:43:01.789 --> 00:43:05.169 equal 00:43:05.169 --> 00:43:06.269 to 00:43:06.269 --> 00:43:08.439 0. 00:43:08.439 --> 00:43:10.479 Turns out this will hold true, 00:43:10.479 --> 00:43:11.869 too. 00:43:11.869 --> 00:43:14.429 This is called 00:43:14.429 --> 00:43:18.960 00:43:18.960 --> 00:43:20.549 the - 00:43:20.549 --> 00:43:27.380 well 00:43:27.380 --> 00:43:28.569 - 00:43:28.569 --> 00:43:32.359 this is called the KKT complementary condition. 00:43:32.359 --> 00:43:37.869 KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem. 00:43:37.869 --> 00:43:41.630 Well, and by tradition, usually this [inaudible] KKT conditions. 00:43:41.630 --> 00:43:48.630 But the other two are 00:43:50.489 --> 00:43:53.239 - just so the [inaudible] is greater than 0, which we had 00:43:53.239 --> 00:43:54.509 previously 00:43:54.509 --> 00:44:01.509 and that your constraints are actually satisfied. 00:44:02.919 --> 00:44:09.919 So let's see. [Inaudible] All 00:44:21.789 --> 00:44:28.789 right. 00:44:29.959 --> 00:44:34.170 So let's take those and apply this to our 00:44:34.170 --> 00:44:37.349 optimal margin 00:44:37.349 --> 00:44:40.689 optimization problem that we had previously. I 00:44:40.689 --> 00:44:43.779 was gonna say one word about this, 00:44:43.779 --> 00:44:46.809 which is - 00:44:46.809 --> 00:44:49.369 was gonna say one word about this 00:44:49.369 --> 00:44:50.880 KTT 00:44:50.880 --> 00:44:52.379 complementary condition is 00:44:52.379 --> 00:44:54.229 that a condition that is a - 00:44:54.229 --> 00:45:00.839 at your solution, you must have that alpha star i times gi of w is equal to 0. 00:45:00.839 --> 00:45:02.680 So 00:45:02.680 --> 00:45:04.390 let's see. 00:45:04.390 --> 00:45:08.789 So the product of two numbers is equal to 0. That means that 00:45:08.789 --> 00:45:09.450 00:45:09.450 --> 00:45:12.459 at least one of these things must be equal to 00:45:12.459 --> 00:45:15.989 0. For the product of two things to be equal to 0, well, that's just saying either alpha 00:45:15.989 --> 00:45:17.230 or i 00:45:17.230 --> 00:45:21.269 or gi is equal to 0. 00:45:21.269 --> 00:45:25.759 So what that implies is that the - just Karush-Kuhn-Tucker 00:45:25.759 --> 00:45:31.539 00:45:31.539 --> 00:45:38.410 00:45:38.410 --> 00:45:42.630 - most people just say KKT, but we wanna show you the right 00:45:42.630 --> 00:45:45.589 spelling 00:45:45.589 --> 00:45:46.890 of their names. So 00:45:46.890 --> 00:45:47.800 KKT 00:45:47.800 --> 00:45:52.119 complementary condition implies that if alpha 00:45:52.119 --> 00:45:55.360 i is not 0, 00:45:55.360 --> 00:46:00.559 that necessarily implies that 00:46:00.559 --> 00:46:03.199 gi of w star 00:46:03.199 --> 00:46:06.239 is equal to 00:46:06.239 --> 00:46:10.959 0. 00:46:10.959 --> 00:46:13.039 And 00:46:13.039 --> 00:46:15.000 usually, 00:46:15.000 --> 00:46:21.149 00:46:21.149 --> 00:46:23.629 00:46:23.629 --> 00:46:25.999 it turns out - so 00:46:25.999 --> 00:46:28.649 all the KKT condition guarantees is that 00:46:28.649 --> 00:46:32.939 at least one of them is 0. It may actually be the case that both 00:46:32.939 --> 00:46:35.279 alpha and gi are both equal to 0. 00:46:35.279 --> 00:46:39.170 But in practice, when you solve this optimization problem, you find that 00:46:39.170 --> 00:46:42.880 to a large part, alpha i star is not equal to 0 if and only gi of 00:46:42.880 --> 00:46:44.749 w star 0, 0. 00:46:44.749 --> 00:46:50.449 This is not strictly true because it's possible that both of these may be 0. 00:46:50.449 --> 00:46:51.720 But in practice, 00:46:51.720 --> 00:46:55.119 when we - because when we solve problems like these, you're, for the most part, 00:46:55.119 --> 00:46:59.319 usually exactly one of these will be non-0. 00:46:59.319 --> 00:47:03.459 And also, when this holds true, when gi of w star is equal to 0, 00:47:03.459 --> 00:47:05.819 we say that 00:47:05.819 --> 00:47:06.540 gi 00:47:06.540 --> 00:47:13.540 - gi of w, I guess, is an active 00:47:16.279 --> 00:47:17.919 constraint 00:47:17.919 --> 00:47:22.019 because we call a constraint - our constraint was a gi of w must be less than or equal to 00:47:22.019 --> 00:47:23.399 0. 00:47:23.399 --> 00:47:25.589 And so it is equal to 0, then 00:47:25.589 --> 00:47:32.589 we say that that's a constraint that this is an active constraint. 00:47:33.679 --> 00:47:37.410 Once we talk about [inaudible], we come back and [inaudible] and just extend this 00:47:37.410 --> 00:47:42.589 idea a little bit more. 00:47:42.589 --> 00:47:48.349 [Inaudible] board. [Inaudible] 00:47:48.349 --> 00:47:52.159 turn 00:47:52.159 --> 00:47:59.159 to this board in a second, but - 00:48:07.269 --> 00:48:10.899 so let's go back and work out one of the primal and the duo optimization problems for 00:48:10.899 --> 00:48:12.029 00:48:12.029 --> 00:48:17.329 our optimal margin classifier for the optimization problem that we worked on just now. As 00:48:17.329 --> 00:48:20.379 a point of notation, 00:48:20.379 --> 00:48:24.669 in whatever I've been writing down so far in deriving the KKT conditions, 00:48:24.669 --> 00:48:28.189 00:48:28.189 --> 00:48:32.369 when Lagrange multipliers were alpha i and beta i, 00:48:32.369 --> 00:48:35.600 it turns out that when applied as [inaudible] 00:48:35.600 --> 00:48:37.020 dm, 00:48:37.020 --> 00:48:41.089 turns out we only have one set of Lagrange multipliers alpha i. 00:48:41.089 --> 00:48:43.859 And also, 00:48:43.859 --> 00:48:47.419 as I was working out the KKT conditions, I used w 00:48:47.419 --> 00:48:52.189 to denote the parameters of my primal optimization problem. [Inaudible] I wanted to 00:48:52.189 --> 00:48:53.999 minimize f of w. In my 00:48:53.999 --> 00:48:57.739 very first optimization problem, I had 00:48:57.739 --> 00:48:58.989 that optimization problem [inaudible] finding the parameters 00:48:58.989 --> 00:49:01.670 w. In my svn problem, I'm 00:49:01.670 --> 00:49:04.640 actually gonna have two sets of parameters, w and b. So 00:49:04.640 --> 00:49:08.479 this is just a - keep that 00:49:08.479 --> 00:49:15.479 sort of slight notation change in mind. So 00:49:15.960 --> 00:49:20.289 problem we worked out previously was we want to minimize the normal w squared and just add a 00:49:20.289 --> 00:49:21.149 half there 00:49:21.149 --> 00:49:25.719 by convention because it makes other work - math work a little 00:49:25.719 --> 00:49:26.799 nicer. 00:49:26.799 --> 00:49:32.829 And subject to the constraint that yi x w [inaudible] xi + v must be = greater 00:49:40.949 --> 00:49:46.390 And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w, 00:49:46.390 --> 00:49:48.839 b. Again, previously, 00:49:48.839 --> 00:49:52.559 I had gi 00:49:52.559 --> 00:49:56.079 of w, but now I have parameters w and b. So 00:49:56.079 --> 00:50:03.079 gi of w, b defined 00:50:03.369 --> 00:50:09.589 as 1. 00:50:09.589 --> 00:50:15.659 So 00:50:15.659 --> 00:50:19.799 let's look at the implications of this in terms 00:50:19.799 --> 00:50:25.380 of the KKT duo complementary condition again. 00:50:25.380 --> 00:50:29.180 So we have that alpha i is basically equal to 00:50:29.180 --> 00:50:32.309 0. That necessarily implies that 00:50:32.309 --> 00:50:35.149 gi of w, b 00:50:35.149 --> 00:50:42.149 is equal to 0. In other words, this is an active constraint. 00:50:46.849 --> 00:50:53.849 And what does this mean? It means that it actually turns out gi of wb equal to 0 that 00:50:54.460 --> 00:51:00.729 is - that means exactly that the training example xi, yi 00:51:00.729 --> 00:51:07.729 has functional margin 00:51:08.909 --> 00:51:10.289 equal to 1. 00:51:10.289 --> 00:51:11.279 Because 00:51:11.279 --> 00:51:15.049 this constraint was that 00:51:15.049 --> 00:51:18.769 the functional margin of every example has to be greater equal to 00:51:18.769 --> 00:51:22.139 1. And so if this is an active constraint, it just - 00:51:22.139 --> 00:51:24.160 inequality holds that equality. 00:51:24.160 --> 00:51:26.529 That means that my training example i 00:51:26.529 --> 00:51:30.349 must have functional margin equal to exactly 1. And so - actually, yeah, 00:51:30.349 --> 00:51:33.609 right now, I'll 00:51:33.609 --> 00:51:39.689 do this on a different board, I guess. 00:51:39.689 --> 00:51:46.689 00:51:57.239 --> 00:51:59.439 So in pictures, 00:51:59.439 --> 00:52:03.650 what that means is that, you have some 00:52:03.650 --> 00:52:10.650 training sets, and you'll 00:52:10.729 --> 00:52:15.189 have some separating hyperplane. 00:52:15.189 --> 00:52:19.469 And so the examples with functional margin equal to 1 00:52:19.469 --> 00:52:23.299 will be exactly those which are - 00:52:23.299 --> 00:52:25.549 so they're closest 00:52:25.549 --> 00:52:27.799 to my separating hyperplane. So 00:52:27.799 --> 00:52:30.669 that's 00:52:30.669 --> 00:52:34.109 my equation. [Inaudible] equal to 0. 00:52:34.109 --> 00:52:39.499 And so in this - in this cartoon example that I've done, it'll be 00:52:39.499 --> 00:52:44.649 exactly 00:52:44.649 --> 00:52:49.699 - these three 00:52:49.699 --> 00:52:52.709 examples that have functional margin 00:52:52.709 --> 00:52:54.399 equal to 1, 00:52:54.399 --> 00:52:58.519 and all of the other examples as being further away than these 00:52:58.519 --> 00:53:05.519 three will have functional margin that is strictly greater than 1. 00:53:05.929 --> 00:53:08.109 And 00:53:08.109 --> 00:53:11.029 the examples with functional margin equal to 1 will usually correspond to the 00:53:11.029 --> 00:53:15.429 00:53:15.429 --> 00:53:21.589 00:53:21.589 --> 00:53:22.979 ones where 00:53:22.979 --> 00:53:27.049 the corresponding Lagrange multipliers also not equal to 0. And again, it may not 00:53:27.049 --> 00:53:29.009 hold true. It may be the case that 00:53:29.009 --> 00:53:32.679 gi and alpha i equal to 0. But usually, when gi's not - 00:53:32.679 --> 00:53:35.739 is 0, alpha i will be non-0. 00:53:35.739 --> 00:53:39.380 And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal 00:53:39.380 --> 00:53:46.380 to 0. One 00:53:46.920 --> 00:53:48.969 useful property of this is that 00:53:48.969 --> 00:53:52.839 as suggested by this picture and so true in general as well, it 00:53:52.839 --> 00:53:57.069 turns out that we find a solution to this - to the optimization problem, 00:53:57.069 --> 00:54:01.509 you find that relatively few training examples have functional margin equal to 1. In 00:54:01.509 --> 00:54:02.829 this picture I've drawn, 00:54:02.829 --> 00:54:05.859 there are three examples with functional margin equal to 1. There 00:54:05.859 --> 00:54:09.449 are just few examples of this minimum possible distance to your separating hyperplane. 00:54:09.449 --> 00:54:11.479 00:54:11.479 --> 00:54:13.779 And these are three - 00:54:13.779 --> 00:54:18.420 these examples of functional margin equal to 1 - they 00:54:18.420 --> 00:54:21.369 are what we're going to call 00:54:21.369 --> 00:54:26.169 the support vectors. And 00:54:26.169 --> 00:54:29.839 this needs the name support vector machine. There'll be these three points with functional margin 00:54:29.839 --> 00:54:30.809 1 00:54:30.809 --> 00:54:35.229 that we're calling support vectors. 00:54:35.229 --> 00:54:36.749 And 00:54:36.749 --> 00:54:40.089 the fact that they're relatively few support vectors also means that 00:54:40.089 --> 00:54:40.919 usually, 00:54:40.919 --> 00:54:43.319 most of the alpha i's are equal to 00:54:43.319 --> 00:54:44.909 0. So with alpha i equal 00:54:44.909 --> 00:54:51.909 to 00:54:52.599 --> 00:54:54.239 0, 00:54:54.239 --> 00:55:01.239 for examples, though, not support vectors. 00:55:03.349 --> 00:55:06.769 Let's go ahead and work out the actual 00:55:06.769 --> 00:55:08.369 optimization problem. 00:55:08.369 --> 00:55:12.169 00:55:12.169 --> 00:55:19.169 00:55:28.499 --> 00:55:29.369 So 00:55:29.369 --> 00:55:31.779 we have a [inaudible] margin 00:55:31.779 --> 00:55:33.129 optimization problem. 00:55:33.129 --> 00:55:34.009 00:55:34.009 --> 00:55:38.599 So there we go and write down the margin, 00:55:38.599 --> 00:55:39.249 and 00:55:39.249 --> 00:55:43.599 because we only have inequality constraints where we really have gi star 00:55:43.599 --> 00:55:47.059 constraints, no hi star constraint. We have 00:55:47.059 --> 00:55:49.739 inequality constraints and no equality constraints, 00:55:49.739 --> 00:55:51.599 I'll only have 00:55:51.599 --> 00:55:52.990 Lagrange multipliers of type 00:55:52.990 --> 00:55:57.489 alpha - no betas in my generalized Lagrange. But 00:55:57.489 --> 00:56:00.519 00:56:00.519 --> 00:56:02.579 my Lagrange will be 00:56:02.579 --> 00:56:09.579 one-half w squared minus. 00:56:14.739 --> 00:56:17.579 00:56:17.579 --> 00:56:19.829 That's my 00:56:19.829 --> 00:56:26.729 Lagrange. 00:56:26.729 --> 00:56:29.389 And so let's work out what the dual problem is. 00:56:29.389 --> 00:56:33.479 And to do that, I need to figure out what theta d of alpha - and I know again, beta's there 00:56:33.479 --> 00:56:37.239 - so what theta d of alpha 00:56:37.239 --> 00:56:43.829 is min with 00:56:43.829 --> 00:56:48.419 respect to wb of lb alpha. So the dual problem is the maximize theta d as the function of alpha. 00:56:48.419 --> 00:56:54.999 So as to work out what theta d is, and then that'll give us our dual problem. 00:56:54.999 --> 00:56:58.090 So then to work out what this is, what do you need to do? We need to 00:56:58.090 --> 00:57:01.930 take a look at Lagrange and minimize it as a function of lv and b 00:57:01.930 --> 00:57:02.709 so - and what is this? How do you 00:57:02.709 --> 00:57:05.079 minimize Lagrange? So in order to 00:57:05.079 --> 00:57:05.890 00:57:05.890 --> 00:57:09.079 minimize the Lagrange as a function of w and b, 00:57:09.079 --> 00:57:11.199 we do the usual thing. We 00:57:11.199 --> 00:57:13.119 take the derivatives of w - 00:57:13.119 --> 00:57:16.819 Lagrange with respect to w and b. And we set that to 0. That's how we 00:57:16.819 --> 00:57:20.209 minimize the Lagrange with respect 00:57:20.209 --> 00:57:25.709 to w and b. So take the derivative with respect to w of the Lagrange. 00:57:25.709 --> 00:57:27.829 And 00:57:27.829 --> 00:57:34.829 I want - I just write down the answer. You know how to do calculus like this. 00:57:38.189 --> 00:57:41.439 So I wanna minimize this function of w, so I take the derivative and set it 00:57:41.439 --> 00:57:42.839 to 0. And 00:57:42.839 --> 00:57:44.950 I get that. And 00:57:44.950 --> 00:57:51.950 then so this implies that w 00:57:55.940 --> 00:57:58.789 must be that. 00:57:58.789 --> 00:58:00.779 And so 00:58:00.779 --> 00:58:05.189 w, therefore, is actually a linear combination of your input feature vectors xi. 00:58:05.189 --> 00:58:06.779 This is 00:58:06.779 --> 00:58:09.570 sum of your various weights given by the alpha i's and times 00:58:09.570 --> 00:58:13.069 the xi's, which are your examples in your training set. 00:58:13.069 --> 00:58:15.939 And this will be useful later. The 00:58:15.939 --> 00:58:22.939 other equation we have is - here, partial derivative of 00:58:23.139 --> 00:58:27.900 Lagrange with respect to p is equal to minus sum 00:58:27.900 --> 00:58:33.819 of i plus 1 to m [inaudible] for 00:58:33.819 --> 00:58:35.219 i. 00:58:35.219 --> 00:58:37.199 And so I'll just set that to equal to 00:58:37.199 --> 00:58:40.439 0. And so these are my two constraints. 00:58:40.439 --> 00:58:42.699 And so 00:58:42.699 --> 00:58:48.489 00:58:48.489 --> 00:58:49.769 [inaudible]. 00:58:49.769 --> 00:58:53.650 So what I'm going to do is I'm actually going to take these two constraints, 00:58:53.650 --> 00:58:57.579 and well, I'm going to take whatever I thought to be the value for w. 00:58:57.579 --> 00:59:00.379 And I'm 00:59:00.379 --> 00:59:01.309 gonna 00:59:01.309 --> 00:59:04.070 take what I've worked out to be the 00:59:04.070 --> 00:59:04.769 value for w, and 00:59:04.769 --> 00:59:06.979 I'll plug it back in there 00:59:06.979 --> 00:59:09.169 to figure out what the Lagrange really is 00:59:09.169 --> 00:59:13.989 when I minimize with respect to w. [Inaudible] and I'll 00:59:13.989 --> 00:59:20.989 deal with b in a second. 00:59:27.869 --> 00:59:34.869 So 00:59:37.629 --> 00:59:41.109 let's see. So my Lagrange is 1/2 00:59:41.109 --> 00:59:44.479 w transpose w minus. 00:59:44.479 --> 00:59:51.199 00:59:51.199 --> 00:59:58.079 00:59:58.079 --> 01:00:03.529 So this first term, w transpose w 01:00:03.529 --> 01:00:05.939 - this becomes 01:00:05.939 --> 01:00:11.039 sum y equals one to m, alpha i, yi, 01:00:11.039 --> 01:00:18.039 01:00:20.749 --> 01:00:24.859 xi transpose. This is just putting in the value for w that I worked out previously. 01:00:24.859 --> 01:00:27.880 But since this is w transpose w - 01:00:27.880 --> 01:00:32.239 and so when they expand out of this quadratic function, and when I plug in w 01:00:32.239 --> 01:00:34.469 over there as well, 01:00:34.469 --> 01:00:36.329 I 01:00:36.329 --> 01:00:38.759 find 01:00:38.759 --> 01:00:40.849 01:00:40.849 --> 01:00:42.869 01:00:42.869 --> 01:00:49.539 that 01:00:49.539 --> 01:00:55.189 01:00:55.189 --> 01:00:58.499 I 01:00:58.499 --> 01:01:00.150 have 01:01:00.150 --> 01:01:03.969 01:01:03.969 --> 01:01:08.529 01:01:08.529 --> 01:01:10.979 01:01:10.979 --> 01:01:17.979 that. 01:01:20.689 --> 01:01:22.469 Oh, 01:01:22.469 --> 01:01:28.589 where I'm using these angle brackets to denote end product, so this 01:01:28.589 --> 01:01:35.179 thing here, it just means the end product, xi transpose 01:01:35.179 --> 01:01:36.259 xj. And 01:01:36.259 --> 01:01:40.299 the first and second terms are actually the same except for the minus one half. So to 01:01:40.299 --> 01:01:41.689 simplify to be 01:01:41.689 --> 01:01:44.989 equal to 01:01:44.989 --> 01:01:51.259 01:01:51.259 --> 01:01:58.259 01:02:08.239 --> 01:02:12.549 that. 01:02:12.549 --> 01:02:15.689 So 01:02:15.689 --> 01:02:18.749 let me go ahead and 01:02:18.749 --> 01:02:22.239 call this w of alpha. 01:02:22.239 --> 01:02:23.449 01:02:23.449 --> 01:02:29.469 01:02:29.469 --> 01:02:36.469 01:02:46.779 --> 01:02:51.919 My dual problem is, therefore, the following. I want to maximize w 01:02:51.919 --> 01:02:55.839 of alpha, which is that [inaudible]. 01:02:55.839 --> 01:02:58.380 And 01:02:58.380 --> 01:03:02.429 I want to the - I realize the notation is somewhat 01:03:02.429 --> 01:03:07.259 unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier. 01:03:07.259 --> 01:03:13.529 And then we also had our lowercase w. The original [inaudible] is the primal 01:03:13.529 --> 01:03:16.349 problem. Lowercase w transpose xi. So 01:03:16.349 --> 01:03:18.059 uppercase and lowercase w 01:03:18.059 --> 01:03:20.609 are totally different 01:03:20.609 --> 01:03:27.160 things, so unfortunately, the notation is standard as well, as far as I know, 01:03:27.160 --> 01:03:29.929 so. So the dual problem is 01:03:29.929 --> 01:03:33.119 that subject to the alpha [inaudible] related to 0, 01:03:33.119 --> 01:03:37.239 and 01:03:37.239 --> 01:03:39.439 we also have that the 01:03:39.439 --> 01:03:40.619 sum of i, 01:03:40.619 --> 01:03:43.819 yi, alpha i is related to 0. 01:03:43.819 --> 01:03:46.380 That last constraint 01:03:46.380 --> 01:03:49.589 was the constraint I got from 01:03:49.589 --> 01:03:52.299 this - the 01:03:52.299 --> 01:03:55.020 sum of i - sum of i, yi alpha i equals to 0. But that's where 01:03:55.020 --> 01:03:59.619 that [inaudible] came 01:03:59.619 --> 01:04:01.839 from. Let 01:04:01.839 --> 01:04:02.380 me just 01:04:02.380 --> 01:04:05.619 - I think in previous years that I taught this, 01:04:05.619 --> 01:04:09.199 where this constraint comes from is just - is slightly confusing. So let 01:04:09.199 --> 01:04:12.910 me just take two minutes to say what the real interpretation of that is. And if you 01:04:12.910 --> 01:04:15.769 don't understand it, it's 01:04:15.769 --> 01:04:18.069 not a big deal, I guess. 01:04:18.069 --> 01:04:21.789 So when we took the partial derivative of the Lagrange with 01:04:21.789 --> 01:04:22.669 respect to b, 01:04:22.669 --> 01:04:28.219 we end up with this constraint that sum of i, yi, alpha i must be equal to 0. 01:04:28.219 --> 01:04:34.219 The interpretation of that, it turns out, is that if sum of i, yi, alpha i 01:04:34.219 --> 01:04:37.410 is not equal to 01:04:37.410 --> 01:04:40.920 0, then 01:04:40.920 --> 01:04:42.809 01:04:42.809 --> 01:04:49.809 01:04:51.019 --> 01:04:53.210 theta d of wb 01:04:53.210 --> 01:04:55.659 is - 01:04:55.659 --> 01:04:59.829 actually, excuse me. 01:04:59.829 --> 01:05:01.559 01:05:01.559 --> 01:05:02.779 01:05:02.779 --> 01:05:06.259 Then theta d of alpha is equal to 01:05:06.259 --> 01:05:08.499 01:05:08.499 --> 01:05:13.179 minus infinity for minimizing. 01:05:13.179 --> 01:05:16.279 So in other words, it turns out my Lagrange is 01:05:16.279 --> 01:05:20.479 actually a linear function of my parameters b. And so the interpretation of 01:05:20.479 --> 01:05:22.890 that constraint we worked out previously was that if sum of i or yi, alpha i 01:05:22.890 --> 01:05:25.820 is not equal to 0, then 01:05:25.820 --> 01:05:28.889 theta d of alpha is equal to minus infinity. 01:05:28.889 --> 01:05:32.469 And so if your goal is to 01:05:32.469 --> 01:05:36.839 maximize as a function of alpha, theta 01:05:36.839 --> 01:05:38.339 d of alpha, 01:05:38.339 --> 01:05:44.599 then you've gotta choose values of alpha for which sum of yi alpha is equal to 0. 01:05:44.599 --> 01:05:50.969 And then when sum of yi alpha is equal to 0, then 01:05:50.969 --> 01:05:54.189 01:05:54.189 --> 01:05:56.099 01:05:56.099 --> 01:06:00.949 theta d of 01:06:00.949 --> 01:06:03.999 alpha is equal to w of alpha. 01:06:03.999 --> 01:06:08.690 And so that's why we ended up deciding to maximize w of alpha subject to 01:06:08.690 --> 01:06:14.319 that sum of yi alpha is equal to 0. 01:06:14.319 --> 01:06:18.929 Yeah, the - unfortunately, the fact of that d would be [inaudible] 01:06:18.929 --> 01:06:21.619 adds just a little bit of extra notation in our 01:06:21.619 --> 01:06:22.989 derivation of the duo. But 01:06:22.989 --> 01:06:27.319 by the way, and [inaudible] all the action of the optimization problem is with w 01:06:27.319 --> 01:06:32.809 because b is just one parameter. 01:06:32.809 --> 01:06:39.809 So let's check. Are there any questions about this? Okay, cool. 01:06:47.069 --> 01:06:49.499 So 01:06:49.499 --> 01:06:53.729 what derived a duo optimization problem - and really, don't worry about this 01:06:53.729 --> 01:06:56.469 if you're not quite sure where this was. Just think of this as 01:06:56.469 --> 01:06:59.959 we worked out this constraint, and we worked out, and we took partial derivative with 01:06:59.959 --> 01:07:01.309 respect to b, 01:07:01.309 --> 01:07:06.049 that this constraint has the [inaudible] and so I just copied that over here. But so - worked out 01:07:06.049 --> 01:07:11.129 the duo of the optimization problem, 01:07:11.129 --> 01:07:16.069 so our approach to finding - to deriving the optimal margin classifier or support vector 01:07:16.069 --> 01:07:16.679 machine 01:07:16.679 --> 01:07:19.409 will be that we'll solve 01:07:19.409 --> 01:07:26.409 along this duo optimization problem for the parameters alpha 01:07:28.109 --> 01:07:29.169 star. 01:07:29.169 --> 01:07:30.509 And then 01:07:30.509 --> 01:07:34.149 if you want, you can then - this is the equation that we worked out on 01:07:34.149 --> 01:07:36.459 the previous board. We said that 01:07:36.459 --> 01:07:43.459 w - this [inaudible] alpha - w must be equal to 01:07:44.369 --> 01:07:46.349 that. 01:07:46.349 --> 01:07:47.690 And so 01:07:47.690 --> 01:07:53.309 once you solve for alpha, you can then go back and quickly derive 01:07:53.309 --> 01:07:57.389 w in parameters to your primal problem. And we worked this out earlier. 01:07:57.389 --> 01:07:58.209 01:07:58.209 --> 01:08:00.429 And moreover, 01:08:00.429 --> 01:08:05.409 once you solve alpha and w, you can then focus back into your - once you solve for alpha and w, 01:08:05.409 --> 01:08:06.819 01:08:06.819 --> 01:08:10.419 it's really easy to solve for v, so 01:08:10.419 --> 01:08:13.979 01:08:13.979 --> 01:08:15.269 that b gives us the interpretation of [inaudible] 01:08:15.269 --> 01:08:19.559 training set, and you found the direction for w. So you know where your separating 01:08:19.559 --> 01:08:22.079 hyperplane's direction is. You know it's got to be 01:08:22.079 --> 01:08:25.059 one of these things. 01:08:25.059 --> 01:08:28.499 And you know the orientation and separating hyperplane. You just have to 01:08:28.499 --> 01:08:31.000 decide where to place 01:08:31.000 --> 01:08:33.289 this hyperplane. And that's what solving b is. 01:08:33.289 --> 01:08:36.969 So once you solve for alpha and w, it's really easy to solve b. 01:08:36.969 --> 01:08:40.409 You can plug alpha and w back into the 01:08:40.409 --> 01:08:42.779 primal optimization problem 01:08:42.779 --> 01:08:45.649 01:08:45.649 --> 01:08:49.899 01:08:49.899 --> 01:08:56.899 and solve for b. 01:09:02.329 --> 01:09:05.140 And I just wrote it down 01:09:05.140 --> 01:09:12.140 for the sake of completeness, 01:09:14.859 --> 01:09:19.890 01:09:19.890 --> 01:09:21.299 01:09:21.299 --> 01:09:23.279 but 01:09:23.279 --> 01:09:27.939 01:09:27.939 --> 01:09:29.609 - and the 01:09:29.609 --> 01:09:36.199 intuition behind this formula is just that find the worst positive 01:09:36.199 --> 01:09:37.309 [inaudible] and the 01:09:37.309 --> 01:09:41.920 worst negative example. Let's say 01:09:41.920 --> 01:09:44.779 this one and this one - say [inaudible] and [inaudible] the difference between them. And 01:09:44.779 --> 01:09:45.990 that tells you where you should 01:09:45.990 --> 01:09:47.969 set the threshold for 01:09:47.969 --> 01:09:52.509 where to place the separating hyperplane. 01:09:52.509 --> 01:09:53.410 And then 01:09:53.410 --> 01:09:54.829 that's the - 01:09:54.829 --> 01:09:57.749 this is the optimal margin classifier. This is also called a support vector 01:09:57.749 --> 01:10:02.329 machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words 01:10:02.329 --> 01:10:04.320 about that. But I 01:10:04.320 --> 01:10:08.080 hope the process is clear. It's a dual problem. We're going to solve the duo 01:10:08.080 --> 01:10:09.789 problem for the alpha i's. 01:10:09.789 --> 01:10:11.360 That gives us w, and that gives 01:10:11.360 --> 01:10:13.589 us b. 01:10:13.589 --> 01:10:16.649 So 01:10:16.649 --> 01:10:21.709 there's just one more thing I wanna point out as I lead into the next lecture, 01:10:21.709 --> 01:10:23.070 which is that - I'll just 01:10:23.070 --> 01:10:28.440 write this out again, 01:10:28.440 --> 01:10:29.250 I guess - 01:10:29.250 --> 01:10:31.359 which is that it turns out 01:10:31.359 --> 01:10:33.530 we can take the entire algorithm, 01:10:33.530 --> 01:10:37.530 and we can express the entire algorithm in terms of inner products. And here's what I 01:10:37.530 --> 01:10:40.730 mean by that. So 01:10:40.730 --> 01:10:42.249 say that the parameters w 01:10:42.249 --> 01:10:45.469 is the sum of your input examples. 01:10:45.469 --> 01:10:49.160 And so we need to make a prediction. 01:10:49.160 --> 01:10:54.659 Someone gives you a new value of x. You want a value of the hypothesis on the value of x. 01:10:54.659 --> 01:10:58.090 That's given by g of w transpose x plus b, or 01:10:58.090 --> 01:11:02.869 where g was this threshold function that outputs minus 1 or plus 1. 01:11:02.869 --> 01:11:06.380 And so you need to compute w transpose x plus b. 01:11:06.380 --> 01:11:11.300 And that is equal to alpha i, 01:11:11.300 --> 01:11:18.300 yi. 01:11:19.739 --> 01:11:24.029 And that can be expressed as a sum of these inner products between 01:11:24.029 --> 01:11:26.070 your training examples 01:11:26.070 --> 01:11:33.070 and this new value of x [inaudible] value [inaudible]. And this will 01:11:33.849 --> 01:11:39.329 lead into our next lecture, which is the idea of kernels. 01:11:39.329 --> 01:11:39.929 And 01:11:39.929 --> 01:11:44.209 it turns out that in the source of feature spaces where used to support vector 01:11:44.209 --> 01:11:45.039 machines - 01:11:45.039 --> 01:11:52.039 it turns out that sometimes your training examples may be very high-dimensional. It may even be the case 01:11:53.730 --> 01:11:58.329 that the features that you want to use 01:11:58.329 --> 01:11:59.579 are 01:11:59.579 --> 01:12:04.129 inner-dimensional feature vectors. 01:12:04.129 --> 01:12:11.129 But despite this, it'll turn out that there'll be an interesting representation that 01:12:11.229 --> 01:12:12.879 you can use 01:12:12.879 --> 01:12:14.820 that will allow you 01:12:14.820 --> 01:12:18.719 to compute inner products like these efficiently. 01:12:18.719 --> 01:12:25.719 01:12:25.780 --> 01:12:29.179 And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets 01:12:29.179 --> 01:12:30.479 of features. 01:12:30.479 --> 01:12:33.609 But we talk about the idea of 01:12:33.609 --> 01:12:35.320 kernels. In the next lecture, we'll 01:12:35.320 --> 01:12:37.809 see examples where 01:12:37.809 --> 01:12:41.249 even though you have extremely high-dimensional feature vectors, you can compute 01:12:41.249 --> 01:12:45.829 - you may never want to represent xi, x plus [inaudible] inner-dimensional 01:12:45.829 --> 01:12:48.459 feature vector. You can even store in computer memory. 01:12:48.459 --> 01:12:51.769 But you will nonetheless be able to compute inner products between different 01:12:51.769 --> 01:12:53.839 [inaudible] feature vectors very efficiently. 01:12:53.839 --> 01:12:57.419 And so you can - for example, you can make predictions by making use of these inner 01:12:57.419 --> 01:12:58.790 products. 01:12:58.790 --> 01:13:02.129 This is just xi 01:13:02.129 --> 01:13:03.809 transpose. 01:13:03.809 --> 01:13:08.570 You will compute these inner products very efficiently and, therefore, make predictions. 01:13:08.570 --> 01:13:11.109 And this pointed also - the other 01:13:11.109 --> 01:13:14.599 reason we derive the duo was because 01:13:14.599 --> 01:13:18.300 on this board, when we worked out what w of alpha is, w of alpha 01:13:18.300 --> 01:13:23.620 - actually are the same property - w of alpha is again 01:13:23.620 --> 01:13:26.699 written in terms of these inner products. 01:13:26.699 --> 01:13:29.040 And so if you 01:13:29.040 --> 01:13:33.429 actually look at the duo optimization problem and step - for all the steps of the 01:13:33.429 --> 01:13:33.999 algorithm, 01:13:33.999 --> 01:13:36.929 you'll find that you actually do everything you want - learn the parameters of 01:13:36.929 --> 01:13:38.169 alpha. So 01:13:38.169 --> 01:13:41.989 suppose you do an optimization problem, go into parameters alpha, and you do everything you want 01:13:41.989 --> 01:13:46.519 without ever needing to represent xi directly. And all you need to do 01:13:46.519 --> 01:13:53.519 is represent this compute inner products with your feature vectors like these. Well, 01:13:53.879 --> 01:13:56.400 one last property of 01:13:56.400 --> 01:13:58.159 this algorithm that's kinda nice is that 01:13:58.159 --> 01:13:59.900 I said previously 01:13:59.900 --> 01:14:00.929 that 01:14:00.929 --> 01:14:07.389 the alpha i's are 0 only for the - are non-0 only for the support vectors, 01:14:07.389 --> 01:14:08.690 only for the vectors 01:14:08.690 --> 01:14:10.400 that function y [inaudible] 1. 01:14:10.400 --> 01:14:13.540 And in practice, there are usually fairly few of them. 01:14:13.540 --> 01:14:17.360 And so what this means is that if you're representing w this way, 01:14:17.360 --> 01:14:18.320 then 01:14:18.320 --> 01:14:22.870 w when represented as a fairly small fraction of training examples 01:14:22.870 --> 01:14:25.139 because mostly alpha i's is 0 - 01:14:25.139 --> 01:14:27.180 and so when you're summing up 01:14:27.180 --> 01:14:28.279 the sum, 01:14:28.279 --> 01:14:32.359 you need to compute inner products only if the support vectors, which is 01:14:32.359 --> 01:14:36.139 usually a small fraction of your training set. So that's another nice [inaudible] 01:14:36.139 --> 01:14:39.320 because [inaudible] alpha is 01:14:39.320 --> 01:14:40.999 0. And well, 01:14:40.999 --> 01:14:44.639 much of this will make much more sense when we talk about kernels. [Inaudible] quick 01:14:44.639 --> 01:14:51.639 questions 01:14:51.959 --> 01:14:52.920 01:14:52.920 --> 01:14:57.989 before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well 01:14:57.989 --> 01:14:59.560 behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that 01:14:59.560 --> 01:15:03.889 the data is linearly separable - that you can actually get perfect 01:15:03.889 --> 01:15:10.889 [inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible] 01:15:10.909 --> 01:15:13.380 point [inaudible], so [inaudible] 01:15:13.380 --> 01:15:17.579 have 01:15:17.579 --> 01:15:19.300 [inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that 01:15:19.300 --> 01:15:23.429 there are ways to generalize this in multiple classes that I probably won't [inaudible] - 01:15:23.429 --> 01:15:26.429 but yeah, that's generalization [inaudible]. 01:15:26.429 --> 01:15:27.870 Okay. Let's close for today, then. 01:15:27.870 --> 01:15:29.300 We'll talk about kernels in our next lecture.