1 00:00:11,789 --> 00:00:15,049 This presentation is delivered by the Stanford Center for Professional 2 00:00:15,049 --> 00:00:22,049 Development. 3 00:00:23,680 --> 00:00:25,839 So welcome back. 4 00:00:25,839 --> 00:00:30,859 And what I wanna do today is continue our discussion on support vector machines. And in 5 00:00:30,859 --> 00:00:34,410 particular, I wanna talk about the optimal margin classifier. 6 00:00:34,410 --> 00:00:38,280 Then I wanna take a brief digression and talk about primal and duo optimization problems, and 7 00:00:38,280 --> 00:00:39,410 in particular, 8 00:00:39,410 --> 00:00:41,570 what's called the KKT conditions. 9 00:00:41,570 --> 00:00:45,190 And then we'll derive the duo to the optimization problem that I 10 00:00:45,190 --> 00:00:46,570 had posed earlier. 11 00:00:46,570 --> 00:00:50,980 And that will lead us into a discussion of kernels, which I won't really - 12 00:00:50,980 --> 00:00:54,010 which we just get to say couple words about, but which I'll do probably 13 00:00:54,010 --> 00:00:57,530 only in the next lecture. 14 00:00:57,530 --> 00:01:02,079 And as part of today's lecture, I'll spend some time talking about optimization problems. 15 00:01:02,079 --> 00:01:04,180 And in 16 00:01:04,180 --> 00:01:08,299 the little time I have today, I won't really be able to do this topic justice. 17 00:01:08,299 --> 00:01:13,010 I wanna talk about convex optimization and do that topic justice. 18 00:01:13,010 --> 00:01:17,500 And so at this week's discussion session, the TAs will have more 19 00:01:17,500 --> 00:01:22,110 time - will teach a discussion session - focus on convex optimization 20 00:01:22,110 --> 00:01:25,159 - sort of very beautiful and useful theory. 21 00:01:25,159 --> 00:01:32,159 So you want to learn more about that, listen to this Friday's discussion session. 22 00:01:33,100 --> 00:01:36,630 Just to recap what we did in the previous lecture, 23 00:01:36,630 --> 00:01:38,830 as 24 00:01:38,830 --> 00:01:42,420 we were beginning on developing on support vector machines, I 25 00:01:42,420 --> 00:01:43,900 said that a 26 00:01:43,900 --> 00:01:48,800 hypothesis represented as H sub [inaudible] wb as g of 27 00:01:48,800 --> 00:01:53,090 w transpose [inaudible] x + b, where 28 00:01:53,090 --> 00:01:55,030 29 00:01:55,030 --> 00:01:55,820 g 30 00:01:55,820 --> 00:02:02,820 will be 31 00:02:04,350 --> 00:02:08,599 +1 or -1, depending on whether z is greater than 32 00:02:08,599 --> 00:02:10,069 0. 33 00:02:10,069 --> 00:02:12,779 And I said that in our development of support vector machines, 34 00:02:12,779 --> 00:02:16,999 we'll use - we'll change the convention of letting y be +1, -1 to note 35 00:02:16,999 --> 00:02:20,159 the class labels. 36 00:02:20,159 --> 00:02:24,439 So last time, 37 00:02:24,439 --> 00:02:28,149 we also talked about the functional margin, which was this thing, gamma 38 00:02:28,149 --> 00:02:33,519 hat i. 39 00:02:33,519 --> 00:02:36,109 40 00:02:36,109 --> 00:02:38,759 And so we had the intuition that the 41 00:02:38,759 --> 00:02:42,389 if functional margin is a large positive number, 42 00:02:42,389 --> 00:02:46,769 then that means that we are classifying a training example correctly and very 43 00:02:46,769 --> 00:02:48,209 confidently. So 44 00:02:48,209 --> 00:02:50,389 yi is +1. We 45 00:02:50,389 --> 00:02:53,319 would like w transpose xi + b to be very large. 46 00:02:53,319 --> 00:02:54,649 And it makes i - if, excuse me, if 47 00:02:54,649 --> 00:02:56,779 yi is -1, 48 00:02:56,779 --> 00:02:59,459 then we'd w transpose xi + b to be a large negative number. So 49 00:02:59,459 --> 00:03:03,399 we'd sort of like functional margins to be large. We 50 00:03:03,399 --> 00:03:06,029 also said - functional margin is a strange property - 51 00:03:06,029 --> 00:03:10,469 that you can increase functional margin just by, say, taking your parameters, 52 00:03:10,469 --> 00:03:11,359 w and b, 53 00:03:11,359 --> 00:03:13,909 and multiplying them by 54 00:03:13,909 --> 00:03:15,809 2. 55 00:03:15,809 --> 00:03:18,299 And then we also 56 00:03:18,299 --> 00:03:20,949 defined the geometric margin, 57 00:03:20,949 --> 00:03:22,709 which 58 00:03:22,709 --> 00:03:24,659 59 00:03:24,659 --> 00:03:31,659 was 60 00:03:33,219 --> 00:03:37,389 that we just - essentially, the functional margin 61 00:03:37,389 --> 00:03:38,499 divided by 62 00:03:38,499 --> 00:03:40,349 the normal w. 63 00:03:40,349 --> 00:03:43,449 And so the geometric margin had 64 00:03:43,449 --> 00:03:46,589 the interpretation as being - I'll give 65 00:03:46,589 --> 00:03:48,799 you a few examples. The 66 00:03:48,799 --> 00:03:52,079 geometric margin, for example, is - has 67 00:03:52,079 --> 00:03:56,139 the interpretation as a distance between a training example 68 00:03:56,139 --> 00:03:56,909 and a hyperplane. 69 00:03:56,909 --> 00:04:00,679 And it'll actually be a sin distance, so that this distance will be positive 70 00:04:00,679 --> 00:04:05,059 if you're classifying the example correctly. And if you misclassify the example, this 71 00:04:05,059 --> 00:04:07,929 distance - it'll be the minus of the distance, 72 00:04:07,929 --> 00:04:10,269 reaching the point, reaching the training example. 73 00:04:10,269 --> 00:04:12,079 And you're separating hyperplane. 74 00:04:12,079 --> 00:04:16,329 Where you're separating hyperplane is defined by the equation w transpose x 75 00:04:16,329 --> 00:04:17,620 + 76 00:04:17,620 --> 00:04:20,589 b = 77 00:04:20,589 --> 00:04:24,229 0. 78 00:04:24,229 --> 00:04:28,089 So - oh, 79 00:04:28,089 --> 00:04:32,099 well, and I guess 80 00:04:32,099 --> 00:04:36,539 also defined these things as the functional margin, geometric margins, 81 00:04:36,539 --> 00:04:38,499 respect to training set I defined as 82 00:04:38,499 --> 00:04:45,499 the worst case or the minimum functional geometric margin. So in our 83 00:04:48,719 --> 00:04:52,870 development of the optimal margin classifier, 84 00:04:52,870 --> 00:04:56,499 our learning algorithm would choose parameters w and b so as to maximize 85 00:04:56,499 --> 00:05:00,050 the geometric margin. So our goal is to find the separating hyperplane 86 00:05:00,050 --> 00:05:04,729 that separates the positive and negative examples with as large a distance as possible 87 00:05:04,729 --> 00:05:06,100 between hyperplane 88 00:05:06,100 --> 00:05:10,540 and the positive and negative examples. And if you 89 00:05:10,540 --> 00:05:14,940 go to choose parameters w and b to maximize this, [inaudible] one copy of 90 00:05:14,940 --> 00:05:16,620 the geometric margin 91 00:05:16,620 --> 00:05:18,159 is that you 92 00:05:18,159 --> 00:05:19,989 can actually scale w 93 00:05:19,989 --> 00:05:24,460 and b arbitrarily. So you look at this definition for the geometric margin. 94 00:05:24,460 --> 00:05:24,850 95 00:05:24,850 --> 00:05:28,300 I can choose to multiply my parameters w and b 96 00:05:28,300 --> 00:05:31,629 by 2 or by 10 or any other constant. 97 00:05:31,629 --> 00:05:33,589 And it doesn't change 98 00:05:33,589 --> 00:05:35,819 my geometric margin. 99 00:05:35,819 --> 00:05:38,280 And one way of interpreting that is you're looking 100 00:05:38,280 --> 00:05:39,379 at 101 00:05:39,379 --> 00:05:43,099 just separating hyperplane. You look at this line you're separating by positive and negative training examples. If 102 00:05:43,099 --> 00:05:44,819 I 103 00:05:44,819 --> 00:05:45,740 scale w 104 00:05:45,740 --> 00:05:47,400 and b, 105 00:05:47,400 --> 00:05:49,489 that doesn't change the position of this plane, though 106 00:05:49,489 --> 00:05:51,629 because the equation wh + 107 00:05:51,629 --> 00:05:58,629 b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight 108 00:05:58,899 --> 00:06:02,589 line. And what that means is that I can actually choose whatever scaling 109 00:06:02,589 --> 00:06:05,969 for w and b is convenient for me. 110 00:06:05,969 --> 00:06:07,819 And in particular, 111 00:06:07,819 --> 00:06:09,199 we use in a minute, 112 00:06:09,199 --> 00:06:13,600 I can [inaudible] perfect constraint like that the normal w [inaudible] 1 113 00:06:13,600 --> 00:06:17,180 because this means that you can find a solution to w and b. 114 00:06:17,180 --> 00:06:21,930 And then by rescaling the parameters, you can easily meet this condition, this rescaled w 115 00:06:21,930 --> 00:06:23,530 [inaudible] 1. And so I can 116 00:06:23,530 --> 00:06:28,440 add the condition like this and then essentially not change the problem. Or I 117 00:06:28,440 --> 00:06:32,469 can add other conditions. I can actually add a 118 00:06:32,469 --> 00:06:35,120 condition 119 00:06:35,120 --> 00:06:38,150 that - excuse me, the absolute value of w1 = 1. I can have 120 00:06:38,150 --> 00:06:41,430 only one of these conditions right now [inaudible]. And adding condition to the absolute 121 00:06:41,430 --> 00:06:42,419 value - the 122 00:06:42,419 --> 00:06:45,169 first component of w must be to 1. And again, 123 00:06:45,169 --> 00:06:48,569 you can find the absolute solution and just rescale w and meet this 124 00:06:48,569 --> 00:06:50,539 condition. 125 00:06:50,539 --> 00:06:53,179 And it can have other, 126 00:06:53,179 --> 00:06:56,300 most esoteric conditions like that 127 00:06:56,300 --> 00:06:57,250 because again, 128 00:06:57,250 --> 00:07:01,039 this is a condition that you can solve for the optimal margin, and then just 129 00:07:01,039 --> 00:07:02,669 by scaling, 130 00:07:02,669 --> 00:07:04,789 you have w up and down. You can - you can then 131 00:07:04,789 --> 00:07:08,729 ensure you meet this condition as well. So 132 00:07:08,729 --> 00:07:12,580 again, [inaudible] one of these conditions right now, not all of them. 133 00:07:12,580 --> 00:07:15,449 And so our ability to choose 134 00:07:15,449 --> 00:07:19,330 any scaling condition on w that's convenient to us 135 00:07:19,330 --> 00:07:23,779 will be useful again in a second. All right. 136 00:07:23,779 --> 00:07:28,749 So let's go ahead and break down the optimization problem. And again, my goal is to choose 137 00:07:28,749 --> 00:07:30,279 parameters w and b 138 00:07:30,279 --> 00:07:34,169 so as to maximize the geometric margin. 139 00:07:34,169 --> 00:07:38,569 Here's my first attempt at writing down the optimization problem. Actually wrote this one down 140 00:07:38,569 --> 00:07:40,379 right at 141 00:07:40,379 --> 00:07:42,520 the end of the previous 142 00:07:42,520 --> 00:07:46,529 lecture. Begin to solve the parameters gamma w and b 143 00:07:46,529 --> 00:07:48,300 such that 144 00:07:48,300 --> 00:07:49,429 145 00:07:49,429 --> 00:07:53,989 - 146 00:07:53,989 --> 00:07:57,399 that [inaudible] i 147 00:07:57,399 --> 00:08:01,610 for in training examples. 148 00:08:01,610 --> 00:08:02,460 Let's say I 149 00:08:02,460 --> 00:08:06,439 choose to add this normalization condition. 150 00:08:06,439 --> 00:08:10,629 So the norm condition that w - the normal w is equal to 1 just makes 151 00:08:10,629 --> 00:08:15,279 the geometric and the functional margin the same. 152 00:08:15,279 --> 00:08:18,029 And so I'm saying I want to find a value - 153 00:08:18,029 --> 00:08:22,610 I want to find a value for gamma as big as possible 154 00:08:22,610 --> 00:08:26,379 so that all of my training examples have functional margin 155 00:08:26,379 --> 00:08:29,049 greater than or equals gamma, 156 00:08:29,049 --> 00:08:30,010 and 157 00:08:30,010 --> 00:08:33,780 with the constraint that normal w equals 1, 158 00:08:33,780 --> 00:08:36,089 functional margin and geometric margin are the same. 159 00:08:36,089 --> 00:08:37,760 So it's the same. 160 00:08:37,760 --> 00:08:39,670 Find the value for gamma so that 161 00:08:39,670 --> 00:08:43,130 all the values - all the geometric margins are greater or equal to 162 00:08:43,130 --> 00:08:45,870 gamma. 163 00:08:45,870 --> 00:08:51,269 So you solve this optimization problem, then you have derived 164 00:08:51,269 --> 00:08:55,060 the optimal margin classifier - 165 00:08:55,060 --> 00:08:56,940 that 166 00:08:56,940 --> 00:09:00,170 there's not a very nice optimization problem because this is a 167 00:09:00,170 --> 00:09:04,300 nasty, nonconvex constraints. And [inaudible] is asking that you 168 00:09:04,300 --> 00:09:06,229 solve for parameters w 169 00:09:06,229 --> 00:09:10,989 that lie on the surface of a unisphere, lie on his [inaudible]. 170 00:09:10,989 --> 00:09:15,170 It lies on a unicircle - a unisphere. 171 00:09:15,170 --> 00:09:16,390 172 00:09:16,390 --> 00:09:17,990 And so 173 00:09:17,990 --> 00:09:21,160 if we can come up with a convex optimization problem, then 174 00:09:21,160 --> 00:09:24,579 we'd be guaranteed that our [inaudible] descend to other local [inaudible] will 175 00:09:24,579 --> 00:09:25,460 not have 176 00:09:25,460 --> 00:09:26,490 local optimal. And 177 00:09:26,490 --> 00:09:31,480 it turns out this is an example of a nonconvex constraint. This is a nasty constraint 178 00:09:31,480 --> 00:09:38,480 that I would like to get rid of. So 179 00:09:40,790 --> 00:09:42,840 let's change the optimization problem 180 00:09:42,840 --> 00:09:47,650 one more time. 181 00:09:47,650 --> 00:09:52,180 Now, let me 182 00:09:52,180 --> 00:09:56,870 pose a slightly different optimization problem. Let 183 00:09:56,870 --> 00:10:02,680 me maximize the functional margin divided by the normal w 184 00:10:02,680 --> 00:10:04,870 subject 185 00:10:04,870 --> 00:10:11,870 to yi w transpose xi. 186 00:10:12,580 --> 00:10:15,390 So in other words, once you find 187 00:10:15,390 --> 00:10:17,220 a number, gamma hat, 188 00:10:17,220 --> 00:10:19,700 so that every one of my training examples has functional margin greater 189 00:10:19,700 --> 00:10:22,840 than the gamma hat, 190 00:10:22,840 --> 00:10:27,300 and my optimization objective is I want to maximize gamma hat divided by the normal 191 00:10:27,300 --> 00:10:28,180 192 00:10:28,180 --> 00:10:30,550 w. And so I wanna maximize the 193 00:10:30,550 --> 00:10:35,680 function margin divided by the normal w. And we saw previously the function 194 00:10:35,680 --> 00:10:38,390 margin divided by the normal 195 00:10:38,390 --> 00:10:41,780 w is just a geometric margin, and so this is a different way of posing 196 00:10:41,780 --> 00:10:44,610 the same 197 00:10:44,610 --> 00:10:46,780 optimization problem. [Inaudible] confused, though. Are there questions 198 00:10:46,780 --> 00:10:53,710 about this? 199 00:10:53,710 --> 00:10:53,960 200 00:10:53,940 --> 00:10:56,590 Student:[Inaudible] the second statement has to be made of the functional margin y 201 00:10:56,590 --> 00:10:58,900 divided by - why don't you just have it 202 00:10:58,900 --> 00:11:02,280 the geometric 203 00:11:02,280 --> 00:11:04,800 margin? Why do 204 00:11:04,800 --> 00:11:09,450 you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew 205 00:11:09,450 --> 00:11:10,740 Ng):Oh, I see, yes. Student:[Inaudible] 206 00:11:10,740 --> 00:11:12,730 207 00:11:12,730 --> 00:11:17,050 is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin. 208 00:11:17,050 --> 00:11:18,820 Student:Yeah. 209 00:11:18,820 --> 00:11:25,820 Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional 210 00:11:26,040 --> 00:11:27,580 margin. 211 00:11:27,580 --> 00:11:31,780 Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me 212 00:11:31,780 --> 00:11:36,210 try saying 213 00:11:36,210 --> 00:11:40,390 this again. So here's my goal. My - I want [inaudible]. So 214 00:11:40,390 --> 00:11:42,110 let's see, 215 00:11:42,110 --> 00:11:46,750 the parameters of this optimization problem where gamma hat w and b - so 216 00:11:46,750 --> 00:11:49,610 the convex optimization software 217 00:11:49,610 --> 00:11:52,399 solves this problem for some set of parameters gamma 218 00:11:52,399 --> 00:11:54,260 w and b. 219 00:11:54,260 --> 00:11:57,120 And I'm imposing the constraint that 220 00:11:57,120 --> 00:11:59,050 whatever values it comes up with, 221 00:11:59,050 --> 00:12:03,750 yi x [inaudible] x5 + b must be greater than gamma hat. 222 00:12:03,750 --> 00:12:05,889 And so this means that 223 00:12:05,889 --> 00:12:08,640 the functional margin of every example had 224 00:12:08,640 --> 00:12:09,310 better be 225 00:12:09,310 --> 00:12:11,070 greater than equal to gamma hat. So 226 00:12:11,070 --> 00:12:14,300 there's a constraint to the function margin and a constraint to the gamma hat. 227 00:12:14,300 --> 00:12:17,950 But what I care about is not really maximizing the functional margin. What I 228 00:12:17,950 --> 00:12:19,010 really care about - 229 00:12:19,010 --> 00:12:21,430 in other words, in optimization objective, 230 00:12:21,430 --> 00:12:25,110 is maximizing gamma hat divided by the normal w, 231 00:12:25,110 --> 00:12:27,890 which is the geometric margin. 232 00:12:27,890 --> 00:12:32,370 So in other words, my optimization [inaudible] is I want to maximize the function margin 233 00:12:32,370 --> 00:12:35,210 divided by the normal 234 00:12:35,210 --> 00:12:38,230 w. Subject to that, every example must have 235 00:12:38,230 --> 00:12:40,290 function margin and at least gamma hat. Does that make 236 00:12:40,290 --> 00:12:46,120 sense now? Student:[Inaudible] when you 237 00:12:46,120 --> 00:12:48,000 said that to maximize gamma 238 00:12:48,000 --> 00:12:50,580 or gamma hat, respect to gamma w and with 239 00:12:50,580 --> 00:12:51,640 respect to gamma hat 240 00:12:51,640 --> 00:12:56,820 so that 241 00:12:56,820 --> 00:13:01,490 [inaudible] gamma hat are no 242 00:13:01,490 --> 00:13:04,920 longer [inaudible]? Instructor (Andrew Ng):So this is the - 243 00:13:04,920 --> 00:13:06,800 so it turns out - 244 00:13:06,800 --> 00:13:11,020 so this is how I write down the - this is how I write down an optimization problem 245 00:13:11,020 --> 00:13:12,790 in order to solve for 246 00:13:12,790 --> 00:13:15,950 the geometric margin. What is it - 247 00:13:15,950 --> 00:13:18,450 so it turns out that 248 00:13:18,450 --> 00:13:21,410 the question of this - is the gamma hat the function of w and b? And it turns out 249 00:13:21,410 --> 00:13:22,250 that 250 00:13:22,250 --> 00:13:25,130 in my previous mathematical definition, it was, 251 00:13:25,130 --> 00:13:29,590 but the way I'm going to pose this as an optimization problem is 252 00:13:29,590 --> 00:13:34,630 I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving 253 00:13:34,630 --> 00:13:37,920 convex optimization problems - 254 00:13:37,920 --> 00:13:39,550 hen I'm going to 255 00:13:39,550 --> 00:13:43,930 pretend that these are independent variables and ask my convex optimization software 256 00:13:43,930 --> 00:13:46,420 to find me values for gamma, w, and b, 257 00:13:46,420 --> 00:13:50,610 to make this value as big as possible and subject to this constraint. 258 00:13:50,610 --> 00:13:52,670 And it'll turn out that 259 00:13:52,670 --> 00:13:55,329 when it does that, it will choose - or 260 00:13:55,329 --> 00:13:58,790 obviously, it will choose for gamma to be as big as possible 261 00:13:58,790 --> 00:14:01,340 because optimization objective is this: 262 00:14:01,340 --> 00:14:02,730 You're trying to maximize gamma hat. 263 00:14:02,730 --> 00:14:05,540 So for x value of w and b, 264 00:14:05,540 --> 00:14:09,140 my software, which choose to make gamma hat as big as possible - 265 00:14:09,140 --> 00:14:12,590 well, but how big can we make gamma hat? Well, it's limited by use 266 00:14:12,590 --> 00:14:15,460 constraints. It says that every training example 267 00:14:15,460 --> 00:14:17,079 must have function margin 268 00:14:17,079 --> 00:14:20,380 greater than equal to gamma hat. 269 00:14:20,380 --> 00:14:23,950 And so my - the bigger you can make gamma hat 270 00:14:23,950 --> 00:14:27,880 will be the value of the smallest functional margin. 271 00:14:27,880 --> 00:14:30,220 And so when you solve this optimization problem, 272 00:14:30,220 --> 00:14:33,720 the value of gamma hat you get out will be, indeed, 273 00:14:33,720 --> 00:14:38,590 the minimum of the functional margins of your training set. Okay, so 274 00:14:38,590 --> 00:14:40,530 Justin? Student:Yeah, I was just 275 00:14:40,530 --> 00:14:41,760 wondering, I 276 00:14:41,760 --> 00:14:46,200 guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line 277 00:14:46,200 --> 00:14:50,180 such that you maximize the distance between - the smallest distance that [inaudible] between the line and 278 00:14:50,180 --> 00:14:52,190 the 279 00:14:52,190 --> 00:14:53,780 data points." 280 00:14:53,780 --> 00:14:56,370 And it seems like that's kind of what we're doing, but it's - it seems like 281 00:14:56,370 --> 00:15:00,200 this is more complicated than that. And I guess I'm wondering what 282 00:15:00,200 --> 00:15:04,109 is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And 283 00:15:04,109 --> 00:15:08,520 this seems 284 00:15:08,520 --> 00:15:12,420 more complicated than trying to find a line [inaudible]. So I'm 285 00:15:12,420 --> 00:15:17,550 just repeating the questions in case - since I'm not sure how all the audio catches it. 286 00:15:17,550 --> 00:15:21,380 So the answer is this is actually exactly that problem. This is exactly that problem 287 00:15:21,380 --> 00:15:22,710 of 288 00:15:22,710 --> 00:15:26,620 given the two class of data, positive and negative examples, 289 00:15:26,620 --> 00:15:29,900 this is exactly the formalization of the problem 290 00:15:29,900 --> 00:15:31,519 where I go is to find 291 00:15:31,519 --> 00:15:35,880 a line that separates the two - the positive and negative examples, 292 00:15:35,880 --> 00:15:42,880 maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case 293 00:15:43,040 --> 00:15:43,740 distance [inaudible]? 294 00:15:43,740 --> 00:15:46,270 Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now, 295 00:15:46,270 --> 00:15:48,230 296 00:15:48,230 --> 00:15:52,590 let's just say - let's just care about the worst-case distance for now. We'll come back, 297 00:15:52,590 --> 00:15:55,330 and we'll fix that later. We'll - that's a - 298 00:15:55,330 --> 00:15:57,850 caring about the worst case is is just - 299 00:15:57,850 --> 00:16:01,240 is just a nice way to formulate this optimization problem. I'll come back, and I'll 300 00:16:01,240 --> 00:16:03,580 change that later. Okay, 301 00:16:03,580 --> 00:16:10,100 raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool. 302 00:16:10,100 --> 00:16:13,830 Great. So let's see - 303 00:16:13,830 --> 00:16:15,910 so this is just a different way of posing 304 00:16:15,910 --> 00:16:18,160 the same optimization problem. And 305 00:16:18,160 --> 00:16:22,339 on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on 306 00:16:22,339 --> 00:16:23,880 the other hand, I've now 307 00:16:23,880 --> 00:16:24,830 added 308 00:16:24,830 --> 00:16:30,200 a nasty, nonconvex objective. In particular, this is not a convex function in parameters 309 00:16:30,200 --> 00:16:31,010 w. 310 00:16:31,010 --> 00:16:32,610 And so you can't - 311 00:16:32,610 --> 00:16:36,670 you don't have the usual guarantees like if you [inaudible] 312 00:16:36,670 --> 00:16:38,240 global minimum. 313 00:16:38,240 --> 00:16:40,170 At least that 314 00:16:40,170 --> 00:16:45,190 does not follow immediately from this because this is nonconvex. 315 00:16:45,190 --> 00:16:47,840 316 00:16:47,840 --> 00:16:51,380 So what 317 00:16:51,380 --> 00:16:54,810 I'm going to do is, 318 00:16:54,810 --> 00:16:56,940 earlier, I said that can pose 319 00:16:56,940 --> 00:16:57,770 any of 320 00:16:57,770 --> 00:17:02,630 a number of even fairly bizarre scaling constraints on w. So you can 321 00:17:02,630 --> 00:17:06,100 choose any scaling constraint like this, and things are still fine. 322 00:17:06,100 --> 00:17:07,660 And so here's 323 00:17:07,660 --> 00:17:14,089 the scaling I'm going to choose to add. Again, I'm 324 00:17:14,089 --> 00:17:18,380 gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly 325 00:17:18,380 --> 00:17:19,580 separable, that 326 00:17:19,580 --> 00:17:23,290 you can actually separate the positive and negative classes, and that we'll come back and 327 00:17:23,290 --> 00:17:25,580 fix this later as well. 328 00:17:25,580 --> 00:17:28,700 But here's the scaling constraint I want to impose on w. I want 329 00:17:28,700 --> 00:17:33,840 to impose a constraint that 330 00:17:33,840 --> 00:17:37,840 the functional margin is equal to 1. 331 00:17:37,840 --> 00:17:40,059 And another way of writing that 332 00:17:40,059 --> 00:17:44,380 is that I want to impose a constraint that min 333 00:17:44,380 --> 00:17:51,380 over i, yi - 334 00:17:52,380 --> 00:17:55,170 that in the worst case, function y is over 1. 335 00:17:55,170 --> 00:17:59,390 And clearly, this is a scaling constraint because if 336 00:17:59,390 --> 00:18:01,740 337 00:18:01,740 --> 00:18:05,820 you solve for w and b, and you find that your worst-case function margin is actually 10 or 338 00:18:05,820 --> 00:18:06,790 whatever, 339 00:18:06,790 --> 00:18:10,099 then by dividing through w and b by a factor of 10, I can get 340 00:18:10,099 --> 00:18:13,210 my functional margin to be over 1. 341 00:18:13,210 --> 00:18:15,580 So this is a scaling constraint [inaudible] would 342 00:18:15,580 --> 00:18:17,289 imply. And this is 343 00:18:17,289 --> 00:18:19,220 just more compactly written 344 00:18:19,220 --> 00:18:22,590 as follows. This is imposing a constraint that the functional margin be 345 00:18:22,590 --> 00:18:25,520 equal to 1. 346 00:18:25,520 --> 00:18:28,770 And so if we just take 347 00:18:28,770 --> 00:18:32,230 what I wrote down as No. 2 of our previous optimization problem and add the 348 00:18:32,230 --> 00:18:34,510 scaling constraint, 349 00:18:34,510 --> 00:18:38,400 we then get the following optimization problem: 350 00:18:38,400 --> 00:18:45,400 min over wb. 351 00:18:57,360 --> 00:19:00,520 I guess previously, we had a maximization 352 00:19:00,520 --> 00:19:05,500 over gamma hats divided by the normal w. So those maximize 353 00:19:05,500 --> 00:19:10,240 1 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum 354 00:19:10,240 --> 00:19:10,720 normal w 355 00:19:10,720 --> 00:19:14,510 is min w - normal w squared. And then these 356 00:19:14,510 --> 00:19:16,400 are our constraints. 357 00:19:16,400 --> 00:19:20,980 Since I've added the constraint, the functional margin 358 00:19:20,980 --> 00:19:23,570 is over 1. And this is actually 359 00:19:23,570 --> 00:19:26,830 my 360 00:19:26,830 --> 00:19:30,440 final - well, 361 00:19:30,440 --> 00:19:36,960 final formulation of the optimal margin classifier problem, at least for now. 362 00:19:36,960 --> 00:19:38,250 So the picture 363 00:19:38,250 --> 00:19:40,640 to 364 00:19:40,640 --> 00:19:44,540 keep in mind for this, I guess, is that our 365 00:19:44,540 --> 00:19:48,450 optimization objective is once you minimize the normal w. And so our 366 00:19:48,450 --> 00:19:50,020 optimization objective 367 00:19:50,020 --> 00:19:52,450 is just the [inaudible] quadratic function. 368 00:19:52,450 --> 00:19:55,300 And [inaudible] those pictures [inaudible] can draw 369 00:19:55,300 --> 00:19:56,659 it. So it - 370 00:19:56,659 --> 00:19:58,850 if [inaudible] is w1 and w2, and you 371 00:19:58,850 --> 00:20:02,640 want to minimize the quadratic function like this - so quadratic function 372 00:20:02,640 --> 00:20:06,250 just has [inaudible] that look like this. 373 00:20:06,250 --> 00:20:10,190 And moreover, you have a number of linear constraints in your parameters, so you may have linear 374 00:20:10,190 --> 00:20:11,940 constraints that eliminates that half space or 375 00:20:11,940 --> 00:20:17,260 linear constraint eliminates that half space [inaudible]. So there's that half space 376 00:20:17,260 --> 00:20:19,560 377 00:20:19,560 --> 00:20:22,570 and so on. 378 00:20:22,570 --> 00:20:25,610 And so the picture is you have a quadratic function, 379 00:20:25,610 --> 00:20:27,020 and you're ruling out 380 00:20:27,020 --> 00:20:28,639 381 00:20:28,639 --> 00:20:33,540 various half spaces where each of these linear constraints. And I hope - if 382 00:20:33,540 --> 00:20:38,080 you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can 383 00:20:38,080 --> 00:20:43,250 convince yourself that this is a convex problem that has no local optimum. But they 384 00:20:43,250 --> 00:20:45,640 be run great 385 00:20:45,640 --> 00:20:49,850 [inaudible] within this set of points that hasn't ruled out, then you convert to the 386 00:20:49,850 --> 00:20:53,400 global optimum. 387 00:20:53,400 --> 00:20:56,180 And so that's the convex optimization problem. 388 00:20:56,180 --> 00:20:58,600 The - does this [inaudible] nice and 389 00:20:58,600 --> 00:21:02,390 [inaudible]. 390 00:21:02,390 --> 00:21:09,390 Questions about this? 391 00:21:17,080 --> 00:21:24,080 Actually, just raise your hand if this makes sense. Okay, cool. 392 00:21:25,799 --> 00:21:30,250 So this gives you the optimal margin classifier algorithm. 393 00:21:30,250 --> 00:21:31,500 And it turns out that 394 00:21:31,500 --> 00:21:33,789 this is the convex optimization problem, 395 00:21:33,789 --> 00:21:36,970 so you can actually take this formulation of the problem 396 00:21:36,970 --> 00:21:41,410 and throw it at off-the-shelf software - what's called a QP or quadratic 397 00:21:41,410 --> 00:21:45,039 program software. This [inaudible] optimization is called a quadratic program, 398 00:21:45,039 --> 00:21:48,720 where the quadratic convex objective function and [inaudible] constraints - 399 00:21:48,720 --> 00:21:53,700 so you can actually download software to solve these optimization problems for you. 400 00:21:53,700 --> 00:21:55,810 Usually, as you wanna use the - 401 00:21:55,810 --> 00:21:56,880 use 402 00:21:56,880 --> 00:22:00,350 [inaudible] because you have constraints like these, although you could actually modify [inaudible] work with 403 00:22:00,350 --> 00:22:03,520 this, too. 404 00:22:03,520 --> 00:22:06,500 So we could just declare success and say that we're done with this formulation of the 405 00:22:06,500 --> 00:22:07,860 problem. But 406 00:22:07,860 --> 00:22:10,899 what I'm going to do now is take a 407 00:22:10,899 --> 00:22:14,680 digression to talk about primal and duo optimization problems. 408 00:22:14,680 --> 00:22:16,160 And in particular, I'm going to 409 00:22:16,160 --> 00:22:17,970 - 410 00:22:17,970 --> 00:22:21,700 later, I'm going to come back and derive yet another very different form 411 00:22:21,700 --> 00:22:24,640 of this optimization problem. 412 00:22:24,640 --> 00:22:29,130 And the reason we'll do that is because it turns out this optimization problem has 413 00:22:29,130 --> 00:22:33,050 certain properties that make it amenable to very efficient algorithms. 414 00:22:33,050 --> 00:22:35,320 And moreover, I'll be deriving 415 00:22:35,320 --> 00:22:37,430 what's called the duo formulation of this 416 00:22:37,430 --> 00:22:39,830 that allows us to 417 00:22:39,830 --> 00:22:42,850 apply the optimal margin classifier even in 418 00:22:42,850 --> 00:22:46,950 very high-dimensional feature spaces - even in sometimes infinite 419 00:22:46,950 --> 00:22:49,730 dimensional feature spaces. So we 420 00:22:49,730 --> 00:22:51,780 can come back to that later. 421 00:22:51,780 --> 00:22:53,190 But 422 00:22:53,190 --> 00:22:54,450 let me know, 423 00:22:54,450 --> 00:23:01,450 since I'm talking about 424 00:23:03,470 --> 00:23:06,480 convex optimization. So how many here is - how many of you, from, I don't 425 00:23:06,480 --> 00:23:08,080 know, calculus, 426 00:23:08,080 --> 00:23:12,120 remember the method of Lagrange multipliers for 427 00:23:12,120 --> 00:23:13,910 solving an optimization problem 428 00:23:13,910 --> 00:23:18,299 like minimum - minimization, maximization problem subject to some constraint? 429 00:23:18,299 --> 00:23:22,740 How many of you remember the method of Lagrange multipliers for that? Oh, okay, 430 00:23:22,740 --> 00:23:24,070 cool. Some of you, yeah. 431 00:23:24,070 --> 00:23:27,710 So if you don't remember, don't worry. I - I'll describe that briefly 432 00:23:27,710 --> 00:23:28,390 here 433 00:23:28,390 --> 00:23:32,039 as well, but what I'm really gonna do is talk about the generalization of this method 434 00:23:32,039 --> 00:23:33,710 of Lagrange multipliers 435 00:23:33,710 --> 00:23:37,659 that you may or may not have seen in some calculus classes. But if you haven't 436 00:23:37,659 --> 00:23:40,080 seen it before, don't worry about it. 437 00:23:40,080 --> 00:23:43,350 438 00:23:43,350 --> 00:23:49,159 So the method of Lagrange multipliers is - was - well, suppose there's some 439 00:23:49,159 --> 00:23:52,950 function you want to minimize, or minimize f of w. 440 00:23:52,950 --> 00:23:55,660 We're subject to 441 00:23:55,660 --> 00:24:02,660 some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And 442 00:24:03,400 --> 00:24:06,770 given this constraint, I'll actually usually write it in vectorial 443 00:24:06,770 --> 00:24:09,550 form in which I write h of w 444 00:24:09,550 --> 00:24:11,429 as this 445 00:24:11,429 --> 00:24:16,980 vector value function. 446 00:24:16,980 --> 00:24:20,919 So that is equal to 0, where 0 is the arrow on top. I used 447 00:24:20,919 --> 00:24:24,600 that to denote the vector of 448 00:24:24,600 --> 00:24:27,200 all 0s. So you want to solve this optimization problem. 449 00:24:27,200 --> 00:24:28,880 450 00:24:28,880 --> 00:24:32,130 Some of you have seen method of Lagrange multipliers where 451 00:24:32,130 --> 00:24:34,880 you construct this 452 00:24:34,880 --> 00:24:39,470 [inaudible] Lagrange, 453 00:24:39,470 --> 00:24:45,980 454 00:24:45,980 --> 00:24:49,810 which is the original optimization objective plus some [inaudible] Lagrange multipliers the 455 00:24:49,810 --> 00:24:51,950 highest constraints. 456 00:24:51,950 --> 00:24:54,720 And these parameters - they derive - 457 00:24:54,720 --> 00:25:01,570 we call the Lagrange multipliers. 458 00:25:01,570 --> 00:25:06,320 And so the way you actually solve the optimization problem is 459 00:25:06,320 --> 00:25:10,830 you take the partial derivative of this with respect to the original parameters 460 00:25:10,830 --> 00:25:13,740 and set that to 0. So 461 00:25:13,740 --> 00:25:19,710 the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0. 462 00:25:19,710 --> 00:25:23,630 And then the same as theorem through [inaudible], I guess [inaudible] 463 00:25:23,630 --> 00:25:24,510 Lagrange 464 00:25:24,510 --> 00:25:30,800 was that for w - for some value w star to get a 465 00:25:30,800 --> 00:25:32,380 solution, 466 00:25:32,380 --> 00:25:34,660 it 467 00:25:34,660 --> 00:25:41,660 is necessary 468 00:25:42,970 --> 00:25:48,299 that - can this 469 00:25:48,299 --> 00:25:48,840 be 470 00:25:48,840 --> 00:25:50,200 471 00:25:50,200 --> 00:25:53,510 the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists 472 00:25:53,510 --> 00:25:54,750 beta star 473 00:25:54,750 --> 00:26:00,420 474 00:26:00,420 --> 00:26:07,420 475 00:26:17,300 --> 00:26:19,940 such that those partial derivatives are 476 00:26:19,940 --> 00:26:21,500 equal to 477 00:26:21,500 --> 00:26:22,250 0. 478 00:26:22,250 --> 00:26:24,020 So the 479 00:26:24,020 --> 00:26:25,400 method 480 00:26:25,400 --> 00:26:28,510 of Lagrange multipliers is to solve this problem, 481 00:26:28,510 --> 00:26:30,710 you construct a Lagrange, 482 00:26:30,710 --> 00:26:32,580 take the derivative with respect to 483 00:26:32,580 --> 00:26:34,520 the original parameters b, the original 484 00:26:34,520 --> 00:26:39,809 parameters w, and with respect to the Lagrange multipliers beta. 485 00:26:39,809 --> 00:26:42,139 Set the partial derivatives equal to 0, and solve 486 00:26:42,139 --> 00:26:45,460 for our solutions. And then you check each of the solutions to see if it is indeed a minimum. 487 00:26:45,460 --> 00:26:48,500 Great. 488 00:26:48,500 --> 00:26:52,380 So great 489 00:26:52,380 --> 00:26:56,790 - 490 00:26:56,790 --> 00:26:59,110 so what I'm going to do is actually 491 00:26:59,110 --> 00:27:00,420 write 492 00:27:00,420 --> 00:27:02,760 down the generalization of this. And 493 00:27:02,760 --> 00:27:06,090 if you haven't seen that before, don't worry about it. This is [inaudible]. 494 00:27:06,090 --> 00:27:11,000 So what I'm going to do is actually write down the generalization of this to solve 495 00:27:11,000 --> 00:27:15,870 a slightly more difficult type of constraint optimization problem, 496 00:27:15,870 --> 00:27:20,370 which is suppose you want to minimize f of w 497 00:27:20,370 --> 00:27:24,039 subject to the constraint that gi of w, 498 00:27:24,039 --> 00:27:25,539 excuse me, 499 00:27:25,539 --> 00:27:30,780 is less than equal to 500 00:27:30,780 --> 00:27:34,940 0, and that hi of w is equal to 0. 501 00:27:34,940 --> 00:27:38,280 And 502 00:27:38,280 --> 00:27:43,230 again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0. 503 00:27:43,230 --> 00:27:44,710 So 504 00:27:44,710 --> 00:27:48,940 in [inaudible]'s 505 00:27:48,940 --> 00:27:52,169 case, we now have inequality for constraint as well as 506 00:27:52,169 --> 00:27:59,169 equality constraint. 507 00:28:01,620 --> 00:28:02,779 I then have 508 00:28:02,779 --> 00:28:06,719 a Lagrange, or it's actually still - called - say generalized 509 00:28:06,719 --> 00:28:08,370 Lagrange, 510 00:28:08,370 --> 00:28:13,040 which is now a function of my original optimization for parameters w, 511 00:28:13,040 --> 00:28:18,050 as well as two sets of Lagrange multipliers, alpha and beta. 512 00:28:18,050 --> 00:28:19,799 And so this will be 513 00:28:19,799 --> 00:28:26,799 f of w. 514 00:28:28,650 --> 00:28:32,549 Now, 515 00:28:32,549 --> 00:28:36,559 516 00:28:36,559 --> 00:28:39,350 here's a 517 00:28:39,350 --> 00:28:40,220 cool part. 518 00:28:40,220 --> 00:28:42,650 I'm going to define theta 519 00:28:42,650 --> 00:28:43,679 520 00:28:43,679 --> 00:28:45,990 subscript p of 521 00:28:45,990 --> 00:28:46,690 w 522 00:28:46,690 --> 00:28:49,030 to be equal to 523 00:28:49,030 --> 00:28:52,030 max of alpha beta subject to the 524 00:28:52,030 --> 00:28:59,030 constraints that the alphas are, beta equal 525 00:28:59,090 --> 00:29:06,090 to 0 of the Lagrange. 526 00:29:10,480 --> 00:29:15,490 And so 527 00:29:15,490 --> 00:29:17,910 I want you to consider 528 00:29:17,910 --> 00:29:21,610 the optimization problem 529 00:29:21,610 --> 00:29:24,350 min over w of 530 00:29:24,350 --> 00:29:31,350 max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And 531 00:29:31,700 --> 00:29:35,690 that's just equal to min over w, 532 00:29:35,690 --> 00:29:42,000 theta p of 533 00:29:42,000 --> 00:29:47,990 w. And just to give us a name, the [inaudible] - the subscript p here is a sense of 534 00:29:47,990 --> 00:29:49,590 primal problem. 535 00:29:49,590 --> 00:29:56,590 And that refers to this entire thing. 536 00:29:56,920 --> 00:30:01,080 This optimization problem that written down is called a primal problem. This means 537 00:30:01,080 --> 00:30:04,270 there's the original optimization problem in which [inaudible] solving. 538 00:30:04,270 --> 00:30:08,450 And later on, I'll derive in another version of this, but that's what p 539 00:30:08,450 --> 00:30:13,360 stands for. It's a - this is a primal problem. 540 00:30:13,360 --> 00:30:17,690 Now, I want you to look at - consider theta over p again. And in particular, I wanna 541 00:30:17,690 --> 00:30:21,510 consider the problem of what happens if you minimize w 542 00:30:21,510 --> 00:30:24,070 - minimize as a function of w 543 00:30:24,070 --> 00:30:27,340 this quantity theta over p. 544 00:30:27,340 --> 00:30:34,160 545 00:30:34,160 --> 00:30:37,510 So let's look at what theta p of w is. Notice 546 00:30:37,510 --> 00:30:39,659 that 547 00:30:39,659 --> 00:30:41,620 if 548 00:30:41,620 --> 00:30:43,480 gi of w 549 00:30:43,480 --> 00:30:47,039 is greater than 0, so let's pick the value of w. 550 00:30:47,039 --> 00:30:49,779 And let's ask what is the state of p of w? 551 00:30:49,779 --> 00:30:56,779 So if w violates one of your primal problems constraints, 552 00:30:57,050 --> 00:31:00,280 553 00:31:00,280 --> 00:31:03,670 then state of p of w would be infinity. 554 00:31:03,670 --> 00:31:05,780 Why is that? 555 00:31:05,780 --> 00:31:09,440 [Inaudible] p [inaudible] second. 556 00:31:09,440 --> 00:31:12,800 Suppose I pick a value of w that violates one of these constraints. So gi of 557 00:31:12,800 --> 00:31:16,880 w is positive. 558 00:31:16,880 --> 00:31:21,840 Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So 559 00:31:21,840 --> 00:31:25,740 one of these gi of w's is this positive, 560 00:31:25,740 --> 00:31:30,300 then by setting the other responding alpha i to plus infinity, I can make this 561 00:31:30,300 --> 00:31:31,940 arbitrarily large. 562 00:31:31,940 --> 00:31:34,539 And so if w violates one of my 563 00:31:34,539 --> 00:31:37,770 primal problem's constraints in one of the gis, then 564 00:31:37,770 --> 00:31:42,290 max over alpha of this Lagrange will be plus 565 00:31:42,290 --> 00:31:49,290 infinity. There's some of - and in the same way - I guess in a similar way, 566 00:31:50,500 --> 00:31:55,240 if hi of w is not equal to 567 00:31:55,240 --> 00:31:57,690 0, 568 00:31:57,690 --> 00:32:02,669 then theta p of w also be infinity for a very similar reason because 569 00:32:02,669 --> 00:32:05,650 if hi of w is not equal to 0 for some value of i, 570 00:32:05,650 --> 00:32:09,990 then in my Lagrange, I had a beta i x hi theorem. 571 00:32:09,990 --> 00:32:13,480 And so by setting beta i to be plus infinity or minus 572 00:32:13,480 --> 00:32:15,830 infinity depending on the sign of hi, 573 00:32:15,830 --> 00:32:20,310 I can make this plus infinity as well. 574 00:32:20,310 --> 00:32:22,190 And otherwise, 575 00:32:22,190 --> 00:32:29,190 576 00:32:30,400 --> 00:32:33,730 577 00:32:33,730 --> 00:32:38,270 theta p of w is just equal to f of w. Turns 578 00:32:38,270 --> 00:32:39,649 out if 579 00:32:39,649 --> 00:32:41,799 I had a value of w that 580 00:32:41,799 --> 00:32:44,940 satisfies all of the gi and the hi constraints, 581 00:32:44,940 --> 00:32:47,749 then we maximize in terms of alpha and beta 582 00:32:47,749 --> 00:32:48,950 - all the Lagrange multiply 583 00:32:48,950 --> 00:32:53,830 theorems will actually be obtained by 584 00:32:53,830 --> 00:32:57,600 setting all the Lagrange multiply theorems to be 0, 585 00:32:57,600 --> 00:32:59,190 and so theta p 586 00:32:59,190 --> 00:33:03,590 just left with f of w. Thus, theta 587 00:33:03,590 --> 00:33:06,400 p 588 00:33:06,400 --> 00:33:11,490 of w is equal to 589 00:33:11,490 --> 00:33:13,540 f of w 590 00:33:13,540 --> 00:33:16,610 if constraints are 591 00:33:16,610 --> 00:33:17,929 satisfied 592 00:33:17,929 --> 00:33:21,330 [inaudible] 593 00:33:21,330 --> 00:33:22,540 the gi in 594 00:33:22,540 --> 00:33:24,409 hi constraints, 595 00:33:24,409 --> 00:33:28,270 and is equal to plus infinity 596 00:33:28,270 --> 00:33:31,770 otherwise. 597 00:33:31,770 --> 00:33:34,500 So the problem I 598 00:33:34,500 --> 00:33:38,619 wrote down that minimizes the function of w - 599 00:33:38,619 --> 00:33:44,630 theta p of w - this is [inaudible] problem. That's 600 00:33:44,630 --> 00:33:47,540 just 601 00:33:47,540 --> 00:33:50,639 exactly the same problem as my original primal problem 602 00:33:50,639 --> 00:33:55,010 because if you choose a value of w that violates the constraints, you get infinity. 603 00:33:55,010 --> 00:33:57,920 And if you satisfy the constraints, you get f of w. 604 00:33:57,920 --> 00:33:59,809 So this is really just the same as - well, 605 00:33:59,809 --> 00:34:00,980 we'll say, 606 00:34:00,980 --> 00:34:05,100 "Satisfy the constraints, and minimize f of w." That's really what 607 00:34:05,100 --> 00:34:08,609 minimizing the state of p 608 00:34:08,609 --> 00:34:15,609 of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right. 609 00:34:24,699 --> 00:34:28,339 I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly 610 00:34:28,339 --> 00:34:31,159 the same thing we started with. So here's 611 00:34:31,159 --> 00:34:33,499 the cool part. 612 00:34:33,499 --> 00:34:40,289 Let me know if you find it in your problem. To find 613 00:34:40,289 --> 00:34:41,909 theta 614 00:34:41,909 --> 00:34:45,059 d 615 00:34:45,059 --> 00:34:47,129 and d [inaudible] duo, 616 00:34:47,129 --> 00:34:50,409 and this is how the function of alpha and beta is. It's not the function of the Lagrange 617 00:34:50,409 --> 00:34:52,589 multipliers. It's not of w. 618 00:34:52,589 --> 00:34:58,229 To find this, we minimize over w of 619 00:34:58,229 --> 00:35:05,229 my generalized Lagrange. And 620 00:35:06,309 --> 00:35:13,309 my dual problem is this. So in 621 00:35:16,729 --> 00:35:20,359 other words, this is max over 622 00:35:20,359 --> 00:35:21,859 that. 623 00:35:21,859 --> 00:35:23,130 624 00:35:23,130 --> 00:35:28,239 625 00:35:28,239 --> 00:35:33,019 And so this is my duo optimization problem. To maximize over alpha and 626 00:35:33,019 --> 00:35:34,059 beta, theta 627 00:35:34,059 --> 00:35:35,499 d over alpha 628 00:35:35,499 --> 00:35:37,980 and beta. So this optimization problem, I 629 00:35:37,980 --> 00:35:40,859 guess, is my dual problem. I 630 00:35:40,859 --> 00:35:44,810 want you to compare this to our previous prime optimization problem. 631 00:35:44,810 --> 00:35:47,499 The only difference is that 632 00:35:47,499 --> 00:35:51,690 I took the max and min, and I switched the order around with the max and 633 00:35:51,690 --> 00:35:56,199 min. That's the difference in the primal and the duo optimization [inaudible]. 634 00:35:56,199 --> 00:36:00,210 And it turns out that 635 00:36:00,210 --> 00:36:01,669 636 00:36:01,669 --> 00:36:04,730 it's a - it's sort of - it's a fact - it's 637 00:36:04,730 --> 00:36:06,169 true, generally, that d 638 00:36:06,169 --> 00:36:09,919 star is less than [inaudible] p star. In other words, I think I defined 639 00:36:09,919 --> 00:36:15,609 p star previously. P star was a value of the prime optimization problem. 640 00:36:15,609 --> 00:36:19,829 And in other words, that it's just generally true 641 00:36:19,829 --> 00:36:25,150 that the max of the min of something is less than equal to the min of the max 642 00:36:25,150 --> 00:36:27,609 of something. 643 00:36:27,609 --> 00:36:29,199 And this is a 644 00:36:29,199 --> 00:36:30,099 general fact. 645 00:36:30,099 --> 00:36:33,169 And just as a concrete example, the 646 00:36:33,169 --> 00:36:37,069 max over y in the set 01 x - 647 00:36:37,069 --> 00:36:40,449 oh, excuse me, of the min of the 648 00:36:40,449 --> 00:36:43,579 set in 01 649 00:36:43,579 --> 00:36:47,939 of indicator x = 650 00:36:47,939 --> 00:36:50,940 y - this is 651 00:36:50,940 --> 00:36:57,940 [inaudible] equal to 652 00:37:02,689 --> 00:37:09,669 min. 653 00:37:09,669 --> 00:37:14,229 So this equality - this inequality actually holds true for any 654 00:37:14,229 --> 00:37:15,759 function you might find in here. 655 00:37:15,759 --> 00:37:18,059 And this is one specific example 656 00:37:18,059 --> 00:37:20,670 where 657 00:37:20,670 --> 00:37:24,289 the min over xy - excuse me, min over x of [inaudible] equals y - 658 00:37:24,289 --> 00:37:27,009 this is always equal to 0 659 00:37:27,009 --> 00:37:30,760 because whatever y is, you can choose x to be something different. So 660 00:37:30,760 --> 00:37:32,539 this is always 0, 661 00:37:32,539 --> 00:37:34,079 whereas if 662 00:37:34,079 --> 00:37:37,680 I exchange the order to min 663 00:37:37,680 --> 00:37:39,569 and max, then thing here is always equal to 664 00:37:39,569 --> 00:37:41,669 1. So 0 [inaudible] to 665 00:37:41,669 --> 00:37:44,700 1. And more generally, this min/max - excuse 666 00:37:44,700 --> 00:37:50,909 me, this max/min, thus with the min/max holds true for any function you might put in 667 00:37:50,909 --> 00:37:52,149 there. 668 00:37:52,149 --> 00:37:57,779 But it turns out that sometimes under certain conditions, 669 00:37:57,779 --> 00:37:59,349 670 00:37:59,349 --> 00:38:01,349 671 00:38:01,349 --> 00:38:01,839 672 00:38:01,839 --> 00:38:03,239 673 00:38:03,239 --> 00:38:06,949 these two optimization problems have the same value. Sometimes under certain 674 00:38:06,949 --> 00:38:07,859 conditions, 675 00:38:07,859 --> 00:38:10,110 the primal and the dual problems 676 00:38:10,110 --> 00:38:12,349 have the same value. 677 00:38:12,349 --> 00:38:14,559 And so 678 00:38:14,559 --> 00:38:15,700 you might be able to solve 679 00:38:15,700 --> 00:38:20,079 the dual problem rather than the primal problem. 680 00:38:20,079 --> 00:38:22,290 And the reason to do that is that 681 00:38:22,290 --> 00:38:26,289 sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem, 682 00:38:26,289 --> 00:38:31,130 the dual problem turns out to be much easier than it - often has many useful properties that 683 00:38:31,130 --> 00:38:33,749 will make user 684 00:38:33,749 --> 00:38:40,749 compared to the primal. So for the sake of - 685 00:38:42,199 --> 00:38:48,019 so 686 00:38:48,019 --> 00:38:55,019 what 687 00:39:01,549 --> 00:39:05,230 I'm going to do now is write down formally the certain conditions under which that's 688 00:39:05,230 --> 00:39:09,449 true - where the primal and the dual problems are equivalent. 689 00:39:09,449 --> 00:39:14,199 And so our strategy for working out the [inaudible] of support vector machine algorithm 690 00:39:14,199 --> 00:39:17,849 will be that we'll write down the primal optimization problem, which we did 691 00:39:17,849 --> 00:39:20,619 previously, and maximizing classifier. 692 00:39:20,619 --> 00:39:21,780 And then we'll 693 00:39:21,780 --> 00:39:24,470 derive the duo optimization problem for that. 694 00:39:24,470 --> 00:39:26,489 And then we'll solve the dual problem. 695 00:39:26,489 --> 00:39:29,279 And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for 696 00:39:29,279 --> 00:39:30,489 697 00:39:30,489 --> 00:39:33,639 now, let me just first, for 698 00:39:33,639 --> 00:39:36,659 the sake of completeness, I just write down the conditions under which the primal 699 00:39:36,659 --> 00:39:41,579 and the duo optimization problems give you the same solutions. So let f 700 00:39:41,579 --> 00:39:45,170 be convex. If you're not 701 00:39:45,170 --> 00:39:46,699 702 00:39:46,699 --> 00:39:50,249 sure what convex means, for the purposes of this class, you can take it to 703 00:39:50,249 --> 00:39:53,319 mean that the 704 00:39:53,319 --> 00:39:56,260 Hessian, h is positive. [Inaudible], so it just means it's 705 00:39:56,260 --> 00:39:58,199 a [inaudible] function like that. 706 00:39:58,199 --> 00:39:58,860 And 707 00:39:58,860 --> 00:40:01,030 once you learn more about optimization 708 00:40:01,030 --> 00:40:06,559 - again, please come to this week's discussion session taught by the TAs. 709 00:40:06,559 --> 00:40:09,279 Then 710 00:40:09,279 --> 00:40:11,309 suppose 711 00:40:11,309 --> 00:40:12,520 hi 712 00:40:12,520 --> 00:40:17,919 - the hi constraints [inaudible], and what that means is that hi of w equals alpha i 713 00:40:17,919 --> 00:40:24,519 transpose w plus vi. This actually 714 00:40:24,519 --> 00:40:28,509 means the same thing as linear. Without the term b here, we say 715 00:40:28,509 --> 00:40:29,970 that hi is linear 716 00:40:29,970 --> 00:40:34,079 where we have a constant interceptor as well. This is technically called [inaudible] other than 717 00:40:34,079 --> 00:40:37,229 linear. 718 00:40:37,229 --> 00:40:39,770 And let's suppose 719 00:40:39,770 --> 00:40:42,599 720 00:40:42,599 --> 00:40:49,599 that gi's are strictly feasible. 721 00:40:51,170 --> 00:40:54,049 And 722 00:40:54,049 --> 00:40:57,029 what that means is that 723 00:40:57,029 --> 00:41:00,869 there is just a value of the w such that 724 00:41:00,869 --> 00:41:04,249 from i, 725 00:41:04,249 --> 00:41:06,669 gi of w is less 726 00:41:06,669 --> 00:41:10,519 than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the 727 00:41:10,519 --> 00:41:12,939 technical details. 728 00:41:12,939 --> 00:41:15,820 Strictly feasible, which just means that there's a value of w such that 729 00:41:15,820 --> 00:41:19,260 all of these constraints are satisfy were stricter than the equality rather than what 730 00:41:19,260 --> 00:41:21,739 less than equal to. 731 00:41:21,739 --> 00:41:22,490 732 00:41:22,490 --> 00:41:25,130 Under these conditions, 733 00:41:25,130 --> 00:41:26,180 734 00:41:26,180 --> 00:41:29,990 there were exists w star, 735 00:41:29,990 --> 00:41:32,299 alpha 736 00:41:32,299 --> 00:41:33,800 star, beta 737 00:41:33,800 --> 00:41:39,639 star such that w star solves the primal problem. 738 00:41:39,639 --> 00:41:41,849 And alpha star 739 00:41:41,849 --> 00:41:48,849 and beta star, the Lagrange multipliers, solve the dual problem. 740 00:41:51,829 --> 00:41:53,659 741 00:41:53,659 --> 00:41:58,679 And the value of the primal problem will be equal to the value of the dual problem will 742 00:41:58,679 --> 00:42:03,649 be equal to the value of your Lagrange multiplier - excuse 743 00:42:03,649 --> 00:42:06,189 me, will be equal to the value of your generalized Lagrange, the value 744 00:42:06,189 --> 00:42:08,759 of that w star, alpha star, beta star. 745 00:42:08,759 --> 00:42:13,339 In other words, you can solve either the primal or the dual problem. You get 746 00:42:13,339 --> 00:42:14,899 the same 747 00:42:14,899 --> 00:42:19,869 solution. Further, 748 00:42:19,869 --> 00:42:22,530 your parameters will satisfy 749 00:42:22,530 --> 00:42:24,139 750 00:42:24,139 --> 00:42:27,190 751 00:42:27,190 --> 00:42:32,119 these conditions. Partial derivative perspective parameters would be 752 00:42:32,119 --> 00:42:33,339 0. And 753 00:42:33,339 --> 00:42:36,559 actually, to keep this equation in mind, we'll actually use this in a second 754 00:42:36,559 --> 00:42:38,630 when we take the Lagrange, and we - and 755 00:42:38,630 --> 00:42:43,259 our support vector machine problem, and take a derivative with respect to w to solve a - 756 00:42:43,259 --> 00:42:44,679 to solve our - 757 00:42:44,679 --> 00:42:46,519 to derive our dual problem. We'll actually 758 00:42:46,519 --> 00:42:50,480 perform this step ourselves in a second. 759 00:42:50,480 --> 00:42:52,099 760 00:42:52,099 --> 00:42:58,289 Partial derivative with respect to the Lagrange multiplier beta is 761 00:42:58,289 --> 00:43:01,789 762 00:43:01,789 --> 00:43:05,169 equal 763 00:43:05,169 --> 00:43:06,269 to 764 00:43:06,269 --> 00:43:08,439 0. 765 00:43:08,439 --> 00:43:10,479 Turns out this will hold true, 766 00:43:10,479 --> 00:43:11,869 too. 767 00:43:11,869 --> 00:43:14,429 This is called 768 00:43:14,429 --> 00:43:18,960 769 00:43:18,960 --> 00:43:20,549 the - 770 00:43:20,549 --> 00:43:27,380 well 771 00:43:27,380 --> 00:43:28,569 - 772 00:43:28,569 --> 00:43:32,359 this is called the KKT complementary condition. 773 00:43:32,359 --> 00:43:37,869 KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem. 774 00:43:37,869 --> 00:43:41,630 Well, and by tradition, usually this [inaudible] KKT conditions. 775 00:43:41,630 --> 00:43:48,630 But the other two are 776 00:43:50,489 --> 00:43:53,239 - just so the [inaudible] is greater than 0, which we had 777 00:43:53,239 --> 00:43:54,509 previously 778 00:43:54,509 --> 00:44:01,509 and that your constraints are actually satisfied. 779 00:44:02,919 --> 00:44:09,919 So let's see. [Inaudible] All 780 00:44:21,789 --> 00:44:28,789 right. 781 00:44:29,959 --> 00:44:34,170 So let's take those and apply this to our 782 00:44:34,170 --> 00:44:37,349 optimal margin 783 00:44:37,349 --> 00:44:40,689 optimization problem that we had previously. I 784 00:44:40,689 --> 00:44:43,779 was gonna say one word about this, 785 00:44:43,779 --> 00:44:46,809 which is - 786 00:44:46,809 --> 00:44:49,369 was gonna say one word about this 787 00:44:49,369 --> 00:44:50,880 KTT 788 00:44:50,880 --> 00:44:52,379 complementary condition is 789 00:44:52,379 --> 00:44:54,229 that a condition that is a - 790 00:44:54,229 --> 00:45:00,839 at your solution, you must have that alpha star i times gi of w is equal to 0. 791 00:45:00,839 --> 00:45:02,680 So 792 00:45:02,680 --> 00:45:04,390 let's see. 793 00:45:04,390 --> 00:45:08,789 So the product of two numbers is equal to 0. That means that 794 00:45:08,789 --> 00:45:09,450 795 00:45:09,450 --> 00:45:12,459 at least one of these things must be equal to 796 00:45:12,459 --> 00:45:15,989 0. For the product of two things to be equal to 0, well, that's just saying either alpha 797 00:45:15,989 --> 00:45:17,230 or i 798 00:45:17,230 --> 00:45:21,269 or gi is equal to 0. 799 00:45:21,269 --> 00:45:25,759 So what that implies is that the - just Karush-Kuhn-Tucker 800 00:45:25,759 --> 00:45:31,539 801 00:45:31,539 --> 00:45:38,410 802 00:45:38,410 --> 00:45:42,630 - most people just say KKT, but we wanna show you the right 803 00:45:42,630 --> 00:45:45,589 spelling 804 00:45:45,589 --> 00:45:46,890 of their names. So 805 00:45:46,890 --> 00:45:47,800 KKT 806 00:45:47,800 --> 00:45:52,119 complementary condition implies that if alpha 807 00:45:52,119 --> 00:45:55,360 i is not 0, 808 00:45:55,360 --> 00:46:00,559 that necessarily implies that 809 00:46:00,559 --> 00:46:03,199 gi of w star 810 00:46:03,199 --> 00:46:06,239 is equal to 811 00:46:06,239 --> 00:46:10,959 0. 812 00:46:10,959 --> 00:46:13,039 And 813 00:46:13,039 --> 00:46:15,000 usually, 814 00:46:15,000 --> 00:46:21,149 815 00:46:21,149 --> 00:46:23,629 816 00:46:23,629 --> 00:46:25,999 it turns out - so 817 00:46:25,999 --> 00:46:28,649 all the KKT condition guarantees is that 818 00:46:28,649 --> 00:46:32,939 at least one of them is 0. It may actually be the case that both 819 00:46:32,939 --> 00:46:35,279 alpha and gi are both equal to 0. 820 00:46:35,279 --> 00:46:39,170 But in practice, when you solve this optimization problem, you find that 821 00:46:39,170 --> 00:46:42,880 to a large part, alpha i star is not equal to 0 if and only gi of 822 00:46:42,880 --> 00:46:44,749 w star 0, 0. 823 00:46:44,749 --> 00:46:50,449 This is not strictly true because it's possible that both of these may be 0. 824 00:46:50,449 --> 00:46:51,720 But in practice, 825 00:46:51,720 --> 00:46:55,119 when we - because when we solve problems like these, you're, for the most part, 826 00:46:55,119 --> 00:46:59,319 usually exactly one of these will be non-0. 827 00:46:59,319 --> 00:47:03,459 And also, when this holds true, when gi of w star is equal to 0, 828 00:47:03,459 --> 00:47:05,819 we say that 829 00:47:05,819 --> 00:47:06,540 gi 830 00:47:06,540 --> 00:47:13,540 - gi of w, I guess, is an active 831 00:47:16,279 --> 00:47:17,919 constraint 832 00:47:17,919 --> 00:47:22,019 because we call a constraint - our constraint was a gi of w must be less than or equal to 833 00:47:22,019 --> 00:47:23,399 0. 834 00:47:23,399 --> 00:47:25,589 And so it is equal to 0, then 835 00:47:25,589 --> 00:47:32,589 we say that that's a constraint that this is an active constraint. 836 00:47:33,679 --> 00:47:37,410 Once we talk about [inaudible], we come back and [inaudible] and just extend this 837 00:47:37,410 --> 00:47:42,589 idea a little bit more. 838 00:47:42,589 --> 00:47:48,349 [Inaudible] board. [Inaudible] 839 00:47:48,349 --> 00:47:52,159 turn 840 00:47:52,159 --> 00:47:59,159 to this board in a second, but - 841 00:48:07,269 --> 00:48:10,899 so let's go back and work out one of the primal and the duo optimization problems for 842 00:48:10,899 --> 00:48:12,029 843 00:48:12,029 --> 00:48:17,329 our optimal margin classifier for the optimization problem that we worked on just now. As 844 00:48:17,329 --> 00:48:20,379 a point of notation, 845 00:48:20,379 --> 00:48:24,669 in whatever I've been writing down so far in deriving the KKT conditions, 846 00:48:24,669 --> 00:48:28,189 847 00:48:28,189 --> 00:48:32,369 when Lagrange multipliers were alpha i and beta i, 848 00:48:32,369 --> 00:48:35,600 it turns out that when applied as [inaudible] 849 00:48:35,600 --> 00:48:37,020 dm, 850 00:48:37,020 --> 00:48:41,089 turns out we only have one set of Lagrange multipliers alpha i. 851 00:48:41,089 --> 00:48:43,859 And also, 852 00:48:43,859 --> 00:48:47,419 as I was working out the KKT conditions, I used w 853 00:48:47,419 --> 00:48:52,189 to denote the parameters of my primal optimization problem. [Inaudible] I wanted to 854 00:48:52,189 --> 00:48:53,999 minimize f of w. In my 855 00:48:53,999 --> 00:48:57,739 very first optimization problem, I had 856 00:48:57,739 --> 00:48:58,989 that optimization problem [inaudible] finding the parameters 857 00:48:58,989 --> 00:49:01,670 w. In my svn problem, I'm 858 00:49:01,670 --> 00:49:04,640 actually gonna have two sets of parameters, w and b. So 859 00:49:04,640 --> 00:49:08,479 this is just a - keep that 860 00:49:08,479 --> 00:49:15,479 sort of slight notation change in mind. So 861 00:49:15,960 --> 00:49:20,289 problem we worked out previously was we want to minimize the normal w squared and just add a 862 00:49:20,289 --> 00:49:21,149 half there 863 00:49:21,149 --> 00:49:25,719 by convention because it makes other work - math work a little 864 00:49:25,719 --> 00:49:26,799 nicer. 865 00:49:26,799 --> 00:49:32,829 And subject to the constraint that yi x w [inaudible] xi + v must be = greater 866 00:49:40,949 --> 00:49:46,390 And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w, 867 00:49:46,390 --> 00:49:48,839 b. Again, previously, 868 00:49:48,839 --> 00:49:52,559 I had gi 869 00:49:52,559 --> 00:49:56,079 of w, but now I have parameters w and b. So 870 00:49:56,079 --> 00:50:03,079 gi of w, b defined 871 00:50:03,369 --> 00:50:09,589 as 1. 872 00:50:09,589 --> 00:50:15,659 So 873 00:50:15,659 --> 00:50:19,799 let's look at the implications of this in terms 874 00:50:19,799 --> 00:50:25,380 of the KKT duo complementary condition again. 875 00:50:25,380 --> 00:50:29,180 So we have that alpha i is basically equal to 876 00:50:29,180 --> 00:50:32,309 0. That necessarily implies that 877 00:50:32,309 --> 00:50:35,149 gi of w, b 878 00:50:35,149 --> 00:50:42,149 is equal to 0. In other words, this is an active constraint. 879 00:50:46,849 --> 00:50:53,849 And what does this mean? It means that it actually turns out gi of wb equal to 0 that 880 00:50:54,460 --> 00:51:00,729 is - that means exactly that the training example xi, yi 881 00:51:00,729 --> 00:51:07,729 has functional margin 882 00:51:08,909 --> 00:51:10,289 equal to 1. 883 00:51:10,289 --> 00:51:11,279 Because 884 00:51:11,279 --> 00:51:15,049 this constraint was that 885 00:51:15,049 --> 00:51:18,769 the functional margin of every example has to be greater equal to 886 00:51:18,769 --> 00:51:22,139 1. And so if this is an active constraint, it just - 887 00:51:22,139 --> 00:51:24,160 inequality holds that equality. 888 00:51:24,160 --> 00:51:26,529 That means that my training example i 889 00:51:26,529 --> 00:51:30,349 must have functional margin equal to exactly 1. And so - actually, yeah, 890 00:51:30,349 --> 00:51:33,609 right now, I'll 891 00:51:33,609 --> 00:51:39,689 do this on a different board, I guess. 892 00:51:39,689 --> 00:51:46,689 893 00:51:57,239 --> 00:51:59,439 So in pictures, 894 00:51:59,439 --> 00:52:03,650 what that means is that, you have some 895 00:52:03,650 --> 00:52:10,650 training sets, and you'll 896 00:52:10,729 --> 00:52:15,189 have some separating hyperplane. 897 00:52:15,189 --> 00:52:19,469 And so the examples with functional margin equal to 1 898 00:52:19,469 --> 00:52:23,299 will be exactly those which are - 899 00:52:23,299 --> 00:52:25,549 so they're closest 900 00:52:25,549 --> 00:52:27,799 to my separating hyperplane. So 901 00:52:27,799 --> 00:52:30,669 that's 902 00:52:30,669 --> 00:52:34,109 my equation. [Inaudible] equal to 0. 903 00:52:34,109 --> 00:52:39,499 And so in this - in this cartoon example that I've done, it'll be 904 00:52:39,499 --> 00:52:44,649 exactly 905 00:52:44,649 --> 00:52:49,699 - these three 906 00:52:49,699 --> 00:52:52,709 examples that have functional margin 907 00:52:52,709 --> 00:52:54,399 equal to 1, 908 00:52:54,399 --> 00:52:58,519 and all of the other examples as being further away than these 909 00:52:58,519 --> 00:53:05,519 three will have functional margin that is strictly greater than 1. 910 00:53:05,929 --> 00:53:08,109 And 911 00:53:08,109 --> 00:53:11,029 the examples with functional margin equal to 1 will usually correspond to the 912 00:53:11,029 --> 00:53:15,429 913 00:53:15,429 --> 00:53:21,589 914 00:53:21,589 --> 00:53:22,979 ones where 915 00:53:22,979 --> 00:53:27,049 the corresponding Lagrange multipliers also not equal to 0. And again, it may not 916 00:53:27,049 --> 00:53:29,009 hold true. It may be the case that 917 00:53:29,009 --> 00:53:32,679 gi and alpha i equal to 0. But usually, when gi's not - 918 00:53:32,679 --> 00:53:35,739 is 0, alpha i will be non-0. 919 00:53:35,739 --> 00:53:39,380 And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal 920 00:53:39,380 --> 00:53:46,380 to 0. One 921 00:53:46,920 --> 00:53:48,969 useful property of this is that 922 00:53:48,969 --> 00:53:52,839 as suggested by this picture and so true in general as well, it 923 00:53:52,839 --> 00:53:57,069 turns out that we find a solution to this - to the optimization problem, 924 00:53:57,069 --> 00:54:01,509 you find that relatively few training examples have functional margin equal to 1. In 925 00:54:01,509 --> 00:54:02,829 this picture I've drawn, 926 00:54:02,829 --> 00:54:05,859 there are three examples with functional margin equal to 1. There 927 00:54:05,859 --> 00:54:09,449 are just few examples of this minimum possible distance to your separating hyperplane. 928 00:54:09,449 --> 00:54:11,479 929 00:54:11,479 --> 00:54:13,779 And these are three - 930 00:54:13,779 --> 00:54:18,420 these examples of functional margin equal to 1 - they 931 00:54:18,420 --> 00:54:21,369 are what we're going to call 932 00:54:21,369 --> 00:54:26,169 the support vectors. And 933 00:54:26,169 --> 00:54:29,839 this needs the name support vector machine. There'll be these three points with functional margin 934 00:54:29,839 --> 00:54:30,809 1 935 00:54:30,809 --> 00:54:35,229 that we're calling support vectors. 936 00:54:35,229 --> 00:54:36,749 And 937 00:54:36,749 --> 00:54:40,089 the fact that they're relatively few support vectors also means that 938 00:54:40,089 --> 00:54:40,919 usually, 939 00:54:40,919 --> 00:54:43,319 most of the alpha i's are equal to 940 00:54:43,319 --> 00:54:44,909 0. So with alpha i equal 941 00:54:44,909 --> 00:54:51,909 to 942 00:54:52,599 --> 00:54:54,239 0, 943 00:54:54,239 --> 00:55:01,239 for examples, though, not support vectors. 944 00:55:03,349 --> 00:55:06,769 Let's go ahead and work out the actual 945 00:55:06,769 --> 00:55:08,369 optimization problem. 946 00:55:08,369 --> 00:55:12,169 947 00:55:12,169 --> 00:55:19,169 948 00:55:28,499 --> 00:55:29,369 So 949 00:55:29,369 --> 00:55:31,779 we have a [inaudible] margin 950 00:55:31,779 --> 00:55:33,129 optimization problem. 951 00:55:33,129 --> 00:55:34,009 952 00:55:34,009 --> 00:55:38,599 So there we go and write down the margin, 953 00:55:38,599 --> 00:55:39,249 and 954 00:55:39,249 --> 00:55:43,599 because we only have inequality constraints where we really have gi star 955 00:55:43,599 --> 00:55:47,059 constraints, no hi star constraint. We have 956 00:55:47,059 --> 00:55:49,739 inequality constraints and no equality constraints, 957 00:55:49,739 --> 00:55:51,599 I'll only have 958 00:55:51,599 --> 00:55:52,990 Lagrange multipliers of type 959 00:55:52,990 --> 00:55:57,489 alpha - no betas in my generalized Lagrange. But 960 00:55:57,489 --> 00:56:00,519 961 00:56:00,519 --> 00:56:02,579 my Lagrange will be 962 00:56:02,579 --> 00:56:09,579 one-half w squared minus. 963 00:56:14,739 --> 00:56:17,579 964 00:56:17,579 --> 00:56:19,829 That's my 965 00:56:19,829 --> 00:56:26,729 Lagrange. 966 00:56:26,729 --> 00:56:29,389 And so let's work out what the dual problem is. 967 00:56:29,389 --> 00:56:33,479 And to do that, I need to figure out what theta d of alpha - and I know again, beta's there 968 00:56:33,479 --> 00:56:37,239 - so what theta d of alpha 969 00:56:37,239 --> 00:56:43,829 is min with 970 00:56:43,829 --> 00:56:48,419 respect to wb of lb alpha. So the dual problem is the maximize theta d as the function of alpha. 971 00:56:48,419 --> 00:56:54,999 So as to work out what theta d is, and then that'll give us our dual problem. 972 00:56:54,999 --> 00:56:58,090 So then to work out what this is, what do you need to do? We need to 973 00:56:58,090 --> 00:57:01,930 take a look at Lagrange and minimize it as a function of lv and b 974 00:57:01,930 --> 00:57:02,709 so - and what is this? How do you 975 00:57:02,709 --> 00:57:05,079 minimize Lagrange? So in order to 976 00:57:05,079 --> 00:57:05,890 977 00:57:05,890 --> 00:57:09,079 minimize the Lagrange as a function of w and b, 978 00:57:09,079 --> 00:57:11,199 we do the usual thing. We 979 00:57:11,199 --> 00:57:13,119 take the derivatives of w - 980 00:57:13,119 --> 00:57:16,819 Lagrange with respect to w and b. And we set that to 0. That's how we 981 00:57:16,819 --> 00:57:20,209 minimize the Lagrange with respect 982 00:57:20,209 --> 00:57:25,709 to w and b. So take the derivative with respect to w of the Lagrange. 983 00:57:25,709 --> 00:57:27,829 And 984 00:57:27,829 --> 00:57:34,829 I want - I just write down the answer. You know how to do calculus like this. 985 00:57:38,189 --> 00:57:41,439 So I wanna minimize this function of w, so I take the derivative and set it 986 00:57:41,439 --> 00:57:42,839 to 0. And 987 00:57:42,839 --> 00:57:44,950 I get that. And 988 00:57:44,950 --> 00:57:51,950 then so this implies that w 989 00:57:55,940 --> 00:57:58,789 must be that. 990 00:57:58,789 --> 00:58:00,779 And so 991 00:58:00,779 --> 00:58:05,189 w, therefore, is actually a linear combination of your input feature vectors xi. 992 00:58:05,189 --> 00:58:06,779 This is 993 00:58:06,779 --> 00:58:09,570 sum of your various weights given by the alpha i's and times 994 00:58:09,570 --> 00:58:13,069 the xi's, which are your examples in your training set. 995 00:58:13,069 --> 00:58:15,939 And this will be useful later. The 996 00:58:15,939 --> 00:58:22,939 other equation we have is - here, partial derivative of 997 00:58:23,139 --> 00:58:27,900 Lagrange with respect to p is equal to minus sum 998 00:58:27,900 --> 00:58:33,819 of i plus 1 to m [inaudible] for 999 00:58:33,819 --> 00:58:35,219 i. 1000 00:58:35,219 --> 00:58:37,199 And so I'll just set that to equal to 1001 00:58:37,199 --> 00:58:40,439 0. And so these are my two constraints. 1002 00:58:40,439 --> 00:58:42,699 And so 1003 00:58:42,699 --> 00:58:48,489 1004 00:58:48,489 --> 00:58:49,769 [inaudible]. 1005 00:58:49,769 --> 00:58:53,650 So what I'm going to do is I'm actually going to take these two constraints, 1006 00:58:53,650 --> 00:58:57,579 and well, I'm going to take whatever I thought to be the value for w. 1007 00:58:57,579 --> 00:59:00,379 And I'm 1008 00:59:00,379 --> 00:59:01,309 gonna 1009 00:59:01,309 --> 00:59:04,070 take what I've worked out to be the 1010 00:59:04,070 --> 00:59:04,769 value for w, and 1011 00:59:04,769 --> 00:59:06,979 I'll plug it back in there 1012 00:59:06,979 --> 00:59:09,169 to figure out what the Lagrange really is 1013 00:59:09,169 --> 00:59:13,989 when I minimize with respect to w. [Inaudible] and I'll 1014 00:59:13,989 --> 00:59:20,989 deal with b in a second. 1015 00:59:27,869 --> 00:59:34,869 So 1016 00:59:37,629 --> 00:59:41,109 let's see. So my Lagrange is 1/2 1017 00:59:41,109 --> 00:59:44,479 w transpose w minus. 1018 00:59:44,479 --> 00:59:51,199 1019 00:59:51,199 --> 00:59:58,079 1020 00:59:58,079 --> 01:00:03,529 So this first term, w transpose w 1021 01:00:03,529 --> 01:00:05,939 - this becomes 1022 01:00:05,939 --> 01:00:11,039 sum y equals one to m, alpha i, yi, 1023 01:00:11,039 --> 01:00:18,039 1024 01:00:20,749 --> 01:00:24,859 xi transpose. This is just putting in the value for w that I worked out previously. 1025 01:00:24,859 --> 01:00:27,880 But since this is w transpose w - 1026 01:00:27,880 --> 01:00:32,239 and so when they expand out of this quadratic function, and when I plug in w 1027 01:00:32,239 --> 01:00:34,469 over there as well, 1028 01:00:34,469 --> 01:00:36,329 I 1029 01:00:36,329 --> 01:00:38,759 find 1030 01:00:38,759 --> 01:00:40,849 1031 01:00:40,849 --> 01:00:42,869 1032 01:00:42,869 --> 01:00:49,539 that 1033 01:00:49,539 --> 01:00:55,189 1034 01:00:55,189 --> 01:00:58,499 I 1035 01:00:58,499 --> 01:01:00,150 have 1036 01:01:00,150 --> 01:01:03,969 1037 01:01:03,969 --> 01:01:08,529 1038 01:01:08,529 --> 01:01:10,979 1039 01:01:10,979 --> 01:01:17,979 that. 1040 01:01:20,689 --> 01:01:22,469 Oh, 1041 01:01:22,469 --> 01:01:28,589 where I'm using these angle brackets to denote end product, so this 1042 01:01:28,589 --> 01:01:35,179 thing here, it just means the end product, xi transpose 1043 01:01:35,179 --> 01:01:36,259 xj. And 1044 01:01:36,259 --> 01:01:40,299 the first and second terms are actually the same except for the minus one half. So to 1045 01:01:40,299 --> 01:01:41,689 simplify to be 1046 01:01:41,689 --> 01:01:44,989 equal to 1047 01:01:44,989 --> 01:01:51,259 1048 01:01:51,259 --> 01:01:58,259 1049 01:02:08,239 --> 01:02:12,549 that. 1050 01:02:12,549 --> 01:02:15,689 So 1051 01:02:15,689 --> 01:02:18,749 let me go ahead and 1052 01:02:18,749 --> 01:02:22,239 call this w of alpha. 1053 01:02:22,239 --> 01:02:23,449 1054 01:02:23,449 --> 01:02:29,469 1055 01:02:29,469 --> 01:02:36,469 1056 01:02:46,779 --> 01:02:51,919 My dual problem is, therefore, the following. I want to maximize w 1057 01:02:51,919 --> 01:02:55,839 of alpha, which is that [inaudible]. 1058 01:02:55,839 --> 01:02:58,380 And 1059 01:02:58,380 --> 01:03:02,429 I want to the - I realize the notation is somewhat 1060 01:03:02,429 --> 01:03:07,259 unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier. 1061 01:03:07,259 --> 01:03:13,529 And then we also had our lowercase w. The original [inaudible] is the primal 1062 01:03:13,529 --> 01:03:16,349 problem. Lowercase w transpose xi. So 1063 01:03:16,349 --> 01:03:18,059 uppercase and lowercase w 1064 01:03:18,059 --> 01:03:20,609 are totally different 1065 01:03:20,609 --> 01:03:27,160 things, so unfortunately, the notation is standard as well, as far as I know, 1066 01:03:27,160 --> 01:03:29,929 so. So the dual problem is 1067 01:03:29,929 --> 01:03:33,119 that subject to the alpha [inaudible] related to 0, 1068 01:03:33,119 --> 01:03:37,239 and 1069 01:03:37,239 --> 01:03:39,439 we also have that the 1070 01:03:39,439 --> 01:03:40,619 sum of i, 1071 01:03:40,619 --> 01:03:43,819 yi, alpha i is related to 0. 1072 01:03:43,819 --> 01:03:46,380 That last constraint 1073 01:03:46,380 --> 01:03:49,589 was the constraint I got from 1074 01:03:49,589 --> 01:03:52,299 this - the 1075 01:03:52,299 --> 01:03:55,020 sum of i - sum of i, yi alpha i equals to 0. But that's where 1076 01:03:55,020 --> 01:03:59,619 that [inaudible] came 1077 01:03:59,619 --> 01:04:01,839 from. Let 1078 01:04:01,839 --> 01:04:02,380 me just 1079 01:04:02,380 --> 01:04:05,619 - I think in previous years that I taught this, 1080 01:04:05,619 --> 01:04:09,199 where this constraint comes from is just - is slightly confusing. So let 1081 01:04:09,199 --> 01:04:12,910 me just take two minutes to say what the real interpretation of that is. And if you 1082 01:04:12,910 --> 01:04:15,769 don't understand it, it's 1083 01:04:15,769 --> 01:04:18,069 not a big deal, I guess. 1084 01:04:18,069 --> 01:04:21,789 So when we took the partial derivative of the Lagrange with 1085 01:04:21,789 --> 01:04:22,669 respect to b, 1086 01:04:22,669 --> 01:04:28,219 we end up with this constraint that sum of i, yi, alpha i must be equal to 0. 1087 01:04:28,219 --> 01:04:34,219 The interpretation of that, it turns out, is that if sum of i, yi, alpha i 1088 01:04:34,219 --> 01:04:37,410 is not equal to 1089 01:04:37,410 --> 01:04:40,920 0, then 1090 01:04:40,920 --> 01:04:42,809 1091 01:04:42,809 --> 01:04:49,809 1092 01:04:51,019 --> 01:04:53,210 theta d of wb 1093 01:04:53,210 --> 01:04:55,659 is - 1094 01:04:55,659 --> 01:04:59,829 actually, excuse me. 1095 01:04:59,829 --> 01:05:01,559 1096 01:05:01,559 --> 01:05:02,779 1097 01:05:02,779 --> 01:05:06,259 Then theta d of alpha is equal to 1098 01:05:06,259 --> 01:05:08,499 1099 01:05:08,499 --> 01:05:13,179 minus infinity for minimizing. 1100 01:05:13,179 --> 01:05:16,279 So in other words, it turns out my Lagrange is 1101 01:05:16,279 --> 01:05:20,479 actually a linear function of my parameters b. And so the interpretation of 1102 01:05:20,479 --> 01:05:22,890 that constraint we worked out previously was that if sum of i or yi, alpha i 1103 01:05:22,890 --> 01:05:25,820 is not equal to 0, then 1104 01:05:25,820 --> 01:05:28,889 theta d of alpha is equal to minus infinity. 1105 01:05:28,889 --> 01:05:32,469 And so if your goal is to 1106 01:05:32,469 --> 01:05:36,839 maximize as a function of alpha, theta 1107 01:05:36,839 --> 01:05:38,339 d of alpha, 1108 01:05:38,339 --> 01:05:44,599 then you've gotta choose values of alpha for which sum of yi alpha is equal to 0. 1109 01:05:44,599 --> 01:05:50,969 And then when sum of yi alpha is equal to 0, then 1110 01:05:50,969 --> 01:05:54,189 1111 01:05:54,189 --> 01:05:56,099 1112 01:05:56,099 --> 01:06:00,949 theta d of 1113 01:06:00,949 --> 01:06:03,999 alpha is equal to w of alpha. 1114 01:06:03,999 --> 01:06:08,690 And so that's why we ended up deciding to maximize w of alpha subject to 1115 01:06:08,690 --> 01:06:14,319 that sum of yi alpha is equal to 0. 1116 01:06:14,319 --> 01:06:18,929 Yeah, the - unfortunately, the fact of that d would be [inaudible] 1117 01:06:18,929 --> 01:06:21,619 adds just a little bit of extra notation in our 1118 01:06:21,619 --> 01:06:22,989 derivation of the duo. But 1119 01:06:22,989 --> 01:06:27,319 by the way, and [inaudible] all the action of the optimization problem is with w 1120 01:06:27,319 --> 01:06:32,809 because b is just one parameter. 1121 01:06:32,809 --> 01:06:39,809 So let's check. Are there any questions about this? Okay, cool. 1122 01:06:47,069 --> 01:06:49,499 So 1123 01:06:49,499 --> 01:06:53,729 what derived a duo optimization problem - and really, don't worry about this 1124 01:06:53,729 --> 01:06:56,469 if you're not quite sure where this was. Just think of this as 1125 01:06:56,469 --> 01:06:59,959 we worked out this constraint, and we worked out, and we took partial derivative with 1126 01:06:59,959 --> 01:07:01,309 respect to b, 1127 01:07:01,309 --> 01:07:06,049 that this constraint has the [inaudible] and so I just copied that over here. But so - worked out 1128 01:07:06,049 --> 01:07:11,129 the duo of the optimization problem, 1129 01:07:11,129 --> 01:07:16,069 so our approach to finding - to deriving the optimal margin classifier or support vector 1130 01:07:16,069 --> 01:07:16,679 machine 1131 01:07:16,679 --> 01:07:19,409 will be that we'll solve 1132 01:07:19,409 --> 01:07:26,409 along this duo optimization problem for the parameters alpha 1133 01:07:28,109 --> 01:07:29,169 star. 1134 01:07:29,169 --> 01:07:30,509 And then 1135 01:07:30,509 --> 01:07:34,149 if you want, you can then - this is the equation that we worked out on 1136 01:07:34,149 --> 01:07:36,459 the previous board. We said that 1137 01:07:36,459 --> 01:07:43,459 w - this [inaudible] alpha - w must be equal to 1138 01:07:44,369 --> 01:07:46,349 that. 1139 01:07:46,349 --> 01:07:47,690 And so 1140 01:07:47,690 --> 01:07:53,309 once you solve for alpha, you can then go back and quickly derive 1141 01:07:53,309 --> 01:07:57,389 w in parameters to your primal problem. And we worked this out earlier. 1142 01:07:57,389 --> 01:07:58,209 1143 01:07:58,209 --> 01:08:00,429 And moreover, 1144 01:08:00,429 --> 01:08:05,409 once you solve alpha and w, you can then focus back into your - once you solve for alpha and w, 1145 01:08:05,409 --> 01:08:06,819 1146 01:08:06,819 --> 01:08:10,419 it's really easy to solve for v, so 1147 01:08:10,419 --> 01:08:13,979 1148 01:08:13,979 --> 01:08:15,269 that b gives us the interpretation of [inaudible] 1149 01:08:15,269 --> 01:08:19,559 training set, and you found the direction for w. So you know where your separating 1150 01:08:19,559 --> 01:08:22,079 hyperplane's direction is. You know it's got to be 1151 01:08:22,079 --> 01:08:25,059 one of these things. 1152 01:08:25,059 --> 01:08:28,499 And you know the orientation and separating hyperplane. You just have to 1153 01:08:28,499 --> 01:08:31,000 decide where to place 1154 01:08:31,000 --> 01:08:33,289 this hyperplane. And that's what solving b is. 1155 01:08:33,289 --> 01:08:36,969 So once you solve for alpha and w, it's really easy to solve b. 1156 01:08:36,969 --> 01:08:40,409 You can plug alpha and w back into the 1157 01:08:40,409 --> 01:08:42,779 primal optimization problem 1158 01:08:42,779 --> 01:08:45,649 1159 01:08:45,649 --> 01:08:49,899 1160 01:08:49,899 --> 01:08:56,899 and solve for b. 1161 01:09:02,329 --> 01:09:05,140 And I just wrote it down 1162 01:09:05,140 --> 01:09:12,140 for the sake of completeness, 1163 01:09:14,859 --> 01:09:19,890 1164 01:09:19,890 --> 01:09:21,299 1165 01:09:21,299 --> 01:09:23,279 but 1166 01:09:23,279 --> 01:09:27,939 1167 01:09:27,939 --> 01:09:29,609 - and the 1168 01:09:29,609 --> 01:09:36,199 intuition behind this formula is just that find the worst positive 1169 01:09:36,199 --> 01:09:37,309 [inaudible] and the 1170 01:09:37,309 --> 01:09:41,920 worst negative example. Let's say 1171 01:09:41,920 --> 01:09:44,779 this one and this one - say [inaudible] and [inaudible] the difference between them. And 1172 01:09:44,779 --> 01:09:45,990 that tells you where you should 1173 01:09:45,990 --> 01:09:47,969 set the threshold for 1174 01:09:47,969 --> 01:09:52,509 where to place the separating hyperplane. 1175 01:09:52,509 --> 01:09:53,410 And then 1176 01:09:53,410 --> 01:09:54,829 that's the - 1177 01:09:54,829 --> 01:09:57,749 this is the optimal margin classifier. This is also called a support vector 1178 01:09:57,749 --> 01:10:02,329 machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words 1179 01:10:02,329 --> 01:10:04,320 about that. But I 1180 01:10:04,320 --> 01:10:08,080 hope the process is clear. It's a dual problem. We're going to solve the duo 1181 01:10:08,080 --> 01:10:09,789 problem for the alpha i's. 1182 01:10:09,789 --> 01:10:11,360 That gives us w, and that gives 1183 01:10:11,360 --> 01:10:13,589 us b. 1184 01:10:13,589 --> 01:10:16,649 So 1185 01:10:16,649 --> 01:10:21,709 there's just one more thing I wanna point out as I lead into the next lecture, 1186 01:10:21,709 --> 01:10:23,070 which is that - I'll just 1187 01:10:23,070 --> 01:10:28,440 write this out again, 1188 01:10:28,440 --> 01:10:29,250 I guess - 1189 01:10:29,250 --> 01:10:31,359 which is that it turns out 1190 01:10:31,359 --> 01:10:33,530 we can take the entire algorithm, 1191 01:10:33,530 --> 01:10:37,530 and we can express the entire algorithm in terms of inner products. And here's what I 1192 01:10:37,530 --> 01:10:40,730 mean by that. So 1193 01:10:40,730 --> 01:10:42,249 say that the parameters w 1194 01:10:42,249 --> 01:10:45,469 is the sum of your input examples. 1195 01:10:45,469 --> 01:10:49,160 And so we need to make a prediction. 1196 01:10:49,160 --> 01:10:54,659 Someone gives you a new value of x. You want a value of the hypothesis on the value of x. 1197 01:10:54,659 --> 01:10:58,090 That's given by g of w transpose x plus b, or 1198 01:10:58,090 --> 01:11:02,869 where g was this threshold function that outputs minus 1 or plus 1. 1199 01:11:02,869 --> 01:11:06,380 And so you need to compute w transpose x plus b. 1200 01:11:06,380 --> 01:11:11,300 And that is equal to alpha i, 1201 01:11:11,300 --> 01:11:18,300 yi. 1202 01:11:19,739 --> 01:11:24,029 And that can be expressed as a sum of these inner products between 1203 01:11:24,029 --> 01:11:26,070 your training examples 1204 01:11:26,070 --> 01:11:33,070 and this new value of x [inaudible] value [inaudible]. And this will 1205 01:11:33,849 --> 01:11:39,329 lead into our next lecture, which is the idea of kernels. 1206 01:11:39,329 --> 01:11:39,929 And 1207 01:11:39,929 --> 01:11:44,209 it turns out that in the source of feature spaces where used to support vector 1208 01:11:44,209 --> 01:11:45,039 machines - 1209 01:11:45,039 --> 01:11:52,039 it turns out that sometimes your training examples may be very high-dimensional. It may even be the case 1210 01:11:53,730 --> 01:11:58,329 that the features that you want to use 1211 01:11:58,329 --> 01:11:59,579 are 1212 01:11:59,579 --> 01:12:04,129 inner-dimensional feature vectors. 1213 01:12:04,129 --> 01:12:11,129 But despite this, it'll turn out that there'll be an interesting representation that 1214 01:12:11,229 --> 01:12:12,879 you can use 1215 01:12:12,879 --> 01:12:14,820 that will allow you 1216 01:12:14,820 --> 01:12:18,719 to compute inner products like these efficiently. 1217 01:12:18,719 --> 01:12:25,719 1218 01:12:25,780 --> 01:12:29,179 And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets 1219 01:12:29,179 --> 01:12:30,479 of features. 1220 01:12:30,479 --> 01:12:33,609 But we talk about the idea of 1221 01:12:33,609 --> 01:12:35,320 kernels. In the next lecture, we'll 1222 01:12:35,320 --> 01:12:37,809 see examples where 1223 01:12:37,809 --> 01:12:41,249 even though you have extremely high-dimensional feature vectors, you can compute 1224 01:12:41,249 --> 01:12:45,829 - you may never want to represent xi, x plus [inaudible] inner-dimensional 1225 01:12:45,829 --> 01:12:48,459 feature vector. You can even store in computer memory. 1226 01:12:48,459 --> 01:12:51,769 But you will nonetheless be able to compute inner products between different 1227 01:12:51,769 --> 01:12:53,839 [inaudible] feature vectors very efficiently. 1228 01:12:53,839 --> 01:12:57,419 And so you can - for example, you can make predictions by making use of these inner 1229 01:12:57,419 --> 01:12:58,790 products. 1230 01:12:58,790 --> 01:13:02,129 This is just xi 1231 01:13:02,129 --> 01:13:03,809 transpose. 1232 01:13:03,809 --> 01:13:08,570 You will compute these inner products very efficiently and, therefore, make predictions. 1233 01:13:08,570 --> 01:13:11,109 And this pointed also - the other 1234 01:13:11,109 --> 01:13:14,599 reason we derive the duo was because 1235 01:13:14,599 --> 01:13:18,300 on this board, when we worked out what w of alpha is, w of alpha 1236 01:13:18,300 --> 01:13:23,620 - actually are the same property - w of alpha is again 1237 01:13:23,620 --> 01:13:26,699 written in terms of these inner products. 1238 01:13:26,699 --> 01:13:29,040 And so if you 1239 01:13:29,040 --> 01:13:33,429 actually look at the duo optimization problem and step - for all the steps of the 1240 01:13:33,429 --> 01:13:33,999 algorithm, 1241 01:13:33,999 --> 01:13:36,929 you'll find that you actually do everything you want - learn the parameters of 1242 01:13:36,929 --> 01:13:38,169 alpha. So 1243 01:13:38,169 --> 01:13:41,989 suppose you do an optimization problem, go into parameters alpha, and you do everything you want 1244 01:13:41,989 --> 01:13:46,519 without ever needing to represent xi directly. And all you need to do 1245 01:13:46,519 --> 01:13:53,519 is represent this compute inner products with your feature vectors like these. Well, 1246 01:13:53,879 --> 01:13:56,400 one last property of 1247 01:13:56,400 --> 01:13:58,159 this algorithm that's kinda nice is that 1248 01:13:58,159 --> 01:13:59,900 I said previously 1249 01:13:59,900 --> 01:14:00,929 that 1250 01:14:00,929 --> 01:14:07,389 the alpha i's are 0 only for the - are non-0 only for the support vectors, 1251 01:14:07,389 --> 01:14:08,690 only for the vectors 1252 01:14:08,690 --> 01:14:10,400 that function y [inaudible] 1. 1253 01:14:10,400 --> 01:14:13,540 And in practice, there are usually fairly few of them. 1254 01:14:13,540 --> 01:14:17,360 And so what this means is that if you're representing w this way, 1255 01:14:17,360 --> 01:14:18,320 then 1256 01:14:18,320 --> 01:14:22,870 w when represented as a fairly small fraction of training examples 1257 01:14:22,870 --> 01:14:25,139 because mostly alpha i's is 0 - 1258 01:14:25,139 --> 01:14:27,180 and so when you're summing up 1259 01:14:27,180 --> 01:14:28,279 the sum, 1260 01:14:28,279 --> 01:14:32,359 you need to compute inner products only if the support vectors, which is 1261 01:14:32,359 --> 01:14:36,139 usually a small fraction of your training set. So that's another nice [inaudible] 1262 01:14:36,139 --> 01:14:39,320 because [inaudible] alpha is 1263 01:14:39,320 --> 01:14:40,999 0. And well, 1264 01:14:40,999 --> 01:14:44,639 much of this will make much more sense when we talk about kernels. [Inaudible] quick 1265 01:14:44,639 --> 01:14:51,639 questions 1266 01:14:51,959 --> 01:14:52,920 1267 01:14:52,920 --> 01:14:57,989 before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well 1268 01:14:57,989 --> 01:14:59,560 behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that 1269 01:14:59,560 --> 01:15:03,889 the data is linearly separable - that you can actually get perfect 1270 01:15:03,889 --> 01:15:10,889 [inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible] 1271 01:15:10,909 --> 01:15:13,380 point [inaudible], so [inaudible] 1272 01:15:13,380 --> 01:15:17,579 have 1273 01:15:17,579 --> 01:15:19,300 [inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that 1274 01:15:19,300 --> 01:15:23,429 there are ways to generalize this in multiple classes that I probably won't [inaudible] - 1275 01:15:23,429 --> 01:15:26,429 but yeah, that's generalization [inaudible]. 1276 01:15:26,429 --> 01:15:27,870 Okay. Let's close for today, then. 1277 01:15:27,870 --> 01:15:29,300 We'll talk about kernels in our next lecture.