0:00:00.290,0:00:12.079 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 0:00:12.079,0:00:15.349 This presentation is delivered by the Stanford Center for Professional 0:00:15.349,0:00:22.349 Development. 0:00:24.169,0:00:25.189 0:00:25.189,0:00:27.299 What I want to do today is 0:00:27.299,0:00:30.829 continue our discussion of Naive Bayes, which is the learning algorithm that 0:00:30.829,0:00:32.600 I 0:00:32.600,0:00:36.110 started to discuss in the previous lecture 0:00:36.110,0:00:40.250 and talk about a couple of different event models in Naive Bayes, 0:00:40.250,0:00:44.450 and then I'll take a brief digression to talk about neural networks, which is 0:00:44.450,0:00:47.000 something that I actually won't spend a lot of time on, 0:00:47.000,0:00:50.350 and then I want to start to talk about support vector machines, 0:00:50.350,0:00:52.680 and support vector machines is 0:00:52.680,0:00:56.450 the learning algorithms, the supervised learning algorithm that 0:00:56.450,0:01:00.910 many people consider the most effective, off-the-shelf 0:01:00.910,0:01:02.250 supervised learning 0:01:02.250,0:01:05.240 algorithm. That point of view is debatable, but there are many people that hold that 0:01:05.240,0:01:06.340 point of view, 0:01:06.340,0:01:10.800 and we'll start discussing that today, and this will actually take us a few 0:01:10.800,0:01:13.830 lectures to complete. 0:01:13.830,0:01:16.940 So let's talk about Naive Bayes. To recap 0:01:16.940,0:01:21.370 from the previous lecture, 0:01:21.370,0:01:23.299 I started off 0:01:23.299,0:01:24.420 describing 0:01:24.420,0:01:28.329 spam classification as the most [inaudible] example for Naive Bayes 0:01:28.329,0:01:33.700 in which we would create feature vectors like these, right, that correspond 0:01:33.700,0:01:37.840 to words in a dictionary. 0:01:37.840,0:01:39.500 And so, you know, 0:01:39.500,0:01:43.100 based on what words appear in a piece of email were represented as a 0:01:43.100,0:01:45.230 feature vector with 0:01:45.230,0:01:48.130 ones and zeros in the corresponding places, 0:01:48.130,0:01:49.950 and Naive 0:01:49.950,0:01:56.950 Bayes was a generative learning algorithm, and by that 0:01:58.799,0:02:00.159 I mean it's 0:02:00.159,0:02:06.140 an algorithm in which we model P(X|Y), and for Naive 0:02:06.140,0:02:08.479 Bayes, specifically, we 0:02:08.479,0:02:12.859 modeled it as product from I equals one 0:02:12.859,0:02:13.789 to 0:02:13.789,0:02:17.109 N, P(Xi|Y), 0:02:17.109,0:02:19.259 and also we model P(Y), 0:02:19.259,0:02:22.749 and then we use Bayes Rule, right, to combine these two together, 0:02:22.749,0:02:25.050 and so our predictions, 0:02:25.050,0:02:28.549 when you give it a new piece of email you want to tell if it's spam or not spam, 0:02:28.549,0:02:33.119 you predict arg max over Y, P(Y|X), which 0:02:33.119,0:02:36.459 by Bayes Rule is arg max over Y, 0:02:36.459,0:02:39.420 P(X given Y) 0:02:39.420,0:02:41.019 times P(Y), okay? 0:02:41.019,0:02:43.559 So this is Naive Bayes, 0:02:43.559,0:02:45.869 and just to draw attention to two things, 0:02:45.869,0:02:49.819 one is that in this model, each of our features 0:02:49.819,0:02:51.549 were zero, one, so 0:02:51.549,0:02:53.919 indicating whether different words appear, 0:02:53.919,0:02:57.589 and the length or the feature vector was, sort of, 0:02:57.589,0:03:00.029 the length N of the feature vector was 0:03:00.029,0:03:07.029 the number of words in the dictionary. 0:03:08.339,0:03:15.339 So it might be on this version on the order of 50,000 words, say. What I want 0:03:18.419,0:03:21.170 to do now is describe two variations on this algorithm. 0:03:21.170,0:03:23.779 The first one is the simpler one, 0:03:23.779,0:03:26.609 which it's just a generalization to if 0:03:26.609,0:03:28.039 xi 0:03:28.039,0:03:33.359 takes on more values. So, you know, one thing that's commonly done 0:03:33.359,0:03:35.169 is 0:03:35.169,0:03:38.839 to apply Naive Bayes to problems where 0:03:38.839,0:03:43.479 some of these features, xi, takes on K values rather than just two values, 0:03:43.479,0:03:46.579 and in that case, 0:03:46.579,0:03:48.679 you actually build, 0:03:48.679,0:03:51.919 sort of, a very similar model where P(X|Y) 0:03:51.919,0:03:53.130 is 0:03:53.130,0:03:57.120 really the same thing, right, 0:03:57.120,0:04:02.099 where now these are going to be multinomial probabilities 0:04:02.099,0:04:07.729 rather than Bernoulli's because the XI's can, maybe, take on up to K values. 0:04:07.729,0:04:11.739 It turns out, the situation where - one situation where this arises very 0:04:11.739,0:04:16.760 commonly is if you have a feature that's actually continuous valued, 0:04:16.760,0:04:20.639 and you choose to dispertise it, and you choose to take a continuous value feature 0:04:20.639,0:04:25.770 and dispertise it into a finite set of K values, 0:04:25.770,0:04:27.710 and so it's a perfect example 0:04:27.710,0:04:32.370 if you remember our very first 0:04:32.370,0:04:35.189 supervised learning problem of predicting 0:04:35.189,0:04:37.089 the price of 0:04:37.089,0:04:41.039 houses. If you have the classification problem on these houses, so based on 0:04:41.039,0:04:44.129 features of a house, and you want to predict whether or not the house will be sold in the next six 0:04:44.129,0:04:46.880 months, say. That's a classification problem, and 0:04:46.880,0:04:49.210 once you use Naive Bayes, 0:04:49.210,0:04:52.229 then given a continuous value feature 0:04:52.229,0:04:54.619 like the living area, 0:04:54.619,0:04:59.259 you know, one pretty common thing to do would be take the continuous value living area 0:04:59.259,0:05:02.300 and just dispertise it 0:05:02.300,0:05:06.669 into a few 0:05:06.669,0:05:08.909 0:05:08.909,0:05:12.769 - discreet buckets, 0:05:12.769,0:05:13.900 and so 0:05:13.900,0:05:16.900 depending on whether the living area of the house is less than 500 square feet 0:05:16.900,0:05:19.479 or between 1,000 and 1500 square feet, 0:05:19.479,0:05:22.279 and so on, or whether it's greater than 2,000 square feet, 0:05:22.279,0:05:26.490 you choose the value of the corresponding feature, XI, to be one, 0:05:26.490,0:05:28.889 two, three, or four, okay? 0:05:28.889,0:05:31.210 So that was the first 0:05:31.210,0:05:36.479 variation or generalization of Naive Bayes I 0:05:36.479,0:05:43.479 wanted to talk about. I should just check; are there questions about this? Okay. 0:05:49.919,0:05:55.740 Cool. And so it turns out that in practice, it's fairly common to use about ten buckets 0:05:55.740,0:05:59.099 to dispertise a continuous value feature. I drew four here only to 0:05:59.099,0:06:00.900 save on writing. 0:06:00.900,0:06:02.789 0:06:02.789,0:06:03.389 0:06:03.389,0:06:06.639 The second and, sort of, final variation that I want to talk about for Naive Bayes 0:06:06.639,0:06:09.839 is a variation that's specific 0:06:09.839,0:06:10.510 to 0:06:10.510,0:06:15.720 classifying text documents, or, more generally, for classifying sequences. So the text 0:06:15.720,0:06:20.219 document, like a piece of email, you can think of as a sequence of words 0:06:20.219,0:06:23.509 and you can apply this, sort of, model I'm about to describe to classifying other sequences as 0:06:23.509,0:06:24.909 well, but let 0:06:24.909,0:06:27.400 me just focus on text, 0:06:27.400,0:06:29.689 and here's 0:06:29.689,0:06:33.419 the idea. So the 0:06:33.419,0:06:38.379 Naiven a piece of email, we were 0:06:38.379,0:06:42.110 representing it using this binary vector value representation, 0:06:42.110,0:06:46.469 and one of the things that this loses, for instance, is the number of times that different words 0:06:46.469,0:06:47.049 appear, 0:06:47.049,0:06:49.429 all right? So, for example, if 0:06:49.429,0:06:53.149 some word appears a lot of times, and you see the word, you know, buy a lot of times. 0:06:53.149,0:06:58.369 You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of 0:06:58.369,0:06:59.490 times in the email, it 0:06:59.490,0:07:03.449 is more likely to be spam than it appears, I guess, only once 0:07:03.449,0:07:08.129 because even once, I guess, is enough. So let me just 0:07:08.129,0:07:11.210 try a different, what's called an event model 0:07:11.210,0:07:14.870 for Naive Bayes that will take into account the number of times a word appears in 0:07:14.870,0:07:16.530 the email, 0:07:16.530,0:07:17.430 and to give 0:07:17.430,0:07:22.300 this previous model a name as well this particular model for 0:07:22.300,0:07:24.059 text 0:07:24.059,0:07:25.990 classification 0:07:25.990,0:07:29.189 0:07:29.189,0:07:32.369 0:07:32.369,0:07:35.639 is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what 0:07:35.639,0:07:39.639 the name means. It 0:07:39.639,0:07:43.139 refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about 0:07:43.139,0:07:45.349 what the name means. 0:07:45.349,0:07:46.469 In contrast, 0:07:46.469,0:07:50.529 what I want to do now is describe a different representation for email in terms of the 0:07:50.529,0:07:51.819 feature 0:07:51.819,0:07:58.819 vector, and this is called the Multinomial Event Model, and, 0:08:00.619,0:08:03.319 again, there is a rationale behind the name, but it's slightly cryptic, so don't worry 0:08:03.319,0:08:07.249 about why it's called the Multinomial Event Model; it's just called that. 0:08:07.249,0:08:09.349 And here's what we're gonna do, 0:08:09.349,0:08:15.559 given a piece of email, I'm going to represent my email as a feature vector, 0:08:15.559,0:08:19.039 and so my IF training 0:08:19.039,0:08:23.340 example, XI will be a feature vector, 0:08:23.340,0:08:29.150 XI sub group one, XI sub group two, 0:08:29.150,0:08:31.279 XI subscript 0:08:31.279,0:08:33.020 NI 0:08:33.020,0:08:35.370 where 0:08:35.370,0:08:38.810 NI is equal to the 0:08:38.810,0:08:42.160 number of words 0:08:42.160,0:08:45.830 in this email, right? So if one of my training examples is an email with 0:08:45.830,0:08:47.340 300 words in it, 0:08:47.340,0:08:49.750 then I represent this email via 0:08:49.750,0:08:52.010 a feature vector 0:08:52.010,0:08:57.320 with 300 elements, 0:08:57.320,0:09:00.360 and each 0:09:00.360,0:09:04.270 of these elements of the feature vector - lets see. Let me just write this as X subscript 0:09:04.270,0:09:05.560 J. 0:09:05.560,0:09:12.560 These will be an index into my dictionary, okay? 0:09:12.820,0:09:16.350 And so if my dictionary has 50,000 words, then 0:09:16.350,0:09:18.690 each position in my feature vector 0:09:18.690,0:09:22.520 will be a variable that takes on one of 50,000 possible 0:09:22.520,0:09:24.440 values 0:09:24.440,0:09:25.640 0:09:25.640,0:09:31.580 corresponding to what word appeared in the J position of my email, okay? 0:09:31.580,0:09:33.940 So, in other words, I'm gonna take all the words in my email 0:09:33.940,0:09:35.680 and you have a feature 0:09:35.680,0:09:38.550 vector that just says which word 0:09:38.550,0:09:39.899 in my dictionary 0:09:39.899,0:09:43.700 was each word in the email, okay? 0:09:43.700,0:09:47.680 So a different definition for NI now, NI now varies and is different for every 0:09:47.680,0:09:49.050 training example, 0:09:49.050,0:09:52.740 and this XJ is now indexed into the dictionary. 0:09:52.740,0:09:54.740 You know, the components of the feature vector 0:09:54.740,0:09:59.450 are no longer binary random variables; they're these indices in the dictionary that take on a much 0:09:59.450,0:10:02.750 larger set of values. 0:10:02.750,0:10:05.660 And so 0:10:05.660,0:10:08.890 our generative model for this 0:10:08.890,0:10:09.980 0:10:09.980,0:10:16.980 0:10:22.550,0:10:24.360 will be that the joint distribution 0:10:24.360,0:10:27.760 over X and Y will be that, where again N is now the length of the email, all right? 0:10:27.760,0:10:30.839 So the way to think about this formula is you 0:10:30.839,0:10:32.060 imagine 0:10:32.060,0:10:33.410 that there was some 0:10:33.410,0:10:38.050 probably distribution over emails. There's some random distribution that generates the emails, 0:10:38.050,0:10:40.900 and that process proceeds as follows: 0:10:40.900,0:10:45.090 First, Y is chosen, first the class label. Is someone gonna send you spam email or not 0:10:45.090,0:10:48.520 spam emails is chosen for 0:10:48.520,0:10:54.240 us. So first Y, the random variable Y, the class label of spam or not spam is generated, 0:10:54.240,0:10:58.630 and then having decided whether they sent you spam or not spam, 0:10:58.630,0:11:02.879 someone iterates over all 300 positions of the email, 0:11:02.879,0:11:07.360 or 300 words that are going to compose them as email, 0:11:07.360,0:11:10.300 and would generate words from some distribution that 0:11:10.300,0:11:14.240 depends on whether they chose to send you spam or not spam. So if 0:11:14.240,0:11:17.459 they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and 0:11:17.459,0:11:19.520 whatever at discounts, sale, whatever. 0:11:19.520,0:11:23.270 And if somebody chose to send you not spam, then they'll send you, sort of, the more normal 0:11:23.270,0:11:27.580 0:11:27.580,0:11:31.150 words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the 0:11:31.150,0:11:34.820 previous event model, and N has a very different definition from the previous event 0:11:34.820,0:11:38.150 model. 0:11:38.150,0:11:44.680 And so the parameters of this model are - let's see. Phi 0:11:44.680,0:11:48.200 subscript K given Y equals one, which 0:11:48.200,0:11:51.370 is the 0:11:51.370,0:11:55.190 probability that, 0:11:55.190,0:11:57.490 you know, conditioned on 0:11:57.490,0:11:59.280 someone deciding to spend you spam, 0:11:59.280,0:12:03.070 what's the probability that the next word they choose to email you in 0:12:03.070,0:12:06.260 the spam email is going to be word K, and 0:12:06.260,0:12:10.350 similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi 0:12:10.350,0:12:17.350 Y 0:12:21.160,0:12:25.590 and just same as before, okay? 0:12:25.590,0:12:28.170 So these are the parameters of the model, 0:12:28.170,0:12:29.350 0:12:29.350,0:12:31.450 and 0:12:31.450,0:12:35.170 given a training set, you can work out the maximum likelihood estimates of the 0:12:35.170,0:12:42.170 0:12:42.230,0:12:47.810 parameters. So the maximum likelihood estimate of the parameters will be 0:12:47.810,0:12:49.100 equal to - 0:12:49.100,0:12:53.440 and now I'm gonna write one of these, you know, big indicator function things again. It'll be 0:12:53.440,0:12:57.750 a sum over your training sets 0:12:57.750,0:13:00.290 indicator 0:13:00.290,0:13:07.290 whether that was spam times the 0:13:07.450,0:13:10.129 sum over all the words in that email where N subscript 0:13:10.129,0:13:15.310 I is the number of words in email I in your training set, 0:13:15.310,0:13:22.310 times indicator XIJ, SK times that I, okay? 0:13:39.390,0:13:40.350 So 0:13:40.350,0:13:45.400 the numerator says sum over all your emails and take into account all the emails that had 0:13:45.400,0:13:48.400 class label one, take into account only of the emails that were spam because if Y equals zero, then 0:13:48.400,0:13:52.210 this is zero, and this would go away, 0:13:52.210,0:13:54.470 and then times sum over 0:13:54.470,0:13:56.579 all the words in your spam email, and 0:13:56.579,0:14:01.040 it counts up the number of times you observed the word K in your spam emails. 0:14:01.040,0:14:04.670 So, in other words, the numerator is look at all the spam emails in your training set 0:14:04.670,0:14:11.670 and count up the total number of times the word K appeared in this email. The 0:14:13.840,0:14:17.160 denominator then is sum over I into our training set of 0:14:17.160,0:14:20.199 whenever one of your examples is spam, 0:14:20.199,0:14:21.430 you know, sum up the length 0:14:21.430,0:14:23.220 of that spam email, 0:14:23.220,0:14:26.160 and so the denominator is the total length 0:14:26.160,0:14:32.370 of all of the spam emails in your training set. 0:14:32.370,0:14:36.820 And so the ratio is just out of all your spam emails, what is the fraction of words in 0:14:36.820,0:14:39.190 all your spam emails that were word K, 0:14:39.190,0:14:41.840 and that's your estimate for the probability 0:14:41.840,0:14:44.240 of the next piece of spam mail 0:14:44.240,0:14:51.240 generating the word K in any given position, okay? At the 0:14:53.100,0:14:56.480 end of the previous lecture, I talked about LaPlace smoothing, 0:14:56.480,0:14:58.980 and so when you do that as 0:14:58.980,0:15:03.310 well, you add one to the numerator and K to the denominator, and this is 0:15:03.310,0:15:06.600 the LaPlace smoothed estimate 0:15:06.600,0:15:08.790 of this parameter, 0:15:08.790,0:15:11.370 okay? And similarly, you do the same thing for - 0:15:11.370,0:15:14.040 and you can work out the estimates for the 0:15:14.040,0:15:20.970 other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just 0:15:20.970,0:15:22.620 wondering 0:15:22.620,0:15:23.600 0:15:23.600,0:15:26.290 what the X of I is, and what the N of - 0:15:26.290,0:15:28.350 Instructor (Andrew Ng):Right. So 0:15:28.350,0:15:32.140 in this second event model, the definition for XI and the definition for N 0:15:32.140,0:15:34.770 are different, right? So here - 0:15:34.770,0:15:41.770 well, this is for one example XY. So here, N is the number of words 0:15:44.300,0:15:46.339 in a given email, 0:15:46.339,0:15:50.730 right? And if it's the I email subscripting then this N subscript I, and so 0:15:50.730,0:15:55.710 N will be different for different training examples, and here 0:15:55.710,0:16:00.910 XI will be, you know, 0:16:00.910,0:16:04.070 these values from 1 to 50,000, 0:16:04.070,0:16:05.580 and XI is 0:16:05.580,0:16:06.940 essentially 0:16:06.940,0:16:10.600 the identity of the Ith 0:16:10.600,0:16:11.340 word 0:16:11.340,0:16:14.830 in a given piece of email, 0:16:14.830,0:16:16.850 okay? So that's why this is 0:16:16.850,0:16:20.440 grouping, or this is a product over all the different words of your email 0:16:20.440,0:16:22.059 of their probability 0:16:22.059,0:16:24.430 the Ith word in your email, 0:16:24.430,0:16:31.430 conditioned on Y. Yeah? 0:16:32.280,0:16:36.480 Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, 0:16:36.480,0:16:39.260 and I shouldn't have used K here. Let 0:16:39.260,0:16:43.750 me use a different alphabet and see if that makes sense; does 0:16:43.750,0:16:46.490 that make sense? Oh, you know what, I'm sorry. 0:16:46.490,0:16:49.430 You're absolutely right. 0:16:49.430,0:16:50.560 Thank you. All right. So 0:16:50.560,0:16:54.660 in LaPlace smoothing, that shouldn't be 0:16:54.660,0:16:57.940 K. This should be, you know, 50,000, 0:16:57.940,0:17:01.940 if you have 50,000 words in your dictionary. Yeah, thanks. Great. 0:17:01.940,0:17:04.660 I stole notation from the previous lecture and 0:17:04.660,0:17:06.720 didn't translate it properly. 0:17:06.720,0:17:08.400 So LaPlace smoothing, right, 0:17:08.400,0:17:15.400 this is the number of possible values that the random variable XI can 0:17:16.430,0:17:17.320 take on. Cool. Raise your hand if 0:17:17.320,0:17:21.680 this makes sense? Okay. Some of you, are there 0:17:21.680,0:17:28.680 more questions about this? 0:17:34.850,0:17:41.850 0:17:42.010,0:17:43.880 Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah, 0:17:43.880,0:17:45.110 let's see. 0:17:45.110,0:17:49.920 So LaPlace smoothing is 0:17:49.920,0:17:51.110 a 0:17:51.110,0:17:55.570 method to give you, sort of, hopefully, better estimates of their probability 0:17:55.570,0:17:58.150 distribution over a multinomial, 0:17:58.150,0:18:00.430 and so 0:18:00.430,0:18:02.870 was I using X to Y in the previous lecture? 0:18:02.870,0:18:04.600 So 0:18:04.600,0:18:07.809 in trying to estimate the probability over a multinomial - I think X 0:18:07.809,0:18:10.660 and Y are different. I think - 0:18:10.660,0:18:13.690 was it X or Y? I think it was X, actually. Well - oh, 0:18:13.690,0:18:16.800 I see, right, right. I think I was using a different definition 0:18:16.800,0:18:18.960 for the random variable Y because 0:18:18.960,0:18:23.860 suppose you have a multinomial random variable, X 0:18:23.860,0:18:27.890 which takes on - let's 0:18:27.890,0:18:30.190 use a different alphabet. Suppose you have 0:18:30.190,0:18:34.950 a multinomial random variable X which takes on L different values, then 0:18:34.950,0:18:37.440 the maximum likelihood estimate 0:18:37.440,0:18:40.550 for the probability of X, 0:18:40.550,0:18:42.060 PFX equals K, 0:18:42.060,0:18:43.720 will be equal to, right, 0:18:43.720,0:18:46.700 the number of observations. 0:18:46.700,0:18:51.860 0:18:51.860,0:18:55.140 The maximum likelihood estimate for the 0:18:55.140,0:19:02.110 probability of X being equal to K will be the number of observations of X 0:19:02.110,0:19:04.800 equals K divided 0:19:04.800,0:19:10.510 by the total number of observations 0:19:10.510,0:19:11.330 of X, 0:19:11.330,0:19:13.750 okay? So that's the maximum likelihood estimate. 0:19:13.750,0:19:17.710 And to add LaPlace smoothing to this, you, sort of, add one to the numerator, 0:19:17.710,0:19:20.120 and you add L to the 0:19:20.120,0:19:20.910 0:19:20.910,0:19:22.610 denominator 0:19:22.610,0:19:26.200 where L was the number of possible values that X can take on. 0:19:26.200,0:19:28.540 So, in this case, 0:19:28.540,0:19:31.740 this is a probability that X equals K, 0:19:31.740,0:19:36.400 and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be 0:19:36.400,0:19:40.300 something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific 0:19:40.300,0:19:47.300 definition for a 0:19:48.190,0:19:48.659 maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples 0:19:48.659,0:19:51.630 make sense, but I don't know what the, like, general formula for it is. 0:19:51.630,0:19:53.970 Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so 0:19:53.970,0:19:55.610 the question is 0:19:55.610,0:20:01.809 what's the definition for maximum likelihood estimate? So actually 0:20:01.809,0:20:05.460 in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I 0:20:05.460,0:20:06.550 was, sort of, 0:20:06.550,0:20:09.130 throwing out the maximum likelihood estimates on the board 0:20:09.130,0:20:11.049 without proving them. 0:20:11.049,0:20:15.520 The way to actually work this out is 0:20:15.520,0:20:19.900 to actually 0:20:19.900,0:20:26.900 write down the likelihood. 0:20:29.070,0:20:32.850 So the way to figure out all of these maximum likelihood estimates is to write 0:20:32.850,0:20:33.730 down 0:20:33.730,0:20:35.190 the likelihood of 0:20:35.190,0:20:41.170 the parameters, phi K given Y being 0:20:41.170,0:20:43.810 zero, phi 0:20:43.810,0:20:44.450 Y, 0:20:44.450,0:20:47.090 right? And so given a training set, 0:20:47.090,0:20:50.350 the likelihood, I guess, I should be writing log likelihood 0:20:50.350,0:20:54.740 will be the log of the product of I equals one to 0:20:54.740,0:20:55.910 N, PFXI, YI, 0:20:55.910,0:20:57.950 you 0:20:57.950,0:21:04.950 know, parameterized by these things, okay? 0:21:06.940,0:21:09.450 Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are 0:21:09.450,0:21:12.010 parameterized 0:21:12.010,0:21:13.500 by 0:21:13.500,0:21:19.299 0:21:19.299,0:21:21.520 0:21:21.520,0:21:26.930 - well, 0:21:26.930,0:21:28.140 0:21:28.140,0:21:31.690 I'll just drop the parameters to write this 0:21:31.690,0:21:38.690 0:21:42.370,0:21:49.200 more simply - oh, I just put it in - times PFYI, okay? 0:21:49.200,0:21:50.970 So this is my log likelihood, 0:21:50.970,0:21:55.090 and so the way you get the maximum likelihood estimate of the parameters 0:21:55.090,0:21:55.809 is you - 0:21:55.809,0:21:59.750 so if given a fixed training set, given a set of fixed IYI's, 0:21:59.750,0:22:02.180 you maximize this in terms of 0:22:02.180,0:22:05.740 these parameters, and then you get the maximum likelihood estimates that I've been writing 0:22:05.740,0:22:07.530 out. 0:22:07.530,0:22:11.220 So in a previous section of today's lecture I wrote out some maximum likelihood estimates 0:22:11.220,0:22:14.280 for the Gaussian Discriminant Analysis model, and 0:22:14.280,0:22:15.610 for Naive Bayes, 0:22:15.610,0:22:17.750 and then this - I didn't prove them, 0:22:17.750,0:22:22.310 but you get to, sort of, play with that yourself in the homework problem as well 0:22:22.310,0:22:26.420 and for one of these models, and you'll be able to verify that when 0:22:26.420,0:22:29.700 you maximize the likelihood and maximize the log likelihood 0:22:29.700,0:22:33.750 that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to 0:22:33.750,0:22:34.950 find the way 0:22:34.950,0:22:41.630 these are derived is by maximizing this, okay? Cool. All right. 0:22:41.630,0:22:44.730 So 0:22:44.730,0:22:51.299 that wraps up what I wanted to say about - oh, 0:22:51.299,0:22:55.809 so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that 0:22:55.809,0:23:01.480 for text classification, the Naivent 0:23:01.480,0:23:04.080 model, the last 0:23:04.080,0:23:07.300 Naivent model, 0:23:07.300,0:23:10.330 it turns out that almost always does better than 0:23:10.330,0:23:11.370 the 0:23:11.370,0:23:15.360 first Naive Bayes model I talked about when you're applying it to the 0:23:15.360,0:23:18.670 specific case - to the specific of text classification, 0:23:18.670,0:23:21.070 and 0:23:21.070,0:23:25.049 one of the reasons is hypothesized for this is that this second model, 0:23:25.049,0:23:26.880 the multinomial event model, 0:23:26.880,0:23:30.120 takes into account 0:23:30.120,0:23:33.880 the number of times a word appears in a document, whereas the former model 0:23:33.880,0:23:35.170 doesn't. 0:23:35.170,0:23:37.370 I should say that in truth 0:23:37.370,0:23:40.700 that actually turns out not to be completely understood why the latter 0:23:40.700,0:23:43.780 model does better than the former one for text classification, and, sort of, 0:23:43.780,0:23:45.640 researchers are still 0:23:45.640,0:23:46.790 debating about it a little bit, but if you 0:23:46.790,0:23:51.320 ever have a text classification problem, you know, Naive Bayes Classify is probably 0:23:51.320,0:23:51.860 not, 0:23:51.860,0:23:54.150 by far, the best learning algorithm out there, 0:23:54.150,0:23:55.630 but it is 0:23:55.630,0:23:58.460 relatively straightforward to implement, and it's a very good 0:23:58.460,0:24:03.070 algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It 0:24:03.070,0:24:04.490 doesn't 0:24:04.490,0:24:07.020 actually 0:24:07.020,0:24:07.910 0:24:07.910,0:24:13.710 care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I 0:24:13.710,0:24:14.950 mean, X variable, if my model like you had exclamation in, does 0:24:14.950,0:24:17.170 that usually 0:24:17.170,0:24:18.329 do 0:24:18.329,0:24:20.340 better if you have enough data? 0:24:20.340,0:24:23.050 Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words. 0:24:23.050,0:24:26.990 You can shuffle all the words in the email, and it does exactly the same thing. 0:24:26.990,0:24:27.770 So 0:24:27.770,0:24:30.929 in natural language processing, there's actually another name; it's called a Unique Grand 0:24:30.929,0:24:32.850 Model in natural language processing, 0:24:32.850,0:24:35.529 and there's some other models like, sort of, 0:24:35.529,0:24:39.190 say, higher order markup models that take into account some of the ordering 0:24:39.190,0:24:44.080 of the words. It turns out that for text classification, the models 0:24:44.080,0:24:45.260 like the 0:24:45.260,0:24:49.290 bigram models or trigram models, 0:24:49.290,0:24:53.360 I believe they do only very slightly better, if at all, 0:24:53.360,0:25:00.360 but that's when you're applying them to text classification, okay? All right. 0:25:03.550,0:25:07.190 So the next thing I want to talk about is to start again to discussion of 0:25:07.190,0:25:11.560 non-linear classifiers. So 0:25:11.560,0:25:18.560 it turns out 0:25:21.670,0:25:27.590 - well, and so the very first 0:25:27.590,0:25:31.220 classification algorithm we talked about was logistic regression, which 0:25:31.220,0:25:33.419 had the forming form for 0:25:33.419,0:25:36.340 hypothesis, 0:25:36.340,0:25:37.520 and 0:25:37.520,0:25:39.220 you can think of this as 0:25:39.220,0:25:41.190 predicting one when 0:25:41.190,0:25:44.420 this estimator probability is greater or equal to 0.5 and 0:25:44.420,0:25:46.180 predicting zero, 0:25:46.180,0:25:46.839 right, 0:25:46.839,0:25:50.050 when this is less than 0.5, 0:25:50.050,0:25:52.760 and given a training set, right? 0:25:52.760,0:25:59.760 Logistic regression 0:26:02.549,0:26:05.919 will maybe do grade and descends or something or 0:26:05.919,0:26:07.440 use Newton's method 0:26:07.440,0:26:09.370 to find a straight line that 0:26:09.370,0:26:12.130 reasonably separates the positive and negative classes. 0:26:12.130,0:26:15.830 But sometimes a data set just can't be separated by a straight line, so is there 0:26:15.830,0:26:16.950 an algorithm 0:26:16.950,0:26:18.230 that will let you 0:26:18.230,0:26:25.230 start to learn these sorts of non-linear division boundaries? 0:26:25.540,0:26:29.230 And so how do you go about getting a non-linear classifier? And, by the 0:26:29.230,0:26:34.360 way, one cool result is that remember when I said - when we talked 0:26:34.360,0:26:37.790 about generative learning algorithms, I said that 0:26:37.790,0:26:41.040 if you assume Y given X is 0:26:41.040,0:26:43.710 exponential family, 0:26:43.710,0:26:45.330 0:26:45.330,0:26:46.669 right, with parameter 0:26:46.669,0:26:50.730 A, and if you build a generative learning algorithm using this, right, plus 0:26:50.730,0:26:55.850 one, if this is A to 0:26:55.850,0:26:59.710 one. This is exponential family 0:26:59.710,0:27:01.750 with natural parameter A to zero, 0:27:01.750,0:27:05.610 right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, 0:27:05.610,0:27:06.210 then 0:27:06.210,0:27:08.610 you end up with a logistic posterior. It 0:27:08.610,0:27:11.410 actually turns out that a Naive Bayes model 0:27:11.410,0:27:12.629 actually falls into this 0:27:12.629,0:27:17.440 as well. So the Naive Bayes model actually falls into this exponential family as well, and, 0:27:17.440,0:27:19.370 therefore, 0:27:19.370,0:27:23.000 under the Naive Bayes model, you're actually 0:27:23.000,0:27:27.080 using this other linear classifier as well, okay? 0:27:27.080,0:27:29.020 So the question is 0:27:29.020,0:27:32.480 how can you start to get non-linear classifiers? 0:27:32.480,0:27:35.649 And I'm going to talk about one method today 0:27:35.649,0:27:41.840 which is - and 0:27:41.840,0:27:45.300 we started to talk about it very briefly which is 0:27:45.300,0:27:46.650 taking a 0:27:46.650,0:27:48.990 simpler algorithm like 0:27:48.990,0:27:50.720 logistic regression 0:27:50.720,0:27:57.720 and using it to build up to more complex non-linear classifiers, okay? So 0:27:58.390,0:28:00.440 to motivate this discussion, 0:28:00.440,0:28:05.340 I'm going to use the little picture - let's see. So suppose you have features X1, 0:28:05.340,0:28:07.340 X2, and X3, 0:28:07.340,0:28:10.320 and so by convention, I'm gonna follow 0:28:10.320,0:28:13.460 our earlier convention that X0 is set to one, 0:28:13.460,0:28:17.010 and so I'm gonna use a little diagram like this 0:28:17.010,0:28:18.090 to denote 0:28:18.090,0:28:19.690 our 0:28:19.690,0:28:24.320 logistic regression unit, okay? So 0:28:24.320,0:28:28.609 think of a little picture like that, you know, this little circle as denoting a 0:28:28.609,0:28:29.419 computation note 0:28:29.419,0:28:32.630 that takes this input, you know, several features and then it outputs another 0:28:32.630,0:28:36.090 number that's X subscript theta of X, given by 0:28:36.090,0:28:37.800 a sigmoid function, 0:28:37.800,0:28:39.930 and so this little computational unit - 0:28:39.930,0:28:43.460 well, will have parameters theta. 0:28:43.460,0:28:47.700 Now, in order to get non-linear division boundaries, all we need to do - 0:28:47.700,0:28:50.850 well, at least one thing to do is just 0:28:50.850,0:28:52.780 come up with 0:28:52.780,0:28:54.390 0:28:54.390,0:28:57.020 a way to represent hypotheses 0:28:57.020,0:29:00.640 that can output non-linear division boundaries, right, 0:29:00.640,0:29:02.700 and so 0:29:02.700,0:29:07.490 this is - 0:29:07.490,0:29:09.750 when you put a bunch of those little pictures 0:29:09.750,0:29:11.390 that I drew 0:29:11.390,0:29:12.790 on the previous board, 0:29:12.790,0:29:14.169 you can then get 0:29:14.169,0:29:18.549 what's called 0:29:18.549,0:29:20.780 a Neural Network in which you 0:29:20.780,0:29:23.710 think of having my features here 0:29:23.710,0:29:30.710 and then I would feed them to 0:29:31.019,0:29:34.530 say a few of these little 0:29:34.530,0:29:38.940 sigmoidal units, 0:29:38.940,0:29:43.559 and these together will feed into yet another sigmoidal unit, say, 0:29:43.559,0:29:46.270 which will output 0:29:46.270,0:29:51.510 my final output H subscript theta of X, okay? And 0:29:51.510,0:29:54.500 just to give these things names, let me call 0:29:54.500,0:29:58.380 the values output by these three intermediate sigmoidal units; let me call them A1, 0:29:58.380,0:30:01.360 A2, A3. 0:30:01.360,0:30:02.580 And let me just be 0:30:02.580,0:30:05.679 completely concrete about what this formula represents, right? So 0:30:05.679,0:30:06.570 0:30:06.570,0:30:08.309 each of these 0:30:08.309,0:30:12.080 units in the middle will have their own associated set of parameters, 0:30:12.080,0:30:15.820 and so the value A1 will be computed as 0:30:15.820,0:30:17.940 G 0:30:17.940,0:30:21.440 of X transpose, 0:30:21.440,0:30:27.440 and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta 0:30:27.440,0:30:29.090 two, 0:30:29.090,0:30:30.980 and A3 will be 0:30:30.980,0:30:34.490 G of X transpose, 0:30:34.490,0:30:35.700 theta three, 0:30:35.700,0:30:42.700 where G is the sigmoid function, all right? So G of Z, 0:30:43.640,0:30:49.850 and then, finally, our hypothesis will output 0:30:49.850,0:30:52.360 G of 0:30:52.360,0:30:53.950 A 0:30:53.950,0:30:56.620 transpose 0:30:56.620,0:30:59.200 theta four, right? Where, you 0:30:59.200,0:31:01.320 know, this A vector 0:31:01.320,0:31:04.390 is a vector of A1, 0:31:04.390,0:31:05.780 A2, A3. 0:31:05.780,0:31:08.220 We can append another one to it at first 0:31:08.220,0:31:11.470 if you want, okay? Let 0:31:11.470,0:31:14.270 me 0:31:14.270,0:31:16.210 just draw up here this - I'm 0:31:16.210,0:31:18.470 sorry about the cluttered board. 0:31:18.470,0:31:20.950 And so H subscript theta of X, 0:31:20.950,0:31:23.850 this is a function 0:31:23.850,0:31:29.300 of all the parameters theta one 0:31:29.300,0:31:32.420 through theta four, 0:31:32.420,0:31:36.680 and so one way to learn parameters for this model 0:31:36.680,0:31:37.650 is to 0:31:37.650,0:31:39.910 write down the cost function, 0:31:39.910,0:31:41.180 say, J of theta 0:31:41.180,0:31:46.040 equals one-half sum from Y equals one to 0:31:46.040,0:31:49.730 M, YI minus H subscript theta of 0:31:49.730,0:31:51.770 XI 0:31:51.770,0:31:54.450 squared, say. Okay, 0:31:54.450,0:31:56.070 so that's our familiar 0:31:56.070,0:31:58.280 quadratic cost function, 0:31:58.280,0:31:59.980 and so 0:31:59.980,0:32:03.990 one way to learn the parameters of an algorithm like this is to just use gradient 0:32:03.990,0:32:04.840 interscent 0:32:04.840,0:32:08.809 to minimize J of theta as a function of theta, okay? See, in the phi 0:32:08.809,0:32:10.940 gradient descent 0:32:10.940,0:32:14.610 to minimize this square area, which 0:32:14.610,0:32:18.050 stated differently means you use gradient descent to make the predictions of your neural 0:32:18.050,0:32:20.330 network as close as possible 0:32:20.330,0:32:27.330 to what you observed as the labels in your training 0:32:30.550,0:32:34.520 set, okay? So it turns out green descent on this neural network is a specific name, the 0:32:34.520,0:32:35.059 algorithm 0:32:35.059,0:32:38.789 that implements grand descent is called back propagation, and so if you ever hear that all that 0:32:38.789,0:32:41.240 means is - it just means gradient interscent on 0:32:41.240,0:32:45.170 a cost function like this or a variation of this on the neural network 0:32:45.170,0:32:47.909 that looks like that, 0:32:47.909,0:32:50.220 and - 0:32:50.220,0:32:54.360 well, this algorithm actually has some advantages and disadvantages, but let me 0:32:54.360,0:32:57.830 actually show you. So, let's see. 0:32:57.830,0:33:00.729 One of the interesting things about the neural network is that 0:33:00.729,0:33:05.430 you can look at what these intermediate notes are computing, right? So 0:33:05.430,0:33:08.270 this neural network has what's called 0:33:08.270,0:33:11.610 a hidden layer before you then have the output layer, 0:33:11.610,0:33:14.410 and, more generally, you can actually have inputs feed 0:33:14.410,0:33:16.279 into these 0:33:16.279,0:33:19.690 computation units, feed into more layers of computation units, to even more layers, to 0:33:19.690,0:33:23.370 more layers, and then finally you have an output layer at the end 0:33:23.370,0:33:26.970 And one cool thing you can do is look at all of these intermediate units, look at 0:33:26.970,0:33:28.280 these 0:33:28.280,0:33:30.980 units and what's called a hidden layer of the neural network. Don't 0:33:30.980,0:33:32.580 worry about why it's called that. Look 0:33:32.580,0:33:34.489 at computations of the hidden unit 0:33:34.489,0:33:39.570 and ask what is the hidden unit computing the neural network? So 0:33:39.570,0:33:42.900 to, maybe, get a better sense of neural networks might be doing, let me show you a 0:33:42.900,0:33:44.580 video - I'm gonna switch to the laptop - this is made 0:33:44.580,0:33:46.250 by a 0:33:46.250,0:33:49.360 friend, Yann LeCun 0:33:49.360,0:33:52.420 who's currently a professor at New York University. 0:33:52.420,0:33:56.289 Can I show a video on the laptop? 0:33:56.289,0:33:58.600 So let me show you a video 0:33:58.600,0:34:00.440 from Yann 0:34:00.440,0:34:06.330 LeCun on a neural network that he developed for Hammerton Digit Recognition. 0:34:06.330,0:34:10.110 There was one other thing he did in this neural network that I'm not gonna talk about called 0:34:10.110,0:34:12.389 a Convolutional Neural Network 0:34:12.389,0:34:16.009 that - well, 0:34:16.009,0:34:18.809 his system is called LeNet, 0:34:18.809,0:34:22.569 and let's see. Would you put 0:34:22.569,0:34:23.360 on the 0:34:23.360,0:34:30.360 laptop display? Hum, 0:34:37.779,0:34:38.789 actually maybe if - 0:34:38.789,0:34:42.339 or you can just put on the screen on the side; that would work too 0:34:42.339,0:34:49.339 if the big screen isn't working. 0:35:12.249,0:35:13.729 Let's see. I'm 0:35:13.729,0:35:19.299 just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come 0:35:19.299,0:35:24.209 on? Well, let me say a few more things about neural network. 0:35:24.209,0:35:25.529 So it turns out that 0:35:25.529,0:35:29.039 when you write a quadratic cost function like I 0:35:29.039,0:35:31.689 wrote down on the chalkboard just now, 0:35:31.689,0:35:34.639 it turns out that unlike logistic regression, 0:35:34.639,0:35:39.029 that will almost always respond to non-convex optimization problem, 0:35:39.029,0:35:42.709 and so whereas for logistic regression if you run gradient descent 0:35:42.709,0:35:44.280 or Newton's method or whatever, 0:35:44.280,0:35:46.000 you converse the global optimer. 0:35:46.000,0:35:50.500 This is not true for neural networks. In general, there are lots of local optimer 0:35:50.500,0:35:55.209 and, sort of, much harder optimization problem. 0:35:55.209,0:35:56.159 So 0:35:56.159,0:35:59.699 neural networks, if you're, sort of, familiar with them, and you're good at making design choices 0:35:59.699,0:36:01.599 like what learning rate to use, and 0:36:01.599,0:36:04.319 how many hidden units to use, and so on, you can, 0:36:04.319,0:36:07.700 sort of, get them to be fairly effective, 0:36:07.700,0:36:09.109 and 0:36:09.109,0:36:12.910 there's, sort of, often ongoing debates about, you know, is this learning algorithm 0:36:12.910,0:36:13.819 better, or is that learning 0:36:13.819,0:36:17.869 algorithm better? The vast majority of machine learning researchers today seem to perceive 0:36:17.869,0:36:20.180 support vector machines, which is what I'll talk about later, 0:36:20.180,0:36:24.349 to be a much more effective off-theshelf learning algorithm than neural networks. 0:36:24.349,0:36:26.509 This point of view is contested 0:36:26.509,0:36:32.039 a bit, but so neural networks are not something that I personally use a lot right 0:36:32.039,0:36:35.670 now because there's a hard optimization problem and you should do so often verge, 0:36:35.670,0:36:39.969 and it actually, sort of works. It, sort of, works reasonably 0:36:39.969,0:36:41.739 well. It's just 0:36:41.739,0:36:43.700 because this is fairly complicated, 0:36:43.700,0:36:45.269 there's not an algorithm that 0:36:45.269,0:36:52.009 I use commonly or that my friends use all 0:36:52.009,0:36:56.019 time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years, 0:36:56.019,0:36:56.819 0:36:56.819,0:37:00.400 you know, the most effective learning algorithm before support vector 0:37:00.400,0:37:02.029 machines were invented. 0:37:02.029,0:37:06.349 So here's Yann LeCun's video, and - 0:37:06.349,0:37:09.350 well, there's actually audio on this too, the soundboard. So I'll just tell you 0:37:09.350,0:37:11.209 what's happening. 0:37:11.209,0:37:11.880 0:37:11.880,0:37:14.629 What you're seeing is a trained neural network, 0:37:14.629,0:37:18.549 and this display where my mouse pointer is pointing, right, this big three 0:37:18.549,0:37:19.429 there 0:37:19.429,0:37:21.189 is 0:37:21.189,0:37:24.799 the input to the neural network. So you're showing the neural network this image, and it's 0:37:24.799,0:37:26.859 trying to recognize what is this. 0:37:26.859,0:37:30.769 The final answer output by the neural network is this number up here, right 0:37:30.769,0:37:33.189 below where it says LeNet-5, 0:37:33.189,0:37:34.229 and the 0:37:34.229,0:37:37.759 neural network correctly recognizes this image as a three, 0:37:37.759,0:37:39.890 and if you look to the left of this image, 0:37:39.890,0:37:42.309 what's interesting about this is 0:37:42.309,0:37:46.640 the display on the left portion of this is actually showing the 0:37:46.640,0:37:49.999 intermediate computations of the neural network. In other words, it's showing you 0:37:49.999,0:37:51.900 what are the hidden layers of the 0:37:51.900,0:37:53.829 neural network computing. 0:37:53.829,0:37:57.529 And so, for example, if you look at this one, the third image down from the top, 0:37:57.529,0:38:01.569 this seems to be computing, you know, certain edges into digits, 0:38:01.569,0:38:05.630 right? We're just computing digits on the 0:38:05.630,0:38:07.489 right-hand side of the bottom or something 0:38:07.489,0:38:09.889 of the input display 0:38:09.889,0:38:11.479 of the input image, okay? 0:38:11.479,0:38:15.129 So let me just play this video, and you can see 0:38:15.129,0:38:17.019 some 0:38:17.019,0:38:20.619 of the inputs and outputs of the neural network, and 0:38:20.619,0:38:21.529 those 0:38:21.529,0:38:26.309 are very different fonts. There's this robustness to 0:38:26.309,0:38:30.670 noise. 0:38:30.670,0:38:36.929 All 0:38:36.929,0:38:37.880 0:38:37.880,0:38:41.949 0:38:41.949,0:38:43.000 right. Multiple 0:38:43.000,0:38:49.089 digits, 0:38:49.089,0:38:50.739 0:38:50.739,0:38:57.739 that's, kind of, cool. All 0:38:59.259,0:39:06.259 right. 0:39:12.140,0:39:19.140 0:39:30.779,0:39:35.789 0:39:35.789,0:39:38.689 0:39:38.689,0:39:40.459 0:39:40.459,0:39:41.660 0:39:41.660,0:39:48.660 0:39:52.359,0:39:54.809 So, 0:39:54.809,0:39:58.219 just for fun, let me show you one more video, 0:39:58.219,0:40:00.579 which was - let's 0:40:00.579,0:40:06.089 see. This is another video from the various CV's, the machine that changed the world, which 0:40:06.089,0:40:07.490 was produced by WGBH 0:40:07.490,0:40:09.059 Television 0:40:09.059,0:40:10.800 in 0:40:10.800,0:40:12.599 corporation with British Foreclass Incorporation, 0:40:12.599,0:40:16.159 and it was aired on PBS a few years ago, I think. 0:40:16.159,0:40:17.630 I want to show you a 0:40:17.630,0:40:19.589 video describing the NETtalk 0:40:19.589,0:40:23.629 Neural Network, which was developed by Terry Sejnowski; he's a 0:40:23.629,0:40:25.739 researcher. 0:40:25.739,0:40:29.589 And so NETtalk was actually one of the major milestones in the 0:40:29.589,0:40:31.749 history of neural network, 0:40:31.749,0:40:33.490 and this specific application 0:40:33.490,0:40:36.399 is getting the neural network to read text. 0:40:36.399,0:40:38.779 So, in other words, can you show a 0:40:38.779,0:40:41.659 piece of English to a computer 0:40:41.659,0:40:43.419 and have the computer read, 0:40:43.419,0:40:46.410 sort of, verbally produce sounds that could respond 0:40:46.410,0:40:48.319 to the 0:40:48.319,0:40:50.629 reading of the text. 0:40:50.629,0:40:55.279 And it turns out that in the history of AI and the history of machine learning, 0:40:55.279,0:40:56.509 this video 0:40:56.509,0:41:00.849 created a lot of excitement about neural networks and about machine learning. Part of the 0:41:00.849,0:41:05.429 reason was that Terry Sejnowski had the foresight to choose 0:41:05.429,0:41:06.699 to use, 0:41:06.699,0:41:08.040 in his video, 0:41:08.040,0:41:12.940 a child-like voice talking about visiting your grandmother's house and so on. You'll 0:41:12.940,0:41:14.289 see it in a second, 0:41:14.289,0:41:16.029 and so 0:41:16.029,0:41:19.399 this really created the perception of - created 0:41:19.399,0:41:23.129 the impression of the neural network being like a young child learning how to speak, 0:41:23.129,0:41:25.260 and talking about going to your grandmothers, and so 0:41:25.260,0:41:27.379 on. So this actually 0:41:27.379,0:41:29.289 helped generate a lot of excitement 0:41:29.289,0:41:32.449 within academia and outside academia on neural networks, sort of, early in the history 0:41:32.449,0:41:36.109 of neural networks. I'm just gonna show you the video. 0:41:36.109,0:41:37.910 [Begin Video] You're going to hear first 0:41:37.910,0:41:41.559 what the network sounds like at the very beginning of the training, and it won't 0:41:41.559,0:41:43.589 sound like words, but it'll sound 0:41:43.589,0:41:48.009 like attempts that will get better and better with 0:41:48.009,0:41:54.239 time. [Computer's voice] The network 0:41:54.239,0:41:58.249 takes the letters, say the phrase, grandmother's house, and 0:41:58.249,0:42:02.679 makes a random attempt at pronouncing it. [Computer's 0:42:02.679,0:42:07.219 voice] 0:42:07.219,0:42:09.269 Grandmother's house. The 0:42:09.269,0:42:12.530 phonetic difference between the guess and the right pronunciation 0:42:12.530,0:42:16.539 is sent back through the network. [Computer's 0:42:16.539,0:42:18.989 voice] 0:42:18.989,0:42:20.329 Grandmother's house. 0:42:20.329,0:42:23.920 By adjusting the connection strengths after each attempt, 0:42:23.920,0:42:26.689 the net slowly improves. And, 0:42:26.689,0:42:30.029 finally, after letting it train overnight, 0:42:30.029,0:42:32.700 the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's 0:42:32.700,0:42:37.759 house. 0:42:37.759,0:42:41.149 Well, because she gives us candy. 0:42:41.149,0:42:46.789 Well, and we - NETtalk understands nothing about the language. It is simply associating letters 0:42:46.789,0:42:49.809 with sounds. 0:42:49.809,0:42:52.359 [End Video] All right. So 0:42:52.359,0:42:54.380 at the time this was done, I mean, this is 0:42:54.380,0:42:57.379 an amazing piece of work. I should say 0:42:57.379,0:42:59.140 today there are other 0:42:59.140,0:43:03.389 text to speech systems that work better than what you just saw, 0:43:03.389,0:43:08.359 and you'll also appreciate getting candy from your grandmother's house is a little bit 0:43:08.359,0:43:11.449 less impressive than talking about the Dow Jones falling 15 points, 0:43:11.449,0:43:16.149 and profit taking, whatever. So 0:43:16.149,0:43:18.789 but I wanted to show that just because that was another cool, 0:43:18.789,0:43:25.789 major landmark in the history of neural networks. Okay. 0:43:27.259,0:43:32.949 So let's switch back to the chalkboard, 0:43:32.949,0:43:35.859 and what I want to do next 0:43:35.859,0:43:39.479 is 0:43:39.479,0:43:42.369 tell you about Support Vector Machines, 0:43:42.369,0:43:49.369 okay? That, sort of, wraps up our discussion on neural 0:44:08.839,0:44:10.850 networks. So I started off talking about neural networks 0:44:10.850,0:44:13.269 by motivating it as a way to get 0:44:13.269,0:44:17.920 us to output non-linear classifiers, right? I don't really approve of it. It turns out 0:44:17.920,0:44:18.419 that 0:44:18.419,0:44:21.799 you'd be able to come up with non-linear division boundaries using a neural 0:44:21.799,0:44:22.479 network 0:44:22.479,0:44:27.309 like what I drew on the chalkboard earlier. 0:44:27.309,0:44:30.469 Support Vector Machines will be another learning algorithm that will give us a way 0:44:30.469,0:44:34.529 to come up with non-linear classifiers. There's a very effective, off-the-shelf 0:44:34.529,0:44:36.019 learning algorithm, 0:44:36.019,0:44:39.399 but it turns out that in the discussion I'm gonna - in 0:44:39.399,0:44:43.379 the progression and development I'm gonna pursue, I'm actually going to start 0:44:43.379,0:44:47.799 off by describing yet another class of linear classifiers with linear division 0:44:47.799,0:44:49.269 boundaries, 0:44:49.269,0:44:54.179 and only be later, sort of, in probably the next lecture or the one after that, 0:44:54.179,0:44:55.249 that we'll then 0:44:55.249,0:44:58.699 take the support vector machine idea and, sort of, do some clever things to it 0:44:58.699,0:45:02.959 to make it work very well to generate non-linear division boundaries as well, okay? 0:45:02.959,0:45:07.859 But we'll actually start by talking about linear classifiers a little bit more. 0:45:07.859,0:45:12.619 And to do that, I want to 0:45:12.619,0:45:14.559 convey two 0:45:14.559,0:45:16.689 intuitions about classification. 0:45:16.689,0:45:20.700 One is 0:45:20.700,0:45:24.450 you think about logistic regression; we have this logistic function that was outputting 0:45:24.450,0:45:28.669 the probability that Y equals one, 0:45:28.669,0:45:31.049 and it crosses 0:45:31.049,0:45:32.959 this line at zero. 0:45:32.959,0:45:37.819 So when you run logistic regression, I want you to think of it as 0:45:37.819,0:45:41.179 an algorithm that computes 0:45:41.179,0:45:43.239 theta transpose X, 0:45:43.239,0:45:46.789 and then it predicts 0:45:46.789,0:45:51.170 one, right, 0:45:51.170,0:45:56.359 if and only if, theta transpose X is greater than zero, right? IFF stands for if 0:45:56.359,0:45:59.539 and only if. It means the same thing as a double implication, 0:45:59.539,0:46:04.629 and it predicts zero, 0:46:04.629,0:46:11.629 if and only if, theta transpose X is less than zero, okay? So if 0:46:12.499,0:46:15.019 it's the case that 0:46:15.019,0:46:18.069 theta transpose X is much greater than zero, 0:46:18.069,0:46:22.309 the double greater than sign means these are much greater than, all right. 0:46:22.309,0:46:25.589 So if theta transpose X is much greater than zero, 0:46:25.589,0:46:26.739 then, 0:46:26.739,0:46:33.069 you know, think of that as a very confident 0:46:33.069,0:46:38.779 prediction 0:46:38.779,0:46:40.779 that Y is equal to one, 0:46:40.779,0:46:43.459 right? If theta transpose X is much greater than zero, then we're 0:46:43.459,0:46:47.049 gonna predict one then moreover we're very confident it's one, and the picture for 0:46:47.049,0:46:48.479 that is if theta 0:46:48.479,0:46:51.179 transpose X is way out here, then 0:46:51.179,0:46:54.329 we're estimating that the probability of Y being equal to one 0:46:54.329,0:46:56.089 on the sigmoid function, it will 0:46:56.089,0:46:58.699 be very close to one. 0:46:58.699,0:47:02.779 And, in the same way, if theta transpose X 0:47:02.779,0:47:05.190 is much less than zero, 0:47:05.190,0:47:12.190 then we're very confident that 0:47:12.809,0:47:15.849 Y is equal to zero. 0:47:15.849,0:47:19.599 So 0:47:19.599,0:47:23.859 wouldn't it be nice - so when we fit logistic regression of some of the 0:47:23.859,0:47:25.309 classifiers is your training set, 0:47:25.309,0:47:29.569 then so wouldn't it be nice if, 0:47:29.569,0:47:30.430 right, 0:47:30.430,0:47:34.349 for all I 0:47:34.349,0:47:38.739 such that Y is equal to one. 0:47:38.739,0:47:39.699 We have 0:47:39.699,0:47:41.279 theta 0:47:41.279,0:47:45.009 transpose XI is much greater than zero, 0:47:45.009,0:47:47.929 and for all I such that Y is equal 0:47:47.929,0:47:49.869 to 0:47:49.869,0:47:56.869 zero, 0:47:57.589,0:47:59.910 we have theta transpose XI is 0:47:59.910,0:48:00.970 much less than zero, 0:48:00.970,0:48:05.659 okay? So wouldn't it be nice if this is true? That, 0:48:05.659,0:48:06.380 0:48:06.380,0:48:09.929 essentially, if our training set, we can find parameters theta 0:48:09.929,0:48:12.749 so that our learning algorithm not only 0:48:12.749,0:48:16.839 makes correct classifications on all the examples in a training set, but further it's, sort 0:48:16.839,0:48:19.880 of, is very confident about all of those correct 0:48:19.880,0:48:22.299 classifications. This is 0:48:22.299,0:48:26.859 the first intuition that I want you to have, and we'll come back to this first intuition 0:48:26.859,0:48:28.889 in a second 0:48:28.889,0:48:33.209 when we talk about functional margins, okay? 0:48:33.209,0:48:40.209 We'll define this later. The second 0:48:43.539,0:48:48.119 intuition that I want to 0:48:48.119,0:48:55.119 convey, 0:48:56.769,0:49:00.209 and it turns out for the rest of today's lecture I'm going to assume 0:49:00.209,0:49:04.049 that a training set is linearly separable, okay? So by that I mean for the rest of 0:49:04.049,0:49:05.619 0:49:05.619,0:49:08.690 today's lecture, I'm going to assume that 0:49:08.690,0:49:10.900 there is indeed a straight line that can separate 0:49:10.900,0:49:15.030 your training set, and we'll remove this assumption later, but 0:49:15.030,0:49:17.089 just to develop the algorithm, let's take away the 0:49:17.089,0:49:19.409 linearly separable training set. 0:49:19.409,0:49:20.440 And so 0:49:20.440,0:49:24.400 there's a sense that out of all the straight lines that separate the training 0:49:24.400,0:49:28.569 set, you know, maybe that straight line isn't such a good one, 0:49:28.569,0:49:33.139 and that one actually isn't such a great one either, but 0:49:33.139,0:49:35.019 maybe that line in the 0:49:35.019,0:49:38.869 middle is a much better linear separator than the others, right? 0:49:38.869,0:49:41.179 And one reason that 0:49:41.179,0:49:45.759 when you and I look at it this one seems best 0:49:45.759,0:49:48.329 is because this line is just further from the data, all right? 0:49:48.329,0:49:52.469 That is separates the data with a greater distance between your positive and your negative 0:49:52.469,0:49:55.779 examples and division boundary, okay? 0:49:55.779,0:49:59.390 And this second intuition, we'll come back to this shortly, 0:49:59.390,0:50:02.829 about this final line 0:50:02.829,0:50:06.099 that I drew being, maybe, the best line 0:50:06.099,0:50:09.999 this notion of distance from the training examples. This is the second intuition I want 0:50:09.999,0:50:11.239 to convey, 0:50:11.239,0:50:13.659 and we'll formalize it later when 0:50:13.659,0:50:17.849 we talk about 0:50:17.849,0:50:24.849 geometric margins of our classifiers, okay? 0:50:41.549,0:50:44.889 So in order to describe support vector machine, unfortunately, I'm gonna 0:50:44.889,0:50:48.239 have to pull a notation change, 0:50:48.239,0:50:50.109 and, sort 0:50:50.109,0:50:53.989 of, unfortunately, it, sort of, was impossible to do logistic regression, 0:50:53.989,0:50:55.809 and support vector machines, 0:50:55.809,0:51:01.029 and all the other algorithms using one completely consistent notation, 0:51:01.029,0:51:04.180 and so I'm actually gonna change notations slightly 0:51:04.180,0:51:07.999 for linear classifiers, and that will actually make it much easier for us - 0:51:07.999,0:51:10.019 that'll make it much easier later today 0:51:10.019,0:51:15.149 and in next week's lectures to actually talk about support vector machine. But the 0:51:15.149,0:51:19.549 notation that I'm gonna use for the rest 0:51:19.549,0:51:21.759 of today and for most of next week 0:51:21.759,0:51:25.769 will be that my B equals Y, 0:51:25.769,0:51:30.219 and instead of be zero, one, they'll be minus one and plus one, 0:51:30.219,0:51:33.779 0:51:33.779,0:51:36.759 and a development of a support vector machine 0:51:36.759,0:51:38.469 we will have H, 0:51:38.469,0:51:43.939 have a hypothesis 0:51:43.939,0:51:49.369 output values to 0:51:49.369,0:51:53.179 the either plus one or minus one, 0:51:53.179,0:51:54.619 and so 0:51:54.619,0:51:58.039 we'll let G of Z be equal to 0:51:58.039,0:52:00.349 one if Z is 0:52:00.349,0:52:04.460 greater or equal to zero, and minus one otherwise, right? So just rather than zero and 0:52:04.460,0:52:07.809 one, we change everything to plus one and minus one. 0:52:07.809,0:52:10.149 And, finally, 0:52:10.149,0:52:13.770 whereas previously I wrote G subscript 0:52:13.770,0:52:16.779 theta of X equals 0:52:16.779,0:52:18.429 G of theta transpose X 0:52:18.429,0:52:20.880 and we had the convention 0:52:20.880,0:52:23.409 that X zero is equal to one, 0:52:23.409,0:52:25.789 right? And so X is an RN plus 0:52:25.789,0:52:28.670 one. 0:52:28.670,0:52:34.499 I'm gonna drop this convention of 0:52:34.499,0:52:38.400 letting X zero equals a one, and letting X be an RN plus one, and instead I'm 0:52:38.400,0:52:44.049 going to parameterize my linear classifier as H subscript W, B of X 0:52:44.049,0:52:46.159 equals G of 0:52:46.159,0:52:50.629 W transpose X plus B, okay? 0:52:50.629,0:52:51.349 And so 0:52:51.349,0:52:53.359 B 0:52:53.359,0:52:54.989 just now plays the role of 0:52:54.989,0:52:56.289 theta zero, 0:52:56.289,0:53:03.289 and W now plays the role of the rest of the parameters, theta one 0:53:04.179,0:53:07.059 through theta N, okay? So just by separating out 0:53:07.059,0:53:10.569 the interceptor B rather than lumping it together, it'll make it easier for 0:53:10.569,0:53:11.230 us 0:53:11.230,0:53:17.569 to develop support vector machines. 0:53:17.569,0:53:24.569 So - yes. Student:[Off mic]. Instructor 0:53:27.889,0:53:31.829 (Andrew Ng):Oh, yes. Right, yes. So W is - 0:53:31.829,0:53:33.049 right. So W 0:53:33.049,0:53:35.499 is a vector in RN, and 0:53:35.499,0:53:37.069 X 0:53:37.069,0:53:42.709 is now a vector in RN rather than N plus one, 0:53:42.709,0:53:49.709 and a lowercase b is a real number. Okay. 0:53:56.429,0:54:02.650 Now, let's formalize the notion of functional margin and germesh margin. Let me make a 0:54:02.650,0:54:03.660 definition. 0:54:03.660,0:54:10.660 I'm going to say that the functional margin 0:54:13.160,0:54:19.680 of the hyper plane 0:54:19.680,0:54:21.880 WB 0:54:21.880,0:54:25.229 with respect to a specific training example, 0:54:25.229,0:54:27.429 XIYI is - WRT 0:54:27.429,0:54:32.069 stands for with respect to - 0:54:32.069,0:54:35.489 the function margin of a hyper plane WB with 0:54:35.489,0:54:38.099 respect to 0:54:38.099,0:54:43.099 a certain training example, XIYI has been defined as Gamma 0:54:43.099,0:54:45.319 Hat I equals YI 0:54:45.319,0:54:46.919 times W transpose XI 0:54:46.919,0:54:51.159 plus B, okay? 0:54:51.159,0:54:54.579 And so a set of parameters, W, B defines 0:54:54.579,0:54:55.960 a 0:54:55.960,0:55:00.239 classifier - it, sort of, defines a linear separating boundary, 0:55:00.239,0:55:01.160 and so 0:55:01.160,0:55:03.989 when I say hyper plane, I just mean 0:55:03.989,0:55:07.390 the decision boundary that's 0:55:07.390,0:55:11.169 defined by the parameters W, B. 0:55:11.169,0:55:12.969 You know what, 0:55:12.969,0:55:17.699 if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with 0:55:17.699,0:55:18.869 parameters W, B 0:55:18.869,0:55:21.019 with respect to a training example 0:55:21.019,0:55:23.539 is given by this formula, okay? 0:55:23.539,0:55:25.530 And interpretation of this is that 0:55:25.530,0:55:28.369 if 0:55:28.369,0:55:29.799 YI is equal to one, 0:55:29.799,0:55:33.239 then for each to have a large functional margin, you 0:55:33.239,0:55:36.519 want W transpose XI plus B to 0:55:36.519,0:55:37.849 be large, 0:55:37.849,0:55:39.079 right? 0:55:39.079,0:55:41.389 And if YI 0:55:41.389,0:55:43.709 is equal minus one, 0:55:43.709,0:55:47.479 then in order for the functional margin to be large - we, sort of, want the functional margins 0:55:47.479,0:55:50.229 to large, but in order for the function margins to be large, 0:55:50.229,0:55:55.229 if YI is equal to minus one, then the only way for this to be big is if W 0:55:55.229,0:55:57.029 transpose XI 0:55:57.029,0:55:58.219 plus B 0:55:58.219,0:56:03.849 is much less than zero, okay? So this 0:56:03.849,0:56:06.780 captures the intuition that we had earlier about functional 0:56:06.780,0:56:08.689 margins - the 0:56:08.689,0:56:12.839 intuition we had earlier that if YI is equal to one, we want this to be big, and if YI 0:56:12.839,0:56:14.669 is equal to minus one, we 0:56:14.669,0:56:15.990 want this to be small, 0:56:15.990,0:56:17.009 and this, sort of, 0:56:17.009,0:56:18.699 practice of two cases 0:56:18.699,0:56:22.599 into one statement that we'd like the functional margin to be large. 0:56:22.599,0:56:26.589 And notice this is also that 0:56:26.589,0:56:31.249 so long as YI times W transpose XY 0:56:31.249,0:56:34.049 plus B, so long as this is greater than zero, 0:56:34.049,0:56:36.649 that means we 0:56:36.649,0:56:43.649 classified it correctly, okay? 0:57:14.359,0:57:18.069 And one more definition, I'm going to say that 0:57:18.069,0:57:19.899 the functional margin of 0:57:19.899,0:57:23.679 a hyper plane with respect to an entire training set 0:57:23.679,0:57:30.679 is 0:57:31.579,0:57:33.609 going to define gamma hat to 0:57:33.609,0:57:35.509 be equal to 0:57:35.509,0:57:35.759 min 0:57:35.650,0:57:39.569 over all your training examples of gamma hat, I, right? 0:57:39.569,0:57:40.539 So if you have 0:57:40.539,0:57:44.439 a training set, if you have just more than one training example, 0:57:44.439,0:57:46.949 I'm going to define the functional margin 0:57:46.949,0:57:49.779 with respect to the entire training set as 0:57:49.779,0:57:51.799 the worst case of all of your 0:57:51.799,0:57:54.439 functional margins of the entire training set. 0:57:54.439,0:57:57.849 And so for now we should think of the 0:57:57.849,0:58:02.179 first function like an intuition of saying that we would like the function margin 0:58:02.179,0:58:03.450 to be large, 0:58:03.450,0:58:05.910 and for our purposes, for now, 0:58:05.910,0:58:07.309 let's just say we would like 0:58:07.309,0:58:10.389 the worst-case functional margin to be large, okay? And 0:58:10.389,0:58:16.299 we'll change this a little bit later as well. Now, it turns out that 0:58:16.299,0:58:20.150 there's one little problem with this intuition that will, sort of, edge 0:58:20.150,0:58:22.640 us later, 0:58:22.640,0:58:26.189 which it actually turns out to be very easy to make the functional margin large, all 0:58:26.189,0:58:28.599 right? So, for example, 0:58:28.599,0:58:31.899 so as I have a classifiable parameters W and 0:58:31.899,0:58:37.369 B. If I take W and multiply it by two and take B and multiply it by two, 0:58:37.369,0:58:42.049 then if you refer to the definition of the functional margin, I guess that was what? 0:58:42.049,0:58:42.959 Gamma I, 0:58:42.959,0:58:44.409 gamma hat I 0:58:44.409,0:58:47.929 equals YI 0:58:47.929,0:58:51.059 times W times transpose B. If 0:58:51.059,0:58:52.859 I double 0:58:52.859,0:58:54.609 W and B, 0:58:54.609,0:58:55.010 then 0:58:55.010,0:58:59.419 I can easily double my functional margin. So this goal 0:58:59.419,0:59:02.799 of making the functional margin large, in and of itself, isn't so 0:59:02.799,0:59:05.959 useful because it's easy to make the functional margin arbitrarily large 0:59:05.959,0:59:08.729 just by scaling other parameters. And so 0:59:08.729,0:59:13.259 maybe one thing we need to do later is add a normalization 0:59:13.259,0:59:14.879 condition. 0:59:14.879,0:59:16.849 For example, maybe 0:59:16.849,0:59:20.179 we want to add a normalization condition that de-norm, the 0:59:20.179,0:59:21.789 alter-norm of 0:59:21.789,0:59:23.109 the parameter W 0:59:23.109,0:59:24.009 is equal to one, 0:59:24.009,0:59:28.150 and we'll come back to this in a second. All 0:59:28.150,0:59:34.279 right. And then so - Okay. 0:59:34.279,0:59:39.869 Now, let's talk about - see how much 0:59:39.869,0:59:46.869 time we have, 15 minutes. Well, see, I'm trying to 0:59:52.909,0:59:59.909 decide how much to try to do in the last 15 minutes. Okay. So let's talk 1:00:05.339,1:00:10.890 about the geometric margin, 1:00:10.890,1:00:13.379 and so 1:00:13.379,1:00:16.939 the geometric margin of a training example - 1:00:16.939,1:00:23.939 [inaudible], right? 1:00:27.279,1:00:30.649 So the division boundary of my classifier 1:00:30.649,1:00:33.439 is going to be given by the plane W transpose X 1:00:33.439,1:00:34.889 plus B is equal 1:00:34.889,1:00:37.650 to zero, okay? Right, and these are the 1:00:37.650,1:00:41.389 X1, X2 axis, say, 1:00:41.389,1:00:42.579 and 1:00:42.579,1:00:48.299 we're going to draw relatively few training examples here. Let's say I'm drawing 1:00:48.299,1:00:52.929 deliberately few training examples so that I can add things to this, okay? 1:00:52.929,1:00:54.659 And so 1:00:54.659,1:00:57.199 assuming we classified an example correctly, I'm 1:00:57.199,1:01:01.209 going to define the geometric margin 1:01:01.209,1:01:05.559 as just a geometric distance between a point between the 1:01:05.559,1:01:07.239 training example - 1:01:07.239,1:01:09.649 yeah, between the training example XI, 1:01:09.649,1:01:11.679 YI 1:01:11.679,1:01:13.259 and the distance 1:01:13.259,1:01:15.489 given by this separating 1:01:15.489,1:01:18.249 line, given by this separating hyper plane, okay? 1:01:18.249,1:01:20.329 That's what I'm going to define the 1:01:20.329,1:01:23.809 geometric margin to be. 1:01:23.809,1:01:26.299 And so I'm gonna do 1:01:26.299,1:01:29.919 some algebra fairly quickly. In case it doesn't make sense, and read 1:01:29.919,1:01:33.969 through the lecture notes more carefully for details. 1:01:33.969,1:01:36.479 Sort of, by standard geometry, 1:01:36.479,1:01:40.589 the normal, or in other words, the vector that's 90 degrees to the 1:01:40.589,1:01:41.869 separating hyper plane 1:01:41.869,1:01:45.069 is going to be given by W divided by 1:01:45.069,1:01:47.480 the norm of W; that's just how 1:01:47.480,1:01:49.619 planes and high dimensions work. If this 1:01:49.619,1:01:53.380 stuff - some of this you have to use, take a look t the lecture notes on the 1:01:53.380,1:01:55.589 website. 1:01:55.589,1:01:58.199 And so let's say this distance 1:01:58.199,1:01:59.789 is 1:01:59.789,1:02:01.220 1:02:01.220,1:02:02.079 gamma I, 1:02:02.079,1:02:04.869 okay? And so I'm going to use the convention that 1:02:04.869,1:02:08.549 I'll put a hat on top where I'm referring to functional margins, 1:02:08.549,1:02:11.619 and no hat on top for geometric margins. So let's say geometric margin, 1:02:11.619,1:02:14.120 as this example, is 1:02:14.120,1:02:18.009 gamma I. 1:02:18.009,1:02:21.789 That means that 1:02:21.789,1:02:26.459 this point here, right, 1:02:26.459,1:02:28.899 is going to be 1:02:28.899,1:02:32.149 XI minus 1:02:32.149,1:02:36.119 gamma I times W 1:02:36.119,1:02:39.479 over normal W, okay? 1:02:39.479,1:02:41.279 Because 1:02:41.279,1:02:45.399 W over normal W is the unit vector, is the length one vector that 1:02:45.399,1:02:48.459 is normal to the separating hyper plane, 1:02:48.459,1:02:52.069 and so when we subtract gamma I times the unit vector 1:02:52.069,1:02:55.189 from this point, XI, or at this point here is XI. 1:02:55.189,1:02:57.039 So XI minus, 1:02:57.039,1:02:59.079 you know, this little vector here 1:02:59.079,1:03:01.539 is going to be this point that I've drawn as 1:03:01.539,1:03:03.080 a heavy circle, 1:03:03.080,1:03:05.660 okay? So this heavy point here is 1:03:05.660,1:03:07.429 XI minus this vector, 1:03:07.429,1:03:14.429 and this vector is gamma I time W over norm of W, okay? 1:03:16.049,1:03:17.649 And so 1:03:17.649,1:03:21.160 because this heavy point is on the separating hyper plane, 1:03:21.160,1:03:22.689 right, this point 1:03:22.689,1:03:24.949 must satisfy 1:03:24.949,1:03:28.660 W transpose times 1:03:28.660,1:03:34.659 that point 1:03:34.659,1:03:35.569 equals zero, 1:03:35.569,1:03:37.189 right? Because 1:03:37.189,1:03:38.029 all 1:03:38.029,1:03:41.479 points X on the separating hyper plane satisfy the equation W transpose X plus B 1:03:41.479,1:03:42.939 equals zero, 1:03:42.939,1:03:45.989 and so this point is on the separating hyper plane, therefore, 1:03:45.989,1:03:49.199 it must satisfy W transpose this point - oh, 1:03:49.199,1:03:53.809 excuse me. Plus B is equal 1:03:53.809,1:03:57.579 to zero, okay? 1:03:57.579,1:04:01.289 Raise your hand if this makes sense so far? Oh, 1:04:01.289,1:04:03.669 okay. Cool, most of you, but, again, 1:04:03.669,1:04:07.099 I'm, sort of, being slightly fast in this geometry. So if you're not quite sure 1:04:07.099,1:04:10.089 why this is a normal vector, or how I subtracted 1:04:10.089,1:04:17.089 this, or whatever, take a look at the details in the lecture notes. 1:04:21.299,1:04:24.319 And so 1:04:24.319,1:04:28.459 what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So 1:04:28.459,1:04:30.269 this equation I just wrote down, 1:04:30.269,1:04:32.199 solve this equation for gamma 1:04:32.199,1:04:33.939 or gamma I, 1:04:33.939,1:04:40.939 and you find that - 1:04:44.129,1:04:47.630 you saw that previous equation from gamma I - 1:04:47.630,1:04:49.769 well, why don't I just do it? 1:04:49.769,1:04:52.819 You have W transpose XI plus 1:04:52.819,1:04:54.660 B 1:04:54.660,1:04:55.489 equals 1:04:55.489,1:04:57.049 gamma 1:04:57.049,1:05:01.489 I times W transpose W over norm of W; 1:05:01.489,1:05:05.169 that's just equal to gamma 1:05:05.169,1:05:06.749 times the norm of W 1:05:06.749,1:05:07.659 because 1:05:07.659,1:05:09.529 W transpose W 1:05:09.529,1:05:12.469 is the norm 1:05:12.469,1:05:14.829 of W squared, and, therefore, 1:05:14.829,1:05:17.589 gamma 1:05:17.589,1:05:24.589 is just - well, 1:05:26.579,1:05:33.579 transpose X equals, okay? 1:05:33.999,1:05:38.089 And, in other words, this little calculation just showed us that if you have a 1:05:38.089,1:05:41.299 training example 1:05:41.299,1:05:42.599 XI, 1:05:42.599,1:05:46.749 then the distance between XI and the separating hyper plane defined by the 1:05:46.749,1:05:49.449 parameters W and B 1:05:49.449,1:05:56.449 can be computed by this formula, okay? So 1:06:02.979,1:06:05.459 the last thing I want to do is actually 1:06:05.459,1:06:06.729 take into account 1:06:06.729,1:06:11.929 the sign of the - the correct classification of the training example. So I've 1:06:11.929,1:06:13.130 been assuming that 1:06:13.130,1:06:16.339 we've been classifying an example correctly. 1:06:16.339,1:06:20.999 So, more generally, to find 1:06:20.999,1:06:27.439 the geometric 1:06:27.439,1:06:30.029 margin of a training example to be 1:06:30.029,1:06:32.979 gamma I equals YI 1:06:32.979,1:06:39.979 times that thing on top, okay? 1:06:45.389,1:06:47.279 And so 1:06:47.279,1:06:51.169 this is very similar to the functional margin, except for the normalization by the 1:06:51.169,1:06:52.729 norm of W, 1:06:52.729,1:06:57.759 and so as before, you know, this says that so long as - 1:06:57.759,1:07:00.889 we would like the geometric margin to be large, and all that means is that so 1:07:00.889,1:07:01.849 long as 1:07:01.849,1:07:03.919 we're classifying the example correctly, 1:07:03.919,1:07:08.209 we would ideally hope of the example to be as far as possible from the separating 1:07:08.209,1:07:11.079 hyper plane, so long as it's on the right side of 1:07:11.079,1:07:12.279 the separating hyper plane, and that's what YI 1:07:12.279,1:07:19.279 multiplied into this does. 1:07:22.689,1:07:27.839 And so a couple of easy facts, one is if the 1:07:27.839,1:07:31.769 norm of W is equal to one, 1:07:31.769,1:07:34.289 then 1:07:34.289,1:07:37.439 the functional margin is equal to the geometric margin, and you see 1:07:37.439,1:07:39.369 that quite easily, 1:07:39.369,1:07:41.479 and, more generally, 1:07:41.479,1:07:43.019 1:07:43.019,1:07:47.779 the geometric margin is just equal 1:07:47.779,1:07:54.779 to the functional margin divided by the norm of W, okay? Let's see, okay. 1:08:11.959,1:08:18.959 And so one final definition is 1:08:22.089,1:08:26.769 so far I've defined the geometric margin with respect to a single training example, 1:08:26.769,1:08:32.189 and so as before, I'll define the geometric margin with respect to an entire training set 1:08:32.189,1:08:34.499 as gamma equals 1:08:34.499,1:08:37.599 min over I 1:08:37.599,1:08:44.599 of gamma I, all right? 1:08:45.250,1:08:47.309 And so 1:08:47.309,1:08:54.309 the maximum margin classifier, which is a precursor to the support vector machine, 1:08:59.539,1:09:04.880 is the learning algorithm that chooses the parameters W and B 1:09:04.880,1:09:06.730 so as to maximize 1:09:06.730,1:09:08.900 the geometric margin, 1:09:08.900,1:09:11.919 and so I just write that down. The 1:09:11.919,1:09:16.489 maximum margin classified poses the following optimization problem. It says 1:09:16.489,1:09:19.960 choose gamma, W, and B 1:09:19.960,1:09:23.529 so as to maximize the geometric margin, 1:09:23.529,1:09:25.499 subject to that YI times - 1:09:25.499,1:09:32.499 well, 1:09:33.420,1:09:37.559 this is just one way to write it, 1:09:37.559,1:09:41.089 subject to - actually, do I write it like that? Yeah, 1:09:41.089,1:09:43.819 fine. 1:09:43.819,1:09:46.209 There are several ways to write this, and one of the things we'll 1:09:46.209,1:09:48.949 do next time is actually - I'm trying 1:09:48.949,1:09:52.730 to figure out if I can do this in five minutes. I'm guessing this could 1:09:52.730,1:09:55.420 be difficult. Well, 1:09:55.420,1:09:59.429 so this maximizing your classifier is the maximization problem 1:09:59.429,1:10:00.929 over parameter gamma 1:10:00.929,1:10:02.519 W 1:10:02.519,1:10:04.919 and B, and for now, it turns out 1:10:04.919,1:10:08.869 that the geometric margin doesn't change depending on the norm of W, right? Because 1:10:08.869,1:10:11.479 in the definition of the geometric margin, 1:10:11.479,1:10:14.429 notice that we're dividing by the norm of W anyway. 1:10:14.429,1:10:18.239 So you can actually set the norm of W to be anything you want, and you can multiply 1:10:18.239,1:10:22.530 W and B by any constant; it doesn't change the geometric margin. 1:10:22.530,1:10:26.809 This will actually be important, and we'll come back to this later. Notice that you 1:10:26.809,1:10:29.829 can take the parameters WB, 1:10:29.829,1:10:33.759 and you can impose any normalization constant to it, or you can change W and B by 1:10:33.759,1:10:35.760 any scaling factor and 1:10:35.760,1:10:37.969 replace them by ten W and ten B 1:10:37.969,1:10:38.730 whatever, 1:10:38.730,1:10:42.809 and it does not change the geometric margin, 1:10:42.809,1:10:44.329 okay? And so 1:10:44.329,1:10:47.800 in this first formulation, I'm just gonna impose a constraint and say that the norm of W was 1:10:47.800,1:10:48.989 one, 1:10:48.989,1:10:51.969 and so the function of the geometric margins will be the same, 1:10:51.969,1:10:52.699 and then we'll say 1:10:52.699,1:10:55.119 maximize the geometric margins subject to - 1:10:55.119,1:10:56.800 you maximize gamma 1:10:56.800,1:11:00.489 subject to that every training example must have geometric margin 1:11:00.489,1:11:02.069 at least gamma, 1:11:02.069,1:11:04.219 and this is a geometric margin because 1:11:04.219,1:11:05.780 when the norm of W is equal to one, then 1:11:05.780,1:11:08.159 the functional of the geometric margin 1:11:08.159,1:11:11.409 are identical, okay? 1:11:11.409,1:11:15.130 So this is the maximum margin classifier, and it turns out that if you do this, 1:11:15.130,1:11:19.079 it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but 1:11:19.079,1:11:19.659 it 1:11:19.659,1:11:22.709 1:11:22.709,1:11:24.139 turns out that 1:11:24.139,1:11:26.449 as we develop this algorithm further, there 1:11:26.449,1:11:30.519 will be a clever way to allow us to change this algorithm to let it work 1:11:30.519,1:11:33.150 in infinite dimensional feature spaces 1:11:33.150,1:11:37.010 and come up with very efficient non-linear classifiers. 1:11:37.010,1:11:38.020 So 1:11:38.020,1:11:41.280 there's a ways to go before we turn this into a support vector machine, 1:11:41.280,1:11:43.409 but this is the first step. 1:11:43.409,1:11:50.409 So are there questions about this? Yeah. 1:12:03.610,1:12:08.179 Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a 1:12:08.179,1:12:09.900 fixed training set, and the 1:12:09.900,1:12:12.879 scaling of the training set is not something you get to play with, right? 1:12:12.879,1:12:17.190 So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the 1:12:17.190,1:12:20.869 Y's. Are there other questions? Okay. So all right. 1:12:20.869,1:12:26.150 1:12:26.150,1:12:30.129 1:12:30.129,1:12:34.779 Next week we will take this, and we'll talk about authorization algorithms, 1:12:34.779,1:12:35.839 and 1:12:35.839,1:12:39.949 work our way towards turning this into one of the most effective off-theshelf 1:12:39.949,1:12:41.920 learning algorithms, 1:12:41.920,1:12:44.369 and just a final reminder again, this next 1:12:44.369,1:12:48.170 discussion session will be on Matlab and Octaves. So show up for that if you want 1:12:48.170,1:12:49.679 to see a tutorial. 1:12:49.679,1:12:51.529 Okay. See you guys in the next class.