WEBVTT 00:00:00.290 --> 00:00:12.079 This subtitle is provided from lecsub.jimdo.com under Creative Commons License. 00:00:12.079 --> 00:00:15.349 This presentation is delivered by the Stanford Center for Professional 00:00:15.349 --> 00:00:22.349 Development. 00:00:24.169 --> 00:00:25.189 00:00:25.189 --> 00:00:27.299 What I want to do today is 00:00:27.299 --> 00:00:30.829 continue our discussion of Naive Bayes, which is the learning algorithm that 00:00:30.829 --> 00:00:32.600 I 00:00:32.600 --> 00:00:36.110 started to discuss in the previous lecture 00:00:36.110 --> 00:00:40.250 and talk about a couple of different event models in Naive Bayes, 00:00:40.250 --> 00:00:44.450 and then I'll take a brief digression to talk about neural networks, which is 00:00:44.450 --> 00:00:47.000 something that I actually won't spend a lot of time on, 00:00:47.000 --> 00:00:50.350 and then I want to start to talk about support vector machines, 00:00:50.350 --> 00:00:52.680 and support vector machines is 00:00:52.680 --> 00:00:56.450 the learning algorithms, the supervised learning algorithm that 00:00:56.450 --> 00:01:00.910 many people consider the most effective, off-the-shelf 00:01:00.910 --> 00:01:02.250 supervised learning 00:01:02.250 --> 00:01:05.240 algorithm. That point of view is debatable, but there are many people that hold that 00:01:05.240 --> 00:01:06.340 point of view, 00:01:06.340 --> 00:01:10.800 and we'll start discussing that today, and this will actually take us a few 00:01:10.800 --> 00:01:13.830 lectures to complete. 00:01:13.830 --> 00:01:16.940 So let's talk about Naive Bayes. To recap 00:01:16.940 --> 00:01:21.370 from the previous lecture, 00:01:21.370 --> 00:01:23.299 I started off 00:01:23.299 --> 00:01:24.420 describing 00:01:24.420 --> 00:01:28.329 spam classification as the most [inaudible] example for Naive Bayes 00:01:28.329 --> 00:01:33.700 in which we would create feature vectors like these, right, that correspond 00:01:33.700 --> 00:01:37.840 to words in a dictionary. 00:01:37.840 --> 00:01:39.500 And so, you know, 00:01:39.500 --> 00:01:43.100 based on what words appear in a piece of email were represented as a 00:01:43.100 --> 00:01:45.230 feature vector with 00:01:45.230 --> 00:01:48.130 ones and zeros in the corresponding places, 00:01:48.130 --> 00:01:49.950 and Naive 00:01:49.950 --> 00:01:56.950 Bayes was a generative learning algorithm, and by that 00:01:58.799 --> 00:02:00.159 I mean it's 00:02:00.159 --> 00:02:06.140 an algorithm in which we model P(X|Y), and for Naive 00:02:06.140 --> 00:02:08.479 Bayes, specifically, we 00:02:08.479 --> 00:02:12.859 modeled it as product from I equals one 00:02:12.859 --> 00:02:13.789 to 00:02:13.789 --> 00:02:17.109 N, P(Xi|Y), 00:02:17.109 --> 00:02:19.259 and also we model P(Y), 00:02:19.259 --> 00:02:22.749 and then we use Bayes Rule, right, to combine these two together, 00:02:22.749 --> 00:02:25.050 and so our predictions, 00:02:25.050 --> 00:02:28.549 when you give it a new piece of email you want to tell if it's spam or not spam, 00:02:28.549 --> 00:02:33.119 you predict arg max over Y, P(Y|X), which 00:02:33.119 --> 00:02:36.459 by Bayes Rule is arg max over Y, 00:02:36.459 --> 00:02:39.420 P(X given Y) 00:02:39.420 --> 00:02:41.019 times P(Y), okay? 00:02:41.019 --> 00:02:43.559 So this is Naive Bayes, 00:02:43.559 --> 00:02:45.869 and just to draw attention to two things, 00:02:45.869 --> 00:02:49.819 one is that in this model, each of our features 00:02:49.819 --> 00:02:51.549 were zero, one, so 00:02:51.549 --> 00:02:53.919 indicating whether different words appear, 00:02:53.919 --> 00:02:57.589 and the length or the feature vector was, sort of, 00:02:57.589 --> 00:03:00.029 the length N of the feature vector was 00:03:00.029 --> 00:03:07.029 the number of words in the dictionary. 00:03:08.339 --> 00:03:15.339 So it might be on this version on the order of 50,000 words, say. What I want 00:03:18.419 --> 00:03:21.170 to do now is describe two variations on this algorithm. 00:03:21.170 --> 00:03:23.779 The first one is the simpler one, 00:03:23.779 --> 00:03:26.609 which it's just a generalization to if 00:03:26.609 --> 00:03:28.039 xi 00:03:28.039 --> 00:03:33.359 takes on more values. So, you know, one thing that's commonly done 00:03:33.359 --> 00:03:35.169 is 00:03:35.169 --> 00:03:38.839 to apply Naive Bayes to problems where 00:03:38.839 --> 00:03:43.479 some of these features, xi, takes on K values rather than just two values, 00:03:43.479 --> 00:03:46.579 and in that case, 00:03:46.579 --> 00:03:48.679 you actually build, 00:03:48.679 --> 00:03:51.919 sort of, a very similar model where P(X|Y) 00:03:51.919 --> 00:03:53.130 is 00:03:53.130 --> 00:03:57.120 really the same thing, right, 00:03:57.120 --> 00:04:02.099 where now these are going to be multinomial probabilities 00:04:02.099 --> 00:04:07.729 rather than Bernoulli's because the XI's can, maybe, take on up to K values. 00:04:07.729 --> 00:04:11.739 It turns out, the situation where - one situation where this arises very 00:04:11.739 --> 00:04:16.760 commonly is if you have a feature that's actually continuous valued, 00:04:16.760 --> 00:04:20.639 and you choose to dispertise it, and you choose to take a continuous value feature 00:04:20.639 --> 00:04:25.770 and dispertise it into a finite set of K values, 00:04:25.770 --> 00:04:27.710 and so it's a perfect example 00:04:27.710 --> 00:04:32.370 if you remember our very first 00:04:32.370 --> 00:04:35.189 supervised learning problem of predicting 00:04:35.189 --> 00:04:37.089 the price of 00:04:37.089 --> 00:04:41.039 houses. If you have the classification problem on these houses, so based on 00:04:41.039 --> 00:04:44.129 features of a house, and you want to predict whether or not the house will be sold in the next six 00:04:44.129 --> 00:04:46.880 months, say. That's a classification problem, and 00:04:46.880 --> 00:04:49.210 once you use Naive Bayes, 00:04:49.210 --> 00:04:52.229 then given a continuous value feature 00:04:52.229 --> 00:04:54.619 like the living area, 00:04:54.619 --> 00:04:59.259 you know, one pretty common thing to do would be take the continuous value living area 00:04:59.259 --> 00:05:02.300 and just dispertise it 00:05:02.300 --> 00:05:06.669 into a few 00:05:06.669 --> 00:05:08.909 00:05:08.909 --> 00:05:12.769 - discreet buckets, 00:05:12.769 --> 00:05:13.900 and so 00:05:13.900 --> 00:05:16.900 depending on whether the living area of the house is less than 500 square feet 00:05:16.900 --> 00:05:19.479 or between 1,000 and 1500 square feet, 00:05:19.479 --> 00:05:22.279 and so on, or whether it's greater than 2,000 square feet, 00:05:22.279 --> 00:05:26.490 you choose the value of the corresponding feature, XI, to be one, 00:05:26.490 --> 00:05:28.889 two, three, or four, okay? 00:05:28.889 --> 00:05:31.210 So that was the first 00:05:31.210 --> 00:05:36.479 variation or generalization of Naive Bayes I 00:05:36.479 --> 00:05:43.479 wanted to talk about. I should just check; are there questions about this? Okay. 00:05:49.919 --> 00:05:55.740 Cool. And so it turns out that in practice, it's fairly common to use about ten buckets 00:05:55.740 --> 00:05:59.099 to dispertise a continuous value feature. I drew four here only to 00:05:59.099 --> 00:06:00.900 save on writing. 00:06:00.900 --> 00:06:02.789 00:06:02.789 --> 00:06:03.389 00:06:03.389 --> 00:06:06.639 The second and, sort of, final variation that I want to talk about for Naive Bayes 00:06:06.639 --> 00:06:09.839 is a variation that's specific 00:06:09.839 --> 00:06:10.510 to 00:06:10.510 --> 00:06:15.720 classifying text documents, or, more generally, for classifying sequences. So the text 00:06:15.720 --> 00:06:20.219 document, like a piece of email, you can think of as a sequence of words 00:06:20.219 --> 00:06:23.509 and you can apply this, sort of, model I'm about to describe to classifying other sequences as 00:06:23.509 --> 00:06:24.909 well, but let 00:06:24.909 --> 00:06:27.400 me just focus on text, 00:06:27.400 --> 00:06:29.689 and here's 00:06:29.689 --> 00:06:33.419 the idea. So the 00:06:33.419 --> 00:06:38.379 Naiven a piece of email, we were 00:06:38.379 --> 00:06:42.110 representing it using this binary vector value representation, 00:06:42.110 --> 00:06:46.469 and one of the things that this loses, for instance, is the number of times that different words 00:06:46.469 --> 00:06:47.049 appear, 00:06:47.049 --> 00:06:49.429 all right? So, for example, if 00:06:49.429 --> 00:06:53.149 some word appears a lot of times, and you see the word, you know, buy a lot of times. 00:06:53.149 --> 00:06:58.369 You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of 00:06:58.369 --> 00:06:59.490 times in the email, it 00:06:59.490 --> 00:07:03.449 is more likely to be spam than it appears, I guess, only once 00:07:03.449 --> 00:07:08.129 because even once, I guess, is enough. So let me just 00:07:08.129 --> 00:07:11.210 try a different, what's called an event model 00:07:11.210 --> 00:07:14.870 for Naive Bayes that will take into account the number of times a word appears in 00:07:14.870 --> 00:07:16.530 the email, 00:07:16.530 --> 00:07:17.430 and to give 00:07:17.430 --> 00:07:22.300 this previous model a name as well this particular model for 00:07:22.300 --> 00:07:24.059 text 00:07:24.059 --> 00:07:25.990 classification 00:07:25.990 --> 00:07:29.189 00:07:29.189 --> 00:07:32.369 00:07:32.369 --> 00:07:35.639 is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what 00:07:35.639 --> 00:07:39.639 the name means. It 00:07:39.639 --> 00:07:43.139 refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about 00:07:43.139 --> 00:07:45.349 what the name means. 00:07:45.349 --> 00:07:46.469 In contrast, 00:07:46.469 --> 00:07:50.529 what I want to do now is describe a different representation for email in terms of the 00:07:50.529 --> 00:07:51.819 feature 00:07:51.819 --> 00:07:58.819 vector, and this is called the Multinomial Event Model, and, 00:08:00.619 --> 00:08:03.319 again, there is a rationale behind the name, but it's slightly cryptic, so don't worry 00:08:03.319 --> 00:08:07.249 about why it's called the Multinomial Event Model; it's just called that. 00:08:07.249 --> 00:08:09.349 And here's what we're gonna do, 00:08:09.349 --> 00:08:15.559 given a piece of email, I'm going to represent my email as a feature vector, 00:08:15.559 --> 00:08:19.039 and so my IF training 00:08:19.039 --> 00:08:23.340 example, XI will be a feature vector, 00:08:23.340 --> 00:08:29.150 XI sub group one, XI sub group two, 00:08:29.150 --> 00:08:31.279 XI subscript 00:08:31.279 --> 00:08:33.020 NI 00:08:33.020 --> 00:08:35.370 where 00:08:35.370 --> 00:08:38.810 NI is equal to the 00:08:38.810 --> 00:08:42.160 number of words 00:08:42.160 --> 00:08:45.830 in this email, right? So if one of my training examples is an email with 00:08:45.830 --> 00:08:47.340 300 words in it, 00:08:47.340 --> 00:08:49.750 then I represent this email via 00:08:49.750 --> 00:08:52.010 a feature vector 00:08:52.010 --> 00:08:57.320 with 300 elements, 00:08:57.320 --> 00:09:00.360 and each 00:09:00.360 --> 00:09:04.270 of these elements of the feature vector - lets see. Let me just write this as X subscript 00:09:04.270 --> 00:09:05.560 J. 00:09:05.560 --> 00:09:12.560 These will be an index into my dictionary, okay? 00:09:12.820 --> 00:09:16.350 And so if my dictionary has 50,000 words, then 00:09:16.350 --> 00:09:18.690 each position in my feature vector 00:09:18.690 --> 00:09:22.520 will be a variable that takes on one of 50,000 possible 00:09:22.520 --> 00:09:24.440 values 00:09:24.440 --> 00:09:25.640 00:09:25.640 --> 00:09:31.580 corresponding to what word appeared in the J position of my email, okay? 00:09:31.580 --> 00:09:33.940 So, in other words, I'm gonna take all the words in my email 00:09:33.940 --> 00:09:35.680 and you have a feature 00:09:35.680 --> 00:09:38.550 vector that just says which word 00:09:38.550 --> 00:09:39.899 in my dictionary 00:09:39.899 --> 00:09:43.700 was each word in the email, okay? 00:09:43.700 --> 00:09:47.680 So a different definition for NI now, NI now varies and is different for every 00:09:47.680 --> 00:09:49.050 training example, 00:09:49.050 --> 00:09:52.740 and this XJ is now indexed into the dictionary. 00:09:52.740 --> 00:09:54.740 You know, the components of the feature vector 00:09:54.740 --> 00:09:59.450 are no longer binary random variables; they're these indices in the dictionary that take on a much 00:09:59.450 --> 00:10:02.750 larger set of values. 00:10:02.750 --> 00:10:05.660 And so 00:10:05.660 --> 00:10:08.890 our generative model for this 00:10:08.890 --> 00:10:09.980 00:10:09.980 --> 00:10:16.980 00:10:22.550 --> 00:10:24.360 will be that the joint distribution 00:10:24.360 --> 00:10:27.760 over X and Y will be that, where again N is now the length of the email, all right? 00:10:27.760 --> 00:10:30.839 So the way to think about this formula is you 00:10:30.839 --> 00:10:32.060 imagine 00:10:32.060 --> 00:10:33.410 that there was some 00:10:33.410 --> 00:10:38.050 probably distribution over emails. There's some random distribution that generates the emails, 00:10:38.050 --> 00:10:40.900 and that process proceeds as follows: 00:10:40.900 --> 00:10:45.090 First, Y is chosen, first the class label. Is someone gonna send you spam email or not 00:10:45.090 --> 00:10:48.520 spam emails is chosen for 00:10:48.520 --> 00:10:54.240 us. So first Y, the random variable Y, the class label of spam or not spam is generated, 00:10:54.240 --> 00:10:58.630 and then having decided whether they sent you spam or not spam, 00:10:58.630 --> 00:11:02.879 someone iterates over all 300 positions of the email, 00:11:02.879 --> 00:11:07.360 or 300 words that are going to compose them as email, 00:11:07.360 --> 00:11:10.300 and would generate words from some distribution that 00:11:10.300 --> 00:11:14.240 depends on whether they chose to send you spam or not spam. So if 00:11:14.240 --> 00:11:17.459 they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and 00:11:17.459 --> 00:11:19.520 whatever at discounts, sale, whatever. 00:11:19.520 --> 00:11:23.270 And if somebody chose to send you not spam, then they'll send you, sort of, the more normal 00:11:23.270 --> 00:11:27.580 00:11:27.580 --> 00:11:31.150 words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the 00:11:31.150 --> 00:11:34.820 previous event model, and N has a very different definition from the previous event 00:11:34.820 --> 00:11:38.150 model. 00:11:38.150 --> 00:11:44.680 And so the parameters of this model are - let's see. Phi 00:11:44.680 --> 00:11:48.200 subscript K given Y equals one, which 00:11:48.200 --> 00:11:51.370 is the 00:11:51.370 --> 00:11:55.190 probability that, 00:11:55.190 --> 00:11:57.490 you know, conditioned on 00:11:57.490 --> 00:11:59.280 someone deciding to spend you spam, 00:11:59.280 --> 00:12:03.070 what's the probability that the next word they choose to email you in 00:12:03.070 --> 00:12:06.260 the spam email is going to be word K, and 00:12:06.260 --> 00:12:10.350 similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi 00:12:10.350 --> 00:12:17.350 Y 00:12:21.160 --> 00:12:25.590 and just same as before, okay? 00:12:25.590 --> 00:12:28.170 So these are the parameters of the model, 00:12:28.170 --> 00:12:29.350 00:12:29.350 --> 00:12:31.450 and 00:12:31.450 --> 00:12:35.170 given a training set, you can work out the maximum likelihood estimates of the 00:12:35.170 --> 00:12:42.170 00:12:42.230 --> 00:12:47.810 parameters. So the maximum likelihood estimate of the parameters will be 00:12:47.810 --> 00:12:49.100 equal to - 00:12:49.100 --> 00:12:53.440 and now I'm gonna write one of these, you know, big indicator function things again. It'll be 00:12:53.440 --> 00:12:57.750 a sum over your training sets 00:12:57.750 --> 00:13:00.290 indicator 00:13:00.290 --> 00:13:07.290 whether that was spam times the 00:13:07.450 --> 00:13:10.129 sum over all the words in that email where N subscript 00:13:10.129 --> 00:13:15.310 I is the number of words in email I in your training set, 00:13:15.310 --> 00:13:22.310 times indicator XIJ, SK times that I, okay? 00:13:39.390 --> 00:13:40.350 So 00:13:40.350 --> 00:13:45.400 the numerator says sum over all your emails and take into account all the emails that had 00:13:45.400 --> 00:13:48.400 class label one, take into account only of the emails that were spam because if Y equals zero, then 00:13:48.400 --> 00:13:52.210 this is zero, and this would go away, 00:13:52.210 --> 00:13:54.470 and then times sum over 00:13:54.470 --> 00:13:56.579 all the words in your spam email, and 00:13:56.579 --> 00:14:01.040 it counts up the number of times you observed the word K in your spam emails. 00:14:01.040 --> 00:14:04.670 So, in other words, the numerator is look at all the spam emails in your training set 00:14:04.670 --> 00:14:11.670 and count up the total number of times the word K appeared in this email. The 00:14:13.840 --> 00:14:17.160 denominator then is sum over I into our training set of 00:14:17.160 --> 00:14:20.199 whenever one of your examples is spam, 00:14:20.199 --> 00:14:21.430 you know, sum up the length 00:14:21.430 --> 00:14:23.220 of that spam email, 00:14:23.220 --> 00:14:26.160 and so the denominator is the total length 00:14:26.160 --> 00:14:32.370 of all of the spam emails in your training set. 00:14:32.370 --> 00:14:36.820 And so the ratio is just out of all your spam emails, what is the fraction of words in 00:14:36.820 --> 00:14:39.190 all your spam emails that were word K, 00:14:39.190 --> 00:14:41.840 and that's your estimate for the probability 00:14:41.840 --> 00:14:44.240 of the next piece of spam mail 00:14:44.240 --> 00:14:51.240 generating the word K in any given position, okay? At the 00:14:53.100 --> 00:14:56.480 end of the previous lecture, I talked about LaPlace smoothing, 00:14:56.480 --> 00:14:58.980 and so when you do that as 00:14:58.980 --> 00:15:03.310 well, you add one to the numerator and K to the denominator, and this is 00:15:03.310 --> 00:15:06.600 the LaPlace smoothed estimate 00:15:06.600 --> 00:15:08.790 of this parameter, 00:15:08.790 --> 00:15:11.370 okay? And similarly, you do the same thing for - 00:15:11.370 --> 00:15:14.040 and you can work out the estimates for the 00:15:14.040 --> 00:15:20.970 other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just 00:15:20.970 --> 00:15:22.620 wondering 00:15:22.620 --> 00:15:23.600 00:15:23.600 --> 00:15:26.290 what the X of I is, and what the N of - 00:15:26.290 --> 00:15:28.350 Instructor (Andrew Ng):Right. So 00:15:28.350 --> 00:15:32.140 in this second event model, the definition for XI and the definition for N 00:15:32.140 --> 00:15:34.770 are different, right? So here - 00:15:34.770 --> 00:15:41.770 well, this is for one example XY. So here, N is the number of words 00:15:44.300 --> 00:15:46.339 in a given email, 00:15:46.339 --> 00:15:50.730 right? And if it's the I email subscripting then this N subscript I, and so 00:15:50.730 --> 00:15:55.710 N will be different for different training examples, and here 00:15:55.710 --> 00:16:00.910 XI will be, you know, 00:16:00.910 --> 00:16:04.070 these values from 1 to 50,000, 00:16:04.070 --> 00:16:05.580 and XI is 00:16:05.580 --> 00:16:06.940 essentially 00:16:06.940 --> 00:16:10.600 the identity of the Ith 00:16:10.600 --> 00:16:11.340 word 00:16:11.340 --> 00:16:14.830 in a given piece of email, 00:16:14.830 --> 00:16:16.850 okay? So that's why this is 00:16:16.850 --> 00:16:20.440 grouping, or this is a product over all the different words of your email 00:16:20.440 --> 00:16:22.059 of their probability 00:16:22.059 --> 00:16:24.430 the Ith word in your email, 00:16:24.430 --> 00:16:31.430 conditioned on Y. Yeah? 00:16:32.280 --> 00:16:36.480 Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right, 00:16:36.480 --> 00:16:39.260 and I shouldn't have used K here. Let 00:16:39.260 --> 00:16:43.750 me use a different alphabet and see if that makes sense; does 00:16:43.750 --> 00:16:46.490 that make sense? Oh, you know what, I'm sorry. 00:16:46.490 --> 00:16:49.430 You're absolutely right. 00:16:49.430 --> 00:16:50.560 Thank you. All right. So 00:16:50.560 --> 00:16:54.660 in LaPlace smoothing, that shouldn't be 00:16:54.660 --> 00:16:57.940 K. This should be, you know, 50,000, 00:16:57.940 --> 00:17:01.940 if you have 50,000 words in your dictionary. Yeah, thanks. Great. 00:17:01.940 --> 00:17:04.660 I stole notation from the previous lecture and 00:17:04.660 --> 00:17:06.720 didn't translate it properly. 00:17:06.720 --> 00:17:08.400 So LaPlace smoothing, right, 00:17:08.400 --> 00:17:15.400 this is the number of possible values that the random variable XI can 00:17:16.430 --> 00:17:17.320 take on. Cool. Raise your hand if 00:17:17.320 --> 00:17:21.680 this makes sense? Okay. Some of you, are there 00:17:21.680 --> 00:17:28.680 more questions about this? 00:17:34.850 --> 00:17:41.850 00:17:42.010 --> 00:17:43.880 Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah, 00:17:43.880 --> 00:17:45.110 let's see. 00:17:45.110 --> 00:17:49.920 So LaPlace smoothing is 00:17:49.920 --> 00:17:51.110 a 00:17:51.110 --> 00:17:55.570 method to give you, sort of, hopefully, better estimates of their probability 00:17:55.570 --> 00:17:58.150 distribution over a multinomial, 00:17:58.150 --> 00:18:00.430 and so 00:18:00.430 --> 00:18:02.870 was I using X to Y in the previous lecture? 00:18:02.870 --> 00:18:04.600 So 00:18:04.600 --> 00:18:07.809 in trying to estimate the probability over a multinomial - I think X 00:18:07.809 --> 00:18:10.660 and Y are different. I think - 00:18:10.660 --> 00:18:13.690 was it X or Y? I think it was X, actually. Well - oh, 00:18:13.690 --> 00:18:16.800 I see, right, right. I think I was using a different definition 00:18:16.800 --> 00:18:18.960 for the random variable Y because 00:18:18.960 --> 00:18:23.860 suppose you have a multinomial random variable, X 00:18:23.860 --> 00:18:27.890 which takes on - let's 00:18:27.890 --> 00:18:30.190 use a different alphabet. Suppose you have 00:18:30.190 --> 00:18:34.950 a multinomial random variable X which takes on L different values, then 00:18:34.950 --> 00:18:37.440 the maximum likelihood estimate 00:18:37.440 --> 00:18:40.550 for the probability of X, 00:18:40.550 --> 00:18:42.060 PFX equals K, 00:18:42.060 --> 00:18:43.720 will be equal to, right, 00:18:43.720 --> 00:18:46.700 the number of observations. 00:18:46.700 --> 00:18:51.860 00:18:51.860 --> 00:18:55.140 The maximum likelihood estimate for the 00:18:55.140 --> 00:19:02.110 probability of X being equal to K will be the number of observations of X 00:19:02.110 --> 00:19:04.800 equals K divided 00:19:04.800 --> 00:19:10.510 by the total number of observations 00:19:10.510 --> 00:19:11.330 of X, 00:19:11.330 --> 00:19:13.750 okay? So that's the maximum likelihood estimate. 00:19:13.750 --> 00:19:17.710 And to add LaPlace smoothing to this, you, sort of, add one to the numerator, 00:19:17.710 --> 00:19:20.120 and you add L to the 00:19:20.120 --> 00:19:20.910 00:19:20.910 --> 00:19:22.610 denominator 00:19:22.610 --> 00:19:26.200 where L was the number of possible values that X can take on. 00:19:26.200 --> 00:19:28.540 So, in this case, 00:19:28.540 --> 00:19:31.740 this is a probability that X equals K, 00:19:31.740 --> 00:19:36.400 and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be 00:19:36.400 --> 00:19:40.300 something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific 00:19:40.300 --> 00:19:47.300 definition for a 00:19:48.190 --> 00:19:48.659 maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples 00:19:48.659 --> 00:19:51.630 make sense, but I don't know what the, like, general formula for it is. 00:19:51.630 --> 00:19:53.970 Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so 00:19:53.970 --> 00:19:55.610 the question is 00:19:55.610 --> 00:20:01.809 what's the definition for maximum likelihood estimate? So actually 00:20:01.809 --> 00:20:05.460 in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I 00:20:05.460 --> 00:20:06.550 was, sort of, 00:20:06.550 --> 00:20:09.130 throwing out the maximum likelihood estimates on the board 00:20:09.130 --> 00:20:11.049 without proving them. 00:20:11.049 --> 00:20:15.520 The way to actually work this out is 00:20:15.520 --> 00:20:19.900 to actually 00:20:19.900 --> 00:20:26.900 write down the likelihood. 00:20:29.070 --> 00:20:32.850 So the way to figure out all of these maximum likelihood estimates is to write 00:20:32.850 --> 00:20:33.730 down 00:20:33.730 --> 00:20:35.190 the likelihood of 00:20:35.190 --> 00:20:41.170 the parameters, phi K given Y being 00:20:41.170 --> 00:20:43.810 zero, phi 00:20:43.810 --> 00:20:44.450 Y, 00:20:44.450 --> 00:20:47.090 right? And so given a training set, 00:20:47.090 --> 00:20:50.350 the likelihood, I guess, I should be writing log likelihood 00:20:50.350 --> 00:20:54.740 will be the log of the product of I equals one to 00:20:54.740 --> 00:20:55.910 N, PFXI, YI, 00:20:55.910 --> 00:20:57.950 you 00:20:57.950 --> 00:21:04.950 know, parameterized by these things, okay? 00:21:06.940 --> 00:21:09.450 Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are 00:21:09.450 --> 00:21:12.010 parameterized 00:21:12.010 --> 00:21:13.500 by 00:21:13.500 --> 00:21:19.299 00:21:19.299 --> 00:21:21.520 00:21:21.520 --> 00:21:26.930 - well, 00:21:26.930 --> 00:21:28.140 00:21:28.140 --> 00:21:31.690 I'll just drop the parameters to write this 00:21:31.690 --> 00:21:38.690 00:21:42.370 --> 00:21:49.200 more simply - oh, I just put it in - times PFYI, okay? 00:21:49.200 --> 00:21:50.970 So this is my log likelihood, 00:21:50.970 --> 00:21:55.090 and so the way you get the maximum likelihood estimate of the parameters 00:21:55.090 --> 00:21:55.809 is you - 00:21:55.809 --> 00:21:59.750 so if given a fixed training set, given a set of fixed IYI's, 00:21:59.750 --> 00:22:02.180 you maximize this in terms of 00:22:02.180 --> 00:22:05.740 these parameters, and then you get the maximum likelihood estimates that I've been writing 00:22:05.740 --> 00:22:07.530 out. 00:22:07.530 --> 00:22:11.220 So in a previous section of today's lecture I wrote out some maximum likelihood estimates 00:22:11.220 --> 00:22:14.280 for the Gaussian Discriminant Analysis model, and 00:22:14.280 --> 00:22:15.610 for Naive Bayes, 00:22:15.610 --> 00:22:17.750 and then this - I didn't prove them, 00:22:17.750 --> 00:22:22.310 but you get to, sort of, play with that yourself in the homework problem as well 00:22:22.310 --> 00:22:26.420 and for one of these models, and you'll be able to verify that when 00:22:26.420 --> 00:22:29.700 you maximize the likelihood and maximize the log likelihood 00:22:29.700 --> 00:22:33.750 that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to 00:22:33.750 --> 00:22:34.950 find the way 00:22:34.950 --> 00:22:41.630 these are derived is by maximizing this, okay? Cool. All right. 00:22:41.630 --> 00:22:44.730 So 00:22:44.730 --> 00:22:51.299 that wraps up what I wanted to say about - oh, 00:22:51.299 --> 00:22:55.809 so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that 00:22:55.809 --> 00:23:01.480 for text classification, the Naivent 00:23:01.480 --> 00:23:04.080 model, the last 00:23:04.080 --> 00:23:07.300 Naivent model, 00:23:07.300 --> 00:23:10.330 it turns out that almost always does better than 00:23:10.330 --> 00:23:11.370 the 00:23:11.370 --> 00:23:15.360 first Naive Bayes model I talked about when you're applying it to the 00:23:15.360 --> 00:23:18.670 specific case - to the specific of text classification, 00:23:18.670 --> 00:23:21.070 and 00:23:21.070 --> 00:23:25.049 one of the reasons is hypothesized for this is that this second model, 00:23:25.049 --> 00:23:26.880 the multinomial event model, 00:23:26.880 --> 00:23:30.120 takes into account 00:23:30.120 --> 00:23:33.880 the number of times a word appears in a document, whereas the former model 00:23:33.880 --> 00:23:35.170 doesn't. 00:23:35.170 --> 00:23:37.370 I should say that in truth 00:23:37.370 --> 00:23:40.700 that actually turns out not to be completely understood why the latter 00:23:40.700 --> 00:23:43.780 model does better than the former one for text classification, and, sort of, 00:23:43.780 --> 00:23:45.640 researchers are still 00:23:45.640 --> 00:23:46.790 debating about it a little bit, but if you 00:23:46.790 --> 00:23:51.320 ever have a text classification problem, you know, Naive Bayes Classify is probably 00:23:51.320 --> 00:23:51.860 not, 00:23:51.860 --> 00:23:54.150 by far, the best learning algorithm out there, 00:23:54.150 --> 00:23:55.630 but it is 00:23:55.630 --> 00:23:58.460 relatively straightforward to implement, and it's a very good 00:23:58.460 --> 00:24:03.070 algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It 00:24:03.070 --> 00:24:04.490 doesn't 00:24:04.490 --> 00:24:07.020 actually 00:24:07.020 --> 00:24:07.910 00:24:07.910 --> 00:24:13.710 care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I 00:24:13.710 --> 00:24:14.950 mean, X variable, if my model like you had exclamation in, does 00:24:14.950 --> 00:24:17.170 that usually 00:24:17.170 --> 00:24:18.329 do 00:24:18.329 --> 00:24:20.340 better if you have enough data? 00:24:20.340 --> 00:24:23.050 Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words. 00:24:23.050 --> 00:24:26.990 You can shuffle all the words in the email, and it does exactly the same thing. 00:24:26.990 --> 00:24:27.770 So 00:24:27.770 --> 00:24:30.929 in natural language processing, there's actually another name; it's called a Unique Grand 00:24:30.929 --> 00:24:32.850 Model in natural language processing, 00:24:32.850 --> 00:24:35.529 and there's some other models like, sort of, 00:24:35.529 --> 00:24:39.190 say, higher order markup models that take into account some of the ordering 00:24:39.190 --> 00:24:44.080 of the words. It turns out that for text classification, the models 00:24:44.080 --> 00:24:45.260 like the 00:24:45.260 --> 00:24:49.290 bigram models or trigram models, 00:24:49.290 --> 00:24:53.360 I believe they do only very slightly better, if at all, 00:24:53.360 --> 00:25:00.360 but that's when you're applying them to text classification, okay? All right. 00:25:03.550 --> 00:25:07.190 So the next thing I want to talk about is to start again to discussion of 00:25:07.190 --> 00:25:11.560 non-linear classifiers. So 00:25:11.560 --> 00:25:18.560 it turns out 00:25:21.670 --> 00:25:27.590 - well, and so the very first 00:25:27.590 --> 00:25:31.220 classification algorithm we talked about was logistic regression, which 00:25:31.220 --> 00:25:33.419 had the forming form for 00:25:33.419 --> 00:25:36.340 hypothesis, 00:25:36.340 --> 00:25:37.520 and 00:25:37.520 --> 00:25:39.220 you can think of this as 00:25:39.220 --> 00:25:41.190 predicting one when 00:25:41.190 --> 00:25:44.420 this estimator probability is greater or equal to 0.5 and 00:25:44.420 --> 00:25:46.180 predicting zero, 00:25:46.180 --> 00:25:46.839 right, 00:25:46.839 --> 00:25:50.050 when this is less than 0.5, 00:25:50.050 --> 00:25:52.760 and given a training set, right? 00:25:52.760 --> 00:25:59.760 Logistic regression 00:26:02.549 --> 00:26:05.919 will maybe do grade and descends or something or 00:26:05.919 --> 00:26:07.440 use Newton's method 00:26:07.440 --> 00:26:09.370 to find a straight line that 00:26:09.370 --> 00:26:12.130 reasonably separates the positive and negative classes. 00:26:12.130 --> 00:26:15.830 But sometimes a data set just can't be separated by a straight line, so is there 00:26:15.830 --> 00:26:16.950 an algorithm 00:26:16.950 --> 00:26:18.230 that will let you 00:26:18.230 --> 00:26:25.230 start to learn these sorts of non-linear division boundaries? 00:26:25.540 --> 00:26:29.230 And so how do you go about getting a non-linear classifier? And, by the 00:26:29.230 --> 00:26:34.360 way, one cool result is that remember when I said - when we talked 00:26:34.360 --> 00:26:37.790 about generative learning algorithms, I said that 00:26:37.790 --> 00:26:41.040 if you assume Y given X is 00:26:41.040 --> 00:26:43.710 exponential family, 00:26:43.710 --> 00:26:45.330 00:26:45.330 --> 00:26:46.669 right, with parameter 00:26:46.669 --> 00:26:50.730 A, and if you build a generative learning algorithm using this, right, plus 00:26:50.730 --> 00:26:55.850 one, if this is A to 00:26:55.850 --> 00:26:59.710 one. This is exponential family 00:26:59.710 --> 00:27:01.750 with natural parameter A to zero, 00:27:01.750 --> 00:27:05.610 right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true, 00:27:05.610 --> 00:27:06.210 then 00:27:06.210 --> 00:27:08.610 you end up with a logistic posterior. It 00:27:08.610 --> 00:27:11.410 actually turns out that a Naive Bayes model 00:27:11.410 --> 00:27:12.629 actually falls into this 00:27:12.629 --> 00:27:17.440 as well. So the Naive Bayes model actually falls into this exponential family as well, and, 00:27:17.440 --> 00:27:19.370 therefore, 00:27:19.370 --> 00:27:23.000 under the Naive Bayes model, you're actually 00:27:23.000 --> 00:27:27.080 using this other linear classifier as well, okay? 00:27:27.080 --> 00:27:29.020 So the question is 00:27:29.020 --> 00:27:32.480 how can you start to get non-linear classifiers? 00:27:32.480 --> 00:27:35.649 And I'm going to talk about one method today 00:27:35.649 --> 00:27:41.840 which is - and 00:27:41.840 --> 00:27:45.300 we started to talk about it very briefly which is 00:27:45.300 --> 00:27:46.650 taking a 00:27:46.650 --> 00:27:48.990 simpler algorithm like 00:27:48.990 --> 00:27:50.720 logistic regression 00:27:50.720 --> 00:27:57.720 and using it to build up to more complex non-linear classifiers, okay? So 00:27:58.390 --> 00:28:00.440 to motivate this discussion, 00:28:00.440 --> 00:28:05.340 I'm going to use the little picture - let's see. So suppose you have features X1, 00:28:05.340 --> 00:28:07.340 X2, and X3, 00:28:07.340 --> 00:28:10.320 and so by convention, I'm gonna follow 00:28:10.320 --> 00:28:13.460 our earlier convention that X0 is set to one, 00:28:13.460 --> 00:28:17.010 and so I'm gonna use a little diagram like this 00:28:17.010 --> 00:28:18.090 to denote 00:28:18.090 --> 00:28:19.690 our 00:28:19.690 --> 00:28:24.320 logistic regression unit, okay? So 00:28:24.320 --> 00:28:28.609 think of a little picture like that, you know, this little circle as denoting a 00:28:28.609 --> 00:28:29.419 computation note 00:28:29.419 --> 00:28:32.630 that takes this input, you know, several features and then it outputs another 00:28:32.630 --> 00:28:36.090 number that's X subscript theta of X, given by 00:28:36.090 --> 00:28:37.800 a sigmoid function, 00:28:37.800 --> 00:28:39.930 and so this little computational unit - 00:28:39.930 --> 00:28:43.460 well, will have parameters theta. 00:28:43.460 --> 00:28:47.700 Now, in order to get non-linear division boundaries, all we need to do - 00:28:47.700 --> 00:28:50.850 well, at least one thing to do is just 00:28:50.850 --> 00:28:52.780 come up with 00:28:52.780 --> 00:28:54.390 00:28:54.390 --> 00:28:57.020 a way to represent hypotheses 00:28:57.020 --> 00:29:00.640 that can output non-linear division boundaries, right, 00:29:00.640 --> 00:29:02.700 and so 00:29:02.700 --> 00:29:07.490 this is - 00:29:07.490 --> 00:29:09.750 when you put a bunch of those little pictures 00:29:09.750 --> 00:29:11.390 that I drew 00:29:11.390 --> 00:29:12.790 on the previous board, 00:29:12.790 --> 00:29:14.169 you can then get 00:29:14.169 --> 00:29:18.549 what's called 00:29:18.549 --> 00:29:20.780 a Neural Network in which you 00:29:20.780 --> 00:29:23.710 think of having my features here 00:29:23.710 --> 00:29:30.710 and then I would feed them to 00:29:31.019 --> 00:29:34.530 say a few of these little 00:29:34.530 --> 00:29:38.940 sigmoidal units, 00:29:38.940 --> 00:29:43.559 and these together will feed into yet another sigmoidal unit, say, 00:29:43.559 --> 00:29:46.270 which will output 00:29:46.270 --> 00:29:51.510 my final output H subscript theta of X, okay? And 00:29:51.510 --> 00:29:54.500 just to give these things names, let me call 00:29:54.500 --> 00:29:58.380 the values output by these three intermediate sigmoidal units; let me call them A1, 00:29:58.380 --> 00:30:01.360 A2, A3. 00:30:01.360 --> 00:30:02.580 And let me just be 00:30:02.580 --> 00:30:05.679 completely concrete about what this formula represents, right? So 00:30:05.679 --> 00:30:06.570 00:30:06.570 --> 00:30:08.309 each of these 00:30:08.309 --> 00:30:12.080 units in the middle will have their own associated set of parameters, 00:30:12.080 --> 00:30:15.820 and so the value A1 will be computed as 00:30:15.820 --> 00:30:17.940 G 00:30:17.940 --> 00:30:21.440 of X transpose, 00:30:21.440 --> 00:30:27.440 and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta 00:30:27.440 --> 00:30:29.090 two, 00:30:29.090 --> 00:30:30.980 and A3 will be 00:30:30.980 --> 00:30:34.490 G of X transpose, 00:30:34.490 --> 00:30:35.700 theta three, 00:30:35.700 --> 00:30:42.700 where G is the sigmoid function, all right? So G of Z, 00:30:43.640 --> 00:30:49.850 and then, finally, our hypothesis will output 00:30:49.850 --> 00:30:52.360 G of 00:30:52.360 --> 00:30:53.950 A 00:30:53.950 --> 00:30:56.620 transpose 00:30:56.620 --> 00:30:59.200 theta four, right? Where, you 00:30:59.200 --> 00:31:01.320 know, this A vector 00:31:01.320 --> 00:31:04.390 is a vector of A1, 00:31:04.390 --> 00:31:05.780 A2, A3. 00:31:05.780 --> 00:31:08.220 We can append another one to it at first 00:31:08.220 --> 00:31:11.470 if you want, okay? Let 00:31:11.470 --> 00:31:14.270 me 00:31:14.270 --> 00:31:16.210 just draw up here this - I'm 00:31:16.210 --> 00:31:18.470 sorry about the cluttered board. 00:31:18.470 --> 00:31:20.950 And so H subscript theta of X, 00:31:20.950 --> 00:31:23.850 this is a function 00:31:23.850 --> 00:31:29.300 of all the parameters theta one 00:31:29.300 --> 00:31:32.420 through theta four, 00:31:32.420 --> 00:31:36.680 and so one way to learn parameters for this model 00:31:36.680 --> 00:31:37.650 is to 00:31:37.650 --> 00:31:39.910 write down the cost function, 00:31:39.910 --> 00:31:41.180 say, J of theta 00:31:41.180 --> 00:31:46.040 equals one-half sum from Y equals one to 00:31:46.040 --> 00:31:49.730 M, YI minus H subscript theta of 00:31:49.730 --> 00:31:51.770 XI 00:31:51.770 --> 00:31:54.450 squared, say. Okay, 00:31:54.450 --> 00:31:56.070 so that's our familiar 00:31:56.070 --> 00:31:58.280 quadratic cost function, 00:31:58.280 --> 00:31:59.980 and so 00:31:59.980 --> 00:32:03.990 one way to learn the parameters of an algorithm like this is to just use gradient 00:32:03.990 --> 00:32:04.840 interscent 00:32:04.840 --> 00:32:08.809 to minimize J of theta as a function of theta, okay? See, in the phi 00:32:08.809 --> 00:32:10.940 gradient descent 00:32:10.940 --> 00:32:14.610 to minimize this square area, which 00:32:14.610 --> 00:32:18.050 stated differently means you use gradient descent to make the predictions of your neural 00:32:18.050 --> 00:32:20.330 network as close as possible 00:32:20.330 --> 00:32:27.330 to what you observed as the labels in your training 00:32:30.550 --> 00:32:34.520 set, okay? So it turns out green descent on this neural network is a specific name, the 00:32:34.520 --> 00:32:35.059 algorithm 00:32:35.059 --> 00:32:38.789 that implements grand descent is called back propagation, and so if you ever hear that all that 00:32:38.789 --> 00:32:41.240 means is - it just means gradient interscent on 00:32:41.240 --> 00:32:45.170 a cost function like this or a variation of this on the neural network 00:32:45.170 --> 00:32:47.909 that looks like that, 00:32:47.909 --> 00:32:50.220 and - 00:32:50.220 --> 00:32:54.360 well, this algorithm actually has some advantages and disadvantages, but let me 00:32:54.360 --> 00:32:57.830 actually show you. So, let's see. 00:32:57.830 --> 00:33:00.729 One of the interesting things about the neural network is that 00:33:00.729 --> 00:33:05.430 you can look at what these intermediate notes are computing, right? So 00:33:05.430 --> 00:33:08.270 this neural network has what's called 00:33:08.270 --> 00:33:11.610 a hidden layer before you then have the output layer, 00:33:11.610 --> 00:33:14.410 and, more generally, you can actually have inputs feed 00:33:14.410 --> 00:33:16.279 into these 00:33:16.279 --> 00:33:19.690 computation units, feed into more layers of computation units, to even more layers, to 00:33:19.690 --> 00:33:23.370 more layers, and then finally you have an output layer at the end 00:33:23.370 --> 00:33:26.970 And one cool thing you can do is look at all of these intermediate units, look at 00:33:26.970 --> 00:33:28.280 these 00:33:28.280 --> 00:33:30.980 units and what's called a hidden layer of the neural network. Don't 00:33:30.980 --> 00:33:32.580 worry about why it's called that. Look 00:33:32.580 --> 00:33:34.489 at computations of the hidden unit 00:33:34.489 --> 00:33:39.570 and ask what is the hidden unit computing the neural network? So 00:33:39.570 --> 00:33:42.900 to, maybe, get a better sense of neural networks might be doing, let me show you a 00:33:42.900 --> 00:33:44.580 video - I'm gonna switch to the laptop - this is made 00:33:44.580 --> 00:33:46.250 by a 00:33:46.250 --> 00:33:49.360 friend, Yann LeCun 00:33:49.360 --> 00:33:52.420 who's currently a professor at New York University. 00:33:52.420 --> 00:33:56.289 Can I show a video on the laptop? 00:33:56.289 --> 00:33:58.600 So let me show you a video 00:33:58.600 --> 00:34:00.440 from Yann 00:34:00.440 --> 00:34:06.330 LeCun on a neural network that he developed for Hammerton Digit Recognition. 00:34:06.330 --> 00:34:10.110 There was one other thing he did in this neural network that I'm not gonna talk about called 00:34:10.110 --> 00:34:12.389 a Convolutional Neural Network 00:34:12.389 --> 00:34:16.009 that - well, 00:34:16.009 --> 00:34:18.809 his system is called LeNet, 00:34:18.809 --> 00:34:22.569 and let's see. Would you put 00:34:22.569 --> 00:34:23.360 on the 00:34:23.360 --> 00:34:30.360 laptop display? Hum, 00:34:37.779 --> 00:34:38.789 actually maybe if - 00:34:38.789 --> 00:34:42.339 or you can just put on the screen on the side; that would work too 00:34:42.339 --> 00:34:49.339 if the big screen isn't working. 00:35:12.249 --> 00:35:13.729 Let's see. I'm 00:35:13.729 --> 00:35:19.299 just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come 00:35:19.299 --> 00:35:24.209 on? Well, let me say a few more things about neural network. 00:35:24.209 --> 00:35:25.529 So it turns out that 00:35:25.529 --> 00:35:29.039 when you write a quadratic cost function like I 00:35:29.039 --> 00:35:31.689 wrote down on the chalkboard just now, 00:35:31.689 --> 00:35:34.639 it turns out that unlike logistic regression, 00:35:34.639 --> 00:35:39.029 that will almost always respond to non-convex optimization problem, 00:35:39.029 --> 00:35:42.709 and so whereas for logistic regression if you run gradient descent 00:35:42.709 --> 00:35:44.280 or Newton's method or whatever, 00:35:44.280 --> 00:35:46.000 you converse the global optimer. 00:35:46.000 --> 00:35:50.500 This is not true for neural networks. In general, there are lots of local optimer 00:35:50.500 --> 00:35:55.209 and, sort of, much harder optimization problem. 00:35:55.209 --> 00:35:56.159 So 00:35:56.159 --> 00:35:59.699 neural networks, if you're, sort of, familiar with them, and you're good at making design choices 00:35:59.699 --> 00:36:01.599 like what learning rate to use, and 00:36:01.599 --> 00:36:04.319 how many hidden units to use, and so on, you can, 00:36:04.319 --> 00:36:07.700 sort of, get them to be fairly effective, 00:36:07.700 --> 00:36:09.109 and 00:36:09.109 --> 00:36:12.910 there's, sort of, often ongoing debates about, you know, is this learning algorithm 00:36:12.910 --> 00:36:13.819 better, or is that learning 00:36:13.819 --> 00:36:17.869 algorithm better? The vast majority of machine learning researchers today seem to perceive 00:36:17.869 --> 00:36:20.180 support vector machines, which is what I'll talk about later, 00:36:20.180 --> 00:36:24.349 to be a much more effective off-theshelf learning algorithm than neural networks. 00:36:24.349 --> 00:36:26.509 This point of view is contested 00:36:26.509 --> 00:36:32.039 a bit, but so neural networks are not something that I personally use a lot right 00:36:32.039 --> 00:36:35.670 now because there's a hard optimization problem and you should do so often verge, 00:36:35.670 --> 00:36:39.969 and it actually, sort of works. It, sort of, works reasonably 00:36:39.969 --> 00:36:41.739 well. It's just 00:36:41.739 --> 00:36:43.700 because this is fairly complicated, 00:36:43.700 --> 00:36:45.269 there's not an algorithm that 00:36:45.269 --> 00:36:52.009 I use commonly or that my friends use all 00:36:52.009 --> 00:36:56.019 time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years, 00:36:56.019 --> 00:36:56.819 00:36:56.819 --> 00:37:00.400 you know, the most effective learning algorithm before support vector 00:37:00.400 --> 00:37:02.029 machines were invented. 00:37:02.029 --> 00:37:06.349 So here's Yann LeCun's video, and - 00:37:06.349 --> 00:37:09.350 well, there's actually audio on this too, the soundboard. So I'll just tell you 00:37:09.350 --> 00:37:11.209 what's happening. 00:37:11.209 --> 00:37:11.880 00:37:11.880 --> 00:37:14.629 What you're seeing is a trained neural network, 00:37:14.629 --> 00:37:18.549 and this display where my mouse pointer is pointing, right, this big three 00:37:18.549 --> 00:37:19.429 there 00:37:19.429 --> 00:37:21.189 is 00:37:21.189 --> 00:37:24.799 the input to the neural network. So you're showing the neural network this image, and it's 00:37:24.799 --> 00:37:26.859 trying to recognize what is this. 00:37:26.859 --> 00:37:30.769 The final answer output by the neural network is this number up here, right 00:37:30.769 --> 00:37:33.189 below where it says LeNet-5, 00:37:33.189 --> 00:37:34.229 and the 00:37:34.229 --> 00:37:37.759 neural network correctly recognizes this image as a three, 00:37:37.759 --> 00:37:39.890 and if you look to the left of this image, 00:37:39.890 --> 00:37:42.309 what's interesting about this is 00:37:42.309 --> 00:37:46.640 the display on the left portion of this is actually showing the 00:37:46.640 --> 00:37:49.999 intermediate computations of the neural network. In other words, it's showing you 00:37:49.999 --> 00:37:51.900 what are the hidden layers of the 00:37:51.900 --> 00:37:53.829 neural network computing. 00:37:53.829 --> 00:37:57.529 And so, for example, if you look at this one, the third image down from the top, 00:37:57.529 --> 00:38:01.569 this seems to be computing, you know, certain edges into digits, 00:38:01.569 --> 00:38:05.630 right? We're just computing digits on the 00:38:05.630 --> 00:38:07.489 right-hand side of the bottom or something 00:38:07.489 --> 00:38:09.889 of the input display 00:38:09.889 --> 00:38:11.479 of the input image, okay? 00:38:11.479 --> 00:38:15.129 So let me just play this video, and you can see 00:38:15.129 --> 00:38:17.019 some 00:38:17.019 --> 00:38:20.619 of the inputs and outputs of the neural network, and 00:38:20.619 --> 00:38:21.529 those 00:38:21.529 --> 00:38:26.309 are very different fonts. There's this robustness to 00:38:26.309 --> 00:38:30.670 noise. 00:38:30.670 --> 00:38:36.929 All 00:38:36.929 --> 00:38:37.880 00:38:37.880 --> 00:38:41.949 00:38:41.949 --> 00:38:43.000 right. Multiple 00:38:43.000 --> 00:38:49.089 digits, 00:38:49.089 --> 00:38:50.739 00:38:50.739 --> 00:38:57.739 that's, kind of, cool. All 00:38:59.259 --> 00:39:06.259 right. 00:39:12.140 --> 00:39:19.140 00:39:30.779 --> 00:39:35.789 00:39:35.789 --> 00:39:38.689 00:39:38.689 --> 00:39:40.459 00:39:40.459 --> 00:39:41.660 00:39:41.660 --> 00:39:48.660 00:39:52.359 --> 00:39:54.809 So, 00:39:54.809 --> 00:39:58.219 just for fun, let me show you one more video, 00:39:58.219 --> 00:40:00.579 which was - let's 00:40:00.579 --> 00:40:06.089 see. This is another video from the various CV's, the machine that changed the world, which 00:40:06.089 --> 00:40:07.490 was produced by WGBH 00:40:07.490 --> 00:40:09.059 Television 00:40:09.059 --> 00:40:10.800 in 00:40:10.800 --> 00:40:12.599 corporation with British Foreclass Incorporation, 00:40:12.599 --> 00:40:16.159 and it was aired on PBS a few years ago, I think. 00:40:16.159 --> 00:40:17.630 I want to show you a 00:40:17.630 --> 00:40:19.589 video describing the NETtalk 00:40:19.589 --> 00:40:23.629 Neural Network, which was developed by Terry Sejnowski; he's a 00:40:23.629 --> 00:40:25.739 researcher. 00:40:25.739 --> 00:40:29.589 And so NETtalk was actually one of the major milestones in the 00:40:29.589 --> 00:40:31.749 history of neural network, 00:40:31.749 --> 00:40:33.490 and this specific application 00:40:33.490 --> 00:40:36.399 is getting the neural network to read text. 00:40:36.399 --> 00:40:38.779 So, in other words, can you show a 00:40:38.779 --> 00:40:41.659 piece of English to a computer 00:40:41.659 --> 00:40:43.419 and have the computer read, 00:40:43.419 --> 00:40:46.410 sort of, verbally produce sounds that could respond 00:40:46.410 --> 00:40:48.319 to the 00:40:48.319 --> 00:40:50.629 reading of the text. 00:40:50.629 --> 00:40:55.279 And it turns out that in the history of AI and the history of machine learning, 00:40:55.279 --> 00:40:56.509 this video 00:40:56.509 --> 00:41:00.849 created a lot of excitement about neural networks and about machine learning. Part of the 00:41:00.849 --> 00:41:05.429 reason was that Terry Sejnowski had the foresight to choose 00:41:05.429 --> 00:41:06.699 to use, 00:41:06.699 --> 00:41:08.040 in his video, 00:41:08.040 --> 00:41:12.940 a child-like voice talking about visiting your grandmother's house and so on. You'll 00:41:12.940 --> 00:41:14.289 see it in a second, 00:41:14.289 --> 00:41:16.029 and so 00:41:16.029 --> 00:41:19.399 this really created the perception of - created 00:41:19.399 --> 00:41:23.129 the impression of the neural network being like a young child learning how to speak, 00:41:23.129 --> 00:41:25.260 and talking about going to your grandmothers, and so 00:41:25.260 --> 00:41:27.379 on. So this actually 00:41:27.379 --> 00:41:29.289 helped generate a lot of excitement 00:41:29.289 --> 00:41:32.449 within academia and outside academia on neural networks, sort of, early in the history 00:41:32.449 --> 00:41:36.109 of neural networks. I'm just gonna show you the video. 00:41:36.109 --> 00:41:37.910 [Begin Video] You're going to hear first 00:41:37.910 --> 00:41:41.559 what the network sounds like at the very beginning of the training, and it won't 00:41:41.559 --> 00:41:43.589 sound like words, but it'll sound 00:41:43.589 --> 00:41:48.009 like attempts that will get better and better with 00:41:48.009 --> 00:41:54.239 time. [Computer's voice] The network 00:41:54.239 --> 00:41:58.249 takes the letters, say the phrase, grandmother's house, and 00:41:58.249 --> 00:42:02.679 makes a random attempt at pronouncing it. [Computer's 00:42:02.679 --> 00:42:07.219 voice] 00:42:07.219 --> 00:42:09.269 Grandmother's house. The 00:42:09.269 --> 00:42:12.530 phonetic difference between the guess and the right pronunciation 00:42:12.530 --> 00:42:16.539 is sent back through the network. [Computer's 00:42:16.539 --> 00:42:18.989 voice] 00:42:18.989 --> 00:42:20.329 Grandmother's house. 00:42:20.329 --> 00:42:23.920 By adjusting the connection strengths after each attempt, 00:42:23.920 --> 00:42:26.689 the net slowly improves. And, 00:42:26.689 --> 00:42:30.029 finally, after letting it train overnight, 00:42:30.029 --> 00:42:32.700 the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's 00:42:32.700 --> 00:42:37.759 house. 00:42:37.759 --> 00:42:41.149 Well, because she gives us candy. 00:42:41.149 --> 00:42:46.789 Well, and we - NETtalk understands nothing about the language. It is simply associating letters 00:42:46.789 --> 00:42:49.809 with sounds. 00:42:49.809 --> 00:42:52.359 [End Video] All right. So 00:42:52.359 --> 00:42:54.380 at the time this was done, I mean, this is 00:42:54.380 --> 00:42:57.379 an amazing piece of work. I should say 00:42:57.379 --> 00:42:59.140 today there are other 00:42:59.140 --> 00:43:03.389 text to speech systems that work better than what you just saw, 00:43:03.389 --> 00:43:08.359 and you'll also appreciate getting candy from your grandmother's house is a little bit 00:43:08.359 --> 00:43:11.449 less impressive than talking about the Dow Jones falling 15 points, 00:43:11.449 --> 00:43:16.149 and profit taking, whatever. So 00:43:16.149 --> 00:43:18.789 but I wanted to show that just because that was another cool, 00:43:18.789 --> 00:43:25.789 major landmark in the history of neural networks. Okay. 00:43:27.259 --> 00:43:32.949 So let's switch back to the chalkboard, 00:43:32.949 --> 00:43:35.859 and what I want to do next 00:43:35.859 --> 00:43:39.479 is 00:43:39.479 --> 00:43:42.369 tell you about Support Vector Machines, 00:43:42.369 --> 00:43:49.369 okay? That, sort of, wraps up our discussion on neural 00:44:08.839 --> 00:44:10.850 networks. So I started off talking about neural networks 00:44:10.850 --> 00:44:13.269 by motivating it as a way to get 00:44:13.269 --> 00:44:17.920 us to output non-linear classifiers, right? I don't really approve of it. It turns out 00:44:17.920 --> 00:44:18.419 that 00:44:18.419 --> 00:44:21.799 you'd be able to come up with non-linear division boundaries using a neural 00:44:21.799 --> 00:44:22.479 network 00:44:22.479 --> 00:44:27.309 like what I drew on the chalkboard earlier. 00:44:27.309 --> 00:44:30.469 Support Vector Machines will be another learning algorithm that will give us a way 00:44:30.469 --> 00:44:34.529 to come up with non-linear classifiers. There's a very effective, off-the-shelf 00:44:34.529 --> 00:44:36.019 learning algorithm, 00:44:36.019 --> 00:44:39.399 but it turns out that in the discussion I'm gonna - in 00:44:39.399 --> 00:44:43.379 the progression and development I'm gonna pursue, I'm actually going to start 00:44:43.379 --> 00:44:47.799 off by describing yet another class of linear classifiers with linear division 00:44:47.799 --> 00:44:49.269 boundaries, 00:44:49.269 --> 00:44:54.179 and only be later, sort of, in probably the next lecture or the one after that, 00:44:54.179 --> 00:44:55.249 that we'll then 00:44:55.249 --> 00:44:58.699 take the support vector machine idea and, sort of, do some clever things to it 00:44:58.699 --> 00:45:02.959 to make it work very well to generate non-linear division boundaries as well, okay? 00:45:02.959 --> 00:45:07.859 But we'll actually start by talking about linear classifiers a little bit more. 00:45:07.859 --> 00:45:12.619 And to do that, I want to 00:45:12.619 --> 00:45:14.559 convey two 00:45:14.559 --> 00:45:16.689 intuitions about classification. 00:45:16.689 --> 00:45:20.700 One is 00:45:20.700 --> 00:45:24.450 you think about logistic regression; we have this logistic function that was outputting 00:45:24.450 --> 00:45:28.669 the probability that Y equals one, 00:45:28.669 --> 00:45:31.049 and it crosses 00:45:31.049 --> 00:45:32.959 this line at zero. 00:45:32.959 --> 00:45:37.819 So when you run logistic regression, I want you to think of it as 00:45:37.819 --> 00:45:41.179 an algorithm that computes 00:45:41.179 --> 00:45:43.239 theta transpose X, 00:45:43.239 --> 00:45:46.789 and then it predicts 00:45:46.789 --> 00:45:51.170 one, right, 00:45:51.170 --> 00:45:56.359 if and only if, theta transpose X is greater than zero, right? IFF stands for if 00:45:56.359 --> 00:45:59.539 and only if. It means the same thing as a double implication, 00:45:59.539 --> 00:46:04.629 and it predicts zero, 00:46:04.629 --> 00:46:11.629 if and only if, theta transpose X is less than zero, okay? So if 00:46:12.499 --> 00:46:15.019 it's the case that 00:46:15.019 --> 00:46:18.069 theta transpose X is much greater than zero, 00:46:18.069 --> 00:46:22.309 the double greater than sign means these are much greater than, all right. 00:46:22.309 --> 00:46:25.589 So if theta transpose X is much greater than zero, 00:46:25.589 --> 00:46:26.739 then, 00:46:26.739 --> 00:46:33.069 you know, think of that as a very confident 00:46:33.069 --> 00:46:38.779 prediction 00:46:38.779 --> 00:46:40.779 that Y is equal to one, 00:46:40.779 --> 00:46:43.459 right? If theta transpose X is much greater than zero, then we're 00:46:43.459 --> 00:46:47.049 gonna predict one then moreover we're very confident it's one, and the picture for 00:46:47.049 --> 00:46:48.479 that is if theta 00:46:48.479 --> 00:46:51.179 transpose X is way out here, then 00:46:51.179 --> 00:46:54.329 we're estimating that the probability of Y being equal to one 00:46:54.329 --> 00:46:56.089 on the sigmoid function, it will 00:46:56.089 --> 00:46:58.699 be very close to one. 00:46:58.699 --> 00:47:02.779 And, in the same way, if theta transpose X 00:47:02.779 --> 00:47:05.190 is much less than zero, 00:47:05.190 --> 00:47:12.190 then we're very confident that 00:47:12.809 --> 00:47:15.849 Y is equal to zero. 00:47:15.849 --> 00:47:19.599 So 00:47:19.599 --> 00:47:23.859 wouldn't it be nice - so when we fit logistic regression of some of the 00:47:23.859 --> 00:47:25.309 classifiers is your training set, 00:47:25.309 --> 00:47:29.569 then so wouldn't it be nice if, 00:47:29.569 --> 00:47:30.430 right, 00:47:30.430 --> 00:47:34.349 for all I 00:47:34.349 --> 00:47:38.739 such that Y is equal to one. 00:47:38.739 --> 00:47:39.699 We have 00:47:39.699 --> 00:47:41.279 theta 00:47:41.279 --> 00:47:45.009 transpose XI is much greater than zero, 00:47:45.009 --> 00:47:47.929 and for all I such that Y is equal 00:47:47.929 --> 00:47:49.869 to 00:47:49.869 --> 00:47:56.869 zero, 00:47:57.589 --> 00:47:59.910 we have theta transpose XI is 00:47:59.910 --> 00:48:00.970 much less than zero, 00:48:00.970 --> 00:48:05.659 okay? So wouldn't it be nice if this is true? That, 00:48:05.659 --> 00:48:06.380 00:48:06.380 --> 00:48:09.929 essentially, if our training set, we can find parameters theta 00:48:09.929 --> 00:48:12.749 so that our learning algorithm not only 00:48:12.749 --> 00:48:16.839 makes correct classifications on all the examples in a training set, but further it's, sort 00:48:16.839 --> 00:48:19.880 of, is very confident about all of those correct 00:48:19.880 --> 00:48:22.299 classifications. This is 00:48:22.299 --> 00:48:26.859 the first intuition that I want you to have, and we'll come back to this first intuition 00:48:26.859 --> 00:48:28.889 in a second 00:48:28.889 --> 00:48:33.209 when we talk about functional margins, okay? 00:48:33.209 --> 00:48:40.209 We'll define this later. The second 00:48:43.539 --> 00:48:48.119 intuition that I want to 00:48:48.119 --> 00:48:55.119 convey, 00:48:56.769 --> 00:49:00.209 and it turns out for the rest of today's lecture I'm going to assume 00:49:00.209 --> 00:49:04.049 that a training set is linearly separable, okay? So by that I mean for the rest of 00:49:04.049 --> 00:49:05.619 00:49:05.619 --> 00:49:08.690 today's lecture, I'm going to assume that 00:49:08.690 --> 00:49:10.900 there is indeed a straight line that can separate 00:49:10.900 --> 00:49:15.030 your training set, and we'll remove this assumption later, but 00:49:15.030 --> 00:49:17.089 just to develop the algorithm, let's take away the 00:49:17.089 --> 00:49:19.409 linearly separable training set. 00:49:19.409 --> 00:49:20.440 And so 00:49:20.440 --> 00:49:24.400 there's a sense that out of all the straight lines that separate the training 00:49:24.400 --> 00:49:28.569 set, you know, maybe that straight line isn't such a good one, 00:49:28.569 --> 00:49:33.139 and that one actually isn't such a great one either, but 00:49:33.139 --> 00:49:35.019 maybe that line in the 00:49:35.019 --> 00:49:38.869 middle is a much better linear separator than the others, right? 00:49:38.869 --> 00:49:41.179 And one reason that 00:49:41.179 --> 00:49:45.759 when you and I look at it this one seems best 00:49:45.759 --> 00:49:48.329 is because this line is just further from the data, all right? 00:49:48.329 --> 00:49:52.469 That is separates the data with a greater distance between your positive and your negative 00:49:52.469 --> 00:49:55.779 examples and division boundary, okay? 00:49:55.779 --> 00:49:59.390 And this second intuition, we'll come back to this shortly, 00:49:59.390 --> 00:50:02.829 about this final line 00:50:02.829 --> 00:50:06.099 that I drew being, maybe, the best line 00:50:06.099 --> 00:50:09.999 this notion of distance from the training examples. This is the second intuition I want 00:50:09.999 --> 00:50:11.239 to convey, 00:50:11.239 --> 00:50:13.659 and we'll formalize it later when 00:50:13.659 --> 00:50:17.849 we talk about 00:50:17.849 --> 00:50:24.849 geometric margins of our classifiers, okay? 00:50:41.549 --> 00:50:44.889 So in order to describe support vector machine, unfortunately, I'm gonna 00:50:44.889 --> 00:50:48.239 have to pull a notation change, 00:50:48.239 --> 00:50:50.109 and, sort 00:50:50.109 --> 00:50:53.989 of, unfortunately, it, sort of, was impossible to do logistic regression, 00:50:53.989 --> 00:50:55.809 and support vector machines, 00:50:55.809 --> 00:51:01.029 and all the other algorithms using one completely consistent notation, 00:51:01.029 --> 00:51:04.180 and so I'm actually gonna change notations slightly 00:51:04.180 --> 00:51:07.999 for linear classifiers, and that will actually make it much easier for us - 00:51:07.999 --> 00:51:10.019 that'll make it much easier later today 00:51:10.019 --> 00:51:15.149 and in next week's lectures to actually talk about support vector machine. But the 00:51:15.149 --> 00:51:19.549 notation that I'm gonna use for the rest 00:51:19.549 --> 00:51:21.759 of today and for most of next week 00:51:21.759 --> 00:51:25.769 will be that my B equals Y, 00:51:25.769 --> 00:51:30.219 and instead of be zero, one, they'll be minus one and plus one, 00:51:30.219 --> 00:51:33.779 00:51:33.779 --> 00:51:36.759 and a development of a support vector machine 00:51:36.759 --> 00:51:38.469 we will have H, 00:51:38.469 --> 00:51:43.939 have a hypothesis 00:51:43.939 --> 00:51:49.369 output values to 00:51:49.369 --> 00:51:53.179 the either plus one or minus one, 00:51:53.179 --> 00:51:54.619 and so 00:51:54.619 --> 00:51:58.039 we'll let G of Z be equal to 00:51:58.039 --> 00:52:00.349 one if Z is 00:52:00.349 --> 00:52:04.460 greater or equal to zero, and minus one otherwise, right? So just rather than zero and 00:52:04.460 --> 00:52:07.809 one, we change everything to plus one and minus one. 00:52:07.809 --> 00:52:10.149 And, finally, 00:52:10.149 --> 00:52:13.770 whereas previously I wrote G subscript 00:52:13.770 --> 00:52:16.779 theta of X equals 00:52:16.779 --> 00:52:18.429 G of theta transpose X 00:52:18.429 --> 00:52:20.880 and we had the convention 00:52:20.880 --> 00:52:23.409 that X zero is equal to one, 00:52:23.409 --> 00:52:25.789 right? And so X is an RN plus 00:52:25.789 --> 00:52:28.670 one. 00:52:28.670 --> 00:52:34.499 I'm gonna drop this convention of 00:52:34.499 --> 00:52:38.400 letting X zero equals a one, and letting X be an RN plus one, and instead I'm 00:52:38.400 --> 00:52:44.049 going to parameterize my linear classifier as H subscript W, B of X 00:52:44.049 --> 00:52:46.159 equals G of 00:52:46.159 --> 00:52:50.629 W transpose X plus B, okay? 00:52:50.629 --> 00:52:51.349 And so 00:52:51.349 --> 00:52:53.359 B 00:52:53.359 --> 00:52:54.989 just now plays the role of 00:52:54.989 --> 00:52:56.289 theta zero, 00:52:56.289 --> 00:53:03.289 and W now plays the role of the rest of the parameters, theta one 00:53:04.179 --> 00:53:07.059 through theta N, okay? So just by separating out 00:53:07.059 --> 00:53:10.569 the interceptor B rather than lumping it together, it'll make it easier for 00:53:10.569 --> 00:53:11.230 us 00:53:11.230 --> 00:53:17.569 to develop support vector machines. 00:53:17.569 --> 00:53:24.569 So - yes. Student:[Off mic]. Instructor 00:53:27.889 --> 00:53:31.829 (Andrew Ng):Oh, yes. Right, yes. So W is - 00:53:31.829 --> 00:53:33.049 right. So W 00:53:33.049 --> 00:53:35.499 is a vector in RN, and 00:53:35.499 --> 00:53:37.069 X 00:53:37.069 --> 00:53:42.709 is now a vector in RN rather than N plus one, 00:53:42.709 --> 00:53:49.709 and a lowercase b is a real number. Okay. 00:53:56.429 --> 00:54:02.650 Now, let's formalize the notion of functional margin and germesh margin. Let me make a 00:54:02.650 --> 00:54:03.660 definition. 00:54:03.660 --> 00:54:10.660 I'm going to say that the functional margin 00:54:13.160 --> 00:54:19.680 of the hyper plane 00:54:19.680 --> 00:54:21.880 WB 00:54:21.880 --> 00:54:25.229 with respect to a specific training example, 00:54:25.229 --> 00:54:27.429 XIYI is - WRT 00:54:27.429 --> 00:54:32.069 stands for with respect to - 00:54:32.069 --> 00:54:35.489 the function margin of a hyper plane WB with 00:54:35.489 --> 00:54:38.099 respect to 00:54:38.099 --> 00:54:43.099 a certain training example, XIYI has been defined as Gamma 00:54:43.099 --> 00:54:45.319 Hat I equals YI 00:54:45.319 --> 00:54:46.919 times W transpose XI 00:54:46.919 --> 00:54:51.159 plus B, okay? 00:54:51.159 --> 00:54:54.579 And so a set of parameters, W, B defines 00:54:54.579 --> 00:54:55.960 a 00:54:55.960 --> 00:55:00.239 classifier - it, sort of, defines a linear separating boundary, 00:55:00.239 --> 00:55:01.160 and so 00:55:01.160 --> 00:55:03.989 when I say hyper plane, I just mean 00:55:03.989 --> 00:55:07.390 the decision boundary that's 00:55:07.390 --> 00:55:11.169 defined by the parameters W, B. 00:55:11.169 --> 00:55:12.969 You know what, 00:55:12.969 --> 00:55:17.699 if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with 00:55:17.699 --> 00:55:18.869 parameters W, B 00:55:18.869 --> 00:55:21.019 with respect to a training example 00:55:21.019 --> 00:55:23.539 is given by this formula, okay? 00:55:23.539 --> 00:55:25.530 And interpretation of this is that 00:55:25.530 --> 00:55:28.369 if 00:55:28.369 --> 00:55:29.799 YI is equal to one, 00:55:29.799 --> 00:55:33.239 then for each to have a large functional margin, you 00:55:33.239 --> 00:55:36.519 want W transpose XI plus B to 00:55:36.519 --> 00:55:37.849 be large, 00:55:37.849 --> 00:55:39.079 right? 00:55:39.079 --> 00:55:41.389 And if YI 00:55:41.389 --> 00:55:43.709 is equal minus one, 00:55:43.709 --> 00:55:47.479 then in order for the functional margin to be large - we, sort of, want the functional margins 00:55:47.479 --> 00:55:50.229 to large, but in order for the function margins to be large, 00:55:50.229 --> 00:55:55.229 if YI is equal to minus one, then the only way for this to be big is if W 00:55:55.229 --> 00:55:57.029 transpose XI 00:55:57.029 --> 00:55:58.219 plus B 00:55:58.219 --> 00:56:03.849 is much less than zero, okay? So this 00:56:03.849 --> 00:56:06.780 captures the intuition that we had earlier about functional 00:56:06.780 --> 00:56:08.689 margins - the 00:56:08.689 --> 00:56:12.839 intuition we had earlier that if YI is equal to one, we want this to be big, and if YI 00:56:12.839 --> 00:56:14.669 is equal to minus one, we 00:56:14.669 --> 00:56:15.990 want this to be small, 00:56:15.990 --> 00:56:17.009 and this, sort of, 00:56:17.009 --> 00:56:18.699 practice of two cases 00:56:18.699 --> 00:56:22.599 into one statement that we'd like the functional margin to be large. 00:56:22.599 --> 00:56:26.589 And notice this is also that 00:56:26.589 --> 00:56:31.249 so long as YI times W transpose XY 00:56:31.249 --> 00:56:34.049 plus B, so long as this is greater than zero, 00:56:34.049 --> 00:56:36.649 that means we 00:56:36.649 --> 00:56:43.649 classified it correctly, okay? 00:57:14.359 --> 00:57:18.069 And one more definition, I'm going to say that 00:57:18.069 --> 00:57:19.899 the functional margin of 00:57:19.899 --> 00:57:23.679 a hyper plane with respect to an entire training set 00:57:23.679 --> 00:57:30.679 is 00:57:31.579 --> 00:57:33.609 going to define gamma hat to 00:57:33.609 --> 00:57:35.509 be equal to 00:57:35.509 --> 00:57:35.759 min 00:57:35.650 --> 00:57:39.569 over all your training examples of gamma hat, I, right? 00:57:39.569 --> 00:57:40.539 So if you have 00:57:40.539 --> 00:57:44.439 a training set, if you have just more than one training example, 00:57:44.439 --> 00:57:46.949 I'm going to define the functional margin 00:57:46.949 --> 00:57:49.779 with respect to the entire training set as 00:57:49.779 --> 00:57:51.799 the worst case of all of your 00:57:51.799 --> 00:57:54.439 functional margins of the entire training set. 00:57:54.439 --> 00:57:57.849 And so for now we should think of the 00:57:57.849 --> 00:58:02.179 first function like an intuition of saying that we would like the function margin 00:58:02.179 --> 00:58:03.450 to be large, 00:58:03.450 --> 00:58:05.910 and for our purposes, for now, 00:58:05.910 --> 00:58:07.309 let's just say we would like 00:58:07.309 --> 00:58:10.389 the worst-case functional margin to be large, okay? And 00:58:10.389 --> 00:58:16.299 we'll change this a little bit later as well. Now, it turns out that 00:58:16.299 --> 00:58:20.150 there's one little problem with this intuition that will, sort of, edge 00:58:20.150 --> 00:58:22.640 us later, 00:58:22.640 --> 00:58:26.189 which it actually turns out to be very easy to make the functional margin large, all 00:58:26.189 --> 00:58:28.599 right? So, for example, 00:58:28.599 --> 00:58:31.899 so as I have a classifiable parameters W and 00:58:31.899 --> 00:58:37.369 B. If I take W and multiply it by two and take B and multiply it by two, 00:58:37.369 --> 00:58:42.049 then if you refer to the definition of the functional margin, I guess that was what? 00:58:42.049 --> 00:58:42.959 Gamma I, 00:58:42.959 --> 00:58:44.409 gamma hat I 00:58:44.409 --> 00:58:47.929 equals YI 00:58:47.929 --> 00:58:51.059 times W times transpose B. If 00:58:51.059 --> 00:58:52.859 I double 00:58:52.859 --> 00:58:54.609 W and B, 00:58:54.609 --> 00:58:55.010 then 00:58:55.010 --> 00:58:59.419 I can easily double my functional margin. So this goal 00:58:59.419 --> 00:59:02.799 of making the functional margin large, in and of itself, isn't so 00:59:02.799 --> 00:59:05.959 useful because it's easy to make the functional margin arbitrarily large 00:59:05.959 --> 00:59:08.729 just by scaling other parameters. And so 00:59:08.729 --> 00:59:13.259 maybe one thing we need to do later is add a normalization 00:59:13.259 --> 00:59:14.879 condition. 00:59:14.879 --> 00:59:16.849 For example, maybe 00:59:16.849 --> 00:59:20.179 we want to add a normalization condition that de-norm, the 00:59:20.179 --> 00:59:21.789 alter-norm of 00:59:21.789 --> 00:59:23.109 the parameter W 00:59:23.109 --> 00:59:24.009 is equal to one, 00:59:24.009 --> 00:59:28.150 and we'll come back to this in a second. All 00:59:28.150 --> 00:59:34.279 right. And then so - Okay. 00:59:34.279 --> 00:59:39.869 Now, let's talk about - see how much 00:59:39.869 --> 00:59:46.869 time we have, 15 minutes. Well, see, I'm trying to 00:59:52.909 --> 00:59:59.909 decide how much to try to do in the last 15 minutes. Okay. So let's talk 01:00:05.339 --> 01:00:10.890 about the geometric margin, 01:00:10.890 --> 01:00:13.379 and so 01:00:13.379 --> 01:00:16.939 the geometric margin of a training example - 01:00:16.939 --> 01:00:23.939 [inaudible], right? 01:00:27.279 --> 01:00:30.649 So the division boundary of my classifier 01:00:30.649 --> 01:00:33.439 is going to be given by the plane W transpose X 01:00:33.439 --> 01:00:34.889 plus B is equal 01:00:34.889 --> 01:00:37.650 to zero, okay? Right, and these are the 01:00:37.650 --> 01:00:41.389 X1, X2 axis, say, 01:00:41.389 --> 01:00:42.579 and 01:00:42.579 --> 01:00:48.299 we're going to draw relatively few training examples here. Let's say I'm drawing 01:00:48.299 --> 01:00:52.929 deliberately few training examples so that I can add things to this, okay? 01:00:52.929 --> 01:00:54.659 And so 01:00:54.659 --> 01:00:57.199 assuming we classified an example correctly, I'm 01:00:57.199 --> 01:01:01.209 going to define the geometric margin 01:01:01.209 --> 01:01:05.559 as just a geometric distance between a point between the 01:01:05.559 --> 01:01:07.239 training example - 01:01:07.239 --> 01:01:09.649 yeah, between the training example XI, 01:01:09.649 --> 01:01:11.679 YI 01:01:11.679 --> 01:01:13.259 and the distance 01:01:13.259 --> 01:01:15.489 given by this separating 01:01:15.489 --> 01:01:18.249 line, given by this separating hyper plane, okay? 01:01:18.249 --> 01:01:20.329 That's what I'm going to define the 01:01:20.329 --> 01:01:23.809 geometric margin to be. 01:01:23.809 --> 01:01:26.299 And so I'm gonna do 01:01:26.299 --> 01:01:29.919 some algebra fairly quickly. In case it doesn't make sense, and read 01:01:29.919 --> 01:01:33.969 through the lecture notes more carefully for details. 01:01:33.969 --> 01:01:36.479 Sort of, by standard geometry, 01:01:36.479 --> 01:01:40.589 the normal, or in other words, the vector that's 90 degrees to the 01:01:40.589 --> 01:01:41.869 separating hyper plane 01:01:41.869 --> 01:01:45.069 is going to be given by W divided by 01:01:45.069 --> 01:01:47.480 the norm of W; that's just how 01:01:47.480 --> 01:01:49.619 planes and high dimensions work. If this 01:01:49.619 --> 01:01:53.380 stuff - some of this you have to use, take a look t the lecture notes on the 01:01:53.380 --> 01:01:55.589 website. 01:01:55.589 --> 01:01:58.199 And so let's say this distance 01:01:58.199 --> 01:01:59.789 is 01:01:59.789 --> 01:02:01.220 01:02:01.220 --> 01:02:02.079 gamma I, 01:02:02.079 --> 01:02:04.869 okay? And so I'm going to use the convention that 01:02:04.869 --> 01:02:08.549 I'll put a hat on top where I'm referring to functional margins, 01:02:08.549 --> 01:02:11.619 and no hat on top for geometric margins. So let's say geometric margin, 01:02:11.619 --> 01:02:14.120 as this example, is 01:02:14.120 --> 01:02:18.009 gamma I. 01:02:18.009 --> 01:02:21.789 That means that 01:02:21.789 --> 01:02:26.459 this point here, right, 01:02:26.459 --> 01:02:28.899 is going to be 01:02:28.899 --> 01:02:32.149 XI minus 01:02:32.149 --> 01:02:36.119 gamma I times W 01:02:36.119 --> 01:02:39.479 over normal W, okay? 01:02:39.479 --> 01:02:41.279 Because 01:02:41.279 --> 01:02:45.399 W over normal W is the unit vector, is the length one vector that 01:02:45.399 --> 01:02:48.459 is normal to the separating hyper plane, 01:02:48.459 --> 01:02:52.069 and so when we subtract gamma I times the unit vector 01:02:52.069 --> 01:02:55.189 from this point, XI, or at this point here is XI. 01:02:55.189 --> 01:02:57.039 So XI minus, 01:02:57.039 --> 01:02:59.079 you know, this little vector here 01:02:59.079 --> 01:03:01.539 is going to be this point that I've drawn as 01:03:01.539 --> 01:03:03.080 a heavy circle, 01:03:03.080 --> 01:03:05.660 okay? So this heavy point here is 01:03:05.660 --> 01:03:07.429 XI minus this vector, 01:03:07.429 --> 01:03:14.429 and this vector is gamma I time W over norm of W, okay? 01:03:16.049 --> 01:03:17.649 And so 01:03:17.649 --> 01:03:21.160 because this heavy point is on the separating hyper plane, 01:03:21.160 --> 01:03:22.689 right, this point 01:03:22.689 --> 01:03:24.949 must satisfy 01:03:24.949 --> 01:03:28.660 W transpose times 01:03:28.660 --> 01:03:34.659 that point 01:03:34.659 --> 01:03:35.569 equals zero, 01:03:35.569 --> 01:03:37.189 right? Because 01:03:37.189 --> 01:03:38.029 all 01:03:38.029 --> 01:03:41.479 points X on the separating hyper plane satisfy the equation W transpose X plus B 01:03:41.479 --> 01:03:42.939 equals zero, 01:03:42.939 --> 01:03:45.989 and so this point is on the separating hyper plane, therefore, 01:03:45.989 --> 01:03:49.199 it must satisfy W transpose this point - oh, 01:03:49.199 --> 01:03:53.809 excuse me. Plus B is equal 01:03:53.809 --> 01:03:57.579 to zero, okay? 01:03:57.579 --> 01:04:01.289 Raise your hand if this makes sense so far? Oh, 01:04:01.289 --> 01:04:03.669 okay. Cool, most of you, but, again, 01:04:03.669 --> 01:04:07.099 I'm, sort of, being slightly fast in this geometry. So if you're not quite sure 01:04:07.099 --> 01:04:10.089 why this is a normal vector, or how I subtracted 01:04:10.089 --> 01:04:17.089 this, or whatever, take a look at the details in the lecture notes. 01:04:21.299 --> 01:04:24.319 And so 01:04:24.319 --> 01:04:28.459 what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So 01:04:28.459 --> 01:04:30.269 this equation I just wrote down, 01:04:30.269 --> 01:04:32.199 solve this equation for gamma 01:04:32.199 --> 01:04:33.939 or gamma I, 01:04:33.939 --> 01:04:40.939 and you find that - 01:04:44.129 --> 01:04:47.630 you saw that previous equation from gamma I - 01:04:47.630 --> 01:04:49.769 well, why don't I just do it? 01:04:49.769 --> 01:04:52.819 You have W transpose XI plus 01:04:52.819 --> 01:04:54.660 B 01:04:54.660 --> 01:04:55.489 equals 01:04:55.489 --> 01:04:57.049 gamma 01:04:57.049 --> 01:05:01.489 I times W transpose W over norm of W; 01:05:01.489 --> 01:05:05.169 that's just equal to gamma 01:05:05.169 --> 01:05:06.749 times the norm of W 01:05:06.749 --> 01:05:07.659 because 01:05:07.659 --> 01:05:09.529 W transpose W 01:05:09.529 --> 01:05:12.469 is the norm 01:05:12.469 --> 01:05:14.829 of W squared, and, therefore, 01:05:14.829 --> 01:05:17.589 gamma 01:05:17.589 --> 01:05:24.589 is just - well, 01:05:26.579 --> 01:05:33.579 transpose X equals, okay? 01:05:33.999 --> 01:05:38.089 And, in other words, this little calculation just showed us that if you have a 01:05:38.089 --> 01:05:41.299 training example 01:05:41.299 --> 01:05:42.599 XI, 01:05:42.599 --> 01:05:46.749 then the distance between XI and the separating hyper plane defined by the 01:05:46.749 --> 01:05:49.449 parameters W and B 01:05:49.449 --> 01:05:56.449 can be computed by this formula, okay? So 01:06:02.979 --> 01:06:05.459 the last thing I want to do is actually 01:06:05.459 --> 01:06:06.729 take into account 01:06:06.729 --> 01:06:11.929 the sign of the - the correct classification of the training example. So I've 01:06:11.929 --> 01:06:13.130 been assuming that 01:06:13.130 --> 01:06:16.339 we've been classifying an example correctly. 01:06:16.339 --> 01:06:20.999 So, more generally, to find 01:06:20.999 --> 01:06:27.439 the geometric 01:06:27.439 --> 01:06:30.029 margin of a training example to be 01:06:30.029 --> 01:06:32.979 gamma I equals YI 01:06:32.979 --> 01:06:39.979 times that thing on top, okay? 01:06:45.389 --> 01:06:47.279 And so 01:06:47.279 --> 01:06:51.169 this is very similar to the functional margin, except for the normalization by the 01:06:51.169 --> 01:06:52.729 norm of W, 01:06:52.729 --> 01:06:57.759 and so as before, you know, this says that so long as - 01:06:57.759 --> 01:07:00.889 we would like the geometric margin to be large, and all that means is that so 01:07:00.889 --> 01:07:01.849 long as 01:07:01.849 --> 01:07:03.919 we're classifying the example correctly, 01:07:03.919 --> 01:07:08.209 we would ideally hope of the example to be as far as possible from the separating 01:07:08.209 --> 01:07:11.079 hyper plane, so long as it's on the right side of 01:07:11.079 --> 01:07:12.279 the separating hyper plane, and that's what YI 01:07:12.279 --> 01:07:19.279 multiplied into this does. 01:07:22.689 --> 01:07:27.839 And so a couple of easy facts, one is if the 01:07:27.839 --> 01:07:31.769 norm of W is equal to one, 01:07:31.769 --> 01:07:34.289 then 01:07:34.289 --> 01:07:37.439 the functional margin is equal to the geometric margin, and you see 01:07:37.439 --> 01:07:39.369 that quite easily, 01:07:39.369 --> 01:07:41.479 and, more generally, 01:07:41.479 --> 01:07:43.019 01:07:43.019 --> 01:07:47.779 the geometric margin is just equal 01:07:47.779 --> 01:07:54.779 to the functional margin divided by the norm of W, okay? Let's see, okay. 01:08:11.959 --> 01:08:18.959 And so one final definition is 01:08:22.089 --> 01:08:26.769 so far I've defined the geometric margin with respect to a single training example, 01:08:26.769 --> 01:08:32.189 and so as before, I'll define the geometric margin with respect to an entire training set 01:08:32.189 --> 01:08:34.499 as gamma equals 01:08:34.499 --> 01:08:37.599 min over I 01:08:37.599 --> 01:08:44.599 of gamma I, all right? 01:08:45.250 --> 01:08:47.309 And so 01:08:47.309 --> 01:08:54.309 the maximum margin classifier, which is a precursor to the support vector machine, 01:08:59.539 --> 01:09:04.880 is the learning algorithm that chooses the parameters W and B 01:09:04.880 --> 01:09:06.730 so as to maximize 01:09:06.730 --> 01:09:08.900 the geometric margin, 01:09:08.900 --> 01:09:11.919 and so I just write that down. The 01:09:11.919 --> 01:09:16.489 maximum margin classified poses the following optimization problem. It says 01:09:16.489 --> 01:09:19.960 choose gamma, W, and B 01:09:19.960 --> 01:09:23.529 so as to maximize the geometric margin, 01:09:23.529 --> 01:09:25.499 subject to that YI times - 01:09:25.499 --> 01:09:32.499 well, 01:09:33.420 --> 01:09:37.559 this is just one way to write it, 01:09:37.559 --> 01:09:41.089 subject to - actually, do I write it like that? Yeah, 01:09:41.089 --> 01:09:43.819 fine. 01:09:43.819 --> 01:09:46.209 There are several ways to write this, and one of the things we'll 01:09:46.209 --> 01:09:48.949 do next time is actually - I'm trying 01:09:48.949 --> 01:09:52.730 to figure out if I can do this in five minutes. I'm guessing this could 01:09:52.730 --> 01:09:55.420 be difficult. Well, 01:09:55.420 --> 01:09:59.429 so this maximizing your classifier is the maximization problem 01:09:59.429 --> 01:10:00.929 over parameter gamma 01:10:00.929 --> 01:10:02.519 W 01:10:02.519 --> 01:10:04.919 and B, and for now, it turns out 01:10:04.919 --> 01:10:08.869 that the geometric margin doesn't change depending on the norm of W, right? Because 01:10:08.869 --> 01:10:11.479 in the definition of the geometric margin, 01:10:11.479 --> 01:10:14.429 notice that we're dividing by the norm of W anyway. 01:10:14.429 --> 01:10:18.239 So you can actually set the norm of W to be anything you want, and you can multiply 01:10:18.239 --> 01:10:22.530 W and B by any constant; it doesn't change the geometric margin. 01:10:22.530 --> 01:10:26.809 This will actually be important, and we'll come back to this later. Notice that you 01:10:26.809 --> 01:10:29.829 can take the parameters WB, 01:10:29.829 --> 01:10:33.759 and you can impose any normalization constant to it, or you can change W and B by 01:10:33.759 --> 01:10:35.760 any scaling factor and 01:10:35.760 --> 01:10:37.969 replace them by ten W and ten B 01:10:37.969 --> 01:10:38.730 whatever, 01:10:38.730 --> 01:10:42.809 and it does not change the geometric margin, 01:10:42.809 --> 01:10:44.329 okay? And so 01:10:44.329 --> 01:10:47.800 in this first formulation, I'm just gonna impose a constraint and say that the norm of W was 01:10:47.800 --> 01:10:48.989 one, 01:10:48.989 --> 01:10:51.969 and so the function of the geometric margins will be the same, 01:10:51.969 --> 01:10:52.699 and then we'll say 01:10:52.699 --> 01:10:55.119 maximize the geometric margins subject to - 01:10:55.119 --> 01:10:56.800 you maximize gamma 01:10:56.800 --> 01:11:00.489 subject to that every training example must have geometric margin 01:11:00.489 --> 01:11:02.069 at least gamma, 01:11:02.069 --> 01:11:04.219 and this is a geometric margin because 01:11:04.219 --> 01:11:05.780 when the norm of W is equal to one, then 01:11:05.780 --> 01:11:08.159 the functional of the geometric margin 01:11:08.159 --> 01:11:11.409 are identical, okay? 01:11:11.409 --> 01:11:15.130 So this is the maximum margin classifier, and it turns out that if you do this, 01:11:15.130 --> 01:11:19.079 it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but 01:11:19.079 --> 01:11:19.659 it 01:11:19.659 --> 01:11:22.709 01:11:22.709 --> 01:11:24.139 turns out that 01:11:24.139 --> 01:11:26.449 as we develop this algorithm further, there 01:11:26.449 --> 01:11:30.519 will be a clever way to allow us to change this algorithm to let it work 01:11:30.519 --> 01:11:33.150 in infinite dimensional feature spaces 01:11:33.150 --> 01:11:37.010 and come up with very efficient non-linear classifiers. 01:11:37.010 --> 01:11:38.020 So 01:11:38.020 --> 01:11:41.280 there's a ways to go before we turn this into a support vector machine, 01:11:41.280 --> 01:11:43.409 but this is the first step. 01:11:43.409 --> 01:11:50.409 So are there questions about this? Yeah. 01:12:03.610 --> 01:12:08.179 Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a 01:12:08.179 --> 01:12:09.900 fixed training set, and the 01:12:09.900 --> 01:12:12.879 scaling of the training set is not something you get to play with, right? 01:12:12.879 --> 01:12:17.190 So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the 01:12:17.190 --> 01:12:20.869 Y's. Are there other questions? Okay. So all right. 01:12:20.869 --> 01:12:26.150 01:12:26.150 --> 01:12:30.129 01:12:30.129 --> 01:12:34.779 Next week we will take this, and we'll talk about authorization algorithms, 01:12:34.779 --> 01:12:35.839 and 01:12:35.839 --> 01:12:39.949 work our way towards turning this into one of the most effective off-theshelf 01:12:39.949 --> 01:12:41.920 learning algorithms, 01:12:41.920 --> 01:12:44.369 and just a final reminder again, this next 01:12:44.369 --> 01:12:48.170 discussion session will be on Matlab and Octaves. So show up for that if you want 01:12:48.170 --> 01:12:49.679 to see a tutorial. 01:12:49.679 --> 01:12:51.529 Okay. See you guys in the next class.