WEBVTT 00:00:10.040 --> 00:00:12.179 00:00:12.179 --> 00:00:15.449 This presentation is delivered by the Stanford Center for Professional 00:00:15.449 --> 00:00:22.449 Development. 00:00:25.119 --> 00:00:29.819 So what I want to do today is talk about a different type of learning algorithm, and, in particular, 00:00:29.819 --> 00:00:32.790 start to talk about generative learning algorithms 00:00:32.790 --> 00:00:37.840 and the specific algorithm called Gaussian Discriminant Analysis. 00:00:37.840 --> 00:00:43.340 Take a slight digression, talk about Gaussians, and I'll briefly discuss 00:00:43.340 --> 00:00:45.700 generative versus discriminative learning algorithms, 00:00:45.700 --> 00:00:49.780 and then hopefully wrap up today's lecture with a discussion of Naive Bayes and 00:00:49.780 --> 00:00:52.140 the Laplace Smoothing. 00:00:52.140 --> 00:00:55.020 So just to motivate our 00:00:55.020 --> 00:00:58.590 discussion on generative learning algorithms, right, so by way of contrast, 00:00:58.590 --> 00:01:02.100 the source of classification algorithms we've been talking about 00:01:02.100 --> 00:01:08.920 I think of algorithms that do this. So you're given a training set, 00:01:08.920 --> 00:01:11.640 and 00:01:11.640 --> 00:01:15.370 if you run an algorithm right, we just see progression on those training sets. 00:01:15.370 --> 00:01:19.830 The way I think of logistic regression is that it's trying to find " look at the date and is 00:01:19.830 --> 00:01:24.369 trying to find a straight line to divide the crosses and O's, right? So it's, sort of, 00:01:24.369 --> 00:01:27.380 trying to find a straight line. Let 00:01:27.380 --> 00:01:29.400 me " just make the days a bit noisier. 00:01:29.400 --> 00:01:33.060 Trying to find a straight line 00:01:33.060 --> 00:01:34.430 that separates out 00:01:34.430 --> 00:01:38.630 the positive and the negative classes as well as pass the law, right? And, 00:01:38.630 --> 00:01:42.570 in fact, it shows it on the laptop. Maybe just use the screens or the small 00:01:42.570 --> 00:01:46.250 monitors for this. 00:01:46.250 --> 00:01:47.230 In fact, 00:01:47.230 --> 00:01:49.720 you can see there's the data 00:01:49.720 --> 00:01:51.100 set with 00:01:51.100 --> 00:01:52.710 logistic regression, 00:01:52.710 --> 00:01:54.080 and so 00:01:54.080 --> 00:01:57.930 I've initialized the parameters randomly, and so logistic regression is, kind 00:01:57.930 --> 00:01:59.350 of, the outputting " it's 00:01:59.350 --> 00:02:01.530 the, kind of, hypothesis that 00:02:01.530 --> 00:02:04.750 iteration zero is that straight line shown in the bottom right. 00:02:04.750 --> 00:02:08.720 And so after one iteration and creating descent, the straight line changes a bit. 00:02:08.720 --> 00:02:10.999 After two iterations, three, 00:02:10.999 --> 00:02:12.359 four, 00:02:12.359 --> 00:02:15.019 until logistic regression converges 00:02:15.019 --> 00:02:18.969 and has found the straight line that, more or less, separates the positive and negative class, okay? So 00:02:18.969 --> 00:02:20.170 you can think of this 00:02:20.170 --> 00:02:21.490 as logistic regression, 00:02:21.490 --> 00:02:28.010 sort of, searching for a line that separates the positive and the negative classes. 00:02:28.010 --> 00:02:30.980 What I want to do today is talk about an algorithm that does something slightly 00:02:30.980 --> 00:02:32.759 different, 00:02:32.759 --> 00:02:36.309 and to motivate us, let's use our old example of trying to classifythe 00:02:36.309 --> 00:02:40.689 team malignant cancer and benign cancer, right? So a patient comes in 00:02:40.689 --> 00:02:44.759 and they have a cancer, you want to know if it's a malignant or a harmful cancer, 00:02:44.759 --> 00:02:48.489 or if it's a benign, meaning a harmless cancer. 00:02:48.489 --> 00:02:51.579 So rather than trying to find the straight line to separate the two classes, here's something else 00:02:51.579 --> 00:02:53.579 we could do. 00:02:53.579 --> 00:02:55.759 We can go from our training set 00:02:55.759 --> 00:03:00.409 and look at all the cases of malignant cancers, go through, you know, look for our training set for all the 00:03:00.409 --> 00:03:04.129 positive examples of malignant cancers, 00:03:04.129 --> 00:03:08.799 and we can then build a model for what malignant cancer looks like. 00:03:08.799 --> 00:03:13.400 Then we'll go for our training set again and take out all of the examples of benign cancers, 00:03:13.400 --> 00:03:17.859 and then we'll build a model for what benign cancers look like, okay? 00:03:17.859 --> 00:03:19.209 And then 00:03:19.209 --> 00:03:20.270 when you need to 00:03:20.270 --> 00:03:24.620 classify a new example, when you have a new patient, and you want to decide is this cancer malignant 00:03:24.620 --> 00:03:25.650 or benign, 00:03:25.650 --> 00:03:27.920 you then take your new cancer, and you 00:03:27.920 --> 00:03:30.720 match it to your model of malignant cancers, 00:03:30.720 --> 00:03:35.969 and you match it to your model of benign cancers, and you see which model it matches better, and 00:03:35.969 --> 00:03:39.389 depending on which model it matches better to, you then 00:03:39.389 --> 00:03:43.679 predict whether the new cancer is malignant or benign, 00:03:43.679 --> 00:03:45.809 okay? 00:03:45.809 --> 00:03:46.899 So 00:03:46.899 --> 00:03:49.659 what I just described, just this cross 00:03:49.659 --> 00:03:52.929 of methods where you build a second model for malignant cancers and 00:03:52.929 --> 00:03:54.950 a separate model for benign cancers 00:03:54.950 --> 00:03:58.379 is called a generative learning algorithm, 00:03:58.379 --> 00:04:01.230 and let me just, kind of, formalize this. 00:04:01.230 --> 00:04:03.089 00:04:03.089 --> 00:04:05.459 So in the models that we've 00:04:05.459 --> 00:04:07.559 been talking about previously, those were actually 00:04:07.559 --> 00:04:13.899 all discriminative learning algorithms, 00:04:13.899 --> 00:04:18.959 and studied more formally, a discriminative learning algorithm is one 00:04:18.959 --> 00:04:22.729 that either learns P of Y 00:04:22.729 --> 00:04:25.190 given X directly, 00:04:25.190 --> 00:04:26.889 00:04:26.889 --> 00:04:29.619 or even 00:04:29.619 --> 00:04:32.510 learns a hypothesis 00:04:32.510 --> 00:04:39.510 that 00:04:39.699 --> 00:04:43.240 outputs value 0, 1 directly, 00:04:43.240 --> 00:04:46.509 okay? So logistic regression is an example of 00:04:46.509 --> 00:04:49.349 a discriminative learning algorithm. 00:04:49.349 --> 00:04:56.349 In contrast, a generative learning algorithm of 00:04:58.150 --> 00:04:59.940 models P of X given Y. 00:04:59.940 --> 00:05:03.270 The probability of the features given the class label, 00:05:03.270 --> 00:05:04.330 00:05:04.330 --> 00:05:11.330 and as a technical detail, it also models P of Y, but that's a less important thing, and the 00:05:11.939 --> 00:05:15.319 interpretation of this is that a generative model 00:05:15.319 --> 00:05:17.499 builds a probabilistic model for 00:05:17.499 --> 00:05:21.959 what the features looks like, 00:05:21.959 --> 00:05:27.310 conditioned on the 00:05:27.310 --> 00:05:30.729 class label, okay? In other words, conditioned on whether a cancer is 00:05:30.729 --> 00:05:35.060 malignant or benign, it models probability distribution over what the features 00:05:35.060 --> 00:05:37.090 of the cancer looks like. 00:05:37.090 --> 00:05:40.470 Then having built this model " having built a model for P of X 00:05:40.470 --> 00:05:42.129 given Y and P of Y, 00:05:42.129 --> 00:05:45.949 then by Bayes rule, obviously, you can compute P of Y given 1, 00:05:45.949 --> 00:05:47.439 conditioned on X. 00:05:47.439 --> 00:05:49.370 This is just 00:05:49.370 --> 00:05:50.959 P of X 00:05:50.959 --> 00:05:52.669 given Y = 1 00:05:52.669 --> 00:05:56.430 times P of X 00:05:56.430 --> 00:05:58.919 divided by P of X, 00:05:58.919 --> 00:06:01.349 and, if necessary, 00:06:01.349 --> 00:06:05.600 you can calculate the denominator 00:06:05.600 --> 00:06:12.600 using this, right? 00:06:18.969 --> 00:06:19.889 00:06:19.889 --> 00:06:22.919 And so by modeling P of X given Y 00:06:22.919 --> 00:06:26.500 and modeling P of Y, you can actually use Bayes rule to get back to P of Y given 00:06:26.500 --> 00:06:27.569 X, 00:06:27.569 --> 00:06:29.820 but a generative model - 00:06:29.820 --> 00:06:33.120 generative learning algorithm starts in modeling P of X given Y, rather than P of Y 00:06:33.120 --> 00:06:34.979 given X, okay? 00:06:34.979 --> 00:06:37.810 We'll talk about some of the tradeoffs, and why this may be a 00:06:37.810 --> 00:06:41.999 better or worse idea than a discriminative model a bit later. 00:06:41.999 --> 00:06:46.679 Let's go for a specific example of a generative learning algorithm, 00:06:46.679 --> 00:06:47.909 and for 00:06:47.909 --> 00:06:51.329 this specific motivating example, I'm 00:06:51.329 --> 00:06:53.580 going to assume 00:06:53.580 --> 00:06:55.600 that your input feature is X 00:06:55.600 --> 00:06:58.199 and RN 00:06:58.199 --> 00:07:00.489 00:07:00.489 --> 00:07:03.110 and are 00:07:03.110 --> 00:07:05.710 continuous values, okay? 00:07:05.710 --> 00:07:08.190 And under this assumption, let 00:07:08.190 --> 00:07:15.190 me describe to you a specific algorithm called Gaussian Discriminant Analysis, 00:07:20.180 --> 00:07:23.059 00:07:23.059 --> 00:07:25.960 and the, I 00:07:25.960 --> 00:07:30.080 guess, core assumption is that we're going to assume in the Gaussian discriminant analysis 00:07:30.080 --> 00:07:32.279 model of that P of X given Y 00:07:32.279 --> 00:07:39.279 is Gaussian, okay? So 00:07:39.779 --> 00:07:43.449 actually just raise your hand, how many of you have seen a multivariate Gaussian before - 00:07:43.449 --> 00:07:45.639 not a 1D Gaussian, but the higher range though? 00:07:45.639 --> 00:07:51.029 Okay, cool, like maybe half of you, two-thirds of 00:07:51.029 --> 00:07:54.599 you. So let me just say a few words about Gaussians, and for those of you that have seen 00:07:54.599 --> 00:07:57.119 it before, it'll be a refresher. 00:07:57.119 --> 00:08:03.249 So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the 00:08:03.249 --> 00:08:05.309 script N for normal 00:08:05.309 --> 00:08:08.150 with 00:08:08.150 --> 00:08:13.379 parameters mean U and covariance sigma squared. If 00:08:13.379 --> 00:08:16.759 Z has a density 1 00:08:16.759 --> 00:08:19.539 over 2 Pi, sigma 00:08:19.539 --> 00:08:23.939 2, 00:08:23.939 --> 00:08:30.939 okay? 00:08:31.449 --> 00:08:35.949 That's the formula for the density as a generalization of the one dimension of Gaussians and no 00:08:35.949 --> 00:08:38.190 more the familiar bell-shape curve. It's 00:08:38.190 --> 00:08:42.220 a high dimension vector value random variable 00:08:42.220 --> 00:08:42.990 Z. 00:08:42.990 --> 00:08:47.210 Don't worry too much about this formula for the density. You rarely end up needing to 00:08:47.210 --> 00:08:47.960 use it, 00:08:47.960 --> 00:08:51.620 but the two key quantities are this 00:08:51.620 --> 00:08:55.310 vector mew is the mean of the Gaussian and this 00:08:55.310 --> 00:08:56.999 matrix sigma is 00:08:56.999 --> 00:09:00.060 the covariance matrix - 00:09:00.060 --> 00:09:02.470 covariance, 00:09:02.470 --> 00:09:04.670 and so 00:09:04.670 --> 00:09:07.410 sigma will be equal to, 00:09:07.410 --> 00:09:11.810 right, the definition of covariance of a vector valued random variable 00:09:11.810 --> 00:09:13.710 is X - U, X - V conspose, 00:09:13.710 --> 00:09:17.260 okay? 00:09:17.260 --> 00:09:20.320 And, actually, if this 00:09:20.320 --> 00:09:22.930 doesn't look familiar to you, 00:09:22.930 --> 00:09:25.720 you might re-watch the 00:09:25.720 --> 00:09:28.740 discussion section that the TAs held last Friday 00:09:28.740 --> 00:09:33.690 or the one that they'll be holding later this week on, sort of, a recap of 00:09:33.690 --> 00:09:34.290 probability, okay? 00:09:34.290 --> 00:09:37.670 So 00:09:37.670 --> 00:09:41.370 multi-grade Gaussians is parameterized by a mean and a covariance, and let me just - 00:09:41.370 --> 00:09:42.420 can I 00:09:42.420 --> 00:09:45.860 have the laptop displayed, please? 00:09:45.860 --> 00:09:48.570 I'll just go ahead and actually show you, 00:09:48.570 --> 00:09:52.090 you know, graphically, the effects of 00:09:52.090 --> 00:09:53.930 varying a Gaussian - 00:09:53.930 --> 00:09:56.310 varying the parameters of a Gaussian. 00:09:56.310 --> 00:09:59.130 So what I have up here 00:09:59.130 --> 00:10:02.960 is the density of a zero mean Gaussian with 00:10:02.960 --> 00:10:05.880 covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand 00:10:05.880 --> 00:10:08.240 corner of the slide, and 00:10:08.240 --> 00:10:11.600 there's the familiar bell-shaped curve in two dimensions. 00:10:11.600 --> 00:10:17.320 And so if I shrink the covariance matrix, instead of covariance your identity, if 00:10:17.320 --> 00:10:21.600 I shrink the covariance matrix, then the Gaussian becomes more peaked, 00:10:21.600 --> 00:10:25.090 and if I widen the covariance, so like same = 2, 2, 00:10:25.090 --> 00:10:25.749 then 00:10:25.749 --> 00:10:30.810 the distribution - well, the density becomes more spread out, okay? 00:10:30.810 --> 00:10:32.510 Those vectors stand at normal, identity 00:10:32.510 --> 00:10:34.570 covariance one. 00:10:34.570 --> 00:10:36.520 If I increase 00:10:36.520 --> 00:10:39.419 the diagonals of a covariance matrix, right, 00:10:39.419 --> 00:10:43.949 if I make the variables correlated, and the 00:10:43.949 --> 00:10:45.350 Gaussian becomes flattened out in this X = Y direction, and 00:10:45.350 --> 00:10:47.170 increase it even further, 00:10:47.170 --> 00:10:52.460 then my variables, X and Y, right - excuse me, it goes Z1 and Z2 00:10:52.460 --> 00:10:57.020 are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in 00:10:57.020 --> 00:10:58.940 contours. 00:10:58.940 --> 00:11:03.310 The standard normal of distribution has contours that are - they're actually 00:11:03.310 --> 00:11:06.960 circles. Because of the aspect ratio, these look like ellipses. 00:11:06.960 --> 00:11:09.050 These should actually be circles, 00:11:09.050 --> 00:11:12.770 and if you increase the off diagonals of the Gaussian covariance matrix, 00:11:12.770 --> 00:11:14.890 then it becomes 00:11:14.890 --> 00:11:16.400 ellipses aligned along the, sort of, 00:11:16.400 --> 00:11:18.090 45 degree angle 00:11:18.090 --> 00:11:21.120 in this example. 00:11:21.120 --> 00:11:26.430 This is the same thing. Here's an example of a Gaussian density with negative covariances. 00:11:26.430 --> 00:11:28.610 So now the correlation 00:11:28.610 --> 00:11:32.700 goes the other way, so that even strong [inaudible] of covariance and the same thing in 00:11:32.700 --> 00:11:36.400 contours. This is a Gaussian with negative entries on the diagonals and even 00:11:36.400 --> 00:11:41.130 larger entries on the diagonals, okay? 00:11:41.130 --> 00:11:45.480 And other parameter for the Gaussian is the mean parameters, so if this is - with mew0, 00:11:45.480 --> 00:11:47.839 and as he changed the mean parameter, 00:11:47.839 --> 00:11:50.060 this is mew equals 0.15, 00:11:50.060 --> 00:11:56.230 the location of the Gaussian just moves around, okay? 00:11:56.230 --> 00:12:01.500 All right. So that was a quick primer on what Gaussians look like, and here's 00:12:01.500 --> 00:12:02.300 00:12:02.300 --> 00:12:06.230 as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant 00:12:06.230 --> 00:12:09.400 analysis algorithm, this is what we're going to do. Here's 00:12:09.400 --> 00:12:11.640 the training set, 00:12:11.640 --> 00:12:14.790 and in the Gaussian discriminant analysis algorithm, 00:12:14.790 --> 00:12:19.480 what I'm going to do is I'm going to look at the positive examples, say the crosses, 00:12:19.480 --> 00:12:23.520 and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the 00:12:23.520 --> 00:12:25.250 positive examples, and so 00:12:25.250 --> 00:12:27.580 maybe I end up with a Gaussian distribution like that, 00:12:27.580 --> 00:12:30.960 okay? So there's P of X given Y = 1. 00:12:30.960 --> 00:12:34.340 And then I'll look at the negative examples, the O's in this figure, 00:12:34.340 --> 00:12:36.990 and I'll fit a Gaussian to that, and maybe I get a 00:12:36.990 --> 00:12:37.740 Gaussian 00:12:37.740 --> 00:12:40.700 centered over there. This is the concept of my second Gaussian, 00:12:40.700 --> 00:12:41.910 and together - 00:12:41.910 --> 00:12:44.770 we'll say how later - 00:12:44.770 --> 00:12:51.170 together these two Gaussian densities will define a separator for these two classes, okay? 00:12:51.170 --> 00:12:55.300 And it'll turn out that the separator will turn out to be a little bit 00:12:55.300 --> 00:12:55.870 different 00:12:55.870 --> 00:12:57.340 from what logistic regression 00:12:57.340 --> 00:12:58.719 gives you. 00:12:58.719 --> 00:13:00.390 If you run logistic regression, 00:13:00.390 --> 00:13:03.720 you actually get the division bound to be shown in the green line, whereas Gaussian discriminant 00:13:03.720 --> 00:13:07.340 analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the 00:13:07.340 --> 00:13:14.340 00:13:29.460 --> 00:13:34.440 Gaussian discriminant analysis model, put 00:13:34.440 --> 00:13:37.160 into model P of Y 00:13:37.160 --> 00:13:40.940 as a Bernoulli random variable as usual, but 00:13:40.940 --> 00:13:45.540 as a Bernoulli random variable and parameterized by parameter phi; you've 00:13:45.540 --> 00:13:47.300 seen this before. 00:13:47.300 --> 00:13:48.900 00:13:48.900 --> 00:13:55.900 Model P of X given Y = 0 as a Gaussian - 00:13:58.090 --> 00:14:01.060 oh, you know what? Yeah, 00:14:01.060 --> 00:14:04.870 yes, excuse me. I 00:14:04.870 --> 00:14:06.050 thought this looked strange. 00:14:06.050 --> 00:14:11.730 This 00:14:11.730 --> 00:14:13.040 should be a sigma, 00:14:13.040 --> 00:14:16.230 determined in a sigma to the one-half of the denominator there. 00:14:16.230 --> 00:14:18.110 It's no big deal. It was - yeah, 00:14:18.110 --> 00:14:20.350 well, okay. Right. 00:14:20.350 --> 00:14:23.120 I 00:14:23.120 --> 00:14:26.690 was listing the sigma to the determining the sigma to the one-half on a previous 00:14:26.690 --> 00:14:33.690 board, excuse me. Okay, 00:14:40.180 --> 00:14:44.130 and so I model P of X given Y = 0 as a Gaussian 00:14:44.130 --> 00:14:47.700 with mean mew0 and covariance sigma to the sigma to 00:14:47.700 --> 00:14:54.700 the minus one-half, 00:14:54.800 --> 00:15:01.210 00:15:01.210 --> 00:15:02.000 and 00:15:02.000 --> 00:15:09.000 - 00:15:11.200 --> 00:15:12.840 okay? 00:15:12.840 --> 00:15:16.840 And so the parameters of this model are 00:15:16.840 --> 00:15:18.310 phi, 00:15:18.310 --> 00:15:20.640 mew0, 00:15:20.640 --> 00:15:23.900 mew1, and sigma, 00:15:23.900 --> 00:15:25.100 and so 00:15:25.100 --> 00:15:29.390 I can now write down the likelihood of the parameters 00:15:29.390 --> 00:15:30.380 as - oh, excuse 00:15:30.380 --> 00:15:37.380 me, actually, the log likelihood of the parameters as the log of 00:15:40.500 --> 00:15:41.440 that, 00:15:41.440 --> 00:15:45.080 right? 00:15:45.080 --> 00:15:48.260 So, in other words, if I'm given the training set, then 00:15:48.260 --> 00:15:52.690 they can write down the log likelihood of the parameters as the log of, you know, 00:15:52.690 --> 00:15:58.060 the probative probabilities of P of XI, YI, right? 00:15:58.060 --> 00:16:03.830 00:16:03.830 --> 00:16:10.030 00:16:10.030 --> 00:16:11.710 And 00:16:11.710 --> 00:16:14.650 this is just equal to that where each of these terms, P of XI given YI, 00:16:14.650 --> 00:16:16.980 or P of YI is 00:16:16.980 --> 00:16:18.720 then given 00:16:18.720 --> 00:16:19.600 00:16:19.600 --> 00:16:24.250 by one of these three equations on top, okay? 00:16:24.250 --> 00:16:26.420 And I just want 00:16:26.420 --> 00:16:30.000 to contrast this again with discriminative learning algorithms, right? 00:16:30.000 --> 00:16:31.790 So 00:16:31.790 --> 00:16:34.779 to give this a name, I guess, this sometimes is actually called 00:16:34.779 --> 00:16:39.249 the Joint Data Likelihood - the Joint Likelihood, 00:16:39.249 --> 00:16:41.790 and 00:16:41.790 --> 00:16:44.930 let me just contrast this with what we had previously 00:16:44.930 --> 00:16:47.710 when we're talking about logistic 00:16:47.710 --> 00:16:51.860 regression. Where I said with the log likelihood of the parameter's theater 00:16:51.860 --> 00:16:53.089 was log 00:16:53.089 --> 00:16:57.000 of a product I = 1 to M, P of YI 00:16:57.000 --> 00:16:59.030 given XI 00:16:59.030 --> 00:17:01.990 and parameterized 00:17:01.990 --> 00:17:03.960 by a theater, right? 00:17:03.960 --> 00:17:05.150 So 00:17:05.150 --> 00:17:08.040 back where we're fitting logistic regression models or generalized learning 00:17:08.040 --> 00:17:08.670 models, 00:17:08.670 --> 00:17:14.360 we're always modeling P of YI given XI and parameterized by a theater, and that was the 00:17:14.360 --> 00:17:16.420 conditional 00:17:16.420 --> 00:17:17.460 00:17:17.460 --> 00:17:19.579 likelihood, okay, 00:17:19.579 --> 00:17:23.910 in 00:17:23.910 --> 00:17:26.700 which we're modeling P of YI given XI, 00:17:26.700 --> 00:17:27.839 whereas, now, 00:17:27.839 --> 00:17:30.929 regenerative learning algorithms, we're going to look at the joint likelihood which 00:17:30.929 --> 00:17:35.440 is P of XI, YI, okay? 00:17:35.440 --> 00:17:36.270 So let's 00:17:36.270 --> 00:17:43.270 see. 00:17:46.240 --> 00:17:48.049 So given the training sets 00:17:48.049 --> 00:17:51.260 and using the Gaussian discriminant analysis model 00:17:51.260 --> 00:17:55.550 to fit the parameters of the model, we'll do maximize likelihood estimation as usual, 00:17:55.550 --> 00:17:58.780 and so you maximize your 00:17:58.780 --> 00:18:02.269 L 00:18:02.269 --> 00:18:05.910 with respect to the parameters phi, mew0, 00:18:05.910 --> 00:18:07.850 mew1, sigma, 00:18:07.850 --> 00:18:10.150 and so 00:18:10.150 --> 00:18:15.800 if we find the maximum likelihood estimate of parameters, you find that phi is 00:18:15.800 --> 00:18:17.400 - 00:18:17.400 --> 00:18:22.040 the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for 00:18:22.040 --> 00:18:23.750 indicating notation, 00:18:23.750 --> 00:18:27.220 all right? So the maximum likelihood estimate for phi would be Sum over 00:18:27.220 --> 00:18:29.520 I, YI á M, 00:18:29.520 --> 00:18:32.040 00:18:32.040 --> 00:18:35.490 or written alternatively as Sum over - 00:18:35.490 --> 00:18:41.620 all your training examples of indicator YI = 1 á M, okay? 00:18:41.620 --> 00:18:42.870 00:18:42.870 --> 00:18:45.880 In other words, maximum likelihood estimate for a 00:18:45.880 --> 00:18:47.390 00:18:47.390 --> 00:18:52.600 newly parameter phi is just the faction of training examples with 00:18:52.600 --> 00:18:56.880 label one, with Y equals 1. 00:18:56.880 --> 00:19:02.800 Maximum likelihood estimate for mew0 is this, okay? 00:19:02.800 --> 00:19:09.800 You 00:19:22.010 --> 00:19:24.040 should stare at this for a second and 00:19:24.040 --> 00:19:26.200 see if it makes sense. 00:19:26.200 --> 00:19:33.200 Actually, I'll just write on the next one for mew1 while you do that. 00:19:34.310 --> 00:19:41.310 Okay? 00:19:49.880 --> 00:19:51.900 So what this is is 00:19:51.900 --> 00:19:57.060 what the denominator is sum of your training sets indicated YI = 0. 00:19:57.060 --> 00:20:02.090 So for every training example for which YI = 0, this will 00:20:02.090 --> 00:20:05.670 increment the count by one, all right? So the 00:20:05.670 --> 00:20:08.030 denominator is just 00:20:08.030 --> 00:20:10.880 the number of examples 00:20:10.880 --> 00:20:14.550 00:20:14.550 --> 00:20:15.370 with 00:20:15.370 --> 00:20:20.700 label zero, all right? 00:20:20.700 --> 00:20:22.680 And then the numerator will be, let's 00:20:22.680 --> 00:20:25.660 see, Sum from I = 1 for M, or every time 00:20:25.660 --> 00:20:30.350 YI is equal to 0, this will be a one, and otherwise, this thing will be zero, 00:20:30.350 --> 00:20:31.480 and so 00:20:31.480 --> 00:20:34.330 this indicator function means that you're including 00:20:34.330 --> 00:20:37.630 only the times for 00:20:37.630 --> 00:20:40.270 which YI is equal to one - only the turns which Y is equal to zero 00:20:40.270 --> 00:20:45.020 because for all the times where YI is equal to one, 00:20:45.020 --> 00:20:47.990 this sum and will be equal to zero, 00:20:47.990 --> 00:20:52.890 and then you multiply that by XI, and so the numerator is 00:20:52.890 --> 00:20:55.770 really the sum 00:20:55.770 --> 00:21:00.910 of XI's corresponding to 00:21:00.910 --> 00:21:02.970 00:21:02.970 --> 00:21:08.649 examples where the class labels were zero, okay? 00:21:08.649 --> 00:21:14.530 Raise your hand if this makes sense. Okay, cool. 00:21:14.530 --> 00:21:17.350 So just to say this fancifully, 00:21:17.350 --> 00:21:19.630 this just means look for your training set, 00:21:19.630 --> 00:21:22.900 find all the examples for which Y = 0, 00:21:22.900 --> 00:21:25.160 and take the average 00:21:25.160 --> 00:21:29.570 of the value of X for all your examples which Y = 0. So take all your negative fitting 00:21:29.570 --> 00:21:30.470 examples 00:21:30.470 --> 00:21:32.950 and average the values for X 00:21:32.950 --> 00:21:37.730 and 00:21:37.730 --> 00:21:41.350 that's mew0, okay? If this 00:21:41.350 --> 00:21:45.140 notation is still a little bit cryptic - if you're still not sure why this 00:21:45.140 --> 00:21:48.000 equation translates into 00:21:48.000 --> 00:21:52.940 what I just said, do go home and stare at it for a while until it just makes sense. This is, sort 00:21:52.940 --> 00:21:57.040 of, no surprise. It just says to estimate the mean for the negative examples, 00:21:57.040 --> 00:21:59.860 take all your negative examples, and average them. So 00:21:59.860 --> 00:22:06.830 no surprise, but this is a useful practice to indicate a notation. 00:22:06.830 --> 00:22:07.500 [Inaudible] 00:22:07.500 --> 00:22:11.330 divide the maximum likelihood estimate for sigma. I won't do that. You can read that in 00:22:11.330 --> 00:22:16.630 the notes yourself. 00:22:16.630 --> 00:22:21.390 00:22:21.390 --> 00:22:24.340 And so having fit 00:22:24.340 --> 00:22:28.940 the parameters find mew0, mew1, and sigma 00:22:28.940 --> 00:22:30.470 to your data, 00:22:30.470 --> 00:22:35.590 well, you now need to make a prediction. You 00:22:35.590 --> 00:22:39.710 know, when you're given a new value of X, when you're given a new cancer, you need to predict whether 00:22:39.710 --> 00:22:41.630 it's malignant or benign. 00:22:41.630 --> 00:22:44.610 Your prediction is then going to be, 00:22:44.610 --> 00:22:46.710 let's say, 00:22:46.710 --> 00:22:50.400 the most likely value of Y given X. I should 00:22:50.400 --> 00:22:54.720 write semicolon the parameters there. I'll just give that 00:22:54.720 --> 00:22:57.230 - which is the [inaudible] of a Y 00:22:57.230 --> 00:23:04.230 by Bayes rule, all right? 00:23:05.870 --> 00:23:12.870 And that is, in turn, 00:23:14.540 --> 00:23:15.480 just that 00:23:15.480 --> 00:23:19.790 because the denominator P of X doesn't depend on Y, 00:23:19.790 --> 00:23:21.530 and 00:23:21.530 --> 00:23:23.540 if P of Y 00:23:23.540 --> 00:23:27.450 is uniform. 00:23:27.450 --> 00:23:30.950 In other words, if each of your constants is equally likely, 00:23:30.950 --> 00:23:33.039 so if P of Y 00:23:33.039 --> 00:23:36.840 takes the same value for all values 00:23:36.840 --> 00:23:38.010 of Y, 00:23:38.010 --> 00:23:42.250 then this is just arc X over Y, P of X 00:23:42.250 --> 00:23:47.340 given Y, okay? 00:23:47.340 --> 00:23:50.970 This happens sometimes, maybe not very often, so usually you end up using this 00:23:50.970 --> 00:23:53.049 formula where you 00:23:53.049 --> 00:23:56.470 compute P of X given Y and P of Y using 00:23:56.470 --> 00:23:59.910 your model, okay? Student:Can 00:23:59.910 --> 00:24:00.780 you give 00:24:00.780 --> 00:24:05.480 us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually 00:24:05.480 --> 00:24:08.750 let me. 00:24:08.750 --> 00:24:09.770 So the min of 00:24:09.770 --> 00:24:15.470 - arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just 00:24:15.470 --> 00:24:18.600 for an example, the min of X - 5 00:24:18.600 --> 00:24:22.610 squared is 0 because by choosing X equals 5, you can get this to be zero, 00:24:22.610 --> 00:24:24.470 and the argument over X 00:24:24.470 --> 00:24:30.870 of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay? 00:24:30.870 --> 00:24:37.870 Cool. Thanks 00:24:38.100 --> 00:24:45.100 for 00:24:45.140 --> 00:24:52.140 asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is 00:25:00.910 --> 00:25:04.320 distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose 00:25:04.320 --> 00:25:05.649 here. 00:25:05.649 --> 00:25:07.240 I meant if 00:25:07.240 --> 00:25:10.890 P of Y = 0 is equal to P of Y = 1, or if Y is the 00:25:10.890 --> 00:25:12.780 uniform distribution over 00:25:12.780 --> 00:25:13.860 the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0 00:25:13.860 --> 00:25:15.450 zero = P of Y given 1. That's all I mean, see? Anything else? 00:25:15.450 --> 00:25:17.080 All 00:25:17.080 --> 00:25:20.030 right. Okay. So 00:25:20.030 --> 00:25:22.039 00:25:22.039 --> 00:25:28.040 00:25:28.040 --> 00:25:35.040 it 00:25:42.130 --> 00:25:46.980 turns out Gaussian discriminant analysis has an interesting relationship 00:25:46.980 --> 00:25:48.590 to logistic 00:25:48.590 --> 00:25:51.640 regression. Let me illustrate that. 00:25:51.640 --> 00:25:54.540 00:25:54.540 --> 00:25:59.200 So let's say you have a training set 00:25:59.200 --> 00:26:06.200 - actually let me just go ahead and draw 1D training set, and that will 00:26:11.820 --> 00:26:14.580 kind of work, yes, okay. 00:26:14.580 --> 00:26:18.470 So let's say we have a training set comprising a few negative and a few positive examples, 00:26:18.470 --> 00:26:23.070 and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two 00:26:23.070 --> 00:26:24.679 densities - a Gaussian density to each of these two - to my positive 00:26:24.679 --> 00:26:27.440 and negative training 00:26:27.440 --> 00:26:30.090 examples, 00:26:30.090 --> 00:26:33.470 and so maybe my 00:26:33.470 --> 00:26:37.960 positive examples, the X's, are fit with a Gaussian like this, 00:26:37.960 --> 00:26:39.520 00:26:39.520 --> 00:26:44.560 00:26:44.560 --> 00:26:47.840 and my negative examples I will fit, and you have a 00:26:47.840 --> 00:26:54.840 Gaussian that looks like that, okay? 00:26:57.030 --> 00:27:00.180 Now, I 00:27:00.180 --> 00:27:04.150 hope this [inaudible]. Now, let's 00:27:04.150 --> 00:27:06.789 vary along the X axis, 00:27:06.789 --> 00:27:10.260 and what I want to do is I'll 00:27:10.260 --> 00:27:14.470 overlay on top of this plot. I'm going to plot 00:27:14.470 --> 00:27:21.470 P of Y = 1 - no, actually, 00:27:24.420 --> 00:27:25.820 given X 00:27:25.820 --> 00:27:31.570 for a variety of values X, okay? 00:27:31.570 --> 00:27:32.929 So I actually realize what I should have done. 00:27:32.929 --> 00:27:37.250 I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just 00:27:37.250 --> 00:27:39.720 makes this part come in better. 00:27:39.720 --> 00:27:44.270 So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal 00:27:44.270 --> 00:27:45.200 axis. 00:27:45.200 --> 00:27:48.810 Then what's the probability of Y being equal to one conditioned on X? Well, 00:27:48.810 --> 00:27:52.770 the way you calculate that is you write P of Y 00:27:52.770 --> 00:27:56.899 = 1 given X, and then you plug in all these formulas as usual, right? It's P of X 00:27:56.899 --> 00:27:59.140 given Y = 1, which is 00:27:59.140 --> 00:28:00.929 your Gaussian density, 00:28:00.929 --> 00:28:04.070 times P of Y = 1, you know, which is 00:28:04.070 --> 00:28:08.060 essentially - this is just going to be equal to phi, 00:28:08.060 --> 00:28:09.510 and then divided by, 00:28:09.510 --> 00:28:12.210 right, P of X, and then this shows you how you can calculate this. 00:28:12.210 --> 00:28:13.660 By using 00:28:13.660 --> 00:28:18.730 these two Gaussians and my phi on P of Y, I actually compute what P of Y 00:28:18.730 --> 00:28:21.340 = 1 given X is, and 00:28:21.340 --> 00:28:24.529 in this case, 00:28:24.529 --> 00:28:29.230 if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to 00:28:29.230 --> 00:28:31.330 a positive class, and so 00:28:31.330 --> 00:28:36.370 it'll be very small; it'll be very close to zero say, okay? 00:28:36.370 --> 00:28:40.770 And then we can increment the value of X a bit, and study a different value of X, and 00:28:40.770 --> 00:28:44.230 plot what is the P of Y given X - P of Y 00:28:44.230 --> 00:28:49.660 = 1 given X, and, again, it'll be pretty small. Let's 00:28:49.660 --> 00:28:52.200 use a point like that, right? At this point, 00:28:52.200 --> 00:28:54.480 the two Gaussian densities 00:28:54.480 --> 00:28:56.490 have equal value, 00:28:56.490 --> 00:28:57.539 and if 00:28:57.539 --> 00:28:58.799 I ask 00:28:58.799 --> 00:29:01.779 if X is this value, right, shown by the arrow, 00:29:01.779 --> 00:29:03.080 00:29:03.080 --> 00:29:06.770 what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if 00:29:06.770 --> 00:29:11.270 you fill 00:29:11.270 --> 00:29:13.930 in a bunch more points, you get a 00:29:13.930 --> 00:29:16.470 curve like that, 00:29:16.470 --> 00:29:20.550 and then you can keep going. Let's say for a point like that, you can ask what's the probability of X 00:29:20.550 --> 00:29:21.919 being one? Well, if 00:29:21.919 --> 00:29:25.320 it's that far out, then clearly, it belongs to this 00:29:25.320 --> 00:29:27.540 rightmost Gaussian, and so 00:29:27.540 --> 00:29:31.510 the probability of Y being a one would be very high; it would be almost one, okay? 00:29:31.510 --> 00:29:33.830 And so you 00:29:33.830 --> 00:29:36.429 can repeat this exercise 00:29:36.429 --> 00:29:38.510 for a bunch of points. All right, 00:29:38.510 --> 00:29:41.620 compute P of Y equals one given X for a bunch of points, 00:29:41.620 --> 00:29:46.490 and if you connect up these points, 00:29:46.490 --> 00:29:48.150 you find that the 00:29:48.150 --> 00:29:50.039 curve you get [Pause] plotted 00:29:50.039 --> 00:29:56.410 takes a form of sigmoid function, okay? So, 00:29:56.410 --> 00:30:00.030 in other words, when you make the assumptions under the Gaussian 00:30:00.030 --> 00:30:02.350 00:30:02.350 --> 00:30:05.550 discriminant analysis model, 00:30:05.550 --> 00:30:08.130 that P of X given Y is Gaussian, 00:30:08.130 --> 00:30:13.330 when you go back and compute what P of Y given X is, you actually get back 00:30:13.330 --> 00:30:19.290 exactly the same sigmoid function 00:30:19.290 --> 00:30:22.960 that we're using which is the progression, okay? But it turns out the key difference is that 00:30:22.960 --> 00:30:27.360 Gaussian discriminant analysis will end up choosing a different 00:30:27.360 --> 00:30:28.900 position 00:30:28.900 --> 00:30:31.659 and a steepness of the sigmoid 00:30:31.659 --> 00:30:35.730 than would logistic regression. Is there a question? Student:I'm just 00:30:35.730 --> 00:30:39.570 wondering, 00:30:39.570 --> 00:30:44.600 the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is 00:30:44.600 --> 00:30:47.500 P of X given Y = 1, and 00:30:47.500 --> 00:30:50.070 this Gaussian is P of X 00:30:50.070 --> 00:30:57.070 given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you 00:31:01.920 --> 00:31:04.040 decide what Y 00:31:04.040 --> 00:31:05.760 given 00:31:05.760 --> 00:31:09.769 P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you 00:31:09.769 --> 00:31:12.330 figured out where 00:31:12.330 --> 00:31:13.750 to draw each dot? Instructor (Andrew Ng):Let's see, 00:31:13.750 --> 00:31:15.919 okay. So the 00:31:15.919 --> 00:31:18.600 computation is as follows, right? The steps 00:31:18.600 --> 00:31:22.020 are I have the training sets, and so given my training set, I'm going to fit 00:31:22.020 --> 00:31:25.159 a Gaussian discriminant analysis model to it, 00:31:25.159 --> 00:31:28.940 and what that means is I'll build a model for P of X given Y = 1. I'll 00:31:28.940 --> 00:31:30.159 build 00:31:30.159 --> 00:31:33.059 a model for P of X given Y = 0, 00:31:33.059 --> 00:31:37.950 and I'll also fit a Bernoulli distribution to 00:31:37.950 --> 00:31:39.290 P of Y, okay? 00:31:39.290 --> 00:31:43.270 So, in other words, given my training set, I'll fit P of X given Y and P of Y 00:31:43.270 --> 00:31:46.539 to my data, and now I've chosen my parameters 00:31:46.539 --> 00:31:48.630 of find mew0, 00:31:48.630 --> 00:31:49.600 mew1, 00:31:49.600 --> 00:31:52.310 and the sigma, okay? Then 00:31:52.310 --> 00:31:54.930 this is the process I went through 00:31:54.930 --> 00:31:58.710 to plot all these dots, right? It's just I pick a point in the X axis, 00:31:58.710 --> 00:32:00.950 and then I compute 00:32:00.950 --> 00:32:02.970 P of Y given X 00:32:02.970 --> 00:32:05.200 for that value of X, 00:32:05.200 --> 00:32:09.380 and P of Y given 1 conditioned on X will be some value between zero and one. It'll 00:32:09.380 --> 00:32:12.500 be some real number, and whatever that real number is, I then plot it on the vertical 00:32:12.500 --> 00:32:14.490 axis, 00:32:14.490 --> 00:32:19.929 okay? And the way I compute P of Y = 1 conditioned on X is 00:32:19.929 --> 00:32:21.380 I would 00:32:21.380 --> 00:32:23.470 use these quantities. I would use 00:32:23.470 --> 00:32:25.080 P of X given Y 00:32:25.080 --> 00:32:30.050 and P of Y, and, sort of, plug them into Bayes rule, and that allows me 00:32:30.050 --> 00:32:30.760 to 00:32:30.760 --> 00:32:31.690 compute P of Y given X 00:32:31.690 --> 00:32:34.210 from these three quantities; does that make 00:32:34.210 --> 00:32:35.240 sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that - 00:32:35.240 --> 00:32:39.080 Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so 00:32:39.080 --> 00:32:42.269 - 00:32:42.269 --> 00:32:43.170 well, 00:32:43.170 --> 00:32:45.340 got this right 00:32:45.340 --> 00:32:49.090 here. So P of X can be written as, 00:32:49.090 --> 00:32:50.449 00:32:50.449 --> 00:32:52.810 right, 00:32:52.810 --> 00:32:55.030 so 00:32:55.030 --> 00:32:58.660 P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y = 00:32:58.660 --> 00:33:03.140 1, right? 00:33:03.140 --> 00:33:06.740 And so each of these terms, P of X given Y 00:33:06.740 --> 00:33:11.170 and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant 00:33:11.170 --> 00:33:13.890 analysis model. Each of these terms is something that 00:33:13.890 --> 00:33:16.300 my model gives me directly, 00:33:16.300 --> 00:33:18.350 so plugged in as the denominator, 00:33:18.350 --> 00:33:25.350 and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool. 00:33:30.570 --> 00:33:37.570 00:33:53.170 --> 00:33:57.860 So let's talk a little bit about the advantages and disadvantages of using a 00:33:57.860 --> 00:34:00.360 generative 00:34:00.360 --> 00:34:04.759 learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we 00:34:04.759 --> 00:34:05.960 assume that 00:34:05.960 --> 00:34:08.080 X conditions on Y 00:34:08.080 --> 00:34:13.829 is Gaussian, 00:34:13.829 --> 00:34:17.089 and the argument I showed on the previous chalkboard, I didn't prove it formally, 00:34:17.089 --> 00:34:20.289 but you can actually go back and prove it yourself 00:34:20.289 --> 00:34:24.269 is that if you assume X given Y is Gaussian, 00:34:24.269 --> 00:34:27.419 then that implies that 00:34:27.419 --> 00:34:28.589 when you plot Y 00:34:28.589 --> 00:34:30.639 given X, 00:34:30.639 --> 00:34:37.639 you find that - well, let me just write logistic posterior, okay? 00:34:40.689 --> 00:34:43.519 And the argument I showed just now, which I didn't prove; you can go home and prove it 00:34:43.519 --> 00:34:44.489 yourself, 00:34:44.489 --> 00:34:49.329 is that if you assume X given Y is Gaussian, then that implies that the posterior 00:34:49.329 --> 00:34:53.929 distribution or the form of 00:34:53.929 --> 00:34:57.229 P of Y = 1 given X 00:34:57.229 --> 00:35:00.569 is going to be a logistic function, 00:35:00.569 --> 00:35:02.089 00:35:02.089 --> 00:35:05.809 and it turns out this 00:35:05.809 --> 00:35:08.319 implication in the opposite direction 00:35:08.319 --> 00:35:11.799 does not hold true, 00:35:11.799 --> 00:35:16.349 okay? In particular, it actually turns out - this is actually, kind of, cool. It 00:35:16.349 --> 00:35:19.360 turns out that if you're 00:35:19.360 --> 00:35:22.739 seeing that X given Y = 1 is 00:35:22.739 --> 00:35:24.389 00:35:24.389 --> 00:35:26.169 Hessian with 00:35:26.169 --> 00:35:28.609 parameter lambda 1, 00:35:28.609 --> 00:35:32.039 and X given Y = 0, 00:35:32.039 --> 00:35:36.509 is Hessian 00:35:36.509 --> 00:35:39.199 with parameter lambda 0. 00:35:39.199 --> 00:35:41.420 It turns out if you assumed this, 00:35:41.420 --> 00:35:42.640 then 00:35:42.640 --> 00:35:43.849 that also 00:35:43.849 --> 00:35:46.839 implies that P of Y 00:35:46.839 --> 00:35:51.130 given X 00:35:51.130 --> 00:35:52.439 00:35:52.439 --> 00:35:54.429 is logistic, okay? 00:35:54.429 --> 00:35:57.259 So there are lots of assumptions on X given Y 00:35:57.259 --> 00:36:04.259 that will lead to P of Y given X being logistic, and, 00:36:04.619 --> 00:36:06.059 therefore, 00:36:06.059 --> 00:36:11.609 this, the assumption that X given Y being Gaussian is the stronger assumption 00:36:11.609 --> 00:36:15.019 than the assumption that Y given X is logistic, 00:36:15.019 --> 00:36:17.400 okay? Because this implies this, 00:36:17.400 --> 00:36:22.829 right? That means that this is a stronger assumption than this because 00:36:22.829 --> 00:36:29.829 this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa. 00:36:29.940 --> 00:36:31.099 And so this leaves some 00:36:31.099 --> 00:36:35.749 of the tradeoffs between Gaussian discriminant analysis and logistic 00:36:35.749 --> 00:36:36.559 regression, 00:36:36.559 --> 00:36:40.079 right? Gaussian discriminant analysis makes a much stronger assumption 00:36:40.079 --> 00:36:43.049 that X given Y is Gaussian, 00:36:43.049 --> 00:36:47.099 and so when this assumption is true, when this assumption approximately holds, if you plot the 00:36:47.099 --> 00:36:48.569 data, 00:36:48.569 --> 00:36:52.039 and if X given Y is, indeed, approximately Gaussian, 00:36:52.039 --> 00:36:56.199 then if you make this assumption, explicit to the algorithm, then the 00:36:56.199 --> 00:36:57.499 algorithm will do better 00:36:57.499 --> 00:36:59.239 because it's as if the 00:36:59.239 --> 00:37:02.609 algorithm is making use of more information about the data. The algorithm knows that 00:37:02.609 --> 00:37:04.380 the data is Gaussian, 00:37:04.380 --> 00:37:05.290 right? And so 00:37:05.290 --> 00:37:07.609 if the Gaussian assumption, you know, 00:37:07.609 --> 00:37:09.529 holds or roughly holds, 00:37:09.529 --> 00:37:10.220 then Gaussian 00:37:10.220 --> 00:37:13.569 discriminant analysis may do better than logistic regression. 00:37:13.569 --> 00:37:19.299 If, conversely, if you're actually not sure what X given Y is, then 00:37:19.299 --> 00:37:22.819 logistic regression, the discriminant algorithm may do better, 00:37:22.819 --> 00:37:25.579 and, in particular, use logistic regression, 00:37:25.579 --> 00:37:26.110 and 00:37:26.110 --> 00:37:29.690 maybe you see [inaudible] before the data was Gaussian, but it turns out the data 00:37:29.690 --> 00:37:31.429 was actually Poisson, right? 00:37:31.429 --> 00:37:33.980 Then logistic regression will still do perfectly fine because if 00:37:33.980 --> 00:37:36.869 the data were actually Poisson, 00:37:36.869 --> 00:37:40.210 then P of Y = 1 given X will be logistic, and it'll do perfectly 00:37:40.210 --> 00:37:44.519 fine, but if you assumed it was Gaussian, then the algorithm may go off and do something 00:37:44.519 --> 00:37:51.519 that's not as good, okay? 00:37:52.109 --> 00:37:56.139 So it turns out that - right. 00:37:56.139 --> 00:37:58.799 So it's slightly different. 00:37:58.799 --> 00:38:01.089 It 00:38:01.089 --> 00:38:05.359 turns out the real advantage of generative learning algorithms is often that it 00:38:05.359 --> 00:38:07.839 requires less data, 00:38:07.839 --> 00:38:09.049 and, in particular, 00:38:09.049 --> 00:38:13.469 data is never really exactly Gaussian, right? Because data is often 00:38:13.469 --> 00:38:15.890 approximately Gaussian; it's never exactly Gaussian. 00:38:15.890 --> 00:38:19.989 And it turns out, generative learning algorithms often do surprisingly well 00:38:19.989 --> 00:38:20.750 even when 00:38:20.750 --> 00:38:24.369 these modeling assumptions are not met, but 00:38:24.369 --> 00:38:26.019 one other tradeoff 00:38:26.019 --> 00:38:26.880 is that 00:38:26.880 --> 00:38:29.930 by making stronger assumptions 00:38:29.930 --> 00:38:31.899 about the data, 00:38:31.899 --> 00:38:34.039 Gaussian discriminant analysis 00:38:34.039 --> 00:38:36.229 often needs less data 00:38:36.229 --> 00:38:40.439 in order to fit, like, an okay model, even if there's less training data. Whereas, in 00:38:40.439 --> 00:38:41.870 contrast, 00:38:41.870 --> 00:38:45.959 logistic regression by making less 00:38:45.959 --> 00:38:48.959 assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're 00:38:48.959 --> 00:38:50.729 making less assumptions, 00:38:50.729 --> 00:38:54.150 but sometimes it takes a slightly larger training set to fit than Gaussian discriminant 00:38:54.150 --> 00:38:55.419 analysis. Question? Student:In order 00:38:55.419 --> 00:39:00.019 to meet any assumption about the number [inaudible], plus 00:39:00.019 --> 00:39:01.029 00:39:01.029 --> 00:39:03.139 here 00:39:03.139 --> 00:39:09.089 we assume that P of Y = 1, equal 00:39:09.089 --> 00:39:11.159 two 00:39:11.159 --> 00:39:14.849 number of. 00:39:14.849 --> 00:39:16.319 [Inaudible]. Is true when 00:39:16.319 --> 00:39:18.069 the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see. 00:39:18.069 --> 00:39:21.799 So there's a question of is this true - what 00:39:21.799 --> 00:39:23.379 was that? Let me translate that 00:39:23.379 --> 00:39:25.429 differently. So 00:39:25.429 --> 00:39:28.829 the marving assumptions are made independently of the size 00:39:28.829 --> 00:39:29.880 of 00:39:29.880 --> 00:39:32.520 your training set, right? So, like, in least/great regression - well, 00:39:32.520 --> 00:39:36.649 in all of these models I'm assuming that these are random variables 00:39:36.649 --> 00:39:41.380 flowing from some distribution, and then, finally, I'm giving a single training set 00:39:41.380 --> 00:39:43.640 and that as for the parameters of the 00:39:43.640 --> 00:39:44.450 distribution, right? 00:39:44.450 --> 00:39:51.450 Student:So 00:39:51.889 --> 00:39:54.169 what's the probability of Y = 1? 00:39:54.169 --> 00:39:55.069 Instructor (Andrew Ng):Probability of Y + 1? 00:39:55.069 --> 00:39:56.499 Student:Yeah, you used the - Instructor (Andrew Ng):Sort 00:39:56.499 --> 00:39:57.909 of, this like 00:39:57.909 --> 00:39:59.589 - 00:39:59.589 --> 00:40:02.829 back to the philosophy of mass molecular estimation, 00:40:02.829 --> 00:40:04.700 right? I'm assuming that 00:40:04.700 --> 00:40:08.329 they're P of Y is equal to phi to the Y, 00:40:08.329 --> 00:40:13.390 Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y 00:40:13.390 --> 00:40:14.009 generating 00:40:14.009 --> 00:40:15.879 all my data, 00:40:15.879 --> 00:40:18.849 and then 00:40:18.849 --> 00:40:19.549 - 00:40:19.549 --> 00:40:25.589 well, when I write this, I guess, maybe what I should write isn't - 00:40:25.589 --> 00:40:27.609 so when I write this, I 00:40:27.609 --> 00:40:29.900 guess there are already two values of phi. One is 00:40:29.900 --> 00:40:32.200 there's a true underlying value of phi 00:40:32.200 --> 00:40:35.160 that guards the use to generate the data, 00:40:35.160 --> 00:40:39.630 and then there's the maximum likelihood estimate of the value of phi, and so when I was writing 00:40:39.630 --> 00:40:41.509 those formulas earlier, 00:40:41.509 --> 00:40:45.209 those formulas are writing for phi, and mew0, and mew1 00:40:45.209 --> 00:40:49.359 were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true 00:40:49.359 --> 00:40:55.819 underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum 00:40:55.819 --> 00:40:57.559 likelihood estimate comes from the data, 00:40:57.559 --> 00:41:01.529 and there's some, sort of, true underlying value of phi that I'm trying to estimate, 00:41:01.529 --> 00:41:05.319 and my maximum likelihood estimate is my attempt to estimate the true value, 00:41:05.319 --> 00:41:08.150 but, you know, by notational and convention 00:41:08.150 --> 00:41:11.779 often are just right as that as well without bothering to distinguish between 00:41:11.779 --> 00:41:15.470 the maximum likelihood value and the true underlying value that I'm assuming is out 00:41:15.470 --> 00:41:16.339 there, and that I'm 00:41:16.339 --> 00:41:23.339 only hoping to estimate. 00:41:23.979 --> 00:41:25.579 Actually, yeah, 00:41:25.579 --> 00:41:29.369 so for the sample of questions like these about maximum likelihood and so on, I hope 00:41:29.369 --> 00:41:34.059 to tease to the Friday discussion section 00:41:34.059 --> 00:41:36.279 as a good time to 00:41:36.279 --> 00:41:37.890 ask questions about, sort of, 00:41:37.890 --> 00:41:40.880 probabilistic definitions like these as well. Are there any 00:41:40.880 --> 00:41:47.880 other questions? No, great. Okay. 00:41:52.900 --> 00:41:59.680 So, 00:41:59.680 --> 00:42:01.189 great. Oh, it 00:42:01.189 --> 00:42:07.589 turns out, just to mention one more thing that's, kind of, cool. 00:42:07.589 --> 00:42:09.300 I said that 00:42:09.300 --> 00:42:13.509 if X given Y is Poisson, and you also go logistic posterior, 00:42:13.509 --> 00:42:17.210 it actually turns out there's a more general version of this. If you assume 00:42:17.210 --> 00:42:18.929 X 00:42:18.929 --> 00:42:23.899 given Y = 1 is exponential family 00:42:23.899 --> 00:42:26.279 with parameter A to 1, and then you assume 00:42:26.279 --> 00:42:33.279 X given Y = 0 is exponential family 00:42:33.859 --> 00:42:38.889 with parameter A to 0, then 00:42:38.889 --> 00:42:43.269 this implies that P of Y = 1 given X is also logistic, okay? And 00:42:43.269 --> 00:42:44.679 that's, kind of, cool. 00:42:44.679 --> 00:42:47.590 It means that Y given X could be - I don't 00:42:47.590 --> 00:42:50.180 know, some strange thing. It could be gamma because 00:42:50.180 --> 00:42:52.509 we've seen Gaussian right 00:42:52.509 --> 00:42:53.609 next to the - I 00:42:53.609 --> 00:42:57.029 don't know, gamma exponential. 00:42:57.029 --> 00:42:59.099 They're actually a beta. I'm 00:42:59.099 --> 00:43:02.089 just rattling off my mental list of exponential family extrusions. It could be any one 00:43:02.089 --> 00:43:03.689 of those things, 00:43:03.689 --> 00:43:07.519 so [inaudible] the same exponential family distribution for the two classes 00:43:07.519 --> 00:43:09.439 with different natural parameters 00:43:09.439 --> 00:43:12.239 than the 00:43:12.239 --> 00:43:13.279 posterior 00:43:13.279 --> 00:43:16.849 P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows 00:43:16.849 --> 00:43:19.299 the robustness of logistic regression 00:43:19.299 --> 00:43:22.479 to the choice of modeling assumptions because it could be that 00:43:22.479 --> 00:43:24.989 the data was actually, you know, gamma distributed, 00:43:24.989 --> 00:43:28.739 and just still turns out to be logistic. So it's the 00:43:28.739 --> 00:43:35.739 robustness of logistic regression to modeling 00:43:38.809 --> 00:43:40.960 assumptions. 00:43:40.960 --> 00:43:42.900 And this is the density. I think, 00:43:42.900 --> 00:43:44.399 early on I promised 00:43:44.399 --> 00:43:48.619 two justifications for where I pulled the logistic function out of the hat, right? So 00:43:48.619 --> 00:43:53.299 one was the exponential family derivation we went through last time, and this is, sort of, the second one. 00:43:53.299 --> 00:44:00.299 That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off 00:44:05.109 --> 00:44:09.969 mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also 00:44:09.969 --> 00:44:11.059 not true, right? 00:44:11.059 --> 00:44:12.499 Yeah, so this exponential 00:44:12.499 --> 00:44:14.199 family distribution 00:44:14.199 --> 00:44:18.269 implies Y = 1 is logistic, but the reverse assumption is also not true. 00:44:18.269 --> 00:44:22.599 There are actually all sorts of really bizarre distributions 00:44:22.599 --> 00:44:29.599 for X that would give rise to logistic function, okay? Okay. So 00:44:29.889 --> 00:44:34.719 let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second 00:44:34.719 --> 00:44:37.609 generative learning algorithm, 00:44:37.609 --> 00:44:44.609 and the motivating example, actually this is called a Naive Bayes algorithm, 00:44:44.849 --> 00:44:49.690 and the motivating example that I'm gonna use will be spam classification. All right. So let's 00:44:49.690 --> 00:44:54.109 say that you want to build a spam classifier to take your incoming stream of email and decide if 00:44:54.109 --> 00:44:56.529 it's spam or 00:44:56.529 --> 00:44:57.489 not. 00:44:57.489 --> 00:45:02.299 So let's 00:45:02.299 --> 00:45:04.759 see. Y will be 0 00:45:04.759 --> 00:45:06.849 or 00:45:06.849 --> 00:45:11.679 1, with 1 being spam email 00:45:11.679 --> 00:45:12.589 and 0 being non-spam, and 00:45:12.589 --> 00:45:16.649 the first decision we need to make is, given a piece of email, 00:45:16.649 --> 00:45:20.789 how do you represent a piece of email using a feature vector X, 00:45:20.789 --> 00:45:24.449 right? So email is just a piece of text, right? Email 00:45:24.449 --> 00:45:27.889 is like a list of words or a list of ASCII characters. 00:45:27.889 --> 00:45:30.839 So I can represent email as a feature of vector X. 00:45:30.839 --> 00:45:32.369 So we'll use a couple of different 00:45:32.369 --> 00:45:34.099 representations, 00:45:34.099 --> 00:45:36.579 but the one I'll use today is 00:45:36.579 --> 00:45:38.049 we will 00:45:38.049 --> 00:45:42.179 construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of 00:45:42.179 --> 00:45:44.400 all the words in my dictionary, okay? So 00:45:44.400 --> 00:45:46.469 the first word is 00:45:46.469 --> 00:45:52.289 RA. The second word in my dictionary is Aardvark, ausworth, 00:45:52.289 --> 00:45:55.569 okay? 00:45:55.569 --> 00:45:59.419 You know, and somewhere along the way you see the word buy in the spam email telling you to buy 00:45:59.419 --> 00:46:01.379 stuff. 00:46:01.379 --> 00:46:04.140 Tell you how you collect your list of words, 00:46:04.140 --> 00:46:08.579 you know, you won't find CS229, right, course number in a dictionary, but 00:46:08.579 --> 00:46:09.559 if you 00:46:09.559 --> 00:46:14.249 collect a list of words via other emails you've gotten, you have this list somewhere 00:46:14.249 --> 00:46:17.979 as well, and then the last word in my dictionary was 00:46:17.979 --> 00:46:21.969 zicmergue, which 00:46:21.969 --> 00:46:24.739 pertains to the technological chemistry that deals with 00:46:24.739 --> 00:46:27.899 the fermentation process in 00:46:27.899 --> 00:46:30.829 brewing. 00:46:30.829 --> 00:46:35.180 So say I get a piece of email, and what I'll do is I'll then 00:46:35.180 --> 00:46:37.860 scan through this list of words, and wherever 00:46:37.860 --> 00:46:41.249 a certain word appears in my email, I'll put a 1 there. So if a particular 00:46:41.249 --> 00:46:44.319 email has the word aid then that's 1. 00:46:44.319 --> 00:46:46.560 You know, my email doesn't have the words ausworth 00:46:46.560 --> 00:46:49.209 or aardvark, so it gets zeros. And again, 00:46:49.209 --> 00:46:52.939 a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay? 00:46:52.939 --> 00:46:56.910 So 00:46:56.910 --> 00:46:59.289 this would be 00:46:59.289 --> 00:47:01.939 one way of creating a feature vector 00:47:01.939 --> 00:47:03.619 to represent a 00:47:03.619 --> 00:47:05.739 piece of email. 00:47:05.739 --> 00:47:09.900 Now, let's throw 00:47:09.900 --> 00:47:16.900 the generative model out for this. Actually, 00:47:18.059 --> 00:47:19.289 let's use 00:47:19.289 --> 00:47:20.849 this. 00:47:20.849 --> 00:47:24.400 In other words, I want to model P of X given Y. The 00:47:24.400 --> 00:47:27.959 given Y = 0 or Y = 1, all right? 00:47:27.959 --> 00:47:31.390 And my feature vectors are going to be 0, 1 00:47:31.390 --> 00:47:35.709 to the N. It's going to be these split vectors, binary value vectors. They're N 00:47:35.709 --> 00:47:36.400 dimensional. 00:47:36.400 --> 00:47:38.130 Where N 00:47:38.130 --> 00:47:38.830 may 00:47:38.830 --> 00:47:41.950 be on the order of, say, 50,000, if you have 50,000 00:47:41.950 --> 00:47:44.089 words in your dictionary, 00:47:44.089 --> 00:47:46.110 which is not atypical. So 00:47:46.110 --> 00:47:47.459 values from - I don't 00:47:47.459 --> 00:47:48.449 know, 00:47:48.449 --> 00:47:52.219 mid-thousands to tens of thousands is very typical 00:47:52.219 --> 00:47:53.939 for problems like 00:47:53.939 --> 00:47:56.620 these. And, therefore, 00:47:56.620 --> 00:48:00.340 there two to the 50,000 possible values for X, right? So two to 00:48:00.340 --> 00:48:01.500 50,000 00:48:01.500 --> 00:48:02.729 possible bit vectors 00:48:02.729 --> 00:48:03.819 of length 00:48:03.819 --> 00:48:05.199 00:48:05.199 --> 00:48:07.239 50,000, and so 00:48:07.239 --> 00:48:08.829 one way to model this is 00:48:08.829 --> 00:48:10.749 the multinomial distribution, 00:48:10.749 --> 00:48:13.900 but because there are two to the 50,000 possible values for X, 00:48:13.900 --> 00:48:20.339 I would need two to the 50,000, but maybe -1 parameters, 00:48:20.339 --> 00:48:21.059 right? Because you have 00:48:21.059 --> 00:48:23.699 this sum to 1, right? So 00:48:23.699 --> 00:48:27.089 -1. And this is clearly way too many parameters 00:48:27.089 --> 00:48:28.479 to model 00:48:28.479 --> 00:48:30.269 using the multinomial distribution 00:48:30.269 --> 00:48:34.439 over all two to 50,000 possibilities. 00:48:34.439 --> 00:48:35.839 So 00:48:35.839 --> 00:48:42.289 in a Naive Bayes algorithm, we're 00:48:42.289 --> 00:48:46.579 going to make a very strong assumption on P of X given Y, 00:48:46.579 --> 00:48:49.369 and, in particular, I'm going to assume - let 00:48:49.369 --> 00:48:53.609 me just say what it's called; then I'll write out what it means. I'm going to assume that the 00:48:53.609 --> 00:48:54.630 XI's 00:48:54.630 --> 00:49:01.630 are conditionally independent 00:49:06.489 --> 00:49:07.950 given Y, okay? 00:49:07.950 --> 00:49:11.209 Let me say what this means. 00:49:11.209 --> 00:49:18.209 So I have that P of X1, X2, up to X50,000, 00:49:18.729 --> 00:49:20.559 right, given the 00:49:20.559 --> 00:49:25.910 Y. By the key rule of probability, this is P of X1 given Y 00:49:25.910 --> 00:49:27.759 times P of X2 00:49:27.759 --> 00:49:30.369 given 00:49:30.369 --> 00:49:32.469 00:49:32.469 --> 00:49:33.989 Y, 00:49:33.989 --> 00:49:39.409 X1 00:49:39.409 --> 00:49:42.919 times PF - I'll just put dot, dot, dot. I'll just write 1, 1 à dot, dot, dot up to, you know, well - 00:49:42.919 --> 00:49:46.369 whatever. You get the idea, up to P of X50,000, okay? 00:49:46.369 --> 00:49:49.529 So this is the chain were of probability. This always holds. I've not 00:49:49.529 --> 00:49:52.849 made any assumption yet, and now, we're 00:49:52.849 --> 00:49:54.529 gonna 00:49:54.529 --> 00:49:58.089 meet what's called the Naive Bayes assumption, or this assumption that X 00:49:58.089 --> 00:50:01.869 defies a conditionally independent given Y. Going 00:50:01.869 --> 00:50:03.709 to assume that - 00:50:03.709 --> 00:50:06.619 well, nothing changes for the first term, 00:50:06.619 --> 00:50:11.719 but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3 00:50:11.719 --> 00:50:13.359 given 00:50:13.359 --> 00:50:16.979 the 00:50:16.979 --> 00:50:17.760 Y, 00:50:17.760 --> 00:50:20.099 and so on, up 00:50:20.099 --> 00:50:24.489 to P of X50,000 given Y, okay? 00:50:24.489 --> 00:50:28.849 Or just written more compactly, 00:50:28.849 --> 00:50:32.119 00:50:32.119 --> 00:50:35.989 means assume that P of X1, P of X50,000 given Y is 00:50:35.989 --> 00:50:41.150 the product from I = 1 to 50,000 or P of XI 00:50:41.150 --> 00:50:44.170 given the Y, 00:50:44.170 --> 00:50:45.299 okay? 00:50:45.299 --> 00:50:49.129 And stating informally what this means is that I'm, sort of, assuming that - 00:50:49.129 --> 00:50:53.009 so unless you know the cost label Y, so long as you know whether this is spam or not 00:50:53.009 --> 00:50:54.959 spam, 00:50:54.959 --> 00:50:58.579 then knowing whether the word A appears in email 00:50:58.579 --> 00:50:59.809 does not affect 00:50:59.809 --> 00:51:01.359 the probability 00:51:01.359 --> 00:51:02.849 of whether the word 00:51:02.849 --> 00:51:06.829 Ausworth appears in the email, all right? 00:51:06.829 --> 00:51:10.699 And, in other words, there's assuming - once you know whether an email is spam 00:51:10.699 --> 00:51:12.619 or not spam, 00:51:12.619 --> 00:51:15.519 then knowing whether other words appear in the email won't help 00:51:15.519 --> 00:51:19.049 you predict whether any other word appears in the email, okay? 00:51:19.049 --> 00:51:20.289 And, 00:51:20.289 --> 00:51:23.489 obviously, this assumption is false, right? This 00:51:23.489 --> 00:51:25.979 assumption can't possibly be 00:51:25.979 --> 00:51:28.190 true. I mean, if you see the word 00:51:28.190 --> 00:51:31.410 - I don't know, CS229 in an email, you're much more likely to see my name in the email, or 00:51:31.410 --> 00:51:37.050 the TA's names, or whatever. So this assumption is normally just false 00:51:37.050 --> 00:51:37.469 00:51:37.469 --> 00:51:39.349 under English, right, 00:51:39.349 --> 00:51:41.239 for normal written English, 00:51:41.239 --> 00:51:42.519 but it 00:51:42.519 --> 00:51:44.269 turns out that despite 00:51:44.269 --> 00:51:46.599 this assumption being, sort of, 00:51:46.599 --> 00:51:48.180 false in the literal sense, 00:51:48.180 --> 00:51:50.829 the Naive Bayes algorithm is, sort of, 00:51:50.829 --> 00:51:53.469 an extremely effective 00:51:53.469 --> 00:51:57.869 algorithm for classifying text documents into spam or not spam, for 00:51:57.869 --> 00:52:01.519 classifying your emails into different emails for your automatic view, for 00:52:01.519 --> 00:52:03.589 looking at web pages and classifying 00:52:03.589 --> 00:52:06.290 whether this webpage is trying to sell something or whatever. It 00:52:06.290 --> 00:52:07.959 turns out, this assumption 00:52:07.959 --> 00:52:11.819 works very well for classifying text documents and for other applications too that I'll 00:52:11.819 --> 00:52:16.419 talk a bit about later. 00:52:16.419 --> 00:52:20.689 As a digression that'll make sense only to some of you. 00:52:20.689 --> 00:52:23.019 Let me just say that 00:52:23.019 --> 00:52:27.959 if you're familiar with Bayesian X world, say 00:52:27.959 --> 00:52:29.839 00:52:29.839 --> 00:52:33.999 graphical models, the Bayesian network associated with this model looks like this, and you're assuming 00:52:33.999 --> 00:52:34.509 that 00:52:34.509 --> 00:52:36.769 this is random variable Y 00:52:36.769 --> 00:52:38.779 that then generates X1, X2, through 00:52:38.779 --> 00:52:41.089 X50,000, okay? If you've not seen the 00:52:41.089 --> 00:52:43.069 Bayes Net before, if 00:52:43.069 --> 00:52:47.549 you don't know your graphical model, just ignore this. It's not important to our purposes, but 00:52:47.549 --> 00:52:54.549 if you've seen it before, that's what it will look like. Okay. 00:52:58.019 --> 00:53:03.349 So 00:53:03.349 --> 00:53:10.349 00:53:11.660 --> 00:53:16.239 the parameters of the model 00:53:16.239 --> 00:53:18.069 are as follows 00:53:18.069 --> 00:53:21.659 with phi FI given Y = 1, which 00:53:21.659 --> 00:53:23.189 00:53:23.189 --> 00:53:28.019 is probably FX = 1 or XI = 1 00:53:28.019 --> 00:53:28.839 given Y = 1, 00:53:28.839 --> 00:53:35.529 phi I 00:53:35.529 --> 00:53:39.959 given Y = 0, and phi Y, okay? 00:53:39.959 --> 00:53:43.459 So these are the parameters of the model, 00:53:43.459 --> 00:53:45.419 00:53:45.419 --> 00:53:47.249 and, therefore, 00:53:47.249 --> 00:53:50.139 to fit the parameters of the model, you 00:53:50.139 --> 00:53:57.139 can write down the joint likelihood, right, 00:54:01.849 --> 00:54:07.329 is 00:54:07.329 --> 00:54:14.329 equal to, as usual, okay? 00:54:18.670 --> 00:54:20.779 So given the training sets, 00:54:20.779 --> 00:54:27.779 you can write down the joint 00:54:28.979 --> 00:54:32.579 likelihood of the parameters, and 00:54:32.579 --> 00:54:39.579 then when 00:54:43.729 --> 00:54:44.640 you 00:54:44.640 --> 00:54:46.869 do maximum likelihood estimation, 00:54:46.869 --> 00:54:50.860 you find that the maximum likelihood estimate of the parameters are 00:54:50.860 --> 00:54:53.239 - they're really, pretty much, what you'd expect. 00:54:53.239 --> 00:54:55.630 Maximum likelihood estimate for phi J given Y = 1 is 00:54:55.630 --> 00:54:57.880 sum from I = 1 to 00:54:57.880 --> 00:55:01.759 M, 00:55:01.759 --> 00:55:03.459 indicator 00:55:03.459 --> 00:55:07.129 XIJ = 00:55:07.129 --> 00:55:14.129 1, YI = 1, okay? 00:55:14.699 --> 00:55:18.479 00:55:18.479 --> 00:55:19.359 00:55:19.359 --> 00:55:20.719 And this is just a, 00:55:20.719 --> 00:55:22.829 I guess, stated more simply, 00:55:22.829 --> 00:55:24.969 the numerator just says, Run for 00:55:24.969 --> 00:55:27.979 your entire training set, some [inaudible] examples, 00:55:27.979 --> 00:55:30.919 and count up the number of times you saw word Jay 00:55:30.919 --> 00:55:32.549 in a piece of email 00:55:32.549 --> 00:55:34.649 for which the label Y was equal to 1. So, in other words, look 00:55:34.649 --> 00:55:36.800 through all your spam emails 00:55:36.800 --> 00:55:39.320 and count the number of emails in which the word 00:55:39.320 --> 00:55:40.559 Jay appeared out of 00:55:40.559 --> 00:55:42.420 all your spam emails, 00:55:42.420 --> 00:55:44.710 and the denominator is, you know, 00:55:44.710 --> 00:55:45.989 sum from I = 1 to M, 00:55:45.989 --> 00:55:47.949 the number of spam. The 00:55:47.949 --> 00:55:51.439 denominator is just the number of spam emails you got. 00:55:51.439 --> 00:55:54.329 And so this ratio is 00:55:54.329 --> 00:55:57.769 in all your spam emails in your training set, 00:55:57.769 --> 00:56:00.399 what fraction of these emails 00:56:00.399 --> 00:56:03.039 did the word Jay 00:56:03.039 --> 00:56:03.659 appear in - 00:56:03.659 --> 00:56:06.640 did the, Jay you wrote in your dictionary appear in? 00:56:06.640 --> 00:56:10.239 And that's the maximum likelihood estimate 00:56:10.239 --> 00:56:13.819 for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your 00:56:13.819 --> 00:56:17.499 00:56:17.499 --> 00:56:20.499 00:56:20.499 --> 00:56:23.160 maximum likelihood estimate for phi 00:56:23.160 --> 00:56:24.999 Y 00:56:24.999 --> 00:56:28.880 is pretty much what you'd expect, right? 00:56:28.880 --> 00:56:35.880 Okay? 00:56:41.409 --> 00:56:43.389 And so 00:56:43.389 --> 00:56:47.669 having estimated all these parameters, 00:56:47.669 --> 00:56:50.960 when you're given a new piece of email that you want to classify, 00:56:50.960 --> 00:56:53.430 you can then compute P of Y given X 00:56:53.430 --> 00:56:56.079 using Bayes rule, right? 00:56:56.079 --> 00:56:57.580 Same as before because 00:56:57.580 --> 00:57:04.229 together these parameters gives you a model for P of X given Y and for P of Y, 00:57:04.229 --> 00:57:07.849 and by using Bayes rule, given these two terms, you can compute 00:57:07.849 --> 00:57:10.709 P of X given Y, and 00:57:10.709 --> 00:57:15.650 there's your spam classifier, okay? 00:57:15.650 --> 00:57:18.999 Turns out we need one more elaboration to this idea, but let me check if there are 00:57:18.999 --> 00:57:25.109 questions about this so far. 00:57:25.109 --> 00:57:29.269 Student:So does this model depend 00:57:29.269 --> 00:57:29.659 on 00:57:29.659 --> 00:57:34.380 the number of inputs? Instructor (Andrew Ng):What do 00:57:34.380 --> 00:57:35.149 you 00:57:35.149 --> 00:57:38.169 mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this 00:57:38.169 --> 00:57:45.169 given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make 00:57:48.959 --> 00:57:52.629 sense? Or M is the number of training examples, so when you have M training examples, you plug them 00:57:52.629 --> 00:57:53.999 into this formula, 00:57:53.999 --> 00:58:00.999 and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the 00:58:01.289 --> 00:58:04.089 number of emails? Instructor (Andrew Ng):Yeah, right. So, right. 00:58:04.089 --> 00:58:07.130 So it's, kind of, your training set. I would go through all the email I've gotten 00:58:07.130 --> 00:58:08.869 in the last two months 00:58:08.869 --> 00:58:11.129 and label them as spam or not spam, 00:58:11.129 --> 00:58:14.450 and so you have - I don't 00:58:14.450 --> 00:58:16.809 know, like, a few hundred emails 00:58:16.809 --> 00:58:18.630 labeled as spam or not spam, 00:58:18.630 --> 00:58:25.630 and that will comprise your training sets for X1 and Y1 through XM, 00:58:28.049 --> 00:58:28.620 YM, 00:58:28.620 --> 00:58:32.820 where X is one of those vectors representing which words appeared in the email and Y 00:58:32.820 --> 00:58:39.820 is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number 00:58:41.309 --> 00:58:42.019 00:58:42.019 --> 00:58:49.019 of examples, but the last model doesn't depend on the models, but your phi is the 00:58:49.829 --> 00:58:53.630 same for either one. Instructor (Andrew Ng):They're 00:58:53.630 --> 00:58:58.019 different things, right? There's the model which is 00:58:58.019 --> 00:59:00.659 - the modeling assumptions aren't made very well. 00:59:00.659 --> 00:59:04.069 I'm assuming that - I'm making the Naive Bayes assumption. 00:59:04.069 --> 00:59:08.900 So the probabilistic model is an assumption on the joint distribution 00:59:08.900 --> 00:59:10.319 of X and Y. 00:59:10.319 --> 00:59:12.390 That's what the model is, 00:59:12.390 --> 00:59:16.650 and then I'm given a fixed number of training examples. I'm given M training examples, and 00:59:16.650 --> 00:59:20.309 then it's, like, after I'm given the training sets, I'll then go in to write the maximum 00:59:20.309 --> 00:59:22.299 likelihood estimate of the parameters, right? So that's, sort 00:59:22.299 --> 00:59:24.529 of, 00:59:24.529 --> 00:59:31.529 maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like, 00:59:38.389 --> 00:59:41.079 if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So 00:59:41.079 --> 00:59:42.360 it turns out 00:59:42.360 --> 00:59:45.859 this is, sort of, a very practical question, really. How do I count this list of 00:59:45.859 --> 00:59:49.629 words? One common way to do this is to actually 00:59:49.629 --> 00:59:52.899 find some way to count a list of words, like go through all your emails, go through 00:59:52.899 --> 00:59:53.909 all the - 00:59:53.909 --> 00:59:56.930 in practice, one common way to count a list of words 00:59:56.930 --> 01:00:00.339 is to just take all the words that appear in your training set. That's one fairly common way 01:00:00.339 --> 01:00:02.190 to do it, 01:00:02.190 --> 01:00:06.130 or if that turns out to be too many words, you can take all words that appear 01:00:06.130 --> 01:00:07.030 at least 01:00:07.030 --> 01:00:08.080 three times 01:00:08.080 --> 01:00:09.349 in your training set. So 01:00:09.349 --> 01:00:10.659 words that 01:00:10.659 --> 01:00:14.749 you didn't even see three times in the emails you got in the last 01:00:14.749 --> 01:00:17.210 two months, you discard. So those are - I 01:00:17.210 --> 01:00:20.259 was talking about going through a dictionary, which is a nice way of thinking about it, but in 01:00:20.259 --> 01:00:22.049 practice, you might go through 01:00:22.049 --> 01:00:26.409 your training set and then just take the union of all the words that appear in 01:00:26.409 --> 01:00:29.759 it. In some of the tests I've even, by the way, said select these features, but this is one 01:00:29.759 --> 01:00:32.309 way to think about 01:00:32.309 --> 01:00:34.139 creating your feature vector, 01:00:34.139 --> 01:00:41.139 right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On 01:00:49.209 --> 01:00:51.609 how I came up with the parameters? 01:00:51.609 --> 01:00:53.929 Student:Correct. Instructor (Andrew Ng):Let's see. 01:00:53.929 --> 01:00:58.249 So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right? 01:00:58.249 --> 01:00:59.279 In Naive Bayes, 01:00:59.279 --> 01:01:01.499 I need to build a model 01:01:01.499 --> 01:01:03.549 for P of X given Y and for 01:01:03.549 --> 01:01:05.359 P of Y, 01:01:05.359 --> 01:01:09.749 right? So this is, I mean, in generous of learning algorithms, I need to come up with 01:01:09.749 --> 01:01:11.079 models for these. 01:01:11.079 --> 01:01:15.359 So how'd I model P of Y? Well, I just those to model it using a Bernoulli 01:01:15.359 --> 01:01:16.389 distribution, 01:01:16.389 --> 01:01:17.150 and so 01:01:17.150 --> 01:01:19.709 P of Y will be 01:01:19.709 --> 01:01:22.109 parameterized by that, all right? Student:Okay. 01:01:22.109 --> 01:01:26.519 Instructor (Andrew Ng):And then how'd I model P of X given Y? Well, 01:01:26.519 --> 01:01:28.529 let's keep changing bullets. 01:01:28.529 --> 01:01:32.729 My model for P of X given Y under the Naive Bayes assumption, I assume 01:01:32.729 --> 01:01:34.539 that P of X given Y 01:01:34.539 --> 01:01:37.549 is the product of these probabilities, 01:01:37.549 --> 01:01:40.900 and so I'm going to need parameters to tell me 01:01:40.900 --> 01:01:43.509 what's the probability of each word occurring, 01:01:43.509 --> 01:01:46.309 you know, of each word occurring or not occurring, 01:01:46.309 --> 01:01:53.309 conditions on the email being spam or not spam email, okay? Student:How is that 01:01:54.639 --> 01:01:58.829 Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature 01:01:58.829 --> 01:02:00.799 vectors, XI 01:02:00.799 --> 01:02:05.499 is either one or zero, depending on whether words I appear as in the email, 01:02:05.499 --> 01:02:08.859 right? So by the way I define the 01:02:08.859 --> 01:02:11.419 feature vectors, XI - 01:02:11.419 --> 01:02:16.639 the XI is always zero or one. So that by definition, if XI, you know, is either zero or 01:02:16.639 --> 01:02:19.969 one, then it has to be a Bernoulli distribution, right? 01:02:19.969 --> 01:02:22.419 If XI would continue as then 01:02:22.419 --> 01:02:24.819 you might model this as Gaussian and say you end up 01:02:24.819 --> 01:02:27.299 like we did in Gaussian discriminant analysis. It's 01:02:27.299 --> 01:02:30.719 just that the way I constructed my features for email, XI is always binary 01:02:30.719 --> 01:02:32.129 value, and so you end up 01:02:32.129 --> 01:02:32.990 with a 01:02:32.990 --> 01:02:36.059 Bernoulli here, okay? All right. I 01:02:36.059 --> 01:02:40.389 should move on. So 01:02:40.389 --> 01:02:42.389 it turns out that 01:02:42.389 --> 01:02:44.339 this idea 01:02:44.339 --> 01:02:46.369 almost works. 01:02:46.369 --> 01:02:47.719 Now, here's the problem. 01:02:47.719 --> 01:02:49.959 So let's say you 01:02:49.959 --> 01:02:54.439 complete this class and you start to do, maybe do the class project, and you 01:02:54.439 --> 01:02:56.760 keep working on your class project for a bit, and it 01:02:56.760 --> 01:03:01.439 becomes really good, and you want to submit your class project to a conference, right? So, 01:03:01.439 --> 01:03:03.479 you know, around - I don't know, 01:03:03.479 --> 01:03:08.039 June every year is the conference deadline for the next conference. 01:03:08.039 --> 01:03:10.889 It's just the name of the conference; it's an acronym. 01:03:10.889 --> 01:03:12.359 And so maybe 01:03:12.359 --> 01:03:16.080 you send your project partners or senior friends even, and say, Hey, let's 01:03:16.080 --> 01:03:20.069 work on a project and submit it to the NIPS conference. And so you're getting these emails 01:03:20.069 --> 01:03:21.489 with the word NIPS in them, 01:03:21.489 --> 01:03:24.329 which you've probably never seen before, 01:03:24.329 --> 01:03:28.179 and so a 01:03:28.179 --> 01:03:32.519 piece of email comes from your project partner, and so you 01:03:32.519 --> 01:03:35.639 go, Let's send a paper to the NIPS conference. 01:03:35.639 --> 01:03:37.310 And then your stamp classifier 01:03:37.310 --> 01:03:38.759 will say 01:03:38.759 --> 01:03:39.930 P of X - 01:03:39.930 --> 01:03:44.630 let's say NIPS is the 30,000th word in your dictionary, okay? 01:03:44.630 --> 01:03:46.669 So X30,000 given 01:03:46.669 --> 01:03:50.189 the 1, given 01:03:50.189 --> 01:03:51.619 Y = 01:03:51.619 --> 01:03:52.799 1 01:03:52.799 --> 01:03:54.029 will be equal to 0. 01:03:54.029 --> 01:03:57.479 That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in 01:03:57.479 --> 01:04:01.199 your training set, so maximum likelihood of the parameter is that probably have seen the word 01:04:01.199 --> 01:04:02.459 NIPS is zero, 01:04:02.459 --> 01:04:06.789 and, similarly, 01:04:06.789 --> 01:04:08.379 you 01:04:08.379 --> 01:04:12.459 know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also 01:04:12.459 --> 01:04:14.729 estimated 01:04:14.729 --> 01:04:21.209 as zero. 01:04:21.209 --> 01:04:28.209 So 01:04:34.789 --> 01:04:40.709 when your spam classifier goes to compute P of Y = 1 given X, it will 01:04:40.709 --> 01:04:44.529 compute this 01:04:44.529 --> 01:04:47.209 right here P of Y 01:04:47.209 --> 01:04:54.209 over - well, 01:04:56.179 --> 01:05:03.179 all 01:05:05.089 --> 01:05:09.509 right. 01:05:09.509 --> 01:05:11.329 And so 01:05:11.329 --> 01:05:15.759 you look at that terms, say, this will be product from I = 01:05:15.759 --> 01:05:18.379 1 to 50,000, 01:05:18.379 --> 01:05:22.359 P of XI given Y, 01:05:22.359 --> 01:05:26.139 and one of those probabilities will be equal to 01:05:26.139 --> 01:05:28.169 01:05:28.169 --> 01:05:32.339 zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a 01:05:32.339 --> 01:05:36.699 zero in this product, and so the numerator is zero, 01:05:36.699 --> 01:05:40.249 and in the same way, it turns out the denominator will also be zero, and so you end 01:05:40.249 --> 01:05:42.719 up with - 01:05:42.719 --> 01:05:45.839 actually all of these terms end up being zero. So you end up with P of Y = 1 01:05:45.839 --> 01:05:48.779 given X is 0 over 0 + 0, okay, which is 01:05:48.779 --> 01:05:51.410 undefined. And the 01:05:51.410 --> 01:05:53.959 problem with this is that it's 01:05:53.959 --> 01:05:57.240 just statistically a bad idea 01:05:57.240 --> 01:06:00.209 to say that P of X30,000 01:06:00.209 --> 01:06:02.950 given Y is 01:06:02.950 --> 01:06:03.490 0, 01:06:03.490 --> 01:06:06.270 right? Just because you haven't seen the word NIPS in your last 01:06:06.270 --> 01:06:10.420 two months worth of email, it's also statistically not sound to say that, 01:06:10.420 --> 01:06:16.319 therefore, the chance of ever seeing this word is zero, right? 01:06:16.319 --> 01:06:19.170 And so 01:06:19.170 --> 01:06:20.479 01:06:20.479 --> 01:06:22.719 01:06:22.719 --> 01:06:25.819 is this idea that just because you haven't seen something 01:06:25.819 --> 01:06:30.400 before, that may mean that that event is unlikely, but it doesn't mean that 01:06:30.400 --> 01:06:33.920 it's impossible, and just saying that if you've never seen the word NIPS before, 01:06:33.920 --> 01:06:40.759 then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero. 01:06:40.759 --> 01:06:47.759 So we're gonna fix this, 01:06:48.339 --> 01:06:50.079 and 01:06:50.079 --> 01:06:54.119 to motivate the fix I'll talk about 01:06:54.119 --> 01:06:58.939 - the example we're gonna use is let's say that you've been following the Stanford basketball 01:06:58.939 --> 01:07:04.099 team for all of their away games, and been, sort of, tracking their wins and losses 01:07:04.099 --> 01:07:08.169 to gather statistics, and, maybe - I don't know, form a betting pool about 01:07:08.169 --> 01:07:11.059 whether they're likely to win or lose the next game, okay? 01:07:11.059 --> 01:07:12.630 So 01:07:12.630 --> 01:07:15.409 these are some of the statistics. 01:07:15.409 --> 01:07:17.589 01:07:17.589 --> 01:07:19.429 01:07:19.429 --> 01:07:24.819 So on, I guess, the 8th of February 01:07:24.819 --> 01:07:29.299 last season they played Washington State, and they 01:07:29.299 --> 01:07:31.779 did not win. 01:07:31.779 --> 01:07:36.239 On 01:07:36.239 --> 01:07:38.209 the 11th of February, 01:07:38.209 --> 01:07:42.349 they play Washington, 22nd 01:07:42.349 --> 01:07:47.259 they played USC, 01:07:47.259 --> 01:07:49.769 01:07:49.769 --> 01:07:54.929 played UCLA, 01:07:54.929 --> 01:07:57.630 played USC again, 01:07:57.630 --> 01:08:03.739 01:08:03.739 --> 01:08:05.819 and now you want to estimate 01:08:05.819 --> 01:08:09.880 what's the chance that they'll win or lose against Louisville, right? 01:08:09.880 --> 01:08:11.169 So 01:08:11.169 --> 01:08:14.909 find the four guys last year or five times and they weren't good in their away games, but it 01:08:14.909 --> 01:08:17.009 seems awfully harsh to say that - so it 01:08:17.009 --> 01:08:24.009 seems awfully harsh to say there's zero chance that they'll 01:08:37.079 --> 01:08:40.269 win in the last - in the 5th game. So here's the idea behind Laplace smoothing 01:08:40.269 --> 01:08:42.629 which is 01:08:42.629 --> 01:08:44.819 that we're estimate 01:08:44.819 --> 01:08:48.219 the probably of Y being equal to one, right? 01:08:48.219 --> 01:08:53.189 Normally, the maximum likelihood [inaudible] is the 01:08:53.189 --> 01:08:56.059 number of ones 01:08:56.059 --> 01:08:57.659 divided by 01:08:57.659 --> 01:09:00.759 the number of zeros 01:09:00.759 --> 01:09:05.350 plus the number of ones, okay? I 01:09:05.350 --> 01:09:07.869 hope this informal notation makes sense, right? Knowing 01:09:07.869 --> 01:09:11.349 the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random 01:09:11.349 --> 01:09:12.219 variable 01:09:12.219 --> 01:09:14.629 is 01:09:14.629 --> 01:09:16.969 just the number of ones you saw 01:09:16.969 --> 01:09:20.859 divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in 01:09:20.859 --> 01:09:22.859 the Laplace 01:09:22.859 --> 01:09:24.170 Smoothing 01:09:24.170 --> 01:09:25.799 we're going to 01:09:25.799 --> 01:09:29.549 just take each of these terms, the number of ones and, sort of, add one to that, the number 01:09:29.549 --> 01:09:31.830 of zeros and add one to that, the 01:09:31.830 --> 01:09:34.129 number of ones and add one to that, 01:09:34.129 --> 01:09:37.109 and so in our example, 01:09:37.109 --> 01:09:38.299 instead of estimating 01:09:38.299 --> 01:09:42.890 the probability of winning the next game to be 0 á 01:09:42.890 --> 01:09:45.210 5 + 0, 01:09:45.210 --> 01:09:50.319 we'll add one to all of these counts, and so we say that the chance of 01:09:50.319 --> 01:09:51.850 their 01:09:51.850 --> 01:09:53.799 winning the next game is 1/7th, 01:09:53.799 --> 01:09:55.179 okay? Which is 01:09:55.179 --> 01:09:59.310 that having seen them lose, you know, five away games in a row, we aren't terribly - 01:09:59.310 --> 01:10:02.510 we don't think it's terribly likely they'll win the next game, but at 01:10:02.510 --> 01:10:05.979 least we're not saying it's impossible. 01:10:05.979 --> 01:10:09.949 As a historical side note, the Laplace actually came up with the method. 01:10:09.949 --> 01:10:14.320 It's called the Laplace smoothing after him. 01:10:14.320 --> 01:10:18.399 01:10:18.399 --> 01:10:21.919 When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale 01:10:21.919 --> 01:10:25.239 was in a lot of days now, we've seen the sun rise, 01:10:25.239 --> 01:10:28.489 but that doesn't mean we can be absolutely certain the sun will rise tomorrow. 01:10:28.489 --> 01:10:31.420 He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of, 01:10:31.420 --> 01:10:36.449 cool. So, 01:10:36.449 --> 01:10:38.549 and more generally, 01:10:38.549 --> 01:10:41.570 01:10:41.570 --> 01:10:43.969 01:10:43.969 --> 01:10:45.429 01:10:45.429 --> 01:10:46.449 01:10:46.449 --> 01:10:49.409 if Y 01:10:49.409 --> 01:10:52.840 takes on 01:10:52.840 --> 01:10:55.150 K possible of values, 01:10:55.150 --> 01:10:59.479 if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1. 01:10:59.479 --> 01:11:01.729 Let's 01:11:01.729 --> 01:11:05.900 see. 01:11:05.900 --> 01:11:11.380 So the maximum likelihood estimate will be Sum from J = 1 to M, 01:11:11.380 --> 01:11:16.499 indicator YI = J á M, 01:11:16.499 --> 01:11:19.379 right? 01:11:19.379 --> 01:11:21.629 That's the maximum likelihood estimate 01:11:21.629 --> 01:11:24.369 of a multinomial probability of Y 01:11:24.369 --> 01:11:26.929 being equal to - oh, 01:11:26.929 --> 01:11:29.229 excuse me, Y = J. All right. 01:11:29.229 --> 01:11:33.110 That's the maximum likelihood estimate for the probability of Y = J, 01:11:33.110 --> 01:11:36.890 and so when you apply Laplace smoothing to that, 01:11:36.890 --> 01:11:39.639 you add one to the numerator, and 01:11:39.639 --> 01:11:41.889 add K to the denominator, 01:11:41.889 --> 01:11:48.889 if Y can take up K possible values, okay? 01:12:06.469 --> 01:12:08.579 So for Naive Bayes, 01:12:08.579 --> 01:12:14.349 what that gives us is - 01:12:14.349 --> 01:12:21.349 shoot. 01:12:38.639 --> 01:12:44.320 Right? So that was the maximum likelihood estimate, and what you 01:12:44.320 --> 01:12:47.949 end up doing is adding one to the numerator and adding two to the denominator, 01:12:47.949 --> 01:12:52.030 and this solves the problem of the zero probabilities, and when your friend sends 01:12:52.030 --> 01:12:57.489 you email about the NIPS conference, 01:12:57.489 --> 01:13:00.799 your spam filter will still be able to 01:13:00.799 --> 01:13:07.799 make a meaningful prediction, all right? Okay. 01:13:12.030 --> 01:13:14.829 Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the 01:13:14.829 --> 01:13:16.239 games on the right, it's liberal assumptions that the probability 01:13:16.239 --> 01:13:23.239 of 01:13:24.650 --> 01:13:27.960 winning is very close to zero, so, I mean, the prediction should 01:13:27.960 --> 01:13:29.789 be equal to PF, 0. Instructor (Andrew Ng):Right. 01:13:29.789 --> 01:13:35.959 I would say that 01:13:35.959 --> 01:13:38.139 in this case the prediction 01:13:38.139 --> 01:13:41.170 is 1/7th, right? We don't have a lot of - if you see somebody lose five games 01:13:41.170 --> 01:13:44.289 in a row, you may not have a lot of faith in them, 01:13:44.289 --> 01:13:48.520 but as an extreme example, suppose you saw them lose one game, 01:13:48.520 --> 01:13:51.500 right? It's just not reasonable to say that the chances of winning the next game 01:13:51.500 --> 01:13:52.860 is zero, but 01:13:52.860 --> 01:13:58.900 that's what maximum likelihood 01:13:58.900 --> 01:14:01.150 estimate 01:14:01.150 --> 01:14:05.099 will say. Student:Yes. Instructor (Andrew Ng):And - 01:14:05.099 --> 01:14:08.139 Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you 01:14:08.139 --> 01:14:11.309 know, given just five training examples, what's a reasonable estimate for the chance of 01:14:11.309 --> 01:14:13.170 winning the next game, 01:14:13.170 --> 01:14:14.119 and 01:14:14.119 --> 01:14:18.460 1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance. 01:14:18.460 --> 01:14:21.119 We're saying the chances of winning the next game is less 01:14:21.119 --> 01:14:25.069 than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of 01:14:25.069 --> 01:14:27.859 Bayesian assumptions about the prior and posterior, 01:14:27.859 --> 01:14:31.280 this Laplace smoothing actually gives the optimal estimate, 01:14:31.280 --> 01:14:33.459 in a certain sense I won't go into 01:14:33.459 --> 01:14:37.009 of what's the chance of winning the next game, and so under a certain assumption 01:14:37.009 --> 01:14:38.079 about the 01:14:38.079 --> 01:14:40.679 Bayesian prior on the parameter. 01:14:40.679 --> 01:14:44.260 So I don't know. It actually seems like a pretty reasonable assumption to me. 01:14:44.260 --> 01:14:46.940 Although, I should say, it actually 01:14:46.940 --> 01:14:50.059 turned out - 01:14:50.059 --> 01:14:53.530 No, I'm just being mean. We actually are a pretty good basketball team, but I chose a 01:14:53.530 --> 01:14:54.660 losing streak 01:14:54.660 --> 01:14:58.769 because it's funnier that way. 01:14:58.769 --> 01:15:05.769 Let's see. Shoot. Does someone want to - are there other questions about 01:15:07.489 --> 01:15:09.320 01:15:09.320 --> 01:15:13.380 this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but 01:15:13.380 --> 01:15:15.059 01:15:15.059 --> 01:15:15.860 we'll do that in the next lecture. So let's wrap it.