0:00:10.040,0:00:12.179 0:00:12.179,0:00:15.449 This presentation is delivered by the Stanford Center for Professional 0:00:15.449,0:00:22.449 Development. 0:00:25.119,0:00:29.819 So what I want to do today is talk about a different type of learning algorithm, and, in particular, 0:00:29.819,0:00:32.790 start to talk about generative learning algorithms 0:00:32.790,0:00:37.840 and the specific algorithm called Gaussian Discriminant Analysis. 0:00:37.840,0:00:43.340 Take a slight digression, talk about Gaussians, and I'll briefly discuss 0:00:43.340,0:00:45.700 generative versus discriminative learning algorithms, 0:00:45.700,0:00:49.780 and then hopefully wrap up today's lecture with a discussion of Naive Bayes and 0:00:49.780,0:00:52.140 the Laplace Smoothing. 0:00:52.140,0:00:55.020 So just to motivate our 0:00:55.020,0:00:58.590 discussion on generative learning algorithms, right, so by way of contrast, 0:00:58.590,0:01:02.100 the source of classification algorithms we've been talking about 0:01:02.100,0:01:08.920 I think of algorithms that do this. So you're given a training set, 0:01:08.920,0:01:11.640 and 0:01:11.640,0:01:15.370 if you run an algorithm right, we just see progression on those training sets. 0:01:15.370,0:01:19.830 The way I think of logistic regression is that it's trying to find " look at the date and is 0:01:19.830,0:01:24.369 trying to find a straight line to divide the crosses and O's, right? So it's, sort of, 0:01:24.369,0:01:27.380 trying to find a straight line. Let 0:01:27.380,0:01:29.400 me " just make the days a bit noisier. 0:01:29.400,0:01:33.060 Trying to find a straight line 0:01:33.060,0:01:34.430 that separates out 0:01:34.430,0:01:38.630 the positive and the negative classes as well as pass the law, right? And, 0:01:38.630,0:01:42.570 in fact, it shows it on the laptop. Maybe just use the screens or the small 0:01:42.570,0:01:46.250 monitors for this. 0:01:46.250,0:01:47.230 In fact, 0:01:47.230,0:01:49.720 you can see there's the data 0:01:49.720,0:01:51.100 set with 0:01:51.100,0:01:52.710 logistic regression, 0:01:52.710,0:01:54.080 and so 0:01:54.080,0:01:57.930 I've initialized the parameters randomly, and so logistic regression is, kind 0:01:57.930,0:01:59.350 of, the outputting " it's 0:01:59.350,0:02:01.530 the, kind of, hypothesis that 0:02:01.530,0:02:04.750 iteration zero is that straight line shown in the bottom right. 0:02:04.750,0:02:08.720 And so after one iteration and creating descent, the straight line changes a bit. 0:02:08.720,0:02:10.999 After two iterations, three, 0:02:10.999,0:02:12.359 four, 0:02:12.359,0:02:15.019 until logistic regression converges 0:02:15.019,0:02:18.969 and has found the straight line that, more or less, separates the positive and negative class, okay? So 0:02:18.969,0:02:20.170 you can think of this 0:02:20.170,0:02:21.490 as logistic regression, 0:02:21.490,0:02:28.010 sort of, searching for a line that separates the positive and the negative classes. 0:02:28.010,0:02:30.980 What I want to do today is talk about an algorithm that does something slightly 0:02:30.980,0:02:32.759 different, 0:02:32.759,0:02:36.309 and to motivate us, let's use our old example of trying to classifythe 0:02:36.309,0:02:40.689 team malignant cancer and benign cancer, right? So a patient comes in 0:02:40.689,0:02:44.759 and they have a cancer, you want to know if it's a malignant or a harmful cancer, 0:02:44.759,0:02:48.489 or if it's a benign, meaning a harmless cancer. 0:02:48.489,0:02:51.579 So rather than trying to find the straight line to separate the two classes, here's something else 0:02:51.579,0:02:53.579 we could do. 0:02:53.579,0:02:55.759 We can go from our training set 0:02:55.759,0:03:00.409 and look at all the cases of malignant cancers, go through, you know, look for our training set for all the 0:03:00.409,0:03:04.129 positive examples of malignant cancers, 0:03:04.129,0:03:08.799 and we can then build a model for what malignant cancer looks like. 0:03:08.799,0:03:13.400 Then we'll go for our training set again and take out all of the examples of benign cancers, 0:03:13.400,0:03:17.859 and then we'll build a model for what benign cancers look like, okay? 0:03:17.859,0:03:19.209 And then 0:03:19.209,0:03:20.270 when you need to 0:03:20.270,0:03:24.620 classify a new example, when you have a new patient, and you want to decide is this cancer malignant 0:03:24.620,0:03:25.650 or benign, 0:03:25.650,0:03:27.920 you then take your new cancer, and you 0:03:27.920,0:03:30.720 match it to your model of malignant cancers, 0:03:30.720,0:03:35.969 and you match it to your model of benign cancers, and you see which model it matches better, and 0:03:35.969,0:03:39.389 depending on which model it matches better to, you then 0:03:39.389,0:03:43.679 predict whether the new cancer is malignant or benign, 0:03:43.679,0:03:45.809 okay? 0:03:45.809,0:03:46.899 So 0:03:46.899,0:03:49.659 what I just described, just this cross 0:03:49.659,0:03:52.929 of methods where you build a second model for malignant cancers and 0:03:52.929,0:03:54.950 a separate model for benign cancers 0:03:54.950,0:03:58.379 is called a generative learning algorithm, 0:03:58.379,0:04:01.230 and let me just, kind of, formalize this. 0:04:01.230,0:04:03.089 0:04:03.089,0:04:05.459 So in the models that we've 0:04:05.459,0:04:07.559 been talking about previously, those were actually 0:04:07.559,0:04:13.899 all discriminative learning algorithms, 0:04:13.899,0:04:18.959 and studied more formally, a discriminative learning algorithm is one 0:04:18.959,0:04:22.729 that either learns P of Y 0:04:22.729,0:04:25.190 given X directly, 0:04:25.190,0:04:26.889 0:04:26.889,0:04:29.619 or even 0:04:29.619,0:04:32.510 learns a hypothesis 0:04:32.510,0:04:39.510 that 0:04:39.699,0:04:43.240 outputs value 0, 1 directly, 0:04:43.240,0:04:46.509 okay? So logistic regression is an example of 0:04:46.509,0:04:49.349 a discriminative learning algorithm. 0:04:49.349,0:04:56.349 In contrast, a generative learning algorithm of 0:04:58.150,0:04:59.940 models P of X given Y. 0:04:59.940,0:05:03.270 The probability of the features given the class label, 0:05:03.270,0:05:04.330 0:05:04.330,0:05:11.330 and as a technical detail, it also models P of Y, but that's a less important thing, and the 0:05:11.939,0:05:15.319 interpretation of this is that a generative model 0:05:15.319,0:05:17.499 builds a probabilistic model for 0:05:17.499,0:05:21.959 what the features looks like, 0:05:21.959,0:05:27.310 conditioned on the 0:05:27.310,0:05:30.729 class label, okay? In other words, conditioned on whether a cancer is 0:05:30.729,0:05:35.060 malignant or benign, it models probability distribution over what the features 0:05:35.060,0:05:37.090 of the cancer looks like. 0:05:37.090,0:05:40.470 Then having built this model " having built a model for P of X 0:05:40.470,0:05:42.129 given Y and P of Y, 0:05:42.129,0:05:45.949 then by Bayes rule, obviously, you can compute P of Y given 1, 0:05:45.949,0:05:47.439 conditioned on X. 0:05:47.439,0:05:49.370 This is just 0:05:49.370,0:05:50.959 P of X 0:05:50.959,0:05:52.669 given Y = 1 0:05:52.669,0:05:56.430 times P of X 0:05:56.430,0:05:58.919 divided by P of X, 0:05:58.919,0:06:01.349 and, if necessary, 0:06:01.349,0:06:05.600 you can calculate the denominator 0:06:05.600,0:06:12.600 using this, right? 0:06:18.969,0:06:19.889 0:06:19.889,0:06:22.919 And so by modeling P of X given Y 0:06:22.919,0:06:26.500 and modeling P of Y, you can actually use Bayes rule to get back to P of Y given 0:06:26.500,0:06:27.569 X, 0:06:27.569,0:06:29.820 but a generative model - 0:06:29.820,0:06:33.120 generative learning algorithm starts in modeling P of X given Y, rather than P of Y 0:06:33.120,0:06:34.979 given X, okay? 0:06:34.979,0:06:37.810 We'll talk about some of the tradeoffs, and why this may be a 0:06:37.810,0:06:41.999 better or worse idea than a discriminative model a bit later. 0:06:41.999,0:06:46.679 Let's go for a specific example of a generative learning algorithm, 0:06:46.679,0:06:47.909 and for 0:06:47.909,0:06:51.329 this specific motivating example, I'm 0:06:51.329,0:06:53.580 going to assume 0:06:53.580,0:06:55.600 that your input feature is X 0:06:55.600,0:06:58.199 and RN 0:06:58.199,0:07:00.489 0:07:00.489,0:07:03.110 and are 0:07:03.110,0:07:05.710 continuous values, okay? 0:07:05.710,0:07:08.190 And under this assumption, let 0:07:08.190,0:07:15.190 me describe to you a specific algorithm called Gaussian Discriminant Analysis, 0:07:20.180,0:07:23.059 0:07:23.059,0:07:25.960 and the, I 0:07:25.960,0:07:30.080 guess, core assumption is that we're going to assume in the Gaussian discriminant analysis 0:07:30.080,0:07:32.279 model of that P of X given Y 0:07:32.279,0:07:39.279 is Gaussian, okay? So 0:07:39.779,0:07:43.449 actually just raise your hand, how many of you have seen a multivariate Gaussian before - 0:07:43.449,0:07:45.639 not a 1D Gaussian, but the higher range though? 0:07:45.639,0:07:51.029 Okay, cool, like maybe half of you, two-thirds of 0:07:51.029,0:07:54.599 you. So let me just say a few words about Gaussians, and for those of you that have seen 0:07:54.599,0:07:57.119 it before, it'll be a refresher. 0:07:57.119,0:08:03.249 So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the 0:08:03.249,0:08:05.309 script N for normal 0:08:05.309,0:08:08.150 with 0:08:08.150,0:08:13.379 parameters mean U and covariance sigma squared. If 0:08:13.379,0:08:16.759 Z has a density 1 0:08:16.759,0:08:19.539 over 2 Pi, sigma 0:08:19.539,0:08:23.939 2, 0:08:23.939,0:08:30.939 okay? 0:08:31.449,0:08:35.949 That's the formula for the density as a generalization of the one dimension of Gaussians and no 0:08:35.949,0:08:38.190 more the familiar bell-shape curve. It's 0:08:38.190,0:08:42.220 a high dimension vector value random variable 0:08:42.220,0:08:42.990 Z. 0:08:42.990,0:08:47.210 Don't worry too much about this formula for the density. You rarely end up needing to 0:08:47.210,0:08:47.960 use it, 0:08:47.960,0:08:51.620 but the two key quantities are this 0:08:51.620,0:08:55.310 vector mew is the mean of the Gaussian and this 0:08:55.310,0:08:56.999 matrix sigma is 0:08:56.999,0:09:00.060 the covariance matrix - 0:09:00.060,0:09:02.470 covariance, 0:09:02.470,0:09:04.670 and so 0:09:04.670,0:09:07.410 sigma will be equal to, 0:09:07.410,0:09:11.810 right, the definition of covariance of a vector valued random variable 0:09:11.810,0:09:13.710 is X - U, X - V conspose, 0:09:13.710,0:09:17.260 okay? 0:09:17.260,0:09:20.320 And, actually, if this 0:09:20.320,0:09:22.930 doesn't look familiar to you, 0:09:22.930,0:09:25.720 you might re-watch the 0:09:25.720,0:09:28.740 discussion section that the TAs held last Friday 0:09:28.740,0:09:33.690 or the one that they'll be holding later this week on, sort of, a recap of 0:09:33.690,0:09:34.290 probability, okay? 0:09:34.290,0:09:37.670 So 0:09:37.670,0:09:41.370 multi-grade Gaussians is parameterized by a mean and a covariance, and let me just - 0:09:41.370,0:09:42.420 can I 0:09:42.420,0:09:45.860 have the laptop displayed, please? 0:09:45.860,0:09:48.570 I'll just go ahead and actually show you, 0:09:48.570,0:09:52.090 you know, graphically, the effects of 0:09:52.090,0:09:53.930 varying a Gaussian - 0:09:53.930,0:09:56.310 varying the parameters of a Gaussian. 0:09:56.310,0:09:59.130 So what I have up here 0:09:59.130,0:10:02.960 is the density of a zero mean Gaussian with 0:10:02.960,0:10:05.880 covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand 0:10:05.880,0:10:08.240 corner of the slide, and 0:10:08.240,0:10:11.600 there's the familiar bell-shaped curve in two dimensions. 0:10:11.600,0:10:17.320 And so if I shrink the covariance matrix, instead of covariance your identity, if 0:10:17.320,0:10:21.600 I shrink the covariance matrix, then the Gaussian becomes more peaked, 0:10:21.600,0:10:25.090 and if I widen the covariance, so like same = 2, 2, 0:10:25.090,0:10:25.749 then 0:10:25.749,0:10:30.810 the distribution - well, the density becomes more spread out, okay? 0:10:30.810,0:10:32.510 Those vectors stand at normal, identity 0:10:32.510,0:10:34.570 covariance one. 0:10:34.570,0:10:36.520 If I increase 0:10:36.520,0:10:39.419 the diagonals of a covariance matrix, right, 0:10:39.419,0:10:43.949 if I make the variables correlated, and the 0:10:43.949,0:10:45.350 Gaussian becomes flattened out in this X = Y direction, and 0:10:45.350,0:10:47.170 increase it even further, 0:10:47.170,0:10:52.460 then my variables, X and Y, right - excuse me, it goes Z1 and Z2 0:10:52.460,0:10:57.020 are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in 0:10:57.020,0:10:58.940 contours. 0:10:58.940,0:11:03.310 The standard normal of distribution has contours that are - they're actually 0:11:03.310,0:11:06.960 circles. Because of the aspect ratio, these look like ellipses. 0:11:06.960,0:11:09.050 These should actually be circles, 0:11:09.050,0:11:12.770 and if you increase the off diagonals of the Gaussian covariance matrix, 0:11:12.770,0:11:14.890 then it becomes 0:11:14.890,0:11:16.400 ellipses aligned along the, sort of, 0:11:16.400,0:11:18.090 45 degree angle 0:11:18.090,0:11:21.120 in this example. 0:11:21.120,0:11:26.430 This is the same thing. Here's an example of a Gaussian density with negative covariances. 0:11:26.430,0:11:28.610 So now the correlation 0:11:28.610,0:11:32.700 goes the other way, so that even strong [inaudible] of covariance and the same thing in 0:11:32.700,0:11:36.400 contours. This is a Gaussian with negative entries on the diagonals and even 0:11:36.400,0:11:41.130 larger entries on the diagonals, okay? 0:11:41.130,0:11:45.480 And other parameter for the Gaussian is the mean parameters, so if this is - with mew0, 0:11:45.480,0:11:47.839 and as he changed the mean parameter, 0:11:47.839,0:11:50.060 this is mew equals 0.15, 0:11:50.060,0:11:56.230 the location of the Gaussian just moves around, okay? 0:11:56.230,0:12:01.500 All right. So that was a quick primer on what Gaussians look like, and here's 0:12:01.500,0:12:02.300 0:12:02.300,0:12:06.230 as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant 0:12:06.230,0:12:09.400 analysis algorithm, this is what we're going to do. Here's 0:12:09.400,0:12:11.640 the training set, 0:12:11.640,0:12:14.790 and in the Gaussian discriminant analysis algorithm, 0:12:14.790,0:12:19.480 what I'm going to do is I'm going to look at the positive examples, say the crosses, 0:12:19.480,0:12:23.520 and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the 0:12:23.520,0:12:25.250 positive examples, and so 0:12:25.250,0:12:27.580 maybe I end up with a Gaussian distribution like that, 0:12:27.580,0:12:30.960 okay? So there's P of X given Y = 1. 0:12:30.960,0:12:34.340 And then I'll look at the negative examples, the O's in this figure, 0:12:34.340,0:12:36.990 and I'll fit a Gaussian to that, and maybe I get a 0:12:36.990,0:12:37.740 Gaussian 0:12:37.740,0:12:40.700 centered over there. This is the concept of my second Gaussian, 0:12:40.700,0:12:41.910 and together - 0:12:41.910,0:12:44.770 we'll say how later - 0:12:44.770,0:12:51.170 together these two Gaussian densities will define a separator for these two classes, okay? 0:12:51.170,0:12:55.300 And it'll turn out that the separator will turn out to be a little bit 0:12:55.300,0:12:55.870 different 0:12:55.870,0:12:57.340 from what logistic regression 0:12:57.340,0:12:58.719 gives you. 0:12:58.719,0:13:00.390 If you run logistic regression, 0:13:00.390,0:13:03.720 you actually get the division bound to be shown in the green line, whereas Gaussian discriminant 0:13:03.720,0:13:07.340 analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the 0:13:07.340,0:13:14.340 0:13:29.460,0:13:34.440 Gaussian discriminant analysis model, put 0:13:34.440,0:13:37.160 into model P of Y 0:13:37.160,0:13:40.940 as a Bernoulli random variable as usual, but 0:13:40.940,0:13:45.540 as a Bernoulli random variable and parameterized by parameter phi; you've 0:13:45.540,0:13:47.300 seen this before. 0:13:47.300,0:13:48.900 0:13:48.900,0:13:55.900 Model P of X given Y = 0 as a Gaussian - 0:13:58.090,0:14:01.060 oh, you know what? Yeah, 0:14:01.060,0:14:04.870 yes, excuse me. I 0:14:04.870,0:14:06.050 thought this looked strange. 0:14:06.050,0:14:11.730 This 0:14:11.730,0:14:13.040 should be a sigma, 0:14:13.040,0:14:16.230 determined in a sigma to the one-half of the denominator there. 0:14:16.230,0:14:18.110 It's no big deal. It was - yeah, 0:14:18.110,0:14:20.350 well, okay. Right. 0:14:20.350,0:14:23.120 I 0:14:23.120,0:14:26.690 was listing the sigma to the determining the sigma to the one-half on a previous 0:14:26.690,0:14:33.690 board, excuse me. Okay, 0:14:40.180,0:14:44.130 and so I model P of X given Y = 0 as a Gaussian 0:14:44.130,0:14:47.700 with mean mew0 and covariance sigma to the sigma to 0:14:47.700,0:14:54.700 the minus one-half, 0:14:54.800,0:15:01.210 0:15:01.210,0:15:02.000 and 0:15:02.000,0:15:09.000 - 0:15:11.200,0:15:12.840 okay? 0:15:12.840,0:15:16.840 And so the parameters of this model are 0:15:16.840,0:15:18.310 phi, 0:15:18.310,0:15:20.640 mew0, 0:15:20.640,0:15:23.900 mew1, and sigma, 0:15:23.900,0:15:25.100 and so 0:15:25.100,0:15:29.390 I can now write down the likelihood of the parameters 0:15:29.390,0:15:30.380 as - oh, excuse 0:15:30.380,0:15:37.380 me, actually, the log likelihood of the parameters as the log of 0:15:40.500,0:15:41.440 that, 0:15:41.440,0:15:45.080 right? 0:15:45.080,0:15:48.260 So, in other words, if I'm given the training set, then 0:15:48.260,0:15:52.690 they can write down the log likelihood of the parameters as the log of, you know, 0:15:52.690,0:15:58.060 the probative probabilities of P of XI, YI, right? 0:15:58.060,0:16:03.830 0:16:03.830,0:16:10.030 0:16:10.030,0:16:11.710 And 0:16:11.710,0:16:14.650 this is just equal to that where each of these terms, P of XI given YI, 0:16:14.650,0:16:16.980 or P of YI is 0:16:16.980,0:16:18.720 then given 0:16:18.720,0:16:19.600 0:16:19.600,0:16:24.250 by one of these three equations on top, okay? 0:16:24.250,0:16:26.420 And I just want 0:16:26.420,0:16:30.000 to contrast this again with discriminative learning algorithms, right? 0:16:30.000,0:16:31.790 So 0:16:31.790,0:16:34.779 to give this a name, I guess, this sometimes is actually called 0:16:34.779,0:16:39.249 the Joint Data Likelihood - the Joint Likelihood, 0:16:39.249,0:16:41.790 and 0:16:41.790,0:16:44.930 let me just contrast this with what we had previously 0:16:44.930,0:16:47.710 when we're talking about logistic 0:16:47.710,0:16:51.860 regression. Where I said with the log likelihood of the parameter's theater 0:16:51.860,0:16:53.089 was log 0:16:53.089,0:16:57.000 of a product I = 1 to M, P of YI 0:16:57.000,0:16:59.030 given XI 0:16:59.030,0:17:01.990 and parameterized 0:17:01.990,0:17:03.960 by a theater, right? 0:17:03.960,0:17:05.150 So 0:17:05.150,0:17:08.040 back where we're fitting logistic regression models or generalized learning 0:17:08.040,0:17:08.670 models, 0:17:08.670,0:17:14.360 we're always modeling P of YI given XI and parameterized by a theater, and that was the 0:17:14.360,0:17:16.420 conditional 0:17:16.420,0:17:17.460 0:17:17.460,0:17:19.579 likelihood, okay, 0:17:19.579,0:17:23.910 in 0:17:23.910,0:17:26.700 which we're modeling P of YI given XI, 0:17:26.700,0:17:27.839 whereas, now, 0:17:27.839,0:17:30.929 regenerative learning algorithms, we're going to look at the joint likelihood which 0:17:30.929,0:17:35.440 is P of XI, YI, okay? 0:17:35.440,0:17:36.270 So let's 0:17:36.270,0:17:43.270 see. 0:17:46.240,0:17:48.049 So given the training sets 0:17:48.049,0:17:51.260 and using the Gaussian discriminant analysis model 0:17:51.260,0:17:55.550 to fit the parameters of the model, we'll do maximize likelihood estimation as usual, 0:17:55.550,0:17:58.780 and so you maximize your 0:17:58.780,0:18:02.269 L 0:18:02.269,0:18:05.910 with respect to the parameters phi, mew0, 0:18:05.910,0:18:07.850 mew1, sigma, 0:18:07.850,0:18:10.150 and so 0:18:10.150,0:18:15.800 if we find the maximum likelihood estimate of parameters, you find that phi is 0:18:15.800,0:18:17.400 - 0:18:17.400,0:18:22.040 the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for 0:18:22.040,0:18:23.750 indicating notation, 0:18:23.750,0:18:27.220 all right? So the maximum likelihood estimate for phi would be Sum over 0:18:27.220,0:18:29.520 I, YI á M, 0:18:29.520,0:18:32.040 0:18:32.040,0:18:35.490 or written alternatively as Sum over - 0:18:35.490,0:18:41.620 all your training examples of indicator YI = 1 á M, okay? 0:18:41.620,0:18:42.870 0:18:42.870,0:18:45.880 In other words, maximum likelihood estimate for a 0:18:45.880,0:18:47.390 0:18:47.390,0:18:52.600 newly parameter phi is just the faction of training examples with 0:18:52.600,0:18:56.880 label one, with Y equals 1. 0:18:56.880,0:19:02.800 Maximum likelihood estimate for mew0 is this, okay? 0:19:02.800,0:19:09.800 You 0:19:22.010,0:19:24.040 should stare at this for a second and 0:19:24.040,0:19:26.200 see if it makes sense. 0:19:26.200,0:19:33.200 Actually, I'll just write on the next one for mew1 while you do that. 0:19:34.310,0:19:41.310 Okay? 0:19:49.880,0:19:51.900 So what this is is 0:19:51.900,0:19:57.060 what the denominator is sum of your training sets indicated YI = 0. 0:19:57.060,0:20:02.090 So for every training example for which YI = 0, this will 0:20:02.090,0:20:05.670 increment the count by one, all right? So the 0:20:05.670,0:20:08.030 denominator is just 0:20:08.030,0:20:10.880 the number of examples 0:20:10.880,0:20:14.550 0:20:14.550,0:20:15.370 with 0:20:15.370,0:20:20.700 label zero, all right? 0:20:20.700,0:20:22.680 And then the numerator will be, let's 0:20:22.680,0:20:25.660 see, Sum from I = 1 for M, or every time 0:20:25.660,0:20:30.350 YI is equal to 0, this will be a one, and otherwise, this thing will be zero, 0:20:30.350,0:20:31.480 and so 0:20:31.480,0:20:34.330 this indicator function means that you're including 0:20:34.330,0:20:37.630 only the times for 0:20:37.630,0:20:40.270 which YI is equal to one - only the turns which Y is equal to zero 0:20:40.270,0:20:45.020 because for all the times where YI is equal to one, 0:20:45.020,0:20:47.990 this sum and will be equal to zero, 0:20:47.990,0:20:52.890 and then you multiply that by XI, and so the numerator is 0:20:52.890,0:20:55.770 really the sum 0:20:55.770,0:21:00.910 of XI's corresponding to 0:21:00.910,0:21:02.970 0:21:02.970,0:21:08.649 examples where the class labels were zero, okay? 0:21:08.649,0:21:14.530 Raise your hand if this makes sense. Okay, cool. 0:21:14.530,0:21:17.350 So just to say this fancifully, 0:21:17.350,0:21:19.630 this just means look for your training set, 0:21:19.630,0:21:22.900 find all the examples for which Y = 0, 0:21:22.900,0:21:25.160 and take the average 0:21:25.160,0:21:29.570 of the value of X for all your examples which Y = 0. So take all your negative fitting 0:21:29.570,0:21:30.470 examples 0:21:30.470,0:21:32.950 and average the values for X 0:21:32.950,0:21:37.730 and 0:21:37.730,0:21:41.350 that's mew0, okay? If this 0:21:41.350,0:21:45.140 notation is still a little bit cryptic - if you're still not sure why this 0:21:45.140,0:21:48.000 equation translates into 0:21:48.000,0:21:52.940 what I just said, do go home and stare at it for a while until it just makes sense. This is, sort 0:21:52.940,0:21:57.040 of, no surprise. It just says to estimate the mean for the negative examples, 0:21:57.040,0:21:59.860 take all your negative examples, and average them. So 0:21:59.860,0:22:06.830 no surprise, but this is a useful practice to indicate a notation. 0:22:06.830,0:22:07.500 [Inaudible] 0:22:07.500,0:22:11.330 divide the maximum likelihood estimate for sigma. I won't do that. You can read that in 0:22:11.330,0:22:16.630 the notes yourself. 0:22:16.630,0:22:21.390 0:22:21.390,0:22:24.340 And so having fit 0:22:24.340,0:22:28.940 the parameters find mew0, mew1, and sigma 0:22:28.940,0:22:30.470 to your data, 0:22:30.470,0:22:35.590 well, you now need to make a prediction. You 0:22:35.590,0:22:39.710 know, when you're given a new value of X, when you're given a new cancer, you need to predict whether 0:22:39.710,0:22:41.630 it's malignant or benign. 0:22:41.630,0:22:44.610 Your prediction is then going to be, 0:22:44.610,0:22:46.710 let's say, 0:22:46.710,0:22:50.400 the most likely value of Y given X. I should 0:22:50.400,0:22:54.720 write semicolon the parameters there. I'll just give that 0:22:54.720,0:22:57.230 - which is the [inaudible] of a Y 0:22:57.230,0:23:04.230 by Bayes rule, all right? 0:23:05.870,0:23:12.870 And that is, in turn, 0:23:14.540,0:23:15.480 just that 0:23:15.480,0:23:19.790 because the denominator P of X doesn't depend on Y, 0:23:19.790,0:23:21.530 and 0:23:21.530,0:23:23.540 if P of Y 0:23:23.540,0:23:27.450 is uniform. 0:23:27.450,0:23:30.950 In other words, if each of your constants is equally likely, 0:23:30.950,0:23:33.039 so if P of Y 0:23:33.039,0:23:36.840 takes the same value for all values 0:23:36.840,0:23:38.010 of Y, 0:23:38.010,0:23:42.250 then this is just arc X over Y, P of X 0:23:42.250,0:23:47.340 given Y, okay? 0:23:47.340,0:23:50.970 This happens sometimes, maybe not very often, so usually you end up using this 0:23:50.970,0:23:53.049 formula where you 0:23:53.049,0:23:56.470 compute P of X given Y and P of Y using 0:23:56.470,0:23:59.910 your model, okay? Student:Can 0:23:59.910,0:24:00.780 you give 0:24:00.780,0:24:05.480 us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually 0:24:05.480,0:24:08.750 let me. 0:24:08.750,0:24:09.770 So the min of 0:24:09.770,0:24:15.470 - arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just 0:24:15.470,0:24:18.600 for an example, the min of X - 5 0:24:18.600,0:24:22.610 squared is 0 because by choosing X equals 5, you can get this to be zero, 0:24:22.610,0:24:24.470 and the argument over X 0:24:24.470,0:24:30.870 of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay? 0:24:30.870,0:24:37.870 Cool. Thanks 0:24:38.100,0:24:45.100 for 0:24:45.140,0:24:52.140 asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is 0:25:00.910,0:25:04.320 distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose 0:25:04.320,0:25:05.649 here. 0:25:05.649,0:25:07.240 I meant if 0:25:07.240,0:25:10.890 P of Y = 0 is equal to P of Y = 1, or if Y is the 0:25:10.890,0:25:12.780 uniform distribution over 0:25:12.780,0:25:13.860 the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0 0:25:13.860,0:25:15.450 zero = P of Y given 1. That's all I mean, see? Anything else? 0:25:15.450,0:25:17.080 All 0:25:17.080,0:25:20.030 right. Okay. So 0:25:20.030,0:25:22.039 0:25:22.039,0:25:28.040 0:25:28.040,0:25:35.040 it 0:25:42.130,0:25:46.980 turns out Gaussian discriminant analysis has an interesting relationship 0:25:46.980,0:25:48.590 to logistic 0:25:48.590,0:25:51.640 regression. Let me illustrate that. 0:25:51.640,0:25:54.540 0:25:54.540,0:25:59.200 So let's say you have a training set 0:25:59.200,0:26:06.200 - actually let me just go ahead and draw 1D training set, and that will 0:26:11.820,0:26:14.580 kind of work, yes, okay. 0:26:14.580,0:26:18.470 So let's say we have a training set comprising a few negative and a few positive examples, 0:26:18.470,0:26:23.070 and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two 0:26:23.070,0:26:24.679 densities - a Gaussian density to each of these two - to my positive 0:26:24.679,0:26:27.440 and negative training 0:26:27.440,0:26:30.090 examples, 0:26:30.090,0:26:33.470 and so maybe my 0:26:33.470,0:26:37.960 positive examples, the X's, are fit with a Gaussian like this, 0:26:37.960,0:26:39.520 0:26:39.520,0:26:44.560 0:26:44.560,0:26:47.840 and my negative examples I will fit, and you have a 0:26:47.840,0:26:54.840 Gaussian that looks like that, okay? 0:26:57.030,0:27:00.180 Now, I 0:27:00.180,0:27:04.150 hope this [inaudible]. Now, let's 0:27:04.150,0:27:06.789 vary along the X axis, 0:27:06.789,0:27:10.260 and what I want to do is I'll 0:27:10.260,0:27:14.470 overlay on top of this plot. I'm going to plot 0:27:14.470,0:27:21.470 P of Y = 1 - no, actually, 0:27:24.420,0:27:25.820 given X 0:27:25.820,0:27:31.570 for a variety of values X, okay? 0:27:31.570,0:27:32.929 So I actually realize what I should have done. 0:27:32.929,0:27:37.250 I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just 0:27:37.250,0:27:39.720 makes this part come in better. 0:27:39.720,0:27:44.270 So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal 0:27:44.270,0:27:45.200 axis. 0:27:45.200,0:27:48.810 Then what's the probability of Y being equal to one conditioned on X? Well, 0:27:48.810,0:27:52.770 the way you calculate that is you write P of Y 0:27:52.770,0:27:56.899 = 1 given X, and then you plug in all these formulas as usual, right? It's P of X 0:27:56.899,0:27:59.140 given Y = 1, which is 0:27:59.140,0:28:00.929 your Gaussian density, 0:28:00.929,0:28:04.070 times P of Y = 1, you know, which is 0:28:04.070,0:28:08.060 essentially - this is just going to be equal to phi, 0:28:08.060,0:28:09.510 and then divided by, 0:28:09.510,0:28:12.210 right, P of X, and then this shows you how you can calculate this. 0:28:12.210,0:28:13.660 By using 0:28:13.660,0:28:18.730 these two Gaussians and my phi on P of Y, I actually compute what P of Y 0:28:18.730,0:28:21.340 = 1 given X is, and 0:28:21.340,0:28:24.529 in this case, 0:28:24.529,0:28:29.230 if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to 0:28:29.230,0:28:31.330 a positive class, and so 0:28:31.330,0:28:36.370 it'll be very small; it'll be very close to zero say, okay? 0:28:36.370,0:28:40.770 And then we can increment the value of X a bit, and study a different value of X, and 0:28:40.770,0:28:44.230 plot what is the P of Y given X - P of Y 0:28:44.230,0:28:49.660 = 1 given X, and, again, it'll be pretty small. Let's 0:28:49.660,0:28:52.200 use a point like that, right? At this point, 0:28:52.200,0:28:54.480 the two Gaussian densities 0:28:54.480,0:28:56.490 have equal value, 0:28:56.490,0:28:57.539 and if 0:28:57.539,0:28:58.799 I ask 0:28:58.799,0:29:01.779 if X is this value, right, shown by the arrow, 0:29:01.779,0:29:03.080 0:29:03.080,0:29:06.770 what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if 0:29:06.770,0:29:11.270 you fill 0:29:11.270,0:29:13.930 in a bunch more points, you get a 0:29:13.930,0:29:16.470 curve like that, 0:29:16.470,0:29:20.550 and then you can keep going. Let's say for a point like that, you can ask what's the probability of X 0:29:20.550,0:29:21.919 being one? Well, if 0:29:21.919,0:29:25.320 it's that far out, then clearly, it belongs to this 0:29:25.320,0:29:27.540 rightmost Gaussian, and so 0:29:27.540,0:29:31.510 the probability of Y being a one would be very high; it would be almost one, okay? 0:29:31.510,0:29:33.830 And so you 0:29:33.830,0:29:36.429 can repeat this exercise 0:29:36.429,0:29:38.510 for a bunch of points. All right, 0:29:38.510,0:29:41.620 compute P of Y equals one given X for a bunch of points, 0:29:41.620,0:29:46.490 and if you connect up these points, 0:29:46.490,0:29:48.150 you find that the 0:29:48.150,0:29:50.039 curve you get [Pause] plotted 0:29:50.039,0:29:56.410 takes a form of sigmoid function, okay? So, 0:29:56.410,0:30:00.030 in other words, when you make the assumptions under the Gaussian 0:30:00.030,0:30:02.350 0:30:02.350,0:30:05.550 discriminant analysis model, 0:30:05.550,0:30:08.130 that P of X given Y is Gaussian, 0:30:08.130,0:30:13.330 when you go back and compute what P of Y given X is, you actually get back 0:30:13.330,0:30:19.290 exactly the same sigmoid function 0:30:19.290,0:30:22.960 that we're using which is the progression, okay? But it turns out the key difference is that 0:30:22.960,0:30:27.360 Gaussian discriminant analysis will end up choosing a different 0:30:27.360,0:30:28.900 position 0:30:28.900,0:30:31.659 and a steepness of the sigmoid 0:30:31.659,0:30:35.730 than would logistic regression. Is there a question? Student:I'm just 0:30:35.730,0:30:39.570 wondering, 0:30:39.570,0:30:44.600 the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is 0:30:44.600,0:30:47.500 P of X given Y = 1, and 0:30:47.500,0:30:50.070 this Gaussian is P of X 0:30:50.070,0:30:57.070 given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you 0:31:01.920,0:31:04.040 decide what Y 0:31:04.040,0:31:05.760 given 0:31:05.760,0:31:09.769 P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you 0:31:09.769,0:31:12.330 figured out where 0:31:12.330,0:31:13.750 to draw each dot? Instructor (Andrew Ng):Let's see, 0:31:13.750,0:31:15.919 okay. So the 0:31:15.919,0:31:18.600 computation is as follows, right? The steps 0:31:18.600,0:31:22.020 are I have the training sets, and so given my training set, I'm going to fit 0:31:22.020,0:31:25.159 a Gaussian discriminant analysis model to it, 0:31:25.159,0:31:28.940 and what that means is I'll build a model for P of X given Y = 1. I'll 0:31:28.940,0:31:30.159 build 0:31:30.159,0:31:33.059 a model for P of X given Y = 0, 0:31:33.059,0:31:37.950 and I'll also fit a Bernoulli distribution to 0:31:37.950,0:31:39.290 P of Y, okay? 0:31:39.290,0:31:43.270 So, in other words, given my training set, I'll fit P of X given Y and P of Y 0:31:43.270,0:31:46.539 to my data, and now I've chosen my parameters 0:31:46.539,0:31:48.630 of find mew0, 0:31:48.630,0:31:49.600 mew1, 0:31:49.600,0:31:52.310 and the sigma, okay? Then 0:31:52.310,0:31:54.930 this is the process I went through 0:31:54.930,0:31:58.710 to plot all these dots, right? It's just I pick a point in the X axis, 0:31:58.710,0:32:00.950 and then I compute 0:32:00.950,0:32:02.970 P of Y given X 0:32:02.970,0:32:05.200 for that value of X, 0:32:05.200,0:32:09.380 and P of Y given 1 conditioned on X will be some value between zero and one. It'll 0:32:09.380,0:32:12.500 be some real number, and whatever that real number is, I then plot it on the vertical 0:32:12.500,0:32:14.490 axis, 0:32:14.490,0:32:19.929 okay? And the way I compute P of Y = 1 conditioned on X is 0:32:19.929,0:32:21.380 I would 0:32:21.380,0:32:23.470 use these quantities. I would use 0:32:23.470,0:32:25.080 P of X given Y 0:32:25.080,0:32:30.050 and P of Y, and, sort of, plug them into Bayes rule, and that allows me 0:32:30.050,0:32:30.760 to 0:32:30.760,0:32:31.690 compute P of Y given X 0:32:31.690,0:32:34.210 from these three quantities; does that make 0:32:34.210,0:32:35.240 sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that - 0:32:35.240,0:32:39.080 Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so 0:32:39.080,0:32:42.269 - 0:32:42.269,0:32:43.170 well, 0:32:43.170,0:32:45.340 got this right 0:32:45.340,0:32:49.090 here. So P of X can be written as, 0:32:49.090,0:32:50.449 0:32:50.449,0:32:52.810 right, 0:32:52.810,0:32:55.030 so 0:32:55.030,0:32:58.660 P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y = 0:32:58.660,0:33:03.140 1, right? 0:33:03.140,0:33:06.740 And so each of these terms, P of X given Y 0:33:06.740,0:33:11.170 and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant 0:33:11.170,0:33:13.890 analysis model. Each of these terms is something that 0:33:13.890,0:33:16.300 my model gives me directly, 0:33:16.300,0:33:18.350 so plugged in as the denominator, 0:33:18.350,0:33:25.350 and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool. 0:33:30.570,0:33:37.570 0:33:53.170,0:33:57.860 So let's talk a little bit about the advantages and disadvantages of using a 0:33:57.860,0:34:00.360 generative 0:34:00.360,0:34:04.759 learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we 0:34:04.759,0:34:05.960 assume that 0:34:05.960,0:34:08.080 X conditions on Y 0:34:08.080,0:34:13.829 is Gaussian, 0:34:13.829,0:34:17.089 and the argument I showed on the previous chalkboard, I didn't prove it formally, 0:34:17.089,0:34:20.289 but you can actually go back and prove it yourself 0:34:20.289,0:34:24.269 is that if you assume X given Y is Gaussian, 0:34:24.269,0:34:27.419 then that implies that 0:34:27.419,0:34:28.589 when you plot Y 0:34:28.589,0:34:30.639 given X, 0:34:30.639,0:34:37.639 you find that - well, let me just write logistic posterior, okay? 0:34:40.689,0:34:43.519 And the argument I showed just now, which I didn't prove; you can go home and prove it 0:34:43.519,0:34:44.489 yourself, 0:34:44.489,0:34:49.329 is that if you assume X given Y is Gaussian, then that implies that the posterior 0:34:49.329,0:34:53.929 distribution or the form of 0:34:53.929,0:34:57.229 P of Y = 1 given X 0:34:57.229,0:35:00.569 is going to be a logistic function, 0:35:00.569,0:35:02.089 0:35:02.089,0:35:05.809 and it turns out this 0:35:05.809,0:35:08.319 implication in the opposite direction 0:35:08.319,0:35:11.799 does not hold true, 0:35:11.799,0:35:16.349 okay? In particular, it actually turns out - this is actually, kind of, cool. It 0:35:16.349,0:35:19.360 turns out that if you're 0:35:19.360,0:35:22.739 seeing that X given Y = 1 is 0:35:22.739,0:35:24.389 0:35:24.389,0:35:26.169 Hessian with 0:35:26.169,0:35:28.609 parameter lambda 1, 0:35:28.609,0:35:32.039 and X given Y = 0, 0:35:32.039,0:35:36.509 is Hessian 0:35:36.509,0:35:39.199 with parameter lambda 0. 0:35:39.199,0:35:41.420 It turns out if you assumed this, 0:35:41.420,0:35:42.640 then 0:35:42.640,0:35:43.849 that also 0:35:43.849,0:35:46.839 implies that P of Y 0:35:46.839,0:35:51.130 given X 0:35:51.130,0:35:52.439 0:35:52.439,0:35:54.429 is logistic, okay? 0:35:54.429,0:35:57.259 So there are lots of assumptions on X given Y 0:35:57.259,0:36:04.259 that will lead to P of Y given X being logistic, and, 0:36:04.619,0:36:06.059 therefore, 0:36:06.059,0:36:11.609 this, the assumption that X given Y being Gaussian is the stronger assumption 0:36:11.609,0:36:15.019 than the assumption that Y given X is logistic, 0:36:15.019,0:36:17.400 okay? Because this implies this, 0:36:17.400,0:36:22.829 right? That means that this is a stronger assumption than this because 0:36:22.829,0:36:29.829 this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa. 0:36:29.940,0:36:31.099 And so this leaves some 0:36:31.099,0:36:35.749 of the tradeoffs between Gaussian discriminant analysis and logistic 0:36:35.749,0:36:36.559 regression, 0:36:36.559,0:36:40.079 right? Gaussian discriminant analysis makes a much stronger assumption 0:36:40.079,0:36:43.049 that X given Y is Gaussian, 0:36:43.049,0:36:47.099 and so when this assumption is true, when this assumption approximately holds, if you plot the 0:36:47.099,0:36:48.569 data, 0:36:48.569,0:36:52.039 and if X given Y is, indeed, approximately Gaussian, 0:36:52.039,0:36:56.199 then if you make this assumption, explicit to the algorithm, then the 0:36:56.199,0:36:57.499 algorithm will do better 0:36:57.499,0:36:59.239 because it's as if the 0:36:59.239,0:37:02.609 algorithm is making use of more information about the data. The algorithm knows that 0:37:02.609,0:37:04.380 the data is Gaussian, 0:37:04.380,0:37:05.290 right? And so 0:37:05.290,0:37:07.609 if the Gaussian assumption, you know, 0:37:07.609,0:37:09.529 holds or roughly holds, 0:37:09.529,0:37:10.220 then Gaussian 0:37:10.220,0:37:13.569 discriminant analysis may do better than logistic regression. 0:37:13.569,0:37:19.299 If, conversely, if you're actually not sure what X given Y is, then 0:37:19.299,0:37:22.819 logistic regression, the discriminant algorithm may do better, 0:37:22.819,0:37:25.579 and, in particular, use logistic regression, 0:37:25.579,0:37:26.110 and 0:37:26.110,0:37:29.690 maybe you see [inaudible] before the data was Gaussian, but it turns out the data 0:37:29.690,0:37:31.429 was actually Poisson, right? 0:37:31.429,0:37:33.980 Then logistic regression will still do perfectly fine because if 0:37:33.980,0:37:36.869 the data were actually Poisson, 0:37:36.869,0:37:40.210 then P of Y = 1 given X will be logistic, and it'll do perfectly 0:37:40.210,0:37:44.519 fine, but if you assumed it was Gaussian, then the algorithm may go off and do something 0:37:44.519,0:37:51.519 that's not as good, okay? 0:37:52.109,0:37:56.139 So it turns out that - right. 0:37:56.139,0:37:58.799 So it's slightly different. 0:37:58.799,0:38:01.089 It 0:38:01.089,0:38:05.359 turns out the real advantage of generative learning algorithms is often that it 0:38:05.359,0:38:07.839 requires less data, 0:38:07.839,0:38:09.049 and, in particular, 0:38:09.049,0:38:13.469 data is never really exactly Gaussian, right? Because data is often 0:38:13.469,0:38:15.890 approximately Gaussian; it's never exactly Gaussian. 0:38:15.890,0:38:19.989 And it turns out, generative learning algorithms often do surprisingly well 0:38:19.989,0:38:20.750 even when 0:38:20.750,0:38:24.369 these modeling assumptions are not met, but 0:38:24.369,0:38:26.019 one other tradeoff 0:38:26.019,0:38:26.880 is that 0:38:26.880,0:38:29.930 by making stronger assumptions 0:38:29.930,0:38:31.899 about the data, 0:38:31.899,0:38:34.039 Gaussian discriminant analysis 0:38:34.039,0:38:36.229 often needs less data 0:38:36.229,0:38:40.439 in order to fit, like, an okay model, even if there's less training data. Whereas, in 0:38:40.439,0:38:41.870 contrast, 0:38:41.870,0:38:45.959 logistic regression by making less 0:38:45.959,0:38:48.959 assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're 0:38:48.959,0:38:50.729 making less assumptions, 0:38:50.729,0:38:54.150 but sometimes it takes a slightly larger training set to fit than Gaussian discriminant 0:38:54.150,0:38:55.419 analysis. Question? Student:In order 0:38:55.419,0:39:00.019 to meet any assumption about the number [inaudible], plus 0:39:00.019,0:39:01.029 0:39:01.029,0:39:03.139 here 0:39:03.139,0:39:09.089 we assume that P of Y = 1, equal 0:39:09.089,0:39:11.159 two 0:39:11.159,0:39:14.849 number of. 0:39:14.849,0:39:16.319 [Inaudible]. Is true when 0:39:16.319,0:39:18.069 the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see. 0:39:18.069,0:39:21.799 So there's a question of is this true - what 0:39:21.799,0:39:23.379 was that? Let me translate that 0:39:23.379,0:39:25.429 differently. So 0:39:25.429,0:39:28.829 the marving assumptions are made independently of the size 0:39:28.829,0:39:29.880 of 0:39:29.880,0:39:32.520 your training set, right? So, like, in least/great regression - well, 0:39:32.520,0:39:36.649 in all of these models I'm assuming that these are random variables 0:39:36.649,0:39:41.380 flowing from some distribution, and then, finally, I'm giving a single training set 0:39:41.380,0:39:43.640 and that as for the parameters of the 0:39:43.640,0:39:44.450 distribution, right? 0:39:44.450,0:39:51.450 Student:So 0:39:51.889,0:39:54.169 what's the probability of Y = 1? 0:39:54.169,0:39:55.069 Instructor (Andrew Ng):Probability of Y + 1? 0:39:55.069,0:39:56.499 Student:Yeah, you used the - Instructor (Andrew Ng):Sort 0:39:56.499,0:39:57.909 of, this like 0:39:57.909,0:39:59.589 - 0:39:59.589,0:40:02.829 back to the philosophy of mass molecular estimation, 0:40:02.829,0:40:04.700 right? I'm assuming that 0:40:04.700,0:40:08.329 they're P of Y is equal to phi to the Y, 0:40:08.329,0:40:13.390 Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y 0:40:13.390,0:40:14.009 generating 0:40:14.009,0:40:15.879 all my data, 0:40:15.879,0:40:18.849 and then 0:40:18.849,0:40:19.549 - 0:40:19.549,0:40:25.589 well, when I write this, I guess, maybe what I should write isn't - 0:40:25.589,0:40:27.609 so when I write this, I 0:40:27.609,0:40:29.900 guess there are already two values of phi. One is 0:40:29.900,0:40:32.200 there's a true underlying value of phi 0:40:32.200,0:40:35.160 that guards the use to generate the data, 0:40:35.160,0:40:39.630 and then there's the maximum likelihood estimate of the value of phi, and so when I was writing 0:40:39.630,0:40:41.509 those formulas earlier, 0:40:41.509,0:40:45.209 those formulas are writing for phi, and mew0, and mew1 0:40:45.209,0:40:49.359 were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true 0:40:49.359,0:40:55.819 underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum 0:40:55.819,0:40:57.559 likelihood estimate comes from the data, 0:40:57.559,0:41:01.529 and there's some, sort of, true underlying value of phi that I'm trying to estimate, 0:41:01.529,0:41:05.319 and my maximum likelihood estimate is my attempt to estimate the true value, 0:41:05.319,0:41:08.150 but, you know, by notational and convention 0:41:08.150,0:41:11.779 often are just right as that as well without bothering to distinguish between 0:41:11.779,0:41:15.470 the maximum likelihood value and the true underlying value that I'm assuming is out 0:41:15.470,0:41:16.339 there, and that I'm 0:41:16.339,0:41:23.339 only hoping to estimate. 0:41:23.979,0:41:25.579 Actually, yeah, 0:41:25.579,0:41:29.369 so for the sample of questions like these about maximum likelihood and so on, I hope 0:41:29.369,0:41:34.059 to tease to the Friday discussion section 0:41:34.059,0:41:36.279 as a good time to 0:41:36.279,0:41:37.890 ask questions about, sort of, 0:41:37.890,0:41:40.880 probabilistic definitions like these as well. Are there any 0:41:40.880,0:41:47.880 other questions? No, great. Okay. 0:41:52.900,0:41:59.680 So, 0:41:59.680,0:42:01.189 great. Oh, it 0:42:01.189,0:42:07.589 turns out, just to mention one more thing that's, kind of, cool. 0:42:07.589,0:42:09.300 I said that 0:42:09.300,0:42:13.509 if X given Y is Poisson, and you also go logistic posterior, 0:42:13.509,0:42:17.210 it actually turns out there's a more general version of this. If you assume 0:42:17.210,0:42:18.929 X 0:42:18.929,0:42:23.899 given Y = 1 is exponential family 0:42:23.899,0:42:26.279 with parameter A to 1, and then you assume 0:42:26.279,0:42:33.279 X given Y = 0 is exponential family 0:42:33.859,0:42:38.889 with parameter A to 0, then 0:42:38.889,0:42:43.269 this implies that P of Y = 1 given X is also logistic, okay? And 0:42:43.269,0:42:44.679 that's, kind of, cool. 0:42:44.679,0:42:47.590 It means that Y given X could be - I don't 0:42:47.590,0:42:50.180 know, some strange thing. It could be gamma because 0:42:50.180,0:42:52.509 we've seen Gaussian right 0:42:52.509,0:42:53.609 next to the - I 0:42:53.609,0:42:57.029 don't know, gamma exponential. 0:42:57.029,0:42:59.099 They're actually a beta. I'm 0:42:59.099,0:43:02.089 just rattling off my mental list of exponential family extrusions. It could be any one 0:43:02.089,0:43:03.689 of those things, 0:43:03.689,0:43:07.519 so [inaudible] the same exponential family distribution for the two classes 0:43:07.519,0:43:09.439 with different natural parameters 0:43:09.439,0:43:12.239 than the 0:43:12.239,0:43:13.279 posterior 0:43:13.279,0:43:16.849 P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows 0:43:16.849,0:43:19.299 the robustness of logistic regression 0:43:19.299,0:43:22.479 to the choice of modeling assumptions because it could be that 0:43:22.479,0:43:24.989 the data was actually, you know, gamma distributed, 0:43:24.989,0:43:28.739 and just still turns out to be logistic. So it's the 0:43:28.739,0:43:35.739 robustness of logistic regression to modeling 0:43:38.809,0:43:40.960 assumptions. 0:43:40.960,0:43:42.900 And this is the density. I think, 0:43:42.900,0:43:44.399 early on I promised 0:43:44.399,0:43:48.619 two justifications for where I pulled the logistic function out of the hat, right? So 0:43:48.619,0:43:53.299 one was the exponential family derivation we went through last time, and this is, sort of, the second one. 0:43:53.299,0:44:00.299 That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off 0:44:05.109,0:44:09.969 mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also 0:44:09.969,0:44:11.059 not true, right? 0:44:11.059,0:44:12.499 Yeah, so this exponential 0:44:12.499,0:44:14.199 family distribution 0:44:14.199,0:44:18.269 implies Y = 1 is logistic, but the reverse assumption is also not true. 0:44:18.269,0:44:22.599 There are actually all sorts of really bizarre distributions 0:44:22.599,0:44:29.599 for X that would give rise to logistic function, okay? Okay. So 0:44:29.889,0:44:34.719 let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second 0:44:34.719,0:44:37.609 generative learning algorithm, 0:44:37.609,0:44:44.609 and the motivating example, actually this is called a Naive Bayes algorithm, 0:44:44.849,0:44:49.690 and the motivating example that I'm gonna use will be spam classification. All right. So let's 0:44:49.690,0:44:54.109 say that you want to build a spam classifier to take your incoming stream of email and decide if 0:44:54.109,0:44:56.529 it's spam or 0:44:56.529,0:44:57.489 not. 0:44:57.489,0:45:02.299 So let's 0:45:02.299,0:45:04.759 see. Y will be 0 0:45:04.759,0:45:06.849 or 0:45:06.849,0:45:11.679 1, with 1 being spam email 0:45:11.679,0:45:12.589 and 0 being non-spam, and 0:45:12.589,0:45:16.649 the first decision we need to make is, given a piece of email, 0:45:16.649,0:45:20.789 how do you represent a piece of email using a feature vector X, 0:45:20.789,0:45:24.449 right? So email is just a piece of text, right? Email 0:45:24.449,0:45:27.889 is like a list of words or a list of ASCII characters. 0:45:27.889,0:45:30.839 So I can represent email as a feature of vector X. 0:45:30.839,0:45:32.369 So we'll use a couple of different 0:45:32.369,0:45:34.099 representations, 0:45:34.099,0:45:36.579 but the one I'll use today is 0:45:36.579,0:45:38.049 we will 0:45:38.049,0:45:42.179 construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of 0:45:42.179,0:45:44.400 all the words in my dictionary, okay? So 0:45:44.400,0:45:46.469 the first word is 0:45:46.469,0:45:52.289 RA. The second word in my dictionary is Aardvark, ausworth, 0:45:52.289,0:45:55.569 okay? 0:45:55.569,0:45:59.419 You know, and somewhere along the way you see the word buy in the spam email telling you to buy 0:45:59.419,0:46:01.379 stuff. 0:46:01.379,0:46:04.140 Tell you how you collect your list of words, 0:46:04.140,0:46:08.579 you know, you won't find CS229, right, course number in a dictionary, but 0:46:08.579,0:46:09.559 if you 0:46:09.559,0:46:14.249 collect a list of words via other emails you've gotten, you have this list somewhere 0:46:14.249,0:46:17.979 as well, and then the last word in my dictionary was 0:46:17.979,0:46:21.969 zicmergue, which 0:46:21.969,0:46:24.739 pertains to the technological chemistry that deals with 0:46:24.739,0:46:27.899 the fermentation process in 0:46:27.899,0:46:30.829 brewing. 0:46:30.829,0:46:35.180 So say I get a piece of email, and what I'll do is I'll then 0:46:35.180,0:46:37.860 scan through this list of words, and wherever 0:46:37.860,0:46:41.249 a certain word appears in my email, I'll put a 1 there. So if a particular 0:46:41.249,0:46:44.319 email has the word aid then that's 1. 0:46:44.319,0:46:46.560 You know, my email doesn't have the words ausworth 0:46:46.560,0:46:49.209 or aardvark, so it gets zeros. And again, 0:46:49.209,0:46:52.939 a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay? 0:46:52.939,0:46:56.910 So 0:46:56.910,0:46:59.289 this would be 0:46:59.289,0:47:01.939 one way of creating a feature vector 0:47:01.939,0:47:03.619 to represent a 0:47:03.619,0:47:05.739 piece of email. 0:47:05.739,0:47:09.900 Now, let's throw 0:47:09.900,0:47:16.900 the generative model out for this. Actually, 0:47:18.059,0:47:19.289 let's use 0:47:19.289,0:47:20.849 this. 0:47:20.849,0:47:24.400 In other words, I want to model P of X given Y. The 0:47:24.400,0:47:27.959 given Y = 0 or Y = 1, all right? 0:47:27.959,0:47:31.390 And my feature vectors are going to be 0, 1 0:47:31.390,0:47:35.709 to the N. It's going to be these split vectors, binary value vectors. They're N 0:47:35.709,0:47:36.400 dimensional. 0:47:36.400,0:47:38.130 Where N 0:47:38.130,0:47:38.830 may 0:47:38.830,0:47:41.950 be on the order of, say, 50,000, if you have 50,000 0:47:41.950,0:47:44.089 words in your dictionary, 0:47:44.089,0:47:46.110 which is not atypical. So 0:47:46.110,0:47:47.459 values from - I don't 0:47:47.459,0:47:48.449 know, 0:47:48.449,0:47:52.219 mid-thousands to tens of thousands is very typical 0:47:52.219,0:47:53.939 for problems like 0:47:53.939,0:47:56.620 these. And, therefore, 0:47:56.620,0:48:00.340 there two to the 50,000 possible values for X, right? So two to 0:48:00.340,0:48:01.500 50,000 0:48:01.500,0:48:02.729 possible bit vectors 0:48:02.729,0:48:03.819 of length 0:48:03.819,0:48:05.199 0:48:05.199,0:48:07.239 50,000, and so 0:48:07.239,0:48:08.829 one way to model this is 0:48:08.829,0:48:10.749 the multinomial distribution, 0:48:10.749,0:48:13.900 but because there are two to the 50,000 possible values for X, 0:48:13.900,0:48:20.339 I would need two to the 50,000, but maybe -1 parameters, 0:48:20.339,0:48:21.059 right? Because you have 0:48:21.059,0:48:23.699 this sum to 1, right? So 0:48:23.699,0:48:27.089 -1. And this is clearly way too many parameters 0:48:27.089,0:48:28.479 to model 0:48:28.479,0:48:30.269 using the multinomial distribution 0:48:30.269,0:48:34.439 over all two to 50,000 possibilities. 0:48:34.439,0:48:35.839 So 0:48:35.839,0:48:42.289 in a Naive Bayes algorithm, we're 0:48:42.289,0:48:46.579 going to make a very strong assumption on P of X given Y, 0:48:46.579,0:48:49.369 and, in particular, I'm going to assume - let 0:48:49.369,0:48:53.609 me just say what it's called; then I'll write out what it means. I'm going to assume that the 0:48:53.609,0:48:54.630 XI's 0:48:54.630,0:49:01.630 are conditionally independent 0:49:06.489,0:49:07.950 given Y, okay? 0:49:07.950,0:49:11.209 Let me say what this means. 0:49:11.209,0:49:18.209 So I have that P of X1, X2, up to X50,000, 0:49:18.729,0:49:20.559 right, given the 0:49:20.559,0:49:25.910 Y. By the key rule of probability, this is P of X1 given Y 0:49:25.910,0:49:27.759 times P of X2 0:49:27.759,0:49:30.369 given 0:49:30.369,0:49:32.469 0:49:32.469,0:49:33.989 Y, 0:49:33.989,0:49:39.409 X1 0:49:39.409,0:49:42.919 times PF - I'll just put dot, dot, dot. I'll just write 1, 1 à dot, dot, dot up to, you know, well - 0:49:42.919,0:49:46.369 whatever. You get the idea, up to P of X50,000, okay? 0:49:46.369,0:49:49.529 So this is the chain were of probability. This always holds. I've not 0:49:49.529,0:49:52.849 made any assumption yet, and now, we're 0:49:52.849,0:49:54.529 gonna 0:49:54.529,0:49:58.089 meet what's called the Naive Bayes assumption, or this assumption that X 0:49:58.089,0:50:01.869 defies a conditionally independent given Y. Going 0:50:01.869,0:50:03.709 to assume that - 0:50:03.709,0:50:06.619 well, nothing changes for the first term, 0:50:06.619,0:50:11.719 but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3 0:50:11.719,0:50:13.359 given 0:50:13.359,0:50:16.979 the 0:50:16.979,0:50:17.760 Y, 0:50:17.760,0:50:20.099 and so on, up 0:50:20.099,0:50:24.489 to P of X50,000 given Y, okay? 0:50:24.489,0:50:28.849 Or just written more compactly, 0:50:28.849,0:50:32.119 0:50:32.119,0:50:35.989 means assume that P of X1, P of X50,000 given Y is 0:50:35.989,0:50:41.150 the product from I = 1 to 50,000 or P of XI 0:50:41.150,0:50:44.170 given the Y, 0:50:44.170,0:50:45.299 okay? 0:50:45.299,0:50:49.129 And stating informally what this means is that I'm, sort of, assuming that - 0:50:49.129,0:50:53.009 so unless you know the cost label Y, so long as you know whether this is spam or not 0:50:53.009,0:50:54.959 spam, 0:50:54.959,0:50:58.579 then knowing whether the word A appears in email 0:50:58.579,0:50:59.809 does not affect 0:50:59.809,0:51:01.359 the probability 0:51:01.359,0:51:02.849 of whether the word 0:51:02.849,0:51:06.829 Ausworth appears in the email, all right? 0:51:06.829,0:51:10.699 And, in other words, there's assuming - once you know whether an email is spam 0:51:10.699,0:51:12.619 or not spam, 0:51:12.619,0:51:15.519 then knowing whether other words appear in the email won't help 0:51:15.519,0:51:19.049 you predict whether any other word appears in the email, okay? 0:51:19.049,0:51:20.289 And, 0:51:20.289,0:51:23.489 obviously, this assumption is false, right? This 0:51:23.489,0:51:25.979 assumption can't possibly be 0:51:25.979,0:51:28.190 true. I mean, if you see the word 0:51:28.190,0:51:31.410 - I don't know, CS229 in an email, you're much more likely to see my name in the email, or 0:51:31.410,0:51:37.050 the TA's names, or whatever. So this assumption is normally just false 0:51:37.050,0:51:37.469 0:51:37.469,0:51:39.349 under English, right, 0:51:39.349,0:51:41.239 for normal written English, 0:51:41.239,0:51:42.519 but it 0:51:42.519,0:51:44.269 turns out that despite 0:51:44.269,0:51:46.599 this assumption being, sort of, 0:51:46.599,0:51:48.180 false in the literal sense, 0:51:48.180,0:51:50.829 the Naive Bayes algorithm is, sort of, 0:51:50.829,0:51:53.469 an extremely effective 0:51:53.469,0:51:57.869 algorithm for classifying text documents into spam or not spam, for 0:51:57.869,0:52:01.519 classifying your emails into different emails for your automatic view, for 0:52:01.519,0:52:03.589 looking at web pages and classifying 0:52:03.589,0:52:06.290 whether this webpage is trying to sell something or whatever. It 0:52:06.290,0:52:07.959 turns out, this assumption 0:52:07.959,0:52:11.819 works very well for classifying text documents and for other applications too that I'll 0:52:11.819,0:52:16.419 talk a bit about later. 0:52:16.419,0:52:20.689 As a digression that'll make sense only to some of you. 0:52:20.689,0:52:23.019 Let me just say that 0:52:23.019,0:52:27.959 if you're familiar with Bayesian X world, say 0:52:27.959,0:52:29.839 0:52:29.839,0:52:33.999 graphical models, the Bayesian network associated with this model looks like this, and you're assuming 0:52:33.999,0:52:34.509 that 0:52:34.509,0:52:36.769 this is random variable Y 0:52:36.769,0:52:38.779 that then generates X1, X2, through 0:52:38.779,0:52:41.089 X50,000, okay? If you've not seen the 0:52:41.089,0:52:43.069 Bayes Net before, if 0:52:43.069,0:52:47.549 you don't know your graphical model, just ignore this. It's not important to our purposes, but 0:52:47.549,0:52:54.549 if you've seen it before, that's what it will look like. Okay. 0:52:58.019,0:53:03.349 So 0:53:03.349,0:53:10.349 0:53:11.660,0:53:16.239 the parameters of the model 0:53:16.239,0:53:18.069 are as follows 0:53:18.069,0:53:21.659 with phi FI given Y = 1, which 0:53:21.659,0:53:23.189 0:53:23.189,0:53:28.019 is probably FX = 1 or XI = 1 0:53:28.019,0:53:28.839 given Y = 1, 0:53:28.839,0:53:35.529 phi I 0:53:35.529,0:53:39.959 given Y = 0, and phi Y, okay? 0:53:39.959,0:53:43.459 So these are the parameters of the model, 0:53:43.459,0:53:45.419 0:53:45.419,0:53:47.249 and, therefore, 0:53:47.249,0:53:50.139 to fit the parameters of the model, you 0:53:50.139,0:53:57.139 can write down the joint likelihood, right, 0:54:01.849,0:54:07.329 is 0:54:07.329,0:54:14.329 equal to, as usual, okay? 0:54:18.670,0:54:20.779 So given the training sets, 0:54:20.779,0:54:27.779 you can write down the joint 0:54:28.979,0:54:32.579 likelihood of the parameters, and 0:54:32.579,0:54:39.579 then when 0:54:43.729,0:54:44.640 you 0:54:44.640,0:54:46.869 do maximum likelihood estimation, 0:54:46.869,0:54:50.860 you find that the maximum likelihood estimate of the parameters are 0:54:50.860,0:54:53.239 - they're really, pretty much, what you'd expect. 0:54:53.239,0:54:55.630 Maximum likelihood estimate for phi J given Y = 1 is 0:54:55.630,0:54:57.880 sum from I = 1 to 0:54:57.880,0:55:01.759 M, 0:55:01.759,0:55:03.459 indicator 0:55:03.459,0:55:07.129 XIJ = 0:55:07.129,0:55:14.129 1, YI = 1, okay? 0:55:14.699,0:55:18.479 0:55:18.479,0:55:19.359 0:55:19.359,0:55:20.719 And this is just a, 0:55:20.719,0:55:22.829 I guess, stated more simply, 0:55:22.829,0:55:24.969 the numerator just says, Run for 0:55:24.969,0:55:27.979 your entire training set, some [inaudible] examples, 0:55:27.979,0:55:30.919 and count up the number of times you saw word Jay 0:55:30.919,0:55:32.549 in a piece of email 0:55:32.549,0:55:34.649 for which the label Y was equal to 1. So, in other words, look 0:55:34.649,0:55:36.800 through all your spam emails 0:55:36.800,0:55:39.320 and count the number of emails in which the word 0:55:39.320,0:55:40.559 Jay appeared out of 0:55:40.559,0:55:42.420 all your spam emails, 0:55:42.420,0:55:44.710 and the denominator is, you know, 0:55:44.710,0:55:45.989 sum from I = 1 to M, 0:55:45.989,0:55:47.949 the number of spam. The 0:55:47.949,0:55:51.439 denominator is just the number of spam emails you got. 0:55:51.439,0:55:54.329 And so this ratio is 0:55:54.329,0:55:57.769 in all your spam emails in your training set, 0:55:57.769,0:56:00.399 what fraction of these emails 0:56:00.399,0:56:03.039 did the word Jay 0:56:03.039,0:56:03.659 appear in - 0:56:03.659,0:56:06.640 did the, Jay you wrote in your dictionary appear in? 0:56:06.640,0:56:10.239 And that's the maximum likelihood estimate 0:56:10.239,0:56:13.819 for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your 0:56:13.819,0:56:17.499 0:56:17.499,0:56:20.499 0:56:20.499,0:56:23.160 maximum likelihood estimate for phi 0:56:23.160,0:56:24.999 Y 0:56:24.999,0:56:28.880 is pretty much what you'd expect, right? 0:56:28.880,0:56:35.880 Okay? 0:56:41.409,0:56:43.389 And so 0:56:43.389,0:56:47.669 having estimated all these parameters, 0:56:47.669,0:56:50.960 when you're given a new piece of email that you want to classify, 0:56:50.960,0:56:53.430 you can then compute P of Y given X 0:56:53.430,0:56:56.079 using Bayes rule, right? 0:56:56.079,0:56:57.580 Same as before because 0:56:57.580,0:57:04.229 together these parameters gives you a model for P of X given Y and for P of Y, 0:57:04.229,0:57:07.849 and by using Bayes rule, given these two terms, you can compute 0:57:07.849,0:57:10.709 P of X given Y, and 0:57:10.709,0:57:15.650 there's your spam classifier, okay? 0:57:15.650,0:57:18.999 Turns out we need one more elaboration to this idea, but let me check if there are 0:57:18.999,0:57:25.109 questions about this so far. 0:57:25.109,0:57:29.269 Student:So does this model depend 0:57:29.269,0:57:29.659 on 0:57:29.659,0:57:34.380 the number of inputs? Instructor (Andrew Ng):What do 0:57:34.380,0:57:35.149 you 0:57:35.149,0:57:38.169 mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this 0:57:38.169,0:57:45.169 given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make 0:57:48.959,0:57:52.629 sense? Or M is the number of training examples, so when you have M training examples, you plug them 0:57:52.629,0:57:53.999 into this formula, 0:57:53.999,0:58:00.999 and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the 0:58:01.289,0:58:04.089 number of emails? Instructor (Andrew Ng):Yeah, right. So, right. 0:58:04.089,0:58:07.130 So it's, kind of, your training set. I would go through all the email I've gotten 0:58:07.130,0:58:08.869 in the last two months 0:58:08.869,0:58:11.129 and label them as spam or not spam, 0:58:11.129,0:58:14.450 and so you have - I don't 0:58:14.450,0:58:16.809 know, like, a few hundred emails 0:58:16.809,0:58:18.630 labeled as spam or not spam, 0:58:18.630,0:58:25.630 and that will comprise your training sets for X1 and Y1 through XM, 0:58:28.049,0:58:28.620 YM, 0:58:28.620,0:58:32.820 where X is one of those vectors representing which words appeared in the email and Y 0:58:32.820,0:58:39.820 is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number 0:58:41.309,0:58:42.019 0:58:42.019,0:58:49.019 of examples, but the last model doesn't depend on the models, but your phi is the 0:58:49.829,0:58:53.630 same for either one. Instructor (Andrew Ng):They're 0:58:53.630,0:58:58.019 different things, right? There's the model which is 0:58:58.019,0:59:00.659 - the modeling assumptions aren't made very well. 0:59:00.659,0:59:04.069 I'm assuming that - I'm making the Naive Bayes assumption. 0:59:04.069,0:59:08.900 So the probabilistic model is an assumption on the joint distribution 0:59:08.900,0:59:10.319 of X and Y. 0:59:10.319,0:59:12.390 That's what the model is, 0:59:12.390,0:59:16.650 and then I'm given a fixed number of training examples. I'm given M training examples, and 0:59:16.650,0:59:20.309 then it's, like, after I'm given the training sets, I'll then go in to write the maximum 0:59:20.309,0:59:22.299 likelihood estimate of the parameters, right? So that's, sort 0:59:22.299,0:59:24.529 of, 0:59:24.529,0:59:31.529 maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like, 0:59:38.389,0:59:41.079 if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So 0:59:41.079,0:59:42.360 it turns out 0:59:42.360,0:59:45.859 this is, sort of, a very practical question, really. How do I count this list of 0:59:45.859,0:59:49.629 words? One common way to do this is to actually 0:59:49.629,0:59:52.899 find some way to count a list of words, like go through all your emails, go through 0:59:52.899,0:59:53.909 all the - 0:59:53.909,0:59:56.930 in practice, one common way to count a list of words 0:59:56.930,1:00:00.339 is to just take all the words that appear in your training set. That's one fairly common way 1:00:00.339,1:00:02.190 to do it, 1:00:02.190,1:00:06.130 or if that turns out to be too many words, you can take all words that appear 1:00:06.130,1:00:07.030 at least 1:00:07.030,1:00:08.080 three times 1:00:08.080,1:00:09.349 in your training set. So 1:00:09.349,1:00:10.659 words that 1:00:10.659,1:00:14.749 you didn't even see three times in the emails you got in the last 1:00:14.749,1:00:17.210 two months, you discard. So those are - I 1:00:17.210,1:00:20.259 was talking about going through a dictionary, which is a nice way of thinking about it, but in 1:00:20.259,1:00:22.049 practice, you might go through 1:00:22.049,1:00:26.409 your training set and then just take the union of all the words that appear in 1:00:26.409,1:00:29.759 it. In some of the tests I've even, by the way, said select these features, but this is one 1:00:29.759,1:00:32.309 way to think about 1:00:32.309,1:00:34.139 creating your feature vector, 1:00:34.139,1:00:41.139 right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On 1:00:49.209,1:00:51.609 how I came up with the parameters? 1:00:51.609,1:00:53.929 Student:Correct. Instructor (Andrew Ng):Let's see. 1:00:53.929,1:00:58.249 So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right? 1:00:58.249,1:00:59.279 In Naive Bayes, 1:00:59.279,1:01:01.499 I need to build a model 1:01:01.499,1:01:03.549 for P of X given Y and for 1:01:03.549,1:01:05.359 P of Y, 1:01:05.359,1:01:09.749 right? So this is, I mean, in generous of learning algorithms, I need to come up with 1:01:09.749,1:01:11.079 models for these. 1:01:11.079,1:01:15.359 So how'd I model P of Y? Well, I just those to model it using a Bernoulli 1:01:15.359,1:01:16.389 distribution, 1:01:16.389,1:01:17.150 and so 1:01:17.150,1:01:19.709 P of Y will be 1:01:19.709,1:01:22.109 parameterized by that, all right? Student:Okay. 1:01:22.109,1:01:26.519 Instructor (Andrew Ng):And then how'd I model P of X given Y? Well, 1:01:26.519,1:01:28.529 let's keep changing bullets. 1:01:28.529,1:01:32.729 My model for P of X given Y under the Naive Bayes assumption, I assume 1:01:32.729,1:01:34.539 that P of X given Y 1:01:34.539,1:01:37.549 is the product of these probabilities, 1:01:37.549,1:01:40.900 and so I'm going to need parameters to tell me 1:01:40.900,1:01:43.509 what's the probability of each word occurring, 1:01:43.509,1:01:46.309 you know, of each word occurring or not occurring, 1:01:46.309,1:01:53.309 conditions on the email being spam or not spam email, okay? Student:How is that 1:01:54.639,1:01:58.829 Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature 1:01:58.829,1:02:00.799 vectors, XI 1:02:00.799,1:02:05.499 is either one or zero, depending on whether words I appear as in the email, 1:02:05.499,1:02:08.859 right? So by the way I define the 1:02:08.859,1:02:11.419 feature vectors, XI - 1:02:11.419,1:02:16.639 the XI is always zero or one. So that by definition, if XI, you know, is either zero or 1:02:16.639,1:02:19.969 one, then it has to be a Bernoulli distribution, right? 1:02:19.969,1:02:22.419 If XI would continue as then 1:02:22.419,1:02:24.819 you might model this as Gaussian and say you end up 1:02:24.819,1:02:27.299 like we did in Gaussian discriminant analysis. It's 1:02:27.299,1:02:30.719 just that the way I constructed my features for email, XI is always binary 1:02:30.719,1:02:32.129 value, and so you end up 1:02:32.129,1:02:32.990 with a 1:02:32.990,1:02:36.059 Bernoulli here, okay? All right. I 1:02:36.059,1:02:40.389 should move on. So 1:02:40.389,1:02:42.389 it turns out that 1:02:42.389,1:02:44.339 this idea 1:02:44.339,1:02:46.369 almost works. 1:02:46.369,1:02:47.719 Now, here's the problem. 1:02:47.719,1:02:49.959 So let's say you 1:02:49.959,1:02:54.439 complete this class and you start to do, maybe do the class project, and you 1:02:54.439,1:02:56.760 keep working on your class project for a bit, and it 1:02:56.760,1:03:01.439 becomes really good, and you want to submit your class project to a conference, right? So, 1:03:01.439,1:03:03.479 you know, around - I don't know, 1:03:03.479,1:03:08.039 June every year is the conference deadline for the next conference. 1:03:08.039,1:03:10.889 It's just the name of the conference; it's an acronym. 1:03:10.889,1:03:12.359 And so maybe 1:03:12.359,1:03:16.080 you send your project partners or senior friends even, and say, Hey, let's 1:03:16.080,1:03:20.069 work on a project and submit it to the NIPS conference. And so you're getting these emails 1:03:20.069,1:03:21.489 with the word NIPS in them, 1:03:21.489,1:03:24.329 which you've probably never seen before, 1:03:24.329,1:03:28.179 and so a 1:03:28.179,1:03:32.519 piece of email comes from your project partner, and so you 1:03:32.519,1:03:35.639 go, Let's send a paper to the NIPS conference. 1:03:35.639,1:03:37.310 And then your stamp classifier 1:03:37.310,1:03:38.759 will say 1:03:38.759,1:03:39.930 P of X - 1:03:39.930,1:03:44.630 let's say NIPS is the 30,000th word in your dictionary, okay? 1:03:44.630,1:03:46.669 So X30,000 given 1:03:46.669,1:03:50.189 the 1, given 1:03:50.189,1:03:51.619 Y = 1:03:51.619,1:03:52.799 1 1:03:52.799,1:03:54.029 will be equal to 0. 1:03:54.029,1:03:57.479 That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in 1:03:57.479,1:04:01.199 your training set, so maximum likelihood of the parameter is that probably have seen the word 1:04:01.199,1:04:02.459 NIPS is zero, 1:04:02.459,1:04:06.789 and, similarly, 1:04:06.789,1:04:08.379 you 1:04:08.379,1:04:12.459 know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also 1:04:12.459,1:04:14.729 estimated 1:04:14.729,1:04:21.209 as zero. 1:04:21.209,1:04:28.209 So 1:04:34.789,1:04:40.709 when your spam classifier goes to compute P of Y = 1 given X, it will 1:04:40.709,1:04:44.529 compute this 1:04:44.529,1:04:47.209 right here P of Y 1:04:47.209,1:04:54.209 over - well, 1:04:56.179,1:05:03.179 all 1:05:05.089,1:05:09.509 right. 1:05:09.509,1:05:11.329 And so 1:05:11.329,1:05:15.759 you look at that terms, say, this will be product from I = 1:05:15.759,1:05:18.379 1 to 50,000, 1:05:18.379,1:05:22.359 P of XI given Y, 1:05:22.359,1:05:26.139 and one of those probabilities will be equal to 1:05:26.139,1:05:28.169 1:05:28.169,1:05:32.339 zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a 1:05:32.339,1:05:36.699 zero in this product, and so the numerator is zero, 1:05:36.699,1:05:40.249 and in the same way, it turns out the denominator will also be zero, and so you end 1:05:40.249,1:05:42.719 up with - 1:05:42.719,1:05:45.839 actually all of these terms end up being zero. So you end up with P of Y = 1 1:05:45.839,1:05:48.779 given X is 0 over 0 + 0, okay, which is 1:05:48.779,1:05:51.410 undefined. And the 1:05:51.410,1:05:53.959 problem with this is that it's 1:05:53.959,1:05:57.240 just statistically a bad idea 1:05:57.240,1:06:00.209 to say that P of X30,000 1:06:00.209,1:06:02.950 given Y is 1:06:02.950,1:06:03.490 0, 1:06:03.490,1:06:06.270 right? Just because you haven't seen the word NIPS in your last 1:06:06.270,1:06:10.420 two months worth of email, it's also statistically not sound to say that, 1:06:10.420,1:06:16.319 therefore, the chance of ever seeing this word is zero, right? 1:06:16.319,1:06:19.170 And so 1:06:19.170,1:06:20.479 1:06:20.479,1:06:22.719 1:06:22.719,1:06:25.819 is this idea that just because you haven't seen something 1:06:25.819,1:06:30.400 before, that may mean that that event is unlikely, but it doesn't mean that 1:06:30.400,1:06:33.920 it's impossible, and just saying that if you've never seen the word NIPS before, 1:06:33.920,1:06:40.759 then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero. 1:06:40.759,1:06:47.759 So we're gonna fix this, 1:06:48.339,1:06:50.079 and 1:06:50.079,1:06:54.119 to motivate the fix I'll talk about 1:06:54.119,1:06:58.939 - the example we're gonna use is let's say that you've been following the Stanford basketball 1:06:58.939,1:07:04.099 team for all of their away games, and been, sort of, tracking their wins and losses 1:07:04.099,1:07:08.169 to gather statistics, and, maybe - I don't know, form a betting pool about 1:07:08.169,1:07:11.059 whether they're likely to win or lose the next game, okay? 1:07:11.059,1:07:12.630 So 1:07:12.630,1:07:15.409 these are some of the statistics. 1:07:15.409,1:07:17.589 1:07:17.589,1:07:19.429 1:07:19.429,1:07:24.819 So on, I guess, the 8th of February 1:07:24.819,1:07:29.299 last season they played Washington State, and they 1:07:29.299,1:07:31.779 did not win. 1:07:31.779,1:07:36.239 On 1:07:36.239,1:07:38.209 the 11th of February, 1:07:38.209,1:07:42.349 they play Washington, 22nd 1:07:42.349,1:07:47.259 they played USC, 1:07:47.259,1:07:49.769 1:07:49.769,1:07:54.929 played UCLA, 1:07:54.929,1:07:57.630 played USC again, 1:07:57.630,1:08:03.739 1:08:03.739,1:08:05.819 and now you want to estimate 1:08:05.819,1:08:09.880 what's the chance that they'll win or lose against Louisville, right? 1:08:09.880,1:08:11.169 So 1:08:11.169,1:08:14.909 find the four guys last year or five times and they weren't good in their away games, but it 1:08:14.909,1:08:17.009 seems awfully harsh to say that - so it 1:08:17.009,1:08:24.009 seems awfully harsh to say there's zero chance that they'll 1:08:37.079,1:08:40.269 win in the last - in the 5th game. So here's the idea behind Laplace smoothing 1:08:40.269,1:08:42.629 which is 1:08:42.629,1:08:44.819 that we're estimate 1:08:44.819,1:08:48.219 the probably of Y being equal to one, right? 1:08:48.219,1:08:53.189 Normally, the maximum likelihood [inaudible] is the 1:08:53.189,1:08:56.059 number of ones 1:08:56.059,1:08:57.659 divided by 1:08:57.659,1:09:00.759 the number of zeros 1:09:00.759,1:09:05.350 plus the number of ones, okay? I 1:09:05.350,1:09:07.869 hope this informal notation makes sense, right? Knowing 1:09:07.869,1:09:11.349 the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random 1:09:11.349,1:09:12.219 variable 1:09:12.219,1:09:14.629 is 1:09:14.629,1:09:16.969 just the number of ones you saw 1:09:16.969,1:09:20.859 divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in 1:09:20.859,1:09:22.859 the Laplace 1:09:22.859,1:09:24.170 Smoothing 1:09:24.170,1:09:25.799 we're going to 1:09:25.799,1:09:29.549 just take each of these terms, the number of ones and, sort of, add one to that, the number 1:09:29.549,1:09:31.830 of zeros and add one to that, the 1:09:31.830,1:09:34.129 number of ones and add one to that, 1:09:34.129,1:09:37.109 and so in our example, 1:09:37.109,1:09:38.299 instead of estimating 1:09:38.299,1:09:42.890 the probability of winning the next game to be 0 á 1:09:42.890,1:09:45.210 5 + 0, 1:09:45.210,1:09:50.319 we'll add one to all of these counts, and so we say that the chance of 1:09:50.319,1:09:51.850 their 1:09:51.850,1:09:53.799 winning the next game is 1/7th, 1:09:53.799,1:09:55.179 okay? Which is 1:09:55.179,1:09:59.310 that having seen them lose, you know, five away games in a row, we aren't terribly - 1:09:59.310,1:10:02.510 we don't think it's terribly likely they'll win the next game, but at 1:10:02.510,1:10:05.979 least we're not saying it's impossible. 1:10:05.979,1:10:09.949 As a historical side note, the Laplace actually came up with the method. 1:10:09.949,1:10:14.320 It's called the Laplace smoothing after him. 1:10:14.320,1:10:18.399 1:10:18.399,1:10:21.919 When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale 1:10:21.919,1:10:25.239 was in a lot of days now, we've seen the sun rise, 1:10:25.239,1:10:28.489 but that doesn't mean we can be absolutely certain the sun will rise tomorrow. 1:10:28.489,1:10:31.420 He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of, 1:10:31.420,1:10:36.449 cool. So, 1:10:36.449,1:10:38.549 and more generally, 1:10:38.549,1:10:41.570 1:10:41.570,1:10:43.969 1:10:43.969,1:10:45.429 1:10:45.429,1:10:46.449 1:10:46.449,1:10:49.409 if Y 1:10:49.409,1:10:52.840 takes on 1:10:52.840,1:10:55.150 K possible of values, 1:10:55.150,1:10:59.479 if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1. 1:10:59.479,1:11:01.729 Let's 1:11:01.729,1:11:05.900 see. 1:11:05.900,1:11:11.380 So the maximum likelihood estimate will be Sum from J = 1 to M, 1:11:11.380,1:11:16.499 indicator YI = J á M, 1:11:16.499,1:11:19.379 right? 1:11:19.379,1:11:21.629 That's the maximum likelihood estimate 1:11:21.629,1:11:24.369 of a multinomial probability of Y 1:11:24.369,1:11:26.929 being equal to - oh, 1:11:26.929,1:11:29.229 excuse me, Y = J. All right. 1:11:29.229,1:11:33.110 That's the maximum likelihood estimate for the probability of Y = J, 1:11:33.110,1:11:36.890 and so when you apply Laplace smoothing to that, 1:11:36.890,1:11:39.639 you add one to the numerator, and 1:11:39.639,1:11:41.889 add K to the denominator, 1:11:41.889,1:11:48.889 if Y can take up K possible values, okay? 1:12:06.469,1:12:08.579 So for Naive Bayes, 1:12:08.579,1:12:14.349 what that gives us is - 1:12:14.349,1:12:21.349 shoot. 1:12:38.639,1:12:44.320 Right? So that was the maximum likelihood estimate, and what you 1:12:44.320,1:12:47.949 end up doing is adding one to the numerator and adding two to the denominator, 1:12:47.949,1:12:52.030 and this solves the problem of the zero probabilities, and when your friend sends 1:12:52.030,1:12:57.489 you email about the NIPS conference, 1:12:57.489,1:13:00.799 your spam filter will still be able to 1:13:00.799,1:13:07.799 make a meaningful prediction, all right? Okay. 1:13:12.030,1:13:14.829 Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the 1:13:14.829,1:13:16.239 games on the right, it's liberal assumptions that the probability 1:13:16.239,1:13:23.239 of 1:13:24.650,1:13:27.960 winning is very close to zero, so, I mean, the prediction should 1:13:27.960,1:13:29.789 be equal to PF, 0. Instructor (Andrew Ng):Right. 1:13:29.789,1:13:35.959 I would say that 1:13:35.959,1:13:38.139 in this case the prediction 1:13:38.139,1:13:41.170 is 1/7th, right? We don't have a lot of - if you see somebody lose five games 1:13:41.170,1:13:44.289 in a row, you may not have a lot of faith in them, 1:13:44.289,1:13:48.520 but as an extreme example, suppose you saw them lose one game, 1:13:48.520,1:13:51.500 right? It's just not reasonable to say that the chances of winning the next game 1:13:51.500,1:13:52.860 is zero, but 1:13:52.860,1:13:58.900 that's what maximum likelihood 1:13:58.900,1:14:01.150 estimate 1:14:01.150,1:14:05.099 will say. Student:Yes. Instructor (Andrew Ng):And - 1:14:05.099,1:14:08.139 Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you 1:14:08.139,1:14:11.309 know, given just five training examples, what's a reasonable estimate for the chance of 1:14:11.309,1:14:13.170 winning the next game, 1:14:13.170,1:14:14.119 and 1:14:14.119,1:14:18.460 1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance. 1:14:18.460,1:14:21.119 We're saying the chances of winning the next game is less 1:14:21.119,1:14:25.069 than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of 1:14:25.069,1:14:27.859 Bayesian assumptions about the prior and posterior, 1:14:27.859,1:14:31.280 this Laplace smoothing actually gives the optimal estimate, 1:14:31.280,1:14:33.459 in a certain sense I won't go into 1:14:33.459,1:14:37.009 of what's the chance of winning the next game, and so under a certain assumption 1:14:37.009,1:14:38.079 about the 1:14:38.079,1:14:40.679 Bayesian prior on the parameter. 1:14:40.679,1:14:44.260 So I don't know. It actually seems like a pretty reasonable assumption to me. 1:14:44.260,1:14:46.940 Although, I should say, it actually 1:14:46.940,1:14:50.059 turned out - 1:14:50.059,1:14:53.530 No, I'm just being mean. We actually are a pretty good basketball team, but I chose a 1:14:53.530,1:14:54.660 losing streak 1:14:54.660,1:14:58.769 because it's funnier that way. 1:14:58.769,1:15:05.769 Let's see. Shoot. Does someone want to - are there other questions about 1:15:07.489,1:15:09.320 1:15:09.320,1:15:13.380 this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but 1:15:13.380,1:15:15.059 1:15:15.059,1:15:15.860 we'll do that in the next lecture. So let's wrap it.