0:00:09.040,0:00:10.319 0:00:10.319,0:00:13.589 This presentation is delivered by the Stanford Center for Professional 0:00:13.589,0:00:20.589 Development. 0:00:23.960,0:00:24.880 So 0:00:24.880,0:00:29.720 welcome back, and what I want to do today is 0:00:29.720,0:00:33.930 continue our discussions of the EM Algorithm, and in particular, I want to talk 0:00:33.930,0:00:34.730 about 0:00:34.730,0:00:36.359 the EM 0:00:36.359,0:00:39.640 formulation that we derived in the previous lecture and apply it to the 0:00:39.640,0:00:41.910 mixture of Gaussians model, 0:00:41.910,0:00:45.050 apply it to a different model and a mixture of naive Bayes model, 0:00:45.050,0:00:48.690 and then the launch part of today's lecture will be on the factor 0:00:48.690,0:00:51.480 analysis algorithm, which will also use the EM. 0:00:51.480,0:00:55.600 And as part of that, we'll actually take a brief digression to talk a little bit about 0:00:55.600,0:01:01.050 sort of useful properties of Gaussian distributions. 0:01:01.050,0:01:03.190 So just to recap where we are. 0:01:03.190,0:01:10.120 In the previous lecture, I started to talk about unsupervised learning, which was 0:01:10.120,0:01:12.720 machine-learning problems, where you're given 0:01:12.720,0:01:14.590 an unlabeled training set 0:01:14.590,0:01:16.210 comprising m examples here, right? 0:01:16.210,0:01:18.360 And then " so the fact that there are no labels; 0:01:18.360,0:01:22.720 that's what makes this unsupervised or 0:01:22.720,0:01:24.880 anything. So 0:01:24.880,0:01:30.710 one problem that I talked about last time was what if 0:01:30.710,0:01:33.820 you're given a data set that looks like this 0:01:33.820,0:01:36.420 and you want to model 0:01:36.420,0:01:40.110 the density PFX from which you think the data 0:01:40.110,0:01:41.730 had been drawn, 0:01:41.730,0:01:45.900 and so with a data set like this, maybe you think was a mixture of two Gaussians and start 0:01:45.900,0:01:50.230 to talk about an algorithm 0:01:50.230,0:01:53.920 for fitting a mixture of Gaussians model, all right? And so 0:01:55.050,0:01:57.820 we said that we would model the 0:01:57.820,0:01:59.460 density of XP of X 0:01:59.460,0:02:02.130 as sum over Z 0:02:02.130,0:02:03.170 PFX 0:02:03.170,0:02:05.410 given Z 0:02:05.410,0:02:06.970 times P of Z 0:02:06.970,0:02:10.350 where this later random variable meaning this hidden random 0:02:10.350,0:02:11.829 variable Z 0:02:11.829,0:02:12.609 indicates 0:02:12.609,0:02:16.369 which of the two Gaussian distributions each of your data points came from 0:02:16.369,0:02:19.369 and so we have, 0:02:19.369,0:02:24.139 you know, Z was not a 0:02:24.139,0:02:26.629 nomial with parameter phi 0:02:26.629,0:02:28.760 and X conditions on 0:02:28.760,0:02:32.599 a coming from the JAFE 0:02:32.599,0:02:35.809 Gaussian 0:02:35.809,0:02:38.749 was given by Gaussian 0:02:38.749,0:02:43.679 of mean mu J and covariant sigma J, all right? So, like I said 0:02:43.679,0:02:47.510 at the beginning of the previous lecture, I just talked about a very specific algorithm that 0:02:47.510,0:02:50.639 I sort of pulled out of the air 0:02:50.639,0:02:54.609 for fitting the parameters of this model for finian, Francis, phi, mu and 0:02:54.609,0:02:55.519 0:02:55.519,0:02:59.769 sigma, but then in the second half of the previous lecture I talked about what's called the 0:02:59.769,0:03:01.489 EM Algorithm in which 0:03:01.489,0:03:03.999 our 0:03:03.999,0:03:08.719 goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of 0:03:08.719,0:03:09.900 theta, 0:03:09.900,0:03:13.279 you know, the, sort of, 0:03:13.279,0:03:17.649 usual right matter of log likelihood " 0:03:17.649,0:03:19.549 well, parameterized by theta. 0:03:19.549,0:03:22.149 And 0:03:22.149,0:03:25.109 because we have 0:03:25.109,0:03:28.199 a later random variable Z this is really 0:03:28.199,0:03:30.889 maximizing in terms of theta, 0:03:30.889,0:03:37.889 sum over I, sum over Z, P of XI, 0:03:39.639,0:03:45.279 ZI parameterized by theta. Okay? 0:03:45.279,0:03:47.139 So using 0:03:47.139,0:03:50.109 Jensen's inequality last time 0:03:50.109,0:03:54.249 we worked out the EM Algorithm in which in the E step 0:03:54.249,0:03:56.180 we would chose these 0:03:56.180,0:03:59.359 probability distributions QI to the 0:03:59.359,0:04:02.930 l posterior 0:04:02.930,0:04:09.319 on Z given X and parameterized by theta 0:04:09.319,0:04:13.759 and in the M step we would set 0:04:13.759,0:04:16.429 theta 0:04:16.429,0:04:21.669 to be the value that 0:04:21.669,0:04:27.609 maximizes 0:04:27.609,0:04:34.609 this. 0:04:36.469,0:04:39.499 Okay? So these are the ones we worked out last time 0:04:41.730,0:04:46.699 and the cartoon that I drew was that you have this long likelihood function L of 0:04:46.699,0:04:49.499 theta that's often hard to maximize 0:04:49.499,0:04:51.250 and what the E step does 0:04:51.250,0:04:55.960 is choose these probability distribution production QI's. And in the cartoon, I drew 0:04:55.960,0:04:57.850 what that corresponded to 0:04:57.850,0:05:01.029 was finding a lower bounds 0:05:01.029,0:05:02.979 for the log likelihood. 0:05:02.979,0:05:04.779 And then 0:05:04.779,0:05:06.420 horizontal access data 0:05:06.420,0:05:10.710 and then the M step you maximize the lower boundary, right? So maybe you were here previously 0:05:10.710,0:05:11.500 and so you 0:05:11.500,0:05:15.389 jumped to the new point, the new 0:05:15.389,0:05:17.619 maximum of this lower bound. Okay? 0:05:17.619,0:05:21.889 And so this little curve here, right? This lower bound function here 0:05:21.889,0:05:26.259 that's really 0:05:26.259,0:05:28.820 the right-hand side of that augments. 0:05:28.820,0:05:30.210 Okay? So 0:05:30.210,0:05:34.010 this whole thing in the augments. If you view this thing 0:05:34.010,0:05:36.590 as a function of theta, 0:05:36.590,0:05:38.129 this function of theta 0:05:38.129,0:05:39.250 is a lower bounds 0:05:39.250,0:05:41.729 for the log likelihood of theta 0:05:41.729,0:05:45.539 and so the M step we maximize this lower bound and that corresponds to 0:05:45.539,0:05:47.360 jumping 0:05:47.360,0:05:51.669 to this new maximum to lower bound. So it 0:05:51.669,0:05:53.139 turns out 0:05:53.139,0:05:55.669 that 0:05:55.669,0:05:57.259 in the EM Algorithm " 0:05:57.259,0:06:02.100 so why do you evolve with the EM algorithm? It turns out that very often, and this will be 0:06:02.100,0:06:05.210 true for all the examples we see 0:06:05.210,0:06:07.410 today, it turns out 0:06:07.410,0:06:10.889 that very often in the 0:06:10.889,0:06:12.529 EM Algorithm 0:06:12.529,0:06:16.159 maximizing the M Step, so performing the maximization the M Step, will be 0:06:16.159,0:06:19.739 tractable and can often be done analytically in the closed form. 0:06:19.739,0:06:21.330 Whereas 0:06:21.330,0:06:25.729 if you were trying to maximize this objective we try to take this 0:06:25.729,0:06:27.270 formula on the right and 0:06:27.270,0:06:30.130 this maximum likely object, everyone, is to take this all on the right 0:06:30.130,0:06:32.810 and set its derivatives to zero and try to solve and 0:06:32.810,0:06:34.629 you'll find that you're unable 0:06:34.629,0:06:37.030 to obtain a solution to this in closed form 0:06:37.030,0:06:38.759 this maximization. Okay? 0:06:38.759,0:06:41.250 And so to give you an example of that is that 0:06:41.250,0:06:45.069 you remember our discussion on exponential family marbles, right? 0:06:45.069,0:06:47.019 It turns out that 0:06:47.019,0:06:49.370 if X and Z 0:06:49.370,0:06:50.320 is 0:06:50.320,0:06:54.220 jointly, I guess, a line in exponential families. So if P of X, 0:06:54.220,0:06:55.379 Z 0:06:55.379,0:06:58.319 prioritized by theta there's an explanation family distribution, 0:06:58.319,0:07:01.909 which it turns out to be true for the mixture of Gaussians distribution. 0:07:01.909,0:07:05.219 Then turns out that the M step here will be tractable 0:07:05.219,0:07:08.379 and the E step will also be tractable and so you can do each of these steps very 0:07:08.379,0:07:09.489 easily. 0:07:09.489,0:07:10.889 Whereas 0:07:10.889,0:07:12.689 performing " trying to 0:07:12.689,0:07:15.669 perform this original maximum likelihood estimation problem 0:07:15.669,0:07:17.719 on this one, right? 0:07:17.719,0:07:21.439 Will be computationally very difficult. You're going 0:07:21.439,0:07:24.699 to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to 0:07:24.699,0:07:27.589 this. Okay? 0:07:27.589,0:07:32.709 So 0:07:32.709,0:07:36.050 what I want to do in a second is actually take this view of the EM 0:07:36.050,0:07:37.110 Algorithm 0:07:37.110,0:07:40.889 and apply it to the mixture of Gaussians models. I want to take these E steps 0:07:40.889,0:07:42.729 and M Steps and work 0:07:42.729,0:07:44.229 them out for 0:07:44.229,0:07:47.849 the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the 0:07:49.689,0:07:55.020 EM Algorithm. It turns out there's one other way of thinking about the EM 0:07:55.020,0:07:59.479 Algorithm, which is the following: I can define 0:07:59.479,0:08:02.090 an optimization objective 0:08:02.090,0:08:05.979 J of theta, Q are defined 0:08:05.979,0:08:10.519 it to be 0:08:10.519,0:08:14.699 this. This is just a thing in the augments 0:08:14.699,0:08:21.699 in the M step. Okay? 0:08:25.759,0:08:28.110 And so what we proved 0:08:28.110,0:08:31.079 using Jensen's inequality 0:08:31.079,0:08:34.040 is that 0:08:34.040,0:08:38.380 the log likelihood of theta is greater and equal to J 0:08:38.380,0:08:41.900 of theta Q. So in other words, we proved last time that 0:08:41.900,0:08:43.749 for any value of theta and Q 0:08:43.749,0:08:47.190 the log likelihood upper bounds J of theta and Q. 0:08:47.190,0:08:51.290 And so just to relate this back to, sort of, yet more things that you all ready 0:08:51.290,0:08:52.220 know, 0:08:52.220,0:08:54.100 you can also think of 0:08:54.100,0:08:58.530 covariant cause in a sense, right? 0:08:58.530,0:09:02.010 However, our discussion awhile back on the coordinate ascent optimization 0:09:02.010,0:09:03.160 algorithm. 0:09:03.160,0:09:07.420 So we can show, and I won't actually show this view so just take our word for it 0:09:07.420,0:09:09.580 and look for that at home if you want, 0:09:09.580,0:09:11.820 0:09:11.820,0:09:13.260 that EM is 0:09:13.260,0:09:17.480 just coordinate in a set on the 0:09:17.480,0:09:18.240 function J. 0:09:18.240,0:09:20.630 So in the E step you maximize 0:09:20.630,0:09:23.960 with respect to Q 0:09:23.960,0:09:26.850 and then the M step 0:09:26.850,0:09:31.860 you maximize with 0:09:31.860,0:09:34.100 respect to theta. Okay? 0:09:34.100,0:09:35.810 So this is 0:09:35.810,0:09:39.690 another view of the EM Algorithm 0:09:40.900,0:09:45.430 that shows why it has to converge, for example. If you can - I've used in a sense of 0:09:45.430,0:09:46.880 J of theta, Q 0:09:46.880,0:09:53.160 having to monotonically increase on every iteration. Okay? 0:09:53.160,0:09:58.800 So what I want 0:09:58.800,0:10:02.389 to do next is actually take this general 0:10:02.389,0:10:06.230 EM machinery that we worked up and apply it to a mixture Gaussians model. 0:10:06.230,0:10:09.790 Before I do that, let me just check if there are questions about 0:10:09.790,0:10:16.790 the EM Algorithm as a whole? Okay, cool. So 0:10:23.980,0:10:26.100 let's go ahead and 0:10:26.100,0:10:33.100 work on the mixture of Gaussian's 0:10:35.550,0:10:37.160 EM, all right? MOG, and that's my 0:10:37.160,0:10:39.010 abbreviation for Mixture of 0:10:39.010,0:10:43.590 Gaussian's. So the E step were called those Q distributions, right? 0:10:43.590,0:10:50.590 In particular, I want to work out - so Q is the probability distribution 0:10:50.670,0:10:53.230 over the late and random variable Z 0:10:53.230,0:10:58.440 and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And 0:10:58.440,0:11:00.700 you can think of this as my writing 0:11:00.700,0:11:04.210 P of ZI equals J, right? Under the Q 0:11:04.210,0:11:08.260 distribution. That's what this notation means. 0:11:08.260,0:11:12.810 And so the EM Algorithm tells us 0:11:12.810,0:11:13.840 0:11:13.840,0:11:16.280 that, let's see, Q of J 0:11:16.280,0:11:23.280 is the likelihood probability of Z being the value 0:11:23.750,0:11:25.040 J and 0:11:25.040,0:11:28.690 given XI and all your parameters. 0:11:28.690,0:11:30.310 And so, 0:11:30.310,0:11:31.050 0:11:31.050,0:11:35.750 well, the way you compute this is by Bayes rule, right? So that is going to 0:11:35.750,0:11:39.250 be equal to P of XI given ZI 0:11:39.250,0:11:41.060 equals J 0:11:41.060,0:11:44.690 times P of ZIJ divided 0:11:44.690,0:11:51.690 by - 0:11:57.490,0:12:00.880 right? That's all the - by Bayes rule. 0:12:00.880,0:12:02.670 0:12:02.670,0:12:06.100 And so this 0:12:06.100,0:12:09.110 you know 0:12:09.110,0:12:09.860 0:12:09.860,0:12:11.630 0:12:11.630,0:12:14.390 because XI given ZI equals J. This was a Gaussian 0:12:14.390,0:12:17.780 with mean mu J and covariant sigma J. And so to compute this first 0:12:17.780,0:12:20.970 term you plug in the formula for the Gaussian density there 0:12:21.810,0:12:24.780 with parameters mu J and sigma J 0:12:24.780,0:12:27.160 and this you'd know 0:12:27.160,0:12:28.500 because 0:12:28.500,0:12:35.500 Z was not a nomial, right? 0:12:36.020,0:12:40.650 Where parameters given by phi and so the problem of ZI being with J is just phi 0:12:40.650,0:12:44.150 J and so you can substitute these terms in. 0:12:44.150,0:12:46.810 Similarly do the same thing for the denominator 0:12:46.810,0:12:49.930 and that's how you work out what Q is. Okay? 0:12:49.930,0:12:56.080 And so in the previous lecture this value the probability that ZI 0:12:56.080,0:12:58.540 equals J under the Q 0:12:58.540,0:13:01.150 distribution that was why I denoted that as WIJ. 0:13:01.150,0:13:03.840 So that would be the 0:13:03.840,0:13:05.330 0:13:05.330,0:13:08.550 E 0:13:08.550,0:13:10.440 step 0:13:10.440,0:13:13.720 and then in the M step 0:13:13.720,0:13:17.890 we maximize with respect to all of our parameters. 0:13:17.890,0:13:19.480 0:13:19.480,0:13:21.230 This, well 0:13:21.230,0:13:28.130 I seem to be writing the same formula down a lot today. All 0:13:28.130,0:13:35.130 right. 0:13:42.670,0:13:48.090 And just so 0:13:48.090,0:13:52.250 we're completely concrete about how you do that, right? So if you do 0:13:52.250,0:13:54.820 that you end up with - 0:13:54.820,0:13:57.440 so plugging in the 0:13:57.440,0:13:59.840 quantities that you know 0:13:59.840,0:14:00.770 0:14:00.770,0:14:03.950 that becomes 0:14:03.950,0:14:10.950 this, let's see. Right. 0:14:34.590,0:14:38.730 And so that we're completely concrete about what the M step is doing. 0:14:38.730,0:14:41.900 So in the 0:14:41.900,0:14:44.260 M step that was, 0:14:44.260,0:14:45.660 I guess, QI 0:14:45.660,0:14:47.130 over Z, I being over 0:14:47.130,0:14:50.560 J. Just in the summation, sum over J is the sum over all the possible values 0:14:50.560,0:14:52.500 of ZI 0:14:52.500,0:14:53.430 and then 0:14:53.430,0:14:56.330 this thing here is my Gaussian 0:14:56.330,0:14:58.700 density. Sorry, guys, this thing - well, 0:14:58.700,0:15:01.980 this first term here, 0:15:01.980,0:15:06.480 right? Is my P of 0:15:06.480,0:15:10.050 XI given ZI 0:15:10.050,0:15:14.460 and that's P of ZI. Okay? 0:15:14.460,0:15:16.950 And so 0:15:16.950,0:15:21.070 to maximize this with respect to - say you want to maximize this with respect to all 0:15:21.070,0:15:24.410 of your parameters phi, mu and sigma. 0:15:24.410,0:15:27.800 So to maximize with respect to the parameter mu, say, 0:15:27.800,0:15:32.430 you would take the derivative for respect to mu 0:15:32.430,0:15:34.440 and set that to zero 0:15:34.440,0:15:38.190 and you would - and if you actually do that computation 0:15:38.190,0:15:44.330 you would get, for instance, that 0:15:44.330,0:15:47.660 0:15:47.660,0:15:49.750 that becomes your update 0:15:49.750,0:15:51.230 to mu J. 0:15:51.230,0:15:52.260 Okay? Just 0:15:52.260,0:15:53.409 so I want to - 0:15:53.409,0:15:57.029 the equation is unimportant. All of these equations are written down in 0:15:57.029,0:15:59.980 the lecture notes. I'm writing these down just to be 0:15:59.980,0:16:03.350 completely concrete about what the M step means. And so write down that formula, 0:16:03.350,0:16:05.370 plug in the densities you know, take 0:16:05.370,0:16:08.150 the derivative set to zero, solve for mu J 0:16:08.150,0:16:10.949 and in the same way you set the derivatives equal to zero 0:16:10.949,0:16:12.990 and solve for your updates 0:16:12.990,0:16:15.439 for your other parameters phi and 0:16:15.439,0:16:22.439 sigma as well. Okay? Well, 0:16:23.100,0:16:26.500 just point out just one little tricky bit for this that 0:16:26.500,0:16:30.630 you haven't seen before that most of you probably all ready now, but I'll just 0:16:30.630,0:16:31.220 mention 0:16:31.220,0:16:32.810 is that 0:16:32.810,0:16:37.430 since phi here is a multinomial distribution 0:16:37.430,0:16:39.570 when you take this formula 0:16:39.570,0:16:42.440 and you maximize it with respect to phi 0:16:42.440,0:16:45.470 you actually have an additional constraint, right? That the sum of I - let's see, sum 0:16:45.470,0:16:46.460 0:16:46.460,0:16:48.250 over 0:16:48.250,0:16:50.440 J, 0:16:50.440,0:16:53.630 phi J must be equal to one. All right? So, again, 0:16:53.630,0:16:57.720 in the M step I want to take this thing and maximize it with respect to all the parameters 0:16:57.720,0:17:00.900 and when you maximize this respect to the parameters phi J 0:17:00.900,0:17:03.400 you need to respect the constraint that 0:17:03.400,0:17:06.300 sum of J phi J must be equal to one. 0:17:06.300,0:17:10.730 And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of 0:17:10.730,0:17:14.230 the granjay multipliers and generalize the granjay when you talk about the support of X 0:17:14.230,0:17:15.120 machines. 0:17:15.120,0:17:19.049 And so to actually perform the maximization in terms of phi J you 0:17:19.049,0:17:21.100 construct to the granjay, 0:17:21.100,0:17:23.490 which is - all right? 0:17:23.490,0:17:24.639 So that's the 0:17:24.639,0:17:28.390 equation from above and we'll denote in the dot dot dot plus 0:17:28.390,0:17:33.760 theta times 0:17:33.760,0:17:40.760 that, where this is sort of the granjay multiplier 0:17:42.220,0:17:42.830 and this 0:17:42.830,0:17:46.100 is your optimization objective. 0:17:46.100,0:17:47.640 0:17:47.640,0:17:50.799 And so to actually solve the parameters phi J you set 0:17:50.799,0:17:56.690 the parameters of this 0:17:56.690,0:17:59.080 so that the granjay is zero and solve. Okay? 0:17:59.080,0:18:02.990 And if you then work through the math 0:18:02.990,0:18:07.240 you get the appropriate value to update the phi J's too, 0:18:07.240,0:18:10.880 which I won't do, but I'll be - all the full directions are in the lecture 0:18:10.880,0:18:16.470 notes. I won't do that here. 0:18:16.470,0:18:18.010 0:18:18.010,0:18:21.870 Okay. And so if you actually perform all these computations you can also verify 0:18:21.870,0:18:22.730 that. 0:18:22.730,0:18:26.590 So I just wrote down a bunch of formulas for the EM 0:18:26.590,0:18:30.080 Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I 0:18:30.080,0:18:34.170 said for the EM here's the formula for computing the WIJ's and here's a 0:18:34.170,0:18:36.480 formula for computing the mud's and so on, 0:18:36.480,0:18:42.090 and this variation is where all of those formulas actually come from. 0:18:42.090,0:18:43.280 Okay? 0:18:43.280,0:18:50.280 Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh, 0:18:52.390,0:18:54.289 I see. So 0:18:54.289,0:18:58.200 it turns out that, yes, there's also constrained to the phi J this must be greater than 0:18:58.200,0:19:00.070 zero. 0:19:00.070,0:19:02.770 It turns out that 0:19:02.770,0:19:07.090 if you want you could actually write down then generalize the granjayn 0:19:07.090,0:19:10.860 incorporating all of these constraints as well and you can solve to [inaudible] 0:19:10.860,0:19:12.179 these constraints. 0:19:12.179,0:19:16.280 It turns out that in this particular derivation - actually it turns out that very often 0:19:16.280,0:19:16.780 we 0:19:16.780,0:19:19.980 find maximum likely estimate for multinomial distributions probabilities. 0:19:19.980,0:19:23.190 It turns out that if you ignore these constraints and you just maximize the 0:19:23.190,0:19:24.020 formula 0:19:24.020,0:19:25.100 0:19:25.100,0:19:27.820 luckily you end up with 0:19:27.820,0:19:31.190 values that actually are greater than or equal to zero 0:19:31.190,0:19:34.160 and so if even ignoring those constraint you end up with parameters that are greater than or equal to 0:19:34.160,0:19:38.470 zero that shows that that must be the correct solution 0:19:38.470,0:19:41.130 because adding that constraint won't change anything. 0:19:41.130,0:19:44.789 So this constraint it is then caused - it turns out that if you 0:19:44.789,0:19:46.150 ignore this and just do 0:19:46.150,0:19:53.150 what I've wrote down you actually get the right answer. 0:19:55.789,0:19:57.030 Okay? 0:19:57.030,0:19:58.080 Great. 0:19:58.080,0:19:59.080 So let me 0:19:59.080,0:20:00.810 just 0:20:00.810,0:20:05.400 very quickly talk about one more example of a mixture model. And 0:20:05.400,0:20:09.340 the perfect example for this is imagine you want to do text clustering, right? 0:20:09.340,0:20:11.900 So someone gives you a large set of documents 0:20:11.900,0:20:15.389 and you want to cluster them together into cohesive 0:20:15.389,0:20:16.239 topics. I 0:20:16.239,0:20:19.379 think I mentioned the news website news.google.com. 0:20:19.379,0:20:22.010 That's one application of text clustering 0:20:22.010,0:20:25.850 where you might want to look at all of the news stories about 0:20:25.850,0:20:28.220 today, all the news stories 0:20:28.220,0:20:32.150 written by everyone, written by all the online news websites about whatever happened 0:20:32.150,0:20:33.270 yesterday 0:20:33.270,0:20:36.810 and there will be many, many different stories on the same thing, right? 0:20:36.810,0:20:40.770 And by running a text-clustering algorithm you can group related 0:20:40.770,0:20:42.559 documents together. Okay? 0:20:42.559,0:20:49.559 So how do you apply the EM Algorithm to text clustering? 0:20:52.980,0:20:56.300 I want to do this to illustrate an example 0:20:56.300,0:20:59.380 in which you run the EM Algorithm 0:20:59.380,0:21:03.320 on discreet valued inputs where the input - where the training examples 0:21:03.320,0:21:04.220 XI 0:21:04.220,0:21:05.830 are discreet 0:21:05.830,0:21:08.589 values. So what I want to talk about specifically 0:21:08.589,0:21:15.589 is the mixture of naive Bayes model 0:21:16.510,0:21:19.980 and depending on how much you remember about naive Bayes 0:21:19.980,0:21:23.690 I talked about two event models. One was the multivariant vanuvy 0:21:23.690,0:21:24.960 event model. One 0:21:24.960,0:21:27.880 was the multinomial event model. 0:21:27.880,0:21:31.050 Today I'm gonna use the multivariant vanuvy event model. If 0:21:31.050,0:21:34.720 you don't remember what those terms mean anymore don't worry about it. I 0:21:34.720,0:21:36.430 think the equation will still make sense. But in 0:21:36.430,0:21:43.430 this setting we're given a training set X1 0:21:44.549,0:21:45.820 for XM. 0:21:45.820,0:21:48.930 So we're given M text documents 0:21:48.930,0:21:52.270 where each 0:21:52.270,0:21:55.940 0:21:55.940,0:22:00.330 XI is zero one to the end. So each of our training examples 0:22:00.330,0:22:04.160 is an indimensional bit of vector, 0:22:04.160,0:22:06.780 right? S this was the representation 0:22:06.780,0:22:11.060 where XIJ was - it indicates whether 0:22:11.060,0:22:13.220 word J 0:22:13.220,0:22:16.820 appears in 0:22:16.820,0:22:23.260 document 0:22:23.260,0:22:28.220 I, right? And let's say that 0:22:28.220,0:22:31.930 we're going to model ZI the - our latent random variable meaning 0:22:31.930,0:22:35.470 our hidden random variable ZI will take on two values zero one 0:22:35.470,0:22:38.480 so this means I'm just gonna find two clusters and you can generalize the 0:22:38.480,0:22:44.560 clusters that you want. 0:22:44.560,0:22:46.650 So 0:22:46.650,0:22:51.510 in the mixture of naive Bayes model we assume that 0:22:51.510,0:22:54.090 ZI is distributed and mu 0:22:54.090,0:22:57.280 E 0:22:57.280,0:22:58.390 with some 0:22:58.390,0:23:02.600 value of phi so there's some probability of each document coming from cluster one 0:23:02.600,0:23:05.220 or from cluster 0:23:05.220,0:23:10.490 two. We assume that 0:23:10.490,0:23:14.090 the probability 0:23:14.090,0:23:16.049 0:23:16.049,0:23:20.500 of XI given ZI, 0:23:20.500,0:23:21.550 right? 0:23:21.550,0:23:28.550 Will make the same naive Bayes assumption as we did before. 0:23:35.550,0:23:36.630 Okay? 0:23:36.630,0:23:43.630 And more specifically - well, excuse 0:23:53.110,0:24:00.110 me, 0:24:00.820,0:24:01.840 right. Okay. 0:24:01.840,0:24:06.570 And so most of us [inaudible] cycles one given ZI equals say zero 0:24:06.570,0:24:10.250 will be given by a parameter phi 0:24:10.250,0:24:11.960 substitute J 0:24:11.960,0:24:18.230 given Z equals 0:24:18.230,0:24:21.170 zero. So if you take this chalkboard and if you 0:24:21.170,0:24:22.669 take all instances of 0:24:22.669,0:24:24.510 the alphabet Z 0:24:24.510,0:24:27.270 and replace it with Y 0:24:27.270,0:24:31.750 then you end up with exactly the same equation as I've written down for naive 0:24:31.750,0:24:38.750 Bayes like a long time ago. Okay? 0:24:39.770,0:24:40.710 And I'm not 0:24:40.710,0:24:44.850 actually going to work out the mechanics deriving 0:24:44.850,0:24:46.460 the EM Algorithm, but it turns out that 0:24:46.460,0:24:49.830 if you take this joint distribution of X and Z 0:24:49.830,0:24:53.700 and if you work out what the equations are for the EM algorithm for 0:24:53.700,0:24:55.860 maximum likelihood estimation of parameters 0:24:55.860,0:24:58.820 you find that 0:24:58.820,0:25:00.880 in the E 0:25:00.880,0:25:03.100 step you compute, 0:25:03.100,0:25:05.679 you know, let's say these parameters - these weights 0:25:05.679,0:25:06.970 WI 0:25:06.970,0:25:09.420 which are going to be equal to 0:25:09.420,0:25:15.670 your perceived distribution of Z being equal one conditioned on XI parameterized 0:25:15.670,0:25:19.200 by your 0:25:19.200,0:25:23.700 phi's and given your parameters 0:25:23.700,0:25:25.590 0:25:25.590,0:25:30.900 and in the M step. Okay? 0:25:30.900,0:25:37.900 And 0:26:17.240,0:26:20.740 that's the equation you get in the M step. 0:26:20.740,0:26:22.150 I 0:26:22.150,0:26:24.860 mean, again, the equations themselves aren't too important. Just 0:26:24.860,0:26:27.179 sort of convey - I'll 0:26:27.179,0:26:30.210 give you a second to finish writing, I guess. 0:26:30.210,0:26:32.500 And when you're done or finished writing 0:26:32.500,0:26:37.370 take a look at these equations and see if they make intuitive sense to you 0:26:37.370,0:26:44.370 why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again. 0:26:47.130,0:26:49.620 Y - 0:26:49.620,0:26:54.230 Oh, yes, thank you. Right. 0:26:54.230,0:26:57.950 Sorry, 0:26:57.950,0:27:04.950 just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the 0:27:15.380,0:27:20.670 first place? 0:27:20.670,0:27:22.059 No. So 0:27:25.130,0:27:26.080 what is it? 0:27:26.080,0:27:33.080 Normally you initialize phi's to be something else, say randomly. 0:27:35.010,0:27:38.060 So just like in naive Bayes we saw 0:27:38.060,0:27:41.360 zero probabilities as a bad thing so the same reason you 0:27:41.360,0:27:48.200 try to avoid zero probabilities, yeah. Okay? 0:27:48.200,0:27:51.500 And so just the intuition behind these 0:27:51.500,0:27:54.150 equations is 0:27:54.150,0:27:55.950 in the E step 0:27:55.950,0:27:59.860 WI's is you're gonna take your best guess for whether the document came from 0:27:59.860,0:28:02.640 cluster one or cluster 0:28:02.640,0:28:05.570 zero, all right? This is very similar to the 0:28:05.570,0:28:09.200 intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the 0:28:09.200,0:28:10.020 E step 0:28:10.020,0:28:14.760 we're going to compute these weights that tell us do I think this document came from 0:28:14.760,0:28:19.370 cluster one or cluster zero. 0:28:19.370,0:28:22.650 And then in the M step I'm gonna say 0:28:22.650,0:28:27.140 does this numerator is the sum over all the elements of my training set 0:28:27.140,0:28:28.800 of - so then 0:28:28.800,0:28:33.000 informally, right? WI is one there, but I think the document came from cluster 0:28:33.000,0:28:33.980 one 0:28:33.980,0:28:35.810 and so this will 0:28:35.810,0:28:41.090 essentially sum up all the times I saw words J 0:28:41.090,0:28:45.419 in documents that I think are in cluster one. 0:28:45.419,0:28:48.830 And these are sort of weighted by the actual probability. I think it came from cluster 0:28:48.830,0:28:49.640 one 0:28:49.640,0:28:51.200 and then I'll divide by 0:28:51.200,0:28:54.740 - again, if all of these were ones and zeros then I'd be dividing by 0:28:54.740,0:28:58.480 the actual number of documents I had in cluster one. So if all the WI's were 0:28:58.480,0:28:59.540 0:28:59.540,0:29:02.960 either ones or zeroes then this would be exactly 0:29:02.960,0:29:06.660 the fraction of documents that I saw in cluster one 0:29:06.660,0:29:09.640 in which I also saw were at J. 0:29:09.640,0:29:10.900 Okay? But in the EM Algorithm 0:29:10.900,0:29:14.540 you don't make a hard assignment decision about is this in cluster one or is this in 0:29:14.540,0:29:15.790 cluster zero. You 0:29:15.790,0:29:16.880 instead 0:29:16.880,0:29:23.880 represent your uncertainty about cluster membership with the parameters WI. Okay? It 0:29:23.970,0:29:26.970 actually turns out that when we actually implement this particular model it 0:29:26.970,0:29:28.630 actually turns out that 0:29:28.630,0:29:32.080 by the nature of this computation all the values of WI's 0:29:32.080,0:29:35.260 will be very close to either one or zero so they'll 0:29:35.260,0:29:39.110 be numerically almost indistinguishable from one's and zeroes. This is a property of 0:29:39.110,0:29:41.960 naive Bayes. If you actually compute this probability 0:29:41.960,0:29:45.750 from all those documents you find that WI is either 0:29:45.750,0:29:46.169 0:29:46.169,0:29:50.000 0.0001 or 0.999. It'll be amazingly close to either zero or one 0:29:50.000,0:29:53.170 and so the M step - and so this is pretty much guessing whether each document is 0:29:53.170,0:29:55.270 in cluster one or cluster 0:29:55.270,0:29:56.210 zero and then 0:29:56.210,0:30:00.450 using formulas they're very similar to maximum likely estimation 0:30:00.450,0:30:05.210 for naive Bayes. 0:30:05.210,0:30:06.250 Okay? 0:30:06.250,0:30:08.409 Cool. Are there - and 0:30:08.409,0:30:09.679 if 0:30:09.679,0:30:12.170 some of these equations don't look that familiar to you anymore, 0:30:12.170,0:30:15.309 sort of, go back and take another look at what you saw in naive 0:30:15.309,0:30:16.700 Bayes 0:30:16.700,0:30:18.720 and 0:30:18.720,0:30:22.220 hopefully you can see the links there as well. 0:30:22.220,0:30:29.220 Questions about this before I move on? Right, okay. 0:30:33.200,0:30:36.700 Of course the way I got these equations was by turning through the machinery of 0:30:36.700,0:30:40.610 the EM Algorithm, right? I didn't just write these out of thin air. The way you do this 0:30:40.610,0:30:41.890 is by 0:30:41.890,0:30:44.990 writing down the E step and the M step for this model and then the M step 0:30:44.990,0:30:46.340 same derivatives 0:30:46.340,0:30:48.270 equal to zero and solving from that so 0:30:48.270,0:30:49.889 that's how you get the M step and the E 0:30:49.889,0:30:56.889 step. 0:31:22.030,0:31:23.950 So the last thing I want to do today 0:31:23.950,0:31:30.950 is talk about the factor analysis model 0:31:31.870,0:31:33.110 0:31:33.110,0:31:34.780 and 0:31:34.780,0:31:38.249 the reason I want to do this is sort of two reasons because one is 0:31:38.249,0:31:42.590 factor analysis is kind of a useful model. It's 0:31:42.590,0:31:44.160 not 0:31:44.160,0:31:48.320 as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe, 0:31:48.320,0:31:50.690 but it's sort of useful. 0:31:50.690,0:31:55.180 But the other reason I want to derive this model is that there are a 0:31:55.180,0:31:58.169 few steps in the math that are more generally useful. 0:31:58.169,0:32:02.560 In particular, where this is for factor analysis this would be an example 0:32:02.560,0:32:04.510 in which we'll do EM 0:32:04.510,0:32:08.440 where the late and random variable - where the hidden random variable Z 0:32:08.440,0:32:11.060 is going to be continued as valued. 0:32:11.060,0:32:14.740 And so some of the math we'll see in deriving factor analysis will be a little bit 0:32:14.740,0:32:17.430 different than what you saw before and they're just a - 0:32:17.430,0:32:21.590 it turns out the full derivation for EM for factor analysis is sort of 0:32:21.590,0:32:23.260 extremely long and complicated 0:32:23.260,0:32:24.980 and so I won't 0:32:24.980,0:32:26.750 inflect that on you in lecture today, 0:32:26.750,0:32:30.530 but I will still be writing more equations than is - than you'll see me do 0:32:30.530,0:32:32.640 in other lectures because there are, sort of, 0:32:32.640,0:32:36.690 just a few steps in the factor analysis derivation so I'll physically 0:32:36.690,0:32:38.800 0:32:38.800,0:32:40.100 illustrate 0:32:40.100,0:32:40.950 it. 0:32:40.950,0:32:43.350 So it's actually [inaudible] the model 0:32:43.350,0:32:47.510 and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is 0:32:47.510,0:32:50.670 our first model 0:32:50.670,0:32:52.570 we had, 0:32:52.570,0:32:57.990 that - well I actually motivated it by drawing the data set like this, right? That one of 0:32:57.990,0:33:04.390 you has a data set that looks like this, 0:33:04.390,0:33:06.190 right? So this was a problem where 0:33:06.190,0:33:09.440 n is twodimensional 0:33:09.440,0:33:13.860 and you have, I don't know, maybe 50 or 100 training examples, whatever, right? 0:33:13.860,0:33:15.409 And I said 0:33:15.409,0:33:17.760 maybe you want to give a label 0:33:17.760,0:33:21.100 training set like this. Maybe you want to model this 0:33:21.100,0:33:23.710 as a mixture of two Gaussians. Okay? 0:33:23.710,0:33:24.700 And so a 0:33:24.700,0:33:27.320 mixture of Gaussian models tend to be 0:33:27.320,0:33:29.100 applicable 0:33:29.100,0:33:30.520 where m 0:33:30.520,0:33:32.519 is larger, 0:33:32.519,0:33:35.690 and often much larger, than n where the number of training examples you 0:33:35.690,0:33:36.610 have 0:33:36.610,0:33:40.790 is at least as large as, and is usually much larger than, the 0:33:40.790,0:33:45.820 dimension of the data. What I 0:33:45.820,0:33:49.280 want to do is talk about a different problem where I 0:33:49.280,0:33:51.880 want you to imagine what happens if 0:33:51.880,0:33:56.200 either the dimension of your data is roughly equal to the number of 0:33:56.200,0:33:58.130 examples you have 0:33:58.130,0:34:00.180 or maybe 0:34:00.180,0:34:00.940 the 0:34:00.940,0:34:03.980 dimension of your data is maybe even much larger than 0:34:03.980,0:34:08.469 the number of training examples you have. Okay? So how do 0:34:08.469,0:34:10.139 you model 0:34:10.139,0:34:13.569 such a very high dimensional data? Watch and 0:34:13.569,0:34:15.549 you will see sometimes, right? If 0:34:15.549,0:34:18.960 you run a plant or something, you run a factory, maybe you have 0:34:18.960,0:34:23.359 a thousand measurements all through your plants, but you only have five - you only have 0:34:23.359,0:34:25.639 20 days of data. So 0:34:25.639,0:34:27.189 you can have 1,000 dimensional data, 0:34:27.189,0:34:31.039 but 20 examples of it all ready. So 0:34:31.039,0:34:35.239 given data that has this property in the beginning that we've given 0:34:35.239,0:34:39.709 a training set of m examples. Well, 0:34:39.709,0:34:41.229 what can you do to try to model 0:34:41.229,0:34:44.139 the density of X? 0:34:44.139,0:34:48.239 So one thing you can do is try to model it just as a single Gaussian, right? So in my 0:34:48.239,0:34:51.629 mixtures of Gaussian this is how you try model as a single Gaussian and 0:34:51.629,0:34:55.019 say X is intuitive with mean mu 0:34:55.019,0:34:57.559 and parameter 0:34:57.559,0:34:58.809 sigma 0:34:58.809,0:35:00.329 where 0:35:00.329,0:35:02.169 sigma is going to be 0:35:02.169,0:35:04.769 done n by n matrix 0:35:04.769,0:35:09.509 and so if you work out the maximum likelihood estimate of the parameters 0:35:09.509,0:35:12.680 you find that the maximum likelihood estimate for the mean 0:35:12.680,0:35:17.400 is just the empirical mean of your training set, 0:35:17.400,0:35:21.299 right. So that makes sense. 0:35:21.299,0:35:25.769 And the maximum likelihood of the covariance matrix sigma 0:35:25.769,0:35:27.700 will be 0:35:27.700,0:35:29.549 0:35:29.549,0:35:36.549 0:35:39.649,0:35:41.150 this, all right? But it turns out that 0:35:41.150,0:35:44.789 in this regime where the data is much higher dimensional - excuse me, 0:35:44.789,0:35:48.259 where the data's dimension is much larger than the training examples 0:35:48.259,0:35:49.579 you 0:35:49.579,0:35:51.229 have if you 0:35:51.229,0:35:52.840 compute the maximum likely estimate 0:35:52.840,0:35:54.869 of the covariance matrix sigma 0:35:54.869,0:35:58.309 you find that this matrix is singular. 0:35:58.309,0:35:59.079 0:35:59.079,0:36:02.869 Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen 0:36:02.869,0:36:04.739 value so it doesn't have - I hope 0:36:04.739,0:36:06.869 one of those terms makes sense. 0:36:06.869,0:36:12.619 And 0:36:12.619,0:36:19.619 there's another saying that the matrix sigma will be non-invertible. 0:36:22.299,0:36:24.910 And just in pictures, 0:36:24.910,0:36:28.619 one complete example is if D is - 0:36:28.619,0:36:33.419 if N equals M equals two if you have two-dimensional data and 0:36:33.419,0:36:35.929 you have two examples. So I'd have 0:36:35.929,0:36:39.959 two training examples in two-dimen.. - this is 0:36:39.959,0:36:41.539 X1 and 0:36:41.539,0:36:43.669 X2. This is my unlabeled data. 0:36:43.669,0:36:47.640 If you fit a Gaussian to this data set you find that 0:36:47.640,0:36:49.169 - well 0:36:49.169,0:36:53.329 you remember I used to draw constables of Gaussians as ellipses, right? So 0:36:53.329,0:36:56.139 these are examples of different constables of Gaussians. 0:36:56.139,0:36:58.669 You find that the maximum likely estimate 0:36:58.669,0:36:59.650 Gaussian for this 0:36:59.650,0:37:04.329 responds to Gaussian where the contours are sort of infinitely 0:37:04.329,0:37:07.160 thin and infinitely long in that direction. 0:37:07.160,0:37:10.959 Okay? So in terms - so the contours will sort of be 0:37:10.959,0:37:12.970 infinitely thin, 0:37:12.970,0:37:17.959 right? And stretch infinitely long in that direction. 0:37:17.959,0:37:20.819 And another way of saying it is that if 0:37:20.819,0:37:25.539 you actually plug in the formula for the density 0:37:25.539,0:37:30.679 of the 0:37:30.679,0:37:36.209 Gaussian, which is 0:37:36.209,0:37:39.099 this, you won't actually get a nice answer because 0:37:39.099,0:37:42.399 the matrix sigma is non-invertible so sigma inverse is not 0:37:42.399,0:37:43.309 defined 0:37:43.309,0:37:45.290 and this is zero. 0:37:45.290,0:37:48.890 So you also have one over zero times E to the sum 0:37:48.890,0:37:55.890 inversive and non-inversive matrix so not a good model. 0:37:58.289,0:38:02.009 So let's do even better, right? So given this sort of data 0:38:02.009,0:38:09.009 how do you model P of X? 0:38:28.400,0:38:35.400 Well, one thing you could do is constrain sigma to be diagonal. So you have a 0:38:42.309,0:38:46.319 covariance matrix X is - okay? 0:38:46.319,0:38:48.939 So in other words you get a constraint sigma 0:38:48.939,0:38:52.979 to be this matrix, all 0:38:52.979,0:38:55.880 right? 0:38:55.880,0:38:59.779 With zeroes on the off diagonals. I hope 0:38:59.779,0:39:03.109 this makes sense. These zeroes I've written down here denote that 0:39:03.109,0:39:06.079 everything after diagonal of this matrix is a 0:39:06.079,0:39:07.459 zero. 0:39:07.459,0:39:08.970 So 0:39:08.970,0:39:13.410 the massive likely estimate of the parameters will be pretty much what you'll 0:39:13.410,0:39:16.210 expect, 0:39:16.210,0:39:19.880 right? 0:39:19.880,0:39:21.200 And 0:39:21.200,0:39:22.440 in pictures 0:39:22.440,0:39:24.150 what this means is that 0:39:24.150,0:39:26.399 the [inaudible] the distribution 0:39:26.399,0:39:28.569 with Gaussians 0:39:28.569,0:39:31.739 whose controls are axis aligned. So 0:39:31.739,0:39:36.900 that's one example of a Gaussian where the covariance is diagonal. 0:39:36.900,0:39:37.859 0:39:37.859,0:39:38.509 And 0:39:38.509,0:39:42.919 here's another example and 0:39:42.919,0:39:45.349 so here's a third example. But 0:39:45.349,0:39:48.939 often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I 0:39:48.939,0:39:51.489 don't 0:39:51.489,0:39:53.999 know, 0:39:53.999,0:39:56.629 you could do this in model P of X, but this isn't very nice because you've now 0:39:56.629,0:40:00.069 thrown away all the correlations 0:40:00.069,0:40:04.539 between the different variables so 0:40:04.539,0:40:07.599 the axis are X1 and X2, right? So you've thrown away - you're failing to capture 0:40:07.599,0:40:11.779 any of the correlations or the relationships between 0:40:11.779,0:40:18.229 any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again. 0:40:18.229,0:40:20.759 The covariance matrix the diagonal, 0:40:20.759,0:40:22.150 what does 0:40:22.150,0:40:23.369 that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay. 0:40:23.369,0:40:26.530 So these are the contours of the Gaussian density that I'm drawing, 0:40:26.530,0:40:28.159 right? So let's see - 0:40:28.159,0:40:32.080 so post covariance issues with diagonal 0:40:32.080,0:40:35.229 then you can ask what is P of X 0:40:35.229,0:40:37.299 parameterized by 0:40:37.299,0:40:38.539 mu and sigma, 0:40:38.539,0:40:41.190 right? If sigma is diagonal 0:40:41.190,0:40:43.699 and so this will be some Gaussian dump, 0:40:43.699,0:40:46.249 right? So not in - oh, boy. My drawing's really 0:40:46.249,0:40:48.699 bad, but in two-D 0:40:48.699,0:40:53.239 the density for Gaussian is like this bump shaped thing, right? 0:40:53.239,0:40:56.719 So this is the density of the Gaussian 0:40:56.719,0:40:59.549 - wow, and this is a really bad drawing. With those, 0:40:59.549,0:41:04.049 your axis X1 and X2 and the height of this is P of X 0:41:04.049,0:41:07.879 and so those figures over there are the contours of 0:41:07.879,0:41:09.479 the density of the Gaussian. 0:41:09.479,0:41:15.949 So those are the contours of this shape. Student:No, I don't 0:41:15.949,0:41:16.680 mean 0:41:16.680,0:41:22.569 the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so 0:41:22.569,0:41:23.690 the main - these, 0:41:23.690,0:41:24.999 let's see. 0:41:24.999,0:41:27.710 So I'm not drawing a contour like this, 0:41:27.710,0:41:30.309 right? Because the main axes of these 0:41:30.309,0:41:32.289 are not aligned with the 0:41:32.289,0:41:34.390 X1 and X-axis so 0:41:34.390,0:41:36.189 this occurs found to 0:41:36.189,0:41:43.109 Gaussian where the off-diagonals are non-zero, right? Cool. Okay. 0:41:43.109,0:41:46.929 0:41:46.929,0:41:50.229 You could do this, this is sort of work. It turns out that what our best view is two 0:41:50.229,0:41:51.670 training examples 0:41:51.670,0:41:55.210 you can learn in non-singular covariance matrix, but you've thrown away all 0:41:55.210,0:41:55.849 0:41:55.849,0:42:02.849 of the correlation in the data so this is not a great model. 0:42:05.900,0:42:08.280 It turns out you can do something - well, 0:42:08.280,0:42:10.439 actually, we'll come back and use this property later. 0:42:10.439,0:42:17.439 But it turns out you can do something even more restrictive, which 0:42:17.989,0:42:18.779 is 0:42:18.779,0:42:23.959 you can constrain sigma to equal to sigma squared times the identity 0:42:23.959,0:42:27.629 matrix. So in other words, you can constrain it to be diagonal 0:42:27.629,0:42:28.759 matrix 0:42:28.759,0:42:31.699 and moreover all the 0:42:31.699,0:42:34.609 diagonal entries must be the same 0:42:34.609,0:42:37.989 and so the cartoon for that is that you're 0:42:37.989,0:42:39.440 constraining 0:42:39.440,0:42:42.779 the contours of your Gaussian density to be 0:42:42.779,0:42:47.359 circular. Okay? This is a sort of even harsher constraint to place in your model. 0:42:47.359,0:42:51.479 0:42:51.479,0:42:52.200 0:42:52.200,0:42:55.970 So either of these versions, diagonal sigma or sigma being the, sort of, constant 0:42:55.970,0:42:57.349 value diagonal 0:42:57.349,0:43:01.869 are the all ready strong assumptions, all right? So if you have enough data 0:43:01.869,0:43:07.529 maybe write a model just a little bit of a correlation between your different variables. 0:43:07.529,0:43:10.930 So the factor analysis model 0:43:10.930,0:43:14.349 is one way to attempt to do that. So here's 0:43:14.349,0:43:21.349 the idea. 0:43:34.619,0:43:35.729 So this is 0:43:35.729,0:43:39.169 how the factor analysis model 0:43:39.169,0:43:40.979 models your data. 0:43:40.979,0:43:42.150 We're going to 0:43:42.150,0:43:45.309 assume that there is a latent random variable, okay? 0:43:45.309,0:43:46.959 Which just 0:43:46.959,0:43:50.759 means random variable Z. So Z is distributed 0:43:50.759,0:43:53.380 Gaussian with mean zero and 0:43:53.380,0:43:55.659 covariance identity 0:43:55.659,0:43:57.630 where Z 0:43:57.630,0:44:00.109 will be a Ddimensional vector now 0:44:00.109,0:44:01.499 and 0:44:01.499,0:44:03.839 D 0:44:03.839,0:44:10.839 will be chosen so that it is lower than the dimension of your X's. Okay? 0:44:11.829,0:44:12.829 And now 0:44:12.829,0:44:14.660 I'm going to assume that 0:44:14.660,0:44:19.259 X is given by - 0:44:19.259,0:44:21.279 well let me write this. Each XI 0:44:21.279,0:44:22.690 is distributed - 0:44:25.059,0:44:28.849 actually, sorry, I'm just. We 0:44:28.849,0:44:33.889 have 0:44:33.889,0:44:38.679 to assume that conditions on the value of Z, 0:44:38.679,0:44:42.109 X is given by another Gaussian 0:44:42.109,0:44:46.819 with mean given by mu plus 0:44:46.819,0:44:50.089 lambda Z 0:44:50.089,0:44:53.639 and covariance given by matrix si. 0:44:53.639,0:44:59.400 So just to say the second line in 0:44:59.400,0:45:01.289 an equivalent form, equivalently 0:45:01.289,0:45:05.929 I'm going to model X as 0:45:05.929,0:45:08.819 mu plus lambda Z 0:45:08.819,0:45:13.269 plus a noise term epsilon where epsilon is 0:45:13.269,0:45:18.329 Gaussian with mean zero 0:45:18.329,0:45:21.869 and covariant si. 0:45:21.869,0:45:23.469 0:45:23.469,0:45:24.669 And so 0:45:24.669,0:45:27.709 the parameters of this 0:45:27.709,0:45:30.180 model 0:45:30.180,0:45:32.719 are going to be a vector 0:45:32.719,0:45:34.729 mu with its 0:45:34.729,0:45:37.559 n-dimensional and 0:45:37.559,0:45:41.179 matrix lambda, 0:45:41.179,0:45:42.400 which is 0:45:42.400,0:45:43.539 0:45:43.539,0:45:45.569 n by D 0:45:45.569,0:45:50.039 and a covariance matrix si, 0:45:50.039,0:45:51.970 which is n by n, 0:45:51.970,0:45:55.519 and I'm going to impose an additional constraint on si. I'm going to impose 0:45:55.519,0:45:57.319 a constraint that si 0:45:57.319,0:45:59.559 is 0:45:59.559,0:46:03.389 diagonal. 0:46:03.389,0:46:04.789 Okay? So 0:46:04.789,0:46:07.640 that was a form of definition - let me actually, sort of, 0:46:07.640,0:46:14.640 give a couple of examples to make this more complete. 0:46:32.710,0:46:37.709 So let's give a kind of example, suppose Z 0:46:37.709,0:46:40.419 is one-dimensional 0:46:40.419,0:46:44.419 and X is twodimensional 0:46:44.419,0:46:46.689 so let's see what 0:46:46.689,0:46:51.599 this model - let's see a, sort of, specific instance of the factor analysis 0:46:51.599,0:46:52.509 model 0:46:52.509,0:46:55.819 and how we're modeling the joint - the distribution over X 0:46:55.819,0:46:57.999 of - what this gives us 0:46:57.999,0:47:01.269 in terms of a model over P of X, all right? 0:47:01.269,0:47:03.089 So 0:47:03.089,0:47:09.699 let's see. From this model to let me assume that 0:47:09.699,0:47:10.569 lambda is 0:47:10.569,0:47:12.059 2, 1 0:47:12.059,0:47:15.549 and si, which has to be diagonal matrix, remember, 0:47:15.549,0:47:19.809 is this. 0:47:19.809,0:47:21.710 Okay? So 0:47:21.710,0:47:28.710 Z is one-dimensional so let me just draw a typical sample for Z, all right? So 0:47:30.900,0:47:33.659 if I draw ZI 0:47:33.659,0:47:36.199 from a Gaussian 0:47:36.199,0:47:39.530 so that's a typical sample for what Z might look like 0:47:39.530,0:47:43.440 and so I'm gonna - at any rate I'm gonna call 0:47:43.440,0:47:45.369 this Z1, 0:47:45.369,0:47:49.969 Z2, Z3 and so on. If this really were a typical sample the order of the 0:47:49.969,0:47:53.349 Z's would be jumbled up, but I'm just ordering them like this 0:47:53.349,0:47:55.970 just to make the example easier. 0:47:55.970,0:47:57.769 So, yes, typical sample 0:47:57.769,0:48:04.769 of random variable Z from a Gaussian distribution with mean of covariance one. 0:48:05.569,0:48:08.929 So - 0:48:08.929,0:48:13.789 and with this example let me just set mu equals zero. It's to write the - 0:48:13.789,0:48:18.359 just that it's easier to talk about. 0:48:18.359,0:48:20.749 So 0:48:20.749,0:48:22.509 lambda times Z, 0:48:22.509,0:48:26.269 right? We'll take each of these numbers and multiply them by lambda. 0:48:26.269,0:48:28.639 And so 0:48:28.639,0:48:30.999 you find that 0:48:30.999,0:48:36.989 all of the values for lambda times Z 0:48:36.989,0:48:38.489 will lie on a straight line, 0:48:38.489,0:48:40.309 all right? So, for example, 0:48:40.309,0:48:41.090 0:48:41.090,0:48:45.400 this one here would be one, two, three, four, five, six, seven, I guess. So if this was 0:48:45.400,0:48:46.239 Z7 0:48:46.239,0:48:49.329 then this one here would be lambda 0:48:49.329,0:48:51.979 times Z7 0:48:51.979,0:48:54.369 and now that's the number in R2, because lambda's a two 0:48:54.369,0:48:56.529 by one matrix. 0:48:56.529,0:48:59.900 And so what I've drawn here is like a typical sample 0:48:59.900,0:49:04.279 for lambda times Z 0:49:04.279,0:49:05.589 and 0:49:05.589,0:49:09.779 the final step for this is what a typical sample for X looks like. Well X is 0:49:09.779,0:49:11.839 mu 0:49:11.839,0:49:12.989 plus 0:49:12.989,0:49:15.089 0:49:15.089,0:49:16.660 lambda Z plus epsilon 0:49:16.660,0:49:18.649 where epsilon 0:49:18.649,0:49:23.680 is Gaussian with mean nu and covariance given by si, right? 0:49:23.680,0:49:27.690 And so the last step to draw a typical sample for the random variables 0:49:27.690,0:49:28.509 0:49:28.509,0:49:30.899 X 0:49:30.899,0:49:36.079 I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example 0:49:36.079,0:49:37.159 and 0:49:37.159,0:49:38.390 around this point 0:49:38.390,0:49:41.890 I'm going to place 0:49:41.890,0:49:43.890 an axis aligned ellipse. Or 0:49:43.890,0:49:45.879 in other words, I'm going to 0:49:45.879,0:49:47.939 create a Gaussian distribution 0:49:47.939,0:49:50.950 centered on this point 0:49:50.950,0:49:53.459 and this I've drawn 0:49:53.459,0:49:55.880 corresponds to one of the contours 0:49:55.880,0:49:59.309 of my density for 0:49:59.309,0:50:02.799 epsilon, right? And so you can imagine placing a little Gaussian bump here. 0:50:02.799,0:50:04.179 And so 0:50:04.179,0:50:08.189 I'll draw an example from this little Gaussian and 0:50:08.189,0:50:11.089 let's say I get that point going, 0:50:11.089,0:50:17.630 I do the same here and 0:50:17.630,0:50:19.130 so on. 0:50:19.130,0:50:24.079 So I draw a bunch of examples from these Gaussians 0:50:24.079,0:50:28.519 and the - 0:50:28.519,0:50:31.999 whatever they call it - the orange points I drew 0:50:31.999,0:50:35.309 will comprise a typical sample for 0:50:35.309,0:50:42.309 whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that? 0:50:43.839,0:50:45.929 Instructor (Andrew Ng):Oh, 0:50:45.929,0:50:48.649 yes, you do. And in this example, I said you 0:50:48.649,0:50:50.109 do a zero zero just to make 0:50:50.109,0:50:53.199 it easier. If mu were something else you'd take the whole picture and you'd sort of shift 0:50:53.199,0:51:00.199 it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of 0:51:00.679,0:51:03.319 course, 0:51:03.319,0:51:05.920 what does that Y-axis corresponds to? Instructor (Andrew 0:51:05.920,0:51:07.799 Ng):Oh, so this is Z 0:51:07.799,0:51:09.529 is one-dimensional 0:51:09.529,0:51:12.969 so here I'm plotting the typical sample 0:51:12.969,0:51:15.460 for Z so this is like zero. 0:51:15.460,0:51:19.759 So this is just the Z Axis, right. So Z is onedimensional data. 0:51:19.759,0:51:22.999 So this line here is like a plot 0:51:22.999,0:51:26.479 of a typical sample of 0:51:26.479,0:51:31.609 values for Z. Okay? 0:51:31.609,0:51:32.889 Yeah? Student:You have by axis, right? And 0:51:32.889,0:51:34.459 the axis data pertains samples. Instructor (Andrew 0:51:34.459,0:51:35.689 Ng):Oh, yes, right. Student:So sort of projecting 0:51:35.689,0:51:38.289 them into 0:51:38.289,0:51:42.680 that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these 0:51:42.680,0:51:45.319 beige points - so that's like X1 and that's X2 0:51:45.319,0:51:47.290 and so on, right? So the beige points are 0:51:47.290,0:51:51.099 what I 0:51:51.099,0:51:52.269 see. And so 0:51:52.269,0:51:54.849 in reality all you ever get to see are the 0:51:54.849,0:51:56.289 X's, but 0:51:56.289,0:52:00.049 just like in the mixture of Gaussians model I tell a story about what I would 0:52:00.049,0:52:03.609 imagine the Gauss... - the data came from two Gaussian's 0:52:03.609,0:52:08.679 was is had a random variable Z that led to the generation of X's from two Gaussians. 0:52:08.679,0:52:12.209 So the same way I'm sort of telling the story here, which all the algorithm actually 0:52:12.209,0:52:15.049 sees are the orange points, but we're 0:52:15.049,0:52:19.219 gonna tell a story about how the data came about and that story is 0:52:19.219,0:52:23.809 what comprises the factor analysis model. Okay? 0:52:23.809,0:52:26.779 So one of the ways to see the intrusion of this model is that we're 0:52:26.779,0:52:30.779 going to think of the model as one way 0:52:30.779,0:52:34.659 just informally, not formally, but one way to think about this model is 0:52:34.659,0:52:37.679 you can think of this factor analysis model 0:52:37.679,0:52:42.089 as modeling the data from coming from a lower dimensional subspace 0:52:42.089,0:52:42.890 more 0:52:42.890,0:52:44.799 or less so the data X here Y 0:52:44.799,0:52:47.609 is approximately on one D line 0:52:47.609,0:52:50.380 and then plus a little bit of noise - plus a little bit 0:52:50.380,0:52:53.449 of random noise so the X isn't exactly on this one D line. 0:52:53.449,0:52:57.279 That's one informal way of thinking about factor analysis. 0:52:57.279,0:53:04.279 We're 0:53:08.880,0:53:15.880 not doing great on time. 0:53:18.159,0:53:20.359 Well, let's do this. 0:53:20.359,0:53:22.780 So let me just do one more quick example, 0:53:22.780,0:53:24.100 which is, 0:53:24.100,0:53:25.789 0:53:25.789,0:53:28.949 in this example, 0:53:28.949,0:53:34.899 let's say Z is in R2 0:53:34.899,0:53:37.899 and X is in R3, right? 0:53:37.899,0:53:40.289 And so 0:53:40.289,0:53:44.939 in this example Z, your data Z now lies in 2-D 0:53:44.939,0:53:49.259 and so let me draw this on a sheet of paper. Okay? 0:53:49.259,0:53:50.790 So let's say the 0:53:50.790,0:53:54.180 axis of my paper are the Z1 and Z2 axis 0:53:54.180,0:53:56.299 and so 0:53:56.299,0:53:58.289 here is a typical sample 0:53:58.289,0:54:01.489 of point Z, right? 0:54:01.489,0:54:05.369 And so we'll then take the sample Z - 0:54:05.369,0:54:10.919 well, actually let me draw this here as well. All right. So 0:54:10.919,0:54:13.449 this is a typical sample for Z going on the Z1 and Z2 axis and 0:54:13.449,0:54:17.419 I guess the origin would be here. 0:54:17.419,0:54:20.119 So center around zero. 0:54:20.119,0:54:24.079 And then we'll take those and map it to mu 0:54:24.079,0:54:24.979 plus 0:54:24.979,0:54:26.299 lambda Z 0:54:26.299,0:54:30.140 and what that means is if you imagine the free space of this classroom is R3. 0:54:30.140,0:54:34.549 What that means is we'll take this sample of Z's and we'll map it to 0:54:34.549,0:54:38.769 position in free space. So we'll take this sheet of paper and move it somewhere and some 0:54:38.769,0:54:41.769 orientation in 3-D space. 0:54:41.769,0:54:44.219 And the last step is 0:54:44.219,0:54:48.689 you have X equals mu plus lambda Z plus 0:54:48.689,0:54:52.339 epsilon and so you would take the set of the points which align in some plane 0:54:52.339,0:54:53.639 in our 3-D space 0:54:53.639,0:54:56.150 the variable of noise of these 0:54:56.150,0:54:58.259 and the noise will, sort of, come from 0:54:58.259,0:55:01.419 Gaussians to 0:55:01.419,0:55:03.699 the axis 0:55:03.699,0:55:10.699 aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake. 0:55:11.680,0:55:13.179 0:55:13.179,0:55:15.119 So that's a model 0:55:15.119,0:55:22.119 - let's actually talk about how to fit the parameters of the model. Okay? 0:55:27.959,0:55:32.229 In order to 0:55:32.229,0:55:33.770 describe how to fit the model I'm sure 0:55:33.770,0:55:35.859 we need to 0:55:35.859,0:55:40.190 re-write Gaussians and this is in a very slightly different way. 0:55:40.190,0:55:42.789 So, in particular, 0:55:42.789,0:55:45.389 let's say I have a vector X and 0:55:45.389,0:55:51.389 I'm gonna use this notation to denote partition vectors, right? X1, X2 0:55:51.389,0:55:53.209 where 0:55:53.209,0:55:56.809 if X1 is say an rdimensional vector 0:55:56.809,0:55:58.640 then X2 is 0:55:58.640,0:56:00.430 an estimational vector 0:56:00.430,0:56:02.229 and X 0:56:02.229,0:56:04.509 is an R plus S 0:56:04.509,0:56:09.029 dimensional vector. Okay? So I'm gonna use this notation to denote just 0:56:09.029,0:56:13.269 the taking of vector and, sort of, partitioning the vector into two halves. The 0:56:13.269,0:56:15.449 first R elements followed by 0:56:15.449,0:56:17.079 the last 0:56:17.079,0:56:22.409 S elements. 0:56:22.409,0:56:26.269 So 0:56:26.269,0:56:29.199 let's say you have X 0:56:29.199,0:56:35.029 coming from a Gaussian distribution with mean mu and covariance sigma 0:56:35.029,0:56:37.169 where 0:56:37.169,0:56:38.809 mu 0:56:38.809,0:56:42.219 is itself a partition vector. 0:56:42.219,0:56:46.839 So break mu up into two pieces mu1 and mu2 0:56:46.839,0:56:50.189 and the covariance matrix sigma 0:56:50.189,0:56:54.989 is now a partitioned matrix. 0:56:54.989,0:56:56.750 Okay? So what this means is 0:56:56.750,0:56:58.669 that you take the covariance matrix sigma 0:56:58.669,0:57:00.940 and I'm going to break it up into four blocks, 0:57:00.940,0:57:02.649 right? And so the 0:57:02.649,0:57:06.059 dimension of this is there will be R elements here 0:57:06.059,0:57:08.299 and there will be 0:57:08.299,0:57:10.289 S elements here 0:57:10.289,0:57:13.679 and there will be R elements here. So, for example, sigma 1, 2 will 0:57:13.679,0:57:15.689 be an 0:57:15.689,0:57:19.769 R by S matrix. 0:57:19.769,0:57:26.769 It's R elements tall and S elements wide. 0:57:32.150,0:57:36.190 So this Gaussian over to down is really a joint distribution of a loss of variables, right? 0:57:36.190,0:57:40.309 So X is a vector so XY is a joint distribution 0:57:40.309,0:57:42.980 over X1 through X of - 0:57:42.980,0:57:47.349 over XN or over X of R plus S. 0:57:47.349,0:57:51.299 We can then ask what are the marginal and conditional distributions of this 0:57:51.299,0:57:52.609 Gaussian? 0:57:52.609,0:57:53.840 So, for example, 0:57:53.840,0:57:57.819 with my Gaussian, I know what P of X is, but can 0:57:57.819,0:58:03.229 I compute the modular distribution of X1, right. And so P of X1 is just equal to, 0:58:03.229,0:58:07.009 of course, integrate our X2, P of X1 0:58:07.009,0:58:08.880 comma X2 0:58:08.880,0:58:10.649 DX2. 0:58:10.649,0:58:14.169 And if you actually perform that distribution - that computation you 0:58:14.169,0:58:15.199 find 0:58:15.199,0:58:17.769 that P of X1, 0:58:17.769,0:58:20.069 I guess, is Gaussian 0:58:20.069,0:58:21.159 with mean 0:58:21.159,0:58:23.079 given by mu1 0:58:23.079,0:58:24.929 and sigma 1, 1. All right. 0:58:24.929,0:58:26.309 So this is sort 0:58:26.309,0:58:30.019 of no surprise. The marginal distribution of a 0:58:30.019,0:58:31.129 Gaussian 0:58:31.129,0:58:32.879 is itself the 0:58:32.879,0:58:33.390 Gaussian and 0:58:33.390,0:58:35.479 you just take out the 0:58:35.479,0:58:36.229 relevant 0:58:36.229,0:58:37.399 sub-blocks of 0:58:37.399,0:58:42.049 the covariance matrix and the relevant sub-vector of the 0:58:42.049,0:58:46.569 mu vector - E in vector mu. 0:58:46.569,0:58:49.819 You can also compute conditionals. You can also - 0:58:49.819,0:58:51.959 what does P of X1 0:58:51.959,0:58:53.409 given 0:58:53.409,0:58:56.819 a specific value for X2, right? 0:58:56.819,0:59:02.799 And so the way you compute that is, well, the usual way P of X1 0:59:02.799,0:59:04.930 comma X2 0:59:04.930,0:59:08.379 divided by P of X2, right? 0:59:08.379,0:59:10.069 And so 0:59:10.069,0:59:13.789 you know what both of these formulas are, right? The numerator - well, 0:59:13.789,0:59:16.559 this is just a usual 0:59:16.559,0:59:20.160 Gaussian that your joint distribution over X1, X2 is a Gaussian with 0:59:20.160,0:59:21.899 mean mu and covariance sigma 0:59:21.899,0:59:23.929 and 0:59:23.929,0:59:26.659 this 0:59:26.659,0:59:30.079 by that marginalization operation I talked about is that. 0:59:30.079,0:59:31.449 0:59:31.449,0:59:35.009 So if you actually plug in the formulas for these two Gaussians and if you simplify 0:59:35.009,0:59:35.599 0:59:35.599,0:59:38.560 the simplification step is actually fairly non-trivial. 0:59:38.560,0:59:42.070 If you haven't seen it before this will actually be - this will actually be 0:59:42.070,0:59:44.389 somewhat difficult to do. 0:59:44.389,0:59:45.580 But if you 0:59:45.580,0:59:50.020 plug this in for Gaussian and simplify that 0:59:50.020,0:59:51.029 expression 0:59:51.029,0:59:53.249 you find that 0:59:53.249,0:59:55.969 conditioned on the value of 0:59:55.969,1:00:01.130 X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be 1:00:01.130,1:00:05.849 Gaussian 1:00:05.849,1:00:08.260 and it will have mean mu 1:00:08.260,1:00:13.929 of 1 given 2 and covariant 1:00:13.929,1:00:15.549 1:00:15.549,1:00:18.479 sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show 1:00:18.479,1:00:20.969 the formula for mu given - of mu of 1:00:20.969,1:00:27.969 one given 2 is given by this 1:00:31.689,1:00:38.689 and I 1:00:40.179,1:00:44.209 think 1:00:44.209,1:00:48.949 the sigma of 1 given 2 is given by that. Okay? 1:00:48.949,1:00:52.109 So these are just 1:00:52.109,1:00:55.869 useful formulas to know for how to find the conditional distributions 1:00:55.869,1:00:58.649 of the Gaussian and the marginal distributions of a Gaussian. 1:00:58.649,1:01:03.229 I won't actually show the derivation for this. Student:Could you 1:01:03.229,1:01:10.229 repeat the [inaudible]? Instructor 1:01:12.059,1:01:16.189 (Andrew Ng):Sure. So this one on the left mu of 1 given 2 1:01:16.189,1:01:16.849 equals 1:01:16.849,1:01:19.029 mu1 plus 1:01:19.029,1:01:20.769 sigma 1,2, 1:01:20.769,1:01:22.729 sigma 2,2 inverse 1:01:22.729,1:01:25.389 times X2 minus mu2 1:01:25.389,1:01:29.599 and this is sigma 1 given 2 equals sigma 1,1 1:01:29.599,1:01:31.599 minus sigma 1,2 1:01:31.599,1:01:33.119 sigma 2,2 inverse 1:01:33.119,1:01:34.400 sigma 2,1. Okay? 1:01:34.400,1:01:40.189 These are also in the 1:01:40.189,1:01:47.189 lecture 1:01:52.929,1:01:59.929 notes. Shoot. Nothing as where I was hoping to on time. 1:02:00.989,1:02:07.989 Well, actually it is. Okay? 1:02:18.719,1:02:21.209 So it 1:02:21.209,1:02:26.369 turns out - I think I'll skip this in the interest of time. So it turns out that - 1:02:26.369,1:02:30.049 well, so let's go back and use these in the factor analysis model, right? 1:02:30.049,1:02:32.089 It turns out that 1:02:32.089,1:02:33.769 you can go back 1:02:33.769,1:02:35.130 and 1:02:38.769,1:02:45.759 oh, 1:02:45.759,1:02:48.569 do I want to do this? I kind of need this 1:02:48.569,1:02:51.939 though. So let's go back and figure out 1:02:51.939,1:02:57.869 just what the joint distribution factor analysis assumes on Z and X's. Okay? 1:02:57.869,1:02:58.630 So 1:02:58.630,1:03:00.219 1:03:00.219,1:03:03.199 under the factor analysis model 1:03:03.199,1:03:07.429 Z and X, the random variables Z and X 1:03:07.429,1:03:11.549 have some joint distribution given by - I'll write this 1:03:11.549,1:03:12.779 vector as mu 1:03:12.779,1:03:13.799 ZX 1:03:13.799,1:03:17.329 in some covariance matrix sigma. 1:03:17.329,1:03:21.239 So let's go back and figure out what mu ZX is and what sigma is and I'll 1:03:21.239,1:03:22.699 do this so that 1:03:22.699,1:03:24.990 we'll get a little bit more practice with 1:03:24.990,1:03:28.739 partition vectors and partition matrixes. 1:03:28.739,1:03:31.239 So just to remind you, right? You 1:03:31.239,1:03:34.959 have to have Z as Gaussian with mean zero and covariance identity 1:03:34.959,1:03:41.099 and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with 1:03:41.099,1:03:42.749 mean zero 1:03:42.749,1:03:47.179 covariant si. So I have the - I'm just writing out the same equations again. 1:03:47.179,1:03:53.209 So let's first figure out what this vector mu ZX is. Well, 1:03:53.209,1:03:54.619 the expected value of Z 1:03:54.619,1:03:56.219 is 1:03:56.219,1:03:56.989 zero 1:03:56.989,1:03:58.140 1:03:58.140,1:04:03.599 and, again, as usual I'll often drop the square backers around here. 1:04:03.599,1:04:05.930 And the expected value of X is - 1:04:05.930,1:04:08.899 well, the expected value of mu 1:04:08.899,1:04:11.149 plus lambda 1:04:11.149,1:04:14.179 Z 1:04:14.179,1:04:17.249 plus epsilon. So these two terms have zero expectation 1:04:17.249,1:04:20.099 and so the expected value of X 1:04:20.099,1:04:22.749 is just mu 1:04:22.749,1:04:23.900 1:04:23.900,1:04:26.529 and so that vector 1:04:26.529,1:04:28.049 mu ZX, right, 1:04:28.049,1:04:31.239 in my parameter for the Gaussian 1:04:31.239,1:04:33.650 this is going to be 1:04:33.650,1:04:37.439 the expected value of this partition vector 1:04:37.439,1:04:40.249 given by this partition Z and X 1:04:40.249,1:04:42.689 and so that would just be 1:04:42.689,1:04:43.769 zero 1:04:43.769,1:04:46.900 followed by mu. Okay? 1:04:46.900,1:04:48.319 And so that's 1:04:48.319,1:04:51.299 a d-dimensional zero 1:04:51.299,1:04:57.059 followed by an indimensional mu. 1:04:57.059,1:05:02.160 That's not gonna work out what the covariance matrix sigma is. 1:05:02.160,1:05:09.160 So 1:05:20.929,1:05:24.109 the covariance matrix sigma 1:05:24.109,1:05:27.959 - if you work out 1:05:27.959,1:05:34.959 definition of a partition. So this 1:05:44.359,1:05:45.619 is 1:05:45.619,1:05:52.619 into your partition matrix. 1:06:03.339,1:06:04.150 Okay? Will be - 1:06:04.150,1:06:06.149 so the covariance matrix sigma 1:06:06.149,1:06:10.499 will comprise four blocks like that 1:06:10.499,1:06:14.199 and so the upper left most block, which I write as sigma 1,1 - 1:06:14.199,1:06:16.400 well, that uppermost left block 1:06:16.400,1:06:18.559 is just 1:06:18.559,1:06:21.139 the covariance matrix of Z, 1:06:21.139,1:06:24.819 which we know is the identity. I was 1:06:24.819,1:06:28.269 gonna show you briefly how to derive some of the other blocks, right, so 1:06:28.269,1:06:30.809 sigma 1,2 that's the 1:06:30.809,1:06:37.809 upper - oh, actually, 1:06:41.259,1:06:43.819 excuse me. Sigma 2,1 1:06:43.819,1:06:46.829 which is the lower left block 1:06:46.829,1:06:48.929 that's E 1:06:48.929,1:06:50.829 of X 1:06:50.829,1:06:54.900 minus EX times Z minus EZ. 1:06:54.900,1:06:58.689 So X is equal to mu 1:06:58.689,1:07:01.529 plus lambda Z 1:07:01.529,1:07:03.329 plus 1:07:03.329,1:07:05.429 epsilon and then minus EX is 1:07:05.429,1:07:07.369 minus mu and then 1:07:07.369,1:07:11.219 times Z 1:07:11.219,1:07:14.089 because the expected value of Z is zero, right, 1:07:14.089,1:07:19.329 so that's equal to zero. 1:07:19.329,1:07:23.660 And so if you simplify - or if you expand this out 1:07:23.660,1:07:26.829 plus mu minus mu cancel out 1:07:26.829,1:07:29.819 and so you have the expected value 1:07:29.819,1:07:33.139 of lambda - oh, 1:07:33.139,1:07:35.359 excuse me. 1:07:35.359,1:07:40.269 ZZ transpose minus the 1:07:40.269,1:07:47.269 expected value of epsilon Z is 1:07:52.799,1:07:57.059 equal to 1:07:57.059,1:08:00.309 that, which is just equal to 1:08:00.309,1:08:07.239 lambda times the identity matrix. Okay? Does that make 1:08:07.239,1:08:11.499 sense? Cause 1:08:11.499,1:08:14.540 this term is equal to zero. 1:08:14.540,1:08:21.540 Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well, 1:08:48.170,1:08:53.399 so the final block is sigma 2,2 which is equal to the expected value of 1:08:53.399,1:08:59.099 mu plus lambda Z 1:08:59.099,1:09:01.889 plus epsilon minus mu 1:09:01.889,1:09:03.960 times, right? 1:09:03.960,1:09:07.900 Is 1:09:07.900,1:09:09.259 equal to - and 1:09:09.259,1:09:16.259 I won't do this, but this simplifies to lambda lambda transpose plus si. Okay? 1:09:17.609,1:09:20.880 So 1:09:20.880,1:09:23.809 putting all this together 1:09:23.809,1:09:28.689 this tells us that the joint distribution of this vector ZX 1:09:28.689,1:09:31.209 is going to be Gaussian 1:09:31.209,1:09:35.920 with mean vector given by that, 1:09:35.920,1:09:37.409 which we worked out previously. 1:09:37.409,1:09:38.719 So 1:09:38.719,1:09:42.239 this is the new ZX that we worked out previously, 1:09:42.239,1:09:44.179 and covariance matrix 1:09:44.179,1:09:51.179 given by 1:09:54.429,1:10:01.429 that. Okay? So 1:10:04.719,1:10:07.179 in principle - let's 1:10:07.179,1:10:09.869 see, so the parameters of our model 1:10:09.869,1:10:11.579 are mu, 1:10:11.579,1:10:12.359 lambda, 1:10:12.359,1:10:14.000 and si. 1:10:14.000,1:10:15.480 And so 1:10:15.480,1:10:18.800 in order to find the parameters of this model 1:10:18.800,1:10:24.559 we're given a training set of m examples 1:10:24.559,1:10:28.590 and so we like to do a massive likely estimation of the parameters. 1:10:28.590,1:10:34.319 And so in principle one thing you could do is you can actually write down 1:10:34.319,1:10:36.820 what P of XI is and, 1:10:36.820,1:10:40.309 right, so P of XI 1:10:40.309,1:10:43.020 XI is actually - 1:10:43.020,1:10:45.880 the distribution of X, right? If, again, 1:10:45.880,1:10:49.439 you can marginalize this Gaussian 1:10:49.439,1:10:54.679 and so the distribution of X, which is the lower half of this partition vector 1:10:54.679,1:10:56.620 is going to have 1:10:56.620,1:10:58.209 mean mu 1:10:58.209,1:11:03.929 and covariance given by lambda lambda transpose plus si. Right? 1:11:03.929,1:11:05.320 So that's 1:11:05.320,1:11:06.759 the distribution 1:11:06.759,1:11:12.499 that we're using to model P of X. 1:11:12.499,1:11:16.980 And so in principle one thing you could do is actually write down the log likelihood of 1:11:16.980,1:11:20.329 your parameters, 1:11:20.329,1:11:23.679 right? Which is just the product over of - it 1:11:23.679,1:11:25.310 is the sum over 1:11:25.310,1:11:26.050 I 1:11:26.050,1:11:29.829 log P of XI 1:11:29.829,1:11:30.970 where P of XI 1:11:30.970,1:11:32.400 will be given 1:11:32.400,1:11:35.719 by this Gaussian density, right. And 1:11:35.719,1:11:39.820 I'm using theta as a shorthand to denote all of my parameters. 1:11:39.820,1:11:43.429 And so you actually know what the density for Gaussian is 1:11:43.429,1:11:49.519 and so you can say P of XI is this Gaussian with E mu in covariance 1:11:49.519,1:11:52.960 given by this lambda lambda transpose plus si. 1:11:52.960,1:11:54.690 So in case you write down the log likelihood 1:11:54.690,1:11:55.900 of your parameters 1:11:55.900,1:11:57.380 as follows 1:11:57.380,1:12:01.050 and you can try to take derivatives of your log likelihood with respect to your 1:12:01.050,1:12:02.039 parameters 1:12:02.039,1:12:05.059 and maximize the log likelihood, all right. 1:12:05.059,1:12:07.490 It turns out that if you do that you end up with 1:12:07.490,1:12:11.300 sort of an intractable atomization problem or at least one 1:12:11.300,1:12:13.219 that you - excuse me, you end up with a 1:12:13.219,1:12:16.860 optimization problem that you will not be able to find and in this analytics, sort of, 1:12:16.860,1:12:18.649 closed form solutions to. 1:12:18.649,1:12:21.679 So if you say my model of X is this and found your 1:12:21.679,1:12:23.900 massive likely parameter estimation 1:12:23.900,1:12:25.589 you won't be able to find 1:12:25.589,1:12:29.119 the massive likely estimate of the parameters in closed form. 1:12:29.119,1:12:30.059 1:12:30.059,1:12:31.940 So what I would have liked to do 1:12:31.940,1:12:38.940 is - well, 1:12:40.280,1:12:45.579 so in order to fit parameters to this model 1:12:45.579,1:12:51.869 what we'll actually do is use the EM Algorithm 1:12:51.869,1:12:55.569 in with 1:12:55.569,1:12:59.559 the E step, right? 1:12:59.559,1:13:06.559 We'll compute that 1:13:08.320,1:13:10.929 1:13:10.929,1:13:15.010 and this formula looks the same except that one difference is that now 1:13:15.010,1:13:17.679 Z is a continuous random variable 1:13:17.679,1:13:18.590 and so 1:13:18.590,1:13:20.119 in the E step 1:13:20.119,1:13:24.210 we actually have to find the density QI of ZI where it's the, sort of, E step actually 1:13:24.210,1:13:26.050 requires that we find 1:13:26.050,1:13:28.260 the posterior distribution 1:13:28.260,1:13:31.510 that - so the density to the random variable ZI 1:13:31.510,1:13:34.489 and then the M step 1:13:34.489,1:13:37.949 will then perform 1:13:37.949,1:13:40.870 the following maximization 1:13:40.870,1:13:41.709 1:13:41.709,1:13:45.010 where, again, because Z is now 1:13:45.010,1:13:49.929 continuous we 1:13:49.929,1:13:56.929 now need to integrate over Z. Okay? Where 1:13:59.389,1:14:02.379 in the M step now because ZI was continuous we now have an 1:14:02.379,1:14:05.709 integral over Z rather than a sum. 1:14:05.709,1:14:09.269 Okay? I was hoping to go a little bit further in deriving these things, but I don't have 1:14:09.269,1:14:11.080 time today so we'll wrap that up 1:14:11.080,1:14:14.530 in the next lecture, but before I close let's check if there are questions 1:14:14.530,1:14:15.729 1:14:15.729,1:14:22.729 about the whole factor analysis model. Okay. 1:14:27.569,1:14:30.639 So we'll come back in the next lecture; 1:14:30.639,1:14:34.690 I will wrap up this model and because I want to go a little bit deeper into the E 1:14:34.690,1:14:36.709 and M steps, as there's some 1:14:36.709,1:14:39.760 tricky parts for the factor analysis model specifically. 1:14:39.760,1:14:41.159 Okay. I'll see you in a 1:14:41.159,1:14:41.409 couple of days.