1 00:00:09,040 --> 00:00:10,319 2 00:00:10,319 --> 00:00:13,589 This presentation is delivered by the Stanford Center for Professional 3 00:00:13,589 --> 00:00:20,589 Development. 4 00:00:23,960 --> 00:00:24,880 So 5 00:00:24,880 --> 00:00:29,720 welcome back, and what I want to do today is 6 00:00:29,720 --> 00:00:33,930 continue our discussions of the EM Algorithm, and in particular, I want to talk 7 00:00:33,930 --> 00:00:34,730 about 8 00:00:34,730 --> 00:00:36,359 the EM 9 00:00:36,359 --> 00:00:39,640 formulation that we derived in the previous lecture and apply it to the 10 00:00:39,640 --> 00:00:41,910 mixture of Gaussians model, 11 00:00:41,910 --> 00:00:45,050 apply it to a different model and a mixture of naive Bayes model, 12 00:00:45,050 --> 00:00:48,690 and then the launch part of today's lecture will be on the factor 13 00:00:48,690 --> 00:00:51,480 analysis algorithm, which will also use the EM. 14 00:00:51,480 --> 00:00:55,600 And as part of that, we'll actually take a brief digression to talk a little bit about 15 00:00:55,600 --> 00:01:01,050 sort of useful properties of Gaussian distributions. 16 00:01:01,050 --> 00:01:03,190 So just to recap where we are. 17 00:01:03,190 --> 00:01:10,120 In the previous lecture, I started to talk about unsupervised learning, which was 18 00:01:10,120 --> 00:01:12,720 machine-learning problems, where you're given 19 00:01:12,720 --> 00:01:14,590 an unlabeled training set 20 00:01:14,590 --> 00:01:16,210 comprising m examples here, right? 21 00:01:16,210 --> 00:01:18,360 And then " so the fact that there are no labels; 22 00:01:18,360 --> 00:01:22,720 that's what makes this unsupervised or 23 00:01:22,720 --> 00:01:24,880 anything. So 24 00:01:24,880 --> 00:01:30,710 one problem that I talked about last time was what if 25 00:01:30,710 --> 00:01:33,820 you're given a data set that looks like this 26 00:01:33,820 --> 00:01:36,420 and you want to model 27 00:01:36,420 --> 00:01:40,110 the density PFX from which you think the data 28 00:01:40,110 --> 00:01:41,730 had been drawn, 29 00:01:41,730 --> 00:01:45,900 and so with a data set like this, maybe you think was a mixture of two Gaussians and start 30 00:01:45,900 --> 00:01:50,230 to talk about an algorithm 31 00:01:50,230 --> 00:01:53,920 for fitting a mixture of Gaussians model, all right? And so 32 00:01:55,050 --> 00:01:57,820 we said that we would model the 33 00:01:57,820 --> 00:01:59,460 density of XP of X 34 00:01:59,460 --> 00:02:02,130 as sum over Z 35 00:02:02,130 --> 00:02:03,170 PFX 36 00:02:03,170 --> 00:02:05,410 given Z 37 00:02:05,410 --> 00:02:06,970 times P of Z 38 00:02:06,970 --> 00:02:10,350 where this later random variable meaning this hidden random 39 00:02:10,350 --> 00:02:11,829 variable Z 40 00:02:11,829 --> 00:02:12,609 indicates 41 00:02:12,609 --> 00:02:16,369 which of the two Gaussian distributions each of your data points came from 42 00:02:16,369 --> 00:02:19,369 and so we have, 43 00:02:19,369 --> 00:02:24,139 you know, Z was not a 44 00:02:24,139 --> 00:02:26,629 nomial with parameter phi 45 00:02:26,629 --> 00:02:28,760 and X conditions on 46 00:02:28,760 --> 00:02:32,599 a coming from the JAFE 47 00:02:32,599 --> 00:02:35,809 Gaussian 48 00:02:35,809 --> 00:02:38,749 was given by Gaussian 49 00:02:38,749 --> 00:02:43,679 of mean mu J and covariant sigma J, all right? So, like I said 50 00:02:43,679 --> 00:02:47,510 at the beginning of the previous lecture, I just talked about a very specific algorithm that 51 00:02:47,510 --> 00:02:50,639 I sort of pulled out of the air 52 00:02:50,639 --> 00:02:54,609 for fitting the parameters of this model for finian, Francis, phi, mu and 53 00:02:54,609 --> 00:02:55,519 54 00:02:55,519 --> 00:02:59,769 sigma, but then in the second half of the previous lecture I talked about what's called the 55 00:02:59,769 --> 00:03:01,489 EM Algorithm in which 56 00:03:01,489 --> 00:03:03,999 our 57 00:03:03,999 --> 00:03:08,719 goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of 58 00:03:08,719 --> 00:03:09,900 theta, 59 00:03:09,900 --> 00:03:13,279 you know, the, sort of, 60 00:03:13,279 --> 00:03:17,649 usual right matter of log likelihood " 61 00:03:17,649 --> 00:03:19,549 well, parameterized by theta. 62 00:03:19,549 --> 00:03:22,149 And 63 00:03:22,149 --> 00:03:25,109 because we have 64 00:03:25,109 --> 00:03:28,199 a later random variable Z this is really 65 00:03:28,199 --> 00:03:30,889 maximizing in terms of theta, 66 00:03:30,889 --> 00:03:37,889 sum over I, sum over Z, P of XI, 67 00:03:39,639 --> 00:03:45,279 ZI parameterized by theta. Okay? 68 00:03:45,279 --> 00:03:47,139 So using 69 00:03:47,139 --> 00:03:50,109 Jensen's inequality last time 70 00:03:50,109 --> 00:03:54,249 we worked out the EM Algorithm in which in the E step 71 00:03:54,249 --> 00:03:56,180 we would chose these 72 00:03:56,180 --> 00:03:59,359 probability distributions QI to the 73 00:03:59,359 --> 00:04:02,930 l posterior 74 00:04:02,930 --> 00:04:09,319 on Z given X and parameterized by theta 75 00:04:09,319 --> 00:04:13,759 and in the M step we would set 76 00:04:13,759 --> 00:04:16,429 theta 77 00:04:16,429 --> 00:04:21,669 to be the value that 78 00:04:21,669 --> 00:04:27,609 maximizes 79 00:04:27,609 --> 00:04:34,609 this. 80 00:04:36,469 --> 00:04:39,499 Okay? So these are the ones we worked out last time 81 00:04:41,730 --> 00:04:46,699 and the cartoon that I drew was that you have this long likelihood function L of 82 00:04:46,699 --> 00:04:49,499 theta that's often hard to maximize 83 00:04:49,499 --> 00:04:51,250 and what the E step does 84 00:04:51,250 --> 00:04:55,960 is choose these probability distribution production QI's. And in the cartoon, I drew 85 00:04:55,960 --> 00:04:57,850 what that corresponded to 86 00:04:57,850 --> 00:05:01,029 was finding a lower bounds 87 00:05:01,029 --> 00:05:02,979 for the log likelihood. 88 00:05:02,979 --> 00:05:04,779 And then 89 00:05:04,779 --> 00:05:06,420 horizontal access data 90 00:05:06,420 --> 00:05:10,710 and then the M step you maximize the lower boundary, right? So maybe you were here previously 91 00:05:10,710 --> 00:05:11,500 and so you 92 00:05:11,500 --> 00:05:15,389 jumped to the new point, the new 93 00:05:15,389 --> 00:05:17,619 maximum of this lower bound. Okay? 94 00:05:17,619 --> 00:05:21,889 And so this little curve here, right? This lower bound function here 95 00:05:21,889 --> 00:05:26,259 that's really 96 00:05:26,259 --> 00:05:28,820 the right-hand side of that augments. 97 00:05:28,820 --> 00:05:30,210 Okay? So 98 00:05:30,210 --> 00:05:34,010 this whole thing in the augments. If you view this thing 99 00:05:34,010 --> 00:05:36,590 as a function of theta, 100 00:05:36,590 --> 00:05:38,129 this function of theta 101 00:05:38,129 --> 00:05:39,250 is a lower bounds 102 00:05:39,250 --> 00:05:41,729 for the log likelihood of theta 103 00:05:41,729 --> 00:05:45,539 and so the M step we maximize this lower bound and that corresponds to 104 00:05:45,539 --> 00:05:47,360 jumping 105 00:05:47,360 --> 00:05:51,669 to this new maximum to lower bound. So it 106 00:05:51,669 --> 00:05:53,139 turns out 107 00:05:53,139 --> 00:05:55,669 that 108 00:05:55,669 --> 00:05:57,259 in the EM Algorithm " 109 00:05:57,259 --> 00:06:02,100 so why do you evolve with the EM algorithm? It turns out that very often, and this will be 110 00:06:02,100 --> 00:06:05,210 true for all the examples we see 111 00:06:05,210 --> 00:06:07,410 today, it turns out 112 00:06:07,410 --> 00:06:10,889 that very often in the 113 00:06:10,889 --> 00:06:12,529 EM Algorithm 114 00:06:12,529 --> 00:06:16,159 maximizing the M Step, so performing the maximization the M Step, will be 115 00:06:16,159 --> 00:06:19,739 tractable and can often be done analytically in the closed form. 116 00:06:19,739 --> 00:06:21,330 Whereas 117 00:06:21,330 --> 00:06:25,729 if you were trying to maximize this objective we try to take this 118 00:06:25,729 --> 00:06:27,270 formula on the right and 119 00:06:27,270 --> 00:06:30,130 this maximum likely object, everyone, is to take this all on the right 120 00:06:30,130 --> 00:06:32,810 and set its derivatives to zero and try to solve and 121 00:06:32,810 --> 00:06:34,629 you'll find that you're unable 122 00:06:34,629 --> 00:06:37,030 to obtain a solution to this in closed form 123 00:06:37,030 --> 00:06:38,759 this maximization. Okay? 124 00:06:38,759 --> 00:06:41,250 And so to give you an example of that is that 125 00:06:41,250 --> 00:06:45,069 you remember our discussion on exponential family marbles, right? 126 00:06:45,069 --> 00:06:47,019 It turns out that 127 00:06:47,019 --> 00:06:49,370 if X and Z 128 00:06:49,370 --> 00:06:50,320 is 129 00:06:50,320 --> 00:06:54,220 jointly, I guess, a line in exponential families. So if P of X, 130 00:06:54,220 --> 00:06:55,379 Z 131 00:06:55,379 --> 00:06:58,319 prioritized by theta there's an explanation family distribution, 132 00:06:58,319 --> 00:07:01,909 which it turns out to be true for the mixture of Gaussians distribution. 133 00:07:01,909 --> 00:07:05,219 Then turns out that the M step here will be tractable 134 00:07:05,219 --> 00:07:08,379 and the E step will also be tractable and so you can do each of these steps very 135 00:07:08,379 --> 00:07:09,489 easily. 136 00:07:09,489 --> 00:07:10,889 Whereas 137 00:07:10,889 --> 00:07:12,689 performing " trying to 138 00:07:12,689 --> 00:07:15,669 perform this original maximum likelihood estimation problem 139 00:07:15,669 --> 00:07:17,719 on this one, right? 140 00:07:17,719 --> 00:07:21,439 Will be computationally very difficult. You're going 141 00:07:21,439 --> 00:07:24,699 to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to 142 00:07:24,699 --> 00:07:27,589 this. Okay? 143 00:07:27,589 --> 00:07:32,709 So 144 00:07:32,709 --> 00:07:36,050 what I want to do in a second is actually take this view of the EM 145 00:07:36,050 --> 00:07:37,110 Algorithm 146 00:07:37,110 --> 00:07:40,889 and apply it to the mixture of Gaussians models. I want to take these E steps 147 00:07:40,889 --> 00:07:42,729 and M Steps and work 148 00:07:42,729 --> 00:07:44,229 them out for 149 00:07:44,229 --> 00:07:47,849 the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the 150 00:07:49,689 --> 00:07:55,020 EM Algorithm. It turns out there's one other way of thinking about the EM 151 00:07:55,020 --> 00:07:59,479 Algorithm, which is the following: I can define 152 00:07:59,479 --> 00:08:02,090 an optimization objective 153 00:08:02,090 --> 00:08:05,979 J of theta, Q are defined 154 00:08:05,979 --> 00:08:10,519 it to be 155 00:08:10,519 --> 00:08:14,699 this. This is just a thing in the augments 156 00:08:14,699 --> 00:08:21,699 in the M step. Okay? 157 00:08:25,759 --> 00:08:28,110 And so what we proved 158 00:08:28,110 --> 00:08:31,079 using Jensen's inequality 159 00:08:31,079 --> 00:08:34,040 is that 160 00:08:34,040 --> 00:08:38,380 the log likelihood of theta is greater and equal to J 161 00:08:38,380 --> 00:08:41,900 of theta Q. So in other words, we proved last time that 162 00:08:41,900 --> 00:08:43,749 for any value of theta and Q 163 00:08:43,749 --> 00:08:47,190 the log likelihood upper bounds J of theta and Q. 164 00:08:47,190 --> 00:08:51,290 And so just to relate this back to, sort of, yet more things that you all ready 165 00:08:51,290 --> 00:08:52,220 know, 166 00:08:52,220 --> 00:08:54,100 you can also think of 167 00:08:54,100 --> 00:08:58,530 covariant cause in a sense, right? 168 00:08:58,530 --> 00:09:02,010 However, our discussion awhile back on the coordinate ascent optimization 169 00:09:02,010 --> 00:09:03,160 algorithm. 170 00:09:03,160 --> 00:09:07,420 So we can show, and I won't actually show this view so just take our word for it 171 00:09:07,420 --> 00:09:09,580 and look for that at home if you want, 172 00:09:09,580 --> 00:09:11,820 173 00:09:11,820 --> 00:09:13,260 that EM is 174 00:09:13,260 --> 00:09:17,480 just coordinate in a set on the 175 00:09:17,480 --> 00:09:18,240 function J. 176 00:09:18,240 --> 00:09:20,630 So in the E step you maximize 177 00:09:20,630 --> 00:09:23,960 with respect to Q 178 00:09:23,960 --> 00:09:26,850 and then the M step 179 00:09:26,850 --> 00:09:31,860 you maximize with 180 00:09:31,860 --> 00:09:34,100 respect to theta. Okay? 181 00:09:34,100 --> 00:09:35,810 So this is 182 00:09:35,810 --> 00:09:39,690 another view of the EM Algorithm 183 00:09:40,900 --> 00:09:45,430 that shows why it has to converge, for example. If you can - I've used in a sense of 184 00:09:45,430 --> 00:09:46,880 J of theta, Q 185 00:09:46,880 --> 00:09:53,160 having to monotonically increase on every iteration. Okay? 186 00:09:53,160 --> 00:09:58,800 So what I want 187 00:09:58,800 --> 00:10:02,389 to do next is actually take this general 188 00:10:02,389 --> 00:10:06,230 EM machinery that we worked up and apply it to a mixture Gaussians model. 189 00:10:06,230 --> 00:10:09,790 Before I do that, let me just check if there are questions about 190 00:10:09,790 --> 00:10:16,790 the EM Algorithm as a whole? Okay, cool. So 191 00:10:23,980 --> 00:10:26,100 let's go ahead and 192 00:10:26,100 --> 00:10:33,100 work on the mixture of Gaussian's 193 00:10:35,550 --> 00:10:37,160 EM, all right? MOG, and that's my 194 00:10:37,160 --> 00:10:39,010 abbreviation for Mixture of 195 00:10:39,010 --> 00:10:43,590 Gaussian's. So the E step were called those Q distributions, right? 196 00:10:43,590 --> 00:10:50,590 In particular, I want to work out - so Q is the probability distribution 197 00:10:50,670 --> 00:10:53,230 over the late and random variable Z 198 00:10:53,230 --> 00:10:58,440 and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And 199 00:10:58,440 --> 00:11:00,700 you can think of this as my writing 200 00:11:00,700 --> 00:11:04,210 P of ZI equals J, right? Under the Q 201 00:11:04,210 --> 00:11:08,260 distribution. That's what this notation means. 202 00:11:08,260 --> 00:11:12,810 And so the EM Algorithm tells us 203 00:11:12,810 --> 00:11:13,840 204 00:11:13,840 --> 00:11:16,280 that, let's see, Q of J 205 00:11:16,280 --> 00:11:23,280 is the likelihood probability of Z being the value 206 00:11:23,750 --> 00:11:25,040 J and 207 00:11:25,040 --> 00:11:28,690 given XI and all your parameters. 208 00:11:28,690 --> 00:11:30,310 And so, 209 00:11:30,310 --> 00:11:31,050 210 00:11:31,050 --> 00:11:35,750 well, the way you compute this is by Bayes rule, right? So that is going to 211 00:11:35,750 --> 00:11:39,250 be equal to P of XI given ZI 212 00:11:39,250 --> 00:11:41,060 equals J 213 00:11:41,060 --> 00:11:44,690 times P of ZIJ divided 214 00:11:44,690 --> 00:11:51,690 by - 215 00:11:57,490 --> 00:12:00,880 right? That's all the - by Bayes rule. 216 00:12:00,880 --> 00:12:02,670 217 00:12:02,670 --> 00:12:06,100 And so this 218 00:12:06,100 --> 00:12:09,110 you know 219 00:12:09,110 --> 00:12:09,860 220 00:12:09,860 --> 00:12:11,630 221 00:12:11,630 --> 00:12:14,390 because XI given ZI equals J. This was a Gaussian 222 00:12:14,390 --> 00:12:17,780 with mean mu J and covariant sigma J. And so to compute this first 223 00:12:17,780 --> 00:12:20,970 term you plug in the formula for the Gaussian density there 224 00:12:21,810 --> 00:12:24,780 with parameters mu J and sigma J 225 00:12:24,780 --> 00:12:27,160 and this you'd know 226 00:12:27,160 --> 00:12:28,500 because 227 00:12:28,500 --> 00:12:35,500 Z was not a nomial, right? 228 00:12:36,020 --> 00:12:40,650 Where parameters given by phi and so the problem of ZI being with J is just phi 229 00:12:40,650 --> 00:12:44,150 J and so you can substitute these terms in. 230 00:12:44,150 --> 00:12:46,810 Similarly do the same thing for the denominator 231 00:12:46,810 --> 00:12:49,930 and that's how you work out what Q is. Okay? 232 00:12:49,930 --> 00:12:56,080 And so in the previous lecture this value the probability that ZI 233 00:12:56,080 --> 00:12:58,540 equals J under the Q 234 00:12:58,540 --> 00:13:01,150 distribution that was why I denoted that as WIJ. 235 00:13:01,150 --> 00:13:03,840 So that would be the 236 00:13:03,840 --> 00:13:05,330 237 00:13:05,330 --> 00:13:08,550 E 238 00:13:08,550 --> 00:13:10,440 step 239 00:13:10,440 --> 00:13:13,720 and then in the M step 240 00:13:13,720 --> 00:13:17,890 we maximize with respect to all of our parameters. 241 00:13:17,890 --> 00:13:19,480 242 00:13:19,480 --> 00:13:21,230 This, well 243 00:13:21,230 --> 00:13:28,130 I seem to be writing the same formula down a lot today. All 244 00:13:28,130 --> 00:13:35,130 right. 245 00:13:42,670 --> 00:13:48,090 And just so 246 00:13:48,090 --> 00:13:52,250 we're completely concrete about how you do that, right? So if you do 247 00:13:52,250 --> 00:13:54,820 that you end up with - 248 00:13:54,820 --> 00:13:57,440 so plugging in the 249 00:13:57,440 --> 00:13:59,840 quantities that you know 250 00:13:59,840 --> 00:14:00,770 251 00:14:00,770 --> 00:14:03,950 that becomes 252 00:14:03,950 --> 00:14:10,950 this, let's see. Right. 253 00:14:34,590 --> 00:14:38,730 And so that we're completely concrete about what the M step is doing. 254 00:14:38,730 --> 00:14:41,900 So in the 255 00:14:41,900 --> 00:14:44,260 M step that was, 256 00:14:44,260 --> 00:14:45,660 I guess, QI 257 00:14:45,660 --> 00:14:47,130 over Z, I being over 258 00:14:47,130 --> 00:14:50,560 J. Just in the summation, sum over J is the sum over all the possible values 259 00:14:50,560 --> 00:14:52,500 of ZI 260 00:14:52,500 --> 00:14:53,430 and then 261 00:14:53,430 --> 00:14:56,330 this thing here is my Gaussian 262 00:14:56,330 --> 00:14:58,700 density. Sorry, guys, this thing - well, 263 00:14:58,700 --> 00:15:01,980 this first term here, 264 00:15:01,980 --> 00:15:06,480 right? Is my P of 265 00:15:06,480 --> 00:15:10,050 XI given ZI 266 00:15:10,050 --> 00:15:14,460 and that's P of ZI. Okay? 267 00:15:14,460 --> 00:15:16,950 And so 268 00:15:16,950 --> 00:15:21,070 to maximize this with respect to - say you want to maximize this with respect to all 269 00:15:21,070 --> 00:15:24,410 of your parameters phi, mu and sigma. 270 00:15:24,410 --> 00:15:27,800 So to maximize with respect to the parameter mu, say, 271 00:15:27,800 --> 00:15:32,430 you would take the derivative for respect to mu 272 00:15:32,430 --> 00:15:34,440 and set that to zero 273 00:15:34,440 --> 00:15:38,190 and you would - and if you actually do that computation 274 00:15:38,190 --> 00:15:44,330 you would get, for instance, that 275 00:15:44,330 --> 00:15:47,660 276 00:15:47,660 --> 00:15:49,750 that becomes your update 277 00:15:49,750 --> 00:15:51,230 to mu J. 278 00:15:51,230 --> 00:15:52,260 Okay? Just 279 00:15:52,260 --> 00:15:53,409 so I want to - 280 00:15:53,409 --> 00:15:57,029 the equation is unimportant. All of these equations are written down in 281 00:15:57,029 --> 00:15:59,980 the lecture notes. I'm writing these down just to be 282 00:15:59,980 --> 00:16:03,350 completely concrete about what the M step means. And so write down that formula, 283 00:16:03,350 --> 00:16:05,370 plug in the densities you know, take 284 00:16:05,370 --> 00:16:08,150 the derivative set to zero, solve for mu J 285 00:16:08,150 --> 00:16:10,949 and in the same way you set the derivatives equal to zero 286 00:16:10,949 --> 00:16:12,990 and solve for your updates 287 00:16:12,990 --> 00:16:15,439 for your other parameters phi and 288 00:16:15,439 --> 00:16:22,439 sigma as well. Okay? Well, 289 00:16:23,100 --> 00:16:26,500 just point out just one little tricky bit for this that 290 00:16:26,500 --> 00:16:30,630 you haven't seen before that most of you probably all ready now, but I'll just 291 00:16:30,630 --> 00:16:31,220 mention 292 00:16:31,220 --> 00:16:32,810 is that 293 00:16:32,810 --> 00:16:37,430 since phi here is a multinomial distribution 294 00:16:37,430 --> 00:16:39,570 when you take this formula 295 00:16:39,570 --> 00:16:42,440 and you maximize it with respect to phi 296 00:16:42,440 --> 00:16:45,470 you actually have an additional constraint, right? That the sum of I - let's see, sum 297 00:16:45,470 --> 00:16:46,460 298 00:16:46,460 --> 00:16:48,250 over 299 00:16:48,250 --> 00:16:50,440 J, 300 00:16:50,440 --> 00:16:53,630 phi J must be equal to one. All right? So, again, 301 00:16:53,630 --> 00:16:57,720 in the M step I want to take this thing and maximize it with respect to all the parameters 302 00:16:57,720 --> 00:17:00,900 and when you maximize this respect to the parameters phi J 303 00:17:00,900 --> 00:17:03,400 you need to respect the constraint that 304 00:17:03,400 --> 00:17:06,300 sum of J phi J must be equal to one. 305 00:17:06,300 --> 00:17:10,730 And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of 306 00:17:10,730 --> 00:17:14,230 the granjay multipliers and generalize the granjay when you talk about the support of X 307 00:17:14,230 --> 00:17:15,120 machines. 308 00:17:15,120 --> 00:17:19,049 And so to actually perform the maximization in terms of phi J you 309 00:17:19,049 --> 00:17:21,100 construct to the granjay, 310 00:17:21,100 --> 00:17:23,490 which is - all right? 311 00:17:23,490 --> 00:17:24,639 So that's the 312 00:17:24,639 --> 00:17:28,390 equation from above and we'll denote in the dot dot dot plus 313 00:17:28,390 --> 00:17:33,760 theta times 314 00:17:33,760 --> 00:17:40,760 that, where this is sort of the granjay multiplier 315 00:17:42,220 --> 00:17:42,830 and this 316 00:17:42,830 --> 00:17:46,100 is your optimization objective. 317 00:17:46,100 --> 00:17:47,640 318 00:17:47,640 --> 00:17:50,799 And so to actually solve the parameters phi J you set 319 00:17:50,799 --> 00:17:56,690 the parameters of this 320 00:17:56,690 --> 00:17:59,080 so that the granjay is zero and solve. Okay? 321 00:17:59,080 --> 00:18:02,990 And if you then work through the math 322 00:18:02,990 --> 00:18:07,240 you get the appropriate value to update the phi J's too, 323 00:18:07,240 --> 00:18:10,880 which I won't do, but I'll be - all the full directions are in the lecture 324 00:18:10,880 --> 00:18:16,470 notes. I won't do that here. 325 00:18:16,470 --> 00:18:18,010 326 00:18:18,010 --> 00:18:21,870 Okay. And so if you actually perform all these computations you can also verify 327 00:18:21,870 --> 00:18:22,730 that. 328 00:18:22,730 --> 00:18:26,590 So I just wrote down a bunch of formulas for the EM 329 00:18:26,590 --> 00:18:30,080 Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I 330 00:18:30,080 --> 00:18:34,170 said for the EM here's the formula for computing the WIJ's and here's a 331 00:18:34,170 --> 00:18:36,480 formula for computing the mud's and so on, 332 00:18:36,480 --> 00:18:42,090 and this variation is where all of those formulas actually come from. 333 00:18:42,090 --> 00:18:43,280 Okay? 334 00:18:43,280 --> 00:18:50,280 Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh, 335 00:18:52,390 --> 00:18:54,289 I see. So 336 00:18:54,289 --> 00:18:58,200 it turns out that, yes, there's also constrained to the phi J this must be greater than 337 00:18:58,200 --> 00:19:00,070 zero. 338 00:19:00,070 --> 00:19:02,770 It turns out that 339 00:19:02,770 --> 00:19:07,090 if you want you could actually write down then generalize the granjayn 340 00:19:07,090 --> 00:19:10,860 incorporating all of these constraints as well and you can solve to [inaudible] 341 00:19:10,860 --> 00:19:12,179 these constraints. 342 00:19:12,179 --> 00:19:16,280 It turns out that in this particular derivation - actually it turns out that very often 343 00:19:16,280 --> 00:19:16,780 we 344 00:19:16,780 --> 00:19:19,980 find maximum likely estimate for multinomial distributions probabilities. 345 00:19:19,980 --> 00:19:23,190 It turns out that if you ignore these constraints and you just maximize the 346 00:19:23,190 --> 00:19:24,020 formula 347 00:19:24,020 --> 00:19:25,100 348 00:19:25,100 --> 00:19:27,820 luckily you end up with 349 00:19:27,820 --> 00:19:31,190 values that actually are greater than or equal to zero 350 00:19:31,190 --> 00:19:34,160 and so if even ignoring those constraint you end up with parameters that are greater than or equal to 351 00:19:34,160 --> 00:19:38,470 zero that shows that that must be the correct solution 352 00:19:38,470 --> 00:19:41,130 because adding that constraint won't change anything. 353 00:19:41,130 --> 00:19:44,789 So this constraint it is then caused - it turns out that if you 354 00:19:44,789 --> 00:19:46,150 ignore this and just do 355 00:19:46,150 --> 00:19:53,150 what I've wrote down you actually get the right answer. 356 00:19:55,789 --> 00:19:57,030 Okay? 357 00:19:57,030 --> 00:19:58,080 Great. 358 00:19:58,080 --> 00:19:59,080 So let me 359 00:19:59,080 --> 00:20:00,810 just 360 00:20:00,810 --> 00:20:05,400 very quickly talk about one more example of a mixture model. And 361 00:20:05,400 --> 00:20:09,340 the perfect example for this is imagine you want to do text clustering, right? 362 00:20:09,340 --> 00:20:11,900 So someone gives you a large set of documents 363 00:20:11,900 --> 00:20:15,389 and you want to cluster them together into cohesive 364 00:20:15,389 --> 00:20:16,239 topics. I 365 00:20:16,239 --> 00:20:19,379 think I mentioned the news website news.google.com. 366 00:20:19,379 --> 00:20:22,010 That's one application of text clustering 367 00:20:22,010 --> 00:20:25,850 where you might want to look at all of the news stories about 368 00:20:25,850 --> 00:20:28,220 today, all the news stories 369 00:20:28,220 --> 00:20:32,150 written by everyone, written by all the online news websites about whatever happened 370 00:20:32,150 --> 00:20:33,270 yesterday 371 00:20:33,270 --> 00:20:36,810 and there will be many, many different stories on the same thing, right? 372 00:20:36,810 --> 00:20:40,770 And by running a text-clustering algorithm you can group related 373 00:20:40,770 --> 00:20:42,559 documents together. Okay? 374 00:20:42,559 --> 00:20:49,559 So how do you apply the EM Algorithm to text clustering? 375 00:20:52,980 --> 00:20:56,300 I want to do this to illustrate an example 376 00:20:56,300 --> 00:20:59,380 in which you run the EM Algorithm 377 00:20:59,380 --> 00:21:03,320 on discreet valued inputs where the input - where the training examples 378 00:21:03,320 --> 00:21:04,220 XI 379 00:21:04,220 --> 00:21:05,830 are discreet 380 00:21:05,830 --> 00:21:08,589 values. So what I want to talk about specifically 381 00:21:08,589 --> 00:21:15,589 is the mixture of naive Bayes model 382 00:21:16,510 --> 00:21:19,980 and depending on how much you remember about naive Bayes 383 00:21:19,980 --> 00:21:23,690 I talked about two event models. One was the multivariant vanuvy 384 00:21:23,690 --> 00:21:24,960 event model. One 385 00:21:24,960 --> 00:21:27,880 was the multinomial event model. 386 00:21:27,880 --> 00:21:31,050 Today I'm gonna use the multivariant vanuvy event model. If 387 00:21:31,050 --> 00:21:34,720 you don't remember what those terms mean anymore don't worry about it. I 388 00:21:34,720 --> 00:21:36,430 think the equation will still make sense. But in 389 00:21:36,430 --> 00:21:43,430 this setting we're given a training set X1 390 00:21:44,549 --> 00:21:45,820 for XM. 391 00:21:45,820 --> 00:21:48,930 So we're given M text documents 392 00:21:48,930 --> 00:21:52,270 where each 393 00:21:52,270 --> 00:21:55,940 394 00:21:55,940 --> 00:22:00,330 XI is zero one to the end. So each of our training examples 395 00:22:00,330 --> 00:22:04,160 is an indimensional bit of vector, 396 00:22:04,160 --> 00:22:06,780 right? S this was the representation 397 00:22:06,780 --> 00:22:11,060 where XIJ was - it indicates whether 398 00:22:11,060 --> 00:22:13,220 word J 399 00:22:13,220 --> 00:22:16,820 appears in 400 00:22:16,820 --> 00:22:23,260 document 401 00:22:23,260 --> 00:22:28,220 I, right? And let's say that 402 00:22:28,220 --> 00:22:31,930 we're going to model ZI the - our latent random variable meaning 403 00:22:31,930 --> 00:22:35,470 our hidden random variable ZI will take on two values zero one 404 00:22:35,470 --> 00:22:38,480 so this means I'm just gonna find two clusters and you can generalize the 405 00:22:38,480 --> 00:22:44,560 clusters that you want. 406 00:22:44,560 --> 00:22:46,650 So 407 00:22:46,650 --> 00:22:51,510 in the mixture of naive Bayes model we assume that 408 00:22:51,510 --> 00:22:54,090 ZI is distributed and mu 409 00:22:54,090 --> 00:22:57,280 E 410 00:22:57,280 --> 00:22:58,390 with some 411 00:22:58,390 --> 00:23:02,600 value of phi so there's some probability of each document coming from cluster one 412 00:23:02,600 --> 00:23:05,220 or from cluster 413 00:23:05,220 --> 00:23:10,490 two. We assume that 414 00:23:10,490 --> 00:23:14,090 the probability 415 00:23:14,090 --> 00:23:16,049 416 00:23:16,049 --> 00:23:20,500 of XI given ZI, 417 00:23:20,500 --> 00:23:21,550 right? 418 00:23:21,550 --> 00:23:28,550 Will make the same naive Bayes assumption as we did before. 419 00:23:35,550 --> 00:23:36,630 Okay? 420 00:23:36,630 --> 00:23:43,630 And more specifically - well, excuse 421 00:23:53,110 --> 00:24:00,110 me, 422 00:24:00,820 --> 00:24:01,840 right. Okay. 423 00:24:01,840 --> 00:24:06,570 And so most of us [inaudible] cycles one given ZI equals say zero 424 00:24:06,570 --> 00:24:10,250 will be given by a parameter phi 425 00:24:10,250 --> 00:24:11,960 substitute J 426 00:24:11,960 --> 00:24:18,230 given Z equals 427 00:24:18,230 --> 00:24:21,170 zero. So if you take this chalkboard and if you 428 00:24:21,170 --> 00:24:22,669 take all instances of 429 00:24:22,669 --> 00:24:24,510 the alphabet Z 430 00:24:24,510 --> 00:24:27,270 and replace it with Y 431 00:24:27,270 --> 00:24:31,750 then you end up with exactly the same equation as I've written down for naive 432 00:24:31,750 --> 00:24:38,750 Bayes like a long time ago. Okay? 433 00:24:39,770 --> 00:24:40,710 And I'm not 434 00:24:40,710 --> 00:24:44,850 actually going to work out the mechanics deriving 435 00:24:44,850 --> 00:24:46,460 the EM Algorithm, but it turns out that 436 00:24:46,460 --> 00:24:49,830 if you take this joint distribution of X and Z 437 00:24:49,830 --> 00:24:53,700 and if you work out what the equations are for the EM algorithm for 438 00:24:53,700 --> 00:24:55,860 maximum likelihood estimation of parameters 439 00:24:55,860 --> 00:24:58,820 you find that 440 00:24:58,820 --> 00:25:00,880 in the E 441 00:25:00,880 --> 00:25:03,100 step you compute, 442 00:25:03,100 --> 00:25:05,679 you know, let's say these parameters - these weights 443 00:25:05,679 --> 00:25:06,970 WI 444 00:25:06,970 --> 00:25:09,420 which are going to be equal to 445 00:25:09,420 --> 00:25:15,670 your perceived distribution of Z being equal one conditioned on XI parameterized 446 00:25:15,670 --> 00:25:19,200 by your 447 00:25:19,200 --> 00:25:23,700 phi's and given your parameters 448 00:25:23,700 --> 00:25:25,590 449 00:25:25,590 --> 00:25:30,900 and in the M step. Okay? 450 00:25:30,900 --> 00:25:37,900 And 451 00:26:17,240 --> 00:26:20,740 that's the equation you get in the M step. 452 00:26:20,740 --> 00:26:22,150 I 453 00:26:22,150 --> 00:26:24,860 mean, again, the equations themselves aren't too important. Just 454 00:26:24,860 --> 00:26:27,179 sort of convey - I'll 455 00:26:27,179 --> 00:26:30,210 give you a second to finish writing, I guess. 456 00:26:30,210 --> 00:26:32,500 And when you're done or finished writing 457 00:26:32,500 --> 00:26:37,370 take a look at these equations and see if they make intuitive sense to you 458 00:26:37,370 --> 00:26:44,370 why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again. 459 00:26:47,130 --> 00:26:49,620 Y - 460 00:26:49,620 --> 00:26:54,230 Oh, yes, thank you. Right. 461 00:26:54,230 --> 00:26:57,950 Sorry, 462 00:26:57,950 --> 00:27:04,950 just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the 463 00:27:15,380 --> 00:27:20,670 first place? 464 00:27:20,670 --> 00:27:22,059 No. So 465 00:27:25,130 --> 00:27:26,080 what is it? 466 00:27:26,080 --> 00:27:33,080 Normally you initialize phi's to be something else, say randomly. 467 00:27:35,010 --> 00:27:38,060 So just like in naive Bayes we saw 468 00:27:38,060 --> 00:27:41,360 zero probabilities as a bad thing so the same reason you 469 00:27:41,360 --> 00:27:48,200 try to avoid zero probabilities, yeah. Okay? 470 00:27:48,200 --> 00:27:51,500 And so just the intuition behind these 471 00:27:51,500 --> 00:27:54,150 equations is 472 00:27:54,150 --> 00:27:55,950 in the E step 473 00:27:55,950 --> 00:27:59,860 WI's is you're gonna take your best guess for whether the document came from 474 00:27:59,860 --> 00:28:02,640 cluster one or cluster 475 00:28:02,640 --> 00:28:05,570 zero, all right? This is very similar to the 476 00:28:05,570 --> 00:28:09,200 intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the 477 00:28:09,200 --> 00:28:10,020 E step 478 00:28:10,020 --> 00:28:14,760 we're going to compute these weights that tell us do I think this document came from 479 00:28:14,760 --> 00:28:19,370 cluster one or cluster zero. 480 00:28:19,370 --> 00:28:22,650 And then in the M step I'm gonna say 481 00:28:22,650 --> 00:28:27,140 does this numerator is the sum over all the elements of my training set 482 00:28:27,140 --> 00:28:28,800 of - so then 483 00:28:28,800 --> 00:28:33,000 informally, right? WI is one there, but I think the document came from cluster 484 00:28:33,000 --> 00:28:33,980 one 485 00:28:33,980 --> 00:28:35,810 and so this will 486 00:28:35,810 --> 00:28:41,090 essentially sum up all the times I saw words J 487 00:28:41,090 --> 00:28:45,419 in documents that I think are in cluster one. 488 00:28:45,419 --> 00:28:48,830 And these are sort of weighted by the actual probability. I think it came from cluster 489 00:28:48,830 --> 00:28:49,640 one 490 00:28:49,640 --> 00:28:51,200 and then I'll divide by 491 00:28:51,200 --> 00:28:54,740 - again, if all of these were ones and zeros then I'd be dividing by 492 00:28:54,740 --> 00:28:58,480 the actual number of documents I had in cluster one. So if all the WI's were 493 00:28:58,480 --> 00:28:59,540 494 00:28:59,540 --> 00:29:02,960 either ones or zeroes then this would be exactly 495 00:29:02,960 --> 00:29:06,660 the fraction of documents that I saw in cluster one 496 00:29:06,660 --> 00:29:09,640 in which I also saw were at J. 497 00:29:09,640 --> 00:29:10,900 Okay? But in the EM Algorithm 498 00:29:10,900 --> 00:29:14,540 you don't make a hard assignment decision about is this in cluster one or is this in 499 00:29:14,540 --> 00:29:15,790 cluster zero. You 500 00:29:15,790 --> 00:29:16,880 instead 501 00:29:16,880 --> 00:29:23,880 represent your uncertainty about cluster membership with the parameters WI. Okay? It 502 00:29:23,970 --> 00:29:26,970 actually turns out that when we actually implement this particular model it 503 00:29:26,970 --> 00:29:28,630 actually turns out that 504 00:29:28,630 --> 00:29:32,080 by the nature of this computation all the values of WI's 505 00:29:32,080 --> 00:29:35,260 will be very close to either one or zero so they'll 506 00:29:35,260 --> 00:29:39,110 be numerically almost indistinguishable from one's and zeroes. This is a property of 507 00:29:39,110 --> 00:29:41,960 naive Bayes. If you actually compute this probability 508 00:29:41,960 --> 00:29:45,750 from all those documents you find that WI is either 509 00:29:45,750 --> 00:29:46,169 510 00:29:46,169 --> 00:29:50,000 0.0001 or 0.999. It'll be amazingly close to either zero or one 511 00:29:50,000 --> 00:29:53,170 and so the M step - and so this is pretty much guessing whether each document is 512 00:29:53,170 --> 00:29:55,270 in cluster one or cluster 513 00:29:55,270 --> 00:29:56,210 zero and then 514 00:29:56,210 --> 00:30:00,450 using formulas they're very similar to maximum likely estimation 515 00:30:00,450 --> 00:30:05,210 for naive Bayes. 516 00:30:05,210 --> 00:30:06,250 Okay? 517 00:30:06,250 --> 00:30:08,409 Cool. Are there - and 518 00:30:08,409 --> 00:30:09,679 if 519 00:30:09,679 --> 00:30:12,170 some of these equations don't look that familiar to you anymore, 520 00:30:12,170 --> 00:30:15,309 sort of, go back and take another look at what you saw in naive 521 00:30:15,309 --> 00:30:16,700 Bayes 522 00:30:16,700 --> 00:30:18,720 and 523 00:30:18,720 --> 00:30:22,220 hopefully you can see the links there as well. 524 00:30:22,220 --> 00:30:29,220 Questions about this before I move on? Right, okay. 525 00:30:33,200 --> 00:30:36,700 Of course the way I got these equations was by turning through the machinery of 526 00:30:36,700 --> 00:30:40,610 the EM Algorithm, right? I didn't just write these out of thin air. The way you do this 527 00:30:40,610 --> 00:30:41,890 is by 528 00:30:41,890 --> 00:30:44,990 writing down the E step and the M step for this model and then the M step 529 00:30:44,990 --> 00:30:46,340 same derivatives 530 00:30:46,340 --> 00:30:48,270 equal to zero and solving from that so 531 00:30:48,270 --> 00:30:49,889 that's how you get the M step and the E 532 00:30:49,889 --> 00:30:56,889 step. 533 00:31:22,030 --> 00:31:23,950 So the last thing I want to do today 534 00:31:23,950 --> 00:31:30,950 is talk about the factor analysis model 535 00:31:31,870 --> 00:31:33,110 536 00:31:33,110 --> 00:31:34,780 and 537 00:31:34,780 --> 00:31:38,249 the reason I want to do this is sort of two reasons because one is 538 00:31:38,249 --> 00:31:42,590 factor analysis is kind of a useful model. It's 539 00:31:42,590 --> 00:31:44,160 not 540 00:31:44,160 --> 00:31:48,320 as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe, 541 00:31:48,320 --> 00:31:50,690 but it's sort of useful. 542 00:31:50,690 --> 00:31:55,180 But the other reason I want to derive this model is that there are a 543 00:31:55,180 --> 00:31:58,169 few steps in the math that are more generally useful. 544 00:31:58,169 --> 00:32:02,560 In particular, where this is for factor analysis this would be an example 545 00:32:02,560 --> 00:32:04,510 in which we'll do EM 546 00:32:04,510 --> 00:32:08,440 where the late and random variable - where the hidden random variable Z 547 00:32:08,440 --> 00:32:11,060 is going to be continued as valued. 548 00:32:11,060 --> 00:32:14,740 And so some of the math we'll see in deriving factor analysis will be a little bit 549 00:32:14,740 --> 00:32:17,430 different than what you saw before and they're just a - 550 00:32:17,430 --> 00:32:21,590 it turns out the full derivation for EM for factor analysis is sort of 551 00:32:21,590 --> 00:32:23,260 extremely long and complicated 552 00:32:23,260 --> 00:32:24,980 and so I won't 553 00:32:24,980 --> 00:32:26,750 inflect that on you in lecture today, 554 00:32:26,750 --> 00:32:30,530 but I will still be writing more equations than is - than you'll see me do 555 00:32:30,530 --> 00:32:32,640 in other lectures because there are, sort of, 556 00:32:32,640 --> 00:32:36,690 just a few steps in the factor analysis derivation so I'll physically 557 00:32:36,690 --> 00:32:38,800 558 00:32:38,800 --> 00:32:40,100 illustrate 559 00:32:40,100 --> 00:32:40,950 it. 560 00:32:40,950 --> 00:32:43,350 So it's actually [inaudible] the model 561 00:32:43,350 --> 00:32:47,510 and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is 562 00:32:47,510 --> 00:32:50,670 our first model 563 00:32:50,670 --> 00:32:52,570 we had, 564 00:32:52,570 --> 00:32:57,990 that - well I actually motivated it by drawing the data set like this, right? That one of 565 00:32:57,990 --> 00:33:04,390 you has a data set that looks like this, 566 00:33:04,390 --> 00:33:06,190 right? So this was a problem where 567 00:33:06,190 --> 00:33:09,440 n is twodimensional 568 00:33:09,440 --> 00:33:13,860 and you have, I don't know, maybe 50 or 100 training examples, whatever, right? 569 00:33:13,860 --> 00:33:15,409 And I said 570 00:33:15,409 --> 00:33:17,760 maybe you want to give a label 571 00:33:17,760 --> 00:33:21,100 training set like this. Maybe you want to model this 572 00:33:21,100 --> 00:33:23,710 as a mixture of two Gaussians. Okay? 573 00:33:23,710 --> 00:33:24,700 And so a 574 00:33:24,700 --> 00:33:27,320 mixture of Gaussian models tend to be 575 00:33:27,320 --> 00:33:29,100 applicable 576 00:33:29,100 --> 00:33:30,520 where m 577 00:33:30,520 --> 00:33:32,519 is larger, 578 00:33:32,519 --> 00:33:35,690 and often much larger, than n where the number of training examples you 579 00:33:35,690 --> 00:33:36,610 have 580 00:33:36,610 --> 00:33:40,790 is at least as large as, and is usually much larger than, the 581 00:33:40,790 --> 00:33:45,820 dimension of the data. What I 582 00:33:45,820 --> 00:33:49,280 want to do is talk about a different problem where I 583 00:33:49,280 --> 00:33:51,880 want you to imagine what happens if 584 00:33:51,880 --> 00:33:56,200 either the dimension of your data is roughly equal to the number of 585 00:33:56,200 --> 00:33:58,130 examples you have 586 00:33:58,130 --> 00:34:00,180 or maybe 587 00:34:00,180 --> 00:34:00,940 the 588 00:34:00,940 --> 00:34:03,980 dimension of your data is maybe even much larger than 589 00:34:03,980 --> 00:34:08,469 the number of training examples you have. Okay? So how do 590 00:34:08,469 --> 00:34:10,139 you model 591 00:34:10,139 --> 00:34:13,569 such a very high dimensional data? Watch and 592 00:34:13,569 --> 00:34:15,549 you will see sometimes, right? If 593 00:34:15,549 --> 00:34:18,960 you run a plant or something, you run a factory, maybe you have 594 00:34:18,960 --> 00:34:23,359 a thousand measurements all through your plants, but you only have five - you only have 595 00:34:23,359 --> 00:34:25,639 20 days of data. So 596 00:34:25,639 --> 00:34:27,189 you can have 1,000 dimensional data, 597 00:34:27,189 --> 00:34:31,039 but 20 examples of it all ready. So 598 00:34:31,039 --> 00:34:35,239 given data that has this property in the beginning that we've given 599 00:34:35,239 --> 00:34:39,709 a training set of m examples. Well, 600 00:34:39,709 --> 00:34:41,229 what can you do to try to model 601 00:34:41,229 --> 00:34:44,139 the density of X? 602 00:34:44,139 --> 00:34:48,239 So one thing you can do is try to model it just as a single Gaussian, right? So in my 603 00:34:48,239 --> 00:34:51,629 mixtures of Gaussian this is how you try model as a single Gaussian and 604 00:34:51,629 --> 00:34:55,019 say X is intuitive with mean mu 605 00:34:55,019 --> 00:34:57,559 and parameter 606 00:34:57,559 --> 00:34:58,809 sigma 607 00:34:58,809 --> 00:35:00,329 where 608 00:35:00,329 --> 00:35:02,169 sigma is going to be 609 00:35:02,169 --> 00:35:04,769 done n by n matrix 610 00:35:04,769 --> 00:35:09,509 and so if you work out the maximum likelihood estimate of the parameters 611 00:35:09,509 --> 00:35:12,680 you find that the maximum likelihood estimate for the mean 612 00:35:12,680 --> 00:35:17,400 is just the empirical mean of your training set, 613 00:35:17,400 --> 00:35:21,299 right. So that makes sense. 614 00:35:21,299 --> 00:35:25,769 And the maximum likelihood of the covariance matrix sigma 615 00:35:25,769 --> 00:35:27,700 will be 616 00:35:27,700 --> 00:35:29,549 617 00:35:29,549 --> 00:35:36,549 618 00:35:39,649 --> 00:35:41,150 this, all right? But it turns out that 619 00:35:41,150 --> 00:35:44,789 in this regime where the data is much higher dimensional - excuse me, 620 00:35:44,789 --> 00:35:48,259 where the data's dimension is much larger than the training examples 621 00:35:48,259 --> 00:35:49,579 you 622 00:35:49,579 --> 00:35:51,229 have if you 623 00:35:51,229 --> 00:35:52,840 compute the maximum likely estimate 624 00:35:52,840 --> 00:35:54,869 of the covariance matrix sigma 625 00:35:54,869 --> 00:35:58,309 you find that this matrix is singular. 626 00:35:58,309 --> 00:35:59,079 627 00:35:59,079 --> 00:36:02,869 Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen 628 00:36:02,869 --> 00:36:04,739 value so it doesn't have - I hope 629 00:36:04,739 --> 00:36:06,869 one of those terms makes sense. 630 00:36:06,869 --> 00:36:12,619 And 631 00:36:12,619 --> 00:36:19,619 there's another saying that the matrix sigma will be non-invertible. 632 00:36:22,299 --> 00:36:24,910 And just in pictures, 633 00:36:24,910 --> 00:36:28,619 one complete example is if D is - 634 00:36:28,619 --> 00:36:33,419 if N equals M equals two if you have two-dimensional data and 635 00:36:33,419 --> 00:36:35,929 you have two examples. So I'd have 636 00:36:35,929 --> 00:36:39,959 two training examples in two-dimen.. - this is 637 00:36:39,959 --> 00:36:41,539 X1 and 638 00:36:41,539 --> 00:36:43,669 X2. This is my unlabeled data. 639 00:36:43,669 --> 00:36:47,640 If you fit a Gaussian to this data set you find that 640 00:36:47,640 --> 00:36:49,169 - well 641 00:36:49,169 --> 00:36:53,329 you remember I used to draw constables of Gaussians as ellipses, right? So 642 00:36:53,329 --> 00:36:56,139 these are examples of different constables of Gaussians. 643 00:36:56,139 --> 00:36:58,669 You find that the maximum likely estimate 644 00:36:58,669 --> 00:36:59,650 Gaussian for this 645 00:36:59,650 --> 00:37:04,329 responds to Gaussian where the contours are sort of infinitely 646 00:37:04,329 --> 00:37:07,160 thin and infinitely long in that direction. 647 00:37:07,160 --> 00:37:10,959 Okay? So in terms - so the contours will sort of be 648 00:37:10,959 --> 00:37:12,970 infinitely thin, 649 00:37:12,970 --> 00:37:17,959 right? And stretch infinitely long in that direction. 650 00:37:17,959 --> 00:37:20,819 And another way of saying it is that if 651 00:37:20,819 --> 00:37:25,539 you actually plug in the formula for the density 652 00:37:25,539 --> 00:37:30,679 of the 653 00:37:30,679 --> 00:37:36,209 Gaussian, which is 654 00:37:36,209 --> 00:37:39,099 this, you won't actually get a nice answer because 655 00:37:39,099 --> 00:37:42,399 the matrix sigma is non-invertible so sigma inverse is not 656 00:37:42,399 --> 00:37:43,309 defined 657 00:37:43,309 --> 00:37:45,290 and this is zero. 658 00:37:45,290 --> 00:37:48,890 So you also have one over zero times E to the sum 659 00:37:48,890 --> 00:37:55,890 inversive and non-inversive matrix so not a good model. 660 00:37:58,289 --> 00:38:02,009 So let's do even better, right? So given this sort of data 661 00:38:02,009 --> 00:38:09,009 how do you model P of X? 662 00:38:28,400 --> 00:38:35,400 Well, one thing you could do is constrain sigma to be diagonal. So you have a 663 00:38:42,309 --> 00:38:46,319 covariance matrix X is - okay? 664 00:38:46,319 --> 00:38:48,939 So in other words you get a constraint sigma 665 00:38:48,939 --> 00:38:52,979 to be this matrix, all 666 00:38:52,979 --> 00:38:55,880 right? 667 00:38:55,880 --> 00:38:59,779 With zeroes on the off diagonals. I hope 668 00:38:59,779 --> 00:39:03,109 this makes sense. These zeroes I've written down here denote that 669 00:39:03,109 --> 00:39:06,079 everything after diagonal of this matrix is a 670 00:39:06,079 --> 00:39:07,459 zero. 671 00:39:07,459 --> 00:39:08,970 So 672 00:39:08,970 --> 00:39:13,410 the massive likely estimate of the parameters will be pretty much what you'll 673 00:39:13,410 --> 00:39:16,210 expect, 674 00:39:16,210 --> 00:39:19,880 right? 675 00:39:19,880 --> 00:39:21,200 And 676 00:39:21,200 --> 00:39:22,440 in pictures 677 00:39:22,440 --> 00:39:24,150 what this means is that 678 00:39:24,150 --> 00:39:26,399 the [inaudible] the distribution 679 00:39:26,399 --> 00:39:28,569 with Gaussians 680 00:39:28,569 --> 00:39:31,739 whose controls are axis aligned. So 681 00:39:31,739 --> 00:39:36,900 that's one example of a Gaussian where the covariance is diagonal. 682 00:39:36,900 --> 00:39:37,859 683 00:39:37,859 --> 00:39:38,509 And 684 00:39:38,509 --> 00:39:42,919 here's another example and 685 00:39:42,919 --> 00:39:45,349 so here's a third example. But 686 00:39:45,349 --> 00:39:48,939 often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I 687 00:39:48,939 --> 00:39:51,489 don't 688 00:39:51,489 --> 00:39:53,999 know, 689 00:39:53,999 --> 00:39:56,629 you could do this in model P of X, but this isn't very nice because you've now 690 00:39:56,629 --> 00:40:00,069 thrown away all the correlations 691 00:40:00,069 --> 00:40:04,539 between the different variables so 692 00:40:04,539 --> 00:40:07,599 the axis are X1 and X2, right? So you've thrown away - you're failing to capture 693 00:40:07,599 --> 00:40:11,779 any of the correlations or the relationships between 694 00:40:11,779 --> 00:40:18,229 any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again. 695 00:40:18,229 --> 00:40:20,759 The covariance matrix the diagonal, 696 00:40:20,759 --> 00:40:22,150 what does 697 00:40:22,150 --> 00:40:23,369 that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay. 698 00:40:23,369 --> 00:40:26,530 So these are the contours of the Gaussian density that I'm drawing, 699 00:40:26,530 --> 00:40:28,159 right? So let's see - 700 00:40:28,159 --> 00:40:32,080 so post covariance issues with diagonal 701 00:40:32,080 --> 00:40:35,229 then you can ask what is P of X 702 00:40:35,229 --> 00:40:37,299 parameterized by 703 00:40:37,299 --> 00:40:38,539 mu and sigma, 704 00:40:38,539 --> 00:40:41,190 right? If sigma is diagonal 705 00:40:41,190 --> 00:40:43,699 and so this will be some Gaussian dump, 706 00:40:43,699 --> 00:40:46,249 right? So not in - oh, boy. My drawing's really 707 00:40:46,249 --> 00:40:48,699 bad, but in two-D 708 00:40:48,699 --> 00:40:53,239 the density for Gaussian is like this bump shaped thing, right? 709 00:40:53,239 --> 00:40:56,719 So this is the density of the Gaussian 710 00:40:56,719 --> 00:40:59,549 - wow, and this is a really bad drawing. With those, 711 00:40:59,549 --> 00:41:04,049 your axis X1 and X2 and the height of this is P of X 712 00:41:04,049 --> 00:41:07,879 and so those figures over there are the contours of 713 00:41:07,879 --> 00:41:09,479 the density of the Gaussian. 714 00:41:09,479 --> 00:41:15,949 So those are the contours of this shape. Student:No, I don't 715 00:41:15,949 --> 00:41:16,680 mean 716 00:41:16,680 --> 00:41:22,569 the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so 717 00:41:22,569 --> 00:41:23,690 the main - these, 718 00:41:23,690 --> 00:41:24,999 let's see. 719 00:41:24,999 --> 00:41:27,710 So I'm not drawing a contour like this, 720 00:41:27,710 --> 00:41:30,309 right? Because the main axes of these 721 00:41:30,309 --> 00:41:32,289 are not aligned with the 722 00:41:32,289 --> 00:41:34,390 X1 and X-axis so 723 00:41:34,390 --> 00:41:36,189 this occurs found to 724 00:41:36,189 --> 00:41:43,109 Gaussian where the off-diagonals are non-zero, right? Cool. Okay. 725 00:41:43,109 --> 00:41:46,929 726 00:41:46,929 --> 00:41:50,229 You could do this, this is sort of work. It turns out that what our best view is two 727 00:41:50,229 --> 00:41:51,670 training examples 728 00:41:51,670 --> 00:41:55,210 you can learn in non-singular covariance matrix, but you've thrown away all 729 00:41:55,210 --> 00:41:55,849 730 00:41:55,849 --> 00:42:02,849 of the correlation in the data so this is not a great model. 731 00:42:05,900 --> 00:42:08,280 It turns out you can do something - well, 732 00:42:08,280 --> 00:42:10,439 actually, we'll come back and use this property later. 733 00:42:10,439 --> 00:42:17,439 But it turns out you can do something even more restrictive, which 734 00:42:17,989 --> 00:42:18,779 is 735 00:42:18,779 --> 00:42:23,959 you can constrain sigma to equal to sigma squared times the identity 736 00:42:23,959 --> 00:42:27,629 matrix. So in other words, you can constrain it to be diagonal 737 00:42:27,629 --> 00:42:28,759 matrix 738 00:42:28,759 --> 00:42:31,699 and moreover all the 739 00:42:31,699 --> 00:42:34,609 diagonal entries must be the same 740 00:42:34,609 --> 00:42:37,989 and so the cartoon for that is that you're 741 00:42:37,989 --> 00:42:39,440 constraining 742 00:42:39,440 --> 00:42:42,779 the contours of your Gaussian density to be 743 00:42:42,779 --> 00:42:47,359 circular. Okay? This is a sort of even harsher constraint to place in your model. 744 00:42:47,359 --> 00:42:51,479 745 00:42:51,479 --> 00:42:52,200 746 00:42:52,200 --> 00:42:55,970 So either of these versions, diagonal sigma or sigma being the, sort of, constant 747 00:42:55,970 --> 00:42:57,349 value diagonal 748 00:42:57,349 --> 00:43:01,869 are the all ready strong assumptions, all right? So if you have enough data 749 00:43:01,869 --> 00:43:07,529 maybe write a model just a little bit of a correlation between your different variables. 750 00:43:07,529 --> 00:43:10,930 So the factor analysis model 751 00:43:10,930 --> 00:43:14,349 is one way to attempt to do that. So here's 752 00:43:14,349 --> 00:43:21,349 the idea. 753 00:43:34,619 --> 00:43:35,729 So this is 754 00:43:35,729 --> 00:43:39,169 how the factor analysis model 755 00:43:39,169 --> 00:43:40,979 models your data. 756 00:43:40,979 --> 00:43:42,150 We're going to 757 00:43:42,150 --> 00:43:45,309 assume that there is a latent random variable, okay? 758 00:43:45,309 --> 00:43:46,959 Which just 759 00:43:46,959 --> 00:43:50,759 means random variable Z. So Z is distributed 760 00:43:50,759 --> 00:43:53,380 Gaussian with mean zero and 761 00:43:53,380 --> 00:43:55,659 covariance identity 762 00:43:55,659 --> 00:43:57,630 where Z 763 00:43:57,630 --> 00:44:00,109 will be a Ddimensional vector now 764 00:44:00,109 --> 00:44:01,499 and 765 00:44:01,499 --> 00:44:03,839 D 766 00:44:03,839 --> 00:44:10,839 will be chosen so that it is lower than the dimension of your X's. Okay? 767 00:44:11,829 --> 00:44:12,829 And now 768 00:44:12,829 --> 00:44:14,660 I'm going to assume that 769 00:44:14,660 --> 00:44:19,259 X is given by - 770 00:44:19,259 --> 00:44:21,279 well let me write this. Each XI 771 00:44:21,279 --> 00:44:22,690 is distributed - 772 00:44:25,059 --> 00:44:28,849 actually, sorry, I'm just. We 773 00:44:28,849 --> 00:44:33,889 have 774 00:44:33,889 --> 00:44:38,679 to assume that conditions on the value of Z, 775 00:44:38,679 --> 00:44:42,109 X is given by another Gaussian 776 00:44:42,109 --> 00:44:46,819 with mean given by mu plus 777 00:44:46,819 --> 00:44:50,089 lambda Z 778 00:44:50,089 --> 00:44:53,639 and covariance given by matrix si. 779 00:44:53,639 --> 00:44:59,400 So just to say the second line in 780 00:44:59,400 --> 00:45:01,289 an equivalent form, equivalently 781 00:45:01,289 --> 00:45:05,929 I'm going to model X as 782 00:45:05,929 --> 00:45:08,819 mu plus lambda Z 783 00:45:08,819 --> 00:45:13,269 plus a noise term epsilon where epsilon is 784 00:45:13,269 --> 00:45:18,329 Gaussian with mean zero 785 00:45:18,329 --> 00:45:21,869 and covariant si. 786 00:45:21,869 --> 00:45:23,469 787 00:45:23,469 --> 00:45:24,669 And so 788 00:45:24,669 --> 00:45:27,709 the parameters of this 789 00:45:27,709 --> 00:45:30,180 model 790 00:45:30,180 --> 00:45:32,719 are going to be a vector 791 00:45:32,719 --> 00:45:34,729 mu with its 792 00:45:34,729 --> 00:45:37,559 n-dimensional and 793 00:45:37,559 --> 00:45:41,179 matrix lambda, 794 00:45:41,179 --> 00:45:42,400 which is 795 00:45:42,400 --> 00:45:43,539 796 00:45:43,539 --> 00:45:45,569 n by D 797 00:45:45,569 --> 00:45:50,039 and a covariance matrix si, 798 00:45:50,039 --> 00:45:51,970 which is n by n, 799 00:45:51,970 --> 00:45:55,519 and I'm going to impose an additional constraint on si. I'm going to impose 800 00:45:55,519 --> 00:45:57,319 a constraint that si 801 00:45:57,319 --> 00:45:59,559 is 802 00:45:59,559 --> 00:46:03,389 diagonal. 803 00:46:03,389 --> 00:46:04,789 Okay? So 804 00:46:04,789 --> 00:46:07,640 that was a form of definition - let me actually, sort of, 805 00:46:07,640 --> 00:46:14,640 give a couple of examples to make this more complete. 806 00:46:32,710 --> 00:46:37,709 So let's give a kind of example, suppose Z 807 00:46:37,709 --> 00:46:40,419 is one-dimensional 808 00:46:40,419 --> 00:46:44,419 and X is twodimensional 809 00:46:44,419 --> 00:46:46,689 so let's see what 810 00:46:46,689 --> 00:46:51,599 this model - let's see a, sort of, specific instance of the factor analysis 811 00:46:51,599 --> 00:46:52,509 model 812 00:46:52,509 --> 00:46:55,819 and how we're modeling the joint - the distribution over X 813 00:46:55,819 --> 00:46:57,999 of - what this gives us 814 00:46:57,999 --> 00:47:01,269 in terms of a model over P of X, all right? 815 00:47:01,269 --> 00:47:03,089 So 816 00:47:03,089 --> 00:47:09,699 let's see. From this model to let me assume that 817 00:47:09,699 --> 00:47:10,569 lambda is 818 00:47:10,569 --> 00:47:12,059 2, 1 819 00:47:12,059 --> 00:47:15,549 and si, which has to be diagonal matrix, remember, 820 00:47:15,549 --> 00:47:19,809 is this. 821 00:47:19,809 --> 00:47:21,710 Okay? So 822 00:47:21,710 --> 00:47:28,710 Z is one-dimensional so let me just draw a typical sample for Z, all right? So 823 00:47:30,900 --> 00:47:33,659 if I draw ZI 824 00:47:33,659 --> 00:47:36,199 from a Gaussian 825 00:47:36,199 --> 00:47:39,530 so that's a typical sample for what Z might look like 826 00:47:39,530 --> 00:47:43,440 and so I'm gonna - at any rate I'm gonna call 827 00:47:43,440 --> 00:47:45,369 this Z1, 828 00:47:45,369 --> 00:47:49,969 Z2, Z3 and so on. If this really were a typical sample the order of the 829 00:47:49,969 --> 00:47:53,349 Z's would be jumbled up, but I'm just ordering them like this 830 00:47:53,349 --> 00:47:55,970 just to make the example easier. 831 00:47:55,970 --> 00:47:57,769 So, yes, typical sample 832 00:47:57,769 --> 00:48:04,769 of random variable Z from a Gaussian distribution with mean of covariance one. 833 00:48:05,569 --> 00:48:08,929 So - 834 00:48:08,929 --> 00:48:13,789 and with this example let me just set mu equals zero. It's to write the - 835 00:48:13,789 --> 00:48:18,359 just that it's easier to talk about. 836 00:48:18,359 --> 00:48:20,749 So 837 00:48:20,749 --> 00:48:22,509 lambda times Z, 838 00:48:22,509 --> 00:48:26,269 right? We'll take each of these numbers and multiply them by lambda. 839 00:48:26,269 --> 00:48:28,639 And so 840 00:48:28,639 --> 00:48:30,999 you find that 841 00:48:30,999 --> 00:48:36,989 all of the values for lambda times Z 842 00:48:36,989 --> 00:48:38,489 will lie on a straight line, 843 00:48:38,489 --> 00:48:40,309 all right? So, for example, 844 00:48:40,309 --> 00:48:41,090 845 00:48:41,090 --> 00:48:45,400 this one here would be one, two, three, four, five, six, seven, I guess. So if this was 846 00:48:45,400 --> 00:48:46,239 Z7 847 00:48:46,239 --> 00:48:49,329 then this one here would be lambda 848 00:48:49,329 --> 00:48:51,979 times Z7 849 00:48:51,979 --> 00:48:54,369 and now that's the number in R2, because lambda's a two 850 00:48:54,369 --> 00:48:56,529 by one matrix. 851 00:48:56,529 --> 00:48:59,900 And so what I've drawn here is like a typical sample 852 00:48:59,900 --> 00:49:04,279 for lambda times Z 853 00:49:04,279 --> 00:49:05,589 and 854 00:49:05,589 --> 00:49:09,779 the final step for this is what a typical sample for X looks like. Well X is 855 00:49:09,779 --> 00:49:11,839 mu 856 00:49:11,839 --> 00:49:12,989 plus 857 00:49:12,989 --> 00:49:15,089 858 00:49:15,089 --> 00:49:16,660 lambda Z plus epsilon 859 00:49:16,660 --> 00:49:18,649 where epsilon 860 00:49:18,649 --> 00:49:23,680 is Gaussian with mean nu and covariance given by si, right? 861 00:49:23,680 --> 00:49:27,690 And so the last step to draw a typical sample for the random variables 862 00:49:27,690 --> 00:49:28,509 863 00:49:28,509 --> 00:49:30,899 X 864 00:49:30,899 --> 00:49:36,079 I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example 865 00:49:36,079 --> 00:49:37,159 and 866 00:49:37,159 --> 00:49:38,390 around this point 867 00:49:38,390 --> 00:49:41,890 I'm going to place 868 00:49:41,890 --> 00:49:43,890 an axis aligned ellipse. Or 869 00:49:43,890 --> 00:49:45,879 in other words, I'm going to 870 00:49:45,879 --> 00:49:47,939 create a Gaussian distribution 871 00:49:47,939 --> 00:49:50,950 centered on this point 872 00:49:50,950 --> 00:49:53,459 and this I've drawn 873 00:49:53,459 --> 00:49:55,880 corresponds to one of the contours 874 00:49:55,880 --> 00:49:59,309 of my density for 875 00:49:59,309 --> 00:50:02,799 epsilon, right? And so you can imagine placing a little Gaussian bump here. 876 00:50:02,799 --> 00:50:04,179 And so 877 00:50:04,179 --> 00:50:08,189 I'll draw an example from this little Gaussian and 878 00:50:08,189 --> 00:50:11,089 let's say I get that point going, 879 00:50:11,089 --> 00:50:17,630 I do the same here and 880 00:50:17,630 --> 00:50:19,130 so on. 881 00:50:19,130 --> 00:50:24,079 So I draw a bunch of examples from these Gaussians 882 00:50:24,079 --> 00:50:28,519 and the - 883 00:50:28,519 --> 00:50:31,999 whatever they call it - the orange points I drew 884 00:50:31,999 --> 00:50:35,309 will comprise a typical sample for 885 00:50:35,309 --> 00:50:42,309 whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that? 886 00:50:43,839 --> 00:50:45,929 Instructor (Andrew Ng):Oh, 887 00:50:45,929 --> 00:50:48,649 yes, you do. And in this example, I said you 888 00:50:48,649 --> 00:50:50,109 do a zero zero just to make 889 00:50:50,109 --> 00:50:53,199 it easier. If mu were something else you'd take the whole picture and you'd sort of shift 890 00:50:53,199 --> 00:51:00,199 it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of 891 00:51:00,679 --> 00:51:03,319 course, 892 00:51:03,319 --> 00:51:05,920 what does that Y-axis corresponds to? Instructor (Andrew 893 00:51:05,920 --> 00:51:07,799 Ng):Oh, so this is Z 894 00:51:07,799 --> 00:51:09,529 is one-dimensional 895 00:51:09,529 --> 00:51:12,969 so here I'm plotting the typical sample 896 00:51:12,969 --> 00:51:15,460 for Z so this is like zero. 897 00:51:15,460 --> 00:51:19,759 So this is just the Z Axis, right. So Z is onedimensional data. 898 00:51:19,759 --> 00:51:22,999 So this line here is like a plot 899 00:51:22,999 --> 00:51:26,479 of a typical sample of 900 00:51:26,479 --> 00:51:31,609 values for Z. Okay? 901 00:51:31,609 --> 00:51:32,889 Yeah? Student:You have by axis, right? And 902 00:51:32,889 --> 00:51:34,459 the axis data pertains samples. Instructor (Andrew 903 00:51:34,459 --> 00:51:35,689 Ng):Oh, yes, right. Student:So sort of projecting 904 00:51:35,689 --> 00:51:38,289 them into 905 00:51:38,289 --> 00:51:42,680 that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these 906 00:51:42,680 --> 00:51:45,319 beige points - so that's like X1 and that's X2 907 00:51:45,319 --> 00:51:47,290 and so on, right? So the beige points are 908 00:51:47,290 --> 00:51:51,099 what I 909 00:51:51,099 --> 00:51:52,269 see. And so 910 00:51:52,269 --> 00:51:54,849 in reality all you ever get to see are the 911 00:51:54,849 --> 00:51:56,289 X's, but 912 00:51:56,289 --> 00:52:00,049 just like in the mixture of Gaussians model I tell a story about what I would 913 00:52:00,049 --> 00:52:03,609 imagine the Gauss... - the data came from two Gaussian's 914 00:52:03,609 --> 00:52:08,679 was is had a random variable Z that led to the generation of X's from two Gaussians. 915 00:52:08,679 --> 00:52:12,209 So the same way I'm sort of telling the story here, which all the algorithm actually 916 00:52:12,209 --> 00:52:15,049 sees are the orange points, but we're 917 00:52:15,049 --> 00:52:19,219 gonna tell a story about how the data came about and that story is 918 00:52:19,219 --> 00:52:23,809 what comprises the factor analysis model. Okay? 919 00:52:23,809 --> 00:52:26,779 So one of the ways to see the intrusion of this model is that we're 920 00:52:26,779 --> 00:52:30,779 going to think of the model as one way 921 00:52:30,779 --> 00:52:34,659 just informally, not formally, but one way to think about this model is 922 00:52:34,659 --> 00:52:37,679 you can think of this factor analysis model 923 00:52:37,679 --> 00:52:42,089 as modeling the data from coming from a lower dimensional subspace 924 00:52:42,089 --> 00:52:42,890 more 925 00:52:42,890 --> 00:52:44,799 or less so the data X here Y 926 00:52:44,799 --> 00:52:47,609 is approximately on one D line 927 00:52:47,609 --> 00:52:50,380 and then plus a little bit of noise - plus a little bit 928 00:52:50,380 --> 00:52:53,449 of random noise so the X isn't exactly on this one D line. 929 00:52:53,449 --> 00:52:57,279 That's one informal way of thinking about factor analysis. 930 00:52:57,279 --> 00:53:04,279 We're 931 00:53:08,880 --> 00:53:15,880 not doing great on time. 932 00:53:18,159 --> 00:53:20,359 Well, let's do this. 933 00:53:20,359 --> 00:53:22,780 So let me just do one more quick example, 934 00:53:22,780 --> 00:53:24,100 which is, 935 00:53:24,100 --> 00:53:25,789 936 00:53:25,789 --> 00:53:28,949 in this example, 937 00:53:28,949 --> 00:53:34,899 let's say Z is in R2 938 00:53:34,899 --> 00:53:37,899 and X is in R3, right? 939 00:53:37,899 --> 00:53:40,289 And so 940 00:53:40,289 --> 00:53:44,939 in this example Z, your data Z now lies in 2-D 941 00:53:44,939 --> 00:53:49,259 and so let me draw this on a sheet of paper. Okay? 942 00:53:49,259 --> 00:53:50,790 So let's say the 943 00:53:50,790 --> 00:53:54,180 axis of my paper are the Z1 and Z2 axis 944 00:53:54,180 --> 00:53:56,299 and so 945 00:53:56,299 --> 00:53:58,289 here is a typical sample 946 00:53:58,289 --> 00:54:01,489 of point Z, right? 947 00:54:01,489 --> 00:54:05,369 And so we'll then take the sample Z - 948 00:54:05,369 --> 00:54:10,919 well, actually let me draw this here as well. All right. So 949 00:54:10,919 --> 00:54:13,449 this is a typical sample for Z going on the Z1 and Z2 axis and 950 00:54:13,449 --> 00:54:17,419 I guess the origin would be here. 951 00:54:17,419 --> 00:54:20,119 So center around zero. 952 00:54:20,119 --> 00:54:24,079 And then we'll take those and map it to mu 953 00:54:24,079 --> 00:54:24,979 plus 954 00:54:24,979 --> 00:54:26,299 lambda Z 955 00:54:26,299 --> 00:54:30,140 and what that means is if you imagine the free space of this classroom is R3. 956 00:54:30,140 --> 00:54:34,549 What that means is we'll take this sample of Z's and we'll map it to 957 00:54:34,549 --> 00:54:38,769 position in free space. So we'll take this sheet of paper and move it somewhere and some 958 00:54:38,769 --> 00:54:41,769 orientation in 3-D space. 959 00:54:41,769 --> 00:54:44,219 And the last step is 960 00:54:44,219 --> 00:54:48,689 you have X equals mu plus lambda Z plus 961 00:54:48,689 --> 00:54:52,339 epsilon and so you would take the set of the points which align in some plane 962 00:54:52,339 --> 00:54:53,639 in our 3-D space 963 00:54:53,639 --> 00:54:56,150 the variable of noise of these 964 00:54:56,150 --> 00:54:58,259 and the noise will, sort of, come from 965 00:54:58,259 --> 00:55:01,419 Gaussians to 966 00:55:01,419 --> 00:55:03,699 the axis 967 00:55:03,699 --> 00:55:10,699 aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake. 968 00:55:11,680 --> 00:55:13,179 969 00:55:13,179 --> 00:55:15,119 So that's a model 970 00:55:15,119 --> 00:55:22,119 - let's actually talk about how to fit the parameters of the model. Okay? 971 00:55:27,959 --> 00:55:32,229 In order to 972 00:55:32,229 --> 00:55:33,770 describe how to fit the model I'm sure 973 00:55:33,770 --> 00:55:35,859 we need to 974 00:55:35,859 --> 00:55:40,190 re-write Gaussians and this is in a very slightly different way. 975 00:55:40,190 --> 00:55:42,789 So, in particular, 976 00:55:42,789 --> 00:55:45,389 let's say I have a vector X and 977 00:55:45,389 --> 00:55:51,389 I'm gonna use this notation to denote partition vectors, right? X1, X2 978 00:55:51,389 --> 00:55:53,209 where 979 00:55:53,209 --> 00:55:56,809 if X1 is say an rdimensional vector 980 00:55:56,809 --> 00:55:58,640 then X2 is 981 00:55:58,640 --> 00:56:00,430 an estimational vector 982 00:56:00,430 --> 00:56:02,229 and X 983 00:56:02,229 --> 00:56:04,509 is an R plus S 984 00:56:04,509 --> 00:56:09,029 dimensional vector. Okay? So I'm gonna use this notation to denote just 985 00:56:09,029 --> 00:56:13,269 the taking of vector and, sort of, partitioning the vector into two halves. The 986 00:56:13,269 --> 00:56:15,449 first R elements followed by 987 00:56:15,449 --> 00:56:17,079 the last 988 00:56:17,079 --> 00:56:22,409 S elements. 989 00:56:22,409 --> 00:56:26,269 So 990 00:56:26,269 --> 00:56:29,199 let's say you have X 991 00:56:29,199 --> 00:56:35,029 coming from a Gaussian distribution with mean mu and covariance sigma 992 00:56:35,029 --> 00:56:37,169 where 993 00:56:37,169 --> 00:56:38,809 mu 994 00:56:38,809 --> 00:56:42,219 is itself a partition vector. 995 00:56:42,219 --> 00:56:46,839 So break mu up into two pieces mu1 and mu2 996 00:56:46,839 --> 00:56:50,189 and the covariance matrix sigma 997 00:56:50,189 --> 00:56:54,989 is now a partitioned matrix. 998 00:56:54,989 --> 00:56:56,750 Okay? So what this means is 999 00:56:56,750 --> 00:56:58,669 that you take the covariance matrix sigma 1000 00:56:58,669 --> 00:57:00,940 and I'm going to break it up into four blocks, 1001 00:57:00,940 --> 00:57:02,649 right? And so the 1002 00:57:02,649 --> 00:57:06,059 dimension of this is there will be R elements here 1003 00:57:06,059 --> 00:57:08,299 and there will be 1004 00:57:08,299 --> 00:57:10,289 S elements here 1005 00:57:10,289 --> 00:57:13,679 and there will be R elements here. So, for example, sigma 1, 2 will 1006 00:57:13,679 --> 00:57:15,689 be an 1007 00:57:15,689 --> 00:57:19,769 R by S matrix. 1008 00:57:19,769 --> 00:57:26,769 It's R elements tall and S elements wide. 1009 00:57:32,150 --> 00:57:36,190 So this Gaussian over to down is really a joint distribution of a loss of variables, right? 1010 00:57:36,190 --> 00:57:40,309 So X is a vector so XY is a joint distribution 1011 00:57:40,309 --> 00:57:42,980 over X1 through X of - 1012 00:57:42,980 --> 00:57:47,349 over XN or over X of R plus S. 1013 00:57:47,349 --> 00:57:51,299 We can then ask what are the marginal and conditional distributions of this 1014 00:57:51,299 --> 00:57:52,609 Gaussian? 1015 00:57:52,609 --> 00:57:53,840 So, for example, 1016 00:57:53,840 --> 00:57:57,819 with my Gaussian, I know what P of X is, but can 1017 00:57:57,819 --> 00:58:03,229 I compute the modular distribution of X1, right. And so P of X1 is just equal to, 1018 00:58:03,229 --> 00:58:07,009 of course, integrate our X2, P of X1 1019 00:58:07,009 --> 00:58:08,880 comma X2 1020 00:58:08,880 --> 00:58:10,649 DX2. 1021 00:58:10,649 --> 00:58:14,169 And if you actually perform that distribution - that computation you 1022 00:58:14,169 --> 00:58:15,199 find 1023 00:58:15,199 --> 00:58:17,769 that P of X1, 1024 00:58:17,769 --> 00:58:20,069 I guess, is Gaussian 1025 00:58:20,069 --> 00:58:21,159 with mean 1026 00:58:21,159 --> 00:58:23,079 given by mu1 1027 00:58:23,079 --> 00:58:24,929 and sigma 1, 1. All right. 1028 00:58:24,929 --> 00:58:26,309 So this is sort 1029 00:58:26,309 --> 00:58:30,019 of no surprise. The marginal distribution of a 1030 00:58:30,019 --> 00:58:31,129 Gaussian 1031 00:58:31,129 --> 00:58:32,879 is itself the 1032 00:58:32,879 --> 00:58:33,390 Gaussian and 1033 00:58:33,390 --> 00:58:35,479 you just take out the 1034 00:58:35,479 --> 00:58:36,229 relevant 1035 00:58:36,229 --> 00:58:37,399 sub-blocks of 1036 00:58:37,399 --> 00:58:42,049 the covariance matrix and the relevant sub-vector of the 1037 00:58:42,049 --> 00:58:46,569 mu vector - E in vector mu. 1038 00:58:46,569 --> 00:58:49,819 You can also compute conditionals. You can also - 1039 00:58:49,819 --> 00:58:51,959 what does P of X1 1040 00:58:51,959 --> 00:58:53,409 given 1041 00:58:53,409 --> 00:58:56,819 a specific value for X2, right? 1042 00:58:56,819 --> 00:59:02,799 And so the way you compute that is, well, the usual way P of X1 1043 00:59:02,799 --> 00:59:04,930 comma X2 1044 00:59:04,930 --> 00:59:08,379 divided by P of X2, right? 1045 00:59:08,379 --> 00:59:10,069 And so 1046 00:59:10,069 --> 00:59:13,789 you know what both of these formulas are, right? The numerator - well, 1047 00:59:13,789 --> 00:59:16,559 this is just a usual 1048 00:59:16,559 --> 00:59:20,160 Gaussian that your joint distribution over X1, X2 is a Gaussian with 1049 00:59:20,160 --> 00:59:21,899 mean mu and covariance sigma 1050 00:59:21,899 --> 00:59:23,929 and 1051 00:59:23,929 --> 00:59:26,659 this 1052 00:59:26,659 --> 00:59:30,079 by that marginalization operation I talked about is that. 1053 00:59:30,079 --> 00:59:31,449 1054 00:59:31,449 --> 00:59:35,009 So if you actually plug in the formulas for these two Gaussians and if you simplify 1055 00:59:35,009 --> 00:59:35,599 1056 00:59:35,599 --> 00:59:38,560 the simplification step is actually fairly non-trivial. 1057 00:59:38,560 --> 00:59:42,070 If you haven't seen it before this will actually be - this will actually be 1058 00:59:42,070 --> 00:59:44,389 somewhat difficult to do. 1059 00:59:44,389 --> 00:59:45,580 But if you 1060 00:59:45,580 --> 00:59:50,020 plug this in for Gaussian and simplify that 1061 00:59:50,020 --> 00:59:51,029 expression 1062 00:59:51,029 --> 00:59:53,249 you find that 1063 00:59:53,249 --> 00:59:55,969 conditioned on the value of 1064 00:59:55,969 --> 01:00:01,130 X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be 1065 01:00:01,130 --> 01:00:05,849 Gaussian 1066 01:00:05,849 --> 01:00:08,260 and it will have mean mu 1067 01:00:08,260 --> 01:00:13,929 of 1 given 2 and covariant 1068 01:00:13,929 --> 01:00:15,549 1069 01:00:15,549 --> 01:00:18,479 sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show 1070 01:00:18,479 --> 01:00:20,969 the formula for mu given - of mu of 1071 01:00:20,969 --> 01:00:27,969 one given 2 is given by this 1072 01:00:31,689 --> 01:00:38,689 and I 1073 01:00:40,179 --> 01:00:44,209 think 1074 01:00:44,209 --> 01:00:48,949 the sigma of 1 given 2 is given by that. Okay? 1075 01:00:48,949 --> 01:00:52,109 So these are just 1076 01:00:52,109 --> 01:00:55,869 useful formulas to know for how to find the conditional distributions 1077 01:00:55,869 --> 01:00:58,649 of the Gaussian and the marginal distributions of a Gaussian. 1078 01:00:58,649 --> 01:01:03,229 I won't actually show the derivation for this. Student:Could you 1079 01:01:03,229 --> 01:01:10,229 repeat the [inaudible]? Instructor 1080 01:01:12,059 --> 01:01:16,189 (Andrew Ng):Sure. So this one on the left mu of 1 given 2 1081 01:01:16,189 --> 01:01:16,849 equals 1082 01:01:16,849 --> 01:01:19,029 mu1 plus 1083 01:01:19,029 --> 01:01:20,769 sigma 1,2, 1084 01:01:20,769 --> 01:01:22,729 sigma 2,2 inverse 1085 01:01:22,729 --> 01:01:25,389 times X2 minus mu2 1086 01:01:25,389 --> 01:01:29,599 and this is sigma 1 given 2 equals sigma 1,1 1087 01:01:29,599 --> 01:01:31,599 minus sigma 1,2 1088 01:01:31,599 --> 01:01:33,119 sigma 2,2 inverse 1089 01:01:33,119 --> 01:01:34,400 sigma 2,1. Okay? 1090 01:01:34,400 --> 01:01:40,189 These are also in the 1091 01:01:40,189 --> 01:01:47,189 lecture 1092 01:01:52,929 --> 01:01:59,929 notes. Shoot. Nothing as where I was hoping to on time. 1093 01:02:00,989 --> 01:02:07,989 Well, actually it is. Okay? 1094 01:02:18,719 --> 01:02:21,209 So it 1095 01:02:21,209 --> 01:02:26,369 turns out - I think I'll skip this in the interest of time. So it turns out that - 1096 01:02:26,369 --> 01:02:30,049 well, so let's go back and use these in the factor analysis model, right? 1097 01:02:30,049 --> 01:02:32,089 It turns out that 1098 01:02:32,089 --> 01:02:33,769 you can go back 1099 01:02:33,769 --> 01:02:35,130 and 1100 01:02:38,769 --> 01:02:45,759 oh, 1101 01:02:45,759 --> 01:02:48,569 do I want to do this? I kind of need this 1102 01:02:48,569 --> 01:02:51,939 though. So let's go back and figure out 1103 01:02:51,939 --> 01:02:57,869 just what the joint distribution factor analysis assumes on Z and X's. Okay? 1104 01:02:57,869 --> 01:02:58,630 So 1105 01:02:58,630 --> 01:03:00,219 1106 01:03:00,219 --> 01:03:03,199 under the factor analysis model 1107 01:03:03,199 --> 01:03:07,429 Z and X, the random variables Z and X 1108 01:03:07,429 --> 01:03:11,549 have some joint distribution given by - I'll write this 1109 01:03:11,549 --> 01:03:12,779 vector as mu 1110 01:03:12,779 --> 01:03:13,799 ZX 1111 01:03:13,799 --> 01:03:17,329 in some covariance matrix sigma. 1112 01:03:17,329 --> 01:03:21,239 So let's go back and figure out what mu ZX is and what sigma is and I'll 1113 01:03:21,239 --> 01:03:22,699 do this so that 1114 01:03:22,699 --> 01:03:24,990 we'll get a little bit more practice with 1115 01:03:24,990 --> 01:03:28,739 partition vectors and partition matrixes. 1116 01:03:28,739 --> 01:03:31,239 So just to remind you, right? You 1117 01:03:31,239 --> 01:03:34,959 have to have Z as Gaussian with mean zero and covariance identity 1118 01:03:34,959 --> 01:03:41,099 and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with 1119 01:03:41,099 --> 01:03:42,749 mean zero 1120 01:03:42,749 --> 01:03:47,179 covariant si. So I have the - I'm just writing out the same equations again. 1121 01:03:47,179 --> 01:03:53,209 So let's first figure out what this vector mu ZX is. Well, 1122 01:03:53,209 --> 01:03:54,619 the expected value of Z 1123 01:03:54,619 --> 01:03:56,219 is 1124 01:03:56,219 --> 01:03:56,989 zero 1125 01:03:56,989 --> 01:03:58,140 1126 01:03:58,140 --> 01:04:03,599 and, again, as usual I'll often drop the square backers around here. 1127 01:04:03,599 --> 01:04:05,930 And the expected value of X is - 1128 01:04:05,930 --> 01:04:08,899 well, the expected value of mu 1129 01:04:08,899 --> 01:04:11,149 plus lambda 1130 01:04:11,149 --> 01:04:14,179 Z 1131 01:04:14,179 --> 01:04:17,249 plus epsilon. So these two terms have zero expectation 1132 01:04:17,249 --> 01:04:20,099 and so the expected value of X 1133 01:04:20,099 --> 01:04:22,749 is just mu 1134 01:04:22,749 --> 01:04:23,900 1135 01:04:23,900 --> 01:04:26,529 and so that vector 1136 01:04:26,529 --> 01:04:28,049 mu ZX, right, 1137 01:04:28,049 --> 01:04:31,239 in my parameter for the Gaussian 1138 01:04:31,239 --> 01:04:33,650 this is going to be 1139 01:04:33,650 --> 01:04:37,439 the expected value of this partition vector 1140 01:04:37,439 --> 01:04:40,249 given by this partition Z and X 1141 01:04:40,249 --> 01:04:42,689 and so that would just be 1142 01:04:42,689 --> 01:04:43,769 zero 1143 01:04:43,769 --> 01:04:46,900 followed by mu. Okay? 1144 01:04:46,900 --> 01:04:48,319 And so that's 1145 01:04:48,319 --> 01:04:51,299 a d-dimensional zero 1146 01:04:51,299 --> 01:04:57,059 followed by an indimensional mu. 1147 01:04:57,059 --> 01:05:02,160 That's not gonna work out what the covariance matrix sigma is. 1148 01:05:02,160 --> 01:05:09,160 So 1149 01:05:20,929 --> 01:05:24,109 the covariance matrix sigma 1150 01:05:24,109 --> 01:05:27,959 - if you work out 1151 01:05:27,959 --> 01:05:34,959 definition of a partition. So this 1152 01:05:44,359 --> 01:05:45,619 is 1153 01:05:45,619 --> 01:05:52,619 into your partition matrix. 1154 01:06:03,339 --> 01:06:04,150 Okay? Will be - 1155 01:06:04,150 --> 01:06:06,149 so the covariance matrix sigma 1156 01:06:06,149 --> 01:06:10,499 will comprise four blocks like that 1157 01:06:10,499 --> 01:06:14,199 and so the upper left most block, which I write as sigma 1,1 - 1158 01:06:14,199 --> 01:06:16,400 well, that uppermost left block 1159 01:06:16,400 --> 01:06:18,559 is just 1160 01:06:18,559 --> 01:06:21,139 the covariance matrix of Z, 1161 01:06:21,139 --> 01:06:24,819 which we know is the identity. I was 1162 01:06:24,819 --> 01:06:28,269 gonna show you briefly how to derive some of the other blocks, right, so 1163 01:06:28,269 --> 01:06:30,809 sigma 1,2 that's the 1164 01:06:30,809 --> 01:06:37,809 upper - oh, actually, 1165 01:06:41,259 --> 01:06:43,819 excuse me. Sigma 2,1 1166 01:06:43,819 --> 01:06:46,829 which is the lower left block 1167 01:06:46,829 --> 01:06:48,929 that's E 1168 01:06:48,929 --> 01:06:50,829 of X 1169 01:06:50,829 --> 01:06:54,900 minus EX times Z minus EZ. 1170 01:06:54,900 --> 01:06:58,689 So X is equal to mu 1171 01:06:58,689 --> 01:07:01,529 plus lambda Z 1172 01:07:01,529 --> 01:07:03,329 plus 1173 01:07:03,329 --> 01:07:05,429 epsilon and then minus EX is 1174 01:07:05,429 --> 01:07:07,369 minus mu and then 1175 01:07:07,369 --> 01:07:11,219 times Z 1176 01:07:11,219 --> 01:07:14,089 because the expected value of Z is zero, right, 1177 01:07:14,089 --> 01:07:19,329 so that's equal to zero. 1178 01:07:19,329 --> 01:07:23,660 And so if you simplify - or if you expand this out 1179 01:07:23,660 --> 01:07:26,829 plus mu minus mu cancel out 1180 01:07:26,829 --> 01:07:29,819 and so you have the expected value 1181 01:07:29,819 --> 01:07:33,139 of lambda - oh, 1182 01:07:33,139 --> 01:07:35,359 excuse me. 1183 01:07:35,359 --> 01:07:40,269 ZZ transpose minus the 1184 01:07:40,269 --> 01:07:47,269 expected value of epsilon Z is 1185 01:07:52,799 --> 01:07:57,059 equal to 1186 01:07:57,059 --> 01:08:00,309 that, which is just equal to 1187 01:08:00,309 --> 01:08:07,239 lambda times the identity matrix. Okay? Does that make 1188 01:08:07,239 --> 01:08:11,499 sense? Cause 1189 01:08:11,499 --> 01:08:14,540 this term is equal to zero. 1190 01:08:14,540 --> 01:08:21,540 Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well, 1191 01:08:48,170 --> 01:08:53,399 so the final block is sigma 2,2 which is equal to the expected value of 1192 01:08:53,399 --> 01:08:59,099 mu plus lambda Z 1193 01:08:59,099 --> 01:09:01,889 plus epsilon minus mu 1194 01:09:01,889 --> 01:09:03,960 times, right? 1195 01:09:03,960 --> 01:09:07,900 Is 1196 01:09:07,900 --> 01:09:09,259 equal to - and 1197 01:09:09,259 --> 01:09:16,259 I won't do this, but this simplifies to lambda lambda transpose plus si. Okay? 1198 01:09:17,609 --> 01:09:20,880 So 1199 01:09:20,880 --> 01:09:23,809 putting all this together 1200 01:09:23,809 --> 01:09:28,689 this tells us that the joint distribution of this vector ZX 1201 01:09:28,689 --> 01:09:31,209 is going to be Gaussian 1202 01:09:31,209 --> 01:09:35,920 with mean vector given by that, 1203 01:09:35,920 --> 01:09:37,409 which we worked out previously. 1204 01:09:37,409 --> 01:09:38,719 So 1205 01:09:38,719 --> 01:09:42,239 this is the new ZX that we worked out previously, 1206 01:09:42,239 --> 01:09:44,179 and covariance matrix 1207 01:09:44,179 --> 01:09:51,179 given by 1208 01:09:54,429 --> 01:10:01,429 that. Okay? So 1209 01:10:04,719 --> 01:10:07,179 in principle - let's 1210 01:10:07,179 --> 01:10:09,869 see, so the parameters of our model 1211 01:10:09,869 --> 01:10:11,579 are mu, 1212 01:10:11,579 --> 01:10:12,359 lambda, 1213 01:10:12,359 --> 01:10:14,000 and si. 1214 01:10:14,000 --> 01:10:15,480 And so 1215 01:10:15,480 --> 01:10:18,800 in order to find the parameters of this model 1216 01:10:18,800 --> 01:10:24,559 we're given a training set of m examples 1217 01:10:24,559 --> 01:10:28,590 and so we like to do a massive likely estimation of the parameters. 1218 01:10:28,590 --> 01:10:34,319 And so in principle one thing you could do is you can actually write down 1219 01:10:34,319 --> 01:10:36,820 what P of XI is and, 1220 01:10:36,820 --> 01:10:40,309 right, so P of XI 1221 01:10:40,309 --> 01:10:43,020 XI is actually - 1222 01:10:43,020 --> 01:10:45,880 the distribution of X, right? If, again, 1223 01:10:45,880 --> 01:10:49,439 you can marginalize this Gaussian 1224 01:10:49,439 --> 01:10:54,679 and so the distribution of X, which is the lower half of this partition vector 1225 01:10:54,679 --> 01:10:56,620 is going to have 1226 01:10:56,620 --> 01:10:58,209 mean mu 1227 01:10:58,209 --> 01:11:03,929 and covariance given by lambda lambda transpose plus si. Right? 1228 01:11:03,929 --> 01:11:05,320 So that's 1229 01:11:05,320 --> 01:11:06,759 the distribution 1230 01:11:06,759 --> 01:11:12,499 that we're using to model P of X. 1231 01:11:12,499 --> 01:11:16,980 And so in principle one thing you could do is actually write down the log likelihood of 1232 01:11:16,980 --> 01:11:20,329 your parameters, 1233 01:11:20,329 --> 01:11:23,679 right? Which is just the product over of - it 1234 01:11:23,679 --> 01:11:25,310 is the sum over 1235 01:11:25,310 --> 01:11:26,050 I 1236 01:11:26,050 --> 01:11:29,829 log P of XI 1237 01:11:29,829 --> 01:11:30,970 where P of XI 1238 01:11:30,970 --> 01:11:32,400 will be given 1239 01:11:32,400 --> 01:11:35,719 by this Gaussian density, right. And 1240 01:11:35,719 --> 01:11:39,820 I'm using theta as a shorthand to denote all of my parameters. 1241 01:11:39,820 --> 01:11:43,429 And so you actually know what the density for Gaussian is 1242 01:11:43,429 --> 01:11:49,519 and so you can say P of XI is this Gaussian with E mu in covariance 1243 01:11:49,519 --> 01:11:52,960 given by this lambda lambda transpose plus si. 1244 01:11:52,960 --> 01:11:54,690 So in case you write down the log likelihood 1245 01:11:54,690 --> 01:11:55,900 of your parameters 1246 01:11:55,900 --> 01:11:57,380 as follows 1247 01:11:57,380 --> 01:12:01,050 and you can try to take derivatives of your log likelihood with respect to your 1248 01:12:01,050 --> 01:12:02,039 parameters 1249 01:12:02,039 --> 01:12:05,059 and maximize the log likelihood, all right. 1250 01:12:05,059 --> 01:12:07,490 It turns out that if you do that you end up with 1251 01:12:07,490 --> 01:12:11,300 sort of an intractable atomization problem or at least one 1252 01:12:11,300 --> 01:12:13,219 that you - excuse me, you end up with a 1253 01:12:13,219 --> 01:12:16,860 optimization problem that you will not be able to find and in this analytics, sort of, 1254 01:12:16,860 --> 01:12:18,649 closed form solutions to. 1255 01:12:18,649 --> 01:12:21,679 So if you say my model of X is this and found your 1256 01:12:21,679 --> 01:12:23,900 massive likely parameter estimation 1257 01:12:23,900 --> 01:12:25,589 you won't be able to find 1258 01:12:25,589 --> 01:12:29,119 the massive likely estimate of the parameters in closed form. 1259 01:12:29,119 --> 01:12:30,059 1260 01:12:30,059 --> 01:12:31,940 So what I would have liked to do 1261 01:12:31,940 --> 01:12:38,940 is - well, 1262 01:12:40,280 --> 01:12:45,579 so in order to fit parameters to this model 1263 01:12:45,579 --> 01:12:51,869 what we'll actually do is use the EM Algorithm 1264 01:12:51,869 --> 01:12:55,569 in with 1265 01:12:55,569 --> 01:12:59,559 the E step, right? 1266 01:12:59,559 --> 01:13:06,559 We'll compute that 1267 01:13:08,320 --> 01:13:10,929 1268 01:13:10,929 --> 01:13:15,010 and this formula looks the same except that one difference is that now 1269 01:13:15,010 --> 01:13:17,679 Z is a continuous random variable 1270 01:13:17,679 --> 01:13:18,590 and so 1271 01:13:18,590 --> 01:13:20,119 in the E step 1272 01:13:20,119 --> 01:13:24,210 we actually have to find the density QI of ZI where it's the, sort of, E step actually 1273 01:13:24,210 --> 01:13:26,050 requires that we find 1274 01:13:26,050 --> 01:13:28,260 the posterior distribution 1275 01:13:28,260 --> 01:13:31,510 that - so the density to the random variable ZI 1276 01:13:31,510 --> 01:13:34,489 and then the M step 1277 01:13:34,489 --> 01:13:37,949 will then perform 1278 01:13:37,949 --> 01:13:40,870 the following maximization 1279 01:13:40,870 --> 01:13:41,709 1280 01:13:41,709 --> 01:13:45,010 where, again, because Z is now 1281 01:13:45,010 --> 01:13:49,929 continuous we 1282 01:13:49,929 --> 01:13:56,929 now need to integrate over Z. Okay? Where 1283 01:13:59,389 --> 01:14:02,379 in the M step now because ZI was continuous we now have an 1284 01:14:02,379 --> 01:14:05,709 integral over Z rather than a sum. 1285 01:14:05,709 --> 01:14:09,269 Okay? I was hoping to go a little bit further in deriving these things, but I don't have 1286 01:14:09,269 --> 01:14:11,080 time today so we'll wrap that up 1287 01:14:11,080 --> 01:14:14,530 in the next lecture, but before I close let's check if there are questions 1288 01:14:14,530 --> 01:14:15,729 1289 01:14:15,729 --> 01:14:22,729 about the whole factor analysis model. Okay. 1290 01:14:27,569 --> 01:14:30,639 So we'll come back in the next lecture; 1291 01:14:30,639 --> 01:14:34,690 I will wrap up this model and because I want to go a little bit deeper into the E 1292 01:14:34,690 --> 01:14:36,709 and M steps, as there's some 1293 01:14:36,709 --> 01:14:39,760 tricky parts for the factor analysis model specifically. 1294 01:14:39,760 --> 01:14:41,159 Okay. I'll see you in a 1295 01:14:41,159 --> 01:14:41,409 couple of days.