[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:09.04,0:00:10.32,Default,,0000,0000,0000,, Dialogue: 0,0:00:10.32,0:00:13.59,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:13.59,0:00:20.59,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:23.96,0:00:24.88,Default,,0000,0000,0000,,So Dialogue: 0,0:00:24.88,0:00:29.72,Default,,0000,0000,0000,,welcome back, and what I want to do today is Dialogue: 0,0:00:29.72,0:00:33.93,Default,,0000,0000,0000,,continue our discussions of the EM Algorithm, and in particular, I want to talk Dialogue: 0,0:00:33.93,0:00:34.73,Default,,0000,0000,0000,,about Dialogue: 0,0:00:34.73,0:00:36.36,Default,,0000,0000,0000,,the EM Dialogue: 0,0:00:36.36,0:00:39.64,Default,,0000,0000,0000,,formulation that we derived in the previous lecture and apply it to the Dialogue: 0,0:00:39.64,0:00:41.91,Default,,0000,0000,0000,,mixture of Gaussians model, Dialogue: 0,0:00:41.91,0:00:45.05,Default,,0000,0000,0000,,apply it to a different model and a mixture of naive Bayes model, Dialogue: 0,0:00:45.05,0:00:48.69,Default,,0000,0000,0000,,and then the launch part of today's lecture will be on the factor Dialogue: 0,0:00:48.69,0:00:51.48,Default,,0000,0000,0000,,analysis algorithm, which will also use the EM. Dialogue: 0,0:00:51.48,0:00:55.60,Default,,0000,0000,0000,,And as part of that, we'll actually take a brief digression to talk a little bit about Dialogue: 0,0:00:55.60,0:01:01.05,Default,,0000,0000,0000,,sort of useful properties of Gaussian distributions. Dialogue: 0,0:01:01.05,0:01:03.19,Default,,0000,0000,0000,,So just to recap where we are. Dialogue: 0,0:01:03.19,0:01:10.12,Default,,0000,0000,0000,,In the previous lecture, I started to talk about unsupervised learning, which was Dialogue: 0,0:01:10.12,0:01:12.72,Default,,0000,0000,0000,,machine-learning problems, where you're given Dialogue: 0,0:01:12.72,0:01:14.59,Default,,0000,0000,0000,,an unlabeled training set Dialogue: 0,0:01:14.59,0:01:16.21,Default,,0000,0000,0000,,comprising m examples here, right? Dialogue: 0,0:01:16.21,0:01:18.36,Default,,0000,0000,0000,,And then " so the fact that there are no labels; Dialogue: 0,0:01:18.36,0:01:22.72,Default,,0000,0000,0000,,that's what makes this unsupervised or Dialogue: 0,0:01:22.72,0:01:24.88,Default,,0000,0000,0000,,anything. So Dialogue: 0,0:01:24.88,0:01:30.71,Default,,0000,0000,0000,,one problem that I talked about last time was what if Dialogue: 0,0:01:30.71,0:01:33.82,Default,,0000,0000,0000,,you're given a data set that looks like this Dialogue: 0,0:01:33.82,0:01:36.42,Default,,0000,0000,0000,,and you want to model Dialogue: 0,0:01:36.42,0:01:40.11,Default,,0000,0000,0000,,the density PFX from which you think the data Dialogue: 0,0:01:40.11,0:01:41.73,Default,,0000,0000,0000,,had been drawn, Dialogue: 0,0:01:41.73,0:01:45.90,Default,,0000,0000,0000,,and so with a data set like this, maybe you think was a mixture of two Gaussians and start Dialogue: 0,0:01:45.90,0:01:50.23,Default,,0000,0000,0000,,to talk about an algorithm Dialogue: 0,0:01:50.23,0:01:53.92,Default,,0000,0000,0000,,for fitting a mixture of Gaussians model, all right? And so Dialogue: 0,0:01:55.05,0:01:57.82,Default,,0000,0000,0000,,we said that we would model the Dialogue: 0,0:01:57.82,0:01:59.46,Default,,0000,0000,0000,,density of XP of X Dialogue: 0,0:01:59.46,0:02:02.13,Default,,0000,0000,0000,,as sum over Z Dialogue: 0,0:02:02.13,0:02:03.17,Default,,0000,0000,0000,,PFX Dialogue: 0,0:02:03.17,0:02:05.41,Default,,0000,0000,0000,,given Z Dialogue: 0,0:02:05.41,0:02:06.97,Default,,0000,0000,0000,,times P of Z Dialogue: 0,0:02:06.97,0:02:10.35,Default,,0000,0000,0000,,where this later random variable meaning this hidden random Dialogue: 0,0:02:10.35,0:02:11.83,Default,,0000,0000,0000,,variable Z Dialogue: 0,0:02:11.83,0:02:12.61,Default,,0000,0000,0000,,indicates Dialogue: 0,0:02:12.61,0:02:16.37,Default,,0000,0000,0000,,which of the two Gaussian distributions each of your data points came from Dialogue: 0,0:02:16.37,0:02:19.37,Default,,0000,0000,0000,,and so we have, Dialogue: 0,0:02:19.37,0:02:24.14,Default,,0000,0000,0000,,you know, Z was not a Dialogue: 0,0:02:24.14,0:02:26.63,Default,,0000,0000,0000,,nomial with parameter phi Dialogue: 0,0:02:26.63,0:02:28.76,Default,,0000,0000,0000,,and X conditions on Dialogue: 0,0:02:28.76,0:02:32.60,Default,,0000,0000,0000,,a coming from the JAFE Dialogue: 0,0:02:32.60,0:02:35.81,Default,,0000,0000,0000,,Gaussian Dialogue: 0,0:02:35.81,0:02:38.75,Default,,0000,0000,0000,,was given by Gaussian Dialogue: 0,0:02:38.75,0:02:43.68,Default,,0000,0000,0000,,of mean mu J and covariant sigma J, all right? So, like I said Dialogue: 0,0:02:43.68,0:02:47.51,Default,,0000,0000,0000,,at the beginning of the previous lecture, I just talked about a very specific algorithm that Dialogue: 0,0:02:47.51,0:02:50.64,Default,,0000,0000,0000,,I sort of pulled out of the air Dialogue: 0,0:02:50.64,0:02:54.61,Default,,0000,0000,0000,,for fitting the parameters of this model for finian, Francis, phi, mu and Dialogue: 0,0:02:54.61,0:02:55.52,Default,,0000,0000,0000,, Dialogue: 0,0:02:55.52,0:02:59.77,Default,,0000,0000,0000,,sigma, but then in the second half of the previous lecture I talked about what's called the Dialogue: 0,0:02:59.77,0:03:01.49,Default,,0000,0000,0000,,EM Algorithm in which Dialogue: 0,0:03:01.49,0:03:03.100,Default,,0000,0000,0000,,our Dialogue: 0,0:03:03.100,0:03:08.72,Default,,0000,0000,0000,,goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of Dialogue: 0,0:03:08.72,0:03:09.90,Default,,0000,0000,0000,,theta, Dialogue: 0,0:03:09.90,0:03:13.28,Default,,0000,0000,0000,,you know, the, sort of, Dialogue: 0,0:03:13.28,0:03:17.65,Default,,0000,0000,0000,,usual right matter of log likelihood " Dialogue: 0,0:03:17.65,0:03:19.55,Default,,0000,0000,0000,,well, parameterized by theta. Dialogue: 0,0:03:19.55,0:03:22.15,Default,,0000,0000,0000,,And Dialogue: 0,0:03:22.15,0:03:25.11,Default,,0000,0000,0000,,because we have Dialogue: 0,0:03:25.11,0:03:28.20,Default,,0000,0000,0000,,a later random variable Z this is really Dialogue: 0,0:03:28.20,0:03:30.89,Default,,0000,0000,0000,,maximizing in terms of theta, Dialogue: 0,0:03:30.89,0:03:37.89,Default,,0000,0000,0000,,sum over I, sum over Z, P of XI, Dialogue: 0,0:03:39.64,0:03:45.28,Default,,0000,0000,0000,,ZI parameterized by theta. Okay? Dialogue: 0,0:03:45.28,0:03:47.14,Default,,0000,0000,0000,,So using Dialogue: 0,0:03:47.14,0:03:50.11,Default,,0000,0000,0000,,Jensen's inequality last time Dialogue: 0,0:03:50.11,0:03:54.25,Default,,0000,0000,0000,,we worked out the EM Algorithm in which in the E step Dialogue: 0,0:03:54.25,0:03:56.18,Default,,0000,0000,0000,,we would chose these Dialogue: 0,0:03:56.18,0:03:59.36,Default,,0000,0000,0000,,probability distributions QI to the Dialogue: 0,0:03:59.36,0:04:02.93,Default,,0000,0000,0000,,l posterior Dialogue: 0,0:04:02.93,0:04:09.32,Default,,0000,0000,0000,,on Z given X and parameterized by theta Dialogue: 0,0:04:09.32,0:04:13.76,Default,,0000,0000,0000,,and in the M step we would set Dialogue: 0,0:04:13.76,0:04:16.43,Default,,0000,0000,0000,,theta Dialogue: 0,0:04:16.43,0:04:21.67,Default,,0000,0000,0000,,to be the value that Dialogue: 0,0:04:21.67,0:04:27.61,Default,,0000,0000,0000,,maximizes Dialogue: 0,0:04:27.61,0:04:34.61,Default,,0000,0000,0000,,this. Dialogue: 0,0:04:36.47,0:04:39.50,Default,,0000,0000,0000,,Okay? So these are the ones we worked out last time Dialogue: 0,0:04:41.73,0:04:46.70,Default,,0000,0000,0000,,and the cartoon that I drew was that you have this long likelihood function L of Dialogue: 0,0:04:46.70,0:04:49.50,Default,,0000,0000,0000,,theta that's often hard to maximize Dialogue: 0,0:04:49.50,0:04:51.25,Default,,0000,0000,0000,,and what the E step does Dialogue: 0,0:04:51.25,0:04:55.96,Default,,0000,0000,0000,,is choose these probability distribution production QI's. And in the cartoon, I drew Dialogue: 0,0:04:55.96,0:04:57.85,Default,,0000,0000,0000,,what that corresponded to Dialogue: 0,0:04:57.85,0:05:01.03,Default,,0000,0000,0000,,was finding a lower bounds Dialogue: 0,0:05:01.03,0:05:02.98,Default,,0000,0000,0000,,for the log likelihood. Dialogue: 0,0:05:02.98,0:05:04.78,Default,,0000,0000,0000,,And then Dialogue: 0,0:05:04.78,0:05:06.42,Default,,0000,0000,0000,,horizontal access data Dialogue: 0,0:05:06.42,0:05:10.71,Default,,0000,0000,0000,,and then the M step you maximize the lower boundary, right? So maybe you were here previously Dialogue: 0,0:05:10.71,0:05:11.50,Default,,0000,0000,0000,,and so you Dialogue: 0,0:05:11.50,0:05:15.39,Default,,0000,0000,0000,,jumped to the new point, the new Dialogue: 0,0:05:15.39,0:05:17.62,Default,,0000,0000,0000,,maximum of this lower bound. Okay? Dialogue: 0,0:05:17.62,0:05:21.89,Default,,0000,0000,0000,,And so this little curve here, right? This lower bound function here Dialogue: 0,0:05:21.89,0:05:26.26,Default,,0000,0000,0000,,that's really Dialogue: 0,0:05:26.26,0:05:28.82,Default,,0000,0000,0000,,the right-hand side of that augments. Dialogue: 0,0:05:28.82,0:05:30.21,Default,,0000,0000,0000,,Okay? So Dialogue: 0,0:05:30.21,0:05:34.01,Default,,0000,0000,0000,,this whole thing in the augments. If you view this thing Dialogue: 0,0:05:34.01,0:05:36.59,Default,,0000,0000,0000,,as a function of theta, Dialogue: 0,0:05:36.59,0:05:38.13,Default,,0000,0000,0000,,this function of theta Dialogue: 0,0:05:38.13,0:05:39.25,Default,,0000,0000,0000,,is a lower bounds Dialogue: 0,0:05:39.25,0:05:41.73,Default,,0000,0000,0000,,for the log likelihood of theta Dialogue: 0,0:05:41.73,0:05:45.54,Default,,0000,0000,0000,,and so the M step we maximize this lower bound and that corresponds to Dialogue: 0,0:05:45.54,0:05:47.36,Default,,0000,0000,0000,,jumping Dialogue: 0,0:05:47.36,0:05:51.67,Default,,0000,0000,0000,,to this new maximum to lower bound. So it Dialogue: 0,0:05:51.67,0:05:53.14,Default,,0000,0000,0000,,turns out Dialogue: 0,0:05:53.14,0:05:55.67,Default,,0000,0000,0000,,that Dialogue: 0,0:05:55.67,0:05:57.26,Default,,0000,0000,0000,,in the EM Algorithm " Dialogue: 0,0:05:57.26,0:06:02.10,Default,,0000,0000,0000,,so why do you evolve with the EM algorithm? It turns out that very often, and this will be Dialogue: 0,0:06:02.10,0:06:05.21,Default,,0000,0000,0000,,true for all the examples we see Dialogue: 0,0:06:05.21,0:06:07.41,Default,,0000,0000,0000,,today, it turns out Dialogue: 0,0:06:07.41,0:06:10.89,Default,,0000,0000,0000,,that very often in the Dialogue: 0,0:06:10.89,0:06:12.53,Default,,0000,0000,0000,,EM Algorithm Dialogue: 0,0:06:12.53,0:06:16.16,Default,,0000,0000,0000,,maximizing the M Step, so performing the maximization the M Step, will be Dialogue: 0,0:06:16.16,0:06:19.74,Default,,0000,0000,0000,,tractable and can often be done analytically in the closed form. Dialogue: 0,0:06:19.74,0:06:21.33,Default,,0000,0000,0000,,Whereas Dialogue: 0,0:06:21.33,0:06:25.73,Default,,0000,0000,0000,,if you were trying to maximize this objective we try to take this Dialogue: 0,0:06:25.73,0:06:27.27,Default,,0000,0000,0000,,formula on the right and Dialogue: 0,0:06:27.27,0:06:30.13,Default,,0000,0000,0000,,this maximum likely object, everyone, is to take this all on the right Dialogue: 0,0:06:30.13,0:06:32.81,Default,,0000,0000,0000,,and set its derivatives to zero and try to solve and Dialogue: 0,0:06:32.81,0:06:34.63,Default,,0000,0000,0000,,you'll find that you're unable Dialogue: 0,0:06:34.63,0:06:37.03,Default,,0000,0000,0000,,to obtain a solution to this in closed form Dialogue: 0,0:06:37.03,0:06:38.76,Default,,0000,0000,0000,,this maximization. Okay? Dialogue: 0,0:06:38.76,0:06:41.25,Default,,0000,0000,0000,,And so to give you an example of that is that Dialogue: 0,0:06:41.25,0:06:45.07,Default,,0000,0000,0000,,you remember our discussion on exponential family marbles, right? Dialogue: 0,0:06:45.07,0:06:47.02,Default,,0000,0000,0000,,It turns out that Dialogue: 0,0:06:47.02,0:06:49.37,Default,,0000,0000,0000,,if X and Z Dialogue: 0,0:06:49.37,0:06:50.32,Default,,0000,0000,0000,,is Dialogue: 0,0:06:50.32,0:06:54.22,Default,,0000,0000,0000,,jointly, I guess, a line in exponential families. So if P of X, Dialogue: 0,0:06:54.22,0:06:55.38,Default,,0000,0000,0000,,Z Dialogue: 0,0:06:55.38,0:06:58.32,Default,,0000,0000,0000,,prioritized by theta there's an explanation family distribution, Dialogue: 0,0:06:58.32,0:07:01.91,Default,,0000,0000,0000,,which it turns out to be true for the mixture of Gaussians distribution. Dialogue: 0,0:07:01.91,0:07:05.22,Default,,0000,0000,0000,,Then turns out that the M step here will be tractable Dialogue: 0,0:07:05.22,0:07:08.38,Default,,0000,0000,0000,,and the E step will also be tractable and so you can do each of these steps very Dialogue: 0,0:07:08.38,0:07:09.49,Default,,0000,0000,0000,,easily. Dialogue: 0,0:07:09.49,0:07:10.89,Default,,0000,0000,0000,,Whereas Dialogue: 0,0:07:10.89,0:07:12.69,Default,,0000,0000,0000,,performing " trying to Dialogue: 0,0:07:12.69,0:07:15.67,Default,,0000,0000,0000,,perform this original maximum likelihood estimation problem Dialogue: 0,0:07:15.67,0:07:17.72,Default,,0000,0000,0000,,on this one, right? Dialogue: 0,0:07:17.72,0:07:21.44,Default,,0000,0000,0000,,Will be computationally very difficult. You're going Dialogue: 0,0:07:21.44,0:07:24.70,Default,,0000,0000,0000,,to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to Dialogue: 0,0:07:24.70,0:07:27.59,Default,,0000,0000,0000,,this. Okay? Dialogue: 0,0:07:27.59,0:07:32.71,Default,,0000,0000,0000,,So Dialogue: 0,0:07:32.71,0:07:36.05,Default,,0000,0000,0000,,what I want to do in a second is actually take this view of the EM Dialogue: 0,0:07:36.05,0:07:37.11,Default,,0000,0000,0000,,Algorithm Dialogue: 0,0:07:37.11,0:07:40.89,Default,,0000,0000,0000,,and apply it to the mixture of Gaussians models. I want to take these E steps Dialogue: 0,0:07:40.89,0:07:42.73,Default,,0000,0000,0000,,and M Steps and work Dialogue: 0,0:07:42.73,0:07:44.23,Default,,0000,0000,0000,,them out for Dialogue: 0,0:07:44.23,0:07:47.85,Default,,0000,0000,0000,,the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the Dialogue: 0,0:07:49.69,0:07:55.02,Default,,0000,0000,0000,,EM Algorithm. It turns out there's one other way of thinking about the EM Dialogue: 0,0:07:55.02,0:07:59.48,Default,,0000,0000,0000,,Algorithm, which is the following: I can define Dialogue: 0,0:07:59.48,0:08:02.09,Default,,0000,0000,0000,,an optimization objective Dialogue: 0,0:08:02.09,0:08:05.98,Default,,0000,0000,0000,,J of theta, Q are defined Dialogue: 0,0:08:05.98,0:08:10.52,Default,,0000,0000,0000,,it to be Dialogue: 0,0:08:10.52,0:08:14.70,Default,,0000,0000,0000,,this. This is just a thing in the augments Dialogue: 0,0:08:14.70,0:08:21.70,Default,,0000,0000,0000,,in the M step. Okay? Dialogue: 0,0:08:25.76,0:08:28.11,Default,,0000,0000,0000,,And so what we proved Dialogue: 0,0:08:28.11,0:08:31.08,Default,,0000,0000,0000,,using Jensen's inequality Dialogue: 0,0:08:31.08,0:08:34.04,Default,,0000,0000,0000,,is that Dialogue: 0,0:08:34.04,0:08:38.38,Default,,0000,0000,0000,,the log likelihood of theta is greater and equal to J Dialogue: 0,0:08:38.38,0:08:41.90,Default,,0000,0000,0000,,of theta Q. So in other words, we proved last time that Dialogue: 0,0:08:41.90,0:08:43.75,Default,,0000,0000,0000,,for any value of theta and Q Dialogue: 0,0:08:43.75,0:08:47.19,Default,,0000,0000,0000,,the log likelihood upper bounds J of theta and Q. Dialogue: 0,0:08:47.19,0:08:51.29,Default,,0000,0000,0000,,And so just to relate this back to, sort of, yet more things that you all ready Dialogue: 0,0:08:51.29,0:08:52.22,Default,,0000,0000,0000,,know, Dialogue: 0,0:08:52.22,0:08:54.10,Default,,0000,0000,0000,,you can also think of Dialogue: 0,0:08:54.10,0:08:58.53,Default,,0000,0000,0000,,covariant cause in a sense, right? Dialogue: 0,0:08:58.53,0:09:02.01,Default,,0000,0000,0000,,However, our discussion awhile back on the coordinate ascent optimization Dialogue: 0,0:09:02.01,0:09:03.16,Default,,0000,0000,0000,,algorithm. Dialogue: 0,0:09:03.16,0:09:07.42,Default,,0000,0000,0000,,So we can show, and I won't actually show this view so just take our word for it Dialogue: 0,0:09:07.42,0:09:09.58,Default,,0000,0000,0000,,and look for that at home if you want, Dialogue: 0,0:09:09.58,0:09:11.82,Default,,0000,0000,0000,, Dialogue: 0,0:09:11.82,0:09:13.26,Default,,0000,0000,0000,,that EM is Dialogue: 0,0:09:13.26,0:09:17.48,Default,,0000,0000,0000,,just coordinate in a set on the Dialogue: 0,0:09:17.48,0:09:18.24,Default,,0000,0000,0000,,function J. Dialogue: 0,0:09:18.24,0:09:20.63,Default,,0000,0000,0000,,So in the E step you maximize Dialogue: 0,0:09:20.63,0:09:23.96,Default,,0000,0000,0000,,with respect to Q Dialogue: 0,0:09:23.96,0:09:26.85,Default,,0000,0000,0000,,and then the M step Dialogue: 0,0:09:26.85,0:09:31.86,Default,,0000,0000,0000,,you maximize with Dialogue: 0,0:09:31.86,0:09:34.10,Default,,0000,0000,0000,,respect to theta. Okay? Dialogue: 0,0:09:34.10,0:09:35.81,Default,,0000,0000,0000,,So this is Dialogue: 0,0:09:35.81,0:09:39.69,Default,,0000,0000,0000,,another view of the EM Algorithm Dialogue: 0,0:09:40.90,0:09:45.43,Default,,0000,0000,0000,,that shows why it has to converge, for example. If you can - I've used in a sense of Dialogue: 0,0:09:45.43,0:09:46.88,Default,,0000,0000,0000,,J of theta, Q Dialogue: 0,0:09:46.88,0:09:53.16,Default,,0000,0000,0000,,having to monotonically increase on every iteration. Okay? Dialogue: 0,0:09:53.16,0:09:58.80,Default,,0000,0000,0000,,So what I want Dialogue: 0,0:09:58.80,0:10:02.39,Default,,0000,0000,0000,,to do next is actually take this general Dialogue: 0,0:10:02.39,0:10:06.23,Default,,0000,0000,0000,,EM machinery that we worked up and apply it to a mixture Gaussians model. Dialogue: 0,0:10:06.23,0:10:09.79,Default,,0000,0000,0000,,Before I do that, let me just check if there are questions about Dialogue: 0,0:10:09.79,0:10:16.79,Default,,0000,0000,0000,,the EM Algorithm as a whole? Okay, cool. So Dialogue: 0,0:10:23.98,0:10:26.10,Default,,0000,0000,0000,,let's go ahead and Dialogue: 0,0:10:26.10,0:10:33.10,Default,,0000,0000,0000,,work on the mixture of Gaussian's Dialogue: 0,0:10:35.55,0:10:37.16,Default,,0000,0000,0000,,EM, all right? MOG, and that's my Dialogue: 0,0:10:37.16,0:10:39.01,Default,,0000,0000,0000,,abbreviation for Mixture of Dialogue: 0,0:10:39.01,0:10:43.59,Default,,0000,0000,0000,,Gaussian's. So the E step were called those Q distributions, right? Dialogue: 0,0:10:43.59,0:10:50.59,Default,,0000,0000,0000,,In particular, I want to work out - so Q is the probability distribution Dialogue: 0,0:10:50.67,0:10:53.23,Default,,0000,0000,0000,,over the late and random variable Z Dialogue: 0,0:10:53.23,0:10:58.44,Default,,0000,0000,0000,,and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And Dialogue: 0,0:10:58.44,0:11:00.70,Default,,0000,0000,0000,,you can think of this as my writing Dialogue: 0,0:11:00.70,0:11:04.21,Default,,0000,0000,0000,,P of ZI equals J, right? Under the Q Dialogue: 0,0:11:04.21,0:11:08.26,Default,,0000,0000,0000,,distribution. That's what this notation means. Dialogue: 0,0:11:08.26,0:11:12.81,Default,,0000,0000,0000,,And so the EM Algorithm tells us Dialogue: 0,0:11:12.81,0:11:13.84,Default,,0000,0000,0000,, Dialogue: 0,0:11:13.84,0:11:16.28,Default,,0000,0000,0000,,that, let's see, Q of J Dialogue: 0,0:11:16.28,0:11:23.28,Default,,0000,0000,0000,,is the likelihood probability of Z being the value Dialogue: 0,0:11:23.75,0:11:25.04,Default,,0000,0000,0000,,J and Dialogue: 0,0:11:25.04,0:11:28.69,Default,,0000,0000,0000,,given XI and all your parameters. Dialogue: 0,0:11:28.69,0:11:30.31,Default,,0000,0000,0000,,And so, Dialogue: 0,0:11:30.31,0:11:31.05,Default,,0000,0000,0000,, Dialogue: 0,0:11:31.05,0:11:35.75,Default,,0000,0000,0000,,well, the way you compute this is by Bayes rule, right? So that is going to Dialogue: 0,0:11:35.75,0:11:39.25,Default,,0000,0000,0000,,be equal to P of XI given ZI Dialogue: 0,0:11:39.25,0:11:41.06,Default,,0000,0000,0000,,equals J Dialogue: 0,0:11:41.06,0:11:44.69,Default,,0000,0000,0000,,times P of ZIJ divided Dialogue: 0,0:11:44.69,0:11:51.69,Default,,0000,0000,0000,,by - Dialogue: 0,0:11:57.49,0:12:00.88,Default,,0000,0000,0000,,right? That's all the - by Bayes rule. Dialogue: 0,0:12:00.88,0:12:02.67,Default,,0000,0000,0000,, Dialogue: 0,0:12:02.67,0:12:06.10,Default,,0000,0000,0000,,And so this Dialogue: 0,0:12:06.10,0:12:09.11,Default,,0000,0000,0000,,you know Dialogue: 0,0:12:09.11,0:12:09.86,Default,,0000,0000,0000,, Dialogue: 0,0:12:09.86,0:12:11.63,Default,,0000,0000,0000,, Dialogue: 0,0:12:11.63,0:12:14.39,Default,,0000,0000,0000,,because XI given ZI equals J. This was a Gaussian Dialogue: 0,0:12:14.39,0:12:17.78,Default,,0000,0000,0000,,with mean mu J and covariant sigma J. And so to compute this first Dialogue: 0,0:12:17.78,0:12:20.97,Default,,0000,0000,0000,,term you plug in the formula for the Gaussian density there Dialogue: 0,0:12:21.81,0:12:24.78,Default,,0000,0000,0000,,with parameters mu J and sigma J Dialogue: 0,0:12:24.78,0:12:27.16,Default,,0000,0000,0000,,and this you'd know Dialogue: 0,0:12:27.16,0:12:28.50,Default,,0000,0000,0000,,because Dialogue: 0,0:12:28.50,0:12:35.50,Default,,0000,0000,0000,,Z was not a nomial, right? Dialogue: 0,0:12:36.02,0:12:40.65,Default,,0000,0000,0000,,Where parameters given by phi and so the problem of ZI being with J is just phi Dialogue: 0,0:12:40.65,0:12:44.15,Default,,0000,0000,0000,,J and so you can substitute these terms in. Dialogue: 0,0:12:44.15,0:12:46.81,Default,,0000,0000,0000,,Similarly do the same thing for the denominator Dialogue: 0,0:12:46.81,0:12:49.93,Default,,0000,0000,0000,,and that's how you work out what Q is. Okay? Dialogue: 0,0:12:49.93,0:12:56.08,Default,,0000,0000,0000,,And so in the previous lecture this value the probability that ZI Dialogue: 0,0:12:56.08,0:12:58.54,Default,,0000,0000,0000,,equals J under the Q Dialogue: 0,0:12:58.54,0:13:01.15,Default,,0000,0000,0000,,distribution that was why I denoted that as WIJ. Dialogue: 0,0:13:01.15,0:13:03.84,Default,,0000,0000,0000,,So that would be the Dialogue: 0,0:13:03.84,0:13:05.33,Default,,0000,0000,0000,, Dialogue: 0,0:13:05.33,0:13:08.55,Default,,0000,0000,0000,,E Dialogue: 0,0:13:08.55,0:13:10.44,Default,,0000,0000,0000,,step Dialogue: 0,0:13:10.44,0:13:13.72,Default,,0000,0000,0000,,and then in the M step Dialogue: 0,0:13:13.72,0:13:17.89,Default,,0000,0000,0000,,we maximize with respect to all of our parameters. Dialogue: 0,0:13:17.89,0:13:19.48,Default,,0000,0000,0000,, Dialogue: 0,0:13:19.48,0:13:21.23,Default,,0000,0000,0000,,This, well Dialogue: 0,0:13:21.23,0:13:28.13,Default,,0000,0000,0000,,I seem to be writing the same formula down a lot today. All Dialogue: 0,0:13:28.13,0:13:35.13,Default,,0000,0000,0000,,right. Dialogue: 0,0:13:42.67,0:13:48.09,Default,,0000,0000,0000,,And just so Dialogue: 0,0:13:48.09,0:13:52.25,Default,,0000,0000,0000,,we're completely concrete about how you do that, right? So if you do Dialogue: 0,0:13:52.25,0:13:54.82,Default,,0000,0000,0000,,that you end up with - Dialogue: 0,0:13:54.82,0:13:57.44,Default,,0000,0000,0000,,so plugging in the Dialogue: 0,0:13:57.44,0:13:59.84,Default,,0000,0000,0000,,quantities that you know Dialogue: 0,0:13:59.84,0:14:00.77,Default,,0000,0000,0000,, Dialogue: 0,0:14:00.77,0:14:03.95,Default,,0000,0000,0000,,that becomes Dialogue: 0,0:14:03.95,0:14:10.95,Default,,0000,0000,0000,,this, let's see. Right. Dialogue: 0,0:14:34.59,0:14:38.73,Default,,0000,0000,0000,,And so that we're completely concrete about what the M step is doing. Dialogue: 0,0:14:38.73,0:14:41.90,Default,,0000,0000,0000,,So in the Dialogue: 0,0:14:41.90,0:14:44.26,Default,,0000,0000,0000,,M step that was, Dialogue: 0,0:14:44.26,0:14:45.66,Default,,0000,0000,0000,,I guess, QI Dialogue: 0,0:14:45.66,0:14:47.13,Default,,0000,0000,0000,,over Z, I being over Dialogue: 0,0:14:47.13,0:14:50.56,Default,,0000,0000,0000,,J. Just in the summation, sum over J is the sum over all the possible values Dialogue: 0,0:14:50.56,0:14:52.50,Default,,0000,0000,0000,,of ZI Dialogue: 0,0:14:52.50,0:14:53.43,Default,,0000,0000,0000,,and then Dialogue: 0,0:14:53.43,0:14:56.33,Default,,0000,0000,0000,,this thing here is my Gaussian Dialogue: 0,0:14:56.33,0:14:58.70,Default,,0000,0000,0000,,density. Sorry, guys, this thing - well, Dialogue: 0,0:14:58.70,0:15:01.98,Default,,0000,0000,0000,,this first term here, Dialogue: 0,0:15:01.98,0:15:06.48,Default,,0000,0000,0000,,right? Is my P of Dialogue: 0,0:15:06.48,0:15:10.05,Default,,0000,0000,0000,,XI given ZI Dialogue: 0,0:15:10.05,0:15:14.46,Default,,0000,0000,0000,,and that's P of ZI. Okay? Dialogue: 0,0:15:14.46,0:15:16.95,Default,,0000,0000,0000,,And so Dialogue: 0,0:15:16.95,0:15:21.07,Default,,0000,0000,0000,,to maximize this with respect to - say you want to maximize this with respect to all Dialogue: 0,0:15:21.07,0:15:24.41,Default,,0000,0000,0000,,of your parameters phi, mu and sigma. Dialogue: 0,0:15:24.41,0:15:27.80,Default,,0000,0000,0000,,So to maximize with respect to the parameter mu, say, Dialogue: 0,0:15:27.80,0:15:32.43,Default,,0000,0000,0000,,you would take the derivative for respect to mu Dialogue: 0,0:15:32.43,0:15:34.44,Default,,0000,0000,0000,,and set that to zero Dialogue: 0,0:15:34.44,0:15:38.19,Default,,0000,0000,0000,,and you would - and if you actually do that computation Dialogue: 0,0:15:38.19,0:15:44.33,Default,,0000,0000,0000,,you would get, for instance, that Dialogue: 0,0:15:44.33,0:15:47.66,Default,,0000,0000,0000,, Dialogue: 0,0:15:47.66,0:15:49.75,Default,,0000,0000,0000,,that becomes your update Dialogue: 0,0:15:49.75,0:15:51.23,Default,,0000,0000,0000,,to mu J. Dialogue: 0,0:15:51.23,0:15:52.26,Default,,0000,0000,0000,,Okay? Just Dialogue: 0,0:15:52.26,0:15:53.41,Default,,0000,0000,0000,,so I want to - Dialogue: 0,0:15:53.41,0:15:57.03,Default,,0000,0000,0000,,the equation is unimportant. All of these equations are written down in Dialogue: 0,0:15:57.03,0:15:59.98,Default,,0000,0000,0000,,the lecture notes. I'm writing these down just to be Dialogue: 0,0:15:59.98,0:16:03.35,Default,,0000,0000,0000,,completely concrete about what the M step means. And so write down that formula, Dialogue: 0,0:16:03.35,0:16:05.37,Default,,0000,0000,0000,,plug in the densities you know, take Dialogue: 0,0:16:05.37,0:16:08.15,Default,,0000,0000,0000,,the derivative set to zero, solve for mu J Dialogue: 0,0:16:08.15,0:16:10.95,Default,,0000,0000,0000,,and in the same way you set the derivatives equal to zero Dialogue: 0,0:16:10.95,0:16:12.99,Default,,0000,0000,0000,,and solve for your updates Dialogue: 0,0:16:12.99,0:16:15.44,Default,,0000,0000,0000,,for your other parameters phi and Dialogue: 0,0:16:15.44,0:16:22.44,Default,,0000,0000,0000,,sigma as well. Okay? Well, Dialogue: 0,0:16:23.10,0:16:26.50,Default,,0000,0000,0000,,just point out just one little tricky bit for this that Dialogue: 0,0:16:26.50,0:16:30.63,Default,,0000,0000,0000,,you haven't seen before that most of you probably all ready now, but I'll just Dialogue: 0,0:16:30.63,0:16:31.22,Default,,0000,0000,0000,,mention Dialogue: 0,0:16:31.22,0:16:32.81,Default,,0000,0000,0000,,is that Dialogue: 0,0:16:32.81,0:16:37.43,Default,,0000,0000,0000,,since phi here is a multinomial distribution Dialogue: 0,0:16:37.43,0:16:39.57,Default,,0000,0000,0000,,when you take this formula Dialogue: 0,0:16:39.57,0:16:42.44,Default,,0000,0000,0000,,and you maximize it with respect to phi Dialogue: 0,0:16:42.44,0:16:45.47,Default,,0000,0000,0000,,you actually have an additional constraint, right? That the sum of I - let's see, sum Dialogue: 0,0:16:45.47,0:16:46.46,Default,,0000,0000,0000,, Dialogue: 0,0:16:46.46,0:16:48.25,Default,,0000,0000,0000,,over Dialogue: 0,0:16:48.25,0:16:50.44,Default,,0000,0000,0000,,J, Dialogue: 0,0:16:50.44,0:16:53.63,Default,,0000,0000,0000,,phi J must be equal to one. All right? So, again, Dialogue: 0,0:16:53.63,0:16:57.72,Default,,0000,0000,0000,,in the M step I want to take this thing and maximize it with respect to all the parameters Dialogue: 0,0:16:57.72,0:17:00.90,Default,,0000,0000,0000,,and when you maximize this respect to the parameters phi J Dialogue: 0,0:17:00.90,0:17:03.40,Default,,0000,0000,0000,,you need to respect the constraint that Dialogue: 0,0:17:03.40,0:17:06.30,Default,,0000,0000,0000,,sum of J phi J must be equal to one. Dialogue: 0,0:17:06.30,0:17:10.73,Default,,0000,0000,0000,,And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of Dialogue: 0,0:17:10.73,0:17:14.23,Default,,0000,0000,0000,,the granjay multipliers and generalize the granjay when you talk about the support of X Dialogue: 0,0:17:14.23,0:17:15.12,Default,,0000,0000,0000,,machines. Dialogue: 0,0:17:15.12,0:17:19.05,Default,,0000,0000,0000,,And so to actually perform the maximization in terms of phi J you Dialogue: 0,0:17:19.05,0:17:21.10,Default,,0000,0000,0000,,construct to the granjay, Dialogue: 0,0:17:21.10,0:17:23.49,Default,,0000,0000,0000,,which is - all right? Dialogue: 0,0:17:23.49,0:17:24.64,Default,,0000,0000,0000,,So that's the Dialogue: 0,0:17:24.64,0:17:28.39,Default,,0000,0000,0000,,equation from above and we'll denote in the dot dot dot plus Dialogue: 0,0:17:28.39,0:17:33.76,Default,,0000,0000,0000,,theta times Dialogue: 0,0:17:33.76,0:17:40.76,Default,,0000,0000,0000,,that, where this is sort of the granjay multiplier Dialogue: 0,0:17:42.22,0:17:42.83,Default,,0000,0000,0000,,and this Dialogue: 0,0:17:42.83,0:17:46.10,Default,,0000,0000,0000,,is your optimization objective. Dialogue: 0,0:17:46.10,0:17:47.64,Default,,0000,0000,0000,, Dialogue: 0,0:17:47.64,0:17:50.80,Default,,0000,0000,0000,,And so to actually solve the parameters phi J you set Dialogue: 0,0:17:50.80,0:17:56.69,Default,,0000,0000,0000,,the parameters of this Dialogue: 0,0:17:56.69,0:17:59.08,Default,,0000,0000,0000,,so that the granjay is zero and solve. Okay? Dialogue: 0,0:17:59.08,0:18:02.99,Default,,0000,0000,0000,,And if you then work through the math Dialogue: 0,0:18:02.99,0:18:07.24,Default,,0000,0000,0000,,you get the appropriate value to update the phi J's too, Dialogue: 0,0:18:07.24,0:18:10.88,Default,,0000,0000,0000,,which I won't do, but I'll be - all the full directions are in the lecture Dialogue: 0,0:18:10.88,0:18:16.47,Default,,0000,0000,0000,,notes. I won't do that here. Dialogue: 0,0:18:16.47,0:18:18.01,Default,,0000,0000,0000,, Dialogue: 0,0:18:18.01,0:18:21.87,Default,,0000,0000,0000,,Okay. And so if you actually perform all these computations you can also verify Dialogue: 0,0:18:21.87,0:18:22.73,Default,,0000,0000,0000,,that. Dialogue: 0,0:18:22.73,0:18:26.59,Default,,0000,0000,0000,,So I just wrote down a bunch of formulas for the EM Dialogue: 0,0:18:26.59,0:18:30.08,Default,,0000,0000,0000,,Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I Dialogue: 0,0:18:30.08,0:18:34.17,Default,,0000,0000,0000,,said for the EM here's the formula for computing the WIJ's and here's a Dialogue: 0,0:18:34.17,0:18:36.48,Default,,0000,0000,0000,,formula for computing the mud's and so on, Dialogue: 0,0:18:36.48,0:18:42.09,Default,,0000,0000,0000,,and this variation is where all of those formulas actually come from. Dialogue: 0,0:18:42.09,0:18:43.28,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:18:43.28,0:18:50.28,Default,,0000,0000,0000,,Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh, Dialogue: 0,0:18:52.39,0:18:54.29,Default,,0000,0000,0000,,I see. So Dialogue: 0,0:18:54.29,0:18:58.20,Default,,0000,0000,0000,,it turns out that, yes, there's also constrained to the phi J this must be greater than Dialogue: 0,0:18:58.20,0:19:00.07,Default,,0000,0000,0000,,zero. Dialogue: 0,0:19:00.07,0:19:02.77,Default,,0000,0000,0000,,It turns out that Dialogue: 0,0:19:02.77,0:19:07.09,Default,,0000,0000,0000,,if you want you could actually write down then generalize the granjayn Dialogue: 0,0:19:07.09,0:19:10.86,Default,,0000,0000,0000,,incorporating all of these constraints as well and you can solve to [inaudible] Dialogue: 0,0:19:10.86,0:19:12.18,Default,,0000,0000,0000,,these constraints. Dialogue: 0,0:19:12.18,0:19:16.28,Default,,0000,0000,0000,,It turns out that in this particular derivation - actually it turns out that very often Dialogue: 0,0:19:16.28,0:19:16.78,Default,,0000,0000,0000,,we Dialogue: 0,0:19:16.78,0:19:19.98,Default,,0000,0000,0000,,find maximum likely estimate for multinomial distributions probabilities. Dialogue: 0,0:19:19.98,0:19:23.19,Default,,0000,0000,0000,,It turns out that if you ignore these constraints and you just maximize the Dialogue: 0,0:19:23.19,0:19:24.02,Default,,0000,0000,0000,,formula Dialogue: 0,0:19:24.02,0:19:25.10,Default,,0000,0000,0000,, Dialogue: 0,0:19:25.10,0:19:27.82,Default,,0000,0000,0000,,luckily you end up with Dialogue: 0,0:19:27.82,0:19:31.19,Default,,0000,0000,0000,,values that actually are greater than or equal to zero Dialogue: 0,0:19:31.19,0:19:34.16,Default,,0000,0000,0000,,and so if even ignoring those constraint you end up with parameters that are greater than or equal to Dialogue: 0,0:19:34.16,0:19:38.47,Default,,0000,0000,0000,,zero that shows that that must be the correct solution Dialogue: 0,0:19:38.47,0:19:41.13,Default,,0000,0000,0000,,because adding that constraint won't change anything. Dialogue: 0,0:19:41.13,0:19:44.79,Default,,0000,0000,0000,,So this constraint it is then caused - it turns out that if you Dialogue: 0,0:19:44.79,0:19:46.15,Default,,0000,0000,0000,,ignore this and just do Dialogue: 0,0:19:46.15,0:19:53.15,Default,,0000,0000,0000,,what I've wrote down you actually get the right answer. Dialogue: 0,0:19:55.79,0:19:57.03,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:19:57.03,0:19:58.08,Default,,0000,0000,0000,,Great. Dialogue: 0,0:19:58.08,0:19:59.08,Default,,0000,0000,0000,,So let me Dialogue: 0,0:19:59.08,0:20:00.81,Default,,0000,0000,0000,,just Dialogue: 0,0:20:00.81,0:20:05.40,Default,,0000,0000,0000,,very quickly talk about one more example of a mixture model. And Dialogue: 0,0:20:05.40,0:20:09.34,Default,,0000,0000,0000,,the perfect example for this is imagine you want to do text clustering, right? Dialogue: 0,0:20:09.34,0:20:11.90,Default,,0000,0000,0000,,So someone gives you a large set of documents Dialogue: 0,0:20:11.90,0:20:15.39,Default,,0000,0000,0000,,and you want to cluster them together into cohesive Dialogue: 0,0:20:15.39,0:20:16.24,Default,,0000,0000,0000,,topics. I Dialogue: 0,0:20:16.24,0:20:19.38,Default,,0000,0000,0000,,think I mentioned the news website news.google.com. Dialogue: 0,0:20:19.38,0:20:22.01,Default,,0000,0000,0000,,That's one application of text clustering Dialogue: 0,0:20:22.01,0:20:25.85,Default,,0000,0000,0000,,where you might want to look at all of the news stories about Dialogue: 0,0:20:25.85,0:20:28.22,Default,,0000,0000,0000,,today, all the news stories Dialogue: 0,0:20:28.22,0:20:32.15,Default,,0000,0000,0000,,written by everyone, written by all the online news websites about whatever happened Dialogue: 0,0:20:32.15,0:20:33.27,Default,,0000,0000,0000,,yesterday Dialogue: 0,0:20:33.27,0:20:36.81,Default,,0000,0000,0000,,and there will be many, many different stories on the same thing, right? Dialogue: 0,0:20:36.81,0:20:40.77,Default,,0000,0000,0000,,And by running a text-clustering algorithm you can group related Dialogue: 0,0:20:40.77,0:20:42.56,Default,,0000,0000,0000,,documents together. Okay? Dialogue: 0,0:20:42.56,0:20:49.56,Default,,0000,0000,0000,,So how do you apply the EM Algorithm to text clustering? Dialogue: 0,0:20:52.98,0:20:56.30,Default,,0000,0000,0000,,I want to do this to illustrate an example Dialogue: 0,0:20:56.30,0:20:59.38,Default,,0000,0000,0000,,in which you run the EM Algorithm Dialogue: 0,0:20:59.38,0:21:03.32,Default,,0000,0000,0000,,on discreet valued inputs where the input - where the training examples Dialogue: 0,0:21:03.32,0:21:04.22,Default,,0000,0000,0000,,XI Dialogue: 0,0:21:04.22,0:21:05.83,Default,,0000,0000,0000,,are discreet Dialogue: 0,0:21:05.83,0:21:08.59,Default,,0000,0000,0000,,values. So what I want to talk about specifically Dialogue: 0,0:21:08.59,0:21:15.59,Default,,0000,0000,0000,,is the mixture of naive Bayes model Dialogue: 0,0:21:16.51,0:21:19.98,Default,,0000,0000,0000,,and depending on how much you remember about naive Bayes Dialogue: 0,0:21:19.98,0:21:23.69,Default,,0000,0000,0000,,I talked about two event models. One was the multivariant vanuvy Dialogue: 0,0:21:23.69,0:21:24.96,Default,,0000,0000,0000,,event model. One Dialogue: 0,0:21:24.96,0:21:27.88,Default,,0000,0000,0000,,was the multinomial event model. Dialogue: 0,0:21:27.88,0:21:31.05,Default,,0000,0000,0000,,Today I'm gonna use the multivariant vanuvy event model. If Dialogue: 0,0:21:31.05,0:21:34.72,Default,,0000,0000,0000,,you don't remember what those terms mean anymore don't worry about it. I Dialogue: 0,0:21:34.72,0:21:36.43,Default,,0000,0000,0000,,think the equation will still make sense. But in Dialogue: 0,0:21:36.43,0:21:43.43,Default,,0000,0000,0000,,this setting we're given a training set X1 Dialogue: 0,0:21:44.55,0:21:45.82,Default,,0000,0000,0000,,for XM. Dialogue: 0,0:21:45.82,0:21:48.93,Default,,0000,0000,0000,,So we're given M text documents Dialogue: 0,0:21:48.93,0:21:52.27,Default,,0000,0000,0000,,where each Dialogue: 0,0:21:52.27,0:21:55.94,Default,,0000,0000,0000,, Dialogue: 0,0:21:55.94,0:22:00.33,Default,,0000,0000,0000,,XI is zero one to the end. So each of our training examples Dialogue: 0,0:22:00.33,0:22:04.16,Default,,0000,0000,0000,,is an indimensional bit of vector, Dialogue: 0,0:22:04.16,0:22:06.78,Default,,0000,0000,0000,,right? S this was the representation Dialogue: 0,0:22:06.78,0:22:11.06,Default,,0000,0000,0000,,where XIJ was - it indicates whether Dialogue: 0,0:22:11.06,0:22:13.22,Default,,0000,0000,0000,,word J Dialogue: 0,0:22:13.22,0:22:16.82,Default,,0000,0000,0000,,appears in Dialogue: 0,0:22:16.82,0:22:23.26,Default,,0000,0000,0000,,document Dialogue: 0,0:22:23.26,0:22:28.22,Default,,0000,0000,0000,,I, right? And let's say that Dialogue: 0,0:22:28.22,0:22:31.93,Default,,0000,0000,0000,,we're going to model ZI the - our latent random variable meaning Dialogue: 0,0:22:31.93,0:22:35.47,Default,,0000,0000,0000,,our hidden random variable ZI will take on two values zero one Dialogue: 0,0:22:35.47,0:22:38.48,Default,,0000,0000,0000,,so this means I'm just gonna find two clusters and you can generalize the Dialogue: 0,0:22:38.48,0:22:44.56,Default,,0000,0000,0000,,clusters that you want. Dialogue: 0,0:22:44.56,0:22:46.65,Default,,0000,0000,0000,,So Dialogue: 0,0:22:46.65,0:22:51.51,Default,,0000,0000,0000,,in the mixture of naive Bayes model we assume that Dialogue: 0,0:22:51.51,0:22:54.09,Default,,0000,0000,0000,,ZI is distributed and mu Dialogue: 0,0:22:54.09,0:22:57.28,Default,,0000,0000,0000,,E Dialogue: 0,0:22:57.28,0:22:58.39,Default,,0000,0000,0000,,with some Dialogue: 0,0:22:58.39,0:23:02.60,Default,,0000,0000,0000,,value of phi so there's some probability of each document coming from cluster one Dialogue: 0,0:23:02.60,0:23:05.22,Default,,0000,0000,0000,,or from cluster Dialogue: 0,0:23:05.22,0:23:10.49,Default,,0000,0000,0000,,two. We assume that Dialogue: 0,0:23:10.49,0:23:14.09,Default,,0000,0000,0000,,the probability Dialogue: 0,0:23:14.09,0:23:16.05,Default,,0000,0000,0000,, Dialogue: 0,0:23:16.05,0:23:20.50,Default,,0000,0000,0000,,of XI given ZI, Dialogue: 0,0:23:20.50,0:23:21.55,Default,,0000,0000,0000,,right? Dialogue: 0,0:23:21.55,0:23:28.55,Default,,0000,0000,0000,,Will make the same naive Bayes assumption as we did before. Dialogue: 0,0:23:35.55,0:23:36.63,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:23:36.63,0:23:43.63,Default,,0000,0000,0000,,And more specifically - well, excuse Dialogue: 0,0:23:53.11,0:24:00.11,Default,,0000,0000,0000,,me, Dialogue: 0,0:24:00.82,0:24:01.84,Default,,0000,0000,0000,,right. Okay. Dialogue: 0,0:24:01.84,0:24:06.57,Default,,0000,0000,0000,,And so most of us [inaudible] cycles one given ZI equals say zero Dialogue: 0,0:24:06.57,0:24:10.25,Default,,0000,0000,0000,,will be given by a parameter phi Dialogue: 0,0:24:10.25,0:24:11.96,Default,,0000,0000,0000,,substitute J Dialogue: 0,0:24:11.96,0:24:18.23,Default,,0000,0000,0000,,given Z equals Dialogue: 0,0:24:18.23,0:24:21.17,Default,,0000,0000,0000,,zero. So if you take this chalkboard and if you Dialogue: 0,0:24:21.17,0:24:22.67,Default,,0000,0000,0000,,take all instances of Dialogue: 0,0:24:22.67,0:24:24.51,Default,,0000,0000,0000,,the alphabet Z Dialogue: 0,0:24:24.51,0:24:27.27,Default,,0000,0000,0000,,and replace it with Y Dialogue: 0,0:24:27.27,0:24:31.75,Default,,0000,0000,0000,,then you end up with exactly the same equation as I've written down for naive Dialogue: 0,0:24:31.75,0:24:38.75,Default,,0000,0000,0000,,Bayes like a long time ago. Okay? Dialogue: 0,0:24:39.77,0:24:40.71,Default,,0000,0000,0000,,And I'm not Dialogue: 0,0:24:40.71,0:24:44.85,Default,,0000,0000,0000,,actually going to work out the mechanics deriving Dialogue: 0,0:24:44.85,0:24:46.46,Default,,0000,0000,0000,,the EM Algorithm, but it turns out that Dialogue: 0,0:24:46.46,0:24:49.83,Default,,0000,0000,0000,,if you take this joint distribution of X and Z Dialogue: 0,0:24:49.83,0:24:53.70,Default,,0000,0000,0000,,and if you work out what the equations are for the EM algorithm for Dialogue: 0,0:24:53.70,0:24:55.86,Default,,0000,0000,0000,,maximum likelihood estimation of parameters Dialogue: 0,0:24:55.86,0:24:58.82,Default,,0000,0000,0000,,you find that Dialogue: 0,0:24:58.82,0:25:00.88,Default,,0000,0000,0000,,in the E Dialogue: 0,0:25:00.88,0:25:03.10,Default,,0000,0000,0000,,step you compute, Dialogue: 0,0:25:03.10,0:25:05.68,Default,,0000,0000,0000,,you know, let's say these parameters - these weights Dialogue: 0,0:25:05.68,0:25:06.97,Default,,0000,0000,0000,,WI Dialogue: 0,0:25:06.97,0:25:09.42,Default,,0000,0000,0000,,which are going to be equal to Dialogue: 0,0:25:09.42,0:25:15.67,Default,,0000,0000,0000,,your perceived distribution of Z being equal one conditioned on XI parameterized Dialogue: 0,0:25:15.67,0:25:19.20,Default,,0000,0000,0000,,by your Dialogue: 0,0:25:19.20,0:25:23.70,Default,,0000,0000,0000,,phi's and given your parameters Dialogue: 0,0:25:23.70,0:25:25.59,Default,,0000,0000,0000,, Dialogue: 0,0:25:25.59,0:25:30.90,Default,,0000,0000,0000,,and in the M step. Okay? Dialogue: 0,0:25:30.90,0:25:37.90,Default,,0000,0000,0000,,And Dialogue: 0,0:26:17.24,0:26:20.74,Default,,0000,0000,0000,,that's the equation you get in the M step. Dialogue: 0,0:26:20.74,0:26:22.15,Default,,0000,0000,0000,,I Dialogue: 0,0:26:22.15,0:26:24.86,Default,,0000,0000,0000,,mean, again, the equations themselves aren't too important. Just Dialogue: 0,0:26:24.86,0:26:27.18,Default,,0000,0000,0000,,sort of convey - I'll Dialogue: 0,0:26:27.18,0:26:30.21,Default,,0000,0000,0000,,give you a second to finish writing, I guess. Dialogue: 0,0:26:30.21,0:26:32.50,Default,,0000,0000,0000,,And when you're done or finished writing Dialogue: 0,0:26:32.50,0:26:37.37,Default,,0000,0000,0000,,take a look at these equations and see if they make intuitive sense to you Dialogue: 0,0:26:37.37,0:26:44.37,Default,,0000,0000,0000,,why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again. Dialogue: 0,0:26:47.13,0:26:49.62,Default,,0000,0000,0000,,Y - Dialogue: 0,0:26:49.62,0:26:54.23,Default,,0000,0000,0000,,Oh, yes, thank you. Right. Dialogue: 0,0:26:54.23,0:26:57.95,Default,,0000,0000,0000,,Sorry, Dialogue: 0,0:26:57.95,0:27:04.95,Default,,0000,0000,0000,,just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the Dialogue: 0,0:27:15.38,0:27:20.67,Default,,0000,0000,0000,,first place? Dialogue: 0,0:27:20.67,0:27:22.06,Default,,0000,0000,0000,,No. So Dialogue: 0,0:27:25.13,0:27:26.08,Default,,0000,0000,0000,,what is it? Dialogue: 0,0:27:26.08,0:27:33.08,Default,,0000,0000,0000,,Normally you initialize phi's to be something else, say randomly. Dialogue: 0,0:27:35.01,0:27:38.06,Default,,0000,0000,0000,,So just like in naive Bayes we saw Dialogue: 0,0:27:38.06,0:27:41.36,Default,,0000,0000,0000,,zero probabilities as a bad thing so the same reason you Dialogue: 0,0:27:41.36,0:27:48.20,Default,,0000,0000,0000,,try to avoid zero probabilities, yeah. Okay? Dialogue: 0,0:27:48.20,0:27:51.50,Default,,0000,0000,0000,,And so just the intuition behind these Dialogue: 0,0:27:51.50,0:27:54.15,Default,,0000,0000,0000,,equations is Dialogue: 0,0:27:54.15,0:27:55.95,Default,,0000,0000,0000,,in the E step Dialogue: 0,0:27:55.95,0:27:59.86,Default,,0000,0000,0000,,WI's is you're gonna take your best guess for whether the document came from Dialogue: 0,0:27:59.86,0:28:02.64,Default,,0000,0000,0000,,cluster one or cluster Dialogue: 0,0:28:02.64,0:28:05.57,Default,,0000,0000,0000,,zero, all right? This is very similar to the Dialogue: 0,0:28:05.57,0:28:09.20,Default,,0000,0000,0000,,intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the Dialogue: 0,0:28:09.20,0:28:10.02,Default,,0000,0000,0000,,E step Dialogue: 0,0:28:10.02,0:28:14.76,Default,,0000,0000,0000,,we're going to compute these weights that tell us do I think this document came from Dialogue: 0,0:28:14.76,0:28:19.37,Default,,0000,0000,0000,,cluster one or cluster zero. Dialogue: 0,0:28:19.37,0:28:22.65,Default,,0000,0000,0000,,And then in the M step I'm gonna say Dialogue: 0,0:28:22.65,0:28:27.14,Default,,0000,0000,0000,,does this numerator is the sum over all the elements of my training set Dialogue: 0,0:28:27.14,0:28:28.80,Default,,0000,0000,0000,,of - so then Dialogue: 0,0:28:28.80,0:28:33.00,Default,,0000,0000,0000,,informally, right? WI is one there, but I think the document came from cluster Dialogue: 0,0:28:33.00,0:28:33.98,Default,,0000,0000,0000,,one Dialogue: 0,0:28:33.98,0:28:35.81,Default,,0000,0000,0000,,and so this will Dialogue: 0,0:28:35.81,0:28:41.09,Default,,0000,0000,0000,,essentially sum up all the times I saw words J Dialogue: 0,0:28:41.09,0:28:45.42,Default,,0000,0000,0000,,in documents that I think are in cluster one. Dialogue: 0,0:28:45.42,0:28:48.83,Default,,0000,0000,0000,,And these are sort of weighted by the actual probability. I think it came from cluster Dialogue: 0,0:28:48.83,0:28:49.64,Default,,0000,0000,0000,,one Dialogue: 0,0:28:49.64,0:28:51.20,Default,,0000,0000,0000,,and then I'll divide by Dialogue: 0,0:28:51.20,0:28:54.74,Default,,0000,0000,0000,,- again, if all of these were ones and zeros then I'd be dividing by Dialogue: 0,0:28:54.74,0:28:58.48,Default,,0000,0000,0000,,the actual number of documents I had in cluster one. So if all the WI's were Dialogue: 0,0:28:58.48,0:28:59.54,Default,,0000,0000,0000,, Dialogue: 0,0:28:59.54,0:29:02.96,Default,,0000,0000,0000,,either ones or zeroes then this would be exactly Dialogue: 0,0:29:02.96,0:29:06.66,Default,,0000,0000,0000,,the fraction of documents that I saw in cluster one Dialogue: 0,0:29:06.66,0:29:09.64,Default,,0000,0000,0000,,in which I also saw were at J. Dialogue: 0,0:29:09.64,0:29:10.90,Default,,0000,0000,0000,,Okay? But in the EM Algorithm Dialogue: 0,0:29:10.90,0:29:14.54,Default,,0000,0000,0000,,you don't make a hard assignment decision about is this in cluster one or is this in Dialogue: 0,0:29:14.54,0:29:15.79,Default,,0000,0000,0000,,cluster zero. You Dialogue: 0,0:29:15.79,0:29:16.88,Default,,0000,0000,0000,,instead Dialogue: 0,0:29:16.88,0:29:23.88,Default,,0000,0000,0000,,represent your uncertainty about cluster membership with the parameters WI. Okay? It Dialogue: 0,0:29:23.97,0:29:26.97,Default,,0000,0000,0000,,actually turns out that when we actually implement this particular model it Dialogue: 0,0:29:26.97,0:29:28.63,Default,,0000,0000,0000,,actually turns out that Dialogue: 0,0:29:28.63,0:29:32.08,Default,,0000,0000,0000,,by the nature of this computation all the values of WI's Dialogue: 0,0:29:32.08,0:29:35.26,Default,,0000,0000,0000,,will be very close to either one or zero so they'll Dialogue: 0,0:29:35.26,0:29:39.11,Default,,0000,0000,0000,,be numerically almost indistinguishable from one's and zeroes. This is a property of Dialogue: 0,0:29:39.11,0:29:41.96,Default,,0000,0000,0000,,naive Bayes. If you actually compute this probability Dialogue: 0,0:29:41.96,0:29:45.75,Default,,0000,0000,0000,,from all those documents you find that WI is either Dialogue: 0,0:29:45.75,0:29:46.17,Default,,0000,0000,0000,, Dialogue: 0,0:29:46.17,0:29:50.00,Default,,0000,0000,0000,,0.0001 or 0.999. It'll be amazingly close to either zero or one Dialogue: 0,0:29:50.00,0:29:53.17,Default,,0000,0000,0000,,and so the M step - and so this is pretty much guessing whether each document is Dialogue: 0,0:29:53.17,0:29:55.27,Default,,0000,0000,0000,,in cluster one or cluster Dialogue: 0,0:29:55.27,0:29:56.21,Default,,0000,0000,0000,,zero and then Dialogue: 0,0:29:56.21,0:30:00.45,Default,,0000,0000,0000,,using formulas they're very similar to maximum likely estimation Dialogue: 0,0:30:00.45,0:30:05.21,Default,,0000,0000,0000,,for naive Bayes. Dialogue: 0,0:30:05.21,0:30:06.25,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:30:06.25,0:30:08.41,Default,,0000,0000,0000,,Cool. Are there - and Dialogue: 0,0:30:08.41,0:30:09.68,Default,,0000,0000,0000,,if Dialogue: 0,0:30:09.68,0:30:12.17,Default,,0000,0000,0000,,some of these equations don't look that familiar to you anymore, Dialogue: 0,0:30:12.17,0:30:15.31,Default,,0000,0000,0000,,sort of, go back and take another look at what you saw in naive Dialogue: 0,0:30:15.31,0:30:16.70,Default,,0000,0000,0000,,Bayes Dialogue: 0,0:30:16.70,0:30:18.72,Default,,0000,0000,0000,,and Dialogue: 0,0:30:18.72,0:30:22.22,Default,,0000,0000,0000,,hopefully you can see the links there as well. Dialogue: 0,0:30:22.22,0:30:29.22,Default,,0000,0000,0000,,Questions about this before I move on? Right, okay. Dialogue: 0,0:30:33.20,0:30:36.70,Default,,0000,0000,0000,,Of course the way I got these equations was by turning through the machinery of Dialogue: 0,0:30:36.70,0:30:40.61,Default,,0000,0000,0000,,the EM Algorithm, right? I didn't just write these out of thin air. The way you do this Dialogue: 0,0:30:40.61,0:30:41.89,Default,,0000,0000,0000,,is by Dialogue: 0,0:30:41.89,0:30:44.99,Default,,0000,0000,0000,,writing down the E step and the M step for this model and then the M step Dialogue: 0,0:30:44.99,0:30:46.34,Default,,0000,0000,0000,,same derivatives Dialogue: 0,0:30:46.34,0:30:48.27,Default,,0000,0000,0000,,equal to zero and solving from that so Dialogue: 0,0:30:48.27,0:30:49.89,Default,,0000,0000,0000,,that's how you get the M step and the E Dialogue: 0,0:30:49.89,0:30:56.89,Default,,0000,0000,0000,,step. Dialogue: 0,0:31:22.03,0:31:23.95,Default,,0000,0000,0000,,So the last thing I want to do today Dialogue: 0,0:31:23.95,0:31:30.95,Default,,0000,0000,0000,,is talk about the factor analysis model Dialogue: 0,0:31:31.87,0:31:33.11,Default,,0000,0000,0000,, Dialogue: 0,0:31:33.11,0:31:34.78,Default,,0000,0000,0000,,and Dialogue: 0,0:31:34.78,0:31:38.25,Default,,0000,0000,0000,,the reason I want to do this is sort of two reasons because one is Dialogue: 0,0:31:38.25,0:31:42.59,Default,,0000,0000,0000,,factor analysis is kind of a useful model. It's Dialogue: 0,0:31:42.59,0:31:44.16,Default,,0000,0000,0000,,not Dialogue: 0,0:31:44.16,0:31:48.32,Default,,0000,0000,0000,,as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe, Dialogue: 0,0:31:48.32,0:31:50.69,Default,,0000,0000,0000,,but it's sort of useful. Dialogue: 0,0:31:50.69,0:31:55.18,Default,,0000,0000,0000,,But the other reason I want to derive this model is that there are a Dialogue: 0,0:31:55.18,0:31:58.17,Default,,0000,0000,0000,,few steps in the math that are more generally useful. Dialogue: 0,0:31:58.17,0:32:02.56,Default,,0000,0000,0000,,In particular, where this is for factor analysis this would be an example Dialogue: 0,0:32:02.56,0:32:04.51,Default,,0000,0000,0000,,in which we'll do EM Dialogue: 0,0:32:04.51,0:32:08.44,Default,,0000,0000,0000,,where the late and random variable - where the hidden random variable Z Dialogue: 0,0:32:08.44,0:32:11.06,Default,,0000,0000,0000,,is going to be continued as valued. Dialogue: 0,0:32:11.06,0:32:14.74,Default,,0000,0000,0000,,And so some of the math we'll see in deriving factor analysis will be a little bit Dialogue: 0,0:32:14.74,0:32:17.43,Default,,0000,0000,0000,,different than what you saw before and they're just a - Dialogue: 0,0:32:17.43,0:32:21.59,Default,,0000,0000,0000,,it turns out the full derivation for EM for factor analysis is sort of Dialogue: 0,0:32:21.59,0:32:23.26,Default,,0000,0000,0000,,extremely long and complicated Dialogue: 0,0:32:23.26,0:32:24.98,Default,,0000,0000,0000,,and so I won't Dialogue: 0,0:32:24.98,0:32:26.75,Default,,0000,0000,0000,,inflect that on you in lecture today, Dialogue: 0,0:32:26.75,0:32:30.53,Default,,0000,0000,0000,,but I will still be writing more equations than is - than you'll see me do Dialogue: 0,0:32:30.53,0:32:32.64,Default,,0000,0000,0000,,in other lectures because there are, sort of, Dialogue: 0,0:32:32.64,0:32:36.69,Default,,0000,0000,0000,,just a few steps in the factor analysis derivation so I'll physically Dialogue: 0,0:32:36.69,0:32:38.80,Default,,0000,0000,0000,, Dialogue: 0,0:32:38.80,0:32:40.10,Default,,0000,0000,0000,,illustrate Dialogue: 0,0:32:40.10,0:32:40.95,Default,,0000,0000,0000,,it. Dialogue: 0,0:32:40.95,0:32:43.35,Default,,0000,0000,0000,,So it's actually [inaudible] the model Dialogue: 0,0:32:43.35,0:32:47.51,Default,,0000,0000,0000,,and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is Dialogue: 0,0:32:47.51,0:32:50.67,Default,,0000,0000,0000,,our first model Dialogue: 0,0:32:50.67,0:32:52.57,Default,,0000,0000,0000,,we had, Dialogue: 0,0:32:52.57,0:32:57.99,Default,,0000,0000,0000,,that - well I actually motivated it by drawing the data set like this, right? That one of Dialogue: 0,0:32:57.99,0:33:04.39,Default,,0000,0000,0000,,you has a data set that looks like this, Dialogue: 0,0:33:04.39,0:33:06.19,Default,,0000,0000,0000,,right? So this was a problem where Dialogue: 0,0:33:06.19,0:33:09.44,Default,,0000,0000,0000,,n is twodimensional Dialogue: 0,0:33:09.44,0:33:13.86,Default,,0000,0000,0000,,and you have, I don't know, maybe 50 or 100 training examples, whatever, right? Dialogue: 0,0:33:13.86,0:33:15.41,Default,,0000,0000,0000,,And I said Dialogue: 0,0:33:15.41,0:33:17.76,Default,,0000,0000,0000,,maybe you want to give a label Dialogue: 0,0:33:17.76,0:33:21.10,Default,,0000,0000,0000,,training set like this. Maybe you want to model this Dialogue: 0,0:33:21.10,0:33:23.71,Default,,0000,0000,0000,,as a mixture of two Gaussians. Okay? Dialogue: 0,0:33:23.71,0:33:24.70,Default,,0000,0000,0000,,And so a Dialogue: 0,0:33:24.70,0:33:27.32,Default,,0000,0000,0000,,mixture of Gaussian models tend to be Dialogue: 0,0:33:27.32,0:33:29.10,Default,,0000,0000,0000,,applicable Dialogue: 0,0:33:29.10,0:33:30.52,Default,,0000,0000,0000,,where m Dialogue: 0,0:33:30.52,0:33:32.52,Default,,0000,0000,0000,,is larger, Dialogue: 0,0:33:32.52,0:33:35.69,Default,,0000,0000,0000,,and often much larger, than n where the number of training examples you Dialogue: 0,0:33:35.69,0:33:36.61,Default,,0000,0000,0000,,have Dialogue: 0,0:33:36.61,0:33:40.79,Default,,0000,0000,0000,,is at least as large as, and is usually much larger than, the Dialogue: 0,0:33:40.79,0:33:45.82,Default,,0000,0000,0000,,dimension of the data. What I Dialogue: 0,0:33:45.82,0:33:49.28,Default,,0000,0000,0000,,want to do is talk about a different problem where I Dialogue: 0,0:33:49.28,0:33:51.88,Default,,0000,0000,0000,,want you to imagine what happens if Dialogue: 0,0:33:51.88,0:33:56.20,Default,,0000,0000,0000,,either the dimension of your data is roughly equal to the number of Dialogue: 0,0:33:56.20,0:33:58.13,Default,,0000,0000,0000,,examples you have Dialogue: 0,0:33:58.13,0:34:00.18,Default,,0000,0000,0000,,or maybe Dialogue: 0,0:34:00.18,0:34:00.94,Default,,0000,0000,0000,,the Dialogue: 0,0:34:00.94,0:34:03.98,Default,,0000,0000,0000,,dimension of your data is maybe even much larger than Dialogue: 0,0:34:03.98,0:34:08.47,Default,,0000,0000,0000,,the number of training examples you have. Okay? So how do Dialogue: 0,0:34:08.47,0:34:10.14,Default,,0000,0000,0000,,you model Dialogue: 0,0:34:10.14,0:34:13.57,Default,,0000,0000,0000,,such a very high dimensional data? Watch and Dialogue: 0,0:34:13.57,0:34:15.55,Default,,0000,0000,0000,,you will see sometimes, right? If Dialogue: 0,0:34:15.55,0:34:18.96,Default,,0000,0000,0000,,you run a plant or something, you run a factory, maybe you have Dialogue: 0,0:34:18.96,0:34:23.36,Default,,0000,0000,0000,,a thousand measurements all through your plants, but you only have five - you only have Dialogue: 0,0:34:23.36,0:34:25.64,Default,,0000,0000,0000,,20 days of data. So Dialogue: 0,0:34:25.64,0:34:27.19,Default,,0000,0000,0000,,you can have 1,000 dimensional data, Dialogue: 0,0:34:27.19,0:34:31.04,Default,,0000,0000,0000,,but 20 examples of it all ready. So Dialogue: 0,0:34:31.04,0:34:35.24,Default,,0000,0000,0000,,given data that has this property in the beginning that we've given Dialogue: 0,0:34:35.24,0:34:39.71,Default,,0000,0000,0000,,a training set of m examples. Well, Dialogue: 0,0:34:39.71,0:34:41.23,Default,,0000,0000,0000,,what can you do to try to model Dialogue: 0,0:34:41.23,0:34:44.14,Default,,0000,0000,0000,,the density of X? Dialogue: 0,0:34:44.14,0:34:48.24,Default,,0000,0000,0000,,So one thing you can do is try to model it just as a single Gaussian, right? So in my Dialogue: 0,0:34:48.24,0:34:51.63,Default,,0000,0000,0000,,mixtures of Gaussian this is how you try model as a single Gaussian and Dialogue: 0,0:34:51.63,0:34:55.02,Default,,0000,0000,0000,,say X is intuitive with mean mu Dialogue: 0,0:34:55.02,0:34:57.56,Default,,0000,0000,0000,,and parameter Dialogue: 0,0:34:57.56,0:34:58.81,Default,,0000,0000,0000,,sigma Dialogue: 0,0:34:58.81,0:35:00.33,Default,,0000,0000,0000,,where Dialogue: 0,0:35:00.33,0:35:02.17,Default,,0000,0000,0000,,sigma is going to be Dialogue: 0,0:35:02.17,0:35:04.77,Default,,0000,0000,0000,,done n by n matrix Dialogue: 0,0:35:04.77,0:35:09.51,Default,,0000,0000,0000,,and so if you work out the maximum likelihood estimate of the parameters Dialogue: 0,0:35:09.51,0:35:12.68,Default,,0000,0000,0000,,you find that the maximum likelihood estimate for the mean Dialogue: 0,0:35:12.68,0:35:17.40,Default,,0000,0000,0000,,is just the empirical mean of your training set, Dialogue: 0,0:35:17.40,0:35:21.30,Default,,0000,0000,0000,,right. So that makes sense. Dialogue: 0,0:35:21.30,0:35:25.77,Default,,0000,0000,0000,,And the maximum likelihood of the covariance matrix sigma Dialogue: 0,0:35:25.77,0:35:27.70,Default,,0000,0000,0000,,will be Dialogue: 0,0:35:27.70,0:35:29.55,Default,,0000,0000,0000,, Dialogue: 0,0:35:29.55,0:35:36.55,Default,,0000,0000,0000,, Dialogue: 0,0:35:39.65,0:35:41.15,Default,,0000,0000,0000,,this, all right? But it turns out that Dialogue: 0,0:35:41.15,0:35:44.79,Default,,0000,0000,0000,,in this regime where the data is much higher dimensional - excuse me, Dialogue: 0,0:35:44.79,0:35:48.26,Default,,0000,0000,0000,,where the data's dimension is much larger than the training examples Dialogue: 0,0:35:48.26,0:35:49.58,Default,,0000,0000,0000,,you Dialogue: 0,0:35:49.58,0:35:51.23,Default,,0000,0000,0000,,have if you Dialogue: 0,0:35:51.23,0:35:52.84,Default,,0000,0000,0000,,compute the maximum likely estimate Dialogue: 0,0:35:52.84,0:35:54.87,Default,,0000,0000,0000,,of the covariance matrix sigma Dialogue: 0,0:35:54.87,0:35:58.31,Default,,0000,0000,0000,,you find that this matrix is singular. Dialogue: 0,0:35:58.31,0:35:59.08,Default,,0000,0000,0000,, Dialogue: 0,0:35:59.08,0:36:02.87,Default,,0000,0000,0000,,Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen Dialogue: 0,0:36:02.87,0:36:04.74,Default,,0000,0000,0000,,value so it doesn't have - I hope Dialogue: 0,0:36:04.74,0:36:06.87,Default,,0000,0000,0000,,one of those terms makes sense. Dialogue: 0,0:36:06.87,0:36:12.62,Default,,0000,0000,0000,,And Dialogue: 0,0:36:12.62,0:36:19.62,Default,,0000,0000,0000,,there's another saying that the matrix sigma will be non-invertible. Dialogue: 0,0:36:22.30,0:36:24.91,Default,,0000,0000,0000,,And just in pictures, Dialogue: 0,0:36:24.91,0:36:28.62,Default,,0000,0000,0000,,one complete example is if D is - Dialogue: 0,0:36:28.62,0:36:33.42,Default,,0000,0000,0000,,if N equals M equals two if you have two-dimensional data and Dialogue: 0,0:36:33.42,0:36:35.93,Default,,0000,0000,0000,,you have two examples. So I'd have Dialogue: 0,0:36:35.93,0:36:39.96,Default,,0000,0000,0000,,two training examples in two-dimen.. - this is Dialogue: 0,0:36:39.96,0:36:41.54,Default,,0000,0000,0000,,X1 and Dialogue: 0,0:36:41.54,0:36:43.67,Default,,0000,0000,0000,,X2. This is my unlabeled data. Dialogue: 0,0:36:43.67,0:36:47.64,Default,,0000,0000,0000,,If you fit a Gaussian to this data set you find that Dialogue: 0,0:36:47.64,0:36:49.17,Default,,0000,0000,0000,,- well Dialogue: 0,0:36:49.17,0:36:53.33,Default,,0000,0000,0000,,you remember I used to draw constables of Gaussians as ellipses, right? So Dialogue: 0,0:36:53.33,0:36:56.14,Default,,0000,0000,0000,,these are examples of different constables of Gaussians. Dialogue: 0,0:36:56.14,0:36:58.67,Default,,0000,0000,0000,,You find that the maximum likely estimate Dialogue: 0,0:36:58.67,0:36:59.65,Default,,0000,0000,0000,,Gaussian for this Dialogue: 0,0:36:59.65,0:37:04.33,Default,,0000,0000,0000,,responds to Gaussian where the contours are sort of infinitely Dialogue: 0,0:37:04.33,0:37:07.16,Default,,0000,0000,0000,,thin and infinitely long in that direction. Dialogue: 0,0:37:07.16,0:37:10.96,Default,,0000,0000,0000,,Okay? So in terms - so the contours will sort of be Dialogue: 0,0:37:10.96,0:37:12.97,Default,,0000,0000,0000,,infinitely thin, Dialogue: 0,0:37:12.97,0:37:17.96,Default,,0000,0000,0000,,right? And stretch infinitely long in that direction. Dialogue: 0,0:37:17.96,0:37:20.82,Default,,0000,0000,0000,,And another way of saying it is that if Dialogue: 0,0:37:20.82,0:37:25.54,Default,,0000,0000,0000,,you actually plug in the formula for the density Dialogue: 0,0:37:25.54,0:37:30.68,Default,,0000,0000,0000,,of the Dialogue: 0,0:37:30.68,0:37:36.21,Default,,0000,0000,0000,,Gaussian, which is Dialogue: 0,0:37:36.21,0:37:39.10,Default,,0000,0000,0000,,this, you won't actually get a nice answer because Dialogue: 0,0:37:39.10,0:37:42.40,Default,,0000,0000,0000,,the matrix sigma is non-invertible so sigma inverse is not Dialogue: 0,0:37:42.40,0:37:43.31,Default,,0000,0000,0000,,defined Dialogue: 0,0:37:43.31,0:37:45.29,Default,,0000,0000,0000,,and this is zero. Dialogue: 0,0:37:45.29,0:37:48.89,Default,,0000,0000,0000,,So you also have one over zero times E to the sum Dialogue: 0,0:37:48.89,0:37:55.89,Default,,0000,0000,0000,,inversive and non-inversive matrix so not a good model. Dialogue: 0,0:37:58.29,0:38:02.01,Default,,0000,0000,0000,,So let's do even better, right? So given this sort of data Dialogue: 0,0:38:02.01,0:38:09.01,Default,,0000,0000,0000,,how do you model P of X? Dialogue: 0,0:38:28.40,0:38:35.40,Default,,0000,0000,0000,,Well, one thing you could do is constrain sigma to be diagonal. So you have a Dialogue: 0,0:38:42.31,0:38:46.32,Default,,0000,0000,0000,,covariance matrix X is - okay? Dialogue: 0,0:38:46.32,0:38:48.94,Default,,0000,0000,0000,,So in other words you get a constraint sigma Dialogue: 0,0:38:48.94,0:38:52.98,Default,,0000,0000,0000,,to be this matrix, all Dialogue: 0,0:38:52.98,0:38:55.88,Default,,0000,0000,0000,,right? Dialogue: 0,0:38:55.88,0:38:59.78,Default,,0000,0000,0000,,With zeroes on the off diagonals. I hope Dialogue: 0,0:38:59.78,0:39:03.11,Default,,0000,0000,0000,,this makes sense. These zeroes I've written down here denote that Dialogue: 0,0:39:03.11,0:39:06.08,Default,,0000,0000,0000,,everything after diagonal of this matrix is a Dialogue: 0,0:39:06.08,0:39:07.46,Default,,0000,0000,0000,,zero. Dialogue: 0,0:39:07.46,0:39:08.97,Default,,0000,0000,0000,,So Dialogue: 0,0:39:08.97,0:39:13.41,Default,,0000,0000,0000,,the massive likely estimate of the parameters will be pretty much what you'll Dialogue: 0,0:39:13.41,0:39:16.21,Default,,0000,0000,0000,,expect, Dialogue: 0,0:39:16.21,0:39:19.88,Default,,0000,0000,0000,,right? Dialogue: 0,0:39:19.88,0:39:21.20,Default,,0000,0000,0000,,And Dialogue: 0,0:39:21.20,0:39:22.44,Default,,0000,0000,0000,,in pictures Dialogue: 0,0:39:22.44,0:39:24.15,Default,,0000,0000,0000,,what this means is that Dialogue: 0,0:39:24.15,0:39:26.40,Default,,0000,0000,0000,,the [inaudible] the distribution Dialogue: 0,0:39:26.40,0:39:28.57,Default,,0000,0000,0000,,with Gaussians Dialogue: 0,0:39:28.57,0:39:31.74,Default,,0000,0000,0000,,whose controls are axis aligned. So Dialogue: 0,0:39:31.74,0:39:36.90,Default,,0000,0000,0000,,that's one example of a Gaussian where the covariance is diagonal. Dialogue: 0,0:39:36.90,0:39:37.86,Default,,0000,0000,0000,, Dialogue: 0,0:39:37.86,0:39:38.51,Default,,0000,0000,0000,,And Dialogue: 0,0:39:38.51,0:39:42.92,Default,,0000,0000,0000,,here's another example and Dialogue: 0,0:39:42.92,0:39:45.35,Default,,0000,0000,0000,,so here's a third example. But Dialogue: 0,0:39:45.35,0:39:48.94,Default,,0000,0000,0000,,often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I Dialogue: 0,0:39:48.94,0:39:51.49,Default,,0000,0000,0000,,don't Dialogue: 0,0:39:51.49,0:39:53.100,Default,,0000,0000,0000,,know, Dialogue: 0,0:39:53.100,0:39:56.63,Default,,0000,0000,0000,,you could do this in model P of X, but this isn't very nice because you've now Dialogue: 0,0:39:56.63,0:40:00.07,Default,,0000,0000,0000,,thrown away all the correlations Dialogue: 0,0:40:00.07,0:40:04.54,Default,,0000,0000,0000,,between the different variables so Dialogue: 0,0:40:04.54,0:40:07.60,Default,,0000,0000,0000,,the axis are X1 and X2, right? So you've thrown away - you're failing to capture Dialogue: 0,0:40:07.60,0:40:11.78,Default,,0000,0000,0000,,any of the correlations or the relationships between Dialogue: 0,0:40:11.78,0:40:18.23,Default,,0000,0000,0000,,any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again. Dialogue: 0,0:40:18.23,0:40:20.76,Default,,0000,0000,0000,,The covariance matrix the diagonal, Dialogue: 0,0:40:20.76,0:40:22.15,Default,,0000,0000,0000,,what does Dialogue: 0,0:40:22.15,0:40:23.37,Default,,0000,0000,0000,,that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay. Dialogue: 0,0:40:23.37,0:40:26.53,Default,,0000,0000,0000,,So these are the contours of the Gaussian density that I'm drawing, Dialogue: 0,0:40:26.53,0:40:28.16,Default,,0000,0000,0000,,right? So let's see - Dialogue: 0,0:40:28.16,0:40:32.08,Default,,0000,0000,0000,,so post covariance issues with diagonal Dialogue: 0,0:40:32.08,0:40:35.23,Default,,0000,0000,0000,,then you can ask what is P of X Dialogue: 0,0:40:35.23,0:40:37.30,Default,,0000,0000,0000,,parameterized by Dialogue: 0,0:40:37.30,0:40:38.54,Default,,0000,0000,0000,,mu and sigma, Dialogue: 0,0:40:38.54,0:40:41.19,Default,,0000,0000,0000,,right? If sigma is diagonal Dialogue: 0,0:40:41.19,0:40:43.70,Default,,0000,0000,0000,,and so this will be some Gaussian dump, Dialogue: 0,0:40:43.70,0:40:46.25,Default,,0000,0000,0000,,right? So not in - oh, boy. My drawing's really Dialogue: 0,0:40:46.25,0:40:48.70,Default,,0000,0000,0000,,bad, but in two-D Dialogue: 0,0:40:48.70,0:40:53.24,Default,,0000,0000,0000,,the density for Gaussian is like this bump shaped thing, right? Dialogue: 0,0:40:53.24,0:40:56.72,Default,,0000,0000,0000,,So this is the density of the Gaussian Dialogue: 0,0:40:56.72,0:40:59.55,Default,,0000,0000,0000,,- wow, and this is a really bad drawing. With those, Dialogue: 0,0:40:59.55,0:41:04.05,Default,,0000,0000,0000,,your axis X1 and X2 and the height of this is P of X Dialogue: 0,0:41:04.05,0:41:07.88,Default,,0000,0000,0000,,and so those figures over there are the contours of Dialogue: 0,0:41:07.88,0:41:09.48,Default,,0000,0000,0000,,the density of the Gaussian. Dialogue: 0,0:41:09.48,0:41:15.95,Default,,0000,0000,0000,,So those are the contours of this shape. Student:No, I don't Dialogue: 0,0:41:15.95,0:41:16.68,Default,,0000,0000,0000,,mean Dialogue: 0,0:41:16.68,0:41:22.57,Default,,0000,0000,0000,,the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so Dialogue: 0,0:41:22.57,0:41:23.69,Default,,0000,0000,0000,,the main - these, Dialogue: 0,0:41:23.69,0:41:24.100,Default,,0000,0000,0000,,let's see. Dialogue: 0,0:41:24.100,0:41:27.71,Default,,0000,0000,0000,,So I'm not drawing a contour like this, Dialogue: 0,0:41:27.71,0:41:30.31,Default,,0000,0000,0000,,right? Because the main axes of these Dialogue: 0,0:41:30.31,0:41:32.29,Default,,0000,0000,0000,,are not aligned with the Dialogue: 0,0:41:32.29,0:41:34.39,Default,,0000,0000,0000,,X1 and X-axis so Dialogue: 0,0:41:34.39,0:41:36.19,Default,,0000,0000,0000,,this occurs found to Dialogue: 0,0:41:36.19,0:41:43.11,Default,,0000,0000,0000,,Gaussian where the off-diagonals are non-zero, right? Cool. Okay. Dialogue: 0,0:41:43.11,0:41:46.93,Default,,0000,0000,0000,, Dialogue: 0,0:41:46.93,0:41:50.23,Default,,0000,0000,0000,,You could do this, this is sort of work. It turns out that what our best view is two Dialogue: 0,0:41:50.23,0:41:51.67,Default,,0000,0000,0000,,training examples Dialogue: 0,0:41:51.67,0:41:55.21,Default,,0000,0000,0000,,you can learn in non-singular covariance matrix, but you've thrown away all Dialogue: 0,0:41:55.21,0:41:55.85,Default,,0000,0000,0000,, Dialogue: 0,0:41:55.85,0:42:02.85,Default,,0000,0000,0000,,of the correlation in the data so this is not a great model. Dialogue: 0,0:42:05.90,0:42:08.28,Default,,0000,0000,0000,,It turns out you can do something - well, Dialogue: 0,0:42:08.28,0:42:10.44,Default,,0000,0000,0000,,actually, we'll come back and use this property later. Dialogue: 0,0:42:10.44,0:42:17.44,Default,,0000,0000,0000,,But it turns out you can do something even more restrictive, which Dialogue: 0,0:42:17.99,0:42:18.78,Default,,0000,0000,0000,,is Dialogue: 0,0:42:18.78,0:42:23.96,Default,,0000,0000,0000,,you can constrain sigma to equal to sigma squared times the identity Dialogue: 0,0:42:23.96,0:42:27.63,Default,,0000,0000,0000,,matrix. So in other words, you can constrain it to be diagonal Dialogue: 0,0:42:27.63,0:42:28.76,Default,,0000,0000,0000,,matrix Dialogue: 0,0:42:28.76,0:42:31.70,Default,,0000,0000,0000,,and moreover all the Dialogue: 0,0:42:31.70,0:42:34.61,Default,,0000,0000,0000,,diagonal entries must be the same Dialogue: 0,0:42:34.61,0:42:37.99,Default,,0000,0000,0000,,and so the cartoon for that is that you're Dialogue: 0,0:42:37.99,0:42:39.44,Default,,0000,0000,0000,,constraining Dialogue: 0,0:42:39.44,0:42:42.78,Default,,0000,0000,0000,,the contours of your Gaussian density to be Dialogue: 0,0:42:42.78,0:42:47.36,Default,,0000,0000,0000,,circular. Okay? This is a sort of even harsher constraint to place in your model. Dialogue: 0,0:42:47.36,0:42:51.48,Default,,0000,0000,0000,, Dialogue: 0,0:42:51.48,0:42:52.20,Default,,0000,0000,0000,, Dialogue: 0,0:42:52.20,0:42:55.97,Default,,0000,0000,0000,,So either of these versions, diagonal sigma or sigma being the, sort of, constant Dialogue: 0,0:42:55.97,0:42:57.35,Default,,0000,0000,0000,,value diagonal Dialogue: 0,0:42:57.35,0:43:01.87,Default,,0000,0000,0000,,are the all ready strong assumptions, all right? So if you have enough data Dialogue: 0,0:43:01.87,0:43:07.53,Default,,0000,0000,0000,,maybe write a model just a little bit of a correlation between your different variables. Dialogue: 0,0:43:07.53,0:43:10.93,Default,,0000,0000,0000,,So the factor analysis model Dialogue: 0,0:43:10.93,0:43:14.35,Default,,0000,0000,0000,,is one way to attempt to do that. So here's Dialogue: 0,0:43:14.35,0:43:21.35,Default,,0000,0000,0000,,the idea. Dialogue: 0,0:43:34.62,0:43:35.73,Default,,0000,0000,0000,,So this is Dialogue: 0,0:43:35.73,0:43:39.17,Default,,0000,0000,0000,,how the factor analysis model Dialogue: 0,0:43:39.17,0:43:40.98,Default,,0000,0000,0000,,models your data. Dialogue: 0,0:43:40.98,0:43:42.15,Default,,0000,0000,0000,,We're going to Dialogue: 0,0:43:42.15,0:43:45.31,Default,,0000,0000,0000,,assume that there is a latent random variable, okay? Dialogue: 0,0:43:45.31,0:43:46.96,Default,,0000,0000,0000,,Which just Dialogue: 0,0:43:46.96,0:43:50.76,Default,,0000,0000,0000,,means random variable Z. So Z is distributed Dialogue: 0,0:43:50.76,0:43:53.38,Default,,0000,0000,0000,,Gaussian with mean zero and Dialogue: 0,0:43:53.38,0:43:55.66,Default,,0000,0000,0000,,covariance identity Dialogue: 0,0:43:55.66,0:43:57.63,Default,,0000,0000,0000,,where Z Dialogue: 0,0:43:57.63,0:44:00.11,Default,,0000,0000,0000,,will be a Ddimensional vector now Dialogue: 0,0:44:00.11,0:44:01.50,Default,,0000,0000,0000,,and Dialogue: 0,0:44:01.50,0:44:03.84,Default,,0000,0000,0000,,D Dialogue: 0,0:44:03.84,0:44:10.84,Default,,0000,0000,0000,,will be chosen so that it is lower than the dimension of your X's. Okay? Dialogue: 0,0:44:11.83,0:44:12.83,Default,,0000,0000,0000,,And now Dialogue: 0,0:44:12.83,0:44:14.66,Default,,0000,0000,0000,,I'm going to assume that Dialogue: 0,0:44:14.66,0:44:19.26,Default,,0000,0000,0000,,X is given by - Dialogue: 0,0:44:19.26,0:44:21.28,Default,,0000,0000,0000,,well let me write this. Each XI Dialogue: 0,0:44:21.28,0:44:22.69,Default,,0000,0000,0000,,is distributed - Dialogue: 0,0:44:25.06,0:44:28.85,Default,,0000,0000,0000,,actually, sorry, I'm just. We Dialogue: 0,0:44:28.85,0:44:33.89,Default,,0000,0000,0000,,have Dialogue: 0,0:44:33.89,0:44:38.68,Default,,0000,0000,0000,,to assume that conditions on the value of Z, Dialogue: 0,0:44:38.68,0:44:42.11,Default,,0000,0000,0000,,X is given by another Gaussian Dialogue: 0,0:44:42.11,0:44:46.82,Default,,0000,0000,0000,,with mean given by mu plus Dialogue: 0,0:44:46.82,0:44:50.09,Default,,0000,0000,0000,,lambda Z Dialogue: 0,0:44:50.09,0:44:53.64,Default,,0000,0000,0000,,and covariance given by matrix si. Dialogue: 0,0:44:53.64,0:44:59.40,Default,,0000,0000,0000,,So just to say the second line in Dialogue: 0,0:44:59.40,0:45:01.29,Default,,0000,0000,0000,,an equivalent form, equivalently Dialogue: 0,0:45:01.29,0:45:05.93,Default,,0000,0000,0000,,I'm going to model X as Dialogue: 0,0:45:05.93,0:45:08.82,Default,,0000,0000,0000,,mu plus lambda Z Dialogue: 0,0:45:08.82,0:45:13.27,Default,,0000,0000,0000,,plus a noise term epsilon where epsilon is Dialogue: 0,0:45:13.27,0:45:18.33,Default,,0000,0000,0000,,Gaussian with mean zero Dialogue: 0,0:45:18.33,0:45:21.87,Default,,0000,0000,0000,,and covariant si. Dialogue: 0,0:45:21.87,0:45:23.47,Default,,0000,0000,0000,, Dialogue: 0,0:45:23.47,0:45:24.67,Default,,0000,0000,0000,,And so Dialogue: 0,0:45:24.67,0:45:27.71,Default,,0000,0000,0000,,the parameters of this Dialogue: 0,0:45:27.71,0:45:30.18,Default,,0000,0000,0000,,model Dialogue: 0,0:45:30.18,0:45:32.72,Default,,0000,0000,0000,,are going to be a vector Dialogue: 0,0:45:32.72,0:45:34.73,Default,,0000,0000,0000,,mu with its Dialogue: 0,0:45:34.73,0:45:37.56,Default,,0000,0000,0000,,n-dimensional and Dialogue: 0,0:45:37.56,0:45:41.18,Default,,0000,0000,0000,,matrix lambda, Dialogue: 0,0:45:41.18,0:45:42.40,Default,,0000,0000,0000,,which is Dialogue: 0,0:45:42.40,0:45:43.54,Default,,0000,0000,0000,, Dialogue: 0,0:45:43.54,0:45:45.57,Default,,0000,0000,0000,,n by D Dialogue: 0,0:45:45.57,0:45:50.04,Default,,0000,0000,0000,,and a covariance matrix si, Dialogue: 0,0:45:50.04,0:45:51.97,Default,,0000,0000,0000,,which is n by n, Dialogue: 0,0:45:51.97,0:45:55.52,Default,,0000,0000,0000,,and I'm going to impose an additional constraint on si. I'm going to impose Dialogue: 0,0:45:55.52,0:45:57.32,Default,,0000,0000,0000,,a constraint that si Dialogue: 0,0:45:57.32,0:45:59.56,Default,,0000,0000,0000,,is Dialogue: 0,0:45:59.56,0:46:03.39,Default,,0000,0000,0000,,diagonal. Dialogue: 0,0:46:03.39,0:46:04.79,Default,,0000,0000,0000,,Okay? So Dialogue: 0,0:46:04.79,0:46:07.64,Default,,0000,0000,0000,,that was a form of definition - let me actually, sort of, Dialogue: 0,0:46:07.64,0:46:14.64,Default,,0000,0000,0000,,give a couple of examples to make this more complete. Dialogue: 0,0:46:32.71,0:46:37.71,Default,,0000,0000,0000,,So let's give a kind of example, suppose Z Dialogue: 0,0:46:37.71,0:46:40.42,Default,,0000,0000,0000,,is one-dimensional Dialogue: 0,0:46:40.42,0:46:44.42,Default,,0000,0000,0000,,and X is twodimensional Dialogue: 0,0:46:44.42,0:46:46.69,Default,,0000,0000,0000,,so let's see what Dialogue: 0,0:46:46.69,0:46:51.60,Default,,0000,0000,0000,,this model - let's see a, sort of, specific instance of the factor analysis Dialogue: 0,0:46:51.60,0:46:52.51,Default,,0000,0000,0000,,model Dialogue: 0,0:46:52.51,0:46:55.82,Default,,0000,0000,0000,,and how we're modeling the joint - the distribution over X Dialogue: 0,0:46:55.82,0:46:57.100,Default,,0000,0000,0000,,of - what this gives us Dialogue: 0,0:46:57.100,0:47:01.27,Default,,0000,0000,0000,,in terms of a model over P of X, all right? Dialogue: 0,0:47:01.27,0:47:03.09,Default,,0000,0000,0000,,So Dialogue: 0,0:47:03.09,0:47:09.70,Default,,0000,0000,0000,,let's see. From this model to let me assume that Dialogue: 0,0:47:09.70,0:47:10.57,Default,,0000,0000,0000,,lambda is Dialogue: 0,0:47:10.57,0:47:12.06,Default,,0000,0000,0000,,2, 1 Dialogue: 0,0:47:12.06,0:47:15.55,Default,,0000,0000,0000,,and si, which has to be diagonal matrix, remember, Dialogue: 0,0:47:15.55,0:47:19.81,Default,,0000,0000,0000,,is this. Dialogue: 0,0:47:19.81,0:47:21.71,Default,,0000,0000,0000,,Okay? So Dialogue: 0,0:47:21.71,0:47:28.71,Default,,0000,0000,0000,,Z is one-dimensional so let me just draw a typical sample for Z, all right? So Dialogue: 0,0:47:30.90,0:47:33.66,Default,,0000,0000,0000,,if I draw ZI Dialogue: 0,0:47:33.66,0:47:36.20,Default,,0000,0000,0000,,from a Gaussian Dialogue: 0,0:47:36.20,0:47:39.53,Default,,0000,0000,0000,,so that's a typical sample for what Z might look like Dialogue: 0,0:47:39.53,0:47:43.44,Default,,0000,0000,0000,,and so I'm gonna - at any rate I'm gonna call Dialogue: 0,0:47:43.44,0:47:45.37,Default,,0000,0000,0000,,this Z1, Dialogue: 0,0:47:45.37,0:47:49.97,Default,,0000,0000,0000,,Z2, Z3 and so on. If this really were a typical sample the order of the Dialogue: 0,0:47:49.97,0:47:53.35,Default,,0000,0000,0000,,Z's would be jumbled up, but I'm just ordering them like this Dialogue: 0,0:47:53.35,0:47:55.97,Default,,0000,0000,0000,,just to make the example easier. Dialogue: 0,0:47:55.97,0:47:57.77,Default,,0000,0000,0000,,So, yes, typical sample Dialogue: 0,0:47:57.77,0:48:04.77,Default,,0000,0000,0000,,of random variable Z from a Gaussian distribution with mean of covariance one. Dialogue: 0,0:48:05.57,0:48:08.93,Default,,0000,0000,0000,,So - Dialogue: 0,0:48:08.93,0:48:13.79,Default,,0000,0000,0000,,and with this example let me just set mu equals zero. It's to write the - Dialogue: 0,0:48:13.79,0:48:18.36,Default,,0000,0000,0000,,just that it's easier to talk about. Dialogue: 0,0:48:18.36,0:48:20.75,Default,,0000,0000,0000,,So Dialogue: 0,0:48:20.75,0:48:22.51,Default,,0000,0000,0000,,lambda times Z, Dialogue: 0,0:48:22.51,0:48:26.27,Default,,0000,0000,0000,,right? We'll take each of these numbers and multiply them by lambda. Dialogue: 0,0:48:26.27,0:48:28.64,Default,,0000,0000,0000,,And so Dialogue: 0,0:48:28.64,0:48:30.100,Default,,0000,0000,0000,,you find that Dialogue: 0,0:48:30.100,0:48:36.99,Default,,0000,0000,0000,,all of the values for lambda times Z Dialogue: 0,0:48:36.99,0:48:38.49,Default,,0000,0000,0000,,will lie on a straight line, Dialogue: 0,0:48:38.49,0:48:40.31,Default,,0000,0000,0000,,all right? So, for example, Dialogue: 0,0:48:40.31,0:48:41.09,Default,,0000,0000,0000,, Dialogue: 0,0:48:41.09,0:48:45.40,Default,,0000,0000,0000,,this one here would be one, two, three, four, five, six, seven, I guess. So if this was Dialogue: 0,0:48:45.40,0:48:46.24,Default,,0000,0000,0000,,Z7 Dialogue: 0,0:48:46.24,0:48:49.33,Default,,0000,0000,0000,,then this one here would be lambda Dialogue: 0,0:48:49.33,0:48:51.98,Default,,0000,0000,0000,,times Z7 Dialogue: 0,0:48:51.98,0:48:54.37,Default,,0000,0000,0000,,and now that's the number in R2, because lambda's a two Dialogue: 0,0:48:54.37,0:48:56.53,Default,,0000,0000,0000,,by one matrix. Dialogue: 0,0:48:56.53,0:48:59.90,Default,,0000,0000,0000,,And so what I've drawn here is like a typical sample Dialogue: 0,0:48:59.90,0:49:04.28,Default,,0000,0000,0000,,for lambda times Z Dialogue: 0,0:49:04.28,0:49:05.59,Default,,0000,0000,0000,,and Dialogue: 0,0:49:05.59,0:49:09.78,Default,,0000,0000,0000,,the final step for this is what a typical sample for X looks like. Well X is Dialogue: 0,0:49:09.78,0:49:11.84,Default,,0000,0000,0000,,mu Dialogue: 0,0:49:11.84,0:49:12.99,Default,,0000,0000,0000,,plus Dialogue: 0,0:49:12.99,0:49:15.09,Default,,0000,0000,0000,, Dialogue: 0,0:49:15.09,0:49:16.66,Default,,0000,0000,0000,,lambda Z plus epsilon Dialogue: 0,0:49:16.66,0:49:18.65,Default,,0000,0000,0000,,where epsilon Dialogue: 0,0:49:18.65,0:49:23.68,Default,,0000,0000,0000,,is Gaussian with mean nu and covariance given by si, right? Dialogue: 0,0:49:23.68,0:49:27.69,Default,,0000,0000,0000,,And so the last step to draw a typical sample for the random variables Dialogue: 0,0:49:27.69,0:49:28.51,Default,,0000,0000,0000,, Dialogue: 0,0:49:28.51,0:49:30.90,Default,,0000,0000,0000,,X Dialogue: 0,0:49:30.90,0:49:36.08,Default,,0000,0000,0000,,I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example Dialogue: 0,0:49:36.08,0:49:37.16,Default,,0000,0000,0000,,and Dialogue: 0,0:49:37.16,0:49:38.39,Default,,0000,0000,0000,,around this point Dialogue: 0,0:49:38.39,0:49:41.89,Default,,0000,0000,0000,,I'm going to place Dialogue: 0,0:49:41.89,0:49:43.89,Default,,0000,0000,0000,,an axis aligned ellipse. Or Dialogue: 0,0:49:43.89,0:49:45.88,Default,,0000,0000,0000,,in other words, I'm going to Dialogue: 0,0:49:45.88,0:49:47.94,Default,,0000,0000,0000,,create a Gaussian distribution Dialogue: 0,0:49:47.94,0:49:50.95,Default,,0000,0000,0000,,centered on this point Dialogue: 0,0:49:50.95,0:49:53.46,Default,,0000,0000,0000,,and this I've drawn Dialogue: 0,0:49:53.46,0:49:55.88,Default,,0000,0000,0000,,corresponds to one of the contours Dialogue: 0,0:49:55.88,0:49:59.31,Default,,0000,0000,0000,,of my density for Dialogue: 0,0:49:59.31,0:50:02.80,Default,,0000,0000,0000,,epsilon, right? And so you can imagine placing a little Gaussian bump here. Dialogue: 0,0:50:02.80,0:50:04.18,Default,,0000,0000,0000,,And so Dialogue: 0,0:50:04.18,0:50:08.19,Default,,0000,0000,0000,,I'll draw an example from this little Gaussian and Dialogue: 0,0:50:08.19,0:50:11.09,Default,,0000,0000,0000,,let's say I get that point going, Dialogue: 0,0:50:11.09,0:50:17.63,Default,,0000,0000,0000,,I do the same here and Dialogue: 0,0:50:17.63,0:50:19.13,Default,,0000,0000,0000,,so on. Dialogue: 0,0:50:19.13,0:50:24.08,Default,,0000,0000,0000,,So I draw a bunch of examples from these Gaussians Dialogue: 0,0:50:24.08,0:50:28.52,Default,,0000,0000,0000,,and the - Dialogue: 0,0:50:28.52,0:50:31.100,Default,,0000,0000,0000,,whatever they call it - the orange points I drew Dialogue: 0,0:50:31.100,0:50:35.31,Default,,0000,0000,0000,,will comprise a typical sample for Dialogue: 0,0:50:35.31,0:50:42.31,Default,,0000,0000,0000,,whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that? Dialogue: 0,0:50:43.84,0:50:45.93,Default,,0000,0000,0000,,Instructor (Andrew Ng):Oh, Dialogue: 0,0:50:45.93,0:50:48.65,Default,,0000,0000,0000,,yes, you do. And in this example, I said you Dialogue: 0,0:50:48.65,0:50:50.11,Default,,0000,0000,0000,,do a zero zero just to make Dialogue: 0,0:50:50.11,0:50:53.20,Default,,0000,0000,0000,,it easier. If mu were something else you'd take the whole picture and you'd sort of shift Dialogue: 0,0:50:53.20,0:51:00.20,Default,,0000,0000,0000,,it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of Dialogue: 0,0:51:00.68,0:51:03.32,Default,,0000,0000,0000,,course, Dialogue: 0,0:51:03.32,0:51:05.92,Default,,0000,0000,0000,,what does that Y-axis corresponds to? Instructor (Andrew Dialogue: 0,0:51:05.92,0:51:07.80,Default,,0000,0000,0000,,Ng):Oh, so this is Z Dialogue: 0,0:51:07.80,0:51:09.53,Default,,0000,0000,0000,,is one-dimensional Dialogue: 0,0:51:09.53,0:51:12.97,Default,,0000,0000,0000,,so here I'm plotting the typical sample Dialogue: 0,0:51:12.97,0:51:15.46,Default,,0000,0000,0000,,for Z so this is like zero. Dialogue: 0,0:51:15.46,0:51:19.76,Default,,0000,0000,0000,,So this is just the Z Axis, right. So Z is onedimensional data. Dialogue: 0,0:51:19.76,0:51:22.100,Default,,0000,0000,0000,,So this line here is like a plot Dialogue: 0,0:51:22.100,0:51:26.48,Default,,0000,0000,0000,,of a typical sample of Dialogue: 0,0:51:26.48,0:51:31.61,Default,,0000,0000,0000,,values for Z. Okay? Dialogue: 0,0:51:31.61,0:51:32.89,Default,,0000,0000,0000,,Yeah? Student:You have by axis, right? And Dialogue: 0,0:51:32.89,0:51:34.46,Default,,0000,0000,0000,,the axis data pertains samples. Instructor (Andrew Dialogue: 0,0:51:34.46,0:51:35.69,Default,,0000,0000,0000,,Ng):Oh, yes, right. Student:So sort of projecting Dialogue: 0,0:51:35.69,0:51:38.29,Default,,0000,0000,0000,,them into Dialogue: 0,0:51:38.29,0:51:42.68,Default,,0000,0000,0000,,that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these Dialogue: 0,0:51:42.68,0:51:45.32,Default,,0000,0000,0000,,beige points - so that's like X1 and that's X2 Dialogue: 0,0:51:45.32,0:51:47.29,Default,,0000,0000,0000,,and so on, right? So the beige points are Dialogue: 0,0:51:47.29,0:51:51.10,Default,,0000,0000,0000,,what I Dialogue: 0,0:51:51.10,0:51:52.27,Default,,0000,0000,0000,,see. And so Dialogue: 0,0:51:52.27,0:51:54.85,Default,,0000,0000,0000,,in reality all you ever get to see are the Dialogue: 0,0:51:54.85,0:51:56.29,Default,,0000,0000,0000,,X's, but Dialogue: 0,0:51:56.29,0:52:00.05,Default,,0000,0000,0000,,just like in the mixture of Gaussians model I tell a story about what I would Dialogue: 0,0:52:00.05,0:52:03.61,Default,,0000,0000,0000,,imagine the Gauss... - the data came from two Gaussian's Dialogue: 0,0:52:03.61,0:52:08.68,Default,,0000,0000,0000,,was is had a random variable Z that led to the generation of X's from two Gaussians. Dialogue: 0,0:52:08.68,0:52:12.21,Default,,0000,0000,0000,,So the same way I'm sort of telling the story here, which all the algorithm actually Dialogue: 0,0:52:12.21,0:52:15.05,Default,,0000,0000,0000,,sees are the orange points, but we're Dialogue: 0,0:52:15.05,0:52:19.22,Default,,0000,0000,0000,,gonna tell a story about how the data came about and that story is Dialogue: 0,0:52:19.22,0:52:23.81,Default,,0000,0000,0000,,what comprises the factor analysis model. Okay? Dialogue: 0,0:52:23.81,0:52:26.78,Default,,0000,0000,0000,,So one of the ways to see the intrusion of this model is that we're Dialogue: 0,0:52:26.78,0:52:30.78,Default,,0000,0000,0000,,going to think of the model as one way Dialogue: 0,0:52:30.78,0:52:34.66,Default,,0000,0000,0000,,just informally, not formally, but one way to think about this model is Dialogue: 0,0:52:34.66,0:52:37.68,Default,,0000,0000,0000,,you can think of this factor analysis model Dialogue: 0,0:52:37.68,0:52:42.09,Default,,0000,0000,0000,,as modeling the data from coming from a lower dimensional subspace Dialogue: 0,0:52:42.09,0:52:42.89,Default,,0000,0000,0000,,more Dialogue: 0,0:52:42.89,0:52:44.80,Default,,0000,0000,0000,,or less so the data X here Y Dialogue: 0,0:52:44.80,0:52:47.61,Default,,0000,0000,0000,,is approximately on one D line Dialogue: 0,0:52:47.61,0:52:50.38,Default,,0000,0000,0000,,and then plus a little bit of noise - plus a little bit Dialogue: 0,0:52:50.38,0:52:53.45,Default,,0000,0000,0000,,of random noise so the X isn't exactly on this one D line. Dialogue: 0,0:52:53.45,0:52:57.28,Default,,0000,0000,0000,,That's one informal way of thinking about factor analysis. Dialogue: 0,0:52:57.28,0:53:04.28,Default,,0000,0000,0000,,We're Dialogue: 0,0:53:08.88,0:53:15.88,Default,,0000,0000,0000,,not doing great on time. Dialogue: 0,0:53:18.16,0:53:20.36,Default,,0000,0000,0000,,Well, let's do this. Dialogue: 0,0:53:20.36,0:53:22.78,Default,,0000,0000,0000,,So let me just do one more quick example, Dialogue: 0,0:53:22.78,0:53:24.10,Default,,0000,0000,0000,,which is, Dialogue: 0,0:53:24.10,0:53:25.79,Default,,0000,0000,0000,, Dialogue: 0,0:53:25.79,0:53:28.95,Default,,0000,0000,0000,,in this example, Dialogue: 0,0:53:28.95,0:53:34.90,Default,,0000,0000,0000,,let's say Z is in R2 Dialogue: 0,0:53:34.90,0:53:37.90,Default,,0000,0000,0000,,and X is in R3, right? Dialogue: 0,0:53:37.90,0:53:40.29,Default,,0000,0000,0000,,And so Dialogue: 0,0:53:40.29,0:53:44.94,Default,,0000,0000,0000,,in this example Z, your data Z now lies in 2-D Dialogue: 0,0:53:44.94,0:53:49.26,Default,,0000,0000,0000,,and so let me draw this on a sheet of paper. Okay? Dialogue: 0,0:53:49.26,0:53:50.79,Default,,0000,0000,0000,,So let's say the Dialogue: 0,0:53:50.79,0:53:54.18,Default,,0000,0000,0000,,axis of my paper are the Z1 and Z2 axis Dialogue: 0,0:53:54.18,0:53:56.30,Default,,0000,0000,0000,,and so Dialogue: 0,0:53:56.30,0:53:58.29,Default,,0000,0000,0000,,here is a typical sample Dialogue: 0,0:53:58.29,0:54:01.49,Default,,0000,0000,0000,,of point Z, right? Dialogue: 0,0:54:01.49,0:54:05.37,Default,,0000,0000,0000,,And so we'll then take the sample Z - Dialogue: 0,0:54:05.37,0:54:10.92,Default,,0000,0000,0000,,well, actually let me draw this here as well. All right. So Dialogue: 0,0:54:10.92,0:54:13.45,Default,,0000,0000,0000,,this is a typical sample for Z going on the Z1 and Z2 axis and Dialogue: 0,0:54:13.45,0:54:17.42,Default,,0000,0000,0000,,I guess the origin would be here. Dialogue: 0,0:54:17.42,0:54:20.12,Default,,0000,0000,0000,,So center around zero. Dialogue: 0,0:54:20.12,0:54:24.08,Default,,0000,0000,0000,,And then we'll take those and map it to mu Dialogue: 0,0:54:24.08,0:54:24.98,Default,,0000,0000,0000,,plus Dialogue: 0,0:54:24.98,0:54:26.30,Default,,0000,0000,0000,,lambda Z Dialogue: 0,0:54:26.30,0:54:30.14,Default,,0000,0000,0000,,and what that means is if you imagine the free space of this classroom is R3. Dialogue: 0,0:54:30.14,0:54:34.55,Default,,0000,0000,0000,,What that means is we'll take this sample of Z's and we'll map it to Dialogue: 0,0:54:34.55,0:54:38.77,Default,,0000,0000,0000,,position in free space. So we'll take this sheet of paper and move it somewhere and some Dialogue: 0,0:54:38.77,0:54:41.77,Default,,0000,0000,0000,,orientation in 3-D space. Dialogue: 0,0:54:41.77,0:54:44.22,Default,,0000,0000,0000,,And the last step is Dialogue: 0,0:54:44.22,0:54:48.69,Default,,0000,0000,0000,,you have X equals mu plus lambda Z plus Dialogue: 0,0:54:48.69,0:54:52.34,Default,,0000,0000,0000,,epsilon and so you would take the set of the points which align in some plane Dialogue: 0,0:54:52.34,0:54:53.64,Default,,0000,0000,0000,,in our 3-D space Dialogue: 0,0:54:53.64,0:54:56.15,Default,,0000,0000,0000,,the variable of noise of these Dialogue: 0,0:54:56.15,0:54:58.26,Default,,0000,0000,0000,,and the noise will, sort of, come from Dialogue: 0,0:54:58.26,0:55:01.42,Default,,0000,0000,0000,,Gaussians to Dialogue: 0,0:55:01.42,0:55:03.70,Default,,0000,0000,0000,,the axis Dialogue: 0,0:55:03.70,0:55:10.70,Default,,0000,0000,0000,,aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake. Dialogue: 0,0:55:11.68,0:55:13.18,Default,,0000,0000,0000,, Dialogue: 0,0:55:13.18,0:55:15.12,Default,,0000,0000,0000,,So that's a model Dialogue: 0,0:55:15.12,0:55:22.12,Default,,0000,0000,0000,,- let's actually talk about how to fit the parameters of the model. Okay? Dialogue: 0,0:55:27.96,0:55:32.23,Default,,0000,0000,0000,,In order to Dialogue: 0,0:55:32.23,0:55:33.77,Default,,0000,0000,0000,,describe how to fit the model I'm sure Dialogue: 0,0:55:33.77,0:55:35.86,Default,,0000,0000,0000,,we need to Dialogue: 0,0:55:35.86,0:55:40.19,Default,,0000,0000,0000,,re-write Gaussians and this is in a very slightly different way. Dialogue: 0,0:55:40.19,0:55:42.79,Default,,0000,0000,0000,,So, in particular, Dialogue: 0,0:55:42.79,0:55:45.39,Default,,0000,0000,0000,,let's say I have a vector X and Dialogue: 0,0:55:45.39,0:55:51.39,Default,,0000,0000,0000,,I'm gonna use this notation to denote partition vectors, right? X1, X2 Dialogue: 0,0:55:51.39,0:55:53.21,Default,,0000,0000,0000,,where Dialogue: 0,0:55:53.21,0:55:56.81,Default,,0000,0000,0000,,if X1 is say an rdimensional vector Dialogue: 0,0:55:56.81,0:55:58.64,Default,,0000,0000,0000,,then X2 is Dialogue: 0,0:55:58.64,0:56:00.43,Default,,0000,0000,0000,,an estimational vector Dialogue: 0,0:56:00.43,0:56:02.23,Default,,0000,0000,0000,,and X Dialogue: 0,0:56:02.23,0:56:04.51,Default,,0000,0000,0000,,is an R plus S Dialogue: 0,0:56:04.51,0:56:09.03,Default,,0000,0000,0000,,dimensional vector. Okay? So I'm gonna use this notation to denote just Dialogue: 0,0:56:09.03,0:56:13.27,Default,,0000,0000,0000,,the taking of vector and, sort of, partitioning the vector into two halves. The Dialogue: 0,0:56:13.27,0:56:15.45,Default,,0000,0000,0000,,first R elements followed by Dialogue: 0,0:56:15.45,0:56:17.08,Default,,0000,0000,0000,,the last Dialogue: 0,0:56:17.08,0:56:22.41,Default,,0000,0000,0000,,S elements. Dialogue: 0,0:56:22.41,0:56:26.27,Default,,0000,0000,0000,,So Dialogue: 0,0:56:26.27,0:56:29.20,Default,,0000,0000,0000,,let's say you have X Dialogue: 0,0:56:29.20,0:56:35.03,Default,,0000,0000,0000,,coming from a Gaussian distribution with mean mu and covariance sigma Dialogue: 0,0:56:35.03,0:56:37.17,Default,,0000,0000,0000,,where Dialogue: 0,0:56:37.17,0:56:38.81,Default,,0000,0000,0000,,mu Dialogue: 0,0:56:38.81,0:56:42.22,Default,,0000,0000,0000,,is itself a partition vector. Dialogue: 0,0:56:42.22,0:56:46.84,Default,,0000,0000,0000,,So break mu up into two pieces mu1 and mu2 Dialogue: 0,0:56:46.84,0:56:50.19,Default,,0000,0000,0000,,and the covariance matrix sigma Dialogue: 0,0:56:50.19,0:56:54.99,Default,,0000,0000,0000,,is now a partitioned matrix. Dialogue: 0,0:56:54.99,0:56:56.75,Default,,0000,0000,0000,,Okay? So what this means is Dialogue: 0,0:56:56.75,0:56:58.67,Default,,0000,0000,0000,,that you take the covariance matrix sigma Dialogue: 0,0:56:58.67,0:57:00.94,Default,,0000,0000,0000,,and I'm going to break it up into four blocks, Dialogue: 0,0:57:00.94,0:57:02.65,Default,,0000,0000,0000,,right? And so the Dialogue: 0,0:57:02.65,0:57:06.06,Default,,0000,0000,0000,,dimension of this is there will be R elements here Dialogue: 0,0:57:06.06,0:57:08.30,Default,,0000,0000,0000,,and there will be Dialogue: 0,0:57:08.30,0:57:10.29,Default,,0000,0000,0000,,S elements here Dialogue: 0,0:57:10.29,0:57:13.68,Default,,0000,0000,0000,,and there will be R elements here. So, for example, sigma 1, 2 will Dialogue: 0,0:57:13.68,0:57:15.69,Default,,0000,0000,0000,,be an Dialogue: 0,0:57:15.69,0:57:19.77,Default,,0000,0000,0000,,R by S matrix. Dialogue: 0,0:57:19.77,0:57:26.77,Default,,0000,0000,0000,,It's R elements tall and S elements wide. Dialogue: 0,0:57:32.15,0:57:36.19,Default,,0000,0000,0000,,So this Gaussian over to down is really a joint distribution of a loss of variables, right? Dialogue: 0,0:57:36.19,0:57:40.31,Default,,0000,0000,0000,,So X is a vector so XY is a joint distribution Dialogue: 0,0:57:40.31,0:57:42.98,Default,,0000,0000,0000,,over X1 through X of - Dialogue: 0,0:57:42.98,0:57:47.35,Default,,0000,0000,0000,,over XN or over X of R plus S. Dialogue: 0,0:57:47.35,0:57:51.30,Default,,0000,0000,0000,,We can then ask what are the marginal and conditional distributions of this Dialogue: 0,0:57:51.30,0:57:52.61,Default,,0000,0000,0000,,Gaussian? Dialogue: 0,0:57:52.61,0:57:53.84,Default,,0000,0000,0000,,So, for example, Dialogue: 0,0:57:53.84,0:57:57.82,Default,,0000,0000,0000,,with my Gaussian, I know what P of X is, but can Dialogue: 0,0:57:57.82,0:58:03.23,Default,,0000,0000,0000,,I compute the modular distribution of X1, right. And so P of X1 is just equal to, Dialogue: 0,0:58:03.23,0:58:07.01,Default,,0000,0000,0000,,of course, integrate our X2, P of X1 Dialogue: 0,0:58:07.01,0:58:08.88,Default,,0000,0000,0000,,comma X2 Dialogue: 0,0:58:08.88,0:58:10.65,Default,,0000,0000,0000,,DX2. Dialogue: 0,0:58:10.65,0:58:14.17,Default,,0000,0000,0000,,And if you actually perform that distribution - that computation you Dialogue: 0,0:58:14.17,0:58:15.20,Default,,0000,0000,0000,,find Dialogue: 0,0:58:15.20,0:58:17.77,Default,,0000,0000,0000,,that P of X1, Dialogue: 0,0:58:17.77,0:58:20.07,Default,,0000,0000,0000,,I guess, is Gaussian Dialogue: 0,0:58:20.07,0:58:21.16,Default,,0000,0000,0000,,with mean Dialogue: 0,0:58:21.16,0:58:23.08,Default,,0000,0000,0000,,given by mu1 Dialogue: 0,0:58:23.08,0:58:24.93,Default,,0000,0000,0000,,and sigma 1, 1. All right. Dialogue: 0,0:58:24.93,0:58:26.31,Default,,0000,0000,0000,,So this is sort Dialogue: 0,0:58:26.31,0:58:30.02,Default,,0000,0000,0000,,of no surprise. The marginal distribution of a Dialogue: 0,0:58:30.02,0:58:31.13,Default,,0000,0000,0000,,Gaussian Dialogue: 0,0:58:31.13,0:58:32.88,Default,,0000,0000,0000,,is itself the Dialogue: 0,0:58:32.88,0:58:33.39,Default,,0000,0000,0000,,Gaussian and Dialogue: 0,0:58:33.39,0:58:35.48,Default,,0000,0000,0000,,you just take out the Dialogue: 0,0:58:35.48,0:58:36.23,Default,,0000,0000,0000,,relevant Dialogue: 0,0:58:36.23,0:58:37.40,Default,,0000,0000,0000,,sub-blocks of Dialogue: 0,0:58:37.40,0:58:42.05,Default,,0000,0000,0000,,the covariance matrix and the relevant sub-vector of the Dialogue: 0,0:58:42.05,0:58:46.57,Default,,0000,0000,0000,,mu vector - E in vector mu. Dialogue: 0,0:58:46.57,0:58:49.82,Default,,0000,0000,0000,,You can also compute conditionals. You can also - Dialogue: 0,0:58:49.82,0:58:51.96,Default,,0000,0000,0000,,what does P of X1 Dialogue: 0,0:58:51.96,0:58:53.41,Default,,0000,0000,0000,,given Dialogue: 0,0:58:53.41,0:58:56.82,Default,,0000,0000,0000,,a specific value for X2, right? Dialogue: 0,0:58:56.82,0:59:02.80,Default,,0000,0000,0000,,And so the way you compute that is, well, the usual way P of X1 Dialogue: 0,0:59:02.80,0:59:04.93,Default,,0000,0000,0000,,comma X2 Dialogue: 0,0:59:04.93,0:59:08.38,Default,,0000,0000,0000,,divided by P of X2, right? Dialogue: 0,0:59:08.38,0:59:10.07,Default,,0000,0000,0000,,And so Dialogue: 0,0:59:10.07,0:59:13.79,Default,,0000,0000,0000,,you know what both of these formulas are, right? The numerator - well, Dialogue: 0,0:59:13.79,0:59:16.56,Default,,0000,0000,0000,,this is just a usual Dialogue: 0,0:59:16.56,0:59:20.16,Default,,0000,0000,0000,,Gaussian that your joint distribution over X1, X2 is a Gaussian with Dialogue: 0,0:59:20.16,0:59:21.90,Default,,0000,0000,0000,,mean mu and covariance sigma Dialogue: 0,0:59:21.90,0:59:23.93,Default,,0000,0000,0000,,and Dialogue: 0,0:59:23.93,0:59:26.66,Default,,0000,0000,0000,,this Dialogue: 0,0:59:26.66,0:59:30.08,Default,,0000,0000,0000,,by that marginalization operation I talked about is that. Dialogue: 0,0:59:30.08,0:59:31.45,Default,,0000,0000,0000,, Dialogue: 0,0:59:31.45,0:59:35.01,Default,,0000,0000,0000,,So if you actually plug in the formulas for these two Gaussians and if you simplify Dialogue: 0,0:59:35.01,0:59:35.60,Default,,0000,0000,0000,, Dialogue: 0,0:59:35.60,0:59:38.56,Default,,0000,0000,0000,,the simplification step is actually fairly non-trivial. Dialogue: 0,0:59:38.56,0:59:42.07,Default,,0000,0000,0000,,If you haven't seen it before this will actually be - this will actually be Dialogue: 0,0:59:42.07,0:59:44.39,Default,,0000,0000,0000,,somewhat difficult to do. Dialogue: 0,0:59:44.39,0:59:45.58,Default,,0000,0000,0000,,But if you Dialogue: 0,0:59:45.58,0:59:50.02,Default,,0000,0000,0000,,plug this in for Gaussian and simplify that Dialogue: 0,0:59:50.02,0:59:51.03,Default,,0000,0000,0000,,expression Dialogue: 0,0:59:51.03,0:59:53.25,Default,,0000,0000,0000,,you find that Dialogue: 0,0:59:53.25,0:59:55.97,Default,,0000,0000,0000,,conditioned on the value of Dialogue: 0,0:59:55.97,1:00:01.13,Default,,0000,0000,0000,,X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be Dialogue: 0,1:00:01.13,1:00:05.85,Default,,0000,0000,0000,,Gaussian Dialogue: 0,1:00:05.85,1:00:08.26,Default,,0000,0000,0000,,and it will have mean mu Dialogue: 0,1:00:08.26,1:00:13.93,Default,,0000,0000,0000,,of 1 given 2 and covariant Dialogue: 0,1:00:13.93,1:00:15.55,Default,,0000,0000,0000,, Dialogue: 0,1:00:15.55,1:00:18.48,Default,,0000,0000,0000,,sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show Dialogue: 0,1:00:18.48,1:00:20.97,Default,,0000,0000,0000,,the formula for mu given - of mu of Dialogue: 0,1:00:20.97,1:00:27.97,Default,,0000,0000,0000,,one given 2 is given by this Dialogue: 0,1:00:31.69,1:00:38.69,Default,,0000,0000,0000,,and I Dialogue: 0,1:00:40.18,1:00:44.21,Default,,0000,0000,0000,,think Dialogue: 0,1:00:44.21,1:00:48.95,Default,,0000,0000,0000,,the sigma of 1 given 2 is given by that. Okay? Dialogue: 0,1:00:48.95,1:00:52.11,Default,,0000,0000,0000,,So these are just Dialogue: 0,1:00:52.11,1:00:55.87,Default,,0000,0000,0000,,useful formulas to know for how to find the conditional distributions Dialogue: 0,1:00:55.87,1:00:58.65,Default,,0000,0000,0000,,of the Gaussian and the marginal distributions of a Gaussian. Dialogue: 0,1:00:58.65,1:01:03.23,Default,,0000,0000,0000,,I won't actually show the derivation for this. Student:Could you Dialogue: 0,1:01:03.23,1:01:10.23,Default,,0000,0000,0000,,repeat the [inaudible]? Instructor Dialogue: 0,1:01:12.06,1:01:16.19,Default,,0000,0000,0000,,(Andrew Ng):Sure. So this one on the left mu of 1 given 2 Dialogue: 0,1:01:16.19,1:01:16.85,Default,,0000,0000,0000,,equals Dialogue: 0,1:01:16.85,1:01:19.03,Default,,0000,0000,0000,,mu1 plus Dialogue: 0,1:01:19.03,1:01:20.77,Default,,0000,0000,0000,,sigma 1,2, Dialogue: 0,1:01:20.77,1:01:22.73,Default,,0000,0000,0000,,sigma 2,2 inverse Dialogue: 0,1:01:22.73,1:01:25.39,Default,,0000,0000,0000,,times X2 minus mu2 Dialogue: 0,1:01:25.39,1:01:29.60,Default,,0000,0000,0000,,and this is sigma 1 given 2 equals sigma 1,1 Dialogue: 0,1:01:29.60,1:01:31.60,Default,,0000,0000,0000,,minus sigma 1,2 Dialogue: 0,1:01:31.60,1:01:33.12,Default,,0000,0000,0000,,sigma 2,2 inverse Dialogue: 0,1:01:33.12,1:01:34.40,Default,,0000,0000,0000,,sigma 2,1. Okay? Dialogue: 0,1:01:34.40,1:01:40.19,Default,,0000,0000,0000,,These are also in the Dialogue: 0,1:01:40.19,1:01:47.19,Default,,0000,0000,0000,,lecture Dialogue: 0,1:01:52.93,1:01:59.93,Default,,0000,0000,0000,,notes. Shoot. Nothing as where I was hoping to on time. Dialogue: 0,1:02:00.99,1:02:07.99,Default,,0000,0000,0000,,Well, actually it is. Okay? Dialogue: 0,1:02:18.72,1:02:21.21,Default,,0000,0000,0000,,So it Dialogue: 0,1:02:21.21,1:02:26.37,Default,,0000,0000,0000,,turns out - I think I'll skip this in the interest of time. So it turns out that - Dialogue: 0,1:02:26.37,1:02:30.05,Default,,0000,0000,0000,,well, so let's go back and use these in the factor analysis model, right? Dialogue: 0,1:02:30.05,1:02:32.09,Default,,0000,0000,0000,,It turns out that Dialogue: 0,1:02:32.09,1:02:33.77,Default,,0000,0000,0000,,you can go back Dialogue: 0,1:02:33.77,1:02:35.13,Default,,0000,0000,0000,,and Dialogue: 0,1:02:38.77,1:02:45.76,Default,,0000,0000,0000,,oh, Dialogue: 0,1:02:45.76,1:02:48.57,Default,,0000,0000,0000,,do I want to do this? I kind of need this Dialogue: 0,1:02:48.57,1:02:51.94,Default,,0000,0000,0000,,though. So let's go back and figure out Dialogue: 0,1:02:51.94,1:02:57.87,Default,,0000,0000,0000,,just what the joint distribution factor analysis assumes on Z and X's. Okay? Dialogue: 0,1:02:57.87,1:02:58.63,Default,,0000,0000,0000,,So Dialogue: 0,1:02:58.63,1:03:00.22,Default,,0000,0000,0000,, Dialogue: 0,1:03:00.22,1:03:03.20,Default,,0000,0000,0000,,under the factor analysis model Dialogue: 0,1:03:03.20,1:03:07.43,Default,,0000,0000,0000,,Z and X, the random variables Z and X Dialogue: 0,1:03:07.43,1:03:11.55,Default,,0000,0000,0000,,have some joint distribution given by - I'll write this Dialogue: 0,1:03:11.55,1:03:12.78,Default,,0000,0000,0000,,vector as mu Dialogue: 0,1:03:12.78,1:03:13.80,Default,,0000,0000,0000,,ZX Dialogue: 0,1:03:13.80,1:03:17.33,Default,,0000,0000,0000,,in some covariance matrix sigma. Dialogue: 0,1:03:17.33,1:03:21.24,Default,,0000,0000,0000,,So let's go back and figure out what mu ZX is and what sigma is and I'll Dialogue: 0,1:03:21.24,1:03:22.70,Default,,0000,0000,0000,,do this so that Dialogue: 0,1:03:22.70,1:03:24.99,Default,,0000,0000,0000,,we'll get a little bit more practice with Dialogue: 0,1:03:24.99,1:03:28.74,Default,,0000,0000,0000,,partition vectors and partition matrixes. Dialogue: 0,1:03:28.74,1:03:31.24,Default,,0000,0000,0000,,So just to remind you, right? You Dialogue: 0,1:03:31.24,1:03:34.96,Default,,0000,0000,0000,,have to have Z as Gaussian with mean zero and covariance identity Dialogue: 0,1:03:34.96,1:03:41.10,Default,,0000,0000,0000,,and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with Dialogue: 0,1:03:41.10,1:03:42.75,Default,,0000,0000,0000,,mean zero Dialogue: 0,1:03:42.75,1:03:47.18,Default,,0000,0000,0000,,covariant si. So I have the - I'm just writing out the same equations again. Dialogue: 0,1:03:47.18,1:03:53.21,Default,,0000,0000,0000,,So let's first figure out what this vector mu ZX is. Well, Dialogue: 0,1:03:53.21,1:03:54.62,Default,,0000,0000,0000,,the expected value of Z Dialogue: 0,1:03:54.62,1:03:56.22,Default,,0000,0000,0000,,is Dialogue: 0,1:03:56.22,1:03:56.99,Default,,0000,0000,0000,,zero Dialogue: 0,1:03:56.99,1:03:58.14,Default,,0000,0000,0000,, Dialogue: 0,1:03:58.14,1:04:03.60,Default,,0000,0000,0000,,and, again, as usual I'll often drop the square backers around here. Dialogue: 0,1:04:03.60,1:04:05.93,Default,,0000,0000,0000,,And the expected value of X is - Dialogue: 0,1:04:05.93,1:04:08.90,Default,,0000,0000,0000,,well, the expected value of mu Dialogue: 0,1:04:08.90,1:04:11.15,Default,,0000,0000,0000,,plus lambda Dialogue: 0,1:04:11.15,1:04:14.18,Default,,0000,0000,0000,,Z Dialogue: 0,1:04:14.18,1:04:17.25,Default,,0000,0000,0000,,plus epsilon. So these two terms have zero expectation Dialogue: 0,1:04:17.25,1:04:20.10,Default,,0000,0000,0000,,and so the expected value of X Dialogue: 0,1:04:20.10,1:04:22.75,Default,,0000,0000,0000,,is just mu Dialogue: 0,1:04:22.75,1:04:23.90,Default,,0000,0000,0000,, Dialogue: 0,1:04:23.90,1:04:26.53,Default,,0000,0000,0000,,and so that vector Dialogue: 0,1:04:26.53,1:04:28.05,Default,,0000,0000,0000,,mu ZX, right, Dialogue: 0,1:04:28.05,1:04:31.24,Default,,0000,0000,0000,,in my parameter for the Gaussian Dialogue: 0,1:04:31.24,1:04:33.65,Default,,0000,0000,0000,,this is going to be Dialogue: 0,1:04:33.65,1:04:37.44,Default,,0000,0000,0000,,the expected value of this partition vector Dialogue: 0,1:04:37.44,1:04:40.25,Default,,0000,0000,0000,,given by this partition Z and X Dialogue: 0,1:04:40.25,1:04:42.69,Default,,0000,0000,0000,,and so that would just be Dialogue: 0,1:04:42.69,1:04:43.77,Default,,0000,0000,0000,,zero Dialogue: 0,1:04:43.77,1:04:46.90,Default,,0000,0000,0000,,followed by mu. Okay? Dialogue: 0,1:04:46.90,1:04:48.32,Default,,0000,0000,0000,,And so that's Dialogue: 0,1:04:48.32,1:04:51.30,Default,,0000,0000,0000,,a d-dimensional zero Dialogue: 0,1:04:51.30,1:04:57.06,Default,,0000,0000,0000,,followed by an indimensional mu. Dialogue: 0,1:04:57.06,1:05:02.16,Default,,0000,0000,0000,,That's not gonna work out what the covariance matrix sigma is. Dialogue: 0,1:05:02.16,1:05:09.16,Default,,0000,0000,0000,,So Dialogue: 0,1:05:20.93,1:05:24.11,Default,,0000,0000,0000,,the covariance matrix sigma Dialogue: 0,1:05:24.11,1:05:27.96,Default,,0000,0000,0000,,- if you work out Dialogue: 0,1:05:27.96,1:05:34.96,Default,,0000,0000,0000,,definition of a partition. So this Dialogue: 0,1:05:44.36,1:05:45.62,Default,,0000,0000,0000,,is Dialogue: 0,1:05:45.62,1:05:52.62,Default,,0000,0000,0000,,into your partition matrix. Dialogue: 0,1:06:03.34,1:06:04.15,Default,,0000,0000,0000,,Okay? Will be - Dialogue: 0,1:06:04.15,1:06:06.15,Default,,0000,0000,0000,,so the covariance matrix sigma Dialogue: 0,1:06:06.15,1:06:10.50,Default,,0000,0000,0000,,will comprise four blocks like that Dialogue: 0,1:06:10.50,1:06:14.20,Default,,0000,0000,0000,,and so the upper left most block, which I write as sigma 1,1 - Dialogue: 0,1:06:14.20,1:06:16.40,Default,,0000,0000,0000,,well, that uppermost left block Dialogue: 0,1:06:16.40,1:06:18.56,Default,,0000,0000,0000,,is just Dialogue: 0,1:06:18.56,1:06:21.14,Default,,0000,0000,0000,,the covariance matrix of Z, Dialogue: 0,1:06:21.14,1:06:24.82,Default,,0000,0000,0000,,which we know is the identity. I was Dialogue: 0,1:06:24.82,1:06:28.27,Default,,0000,0000,0000,,gonna show you briefly how to derive some of the other blocks, right, so Dialogue: 0,1:06:28.27,1:06:30.81,Default,,0000,0000,0000,,sigma 1,2 that's the Dialogue: 0,1:06:30.81,1:06:37.81,Default,,0000,0000,0000,,upper - oh, actually, Dialogue: 0,1:06:41.26,1:06:43.82,Default,,0000,0000,0000,,excuse me. Sigma 2,1 Dialogue: 0,1:06:43.82,1:06:46.83,Default,,0000,0000,0000,,which is the lower left block Dialogue: 0,1:06:46.83,1:06:48.93,Default,,0000,0000,0000,,that's E Dialogue: 0,1:06:48.93,1:06:50.83,Default,,0000,0000,0000,,of X Dialogue: 0,1:06:50.83,1:06:54.90,Default,,0000,0000,0000,,minus EX times Z minus EZ. Dialogue: 0,1:06:54.90,1:06:58.69,Default,,0000,0000,0000,,So X is equal to mu Dialogue: 0,1:06:58.69,1:07:01.53,Default,,0000,0000,0000,,plus lambda Z Dialogue: 0,1:07:01.53,1:07:03.33,Default,,0000,0000,0000,,plus Dialogue: 0,1:07:03.33,1:07:05.43,Default,,0000,0000,0000,,epsilon and then minus EX is Dialogue: 0,1:07:05.43,1:07:07.37,Default,,0000,0000,0000,,minus mu and then Dialogue: 0,1:07:07.37,1:07:11.22,Default,,0000,0000,0000,,times Z Dialogue: 0,1:07:11.22,1:07:14.09,Default,,0000,0000,0000,,because the expected value of Z is zero, right, Dialogue: 0,1:07:14.09,1:07:19.33,Default,,0000,0000,0000,,so that's equal to zero. Dialogue: 0,1:07:19.33,1:07:23.66,Default,,0000,0000,0000,,And so if you simplify - or if you expand this out Dialogue: 0,1:07:23.66,1:07:26.83,Default,,0000,0000,0000,,plus mu minus mu cancel out Dialogue: 0,1:07:26.83,1:07:29.82,Default,,0000,0000,0000,,and so you have the expected value Dialogue: 0,1:07:29.82,1:07:33.14,Default,,0000,0000,0000,,of lambda - oh, Dialogue: 0,1:07:33.14,1:07:35.36,Default,,0000,0000,0000,,excuse me. Dialogue: 0,1:07:35.36,1:07:40.27,Default,,0000,0000,0000,,ZZ transpose minus the Dialogue: 0,1:07:40.27,1:07:47.27,Default,,0000,0000,0000,,expected value of epsilon Z is Dialogue: 0,1:07:52.80,1:07:57.06,Default,,0000,0000,0000,,equal to Dialogue: 0,1:07:57.06,1:08:00.31,Default,,0000,0000,0000,,that, which is just equal to Dialogue: 0,1:08:00.31,1:08:07.24,Default,,0000,0000,0000,,lambda times the identity matrix. Okay? Does that make Dialogue: 0,1:08:07.24,1:08:11.50,Default,,0000,0000,0000,,sense? Cause Dialogue: 0,1:08:11.50,1:08:14.54,Default,,0000,0000,0000,,this term is equal to zero. Dialogue: 0,1:08:14.54,1:08:21.54,Default,,0000,0000,0000,,Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well, Dialogue: 0,1:08:48.17,1:08:53.40,Default,,0000,0000,0000,,so the final block is sigma 2,2 which is equal to the expected value of Dialogue: 0,1:08:53.40,1:08:59.10,Default,,0000,0000,0000,,mu plus lambda Z Dialogue: 0,1:08:59.10,1:09:01.89,Default,,0000,0000,0000,,plus epsilon minus mu Dialogue: 0,1:09:01.89,1:09:03.96,Default,,0000,0000,0000,,times, right? Dialogue: 0,1:09:03.96,1:09:07.90,Default,,0000,0000,0000,,Is Dialogue: 0,1:09:07.90,1:09:09.26,Default,,0000,0000,0000,,equal to - and Dialogue: 0,1:09:09.26,1:09:16.26,Default,,0000,0000,0000,,I won't do this, but this simplifies to lambda lambda transpose plus si. Okay? Dialogue: 0,1:09:17.61,1:09:20.88,Default,,0000,0000,0000,,So Dialogue: 0,1:09:20.88,1:09:23.81,Default,,0000,0000,0000,,putting all this together Dialogue: 0,1:09:23.81,1:09:28.69,Default,,0000,0000,0000,,this tells us that the joint distribution of this vector ZX Dialogue: 0,1:09:28.69,1:09:31.21,Default,,0000,0000,0000,,is going to be Gaussian Dialogue: 0,1:09:31.21,1:09:35.92,Default,,0000,0000,0000,,with mean vector given by that, Dialogue: 0,1:09:35.92,1:09:37.41,Default,,0000,0000,0000,,which we worked out previously. Dialogue: 0,1:09:37.41,1:09:38.72,Default,,0000,0000,0000,,So Dialogue: 0,1:09:38.72,1:09:42.24,Default,,0000,0000,0000,,this is the new ZX that we worked out previously, Dialogue: 0,1:09:42.24,1:09:44.18,Default,,0000,0000,0000,,and covariance matrix Dialogue: 0,1:09:44.18,1:09:51.18,Default,,0000,0000,0000,,given by Dialogue: 0,1:09:54.43,1:10:01.43,Default,,0000,0000,0000,,that. Okay? So Dialogue: 0,1:10:04.72,1:10:07.18,Default,,0000,0000,0000,,in principle - let's Dialogue: 0,1:10:07.18,1:10:09.87,Default,,0000,0000,0000,,see, so the parameters of our model Dialogue: 0,1:10:09.87,1:10:11.58,Default,,0000,0000,0000,,are mu, Dialogue: 0,1:10:11.58,1:10:12.36,Default,,0000,0000,0000,,lambda, Dialogue: 0,1:10:12.36,1:10:14.00,Default,,0000,0000,0000,,and si. Dialogue: 0,1:10:14.00,1:10:15.48,Default,,0000,0000,0000,,And so Dialogue: 0,1:10:15.48,1:10:18.80,Default,,0000,0000,0000,,in order to find the parameters of this model Dialogue: 0,1:10:18.80,1:10:24.56,Default,,0000,0000,0000,,we're given a training set of m examples Dialogue: 0,1:10:24.56,1:10:28.59,Default,,0000,0000,0000,,and so we like to do a massive likely estimation of the parameters. Dialogue: 0,1:10:28.59,1:10:34.32,Default,,0000,0000,0000,,And so in principle one thing you could do is you can actually write down Dialogue: 0,1:10:34.32,1:10:36.82,Default,,0000,0000,0000,,what P of XI is and, Dialogue: 0,1:10:36.82,1:10:40.31,Default,,0000,0000,0000,,right, so P of XI Dialogue: 0,1:10:40.31,1:10:43.02,Default,,0000,0000,0000,,XI is actually - Dialogue: 0,1:10:43.02,1:10:45.88,Default,,0000,0000,0000,,the distribution of X, right? If, again, Dialogue: 0,1:10:45.88,1:10:49.44,Default,,0000,0000,0000,,you can marginalize this Gaussian Dialogue: 0,1:10:49.44,1:10:54.68,Default,,0000,0000,0000,,and so the distribution of X, which is the lower half of this partition vector Dialogue: 0,1:10:54.68,1:10:56.62,Default,,0000,0000,0000,,is going to have Dialogue: 0,1:10:56.62,1:10:58.21,Default,,0000,0000,0000,,mean mu Dialogue: 0,1:10:58.21,1:11:03.93,Default,,0000,0000,0000,,and covariance given by lambda lambda transpose plus si. Right? Dialogue: 0,1:11:03.93,1:11:05.32,Default,,0000,0000,0000,,So that's Dialogue: 0,1:11:05.32,1:11:06.76,Default,,0000,0000,0000,,the distribution Dialogue: 0,1:11:06.76,1:11:12.50,Default,,0000,0000,0000,,that we're using to model P of X. Dialogue: 0,1:11:12.50,1:11:16.98,Default,,0000,0000,0000,,And so in principle one thing you could do is actually write down the log likelihood of Dialogue: 0,1:11:16.98,1:11:20.33,Default,,0000,0000,0000,,your parameters, Dialogue: 0,1:11:20.33,1:11:23.68,Default,,0000,0000,0000,,right? Which is just the product over of - it Dialogue: 0,1:11:23.68,1:11:25.31,Default,,0000,0000,0000,,is the sum over Dialogue: 0,1:11:25.31,1:11:26.05,Default,,0000,0000,0000,,I Dialogue: 0,1:11:26.05,1:11:29.83,Default,,0000,0000,0000,,log P of XI Dialogue: 0,1:11:29.83,1:11:30.97,Default,,0000,0000,0000,,where P of XI Dialogue: 0,1:11:30.97,1:11:32.40,Default,,0000,0000,0000,,will be given Dialogue: 0,1:11:32.40,1:11:35.72,Default,,0000,0000,0000,,by this Gaussian density, right. And Dialogue: 0,1:11:35.72,1:11:39.82,Default,,0000,0000,0000,,I'm using theta as a shorthand to denote all of my parameters. Dialogue: 0,1:11:39.82,1:11:43.43,Default,,0000,0000,0000,,And so you actually know what the density for Gaussian is Dialogue: 0,1:11:43.43,1:11:49.52,Default,,0000,0000,0000,,and so you can say P of XI is this Gaussian with E mu in covariance Dialogue: 0,1:11:49.52,1:11:52.96,Default,,0000,0000,0000,,given by this lambda lambda transpose plus si. Dialogue: 0,1:11:52.96,1:11:54.69,Default,,0000,0000,0000,,So in case you write down the log likelihood Dialogue: 0,1:11:54.69,1:11:55.90,Default,,0000,0000,0000,,of your parameters Dialogue: 0,1:11:55.90,1:11:57.38,Default,,0000,0000,0000,,as follows Dialogue: 0,1:11:57.38,1:12:01.05,Default,,0000,0000,0000,,and you can try to take derivatives of your log likelihood with respect to your Dialogue: 0,1:12:01.05,1:12:02.04,Default,,0000,0000,0000,,parameters Dialogue: 0,1:12:02.04,1:12:05.06,Default,,0000,0000,0000,,and maximize the log likelihood, all right. Dialogue: 0,1:12:05.06,1:12:07.49,Default,,0000,0000,0000,,It turns out that if you do that you end up with Dialogue: 0,1:12:07.49,1:12:11.30,Default,,0000,0000,0000,,sort of an intractable atomization problem or at least one Dialogue: 0,1:12:11.30,1:12:13.22,Default,,0000,0000,0000,,that you - excuse me, you end up with a Dialogue: 0,1:12:13.22,1:12:16.86,Default,,0000,0000,0000,,optimization problem that you will not be able to find and in this analytics, sort of, Dialogue: 0,1:12:16.86,1:12:18.65,Default,,0000,0000,0000,,closed form solutions to. Dialogue: 0,1:12:18.65,1:12:21.68,Default,,0000,0000,0000,,So if you say my model of X is this and found your Dialogue: 0,1:12:21.68,1:12:23.90,Default,,0000,0000,0000,,massive likely parameter estimation Dialogue: 0,1:12:23.90,1:12:25.59,Default,,0000,0000,0000,,you won't be able to find Dialogue: 0,1:12:25.59,1:12:29.12,Default,,0000,0000,0000,,the massive likely estimate of the parameters in closed form. Dialogue: 0,1:12:29.12,1:12:30.06,Default,,0000,0000,0000,, Dialogue: 0,1:12:30.06,1:12:31.94,Default,,0000,0000,0000,,So what I would have liked to do Dialogue: 0,1:12:31.94,1:12:38.94,Default,,0000,0000,0000,,is - well, Dialogue: 0,1:12:40.28,1:12:45.58,Default,,0000,0000,0000,,so in order to fit parameters to this model Dialogue: 0,1:12:45.58,1:12:51.87,Default,,0000,0000,0000,,what we'll actually do is use the EM Algorithm Dialogue: 0,1:12:51.87,1:12:55.57,Default,,0000,0000,0000,,in with Dialogue: 0,1:12:55.57,1:12:59.56,Default,,0000,0000,0000,,the E step, right? Dialogue: 0,1:12:59.56,1:13:06.56,Default,,0000,0000,0000,,We'll compute that Dialogue: 0,1:13:08.32,1:13:10.93,Default,,0000,0000,0000,, Dialogue: 0,1:13:10.93,1:13:15.01,Default,,0000,0000,0000,,and this formula looks the same except that one difference is that now Dialogue: 0,1:13:15.01,1:13:17.68,Default,,0000,0000,0000,,Z is a continuous random variable Dialogue: 0,1:13:17.68,1:13:18.59,Default,,0000,0000,0000,,and so Dialogue: 0,1:13:18.59,1:13:20.12,Default,,0000,0000,0000,,in the E step Dialogue: 0,1:13:20.12,1:13:24.21,Default,,0000,0000,0000,,we actually have to find the density QI of ZI where it's the, sort of, E step actually Dialogue: 0,1:13:24.21,1:13:26.05,Default,,0000,0000,0000,,requires that we find Dialogue: 0,1:13:26.05,1:13:28.26,Default,,0000,0000,0000,,the posterior distribution Dialogue: 0,1:13:28.26,1:13:31.51,Default,,0000,0000,0000,,that - so the density to the random variable ZI Dialogue: 0,1:13:31.51,1:13:34.49,Default,,0000,0000,0000,,and then the M step Dialogue: 0,1:13:34.49,1:13:37.95,Default,,0000,0000,0000,,will then perform Dialogue: 0,1:13:37.95,1:13:40.87,Default,,0000,0000,0000,,the following maximization Dialogue: 0,1:13:40.87,1:13:41.71,Default,,0000,0000,0000,, Dialogue: 0,1:13:41.71,1:13:45.01,Default,,0000,0000,0000,,where, again, because Z is now Dialogue: 0,1:13:45.01,1:13:49.93,Default,,0000,0000,0000,,continuous we Dialogue: 0,1:13:49.93,1:13:56.93,Default,,0000,0000,0000,,now need to integrate over Z. Okay? Where Dialogue: 0,1:13:59.39,1:14:02.38,Default,,0000,0000,0000,,in the M step now because ZI was continuous we now have an Dialogue: 0,1:14:02.38,1:14:05.71,Default,,0000,0000,0000,,integral over Z rather than a sum. Dialogue: 0,1:14:05.71,1:14:09.27,Default,,0000,0000,0000,,Okay? I was hoping to go a little bit further in deriving these things, but I don't have Dialogue: 0,1:14:09.27,1:14:11.08,Default,,0000,0000,0000,,time today so we'll wrap that up Dialogue: 0,1:14:11.08,1:14:14.53,Default,,0000,0000,0000,,in the next lecture, but before I close let's check if there are questions Dialogue: 0,1:14:14.53,1:14:15.73,Default,,0000,0000,0000,, Dialogue: 0,1:14:15.73,1:14:22.73,Default,,0000,0000,0000,,about the whole factor analysis model. Okay. Dialogue: 0,1:14:27.57,1:14:30.64,Default,,0000,0000,0000,,So we'll come back in the next lecture; Dialogue: 0,1:14:30.64,1:14:34.69,Default,,0000,0000,0000,,I will wrap up this model and because I want to go a little bit deeper into the E Dialogue: 0,1:14:34.69,1:14:36.71,Default,,0000,0000,0000,,and M steps, as there's some Dialogue: 0,1:14:36.71,1:14:39.76,Default,,0000,0000,0000,,tricky parts for the factor analysis model specifically. Dialogue: 0,1:14:39.76,1:14:41.16,Default,,0000,0000,0000,,Okay. I'll see you in a Dialogue: 0,1:14:41.16,1:14:41.41,Default,,0000,0000,0000,,couple of days.