Lecture 13 | Machine Learning (Stanford)
-
0:09 - 0:10
-
0:10 - 0:14This presentation is delivered by the Stanford Center for Professional
-
0:14 - 0:21Development.
-
0:24 - 0:25So
-
0:25 - 0:30welcome back, and what I want to do today is
-
0:30 - 0:34continue our discussions of the EM Algorithm, and in particular, I want to talk
-
0:34 - 0:35about
-
0:35 - 0:36the EM
-
0:36 - 0:40formulation that we derived in the previous lecture and apply it to the
-
0:40 - 0:42mixture of Gaussians model,
-
0:42 - 0:45apply it to a different model and a mixture of naive Bayes model,
-
0:45 - 0:49and then the launch part of today's lecture will be on the factor
-
0:49 - 0:51analysis algorithm, which will also use the EM.
-
0:51 - 0:56And as part of that, we'll actually take a brief digression to talk a little bit about
-
0:56 - 1:01sort of useful properties of Gaussian distributions.
-
1:01 - 1:03So just to recap where we are.
-
1:03 - 1:10In the previous lecture, I started to talk about unsupervised learning, which was
-
1:10 - 1:13machine-learning problems, where you're given
-
1:13 - 1:15an unlabeled training set
-
1:15 - 1:16comprising m examples here, right?
-
1:16 - 1:18And then " so the fact that there are no labels;
-
1:18 - 1:23that's what makes this unsupervised or
-
1:23 - 1:25anything. So
-
1:25 - 1:31one problem that I talked about last time was what if
-
1:31 - 1:34you're given a data set that looks like this
-
1:34 - 1:36and you want to model
-
1:36 - 1:40the density PFX from which you think the data
-
1:40 - 1:42had been drawn,
-
1:42 - 1:46and so with a data set like this, maybe you think was a mixture of two Gaussians and start
-
1:46 - 1:50to talk about an algorithm
-
1:50 - 1:54for fitting a mixture of Gaussians model, all right? And so
-
1:55 - 1:58we said that we would model the
-
1:58 - 1:59density of XP of X
-
1:59 - 2:02as sum over Z
-
2:02 - 2:03PFX
-
2:03 - 2:05given Z
-
2:05 - 2:07times P of Z
-
2:07 - 2:10where this later random variable meaning this hidden random
-
2:10 - 2:12variable Z
-
2:12 - 2:13indicates
-
2:13 - 2:16which of the two Gaussian distributions each of your data points came from
-
2:16 - 2:19and so we have,
-
2:19 - 2:24you know, Z was not a
-
2:24 - 2:27nomial with parameter phi
-
2:27 - 2:29and X conditions on
-
2:29 - 2:33a coming from the JAFE
-
2:33 - 2:36Gaussian
-
2:36 - 2:39was given by Gaussian
-
2:39 - 2:44of mean mu J and covariant sigma J, all right? So, like I said
-
2:44 - 2:48at the beginning of the previous lecture, I just talked about a very specific algorithm that
-
2:48 - 2:51I sort of pulled out of the air
-
2:51 - 2:55for fitting the parameters of this model for finian, Francis, phi, mu and
-
2:55 - 2:56
-
2:56 - 3:00sigma, but then in the second half of the previous lecture I talked about what's called the
-
3:00 - 3:01EM Algorithm in which
-
3:01 - 3:04our
-
3:04 - 3:09goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of
-
3:09 - 3:10theta,
-
3:10 - 3:13you know, the, sort of,
-
3:13 - 3:18usual right matter of log likelihood "
-
3:18 - 3:20well, parameterized by theta.
-
3:20 - 3:22And
-
3:22 - 3:25because we have
-
3:25 - 3:28a later random variable Z this is really
-
3:28 - 3:31maximizing in terms of theta,
-
3:31 - 3:38sum over I, sum over Z, P of XI,
-
3:40 - 3:45ZI parameterized by theta. Okay?
-
3:45 - 3:47So using
-
3:47 - 3:50Jensen's inequality last time
-
3:50 - 3:54we worked out the EM Algorithm in which in the E step
-
3:54 - 3:56we would chose these
-
3:56 - 3:59probability distributions QI to the
-
3:59 - 4:03l posterior
-
4:03 - 4:09on Z given X and parameterized by theta
-
4:09 - 4:14and in the M step we would set
-
4:14 - 4:16theta
-
4:16 - 4:22to be the value that
-
4:22 - 4:28maximizes
-
4:28 - 4:35this.
-
4:36 - 4:39Okay? So these are the ones we worked out last time
-
4:42 - 4:47and the cartoon that I drew was that you have this long likelihood function L of
-
4:47 - 4:49theta that's often hard to maximize
-
4:49 - 4:51and what the E step does
-
4:51 - 4:56is choose these probability distribution production QI's. And in the cartoon, I drew
-
4:56 - 4:58what that corresponded to
-
4:58 - 5:01was finding a lower bounds
-
5:01 - 5:03for the log likelihood.
-
5:03 - 5:05And then
-
5:05 - 5:06horizontal access data
-
5:06 - 5:11and then the M step you maximize the lower boundary, right? So maybe you were here previously
-
5:11 - 5:12and so you
-
5:12 - 5:15jumped to the new point, the new
-
5:15 - 5:18maximum of this lower bound. Okay?
-
5:18 - 5:22And so this little curve here, right? This lower bound function here
-
5:22 - 5:26that's really
-
5:26 - 5:29the right-hand side of that augments.
-
5:29 - 5:30Okay? So
-
5:30 - 5:34this whole thing in the augments. If you view this thing
-
5:34 - 5:37as a function of theta,
-
5:37 - 5:38this function of theta
-
5:38 - 5:39is a lower bounds
-
5:39 - 5:42for the log likelihood of theta
-
5:42 - 5:46and so the M step we maximize this lower bound and that corresponds to
-
5:46 - 5:47jumping
-
5:47 - 5:52to this new maximum to lower bound. So it
-
5:52 - 5:53turns out
-
5:53 - 5:56that
-
5:56 - 5:57in the EM Algorithm "
-
5:57 - 6:02so why do you evolve with the EM algorithm? It turns out that very often, and this will be
-
6:02 - 6:05true for all the examples we see
-
6:05 - 6:07today, it turns out
-
6:07 - 6:11that very often in the
-
6:11 - 6:13EM Algorithm
-
6:13 - 6:16maximizing the M Step, so performing the maximization the M Step, will be
-
6:16 - 6:20tractable and can often be done analytically in the closed form.
-
6:20 - 6:21Whereas
-
6:21 - 6:26if you were trying to maximize this objective we try to take this
-
6:26 - 6:27formula on the right and
-
6:27 - 6:30this maximum likely object, everyone, is to take this all on the right
-
6:30 - 6:33and set its derivatives to zero and try to solve and
-
6:33 - 6:35you'll find that you're unable
-
6:35 - 6:37to obtain a solution to this in closed form
-
6:37 - 6:39this maximization. Okay?
-
6:39 - 6:41And so to give you an example of that is that
-
6:41 - 6:45you remember our discussion on exponential family marbles, right?
-
6:45 - 6:47It turns out that
-
6:47 - 6:49if X and Z
-
6:49 - 6:50is
-
6:50 - 6:54jointly, I guess, a line in exponential families. So if P of X,
-
6:54 - 6:55Z
-
6:55 - 6:58prioritized by theta there's an explanation family distribution,
-
6:58 - 7:02which it turns out to be true for the mixture of Gaussians distribution.
-
7:02 - 7:05Then turns out that the M step here will be tractable
-
7:05 - 7:08and the E step will also be tractable and so you can do each of these steps very
-
7:08 - 7:09easily.
-
7:09 - 7:11Whereas
-
7:11 - 7:13performing " trying to
-
7:13 - 7:16perform this original maximum likelihood estimation problem
-
7:16 - 7:18on this one, right?
-
7:18 - 7:21Will be computationally very difficult. You're going
-
7:21 - 7:25to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to
-
7:25 - 7:28this. Okay?
-
7:28 - 7:33So
-
7:33 - 7:36what I want to do in a second is actually take this view of the EM
-
7:36 - 7:37Algorithm
-
7:37 - 7:41and apply it to the mixture of Gaussians models. I want to take these E steps
-
7:41 - 7:43and M Steps and work
-
7:43 - 7:44them out for
-
7:44 - 7:48the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the
-
7:50 - 7:55EM Algorithm. It turns out there's one other way of thinking about the EM
-
7:55 - 7:59Algorithm, which is the following: I can define
-
7:59 - 8:02an optimization objective
-
8:02 - 8:06J of theta, Q are defined
-
8:06 - 8:11it to be
-
8:11 - 8:15this. This is just a thing in the augments
-
8:15 - 8:22in the M step. Okay?
-
8:26 - 8:28And so what we proved
-
8:28 - 8:31using Jensen's inequality
-
8:31 - 8:34is that
-
8:34 - 8:38the log likelihood of theta is greater and equal to J
-
8:38 - 8:42of theta Q. So in other words, we proved last time that
-
8:42 - 8:44for any value of theta and Q
-
8:44 - 8:47the log likelihood upper bounds J of theta and Q.
-
8:47 - 8:51And so just to relate this back to, sort of, yet more things that you all ready
-
8:51 - 8:52know,
-
8:52 - 8:54you can also think of
-
8:54 - 8:59covariant cause in a sense, right?
-
8:59 - 9:02However, our discussion awhile back on the coordinate ascent optimization
-
9:02 - 9:03algorithm.
-
9:03 - 9:07So we can show, and I won't actually show this view so just take our word for it
-
9:07 - 9:10and look for that at home if you want,
-
9:10 - 9:12
-
9:12 - 9:13that EM is
-
9:13 - 9:17just coordinate in a set on the
-
9:17 - 9:18function J.
-
9:18 - 9:21So in the E step you maximize
-
9:21 - 9:24with respect to Q
-
9:24 - 9:27and then the M step
-
9:27 - 9:32you maximize with
-
9:32 - 9:34respect to theta. Okay?
-
9:34 - 9:36So this is
-
9:36 - 9:40another view of the EM Algorithm
-
9:41 - 9:45that shows why it has to converge, for example. If you can - I've used in a sense of
-
9:45 - 9:47J of theta, Q
-
9:47 - 9:53having to monotonically increase on every iteration. Okay?
-
9:53 - 9:59So what I want
-
9:59 - 10:02to do next is actually take this general
-
10:02 - 10:06EM machinery that we worked up and apply it to a mixture Gaussians model.
-
10:06 - 10:10Before I do that, let me just check if there are questions about
-
10:10 - 10:17the EM Algorithm as a whole? Okay, cool. So
-
10:24 - 10:26let's go ahead and
-
10:26 - 10:33work on the mixture of Gaussian's
-
10:36 - 10:37EM, all right? MOG, and that's my
-
10:37 - 10:39abbreviation for Mixture of
-
10:39 - 10:44Gaussian's. So the E step were called those Q distributions, right?
-
10:44 - 10:51In particular, I want to work out - so Q is the probability distribution
-
10:51 - 10:53over the late and random variable Z
-
10:53 - 10:58and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And
-
10:58 - 11:01you can think of this as my writing
-
11:01 - 11:04P of ZI equals J, right? Under the Q
-
11:04 - 11:08distribution. That's what this notation means.
-
11:08 - 11:13And so the EM Algorithm tells us
-
11:13 - 11:14
-
11:14 - 11:16that, let's see, Q of J
-
11:16 - 11:23is the likelihood probability of Z being the value
-
11:24 - 11:25J and
-
11:25 - 11:29given XI and all your parameters.
-
11:29 - 11:30And so,
-
11:30 - 11:31
-
11:31 - 11:36well, the way you compute this is by Bayes rule, right? So that is going to
-
11:36 - 11:39be equal to P of XI given ZI
-
11:39 - 11:41equals J
-
11:41 - 11:45times P of ZIJ divided
-
11:45 - 11:52by -
-
11:57 - 12:01right? That's all the - by Bayes rule.
-
12:01 - 12:03
-
12:03 - 12:06And so this
-
12:06 - 12:09you know
-
12:09 - 12:10
-
12:10 - 12:12
-
12:12 - 12:14because XI given ZI equals J. This was a Gaussian
-
12:14 - 12:18with mean mu J and covariant sigma J. And so to compute this first
-
12:18 - 12:21term you plug in the formula for the Gaussian density there
-
12:22 - 12:25with parameters mu J and sigma J
-
12:25 - 12:27and this you'd know
-
12:27 - 12:28because
-
12:28 - 12:36Z was not a nomial, right?
-
12:36 - 12:41Where parameters given by phi and so the problem of ZI being with J is just phi
-
12:41 - 12:44J and so you can substitute these terms in.
-
12:44 - 12:47Similarly do the same thing for the denominator
-
12:47 - 12:50and that's how you work out what Q is. Okay?
-
12:50 - 12:56And so in the previous lecture this value the probability that ZI
-
12:56 - 12:59equals J under the Q
-
12:59 - 13:01distribution that was why I denoted that as WIJ.
-
13:01 - 13:04So that would be the
-
13:04 - 13:05
-
13:05 - 13:09E
-
13:09 - 13:10step
-
13:10 - 13:14and then in the M step
-
13:14 - 13:18we maximize with respect to all of our parameters.
-
13:18 - 13:19
-
13:19 - 13:21This, well
-
13:21 - 13:28I seem to be writing the same formula down a lot today. All
-
13:28 - 13:35right.
-
13:43 - 13:48And just so
-
13:48 - 13:52we're completely concrete about how you do that, right? So if you do
-
13:52 - 13:55that you end up with -
-
13:55 - 13:57so plugging in the
-
13:57 - 14:00quantities that you know
-
14:00 - 14:01
-
14:01 - 14:04that becomes
-
14:04 - 14:11this, let's see. Right.
-
14:35 - 14:39And so that we're completely concrete about what the M step is doing.
-
14:39 - 14:42So in the
-
14:42 - 14:44M step that was,
-
14:44 - 14:46I guess, QI
-
14:46 - 14:47over Z, I being over
-
14:47 - 14:51J. Just in the summation, sum over J is the sum over all the possible values
-
14:51 - 14:52of ZI
-
14:52 - 14:53and then
-
14:53 - 14:56this thing here is my Gaussian
-
14:56 - 14:59density. Sorry, guys, this thing - well,
-
14:59 - 15:02this first term here,
-
15:02 - 15:06right? Is my P of
-
15:06 - 15:10XI given ZI
-
15:10 - 15:14and that's P of ZI. Okay?
-
15:14 - 15:17And so
-
15:17 - 15:21to maximize this with respect to - say you want to maximize this with respect to all
-
15:21 - 15:24of your parameters phi, mu and sigma.
-
15:24 - 15:28So to maximize with respect to the parameter mu, say,
-
15:28 - 15:32you would take the derivative for respect to mu
-
15:32 - 15:34and set that to zero
-
15:34 - 15:38and you would - and if you actually do that computation
-
15:38 - 15:44you would get, for instance, that
-
15:44 - 15:48
-
15:48 - 15:50that becomes your update
-
15:50 - 15:51to mu J.
-
15:51 - 15:52Okay? Just
-
15:52 - 15:53so I want to -
-
15:53 - 15:57the equation is unimportant. All of these equations are written down in
-
15:57 - 16:00the lecture notes. I'm writing these down just to be
-
16:00 - 16:03completely concrete about what the M step means. And so write down that formula,
-
16:03 - 16:05plug in the densities you know, take
-
16:05 - 16:08the derivative set to zero, solve for mu J
-
16:08 - 16:11and in the same way you set the derivatives equal to zero
-
16:11 - 16:13and solve for your updates
-
16:13 - 16:15for your other parameters phi and
-
16:15 - 16:22sigma as well. Okay? Well,
-
16:23 - 16:26just point out just one little tricky bit for this that
-
16:26 - 16:31you haven't seen before that most of you probably all ready now, but I'll just
-
16:31 - 16:31mention
-
16:31 - 16:33is that
-
16:33 - 16:37since phi here is a multinomial distribution
-
16:37 - 16:40when you take this formula
-
16:40 - 16:42and you maximize it with respect to phi
-
16:42 - 16:45you actually have an additional constraint, right? That the sum of I - let's see, sum
-
16:45 - 16:46
-
16:46 - 16:48over
-
16:48 - 16:50J,
-
16:50 - 16:54phi J must be equal to one. All right? So, again,
-
16:54 - 16:58in the M step I want to take this thing and maximize it with respect to all the parameters
-
16:58 - 17:01and when you maximize this respect to the parameters phi J
-
17:01 - 17:03you need to respect the constraint that
-
17:03 - 17:06sum of J phi J must be equal to one.
-
17:06 - 17:11And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of
-
17:11 - 17:14the granjay multipliers and generalize the granjay when you talk about the support of X
-
17:14 - 17:15machines.
-
17:15 - 17:19And so to actually perform the maximization in terms of phi J you
-
17:19 - 17:21construct to the granjay,
-
17:21 - 17:23which is - all right?
-
17:23 - 17:25So that's the
-
17:25 - 17:28equation from above and we'll denote in the dot dot dot plus
-
17:28 - 17:34theta times
-
17:34 - 17:41that, where this is sort of the granjay multiplier
-
17:42 - 17:43and this
-
17:43 - 17:46is your optimization objective.
-
17:46 - 17:48
-
17:48 - 17:51And so to actually solve the parameters phi J you set
-
17:51 - 17:57the parameters of this
-
17:57 - 17:59so that the granjay is zero and solve. Okay?
-
17:59 - 18:03And if you then work through the math
-
18:03 - 18:07you get the appropriate value to update the phi J's too,
-
18:07 - 18:11which I won't do, but I'll be - all the full directions are in the lecture
-
18:11 - 18:16notes. I won't do that here.
-
18:16 - 18:18
-
18:18 - 18:22Okay. And so if you actually perform all these computations you can also verify
-
18:22 - 18:23that.
-
18:23 - 18:27So I just wrote down a bunch of formulas for the EM
-
18:27 - 18:30Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I
-
18:30 - 18:34said for the EM here's the formula for computing the WIJ's and here's a
-
18:34 - 18:36formula for computing the mud's and so on,
-
18:36 - 18:42and this variation is where all of those formulas actually come from.
-
18:42 - 18:43Okay?
-
18:43 - 18:50Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh,
-
18:52 - 18:54I see. So
-
18:54 - 18:58it turns out that, yes, there's also constrained to the phi J this must be greater than
-
18:58 - 19:00zero.
-
19:00 - 19:03It turns out that
-
19:03 - 19:07if you want you could actually write down then generalize the granjayn
-
19:07 - 19:11incorporating all of these constraints as well and you can solve to [inaudible]
-
19:11 - 19:12these constraints.
-
19:12 - 19:16It turns out that in this particular derivation - actually it turns out that very often
-
19:16 - 19:17we
-
19:17 - 19:20find maximum likely estimate for multinomial distributions probabilities.
-
19:20 - 19:23It turns out that if you ignore these constraints and you just maximize the
-
19:23 - 19:24formula
-
19:24 - 19:25
-
19:25 - 19:28luckily you end up with
-
19:28 - 19:31values that actually are greater than or equal to zero
-
19:31 - 19:34and so if even ignoring those constraint you end up with parameters that are greater than or equal to
-
19:34 - 19:38zero that shows that that must be the correct solution
-
19:38 - 19:41because adding that constraint won't change anything.
-
19:41 - 19:45So this constraint it is then caused - it turns out that if you
-
19:45 - 19:46ignore this and just do
-
19:46 - 19:53what I've wrote down you actually get the right answer.
-
19:56 - 19:57Okay?
-
19:57 - 19:58Great.
-
19:58 - 19:59So let me
-
19:59 - 20:01just
-
20:01 - 20:05very quickly talk about one more example of a mixture model. And
-
20:05 - 20:09the perfect example for this is imagine you want to do text clustering, right?
-
20:09 - 20:12So someone gives you a large set of documents
-
20:12 - 20:15and you want to cluster them together into cohesive
-
20:15 - 20:16topics. I
-
20:16 - 20:19think I mentioned the news website news.google.com.
-
20:19 - 20:22That's one application of text clustering
-
20:22 - 20:26where you might want to look at all of the news stories about
-
20:26 - 20:28today, all the news stories
-
20:28 - 20:32written by everyone, written by all the online news websites about whatever happened
-
20:32 - 20:33yesterday
-
20:33 - 20:37and there will be many, many different stories on the same thing, right?
-
20:37 - 20:41And by running a text-clustering algorithm you can group related
-
20:41 - 20:43documents together. Okay?
-
20:43 - 20:50So how do you apply the EM Algorithm to text clustering?
-
20:53 - 20:56I want to do this to illustrate an example
-
20:56 - 20:59in which you run the EM Algorithm
-
20:59 - 21:03on discreet valued inputs where the input - where the training examples
-
21:03 - 21:04XI
-
21:04 - 21:06are discreet
-
21:06 - 21:09values. So what I want to talk about specifically
-
21:09 - 21:16is the mixture of naive Bayes model
-
21:17 - 21:20and depending on how much you remember about naive Bayes
-
21:20 - 21:24I talked about two event models. One was the multivariant vanuvy
-
21:24 - 21:25event model. One
-
21:25 - 21:28was the multinomial event model.
-
21:28 - 21:31Today I'm gonna use the multivariant vanuvy event model. If
-
21:31 - 21:35you don't remember what those terms mean anymore don't worry about it. I
-
21:35 - 21:36think the equation will still make sense. But in
-
21:36 - 21:43this setting we're given a training set X1
-
21:45 - 21:46for XM.
-
21:46 - 21:49So we're given M text documents
-
21:49 - 21:52where each
-
21:52 - 21:56
-
21:56 - 22:00XI is zero one to the end. So each of our training examples
-
22:00 - 22:04is an indimensional bit of vector,
-
22:04 - 22:07right? S this was the representation
-
22:07 - 22:11where XIJ was - it indicates whether
-
22:11 - 22:13word J
-
22:13 - 22:17appears in
-
22:17 - 22:23document
-
22:23 - 22:28I, right? And let's say that
-
22:28 - 22:32we're going to model ZI the - our latent random variable meaning
-
22:32 - 22:35our hidden random variable ZI will take on two values zero one
-
22:35 - 22:38so this means I'm just gonna find two clusters and you can generalize the
-
22:38 - 22:45clusters that you want.
-
22:45 - 22:47So
-
22:47 - 22:52in the mixture of naive Bayes model we assume that
-
22:52 - 22:54ZI is distributed and mu
-
22:54 - 22:57E
-
22:57 - 22:58with some
-
22:58 - 23:03value of phi so there's some probability of each document coming from cluster one
-
23:03 - 23:05or from cluster
-
23:05 - 23:10two. We assume that
-
23:10 - 23:14the probability
-
23:14 - 23:16
-
23:16 - 23:20of XI given ZI,
-
23:20 - 23:22right?
-
23:22 - 23:29Will make the same naive Bayes assumption as we did before.
-
23:36 - 23:37Okay?
-
23:37 - 23:44And more specifically - well, excuse
-
23:53 - 24:00me,
-
24:01 - 24:02right. Okay.
-
24:02 - 24:07And so most of us [inaudible] cycles one given ZI equals say zero
-
24:07 - 24:10will be given by a parameter phi
-
24:10 - 24:12substitute J
-
24:12 - 24:18given Z equals
-
24:18 - 24:21zero. So if you take this chalkboard and if you
-
24:21 - 24:23take all instances of
-
24:23 - 24:25the alphabet Z
-
24:25 - 24:27and replace it with Y
-
24:27 - 24:32then you end up with exactly the same equation as I've written down for naive
-
24:32 - 24:39Bayes like a long time ago. Okay?
-
24:40 - 24:41And I'm not
-
24:41 - 24:45actually going to work out the mechanics deriving
-
24:45 - 24:46the EM Algorithm, but it turns out that
-
24:46 - 24:50if you take this joint distribution of X and Z
-
24:50 - 24:54and if you work out what the equations are for the EM algorithm for
-
24:54 - 24:56maximum likelihood estimation of parameters
-
24:56 - 24:59you find that
-
24:59 - 25:01in the E
-
25:01 - 25:03step you compute,
-
25:03 - 25:06you know, let's say these parameters - these weights
-
25:06 - 25:07WI
-
25:07 - 25:09which are going to be equal to
-
25:09 - 25:16your perceived distribution of Z being equal one conditioned on XI parameterized
-
25:16 - 25:19by your
-
25:19 - 25:24phi's and given your parameters
-
25:24 - 25:26
-
25:26 - 25:31and in the M step. Okay?
-
25:31 - 25:38And
-
26:17 - 26:21that's the equation you get in the M step.
-
26:21 - 26:22I
-
26:22 - 26:25mean, again, the equations themselves aren't too important. Just
-
26:25 - 26:27sort of convey - I'll
-
26:27 - 26:30give you a second to finish writing, I guess.
-
26:30 - 26:32And when you're done or finished writing
-
26:32 - 26:37take a look at these equations and see if they make intuitive sense to you
-
26:37 - 26:44why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again.
-
26:47 - 26:50Y -
-
26:50 - 26:54Oh, yes, thank you. Right.
-
26:54 - 26:58Sorry,
-
26:58 - 27:05just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the
-
27:15 - 27:21first place?
-
27:21 - 27:22No. So
-
27:25 - 27:26what is it?
-
27:26 - 27:33Normally you initialize phi's to be something else, say randomly.
-
27:35 - 27:38So just like in naive Bayes we saw
-
27:38 - 27:41zero probabilities as a bad thing so the same reason you
-
27:41 - 27:48try to avoid zero probabilities, yeah. Okay?
-
27:48 - 27:52And so just the intuition behind these
-
27:52 - 27:54equations is
-
27:54 - 27:56in the E step
-
27:56 - 28:00WI's is you're gonna take your best guess for whether the document came from
-
28:00 - 28:03cluster one or cluster
-
28:03 - 28:06zero, all right? This is very similar to the
-
28:06 - 28:09intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the
-
28:09 - 28:10E step
-
28:10 - 28:15we're going to compute these weights that tell us do I think this document came from
-
28:15 - 28:19cluster one or cluster zero.
-
28:19 - 28:23And then in the M step I'm gonna say
-
28:23 - 28:27does this numerator is the sum over all the elements of my training set
-
28:27 - 28:29of - so then
-
28:29 - 28:33informally, right? WI is one there, but I think the document came from cluster
-
28:33 - 28:34one
-
28:34 - 28:36and so this will
-
28:36 - 28:41essentially sum up all the times I saw words J
-
28:41 - 28:45in documents that I think are in cluster one.
-
28:45 - 28:49And these are sort of weighted by the actual probability. I think it came from cluster
-
28:49 - 28:50one
-
28:50 - 28:51and then I'll divide by
-
28:51 - 28:55- again, if all of these were ones and zeros then I'd be dividing by
-
28:55 - 28:58the actual number of documents I had in cluster one. So if all the WI's were
-
28:58 - 29:00
-
29:00 - 29:03either ones or zeroes then this would be exactly
-
29:03 - 29:07the fraction of documents that I saw in cluster one
-
29:07 - 29:10in which I also saw were at J.
-
29:10 - 29:11Okay? But in the EM Algorithm
-
29:11 - 29:15you don't make a hard assignment decision about is this in cluster one or is this in
-
29:15 - 29:16cluster zero. You
-
29:16 - 29:17instead
-
29:17 - 29:24represent your uncertainty about cluster membership with the parameters WI. Okay? It
-
29:24 - 29:27actually turns out that when we actually implement this particular model it
-
29:27 - 29:29actually turns out that
-
29:29 - 29:32by the nature of this computation all the values of WI's
-
29:32 - 29:35will be very close to either one or zero so they'll
-
29:35 - 29:39be numerically almost indistinguishable from one's and zeroes. This is a property of
-
29:39 - 29:42naive Bayes. If you actually compute this probability
-
29:42 - 29:46from all those documents you find that WI is either
-
29:46 - 29:46
-
29:46 - 29:500.0001 or 0.999. It'll be amazingly close to either zero or one
-
29:50 - 29:53and so the M step - and so this is pretty much guessing whether each document is
-
29:53 - 29:55in cluster one or cluster
-
29:55 - 29:56zero and then
-
29:56 - 30:00using formulas they're very similar to maximum likely estimation
-
30:00 - 30:05for naive Bayes.
-
30:05 - 30:06Okay?
-
30:06 - 30:08Cool. Are there - and
-
30:08 - 30:10if
-
30:10 - 30:12some of these equations don't look that familiar to you anymore,
-
30:12 - 30:15sort of, go back and take another look at what you saw in naive
-
30:15 - 30:17Bayes
-
30:17 - 30:19and
-
30:19 - 30:22hopefully you can see the links there as well.
-
30:22 - 30:29Questions about this before I move on? Right, okay.
-
30:33 - 30:37Of course the way I got these equations was by turning through the machinery of
-
30:37 - 30:41the EM Algorithm, right? I didn't just write these out of thin air. The way you do this
-
30:41 - 30:42is by
-
30:42 - 30:45writing down the E step and the M step for this model and then the M step
-
30:45 - 30:46same derivatives
-
30:46 - 30:48equal to zero and solving from that so
-
30:48 - 30:50that's how you get the M step and the E
-
30:50 - 30:57step.
-
31:22 - 31:24So the last thing I want to do today
-
31:24 - 31:31is talk about the factor analysis model
-
31:32 - 31:33
-
31:33 - 31:35and
-
31:35 - 31:38the reason I want to do this is sort of two reasons because one is
-
31:38 - 31:43factor analysis is kind of a useful model. It's
-
31:43 - 31:44not
-
31:44 - 31:48as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe,
-
31:48 - 31:51but it's sort of useful.
-
31:51 - 31:55But the other reason I want to derive this model is that there are a
-
31:55 - 31:58few steps in the math that are more generally useful.
-
31:58 - 32:03In particular, where this is for factor analysis this would be an example
-
32:03 - 32:05in which we'll do EM
-
32:05 - 32:08where the late and random variable - where the hidden random variable Z
-
32:08 - 32:11is going to be continued as valued.
-
32:11 - 32:15And so some of the math we'll see in deriving factor analysis will be a little bit
-
32:15 - 32:17different than what you saw before and they're just a -
-
32:17 - 32:22it turns out the full derivation for EM for factor analysis is sort of
-
32:22 - 32:23extremely long and complicated
-
32:23 - 32:25and so I won't
-
32:25 - 32:27inflect that on you in lecture today,
-
32:27 - 32:31but I will still be writing more equations than is - than you'll see me do
-
32:31 - 32:33in other lectures because there are, sort of,
-
32:33 - 32:37just a few steps in the factor analysis derivation so I'll physically
-
32:37 - 32:39
-
32:39 - 32:40illustrate
-
32:40 - 32:41it.
-
32:41 - 32:43So it's actually [inaudible] the model
-
32:43 - 32:48and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is
-
32:48 - 32:51our first model
-
32:51 - 32:53we had,
-
32:53 - 32:58that - well I actually motivated it by drawing the data set like this, right? That one of
-
32:58 - 33:04you has a data set that looks like this,
-
33:04 - 33:06right? So this was a problem where
-
33:06 - 33:09n is twodimensional
-
33:09 - 33:14and you have, I don't know, maybe 50 or 100 training examples, whatever, right?
-
33:14 - 33:15And I said
-
33:15 - 33:18maybe you want to give a label
-
33:18 - 33:21training set like this. Maybe you want to model this
-
33:21 - 33:24as a mixture of two Gaussians. Okay?
-
33:24 - 33:25And so a
-
33:25 - 33:27mixture of Gaussian models tend to be
-
33:27 - 33:29applicable
-
33:29 - 33:31where m
-
33:31 - 33:33is larger,
-
33:33 - 33:36and often much larger, than n where the number of training examples you
-
33:36 - 33:37have
-
33:37 - 33:41is at least as large as, and is usually much larger than, the
-
33:41 - 33:46dimension of the data. What I
-
33:46 - 33:49want to do is talk about a different problem where I
-
33:49 - 33:52want you to imagine what happens if
-
33:52 - 33:56either the dimension of your data is roughly equal to the number of
-
33:56 - 33:58examples you have
-
33:58 - 34:00or maybe
-
34:00 - 34:01the
-
34:01 - 34:04dimension of your data is maybe even much larger than
-
34:04 - 34:08the number of training examples you have. Okay? So how do
-
34:08 - 34:10you model
-
34:10 - 34:14such a very high dimensional data? Watch and
-
34:14 - 34:16you will see sometimes, right? If
-
34:16 - 34:19you run a plant or something, you run a factory, maybe you have
-
34:19 - 34:23a thousand measurements all through your plants, but you only have five - you only have
-
34:23 - 34:2620 days of data. So
-
34:26 - 34:27you can have 1,000 dimensional data,
-
34:27 - 34:31but 20 examples of it all ready. So
-
34:31 - 34:35given data that has this property in the beginning that we've given
-
34:35 - 34:40a training set of m examples. Well,
-
34:40 - 34:41what can you do to try to model
-
34:41 - 34:44the density of X?
-
34:44 - 34:48So one thing you can do is try to model it just as a single Gaussian, right? So in my
-
34:48 - 34:52mixtures of Gaussian this is how you try model as a single Gaussian and
-
34:52 - 34:55say X is intuitive with mean mu
-
34:55 - 34:58and parameter
-
34:58 - 34:59sigma
-
34:59 - 35:00where
-
35:00 - 35:02sigma is going to be
-
35:02 - 35:05done n by n matrix
-
35:05 - 35:10and so if you work out the maximum likelihood estimate of the parameters
-
35:10 - 35:13you find that the maximum likelihood estimate for the mean
-
35:13 - 35:17is just the empirical mean of your training set,
-
35:17 - 35:21right. So that makes sense.
-
35:21 - 35:26And the maximum likelihood of the covariance matrix sigma
-
35:26 - 35:28will be
-
35:28 - 35:30
-
35:30 - 35:37
-
35:40 - 35:41this, all right? But it turns out that
-
35:41 - 35:45in this regime where the data is much higher dimensional - excuse me,
-
35:45 - 35:48where the data's dimension is much larger than the training examples
-
35:48 - 35:50you
-
35:50 - 35:51have if you
-
35:51 - 35:53compute the maximum likely estimate
-
35:53 - 35:55of the covariance matrix sigma
-
35:55 - 35:58you find that this matrix is singular.
-
35:58 - 35:59
-
35:59 - 36:03Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen
-
36:03 - 36:05value so it doesn't have - I hope
-
36:05 - 36:07one of those terms makes sense.
-
36:07 - 36:13And
-
36:13 - 36:20there's another saying that the matrix sigma will be non-invertible.
-
36:22 - 36:25And just in pictures,
-
36:25 - 36:29one complete example is if D is -
-
36:29 - 36:33if N equals M equals two if you have two-dimensional data and
-
36:33 - 36:36you have two examples. So I'd have
-
36:36 - 36:40two training examples in two-dimen.. - this is
-
36:40 - 36:42X1 and
-
36:42 - 36:44X2. This is my unlabeled data.
-
36:44 - 36:48If you fit a Gaussian to this data set you find that
-
36:48 - 36:49- well
-
36:49 - 36:53you remember I used to draw constables of Gaussians as ellipses, right? So
-
36:53 - 36:56these are examples of different constables of Gaussians.
-
36:56 - 36:59You find that the maximum likely estimate
-
36:59 - 37:00Gaussian for this
-
37:00 - 37:04responds to Gaussian where the contours are sort of infinitely
-
37:04 - 37:07thin and infinitely long in that direction.
-
37:07 - 37:11Okay? So in terms - so the contours will sort of be
-
37:11 - 37:13infinitely thin,
-
37:13 - 37:18right? And stretch infinitely long in that direction.
-
37:18 - 37:21And another way of saying it is that if
-
37:21 - 37:26you actually plug in the formula for the density
-
37:26 - 37:31of the
-
37:31 - 37:36Gaussian, which is
-
37:36 - 37:39this, you won't actually get a nice answer because
-
37:39 - 37:42the matrix sigma is non-invertible so sigma inverse is not
-
37:42 - 37:43defined
-
37:43 - 37:45and this is zero.
-
37:45 - 37:49So you also have one over zero times E to the sum
-
37:49 - 37:56inversive and non-inversive matrix so not a good model.
-
37:58 - 38:02So let's do even better, right? So given this sort of data
-
38:02 - 38:09how do you model P of X?
-
38:28 - 38:35Well, one thing you could do is constrain sigma to be diagonal. So you have a
-
38:42 - 38:46covariance matrix X is - okay?
-
38:46 - 38:49So in other words you get a constraint sigma
-
38:49 - 38:53to be this matrix, all
-
38:53 - 38:56right?
-
38:56 - 39:00With zeroes on the off diagonals. I hope
-
39:00 - 39:03this makes sense. These zeroes I've written down here denote that
-
39:03 - 39:06everything after diagonal of this matrix is a
-
39:06 - 39:07zero.
-
39:07 - 39:09So
-
39:09 - 39:13the massive likely estimate of the parameters will be pretty much what you'll
-
39:13 - 39:16expect,
-
39:16 - 39:20right?
-
39:20 - 39:21And
-
39:21 - 39:22in pictures
-
39:22 - 39:24what this means is that
-
39:24 - 39:26the [inaudible] the distribution
-
39:26 - 39:29with Gaussians
-
39:29 - 39:32whose controls are axis aligned. So
-
39:32 - 39:37that's one example of a Gaussian where the covariance is diagonal.
-
39:37 - 39:38
-
39:38 - 39:39And
-
39:39 - 39:43here's another example and
-
39:43 - 39:45so here's a third example. But
-
39:45 - 39:49often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I
-
39:49 - 39:51don't
-
39:51 - 39:54know,
-
39:54 - 39:57you could do this in model P of X, but this isn't very nice because you've now
-
39:57 - 40:00thrown away all the correlations
-
40:00 - 40:05between the different variables so
-
40:05 - 40:08the axis are X1 and X2, right? So you've thrown away - you're failing to capture
-
40:08 - 40:12any of the correlations or the relationships between
-
40:12 - 40:18any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again.
-
40:18 - 40:21The covariance matrix the diagonal,
-
40:21 - 40:22what does
-
40:22 - 40:23that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay.
-
40:23 - 40:27So these are the contours of the Gaussian density that I'm drawing,
-
40:27 - 40:28right? So let's see -
-
40:28 - 40:32so post covariance issues with diagonal
-
40:32 - 40:35then you can ask what is P of X
-
40:35 - 40:37parameterized by
-
40:37 - 40:39mu and sigma,
-
40:39 - 40:41right? If sigma is diagonal
-
40:41 - 40:44and so this will be some Gaussian dump,
-
40:44 - 40:46right? So not in - oh, boy. My drawing's really
-
40:46 - 40:49bad, but in two-D
-
40:49 - 40:53the density for Gaussian is like this bump shaped thing, right?
-
40:53 - 40:57So this is the density of the Gaussian
-
40:57 - 41:00- wow, and this is a really bad drawing. With those,
-
41:00 - 41:04your axis X1 and X2 and the height of this is P of X
-
41:04 - 41:08and so those figures over there are the contours of
-
41:08 - 41:09the density of the Gaussian.
-
41:09 - 41:16So those are the contours of this shape. Student:No, I don't
-
41:16 - 41:17mean
-
41:17 - 41:23the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so
-
41:23 - 41:24the main - these,
-
41:24 - 41:25let's see.
-
41:25 - 41:28So I'm not drawing a contour like this,
-
41:28 - 41:30right? Because the main axes of these
-
41:30 - 41:32are not aligned with the
-
41:32 - 41:34X1 and X-axis so
-
41:34 - 41:36this occurs found to
-
41:36 - 41:43Gaussian where the off-diagonals are non-zero, right? Cool. Okay.
-
41:43 - 41:47
-
41:47 - 41:50You could do this, this is sort of work. It turns out that what our best view is two
-
41:50 - 41:52training examples
-
41:52 - 41:55you can learn in non-singular covariance matrix, but you've thrown away all
-
41:55 - 41:56
-
41:56 - 42:03of the correlation in the data so this is not a great model.
-
42:06 - 42:08It turns out you can do something - well,
-
42:08 - 42:10actually, we'll come back and use this property later.
-
42:10 - 42:17But it turns out you can do something even more restrictive, which
-
42:18 - 42:19is
-
42:19 - 42:24you can constrain sigma to equal to sigma squared times the identity
-
42:24 - 42:28matrix. So in other words, you can constrain it to be diagonal
-
42:28 - 42:29matrix
-
42:29 - 42:32and moreover all the
-
42:32 - 42:35diagonal entries must be the same
-
42:35 - 42:38and so the cartoon for that is that you're
-
42:38 - 42:39constraining
-
42:39 - 42:43the contours of your Gaussian density to be
-
42:43 - 42:47circular. Okay? This is a sort of even harsher constraint to place in your model.
-
42:47 - 42:51
-
42:51 - 42:52
-
42:52 - 42:56So either of these versions, diagonal sigma or sigma being the, sort of, constant
-
42:56 - 42:57value diagonal
-
42:57 - 43:02are the all ready strong assumptions, all right? So if you have enough data
-
43:02 - 43:08maybe write a model just a little bit of a correlation between your different variables.
-
43:08 - 43:11So the factor analysis model
-
43:11 - 43:14is one way to attempt to do that. So here's
-
43:14 - 43:21the idea.
-
43:35 - 43:36So this is
-
43:36 - 43:39how the factor analysis model
-
43:39 - 43:41models your data.
-
43:41 - 43:42We're going to
-
43:42 - 43:45assume that there is a latent random variable, okay?
-
43:45 - 43:47Which just
-
43:47 - 43:51means random variable Z. So Z is distributed
-
43:51 - 43:53Gaussian with mean zero and
-
43:53 - 43:56covariance identity
-
43:56 - 43:58where Z
-
43:58 - 44:00will be a Ddimensional vector now
-
44:00 - 44:01and
-
44:01 - 44:04D
-
44:04 - 44:11will be chosen so that it is lower than the dimension of your X's. Okay?
-
44:12 - 44:13And now
-
44:13 - 44:15I'm going to assume that
-
44:15 - 44:19X is given by -
-
44:19 - 44:21well let me write this. Each XI
-
44:21 - 44:23is distributed -
-
44:25 - 44:29actually, sorry, I'm just. We
-
44:29 - 44:34have
-
44:34 - 44:39to assume that conditions on the value of Z,
-
44:39 - 44:42X is given by another Gaussian
-
44:42 - 44:47with mean given by mu plus
-
44:47 - 44:50lambda Z
-
44:50 - 44:54and covariance given by matrix si.
-
44:54 - 44:59So just to say the second line in
-
44:59 - 45:01an equivalent form, equivalently
-
45:01 - 45:06I'm going to model X as
-
45:06 - 45:09mu plus lambda Z
-
45:09 - 45:13plus a noise term epsilon where epsilon is
-
45:13 - 45:18Gaussian with mean zero
-
45:18 - 45:22and covariant si.
-
45:22 - 45:23
-
45:23 - 45:25And so
-
45:25 - 45:28the parameters of this
-
45:28 - 45:30model
-
45:30 - 45:33are going to be a vector
-
45:33 - 45:35mu with its
-
45:35 - 45:38n-dimensional and
-
45:38 - 45:41matrix lambda,
-
45:41 - 45:42which is
-
45:42 - 45:44
-
45:44 - 45:46n by D
-
45:46 - 45:50and a covariance matrix si,
-
45:50 - 45:52which is n by n,
-
45:52 - 45:56and I'm going to impose an additional constraint on si. I'm going to impose
-
45:56 - 45:57a constraint that si
-
45:57 - 46:00is
-
46:00 - 46:03diagonal.
-
46:03 - 46:05Okay? So
-
46:05 - 46:08that was a form of definition - let me actually, sort of,
-
46:08 - 46:15give a couple of examples to make this more complete.
-
46:33 - 46:38So let's give a kind of example, suppose Z
-
46:38 - 46:40is one-dimensional
-
46:40 - 46:44and X is twodimensional
-
46:44 - 46:47so let's see what
-
46:47 - 46:52this model - let's see a, sort of, specific instance of the factor analysis
-
46:52 - 46:53model
-
46:53 - 46:56and how we're modeling the joint - the distribution over X
-
46:56 - 46:58of - what this gives us
-
46:58 - 47:01in terms of a model over P of X, all right?
-
47:01 - 47:03So
-
47:03 - 47:10let's see. From this model to let me assume that
-
47:10 - 47:11lambda is
-
47:11 - 47:122, 1
-
47:12 - 47:16and si, which has to be diagonal matrix, remember,
-
47:16 - 47:20is this.
-
47:20 - 47:22Okay? So
-
47:22 - 47:29Z is one-dimensional so let me just draw a typical sample for Z, all right? So
-
47:31 - 47:34if I draw ZI
-
47:34 - 47:36from a Gaussian
-
47:36 - 47:40so that's a typical sample for what Z might look like
-
47:40 - 47:43and so I'm gonna - at any rate I'm gonna call
-
47:43 - 47:45this Z1,
-
47:45 - 47:50Z2, Z3 and so on. If this really were a typical sample the order of the
-
47:50 - 47:53Z's would be jumbled up, but I'm just ordering them like this
-
47:53 - 47:56just to make the example easier.
-
47:56 - 47:58So, yes, typical sample
-
47:58 - 48:05of random variable Z from a Gaussian distribution with mean of covariance one.
-
48:06 - 48:09So -
-
48:09 - 48:14and with this example let me just set mu equals zero. It's to write the -
-
48:14 - 48:18just that it's easier to talk about.
-
48:18 - 48:21So
-
48:21 - 48:23lambda times Z,
-
48:23 - 48:26right? We'll take each of these numbers and multiply them by lambda.
-
48:26 - 48:29And so
-
48:29 - 48:31you find that
-
48:31 - 48:37all of the values for lambda times Z
-
48:37 - 48:38will lie on a straight line,
-
48:38 - 48:40all right? So, for example,
-
48:40 - 48:41
-
48:41 - 48:45this one here would be one, two, three, four, five, six, seven, I guess. So if this was
-
48:45 - 48:46Z7
-
48:46 - 48:49then this one here would be lambda
-
48:49 - 48:52times Z7
-
48:52 - 48:54and now that's the number in R2, because lambda's a two
-
48:54 - 48:57by one matrix.
-
48:57 - 49:00And so what I've drawn here is like a typical sample
-
49:00 - 49:04for lambda times Z
-
49:04 - 49:06and
-
49:06 - 49:10the final step for this is what a typical sample for X looks like. Well X is
-
49:10 - 49:12mu
-
49:12 - 49:13plus
-
49:13 - 49:15
-
49:15 - 49:17lambda Z plus epsilon
-
49:17 - 49:19where epsilon
-
49:19 - 49:24is Gaussian with mean nu and covariance given by si, right?
-
49:24 - 49:28And so the last step to draw a typical sample for the random variables
-
49:28 - 49:29
-
49:29 - 49:31X
-
49:31 - 49:36I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example
-
49:36 - 49:37and
-
49:37 - 49:38around this point
-
49:38 - 49:42I'm going to place
-
49:42 - 49:44an axis aligned ellipse. Or
-
49:44 - 49:46in other words, I'm going to
-
49:46 - 49:48create a Gaussian distribution
-
49:48 - 49:51centered on this point
-
49:51 - 49:53and this I've drawn
-
49:53 - 49:56corresponds to one of the contours
-
49:56 - 49:59of my density for
-
49:59 - 50:03epsilon, right? And so you can imagine placing a little Gaussian bump here.
-
50:03 - 50:04And so
-
50:04 - 50:08I'll draw an example from this little Gaussian and
-
50:08 - 50:11let's say I get that point going,
-
50:11 - 50:18I do the same here and
-
50:18 - 50:19so on.
-
50:19 - 50:24So I draw a bunch of examples from these Gaussians
-
50:24 - 50:29and the -
-
50:29 - 50:32whatever they call it - the orange points I drew
-
50:32 - 50:35will comprise a typical sample for
-
50:35 - 50:42whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that?
-
50:44 - 50:46Instructor (Andrew Ng):Oh,
-
50:46 - 50:49yes, you do. And in this example, I said you
-
50:49 - 50:50do a zero zero just to make
-
50:50 - 50:53it easier. If mu were something else you'd take the whole picture and you'd sort of shift
-
50:53 - 51:00it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of
-
51:01 - 51:03course,
-
51:03 - 51:06what does that Y-axis corresponds to? Instructor (Andrew
-
51:06 - 51:08Ng):Oh, so this is Z
-
51:08 - 51:10is one-dimensional
-
51:10 - 51:13so here I'm plotting the typical sample
-
51:13 - 51:15for Z so this is like zero.
-
51:15 - 51:20So this is just the Z Axis, right. So Z is onedimensional data.
-
51:20 - 51:23So this line here is like a plot
-
51:23 - 51:26of a typical sample of
-
51:26 - 51:32values for Z. Okay?
-
51:32 - 51:33Yeah? Student:You have by axis, right? And
-
51:33 - 51:34the axis data pertains samples. Instructor (Andrew
-
51:34 - 51:36Ng):Oh, yes, right. Student:So sort of projecting
-
51:36 - 51:38them into
-
51:38 - 51:43that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these
-
51:43 - 51:45beige points - so that's like X1 and that's X2
-
51:45 - 51:47and so on, right? So the beige points are
-
51:47 - 51:51what I
-
51:51 - 51:52see. And so
-
51:52 - 51:55in reality all you ever get to see are the
-
51:55 - 51:56X's, but
-
51:56 - 52:00just like in the mixture of Gaussians model I tell a story about what I would
-
52:00 - 52:04imagine the Gauss... - the data came from two Gaussian's
-
52:04 - 52:09was is had a random variable Z that led to the generation of X's from two Gaussians.
-
52:09 - 52:12So the same way I'm sort of telling the story here, which all the algorithm actually
-
52:12 - 52:15sees are the orange points, but we're
-
52:15 - 52:19gonna tell a story about how the data came about and that story is
-
52:19 - 52:24what comprises the factor analysis model. Okay?
-
52:24 - 52:27So one of the ways to see the intrusion of this model is that we're
-
52:27 - 52:31going to think of the model as one way
-
52:31 - 52:35just informally, not formally, but one way to think about this model is
-
52:35 - 52:38you can think of this factor analysis model
-
52:38 - 52:42as modeling the data from coming from a lower dimensional subspace
-
52:42 - 52:43more
-
52:43 - 52:45or less so the data X here Y
-
52:45 - 52:48is approximately on one D line
-
52:48 - 52:50and then plus a little bit of noise - plus a little bit
-
52:50 - 52:53of random noise so the X isn't exactly on this one D line.
-
52:53 - 52:57That's one informal way of thinking about factor analysis.
-
52:57 - 53:04We're
-
53:09 - 53:16not doing great on time.
-
53:18 - 53:20Well, let's do this.
-
53:20 - 53:23So let me just do one more quick example,
-
53:23 - 53:24which is,
-
53:24 - 53:26
-
53:26 - 53:29in this example,
-
53:29 - 53:35let's say Z is in R2
-
53:35 - 53:38and X is in R3, right?
-
53:38 - 53:40And so
-
53:40 - 53:45in this example Z, your data Z now lies in 2-D
-
53:45 - 53:49and so let me draw this on a sheet of paper. Okay?
-
53:49 - 53:51So let's say the
-
53:51 - 53:54axis of my paper are the Z1 and Z2 axis
-
53:54 - 53:56and so
-
53:56 - 53:58here is a typical sample
-
53:58 - 54:01of point Z, right?
-
54:01 - 54:05And so we'll then take the sample Z -
-
54:05 - 54:11well, actually let me draw this here as well. All right. So
-
54:11 - 54:13this is a typical sample for Z going on the Z1 and Z2 axis and
-
54:13 - 54:17I guess the origin would be here.
-
54:17 - 54:20So center around zero.
-
54:20 - 54:24And then we'll take those and map it to mu
-
54:24 - 54:25plus
-
54:25 - 54:26lambda Z
-
54:26 - 54:30and what that means is if you imagine the free space of this classroom is R3.
-
54:30 - 54:35What that means is we'll take this sample of Z's and we'll map it to
-
54:35 - 54:39position in free space. So we'll take this sheet of paper and move it somewhere and some
-
54:39 - 54:42orientation in 3-D space.
-
54:42 - 54:44And the last step is
-
54:44 - 54:49you have X equals mu plus lambda Z plus
-
54:49 - 54:52epsilon and so you would take the set of the points which align in some plane
-
54:52 - 54:54in our 3-D space
-
54:54 - 54:56the variable of noise of these
-
54:56 - 54:58and the noise will, sort of, come from
-
54:58 - 55:01Gaussians to
-
55:01 - 55:04the axis
-
55:04 - 55:11aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake.
-
55:12 - 55:13
-
55:13 - 55:15So that's a model
-
55:15 - 55:22- let's actually talk about how to fit the parameters of the model. Okay?
-
55:28 - 55:32In order to
-
55:32 - 55:34describe how to fit the model I'm sure
-
55:34 - 55:36we need to
-
55:36 - 55:40re-write Gaussians and this is in a very slightly different way.
-
55:40 - 55:43So, in particular,
-
55:43 - 55:45let's say I have a vector X and
-
55:45 - 55:51I'm gonna use this notation to denote partition vectors, right? X1, X2
-
55:51 - 55:53where
-
55:53 - 55:57if X1 is say an rdimensional vector
-
55:57 - 55:59then X2 is
-
55:59 - 56:00an estimational vector
-
56:00 - 56:02and X
-
56:02 - 56:05is an R plus S
-
56:05 - 56:09dimensional vector. Okay? So I'm gonna use this notation to denote just
-
56:09 - 56:13the taking of vector and, sort of, partitioning the vector into two halves. The
-
56:13 - 56:15first R elements followed by
-
56:15 - 56:17the last
-
56:17 - 56:22S elements.
-
56:22 - 56:26So
-
56:26 - 56:29let's say you have X
-
56:29 - 56:35coming from a Gaussian distribution with mean mu and covariance sigma
-
56:35 - 56:37where
-
56:37 - 56:39mu
-
56:39 - 56:42is itself a partition vector.
-
56:42 - 56:47So break mu up into two pieces mu1 and mu2
-
56:47 - 56:50and the covariance matrix sigma
-
56:50 - 56:55is now a partitioned matrix.
-
56:55 - 56:57Okay? So what this means is
-
56:57 - 56:59that you take the covariance matrix sigma
-
56:59 - 57:01and I'm going to break it up into four blocks,
-
57:01 - 57:03right? And so the
-
57:03 - 57:06dimension of this is there will be R elements here
-
57:06 - 57:08and there will be
-
57:08 - 57:10S elements here
-
57:10 - 57:14and there will be R elements here. So, for example, sigma 1, 2 will
-
57:14 - 57:16be an
-
57:16 - 57:20R by S matrix.
-
57:20 - 57:27It's R elements tall and S elements wide.
-
57:32 - 57:36So this Gaussian over to down is really a joint distribution of a loss of variables, right?
-
57:36 - 57:40So X is a vector so XY is a joint distribution
-
57:40 - 57:43over X1 through X of -
-
57:43 - 57:47over XN or over X of R plus S.
-
57:47 - 57:51We can then ask what are the marginal and conditional distributions of this
-
57:51 - 57:53Gaussian?
-
57:53 - 57:54So, for example,
-
57:54 - 57:58with my Gaussian, I know what P of X is, but can
-
57:58 - 58:03I compute the modular distribution of X1, right. And so P of X1 is just equal to,
-
58:03 - 58:07of course, integrate our X2, P of X1
-
58:07 - 58:09comma X2
-
58:09 - 58:11DX2.
-
58:11 - 58:14And if you actually perform that distribution - that computation you
-
58:14 - 58:15find
-
58:15 - 58:18that P of X1,
-
58:18 - 58:20I guess, is Gaussian
-
58:20 - 58:21with mean
-
58:21 - 58:23given by mu1
-
58:23 - 58:25and sigma 1, 1. All right.
-
58:25 - 58:26So this is sort
-
58:26 - 58:30of no surprise. The marginal distribution of a
-
58:30 - 58:31Gaussian
-
58:31 - 58:33is itself the
-
58:33 - 58:33Gaussian and
-
58:33 - 58:35you just take out the
-
58:35 - 58:36relevant
-
58:36 - 58:37sub-blocks of
-
58:37 - 58:42the covariance matrix and the relevant sub-vector of the
-
58:42 - 58:47mu vector - E in vector mu.
-
58:47 - 58:50You can also compute conditionals. You can also -
-
58:50 - 58:52what does P of X1
-
58:52 - 58:53given
-
58:53 - 58:57a specific value for X2, right?
-
58:57 - 59:03And so the way you compute that is, well, the usual way P of X1
-
59:03 - 59:05comma X2
-
59:05 - 59:08divided by P of X2, right?
-
59:08 - 59:10And so
-
59:10 - 59:14you know what both of these formulas are, right? The numerator - well,
-
59:14 - 59:17this is just a usual
-
59:17 - 59:20Gaussian that your joint distribution over X1, X2 is a Gaussian with
-
59:20 - 59:22mean mu and covariance sigma
-
59:22 - 59:24and
-
59:24 - 59:27this
-
59:27 - 59:30by that marginalization operation I talked about is that.
-
59:30 - 59:31
-
59:31 - 59:35So if you actually plug in the formulas for these two Gaussians and if you simplify
-
59:35 - 59:36
-
59:36 - 59:39the simplification step is actually fairly non-trivial.
-
59:39 - 59:42If you haven't seen it before this will actually be - this will actually be
-
59:42 - 59:44somewhat difficult to do.
-
59:44 - 59:46But if you
-
59:46 - 59:50plug this in for Gaussian and simplify that
-
59:50 - 59:51expression
-
59:51 - 59:53you find that
-
59:53 - 59:56conditioned on the value of
-
59:56 - 60:01X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be
-
60:01 - 60:06Gaussian
-
60:06 - 60:08and it will have mean mu
-
60:08 - 60:14of 1 given 2 and covariant
-
60:14 - 60:16
-
60:16 - 60:18sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show
-
60:18 - 60:21the formula for mu given - of mu of
-
60:21 - 60:28one given 2 is given by this
-
60:32 - 60:39and I
-
60:40 - 60:44think
-
60:44 - 60:49the sigma of 1 given 2 is given by that. Okay?
-
60:49 - 60:52So these are just
-
60:52 - 60:56useful formulas to know for how to find the conditional distributions
-
60:56 - 60:59of the Gaussian and the marginal distributions of a Gaussian.
-
60:59 - 61:03I won't actually show the derivation for this. Student:Could you
-
61:03 - 61:10repeat the [inaudible]? Instructor
-
61:12 - 61:16(Andrew Ng):Sure. So this one on the left mu of 1 given 2
-
61:16 - 61:17equals
-
61:17 - 61:19mu1 plus
-
61:19 - 61:21sigma 1,2,
-
61:21 - 61:23sigma 2,2 inverse
-
61:23 - 61:25times X2 minus mu2
-
61:25 - 61:30and this is sigma 1 given 2 equals sigma 1,1
-
61:30 - 61:32minus sigma 1,2
-
61:32 - 61:33sigma 2,2 inverse
-
61:33 - 61:34sigma 2,1. Okay?
-
61:34 - 61:40These are also in the
-
61:40 - 61:47lecture
-
61:53 - 62:00notes. Shoot. Nothing as where I was hoping to on time.
-
62:01 - 62:08Well, actually it is. Okay?
-
62:19 - 62:21So it
-
62:21 - 62:26turns out - I think I'll skip this in the interest of time. So it turns out that -
-
62:26 - 62:30well, so let's go back and use these in the factor analysis model, right?
-
62:30 - 62:32It turns out that
-
62:32 - 62:34you can go back
-
62:34 - 62:35and
-
62:39 - 62:46oh,
-
62:46 - 62:49do I want to do this? I kind of need this
-
62:49 - 62:52though. So let's go back and figure out
-
62:52 - 62:58just what the joint distribution factor analysis assumes on Z and X's. Okay?
-
62:58 - 62:59So
-
62:59 - 63:00
-
63:00 - 63:03under the factor analysis model
-
63:03 - 63:07Z and X, the random variables Z and X
-
63:07 - 63:12have some joint distribution given by - I'll write this
-
63:12 - 63:13vector as mu
-
63:13 - 63:14ZX
-
63:14 - 63:17in some covariance matrix sigma.
-
63:17 - 63:21So let's go back and figure out what mu ZX is and what sigma is and I'll
-
63:21 - 63:23do this so that
-
63:23 - 63:25we'll get a little bit more practice with
-
63:25 - 63:29partition vectors and partition matrixes.
-
63:29 - 63:31So just to remind you, right? You
-
63:31 - 63:35have to have Z as Gaussian with mean zero and covariance identity
-
63:35 - 63:41and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with
-
63:41 - 63:43mean zero
-
63:43 - 63:47covariant si. So I have the - I'm just writing out the same equations again.
-
63:47 - 63:53So let's first figure out what this vector mu ZX is. Well,
-
63:53 - 63:55the expected value of Z
-
63:55 - 63:56is
-
63:56 - 63:57zero
-
63:57 - 63:58
-
63:58 - 64:04and, again, as usual I'll often drop the square backers around here.
-
64:04 - 64:06And the expected value of X is -
-
64:06 - 64:09well, the expected value of mu
-
64:09 - 64:11plus lambda
-
64:11 - 64:14Z
-
64:14 - 64:17plus epsilon. So these two terms have zero expectation
-
64:17 - 64:20and so the expected value of X
-
64:20 - 64:23is just mu
-
64:23 - 64:24
-
64:24 - 64:27and so that vector
-
64:27 - 64:28mu ZX, right,
-
64:28 - 64:31in my parameter for the Gaussian
-
64:31 - 64:34this is going to be
-
64:34 - 64:37the expected value of this partition vector
-
64:37 - 64:40given by this partition Z and X
-
64:40 - 64:43and so that would just be
-
64:43 - 64:44zero
-
64:44 - 64:47followed by mu. Okay?
-
64:47 - 64:48And so that's
-
64:48 - 64:51a d-dimensional zero
-
64:51 - 64:57followed by an indimensional mu.
-
64:57 - 65:02That's not gonna work out what the covariance matrix sigma is.
-
65:02 - 65:09So
-
65:21 - 65:24the covariance matrix sigma
-
65:24 - 65:28- if you work out
-
65:28 - 65:35definition of a partition. So this
-
65:44 - 65:46is
-
65:46 - 65:53into your partition matrix.
-
66:03 - 66:04Okay? Will be -
-
66:04 - 66:06so the covariance matrix sigma
-
66:06 - 66:10will comprise four blocks like that
-
66:10 - 66:14and so the upper left most block, which I write as sigma 1,1 -
-
66:14 - 66:16well, that uppermost left block
-
66:16 - 66:19is just
-
66:19 - 66:21the covariance matrix of Z,
-
66:21 - 66:25which we know is the identity. I was
-
66:25 - 66:28gonna show you briefly how to derive some of the other blocks, right, so
-
66:28 - 66:31sigma 1,2 that's the
-
66:31 - 66:38upper - oh, actually,
-
66:41 - 66:44excuse me. Sigma 2,1
-
66:44 - 66:47which is the lower left block
-
66:47 - 66:49that's E
-
66:49 - 66:51of X
-
66:51 - 66:55minus EX times Z minus EZ.
-
66:55 - 66:59So X is equal to mu
-
66:59 - 67:02plus lambda Z
-
67:02 - 67:03plus
-
67:03 - 67:05epsilon and then minus EX is
-
67:05 - 67:07minus mu and then
-
67:07 - 67:11times Z
-
67:11 - 67:14because the expected value of Z is zero, right,
-
67:14 - 67:19so that's equal to zero.
-
67:19 - 67:24And so if you simplify - or if you expand this out
-
67:24 - 67:27plus mu minus mu cancel out
-
67:27 - 67:30and so you have the expected value
-
67:30 - 67:33of lambda - oh,
-
67:33 - 67:35excuse me.
-
67:35 - 67:40ZZ transpose minus the
-
67:40 - 67:47expected value of epsilon Z is
-
67:53 - 67:57equal to
-
67:57 - 68:00that, which is just equal to
-
68:00 - 68:07lambda times the identity matrix. Okay? Does that make
-
68:07 - 68:11sense? Cause
-
68:11 - 68:15this term is equal to zero.
-
68:15 - 68:22Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well,
-
68:48 - 68:53so the final block is sigma 2,2 which is equal to the expected value of
-
68:53 - 68:59mu plus lambda Z
-
68:59 - 69:02plus epsilon minus mu
-
69:02 - 69:04times, right?
-
69:04 - 69:08Is
-
69:08 - 69:09equal to - and
-
69:09 - 69:16I won't do this, but this simplifies to lambda lambda transpose plus si. Okay?
-
69:18 - 69:21So
-
69:21 - 69:24putting all this together
-
69:24 - 69:29this tells us that the joint distribution of this vector ZX
-
69:29 - 69:31is going to be Gaussian
-
69:31 - 69:36with mean vector given by that,
-
69:36 - 69:37which we worked out previously.
-
69:37 - 69:39So
-
69:39 - 69:42this is the new ZX that we worked out previously,
-
69:42 - 69:44and covariance matrix
-
69:44 - 69:51given by
-
69:54 - 70:01that. Okay? So
-
70:05 - 70:07in principle - let's
-
70:07 - 70:10see, so the parameters of our model
-
70:10 - 70:12are mu,
-
70:12 - 70:12lambda,
-
70:12 - 70:14and si.
-
70:14 - 70:15And so
-
70:15 - 70:19in order to find the parameters of this model
-
70:19 - 70:25we're given a training set of m examples
-
70:25 - 70:29and so we like to do a massive likely estimation of the parameters.
-
70:29 - 70:34And so in principle one thing you could do is you can actually write down
-
70:34 - 70:37what P of XI is and,
-
70:37 - 70:40right, so P of XI
-
70:40 - 70:43XI is actually -
-
70:43 - 70:46the distribution of X, right? If, again,
-
70:46 - 70:49you can marginalize this Gaussian
-
70:49 - 70:55and so the distribution of X, which is the lower half of this partition vector
-
70:55 - 70:57is going to have
-
70:57 - 70:58mean mu
-
70:58 - 71:04and covariance given by lambda lambda transpose plus si. Right?
-
71:04 - 71:05So that's
-
71:05 - 71:07the distribution
-
71:07 - 71:12that we're using to model P of X.
-
71:12 - 71:17And so in principle one thing you could do is actually write down the log likelihood of
-
71:17 - 71:20your parameters,
-
71:20 - 71:24right? Which is just the product over of - it
-
71:24 - 71:25is the sum over
-
71:25 - 71:26I
-
71:26 - 71:30log P of XI
-
71:30 - 71:31where P of XI
-
71:31 - 71:32will be given
-
71:32 - 71:36by this Gaussian density, right. And
-
71:36 - 71:40I'm using theta as a shorthand to denote all of my parameters.
-
71:40 - 71:43And so you actually know what the density for Gaussian is
-
71:43 - 71:50and so you can say P of XI is this Gaussian with E mu in covariance
-
71:50 - 71:53given by this lambda lambda transpose plus si.
-
71:53 - 71:55So in case you write down the log likelihood
-
71:55 - 71:56of your parameters
-
71:56 - 71:57as follows
-
71:57 - 72:01and you can try to take derivatives of your log likelihood with respect to your
-
72:01 - 72:02parameters
-
72:02 - 72:05and maximize the log likelihood, all right.
-
72:05 - 72:07It turns out that if you do that you end up with
-
72:07 - 72:11sort of an intractable atomization problem or at least one
-
72:11 - 72:13that you - excuse me, you end up with a
-
72:13 - 72:17optimization problem that you will not be able to find and in this analytics, sort of,
-
72:17 - 72:19closed form solutions to.
-
72:19 - 72:22So if you say my model of X is this and found your
-
72:22 - 72:24massive likely parameter estimation
-
72:24 - 72:26you won't be able to find
-
72:26 - 72:29the massive likely estimate of the parameters in closed form.
-
72:29 - 72:30
-
72:30 - 72:32So what I would have liked to do
-
72:32 - 72:39is - well,
-
72:40 - 72:46so in order to fit parameters to this model
-
72:46 - 72:52what we'll actually do is use the EM Algorithm
-
72:52 - 72:56in with
-
72:56 - 73:00the E step, right?
-
73:00 - 73:07We'll compute that
-
73:08 - 73:11
-
73:11 - 73:15and this formula looks the same except that one difference is that now
-
73:15 - 73:18Z is a continuous random variable
-
73:18 - 73:19and so
-
73:19 - 73:20in the E step
-
73:20 - 73:24we actually have to find the density QI of ZI where it's the, sort of, E step actually
-
73:24 - 73:26requires that we find
-
73:26 - 73:28the posterior distribution
-
73:28 - 73:32that - so the density to the random variable ZI
-
73:32 - 73:34and then the M step
-
73:34 - 73:38will then perform
-
73:38 - 73:41the following maximization
-
73:41 - 73:42
-
73:42 - 73:45where, again, because Z is now
-
73:45 - 73:50continuous we
-
73:50 - 73:57now need to integrate over Z. Okay? Where
-
73:59 - 74:02in the M step now because ZI was continuous we now have an
-
74:02 - 74:06integral over Z rather than a sum.
-
74:06 - 74:09Okay? I was hoping to go a little bit further in deriving these things, but I don't have
-
74:09 - 74:11time today so we'll wrap that up
-
74:11 - 74:15in the next lecture, but before I close let's check if there are questions
-
74:15 - 74:16
-
74:16 - 74:23about the whole factor analysis model. Okay.
-
74:28 - 74:31So we'll come back in the next lecture;
-
74:31 - 74:35I will wrap up this model and because I want to go a little bit deeper into the E
-
74:35 - 74:37and M steps, as there's some
-
74:37 - 74:40tricky parts for the factor analysis model specifically.
-
74:40 - 74:41Okay. I'll see you in a
-
74:41 - 74:41couple of days.
- Title:
- Lecture 13 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:14:57
N. Ueda edited English subtitles for Lecture 13 | Machine Learning (Stanford) | ||
N. Ueda added a translation |