Lecture 13 | Machine Learning (Stanford)

Edit subtitles

0:09 - 0:10
0:10 - 0:14

This presentation is delivered by the Stanford Center for Professional
0:14 - 0:21

Development.
0:24 - 0:25

So
0:25 - 0:30

welcome back, and what I want to do today is
0:30 - 0:34

continue our discussions of the EM Algorithm, and in particular, I want to talk
0:34 - 0:35

about
0:35 - 0:36

the EM
0:36 - 0:40

formulation that we derived in the previous lecture and apply it to the
0:40 - 0:42

mixture of Gaussians model,
0:42 - 0:45

apply it to a different model and a mixture of naive Bayes model,
0:45 - 0:49

and then the launch part of today's lecture will be on the factor
0:49 - 0:51

analysis algorithm, which will also use the EM.
0:51 - 0:56

And as part of that, we'll actually take a brief digression to talk a little bit about
0:56 - 1:01

sort of useful properties of Gaussian distributions.
1:01 - 1:03

So just to recap where we are.
1:03 - 1:10

In the previous lecture, I started to talk about unsupervised learning, which was
1:10 - 1:13

machine-learning problems, where you're given
1:13 - 1:15

an unlabeled training set
1:15 - 1:16

comprising m examples here, right?
1:16 - 1:18

And then " so the fact that there are no labels;
1:18 - 1:23

that's what makes this unsupervised or
1:23 - 1:25

anything. So
1:25 - 1:31

one problem that I talked about last time was what if
1:31 - 1:34

you're given a data set that looks like this
1:34 - 1:36

and you want to model
1:36 - 1:40

the density PFX from which you think the data
1:40 - 1:42

had been drawn,
1:42 - 1:46

and so with a data set like this, maybe you think was a mixture of two Gaussians and start
1:46 - 1:50

to talk about an algorithm
1:50 - 1:54

for fitting a mixture of Gaussians model, all right? And so
1:55 - 1:58

we said that we would model the
1:58 - 1:59

density of XP of X
1:59 - 2:02

as sum over Z
2:02 - 2:03

PFX
2:03 - 2:05

given Z
2:05 - 2:07

times P of Z
2:07 - 2:10

where this later random variable meaning this hidden random
2:10 - 2:12

variable Z
2:12 - 2:13

indicates
2:13 - 2:16

which of the two Gaussian distributions each of your data points came from
2:16 - 2:19

and so we have,
2:19 - 2:24

you know, Z was not a
2:24 - 2:27

nomial with parameter phi
2:27 - 2:29

and X conditions on
2:29 - 2:33

a coming from the JAFE
2:33 - 2:36

Gaussian
2:36 - 2:39

was given by Gaussian
2:39 - 2:44

of mean mu J and covariant sigma J, all right? So, like I said
2:44 - 2:48

at the beginning of the previous lecture, I just talked about a very specific algorithm that
2:48 - 2:51

I sort of pulled out of the air
2:51 - 2:55

for fitting the parameters of this model for finian, Francis, phi, mu and
2:55 - 2:56
2:56 - 3:00

sigma, but then in the second half of the previous lecture I talked about what's called the
3:00 - 3:01

EM Algorithm in which
3:01 - 3:04

our
3:04 - 3:09

goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of
3:09 - 3:10

theta,
3:10 - 3:13

you know, the, sort of,
3:13 - 3:18

usual right matter of log likelihood "
3:18 - 3:20

well, parameterized by theta.
3:20 - 3:22

And
3:22 - 3:25

because we have
3:25 - 3:28

a later random variable Z this is really
3:28 - 3:31

maximizing in terms of theta,
3:31 - 3:38

sum over I, sum over Z, P of XI,
3:40 - 3:45

ZI parameterized by theta. Okay?
3:45 - 3:47

So using
3:47 - 3:50

Jensen's inequality last time
3:50 - 3:54

we worked out the EM Algorithm in which in the E step
3:54 - 3:56

we would chose these
3:56 - 3:59

probability distributions QI to the
3:59 - 4:03

l posterior
4:03 - 4:09

on Z given X and parameterized by theta
4:09 - 4:14

and in the M step we would set
4:14 - 4:16

theta
4:16 - 4:22

to be the value that
4:22 - 4:28

maximizes
4:28 - 4:35

this.
4:36 - 4:39

Okay? So these are the ones we worked out last time
4:42 - 4:47

and the cartoon that I drew was that you have this long likelihood function L of
4:47 - 4:49

theta that's often hard to maximize
4:49 - 4:51

and what the E step does
4:51 - 4:56

is choose these probability distribution production QI's. And in the cartoon, I drew
4:56 - 4:58

what that corresponded to
4:58 - 5:01

was finding a lower bounds
5:01 - 5:03

for the log likelihood.
5:03 - 5:05

And then
5:05 - 5:06

horizontal access data
5:06 - 5:11

and then the M step you maximize the lower boundary, right? So maybe you were here previously
5:11 - 5:12

and so you
5:12 - 5:15

jumped to the new point, the new
5:15 - 5:18

maximum of this lower bound. Okay?
5:18 - 5:22

And so this little curve here, right? This lower bound function here
5:22 - 5:26

that's really
5:26 - 5:29

the right-hand side of that augments.
5:29 - 5:30

Okay? So
5:30 - 5:34

this whole thing in the augments. If you view this thing
5:34 - 5:37

as a function of theta,
5:37 - 5:38

this function of theta
5:38 - 5:39

is a lower bounds
5:39 - 5:42

for the log likelihood of theta
5:42 - 5:46

and so the M step we maximize this lower bound and that corresponds to
5:46 - 5:47

jumping
5:47 - 5:52

to this new maximum to lower bound. So it
5:52 - 5:53

turns out
5:53 - 5:56

that
5:56 - 5:57

in the EM Algorithm "
5:57 - 6:02

so why do you evolve with the EM algorithm? It turns out that very often, and this will be
6:02 - 6:05

true for all the examples we see
6:05 - 6:07

today, it turns out
6:07 - 6:11

that very often in the
6:11 - 6:13

EM Algorithm
6:13 - 6:16

maximizing the M Step, so performing the maximization the M Step, will be
6:16 - 6:20

tractable and can often be done analytically in the closed form.
6:20 - 6:21

Whereas
6:21 - 6:26

if you were trying to maximize this objective we try to take this
6:26 - 6:27

formula on the right and
6:27 - 6:30

this maximum likely object, everyone, is to take this all on the right
6:30 - 6:33

and set its derivatives to zero and try to solve and
6:33 - 6:35

you'll find that you're unable
6:35 - 6:37

to obtain a solution to this in closed form
6:37 - 6:39

this maximization. Okay?
6:39 - 6:41

And so to give you an example of that is that
6:41 - 6:45

you remember our discussion on exponential family marbles, right?
6:45 - 6:47

It turns out that
6:47 - 6:49

if X and Z
6:49 - 6:50

is
6:50 - 6:54

jointly, I guess, a line in exponential families. So if P of X,
6:54 - 6:55

Z
6:55 - 6:58

prioritized by theta there's an explanation family distribution,
6:58 - 7:02

which it turns out to be true for the mixture of Gaussians distribution.
7:02 - 7:05

Then turns out that the M step here will be tractable
7:05 - 7:08

and the E step will also be tractable and so you can do each of these steps very
7:08 - 7:09

easily.
7:09 - 7:11

Whereas
7:11 - 7:13

performing " trying to
7:13 - 7:16

perform this original maximum likelihood estimation problem
7:16 - 7:18

on this one, right?
7:18 - 7:21

Will be computationally very difficult. You're going
7:21 - 7:25

to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to
7:25 - 7:28

this. Okay?
7:28 - 7:33

So
7:33 - 7:36

what I want to do in a second is actually take this view of the EM
7:36 - 7:37

Algorithm
7:37 - 7:41

and apply it to the mixture of Gaussians models. I want to take these E steps
7:41 - 7:43

and M Steps and work
7:43 - 7:44

them out for
7:44 - 7:48

the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the
7:50 - 7:55

EM Algorithm. It turns out there's one other way of thinking about the EM
7:55 - 7:59

Algorithm, which is the following: I can define
7:59 - 8:02

an optimization objective
8:02 - 8:06

J of theta, Q are defined
8:06 - 8:11

it to be
8:11 - 8:15

this. This is just a thing in the augments
8:15 - 8:22

in the M step. Okay?
8:26 - 8:28

And so what we proved
8:28 - 8:31

using Jensen's inequality
8:31 - 8:34

is that
8:34 - 8:38

the log likelihood of theta is greater and equal to J
8:38 - 8:42

of theta Q. So in other words, we proved last time that
8:42 - 8:44

for any value of theta and Q
8:44 - 8:47

the log likelihood upper bounds J of theta and Q.
8:47 - 8:51

And so just to relate this back to, sort of, yet more things that you all ready
8:51 - 8:52

know,
8:52 - 8:54

you can also think of
8:54 - 8:59

covariant cause in a sense, right?
8:59 - 9:02

However, our discussion awhile back on the coordinate ascent optimization
9:02 - 9:03

algorithm.
9:03 - 9:07

So we can show, and I won't actually show this view so just take our word for it
9:07 - 9:10

and look for that at home if you want,
9:10 - 9:12
9:12 - 9:13

that EM is
9:13 - 9:17

just coordinate in a set on the
9:17 - 9:18

function J.
9:18 - 9:21

So in the E step you maximize
9:21 - 9:24

with respect to Q
9:24 - 9:27

and then the M step
9:27 - 9:32

you maximize with
9:32 - 9:34

respect to theta. Okay?
9:34 - 9:36

So this is
9:36 - 9:40

another view of the EM Algorithm
9:41 - 9:45

that shows why it has to converge, for example. If you can - I've used in a sense of
9:45 - 9:47

J of theta, Q
9:47 - 9:53

having to monotonically increase on every iteration. Okay?
9:53 - 9:59

So what I want
9:59 - 10:02

to do next is actually take this general
10:02 - 10:06

EM machinery that we worked up and apply it to a mixture Gaussians model.
10:06 - 10:10

Before I do that, let me just check if there are questions about
10:10 - 10:17

the EM Algorithm as a whole? Okay, cool. So
10:24 - 10:26

let's go ahead and
10:26 - 10:33

work on the mixture of Gaussian's
10:36 - 10:37

EM, all right? MOG, and that's my
10:37 - 10:39

abbreviation for Mixture of
10:39 - 10:44

Gaussian's. So the E step were called those Q distributions, right?
10:44 - 10:51

In particular, I want to work out - so Q is the probability distribution
10:51 - 10:53

over the late and random variable Z
10:53 - 10:58

and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And
10:58 - 11:01

you can think of this as my writing
11:01 - 11:04

P of ZI equals J, right? Under the Q
11:04 - 11:08

distribution. That's what this notation means.
11:08 - 11:13

And so the EM Algorithm tells us
11:13 - 11:14
11:14 - 11:16

that, let's see, Q of J
11:16 - 11:23

is the likelihood probability of Z being the value
11:24 - 11:25

J and
11:25 - 11:29

given XI and all your parameters.
11:29 - 11:30

And so,
11:30 - 11:31
11:31 - 11:36

well, the way you compute this is by Bayes rule, right? So that is going to
11:36 - 11:39

be equal to P of XI given ZI
11:39 - 11:41

equals J
11:41 - 11:45

times P of ZIJ divided
11:45 - 11:52

by -
11:57 - 12:01

right? That's all the - by Bayes rule.
12:01 - 12:03
12:03 - 12:06

And so this
12:06 - 12:09

you know
12:09 - 12:10
12:10 - 12:12
12:12 - 12:14

because XI given ZI equals J. This was a Gaussian
12:14 - 12:18

with mean mu J and covariant sigma J. And so to compute this first
12:18 - 12:21

term you plug in the formula for the Gaussian density there
12:22 - 12:25

with parameters mu J and sigma J
12:25 - 12:27

and this you'd know
12:27 - 12:28

because
12:28 - 12:36

Z was not a nomial, right?
12:36 - 12:41

Where parameters given by phi and so the problem of ZI being with J is just phi
12:41 - 12:44

J and so you can substitute these terms in.
12:44 - 12:47

Similarly do the same thing for the denominator
12:47 - 12:50

and that's how you work out what Q is. Okay?
12:50 - 12:56

And so in the previous lecture this value the probability that ZI
12:56 - 12:59

equals J under the Q
12:59 - 13:01

distribution that was why I denoted that as WIJ.
13:01 - 13:04

So that would be the
13:04 - 13:05
13:05 - 13:09

E
13:09 - 13:10

step
13:10 - 13:14

and then in the M step
13:14 - 13:18

we maximize with respect to all of our parameters.
13:18 - 13:19
13:19 - 13:21

This, well
13:21 - 13:28

I seem to be writing the same formula down a lot today. All
13:28 - 13:35

right.
13:43 - 13:48

And just so
13:48 - 13:52

we're completely concrete about how you do that, right? So if you do
13:52 - 13:55

that you end up with -
13:55 - 13:57

so plugging in the
13:57 - 14:00

quantities that you know
14:00 - 14:01
14:01 - 14:04

that becomes
14:04 - 14:11

this, let's see. Right.
14:35 - 14:39

And so that we're completely concrete about what the M step is doing.
14:39 - 14:42

So in the
14:42 - 14:44

M step that was,
14:44 - 14:46

I guess, QI
14:46 - 14:47

over Z, I being over
14:47 - 14:51

J. Just in the summation, sum over J is the sum over all the possible values
14:51 - 14:52

of ZI
14:52 - 14:53

and then
14:53 - 14:56

this thing here is my Gaussian
14:56 - 14:59

density. Sorry, guys, this thing - well,
14:59 - 15:02

this first term here,
15:02 - 15:06

right? Is my P of
15:06 - 15:10

XI given ZI
15:10 - 15:14

and that's P of ZI. Okay?
15:14 - 15:17

And so
15:17 - 15:21

to maximize this with respect to - say you want to maximize this with respect to all
15:21 - 15:24

of your parameters phi, mu and sigma.
15:24 - 15:28

So to maximize with respect to the parameter mu, say,
15:28 - 15:32

you would take the derivative for respect to mu
15:32 - 15:34

and set that to zero
15:34 - 15:38

and you would - and if you actually do that computation
15:38 - 15:44

you would get, for instance, that
15:44 - 15:48
15:48 - 15:50

that becomes your update
15:50 - 15:51

to mu J.
15:51 - 15:52

Okay? Just
15:52 - 15:53

so I want to -
15:53 - 15:57

the equation is unimportant. All of these equations are written down in
15:57 - 16:00

the lecture notes. I'm writing these down just to be
16:00 - 16:03

completely concrete about what the M step means. And so write down that formula,
16:03 - 16:05

plug in the densities you know, take
16:05 - 16:08

the derivative set to zero, solve for mu J
16:08 - 16:11

and in the same way you set the derivatives equal to zero
16:11 - 16:13

and solve for your updates
16:13 - 16:15

for your other parameters phi and
16:15 - 16:22

sigma as well. Okay? Well,
16:23 - 16:26

just point out just one little tricky bit for this that
16:26 - 16:31

you haven't seen before that most of you probably all ready now, but I'll just
16:31 - 16:31

mention
16:31 - 16:33

is that
16:33 - 16:37

since phi here is a multinomial distribution
16:37 - 16:40

when you take this formula
16:40 - 16:42

and you maximize it with respect to phi
16:42 - 16:45

you actually have an additional constraint, right? That the sum of I - let's see, sum
16:45 - 16:46
16:46 - 16:48

over
16:48 - 16:50

J,
16:50 - 16:54

phi J must be equal to one. All right? So, again,
16:54 - 16:58

in the M step I want to take this thing and maximize it with respect to all the parameters
16:58 - 17:01

and when you maximize this respect to the parameters phi J
17:01 - 17:03

you need to respect the constraint that
17:03 - 17:06

sum of J phi J must be equal to one.
17:06 - 17:11

And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of
17:11 - 17:14

the granjay multipliers and generalize the granjay when you talk about the support of X
17:14 - 17:15

machines.
17:15 - 17:19

And so to actually perform the maximization in terms of phi J you
17:19 - 17:21

construct to the granjay,
17:21 - 17:23

which is - all right?
17:23 - 17:25

So that's the
17:25 - 17:28

equation from above and we'll denote in the dot dot dot plus
17:28 - 17:34

theta times
17:34 - 17:41

that, where this is sort of the granjay multiplier
17:42 - 17:43

and this
17:43 - 17:46

is your optimization objective.
17:46 - 17:48
17:48 - 17:51

And so to actually solve the parameters phi J you set
17:51 - 17:57

the parameters of this
17:57 - 17:59

so that the granjay is zero and solve. Okay?
17:59 - 18:03

And if you then work through the math
18:03 - 18:07

you get the appropriate value to update the phi J's too,
18:07 - 18:11

which I won't do, but I'll be - all the full directions are in the lecture
18:11 - 18:16

notes. I won't do that here.
18:16 - 18:18
18:18 - 18:22

Okay. And so if you actually perform all these computations you can also verify
18:22 - 18:23

that.
18:23 - 18:27

So I just wrote down a bunch of formulas for the EM
18:27 - 18:30

Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I
18:30 - 18:34

said for the EM here's the formula for computing the WIJ's and here's a
18:34 - 18:36

formula for computing the mud's and so on,
18:36 - 18:42

and this variation is where all of those formulas actually come from.
18:42 - 18:43

Okay?
18:43 - 18:50

Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh,
18:52 - 18:54

I see. So
18:54 - 18:58

it turns out that, yes, there's also constrained to the phi J this must be greater than
18:58 - 19:00

zero.
19:00 - 19:03

It turns out that
19:03 - 19:07

if you want you could actually write down then generalize the granjayn
19:07 - 19:11

incorporating all of these constraints as well and you can solve to [inaudible]
19:11 - 19:12

these constraints.
19:12 - 19:16

It turns out that in this particular derivation - actually it turns out that very often
19:16 - 19:17

we
19:17 - 19:20

find maximum likely estimate for multinomial distributions probabilities.
19:20 - 19:23

It turns out that if you ignore these constraints and you just maximize the
19:23 - 19:24

formula
19:24 - 19:25
19:25 - 19:28

luckily you end up with
19:28 - 19:31

values that actually are greater than or equal to zero
19:31 - 19:34

and so if even ignoring those constraint you end up with parameters that are greater than or equal to
19:34 - 19:38

zero that shows that that must be the correct solution
19:38 - 19:41

because adding that constraint won't change anything.
19:41 - 19:45

So this constraint it is then caused - it turns out that if you
19:45 - 19:46

ignore this and just do
19:46 - 19:53

what I've wrote down you actually get the right answer.
19:56 - 19:57

Okay?
19:57 - 19:58

Great.
19:58 - 19:59

So let me
19:59 - 20:01

just
20:01 - 20:05

very quickly talk about one more example of a mixture model. And
20:05 - 20:09

the perfect example for this is imagine you want to do text clustering, right?
20:09 - 20:12

So someone gives you a large set of documents
20:12 - 20:15

and you want to cluster them together into cohesive
20:15 - 20:16

topics. I
20:16 - 20:19

think I mentioned the news website news.google.com.
20:19 - 20:22

That's one application of text clustering
20:22 - 20:26

where you might want to look at all of the news stories about
20:26 - 20:28

today, all the news stories
20:28 - 20:32

written by everyone, written by all the online news websites about whatever happened
20:32 - 20:33

yesterday
20:33 - 20:37

and there will be many, many different stories on the same thing, right?
20:37 - 20:41

And by running a text-clustering algorithm you can group related
20:41 - 20:43

documents together. Okay?
20:43 - 20:50

So how do you apply the EM Algorithm to text clustering?
20:53 - 20:56

I want to do this to illustrate an example
20:56 - 20:59

in which you run the EM Algorithm
20:59 - 21:03

on discreet valued inputs where the input - where the training examples
21:03 - 21:04

XI
21:04 - 21:06

are discreet
21:06 - 21:09

values. So what I want to talk about specifically
21:09 - 21:16

is the mixture of naive Bayes model
21:17 - 21:20

and depending on how much you remember about naive Bayes
21:20 - 21:24

I talked about two event models. One was the multivariant vanuvy
21:24 - 21:25

event model. One
21:25 - 21:28

was the multinomial event model.
21:28 - 21:31

Today I'm gonna use the multivariant vanuvy event model. If
21:31 - 21:35

you don't remember what those terms mean anymore don't worry about it. I
21:35 - 21:36

think the equation will still make sense. But in
21:36 - 21:43

this setting we're given a training set X1
21:45 - 21:46

for XM.
21:46 - 21:49

So we're given M text documents
21:49 - 21:52

where each
21:52 - 21:56
21:56 - 22:00

XI is zero one to the end. So each of our training examples
22:00 - 22:04

is an indimensional bit of vector,
22:04 - 22:07

right? S this was the representation
22:07 - 22:11

where XIJ was - it indicates whether
22:11 - 22:13

word J
22:13 - 22:17

appears in
22:17 - 22:23

document
22:23 - 22:28

I, right? And let's say that
22:28 - 22:32

we're going to model ZI the - our latent random variable meaning
22:32 - 22:35

our hidden random variable ZI will take on two values zero one
22:35 - 22:38

so this means I'm just gonna find two clusters and you can generalize the
22:38 - 22:45

clusters that you want.
22:45 - 22:47

So
22:47 - 22:52

in the mixture of naive Bayes model we assume that
22:52 - 22:54

ZI is distributed and mu
22:54 - 22:57

E
22:57 - 22:58

with some
22:58 - 23:03

value of phi so there's some probability of each document coming from cluster one
23:03 - 23:05

or from cluster
23:05 - 23:10

two. We assume that
23:10 - 23:14

the probability
23:14 - 23:16
23:16 - 23:20

of XI given ZI,
23:20 - 23:22

right?
23:22 - 23:29

Will make the same naive Bayes assumption as we did before.
23:36 - 23:37

Okay?
23:37 - 23:44

And more specifically - well, excuse
23:53 - 24:00

me,
24:01 - 24:02

right. Okay.
24:02 - 24:07

And so most of us [inaudible] cycles one given ZI equals say zero
24:07 - 24:10

will be given by a parameter phi
24:10 - 24:12

substitute J
24:12 - 24:18

given Z equals
24:18 - 24:21

zero. So if you take this chalkboard and if you
24:21 - 24:23

take all instances of
24:23 - 24:25

the alphabet Z
24:25 - 24:27

and replace it with Y
24:27 - 24:32

then you end up with exactly the same equation as I've written down for naive
24:32 - 24:39

Bayes like a long time ago. Okay?
24:40 - 24:41

And I'm not
24:41 - 24:45

actually going to work out the mechanics deriving
24:45 - 24:46

the EM Algorithm, but it turns out that
24:46 - 24:50

if you take this joint distribution of X and Z
24:50 - 24:54

and if you work out what the equations are for the EM algorithm for
24:54 - 24:56

maximum likelihood estimation of parameters
24:56 - 24:59

you find that
24:59 - 25:01

in the E
25:01 - 25:03

step you compute,
25:03 - 25:06

you know, let's say these parameters - these weights
25:06 - 25:07

WI
25:07 - 25:09

which are going to be equal to
25:09 - 25:16

your perceived distribution of Z being equal one conditioned on XI parameterized
25:16 - 25:19

by your
25:19 - 25:24

phi's and given your parameters
25:24 - 25:26
25:26 - 25:31

and in the M step. Okay?
25:31 - 25:38

And
26:17 - 26:21

that's the equation you get in the M step.
26:21 - 26:22

I
26:22 - 26:25

mean, again, the equations themselves aren't too important. Just
26:25 - 26:27

sort of convey - I'll
26:27 - 26:30

give you a second to finish writing, I guess.
26:30 - 26:32

And when you're done or finished writing
26:32 - 26:37

take a look at these equations and see if they make intuitive sense to you
26:37 - 26:44

why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again.
26:47 - 26:50

Y -
26:50 - 26:54

Oh, yes, thank you. Right.
26:54 - 26:58

Sorry,
26:58 - 27:05

just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the
27:15 - 27:21

first place?
27:21 - 27:22

No. So
27:25 - 27:26

what is it?
27:26 - 27:33

Normally you initialize phi's to be something else, say randomly.
27:35 - 27:38

So just like in naive Bayes we saw
27:38 - 27:41

zero probabilities as a bad thing so the same reason you
27:41 - 27:48

try to avoid zero probabilities, yeah. Okay?
27:48 - 27:52

And so just the intuition behind these
27:52 - 27:54

equations is
27:54 - 27:56

in the E step
27:56 - 28:00

WI's is you're gonna take your best guess for whether the document came from
28:00 - 28:03

cluster one or cluster
28:03 - 28:06

zero, all right? This is very similar to the
28:06 - 28:09

intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the
28:09 - 28:10

E step
28:10 - 28:15

we're going to compute these weights that tell us do I think this document came from
28:15 - 28:19

cluster one or cluster zero.
28:19 - 28:23

And then in the M step I'm gonna say
28:23 - 28:27

does this numerator is the sum over all the elements of my training set
28:27 - 28:29

of - so then
28:29 - 28:33

informally, right? WI is one there, but I think the document came from cluster
28:33 - 28:34

one
28:34 - 28:36

and so this will
28:36 - 28:41

essentially sum up all the times I saw words J
28:41 - 28:45

in documents that I think are in cluster one.
28:45 - 28:49

And these are sort of weighted by the actual probability. I think it came from cluster
28:49 - 28:50

one
28:50 - 28:51

and then I'll divide by
28:51 - 28:55

- again, if all of these were ones and zeros then I'd be dividing by
28:55 - 28:58

the actual number of documents I had in cluster one. So if all the WI's were
28:58 - 29:00
29:00 - 29:03

either ones or zeroes then this would be exactly
29:03 - 29:07

the fraction of documents that I saw in cluster one
29:07 - 29:10

in which I also saw were at J.
29:10 - 29:11

Okay? But in the EM Algorithm
29:11 - 29:15

you don't make a hard assignment decision about is this in cluster one or is this in
29:15 - 29:16

cluster zero. You
29:16 - 29:17

instead
29:17 - 29:24

represent your uncertainty about cluster membership with the parameters WI. Okay? It
29:24 - 29:27

actually turns out that when we actually implement this particular model it
29:27 - 29:29

actually turns out that
29:29 - 29:32

by the nature of this computation all the values of WI's
29:32 - 29:35

will be very close to either one or zero so they'll
29:35 - 29:39

be numerically almost indistinguishable from one's and zeroes. This is a property of
29:39 - 29:42

naive Bayes. If you actually compute this probability
29:42 - 29:46

from all those documents you find that WI is either
29:46 - 29:46
29:46 - 29:50

0.0001 or 0.999. It'll be amazingly close to either zero or one
29:50 - 29:53

and so the M step - and so this is pretty much guessing whether each document is
29:53 - 29:55

in cluster one or cluster
29:55 - 29:56

zero and then
29:56 - 30:00

using formulas they're very similar to maximum likely estimation
30:00 - 30:05

for naive Bayes.
30:05 - 30:06

Okay?
30:06 - 30:08

Cool. Are there - and
30:08 - 30:10

if
30:10 - 30:12

some of these equations don't look that familiar to you anymore,
30:12 - 30:15

sort of, go back and take another look at what you saw in naive
30:15 - 30:17

Bayes
30:17 - 30:19

and
30:19 - 30:22

hopefully you can see the links there as well.
30:22 - 30:29

Questions about this before I move on? Right, okay.
30:33 - 30:37

Of course the way I got these equations was by turning through the machinery of
30:37 - 30:41

the EM Algorithm, right? I didn't just write these out of thin air. The way you do this
30:41 - 30:42

is by
30:42 - 30:45

writing down the E step and the M step for this model and then the M step
30:45 - 30:46

same derivatives
30:46 - 30:48

equal to zero and solving from that so
30:48 - 30:50

that's how you get the M step and the E
30:50 - 30:57

step.
31:22 - 31:24

So the last thing I want to do today
31:24 - 31:31

is talk about the factor analysis model
31:32 - 31:33
31:33 - 31:35

and
31:35 - 31:38

the reason I want to do this is sort of two reasons because one is
31:38 - 31:43

factor analysis is kind of a useful model. It's
31:43 - 31:44

not
31:44 - 31:48

as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe,
31:48 - 31:51

but it's sort of useful.
31:51 - 31:55

But the other reason I want to derive this model is that there are a
31:55 - 31:58

few steps in the math that are more generally useful.
31:58 - 32:03

In particular, where this is for factor analysis this would be an example
32:03 - 32:05

in which we'll do EM
32:05 - 32:08

where the late and random variable - where the hidden random variable Z
32:08 - 32:11

is going to be continued as valued.
32:11 - 32:15

And so some of the math we'll see in deriving factor analysis will be a little bit
32:15 - 32:17

different than what you saw before and they're just a -
32:17 - 32:22

it turns out the full derivation for EM for factor analysis is sort of
32:22 - 32:23

extremely long and complicated
32:23 - 32:25

and so I won't
32:25 - 32:27

inflect that on you in lecture today,
32:27 - 32:31

but I will still be writing more equations than is - than you'll see me do
32:31 - 32:33

in other lectures because there are, sort of,
32:33 - 32:37

just a few steps in the factor analysis derivation so I'll physically
32:37 - 32:39
32:39 - 32:40

illustrate
32:40 - 32:41

it.
32:41 - 32:43

So it's actually [inaudible] the model
32:43 - 32:48

and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is
32:48 - 32:51

our first model
32:51 - 32:53

we had,
32:53 - 32:58

that - well I actually motivated it by drawing the data set like this, right? That one of
32:58 - 33:04

you has a data set that looks like this,
33:04 - 33:06

right? So this was a problem where
33:06 - 33:09

n is twodimensional
33:09 - 33:14

and you have, I don't know, maybe 50 or 100 training examples, whatever, right?
33:14 - 33:15

And I said
33:15 - 33:18

maybe you want to give a label
33:18 - 33:21

training set like this. Maybe you want to model this
33:21 - 33:24

as a mixture of two Gaussians. Okay?
33:24 - 33:25

And so a
33:25 - 33:27

mixture of Gaussian models tend to be
33:27 - 33:29

applicable
33:29 - 33:31

where m
33:31 - 33:33

is larger,
33:33 - 33:36

and often much larger, than n where the number of training examples you
33:36 - 33:37

have
33:37 - 33:41

is at least as large as, and is usually much larger than, the
33:41 - 33:46

dimension of the data. What I
33:46 - 33:49

want to do is talk about a different problem where I
33:49 - 33:52

want you to imagine what happens if
33:52 - 33:56

either the dimension of your data is roughly equal to the number of
33:56 - 33:58

examples you have
33:58 - 34:00

or maybe
34:00 - 34:01

the
34:01 - 34:04

dimension of your data is maybe even much larger than
34:04 - 34:08

the number of training examples you have. Okay? So how do
34:08 - 34:10

you model
34:10 - 34:14

such a very high dimensional data? Watch and
34:14 - 34:16

you will see sometimes, right? If
34:16 - 34:19

you run a plant or something, you run a factory, maybe you have
34:19 - 34:23

a thousand measurements all through your plants, but you only have five - you only have
34:23 - 34:26

20 days of data. So
34:26 - 34:27

you can have 1,000 dimensional data,
34:27 - 34:31

but 20 examples of it all ready. So
34:31 - 34:35

given data that has this property in the beginning that we've given
34:35 - 34:40

a training set of m examples. Well,
34:40 - 34:41

what can you do to try to model
34:41 - 34:44

the density of X?
34:44 - 34:48

So one thing you can do is try to model it just as a single Gaussian, right? So in my
34:48 - 34:52

mixtures of Gaussian this is how you try model as a single Gaussian and
34:52 - 34:55

say X is intuitive with mean mu
34:55 - 34:58

and parameter
34:58 - 34:59

sigma
34:59 - 35:00

where
35:00 - 35:02

sigma is going to be
35:02 - 35:05

done n by n matrix
35:05 - 35:10

and so if you work out the maximum likelihood estimate of the parameters
35:10 - 35:13

you find that the maximum likelihood estimate for the mean
35:13 - 35:17

is just the empirical mean of your training set,
35:17 - 35:21

right. So that makes sense.
35:21 - 35:26

And the maximum likelihood of the covariance matrix sigma
35:26 - 35:28

will be
35:28 - 35:30
35:30 - 35:37
35:40 - 35:41

this, all right? But it turns out that
35:41 - 35:45

in this regime where the data is much higher dimensional - excuse me,
35:45 - 35:48

where the data's dimension is much larger than the training examples
35:48 - 35:50

you
35:50 - 35:51

have if you
35:51 - 35:53

compute the maximum likely estimate
35:53 - 35:55

of the covariance matrix sigma
35:55 - 35:58

you find that this matrix is singular.
35:58 - 35:59
35:59 - 36:03

Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen
36:03 - 36:05

value so it doesn't have - I hope
36:05 - 36:07

one of those terms makes sense.
36:07 - 36:13

And
36:13 - 36:20

there's another saying that the matrix sigma will be non-invertible.
36:22 - 36:25

And just in pictures,
36:25 - 36:29

one complete example is if D is -
36:29 - 36:33

if N equals M equals two if you have two-dimensional data and
36:33 - 36:36

you have two examples. So I'd have
36:36 - 36:40

two training examples in two-dimen.. - this is
36:40 - 36:42

X1 and
36:42 - 36:44

X2. This is my unlabeled data.
36:44 - 36:48

If you fit a Gaussian to this data set you find that
36:48 - 36:49

- well
36:49 - 36:53

you remember I used to draw constables of Gaussians as ellipses, right? So
36:53 - 36:56

these are examples of different constables of Gaussians.
36:56 - 36:59

You find that the maximum likely estimate
36:59 - 37:00

Gaussian for this
37:00 - 37:04

responds to Gaussian where the contours are sort of infinitely
37:04 - 37:07

thin and infinitely long in that direction.
37:07 - 37:11

Okay? So in terms - so the contours will sort of be
37:11 - 37:13

infinitely thin,
37:13 - 37:18

right? And stretch infinitely long in that direction.
37:18 - 37:21

And another way of saying it is that if
37:21 - 37:26

you actually plug in the formula for the density
37:26 - 37:31

of the
37:31 - 37:36

Gaussian, which is
37:36 - 37:39

this, you won't actually get a nice answer because
37:39 - 37:42

the matrix sigma is non-invertible so sigma inverse is not
37:42 - 37:43

defined
37:43 - 37:45

and this is zero.
37:45 - 37:49

So you also have one over zero times E to the sum
37:49 - 37:56

inversive and non-inversive matrix so not a good model.
37:58 - 38:02

So let's do even better, right? So given this sort of data
38:02 - 38:09

how do you model P of X?
38:28 - 38:35

Well, one thing you could do is constrain sigma to be diagonal. So you have a
38:42 - 38:46

covariance matrix X is - okay?
38:46 - 38:49

So in other words you get a constraint sigma
38:49 - 38:53

to be this matrix, all
38:53 - 38:56

right?
38:56 - 39:00

With zeroes on the off diagonals. I hope
39:00 - 39:03

this makes sense. These zeroes I've written down here denote that
39:03 - 39:06

everything after diagonal of this matrix is a
39:06 - 39:07

zero.
39:07 - 39:09

So
39:09 - 39:13

the massive likely estimate of the parameters will be pretty much what you'll
39:13 - 39:16

expect,
39:16 - 39:20

right?
39:20 - 39:21

And
39:21 - 39:22

in pictures
39:22 - 39:24

what this means is that
39:24 - 39:26

the [inaudible] the distribution
39:26 - 39:29

with Gaussians
39:29 - 39:32

whose controls are axis aligned. So
39:32 - 39:37

that's one example of a Gaussian where the covariance is diagonal.
39:37 - 39:38
39:38 - 39:39

And
39:39 - 39:43

here's another example and
39:43 - 39:45

so here's a third example. But
39:45 - 39:49

often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I
39:49 - 39:51

don't
39:51 - 39:54

know,
39:54 - 39:57

you could do this in model P of X, but this isn't very nice because you've now
39:57 - 40:00

thrown away all the correlations
40:00 - 40:05

between the different variables so
40:05 - 40:08

the axis are X1 and X2, right? So you've thrown away - you're failing to capture
40:08 - 40:12

any of the correlations or the relationships between
40:12 - 40:18

any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again.
40:18 - 40:21

The covariance matrix the diagonal,
40:21 - 40:22

what does
40:22 - 40:23

that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay.
40:23 - 40:27

So these are the contours of the Gaussian density that I'm drawing,
40:27 - 40:28

right? So let's see -
40:28 - 40:32

so post covariance issues with diagonal
40:32 - 40:35

then you can ask what is P of X
40:35 - 40:37

parameterized by
40:37 - 40:39

mu and sigma,
40:39 - 40:41

right? If sigma is diagonal
40:41 - 40:44

and so this will be some Gaussian dump,
40:44 - 40:46

right? So not in - oh, boy. My drawing's really
40:46 - 40:49

bad, but in two-D
40:49 - 40:53

the density for Gaussian is like this bump shaped thing, right?
40:53 - 40:57

So this is the density of the Gaussian
40:57 - 41:00

- wow, and this is a really bad drawing. With those,
41:00 - 41:04

your axis X1 and X2 and the height of this is P of X
41:04 - 41:08

and so those figures over there are the contours of
41:08 - 41:09

the density of the Gaussian.
41:09 - 41:16

So those are the contours of this shape. Student:No, I don't
41:16 - 41:17

mean
41:17 - 41:23

the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so
41:23 - 41:24

the main - these,
41:24 - 41:25

let's see.
41:25 - 41:28

So I'm not drawing a contour like this,
41:28 - 41:30

right? Because the main axes of these
41:30 - 41:32

are not aligned with the
41:32 - 41:34

X1 and X-axis so
41:34 - 41:36

this occurs found to
41:36 - 41:43

Gaussian where the off-diagonals are non-zero, right? Cool. Okay.
41:43 - 41:47
41:47 - 41:50

You could do this, this is sort of work. It turns out that what our best view is two
41:50 - 41:52

training examples
41:52 - 41:55

you can learn in non-singular covariance matrix, but you've thrown away all
41:55 - 41:56
41:56 - 42:03

of the correlation in the data so this is not a great model.
42:06 - 42:08

It turns out you can do something - well,
42:08 - 42:10

actually, we'll come back and use this property later.
42:10 - 42:17

But it turns out you can do something even more restrictive, which
42:18 - 42:19

is
42:19 - 42:24

you can constrain sigma to equal to sigma squared times the identity
42:24 - 42:28

matrix. So in other words, you can constrain it to be diagonal
42:28 - 42:29

matrix
42:29 - 42:32

and moreover all the
42:32 - 42:35

diagonal entries must be the same
42:35 - 42:38

and so the cartoon for that is that you're
42:38 - 42:39

constraining
42:39 - 42:43

the contours of your Gaussian density to be
42:43 - 42:47

circular. Okay? This is a sort of even harsher constraint to place in your model.
42:47 - 42:51
42:51 - 42:52
42:52 - 42:56

So either of these versions, diagonal sigma or sigma being the, sort of, constant
42:56 - 42:57

value diagonal
42:57 - 43:02

are the all ready strong assumptions, all right? So if you have enough data
43:02 - 43:08

maybe write a model just a little bit of a correlation between your different variables.
43:08 - 43:11

So the factor analysis model
43:11 - 43:14

is one way to attempt to do that. So here's
43:14 - 43:21

the idea.
43:35 - 43:36

So this is
43:36 - 43:39

how the factor analysis model
43:39 - 43:41

models your data.
43:41 - 43:42

We're going to
43:42 - 43:45

assume that there is a latent random variable, okay?
43:45 - 43:47

Which just
43:47 - 43:51

means random variable Z. So Z is distributed
43:51 - 43:53

Gaussian with mean zero and
43:53 - 43:56

covariance identity
43:56 - 43:58

where Z
43:58 - 44:00

will be a Ddimensional vector now
44:00 - 44:01

and
44:01 - 44:04

D
44:04 - 44:11

will be chosen so that it is lower than the dimension of your X's. Okay?
44:12 - 44:13

And now
44:13 - 44:15

I'm going to assume that
44:15 - 44:19

X is given by -
44:19 - 44:21

well let me write this. Each XI
44:21 - 44:23

is distributed -
44:25 - 44:29

actually, sorry, I'm just. We
44:29 - 44:34

have
44:34 - 44:39

to assume that conditions on the value of Z,
44:39 - 44:42

X is given by another Gaussian
44:42 - 44:47

with mean given by mu plus
44:47 - 44:50

lambda Z
44:50 - 44:54

and covariance given by matrix si.
44:54 - 44:59

So just to say the second line in
44:59 - 45:01

an equivalent form, equivalently
45:01 - 45:06

I'm going to model X as
45:06 - 45:09

mu plus lambda Z
45:09 - 45:13

plus a noise term epsilon where epsilon is
45:13 - 45:18

Gaussian with mean zero
45:18 - 45:22

and covariant si.
45:22 - 45:23
45:23 - 45:25

And so
45:25 - 45:28

the parameters of this
45:28 - 45:30

model
45:30 - 45:33

are going to be a vector
45:33 - 45:35

mu with its
45:35 - 45:38

n-dimensional and
45:38 - 45:41

matrix lambda,
45:41 - 45:42

which is
45:42 - 45:44
45:44 - 45:46

n by D
45:46 - 45:50

and a covariance matrix si,
45:50 - 45:52

which is n by n,
45:52 - 45:56

and I'm going to impose an additional constraint on si. I'm going to impose
45:56 - 45:57

a constraint that si
45:57 - 46:00

is
46:00 - 46:03

diagonal.
46:03 - 46:05

Okay? So
46:05 - 46:08

that was a form of definition - let me actually, sort of,
46:08 - 46:15

give a couple of examples to make this more complete.
46:33 - 46:38

So let's give a kind of example, suppose Z
46:38 - 46:40

is one-dimensional
46:40 - 46:44

and X is twodimensional
46:44 - 46:47

so let's see what
46:47 - 46:52

this model - let's see a, sort of, specific instance of the factor analysis
46:52 - 46:53

model
46:53 - 46:56

and how we're modeling the joint - the distribution over X
46:56 - 46:58

of - what this gives us
46:58 - 47:01

in terms of a model over P of X, all right?
47:01 - 47:03

So
47:03 - 47:10

let's see. From this model to let me assume that
47:10 - 47:11

lambda is
47:11 - 47:12

2, 1
47:12 - 47:16

and si, which has to be diagonal matrix, remember,
47:16 - 47:20

is this.
47:20 - 47:22

Okay? So
47:22 - 47:29

Z is one-dimensional so let me just draw a typical sample for Z, all right? So
47:31 - 47:34

if I draw ZI
47:34 - 47:36

from a Gaussian
47:36 - 47:40

so that's a typical sample for what Z might look like
47:40 - 47:43

and so I'm gonna - at any rate I'm gonna call
47:43 - 47:45

this Z1,
47:45 - 47:50

Z2, Z3 and so on. If this really were a typical sample the order of the
47:50 - 47:53

Z's would be jumbled up, but I'm just ordering them like this
47:53 - 47:56

just to make the example easier.
47:56 - 47:58

So, yes, typical sample
47:58 - 48:05

of random variable Z from a Gaussian distribution with mean of covariance one.
48:06 - 48:09

So -
48:09 - 48:14

and with this example let me just set mu equals zero. It's to write the -
48:14 - 48:18

just that it's easier to talk about.
48:18 - 48:21

So
48:21 - 48:23

lambda times Z,
48:23 - 48:26

right? We'll take each of these numbers and multiply them by lambda.
48:26 - 48:29

And so
48:29 - 48:31

you find that
48:31 - 48:37

all of the values for lambda times Z
48:37 - 48:38

will lie on a straight line,
48:38 - 48:40

all right? So, for example,
48:40 - 48:41
48:41 - 48:45

this one here would be one, two, three, four, five, six, seven, I guess. So if this was
48:45 - 48:46

Z7
48:46 - 48:49

then this one here would be lambda
48:49 - 48:52

times Z7
48:52 - 48:54

and now that's the number in R2, because lambda's a two
48:54 - 48:57

by one matrix.
48:57 - 49:00

And so what I've drawn here is like a typical sample
49:00 - 49:04

for lambda times Z
49:04 - 49:06

and
49:06 - 49:10

the final step for this is what a typical sample for X looks like. Well X is
49:10 - 49:12

mu
49:12 - 49:13

plus
49:13 - 49:15
49:15 - 49:17

lambda Z plus epsilon
49:17 - 49:19

where epsilon
49:19 - 49:24

is Gaussian with mean nu and covariance given by si, right?
49:24 - 49:28

And so the last step to draw a typical sample for the random variables
49:28 - 49:29
49:29 - 49:31

X
49:31 - 49:36

I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example
49:36 - 49:37

and
49:37 - 49:38

around this point
49:38 - 49:42

I'm going to place
49:42 - 49:44

an axis aligned ellipse. Or
49:44 - 49:46

in other words, I'm going to
49:46 - 49:48

create a Gaussian distribution
49:48 - 49:51

centered on this point
49:51 - 49:53

and this I've drawn
49:53 - 49:56

corresponds to one of the contours
49:56 - 49:59

of my density for
49:59 - 50:03

epsilon, right? And so you can imagine placing a little Gaussian bump here.
50:03 - 50:04

And so
50:04 - 50:08

I'll draw an example from this little Gaussian and
50:08 - 50:11

let's say I get that point going,
50:11 - 50:18

I do the same here and
50:18 - 50:19

so on.
50:19 - 50:24

So I draw a bunch of examples from these Gaussians
50:24 - 50:29

and the -
50:29 - 50:32

whatever they call it - the orange points I drew
50:32 - 50:35

will comprise a typical sample for
50:35 - 50:42

whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that?
50:44 - 50:46

Instructor (Andrew Ng):Oh,
50:46 - 50:49

yes, you do. And in this example, I said you
50:49 - 50:50

do a zero zero just to make
50:50 - 50:53

it easier. If mu were something else you'd take the whole picture and you'd sort of shift
50:53 - 51:00

it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of
51:01 - 51:03

course,
51:03 - 51:06

what does that Y-axis corresponds to? Instructor (Andrew
51:06 - 51:08

Ng):Oh, so this is Z
51:08 - 51:10

is one-dimensional
51:10 - 51:13

so here I'm plotting the typical sample
51:13 - 51:15

for Z so this is like zero.
51:15 - 51:20

So this is just the Z Axis, right. So Z is onedimensional data.
51:20 - 51:23

So this line here is like a plot
51:23 - 51:26

of a typical sample of
51:26 - 51:32

values for Z. Okay?
51:32 - 51:33

Yeah? Student:You have by axis, right? And
51:33 - 51:34

the axis data pertains samples. Instructor (Andrew
51:34 - 51:36

Ng):Oh, yes, right. Student:So sort of projecting
51:36 - 51:38

them into
51:38 - 51:43

that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these
51:43 - 51:45

beige points - so that's like X1 and that's X2
51:45 - 51:47

and so on, right? So the beige points are
51:47 - 51:51

what I
51:51 - 51:52

see. And so
51:52 - 51:55

in reality all you ever get to see are the
51:55 - 51:56

X's, but
51:56 - 52:00

just like in the mixture of Gaussians model I tell a story about what I would
52:00 - 52:04

imagine the Gauss... - the data came from two Gaussian's
52:04 - 52:09

was is had a random variable Z that led to the generation of X's from two Gaussians.
52:09 - 52:12

So the same way I'm sort of telling the story here, which all the algorithm actually
52:12 - 52:15

sees are the orange points, but we're
52:15 - 52:19

gonna tell a story about how the data came about and that story is
52:19 - 52:24

what comprises the factor analysis model. Okay?
52:24 - 52:27

So one of the ways to see the intrusion of this model is that we're
52:27 - 52:31

going to think of the model as one way
52:31 - 52:35

just informally, not formally, but one way to think about this model is
52:35 - 52:38

you can think of this factor analysis model
52:38 - 52:42

as modeling the data from coming from a lower dimensional subspace
52:42 - 52:43

more
52:43 - 52:45

or less so the data X here Y
52:45 - 52:48

is approximately on one D line
52:48 - 52:50

and then plus a little bit of noise - plus a little bit
52:50 - 52:53

of random noise so the X isn't exactly on this one D line.
52:53 - 52:57

That's one informal way of thinking about factor analysis.
52:57 - 53:04

We're
53:09 - 53:16

not doing great on time.
53:18 - 53:20

Well, let's do this.
53:20 - 53:23

So let me just do one more quick example,
53:23 - 53:24

which is,
53:24 - 53:26
53:26 - 53:29

in this example,
53:29 - 53:35

let's say Z is in R2
53:35 - 53:38

and X is in R3, right?
53:38 - 53:40

And so
53:40 - 53:45

in this example Z, your data Z now lies in 2-D
53:45 - 53:49

and so let me draw this on a sheet of paper. Okay?
53:49 - 53:51

So let's say the
53:51 - 53:54

axis of my paper are the Z1 and Z2 axis
53:54 - 53:56

and so
53:56 - 53:58

here is a typical sample
53:58 - 54:01

of point Z, right?
54:01 - 54:05

And so we'll then take the sample Z -
54:05 - 54:11

well, actually let me draw this here as well. All right. So
54:11 - 54:13

this is a typical sample for Z going on the Z1 and Z2 axis and
54:13 - 54:17

I guess the origin would be here.
54:17 - 54:20

So center around zero.
54:20 - 54:24

And then we'll take those and map it to mu
54:24 - 54:25

plus
54:25 - 54:26

lambda Z
54:26 - 54:30

and what that means is if you imagine the free space of this classroom is R3.
54:30 - 54:35

What that means is we'll take this sample of Z's and we'll map it to
54:35 - 54:39

position in free space. So we'll take this sheet of paper and move it somewhere and some
54:39 - 54:42

orientation in 3-D space.
54:42 - 54:44

And the last step is
54:44 - 54:49

you have X equals mu plus lambda Z plus
54:49 - 54:52

epsilon and so you would take the set of the points which align in some plane
54:52 - 54:54

in our 3-D space
54:54 - 54:56

the variable of noise of these
54:56 - 54:58

and the noise will, sort of, come from
54:58 - 55:01

Gaussians to
55:01 - 55:04

the axis
55:04 - 55:11

aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake.
55:12 - 55:13
55:13 - 55:15

So that's a model
55:15 - 55:22

- let's actually talk about how to fit the parameters of the model. Okay?
55:28 - 55:32

In order to
55:32 - 55:34

describe how to fit the model I'm sure
55:34 - 55:36

we need to
55:36 - 55:40

re-write Gaussians and this is in a very slightly different way.
55:40 - 55:43

So, in particular,
55:43 - 55:45

let's say I have a vector X and
55:45 - 55:51

I'm gonna use this notation to denote partition vectors, right? X1, X2
55:51 - 55:53

where
55:53 - 55:57

if X1 is say an rdimensional vector
55:57 - 55:59

then X2 is
55:59 - 56:00

an estimational vector
56:00 - 56:02

and X
56:02 - 56:05

is an R plus S
56:05 - 56:09

dimensional vector. Okay? So I'm gonna use this notation to denote just
56:09 - 56:13

the taking of vector and, sort of, partitioning the vector into two halves. The
56:13 - 56:15

first R elements followed by
56:15 - 56:17

the last
56:17 - 56:22

S elements.
56:22 - 56:26

So
56:26 - 56:29

let's say you have X
56:29 - 56:35

coming from a Gaussian distribution with mean mu and covariance sigma
56:35 - 56:37

where
56:37 - 56:39

mu
56:39 - 56:42

is itself a partition vector.
56:42 - 56:47

So break mu up into two pieces mu1 and mu2
56:47 - 56:50

and the covariance matrix sigma
56:50 - 56:55

is now a partitioned matrix.
56:55 - 56:57

Okay? So what this means is
56:57 - 56:59

that you take the covariance matrix sigma
56:59 - 57:01

and I'm going to break it up into four blocks,
57:01 - 57:03

right? And so the
57:03 - 57:06

dimension of this is there will be R elements here
57:06 - 57:08

and there will be
57:08 - 57:10

S elements here
57:10 - 57:14

and there will be R elements here. So, for example, sigma 1, 2 will
57:14 - 57:16

be an
57:16 - 57:20

R by S matrix.
57:20 - 57:27

It's R elements tall and S elements wide.
57:32 - 57:36

So this Gaussian over to down is really a joint distribution of a loss of variables, right?
57:36 - 57:40

So X is a vector so XY is a joint distribution
57:40 - 57:43

over X1 through X of -
57:43 - 57:47

over XN or over X of R plus S.
57:47 - 57:51

We can then ask what are the marginal and conditional distributions of this
57:51 - 57:53

Gaussian?
57:53 - 57:54

So, for example,
57:54 - 57:58

with my Gaussian, I know what P of X is, but can
57:58 - 58:03

I compute the modular distribution of X1, right. And so P of X1 is just equal to,
58:03 - 58:07

of course, integrate our X2, P of X1
58:07 - 58:09

comma X2
58:09 - 58:11

DX2.
58:11 - 58:14

And if you actually perform that distribution - that computation you
58:14 - 58:15

find
58:15 - 58:18

that P of X1,
58:18 - 58:20

I guess, is Gaussian
58:20 - 58:21

with mean
58:21 - 58:23

given by mu1
58:23 - 58:25

and sigma 1, 1. All right.
58:25 - 58:26

So this is sort
58:26 - 58:30

of no surprise. The marginal distribution of a
58:30 - 58:31

Gaussian
58:31 - 58:33

is itself the
58:33 - 58:33

Gaussian and
58:33 - 58:35

you just take out the
58:35 - 58:36

relevant
58:36 - 58:37

sub-blocks of
58:37 - 58:42

the covariance matrix and the relevant sub-vector of the
58:42 - 58:47

mu vector - E in vector mu.
58:47 - 58:50

You can also compute conditionals. You can also -
58:50 - 58:52

what does P of X1
58:52 - 58:53

given
58:53 - 58:57

a specific value for X2, right?
58:57 - 59:03

And so the way you compute that is, well, the usual way P of X1
59:03 - 59:05

comma X2
59:05 - 59:08

divided by P of X2, right?
59:08 - 59:10

And so
59:10 - 59:14

you know what both of these formulas are, right? The numerator - well,
59:14 - 59:17

this is just a usual
59:17 - 59:20

Gaussian that your joint distribution over X1, X2 is a Gaussian with
59:20 - 59:22

mean mu and covariance sigma
59:22 - 59:24

and
59:24 - 59:27

this
59:27 - 59:30

by that marginalization operation I talked about is that.
59:30 - 59:31
59:31 - 59:35

So if you actually plug in the formulas for these two Gaussians and if you simplify
59:35 - 59:36
59:36 - 59:39

the simplification step is actually fairly non-trivial.
59:39 - 59:42

If you haven't seen it before this will actually be - this will actually be
59:42 - 59:44

somewhat difficult to do.
59:44 - 59:46

But if you
59:46 - 59:50

plug this in for Gaussian and simplify that
59:50 - 59:51

expression
59:51 - 59:53

you find that
59:53 - 59:56

conditioned on the value of
59:56 - 60:01

X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be
60:01 - 60:06

Gaussian
60:06 - 60:08

and it will have mean mu
60:08 - 60:14

of 1 given 2 and covariant
60:14 - 60:16
60:16 - 60:18

sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show
60:18 - 60:21

the formula for mu given - of mu of
60:21 - 60:28

one given 2 is given by this
60:32 - 60:39

and I
60:40 - 60:44

think
60:44 - 60:49

the sigma of 1 given 2 is given by that. Okay?
60:49 - 60:52

So these are just
60:52 - 60:56

useful formulas to know for how to find the conditional distributions
60:56 - 60:59

of the Gaussian and the marginal distributions of a Gaussian.
60:59 - 61:03

I won't actually show the derivation for this. Student:Could you
61:03 - 61:10

repeat the [inaudible]? Instructor
61:12 - 61:16

(Andrew Ng):Sure. So this one on the left mu of 1 given 2
61:16 - 61:17

equals
61:17 - 61:19

mu1 plus
61:19 - 61:21

sigma 1,2,
61:21 - 61:23

sigma 2,2 inverse
61:23 - 61:25

times X2 minus mu2
61:25 - 61:30

and this is sigma 1 given 2 equals sigma 1,1
61:30 - 61:32

minus sigma 1,2
61:32 - 61:33

sigma 2,2 inverse
61:33 - 61:34

sigma 2,1. Okay?
61:34 - 61:40

These are also in the
61:40 - 61:47

lecture
61:53 - 62:00

notes. Shoot. Nothing as where I was hoping to on time.
62:01 - 62:08

Well, actually it is. Okay?
62:19 - 62:21

So it
62:21 - 62:26

turns out - I think I'll skip this in the interest of time. So it turns out that -
62:26 - 62:30

well, so let's go back and use these in the factor analysis model, right?
62:30 - 62:32

It turns out that
62:32 - 62:34

you can go back
62:34 - 62:35

and
62:39 - 62:46

oh,
62:46 - 62:49

do I want to do this? I kind of need this
62:49 - 62:52

though. So let's go back and figure out
62:52 - 62:58

just what the joint distribution factor analysis assumes on Z and X's. Okay?
62:58 - 62:59

So
62:59 - 63:00
63:00 - 63:03

under the factor analysis model
63:03 - 63:07

Z and X, the random variables Z and X
63:07 - 63:12

have some joint distribution given by - I'll write this
63:12 - 63:13

vector as mu
63:13 - 63:14

ZX
63:14 - 63:17

in some covariance matrix sigma.
63:17 - 63:21

So let's go back and figure out what mu ZX is and what sigma is and I'll
63:21 - 63:23

do this so that
63:23 - 63:25

we'll get a little bit more practice with
63:25 - 63:29

partition vectors and partition matrixes.
63:29 - 63:31

So just to remind you, right? You
63:31 - 63:35

have to have Z as Gaussian with mean zero and covariance identity
63:35 - 63:41

and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with
63:41 - 63:43

mean zero
63:43 - 63:47

covariant si. So I have the - I'm just writing out the same equations again.
63:47 - 63:53

So let's first figure out what this vector mu ZX is. Well,
63:53 - 63:55

the expected value of Z
63:55 - 63:56

is
63:56 - 63:57

zero
63:57 - 63:58
63:58 - 64:04

and, again, as usual I'll often drop the square backers around here.
64:04 - 64:06

And the expected value of X is -
64:06 - 64:09

well, the expected value of mu
64:09 - 64:11

plus lambda
64:11 - 64:14

Z
64:14 - 64:17

plus epsilon. So these two terms have zero expectation
64:17 - 64:20

and so the expected value of X
64:20 - 64:23

is just mu
64:23 - 64:24
64:24 - 64:27

and so that vector
64:27 - 64:28

mu ZX, right,
64:28 - 64:31

in my parameter for the Gaussian
64:31 - 64:34

this is going to be
64:34 - 64:37

the expected value of this partition vector
64:37 - 64:40

given by this partition Z and X
64:40 - 64:43

and so that would just be
64:43 - 64:44

zero
64:44 - 64:47

followed by mu. Okay?
64:47 - 64:48

And so that's
64:48 - 64:51

a d-dimensional zero
64:51 - 64:57

followed by an indimensional mu.
64:57 - 65:02

That's not gonna work out what the covariance matrix sigma is.
65:02 - 65:09

So
65:21 - 65:24

the covariance matrix sigma
65:24 - 65:28

- if you work out
65:28 - 65:35

definition of a partition. So this
65:44 - 65:46

is
65:46 - 65:53

into your partition matrix.
66:03 - 66:04

Okay? Will be -
66:04 - 66:06

so the covariance matrix sigma
66:06 - 66:10

will comprise four blocks like that
66:10 - 66:14

and so the upper left most block, which I write as sigma 1,1 -
66:14 - 66:16

well, that uppermost left block
66:16 - 66:19

is just
66:19 - 66:21

the covariance matrix of Z,
66:21 - 66:25

which we know is the identity. I was
66:25 - 66:28

gonna show you briefly how to derive some of the other blocks, right, so
66:28 - 66:31

sigma 1,2 that's the
66:31 - 66:38

upper - oh, actually,
66:41 - 66:44

excuse me. Sigma 2,1
66:44 - 66:47

which is the lower left block
66:47 - 66:49

that's E
66:49 - 66:51

of X
66:51 - 66:55

minus EX times Z minus EZ.
66:55 - 66:59

So X is equal to mu
66:59 - 67:02

plus lambda Z
67:02 - 67:03

plus
67:03 - 67:05

epsilon and then minus EX is
67:05 - 67:07

minus mu and then
67:07 - 67:11

times Z
67:11 - 67:14

because the expected value of Z is zero, right,
67:14 - 67:19

so that's equal to zero.
67:19 - 67:24

And so if you simplify - or if you expand this out
67:24 - 67:27

plus mu minus mu cancel out
67:27 - 67:30

and so you have the expected value
67:30 - 67:33

of lambda - oh,
67:33 - 67:35

excuse me.
67:35 - 67:40

ZZ transpose minus the
67:40 - 67:47

expected value of epsilon Z is
67:53 - 67:57

equal to
67:57 - 68:00

that, which is just equal to
68:00 - 68:07

lambda times the identity matrix. Okay? Does that make
68:07 - 68:11

sense? Cause
68:11 - 68:15

this term is equal to zero.
68:15 - 68:22

Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well,
68:48 - 68:53

so the final block is sigma 2,2 which is equal to the expected value of
68:53 - 68:59

mu plus lambda Z
68:59 - 69:02

plus epsilon minus mu
69:02 - 69:04

times, right?
69:04 - 69:08

Is
69:08 - 69:09

equal to - and
69:09 - 69:16

I won't do this, but this simplifies to lambda lambda transpose plus si. Okay?
69:18 - 69:21

So
69:21 - 69:24

putting all this together
69:24 - 69:29

this tells us that the joint distribution of this vector ZX
69:29 - 69:31

is going to be Gaussian
69:31 - 69:36

with mean vector given by that,
69:36 - 69:37

which we worked out previously.
69:37 - 69:39

So
69:39 - 69:42

this is the new ZX that we worked out previously,
69:42 - 69:44

and covariance matrix
69:44 - 69:51

given by
69:54 - 70:01

that. Okay? So
70:05 - 70:07

in principle - let's
70:07 - 70:10

see, so the parameters of our model
70:10 - 70:12

are mu,
70:12 - 70:12

lambda,
70:12 - 70:14

and si.
70:14 - 70:15

And so
70:15 - 70:19

in order to find the parameters of this model
70:19 - 70:25

we're given a training set of m examples
70:25 - 70:29

and so we like to do a massive likely estimation of the parameters.
70:29 - 70:34

And so in principle one thing you could do is you can actually write down
70:34 - 70:37

what P of XI is and,
70:37 - 70:40

right, so P of XI
70:40 - 70:43

XI is actually -
70:43 - 70:46

the distribution of X, right? If, again,
70:46 - 70:49

you can marginalize this Gaussian
70:49 - 70:55

and so the distribution of X, which is the lower half of this partition vector
70:55 - 70:57

is going to have
70:57 - 70:58

mean mu
70:58 - 71:04

and covariance given by lambda lambda transpose plus si. Right?
71:04 - 71:05

So that's
71:05 - 71:07

the distribution
71:07 - 71:12

that we're using to model P of X.
71:12 - 71:17

And so in principle one thing you could do is actually write down the log likelihood of
71:17 - 71:20

your parameters,
71:20 - 71:24

right? Which is just the product over of - it
71:24 - 71:25

is the sum over
71:25 - 71:26

I
71:26 - 71:30

log P of XI
71:30 - 71:31

where P of XI
71:31 - 71:32

will be given
71:32 - 71:36

by this Gaussian density, right. And
71:36 - 71:40

I'm using theta as a shorthand to denote all of my parameters.
71:40 - 71:43

And so you actually know what the density for Gaussian is
71:43 - 71:50

and so you can say P of XI is this Gaussian with E mu in covariance
71:50 - 71:53

given by this lambda lambda transpose plus si.
71:53 - 71:55

So in case you write down the log likelihood
71:55 - 71:56

of your parameters
71:56 - 71:57

as follows
71:57 - 72:01

and you can try to take derivatives of your log likelihood with respect to your
72:01 - 72:02

parameters
72:02 - 72:05

and maximize the log likelihood, all right.
72:05 - 72:07

It turns out that if you do that you end up with
72:07 - 72:11

sort of an intractable atomization problem or at least one
72:11 - 72:13

that you - excuse me, you end up with a
72:13 - 72:17

optimization problem that you will not be able to find and in this analytics, sort of,
72:17 - 72:19

closed form solutions to.
72:19 - 72:22

So if you say my model of X is this and found your
72:22 - 72:24

massive likely parameter estimation
72:24 - 72:26

you won't be able to find
72:26 - 72:29

the massive likely estimate of the parameters in closed form.
72:29 - 72:30
72:30 - 72:32

So what I would have liked to do
72:32 - 72:39

is - well,
72:40 - 72:46

so in order to fit parameters to this model
72:46 - 72:52

what we'll actually do is use the EM Algorithm
72:52 - 72:56

in with
72:56 - 73:00

the E step, right?
73:00 - 73:07

We'll compute that
73:08 - 73:11
73:11 - 73:15

and this formula looks the same except that one difference is that now
73:15 - 73:18

Z is a continuous random variable
73:18 - 73:19

and so
73:19 - 73:20

in the E step
73:20 - 73:24

we actually have to find the density QI of ZI where it's the, sort of, E step actually
73:24 - 73:26

requires that we find
73:26 - 73:28

the posterior distribution
73:28 - 73:32

that - so the density to the random variable ZI
73:32 - 73:34

and then the M step
73:34 - 73:38

will then perform
73:38 - 73:41

the following maximization
73:41 - 73:42
73:42 - 73:45

where, again, because Z is now
73:45 - 73:50

continuous we
73:50 - 73:57

now need to integrate over Z. Okay? Where
73:59 - 74:02

in the M step now because ZI was continuous we now have an
74:02 - 74:06

integral over Z rather than a sum.
74:06 - 74:09

Okay? I was hoping to go a little bit further in deriving these things, but I don't have
74:09 - 74:11

time today so we'll wrap that up
74:11 - 74:15

in the next lecture, but before I close let's check if there are questions
74:15 - 74:16
74:16 - 74:23

about the whole factor analysis model. Okay.
74:28 - 74:31

So we'll come back in the next lecture;
74:31 - 74:35

I will wrap up this model and because I want to go a little bit deeper into the E
74:35 - 74:37

and M steps, as there's some
74:37 - 74:40

tricky parts for the factor analysis model specifically.
74:40 - 74:41

Okay. I'll see you in a
74:41 - 74:41

couple of days.

Title:: Lecture 13 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:14:57

	N. Ueda edited English subtitles for Lecture 13 \| Machine Learning (Stanford)
	N. Ueda added a translation

English subtitles

Revisions

Revision 2

N. Ueda

Lecture 13 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)