0:00:09.040,0:00:10.319


0:00:10.319,0:00:13.589
This presentation is delivered by the Stanford Center for Professional

0:00:13.589,0:00:20.589
Development.

0:00:23.960,0:00:24.880
So

0:00:24.880,0:00:29.720
welcome back, and what I want to do today is

0:00:29.720,0:00:33.930
continue our discussions of the EM Algorithm, and in particular, I want to talk

0:00:33.930,0:00:34.730
about

0:00:34.730,0:00:36.359
the EM

0:00:36.359,0:00:39.640
formulation that we derived in the previous lecture and apply it to the

0:00:39.640,0:00:41.910
mixture of Gaussians model,

0:00:41.910,0:00:45.050
apply it to a different model and a mixture of naive Bayes model,

0:00:45.050,0:00:48.690
and then the launch part of today's lecture will be on the factor

0:00:48.690,0:00:51.480
analysis algorithm, which will also use the EM.

0:00:51.480,0:00:55.600
And as part of that, we'll actually take a brief digression to talk a little bit about

0:00:55.600,0:01:01.050
sort of useful properties of Gaussian distributions.

0:01:01.050,0:01:03.190
So just to recap where we are.

0:01:03.190,0:01:10.120
In the previous lecture, I started to talk about unsupervised learning, which was

0:01:10.120,0:01:12.720
machine-learning problems, where you're given

0:01:12.720,0:01:14.590
an unlabeled training set

0:01:14.590,0:01:16.210
comprising m examples here, right?

0:01:16.210,0:01:18.360
And then " so the fact that there are no labels;

0:01:18.360,0:01:22.720
that's what makes this unsupervised or

0:01:22.720,0:01:24.880
anything. So

0:01:24.880,0:01:30.710
one problem that I talked about last time was what if

0:01:30.710,0:01:33.820
you're given a data set that looks like this

0:01:33.820,0:01:36.420
and you want to model

0:01:36.420,0:01:40.110
the density PFX from which you think the data

0:01:40.110,0:01:41.730
had been drawn,

0:01:41.730,0:01:45.900
and so with a data set like this, maybe you think was a mixture of two Gaussians and start

0:01:45.900,0:01:50.230
to talk about an algorithm

0:01:50.230,0:01:53.920
for fitting a mixture of Gaussians model, all right? And so

0:01:55.050,0:01:57.820
we said that we would model the

0:01:57.820,0:01:59.460
density of XP of X

0:01:59.460,0:02:02.130
as sum over Z

0:02:02.130,0:02:03.170
PFX

0:02:03.170,0:02:05.410
given Z

0:02:05.410,0:02:06.970
times P of Z

0:02:06.970,0:02:10.350
where this later random variable meaning this hidden random

0:02:10.350,0:02:11.829
variable Z

0:02:11.829,0:02:12.609
indicates

0:02:12.609,0:02:16.369
which of the two Gaussian distributions each of your data points came from

0:02:16.369,0:02:19.369
and so we have,

0:02:19.369,0:02:24.139
you know, Z was not a

0:02:24.139,0:02:26.629
nomial with parameter phi

0:02:26.629,0:02:28.760
and X conditions on

0:02:28.760,0:02:32.599
a coming from the JAFE

0:02:32.599,0:02:35.809
Gaussian

0:02:35.809,0:02:38.749
was given by Gaussian

0:02:38.749,0:02:43.679
of mean mu J and covariant sigma J, all right? So, like I said

0:02:43.679,0:02:47.510
at the beginning of the previous lecture, I just talked about a very specific algorithm that

0:02:47.510,0:02:50.639
I sort of pulled out of the air

0:02:50.639,0:02:54.609
for fitting the parameters of this model for finian, Francis, phi, mu and

0:02:54.609,0:02:55.519


0:02:55.519,0:02:59.769
sigma, but then in the second half of the previous lecture I talked about what's called the

0:02:59.769,0:03:01.489
EM Algorithm in which

0:03:01.489,0:03:03.999
our

0:03:03.999,0:03:08.719
goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of

0:03:08.719,0:03:09.900
theta,

0:03:09.900,0:03:13.279
you know, the, sort of,

0:03:13.279,0:03:17.649
usual right matter of log likelihood "

0:03:17.649,0:03:19.549
well, parameterized by theta.

0:03:19.549,0:03:22.149
And

0:03:22.149,0:03:25.109
because we have

0:03:25.109,0:03:28.199
a later random variable Z this is really

0:03:28.199,0:03:30.889
maximizing in terms of theta,

0:03:30.889,0:03:37.889
sum over I, sum over Z, P of XI,

0:03:39.639,0:03:45.279
ZI parameterized by theta. Okay?

0:03:45.279,0:03:47.139
So using

0:03:47.139,0:03:50.109
Jensen's inequality last time

0:03:50.109,0:03:54.249
we worked out the EM Algorithm in which in the E step

0:03:54.249,0:03:56.180
we would chose these

0:03:56.180,0:03:59.359
probability distributions QI to the

0:03:59.359,0:04:02.930
l posterior

0:04:02.930,0:04:09.319
on Z given X and parameterized by theta

0:04:09.319,0:04:13.759
and in the M step we would set

0:04:13.759,0:04:16.429
theta

0:04:16.429,0:04:21.669
to be the value that

0:04:21.669,0:04:27.609
maximizes

0:04:27.609,0:04:34.609
this.

0:04:36.469,0:04:39.499
Okay? So these are the ones we worked out last time

0:04:41.730,0:04:46.699
and the cartoon that I drew was that you have this long likelihood function L of

0:04:46.699,0:04:49.499
theta that's often hard to maximize

0:04:49.499,0:04:51.250
and what the E step does

0:04:51.250,0:04:55.960
is choose these probability distribution production QI's. And in the cartoon, I drew

0:04:55.960,0:04:57.850
what that corresponded to

0:04:57.850,0:05:01.029
was finding a lower bounds

0:05:01.029,0:05:02.979
for the log likelihood.

0:05:02.979,0:05:04.779
And then

0:05:04.779,0:05:06.420
horizontal access data

0:05:06.420,0:05:10.710
and then the M step you maximize the lower boundary, right? So maybe you were here previously

0:05:10.710,0:05:11.500
and so you

0:05:11.500,0:05:15.389
jumped to the new point, the new

0:05:15.389,0:05:17.619
maximum of this lower bound. Okay?

0:05:17.619,0:05:21.889
And so this little curve here, right? This lower bound function here

0:05:21.889,0:05:26.259
that's really

0:05:26.259,0:05:28.820
the right-hand side of that augments.

0:05:28.820,0:05:30.210
Okay? So

0:05:30.210,0:05:34.010
this whole thing in the augments. If you view this thing

0:05:34.010,0:05:36.590
as a function of theta,

0:05:36.590,0:05:38.129
this function of theta

0:05:38.129,0:05:39.250
is a lower bounds

0:05:39.250,0:05:41.729
for the log likelihood of theta

0:05:41.729,0:05:45.539
and so the M step we maximize this lower bound and that corresponds to

0:05:45.539,0:05:47.360
jumping

0:05:47.360,0:05:51.669
to this new maximum to lower bound. So it

0:05:51.669,0:05:53.139
turns out

0:05:53.139,0:05:55.669
that

0:05:55.669,0:05:57.259
in the EM Algorithm "

0:05:57.259,0:06:02.100
so why do you evolve with the EM algorithm? It turns out that very often, and this will be

0:06:02.100,0:06:05.210
true for all the examples we see

0:06:05.210,0:06:07.410
today, it turns out

0:06:07.410,0:06:10.889
that very often in the

0:06:10.889,0:06:12.529
EM Algorithm

0:06:12.529,0:06:16.159
maximizing the M Step, so performing the maximization the M Step, will be

0:06:16.159,0:06:19.739
tractable and can often be done analytically in the closed form.

0:06:19.739,0:06:21.330
Whereas

0:06:21.330,0:06:25.729
if you were trying to maximize this objective we try to take this

0:06:25.729,0:06:27.270
formula on the right and

0:06:27.270,0:06:30.130
this maximum likely object, everyone, is to take this all on the right

0:06:30.130,0:06:32.810
and set its derivatives to zero and try to solve and

0:06:32.810,0:06:34.629
you'll find that you're unable

0:06:34.629,0:06:37.030
to obtain a solution to this in closed form

0:06:37.030,0:06:38.759
this maximization. Okay?

0:06:38.759,0:06:41.250
And so to give you an example of that is that

0:06:41.250,0:06:45.069
you remember our discussion on exponential family marbles, right?

0:06:45.069,0:06:47.019
It turns out that

0:06:47.019,0:06:49.370
if X and Z

0:06:49.370,0:06:50.320
is

0:06:50.320,0:06:54.220
jointly, I guess, a line in exponential families. So if P of X,

0:06:54.220,0:06:55.379
Z

0:06:55.379,0:06:58.319
prioritized by theta there's an explanation family distribution,

0:06:58.319,0:07:01.909
which it turns out to be true for the mixture of Gaussians distribution.

0:07:01.909,0:07:05.219
Then turns out that the M step here will be tractable

0:07:05.219,0:07:08.379
and the E step will also be tractable and so you can do each of these steps very

0:07:08.379,0:07:09.489
easily.

0:07:09.489,0:07:10.889
Whereas

0:07:10.889,0:07:12.689
performing " trying to

0:07:12.689,0:07:15.669
perform this original maximum likelihood estimation problem

0:07:15.669,0:07:17.719
on this one, right?

0:07:17.719,0:07:21.439
Will be computationally very difficult. You're going

0:07:21.439,0:07:24.699
to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to

0:07:24.699,0:07:27.589
this. Okay?

0:07:27.589,0:07:32.709
So

0:07:32.709,0:07:36.050
what I want to do in a second is actually take this view of the EM

0:07:36.050,0:07:37.110
Algorithm

0:07:37.110,0:07:40.889
and apply it to the mixture of Gaussians models. I want to take these E steps

0:07:40.889,0:07:42.729
and M Steps and work

0:07:42.729,0:07:44.229
them out for

0:07:44.229,0:07:47.849
the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the

0:07:49.689,0:07:55.020
EM Algorithm. It turns out there's one other way of thinking about the EM

0:07:55.020,0:07:59.479
Algorithm, which is the following: I can define

0:07:59.479,0:08:02.090
an optimization objective

0:08:02.090,0:08:05.979
J of theta, Q are defined

0:08:05.979,0:08:10.519
it to be

0:08:10.519,0:08:14.699
this. This is just a thing in the augments

0:08:14.699,0:08:21.699
in the M step. Okay?

0:08:25.759,0:08:28.110
And so what we proved

0:08:28.110,0:08:31.079
using Jensen's inequality

0:08:31.079,0:08:34.040
is that

0:08:34.040,0:08:38.380
the log likelihood of theta is greater and equal to J

0:08:38.380,0:08:41.900
of theta Q. So in other words, we proved last time that

0:08:41.900,0:08:43.749
for any value of theta and Q

0:08:43.749,0:08:47.190
the log likelihood upper bounds J of theta and Q.

0:08:47.190,0:08:51.290
And so just to relate this back to, sort of, yet more things that you all ready

0:08:51.290,0:08:52.220
know,

0:08:52.220,0:08:54.100
you can also think of

0:08:54.100,0:08:58.530
covariant cause in a sense, right?

0:08:58.530,0:09:02.010
However, our discussion awhile back on the coordinate ascent optimization

0:09:02.010,0:09:03.160
algorithm.

0:09:03.160,0:09:07.420
So we can show, and I won't actually show this view so just take our word for it

0:09:07.420,0:09:09.580
and look for that at home if you want,

0:09:09.580,0:09:11.820


0:09:11.820,0:09:13.260
that EM is

0:09:13.260,0:09:17.480
just coordinate in a set on the

0:09:17.480,0:09:18.240
function J.

0:09:18.240,0:09:20.630
So in the E step you maximize

0:09:20.630,0:09:23.960
with respect to Q

0:09:23.960,0:09:26.850
and then the M step

0:09:26.850,0:09:31.860
you maximize with

0:09:31.860,0:09:34.100
respect to theta. Okay?

0:09:34.100,0:09:35.810
So this is

0:09:35.810,0:09:39.690
another view of the EM Algorithm

0:09:40.900,0:09:45.430
that shows why it has to converge, for example. If you can - I've used in a sense of

0:09:45.430,0:09:46.880
J of theta, Q

0:09:46.880,0:09:53.160
having to monotonically increase on every iteration. Okay?

0:09:53.160,0:09:58.800
So what I want

0:09:58.800,0:10:02.389
to do next is actually take this general

0:10:02.389,0:10:06.230
EM machinery that we worked up and apply it to a mixture Gaussians model.

0:10:06.230,0:10:09.790
Before I do that, let me just check if there are questions about

0:10:09.790,0:10:16.790
the EM Algorithm as a whole? Okay, cool. So

0:10:23.980,0:10:26.100
let's go ahead and

0:10:26.100,0:10:33.100
work on the mixture of Gaussian's

0:10:35.550,0:10:37.160
EM, all right? MOG, and that's my

0:10:37.160,0:10:39.010
abbreviation for Mixture of

0:10:39.010,0:10:43.590
Gaussian's. So the E step were called those Q distributions, right?

0:10:43.590,0:10:50.590
In particular, I want to work out - so Q is the probability distribution

0:10:50.670,0:10:53.230
over the late and random variable Z

0:10:53.230,0:10:58.440
and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And

0:10:58.440,0:11:00.700
you can think of this as my writing

0:11:00.700,0:11:04.210
P of ZI equals J, right? Under the Q

0:11:04.210,0:11:08.260
distribution. That's what this notation means.

0:11:08.260,0:11:12.810
And so the EM Algorithm tells us

0:11:12.810,0:11:13.840


0:11:13.840,0:11:16.280
that, let's see, Q of J

0:11:16.280,0:11:23.280
is the likelihood probability of Z being the value

0:11:23.750,0:11:25.040
J and

0:11:25.040,0:11:28.690
given XI and all your parameters.

0:11:28.690,0:11:30.310
And so,

0:11:30.310,0:11:31.050


0:11:31.050,0:11:35.750
well, the way you compute this is by Bayes rule, right? So that is going to

0:11:35.750,0:11:39.250
be equal to P of XI given ZI

0:11:39.250,0:11:41.060
equals J

0:11:41.060,0:11:44.690
times P of ZIJ divided

0:11:44.690,0:11:51.690
by -

0:11:57.490,0:12:00.880
right? That's all the - by Bayes rule.

0:12:00.880,0:12:02.670


0:12:02.670,0:12:06.100
And so this

0:12:06.100,0:12:09.110
you know

0:12:09.110,0:12:09.860


0:12:09.860,0:12:11.630


0:12:11.630,0:12:14.390
because XI given ZI equals J. This was a Gaussian

0:12:14.390,0:12:17.780
with mean mu J and covariant sigma J. And so to compute this first

0:12:17.780,0:12:20.970
term you plug in the formula for the Gaussian density there

0:12:21.810,0:12:24.780
with parameters mu J and sigma J

0:12:24.780,0:12:27.160
and this you'd know

0:12:27.160,0:12:28.500
because

0:12:28.500,0:12:35.500
Z was not a nomial, right?

0:12:36.020,0:12:40.650
Where parameters given by phi and so the problem of ZI being with J is just phi

0:12:40.650,0:12:44.150
J and so you can substitute these terms in.

0:12:44.150,0:12:46.810
Similarly do the same thing for the denominator

0:12:46.810,0:12:49.930
and that's how you work out what Q is. Okay?

0:12:49.930,0:12:56.080
And so in the previous lecture this value the probability that ZI

0:12:56.080,0:12:58.540
equals J under the Q

0:12:58.540,0:13:01.150
distribution that was why I denoted that as WIJ.

0:13:01.150,0:13:03.840
So that would be the

0:13:03.840,0:13:05.330


0:13:05.330,0:13:08.550
E

0:13:08.550,0:13:10.440
step

0:13:10.440,0:13:13.720
and then in the M step

0:13:13.720,0:13:17.890
we maximize with respect to all of our parameters.

0:13:17.890,0:13:19.480


0:13:19.480,0:13:21.230
This, well

0:13:21.230,0:13:28.130
I seem to be writing the same formula down a lot today. All

0:13:28.130,0:13:35.130
right.

0:13:42.670,0:13:48.090
And just so

0:13:48.090,0:13:52.250
we're completely concrete about how you do that, right? So if you do

0:13:52.250,0:13:54.820
that you end up with -

0:13:54.820,0:13:57.440
so plugging in the

0:13:57.440,0:13:59.840
quantities that you know

0:13:59.840,0:14:00.770


0:14:00.770,0:14:03.950
that becomes

0:14:03.950,0:14:10.950
this, let's see. Right.

0:14:34.590,0:14:38.730
And so that we're completely concrete about what the M step is doing.

0:14:38.730,0:14:41.900
So in the

0:14:41.900,0:14:44.260
M step that was,

0:14:44.260,0:14:45.660
I guess, QI

0:14:45.660,0:14:47.130
over Z, I being over

0:14:47.130,0:14:50.560
J. Just in the summation, sum over J is the sum over all the possible values

0:14:50.560,0:14:52.500
of ZI

0:14:52.500,0:14:53.430
and then

0:14:53.430,0:14:56.330
this thing here is my Gaussian

0:14:56.330,0:14:58.700
density. Sorry, guys, this thing - well,

0:14:58.700,0:15:01.980
this first term here,

0:15:01.980,0:15:06.480
right? Is my P of

0:15:06.480,0:15:10.050
XI given ZI

0:15:10.050,0:15:14.460
and that's P of ZI. Okay?

0:15:14.460,0:15:16.950
And so

0:15:16.950,0:15:21.070
to maximize this with respect to - say you want to maximize this with respect to all

0:15:21.070,0:15:24.410
of your parameters phi, mu and sigma.

0:15:24.410,0:15:27.800
So to maximize with respect to the parameter mu, say,

0:15:27.800,0:15:32.430
you would take the derivative for respect to mu

0:15:32.430,0:15:34.440
and set that to zero

0:15:34.440,0:15:38.190
and you would - and if you actually do that computation

0:15:38.190,0:15:44.330
you would get, for instance, that

0:15:44.330,0:15:47.660


0:15:47.660,0:15:49.750
that becomes your update

0:15:49.750,0:15:51.230
to mu J.

0:15:51.230,0:15:52.260
Okay? Just

0:15:52.260,0:15:53.409
so I want to -

0:15:53.409,0:15:57.029
the equation is unimportant. All of these equations are written down in

0:15:57.029,0:15:59.980
the lecture notes. I'm writing these down just to be

0:15:59.980,0:16:03.350
completely concrete about what the M step means. And so write down that formula,

0:16:03.350,0:16:05.370
plug in the densities you know, take

0:16:05.370,0:16:08.150
the derivative set to zero, solve for mu J

0:16:08.150,0:16:10.949
and in the same way you set the derivatives equal to zero

0:16:10.949,0:16:12.990
and solve for your updates

0:16:12.990,0:16:15.439
for your other parameters phi and

0:16:15.439,0:16:22.439
sigma as well. Okay? Well,

0:16:23.100,0:16:26.500
just point out just one little tricky bit for this that

0:16:26.500,0:16:30.630
you haven't seen before that most of you probably all ready now, but I'll just

0:16:30.630,0:16:31.220
mention

0:16:31.220,0:16:32.810
is that

0:16:32.810,0:16:37.430
since phi here is a multinomial distribution

0:16:37.430,0:16:39.570
when you take this formula

0:16:39.570,0:16:42.440
and you maximize it with respect to phi

0:16:42.440,0:16:45.470
you actually have an additional constraint, right? That the sum of I - let's see, sum

0:16:45.470,0:16:46.460


0:16:46.460,0:16:48.250
over

0:16:48.250,0:16:50.440
J,

0:16:50.440,0:16:53.630
phi J must be equal to one. All right? So, again,

0:16:53.630,0:16:57.720
in the M step I want to take this thing and maximize it with respect to all the parameters

0:16:57.720,0:17:00.900
and when you maximize this respect to the parameters phi J

0:17:00.900,0:17:03.400
you need to respect the constraint that

0:17:03.400,0:17:06.300
sum of J phi J must be equal to one.

0:17:06.300,0:17:10.730
And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of

0:17:10.730,0:17:14.230
the granjay multipliers and generalize the granjay when you talk about the support of X

0:17:14.230,0:17:15.120
machines.

0:17:15.120,0:17:19.049
And so to actually perform the maximization in terms of phi J you

0:17:19.049,0:17:21.100
construct to the granjay,

0:17:21.100,0:17:23.490
which is - all right?

0:17:23.490,0:17:24.639
So that's the

0:17:24.639,0:17:28.390
equation from above and we'll denote in the dot dot dot plus

0:17:28.390,0:17:33.760
theta times

0:17:33.760,0:17:40.760
that, where this is sort of the granjay multiplier

0:17:42.220,0:17:42.830
and this

0:17:42.830,0:17:46.100
is your optimization objective.

0:17:46.100,0:17:47.640


0:17:47.640,0:17:50.799
And so to actually solve the parameters phi J you set

0:17:50.799,0:17:56.690
the parameters of this

0:17:56.690,0:17:59.080
so that the granjay is zero and solve. Okay?

0:17:59.080,0:18:02.990
And if you then work through the math

0:18:02.990,0:18:07.240
you get the appropriate value to update the phi J's too,

0:18:07.240,0:18:10.880
which I won't do, but I'll be - all the full directions are in the lecture

0:18:10.880,0:18:16.470
notes. I won't do that here.

0:18:16.470,0:18:18.010


0:18:18.010,0:18:21.870
Okay. And so if you actually perform all these computations you can also verify

0:18:21.870,0:18:22.730
that.

0:18:22.730,0:18:26.590
So I just wrote down a bunch of formulas for the EM

0:18:26.590,0:18:30.080
Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I

0:18:30.080,0:18:34.170
said for the EM here's the formula for computing the WIJ's and here's a

0:18:34.170,0:18:36.480
formula for computing the mud's and so on,

0:18:36.480,0:18:42.090
and this variation is where all of those formulas actually come from.

0:18:42.090,0:18:43.280
Okay?

0:18:43.280,0:18:50.280
Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh,

0:18:52.390,0:18:54.289
I see. So

0:18:54.289,0:18:58.200
it turns out that, yes, there's also constrained to the phi J this must be greater than

0:18:58.200,0:19:00.070
zero.

0:19:00.070,0:19:02.770
It turns out that

0:19:02.770,0:19:07.090
if you want you could actually write down then generalize the granjayn

0:19:07.090,0:19:10.860
incorporating all of these constraints as well and you can solve to [inaudible]

0:19:10.860,0:19:12.179
these constraints.

0:19:12.179,0:19:16.280
It turns out that in this particular derivation - actually it turns out that very often

0:19:16.280,0:19:16.780
we

0:19:16.780,0:19:19.980
find maximum likely estimate for multinomial distributions probabilities.

0:19:19.980,0:19:23.190
It turns out that if you ignore these constraints and you just maximize the

0:19:23.190,0:19:24.020
formula

0:19:24.020,0:19:25.100


0:19:25.100,0:19:27.820
luckily you end up with

0:19:27.820,0:19:31.190
values that actually are greater than or equal to zero

0:19:31.190,0:19:34.160
and so if even ignoring those constraint you end up with parameters that are greater than or equal to

0:19:34.160,0:19:38.470
zero that shows that that must be the correct solution

0:19:38.470,0:19:41.130
because adding that constraint won't change anything.

0:19:41.130,0:19:44.789
So this constraint it is then caused - it turns out that if you

0:19:44.789,0:19:46.150
ignore this and just do

0:19:46.150,0:19:53.150
what I've wrote down you actually get the right answer.

0:19:55.789,0:19:57.030
Okay?

0:19:57.030,0:19:58.080
Great.

0:19:58.080,0:19:59.080
So let me

0:19:59.080,0:20:00.810
just

0:20:00.810,0:20:05.400
very quickly talk about one more example of a mixture model. And

0:20:05.400,0:20:09.340
the perfect example for this is imagine you want to do text clustering, right?

0:20:09.340,0:20:11.900
So someone gives you a large set of documents

0:20:11.900,0:20:15.389
and you want to cluster them together into cohesive

0:20:15.389,0:20:16.239
topics. I

0:20:16.239,0:20:19.379
think I mentioned the news website news.google.com.

0:20:19.379,0:20:22.010
That's one application of text clustering

0:20:22.010,0:20:25.850
where you might want to look at all of the news stories about

0:20:25.850,0:20:28.220
today, all the news stories

0:20:28.220,0:20:32.150
written by everyone, written by all the online news websites about whatever happened

0:20:32.150,0:20:33.270
yesterday

0:20:33.270,0:20:36.810
and there will be many, many different stories on the same thing, right?

0:20:36.810,0:20:40.770
And by running a text-clustering algorithm you can group related

0:20:40.770,0:20:42.559
documents together. Okay?

0:20:42.559,0:20:49.559
So how do you apply the EM Algorithm to text clustering?

0:20:52.980,0:20:56.300
I want to do this to illustrate an example

0:20:56.300,0:20:59.380
in which you run the EM Algorithm

0:20:59.380,0:21:03.320
on discreet valued inputs where the input - where the training examples

0:21:03.320,0:21:04.220
XI

0:21:04.220,0:21:05.830
are discreet

0:21:05.830,0:21:08.589
values. So what I want to talk about specifically

0:21:08.589,0:21:15.589
is the mixture of naive Bayes model

0:21:16.510,0:21:19.980
and depending on how much you remember about naive Bayes

0:21:19.980,0:21:23.690
I talked about two event models. One was the multivariant vanuvy

0:21:23.690,0:21:24.960
event model. One

0:21:24.960,0:21:27.880
was the multinomial event model.

0:21:27.880,0:21:31.050
Today I'm gonna use the multivariant vanuvy event model. If

0:21:31.050,0:21:34.720
you don't remember what those terms mean anymore don't worry about it. I

0:21:34.720,0:21:36.430
think the equation will still make sense. But in

0:21:36.430,0:21:43.430
this setting we're given a training set X1

0:21:44.549,0:21:45.820
for XM.

0:21:45.820,0:21:48.930
So we're given M text documents

0:21:48.930,0:21:52.270
where each

0:21:52.270,0:21:55.940


0:21:55.940,0:22:00.330
XI is zero one to the end. So each of our training examples

0:22:00.330,0:22:04.160
is an indimensional bit of vector,

0:22:04.160,0:22:06.780
right? S this was the representation

0:22:06.780,0:22:11.060
where XIJ was - it indicates whether

0:22:11.060,0:22:13.220
word J

0:22:13.220,0:22:16.820
appears in

0:22:16.820,0:22:23.260
document

0:22:23.260,0:22:28.220
I, right? And let's say that

0:22:28.220,0:22:31.930
we're going to model ZI the - our latent random variable meaning

0:22:31.930,0:22:35.470
our hidden random variable ZI will take on two values zero one

0:22:35.470,0:22:38.480
so this means I'm just gonna find two clusters and you can generalize the

0:22:38.480,0:22:44.560
clusters that you want.

0:22:44.560,0:22:46.650
So

0:22:46.650,0:22:51.510
in the mixture of naive Bayes model we assume that

0:22:51.510,0:22:54.090
ZI is distributed and mu

0:22:54.090,0:22:57.280
E

0:22:57.280,0:22:58.390
with some

0:22:58.390,0:23:02.600
value of phi so there's some probability of each document coming from cluster one

0:23:02.600,0:23:05.220
or from cluster

0:23:05.220,0:23:10.490
two. We assume that

0:23:10.490,0:23:14.090
the probability

0:23:14.090,0:23:16.049


0:23:16.049,0:23:20.500
of XI given ZI,

0:23:20.500,0:23:21.550
right?

0:23:21.550,0:23:28.550
Will make the same naive Bayes assumption as we did before.

0:23:35.550,0:23:36.630
Okay?

0:23:36.630,0:23:43.630
And more specifically - well, excuse

0:23:53.110,0:24:00.110
me,

0:24:00.820,0:24:01.840
right. Okay.

0:24:01.840,0:24:06.570
And so most of us [inaudible] cycles one given ZI equals say zero

0:24:06.570,0:24:10.250
will be given by a parameter phi

0:24:10.250,0:24:11.960
substitute J

0:24:11.960,0:24:18.230
given Z equals

0:24:18.230,0:24:21.170
zero. So if you take this chalkboard and if you

0:24:21.170,0:24:22.669
take all instances of

0:24:22.669,0:24:24.510
the alphabet Z

0:24:24.510,0:24:27.270
and replace it with Y

0:24:27.270,0:24:31.750
then you end up with exactly the same equation as I've written down for naive

0:24:31.750,0:24:38.750
Bayes like a long time ago. Okay?

0:24:39.770,0:24:40.710
And I'm not

0:24:40.710,0:24:44.850
actually going to work out the mechanics deriving

0:24:44.850,0:24:46.460
the EM Algorithm, but it turns out that

0:24:46.460,0:24:49.830
if you take this joint distribution of X and Z

0:24:49.830,0:24:53.700
and if you work out what the equations are for the EM algorithm for

0:24:53.700,0:24:55.860
maximum likelihood estimation of parameters

0:24:55.860,0:24:58.820
you find that

0:24:58.820,0:25:00.880
in the E

0:25:00.880,0:25:03.100
step you compute,

0:25:03.100,0:25:05.679
you know, let's say these parameters - these weights

0:25:05.679,0:25:06.970
WI

0:25:06.970,0:25:09.420
which are going to be equal to

0:25:09.420,0:25:15.670
your perceived distribution of Z being equal one conditioned on XI parameterized

0:25:15.670,0:25:19.200
by your

0:25:19.200,0:25:23.700
phi's and given your parameters

0:25:23.700,0:25:25.590


0:25:25.590,0:25:30.900
and in the M step. Okay?

0:25:30.900,0:25:37.900
And

0:26:17.240,0:26:20.740
that's the equation you get in the M step.

0:26:20.740,0:26:22.150
I

0:26:22.150,0:26:24.860
mean, again, the equations themselves aren't too important. Just

0:26:24.860,0:26:27.179
sort of convey - I'll

0:26:27.179,0:26:30.210
give you a second to finish writing, I guess.

0:26:30.210,0:26:32.500
And when you're done or finished writing

0:26:32.500,0:26:37.370
take a look at these equations and see if they make intuitive sense to you

0:26:37.370,0:26:44.370
why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again.

0:26:47.130,0:26:49.620
Y -

0:26:49.620,0:26:54.230
Oh, yes, thank you. Right.

0:26:54.230,0:26:57.950
Sorry,

0:26:57.950,0:27:04.950
just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the

0:27:15.380,0:27:20.670
first place?

0:27:20.670,0:27:22.059
No. So

0:27:25.130,0:27:26.080
what is it?

0:27:26.080,0:27:33.080
Normally you initialize phi's to be something else, say randomly.

0:27:35.010,0:27:38.060
So just like in naive Bayes we saw

0:27:38.060,0:27:41.360
zero probabilities as a bad thing so the same reason you

0:27:41.360,0:27:48.200
try to avoid zero probabilities, yeah. Okay?

0:27:48.200,0:27:51.500
And so just the intuition behind these

0:27:51.500,0:27:54.150
equations is

0:27:54.150,0:27:55.950
in the E step

0:27:55.950,0:27:59.860
WI's is you're gonna take your best guess for whether the document came from

0:27:59.860,0:28:02.640
cluster one or cluster

0:28:02.640,0:28:05.570
zero, all right? This is very similar to the

0:28:05.570,0:28:09.200
intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the

0:28:09.200,0:28:10.020
E step

0:28:10.020,0:28:14.760
we're going to compute these weights that tell us do I think this document came from

0:28:14.760,0:28:19.370
cluster one or cluster zero.

0:28:19.370,0:28:22.650
And then in the M step I'm gonna say

0:28:22.650,0:28:27.140
does this numerator is the sum over all the elements of my training set

0:28:27.140,0:28:28.800
of - so then

0:28:28.800,0:28:33.000
informally, right? WI is one there, but I think the document came from cluster

0:28:33.000,0:28:33.980
one

0:28:33.980,0:28:35.810
and so this will

0:28:35.810,0:28:41.090
essentially sum up all the times I saw words J

0:28:41.090,0:28:45.419
in documents that I think are in cluster one.

0:28:45.419,0:28:48.830
And these are sort of weighted by the actual probability. I think it came from cluster

0:28:48.830,0:28:49.640
one

0:28:49.640,0:28:51.200
and then I'll divide by

0:28:51.200,0:28:54.740
- again, if all of these were ones and zeros then I'd be dividing by

0:28:54.740,0:28:58.480
the actual number of documents I had in cluster one. So if all the WI's were

0:28:58.480,0:28:59.540


0:28:59.540,0:29:02.960
either ones or zeroes then this would be exactly

0:29:02.960,0:29:06.660
the fraction of documents that I saw in cluster one

0:29:06.660,0:29:09.640
in which I also saw were at J.

0:29:09.640,0:29:10.900
Okay? But in the EM Algorithm

0:29:10.900,0:29:14.540
you don't make a hard assignment decision about is this in cluster one or is this in

0:29:14.540,0:29:15.790
cluster zero. You

0:29:15.790,0:29:16.880
instead

0:29:16.880,0:29:23.880
represent your uncertainty about cluster membership with the parameters WI. Okay? It

0:29:23.970,0:29:26.970
actually turns out that when we actually implement this particular model it

0:29:26.970,0:29:28.630
actually turns out that

0:29:28.630,0:29:32.080
by the nature of this computation all the values of WI's

0:29:32.080,0:29:35.260
will be very close to either one or zero so they'll

0:29:35.260,0:29:39.110
be numerically almost indistinguishable from one's and zeroes. This is a property of

0:29:39.110,0:29:41.960
naive Bayes. If you actually compute this probability

0:29:41.960,0:29:45.750
from all those documents you find that WI is either

0:29:45.750,0:29:46.169


0:29:46.169,0:29:50.000
0.0001 or 0.999. It'll be amazingly close to either zero or one

0:29:50.000,0:29:53.170
and so the M step - and so this is pretty much guessing whether each document is

0:29:53.170,0:29:55.270
in cluster one or cluster

0:29:55.270,0:29:56.210
zero and then

0:29:56.210,0:30:00.450
using formulas they're very similar to maximum likely estimation

0:30:00.450,0:30:05.210
for naive Bayes.

0:30:05.210,0:30:06.250
Okay?

0:30:06.250,0:30:08.409
Cool. Are there - and

0:30:08.409,0:30:09.679
if

0:30:09.679,0:30:12.170
some of these equations don't look that familiar to you anymore,

0:30:12.170,0:30:15.309
sort of, go back and take another look at what you saw in naive

0:30:15.309,0:30:16.700
Bayes

0:30:16.700,0:30:18.720
and

0:30:18.720,0:30:22.220
hopefully you can see the links there as well.

0:30:22.220,0:30:29.220
Questions about this before I move on? Right, okay.

0:30:33.200,0:30:36.700
Of course the way I got these equations was by turning through the machinery of

0:30:36.700,0:30:40.610
the EM Algorithm, right? I didn't just write these out of thin air. The way you do this

0:30:40.610,0:30:41.890
is by

0:30:41.890,0:30:44.990
writing down the E step and the M step for this model and then the M step

0:30:44.990,0:30:46.340
same derivatives

0:30:46.340,0:30:48.270
equal to zero and solving from that so

0:30:48.270,0:30:49.889
that's how you get the M step and the E

0:30:49.889,0:30:56.889
step.

0:31:22.030,0:31:23.950
So the last thing I want to do today

0:31:23.950,0:31:30.950
is talk about the factor analysis model

0:31:31.870,0:31:33.110


0:31:33.110,0:31:34.780
and

0:31:34.780,0:31:38.249
the reason I want to do this is sort of two reasons because one is

0:31:38.249,0:31:42.590
factor analysis is kind of a useful model. It's

0:31:42.590,0:31:44.160
not

0:31:44.160,0:31:48.320
as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe,

0:31:48.320,0:31:50.690
but it's sort of useful.

0:31:50.690,0:31:55.180
But the other reason I want to derive this model is that there are a

0:31:55.180,0:31:58.169
few steps in the math that are more generally useful.

0:31:58.169,0:32:02.560
In particular, where this is for factor analysis this would be an example

0:32:02.560,0:32:04.510
in which we'll do EM

0:32:04.510,0:32:08.440
where the late and random variable - where the hidden random variable Z

0:32:08.440,0:32:11.060
is going to be continued as valued.

0:32:11.060,0:32:14.740
And so some of the math we'll see in deriving factor analysis will be a little bit

0:32:14.740,0:32:17.430
different than what you saw before and they're just a -

0:32:17.430,0:32:21.590
it turns out the full derivation for EM for factor analysis is sort of

0:32:21.590,0:32:23.260
extremely long and complicated

0:32:23.260,0:32:24.980
and so I won't

0:32:24.980,0:32:26.750
inflect that on you in lecture today,

0:32:26.750,0:32:30.530
but I will still be writing more equations than is - than you'll see me do

0:32:30.530,0:32:32.640
in other lectures because there are, sort of,

0:32:32.640,0:32:36.690
just a few steps in the factor analysis derivation so I'll physically

0:32:36.690,0:32:38.800


0:32:38.800,0:32:40.100
illustrate

0:32:40.100,0:32:40.950
it.

0:32:40.950,0:32:43.350
So it's actually [inaudible] the model

0:32:43.350,0:32:47.510
and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is

0:32:47.510,0:32:50.670
our first model

0:32:50.670,0:32:52.570
we had,

0:32:52.570,0:32:57.990
that - well I actually motivated it by drawing the data set like this, right? That one of

0:32:57.990,0:33:04.390
you has a data set that looks like this,

0:33:04.390,0:33:06.190
right? So this was a problem where

0:33:06.190,0:33:09.440
n is twodimensional

0:33:09.440,0:33:13.860
and you have, I don't know, maybe 50 or 100 training examples, whatever, right?

0:33:13.860,0:33:15.409
And I said

0:33:15.409,0:33:17.760
maybe you want to give a label

0:33:17.760,0:33:21.100
training set like this. Maybe you want to model this

0:33:21.100,0:33:23.710
as a mixture of two Gaussians. Okay?

0:33:23.710,0:33:24.700
And so a

0:33:24.700,0:33:27.320
mixture of Gaussian models tend to be

0:33:27.320,0:33:29.100
applicable

0:33:29.100,0:33:30.520
where m

0:33:30.520,0:33:32.519
is larger,

0:33:32.519,0:33:35.690
and often much larger, than n where the number of training examples you

0:33:35.690,0:33:36.610
have

0:33:36.610,0:33:40.790
is at least as large as, and is usually much larger than, the

0:33:40.790,0:33:45.820
dimension of the data. What I

0:33:45.820,0:33:49.280
want to do is talk about a different problem where I

0:33:49.280,0:33:51.880
want you to imagine what happens if

0:33:51.880,0:33:56.200
either the dimension of your data is roughly equal to the number of

0:33:56.200,0:33:58.130
examples you have

0:33:58.130,0:34:00.180
or maybe

0:34:00.180,0:34:00.940
the

0:34:00.940,0:34:03.980
dimension of your data is maybe even much larger than

0:34:03.980,0:34:08.469
the number of training examples you have. Okay? So how do

0:34:08.469,0:34:10.139
you model

0:34:10.139,0:34:13.569
such a very high dimensional data? Watch and

0:34:13.569,0:34:15.549
you will see sometimes, right? If

0:34:15.549,0:34:18.960
you run a plant or something, you run a factory, maybe you have

0:34:18.960,0:34:23.359
a thousand measurements all through your plants, but you only have five - you only have

0:34:23.359,0:34:25.639
20 days of data. So

0:34:25.639,0:34:27.189
you can have 1,000 dimensional data,

0:34:27.189,0:34:31.039
but 20 examples of it all ready. So

0:34:31.039,0:34:35.239
given data that has this property in the beginning that we've given

0:34:35.239,0:34:39.709
a training set of m examples. Well,

0:34:39.709,0:34:41.229
what can you do to try to model

0:34:41.229,0:34:44.139
the density of X?

0:34:44.139,0:34:48.239
So one thing you can do is try to model it just as a single Gaussian, right? So in my

0:34:48.239,0:34:51.629
mixtures of Gaussian this is how you try model as a single Gaussian and

0:34:51.629,0:34:55.019
say X is intuitive with mean mu

0:34:55.019,0:34:57.559
and parameter

0:34:57.559,0:34:58.809
sigma

0:34:58.809,0:35:00.329
where

0:35:00.329,0:35:02.169
sigma is going to be

0:35:02.169,0:35:04.769
done n by n matrix

0:35:04.769,0:35:09.509
and so if you work out the maximum likelihood estimate of the parameters

0:35:09.509,0:35:12.680
you find that the maximum likelihood estimate for the mean

0:35:12.680,0:35:17.400
is just the empirical mean of your training set,

0:35:17.400,0:35:21.299
right. So that makes sense.

0:35:21.299,0:35:25.769
And the maximum likelihood of the covariance matrix sigma

0:35:25.769,0:35:27.700
will be

0:35:27.700,0:35:29.549


0:35:29.549,0:35:36.549


0:35:39.649,0:35:41.150
this, all right? But it turns out that

0:35:41.150,0:35:44.789
in this regime where the data is much higher dimensional - excuse me,

0:35:44.789,0:35:48.259
where the data's dimension is much larger than the training examples

0:35:48.259,0:35:49.579
you

0:35:49.579,0:35:51.229
have if you

0:35:51.229,0:35:52.840
compute the maximum likely estimate

0:35:52.840,0:35:54.869
of the covariance matrix sigma

0:35:54.869,0:35:58.309
you find that this matrix is singular.

0:35:58.309,0:35:59.079


0:35:59.079,0:36:02.869
Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen

0:36:02.869,0:36:04.739
value so it doesn't have - I hope

0:36:04.739,0:36:06.869
one of those terms makes sense.

0:36:06.869,0:36:12.619
And

0:36:12.619,0:36:19.619
there's another saying that the matrix sigma will be non-invertible.

0:36:22.299,0:36:24.910
And just in pictures,

0:36:24.910,0:36:28.619
one complete example is if D is -

0:36:28.619,0:36:33.419
if N equals M equals two if you have two-dimensional data and

0:36:33.419,0:36:35.929
you have two examples. So I'd have

0:36:35.929,0:36:39.959
two training examples in two-dimen.. - this is

0:36:39.959,0:36:41.539
X1 and

0:36:41.539,0:36:43.669
X2. This is my unlabeled data.

0:36:43.669,0:36:47.640
If you fit a Gaussian to this data set you find that

0:36:47.640,0:36:49.169
- well

0:36:49.169,0:36:53.329
you remember I used to draw constables of Gaussians as ellipses, right? So

0:36:53.329,0:36:56.139
these are examples of different constables of Gaussians.

0:36:56.139,0:36:58.669
You find that the maximum likely estimate

0:36:58.669,0:36:59.650
Gaussian for this

0:36:59.650,0:37:04.329
responds to Gaussian where the contours are sort of infinitely

0:37:04.329,0:37:07.160
thin and infinitely long in that direction.

0:37:07.160,0:37:10.959
Okay? So in terms - so the contours will sort of be

0:37:10.959,0:37:12.970
infinitely thin,

0:37:12.970,0:37:17.959
right? And stretch infinitely long in that direction.

0:37:17.959,0:37:20.819
And another way of saying it is that if

0:37:20.819,0:37:25.539
you actually plug in the formula for the density

0:37:25.539,0:37:30.679
of the

0:37:30.679,0:37:36.209
Gaussian, which is

0:37:36.209,0:37:39.099
this, you won't actually get a nice answer because

0:37:39.099,0:37:42.399
the matrix sigma is non-invertible so sigma inverse is not

0:37:42.399,0:37:43.309
defined

0:37:43.309,0:37:45.290
and this is zero.

0:37:45.290,0:37:48.890
So you also have one over zero times E to the sum

0:37:48.890,0:37:55.890
inversive and non-inversive matrix so not a good model.

0:37:58.289,0:38:02.009
So let's do even better, right? So given this sort of data

0:38:02.009,0:38:09.009
how do you model P of X?

0:38:28.400,0:38:35.400
Well, one thing you could do is constrain sigma to be diagonal. So you have a

0:38:42.309,0:38:46.319
covariance matrix X is - okay?

0:38:46.319,0:38:48.939
So in other words you get a constraint sigma

0:38:48.939,0:38:52.979
to be this matrix, all

0:38:52.979,0:38:55.880
right?

0:38:55.880,0:38:59.779
With zeroes on the off diagonals. I hope

0:38:59.779,0:39:03.109
this makes sense. These zeroes I've written down here denote that

0:39:03.109,0:39:06.079
everything after diagonal of this matrix is a

0:39:06.079,0:39:07.459
zero.

0:39:07.459,0:39:08.970
So

0:39:08.970,0:39:13.410
the massive likely estimate of the parameters will be pretty much what you'll

0:39:13.410,0:39:16.210
expect,

0:39:16.210,0:39:19.880
right?

0:39:19.880,0:39:21.200
And

0:39:21.200,0:39:22.440
in pictures

0:39:22.440,0:39:24.150
what this means is that

0:39:24.150,0:39:26.399
the [inaudible] the distribution

0:39:26.399,0:39:28.569
with Gaussians

0:39:28.569,0:39:31.739
whose controls are axis aligned. So

0:39:31.739,0:39:36.900
that's one example of a Gaussian where the covariance is diagonal.

0:39:36.900,0:39:37.859


0:39:37.859,0:39:38.509
And

0:39:38.509,0:39:42.919
here's another example and

0:39:42.919,0:39:45.349
so here's a third example. But

0:39:45.349,0:39:48.939
often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I

0:39:48.939,0:39:51.489
don't

0:39:51.489,0:39:53.999
know,

0:39:53.999,0:39:56.629
you could do this in model P of X, but this isn't very nice because you've now

0:39:56.629,0:40:00.069
thrown away all the correlations

0:40:00.069,0:40:04.539
between the different variables so

0:40:04.539,0:40:07.599
the axis are X1 and X2, right? So you've thrown away - you're failing to capture

0:40:07.599,0:40:11.779
any of the correlations or the relationships between

0:40:11.779,0:40:18.229
any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again.

0:40:18.229,0:40:20.759
The covariance matrix the diagonal,

0:40:20.759,0:40:22.150
what does

0:40:22.150,0:40:23.369
that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay.

0:40:23.369,0:40:26.530
So these are the contours of the Gaussian density that I'm drawing,

0:40:26.530,0:40:28.159
right? So let's see -

0:40:28.159,0:40:32.080
so post covariance issues with diagonal

0:40:32.080,0:40:35.229
then you can ask what is P of X

0:40:35.229,0:40:37.299
parameterized by

0:40:37.299,0:40:38.539
mu and sigma,

0:40:38.539,0:40:41.190
right? If sigma is diagonal

0:40:41.190,0:40:43.699
and so this will be some Gaussian dump,

0:40:43.699,0:40:46.249
right? So not in - oh, boy. My drawing's really

0:40:46.249,0:40:48.699
bad, but in two-D

0:40:48.699,0:40:53.239
the density for Gaussian is like this bump shaped thing, right?

0:40:53.239,0:40:56.719
So this is the density of the Gaussian

0:40:56.719,0:40:59.549
- wow, and this is a really bad drawing. With those,

0:40:59.549,0:41:04.049
your axis X1 and X2 and the height of this is P of X

0:41:04.049,0:41:07.879
and so those figures over there are the contours of

0:41:07.879,0:41:09.479
the density of the Gaussian.

0:41:09.479,0:41:15.949
So those are the contours of this shape. Student:No, I don't

0:41:15.949,0:41:16.680
mean

0:41:16.680,0:41:22.569
the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so

0:41:22.569,0:41:23.690
the main - these,

0:41:23.690,0:41:24.999
let's see.

0:41:24.999,0:41:27.710
So I'm not drawing a contour like this,

0:41:27.710,0:41:30.309
right? Because the main axes of these

0:41:30.309,0:41:32.289
are not aligned with the

0:41:32.289,0:41:34.390
X1 and X-axis so

0:41:34.390,0:41:36.189
this occurs found to

0:41:36.189,0:41:43.109
Gaussian where the off-diagonals are non-zero, right? Cool. Okay.

0:41:43.109,0:41:46.929


0:41:46.929,0:41:50.229
You could do this, this is sort of work. It turns out that what our best view is two

0:41:50.229,0:41:51.670
training examples

0:41:51.670,0:41:55.210
you can learn in non-singular covariance matrix, but you've thrown away all

0:41:55.210,0:41:55.849


0:41:55.849,0:42:02.849
of the correlation in the data so this is not a great model.

0:42:05.900,0:42:08.280
It turns out you can do something - well,

0:42:08.280,0:42:10.439
actually, we'll come back and use this property later.

0:42:10.439,0:42:17.439
But it turns out you can do something even more restrictive, which

0:42:17.989,0:42:18.779
is

0:42:18.779,0:42:23.959
you can constrain sigma to equal to sigma squared times the identity

0:42:23.959,0:42:27.629
matrix. So in other words, you can constrain it to be diagonal

0:42:27.629,0:42:28.759
matrix

0:42:28.759,0:42:31.699
and moreover all the

0:42:31.699,0:42:34.609
diagonal entries must be the same

0:42:34.609,0:42:37.989
and so the cartoon for that is that you're

0:42:37.989,0:42:39.440
constraining

0:42:39.440,0:42:42.779
the contours of your Gaussian density to be

0:42:42.779,0:42:47.359
circular. Okay? This is a sort of even harsher constraint to place in your model.

0:42:47.359,0:42:51.479


0:42:51.479,0:42:52.200


0:42:52.200,0:42:55.970
So either of these versions, diagonal sigma or sigma being the, sort of, constant

0:42:55.970,0:42:57.349
value diagonal

0:42:57.349,0:43:01.869
are the all ready strong assumptions, all right? So if you have enough data

0:43:01.869,0:43:07.529
maybe write a model just a little bit of a correlation between your different variables.

0:43:07.529,0:43:10.930
So the factor analysis model

0:43:10.930,0:43:14.349
is one way to attempt to do that. So here's

0:43:14.349,0:43:21.349
the idea.

0:43:34.619,0:43:35.729
So this is

0:43:35.729,0:43:39.169
how the factor analysis model

0:43:39.169,0:43:40.979
models your data.

0:43:40.979,0:43:42.150
We're going to

0:43:42.150,0:43:45.309
assume that there is a latent random variable, okay?

0:43:45.309,0:43:46.959
Which just

0:43:46.959,0:43:50.759
means random variable Z. So Z is distributed

0:43:50.759,0:43:53.380
Gaussian with mean zero and

0:43:53.380,0:43:55.659
covariance identity

0:43:55.659,0:43:57.630
where Z

0:43:57.630,0:44:00.109
will be a Ddimensional vector now

0:44:00.109,0:44:01.499
and

0:44:01.499,0:44:03.839
D

0:44:03.839,0:44:10.839
will be chosen so that it is lower than the dimension of your X's. Okay?

0:44:11.829,0:44:12.829
And now

0:44:12.829,0:44:14.660
I'm going to assume that

0:44:14.660,0:44:19.259
X is given by -

0:44:19.259,0:44:21.279
well let me write this. Each XI

0:44:21.279,0:44:22.690
is distributed -

0:44:25.059,0:44:28.849
actually, sorry, I'm just. We

0:44:28.849,0:44:33.889
have

0:44:33.889,0:44:38.679
to assume that conditions on the value of Z,

0:44:38.679,0:44:42.109
X is given by another Gaussian

0:44:42.109,0:44:46.819
with mean given by mu plus

0:44:46.819,0:44:50.089
lambda Z

0:44:50.089,0:44:53.639
and covariance given by matrix si.

0:44:53.639,0:44:59.400
So just to say the second line in

0:44:59.400,0:45:01.289
an equivalent form, equivalently

0:45:01.289,0:45:05.929
I'm going to model X as

0:45:05.929,0:45:08.819
mu plus lambda Z

0:45:08.819,0:45:13.269
plus a noise term epsilon where epsilon is

0:45:13.269,0:45:18.329
Gaussian with mean zero

0:45:18.329,0:45:21.869
and covariant si.

0:45:21.869,0:45:23.469


0:45:23.469,0:45:24.669
And so

0:45:24.669,0:45:27.709
the parameters of this

0:45:27.709,0:45:30.180
model

0:45:30.180,0:45:32.719
are going to be a vector

0:45:32.719,0:45:34.729
mu with its

0:45:34.729,0:45:37.559
n-dimensional and

0:45:37.559,0:45:41.179
matrix lambda,

0:45:41.179,0:45:42.400
which is

0:45:42.400,0:45:43.539


0:45:43.539,0:45:45.569
n by D

0:45:45.569,0:45:50.039
and a covariance matrix si,

0:45:50.039,0:45:51.970
which is n by n,

0:45:51.970,0:45:55.519
and I'm going to impose an additional constraint on si. I'm going to impose

0:45:55.519,0:45:57.319
a constraint that si

0:45:57.319,0:45:59.559
is

0:45:59.559,0:46:03.389
diagonal.

0:46:03.389,0:46:04.789
Okay? So

0:46:04.789,0:46:07.640
that was a form of definition - let me actually, sort of,

0:46:07.640,0:46:14.640
give a couple of examples to make this more complete.

0:46:32.710,0:46:37.709
So let's give a kind of example, suppose Z

0:46:37.709,0:46:40.419
is one-dimensional

0:46:40.419,0:46:44.419
and X is twodimensional

0:46:44.419,0:46:46.689
so let's see what

0:46:46.689,0:46:51.599
this model - let's see a, sort of, specific instance of the factor analysis

0:46:51.599,0:46:52.509
model

0:46:52.509,0:46:55.819
and how we're modeling the joint - the distribution over X

0:46:55.819,0:46:57.999
of - what this gives us

0:46:57.999,0:47:01.269
in terms of a model over P of X, all right?

0:47:01.269,0:47:03.089
So

0:47:03.089,0:47:09.699
let's see. From this model to let me assume that

0:47:09.699,0:47:10.569
lambda is

0:47:10.569,0:47:12.059
2, 1

0:47:12.059,0:47:15.549
and si, which has to be diagonal matrix, remember,

0:47:15.549,0:47:19.809
is this.

0:47:19.809,0:47:21.710
Okay? So

0:47:21.710,0:47:28.710
Z is one-dimensional so let me just draw a typical sample for Z, all right? So

0:47:30.900,0:47:33.659
if I draw ZI

0:47:33.659,0:47:36.199
from a Gaussian

0:47:36.199,0:47:39.530
so that's a typical sample for what Z might look like

0:47:39.530,0:47:43.440
and so I'm gonna - at any rate I'm gonna call

0:47:43.440,0:47:45.369
this Z1,

0:47:45.369,0:47:49.969
Z2, Z3 and so on. If this really were a typical sample the order of the

0:47:49.969,0:47:53.349
Z's would be jumbled up, but I'm just ordering them like this

0:47:53.349,0:47:55.970
just to make the example easier.

0:47:55.970,0:47:57.769
So, yes, typical sample

0:47:57.769,0:48:04.769
of random variable Z from a Gaussian distribution with mean of covariance one.

0:48:05.569,0:48:08.929
So -

0:48:08.929,0:48:13.789
and with this example let me just set mu equals zero. It's to write the -

0:48:13.789,0:48:18.359
just that it's easier to talk about.

0:48:18.359,0:48:20.749
So

0:48:20.749,0:48:22.509
lambda times Z,

0:48:22.509,0:48:26.269
right? We'll take each of these numbers and multiply them by lambda.

0:48:26.269,0:48:28.639
And so

0:48:28.639,0:48:30.999
you find that

0:48:30.999,0:48:36.989
all of the values for lambda times Z

0:48:36.989,0:48:38.489
will lie on a straight line,

0:48:38.489,0:48:40.309
all right? So, for example,

0:48:40.309,0:48:41.090


0:48:41.090,0:48:45.400
this one here would be one, two, three, four, five, six, seven, I guess. So if this was

0:48:45.400,0:48:46.239
Z7

0:48:46.239,0:48:49.329
then this one here would be lambda

0:48:49.329,0:48:51.979
times Z7

0:48:51.979,0:48:54.369
and now that's the number in R2, because lambda's a two

0:48:54.369,0:48:56.529
by one matrix.

0:48:56.529,0:48:59.900
And so what I've drawn here is like a typical sample

0:48:59.900,0:49:04.279
for lambda times Z

0:49:04.279,0:49:05.589
and

0:49:05.589,0:49:09.779
the final step for this is what a typical sample for X looks like. Well X is

0:49:09.779,0:49:11.839
mu

0:49:11.839,0:49:12.989
plus

0:49:12.989,0:49:15.089


0:49:15.089,0:49:16.660
lambda Z plus epsilon

0:49:16.660,0:49:18.649
where epsilon

0:49:18.649,0:49:23.680
is Gaussian with mean nu and covariance given by si, right?

0:49:23.680,0:49:27.690
And so the last step to draw a typical sample for the random variables

0:49:27.690,0:49:28.509


0:49:28.509,0:49:30.899
X

0:49:30.899,0:49:36.079
I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example

0:49:36.079,0:49:37.159
and

0:49:37.159,0:49:38.390
around this point

0:49:38.390,0:49:41.890
I'm going to place

0:49:41.890,0:49:43.890
an axis aligned ellipse. Or

0:49:43.890,0:49:45.879
in other words, I'm going to

0:49:45.879,0:49:47.939
create a Gaussian distribution

0:49:47.939,0:49:50.950
centered on this point

0:49:50.950,0:49:53.459
and this I've drawn

0:49:53.459,0:49:55.880
corresponds to one of the contours

0:49:55.880,0:49:59.309
of my density for

0:49:59.309,0:50:02.799
epsilon, right? And so you can imagine placing a little Gaussian bump here.

0:50:02.799,0:50:04.179
And so

0:50:04.179,0:50:08.189
I'll draw an example from this little Gaussian and

0:50:08.189,0:50:11.089
let's say I get that point going,

0:50:11.089,0:50:17.630
I do the same here and

0:50:17.630,0:50:19.130
so on.

0:50:19.130,0:50:24.079
So I draw a bunch of examples from these Gaussians

0:50:24.079,0:50:28.519
and the -

0:50:28.519,0:50:31.999
whatever they call it - the orange points I drew

0:50:31.999,0:50:35.309
will comprise a typical sample for

0:50:35.309,0:50:42.309
whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that?

0:50:43.839,0:50:45.929
Instructor (Andrew Ng):Oh,

0:50:45.929,0:50:48.649
yes, you do. And in this example, I said you

0:50:48.649,0:50:50.109
do a zero zero just to make

0:50:50.109,0:50:53.199
it easier. If mu were something else you'd take the whole picture and you'd sort of shift

0:50:53.199,0:51:00.199
it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of

0:51:00.679,0:51:03.319
course,

0:51:03.319,0:51:05.920
what does that Y-axis corresponds to? Instructor (Andrew

0:51:05.920,0:51:07.799
Ng):Oh, so this is Z

0:51:07.799,0:51:09.529
is one-dimensional

0:51:09.529,0:51:12.969
so here I'm plotting the typical sample

0:51:12.969,0:51:15.460
for Z so this is like zero.

0:51:15.460,0:51:19.759
So this is just the Z Axis, right. So Z is onedimensional data.

0:51:19.759,0:51:22.999
So this line here is like a plot

0:51:22.999,0:51:26.479
of a typical sample of

0:51:26.479,0:51:31.609
values for Z. Okay?

0:51:31.609,0:51:32.889
Yeah? Student:You have by axis, right? And

0:51:32.889,0:51:34.459
the axis data pertains samples. Instructor (Andrew

0:51:34.459,0:51:35.689
Ng):Oh, yes, right. Student:So sort of projecting

0:51:35.689,0:51:38.289
them into

0:51:38.289,0:51:42.680
that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these

0:51:42.680,0:51:45.319
beige points - so that's like X1 and that's X2

0:51:45.319,0:51:47.290
and so on, right? So the beige points are

0:51:47.290,0:51:51.099
what I

0:51:51.099,0:51:52.269
see. And so

0:51:52.269,0:51:54.849
in reality all you ever get to see are the

0:51:54.849,0:51:56.289
X's, but

0:51:56.289,0:52:00.049
just like in the mixture of Gaussians model I tell a story about what I would

0:52:00.049,0:52:03.609
imagine the Gauss... - the data came from two Gaussian's

0:52:03.609,0:52:08.679
was is had a random variable Z that led to the generation of X's from two Gaussians.

0:52:08.679,0:52:12.209
So the same way I'm sort of telling the story here, which all the algorithm actually

0:52:12.209,0:52:15.049
sees are the orange points, but we're

0:52:15.049,0:52:19.219
gonna tell a story about how the data came about and that story is

0:52:19.219,0:52:23.809
what comprises the factor analysis model. Okay?

0:52:23.809,0:52:26.779
So one of the ways to see the intrusion of this model is that we're

0:52:26.779,0:52:30.779
going to think of the model as one way

0:52:30.779,0:52:34.659
just informally, not formally, but one way to think about this model is

0:52:34.659,0:52:37.679
you can think of this factor analysis model

0:52:37.679,0:52:42.089
as modeling the data from coming from a lower dimensional subspace

0:52:42.089,0:52:42.890
more

0:52:42.890,0:52:44.799
or less so the data X here Y

0:52:44.799,0:52:47.609
is approximately on one D line

0:52:47.609,0:52:50.380
and then plus a little bit of noise - plus a little bit

0:52:50.380,0:52:53.449
of random noise so the X isn't exactly on this one D line.

0:52:53.449,0:52:57.279
That's one informal way of thinking about factor analysis.

0:52:57.279,0:53:04.279
We're

0:53:08.880,0:53:15.880
not doing great on time.

0:53:18.159,0:53:20.359
Well, let's do this.

0:53:20.359,0:53:22.780
So let me just do one more quick example,

0:53:22.780,0:53:24.100
which is,

0:53:24.100,0:53:25.789


0:53:25.789,0:53:28.949
in this example,

0:53:28.949,0:53:34.899
let's say Z is in R2

0:53:34.899,0:53:37.899
and X is in R3, right?

0:53:37.899,0:53:40.289
And so

0:53:40.289,0:53:44.939
in this example Z, your data Z now lies in 2-D

0:53:44.939,0:53:49.259
and so let me draw this on a sheet of paper. Okay?

0:53:49.259,0:53:50.790
So let's say the

0:53:50.790,0:53:54.180
axis of my paper are the Z1 and Z2 axis

0:53:54.180,0:53:56.299
and so

0:53:56.299,0:53:58.289
here is a typical sample

0:53:58.289,0:54:01.489
of point Z, right?

0:54:01.489,0:54:05.369
And so we'll then take the sample Z -

0:54:05.369,0:54:10.919
well, actually let me draw this here as well. All right. So

0:54:10.919,0:54:13.449
this is a typical sample for Z going on the Z1 and Z2 axis and

0:54:13.449,0:54:17.419
I guess the origin would be here.

0:54:17.419,0:54:20.119
So center around zero.

0:54:20.119,0:54:24.079
And then we'll take those and map it to mu

0:54:24.079,0:54:24.979
plus

0:54:24.979,0:54:26.299
lambda Z

0:54:26.299,0:54:30.140
and what that means is if you imagine the free space of this classroom is R3.

0:54:30.140,0:54:34.549
What that means is we'll take this sample of Z's and we'll map it to

0:54:34.549,0:54:38.769
position in free space. So we'll take this sheet of paper and move it somewhere and some

0:54:38.769,0:54:41.769
orientation in 3-D space.

0:54:41.769,0:54:44.219
And the last step is

0:54:44.219,0:54:48.689
you have X equals mu plus lambda Z plus

0:54:48.689,0:54:52.339
epsilon and so you would take the set of the points which align in some plane

0:54:52.339,0:54:53.639
in our 3-D space

0:54:53.639,0:54:56.150
the variable of noise of these

0:54:56.150,0:54:58.259
and the noise will, sort of, come from

0:54:58.259,0:55:01.419
Gaussians to

0:55:01.419,0:55:03.699
the axis

0:55:03.699,0:55:10.699
aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake.

0:55:11.680,0:55:13.179


0:55:13.179,0:55:15.119
So that's a model

0:55:15.119,0:55:22.119
- let's actually talk about how to fit the parameters of the model. Okay?

0:55:27.959,0:55:32.229
In order to

0:55:32.229,0:55:33.770
describe how to fit the model I'm sure

0:55:33.770,0:55:35.859
we need to

0:55:35.859,0:55:40.190
re-write Gaussians and this is in a very slightly different way.

0:55:40.190,0:55:42.789
So, in particular,

0:55:42.789,0:55:45.389
let's say I have a vector X and

0:55:45.389,0:55:51.389
I'm gonna use this notation to denote partition vectors, right? X1, X2

0:55:51.389,0:55:53.209
where

0:55:53.209,0:55:56.809
if X1 is say an rdimensional vector

0:55:56.809,0:55:58.640
then X2 is

0:55:58.640,0:56:00.430
an estimational vector

0:56:00.430,0:56:02.229
and X

0:56:02.229,0:56:04.509
is an R plus S

0:56:04.509,0:56:09.029
dimensional vector. Okay? So I'm gonna use this notation to denote just

0:56:09.029,0:56:13.269
the taking of vector and, sort of, partitioning the vector into two halves. The

0:56:13.269,0:56:15.449
first R elements followed by

0:56:15.449,0:56:17.079
the last

0:56:17.079,0:56:22.409
S elements.

0:56:22.409,0:56:26.269
So

0:56:26.269,0:56:29.199
let's say you have X

0:56:29.199,0:56:35.029
coming from a Gaussian distribution with mean mu and covariance sigma

0:56:35.029,0:56:37.169
where

0:56:37.169,0:56:38.809
mu

0:56:38.809,0:56:42.219
is itself a partition vector.

0:56:42.219,0:56:46.839
So break mu up into two pieces mu1 and mu2

0:56:46.839,0:56:50.189
and the covariance matrix sigma

0:56:50.189,0:56:54.989
is now a partitioned matrix.

0:56:54.989,0:56:56.750
Okay? So what this means is

0:56:56.750,0:56:58.669
that you take the covariance matrix sigma

0:56:58.669,0:57:00.940
and I'm going to break it up into four blocks,

0:57:00.940,0:57:02.649
right? And so the

0:57:02.649,0:57:06.059
dimension of this is there will be R elements here

0:57:06.059,0:57:08.299
and there will be

0:57:08.299,0:57:10.289
S elements here

0:57:10.289,0:57:13.679
and there will be R elements here. So, for example, sigma 1, 2 will

0:57:13.679,0:57:15.689
be an

0:57:15.689,0:57:19.769
R by S matrix.

0:57:19.769,0:57:26.769
It's R elements tall and S elements wide.

0:57:32.150,0:57:36.190
So this Gaussian over to down is really a joint distribution of a loss of variables, right?

0:57:36.190,0:57:40.309
So X is a vector so XY is a joint distribution

0:57:40.309,0:57:42.980
over X1 through X of -

0:57:42.980,0:57:47.349
over XN or over X of R plus S.

0:57:47.349,0:57:51.299
We can then ask what are the marginal and conditional distributions of this

0:57:51.299,0:57:52.609
Gaussian?

0:57:52.609,0:57:53.840
So, for example,

0:57:53.840,0:57:57.819
with my Gaussian, I know what P of X is, but can

0:57:57.819,0:58:03.229
I compute the modular distribution of X1, right. And so P of X1 is just equal to,

0:58:03.229,0:58:07.009
of course, integrate our X2, P of X1

0:58:07.009,0:58:08.880
comma X2

0:58:08.880,0:58:10.649
DX2.

0:58:10.649,0:58:14.169
And if you actually perform that distribution - that computation you

0:58:14.169,0:58:15.199
find

0:58:15.199,0:58:17.769
that P of X1,

0:58:17.769,0:58:20.069
I guess, is Gaussian

0:58:20.069,0:58:21.159
with mean

0:58:21.159,0:58:23.079
given by mu1

0:58:23.079,0:58:24.929
and sigma 1, 1. All right.

0:58:24.929,0:58:26.309
So this is sort

0:58:26.309,0:58:30.019
of no surprise. The marginal distribution of a

0:58:30.019,0:58:31.129
Gaussian

0:58:31.129,0:58:32.879
is itself the

0:58:32.879,0:58:33.390
Gaussian and

0:58:33.390,0:58:35.479
you just take out the

0:58:35.479,0:58:36.229
relevant

0:58:36.229,0:58:37.399
sub-blocks of

0:58:37.399,0:58:42.049
the covariance matrix and the relevant sub-vector of the

0:58:42.049,0:58:46.569
mu vector - E in vector mu.

0:58:46.569,0:58:49.819
You can also compute conditionals. You can also -

0:58:49.819,0:58:51.959
what does P of X1

0:58:51.959,0:58:53.409
given

0:58:53.409,0:58:56.819
a specific value for X2, right?

0:58:56.819,0:59:02.799
And so the way you compute that is, well, the usual way P of X1

0:59:02.799,0:59:04.930
comma X2

0:59:04.930,0:59:08.379
divided by P of X2, right?

0:59:08.379,0:59:10.069
And so

0:59:10.069,0:59:13.789
you know what both of these formulas are, right? The numerator - well,

0:59:13.789,0:59:16.559
this is just a usual

0:59:16.559,0:59:20.160
Gaussian that your joint distribution over X1, X2 is a Gaussian with

0:59:20.160,0:59:21.899
mean mu and covariance sigma

0:59:21.899,0:59:23.929
and

0:59:23.929,0:59:26.659
this

0:59:26.659,0:59:30.079
by that marginalization operation I talked about is that.

0:59:30.079,0:59:31.449


0:59:31.449,0:59:35.009
So if you actually plug in the formulas for these two Gaussians and if you simplify

0:59:35.009,0:59:35.599


0:59:35.599,0:59:38.560
the simplification step is actually fairly non-trivial.

0:59:38.560,0:59:42.070
If you haven't seen it before this will actually be - this will actually be

0:59:42.070,0:59:44.389
somewhat difficult to do.

0:59:44.389,0:59:45.580
But if you

0:59:45.580,0:59:50.020
plug this in for Gaussian and simplify that

0:59:50.020,0:59:51.029
expression

0:59:51.029,0:59:53.249
you find that

0:59:53.249,0:59:55.969
conditioned on the value of

0:59:55.969,1:00:01.130
X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be

1:00:01.130,1:00:05.849
Gaussian

1:00:05.849,1:00:08.260
and it will have mean mu

1:00:08.260,1:00:13.929
of 1 given 2 and covariant

1:00:13.929,1:00:15.549


1:00:15.549,1:00:18.479
sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show

1:00:18.479,1:00:20.969
the formula for mu given - of mu of

1:00:20.969,1:00:27.969
one given 2 is given by this

1:00:31.689,1:00:38.689
and I

1:00:40.179,1:00:44.209
think

1:00:44.209,1:00:48.949
the sigma of 1 given 2 is given by that. Okay?

1:00:48.949,1:00:52.109
So these are just

1:00:52.109,1:00:55.869
useful formulas to know for how to find the conditional distributions

1:00:55.869,1:00:58.649
of the Gaussian and the marginal distributions of a Gaussian.

1:00:58.649,1:01:03.229
I won't actually show the derivation for this. Student:Could you

1:01:03.229,1:01:10.229
repeat the [inaudible]? Instructor

1:01:12.059,1:01:16.189
(Andrew Ng):Sure. So this one on the left mu of 1 given 2

1:01:16.189,1:01:16.849
equals

1:01:16.849,1:01:19.029
mu1 plus

1:01:19.029,1:01:20.769
sigma 1,2,

1:01:20.769,1:01:22.729
sigma 2,2 inverse

1:01:22.729,1:01:25.389
times X2 minus mu2

1:01:25.389,1:01:29.599
and this is sigma 1 given 2 equals sigma 1,1

1:01:29.599,1:01:31.599
minus sigma 1,2

1:01:31.599,1:01:33.119
sigma 2,2 inverse

1:01:33.119,1:01:34.400
sigma 2,1. Okay?

1:01:34.400,1:01:40.189
These are also in the

1:01:40.189,1:01:47.189
lecture

1:01:52.929,1:01:59.929
notes. Shoot. Nothing as where I was hoping to on time.

1:02:00.989,1:02:07.989
Well, actually it is. Okay?

1:02:18.719,1:02:21.209
So it

1:02:21.209,1:02:26.369
turns out - I think I'll skip this in the interest of time. So it turns out that -

1:02:26.369,1:02:30.049
well, so let's go back and use these in the factor analysis model, right?

1:02:30.049,1:02:32.089
It turns out that

1:02:32.089,1:02:33.769
you can go back

1:02:33.769,1:02:35.130
and

1:02:38.769,1:02:45.759
oh,

1:02:45.759,1:02:48.569
do I want to do this? I kind of need this

1:02:48.569,1:02:51.939
though. So let's go back and figure out

1:02:51.939,1:02:57.869
just what the joint distribution factor analysis assumes on Z and X's. Okay?

1:02:57.869,1:02:58.630
So

1:02:58.630,1:03:00.219


1:03:00.219,1:03:03.199
under the factor analysis model

1:03:03.199,1:03:07.429
Z and X, the random variables Z and X

1:03:07.429,1:03:11.549
have some joint distribution given by - I'll write this

1:03:11.549,1:03:12.779
vector as mu

1:03:12.779,1:03:13.799
ZX

1:03:13.799,1:03:17.329
in some covariance matrix sigma.

1:03:17.329,1:03:21.239
So let's go back and figure out what mu ZX is and what sigma is and I'll

1:03:21.239,1:03:22.699
do this so that

1:03:22.699,1:03:24.990
we'll get a little bit more practice with

1:03:24.990,1:03:28.739
partition vectors and partition matrixes.

1:03:28.739,1:03:31.239
So just to remind you, right? You

1:03:31.239,1:03:34.959
have to have Z as Gaussian with mean zero and covariance identity

1:03:34.959,1:03:41.099
and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with

1:03:41.099,1:03:42.749
mean zero

1:03:42.749,1:03:47.179
covariant si. So I have the - I'm just writing out the same equations again.

1:03:47.179,1:03:53.209
So let's first figure out what this vector mu ZX is. Well,

1:03:53.209,1:03:54.619
the expected value of Z

1:03:54.619,1:03:56.219
is

1:03:56.219,1:03:56.989
zero

1:03:56.989,1:03:58.140


1:03:58.140,1:04:03.599
and, again, as usual I'll often drop the square backers around here.

1:04:03.599,1:04:05.930
And the expected value of X is -

1:04:05.930,1:04:08.899
well, the expected value of mu

1:04:08.899,1:04:11.149
plus lambda

1:04:11.149,1:04:14.179
Z

1:04:14.179,1:04:17.249
plus epsilon. So these two terms have zero expectation

1:04:17.249,1:04:20.099
and so the expected value of X

1:04:20.099,1:04:22.749
is just mu

1:04:22.749,1:04:23.900


1:04:23.900,1:04:26.529
and so that vector

1:04:26.529,1:04:28.049
mu ZX, right,

1:04:28.049,1:04:31.239
in my parameter for the Gaussian

1:04:31.239,1:04:33.650
this is going to be

1:04:33.650,1:04:37.439
the expected value of this partition vector

1:04:37.439,1:04:40.249
given by this partition Z and X

1:04:40.249,1:04:42.689
and so that would just be

1:04:42.689,1:04:43.769
zero

1:04:43.769,1:04:46.900
followed by mu. Okay?

1:04:46.900,1:04:48.319
And so that's

1:04:48.319,1:04:51.299
a d-dimensional zero

1:04:51.299,1:04:57.059
followed by an indimensional mu.

1:04:57.059,1:05:02.160
That's not gonna work out what the covariance matrix sigma is.

1:05:02.160,1:05:09.160
So

1:05:20.929,1:05:24.109
the covariance matrix sigma

1:05:24.109,1:05:27.959
- if you work out

1:05:27.959,1:05:34.959
definition of a partition. So this

1:05:44.359,1:05:45.619
is

1:05:45.619,1:05:52.619
into your partition matrix.

1:06:03.339,1:06:04.150
Okay? Will be -

1:06:04.150,1:06:06.149
so the covariance matrix sigma

1:06:06.149,1:06:10.499
will comprise four blocks like that

1:06:10.499,1:06:14.199
and so the upper left most block, which I write as sigma 1,1 -

1:06:14.199,1:06:16.400
well, that uppermost left block

1:06:16.400,1:06:18.559
is just

1:06:18.559,1:06:21.139
the covariance matrix of Z,

1:06:21.139,1:06:24.819
which we know is the identity. I was

1:06:24.819,1:06:28.269
gonna show you briefly how to derive some of the other blocks, right, so

1:06:28.269,1:06:30.809
sigma 1,2 that's the

1:06:30.809,1:06:37.809
upper - oh, actually,

1:06:41.259,1:06:43.819
excuse me. Sigma 2,1

1:06:43.819,1:06:46.829
which is the lower left block

1:06:46.829,1:06:48.929
that's E

1:06:48.929,1:06:50.829
of X

1:06:50.829,1:06:54.900
minus EX times Z minus EZ.

1:06:54.900,1:06:58.689
So X is equal to mu

1:06:58.689,1:07:01.529
plus lambda Z

1:07:01.529,1:07:03.329
plus

1:07:03.329,1:07:05.429
epsilon and then minus EX is

1:07:05.429,1:07:07.369
minus mu and then

1:07:07.369,1:07:11.219
times Z

1:07:11.219,1:07:14.089
because the expected value of Z is zero, right,

1:07:14.089,1:07:19.329
so that's equal to zero.

1:07:19.329,1:07:23.660
And so if you simplify - or if you expand this out

1:07:23.660,1:07:26.829
plus mu minus mu cancel out

1:07:26.829,1:07:29.819
and so you have the expected value

1:07:29.819,1:07:33.139
of lambda - oh,

1:07:33.139,1:07:35.359
excuse me.

1:07:35.359,1:07:40.269
ZZ transpose minus the

1:07:40.269,1:07:47.269
expected value of epsilon Z is

1:07:52.799,1:07:57.059
equal to

1:07:57.059,1:08:00.309
that, which is just equal to

1:08:00.309,1:08:07.239
lambda times the identity matrix. Okay? Does that make

1:08:07.239,1:08:11.499
sense? Cause

1:08:11.499,1:08:14.540
this term is equal to zero.

1:08:14.540,1:08:21.540
Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well,

1:08:48.170,1:08:53.399
so the final block is sigma 2,2 which is equal to the expected value of

1:08:53.399,1:08:59.099
mu plus lambda Z

1:08:59.099,1:09:01.889
plus epsilon minus mu

1:09:01.889,1:09:03.960
times, right?

1:09:03.960,1:09:07.900
Is

1:09:07.900,1:09:09.259
equal to - and

1:09:09.259,1:09:16.259
I won't do this, but this simplifies to lambda lambda transpose plus si. Okay?

1:09:17.609,1:09:20.880
So

1:09:20.880,1:09:23.809
putting all this together

1:09:23.809,1:09:28.689
this tells us that the joint distribution of this vector ZX

1:09:28.689,1:09:31.209
is going to be Gaussian

1:09:31.209,1:09:35.920
with mean vector given by that,

1:09:35.920,1:09:37.409
which we worked out previously.

1:09:37.409,1:09:38.719
So

1:09:38.719,1:09:42.239
this is the new ZX that we worked out previously,

1:09:42.239,1:09:44.179
and covariance matrix

1:09:44.179,1:09:51.179
given by

1:09:54.429,1:10:01.429
that. Okay? So

1:10:04.719,1:10:07.179
in principle - let's

1:10:07.179,1:10:09.869
see, so the parameters of our model

1:10:09.869,1:10:11.579
are mu,

1:10:11.579,1:10:12.359
lambda,

1:10:12.359,1:10:14.000
and si.

1:10:14.000,1:10:15.480
And so

1:10:15.480,1:10:18.800
in order to find the parameters of this model

1:10:18.800,1:10:24.559
we're given a training set of m examples

1:10:24.559,1:10:28.590
and so we like to do a massive likely estimation of the parameters.

1:10:28.590,1:10:34.319
And so in principle one thing you could do is you can actually write down

1:10:34.319,1:10:36.820
what P of XI is and,

1:10:36.820,1:10:40.309
right, so P of XI

1:10:40.309,1:10:43.020
XI is actually -

1:10:43.020,1:10:45.880
the distribution of X, right? If, again,

1:10:45.880,1:10:49.439
you can marginalize this Gaussian

1:10:49.439,1:10:54.679
and so the distribution of X, which is the lower half of this partition vector

1:10:54.679,1:10:56.620
is going to have

1:10:56.620,1:10:58.209
mean mu

1:10:58.209,1:11:03.929
and covariance given by lambda lambda transpose plus si. Right?

1:11:03.929,1:11:05.320
So that's

1:11:05.320,1:11:06.759
the distribution

1:11:06.759,1:11:12.499
that we're using to model P of X.

1:11:12.499,1:11:16.980
And so in principle one thing you could do is actually write down the log likelihood of

1:11:16.980,1:11:20.329
your parameters,

1:11:20.329,1:11:23.679
right? Which is just the product over of - it

1:11:23.679,1:11:25.310
is the sum over

1:11:25.310,1:11:26.050
I

1:11:26.050,1:11:29.829
log P of XI

1:11:29.829,1:11:30.970
where P of XI

1:11:30.970,1:11:32.400
will be given

1:11:32.400,1:11:35.719
by this Gaussian density, right. And

1:11:35.719,1:11:39.820
I'm using theta as a shorthand to denote all of my parameters.

1:11:39.820,1:11:43.429
And so you actually know what the density for Gaussian is

1:11:43.429,1:11:49.519
and so you can say P of XI is this Gaussian with E mu in covariance

1:11:49.519,1:11:52.960
given by this lambda lambda transpose plus si.

1:11:52.960,1:11:54.690
So in case you write down the log likelihood

1:11:54.690,1:11:55.900
of your parameters

1:11:55.900,1:11:57.380
as follows

1:11:57.380,1:12:01.050
and you can try to take derivatives of your log likelihood with respect to your

1:12:01.050,1:12:02.039
parameters

1:12:02.039,1:12:05.059
and maximize the log likelihood, all right.

1:12:05.059,1:12:07.490
It turns out that if you do that you end up with

1:12:07.490,1:12:11.300
sort of an intractable atomization problem or at least one

1:12:11.300,1:12:13.219
that you - excuse me, you end up with a

1:12:13.219,1:12:16.860
optimization problem that you will not be able to find and in this analytics, sort of,

1:12:16.860,1:12:18.649
closed form solutions to.

1:12:18.649,1:12:21.679
So if you say my model of X is this and found your

1:12:21.679,1:12:23.900
massive likely parameter estimation

1:12:23.900,1:12:25.589
you won't be able to find

1:12:25.589,1:12:29.119
the massive likely estimate of the parameters in closed form.

1:12:29.119,1:12:30.059


1:12:30.059,1:12:31.940
So what I would have liked to do

1:12:31.940,1:12:38.940
is - well,

1:12:40.280,1:12:45.579
so in order to fit parameters to this model

1:12:45.579,1:12:51.869
what we'll actually do is use the EM Algorithm

1:12:51.869,1:12:55.569
in with

1:12:55.569,1:12:59.559
the E step, right?

1:12:59.559,1:13:06.559
We'll compute that

1:13:08.320,1:13:10.929


1:13:10.929,1:13:15.010
and this formula looks the same except that one difference is that now

1:13:15.010,1:13:17.679
Z is a continuous random variable

1:13:17.679,1:13:18.590
and so

1:13:18.590,1:13:20.119
in the E step

1:13:20.119,1:13:24.210
we actually have to find the density QI of ZI where it's the, sort of, E step actually

1:13:24.210,1:13:26.050
requires that we find

1:13:26.050,1:13:28.260
the posterior distribution

1:13:28.260,1:13:31.510
that - so the density to the random variable ZI

1:13:31.510,1:13:34.489
and then the M step

1:13:34.489,1:13:37.949
will then perform

1:13:37.949,1:13:40.870
the following maximization

1:13:40.870,1:13:41.709


1:13:41.709,1:13:45.010
where, again, because Z is now

1:13:45.010,1:13:49.929
continuous we

1:13:49.929,1:13:56.929
now need to integrate over Z. Okay? Where

1:13:59.389,1:14:02.379
in the M step now because ZI was continuous we now have an

1:14:02.379,1:14:05.709
integral over Z rather than a sum.

1:14:05.709,1:14:09.269
Okay? I was hoping to go a little bit further in deriving these things, but I don't have

1:14:09.269,1:14:11.080
time today so we'll wrap that up

1:14:11.080,1:14:14.530
in the next lecture, but before I close let's check if there are questions

1:14:14.530,1:14:15.729


1:14:15.729,1:14:22.729
about the whole factor analysis model. Okay.

1:14:27.569,1:14:30.639
So we'll come back in the next lecture;

1:14:30.639,1:14:34.690
I will wrap up this model and because I want to go a little bit deeper into the E

1:14:34.690,1:14:36.709
and M steps, as there's some

1:14:36.709,1:14:39.760
tricky parts for the factor analysis model specifically.

1:14:39.760,1:14:41.159
Okay. I'll see you in a

1:14:41.159,1:14:41.409
couple of days.