1
00:00:09,040 --> 00:00:10,319


2
00:00:10,319 --> 00:00:13,589
This presentation is delivered by the Stanford Center for Professional

3
00:00:13,589 --> 00:00:20,589
Development.

4
00:00:23,960 --> 00:00:24,880
So

5
00:00:24,880 --> 00:00:29,720
welcome back, and what I want to do today is

6
00:00:29,720 --> 00:00:33,930
continue our discussions of the EM Algorithm, and in particular, I want to talk

7
00:00:33,930 --> 00:00:34,730
about

8
00:00:34,730 --> 00:00:36,359
the EM

9
00:00:36,359 --> 00:00:39,640
formulation that we derived in the previous lecture and apply it to the

10
00:00:39,640 --> 00:00:41,910
mixture of Gaussians model,

11
00:00:41,910 --> 00:00:45,050
apply it to a different model and a mixture of naive Bayes model,

12
00:00:45,050 --> 00:00:48,690
and then the launch part of today's lecture will be on the factor

13
00:00:48,690 --> 00:00:51,480
analysis algorithm, which will also use the EM.

14
00:00:51,480 --> 00:00:55,600
And as part of that, we'll actually take a brief digression to talk a little bit about

15
00:00:55,600 --> 00:01:01,050
sort of useful properties of Gaussian distributions.

16
00:01:01,050 --> 00:01:03,190
So just to recap where we are.

17
00:01:03,190 --> 00:01:10,120
In the previous lecture, I started to talk about unsupervised learning, which was

18
00:01:10,120 --> 00:01:12,720
machine-learning problems, where you're given

19
00:01:12,720 --> 00:01:14,590
an unlabeled training set

20
00:01:14,590 --> 00:01:16,210
comprising m examples here, right?

21
00:01:16,210 --> 00:01:18,360
And then " so the fact that there are no labels;

22
00:01:18,360 --> 00:01:22,720
that's what makes this unsupervised or

23
00:01:22,720 --> 00:01:24,880
anything. So

24
00:01:24,880 --> 00:01:30,710
one problem that I talked about last time was what if

25
00:01:30,710 --> 00:01:33,820
you're given a data set that looks like this

26
00:01:33,820 --> 00:01:36,420
and you want to model

27
00:01:36,420 --> 00:01:40,110
the density PFX from which you think the data

28
00:01:40,110 --> 00:01:41,730
had been drawn,

29
00:01:41,730 --> 00:01:45,900
and so with a data set like this, maybe you think was a mixture of two Gaussians and start

30
00:01:45,900 --> 00:01:50,230
to talk about an algorithm

31
00:01:50,230 --> 00:01:53,920
for fitting a mixture of Gaussians model, all right? And so

32
00:01:55,050 --> 00:01:57,820
we said that we would model the

33
00:01:57,820 --> 00:01:59,460
density of XP of X

34
00:01:59,460 --> 00:02:02,130
as sum over Z

35
00:02:02,130 --> 00:02:03,170
PFX

36
00:02:03,170 --> 00:02:05,410
given Z

37
00:02:05,410 --> 00:02:06,970
times P of Z

38
00:02:06,970 --> 00:02:10,350
where this later random variable meaning this hidden random

39
00:02:10,350 --> 00:02:11,829
variable Z

40
00:02:11,829 --> 00:02:12,609
indicates

41
00:02:12,609 --> 00:02:16,369
which of the two Gaussian distributions each of your data points came from

42
00:02:16,369 --> 00:02:19,369
and so we have,

43
00:02:19,369 --> 00:02:24,139
you know, Z was not a

44
00:02:24,139 --> 00:02:26,629
nomial with parameter phi

45
00:02:26,629 --> 00:02:28,760
and X conditions on

46
00:02:28,760 --> 00:02:32,599
a coming from the JAFE

47
00:02:32,599 --> 00:02:35,809
Gaussian

48
00:02:35,809 --> 00:02:38,749
was given by Gaussian

49
00:02:38,749 --> 00:02:43,679
of mean mu J and covariant sigma J, all right? So, like I said

50
00:02:43,679 --> 00:02:47,510
at the beginning of the previous lecture, I just talked about a very specific algorithm that

51
00:02:47,510 --> 00:02:50,639
I sort of pulled out of the air

52
00:02:50,639 --> 00:02:54,609
for fitting the parameters of this model for finian, Francis, phi, mu and

53
00:02:54,609 --> 00:02:55,519


54
00:02:55,519 --> 00:02:59,769
sigma, but then in the second half of the previous lecture I talked about what's called the

55
00:02:59,769 --> 00:03:01,489
EM Algorithm in which

56
00:03:01,489 --> 00:03:03,999
our

57
00:03:03,999 --> 00:03:08,719
goal is that it's a likelihood estimation of parameters. So we want to maximize in terms of

58
00:03:08,719 --> 00:03:09,900
theta,

59
00:03:09,900 --> 00:03:13,279
you know, the, sort of,

60
00:03:13,279 --> 00:03:17,649
usual right matter of log likelihood "

61
00:03:17,649 --> 00:03:19,549
well, parameterized by theta.

62
00:03:19,549 --> 00:03:22,149
And

63
00:03:22,149 --> 00:03:25,109
because we have

64
00:03:25,109 --> 00:03:28,199
a later random variable Z this is really

65
00:03:28,199 --> 00:03:30,889
maximizing in terms of theta,

66
00:03:30,889 --> 00:03:37,889
sum over I, sum over Z, P of XI,

67
00:03:39,639 --> 00:03:45,279
ZI parameterized by theta. Okay?

68
00:03:45,279 --> 00:03:47,139
So using

69
00:03:47,139 --> 00:03:50,109
Jensen's inequality last time

70
00:03:50,109 --> 00:03:54,249
we worked out the EM Algorithm in which in the E step

71
00:03:54,249 --> 00:03:56,180
we would chose these

72
00:03:56,180 --> 00:03:59,359
probability distributions QI to the

73
00:03:59,359 --> 00:04:02,930
l posterior

74
00:04:02,930 --> 00:04:09,319
on Z given X and parameterized by theta

75
00:04:09,319 --> 00:04:13,759
and in the M step we would set

76
00:04:13,759 --> 00:04:16,429
theta

77
00:04:16,429 --> 00:04:21,669
to be the value that

78
00:04:21,669 --> 00:04:27,609
maximizes

79
00:04:27,609 --> 00:04:34,609
this.

80
00:04:36,469 --> 00:04:39,499
Okay? So these are the ones we worked out last time

81
00:04:41,730 --> 00:04:46,699
and the cartoon that I drew was that you have this long likelihood function L of

82
00:04:46,699 --> 00:04:49,499
theta that's often hard to maximize

83
00:04:49,499 --> 00:04:51,250
and what the E step does

84
00:04:51,250 --> 00:04:55,960
is choose these probability distribution production QI's. And in the cartoon, I drew

85
00:04:55,960 --> 00:04:57,850
what that corresponded to

86
00:04:57,850 --> 00:05:01,029
was finding a lower bounds

87
00:05:01,029 --> 00:05:02,979
for the log likelihood.

88
00:05:02,979 --> 00:05:04,779
And then

89
00:05:04,779 --> 00:05:06,420
horizontal access data

90
00:05:06,420 --> 00:05:10,710
and then the M step you maximize the lower boundary, right? So maybe you were here previously

91
00:05:10,710 --> 00:05:11,500
and so you

92
00:05:11,500 --> 00:05:15,389
jumped to the new point, the new

93
00:05:15,389 --> 00:05:17,619
maximum of this lower bound. Okay?

94
00:05:17,619 --> 00:05:21,889
And so this little curve here, right? This lower bound function here

95
00:05:21,889 --> 00:05:26,259
that's really

96
00:05:26,259 --> 00:05:28,820
the right-hand side of that augments.

97
00:05:28,820 --> 00:05:30,210
Okay? So

98
00:05:30,210 --> 00:05:34,010
this whole thing in the augments. If you view this thing

99
00:05:34,010 --> 00:05:36,590
as a function of theta,

100
00:05:36,590 --> 00:05:38,129
this function of theta

101
00:05:38,129 --> 00:05:39,250
is a lower bounds

102
00:05:39,250 --> 00:05:41,729
for the log likelihood of theta

103
00:05:41,729 --> 00:05:45,539
and so the M step we maximize this lower bound and that corresponds to

104
00:05:45,539 --> 00:05:47,360
jumping

105
00:05:47,360 --> 00:05:51,669
to this new maximum to lower bound. So it

106
00:05:51,669 --> 00:05:53,139
turns out

107
00:05:53,139 --> 00:05:55,669
that

108
00:05:55,669 --> 00:05:57,259
in the EM Algorithm "

109
00:05:57,259 --> 00:06:02,100
so why do you evolve with the EM algorithm? It turns out that very often, and this will be

110
00:06:02,100 --> 00:06:05,210
true for all the examples we see

111
00:06:05,210 --> 00:06:07,410
today, it turns out

112
00:06:07,410 --> 00:06:10,889
that very often in the

113
00:06:10,889 --> 00:06:12,529
EM Algorithm

114
00:06:12,529 --> 00:06:16,159
maximizing the M Step, so performing the maximization the M Step, will be

115
00:06:16,159 --> 00:06:19,739
tractable and can often be done analytically in the closed form.

116
00:06:19,739 --> 00:06:21,330
Whereas

117
00:06:21,330 --> 00:06:25,729
if you were trying to maximize this objective we try to take this

118
00:06:25,729 --> 00:06:27,270
formula on the right and

119
00:06:27,270 --> 00:06:30,130
this maximum likely object, everyone, is to take this all on the right

120
00:06:30,130 --> 00:06:32,810
and set its derivatives to zero and try to solve and

121
00:06:32,810 --> 00:06:34,629
you'll find that you're unable

122
00:06:34,629 --> 00:06:37,030
to obtain a solution to this in closed form

123
00:06:37,030 --> 00:06:38,759
this maximization. Okay?

124
00:06:38,759 --> 00:06:41,250
And so to give you an example of that is that

125
00:06:41,250 --> 00:06:45,069
you remember our discussion on exponential family marbles, right?

126
00:06:45,069 --> 00:06:47,019
It turns out that

127
00:06:47,019 --> 00:06:49,370
if X and Z

128
00:06:49,370 --> 00:06:50,320
is

129
00:06:50,320 --> 00:06:54,220
jointly, I guess, a line in exponential families. So if P of X,

130
00:06:54,220 --> 00:06:55,379
Z

131
00:06:55,379 --> 00:06:58,319
prioritized by theta there's an explanation family distribution,

132
00:06:58,319 --> 00:07:01,909
which it turns out to be true for the mixture of Gaussians distribution.

133
00:07:01,909 --> 00:07:05,219
Then turns out that the M step here will be tractable

134
00:07:05,219 --> 00:07:08,379
and the E step will also be tractable and so you can do each of these steps very

135
00:07:08,379 --> 00:07:09,489
easily.

136
00:07:09,489 --> 00:07:10,889
Whereas

137
00:07:10,889 --> 00:07:12,689
performing " trying to

138
00:07:12,689 --> 00:07:15,669
perform this original maximum likelihood estimation problem

139
00:07:15,669 --> 00:07:17,719
on this one, right?

140
00:07:17,719 --> 00:07:21,439
Will be computationally very difficult. You're going

141
00:07:21,439 --> 00:07:24,699
to set the derivatives to zero and try to solve for that. Analytically you won't be able to find an analytic solution to

142
00:07:24,699 --> 00:07:27,589
this. Okay?

143
00:07:27,589 --> 00:07:32,709
So

144
00:07:32,709 --> 00:07:36,050
what I want to do in a second is actually take this view of the EM

145
00:07:36,050 --> 00:07:37,110
Algorithm

146
00:07:37,110 --> 00:07:40,889
and apply it to the mixture of Gaussians models. I want to take these E steps

147
00:07:40,889 --> 00:07:42,729
and M Steps and work

148
00:07:42,729 --> 00:07:44,229
them out for

149
00:07:44,229 --> 00:07:47,849
the mixture of Gaussians model, but before I do that, I just want to say one more thing about this other view of the

150
00:07:49,689 --> 00:07:55,020
EM Algorithm. It turns out there's one other way of thinking about the EM

151
00:07:55,020 --> 00:07:59,479
Algorithm, which is the following: I can define

152
00:07:59,479 --> 00:08:02,090
an optimization objective

153
00:08:02,090 --> 00:08:05,979
J of theta, Q are defined

154
00:08:05,979 --> 00:08:10,519
it to be

155
00:08:10,519 --> 00:08:14,699
this. This is just a thing in the augments

156
00:08:14,699 --> 00:08:21,699
in the M step. Okay?

157
00:08:25,759 --> 00:08:28,110
And so what we proved

158
00:08:28,110 --> 00:08:31,079
using Jensen's inequality

159
00:08:31,079 --> 00:08:34,040
is that

160
00:08:34,040 --> 00:08:38,380
the log likelihood of theta is greater and equal to J

161
00:08:38,380 --> 00:08:41,900
of theta Q. So in other words, we proved last time that

162
00:08:41,900 --> 00:08:43,749
for any value of theta and Q

163
00:08:43,749 --> 00:08:47,190
the log likelihood upper bounds J of theta and Q.

164
00:08:47,190 --> 00:08:51,290
And so just to relate this back to, sort of, yet more things that you all ready

165
00:08:51,290 --> 00:08:52,220
know,

166
00:08:52,220 --> 00:08:54,100
you can also think of

167
00:08:54,100 --> 00:08:58,530
covariant cause in a sense, right?

168
00:08:58,530 --> 00:09:02,010
However, our discussion awhile back on the coordinate ascent optimization

169
00:09:02,010 --> 00:09:03,160
algorithm.

170
00:09:03,160 --> 00:09:07,420
So we can show, and I won't actually show this view so just take our word for it

171
00:09:07,420 --> 00:09:09,580
and look for that at home if you want,

172
00:09:09,580 --> 00:09:11,820


173
00:09:11,820 --> 00:09:13,260
that EM is

174
00:09:13,260 --> 00:09:17,480
just coordinate in a set on the

175
00:09:17,480 --> 00:09:18,240
function J.

176
00:09:18,240 --> 00:09:20,630
So in the E step you maximize

177
00:09:20,630 --> 00:09:23,960
with respect to Q

178
00:09:23,960 --> 00:09:26,850
and then the M step

179
00:09:26,850 --> 00:09:31,860
you maximize with

180
00:09:31,860 --> 00:09:34,100
respect to theta. Okay?

181
00:09:34,100 --> 00:09:35,810
So this is

182
00:09:35,810 --> 00:09:39,690
another view of the EM Algorithm

183
00:09:40,900 --> 00:09:45,430
that shows why it has to converge, for example. If you can - I've used in a sense of

184
00:09:45,430 --> 00:09:46,880
J of theta, Q

185
00:09:46,880 --> 00:09:53,160
having to monotonically increase on every iteration. Okay?

186
00:09:53,160 --> 00:09:58,800
So what I want

187
00:09:58,800 --> 00:10:02,389
to do next is actually take this general

188
00:10:02,389 --> 00:10:06,230
EM machinery that we worked up and apply it to a mixture Gaussians model.

189
00:10:06,230 --> 00:10:09,790
Before I do that, let me just check if there are questions about

190
00:10:09,790 --> 00:10:16,790
the EM Algorithm as a whole? Okay, cool. So

191
00:10:23,980 --> 00:10:26,100
let's go ahead and

192
00:10:26,100 --> 00:10:33,100
work on the mixture of Gaussian's

193
00:10:35,550 --> 00:10:37,160
EM, all right? MOG, and that's my

194
00:10:37,160 --> 00:10:39,010
abbreviation for Mixture of

195
00:10:39,010 --> 00:10:43,590
Gaussian's. So the E step were called those Q distributions, right?

196
00:10:43,590 --> 00:10:50,590
In particular, I want to work out - so Q is the probability distribution

197
00:10:50,670 --> 00:10:53,230
over the late and random variable Z

198
00:10:53,230 --> 00:10:58,440
and so the E step I'm gonna figure out what is these compute - what is Q of ZI equals J. And

199
00:10:58,440 --> 00:11:00,700
you can think of this as my writing

200
00:11:00,700 --> 00:11:04,210
P of ZI equals J, right? Under the Q

201
00:11:04,210 --> 00:11:08,260
distribution. That's what this notation means.

202
00:11:08,260 --> 00:11:12,810
And so the EM Algorithm tells us

203
00:11:12,810 --> 00:11:13,840


204
00:11:13,840 --> 00:11:16,280
that, let's see, Q of J

205
00:11:16,280 --> 00:11:23,280
is the likelihood probability of Z being the value

206
00:11:23,750 --> 00:11:25,040
J and

207
00:11:25,040 --> 00:11:28,690
given XI and all your parameters.

208
00:11:28,690 --> 00:11:30,310
And so,

209
00:11:30,310 --> 00:11:31,050


210
00:11:31,050 --> 00:11:35,750
well, the way you compute this is by Bayes rule, right? So that is going to

211
00:11:35,750 --> 00:11:39,250
be equal to P of XI given ZI

212
00:11:39,250 --> 00:11:41,060
equals J

213
00:11:41,060 --> 00:11:44,690
times P of ZIJ divided

214
00:11:44,690 --> 00:11:51,690
by -

215
00:11:57,490 --> 00:12:00,880
right? That's all the - by Bayes rule.

216
00:12:00,880 --> 00:12:02,670


217
00:12:02,670 --> 00:12:06,100
And so this

218
00:12:06,100 --> 00:12:09,110
you know

219
00:12:09,110 --> 00:12:09,860


220
00:12:09,860 --> 00:12:11,630


221
00:12:11,630 --> 00:12:14,390
because XI given ZI equals J. This was a Gaussian

222
00:12:14,390 --> 00:12:17,780
with mean mu J and covariant sigma J. And so to compute this first

223
00:12:17,780 --> 00:12:20,970
term you plug in the formula for the Gaussian density there

224
00:12:21,810 --> 00:12:24,780
with parameters mu J and sigma J

225
00:12:24,780 --> 00:12:27,160
and this you'd know

226
00:12:27,160 --> 00:12:28,500
because

227
00:12:28,500 --> 00:12:35,500
Z was not a nomial, right?

228
00:12:36,020 --> 00:12:40,650
Where parameters given by phi and so the problem of ZI being with J is just phi

229
00:12:40,650 --> 00:12:44,150
J and so you can substitute these terms in.

230
00:12:44,150 --> 00:12:46,810
Similarly do the same thing for the denominator

231
00:12:46,810 --> 00:12:49,930
and that's how you work out what Q is. Okay?

232
00:12:49,930 --> 00:12:56,080
And so in the previous lecture this value the probability that ZI

233
00:12:56,080 --> 00:12:58,540
equals J under the Q

234
00:12:58,540 --> 00:13:01,150
distribution that was why I denoted that as WIJ.

235
00:13:01,150 --> 00:13:03,840
So that would be the

236
00:13:03,840 --> 00:13:05,330


237
00:13:05,330 --> 00:13:08,550
E

238
00:13:08,550 --> 00:13:10,440
step

239
00:13:10,440 --> 00:13:13,720
and then in the M step

240
00:13:13,720 --> 00:13:17,890
we maximize with respect to all of our parameters.

241
00:13:17,890 --> 00:13:19,480


242
00:13:19,480 --> 00:13:21,230
This, well

243
00:13:21,230 --> 00:13:28,130
I seem to be writing the same formula down a lot today. All

244
00:13:28,130 --> 00:13:35,130
right.

245
00:13:42,670 --> 00:13:48,090
And just so

246
00:13:48,090 --> 00:13:52,250
we're completely concrete about how you do that, right? So if you do

247
00:13:52,250 --> 00:13:54,820
that you end up with -

248
00:13:54,820 --> 00:13:57,440
so plugging in the

249
00:13:57,440 --> 00:13:59,840
quantities that you know

250
00:13:59,840 --> 00:14:00,770


251
00:14:00,770 --> 00:14:03,950
that becomes

252
00:14:03,950 --> 00:14:10,950
this, let's see. Right.

253
00:14:34,590 --> 00:14:38,730
And so that we're completely concrete about what the M step is doing.

254
00:14:38,730 --> 00:14:41,900
So in the

255
00:14:41,900 --> 00:14:44,260
M step that was,

256
00:14:44,260 --> 00:14:45,660
I guess, QI

257
00:14:45,660 --> 00:14:47,130
over Z, I being over

258
00:14:47,130 --> 00:14:50,560
J. Just in the summation, sum over J is the sum over all the possible values

259
00:14:50,560 --> 00:14:52,500
of ZI

260
00:14:52,500 --> 00:14:53,430
and then

261
00:14:53,430 --> 00:14:56,330
this thing here is my Gaussian

262
00:14:56,330 --> 00:14:58,700
density. Sorry, guys, this thing - well,

263
00:14:58,700 --> 00:15:01,980
this first term here,

264
00:15:01,980 --> 00:15:06,480
right? Is my P of

265
00:15:06,480 --> 00:15:10,050
XI given ZI

266
00:15:10,050 --> 00:15:14,460
and that's P of ZI. Okay?

267
00:15:14,460 --> 00:15:16,950
And so

268
00:15:16,950 --> 00:15:21,070
to maximize this with respect to - say you want to maximize this with respect to all

269
00:15:21,070 --> 00:15:24,410
of your parameters phi, mu and sigma.

270
00:15:24,410 --> 00:15:27,800
So to maximize with respect to the parameter mu, say,

271
00:15:27,800 --> 00:15:32,430
you would take the derivative for respect to mu

272
00:15:32,430 --> 00:15:34,440
and set that to zero

273
00:15:34,440 --> 00:15:38,190
and you would - and if you actually do that computation

274
00:15:38,190 --> 00:15:44,330
you would get, for instance, that

275
00:15:44,330 --> 00:15:47,660


276
00:15:47,660 --> 00:15:49,750
that becomes your update

277
00:15:49,750 --> 00:15:51,230
to mu J.

278
00:15:51,230 --> 00:15:52,260
Okay? Just

279
00:15:52,260 --> 00:15:53,409
so I want to -

280
00:15:53,409 --> 00:15:57,029
the equation is unimportant. All of these equations are written down in

281
00:15:57,029 --> 00:15:59,980
the lecture notes. I'm writing these down just to be

282
00:15:59,980 --> 00:16:03,350
completely concrete about what the M step means. And so write down that formula,

283
00:16:03,350 --> 00:16:05,370
plug in the densities you know, take

284
00:16:05,370 --> 00:16:08,150
the derivative set to zero, solve for mu J

285
00:16:08,150 --> 00:16:10,949
and in the same way you set the derivatives equal to zero

286
00:16:10,949 --> 00:16:12,990
and solve for your updates

287
00:16:12,990 --> 00:16:15,439
for your other parameters phi and

288
00:16:15,439 --> 00:16:22,439
sigma as well. Okay? Well,

289
00:16:23,100 --> 00:16:26,500
just point out just one little tricky bit for this that

290
00:16:26,500 --> 00:16:30,630
you haven't seen before that most of you probably all ready now, but I'll just

291
00:16:30,630 --> 00:16:31,220
mention

292
00:16:31,220 --> 00:16:32,810
is that

293
00:16:32,810 --> 00:16:37,430
since phi here is a multinomial distribution

294
00:16:37,430 --> 00:16:39,570
when you take this formula

295
00:16:39,570 --> 00:16:42,440
and you maximize it with respect to phi

296
00:16:42,440 --> 00:16:45,470
you actually have an additional constraint, right? That the sum of I - let's see, sum

297
00:16:45,470 --> 00:16:46,460


298
00:16:46,460 --> 00:16:48,250
over

299
00:16:48,250 --> 00:16:50,440
J,

300
00:16:50,440 --> 00:16:53,630
phi J must be equal to one. All right? So, again,

301
00:16:53,630 --> 00:16:57,720
in the M step I want to take this thing and maximize it with respect to all the parameters

302
00:16:57,720 --> 00:17:00,900
and when you maximize this respect to the parameters phi J

303
00:17:00,900 --> 00:17:03,400
you need to respect the constraint that

304
00:17:03,400 --> 00:17:06,300
sum of J phi J must be equal to one.

305
00:17:06,300 --> 00:17:10,730
And so, well, you all ready know how to do constraint maximization, right? So I'll throw out the method of

306
00:17:10,730 --> 00:17:14,230
the granjay multipliers and generalize the granjay when you talk about the support of X

307
00:17:14,230 --> 00:17:15,120
machines.

308
00:17:15,120 --> 00:17:19,049
And so to actually perform the maximization in terms of phi J you

309
00:17:19,049 --> 00:17:21,100
construct to the granjay,

310
00:17:21,100 --> 00:17:23,490
which is - all right?

311
00:17:23,490 --> 00:17:24,639
So that's the

312
00:17:24,639 --> 00:17:28,390
equation from above and we'll denote in the dot dot dot plus

313
00:17:28,390 --> 00:17:33,760
theta times

314
00:17:33,760 --> 00:17:40,760
that, where this is sort of the granjay multiplier

315
00:17:42,220 --> 00:17:42,830
and this

316
00:17:42,830 --> 00:17:46,100
is your optimization objective.

317
00:17:46,100 --> 00:17:47,640


318
00:17:47,640 --> 00:17:50,799
And so to actually solve the parameters phi J you set

319
00:17:50,799 --> 00:17:56,690
the parameters of this

320
00:17:56,690 --> 00:17:59,080
so that the granjay is zero and solve. Okay?

321
00:17:59,080 --> 00:18:02,990
And if you then work through the math

322
00:18:02,990 --> 00:18:07,240
you get the appropriate value to update the phi J's too,

323
00:18:07,240 --> 00:18:10,880
which I won't do, but I'll be - all the full directions are in the lecture

324
00:18:10,880 --> 00:18:16,470
notes. I won't do that here.

325
00:18:16,470 --> 00:18:18,010


326
00:18:18,010 --> 00:18:21,870
Okay. And so if you actually perform all these computations you can also verify

327
00:18:21,870 --> 00:18:22,730
that.

328
00:18:22,730 --> 00:18:26,590
So I just wrote down a bunch of formulas for the EM

329
00:18:26,590 --> 00:18:30,080
Algorithm. At the beginning of the last lecture I said for the mixture of Gaussian's model - I

330
00:18:30,080 --> 00:18:34,170
said for the EM here's the formula for computing the WIJ's and here's a

331
00:18:34,170 --> 00:18:36,480
formula for computing the mud's and so on,

332
00:18:36,480 --> 00:18:42,090
and this variation is where all of those formulas actually come from.

333
00:18:42,090 --> 00:18:43,280
Okay?

334
00:18:43,280 --> 00:18:50,280
Questions about this? Yeah? Student:[Inaudible] Instructor (Andrew Ng):Oh,

335
00:18:52,390 --> 00:18:54,289
I see. So

336
00:18:54,289 --> 00:18:58,200
it turns out that, yes, there's also constrained to the phi J this must be greater than

337
00:18:58,200 --> 00:19:00,070
zero.

338
00:19:00,070 --> 00:19:02,770
It turns out that

339
00:19:02,770 --> 00:19:07,090
if you want you could actually write down then generalize the granjayn

340
00:19:07,090 --> 00:19:10,860
incorporating all of these constraints as well and you can solve to [inaudible]

341
00:19:10,860 --> 00:19:12,179
these constraints.

342
00:19:12,179 --> 00:19:16,280
It turns out that in this particular derivation - actually it turns out that very often

343
00:19:16,280 --> 00:19:16,780
we

344
00:19:16,780 --> 00:19:19,980
find maximum likely estimate for multinomial distributions probabilities.

345
00:19:19,980 --> 00:19:23,190
It turns out that if you ignore these constraints and you just maximize the

346
00:19:23,190 --> 00:19:24,020
formula

347
00:19:24,020 --> 00:19:25,100


348
00:19:25,100 --> 00:19:27,820
luckily you end up with

349
00:19:27,820 --> 00:19:31,190
values that actually are greater than or equal to zero

350
00:19:31,190 --> 00:19:34,160
and so if even ignoring those constraint you end up with parameters that are greater than or equal to

351
00:19:34,160 --> 00:19:38,470
zero that shows that that must be the correct solution

352
00:19:38,470 --> 00:19:41,130
because adding that constraint won't change anything.

353
00:19:41,130 --> 00:19:44,789
So this constraint it is then caused - it turns out that if you

354
00:19:44,789 --> 00:19:46,150
ignore this and just do

355
00:19:46,150 --> 00:19:53,150
what I've wrote down you actually get the right answer.

356
00:19:55,789 --> 00:19:57,030
Okay?

357
00:19:57,030 --> 00:19:58,080
Great.

358
00:19:58,080 --> 00:19:59,080
So let me

359
00:19:59,080 --> 00:20:00,810
just

360
00:20:00,810 --> 00:20:05,400
very quickly talk about one more example of a mixture model. And

361
00:20:05,400 --> 00:20:09,340
the perfect example for this is imagine you want to do text clustering, right?

362
00:20:09,340 --> 00:20:11,900
So someone gives you a large set of documents

363
00:20:11,900 --> 00:20:15,389
and you want to cluster them together into cohesive

364
00:20:15,389 --> 00:20:16,239
topics. I

365
00:20:16,239 --> 00:20:19,379
think I mentioned the news website news.google.com.

366
00:20:19,379 --> 00:20:22,010
That's one application of text clustering

367
00:20:22,010 --> 00:20:25,850
where you might want to look at all of the news stories about

368
00:20:25,850 --> 00:20:28,220
today, all the news stories

369
00:20:28,220 --> 00:20:32,150
written by everyone, written by all the online news websites about whatever happened

370
00:20:32,150 --> 00:20:33,270
yesterday

371
00:20:33,270 --> 00:20:36,810
and there will be many, many different stories on the same thing, right?

372
00:20:36,810 --> 00:20:40,770
And by running a text-clustering algorithm you can group related

373
00:20:40,770 --> 00:20:42,559
documents together. Okay?

374
00:20:42,559 --> 00:20:49,559
So how do you apply the EM Algorithm to text clustering?

375
00:20:52,980 --> 00:20:56,300
I want to do this to illustrate an example

376
00:20:56,300 --> 00:20:59,380
in which you run the EM Algorithm

377
00:20:59,380 --> 00:21:03,320
on discreet valued inputs where the input - where the training examples

378
00:21:03,320 --> 00:21:04,220
XI

379
00:21:04,220 --> 00:21:05,830
are discreet

380
00:21:05,830 --> 00:21:08,589
values. So what I want to talk about specifically

381
00:21:08,589 --> 00:21:15,589
is the mixture of naive Bayes model

382
00:21:16,510 --> 00:21:19,980
and depending on how much you remember about naive Bayes

383
00:21:19,980 --> 00:21:23,690
I talked about two event models. One was the multivariant vanuvy

384
00:21:23,690 --> 00:21:24,960
event model. One

385
00:21:24,960 --> 00:21:27,880
was the multinomial event model.

386
00:21:27,880 --> 00:21:31,050
Today I'm gonna use the multivariant vanuvy event model. If

387
00:21:31,050 --> 00:21:34,720
you don't remember what those terms mean anymore don't worry about it. I

388
00:21:34,720 --> 00:21:36,430
think the equation will still make sense. But in

389
00:21:36,430 --> 00:21:43,430
this setting we're given a training set X1

390
00:21:44,549 --> 00:21:45,820
for XM.

391
00:21:45,820 --> 00:21:48,930
So we're given M text documents

392
00:21:48,930 --> 00:21:52,270
where each

393
00:21:52,270 --> 00:21:55,940


394
00:21:55,940 --> 00:22:00,330
XI is zero one to the end. So each of our training examples

395
00:22:00,330 --> 00:22:04,160
is an indimensional bit of vector,

396
00:22:04,160 --> 00:22:06,780
right? S this was the representation

397
00:22:06,780 --> 00:22:11,060
where XIJ was - it indicates whether

398
00:22:11,060 --> 00:22:13,220
word J

399
00:22:13,220 --> 00:22:16,820
appears in

400
00:22:16,820 --> 00:22:23,260
document

401
00:22:23,260 --> 00:22:28,220
I, right? And let's say that

402
00:22:28,220 --> 00:22:31,930
we're going to model ZI the - our latent random variable meaning

403
00:22:31,930 --> 00:22:35,470
our hidden random variable ZI will take on two values zero one

404
00:22:35,470 --> 00:22:38,480
so this means I'm just gonna find two clusters and you can generalize the

405
00:22:38,480 --> 00:22:44,560
clusters that you want.

406
00:22:44,560 --> 00:22:46,650
So

407
00:22:46,650 --> 00:22:51,510
in the mixture of naive Bayes model we assume that

408
00:22:51,510 --> 00:22:54,090
ZI is distributed and mu

409
00:22:54,090 --> 00:22:57,280
E

410
00:22:57,280 --> 00:22:58,390
with some

411
00:22:58,390 --> 00:23:02,600
value of phi so there's some probability of each document coming from cluster one

412
00:23:02,600 --> 00:23:05,220
or from cluster

413
00:23:05,220 --> 00:23:10,490
two. We assume that

414
00:23:10,490 --> 00:23:14,090
the probability

415
00:23:14,090 --> 00:23:16,049


416
00:23:16,049 --> 00:23:20,500
of XI given ZI,

417
00:23:20,500 --> 00:23:21,550
right?

418
00:23:21,550 --> 00:23:28,550
Will make the same naive Bayes assumption as we did before.

419
00:23:35,550 --> 00:23:36,630
Okay?

420
00:23:36,630 --> 00:23:43,630
And more specifically - well, excuse

421
00:23:53,110 --> 00:24:00,110
me,

422
00:24:00,820 --> 00:24:01,840
right. Okay.

423
00:24:01,840 --> 00:24:06,570
And so most of us [inaudible] cycles one given ZI equals say zero

424
00:24:06,570 --> 00:24:10,250
will be given by a parameter phi

425
00:24:10,250 --> 00:24:11,960
substitute J

426
00:24:11,960 --> 00:24:18,230
given Z equals

427
00:24:18,230 --> 00:24:21,170
zero. So if you take this chalkboard and if you

428
00:24:21,170 --> 00:24:22,669
take all instances of

429
00:24:22,669 --> 00:24:24,510
the alphabet Z

430
00:24:24,510 --> 00:24:27,270
and replace it with Y

431
00:24:27,270 --> 00:24:31,750
then you end up with exactly the same equation as I've written down for naive

432
00:24:31,750 --> 00:24:38,750
Bayes like a long time ago. Okay?

433
00:24:39,770 --> 00:24:40,710
And I'm not

434
00:24:40,710 --> 00:24:44,850
actually going to work out the mechanics deriving

435
00:24:44,850 --> 00:24:46,460
the EM Algorithm, but it turns out that

436
00:24:46,460 --> 00:24:49,830
if you take this joint distribution of X and Z

437
00:24:49,830 --> 00:24:53,700
and if you work out what the equations are for the EM algorithm for

438
00:24:53,700 --> 00:24:55,860
maximum likelihood estimation of parameters

439
00:24:55,860 --> 00:24:58,820
you find that

440
00:24:58,820 --> 00:25:00,880
in the E

441
00:25:00,880 --> 00:25:03,100
step you compute,

442
00:25:03,100 --> 00:25:05,679
you know, let's say these parameters - these weights

443
00:25:05,679 --> 00:25:06,970
WI

444
00:25:06,970 --> 00:25:09,420
which are going to be equal to

445
00:25:09,420 --> 00:25:15,670
your perceived distribution of Z being equal one conditioned on XI parameterized

446
00:25:15,670 --> 00:25:19,200
by your

447
00:25:19,200 --> 00:25:23,700
phi's and given your parameters

448
00:25:23,700 --> 00:25:25,590


449
00:25:25,590 --> 00:25:30,900
and in the M step. Okay?

450
00:25:30,900 --> 00:25:37,900
And

451
00:26:17,240 --> 00:26:20,740
that's the equation you get in the M step.

452
00:26:20,740 --> 00:26:22,150
I

453
00:26:22,150 --> 00:26:24,860
mean, again, the equations themselves aren't too important. Just

454
00:26:24,860 --> 00:26:27,179
sort of convey - I'll

455
00:26:27,179 --> 00:26:30,210
give you a second to finish writing, I guess.

456
00:26:30,210 --> 00:26:32,500
And when you're done or finished writing

457
00:26:32,500 --> 00:26:37,370
take a look at these equations and see if they make intuitive sense to you

458
00:26:37,370 --> 00:26:44,370
why these three equations, sort of, sound like they might be the right thing to do. Yeah? [Inaudible] Say that again.

459
00:26:47,130 --> 00:26:49,620
Y -

460
00:26:49,620 --> 00:26:54,230
Oh, yes, thank you. Right.

461
00:26:54,230 --> 00:26:57,950
Sorry,

462
00:26:57,950 --> 00:27:04,950
just, for everywhere over Y I meant Z. Yeah? [Inaudible] in the

463
00:27:15,380 --> 00:27:20,670
first place?

464
00:27:20,670 --> 00:27:22,059
No. So

465
00:27:25,130 --> 00:27:26,080
what is it?

466
00:27:26,080 --> 00:27:33,080
Normally you initialize phi's to be something else, say randomly.

467
00:27:35,010 --> 00:27:38,060
So just like in naive Bayes we saw

468
00:27:38,060 --> 00:27:41,360
zero probabilities as a bad thing so the same reason you

469
00:27:41,360 --> 00:27:48,200
try to avoid zero probabilities, yeah. Okay?

470
00:27:48,200 --> 00:27:51,500
And so just the intuition behind these

471
00:27:51,500 --> 00:27:54,150
equations is

472
00:27:54,150 --> 00:27:55,950
in the E step

473
00:27:55,950 --> 00:27:59,860
WI's is you're gonna take your best guess for whether the document came from

474
00:27:59,860 --> 00:28:02,640
cluster one or cluster

475
00:28:02,640 --> 00:28:05,570
zero, all right? This is very similar to the

476
00:28:05,570 --> 00:28:09,200
intuitions behind the EM Algorithm that we talked about in a previous lecture. So in the

477
00:28:09,200 --> 00:28:10,020
E step

478
00:28:10,020 --> 00:28:14,760
we're going to compute these weights that tell us do I think this document came from

479
00:28:14,760 --> 00:28:19,370
cluster one or cluster zero.

480
00:28:19,370 --> 00:28:22,650
And then in the M step I'm gonna say

481
00:28:22,650 --> 00:28:27,140
does this numerator is the sum over all the elements of my training set

482
00:28:27,140 --> 00:28:28,800
of - so then

483
00:28:28,800 --> 00:28:33,000
informally, right? WI is one there, but I think the document came from cluster

484
00:28:33,000 --> 00:28:33,980
one

485
00:28:33,980 --> 00:28:35,810
and so this will

486
00:28:35,810 --> 00:28:41,090
essentially sum up all the times I saw words J

487
00:28:41,090 --> 00:28:45,419
in documents that I think are in cluster one.

488
00:28:45,419 --> 00:28:48,830
And these are sort of weighted by the actual probability. I think it came from cluster

489
00:28:48,830 --> 00:28:49,640
one

490
00:28:49,640 --> 00:28:51,200
and then I'll divide by

491
00:28:51,200 --> 00:28:54,740
- again, if all of these were ones and zeros then I'd be dividing by

492
00:28:54,740 --> 00:28:58,480
the actual number of documents I had in cluster one. So if all the WI's were

493
00:28:58,480 --> 00:28:59,540


494
00:28:59,540 --> 00:29:02,960
either ones or zeroes then this would be exactly

495
00:29:02,960 --> 00:29:06,660
the fraction of documents that I saw in cluster one

496
00:29:06,660 --> 00:29:09,640
in which I also saw were at J.

497
00:29:09,640 --> 00:29:10,900
Okay? But in the EM Algorithm

498
00:29:10,900 --> 00:29:14,540
you don't make a hard assignment decision about is this in cluster one or is this in

499
00:29:14,540 --> 00:29:15,790
cluster zero. You

500
00:29:15,790 --> 00:29:16,880
instead

501
00:29:16,880 --> 00:29:23,880
represent your uncertainty about cluster membership with the parameters WI. Okay? It

502
00:29:23,970 --> 00:29:26,970
actually turns out that when we actually implement this particular model it

503
00:29:26,970 --> 00:29:28,630
actually turns out that

504
00:29:28,630 --> 00:29:32,080
by the nature of this computation all the values of WI's

505
00:29:32,080 --> 00:29:35,260
will be very close to either one or zero so they'll

506
00:29:35,260 --> 00:29:39,110
be numerically almost indistinguishable from one's and zeroes. This is a property of

507
00:29:39,110 --> 00:29:41,960
naive Bayes. If you actually compute this probability

508
00:29:41,960 --> 00:29:45,750
from all those documents you find that WI is either

509
00:29:45,750 --> 00:29:46,169


510
00:29:46,169 --> 00:29:50,000
0.0001 or 0.999. It'll be amazingly close to either zero or one

511
00:29:50,000 --> 00:29:53,170
and so the M step - and so this is pretty much guessing whether each document is

512
00:29:53,170 --> 00:29:55,270
in cluster one or cluster

513
00:29:55,270 --> 00:29:56,210
zero and then

514
00:29:56,210 --> 00:30:00,450
using formulas they're very similar to maximum likely estimation

515
00:30:00,450 --> 00:30:05,210
for naive Bayes.

516
00:30:05,210 --> 00:30:06,250
Okay?

517
00:30:06,250 --> 00:30:08,409
Cool. Are there - and

518
00:30:08,409 --> 00:30:09,679
if

519
00:30:09,679 --> 00:30:12,170
some of these equations don't look that familiar to you anymore,

520
00:30:12,170 --> 00:30:15,309
sort of, go back and take another look at what you saw in naive

521
00:30:15,309 --> 00:30:16,700
Bayes

522
00:30:16,700 --> 00:30:18,720
and

523
00:30:18,720 --> 00:30:22,220
hopefully you can see the links there as well.

524
00:30:22,220 --> 00:30:29,220
Questions about this before I move on? Right, okay.

525
00:30:33,200 --> 00:30:36,700
Of course the way I got these equations was by turning through the machinery of

526
00:30:36,700 --> 00:30:40,610
the EM Algorithm, right? I didn't just write these out of thin air. The way you do this

527
00:30:40,610 --> 00:30:41,890
is by

528
00:30:41,890 --> 00:30:44,990
writing down the E step and the M step for this model and then the M step

529
00:30:44,990 --> 00:30:46,340
same derivatives

530
00:30:46,340 --> 00:30:48,270
equal to zero and solving from that so

531
00:30:48,270 --> 00:30:49,889
that's how you get the M step and the E

532
00:30:49,889 --> 00:30:56,889
step.

533
00:31:22,030 --> 00:31:23,950
So the last thing I want to do today

534
00:31:23,950 --> 00:31:30,950
is talk about the factor analysis model

535
00:31:31,870 --> 00:31:33,110


536
00:31:33,110 --> 00:31:34,780
and

537
00:31:34,780 --> 00:31:38,249
the reason I want to do this is sort of two reasons because one is

538
00:31:38,249 --> 00:31:42,590
factor analysis is kind of a useful model. It's

539
00:31:42,590 --> 00:31:44,160
not

540
00:31:44,160 --> 00:31:48,320
as widely used as mixtures of Gaussian's and mixtures of naive Bayes maybe,

541
00:31:48,320 --> 00:31:50,690
but it's sort of useful.

542
00:31:50,690 --> 00:31:55,180
But the other reason I want to derive this model is that there are a

543
00:31:55,180 --> 00:31:58,169
few steps in the math that are more generally useful.

544
00:31:58,169 --> 00:32:02,560
In particular, where this is for factor analysis this would be an example

545
00:32:02,560 --> 00:32:04,510
in which we'll do EM

546
00:32:04,510 --> 00:32:08,440
where the late and random variable - where the hidden random variable Z

547
00:32:08,440 --> 00:32:11,060
is going to be continued as valued.

548
00:32:11,060 --> 00:32:14,740
And so some of the math we'll see in deriving factor analysis will be a little bit

549
00:32:14,740 --> 00:32:17,430
different than what you saw before and they're just a -

550
00:32:17,430 --> 00:32:21,590
it turns out the full derivation for EM for factor analysis is sort of

551
00:32:21,590 --> 00:32:23,260
extremely long and complicated

552
00:32:23,260 --> 00:32:24,980
and so I won't

553
00:32:24,980 --> 00:32:26,750
inflect that on you in lecture today,

554
00:32:26,750 --> 00:32:30,530
but I will still be writing more equations than is - than you'll see me do

555
00:32:30,530 --> 00:32:32,640
in other lectures because there are, sort of,

556
00:32:32,640 --> 00:32:36,690
just a few steps in the factor analysis derivation so I'll physically

557
00:32:36,690 --> 00:32:38,800


558
00:32:38,800 --> 00:32:40,100
illustrate

559
00:32:40,100 --> 00:32:40,950
it.

560
00:32:40,950 --> 00:32:43,350
So it's actually [inaudible] the model

561
00:32:43,350 --> 00:32:47,510
and it's really contrast to the mixture of Gaussians model, all right? So for the mixture of Gaussians model, which is

562
00:32:47,510 --> 00:32:50,670
our first model

563
00:32:50,670 --> 00:32:52,570
we had,

564
00:32:52,570 --> 00:32:57,990
that - well I actually motivated it by drawing the data set like this, right? That one of

565
00:32:57,990 --> 00:33:04,390
you has a data set that looks like this,

566
00:33:04,390 --> 00:33:06,190
right? So this was a problem where

567
00:33:06,190 --> 00:33:09,440
n is twodimensional

568
00:33:09,440 --> 00:33:13,860
and you have, I don't know, maybe 50 or 100 training examples, whatever, right?

569
00:33:13,860 --> 00:33:15,409
And I said

570
00:33:15,409 --> 00:33:17,760
maybe you want to give a label

571
00:33:17,760 --> 00:33:21,100
training set like this. Maybe you want to model this

572
00:33:21,100 --> 00:33:23,710
as a mixture of two Gaussians. Okay?

573
00:33:23,710 --> 00:33:24,700
And so a

574
00:33:24,700 --> 00:33:27,320
mixture of Gaussian models tend to be

575
00:33:27,320 --> 00:33:29,100
applicable

576
00:33:29,100 --> 00:33:30,520
where m

577
00:33:30,520 --> 00:33:32,519
is larger,

578
00:33:32,519 --> 00:33:35,690
and often much larger, than n where the number of training examples you

579
00:33:35,690 --> 00:33:36,610
have

580
00:33:36,610 --> 00:33:40,790
is at least as large as, and is usually much larger than, the

581
00:33:40,790 --> 00:33:45,820
dimension of the data. What I

582
00:33:45,820 --> 00:33:49,280
want to do is talk about a different problem where I

583
00:33:49,280 --> 00:33:51,880
want you to imagine what happens if

584
00:33:51,880 --> 00:33:56,200
either the dimension of your data is roughly equal to the number of

585
00:33:56,200 --> 00:33:58,130
examples you have

586
00:33:58,130 --> 00:34:00,180
or maybe

587
00:34:00,180 --> 00:34:00,940
the

588
00:34:00,940 --> 00:34:03,980
dimension of your data is maybe even much larger than

589
00:34:03,980 --> 00:34:08,469
the number of training examples you have. Okay? So how do

590
00:34:08,469 --> 00:34:10,139
you model

591
00:34:10,139 --> 00:34:13,569
such a very high dimensional data? Watch and

592
00:34:13,569 --> 00:34:15,549
you will see sometimes, right? If

593
00:34:15,549 --> 00:34:18,960
you run a plant or something, you run a factory, maybe you have

594
00:34:18,960 --> 00:34:23,359
a thousand measurements all through your plants, but you only have five - you only have

595
00:34:23,359 --> 00:34:25,639
20 days of data. So

596
00:34:25,639 --> 00:34:27,189
you can have 1,000 dimensional data,

597
00:34:27,189 --> 00:34:31,039
but 20 examples of it all ready. So

598
00:34:31,039 --> 00:34:35,239
given data that has this property in the beginning that we've given

599
00:34:35,239 --> 00:34:39,709
a training set of m examples. Well,

600
00:34:39,709 --> 00:34:41,229
what can you do to try to model

601
00:34:41,229 --> 00:34:44,139
the density of X?

602
00:34:44,139 --> 00:34:48,239
So one thing you can do is try to model it just as a single Gaussian, right? So in my

603
00:34:48,239 --> 00:34:51,629
mixtures of Gaussian this is how you try model as a single Gaussian and

604
00:34:51,629 --> 00:34:55,019
say X is intuitive with mean mu

605
00:34:55,019 --> 00:34:57,559
and parameter

606
00:34:57,559 --> 00:34:58,809
sigma

607
00:34:58,809 --> 00:35:00,329
where

608
00:35:00,329 --> 00:35:02,169
sigma is going to be

609
00:35:02,169 --> 00:35:04,769
done n by n matrix

610
00:35:04,769 --> 00:35:09,509
and so if you work out the maximum likelihood estimate of the parameters

611
00:35:09,509 --> 00:35:12,680
you find that the maximum likelihood estimate for the mean

612
00:35:12,680 --> 00:35:17,400
is just the empirical mean of your training set,

613
00:35:17,400 --> 00:35:21,299
right. So that makes sense.

614
00:35:21,299 --> 00:35:25,769
And the maximum likelihood of the covariance matrix sigma

615
00:35:25,769 --> 00:35:27,700
will be

616
00:35:27,700 --> 00:35:29,549


617
00:35:29,549 --> 00:35:36,549


618
00:35:39,649 --> 00:35:41,150
this, all right? But it turns out that

619
00:35:41,150 --> 00:35:44,789
in this regime where the data is much higher dimensional - excuse me,

620
00:35:44,789 --> 00:35:48,259
where the data's dimension is much larger than the training examples

621
00:35:48,259 --> 00:35:49,579
you

622
00:35:49,579 --> 00:35:51,229
have if you

623
00:35:51,229 --> 00:35:52,840
compute the maximum likely estimate

624
00:35:52,840 --> 00:35:54,869
of the covariance matrix sigma

625
00:35:54,869 --> 00:35:58,309
you find that this matrix is singular.

626
00:35:58,309 --> 00:35:59,079


627
00:35:59,079 --> 00:36:02,869
Okay? By singular, I mean that it doesn't have four vanq or it has zero eigen

628
00:36:02,869 --> 00:36:04,739
value so it doesn't have - I hope

629
00:36:04,739 --> 00:36:06,869
one of those terms makes sense.

630
00:36:06,869 --> 00:36:12,619
And

631
00:36:12,619 --> 00:36:19,619
there's another saying that the matrix sigma will be non-invertible.

632
00:36:22,299 --> 00:36:24,910
And just in pictures,

633
00:36:24,910 --> 00:36:28,619
one complete example is if D is -

634
00:36:28,619 --> 00:36:33,419
if N equals M equals two if you have two-dimensional data and

635
00:36:33,419 --> 00:36:35,929
you have two examples. So I'd have

636
00:36:35,929 --> 00:36:39,959
two training examples in two-dimen.. - this is

637
00:36:39,959 --> 00:36:41,539
X1 and

638
00:36:41,539 --> 00:36:43,669
X2. This is my unlabeled data.

639
00:36:43,669 --> 00:36:47,640
If you fit a Gaussian to this data set you find that

640
00:36:47,640 --> 00:36:49,169
- well

641
00:36:49,169 --> 00:36:53,329
you remember I used to draw constables of Gaussians as ellipses, right? So

642
00:36:53,329 --> 00:36:56,139
these are examples of different constables of Gaussians.

643
00:36:56,139 --> 00:36:58,669
You find that the maximum likely estimate

644
00:36:58,669 --> 00:36:59,650
Gaussian for this

645
00:36:59,650 --> 00:37:04,329
responds to Gaussian where the contours are sort of infinitely

646
00:37:04,329 --> 00:37:07,160
thin and infinitely long in that direction.

647
00:37:07,160 --> 00:37:10,959
Okay? So in terms - so the contours will sort of be

648
00:37:10,959 --> 00:37:12,970
infinitely thin,

649
00:37:12,970 --> 00:37:17,959
right? And stretch infinitely long in that direction.

650
00:37:17,959 --> 00:37:20,819
And another way of saying it is that if

651
00:37:20,819 --> 00:37:25,539
you actually plug in the formula for the density

652
00:37:25,539 --> 00:37:30,679
of the

653
00:37:30,679 --> 00:37:36,209
Gaussian, which is

654
00:37:36,209 --> 00:37:39,099
this, you won't actually get a nice answer because

655
00:37:39,099 --> 00:37:42,399
the matrix sigma is non-invertible so sigma inverse is not

656
00:37:42,399 --> 00:37:43,309
defined

657
00:37:43,309 --> 00:37:45,290
and this is zero.

658
00:37:45,290 --> 00:37:48,890
So you also have one over zero times E to the sum

659
00:37:48,890 --> 00:37:55,890
inversive and non-inversive matrix so not a good model.

660
00:37:58,289 --> 00:38:02,009
So let's do even better, right? So given this sort of data

661
00:38:02,009 --> 00:38:09,009
how do you model P of X?

662
00:38:28,400 --> 00:38:35,400
Well, one thing you could do is constrain sigma to be diagonal. So you have a

663
00:38:42,309 --> 00:38:46,319
covariance matrix X is - okay?

664
00:38:46,319 --> 00:38:48,939
So in other words you get a constraint sigma

665
00:38:48,939 --> 00:38:52,979
to be this matrix, all

666
00:38:52,979 --> 00:38:55,880
right?

667
00:38:55,880 --> 00:38:59,779
With zeroes on the off diagonals. I hope

668
00:38:59,779 --> 00:39:03,109
this makes sense. These zeroes I've written down here denote that

669
00:39:03,109 --> 00:39:06,079
everything after diagonal of this matrix is a

670
00:39:06,079 --> 00:39:07,459
zero.

671
00:39:07,459 --> 00:39:08,970
So

672
00:39:08,970 --> 00:39:13,410
the massive likely estimate of the parameters will be pretty much what you'll

673
00:39:13,410 --> 00:39:16,210
expect,

674
00:39:16,210 --> 00:39:19,880
right?

675
00:39:19,880 --> 00:39:21,200
And

676
00:39:21,200 --> 00:39:22,440
in pictures

677
00:39:22,440 --> 00:39:24,150
what this means is that

678
00:39:24,150 --> 00:39:26,399
the [inaudible] the distribution

679
00:39:26,399 --> 00:39:28,569
with Gaussians

680
00:39:28,569 --> 00:39:31,739
whose controls are axis aligned. So

681
00:39:31,739 --> 00:39:36,900
that's one example of a Gaussian where the covariance is diagonal.

682
00:39:36,900 --> 00:39:37,859


683
00:39:37,859 --> 00:39:38,509
And

684
00:39:38,509 --> 00:39:42,919
here's another example and

685
00:39:42,919 --> 00:39:45,349
so here's a third example. But

686
00:39:45,349 --> 00:39:48,939
often I've used the examples of Gaussians where the covariance matrix is off diagonal. Okay? And, I

687
00:39:48,939 --> 00:39:51,489
don't

688
00:39:51,489 --> 00:39:53,999
know,

689
00:39:53,999 --> 00:39:56,629
you could do this in model P of X, but this isn't very nice because you've now

690
00:39:56,629 --> 00:40:00,069
thrown away all the correlations

691
00:40:00,069 --> 00:40:04,539
between the different variables so

692
00:40:04,539 --> 00:40:07,599
the axis are X1 and X2, right? So you've thrown away - you're failing to capture

693
00:40:07,599 --> 00:40:11,779
any of the correlations or the relationships between

694
00:40:11,779 --> 00:40:18,229
any pair of variables in your data. Yeah? Is it - could you say again what does that do for the diagonal? Say again.

695
00:40:18,229 --> 00:40:20,759
The covariance matrix the diagonal,

696
00:40:20,759 --> 00:40:22,150
what does

697
00:40:22,150 --> 00:40:23,369
that again? I didn't quite understand what the examples mean. Instructor (Andrew Ng):Okay.

698
00:40:23,369 --> 00:40:26,530
So these are the contours of the Gaussian density that I'm drawing,

699
00:40:26,530 --> 00:40:28,159
right? So let's see -

700
00:40:28,159 --> 00:40:32,080
so post covariance issues with diagonal

701
00:40:32,080 --> 00:40:35,229
then you can ask what is P of X

702
00:40:35,229 --> 00:40:37,299
parameterized by

703
00:40:37,299 --> 00:40:38,539
mu and sigma,

704
00:40:38,539 --> 00:40:41,190
right? If sigma is diagonal

705
00:40:41,190 --> 00:40:43,699
and so this will be some Gaussian dump,

706
00:40:43,699 --> 00:40:46,249
right? So not in - oh, boy. My drawing's really

707
00:40:46,249 --> 00:40:48,699
bad, but in two-D

708
00:40:48,699 --> 00:40:53,239
the density for Gaussian is like this bump shaped thing, right?

709
00:40:53,239 --> 00:40:56,719
So this is the density of the Gaussian

710
00:40:56,719 --> 00:40:59,549
- wow, and this is a really bad drawing. With those,

711
00:40:59,549 --> 00:41:04,049
your axis X1 and X2 and the height of this is P of X

712
00:41:04,049 --> 00:41:07,879
and so those figures over there are the contours of

713
00:41:07,879 --> 00:41:09,479
the density of the Gaussian.

714
00:41:09,479 --> 00:41:15,949
So those are the contours of this shape. Student:No, I don't

715
00:41:15,949 --> 00:41:16,680
mean

716
00:41:16,680 --> 00:41:22,569
the contour. What's special about these types? What makes them different than instead of general covariance matrix? Instructor (Andrew Ng):Oh, I see. Oh, okay, sorry. They're axis aligned so

717
00:41:22,569 --> 00:41:23,690
the main - these,

718
00:41:23,690 --> 00:41:24,999
let's see.

719
00:41:24,999 --> 00:41:27,710
So I'm not drawing a contour like this,

720
00:41:27,710 --> 00:41:30,309
right? Because the main axes of these

721
00:41:30,309 --> 00:41:32,289
are not aligned with the

722
00:41:32,289 --> 00:41:34,390
X1 and X-axis so

723
00:41:34,390 --> 00:41:36,189
this occurs found to

724
00:41:36,189 --> 00:41:43,109
Gaussian where the off-diagonals are non-zero, right? Cool. Okay.

725
00:41:43,109 --> 00:41:46,929


726
00:41:46,929 --> 00:41:50,229
You could do this, this is sort of work. It turns out that what our best view is two

727
00:41:50,229 --> 00:41:51,670
training examples

728
00:41:51,670 --> 00:41:55,210
you can learn in non-singular covariance matrix, but you've thrown away all

729
00:41:55,210 --> 00:41:55,849


730
00:41:55,849 --> 00:42:02,849
of the correlation in the data so this is not a great model.

731
00:42:05,900 --> 00:42:08,280
It turns out you can do something - well,

732
00:42:08,280 --> 00:42:10,439
actually, we'll come back and use this property later.

733
00:42:10,439 --> 00:42:17,439
But it turns out you can do something even more restrictive, which

734
00:42:17,989 --> 00:42:18,779
is

735
00:42:18,779 --> 00:42:23,959
you can constrain sigma to equal to sigma squared times the identity

736
00:42:23,959 --> 00:42:27,629
matrix. So in other words, you can constrain it to be diagonal

737
00:42:27,629 --> 00:42:28,759
matrix

738
00:42:28,759 --> 00:42:31,699
and moreover all the

739
00:42:31,699 --> 00:42:34,609
diagonal entries must be the same

740
00:42:34,609 --> 00:42:37,989
and so the cartoon for that is that you're

741
00:42:37,989 --> 00:42:39,440
constraining

742
00:42:39,440 --> 00:42:42,779
the contours of your Gaussian density to be

743
00:42:42,779 --> 00:42:47,359
circular. Okay? This is a sort of even harsher constraint to place in your model.

744
00:42:47,359 --> 00:42:51,479


745
00:42:51,479 --> 00:42:52,200


746
00:42:52,200 --> 00:42:55,970
So either of these versions, diagonal sigma or sigma being the, sort of, constant

747
00:42:55,970 --> 00:42:57,349
value diagonal

748
00:42:57,349 --> 00:43:01,869
are the all ready strong assumptions, all right? So if you have enough data

749
00:43:01,869 --> 00:43:07,529
maybe write a model just a little bit of a correlation between your different variables.

750
00:43:07,529 --> 00:43:10,930
So the factor analysis model

751
00:43:10,930 --> 00:43:14,349
is one way to attempt to do that. So here's

752
00:43:14,349 --> 00:43:21,349
the idea.

753
00:43:34,619 --> 00:43:35,729
So this is

754
00:43:35,729 --> 00:43:39,169
how the factor analysis model

755
00:43:39,169 --> 00:43:40,979
models your data.

756
00:43:40,979 --> 00:43:42,150
We're going to

757
00:43:42,150 --> 00:43:45,309
assume that there is a latent random variable, okay?

758
00:43:45,309 --> 00:43:46,959
Which just

759
00:43:46,959 --> 00:43:50,759
means random variable Z. So Z is distributed

760
00:43:50,759 --> 00:43:53,380
Gaussian with mean zero and

761
00:43:53,380 --> 00:43:55,659
covariance identity

762
00:43:55,659 --> 00:43:57,630
where Z

763
00:43:57,630 --> 00:44:00,109
will be a Ddimensional vector now

764
00:44:00,109 --> 00:44:01,499
and

765
00:44:01,499 --> 00:44:03,839
D

766
00:44:03,839 --> 00:44:10,839
will be chosen so that it is lower than the dimension of your X's. Okay?

767
00:44:11,829 --> 00:44:12,829
And now

768
00:44:12,829 --> 00:44:14,660
I'm going to assume that

769
00:44:14,660 --> 00:44:19,259
X is given by -

770
00:44:19,259 --> 00:44:21,279
well let me write this. Each XI

771
00:44:21,279 --> 00:44:22,690
is distributed -

772
00:44:25,059 --> 00:44:28,849
actually, sorry, I'm just. We

773
00:44:28,849 --> 00:44:33,889
have

774
00:44:33,889 --> 00:44:38,679
to assume that conditions on the value of Z,

775
00:44:38,679 --> 00:44:42,109
X is given by another Gaussian

776
00:44:42,109 --> 00:44:46,819
with mean given by mu plus

777
00:44:46,819 --> 00:44:50,089
lambda Z

778
00:44:50,089 --> 00:44:53,639
and covariance given by matrix si.

779
00:44:53,639 --> 00:44:59,400
So just to say the second line in

780
00:44:59,400 --> 00:45:01,289
an equivalent form, equivalently

781
00:45:01,289 --> 00:45:05,929
I'm going to model X as

782
00:45:05,929 --> 00:45:08,819
mu plus lambda Z

783
00:45:08,819 --> 00:45:13,269
plus a noise term epsilon where epsilon is

784
00:45:13,269 --> 00:45:18,329
Gaussian with mean zero

785
00:45:18,329 --> 00:45:21,869
and covariant si.

786
00:45:21,869 --> 00:45:23,469


787
00:45:23,469 --> 00:45:24,669
And so

788
00:45:24,669 --> 00:45:27,709
the parameters of this

789
00:45:27,709 --> 00:45:30,180
model

790
00:45:30,180 --> 00:45:32,719
are going to be a vector

791
00:45:32,719 --> 00:45:34,729
mu with its

792
00:45:34,729 --> 00:45:37,559
n-dimensional and

793
00:45:37,559 --> 00:45:41,179
matrix lambda,

794
00:45:41,179 --> 00:45:42,400
which is

795
00:45:42,400 --> 00:45:43,539


796
00:45:43,539 --> 00:45:45,569
n by D

797
00:45:45,569 --> 00:45:50,039
and a covariance matrix si,

798
00:45:50,039 --> 00:45:51,970
which is n by n,

799
00:45:51,970 --> 00:45:55,519
and I'm going to impose an additional constraint on si. I'm going to impose

800
00:45:55,519 --> 00:45:57,319
a constraint that si

801
00:45:57,319 --> 00:45:59,559
is

802
00:45:59,559 --> 00:46:03,389
diagonal.

803
00:46:03,389 --> 00:46:04,789
Okay? So

804
00:46:04,789 --> 00:46:07,640
that was a form of definition - let me actually, sort of,

805
00:46:07,640 --> 00:46:14,640
give a couple of examples to make this more complete.

806
00:46:32,710 --> 00:46:37,709
So let's give a kind of example, suppose Z

807
00:46:37,709 --> 00:46:40,419
is one-dimensional

808
00:46:40,419 --> 00:46:44,419
and X is twodimensional

809
00:46:44,419 --> 00:46:46,689
so let's see what

810
00:46:46,689 --> 00:46:51,599
this model - let's see a, sort of, specific instance of the factor analysis

811
00:46:51,599 --> 00:46:52,509
model

812
00:46:52,509 --> 00:46:55,819
and how we're modeling the joint - the distribution over X

813
00:46:55,819 --> 00:46:57,999
of - what this gives us

814
00:46:57,999 --> 00:47:01,269
in terms of a model over P of X, all right?

815
00:47:01,269 --> 00:47:03,089
So

816
00:47:03,089 --> 00:47:09,699
let's see. From this model to let me assume that

817
00:47:09,699 --> 00:47:10,569
lambda is

818
00:47:10,569 --> 00:47:12,059
2, 1

819
00:47:12,059 --> 00:47:15,549
and si, which has to be diagonal matrix, remember,

820
00:47:15,549 --> 00:47:19,809
is this.

821
00:47:19,809 --> 00:47:21,710
Okay? So

822
00:47:21,710 --> 00:47:28,710
Z is one-dimensional so let me just draw a typical sample for Z, all right? So

823
00:47:30,900 --> 00:47:33,659
if I draw ZI

824
00:47:33,659 --> 00:47:36,199
from a Gaussian

825
00:47:36,199 --> 00:47:39,530
so that's a typical sample for what Z might look like

826
00:47:39,530 --> 00:47:43,440
and so I'm gonna - at any rate I'm gonna call

827
00:47:43,440 --> 00:47:45,369
this Z1,

828
00:47:45,369 --> 00:47:49,969
Z2, Z3 and so on. If this really were a typical sample the order of the

829
00:47:49,969 --> 00:47:53,349
Z's would be jumbled up, but I'm just ordering them like this

830
00:47:53,349 --> 00:47:55,970
just to make the example easier.

831
00:47:55,970 --> 00:47:57,769
So, yes, typical sample

832
00:47:57,769 --> 00:48:04,769
of random variable Z from a Gaussian distribution with mean of covariance one.

833
00:48:05,569 --> 00:48:08,929
So -

834
00:48:08,929 --> 00:48:13,789
and with this example let me just set mu equals zero. It's to write the -

835
00:48:13,789 --> 00:48:18,359
just that it's easier to talk about.

836
00:48:18,359 --> 00:48:20,749
So

837
00:48:20,749 --> 00:48:22,509
lambda times Z,

838
00:48:22,509 --> 00:48:26,269
right? We'll take each of these numbers and multiply them by lambda.

839
00:48:26,269 --> 00:48:28,639
And so

840
00:48:28,639 --> 00:48:30,999
you find that

841
00:48:30,999 --> 00:48:36,989
all of the values for lambda times Z

842
00:48:36,989 --> 00:48:38,489
will lie on a straight line,

843
00:48:38,489 --> 00:48:40,309
all right? So, for example,

844
00:48:40,309 --> 00:48:41,090


845
00:48:41,090 --> 00:48:45,400
this one here would be one, two, three, four, five, six, seven, I guess. So if this was

846
00:48:45,400 --> 00:48:46,239
Z7

847
00:48:46,239 --> 00:48:49,329
then this one here would be lambda

848
00:48:49,329 --> 00:48:51,979
times Z7

849
00:48:51,979 --> 00:48:54,369
and now that's the number in R2, because lambda's a two

850
00:48:54,369 --> 00:48:56,529
by one matrix.

851
00:48:56,529 --> 00:48:59,900
And so what I've drawn here is like a typical sample

852
00:48:59,900 --> 00:49:04,279
for lambda times Z

853
00:49:04,279 --> 00:49:05,589
and

854
00:49:05,589 --> 00:49:09,779
the final step for this is what a typical sample for X looks like. Well X is

855
00:49:09,779 --> 00:49:11,839
mu

856
00:49:11,839 --> 00:49:12,989
plus

857
00:49:12,989 --> 00:49:15,089


858
00:49:15,089 --> 00:49:16,660
lambda Z plus epsilon

859
00:49:16,660 --> 00:49:18,649
where epsilon

860
00:49:18,649 --> 00:49:23,680
is Gaussian with mean nu and covariance given by si, right?

861
00:49:23,680 --> 00:49:27,690
And so the last step to draw a typical sample for the random variables

862
00:49:27,690 --> 00:49:28,509


863
00:49:28,509 --> 00:49:30,899
X

864
00:49:30,899 --> 00:49:36,079
I'm gonna take these non - these are really same as mu plus lambda Z because mu is zero in this example

865
00:49:36,079 --> 00:49:37,159
and

866
00:49:37,159 --> 00:49:38,390
around this point

867
00:49:38,390 --> 00:49:41,890
I'm going to place

868
00:49:41,890 --> 00:49:43,890
an axis aligned ellipse. Or

869
00:49:43,890 --> 00:49:45,879
in other words, I'm going to

870
00:49:45,879 --> 00:49:47,939
create a Gaussian distribution

871
00:49:47,939 --> 00:49:50,950
centered on this point

872
00:49:50,950 --> 00:49:53,459
and this I've drawn

873
00:49:53,459 --> 00:49:55,880
corresponds to one of the contours

874
00:49:55,880 --> 00:49:59,309
of my density for

875
00:49:59,309 --> 00:50:02,799
epsilon, right? And so you can imagine placing a little Gaussian bump here.

876
00:50:02,799 --> 00:50:04,179
And so

877
00:50:04,179 --> 00:50:08,189
I'll draw an example from this little Gaussian and

878
00:50:08,189 --> 00:50:11,089
let's say I get that point going,

879
00:50:11,089 --> 00:50:17,630
I do the same here and

880
00:50:17,630 --> 00:50:19,130
so on.

881
00:50:19,130 --> 00:50:24,079
So I draw a bunch of examples from these Gaussians

882
00:50:24,079 --> 00:50:28,519
and the -

883
00:50:28,519 --> 00:50:31,999
whatever they call it - the orange points I drew

884
00:50:31,999 --> 00:50:35,309
will comprise a typical sample for

885
00:50:35,309 --> 00:50:42,309
whether distribution of X is under this model. Okay? Yeah? Student:Would you add, like, mean? Instructor: Oh, say that again. Student:Do you add mean into that?

886
00:50:43,839 --> 00:50:45,929
Instructor (Andrew Ng):Oh,

887
00:50:45,929 --> 00:50:48,649
yes, you do. And in this example, I said you

888
00:50:48,649 --> 00:50:50,109
do a zero zero just to make

889
00:50:50,109 --> 00:50:53,199
it easier. If mu were something else you'd take the whole picture and you'd sort of shift

890
00:50:53,199 --> 00:51:00,199
it to whatever value of mu is. Yeah? Student:[Inaudible] horizontal line right there, which was Z. What did the X's, of

891
00:51:00,679 --> 00:51:03,319
course,

892
00:51:03,319 --> 00:51:05,920
what does that Y-axis corresponds to? Instructor (Andrew

893
00:51:05,920 --> 00:51:07,799
Ng):Oh, so this is Z

894
00:51:07,799 --> 00:51:09,529
is one-dimensional

895
00:51:09,529 --> 00:51:12,969
so here I'm plotting the typical sample

896
00:51:12,969 --> 00:51:15,460
for Z so this is like zero.

897
00:51:15,460 --> 00:51:19,759
So this is just the Z Axis, right. So Z is onedimensional data.

898
00:51:19,759 --> 00:51:22,999
So this line here is like a plot

899
00:51:22,999 --> 00:51:26,479
of a typical sample of

900
00:51:26,479 --> 00:51:31,609
values for Z. Okay?

901
00:51:31,609 --> 00:51:32,889
Yeah? Student:You have by axis, right? And

902
00:51:32,889 --> 00:51:34,459
the axis data pertains samples. Instructor (Andrew

903
00:51:34,459 --> 00:51:35,689
Ng):Oh, yes, right. Student:So sort of projecting

904
00:51:35,689 --> 00:51:38,289
them into

905
00:51:38,289 --> 00:51:42,680
that? Instructor (Andrew Ng):Let's not talk about projections yet, but, yeah, right. So these

906
00:51:42,680 --> 00:51:45,319
beige points - so that's like X1 and that's X2

907
00:51:45,319 --> 00:51:47,290
and so on, right? So the beige points are

908
00:51:47,290 --> 00:51:51,099
what I

909
00:51:51,099 --> 00:51:52,269
see. And so

910
00:51:52,269 --> 00:51:54,849
in reality all you ever get to see are the

911
00:51:54,849 --> 00:51:56,289
X's, but

912
00:51:56,289 --> 00:52:00,049
just like in the mixture of Gaussians model I tell a story about what I would

913
00:52:00,049 --> 00:52:03,609
imagine the Gauss... - the data came from two Gaussian's

914
00:52:03,609 --> 00:52:08,679
was is had a random variable Z that led to the generation of X's from two Gaussians.

915
00:52:08,679 --> 00:52:12,209
So the same way I'm sort of telling the story here, which all the algorithm actually

916
00:52:12,209 --> 00:52:15,049
sees are the orange points, but we're

917
00:52:15,049 --> 00:52:19,219
gonna tell a story about how the data came about and that story is

918
00:52:19,219 --> 00:52:23,809
what comprises the factor analysis model. Okay?

919
00:52:23,809 --> 00:52:26,779
So one of the ways to see the intrusion of this model is that we're

920
00:52:26,779 --> 00:52:30,779
going to think of the model as one way

921
00:52:30,779 --> 00:52:34,659
just informally, not formally, but one way to think about this model is

922
00:52:34,659 --> 00:52:37,679
you can think of this factor analysis model

923
00:52:37,679 --> 00:52:42,089
as modeling the data from coming from a lower dimensional subspace

924
00:52:42,089 --> 00:52:42,890
more

925
00:52:42,890 --> 00:52:44,799
or less so the data X here Y

926
00:52:44,799 --> 00:52:47,609
is approximately on one D line

927
00:52:47,609 --> 00:52:50,380
and then plus a little bit of noise - plus a little bit

928
00:52:50,380 --> 00:52:53,449
of random noise so the X isn't exactly on this one D line.

929
00:52:53,449 --> 00:52:57,279
That's one informal way of thinking about factor analysis.

930
00:52:57,279 --> 00:53:04,279
We're

931
00:53:08,880 --> 00:53:15,880
not doing great on time.

932
00:53:18,159 --> 00:53:20,359
Well, let's do this.

933
00:53:20,359 --> 00:53:22,780
So let me just do one more quick example,

934
00:53:22,780 --> 00:53:24,100
which is,

935
00:53:24,100 --> 00:53:25,789


936
00:53:25,789 --> 00:53:28,949
in this example,

937
00:53:28,949 --> 00:53:34,899
let's say Z is in R2

938
00:53:34,899 --> 00:53:37,899
and X is in R3, right?

939
00:53:37,899 --> 00:53:40,289
And so

940
00:53:40,289 --> 00:53:44,939
in this example Z, your data Z now lies in 2-D

941
00:53:44,939 --> 00:53:49,259
and so let me draw this on a sheet of paper. Okay?

942
00:53:49,259 --> 00:53:50,790
So let's say the

943
00:53:50,790 --> 00:53:54,180
axis of my paper are the Z1 and Z2 axis

944
00:53:54,180 --> 00:53:56,299
and so

945
00:53:56,299 --> 00:53:58,289
here is a typical sample

946
00:53:58,289 --> 00:54:01,489
of point Z, right?

947
00:54:01,489 --> 00:54:05,369
And so we'll then take the sample Z -

948
00:54:05,369 --> 00:54:10,919
well, actually let me draw this here as well. All right. So

949
00:54:10,919 --> 00:54:13,449
this is a typical sample for Z going on the Z1 and Z2 axis and

950
00:54:13,449 --> 00:54:17,419
I guess the origin would be here.

951
00:54:17,419 --> 00:54:20,119
So center around zero.

952
00:54:20,119 --> 00:54:24,079
And then we'll take those and map it to mu

953
00:54:24,079 --> 00:54:24,979
plus

954
00:54:24,979 --> 00:54:26,299
lambda Z

955
00:54:26,299 --> 00:54:30,140
and what that means is if you imagine the free space of this classroom is R3.

956
00:54:30,140 --> 00:54:34,549
What that means is we'll take this sample of Z's and we'll map it to

957
00:54:34,549 --> 00:54:38,769
position in free space. So we'll take this sheet of paper and move it somewhere and some

958
00:54:38,769 --> 00:54:41,769
orientation in 3-D space.

959
00:54:41,769 --> 00:54:44,219
And the last step is

960
00:54:44,219 --> 00:54:48,689
you have X equals mu plus lambda Z plus

961
00:54:48,689 --> 00:54:52,339
epsilon and so you would take the set of the points which align in some plane

962
00:54:52,339 --> 00:54:53,639
in our 3-D space

963
00:54:53,639 --> 00:54:56,150
the variable of noise of these

964
00:54:56,150 --> 00:54:58,259
and the noise will, sort of, come from

965
00:54:58,259 --> 00:55:01,419
Gaussians to

966
00:55:01,419 --> 00:55:03,699
the axis

967
00:55:03,699 --> 00:55:10,699
aligned. Okay? So you end up with a data set that's sort of like a fat pancake or a little bit of fuzz off your pancake.

968
00:55:11,680 --> 00:55:13,179


969
00:55:13,179 --> 00:55:15,119
So that's a model

970
00:55:15,119 --> 00:55:22,119
- let's actually talk about how to fit the parameters of the model. Okay?

971
00:55:27,959 --> 00:55:32,229
In order to

972
00:55:32,229 --> 00:55:33,770
describe how to fit the model I'm sure

973
00:55:33,770 --> 00:55:35,859
we need to

974
00:55:35,859 --> 00:55:40,190
re-write Gaussians and this is in a very slightly different way.

975
00:55:40,190 --> 00:55:42,789
So, in particular,

976
00:55:42,789 --> 00:55:45,389
let's say I have a vector X and

977
00:55:45,389 --> 00:55:51,389
I'm gonna use this notation to denote partition vectors, right? X1, X2

978
00:55:51,389 --> 00:55:53,209
where

979
00:55:53,209 --> 00:55:56,809
if X1 is say an rdimensional vector

980
00:55:56,809 --> 00:55:58,640
then X2 is

981
00:55:58,640 --> 00:56:00,430
an estimational vector

982
00:56:00,430 --> 00:56:02,229
and X

983
00:56:02,229 --> 00:56:04,509
is an R plus S

984
00:56:04,509 --> 00:56:09,029
dimensional vector. Okay? So I'm gonna use this notation to denote just

985
00:56:09,029 --> 00:56:13,269
the taking of vector and, sort of, partitioning the vector into two halves. The

986
00:56:13,269 --> 00:56:15,449
first R elements followed by

987
00:56:15,449 --> 00:56:17,079
the last

988
00:56:17,079 --> 00:56:22,409
S elements.

989
00:56:22,409 --> 00:56:26,269
So

990
00:56:26,269 --> 00:56:29,199
let's say you have X

991
00:56:29,199 --> 00:56:35,029
coming from a Gaussian distribution with mean mu and covariance sigma

992
00:56:35,029 --> 00:56:37,169
where

993
00:56:37,169 --> 00:56:38,809
mu

994
00:56:38,809 --> 00:56:42,219
is itself a partition vector.

995
00:56:42,219 --> 00:56:46,839
So break mu up into two pieces mu1 and mu2

996
00:56:46,839 --> 00:56:50,189
and the covariance matrix sigma

997
00:56:50,189 --> 00:56:54,989
is now a partitioned matrix.

998
00:56:54,989 --> 00:56:56,750
Okay? So what this means is

999
00:56:56,750 --> 00:56:58,669
that you take the covariance matrix sigma

1000
00:56:58,669 --> 00:57:00,940
and I'm going to break it up into four blocks,

1001
00:57:00,940 --> 00:57:02,649
right? And so the

1002
00:57:02,649 --> 00:57:06,059
dimension of this is there will be R elements here

1003
00:57:06,059 --> 00:57:08,299
and there will be

1004
00:57:08,299 --> 00:57:10,289
S elements here

1005
00:57:10,289 --> 00:57:13,679
and there will be R elements here. So, for example, sigma 1, 2 will

1006
00:57:13,679 --> 00:57:15,689
be an

1007
00:57:15,689 --> 00:57:19,769
R by S matrix.

1008
00:57:19,769 --> 00:57:26,769
It's R elements tall and S elements wide.

1009
00:57:32,150 --> 00:57:36,190
So this Gaussian over to down is really a joint distribution of a loss of variables, right?

1010
00:57:36,190 --> 00:57:40,309
So X is a vector so XY is a joint distribution

1011
00:57:40,309 --> 00:57:42,980
over X1 through X of -

1012
00:57:42,980 --> 00:57:47,349
over XN or over X of R plus S.

1013
00:57:47,349 --> 00:57:51,299
We can then ask what are the marginal and conditional distributions of this

1014
00:57:51,299 --> 00:57:52,609
Gaussian?

1015
00:57:52,609 --> 00:57:53,840
So, for example,

1016
00:57:53,840 --> 00:57:57,819
with my Gaussian, I know what P of X is, but can

1017
00:57:57,819 --> 00:58:03,229
I compute the modular distribution of X1, right. And so P of X1 is just equal to,

1018
00:58:03,229 --> 00:58:07,009
of course, integrate our X2, P of X1

1019
00:58:07,009 --> 00:58:08,880
comma X2

1020
00:58:08,880 --> 00:58:10,649
DX2.

1021
00:58:10,649 --> 00:58:14,169
And if you actually perform that distribution - that computation you

1022
00:58:14,169 --> 00:58:15,199
find

1023
00:58:15,199 --> 00:58:17,769
that P of X1,

1024
00:58:17,769 --> 00:58:20,069
I guess, is Gaussian

1025
00:58:20,069 --> 00:58:21,159
with mean

1026
00:58:21,159 --> 00:58:23,079
given by mu1

1027
00:58:23,079 --> 00:58:24,929
and sigma 1, 1. All right.

1028
00:58:24,929 --> 00:58:26,309
So this is sort

1029
00:58:26,309 --> 00:58:30,019
of no surprise. The marginal distribution of a

1030
00:58:30,019 --> 00:58:31,129
Gaussian

1031
00:58:31,129 --> 00:58:32,879
is itself the

1032
00:58:32,879 --> 00:58:33,390
Gaussian and

1033
00:58:33,390 --> 00:58:35,479
you just take out the

1034
00:58:35,479 --> 00:58:36,229
relevant

1035
00:58:36,229 --> 00:58:37,399
sub-blocks of

1036
00:58:37,399 --> 00:58:42,049
the covariance matrix and the relevant sub-vector of the

1037
00:58:42,049 --> 00:58:46,569
mu vector - E in vector mu.

1038
00:58:46,569 --> 00:58:49,819
You can also compute conditionals. You can also -

1039
00:58:49,819 --> 00:58:51,959
what does P of X1

1040
00:58:51,959 --> 00:58:53,409
given

1041
00:58:53,409 --> 00:58:56,819
a specific value for X2, right?

1042
00:58:56,819 --> 00:59:02,799
And so the way you compute that is, well, the usual way P of X1

1043
00:59:02,799 --> 00:59:04,930
comma X2

1044
00:59:04,930 --> 00:59:08,379
divided by P of X2, right?

1045
00:59:08,379 --> 00:59:10,069
And so

1046
00:59:10,069 --> 00:59:13,789
you know what both of these formulas are, right? The numerator - well,

1047
00:59:13,789 --> 00:59:16,559
this is just a usual

1048
00:59:16,559 --> 00:59:20,160
Gaussian that your joint distribution over X1, X2 is a Gaussian with

1049
00:59:20,160 --> 00:59:21,899
mean mu and covariance sigma

1050
00:59:21,899 --> 00:59:23,929
and

1051
00:59:23,929 --> 00:59:26,659
this

1052
00:59:26,659 --> 00:59:30,079
by that marginalization operation I talked about is that.

1053
00:59:30,079 --> 00:59:31,449


1054
00:59:31,449 --> 00:59:35,009
So if you actually plug in the formulas for these two Gaussians and if you simplify

1055
00:59:35,009 --> 00:59:35,599


1056
00:59:35,599 --> 00:59:38,560
the simplification step is actually fairly non-trivial.

1057
00:59:38,560 --> 00:59:42,070
If you haven't seen it before this will actually be - this will actually be

1058
00:59:42,070 --> 00:59:44,389
somewhat difficult to do.

1059
00:59:44,389 --> 00:59:45,580
But if you

1060
00:59:45,580 --> 00:59:50,020
plug this in for Gaussian and simplify that

1061
00:59:50,020 --> 00:59:51,029
expression

1062
00:59:51,029 --> 00:59:53,249
you find that

1063
00:59:53,249 --> 00:59:55,969
conditioned on the value of

1064
00:59:55,969 --> 01:00:01,130
X2, X1 is - the distribution of X1 conditioned on X2 is itself going to be

1065
01:00:01,130 --> 01:00:05,849
Gaussian

1066
01:00:05,849 --> 01:00:08,260
and it will have mean mu

1067
01:00:08,260 --> 01:00:13,929
of 1 given 2 and covariant

1068
01:00:13,929 --> 01:00:15,549


1069
01:00:15,549 --> 01:00:18,479
sigma of 1 given 2 where - well, so about the simplification and derivation I'm not gonna show

1070
01:00:18,479 --> 01:00:20,969
the formula for mu given - of mu of

1071
01:00:20,969 --> 01:00:27,969
one given 2 is given by this

1072
01:00:31,689 --> 01:00:38,689
and I

1073
01:00:40,179 --> 01:00:44,209
think

1074
01:00:44,209 --> 01:00:48,949
the sigma of 1 given 2 is given by that. Okay?

1075
01:00:48,949 --> 01:00:52,109
So these are just

1076
01:00:52,109 --> 01:00:55,869
useful formulas to know for how to find the conditional distributions

1077
01:00:55,869 --> 01:00:58,649
of the Gaussian and the marginal distributions of a Gaussian.

1078
01:00:58,649 --> 01:01:03,229
I won't actually show the derivation for this. Student:Could you

1079
01:01:03,229 --> 01:01:10,229
repeat the [inaudible]? Instructor

1080
01:01:12,059 --> 01:01:16,189
(Andrew Ng):Sure. So this one on the left mu of 1 given 2

1081
01:01:16,189 --> 01:01:16,849
equals

1082
01:01:16,849 --> 01:01:19,029
mu1 plus

1083
01:01:19,029 --> 01:01:20,769
sigma 1,2,

1084
01:01:20,769 --> 01:01:22,729
sigma 2,2 inverse

1085
01:01:22,729 --> 01:01:25,389
times X2 minus mu2

1086
01:01:25,389 --> 01:01:29,599
and this is sigma 1 given 2 equals sigma 1,1

1087
01:01:29,599 --> 01:01:31,599
minus sigma 1,2

1088
01:01:31,599 --> 01:01:33,119
sigma 2,2 inverse

1089
01:01:33,119 --> 01:01:34,400
sigma 2,1. Okay?

1090
01:01:34,400 --> 01:01:40,189
These are also in the

1091
01:01:40,189 --> 01:01:47,189
lecture

1092
01:01:52,929 --> 01:01:59,929
notes. Shoot. Nothing as where I was hoping to on time.

1093
01:02:00,989 --> 01:02:07,989
Well, actually it is. Okay?

1094
01:02:18,719 --> 01:02:21,209
So it

1095
01:02:21,209 --> 01:02:26,369
turns out - I think I'll skip this in the interest of time. So it turns out that -

1096
01:02:26,369 --> 01:02:30,049
well, so let's go back and use these in the factor analysis model, right?

1097
01:02:30,049 --> 01:02:32,089
It turns out that

1098
01:02:32,089 --> 01:02:33,769
you can go back

1099
01:02:33,769 --> 01:02:35,130
and

1100
01:02:38,769 --> 01:02:45,759
oh,

1101
01:02:45,759 --> 01:02:48,569
do I want to do this? I kind of need this

1102
01:02:48,569 --> 01:02:51,939
though. So let's go back and figure out

1103
01:02:51,939 --> 01:02:57,869
just what the joint distribution factor analysis assumes on Z and X's. Okay?

1104
01:02:57,869 --> 01:02:58,630
So

1105
01:02:58,630 --> 01:03:00,219


1106
01:03:00,219 --> 01:03:03,199
under the factor analysis model

1107
01:03:03,199 --> 01:03:07,429
Z and X, the random variables Z and X

1108
01:03:07,429 --> 01:03:11,549
have some joint distribution given by - I'll write this

1109
01:03:11,549 --> 01:03:12,779
vector as mu

1110
01:03:12,779 --> 01:03:13,799
ZX

1111
01:03:13,799 --> 01:03:17,329
in some covariance matrix sigma.

1112
01:03:17,329 --> 01:03:21,239
So let's go back and figure out what mu ZX is and what sigma is and I'll

1113
01:03:21,239 --> 01:03:22,699
do this so that

1114
01:03:22,699 --> 01:03:24,990
we'll get a little bit more practice with

1115
01:03:24,990 --> 01:03:28,739
partition vectors and partition matrixes.

1116
01:03:28,739 --> 01:03:31,239
So just to remind you, right? You

1117
01:03:31,239 --> 01:03:34,959
have to have Z as Gaussian with mean zero and covariance identity

1118
01:03:34,959 --> 01:03:41,099
and X is mu plus lambda Z plus epsilon where epsilon is Gaussian with

1119
01:03:41,099 --> 01:03:42,749
mean zero

1120
01:03:42,749 --> 01:03:47,179
covariant si. So I have the - I'm just writing out the same equations again.

1121
01:03:47,179 --> 01:03:53,209
So let's first figure out what this vector mu ZX is. Well,

1122
01:03:53,209 --> 01:03:54,619
the expected value of Z

1123
01:03:54,619 --> 01:03:56,219
is

1124
01:03:56,219 --> 01:03:56,989
zero

1125
01:03:56,989 --> 01:03:58,140


1126
01:03:58,140 --> 01:04:03,599
and, again, as usual I'll often drop the square backers around here.

1127
01:04:03,599 --> 01:04:05,930
And the expected value of X is -

1128
01:04:05,930 --> 01:04:08,899
well, the expected value of mu

1129
01:04:08,899 --> 01:04:11,149
plus lambda

1130
01:04:11,149 --> 01:04:14,179
Z

1131
01:04:14,179 --> 01:04:17,249
plus epsilon. So these two terms have zero expectation

1132
01:04:17,249 --> 01:04:20,099
and so the expected value of X

1133
01:04:20,099 --> 01:04:22,749
is just mu

1134
01:04:22,749 --> 01:04:23,900


1135
01:04:23,900 --> 01:04:26,529
and so that vector

1136
01:04:26,529 --> 01:04:28,049
mu ZX, right,

1137
01:04:28,049 --> 01:04:31,239
in my parameter for the Gaussian

1138
01:04:31,239 --> 01:04:33,650
this is going to be

1139
01:04:33,650 --> 01:04:37,439
the expected value of this partition vector

1140
01:04:37,439 --> 01:04:40,249
given by this partition Z and X

1141
01:04:40,249 --> 01:04:42,689
and so that would just be

1142
01:04:42,689 --> 01:04:43,769
zero

1143
01:04:43,769 --> 01:04:46,900
followed by mu. Okay?

1144
01:04:46,900 --> 01:04:48,319
And so that's

1145
01:04:48,319 --> 01:04:51,299
a d-dimensional zero

1146
01:04:51,299 --> 01:04:57,059
followed by an indimensional mu.

1147
01:04:57,059 --> 01:05:02,160
That's not gonna work out what the covariance matrix sigma is.

1148
01:05:02,160 --> 01:05:09,160
So

1149
01:05:20,929 --> 01:05:24,109
the covariance matrix sigma

1150
01:05:24,109 --> 01:05:27,959
- if you work out

1151
01:05:27,959 --> 01:05:34,959
definition of a partition. So this

1152
01:05:44,359 --> 01:05:45,619
is

1153
01:05:45,619 --> 01:05:52,619
into your partition matrix.

1154
01:06:03,339 --> 01:06:04,150
Okay? Will be -

1155
01:06:04,150 --> 01:06:06,149
so the covariance matrix sigma

1156
01:06:06,149 --> 01:06:10,499
will comprise four blocks like that

1157
01:06:10,499 --> 01:06:14,199
and so the upper left most block, which I write as sigma 1,1 -

1158
01:06:14,199 --> 01:06:16,400
well, that uppermost left block

1159
01:06:16,400 --> 01:06:18,559
is just

1160
01:06:18,559 --> 01:06:21,139
the covariance matrix of Z,

1161
01:06:21,139 --> 01:06:24,819
which we know is the identity. I was

1162
01:06:24,819 --> 01:06:28,269
gonna show you briefly how to derive some of the other blocks, right, so

1163
01:06:28,269 --> 01:06:30,809
sigma 1,2 that's the

1164
01:06:30,809 --> 01:06:37,809
upper - oh, actually,

1165
01:06:41,259 --> 01:06:43,819
excuse me. Sigma 2,1

1166
01:06:43,819 --> 01:06:46,829
which is the lower left block

1167
01:06:46,829 --> 01:06:48,929
that's E

1168
01:06:48,929 --> 01:06:50,829
of X

1169
01:06:50,829 --> 01:06:54,900
minus EX times Z minus EZ.

1170
01:06:54,900 --> 01:06:58,689
So X is equal to mu

1171
01:06:58,689 --> 01:07:01,529
plus lambda Z

1172
01:07:01,529 --> 01:07:03,329
plus

1173
01:07:03,329 --> 01:07:05,429
epsilon and then minus EX is

1174
01:07:05,429 --> 01:07:07,369
minus mu and then

1175
01:07:07,369 --> 01:07:11,219
times Z

1176
01:07:11,219 --> 01:07:14,089
because the expected value of Z is zero, right,

1177
01:07:14,089 --> 01:07:19,329
so that's equal to zero.

1178
01:07:19,329 --> 01:07:23,660
And so if you simplify - or if you expand this out

1179
01:07:23,660 --> 01:07:26,829
plus mu minus mu cancel out

1180
01:07:26,829 --> 01:07:29,819
and so you have the expected value

1181
01:07:29,819 --> 01:07:33,139
of lambda - oh,

1182
01:07:33,139 --> 01:07:35,359
excuse me.

1183
01:07:35,359 --> 01:07:40,269
ZZ transpose minus the

1184
01:07:40,269 --> 01:07:47,269
expected value of epsilon Z is

1185
01:07:52,799 --> 01:07:57,059
equal to

1186
01:07:57,059 --> 01:08:00,309
that, which is just equal to

1187
01:08:00,309 --> 01:08:07,239
lambda times the identity matrix. Okay? Does that make

1188
01:08:07,239 --> 01:08:11,499
sense? Cause

1189
01:08:11,499 --> 01:08:14,540
this term is equal to zero.

1190
01:08:14,540 --> 01:08:21,540
Both epsilon and Z are independent and have zero expectation so the second terms are zero. Well,

1191
01:08:48,170 --> 01:08:53,399
so the final block is sigma 2,2 which is equal to the expected value of

1192
01:08:53,399 --> 01:08:59,099
mu plus lambda Z

1193
01:08:59,099 --> 01:09:01,889
plus epsilon minus mu

1194
01:09:01,889 --> 01:09:03,960
times, right?

1195
01:09:03,960 --> 01:09:07,900
Is

1196
01:09:07,900 --> 01:09:09,259
equal to - and

1197
01:09:09,259 --> 01:09:16,259
I won't do this, but this simplifies to lambda lambda transpose plus si. Okay?

1198
01:09:17,609 --> 01:09:20,880
So

1199
01:09:20,880 --> 01:09:23,809
putting all this together

1200
01:09:23,809 --> 01:09:28,689
this tells us that the joint distribution of this vector ZX

1201
01:09:28,689 --> 01:09:31,209
is going to be Gaussian

1202
01:09:31,209 --> 01:09:35,920
with mean vector given by that,

1203
01:09:35,920 --> 01:09:37,409
which we worked out previously.

1204
01:09:37,409 --> 01:09:38,719
So

1205
01:09:38,719 --> 01:09:42,239
this is the new ZX that we worked out previously,

1206
01:09:42,239 --> 01:09:44,179
and covariance matrix

1207
01:09:44,179 --> 01:09:51,179
given by

1208
01:09:54,429 --> 01:10:01,429
that. Okay? So

1209
01:10:04,719 --> 01:10:07,179
in principle - let's

1210
01:10:07,179 --> 01:10:09,869
see, so the parameters of our model

1211
01:10:09,869 --> 01:10:11,579
are mu,

1212
01:10:11,579 --> 01:10:12,359
lambda,

1213
01:10:12,359 --> 01:10:14,000
and si.

1214
01:10:14,000 --> 01:10:15,480
And so

1215
01:10:15,480 --> 01:10:18,800
in order to find the parameters of this model

1216
01:10:18,800 --> 01:10:24,559
we're given a training set of m examples

1217
01:10:24,559 --> 01:10:28,590
and so we like to do a massive likely estimation of the parameters.

1218
01:10:28,590 --> 01:10:34,319
And so in principle one thing you could do is you can actually write down

1219
01:10:34,319 --> 01:10:36,820
what P of XI is and,

1220
01:10:36,820 --> 01:10:40,309
right, so P of XI

1221
01:10:40,309 --> 01:10:43,020
XI is actually -

1222
01:10:43,020 --> 01:10:45,880
the distribution of X, right? If, again,

1223
01:10:45,880 --> 01:10:49,439
you can marginalize this Gaussian

1224
01:10:49,439 --> 01:10:54,679
and so the distribution of X, which is the lower half of this partition vector

1225
01:10:54,679 --> 01:10:56,620
is going to have

1226
01:10:56,620 --> 01:10:58,209
mean mu

1227
01:10:58,209 --> 01:11:03,929
and covariance given by lambda lambda transpose plus si. Right?

1228
01:11:03,929 --> 01:11:05,320
So that's

1229
01:11:05,320 --> 01:11:06,759
the distribution

1230
01:11:06,759 --> 01:11:12,499
that we're using to model P of X.

1231
01:11:12,499 --> 01:11:16,980
And so in principle one thing you could do is actually write down the log likelihood of

1232
01:11:16,980 --> 01:11:20,329
your parameters,

1233
01:11:20,329 --> 01:11:23,679
right? Which is just the product over of - it

1234
01:11:23,679 --> 01:11:25,310
is the sum over

1235
01:11:25,310 --> 01:11:26,050
I

1236
01:11:26,050 --> 01:11:29,829
log P of XI

1237
01:11:29,829 --> 01:11:30,970
where P of XI

1238
01:11:30,970 --> 01:11:32,400
will be given

1239
01:11:32,400 --> 01:11:35,719
by this Gaussian density, right. And

1240
01:11:35,719 --> 01:11:39,820
I'm using theta as a shorthand to denote all of my parameters.

1241
01:11:39,820 --> 01:11:43,429
And so you actually know what the density for Gaussian is

1242
01:11:43,429 --> 01:11:49,519
and so you can say P of XI is this Gaussian with E mu in covariance

1243
01:11:49,519 --> 01:11:52,960
given by this lambda lambda transpose plus si.

1244
01:11:52,960 --> 01:11:54,690
So in case you write down the log likelihood

1245
01:11:54,690 --> 01:11:55,900
of your parameters

1246
01:11:55,900 --> 01:11:57,380
as follows

1247
01:11:57,380 --> 01:12:01,050
and you can try to take derivatives of your log likelihood with respect to your

1248
01:12:01,050 --> 01:12:02,039
parameters

1249
01:12:02,039 --> 01:12:05,059
and maximize the log likelihood, all right.

1250
01:12:05,059 --> 01:12:07,490
It turns out that if you do that you end up with

1251
01:12:07,490 --> 01:12:11,300
sort of an intractable atomization problem or at least one

1252
01:12:11,300 --> 01:12:13,219
that you - excuse me, you end up with a

1253
01:12:13,219 --> 01:12:16,860
optimization problem that you will not be able to find and in this analytics, sort of,

1254
01:12:16,860 --> 01:12:18,649
closed form solutions to.

1255
01:12:18,649 --> 01:12:21,679
So if you say my model of X is this and found your

1256
01:12:21,679 --> 01:12:23,900
massive likely parameter estimation

1257
01:12:23,900 --> 01:12:25,589
you won't be able to find

1258
01:12:25,589 --> 01:12:29,119
the massive likely estimate of the parameters in closed form.

1259
01:12:29,119 --> 01:12:30,059


1260
01:12:30,059 --> 01:12:31,940
So what I would have liked to do

1261
01:12:31,940 --> 01:12:38,940
is - well,

1262
01:12:40,280 --> 01:12:45,579
so in order to fit parameters to this model

1263
01:12:45,579 --> 01:12:51,869
what we'll actually do is use the EM Algorithm

1264
01:12:51,869 --> 01:12:55,569
in with

1265
01:12:55,569 --> 01:12:59,559
the E step, right?

1266
01:12:59,559 --> 01:13:06,559
We'll compute that

1267
01:13:08,320 --> 01:13:10,929


1268
01:13:10,929 --> 01:13:15,010
and this formula looks the same except that one difference is that now

1269
01:13:15,010 --> 01:13:17,679
Z is a continuous random variable

1270
01:13:17,679 --> 01:13:18,590
and so

1271
01:13:18,590 --> 01:13:20,119
in the E step

1272
01:13:20,119 --> 01:13:24,210
we actually have to find the density QI of ZI where it's the, sort of, E step actually

1273
01:13:24,210 --> 01:13:26,050
requires that we find

1274
01:13:26,050 --> 01:13:28,260
the posterior distribution

1275
01:13:28,260 --> 01:13:31,510
that - so the density to the random variable ZI

1276
01:13:31,510 --> 01:13:34,489
and then the M step

1277
01:13:34,489 --> 01:13:37,949
will then perform

1278
01:13:37,949 --> 01:13:40,870
the following maximization

1279
01:13:40,870 --> 01:13:41,709


1280
01:13:41,709 --> 01:13:45,010
where, again, because Z is now

1281
01:13:45,010 --> 01:13:49,929
continuous we

1282
01:13:49,929 --> 01:13:56,929
now need to integrate over Z. Okay? Where

1283
01:13:59,389 --> 01:14:02,379
in the M step now because ZI was continuous we now have an

1284
01:14:02,379 --> 01:14:05,709
integral over Z rather than a sum.

1285
01:14:05,709 --> 01:14:09,269
Okay? I was hoping to go a little bit further in deriving these things, but I don't have

1286
01:14:09,269 --> 01:14:11,080
time today so we'll wrap that up

1287
01:14:11,080 --> 01:14:14,530
in the next lecture, but before I close let's check if there are questions

1288
01:14:14,530 --> 01:14:15,729


1289
01:14:15,729 --> 01:14:22,729
about the whole factor analysis model. Okay.

1290
01:14:27,569 --> 01:14:30,639
So we'll come back in the next lecture;

1291
01:14:30,639 --> 01:14:34,690
I will wrap up this model and because I want to go a little bit deeper into the E

1292
01:14:34,690 --> 01:14:36,709
and M steps, as there's some

1293
01:14:36,709 --> 01:14:39,760
tricky parts for the factor analysis model specifically.

1294
01:14:39,760 --> 01:14:41,159
Okay. I'll see you in a

1295
01:14:41,159 --> 01:14:41,409
couple of days.