1
00:00:09,539 --> 00:00:12,809
This presentation is delivered by the Stanford Center for Professional

2
00:00:12,809 --> 00:00:19,809
Development.

3
00:00:28,739 --> 00:00:30,300
Welcome back.

4
00:00:30,300 --> 00:00:32,800
What I want to do today is

5
00:00:32,800 --> 00:00:37,120
continue a discussion of principal components analysis, or PCA.

6
00:00:37,120 --> 00:00:38,780
In particular, there's

7
00:00:38,780 --> 00:00:41,730
one more application that I didn't get to in the last lecture on

8
00:00:41,730 --> 00:00:43,950
[inaudible] indexing,

9
00:00:43,950 --> 00:00:47,340
LSI. Then I want to spend just a little time talking about

10
00:00:47,340 --> 00:00:50,400
how to implement PCA,

11
00:00:50,400 --> 00:00:53,510
especially for very large problems. In particular, I'll spend just a little bit of time talking

12
00:00:53,510 --> 00:00:55,790
about singular value decomposition,

13
00:00:55,790 --> 00:00:57,930
or the SVD implementation

14
00:00:57,930 --> 00:01:00,170
of principal component

15
00:01:00,170 --> 00:01:01,929
analysis. So the

16
00:01:01,929 --> 00:01:06,549
second half of today's lecture, I want to talk about the different algorithm called

17
00:01:06,549 --> 00:01:09,090
independent component analysis,

18
00:01:09,090 --> 00:01:13,400
which is, in some ways, related to PCA, but in many other ways, it

19
00:01:13,400 --> 00:01:15,580
also manages to accomplish

20
00:01:15,580 --> 00:01:18,390
very different things than PCA.

21
00:01:18,390 --> 00:01:22,380
So with this lecture, this will actually wrap up our discussion on

22
00:01:22,380 --> 00:01:26,470
unsupervised learning. The next lecture, we'll start to talk about

23
00:01:26,470 --> 00:01:31,049
reinforcement learning algorithms.

24
00:01:31,049 --> 00:01:32,500
Just to

25
00:01:32,500 --> 00:01:36,770
recap where we were with PCA, principal

26
00:01:36,770 --> 00:01:38,630
component analysis,

27
00:01:38,630 --> 00:01:44,870
I said that

28
00:01:44,870 --> 00:01:48,840
in PCA, we imagine that we have some very high dimensional data

29
00:01:48,840 --> 00:01:50,970
that perhaps

30
00:01:50,970 --> 00:01:54,650
lies approximately on some low dimensional subspace. So if you had the data set like

31
00:01:54,650 --> 00:01:55,660
this,

32
00:01:55,660 --> 00:01:57,000
you might find that

33
00:01:57,000 --> 00:02:00,650
that's the first principal component of the data,

34
00:02:00,650 --> 00:02:02,850
and that's the second

35
00:02:02,850 --> 00:02:05,650
component of this 2-D data.

36
00:02:06,870 --> 00:02:09,269
To

37
00:02:09,269 --> 00:02:13,039
summarize the algorithm, we have three steps. The first step of PCA

38
00:02:14,680 --> 00:02:16,549
was to

39
00:02:16,549 --> 00:02:20,349
normalize the data to zero mean and

40
00:02:20,349 --> 00:02:25,789
[inaudible]. So

41
00:02:25,789 --> 00:02:27,860
tracked out the means of

42
00:02:27,860 --> 00:02:33,199
your training examples. So it now has zero means, and then normalize each of your features so

43
00:02:33,199 --> 00:02:35,509
that the variance of each feature is now one.

44
00:02:37,719 --> 00:02:40,679
The next step was [inaudible]

45
00:02:40,679 --> 00:02:44,389
computical variance matrix of your zero mean data. So

46
00:02:44,389 --> 00:02:48,859
you compute it as follows.

47
00:02:48,859 --> 00:02:51,310
The sum of all the products,

48
00:02:52,409 --> 00:02:55,870
and then you find the

49
00:02:55,870 --> 00:03:02,870
top K eigen vectors of

50
00:03:03,799 --> 00:03:07,159
sigma.

51
00:03:07,159 --> 00:03:09,589
So

52
00:03:09,589 --> 00:03:12,589
last time we saw the applications of this. For example,

53
00:03:12,589 --> 00:03:18,439
one of the applications was to eigen faces where

54
00:03:18,439 --> 00:03:22,899
each of your training examples, XI, is an image.

55
00:03:22,899 --> 00:03:24,180
So

56
00:03:24,180 --> 00:03:26,749
if you have

57
00:03:26,749 --> 00:03:30,890
100 by 100 images, if your pictures of faces are

58
00:03:30,890 --> 00:03:33,249
100 pixels by 100 pixels,

59
00:03:33,249 --> 00:03:37,519
then each of your training examples, XI,

60
00:03:37,519 --> 00:03:39,989
will be a 10,000 dimensional vector,

61
00:03:39,989 --> 00:03:42,729
corresponding to the

62
00:03:42,729 --> 00:03:47,529
10,000 grayscale intensity pixel values. There are 10,000 pixel values in

63
00:03:47,529 --> 00:03:50,349
each of your 100 by 100 images.

64
00:03:50,349 --> 00:03:53,179
So the eigen faces application was where

65
00:03:53,179 --> 00:03:56,280
the training example comprised

66
00:03:56,280 --> 00:03:58,149
pictures of faces of people.

67
00:03:58,149 --> 00:03:59,989
Then we ran PCA,

68
00:03:59,989 --> 00:04:00,780
and then

69
00:04:00,780 --> 00:04:03,090
to measure the distance between say

70
00:04:03,090 --> 00:04:04,260
a face here

71
00:04:04,260 --> 00:04:06,809
and a face there, we would project both

72
00:04:06,809 --> 00:04:09,619
of the face images onto the subspace and then

73
00:04:09,619 --> 00:04:11,469
measure

74
00:04:11,469 --> 00:04:13,679
the distance along the subspace. So in eigen faces, you use something

75
00:04:13,679 --> 00:04:16,919
like 50 principle components.

76
00:04:16,919 --> 00:04:20,049
So

77
00:04:20,049 --> 00:04:24,080
the difficulty of working with problems like these is that

78
00:04:24,080 --> 00:04:27,330
in step two of the algorithm,

79
00:04:27,330 --> 00:04:31,070
we construct the covariance matrix sigma.

80
00:04:31,070 --> 00:04:37,589
The covariance matrix now becomes

81
00:04:37,589 --> 00:04:42,919
a 10,000 by 10,000 dimensional matrix, which is huge. That has

82
00:04:42,919 --> 00:04:45,580
100 million

83
00:04:45,580 --> 00:04:47,650
entries, which is huge.

84
00:04:49,020 --> 00:04:50,789
So let's apply PCA to

85
00:04:50,789 --> 00:04:54,289
very, very high dimensional data, used as a point of reducing the

86
00:04:54,289 --> 00:04:55,180
dimension. But

87
00:04:55,180 --> 00:04:59,550
step two of this algorithm had this step where you were constructing [inaudible]. So

88
00:04:59,550 --> 00:05:03,599
this extremely large matrix, which you can't do.

89
00:05:03,599 --> 00:05:06,580
Come back to this in a second. It turns out one of

90
00:05:06,580 --> 00:05:08,650
the other

91
00:05:08,650 --> 00:05:10,639
frequently-used applications of

92
00:05:10,639 --> 00:05:11,870
PCA

93
00:05:11,870 --> 00:05:14,210
is actually to text data.

94
00:05:14,210 --> 00:05:16,329
So here's what I

95
00:05:16,329 --> 00:05:21,039
mean. Remember our vectorial representation of emails?

96
00:05:21,039 --> 00:05:22,520
So this is way back

97
00:05:22,520 --> 00:05:26,869
when we were talking about supervised learning algorithms for a

98
00:05:26,869 --> 00:05:29,360
stand classification. You remember I said that

99
00:05:29,360 --> 00:05:32,620
given a piece of email or given a piece of text document, you

100
00:05:32,620 --> 00:05:35,479
can represent it using a very high-dimensional vector

101
00:05:35,479 --> 00:05:36,699
by taking

102
00:05:38,990 --> 00:05:43,889
- writing down a list of all the words in your dictionary. Somewhere you had the word

103
00:05:43,889 --> 00:05:46,669
learn, somewhere you have the word

104
00:05:46,669 --> 00:05:49,680
study

105
00:05:50,569 --> 00:05:54,759
and so on.

106
00:05:54,759 --> 00:05:58,180
Depending on whether each word appears or does not appear in your text document, you put either

107
00:05:58,180 --> 00:05:59,360
a one or a zero

108
00:05:59,360 --> 00:06:03,889
there. This is a representation we use on an electrode five or electrode six

109
00:06:03,889 --> 00:06:06,729
for representing text documents for

110
00:06:06,729 --> 00:06:08,669
when we're building

111
00:06:08,669 --> 00:06:12,090
[inaudible] based classifiers for

112
00:06:12,090 --> 00:06:14,300
[inaudible]. So it turns

113
00:06:14,300 --> 00:06:17,719
out one of the common applications of

114
00:06:17,719 --> 00:06:22,210
PCA is actually this text data representations as well.

115
00:06:22,210 --> 00:06:23,680
When you apply PCA

116
00:06:23,680 --> 00:06:25,650
to this sort of data,

117
00:06:25,650 --> 00:06:27,999
the resulting

118
00:06:27,999 --> 00:06:34,740
algorithm, it often just goes by a different name, just latent semantic indexing.

119
00:06:41,250 --> 00:06:44,009
For the sake of completeness, I should say that

120
00:06:44,009 --> 00:06:48,050
in LSI, you usually skip the preprocessing step.

121
00:07:06,089 --> 00:07:09,929
For various reasons, in LSI, you usually don't normalize the mean of the data to

122
00:07:09,929 --> 00:07:10,939
one,

123
00:07:10,939 --> 00:07:14,169
and you usually don't normalize the variance of the features to one.

124
00:07:14,169 --> 00:07:18,020
These are relatively minor

125
00:07:18,020 --> 00:07:21,199
differences, it turns out, so it does something very

126
00:07:21,199 --> 00:07:24,599
similar to PCA.

127
00:07:24,599 --> 00:07:25,849


128
00:07:25,849 --> 00:07:27,439
Normalizing the variance to one

129
00:07:27,439 --> 00:07:33,439
for text data would actually be a bad idea because all the words are -

130
00:07:33,439 --> 00:07:34,389
because that

131
00:07:34,389 --> 00:07:37,150
would have the affect of

132
00:07:37,150 --> 00:07:39,229
dramatically scaling up the

133
00:07:39,229 --> 00:07:43,779
weight of rarely occurring words. So for example, the word aardvark hardly ever

134
00:07:43,779 --> 00:07:45,729
appears in any document.

135
00:07:45,729 --> 00:07:48,919
So to normalize the variance

136
00:07:48,919 --> 00:07:51,009
of the second feature to one, you end up -

137
00:07:51,009 --> 00:07:54,860
you're scaling up the weight of the word aardvark

138
00:07:54,860 --> 00:07:58,340
dramatically. I don't understand why [inaudible].

139
00:07:58,340 --> 00:08:01,449
So let's

140
00:08:01,449 --> 00:08:05,319
see. [Inaudible] the language,

141
00:08:05,319 --> 00:08:11,169
something that we want to do quite often is, give it two documents,

142
00:08:13,109 --> 00:08:20,109
XI and XJ, to measure how similar they are.

143
00:08:20,179 --> 00:08:22,400
So for example,

144
00:08:22,400 --> 00:08:25,050
I may give you a document and ask

145
00:08:25,050 --> 00:08:28,860
you to find me more documents like this one. We're reading some

146
00:08:28,860 --> 00:08:30,900
article about some user event of today

147
00:08:30,900 --> 00:08:33,770
and want to find out what other news articles there are. So I give you a document and

148
00:08:34,520 --> 00:08:37,159
ask you to look at all the other documents you have

149
00:08:37,159 --> 00:08:40,829
in this large set of documents and find the documents similar to

150
00:08:40,829 --> 00:08:42,220
this.

151
00:08:43,740 --> 00:08:45,210
So

152
00:08:45,210 --> 00:08:48,170
this is typical text application, so

153
00:08:48,170 --> 00:08:51,140
to measure the similarity

154
00:08:52,440 --> 00:08:56,550
between two documents in XI and XJ, [inaudible]

155
00:08:56,550 --> 00:08:59,950
each of these documents is represented

156
00:08:59,950 --> 00:09:03,190
as one of these highdimensional vectors.

157
00:09:03,190 --> 00:09:08,700
One common way to do this is to view each of your documents

158
00:09:08,700 --> 00:09:12,710
as some sort of very high-dimensional vector.

159
00:09:12,710 --> 00:09:14,100
So these

160
00:09:14,100 --> 00:09:18,040
are vectors in the very highdimensional space where

161
00:09:18,040 --> 00:09:20,299
the dimension of the vector is equal to

162
00:09:20,299 --> 00:09:27,210
the number of words in your dictionary.

163
00:09:27,210 --> 00:09:30,630
So maybe each of these documents lives in some

164
00:09:30,630 --> 00:09:33,560
50,000-dimension space, if you have 50,000 words in your

165
00:09:33,560 --> 00:09:37,090
dictionary. So one nature of the similarity between these two documents that's

166
00:09:37,090 --> 00:09:39,800
often used is

167
00:09:39,800 --> 00:09:41,440
what's the angle

168
00:09:41,440 --> 00:09:43,350
between these two

169
00:09:43,350 --> 00:09:50,350
documents.

170
00:09:51,240 --> 00:09:52,750
In particular,

171
00:09:52,750 --> 00:09:56,030
if the angle between these two vectors is small, then

172
00:09:56,030 --> 00:09:59,590
the two documents, we'll consider them to be similar. If the angle between

173
00:09:59,590 --> 00:10:03,310
these two vectors is large, then we consider the documents to be dissimilar.

174
00:10:03,310 --> 00:10:05,530
So

175
00:10:05,530 --> 00:10:10,060
more formally, one commonly used heuristic, the national language of processing,

176
00:10:10,060 --> 00:10:13,950
is to say that the similarity between the two documents is a co-sine of the angle theta between them.

177
00:10:13,950 --> 00:10:16,710
For similar

178
00:10:16,710 --> 00:10:19,270
values, anyway, the co-sine

179
00:10:19,270 --> 00:10:23,110
is a decreasing function of theta.

180
00:10:23,110 --> 00:10:24,490
So the

181
00:10:24,490 --> 00:10:29,560
smaller the angle between them, the larger the similarity.

182
00:10:29,560 --> 00:10:30,690
The co-sine

183
00:10:30,690 --> 00:10:35,970
between two vectors is, of course, just [inaudible]

184
00:10:35,970 --> 00:10:42,970
divided

185
00:10:43,940 --> 00:10:46,680
by - okay?

186
00:10:46,680 --> 00:10:51,100
That's just the linear algebra or the standard

187
00:10:51,100 --> 00:10:56,460
geometry definition of the co-sine between two vectors. Here's the

188
00:11:03,540 --> 00:11:10,540
intuition behind what LSI is doing. The hope, as usual, is

189
00:11:17,930 --> 00:11:21,060
that there

190
00:11:21,060 --> 00:11:24,970
may be some interesting axis of variations in the data,

191
00:11:24,970 --> 00:11:27,660
and there maybe some other axis that

192
00:11:27,660 --> 00:11:29,340
are just

193
00:11:29,340 --> 00:11:33,960
noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by

194
00:11:33,960 --> 00:11:37,850
running PCA on your text data this way, you can remove some of the noise in the data and

195
00:11:37,850 --> 00:11:41,800
get better measures of the similarity between pairs of

196
00:11:41,800 --> 00:11:45,490
documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI

197
00:11:45,490 --> 00:11:46,940
is doing.

198
00:11:46,940 --> 00:11:47,940
So

199
00:11:47,940 --> 00:11:54,940
look further in the definition of the co-sine similarity measure. So

200
00:11:56,430 --> 00:11:59,790
the numerator

201
00:11:59,790 --> 00:12:06,790
or

202
00:12:10,210 --> 00:12:17,210
the similarity between the two documents was this inner product,

203
00:12:17,620 --> 00:12:24,270
which is therefore sum over K,

204
00:12:24,270 --> 00:12:27,500
XIK,

205
00:12:27,500 --> 00:12:28,870
XJK. So

206
00:12:30,410 --> 00:12:33,810
this inner product would be equal to zero if

207
00:12:33,810 --> 00:12:36,910
the two documents have no words in common. So

208
00:12:36,910 --> 00:12:39,900
this is really - sum over K -

209
00:12:41,100 --> 00:12:42,930
indicator of

210
00:12:42,930 --> 00:12:44,760
whether

211
00:12:44,760 --> 00:12:47,170
documents, I and

212
00:12:47,170 --> 00:12:54,170
J,

213
00:12:54,920 --> 00:12:58,250
both contain the word, K, because

214
00:12:58,250 --> 00:13:02,590
I guess XIK indicates whether document I contains the word

215
00:13:02,590 --> 00:13:04,410
K, and XJK

216
00:13:04,410 --> 00:13:07,830
indicates whether document J contains the word, K.

217
00:13:07,830 --> 00:13:10,430
So the product would be one only

218
00:13:10,430 --> 00:13:12,420
if the word K

219
00:13:12,420 --> 00:13:14,540
appears in both documents.

220
00:13:14,540 --> 00:13:17,740
Therefore, the similarity between these two documents would be

221
00:13:17,740 --> 00:13:23,450
zero if the two documents have no words in common.

222
00:13:23,450 --> 00:13:30,450
For example,

223
00:13:31,180 --> 00:13:34,530
suppose your document,

224
00:13:34,530 --> 00:13:40,460
XI, has the word study and the word

225
00:13:40,460 --> 00:13:41,730
XJ,

226
00:13:41,730 --> 00:13:43,460
has the word learn.

227
00:13:43,460 --> 00:13:47,530
Then these two documents may be considered

228
00:13:47,530 --> 00:13:49,069
entirely dissimilar.

229
00:13:50,020 --> 00:13:53,280
[Inaudible] effective study strategies. Sometimes you read a

230
00:13:53,280 --> 00:13:57,100
news article about that. So you ask, what other documents are similar to this? If

231
00:13:57,100 --> 00:14:01,080
there are a bunch of other documents about good methods to

232
00:14:01,080 --> 00:14:04,250
learn, than there are words in common. So similarity [inaudible] is zero.

233
00:14:04,250 --> 00:14:06,790
So here's

234
00:14:06,790 --> 00:14:09,200
a cartoon

235
00:14:09,200 --> 00:14:10,970
of what we hope

236
00:14:10,970 --> 00:14:12,820
[inaudible] PCA will do,

237
00:14:12,820 --> 00:14:14,340
which is

238
00:14:14,340 --> 00:14:17,320
suppose that on the horizontal axis, I plot

239
00:14:17,320 --> 00:14:21,330
the word

240
00:14:21,330 --> 00:14:25,310
learn, and on the vertical access, I plot the word study.

241
00:14:25,310 --> 00:14:29,840
So the values take on either the value of zero or one. So if a document

242
00:14:29,840 --> 00:14:33,040
contains the words learn but not study, then

243
00:14:33,040 --> 00:14:35,260
it'll plot that document there. If

244
00:14:35,260 --> 00:14:38,050
a document contains neither the word study nor learn, then it'll plot that

245
00:14:38,050 --> 00:14:40,530
at zero, zero.

246
00:14:40,530 --> 00:14:44,119
So here's a cartoon behind what PCA

247
00:14:44,119 --> 00:14:46,940
is doing, which is

248
00:14:46,940 --> 00:14:51,480
we identify lower dimensional subspace. That would be sum - eigen

249
00:14:51,480 --> 00:14:57,630
vector, we get out of PCAs.

250
00:14:57,630 --> 00:15:03,380
Now, supposed we have a document about learning. We have a document about studying.

251
00:15:03,380 --> 00:15:05,150
The document about learning

252
00:15:05,150 --> 00:15:07,710
points to the right. Document about studying points

253
00:15:07,710 --> 00:15:11,439
up. So the inner product, or the co-sine angle between these two documents would be

254
00:15:11,439 --> 00:15:13,170
- excuse

255
00:15:13,170 --> 00:15:15,080
me. The inner product between

256
00:15:15,080 --> 00:15:18,850
these two documents will be zero.

257
00:15:18,850 --> 00:15:20,280
So these two

258
00:15:20,280 --> 00:15:22,360
documents are entirely unrelated,

259
00:15:22,360 --> 00:15:24,849
which is not what we want.

260
00:15:24,849 --> 00:15:27,680
Documents about study, documents about learning, they are related. But

261
00:15:27,680 --> 00:15:32,820
we take these two documents, and we project them

262
00:15:32,820 --> 00:15:36,220
onto this subspace.

263
00:15:38,320 --> 00:15:40,850
Then these two documents now become much

264
00:15:40,850 --> 00:15:42,610
closer together,

265
00:15:42,610 --> 00:15:44,790
and the algorithm will recognize that

266
00:15:44,790 --> 00:15:47,650
when you say the inner product between these two documents,

267
00:15:47,650 --> 00:15:50,580
you actually end up with a positive number. So

268
00:15:50,580 --> 00:15:52,190
LSI enables

269
00:15:52,190 --> 00:15:56,200
our algorithm to recognize that these two documents have some positive similarity

270
00:15:56,200 --> 00:15:58,920
between them.

271
00:15:58,920 --> 00:16:01,230
So that's just intuition

272
00:16:01,230 --> 00:16:02,370
about what

273
00:16:02,370 --> 00:16:05,000
PCA may be doing to text data.

274
00:16:05,000 --> 00:16:09,420
The same thing goes to other examples and the words study and learn. So you have

275
00:16:09,420 --> 00:16:11,139
- you find a document about

276
00:16:11,139 --> 00:16:15,059
politicians and a document with the names of prominent

277
00:16:15,059 --> 00:16:16,490
politicians.

278
00:16:17,560 --> 00:16:20,740
That will also bring the documents closer together,

279
00:16:20,740 --> 00:16:24,650
or just any related topics, they end up

280
00:16:24,650 --> 00:16:25,670
[inaudible]

281
00:16:25,670 --> 00:16:31,780
points closer together and just lower dimensional space.

282
00:16:33,130 --> 00:16:38,380
Question about this? Interviewee: [Inaudible].

283
00:16:38,380 --> 00:16:43,930
Which ones?

284
00:16:43,930 --> 00:16:50,930
This one? No, the line. Oh, this one. Oh,

285
00:16:53,820 --> 00:16:54,470
yes.

286
00:16:54,470 --> 00:17:01,470
Thank you.

287
00:17:01,670 --> 00:17:05,510
[Inaudible].

288
00:17:05,510 --> 00:17:06,650
So

289
00:17:06,650 --> 00:17:13,650
let's talk about how to actually implement this now.

290
00:17:17,230 --> 00:17:20,520
Okay. How many of you know what

291
00:17:20,520 --> 00:17:24,780
an SVD or single value decomposition is? Wow,

292
00:17:24,780 --> 00:17:25,800
that's a lot of you. That's a

293
00:17:25,800 --> 00:17:28,270
lot more than I thought.

294
00:17:28,270 --> 00:17:35,270
Curious. Did you guys learn it as under grads or as graduate students?

295
00:17:37,309 --> 00:17:38,029
All

296
00:17:38,029 --> 00:17:40,429
right. Let

297
00:17:40,429 --> 00:17:44,150
me talk about it anyway. I

298
00:17:44,150 --> 00:17:47,790
wasn't expecting so many of you to know what SVD is, but I want to get this

299
00:17:47,790 --> 00:17:51,030
on tape, just so everyone else can learn

300
00:17:51,030 --> 00:17:55,050
about this, too.

301
00:17:55,050 --> 00:17:59,020
So I'll say a little bit about how to implement

302
00:17:59,020 --> 00:18:01,640
PCA. The problem I

303
00:18:01,640 --> 00:18:05,110
was eluding to just now was that

304
00:18:05,110 --> 00:18:09,310
when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for

305
00:18:09,310 --> 00:18:11,620
our

306
00:18:11,620 --> 00:18:13,390
text example,

307
00:18:13,390 --> 00:18:18,000
if the vectors XI are 50,000 dimensional,

308
00:18:18,000 --> 00:18:24,430
then

309
00:18:24,430 --> 00:18:27,160
the covariance matrix will be 50,000 dimensional by 50,000

310
00:18:27,160 --> 00:18:32,910
dimensional, which is much too big to represent explicitly.

311
00:18:32,910 --> 00:18:36,270
I guess many

312
00:18:36,270 --> 00:18:41,240
of you already know this, but I'll just say it anyway. It

313
00:18:41,240 --> 00:18:48,240
turns out there's another way to implement PCA, which is

314
00:18:48,679 --> 00:18:49,450
if

315
00:18:49,450 --> 00:18:53,340
A is any N by N matrix,

316
00:18:53,340 --> 00:18:56,580
than one of the most remarkable results of linear algebra

317
00:18:56,580 --> 00:19:00,970
is that the matrix, A,

318
00:19:00,970 --> 00:19:04,279
can be decomposed into

319
00:19:04,279 --> 00:19:07,330
a singular value

320
00:19:07,330 --> 00:19:10,160
decomposition. What that means is that the matrix, A, which

321
00:19:10,160 --> 00:19:11,519
is

322
00:19:11,519 --> 00:19:12,680
N by N,

323
00:19:12,680 --> 00:19:15,230
can always be decomposed into a product of

324
00:19:15,230 --> 00:19:18,280
three matrixes. U is N by N,

325
00:19:18,280 --> 00:19:21,470
D is a square matrix, which is N by N, and V is

326
00:19:23,750 --> 00:19:27,170
also N by N.

327
00:19:27,170 --> 00:19:30,910
D

328
00:19:30,910 --> 00:19:34,890
is

329
00:19:34,890 --> 00:19:37,320
going to be diagonal.

330
00:19:43,030 --> 00:19:45,140
Zeros are on the off-diagonals,

331
00:19:45,140 --> 00:19:52,140
and the values sigma I are called the singular values of

332
00:19:54,310 --> 00:19:57,530
the matrix A.

333
00:20:01,500 --> 00:20:05,470
Almost all of you said you learned this as a graduate student, rather than as an under grad, so

334
00:20:05,470 --> 00:20:07,020
it turns out that

335
00:20:07,020 --> 00:20:10,820
when you take a class in undergraduate linear algebra, usually you learn a bunch of

336
00:20:10,820 --> 00:20:13,960
decomposition. So you usually learn about the QLD composition, maybe

337
00:20:13,960 --> 00:20:17,070
the LU factorization of the matrixes.

338
00:20:17,070 --> 00:20:20,350
Most under grad courses don't get to talk about singular value

339
00:20:20,350 --> 00:20:21,490
decompositions, but at

340
00:20:21,490 --> 00:20:22,860
least in - almost

341
00:20:22,860 --> 00:20:24,580
everything I

342
00:20:24,580 --> 00:20:26,900
do in machine learning,

343
00:20:26,900 --> 00:20:31,180
you actually find that you end up using SVDs much more than any of the

344
00:20:31,180 --> 00:20:33,120
decompositions

345
00:20:33,120 --> 00:20:35,690
you learned in typical under grad linear algebra class.

346
00:20:35,690 --> 00:20:40,030
So personally, I [inaudible] an SVD dozens of times in the last

347
00:20:40,030 --> 00:20:42,820
year, but LU and QRD compositions,

348
00:20:42,820 --> 00:20:45,480
I think I used the QRD composition once and an

349
00:20:45,480 --> 00:20:49,530
LU decomposition in the last year. So let's see. I'll

350
00:20:49,530 --> 00:20:53,880
say

351
00:20:53,880 --> 00:20:54,830
a

352
00:20:54,830 --> 00:21:01,830
bit

353
00:21:01,840 --> 00:21:04,700
more about this. So I'm going to draw the picture, I guess.

354
00:21:04,700 --> 00:21:08,090
For example,

355
00:21:08,090 --> 00:21:10,890
if A is an N by N matrix,

356
00:21:10,890 --> 00:21:17,890
it can be decomposed into another matrix, U, which is also N by N. It's the same

357
00:21:21,780 --> 00:21:24,510
size, D, which is

358
00:21:24,510 --> 00:21:29,840
N by

359
00:21:29,840 --> 00:21:34,730
N. Another square matrix, V transpose, which is also

360
00:21:34,730 --> 00:21:37,430
N by

361
00:21:37,430 --> 00:21:42,850
N. Furthermore, in a singular value decomposition, the

362
00:21:42,850 --> 00:21:46,460
columns of the matrix, U, will be the eigen

363
00:21:46,460 --> 00:21:50,940
vectors

364
00:21:50,940 --> 00:21:55,630
of A transpose, and the

365
00:21:55,630 --> 00:22:02,020
columns of V will be the eigen vectors

366
00:22:02,020 --> 00:22:05,730
of A

367
00:22:05,730 --> 00:22:08,290
transpose A.

368
00:22:08,290 --> 00:22:12,030


369
00:22:12,030 --> 00:22:16,660
To compute it, you just use the SVD commands

370
00:22:16,660 --> 00:22:20,780
in

371
00:22:20,780 --> 00:22:22,090
Matlab

372
00:22:22,090 --> 00:22:23,009
or

373
00:22:23,009 --> 00:22:26,760
Octave. Today, say the art in numerical linear algebra is that

374
00:22:26,760 --> 00:22:31,040
SVD, singular value decompositions, and matrixes can be computed

375
00:22:31,040 --> 00:22:34,450
extremely [inaudible]. We've

376
00:22:34,450 --> 00:22:35,780
used a

377
00:22:35,780 --> 00:22:37,610
package like Matlab or Octave

378
00:22:37,610 --> 00:22:40,410
to compute, say, the eigen vectors of a matrix.

379
00:22:40,410 --> 00:22:43,539
So if SVD

380
00:22:43,539 --> 00:22:47,980
routines are even more numerically stable than eigen vector routines for finding eigen vector in the

381
00:22:47,980 --> 00:22:49,230
matrix. So you can

382
00:22:49,230 --> 00:22:50,250
safely

383
00:22:50,250 --> 00:22:53,110
use a routine like this, and similar to the way they use

384
00:22:53,110 --> 00:22:56,030
a square root command without thinking about

385
00:22:56,030 --> 00:22:58,400
how it's computed. You can compute the square

386
00:22:58,400 --> 00:23:03,640
root of something and just not worry about it. You know the computer will give you the right

387
00:23:03,640 --> 00:23:07,670
answer. For most reasonably-sized matrixes, even up to thousands by thousands

388
00:23:07,670 --> 00:23:11,180
matrixes, the SVD routine, I think of it as a square root function. If

389
00:23:11,180 --> 00:23:14,700
you call it, it'll give you back the right answer. You don't have to worry

390
00:23:14,700 --> 00:23:16,760
too much about

391
00:23:16,760 --> 00:23:20,510
it. If you have extremely large matrixes, like a million by a million matrixes, I

392
00:23:20,510 --> 00:23:23,200
might start to worry a bit, but a few thousand by a few

393
00:23:23,200 --> 00:23:25,700
thousand matrixes, this is

394
00:23:25,700 --> 00:23:29,330
implemented very well today.

395
00:23:29,330 --> 00:23:31,360
[Inaudible].

396
00:23:31,360 --> 00:23:34,990
What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the

397
00:23:34,990 --> 00:23:36,470
order of N-cubed.

398
00:23:36,470 --> 00:23:41,750
I'm not sure. [Inaudible]

399
00:23:41,750 --> 00:23:45,010
algorithms, so I think - I don't know what's

400
00:23:45,010 --> 00:23:50,470
known about the conversion

401
00:23:50,470 --> 00:23:54,370
of

402
00:23:54,370 --> 00:23:58,280
these algorithms. The example I drew out was for a facts matrix, and a matrix is

403
00:23:58,280 --> 00:23:59,890
[inaudible].

404
00:23:59,890 --> 00:24:03,420
In the same way, you can also

405
00:24:03,420 --> 00:24:08,380
call SVD on the tall matrix, so it's taller than it's wide.

406
00:24:08,380 --> 00:24:15,380
It would decompose it into - okay? A

407
00:24:21,480 --> 00:24:24,190
product of three matrixes like

408
00:24:24,190 --> 00:24:28,470
that.

409
00:24:28,470 --> 00:24:31,220
The nice thing about this is that

410
00:24:31,220 --> 00:24:33,169
we can use it to compute

411
00:24:33,169 --> 00:24:40,169
eigen vectors and PCA very efficiently.

412
00:24:42,150 --> 00:24:47,380
In particular, a

413
00:24:47,380 --> 00:24:52,690
covariance matrix sigma was

414
00:24:52,690 --> 00:24:55,410
this. It was the sum of all the products,

415
00:24:55,410 --> 00:25:00,750
so if you go back and recall the definition of the

416
00:25:00,750 --> 00:25:01,820
design matrix -

417
00:25:01,820 --> 00:25:05,080
I think I described this in

418
00:25:05,080 --> 00:25:07,720
lecture two when

419
00:25:07,720 --> 00:25:13,510
we derived the close form solution to these squares [inaudible] these squares. The

420
00:25:13,510 --> 00:25:17,860
design matrix was this matrix where I took my

421
00:25:17,860 --> 00:25:21,390
examples

422
00:25:21,390 --> 00:25:26,160
and stacked them in

423
00:25:26,160 --> 00:25:27,590
rows.

424
00:25:27,590 --> 00:25:33,320
They call this the design matrix [inaudible]. So if

425
00:25:33,320 --> 00:25:38,780
you construct the design matrix, then

426
00:25:38,780 --> 00:25:43,240
the covariance matrix sigma

427
00:25:43,240 --> 00:25:47,760
can be written just X transposing.

428
00:26:01,270 --> 00:26:08,270
That's X transposed, and [inaudible].

429
00:26:16,230 --> 00:26:16,750
Okay?

430
00:26:16,750 --> 00:26:18,860
I hope you see why the X transpose X gives you

431
00:26:18,860 --> 00:26:22,460
the sum of products of

432
00:26:22,460 --> 00:26:25,650
vectors. If you aren't seeing this right now, just go home and convince yourself

433
00:26:25,650 --> 00:26:32,650
[inaudible] if it's

434
00:26:37,880 --> 00:26:40,020
true.

435
00:26:40,020 --> 00:26:47,020
To get the top K eigen vectors of

436
00:26:50,610 --> 00:26:50,950
sigma,

437
00:26:50,950 --> 00:26:55,890
you would take sigma

438
00:26:55,890 --> 00:26:57,419
and decompose it

439
00:26:57,419 --> 00:26:59,650
using

440
00:26:59,650 --> 00:27:01,880
the -

441
00:27:01,880 --> 00:27:06,710
excuse me.

442
00:27:06,710 --> 00:27:11,960
You would take the matrix X, and you would compute as SVD. So you get USV transpose.

443
00:27:11,960 --> 00:27:15,400
Then the top three

444
00:27:15,400 --> 00:27:19,169
columns

445
00:27:19,169 --> 00:27:21,660
of U

446
00:27:21,660 --> 00:27:24,340
are the top K eigen

447
00:27:24,340 --> 00:27:26,830
vectors

448
00:27:26,830 --> 00:27:29,419
of

449
00:27:29,419 --> 00:27:31,910
X transpose

450
00:27:31,910 --> 00:27:33,660
X,

451
00:27:33,660 --> 00:27:37,760
which is therefore, the top K eigen vectors of your

452
00:27:37,760 --> 00:27:40,440
covariance matrix

453
00:27:40,440 --> 00:27:42,250
sigma. So

454
00:27:42,250 --> 00:27:46,170
in our examples, the

455
00:27:46,170 --> 00:27:48,950
design matrix may be, say R. If you have

456
00:27:48,950 --> 00:27:52,450
50,000 words in your dictionary, than the

457
00:27:52,450 --> 00:27:55,050
design matrix would be

458
00:27:55,050 --> 00:27:58,150
RM by 50,000.

459
00:27:58,150 --> 00:28:01,870
[Inaudible] say 100 by 50,000, if you have 100 examples.

460
00:28:01,870 --> 00:28:03,250
So X would be

461
00:28:03,250 --> 00:28:06,510
quite tractable to represent and compute the

462
00:28:06,510 --> 00:28:10,420
SVD, whereas the matrix sigma would be much harder to represent. This is

463
00:28:10,420 --> 00:28:12,260
50,000 by 50,000.

464
00:28:12,260 --> 00:28:15,870
So this gives you an efficient way to implement

465
00:28:15,870 --> 00:28:18,070
PCA.

466
00:28:18,070 --> 00:28:20,560
The reason I want to talk about this is

467
00:28:20,560 --> 00:28:24,620
in previous years, I didn't talk [inaudible].

468
00:28:24,620 --> 00:28:28,760
The class projects, I found a number of students trying to implement SVD on huge

469
00:28:28,760 --> 00:28:29,580
problems and [inaudible],

470
00:28:29,580 --> 00:28:31,950
so this is

471
00:28:31,950 --> 00:28:35,010
a much better to implement PCA

472
00:28:35,010 --> 00:28:37,600
if you have extremely high dimensional data. If you have

473
00:28:37,600 --> 00:28:39,600
low dimensional data, if

474
00:28:39,600 --> 00:28:43,500
you have 50 or 100 dimensional data, then

475
00:28:43,500 --> 00:28:44,969
computing sigma's no problem. You can

476
00:28:44,969 --> 00:28:51,309
do it the old way, but otherwise, use the SVD to implement this.

477
00:28:52,780 --> 00:28:59,780
Questions about this?

478
00:29:26,180 --> 00:29:30,810
The last thing I want to say is that in practice, when you want to implement this, I want to say a note

479
00:29:30,810 --> 00:29:32,240
of caution.

480
00:29:32,240 --> 00:29:35,010
It turns out that

481
00:29:35,010 --> 00:29:38,909
for many applications of - let's see.

482
00:29:38,909 --> 00:29:42,420
When you apply SVD to these wide - yeah.

483
00:29:42,420 --> 00:29:43,450
Just a quick question. Are

484
00:29:43,450 --> 00:29:48,950
the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh,

485
00:29:48,950 --> 00:29:52,940
yes.

486
00:29:52,940 --> 00:29:54,570
I think you're

487
00:29:54,570 --> 00:29:55,509
right. I

488
00:29:55,509 --> 00:30:02,509
think you're right.

489
00:30:03,960 --> 00:30:05,550
Let's

490
00:30:05,550 --> 00:30:12,550
see. Is it top K columns of U or top K of V?

491
00:30:15,039 --> 00:30:22,039
Yeah, I think you're right. Is that right? Something

492
00:30:31,560 --> 00:30:38,560
bothers me about that, but I think you're right.

493
00:30:40,330 --> 00:30:46,350
[Inaudible], so then X transpose X should be VDD. X

494
00:30:46,350 --> 00:30:53,350
is UDV, so X transpose X would be - [Inaudible].

495
00:31:01,140 --> 00:31:03,389
If anyone thinks about this and

496
00:31:03,389 --> 00:31:06,200
has another opinion, let me know, but I think you're

497
00:31:06,200 --> 00:31:11,480
right. I'll make sure I get the details and let you

498
00:31:11,480 --> 00:31:16,680
know.

499
00:31:16,680 --> 00:31:22,240
Everyone's still looking at that.

500
00:31:22,240 --> 00:31:22,720
Tom, can you figure out

501
00:31:22,720 --> 00:31:25,110
the right answer and let me know? That sounds right. Okay. Cool. Okay.

502
00:31:25,110 --> 00:31:30,070
So just

503
00:31:30,070 --> 00:31:33,920
one last note, a note of caution. It turns out that

504
00:31:33,920 --> 00:31:36,520
in this example, I was implementing SVD

505
00:31:36,520 --> 00:31:43,520
with a wide matrix. So the matrix X was N by N.

506
00:31:44,130 --> 00:31:47,010
It turns out when you

507
00:31:47,010 --> 00:31:52,170
find the SVD decomposition of this,

508
00:31:52,170 --> 00:31:57,870
it turns out that -

509
00:31:57,870 --> 00:32:01,030
let's see. Yeah, I think you're definitely

510
00:32:01,030 --> 00:32:04,230
right. So it turns out that we find the SVD

511
00:32:04,230 --> 00:32:07,590
of this, the right-most portion of this block of this matrix would be all

512
00:32:07,590 --> 00:32:14,590
zeros.

513
00:32:15,130 --> 00:32:17,850
Also, when you compute the

514
00:32:17,850 --> 00:32:22,190
matrix, D, a large part of this matrix would be zeros.

515
00:32:22,190 --> 00:32:24,340
You have the matrix D

516
00:32:24,340 --> 00:32:29,520
transpose. So depending on what convention you use,

517
00:32:29,520 --> 00:32:36,150
for example, I think Matlab actually uses a convention of just

518
00:32:36,150 --> 00:32:43,150
cutting off the zero elements.

519
00:32:47,779 --> 00:32:50,750
So the Matlab uses the convention

520
00:32:50,750 --> 00:32:54,260
of chopping off the right-most half of the U matrix and chopping off the bottom

521
00:32:54,260 --> 00:32:56,520
portion of the D matrix. I'm not sure

522
00:32:56,520 --> 00:33:00,860
if this even depends on the version of Matlab,

523
00:33:00,860 --> 00:33:04,059
but when you call SVD on Matlab or some other numerical algebra packages,

524
00:33:04,059 --> 00:33:07,820
there's slightly different conventions of how to define your SVD when

525
00:33:07,820 --> 00:33:10,260
the matrix is wider than it is tall.

526
00:33:10,260 --> 00:33:13,390
So just watch out for this and make sure you map

527
00:33:13,390 --> 00:33:14,910
whatever convention

528
00:33:14,910 --> 00:33:17,630
your numerical algebra library uses

529
00:33:17,630 --> 00:33:22,620
to the original computations.

530
00:33:22,620 --> 00:33:23,570
It turns out if

531
00:33:23,570 --> 00:33:26,059
you turn Matlab [inaudible]

532
00:33:26,059 --> 00:33:30,200
or you're writing C code. There are many scientific libraries that can

533
00:33:30,200 --> 00:33:31,590


534
00:33:31,590 --> 00:33:36,110
compute SVDs for you, but they're just slightly different in

535
00:33:36,110 --> 00:33:39,330
conventions for the dimensions of these matrixes. So just make sure you figure this out for the

536
00:33:39,330 --> 00:33:44,230
package that you use.

537
00:33:44,230 --> 00:33:46,230
Finally, I just want to

538
00:33:46,230 --> 00:33:49,779
take the unsupervised learning algorithms we talked about and just put a little bit

539
00:33:49,779 --> 00:33:52,010
of broader context.

540
00:33:52,010 --> 00:33:56,450
This is partly in response to the questions I've gotten from students in

541
00:33:56,450 --> 00:33:57,950
office hours and elsewhere

542
00:33:57,950 --> 00:34:00,620
about when to use each of these

543
00:34:00,620 --> 00:34:01,460
algorithms. So

544
00:34:01,460 --> 00:34:05,700
I'm going to draw a two by two matrix. This is a little cartoon that

545
00:34:07,800 --> 00:34:10,690
I find useful.

546
00:34:15,409 --> 00:34:20,159
One of the algorithms we talked about earlier, right

547
00:34:20,159 --> 00:34:21,460
before this, was

548
00:34:21,460 --> 00:34:24,159
factor analysis, which was - it

549
00:34:24,159 --> 00:34:27,919
was - I hope you remember that picture I drew where I would have a bunch of point Z on the

550
00:34:27,919 --> 00:34:28,799
line.

551
00:34:28,799 --> 00:34:32,969
Then I had these ellipses that I drew. I hope you

552
00:34:32,969 --> 00:34:34,999
remember that

553
00:34:34,999 --> 00:34:37,919
picture. This was a factor analysis model

554
00:34:37,919 --> 00:34:38,639
which models

555
00:34:38,639 --> 00:34:42,389
the density effects [inaudible], right?

556
00:34:42,389 --> 00:34:44,639
It was also a PCA, just now.

557
00:34:44,639 --> 00:34:45,799
So the

558
00:34:45,799 --> 00:34:48,969
difference between factor analysis and PCA, the

559
00:34:48,969 --> 00:34:51,829
way I think about it, is that factor analysis

560
00:34:51,829 --> 00:34:53,960
is a density estimation algorithm.

561
00:34:53,960 --> 00:34:57,740
It tries to model the density of the training example's X.

562
00:34:57,740 --> 00:35:01,249
Whereas PCA

563
00:35:02,309 --> 00:35:05,700
is not a probabilistic

564
00:35:05,700 --> 00:35:06,750
algorithm. In particular,

565
00:35:06,750 --> 00:35:11,270
it does not endow your training examples of any probabilistic

566
00:35:11,270 --> 00:35:14,410
distributions and directly goes to find the subspace.

567
00:35:14,410 --> 00:35:18,650
So in terms of when to use factor analysis and when to use PCA,

568
00:35:18,650 --> 00:35:21,699
if your goal is to reduce the dimension of the data,

569
00:35:21,699 --> 00:35:26,139
if your goal is to find the subspace that the data lies on,

570
00:35:26,139 --> 00:35:27,309
then PCA

571
00:35:27,309 --> 00:35:28,320
directly

572
00:35:28,320 --> 00:35:32,180
tries to find the subspace. I think I would

573
00:35:32,180 --> 00:35:35,279
tend to use PCA.

574
00:35:35,279 --> 00:35:39,889
Factor analysis, it sort of assumes the data lies on a subspace.

575
00:35:39,889 --> 00:35:45,039
Let me write a subspace here.

576
00:35:45,039 --> 00:35:49,529
So both of these algorithms sort of assume the data maybe lies close

577
00:35:49,529 --> 00:35:52,009
or on some low dimensional subspace,

578
00:35:52,009 --> 00:35:55,999
but fundamentally, factor analysis, I think of it as a density estimation algorithm.

579
00:35:55,999 --> 00:35:58,739
So that has some very high dimensional distribution. I

580
00:35:58,739 --> 00:36:00,880
want to model P of X, then

581
00:36:00,880 --> 00:36:04,519
the factor analysis is the algorithm I'm more inclined

582
00:36:04,519 --> 00:36:05,400
to use. So

583
00:36:05,400 --> 00:36:07,449
even though you could in theory,

584
00:36:07,449 --> 00:36:12,199
I would tend to avoid trying to use factor analysis to

585
00:36:12,199 --> 00:36:14,349
identify a subspace the

586
00:36:14,349 --> 00:36:15,619
data

587
00:36:15,619 --> 00:36:18,450
set lies on. So [inaudible], if you want to do

588
00:36:18,450 --> 00:36:21,700
anomaly detection, if you want to model P of X

589
00:36:21,700 --> 00:36:26,119
so that if you have a very low probability of N, you can factor an anomaly,

590
00:36:26,119 --> 00:36:32,819
then I would tend to use factor analysis to do that density estimation. So factor

591
00:36:32,819 --> 00:36:38,189
analysis and PCA are both algorithms that assume that your data lies in the subspace.

592
00:36:38,189 --> 00:36:41,759
The other cause of algorithms we talked about was

593
00:36:41,759 --> 00:36:45,379
algorithms that assumes the data

594
00:36:45,379 --> 00:36:47,380
lies in

595
00:36:47,380 --> 00:36:51,410
clumps or that the

596
00:36:51,410 --> 00:36:52,359
data

597
00:36:52,359 --> 00:36:57,869
has a few coherence to groups. So let me just fill in the rest of this picture.

598
00:37:08,579 --> 00:37:09,500
So if you think your data lies

599
00:37:09,500 --> 00:37:14,270
in clumps or lies in groups, and if it goes [inaudible]

600
00:37:14,270 --> 00:37:16,130
density estimation, then I would

601
00:37:16,130 --> 00:37:19,579
tend to use a mixture of [inaudible]

602
00:37:19,579 --> 00:37:23,349
algorithm. But again, you don't necessarily want to endow your data of any probably

603
00:37:23,349 --> 00:37:26,919
semantics, so if you just want to find the clumps of the groups, then I'd be inclined

604
00:37:26,919 --> 00:37:28,999
to use a [inaudible] algorithm. So

605
00:37:29,769 --> 00:37:33,289
haven't seen anyone else draw this picture before, but I tend to organize these things

606
00:37:33,289 --> 00:37:34,989
this way in my brain.

607
00:37:34,989 --> 00:37:36,419
Hopefully this helps guide

608
00:37:36,419 --> 00:37:40,459
when you might use each of these algorithms as well, depending

609
00:37:40,459 --> 00:37:44,719
on whether you believe the data might lie in the subspace or whether it might bind in

610
00:37:44,719 --> 00:37:47,899
clumps or groups.

611
00:37:50,719 --> 00:37:53,789
All right.

612
00:37:53,789 --> 00:38:00,789
That wraps up the discussion on

613
00:38:02,629 --> 00:38:08,719
PCA. What I want to do next is talk about

614
00:38:08,719 --> 00:38:15,329
independent component analysis, or ICA. Yeah. Interviewee: I have

615
00:38:15,329 --> 00:38:17,579
a

616
00:38:17,579 --> 00:38:21,589
question about the upper right [inaudible]. So once you have all of the eigen vectors,

617
00:38:21,589 --> 00:38:26,179
[inaudible] how similar is feature I to

618
00:38:26,179 --> 00:38:29,960
feature J. You pick some eigen vector, and you take some dot products between the

619
00:38:29,960 --> 00:38:31,569
feature I and

620
00:38:31,569 --> 00:38:35,019
feature J and the eigen vector. But

621
00:38:35,019 --> 00:38:39,630
there's a lot of eigen vectors to choose from. Instructor

622
00:38:39,630 --> 00:38:42,049
(Andrew Ng):Right. So Justin's question was

623
00:38:42,049 --> 00:38:45,880
having found my eigen vectors, how do I choose what eigen vector to use to

624
00:38:45,880 --> 00:38:47,539
measure distance. I'm

625
00:38:47,539 --> 00:38:48,949
going to

626
00:38:48,949 --> 00:38:51,019
start

627
00:38:51,019 --> 00:38:53,289
this up.

628
00:38:53,289 --> 00:38:57,299
So the

629
00:38:57,299 --> 00:38:58,319
answer is really

630
00:38:58,319 --> 00:39:02,029
- in this cartoon, I would avoid thinking about

631
00:39:02,029 --> 00:39:03,920
eigen vectors one other time.

632
00:39:03,920 --> 00:39:08,049
A better way to view this cartoon is that this is actually -

633
00:39:08,049 --> 00:39:11,599
if I decide to choose 100 eigen vectors, this is really 100 D

634
00:39:11,599 --> 00:39:18,599
subspace.

635
00:39:19,259 --> 00:39:20,339
So

636
00:39:20,339 --> 00:39:24,879
I'm not actually projecting my data onto one eigen vector.

637
00:39:24,879 --> 00:39:29,769
This arrow, this cartoon, this denotes the 100-dimensional

638
00:39:29,769 --> 00:39:32,089
subspace [inaudible] by all my eigen vectors.

639
00:39:32,089 --> 00:39:36,119
So what I actually do is project my data onto

640
00:39:36,119 --> 00:39:40,149
the span, the linear span of eigen vectors. Then I

641
00:39:40,149 --> 00:39:41,440
measure distance or take

642
00:39:41,440 --> 00:39:43,489
inner products of the distance between

643
00:39:43,489 --> 00:39:49,829
the projections of the two points of the eigen vectors. Okay.

644
00:39:49,829 --> 00:39:54,139
So let's talk about ICA,

645
00:39:54,139 --> 00:39:58,749
independent component analysis.

646
00:39:58,749 --> 00:40:00,749
So whereas PCA

647
00:40:00,749 --> 00:40:02,600
was an algorithm for finding

648
00:40:02,600 --> 00:40:06,699
what I call the main axis of variations of data,

649
00:40:06,699 --> 00:40:11,199
in ICA, we're going to try find the independent of components of variations in the

650
00:40:11,199 --> 00:40:12,039
data.

651
00:40:12,039 --> 00:40:14,939
So switch it to the laptop there, please.

652
00:40:14,939 --> 00:40:16,119
We'll just

653
00:40:16,119 --> 00:40:21,900
take a second to motivate that. I'm

654
00:40:21,900 --> 00:40:26,769
going to do so by

655
00:40:26,769 --> 00:40:32,430
- although if you put on the - okay. This is

656
00:40:32,430 --> 00:40:36,619
actually a slide that I showed in

657
00:40:36,619 --> 00:40:39,779
lecture one of the cocktail party problem.

658
00:40:39,779 --> 00:40:42,619
Suppose you have two speakers at a cocktail party,

659
00:40:42,619 --> 00:40:45,019
and you have two microphones in the

660
00:40:45,019 --> 00:40:46,119
room, overlapping

661
00:40:46,119 --> 00:40:47,959
sets of two conversations.

662
00:40:47,959 --> 00:40:51,640
Then can you separate out the two original speaker sources?

663
00:40:51,640 --> 00:40:55,650
So I actually played this audio as well in the very first lecture, which is

664
00:40:55,650 --> 00:40:59,069
suppose microphone one records this.

665
00:40:59,069 --> 00:41:05,489
[Recording]

666
00:41:13,229 --> 00:41:16,650
So the question is, these are really two speakers,

667
00:41:16,650 --> 00:41:20,809
speaking independently of each other. So each speaker is outputting

668
00:41:20,809 --> 00:41:24,699
a series of sound signals as independent of the other conversation

669
00:41:24,699 --> 00:41:26,119
going on in the room.

670
00:41:26,119 --> 00:41:27,839
So

671
00:41:27,839 --> 00:41:31,880
this being an supervised learning problem, the question is, can we take these two microphone

672
00:41:31,880 --> 00:41:33,900
recordings and feed it to

673
00:41:33,900 --> 00:41:37,329
an algorithm to find the independent components in

674
00:41:37,329 --> 00:41:38,450
this

675
00:41:38,450 --> 00:41:40,589
data? This is the output

676
00:41:42,440 --> 00:41:48,619
when we do so.

677
00:41:48,619 --> 00:41:55,050
[Recording] This is the other one. [Recording]

678
00:41:55,940 --> 00:41:59,859
Just for fun. [Inaudible]. These are audio clips I got

679
00:42:01,409 --> 00:42:04,409
from [inaudible]. Just for fun, let me play the other ones as well. This

680
00:42:04,409 --> 00:42:11,409
is overlapping microphone one. [Recording]

681
00:42:13,440 --> 00:42:20,440
Here's microphone two. [Recording]

682
00:42:21,739 --> 00:42:24,250
So given this as input, here's output one.

683
00:42:24,250 --> 00:42:27,459


684
00:42:27,459 --> 00:42:30,910
[Recording]

685
00:42:30,910 --> 00:42:33,640
It's not perfect, but it's largely cleaned up the music.

686
00:42:33,640 --> 00:42:40,640
Here's number two. [Recording] Okay. Switch back to

687
00:42:42,979 --> 00:42:44,900
[inaudible], please.

688
00:42:44,900 --> 00:42:46,979
So

689
00:42:46,979 --> 00:42:53,979
what I want to do now is describe an algorithm that does that.

690
00:42:54,829 --> 00:42:58,299
Before

691
00:42:58,299 --> 00:43:03,239
I actually jump into the algorithm, I want to say two minutes

692
00:43:03,239 --> 00:43:03,799
of

693
00:43:03,799 --> 00:43:10,799
CDF, so cumulative distribution functions. I know most

694
00:43:18,669 --> 00:43:21,089
of you know what these are, but I'm

695
00:43:21,089 --> 00:43:23,599
just going to remind you of what they are.

696
00:43:24,579 --> 00:43:30,349
Let's say you have a one-D random variable S. So suppose you have

697
00:43:30,349 --> 00:43:35,900
a random variable, S,

698
00:43:35,900 --> 00:43:41,469
and suppose it has a property density function [inaudible].

699
00:43:41,469 --> 00:43:43,409
Then

700
00:43:43,409 --> 00:43:45,859
the CDF

701
00:43:45,859 --> 00:43:50,139
is defined as a function, or rather as F,

702
00:43:50,139 --> 00:43:53,729
which is the probability that the random variable,

703
00:43:53,729 --> 00:43:55,920
S, is less than the value

704
00:43:55,920 --> 00:43:58,539
given by that lower-case

705
00:43:58,539 --> 00:43:59,869
value,

706
00:43:59,869 --> 00:44:01,929
S.

707
00:44:01,929 --> 00:44:03,269
For example,

708
00:44:03,269 --> 00:44:06,099
if this is your [inaudible] density,

709
00:44:06,099 --> 00:44:10,229
than the density of the [inaudible] usually

710
00:44:10,229 --> 00:44:14,609
to note it lower-case phi. That's roughly a bell-shaped density. Then

711
00:44:14,609 --> 00:44:20,319
the CDF or the Gaussian

712
00:44:20,319 --> 00:44:22,269
will look something like this.

713
00:44:22,269 --> 00:44:24,959
There'll be a capital function

714
00:44:24,959 --> 00:44:27,339
pi. So if I pick a value

715
00:44:27,339 --> 00:44:29,079
S like that, then the

716
00:44:29,079 --> 00:44:30,449
height of this -

717
00:44:30,449 --> 00:44:32,569
this is [inaudible] probability that

718
00:44:32,569 --> 00:44:35,410
my Gaussian random variable is less than

719
00:44:35,410 --> 00:44:37,419
that value there. In other words,

720
00:44:37,419 --> 00:44:40,549
the height of the function at that point is

721
00:44:40,549 --> 00:44:44,229
less

722
00:44:44,229 --> 00:44:46,269
than the area of the Gaussian density,

723
00:44:46,269 --> 00:44:48,119
up to the point S.

724
00:44:48,119 --> 00:44:48,890
As you

725
00:44:48,890 --> 00:44:52,690
move further and further to the right, this function will approach one, as

726
00:44:52,690 --> 00:44:59,690
you integrate more and more of this area of the Gaussian. So another way to write

727
00:45:04,839 --> 00:45:11,839
F

728
00:45:21,109 --> 00:45:28,109
of

729
00:45:30,659 --> 00:45:34,619
S is the integral, the minus infinity

730
00:45:34,619 --> 00:45:35,729
to S of

731
00:45:35,729 --> 00:45:41,739
the density, DT.

732
00:45:41,739 --> 00:45:43,859
So something that'll come later is

733
00:45:43,859 --> 00:45:48,320
suppose I have a random variable, S, and I want to model the distribution of the random

734
00:45:48,320 --> 00:45:49,439
variable, S.

735
00:45:49,439 --> 00:45:53,449
So one thing I could do is I can specify

736
00:45:53,449 --> 00:45:56,549
what I think the density

737
00:45:56,549 --> 00:45:58,049
is.

738
00:45:58,049 --> 00:46:03,200
Or I can specify

739
00:46:03,200 --> 00:46:04,450
what the

740
00:46:04,450 --> 00:46:08,099
CDF

741
00:46:08,099 --> 00:46:11,359
is. These are related by this equation. F is the integral of P of S. You

742
00:46:11,359 --> 00:46:13,989
can also

743
00:46:13,989 --> 00:46:15,719
recover the density

744
00:46:15,719 --> 00:46:20,469
by taking the CDF and taking the derivative. So F prime, take the derivative

745
00:46:20,469 --> 00:46:21,729
of the CDF,

746
00:46:21,729 --> 00:46:23,440
you get back the

747
00:46:23,440 --> 00:46:24,709
density. So this has come up

748
00:46:24,709 --> 00:46:28,179
in the middle of when I derive ICA, which is that

749
00:46:28,179 --> 00:46:32,169
there'll be a step where they need to assume a distribution for random variable, S.

750
00:46:32,169 --> 00:46:36,359
I can either specify the density for S directly, or I can specify the CDF. I

751
00:46:36,359 --> 00:46:38,819
choose to specify the

752
00:46:39,920 --> 00:46:41,529
CDF.

753
00:46:41,529 --> 00:46:46,919
It has to be some function increasing from zero to one.

754
00:46:46,919 --> 00:46:48,029
So you can

755
00:46:48,029 --> 00:46:50,679
choose any function that looks like that, and in particular,

756
00:46:51,969 --> 00:46:55,469
pulling functions out of a hat that look like that. You can, for instance, choose a

757
00:46:55,469 --> 00:46:58,989
sigmoid function of

758
00:46:58,989 --> 00:47:04,219
CDF. That would be one way of specifying the distribution of the densities for the random variable S. So

759
00:47:04,219 --> 00:47:05,110
this

760
00:47:05,110 --> 00:47:12,110
will come up later.

761
00:47:30,299 --> 00:47:33,579
Just [inaudible], just raise your hand if that is familiar to you, if you've seen

762
00:47:33,579 --> 00:47:40,579
that before. Great. So

763
00:47:42,469 --> 00:47:43,239
let's

764
00:47:43,239 --> 00:47:48,630
start to derive our RCA, or our independent component analysis

765
00:47:48,630 --> 00:47:50,430
algorithm.

766
00:47:50,430 --> 00:47:53,989
Let's assume that the

767
00:47:55,859 --> 00:47:59,819
data comes from

768
00:47:59,819 --> 00:48:01,979
N original

769
00:48:01,979 --> 00:48:03,319
sources.

770
00:48:03,319 --> 00:48:07,009
So let's say there are N speakers in a cocktail party.

771
00:48:07,009 --> 00:48:09,819
So the original sources, I'm

772
00:48:09,819 --> 00:48:11,330
going to write as a vector, S

773
00:48:11,330 --> 00:48:13,619
as in RN.

774
00:48:13,619 --> 00:48:17,449
So just to be concrete about what I mean about that, I'm going to use

775
00:48:17,449 --> 00:48:22,499
SIJ to denote the signal

776
00:48:22,499 --> 00:48:25,849
from speaker

777
00:48:27,140 --> 00:48:30,219
J

778
00:48:30,219 --> 00:48:32,659
at time

779
00:48:32,659 --> 00:48:34,080
I. Here's what I mean.

780
00:48:34,080 --> 00:48:37,940
So what is sound? When you hear sound waves, sound is created

781
00:48:37,940 --> 00:48:39,279
by a pattern

782
00:48:39,279 --> 00:48:43,160
of expansions and compressions in air. So the way you're hearing my voice is

783
00:48:43,160 --> 00:48:44,620
my

784
00:48:44,620 --> 00:48:47,719
mouth is causing certain

785
00:48:47,719 --> 00:48:50,960
changes in the air pressure, and then your ear is hearing my voice as

786
00:48:50,960 --> 00:48:53,539
detecting those changes in air

787
00:48:53,539 --> 00:48:57,729
pressure. So what a microphone records, what my mouth is generating, is

788
00:48:57,729 --> 00:48:59,159
a pattern.

789
00:48:59,159 --> 00:49:01,459
I'm going to draw a cartoon,

790
00:49:01,459 --> 00:49:04,819
I guess.

791
00:49:04,819 --> 00:49:06,059
Changes in

792
00:49:06,059 --> 00:49:06,970
air pressure. So

793
00:49:06,970 --> 00:49:11,119
this is what sound is. You look at a microphone recording, you see these roughly periodic

794
00:49:11,119 --> 00:49:13,289
signals that comprise of

795
00:49:13,289 --> 00:49:16,230
changes in air pressure over time as the air pressure goes

796
00:49:16,230 --> 00:49:18,539
above and below some baseline air pressure.

797
00:49:18,539 --> 00:49:19,669
So this

798
00:49:19,669 --> 00:49:22,369
is what the speech signal looks like, say.

799
00:49:22,369 --> 00:49:26,399
So this is speaker one.

800
00:49:26,399 --> 00:49:29,039
Then what I'm saying is that

801
00:49:29,039 --> 00:49:31,189
- this is some time, T.

802
00:49:31,189 --> 00:49:34,479
What I'm saying is that the value of that point,

803
00:49:34,479 --> 00:49:36,989
I'm going to denote as S, super

804
00:49:36,989 --> 00:49:40,229
script T, sub script one.

805
00:49:40,229 --> 00:49:41,729
Similarly,

806
00:49:41,729 --> 00:49:44,889
speaker two, it's

807
00:49:44,889 --> 00:49:46,859
outputting some sound wave. Speaker voice

808
00:49:46,859 --> 00:49:49,749
will play that. It'll actually sound like

809
00:49:49,749 --> 00:49:52,920
a single tone, I guess.

810
00:49:52,920 --> 00:49:56,099
So in the same way, at the same time, T,

811
00:49:56,099 --> 00:49:59,049
the value of the air

812
00:49:59,049 --> 00:50:02,589
pressure generated by speaker two, I'll denote as

813
00:50:02,589 --> 00:50:09,589
ST

814
00:50:16,579 --> 00:50:23,579
2.

815
00:50:29,859 --> 00:50:36,859
So we observe

816
00:50:37,769 --> 00:50:40,449
XI equals A times SI, where

817
00:50:40,449 --> 00:50:43,409
these XIs

818
00:50:43,409 --> 00:50:45,989
are vectors in RN.

819
00:50:45,989 --> 00:50:50,269
So I'm going to assume

820
00:50:50,269 --> 00:50:53,259
that I have N microphones,

821
00:50:53,259 --> 00:50:53,580
and

822
00:50:53,580 --> 00:50:58,489
each of my microphones records some linear combination

823
00:50:58,489 --> 00:51:01,869
of what the speakers are saying. So each microphone records some overlapping

824
00:51:01,869 --> 00:51:04,499
combination of what the speakers are saying.

825
00:51:04,499 --> 00:51:07,619
For

826
00:51:10,349 --> 00:51:12,669
example, XIJ, which is - this

827
00:51:12,669 --> 00:51:16,249
is what microphone J records at time, I. So

828
00:51:16,249 --> 00:51:17,349
by definition of

829
00:51:17,349 --> 00:51:21,519
the matrix multiplication, this is sum

830
00:51:21,519 --> 00:51:23,979
of AIKSJ.

831
00:51:23,979 --> 00:51:29,369
Oh, excuse me.

832
00:51:29,369 --> 00:51:36,369
Okay? So what my J - sorry.

833
00:51:37,179 --> 00:51:41,049
So what my J microphone is recording is

834
00:51:42,190 --> 00:51:43,940
some linear combination of

835
00:51:43,940 --> 00:51:45,569
all of the speakers. So

836
00:51:45,569 --> 00:51:49,780
at time I, what microphone J is recording is some linear combination of

837
00:51:49,780 --> 00:51:52,749
what all the speakers are saying at time I.

838
00:51:52,749 --> 00:51:54,359
So K here

839
00:51:54,359 --> 00:51:57,819
indexes over the N speakers.

840
00:51:57,819 --> 00:52:01,239
So our goal

841
00:52:02,869 --> 00:52:06,420
is to find the matrix, W, equals A inverse, and

842
00:52:06,420 --> 00:52:10,130
just defining W that way.

843
00:52:10,130 --> 00:52:17,130
So

844
00:52:18,139 --> 00:52:21,349
we can recover the original sources

845
00:52:21,349 --> 00:52:23,310
as a linear combination of

846
00:52:23,310 --> 00:52:23,560
our

847
00:52:23,549 --> 00:52:30,549
microphone recordings, XI.

848
00:52:33,059 --> 00:52:35,329
Just as a point of notation,

849
00:52:35,329 --> 00:52:42,329
I'm going to write the matrix W this way. I'm going to use

850
00:52:50,890 --> 00:52:55,100
lower case W subscript one, subscript two and so on to denote the roles

851
00:52:55,100 --> 00:53:02,100
of this matrix, W.

852
00:53:13,909 --> 00:53:14,559
Let's

853
00:53:14,559 --> 00:53:18,719
see.

854
00:53:18,719 --> 00:53:23,539
So let's look at why IC is possible. Given these overlapping voices,

855
00:53:23,539 --> 00:53:28,249
let's think briefly why it might be possible

856
00:53:28,249 --> 00:53:30,760
to recover the original sources.

857
00:53:30,760 --> 00:53:33,059
So for the next example, I want

858
00:53:33,059 --> 00:53:36,509
to say

859
00:53:42,739 --> 00:53:46,530
- let's say that each of my speakers

860
00:53:46,530 --> 00:53:50,380
outputs - this will sound like white noise. Can I switch

861
00:53:50,380 --> 00:53:53,380
the laptop display,

862
00:53:53,380 --> 00:53:56,709
please? For this example, let's say that

863
00:53:57,219 --> 00:54:01,459
each of my speakers outputs uniform white noise. So

864
00:54:01,459 --> 00:54:05,459
if that's the case, these are my axis, S1 and S2.

865
00:54:05,459 --> 00:54:08,819
This is what my two speakers would be uttering.

866
00:54:08,819 --> 00:54:11,289
The parts of what they're

867
00:54:11,289 --> 00:54:14,979
uttering will look like a line in a square box if the two speakers are independently

868
00:54:14,979 --> 00:54:16,089
outputting

869
00:54:16,089 --> 00:54:18,389
uniform minus one random variables.

870
00:54:18,389 --> 00:54:20,289
So this is part of

871
00:54:20,289 --> 00:54:24,009
S1 and S2, my original sources.

872
00:54:24,009 --> 00:54:28,999
This would be a typical sample of what my microphones record. Here, at

873
00:54:28,999 --> 00:54:31,400
the axis, are X1 and X2.

874
00:54:31,400 --> 00:54:35,089
So these are images I got from [inaudible] on

875
00:54:35,089 --> 00:54:37,269
ICA.

876
00:54:38,709 --> 00:54:43,689
Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of

877
00:54:43,689 --> 00:54:44,939
this

878
00:54:44,939 --> 00:54:45,809
parallelogram

879
00:54:45,809 --> 00:54:48,149
are. You can figure out

880
00:54:48,149 --> 00:54:51,999
what linear transformation would transform the parallelogram back

881
00:54:51,999 --> 00:54:54,359
to a box.

882
00:54:54,359 --> 00:54:58,769
So it turns out there are some inherent ambiguities in ICA.

883
00:54:58,769 --> 00:55:00,509
I'll just say what they are.

884
00:55:00,509 --> 00:55:01,569
One is that

885
00:55:01,569 --> 00:55:05,709
you can't recover the original indexing of the sources. In particular,

886
00:55:05,709 --> 00:55:07,379
if

887
00:55:07,379 --> 00:55:10,809
I generated the data for speaker one and speaker two,

888
00:55:10,809 --> 00:55:14,469
you can run ICA, and then you may end up with the order of the speakers

889
00:55:14,469 --> 00:55:17,529
reversed. What that corresponds to is if you take this

890
00:55:17,529 --> 00:55:21,809
picture and you flip this picture along a 45-degree

891
00:55:21,809 --> 00:55:26,130
axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still

892
00:55:26,130 --> 00:55:28,279
get a box. So

893
00:55:28,279 --> 00:55:31,319
there's no way for the algorithms to tell which was speaker No. 1 and

894
00:55:31,319 --> 00:55:32,910
which

895
00:55:32,910 --> 00:55:37,699
was speaker No. 2. The numbering or the ordering of the speakers is

896
00:55:37,699 --> 00:55:40,839
ambiguous. The other source of ambiguity, and these are the only ambiguities

897
00:55:40,839 --> 00:55:42,089
in this example,

898
00:55:42,089 --> 00:55:44,469
is the sign of the sources. So

899
00:55:44,469 --> 00:55:49,119
given my speakers' recordings,

900
00:55:49,119 --> 00:55:53,190
you can't tell whether you got a positive SI or whether you got

901
00:55:53,190 --> 00:55:56,179
back a negative SI.

902
00:55:56,179 --> 00:55:58,210
In this picture, what that corresponds to

903
00:55:58,210 --> 00:56:02,100
is if you take this picture, and you reflect it along the vertical axis, if

904
00:56:02,100 --> 00:56:04,659
you reflect it along the horizontal axis,

905
00:56:04,659 --> 00:56:05,910
you still get a box.

906
00:56:05,910 --> 00:56:08,719
You still get back [inaudible] speakers.

907
00:56:08,719 --> 00:56:09,649
So

908
00:56:09,649 --> 00:56:11,719
it turns out that in this example,

909
00:56:11,719 --> 00:56:16,599
you can't guarantee that you've recovered positive SI rather

910
00:56:16,599 --> 00:56:19,689
than negative SI.

911
00:56:19,689 --> 00:56:21,930
So it turns out that these are the only

912
00:56:21,930 --> 00:56:25,740
two ambiguities in this example. What is the permutation of the speakers, and the

913
00:56:25,740 --> 00:56:28,139
other is the sign of the speakers.

914
00:56:28,139 --> 00:56:30,749
Permutation of the speakers, there's not much you can do about that.

915
00:56:30,749 --> 00:56:34,909
It turns out that if you take the audio

916
00:56:34,909 --> 00:56:35,609
source

917
00:56:35,609 --> 00:56:39,199
and if you flip the sign, and you take negative S, and if you play that through a

918
00:56:39,199 --> 00:56:43,819
microphone it'll sound indistinguishable.

919
00:56:43,819 --> 00:56:44,879
So

920
00:56:44,879 --> 00:56:47,829
for many of the applications we care about, the sign

921
00:56:47,829 --> 00:56:51,259
as well as the permutation

922
00:56:51,259 --> 00:56:55,079
is ambiguous, but you don't really care

923
00:56:55,079 --> 00:57:02,079
about it. Let's switch back

924
00:57:03,529 --> 00:57:08,989
to

925
00:57:08,989 --> 00:57:11,180
chalk board, please.

926
00:57:11,180 --> 00:57:15,639
It turns out, and I don't want to spend too much time on this, but I do want to say it briefly.

927
00:57:15,639 --> 00:57:17,289
It turns out the

928
00:57:17,289 --> 00:57:19,200
reason why those are the only

929
00:57:19,200 --> 00:57:25,809
sources of ambiguity - so the ambiguities were the

930
00:57:25,809 --> 00:57:29,869
permutation of the speakers

931
00:57:29,869 --> 00:57:31,959
and the signs.

932
00:57:31,959 --> 00:57:35,399
It turns out that

933
00:57:35,399 --> 00:57:39,919
the reason these were the only ambiguities was because

934
00:57:39,919 --> 00:57:44,999
the SIJs were

935
00:57:44,999 --> 00:57:46,689


936
00:57:46,689 --> 00:57:50,509
non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly.

937
00:57:50,509 --> 00:57:54,089
Suppose my original sources, S1 and S2, were Gaussian.

938
00:57:54,089 --> 00:57:55,909
So

939
00:57:58,329 --> 00:58:02,199
suppose SI is

940
00:58:02,199 --> 00:58:04,339
Gaussian, would mean zero

941
00:58:04,339 --> 00:58:07,019
and identity covariance.

942
00:58:07,019 --> 00:58:10,959
That just means that each of my speakers outputs a Gaussian random variable. Here's a typical

943
00:58:10,959 --> 00:58:12,619
example of Gaussian

944
00:58:12,619 --> 00:58:18,479
data.

945
00:58:18,479 --> 00:58:22,869
You will recall the contours of a Gaussian distribution with identity covariants

946
00:58:22,869 --> 00:58:25,089
looks like

947
00:58:25,089 --> 00:58:27,739
this, right? The Gaussian is a

948
00:58:27,739 --> 00:58:30,569
spherically symmetric distribution.

949
00:58:30,569 --> 00:58:35,219
So if my speakers were outputting Gaussian random variables, than if

950
00:58:35,219 --> 00:58:38,179
I observe a linear combination of this,

951
00:58:38,179 --> 00:58:40,479
there's actually no way to recover the

952
00:58:40,479 --> 00:58:43,419
original distribution because there's no way for me to tell

953
00:58:43,419 --> 00:58:46,120
if the axis are at this angle or if they're at

954
00:58:46,120 --> 00:58:48,349
that angle and so

955
00:58:48,349 --> 00:58:52,429
on. The Gaussian is a rotationally symmetric

956
00:58:52,429 --> 00:58:56,769
distribution, so I would no be able to recover the orientation in the

957
00:58:56,769 --> 00:58:58,839
rotation

958
00:58:58,839 --> 00:59:02,279
of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns

959
00:59:02,279 --> 00:59:02,900
out

960
00:59:02,900 --> 00:59:04,700
if your source is a Gaussian,

961
00:59:04,700 --> 00:59:07,929
then it's actually impossible to do

962
00:59:07,929 --> 00:59:12,049
ICA. ICA relies critically on your data being non-Gaussian because if the data

963
00:59:12,049 --> 00:59:16,939
were Gaussian, then the rotation of the data would be ambiguous. So

964
00:59:16,939 --> 00:59:19,079
regardless of how much data you have,

965
00:59:19,079 --> 00:59:23,549
even if you had infinitely large amounts of data, you would not be able to recover

966
00:59:23,549 --> 00:59:26,739
the matrix A or W.

967
00:59:32,779 --> 00:59:39,779
Let's go ahead and divide the algorithm.

968
00:59:56,779 --> 01:00:00,939
To do this, I need just one more result, and then the derivation will be

969
01:00:03,029 --> 01:00:07,729
three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my

970
01:00:07,729 --> 01:00:11,309
speakers that are emitting at any time.

971
01:00:11,309 --> 01:00:12,459
So

972
01:00:12,459 --> 01:00:15,619
let's say the density of S is

973
01:00:15,619 --> 01:00:17,339
P subscript S,

974
01:00:17,339 --> 01:00:19,569
capital S.

975
01:00:19,569 --> 01:00:23,399
So my microphone recording records S equals AS,

976
01:00:23,399 --> 01:00:25,319
equals W inverse

977
01:00:25,319 --> 01:00:31,019
S. Equivalently, S equals W sign of X.

978
01:00:31,019 --> 01:00:34,529
So let's think about what is the density of

979
01:00:34,529 --> 01:00:38,209
X. So I have P of S. I know the density of

980
01:00:38,209 --> 01:00:41,359
S, and X is a linear combination of the S's.

981
01:00:41,359 --> 01:00:45,170
So let's figure out what is the density of X.

982
01:00:45,170 --> 01:00:48,670
One thing we could do is

983
01:00:48,670 --> 01:00:51,339
figure out what S is. So this is just -

984
01:00:51,339 --> 01:00:55,759
apply the density of

985
01:00:55,759 --> 01:00:58,069
S to W of S. So let's

986
01:00:58,069 --> 01:01:01,999
see. This is the probability of S, so we just

987
01:01:02,909 --> 01:01:06,559
figure out what S is. S is W times X, so the probability of S is

988
01:01:06,559 --> 01:01:09,939
W times X, so the probability of X must be [inaudible].

989
01:01:09,939 --> 01:01:11,619
So this is wrong.

990
01:01:11,619 --> 01:01:14,749
It turns out you can do this for probably mass functions but not for

991
01:01:14,749 --> 01:01:16,919
continuous density. So in particular,

992
01:01:16,919 --> 01:01:20,969
it's not correct to say that the probability of X is - well, you just figure out what

993
01:01:20,969 --> 01:01:22,499
S is.

994
01:01:22,499 --> 01:01:26,189
Then you say the probability of S is applied to that. This is wrong. You

995
01:01:26,189 --> 01:01:27,819
can't do this with densities.

996
01:01:27,819 --> 01:01:30,969
You can't say the probability of S is that because it's a property density

997
01:01:30,969 --> 01:01:32,969
function.

998
01:01:32,969 --> 01:01:34,459
In particular,

999
01:01:34,459 --> 01:01:35,509
the

1000
01:01:35,509 --> 01:01:37,849
right formula is the

1001
01:01:37,849 --> 01:01:40,439
density of S applied to W times X,

1002
01:01:40,439 --> 01:01:41,730
times the determinant

1003
01:01:41,730 --> 01:01:44,209
of the matrix, W.

1004
01:01:44,209 --> 01:01:47,189
Let me just illustrate that with an example.

1005
01:01:47,189 --> 01:01:49,919
Let's say

1006
01:01:49,919 --> 01:01:51,550
the

1007
01:01:51,550 --> 01:01:58,199
density for S is that. In

1008
01:01:58,199 --> 01:02:03,469
this example, S is uniform

1009
01:02:03,469 --> 01:02:05,539
over the unit interval.

1010
01:02:07,679 --> 01:02:14,679
So the density for S looks like that. It's

1011
01:02:15,189 --> 01:02:18,140
just density for the uniform

1012
01:02:18,140 --> 01:02:20,749
distribution of zero one.

1013
01:02:20,749 --> 01:02:24,150
So let me let X be equal to two times

1014
01:02:24,150 --> 01:02:30,009
S. So this means A equals two.

1015
01:02:30,009 --> 01:02:33,709
W equals one half. So if

1016
01:02:33,709 --> 01:02:36,719
S is a uniform distribution over zero, one,

1017
01:02:36,719 --> 01:02:40,320
then X, which is two times that, will be the uniform distribution over the

1018
01:02:40,320 --> 01:02:43,299
range from zero to two.

1019
01:02:43,299 --> 01:02:50,299
So the density for X will be -

1020
01:02:54,359 --> 01:02:57,289
that's one, that's two,

1021
01:02:57,289 --> 01:03:01,409
that's one half,

1022
01:03:02,529 --> 01:03:04,949
and

1023
01:03:04,949 --> 01:03:07,939
that's one. Okay? Density for X will be indicator

1024
01:03:07,939 --> 01:03:12,729
zero [inaudible] for X [inaudible] two

1025
01:03:12,729 --> 01:03:15,739
times W, times one half.

1026
01:03:15,739 --> 01:03:20,229
So

1027
01:03:20,229 --> 01:03:21,729
does that make

1028
01:03:21,729 --> 01:03:25,020
sense? [Inaudible] computer density for X because X is now spread out

1029
01:03:25,020 --> 01:03:28,650
across a wider range. The density of X is now smaller,

1030
01:03:28,650 --> 01:03:35,650
and therefore, the density of X has this one half

1031
01:03:37,859 --> 01:03:38,919
term

1032
01:03:38,919 --> 01:03:42,579
here. Okay? This is an illustration for the case of one-dimensional random variables,

1033
01:03:42,579 --> 01:03:44,289


1034
01:03:44,289 --> 01:03:45,160
or S

1035
01:03:45,160 --> 01:03:49,490
and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the

1036
01:03:49,490 --> 01:03:51,649
density of X is given by this

1037
01:03:51,649 --> 01:03:53,950
times the determinant of the matrix, W. Over here,

1038
01:03:53,950 --> 01:04:00,950
I showed the one dimensional [inaudible] generalization.

1039
01:04:21,439 --> 01:04:28,439
So we're nearly there. Here's

1040
01:04:28,749 --> 01:04:33,969
how I can implement ICA.

1041
01:04:33,969 --> 01:04:37,039
So my distribution on

1042
01:04:37,039 --> 01:04:44,039
S,

1043
01:04:50,259 --> 01:04:52,959
so I'm going to assume that my density on S

1044
01:04:52,959 --> 01:04:55,099
is given by this as a product over the

1045
01:04:55,099 --> 01:04:59,950
N speakers of the density - the product of speaker

1046
01:04:59,950 --> 01:05:00,889
I

1047
01:05:00,889 --> 01:05:03,659
emitting a certain sound. This is a product of densities.

1048
01:05:03,659 --> 01:05:07,659
This is a product of distributions because I'm going to assume that the

1049
01:05:07,659 --> 01:05:11,469
speakers are having independent conversations. So the SI's independent

1050
01:05:11,469 --> 01:05:15,869
for different values of I.

1051
01:05:15,869 --> 01:05:18,060
So by the formula we just worked out,

1052
01:05:18,060 --> 01:05:22,359
the density for X would be equal to that.

1053
01:05:36,599 --> 01:05:39,309
I'll just remind you, W was A

1054
01:05:39,309 --> 01:05:42,579
inverse. It was

1055
01:05:42,579 --> 01:05:43,929
this matrix

1056
01:05:43,929 --> 01:05:47,619
I defined previously

1057
01:05:47,619 --> 01:05:50,430
so that SI

1058
01:05:50,430 --> 01:05:52,519
equals WI [inaudible]

1059
01:05:52,519 --> 01:05:59,209
X. So that's what's in

1060
01:05:59,209 --> 01:06:02,299
there. To complete my formulation for this model,

1061
01:06:02,299 --> 01:06:06,359
the final thing I need to do is

1062
01:06:06,359 --> 01:06:10,179
choose

1063
01:06:10,179 --> 01:06:11,549
a density

1064
01:06:11,549 --> 01:06:14,259
for what I think each speaker is

1065
01:06:14,259 --> 01:06:17,949
saying. I need to assume some density over

1066
01:06:17,949 --> 01:06:21,660
the sounds emitted by an individual speaker.

1067
01:06:21,660 --> 01:06:25,630
So following the discussion I had right when the [inaudible]

1068
01:06:25,630 --> 01:06:27,650
ICA,

1069
01:06:27,650 --> 01:06:30,559
one thing I could do is I could choose

1070
01:06:30,559 --> 01:06:32,019
the density for S,

1071
01:06:32,019 --> 01:06:35,509
or equivalently, I could choose the CDF, the cumulative distribution

1072
01:06:35,509 --> 01:06:37,169
function for

1073
01:06:37,169 --> 01:06:38,219
S.

1074
01:06:38,219 --> 01:06:41,489
In this case, I'm going to choose

1075
01:06:41,489 --> 01:06:44,819
a CDF, probably for historical reasons and probably for

1076
01:06:44,819 --> 01:06:46,569
convenience.

1077
01:06:46,569 --> 01:06:50,019
I need to choose the CDF for S, so

1078
01:06:50,019 --> 01:06:54,779
what that means is I just need to choose some function that increases from zero to

1079
01:06:54,779 --> 01:06:59,439
what. I know I can't choose a Gaussian because we know you can't

1080
01:06:59,439 --> 01:07:02,199
do ICA on Gaussian data.

1081
01:07:02,199 --> 01:07:04,649
So I need some function increasing from zero to one

1082
01:07:04,649 --> 01:07:08,639
that is not the cumulative distribution function for a

1083
01:07:08,639 --> 01:07:10,359
Gaussian distribution.

1084
01:07:10,359 --> 01:07:14,009
So what other functions do I know that increase from zero to one? I

1085
01:07:14,009 --> 01:07:16,139
just choose the

1086
01:07:16,139 --> 01:07:18,329
CDF to be

1087
01:07:18,329 --> 01:07:21,979
the

1088
01:07:21,979 --> 01:07:23,039
sigmoid function.

1089
01:07:23,039 --> 01:07:24,729
This is a

1090
01:07:24,729 --> 01:07:27,229
commonly-made choice that

1091
01:07:27,229 --> 01:07:31,049
is made for convenience. There is actually no great reason for why you

1092
01:07:31,049 --> 01:07:34,079
choose a sigmoid function. It's just a convenient function that we all know

1093
01:07:34,079 --> 01:07:35,289
and are familiar with

1094
01:07:35,289 --> 01:07:37,849
that happens to increase from zero to one.

1095
01:07:37,849 --> 01:07:44,849
When you take the derivative

1096
01:07:45,789 --> 01:07:49,389
of the sigmoid, and that will give you back

1097
01:07:49,389 --> 01:07:50,119
your

1098
01:07:50,119 --> 01:07:55,459
density. This is just not Gaussian. This is the main virtue of choosing the sigmoid.

1099
01:07:55,459 --> 01:08:02,459
So

1100
01:08:19,020 --> 01:08:21,960
there's really no rational for the choice of sigma. Lots of other things will

1101
01:08:21,960 --> 01:08:23,599
work fine, too.

1102
01:08:23,599 --> 01:08:26,659
It's just a common, reasonable default.

1103
01:08:38,039 --> 01:08:40,279
It turns out that

1104
01:08:40,279 --> 01:08:44,629
one reason the sigma works well for a lot of data sources is that

1105
01:08:44,629 --> 01:08:49,079
if this is the Gaussian.

1106
01:08:49,079 --> 01:08:52,190
If you actually take the sigmoid and you take its derivative,

1107
01:09:02,299 --> 01:09:06,639
you find that the sigmoid has [inaudible] than the Gaussian. By this I mean

1108
01:09:06,639 --> 01:09:10,509
the density of the sigmoid dies down to zero much more slowly than

1109
01:09:10,509 --> 01:09:12,299
the

1110
01:09:12,299 --> 01:09:13,489
Gaussian.

1111
01:09:13,489 --> 01:09:18,079
The magnitudes of the tails dies down as E to the minus S squared.

1112
01:09:18,079 --> 01:09:21,960
For the sigmoid, the tails look like E to the minus

1113
01:09:21,960 --> 01:09:26,949
S. So the tails die down as E to the minus S, around E

1114
01:09:26,949 --> 01:09:29,529
to the minus S squared. It turns out that most distributions of this property

1115
01:09:29,529 --> 01:09:34,359
with [inaudible] tails, where the distribution decays to zero relatively slowly

1116
01:09:34,359 --> 01:09:38,439
compared to Gaussian will

1117
01:09:38,439 --> 01:09:39,919
work fine for your data.

1118
01:09:39,919 --> 01:09:43,939
Actually, one other choice you can sometimes us is what's called the Laplacian

1119
01:09:43,939 --> 01:09:46,230
distribution, which is

1120
01:09:46,230 --> 01:09:53,230
that. This will work fine, too, for many data sources.

1121
01:10:06,539 --> 01:10:08,110
Sticking with the sigmoid for now, I'll just

1122
01:10:08,110 --> 01:10:09,420
write

1123
01:10:09,420 --> 01:10:14,479
down the algorithm in two steps. So given

1124
01:10:14,479 --> 01:10:17,150
my training set, and

1125
01:10:17,150 --> 01:10:21,179
as you show, this is an unlabeled training set, I can

1126
01:10:21,179 --> 01:10:25,770
write down the log likelihood of my parameters. So that's - assembled my training

1127
01:10:25,770 --> 01:10:27,209
examples, log of - times

1128
01:10:27,209 --> 01:10:34,209
that.

1129
01:10:42,869 --> 01:10:44,880
So that's my log

1130
01:10:44,880 --> 01:10:51,880
likelihood.

1131
01:10:53,339 --> 01:10:59,380
To learn the parameters, W, of this model, I can use the [inaudible] assent,

1132
01:10:59,380 --> 01:11:06,380
which is

1133
01:11:06,570 --> 01:11:08,579
just that.

1134
01:11:08,579 --> 01:11:11,489
It turns out, if you work through the math,

1135
01:11:11,489 --> 01:11:13,969
let's see. If P of S

1136
01:11:13,969 --> 01:11:19,820
is equal to the derivative of the

1137
01:11:19,820 --> 01:11:23,780
sigmoid, then if you just work through the math to compute the [inaudible] there. You've all

1138
01:11:23,780 --> 01:11:27,409
done this a lot of times. I won't bother to show

1139
01:11:27,409 --> 01:11:34,409
the details. You find that is equal to this.

1140
01:11:46,629 --> 01:11:49,579
Okay? That's just - you can work those out yourself. It's just math to

1141
01:11:49,579 --> 01:11:54,499
compute the derivative of this with respect to

1142
01:11:54,499 --> 01:11:59,309
W. So to summarize, given the training set,

1143
01:11:59,309 --> 01:12:02,099
here's my [inaudible] update rule. So you run the

1144
01:12:02,099 --> 01:12:06,309
[inaudible] to learn the parameters W.

1145
01:12:06,309 --> 01:12:08,380
After you're

1146
01:12:08,380 --> 01:12:09,719
done, you then

1147
01:12:12,369 --> 01:12:14,109
output SI equals

1148
01:12:14,109 --> 01:12:16,989
WXI, and you've separated your sources

1149
01:12:16,989 --> 01:12:18,170
of your

1150
01:12:18,170 --> 01:12:21,780
data back out into the original independent sources.

1151
01:12:21,780 --> 01:12:26,199
Hopefully up to only a permutation and a plus/minus

1152
01:12:26,199 --> 01:12:30,650
sign ambiguity.

1153
01:12:30,650 --> 01:12:34,559
Okay? So just switch back to the laptop, please?

1154
01:12:34,559 --> 01:12:41,559
So we'll just wrap up with a couple of examples of applications of ICA.

1155
01:12:42,210 --> 01:12:43,150
This is

1156
01:12:43,150 --> 01:12:46,719
actually a picture of our TA, Katie.

1157
01:12:46,719 --> 01:12:49,980
So one of the applications of ICA is

1158
01:12:49,980 --> 01:12:52,010
to process

1159
01:12:52,010 --> 01:12:56,530
various types of [inaudible] recording data, so [inaudible]. This

1160
01:12:56,530 --> 01:12:58,780
is a picture of

1161
01:12:58,780 --> 01:13:02,470
a EEG cap, in which there are a number of electrodes

1162
01:13:02,470 --> 01:13:04,529
you place

1163
01:13:04,529 --> 01:13:07,959
on the - in this case, on Katie's brain, on Katie's scalp.

1164
01:13:07,959 --> 01:13:13,370
So where each electrode measures changes in voltage over time

1165
01:13:13,370 --> 01:13:15,059
on the scalp.

1166
01:13:15,059 --> 01:13:18,409
On the right, it's a typical example of [inaudible] data

1167
01:13:18,409 --> 01:13:22,570
where each electrode measures - just changes in voltage over

1168
01:13:22,570 --> 01:13:23,890
time. So

1169
01:13:23,890 --> 01:13:27,949
the horizontal axis is time, and the vertical axis is voltage. So here's the same thing,

1170
01:13:27,949 --> 01:13:29,560
blown up a little bit.

1171
01:13:29,560 --> 01:13:32,680
You notice there are artifacts in this

1172
01:13:32,680 --> 01:13:36,340
data. Where the circle is, where the data is circled, all

1173
01:13:36,340 --> 01:13:37,669
the

1174
01:13:37,669 --> 01:13:41,179
electrodes seem to measure in these very synchronized recordings.

1175
01:13:41,179 --> 01:13:44,699
It turns out that we look at [inaudible] data as well as a number of other

1176
01:13:44,699 --> 01:13:47,019
types of data, there are

1177
01:13:47,019 --> 01:13:51,550
artifacts from heartbeats and from human eye blinks and so on. So the

1178
01:13:51,550 --> 01:13:55,030
cartoonist, if you imagine, placing the

1179
01:13:55,030 --> 01:13:56,730
electrodes, or

1180
01:13:56,730 --> 01:13:58,319
microphones, on my scalp,

1181
01:13:58,319 --> 01:14:01,839
then each microphone is recording some overlapping combination of all the

1182
01:14:01,839 --> 01:14:04,920
things happening in my brain or in my body.

1183
01:14:04,920 --> 01:14:08,380
My brain has a number of different processes going on. My body's [inaudible]

1184
01:14:08,380 --> 01:14:10,519
going on, and

1185
01:14:10,519 --> 01:14:13,429
each electrode measures a sum

1186
01:14:13,429 --> 01:14:15,680
of the different voices in my brain.

1187
01:14:15,680 --> 01:14:19,789
That didn't quite come out the way I wanted it to.

1188
01:14:19,789 --> 01:14:21,530
So we can just take this data

1189
01:14:21,530 --> 01:14:25,400
and run ICA on it and find out one of the independent components, what the

1190
01:14:25,400 --> 01:14:26,130
independent

1191
01:14:26,130 --> 01:14:30,329
process are going on in my brain. This is an example of running ICA.

1192
01:14:30,329 --> 01:14:33,239
So you find that a small number of components, like those shown up there,

1193
01:14:33,239 --> 01:14:37,739
they correspond to heartbeat, where the arrows - so those are very periodic

1194
01:14:37,739 --> 01:14:42,329
signals. They come on occasionally and correspond to [inaudible] components of

1195
01:14:42,329 --> 01:14:43,050
heartbeat.

1196
01:14:43,050 --> 01:14:47,460
You also find things like an eye blink component, corresponding to a

1197
01:14:47,460 --> 01:14:49,780
sigmoid generated when you blink your eyes.

1198
01:14:49,780 --> 01:14:53,820
By doing this, you can then subtract out the heartbeat and the eye blink

1199
01:14:53,820 --> 01:14:56,179
artifacts from the data, and now

1200
01:14:56,179 --> 01:15:01,219
you get much cleaner ICA data - get much cleaner EEG readings. You can

1201
01:15:01,219 --> 01:15:03,699
do further scientific studies. So this is a

1202
01:15:03,699 --> 01:15:06,179
pretty commonly used preprocessing step

1203
01:15:06,179 --> 01:15:09,699
that is a common application of ICA.

1204
01:15:09,699 --> 01:15:13,030
[Inaudible] example is

1205
01:15:13,030 --> 01:15:16,300
the application, again, from [inaudible]. As

1206
01:15:16,300 --> 01:15:20,900
a result of running ICA on natural small image patches. Suppose I take

1207
01:15:20,900 --> 01:15:22,050
natural images

1208
01:15:22,050 --> 01:15:25,909
and run ICA on the data and ask what are the independent components of data.

1209
01:15:25,909 --> 01:15:30,039
It turns out that these are the bases you get. So this is a plot of the

1210
01:15:30,039 --> 01:15:32,530
sources you get.

1211
01:15:32,530 --> 01:15:36,269
This algorithm is saying that a natural image patch

1212
01:15:36,269 --> 01:15:37,749
shown

1213
01:15:37,749 --> 01:15:39,790
on the left

1214
01:15:39,790 --> 01:15:45,330
is often expressed as a sum, or a linear combination, of

1215
01:15:45,330 --> 01:15:46,679
independent sources of

1216
01:15:46,679 --> 01:15:48,159
things that make up images.

1217
01:15:48,159 --> 01:15:52,780
So this model's natural images is generated by independent objects

1218
01:15:52,780 --> 01:15:55,340
that generate different ages in the image.

1219
01:15:55,340 --> 01:16:01,260
One of the fascinating things about this is that, similar to neuroscience, this has also been

1220
01:16:01,260 --> 01:16:04,789
hypothesized as a method for how the human brain processes image

1221
01:16:04,789 --> 01:16:05,999
data. It

1222
01:16:05,999 --> 01:16:10,139
turns out, this is similar, in many ways, to computations

1223
01:16:10,139 --> 01:16:15,079
happening in early visual processing in the human brain,

1224
01:16:15,079 --> 01:16:17,659
in the mammalian

1225
01:16:17,659 --> 01:16:19,799
brain. It's just

1226
01:16:19,799 --> 01:16:25,260
interesting to see ages are the independent components of images.

1227
01:16:25,260 --> 01:16:30,639
Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew

1228
01:16:30,639 --> 01:16:31,929
Ng):Oh,

1229
01:16:31,929 --> 01:16:35,410
yes. For the algorithms I describe, I assume A is a square matrix.

1230
01:16:35,410 --> 01:16:38,590
It turns out if you have more microphones than speakers, you can also apply very

1231
01:16:38,590 --> 01:16:39,609
similar algorithms. If

1232
01:16:39,609 --> 01:16:43,919
you have fewer microphones than speakers, there's sort of an open research problem. The odds

1233
01:16:43,919 --> 01:16:48,459
are that if you have one male and one female speaker, but one microphone, you can

1234
01:16:48,459 --> 01:16:51,819
sometimes sort of separate them because one is high, one is low. If you have two

1235
01:16:51,819 --> 01:16:55,459
male speakers or two female speakers, then it's beyond the state of the art now to separate them

1236
01:16:55,459 --> 01:16:57,050
with one

1237
01:16:57,050 --> 01:17:00,499
microphone. It's a great research program. Okay.

1238
01:17:00,499 --> 01:17:04,869
Sorry about running late again. Let's close now, and we'll

1239
01:17:04,869 --> 01:17:05,749
continue reinforcement learning.