1 00:00:09,539 --> 00:00:12,809 This presentation is delivered by the Stanford Center for Professional 2 00:00:12,809 --> 00:00:19,809 Development. 3 00:00:28,739 --> 00:00:30,300 Welcome back. 4 00:00:30,300 --> 00:00:32,800 What I want to do today is 5 00:00:32,800 --> 00:00:37,120 continue a discussion of principal components analysis, or PCA. 6 00:00:37,120 --> 00:00:38,780 In particular, there's 7 00:00:38,780 --> 00:00:41,730 one more application that I didn't get to in the last lecture on 8 00:00:41,730 --> 00:00:43,950 [inaudible] indexing, 9 00:00:43,950 --> 00:00:47,340 LSI. Then I want to spend just a little time talking about 10 00:00:47,340 --> 00:00:50,400 how to implement PCA, 11 00:00:50,400 --> 00:00:53,510 especially for very large problems. In particular, I'll spend just a little bit of time talking 12 00:00:53,510 --> 00:00:55,790 about singular value decomposition, 13 00:00:55,790 --> 00:00:57,930 or the SVD implementation 14 00:00:57,930 --> 00:01:00,170 of principal component 15 00:01:00,170 --> 00:01:01,929 analysis. So the 16 00:01:01,929 --> 00:01:06,549 second half of today's lecture, I want to talk about the different algorithm called 17 00:01:06,549 --> 00:01:09,090 independent component analysis, 18 00:01:09,090 --> 00:01:13,400 which is, in some ways, related to PCA, but in many other ways, it 19 00:01:13,400 --> 00:01:15,580 also manages to accomplish 20 00:01:15,580 --> 00:01:18,390 very different things than PCA. 21 00:01:18,390 --> 00:01:22,380 So with this lecture, this will actually wrap up our discussion on 22 00:01:22,380 --> 00:01:26,470 unsupervised learning. The next lecture, we'll start to talk about 23 00:01:26,470 --> 00:01:31,049 reinforcement learning algorithms. 24 00:01:31,049 --> 00:01:32,500 Just to 25 00:01:32,500 --> 00:01:36,770 recap where we were with PCA, principal 26 00:01:36,770 --> 00:01:38,630 component analysis, 27 00:01:38,630 --> 00:01:44,870 I said that 28 00:01:44,870 --> 00:01:48,840 in PCA, we imagine that we have some very high dimensional data 29 00:01:48,840 --> 00:01:50,970 that perhaps 30 00:01:50,970 --> 00:01:54,650 lies approximately on some low dimensional subspace. So if you had the data set like 31 00:01:54,650 --> 00:01:55,660 this, 32 00:01:55,660 --> 00:01:57,000 you might find that 33 00:01:57,000 --> 00:02:00,650 that's the first principal component of the data, 34 00:02:00,650 --> 00:02:02,850 and that's the second 35 00:02:02,850 --> 00:02:05,650 component of this 2-D data. 36 00:02:06,870 --> 00:02:09,269 To 37 00:02:09,269 --> 00:02:13,039 summarize the algorithm, we have three steps. The first step of PCA 38 00:02:14,680 --> 00:02:16,549 was to 39 00:02:16,549 --> 00:02:20,349 normalize the data to zero mean and 40 00:02:20,349 --> 00:02:25,789 [inaudible]. So 41 00:02:25,789 --> 00:02:27,860 tracked out the means of 42 00:02:27,860 --> 00:02:33,199 your training examples. So it now has zero means, and then normalize each of your features so 43 00:02:33,199 --> 00:02:35,509 that the variance of each feature is now one. 44 00:02:37,719 --> 00:02:40,679 The next step was [inaudible] 45 00:02:40,679 --> 00:02:44,389 computical variance matrix of your zero mean data. So 46 00:02:44,389 --> 00:02:48,859 you compute it as follows. 47 00:02:48,859 --> 00:02:51,310 The sum of all the products, 48 00:02:52,409 --> 00:02:55,870 and then you find the 49 00:02:55,870 --> 00:03:02,870 top K eigen vectors of 50 00:03:03,799 --> 00:03:07,159 sigma. 51 00:03:07,159 --> 00:03:09,589 So 52 00:03:09,589 --> 00:03:12,589 last time we saw the applications of this. For example, 53 00:03:12,589 --> 00:03:18,439 one of the applications was to eigen faces where 54 00:03:18,439 --> 00:03:22,899 each of your training examples, XI, is an image. 55 00:03:22,899 --> 00:03:24,180 So 56 00:03:24,180 --> 00:03:26,749 if you have 57 00:03:26,749 --> 00:03:30,890 100 by 100 images, if your pictures of faces are 58 00:03:30,890 --> 00:03:33,249 100 pixels by 100 pixels, 59 00:03:33,249 --> 00:03:37,519 then each of your training examples, XI, 60 00:03:37,519 --> 00:03:39,989 will be a 10,000 dimensional vector, 61 00:03:39,989 --> 00:03:42,729 corresponding to the 62 00:03:42,729 --> 00:03:47,529 10,000 grayscale intensity pixel values. There are 10,000 pixel values in 63 00:03:47,529 --> 00:03:50,349 each of your 100 by 100 images. 64 00:03:50,349 --> 00:03:53,179 So the eigen faces application was where 65 00:03:53,179 --> 00:03:56,280 the training example comprised 66 00:03:56,280 --> 00:03:58,149 pictures of faces of people. 67 00:03:58,149 --> 00:03:59,989 Then we ran PCA, 68 00:03:59,989 --> 00:04:00,780 and then 69 00:04:00,780 --> 00:04:03,090 to measure the distance between say 70 00:04:03,090 --> 00:04:04,260 a face here 71 00:04:04,260 --> 00:04:06,809 and a face there, we would project both 72 00:04:06,809 --> 00:04:09,619 of the face images onto the subspace and then 73 00:04:09,619 --> 00:04:11,469 measure 74 00:04:11,469 --> 00:04:13,679 the distance along the subspace. So in eigen faces, you use something 75 00:04:13,679 --> 00:04:16,919 like 50 principle components. 76 00:04:16,919 --> 00:04:20,049 So 77 00:04:20,049 --> 00:04:24,080 the difficulty of working with problems like these is that 78 00:04:24,080 --> 00:04:27,330 in step two of the algorithm, 79 00:04:27,330 --> 00:04:31,070 we construct the covariance matrix sigma. 80 00:04:31,070 --> 00:04:37,589 The covariance matrix now becomes 81 00:04:37,589 --> 00:04:42,919 a 10,000 by 10,000 dimensional matrix, which is huge. That has 82 00:04:42,919 --> 00:04:45,580 100 million 83 00:04:45,580 --> 00:04:47,650 entries, which is huge. 84 00:04:49,020 --> 00:04:50,789 So let's apply PCA to 85 00:04:50,789 --> 00:04:54,289 very, very high dimensional data, used as a point of reducing the 86 00:04:54,289 --> 00:04:55,180 dimension. But 87 00:04:55,180 --> 00:04:59,550 step two of this algorithm had this step where you were constructing [inaudible]. So 88 00:04:59,550 --> 00:05:03,599 this extremely large matrix, which you can't do. 89 00:05:03,599 --> 00:05:06,580 Come back to this in a second. It turns out one of 90 00:05:06,580 --> 00:05:08,650 the other 91 00:05:08,650 --> 00:05:10,639 frequently-used applications of 92 00:05:10,639 --> 00:05:11,870 PCA 93 00:05:11,870 --> 00:05:14,210 is actually to text data. 94 00:05:14,210 --> 00:05:16,329 So here's what I 95 00:05:16,329 --> 00:05:21,039 mean. Remember our vectorial representation of emails? 96 00:05:21,039 --> 00:05:22,520 So this is way back 97 00:05:22,520 --> 00:05:26,869 when we were talking about supervised learning algorithms for a 98 00:05:26,869 --> 00:05:29,360 stand classification. You remember I said that 99 00:05:29,360 --> 00:05:32,620 given a piece of email or given a piece of text document, you 100 00:05:32,620 --> 00:05:35,479 can represent it using a very high-dimensional vector 101 00:05:35,479 --> 00:05:36,699 by taking 102 00:05:38,990 --> 00:05:43,889 - writing down a list of all the words in your dictionary. Somewhere you had the word 103 00:05:43,889 --> 00:05:46,669 learn, somewhere you have the word 104 00:05:46,669 --> 00:05:49,680 study 105 00:05:50,569 --> 00:05:54,759 and so on. 106 00:05:54,759 --> 00:05:58,180 Depending on whether each word appears or does not appear in your text document, you put either 107 00:05:58,180 --> 00:05:59,360 a one or a zero 108 00:05:59,360 --> 00:06:03,889 there. This is a representation we use on an electrode five or electrode six 109 00:06:03,889 --> 00:06:06,729 for representing text documents for 110 00:06:06,729 --> 00:06:08,669 when we're building 111 00:06:08,669 --> 00:06:12,090 [inaudible] based classifiers for 112 00:06:12,090 --> 00:06:14,300 [inaudible]. So it turns 113 00:06:14,300 --> 00:06:17,719 out one of the common applications of 114 00:06:17,719 --> 00:06:22,210 PCA is actually this text data representations as well. 115 00:06:22,210 --> 00:06:23,680 When you apply PCA 116 00:06:23,680 --> 00:06:25,650 to this sort of data, 117 00:06:25,650 --> 00:06:27,999 the resulting 118 00:06:27,999 --> 00:06:34,740 algorithm, it often just goes by a different name, just latent semantic indexing. 119 00:06:41,250 --> 00:06:44,009 For the sake of completeness, I should say that 120 00:06:44,009 --> 00:06:48,050 in LSI, you usually skip the preprocessing step. 121 00:07:06,089 --> 00:07:09,929 For various reasons, in LSI, you usually don't normalize the mean of the data to 122 00:07:09,929 --> 00:07:10,939 one, 123 00:07:10,939 --> 00:07:14,169 and you usually don't normalize the variance of the features to one. 124 00:07:14,169 --> 00:07:18,020 These are relatively minor 125 00:07:18,020 --> 00:07:21,199 differences, it turns out, so it does something very 126 00:07:21,199 --> 00:07:24,599 similar to PCA. 127 00:07:24,599 --> 00:07:25,849 128 00:07:25,849 --> 00:07:27,439 Normalizing the variance to one 129 00:07:27,439 --> 00:07:33,439 for text data would actually be a bad idea because all the words are - 130 00:07:33,439 --> 00:07:34,389 because that 131 00:07:34,389 --> 00:07:37,150 would have the affect of 132 00:07:37,150 --> 00:07:39,229 dramatically scaling up the 133 00:07:39,229 --> 00:07:43,779 weight of rarely occurring words. So for example, the word aardvark hardly ever 134 00:07:43,779 --> 00:07:45,729 appears in any document. 135 00:07:45,729 --> 00:07:48,919 So to normalize the variance 136 00:07:48,919 --> 00:07:51,009 of the second feature to one, you end up - 137 00:07:51,009 --> 00:07:54,860 you're scaling up the weight of the word aardvark 138 00:07:54,860 --> 00:07:58,340 dramatically. I don't understand why [inaudible]. 139 00:07:58,340 --> 00:08:01,449 So let's 140 00:08:01,449 --> 00:08:05,319 see. [Inaudible] the language, 141 00:08:05,319 --> 00:08:11,169 something that we want to do quite often is, give it two documents, 142 00:08:13,109 --> 00:08:20,109 XI and XJ, to measure how similar they are. 143 00:08:20,179 --> 00:08:22,400 So for example, 144 00:08:22,400 --> 00:08:25,050 I may give you a document and ask 145 00:08:25,050 --> 00:08:28,860 you to find me more documents like this one. We're reading some 146 00:08:28,860 --> 00:08:30,900 article about some user event of today 147 00:08:30,900 --> 00:08:33,770 and want to find out what other news articles there are. So I give you a document and 148 00:08:34,520 --> 00:08:37,159 ask you to look at all the other documents you have 149 00:08:37,159 --> 00:08:40,829 in this large set of documents and find the documents similar to 150 00:08:40,829 --> 00:08:42,220 this. 151 00:08:43,740 --> 00:08:45,210 So 152 00:08:45,210 --> 00:08:48,170 this is typical text application, so 153 00:08:48,170 --> 00:08:51,140 to measure the similarity 154 00:08:52,440 --> 00:08:56,550 between two documents in XI and XJ, [inaudible] 155 00:08:56,550 --> 00:08:59,950 each of these documents is represented 156 00:08:59,950 --> 00:09:03,190 as one of these highdimensional vectors. 157 00:09:03,190 --> 00:09:08,700 One common way to do this is to view each of your documents 158 00:09:08,700 --> 00:09:12,710 as some sort of very high-dimensional vector. 159 00:09:12,710 --> 00:09:14,100 So these 160 00:09:14,100 --> 00:09:18,040 are vectors in the very highdimensional space where 161 00:09:18,040 --> 00:09:20,299 the dimension of the vector is equal to 162 00:09:20,299 --> 00:09:27,210 the number of words in your dictionary. 163 00:09:27,210 --> 00:09:30,630 So maybe each of these documents lives in some 164 00:09:30,630 --> 00:09:33,560 50,000-dimension space, if you have 50,000 words in your 165 00:09:33,560 --> 00:09:37,090 dictionary. So one nature of the similarity between these two documents that's 166 00:09:37,090 --> 00:09:39,800 often used is 167 00:09:39,800 --> 00:09:41,440 what's the angle 168 00:09:41,440 --> 00:09:43,350 between these two 169 00:09:43,350 --> 00:09:50,350 documents. 170 00:09:51,240 --> 00:09:52,750 In particular, 171 00:09:52,750 --> 00:09:56,030 if the angle between these two vectors is small, then 172 00:09:56,030 --> 00:09:59,590 the two documents, we'll consider them to be similar. If the angle between 173 00:09:59,590 --> 00:10:03,310 these two vectors is large, then we consider the documents to be dissimilar. 174 00:10:03,310 --> 00:10:05,530 So 175 00:10:05,530 --> 00:10:10,060 more formally, one commonly used heuristic, the national language of processing, 176 00:10:10,060 --> 00:10:13,950 is to say that the similarity between the two documents is a co-sine of the angle theta between them. 177 00:10:13,950 --> 00:10:16,710 For similar 178 00:10:16,710 --> 00:10:19,270 values, anyway, the co-sine 179 00:10:19,270 --> 00:10:23,110 is a decreasing function of theta. 180 00:10:23,110 --> 00:10:24,490 So the 181 00:10:24,490 --> 00:10:29,560 smaller the angle between them, the larger the similarity. 182 00:10:29,560 --> 00:10:30,690 The co-sine 183 00:10:30,690 --> 00:10:35,970 between two vectors is, of course, just [inaudible] 184 00:10:35,970 --> 00:10:42,970 divided 185 00:10:43,940 --> 00:10:46,680 by - okay? 186 00:10:46,680 --> 00:10:51,100 That's just the linear algebra or the standard 187 00:10:51,100 --> 00:10:56,460 geometry definition of the co-sine between two vectors. Here's the 188 00:11:03,540 --> 00:11:10,540 intuition behind what LSI is doing. The hope, as usual, is 189 00:11:17,930 --> 00:11:21,060 that there 190 00:11:21,060 --> 00:11:24,970 may be some interesting axis of variations in the data, 191 00:11:24,970 --> 00:11:27,660 and there maybe some other axis that 192 00:11:27,660 --> 00:11:29,340 are just 193 00:11:29,340 --> 00:11:33,960 noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by 194 00:11:33,960 --> 00:11:37,850 running PCA on your text data this way, you can remove some of the noise in the data and 195 00:11:37,850 --> 00:11:41,800 get better measures of the similarity between pairs of 196 00:11:41,800 --> 00:11:45,490 documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI 197 00:11:45,490 --> 00:11:46,940 is doing. 198 00:11:46,940 --> 00:11:47,940 So 199 00:11:47,940 --> 00:11:54,940 look further in the definition of the co-sine similarity measure. So 200 00:11:56,430 --> 00:11:59,790 the numerator 201 00:11:59,790 --> 00:12:06,790 or 202 00:12:10,210 --> 00:12:17,210 the similarity between the two documents was this inner product, 203 00:12:17,620 --> 00:12:24,270 which is therefore sum over K, 204 00:12:24,270 --> 00:12:27,500 XIK, 205 00:12:27,500 --> 00:12:28,870 XJK. So 206 00:12:30,410 --> 00:12:33,810 this inner product would be equal to zero if 207 00:12:33,810 --> 00:12:36,910 the two documents have no words in common. So 208 00:12:36,910 --> 00:12:39,900 this is really - sum over K - 209 00:12:41,100 --> 00:12:42,930 indicator of 210 00:12:42,930 --> 00:12:44,760 whether 211 00:12:44,760 --> 00:12:47,170 documents, I and 212 00:12:47,170 --> 00:12:54,170 J, 213 00:12:54,920 --> 00:12:58,250 both contain the word, K, because 214 00:12:58,250 --> 00:13:02,590 I guess XIK indicates whether document I contains the word 215 00:13:02,590 --> 00:13:04,410 K, and XJK 216 00:13:04,410 --> 00:13:07,830 indicates whether document J contains the word, K. 217 00:13:07,830 --> 00:13:10,430 So the product would be one only 218 00:13:10,430 --> 00:13:12,420 if the word K 219 00:13:12,420 --> 00:13:14,540 appears in both documents. 220 00:13:14,540 --> 00:13:17,740 Therefore, the similarity between these two documents would be 221 00:13:17,740 --> 00:13:23,450 zero if the two documents have no words in common. 222 00:13:23,450 --> 00:13:30,450 For example, 223 00:13:31,180 --> 00:13:34,530 suppose your document, 224 00:13:34,530 --> 00:13:40,460 XI, has the word study and the word 225 00:13:40,460 --> 00:13:41,730 XJ, 226 00:13:41,730 --> 00:13:43,460 has the word learn. 227 00:13:43,460 --> 00:13:47,530 Then these two documents may be considered 228 00:13:47,530 --> 00:13:49,069 entirely dissimilar. 229 00:13:50,020 --> 00:13:53,280 [Inaudible] effective study strategies. Sometimes you read a 230 00:13:53,280 --> 00:13:57,100 news article about that. So you ask, what other documents are similar to this? If 231 00:13:57,100 --> 00:14:01,080 there are a bunch of other documents about good methods to 232 00:14:01,080 --> 00:14:04,250 learn, than there are words in common. So similarity [inaudible] is zero. 233 00:14:04,250 --> 00:14:06,790 So here's 234 00:14:06,790 --> 00:14:09,200 a cartoon 235 00:14:09,200 --> 00:14:10,970 of what we hope 236 00:14:10,970 --> 00:14:12,820 [inaudible] PCA will do, 237 00:14:12,820 --> 00:14:14,340 which is 238 00:14:14,340 --> 00:14:17,320 suppose that on the horizontal axis, I plot 239 00:14:17,320 --> 00:14:21,330 the word 240 00:14:21,330 --> 00:14:25,310 learn, and on the vertical access, I plot the word study. 241 00:14:25,310 --> 00:14:29,840 So the values take on either the value of zero or one. So if a document 242 00:14:29,840 --> 00:14:33,040 contains the words learn but not study, then 243 00:14:33,040 --> 00:14:35,260 it'll plot that document there. If 244 00:14:35,260 --> 00:14:38,050 a document contains neither the word study nor learn, then it'll plot that 245 00:14:38,050 --> 00:14:40,530 at zero, zero. 246 00:14:40,530 --> 00:14:44,119 So here's a cartoon behind what PCA 247 00:14:44,119 --> 00:14:46,940 is doing, which is 248 00:14:46,940 --> 00:14:51,480 we identify lower dimensional subspace. That would be sum - eigen 249 00:14:51,480 --> 00:14:57,630 vector, we get out of PCAs. 250 00:14:57,630 --> 00:15:03,380 Now, supposed we have a document about learning. We have a document about studying. 251 00:15:03,380 --> 00:15:05,150 The document about learning 252 00:15:05,150 --> 00:15:07,710 points to the right. Document about studying points 253 00:15:07,710 --> 00:15:11,439 up. So the inner product, or the co-sine angle between these two documents would be 254 00:15:11,439 --> 00:15:13,170 - excuse 255 00:15:13,170 --> 00:15:15,080 me. The inner product between 256 00:15:15,080 --> 00:15:18,850 these two documents will be zero. 257 00:15:18,850 --> 00:15:20,280 So these two 258 00:15:20,280 --> 00:15:22,360 documents are entirely unrelated, 259 00:15:22,360 --> 00:15:24,849 which is not what we want. 260 00:15:24,849 --> 00:15:27,680 Documents about study, documents about learning, they are related. But 261 00:15:27,680 --> 00:15:32,820 we take these two documents, and we project them 262 00:15:32,820 --> 00:15:36,220 onto this subspace. 263 00:15:38,320 --> 00:15:40,850 Then these two documents now become much 264 00:15:40,850 --> 00:15:42,610 closer together, 265 00:15:42,610 --> 00:15:44,790 and the algorithm will recognize that 266 00:15:44,790 --> 00:15:47,650 when you say the inner product between these two documents, 267 00:15:47,650 --> 00:15:50,580 you actually end up with a positive number. So 268 00:15:50,580 --> 00:15:52,190 LSI enables 269 00:15:52,190 --> 00:15:56,200 our algorithm to recognize that these two documents have some positive similarity 270 00:15:56,200 --> 00:15:58,920 between them. 271 00:15:58,920 --> 00:16:01,230 So that's just intuition 272 00:16:01,230 --> 00:16:02,370 about what 273 00:16:02,370 --> 00:16:05,000 PCA may be doing to text data. 274 00:16:05,000 --> 00:16:09,420 The same thing goes to other examples and the words study and learn. So you have 275 00:16:09,420 --> 00:16:11,139 - you find a document about 276 00:16:11,139 --> 00:16:15,059 politicians and a document with the names of prominent 277 00:16:15,059 --> 00:16:16,490 politicians. 278 00:16:17,560 --> 00:16:20,740 That will also bring the documents closer together, 279 00:16:20,740 --> 00:16:24,650 or just any related topics, they end up 280 00:16:24,650 --> 00:16:25,670 [inaudible] 281 00:16:25,670 --> 00:16:31,780 points closer together and just lower dimensional space. 282 00:16:33,130 --> 00:16:38,380 Question about this? Interviewee: [Inaudible]. 283 00:16:38,380 --> 00:16:43,930 Which ones? 284 00:16:43,930 --> 00:16:50,930 This one? No, the line. Oh, this one. Oh, 285 00:16:53,820 --> 00:16:54,470 yes. 286 00:16:54,470 --> 00:17:01,470 Thank you. 287 00:17:01,670 --> 00:17:05,510 [Inaudible]. 288 00:17:05,510 --> 00:17:06,650 So 289 00:17:06,650 --> 00:17:13,650 let's talk about how to actually implement this now. 290 00:17:17,230 --> 00:17:20,520 Okay. How many of you know what 291 00:17:20,520 --> 00:17:24,780 an SVD or single value decomposition is? Wow, 292 00:17:24,780 --> 00:17:25,800 that's a lot of you. That's a 293 00:17:25,800 --> 00:17:28,270 lot more than I thought. 294 00:17:28,270 --> 00:17:35,270 Curious. Did you guys learn it as under grads or as graduate students? 295 00:17:37,309 --> 00:17:38,029 All 296 00:17:38,029 --> 00:17:40,429 right. Let 297 00:17:40,429 --> 00:17:44,150 me talk about it anyway. I 298 00:17:44,150 --> 00:17:47,790 wasn't expecting so many of you to know what SVD is, but I want to get this 299 00:17:47,790 --> 00:17:51,030 on tape, just so everyone else can learn 300 00:17:51,030 --> 00:17:55,050 about this, too. 301 00:17:55,050 --> 00:17:59,020 So I'll say a little bit about how to implement 302 00:17:59,020 --> 00:18:01,640 PCA. The problem I 303 00:18:01,640 --> 00:18:05,110 was eluding to just now was that 304 00:18:05,110 --> 00:18:09,310 when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for 305 00:18:09,310 --> 00:18:11,620 our 306 00:18:11,620 --> 00:18:13,390 text example, 307 00:18:13,390 --> 00:18:18,000 if the vectors XI are 50,000 dimensional, 308 00:18:18,000 --> 00:18:24,430 then 309 00:18:24,430 --> 00:18:27,160 the covariance matrix will be 50,000 dimensional by 50,000 310 00:18:27,160 --> 00:18:32,910 dimensional, which is much too big to represent explicitly. 311 00:18:32,910 --> 00:18:36,270 I guess many 312 00:18:36,270 --> 00:18:41,240 of you already know this, but I'll just say it anyway. It 313 00:18:41,240 --> 00:18:48,240 turns out there's another way to implement PCA, which is 314 00:18:48,679 --> 00:18:49,450 if 315 00:18:49,450 --> 00:18:53,340 A is any N by N matrix, 316 00:18:53,340 --> 00:18:56,580 than one of the most remarkable results of linear algebra 317 00:18:56,580 --> 00:19:00,970 is that the matrix, A, 318 00:19:00,970 --> 00:19:04,279 can be decomposed into 319 00:19:04,279 --> 00:19:07,330 a singular value 320 00:19:07,330 --> 00:19:10,160 decomposition. What that means is that the matrix, A, which 321 00:19:10,160 --> 00:19:11,519 is 322 00:19:11,519 --> 00:19:12,680 N by N, 323 00:19:12,680 --> 00:19:15,230 can always be decomposed into a product of 324 00:19:15,230 --> 00:19:18,280 three matrixes. U is N by N, 325 00:19:18,280 --> 00:19:21,470 D is a square matrix, which is N by N, and V is 326 00:19:23,750 --> 00:19:27,170 also N by N. 327 00:19:27,170 --> 00:19:30,910 D 328 00:19:30,910 --> 00:19:34,890 is 329 00:19:34,890 --> 00:19:37,320 going to be diagonal. 330 00:19:43,030 --> 00:19:45,140 Zeros are on the off-diagonals, 331 00:19:45,140 --> 00:19:52,140 and the values sigma I are called the singular values of 332 00:19:54,310 --> 00:19:57,530 the matrix A. 333 00:20:01,500 --> 00:20:05,470 Almost all of you said you learned this as a graduate student, rather than as an under grad, so 334 00:20:05,470 --> 00:20:07,020 it turns out that 335 00:20:07,020 --> 00:20:10,820 when you take a class in undergraduate linear algebra, usually you learn a bunch of 336 00:20:10,820 --> 00:20:13,960 decomposition. So you usually learn about the QLD composition, maybe 337 00:20:13,960 --> 00:20:17,070 the LU factorization of the matrixes. 338 00:20:17,070 --> 00:20:20,350 Most under grad courses don't get to talk about singular value 339 00:20:20,350 --> 00:20:21,490 decompositions, but at 340 00:20:21,490 --> 00:20:22,860 least in - almost 341 00:20:22,860 --> 00:20:24,580 everything I 342 00:20:24,580 --> 00:20:26,900 do in machine learning, 343 00:20:26,900 --> 00:20:31,180 you actually find that you end up using SVDs much more than any of the 344 00:20:31,180 --> 00:20:33,120 decompositions 345 00:20:33,120 --> 00:20:35,690 you learned in typical under grad linear algebra class. 346 00:20:35,690 --> 00:20:40,030 So personally, I [inaudible] an SVD dozens of times in the last 347 00:20:40,030 --> 00:20:42,820 year, but LU and QRD compositions, 348 00:20:42,820 --> 00:20:45,480 I think I used the QRD composition once and an 349 00:20:45,480 --> 00:20:49,530 LU decomposition in the last year. So let's see. I'll 350 00:20:49,530 --> 00:20:53,880 say 351 00:20:53,880 --> 00:20:54,830 a 352 00:20:54,830 --> 00:21:01,830 bit 353 00:21:01,840 --> 00:21:04,700 more about this. So I'm going to draw the picture, I guess. 354 00:21:04,700 --> 00:21:08,090 For example, 355 00:21:08,090 --> 00:21:10,890 if A is an N by N matrix, 356 00:21:10,890 --> 00:21:17,890 it can be decomposed into another matrix, U, which is also N by N. It's the same 357 00:21:21,780 --> 00:21:24,510 size, D, which is 358 00:21:24,510 --> 00:21:29,840 N by 359 00:21:29,840 --> 00:21:34,730 N. Another square matrix, V transpose, which is also 360 00:21:34,730 --> 00:21:37,430 N by 361 00:21:37,430 --> 00:21:42,850 N. Furthermore, in a singular value decomposition, the 362 00:21:42,850 --> 00:21:46,460 columns of the matrix, U, will be the eigen 363 00:21:46,460 --> 00:21:50,940 vectors 364 00:21:50,940 --> 00:21:55,630 of A transpose, and the 365 00:21:55,630 --> 00:22:02,020 columns of V will be the eigen vectors 366 00:22:02,020 --> 00:22:05,730 of A 367 00:22:05,730 --> 00:22:08,290 transpose A. 368 00:22:08,290 --> 00:22:12,030 369 00:22:12,030 --> 00:22:16,660 To compute it, you just use the SVD commands 370 00:22:16,660 --> 00:22:20,780 in 371 00:22:20,780 --> 00:22:22,090 Matlab 372 00:22:22,090 --> 00:22:23,009 or 373 00:22:23,009 --> 00:22:26,760 Octave. Today, say the art in numerical linear algebra is that 374 00:22:26,760 --> 00:22:31,040 SVD, singular value decompositions, and matrixes can be computed 375 00:22:31,040 --> 00:22:34,450 extremely [inaudible]. We've 376 00:22:34,450 --> 00:22:35,780 used a 377 00:22:35,780 --> 00:22:37,610 package like Matlab or Octave 378 00:22:37,610 --> 00:22:40,410 to compute, say, the eigen vectors of a matrix. 379 00:22:40,410 --> 00:22:43,539 So if SVD 380 00:22:43,539 --> 00:22:47,980 routines are even more numerically stable than eigen vector routines for finding eigen vector in the 381 00:22:47,980 --> 00:22:49,230 matrix. So you can 382 00:22:49,230 --> 00:22:50,250 safely 383 00:22:50,250 --> 00:22:53,110 use a routine like this, and similar to the way they use 384 00:22:53,110 --> 00:22:56,030 a square root command without thinking about 385 00:22:56,030 --> 00:22:58,400 how it's computed. You can compute the square 386 00:22:58,400 --> 00:23:03,640 root of something and just not worry about it. You know the computer will give you the right 387 00:23:03,640 --> 00:23:07,670 answer. For most reasonably-sized matrixes, even up to thousands by thousands 388 00:23:07,670 --> 00:23:11,180 matrixes, the SVD routine, I think of it as a square root function. If 389 00:23:11,180 --> 00:23:14,700 you call it, it'll give you back the right answer. You don't have to worry 390 00:23:14,700 --> 00:23:16,760 too much about 391 00:23:16,760 --> 00:23:20,510 it. If you have extremely large matrixes, like a million by a million matrixes, I 392 00:23:20,510 --> 00:23:23,200 might start to worry a bit, but a few thousand by a few 393 00:23:23,200 --> 00:23:25,700 thousand matrixes, this is 394 00:23:25,700 --> 00:23:29,330 implemented very well today. 395 00:23:29,330 --> 00:23:31,360 [Inaudible]. 396 00:23:31,360 --> 00:23:34,990 What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the 397 00:23:34,990 --> 00:23:36,470 order of N-cubed. 398 00:23:36,470 --> 00:23:41,750 I'm not sure. [Inaudible] 399 00:23:41,750 --> 00:23:45,010 algorithms, so I think - I don't know what's 400 00:23:45,010 --> 00:23:50,470 known about the conversion 401 00:23:50,470 --> 00:23:54,370 of 402 00:23:54,370 --> 00:23:58,280 these algorithms. The example I drew out was for a facts matrix, and a matrix is 403 00:23:58,280 --> 00:23:59,890 [inaudible]. 404 00:23:59,890 --> 00:24:03,420 In the same way, you can also 405 00:24:03,420 --> 00:24:08,380 call SVD on the tall matrix, so it's taller than it's wide. 406 00:24:08,380 --> 00:24:15,380 It would decompose it into - okay? A 407 00:24:21,480 --> 00:24:24,190 product of three matrixes like 408 00:24:24,190 --> 00:24:28,470 that. 409 00:24:28,470 --> 00:24:31,220 The nice thing about this is that 410 00:24:31,220 --> 00:24:33,169 we can use it to compute 411 00:24:33,169 --> 00:24:40,169 eigen vectors and PCA very efficiently. 412 00:24:42,150 --> 00:24:47,380 In particular, a 413 00:24:47,380 --> 00:24:52,690 covariance matrix sigma was 414 00:24:52,690 --> 00:24:55,410 this. It was the sum of all the products, 415 00:24:55,410 --> 00:25:00,750 so if you go back and recall the definition of the 416 00:25:00,750 --> 00:25:01,820 design matrix - 417 00:25:01,820 --> 00:25:05,080 I think I described this in 418 00:25:05,080 --> 00:25:07,720 lecture two when 419 00:25:07,720 --> 00:25:13,510 we derived the close form solution to these squares [inaudible] these squares. The 420 00:25:13,510 --> 00:25:17,860 design matrix was this matrix where I took my 421 00:25:17,860 --> 00:25:21,390 examples 422 00:25:21,390 --> 00:25:26,160 and stacked them in 423 00:25:26,160 --> 00:25:27,590 rows. 424 00:25:27,590 --> 00:25:33,320 They call this the design matrix [inaudible]. So if 425 00:25:33,320 --> 00:25:38,780 you construct the design matrix, then 426 00:25:38,780 --> 00:25:43,240 the covariance matrix sigma 427 00:25:43,240 --> 00:25:47,760 can be written just X transposing. 428 00:26:01,270 --> 00:26:08,270 That's X transposed, and [inaudible]. 429 00:26:16,230 --> 00:26:16,750 Okay? 430 00:26:16,750 --> 00:26:18,860 I hope you see why the X transpose X gives you 431 00:26:18,860 --> 00:26:22,460 the sum of products of 432 00:26:22,460 --> 00:26:25,650 vectors. If you aren't seeing this right now, just go home and convince yourself 433 00:26:25,650 --> 00:26:32,650 [inaudible] if it's 434 00:26:37,880 --> 00:26:40,020 true. 435 00:26:40,020 --> 00:26:47,020 To get the top K eigen vectors of 436 00:26:50,610 --> 00:26:50,950 sigma, 437 00:26:50,950 --> 00:26:55,890 you would take sigma 438 00:26:55,890 --> 00:26:57,419 and decompose it 439 00:26:57,419 --> 00:26:59,650 using 440 00:26:59,650 --> 00:27:01,880 the - 441 00:27:01,880 --> 00:27:06,710 excuse me. 442 00:27:06,710 --> 00:27:11,960 You would take the matrix X, and you would compute as SVD. So you get USV transpose. 443 00:27:11,960 --> 00:27:15,400 Then the top three 444 00:27:15,400 --> 00:27:19,169 columns 445 00:27:19,169 --> 00:27:21,660 of U 446 00:27:21,660 --> 00:27:24,340 are the top K eigen 447 00:27:24,340 --> 00:27:26,830 vectors 448 00:27:26,830 --> 00:27:29,419 of 449 00:27:29,419 --> 00:27:31,910 X transpose 450 00:27:31,910 --> 00:27:33,660 X, 451 00:27:33,660 --> 00:27:37,760 which is therefore, the top K eigen vectors of your 452 00:27:37,760 --> 00:27:40,440 covariance matrix 453 00:27:40,440 --> 00:27:42,250 sigma. So 454 00:27:42,250 --> 00:27:46,170 in our examples, the 455 00:27:46,170 --> 00:27:48,950 design matrix may be, say R. If you have 456 00:27:48,950 --> 00:27:52,450 50,000 words in your dictionary, than the 457 00:27:52,450 --> 00:27:55,050 design matrix would be 458 00:27:55,050 --> 00:27:58,150 RM by 50,000. 459 00:27:58,150 --> 00:28:01,870 [Inaudible] say 100 by 50,000, if you have 100 examples. 460 00:28:01,870 --> 00:28:03,250 So X would be 461 00:28:03,250 --> 00:28:06,510 quite tractable to represent and compute the 462 00:28:06,510 --> 00:28:10,420 SVD, whereas the matrix sigma would be much harder to represent. This is 463 00:28:10,420 --> 00:28:12,260 50,000 by 50,000. 464 00:28:12,260 --> 00:28:15,870 So this gives you an efficient way to implement 465 00:28:15,870 --> 00:28:18,070 PCA. 466 00:28:18,070 --> 00:28:20,560 The reason I want to talk about this is 467 00:28:20,560 --> 00:28:24,620 in previous years, I didn't talk [inaudible]. 468 00:28:24,620 --> 00:28:28,760 The class projects, I found a number of students trying to implement SVD on huge 469 00:28:28,760 --> 00:28:29,580 problems and [inaudible], 470 00:28:29,580 --> 00:28:31,950 so this is 471 00:28:31,950 --> 00:28:35,010 a much better to implement PCA 472 00:28:35,010 --> 00:28:37,600 if you have extremely high dimensional data. If you have 473 00:28:37,600 --> 00:28:39,600 low dimensional data, if 474 00:28:39,600 --> 00:28:43,500 you have 50 or 100 dimensional data, then 475 00:28:43,500 --> 00:28:44,969 computing sigma's no problem. You can 476 00:28:44,969 --> 00:28:51,309 do it the old way, but otherwise, use the SVD to implement this. 477 00:28:52,780 --> 00:28:59,780 Questions about this? 478 00:29:26,180 --> 00:29:30,810 The last thing I want to say is that in practice, when you want to implement this, I want to say a note 479 00:29:30,810 --> 00:29:32,240 of caution. 480 00:29:32,240 --> 00:29:35,010 It turns out that 481 00:29:35,010 --> 00:29:38,909 for many applications of - let's see. 482 00:29:38,909 --> 00:29:42,420 When you apply SVD to these wide - yeah. 483 00:29:42,420 --> 00:29:43,450 Just a quick question. Are 484 00:29:43,450 --> 00:29:48,950 the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh, 485 00:29:48,950 --> 00:29:52,940 yes. 486 00:29:52,940 --> 00:29:54,570 I think you're 487 00:29:54,570 --> 00:29:55,509 right. I 488 00:29:55,509 --> 00:30:02,509 think you're right. 489 00:30:03,960 --> 00:30:05,550 Let's 490 00:30:05,550 --> 00:30:12,550 see. Is it top K columns of U or top K of V? 491 00:30:15,039 --> 00:30:22,039 Yeah, I think you're right. Is that right? Something 492 00:30:31,560 --> 00:30:38,560 bothers me about that, but I think you're right. 493 00:30:40,330 --> 00:30:46,350 [Inaudible], so then X transpose X should be VDD. X 494 00:30:46,350 --> 00:30:53,350 is UDV, so X transpose X would be - [Inaudible]. 495 00:31:01,140 --> 00:31:03,389 If anyone thinks about this and 496 00:31:03,389 --> 00:31:06,200 has another opinion, let me know, but I think you're 497 00:31:06,200 --> 00:31:11,480 right. I'll make sure I get the details and let you 498 00:31:11,480 --> 00:31:16,680 know. 499 00:31:16,680 --> 00:31:22,240 Everyone's still looking at that. 500 00:31:22,240 --> 00:31:22,720 Tom, can you figure out 501 00:31:22,720 --> 00:31:25,110 the right answer and let me know? That sounds right. Okay. Cool. Okay. 502 00:31:25,110 --> 00:31:30,070 So just 503 00:31:30,070 --> 00:31:33,920 one last note, a note of caution. It turns out that 504 00:31:33,920 --> 00:31:36,520 in this example, I was implementing SVD 505 00:31:36,520 --> 00:31:43,520 with a wide matrix. So the matrix X was N by N. 506 00:31:44,130 --> 00:31:47,010 It turns out when you 507 00:31:47,010 --> 00:31:52,170 find the SVD decomposition of this, 508 00:31:52,170 --> 00:31:57,870 it turns out that - 509 00:31:57,870 --> 00:32:01,030 let's see. Yeah, I think you're definitely 510 00:32:01,030 --> 00:32:04,230 right. So it turns out that we find the SVD 511 00:32:04,230 --> 00:32:07,590 of this, the right-most portion of this block of this matrix would be all 512 00:32:07,590 --> 00:32:14,590 zeros. 513 00:32:15,130 --> 00:32:17,850 Also, when you compute the 514 00:32:17,850 --> 00:32:22,190 matrix, D, a large part of this matrix would be zeros. 515 00:32:22,190 --> 00:32:24,340 You have the matrix D 516 00:32:24,340 --> 00:32:29,520 transpose. So depending on what convention you use, 517 00:32:29,520 --> 00:32:36,150 for example, I think Matlab actually uses a convention of just 518 00:32:36,150 --> 00:32:43,150 cutting off the zero elements. 519 00:32:47,779 --> 00:32:50,750 So the Matlab uses the convention 520 00:32:50,750 --> 00:32:54,260 of chopping off the right-most half of the U matrix and chopping off the bottom 521 00:32:54,260 --> 00:32:56,520 portion of the D matrix. I'm not sure 522 00:32:56,520 --> 00:33:00,860 if this even depends on the version of Matlab, 523 00:33:00,860 --> 00:33:04,059 but when you call SVD on Matlab or some other numerical algebra packages, 524 00:33:04,059 --> 00:33:07,820 there's slightly different conventions of how to define your SVD when 525 00:33:07,820 --> 00:33:10,260 the matrix is wider than it is tall. 526 00:33:10,260 --> 00:33:13,390 So just watch out for this and make sure you map 527 00:33:13,390 --> 00:33:14,910 whatever convention 528 00:33:14,910 --> 00:33:17,630 your numerical algebra library uses 529 00:33:17,630 --> 00:33:22,620 to the original computations. 530 00:33:22,620 --> 00:33:23,570 It turns out if 531 00:33:23,570 --> 00:33:26,059 you turn Matlab [inaudible] 532 00:33:26,059 --> 00:33:30,200 or you're writing C code. There are many scientific libraries that can 533 00:33:30,200 --> 00:33:31,590 534 00:33:31,590 --> 00:33:36,110 compute SVDs for you, but they're just slightly different in 535 00:33:36,110 --> 00:33:39,330 conventions for the dimensions of these matrixes. So just make sure you figure this out for the 536 00:33:39,330 --> 00:33:44,230 package that you use. 537 00:33:44,230 --> 00:33:46,230 Finally, I just want to 538 00:33:46,230 --> 00:33:49,779 take the unsupervised learning algorithms we talked about and just put a little bit 539 00:33:49,779 --> 00:33:52,010 of broader context. 540 00:33:52,010 --> 00:33:56,450 This is partly in response to the questions I've gotten from students in 541 00:33:56,450 --> 00:33:57,950 office hours and elsewhere 542 00:33:57,950 --> 00:34:00,620 about when to use each of these 543 00:34:00,620 --> 00:34:01,460 algorithms. So 544 00:34:01,460 --> 00:34:05,700 I'm going to draw a two by two matrix. This is a little cartoon that 545 00:34:07,800 --> 00:34:10,690 I find useful. 546 00:34:15,409 --> 00:34:20,159 One of the algorithms we talked about earlier, right 547 00:34:20,159 --> 00:34:21,460 before this, was 548 00:34:21,460 --> 00:34:24,159 factor analysis, which was - it 549 00:34:24,159 --> 00:34:27,919 was - I hope you remember that picture I drew where I would have a bunch of point Z on the 550 00:34:27,919 --> 00:34:28,799 line. 551 00:34:28,799 --> 00:34:32,969 Then I had these ellipses that I drew. I hope you 552 00:34:32,969 --> 00:34:34,999 remember that 553 00:34:34,999 --> 00:34:37,919 picture. This was a factor analysis model 554 00:34:37,919 --> 00:34:38,639 which models 555 00:34:38,639 --> 00:34:42,389 the density effects [inaudible], right? 556 00:34:42,389 --> 00:34:44,639 It was also a PCA, just now. 557 00:34:44,639 --> 00:34:45,799 So the 558 00:34:45,799 --> 00:34:48,969 difference between factor analysis and PCA, the 559 00:34:48,969 --> 00:34:51,829 way I think about it, is that factor analysis 560 00:34:51,829 --> 00:34:53,960 is a density estimation algorithm. 561 00:34:53,960 --> 00:34:57,740 It tries to model the density of the training example's X. 562 00:34:57,740 --> 00:35:01,249 Whereas PCA 563 00:35:02,309 --> 00:35:05,700 is not a probabilistic 564 00:35:05,700 --> 00:35:06,750 algorithm. In particular, 565 00:35:06,750 --> 00:35:11,270 it does not endow your training examples of any probabilistic 566 00:35:11,270 --> 00:35:14,410 distributions and directly goes to find the subspace. 567 00:35:14,410 --> 00:35:18,650 So in terms of when to use factor analysis and when to use PCA, 568 00:35:18,650 --> 00:35:21,699 if your goal is to reduce the dimension of the data, 569 00:35:21,699 --> 00:35:26,139 if your goal is to find the subspace that the data lies on, 570 00:35:26,139 --> 00:35:27,309 then PCA 571 00:35:27,309 --> 00:35:28,320 directly 572 00:35:28,320 --> 00:35:32,180 tries to find the subspace. I think I would 573 00:35:32,180 --> 00:35:35,279 tend to use PCA. 574 00:35:35,279 --> 00:35:39,889 Factor analysis, it sort of assumes the data lies on a subspace. 575 00:35:39,889 --> 00:35:45,039 Let me write a subspace here. 576 00:35:45,039 --> 00:35:49,529 So both of these algorithms sort of assume the data maybe lies close 577 00:35:49,529 --> 00:35:52,009 or on some low dimensional subspace, 578 00:35:52,009 --> 00:35:55,999 but fundamentally, factor analysis, I think of it as a density estimation algorithm. 579 00:35:55,999 --> 00:35:58,739 So that has some very high dimensional distribution. I 580 00:35:58,739 --> 00:36:00,880 want to model P of X, then 581 00:36:00,880 --> 00:36:04,519 the factor analysis is the algorithm I'm more inclined 582 00:36:04,519 --> 00:36:05,400 to use. So 583 00:36:05,400 --> 00:36:07,449 even though you could in theory, 584 00:36:07,449 --> 00:36:12,199 I would tend to avoid trying to use factor analysis to 585 00:36:12,199 --> 00:36:14,349 identify a subspace the 586 00:36:14,349 --> 00:36:15,619 data 587 00:36:15,619 --> 00:36:18,450 set lies on. So [inaudible], if you want to do 588 00:36:18,450 --> 00:36:21,700 anomaly detection, if you want to model P of X 589 00:36:21,700 --> 00:36:26,119 so that if you have a very low probability of N, you can factor an anomaly, 590 00:36:26,119 --> 00:36:32,819 then I would tend to use factor analysis to do that density estimation. So factor 591 00:36:32,819 --> 00:36:38,189 analysis and PCA are both algorithms that assume that your data lies in the subspace. 592 00:36:38,189 --> 00:36:41,759 The other cause of algorithms we talked about was 593 00:36:41,759 --> 00:36:45,379 algorithms that assumes the data 594 00:36:45,379 --> 00:36:47,380 lies in 595 00:36:47,380 --> 00:36:51,410 clumps or that the 596 00:36:51,410 --> 00:36:52,359 data 597 00:36:52,359 --> 00:36:57,869 has a few coherence to groups. So let me just fill in the rest of this picture. 598 00:37:08,579 --> 00:37:09,500 So if you think your data lies 599 00:37:09,500 --> 00:37:14,270 in clumps or lies in groups, and if it goes [inaudible] 600 00:37:14,270 --> 00:37:16,130 density estimation, then I would 601 00:37:16,130 --> 00:37:19,579 tend to use a mixture of [inaudible] 602 00:37:19,579 --> 00:37:23,349 algorithm. But again, you don't necessarily want to endow your data of any probably 603 00:37:23,349 --> 00:37:26,919 semantics, so if you just want to find the clumps of the groups, then I'd be inclined 604 00:37:26,919 --> 00:37:28,999 to use a [inaudible] algorithm. So 605 00:37:29,769 --> 00:37:33,289 haven't seen anyone else draw this picture before, but I tend to organize these things 606 00:37:33,289 --> 00:37:34,989 this way in my brain. 607 00:37:34,989 --> 00:37:36,419 Hopefully this helps guide 608 00:37:36,419 --> 00:37:40,459 when you might use each of these algorithms as well, depending 609 00:37:40,459 --> 00:37:44,719 on whether you believe the data might lie in the subspace or whether it might bind in 610 00:37:44,719 --> 00:37:47,899 clumps or groups. 611 00:37:50,719 --> 00:37:53,789 All right. 612 00:37:53,789 --> 00:38:00,789 That wraps up the discussion on 613 00:38:02,629 --> 00:38:08,719 PCA. What I want to do next is talk about 614 00:38:08,719 --> 00:38:15,329 independent component analysis, or ICA. Yeah. Interviewee: I have 615 00:38:15,329 --> 00:38:17,579 a 616 00:38:17,579 --> 00:38:21,589 question about the upper right [inaudible]. So once you have all of the eigen vectors, 617 00:38:21,589 --> 00:38:26,179 [inaudible] how similar is feature I to 618 00:38:26,179 --> 00:38:29,960 feature J. You pick some eigen vector, and you take some dot products between the 619 00:38:29,960 --> 00:38:31,569 feature I and 620 00:38:31,569 --> 00:38:35,019 feature J and the eigen vector. But 621 00:38:35,019 --> 00:38:39,630 there's a lot of eigen vectors to choose from. Instructor 622 00:38:39,630 --> 00:38:42,049 (Andrew Ng):Right. So Justin's question was 623 00:38:42,049 --> 00:38:45,880 having found my eigen vectors, how do I choose what eigen vector to use to 624 00:38:45,880 --> 00:38:47,539 measure distance. I'm 625 00:38:47,539 --> 00:38:48,949 going to 626 00:38:48,949 --> 00:38:51,019 start 627 00:38:51,019 --> 00:38:53,289 this up. 628 00:38:53,289 --> 00:38:57,299 So the 629 00:38:57,299 --> 00:38:58,319 answer is really 630 00:38:58,319 --> 00:39:02,029 - in this cartoon, I would avoid thinking about 631 00:39:02,029 --> 00:39:03,920 eigen vectors one other time. 632 00:39:03,920 --> 00:39:08,049 A better way to view this cartoon is that this is actually - 633 00:39:08,049 --> 00:39:11,599 if I decide to choose 100 eigen vectors, this is really 100 D 634 00:39:11,599 --> 00:39:18,599 subspace. 635 00:39:19,259 --> 00:39:20,339 So 636 00:39:20,339 --> 00:39:24,879 I'm not actually projecting my data onto one eigen vector. 637 00:39:24,879 --> 00:39:29,769 This arrow, this cartoon, this denotes the 100-dimensional 638 00:39:29,769 --> 00:39:32,089 subspace [inaudible] by all my eigen vectors. 639 00:39:32,089 --> 00:39:36,119 So what I actually do is project my data onto 640 00:39:36,119 --> 00:39:40,149 the span, the linear span of eigen vectors. Then I 641 00:39:40,149 --> 00:39:41,440 measure distance or take 642 00:39:41,440 --> 00:39:43,489 inner products of the distance between 643 00:39:43,489 --> 00:39:49,829 the projections of the two points of the eigen vectors. Okay. 644 00:39:49,829 --> 00:39:54,139 So let's talk about ICA, 645 00:39:54,139 --> 00:39:58,749 independent component analysis. 646 00:39:58,749 --> 00:40:00,749 So whereas PCA 647 00:40:00,749 --> 00:40:02,600 was an algorithm for finding 648 00:40:02,600 --> 00:40:06,699 what I call the main axis of variations of data, 649 00:40:06,699 --> 00:40:11,199 in ICA, we're going to try find the independent of components of variations in the 650 00:40:11,199 --> 00:40:12,039 data. 651 00:40:12,039 --> 00:40:14,939 So switch it to the laptop there, please. 652 00:40:14,939 --> 00:40:16,119 We'll just 653 00:40:16,119 --> 00:40:21,900 take a second to motivate that. I'm 654 00:40:21,900 --> 00:40:26,769 going to do so by 655 00:40:26,769 --> 00:40:32,430 - although if you put on the - okay. This is 656 00:40:32,430 --> 00:40:36,619 actually a slide that I showed in 657 00:40:36,619 --> 00:40:39,779 lecture one of the cocktail party problem. 658 00:40:39,779 --> 00:40:42,619 Suppose you have two speakers at a cocktail party, 659 00:40:42,619 --> 00:40:45,019 and you have two microphones in the 660 00:40:45,019 --> 00:40:46,119 room, overlapping 661 00:40:46,119 --> 00:40:47,959 sets of two conversations. 662 00:40:47,959 --> 00:40:51,640 Then can you separate out the two original speaker sources? 663 00:40:51,640 --> 00:40:55,650 So I actually played this audio as well in the very first lecture, which is 664 00:40:55,650 --> 00:40:59,069 suppose microphone one records this. 665 00:40:59,069 --> 00:41:05,489 [Recording] 666 00:41:13,229 --> 00:41:16,650 So the question is, these are really two speakers, 667 00:41:16,650 --> 00:41:20,809 speaking independently of each other. So each speaker is outputting 668 00:41:20,809 --> 00:41:24,699 a series of sound signals as independent of the other conversation 669 00:41:24,699 --> 00:41:26,119 going on in the room. 670 00:41:26,119 --> 00:41:27,839 So 671 00:41:27,839 --> 00:41:31,880 this being an supervised learning problem, the question is, can we take these two microphone 672 00:41:31,880 --> 00:41:33,900 recordings and feed it to 673 00:41:33,900 --> 00:41:37,329 an algorithm to find the independent components in 674 00:41:37,329 --> 00:41:38,450 this 675 00:41:38,450 --> 00:41:40,589 data? This is the output 676 00:41:42,440 --> 00:41:48,619 when we do so. 677 00:41:48,619 --> 00:41:55,050 [Recording] This is the other one. [Recording] 678 00:41:55,940 --> 00:41:59,859 Just for fun. [Inaudible]. These are audio clips I got 679 00:42:01,409 --> 00:42:04,409 from [inaudible]. Just for fun, let me play the other ones as well. This 680 00:42:04,409 --> 00:42:11,409 is overlapping microphone one. [Recording] 681 00:42:13,440 --> 00:42:20,440 Here's microphone two. [Recording] 682 00:42:21,739 --> 00:42:24,250 So given this as input, here's output one. 683 00:42:24,250 --> 00:42:27,459 684 00:42:27,459 --> 00:42:30,910 [Recording] 685 00:42:30,910 --> 00:42:33,640 It's not perfect, but it's largely cleaned up the music. 686 00:42:33,640 --> 00:42:40,640 Here's number two. [Recording] Okay. Switch back to 687 00:42:42,979 --> 00:42:44,900 [inaudible], please. 688 00:42:44,900 --> 00:42:46,979 So 689 00:42:46,979 --> 00:42:53,979 what I want to do now is describe an algorithm that does that. 690 00:42:54,829 --> 00:42:58,299 Before 691 00:42:58,299 --> 00:43:03,239 I actually jump into the algorithm, I want to say two minutes 692 00:43:03,239 --> 00:43:03,799 of 693 00:43:03,799 --> 00:43:10,799 CDF, so cumulative distribution functions. I know most 694 00:43:18,669 --> 00:43:21,089 of you know what these are, but I'm 695 00:43:21,089 --> 00:43:23,599 just going to remind you of what they are. 696 00:43:24,579 --> 00:43:30,349 Let's say you have a one-D random variable S. So suppose you have 697 00:43:30,349 --> 00:43:35,900 a random variable, S, 698 00:43:35,900 --> 00:43:41,469 and suppose it has a property density function [inaudible]. 699 00:43:41,469 --> 00:43:43,409 Then 700 00:43:43,409 --> 00:43:45,859 the CDF 701 00:43:45,859 --> 00:43:50,139 is defined as a function, or rather as F, 702 00:43:50,139 --> 00:43:53,729 which is the probability that the random variable, 703 00:43:53,729 --> 00:43:55,920 S, is less than the value 704 00:43:55,920 --> 00:43:58,539 given by that lower-case 705 00:43:58,539 --> 00:43:59,869 value, 706 00:43:59,869 --> 00:44:01,929 S. 707 00:44:01,929 --> 00:44:03,269 For example, 708 00:44:03,269 --> 00:44:06,099 if this is your [inaudible] density, 709 00:44:06,099 --> 00:44:10,229 than the density of the [inaudible] usually 710 00:44:10,229 --> 00:44:14,609 to note it lower-case phi. That's roughly a bell-shaped density. Then 711 00:44:14,609 --> 00:44:20,319 the CDF or the Gaussian 712 00:44:20,319 --> 00:44:22,269 will look something like this. 713 00:44:22,269 --> 00:44:24,959 There'll be a capital function 714 00:44:24,959 --> 00:44:27,339 pi. So if I pick a value 715 00:44:27,339 --> 00:44:29,079 S like that, then the 716 00:44:29,079 --> 00:44:30,449 height of this - 717 00:44:30,449 --> 00:44:32,569 this is [inaudible] probability that 718 00:44:32,569 --> 00:44:35,410 my Gaussian random variable is less than 719 00:44:35,410 --> 00:44:37,419 that value there. In other words, 720 00:44:37,419 --> 00:44:40,549 the height of the function at that point is 721 00:44:40,549 --> 00:44:44,229 less 722 00:44:44,229 --> 00:44:46,269 than the area of the Gaussian density, 723 00:44:46,269 --> 00:44:48,119 up to the point S. 724 00:44:48,119 --> 00:44:48,890 As you 725 00:44:48,890 --> 00:44:52,690 move further and further to the right, this function will approach one, as 726 00:44:52,690 --> 00:44:59,690 you integrate more and more of this area of the Gaussian. So another way to write 727 00:45:04,839 --> 00:45:11,839 F 728 00:45:21,109 --> 00:45:28,109 of 729 00:45:30,659 --> 00:45:34,619 S is the integral, the minus infinity 730 00:45:34,619 --> 00:45:35,729 to S of 731 00:45:35,729 --> 00:45:41,739 the density, DT. 732 00:45:41,739 --> 00:45:43,859 So something that'll come later is 733 00:45:43,859 --> 00:45:48,320 suppose I have a random variable, S, and I want to model the distribution of the random 734 00:45:48,320 --> 00:45:49,439 variable, S. 735 00:45:49,439 --> 00:45:53,449 So one thing I could do is I can specify 736 00:45:53,449 --> 00:45:56,549 what I think the density 737 00:45:56,549 --> 00:45:58,049 is. 738 00:45:58,049 --> 00:46:03,200 Or I can specify 739 00:46:03,200 --> 00:46:04,450 what the 740 00:46:04,450 --> 00:46:08,099 CDF 741 00:46:08,099 --> 00:46:11,359 is. These are related by this equation. F is the integral of P of S. You 742 00:46:11,359 --> 00:46:13,989 can also 743 00:46:13,989 --> 00:46:15,719 recover the density 744 00:46:15,719 --> 00:46:20,469 by taking the CDF and taking the derivative. So F prime, take the derivative 745 00:46:20,469 --> 00:46:21,729 of the CDF, 746 00:46:21,729 --> 00:46:23,440 you get back the 747 00:46:23,440 --> 00:46:24,709 density. So this has come up 748 00:46:24,709 --> 00:46:28,179 in the middle of when I derive ICA, which is that 749 00:46:28,179 --> 00:46:32,169 there'll be a step where they need to assume a distribution for random variable, S. 750 00:46:32,169 --> 00:46:36,359 I can either specify the density for S directly, or I can specify the CDF. I 751 00:46:36,359 --> 00:46:38,819 choose to specify the 752 00:46:39,920 --> 00:46:41,529 CDF. 753 00:46:41,529 --> 00:46:46,919 It has to be some function increasing from zero to one. 754 00:46:46,919 --> 00:46:48,029 So you can 755 00:46:48,029 --> 00:46:50,679 choose any function that looks like that, and in particular, 756 00:46:51,969 --> 00:46:55,469 pulling functions out of a hat that look like that. You can, for instance, choose a 757 00:46:55,469 --> 00:46:58,989 sigmoid function of 758 00:46:58,989 --> 00:47:04,219 CDF. That would be one way of specifying the distribution of the densities for the random variable S. So 759 00:47:04,219 --> 00:47:05,110 this 760 00:47:05,110 --> 00:47:12,110 will come up later. 761 00:47:30,299 --> 00:47:33,579 Just [inaudible], just raise your hand if that is familiar to you, if you've seen 762 00:47:33,579 --> 00:47:40,579 that before. Great. So 763 00:47:42,469 --> 00:47:43,239 let's 764 00:47:43,239 --> 00:47:48,630 start to derive our RCA, or our independent component analysis 765 00:47:48,630 --> 00:47:50,430 algorithm. 766 00:47:50,430 --> 00:47:53,989 Let's assume that the 767 00:47:55,859 --> 00:47:59,819 data comes from 768 00:47:59,819 --> 00:48:01,979 N original 769 00:48:01,979 --> 00:48:03,319 sources. 770 00:48:03,319 --> 00:48:07,009 So let's say there are N speakers in a cocktail party. 771 00:48:07,009 --> 00:48:09,819 So the original sources, I'm 772 00:48:09,819 --> 00:48:11,330 going to write as a vector, S 773 00:48:11,330 --> 00:48:13,619 as in RN. 774 00:48:13,619 --> 00:48:17,449 So just to be concrete about what I mean about that, I'm going to use 775 00:48:17,449 --> 00:48:22,499 SIJ to denote the signal 776 00:48:22,499 --> 00:48:25,849 from speaker 777 00:48:27,140 --> 00:48:30,219 J 778 00:48:30,219 --> 00:48:32,659 at time 779 00:48:32,659 --> 00:48:34,080 I. Here's what I mean. 780 00:48:34,080 --> 00:48:37,940 So what is sound? When you hear sound waves, sound is created 781 00:48:37,940 --> 00:48:39,279 by a pattern 782 00:48:39,279 --> 00:48:43,160 of expansions and compressions in air. So the way you're hearing my voice is 783 00:48:43,160 --> 00:48:44,620 my 784 00:48:44,620 --> 00:48:47,719 mouth is causing certain 785 00:48:47,719 --> 00:48:50,960 changes in the air pressure, and then your ear is hearing my voice as 786 00:48:50,960 --> 00:48:53,539 detecting those changes in air 787 00:48:53,539 --> 00:48:57,729 pressure. So what a microphone records, what my mouth is generating, is 788 00:48:57,729 --> 00:48:59,159 a pattern. 789 00:48:59,159 --> 00:49:01,459 I'm going to draw a cartoon, 790 00:49:01,459 --> 00:49:04,819 I guess. 791 00:49:04,819 --> 00:49:06,059 Changes in 792 00:49:06,059 --> 00:49:06,970 air pressure. So 793 00:49:06,970 --> 00:49:11,119 this is what sound is. You look at a microphone recording, you see these roughly periodic 794 00:49:11,119 --> 00:49:13,289 signals that comprise of 795 00:49:13,289 --> 00:49:16,230 changes in air pressure over time as the air pressure goes 796 00:49:16,230 --> 00:49:18,539 above and below some baseline air pressure. 797 00:49:18,539 --> 00:49:19,669 So this 798 00:49:19,669 --> 00:49:22,369 is what the speech signal looks like, say. 799 00:49:22,369 --> 00:49:26,399 So this is speaker one. 800 00:49:26,399 --> 00:49:29,039 Then what I'm saying is that 801 00:49:29,039 --> 00:49:31,189 - this is some time, T. 802 00:49:31,189 --> 00:49:34,479 What I'm saying is that the value of that point, 803 00:49:34,479 --> 00:49:36,989 I'm going to denote as S, super 804 00:49:36,989 --> 00:49:40,229 script T, sub script one. 805 00:49:40,229 --> 00:49:41,729 Similarly, 806 00:49:41,729 --> 00:49:44,889 speaker two, it's 807 00:49:44,889 --> 00:49:46,859 outputting some sound wave. Speaker voice 808 00:49:46,859 --> 00:49:49,749 will play that. It'll actually sound like 809 00:49:49,749 --> 00:49:52,920 a single tone, I guess. 810 00:49:52,920 --> 00:49:56,099 So in the same way, at the same time, T, 811 00:49:56,099 --> 00:49:59,049 the value of the air 812 00:49:59,049 --> 00:50:02,589 pressure generated by speaker two, I'll denote as 813 00:50:02,589 --> 00:50:09,589 ST 814 00:50:16,579 --> 00:50:23,579 2. 815 00:50:29,859 --> 00:50:36,859 So we observe 816 00:50:37,769 --> 00:50:40,449 XI equals A times SI, where 817 00:50:40,449 --> 00:50:43,409 these XIs 818 00:50:43,409 --> 00:50:45,989 are vectors in RN. 819 00:50:45,989 --> 00:50:50,269 So I'm going to assume 820 00:50:50,269 --> 00:50:53,259 that I have N microphones, 821 00:50:53,259 --> 00:50:53,580 and 822 00:50:53,580 --> 00:50:58,489 each of my microphones records some linear combination 823 00:50:58,489 --> 00:51:01,869 of what the speakers are saying. So each microphone records some overlapping 824 00:51:01,869 --> 00:51:04,499 combination of what the speakers are saying. 825 00:51:04,499 --> 00:51:07,619 For 826 00:51:10,349 --> 00:51:12,669 example, XIJ, which is - this 827 00:51:12,669 --> 00:51:16,249 is what microphone J records at time, I. So 828 00:51:16,249 --> 00:51:17,349 by definition of 829 00:51:17,349 --> 00:51:21,519 the matrix multiplication, this is sum 830 00:51:21,519 --> 00:51:23,979 of AIKSJ. 831 00:51:23,979 --> 00:51:29,369 Oh, excuse me. 832 00:51:29,369 --> 00:51:36,369 Okay? So what my J - sorry. 833 00:51:37,179 --> 00:51:41,049 So what my J microphone is recording is 834 00:51:42,190 --> 00:51:43,940 some linear combination of 835 00:51:43,940 --> 00:51:45,569 all of the speakers. So 836 00:51:45,569 --> 00:51:49,780 at time I, what microphone J is recording is some linear combination of 837 00:51:49,780 --> 00:51:52,749 what all the speakers are saying at time I. 838 00:51:52,749 --> 00:51:54,359 So K here 839 00:51:54,359 --> 00:51:57,819 indexes over the N speakers. 840 00:51:57,819 --> 00:52:01,239 So our goal 841 00:52:02,869 --> 00:52:06,420 is to find the matrix, W, equals A inverse, and 842 00:52:06,420 --> 00:52:10,130 just defining W that way. 843 00:52:10,130 --> 00:52:17,130 So 844 00:52:18,139 --> 00:52:21,349 we can recover the original sources 845 00:52:21,349 --> 00:52:23,310 as a linear combination of 846 00:52:23,310 --> 00:52:23,560 our 847 00:52:23,549 --> 00:52:30,549 microphone recordings, XI. 848 00:52:33,059 --> 00:52:35,329 Just as a point of notation, 849 00:52:35,329 --> 00:52:42,329 I'm going to write the matrix W this way. I'm going to use 850 00:52:50,890 --> 00:52:55,100 lower case W subscript one, subscript two and so on to denote the roles 851 00:52:55,100 --> 00:53:02,100 of this matrix, W. 852 00:53:13,909 --> 00:53:14,559 Let's 853 00:53:14,559 --> 00:53:18,719 see. 854 00:53:18,719 --> 00:53:23,539 So let's look at why IC is possible. Given these overlapping voices, 855 00:53:23,539 --> 00:53:28,249 let's think briefly why it might be possible 856 00:53:28,249 --> 00:53:30,760 to recover the original sources. 857 00:53:30,760 --> 00:53:33,059 So for the next example, I want 858 00:53:33,059 --> 00:53:36,509 to say 859 00:53:42,739 --> 00:53:46,530 - let's say that each of my speakers 860 00:53:46,530 --> 00:53:50,380 outputs - this will sound like white noise. Can I switch 861 00:53:50,380 --> 00:53:53,380 the laptop display, 862 00:53:53,380 --> 00:53:56,709 please? For this example, let's say that 863 00:53:57,219 --> 00:54:01,459 each of my speakers outputs uniform white noise. So 864 00:54:01,459 --> 00:54:05,459 if that's the case, these are my axis, S1 and S2. 865 00:54:05,459 --> 00:54:08,819 This is what my two speakers would be uttering. 866 00:54:08,819 --> 00:54:11,289 The parts of what they're 867 00:54:11,289 --> 00:54:14,979 uttering will look like a line in a square box if the two speakers are independently 868 00:54:14,979 --> 00:54:16,089 outputting 869 00:54:16,089 --> 00:54:18,389 uniform minus one random variables. 870 00:54:18,389 --> 00:54:20,289 So this is part of 871 00:54:20,289 --> 00:54:24,009 S1 and S2, my original sources. 872 00:54:24,009 --> 00:54:28,999 This would be a typical sample of what my microphones record. Here, at 873 00:54:28,999 --> 00:54:31,400 the axis, are X1 and X2. 874 00:54:31,400 --> 00:54:35,089 So these are images I got from [inaudible] on 875 00:54:35,089 --> 00:54:37,269 ICA. 876 00:54:38,709 --> 00:54:43,689 Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of 877 00:54:43,689 --> 00:54:44,939 this 878 00:54:44,939 --> 00:54:45,809 parallelogram 879 00:54:45,809 --> 00:54:48,149 are. You can figure out 880 00:54:48,149 --> 00:54:51,999 what linear transformation would transform the parallelogram back 881 00:54:51,999 --> 00:54:54,359 to a box. 882 00:54:54,359 --> 00:54:58,769 So it turns out there are some inherent ambiguities in ICA. 883 00:54:58,769 --> 00:55:00,509 I'll just say what they are. 884 00:55:00,509 --> 00:55:01,569 One is that 885 00:55:01,569 --> 00:55:05,709 you can't recover the original indexing of the sources. In particular, 886 00:55:05,709 --> 00:55:07,379 if 887 00:55:07,379 --> 00:55:10,809 I generated the data for speaker one and speaker two, 888 00:55:10,809 --> 00:55:14,469 you can run ICA, and then you may end up with the order of the speakers 889 00:55:14,469 --> 00:55:17,529 reversed. What that corresponds to is if you take this 890 00:55:17,529 --> 00:55:21,809 picture and you flip this picture along a 45-degree 891 00:55:21,809 --> 00:55:26,130 axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still 892 00:55:26,130 --> 00:55:28,279 get a box. So 893 00:55:28,279 --> 00:55:31,319 there's no way for the algorithms to tell which was speaker No. 1 and 894 00:55:31,319 --> 00:55:32,910 which 895 00:55:32,910 --> 00:55:37,699 was speaker No. 2. The numbering or the ordering of the speakers is 896 00:55:37,699 --> 00:55:40,839 ambiguous. The other source of ambiguity, and these are the only ambiguities 897 00:55:40,839 --> 00:55:42,089 in this example, 898 00:55:42,089 --> 00:55:44,469 is the sign of the sources. So 899 00:55:44,469 --> 00:55:49,119 given my speakers' recordings, 900 00:55:49,119 --> 00:55:53,190 you can't tell whether you got a positive SI or whether you got 901 00:55:53,190 --> 00:55:56,179 back a negative SI. 902 00:55:56,179 --> 00:55:58,210 In this picture, what that corresponds to 903 00:55:58,210 --> 00:56:02,100 is if you take this picture, and you reflect it along the vertical axis, if 904 00:56:02,100 --> 00:56:04,659 you reflect it along the horizontal axis, 905 00:56:04,659 --> 00:56:05,910 you still get a box. 906 00:56:05,910 --> 00:56:08,719 You still get back [inaudible] speakers. 907 00:56:08,719 --> 00:56:09,649 So 908 00:56:09,649 --> 00:56:11,719 it turns out that in this example, 909 00:56:11,719 --> 00:56:16,599 you can't guarantee that you've recovered positive SI rather 910 00:56:16,599 --> 00:56:19,689 than negative SI. 911 00:56:19,689 --> 00:56:21,930 So it turns out that these are the only 912 00:56:21,930 --> 00:56:25,740 two ambiguities in this example. What is the permutation of the speakers, and the 913 00:56:25,740 --> 00:56:28,139 other is the sign of the speakers. 914 00:56:28,139 --> 00:56:30,749 Permutation of the speakers, there's not much you can do about that. 915 00:56:30,749 --> 00:56:34,909 It turns out that if you take the audio 916 00:56:34,909 --> 00:56:35,609 source 917 00:56:35,609 --> 00:56:39,199 and if you flip the sign, and you take negative S, and if you play that through a 918 00:56:39,199 --> 00:56:43,819 microphone it'll sound indistinguishable. 919 00:56:43,819 --> 00:56:44,879 So 920 00:56:44,879 --> 00:56:47,829 for many of the applications we care about, the sign 921 00:56:47,829 --> 00:56:51,259 as well as the permutation 922 00:56:51,259 --> 00:56:55,079 is ambiguous, but you don't really care 923 00:56:55,079 --> 00:57:02,079 about it. Let's switch back 924 00:57:03,529 --> 00:57:08,989 to 925 00:57:08,989 --> 00:57:11,180 chalk board, please. 926 00:57:11,180 --> 00:57:15,639 It turns out, and I don't want to spend too much time on this, but I do want to say it briefly. 927 00:57:15,639 --> 00:57:17,289 It turns out the 928 00:57:17,289 --> 00:57:19,200 reason why those are the only 929 00:57:19,200 --> 00:57:25,809 sources of ambiguity - so the ambiguities were the 930 00:57:25,809 --> 00:57:29,869 permutation of the speakers 931 00:57:29,869 --> 00:57:31,959 and the signs. 932 00:57:31,959 --> 00:57:35,399 It turns out that 933 00:57:35,399 --> 00:57:39,919 the reason these were the only ambiguities was because 934 00:57:39,919 --> 00:57:44,999 the SIJs were 935 00:57:44,999 --> 00:57:46,689 936 00:57:46,689 --> 00:57:50,509 non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly. 937 00:57:50,509 --> 00:57:54,089 Suppose my original sources, S1 and S2, were Gaussian. 938 00:57:54,089 --> 00:57:55,909 So 939 00:57:58,329 --> 00:58:02,199 suppose SI is 940 00:58:02,199 --> 00:58:04,339 Gaussian, would mean zero 941 00:58:04,339 --> 00:58:07,019 and identity covariance. 942 00:58:07,019 --> 00:58:10,959 That just means that each of my speakers outputs a Gaussian random variable. Here's a typical 943 00:58:10,959 --> 00:58:12,619 example of Gaussian 944 00:58:12,619 --> 00:58:18,479 data. 945 00:58:18,479 --> 00:58:22,869 You will recall the contours of a Gaussian distribution with identity covariants 946 00:58:22,869 --> 00:58:25,089 looks like 947 00:58:25,089 --> 00:58:27,739 this, right? The Gaussian is a 948 00:58:27,739 --> 00:58:30,569 spherically symmetric distribution. 949 00:58:30,569 --> 00:58:35,219 So if my speakers were outputting Gaussian random variables, than if 950 00:58:35,219 --> 00:58:38,179 I observe a linear combination of this, 951 00:58:38,179 --> 00:58:40,479 there's actually no way to recover the 952 00:58:40,479 --> 00:58:43,419 original distribution because there's no way for me to tell 953 00:58:43,419 --> 00:58:46,120 if the axis are at this angle or if they're at 954 00:58:46,120 --> 00:58:48,349 that angle and so 955 00:58:48,349 --> 00:58:52,429 on. The Gaussian is a rotationally symmetric 956 00:58:52,429 --> 00:58:56,769 distribution, so I would no be able to recover the orientation in the 957 00:58:56,769 --> 00:58:58,839 rotation 958 00:58:58,839 --> 00:59:02,279 of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns 959 00:59:02,279 --> 00:59:02,900 out 960 00:59:02,900 --> 00:59:04,700 if your source is a Gaussian, 961 00:59:04,700 --> 00:59:07,929 then it's actually impossible to do 962 00:59:07,929 --> 00:59:12,049 ICA. ICA relies critically on your data being non-Gaussian because if the data 963 00:59:12,049 --> 00:59:16,939 were Gaussian, then the rotation of the data would be ambiguous. So 964 00:59:16,939 --> 00:59:19,079 regardless of how much data you have, 965 00:59:19,079 --> 00:59:23,549 even if you had infinitely large amounts of data, you would not be able to recover 966 00:59:23,549 --> 00:59:26,739 the matrix A or W. 967 00:59:32,779 --> 00:59:39,779 Let's go ahead and divide the algorithm. 968 00:59:56,779 --> 01:00:00,939 To do this, I need just one more result, and then the derivation will be 969 01:00:03,029 --> 01:00:07,729 three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my 970 01:00:07,729 --> 01:00:11,309 speakers that are emitting at any time. 971 01:00:11,309 --> 01:00:12,459 So 972 01:00:12,459 --> 01:00:15,619 let's say the density of S is 973 01:00:15,619 --> 01:00:17,339 P subscript S, 974 01:00:17,339 --> 01:00:19,569 capital S. 975 01:00:19,569 --> 01:00:23,399 So my microphone recording records S equals AS, 976 01:00:23,399 --> 01:00:25,319 equals W inverse 977 01:00:25,319 --> 01:00:31,019 S. Equivalently, S equals W sign of X. 978 01:00:31,019 --> 01:00:34,529 So let's think about what is the density of 979 01:00:34,529 --> 01:00:38,209 X. So I have P of S. I know the density of 980 01:00:38,209 --> 01:00:41,359 S, and X is a linear combination of the S's. 981 01:00:41,359 --> 01:00:45,170 So let's figure out what is the density of X. 982 01:00:45,170 --> 01:00:48,670 One thing we could do is 983 01:00:48,670 --> 01:00:51,339 figure out what S is. So this is just - 984 01:00:51,339 --> 01:00:55,759 apply the density of 985 01:00:55,759 --> 01:00:58,069 S to W of S. So let's 986 01:00:58,069 --> 01:01:01,999 see. This is the probability of S, so we just 987 01:01:02,909 --> 01:01:06,559 figure out what S is. S is W times X, so the probability of S is 988 01:01:06,559 --> 01:01:09,939 W times X, so the probability of X must be [inaudible]. 989 01:01:09,939 --> 01:01:11,619 So this is wrong. 990 01:01:11,619 --> 01:01:14,749 It turns out you can do this for probably mass functions but not for 991 01:01:14,749 --> 01:01:16,919 continuous density. So in particular, 992 01:01:16,919 --> 01:01:20,969 it's not correct to say that the probability of X is - well, you just figure out what 993 01:01:20,969 --> 01:01:22,499 S is. 994 01:01:22,499 --> 01:01:26,189 Then you say the probability of S is applied to that. This is wrong. You 995 01:01:26,189 --> 01:01:27,819 can't do this with densities. 996 01:01:27,819 --> 01:01:30,969 You can't say the probability of S is that because it's a property density 997 01:01:30,969 --> 01:01:32,969 function. 998 01:01:32,969 --> 01:01:34,459 In particular, 999 01:01:34,459 --> 01:01:35,509 the 1000 01:01:35,509 --> 01:01:37,849 right formula is the 1001 01:01:37,849 --> 01:01:40,439 density of S applied to W times X, 1002 01:01:40,439 --> 01:01:41,730 times the determinant 1003 01:01:41,730 --> 01:01:44,209 of the matrix, W. 1004 01:01:44,209 --> 01:01:47,189 Let me just illustrate that with an example. 1005 01:01:47,189 --> 01:01:49,919 Let's say 1006 01:01:49,919 --> 01:01:51,550 the 1007 01:01:51,550 --> 01:01:58,199 density for S is that. In 1008 01:01:58,199 --> 01:02:03,469 this example, S is uniform 1009 01:02:03,469 --> 01:02:05,539 over the unit interval. 1010 01:02:07,679 --> 01:02:14,679 So the density for S looks like that. It's 1011 01:02:15,189 --> 01:02:18,140 just density for the uniform 1012 01:02:18,140 --> 01:02:20,749 distribution of zero one. 1013 01:02:20,749 --> 01:02:24,150 So let me let X be equal to two times 1014 01:02:24,150 --> 01:02:30,009 S. So this means A equals two. 1015 01:02:30,009 --> 01:02:33,709 W equals one half. So if 1016 01:02:33,709 --> 01:02:36,719 S is a uniform distribution over zero, one, 1017 01:02:36,719 --> 01:02:40,320 then X, which is two times that, will be the uniform distribution over the 1018 01:02:40,320 --> 01:02:43,299 range from zero to two. 1019 01:02:43,299 --> 01:02:50,299 So the density for X will be - 1020 01:02:54,359 --> 01:02:57,289 that's one, that's two, 1021 01:02:57,289 --> 01:03:01,409 that's one half, 1022 01:03:02,529 --> 01:03:04,949 and 1023 01:03:04,949 --> 01:03:07,939 that's one. Okay? Density for X will be indicator 1024 01:03:07,939 --> 01:03:12,729 zero [inaudible] for X [inaudible] two 1025 01:03:12,729 --> 01:03:15,739 times W, times one half. 1026 01:03:15,739 --> 01:03:20,229 So 1027 01:03:20,229 --> 01:03:21,729 does that make 1028 01:03:21,729 --> 01:03:25,020 sense? [Inaudible] computer density for X because X is now spread out 1029 01:03:25,020 --> 01:03:28,650 across a wider range. The density of X is now smaller, 1030 01:03:28,650 --> 01:03:35,650 and therefore, the density of X has this one half 1031 01:03:37,859 --> 01:03:38,919 term 1032 01:03:38,919 --> 01:03:42,579 here. Okay? This is an illustration for the case of one-dimensional random variables, 1033 01:03:42,579 --> 01:03:44,289 1034 01:03:44,289 --> 01:03:45,160 or S 1035 01:03:45,160 --> 01:03:49,490 and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the 1036 01:03:49,490 --> 01:03:51,649 density of X is given by this 1037 01:03:51,649 --> 01:03:53,950 times the determinant of the matrix, W. Over here, 1038 01:03:53,950 --> 01:04:00,950 I showed the one dimensional [inaudible] generalization. 1039 01:04:21,439 --> 01:04:28,439 So we're nearly there. Here's 1040 01:04:28,749 --> 01:04:33,969 how I can implement ICA. 1041 01:04:33,969 --> 01:04:37,039 So my distribution on 1042 01:04:37,039 --> 01:04:44,039 S, 1043 01:04:50,259 --> 01:04:52,959 so I'm going to assume that my density on S 1044 01:04:52,959 --> 01:04:55,099 is given by this as a product over the 1045 01:04:55,099 --> 01:04:59,950 N speakers of the density - the product of speaker 1046 01:04:59,950 --> 01:05:00,889 I 1047 01:05:00,889 --> 01:05:03,659 emitting a certain sound. This is a product of densities. 1048 01:05:03,659 --> 01:05:07,659 This is a product of distributions because I'm going to assume that the 1049 01:05:07,659 --> 01:05:11,469 speakers are having independent conversations. So the SI's independent 1050 01:05:11,469 --> 01:05:15,869 for different values of I. 1051 01:05:15,869 --> 01:05:18,060 So by the formula we just worked out, 1052 01:05:18,060 --> 01:05:22,359 the density for X would be equal to that. 1053 01:05:36,599 --> 01:05:39,309 I'll just remind you, W was A 1054 01:05:39,309 --> 01:05:42,579 inverse. It was 1055 01:05:42,579 --> 01:05:43,929 this matrix 1056 01:05:43,929 --> 01:05:47,619 I defined previously 1057 01:05:47,619 --> 01:05:50,430 so that SI 1058 01:05:50,430 --> 01:05:52,519 equals WI [inaudible] 1059 01:05:52,519 --> 01:05:59,209 X. So that's what's in 1060 01:05:59,209 --> 01:06:02,299 there. To complete my formulation for this model, 1061 01:06:02,299 --> 01:06:06,359 the final thing I need to do is 1062 01:06:06,359 --> 01:06:10,179 choose 1063 01:06:10,179 --> 01:06:11,549 a density 1064 01:06:11,549 --> 01:06:14,259 for what I think each speaker is 1065 01:06:14,259 --> 01:06:17,949 saying. I need to assume some density over 1066 01:06:17,949 --> 01:06:21,660 the sounds emitted by an individual speaker. 1067 01:06:21,660 --> 01:06:25,630 So following the discussion I had right when the [inaudible] 1068 01:06:25,630 --> 01:06:27,650 ICA, 1069 01:06:27,650 --> 01:06:30,559 one thing I could do is I could choose 1070 01:06:30,559 --> 01:06:32,019 the density for S, 1071 01:06:32,019 --> 01:06:35,509 or equivalently, I could choose the CDF, the cumulative distribution 1072 01:06:35,509 --> 01:06:37,169 function for 1073 01:06:37,169 --> 01:06:38,219 S. 1074 01:06:38,219 --> 01:06:41,489 In this case, I'm going to choose 1075 01:06:41,489 --> 01:06:44,819 a CDF, probably for historical reasons and probably for 1076 01:06:44,819 --> 01:06:46,569 convenience. 1077 01:06:46,569 --> 01:06:50,019 I need to choose the CDF for S, so 1078 01:06:50,019 --> 01:06:54,779 what that means is I just need to choose some function that increases from zero to 1079 01:06:54,779 --> 01:06:59,439 what. I know I can't choose a Gaussian because we know you can't 1080 01:06:59,439 --> 01:07:02,199 do ICA on Gaussian data. 1081 01:07:02,199 --> 01:07:04,649 So I need some function increasing from zero to one 1082 01:07:04,649 --> 01:07:08,639 that is not the cumulative distribution function for a 1083 01:07:08,639 --> 01:07:10,359 Gaussian distribution. 1084 01:07:10,359 --> 01:07:14,009 So what other functions do I know that increase from zero to one? I 1085 01:07:14,009 --> 01:07:16,139 just choose the 1086 01:07:16,139 --> 01:07:18,329 CDF to be 1087 01:07:18,329 --> 01:07:21,979 the 1088 01:07:21,979 --> 01:07:23,039 sigmoid function. 1089 01:07:23,039 --> 01:07:24,729 This is a 1090 01:07:24,729 --> 01:07:27,229 commonly-made choice that 1091 01:07:27,229 --> 01:07:31,049 is made for convenience. There is actually no great reason for why you 1092 01:07:31,049 --> 01:07:34,079 choose a sigmoid function. It's just a convenient function that we all know 1093 01:07:34,079 --> 01:07:35,289 and are familiar with 1094 01:07:35,289 --> 01:07:37,849 that happens to increase from zero to one. 1095 01:07:37,849 --> 01:07:44,849 When you take the derivative 1096 01:07:45,789 --> 01:07:49,389 of the sigmoid, and that will give you back 1097 01:07:49,389 --> 01:07:50,119 your 1098 01:07:50,119 --> 01:07:55,459 density. This is just not Gaussian. This is the main virtue of choosing the sigmoid. 1099 01:07:55,459 --> 01:08:02,459 So 1100 01:08:19,020 --> 01:08:21,960 there's really no rational for the choice of sigma. Lots of other things will 1101 01:08:21,960 --> 01:08:23,599 work fine, too. 1102 01:08:23,599 --> 01:08:26,659 It's just a common, reasonable default. 1103 01:08:38,039 --> 01:08:40,279 It turns out that 1104 01:08:40,279 --> 01:08:44,629 one reason the sigma works well for a lot of data sources is that 1105 01:08:44,629 --> 01:08:49,079 if this is the Gaussian. 1106 01:08:49,079 --> 01:08:52,190 If you actually take the sigmoid and you take its derivative, 1107 01:09:02,299 --> 01:09:06,639 you find that the sigmoid has [inaudible] than the Gaussian. By this I mean 1108 01:09:06,639 --> 01:09:10,509 the density of the sigmoid dies down to zero much more slowly than 1109 01:09:10,509 --> 01:09:12,299 the 1110 01:09:12,299 --> 01:09:13,489 Gaussian. 1111 01:09:13,489 --> 01:09:18,079 The magnitudes of the tails dies down as E to the minus S squared. 1112 01:09:18,079 --> 01:09:21,960 For the sigmoid, the tails look like E to the minus 1113 01:09:21,960 --> 01:09:26,949 S. So the tails die down as E to the minus S, around E 1114 01:09:26,949 --> 01:09:29,529 to the minus S squared. It turns out that most distributions of this property 1115 01:09:29,529 --> 01:09:34,359 with [inaudible] tails, where the distribution decays to zero relatively slowly 1116 01:09:34,359 --> 01:09:38,439 compared to Gaussian will 1117 01:09:38,439 --> 01:09:39,919 work fine for your data. 1118 01:09:39,919 --> 01:09:43,939 Actually, one other choice you can sometimes us is what's called the Laplacian 1119 01:09:43,939 --> 01:09:46,230 distribution, which is 1120 01:09:46,230 --> 01:09:53,230 that. This will work fine, too, for many data sources. 1121 01:10:06,539 --> 01:10:08,110 Sticking with the sigmoid for now, I'll just 1122 01:10:08,110 --> 01:10:09,420 write 1123 01:10:09,420 --> 01:10:14,479 down the algorithm in two steps. So given 1124 01:10:14,479 --> 01:10:17,150 my training set, and 1125 01:10:17,150 --> 01:10:21,179 as you show, this is an unlabeled training set, I can 1126 01:10:21,179 --> 01:10:25,770 write down the log likelihood of my parameters. So that's - assembled my training 1127 01:10:25,770 --> 01:10:27,209 examples, log of - times 1128 01:10:27,209 --> 01:10:34,209 that. 1129 01:10:42,869 --> 01:10:44,880 So that's my log 1130 01:10:44,880 --> 01:10:51,880 likelihood. 1131 01:10:53,339 --> 01:10:59,380 To learn the parameters, W, of this model, I can use the [inaudible] assent, 1132 01:10:59,380 --> 01:11:06,380 which is 1133 01:11:06,570 --> 01:11:08,579 just that. 1134 01:11:08,579 --> 01:11:11,489 It turns out, if you work through the math, 1135 01:11:11,489 --> 01:11:13,969 let's see. If P of S 1136 01:11:13,969 --> 01:11:19,820 is equal to the derivative of the 1137 01:11:19,820 --> 01:11:23,780 sigmoid, then if you just work through the math to compute the [inaudible] there. You've all 1138 01:11:23,780 --> 01:11:27,409 done this a lot of times. I won't bother to show 1139 01:11:27,409 --> 01:11:34,409 the details. You find that is equal to this. 1140 01:11:46,629 --> 01:11:49,579 Okay? That's just - you can work those out yourself. It's just math to 1141 01:11:49,579 --> 01:11:54,499 compute the derivative of this with respect to 1142 01:11:54,499 --> 01:11:59,309 W. So to summarize, given the training set, 1143 01:11:59,309 --> 01:12:02,099 here's my [inaudible] update rule. So you run the 1144 01:12:02,099 --> 01:12:06,309 [inaudible] to learn the parameters W. 1145 01:12:06,309 --> 01:12:08,380 After you're 1146 01:12:08,380 --> 01:12:09,719 done, you then 1147 01:12:12,369 --> 01:12:14,109 output SI equals 1148 01:12:14,109 --> 01:12:16,989 WXI, and you've separated your sources 1149 01:12:16,989 --> 01:12:18,170 of your 1150 01:12:18,170 --> 01:12:21,780 data back out into the original independent sources. 1151 01:12:21,780 --> 01:12:26,199 Hopefully up to only a permutation and a plus/minus 1152 01:12:26,199 --> 01:12:30,650 sign ambiguity. 1153 01:12:30,650 --> 01:12:34,559 Okay? So just switch back to the laptop, please? 1154 01:12:34,559 --> 01:12:41,559 So we'll just wrap up with a couple of examples of applications of ICA. 1155 01:12:42,210 --> 01:12:43,150 This is 1156 01:12:43,150 --> 01:12:46,719 actually a picture of our TA, Katie. 1157 01:12:46,719 --> 01:12:49,980 So one of the applications of ICA is 1158 01:12:49,980 --> 01:12:52,010 to process 1159 01:12:52,010 --> 01:12:56,530 various types of [inaudible] recording data, so [inaudible]. This 1160 01:12:56,530 --> 01:12:58,780 is a picture of 1161 01:12:58,780 --> 01:13:02,470 a EEG cap, in which there are a number of electrodes 1162 01:13:02,470 --> 01:13:04,529 you place 1163 01:13:04,529 --> 01:13:07,959 on the - in this case, on Katie's brain, on Katie's scalp. 1164 01:13:07,959 --> 01:13:13,370 So where each electrode measures changes in voltage over time 1165 01:13:13,370 --> 01:13:15,059 on the scalp. 1166 01:13:15,059 --> 01:13:18,409 On the right, it's a typical example of [inaudible] data 1167 01:13:18,409 --> 01:13:22,570 where each electrode measures - just changes in voltage over 1168 01:13:22,570 --> 01:13:23,890 time. So 1169 01:13:23,890 --> 01:13:27,949 the horizontal axis is time, and the vertical axis is voltage. So here's the same thing, 1170 01:13:27,949 --> 01:13:29,560 blown up a little bit. 1171 01:13:29,560 --> 01:13:32,680 You notice there are artifacts in this 1172 01:13:32,680 --> 01:13:36,340 data. Where the circle is, where the data is circled, all 1173 01:13:36,340 --> 01:13:37,669 the 1174 01:13:37,669 --> 01:13:41,179 electrodes seem to measure in these very synchronized recordings. 1175 01:13:41,179 --> 01:13:44,699 It turns out that we look at [inaudible] data as well as a number of other 1176 01:13:44,699 --> 01:13:47,019 types of data, there are 1177 01:13:47,019 --> 01:13:51,550 artifacts from heartbeats and from human eye blinks and so on. So the 1178 01:13:51,550 --> 01:13:55,030 cartoonist, if you imagine, placing the 1179 01:13:55,030 --> 01:13:56,730 electrodes, or 1180 01:13:56,730 --> 01:13:58,319 microphones, on my scalp, 1181 01:13:58,319 --> 01:14:01,839 then each microphone is recording some overlapping combination of all the 1182 01:14:01,839 --> 01:14:04,920 things happening in my brain or in my body. 1183 01:14:04,920 --> 01:14:08,380 My brain has a number of different processes going on. My body's [inaudible] 1184 01:14:08,380 --> 01:14:10,519 going on, and 1185 01:14:10,519 --> 01:14:13,429 each electrode measures a sum 1186 01:14:13,429 --> 01:14:15,680 of the different voices in my brain. 1187 01:14:15,680 --> 01:14:19,789 That didn't quite come out the way I wanted it to. 1188 01:14:19,789 --> 01:14:21,530 So we can just take this data 1189 01:14:21,530 --> 01:14:25,400 and run ICA on it and find out one of the independent components, what the 1190 01:14:25,400 --> 01:14:26,130 independent 1191 01:14:26,130 --> 01:14:30,329 process are going on in my brain. This is an example of running ICA. 1192 01:14:30,329 --> 01:14:33,239 So you find that a small number of components, like those shown up there, 1193 01:14:33,239 --> 01:14:37,739 they correspond to heartbeat, where the arrows - so those are very periodic 1194 01:14:37,739 --> 01:14:42,329 signals. They come on occasionally and correspond to [inaudible] components of 1195 01:14:42,329 --> 01:14:43,050 heartbeat. 1196 01:14:43,050 --> 01:14:47,460 You also find things like an eye blink component, corresponding to a 1197 01:14:47,460 --> 01:14:49,780 sigmoid generated when you blink your eyes. 1198 01:14:49,780 --> 01:14:53,820 By doing this, you can then subtract out the heartbeat and the eye blink 1199 01:14:53,820 --> 01:14:56,179 artifacts from the data, and now 1200 01:14:56,179 --> 01:15:01,219 you get much cleaner ICA data - get much cleaner EEG readings. You can 1201 01:15:01,219 --> 01:15:03,699 do further scientific studies. So this is a 1202 01:15:03,699 --> 01:15:06,179 pretty commonly used preprocessing step 1203 01:15:06,179 --> 01:15:09,699 that is a common application of ICA. 1204 01:15:09,699 --> 01:15:13,030 [Inaudible] example is 1205 01:15:13,030 --> 01:15:16,300 the application, again, from [inaudible]. As 1206 01:15:16,300 --> 01:15:20,900 a result of running ICA on natural small image patches. Suppose I take 1207 01:15:20,900 --> 01:15:22,050 natural images 1208 01:15:22,050 --> 01:15:25,909 and run ICA on the data and ask what are the independent components of data. 1209 01:15:25,909 --> 01:15:30,039 It turns out that these are the bases you get. So this is a plot of the 1210 01:15:30,039 --> 01:15:32,530 sources you get. 1211 01:15:32,530 --> 01:15:36,269 This algorithm is saying that a natural image patch 1212 01:15:36,269 --> 01:15:37,749 shown 1213 01:15:37,749 --> 01:15:39,790 on the left 1214 01:15:39,790 --> 01:15:45,330 is often expressed as a sum, or a linear combination, of 1215 01:15:45,330 --> 01:15:46,679 independent sources of 1216 01:15:46,679 --> 01:15:48,159 things that make up images. 1217 01:15:48,159 --> 01:15:52,780 So this model's natural images is generated by independent objects 1218 01:15:52,780 --> 01:15:55,340 that generate different ages in the image. 1219 01:15:55,340 --> 01:16:01,260 One of the fascinating things about this is that, similar to neuroscience, this has also been 1220 01:16:01,260 --> 01:16:04,789 hypothesized as a method for how the human brain processes image 1221 01:16:04,789 --> 01:16:05,999 data. It 1222 01:16:05,999 --> 01:16:10,139 turns out, this is similar, in many ways, to computations 1223 01:16:10,139 --> 01:16:15,079 happening in early visual processing in the human brain, 1224 01:16:15,079 --> 01:16:17,659 in the mammalian 1225 01:16:17,659 --> 01:16:19,799 brain. It's just 1226 01:16:19,799 --> 01:16:25,260 interesting to see ages are the independent components of images. 1227 01:16:25,260 --> 01:16:30,639 Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew 1228 01:16:30,639 --> 01:16:31,929 Ng):Oh, 1229 01:16:31,929 --> 01:16:35,410 yes. For the algorithms I describe, I assume A is a square matrix. 1230 01:16:35,410 --> 01:16:38,590 It turns out if you have more microphones than speakers, you can also apply very 1231 01:16:38,590 --> 01:16:39,609 similar algorithms. If 1232 01:16:39,609 --> 01:16:43,919 you have fewer microphones than speakers, there's sort of an open research problem. The odds 1233 01:16:43,919 --> 01:16:48,459 are that if you have one male and one female speaker, but one microphone, you can 1234 01:16:48,459 --> 01:16:51,819 sometimes sort of separate them because one is high, one is low. If you have two 1235 01:16:51,819 --> 01:16:55,459 male speakers or two female speakers, then it's beyond the state of the art now to separate them 1236 01:16:55,459 --> 01:16:57,050 with one 1237 01:16:57,050 --> 01:17:00,499 microphone. It's a great research program. Okay. 1238 01:17:00,499 --> 01:17:04,869 Sorry about running late again. Let's close now, and we'll 1239 01:17:04,869 --> 01:17:05,749 continue reinforcement learning.