0:00:09.539,0:00:12.809 This presentation is delivered by the Stanford Center for Professional 0:00:12.809,0:00:19.809 Development. 0:00:28.739,0:00:30.300 Welcome back. 0:00:30.300,0:00:32.800 What I want to do today is 0:00:32.800,0:00:37.120 continue a discussion of principal components analysis, or PCA. 0:00:37.120,0:00:38.780 In particular, there's 0:00:38.780,0:00:41.730 one more application that I didn't get to in the last lecture on 0:00:41.730,0:00:43.950 [inaudible] indexing, 0:00:43.950,0:00:47.340 LSI. Then I want to spend just a little time talking about 0:00:47.340,0:00:50.400 how to implement PCA, 0:00:50.400,0:00:53.510 especially for very large problems. In particular, I'll spend just a little bit of time talking 0:00:53.510,0:00:55.790 about singular value decomposition, 0:00:55.790,0:00:57.930 or the SVD implementation 0:00:57.930,0:01:00.170 of principal component 0:01:00.170,0:01:01.929 analysis. So the 0:01:01.929,0:01:06.549 second half of today's lecture, I want to talk about the different algorithm called 0:01:06.549,0:01:09.090 independent component analysis, 0:01:09.090,0:01:13.400 which is, in some ways, related to PCA, but in many other ways, it 0:01:13.400,0:01:15.580 also manages to accomplish 0:01:15.580,0:01:18.390 very different things than PCA. 0:01:18.390,0:01:22.380 So with this lecture, this will actually wrap up our discussion on 0:01:22.380,0:01:26.470 unsupervised learning. The next lecture, we'll start to talk about 0:01:26.470,0:01:31.049 reinforcement learning algorithms. 0:01:31.049,0:01:32.500 Just to 0:01:32.500,0:01:36.770 recap where we were with PCA, principal 0:01:36.770,0:01:38.630 component analysis, 0:01:38.630,0:01:44.870 I said that 0:01:44.870,0:01:48.840 in PCA, we imagine that we have some very high dimensional data 0:01:48.840,0:01:50.970 that perhaps 0:01:50.970,0:01:54.650 lies approximately on some low dimensional subspace. So if you had the data set like 0:01:54.650,0:01:55.660 this, 0:01:55.660,0:01:57.000 you might find that 0:01:57.000,0:02:00.650 that's the first principal component of the data, 0:02:00.650,0:02:02.850 and that's the second 0:02:02.850,0:02:05.650 component of this 2-D data. 0:02:06.870,0:02:09.269 To 0:02:09.269,0:02:13.039 summarize the algorithm, we have three steps. The first step of PCA 0:02:14.680,0:02:16.549 was to 0:02:16.549,0:02:20.349 normalize the data to zero mean and 0:02:20.349,0:02:25.789 [inaudible]. So 0:02:25.789,0:02:27.860 tracked out the means of 0:02:27.860,0:02:33.199 your training examples. So it now has zero means, and then normalize each of your features so 0:02:33.199,0:02:35.509 that the variance of each feature is now one. 0:02:37.719,0:02:40.679 The next step was [inaudible] 0:02:40.679,0:02:44.389 computical variance matrix of your zero mean data. So 0:02:44.389,0:02:48.859 you compute it as follows. 0:02:48.859,0:02:51.310 The sum of all the products, 0:02:52.409,0:02:55.870 and then you find the 0:02:55.870,0:03:02.870 top K eigen vectors of 0:03:03.799,0:03:07.159 sigma. 0:03:07.159,0:03:09.589 So 0:03:09.589,0:03:12.589 last time we saw the applications of this. For example, 0:03:12.589,0:03:18.439 one of the applications was to eigen faces where 0:03:18.439,0:03:22.899 each of your training examples, XI, is an image. 0:03:22.899,0:03:24.180 So 0:03:24.180,0:03:26.749 if you have 0:03:26.749,0:03:30.890 100 by 100 images, if your pictures of faces are 0:03:30.890,0:03:33.249 100 pixels by 100 pixels, 0:03:33.249,0:03:37.519 then each of your training examples, XI, 0:03:37.519,0:03:39.989 will be a 10,000 dimensional vector, 0:03:39.989,0:03:42.729 corresponding to the 0:03:42.729,0:03:47.529 10,000 grayscale intensity pixel values. There are 10,000 pixel values in 0:03:47.529,0:03:50.349 each of your 100 by 100 images. 0:03:50.349,0:03:53.179 So the eigen faces application was where 0:03:53.179,0:03:56.280 the training example comprised 0:03:56.280,0:03:58.149 pictures of faces of people. 0:03:58.149,0:03:59.989 Then we ran PCA, 0:03:59.989,0:04:00.780 and then 0:04:00.780,0:04:03.090 to measure the distance between say 0:04:03.090,0:04:04.260 a face here 0:04:04.260,0:04:06.809 and a face there, we would project both 0:04:06.809,0:04:09.619 of the face images onto the subspace and then 0:04:09.619,0:04:11.469 measure 0:04:11.469,0:04:13.679 the distance along the subspace. So in eigen faces, you use something 0:04:13.679,0:04:16.919 like 50 principle components. 0:04:16.919,0:04:20.049 So 0:04:20.049,0:04:24.080 the difficulty of working with problems like these is that 0:04:24.080,0:04:27.330 in step two of the algorithm, 0:04:27.330,0:04:31.070 we construct the covariance matrix sigma. 0:04:31.070,0:04:37.589 The covariance matrix now becomes 0:04:37.589,0:04:42.919 a 10,000 by 10,000 dimensional matrix, which is huge. That has 0:04:42.919,0:04:45.580 100 million 0:04:45.580,0:04:47.650 entries, which is huge. 0:04:49.020,0:04:50.789 So let's apply PCA to 0:04:50.789,0:04:54.289 very, very high dimensional data, used as a point of reducing the 0:04:54.289,0:04:55.180 dimension. But 0:04:55.180,0:04:59.550 step two of this algorithm had this step where you were constructing [inaudible]. So 0:04:59.550,0:05:03.599 this extremely large matrix, which you can't do. 0:05:03.599,0:05:06.580 Come back to this in a second. It turns out one of 0:05:06.580,0:05:08.650 the other 0:05:08.650,0:05:10.639 frequently-used applications of 0:05:10.639,0:05:11.870 PCA 0:05:11.870,0:05:14.210 is actually to text data. 0:05:14.210,0:05:16.329 So here's what I 0:05:16.329,0:05:21.039 mean. Remember our vectorial representation of emails? 0:05:21.039,0:05:22.520 So this is way back 0:05:22.520,0:05:26.869 when we were talking about supervised learning algorithms for a 0:05:26.869,0:05:29.360 stand classification. You remember I said that 0:05:29.360,0:05:32.620 given a piece of email or given a piece of text document, you 0:05:32.620,0:05:35.479 can represent it using a very high-dimensional vector 0:05:35.479,0:05:36.699 by taking 0:05:38.990,0:05:43.889 - writing down a list of all the words in your dictionary. Somewhere you had the word 0:05:43.889,0:05:46.669 learn, somewhere you have the word 0:05:46.669,0:05:49.680 study 0:05:50.569,0:05:54.759 and so on. 0:05:54.759,0:05:58.180 Depending on whether each word appears or does not appear in your text document, you put either 0:05:58.180,0:05:59.360 a one or a zero 0:05:59.360,0:06:03.889 there. This is a representation we use on an electrode five or electrode six 0:06:03.889,0:06:06.729 for representing text documents for 0:06:06.729,0:06:08.669 when we're building 0:06:08.669,0:06:12.090 [inaudible] based classifiers for 0:06:12.090,0:06:14.300 [inaudible]. So it turns 0:06:14.300,0:06:17.719 out one of the common applications of 0:06:17.719,0:06:22.210 PCA is actually this text data representations as well. 0:06:22.210,0:06:23.680 When you apply PCA 0:06:23.680,0:06:25.650 to this sort of data, 0:06:25.650,0:06:27.999 the resulting 0:06:27.999,0:06:34.740 algorithm, it often just goes by a different name, just latent semantic indexing. 0:06:41.250,0:06:44.009 For the sake of completeness, I should say that 0:06:44.009,0:06:48.050 in LSI, you usually skip the preprocessing step. 0:07:06.089,0:07:09.929 For various reasons, in LSI, you usually don't normalize the mean of the data to 0:07:09.929,0:07:10.939 one, 0:07:10.939,0:07:14.169 and you usually don't normalize the variance of the features to one. 0:07:14.169,0:07:18.020 These are relatively minor 0:07:18.020,0:07:21.199 differences, it turns out, so it does something very 0:07:21.199,0:07:24.599 similar to PCA. 0:07:24.599,0:07:25.849 0:07:25.849,0:07:27.439 Normalizing the variance to one 0:07:27.439,0:07:33.439 for text data would actually be a bad idea because all the words are - 0:07:33.439,0:07:34.389 because that 0:07:34.389,0:07:37.150 would have the affect of 0:07:37.150,0:07:39.229 dramatically scaling up the 0:07:39.229,0:07:43.779 weight of rarely occurring words. So for example, the word aardvark hardly ever 0:07:43.779,0:07:45.729 appears in any document. 0:07:45.729,0:07:48.919 So to normalize the variance 0:07:48.919,0:07:51.009 of the second feature to one, you end up - 0:07:51.009,0:07:54.860 you're scaling up the weight of the word aardvark 0:07:54.860,0:07:58.340 dramatically. I don't understand why [inaudible]. 0:07:58.340,0:08:01.449 So let's 0:08:01.449,0:08:05.319 see. [Inaudible] the language, 0:08:05.319,0:08:11.169 something that we want to do quite often is, give it two documents, 0:08:13.109,0:08:20.109 XI and XJ, to measure how similar they are. 0:08:20.179,0:08:22.400 So for example, 0:08:22.400,0:08:25.050 I may give you a document and ask 0:08:25.050,0:08:28.860 you to find me more documents like this one. We're reading some 0:08:28.860,0:08:30.900 article about some user event of today 0:08:30.900,0:08:33.770 and want to find out what other news articles there are. So I give you a document and 0:08:34.520,0:08:37.159 ask you to look at all the other documents you have 0:08:37.159,0:08:40.829 in this large set of documents and find the documents similar to 0:08:40.829,0:08:42.220 this. 0:08:43.740,0:08:45.210 So 0:08:45.210,0:08:48.170 this is typical text application, so 0:08:48.170,0:08:51.140 to measure the similarity 0:08:52.440,0:08:56.550 between two documents in XI and XJ, [inaudible] 0:08:56.550,0:08:59.950 each of these documents is represented 0:08:59.950,0:09:03.190 as one of these highdimensional vectors. 0:09:03.190,0:09:08.700 One common way to do this is to view each of your documents 0:09:08.700,0:09:12.710 as some sort of very high-dimensional vector. 0:09:12.710,0:09:14.100 So these 0:09:14.100,0:09:18.040 are vectors in the very highdimensional space where 0:09:18.040,0:09:20.299 the dimension of the vector is equal to 0:09:20.299,0:09:27.210 the number of words in your dictionary. 0:09:27.210,0:09:30.630 So maybe each of these documents lives in some 0:09:30.630,0:09:33.560 50,000-dimension space, if you have 50,000 words in your 0:09:33.560,0:09:37.090 dictionary. So one nature of the similarity between these two documents that's 0:09:37.090,0:09:39.800 often used is 0:09:39.800,0:09:41.440 what's the angle 0:09:41.440,0:09:43.350 between these two 0:09:43.350,0:09:50.350 documents. 0:09:51.240,0:09:52.750 In particular, 0:09:52.750,0:09:56.030 if the angle between these two vectors is small, then 0:09:56.030,0:09:59.590 the two documents, we'll consider them to be similar. If the angle between 0:09:59.590,0:10:03.310 these two vectors is large, then we consider the documents to be dissimilar. 0:10:03.310,0:10:05.530 So 0:10:05.530,0:10:10.060 more formally, one commonly used heuristic, the national language of processing, 0:10:10.060,0:10:13.950 is to say that the similarity between the two documents is a co-sine of the angle theta between them. 0:10:13.950,0:10:16.710 For similar 0:10:16.710,0:10:19.270 values, anyway, the co-sine 0:10:19.270,0:10:23.110 is a decreasing function of theta. 0:10:23.110,0:10:24.490 So the 0:10:24.490,0:10:29.560 smaller the angle between them, the larger the similarity. 0:10:29.560,0:10:30.690 The co-sine 0:10:30.690,0:10:35.970 between two vectors is, of course, just [inaudible] 0:10:35.970,0:10:42.970 divided 0:10:43.940,0:10:46.680 by - okay? 0:10:46.680,0:10:51.100 That's just the linear algebra or the standard 0:10:51.100,0:10:56.460 geometry definition of the co-sine between two vectors. Here's the 0:11:03.540,0:11:10.540 intuition behind what LSI is doing. The hope, as usual, is 0:11:17.930,0:11:21.060 that there 0:11:21.060,0:11:24.970 may be some interesting axis of variations in the data, 0:11:24.970,0:11:27.660 and there maybe some other axis that 0:11:27.660,0:11:29.340 are just 0:11:29.340,0:11:33.960 noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by 0:11:33.960,0:11:37.850 running PCA on your text data this way, you can remove some of the noise in the data and 0:11:37.850,0:11:41.800 get better measures of the similarity between pairs of 0:11:41.800,0:11:45.490 documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI 0:11:45.490,0:11:46.940 is doing. 0:11:46.940,0:11:47.940 So 0:11:47.940,0:11:54.940 look further in the definition of the co-sine similarity measure. So 0:11:56.430,0:11:59.790 the numerator 0:11:59.790,0:12:06.790 or 0:12:10.210,0:12:17.210 the similarity between the two documents was this inner product, 0:12:17.620,0:12:24.270 which is therefore sum over K, 0:12:24.270,0:12:27.500 XIK, 0:12:27.500,0:12:28.870 XJK. So 0:12:30.410,0:12:33.810 this inner product would be equal to zero if 0:12:33.810,0:12:36.910 the two documents have no words in common. So 0:12:36.910,0:12:39.900 this is really - sum over K - 0:12:41.100,0:12:42.930 indicator of 0:12:42.930,0:12:44.760 whether 0:12:44.760,0:12:47.170 documents, I and 0:12:47.170,0:12:54.170 J, 0:12:54.920,0:12:58.250 both contain the word, K, because 0:12:58.250,0:13:02.590 I guess XIK indicates whether document I contains the word 0:13:02.590,0:13:04.410 K, and XJK 0:13:04.410,0:13:07.830 indicates whether document J contains the word, K. 0:13:07.830,0:13:10.430 So the product would be one only 0:13:10.430,0:13:12.420 if the word K 0:13:12.420,0:13:14.540 appears in both documents. 0:13:14.540,0:13:17.740 Therefore, the similarity between these two documents would be 0:13:17.740,0:13:23.450 zero if the two documents have no words in common. 0:13:23.450,0:13:30.450 For example, 0:13:31.180,0:13:34.530 suppose your document, 0:13:34.530,0:13:40.460 XI, has the word study and the word 0:13:40.460,0:13:41.730 XJ, 0:13:41.730,0:13:43.460 has the word learn. 0:13:43.460,0:13:47.530 Then these two documents may be considered 0:13:47.530,0:13:49.069 entirely dissimilar. 0:13:50.020,0:13:53.280 [Inaudible] effective study strategies. Sometimes you read a 0:13:53.280,0:13:57.100 news article about that. So you ask, what other documents are similar to this? If 0:13:57.100,0:14:01.080 there are a bunch of other documents about good methods to 0:14:01.080,0:14:04.250 learn, than there are words in common. So similarity [inaudible] is zero. 0:14:04.250,0:14:06.790 So here's 0:14:06.790,0:14:09.200 a cartoon 0:14:09.200,0:14:10.970 of what we hope 0:14:10.970,0:14:12.820 [inaudible] PCA will do, 0:14:12.820,0:14:14.340 which is 0:14:14.340,0:14:17.320 suppose that on the horizontal axis, I plot 0:14:17.320,0:14:21.330 the word 0:14:21.330,0:14:25.310 learn, and on the vertical access, I plot the word study. 0:14:25.310,0:14:29.840 So the values take on either the value of zero or one. So if a document 0:14:29.840,0:14:33.040 contains the words learn but not study, then 0:14:33.040,0:14:35.260 it'll plot that document there. If 0:14:35.260,0:14:38.050 a document contains neither the word study nor learn, then it'll plot that 0:14:38.050,0:14:40.530 at zero, zero. 0:14:40.530,0:14:44.119 So here's a cartoon behind what PCA 0:14:44.119,0:14:46.940 is doing, which is 0:14:46.940,0:14:51.480 we identify lower dimensional subspace. That would be sum - eigen 0:14:51.480,0:14:57.630 vector, we get out of PCAs. 0:14:57.630,0:15:03.380 Now, supposed we have a document about learning. We have a document about studying. 0:15:03.380,0:15:05.150 The document about learning 0:15:05.150,0:15:07.710 points to the right. Document about studying points 0:15:07.710,0:15:11.439 up. So the inner product, or the co-sine angle between these two documents would be 0:15:11.439,0:15:13.170 - excuse 0:15:13.170,0:15:15.080 me. The inner product between 0:15:15.080,0:15:18.850 these two documents will be zero. 0:15:18.850,0:15:20.280 So these two 0:15:20.280,0:15:22.360 documents are entirely unrelated, 0:15:22.360,0:15:24.849 which is not what we want. 0:15:24.849,0:15:27.680 Documents about study, documents about learning, they are related. But 0:15:27.680,0:15:32.820 we take these two documents, and we project them 0:15:32.820,0:15:36.220 onto this subspace. 0:15:38.320,0:15:40.850 Then these two documents now become much 0:15:40.850,0:15:42.610 closer together, 0:15:42.610,0:15:44.790 and the algorithm will recognize that 0:15:44.790,0:15:47.650 when you say the inner product between these two documents, 0:15:47.650,0:15:50.580 you actually end up with a positive number. So 0:15:50.580,0:15:52.190 LSI enables 0:15:52.190,0:15:56.200 our algorithm to recognize that these two documents have some positive similarity 0:15:56.200,0:15:58.920 between them. 0:15:58.920,0:16:01.230 So that's just intuition 0:16:01.230,0:16:02.370 about what 0:16:02.370,0:16:05.000 PCA may be doing to text data. 0:16:05.000,0:16:09.420 The same thing goes to other examples and the words study and learn. So you have 0:16:09.420,0:16:11.139 - you find a document about 0:16:11.139,0:16:15.059 politicians and a document with the names of prominent 0:16:15.059,0:16:16.490 politicians. 0:16:17.560,0:16:20.740 That will also bring the documents closer together, 0:16:20.740,0:16:24.650 or just any related topics, they end up 0:16:24.650,0:16:25.670 [inaudible] 0:16:25.670,0:16:31.780 points closer together and just lower dimensional space. 0:16:33.130,0:16:38.380 Question about this? Interviewee: [Inaudible]. 0:16:38.380,0:16:43.930 Which ones? 0:16:43.930,0:16:50.930 This one? No, the line. Oh, this one. Oh, 0:16:53.820,0:16:54.470 yes. 0:16:54.470,0:17:01.470 Thank you. 0:17:01.670,0:17:05.510 [Inaudible]. 0:17:05.510,0:17:06.650 So 0:17:06.650,0:17:13.650 let's talk about how to actually implement this now. 0:17:17.230,0:17:20.520 Okay. How many of you know what 0:17:20.520,0:17:24.780 an SVD or single value decomposition is? Wow, 0:17:24.780,0:17:25.800 that's a lot of you. That's a 0:17:25.800,0:17:28.270 lot more than I thought. 0:17:28.270,0:17:35.270 Curious. Did you guys learn it as under grads or as graduate students? 0:17:37.309,0:17:38.029 All 0:17:38.029,0:17:40.429 right. Let 0:17:40.429,0:17:44.150 me talk about it anyway. I 0:17:44.150,0:17:47.790 wasn't expecting so many of you to know what SVD is, but I want to get this 0:17:47.790,0:17:51.030 on tape, just so everyone else can learn 0:17:51.030,0:17:55.050 about this, too. 0:17:55.050,0:17:59.020 So I'll say a little bit about how to implement 0:17:59.020,0:18:01.640 PCA. The problem I 0:18:01.640,0:18:05.110 was eluding to just now was that 0:18:05.110,0:18:09.310 when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for 0:18:09.310,0:18:11.620 our 0:18:11.620,0:18:13.390 text example, 0:18:13.390,0:18:18.000 if the vectors XI are 50,000 dimensional, 0:18:18.000,0:18:24.430 then 0:18:24.430,0:18:27.160 the covariance matrix will be 50,000 dimensional by 50,000 0:18:27.160,0:18:32.910 dimensional, which is much too big to represent explicitly. 0:18:32.910,0:18:36.270 I guess many 0:18:36.270,0:18:41.240 of you already know this, but I'll just say it anyway. It 0:18:41.240,0:18:48.240 turns out there's another way to implement PCA, which is 0:18:48.679,0:18:49.450 if 0:18:49.450,0:18:53.340 A is any N by N matrix, 0:18:53.340,0:18:56.580 than one of the most remarkable results of linear algebra 0:18:56.580,0:19:00.970 is that the matrix, A, 0:19:00.970,0:19:04.279 can be decomposed into 0:19:04.279,0:19:07.330 a singular value 0:19:07.330,0:19:10.160 decomposition. What that means is that the matrix, A, which 0:19:10.160,0:19:11.519 is 0:19:11.519,0:19:12.680 N by N, 0:19:12.680,0:19:15.230 can always be decomposed into a product of 0:19:15.230,0:19:18.280 three matrixes. U is N by N, 0:19:18.280,0:19:21.470 D is a square matrix, which is N by N, and V is 0:19:23.750,0:19:27.170 also N by N. 0:19:27.170,0:19:30.910 D 0:19:30.910,0:19:34.890 is 0:19:34.890,0:19:37.320 going to be diagonal. 0:19:43.030,0:19:45.140 Zeros are on the off-diagonals, 0:19:45.140,0:19:52.140 and the values sigma I are called the singular values of 0:19:54.310,0:19:57.530 the matrix A. 0:20:01.500,0:20:05.470 Almost all of you said you learned this as a graduate student, rather than as an under grad, so 0:20:05.470,0:20:07.020 it turns out that 0:20:07.020,0:20:10.820 when you take a class in undergraduate linear algebra, usually you learn a bunch of 0:20:10.820,0:20:13.960 decomposition. So you usually learn about the QLD composition, maybe 0:20:13.960,0:20:17.070 the LU factorization of the matrixes. 0:20:17.070,0:20:20.350 Most under grad courses don't get to talk about singular value 0:20:20.350,0:20:21.490 decompositions, but at 0:20:21.490,0:20:22.860 least in - almost 0:20:22.860,0:20:24.580 everything I 0:20:24.580,0:20:26.900 do in machine learning, 0:20:26.900,0:20:31.180 you actually find that you end up using SVDs much more than any of the 0:20:31.180,0:20:33.120 decompositions 0:20:33.120,0:20:35.690 you learned in typical under grad linear algebra class. 0:20:35.690,0:20:40.030 So personally, I [inaudible] an SVD dozens of times in the last 0:20:40.030,0:20:42.820 year, but LU and QRD compositions, 0:20:42.820,0:20:45.480 I think I used the QRD composition once and an 0:20:45.480,0:20:49.530 LU decomposition in the last year. So let's see. I'll 0:20:49.530,0:20:53.880 say 0:20:53.880,0:20:54.830 a 0:20:54.830,0:21:01.830 bit 0:21:01.840,0:21:04.700 more about this. So I'm going to draw the picture, I guess. 0:21:04.700,0:21:08.090 For example, 0:21:08.090,0:21:10.890 if A is an N by N matrix, 0:21:10.890,0:21:17.890 it can be decomposed into another matrix, U, which is also N by N. It's the same 0:21:21.780,0:21:24.510 size, D, which is 0:21:24.510,0:21:29.840 N by 0:21:29.840,0:21:34.730 N. Another square matrix, V transpose, which is also 0:21:34.730,0:21:37.430 N by 0:21:37.430,0:21:42.850 N. Furthermore, in a singular value decomposition, the 0:21:42.850,0:21:46.460 columns of the matrix, U, will be the eigen 0:21:46.460,0:21:50.940 vectors 0:21:50.940,0:21:55.630 of A transpose, and the 0:21:55.630,0:22:02.020 columns of V will be the eigen vectors 0:22:02.020,0:22:05.730 of A 0:22:05.730,0:22:08.290 transpose A. 0:22:08.290,0:22:12.030 0:22:12.030,0:22:16.660 To compute it, you just use the SVD commands 0:22:16.660,0:22:20.780 in 0:22:20.780,0:22:22.090 Matlab 0:22:22.090,0:22:23.009 or 0:22:23.009,0:22:26.760 Octave. Today, say the art in numerical linear algebra is that 0:22:26.760,0:22:31.040 SVD, singular value decompositions, and matrixes can be computed 0:22:31.040,0:22:34.450 extremely [inaudible]. We've 0:22:34.450,0:22:35.780 used a 0:22:35.780,0:22:37.610 package like Matlab or Octave 0:22:37.610,0:22:40.410 to compute, say, the eigen vectors of a matrix. 0:22:40.410,0:22:43.539 So if SVD 0:22:43.539,0:22:47.980 routines are even more numerically stable than eigen vector routines for finding eigen vector in the 0:22:47.980,0:22:49.230 matrix. So you can 0:22:49.230,0:22:50.250 safely 0:22:50.250,0:22:53.110 use a routine like this, and similar to the way they use 0:22:53.110,0:22:56.030 a square root command without thinking about 0:22:56.030,0:22:58.400 how it's computed. You can compute the square 0:22:58.400,0:23:03.640 root of something and just not worry about it. You know the computer will give you the right 0:23:03.640,0:23:07.670 answer. For most reasonably-sized matrixes, even up to thousands by thousands 0:23:07.670,0:23:11.180 matrixes, the SVD routine, I think of it as a square root function. If 0:23:11.180,0:23:14.700 you call it, it'll give you back the right answer. You don't have to worry 0:23:14.700,0:23:16.760 too much about 0:23:16.760,0:23:20.510 it. If you have extremely large matrixes, like a million by a million matrixes, I 0:23:20.510,0:23:23.200 might start to worry a bit, but a few thousand by a few 0:23:23.200,0:23:25.700 thousand matrixes, this is 0:23:25.700,0:23:29.330 implemented very well today. 0:23:29.330,0:23:31.360 [Inaudible]. 0:23:31.360,0:23:34.990 What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the 0:23:34.990,0:23:36.470 order of N-cubed. 0:23:36.470,0:23:41.750 I'm not sure. [Inaudible] 0:23:41.750,0:23:45.010 algorithms, so I think - I don't know what's 0:23:45.010,0:23:50.470 known about the conversion 0:23:50.470,0:23:54.370 of 0:23:54.370,0:23:58.280 these algorithms. The example I drew out was for a facts matrix, and a matrix is 0:23:58.280,0:23:59.890 [inaudible]. 0:23:59.890,0:24:03.420 In the same way, you can also 0:24:03.420,0:24:08.380 call SVD on the tall matrix, so it's taller than it's wide. 0:24:08.380,0:24:15.380 It would decompose it into - okay? A 0:24:21.480,0:24:24.190 product of three matrixes like 0:24:24.190,0:24:28.470 that. 0:24:28.470,0:24:31.220 The nice thing about this is that 0:24:31.220,0:24:33.169 we can use it to compute 0:24:33.169,0:24:40.169 eigen vectors and PCA very efficiently. 0:24:42.150,0:24:47.380 In particular, a 0:24:47.380,0:24:52.690 covariance matrix sigma was 0:24:52.690,0:24:55.410 this. It was the sum of all the products, 0:24:55.410,0:25:00.750 so if you go back and recall the definition of the 0:25:00.750,0:25:01.820 design matrix - 0:25:01.820,0:25:05.080 I think I described this in 0:25:05.080,0:25:07.720 lecture two when 0:25:07.720,0:25:13.510 we derived the close form solution to these squares [inaudible] these squares. The 0:25:13.510,0:25:17.860 design matrix was this matrix where I took my 0:25:17.860,0:25:21.390 examples 0:25:21.390,0:25:26.160 and stacked them in 0:25:26.160,0:25:27.590 rows. 0:25:27.590,0:25:33.320 They call this the design matrix [inaudible]. So if 0:25:33.320,0:25:38.780 you construct the design matrix, then 0:25:38.780,0:25:43.240 the covariance matrix sigma 0:25:43.240,0:25:47.760 can be written just X transposing. 0:26:01.270,0:26:08.270 That's X transposed, and [inaudible]. 0:26:16.230,0:26:16.750 Okay? 0:26:16.750,0:26:18.860 I hope you see why the X transpose X gives you 0:26:18.860,0:26:22.460 the sum of products of 0:26:22.460,0:26:25.650 vectors. If you aren't seeing this right now, just go home and convince yourself 0:26:25.650,0:26:32.650 [inaudible] if it's 0:26:37.880,0:26:40.020 true. 0:26:40.020,0:26:47.020 To get the top K eigen vectors of 0:26:50.610,0:26:50.950 sigma, 0:26:50.950,0:26:55.890 you would take sigma 0:26:55.890,0:26:57.419 and decompose it 0:26:57.419,0:26:59.650 using 0:26:59.650,0:27:01.880 the - 0:27:01.880,0:27:06.710 excuse me. 0:27:06.710,0:27:11.960 You would take the matrix X, and you would compute as SVD. So you get USV transpose. 0:27:11.960,0:27:15.400 Then the top three 0:27:15.400,0:27:19.169 columns 0:27:19.169,0:27:21.660 of U 0:27:21.660,0:27:24.340 are the top K eigen 0:27:24.340,0:27:26.830 vectors 0:27:26.830,0:27:29.419 of 0:27:29.419,0:27:31.910 X transpose 0:27:31.910,0:27:33.660 X, 0:27:33.660,0:27:37.760 which is therefore, the top K eigen vectors of your 0:27:37.760,0:27:40.440 covariance matrix 0:27:40.440,0:27:42.250 sigma. So 0:27:42.250,0:27:46.170 in our examples, the 0:27:46.170,0:27:48.950 design matrix may be, say R. If you have 0:27:48.950,0:27:52.450 50,000 words in your dictionary, than the 0:27:52.450,0:27:55.050 design matrix would be 0:27:55.050,0:27:58.150 RM by 50,000. 0:27:58.150,0:28:01.870 [Inaudible] say 100 by 50,000, if you have 100 examples. 0:28:01.870,0:28:03.250 So X would be 0:28:03.250,0:28:06.510 quite tractable to represent and compute the 0:28:06.510,0:28:10.420 SVD, whereas the matrix sigma would be much harder to represent. This is 0:28:10.420,0:28:12.260 50,000 by 50,000. 0:28:12.260,0:28:15.870 So this gives you an efficient way to implement 0:28:15.870,0:28:18.070 PCA. 0:28:18.070,0:28:20.560 The reason I want to talk about this is 0:28:20.560,0:28:24.620 in previous years, I didn't talk [inaudible]. 0:28:24.620,0:28:28.760 The class projects, I found a number of students trying to implement SVD on huge 0:28:28.760,0:28:29.580 problems and [inaudible], 0:28:29.580,0:28:31.950 so this is 0:28:31.950,0:28:35.010 a much better to implement PCA 0:28:35.010,0:28:37.600 if you have extremely high dimensional data. If you have 0:28:37.600,0:28:39.600 low dimensional data, if 0:28:39.600,0:28:43.500 you have 50 or 100 dimensional data, then 0:28:43.500,0:28:44.969 computing sigma's no problem. You can 0:28:44.969,0:28:51.309 do it the old way, but otherwise, use the SVD to implement this. 0:28:52.780,0:28:59.780 Questions about this? 0:29:26.180,0:29:30.810 The last thing I want to say is that in practice, when you want to implement this, I want to say a note 0:29:30.810,0:29:32.240 of caution. 0:29:32.240,0:29:35.010 It turns out that 0:29:35.010,0:29:38.909 for many applications of - let's see. 0:29:38.909,0:29:42.420 When you apply SVD to these wide - yeah. 0:29:42.420,0:29:43.450 Just a quick question. Are 0:29:43.450,0:29:48.950 the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh, 0:29:48.950,0:29:52.940 yes. 0:29:52.940,0:29:54.570 I think you're 0:29:54.570,0:29:55.509 right. I 0:29:55.509,0:30:02.509 think you're right. 0:30:03.960,0:30:05.550 Let's 0:30:05.550,0:30:12.550 see. Is it top K columns of U or top K of V? 0:30:15.039,0:30:22.039 Yeah, I think you're right. Is that right? Something 0:30:31.560,0:30:38.560 bothers me about that, but I think you're right. 0:30:40.330,0:30:46.350 [Inaudible], so then X transpose X should be VDD. X 0:30:46.350,0:30:53.350 is UDV, so X transpose X would be - [Inaudible]. 0:31:01.140,0:31:03.389 If anyone thinks about this and 0:31:03.389,0:31:06.200 has another opinion, let me know, but I think you're 0:31:06.200,0:31:11.480 right. I'll make sure I get the details and let you 0:31:11.480,0:31:16.680 know. 0:31:16.680,0:31:22.240 Everyone's still looking at that. 0:31:22.240,0:31:22.720 Tom, can you figure out 0:31:22.720,0:31:25.110 the right answer and let me know? That sounds right. Okay. Cool. Okay. 0:31:25.110,0:31:30.070 So just 0:31:30.070,0:31:33.920 one last note, a note of caution. It turns out that 0:31:33.920,0:31:36.520 in this example, I was implementing SVD 0:31:36.520,0:31:43.520 with a wide matrix. So the matrix X was N by N. 0:31:44.130,0:31:47.010 It turns out when you 0:31:47.010,0:31:52.170 find the SVD decomposition of this, 0:31:52.170,0:31:57.870 it turns out that - 0:31:57.870,0:32:01.030 let's see. Yeah, I think you're definitely 0:32:01.030,0:32:04.230 right. So it turns out that we find the SVD 0:32:04.230,0:32:07.590 of this, the right-most portion of this block of this matrix would be all 0:32:07.590,0:32:14.590 zeros. 0:32:15.130,0:32:17.850 Also, when you compute the 0:32:17.850,0:32:22.190 matrix, D, a large part of this matrix would be zeros. 0:32:22.190,0:32:24.340 You have the matrix D 0:32:24.340,0:32:29.520 transpose. So depending on what convention you use, 0:32:29.520,0:32:36.150 for example, I think Matlab actually uses a convention of just 0:32:36.150,0:32:43.150 cutting off the zero elements. 0:32:47.779,0:32:50.750 So the Matlab uses the convention 0:32:50.750,0:32:54.260 of chopping off the right-most half of the U matrix and chopping off the bottom 0:32:54.260,0:32:56.520 portion of the D matrix. I'm not sure 0:32:56.520,0:33:00.860 if this even depends on the version of Matlab, 0:33:00.860,0:33:04.059 but when you call SVD on Matlab or some other numerical algebra packages, 0:33:04.059,0:33:07.820 there's slightly different conventions of how to define your SVD when 0:33:07.820,0:33:10.260 the matrix is wider than it is tall. 0:33:10.260,0:33:13.390 So just watch out for this and make sure you map 0:33:13.390,0:33:14.910 whatever convention 0:33:14.910,0:33:17.630 your numerical algebra library uses 0:33:17.630,0:33:22.620 to the original computations. 0:33:22.620,0:33:23.570 It turns out if 0:33:23.570,0:33:26.059 you turn Matlab [inaudible] 0:33:26.059,0:33:30.200 or you're writing C code. There are many scientific libraries that can 0:33:30.200,0:33:31.590 0:33:31.590,0:33:36.110 compute SVDs for you, but they're just slightly different in 0:33:36.110,0:33:39.330 conventions for the dimensions of these matrixes. So just make sure you figure this out for the 0:33:39.330,0:33:44.230 package that you use. 0:33:44.230,0:33:46.230 Finally, I just want to 0:33:46.230,0:33:49.779 take the unsupervised learning algorithms we talked about and just put a little bit 0:33:49.779,0:33:52.010 of broader context. 0:33:52.010,0:33:56.450 This is partly in response to the questions I've gotten from students in 0:33:56.450,0:33:57.950 office hours and elsewhere 0:33:57.950,0:34:00.620 about when to use each of these 0:34:00.620,0:34:01.460 algorithms. So 0:34:01.460,0:34:05.700 I'm going to draw a two by two matrix. This is a little cartoon that 0:34:07.800,0:34:10.690 I find useful. 0:34:15.409,0:34:20.159 One of the algorithms we talked about earlier, right 0:34:20.159,0:34:21.460 before this, was 0:34:21.460,0:34:24.159 factor analysis, which was - it 0:34:24.159,0:34:27.919 was - I hope you remember that picture I drew where I would have a bunch of point Z on the 0:34:27.919,0:34:28.799 line. 0:34:28.799,0:34:32.969 Then I had these ellipses that I drew. I hope you 0:34:32.969,0:34:34.999 remember that 0:34:34.999,0:34:37.919 picture. This was a factor analysis model 0:34:37.919,0:34:38.639 which models 0:34:38.639,0:34:42.389 the density effects [inaudible], right? 0:34:42.389,0:34:44.639 It was also a PCA, just now. 0:34:44.639,0:34:45.799 So the 0:34:45.799,0:34:48.969 difference between factor analysis and PCA, the 0:34:48.969,0:34:51.829 way I think about it, is that factor analysis 0:34:51.829,0:34:53.960 is a density estimation algorithm. 0:34:53.960,0:34:57.740 It tries to model the density of the training example's X. 0:34:57.740,0:35:01.249 Whereas PCA 0:35:02.309,0:35:05.700 is not a probabilistic 0:35:05.700,0:35:06.750 algorithm. In particular, 0:35:06.750,0:35:11.270 it does not endow your training examples of any probabilistic 0:35:11.270,0:35:14.410 distributions and directly goes to find the subspace. 0:35:14.410,0:35:18.650 So in terms of when to use factor analysis and when to use PCA, 0:35:18.650,0:35:21.699 if your goal is to reduce the dimension of the data, 0:35:21.699,0:35:26.139 if your goal is to find the subspace that the data lies on, 0:35:26.139,0:35:27.309 then PCA 0:35:27.309,0:35:28.320 directly 0:35:28.320,0:35:32.180 tries to find the subspace. I think I would 0:35:32.180,0:35:35.279 tend to use PCA. 0:35:35.279,0:35:39.889 Factor analysis, it sort of assumes the data lies on a subspace. 0:35:39.889,0:35:45.039 Let me write a subspace here. 0:35:45.039,0:35:49.529 So both of these algorithms sort of assume the data maybe lies close 0:35:49.529,0:35:52.009 or on some low dimensional subspace, 0:35:52.009,0:35:55.999 but fundamentally, factor analysis, I think of it as a density estimation algorithm. 0:35:55.999,0:35:58.739 So that has some very high dimensional distribution. I 0:35:58.739,0:36:00.880 want to model P of X, then 0:36:00.880,0:36:04.519 the factor analysis is the algorithm I'm more inclined 0:36:04.519,0:36:05.400 to use. So 0:36:05.400,0:36:07.449 even though you could in theory, 0:36:07.449,0:36:12.199 I would tend to avoid trying to use factor analysis to 0:36:12.199,0:36:14.349 identify a subspace the 0:36:14.349,0:36:15.619 data 0:36:15.619,0:36:18.450 set lies on. So [inaudible], if you want to do 0:36:18.450,0:36:21.700 anomaly detection, if you want to model P of X 0:36:21.700,0:36:26.119 so that if you have a very low probability of N, you can factor an anomaly, 0:36:26.119,0:36:32.819 then I would tend to use factor analysis to do that density estimation. So factor 0:36:32.819,0:36:38.189 analysis and PCA are both algorithms that assume that your data lies in the subspace. 0:36:38.189,0:36:41.759 The other cause of algorithms we talked about was 0:36:41.759,0:36:45.379 algorithms that assumes the data 0:36:45.379,0:36:47.380 lies in 0:36:47.380,0:36:51.410 clumps or that the 0:36:51.410,0:36:52.359 data 0:36:52.359,0:36:57.869 has a few coherence to groups. So let me just fill in the rest of this picture. 0:37:08.579,0:37:09.500 So if you think your data lies 0:37:09.500,0:37:14.270 in clumps or lies in groups, and if it goes [inaudible] 0:37:14.270,0:37:16.130 density estimation, then I would 0:37:16.130,0:37:19.579 tend to use a mixture of [inaudible] 0:37:19.579,0:37:23.349 algorithm. But again, you don't necessarily want to endow your data of any probably 0:37:23.349,0:37:26.919 semantics, so if you just want to find the clumps of the groups, then I'd be inclined 0:37:26.919,0:37:28.999 to use a [inaudible] algorithm. So 0:37:29.769,0:37:33.289 haven't seen anyone else draw this picture before, but I tend to organize these things 0:37:33.289,0:37:34.989 this way in my brain. 0:37:34.989,0:37:36.419 Hopefully this helps guide 0:37:36.419,0:37:40.459 when you might use each of these algorithms as well, depending 0:37:40.459,0:37:44.719 on whether you believe the data might lie in the subspace or whether it might bind in 0:37:44.719,0:37:47.899 clumps or groups. 0:37:50.719,0:37:53.789 All right. 0:37:53.789,0:38:00.789 That wraps up the discussion on 0:38:02.629,0:38:08.719 PCA. What I want to do next is talk about 0:38:08.719,0:38:15.329 independent component analysis, or ICA. Yeah. Interviewee: I have 0:38:15.329,0:38:17.579 a 0:38:17.579,0:38:21.589 question about the upper right [inaudible]. So once you have all of the eigen vectors, 0:38:21.589,0:38:26.179 [inaudible] how similar is feature I to 0:38:26.179,0:38:29.960 feature J. You pick some eigen vector, and you take some dot products between the 0:38:29.960,0:38:31.569 feature I and 0:38:31.569,0:38:35.019 feature J and the eigen vector. But 0:38:35.019,0:38:39.630 there's a lot of eigen vectors to choose from. Instructor 0:38:39.630,0:38:42.049 (Andrew Ng):Right. So Justin's question was 0:38:42.049,0:38:45.880 having found my eigen vectors, how do I choose what eigen vector to use to 0:38:45.880,0:38:47.539 measure distance. I'm 0:38:47.539,0:38:48.949 going to 0:38:48.949,0:38:51.019 start 0:38:51.019,0:38:53.289 this up. 0:38:53.289,0:38:57.299 So the 0:38:57.299,0:38:58.319 answer is really 0:38:58.319,0:39:02.029 - in this cartoon, I would avoid thinking about 0:39:02.029,0:39:03.920 eigen vectors one other time. 0:39:03.920,0:39:08.049 A better way to view this cartoon is that this is actually - 0:39:08.049,0:39:11.599 if I decide to choose 100 eigen vectors, this is really 100 D 0:39:11.599,0:39:18.599 subspace. 0:39:19.259,0:39:20.339 So 0:39:20.339,0:39:24.879 I'm not actually projecting my data onto one eigen vector. 0:39:24.879,0:39:29.769 This arrow, this cartoon, this denotes the 100-dimensional 0:39:29.769,0:39:32.089 subspace [inaudible] by all my eigen vectors. 0:39:32.089,0:39:36.119 So what I actually do is project my data onto 0:39:36.119,0:39:40.149 the span, the linear span of eigen vectors. Then I 0:39:40.149,0:39:41.440 measure distance or take 0:39:41.440,0:39:43.489 inner products of the distance between 0:39:43.489,0:39:49.829 the projections of the two points of the eigen vectors. Okay. 0:39:49.829,0:39:54.139 So let's talk about ICA, 0:39:54.139,0:39:58.749 independent component analysis. 0:39:58.749,0:40:00.749 So whereas PCA 0:40:00.749,0:40:02.600 was an algorithm for finding 0:40:02.600,0:40:06.699 what I call the main axis of variations of data, 0:40:06.699,0:40:11.199 in ICA, we're going to try find the independent of components of variations in the 0:40:11.199,0:40:12.039 data. 0:40:12.039,0:40:14.939 So switch it to the laptop there, please. 0:40:14.939,0:40:16.119 We'll just 0:40:16.119,0:40:21.900 take a second to motivate that. I'm 0:40:21.900,0:40:26.769 going to do so by 0:40:26.769,0:40:32.430 - although if you put on the - okay. This is 0:40:32.430,0:40:36.619 actually a slide that I showed in 0:40:36.619,0:40:39.779 lecture one of the cocktail party problem. 0:40:39.779,0:40:42.619 Suppose you have two speakers at a cocktail party, 0:40:42.619,0:40:45.019 and you have two microphones in the 0:40:45.019,0:40:46.119 room, overlapping 0:40:46.119,0:40:47.959 sets of two conversations. 0:40:47.959,0:40:51.640 Then can you separate out the two original speaker sources? 0:40:51.640,0:40:55.650 So I actually played this audio as well in the very first lecture, which is 0:40:55.650,0:40:59.069 suppose microphone one records this. 0:40:59.069,0:41:05.489 [Recording] 0:41:13.229,0:41:16.650 So the question is, these are really two speakers, 0:41:16.650,0:41:20.809 speaking independently of each other. So each speaker is outputting 0:41:20.809,0:41:24.699 a series of sound signals as independent of the other conversation 0:41:24.699,0:41:26.119 going on in the room. 0:41:26.119,0:41:27.839 So 0:41:27.839,0:41:31.880 this being an supervised learning problem, the question is, can we take these two microphone 0:41:31.880,0:41:33.900 recordings and feed it to 0:41:33.900,0:41:37.329 an algorithm to find the independent components in 0:41:37.329,0:41:38.450 this 0:41:38.450,0:41:40.589 data? This is the output 0:41:42.440,0:41:48.619 when we do so. 0:41:48.619,0:41:55.050 [Recording] This is the other one. [Recording] 0:41:55.940,0:41:59.859 Just for fun. [Inaudible]. These are audio clips I got 0:42:01.409,0:42:04.409 from [inaudible]. Just for fun, let me play the other ones as well. This 0:42:04.409,0:42:11.409 is overlapping microphone one. [Recording] 0:42:13.440,0:42:20.440 Here's microphone two. [Recording] 0:42:21.739,0:42:24.250 So given this as input, here's output one. 0:42:24.250,0:42:27.459 0:42:27.459,0:42:30.910 [Recording] 0:42:30.910,0:42:33.640 It's not perfect, but it's largely cleaned up the music. 0:42:33.640,0:42:40.640 Here's number two. [Recording] Okay. Switch back to 0:42:42.979,0:42:44.900 [inaudible], please. 0:42:44.900,0:42:46.979 So 0:42:46.979,0:42:53.979 what I want to do now is describe an algorithm that does that. 0:42:54.829,0:42:58.299 Before 0:42:58.299,0:43:03.239 I actually jump into the algorithm, I want to say two minutes 0:43:03.239,0:43:03.799 of 0:43:03.799,0:43:10.799 CDF, so cumulative distribution functions. I know most 0:43:18.669,0:43:21.089 of you know what these are, but I'm 0:43:21.089,0:43:23.599 just going to remind you of what they are. 0:43:24.579,0:43:30.349 Let's say you have a one-D random variable S. So suppose you have 0:43:30.349,0:43:35.900 a random variable, S, 0:43:35.900,0:43:41.469 and suppose it has a property density function [inaudible]. 0:43:41.469,0:43:43.409 Then 0:43:43.409,0:43:45.859 the CDF 0:43:45.859,0:43:50.139 is defined as a function, or rather as F, 0:43:50.139,0:43:53.729 which is the probability that the random variable, 0:43:53.729,0:43:55.920 S, is less than the value 0:43:55.920,0:43:58.539 given by that lower-case 0:43:58.539,0:43:59.869 value, 0:43:59.869,0:44:01.929 S. 0:44:01.929,0:44:03.269 For example, 0:44:03.269,0:44:06.099 if this is your [inaudible] density, 0:44:06.099,0:44:10.229 than the density of the [inaudible] usually 0:44:10.229,0:44:14.609 to note it lower-case phi. That's roughly a bell-shaped density. Then 0:44:14.609,0:44:20.319 the CDF or the Gaussian 0:44:20.319,0:44:22.269 will look something like this. 0:44:22.269,0:44:24.959 There'll be a capital function 0:44:24.959,0:44:27.339 pi. So if I pick a value 0:44:27.339,0:44:29.079 S like that, then the 0:44:29.079,0:44:30.449 height of this - 0:44:30.449,0:44:32.569 this is [inaudible] probability that 0:44:32.569,0:44:35.410 my Gaussian random variable is less than 0:44:35.410,0:44:37.419 that value there. In other words, 0:44:37.419,0:44:40.549 the height of the function at that point is 0:44:40.549,0:44:44.229 less 0:44:44.229,0:44:46.269 than the area of the Gaussian density, 0:44:46.269,0:44:48.119 up to the point S. 0:44:48.119,0:44:48.890 As you 0:44:48.890,0:44:52.690 move further and further to the right, this function will approach one, as 0:44:52.690,0:44:59.690 you integrate more and more of this area of the Gaussian. So another way to write 0:45:04.839,0:45:11.839 F 0:45:21.109,0:45:28.109 of 0:45:30.659,0:45:34.619 S is the integral, the minus infinity 0:45:34.619,0:45:35.729 to S of 0:45:35.729,0:45:41.739 the density, DT. 0:45:41.739,0:45:43.859 So something that'll come later is 0:45:43.859,0:45:48.320 suppose I have a random variable, S, and I want to model the distribution of the random 0:45:48.320,0:45:49.439 variable, S. 0:45:49.439,0:45:53.449 So one thing I could do is I can specify 0:45:53.449,0:45:56.549 what I think the density 0:45:56.549,0:45:58.049 is. 0:45:58.049,0:46:03.200 Or I can specify 0:46:03.200,0:46:04.450 what the 0:46:04.450,0:46:08.099 CDF 0:46:08.099,0:46:11.359 is. These are related by this equation. F is the integral of P of S. You 0:46:11.359,0:46:13.989 can also 0:46:13.989,0:46:15.719 recover the density 0:46:15.719,0:46:20.469 by taking the CDF and taking the derivative. So F prime, take the derivative 0:46:20.469,0:46:21.729 of the CDF, 0:46:21.729,0:46:23.440 you get back the 0:46:23.440,0:46:24.709 density. So this has come up 0:46:24.709,0:46:28.179 in the middle of when I derive ICA, which is that 0:46:28.179,0:46:32.169 there'll be a step where they need to assume a distribution for random variable, S. 0:46:32.169,0:46:36.359 I can either specify the density for S directly, or I can specify the CDF. I 0:46:36.359,0:46:38.819 choose to specify the 0:46:39.920,0:46:41.529 CDF. 0:46:41.529,0:46:46.919 It has to be some function increasing from zero to one. 0:46:46.919,0:46:48.029 So you can 0:46:48.029,0:46:50.679 choose any function that looks like that, and in particular, 0:46:51.969,0:46:55.469 pulling functions out of a hat that look like that. You can, for instance, choose a 0:46:55.469,0:46:58.989 sigmoid function of 0:46:58.989,0:47:04.219 CDF. That would be one way of specifying the distribution of the densities for the random variable S. So 0:47:04.219,0:47:05.110 this 0:47:05.110,0:47:12.110 will come up later. 0:47:30.299,0:47:33.579 Just [inaudible], just raise your hand if that is familiar to you, if you've seen 0:47:33.579,0:47:40.579 that before. Great. So 0:47:42.469,0:47:43.239 let's 0:47:43.239,0:47:48.630 start to derive our RCA, or our independent component analysis 0:47:48.630,0:47:50.430 algorithm. 0:47:50.430,0:47:53.989 Let's assume that the 0:47:55.859,0:47:59.819 data comes from 0:47:59.819,0:48:01.979 N original 0:48:01.979,0:48:03.319 sources. 0:48:03.319,0:48:07.009 So let's say there are N speakers in a cocktail party. 0:48:07.009,0:48:09.819 So the original sources, I'm 0:48:09.819,0:48:11.330 going to write as a vector, S 0:48:11.330,0:48:13.619 as in RN. 0:48:13.619,0:48:17.449 So just to be concrete about what I mean about that, I'm going to use 0:48:17.449,0:48:22.499 SIJ to denote the signal 0:48:22.499,0:48:25.849 from speaker 0:48:27.140,0:48:30.219 J 0:48:30.219,0:48:32.659 at time 0:48:32.659,0:48:34.080 I. Here's what I mean. 0:48:34.080,0:48:37.940 So what is sound? When you hear sound waves, sound is created 0:48:37.940,0:48:39.279 by a pattern 0:48:39.279,0:48:43.160 of expansions and compressions in air. So the way you're hearing my voice is 0:48:43.160,0:48:44.620 my 0:48:44.620,0:48:47.719 mouth is causing certain 0:48:47.719,0:48:50.960 changes in the air pressure, and then your ear is hearing my voice as 0:48:50.960,0:48:53.539 detecting those changes in air 0:48:53.539,0:48:57.729 pressure. So what a microphone records, what my mouth is generating, is 0:48:57.729,0:48:59.159 a pattern. 0:48:59.159,0:49:01.459 I'm going to draw a cartoon, 0:49:01.459,0:49:04.819 I guess. 0:49:04.819,0:49:06.059 Changes in 0:49:06.059,0:49:06.970 air pressure. So 0:49:06.970,0:49:11.119 this is what sound is. You look at a microphone recording, you see these roughly periodic 0:49:11.119,0:49:13.289 signals that comprise of 0:49:13.289,0:49:16.230 changes in air pressure over time as the air pressure goes 0:49:16.230,0:49:18.539 above and below some baseline air pressure. 0:49:18.539,0:49:19.669 So this 0:49:19.669,0:49:22.369 is what the speech signal looks like, say. 0:49:22.369,0:49:26.399 So this is speaker one. 0:49:26.399,0:49:29.039 Then what I'm saying is that 0:49:29.039,0:49:31.189 - this is some time, T. 0:49:31.189,0:49:34.479 What I'm saying is that the value of that point, 0:49:34.479,0:49:36.989 I'm going to denote as S, super 0:49:36.989,0:49:40.229 script T, sub script one. 0:49:40.229,0:49:41.729 Similarly, 0:49:41.729,0:49:44.889 speaker two, it's 0:49:44.889,0:49:46.859 outputting some sound wave. Speaker voice 0:49:46.859,0:49:49.749 will play that. It'll actually sound like 0:49:49.749,0:49:52.920 a single tone, I guess. 0:49:52.920,0:49:56.099 So in the same way, at the same time, T, 0:49:56.099,0:49:59.049 the value of the air 0:49:59.049,0:50:02.589 pressure generated by speaker two, I'll denote as 0:50:02.589,0:50:09.589 ST 0:50:16.579,0:50:23.579 2. 0:50:29.859,0:50:36.859 So we observe 0:50:37.769,0:50:40.449 XI equals A times SI, where 0:50:40.449,0:50:43.409 these XIs 0:50:43.409,0:50:45.989 are vectors in RN. 0:50:45.989,0:50:50.269 So I'm going to assume 0:50:50.269,0:50:53.259 that I have N microphones, 0:50:53.259,0:50:53.580 and 0:50:53.580,0:50:58.489 each of my microphones records some linear combination 0:50:58.489,0:51:01.869 of what the speakers are saying. So each microphone records some overlapping 0:51:01.869,0:51:04.499 combination of what the speakers are saying. 0:51:04.499,0:51:07.619 For 0:51:10.349,0:51:12.669 example, XIJ, which is - this 0:51:12.669,0:51:16.249 is what microphone J records at time, I. So 0:51:16.249,0:51:17.349 by definition of 0:51:17.349,0:51:21.519 the matrix multiplication, this is sum 0:51:21.519,0:51:23.979 of AIKSJ. 0:51:23.979,0:51:29.369 Oh, excuse me. 0:51:29.369,0:51:36.369 Okay? So what my J - sorry. 0:51:37.179,0:51:41.049 So what my J microphone is recording is 0:51:42.190,0:51:43.940 some linear combination of 0:51:43.940,0:51:45.569 all of the speakers. So 0:51:45.569,0:51:49.780 at time I, what microphone J is recording is some linear combination of 0:51:49.780,0:51:52.749 what all the speakers are saying at time I. 0:51:52.749,0:51:54.359 So K here 0:51:54.359,0:51:57.819 indexes over the N speakers. 0:51:57.819,0:52:01.239 So our goal 0:52:02.869,0:52:06.420 is to find the matrix, W, equals A inverse, and 0:52:06.420,0:52:10.130 just defining W that way. 0:52:10.130,0:52:17.130 So 0:52:18.139,0:52:21.349 we can recover the original sources 0:52:21.349,0:52:23.310 as a linear combination of 0:52:23.310,0:52:23.560 our 0:52:23.549,0:52:30.549 microphone recordings, XI. 0:52:33.059,0:52:35.329 Just as a point of notation, 0:52:35.329,0:52:42.329 I'm going to write the matrix W this way. I'm going to use 0:52:50.890,0:52:55.100 lower case W subscript one, subscript two and so on to denote the roles 0:52:55.100,0:53:02.100 of this matrix, W. 0:53:13.909,0:53:14.559 Let's 0:53:14.559,0:53:18.719 see. 0:53:18.719,0:53:23.539 So let's look at why IC is possible. Given these overlapping voices, 0:53:23.539,0:53:28.249 let's think briefly why it might be possible 0:53:28.249,0:53:30.760 to recover the original sources. 0:53:30.760,0:53:33.059 So for the next example, I want 0:53:33.059,0:53:36.509 to say 0:53:42.739,0:53:46.530 - let's say that each of my speakers 0:53:46.530,0:53:50.380 outputs - this will sound like white noise. Can I switch 0:53:50.380,0:53:53.380 the laptop display, 0:53:53.380,0:53:56.709 please? For this example, let's say that 0:53:57.219,0:54:01.459 each of my speakers outputs uniform white noise. So 0:54:01.459,0:54:05.459 if that's the case, these are my axis, S1 and S2. 0:54:05.459,0:54:08.819 This is what my two speakers would be uttering. 0:54:08.819,0:54:11.289 The parts of what they're 0:54:11.289,0:54:14.979 uttering will look like a line in a square box if the two speakers are independently 0:54:14.979,0:54:16.089 outputting 0:54:16.089,0:54:18.389 uniform minus one random variables. 0:54:18.389,0:54:20.289 So this is part of 0:54:20.289,0:54:24.009 S1 and S2, my original sources. 0:54:24.009,0:54:28.999 This would be a typical sample of what my microphones record. Here, at 0:54:28.999,0:54:31.400 the axis, are X1 and X2. 0:54:31.400,0:54:35.089 So these are images I got from [inaudible] on 0:54:35.089,0:54:37.269 ICA. 0:54:38.709,0:54:43.689 Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of 0:54:43.689,0:54:44.939 this 0:54:44.939,0:54:45.809 parallelogram 0:54:45.809,0:54:48.149 are. You can figure out 0:54:48.149,0:54:51.999 what linear transformation would transform the parallelogram back 0:54:51.999,0:54:54.359 to a box. 0:54:54.359,0:54:58.769 So it turns out there are some inherent ambiguities in ICA. 0:54:58.769,0:55:00.509 I'll just say what they are. 0:55:00.509,0:55:01.569 One is that 0:55:01.569,0:55:05.709 you can't recover the original indexing of the sources. In particular, 0:55:05.709,0:55:07.379 if 0:55:07.379,0:55:10.809 I generated the data for speaker one and speaker two, 0:55:10.809,0:55:14.469 you can run ICA, and then you may end up with the order of the speakers 0:55:14.469,0:55:17.529 reversed. What that corresponds to is if you take this 0:55:17.529,0:55:21.809 picture and you flip this picture along a 45-degree 0:55:21.809,0:55:26.130 axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still 0:55:26.130,0:55:28.279 get a box. So 0:55:28.279,0:55:31.319 there's no way for the algorithms to tell which was speaker No. 1 and 0:55:31.319,0:55:32.910 which 0:55:32.910,0:55:37.699 was speaker No. 2. The numbering or the ordering of the speakers is 0:55:37.699,0:55:40.839 ambiguous. The other source of ambiguity, and these are the only ambiguities 0:55:40.839,0:55:42.089 in this example, 0:55:42.089,0:55:44.469 is the sign of the sources. So 0:55:44.469,0:55:49.119 given my speakers' recordings, 0:55:49.119,0:55:53.190 you can't tell whether you got a positive SI or whether you got 0:55:53.190,0:55:56.179 back a negative SI. 0:55:56.179,0:55:58.210 In this picture, what that corresponds to 0:55:58.210,0:56:02.100 is if you take this picture, and you reflect it along the vertical axis, if 0:56:02.100,0:56:04.659 you reflect it along the horizontal axis, 0:56:04.659,0:56:05.910 you still get a box. 0:56:05.910,0:56:08.719 You still get back [inaudible] speakers. 0:56:08.719,0:56:09.649 So 0:56:09.649,0:56:11.719 it turns out that in this example, 0:56:11.719,0:56:16.599 you can't guarantee that you've recovered positive SI rather 0:56:16.599,0:56:19.689 than negative SI. 0:56:19.689,0:56:21.930 So it turns out that these are the only 0:56:21.930,0:56:25.740 two ambiguities in this example. What is the permutation of the speakers, and the 0:56:25.740,0:56:28.139 other is the sign of the speakers. 0:56:28.139,0:56:30.749 Permutation of the speakers, there's not much you can do about that. 0:56:30.749,0:56:34.909 It turns out that if you take the audio 0:56:34.909,0:56:35.609 source 0:56:35.609,0:56:39.199 and if you flip the sign, and you take negative S, and if you play that through a 0:56:39.199,0:56:43.819 microphone it'll sound indistinguishable. 0:56:43.819,0:56:44.879 So 0:56:44.879,0:56:47.829 for many of the applications we care about, the sign 0:56:47.829,0:56:51.259 as well as the permutation 0:56:51.259,0:56:55.079 is ambiguous, but you don't really care 0:56:55.079,0:57:02.079 about it. Let's switch back 0:57:03.529,0:57:08.989 to 0:57:08.989,0:57:11.180 chalk board, please. 0:57:11.180,0:57:15.639 It turns out, and I don't want to spend too much time on this, but I do want to say it briefly. 0:57:15.639,0:57:17.289 It turns out the 0:57:17.289,0:57:19.200 reason why those are the only 0:57:19.200,0:57:25.809 sources of ambiguity - so the ambiguities were the 0:57:25.809,0:57:29.869 permutation of the speakers 0:57:29.869,0:57:31.959 and the signs. 0:57:31.959,0:57:35.399 It turns out that 0:57:35.399,0:57:39.919 the reason these were the only ambiguities was because 0:57:39.919,0:57:44.999 the SIJs were 0:57:44.999,0:57:46.689 0:57:46.689,0:57:50.509 non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly. 0:57:50.509,0:57:54.089 Suppose my original sources, S1 and S2, were Gaussian. 0:57:54.089,0:57:55.909 So 0:57:58.329,0:58:02.199 suppose SI is 0:58:02.199,0:58:04.339 Gaussian, would mean zero 0:58:04.339,0:58:07.019 and identity covariance. 0:58:07.019,0:58:10.959 That just means that each of my speakers outputs a Gaussian random variable. Here's a typical 0:58:10.959,0:58:12.619 example of Gaussian 0:58:12.619,0:58:18.479 data. 0:58:18.479,0:58:22.869 You will recall the contours of a Gaussian distribution with identity covariants 0:58:22.869,0:58:25.089 looks like 0:58:25.089,0:58:27.739 this, right? The Gaussian is a 0:58:27.739,0:58:30.569 spherically symmetric distribution. 0:58:30.569,0:58:35.219 So if my speakers were outputting Gaussian random variables, than if 0:58:35.219,0:58:38.179 I observe a linear combination of this, 0:58:38.179,0:58:40.479 there's actually no way to recover the 0:58:40.479,0:58:43.419 original distribution because there's no way for me to tell 0:58:43.419,0:58:46.120 if the axis are at this angle or if they're at 0:58:46.120,0:58:48.349 that angle and so 0:58:48.349,0:58:52.429 on. The Gaussian is a rotationally symmetric 0:58:52.429,0:58:56.769 distribution, so I would no be able to recover the orientation in the 0:58:56.769,0:58:58.839 rotation 0:58:58.839,0:59:02.279 of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns 0:59:02.279,0:59:02.900 out 0:59:02.900,0:59:04.700 if your source is a Gaussian, 0:59:04.700,0:59:07.929 then it's actually impossible to do 0:59:07.929,0:59:12.049 ICA. ICA relies critically on your data being non-Gaussian because if the data 0:59:12.049,0:59:16.939 were Gaussian, then the rotation of the data would be ambiguous. So 0:59:16.939,0:59:19.079 regardless of how much data you have, 0:59:19.079,0:59:23.549 even if you had infinitely large amounts of data, you would not be able to recover 0:59:23.549,0:59:26.739 the matrix A or W. 0:59:32.779,0:59:39.779 Let's go ahead and divide the algorithm. 0:59:56.779,1:00:00.939 To do this, I need just one more result, and then the derivation will be 1:00:03.029,1:00:07.729 three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my 1:00:07.729,1:00:11.309 speakers that are emitting at any time. 1:00:11.309,1:00:12.459 So 1:00:12.459,1:00:15.619 let's say the density of S is 1:00:15.619,1:00:17.339 P subscript S, 1:00:17.339,1:00:19.569 capital S. 1:00:19.569,1:00:23.399 So my microphone recording records S equals AS, 1:00:23.399,1:00:25.319 equals W inverse 1:00:25.319,1:00:31.019 S. Equivalently, S equals W sign of X. 1:00:31.019,1:00:34.529 So let's think about what is the density of 1:00:34.529,1:00:38.209 X. So I have P of S. I know the density of 1:00:38.209,1:00:41.359 S, and X is a linear combination of the S's. 1:00:41.359,1:00:45.170 So let's figure out what is the density of X. 1:00:45.170,1:00:48.670 One thing we could do is 1:00:48.670,1:00:51.339 figure out what S is. So this is just - 1:00:51.339,1:00:55.759 apply the density of 1:00:55.759,1:00:58.069 S to W of S. So let's 1:00:58.069,1:01:01.999 see. This is the probability of S, so we just 1:01:02.909,1:01:06.559 figure out what S is. S is W times X, so the probability of S is 1:01:06.559,1:01:09.939 W times X, so the probability of X must be [inaudible]. 1:01:09.939,1:01:11.619 So this is wrong. 1:01:11.619,1:01:14.749 It turns out you can do this for probably mass functions but not for 1:01:14.749,1:01:16.919 continuous density. So in particular, 1:01:16.919,1:01:20.969 it's not correct to say that the probability of X is - well, you just figure out what 1:01:20.969,1:01:22.499 S is. 1:01:22.499,1:01:26.189 Then you say the probability of S is applied to that. This is wrong. You 1:01:26.189,1:01:27.819 can't do this with densities. 1:01:27.819,1:01:30.969 You can't say the probability of S is that because it's a property density 1:01:30.969,1:01:32.969 function. 1:01:32.969,1:01:34.459 In particular, 1:01:34.459,1:01:35.509 the 1:01:35.509,1:01:37.849 right formula is the 1:01:37.849,1:01:40.439 density of S applied to W times X, 1:01:40.439,1:01:41.730 times the determinant 1:01:41.730,1:01:44.209 of the matrix, W. 1:01:44.209,1:01:47.189 Let me just illustrate that with an example. 1:01:47.189,1:01:49.919 Let's say 1:01:49.919,1:01:51.550 the 1:01:51.550,1:01:58.199 density for S is that. In 1:01:58.199,1:02:03.469 this example, S is uniform 1:02:03.469,1:02:05.539 over the unit interval. 1:02:07.679,1:02:14.679 So the density for S looks like that. It's 1:02:15.189,1:02:18.140 just density for the uniform 1:02:18.140,1:02:20.749 distribution of zero one. 1:02:20.749,1:02:24.150 So let me let X be equal to two times 1:02:24.150,1:02:30.009 S. So this means A equals two. 1:02:30.009,1:02:33.709 W equals one half. So if 1:02:33.709,1:02:36.719 S is a uniform distribution over zero, one, 1:02:36.719,1:02:40.320 then X, which is two times that, will be the uniform distribution over the 1:02:40.320,1:02:43.299 range from zero to two. 1:02:43.299,1:02:50.299 So the density for X will be - 1:02:54.359,1:02:57.289 that's one, that's two, 1:02:57.289,1:03:01.409 that's one half, 1:03:02.529,1:03:04.949 and 1:03:04.949,1:03:07.939 that's one. Okay? Density for X will be indicator 1:03:07.939,1:03:12.729 zero [inaudible] for X [inaudible] two 1:03:12.729,1:03:15.739 times W, times one half. 1:03:15.739,1:03:20.229 So 1:03:20.229,1:03:21.729 does that make 1:03:21.729,1:03:25.020 sense? [Inaudible] computer density for X because X is now spread out 1:03:25.020,1:03:28.650 across a wider range. The density of X is now smaller, 1:03:28.650,1:03:35.650 and therefore, the density of X has this one half 1:03:37.859,1:03:38.919 term 1:03:38.919,1:03:42.579 here. Okay? This is an illustration for the case of one-dimensional random variables, 1:03:42.579,1:03:44.289 1:03:44.289,1:03:45.160 or S 1:03:45.160,1:03:49.490 and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the 1:03:49.490,1:03:51.649 density of X is given by this 1:03:51.649,1:03:53.950 times the determinant of the matrix, W. Over here, 1:03:53.950,1:04:00.950 I showed the one dimensional [inaudible] generalization. 1:04:21.439,1:04:28.439 So we're nearly there. Here's 1:04:28.749,1:04:33.969 how I can implement ICA. 1:04:33.969,1:04:37.039 So my distribution on 1:04:37.039,1:04:44.039 S, 1:04:50.259,1:04:52.959 so I'm going to assume that my density on S 1:04:52.959,1:04:55.099 is given by this as a product over the 1:04:55.099,1:04:59.950 N speakers of the density - the product of speaker 1:04:59.950,1:05:00.889 I 1:05:00.889,1:05:03.659 emitting a certain sound. This is a product of densities. 1:05:03.659,1:05:07.659 This is a product of distributions because I'm going to assume that the 1:05:07.659,1:05:11.469 speakers are having independent conversations. So the SI's independent 1:05:11.469,1:05:15.869 for different values of I. 1:05:15.869,1:05:18.060 So by the formula we just worked out, 1:05:18.060,1:05:22.359 the density for X would be equal to that. 1:05:36.599,1:05:39.309 I'll just remind you, W was A 1:05:39.309,1:05:42.579 inverse. It was 1:05:42.579,1:05:43.929 this matrix 1:05:43.929,1:05:47.619 I defined previously 1:05:47.619,1:05:50.430 so that SI 1:05:50.430,1:05:52.519 equals WI [inaudible] 1:05:52.519,1:05:59.209 X. So that's what's in 1:05:59.209,1:06:02.299 there. To complete my formulation for this model, 1:06:02.299,1:06:06.359 the final thing I need to do is 1:06:06.359,1:06:10.179 choose 1:06:10.179,1:06:11.549 a density 1:06:11.549,1:06:14.259 for what I think each speaker is 1:06:14.259,1:06:17.949 saying. I need to assume some density over 1:06:17.949,1:06:21.660 the sounds emitted by an individual speaker. 1:06:21.660,1:06:25.630 So following the discussion I had right when the [inaudible] 1:06:25.630,1:06:27.650 ICA, 1:06:27.650,1:06:30.559 one thing I could do is I could choose 1:06:30.559,1:06:32.019 the density for S, 1:06:32.019,1:06:35.509 or equivalently, I could choose the CDF, the cumulative distribution 1:06:35.509,1:06:37.169 function for 1:06:37.169,1:06:38.219 S. 1:06:38.219,1:06:41.489 In this case, I'm going to choose 1:06:41.489,1:06:44.819 a CDF, probably for historical reasons and probably for 1:06:44.819,1:06:46.569 convenience. 1:06:46.569,1:06:50.019 I need to choose the CDF for S, so 1:06:50.019,1:06:54.779 what that means is I just need to choose some function that increases from zero to 1:06:54.779,1:06:59.439 what. I know I can't choose a Gaussian because we know you can't 1:06:59.439,1:07:02.199 do ICA on Gaussian data. 1:07:02.199,1:07:04.649 So I need some function increasing from zero to one 1:07:04.649,1:07:08.639 that is not the cumulative distribution function for a 1:07:08.639,1:07:10.359 Gaussian distribution. 1:07:10.359,1:07:14.009 So what other functions do I know that increase from zero to one? I 1:07:14.009,1:07:16.139 just choose the 1:07:16.139,1:07:18.329 CDF to be 1:07:18.329,1:07:21.979 the 1:07:21.979,1:07:23.039 sigmoid function. 1:07:23.039,1:07:24.729 This is a 1:07:24.729,1:07:27.229 commonly-made choice that 1:07:27.229,1:07:31.049 is made for convenience. There is actually no great reason for why you 1:07:31.049,1:07:34.079 choose a sigmoid function. It's just a convenient function that we all know 1:07:34.079,1:07:35.289 and are familiar with 1:07:35.289,1:07:37.849 that happens to increase from zero to one. 1:07:37.849,1:07:44.849 When you take the derivative 1:07:45.789,1:07:49.389 of the sigmoid, and that will give you back 1:07:49.389,1:07:50.119 your 1:07:50.119,1:07:55.459 density. This is just not Gaussian. This is the main virtue of choosing the sigmoid. 1:07:55.459,1:08:02.459 So 1:08:19.020,1:08:21.960 there's really no rational for the choice of sigma. Lots of other things will 1:08:21.960,1:08:23.599 work fine, too. 1:08:23.599,1:08:26.659 It's just a common, reasonable default. 1:08:38.039,1:08:40.279 It turns out that 1:08:40.279,1:08:44.629 one reason the sigma works well for a lot of data sources is that 1:08:44.629,1:08:49.079 if this is the Gaussian. 1:08:49.079,1:08:52.190 If you actually take the sigmoid and you take its derivative, 1:09:02.299,1:09:06.639 you find that the sigmoid has [inaudible] than the Gaussian. By this I mean 1:09:06.639,1:09:10.509 the density of the sigmoid dies down to zero much more slowly than 1:09:10.509,1:09:12.299 the 1:09:12.299,1:09:13.489 Gaussian. 1:09:13.489,1:09:18.079 The magnitudes of the tails dies down as E to the minus S squared. 1:09:18.079,1:09:21.960 For the sigmoid, the tails look like E to the minus 1:09:21.960,1:09:26.949 S. So the tails die down as E to the minus S, around E 1:09:26.949,1:09:29.529 to the minus S squared. It turns out that most distributions of this property 1:09:29.529,1:09:34.359 with [inaudible] tails, where the distribution decays to zero relatively slowly 1:09:34.359,1:09:38.439 compared to Gaussian will 1:09:38.439,1:09:39.919 work fine for your data. 1:09:39.919,1:09:43.939 Actually, one other choice you can sometimes us is what's called the Laplacian 1:09:43.939,1:09:46.230 distribution, which is 1:09:46.230,1:09:53.230 that. This will work fine, too, for many data sources. 1:10:06.539,1:10:08.110 Sticking with the sigmoid for now, I'll just 1:10:08.110,1:10:09.420 write 1:10:09.420,1:10:14.479 down the algorithm in two steps. So given 1:10:14.479,1:10:17.150 my training set, and 1:10:17.150,1:10:21.179 as you show, this is an unlabeled training set, I can 1:10:21.179,1:10:25.770 write down the log likelihood of my parameters. So that's - assembled my training 1:10:25.770,1:10:27.209 examples, log of - times 1:10:27.209,1:10:34.209 that. 1:10:42.869,1:10:44.880 So that's my log 1:10:44.880,1:10:51.880 likelihood. 1:10:53.339,1:10:59.380 To learn the parameters, W, of this model, I can use the [inaudible] assent, 1:10:59.380,1:11:06.380 which is 1:11:06.570,1:11:08.579 just that. 1:11:08.579,1:11:11.489 It turns out, if you work through the math, 1:11:11.489,1:11:13.969 let's see. If P of S 1:11:13.969,1:11:19.820 is equal to the derivative of the 1:11:19.820,1:11:23.780 sigmoid, then if you just work through the math to compute the [inaudible] there. You've all 1:11:23.780,1:11:27.409 done this a lot of times. I won't bother to show 1:11:27.409,1:11:34.409 the details. You find that is equal to this. 1:11:46.629,1:11:49.579 Okay? That's just - you can work those out yourself. It's just math to 1:11:49.579,1:11:54.499 compute the derivative of this with respect to 1:11:54.499,1:11:59.309 W. So to summarize, given the training set, 1:11:59.309,1:12:02.099 here's my [inaudible] update rule. So you run the 1:12:02.099,1:12:06.309 [inaudible] to learn the parameters W. 1:12:06.309,1:12:08.380 After you're 1:12:08.380,1:12:09.719 done, you then 1:12:12.369,1:12:14.109 output SI equals 1:12:14.109,1:12:16.989 WXI, and you've separated your sources 1:12:16.989,1:12:18.170 of your 1:12:18.170,1:12:21.780 data back out into the original independent sources. 1:12:21.780,1:12:26.199 Hopefully up to only a permutation and a plus/minus 1:12:26.199,1:12:30.650 sign ambiguity. 1:12:30.650,1:12:34.559 Okay? So just switch back to the laptop, please? 1:12:34.559,1:12:41.559 So we'll just wrap up with a couple of examples of applications of ICA. 1:12:42.210,1:12:43.150 This is 1:12:43.150,1:12:46.719 actually a picture of our TA, Katie. 1:12:46.719,1:12:49.980 So one of the applications of ICA is 1:12:49.980,1:12:52.010 to process 1:12:52.010,1:12:56.530 various types of [inaudible] recording data, so [inaudible]. This 1:12:56.530,1:12:58.780 is a picture of 1:12:58.780,1:13:02.470 a EEG cap, in which there are a number of electrodes 1:13:02.470,1:13:04.529 you place 1:13:04.529,1:13:07.959 on the - in this case, on Katie's brain, on Katie's scalp. 1:13:07.959,1:13:13.370 So where each electrode measures changes in voltage over time 1:13:13.370,1:13:15.059 on the scalp. 1:13:15.059,1:13:18.409 On the right, it's a typical example of [inaudible] data 1:13:18.409,1:13:22.570 where each electrode measures - just changes in voltage over 1:13:22.570,1:13:23.890 time. So 1:13:23.890,1:13:27.949 the horizontal axis is time, and the vertical axis is voltage. So here's the same thing, 1:13:27.949,1:13:29.560 blown up a little bit. 1:13:29.560,1:13:32.680 You notice there are artifacts in this 1:13:32.680,1:13:36.340 data. Where the circle is, where the data is circled, all 1:13:36.340,1:13:37.669 the 1:13:37.669,1:13:41.179 electrodes seem to measure in these very synchronized recordings. 1:13:41.179,1:13:44.699 It turns out that we look at [inaudible] data as well as a number of other 1:13:44.699,1:13:47.019 types of data, there are 1:13:47.019,1:13:51.550 artifacts from heartbeats and from human eye blinks and so on. So the 1:13:51.550,1:13:55.030 cartoonist, if you imagine, placing the 1:13:55.030,1:13:56.730 electrodes, or 1:13:56.730,1:13:58.319 microphones, on my scalp, 1:13:58.319,1:14:01.839 then each microphone is recording some overlapping combination of all the 1:14:01.839,1:14:04.920 things happening in my brain or in my body. 1:14:04.920,1:14:08.380 My brain has a number of different processes going on. My body's [inaudible] 1:14:08.380,1:14:10.519 going on, and 1:14:10.519,1:14:13.429 each electrode measures a sum 1:14:13.429,1:14:15.680 of the different voices in my brain. 1:14:15.680,1:14:19.789 That didn't quite come out the way I wanted it to. 1:14:19.789,1:14:21.530 So we can just take this data 1:14:21.530,1:14:25.400 and run ICA on it and find out one of the independent components, what the 1:14:25.400,1:14:26.130 independent 1:14:26.130,1:14:30.329 process are going on in my brain. This is an example of running ICA. 1:14:30.329,1:14:33.239 So you find that a small number of components, like those shown up there, 1:14:33.239,1:14:37.739 they correspond to heartbeat, where the arrows - so those are very periodic 1:14:37.739,1:14:42.329 signals. They come on occasionally and correspond to [inaudible] components of 1:14:42.329,1:14:43.050 heartbeat. 1:14:43.050,1:14:47.460 You also find things like an eye blink component, corresponding to a 1:14:47.460,1:14:49.780 sigmoid generated when you blink your eyes. 1:14:49.780,1:14:53.820 By doing this, you can then subtract out the heartbeat and the eye blink 1:14:53.820,1:14:56.179 artifacts from the data, and now 1:14:56.179,1:15:01.219 you get much cleaner ICA data - get much cleaner EEG readings. You can 1:15:01.219,1:15:03.699 do further scientific studies. So this is a 1:15:03.699,1:15:06.179 pretty commonly used preprocessing step 1:15:06.179,1:15:09.699 that is a common application of ICA. 1:15:09.699,1:15:13.030 [Inaudible] example is 1:15:13.030,1:15:16.300 the application, again, from [inaudible]. As 1:15:16.300,1:15:20.900 a result of running ICA on natural small image patches. Suppose I take 1:15:20.900,1:15:22.050 natural images 1:15:22.050,1:15:25.909 and run ICA on the data and ask what are the independent components of data. 1:15:25.909,1:15:30.039 It turns out that these are the bases you get. So this is a plot of the 1:15:30.039,1:15:32.530 sources you get. 1:15:32.530,1:15:36.269 This algorithm is saying that a natural image patch 1:15:36.269,1:15:37.749 shown 1:15:37.749,1:15:39.790 on the left 1:15:39.790,1:15:45.330 is often expressed as a sum, or a linear combination, of 1:15:45.330,1:15:46.679 independent sources of 1:15:46.679,1:15:48.159 things that make up images. 1:15:48.159,1:15:52.780 So this model's natural images is generated by independent objects 1:15:52.780,1:15:55.340 that generate different ages in the image. 1:15:55.340,1:16:01.260 One of the fascinating things about this is that, similar to neuroscience, this has also been 1:16:01.260,1:16:04.789 hypothesized as a method for how the human brain processes image 1:16:04.789,1:16:05.999 data. It 1:16:05.999,1:16:10.139 turns out, this is similar, in many ways, to computations 1:16:10.139,1:16:15.079 happening in early visual processing in the human brain, 1:16:15.079,1:16:17.659 in the mammalian 1:16:17.659,1:16:19.799 brain. It's just 1:16:19.799,1:16:25.260 interesting to see ages are the independent components of images. 1:16:25.260,1:16:30.639 Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew 1:16:30.639,1:16:31.929 Ng):Oh, 1:16:31.929,1:16:35.410 yes. For the algorithms I describe, I assume A is a square matrix. 1:16:35.410,1:16:38.590 It turns out if you have more microphones than speakers, you can also apply very 1:16:38.590,1:16:39.609 similar algorithms. If 1:16:39.609,1:16:43.919 you have fewer microphones than speakers, there's sort of an open research problem. The odds 1:16:43.919,1:16:48.459 are that if you have one male and one female speaker, but one microphone, you can 1:16:48.459,1:16:51.819 sometimes sort of separate them because one is high, one is low. If you have two 1:16:51.819,1:16:55.459 male speakers or two female speakers, then it's beyond the state of the art now to separate them 1:16:55.459,1:16:57.050 with one 1:16:57.050,1:17:00.499 microphone. It's a great research program. Okay. 1:17:00.499,1:17:04.869 Sorry about running late again. Let's close now, and we'll 1:17:04.869,1:17:05.749 continue reinforcement learning.