WEBVTT 00:00:09.539 --> 00:00:12.809 This presentation is delivered by the Stanford Center for Professional 00:00:12.809 --> 00:00:19.809 Development. 00:00:28.739 --> 00:00:30.300 Welcome back. 00:00:30.300 --> 00:00:32.800 What I want to do today is 00:00:32.800 --> 00:00:37.120 continue a discussion of principal components analysis, or PCA. 00:00:37.120 --> 00:00:38.780 In particular, there's 00:00:38.780 --> 00:00:41.730 one more application that I didn't get to in the last lecture on 00:00:41.730 --> 00:00:43.950 [inaudible] indexing, 00:00:43.950 --> 00:00:47.340 LSI. Then I want to spend just a little time talking about 00:00:47.340 --> 00:00:50.400 how to implement PCA, 00:00:50.400 --> 00:00:53.510 especially for very large problems. In particular, I'll spend just a little bit of time talking 00:00:53.510 --> 00:00:55.790 about singular value decomposition, 00:00:55.790 --> 00:00:57.930 or the SVD implementation 00:00:57.930 --> 00:01:00.170 of principal component 00:01:00.170 --> 00:01:01.929 analysis. So the 00:01:01.929 --> 00:01:06.549 second half of today's lecture, I want to talk about the different algorithm called 00:01:06.549 --> 00:01:09.090 independent component analysis, 00:01:09.090 --> 00:01:13.400 which is, in some ways, related to PCA, but in many other ways, it 00:01:13.400 --> 00:01:15.580 also manages to accomplish 00:01:15.580 --> 00:01:18.390 very different things than PCA. 00:01:18.390 --> 00:01:22.380 So with this lecture, this will actually wrap up our discussion on 00:01:22.380 --> 00:01:26.470 unsupervised learning. The next lecture, we'll start to talk about 00:01:26.470 --> 00:01:31.049 reinforcement learning algorithms. 00:01:31.049 --> 00:01:32.500 Just to 00:01:32.500 --> 00:01:36.770 recap where we were with PCA, principal 00:01:36.770 --> 00:01:38.630 component analysis, 00:01:38.630 --> 00:01:44.870 I said that 00:01:44.870 --> 00:01:48.840 in PCA, we imagine that we have some very high dimensional data 00:01:48.840 --> 00:01:50.970 that perhaps 00:01:50.970 --> 00:01:54.650 lies approximately on some low dimensional subspace. So if you had the data set like 00:01:54.650 --> 00:01:55.660 this, 00:01:55.660 --> 00:01:57.000 you might find that 00:01:57.000 --> 00:02:00.650 that's the first principal component of the data, 00:02:00.650 --> 00:02:02.850 and that's the second 00:02:02.850 --> 00:02:05.650 component of this 2-D data. 00:02:06.870 --> 00:02:09.269 To 00:02:09.269 --> 00:02:13.039 summarize the algorithm, we have three steps. The first step of PCA 00:02:14.680 --> 00:02:16.549 was to 00:02:16.549 --> 00:02:20.349 normalize the data to zero mean and 00:02:20.349 --> 00:02:25.789 [inaudible]. So 00:02:25.789 --> 00:02:27.860 tracked out the means of 00:02:27.860 --> 00:02:33.199 your training examples. So it now has zero means, and then normalize each of your features so 00:02:33.199 --> 00:02:35.509 that the variance of each feature is now one. 00:02:37.719 --> 00:02:40.679 The next step was [inaudible] 00:02:40.679 --> 00:02:44.389 computical variance matrix of your zero mean data. So 00:02:44.389 --> 00:02:48.859 you compute it as follows. 00:02:48.859 --> 00:02:51.310 The sum of all the products, 00:02:52.409 --> 00:02:55.870 and then you find the 00:02:55.870 --> 00:03:02.870 top K eigen vectors of 00:03:03.799 --> 00:03:07.159 sigma. 00:03:07.159 --> 00:03:09.589 So 00:03:09.589 --> 00:03:12.589 last time we saw the applications of this. For example, 00:03:12.589 --> 00:03:18.439 one of the applications was to eigen faces where 00:03:18.439 --> 00:03:22.899 each of your training examples, XI, is an image. 00:03:22.899 --> 00:03:24.180 So 00:03:24.180 --> 00:03:26.749 if you have 00:03:26.749 --> 00:03:30.890 100 by 100 images, if your pictures of faces are 00:03:30.890 --> 00:03:33.249 100 pixels by 100 pixels, 00:03:33.249 --> 00:03:37.519 then each of your training examples, XI, 00:03:37.519 --> 00:03:39.989 will be a 10,000 dimensional vector, 00:03:39.989 --> 00:03:42.729 corresponding to the 00:03:42.729 --> 00:03:47.529 10,000 grayscale intensity pixel values. There are 10,000 pixel values in 00:03:47.529 --> 00:03:50.349 each of your 100 by 100 images. 00:03:50.349 --> 00:03:53.179 So the eigen faces application was where 00:03:53.179 --> 00:03:56.280 the training example comprised 00:03:56.280 --> 00:03:58.149 pictures of faces of people. 00:03:58.149 --> 00:03:59.989 Then we ran PCA, 00:03:59.989 --> 00:04:00.780 and then 00:04:00.780 --> 00:04:03.090 to measure the distance between say 00:04:03.090 --> 00:04:04.260 a face here 00:04:04.260 --> 00:04:06.809 and a face there, we would project both 00:04:06.809 --> 00:04:09.619 of the face images onto the subspace and then 00:04:09.619 --> 00:04:11.469 measure 00:04:11.469 --> 00:04:13.679 the distance along the subspace. So in eigen faces, you use something 00:04:13.679 --> 00:04:16.919 like 50 principle components. 00:04:16.919 --> 00:04:20.049 So 00:04:20.049 --> 00:04:24.080 the difficulty of working with problems like these is that 00:04:24.080 --> 00:04:27.330 in step two of the algorithm, 00:04:27.330 --> 00:04:31.070 we construct the covariance matrix sigma. 00:04:31.070 --> 00:04:37.589 The covariance matrix now becomes 00:04:37.589 --> 00:04:42.919 a 10,000 by 10,000 dimensional matrix, which is huge. That has 00:04:42.919 --> 00:04:45.580 100 million 00:04:45.580 --> 00:04:47.650 entries, which is huge. 00:04:49.020 --> 00:04:50.789 So let's apply PCA to 00:04:50.789 --> 00:04:54.289 very, very high dimensional data, used as a point of reducing the 00:04:54.289 --> 00:04:55.180 dimension. But 00:04:55.180 --> 00:04:59.550 step two of this algorithm had this step where you were constructing [inaudible]. So 00:04:59.550 --> 00:05:03.599 this extremely large matrix, which you can't do. 00:05:03.599 --> 00:05:06.580 Come back to this in a second. It turns out one of 00:05:06.580 --> 00:05:08.650 the other 00:05:08.650 --> 00:05:10.639 frequently-used applications of 00:05:10.639 --> 00:05:11.870 PCA 00:05:11.870 --> 00:05:14.210 is actually to text data. 00:05:14.210 --> 00:05:16.329 So here's what I 00:05:16.329 --> 00:05:21.039 mean. Remember our vectorial representation of emails? 00:05:21.039 --> 00:05:22.520 So this is way back 00:05:22.520 --> 00:05:26.869 when we were talking about supervised learning algorithms for a 00:05:26.869 --> 00:05:29.360 stand classification. You remember I said that 00:05:29.360 --> 00:05:32.620 given a piece of email or given a piece of text document, you 00:05:32.620 --> 00:05:35.479 can represent it using a very high-dimensional vector 00:05:35.479 --> 00:05:36.699 by taking 00:05:38.990 --> 00:05:43.889 - writing down a list of all the words in your dictionary. Somewhere you had the word 00:05:43.889 --> 00:05:46.669 learn, somewhere you have the word 00:05:46.669 --> 00:05:49.680 study 00:05:50.569 --> 00:05:54.759 and so on. 00:05:54.759 --> 00:05:58.180 Depending on whether each word appears or does not appear in your text document, you put either 00:05:58.180 --> 00:05:59.360 a one or a zero 00:05:59.360 --> 00:06:03.889 there. This is a representation we use on an electrode five or electrode six 00:06:03.889 --> 00:06:06.729 for representing text documents for 00:06:06.729 --> 00:06:08.669 when we're building 00:06:08.669 --> 00:06:12.090 [inaudible] based classifiers for 00:06:12.090 --> 00:06:14.300 [inaudible]. So it turns 00:06:14.300 --> 00:06:17.719 out one of the common applications of 00:06:17.719 --> 00:06:22.210 PCA is actually this text data representations as well. 00:06:22.210 --> 00:06:23.680 When you apply PCA 00:06:23.680 --> 00:06:25.650 to this sort of data, 00:06:25.650 --> 00:06:27.999 the resulting 00:06:27.999 --> 00:06:34.740 algorithm, it often just goes by a different name, just latent semantic indexing. 00:06:41.250 --> 00:06:44.009 For the sake of completeness, I should say that 00:06:44.009 --> 00:06:48.050 in LSI, you usually skip the preprocessing step. 00:07:06.089 --> 00:07:09.929 For various reasons, in LSI, you usually don't normalize the mean of the data to 00:07:09.929 --> 00:07:10.939 one, 00:07:10.939 --> 00:07:14.169 and you usually don't normalize the variance of the features to one. 00:07:14.169 --> 00:07:18.020 These are relatively minor 00:07:18.020 --> 00:07:21.199 differences, it turns out, so it does something very 00:07:21.199 --> 00:07:24.599 similar to PCA. 00:07:24.599 --> 00:07:25.849 00:07:25.849 --> 00:07:27.439 Normalizing the variance to one 00:07:27.439 --> 00:07:33.439 for text data would actually be a bad idea because all the words are - 00:07:33.439 --> 00:07:34.389 because that 00:07:34.389 --> 00:07:37.150 would have the affect of 00:07:37.150 --> 00:07:39.229 dramatically scaling up the 00:07:39.229 --> 00:07:43.779 weight of rarely occurring words. So for example, the word aardvark hardly ever 00:07:43.779 --> 00:07:45.729 appears in any document. 00:07:45.729 --> 00:07:48.919 So to normalize the variance 00:07:48.919 --> 00:07:51.009 of the second feature to one, you end up - 00:07:51.009 --> 00:07:54.860 you're scaling up the weight of the word aardvark 00:07:54.860 --> 00:07:58.340 dramatically. I don't understand why [inaudible]. 00:07:58.340 --> 00:08:01.449 So let's 00:08:01.449 --> 00:08:05.319 see. [Inaudible] the language, 00:08:05.319 --> 00:08:11.169 something that we want to do quite often is, give it two documents, 00:08:13.109 --> 00:08:20.109 XI and XJ, to measure how similar they are. 00:08:20.179 --> 00:08:22.400 So for example, 00:08:22.400 --> 00:08:25.050 I may give you a document and ask 00:08:25.050 --> 00:08:28.860 you to find me more documents like this one. We're reading some 00:08:28.860 --> 00:08:30.900 article about some user event of today 00:08:30.900 --> 00:08:33.770 and want to find out what other news articles there are. So I give you a document and 00:08:34.520 --> 00:08:37.159 ask you to look at all the other documents you have 00:08:37.159 --> 00:08:40.829 in this large set of documents and find the documents similar to 00:08:40.829 --> 00:08:42.220 this. 00:08:43.740 --> 00:08:45.210 So 00:08:45.210 --> 00:08:48.170 this is typical text application, so 00:08:48.170 --> 00:08:51.140 to measure the similarity 00:08:52.440 --> 00:08:56.550 between two documents in XI and XJ, [inaudible] 00:08:56.550 --> 00:08:59.950 each of these documents is represented 00:08:59.950 --> 00:09:03.190 as one of these highdimensional vectors. 00:09:03.190 --> 00:09:08.700 One common way to do this is to view each of your documents 00:09:08.700 --> 00:09:12.710 as some sort of very high-dimensional vector. 00:09:12.710 --> 00:09:14.100 So these 00:09:14.100 --> 00:09:18.040 are vectors in the very highdimensional space where 00:09:18.040 --> 00:09:20.299 the dimension of the vector is equal to 00:09:20.299 --> 00:09:27.210 the number of words in your dictionary. 00:09:27.210 --> 00:09:30.630 So maybe each of these documents lives in some 00:09:30.630 --> 00:09:33.560 50,000-dimension space, if you have 50,000 words in your 00:09:33.560 --> 00:09:37.090 dictionary. So one nature of the similarity between these two documents that's 00:09:37.090 --> 00:09:39.800 often used is 00:09:39.800 --> 00:09:41.440 what's the angle 00:09:41.440 --> 00:09:43.350 between these two 00:09:43.350 --> 00:09:50.350 documents. 00:09:51.240 --> 00:09:52.750 In particular, 00:09:52.750 --> 00:09:56.030 if the angle between these two vectors is small, then 00:09:56.030 --> 00:09:59.590 the two documents, we'll consider them to be similar. If the angle between 00:09:59.590 --> 00:10:03.310 these two vectors is large, then we consider the documents to be dissimilar. 00:10:03.310 --> 00:10:05.530 So 00:10:05.530 --> 00:10:10.060 more formally, one commonly used heuristic, the national language of processing, 00:10:10.060 --> 00:10:13.950 is to say that the similarity between the two documents is a co-sine of the angle theta between them. 00:10:13.950 --> 00:10:16.710 For similar 00:10:16.710 --> 00:10:19.270 values, anyway, the co-sine 00:10:19.270 --> 00:10:23.110 is a decreasing function of theta. 00:10:23.110 --> 00:10:24.490 So the 00:10:24.490 --> 00:10:29.560 smaller the angle between them, the larger the similarity. 00:10:29.560 --> 00:10:30.690 The co-sine 00:10:30.690 --> 00:10:35.970 between two vectors is, of course, just [inaudible] 00:10:35.970 --> 00:10:42.970 divided 00:10:43.940 --> 00:10:46.680 by - okay? 00:10:46.680 --> 00:10:51.100 That's just the linear algebra or the standard 00:10:51.100 --> 00:10:56.460 geometry definition of the co-sine between two vectors. Here's the 00:11:03.540 --> 00:11:10.540 intuition behind what LSI is doing. The hope, as usual, is 00:11:17.930 --> 00:11:21.060 that there 00:11:21.060 --> 00:11:24.970 may be some interesting axis of variations in the data, 00:11:24.970 --> 00:11:27.660 and there maybe some other axis that 00:11:27.660 --> 00:11:29.340 are just 00:11:29.340 --> 00:11:33.960 noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by 00:11:33.960 --> 00:11:37.850 running PCA on your text data this way, you can remove some of the noise in the data and 00:11:37.850 --> 00:11:41.800 get better measures of the similarity between pairs of 00:11:41.800 --> 00:11:45.490 documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI 00:11:45.490 --> 00:11:46.940 is doing. 00:11:46.940 --> 00:11:47.940 So 00:11:47.940 --> 00:11:54.940 look further in the definition of the co-sine similarity measure. So 00:11:56.430 --> 00:11:59.790 the numerator 00:11:59.790 --> 00:12:06.790 or 00:12:10.210 --> 00:12:17.210 the similarity between the two documents was this inner product, 00:12:17.620 --> 00:12:24.270 which is therefore sum over K, 00:12:24.270 --> 00:12:27.500 XIK, 00:12:27.500 --> 00:12:28.870 XJK. So 00:12:30.410 --> 00:12:33.810 this inner product would be equal to zero if 00:12:33.810 --> 00:12:36.910 the two documents have no words in common. So 00:12:36.910 --> 00:12:39.900 this is really - sum over K - 00:12:41.100 --> 00:12:42.930 indicator of 00:12:42.930 --> 00:12:44.760 whether 00:12:44.760 --> 00:12:47.170 documents, I and 00:12:47.170 --> 00:12:54.170 J, 00:12:54.920 --> 00:12:58.250 both contain the word, K, because 00:12:58.250 --> 00:13:02.590 I guess XIK indicates whether document I contains the word 00:13:02.590 --> 00:13:04.410 K, and XJK 00:13:04.410 --> 00:13:07.830 indicates whether document J contains the word, K. 00:13:07.830 --> 00:13:10.430 So the product would be one only 00:13:10.430 --> 00:13:12.420 if the word K 00:13:12.420 --> 00:13:14.540 appears in both documents. 00:13:14.540 --> 00:13:17.740 Therefore, the similarity between these two documents would be 00:13:17.740 --> 00:13:23.450 zero if the two documents have no words in common. 00:13:23.450 --> 00:13:30.450 For example, 00:13:31.180 --> 00:13:34.530 suppose your document, 00:13:34.530 --> 00:13:40.460 XI, has the word study and the word 00:13:40.460 --> 00:13:41.730 XJ, 00:13:41.730 --> 00:13:43.460 has the word learn. 00:13:43.460 --> 00:13:47.530 Then these two documents may be considered 00:13:47.530 --> 00:13:49.069 entirely dissimilar. 00:13:50.020 --> 00:13:53.280 [Inaudible] effective study strategies. Sometimes you read a 00:13:53.280 --> 00:13:57.100 news article about that. So you ask, what other documents are similar to this? If 00:13:57.100 --> 00:14:01.080 there are a bunch of other documents about good methods to 00:14:01.080 --> 00:14:04.250 learn, than there are words in common. So similarity [inaudible] is zero. 00:14:04.250 --> 00:14:06.790 So here's 00:14:06.790 --> 00:14:09.200 a cartoon 00:14:09.200 --> 00:14:10.970 of what we hope 00:14:10.970 --> 00:14:12.820 [inaudible] PCA will do, 00:14:12.820 --> 00:14:14.340 which is 00:14:14.340 --> 00:14:17.320 suppose that on the horizontal axis, I plot 00:14:17.320 --> 00:14:21.330 the word 00:14:21.330 --> 00:14:25.310 learn, and on the vertical access, I plot the word study. 00:14:25.310 --> 00:14:29.840 So the values take on either the value of zero or one. So if a document 00:14:29.840 --> 00:14:33.040 contains the words learn but not study, then 00:14:33.040 --> 00:14:35.260 it'll plot that document there. If 00:14:35.260 --> 00:14:38.050 a document contains neither the word study nor learn, then it'll plot that 00:14:38.050 --> 00:14:40.530 at zero, zero. 00:14:40.530 --> 00:14:44.119 So here's a cartoon behind what PCA 00:14:44.119 --> 00:14:46.940 is doing, which is 00:14:46.940 --> 00:14:51.480 we identify lower dimensional subspace. That would be sum - eigen 00:14:51.480 --> 00:14:57.630 vector, we get out of PCAs. 00:14:57.630 --> 00:15:03.380 Now, supposed we have a document about learning. We have a document about studying. 00:15:03.380 --> 00:15:05.150 The document about learning 00:15:05.150 --> 00:15:07.710 points to the right. Document about studying points 00:15:07.710 --> 00:15:11.439 up. So the inner product, or the co-sine angle between these two documents would be 00:15:11.439 --> 00:15:13.170 - excuse 00:15:13.170 --> 00:15:15.080 me. The inner product between 00:15:15.080 --> 00:15:18.850 these two documents will be zero. 00:15:18.850 --> 00:15:20.280 So these two 00:15:20.280 --> 00:15:22.360 documents are entirely unrelated, 00:15:22.360 --> 00:15:24.849 which is not what we want. 00:15:24.849 --> 00:15:27.680 Documents about study, documents about learning, they are related. But 00:15:27.680 --> 00:15:32.820 we take these two documents, and we project them 00:15:32.820 --> 00:15:36.220 onto this subspace. 00:15:38.320 --> 00:15:40.850 Then these two documents now become much 00:15:40.850 --> 00:15:42.610 closer together, 00:15:42.610 --> 00:15:44.790 and the algorithm will recognize that 00:15:44.790 --> 00:15:47.650 when you say the inner product between these two documents, 00:15:47.650 --> 00:15:50.580 you actually end up with a positive number. So 00:15:50.580 --> 00:15:52.190 LSI enables 00:15:52.190 --> 00:15:56.200 our algorithm to recognize that these two documents have some positive similarity 00:15:56.200 --> 00:15:58.920 between them. 00:15:58.920 --> 00:16:01.230 So that's just intuition 00:16:01.230 --> 00:16:02.370 about what 00:16:02.370 --> 00:16:05.000 PCA may be doing to text data. 00:16:05.000 --> 00:16:09.420 The same thing goes to other examples and the words study and learn. So you have 00:16:09.420 --> 00:16:11.139 - you find a document about 00:16:11.139 --> 00:16:15.059 politicians and a document with the names of prominent 00:16:15.059 --> 00:16:16.490 politicians. 00:16:17.560 --> 00:16:20.740 That will also bring the documents closer together, 00:16:20.740 --> 00:16:24.650 or just any related topics, they end up 00:16:24.650 --> 00:16:25.670 [inaudible] 00:16:25.670 --> 00:16:31.780 points closer together and just lower dimensional space. 00:16:33.130 --> 00:16:38.380 Question about this? Interviewee: [Inaudible]. 00:16:38.380 --> 00:16:43.930 Which ones? 00:16:43.930 --> 00:16:50.930 This one? No, the line. Oh, this one. Oh, 00:16:53.820 --> 00:16:54.470 yes. 00:16:54.470 --> 00:17:01.470 Thank you. 00:17:01.670 --> 00:17:05.510 [Inaudible]. 00:17:05.510 --> 00:17:06.650 So 00:17:06.650 --> 00:17:13.650 let's talk about how to actually implement this now. 00:17:17.230 --> 00:17:20.520 Okay. How many of you know what 00:17:20.520 --> 00:17:24.780 an SVD or single value decomposition is? Wow, 00:17:24.780 --> 00:17:25.800 that's a lot of you. That's a 00:17:25.800 --> 00:17:28.270 lot more than I thought. 00:17:28.270 --> 00:17:35.270 Curious. Did you guys learn it as under grads or as graduate students? 00:17:37.309 --> 00:17:38.029 All 00:17:38.029 --> 00:17:40.429 right. Let 00:17:40.429 --> 00:17:44.150 me talk about it anyway. I 00:17:44.150 --> 00:17:47.790 wasn't expecting so many of you to know what SVD is, but I want to get this 00:17:47.790 --> 00:17:51.030 on tape, just so everyone else can learn 00:17:51.030 --> 00:17:55.050 about this, too. 00:17:55.050 --> 00:17:59.020 So I'll say a little bit about how to implement 00:17:59.020 --> 00:18:01.640 PCA. The problem I 00:18:01.640 --> 00:18:05.110 was eluding to just now was that 00:18:05.110 --> 00:18:09.310 when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for 00:18:09.310 --> 00:18:11.620 our 00:18:11.620 --> 00:18:13.390 text example, 00:18:13.390 --> 00:18:18.000 if the vectors XI are 50,000 dimensional, 00:18:18.000 --> 00:18:24.430 then 00:18:24.430 --> 00:18:27.160 the covariance matrix will be 50,000 dimensional by 50,000 00:18:27.160 --> 00:18:32.910 dimensional, which is much too big to represent explicitly. 00:18:32.910 --> 00:18:36.270 I guess many 00:18:36.270 --> 00:18:41.240 of you already know this, but I'll just say it anyway. It 00:18:41.240 --> 00:18:48.240 turns out there's another way to implement PCA, which is 00:18:48.679 --> 00:18:49.450 if 00:18:49.450 --> 00:18:53.340 A is any N by N matrix, 00:18:53.340 --> 00:18:56.580 than one of the most remarkable results of linear algebra 00:18:56.580 --> 00:19:00.970 is that the matrix, A, 00:19:00.970 --> 00:19:04.279 can be decomposed into 00:19:04.279 --> 00:19:07.330 a singular value 00:19:07.330 --> 00:19:10.160 decomposition. What that means is that the matrix, A, which 00:19:10.160 --> 00:19:11.519 is 00:19:11.519 --> 00:19:12.680 N by N, 00:19:12.680 --> 00:19:15.230 can always be decomposed into a product of 00:19:15.230 --> 00:19:18.280 three matrixes. U is N by N, 00:19:18.280 --> 00:19:21.470 D is a square matrix, which is N by N, and V is 00:19:23.750 --> 00:19:27.170 also N by N. 00:19:27.170 --> 00:19:30.910 D 00:19:30.910 --> 00:19:34.890 is 00:19:34.890 --> 00:19:37.320 going to be diagonal. 00:19:43.030 --> 00:19:45.140 Zeros are on the off-diagonals, 00:19:45.140 --> 00:19:52.140 and the values sigma I are called the singular values of 00:19:54.310 --> 00:19:57.530 the matrix A. 00:20:01.500 --> 00:20:05.470 Almost all of you said you learned this as a graduate student, rather than as an under grad, so 00:20:05.470 --> 00:20:07.020 it turns out that 00:20:07.020 --> 00:20:10.820 when you take a class in undergraduate linear algebra, usually you learn a bunch of 00:20:10.820 --> 00:20:13.960 decomposition. So you usually learn about the QLD composition, maybe 00:20:13.960 --> 00:20:17.070 the LU factorization of the matrixes. 00:20:17.070 --> 00:20:20.350 Most under grad courses don't get to talk about singular value 00:20:20.350 --> 00:20:21.490 decompositions, but at 00:20:21.490 --> 00:20:22.860 least in - almost 00:20:22.860 --> 00:20:24.580 everything I 00:20:24.580 --> 00:20:26.900 do in machine learning, 00:20:26.900 --> 00:20:31.180 you actually find that you end up using SVDs much more than any of the 00:20:31.180 --> 00:20:33.120 decompositions 00:20:33.120 --> 00:20:35.690 you learned in typical under grad linear algebra class. 00:20:35.690 --> 00:20:40.030 So personally, I [inaudible] an SVD dozens of times in the last 00:20:40.030 --> 00:20:42.820 year, but LU and QRD compositions, 00:20:42.820 --> 00:20:45.480 I think I used the QRD composition once and an 00:20:45.480 --> 00:20:49.530 LU decomposition in the last year. So let's see. I'll 00:20:49.530 --> 00:20:53.880 say 00:20:53.880 --> 00:20:54.830 a 00:20:54.830 --> 00:21:01.830 bit 00:21:01.840 --> 00:21:04.700 more about this. So I'm going to draw the picture, I guess. 00:21:04.700 --> 00:21:08.090 For example, 00:21:08.090 --> 00:21:10.890 if A is an N by N matrix, 00:21:10.890 --> 00:21:17.890 it can be decomposed into another matrix, U, which is also N by N. It's the same 00:21:21.780 --> 00:21:24.510 size, D, which is 00:21:24.510 --> 00:21:29.840 N by 00:21:29.840 --> 00:21:34.730 N. Another square matrix, V transpose, which is also 00:21:34.730 --> 00:21:37.430 N by 00:21:37.430 --> 00:21:42.850 N. Furthermore, in a singular value decomposition, the 00:21:42.850 --> 00:21:46.460 columns of the matrix, U, will be the eigen 00:21:46.460 --> 00:21:50.940 vectors 00:21:50.940 --> 00:21:55.630 of A transpose, and the 00:21:55.630 --> 00:22:02.020 columns of V will be the eigen vectors 00:22:02.020 --> 00:22:05.730 of A 00:22:05.730 --> 00:22:08.290 transpose A. 00:22:08.290 --> 00:22:12.030 00:22:12.030 --> 00:22:16.660 To compute it, you just use the SVD commands 00:22:16.660 --> 00:22:20.780 in 00:22:20.780 --> 00:22:22.090 Matlab 00:22:22.090 --> 00:22:23.009 or 00:22:23.009 --> 00:22:26.760 Octave. Today, say the art in numerical linear algebra is that 00:22:26.760 --> 00:22:31.040 SVD, singular value decompositions, and matrixes can be computed 00:22:31.040 --> 00:22:34.450 extremely [inaudible]. We've 00:22:34.450 --> 00:22:35.780 used a 00:22:35.780 --> 00:22:37.610 package like Matlab or Octave 00:22:37.610 --> 00:22:40.410 to compute, say, the eigen vectors of a matrix. 00:22:40.410 --> 00:22:43.539 So if SVD 00:22:43.539 --> 00:22:47.980 routines are even more numerically stable than eigen vector routines for finding eigen vector in the 00:22:47.980 --> 00:22:49.230 matrix. So you can 00:22:49.230 --> 00:22:50.250 safely 00:22:50.250 --> 00:22:53.110 use a routine like this, and similar to the way they use 00:22:53.110 --> 00:22:56.030 a square root command without thinking about 00:22:56.030 --> 00:22:58.400 how it's computed. You can compute the square 00:22:58.400 --> 00:23:03.640 root of something and just not worry about it. You know the computer will give you the right 00:23:03.640 --> 00:23:07.670 answer. For most reasonably-sized matrixes, even up to thousands by thousands 00:23:07.670 --> 00:23:11.180 matrixes, the SVD routine, I think of it as a square root function. If 00:23:11.180 --> 00:23:14.700 you call it, it'll give you back the right answer. You don't have to worry 00:23:14.700 --> 00:23:16.760 too much about 00:23:16.760 --> 00:23:20.510 it. If you have extremely large matrixes, like a million by a million matrixes, I 00:23:20.510 --> 00:23:23.200 might start to worry a bit, but a few thousand by a few 00:23:23.200 --> 00:23:25.700 thousand matrixes, this is 00:23:25.700 --> 00:23:29.330 implemented very well today. 00:23:29.330 --> 00:23:31.360 [Inaudible]. 00:23:31.360 --> 00:23:34.990 What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the 00:23:34.990 --> 00:23:36.470 order of N-cubed. 00:23:36.470 --> 00:23:41.750 I'm not sure. [Inaudible] 00:23:41.750 --> 00:23:45.010 algorithms, so I think - I don't know what's 00:23:45.010 --> 00:23:50.470 known about the conversion 00:23:50.470 --> 00:23:54.370 of 00:23:54.370 --> 00:23:58.280 these algorithms. The example I drew out was for a facts matrix, and a matrix is 00:23:58.280 --> 00:23:59.890 [inaudible]. 00:23:59.890 --> 00:24:03.420 In the same way, you can also 00:24:03.420 --> 00:24:08.380 call SVD on the tall matrix, so it's taller than it's wide. 00:24:08.380 --> 00:24:15.380 It would decompose it into - okay? A 00:24:21.480 --> 00:24:24.190 product of three matrixes like 00:24:24.190 --> 00:24:28.470 that. 00:24:28.470 --> 00:24:31.220 The nice thing about this is that 00:24:31.220 --> 00:24:33.169 we can use it to compute 00:24:33.169 --> 00:24:40.169 eigen vectors and PCA very efficiently. 00:24:42.150 --> 00:24:47.380 In particular, a 00:24:47.380 --> 00:24:52.690 covariance matrix sigma was 00:24:52.690 --> 00:24:55.410 this. It was the sum of all the products, 00:24:55.410 --> 00:25:00.750 so if you go back and recall the definition of the 00:25:00.750 --> 00:25:01.820 design matrix - 00:25:01.820 --> 00:25:05.080 I think I described this in 00:25:05.080 --> 00:25:07.720 lecture two when 00:25:07.720 --> 00:25:13.510 we derived the close form solution to these squares [inaudible] these squares. The 00:25:13.510 --> 00:25:17.860 design matrix was this matrix where I took my 00:25:17.860 --> 00:25:21.390 examples 00:25:21.390 --> 00:25:26.160 and stacked them in 00:25:26.160 --> 00:25:27.590 rows. 00:25:27.590 --> 00:25:33.320 They call this the design matrix [inaudible]. So if 00:25:33.320 --> 00:25:38.780 you construct the design matrix, then 00:25:38.780 --> 00:25:43.240 the covariance matrix sigma 00:25:43.240 --> 00:25:47.760 can be written just X transposing. 00:26:01.270 --> 00:26:08.270 That's X transposed, and [inaudible]. 00:26:16.230 --> 00:26:16.750 Okay? 00:26:16.750 --> 00:26:18.860 I hope you see why the X transpose X gives you 00:26:18.860 --> 00:26:22.460 the sum of products of 00:26:22.460 --> 00:26:25.650 vectors. If you aren't seeing this right now, just go home and convince yourself 00:26:25.650 --> 00:26:32.650 [inaudible] if it's 00:26:37.880 --> 00:26:40.020 true. 00:26:40.020 --> 00:26:47.020 To get the top K eigen vectors of 00:26:50.610 --> 00:26:50.950 sigma, 00:26:50.950 --> 00:26:55.890 you would take sigma 00:26:55.890 --> 00:26:57.419 and decompose it 00:26:57.419 --> 00:26:59.650 using 00:26:59.650 --> 00:27:01.880 the - 00:27:01.880 --> 00:27:06.710 excuse me. 00:27:06.710 --> 00:27:11.960 You would take the matrix X, and you would compute as SVD. So you get USV transpose. 00:27:11.960 --> 00:27:15.400 Then the top three 00:27:15.400 --> 00:27:19.169 columns 00:27:19.169 --> 00:27:21.660 of U 00:27:21.660 --> 00:27:24.340 are the top K eigen 00:27:24.340 --> 00:27:26.830 vectors 00:27:26.830 --> 00:27:29.419 of 00:27:29.419 --> 00:27:31.910 X transpose 00:27:31.910 --> 00:27:33.660 X, 00:27:33.660 --> 00:27:37.760 which is therefore, the top K eigen vectors of your 00:27:37.760 --> 00:27:40.440 covariance matrix 00:27:40.440 --> 00:27:42.250 sigma. So 00:27:42.250 --> 00:27:46.170 in our examples, the 00:27:46.170 --> 00:27:48.950 design matrix may be, say R. If you have 00:27:48.950 --> 00:27:52.450 50,000 words in your dictionary, than the 00:27:52.450 --> 00:27:55.050 design matrix would be 00:27:55.050 --> 00:27:58.150 RM by 50,000. 00:27:58.150 --> 00:28:01.870 [Inaudible] say 100 by 50,000, if you have 100 examples. 00:28:01.870 --> 00:28:03.250 So X would be 00:28:03.250 --> 00:28:06.510 quite tractable to represent and compute the 00:28:06.510 --> 00:28:10.420 SVD, whereas the matrix sigma would be much harder to represent. This is 00:28:10.420 --> 00:28:12.260 50,000 by 50,000. 00:28:12.260 --> 00:28:15.870 So this gives you an efficient way to implement 00:28:15.870 --> 00:28:18.070 PCA. 00:28:18.070 --> 00:28:20.560 The reason I want to talk about this is 00:28:20.560 --> 00:28:24.620 in previous years, I didn't talk [inaudible]. 00:28:24.620 --> 00:28:28.760 The class projects, I found a number of students trying to implement SVD on huge 00:28:28.760 --> 00:28:29.580 problems and [inaudible], 00:28:29.580 --> 00:28:31.950 so this is 00:28:31.950 --> 00:28:35.010 a much better to implement PCA 00:28:35.010 --> 00:28:37.600 if you have extremely high dimensional data. If you have 00:28:37.600 --> 00:28:39.600 low dimensional data, if 00:28:39.600 --> 00:28:43.500 you have 50 or 100 dimensional data, then 00:28:43.500 --> 00:28:44.969 computing sigma's no problem. You can 00:28:44.969 --> 00:28:51.309 do it the old way, but otherwise, use the SVD to implement this. 00:28:52.780 --> 00:28:59.780 Questions about this? 00:29:26.180 --> 00:29:30.810 The last thing I want to say is that in practice, when you want to implement this, I want to say a note 00:29:30.810 --> 00:29:32.240 of caution. 00:29:32.240 --> 00:29:35.010 It turns out that 00:29:35.010 --> 00:29:38.909 for many applications of - let's see. 00:29:38.909 --> 00:29:42.420 When you apply SVD to these wide - yeah. 00:29:42.420 --> 00:29:43.450 Just a quick question. Are 00:29:43.450 --> 00:29:48.950 the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh, 00:29:48.950 --> 00:29:52.940 yes. 00:29:52.940 --> 00:29:54.570 I think you're 00:29:54.570 --> 00:29:55.509 right. I 00:29:55.509 --> 00:30:02.509 think you're right. 00:30:03.960 --> 00:30:05.550 Let's 00:30:05.550 --> 00:30:12.550 see. Is it top K columns of U or top K of V? 00:30:15.039 --> 00:30:22.039 Yeah, I think you're right. Is that right? Something 00:30:31.560 --> 00:30:38.560 bothers me about that, but I think you're right. 00:30:40.330 --> 00:30:46.350 [Inaudible], so then X transpose X should be VDD. X 00:30:46.350 --> 00:30:53.350 is UDV, so X transpose X would be - [Inaudible]. 00:31:01.140 --> 00:31:03.389 If anyone thinks about this and 00:31:03.389 --> 00:31:06.200 has another opinion, let me know, but I think you're 00:31:06.200 --> 00:31:11.480 right. I'll make sure I get the details and let you 00:31:11.480 --> 00:31:16.680 know. 00:31:16.680 --> 00:31:22.240 Everyone's still looking at that. 00:31:22.240 --> 00:31:22.720 Tom, can you figure out 00:31:22.720 --> 00:31:25.110 the right answer and let me know? That sounds right. Okay. Cool. Okay. 00:31:25.110 --> 00:31:30.070 So just 00:31:30.070 --> 00:31:33.920 one last note, a note of caution. It turns out that 00:31:33.920 --> 00:31:36.520 in this example, I was implementing SVD 00:31:36.520 --> 00:31:43.520 with a wide matrix. So the matrix X was N by N. 00:31:44.130 --> 00:31:47.010 It turns out when you 00:31:47.010 --> 00:31:52.170 find the SVD decomposition of this, 00:31:52.170 --> 00:31:57.870 it turns out that - 00:31:57.870 --> 00:32:01.030 let's see. Yeah, I think you're definitely 00:32:01.030 --> 00:32:04.230 right. So it turns out that we find the SVD 00:32:04.230 --> 00:32:07.590 of this, the right-most portion of this block of this matrix would be all 00:32:07.590 --> 00:32:14.590 zeros. 00:32:15.130 --> 00:32:17.850 Also, when you compute the 00:32:17.850 --> 00:32:22.190 matrix, D, a large part of this matrix would be zeros. 00:32:22.190 --> 00:32:24.340 You have the matrix D 00:32:24.340 --> 00:32:29.520 transpose. So depending on what convention you use, 00:32:29.520 --> 00:32:36.150 for example, I think Matlab actually uses a convention of just 00:32:36.150 --> 00:32:43.150 cutting off the zero elements. 00:32:47.779 --> 00:32:50.750 So the Matlab uses the convention 00:32:50.750 --> 00:32:54.260 of chopping off the right-most half of the U matrix and chopping off the bottom 00:32:54.260 --> 00:32:56.520 portion of the D matrix. I'm not sure 00:32:56.520 --> 00:33:00.860 if this even depends on the version of Matlab, 00:33:00.860 --> 00:33:04.059 but when you call SVD on Matlab or some other numerical algebra packages, 00:33:04.059 --> 00:33:07.820 there's slightly different conventions of how to define your SVD when 00:33:07.820 --> 00:33:10.260 the matrix is wider than it is tall. 00:33:10.260 --> 00:33:13.390 So just watch out for this and make sure you map 00:33:13.390 --> 00:33:14.910 whatever convention 00:33:14.910 --> 00:33:17.630 your numerical algebra library uses 00:33:17.630 --> 00:33:22.620 to the original computations. 00:33:22.620 --> 00:33:23.570 It turns out if 00:33:23.570 --> 00:33:26.059 you turn Matlab [inaudible] 00:33:26.059 --> 00:33:30.200 or you're writing C code. There are many scientific libraries that can 00:33:30.200 --> 00:33:31.590 00:33:31.590 --> 00:33:36.110 compute SVDs for you, but they're just slightly different in 00:33:36.110 --> 00:33:39.330 conventions for the dimensions of these matrixes. So just make sure you figure this out for the 00:33:39.330 --> 00:33:44.230 package that you use. 00:33:44.230 --> 00:33:46.230 Finally, I just want to 00:33:46.230 --> 00:33:49.779 take the unsupervised learning algorithms we talked about and just put a little bit 00:33:49.779 --> 00:33:52.010 of broader context. 00:33:52.010 --> 00:33:56.450 This is partly in response to the questions I've gotten from students in 00:33:56.450 --> 00:33:57.950 office hours and elsewhere 00:33:57.950 --> 00:34:00.620 about when to use each of these 00:34:00.620 --> 00:34:01.460 algorithms. So 00:34:01.460 --> 00:34:05.700 I'm going to draw a two by two matrix. This is a little cartoon that 00:34:07.800 --> 00:34:10.690 I find useful. 00:34:15.409 --> 00:34:20.159 One of the algorithms we talked about earlier, right 00:34:20.159 --> 00:34:21.460 before this, was 00:34:21.460 --> 00:34:24.159 factor analysis, which was - it 00:34:24.159 --> 00:34:27.919 was - I hope you remember that picture I drew where I would have a bunch of point Z on the 00:34:27.919 --> 00:34:28.799 line. 00:34:28.799 --> 00:34:32.969 Then I had these ellipses that I drew. I hope you 00:34:32.969 --> 00:34:34.999 remember that 00:34:34.999 --> 00:34:37.919 picture. This was a factor analysis model 00:34:37.919 --> 00:34:38.639 which models 00:34:38.639 --> 00:34:42.389 the density effects [inaudible], right? 00:34:42.389 --> 00:34:44.639 It was also a PCA, just now. 00:34:44.639 --> 00:34:45.799 So the 00:34:45.799 --> 00:34:48.969 difference between factor analysis and PCA, the 00:34:48.969 --> 00:34:51.829 way I think about it, is that factor analysis 00:34:51.829 --> 00:34:53.960 is a density estimation algorithm. 00:34:53.960 --> 00:34:57.740 It tries to model the density of the training example's X. 00:34:57.740 --> 00:35:01.249 Whereas PCA 00:35:02.309 --> 00:35:05.700 is not a probabilistic 00:35:05.700 --> 00:35:06.750 algorithm. In particular, 00:35:06.750 --> 00:35:11.270 it does not endow your training examples of any probabilistic 00:35:11.270 --> 00:35:14.410 distributions and directly goes to find the subspace. 00:35:14.410 --> 00:35:18.650 So in terms of when to use factor analysis and when to use PCA, 00:35:18.650 --> 00:35:21.699 if your goal is to reduce the dimension of the data, 00:35:21.699 --> 00:35:26.139 if your goal is to find the subspace that the data lies on, 00:35:26.139 --> 00:35:27.309 then PCA 00:35:27.309 --> 00:35:28.320 directly 00:35:28.320 --> 00:35:32.180 tries to find the subspace. I think I would 00:35:32.180 --> 00:35:35.279 tend to use PCA. 00:35:35.279 --> 00:35:39.889 Factor analysis, it sort of assumes the data lies on a subspace. 00:35:39.889 --> 00:35:45.039 Let me write a subspace here. 00:35:45.039 --> 00:35:49.529 So both of these algorithms sort of assume the data maybe lies close 00:35:49.529 --> 00:35:52.009 or on some low dimensional subspace, 00:35:52.009 --> 00:35:55.999 but fundamentally, factor analysis, I think of it as a density estimation algorithm. 00:35:55.999 --> 00:35:58.739 So that has some very high dimensional distribution. I 00:35:58.739 --> 00:36:00.880 want to model P of X, then 00:36:00.880 --> 00:36:04.519 the factor analysis is the algorithm I'm more inclined 00:36:04.519 --> 00:36:05.400 to use. So 00:36:05.400 --> 00:36:07.449 even though you could in theory, 00:36:07.449 --> 00:36:12.199 I would tend to avoid trying to use factor analysis to 00:36:12.199 --> 00:36:14.349 identify a subspace the 00:36:14.349 --> 00:36:15.619 data 00:36:15.619 --> 00:36:18.450 set lies on. So [inaudible], if you want to do 00:36:18.450 --> 00:36:21.700 anomaly detection, if you want to model P of X 00:36:21.700 --> 00:36:26.119 so that if you have a very low probability of N, you can factor an anomaly, 00:36:26.119 --> 00:36:32.819 then I would tend to use factor analysis to do that density estimation. So factor 00:36:32.819 --> 00:36:38.189 analysis and PCA are both algorithms that assume that your data lies in the subspace. 00:36:38.189 --> 00:36:41.759 The other cause of algorithms we talked about was 00:36:41.759 --> 00:36:45.379 algorithms that assumes the data 00:36:45.379 --> 00:36:47.380 lies in 00:36:47.380 --> 00:36:51.410 clumps or that the 00:36:51.410 --> 00:36:52.359 data 00:36:52.359 --> 00:36:57.869 has a few coherence to groups. So let me just fill in the rest of this picture. 00:37:08.579 --> 00:37:09.500 So if you think your data lies 00:37:09.500 --> 00:37:14.270 in clumps or lies in groups, and if it goes [inaudible] 00:37:14.270 --> 00:37:16.130 density estimation, then I would 00:37:16.130 --> 00:37:19.579 tend to use a mixture of [inaudible] 00:37:19.579 --> 00:37:23.349 algorithm. But again, you don't necessarily want to endow your data of any probably 00:37:23.349 --> 00:37:26.919 semantics, so if you just want to find the clumps of the groups, then I'd be inclined 00:37:26.919 --> 00:37:28.999 to use a [inaudible] algorithm. So 00:37:29.769 --> 00:37:33.289 haven't seen anyone else draw this picture before, but I tend to organize these things 00:37:33.289 --> 00:37:34.989 this way in my brain. 00:37:34.989 --> 00:37:36.419 Hopefully this helps guide 00:37:36.419 --> 00:37:40.459 when you might use each of these algorithms as well, depending 00:37:40.459 --> 00:37:44.719 on whether you believe the data might lie in the subspace or whether it might bind in 00:37:44.719 --> 00:37:47.899 clumps or groups. 00:37:50.719 --> 00:37:53.789 All right. 00:37:53.789 --> 00:38:00.789 That wraps up the discussion on 00:38:02.629 --> 00:38:08.719 PCA. What I want to do next is talk about 00:38:08.719 --> 00:38:15.329 independent component analysis, or ICA. Yeah. Interviewee: I have 00:38:15.329 --> 00:38:17.579 a 00:38:17.579 --> 00:38:21.589 question about the upper right [inaudible]. So once you have all of the eigen vectors, 00:38:21.589 --> 00:38:26.179 [inaudible] how similar is feature I to 00:38:26.179 --> 00:38:29.960 feature J. You pick some eigen vector, and you take some dot products between the 00:38:29.960 --> 00:38:31.569 feature I and 00:38:31.569 --> 00:38:35.019 feature J and the eigen vector. But 00:38:35.019 --> 00:38:39.630 there's a lot of eigen vectors to choose from. Instructor 00:38:39.630 --> 00:38:42.049 (Andrew Ng):Right. So Justin's question was 00:38:42.049 --> 00:38:45.880 having found my eigen vectors, how do I choose what eigen vector to use to 00:38:45.880 --> 00:38:47.539 measure distance. I'm 00:38:47.539 --> 00:38:48.949 going to 00:38:48.949 --> 00:38:51.019 start 00:38:51.019 --> 00:38:53.289 this up. 00:38:53.289 --> 00:38:57.299 So the 00:38:57.299 --> 00:38:58.319 answer is really 00:38:58.319 --> 00:39:02.029 - in this cartoon, I would avoid thinking about 00:39:02.029 --> 00:39:03.920 eigen vectors one other time. 00:39:03.920 --> 00:39:08.049 A better way to view this cartoon is that this is actually - 00:39:08.049 --> 00:39:11.599 if I decide to choose 100 eigen vectors, this is really 100 D 00:39:11.599 --> 00:39:18.599 subspace. 00:39:19.259 --> 00:39:20.339 So 00:39:20.339 --> 00:39:24.879 I'm not actually projecting my data onto one eigen vector. 00:39:24.879 --> 00:39:29.769 This arrow, this cartoon, this denotes the 100-dimensional 00:39:29.769 --> 00:39:32.089 subspace [inaudible] by all my eigen vectors. 00:39:32.089 --> 00:39:36.119 So what I actually do is project my data onto 00:39:36.119 --> 00:39:40.149 the span, the linear span of eigen vectors. Then I 00:39:40.149 --> 00:39:41.440 measure distance or take 00:39:41.440 --> 00:39:43.489 inner products of the distance between 00:39:43.489 --> 00:39:49.829 the projections of the two points of the eigen vectors. Okay. 00:39:49.829 --> 00:39:54.139 So let's talk about ICA, 00:39:54.139 --> 00:39:58.749 independent component analysis. 00:39:58.749 --> 00:40:00.749 So whereas PCA 00:40:00.749 --> 00:40:02.600 was an algorithm for finding 00:40:02.600 --> 00:40:06.699 what I call the main axis of variations of data, 00:40:06.699 --> 00:40:11.199 in ICA, we're going to try find the independent of components of variations in the 00:40:11.199 --> 00:40:12.039 data. 00:40:12.039 --> 00:40:14.939 So switch it to the laptop there, please. 00:40:14.939 --> 00:40:16.119 We'll just 00:40:16.119 --> 00:40:21.900 take a second to motivate that. I'm 00:40:21.900 --> 00:40:26.769 going to do so by 00:40:26.769 --> 00:40:32.430 - although if you put on the - okay. This is 00:40:32.430 --> 00:40:36.619 actually a slide that I showed in 00:40:36.619 --> 00:40:39.779 lecture one of the cocktail party problem. 00:40:39.779 --> 00:40:42.619 Suppose you have two speakers at a cocktail party, 00:40:42.619 --> 00:40:45.019 and you have two microphones in the 00:40:45.019 --> 00:40:46.119 room, overlapping 00:40:46.119 --> 00:40:47.959 sets of two conversations. 00:40:47.959 --> 00:40:51.640 Then can you separate out the two original speaker sources? 00:40:51.640 --> 00:40:55.650 So I actually played this audio as well in the very first lecture, which is 00:40:55.650 --> 00:40:59.069 suppose microphone one records this. 00:40:59.069 --> 00:41:05.489 [Recording] 00:41:13.229 --> 00:41:16.650 So the question is, these are really two speakers, 00:41:16.650 --> 00:41:20.809 speaking independently of each other. So each speaker is outputting 00:41:20.809 --> 00:41:24.699 a series of sound signals as independent of the other conversation 00:41:24.699 --> 00:41:26.119 going on in the room. 00:41:26.119 --> 00:41:27.839 So 00:41:27.839 --> 00:41:31.880 this being an supervised learning problem, the question is, can we take these two microphone 00:41:31.880 --> 00:41:33.900 recordings and feed it to 00:41:33.900 --> 00:41:37.329 an algorithm to find the independent components in 00:41:37.329 --> 00:41:38.450 this 00:41:38.450 --> 00:41:40.589 data? This is the output 00:41:42.440 --> 00:41:48.619 when we do so. 00:41:48.619 --> 00:41:55.050 [Recording] This is the other one. [Recording] 00:41:55.940 --> 00:41:59.859 Just for fun. [Inaudible]. These are audio clips I got 00:42:01.409 --> 00:42:04.409 from [inaudible]. Just for fun, let me play the other ones as well. This 00:42:04.409 --> 00:42:11.409 is overlapping microphone one. [Recording] 00:42:13.440 --> 00:42:20.440 Here's microphone two. [Recording] 00:42:21.739 --> 00:42:24.250 So given this as input, here's output one. 00:42:24.250 --> 00:42:27.459 00:42:27.459 --> 00:42:30.910 [Recording] 00:42:30.910 --> 00:42:33.640 It's not perfect, but it's largely cleaned up the music. 00:42:33.640 --> 00:42:40.640 Here's number two. [Recording] Okay. Switch back to 00:42:42.979 --> 00:42:44.900 [inaudible], please. 00:42:44.900 --> 00:42:46.979 So 00:42:46.979 --> 00:42:53.979 what I want to do now is describe an algorithm that does that. 00:42:54.829 --> 00:42:58.299 Before 00:42:58.299 --> 00:43:03.239 I actually jump into the algorithm, I want to say two minutes 00:43:03.239 --> 00:43:03.799 of 00:43:03.799 --> 00:43:10.799 CDF, so cumulative distribution functions. I know most 00:43:18.669 --> 00:43:21.089 of you know what these are, but I'm 00:43:21.089 --> 00:43:23.599 just going to remind you of what they are. 00:43:24.579 --> 00:43:30.349 Let's say you have a one-D random variable S. So suppose you have 00:43:30.349 --> 00:43:35.900 a random variable, S, 00:43:35.900 --> 00:43:41.469 and suppose it has a property density function [inaudible]. 00:43:41.469 --> 00:43:43.409 Then 00:43:43.409 --> 00:43:45.859 the CDF 00:43:45.859 --> 00:43:50.139 is defined as a function, or rather as F, 00:43:50.139 --> 00:43:53.729 which is the probability that the random variable, 00:43:53.729 --> 00:43:55.920 S, is less than the value 00:43:55.920 --> 00:43:58.539 given by that lower-case 00:43:58.539 --> 00:43:59.869 value, 00:43:59.869 --> 00:44:01.929 S. 00:44:01.929 --> 00:44:03.269 For example, 00:44:03.269 --> 00:44:06.099 if this is your [inaudible] density, 00:44:06.099 --> 00:44:10.229 than the density of the [inaudible] usually 00:44:10.229 --> 00:44:14.609 to note it lower-case phi. That's roughly a bell-shaped density. Then 00:44:14.609 --> 00:44:20.319 the CDF or the Gaussian 00:44:20.319 --> 00:44:22.269 will look something like this. 00:44:22.269 --> 00:44:24.959 There'll be a capital function 00:44:24.959 --> 00:44:27.339 pi. So if I pick a value 00:44:27.339 --> 00:44:29.079 S like that, then the 00:44:29.079 --> 00:44:30.449 height of this - 00:44:30.449 --> 00:44:32.569 this is [inaudible] probability that 00:44:32.569 --> 00:44:35.410 my Gaussian random variable is less than 00:44:35.410 --> 00:44:37.419 that value there. In other words, 00:44:37.419 --> 00:44:40.549 the height of the function at that point is 00:44:40.549 --> 00:44:44.229 less 00:44:44.229 --> 00:44:46.269 than the area of the Gaussian density, 00:44:46.269 --> 00:44:48.119 up to the point S. 00:44:48.119 --> 00:44:48.890 As you 00:44:48.890 --> 00:44:52.690 move further and further to the right, this function will approach one, as 00:44:52.690 --> 00:44:59.690 you integrate more and more of this area of the Gaussian. So another way to write 00:45:04.839 --> 00:45:11.839 F 00:45:21.109 --> 00:45:28.109 of 00:45:30.659 --> 00:45:34.619 S is the integral, the minus infinity 00:45:34.619 --> 00:45:35.729 to S of 00:45:35.729 --> 00:45:41.739 the density, DT. 00:45:41.739 --> 00:45:43.859 So something that'll come later is 00:45:43.859 --> 00:45:48.320 suppose I have a random variable, S, and I want to model the distribution of the random 00:45:48.320 --> 00:45:49.439 variable, S. 00:45:49.439 --> 00:45:53.449 So one thing I could do is I can specify 00:45:53.449 --> 00:45:56.549 what I think the density 00:45:56.549 --> 00:45:58.049 is. 00:45:58.049 --> 00:46:03.200 Or I can specify 00:46:03.200 --> 00:46:04.450 what the 00:46:04.450 --> 00:46:08.099 CDF 00:46:08.099 --> 00:46:11.359 is. These are related by this equation. F is the integral of P of S. You 00:46:11.359 --> 00:46:13.989 can also 00:46:13.989 --> 00:46:15.719 recover the density 00:46:15.719 --> 00:46:20.469 by taking the CDF and taking the derivative. So F prime, take the derivative 00:46:20.469 --> 00:46:21.729 of the CDF, 00:46:21.729 --> 00:46:23.440 you get back the 00:46:23.440 --> 00:46:24.709 density. So this has come up 00:46:24.709 --> 00:46:28.179 in the middle of when I derive ICA, which is that 00:46:28.179 --> 00:46:32.169 there'll be a step where they need to assume a distribution for random variable, S. 00:46:32.169 --> 00:46:36.359 I can either specify the density for S directly, or I can specify the CDF. I 00:46:36.359 --> 00:46:38.819 choose to specify the 00:46:39.920 --> 00:46:41.529 CDF. 00:46:41.529 --> 00:46:46.919 It has to be some function increasing from zero to one. 00:46:46.919 --> 00:46:48.029 So you can 00:46:48.029 --> 00:46:50.679 choose any function that looks like that, and in particular, 00:46:51.969 --> 00:46:55.469 pulling functions out of a hat that look like that. You can, for instance, choose a 00:46:55.469 --> 00:46:58.989 sigmoid function of 00:46:58.989 --> 00:47:04.219 CDF. That would be one way of specifying the distribution of the densities for the random variable S. So 00:47:04.219 --> 00:47:05.110 this 00:47:05.110 --> 00:47:12.110 will come up later. 00:47:30.299 --> 00:47:33.579 Just [inaudible], just raise your hand if that is familiar to you, if you've seen 00:47:33.579 --> 00:47:40.579 that before. Great. So 00:47:42.469 --> 00:47:43.239 let's 00:47:43.239 --> 00:47:48.630 start to derive our RCA, or our independent component analysis 00:47:48.630 --> 00:47:50.430 algorithm. 00:47:50.430 --> 00:47:53.989 Let's assume that the 00:47:55.859 --> 00:47:59.819 data comes from 00:47:59.819 --> 00:48:01.979 N original 00:48:01.979 --> 00:48:03.319 sources. 00:48:03.319 --> 00:48:07.009 So let's say there are N speakers in a cocktail party. 00:48:07.009 --> 00:48:09.819 So the original sources, I'm 00:48:09.819 --> 00:48:11.330 going to write as a vector, S 00:48:11.330 --> 00:48:13.619 as in RN. 00:48:13.619 --> 00:48:17.449 So just to be concrete about what I mean about that, I'm going to use 00:48:17.449 --> 00:48:22.499 SIJ to denote the signal 00:48:22.499 --> 00:48:25.849 from speaker 00:48:27.140 --> 00:48:30.219 J 00:48:30.219 --> 00:48:32.659 at time 00:48:32.659 --> 00:48:34.080 I. Here's what I mean. 00:48:34.080 --> 00:48:37.940 So what is sound? When you hear sound waves, sound is created 00:48:37.940 --> 00:48:39.279 by a pattern 00:48:39.279 --> 00:48:43.160 of expansions and compressions in air. So the way you're hearing my voice is 00:48:43.160 --> 00:48:44.620 my 00:48:44.620 --> 00:48:47.719 mouth is causing certain 00:48:47.719 --> 00:48:50.960 changes in the air pressure, and then your ear is hearing my voice as 00:48:50.960 --> 00:48:53.539 detecting those changes in air 00:48:53.539 --> 00:48:57.729 pressure. So what a microphone records, what my mouth is generating, is 00:48:57.729 --> 00:48:59.159 a pattern. 00:48:59.159 --> 00:49:01.459 I'm going to draw a cartoon, 00:49:01.459 --> 00:49:04.819 I guess. 00:49:04.819 --> 00:49:06.059 Changes in 00:49:06.059 --> 00:49:06.970 air pressure. So 00:49:06.970 --> 00:49:11.119 this is what sound is. You look at a microphone recording, you see these roughly periodic 00:49:11.119 --> 00:49:13.289 signals that comprise of 00:49:13.289 --> 00:49:16.230 changes in air pressure over time as the air pressure goes 00:49:16.230 --> 00:49:18.539 above and below some baseline air pressure. 00:49:18.539 --> 00:49:19.669 So this 00:49:19.669 --> 00:49:22.369 is what the speech signal looks like, say. 00:49:22.369 --> 00:49:26.399 So this is speaker one. 00:49:26.399 --> 00:49:29.039 Then what I'm saying is that 00:49:29.039 --> 00:49:31.189 - this is some time, T. 00:49:31.189 --> 00:49:34.479 What I'm saying is that the value of that point, 00:49:34.479 --> 00:49:36.989 I'm going to denote as S, super 00:49:36.989 --> 00:49:40.229 script T, sub script one. 00:49:40.229 --> 00:49:41.729 Similarly, 00:49:41.729 --> 00:49:44.889 speaker two, it's 00:49:44.889 --> 00:49:46.859 outputting some sound wave. Speaker voice 00:49:46.859 --> 00:49:49.749 will play that. It'll actually sound like 00:49:49.749 --> 00:49:52.920 a single tone, I guess. 00:49:52.920 --> 00:49:56.099 So in the same way, at the same time, T, 00:49:56.099 --> 00:49:59.049 the value of the air 00:49:59.049 --> 00:50:02.589 pressure generated by speaker two, I'll denote as 00:50:02.589 --> 00:50:09.589 ST 00:50:16.579 --> 00:50:23.579 2. 00:50:29.859 --> 00:50:36.859 So we observe 00:50:37.769 --> 00:50:40.449 XI equals A times SI, where 00:50:40.449 --> 00:50:43.409 these XIs 00:50:43.409 --> 00:50:45.989 are vectors in RN. 00:50:45.989 --> 00:50:50.269 So I'm going to assume 00:50:50.269 --> 00:50:53.259 that I have N microphones, 00:50:53.259 --> 00:50:53.580 and 00:50:53.580 --> 00:50:58.489 each of my microphones records some linear combination 00:50:58.489 --> 00:51:01.869 of what the speakers are saying. So each microphone records some overlapping 00:51:01.869 --> 00:51:04.499 combination of what the speakers are saying. 00:51:04.499 --> 00:51:07.619 For 00:51:10.349 --> 00:51:12.669 example, XIJ, which is - this 00:51:12.669 --> 00:51:16.249 is what microphone J records at time, I. So 00:51:16.249 --> 00:51:17.349 by definition of 00:51:17.349 --> 00:51:21.519 the matrix multiplication, this is sum 00:51:21.519 --> 00:51:23.979 of AIKSJ. 00:51:23.979 --> 00:51:29.369 Oh, excuse me. 00:51:29.369 --> 00:51:36.369 Okay? So what my J - sorry. 00:51:37.179 --> 00:51:41.049 So what my J microphone is recording is 00:51:42.190 --> 00:51:43.940 some linear combination of 00:51:43.940 --> 00:51:45.569 all of the speakers. So 00:51:45.569 --> 00:51:49.780 at time I, what microphone J is recording is some linear combination of 00:51:49.780 --> 00:51:52.749 what all the speakers are saying at time I. 00:51:52.749 --> 00:51:54.359 So K here 00:51:54.359 --> 00:51:57.819 indexes over the N speakers. 00:51:57.819 --> 00:52:01.239 So our goal 00:52:02.869 --> 00:52:06.420 is to find the matrix, W, equals A inverse, and 00:52:06.420 --> 00:52:10.130 just defining W that way. 00:52:10.130 --> 00:52:17.130 So 00:52:18.139 --> 00:52:21.349 we can recover the original sources 00:52:21.349 --> 00:52:23.310 as a linear combination of 00:52:23.310 --> 00:52:23.560 our 00:52:23.549 --> 00:52:30.549 microphone recordings, XI. 00:52:33.059 --> 00:52:35.329 Just as a point of notation, 00:52:35.329 --> 00:52:42.329 I'm going to write the matrix W this way. I'm going to use 00:52:50.890 --> 00:52:55.100 lower case W subscript one, subscript two and so on to denote the roles 00:52:55.100 --> 00:53:02.100 of this matrix, W. 00:53:13.909 --> 00:53:14.559 Let's 00:53:14.559 --> 00:53:18.719 see. 00:53:18.719 --> 00:53:23.539 So let's look at why IC is possible. Given these overlapping voices, 00:53:23.539 --> 00:53:28.249 let's think briefly why it might be possible 00:53:28.249 --> 00:53:30.760 to recover the original sources. 00:53:30.760 --> 00:53:33.059 So for the next example, I want 00:53:33.059 --> 00:53:36.509 to say 00:53:42.739 --> 00:53:46.530 - let's say that each of my speakers 00:53:46.530 --> 00:53:50.380 outputs - this will sound like white noise. Can I switch 00:53:50.380 --> 00:53:53.380 the laptop display, 00:53:53.380 --> 00:53:56.709 please? For this example, let's say that 00:53:57.219 --> 00:54:01.459 each of my speakers outputs uniform white noise. So 00:54:01.459 --> 00:54:05.459 if that's the case, these are my axis, S1 and S2. 00:54:05.459 --> 00:54:08.819 This is what my two speakers would be uttering. 00:54:08.819 --> 00:54:11.289 The parts of what they're 00:54:11.289 --> 00:54:14.979 uttering will look like a line in a square box if the two speakers are independently 00:54:14.979 --> 00:54:16.089 outputting 00:54:16.089 --> 00:54:18.389 uniform minus one random variables. 00:54:18.389 --> 00:54:20.289 So this is part of 00:54:20.289 --> 00:54:24.009 S1 and S2, my original sources. 00:54:24.009 --> 00:54:28.999 This would be a typical sample of what my microphones record. Here, at 00:54:28.999 --> 00:54:31.400 the axis, are X1 and X2. 00:54:31.400 --> 00:54:35.089 So these are images I got from [inaudible] on 00:54:35.089 --> 00:54:37.269 ICA. 00:54:38.709 --> 00:54:43.689 Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of 00:54:43.689 --> 00:54:44.939 this 00:54:44.939 --> 00:54:45.809 parallelogram 00:54:45.809 --> 00:54:48.149 are. You can figure out 00:54:48.149 --> 00:54:51.999 what linear transformation would transform the parallelogram back 00:54:51.999 --> 00:54:54.359 to a box. 00:54:54.359 --> 00:54:58.769 So it turns out there are some inherent ambiguities in ICA. 00:54:58.769 --> 00:55:00.509 I'll just say what they are. 00:55:00.509 --> 00:55:01.569 One is that 00:55:01.569 --> 00:55:05.709 you can't recover the original indexing of the sources. In particular, 00:55:05.709 --> 00:55:07.379 if 00:55:07.379 --> 00:55:10.809 I generated the data for speaker one and speaker two, 00:55:10.809 --> 00:55:14.469 you can run ICA, and then you may end up with the order of the speakers 00:55:14.469 --> 00:55:17.529 reversed. What that corresponds to is if you take this 00:55:17.529 --> 00:55:21.809 picture and you flip this picture along a 45-degree 00:55:21.809 --> 00:55:26.130 axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still 00:55:26.130 --> 00:55:28.279 get a box. So 00:55:28.279 --> 00:55:31.319 there's no way for the algorithms to tell which was speaker No. 1 and 00:55:31.319 --> 00:55:32.910 which 00:55:32.910 --> 00:55:37.699 was speaker No. 2. The numbering or the ordering of the speakers is 00:55:37.699 --> 00:55:40.839 ambiguous. The other source of ambiguity, and these are the only ambiguities 00:55:40.839 --> 00:55:42.089 in this example, 00:55:42.089 --> 00:55:44.469 is the sign of the sources. So 00:55:44.469 --> 00:55:49.119 given my speakers' recordings, 00:55:49.119 --> 00:55:53.190 you can't tell whether you got a positive SI or whether you got 00:55:53.190 --> 00:55:56.179 back a negative SI. 00:55:56.179 --> 00:55:58.210 In this picture, what that corresponds to 00:55:58.210 --> 00:56:02.100 is if you take this picture, and you reflect it along the vertical axis, if 00:56:02.100 --> 00:56:04.659 you reflect it along the horizontal axis, 00:56:04.659 --> 00:56:05.910 you still get a box. 00:56:05.910 --> 00:56:08.719 You still get back [inaudible] speakers. 00:56:08.719 --> 00:56:09.649 So 00:56:09.649 --> 00:56:11.719 it turns out that in this example, 00:56:11.719 --> 00:56:16.599 you can't guarantee that you've recovered positive SI rather 00:56:16.599 --> 00:56:19.689 than negative SI. 00:56:19.689 --> 00:56:21.930 So it turns out that these are the only 00:56:21.930 --> 00:56:25.740 two ambiguities in this example. What is the permutation of the speakers, and the 00:56:25.740 --> 00:56:28.139 other is the sign of the speakers. 00:56:28.139 --> 00:56:30.749 Permutation of the speakers, there's not much you can do about that. 00:56:30.749 --> 00:56:34.909 It turns out that if you take the audio 00:56:34.909 --> 00:56:35.609 source 00:56:35.609 --> 00:56:39.199 and if you flip the sign, and you take negative S, and if you play that through a 00:56:39.199 --> 00:56:43.819 microphone it'll sound indistinguishable. 00:56:43.819 --> 00:56:44.879 So 00:56:44.879 --> 00:56:47.829 for many of the applications we care about, the sign 00:56:47.829 --> 00:56:51.259 as well as the permutation 00:56:51.259 --> 00:56:55.079 is ambiguous, but you don't really care 00:56:55.079 --> 00:57:02.079 about it. Let's switch back 00:57:03.529 --> 00:57:08.989 to 00:57:08.989 --> 00:57:11.180 chalk board, please. 00:57:11.180 --> 00:57:15.639 It turns out, and I don't want to spend too much time on this, but I do want to say it briefly. 00:57:15.639 --> 00:57:17.289 It turns out the 00:57:17.289 --> 00:57:19.200 reason why those are the only 00:57:19.200 --> 00:57:25.809 sources of ambiguity - so the ambiguities were the 00:57:25.809 --> 00:57:29.869 permutation of the speakers 00:57:29.869 --> 00:57:31.959 and the signs. 00:57:31.959 --> 00:57:35.399 It turns out that 00:57:35.399 --> 00:57:39.919 the reason these were the only ambiguities was because 00:57:39.919 --> 00:57:44.999 the SIJs were 00:57:44.999 --> 00:57:46.689 00:57:46.689 --> 00:57:50.509 non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly. 00:57:50.509 --> 00:57:54.089 Suppose my original sources, S1 and S2, were Gaussian. 00:57:54.089 --> 00:57:55.909 So 00:57:58.329 --> 00:58:02.199 suppose SI is 00:58:02.199 --> 00:58:04.339 Gaussian, would mean zero 00:58:04.339 --> 00:58:07.019 and identity covariance. 00:58:07.019 --> 00:58:10.959 That just means that each of my speakers outputs a Gaussian random variable. Here's a typical 00:58:10.959 --> 00:58:12.619 example of Gaussian 00:58:12.619 --> 00:58:18.479 data. 00:58:18.479 --> 00:58:22.869 You will recall the contours of a Gaussian distribution with identity covariants 00:58:22.869 --> 00:58:25.089 looks like 00:58:25.089 --> 00:58:27.739 this, right? The Gaussian is a 00:58:27.739 --> 00:58:30.569 spherically symmetric distribution. 00:58:30.569 --> 00:58:35.219 So if my speakers were outputting Gaussian random variables, than if 00:58:35.219 --> 00:58:38.179 I observe a linear combination of this, 00:58:38.179 --> 00:58:40.479 there's actually no way to recover the 00:58:40.479 --> 00:58:43.419 original distribution because there's no way for me to tell 00:58:43.419 --> 00:58:46.120 if the axis are at this angle or if they're at 00:58:46.120 --> 00:58:48.349 that angle and so 00:58:48.349 --> 00:58:52.429 on. The Gaussian is a rotationally symmetric 00:58:52.429 --> 00:58:56.769 distribution, so I would no be able to recover the orientation in the 00:58:56.769 --> 00:58:58.839 rotation 00:58:58.839 --> 00:59:02.279 of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns 00:59:02.279 --> 00:59:02.900 out 00:59:02.900 --> 00:59:04.700 if your source is a Gaussian, 00:59:04.700 --> 00:59:07.929 then it's actually impossible to do 00:59:07.929 --> 00:59:12.049 ICA. ICA relies critically on your data being non-Gaussian because if the data 00:59:12.049 --> 00:59:16.939 were Gaussian, then the rotation of the data would be ambiguous. So 00:59:16.939 --> 00:59:19.079 regardless of how much data you have, 00:59:19.079 --> 00:59:23.549 even if you had infinitely large amounts of data, you would not be able to recover 00:59:23.549 --> 00:59:26.739 the matrix A or W. 00:59:32.779 --> 00:59:39.779 Let's go ahead and divide the algorithm. 00:59:56.779 --> 01:00:00.939 To do this, I need just one more result, and then the derivation will be 01:00:03.029 --> 01:00:07.729 three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my 01:00:07.729 --> 01:00:11.309 speakers that are emitting at any time. 01:00:11.309 --> 01:00:12.459 So 01:00:12.459 --> 01:00:15.619 let's say the density of S is 01:00:15.619 --> 01:00:17.339 P subscript S, 01:00:17.339 --> 01:00:19.569 capital S. 01:00:19.569 --> 01:00:23.399 So my microphone recording records S equals AS, 01:00:23.399 --> 01:00:25.319 equals W inverse 01:00:25.319 --> 01:00:31.019 S. Equivalently, S equals W sign of X. 01:00:31.019 --> 01:00:34.529 So let's think about what is the density of 01:00:34.529 --> 01:00:38.209 X. So I have P of S. I know the density of 01:00:38.209 --> 01:00:41.359 S, and X is a linear combination of the S's. 01:00:41.359 --> 01:00:45.170 So let's figure out what is the density of X. 01:00:45.170 --> 01:00:48.670 One thing we could do is 01:00:48.670 --> 01:00:51.339 figure out what S is. So this is just - 01:00:51.339 --> 01:00:55.759 apply the density of 01:00:55.759 --> 01:00:58.069 S to W of S. So let's 01:00:58.069 --> 01:01:01.999 see. This is the probability of S, so we just 01:01:02.909 --> 01:01:06.559 figure out what S is. S is W times X, so the probability of S is 01:01:06.559 --> 01:01:09.939 W times X, so the probability of X must be [inaudible]. 01:01:09.939 --> 01:01:11.619 So this is wrong. 01:01:11.619 --> 01:01:14.749 It turns out you can do this for probably mass functions but not for 01:01:14.749 --> 01:01:16.919 continuous density. So in particular, 01:01:16.919 --> 01:01:20.969 it's not correct to say that the probability of X is - well, you just figure out what 01:01:20.969 --> 01:01:22.499 S is. 01:01:22.499 --> 01:01:26.189 Then you say the probability of S is applied to that. This is wrong. You 01:01:26.189 --> 01:01:27.819 can't do this with densities. 01:01:27.819 --> 01:01:30.969 You can't say the probability of S is that because it's a property density 01:01:30.969 --> 01:01:32.969 function. 01:01:32.969 --> 01:01:34.459 In particular, 01:01:34.459 --> 01:01:35.509 the 01:01:35.509 --> 01:01:37.849 right formula is the 01:01:37.849 --> 01:01:40.439 density of S applied to W times X, 01:01:40.439 --> 01:01:41.730 times the determinant 01:01:41.730 --> 01:01:44.209 of the matrix, W. 01:01:44.209 --> 01:01:47.189 Let me just illustrate that with an example. 01:01:47.189 --> 01:01:49.919 Let's say 01:01:49.919 --> 01:01:51.550 the 01:01:51.550 --> 01:01:58.199 density for S is that. In 01:01:58.199 --> 01:02:03.469 this example, S is uniform 01:02:03.469 --> 01:02:05.539 over the unit interval. 01:02:07.679 --> 01:02:14.679 So the density for S looks like that. It's 01:02:15.189 --> 01:02:18.140 just density for the uniform 01:02:18.140 --> 01:02:20.749 distribution of zero one. 01:02:20.749 --> 01:02:24.150 So let me let X be equal to two times 01:02:24.150 --> 01:02:30.009 S. So this means A equals two. 01:02:30.009 --> 01:02:33.709 W equals one half. So if 01:02:33.709 --> 01:02:36.719 S is a uniform distribution over zero, one, 01:02:36.719 --> 01:02:40.320 then X, which is two times that, will be the uniform distribution over the 01:02:40.320 --> 01:02:43.299 range from zero to two. 01:02:43.299 --> 01:02:50.299 So the density for X will be - 01:02:54.359 --> 01:02:57.289 that's one, that's two, 01:02:57.289 --> 01:03:01.409 that's one half, 01:03:02.529 --> 01:03:04.949 and 01:03:04.949 --> 01:03:07.939 that's one. Okay? Density for X will be indicator 01:03:07.939 --> 01:03:12.729 zero [inaudible] for X [inaudible] two 01:03:12.729 --> 01:03:15.739 times W, times one half. 01:03:15.739 --> 01:03:20.229 So 01:03:20.229 --> 01:03:21.729 does that make 01:03:21.729 --> 01:03:25.020 sense? [Inaudible] computer density for X because X is now spread out 01:03:25.020 --> 01:03:28.650 across a wider range. The density of X is now smaller, 01:03:28.650 --> 01:03:35.650 and therefore, the density of X has this one half 01:03:37.859 --> 01:03:38.919 term 01:03:38.919 --> 01:03:42.579 here. Okay? This is an illustration for the case of one-dimensional random variables, 01:03:42.579 --> 01:03:44.289 01:03:44.289 --> 01:03:45.160 or S 01:03:45.160 --> 01:03:49.490 and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the 01:03:49.490 --> 01:03:51.649 density of X is given by this 01:03:51.649 --> 01:03:53.950 times the determinant of the matrix, W. Over here, 01:03:53.950 --> 01:04:00.950 I showed the one dimensional [inaudible] generalization. 01:04:21.439 --> 01:04:28.439 So we're nearly there. Here's 01:04:28.749 --> 01:04:33.969 how I can implement ICA. 01:04:33.969 --> 01:04:37.039 So my distribution on 01:04:37.039 --> 01:04:44.039 S, 01:04:50.259 --> 01:04:52.959 so I'm going to assume that my density on S 01:04:52.959 --> 01:04:55.099 is given by this as a product over the 01:04:55.099 --> 01:04:59.950 N speakers of the density - the product of speaker 01:04:59.950 --> 01:05:00.889 I 01:05:00.889 --> 01:05:03.659 emitting a certain sound. This is a product of densities. 01:05:03.659 --> 01:05:07.659 This is a product of distributions because I'm going to assume that the 01:05:07.659 --> 01:05:11.469 speakers are having independent conversations. So the SI's independent 01:05:11.469 --> 01:05:15.869 for different values of I. 01:05:15.869 --> 01:05:18.060 So by the formula we just worked out, 01:05:18.060 --> 01:05:22.359 the density for X would be equal to that. 01:05:36.599 --> 01:05:39.309 I'll just remind you, W was A 01:05:39.309 --> 01:05:42.579 inverse. It was 01:05:42.579 --> 01:05:43.929 this matrix 01:05:43.929 --> 01:05:47.619 I defined previously 01:05:47.619 --> 01:05:50.430 so that SI 01:05:50.430 --> 01:05:52.519 equals WI [inaudible] 01:05:52.519 --> 01:05:59.209 X. So that's what's in 01:05:59.209 --> 01:06:02.299 there. To complete my formulation for this model, 01:06:02.299 --> 01:06:06.359 the final thing I need to do is 01:06:06.359 --> 01:06:10.179 choose 01:06:10.179 --> 01:06:11.549 a density 01:06:11.549 --> 01:06:14.259 for what I think each speaker is 01:06:14.259 --> 01:06:17.949 saying. I need to assume some density over 01:06:17.949 --> 01:06:21.660 the sounds emitted by an individual speaker. 01:06:21.660 --> 01:06:25.630 So following the discussion I had right when the [inaudible] 01:06:25.630 --> 01:06:27.650 ICA, 01:06:27.650 --> 01:06:30.559 one thing I could do is I could choose 01:06:30.559 --> 01:06:32.019 the density for S, 01:06:32.019 --> 01:06:35.509 or equivalently, I could choose the CDF, the cumulative distribution 01:06:35.509 --> 01:06:37.169 function for 01:06:37.169 --> 01:06:38.219 S. 01:06:38.219 --> 01:06:41.489 In this case, I'm going to choose 01:06:41.489 --> 01:06:44.819 a CDF, probably for historical reasons and probably for 01:06:44.819 --> 01:06:46.569 convenience. 01:06:46.569 --> 01:06:50.019 I need to choose the CDF for S, so 01:06:50.019 --> 01:06:54.779 what that means is I just need to choose some function that increases from zero to 01:06:54.779 --> 01:06:59.439 what. I know I can't choose a Gaussian because we know you can't 01:06:59.439 --> 01:07:02.199 do ICA on Gaussian data. 01:07:02.199 --> 01:07:04.649 So I need some function increasing from zero to one 01:07:04.649 --> 01:07:08.639 that is not the cumulative distribution function for a 01:07:08.639 --> 01:07:10.359 Gaussian distribution. 01:07:10.359 --> 01:07:14.009 So what other functions do I know that increase from zero to one? I 01:07:14.009 --> 01:07:16.139 just choose the 01:07:16.139 --> 01:07:18.329 CDF to be 01:07:18.329 --> 01:07:21.979 the 01:07:21.979 --> 01:07:23.039 sigmoid function. 01:07:23.039 --> 01:07:24.729 This is a 01:07:24.729 --> 01:07:27.229 commonly-made choice that 01:07:27.229 --> 01:07:31.049 is made for convenience. There is actually no great reason for why you 01:07:31.049 --> 01:07:34.079 choose a sigmoid function. It's just a convenient function that we all know 01:07:34.079 --> 01:07:35.289 and are familiar with 01:07:35.289 --> 01:07:37.849 that happens to increase from zero to one. 01:07:37.849 --> 01:07:44.849 When you take the derivative 01:07:45.789 --> 01:07:49.389 of the sigmoid, and that will give you back 01:07:49.389 --> 01:07:50.119 your 01:07:50.119 --> 01:07:55.459 density. This is just not Gaussian. This is the main virtue of choosing the sigmoid. 01:07:55.459 --> 01:08:02.459 So 01:08:19.020 --> 01:08:21.960 there's really no rational for the choice of sigma. Lots of other things will 01:08:21.960 --> 01:08:23.599 work fine, too. 01:08:23.599 --> 01:08:26.659 It's just a common, reasonable default. 01:08:38.039 --> 01:08:40.279 It turns out that 01:08:40.279 --> 01:08:44.629 one reason the sigma works well for a lot of data sources is that 01:08:44.629 --> 01:08:49.079 if this is the Gaussian. 01:08:49.079 --> 01:08:52.190 If you actually take the sigmoid and you take its derivative, 01:09:02.299 --> 01:09:06.639 you find that the sigmoid has [inaudible] than the Gaussian. By this I mean 01:09:06.639 --> 01:09:10.509 the density of the sigmoid dies down to zero much more slowly than 01:09:10.509 --> 01:09:12.299 the 01:09:12.299 --> 01:09:13.489 Gaussian. 01:09:13.489 --> 01:09:18.079 The magnitudes of the tails dies down as E to the minus S squared. 01:09:18.079 --> 01:09:21.960 For the sigmoid, the tails look like E to the minus 01:09:21.960 --> 01:09:26.949 S. So the tails die down as E to the minus S, around E 01:09:26.949 --> 01:09:29.529 to the minus S squared. It turns out that most distributions of this property 01:09:29.529 --> 01:09:34.359 with [inaudible] tails, where the distribution decays to zero relatively slowly 01:09:34.359 --> 01:09:38.439 compared to Gaussian will 01:09:38.439 --> 01:09:39.919 work fine for your data. 01:09:39.919 --> 01:09:43.939 Actually, one other choice you can sometimes us is what's called the Laplacian 01:09:43.939 --> 01:09:46.230 distribution, which is 01:09:46.230 --> 01:09:53.230 that. This will work fine, too, for many data sources. 01:10:06.539 --> 01:10:08.110 Sticking with the sigmoid for now, I'll just 01:10:08.110 --> 01:10:09.420 write 01:10:09.420 --> 01:10:14.479 down the algorithm in two steps. So given 01:10:14.479 --> 01:10:17.150 my training set, and 01:10:17.150 --> 01:10:21.179 as you show, this is an unlabeled training set, I can 01:10:21.179 --> 01:10:25.770 write down the log likelihood of my parameters. So that's - assembled my training 01:10:25.770 --> 01:10:27.209 examples, log of - times 01:10:27.209 --> 01:10:34.209 that. 01:10:42.869 --> 01:10:44.880 So that's my log 01:10:44.880 --> 01:10:51.880 likelihood. 01:10:53.339 --> 01:10:59.380 To learn the parameters, W, of this model, I can use the [inaudible] assent, 01:10:59.380 --> 01:11:06.380 which is 01:11:06.570 --> 01:11:08.579 just that. 01:11:08.579 --> 01:11:11.489 It turns out, if you work through the math, 01:11:11.489 --> 01:11:13.969 let's see. If P of S 01:11:13.969 --> 01:11:19.820 is equal to the derivative of the 01:11:19.820 --> 01:11:23.780 sigmoid, then if you just work through the math to compute the [inaudible] there. You've all 01:11:23.780 --> 01:11:27.409 done this a lot of times. I won't bother to show 01:11:27.409 --> 01:11:34.409 the details. You find that is equal to this. 01:11:46.629 --> 01:11:49.579 Okay? That's just - you can work those out yourself. It's just math to 01:11:49.579 --> 01:11:54.499 compute the derivative of this with respect to 01:11:54.499 --> 01:11:59.309 W. So to summarize, given the training set, 01:11:59.309 --> 01:12:02.099 here's my [inaudible] update rule. So you run the 01:12:02.099 --> 01:12:06.309 [inaudible] to learn the parameters W. 01:12:06.309 --> 01:12:08.380 After you're 01:12:08.380 --> 01:12:09.719 done, you then 01:12:12.369 --> 01:12:14.109 output SI equals 01:12:14.109 --> 01:12:16.989 WXI, and you've separated your sources 01:12:16.989 --> 01:12:18.170 of your 01:12:18.170 --> 01:12:21.780 data back out into the original independent sources. 01:12:21.780 --> 01:12:26.199 Hopefully up to only a permutation and a plus/minus 01:12:26.199 --> 01:12:30.650 sign ambiguity. 01:12:30.650 --> 01:12:34.559 Okay? So just switch back to the laptop, please? 01:12:34.559 --> 01:12:41.559 So we'll just wrap up with a couple of examples of applications of ICA. 01:12:42.210 --> 01:12:43.150 This is 01:12:43.150 --> 01:12:46.719 actually a picture of our TA, Katie. 01:12:46.719 --> 01:12:49.980 So one of the applications of ICA is 01:12:49.980 --> 01:12:52.010 to process 01:12:52.010 --> 01:12:56.530 various types of [inaudible] recording data, so [inaudible]. This 01:12:56.530 --> 01:12:58.780 is a picture of 01:12:58.780 --> 01:13:02.470 a EEG cap, in which there are a number of electrodes 01:13:02.470 --> 01:13:04.529 you place 01:13:04.529 --> 01:13:07.959 on the - in this case, on Katie's brain, on Katie's scalp. 01:13:07.959 --> 01:13:13.370 So where each electrode measures changes in voltage over time 01:13:13.370 --> 01:13:15.059 on the scalp. 01:13:15.059 --> 01:13:18.409 On the right, it's a typical example of [inaudible] data 01:13:18.409 --> 01:13:22.570 where each electrode measures - just changes in voltage over 01:13:22.570 --> 01:13:23.890 time. So 01:13:23.890 --> 01:13:27.949 the horizontal axis is time, and the vertical axis is voltage. So here's the same thing, 01:13:27.949 --> 01:13:29.560 blown up a little bit. 01:13:29.560 --> 01:13:32.680 You notice there are artifacts in this 01:13:32.680 --> 01:13:36.340 data. Where the circle is, where the data is circled, all 01:13:36.340 --> 01:13:37.669 the 01:13:37.669 --> 01:13:41.179 electrodes seem to measure in these very synchronized recordings. 01:13:41.179 --> 01:13:44.699 It turns out that we look at [inaudible] data as well as a number of other 01:13:44.699 --> 01:13:47.019 types of data, there are 01:13:47.019 --> 01:13:51.550 artifacts from heartbeats and from human eye blinks and so on. So the 01:13:51.550 --> 01:13:55.030 cartoonist, if you imagine, placing the 01:13:55.030 --> 01:13:56.730 electrodes, or 01:13:56.730 --> 01:13:58.319 microphones, on my scalp, 01:13:58.319 --> 01:14:01.839 then each microphone is recording some overlapping combination of all the 01:14:01.839 --> 01:14:04.920 things happening in my brain or in my body. 01:14:04.920 --> 01:14:08.380 My brain has a number of different processes going on. My body's [inaudible] 01:14:08.380 --> 01:14:10.519 going on, and 01:14:10.519 --> 01:14:13.429 each electrode measures a sum 01:14:13.429 --> 01:14:15.680 of the different voices in my brain. 01:14:15.680 --> 01:14:19.789 That didn't quite come out the way I wanted it to. 01:14:19.789 --> 01:14:21.530 So we can just take this data 01:14:21.530 --> 01:14:25.400 and run ICA on it and find out one of the independent components, what the 01:14:25.400 --> 01:14:26.130 independent 01:14:26.130 --> 01:14:30.329 process are going on in my brain. This is an example of running ICA. 01:14:30.329 --> 01:14:33.239 So you find that a small number of components, like those shown up there, 01:14:33.239 --> 01:14:37.739 they correspond to heartbeat, where the arrows - so those are very periodic 01:14:37.739 --> 01:14:42.329 signals. They come on occasionally and correspond to [inaudible] components of 01:14:42.329 --> 01:14:43.050 heartbeat. 01:14:43.050 --> 01:14:47.460 You also find things like an eye blink component, corresponding to a 01:14:47.460 --> 01:14:49.780 sigmoid generated when you blink your eyes. 01:14:49.780 --> 01:14:53.820 By doing this, you can then subtract out the heartbeat and the eye blink 01:14:53.820 --> 01:14:56.179 artifacts from the data, and now 01:14:56.179 --> 01:15:01.219 you get much cleaner ICA data - get much cleaner EEG readings. You can 01:15:01.219 --> 01:15:03.699 do further scientific studies. So this is a 01:15:03.699 --> 01:15:06.179 pretty commonly used preprocessing step 01:15:06.179 --> 01:15:09.699 that is a common application of ICA. 01:15:09.699 --> 01:15:13.030 [Inaudible] example is 01:15:13.030 --> 01:15:16.300 the application, again, from [inaudible]. As 01:15:16.300 --> 01:15:20.900 a result of running ICA on natural small image patches. Suppose I take 01:15:20.900 --> 01:15:22.050 natural images 01:15:22.050 --> 01:15:25.909 and run ICA on the data and ask what are the independent components of data. 01:15:25.909 --> 01:15:30.039 It turns out that these are the bases you get. So this is a plot of the 01:15:30.039 --> 01:15:32.530 sources you get. 01:15:32.530 --> 01:15:36.269 This algorithm is saying that a natural image patch 01:15:36.269 --> 01:15:37.749 shown 01:15:37.749 --> 01:15:39.790 on the left 01:15:39.790 --> 01:15:45.330 is often expressed as a sum, or a linear combination, of 01:15:45.330 --> 01:15:46.679 independent sources of 01:15:46.679 --> 01:15:48.159 things that make up images. 01:15:48.159 --> 01:15:52.780 So this model's natural images is generated by independent objects 01:15:52.780 --> 01:15:55.340 that generate different ages in the image. 01:15:55.340 --> 01:16:01.260 One of the fascinating things about this is that, similar to neuroscience, this has also been 01:16:01.260 --> 01:16:04.789 hypothesized as a method for how the human brain processes image 01:16:04.789 --> 01:16:05.999 data. It 01:16:05.999 --> 01:16:10.139 turns out, this is similar, in many ways, to computations 01:16:10.139 --> 01:16:15.079 happening in early visual processing in the human brain, 01:16:15.079 --> 01:16:17.659 in the mammalian 01:16:17.659 --> 01:16:19.799 brain. It's just 01:16:19.799 --> 01:16:25.260 interesting to see ages are the independent components of images. 01:16:25.260 --> 01:16:30.639 Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew 01:16:30.639 --> 01:16:31.929 Ng):Oh, 01:16:31.929 --> 01:16:35.410 yes. For the algorithms I describe, I assume A is a square matrix. 01:16:35.410 --> 01:16:38.590 It turns out if you have more microphones than speakers, you can also apply very 01:16:38.590 --> 01:16:39.609 similar algorithms. If 01:16:39.609 --> 01:16:43.919 you have fewer microphones than speakers, there's sort of an open research problem. The odds 01:16:43.919 --> 01:16:48.459 are that if you have one male and one female speaker, but one microphone, you can 01:16:48.459 --> 01:16:51.819 sometimes sort of separate them because one is high, one is low. If you have two 01:16:51.819 --> 01:16:55.459 male speakers or two female speakers, then it's beyond the state of the art now to separate them 01:16:55.459 --> 01:16:57.050 with one 01:16:57.050 --> 01:17:00.499 microphone. It's a great research program. Okay. 01:17:00.499 --> 01:17:04.869 Sorry about running late again. Let's close now, and we'll 01:17:04.869 --> 01:17:05.749 continue reinforcement learning.