Lecture 15 | Machine Learning (Stanford)
-
0:10 - 0:13This presentation is delivered by the Stanford Center for Professional
-
0:13 - 0:20Development.
-
0:29 - 0:30Welcome back.
-
0:30 - 0:33What I want to do today is
-
0:33 - 0:37continue a discussion of principal components analysis, or PCA.
-
0:37 - 0:39In particular, there's
-
0:39 - 0:42one more application that I didn't get to in the last lecture on
-
0:42 - 0:44[inaudible] indexing,
-
0:44 - 0:47LSI. Then I want to spend just a little time talking about
-
0:47 - 0:50how to implement PCA,
-
0:50 - 0:54especially for very large problems. In particular, I'll spend just a little bit of time talking
-
0:54 - 0:56about singular value decomposition,
-
0:56 - 0:58or the SVD implementation
-
0:58 - 1:00of principal component
-
1:00 - 1:02analysis. So the
-
1:02 - 1:07second half of today's lecture, I want to talk about the different algorithm called
-
1:07 - 1:09independent component analysis,
-
1:09 - 1:13which is, in some ways, related to PCA, but in many other ways, it
-
1:13 - 1:16also manages to accomplish
-
1:16 - 1:18very different things than PCA.
-
1:18 - 1:22So with this lecture, this will actually wrap up our discussion on
-
1:22 - 1:26unsupervised learning. The next lecture, we'll start to talk about
-
1:26 - 1:31reinforcement learning algorithms.
-
1:31 - 1:32Just to
-
1:32 - 1:37recap where we were with PCA, principal
-
1:37 - 1:39component analysis,
-
1:39 - 1:45I said that
-
1:45 - 1:49in PCA, we imagine that we have some very high dimensional data
-
1:49 - 1:51that perhaps
-
1:51 - 1:55lies approximately on some low dimensional subspace. So if you had the data set like
-
1:55 - 1:56this,
-
1:56 - 1:57you might find that
-
1:57 - 2:01that's the first principal component of the data,
-
2:01 - 2:03and that's the second
-
2:03 - 2:06component of this 2-D data.
-
2:07 - 2:09To
-
2:09 - 2:13summarize the algorithm, we have three steps. The first step of PCA
-
2:15 - 2:17was to
-
2:17 - 2:20normalize the data to zero mean and
-
2:20 - 2:26[inaudible]. So
-
2:26 - 2:28tracked out the means of
-
2:28 - 2:33your training examples. So it now has zero means, and then normalize each of your features so
-
2:33 - 2:36that the variance of each feature is now one.
-
2:38 - 2:41The next step was [inaudible]
-
2:41 - 2:44computical variance matrix of your zero mean data. So
-
2:44 - 2:49you compute it as follows.
-
2:49 - 2:51The sum of all the products,
-
2:52 - 2:56and then you find the
-
2:56 - 3:03top K eigen vectors of
-
3:04 - 3:07sigma.
-
3:07 - 3:10So
-
3:10 - 3:13last time we saw the applications of this. For example,
-
3:13 - 3:18one of the applications was to eigen faces where
-
3:18 - 3:23each of your training examples, XI, is an image.
-
3:23 - 3:24So
-
3:24 - 3:27if you have
-
3:27 - 3:31100 by 100 images, if your pictures of faces are
-
3:31 - 3:33100 pixels by 100 pixels,
-
3:33 - 3:38then each of your training examples, XI,
-
3:38 - 3:40will be a 10,000 dimensional vector,
-
3:40 - 3:43corresponding to the
-
3:43 - 3:4810,000 grayscale intensity pixel values. There are 10,000 pixel values in
-
3:48 - 3:50each of your 100 by 100 images.
-
3:50 - 3:53So the eigen faces application was where
-
3:53 - 3:56the training example comprised
-
3:56 - 3:58pictures of faces of people.
-
3:58 - 4:00Then we ran PCA,
-
4:00 - 4:01and then
-
4:01 - 4:03to measure the distance between say
-
4:03 - 4:04a face here
-
4:04 - 4:07and a face there, we would project both
-
4:07 - 4:10of the face images onto the subspace and then
-
4:10 - 4:11measure
-
4:11 - 4:14the distance along the subspace. So in eigen faces, you use something
-
4:14 - 4:17like 50 principle components.
-
4:17 - 4:20So
-
4:20 - 4:24the difficulty of working with problems like these is that
-
4:24 - 4:27in step two of the algorithm,
-
4:27 - 4:31we construct the covariance matrix sigma.
-
4:31 - 4:38The covariance matrix now becomes
-
4:38 - 4:43a 10,000 by 10,000 dimensional matrix, which is huge. That has
-
4:43 - 4:46100 million
-
4:46 - 4:48entries, which is huge.
-
4:49 - 4:51So let's apply PCA to
-
4:51 - 4:54very, very high dimensional data, used as a point of reducing the
-
4:54 - 4:55dimension. But
-
4:55 - 5:00step two of this algorithm had this step where you were constructing [inaudible]. So
-
5:00 - 5:04this extremely large matrix, which you can't do.
-
5:04 - 5:07Come back to this in a second. It turns out one of
-
5:07 - 5:09the other
-
5:09 - 5:11frequently-used applications of
-
5:11 - 5:12PCA
-
5:12 - 5:14is actually to text data.
-
5:14 - 5:16So here's what I
-
5:16 - 5:21mean. Remember our vectorial representation of emails?
-
5:21 - 5:23So this is way back
-
5:23 - 5:27when we were talking about supervised learning algorithms for a
-
5:27 - 5:29stand classification. You remember I said that
-
5:29 - 5:33given a piece of email or given a piece of text document, you
-
5:33 - 5:35can represent it using a very high-dimensional vector
-
5:35 - 5:37by taking
-
5:39 - 5:44- writing down a list of all the words in your dictionary. Somewhere you had the word
-
5:44 - 5:47learn, somewhere you have the word
-
5:47 - 5:50study
-
5:51 - 5:55and so on.
-
5:55 - 5:58Depending on whether each word appears or does not appear in your text document, you put either
-
5:58 - 5:59a one or a zero
-
5:59 - 6:04there. This is a representation we use on an electrode five or electrode six
-
6:04 - 6:07for representing text documents for
-
6:07 - 6:09when we're building
-
6:09 - 6:12[inaudible] based classifiers for
-
6:12 - 6:14[inaudible]. So it turns
-
6:14 - 6:18out one of the common applications of
-
6:18 - 6:22PCA is actually this text data representations as well.
-
6:22 - 6:24When you apply PCA
-
6:24 - 6:26to this sort of data,
-
6:26 - 6:28the resulting
-
6:28 - 6:35algorithm, it often just goes by a different name, just latent semantic indexing.
-
6:41 - 6:44For the sake of completeness, I should say that
-
6:44 - 6:48in LSI, you usually skip the preprocessing step.
-
7:06 - 7:10For various reasons, in LSI, you usually don't normalize the mean of the data to
-
7:10 - 7:11one,
-
7:11 - 7:14and you usually don't normalize the variance of the features to one.
-
7:14 - 7:18These are relatively minor
-
7:18 - 7:21differences, it turns out, so it does something very
-
7:21 - 7:25similar to PCA.
-
7:25 - 7:26
-
7:26 - 7:27Normalizing the variance to one
-
7:27 - 7:33for text data would actually be a bad idea because all the words are -
-
7:33 - 7:34because that
-
7:34 - 7:37would have the affect of
-
7:37 - 7:39dramatically scaling up the
-
7:39 - 7:44weight of rarely occurring words. So for example, the word aardvark hardly ever
-
7:44 - 7:46appears in any document.
-
7:46 - 7:49So to normalize the variance
-
7:49 - 7:51of the second feature to one, you end up -
-
7:51 - 7:55you're scaling up the weight of the word aardvark
-
7:55 - 7:58dramatically. I don't understand why [inaudible].
-
7:58 - 8:01So let's
-
8:01 - 8:05see. [Inaudible] the language,
-
8:05 - 8:11something that we want to do quite often is, give it two documents,
-
8:13 - 8:20XI and XJ, to measure how similar they are.
-
8:20 - 8:22So for example,
-
8:22 - 8:25I may give you a document and ask
-
8:25 - 8:29you to find me more documents like this one. We're reading some
-
8:29 - 8:31article about some user event of today
-
8:31 - 8:34and want to find out what other news articles there are. So I give you a document and
-
8:35 - 8:37ask you to look at all the other documents you have
-
8:37 - 8:41in this large set of documents and find the documents similar to
-
8:41 - 8:42this.
-
8:44 - 8:45So
-
8:45 - 8:48this is typical text application, so
-
8:48 - 8:51to measure the similarity
-
8:52 - 8:57between two documents in XI and XJ, [inaudible]
-
8:57 - 9:00each of these documents is represented
-
9:00 - 9:03as one of these highdimensional vectors.
-
9:03 - 9:09One common way to do this is to view each of your documents
-
9:09 - 9:13as some sort of very high-dimensional vector.
-
9:13 - 9:14So these
-
9:14 - 9:18are vectors in the very highdimensional space where
-
9:18 - 9:20the dimension of the vector is equal to
-
9:20 - 9:27the number of words in your dictionary.
-
9:27 - 9:31So maybe each of these documents lives in some
-
9:31 - 9:3450,000-dimension space, if you have 50,000 words in your
-
9:34 - 9:37dictionary. So one nature of the similarity between these two documents that's
-
9:37 - 9:40often used is
-
9:40 - 9:41what's the angle
-
9:41 - 9:43between these two
-
9:43 - 9:50documents.
-
9:51 - 9:53In particular,
-
9:53 - 9:56if the angle between these two vectors is small, then
-
9:56 - 10:00the two documents, we'll consider them to be similar. If the angle between
-
10:00 - 10:03these two vectors is large, then we consider the documents to be dissimilar.
-
10:03 - 10:06So
-
10:06 - 10:10more formally, one commonly used heuristic, the national language of processing,
-
10:10 - 10:14is to say that the similarity between the two documents is a co-sine of the angle theta between them.
-
10:14 - 10:17For similar
-
10:17 - 10:19values, anyway, the co-sine
-
10:19 - 10:23is a decreasing function of theta.
-
10:23 - 10:24So the
-
10:24 - 10:30smaller the angle between them, the larger the similarity.
-
10:30 - 10:31The co-sine
-
10:31 - 10:36between two vectors is, of course, just [inaudible]
-
10:36 - 10:43divided
-
10:44 - 10:47by - okay?
-
10:47 - 10:51That's just the linear algebra or the standard
-
10:51 - 10:56geometry definition of the co-sine between two vectors. Here's the
-
11:04 - 11:11intuition behind what LSI is doing. The hope, as usual, is
-
11:18 - 11:21that there
-
11:21 - 11:25may be some interesting axis of variations in the data,
-
11:25 - 11:28and there maybe some other axis that
-
11:28 - 11:29are just
-
11:29 - 11:34noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by
-
11:34 - 11:38running PCA on your text data this way, you can remove some of the noise in the data and
-
11:38 - 11:42get better measures of the similarity between pairs of
-
11:42 - 11:45documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI
-
11:45 - 11:47is doing.
-
11:47 - 11:48So
-
11:48 - 11:55look further in the definition of the co-sine similarity measure. So
-
11:56 - 12:00the numerator
-
12:00 - 12:07or
-
12:10 - 12:17the similarity between the two documents was this inner product,
-
12:18 - 12:24which is therefore sum over K,
-
12:24 - 12:28XIK,
-
12:28 - 12:29XJK. So
-
12:30 - 12:34this inner product would be equal to zero if
-
12:34 - 12:37the two documents have no words in common. So
-
12:37 - 12:40this is really - sum over K -
-
12:41 - 12:43indicator of
-
12:43 - 12:45whether
-
12:45 - 12:47documents, I and
-
12:47 - 12:54J,
-
12:55 - 12:58both contain the word, K, because
-
12:58 - 13:03I guess XIK indicates whether document I contains the word
-
13:03 - 13:04K, and XJK
-
13:04 - 13:08indicates whether document J contains the word, K.
-
13:08 - 13:10So the product would be one only
-
13:10 - 13:12if the word K
-
13:12 - 13:15appears in both documents.
-
13:15 - 13:18Therefore, the similarity between these two documents would be
-
13:18 - 13:23zero if the two documents have no words in common.
-
13:23 - 13:30For example,
-
13:31 - 13:35suppose your document,
-
13:35 - 13:40XI, has the word study and the word
-
13:40 - 13:42XJ,
-
13:42 - 13:43has the word learn.
-
13:43 - 13:48Then these two documents may be considered
-
13:48 - 13:49entirely dissimilar.
-
13:50 - 13:53[Inaudible] effective study strategies. Sometimes you read a
-
13:53 - 13:57news article about that. So you ask, what other documents are similar to this? If
-
13:57 - 14:01there are a bunch of other documents about good methods to
-
14:01 - 14:04learn, than there are words in common. So similarity [inaudible] is zero.
-
14:04 - 14:07So here's
-
14:07 - 14:09a cartoon
-
14:09 - 14:11of what we hope
-
14:11 - 14:13[inaudible] PCA will do,
-
14:13 - 14:14which is
-
14:14 - 14:17suppose that on the horizontal axis, I plot
-
14:17 - 14:21the word
-
14:21 - 14:25learn, and on the vertical access, I plot the word study.
-
14:25 - 14:30So the values take on either the value of zero or one. So if a document
-
14:30 - 14:33contains the words learn but not study, then
-
14:33 - 14:35it'll plot that document there. If
-
14:35 - 14:38a document contains neither the word study nor learn, then it'll plot that
-
14:38 - 14:41at zero, zero.
-
14:41 - 14:44So here's a cartoon behind what PCA
-
14:44 - 14:47is doing, which is
-
14:47 - 14:51we identify lower dimensional subspace. That would be sum - eigen
-
14:51 - 14:58vector, we get out of PCAs.
-
14:58 - 15:03Now, supposed we have a document about learning. We have a document about studying.
-
15:03 - 15:05The document about learning
-
15:05 - 15:08points to the right. Document about studying points
-
15:08 - 15:11up. So the inner product, or the co-sine angle between these two documents would be
-
15:11 - 15:13- excuse
-
15:13 - 15:15me. The inner product between
-
15:15 - 15:19these two documents will be zero.
-
15:19 - 15:20So these two
-
15:20 - 15:22documents are entirely unrelated,
-
15:22 - 15:25which is not what we want.
-
15:25 - 15:28Documents about study, documents about learning, they are related. But
-
15:28 - 15:33we take these two documents, and we project them
-
15:33 - 15:36onto this subspace.
-
15:38 - 15:41Then these two documents now become much
-
15:41 - 15:43closer together,
-
15:43 - 15:45and the algorithm will recognize that
-
15:45 - 15:48when you say the inner product between these two documents,
-
15:48 - 15:51you actually end up with a positive number. So
-
15:51 - 15:52LSI enables
-
15:52 - 15:56our algorithm to recognize that these two documents have some positive similarity
-
15:56 - 15:59between them.
-
15:59 - 16:01So that's just intuition
-
16:01 - 16:02about what
-
16:02 - 16:05PCA may be doing to text data.
-
16:05 - 16:09The same thing goes to other examples and the words study and learn. So you have
-
16:09 - 16:11- you find a document about
-
16:11 - 16:15politicians and a document with the names of prominent
-
16:15 - 16:16politicians.
-
16:18 - 16:21That will also bring the documents closer together,
-
16:21 - 16:25or just any related topics, they end up
-
16:25 - 16:26[inaudible]
-
16:26 - 16:32points closer together and just lower dimensional space.
-
16:33 - 16:38Question about this? Interviewee: [Inaudible].
-
16:38 - 16:44Which ones?
-
16:44 - 16:51This one? No, the line. Oh, this one. Oh,
-
16:54 - 16:54yes.
-
16:54 - 17:01Thank you.
-
17:02 - 17:06[Inaudible].
-
17:06 - 17:07So
-
17:07 - 17:14let's talk about how to actually implement this now.
-
17:17 - 17:21Okay. How many of you know what
-
17:21 - 17:25an SVD or single value decomposition is? Wow,
-
17:25 - 17:26that's a lot of you. That's a
-
17:26 - 17:28lot more than I thought.
-
17:28 - 17:35Curious. Did you guys learn it as under grads or as graduate students?
-
17:37 - 17:38All
-
17:38 - 17:40right. Let
-
17:40 - 17:44me talk about it anyway. I
-
17:44 - 17:48wasn't expecting so many of you to know what SVD is, but I want to get this
-
17:48 - 17:51on tape, just so everyone else can learn
-
17:51 - 17:55about this, too.
-
17:55 - 17:59So I'll say a little bit about how to implement
-
17:59 - 18:02PCA. The problem I
-
18:02 - 18:05was eluding to just now was that
-
18:05 - 18:09when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for
-
18:09 - 18:12our
-
18:12 - 18:13text example,
-
18:13 - 18:18if the vectors XI are 50,000 dimensional,
-
18:18 - 18:24then
-
18:24 - 18:27the covariance matrix will be 50,000 dimensional by 50,000
-
18:27 - 18:33dimensional, which is much too big to represent explicitly.
-
18:33 - 18:36I guess many
-
18:36 - 18:41of you already know this, but I'll just say it anyway. It
-
18:41 - 18:48turns out there's another way to implement PCA, which is
-
18:49 - 18:49if
-
18:49 - 18:53A is any N by N matrix,
-
18:53 - 18:57than one of the most remarkable results of linear algebra
-
18:57 - 19:01is that the matrix, A,
-
19:01 - 19:04can be decomposed into
-
19:04 - 19:07a singular value
-
19:07 - 19:10decomposition. What that means is that the matrix, A, which
-
19:10 - 19:12is
-
19:12 - 19:13N by N,
-
19:13 - 19:15can always be decomposed into a product of
-
19:15 - 19:18three matrixes. U is N by N,
-
19:18 - 19:21D is a square matrix, which is N by N, and V is
-
19:24 - 19:27also N by N.
-
19:27 - 19:31D
-
19:31 - 19:35is
-
19:35 - 19:37going to be diagonal.
-
19:43 - 19:45Zeros are on the off-diagonals,
-
19:45 - 19:52and the values sigma I are called the singular values of
-
19:54 - 19:58the matrix A.
-
20:02 - 20:05Almost all of you said you learned this as a graduate student, rather than as an under grad, so
-
20:05 - 20:07it turns out that
-
20:07 - 20:11when you take a class in undergraduate linear algebra, usually you learn a bunch of
-
20:11 - 20:14decomposition. So you usually learn about the QLD composition, maybe
-
20:14 - 20:17the LU factorization of the matrixes.
-
20:17 - 20:20Most under grad courses don't get to talk about singular value
-
20:20 - 20:21decompositions, but at
-
20:21 - 20:23least in - almost
-
20:23 - 20:25everything I
-
20:25 - 20:27do in machine learning,
-
20:27 - 20:31you actually find that you end up using SVDs much more than any of the
-
20:31 - 20:33decompositions
-
20:33 - 20:36you learned in typical under grad linear algebra class.
-
20:36 - 20:40So personally, I [inaudible] an SVD dozens of times in the last
-
20:40 - 20:43year, but LU and QRD compositions,
-
20:43 - 20:45I think I used the QRD composition once and an
-
20:45 - 20:50LU decomposition in the last year. So let's see. I'll
-
20:50 - 20:54say
-
20:54 - 20:55a
-
20:55 - 21:02bit
-
21:02 - 21:05more about this. So I'm going to draw the picture, I guess.
-
21:05 - 21:08For example,
-
21:08 - 21:11if A is an N by N matrix,
-
21:11 - 21:18it can be decomposed into another matrix, U, which is also N by N. It's the same
-
21:22 - 21:25size, D, which is
-
21:25 - 21:30N by
-
21:30 - 21:35N. Another square matrix, V transpose, which is also
-
21:35 - 21:37N by
-
21:37 - 21:43N. Furthermore, in a singular value decomposition, the
-
21:43 - 21:46columns of the matrix, U, will be the eigen
-
21:46 - 21:51vectors
-
21:51 - 21:56of A transpose, and the
-
21:56 - 22:02columns of V will be the eigen vectors
-
22:02 - 22:06of A
-
22:06 - 22:08transpose A.
-
22:08 - 22:12
-
22:12 - 22:17To compute it, you just use the SVD commands
-
22:17 - 22:21in
-
22:21 - 22:22Matlab
-
22:22 - 22:23or
-
22:23 - 22:27Octave. Today, say the art in numerical linear algebra is that
-
22:27 - 22:31SVD, singular value decompositions, and matrixes can be computed
-
22:31 - 22:34extremely [inaudible]. We've
-
22:34 - 22:36used a
-
22:36 - 22:38package like Matlab or Octave
-
22:38 - 22:40to compute, say, the eigen vectors of a matrix.
-
22:40 - 22:44So if SVD
-
22:44 - 22:48routines are even more numerically stable than eigen vector routines for finding eigen vector in the
-
22:48 - 22:49matrix. So you can
-
22:49 - 22:50safely
-
22:50 - 22:53use a routine like this, and similar to the way they use
-
22:53 - 22:56a square root command without thinking about
-
22:56 - 22:58how it's computed. You can compute the square
-
22:58 - 23:04root of something and just not worry about it. You know the computer will give you the right
-
23:04 - 23:08answer. For most reasonably-sized matrixes, even up to thousands by thousands
-
23:08 - 23:11matrixes, the SVD routine, I think of it as a square root function. If
-
23:11 - 23:15you call it, it'll give you back the right answer. You don't have to worry
-
23:15 - 23:17too much about
-
23:17 - 23:21it. If you have extremely large matrixes, like a million by a million matrixes, I
-
23:21 - 23:23might start to worry a bit, but a few thousand by a few
-
23:23 - 23:26thousand matrixes, this is
-
23:26 - 23:29implemented very well today.
-
23:29 - 23:31[Inaudible].
-
23:31 - 23:35What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the
-
23:35 - 23:36order of N-cubed.
-
23:36 - 23:42I'm not sure. [Inaudible]
-
23:42 - 23:45algorithms, so I think - I don't know what's
-
23:45 - 23:50known about the conversion
-
23:50 - 23:54of
-
23:54 - 23:58these algorithms. The example I drew out was for a facts matrix, and a matrix is
-
23:58 - 24:00[inaudible].
-
24:00 - 24:03In the same way, you can also
-
24:03 - 24:08call SVD on the tall matrix, so it's taller than it's wide.
-
24:08 - 24:15It would decompose it into - okay? A
-
24:21 - 24:24product of three matrixes like
-
24:24 - 24:28that.
-
24:28 - 24:31The nice thing about this is that
-
24:31 - 24:33we can use it to compute
-
24:33 - 24:40eigen vectors and PCA very efficiently.
-
24:42 - 24:47In particular, a
-
24:47 - 24:53covariance matrix sigma was
-
24:53 - 24:55this. It was the sum of all the products,
-
24:55 - 25:01so if you go back and recall the definition of the
-
25:01 - 25:02design matrix -
-
25:02 - 25:05I think I described this in
-
25:05 - 25:08lecture two when
-
25:08 - 25:14we derived the close form solution to these squares [inaudible] these squares. The
-
25:14 - 25:18design matrix was this matrix where I took my
-
25:18 - 25:21examples
-
25:21 - 25:26and stacked them in
-
25:26 - 25:28rows.
-
25:28 - 25:33They call this the design matrix [inaudible]. So if
-
25:33 - 25:39you construct the design matrix, then
-
25:39 - 25:43the covariance matrix sigma
-
25:43 - 25:48can be written just X transposing.
-
26:01 - 26:08That's X transposed, and [inaudible].
-
26:16 - 26:17Okay?
-
26:17 - 26:19I hope you see why the X transpose X gives you
-
26:19 - 26:22the sum of products of
-
26:22 - 26:26vectors. If you aren't seeing this right now, just go home and convince yourself
-
26:26 - 26:33[inaudible] if it's
-
26:38 - 26:40true.
-
26:40 - 26:47To get the top K eigen vectors of
-
26:51 - 26:51sigma,
-
26:51 - 26:56you would take sigma
-
26:56 - 26:57and decompose it
-
26:57 - 27:00using
-
27:00 - 27:02the -
-
27:02 - 27:07excuse me.
-
27:07 - 27:12You would take the matrix X, and you would compute as SVD. So you get USV transpose.
-
27:12 - 27:15Then the top three
-
27:15 - 27:19columns
-
27:19 - 27:22of U
-
27:22 - 27:24are the top K eigen
-
27:24 - 27:27vectors
-
27:27 - 27:29of
-
27:29 - 27:32X transpose
-
27:32 - 27:34X,
-
27:34 - 27:38which is therefore, the top K eigen vectors of your
-
27:38 - 27:40covariance matrix
-
27:40 - 27:42sigma. So
-
27:42 - 27:46in our examples, the
-
27:46 - 27:49design matrix may be, say R. If you have
-
27:49 - 27:5250,000 words in your dictionary, than the
-
27:52 - 27:55design matrix would be
-
27:55 - 27:58RM by 50,000.
-
27:58 - 28:02[Inaudible] say 100 by 50,000, if you have 100 examples.
-
28:02 - 28:03So X would be
-
28:03 - 28:07quite tractable to represent and compute the
-
28:07 - 28:10SVD, whereas the matrix sigma would be much harder to represent. This is
-
28:10 - 28:1250,000 by 50,000.
-
28:12 - 28:16So this gives you an efficient way to implement
-
28:16 - 28:18PCA.
-
28:18 - 28:21The reason I want to talk about this is
-
28:21 - 28:25in previous years, I didn't talk [inaudible].
-
28:25 - 28:29The class projects, I found a number of students trying to implement SVD on huge
-
28:29 - 28:30problems and [inaudible],
-
28:30 - 28:32so this is
-
28:32 - 28:35a much better to implement PCA
-
28:35 - 28:38if you have extremely high dimensional data. If you have
-
28:38 - 28:40low dimensional data, if
-
28:40 - 28:44you have 50 or 100 dimensional data, then
-
28:44 - 28:45computing sigma's no problem. You can
-
28:45 - 28:51do it the old way, but otherwise, use the SVD to implement this.
-
28:53 - 29:00Questions about this?
-
29:26 - 29:31The last thing I want to say is that in practice, when you want to implement this, I want to say a note
-
29:31 - 29:32of caution.
-
29:32 - 29:35It turns out that
-
29:35 - 29:39for many applications of - let's see.
-
29:39 - 29:42When you apply SVD to these wide - yeah.
-
29:42 - 29:43Just a quick question. Are
-
29:43 - 29:49the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh,
-
29:49 - 29:53yes.
-
29:53 - 29:55I think you're
-
29:55 - 29:56right. I
-
29:56 - 30:03think you're right.
-
30:04 - 30:06Let's
-
30:06 - 30:13see. Is it top K columns of U or top K of V?
-
30:15 - 30:22Yeah, I think you're right. Is that right? Something
-
30:32 - 30:39bothers me about that, but I think you're right.
-
30:40 - 30:46[Inaudible], so then X transpose X should be VDD. X
-
30:46 - 30:53is UDV, so X transpose X would be - [Inaudible].
-
31:01 - 31:03If anyone thinks about this and
-
31:03 - 31:06has another opinion, let me know, but I think you're
-
31:06 - 31:11right. I'll make sure I get the details and let you
-
31:11 - 31:17know.
-
31:17 - 31:22Everyone's still looking at that.
-
31:22 - 31:23Tom, can you figure out
-
31:23 - 31:25the right answer and let me know? That sounds right. Okay. Cool. Okay.
-
31:25 - 31:30So just
-
31:30 - 31:34one last note, a note of caution. It turns out that
-
31:34 - 31:37in this example, I was implementing SVD
-
31:37 - 31:44with a wide matrix. So the matrix X was N by N.
-
31:44 - 31:47It turns out when you
-
31:47 - 31:52find the SVD decomposition of this,
-
31:52 - 31:58it turns out that -
-
31:58 - 32:01let's see. Yeah, I think you're definitely
-
32:01 - 32:04right. So it turns out that we find the SVD
-
32:04 - 32:08of this, the right-most portion of this block of this matrix would be all
-
32:08 - 32:15zeros.
-
32:15 - 32:18Also, when you compute the
-
32:18 - 32:22matrix, D, a large part of this matrix would be zeros.
-
32:22 - 32:24You have the matrix D
-
32:24 - 32:30transpose. So depending on what convention you use,
-
32:30 - 32:36for example, I think Matlab actually uses a convention of just
-
32:36 - 32:43cutting off the zero elements.
-
32:48 - 32:51So the Matlab uses the convention
-
32:51 - 32:54of chopping off the right-most half of the U matrix and chopping off the bottom
-
32:54 - 32:57portion of the D matrix. I'm not sure
-
32:57 - 33:01if this even depends on the version of Matlab,
-
33:01 - 33:04but when you call SVD on Matlab or some other numerical algebra packages,
-
33:04 - 33:08there's slightly different conventions of how to define your SVD when
-
33:08 - 33:10the matrix is wider than it is tall.
-
33:10 - 33:13So just watch out for this and make sure you map
-
33:13 - 33:15whatever convention
-
33:15 - 33:18your numerical algebra library uses
-
33:18 - 33:23to the original computations.
-
33:23 - 33:24It turns out if
-
33:24 - 33:26you turn Matlab [inaudible]
-
33:26 - 33:30or you're writing C code. There are many scientific libraries that can
-
33:30 - 33:32
-
33:32 - 33:36compute SVDs for you, but they're just slightly different in
-
33:36 - 33:39conventions for the dimensions of these matrixes. So just make sure you figure this out for the
-
33:39 - 33:44package that you use.
-
33:44 - 33:46Finally, I just want to
-
33:46 - 33:50take the unsupervised learning algorithms we talked about and just put a little bit
-
33:50 - 33:52of broader context.
-
33:52 - 33:56This is partly in response to the questions I've gotten from students in
-
33:56 - 33:58office hours and elsewhere
-
33:58 - 34:01about when to use each of these
-
34:01 - 34:01algorithms. So
-
34:01 - 34:06I'm going to draw a two by two matrix. This is a little cartoon that
-
34:08 - 34:11I find useful.
-
34:15 - 34:20One of the algorithms we talked about earlier, right
-
34:20 - 34:21before this, was
-
34:21 - 34:24factor analysis, which was - it
-
34:24 - 34:28was - I hope you remember that picture I drew where I would have a bunch of point Z on the
-
34:28 - 34:29line.
-
34:29 - 34:33Then I had these ellipses that I drew. I hope you
-
34:33 - 34:35remember that
-
34:35 - 34:38picture. This was a factor analysis model
-
34:38 - 34:39which models
-
34:39 - 34:42the density effects [inaudible], right?
-
34:42 - 34:45It was also a PCA, just now.
-
34:45 - 34:46So the
-
34:46 - 34:49difference between factor analysis and PCA, the
-
34:49 - 34:52way I think about it, is that factor analysis
-
34:52 - 34:54is a density estimation algorithm.
-
34:54 - 34:58It tries to model the density of the training example's X.
-
34:58 - 35:01Whereas PCA
-
35:02 - 35:06is not a probabilistic
-
35:06 - 35:07algorithm. In particular,
-
35:07 - 35:11it does not endow your training examples of any probabilistic
-
35:11 - 35:14distributions and directly goes to find the subspace.
-
35:14 - 35:19So in terms of when to use factor analysis and when to use PCA,
-
35:19 - 35:22if your goal is to reduce the dimension of the data,
-
35:22 - 35:26if your goal is to find the subspace that the data lies on,
-
35:26 - 35:27then PCA
-
35:27 - 35:28directly
-
35:28 - 35:32tries to find the subspace. I think I would
-
35:32 - 35:35tend to use PCA.
-
35:35 - 35:40Factor analysis, it sort of assumes the data lies on a subspace.
-
35:40 - 35:45Let me write a subspace here.
-
35:45 - 35:50So both of these algorithms sort of assume the data maybe lies close
-
35:50 - 35:52or on some low dimensional subspace,
-
35:52 - 35:56but fundamentally, factor analysis, I think of it as a density estimation algorithm.
-
35:56 - 35:59So that has some very high dimensional distribution. I
-
35:59 - 36:01want to model P of X, then
-
36:01 - 36:05the factor analysis is the algorithm I'm more inclined
-
36:05 - 36:05to use. So
-
36:05 - 36:07even though you could in theory,
-
36:07 - 36:12I would tend to avoid trying to use factor analysis to
-
36:12 - 36:14identify a subspace the
-
36:14 - 36:16data
-
36:16 - 36:18set lies on. So [inaudible], if you want to do
-
36:18 - 36:22anomaly detection, if you want to model P of X
-
36:22 - 36:26so that if you have a very low probability of N, you can factor an anomaly,
-
36:26 - 36:33then I would tend to use factor analysis to do that density estimation. So factor
-
36:33 - 36:38analysis and PCA are both algorithms that assume that your data lies in the subspace.
-
36:38 - 36:42The other cause of algorithms we talked about was
-
36:42 - 36:45algorithms that assumes the data
-
36:45 - 36:47lies in
-
36:47 - 36:51clumps or that the
-
36:51 - 36:52data
-
36:52 - 36:58has a few coherence to groups. So let me just fill in the rest of this picture.
-
37:09 - 37:10So if you think your data lies
-
37:10 - 37:14in clumps or lies in groups, and if it goes [inaudible]
-
37:14 - 37:16density estimation, then I would
-
37:16 - 37:20tend to use a mixture of [inaudible]
-
37:20 - 37:23algorithm. But again, you don't necessarily want to endow your data of any probably
-
37:23 - 37:27semantics, so if you just want to find the clumps of the groups, then I'd be inclined
-
37:27 - 37:29to use a [inaudible] algorithm. So
-
37:30 - 37:33haven't seen anyone else draw this picture before, but I tend to organize these things
-
37:33 - 37:35this way in my brain.
-
37:35 - 37:36Hopefully this helps guide
-
37:36 - 37:40when you might use each of these algorithms as well, depending
-
37:40 - 37:45on whether you believe the data might lie in the subspace or whether it might bind in
-
37:45 - 37:48clumps or groups.
-
37:51 - 37:54All right.
-
37:54 - 38:01That wraps up the discussion on
-
38:03 - 38:09PCA. What I want to do next is talk about
-
38:09 - 38:15independent component analysis, or ICA. Yeah. Interviewee: I have
-
38:15 - 38:18a
-
38:18 - 38:22question about the upper right [inaudible]. So once you have all of the eigen vectors,
-
38:22 - 38:26[inaudible] how similar is feature I to
-
38:26 - 38:30feature J. You pick some eigen vector, and you take some dot products between the
-
38:30 - 38:32feature I and
-
38:32 - 38:35feature J and the eigen vector. But
-
38:35 - 38:40there's a lot of eigen vectors to choose from. Instructor
-
38:40 - 38:42(Andrew Ng):Right. So Justin's question was
-
38:42 - 38:46having found my eigen vectors, how do I choose what eigen vector to use to
-
38:46 - 38:48measure distance. I'm
-
38:48 - 38:49going to
-
38:49 - 38:51start
-
38:51 - 38:53this up.
-
38:53 - 38:57So the
-
38:57 - 38:58answer is really
-
38:58 - 39:02- in this cartoon, I would avoid thinking about
-
39:02 - 39:04eigen vectors one other time.
-
39:04 - 39:08A better way to view this cartoon is that this is actually -
-
39:08 - 39:12if I decide to choose 100 eigen vectors, this is really 100 D
-
39:12 - 39:19subspace.
-
39:19 - 39:20So
-
39:20 - 39:25I'm not actually projecting my data onto one eigen vector.
-
39:25 - 39:30This arrow, this cartoon, this denotes the 100-dimensional
-
39:30 - 39:32subspace [inaudible] by all my eigen vectors.
-
39:32 - 39:36So what I actually do is project my data onto
-
39:36 - 39:40the span, the linear span of eigen vectors. Then I
-
39:40 - 39:41measure distance or take
-
39:41 - 39:43inner products of the distance between
-
39:43 - 39:50the projections of the two points of the eigen vectors. Okay.
-
39:50 - 39:54So let's talk about ICA,
-
39:54 - 39:59independent component analysis.
-
39:59 - 40:01So whereas PCA
-
40:01 - 40:03was an algorithm for finding
-
40:03 - 40:07what I call the main axis of variations of data,
-
40:07 - 40:11in ICA, we're going to try find the independent of components of variations in the
-
40:11 - 40:12data.
-
40:12 - 40:15So switch it to the laptop there, please.
-
40:15 - 40:16We'll just
-
40:16 - 40:22take a second to motivate that. I'm
-
40:22 - 40:27going to do so by
-
40:27 - 40:32- although if you put on the - okay. This is
-
40:32 - 40:37actually a slide that I showed in
-
40:37 - 40:40lecture one of the cocktail party problem.
-
40:40 - 40:43Suppose you have two speakers at a cocktail party,
-
40:43 - 40:45and you have two microphones in the
-
40:45 - 40:46room, overlapping
-
40:46 - 40:48sets of two conversations.
-
40:48 - 40:52Then can you separate out the two original speaker sources?
-
40:52 - 40:56So I actually played this audio as well in the very first lecture, which is
-
40:56 - 40:59suppose microphone one records this.
-
40:59 - 41:05[Recording]
-
41:13 - 41:17So the question is, these are really two speakers,
-
41:17 - 41:21speaking independently of each other. So each speaker is outputting
-
41:21 - 41:25a series of sound signals as independent of the other conversation
-
41:25 - 41:26going on in the room.
-
41:26 - 41:28So
-
41:28 - 41:32this being an supervised learning problem, the question is, can we take these two microphone
-
41:32 - 41:34recordings and feed it to
-
41:34 - 41:37an algorithm to find the independent components in
-
41:37 - 41:38this
-
41:38 - 41:41data? This is the output
-
41:42 - 41:49when we do so.
-
41:49 - 41:55[Recording] This is the other one. [Recording]
-
41:56 - 42:00Just for fun. [Inaudible]. These are audio clips I got
-
42:01 - 42:04from [inaudible]. Just for fun, let me play the other ones as well. This
-
42:04 - 42:11is overlapping microphone one. [Recording]
-
42:13 - 42:20Here's microphone two. [Recording]
-
42:22 - 42:24So given this as input, here's output one.
-
42:24 - 42:27
-
42:27 - 42:31[Recording]
-
42:31 - 42:34It's not perfect, but it's largely cleaned up the music.
-
42:34 - 42:41Here's number two. [Recording] Okay. Switch back to
-
42:43 - 42:45[inaudible], please.
-
42:45 - 42:47So
-
42:47 - 42:54what I want to do now is describe an algorithm that does that.
-
42:55 - 42:58Before
-
42:58 - 43:03I actually jump into the algorithm, I want to say two minutes
-
43:03 - 43:04of
-
43:04 - 43:11CDF, so cumulative distribution functions. I know most
-
43:19 - 43:21of you know what these are, but I'm
-
43:21 - 43:24just going to remind you of what they are.
-
43:25 - 43:30Let's say you have a one-D random variable S. So suppose you have
-
43:30 - 43:36a random variable, S,
-
43:36 - 43:41and suppose it has a property density function [inaudible].
-
43:41 - 43:43Then
-
43:43 - 43:46the CDF
-
43:46 - 43:50is defined as a function, or rather as F,
-
43:50 - 43:54which is the probability that the random variable,
-
43:54 - 43:56S, is less than the value
-
43:56 - 43:59given by that lower-case
-
43:59 - 44:00value,
-
44:00 - 44:02S.
-
44:02 - 44:03For example,
-
44:03 - 44:06if this is your [inaudible] density,
-
44:06 - 44:10than the density of the [inaudible] usually
-
44:10 - 44:15to note it lower-case phi. That's roughly a bell-shaped density. Then
-
44:15 - 44:20the CDF or the Gaussian
-
44:20 - 44:22will look something like this.
-
44:22 - 44:25There'll be a capital function
-
44:25 - 44:27pi. So if I pick a value
-
44:27 - 44:29S like that, then the
-
44:29 - 44:30height of this -
-
44:30 - 44:33this is [inaudible] probability that
-
44:33 - 44:35my Gaussian random variable is less than
-
44:35 - 44:37that value there. In other words,
-
44:37 - 44:41the height of the function at that point is
-
44:41 - 44:44less
-
44:44 - 44:46than the area of the Gaussian density,
-
44:46 - 44:48up to the point S.
-
44:48 - 44:49As you
-
44:49 - 44:53move further and further to the right, this function will approach one, as
-
44:53 - 45:00you integrate more and more of this area of the Gaussian. So another way to write
-
45:05 - 45:12F
-
45:21 - 45:28of
-
45:31 - 45:35S is the integral, the minus infinity
-
45:35 - 45:36to S of
-
45:36 - 45:42the density, DT.
-
45:42 - 45:44So something that'll come later is
-
45:44 - 45:48suppose I have a random variable, S, and I want to model the distribution of the random
-
45:48 - 45:49variable, S.
-
45:49 - 45:53So one thing I could do is I can specify
-
45:53 - 45:57what I think the density
-
45:57 - 45:58is.
-
45:58 - 46:03Or I can specify
-
46:03 - 46:04what the
-
46:04 - 46:08CDF
-
46:08 - 46:11is. These are related by this equation. F is the integral of P of S. You
-
46:11 - 46:14can also
-
46:14 - 46:16recover the density
-
46:16 - 46:20by taking the CDF and taking the derivative. So F prime, take the derivative
-
46:20 - 46:22of the CDF,
-
46:22 - 46:23you get back the
-
46:23 - 46:25density. So this has come up
-
46:25 - 46:28in the middle of when I derive ICA, which is that
-
46:28 - 46:32there'll be a step where they need to assume a distribution for random variable, S.
-
46:32 - 46:36I can either specify the density for S directly, or I can specify the CDF. I
-
46:36 - 46:39choose to specify the
-
46:40 - 46:42CDF.
-
46:42 - 46:47It has to be some function increasing from zero to one.
-
46:47 - 46:48So you can
-
46:48 - 46:51choose any function that looks like that, and in particular,
-
46:52 - 46:55pulling functions out of a hat that look like that. You can, for instance, choose a
-
46:55 - 46:59sigmoid function of
-
46:59 - 47:04CDF. That would be one way of specifying the distribution of the densities for the random variable S. So
-
47:04 - 47:05this
-
47:05 - 47:12will come up later.
-
47:30 - 47:34Just [inaudible], just raise your hand if that is familiar to you, if you've seen
-
47:34 - 47:41that before. Great. So
-
47:42 - 47:43let's
-
47:43 - 47:49start to derive our RCA, or our independent component analysis
-
47:49 - 47:50algorithm.
-
47:50 - 47:54Let's assume that the
-
47:56 - 48:00data comes from
-
48:00 - 48:02N original
-
48:02 - 48:03sources.
-
48:03 - 48:07So let's say there are N speakers in a cocktail party.
-
48:07 - 48:10So the original sources, I'm
-
48:10 - 48:11going to write as a vector, S
-
48:11 - 48:14as in RN.
-
48:14 - 48:17So just to be concrete about what I mean about that, I'm going to use
-
48:17 - 48:22SIJ to denote the signal
-
48:22 - 48:26from speaker
-
48:27 - 48:30J
-
48:30 - 48:33at time
-
48:33 - 48:34I. Here's what I mean.
-
48:34 - 48:38So what is sound? When you hear sound waves, sound is created
-
48:38 - 48:39by a pattern
-
48:39 - 48:43of expansions and compressions in air. So the way you're hearing my voice is
-
48:43 - 48:45my
-
48:45 - 48:48mouth is causing certain
-
48:48 - 48:51changes in the air pressure, and then your ear is hearing my voice as
-
48:51 - 48:54detecting those changes in air
-
48:54 - 48:58pressure. So what a microphone records, what my mouth is generating, is
-
48:58 - 48:59a pattern.
-
48:59 - 49:01I'm going to draw a cartoon,
-
49:01 - 49:05I guess.
-
49:05 - 49:06Changes in
-
49:06 - 49:07air pressure. So
-
49:07 - 49:11this is what sound is. You look at a microphone recording, you see these roughly periodic
-
49:11 - 49:13signals that comprise of
-
49:13 - 49:16changes in air pressure over time as the air pressure goes
-
49:16 - 49:19above and below some baseline air pressure.
-
49:19 - 49:20So this
-
49:20 - 49:22is what the speech signal looks like, say.
-
49:22 - 49:26So this is speaker one.
-
49:26 - 49:29Then what I'm saying is that
-
49:29 - 49:31- this is some time, T.
-
49:31 - 49:34What I'm saying is that the value of that point,
-
49:34 - 49:37I'm going to denote as S, super
-
49:37 - 49:40script T, sub script one.
-
49:40 - 49:42Similarly,
-
49:42 - 49:45speaker two, it's
-
49:45 - 49:47outputting some sound wave. Speaker voice
-
49:47 - 49:50will play that. It'll actually sound like
-
49:50 - 49:53a single tone, I guess.
-
49:53 - 49:56So in the same way, at the same time, T,
-
49:56 - 49:59the value of the air
-
49:59 - 50:03pressure generated by speaker two, I'll denote as
-
50:03 - 50:10ST
-
50:17 - 50:242.
-
50:30 - 50:37So we observe
-
50:38 - 50:40XI equals A times SI, where
-
50:40 - 50:43these XIs
-
50:43 - 50:46are vectors in RN.
-
50:46 - 50:50So I'm going to assume
-
50:50 - 50:53that I have N microphones,
-
50:53 - 50:54and
-
50:54 - 50:58each of my microphones records some linear combination
-
50:58 - 51:02of what the speakers are saying. So each microphone records some overlapping
-
51:02 - 51:04combination of what the speakers are saying.
-
51:04 - 51:08For
-
51:10 - 51:13example, XIJ, which is - this
-
51:13 - 51:16is what microphone J records at time, I. So
-
51:16 - 51:17by definition of
-
51:17 - 51:22the matrix multiplication, this is sum
-
51:22 - 51:24of AIKSJ.
-
51:24 - 51:29Oh, excuse me.
-
51:29 - 51:36Okay? So what my J - sorry.
-
51:37 - 51:41So what my J microphone is recording is
-
51:42 - 51:44some linear combination of
-
51:44 - 51:46all of the speakers. So
-
51:46 - 51:50at time I, what microphone J is recording is some linear combination of
-
51:50 - 51:53what all the speakers are saying at time I.
-
51:53 - 51:54So K here
-
51:54 - 51:58indexes over the N speakers.
-
51:58 - 52:01So our goal
-
52:03 - 52:06is to find the matrix, W, equals A inverse, and
-
52:06 - 52:10just defining W that way.
-
52:10 - 52:17So
-
52:18 - 52:21we can recover the original sources
-
52:21 - 52:23as a linear combination of
-
52:23 - 52:24our
-
52:24 - 52:31microphone recordings, XI.
-
52:33 - 52:35Just as a point of notation,
-
52:35 - 52:42I'm going to write the matrix W this way. I'm going to use
-
52:51 - 52:55lower case W subscript one, subscript two and so on to denote the roles
-
52:55 - 53:02of this matrix, W.
-
53:14 - 53:15Let's
-
53:15 - 53:19see.
-
53:19 - 53:24So let's look at why IC is possible. Given these overlapping voices,
-
53:24 - 53:28let's think briefly why it might be possible
-
53:28 - 53:31to recover the original sources.
-
53:31 - 53:33So for the next example, I want
-
53:33 - 53:37to say
-
53:43 - 53:47- let's say that each of my speakers
-
53:47 - 53:50outputs - this will sound like white noise. Can I switch
-
53:50 - 53:53the laptop display,
-
53:53 - 53:57please? For this example, let's say that
-
53:57 - 54:01each of my speakers outputs uniform white noise. So
-
54:01 - 54:05if that's the case, these are my axis, S1 and S2.
-
54:05 - 54:09This is what my two speakers would be uttering.
-
54:09 - 54:11The parts of what they're
-
54:11 - 54:15uttering will look like a line in a square box if the two speakers are independently
-
54:15 - 54:16outputting
-
54:16 - 54:18uniform minus one random variables.
-
54:18 - 54:20So this is part of
-
54:20 - 54:24S1 and S2, my original sources.
-
54:24 - 54:29This would be a typical sample of what my microphones record. Here, at
-
54:29 - 54:31the axis, are X1 and X2.
-
54:31 - 54:35So these are images I got from [inaudible] on
-
54:35 - 54:37ICA.
-
54:39 - 54:44Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of
-
54:44 - 54:45this
-
54:45 - 54:46parallelogram
-
54:46 - 54:48are. You can figure out
-
54:48 - 54:52what linear transformation would transform the parallelogram back
-
54:52 - 54:54to a box.
-
54:54 - 54:59So it turns out there are some inherent ambiguities in ICA.
-
54:59 - 55:01I'll just say what they are.
-
55:01 - 55:02One is that
-
55:02 - 55:06you can't recover the original indexing of the sources. In particular,
-
55:06 - 55:07if
-
55:07 - 55:11I generated the data for speaker one and speaker two,
-
55:11 - 55:14you can run ICA, and then you may end up with the order of the speakers
-
55:14 - 55:18reversed. What that corresponds to is if you take this
-
55:18 - 55:22picture and you flip this picture along a 45-degree
-
55:22 - 55:26axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still
-
55:26 - 55:28get a box. So
-
55:28 - 55:31there's no way for the algorithms to tell which was speaker No. 1 and
-
55:31 - 55:33which
-
55:33 - 55:38was speaker No. 2. The numbering or the ordering of the speakers is
-
55:38 - 55:41ambiguous. The other source of ambiguity, and these are the only ambiguities
-
55:41 - 55:42in this example,
-
55:42 - 55:44is the sign of the sources. So
-
55:44 - 55:49given my speakers' recordings,
-
55:49 - 55:53you can't tell whether you got a positive SI or whether you got
-
55:53 - 55:56back a negative SI.
-
55:56 - 55:58In this picture, what that corresponds to
-
55:58 - 56:02is if you take this picture, and you reflect it along the vertical axis, if
-
56:02 - 56:05you reflect it along the horizontal axis,
-
56:05 - 56:06you still get a box.
-
56:06 - 56:09You still get back [inaudible] speakers.
-
56:09 - 56:10So
-
56:10 - 56:12it turns out that in this example,
-
56:12 - 56:17you can't guarantee that you've recovered positive SI rather
-
56:17 - 56:20than negative SI.
-
56:20 - 56:22So it turns out that these are the only
-
56:22 - 56:26two ambiguities in this example. What is the permutation of the speakers, and the
-
56:26 - 56:28other is the sign of the speakers.
-
56:28 - 56:31Permutation of the speakers, there's not much you can do about that.
-
56:31 - 56:35It turns out that if you take the audio
-
56:35 - 56:36source
-
56:36 - 56:39and if you flip the sign, and you take negative S, and if you play that through a
-
56:39 - 56:44microphone it'll sound indistinguishable.
-
56:44 - 56:45So
-
56:45 - 56:48for many of the applications we care about, the sign
-
56:48 - 56:51as well as the permutation
-
56:51 - 56:55is ambiguous, but you don't really care
-
56:55 - 57:02about it. Let's switch back
-
57:04 - 57:09to
-
57:09 - 57:11chalk board, please.
-
57:11 - 57:16It turns out, and I don't want to spend too much time on this, but I do want to say it briefly.
-
57:16 - 57:17It turns out the
-
57:17 - 57:19reason why those are the only
-
57:19 - 57:26sources of ambiguity - so the ambiguities were the
-
57:26 - 57:30permutation of the speakers
-
57:30 - 57:32and the signs.
-
57:32 - 57:35It turns out that
-
57:35 - 57:40the reason these were the only ambiguities was because
-
57:40 - 57:45the SIJs were
-
57:45 - 57:47
-
57:47 - 57:51non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly.
-
57:51 - 57:54Suppose my original sources, S1 and S2, were Gaussian.
-
57:54 - 57:56So
-
57:58 - 58:02suppose SI is
-
58:02 - 58:04Gaussian, would mean zero
-
58:04 - 58:07and identity covariance.
-
58:07 - 58:11That just means that each of my speakers outputs a Gaussian random variable. Here's a typical
-
58:11 - 58:13example of Gaussian
-
58:13 - 58:18data.
-
58:18 - 58:23You will recall the contours of a Gaussian distribution with identity covariants
-
58:23 - 58:25looks like
-
58:25 - 58:28this, right? The Gaussian is a
-
58:28 - 58:31spherically symmetric distribution.
-
58:31 - 58:35So if my speakers were outputting Gaussian random variables, than if
-
58:35 - 58:38I observe a linear combination of this,
-
58:38 - 58:40there's actually no way to recover the
-
58:40 - 58:43original distribution because there's no way for me to tell
-
58:43 - 58:46if the axis are at this angle or if they're at
-
58:46 - 58:48that angle and so
-
58:48 - 58:52on. The Gaussian is a rotationally symmetric
-
58:52 - 58:57distribution, so I would no be able to recover the orientation in the
-
58:57 - 58:59rotation
-
58:59 - 59:02of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns
-
59:02 - 59:03out
-
59:03 - 59:05if your source is a Gaussian,
-
59:05 - 59:08then it's actually impossible to do
-
59:08 - 59:12ICA. ICA relies critically on your data being non-Gaussian because if the data
-
59:12 - 59:17were Gaussian, then the rotation of the data would be ambiguous. So
-
59:17 - 59:19regardless of how much data you have,
-
59:19 - 59:24even if you had infinitely large amounts of data, you would not be able to recover
-
59:24 - 59:27the matrix A or W.
-
59:33 - 59:40Let's go ahead and divide the algorithm.
-
59:57 - 60:01To do this, I need just one more result, and then the derivation will be
-
60:03 - 60:08three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my
-
60:08 - 60:11speakers that are emitting at any time.
-
60:11 - 60:12So
-
60:12 - 60:16let's say the density of S is
-
60:16 - 60:17P subscript S,
-
60:17 - 60:20capital S.
-
60:20 - 60:23So my microphone recording records S equals AS,
-
60:23 - 60:25equals W inverse
-
60:25 - 60:31S. Equivalently, S equals W sign of X.
-
60:31 - 60:35So let's think about what is the density of
-
60:35 - 60:38X. So I have P of S. I know the density of
-
60:38 - 60:41S, and X is a linear combination of the S's.
-
60:41 - 60:45So let's figure out what is the density of X.
-
60:45 - 60:49One thing we could do is
-
60:49 - 60:51figure out what S is. So this is just -
-
60:51 - 60:56apply the density of
-
60:56 - 60:58S to W of S. So let's
-
60:58 - 61:02see. This is the probability of S, so we just
-
61:03 - 61:07figure out what S is. S is W times X, so the probability of S is
-
61:07 - 61:10W times X, so the probability of X must be [inaudible].
-
61:10 - 61:12So this is wrong.
-
61:12 - 61:15It turns out you can do this for probably mass functions but not for
-
61:15 - 61:17continuous density. So in particular,
-
61:17 - 61:21it's not correct to say that the probability of X is - well, you just figure out what
-
61:21 - 61:22S is.
-
61:22 - 61:26Then you say the probability of S is applied to that. This is wrong. You
-
61:26 - 61:28can't do this with densities.
-
61:28 - 61:31You can't say the probability of S is that because it's a property density
-
61:31 - 61:33function.
-
61:33 - 61:34In particular,
-
61:34 - 61:36the
-
61:36 - 61:38right formula is the
-
61:38 - 61:40density of S applied to W times X,
-
61:40 - 61:42times the determinant
-
61:42 - 61:44of the matrix, W.
-
61:44 - 61:47Let me just illustrate that with an example.
-
61:47 - 61:50Let's say
-
61:50 - 61:52the
-
61:52 - 61:58density for S is that. In
-
61:58 - 62:03this example, S is uniform
-
62:03 - 62:06over the unit interval.
-
62:08 - 62:15So the density for S looks like that. It's
-
62:15 - 62:18just density for the uniform
-
62:18 - 62:21distribution of zero one.
-
62:21 - 62:24So let me let X be equal to two times
-
62:24 - 62:30S. So this means A equals two.
-
62:30 - 62:34W equals one half. So if
-
62:34 - 62:37S is a uniform distribution over zero, one,
-
62:37 - 62:40then X, which is two times that, will be the uniform distribution over the
-
62:40 - 62:43range from zero to two.
-
62:43 - 62:50So the density for X will be -
-
62:54 - 62:57that's one, that's two,
-
62:57 - 63:01that's one half,
-
63:03 - 63:05and
-
63:05 - 63:08that's one. Okay? Density for X will be indicator
-
63:08 - 63:13zero [inaudible] for X [inaudible] two
-
63:13 - 63:16times W, times one half.
-
63:16 - 63:20So
-
63:20 - 63:22does that make
-
63:22 - 63:25sense? [Inaudible] computer density for X because X is now spread out
-
63:25 - 63:29across a wider range. The density of X is now smaller,
-
63:29 - 63:36and therefore, the density of X has this one half
-
63:38 - 63:39term
-
63:39 - 63:43here. Okay? This is an illustration for the case of one-dimensional random variables,
-
63:43 - 63:44
-
63:44 - 63:45or S
-
63:45 - 63:49and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the
-
63:49 - 63:52density of X is given by this
-
63:52 - 63:54times the determinant of the matrix, W. Over here,
-
63:54 - 64:01I showed the one dimensional [inaudible] generalization.
-
64:21 - 64:28So we're nearly there. Here's
-
64:29 - 64:34how I can implement ICA.
-
64:34 - 64:37So my distribution on
-
64:37 - 64:44S,
-
64:50 - 64:53so I'm going to assume that my density on S
-
64:53 - 64:55is given by this as a product over the
-
64:55 - 65:00N speakers of the density - the product of speaker
-
65:00 - 65:01I
-
65:01 - 65:04emitting a certain sound. This is a product of densities.
-
65:04 - 65:08This is a product of distributions because I'm going to assume that the
-
65:08 - 65:11speakers are having independent conversations. So the SI's independent
-
65:11 - 65:16for different values of I.
-
65:16 - 65:18So by the formula we just worked out,
-
65:18 - 65:22the density for X would be equal to that.
-
65:37 - 65:39I'll just remind you, W was A
-
65:39 - 65:43inverse. It was
-
65:43 - 65:44this matrix
-
65:44 - 65:48I defined previously
-
65:48 - 65:50so that SI
-
65:50 - 65:53equals WI [inaudible]
-
65:53 - 65:59X. So that's what's in
-
65:59 - 66:02there. To complete my formulation for this model,
-
66:02 - 66:06the final thing I need to do is
-
66:06 - 66:10choose
-
66:10 - 66:12a density
-
66:12 - 66:14for what I think each speaker is
-
66:14 - 66:18saying. I need to assume some density over
-
66:18 - 66:22the sounds emitted by an individual speaker.
-
66:22 - 66:26So following the discussion I had right when the [inaudible]
-
66:26 - 66:28ICA,
-
66:28 - 66:31one thing I could do is I could choose
-
66:31 - 66:32the density for S,
-
66:32 - 66:36or equivalently, I could choose the CDF, the cumulative distribution
-
66:36 - 66:37function for
-
66:37 - 66:38S.
-
66:38 - 66:41In this case, I'm going to choose
-
66:41 - 66:45a CDF, probably for historical reasons and probably for
-
66:45 - 66:47convenience.
-
66:47 - 66:50I need to choose the CDF for S, so
-
66:50 - 66:55what that means is I just need to choose some function that increases from zero to
-
66:55 - 66:59what. I know I can't choose a Gaussian because we know you can't
-
66:59 - 67:02do ICA on Gaussian data.
-
67:02 - 67:05So I need some function increasing from zero to one
-
67:05 - 67:09that is not the cumulative distribution function for a
-
67:09 - 67:10Gaussian distribution.
-
67:10 - 67:14So what other functions do I know that increase from zero to one? I
-
67:14 - 67:16just choose the
-
67:16 - 67:18CDF to be
-
67:18 - 67:22the
-
67:22 - 67:23sigmoid function.
-
67:23 - 67:25This is a
-
67:25 - 67:27commonly-made choice that
-
67:27 - 67:31is made for convenience. There is actually no great reason for why you
-
67:31 - 67:34choose a sigmoid function. It's just a convenient function that we all know
-
67:34 - 67:35and are familiar with
-
67:35 - 67:38that happens to increase from zero to one.
-
67:38 - 67:45When you take the derivative
-
67:46 - 67:49of the sigmoid, and that will give you back
-
67:49 - 67:50your
-
67:50 - 67:55density. This is just not Gaussian. This is the main virtue of choosing the sigmoid.
-
67:55 - 68:02So
-
68:19 - 68:22there's really no rational for the choice of sigma. Lots of other things will
-
68:22 - 68:24work fine, too.
-
68:24 - 68:27It's just a common, reasonable default.
-
68:38 - 68:40It turns out that
-
68:40 - 68:45one reason the sigma works well for a lot of data sources is that
-
68:45 - 68:49if this is the Gaussian.
-
68:49 - 68:52If you actually take the sigmoid and you take its derivative,
-
69:02 - 69:07you find that the sigmoid has [inaudible] than the Gaussian. By this I mean
-
69:07 - 69:11the density of the sigmoid dies down to zero much more slowly than
-
69:11 - 69:12the
-
69:12 - 69:13Gaussian.
-
69:13 - 69:18The magnitudes of the tails dies down as E to the minus S squared.
-
69:18 - 69:22For the sigmoid, the tails look like E to the minus
-
69:22 - 69:27S. So the tails die down as E to the minus S, around E
-
69:27 - 69:30to the minus S squared. It turns out that most distributions of this property
-
69:30 - 69:34with [inaudible] tails, where the distribution decays to zero relatively slowly
-
69:34 - 69:38compared to Gaussian will
-
69:38 - 69:40work fine for your data.
-
69:40 - 69:44Actually, one other choice you can sometimes us is what's called the Laplacian
-
69:44 - 69:46distribution, which is
-
69:46 - 69:53that. This will work fine, too, for many data sources.
-
70:07 - 70:08Sticking with the sigmoid for now, I'll just
-
70:08 - 70:09write
-
70:09 - 70:14down the algorithm in two steps. So given
-
70:14 - 70:17my training set, and
-
70:17 - 70:21as you show, this is an unlabeled training set, I can
-
70:21 - 70:26write down the log likelihood of my parameters. So that's - assembled my training
-
70:26 - 70:27examples, log of - times
-
70:27 - 70:34that.
-
70:43 - 70:45So that's my log
-
70:45 - 70:52likelihood.
-
70:53 - 70:59To learn the parameters, W, of this model, I can use the [inaudible] assent,
-
70:59 - 71:06which is
-
71:07 - 71:09just that.
-
71:09 - 71:11It turns out, if you work through the math,
-
71:11 - 71:14let's see. If P of S
-
71:14 - 71:20is equal to the derivative of the
-
71:20 - 71:24sigmoid, then if you just work through the math to compute the [inaudible] there. You've all
-
71:24 - 71:27done this a lot of times. I won't bother to show
-
71:27 - 71:34the details. You find that is equal to this.
-
71:47 - 71:50Okay? That's just - you can work those out yourself. It's just math to
-
71:50 - 71:54compute the derivative of this with respect to
-
71:54 - 71:59W. So to summarize, given the training set,
-
71:59 - 72:02here's my [inaudible] update rule. So you run the
-
72:02 - 72:06[inaudible] to learn the parameters W.
-
72:06 - 72:08After you're
-
72:08 - 72:10done, you then
-
72:12 - 72:14output SI equals
-
72:14 - 72:17WXI, and you've separated your sources
-
72:17 - 72:18of your
-
72:18 - 72:22data back out into the original independent sources.
-
72:22 - 72:26Hopefully up to only a permutation and a plus/minus
-
72:26 - 72:31sign ambiguity.
-
72:31 - 72:35Okay? So just switch back to the laptop, please?
-
72:35 - 72:42So we'll just wrap up with a couple of examples of applications of ICA.
-
72:42 - 72:43This is
-
72:43 - 72:47actually a picture of our TA, Katie.
-
72:47 - 72:50So one of the applications of ICA is
-
72:50 - 72:52to process
-
72:52 - 72:57various types of [inaudible] recording data, so [inaudible]. This
-
72:57 - 72:59is a picture of
-
72:59 - 73:02a EEG cap, in which there are a number of electrodes
-
73:02 - 73:05you place
-
73:05 - 73:08on the - in this case, on Katie's brain, on Katie's scalp.
-
73:08 - 73:13So where each electrode measures changes in voltage over time
-
73:13 - 73:15on the scalp.
-
73:15 - 73:18On the right, it's a typical example of [inaudible] data
-
73:18 - 73:23where each electrode measures - just changes in voltage over
-
73:23 - 73:24time. So
-
73:24 - 73:28the horizontal axis is time, and the vertical axis is voltage. So here's the same thing,
-
73:28 - 73:30blown up a little bit.
-
73:30 - 73:33You notice there are artifacts in this
-
73:33 - 73:36data. Where the circle is, where the data is circled, all
-
73:36 - 73:38the
-
73:38 - 73:41electrodes seem to measure in these very synchronized recordings.
-
73:41 - 73:45It turns out that we look at [inaudible] data as well as a number of other
-
73:45 - 73:47types of data, there are
-
73:47 - 73:52artifacts from heartbeats and from human eye blinks and so on. So the
-
73:52 - 73:55cartoonist, if you imagine, placing the
-
73:55 - 73:57electrodes, or
-
73:57 - 73:58microphones, on my scalp,
-
73:58 - 74:02then each microphone is recording some overlapping combination of all the
-
74:02 - 74:05things happening in my brain or in my body.
-
74:05 - 74:08My brain has a number of different processes going on. My body's [inaudible]
-
74:08 - 74:11going on, and
-
74:11 - 74:13each electrode measures a sum
-
74:13 - 74:16of the different voices in my brain.
-
74:16 - 74:20That didn't quite come out the way I wanted it to.
-
74:20 - 74:22So we can just take this data
-
74:22 - 74:25and run ICA on it and find out one of the independent components, what the
-
74:25 - 74:26independent
-
74:26 - 74:30process are going on in my brain. This is an example of running ICA.
-
74:30 - 74:33So you find that a small number of components, like those shown up there,
-
74:33 - 74:38they correspond to heartbeat, where the arrows - so those are very periodic
-
74:38 - 74:42signals. They come on occasionally and correspond to [inaudible] components of
-
74:42 - 74:43heartbeat.
-
74:43 - 74:47You also find things like an eye blink component, corresponding to a
-
74:47 - 74:50sigmoid generated when you blink your eyes.
-
74:50 - 74:54By doing this, you can then subtract out the heartbeat and the eye blink
-
74:54 - 74:56artifacts from the data, and now
-
74:56 - 75:01you get much cleaner ICA data - get much cleaner EEG readings. You can
-
75:01 - 75:04do further scientific studies. So this is a
-
75:04 - 75:06pretty commonly used preprocessing step
-
75:06 - 75:10that is a common application of ICA.
-
75:10 - 75:13[Inaudible] example is
-
75:13 - 75:16the application, again, from [inaudible]. As
-
75:16 - 75:21a result of running ICA on natural small image patches. Suppose I take
-
75:21 - 75:22natural images
-
75:22 - 75:26and run ICA on the data and ask what are the independent components of data.
-
75:26 - 75:30It turns out that these are the bases you get. So this is a plot of the
-
75:30 - 75:33sources you get.
-
75:33 - 75:36This algorithm is saying that a natural image patch
-
75:36 - 75:38shown
-
75:38 - 75:40on the left
-
75:40 - 75:45is often expressed as a sum, or a linear combination, of
-
75:45 - 75:47independent sources of
-
75:47 - 75:48things that make up images.
-
75:48 - 75:53So this model's natural images is generated by independent objects
-
75:53 - 75:55that generate different ages in the image.
-
75:55 - 76:01One of the fascinating things about this is that, similar to neuroscience, this has also been
-
76:01 - 76:05hypothesized as a method for how the human brain processes image
-
76:05 - 76:06data. It
-
76:06 - 76:10turns out, this is similar, in many ways, to computations
-
76:10 - 76:15happening in early visual processing in the human brain,
-
76:15 - 76:18in the mammalian
-
76:18 - 76:20brain. It's just
-
76:20 - 76:25interesting to see ages are the independent components of images.
-
76:25 - 76:31Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew
-
76:31 - 76:32Ng):Oh,
-
76:32 - 76:35yes. For the algorithms I describe, I assume A is a square matrix.
-
76:35 - 76:39It turns out if you have more microphones than speakers, you can also apply very
-
76:39 - 76:40similar algorithms. If
-
76:40 - 76:44you have fewer microphones than speakers, there's sort of an open research problem. The odds
-
76:44 - 76:48are that if you have one male and one female speaker, but one microphone, you can
-
76:48 - 76:52sometimes sort of separate them because one is high, one is low. If you have two
-
76:52 - 76:55male speakers or two female speakers, then it's beyond the state of the art now to separate them
-
76:55 - 76:57with one
-
76:57 - 77:00microphone. It's a great research program. Okay.
-
77:00 - 77:05Sorry about running late again. Let's close now, and we'll
-
77:05 - 77:06continue reinforcement learning.
- Title:
- Lecture 15 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:17:18