Lecture 15 | Machine Learning (Stanford)

Edit subtitles

0:10 - 0:13

This presentation is delivered by the Stanford Center for Professional
0:13 - 0:20

Development.
0:29 - 0:30

Welcome back.
0:30 - 0:33

What I want to do today is
0:33 - 0:37

continue a discussion of principal components analysis, or PCA.
0:37 - 0:39

In particular, there's
0:39 - 0:42

one more application that I didn't get to in the last lecture on
0:42 - 0:44

[inaudible] indexing,
0:44 - 0:47

LSI. Then I want to spend just a little time talking about
0:47 - 0:50

how to implement PCA,
0:50 - 0:54

especially for very large problems. In particular, I'll spend just a little bit of time talking
0:54 - 0:56

about singular value decomposition,
0:56 - 0:58

or the SVD implementation
0:58 - 1:00

of principal component
1:00 - 1:02

analysis. So the
1:02 - 1:07

second half of today's lecture, I want to talk about the different algorithm called
1:07 - 1:09

independent component analysis,
1:09 - 1:13

which is, in some ways, related to PCA, but in many other ways, it
1:13 - 1:16

also manages to accomplish
1:16 - 1:18

very different things than PCA.
1:18 - 1:22

So with this lecture, this will actually wrap up our discussion on
1:22 - 1:26

unsupervised learning. The next lecture, we'll start to talk about
1:26 - 1:31

reinforcement learning algorithms.
1:31 - 1:32

Just to
1:32 - 1:37

recap where we were with PCA, principal
1:37 - 1:39

component analysis,
1:39 - 1:45

I said that
1:45 - 1:49

in PCA, we imagine that we have some very high dimensional data
1:49 - 1:51

that perhaps
1:51 - 1:55

lies approximately on some low dimensional subspace. So if you had the data set like
1:55 - 1:56

this,
1:56 - 1:57

you might find that
1:57 - 2:01

that's the first principal component of the data,
2:01 - 2:03

and that's the second
2:03 - 2:06

component of this 2-D data.
2:07 - 2:09

To
2:09 - 2:13

summarize the algorithm, we have three steps. The first step of PCA
2:15 - 2:17

was to
2:17 - 2:20

normalize the data to zero mean and
2:20 - 2:26

[inaudible]. So
2:26 - 2:28

tracked out the means of
2:28 - 2:33

your training examples. So it now has zero means, and then normalize each of your features so
2:33 - 2:36

that the variance of each feature is now one.
2:38 - 2:41

The next step was [inaudible]
2:41 - 2:44

computical variance matrix of your zero mean data. So
2:44 - 2:49

you compute it as follows.
2:49 - 2:51

The sum of all the products,
2:52 - 2:56

and then you find the
2:56 - 3:03

top K eigen vectors of
3:04 - 3:07

sigma.
3:07 - 3:10

So
3:10 - 3:13

last time we saw the applications of this. For example,
3:13 - 3:18

one of the applications was to eigen faces where
3:18 - 3:23

each of your training examples, XI, is an image.
3:23 - 3:24

So
3:24 - 3:27

if you have
3:27 - 3:31

100 by 100 images, if your pictures of faces are
3:31 - 3:33

100 pixels by 100 pixels,
3:33 - 3:38

then each of your training examples, XI,
3:38 - 3:40

will be a 10,000 dimensional vector,
3:40 - 3:43

corresponding to the
3:43 - 3:48

10,000 grayscale intensity pixel values. There are 10,000 pixel values in
3:48 - 3:50

each of your 100 by 100 images.
3:50 - 3:53

So the eigen faces application was where
3:53 - 3:56

the training example comprised
3:56 - 3:58

pictures of faces of people.
3:58 - 4:00

Then we ran PCA,
4:00 - 4:01

and then
4:01 - 4:03

to measure the distance between say
4:03 - 4:04

a face here
4:04 - 4:07

and a face there, we would project both
4:07 - 4:10

of the face images onto the subspace and then
4:10 - 4:11

measure
4:11 - 4:14

the distance along the subspace. So in eigen faces, you use something
4:14 - 4:17

like 50 principle components.
4:17 - 4:20

So
4:20 - 4:24

the difficulty of working with problems like these is that
4:24 - 4:27

in step two of the algorithm,
4:27 - 4:31

we construct the covariance matrix sigma.
4:31 - 4:38

The covariance matrix now becomes
4:38 - 4:43

a 10,000 by 10,000 dimensional matrix, which is huge. That has
4:43 - 4:46

100 million
4:46 - 4:48

entries, which is huge.
4:49 - 4:51

So let's apply PCA to
4:51 - 4:54

very, very high dimensional data, used as a point of reducing the
4:54 - 4:55

dimension. But
4:55 - 5:00

step two of this algorithm had this step where you were constructing [inaudible]. So
5:00 - 5:04

this extremely large matrix, which you can't do.
5:04 - 5:07

Come back to this in a second. It turns out one of
5:07 - 5:09

the other
5:09 - 5:11

frequently-used applications of
5:11 - 5:12

PCA
5:12 - 5:14

is actually to text data.
5:14 - 5:16

So here's what I
5:16 - 5:21

mean. Remember our vectorial representation of emails?
5:21 - 5:23

So this is way back
5:23 - 5:27

when we were talking about supervised learning algorithms for a
5:27 - 5:29

stand classification. You remember I said that
5:29 - 5:33

given a piece of email or given a piece of text document, you
5:33 - 5:35

can represent it using a very high-dimensional vector
5:35 - 5:37

by taking
5:39 - 5:44

- writing down a list of all the words in your dictionary. Somewhere you had the word
5:44 - 5:47

learn, somewhere you have the word
5:47 - 5:50

study
5:51 - 5:55

and so on.
5:55 - 5:58

Depending on whether each word appears or does not appear in your text document, you put either
5:58 - 5:59

a one or a zero
5:59 - 6:04

there. This is a representation we use on an electrode five or electrode six
6:04 - 6:07

for representing text documents for
6:07 - 6:09

when we're building
6:09 - 6:12

[inaudible] based classifiers for
6:12 - 6:14

[inaudible]. So it turns
6:14 - 6:18

out one of the common applications of
6:18 - 6:22

PCA is actually this text data representations as well.
6:22 - 6:24

When you apply PCA
6:24 - 6:26

to this sort of data,
6:26 - 6:28

the resulting
6:28 - 6:35

algorithm, it often just goes by a different name, just latent semantic indexing.
6:41 - 6:44

For the sake of completeness, I should say that
6:44 - 6:48

in LSI, you usually skip the preprocessing step.
7:06 - 7:10

For various reasons, in LSI, you usually don't normalize the mean of the data to
7:10 - 7:11

one,
7:11 - 7:14

and you usually don't normalize the variance of the features to one.
7:14 - 7:18

These are relatively minor
7:18 - 7:21

differences, it turns out, so it does something very
7:21 - 7:25

similar to PCA.
7:25 - 7:26
7:26 - 7:27

Normalizing the variance to one
7:27 - 7:33

for text data would actually be a bad idea because all the words are -
7:33 - 7:34

because that
7:34 - 7:37

would have the affect of
7:37 - 7:39

dramatically scaling up the
7:39 - 7:44

weight of rarely occurring words. So for example, the word aardvark hardly ever
7:44 - 7:46

appears in any document.
7:46 - 7:49

So to normalize the variance
7:49 - 7:51

of the second feature to one, you end up -
7:51 - 7:55

you're scaling up the weight of the word aardvark
7:55 - 7:58

dramatically. I don't understand why [inaudible].
7:58 - 8:01

So let's
8:01 - 8:05

see. [Inaudible] the language,
8:05 - 8:11

something that we want to do quite often is, give it two documents,
8:13 - 8:20

XI and XJ, to measure how similar they are.
8:20 - 8:22

So for example,
8:22 - 8:25

I may give you a document and ask
8:25 - 8:29

you to find me more documents like this one. We're reading some
8:29 - 8:31

article about some user event of today
8:31 - 8:34

and want to find out what other news articles there are. So I give you a document and
8:35 - 8:37

ask you to look at all the other documents you have
8:37 - 8:41

in this large set of documents and find the documents similar to
8:41 - 8:42

this.
8:44 - 8:45

So
8:45 - 8:48

this is typical text application, so
8:48 - 8:51

to measure the similarity
8:52 - 8:57

between two documents in XI and XJ, [inaudible]
8:57 - 9:00

each of these documents is represented
9:00 - 9:03

as one of these highdimensional vectors.
9:03 - 9:09

One common way to do this is to view each of your documents
9:09 - 9:13

as some sort of very high-dimensional vector.
9:13 - 9:14

So these
9:14 - 9:18

are vectors in the very highdimensional space where
9:18 - 9:20

the dimension of the vector is equal to
9:20 - 9:27

the number of words in your dictionary.
9:27 - 9:31

So maybe each of these documents lives in some
9:31 - 9:34

50,000-dimension space, if you have 50,000 words in your
9:34 - 9:37

dictionary. So one nature of the similarity between these two documents that's
9:37 - 9:40

often used is
9:40 - 9:41

what's the angle
9:41 - 9:43

between these two
9:43 - 9:50

documents.
9:51 - 9:53

In particular,
9:53 - 9:56

if the angle between these two vectors is small, then
9:56 - 10:00

the two documents, we'll consider them to be similar. If the angle between
10:00 - 10:03

these two vectors is large, then we consider the documents to be dissimilar.
10:03 - 10:06

So
10:06 - 10:10

more formally, one commonly used heuristic, the national language of processing,
10:10 - 10:14

is to say that the similarity between the two documents is a co-sine of the angle theta between them.
10:14 - 10:17

For similar
10:17 - 10:19

values, anyway, the co-sine
10:19 - 10:23

is a decreasing function of theta.
10:23 - 10:24

So the
10:24 - 10:30

smaller the angle between them, the larger the similarity.
10:30 - 10:31

The co-sine
10:31 - 10:36

between two vectors is, of course, just [inaudible]
10:36 - 10:43

divided
10:44 - 10:47

by - okay?
10:47 - 10:51

That's just the linear algebra or the standard
10:51 - 10:56

geometry definition of the co-sine between two vectors. Here's the
11:04 - 11:11

intuition behind what LSI is doing. The hope, as usual, is
11:18 - 11:21

that there
11:21 - 11:25

may be some interesting axis of variations in the data,
11:25 - 11:28

and there maybe some other axis that
11:28 - 11:29

are just
11:29 - 11:34

noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by
11:34 - 11:38

running PCA on your text data this way, you can remove some of the noise in the data and
11:38 - 11:42

get better measures of the similarity between pairs of
11:42 - 11:45

documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI
11:45 - 11:47

is doing.
11:47 - 11:48

So
11:48 - 11:55

look further in the definition of the co-sine similarity measure. So
11:56 - 12:00

the numerator
12:00 - 12:07

or
12:10 - 12:17

the similarity between the two documents was this inner product,
12:18 - 12:24

which is therefore sum over K,
12:24 - 12:28

XIK,
12:28 - 12:29

XJK. So
12:30 - 12:34

this inner product would be equal to zero if
12:34 - 12:37

the two documents have no words in common. So
12:37 - 12:40

this is really - sum over K -
12:41 - 12:43

indicator of
12:43 - 12:45

whether
12:45 - 12:47

documents, I and
12:47 - 12:54

J,
12:55 - 12:58

both contain the word, K, because
12:58 - 13:03

I guess XIK indicates whether document I contains the word
13:03 - 13:04

K, and XJK
13:04 - 13:08

indicates whether document J contains the word, K.
13:08 - 13:10

So the product would be one only
13:10 - 13:12

if the word K
13:12 - 13:15

appears in both documents.
13:15 - 13:18

Therefore, the similarity between these two documents would be
13:18 - 13:23

zero if the two documents have no words in common.
13:23 - 13:30

For example,
13:31 - 13:35

suppose your document,
13:35 - 13:40

XI, has the word study and the word
13:40 - 13:42

XJ,
13:42 - 13:43

has the word learn.
13:43 - 13:48

Then these two documents may be considered
13:48 - 13:49

entirely dissimilar.
13:50 - 13:53

[Inaudible] effective study strategies. Sometimes you read a
13:53 - 13:57

news article about that. So you ask, what other documents are similar to this? If
13:57 - 14:01

there are a bunch of other documents about good methods to
14:01 - 14:04

learn, than there are words in common. So similarity [inaudible] is zero.
14:04 - 14:07

So here's
14:07 - 14:09

a cartoon
14:09 - 14:11

of what we hope
14:11 - 14:13

[inaudible] PCA will do,
14:13 - 14:14

which is
14:14 - 14:17

suppose that on the horizontal axis, I plot
14:17 - 14:21

the word
14:21 - 14:25

learn, and on the vertical access, I plot the word study.
14:25 - 14:30

So the values take on either the value of zero or one. So if a document
14:30 - 14:33

contains the words learn but not study, then
14:33 - 14:35

it'll plot that document there. If
14:35 - 14:38

a document contains neither the word study nor learn, then it'll plot that
14:38 - 14:41

at zero, zero.
14:41 - 14:44

So here's a cartoon behind what PCA
14:44 - 14:47

is doing, which is
14:47 - 14:51

we identify lower dimensional subspace. That would be sum - eigen
14:51 - 14:58

vector, we get out of PCAs.
14:58 - 15:03

Now, supposed we have a document about learning. We have a document about studying.
15:03 - 15:05

The document about learning
15:05 - 15:08

points to the right. Document about studying points
15:08 - 15:11

up. So the inner product, or the co-sine angle between these two documents would be
15:11 - 15:13

- excuse
15:13 - 15:15

me. The inner product between
15:15 - 15:19

these two documents will be zero.
15:19 - 15:20

So these two
15:20 - 15:22

documents are entirely unrelated,
15:22 - 15:25

which is not what we want.
15:25 - 15:28

Documents about study, documents about learning, they are related. But
15:28 - 15:33

we take these two documents, and we project them
15:33 - 15:36

onto this subspace.
15:38 - 15:41

Then these two documents now become much
15:41 - 15:43

closer together,
15:43 - 15:45

and the algorithm will recognize that
15:45 - 15:48

when you say the inner product between these two documents,
15:48 - 15:51

you actually end up with a positive number. So
15:51 - 15:52

LSI enables
15:52 - 15:56

our algorithm to recognize that these two documents have some positive similarity
15:56 - 15:59

between them.
15:59 - 16:01

So that's just intuition
16:01 - 16:02

about what
16:02 - 16:05

PCA may be doing to text data.
16:05 - 16:09

The same thing goes to other examples and the words study and learn. So you have
16:09 - 16:11

- you find a document about
16:11 - 16:15

politicians and a document with the names of prominent
16:15 - 16:16

politicians.
16:18 - 16:21

That will also bring the documents closer together,
16:21 - 16:25

or just any related topics, they end up
16:25 - 16:26

[inaudible]
16:26 - 16:32

points closer together and just lower dimensional space.
16:33 - 16:38

Question about this? Interviewee: [Inaudible].
16:38 - 16:44

Which ones?
16:44 - 16:51

This one? No, the line. Oh, this one. Oh,
16:54 - 16:54

yes.
16:54 - 17:01

Thank you.
17:02 - 17:06

[Inaudible].
17:06 - 17:07

So
17:07 - 17:14

let's talk about how to actually implement this now.
17:17 - 17:21

Okay. How many of you know what
17:21 - 17:25

an SVD or single value decomposition is? Wow,
17:25 - 17:26

that's a lot of you. That's a
17:26 - 17:28

lot more than I thought.
17:28 - 17:35

Curious. Did you guys learn it as under grads or as graduate students?
17:37 - 17:38

All
17:38 - 17:40

right. Let
17:40 - 17:44

me talk about it anyway. I
17:44 - 17:48

wasn't expecting so many of you to know what SVD is, but I want to get this
17:48 - 17:51

on tape, just so everyone else can learn
17:51 - 17:55

about this, too.
17:55 - 17:59

So I'll say a little bit about how to implement
17:59 - 18:02

PCA. The problem I
18:02 - 18:05

was eluding to just now was that
18:05 - 18:09

when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for
18:09 - 18:12

our
18:12 - 18:13

text example,
18:13 - 18:18

if the vectors XI are 50,000 dimensional,
18:18 - 18:24

then
18:24 - 18:27

the covariance matrix will be 50,000 dimensional by 50,000
18:27 - 18:33

dimensional, which is much too big to represent explicitly.
18:33 - 18:36

I guess many
18:36 - 18:41

of you already know this, but I'll just say it anyway. It
18:41 - 18:48

turns out there's another way to implement PCA, which is
18:49 - 18:49

if
18:49 - 18:53

A is any N by N matrix,
18:53 - 18:57

than one of the most remarkable results of linear algebra
18:57 - 19:01

is that the matrix, A,
19:01 - 19:04

can be decomposed into
19:04 - 19:07

a singular value
19:07 - 19:10

decomposition. What that means is that the matrix, A, which
19:10 - 19:12

is
19:12 - 19:13

N by N,
19:13 - 19:15

can always be decomposed into a product of
19:15 - 19:18

three matrixes. U is N by N,
19:18 - 19:21

D is a square matrix, which is N by N, and V is
19:24 - 19:27

also N by N.
19:27 - 19:31

D
19:31 - 19:35

is
19:35 - 19:37

going to be diagonal.
19:43 - 19:45

Zeros are on the off-diagonals,
19:45 - 19:52

and the values sigma I are called the singular values of
19:54 - 19:58

the matrix A.
20:02 - 20:05

Almost all of you said you learned this as a graduate student, rather than as an under grad, so
20:05 - 20:07

it turns out that
20:07 - 20:11

when you take a class in undergraduate linear algebra, usually you learn a bunch of
20:11 - 20:14

decomposition. So you usually learn about the QLD composition, maybe
20:14 - 20:17

the LU factorization of the matrixes.
20:17 - 20:20

Most under grad courses don't get to talk about singular value
20:20 - 20:21

decompositions, but at
20:21 - 20:23

least in - almost
20:23 - 20:25

everything I
20:25 - 20:27

do in machine learning,
20:27 - 20:31

you actually find that you end up using SVDs much more than any of the
20:31 - 20:33

decompositions
20:33 - 20:36

you learned in typical under grad linear algebra class.
20:36 - 20:40

So personally, I [inaudible] an SVD dozens of times in the last
20:40 - 20:43

year, but LU and QRD compositions,
20:43 - 20:45

I think I used the QRD composition once and an
20:45 - 20:50

LU decomposition in the last year. So let's see. I'll
20:50 - 20:54

say
20:54 - 20:55

a
20:55 - 21:02

bit
21:02 - 21:05

more about this. So I'm going to draw the picture, I guess.
21:05 - 21:08

For example,
21:08 - 21:11

if A is an N by N matrix,
21:11 - 21:18

it can be decomposed into another matrix, U, which is also N by N. It's the same
21:22 - 21:25

size, D, which is
21:25 - 21:30

N by
21:30 - 21:35

N. Another square matrix, V transpose, which is also
21:35 - 21:37

N by
21:37 - 21:43

N. Furthermore, in a singular value decomposition, the
21:43 - 21:46

columns of the matrix, U, will be the eigen
21:46 - 21:51

vectors
21:51 - 21:56

of A transpose, and the
21:56 - 22:02

columns of V will be the eigen vectors
22:02 - 22:06

of A
22:06 - 22:08

transpose A.
22:08 - 22:12
22:12 - 22:17

To compute it, you just use the SVD commands
22:17 - 22:21

in
22:21 - 22:22

Matlab
22:22 - 22:23

or
22:23 - 22:27

Octave. Today, say the art in numerical linear algebra is that
22:27 - 22:31

SVD, singular value decompositions, and matrixes can be computed
22:31 - 22:34

extremely [inaudible]. We've
22:34 - 22:36

used a
22:36 - 22:38

package like Matlab or Octave
22:38 - 22:40

to compute, say, the eigen vectors of a matrix.
22:40 - 22:44

So if SVD
22:44 - 22:48

routines are even more numerically stable than eigen vector routines for finding eigen vector in the
22:48 - 22:49

matrix. So you can
22:49 - 22:50

safely
22:50 - 22:53

use a routine like this, and similar to the way they use
22:53 - 22:56

a square root command without thinking about
22:56 - 22:58

how it's computed. You can compute the square
22:58 - 23:04

root of something and just not worry about it. You know the computer will give you the right
23:04 - 23:08

answer. For most reasonably-sized matrixes, even up to thousands by thousands
23:08 - 23:11

matrixes, the SVD routine, I think of it as a square root function. If
23:11 - 23:15

you call it, it'll give you back the right answer. You don't have to worry
23:15 - 23:17

too much about
23:17 - 23:21

it. If you have extremely large matrixes, like a million by a million matrixes, I
23:21 - 23:23

might start to worry a bit, but a few thousand by a few
23:23 - 23:26

thousand matrixes, this is
23:26 - 23:29

implemented very well today.
23:29 - 23:31

[Inaudible].
23:31 - 23:35

What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the
23:35 - 23:36

order of N-cubed.
23:36 - 23:42

I'm not sure. [Inaudible]
23:42 - 23:45

algorithms, so I think - I don't know what's
23:45 - 23:50

known about the conversion
23:50 - 23:54

of
23:54 - 23:58

these algorithms. The example I drew out was for a facts matrix, and a matrix is
23:58 - 24:00

[inaudible].
24:00 - 24:03

In the same way, you can also
24:03 - 24:08

call SVD on the tall matrix, so it's taller than it's wide.
24:08 - 24:15

It would decompose it into - okay? A
24:21 - 24:24

product of three matrixes like
24:24 - 24:28

that.
24:28 - 24:31

The nice thing about this is that
24:31 - 24:33

we can use it to compute
24:33 - 24:40

eigen vectors and PCA very efficiently.
24:42 - 24:47

In particular, a
24:47 - 24:53

covariance matrix sigma was
24:53 - 24:55

this. It was the sum of all the products,
24:55 - 25:01

so if you go back and recall the definition of the
25:01 - 25:02

design matrix -
25:02 - 25:05

I think I described this in
25:05 - 25:08

lecture two when
25:08 - 25:14

we derived the close form solution to these squares [inaudible] these squares. The
25:14 - 25:18

design matrix was this matrix where I took my
25:18 - 25:21

examples
25:21 - 25:26

and stacked them in
25:26 - 25:28

rows.
25:28 - 25:33

They call this the design matrix [inaudible]. So if
25:33 - 25:39

you construct the design matrix, then
25:39 - 25:43

the covariance matrix sigma
25:43 - 25:48

can be written just X transposing.
26:01 - 26:08

That's X transposed, and [inaudible].
26:16 - 26:17

Okay?
26:17 - 26:19

I hope you see why the X transpose X gives you
26:19 - 26:22

the sum of products of
26:22 - 26:26

vectors. If you aren't seeing this right now, just go home and convince yourself
26:26 - 26:33

[inaudible] if it's
26:38 - 26:40

true.
26:40 - 26:47

To get the top K eigen vectors of
26:51 - 26:51

sigma,
26:51 - 26:56

you would take sigma
26:56 - 26:57

and decompose it
26:57 - 27:00

using
27:00 - 27:02

the -
27:02 - 27:07

excuse me.
27:07 - 27:12

You would take the matrix X, and you would compute as SVD. So you get USV transpose.
27:12 - 27:15

Then the top three
27:15 - 27:19

columns
27:19 - 27:22

of U
27:22 - 27:24

are the top K eigen
27:24 - 27:27

vectors
27:27 - 27:29

of
27:29 - 27:32

X transpose
27:32 - 27:34

X,
27:34 - 27:38

which is therefore, the top K eigen vectors of your
27:38 - 27:40

covariance matrix
27:40 - 27:42

sigma. So
27:42 - 27:46

in our examples, the
27:46 - 27:49

design matrix may be, say R. If you have
27:49 - 27:52

50,000 words in your dictionary, than the
27:52 - 27:55

design matrix would be
27:55 - 27:58

RM by 50,000.
27:58 - 28:02

[Inaudible] say 100 by 50,000, if you have 100 examples.
28:02 - 28:03

So X would be
28:03 - 28:07

quite tractable to represent and compute the
28:07 - 28:10

SVD, whereas the matrix sigma would be much harder to represent. This is
28:10 - 28:12

50,000 by 50,000.
28:12 - 28:16

So this gives you an efficient way to implement
28:16 - 28:18

PCA.
28:18 - 28:21

The reason I want to talk about this is
28:21 - 28:25

in previous years, I didn't talk [inaudible].
28:25 - 28:29

The class projects, I found a number of students trying to implement SVD on huge
28:29 - 28:30

problems and [inaudible],
28:30 - 28:32

so this is
28:32 - 28:35

a much better to implement PCA
28:35 - 28:38

if you have extremely high dimensional data. If you have
28:38 - 28:40

low dimensional data, if
28:40 - 28:44

you have 50 or 100 dimensional data, then
28:44 - 28:45

computing sigma's no problem. You can
28:45 - 28:51

do it the old way, but otherwise, use the SVD to implement this.
28:53 - 29:00

Questions about this?
29:26 - 29:31

The last thing I want to say is that in practice, when you want to implement this, I want to say a note
29:31 - 29:32

of caution.
29:32 - 29:35

It turns out that
29:35 - 29:39

for many applications of - let's see.
29:39 - 29:42

When you apply SVD to these wide - yeah.
29:42 - 29:43

Just a quick question. Are
29:43 - 29:49

the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh,
29:49 - 29:53

yes.
29:53 - 29:55

I think you're
29:55 - 29:56

right. I
29:56 - 30:03

think you're right.
30:04 - 30:06

Let's
30:06 - 30:13

see. Is it top K columns of U or top K of V?
30:15 - 30:22

Yeah, I think you're right. Is that right? Something
30:32 - 30:39

bothers me about that, but I think you're right.
30:40 - 30:46

[Inaudible], so then X transpose X should be VDD. X
30:46 - 30:53

is UDV, so X transpose X would be - [Inaudible].
31:01 - 31:03

If anyone thinks about this and
31:03 - 31:06

has another opinion, let me know, but I think you're
31:06 - 31:11

right. I'll make sure I get the details and let you
31:11 - 31:17

know.
31:17 - 31:22

Everyone's still looking at that.
31:22 - 31:23

Tom, can you figure out
31:23 - 31:25

the right answer and let me know? That sounds right. Okay. Cool. Okay.
31:25 - 31:30

So just
31:30 - 31:34

one last note, a note of caution. It turns out that
31:34 - 31:37

in this example, I was implementing SVD
31:37 - 31:44

with a wide matrix. So the matrix X was N by N.
31:44 - 31:47

It turns out when you
31:47 - 31:52

find the SVD decomposition of this,
31:52 - 31:58

it turns out that -
31:58 - 32:01

let's see. Yeah, I think you're definitely
32:01 - 32:04

right. So it turns out that we find the SVD
32:04 - 32:08

of this, the right-most portion of this block of this matrix would be all
32:08 - 32:15

zeros.
32:15 - 32:18

Also, when you compute the
32:18 - 32:22

matrix, D, a large part of this matrix would be zeros.
32:22 - 32:24

You have the matrix D
32:24 - 32:30

transpose. So depending on what convention you use,
32:30 - 32:36

for example, I think Matlab actually uses a convention of just
32:36 - 32:43

cutting off the zero elements.
32:48 - 32:51

So the Matlab uses the convention
32:51 - 32:54

of chopping off the right-most half of the U matrix and chopping off the bottom
32:54 - 32:57

portion of the D matrix. I'm not sure
32:57 - 33:01

if this even depends on the version of Matlab,
33:01 - 33:04

but when you call SVD on Matlab or some other numerical algebra packages,
33:04 - 33:08

there's slightly different conventions of how to define your SVD when
33:08 - 33:10

the matrix is wider than it is tall.
33:10 - 33:13

So just watch out for this and make sure you map
33:13 - 33:15

whatever convention
33:15 - 33:18

your numerical algebra library uses
33:18 - 33:23

to the original computations.
33:23 - 33:24

It turns out if
33:24 - 33:26

you turn Matlab [inaudible]
33:26 - 33:30

or you're writing C code. There are many scientific libraries that can
33:30 - 33:32
33:32 - 33:36

compute SVDs for you, but they're just slightly different in
33:36 - 33:39

conventions for the dimensions of these matrixes. So just make sure you figure this out for the
33:39 - 33:44

package that you use.
33:44 - 33:46

Finally, I just want to
33:46 - 33:50

take the unsupervised learning algorithms we talked about and just put a little bit
33:50 - 33:52

of broader context.
33:52 - 33:56

This is partly in response to the questions I've gotten from students in
33:56 - 33:58

office hours and elsewhere
33:58 - 34:01

about when to use each of these
34:01 - 34:01

algorithms. So
34:01 - 34:06

I'm going to draw a two by two matrix. This is a little cartoon that
34:08 - 34:11

I find useful.
34:15 - 34:20

One of the algorithms we talked about earlier, right
34:20 - 34:21

before this, was
34:21 - 34:24

factor analysis, which was - it
34:24 - 34:28

was - I hope you remember that picture I drew where I would have a bunch of point Z on the
34:28 - 34:29

line.
34:29 - 34:33

Then I had these ellipses that I drew. I hope you
34:33 - 34:35

remember that
34:35 - 34:38

picture. This was a factor analysis model
34:38 - 34:39

which models
34:39 - 34:42

the density effects [inaudible], right?
34:42 - 34:45

It was also a PCA, just now.
34:45 - 34:46

So the
34:46 - 34:49

difference between factor analysis and PCA, the
34:49 - 34:52

way I think about it, is that factor analysis
34:52 - 34:54

is a density estimation algorithm.
34:54 - 34:58

It tries to model the density of the training example's X.
34:58 - 35:01

Whereas PCA
35:02 - 35:06

is not a probabilistic
35:06 - 35:07

algorithm. In particular,
35:07 - 35:11

it does not endow your training examples of any probabilistic
35:11 - 35:14

distributions and directly goes to find the subspace.
35:14 - 35:19

So in terms of when to use factor analysis and when to use PCA,
35:19 - 35:22

if your goal is to reduce the dimension of the data,
35:22 - 35:26

if your goal is to find the subspace that the data lies on,
35:26 - 35:27

then PCA
35:27 - 35:28

directly
35:28 - 35:32

tries to find the subspace. I think I would
35:32 - 35:35

tend to use PCA.
35:35 - 35:40

Factor analysis, it sort of assumes the data lies on a subspace.
35:40 - 35:45

Let me write a subspace here.
35:45 - 35:50

So both of these algorithms sort of assume the data maybe lies close
35:50 - 35:52

or on some low dimensional subspace,
35:52 - 35:56

but fundamentally, factor analysis, I think of it as a density estimation algorithm.
35:56 - 35:59

So that has some very high dimensional distribution. I
35:59 - 36:01

want to model P of X, then
36:01 - 36:05

the factor analysis is the algorithm I'm more inclined
36:05 - 36:05

to use. So
36:05 - 36:07

even though you could in theory,
36:07 - 36:12

I would tend to avoid trying to use factor analysis to
36:12 - 36:14

identify a subspace the
36:14 - 36:16

data
36:16 - 36:18

set lies on. So [inaudible], if you want to do
36:18 - 36:22

anomaly detection, if you want to model P of X
36:22 - 36:26

so that if you have a very low probability of N, you can factor an anomaly,
36:26 - 36:33

then I would tend to use factor analysis to do that density estimation. So factor
36:33 - 36:38

analysis and PCA are both algorithms that assume that your data lies in the subspace.
36:38 - 36:42

The other cause of algorithms we talked about was
36:42 - 36:45

algorithms that assumes the data
36:45 - 36:47

lies in
36:47 - 36:51

clumps or that the
36:51 - 36:52

data
36:52 - 36:58

has a few coherence to groups. So let me just fill in the rest of this picture.
37:09 - 37:10

So if you think your data lies
37:10 - 37:14

in clumps or lies in groups, and if it goes [inaudible]
37:14 - 37:16

density estimation, then I would
37:16 - 37:20

tend to use a mixture of [inaudible]
37:20 - 37:23

algorithm. But again, you don't necessarily want to endow your data of any probably
37:23 - 37:27

semantics, so if you just want to find the clumps of the groups, then I'd be inclined
37:27 - 37:29

to use a [inaudible] algorithm. So
37:30 - 37:33

haven't seen anyone else draw this picture before, but I tend to organize these things
37:33 - 37:35

this way in my brain.
37:35 - 37:36

Hopefully this helps guide
37:36 - 37:40

when you might use each of these algorithms as well, depending
37:40 - 37:45

on whether you believe the data might lie in the subspace or whether it might bind in
37:45 - 37:48

clumps or groups.
37:51 - 37:54

All right.
37:54 - 38:01

That wraps up the discussion on
38:03 - 38:09

PCA. What I want to do next is talk about
38:09 - 38:15

independent component analysis, or ICA. Yeah. Interviewee: I have
38:15 - 38:18

a
38:18 - 38:22

question about the upper right [inaudible]. So once you have all of the eigen vectors,
38:22 - 38:26

[inaudible] how similar is feature I to
38:26 - 38:30

feature J. You pick some eigen vector, and you take some dot products between the
38:30 - 38:32

feature I and
38:32 - 38:35

feature J and the eigen vector. But
38:35 - 38:40

there's a lot of eigen vectors to choose from. Instructor
38:40 - 38:42

(Andrew Ng):Right. So Justin's question was
38:42 - 38:46

having found my eigen vectors, how do I choose what eigen vector to use to
38:46 - 38:48

measure distance. I'm
38:48 - 38:49

going to
38:49 - 38:51

start
38:51 - 38:53

this up.
38:53 - 38:57

So the
38:57 - 38:58

answer is really
38:58 - 39:02

- in this cartoon, I would avoid thinking about
39:02 - 39:04

eigen vectors one other time.
39:04 - 39:08

A better way to view this cartoon is that this is actually -
39:08 - 39:12

if I decide to choose 100 eigen vectors, this is really 100 D
39:12 - 39:19

subspace.
39:19 - 39:20

So
39:20 - 39:25

I'm not actually projecting my data onto one eigen vector.
39:25 - 39:30

This arrow, this cartoon, this denotes the 100-dimensional
39:30 - 39:32

subspace [inaudible] by all my eigen vectors.
39:32 - 39:36

So what I actually do is project my data onto
39:36 - 39:40

the span, the linear span of eigen vectors. Then I
39:40 - 39:41

measure distance or take
39:41 - 39:43

inner products of the distance between
39:43 - 39:50

the projections of the two points of the eigen vectors. Okay.
39:50 - 39:54

So let's talk about ICA,
39:54 - 39:59

independent component analysis.
39:59 - 40:01

So whereas PCA
40:01 - 40:03

was an algorithm for finding
40:03 - 40:07

what I call the main axis of variations of data,
40:07 - 40:11

in ICA, we're going to try find the independent of components of variations in the
40:11 - 40:12

data.
40:12 - 40:15

So switch it to the laptop there, please.
40:15 - 40:16

We'll just
40:16 - 40:22

take a second to motivate that. I'm
40:22 - 40:27

going to do so by
40:27 - 40:32

- although if you put on the - okay. This is
40:32 - 40:37

actually a slide that I showed in
40:37 - 40:40

lecture one of the cocktail party problem.
40:40 - 40:43

Suppose you have two speakers at a cocktail party,
40:43 - 40:45

and you have two microphones in the
40:45 - 40:46

room, overlapping
40:46 - 40:48

sets of two conversations.
40:48 - 40:52

Then can you separate out the two original speaker sources?
40:52 - 40:56

So I actually played this audio as well in the very first lecture, which is
40:56 - 40:59

suppose microphone one records this.
40:59 - 41:05

[Recording]
41:13 - 41:17

So the question is, these are really two speakers,
41:17 - 41:21

speaking independently of each other. So each speaker is outputting
41:21 - 41:25

a series of sound signals as independent of the other conversation
41:25 - 41:26

going on in the room.
41:26 - 41:28

So
41:28 - 41:32

this being an supervised learning problem, the question is, can we take these two microphone
41:32 - 41:34

recordings and feed it to
41:34 - 41:37

an algorithm to find the independent components in
41:37 - 41:38

this
41:38 - 41:41

data? This is the output
41:42 - 41:49

when we do so.
41:49 - 41:55

[Recording] This is the other one. [Recording]
41:56 - 42:00

Just for fun. [Inaudible]. These are audio clips I got
42:01 - 42:04

from [inaudible]. Just for fun, let me play the other ones as well. This
42:04 - 42:11

is overlapping microphone one. [Recording]
42:13 - 42:20

Here's microphone two. [Recording]
42:22 - 42:24

So given this as input, here's output one.
42:24 - 42:27
42:27 - 42:31

[Recording]
42:31 - 42:34

It's not perfect, but it's largely cleaned up the music.
42:34 - 42:41

Here's number two. [Recording] Okay. Switch back to
42:43 - 42:45

[inaudible], please.
42:45 - 42:47

So
42:47 - 42:54

what I want to do now is describe an algorithm that does that.
42:55 - 42:58

Before
42:58 - 43:03

I actually jump into the algorithm, I want to say two minutes
43:03 - 43:04

of
43:04 - 43:11

CDF, so cumulative distribution functions. I know most
43:19 - 43:21

of you know what these are, but I'm
43:21 - 43:24

just going to remind you of what they are.
43:25 - 43:30

Let's say you have a one-D random variable S. So suppose you have
43:30 - 43:36

a random variable, S,
43:36 - 43:41

and suppose it has a property density function [inaudible].
43:41 - 43:43

Then
43:43 - 43:46

the CDF
43:46 - 43:50

is defined as a function, or rather as F,
43:50 - 43:54

which is the probability that the random variable,
43:54 - 43:56

S, is less than the value
43:56 - 43:59

given by that lower-case
43:59 - 44:00

value,
44:00 - 44:02

S.
44:02 - 44:03

For example,
44:03 - 44:06

if this is your [inaudible] density,
44:06 - 44:10

than the density of the [inaudible] usually
44:10 - 44:15

to note it lower-case phi. That's roughly a bell-shaped density. Then
44:15 - 44:20

the CDF or the Gaussian
44:20 - 44:22

will look something like this.
44:22 - 44:25

There'll be a capital function
44:25 - 44:27

pi. So if I pick a value
44:27 - 44:29

S like that, then the
44:29 - 44:30

height of this -
44:30 - 44:33

this is [inaudible] probability that
44:33 - 44:35

my Gaussian random variable is less than
44:35 - 44:37

that value there. In other words,
44:37 - 44:41

the height of the function at that point is
44:41 - 44:44

less
44:44 - 44:46

than the area of the Gaussian density,
44:46 - 44:48

up to the point S.
44:48 - 44:49

As you
44:49 - 44:53

move further and further to the right, this function will approach one, as
44:53 - 45:00

you integrate more and more of this area of the Gaussian. So another way to write
45:05 - 45:12

F
45:21 - 45:28

of
45:31 - 45:35

S is the integral, the minus infinity
45:35 - 45:36

to S of
45:36 - 45:42

the density, DT.
45:42 - 45:44

So something that'll come later is
45:44 - 45:48

suppose I have a random variable, S, and I want to model the distribution of the random
45:48 - 45:49

variable, S.
45:49 - 45:53

So one thing I could do is I can specify
45:53 - 45:57

what I think the density
45:57 - 45:58

is.
45:58 - 46:03

Or I can specify
46:03 - 46:04

what the
46:04 - 46:08

CDF
46:08 - 46:11

is. These are related by this equation. F is the integral of P of S. You
46:11 - 46:14

can also
46:14 - 46:16

recover the density
46:16 - 46:20

by taking the CDF and taking the derivative. So F prime, take the derivative
46:20 - 46:22

of the CDF,
46:22 - 46:23

you get back the
46:23 - 46:25

density. So this has come up
46:25 - 46:28

in the middle of when I derive ICA, which is that
46:28 - 46:32

there'll be a step where they need to assume a distribution for random variable, S.
46:32 - 46:36

I can either specify the density for S directly, or I can specify the CDF. I
46:36 - 46:39

choose to specify the
46:40 - 46:42

CDF.
46:42 - 46:47

It has to be some function increasing from zero to one.
46:47 - 46:48

So you can
46:48 - 46:51

choose any function that looks like that, and in particular,
46:52 - 46:55

pulling functions out of a hat that look like that. You can, for instance, choose a
46:55 - 46:59

sigmoid function of
46:59 - 47:04

CDF. That would be one way of specifying the distribution of the densities for the random variable S. So
47:04 - 47:05

this
47:05 - 47:12

will come up later.
47:30 - 47:34

Just [inaudible], just raise your hand if that is familiar to you, if you've seen
47:34 - 47:41

that before. Great. So
47:42 - 47:43

let's
47:43 - 47:49

start to derive our RCA, or our independent component analysis
47:49 - 47:50

algorithm.
47:50 - 47:54

Let's assume that the
47:56 - 48:00

data comes from
48:00 - 48:02

N original
48:02 - 48:03

sources.
48:03 - 48:07

So let's say there are N speakers in a cocktail party.
48:07 - 48:10

So the original sources, I'm
48:10 - 48:11

going to write as a vector, S
48:11 - 48:14

as in RN.
48:14 - 48:17

So just to be concrete about what I mean about that, I'm going to use
48:17 - 48:22

SIJ to denote the signal
48:22 - 48:26

from speaker
48:27 - 48:30

J
48:30 - 48:33

at time
48:33 - 48:34

I. Here's what I mean.
48:34 - 48:38

So what is sound? When you hear sound waves, sound is created
48:38 - 48:39

by a pattern
48:39 - 48:43

of expansions and compressions in air. So the way you're hearing my voice is
48:43 - 48:45

my
48:45 - 48:48

mouth is causing certain
48:48 - 48:51

changes in the air pressure, and then your ear is hearing my voice as
48:51 - 48:54

detecting those changes in air
48:54 - 48:58

pressure. So what a microphone records, what my mouth is generating, is
48:58 - 48:59

a pattern.
48:59 - 49:01

I'm going to draw a cartoon,
49:01 - 49:05

I guess.
49:05 - 49:06

Changes in
49:06 - 49:07

air pressure. So
49:07 - 49:11

this is what sound is. You look at a microphone recording, you see these roughly periodic
49:11 - 49:13

signals that comprise of
49:13 - 49:16

changes in air pressure over time as the air pressure goes
49:16 - 49:19

above and below some baseline air pressure.
49:19 - 49:20

So this
49:20 - 49:22

is what the speech signal looks like, say.
49:22 - 49:26

So this is speaker one.
49:26 - 49:29

Then what I'm saying is that
49:29 - 49:31

- this is some time, T.
49:31 - 49:34

What I'm saying is that the value of that point,
49:34 - 49:37

I'm going to denote as S, super
49:37 - 49:40

script T, sub script one.
49:40 - 49:42

Similarly,
49:42 - 49:45

speaker two, it's
49:45 - 49:47

outputting some sound wave. Speaker voice
49:47 - 49:50

will play that. It'll actually sound like
49:50 - 49:53

a single tone, I guess.
49:53 - 49:56

So in the same way, at the same time, T,
49:56 - 49:59

the value of the air
49:59 - 50:03

pressure generated by speaker two, I'll denote as
50:03 - 50:10

ST
50:17 - 50:24

2.
50:30 - 50:37

So we observe
50:38 - 50:40

XI equals A times SI, where
50:40 - 50:43

these XIs
50:43 - 50:46

are vectors in RN.
50:46 - 50:50

So I'm going to assume
50:50 - 50:53

that I have N microphones,
50:53 - 50:54

and
50:54 - 50:58

each of my microphones records some linear combination
50:58 - 51:02

of what the speakers are saying. So each microphone records some overlapping
51:02 - 51:04

combination of what the speakers are saying.
51:04 - 51:08

For
51:10 - 51:13

example, XIJ, which is - this
51:13 - 51:16

is what microphone J records at time, I. So
51:16 - 51:17

by definition of
51:17 - 51:22

the matrix multiplication, this is sum
51:22 - 51:24

of AIKSJ.
51:24 - 51:29

Oh, excuse me.
51:29 - 51:36

Okay? So what my J - sorry.
51:37 - 51:41

So what my J microphone is recording is
51:42 - 51:44

some linear combination of
51:44 - 51:46

all of the speakers. So
51:46 - 51:50

at time I, what microphone J is recording is some linear combination of
51:50 - 51:53

what all the speakers are saying at time I.
51:53 - 51:54

So K here
51:54 - 51:58

indexes over the N speakers.
51:58 - 52:01

So our goal
52:03 - 52:06

is to find the matrix, W, equals A inverse, and
52:06 - 52:10

just defining W that way.
52:10 - 52:17

So
52:18 - 52:21

we can recover the original sources
52:21 - 52:23

as a linear combination of
52:23 - 52:24

our
52:24 - 52:31

microphone recordings, XI.
52:33 - 52:35

Just as a point of notation,
52:35 - 52:42

I'm going to write the matrix W this way. I'm going to use
52:51 - 52:55

lower case W subscript one, subscript two and so on to denote the roles
52:55 - 53:02

of this matrix, W.
53:14 - 53:15

Let's
53:15 - 53:19

see.
53:19 - 53:24

So let's look at why IC is possible. Given these overlapping voices,
53:24 - 53:28

let's think briefly why it might be possible
53:28 - 53:31

to recover the original sources.
53:31 - 53:33

So for the next example, I want
53:33 - 53:37

to say
53:43 - 53:47

- let's say that each of my speakers
53:47 - 53:50

outputs - this will sound like white noise. Can I switch
53:50 - 53:53

the laptop display,
53:53 - 53:57

please? For this example, let's say that
53:57 - 54:01

each of my speakers outputs uniform white noise. So
54:01 - 54:05

if that's the case, these are my axis, S1 and S2.
54:05 - 54:09

This is what my two speakers would be uttering.
54:09 - 54:11

The parts of what they're
54:11 - 54:15

uttering will look like a line in a square box if the two speakers are independently
54:15 - 54:16

outputting
54:16 - 54:18

uniform minus one random variables.
54:18 - 54:20

So this is part of
54:20 - 54:24

S1 and S2, my original sources.
54:24 - 54:29

This would be a typical sample of what my microphones record. Here, at
54:29 - 54:31

the axis, are X1 and X2.
54:31 - 54:35

So these are images I got from [inaudible] on
54:35 - 54:37

ICA.
54:39 - 54:44

Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of
54:44 - 54:45

this
54:45 - 54:46

parallelogram
54:46 - 54:48

are. You can figure out
54:48 - 54:52

what linear transformation would transform the parallelogram back
54:52 - 54:54

to a box.
54:54 - 54:59

So it turns out there are some inherent ambiguities in ICA.
54:59 - 55:01

I'll just say what they are.
55:01 - 55:02

One is that
55:02 - 55:06

you can't recover the original indexing of the sources. In particular,
55:06 - 55:07

if
55:07 - 55:11

I generated the data for speaker one and speaker two,
55:11 - 55:14

you can run ICA, and then you may end up with the order of the speakers
55:14 - 55:18

reversed. What that corresponds to is if you take this
55:18 - 55:22

picture and you flip this picture along a 45-degree
55:22 - 55:26

axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still
55:26 - 55:28

get a box. So
55:28 - 55:31

there's no way for the algorithms to tell which was speaker No. 1 and
55:31 - 55:33

which
55:33 - 55:38

was speaker No. 2. The numbering or the ordering of the speakers is
55:38 - 55:41

ambiguous. The other source of ambiguity, and these are the only ambiguities
55:41 - 55:42

in this example,
55:42 - 55:44

is the sign of the sources. So
55:44 - 55:49

given my speakers' recordings,
55:49 - 55:53

you can't tell whether you got a positive SI or whether you got
55:53 - 55:56

back a negative SI.
55:56 - 55:58

In this picture, what that corresponds to
55:58 - 56:02

is if you take this picture, and you reflect it along the vertical axis, if
56:02 - 56:05

you reflect it along the horizontal axis,
56:05 - 56:06

you still get a box.
56:06 - 56:09

You still get back [inaudible] speakers.
56:09 - 56:10

So
56:10 - 56:12

it turns out that in this example,
56:12 - 56:17

you can't guarantee that you've recovered positive SI rather
56:17 - 56:20

than negative SI.
56:20 - 56:22

So it turns out that these are the only
56:22 - 56:26

two ambiguities in this example. What is the permutation of the speakers, and the
56:26 - 56:28

other is the sign of the speakers.
56:28 - 56:31

Permutation of the speakers, there's not much you can do about that.
56:31 - 56:35

It turns out that if you take the audio
56:35 - 56:36

source
56:36 - 56:39

and if you flip the sign, and you take negative S, and if you play that through a
56:39 - 56:44

microphone it'll sound indistinguishable.
56:44 - 56:45

So
56:45 - 56:48

for many of the applications we care about, the sign
56:48 - 56:51

as well as the permutation
56:51 - 56:55

is ambiguous, but you don't really care
56:55 - 57:02

about it. Let's switch back
57:04 - 57:09

to
57:09 - 57:11

chalk board, please.
57:11 - 57:16

It turns out, and I don't want to spend too much time on this, but I do want to say it briefly.
57:16 - 57:17

It turns out the
57:17 - 57:19

reason why those are the only
57:19 - 57:26

sources of ambiguity - so the ambiguities were the
57:26 - 57:30

permutation of the speakers
57:30 - 57:32

and the signs.
57:32 - 57:35

It turns out that
57:35 - 57:40

the reason these were the only ambiguities was because
57:40 - 57:45

the SIJs were
57:45 - 57:47
57:47 - 57:51

non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly.
57:51 - 57:54

Suppose my original sources, S1 and S2, were Gaussian.
57:54 - 57:56

So
57:58 - 58:02

suppose SI is
58:02 - 58:04

Gaussian, would mean zero
58:04 - 58:07

and identity covariance.
58:07 - 58:11

That just means that each of my speakers outputs a Gaussian random variable. Here's a typical
58:11 - 58:13

example of Gaussian
58:13 - 58:18

data.
58:18 - 58:23

You will recall the contours of a Gaussian distribution with identity covariants
58:23 - 58:25

looks like
58:25 - 58:28

this, right? The Gaussian is a
58:28 - 58:31

spherically symmetric distribution.
58:31 - 58:35

So if my speakers were outputting Gaussian random variables, than if
58:35 - 58:38

I observe a linear combination of this,
58:38 - 58:40

there's actually no way to recover the
58:40 - 58:43

original distribution because there's no way for me to tell
58:43 - 58:46

if the axis are at this angle or if they're at
58:46 - 58:48

that angle and so
58:48 - 58:52

on. The Gaussian is a rotationally symmetric
58:52 - 58:57

distribution, so I would no be able to recover the orientation in the
58:57 - 58:59

rotation
58:59 - 59:02

of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns
59:02 - 59:03

out
59:03 - 59:05

if your source is a Gaussian,
59:05 - 59:08

then it's actually impossible to do
59:08 - 59:12

ICA. ICA relies critically on your data being non-Gaussian because if the data
59:12 - 59:17

were Gaussian, then the rotation of the data would be ambiguous. So
59:17 - 59:19

regardless of how much data you have,
59:19 - 59:24

even if you had infinitely large amounts of data, you would not be able to recover
59:24 - 59:27

the matrix A or W.
59:33 - 59:40

Let's go ahead and divide the algorithm.
59:57 - 60:01

To do this, I need just one more result, and then the derivation will be
60:03 - 60:08

three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my
60:08 - 60:11

speakers that are emitting at any time.
60:11 - 60:12

So
60:12 - 60:16

let's say the density of S is
60:16 - 60:17

P subscript S,
60:17 - 60:20

capital S.
60:20 - 60:23

So my microphone recording records S equals AS,
60:23 - 60:25

equals W inverse
60:25 - 60:31

S. Equivalently, S equals W sign of X.
60:31 - 60:35

So let's think about what is the density of
60:35 - 60:38

X. So I have P of S. I know the density of
60:38 - 60:41

S, and X is a linear combination of the S's.
60:41 - 60:45

So let's figure out what is the density of X.
60:45 - 60:49

One thing we could do is
60:49 - 60:51

figure out what S is. So this is just -
60:51 - 60:56

apply the density of
60:56 - 60:58

S to W of S. So let's
60:58 - 61:02

see. This is the probability of S, so we just
61:03 - 61:07

figure out what S is. S is W times X, so the probability of S is
61:07 - 61:10

W times X, so the probability of X must be [inaudible].
61:10 - 61:12

So this is wrong.
61:12 - 61:15

It turns out you can do this for probably mass functions but not for
61:15 - 61:17

continuous density. So in particular,
61:17 - 61:21

it's not correct to say that the probability of X is - well, you just figure out what
61:21 - 61:22

S is.
61:22 - 61:26

Then you say the probability of S is applied to that. This is wrong. You
61:26 - 61:28

can't do this with densities.
61:28 - 61:31

You can't say the probability of S is that because it's a property density
61:31 - 61:33

function.
61:33 - 61:34

In particular,
61:34 - 61:36

the
61:36 - 61:38

right formula is the
61:38 - 61:40

density of S applied to W times X,
61:40 - 61:42

times the determinant
61:42 - 61:44

of the matrix, W.
61:44 - 61:47

Let me just illustrate that with an example.
61:47 - 61:50

Let's say
61:50 - 61:52

the
61:52 - 61:58

density for S is that. In
61:58 - 62:03

this example, S is uniform
62:03 - 62:06

over the unit interval.
62:08 - 62:15

So the density for S looks like that. It's
62:15 - 62:18

just density for the uniform
62:18 - 62:21

distribution of zero one.
62:21 - 62:24

So let me let X be equal to two times
62:24 - 62:30

S. So this means A equals two.
62:30 - 62:34

W equals one half. So if
62:34 - 62:37

S is a uniform distribution over zero, one,
62:37 - 62:40

then X, which is two times that, will be the uniform distribution over the
62:40 - 62:43

range from zero to two.
62:43 - 62:50

So the density for X will be -
62:54 - 62:57

that's one, that's two,
62:57 - 63:01

that's one half,
63:03 - 63:05

and
63:05 - 63:08

that's one. Okay? Density for X will be indicator
63:08 - 63:13

zero [inaudible] for X [inaudible] two
63:13 - 63:16

times W, times one half.
63:16 - 63:20

So
63:20 - 63:22

does that make
63:22 - 63:25

sense? [Inaudible] computer density for X because X is now spread out
63:25 - 63:29

across a wider range. The density of X is now smaller,
63:29 - 63:36

and therefore, the density of X has this one half
63:38 - 63:39

term
63:39 - 63:43

here. Okay? This is an illustration for the case of one-dimensional random variables,
63:43 - 63:44
63:44 - 63:45

or S
63:45 - 63:49

and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the
63:49 - 63:52

density of X is given by this
63:52 - 63:54

times the determinant of the matrix, W. Over here,
63:54 - 64:01

I showed the one dimensional [inaudible] generalization.
64:21 - 64:28

So we're nearly there. Here's
64:29 - 64:34

how I can implement ICA.
64:34 - 64:37

So my distribution on
64:37 - 64:44

S,
64:50 - 64:53

so I'm going to assume that my density on S
64:53 - 64:55

is given by this as a product over the
64:55 - 65:00

N speakers of the density - the product of speaker
65:00 - 65:01

I
65:01 - 65:04

emitting a certain sound. This is a product of densities.
65:04 - 65:08

This is a product of distributions because I'm going to assume that the
65:08 - 65:11

speakers are having independent conversations. So the SI's independent
65:11 - 65:16

for different values of I.
65:16 - 65:18

So by the formula we just worked out,
65:18 - 65:22

the density for X would be equal to that.
65:37 - 65:39

I'll just remind you, W was A
65:39 - 65:43

inverse. It was
65:43 - 65:44

this matrix
65:44 - 65:48

I defined previously
65:48 - 65:50

so that SI
65:50 - 65:53

equals WI [inaudible]
65:53 - 65:59

X. So that's what's in
65:59 - 66:02

there. To complete my formulation for this model,
66:02 - 66:06

the final thing I need to do is
66:06 - 66:10

choose
66:10 - 66:12

a density
66:12 - 66:14

for what I think each speaker is
66:14 - 66:18

saying. I need to assume some density over
66:18 - 66:22

the sounds emitted by an individual speaker.
66:22 - 66:26

So following the discussion I had right when the [inaudible]
66:26 - 66:28

ICA,
66:28 - 66:31

one thing I could do is I could choose
66:31 - 66:32

the density for S,
66:32 - 66:36

or equivalently, I could choose the CDF, the cumulative distribution
66:36 - 66:37

function for
66:37 - 66:38

S.
66:38 - 66:41

In this case, I'm going to choose
66:41 - 66:45

a CDF, probably for historical reasons and probably for
66:45 - 66:47

convenience.
66:47 - 66:50

I need to choose the CDF for S, so
66:50 - 66:55

what that means is I just need to choose some function that increases from zero to
66:55 - 66:59

what. I know I can't choose a Gaussian because we know you can't
66:59 - 67:02

do ICA on Gaussian data.
67:02 - 67:05

So I need some function increasing from zero to one
67:05 - 67:09

that is not the cumulative distribution function for a
67:09 - 67:10

Gaussian distribution.
67:10 - 67:14

So what other functions do I know that increase from zero to one? I
67:14 - 67:16

just choose the
67:16 - 67:18

CDF to be
67:18 - 67:22

the
67:22 - 67:23

sigmoid function.
67:23 - 67:25

This is a
67:25 - 67:27

commonly-made choice that
67:27 - 67:31

is made for convenience. There is actually no great reason for why you
67:31 - 67:34

choose a sigmoid function. It's just a convenient function that we all know
67:34 - 67:35

and are familiar with
67:35 - 67:38

that happens to increase from zero to one.
67:38 - 67:45

When you take the derivative
67:46 - 67:49

of the sigmoid, and that will give you back
67:49 - 67:50

your
67:50 - 67:55

density. This is just not Gaussian. This is the main virtue of choosing the sigmoid.
67:55 - 68:02

So
68:19 - 68:22

there's really no rational for the choice of sigma. Lots of other things will
68:22 - 68:24

work fine, too.
68:24 - 68:27

It's just a common, reasonable default.
68:38 - 68:40

It turns out that
68:40 - 68:45

one reason the sigma works well for a lot of data sources is that
68:45 - 68:49

if this is the Gaussian.
68:49 - 68:52

If you actually take the sigmoid and you take its derivative,
69:02 - 69:07

you find that the sigmoid has [inaudible] than the Gaussian. By this I mean
69:07 - 69:11

the density of the sigmoid dies down to zero much more slowly than
69:11 - 69:12

the
69:12 - 69:13

Gaussian.
69:13 - 69:18

The magnitudes of the tails dies down as E to the minus S squared.
69:18 - 69:22

For the sigmoid, the tails look like E to the minus
69:22 - 69:27

S. So the tails die down as E to the minus S, around E
69:27 - 69:30

to the minus S squared. It turns out that most distributions of this property
69:30 - 69:34

with [inaudible] tails, where the distribution decays to zero relatively slowly
69:34 - 69:38

compared to Gaussian will
69:38 - 69:40

work fine for your data.
69:40 - 69:44

Actually, one other choice you can sometimes us is what's called the Laplacian
69:44 - 69:46

distribution, which is
69:46 - 69:53

that. This will work fine, too, for many data sources.
70:07 - 70:08

Sticking with the sigmoid for now, I'll just
70:08 - 70:09

write
70:09 - 70:14

down the algorithm in two steps. So given
70:14 - 70:17

my training set, and
70:17 - 70:21

as you show, this is an unlabeled training set, I can
70:21 - 70:26

write down the log likelihood of my parameters. So that's - assembled my training
70:26 - 70:27

examples, log of - times
70:27 - 70:34

that.
70:43 - 70:45

So that's my log
70:45 - 70:52

likelihood.
70:53 - 70:59

To learn the parameters, W, of this model, I can use the [inaudible] assent,
70:59 - 71:06

which is
71:07 - 71:09

just that.
71:09 - 71:11

It turns out, if you work through the math,
71:11 - 71:14

let's see. If P of S
71:14 - 71:20

is equal to the derivative of the
71:20 - 71:24

sigmoid, then if you just work through the math to compute the [inaudible] there. You've all
71:24 - 71:27

done this a lot of times. I won't bother to show
71:27 - 71:34

the details. You find that is equal to this.
71:47 - 71:50

Okay? That's just - you can work those out yourself. It's just math to
71:50 - 71:54

compute the derivative of this with respect to
71:54 - 71:59

W. So to summarize, given the training set,
71:59 - 72:02

here's my [inaudible] update rule. So you run the
72:02 - 72:06

[inaudible] to learn the parameters W.
72:06 - 72:08

After you're
72:08 - 72:10

done, you then
72:12 - 72:14

output SI equals
72:14 - 72:17

WXI, and you've separated your sources
72:17 - 72:18

of your
72:18 - 72:22

data back out into the original independent sources.
72:22 - 72:26

Hopefully up to only a permutation and a plus/minus
72:26 - 72:31

sign ambiguity.
72:31 - 72:35

Okay? So just switch back to the laptop, please?
72:35 - 72:42

So we'll just wrap up with a couple of examples of applications of ICA.
72:42 - 72:43

This is
72:43 - 72:47

actually a picture of our TA, Katie.
72:47 - 72:50

So one of the applications of ICA is
72:50 - 72:52

to process
72:52 - 72:57

various types of [inaudible] recording data, so [inaudible]. This
72:57 - 72:59

is a picture of
72:59 - 73:02

a EEG cap, in which there are a number of electrodes
73:02 - 73:05

you place
73:05 - 73:08

on the - in this case, on Katie's brain, on Katie's scalp.
73:08 - 73:13

So where each electrode measures changes in voltage over time
73:13 - 73:15

on the scalp.
73:15 - 73:18

On the right, it's a typical example of [inaudible] data
73:18 - 73:23

where each electrode measures - just changes in voltage over
73:23 - 73:24

time. So
73:24 - 73:28

the horizontal axis is time, and the vertical axis is voltage. So here's the same thing,
73:28 - 73:30

blown up a little bit.
73:30 - 73:33

You notice there are artifacts in this
73:33 - 73:36

data. Where the circle is, where the data is circled, all
73:36 - 73:38

the
73:38 - 73:41

electrodes seem to measure in these very synchronized recordings.
73:41 - 73:45

It turns out that we look at [inaudible] data as well as a number of other
73:45 - 73:47

types of data, there are
73:47 - 73:52

artifacts from heartbeats and from human eye blinks and so on. So the
73:52 - 73:55

cartoonist, if you imagine, placing the
73:55 - 73:57

electrodes, or
73:57 - 73:58

microphones, on my scalp,
73:58 - 74:02

then each microphone is recording some overlapping combination of all the
74:02 - 74:05

things happening in my brain or in my body.
74:05 - 74:08

My brain has a number of different processes going on. My body's [inaudible]
74:08 - 74:11

going on, and
74:11 - 74:13

each electrode measures a sum
74:13 - 74:16

of the different voices in my brain.
74:16 - 74:20

That didn't quite come out the way I wanted it to.
74:20 - 74:22

So we can just take this data
74:22 - 74:25

and run ICA on it and find out one of the independent components, what the
74:25 - 74:26

independent
74:26 - 74:30

process are going on in my brain. This is an example of running ICA.
74:30 - 74:33

So you find that a small number of components, like those shown up there,
74:33 - 74:38

they correspond to heartbeat, where the arrows - so those are very periodic
74:38 - 74:42

signals. They come on occasionally and correspond to [inaudible] components of
74:42 - 74:43

heartbeat.
74:43 - 74:47

You also find things like an eye blink component, corresponding to a
74:47 - 74:50

sigmoid generated when you blink your eyes.
74:50 - 74:54

By doing this, you can then subtract out the heartbeat and the eye blink
74:54 - 74:56

artifacts from the data, and now
74:56 - 75:01

you get much cleaner ICA data - get much cleaner EEG readings. You can
75:01 - 75:04

do further scientific studies. So this is a
75:04 - 75:06

pretty commonly used preprocessing step
75:06 - 75:10

that is a common application of ICA.
75:10 - 75:13

[Inaudible] example is
75:13 - 75:16

the application, again, from [inaudible]. As
75:16 - 75:21

a result of running ICA on natural small image patches. Suppose I take
75:21 - 75:22

natural images
75:22 - 75:26

and run ICA on the data and ask what are the independent components of data.
75:26 - 75:30

It turns out that these are the bases you get. So this is a plot of the
75:30 - 75:33

sources you get.
75:33 - 75:36

This algorithm is saying that a natural image patch
75:36 - 75:38

shown
75:38 - 75:40

on the left
75:40 - 75:45

is often expressed as a sum, or a linear combination, of
75:45 - 75:47

independent sources of
75:47 - 75:48

things that make up images.
75:48 - 75:53

So this model's natural images is generated by independent objects
75:53 - 75:55

that generate different ages in the image.
75:55 - 76:01

One of the fascinating things about this is that, similar to neuroscience, this has also been
76:01 - 76:05

hypothesized as a method for how the human brain processes image
76:05 - 76:06

data. It
76:06 - 76:10

turns out, this is similar, in many ways, to computations
76:10 - 76:15

happening in early visual processing in the human brain,
76:15 - 76:18

in the mammalian
76:18 - 76:20

brain. It's just
76:20 - 76:25

interesting to see ages are the independent components of images.
76:25 - 76:31

Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew
76:31 - 76:32

Ng):Oh,
76:32 - 76:35

yes. For the algorithms I describe, I assume A is a square matrix.
76:35 - 76:39

It turns out if you have more microphones than speakers, you can also apply very
76:39 - 76:40

similar algorithms. If
76:40 - 76:44

you have fewer microphones than speakers, there's sort of an open research problem. The odds
76:44 - 76:48

are that if you have one male and one female speaker, but one microphone, you can
76:48 - 76:52

sometimes sort of separate them because one is high, one is low. If you have two
76:52 - 76:55

male speakers or two female speakers, then it's beyond the state of the art now to separate them
76:55 - 76:57

with one
76:57 - 77:00

microphone. It's a great research program. Okay.
77:00 - 77:05

Sorry about running late again. Let's close now, and we'll
77:05 - 77:06

continue reinforcement learning.

Title:: Lecture 15 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:17:18

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 15 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)