0:00:09.539,0:00:12.809
This presentation is delivered by the Stanford Center for Professional

0:00:12.809,0:00:19.809
Development.

0:00:28.739,0:00:30.300
Welcome back.

0:00:30.300,0:00:32.800
What I want to do today is

0:00:32.800,0:00:37.120
continue a discussion of principal components analysis, or PCA.

0:00:37.120,0:00:38.780
In particular, there's

0:00:38.780,0:00:41.730
one more application that I didn't get to in the last lecture on

0:00:41.730,0:00:43.950
[inaudible] indexing,

0:00:43.950,0:00:47.340
LSI. Then I want to spend just a little time talking about

0:00:47.340,0:00:50.400
how to implement PCA,

0:00:50.400,0:00:53.510
especially for very large problems. In particular, I'll spend just a little bit of time talking

0:00:53.510,0:00:55.790
about singular value decomposition,

0:00:55.790,0:00:57.930
or the SVD implementation

0:00:57.930,0:01:00.170
of principal component

0:01:00.170,0:01:01.929
analysis. So the

0:01:01.929,0:01:06.549
second half of today's lecture, I want to talk about the different algorithm called

0:01:06.549,0:01:09.090
independent component analysis,

0:01:09.090,0:01:13.400
which is, in some ways, related to PCA, but in many other ways, it

0:01:13.400,0:01:15.580
also manages to accomplish

0:01:15.580,0:01:18.390
very different things than PCA.

0:01:18.390,0:01:22.380
So with this lecture, this will actually wrap up our discussion on

0:01:22.380,0:01:26.470
unsupervised learning. The next lecture, we'll start to talk about

0:01:26.470,0:01:31.049
reinforcement learning algorithms.

0:01:31.049,0:01:32.500
Just to

0:01:32.500,0:01:36.770
recap where we were with PCA, principal

0:01:36.770,0:01:38.630
component analysis,

0:01:38.630,0:01:44.870
I said that

0:01:44.870,0:01:48.840
in PCA, we imagine that we have some very high dimensional data

0:01:48.840,0:01:50.970
that perhaps

0:01:50.970,0:01:54.650
lies approximately on some low dimensional subspace. So if you had the data set like

0:01:54.650,0:01:55.660
this,

0:01:55.660,0:01:57.000
you might find that

0:01:57.000,0:02:00.650
that's the first principal component of the data,

0:02:00.650,0:02:02.850
and that's the second

0:02:02.850,0:02:05.650
component of this 2-D data.

0:02:06.870,0:02:09.269
To

0:02:09.269,0:02:13.039
summarize the algorithm, we have three steps. The first step of PCA

0:02:14.680,0:02:16.549
was to

0:02:16.549,0:02:20.349
normalize the data to zero mean and

0:02:20.349,0:02:25.789
[inaudible]. So

0:02:25.789,0:02:27.860
tracked out the means of

0:02:27.860,0:02:33.199
your training examples. So it now has zero means, and then normalize each of your features so

0:02:33.199,0:02:35.509
that the variance of each feature is now one.

0:02:37.719,0:02:40.679
The next step was [inaudible]

0:02:40.679,0:02:44.389
computical variance matrix of your zero mean data. So

0:02:44.389,0:02:48.859
you compute it as follows.

0:02:48.859,0:02:51.310
The sum of all the products,

0:02:52.409,0:02:55.870
and then you find the

0:02:55.870,0:03:02.870
top K eigen vectors of

0:03:03.799,0:03:07.159
sigma.

0:03:07.159,0:03:09.589
So

0:03:09.589,0:03:12.589
last time we saw the applications of this. For example,

0:03:12.589,0:03:18.439
one of the applications was to eigen faces where

0:03:18.439,0:03:22.899
each of your training examples, XI, is an image.

0:03:22.899,0:03:24.180
So

0:03:24.180,0:03:26.749
if you have

0:03:26.749,0:03:30.890
100 by 100 images, if your pictures of faces are

0:03:30.890,0:03:33.249
100 pixels by 100 pixels,

0:03:33.249,0:03:37.519
then each of your training examples, XI,

0:03:37.519,0:03:39.989
will be a 10,000 dimensional vector,

0:03:39.989,0:03:42.729
corresponding to the

0:03:42.729,0:03:47.529
10,000 grayscale intensity pixel values. There are 10,000 pixel values in

0:03:47.529,0:03:50.349
each of your 100 by 100 images.

0:03:50.349,0:03:53.179
So the eigen faces application was where

0:03:53.179,0:03:56.280
the training example comprised

0:03:56.280,0:03:58.149
pictures of faces of people.

0:03:58.149,0:03:59.989
Then we ran PCA,

0:03:59.989,0:04:00.780
and then

0:04:00.780,0:04:03.090
to measure the distance between say

0:04:03.090,0:04:04.260
a face here

0:04:04.260,0:04:06.809
and a face there, we would project both

0:04:06.809,0:04:09.619
of the face images onto the subspace and then

0:04:09.619,0:04:11.469
measure

0:04:11.469,0:04:13.679
the distance along the subspace. So in eigen faces, you use something

0:04:13.679,0:04:16.919
like 50 principle components.

0:04:16.919,0:04:20.049
So

0:04:20.049,0:04:24.080
the difficulty of working with problems like these is that

0:04:24.080,0:04:27.330
in step two of the algorithm,

0:04:27.330,0:04:31.070
we construct the covariance matrix sigma.

0:04:31.070,0:04:37.589
The covariance matrix now becomes

0:04:37.589,0:04:42.919
a 10,000 by 10,000 dimensional matrix, which is huge. That has

0:04:42.919,0:04:45.580
100 million

0:04:45.580,0:04:47.650
entries, which is huge.

0:04:49.020,0:04:50.789
So let's apply PCA to

0:04:50.789,0:04:54.289
very, very high dimensional data, used as a point of reducing the

0:04:54.289,0:04:55.180
dimension. But

0:04:55.180,0:04:59.550
step two of this algorithm had this step where you were constructing [inaudible]. So

0:04:59.550,0:05:03.599
this extremely large matrix, which you can't do.

0:05:03.599,0:05:06.580
Come back to this in a second. It turns out one of

0:05:06.580,0:05:08.650
the other

0:05:08.650,0:05:10.639
frequently-used applications of

0:05:10.639,0:05:11.870
PCA

0:05:11.870,0:05:14.210
is actually to text data.

0:05:14.210,0:05:16.329
So here's what I

0:05:16.329,0:05:21.039
mean. Remember our vectorial representation of emails?

0:05:21.039,0:05:22.520
So this is way back

0:05:22.520,0:05:26.869
when we were talking about supervised learning algorithms for a

0:05:26.869,0:05:29.360
stand classification. You remember I said that

0:05:29.360,0:05:32.620
given a piece of email or given a piece of text document, you

0:05:32.620,0:05:35.479
can represent it using a very high-dimensional vector

0:05:35.479,0:05:36.699
by taking

0:05:38.990,0:05:43.889
- writing down a list of all the words in your dictionary. Somewhere you had the word

0:05:43.889,0:05:46.669
learn, somewhere you have the word

0:05:46.669,0:05:49.680
study

0:05:50.569,0:05:54.759
and so on.

0:05:54.759,0:05:58.180
Depending on whether each word appears or does not appear in your text document, you put either

0:05:58.180,0:05:59.360
a one or a zero

0:05:59.360,0:06:03.889
there. This is a representation we use on an electrode five or electrode six

0:06:03.889,0:06:06.729
for representing text documents for

0:06:06.729,0:06:08.669
when we're building

0:06:08.669,0:06:12.090
[inaudible] based classifiers for

0:06:12.090,0:06:14.300
[inaudible]. So it turns

0:06:14.300,0:06:17.719
out one of the common applications of

0:06:17.719,0:06:22.210
PCA is actually this text data representations as well.

0:06:22.210,0:06:23.680
When you apply PCA

0:06:23.680,0:06:25.650
to this sort of data,

0:06:25.650,0:06:27.999
the resulting

0:06:27.999,0:06:34.740
algorithm, it often just goes by a different name, just latent semantic indexing.

0:06:41.250,0:06:44.009
For the sake of completeness, I should say that

0:06:44.009,0:06:48.050
in LSI, you usually skip the preprocessing step.

0:07:06.089,0:07:09.929
For various reasons, in LSI, you usually don't normalize the mean of the data to

0:07:09.929,0:07:10.939
one,

0:07:10.939,0:07:14.169
and you usually don't normalize the variance of the features to one.

0:07:14.169,0:07:18.020
These are relatively minor

0:07:18.020,0:07:21.199
differences, it turns out, so it does something very

0:07:21.199,0:07:24.599
similar to PCA.

0:07:24.599,0:07:25.849


0:07:25.849,0:07:27.439
Normalizing the variance to one

0:07:27.439,0:07:33.439
for text data would actually be a bad idea because all the words are -

0:07:33.439,0:07:34.389
because that

0:07:34.389,0:07:37.150
would have the affect of

0:07:37.150,0:07:39.229
dramatically scaling up the

0:07:39.229,0:07:43.779
weight of rarely occurring words. So for example, the word aardvark hardly ever

0:07:43.779,0:07:45.729
appears in any document.

0:07:45.729,0:07:48.919
So to normalize the variance

0:07:48.919,0:07:51.009
of the second feature to one, you end up -

0:07:51.009,0:07:54.860
you're scaling up the weight of the word aardvark

0:07:54.860,0:07:58.340
dramatically. I don't understand why [inaudible].

0:07:58.340,0:08:01.449
So let's

0:08:01.449,0:08:05.319
see. [Inaudible] the language,

0:08:05.319,0:08:11.169
something that we want to do quite often is, give it two documents,

0:08:13.109,0:08:20.109
XI and XJ, to measure how similar they are.

0:08:20.179,0:08:22.400
So for example,

0:08:22.400,0:08:25.050
I may give you a document and ask

0:08:25.050,0:08:28.860
you to find me more documents like this one. We're reading some

0:08:28.860,0:08:30.900
article about some user event of today

0:08:30.900,0:08:33.770
and want to find out what other news articles there are. So I give you a document and

0:08:34.520,0:08:37.159
ask you to look at all the other documents you have

0:08:37.159,0:08:40.829
in this large set of documents and find the documents similar to

0:08:40.829,0:08:42.220
this.

0:08:43.740,0:08:45.210
So

0:08:45.210,0:08:48.170
this is typical text application, so

0:08:48.170,0:08:51.140
to measure the similarity

0:08:52.440,0:08:56.550
between two documents in XI and XJ, [inaudible]

0:08:56.550,0:08:59.950
each of these documents is represented

0:08:59.950,0:09:03.190
as one of these highdimensional vectors.

0:09:03.190,0:09:08.700
One common way to do this is to view each of your documents

0:09:08.700,0:09:12.710
as some sort of very high-dimensional vector.

0:09:12.710,0:09:14.100
So these

0:09:14.100,0:09:18.040
are vectors in the very highdimensional space where

0:09:18.040,0:09:20.299
the dimension of the vector is equal to

0:09:20.299,0:09:27.210
the number of words in your dictionary.

0:09:27.210,0:09:30.630
So maybe each of these documents lives in some

0:09:30.630,0:09:33.560
50,000-dimension space, if you have 50,000 words in your

0:09:33.560,0:09:37.090
dictionary. So one nature of the similarity between these two documents that's

0:09:37.090,0:09:39.800
often used is

0:09:39.800,0:09:41.440
what's the angle

0:09:41.440,0:09:43.350
between these two

0:09:43.350,0:09:50.350
documents.

0:09:51.240,0:09:52.750
In particular,

0:09:52.750,0:09:56.030
if the angle between these two vectors is small, then

0:09:56.030,0:09:59.590
the two documents, we'll consider them to be similar. If the angle between

0:09:59.590,0:10:03.310
these two vectors is large, then we consider the documents to be dissimilar.

0:10:03.310,0:10:05.530
So

0:10:05.530,0:10:10.060
more formally, one commonly used heuristic, the national language of processing,

0:10:10.060,0:10:13.950
is to say that the similarity between the two documents is a co-sine of the angle theta between them.

0:10:13.950,0:10:16.710
For similar

0:10:16.710,0:10:19.270
values, anyway, the co-sine

0:10:19.270,0:10:23.110
is a decreasing function of theta.

0:10:23.110,0:10:24.490
So the

0:10:24.490,0:10:29.560
smaller the angle between them, the larger the similarity.

0:10:29.560,0:10:30.690
The co-sine

0:10:30.690,0:10:35.970
between two vectors is, of course, just [inaudible]

0:10:35.970,0:10:42.970
divided

0:10:43.940,0:10:46.680
by - okay?

0:10:46.680,0:10:51.100
That's just the linear algebra or the standard

0:10:51.100,0:10:56.460
geometry definition of the co-sine between two vectors. Here's the

0:11:03.540,0:11:10.540
intuition behind what LSI is doing. The hope, as usual, is

0:11:17.930,0:11:21.060
that there

0:11:21.060,0:11:24.970
may be some interesting axis of variations in the data,

0:11:24.970,0:11:27.660
and there maybe some other axis that

0:11:27.660,0:11:29.340
are just

0:11:29.340,0:11:33.960
noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by

0:11:33.960,0:11:37.850
running PCA on your text data this way, you can remove some of the noise in the data and

0:11:37.850,0:11:41.800
get better measures of the similarity between pairs of

0:11:41.800,0:11:45.490
documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI

0:11:45.490,0:11:46.940
is doing.

0:11:46.940,0:11:47.940
So

0:11:47.940,0:11:54.940
look further in the definition of the co-sine similarity measure. So

0:11:56.430,0:11:59.790
the numerator

0:11:59.790,0:12:06.790
or

0:12:10.210,0:12:17.210
the similarity between the two documents was this inner product,

0:12:17.620,0:12:24.270
which is therefore sum over K,

0:12:24.270,0:12:27.500
XIK,

0:12:27.500,0:12:28.870
XJK. So

0:12:30.410,0:12:33.810
this inner product would be equal to zero if

0:12:33.810,0:12:36.910
the two documents have no words in common. So

0:12:36.910,0:12:39.900
this is really - sum over K -

0:12:41.100,0:12:42.930
indicator of

0:12:42.930,0:12:44.760
whether

0:12:44.760,0:12:47.170
documents, I and

0:12:47.170,0:12:54.170
J,

0:12:54.920,0:12:58.250
both contain the word, K, because

0:12:58.250,0:13:02.590
I guess XIK indicates whether document I contains the word

0:13:02.590,0:13:04.410
K, and XJK

0:13:04.410,0:13:07.830
indicates whether document J contains the word, K.

0:13:07.830,0:13:10.430
So the product would be one only

0:13:10.430,0:13:12.420
if the word K

0:13:12.420,0:13:14.540
appears in both documents.

0:13:14.540,0:13:17.740
Therefore, the similarity between these two documents would be

0:13:17.740,0:13:23.450
zero if the two documents have no words in common.

0:13:23.450,0:13:30.450
For example,

0:13:31.180,0:13:34.530
suppose your document,

0:13:34.530,0:13:40.460
XI, has the word study and the word

0:13:40.460,0:13:41.730
XJ,

0:13:41.730,0:13:43.460
has the word learn.

0:13:43.460,0:13:47.530
Then these two documents may be considered

0:13:47.530,0:13:49.069
entirely dissimilar.

0:13:50.020,0:13:53.280
[Inaudible] effective study strategies. Sometimes you read a

0:13:53.280,0:13:57.100
news article about that. So you ask, what other documents are similar to this? If

0:13:57.100,0:14:01.080
there are a bunch of other documents about good methods to

0:14:01.080,0:14:04.250
learn, than there are words in common. So similarity [inaudible] is zero.

0:14:04.250,0:14:06.790
So here's

0:14:06.790,0:14:09.200
a cartoon

0:14:09.200,0:14:10.970
of what we hope

0:14:10.970,0:14:12.820
[inaudible] PCA will do,

0:14:12.820,0:14:14.340
which is

0:14:14.340,0:14:17.320
suppose that on the horizontal axis, I plot

0:14:17.320,0:14:21.330
the word

0:14:21.330,0:14:25.310
learn, and on the vertical access, I plot the word study.

0:14:25.310,0:14:29.840
So the values take on either the value of zero or one. So if a document

0:14:29.840,0:14:33.040
contains the words learn but not study, then

0:14:33.040,0:14:35.260
it'll plot that document there. If

0:14:35.260,0:14:38.050
a document contains neither the word study nor learn, then it'll plot that

0:14:38.050,0:14:40.530
at zero, zero.

0:14:40.530,0:14:44.119
So here's a cartoon behind what PCA

0:14:44.119,0:14:46.940
is doing, which is

0:14:46.940,0:14:51.480
we identify lower dimensional subspace. That would be sum - eigen

0:14:51.480,0:14:57.630
vector, we get out of PCAs.

0:14:57.630,0:15:03.380
Now, supposed we have a document about learning. We have a document about studying.

0:15:03.380,0:15:05.150
The document about learning

0:15:05.150,0:15:07.710
points to the right. Document about studying points

0:15:07.710,0:15:11.439
up. So the inner product, or the co-sine angle between these two documents would be

0:15:11.439,0:15:13.170
- excuse

0:15:13.170,0:15:15.080
me. The inner product between

0:15:15.080,0:15:18.850
these two documents will be zero.

0:15:18.850,0:15:20.280
So these two

0:15:20.280,0:15:22.360
documents are entirely unrelated,

0:15:22.360,0:15:24.849
which is not what we want.

0:15:24.849,0:15:27.680
Documents about study, documents about learning, they are related. But

0:15:27.680,0:15:32.820
we take these two documents, and we project them

0:15:32.820,0:15:36.220
onto this subspace.

0:15:38.320,0:15:40.850
Then these two documents now become much

0:15:40.850,0:15:42.610
closer together,

0:15:42.610,0:15:44.790
and the algorithm will recognize that

0:15:44.790,0:15:47.650
when you say the inner product between these two documents,

0:15:47.650,0:15:50.580
you actually end up with a positive number. So

0:15:50.580,0:15:52.190
LSI enables

0:15:52.190,0:15:56.200
our algorithm to recognize that these two documents have some positive similarity

0:15:56.200,0:15:58.920
between them.

0:15:58.920,0:16:01.230
So that's just intuition

0:16:01.230,0:16:02.370
about what

0:16:02.370,0:16:05.000
PCA may be doing to text data.

0:16:05.000,0:16:09.420
The same thing goes to other examples and the words study and learn. So you have

0:16:09.420,0:16:11.139
- you find a document about

0:16:11.139,0:16:15.059
politicians and a document with the names of prominent

0:16:15.059,0:16:16.490
politicians.

0:16:17.560,0:16:20.740
That will also bring the documents closer together,

0:16:20.740,0:16:24.650
or just any related topics, they end up

0:16:24.650,0:16:25.670
[inaudible]

0:16:25.670,0:16:31.780
points closer together and just lower dimensional space.

0:16:33.130,0:16:38.380
Question about this? Interviewee: [Inaudible].

0:16:38.380,0:16:43.930
Which ones?

0:16:43.930,0:16:50.930
This one? No, the line. Oh, this one. Oh,

0:16:53.820,0:16:54.470
yes.

0:16:54.470,0:17:01.470
Thank you.

0:17:01.670,0:17:05.510
[Inaudible].

0:17:05.510,0:17:06.650
So

0:17:06.650,0:17:13.650
let's talk about how to actually implement this now.

0:17:17.230,0:17:20.520
Okay. How many of you know what

0:17:20.520,0:17:24.780
an SVD or single value decomposition is? Wow,

0:17:24.780,0:17:25.800
that's a lot of you. That's a

0:17:25.800,0:17:28.270
lot more than I thought.

0:17:28.270,0:17:35.270
Curious. Did you guys learn it as under grads or as graduate students?

0:17:37.309,0:17:38.029
All

0:17:38.029,0:17:40.429
right. Let

0:17:40.429,0:17:44.150
me talk about it anyway. I

0:17:44.150,0:17:47.790
wasn't expecting so many of you to know what SVD is, but I want to get this

0:17:47.790,0:17:51.030
on tape, just so everyone else can learn

0:17:51.030,0:17:55.050
about this, too.

0:17:55.050,0:17:59.020
So I'll say a little bit about how to implement

0:17:59.020,0:18:01.640
PCA. The problem I

0:18:01.640,0:18:05.110
was eluding to just now was that

0:18:05.110,0:18:09.310
when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for

0:18:09.310,0:18:11.620
our

0:18:11.620,0:18:13.390
text example,

0:18:13.390,0:18:18.000
if the vectors XI are 50,000 dimensional,

0:18:18.000,0:18:24.430
then

0:18:24.430,0:18:27.160
the covariance matrix will be 50,000 dimensional by 50,000

0:18:27.160,0:18:32.910
dimensional, which is much too big to represent explicitly.

0:18:32.910,0:18:36.270
I guess many

0:18:36.270,0:18:41.240
of you already know this, but I'll just say it anyway. It

0:18:41.240,0:18:48.240
turns out there's another way to implement PCA, which is

0:18:48.679,0:18:49.450
if

0:18:49.450,0:18:53.340
A is any N by N matrix,

0:18:53.340,0:18:56.580
than one of the most remarkable results of linear algebra

0:18:56.580,0:19:00.970
is that the matrix, A,

0:19:00.970,0:19:04.279
can be decomposed into

0:19:04.279,0:19:07.330
a singular value

0:19:07.330,0:19:10.160
decomposition. What that means is that the matrix, A, which

0:19:10.160,0:19:11.519
is

0:19:11.519,0:19:12.680
N by N,

0:19:12.680,0:19:15.230
can always be decomposed into a product of

0:19:15.230,0:19:18.280
three matrixes. U is N by N,

0:19:18.280,0:19:21.470
D is a square matrix, which is N by N, and V is

0:19:23.750,0:19:27.170
also N by N.

0:19:27.170,0:19:30.910
D

0:19:30.910,0:19:34.890
is

0:19:34.890,0:19:37.320
going to be diagonal.

0:19:43.030,0:19:45.140
Zeros are on the off-diagonals,

0:19:45.140,0:19:52.140
and the values sigma I are called the singular values of

0:19:54.310,0:19:57.530
the matrix A.

0:20:01.500,0:20:05.470
Almost all of you said you learned this as a graduate student, rather than as an under grad, so

0:20:05.470,0:20:07.020
it turns out that

0:20:07.020,0:20:10.820
when you take a class in undergraduate linear algebra, usually you learn a bunch of

0:20:10.820,0:20:13.960
decomposition. So you usually learn about the QLD composition, maybe

0:20:13.960,0:20:17.070
the LU factorization of the matrixes.

0:20:17.070,0:20:20.350
Most under grad courses don't get to talk about singular value

0:20:20.350,0:20:21.490
decompositions, but at

0:20:21.490,0:20:22.860
least in - almost

0:20:22.860,0:20:24.580
everything I

0:20:24.580,0:20:26.900
do in machine learning,

0:20:26.900,0:20:31.180
you actually find that you end up using SVDs much more than any of the

0:20:31.180,0:20:33.120
decompositions

0:20:33.120,0:20:35.690
you learned in typical under grad linear algebra class.

0:20:35.690,0:20:40.030
So personally, I [inaudible] an SVD dozens of times in the last

0:20:40.030,0:20:42.820
year, but LU and QRD compositions,

0:20:42.820,0:20:45.480
I think I used the QRD composition once and an

0:20:45.480,0:20:49.530
LU decomposition in the last year. So let's see. I'll

0:20:49.530,0:20:53.880
say

0:20:53.880,0:20:54.830
a

0:20:54.830,0:21:01.830
bit

0:21:01.840,0:21:04.700
more about this. So I'm going to draw the picture, I guess.

0:21:04.700,0:21:08.090
For example,

0:21:08.090,0:21:10.890
if A is an N by N matrix,

0:21:10.890,0:21:17.890
it can be decomposed into another matrix, U, which is also N by N. It's the same

0:21:21.780,0:21:24.510
size, D, which is

0:21:24.510,0:21:29.840
N by

0:21:29.840,0:21:34.730
N. Another square matrix, V transpose, which is also

0:21:34.730,0:21:37.430
N by

0:21:37.430,0:21:42.850
N. Furthermore, in a singular value decomposition, the

0:21:42.850,0:21:46.460
columns of the matrix, U, will be the eigen

0:21:46.460,0:21:50.940
vectors

0:21:50.940,0:21:55.630
of A transpose, and the

0:21:55.630,0:22:02.020
columns of V will be the eigen vectors

0:22:02.020,0:22:05.730
of A

0:22:05.730,0:22:08.290
transpose A.

0:22:08.290,0:22:12.030


0:22:12.030,0:22:16.660
To compute it, you just use the SVD commands

0:22:16.660,0:22:20.780
in

0:22:20.780,0:22:22.090
Matlab

0:22:22.090,0:22:23.009
or

0:22:23.009,0:22:26.760
Octave. Today, say the art in numerical linear algebra is that

0:22:26.760,0:22:31.040
SVD, singular value decompositions, and matrixes can be computed

0:22:31.040,0:22:34.450
extremely [inaudible]. We've

0:22:34.450,0:22:35.780
used a

0:22:35.780,0:22:37.610
package like Matlab or Octave

0:22:37.610,0:22:40.410
to compute, say, the eigen vectors of a matrix.

0:22:40.410,0:22:43.539
So if SVD

0:22:43.539,0:22:47.980
routines are even more numerically stable than eigen vector routines for finding eigen vector in the

0:22:47.980,0:22:49.230
matrix. So you can

0:22:49.230,0:22:50.250
safely

0:22:50.250,0:22:53.110
use a routine like this, and similar to the way they use

0:22:53.110,0:22:56.030
a square root command without thinking about

0:22:56.030,0:22:58.400
how it's computed. You can compute the square

0:22:58.400,0:23:03.640
root of something and just not worry about it. You know the computer will give you the right

0:23:03.640,0:23:07.670
answer. For most reasonably-sized matrixes, even up to thousands by thousands

0:23:07.670,0:23:11.180
matrixes, the SVD routine, I think of it as a square root function. If

0:23:11.180,0:23:14.700
you call it, it'll give you back the right answer. You don't have to worry

0:23:14.700,0:23:16.760
too much about

0:23:16.760,0:23:20.510
it. If you have extremely large matrixes, like a million by a million matrixes, I

0:23:20.510,0:23:23.200
might start to worry a bit, but a few thousand by a few

0:23:23.200,0:23:25.700
thousand matrixes, this is

0:23:25.700,0:23:29.330
implemented very well today.

0:23:29.330,0:23:31.360
[Inaudible].

0:23:31.360,0:23:34.990
What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the

0:23:34.990,0:23:36.470
order of N-cubed.

0:23:36.470,0:23:41.750
I'm not sure. [Inaudible]

0:23:41.750,0:23:45.010
algorithms, so I think - I don't know what's

0:23:45.010,0:23:50.470
known about the conversion

0:23:50.470,0:23:54.370
of

0:23:54.370,0:23:58.280
these algorithms. The example I drew out was for a facts matrix, and a matrix is

0:23:58.280,0:23:59.890
[inaudible].

0:23:59.890,0:24:03.420
In the same way, you can also

0:24:03.420,0:24:08.380
call SVD on the tall matrix, so it's taller than it's wide.

0:24:08.380,0:24:15.380
It would decompose it into - okay? A

0:24:21.480,0:24:24.190
product of three matrixes like

0:24:24.190,0:24:28.470
that.

0:24:28.470,0:24:31.220
The nice thing about this is that

0:24:31.220,0:24:33.169
we can use it to compute

0:24:33.169,0:24:40.169
eigen vectors and PCA very efficiently.

0:24:42.150,0:24:47.380
In particular, a

0:24:47.380,0:24:52.690
covariance matrix sigma was

0:24:52.690,0:24:55.410
this. It was the sum of all the products,

0:24:55.410,0:25:00.750
so if you go back and recall the definition of the

0:25:00.750,0:25:01.820
design matrix -

0:25:01.820,0:25:05.080
I think I described this in

0:25:05.080,0:25:07.720
lecture two when

0:25:07.720,0:25:13.510
we derived the close form solution to these squares [inaudible] these squares. The

0:25:13.510,0:25:17.860
design matrix was this matrix where I took my

0:25:17.860,0:25:21.390
examples

0:25:21.390,0:25:26.160
and stacked them in

0:25:26.160,0:25:27.590
rows.

0:25:27.590,0:25:33.320
They call this the design matrix [inaudible]. So if

0:25:33.320,0:25:38.780
you construct the design matrix, then

0:25:38.780,0:25:43.240
the covariance matrix sigma

0:25:43.240,0:25:47.760
can be written just X transposing.

0:26:01.270,0:26:08.270
That's X transposed, and [inaudible].

0:26:16.230,0:26:16.750
Okay?

0:26:16.750,0:26:18.860
I hope you see why the X transpose X gives you

0:26:18.860,0:26:22.460
the sum of products of

0:26:22.460,0:26:25.650
vectors. If you aren't seeing this right now, just go home and convince yourself

0:26:25.650,0:26:32.650
[inaudible] if it's

0:26:37.880,0:26:40.020
true.

0:26:40.020,0:26:47.020
To get the top K eigen vectors of

0:26:50.610,0:26:50.950
sigma,

0:26:50.950,0:26:55.890
you would take sigma

0:26:55.890,0:26:57.419
and decompose it

0:26:57.419,0:26:59.650
using

0:26:59.650,0:27:01.880
the -

0:27:01.880,0:27:06.710
excuse me.

0:27:06.710,0:27:11.960
You would take the matrix X, and you would compute as SVD. So you get USV transpose.

0:27:11.960,0:27:15.400
Then the top three

0:27:15.400,0:27:19.169
columns

0:27:19.169,0:27:21.660
of U

0:27:21.660,0:27:24.340
are the top K eigen

0:27:24.340,0:27:26.830
vectors

0:27:26.830,0:27:29.419
of

0:27:29.419,0:27:31.910
X transpose

0:27:31.910,0:27:33.660
X,

0:27:33.660,0:27:37.760
which is therefore, the top K eigen vectors of your

0:27:37.760,0:27:40.440
covariance matrix

0:27:40.440,0:27:42.250
sigma. So

0:27:42.250,0:27:46.170
in our examples, the

0:27:46.170,0:27:48.950
design matrix may be, say R. If you have

0:27:48.950,0:27:52.450
50,000 words in your dictionary, than the

0:27:52.450,0:27:55.050
design matrix would be

0:27:55.050,0:27:58.150
RM by 50,000.

0:27:58.150,0:28:01.870
[Inaudible] say 100 by 50,000, if you have 100 examples.

0:28:01.870,0:28:03.250
So X would be

0:28:03.250,0:28:06.510
quite tractable to represent and compute the

0:28:06.510,0:28:10.420
SVD, whereas the matrix sigma would be much harder to represent. This is

0:28:10.420,0:28:12.260
50,000 by 50,000.

0:28:12.260,0:28:15.870
So this gives you an efficient way to implement

0:28:15.870,0:28:18.070
PCA.

0:28:18.070,0:28:20.560
The reason I want to talk about this is

0:28:20.560,0:28:24.620
in previous years, I didn't talk [inaudible].

0:28:24.620,0:28:28.760
The class projects, I found a number of students trying to implement SVD on huge

0:28:28.760,0:28:29.580
problems and [inaudible],

0:28:29.580,0:28:31.950
so this is

0:28:31.950,0:28:35.010
a much better to implement PCA

0:28:35.010,0:28:37.600
if you have extremely high dimensional data. If you have

0:28:37.600,0:28:39.600
low dimensional data, if

0:28:39.600,0:28:43.500
you have 50 or 100 dimensional data, then

0:28:43.500,0:28:44.969
computing sigma's no problem. You can

0:28:44.969,0:28:51.309
do it the old way, but otherwise, use the SVD to implement this.

0:28:52.780,0:28:59.780
Questions about this?

0:29:26.180,0:29:30.810
The last thing I want to say is that in practice, when you want to implement this, I want to say a note

0:29:30.810,0:29:32.240
of caution.

0:29:32.240,0:29:35.010
It turns out that

0:29:35.010,0:29:38.909
for many applications of - let's see.

0:29:38.909,0:29:42.420
When you apply SVD to these wide - yeah.

0:29:42.420,0:29:43.450
Just a quick question. Are

0:29:43.450,0:29:48.950
the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh,

0:29:48.950,0:29:52.940
yes.

0:29:52.940,0:29:54.570
I think you're

0:29:54.570,0:29:55.509
right. I

0:29:55.509,0:30:02.509
think you're right.

0:30:03.960,0:30:05.550
Let's

0:30:05.550,0:30:12.550
see. Is it top K columns of U or top K of V?

0:30:15.039,0:30:22.039
Yeah, I think you're right. Is that right? Something

0:30:31.560,0:30:38.560
bothers me about that, but I think you're right.

0:30:40.330,0:30:46.350
[Inaudible], so then X transpose X should be VDD. X

0:30:46.350,0:30:53.350
is UDV, so X transpose X would be - [Inaudible].

0:31:01.140,0:31:03.389
If anyone thinks about this and

0:31:03.389,0:31:06.200
has another opinion, let me know, but I think you're

0:31:06.200,0:31:11.480
right. I'll make sure I get the details and let you

0:31:11.480,0:31:16.680
know.

0:31:16.680,0:31:22.240
Everyone's still looking at that.

0:31:22.240,0:31:22.720
Tom, can you figure out

0:31:22.720,0:31:25.110
the right answer and let me know? That sounds right. Okay. Cool. Okay.

0:31:25.110,0:31:30.070
So just

0:31:30.070,0:31:33.920
one last note, a note of caution. It turns out that

0:31:33.920,0:31:36.520
in this example, I was implementing SVD

0:31:36.520,0:31:43.520
with a wide matrix. So the matrix X was N by N.

0:31:44.130,0:31:47.010
It turns out when you

0:31:47.010,0:31:52.170
find the SVD decomposition of this,

0:31:52.170,0:31:57.870
it turns out that -

0:31:57.870,0:32:01.030
let's see. Yeah, I think you're definitely

0:32:01.030,0:32:04.230
right. So it turns out that we find the SVD

0:32:04.230,0:32:07.590
of this, the right-most portion of this block of this matrix would be all

0:32:07.590,0:32:14.590
zeros.

0:32:15.130,0:32:17.850
Also, when you compute the

0:32:17.850,0:32:22.190
matrix, D, a large part of this matrix would be zeros.

0:32:22.190,0:32:24.340
You have the matrix D

0:32:24.340,0:32:29.520
transpose. So depending on what convention you use,

0:32:29.520,0:32:36.150
for example, I think Matlab actually uses a convention of just

0:32:36.150,0:32:43.150
cutting off the zero elements.

0:32:47.779,0:32:50.750
So the Matlab uses the convention

0:32:50.750,0:32:54.260
of chopping off the right-most half of the U matrix and chopping off the bottom

0:32:54.260,0:32:56.520
portion of the D matrix. I'm not sure

0:32:56.520,0:33:00.860
if this even depends on the version of Matlab,

0:33:00.860,0:33:04.059
but when you call SVD on Matlab or some other numerical algebra packages,

0:33:04.059,0:33:07.820
there's slightly different conventions of how to define your SVD when

0:33:07.820,0:33:10.260
the matrix is wider than it is tall.

0:33:10.260,0:33:13.390
So just watch out for this and make sure you map

0:33:13.390,0:33:14.910
whatever convention

0:33:14.910,0:33:17.630
your numerical algebra library uses

0:33:17.630,0:33:22.620
to the original computations.

0:33:22.620,0:33:23.570
It turns out if

0:33:23.570,0:33:26.059
you turn Matlab [inaudible]

0:33:26.059,0:33:30.200
or you're writing C code. There are many scientific libraries that can

0:33:30.200,0:33:31.590


0:33:31.590,0:33:36.110
compute SVDs for you, but they're just slightly different in

0:33:36.110,0:33:39.330
conventions for the dimensions of these matrixes. So just make sure you figure this out for the

0:33:39.330,0:33:44.230
package that you use.

0:33:44.230,0:33:46.230
Finally, I just want to

0:33:46.230,0:33:49.779
take the unsupervised learning algorithms we talked about and just put a little bit

0:33:49.779,0:33:52.010
of broader context.

0:33:52.010,0:33:56.450
This is partly in response to the questions I've gotten from students in

0:33:56.450,0:33:57.950
office hours and elsewhere

0:33:57.950,0:34:00.620
about when to use each of these

0:34:00.620,0:34:01.460
algorithms. So

0:34:01.460,0:34:05.700
I'm going to draw a two by two matrix. This is a little cartoon that

0:34:07.800,0:34:10.690
I find useful.

0:34:15.409,0:34:20.159
One of the algorithms we talked about earlier, right

0:34:20.159,0:34:21.460
before this, was

0:34:21.460,0:34:24.159
factor analysis, which was - it

0:34:24.159,0:34:27.919
was - I hope you remember that picture I drew where I would have a bunch of point Z on the

0:34:27.919,0:34:28.799
line.

0:34:28.799,0:34:32.969
Then I had these ellipses that I drew. I hope you

0:34:32.969,0:34:34.999
remember that

0:34:34.999,0:34:37.919
picture. This was a factor analysis model

0:34:37.919,0:34:38.639
which models

0:34:38.639,0:34:42.389
the density effects [inaudible], right?

0:34:42.389,0:34:44.639
It was also a PCA, just now.

0:34:44.639,0:34:45.799
So the

0:34:45.799,0:34:48.969
difference between factor analysis and PCA, the

0:34:48.969,0:34:51.829
way I think about it, is that factor analysis

0:34:51.829,0:34:53.960
is a density estimation algorithm.

0:34:53.960,0:34:57.740
It tries to model the density of the training example's X.

0:34:57.740,0:35:01.249
Whereas PCA

0:35:02.309,0:35:05.700
is not a probabilistic

0:35:05.700,0:35:06.750
algorithm. In particular,

0:35:06.750,0:35:11.270
it does not endow your training examples of any probabilistic

0:35:11.270,0:35:14.410
distributions and directly goes to find the subspace.

0:35:14.410,0:35:18.650
So in terms of when to use factor analysis and when to use PCA,

0:35:18.650,0:35:21.699
if your goal is to reduce the dimension of the data,

0:35:21.699,0:35:26.139
if your goal is to find the subspace that the data lies on,

0:35:26.139,0:35:27.309
then PCA

0:35:27.309,0:35:28.320
directly

0:35:28.320,0:35:32.180
tries to find the subspace. I think I would

0:35:32.180,0:35:35.279
tend to use PCA.

0:35:35.279,0:35:39.889
Factor analysis, it sort of assumes the data lies on a subspace.

0:35:39.889,0:35:45.039
Let me write a subspace here.

0:35:45.039,0:35:49.529
So both of these algorithms sort of assume the data maybe lies close

0:35:49.529,0:35:52.009
or on some low dimensional subspace,

0:35:52.009,0:35:55.999
but fundamentally, factor analysis, I think of it as a density estimation algorithm.

0:35:55.999,0:35:58.739
So that has some very high dimensional distribution. I

0:35:58.739,0:36:00.880
want to model P of X, then

0:36:00.880,0:36:04.519
the factor analysis is the algorithm I'm more inclined

0:36:04.519,0:36:05.400
to use. So

0:36:05.400,0:36:07.449
even though you could in theory,

0:36:07.449,0:36:12.199
I would tend to avoid trying to use factor analysis to

0:36:12.199,0:36:14.349
identify a subspace the

0:36:14.349,0:36:15.619
data

0:36:15.619,0:36:18.450
set lies on. So [inaudible], if you want to do

0:36:18.450,0:36:21.700
anomaly detection, if you want to model P of X

0:36:21.700,0:36:26.119
so that if you have a very low probability of N, you can factor an anomaly,

0:36:26.119,0:36:32.819
then I would tend to use factor analysis to do that density estimation. So factor

0:36:32.819,0:36:38.189
analysis and PCA are both algorithms that assume that your data lies in the subspace.

0:36:38.189,0:36:41.759
The other cause of algorithms we talked about was

0:36:41.759,0:36:45.379
algorithms that assumes the data

0:36:45.379,0:36:47.380
lies in

0:36:47.380,0:36:51.410
clumps or that the

0:36:51.410,0:36:52.359
data

0:36:52.359,0:36:57.869
has a few coherence to groups. So let me just fill in the rest of this picture.

0:37:08.579,0:37:09.500
So if you think your data lies

0:37:09.500,0:37:14.270
in clumps or lies in groups, and if it goes [inaudible]

0:37:14.270,0:37:16.130
density estimation, then I would

0:37:16.130,0:37:19.579
tend to use a mixture of [inaudible]

0:37:19.579,0:37:23.349
algorithm. But again, you don't necessarily want to endow your data of any probably

0:37:23.349,0:37:26.919
semantics, so if you just want to find the clumps of the groups, then I'd be inclined

0:37:26.919,0:37:28.999
to use a [inaudible] algorithm. So

0:37:29.769,0:37:33.289
haven't seen anyone else draw this picture before, but I tend to organize these things

0:37:33.289,0:37:34.989
this way in my brain.

0:37:34.989,0:37:36.419
Hopefully this helps guide

0:37:36.419,0:37:40.459
when you might use each of these algorithms as well, depending

0:37:40.459,0:37:44.719
on whether you believe the data might lie in the subspace or whether it might bind in

0:37:44.719,0:37:47.899
clumps or groups.

0:37:50.719,0:37:53.789
All right.

0:37:53.789,0:38:00.789
That wraps up the discussion on

0:38:02.629,0:38:08.719
PCA. What I want to do next is talk about

0:38:08.719,0:38:15.329
independent component analysis, or ICA. Yeah. Interviewee: I have

0:38:15.329,0:38:17.579
a

0:38:17.579,0:38:21.589
question about the upper right [inaudible]. So once you have all of the eigen vectors,

0:38:21.589,0:38:26.179
[inaudible] how similar is feature I to

0:38:26.179,0:38:29.960
feature J. You pick some eigen vector, and you take some dot products between the

0:38:29.960,0:38:31.569
feature I and

0:38:31.569,0:38:35.019
feature J and the eigen vector. But

0:38:35.019,0:38:39.630
there's a lot of eigen vectors to choose from. Instructor

0:38:39.630,0:38:42.049
(Andrew Ng):Right. So Justin's question was

0:38:42.049,0:38:45.880
having found my eigen vectors, how do I choose what eigen vector to use to

0:38:45.880,0:38:47.539
measure distance. I'm

0:38:47.539,0:38:48.949
going to

0:38:48.949,0:38:51.019
start

0:38:51.019,0:38:53.289
this up.

0:38:53.289,0:38:57.299
So the

0:38:57.299,0:38:58.319
answer is really

0:38:58.319,0:39:02.029
- in this cartoon, I would avoid thinking about

0:39:02.029,0:39:03.920
eigen vectors one other time.

0:39:03.920,0:39:08.049
A better way to view this cartoon is that this is actually -

0:39:08.049,0:39:11.599
if I decide to choose 100 eigen vectors, this is really 100 D

0:39:11.599,0:39:18.599
subspace.

0:39:19.259,0:39:20.339
So

0:39:20.339,0:39:24.879
I'm not actually projecting my data onto one eigen vector.

0:39:24.879,0:39:29.769
This arrow, this cartoon, this denotes the 100-dimensional

0:39:29.769,0:39:32.089
subspace [inaudible] by all my eigen vectors.

0:39:32.089,0:39:36.119
So what I actually do is project my data onto

0:39:36.119,0:39:40.149
the span, the linear span of eigen vectors. Then I

0:39:40.149,0:39:41.440
measure distance or take

0:39:41.440,0:39:43.489
inner products of the distance between

0:39:43.489,0:39:49.829
the projections of the two points of the eigen vectors. Okay.

0:39:49.829,0:39:54.139
So let's talk about ICA,

0:39:54.139,0:39:58.749
independent component analysis.

0:39:58.749,0:40:00.749
So whereas PCA

0:40:00.749,0:40:02.600
was an algorithm for finding

0:40:02.600,0:40:06.699
what I call the main axis of variations of data,

0:40:06.699,0:40:11.199
in ICA, we're going to try find the independent of components of variations in the

0:40:11.199,0:40:12.039
data.

0:40:12.039,0:40:14.939
So switch it to the laptop there, please.

0:40:14.939,0:40:16.119
We'll just

0:40:16.119,0:40:21.900
take a second to motivate that. I'm

0:40:21.900,0:40:26.769
going to do so by

0:40:26.769,0:40:32.430
- although if you put on the - okay. This is

0:40:32.430,0:40:36.619
actually a slide that I showed in

0:40:36.619,0:40:39.779
lecture one of the cocktail party problem.

0:40:39.779,0:40:42.619
Suppose you have two speakers at a cocktail party,

0:40:42.619,0:40:45.019
and you have two microphones in the

0:40:45.019,0:40:46.119
room, overlapping

0:40:46.119,0:40:47.959
sets of two conversations.

0:40:47.959,0:40:51.640
Then can you separate out the two original speaker sources?

0:40:51.640,0:40:55.650
So I actually played this audio as well in the very first lecture, which is

0:40:55.650,0:40:59.069
suppose microphone one records this.

0:40:59.069,0:41:05.489
[Recording]

0:41:13.229,0:41:16.650
So the question is, these are really two speakers,

0:41:16.650,0:41:20.809
speaking independently of each other. So each speaker is outputting

0:41:20.809,0:41:24.699
a series of sound signals as independent of the other conversation

0:41:24.699,0:41:26.119
going on in the room.

0:41:26.119,0:41:27.839
So

0:41:27.839,0:41:31.880
this being an supervised learning problem, the question is, can we take these two microphone

0:41:31.880,0:41:33.900
recordings and feed it to

0:41:33.900,0:41:37.329
an algorithm to find the independent components in

0:41:37.329,0:41:38.450
this

0:41:38.450,0:41:40.589
data? This is the output

0:41:42.440,0:41:48.619
when we do so.

0:41:48.619,0:41:55.050
[Recording] This is the other one. [Recording]

0:41:55.940,0:41:59.859
Just for fun. [Inaudible]. These are audio clips I got

0:42:01.409,0:42:04.409
from [inaudible]. Just for fun, let me play the other ones as well. This

0:42:04.409,0:42:11.409
is overlapping microphone one. [Recording]

0:42:13.440,0:42:20.440
Here's microphone two. [Recording]

0:42:21.739,0:42:24.250
So given this as input, here's output one.

0:42:24.250,0:42:27.459


0:42:27.459,0:42:30.910
[Recording]

0:42:30.910,0:42:33.640
It's not perfect, but it's largely cleaned up the music.

0:42:33.640,0:42:40.640
Here's number two. [Recording] Okay. Switch back to

0:42:42.979,0:42:44.900
[inaudible], please.

0:42:44.900,0:42:46.979
So

0:42:46.979,0:42:53.979
what I want to do now is describe an algorithm that does that.

0:42:54.829,0:42:58.299
Before

0:42:58.299,0:43:03.239
I actually jump into the algorithm, I want to say two minutes

0:43:03.239,0:43:03.799
of

0:43:03.799,0:43:10.799
CDF, so cumulative distribution functions. I know most

0:43:18.669,0:43:21.089
of you know what these are, but I'm

0:43:21.089,0:43:23.599
just going to remind you of what they are.

0:43:24.579,0:43:30.349
Let's say you have a one-D random variable S. So suppose you have

0:43:30.349,0:43:35.900
a random variable, S,

0:43:35.900,0:43:41.469
and suppose it has a property density function [inaudible].

0:43:41.469,0:43:43.409
Then

0:43:43.409,0:43:45.859
the CDF

0:43:45.859,0:43:50.139
is defined as a function, or rather as F,

0:43:50.139,0:43:53.729
which is the probability that the random variable,

0:43:53.729,0:43:55.920
S, is less than the value

0:43:55.920,0:43:58.539
given by that lower-case

0:43:58.539,0:43:59.869
value,

0:43:59.869,0:44:01.929
S.

0:44:01.929,0:44:03.269
For example,

0:44:03.269,0:44:06.099
if this is your [inaudible] density,

0:44:06.099,0:44:10.229
than the density of the [inaudible] usually

0:44:10.229,0:44:14.609
to note it lower-case phi. That's roughly a bell-shaped density. Then

0:44:14.609,0:44:20.319
the CDF or the Gaussian

0:44:20.319,0:44:22.269
will look something like this.

0:44:22.269,0:44:24.959
There'll be a capital function

0:44:24.959,0:44:27.339
pi. So if I pick a value

0:44:27.339,0:44:29.079
S like that, then the

0:44:29.079,0:44:30.449
height of this -

0:44:30.449,0:44:32.569
this is [inaudible] probability that

0:44:32.569,0:44:35.410
my Gaussian random variable is less than

0:44:35.410,0:44:37.419
that value there. In other words,

0:44:37.419,0:44:40.549
the height of the function at that point is

0:44:40.549,0:44:44.229
less

0:44:44.229,0:44:46.269
than the area of the Gaussian density,

0:44:46.269,0:44:48.119
up to the point S.

0:44:48.119,0:44:48.890
As you

0:44:48.890,0:44:52.690
move further and further to the right, this function will approach one, as

0:44:52.690,0:44:59.690
you integrate more and more of this area of the Gaussian. So another way to write

0:45:04.839,0:45:11.839
F

0:45:21.109,0:45:28.109
of

0:45:30.659,0:45:34.619
S is the integral, the minus infinity

0:45:34.619,0:45:35.729
to S of

0:45:35.729,0:45:41.739
the density, DT.

0:45:41.739,0:45:43.859
So something that'll come later is

0:45:43.859,0:45:48.320
suppose I have a random variable, S, and I want to model the distribution of the random

0:45:48.320,0:45:49.439
variable, S.

0:45:49.439,0:45:53.449
So one thing I could do is I can specify

0:45:53.449,0:45:56.549
what I think the density

0:45:56.549,0:45:58.049
is.

0:45:58.049,0:46:03.200
Or I can specify

0:46:03.200,0:46:04.450
what the

0:46:04.450,0:46:08.099
CDF

0:46:08.099,0:46:11.359
is. These are related by this equation. F is the integral of P of S. You

0:46:11.359,0:46:13.989
can also

0:46:13.989,0:46:15.719
recover the density

0:46:15.719,0:46:20.469
by taking the CDF and taking the derivative. So F prime, take the derivative

0:46:20.469,0:46:21.729
of the CDF,

0:46:21.729,0:46:23.440
you get back the

0:46:23.440,0:46:24.709
density. So this has come up

0:46:24.709,0:46:28.179
in the middle of when I derive ICA, which is that

0:46:28.179,0:46:32.169
there'll be a step where they need to assume a distribution for random variable, S.

0:46:32.169,0:46:36.359
I can either specify the density for S directly, or I can specify the CDF. I

0:46:36.359,0:46:38.819
choose to specify the

0:46:39.920,0:46:41.529
CDF.

0:46:41.529,0:46:46.919
It has to be some function increasing from zero to one.

0:46:46.919,0:46:48.029
So you can

0:46:48.029,0:46:50.679
choose any function that looks like that, and in particular,

0:46:51.969,0:46:55.469
pulling functions out of a hat that look like that. You can, for instance, choose a

0:46:55.469,0:46:58.989
sigmoid function of

0:46:58.989,0:47:04.219
CDF. That would be one way of specifying the distribution of the densities for the random variable S. So

0:47:04.219,0:47:05.110
this

0:47:05.110,0:47:12.110
will come up later.

0:47:30.299,0:47:33.579
Just [inaudible], just raise your hand if that is familiar to you, if you've seen

0:47:33.579,0:47:40.579
that before. Great. So

0:47:42.469,0:47:43.239
let's

0:47:43.239,0:47:48.630
start to derive our RCA, or our independent component analysis

0:47:48.630,0:47:50.430
algorithm.

0:47:50.430,0:47:53.989
Let's assume that the

0:47:55.859,0:47:59.819
data comes from

0:47:59.819,0:48:01.979
N original

0:48:01.979,0:48:03.319
sources.

0:48:03.319,0:48:07.009
So let's say there are N speakers in a cocktail party.

0:48:07.009,0:48:09.819
So the original sources, I'm

0:48:09.819,0:48:11.330
going to write as a vector, S

0:48:11.330,0:48:13.619
as in RN.

0:48:13.619,0:48:17.449
So just to be concrete about what I mean about that, I'm going to use

0:48:17.449,0:48:22.499
SIJ to denote the signal

0:48:22.499,0:48:25.849
from speaker

0:48:27.140,0:48:30.219
J

0:48:30.219,0:48:32.659
at time

0:48:32.659,0:48:34.080
I. Here's what I mean.

0:48:34.080,0:48:37.940
So what is sound? When you hear sound waves, sound is created

0:48:37.940,0:48:39.279
by a pattern

0:48:39.279,0:48:43.160
of expansions and compressions in air. So the way you're hearing my voice is

0:48:43.160,0:48:44.620
my

0:48:44.620,0:48:47.719
mouth is causing certain

0:48:47.719,0:48:50.960
changes in the air pressure, and then your ear is hearing my voice as

0:48:50.960,0:48:53.539
detecting those changes in air

0:48:53.539,0:48:57.729
pressure. So what a microphone records, what my mouth is generating, is

0:48:57.729,0:48:59.159
a pattern.

0:48:59.159,0:49:01.459
I'm going to draw a cartoon,

0:49:01.459,0:49:04.819
I guess.

0:49:04.819,0:49:06.059
Changes in

0:49:06.059,0:49:06.970
air pressure. So

0:49:06.970,0:49:11.119
this is what sound is. You look at a microphone recording, you see these roughly periodic

0:49:11.119,0:49:13.289
signals that comprise of

0:49:13.289,0:49:16.230
changes in air pressure over time as the air pressure goes

0:49:16.230,0:49:18.539
above and below some baseline air pressure.

0:49:18.539,0:49:19.669
So this

0:49:19.669,0:49:22.369
is what the speech signal looks like, say.

0:49:22.369,0:49:26.399
So this is speaker one.

0:49:26.399,0:49:29.039
Then what I'm saying is that

0:49:29.039,0:49:31.189
- this is some time, T.

0:49:31.189,0:49:34.479
What I'm saying is that the value of that point,

0:49:34.479,0:49:36.989
I'm going to denote as S, super

0:49:36.989,0:49:40.229
script T, sub script one.

0:49:40.229,0:49:41.729
Similarly,

0:49:41.729,0:49:44.889
speaker two, it's

0:49:44.889,0:49:46.859
outputting some sound wave. Speaker voice

0:49:46.859,0:49:49.749
will play that. It'll actually sound like

0:49:49.749,0:49:52.920
a single tone, I guess.

0:49:52.920,0:49:56.099
So in the same way, at the same time, T,

0:49:56.099,0:49:59.049
the value of the air

0:49:59.049,0:50:02.589
pressure generated by speaker two, I'll denote as

0:50:02.589,0:50:09.589
ST

0:50:16.579,0:50:23.579
2.

0:50:29.859,0:50:36.859
So we observe

0:50:37.769,0:50:40.449
XI equals A times SI, where

0:50:40.449,0:50:43.409
these XIs

0:50:43.409,0:50:45.989
are vectors in RN.

0:50:45.989,0:50:50.269
So I'm going to assume

0:50:50.269,0:50:53.259
that I have N microphones,

0:50:53.259,0:50:53.580
and

0:50:53.580,0:50:58.489
each of my microphones records some linear combination

0:50:58.489,0:51:01.869
of what the speakers are saying. So each microphone records some overlapping

0:51:01.869,0:51:04.499
combination of what the speakers are saying.

0:51:04.499,0:51:07.619
For

0:51:10.349,0:51:12.669
example, XIJ, which is - this

0:51:12.669,0:51:16.249
is what microphone J records at time, I. So

0:51:16.249,0:51:17.349
by definition of

0:51:17.349,0:51:21.519
the matrix multiplication, this is sum

0:51:21.519,0:51:23.979
of AIKSJ.

0:51:23.979,0:51:29.369
Oh, excuse me.

0:51:29.369,0:51:36.369
Okay? So what my J - sorry.

0:51:37.179,0:51:41.049
So what my J microphone is recording is

0:51:42.190,0:51:43.940
some linear combination of

0:51:43.940,0:51:45.569
all of the speakers. So

0:51:45.569,0:51:49.780
at time I, what microphone J is recording is some linear combination of

0:51:49.780,0:51:52.749
what all the speakers are saying at time I.

0:51:52.749,0:51:54.359
So K here

0:51:54.359,0:51:57.819
indexes over the N speakers.

0:51:57.819,0:52:01.239
So our goal

0:52:02.869,0:52:06.420
is to find the matrix, W, equals A inverse, and

0:52:06.420,0:52:10.130
just defining W that way.

0:52:10.130,0:52:17.130
So

0:52:18.139,0:52:21.349
we can recover the original sources

0:52:21.349,0:52:23.310
as a linear combination of

0:52:23.310,0:52:23.560
our

0:52:23.549,0:52:30.549
microphone recordings, XI.

0:52:33.059,0:52:35.329
Just as a point of notation,

0:52:35.329,0:52:42.329
I'm going to write the matrix W this way. I'm going to use

0:52:50.890,0:52:55.100
lower case W subscript one, subscript two and so on to denote the roles

0:52:55.100,0:53:02.100
of this matrix, W.

0:53:13.909,0:53:14.559
Let's

0:53:14.559,0:53:18.719
see.

0:53:18.719,0:53:23.539
So let's look at why IC is possible. Given these overlapping voices,

0:53:23.539,0:53:28.249
let's think briefly why it might be possible

0:53:28.249,0:53:30.760
to recover the original sources.

0:53:30.760,0:53:33.059
So for the next example, I want

0:53:33.059,0:53:36.509
to say

0:53:42.739,0:53:46.530
- let's say that each of my speakers

0:53:46.530,0:53:50.380
outputs - this will sound like white noise. Can I switch

0:53:50.380,0:53:53.380
the laptop display,

0:53:53.380,0:53:56.709
please? For this example, let's say that

0:53:57.219,0:54:01.459
each of my speakers outputs uniform white noise. So

0:54:01.459,0:54:05.459
if that's the case, these are my axis, S1 and S2.

0:54:05.459,0:54:08.819
This is what my two speakers would be uttering.

0:54:08.819,0:54:11.289
The parts of what they're

0:54:11.289,0:54:14.979
uttering will look like a line in a square box if the two speakers are independently

0:54:14.979,0:54:16.089
outputting

0:54:16.089,0:54:18.389
uniform minus one random variables.

0:54:18.389,0:54:20.289
So this is part of

0:54:20.289,0:54:24.009
S1 and S2, my original sources.

0:54:24.009,0:54:28.999
This would be a typical sample of what my microphones record. Here, at

0:54:28.999,0:54:31.400
the axis, are X1 and X2.

0:54:31.400,0:54:35.089
So these are images I got from [inaudible] on

0:54:35.089,0:54:37.269
ICA.

0:54:38.709,0:54:43.689
Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of

0:54:43.689,0:54:44.939
this

0:54:44.939,0:54:45.809
parallelogram

0:54:45.809,0:54:48.149
are. You can figure out

0:54:48.149,0:54:51.999
what linear transformation would transform the parallelogram back

0:54:51.999,0:54:54.359
to a box.

0:54:54.359,0:54:58.769
So it turns out there are some inherent ambiguities in ICA.

0:54:58.769,0:55:00.509
I'll just say what they are.

0:55:00.509,0:55:01.569
One is that

0:55:01.569,0:55:05.709
you can't recover the original indexing of the sources. In particular,

0:55:05.709,0:55:07.379
if

0:55:07.379,0:55:10.809
I generated the data for speaker one and speaker two,

0:55:10.809,0:55:14.469
you can run ICA, and then you may end up with the order of the speakers

0:55:14.469,0:55:17.529
reversed. What that corresponds to is if you take this

0:55:17.529,0:55:21.809
picture and you flip this picture along a 45-degree

0:55:21.809,0:55:26.130
axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still

0:55:26.130,0:55:28.279
get a box. So

0:55:28.279,0:55:31.319
there's no way for the algorithms to tell which was speaker No. 1 and

0:55:31.319,0:55:32.910
which

0:55:32.910,0:55:37.699
was speaker No. 2. The numbering or the ordering of the speakers is

0:55:37.699,0:55:40.839
ambiguous. The other source of ambiguity, and these are the only ambiguities

0:55:40.839,0:55:42.089
in this example,

0:55:42.089,0:55:44.469
is the sign of the sources. So

0:55:44.469,0:55:49.119
given my speakers' recordings,

0:55:49.119,0:55:53.190
you can't tell whether you got a positive SI or whether you got

0:55:53.190,0:55:56.179
back a negative SI.

0:55:56.179,0:55:58.210
In this picture, what that corresponds to

0:55:58.210,0:56:02.100
is if you take this picture, and you reflect it along the vertical axis, if

0:56:02.100,0:56:04.659
you reflect it along the horizontal axis,

0:56:04.659,0:56:05.910
you still get a box.

0:56:05.910,0:56:08.719
You still get back [inaudible] speakers.

0:56:08.719,0:56:09.649
So

0:56:09.649,0:56:11.719
it turns out that in this example,

0:56:11.719,0:56:16.599
you can't guarantee that you've recovered positive SI rather

0:56:16.599,0:56:19.689
than negative SI.

0:56:19.689,0:56:21.930
So it turns out that these are the only

0:56:21.930,0:56:25.740
two ambiguities in this example. What is the permutation of the speakers, and the

0:56:25.740,0:56:28.139
other is the sign of the speakers.

0:56:28.139,0:56:30.749
Permutation of the speakers, there's not much you can do about that.

0:56:30.749,0:56:34.909
It turns out that if you take the audio

0:56:34.909,0:56:35.609
source

0:56:35.609,0:56:39.199
and if you flip the sign, and you take negative S, and if you play that through a

0:56:39.199,0:56:43.819
microphone it'll sound indistinguishable.

0:56:43.819,0:56:44.879
So

0:56:44.879,0:56:47.829
for many of the applications we care about, the sign

0:56:47.829,0:56:51.259
as well as the permutation

0:56:51.259,0:56:55.079
is ambiguous, but you don't really care

0:56:55.079,0:57:02.079
about it. Let's switch back

0:57:03.529,0:57:08.989
to

0:57:08.989,0:57:11.180
chalk board, please.

0:57:11.180,0:57:15.639
It turns out, and I don't want to spend too much time on this, but I do want to say it briefly.

0:57:15.639,0:57:17.289
It turns out the

0:57:17.289,0:57:19.200
reason why those are the only

0:57:19.200,0:57:25.809
sources of ambiguity - so the ambiguities were the

0:57:25.809,0:57:29.869
permutation of the speakers

0:57:29.869,0:57:31.959
and the signs.

0:57:31.959,0:57:35.399
It turns out that

0:57:35.399,0:57:39.919
the reason these were the only ambiguities was because

0:57:39.919,0:57:44.999
the SIJs were

0:57:44.999,0:57:46.689


0:57:46.689,0:57:50.509
non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly.

0:57:50.509,0:57:54.089
Suppose my original sources, S1 and S2, were Gaussian.

0:57:54.089,0:57:55.909
So

0:57:58.329,0:58:02.199
suppose SI is

0:58:02.199,0:58:04.339
Gaussian, would mean zero

0:58:04.339,0:58:07.019
and identity covariance.

0:58:07.019,0:58:10.959
That just means that each of my speakers outputs a Gaussian random variable. Here's a typical

0:58:10.959,0:58:12.619
example of Gaussian

0:58:12.619,0:58:18.479
data.

0:58:18.479,0:58:22.869
You will recall the contours of a Gaussian distribution with identity covariants

0:58:22.869,0:58:25.089
looks like

0:58:25.089,0:58:27.739
this, right? The Gaussian is a

0:58:27.739,0:58:30.569
spherically symmetric distribution.

0:58:30.569,0:58:35.219
So if my speakers were outputting Gaussian random variables, than if

0:58:35.219,0:58:38.179
I observe a linear combination of this,

0:58:38.179,0:58:40.479
there's actually no way to recover the

0:58:40.479,0:58:43.419
original distribution because there's no way for me to tell

0:58:43.419,0:58:46.120
if the axis are at this angle or if they're at

0:58:46.120,0:58:48.349
that angle and so

0:58:48.349,0:58:52.429
on. The Gaussian is a rotationally symmetric

0:58:52.429,0:58:56.769
distribution, so I would no be able to recover the orientation in the

0:58:56.769,0:58:58.839
rotation

0:58:58.839,0:59:02.279
of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns

0:59:02.279,0:59:02.900
out

0:59:02.900,0:59:04.700
if your source is a Gaussian,

0:59:04.700,0:59:07.929
then it's actually impossible to do

0:59:07.929,0:59:12.049
ICA. ICA relies critically on your data being non-Gaussian because if the data

0:59:12.049,0:59:16.939
were Gaussian, then the rotation of the data would be ambiguous. So

0:59:16.939,0:59:19.079
regardless of how much data you have,

0:59:19.079,0:59:23.549
even if you had infinitely large amounts of data, you would not be able to recover

0:59:23.549,0:59:26.739
the matrix A or W.

0:59:32.779,0:59:39.779
Let's go ahead and divide the algorithm.

0:59:56.779,1:00:00.939
To do this, I need just one more result, and then the derivation will be

1:00:03.029,1:00:07.729
three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my

1:00:07.729,1:00:11.309
speakers that are emitting at any time.

1:00:11.309,1:00:12.459
So

1:00:12.459,1:00:15.619
let's say the density of S is

1:00:15.619,1:00:17.339
P subscript S,

1:00:17.339,1:00:19.569
capital S.

1:00:19.569,1:00:23.399
So my microphone recording records S equals AS,

1:00:23.399,1:00:25.319
equals W inverse

1:00:25.319,1:00:31.019
S. Equivalently, S equals W sign of X.

1:00:31.019,1:00:34.529
So let's think about what is the density of

1:00:34.529,1:00:38.209
X. So I have P of S. I know the density of

1:00:38.209,1:00:41.359
S, and X is a linear combination of the S's.

1:00:41.359,1:00:45.170
So let's figure out what is the density of X.

1:00:45.170,1:00:48.670
One thing we could do is

1:00:48.670,1:00:51.339
figure out what S is. So this is just -

1:00:51.339,1:00:55.759
apply the density of

1:00:55.759,1:00:58.069
S to W of S. So let's

1:00:58.069,1:01:01.999
see. This is the probability of S, so we just

1:01:02.909,1:01:06.559
figure out what S is. S is W times X, so the probability of S is

1:01:06.559,1:01:09.939
W times X, so the probability of X must be [inaudible].

1:01:09.939,1:01:11.619
So this is wrong.

1:01:11.619,1:01:14.749
It turns out you can do this for probably mass functions but not for

1:01:14.749,1:01:16.919
continuous density. So in particular,

1:01:16.919,1:01:20.969
it's not correct to say that the probability of X is - well, you just figure out what

1:01:20.969,1:01:22.499
S is.

1:01:22.499,1:01:26.189
Then you say the probability of S is applied to that. This is wrong. You

1:01:26.189,1:01:27.819
can't do this with densities.

1:01:27.819,1:01:30.969
You can't say the probability of S is that because it's a property density

1:01:30.969,1:01:32.969
function.

1:01:32.969,1:01:34.459
In particular,

1:01:34.459,1:01:35.509
the

1:01:35.509,1:01:37.849
right formula is the

1:01:37.849,1:01:40.439
density of S applied to W times X,

1:01:40.439,1:01:41.730
times the determinant

1:01:41.730,1:01:44.209
of the matrix, W.

1:01:44.209,1:01:47.189
Let me just illustrate that with an example.

1:01:47.189,1:01:49.919
Let's say

1:01:49.919,1:01:51.550
the

1:01:51.550,1:01:58.199
density for S is that. In

1:01:58.199,1:02:03.469
this example, S is uniform

1:02:03.469,1:02:05.539
over the unit interval.

1:02:07.679,1:02:14.679
So the density for S looks like that. It's

1:02:15.189,1:02:18.140
just density for the uniform

1:02:18.140,1:02:20.749
distribution of zero one.

1:02:20.749,1:02:24.150
So let me let X be equal to two times

1:02:24.150,1:02:30.009
S. So this means A equals two.

1:02:30.009,1:02:33.709
W equals one half. So if

1:02:33.709,1:02:36.719
S is a uniform distribution over zero, one,

1:02:36.719,1:02:40.320
then X, which is two times that, will be the uniform distribution over the

1:02:40.320,1:02:43.299
range from zero to two.

1:02:43.299,1:02:50.299
So the density for X will be -

1:02:54.359,1:02:57.289
that's one, that's two,

1:02:57.289,1:03:01.409
that's one half,

1:03:02.529,1:03:04.949
and

1:03:04.949,1:03:07.939
that's one. Okay? Density for X will be indicator

1:03:07.939,1:03:12.729
zero [inaudible] for X [inaudible] two

1:03:12.729,1:03:15.739
times W, times one half.

1:03:15.739,1:03:20.229
So

1:03:20.229,1:03:21.729
does that make

1:03:21.729,1:03:25.020
sense? [Inaudible] computer density for X because X is now spread out

1:03:25.020,1:03:28.650
across a wider range. The density of X is now smaller,

1:03:28.650,1:03:35.650
and therefore, the density of X has this one half

1:03:37.859,1:03:38.919
term

1:03:38.919,1:03:42.579
here. Okay? This is an illustration for the case of one-dimensional random variables,

1:03:42.579,1:03:44.289


1:03:44.289,1:03:45.160
or S

1:03:45.160,1:03:49.490
and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the

1:03:49.490,1:03:51.649
density of X is given by this

1:03:51.649,1:03:53.950
times the determinant of the matrix, W. Over here,

1:03:53.950,1:04:00.950
I showed the one dimensional [inaudible] generalization.

1:04:21.439,1:04:28.439
So we're nearly there. Here's

1:04:28.749,1:04:33.969
how I can implement ICA.

1:04:33.969,1:04:37.039
So my distribution on

1:04:37.039,1:04:44.039
S,

1:04:50.259,1:04:52.959
so I'm going to assume that my density on S

1:04:52.959,1:04:55.099
is given by this as a product over the

1:04:55.099,1:04:59.950
N speakers of the density - the product of speaker

1:04:59.950,1:05:00.889
I

1:05:00.889,1:05:03.659
emitting a certain sound. This is a product of densities.

1:05:03.659,1:05:07.659
This is a product of distributions because I'm going to assume that the

1:05:07.659,1:05:11.469
speakers are having independent conversations. So the SI's independent

1:05:11.469,1:05:15.869
for different values of I.

1:05:15.869,1:05:18.060
So by the formula we just worked out,

1:05:18.060,1:05:22.359
the density for X would be equal to that.

1:05:36.599,1:05:39.309
I'll just remind you, W was A

1:05:39.309,1:05:42.579
inverse. It was

1:05:42.579,1:05:43.929
this matrix

1:05:43.929,1:05:47.619
I defined previously

1:05:47.619,1:05:50.430
so that SI

1:05:50.430,1:05:52.519
equals WI [inaudible]

1:05:52.519,1:05:59.209
X. So that's what's in

1:05:59.209,1:06:02.299
there. To complete my formulation for this model,

1:06:02.299,1:06:06.359
the final thing I need to do is

1:06:06.359,1:06:10.179
choose

1:06:10.179,1:06:11.549
a density

1:06:11.549,1:06:14.259
for what I think each speaker is

1:06:14.259,1:06:17.949
saying. I need to assume some density over

1:06:17.949,1:06:21.660
the sounds emitted by an individual speaker.

1:06:21.660,1:06:25.630
So following the discussion I had right when the [inaudible]

1:06:25.630,1:06:27.650
ICA,

1:06:27.650,1:06:30.559
one thing I could do is I could choose

1:06:30.559,1:06:32.019
the density for S,

1:06:32.019,1:06:35.509
or equivalently, I could choose the CDF, the cumulative distribution

1:06:35.509,1:06:37.169
function for

1:06:37.169,1:06:38.219
S.

1:06:38.219,1:06:41.489
In this case, I'm going to choose

1:06:41.489,1:06:44.819
a CDF, probably for historical reasons and probably for

1:06:44.819,1:06:46.569
convenience.

1:06:46.569,1:06:50.019
I need to choose the CDF for S, so

1:06:50.019,1:06:54.779
what that means is I just need to choose some function that increases from zero to

1:06:54.779,1:06:59.439
what. I know I can't choose a Gaussian because we know you can't

1:06:59.439,1:07:02.199
do ICA on Gaussian data.

1:07:02.199,1:07:04.649
So I need some function increasing from zero to one

1:07:04.649,1:07:08.639
that is not the cumulative distribution function for a

1:07:08.639,1:07:10.359
Gaussian distribution.

1:07:10.359,1:07:14.009
So what other functions do I know that increase from zero to one? I

1:07:14.009,1:07:16.139
just choose the

1:07:16.139,1:07:18.329
CDF to be

1:07:18.329,1:07:21.979
the

1:07:21.979,1:07:23.039
sigmoid function.

1:07:23.039,1:07:24.729
This is a

1:07:24.729,1:07:27.229
commonly-made choice that

1:07:27.229,1:07:31.049
is made for convenience. There is actually no great reason for why you

1:07:31.049,1:07:34.079
choose a sigmoid function. It's just a convenient function that we all know

1:07:34.079,1:07:35.289
and are familiar with

1:07:35.289,1:07:37.849
that happens to increase from zero to one.

1:07:37.849,1:07:44.849
When you take the derivative

1:07:45.789,1:07:49.389
of the sigmoid, and that will give you back

1:07:49.389,1:07:50.119
your

1:07:50.119,1:07:55.459
density. This is just not Gaussian. This is the main virtue of choosing the sigmoid.

1:07:55.459,1:08:02.459
So

1:08:19.020,1:08:21.960
there's really no rational for the choice of sigma. Lots of other things will

1:08:21.960,1:08:23.599
work fine, too.

1:08:23.599,1:08:26.659
It's just a common, reasonable default.

1:08:38.039,1:08:40.279
It turns out that

1:08:40.279,1:08:44.629
one reason the sigma works well for a lot of data sources is that

1:08:44.629,1:08:49.079
if this is the Gaussian.

1:08:49.079,1:08:52.190
If you actually take the sigmoid and you take its derivative,

1:09:02.299,1:09:06.639
you find that the sigmoid has [inaudible] than the Gaussian. By this I mean

1:09:06.639,1:09:10.509
the density of the sigmoid dies down to zero much more slowly than

1:09:10.509,1:09:12.299
the

1:09:12.299,1:09:13.489
Gaussian.

1:09:13.489,1:09:18.079
The magnitudes of the tails dies down as E to the minus S squared.

1:09:18.079,1:09:21.960
For the sigmoid, the tails look like E to the minus

1:09:21.960,1:09:26.949
S. So the tails die down as E to the minus S, around E

1:09:26.949,1:09:29.529
to the minus S squared. It turns out that most distributions of this property

1:09:29.529,1:09:34.359
with [inaudible] tails, where the distribution decays to zero relatively slowly

1:09:34.359,1:09:38.439
compared to Gaussian will

1:09:38.439,1:09:39.919
work fine for your data.

1:09:39.919,1:09:43.939
Actually, one other choice you can sometimes us is what's called the Laplacian

1:09:43.939,1:09:46.230
distribution, which is

1:09:46.230,1:09:53.230
that. This will work fine, too, for many data sources.

1:10:06.539,1:10:08.110
Sticking with the sigmoid for now, I'll just

1:10:08.110,1:10:09.420
write

1:10:09.420,1:10:14.479
down the algorithm in two steps. So given

1:10:14.479,1:10:17.150
my training set, and

1:10:17.150,1:10:21.179
as you show, this is an unlabeled training set, I can

1:10:21.179,1:10:25.770
write down the log likelihood of my parameters. So that's - assembled my training

1:10:25.770,1:10:27.209
examples, log of - times

1:10:27.209,1:10:34.209
that.

1:10:42.869,1:10:44.880
So that's my log

1:10:44.880,1:10:51.880
likelihood.

1:10:53.339,1:10:59.380
To learn the parameters, W, of this model, I can use the [inaudible] assent,

1:10:59.380,1:11:06.380
which is

1:11:06.570,1:11:08.579
just that.

1:11:08.579,1:11:11.489
It turns out, if you work through the math,

1:11:11.489,1:11:13.969
let's see. If P of S

1:11:13.969,1:11:19.820
is equal to the derivative of the

1:11:19.820,1:11:23.780
sigmoid, then if you just work through the math to compute the [inaudible] there. You've all

1:11:23.780,1:11:27.409
done this a lot of times. I won't bother to show

1:11:27.409,1:11:34.409
the details. You find that is equal to this.

1:11:46.629,1:11:49.579
Okay? That's just - you can work those out yourself. It's just math to

1:11:49.579,1:11:54.499
compute the derivative of this with respect to

1:11:54.499,1:11:59.309
W. So to summarize, given the training set,

1:11:59.309,1:12:02.099
here's my [inaudible] update rule. So you run the

1:12:02.099,1:12:06.309
[inaudible] to learn the parameters W.

1:12:06.309,1:12:08.380
After you're

1:12:08.380,1:12:09.719
done, you then

1:12:12.369,1:12:14.109
output SI equals

1:12:14.109,1:12:16.989
WXI, and you've separated your sources

1:12:16.989,1:12:18.170
of your

1:12:18.170,1:12:21.780
data back out into the original independent sources.

1:12:21.780,1:12:26.199
Hopefully up to only a permutation and a plus/minus

1:12:26.199,1:12:30.650
sign ambiguity.

1:12:30.650,1:12:34.559
Okay? So just switch back to the laptop, please?

1:12:34.559,1:12:41.559
So we'll just wrap up with a couple of examples of applications of ICA.

1:12:42.210,1:12:43.150
This is

1:12:43.150,1:12:46.719
actually a picture of our TA, Katie.

1:12:46.719,1:12:49.980
So one of the applications of ICA is

1:12:49.980,1:12:52.010
to process

1:12:52.010,1:12:56.530
various types of [inaudible] recording data, so [inaudible]. This

1:12:56.530,1:12:58.780
is a picture of

1:12:58.780,1:13:02.470
a EEG cap, in which there are a number of electrodes

1:13:02.470,1:13:04.529
you place

1:13:04.529,1:13:07.959
on the - in this case, on Katie's brain, on Katie's scalp.

1:13:07.959,1:13:13.370
So where each electrode measures changes in voltage over time

1:13:13.370,1:13:15.059
on the scalp.

1:13:15.059,1:13:18.409
On the right, it's a typical example of [inaudible] data

1:13:18.409,1:13:22.570
where each electrode measures - just changes in voltage over

1:13:22.570,1:13:23.890
time. So

1:13:23.890,1:13:27.949
the horizontal axis is time, and the vertical axis is voltage. So here's the same thing,

1:13:27.949,1:13:29.560
blown up a little bit.

1:13:29.560,1:13:32.680
You notice there are artifacts in this

1:13:32.680,1:13:36.340
data. Where the circle is, where the data is circled, all

1:13:36.340,1:13:37.669
the

1:13:37.669,1:13:41.179
electrodes seem to measure in these very synchronized recordings.

1:13:41.179,1:13:44.699
It turns out that we look at [inaudible] data as well as a number of other

1:13:44.699,1:13:47.019
types of data, there are

1:13:47.019,1:13:51.550
artifacts from heartbeats and from human eye blinks and so on. So the

1:13:51.550,1:13:55.030
cartoonist, if you imagine, placing the

1:13:55.030,1:13:56.730
electrodes, or

1:13:56.730,1:13:58.319
microphones, on my scalp,

1:13:58.319,1:14:01.839
then each microphone is recording some overlapping combination of all the

1:14:01.839,1:14:04.920
things happening in my brain or in my body.

1:14:04.920,1:14:08.380
My brain has a number of different processes going on. My body's [inaudible]

1:14:08.380,1:14:10.519
going on, and

1:14:10.519,1:14:13.429
each electrode measures a sum

1:14:13.429,1:14:15.680
of the different voices in my brain.

1:14:15.680,1:14:19.789
That didn't quite come out the way I wanted it to.

1:14:19.789,1:14:21.530
So we can just take this data

1:14:21.530,1:14:25.400
and run ICA on it and find out one of the independent components, what the

1:14:25.400,1:14:26.130
independent

1:14:26.130,1:14:30.329
process are going on in my brain. This is an example of running ICA.

1:14:30.329,1:14:33.239
So you find that a small number of components, like those shown up there,

1:14:33.239,1:14:37.739
they correspond to heartbeat, where the arrows - so those are very periodic

1:14:37.739,1:14:42.329
signals. They come on occasionally and correspond to [inaudible] components of

1:14:42.329,1:14:43.050
heartbeat.

1:14:43.050,1:14:47.460
You also find things like an eye blink component, corresponding to a

1:14:47.460,1:14:49.780
sigmoid generated when you blink your eyes.

1:14:49.780,1:14:53.820
By doing this, you can then subtract out the heartbeat and the eye blink

1:14:53.820,1:14:56.179
artifacts from the data, and now

1:14:56.179,1:15:01.219
you get much cleaner ICA data - get much cleaner EEG readings. You can

1:15:01.219,1:15:03.699
do further scientific studies. So this is a

1:15:03.699,1:15:06.179
pretty commonly used preprocessing step

1:15:06.179,1:15:09.699
that is a common application of ICA.

1:15:09.699,1:15:13.030
[Inaudible] example is

1:15:13.030,1:15:16.300
the application, again, from [inaudible]. As

1:15:16.300,1:15:20.900
a result of running ICA on natural small image patches. Suppose I take

1:15:20.900,1:15:22.050
natural images

1:15:22.050,1:15:25.909
and run ICA on the data and ask what are the independent components of data.

1:15:25.909,1:15:30.039
It turns out that these are the bases you get. So this is a plot of the

1:15:30.039,1:15:32.530
sources you get.

1:15:32.530,1:15:36.269
This algorithm is saying that a natural image patch

1:15:36.269,1:15:37.749
shown

1:15:37.749,1:15:39.790
on the left

1:15:39.790,1:15:45.330
is often expressed as a sum, or a linear combination, of

1:15:45.330,1:15:46.679
independent sources of

1:15:46.679,1:15:48.159
things that make up images.

1:15:48.159,1:15:52.780
So this model's natural images is generated by independent objects

1:15:52.780,1:15:55.340
that generate different ages in the image.

1:15:55.340,1:16:01.260
One of the fascinating things about this is that, similar to neuroscience, this has also been

1:16:01.260,1:16:04.789
hypothesized as a method for how the human brain processes image

1:16:04.789,1:16:05.999
data. It

1:16:05.999,1:16:10.139
turns out, this is similar, in many ways, to computations

1:16:10.139,1:16:15.079
happening in early visual processing in the human brain,

1:16:15.079,1:16:17.659
in the mammalian

1:16:17.659,1:16:19.799
brain. It's just

1:16:19.799,1:16:25.260
interesting to see ages are the independent components of images.

1:16:25.260,1:16:30.639
Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew

1:16:30.639,1:16:31.929
Ng):Oh,

1:16:31.929,1:16:35.410
yes. For the algorithms I describe, I assume A is a square matrix.

1:16:35.410,1:16:38.590
It turns out if you have more microphones than speakers, you can also apply very

1:16:38.590,1:16:39.609
similar algorithms. If

1:16:39.609,1:16:43.919
you have fewer microphones than speakers, there's sort of an open research problem. The odds

1:16:43.919,1:16:48.459
are that if you have one male and one female speaker, but one microphone, you can

1:16:48.459,1:16:51.819
sometimes sort of separate them because one is high, one is low. If you have two

1:16:51.819,1:16:55.459
male speakers or two female speakers, then it's beyond the state of the art now to separate them

1:16:55.459,1:16:57.050
with one

1:16:57.050,1:17:00.499
microphone. It's a great research program. Okay.

1:17:00.499,1:17:04.869
Sorry about running late again. Let's close now, and we'll

1:17:04.869,1:17:05.749
continue reinforcement learning.