WEBVTT

00:00:09.539 --> 00:00:12.809
This presentation is delivered by the Stanford Center for Professional

00:00:12.809 --> 00:00:19.809
Development.

00:00:28.739 --> 00:00:30.300
Welcome back.

00:00:30.300 --> 00:00:32.800
What I want to do today is

00:00:32.800 --> 00:00:37.120
continue a discussion of principal components analysis, or PCA.

00:00:37.120 --> 00:00:38.780
In particular, there's

00:00:38.780 --> 00:00:41.730
one more application that I didn't get to in the last lecture on

00:00:41.730 --> 00:00:43.950
[inaudible] indexing,

00:00:43.950 --> 00:00:47.340
LSI. Then I want to spend just a little time talking about

00:00:47.340 --> 00:00:50.400
how to implement PCA,

00:00:50.400 --> 00:00:53.510
especially for very large problems. In particular, I'll spend just a little bit of time talking

00:00:53.510 --> 00:00:55.790
about singular value decomposition,

00:00:55.790 --> 00:00:57.930
or the SVD implementation

00:00:57.930 --> 00:01:00.170
of principal component

00:01:00.170 --> 00:01:01.929
analysis. So the

00:01:01.929 --> 00:01:06.549
second half of today's lecture, I want to talk about the different algorithm called

00:01:06.549 --> 00:01:09.090
independent component analysis,

00:01:09.090 --> 00:01:13.400
which is, in some ways, related to PCA, but in many other ways, it

00:01:13.400 --> 00:01:15.580
also manages to accomplish

00:01:15.580 --> 00:01:18.390
very different things than PCA.

00:01:18.390 --> 00:01:22.380
So with this lecture, this will actually wrap up our discussion on

00:01:22.380 --> 00:01:26.470
unsupervised learning. The next lecture, we'll start to talk about

00:01:26.470 --> 00:01:31.049
reinforcement learning algorithms.

00:01:31.049 --> 00:01:32.500
Just to

00:01:32.500 --> 00:01:36.770
recap where we were with PCA, principal

00:01:36.770 --> 00:01:38.630
component analysis,

00:01:38.630 --> 00:01:44.870
I said that

00:01:44.870 --> 00:01:48.840
in PCA, we imagine that we have some very high dimensional data

00:01:48.840 --> 00:01:50.970
that perhaps

00:01:50.970 --> 00:01:54.650
lies approximately on some low dimensional subspace. So if you had the data set like

00:01:54.650 --> 00:01:55.660
this,

00:01:55.660 --> 00:01:57.000
you might find that

00:01:57.000 --> 00:02:00.650
that's the first principal component of the data,

00:02:00.650 --> 00:02:02.850
and that's the second

00:02:02.850 --> 00:02:05.650
component of this 2-D data.

00:02:06.870 --> 00:02:09.269
To

00:02:09.269 --> 00:02:13.039
summarize the algorithm, we have three steps. The first step of PCA

00:02:14.680 --> 00:02:16.549
was to

00:02:16.549 --> 00:02:20.349
normalize the data to zero mean and

00:02:20.349 --> 00:02:25.789
[inaudible]. So

00:02:25.789 --> 00:02:27.860
tracked out the means of

00:02:27.860 --> 00:02:33.199
your training examples. So it now has zero means, and then normalize each of your features so

00:02:33.199 --> 00:02:35.509
that the variance of each feature is now one.

00:02:37.719 --> 00:02:40.679
The next step was [inaudible]

00:02:40.679 --> 00:02:44.389
computical variance matrix of your zero mean data. So

00:02:44.389 --> 00:02:48.859
you compute it as follows.

00:02:48.859 --> 00:02:51.310
The sum of all the products,

00:02:52.409 --> 00:02:55.870
and then you find the

00:02:55.870 --> 00:03:02.870
top K eigen vectors of

00:03:03.799 --> 00:03:07.159
sigma.

00:03:07.159 --> 00:03:09.589
So

00:03:09.589 --> 00:03:12.589
last time we saw the applications of this. For example,

00:03:12.589 --> 00:03:18.439
one of the applications was to eigen faces where

00:03:18.439 --> 00:03:22.899
each of your training examples, XI, is an image.

00:03:22.899 --> 00:03:24.180
So

00:03:24.180 --> 00:03:26.749
if you have

00:03:26.749 --> 00:03:30.890
100 by 100 images, if your pictures of faces are

00:03:30.890 --> 00:03:33.249
100 pixels by 100 pixels,

00:03:33.249 --> 00:03:37.519
then each of your training examples, XI,

00:03:37.519 --> 00:03:39.989
will be a 10,000 dimensional vector,

00:03:39.989 --> 00:03:42.729
corresponding to the

00:03:42.729 --> 00:03:47.529
10,000 grayscale intensity pixel values. There are 10,000 pixel values in

00:03:47.529 --> 00:03:50.349
each of your 100 by 100 images.

00:03:50.349 --> 00:03:53.179
So the eigen faces application was where

00:03:53.179 --> 00:03:56.280
the training example comprised

00:03:56.280 --> 00:03:58.149
pictures of faces of people.

00:03:58.149 --> 00:03:59.989
Then we ran PCA,

00:03:59.989 --> 00:04:00.780
and then

00:04:00.780 --> 00:04:03.090
to measure the distance between say

00:04:03.090 --> 00:04:04.260
a face here

00:04:04.260 --> 00:04:06.809
and a face there, we would project both

00:04:06.809 --> 00:04:09.619
of the face images onto the subspace and then

00:04:09.619 --> 00:04:11.469
measure

00:04:11.469 --> 00:04:13.679
the distance along the subspace. So in eigen faces, you use something

00:04:13.679 --> 00:04:16.919
like 50 principle components.

00:04:16.919 --> 00:04:20.049
So

00:04:20.049 --> 00:04:24.080
the difficulty of working with problems like these is that

00:04:24.080 --> 00:04:27.330
in step two of the algorithm,

00:04:27.330 --> 00:04:31.070
we construct the covariance matrix sigma.

00:04:31.070 --> 00:04:37.589
The covariance matrix now becomes

00:04:37.589 --> 00:04:42.919
a 10,000 by 10,000 dimensional matrix, which is huge. That has

00:04:42.919 --> 00:04:45.580
100 million

00:04:45.580 --> 00:04:47.650
entries, which is huge.

00:04:49.020 --> 00:04:50.789
So let's apply PCA to

00:04:50.789 --> 00:04:54.289
very, very high dimensional data, used as a point of reducing the

00:04:54.289 --> 00:04:55.180
dimension. But

00:04:55.180 --> 00:04:59.550
step two of this algorithm had this step where you were constructing [inaudible]. So

00:04:59.550 --> 00:05:03.599
this extremely large matrix, which you can't do.

00:05:03.599 --> 00:05:06.580
Come back to this in a second. It turns out one of

00:05:06.580 --> 00:05:08.650
the other

00:05:08.650 --> 00:05:10.639
frequently-used applications of

00:05:10.639 --> 00:05:11.870
PCA

00:05:11.870 --> 00:05:14.210
is actually to text data.

00:05:14.210 --> 00:05:16.329
So here's what I

00:05:16.329 --> 00:05:21.039
mean. Remember our vectorial representation of emails?

00:05:21.039 --> 00:05:22.520
So this is way back

00:05:22.520 --> 00:05:26.869
when we were talking about supervised learning algorithms for a

00:05:26.869 --> 00:05:29.360
stand classification. You remember I said that

00:05:29.360 --> 00:05:32.620
given a piece of email or given a piece of text document, you

00:05:32.620 --> 00:05:35.479
can represent it using a very high-dimensional vector

00:05:35.479 --> 00:05:36.699
by taking

00:05:38.990 --> 00:05:43.889
- writing down a list of all the words in your dictionary. Somewhere you had the word

00:05:43.889 --> 00:05:46.669
learn, somewhere you have the word

00:05:46.669 --> 00:05:49.680
study

00:05:50.569 --> 00:05:54.759
and so on.

00:05:54.759 --> 00:05:58.180
Depending on whether each word appears or does not appear in your text document, you put either

00:05:58.180 --> 00:05:59.360
a one or a zero

00:05:59.360 --> 00:06:03.889
there. This is a representation we use on an electrode five or electrode six

00:06:03.889 --> 00:06:06.729
for representing text documents for

00:06:06.729 --> 00:06:08.669
when we're building

00:06:08.669 --> 00:06:12.090
[inaudible] based classifiers for

00:06:12.090 --> 00:06:14.300
[inaudible]. So it turns

00:06:14.300 --> 00:06:17.719
out one of the common applications of

00:06:17.719 --> 00:06:22.210
PCA is actually this text data representations as well.

00:06:22.210 --> 00:06:23.680
When you apply PCA

00:06:23.680 --> 00:06:25.650
to this sort of data,

00:06:25.650 --> 00:06:27.999
the resulting

00:06:27.999 --> 00:06:34.740
algorithm, it often just goes by a different name, just latent semantic indexing.

00:06:41.250 --> 00:06:44.009
For the sake of completeness, I should say that

00:06:44.009 --> 00:06:48.050
in LSI, you usually skip the preprocessing step.

00:07:06.089 --> 00:07:09.929
For various reasons, in LSI, you usually don't normalize the mean of the data to

00:07:09.929 --> 00:07:10.939
one,

00:07:10.939 --> 00:07:14.169
and you usually don't normalize the variance of the features to one.

00:07:14.169 --> 00:07:18.020
These are relatively minor

00:07:18.020 --> 00:07:21.199
differences, it turns out, so it does something very

00:07:21.199 --> 00:07:24.599
similar to PCA.

00:07:24.599 --> 00:07:25.849


00:07:25.849 --> 00:07:27.439
Normalizing the variance to one

00:07:27.439 --> 00:07:33.439
for text data would actually be a bad idea because all the words are -

00:07:33.439 --> 00:07:34.389
because that

00:07:34.389 --> 00:07:37.150
would have the affect of

00:07:37.150 --> 00:07:39.229
dramatically scaling up the

00:07:39.229 --> 00:07:43.779
weight of rarely occurring words. So for example, the word aardvark hardly ever

00:07:43.779 --> 00:07:45.729
appears in any document.

00:07:45.729 --> 00:07:48.919
So to normalize the variance

00:07:48.919 --> 00:07:51.009
of the second feature to one, you end up -

00:07:51.009 --> 00:07:54.860
you're scaling up the weight of the word aardvark

00:07:54.860 --> 00:07:58.340
dramatically. I don't understand why [inaudible].

00:07:58.340 --> 00:08:01.449
So let's

00:08:01.449 --> 00:08:05.319
see. [Inaudible] the language,

00:08:05.319 --> 00:08:11.169
something that we want to do quite often is, give it two documents,

00:08:13.109 --> 00:08:20.109
XI and XJ, to measure how similar they are.

00:08:20.179 --> 00:08:22.400
So for example,

00:08:22.400 --> 00:08:25.050
I may give you a document and ask

00:08:25.050 --> 00:08:28.860
you to find me more documents like this one. We're reading some

00:08:28.860 --> 00:08:30.900
article about some user event of today

00:08:30.900 --> 00:08:33.770
and want to find out what other news articles there are. So I give you a document and

00:08:34.520 --> 00:08:37.159
ask you to look at all the other documents you have

00:08:37.159 --> 00:08:40.829
in this large set of documents and find the documents similar to

00:08:40.829 --> 00:08:42.220
this.

00:08:43.740 --> 00:08:45.210
So

00:08:45.210 --> 00:08:48.170
this is typical text application, so

00:08:48.170 --> 00:08:51.140
to measure the similarity

00:08:52.440 --> 00:08:56.550
between two documents in XI and XJ, [inaudible]

00:08:56.550 --> 00:08:59.950
each of these documents is represented

00:08:59.950 --> 00:09:03.190
as one of these highdimensional vectors.

00:09:03.190 --> 00:09:08.700
One common way to do this is to view each of your documents

00:09:08.700 --> 00:09:12.710
as some sort of very high-dimensional vector.

00:09:12.710 --> 00:09:14.100
So these

00:09:14.100 --> 00:09:18.040
are vectors in the very highdimensional space where

00:09:18.040 --> 00:09:20.299
the dimension of the vector is equal to

00:09:20.299 --> 00:09:27.210
the number of words in your dictionary.

00:09:27.210 --> 00:09:30.630
So maybe each of these documents lives in some

00:09:30.630 --> 00:09:33.560
50,000-dimension space, if you have 50,000 words in your

00:09:33.560 --> 00:09:37.090
dictionary. So one nature of the similarity between these two documents that's

00:09:37.090 --> 00:09:39.800
often used is

00:09:39.800 --> 00:09:41.440
what's the angle

00:09:41.440 --> 00:09:43.350
between these two

00:09:43.350 --> 00:09:50.350
documents.

00:09:51.240 --> 00:09:52.750
In particular,

00:09:52.750 --> 00:09:56.030
if the angle between these two vectors is small, then

00:09:56.030 --> 00:09:59.590
the two documents, we'll consider them to be similar. If the angle between

00:09:59.590 --> 00:10:03.310
these two vectors is large, then we consider the documents to be dissimilar.

00:10:03.310 --> 00:10:05.530
So

00:10:05.530 --> 00:10:10.060
more formally, one commonly used heuristic, the national language of processing,

00:10:10.060 --> 00:10:13.950
is to say that the similarity between the two documents is a co-sine of the angle theta between them.

00:10:13.950 --> 00:10:16.710
For similar

00:10:16.710 --> 00:10:19.270
values, anyway, the co-sine

00:10:19.270 --> 00:10:23.110
is a decreasing function of theta.

00:10:23.110 --> 00:10:24.490
So the

00:10:24.490 --> 00:10:29.560
smaller the angle between them, the larger the similarity.

00:10:29.560 --> 00:10:30.690
The co-sine

00:10:30.690 --> 00:10:35.970
between two vectors is, of course, just [inaudible]

00:10:35.970 --> 00:10:42.970
divided

00:10:43.940 --> 00:10:46.680
by - okay?

00:10:46.680 --> 00:10:51.100
That's just the linear algebra or the standard

00:10:51.100 --> 00:10:56.460
geometry definition of the co-sine between two vectors. Here's the

00:11:03.540 --> 00:11:10.540
intuition behind what LSI is doing. The hope, as usual, is

00:11:17.930 --> 00:11:21.060
that there

00:11:21.060 --> 00:11:24.970
may be some interesting axis of variations in the data,

00:11:24.970 --> 00:11:27.660
and there maybe some other axis that

00:11:27.660 --> 00:11:29.340
are just

00:11:29.340 --> 00:11:33.960
noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by

00:11:33.960 --> 00:11:37.850
running PCA on your text data this way, you can remove some of the noise in the data and

00:11:37.850 --> 00:11:41.800
get better measures of the similarity between pairs of

00:11:41.800 --> 00:11:45.490
documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI

00:11:45.490 --> 00:11:46.940
is doing.

00:11:46.940 --> 00:11:47.940
So

00:11:47.940 --> 00:11:54.940
look further in the definition of the co-sine similarity measure. So

00:11:56.430 --> 00:11:59.790
the numerator

00:11:59.790 --> 00:12:06.790
or

00:12:10.210 --> 00:12:17.210
the similarity between the two documents was this inner product,

00:12:17.620 --> 00:12:24.270
which is therefore sum over K,

00:12:24.270 --> 00:12:27.500
XIK,

00:12:27.500 --> 00:12:28.870
XJK. So

00:12:30.410 --> 00:12:33.810
this inner product would be equal to zero if

00:12:33.810 --> 00:12:36.910
the two documents have no words in common. So

00:12:36.910 --> 00:12:39.900
this is really - sum over K -

00:12:41.100 --> 00:12:42.930
indicator of

00:12:42.930 --> 00:12:44.760
whether

00:12:44.760 --> 00:12:47.170
documents, I and

00:12:47.170 --> 00:12:54.170
J,

00:12:54.920 --> 00:12:58.250
both contain the word, K, because

00:12:58.250 --> 00:13:02.590
I guess XIK indicates whether document I contains the word

00:13:02.590 --> 00:13:04.410
K, and XJK

00:13:04.410 --> 00:13:07.830
indicates whether document J contains the word, K.

00:13:07.830 --> 00:13:10.430
So the product would be one only

00:13:10.430 --> 00:13:12.420
if the word K

00:13:12.420 --> 00:13:14.540
appears in both documents.

00:13:14.540 --> 00:13:17.740
Therefore, the similarity between these two documents would be

00:13:17.740 --> 00:13:23.450
zero if the two documents have no words in common.

00:13:23.450 --> 00:13:30.450
For example,

00:13:31.180 --> 00:13:34.530
suppose your document,

00:13:34.530 --> 00:13:40.460
XI, has the word study and the word

00:13:40.460 --> 00:13:41.730
XJ,

00:13:41.730 --> 00:13:43.460
has the word learn.

00:13:43.460 --> 00:13:47.530
Then these two documents may be considered

00:13:47.530 --> 00:13:49.069
entirely dissimilar.

00:13:50.020 --> 00:13:53.280
[Inaudible] effective study strategies. Sometimes you read a

00:13:53.280 --> 00:13:57.100
news article about that. So you ask, what other documents are similar to this? If

00:13:57.100 --> 00:14:01.080
there are a bunch of other documents about good methods to

00:14:01.080 --> 00:14:04.250
learn, than there are words in common. So similarity [inaudible] is zero.

00:14:04.250 --> 00:14:06.790
So here's

00:14:06.790 --> 00:14:09.200
a cartoon

00:14:09.200 --> 00:14:10.970
of what we hope

00:14:10.970 --> 00:14:12.820
[inaudible] PCA will do,

00:14:12.820 --> 00:14:14.340
which is

00:14:14.340 --> 00:14:17.320
suppose that on the horizontal axis, I plot

00:14:17.320 --> 00:14:21.330
the word

00:14:21.330 --> 00:14:25.310
learn, and on the vertical access, I plot the word study.

00:14:25.310 --> 00:14:29.840
So the values take on either the value of zero or one. So if a document

00:14:29.840 --> 00:14:33.040
contains the words learn but not study, then

00:14:33.040 --> 00:14:35.260
it'll plot that document there. If

00:14:35.260 --> 00:14:38.050
a document contains neither the word study nor learn, then it'll plot that

00:14:38.050 --> 00:14:40.530
at zero, zero.

00:14:40.530 --> 00:14:44.119
So here's a cartoon behind what PCA

00:14:44.119 --> 00:14:46.940
is doing, which is

00:14:46.940 --> 00:14:51.480
we identify lower dimensional subspace. That would be sum - eigen

00:14:51.480 --> 00:14:57.630
vector, we get out of PCAs.

00:14:57.630 --> 00:15:03.380
Now, supposed we have a document about learning. We have a document about studying.

00:15:03.380 --> 00:15:05.150
The document about learning

00:15:05.150 --> 00:15:07.710
points to the right. Document about studying points

00:15:07.710 --> 00:15:11.439
up. So the inner product, or the co-sine angle between these two documents would be

00:15:11.439 --> 00:15:13.170
- excuse

00:15:13.170 --> 00:15:15.080
me. The inner product between

00:15:15.080 --> 00:15:18.850
these two documents will be zero.

00:15:18.850 --> 00:15:20.280
So these two

00:15:20.280 --> 00:15:22.360
documents are entirely unrelated,

00:15:22.360 --> 00:15:24.849
which is not what we want.

00:15:24.849 --> 00:15:27.680
Documents about study, documents about learning, they are related. But

00:15:27.680 --> 00:15:32.820
we take these two documents, and we project them

00:15:32.820 --> 00:15:36.220
onto this subspace.

00:15:38.320 --> 00:15:40.850
Then these two documents now become much

00:15:40.850 --> 00:15:42.610
closer together,

00:15:42.610 --> 00:15:44.790
and the algorithm will recognize that

00:15:44.790 --> 00:15:47.650
when you say the inner product between these two documents,

00:15:47.650 --> 00:15:50.580
you actually end up with a positive number. So

00:15:50.580 --> 00:15:52.190
LSI enables

00:15:52.190 --> 00:15:56.200
our algorithm to recognize that these two documents have some positive similarity

00:15:56.200 --> 00:15:58.920
between them.

00:15:58.920 --> 00:16:01.230
So that's just intuition

00:16:01.230 --> 00:16:02.370
about what

00:16:02.370 --> 00:16:05.000
PCA may be doing to text data.

00:16:05.000 --> 00:16:09.420
The same thing goes to other examples and the words study and learn. So you have

00:16:09.420 --> 00:16:11.139
- you find a document about

00:16:11.139 --> 00:16:15.059
politicians and a document with the names of prominent

00:16:15.059 --> 00:16:16.490
politicians.

00:16:17.560 --> 00:16:20.740
That will also bring the documents closer together,

00:16:20.740 --> 00:16:24.650
or just any related topics, they end up

00:16:24.650 --> 00:16:25.670
[inaudible]

00:16:25.670 --> 00:16:31.780
points closer together and just lower dimensional space.

00:16:33.130 --> 00:16:38.380
Question about this? Interviewee: [Inaudible].

00:16:38.380 --> 00:16:43.930
Which ones?

00:16:43.930 --> 00:16:50.930
This one? No, the line. Oh, this one. Oh,

00:16:53.820 --> 00:16:54.470
yes.

00:16:54.470 --> 00:17:01.470
Thank you.

00:17:01.670 --> 00:17:05.510
[Inaudible].

00:17:05.510 --> 00:17:06.650
So

00:17:06.650 --> 00:17:13.650
let's talk about how to actually implement this now.

00:17:17.230 --> 00:17:20.520
Okay. How many of you know what

00:17:20.520 --> 00:17:24.780
an SVD or single value decomposition is? Wow,

00:17:24.780 --> 00:17:25.800
that's a lot of you. That's a

00:17:25.800 --> 00:17:28.270
lot more than I thought.

00:17:28.270 --> 00:17:35.270
Curious. Did you guys learn it as under grads or as graduate students?

00:17:37.309 --> 00:17:38.029
All

00:17:38.029 --> 00:17:40.429
right. Let

00:17:40.429 --> 00:17:44.150
me talk about it anyway. I

00:17:44.150 --> 00:17:47.790
wasn't expecting so many of you to know what SVD is, but I want to get this

00:17:47.790 --> 00:17:51.030
on tape, just so everyone else can learn

00:17:51.030 --> 00:17:55.050
about this, too.

00:17:55.050 --> 00:17:59.020
So I'll say a little bit about how to implement

00:17:59.020 --> 00:18:01.640
PCA. The problem I

00:18:01.640 --> 00:18:05.110
was eluding to just now was that

00:18:05.110 --> 00:18:09.310
when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for

00:18:09.310 --> 00:18:11.620
our

00:18:11.620 --> 00:18:13.390
text example,

00:18:13.390 --> 00:18:18.000
if the vectors XI are 50,000 dimensional,

00:18:18.000 --> 00:18:24.430
then

00:18:24.430 --> 00:18:27.160
the covariance matrix will be 50,000 dimensional by 50,000

00:18:27.160 --> 00:18:32.910
dimensional, which is much too big to represent explicitly.

00:18:32.910 --> 00:18:36.270
I guess many

00:18:36.270 --> 00:18:41.240
of you already know this, but I'll just say it anyway. It

00:18:41.240 --> 00:18:48.240
turns out there's another way to implement PCA, which is

00:18:48.679 --> 00:18:49.450
if

00:18:49.450 --> 00:18:53.340
A is any N by N matrix,

00:18:53.340 --> 00:18:56.580
than one of the most remarkable results of linear algebra

00:18:56.580 --> 00:19:00.970
is that the matrix, A,

00:19:00.970 --> 00:19:04.279
can be decomposed into

00:19:04.279 --> 00:19:07.330
a singular value

00:19:07.330 --> 00:19:10.160
decomposition. What that means is that the matrix, A, which

00:19:10.160 --> 00:19:11.519
is

00:19:11.519 --> 00:19:12.680
N by N,

00:19:12.680 --> 00:19:15.230
can always be decomposed into a product of

00:19:15.230 --> 00:19:18.280
three matrixes. U is N by N,

00:19:18.280 --> 00:19:21.470
D is a square matrix, which is N by N, and V is

00:19:23.750 --> 00:19:27.170
also N by N.

00:19:27.170 --> 00:19:30.910
D

00:19:30.910 --> 00:19:34.890
is

00:19:34.890 --> 00:19:37.320
going to be diagonal.

00:19:43.030 --> 00:19:45.140
Zeros are on the off-diagonals,

00:19:45.140 --> 00:19:52.140
and the values sigma I are called the singular values of

00:19:54.310 --> 00:19:57.530
the matrix A.

00:20:01.500 --> 00:20:05.470
Almost all of you said you learned this as a graduate student, rather than as an under grad, so

00:20:05.470 --> 00:20:07.020
it turns out that

00:20:07.020 --> 00:20:10.820
when you take a class in undergraduate linear algebra, usually you learn a bunch of

00:20:10.820 --> 00:20:13.960
decomposition. So you usually learn about the QLD composition, maybe

00:20:13.960 --> 00:20:17.070
the LU factorization of the matrixes.

00:20:17.070 --> 00:20:20.350
Most under grad courses don't get to talk about singular value

00:20:20.350 --> 00:20:21.490
decompositions, but at

00:20:21.490 --> 00:20:22.860
least in - almost

00:20:22.860 --> 00:20:24.580
everything I

00:20:24.580 --> 00:20:26.900
do in machine learning,

00:20:26.900 --> 00:20:31.180
you actually find that you end up using SVDs much more than any of the

00:20:31.180 --> 00:20:33.120
decompositions

00:20:33.120 --> 00:20:35.690
you learned in typical under grad linear algebra class.

00:20:35.690 --> 00:20:40.030
So personally, I [inaudible] an SVD dozens of times in the last

00:20:40.030 --> 00:20:42.820
year, but LU and QRD compositions,

00:20:42.820 --> 00:20:45.480
I think I used the QRD composition once and an

00:20:45.480 --> 00:20:49.530
LU decomposition in the last year. So let's see. I'll

00:20:49.530 --> 00:20:53.880
say

00:20:53.880 --> 00:20:54.830
a

00:20:54.830 --> 00:21:01.830
bit

00:21:01.840 --> 00:21:04.700
more about this. So I'm going to draw the picture, I guess.

00:21:04.700 --> 00:21:08.090
For example,

00:21:08.090 --> 00:21:10.890
if A is an N by N matrix,

00:21:10.890 --> 00:21:17.890
it can be decomposed into another matrix, U, which is also N by N. It's the same

00:21:21.780 --> 00:21:24.510
size, D, which is

00:21:24.510 --> 00:21:29.840
N by

00:21:29.840 --> 00:21:34.730
N. Another square matrix, V transpose, which is also

00:21:34.730 --> 00:21:37.430
N by

00:21:37.430 --> 00:21:42.850
N. Furthermore, in a singular value decomposition, the

00:21:42.850 --> 00:21:46.460
columns of the matrix, U, will be the eigen

00:21:46.460 --> 00:21:50.940
vectors

00:21:50.940 --> 00:21:55.630
of A transpose, and the

00:21:55.630 --> 00:22:02.020
columns of V will be the eigen vectors

00:22:02.020 --> 00:22:05.730
of A

00:22:05.730 --> 00:22:08.290
transpose A.

00:22:08.290 --> 00:22:12.030


00:22:12.030 --> 00:22:16.660
To compute it, you just use the SVD commands

00:22:16.660 --> 00:22:20.780
in

00:22:20.780 --> 00:22:22.090
Matlab

00:22:22.090 --> 00:22:23.009
or

00:22:23.009 --> 00:22:26.760
Octave. Today, say the art in numerical linear algebra is that

00:22:26.760 --> 00:22:31.040
SVD, singular value decompositions, and matrixes can be computed

00:22:31.040 --> 00:22:34.450
extremely [inaudible]. We've

00:22:34.450 --> 00:22:35.780
used a

00:22:35.780 --> 00:22:37.610
package like Matlab or Octave

00:22:37.610 --> 00:22:40.410
to compute, say, the eigen vectors of a matrix.

00:22:40.410 --> 00:22:43.539
So if SVD

00:22:43.539 --> 00:22:47.980
routines are even more numerically stable than eigen vector routines for finding eigen vector in the

00:22:47.980 --> 00:22:49.230
matrix. So you can

00:22:49.230 --> 00:22:50.250
safely

00:22:50.250 --> 00:22:53.110
use a routine like this, and similar to the way they use

00:22:53.110 --> 00:22:56.030
a square root command without thinking about

00:22:56.030 --> 00:22:58.400
how it's computed. You can compute the square

00:22:58.400 --> 00:23:03.640
root of something and just not worry about it. You know the computer will give you the right

00:23:03.640 --> 00:23:07.670
answer. For most reasonably-sized matrixes, even up to thousands by thousands

00:23:07.670 --> 00:23:11.180
matrixes, the SVD routine, I think of it as a square root function. If

00:23:11.180 --> 00:23:14.700
you call it, it'll give you back the right answer. You don't have to worry

00:23:14.700 --> 00:23:16.760
too much about

00:23:16.760 --> 00:23:20.510
it. If you have extremely large matrixes, like a million by a million matrixes, I

00:23:20.510 --> 00:23:23.200
might start to worry a bit, but a few thousand by a few

00:23:23.200 --> 00:23:25.700
thousand matrixes, this is

00:23:25.700 --> 00:23:29.330
implemented very well today.

00:23:29.330 --> 00:23:31.360
[Inaudible].

00:23:31.360 --> 00:23:34.990
What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the

00:23:34.990 --> 00:23:36.470
order of N-cubed.

00:23:36.470 --> 00:23:41.750
I'm not sure. [Inaudible]

00:23:41.750 --> 00:23:45.010
algorithms, so I think - I don't know what's

00:23:45.010 --> 00:23:50.470
known about the conversion

00:23:50.470 --> 00:23:54.370
of

00:23:54.370 --> 00:23:58.280
these algorithms. The example I drew out was for a facts matrix, and a matrix is

00:23:58.280 --> 00:23:59.890
[inaudible].

00:23:59.890 --> 00:24:03.420
In the same way, you can also

00:24:03.420 --> 00:24:08.380
call SVD on the tall matrix, so it's taller than it's wide.

00:24:08.380 --> 00:24:15.380
It would decompose it into - okay? A

00:24:21.480 --> 00:24:24.190
product of three matrixes like

00:24:24.190 --> 00:24:28.470
that.

00:24:28.470 --> 00:24:31.220
The nice thing about this is that

00:24:31.220 --> 00:24:33.169
we can use it to compute

00:24:33.169 --> 00:24:40.169
eigen vectors and PCA very efficiently.

00:24:42.150 --> 00:24:47.380
In particular, a

00:24:47.380 --> 00:24:52.690
covariance matrix sigma was

00:24:52.690 --> 00:24:55.410
this. It was the sum of all the products,

00:24:55.410 --> 00:25:00.750
so if you go back and recall the definition of the

00:25:00.750 --> 00:25:01.820
design matrix -

00:25:01.820 --> 00:25:05.080
I think I described this in

00:25:05.080 --> 00:25:07.720
lecture two when

00:25:07.720 --> 00:25:13.510
we derived the close form solution to these squares [inaudible] these squares. The

00:25:13.510 --> 00:25:17.860
design matrix was this matrix where I took my

00:25:17.860 --> 00:25:21.390
examples

00:25:21.390 --> 00:25:26.160
and stacked them in

00:25:26.160 --> 00:25:27.590
rows.

00:25:27.590 --> 00:25:33.320
They call this the design matrix [inaudible]. So if

00:25:33.320 --> 00:25:38.780
you construct the design matrix, then

00:25:38.780 --> 00:25:43.240
the covariance matrix sigma

00:25:43.240 --> 00:25:47.760
can be written just X transposing.

00:26:01.270 --> 00:26:08.270
That's X transposed, and [inaudible].

00:26:16.230 --> 00:26:16.750
Okay?

00:26:16.750 --> 00:26:18.860
I hope you see why the X transpose X gives you

00:26:18.860 --> 00:26:22.460
the sum of products of

00:26:22.460 --> 00:26:25.650
vectors. If you aren't seeing this right now, just go home and convince yourself

00:26:25.650 --> 00:26:32.650
[inaudible] if it's

00:26:37.880 --> 00:26:40.020
true.

00:26:40.020 --> 00:26:47.020
To get the top K eigen vectors of

00:26:50.610 --> 00:26:50.950
sigma,

00:26:50.950 --> 00:26:55.890
you would take sigma

00:26:55.890 --> 00:26:57.419
and decompose it

00:26:57.419 --> 00:26:59.650
using

00:26:59.650 --> 00:27:01.880
the -

00:27:01.880 --> 00:27:06.710
excuse me.

00:27:06.710 --> 00:27:11.960
You would take the matrix X, and you would compute as SVD. So you get USV transpose.

00:27:11.960 --> 00:27:15.400
Then the top three

00:27:15.400 --> 00:27:19.169
columns

00:27:19.169 --> 00:27:21.660
of U

00:27:21.660 --> 00:27:24.340
are the top K eigen

00:27:24.340 --> 00:27:26.830
vectors

00:27:26.830 --> 00:27:29.419
of

00:27:29.419 --> 00:27:31.910
X transpose

00:27:31.910 --> 00:27:33.660
X,

00:27:33.660 --> 00:27:37.760
which is therefore, the top K eigen vectors of your

00:27:37.760 --> 00:27:40.440
covariance matrix

00:27:40.440 --> 00:27:42.250
sigma. So

00:27:42.250 --> 00:27:46.170
in our examples, the

00:27:46.170 --> 00:27:48.950
design matrix may be, say R. If you have

00:27:48.950 --> 00:27:52.450
50,000 words in your dictionary, than the

00:27:52.450 --> 00:27:55.050
design matrix would be

00:27:55.050 --> 00:27:58.150
RM by 50,000.

00:27:58.150 --> 00:28:01.870
[Inaudible] say 100 by 50,000, if you have 100 examples.

00:28:01.870 --> 00:28:03.250
So X would be

00:28:03.250 --> 00:28:06.510
quite tractable to represent and compute the

00:28:06.510 --> 00:28:10.420
SVD, whereas the matrix sigma would be much harder to represent. This is

00:28:10.420 --> 00:28:12.260
50,000 by 50,000.

00:28:12.260 --> 00:28:15.870
So this gives you an efficient way to implement

00:28:15.870 --> 00:28:18.070
PCA.

00:28:18.070 --> 00:28:20.560
The reason I want to talk about this is

00:28:20.560 --> 00:28:24.620
in previous years, I didn't talk [inaudible].

00:28:24.620 --> 00:28:28.760
The class projects, I found a number of students trying to implement SVD on huge

00:28:28.760 --> 00:28:29.580
problems and [inaudible],

00:28:29.580 --> 00:28:31.950
so this is

00:28:31.950 --> 00:28:35.010
a much better to implement PCA

00:28:35.010 --> 00:28:37.600
if you have extremely high dimensional data. If you have

00:28:37.600 --> 00:28:39.600
low dimensional data, if

00:28:39.600 --> 00:28:43.500
you have 50 or 100 dimensional data, then

00:28:43.500 --> 00:28:44.969
computing sigma's no problem. You can

00:28:44.969 --> 00:28:51.309
do it the old way, but otherwise, use the SVD to implement this.

00:28:52.780 --> 00:28:59.780
Questions about this?

00:29:26.180 --> 00:29:30.810
The last thing I want to say is that in practice, when you want to implement this, I want to say a note

00:29:30.810 --> 00:29:32.240
of caution.

00:29:32.240 --> 00:29:35.010
It turns out that

00:29:35.010 --> 00:29:38.909
for many applications of - let's see.

00:29:38.909 --> 00:29:42.420
When you apply SVD to these wide - yeah.

00:29:42.420 --> 00:29:43.450
Just a quick question. Are

00:29:43.450 --> 00:29:48.950
the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh,

00:29:48.950 --> 00:29:52.940
yes.

00:29:52.940 --> 00:29:54.570
I think you're

00:29:54.570 --> 00:29:55.509
right. I

00:29:55.509 --> 00:30:02.509
think you're right.

00:30:03.960 --> 00:30:05.550
Let's

00:30:05.550 --> 00:30:12.550
see. Is it top K columns of U or top K of V?

00:30:15.039 --> 00:30:22.039
Yeah, I think you're right. Is that right? Something

00:30:31.560 --> 00:30:38.560
bothers me about that, but I think you're right.

00:30:40.330 --> 00:30:46.350
[Inaudible], so then X transpose X should be VDD. X

00:30:46.350 --> 00:30:53.350
is UDV, so X transpose X would be - [Inaudible].

00:31:01.140 --> 00:31:03.389
If anyone thinks about this and

00:31:03.389 --> 00:31:06.200
has another opinion, let me know, but I think you're

00:31:06.200 --> 00:31:11.480
right. I'll make sure I get the details and let you

00:31:11.480 --> 00:31:16.680
know.

00:31:16.680 --> 00:31:22.240
Everyone's still looking at that.

00:31:22.240 --> 00:31:22.720
Tom, can you figure out

00:31:22.720 --> 00:31:25.110
the right answer and let me know? That sounds right. Okay. Cool. Okay.

00:31:25.110 --> 00:31:30.070
So just

00:31:30.070 --> 00:31:33.920
one last note, a note of caution. It turns out that

00:31:33.920 --> 00:31:36.520
in this example, I was implementing SVD

00:31:36.520 --> 00:31:43.520
with a wide matrix. So the matrix X was N by N.

00:31:44.130 --> 00:31:47.010
It turns out when you

00:31:47.010 --> 00:31:52.170
find the SVD decomposition of this,

00:31:52.170 --> 00:31:57.870
it turns out that -

00:31:57.870 --> 00:32:01.030
let's see. Yeah, I think you're definitely

00:32:01.030 --> 00:32:04.230
right. So it turns out that we find the SVD

00:32:04.230 --> 00:32:07.590
of this, the right-most portion of this block of this matrix would be all

00:32:07.590 --> 00:32:14.590
zeros.

00:32:15.130 --> 00:32:17.850
Also, when you compute the

00:32:17.850 --> 00:32:22.190
matrix, D, a large part of this matrix would be zeros.

00:32:22.190 --> 00:32:24.340
You have the matrix D

00:32:24.340 --> 00:32:29.520
transpose. So depending on what convention you use,

00:32:29.520 --> 00:32:36.150
for example, I think Matlab actually uses a convention of just

00:32:36.150 --> 00:32:43.150
cutting off the zero elements.

00:32:47.779 --> 00:32:50.750
So the Matlab uses the convention

00:32:50.750 --> 00:32:54.260
of chopping off the right-most half of the U matrix and chopping off the bottom

00:32:54.260 --> 00:32:56.520
portion of the D matrix. I'm not sure

00:32:56.520 --> 00:33:00.860
if this even depends on the version of Matlab,

00:33:00.860 --> 00:33:04.059
but when you call SVD on Matlab or some other numerical algebra packages,

00:33:04.059 --> 00:33:07.820
there's slightly different conventions of how to define your SVD when

00:33:07.820 --> 00:33:10.260
the matrix is wider than it is tall.

00:33:10.260 --> 00:33:13.390
So just watch out for this and make sure you map

00:33:13.390 --> 00:33:14.910
whatever convention

00:33:14.910 --> 00:33:17.630
your numerical algebra library uses

00:33:17.630 --> 00:33:22.620
to the original computations.

00:33:22.620 --> 00:33:23.570
It turns out if

00:33:23.570 --> 00:33:26.059
you turn Matlab [inaudible]

00:33:26.059 --> 00:33:30.200
or you're writing C code. There are many scientific libraries that can

00:33:30.200 --> 00:33:31.590


00:33:31.590 --> 00:33:36.110
compute SVDs for you, but they're just slightly different in

00:33:36.110 --> 00:33:39.330
conventions for the dimensions of these matrixes. So just make sure you figure this out for the

00:33:39.330 --> 00:33:44.230
package that you use.

00:33:44.230 --> 00:33:46.230
Finally, I just want to

00:33:46.230 --> 00:33:49.779
take the unsupervised learning algorithms we talked about and just put a little bit

00:33:49.779 --> 00:33:52.010
of broader context.

00:33:52.010 --> 00:33:56.450
This is partly in response to the questions I've gotten from students in

00:33:56.450 --> 00:33:57.950
office hours and elsewhere

00:33:57.950 --> 00:34:00.620
about when to use each of these

00:34:00.620 --> 00:34:01.460
algorithms. So

00:34:01.460 --> 00:34:05.700
I'm going to draw a two by two matrix. This is a little cartoon that

00:34:07.800 --> 00:34:10.690
I find useful.

00:34:15.409 --> 00:34:20.159
One of the algorithms we talked about earlier, right

00:34:20.159 --> 00:34:21.460
before this, was

00:34:21.460 --> 00:34:24.159
factor analysis, which was - it

00:34:24.159 --> 00:34:27.919
was - I hope you remember that picture I drew where I would have a bunch of point Z on the

00:34:27.919 --> 00:34:28.799
line.

00:34:28.799 --> 00:34:32.969
Then I had these ellipses that I drew. I hope you

00:34:32.969 --> 00:34:34.999
remember that

00:34:34.999 --> 00:34:37.919
picture. This was a factor analysis model

00:34:37.919 --> 00:34:38.639
which models

00:34:38.639 --> 00:34:42.389
the density effects [inaudible], right?

00:34:42.389 --> 00:34:44.639
It was also a PCA, just now.

00:34:44.639 --> 00:34:45.799
So the

00:34:45.799 --> 00:34:48.969
difference between factor analysis and PCA, the

00:34:48.969 --> 00:34:51.829
way I think about it, is that factor analysis

00:34:51.829 --> 00:34:53.960
is a density estimation algorithm.

00:34:53.960 --> 00:34:57.740
It tries to model the density of the training example's X.

00:34:57.740 --> 00:35:01.249
Whereas PCA

00:35:02.309 --> 00:35:05.700
is not a probabilistic

00:35:05.700 --> 00:35:06.750
algorithm. In particular,

00:35:06.750 --> 00:35:11.270
it does not endow your training examples of any probabilistic

00:35:11.270 --> 00:35:14.410
distributions and directly goes to find the subspace.

00:35:14.410 --> 00:35:18.650
So in terms of when to use factor analysis and when to use PCA,

00:35:18.650 --> 00:35:21.699
if your goal is to reduce the dimension of the data,

00:35:21.699 --> 00:35:26.139
if your goal is to find the subspace that the data lies on,

00:35:26.139 --> 00:35:27.309
then PCA

00:35:27.309 --> 00:35:28.320
directly

00:35:28.320 --> 00:35:32.180
tries to find the subspace. I think I would

00:35:32.180 --> 00:35:35.279
tend to use PCA.

00:35:35.279 --> 00:35:39.889
Factor analysis, it sort of assumes the data lies on a subspace.

00:35:39.889 --> 00:35:45.039
Let me write a subspace here.

00:35:45.039 --> 00:35:49.529
So both of these algorithms sort of assume the data maybe lies close

00:35:49.529 --> 00:35:52.009
or on some low dimensional subspace,

00:35:52.009 --> 00:35:55.999
but fundamentally, factor analysis, I think of it as a density estimation algorithm.

00:35:55.999 --> 00:35:58.739
So that has some very high dimensional distribution. I

00:35:58.739 --> 00:36:00.880
want to model P of X, then

00:36:00.880 --> 00:36:04.519
the factor analysis is the algorithm I'm more inclined

00:36:04.519 --> 00:36:05.400
to use. So

00:36:05.400 --> 00:36:07.449
even though you could in theory,

00:36:07.449 --> 00:36:12.199
I would tend to avoid trying to use factor analysis to

00:36:12.199 --> 00:36:14.349
identify a subspace the

00:36:14.349 --> 00:36:15.619
data

00:36:15.619 --> 00:36:18.450
set lies on. So [inaudible], if you want to do

00:36:18.450 --> 00:36:21.700
anomaly detection, if you want to model P of X

00:36:21.700 --> 00:36:26.119
so that if you have a very low probability of N, you can factor an anomaly,

00:36:26.119 --> 00:36:32.819
then I would tend to use factor analysis to do that density estimation. So factor

00:36:32.819 --> 00:36:38.189
analysis and PCA are both algorithms that assume that your data lies in the subspace.

00:36:38.189 --> 00:36:41.759
The other cause of algorithms we talked about was

00:36:41.759 --> 00:36:45.379
algorithms that assumes the data

00:36:45.379 --> 00:36:47.380
lies in

00:36:47.380 --> 00:36:51.410
clumps or that the

00:36:51.410 --> 00:36:52.359
data

00:36:52.359 --> 00:36:57.869
has a few coherence to groups. So let me just fill in the rest of this picture.

00:37:08.579 --> 00:37:09.500
So if you think your data lies

00:37:09.500 --> 00:37:14.270
in clumps or lies in groups, and if it goes [inaudible]

00:37:14.270 --> 00:37:16.130
density estimation, then I would

00:37:16.130 --> 00:37:19.579
tend to use a mixture of [inaudible]

00:37:19.579 --> 00:37:23.349
algorithm. But again, you don't necessarily want to endow your data of any probably

00:37:23.349 --> 00:37:26.919
semantics, so if you just want to find the clumps of the groups, then I'd be inclined

00:37:26.919 --> 00:37:28.999
to use a [inaudible] algorithm. So

00:37:29.769 --> 00:37:33.289
haven't seen anyone else draw this picture before, but I tend to organize these things

00:37:33.289 --> 00:37:34.989
this way in my brain.

00:37:34.989 --> 00:37:36.419
Hopefully this helps guide

00:37:36.419 --> 00:37:40.459
when you might use each of these algorithms as well, depending

00:37:40.459 --> 00:37:44.719
on whether you believe the data might lie in the subspace or whether it might bind in

00:37:44.719 --> 00:37:47.899
clumps or groups.

00:37:50.719 --> 00:37:53.789
All right.

00:37:53.789 --> 00:38:00.789
That wraps up the discussion on

00:38:02.629 --> 00:38:08.719
PCA. What I want to do next is talk about

00:38:08.719 --> 00:38:15.329
independent component analysis, or ICA. Yeah. Interviewee: I have

00:38:15.329 --> 00:38:17.579
a

00:38:17.579 --> 00:38:21.589
question about the upper right [inaudible]. So once you have all of the eigen vectors,

00:38:21.589 --> 00:38:26.179
[inaudible] how similar is feature I to

00:38:26.179 --> 00:38:29.960
feature J. You pick some eigen vector, and you take some dot products between the

00:38:29.960 --> 00:38:31.569
feature I and

00:38:31.569 --> 00:38:35.019
feature J and the eigen vector. But

00:38:35.019 --> 00:38:39.630
there's a lot of eigen vectors to choose from. Instructor

00:38:39.630 --> 00:38:42.049
(Andrew Ng):Right. So Justin's question was

00:38:42.049 --> 00:38:45.880
having found my eigen vectors, how do I choose what eigen vector to use to

00:38:45.880 --> 00:38:47.539
measure distance. I'm

00:38:47.539 --> 00:38:48.949
going to

00:38:48.949 --> 00:38:51.019
start

00:38:51.019 --> 00:38:53.289
this up.

00:38:53.289 --> 00:38:57.299
So the

00:38:57.299 --> 00:38:58.319
answer is really

00:38:58.319 --> 00:39:02.029
- in this cartoon, I would avoid thinking about

00:39:02.029 --> 00:39:03.920
eigen vectors one other time.

00:39:03.920 --> 00:39:08.049
A better way to view this cartoon is that this is actually -

00:39:08.049 --> 00:39:11.599
if I decide to choose 100 eigen vectors, this is really 100 D

00:39:11.599 --> 00:39:18.599
subspace.

00:39:19.259 --> 00:39:20.339
So

00:39:20.339 --> 00:39:24.879
I'm not actually projecting my data onto one eigen vector.

00:39:24.879 --> 00:39:29.769
This arrow, this cartoon, this denotes the 100-dimensional

00:39:29.769 --> 00:39:32.089
subspace [inaudible] by all my eigen vectors.

00:39:32.089 --> 00:39:36.119
So what I actually do is project my data onto

00:39:36.119 --> 00:39:40.149
the span, the linear span of eigen vectors. Then I

00:39:40.149 --> 00:39:41.440
measure distance or take

00:39:41.440 --> 00:39:43.489
inner products of the distance between

00:39:43.489 --> 00:39:49.829
the projections of the two points of the eigen vectors. Okay.

00:39:49.829 --> 00:39:54.139
So let's talk about ICA,

00:39:54.139 --> 00:39:58.749
independent component analysis.

00:39:58.749 --> 00:40:00.749
So whereas PCA

00:40:00.749 --> 00:40:02.600
was an algorithm for finding

00:40:02.600 --> 00:40:06.699
what I call the main axis of variations of data,

00:40:06.699 --> 00:40:11.199
in ICA, we're going to try find the independent of components of variations in the

00:40:11.199 --> 00:40:12.039
data.

00:40:12.039 --> 00:40:14.939
So switch it to the laptop there, please.

00:40:14.939 --> 00:40:16.119
We'll just

00:40:16.119 --> 00:40:21.900
take a second to motivate that. I'm

00:40:21.900 --> 00:40:26.769
going to do so by

00:40:26.769 --> 00:40:32.430
- although if you put on the - okay. This is

00:40:32.430 --> 00:40:36.619
actually a slide that I showed in

00:40:36.619 --> 00:40:39.779
lecture one of the cocktail party problem.

00:40:39.779 --> 00:40:42.619
Suppose you have two speakers at a cocktail party,

00:40:42.619 --> 00:40:45.019
and you have two microphones in the

00:40:45.019 --> 00:40:46.119
room, overlapping

00:40:46.119 --> 00:40:47.959
sets of two conversations.

00:40:47.959 --> 00:40:51.640
Then can you separate out the two original speaker sources?

00:40:51.640 --> 00:40:55.650
So I actually played this audio as well in the very first lecture, which is

00:40:55.650 --> 00:40:59.069
suppose microphone one records this.

00:40:59.069 --> 00:41:05.489
[Recording]

00:41:13.229 --> 00:41:16.650
So the question is, these are really two speakers,

00:41:16.650 --> 00:41:20.809
speaking independently of each other. So each speaker is outputting

00:41:20.809 --> 00:41:24.699
a series of sound signals as independent of the other conversation

00:41:24.699 --> 00:41:26.119
going on in the room.

00:41:26.119 --> 00:41:27.839
So

00:41:27.839 --> 00:41:31.880
this being an supervised learning problem, the question is, can we take these two microphone

00:41:31.880 --> 00:41:33.900
recordings and feed it to

00:41:33.900 --> 00:41:37.329
an algorithm to find the independent components in

00:41:37.329 --> 00:41:38.450
this

00:41:38.450 --> 00:41:40.589
data? This is the output

00:41:42.440 --> 00:41:48.619
when we do so.

00:41:48.619 --> 00:41:55.050
[Recording] This is the other one. [Recording]

00:41:55.940 --> 00:41:59.859
Just for fun. [Inaudible]. These are audio clips I got

00:42:01.409 --> 00:42:04.409
from [inaudible]. Just for fun, let me play the other ones as well. This

00:42:04.409 --> 00:42:11.409
is overlapping microphone one. [Recording]

00:42:13.440 --> 00:42:20.440
Here's microphone two. [Recording]

00:42:21.739 --> 00:42:24.250
So given this as input, here's output one.

00:42:24.250 --> 00:42:27.459


00:42:27.459 --> 00:42:30.910
[Recording]

00:42:30.910 --> 00:42:33.640
It's not perfect, but it's largely cleaned up the music.

00:42:33.640 --> 00:42:40.640
Here's number two. [Recording] Okay. Switch back to

00:42:42.979 --> 00:42:44.900
[inaudible], please.

00:42:44.900 --> 00:42:46.979
So

00:42:46.979 --> 00:42:53.979
what I want to do now is describe an algorithm that does that.

00:42:54.829 --> 00:42:58.299
Before

00:42:58.299 --> 00:43:03.239
I actually jump into the algorithm, I want to say two minutes

00:43:03.239 --> 00:43:03.799
of

00:43:03.799 --> 00:43:10.799
CDF, so cumulative distribution functions. I know most

00:43:18.669 --> 00:43:21.089
of you know what these are, but I'm

00:43:21.089 --> 00:43:23.599
just going to remind you of what they are.

00:43:24.579 --> 00:43:30.349
Let's say you have a one-D random variable S. So suppose you have

00:43:30.349 --> 00:43:35.900
a random variable, S,

00:43:35.900 --> 00:43:41.469
and suppose it has a property density function [inaudible].

00:43:41.469 --> 00:43:43.409
Then

00:43:43.409 --> 00:43:45.859
the CDF

00:43:45.859 --> 00:43:50.139
is defined as a function, or rather as F,

00:43:50.139 --> 00:43:53.729
which is the probability that the random variable,

00:43:53.729 --> 00:43:55.920
S, is less than the value

00:43:55.920 --> 00:43:58.539
given by that lower-case

00:43:58.539 --> 00:43:59.869
value,

00:43:59.869 --> 00:44:01.929
S.

00:44:01.929 --> 00:44:03.269
For example,

00:44:03.269 --> 00:44:06.099
if this is your [inaudible] density,

00:44:06.099 --> 00:44:10.229
than the density of the [inaudible] usually

00:44:10.229 --> 00:44:14.609
to note it lower-case phi. That's roughly a bell-shaped density. Then

00:44:14.609 --> 00:44:20.319
the CDF or the Gaussian

00:44:20.319 --> 00:44:22.269
will look something like this.

00:44:22.269 --> 00:44:24.959
There'll be a capital function

00:44:24.959 --> 00:44:27.339
pi. So if I pick a value

00:44:27.339 --> 00:44:29.079
S like that, then the

00:44:29.079 --> 00:44:30.449
height of this -

00:44:30.449 --> 00:44:32.569
this is [inaudible] probability that

00:44:32.569 --> 00:44:35.410
my Gaussian random variable is less than

00:44:35.410 --> 00:44:37.419
that value there. In other words,

00:44:37.419 --> 00:44:40.549
the height of the function at that point is

00:44:40.549 --> 00:44:44.229
less

00:44:44.229 --> 00:44:46.269
than the area of the Gaussian density,

00:44:46.269 --> 00:44:48.119
up to the point S.

00:44:48.119 --> 00:44:48.890
As you

00:44:48.890 --> 00:44:52.690
move further and further to the right, this function will approach one, as

00:44:52.690 --> 00:44:59.690
you integrate more and more of this area of the Gaussian. So another way to write

00:45:04.839 --> 00:45:11.839
F

00:45:21.109 --> 00:45:28.109
of

00:45:30.659 --> 00:45:34.619
S is the integral, the minus infinity

00:45:34.619 --> 00:45:35.729
to S of

00:45:35.729 --> 00:45:41.739
the density, DT.

00:45:41.739 --> 00:45:43.859
So something that'll come later is

00:45:43.859 --> 00:45:48.320
suppose I have a random variable, S, and I want to model the distribution of the random

00:45:48.320 --> 00:45:49.439
variable, S.

00:45:49.439 --> 00:45:53.449
So one thing I could do is I can specify

00:45:53.449 --> 00:45:56.549
what I think the density

00:45:56.549 --> 00:45:58.049
is.

00:45:58.049 --> 00:46:03.200
Or I can specify

00:46:03.200 --> 00:46:04.450
what the

00:46:04.450 --> 00:46:08.099
CDF

00:46:08.099 --> 00:46:11.359
is. These are related by this equation. F is the integral of P of S. You

00:46:11.359 --> 00:46:13.989
can also

00:46:13.989 --> 00:46:15.719
recover the density

00:46:15.719 --> 00:46:20.469
by taking the CDF and taking the derivative. So F prime, take the derivative

00:46:20.469 --> 00:46:21.729
of the CDF,

00:46:21.729 --> 00:46:23.440
you get back the

00:46:23.440 --> 00:46:24.709
density. So this has come up

00:46:24.709 --> 00:46:28.179
in the middle of when I derive ICA, which is that

00:46:28.179 --> 00:46:32.169
there'll be a step where they need to assume a distribution for random variable, S.

00:46:32.169 --> 00:46:36.359
I can either specify the density for S directly, or I can specify the CDF. I

00:46:36.359 --> 00:46:38.819
choose to specify the

00:46:39.920 --> 00:46:41.529
CDF.

00:46:41.529 --> 00:46:46.919
It has to be some function increasing from zero to one.

00:46:46.919 --> 00:46:48.029
So you can

00:46:48.029 --> 00:46:50.679
choose any function that looks like that, and in particular,

00:46:51.969 --> 00:46:55.469
pulling functions out of a hat that look like that. You can, for instance, choose a

00:46:55.469 --> 00:46:58.989
sigmoid function of

00:46:58.989 --> 00:47:04.219
CDF. That would be one way of specifying the distribution of the densities for the random variable S. So

00:47:04.219 --> 00:47:05.110
this

00:47:05.110 --> 00:47:12.110
will come up later.

00:47:30.299 --> 00:47:33.579
Just [inaudible], just raise your hand if that is familiar to you, if you've seen

00:47:33.579 --> 00:47:40.579
that before. Great. So

00:47:42.469 --> 00:47:43.239
let's

00:47:43.239 --> 00:47:48.630
start to derive our RCA, or our independent component analysis

00:47:48.630 --> 00:47:50.430
algorithm.

00:47:50.430 --> 00:47:53.989
Let's assume that the

00:47:55.859 --> 00:47:59.819
data comes from

00:47:59.819 --> 00:48:01.979
N original

00:48:01.979 --> 00:48:03.319
sources.

00:48:03.319 --> 00:48:07.009
So let's say there are N speakers in a cocktail party.

00:48:07.009 --> 00:48:09.819
So the original sources, I'm

00:48:09.819 --> 00:48:11.330
going to write as a vector, S

00:48:11.330 --> 00:48:13.619
as in RN.

00:48:13.619 --> 00:48:17.449
So just to be concrete about what I mean about that, I'm going to use

00:48:17.449 --> 00:48:22.499
SIJ to denote the signal

00:48:22.499 --> 00:48:25.849
from speaker

00:48:27.140 --> 00:48:30.219
J

00:48:30.219 --> 00:48:32.659
at time

00:48:32.659 --> 00:48:34.080
I. Here's what I mean.

00:48:34.080 --> 00:48:37.940
So what is sound? When you hear sound waves, sound is created

00:48:37.940 --> 00:48:39.279
by a pattern

00:48:39.279 --> 00:48:43.160
of expansions and compressions in air. So the way you're hearing my voice is

00:48:43.160 --> 00:48:44.620
my

00:48:44.620 --> 00:48:47.719
mouth is causing certain

00:48:47.719 --> 00:48:50.960
changes in the air pressure, and then your ear is hearing my voice as

00:48:50.960 --> 00:48:53.539
detecting those changes in air

00:48:53.539 --> 00:48:57.729
pressure. So what a microphone records, what my mouth is generating, is

00:48:57.729 --> 00:48:59.159
a pattern.

00:48:59.159 --> 00:49:01.459
I'm going to draw a cartoon,

00:49:01.459 --> 00:49:04.819
I guess.

00:49:04.819 --> 00:49:06.059
Changes in

00:49:06.059 --> 00:49:06.970
air pressure. So

00:49:06.970 --> 00:49:11.119
this is what sound is. You look at a microphone recording, you see these roughly periodic

00:49:11.119 --> 00:49:13.289
signals that comprise of

00:49:13.289 --> 00:49:16.230
changes in air pressure over time as the air pressure goes

00:49:16.230 --> 00:49:18.539
above and below some baseline air pressure.

00:49:18.539 --> 00:49:19.669
So this

00:49:19.669 --> 00:49:22.369
is what the speech signal looks like, say.

00:49:22.369 --> 00:49:26.399
So this is speaker one.

00:49:26.399 --> 00:49:29.039
Then what I'm saying is that

00:49:29.039 --> 00:49:31.189
- this is some time, T.

00:49:31.189 --> 00:49:34.479
What I'm saying is that the value of that point,

00:49:34.479 --> 00:49:36.989
I'm going to denote as S, super

00:49:36.989 --> 00:49:40.229
script T, sub script one.

00:49:40.229 --> 00:49:41.729
Similarly,

00:49:41.729 --> 00:49:44.889
speaker two, it's

00:49:44.889 --> 00:49:46.859
outputting some sound wave. Speaker voice

00:49:46.859 --> 00:49:49.749
will play that. It'll actually sound like

00:49:49.749 --> 00:49:52.920
a single tone, I guess.

00:49:52.920 --> 00:49:56.099
So in the same way, at the same time, T,

00:49:56.099 --> 00:49:59.049
the value of the air

00:49:59.049 --> 00:50:02.589
pressure generated by speaker two, I'll denote as

00:50:02.589 --> 00:50:09.589
ST

00:50:16.579 --> 00:50:23.579
2.

00:50:29.859 --> 00:50:36.859
So we observe

00:50:37.769 --> 00:50:40.449
XI equals A times SI, where

00:50:40.449 --> 00:50:43.409
these XIs

00:50:43.409 --> 00:50:45.989
are vectors in RN.

00:50:45.989 --> 00:50:50.269
So I'm going to assume

00:50:50.269 --> 00:50:53.259
that I have N microphones,

00:50:53.259 --> 00:50:53.580
and

00:50:53.580 --> 00:50:58.489
each of my microphones records some linear combination

00:50:58.489 --> 00:51:01.869
of what the speakers are saying. So each microphone records some overlapping

00:51:01.869 --> 00:51:04.499
combination of what the speakers are saying.

00:51:04.499 --> 00:51:07.619
For

00:51:10.349 --> 00:51:12.669
example, XIJ, which is - this

00:51:12.669 --> 00:51:16.249
is what microphone J records at time, I. So

00:51:16.249 --> 00:51:17.349
by definition of

00:51:17.349 --> 00:51:21.519
the matrix multiplication, this is sum

00:51:21.519 --> 00:51:23.979
of AIKSJ.

00:51:23.979 --> 00:51:29.369
Oh, excuse me.

00:51:29.369 --> 00:51:36.369
Okay? So what my J - sorry.

00:51:37.179 --> 00:51:41.049
So what my J microphone is recording is

00:51:42.190 --> 00:51:43.940
some linear combination of

00:51:43.940 --> 00:51:45.569
all of the speakers. So

00:51:45.569 --> 00:51:49.780
at time I, what microphone J is recording is some linear combination of

00:51:49.780 --> 00:51:52.749
what all the speakers are saying at time I.

00:51:52.749 --> 00:51:54.359
So K here

00:51:54.359 --> 00:51:57.819
indexes over the N speakers.

00:51:57.819 --> 00:52:01.239
So our goal

00:52:02.869 --> 00:52:06.420
is to find the matrix, W, equals A inverse, and

00:52:06.420 --> 00:52:10.130
just defining W that way.

00:52:10.130 --> 00:52:17.130
So

00:52:18.139 --> 00:52:21.349
we can recover the original sources

00:52:21.349 --> 00:52:23.310
as a linear combination of

00:52:23.310 --> 00:52:23.560
our

00:52:23.549 --> 00:52:30.549
microphone recordings, XI.

00:52:33.059 --> 00:52:35.329
Just as a point of notation,

00:52:35.329 --> 00:52:42.329
I'm going to write the matrix W this way. I'm going to use

00:52:50.890 --> 00:52:55.100
lower case W subscript one, subscript two and so on to denote the roles

00:52:55.100 --> 00:53:02.100
of this matrix, W.

00:53:13.909 --> 00:53:14.559
Let's

00:53:14.559 --> 00:53:18.719
see.

00:53:18.719 --> 00:53:23.539
So let's look at why IC is possible. Given these overlapping voices,

00:53:23.539 --> 00:53:28.249
let's think briefly why it might be possible

00:53:28.249 --> 00:53:30.760
to recover the original sources.

00:53:30.760 --> 00:53:33.059
So for the next example, I want

00:53:33.059 --> 00:53:36.509
to say

00:53:42.739 --> 00:53:46.530
- let's say that each of my speakers

00:53:46.530 --> 00:53:50.380
outputs - this will sound like white noise. Can I switch

00:53:50.380 --> 00:53:53.380
the laptop display,

00:53:53.380 --> 00:53:56.709
please? For this example, let's say that

00:53:57.219 --> 00:54:01.459
each of my speakers outputs uniform white noise. So

00:54:01.459 --> 00:54:05.459
if that's the case, these are my axis, S1 and S2.

00:54:05.459 --> 00:54:08.819
This is what my two speakers would be uttering.

00:54:08.819 --> 00:54:11.289
The parts of what they're

00:54:11.289 --> 00:54:14.979
uttering will look like a line in a square box if the two speakers are independently

00:54:14.979 --> 00:54:16.089
outputting

00:54:16.089 --> 00:54:18.389
uniform minus one random variables.

00:54:18.389 --> 00:54:20.289
So this is part of

00:54:20.289 --> 00:54:24.009
S1 and S2, my original sources.

00:54:24.009 --> 00:54:28.999
This would be a typical sample of what my microphones record. Here, at

00:54:28.999 --> 00:54:31.400
the axis, are X1 and X2.

00:54:31.400 --> 00:54:35.089
So these are images I got from [inaudible] on

00:54:35.089 --> 00:54:37.269
ICA.

00:54:38.709 --> 00:54:43.689
Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of

00:54:43.689 --> 00:54:44.939
this

00:54:44.939 --> 00:54:45.809
parallelogram

00:54:45.809 --> 00:54:48.149
are. You can figure out

00:54:48.149 --> 00:54:51.999
what linear transformation would transform the parallelogram back

00:54:51.999 --> 00:54:54.359
to a box.

00:54:54.359 --> 00:54:58.769
So it turns out there are some inherent ambiguities in ICA.

00:54:58.769 --> 00:55:00.509
I'll just say what they are.

00:55:00.509 --> 00:55:01.569
One is that

00:55:01.569 --> 00:55:05.709
you can't recover the original indexing of the sources. In particular,

00:55:05.709 --> 00:55:07.379
if

00:55:07.379 --> 00:55:10.809
I generated the data for speaker one and speaker two,

00:55:10.809 --> 00:55:14.469
you can run ICA, and then you may end up with the order of the speakers

00:55:14.469 --> 00:55:17.529
reversed. What that corresponds to is if you take this

00:55:17.529 --> 00:55:21.809
picture and you flip this picture along a 45-degree

00:55:21.809 --> 00:55:26.130
axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still

00:55:26.130 --> 00:55:28.279
get a box. So

00:55:28.279 --> 00:55:31.319
there's no way for the algorithms to tell which was speaker No. 1 and

00:55:31.319 --> 00:55:32.910
which

00:55:32.910 --> 00:55:37.699
was speaker No. 2. The numbering or the ordering of the speakers is

00:55:37.699 --> 00:55:40.839
ambiguous. The other source of ambiguity, and these are the only ambiguities

00:55:40.839 --> 00:55:42.089
in this example,

00:55:42.089 --> 00:55:44.469
is the sign of the sources. So

00:55:44.469 --> 00:55:49.119
given my speakers' recordings,

00:55:49.119 --> 00:55:53.190
you can't tell whether you got a positive SI or whether you got

00:55:53.190 --> 00:55:56.179
back a negative SI.

00:55:56.179 --> 00:55:58.210
In this picture, what that corresponds to

00:55:58.210 --> 00:56:02.100
is if you take this picture, and you reflect it along the vertical axis, if

00:56:02.100 --> 00:56:04.659
you reflect it along the horizontal axis,

00:56:04.659 --> 00:56:05.910
you still get a box.

00:56:05.910 --> 00:56:08.719
You still get back [inaudible] speakers.

00:56:08.719 --> 00:56:09.649
So

00:56:09.649 --> 00:56:11.719
it turns out that in this example,

00:56:11.719 --> 00:56:16.599
you can't guarantee that you've recovered positive SI rather

00:56:16.599 --> 00:56:19.689
than negative SI.

00:56:19.689 --> 00:56:21.930
So it turns out that these are the only

00:56:21.930 --> 00:56:25.740
two ambiguities in this example. What is the permutation of the speakers, and the

00:56:25.740 --> 00:56:28.139
other is the sign of the speakers.

00:56:28.139 --> 00:56:30.749
Permutation of the speakers, there's not much you can do about that.

00:56:30.749 --> 00:56:34.909
It turns out that if you take the audio

00:56:34.909 --> 00:56:35.609
source

00:56:35.609 --> 00:56:39.199
and if you flip the sign, and you take negative S, and if you play that through a

00:56:39.199 --> 00:56:43.819
microphone it'll sound indistinguishable.

00:56:43.819 --> 00:56:44.879
So

00:56:44.879 --> 00:56:47.829
for many of the applications we care about, the sign

00:56:47.829 --> 00:56:51.259
as well as the permutation

00:56:51.259 --> 00:56:55.079
is ambiguous, but you don't really care

00:56:55.079 --> 00:57:02.079
about it. Let's switch back

00:57:03.529 --> 00:57:08.989
to

00:57:08.989 --> 00:57:11.180
chalk board, please.

00:57:11.180 --> 00:57:15.639
It turns out, and I don't want to spend too much time on this, but I do want to say it briefly.

00:57:15.639 --> 00:57:17.289
It turns out the

00:57:17.289 --> 00:57:19.200
reason why those are the only

00:57:19.200 --> 00:57:25.809
sources of ambiguity - so the ambiguities were the

00:57:25.809 --> 00:57:29.869
permutation of the speakers

00:57:29.869 --> 00:57:31.959
and the signs.

00:57:31.959 --> 00:57:35.399
It turns out that

00:57:35.399 --> 00:57:39.919
the reason these were the only ambiguities was because

00:57:39.919 --> 00:57:44.999
the SIJs were

00:57:44.999 --> 00:57:46.689


00:57:46.689 --> 00:57:50.509
non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly.

00:57:50.509 --> 00:57:54.089
Suppose my original sources, S1 and S2, were Gaussian.

00:57:54.089 --> 00:57:55.909
So

00:57:58.329 --> 00:58:02.199
suppose SI is

00:58:02.199 --> 00:58:04.339
Gaussian, would mean zero

00:58:04.339 --> 00:58:07.019
and identity covariance.

00:58:07.019 --> 00:58:10.959
That just means that each of my speakers outputs a Gaussian random variable. Here's a typical

00:58:10.959 --> 00:58:12.619
example of Gaussian

00:58:12.619 --> 00:58:18.479
data.

00:58:18.479 --> 00:58:22.869
You will recall the contours of a Gaussian distribution with identity covariants

00:58:22.869 --> 00:58:25.089
looks like

00:58:25.089 --> 00:58:27.739
this, right? The Gaussian is a

00:58:27.739 --> 00:58:30.569
spherically symmetric distribution.

00:58:30.569 --> 00:58:35.219
So if my speakers were outputting Gaussian random variables, than if

00:58:35.219 --> 00:58:38.179
I observe a linear combination of this,

00:58:38.179 --> 00:58:40.479
there's actually no way to recover the

00:58:40.479 --> 00:58:43.419
original distribution because there's no way for me to tell

00:58:43.419 --> 00:58:46.120
if the axis are at this angle or if they're at

00:58:46.120 --> 00:58:48.349
that angle and so

00:58:48.349 --> 00:58:52.429
on. The Gaussian is a rotationally symmetric

00:58:52.429 --> 00:58:56.769
distribution, so I would no be able to recover the orientation in the

00:58:56.769 --> 00:58:58.839
rotation

00:58:58.839 --> 00:59:02.279
of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns

00:59:02.279 --> 00:59:02.900
out

00:59:02.900 --> 00:59:04.700
if your source is a Gaussian,

00:59:04.700 --> 00:59:07.929
then it's actually impossible to do

00:59:07.929 --> 00:59:12.049
ICA. ICA relies critically on your data being non-Gaussian because if the data

00:59:12.049 --> 00:59:16.939
were Gaussian, then the rotation of the data would be ambiguous. So

00:59:16.939 --> 00:59:19.079
regardless of how much data you have,

00:59:19.079 --> 00:59:23.549
even if you had infinitely large amounts of data, you would not be able to recover

00:59:23.549 --> 00:59:26.739
the matrix A or W.

00:59:32.779 --> 00:59:39.779
Let's go ahead and divide the algorithm.

00:59:56.779 --> 01:00:00.939
To do this, I need just one more result, and then the derivation will be

01:00:03.029 --> 01:00:07.729
three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my

01:00:07.729 --> 01:00:11.309
speakers that are emitting at any time.

01:00:11.309 --> 01:00:12.459
So

01:00:12.459 --> 01:00:15.619
let's say the density of S is

01:00:15.619 --> 01:00:17.339
P subscript S,

01:00:17.339 --> 01:00:19.569
capital S.

01:00:19.569 --> 01:00:23.399
So my microphone recording records S equals AS,

01:00:23.399 --> 01:00:25.319
equals W inverse

01:00:25.319 --> 01:00:31.019
S. Equivalently, S equals W sign of X.

01:00:31.019 --> 01:00:34.529
So let's think about what is the density of

01:00:34.529 --> 01:00:38.209
X. So I have P of S. I know the density of

01:00:38.209 --> 01:00:41.359
S, and X is a linear combination of the S's.

01:00:41.359 --> 01:00:45.170
So let's figure out what is the density of X.

01:00:45.170 --> 01:00:48.670
One thing we could do is

01:00:48.670 --> 01:00:51.339
figure out what S is. So this is just -

01:00:51.339 --> 01:00:55.759
apply the density of

01:00:55.759 --> 01:00:58.069
S to W of S. So let's

01:00:58.069 --> 01:01:01.999
see. This is the probability of S, so we just

01:01:02.909 --> 01:01:06.559
figure out what S is. S is W times X, so the probability of S is

01:01:06.559 --> 01:01:09.939
W times X, so the probability of X must be [inaudible].

01:01:09.939 --> 01:01:11.619
So this is wrong.

01:01:11.619 --> 01:01:14.749
It turns out you can do this for probably mass functions but not for

01:01:14.749 --> 01:01:16.919
continuous density. So in particular,

01:01:16.919 --> 01:01:20.969
it's not correct to say that the probability of X is - well, you just figure out what

01:01:20.969 --> 01:01:22.499
S is.

01:01:22.499 --> 01:01:26.189
Then you say the probability of S is applied to that. This is wrong. You

01:01:26.189 --> 01:01:27.819
can't do this with densities.

01:01:27.819 --> 01:01:30.969
You can't say the probability of S is that because it's a property density

01:01:30.969 --> 01:01:32.969
function.

01:01:32.969 --> 01:01:34.459
In particular,

01:01:34.459 --> 01:01:35.509
the

01:01:35.509 --> 01:01:37.849
right formula is the

01:01:37.849 --> 01:01:40.439
density of S applied to W times X,

01:01:40.439 --> 01:01:41.730
times the determinant

01:01:41.730 --> 01:01:44.209
of the matrix, W.

01:01:44.209 --> 01:01:47.189
Let me just illustrate that with an example.

01:01:47.189 --> 01:01:49.919
Let's say

01:01:49.919 --> 01:01:51.550
the

01:01:51.550 --> 01:01:58.199
density for S is that. In

01:01:58.199 --> 01:02:03.469
this example, S is uniform

01:02:03.469 --> 01:02:05.539
over the unit interval.

01:02:07.679 --> 01:02:14.679
So the density for S looks like that. It's

01:02:15.189 --> 01:02:18.140
just density for the uniform

01:02:18.140 --> 01:02:20.749
distribution of zero one.

01:02:20.749 --> 01:02:24.150
So let me let X be equal to two times

01:02:24.150 --> 01:02:30.009
S. So this means A equals two.

01:02:30.009 --> 01:02:33.709
W equals one half. So if

01:02:33.709 --> 01:02:36.719
S is a uniform distribution over zero, one,

01:02:36.719 --> 01:02:40.320
then X, which is two times that, will be the uniform distribution over the

01:02:40.320 --> 01:02:43.299
range from zero to two.

01:02:43.299 --> 01:02:50.299
So the density for X will be -

01:02:54.359 --> 01:02:57.289
that's one, that's two,

01:02:57.289 --> 01:03:01.409
that's one half,

01:03:02.529 --> 01:03:04.949
and

01:03:04.949 --> 01:03:07.939
that's one. Okay? Density for X will be indicator

01:03:07.939 --> 01:03:12.729
zero [inaudible] for X [inaudible] two

01:03:12.729 --> 01:03:15.739
times W, times one half.

01:03:15.739 --> 01:03:20.229
So

01:03:20.229 --> 01:03:21.729
does that make

01:03:21.729 --> 01:03:25.020
sense? [Inaudible] computer density for X because X is now spread out

01:03:25.020 --> 01:03:28.650
across a wider range. The density of X is now smaller,

01:03:28.650 --> 01:03:35.650
and therefore, the density of X has this one half

01:03:37.859 --> 01:03:38.919
term

01:03:38.919 --> 01:03:42.579
here. Okay? This is an illustration for the case of one-dimensional random variables,

01:03:42.579 --> 01:03:44.289


01:03:44.289 --> 01:03:45.160
or S

01:03:45.160 --> 01:03:49.490
and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the

01:03:49.490 --> 01:03:51.649
density of X is given by this

01:03:51.649 --> 01:03:53.950
times the determinant of the matrix, W. Over here,

01:03:53.950 --> 01:04:00.950
I showed the one dimensional [inaudible] generalization.

01:04:21.439 --> 01:04:28.439
So we're nearly there. Here's

01:04:28.749 --> 01:04:33.969
how I can implement ICA.

01:04:33.969 --> 01:04:37.039
So my distribution on

01:04:37.039 --> 01:04:44.039
S,

01:04:50.259 --> 01:04:52.959
so I'm going to assume that my density on S

01:04:52.959 --> 01:04:55.099
is given by this as a product over the

01:04:55.099 --> 01:04:59.950
N speakers of the density - the product of speaker

01:04:59.950 --> 01:05:00.889
I

01:05:00.889 --> 01:05:03.659
emitting a certain sound. This is a product of densities.

01:05:03.659 --> 01:05:07.659
This is a product of distributions because I'm going to assume that the

01:05:07.659 --> 01:05:11.469
speakers are having independent conversations. So the SI's independent

01:05:11.469 --> 01:05:15.869
for different values of I.

01:05:15.869 --> 01:05:18.060
So by the formula we just worked out,

01:05:18.060 --> 01:05:22.359
the density for X would be equal to that.

01:05:36.599 --> 01:05:39.309
I'll just remind you, W was A

01:05:39.309 --> 01:05:42.579
inverse. It was

01:05:42.579 --> 01:05:43.929
this matrix

01:05:43.929 --> 01:05:47.619
I defined previously

01:05:47.619 --> 01:05:50.430
so that SI

01:05:50.430 --> 01:05:52.519
equals WI [inaudible]

01:05:52.519 --> 01:05:59.209
X. So that's what's in

01:05:59.209 --> 01:06:02.299
there. To complete my formulation for this model,

01:06:02.299 --> 01:06:06.359
the final thing I need to do is

01:06:06.359 --> 01:06:10.179
choose

01:06:10.179 --> 01:06:11.549
a density

01:06:11.549 --> 01:06:14.259
for what I think each speaker is

01:06:14.259 --> 01:06:17.949
saying. I need to assume some density over

01:06:17.949 --> 01:06:21.660
the sounds emitted by an individual speaker.

01:06:21.660 --> 01:06:25.630
So following the discussion I had right when the [inaudible]

01:06:25.630 --> 01:06:27.650
ICA,

01:06:27.650 --> 01:06:30.559
one thing I could do is I could choose

01:06:30.559 --> 01:06:32.019
the density for S,

01:06:32.019 --> 01:06:35.509
or equivalently, I could choose the CDF, the cumulative distribution

01:06:35.509 --> 01:06:37.169
function for

01:06:37.169 --> 01:06:38.219
S.

01:06:38.219 --> 01:06:41.489
In this case, I'm going to choose

01:06:41.489 --> 01:06:44.819
a CDF, probably for historical reasons and probably for

01:06:44.819 --> 01:06:46.569
convenience.

01:06:46.569 --> 01:06:50.019
I need to choose the CDF for S, so

01:06:50.019 --> 01:06:54.779
what that means is I just need to choose some function that increases from zero to

01:06:54.779 --> 01:06:59.439
what. I know I can't choose a Gaussian because we know you can't

01:06:59.439 --> 01:07:02.199
do ICA on Gaussian data.

01:07:02.199 --> 01:07:04.649
So I need some function increasing from zero to one

01:07:04.649 --> 01:07:08.639
that is not the cumulative distribution function for a

01:07:08.639 --> 01:07:10.359
Gaussian distribution.

01:07:10.359 --> 01:07:14.009
So what other functions do I know that increase from zero to one? I

01:07:14.009 --> 01:07:16.139
just choose the

01:07:16.139 --> 01:07:18.329
CDF to be

01:07:18.329 --> 01:07:21.979
the

01:07:21.979 --> 01:07:23.039
sigmoid function.

01:07:23.039 --> 01:07:24.729
This is a

01:07:24.729 --> 01:07:27.229
commonly-made choice that

01:07:27.229 --> 01:07:31.049
is made for convenience. There is actually no great reason for why you

01:07:31.049 --> 01:07:34.079
choose a sigmoid function. It's just a convenient function that we all know

01:07:34.079 --> 01:07:35.289
and are familiar with

01:07:35.289 --> 01:07:37.849
that happens to increase from zero to one.

01:07:37.849 --> 01:07:44.849
When you take the derivative

01:07:45.789 --> 01:07:49.389
of the sigmoid, and that will give you back

01:07:49.389 --> 01:07:50.119
your

01:07:50.119 --> 01:07:55.459
density. This is just not Gaussian. This is the main virtue of choosing the sigmoid.

01:07:55.459 --> 01:08:02.459
So

01:08:19.020 --> 01:08:21.960
there's really no rational for the choice of sigma. Lots of other things will

01:08:21.960 --> 01:08:23.599
work fine, too.

01:08:23.599 --> 01:08:26.659
It's just a common, reasonable default.

01:08:38.039 --> 01:08:40.279
It turns out that

01:08:40.279 --> 01:08:44.629
one reason the sigma works well for a lot of data sources is that

01:08:44.629 --> 01:08:49.079
if this is the Gaussian.

01:08:49.079 --> 01:08:52.190
If you actually take the sigmoid and you take its derivative,

01:09:02.299 --> 01:09:06.639
you find that the sigmoid has [inaudible] than the Gaussian. By this I mean

01:09:06.639 --> 01:09:10.509
the density of the sigmoid dies down to zero much more slowly than

01:09:10.509 --> 01:09:12.299
the

01:09:12.299 --> 01:09:13.489
Gaussian.

01:09:13.489 --> 01:09:18.079
The magnitudes of the tails dies down as E to the minus S squared.

01:09:18.079 --> 01:09:21.960
For the sigmoid, the tails look like E to the minus

01:09:21.960 --> 01:09:26.949
S. So the tails die down as E to the minus S, around E

01:09:26.949 --> 01:09:29.529
to the minus S squared. It turns out that most distributions of this property

01:09:29.529 --> 01:09:34.359
with [inaudible] tails, where the distribution decays to zero relatively slowly

01:09:34.359 --> 01:09:38.439
compared to Gaussian will

01:09:38.439 --> 01:09:39.919
work fine for your data.

01:09:39.919 --> 01:09:43.939
Actually, one other choice you can sometimes us is what's called the Laplacian

01:09:43.939 --> 01:09:46.230
distribution, which is

01:09:46.230 --> 01:09:53.230
that. This will work fine, too, for many data sources.

01:10:06.539 --> 01:10:08.110
Sticking with the sigmoid for now, I'll just

01:10:08.110 --> 01:10:09.420
write

01:10:09.420 --> 01:10:14.479
down the algorithm in two steps. So given

01:10:14.479 --> 01:10:17.150
my training set, and

01:10:17.150 --> 01:10:21.179
as you show, this is an unlabeled training set, I can

01:10:21.179 --> 01:10:25.770
write down the log likelihood of my parameters. So that's - assembled my training

01:10:25.770 --> 01:10:27.209
examples, log of - times

01:10:27.209 --> 01:10:34.209
that.

01:10:42.869 --> 01:10:44.880
So that's my log

01:10:44.880 --> 01:10:51.880
likelihood.

01:10:53.339 --> 01:10:59.380
To learn the parameters, W, of this model, I can use the [inaudible] assent,

01:10:59.380 --> 01:11:06.380
which is

01:11:06.570 --> 01:11:08.579
just that.

01:11:08.579 --> 01:11:11.489
It turns out, if you work through the math,

01:11:11.489 --> 01:11:13.969
let's see. If P of S

01:11:13.969 --> 01:11:19.820
is equal to the derivative of the

01:11:19.820 --> 01:11:23.780
sigmoid, then if you just work through the math to compute the [inaudible] there. You've all

01:11:23.780 --> 01:11:27.409
done this a lot of times. I won't bother to show

01:11:27.409 --> 01:11:34.409
the details. You find that is equal to this.

01:11:46.629 --> 01:11:49.579
Okay? That's just - you can work those out yourself. It's just math to

01:11:49.579 --> 01:11:54.499
compute the derivative of this with respect to

01:11:54.499 --> 01:11:59.309
W. So to summarize, given the training set,

01:11:59.309 --> 01:12:02.099
here's my [inaudible] update rule. So you run the

01:12:02.099 --> 01:12:06.309
[inaudible] to learn the parameters W.

01:12:06.309 --> 01:12:08.380
After you're

01:12:08.380 --> 01:12:09.719
done, you then

01:12:12.369 --> 01:12:14.109
output SI equals

01:12:14.109 --> 01:12:16.989
WXI, and you've separated your sources

01:12:16.989 --> 01:12:18.170
of your

01:12:18.170 --> 01:12:21.780
data back out into the original independent sources.

01:12:21.780 --> 01:12:26.199
Hopefully up to only a permutation and a plus/minus

01:12:26.199 --> 01:12:30.650
sign ambiguity.

01:12:30.650 --> 01:12:34.559
Okay? So just switch back to the laptop, please?

01:12:34.559 --> 01:12:41.559
So we'll just wrap up with a couple of examples of applications of ICA.

01:12:42.210 --> 01:12:43.150
This is

01:12:43.150 --> 01:12:46.719
actually a picture of our TA, Katie.

01:12:46.719 --> 01:12:49.980
So one of the applications of ICA is

01:12:49.980 --> 01:12:52.010
to process

01:12:52.010 --> 01:12:56.530
various types of [inaudible] recording data, so [inaudible]. This

01:12:56.530 --> 01:12:58.780
is a picture of

01:12:58.780 --> 01:13:02.470
a EEG cap, in which there are a number of electrodes

01:13:02.470 --> 01:13:04.529
you place

01:13:04.529 --> 01:13:07.959
on the - in this case, on Katie's brain, on Katie's scalp.

01:13:07.959 --> 01:13:13.370
So where each electrode measures changes in voltage over time

01:13:13.370 --> 01:13:15.059
on the scalp.

01:13:15.059 --> 01:13:18.409
On the right, it's a typical example of [inaudible] data

01:13:18.409 --> 01:13:22.570
where each electrode measures - just changes in voltage over

01:13:22.570 --> 01:13:23.890
time. So

01:13:23.890 --> 01:13:27.949
the horizontal axis is time, and the vertical axis is voltage. So here's the same thing,

01:13:27.949 --> 01:13:29.560
blown up a little bit.

01:13:29.560 --> 01:13:32.680
You notice there are artifacts in this

01:13:32.680 --> 01:13:36.340
data. Where the circle is, where the data is circled, all

01:13:36.340 --> 01:13:37.669
the

01:13:37.669 --> 01:13:41.179
electrodes seem to measure in these very synchronized recordings.

01:13:41.179 --> 01:13:44.699
It turns out that we look at [inaudible] data as well as a number of other

01:13:44.699 --> 01:13:47.019
types of data, there are

01:13:47.019 --> 01:13:51.550
artifacts from heartbeats and from human eye blinks and so on. So the

01:13:51.550 --> 01:13:55.030
cartoonist, if you imagine, placing the

01:13:55.030 --> 01:13:56.730
electrodes, or

01:13:56.730 --> 01:13:58.319
microphones, on my scalp,

01:13:58.319 --> 01:14:01.839
then each microphone is recording some overlapping combination of all the

01:14:01.839 --> 01:14:04.920
things happening in my brain or in my body.

01:14:04.920 --> 01:14:08.380
My brain has a number of different processes going on. My body's [inaudible]

01:14:08.380 --> 01:14:10.519
going on, and

01:14:10.519 --> 01:14:13.429
each electrode measures a sum

01:14:13.429 --> 01:14:15.680
of the different voices in my brain.

01:14:15.680 --> 01:14:19.789
That didn't quite come out the way I wanted it to.

01:14:19.789 --> 01:14:21.530
So we can just take this data

01:14:21.530 --> 01:14:25.400
and run ICA on it and find out one of the independent components, what the

01:14:25.400 --> 01:14:26.130
independent

01:14:26.130 --> 01:14:30.329
process are going on in my brain. This is an example of running ICA.

01:14:30.329 --> 01:14:33.239
So you find that a small number of components, like those shown up there,

01:14:33.239 --> 01:14:37.739
they correspond to heartbeat, where the arrows - so those are very periodic

01:14:37.739 --> 01:14:42.329
signals. They come on occasionally and correspond to [inaudible] components of

01:14:42.329 --> 01:14:43.050
heartbeat.

01:14:43.050 --> 01:14:47.460
You also find things like an eye blink component, corresponding to a

01:14:47.460 --> 01:14:49.780
sigmoid generated when you blink your eyes.

01:14:49.780 --> 01:14:53.820
By doing this, you can then subtract out the heartbeat and the eye blink

01:14:53.820 --> 01:14:56.179
artifacts from the data, and now

01:14:56.179 --> 01:15:01.219
you get much cleaner ICA data - get much cleaner EEG readings. You can

01:15:01.219 --> 01:15:03.699
do further scientific studies. So this is a

01:15:03.699 --> 01:15:06.179
pretty commonly used preprocessing step

01:15:06.179 --> 01:15:09.699
that is a common application of ICA.

01:15:09.699 --> 01:15:13.030
[Inaudible] example is

01:15:13.030 --> 01:15:16.300
the application, again, from [inaudible]. As

01:15:16.300 --> 01:15:20.900
a result of running ICA on natural small image patches. Suppose I take

01:15:20.900 --> 01:15:22.050
natural images

01:15:22.050 --> 01:15:25.909
and run ICA on the data and ask what are the independent components of data.

01:15:25.909 --> 01:15:30.039
It turns out that these are the bases you get. So this is a plot of the

01:15:30.039 --> 01:15:32.530
sources you get.

01:15:32.530 --> 01:15:36.269
This algorithm is saying that a natural image patch

01:15:36.269 --> 01:15:37.749
shown

01:15:37.749 --> 01:15:39.790
on the left

01:15:39.790 --> 01:15:45.330
is often expressed as a sum, or a linear combination, of

01:15:45.330 --> 01:15:46.679
independent sources of

01:15:46.679 --> 01:15:48.159
things that make up images.

01:15:48.159 --> 01:15:52.780
So this model's natural images is generated by independent objects

01:15:52.780 --> 01:15:55.340
that generate different ages in the image.

01:15:55.340 --> 01:16:01.260
One of the fascinating things about this is that, similar to neuroscience, this has also been

01:16:01.260 --> 01:16:04.789
hypothesized as a method for how the human brain processes image

01:16:04.789 --> 01:16:05.999
data. It

01:16:05.999 --> 01:16:10.139
turns out, this is similar, in many ways, to computations

01:16:10.139 --> 01:16:15.079
happening in early visual processing in the human brain,

01:16:15.079 --> 01:16:17.659
in the mammalian

01:16:17.659 --> 01:16:19.799
brain. It's just

01:16:19.799 --> 01:16:25.260
interesting to see ages are the independent components of images.

01:16:25.260 --> 01:16:30.639
Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew

01:16:30.639 --> 01:16:31.929
Ng):Oh,

01:16:31.929 --> 01:16:35.410
yes. For the algorithms I describe, I assume A is a square matrix.

01:16:35.410 --> 01:16:38.590
It turns out if you have more microphones than speakers, you can also apply very

01:16:38.590 --> 01:16:39.609
similar algorithms. If

01:16:39.609 --> 01:16:43.919
you have fewer microphones than speakers, there's sort of an open research problem. The odds

01:16:43.919 --> 01:16:48.459
are that if you have one male and one female speaker, but one microphone, you can

01:16:48.459 --> 01:16:51.819
sometimes sort of separate them because one is high, one is low. If you have two

01:16:51.819 --> 01:16:55.459
male speakers or two female speakers, then it's beyond the state of the art now to separate them

01:16:55.459 --> 01:16:57.050
with one

01:16:57.050 --> 01:17:00.499
microphone. It's a great research program. Okay.

01:17:00.499 --> 01:17:04.869
Sorry about running late again. Let's close now, and we'll

01:17:04.869 --> 01:17:05.749
continue reinforcement learning.