< Return to Video

Lecture 15 | Machine Learning (Stanford)

  • 0:10 - 0:13
    This presentation is delivered by the Stanford Center for Professional
  • 0:13 - 0:20
    Development.
  • 0:29 - 0:30
    Welcome back.
  • 0:30 - 0:33
    What I want to do today is
  • 0:33 - 0:37
    continue a discussion of principal components analysis, or PCA.
  • 0:37 - 0:39
    In particular, there's
  • 0:39 - 0:42
    one more application that I didn't get to in the last lecture on
  • 0:42 - 0:44
    [inaudible] indexing,
  • 0:44 - 0:47
    LSI. Then I want to spend just a little time talking about
  • 0:47 - 0:50
    how to implement PCA,
  • 0:50 - 0:54
    especially for very large problems. In particular, I'll spend just a little bit of time talking
  • 0:54 - 0:56
    about singular value decomposition,
  • 0:56 - 0:58
    or the SVD implementation
  • 0:58 - 1:00
    of principal component
  • 1:00 - 1:02
    analysis. So the
  • 1:02 - 1:07
    second half of today's lecture, I want to talk about the different algorithm called
  • 1:07 - 1:09
    independent component analysis,
  • 1:09 - 1:13
    which is, in some ways, related to PCA, but in many other ways, it
  • 1:13 - 1:16
    also manages to accomplish
  • 1:16 - 1:18
    very different things than PCA.
  • 1:18 - 1:22
    So with this lecture, this will actually wrap up our discussion on
  • 1:22 - 1:26
    unsupervised learning. The next lecture, we'll start to talk about
  • 1:26 - 1:31
    reinforcement learning algorithms.
  • 1:31 - 1:32
    Just to
  • 1:32 - 1:37
    recap where we were with PCA, principal
  • 1:37 - 1:39
    component analysis,
  • 1:39 - 1:45
    I said that
  • 1:45 - 1:49
    in PCA, we imagine that we have some very high dimensional data
  • 1:49 - 1:51
    that perhaps
  • 1:51 - 1:55
    lies approximately on some low dimensional subspace. So if you had the data set like
  • 1:55 - 1:56
    this,
  • 1:56 - 1:57
    you might find that
  • 1:57 - 2:01
    that's the first principal component of the data,
  • 2:01 - 2:03
    and that's the second
  • 2:03 - 2:06
    component of this 2-D data.
  • 2:07 - 2:09
    To
  • 2:09 - 2:13
    summarize the algorithm, we have three steps. The first step of PCA
  • 2:15 - 2:17
    was to
  • 2:17 - 2:20
    normalize the data to zero mean and
  • 2:20 - 2:26
    [inaudible]. So
  • 2:26 - 2:28
    tracked out the means of
  • 2:28 - 2:33
    your training examples. So it now has zero means, and then normalize each of your features so
  • 2:33 - 2:36
    that the variance of each feature is now one.
  • 2:38 - 2:41
    The next step was [inaudible]
  • 2:41 - 2:44
    computical variance matrix of your zero mean data. So
  • 2:44 - 2:49
    you compute it as follows.
  • 2:49 - 2:51
    The sum of all the products,
  • 2:52 - 2:56
    and then you find the
  • 2:56 - 3:03
    top K eigen vectors of
  • 3:04 - 3:07
    sigma.
  • 3:07 - 3:10
    So
  • 3:10 - 3:13
    last time we saw the applications of this. For example,
  • 3:13 - 3:18
    one of the applications was to eigen faces where
  • 3:18 - 3:23
    each of your training examples, XI, is an image.
  • 3:23 - 3:24
    So
  • 3:24 - 3:27
    if you have
  • 3:27 - 3:31
    100 by 100 images, if your pictures of faces are
  • 3:31 - 3:33
    100 pixels by 100 pixels,
  • 3:33 - 3:38
    then each of your training examples, XI,
  • 3:38 - 3:40
    will be a 10,000 dimensional vector,
  • 3:40 - 3:43
    corresponding to the
  • 3:43 - 3:48
    10,000 grayscale intensity pixel values. There are 10,000 pixel values in
  • 3:48 - 3:50
    each of your 100 by 100 images.
  • 3:50 - 3:53
    So the eigen faces application was where
  • 3:53 - 3:56
    the training example comprised
  • 3:56 - 3:58
    pictures of faces of people.
  • 3:58 - 4:00
    Then we ran PCA,
  • 4:00 - 4:01
    and then
  • 4:01 - 4:03
    to measure the distance between say
  • 4:03 - 4:04
    a face here
  • 4:04 - 4:07
    and a face there, we would project both
  • 4:07 - 4:10
    of the face images onto the subspace and then
  • 4:10 - 4:11
    measure
  • 4:11 - 4:14
    the distance along the subspace. So in eigen faces, you use something
  • 4:14 - 4:17
    like 50 principle components.
  • 4:17 - 4:20
    So
  • 4:20 - 4:24
    the difficulty of working with problems like these is that
  • 4:24 - 4:27
    in step two of the algorithm,
  • 4:27 - 4:31
    we construct the covariance matrix sigma.
  • 4:31 - 4:38
    The covariance matrix now becomes
  • 4:38 - 4:43
    a 10,000 by 10,000 dimensional matrix, which is huge. That has
  • 4:43 - 4:46
    100 million
  • 4:46 - 4:48
    entries, which is huge.
  • 4:49 - 4:51
    So let's apply PCA to
  • 4:51 - 4:54
    very, very high dimensional data, used as a point of reducing the
  • 4:54 - 4:55
    dimension. But
  • 4:55 - 5:00
    step two of this algorithm had this step where you were constructing [inaudible]. So
  • 5:00 - 5:04
    this extremely large matrix, which you can't do.
  • 5:04 - 5:07
    Come back to this in a second. It turns out one of
  • 5:07 - 5:09
    the other
  • 5:09 - 5:11
    frequently-used applications of
  • 5:11 - 5:12
    PCA
  • 5:12 - 5:14
    is actually to text data.
  • 5:14 - 5:16
    So here's what I
  • 5:16 - 5:21
    mean. Remember our vectorial representation of emails?
  • 5:21 - 5:23
    So this is way back
  • 5:23 - 5:27
    when we were talking about supervised learning algorithms for a
  • 5:27 - 5:29
    stand classification. You remember I said that
  • 5:29 - 5:33
    given a piece of email or given a piece of text document, you
  • 5:33 - 5:35
    can represent it using a very high-dimensional vector
  • 5:35 - 5:37
    by taking
  • 5:39 - 5:44
    - writing down a list of all the words in your dictionary. Somewhere you had the word
  • 5:44 - 5:47
    learn, somewhere you have the word
  • 5:47 - 5:50
    study
  • 5:51 - 5:55
    and so on.
  • 5:55 - 5:58
    Depending on whether each word appears or does not appear in your text document, you put either
  • 5:58 - 5:59
    a one or a zero
  • 5:59 - 6:04
    there. This is a representation we use on an electrode five or electrode six
  • 6:04 - 6:07
    for representing text documents for
  • 6:07 - 6:09
    when we're building
  • 6:09 - 6:12
    [inaudible] based classifiers for
  • 6:12 - 6:14
    [inaudible]. So it turns
  • 6:14 - 6:18
    out one of the common applications of
  • 6:18 - 6:22
    PCA is actually this text data representations as well.
  • 6:22 - 6:24
    When you apply PCA
  • 6:24 - 6:26
    to this sort of data,
  • 6:26 - 6:28
    the resulting
  • 6:28 - 6:35
    algorithm, it often just goes by a different name, just latent semantic indexing.
  • 6:41 - 6:44
    For the sake of completeness, I should say that
  • 6:44 - 6:48
    in LSI, you usually skip the preprocessing step.
  • 7:06 - 7:10
    For various reasons, in LSI, you usually don't normalize the mean of the data to
  • 7:10 - 7:11
    one,
  • 7:11 - 7:14
    and you usually don't normalize the variance of the features to one.
  • 7:14 - 7:18
    These are relatively minor
  • 7:18 - 7:21
    differences, it turns out, so it does something very
  • 7:21 - 7:25
    similar to PCA.
  • 7:25 - 7:26
  • 7:26 - 7:27
    Normalizing the variance to one
  • 7:27 - 7:33
    for text data would actually be a bad idea because all the words are -
  • 7:33 - 7:34
    because that
  • 7:34 - 7:37
    would have the affect of
  • 7:37 - 7:39
    dramatically scaling up the
  • 7:39 - 7:44
    weight of rarely occurring words. So for example, the word aardvark hardly ever
  • 7:44 - 7:46
    appears in any document.
  • 7:46 - 7:49
    So to normalize the variance
  • 7:49 - 7:51
    of the second feature to one, you end up -
  • 7:51 - 7:55
    you're scaling up the weight of the word aardvark
  • 7:55 - 7:58
    dramatically. I don't understand why [inaudible].
  • 7:58 - 8:01
    So let's
  • 8:01 - 8:05
    see. [Inaudible] the language,
  • 8:05 - 8:11
    something that we want to do quite often is, give it two documents,
  • 8:13 - 8:20
    XI and XJ, to measure how similar they are.
  • 8:20 - 8:22
    So for example,
  • 8:22 - 8:25
    I may give you a document and ask
  • 8:25 - 8:29
    you to find me more documents like this one. We're reading some
  • 8:29 - 8:31
    article about some user event of today
  • 8:31 - 8:34
    and want to find out what other news articles there are. So I give you a document and
  • 8:35 - 8:37
    ask you to look at all the other documents you have
  • 8:37 - 8:41
    in this large set of documents and find the documents similar to
  • 8:41 - 8:42
    this.
  • 8:44 - 8:45
    So
  • 8:45 - 8:48
    this is typical text application, so
  • 8:48 - 8:51
    to measure the similarity
  • 8:52 - 8:57
    between two documents in XI and XJ, [inaudible]
  • 8:57 - 9:00
    each of these documents is represented
  • 9:00 - 9:03
    as one of these highdimensional vectors.
  • 9:03 - 9:09
    One common way to do this is to view each of your documents
  • 9:09 - 9:13
    as some sort of very high-dimensional vector.
  • 9:13 - 9:14
    So these
  • 9:14 - 9:18
    are vectors in the very highdimensional space where
  • 9:18 - 9:20
    the dimension of the vector is equal to
  • 9:20 - 9:27
    the number of words in your dictionary.
  • 9:27 - 9:31
    So maybe each of these documents lives in some
  • 9:31 - 9:34
    50,000-dimension space, if you have 50,000 words in your
  • 9:34 - 9:37
    dictionary. So one nature of the similarity between these two documents that's
  • 9:37 - 9:40
    often used is
  • 9:40 - 9:41
    what's the angle
  • 9:41 - 9:43
    between these two
  • 9:43 - 9:50
    documents.
  • 9:51 - 9:53
    In particular,
  • 9:53 - 9:56
    if the angle between these two vectors is small, then
  • 9:56 - 10:00
    the two documents, we'll consider them to be similar. If the angle between
  • 10:00 - 10:03
    these two vectors is large, then we consider the documents to be dissimilar.
  • 10:03 - 10:06
    So
  • 10:06 - 10:10
    more formally, one commonly used heuristic, the national language of processing,
  • 10:10 - 10:14
    is to say that the similarity between the two documents is a co-sine of the angle theta between them.
  • 10:14 - 10:17
    For similar
  • 10:17 - 10:19
    values, anyway, the co-sine
  • 10:19 - 10:23
    is a decreasing function of theta.
  • 10:23 - 10:24
    So the
  • 10:24 - 10:30
    smaller the angle between them, the larger the similarity.
  • 10:30 - 10:31
    The co-sine
  • 10:31 - 10:36
    between two vectors is, of course, just [inaudible]
  • 10:36 - 10:43
    divided
  • 10:44 - 10:47
    by - okay?
  • 10:47 - 10:51
    That's just the linear algebra or the standard
  • 10:51 - 10:56
    geometry definition of the co-sine between two vectors. Here's the
  • 11:04 - 11:11
    intuition behind what LSI is doing. The hope, as usual, is
  • 11:18 - 11:21
    that there
  • 11:21 - 11:25
    may be some interesting axis of variations in the data,
  • 11:25 - 11:28
    and there maybe some other axis that
  • 11:28 - 11:29
    are just
  • 11:29 - 11:34
    noise. So by projecting all of your data on lower-dimensional subspace, the hope is that by
  • 11:34 - 11:38
    running PCA on your text data this way, you can remove some of the noise in the data and
  • 11:38 - 11:42
    get better measures of the similarity between pairs of
  • 11:42 - 11:45
    documents. Let's just delve a little deeper into those examples to convey more intuition about what LSI
  • 11:45 - 11:47
    is doing.
  • 11:47 - 11:48
    So
  • 11:48 - 11:55
    look further in the definition of the co-sine similarity measure. So
  • 11:56 - 12:00
    the numerator
  • 12:00 - 12:07
    or
  • 12:10 - 12:17
    the similarity between the two documents was this inner product,
  • 12:18 - 12:24
    which is therefore sum over K,
  • 12:24 - 12:28
    XIK,
  • 12:28 - 12:29
    XJK. So
  • 12:30 - 12:34
    this inner product would be equal to zero if
  • 12:34 - 12:37
    the two documents have no words in common. So
  • 12:37 - 12:40
    this is really - sum over K -
  • 12:41 - 12:43
    indicator of
  • 12:43 - 12:45
    whether
  • 12:45 - 12:47
    documents, I and
  • 12:47 - 12:54
    J,
  • 12:55 - 12:58
    both contain the word, K, because
  • 12:58 - 13:03
    I guess XIK indicates whether document I contains the word
  • 13:03 - 13:04
    K, and XJK
  • 13:04 - 13:08
    indicates whether document J contains the word, K.
  • 13:08 - 13:10
    So the product would be one only
  • 13:10 - 13:12
    if the word K
  • 13:12 - 13:15
    appears in both documents.
  • 13:15 - 13:18
    Therefore, the similarity between these two documents would be
  • 13:18 - 13:23
    zero if the two documents have no words in common.
  • 13:23 - 13:30
    For example,
  • 13:31 - 13:35
    suppose your document,
  • 13:35 - 13:40
    XI, has the word study and the word
  • 13:40 - 13:42
    XJ,
  • 13:42 - 13:43
    has the word learn.
  • 13:43 - 13:48
    Then these two documents may be considered
  • 13:48 - 13:49
    entirely dissimilar.
  • 13:50 - 13:53
    [Inaudible] effective study strategies. Sometimes you read a
  • 13:53 - 13:57
    news article about that. So you ask, what other documents are similar to this? If
  • 13:57 - 14:01
    there are a bunch of other documents about good methods to
  • 14:01 - 14:04
    learn, than there are words in common. So similarity [inaudible] is zero.
  • 14:04 - 14:07
    So here's
  • 14:07 - 14:09
    a cartoon
  • 14:09 - 14:11
    of what we hope
  • 14:11 - 14:13
    [inaudible] PCA will do,
  • 14:13 - 14:14
    which is
  • 14:14 - 14:17
    suppose that on the horizontal axis, I plot
  • 14:17 - 14:21
    the word
  • 14:21 - 14:25
    learn, and on the vertical access, I plot the word study.
  • 14:25 - 14:30
    So the values take on either the value of zero or one. So if a document
  • 14:30 - 14:33
    contains the words learn but not study, then
  • 14:33 - 14:35
    it'll plot that document there. If
  • 14:35 - 14:38
    a document contains neither the word study nor learn, then it'll plot that
  • 14:38 - 14:41
    at zero, zero.
  • 14:41 - 14:44
    So here's a cartoon behind what PCA
  • 14:44 - 14:47
    is doing, which is
  • 14:47 - 14:51
    we identify lower dimensional subspace. That would be sum - eigen
  • 14:51 - 14:58
    vector, we get out of PCAs.
  • 14:58 - 15:03
    Now, supposed we have a document about learning. We have a document about studying.
  • 15:03 - 15:05
    The document about learning
  • 15:05 - 15:08
    points to the right. Document about studying points
  • 15:08 - 15:11
    up. So the inner product, or the co-sine angle between these two documents would be
  • 15:11 - 15:13
    - excuse
  • 15:13 - 15:15
    me. The inner product between
  • 15:15 - 15:19
    these two documents will be zero.
  • 15:19 - 15:20
    So these two
  • 15:20 - 15:22
    documents are entirely unrelated,
  • 15:22 - 15:25
    which is not what we want.
  • 15:25 - 15:28
    Documents about study, documents about learning, they are related. But
  • 15:28 - 15:33
    we take these two documents, and we project them
  • 15:33 - 15:36
    onto this subspace.
  • 15:38 - 15:41
    Then these two documents now become much
  • 15:41 - 15:43
    closer together,
  • 15:43 - 15:45
    and the algorithm will recognize that
  • 15:45 - 15:48
    when you say the inner product between these two documents,
  • 15:48 - 15:51
    you actually end up with a positive number. So
  • 15:51 - 15:52
    LSI enables
  • 15:52 - 15:56
    our algorithm to recognize that these two documents have some positive similarity
  • 15:56 - 15:59
    between them.
  • 15:59 - 16:01
    So that's just intuition
  • 16:01 - 16:02
    about what
  • 16:02 - 16:05
    PCA may be doing to text data.
  • 16:05 - 16:09
    The same thing goes to other examples and the words study and learn. So you have
  • 16:09 - 16:11
    - you find a document about
  • 16:11 - 16:15
    politicians and a document with the names of prominent
  • 16:15 - 16:16
    politicians.
  • 16:18 - 16:21
    That will also bring the documents closer together,
  • 16:21 - 16:25
    or just any related topics, they end up
  • 16:25 - 16:26
    [inaudible]
  • 16:26 - 16:32
    points closer together and just lower dimensional space.
  • 16:33 - 16:38
    Question about this? Interviewee: [Inaudible].
  • 16:38 - 16:44
    Which ones?
  • 16:44 - 16:51
    This one? No, the line. Oh, this one. Oh,
  • 16:54 - 16:54
    yes.
  • 16:54 - 17:01
    Thank you.
  • 17:02 - 17:06
    [Inaudible].
  • 17:06 - 17:07
    So
  • 17:07 - 17:14
    let's talk about how to actually implement this now.
  • 17:17 - 17:21
    Okay. How many of you know what
  • 17:21 - 17:25
    an SVD or single value decomposition is? Wow,
  • 17:25 - 17:26
    that's a lot of you. That's a
  • 17:26 - 17:28
    lot more than I thought.
  • 17:28 - 17:35
    Curious. Did you guys learn it as under grads or as graduate students?
  • 17:37 - 17:38
    All
  • 17:38 - 17:40
    right. Let
  • 17:40 - 17:44
    me talk about it anyway. I
  • 17:44 - 17:48
    wasn't expecting so many of you to know what SVD is, but I want to get this
  • 17:48 - 17:51
    on tape, just so everyone else can learn
  • 17:51 - 17:55
    about this, too.
  • 17:55 - 17:59
    So I'll say a little bit about how to implement
  • 17:59 - 18:02
    PCA. The problem I
  • 18:02 - 18:05
    was eluding to just now was that
  • 18:05 - 18:09
    when you have these very high-dimensional vectors, than sigma is a large matrix. In particular, for
  • 18:09 - 18:12
    our
  • 18:12 - 18:13
    text example,
  • 18:13 - 18:18
    if the vectors XI are 50,000 dimensional,
  • 18:18 - 18:24
    then
  • 18:24 - 18:27
    the covariance matrix will be 50,000 dimensional by 50,000
  • 18:27 - 18:33
    dimensional, which is much too big to represent explicitly.
  • 18:33 - 18:36
    I guess many
  • 18:36 - 18:41
    of you already know this, but I'll just say it anyway. It
  • 18:41 - 18:48
    turns out there's another way to implement PCA, which is
  • 18:49 - 18:49
    if
  • 18:49 - 18:53
    A is any N by N matrix,
  • 18:53 - 18:57
    than one of the most remarkable results of linear algebra
  • 18:57 - 19:01
    is that the matrix, A,
  • 19:01 - 19:04
    can be decomposed into
  • 19:04 - 19:07
    a singular value
  • 19:07 - 19:10
    decomposition. What that means is that the matrix, A, which
  • 19:10 - 19:12
    is
  • 19:12 - 19:13
    N by N,
  • 19:13 - 19:15
    can always be decomposed into a product of
  • 19:15 - 19:18
    three matrixes. U is N by N,
  • 19:18 - 19:21
    D is a square matrix, which is N by N, and V is
  • 19:24 - 19:27
    also N by N.
  • 19:27 - 19:31
    D
  • 19:31 - 19:35
    is
  • 19:35 - 19:37
    going to be diagonal.
  • 19:43 - 19:45
    Zeros are on the off-diagonals,
  • 19:45 - 19:52
    and the values sigma I are called the singular values of
  • 19:54 - 19:58
    the matrix A.
  • 20:02 - 20:05
    Almost all of you said you learned this as a graduate student, rather than as an under grad, so
  • 20:05 - 20:07
    it turns out that
  • 20:07 - 20:11
    when you take a class in undergraduate linear algebra, usually you learn a bunch of
  • 20:11 - 20:14
    decomposition. So you usually learn about the QLD composition, maybe
  • 20:14 - 20:17
    the LU factorization of the matrixes.
  • 20:17 - 20:20
    Most under grad courses don't get to talk about singular value
  • 20:20 - 20:21
    decompositions, but at
  • 20:21 - 20:23
    least in - almost
  • 20:23 - 20:25
    everything I
  • 20:25 - 20:27
    do in machine learning,
  • 20:27 - 20:31
    you actually find that you end up using SVDs much more than any of the
  • 20:31 - 20:33
    decompositions
  • 20:33 - 20:36
    you learned in typical under grad linear algebra class.
  • 20:36 - 20:40
    So personally, I [inaudible] an SVD dozens of times in the last
  • 20:40 - 20:43
    year, but LU and QRD compositions,
  • 20:43 - 20:45
    I think I used the QRD composition once and an
  • 20:45 - 20:50
    LU decomposition in the last year. So let's see. I'll
  • 20:50 - 20:54
    say
  • 20:54 - 20:55
    a
  • 20:55 - 21:02
    bit
  • 21:02 - 21:05
    more about this. So I'm going to draw the picture, I guess.
  • 21:05 - 21:08
    For example,
  • 21:08 - 21:11
    if A is an N by N matrix,
  • 21:11 - 21:18
    it can be decomposed into another matrix, U, which is also N by N. It's the same
  • 21:22 - 21:25
    size, D, which is
  • 21:25 - 21:30
    N by
  • 21:30 - 21:35
    N. Another square matrix, V transpose, which is also
  • 21:35 - 21:37
    N by
  • 21:37 - 21:43
    N. Furthermore, in a singular value decomposition, the
  • 21:43 - 21:46
    columns of the matrix, U, will be the eigen
  • 21:46 - 21:51
    vectors
  • 21:51 - 21:56
    of A transpose, and the
  • 21:56 - 22:02
    columns of V will be the eigen vectors
  • 22:02 - 22:06
    of A
  • 22:06 - 22:08
    transpose A.
  • 22:08 - 22:12
  • 22:12 - 22:17
    To compute it, you just use the SVD commands
  • 22:17 - 22:21
    in
  • 22:21 - 22:22
    Matlab
  • 22:22 - 22:23
    or
  • 22:23 - 22:27
    Octave. Today, say the art in numerical linear algebra is that
  • 22:27 - 22:31
    SVD, singular value decompositions, and matrixes can be computed
  • 22:31 - 22:34
    extremely [inaudible]. We've
  • 22:34 - 22:36
    used a
  • 22:36 - 22:38
    package like Matlab or Octave
  • 22:38 - 22:40
    to compute, say, the eigen vectors of a matrix.
  • 22:40 - 22:44
    So if SVD
  • 22:44 - 22:48
    routines are even more numerically stable than eigen vector routines for finding eigen vector in the
  • 22:48 - 22:49
    matrix. So you can
  • 22:49 - 22:50
    safely
  • 22:50 - 22:53
    use a routine like this, and similar to the way they use
  • 22:53 - 22:56
    a square root command without thinking about
  • 22:56 - 22:58
    how it's computed. You can compute the square
  • 22:58 - 23:04
    root of something and just not worry about it. You know the computer will give you the right
  • 23:04 - 23:08
    answer. For most reasonably-sized matrixes, even up to thousands by thousands
  • 23:08 - 23:11
    matrixes, the SVD routine, I think of it as a square root function. If
  • 23:11 - 23:15
    you call it, it'll give you back the right answer. You don't have to worry
  • 23:15 - 23:17
    too much about
  • 23:17 - 23:21
    it. If you have extremely large matrixes, like a million by a million matrixes, I
  • 23:21 - 23:23
    might start to worry a bit, but a few thousand by a few
  • 23:23 - 23:26
    thousand matrixes, this is
  • 23:26 - 23:29
    implemented very well today.
  • 23:29 - 23:31
    [Inaudible].
  • 23:31 - 23:35
    What's the complexity of SVD? That's a good question. I actually don't know. I want to guess it's roughly on the
  • 23:35 - 23:36
    order of N-cubed.
  • 23:36 - 23:42
    I'm not sure. [Inaudible]
  • 23:42 - 23:45
    algorithms, so I think - I don't know what's
  • 23:45 - 23:50
    known about the conversion
  • 23:50 - 23:54
    of
  • 23:54 - 23:58
    these algorithms. The example I drew out was for a facts matrix, and a matrix is
  • 23:58 - 24:00
    [inaudible].
  • 24:00 - 24:03
    In the same way, you can also
  • 24:03 - 24:08
    call SVD on the tall matrix, so it's taller than it's wide.
  • 24:08 - 24:15
    It would decompose it into - okay? A
  • 24:21 - 24:24
    product of three matrixes like
  • 24:24 - 24:28
    that.
  • 24:28 - 24:31
    The nice thing about this is that
  • 24:31 - 24:33
    we can use it to compute
  • 24:33 - 24:40
    eigen vectors and PCA very efficiently.
  • 24:42 - 24:47
    In particular, a
  • 24:47 - 24:53
    covariance matrix sigma was
  • 24:53 - 24:55
    this. It was the sum of all the products,
  • 24:55 - 25:01
    so if you go back and recall the definition of the
  • 25:01 - 25:02
    design matrix -
  • 25:02 - 25:05
    I think I described this in
  • 25:05 - 25:08
    lecture two when
  • 25:08 - 25:14
    we derived the close form solution to these squares [inaudible] these squares. The
  • 25:14 - 25:18
    design matrix was this matrix where I took my
  • 25:18 - 25:21
    examples
  • 25:21 - 25:26
    and stacked them in
  • 25:26 - 25:28
    rows.
  • 25:28 - 25:33
    They call this the design matrix [inaudible]. So if
  • 25:33 - 25:39
    you construct the design matrix, then
  • 25:39 - 25:43
    the covariance matrix sigma
  • 25:43 - 25:48
    can be written just X transposing.
  • 26:01 - 26:08
    That's X transposed, and [inaudible].
  • 26:16 - 26:17
    Okay?
  • 26:17 - 26:19
    I hope you see why the X transpose X gives you
  • 26:19 - 26:22
    the sum of products of
  • 26:22 - 26:26
    vectors. If you aren't seeing this right now, just go home and convince yourself
  • 26:26 - 26:33
    [inaudible] if it's
  • 26:38 - 26:40
    true.
  • 26:40 - 26:47
    To get the top K eigen vectors of
  • 26:51 - 26:51
    sigma,
  • 26:51 - 26:56
    you would take sigma
  • 26:56 - 26:57
    and decompose it
  • 26:57 - 27:00
    using
  • 27:00 - 27:02
    the -
  • 27:02 - 27:07
    excuse me.
  • 27:07 - 27:12
    You would take the matrix X, and you would compute as SVD. So you get USV transpose.
  • 27:12 - 27:15
    Then the top three
  • 27:15 - 27:19
    columns
  • 27:19 - 27:22
    of U
  • 27:22 - 27:24
    are the top K eigen
  • 27:24 - 27:27
    vectors
  • 27:27 - 27:29
    of
  • 27:29 - 27:32
    X transpose
  • 27:32 - 27:34
    X,
  • 27:34 - 27:38
    which is therefore, the top K eigen vectors of your
  • 27:38 - 27:40
    covariance matrix
  • 27:40 - 27:42
    sigma. So
  • 27:42 - 27:46
    in our examples, the
  • 27:46 - 27:49
    design matrix may be, say R. If you have
  • 27:49 - 27:52
    50,000 words in your dictionary, than the
  • 27:52 - 27:55
    design matrix would be
  • 27:55 - 27:58
    RM by 50,000.
  • 27:58 - 28:02
    [Inaudible] say 100 by 50,000, if you have 100 examples.
  • 28:02 - 28:03
    So X would be
  • 28:03 - 28:07
    quite tractable to represent and compute the
  • 28:07 - 28:10
    SVD, whereas the matrix sigma would be much harder to represent. This is
  • 28:10 - 28:12
    50,000 by 50,000.
  • 28:12 - 28:16
    So this gives you an efficient way to implement
  • 28:16 - 28:18
    PCA.
  • 28:18 - 28:21
    The reason I want to talk about this is
  • 28:21 - 28:25
    in previous years, I didn't talk [inaudible].
  • 28:25 - 28:29
    The class projects, I found a number of students trying to implement SVD on huge
  • 28:29 - 28:30
    problems and [inaudible],
  • 28:30 - 28:32
    so this is
  • 28:32 - 28:35
    a much better to implement PCA
  • 28:35 - 28:38
    if you have extremely high dimensional data. If you have
  • 28:38 - 28:40
    low dimensional data, if
  • 28:40 - 28:44
    you have 50 or 100 dimensional data, then
  • 28:44 - 28:45
    computing sigma's no problem. You can
  • 28:45 - 28:51
    do it the old way, but otherwise, use the SVD to implement this.
  • 28:53 - 29:00
    Questions about this?
  • 29:26 - 29:31
    The last thing I want to say is that in practice, when you want to implement this, I want to say a note
  • 29:31 - 29:32
    of caution.
  • 29:32 - 29:35
    It turns out that
  • 29:35 - 29:39
    for many applications of - let's see.
  • 29:39 - 29:42
    When you apply SVD to these wide - yeah.
  • 29:42 - 29:43
    Just a quick question. Are
  • 29:43 - 29:49
    the top K columns of U or V because X transposed X is V transpose, right? Let's see. Oh,
  • 29:49 - 29:53
    yes.
  • 29:53 - 29:55
    I think you're
  • 29:55 - 29:56
    right. I
  • 29:56 - 30:03
    think you're right.
  • 30:04 - 30:06
    Let's
  • 30:06 - 30:13
    see. Is it top K columns of U or top K of V?
  • 30:15 - 30:22
    Yeah, I think you're right. Is that right? Something
  • 30:32 - 30:39
    bothers me about that, but I think you're right.
  • 30:40 - 30:46
    [Inaudible], so then X transpose X should be VDD. X
  • 30:46 - 30:53
    is UDV, so X transpose X would be - [Inaudible].
  • 31:01 - 31:03
    If anyone thinks about this and
  • 31:03 - 31:06
    has another opinion, let me know, but I think you're
  • 31:06 - 31:11
    right. I'll make sure I get the details and let you
  • 31:11 - 31:17
    know.
  • 31:17 - 31:22
    Everyone's still looking at that.
  • 31:22 - 31:23
    Tom, can you figure out
  • 31:23 - 31:25
    the right answer and let me know? That sounds right. Okay. Cool. Okay.
  • 31:25 - 31:30
    So just
  • 31:30 - 31:34
    one last note, a note of caution. It turns out that
  • 31:34 - 31:37
    in this example, I was implementing SVD
  • 31:37 - 31:44
    with a wide matrix. So the matrix X was N by N.
  • 31:44 - 31:47
    It turns out when you
  • 31:47 - 31:52
    find the SVD decomposition of this,
  • 31:52 - 31:58
    it turns out that -
  • 31:58 - 32:01
    let's see. Yeah, I think you're definitely
  • 32:01 - 32:04
    right. So it turns out that we find the SVD
  • 32:04 - 32:08
    of this, the right-most portion of this block of this matrix would be all
  • 32:08 - 32:15
    zeros.
  • 32:15 - 32:18
    Also, when you compute the
  • 32:18 - 32:22
    matrix, D, a large part of this matrix would be zeros.
  • 32:22 - 32:24
    You have the matrix D
  • 32:24 - 32:30
    transpose. So depending on what convention you use,
  • 32:30 - 32:36
    for example, I think Matlab actually uses a convention of just
  • 32:36 - 32:43
    cutting off the zero elements.
  • 32:48 - 32:51
    So the Matlab uses the convention
  • 32:51 - 32:54
    of chopping off the right-most half of the U matrix and chopping off the bottom
  • 32:54 - 32:57
    portion of the D matrix. I'm not sure
  • 32:57 - 33:01
    if this even depends on the version of Matlab,
  • 33:01 - 33:04
    but when you call SVD on Matlab or some other numerical algebra packages,
  • 33:04 - 33:08
    there's slightly different conventions of how to define your SVD when
  • 33:08 - 33:10
    the matrix is wider than it is tall.
  • 33:10 - 33:13
    So just watch out for this and make sure you map
  • 33:13 - 33:15
    whatever convention
  • 33:15 - 33:18
    your numerical algebra library uses
  • 33:18 - 33:23
    to the original computations.
  • 33:23 - 33:24
    It turns out if
  • 33:24 - 33:26
    you turn Matlab [inaudible]
  • 33:26 - 33:30
    or you're writing C code. There are many scientific libraries that can
  • 33:30 - 33:32
  • 33:32 - 33:36
    compute SVDs for you, but they're just slightly different in
  • 33:36 - 33:39
    conventions for the dimensions of these matrixes. So just make sure you figure this out for the
  • 33:39 - 33:44
    package that you use.
  • 33:44 - 33:46
    Finally, I just want to
  • 33:46 - 33:50
    take the unsupervised learning algorithms we talked about and just put a little bit
  • 33:50 - 33:52
    of broader context.
  • 33:52 - 33:56
    This is partly in response to the questions I've gotten from students in
  • 33:56 - 33:58
    office hours and elsewhere
  • 33:58 - 34:01
    about when to use each of these
  • 34:01 - 34:01
    algorithms. So
  • 34:01 - 34:06
    I'm going to draw a two by two matrix. This is a little cartoon that
  • 34:08 - 34:11
    I find useful.
  • 34:15 - 34:20
    One of the algorithms we talked about earlier, right
  • 34:20 - 34:21
    before this, was
  • 34:21 - 34:24
    factor analysis, which was - it
  • 34:24 - 34:28
    was - I hope you remember that picture I drew where I would have a bunch of point Z on the
  • 34:28 - 34:29
    line.
  • 34:29 - 34:33
    Then I had these ellipses that I drew. I hope you
  • 34:33 - 34:35
    remember that
  • 34:35 - 34:38
    picture. This was a factor analysis model
  • 34:38 - 34:39
    which models
  • 34:39 - 34:42
    the density effects [inaudible], right?
  • 34:42 - 34:45
    It was also a PCA, just now.
  • 34:45 - 34:46
    So the
  • 34:46 - 34:49
    difference between factor analysis and PCA, the
  • 34:49 - 34:52
    way I think about it, is that factor analysis
  • 34:52 - 34:54
    is a density estimation algorithm.
  • 34:54 - 34:58
    It tries to model the density of the training example's X.
  • 34:58 - 35:01
    Whereas PCA
  • 35:02 - 35:06
    is not a probabilistic
  • 35:06 - 35:07
    algorithm. In particular,
  • 35:07 - 35:11
    it does not endow your training examples of any probabilistic
  • 35:11 - 35:14
    distributions and directly goes to find the subspace.
  • 35:14 - 35:19
    So in terms of when to use factor analysis and when to use PCA,
  • 35:19 - 35:22
    if your goal is to reduce the dimension of the data,
  • 35:22 - 35:26
    if your goal is to find the subspace that the data lies on,
  • 35:26 - 35:27
    then PCA
  • 35:27 - 35:28
    directly
  • 35:28 - 35:32
    tries to find the subspace. I think I would
  • 35:32 - 35:35
    tend to use PCA.
  • 35:35 - 35:40
    Factor analysis, it sort of assumes the data lies on a subspace.
  • 35:40 - 35:45
    Let me write a subspace here.
  • 35:45 - 35:50
    So both of these algorithms sort of assume the data maybe lies close
  • 35:50 - 35:52
    or on some low dimensional subspace,
  • 35:52 - 35:56
    but fundamentally, factor analysis, I think of it as a density estimation algorithm.
  • 35:56 - 35:59
    So that has some very high dimensional distribution. I
  • 35:59 - 36:01
    want to model P of X, then
  • 36:01 - 36:05
    the factor analysis is the algorithm I'm more inclined
  • 36:05 - 36:05
    to use. So
  • 36:05 - 36:07
    even though you could in theory,
  • 36:07 - 36:12
    I would tend to avoid trying to use factor analysis to
  • 36:12 - 36:14
    identify a subspace the
  • 36:14 - 36:16
    data
  • 36:16 - 36:18
    set lies on. So [inaudible], if you want to do
  • 36:18 - 36:22
    anomaly detection, if you want to model P of X
  • 36:22 - 36:26
    so that if you have a very low probability of N, you can factor an anomaly,
  • 36:26 - 36:33
    then I would tend to use factor analysis to do that density estimation. So factor
  • 36:33 - 36:38
    analysis and PCA are both algorithms that assume that your data lies in the subspace.
  • 36:38 - 36:42
    The other cause of algorithms we talked about was
  • 36:42 - 36:45
    algorithms that assumes the data
  • 36:45 - 36:47
    lies in
  • 36:47 - 36:51
    clumps or that the
  • 36:51 - 36:52
    data
  • 36:52 - 36:58
    has a few coherence to groups. So let me just fill in the rest of this picture.
  • 37:09 - 37:10
    So if you think your data lies
  • 37:10 - 37:14
    in clumps or lies in groups, and if it goes [inaudible]
  • 37:14 - 37:16
    density estimation, then I would
  • 37:16 - 37:20
    tend to use a mixture of [inaudible]
  • 37:20 - 37:23
    algorithm. But again, you don't necessarily want to endow your data of any probably
  • 37:23 - 37:27
    semantics, so if you just want to find the clumps of the groups, then I'd be inclined
  • 37:27 - 37:29
    to use a [inaudible] algorithm. So
  • 37:30 - 37:33
    haven't seen anyone else draw this picture before, but I tend to organize these things
  • 37:33 - 37:35
    this way in my brain.
  • 37:35 - 37:36
    Hopefully this helps guide
  • 37:36 - 37:40
    when you might use each of these algorithms as well, depending
  • 37:40 - 37:45
    on whether you believe the data might lie in the subspace or whether it might bind in
  • 37:45 - 37:48
    clumps or groups.
  • 37:51 - 37:54
    All right.
  • 37:54 - 38:01
    That wraps up the discussion on
  • 38:03 - 38:09
    PCA. What I want to do next is talk about
  • 38:09 - 38:15
    independent component analysis, or ICA. Yeah. Interviewee: I have
  • 38:15 - 38:18
    a
  • 38:18 - 38:22
    question about the upper right [inaudible]. So once you have all of the eigen vectors,
  • 38:22 - 38:26
    [inaudible] how similar is feature I to
  • 38:26 - 38:30
    feature J. You pick some eigen vector, and you take some dot products between the
  • 38:30 - 38:32
    feature I and
  • 38:32 - 38:35
    feature J and the eigen vector. But
  • 38:35 - 38:40
    there's a lot of eigen vectors to choose from. Instructor
  • 38:40 - 38:42
    (Andrew Ng):Right. So Justin's question was
  • 38:42 - 38:46
    having found my eigen vectors, how do I choose what eigen vector to use to
  • 38:46 - 38:48
    measure distance. I'm
  • 38:48 - 38:49
    going to
  • 38:49 - 38:51
    start
  • 38:51 - 38:53
    this up.
  • 38:53 - 38:57
    So the
  • 38:57 - 38:58
    answer is really
  • 38:58 - 39:02
    - in this cartoon, I would avoid thinking about
  • 39:02 - 39:04
    eigen vectors one other time.
  • 39:04 - 39:08
    A better way to view this cartoon is that this is actually -
  • 39:08 - 39:12
    if I decide to choose 100 eigen vectors, this is really 100 D
  • 39:12 - 39:19
    subspace.
  • 39:19 - 39:20
    So
  • 39:20 - 39:25
    I'm not actually projecting my data onto one eigen vector.
  • 39:25 - 39:30
    This arrow, this cartoon, this denotes the 100-dimensional
  • 39:30 - 39:32
    subspace [inaudible] by all my eigen vectors.
  • 39:32 - 39:36
    So what I actually do is project my data onto
  • 39:36 - 39:40
    the span, the linear span of eigen vectors. Then I
  • 39:40 - 39:41
    measure distance or take
  • 39:41 - 39:43
    inner products of the distance between
  • 39:43 - 39:50
    the projections of the two points of the eigen vectors. Okay.
  • 39:50 - 39:54
    So let's talk about ICA,
  • 39:54 - 39:59
    independent component analysis.
  • 39:59 - 40:01
    So whereas PCA
  • 40:01 - 40:03
    was an algorithm for finding
  • 40:03 - 40:07
    what I call the main axis of variations of data,
  • 40:07 - 40:11
    in ICA, we're going to try find the independent of components of variations in the
  • 40:11 - 40:12
    data.
  • 40:12 - 40:15
    So switch it to the laptop there, please.
  • 40:15 - 40:16
    We'll just
  • 40:16 - 40:22
    take a second to motivate that. I'm
  • 40:22 - 40:27
    going to do so by
  • 40:27 - 40:32
    - although if you put on the - okay. This is
  • 40:32 - 40:37
    actually a slide that I showed in
  • 40:37 - 40:40
    lecture one of the cocktail party problem.
  • 40:40 - 40:43
    Suppose you have two speakers at a cocktail party,
  • 40:43 - 40:45
    and you have two microphones in the
  • 40:45 - 40:46
    room, overlapping
  • 40:46 - 40:48
    sets of two conversations.
  • 40:48 - 40:52
    Then can you separate out the two original speaker sources?
  • 40:52 - 40:56
    So I actually played this audio as well in the very first lecture, which is
  • 40:56 - 40:59
    suppose microphone one records this.
  • 40:59 - 41:05
    [Recording]
  • 41:13 - 41:17
    So the question is, these are really two speakers,
  • 41:17 - 41:21
    speaking independently of each other. So each speaker is outputting
  • 41:21 - 41:25
    a series of sound signals as independent of the other conversation
  • 41:25 - 41:26
    going on in the room.
  • 41:26 - 41:28
    So
  • 41:28 - 41:32
    this being an supervised learning problem, the question is, can we take these two microphone
  • 41:32 - 41:34
    recordings and feed it to
  • 41:34 - 41:37
    an algorithm to find the independent components in
  • 41:37 - 41:38
    this
  • 41:38 - 41:41
    data? This is the output
  • 41:42 - 41:49
    when we do so.
  • 41:49 - 41:55
    [Recording] This is the other one. [Recording]
  • 41:56 - 42:00
    Just for fun. [Inaudible]. These are audio clips I got
  • 42:01 - 42:04
    from [inaudible]. Just for fun, let me play the other ones as well. This
  • 42:04 - 42:11
    is overlapping microphone one. [Recording]
  • 42:13 - 42:20
    Here's microphone two. [Recording]
  • 42:22 - 42:24
    So given this as input, here's output one.
  • 42:24 - 42:27
  • 42:27 - 42:31
    [Recording]
  • 42:31 - 42:34
    It's not perfect, but it's largely cleaned up the music.
  • 42:34 - 42:41
    Here's number two. [Recording] Okay. Switch back to
  • 42:43 - 42:45
    [inaudible], please.
  • 42:45 - 42:47
    So
  • 42:47 - 42:54
    what I want to do now is describe an algorithm that does that.
  • 42:55 - 42:58
    Before
  • 42:58 - 43:03
    I actually jump into the algorithm, I want to say two minutes
  • 43:03 - 43:04
    of
  • 43:04 - 43:11
    CDF, so cumulative distribution functions. I know most
  • 43:19 - 43:21
    of you know what these are, but I'm
  • 43:21 - 43:24
    just going to remind you of what they are.
  • 43:25 - 43:30
    Let's say you have a one-D random variable S. So suppose you have
  • 43:30 - 43:36
    a random variable, S,
  • 43:36 - 43:41
    and suppose it has a property density function [inaudible].
  • 43:41 - 43:43
    Then
  • 43:43 - 43:46
    the CDF
  • 43:46 - 43:50
    is defined as a function, or rather as F,
  • 43:50 - 43:54
    which is the probability that the random variable,
  • 43:54 - 43:56
    S, is less than the value
  • 43:56 - 43:59
    given by that lower-case
  • 43:59 - 44:00
    value,
  • 44:00 - 44:02
    S.
  • 44:02 - 44:03
    For example,
  • 44:03 - 44:06
    if this is your [inaudible] density,
  • 44:06 - 44:10
    than the density of the [inaudible] usually
  • 44:10 - 44:15
    to note it lower-case phi. That's roughly a bell-shaped density. Then
  • 44:15 - 44:20
    the CDF or the Gaussian
  • 44:20 - 44:22
    will look something like this.
  • 44:22 - 44:25
    There'll be a capital function
  • 44:25 - 44:27
    pi. So if I pick a value
  • 44:27 - 44:29
    S like that, then the
  • 44:29 - 44:30
    height of this -
  • 44:30 - 44:33
    this is [inaudible] probability that
  • 44:33 - 44:35
    my Gaussian random variable is less than
  • 44:35 - 44:37
    that value there. In other words,
  • 44:37 - 44:41
    the height of the function at that point is
  • 44:41 - 44:44
    less
  • 44:44 - 44:46
    than the area of the Gaussian density,
  • 44:46 - 44:48
    up to the point S.
  • 44:48 - 44:49
    As you
  • 44:49 - 44:53
    move further and further to the right, this function will approach one, as
  • 44:53 - 45:00
    you integrate more and more of this area of the Gaussian. So another way to write
  • 45:05 - 45:12
    F
  • 45:21 - 45:28
    of
  • 45:31 - 45:35
    S is the integral, the minus infinity
  • 45:35 - 45:36
    to S of
  • 45:36 - 45:42
    the density, DT.
  • 45:42 - 45:44
    So something that'll come later is
  • 45:44 - 45:48
    suppose I have a random variable, S, and I want to model the distribution of the random
  • 45:48 - 45:49
    variable, S.
  • 45:49 - 45:53
    So one thing I could do is I can specify
  • 45:53 - 45:57
    what I think the density
  • 45:57 - 45:58
    is.
  • 45:58 - 46:03
    Or I can specify
  • 46:03 - 46:04
    what the
  • 46:04 - 46:08
    CDF
  • 46:08 - 46:11
    is. These are related by this equation. F is the integral of P of S. You
  • 46:11 - 46:14
    can also
  • 46:14 - 46:16
    recover the density
  • 46:16 - 46:20
    by taking the CDF and taking the derivative. So F prime, take the derivative
  • 46:20 - 46:22
    of the CDF,
  • 46:22 - 46:23
    you get back the
  • 46:23 - 46:25
    density. So this has come up
  • 46:25 - 46:28
    in the middle of when I derive ICA, which is that
  • 46:28 - 46:32
    there'll be a step where they need to assume a distribution for random variable, S.
  • 46:32 - 46:36
    I can either specify the density for S directly, or I can specify the CDF. I
  • 46:36 - 46:39
    choose to specify the
  • 46:40 - 46:42
    CDF.
  • 46:42 - 46:47
    It has to be some function increasing from zero to one.
  • 46:47 - 46:48
    So you can
  • 46:48 - 46:51
    choose any function that looks like that, and in particular,
  • 46:52 - 46:55
    pulling functions out of a hat that look like that. You can, for instance, choose a
  • 46:55 - 46:59
    sigmoid function of
  • 46:59 - 47:04
    CDF. That would be one way of specifying the distribution of the densities for the random variable S. So
  • 47:04 - 47:05
    this
  • 47:05 - 47:12
    will come up later.
  • 47:30 - 47:34
    Just [inaudible], just raise your hand if that is familiar to you, if you've seen
  • 47:34 - 47:41
    that before. Great. So
  • 47:42 - 47:43
    let's
  • 47:43 - 47:49
    start to derive our RCA, or our independent component analysis
  • 47:49 - 47:50
    algorithm.
  • 47:50 - 47:54
    Let's assume that the
  • 47:56 - 48:00
    data comes from
  • 48:00 - 48:02
    N original
  • 48:02 - 48:03
    sources.
  • 48:03 - 48:07
    So let's say there are N speakers in a cocktail party.
  • 48:07 - 48:10
    So the original sources, I'm
  • 48:10 - 48:11
    going to write as a vector, S
  • 48:11 - 48:14
    as in RN.
  • 48:14 - 48:17
    So just to be concrete about what I mean about that, I'm going to use
  • 48:17 - 48:22
    SIJ to denote the signal
  • 48:22 - 48:26
    from speaker
  • 48:27 - 48:30
    J
  • 48:30 - 48:33
    at time
  • 48:33 - 48:34
    I. Here's what I mean.
  • 48:34 - 48:38
    So what is sound? When you hear sound waves, sound is created
  • 48:38 - 48:39
    by a pattern
  • 48:39 - 48:43
    of expansions and compressions in air. So the way you're hearing my voice is
  • 48:43 - 48:45
    my
  • 48:45 - 48:48
    mouth is causing certain
  • 48:48 - 48:51
    changes in the air pressure, and then your ear is hearing my voice as
  • 48:51 - 48:54
    detecting those changes in air
  • 48:54 - 48:58
    pressure. So what a microphone records, what my mouth is generating, is
  • 48:58 - 48:59
    a pattern.
  • 48:59 - 49:01
    I'm going to draw a cartoon,
  • 49:01 - 49:05
    I guess.
  • 49:05 - 49:06
    Changes in
  • 49:06 - 49:07
    air pressure. So
  • 49:07 - 49:11
    this is what sound is. You look at a microphone recording, you see these roughly periodic
  • 49:11 - 49:13
    signals that comprise of
  • 49:13 - 49:16
    changes in air pressure over time as the air pressure goes
  • 49:16 - 49:19
    above and below some baseline air pressure.
  • 49:19 - 49:20
    So this
  • 49:20 - 49:22
    is what the speech signal looks like, say.
  • 49:22 - 49:26
    So this is speaker one.
  • 49:26 - 49:29
    Then what I'm saying is that
  • 49:29 - 49:31
    - this is some time, T.
  • 49:31 - 49:34
    What I'm saying is that the value of that point,
  • 49:34 - 49:37
    I'm going to denote as S, super
  • 49:37 - 49:40
    script T, sub script one.
  • 49:40 - 49:42
    Similarly,
  • 49:42 - 49:45
    speaker two, it's
  • 49:45 - 49:47
    outputting some sound wave. Speaker voice
  • 49:47 - 49:50
    will play that. It'll actually sound like
  • 49:50 - 49:53
    a single tone, I guess.
  • 49:53 - 49:56
    So in the same way, at the same time, T,
  • 49:56 - 49:59
    the value of the air
  • 49:59 - 50:03
    pressure generated by speaker two, I'll denote as
  • 50:03 - 50:10
    ST
  • 50:17 - 50:24
    2.
  • 50:30 - 50:37
    So we observe
  • 50:38 - 50:40
    XI equals A times SI, where
  • 50:40 - 50:43
    these XIs
  • 50:43 - 50:46
    are vectors in RN.
  • 50:46 - 50:50
    So I'm going to assume
  • 50:50 - 50:53
    that I have N microphones,
  • 50:53 - 50:54
    and
  • 50:54 - 50:58
    each of my microphones records some linear combination
  • 50:58 - 51:02
    of what the speakers are saying. So each microphone records some overlapping
  • 51:02 - 51:04
    combination of what the speakers are saying.
  • 51:04 - 51:08
    For
  • 51:10 - 51:13
    example, XIJ, which is - this
  • 51:13 - 51:16
    is what microphone J records at time, I. So
  • 51:16 - 51:17
    by definition of
  • 51:17 - 51:22
    the matrix multiplication, this is sum
  • 51:22 - 51:24
    of AIKSJ.
  • 51:24 - 51:29
    Oh, excuse me.
  • 51:29 - 51:36
    Okay? So what my J - sorry.
  • 51:37 - 51:41
    So what my J microphone is recording is
  • 51:42 - 51:44
    some linear combination of
  • 51:44 - 51:46
    all of the speakers. So
  • 51:46 - 51:50
    at time I, what microphone J is recording is some linear combination of
  • 51:50 - 51:53
    what all the speakers are saying at time I.
  • 51:53 - 51:54
    So K here
  • 51:54 - 51:58
    indexes over the N speakers.
  • 51:58 - 52:01
    So our goal
  • 52:03 - 52:06
    is to find the matrix, W, equals A inverse, and
  • 52:06 - 52:10
    just defining W that way.
  • 52:10 - 52:17
    So
  • 52:18 - 52:21
    we can recover the original sources
  • 52:21 - 52:23
    as a linear combination of
  • 52:23 - 52:24
    our
  • 52:24 - 52:31
    microphone recordings, XI.
  • 52:33 - 52:35
    Just as a point of notation,
  • 52:35 - 52:42
    I'm going to write the matrix W this way. I'm going to use
  • 52:51 - 52:55
    lower case W subscript one, subscript two and so on to denote the roles
  • 52:55 - 53:02
    of this matrix, W.
  • 53:14 - 53:15
    Let's
  • 53:15 - 53:19
    see.
  • 53:19 - 53:24
    So let's look at why IC is possible. Given these overlapping voices,
  • 53:24 - 53:28
    let's think briefly why it might be possible
  • 53:28 - 53:31
    to recover the original sources.
  • 53:31 - 53:33
    So for the next example, I want
  • 53:33 - 53:37
    to say
  • 53:43 - 53:47
    - let's say that each of my speakers
  • 53:47 - 53:50
    outputs - this will sound like white noise. Can I switch
  • 53:50 - 53:53
    the laptop display,
  • 53:53 - 53:57
    please? For this example, let's say that
  • 53:57 - 54:01
    each of my speakers outputs uniform white noise. So
  • 54:01 - 54:05
    if that's the case, these are my axis, S1 and S2.
  • 54:05 - 54:09
    This is what my two speakers would be uttering.
  • 54:09 - 54:11
    The parts of what they're
  • 54:11 - 54:15
    uttering will look like a line in a square box if the two speakers are independently
  • 54:15 - 54:16
    outputting
  • 54:16 - 54:18
    uniform minus one random variables.
  • 54:18 - 54:20
    So this is part of
  • 54:20 - 54:24
    S1 and S2, my original sources.
  • 54:24 - 54:29
    This would be a typical sample of what my microphones record. Here, at
  • 54:29 - 54:31
    the axis, are X1 and X2.
  • 54:31 - 54:35
    So these are images I got from [inaudible] on
  • 54:35 - 54:37
    ICA.
  • 54:39 - 54:44
    Given a picture like this, you can sort of look at this box, and you can sort of tell what the axis of
  • 54:44 - 54:45
    this
  • 54:45 - 54:46
    parallelogram
  • 54:46 - 54:48
    are. You can figure out
  • 54:48 - 54:52
    what linear transformation would transform the parallelogram back
  • 54:52 - 54:54
    to a box.
  • 54:54 - 54:59
    So it turns out there are some inherent ambiguities in ICA.
  • 54:59 - 55:01
    I'll just say what they are.
  • 55:01 - 55:02
    One is that
  • 55:02 - 55:06
    you can't recover the original indexing of the sources. In particular,
  • 55:06 - 55:07
    if
  • 55:07 - 55:11
    I generated the data for speaker one and speaker two,
  • 55:11 - 55:14
    you can run ICA, and then you may end up with the order of the speakers
  • 55:14 - 55:18
    reversed. What that corresponds to is if you take this
  • 55:18 - 55:22
    picture and you flip this picture along a 45-degree
  • 55:22 - 55:26
    axis. You take a 45-degree axis and reflect this picture across the 45-degree axis, you'll still
  • 55:26 - 55:28
    get a box. So
  • 55:28 - 55:31
    there's no way for the algorithms to tell which was speaker No. 1 and
  • 55:31 - 55:33
    which
  • 55:33 - 55:38
    was speaker No. 2. The numbering or the ordering of the speakers is
  • 55:38 - 55:41
    ambiguous. The other source of ambiguity, and these are the only ambiguities
  • 55:41 - 55:42
    in this example,
  • 55:42 - 55:44
    is the sign of the sources. So
  • 55:44 - 55:49
    given my speakers' recordings,
  • 55:49 - 55:53
    you can't tell whether you got a positive SI or whether you got
  • 55:53 - 55:56
    back a negative SI.
  • 55:56 - 55:58
    In this picture, what that corresponds to
  • 55:58 - 56:02
    is if you take this picture, and you reflect it along the vertical axis, if
  • 56:02 - 56:05
    you reflect it along the horizontal axis,
  • 56:05 - 56:06
    you still get a box.
  • 56:06 - 56:09
    You still get back [inaudible] speakers.
  • 56:09 - 56:10
    So
  • 56:10 - 56:12
    it turns out that in this example,
  • 56:12 - 56:17
    you can't guarantee that you've recovered positive SI rather
  • 56:17 - 56:20
    than negative SI.
  • 56:20 - 56:22
    So it turns out that these are the only
  • 56:22 - 56:26
    two ambiguities in this example. What is the permutation of the speakers, and the
  • 56:26 - 56:28
    other is the sign of the speakers.
  • 56:28 - 56:31
    Permutation of the speakers, there's not much you can do about that.
  • 56:31 - 56:35
    It turns out that if you take the audio
  • 56:35 - 56:36
    source
  • 56:36 - 56:39
    and if you flip the sign, and you take negative S, and if you play that through a
  • 56:39 - 56:44
    microphone it'll sound indistinguishable.
  • 56:44 - 56:45
    So
  • 56:45 - 56:48
    for many of the applications we care about, the sign
  • 56:48 - 56:51
    as well as the permutation
  • 56:51 - 56:55
    is ambiguous, but you don't really care
  • 56:55 - 57:02
    about it. Let's switch back
  • 57:04 - 57:09
    to
  • 57:09 - 57:11
    chalk board, please.
  • 57:11 - 57:16
    It turns out, and I don't want to spend too much time on this, but I do want to say it briefly.
  • 57:16 - 57:17
    It turns out the
  • 57:17 - 57:19
    reason why those are the only
  • 57:19 - 57:26
    sources of ambiguity - so the ambiguities were the
  • 57:26 - 57:30
    permutation of the speakers
  • 57:30 - 57:32
    and the signs.
  • 57:32 - 57:35
    It turns out that
  • 57:35 - 57:40
    the reason these were the only ambiguities was because
  • 57:40 - 57:45
    the SIJs were
  • 57:45 - 57:47
  • 57:47 - 57:51
    non-Gaussian. I don't want to spend too much time on this, but I'll say it briefly.
  • 57:51 - 57:54
    Suppose my original sources, S1 and S2, were Gaussian.
  • 57:54 - 57:56
    So
  • 57:58 - 58:02
    suppose SI is
  • 58:02 - 58:04
    Gaussian, would mean zero
  • 58:04 - 58:07
    and identity covariance.
  • 58:07 - 58:11
    That just means that each of my speakers outputs a Gaussian random variable. Here's a typical
  • 58:11 - 58:13
    example of Gaussian
  • 58:13 - 58:18
    data.
  • 58:18 - 58:23
    You will recall the contours of a Gaussian distribution with identity covariants
  • 58:23 - 58:25
    looks like
  • 58:25 - 58:28
    this, right? The Gaussian is a
  • 58:28 - 58:31
    spherically symmetric distribution.
  • 58:31 - 58:35
    So if my speakers were outputting Gaussian random variables, than if
  • 58:35 - 58:38
    I observe a linear combination of this,
  • 58:38 - 58:40
    there's actually no way to recover the
  • 58:40 - 58:43
    original distribution because there's no way for me to tell
  • 58:43 - 58:46
    if the axis are at this angle or if they're at
  • 58:46 - 58:48
    that angle and so
  • 58:48 - 58:52
    on. The Gaussian is a rotationally symmetric
  • 58:52 - 58:57
    distribution, so I would no be able to recover the orientation in the
  • 58:57 - 58:59
    rotation
  • 58:59 - 59:02
    of this. So I don't want to prove this too much. I don't want to spend too much time dwelling on this, but it turns
  • 59:02 - 59:03
    out
  • 59:03 - 59:05
    if your source is a Gaussian,
  • 59:05 - 59:08
    then it's actually impossible to do
  • 59:08 - 59:12
    ICA. ICA relies critically on your data being non-Gaussian because if the data
  • 59:12 - 59:17
    were Gaussian, then the rotation of the data would be ambiguous. So
  • 59:17 - 59:19
    regardless of how much data you have,
  • 59:19 - 59:24
    even if you had infinitely large amounts of data, you would not be able to recover
  • 59:24 - 59:27
    the matrix A or W.
  • 59:33 - 59:40
    Let's go ahead and divide the algorithm.
  • 59:57 - 60:01
    To do this, I need just one more result, and then the derivation will be
  • 60:03 - 60:08
    three lines. [Inaudible] many variables as N, which is the joint vector of the sound that all of my
  • 60:08 - 60:11
    speakers that are emitting at any time.
  • 60:11 - 60:12
    So
  • 60:12 - 60:16
    let's say the density of S is
  • 60:16 - 60:17
    P subscript S,
  • 60:17 - 60:20
    capital S.
  • 60:20 - 60:23
    So my microphone recording records S equals AS,
  • 60:23 - 60:25
    equals W inverse
  • 60:25 - 60:31
    S. Equivalently, S equals W sign of X.
  • 60:31 - 60:35
    So let's think about what is the density of
  • 60:35 - 60:38
    X. So I have P of S. I know the density of
  • 60:38 - 60:41
    S, and X is a linear combination of the S's.
  • 60:41 - 60:45
    So let's figure out what is the density of X.
  • 60:45 - 60:49
    One thing we could do is
  • 60:49 - 60:51
    figure out what S is. So this is just -
  • 60:51 - 60:56
    apply the density of
  • 60:56 - 60:58
    S to W of S. So let's
  • 60:58 - 61:02
    see. This is the probability of S, so we just
  • 61:03 - 61:07
    figure out what S is. S is W times X, so the probability of S is
  • 61:07 - 61:10
    W times X, so the probability of X must be [inaudible].
  • 61:10 - 61:12
    So this is wrong.
  • 61:12 - 61:15
    It turns out you can do this for probably mass functions but not for
  • 61:15 - 61:17
    continuous density. So in particular,
  • 61:17 - 61:21
    it's not correct to say that the probability of X is - well, you just figure out what
  • 61:21 - 61:22
    S is.
  • 61:22 - 61:26
    Then you say the probability of S is applied to that. This is wrong. You
  • 61:26 - 61:28
    can't do this with densities.
  • 61:28 - 61:31
    You can't say the probability of S is that because it's a property density
  • 61:31 - 61:33
    function.
  • 61:33 - 61:34
    In particular,
  • 61:34 - 61:36
    the
  • 61:36 - 61:38
    right formula is the
  • 61:38 - 61:40
    density of S applied to W times X,
  • 61:40 - 61:42
    times the determinant
  • 61:42 - 61:44
    of the matrix, W.
  • 61:44 - 61:47
    Let me just illustrate that with an example.
  • 61:47 - 61:50
    Let's say
  • 61:50 - 61:52
    the
  • 61:52 - 61:58
    density for S is that. In
  • 61:58 - 62:03
    this example, S is uniform
  • 62:03 - 62:06
    over the unit interval.
  • 62:08 - 62:15
    So the density for S looks like that. It's
  • 62:15 - 62:18
    just density for the uniform
  • 62:18 - 62:21
    distribution of zero one.
  • 62:21 - 62:24
    So let me let X be equal to two times
  • 62:24 - 62:30
    S. So this means A equals two.
  • 62:30 - 62:34
    W equals one half. So if
  • 62:34 - 62:37
    S is a uniform distribution over zero, one,
  • 62:37 - 62:40
    then X, which is two times that, will be the uniform distribution over the
  • 62:40 - 62:43
    range from zero to two.
  • 62:43 - 62:50
    So the density for X will be -
  • 62:54 - 62:57
    that's one, that's two,
  • 62:57 - 63:01
    that's one half,
  • 63:03 - 63:05
    and
  • 63:05 - 63:08
    that's one. Okay? Density for X will be indicator
  • 63:08 - 63:13
    zero [inaudible] for X [inaudible] two
  • 63:13 - 63:16
    times W, times one half.
  • 63:16 - 63:20
    So
  • 63:20 - 63:22
    does that make
  • 63:22 - 63:25
    sense? [Inaudible] computer density for X because X is now spread out
  • 63:25 - 63:29
    across a wider range. The density of X is now smaller,
  • 63:29 - 63:36
    and therefore, the density of X has this one half
  • 63:38 - 63:39
    term
  • 63:39 - 63:43
    here. Okay? This is an illustration for the case of one-dimensional random variables,
  • 63:43 - 63:44
  • 63:44 - 63:45
    or S
  • 63:45 - 63:49
    and X of one D. I'm not going to show it, but the generalization of this to vector value random variables is that the
  • 63:49 - 63:52
    density of X is given by this
  • 63:52 - 63:54
    times the determinant of the matrix, W. Over here,
  • 63:54 - 64:01
    I showed the one dimensional [inaudible] generalization.
  • 64:21 - 64:28
    So we're nearly there. Here's
  • 64:29 - 64:34
    how I can implement ICA.
  • 64:34 - 64:37
    So my distribution on
  • 64:37 - 64:44
    S,
  • 64:50 - 64:53
    so I'm going to assume that my density on S
  • 64:53 - 64:55
    is given by this as a product over the
  • 64:55 - 65:00
    N speakers of the density - the product of speaker
  • 65:00 - 65:01
    I
  • 65:01 - 65:04
    emitting a certain sound. This is a product of densities.
  • 65:04 - 65:08
    This is a product of distributions because I'm going to assume that the
  • 65:08 - 65:11
    speakers are having independent conversations. So the SI's independent
  • 65:11 - 65:16
    for different values of I.
  • 65:16 - 65:18
    So by the formula we just worked out,
  • 65:18 - 65:22
    the density for X would be equal to that.
  • 65:37 - 65:39
    I'll just remind you, W was A
  • 65:39 - 65:43
    inverse. It was
  • 65:43 - 65:44
    this matrix
  • 65:44 - 65:48
    I defined previously
  • 65:48 - 65:50
    so that SI
  • 65:50 - 65:53
    equals WI [inaudible]
  • 65:53 - 65:59
    X. So that's what's in
  • 65:59 - 66:02
    there. To complete my formulation for this model,
  • 66:02 - 66:06
    the final thing I need to do is
  • 66:06 - 66:10
    choose
  • 66:10 - 66:12
    a density
  • 66:12 - 66:14
    for what I think each speaker is
  • 66:14 - 66:18
    saying. I need to assume some density over
  • 66:18 - 66:22
    the sounds emitted by an individual speaker.
  • 66:22 - 66:26
    So following the discussion I had right when the [inaudible]
  • 66:26 - 66:28
    ICA,
  • 66:28 - 66:31
    one thing I could do is I could choose
  • 66:31 - 66:32
    the density for S,
  • 66:32 - 66:36
    or equivalently, I could choose the CDF, the cumulative distribution
  • 66:36 - 66:37
    function for
  • 66:37 - 66:38
    S.
  • 66:38 - 66:41
    In this case, I'm going to choose
  • 66:41 - 66:45
    a CDF, probably for historical reasons and probably for
  • 66:45 - 66:47
    convenience.
  • 66:47 - 66:50
    I need to choose the CDF for S, so
  • 66:50 - 66:55
    what that means is I just need to choose some function that increases from zero to
  • 66:55 - 66:59
    what. I know I can't choose a Gaussian because we know you can't
  • 66:59 - 67:02
    do ICA on Gaussian data.
  • 67:02 - 67:05
    So I need some function increasing from zero to one
  • 67:05 - 67:09
    that is not the cumulative distribution function for a
  • 67:09 - 67:10
    Gaussian distribution.
  • 67:10 - 67:14
    So what other functions do I know that increase from zero to one? I
  • 67:14 - 67:16
    just choose the
  • 67:16 - 67:18
    CDF to be
  • 67:18 - 67:22
    the
  • 67:22 - 67:23
    sigmoid function.
  • 67:23 - 67:25
    This is a
  • 67:25 - 67:27
    commonly-made choice that
  • 67:27 - 67:31
    is made for convenience. There is actually no great reason for why you
  • 67:31 - 67:34
    choose a sigmoid function. It's just a convenient function that we all know
  • 67:34 - 67:35
    and are familiar with
  • 67:35 - 67:38
    that happens to increase from zero to one.
  • 67:38 - 67:45
    When you take the derivative
  • 67:46 - 67:49
    of the sigmoid, and that will give you back
  • 67:49 - 67:50
    your
  • 67:50 - 67:55
    density. This is just not Gaussian. This is the main virtue of choosing the sigmoid.
  • 67:55 - 68:02
    So
  • 68:19 - 68:22
    there's really no rational for the choice of sigma. Lots of other things will
  • 68:22 - 68:24
    work fine, too.
  • 68:24 - 68:27
    It's just a common, reasonable default.
  • 68:38 - 68:40
    It turns out that
  • 68:40 - 68:45
    one reason the sigma works well for a lot of data sources is that
  • 68:45 - 68:49
    if this is the Gaussian.
  • 68:49 - 68:52
    If you actually take the sigmoid and you take its derivative,
  • 69:02 - 69:07
    you find that the sigmoid has [inaudible] than the Gaussian. By this I mean
  • 69:07 - 69:11
    the density of the sigmoid dies down to zero much more slowly than
  • 69:11 - 69:12
    the
  • 69:12 - 69:13
    Gaussian.
  • 69:13 - 69:18
    The magnitudes of the tails dies down as E to the minus S squared.
  • 69:18 - 69:22
    For the sigmoid, the tails look like E to the minus
  • 69:22 - 69:27
    S. So the tails die down as E to the minus S, around E
  • 69:27 - 69:30
    to the minus S squared. It turns out that most distributions of this property
  • 69:30 - 69:34
    with [inaudible] tails, where the distribution decays to zero relatively slowly
  • 69:34 - 69:38
    compared to Gaussian will
  • 69:38 - 69:40
    work fine for your data.
  • 69:40 - 69:44
    Actually, one other choice you can sometimes us is what's called the Laplacian
  • 69:44 - 69:46
    distribution, which is
  • 69:46 - 69:53
    that. This will work fine, too, for many data sources.
  • 70:07 - 70:08
    Sticking with the sigmoid for now, I'll just
  • 70:08 - 70:09
    write
  • 70:09 - 70:14
    down the algorithm in two steps. So given
  • 70:14 - 70:17
    my training set, and
  • 70:17 - 70:21
    as you show, this is an unlabeled training set, I can
  • 70:21 - 70:26
    write down the log likelihood of my parameters. So that's - assembled my training
  • 70:26 - 70:27
    examples, log of - times
  • 70:27 - 70:34
    that.
  • 70:43 - 70:45
    So that's my log
  • 70:45 - 70:52
    likelihood.
  • 70:53 - 70:59
    To learn the parameters, W, of this model, I can use the [inaudible] assent,
  • 70:59 - 71:06
    which is
  • 71:07 - 71:09
    just that.
  • 71:09 - 71:11
    It turns out, if you work through the math,
  • 71:11 - 71:14
    let's see. If P of S
  • 71:14 - 71:20
    is equal to the derivative of the
  • 71:20 - 71:24
    sigmoid, then if you just work through the math to compute the [inaudible] there. You've all
  • 71:24 - 71:27
    done this a lot of times. I won't bother to show
  • 71:27 - 71:34
    the details. You find that is equal to this.
  • 71:47 - 71:50
    Okay? That's just - you can work those out yourself. It's just math to
  • 71:50 - 71:54
    compute the derivative of this with respect to
  • 71:54 - 71:59
    W. So to summarize, given the training set,
  • 71:59 - 72:02
    here's my [inaudible] update rule. So you run the
  • 72:02 - 72:06
    [inaudible] to learn the parameters W.
  • 72:06 - 72:08
    After you're
  • 72:08 - 72:10
    done, you then
  • 72:12 - 72:14
    output SI equals
  • 72:14 - 72:17
    WXI, and you've separated your sources
  • 72:17 - 72:18
    of your
  • 72:18 - 72:22
    data back out into the original independent sources.
  • 72:22 - 72:26
    Hopefully up to only a permutation and a plus/minus
  • 72:26 - 72:31
    sign ambiguity.
  • 72:31 - 72:35
    Okay? So just switch back to the laptop, please?
  • 72:35 - 72:42
    So we'll just wrap up with a couple of examples of applications of ICA.
  • 72:42 - 72:43
    This is
  • 72:43 - 72:47
    actually a picture of our TA, Katie.
  • 72:47 - 72:50
    So one of the applications of ICA is
  • 72:50 - 72:52
    to process
  • 72:52 - 72:57
    various types of [inaudible] recording data, so [inaudible]. This
  • 72:57 - 72:59
    is a picture of
  • 72:59 - 73:02
    a EEG cap, in which there are a number of electrodes
  • 73:02 - 73:05
    you place
  • 73:05 - 73:08
    on the - in this case, on Katie's brain, on Katie's scalp.
  • 73:08 - 73:13
    So where each electrode measures changes in voltage over time
  • 73:13 - 73:15
    on the scalp.
  • 73:15 - 73:18
    On the right, it's a typical example of [inaudible] data
  • 73:18 - 73:23
    where each electrode measures - just changes in voltage over
  • 73:23 - 73:24
    time. So
  • 73:24 - 73:28
    the horizontal axis is time, and the vertical axis is voltage. So here's the same thing,
  • 73:28 - 73:30
    blown up a little bit.
  • 73:30 - 73:33
    You notice there are artifacts in this
  • 73:33 - 73:36
    data. Where the circle is, where the data is circled, all
  • 73:36 - 73:38
    the
  • 73:38 - 73:41
    electrodes seem to measure in these very synchronized recordings.
  • 73:41 - 73:45
    It turns out that we look at [inaudible] data as well as a number of other
  • 73:45 - 73:47
    types of data, there are
  • 73:47 - 73:52
    artifacts from heartbeats and from human eye blinks and so on. So the
  • 73:52 - 73:55
    cartoonist, if you imagine, placing the
  • 73:55 - 73:57
    electrodes, or
  • 73:57 - 73:58
    microphones, on my scalp,
  • 73:58 - 74:02
    then each microphone is recording some overlapping combination of all the
  • 74:02 - 74:05
    things happening in my brain or in my body.
  • 74:05 - 74:08
    My brain has a number of different processes going on. My body's [inaudible]
  • 74:08 - 74:11
    going on, and
  • 74:11 - 74:13
    each electrode measures a sum
  • 74:13 - 74:16
    of the different voices in my brain.
  • 74:16 - 74:20
    That didn't quite come out the way I wanted it to.
  • 74:20 - 74:22
    So we can just take this data
  • 74:22 - 74:25
    and run ICA on it and find out one of the independent components, what the
  • 74:25 - 74:26
    independent
  • 74:26 - 74:30
    process are going on in my brain. This is an example of running ICA.
  • 74:30 - 74:33
    So you find that a small number of components, like those shown up there,
  • 74:33 - 74:38
    they correspond to heartbeat, where the arrows - so those are very periodic
  • 74:38 - 74:42
    signals. They come on occasionally and correspond to [inaudible] components of
  • 74:42 - 74:43
    heartbeat.
  • 74:43 - 74:47
    You also find things like an eye blink component, corresponding to a
  • 74:47 - 74:50
    sigmoid generated when you blink your eyes.
  • 74:50 - 74:54
    By doing this, you can then subtract out the heartbeat and the eye blink
  • 74:54 - 74:56
    artifacts from the data, and now
  • 74:56 - 75:01
    you get much cleaner ICA data - get much cleaner EEG readings. You can
  • 75:01 - 75:04
    do further scientific studies. So this is a
  • 75:04 - 75:06
    pretty commonly used preprocessing step
  • 75:06 - 75:10
    that is a common application of ICA.
  • 75:10 - 75:13
    [Inaudible] example is
  • 75:13 - 75:16
    the application, again, from [inaudible]. As
  • 75:16 - 75:21
    a result of running ICA on natural small image patches. Suppose I take
  • 75:21 - 75:22
    natural images
  • 75:22 - 75:26
    and run ICA on the data and ask what are the independent components of data.
  • 75:26 - 75:30
    It turns out that these are the bases you get. So this is a plot of the
  • 75:30 - 75:33
    sources you get.
  • 75:33 - 75:36
    This algorithm is saying that a natural image patch
  • 75:36 - 75:38
    shown
  • 75:38 - 75:40
    on the left
  • 75:40 - 75:45
    is often expressed as a sum, or a linear combination, of
  • 75:45 - 75:47
    independent sources of
  • 75:47 - 75:48
    things that make up images.
  • 75:48 - 75:53
    So this model's natural images is generated by independent objects
  • 75:53 - 75:55
    that generate different ages in the image.
  • 75:55 - 76:01
    One of the fascinating things about this is that, similar to neuroscience, this has also been
  • 76:01 - 76:05
    hypothesized as a method for how the human brain processes image
  • 76:05 - 76:06
    data. It
  • 76:06 - 76:10
    turns out, this is similar, in many ways, to computations
  • 76:10 - 76:15
    happening in early visual processing in the human brain,
  • 76:15 - 76:18
    in the mammalian
  • 76:18 - 76:20
    brain. It's just
  • 76:20 - 76:25
    interesting to see ages are the independent components of images.
  • 76:25 - 76:31
    Are there quick questions, because I'm running late. Quick questions before I close? Interviewee: [Inaudible] square matrix? Instructor (Andrew
  • 76:31 - 76:32
    Ng):Oh,
  • 76:32 - 76:35
    yes. For the algorithms I describe, I assume A is a square matrix.
  • 76:35 - 76:39
    It turns out if you have more microphones than speakers, you can also apply very
  • 76:39 - 76:40
    similar algorithms. If
  • 76:40 - 76:44
    you have fewer microphones than speakers, there's sort of an open research problem. The odds
  • 76:44 - 76:48
    are that if you have one male and one female speaker, but one microphone, you can
  • 76:48 - 76:52
    sometimes sort of separate them because one is high, one is low. If you have two
  • 76:52 - 76:55
    male speakers or two female speakers, then it's beyond the state of the art now to separate them
  • 76:55 - 76:57
    with one
  • 76:57 - 77:00
    microphone. It's a great research program. Okay.
  • 77:00 - 77:05
    Sorry about running late again. Let's close now, and we'll
  • 77:05 - 77:06
    continue reinforcement learning.
Title:
Lecture 15 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:17:18
N. Ueda added a translation

English subtitles

Revisions