< Return to Video

Lecture 3 | Machine Learning (Stanford)

  • 0:00 - 0:09
    (music)
  • 0:10 - 0:13
    this presentation is delivered by
    the Stanford Center for
  • 0:13 - 0:15
    Professional Development.
  • 0:23 - 0:27
    Okay. Good morning and
    welcome back to the third
  • 0:27 - 0:31
    lecture of this class. So here's
  • 0:31 - 0:33
    what I want to do today,
  • 0:33 - 0:34
    and some of
  • 0:34 - 0:38
    the topics I do today may seem a little bit like I'm jumping, sort of,
  • 0:38 - 0:44
    from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we
  • 0:44 - 0:45
    talked about
  • 0:45 - 0:48
    linear regression and today I want to talk about
  • 0:48 - 0:52
    sort of an adaptation of that called locally weighted regression. It's very a powerful
  • 0:52 - 0:57
    algorithm that's actually one of my former mentor's probably favorite machine
  • 0:57 - 0:59
    learning algorithm. We'll then
  • 0:59 - 1:03
    talk about a probabilistic interpretation of linear regression
  • 1:03 - 1:08
    and use that to move onto our first classification algorithm, which
  • 1:08 - 1:09
    is logistic regression; take
  • 1:09 - 1:13
    a brief digression to tell you about something called the perceptron algorithm,
  • 1:13 - 1:16
    which is something we'll come back to, again, later this quarter;
  • 1:16 - 1:17
    and
  • 1:17 - 1:21
    time allowing I hope to get to Newton's method, which is an algorithm for
  • 1:21 - 1:23
    fitting logistic regression models.
  • 1:24 - 1:31
    So, let's just recap what we're talking about in the previous lecture,
  • 1:31 - 1:34
    remember the notation that I defined was that
  • 1:34 - 1:35
    I used this
  • 1:35 - 1:38
    X superscript i,
  • 1:38 - 1:41
    Y superscript i to denote the ith training example.
  • 1:47 - 1:48
    And
  • 1:49 - 1:52
    when we're talking about linear regression
  • 1:53 - 1:54
    or ordinary least squares,
  • 1:54 - 1:56
    we use this to denote
  • 1:56 - 1:58
    the predicted value
  • 1:58 - 2:01
    output by my hypothesis H on
  • 2:01 - 2:02
    the input X^i.
  • 2:02 - 2:04
    And my hypothesis
  • 2:04 - 2:06
    was parameterized by the vector
  • 2:06 - 2:08
    parameters theta
  • 2:08 - 2:14
    and so we said that this was equal to sum from theta j
  • 2:14 - 2:16
    X^i
  • 2:16 - 2:20
    si
  • 2:20 - 2:22
    written more simply as theta transpose X
  • 2:22 - 2:25
    And we had the convention that X
  • 2:25 - 2:29
    subscript zero is equal to one so
    this accounts for the intercept term in our
  • 2:29 - 2:31
    linear regression model.
  • 2:31 - 2:33
    And lowercase n here
  • 2:33 - 2:37
    was the notation I was using for the
  • 2:37 - 2:40
    number of features in my training set. Okay? So in the
  • 2:40 - 2:44
    example trying to predict housing prices, we had two features, the size of
  • 2:44 - 2:46
    the house and the number of bedrooms.
  • 2:46 - 2:51
    We had two features and there was - little n was equal to two.
  • 2:51 - 2:52
  • 2:52 - 2:55
    So just to
  • 2:55 - 2:57
    finish recapping the previous lecture, we
  • 2:57 - 3:01
    defined this quadratic cos function J of theta equals one-half,
  • 3:01 - 3:06
    something
  • 3:06 - 3:07
    I
  • 3:07 - 3:10
    equals one to m, theta of XI minus YI
  • 3:10 - 3:12
    squared
  • 3:12 - 3:17
    where this is the sum over our m training examples and my training set. So lowercase
  • 3:17 - 3:18
    m
  • 3:18 - 3:21
    was the notation I've been using to denote the number of training examples I have and the
  • 3:21 - 3:23
    size of my training set.
  • 3:23 - 3:25
    And at the end of the last lecture,
  • 3:25 - 3:27
    we derive
  • 3:27 - 3:31
    the value of theta that minimizes this enclosed form, which was X
  • 3:31 - 3:32
    transpose X
  • 3:32 - 3:35
    inverse X
  • 3:35 - 3:39
    transpose Y. Okay?
  • 3:39 - 3:42
    So
  • 3:42 - 3:46
    as we move on in today's lecture, I'll continue to use this notation and, again,
  • 3:46 - 3:51
    I realize this is a fair amount of notation to all remember,
  • 3:51 - 3:55
    so if partway through this lecture you forgot - if you're having trouble remembering
  • 3:55 - 4:02
    what lowercase m is or what lowercase n is or something please raise your hand and ask. When
  • 4:05 - 4:08
    we talked about linear regression last time
  • 4:08 - 4:10
    we used two features. One of the features was
  • 4:10 - 4:14
    the size of the houses in square feet, so the living area of the house,
  • 4:14 - 4:18
    and the other feature was the number of bedrooms in the house.
  • 4:18 - 4:22
    In general, we apply a machine-learning algorithm to some problem that you care
  • 4:22 - 4:23
    about.
  • 4:23 - 4:28
    The choice of the features will very much be up to you, right? And
  • 4:28 - 4:32
    the way you choose your features to give the learning algorithm will often have a
  • 4:32 - 4:34
    large impact on how it actually does.
  • 4:34 - 4:40
    So just for example,
  • 4:40 - 4:44
    the choice we made last time was X1 equal this size, and let's leave this idea
  • 4:44 - 4:47
    of the feature of the number of bedrooms for now, let's say we don't have data
  • 4:47 - 4:51
    that tells us how many bedrooms are in these houses.
  • 4:51 - 4:54
    One thing you could do is actually define - oh, let's
  • 4:54 - 4:56
    draw this out.
  • 4:56 - 5:03
    And so, right? So say that
  • 5:05 - 5:08
    was the size of the house and that's the price of the house. So
  • 5:08 - 5:10
    if you use
  • 5:10 - 5:15
    this as a feature maybe you get theta zero plus theta 1,
  • 5:15 - 5:20
    X1, this, sort of, linear model.
  • 5:20 - 5:22
    If you choose - let me just copy the
  • 5:22 - 5:27
    same data set over, right?
  • 5:27 - 5:30
    You can define the set of features where X1 is equal to the size of the house
  • 5:30 - 5:35
    and X2 is the
  • 5:35 - 5:36
    square
  • 5:36 - 5:37
    of the size
  • 5:37 - 5:38
    of the house. Okay?
  • 5:38 - 5:43
    So X1 is the size of the house in say square footage and X2 is
  • 5:43 - 5:46
    just take whatever the square footage of the house is and just
  • 5:46 - 5:49
    square that number, and this would be another way to come up with a feature,
  • 5:49 - 5:51
    and if you do that then
  • 5:51 - 5:56
    the same algorithm will end up fitting a
  • 5:56 - 5:59
    quadratic function for you.
  • 5:59 - 6:01
    Theta 2, XM squared.
  • 6:01 - 6:06
    Okay? Because this
  • 6:06 - 6:10
    is actually X2. And
  • 6:10 - 6:13
    depending on what the data looks like, maybe this is a slightly
  • 6:13 - 6:17
    better fit to the data. You can actually
  • 6:17 - 6:24
    take this even further, right?
  • 6:25 - 6:27
    Which is - let's see.
  • 6:27 - 6:30
    I have seven training examples here, so you can actually
  • 6:30 - 6:34
    maybe fit up to six for the polynomial. You can actually fill a model
  • 6:34 - 6:36
    theta zero plus
  • 6:36 - 6:38
    theta one, X1 plus theta two,
  • 6:38 - 6:42
    X squared plus up to
  • 6:42 - 6:49
    theta six. X to the
  • 6:49 - 6:53
    power of six and theta six are the polynomial
  • 6:53 - 6:56
    to these seven data points.
  • 6:56 - 6:58
    And if you do that you find that
  • 6:58 - 7:02
    you come up with a model that fits your data exactly. This is where, I guess,
  • 7:02 - 7:06
    in this example I drew, we have seven data points, so if you fit a
  • 7:06 - 7:08
    six model polynomial you can, sort of, fit a line
  • 7:08 - 7:12
    that passes through these seven points perfectly.
  • 7:12 - 7:14
    And you probably find that the curve
  • 7:14 - 7:17
    you get will look something
  • 7:17 - 7:20
    like that.
  • 7:20 - 7:23
    And on the one hand, this is a great model in a sense that it
  • 7:23 - 7:25
    fits your training data perfectly.
  • 7:25 - 7:27
    On the other hand, this is probably not a
  • 7:27 - 7:29
    very good model in the sense that
  • 7:29 - 7:32
    none of us seriously think that this is a very good predictor of housing
  • 7:32 - 7:36
    prices as a function of the size of the house, right? So
  • 7:36 - 7:39
    we'll actually come back to this later. It turns
  • 7:39 - 7:42
    out of the models we have here;
  • 7:42 - 7:45
    I feel like maybe the quadratic model fits the data best.
  • 7:45 - 7:46
    Whereas
  • 7:46 - 7:48
  • 7:48 - 7:52
    the linear model looks like there's actually a bit of a quadratic component in this
  • 7:52 - 7:53
    data
  • 7:53 - 7:57
    that the linear function is not capturing.
  • 7:57 - 8:00
    So we'll actually come back to this a little bit later and talk about the problems
  • 8:00 - 8:04
    associated with fitting models that are either too simple, use two small a set of
  • 8:04 - 8:05
    features, or
  • 8:05 - 8:08
    on the models that are too complex and maybe
  • 8:08 - 8:11
    use too large a set of features.
  • 8:11 - 8:12
    Just to give these a
  • 8:12 - 8:13
    name,
  • 8:13 - 8:14
    we call this
  • 8:14 - 8:20
    the problem of underfitting
  • 8:20 - 8:23
    and, very informally, this refers to a setting where
  • 8:23 - 8:27
    there are obvious patterns that - where there are patterns in the data that the
  • 8:27 - 8:29
    algorithm is just failing to fit.
  • 8:29 - 8:33
    And this problem here we refer to as
  • 8:33 - 8:35
    overfitting
  • 8:35 - 8:36
    and, again, very informally,
  • 8:36 - 8:41
    this is when the algorithm is fitting the idiosyncrasies of this specific data set,
  • 8:41 - 8:43
    right? It just so happens that
  • 8:43 - 8:48
    of the seven houses we sampled in Portland, or wherever you collect data from,
  • 8:48 - 8:52
    that house happens to be a bit more expensive, that house happened on the less
  • 8:52 - 8:54
    expensive, and by
  • 8:54 - 8:57
    fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties
  • 8:57 - 8:59
    of this data set,
  • 8:59 - 9:01
    rather than the true underlying trends
  • 9:01 - 9:05
    of how housing prices vary as the function of the size of house. Okay?
  • 9:05 - 9:08
    So these are two very different problems. We'll define them more formally me later
  • 9:08 - 9:12
    and talk about how to address each of these problems,
  • 9:12 - 9:14
    but for now I
  • 9:14 - 9:21
    hope you appreciate that there is this issue of selecting features. So if
  • 9:22 - 9:24
    you want to just
  • 9:24 - 9:26
    teach us the learning problems there are a few ways to do
  • 9:26 - 9:27
    so.
  • 9:27 - 9:29
    We'll talk about
  • 9:29 - 9:33
    feature selection algorithms later this quarter as well. So automatic algorithms
  • 9:33 - 9:34
    for choosing
  • 9:34 - 9:36
    what features you use in a
  • 9:36 - 9:38
    regression problem like this.
  • 9:38 - 9:42
    What I want to do today is talk about a class of algorithms
  • 9:42 - 9:45
    called non-parametric learning algorithms that will help
  • 9:45 - 9:50
    to alleviate the need somewhat for you to choose features very carefully. Okay?
  • 9:50 - 9:57
    And this leads us into our discussion of locally weighted regression.
  • 9:57 - 10:04
    And just to define the term,
  • 10:12 - 10:17
    linear regression, as we've defined it so far, is an example of a parametric learning
  • 10:17 - 10:18
  • 10:18 - 10:19
    algorithm. Parametric
  • 10:19 - 10:22
    learning algorithm is one that's defined as
  • 10:22 - 10:25
    an algorithm that has a fixed number of parameters
  • 10:25 - 10:27
    that fit to the data. Okay? So
  • 10:27 - 10:29
    in linear regression
  • 10:29 - 10:33
    we have a fix set of parameters theta, right? That must
  • 10:33 - 10:39
    fit to the data.
  • 10:39 - 10:46
    In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The
  • 10:59 - 11:03
    formal definition, which is not very intuitive, so I've replaced it with a
  • 11:03 - 11:04
    second, say, more
  • 11:04 - 11:06
    intuitive.
  • 11:06 - 11:10
    The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm
  • 11:10 - 11:17
    where the number of parameters
  • 11:18 - 11:22
    goes
  • 11:22 - 11:25
    with M, with the size of the training set. And usually it's
  • 11:25 - 11:31
    defined as a number of parameters grows linearly with the size of the training set.
  • 11:31 - 11:33
    This is the formal definition.
  • 11:33 - 11:33
    A
  • 11:33 - 11:36
    slightly less formal definition is that
  • 11:36 - 11:38
    the amount of stuff that your learning algorithm needs
  • 11:38 - 11:41
    to keep around
  • 11:41 - 11:45
    will grow linearly with the training sets or, in another way of saying it, is that this is an
  • 11:45 - 11:46
    algorithm that
  • 11:46 - 11:52
    we'll need to keep around an entire training set, even after learning. Okay? So
  • 11:52 - 11:54
    don't worry too much about this definition. But
  • 11:54 - 11:56
    what I want to do now is
  • 11:56 - 11:59
    describe a specific non-parametric learning algorithm
  • 11:59 - 12:06
    called locally weighted regression.
  • 12:10 - 12:17
    Which also goes by a couple of other names -
  • 12:17 - 12:21
    which also goes by the name of Loess for self-hysterical reasons. Loess
  • 12:21 - 12:23
    is usually spelled L-O-E-S-S,
  • 12:23 - 12:24
    sometimes spelled like that,
  • 12:24 - 12:27
    too. I just call it locally weighted regression.
  • 12:27 - 12:34
    So here's
  • 12:35 - 12:38
    the idea. This will be an algorithm that allows us
  • 12:38 - 12:43
    to worry a little bit less about having to choose features very carefully.
  • 12:43 - 12:48
    So
  • 12:48 - 12:55
    for my motivating example, let's say that I
  • 12:55 - 12:59
    have a
  • 12:59 - 13:01
    training site that looks like this, okay?
  • 13:01 - 13:04
    So this is X and that's Y.
  • 13:04 - 13:07
    If you run
  • 13:07 - 13:11
    linear regression on this and you fit maybe a linear function to this
  • 13:11 - 13:12
    and you end up with a
  • 13:12 - 13:13
    more or less
  • 13:13 - 13:17
    flat, straight line, which is not a very good fit to this data. You
  • 13:17 - 13:20
    can sit around and stare at this and try to decide whether the features are used right.
  • 13:20 - 13:23
    So maybe you want to toss in a quadratic function,
  • 13:23 - 13:26
    but this isn't really quadratic either. So maybe you want to
  • 13:26 - 13:28
    model this as a X
  • 13:28 - 13:31
    plus X squared plus maybe some function of sin of X or something.
  • 13:31 - 13:34
    You actually sit around and fiddle with features.
  • 13:34 - 13:37
    And after a while you can probably come up with a set of features that the model is
  • 13:37 - 13:40
    okay, but let's talk about an algorithm that
  • 13:40 - 13:47
    you can use without needing to do that.
  • 13:50 - 13:53
    So
  • 13:53 - 13:54
    if - well,
  • 13:54 - 13:56
    suppose you want to evaluate
  • 13:56 - 13:59
    your hypothesis H
  • 13:59 - 14:04
    at a certain point
  • 14:04 - 14:06
    with a certain query point low K is X. Okay? And
  • 14:06 - 14:07
    let's say
  • 14:07 - 14:11
    you want to know what's the predicted value of
  • 14:11 - 14:12
    Y
  • 14:12 - 14:17
    at this position of X, right? So
  • 14:17 - 14:19
    for linear regression,
  • 14:19 - 14:22
    what we were doing was we would fit
  • 14:22 - 14:25
    theta
  • 14:25 - 14:28
    to minimize
  • 14:28 - 14:30
    sum
  • 14:30 - 14:35
    over I, YI minus theta, transpose XI
  • 14:35 - 14:39
    squared,
  • 14:39 - 14:41
    and return theta
  • 14:41 - 14:46
    transpose X. Okay? So that was linear regression.
  • 14:46 - 14:50
    In contrast, in locally weighted linear regression you're going to do things slightly
  • 14:50 - 14:51
    different. You're
  • 14:51 - 14:54
    going to look at this point X
  • 14:54 - 14:59
    and then I'm going to look in my data set and take into account
  • 14:59 - 15:03
    only the data points that are, sort of, in the little vicinity of X. Okay?
  • 15:03 - 15:07
    So we'll look at where I want to value my hypothesis. I'm going to look
  • 15:07 - 15:10
    only in the vicinity of
  • 15:10 - 15:14
    this point where I want to value my hypothesis,
  • 15:14 - 15:16
    and then I'm going to take,
  • 15:16 - 15:20
    let's say, just these few points,
  • 15:20 - 15:21
    and I will
  • 15:21 - 15:23
    apply linear regression
  • 15:23 - 15:26
    to fit a straight line just to this sub-set of the data. Okay? I'm
  • 15:26 - 15:30
    using this sub-term sub-set - well let's come back to that later.
  • 15:30 - 15:32
    So we take this data set and I fit a
  • 15:32 - 15:37
    straight line to it and maybe I get a straight line like that.
  • 15:37 - 15:41
    And what I'll do is then
  • 15:41 - 15:45
    evaluate this particular value of straight line and
  • 15:45 - 15:47
    that will be the value I return for my algorithm.
  • 15:47 - 15:50
    I think this would be the predicted value
  • 15:50 - 15:53
    for -
  • 15:53 - 15:57
    this would be the value of then my hypothesis outputs
  • 15:57 - 16:04
    in locally weighted regression. Okay? So
  • 16:05 - 16:10
    we're gonna fall one up. Let me go ahead and formalize that. In
  • 16:10 - 16:15
    locally weighted regression, we're going to fit theta to
  • 16:15 - 16:18
    minimize
  • 16:18 - 16:25
    sum over I
  • 16:27 - 16:33
    to minimize that
  • 16:33 - 16:37
    where these terms W superscript I are called weights.
  • 16:37 - 16:38
  • 16:38 - 16:41
    There are many possible choice for ways, I'm just gonna write one down. So this E's and
  • 16:41 - 16:43
    minus, XI minus
  • 16:43 - 16:46
    X squared
  • 16:46 - 16:49
    over
  • 16:49 - 16:55
    two. So let's look at what these weights really are, right? So notice that -
  • 16:55 - 16:58
    suppose you have a training example XI.
  • 16:58 - 17:05
    So that XI is very close to X. So this is small,
  • 17:07 - 17:11
    right? Then if XI minus X is small, so if XI minus X is close to zero, then
  • 17:11 - 17:14
    this is E's to the minus zero and E to the zero
  • 17:14 - 17:16
    is one.
  • 17:16 - 17:20
    So if XI is close to X, then
  • 17:20 - 17:21
    WI
  • 17:21 - 17:25
    will be close to one. In other words, the weight associated with the, I training
  • 17:25 - 17:27
    example be close to one
  • 17:27 - 17:30
    if XI and X are close to each
  • 17:30 - 17:31
    other.
  • 17:31 - 17:34
    Conversely if XI minus X
  • 17:34 - 17:41
    is large
  • 17:42 - 17:43
    then - I don't
  • 17:43 - 17:48
    know, what would WI be? Zero.
  • 17:48 - 17:51
    Zero, right. Close to zero. Right.
  • 17:51 - 17:52
    So
  • 17:52 - 17:57
    if XI is very far from X then this is E to the minus of some large number
  • 17:57 - 18:02
    and E to the minus some large number will be close to zero.
  • 18:02 - 18:05
    Okay?
  • 18:05 - 18:12
    So the picture is, if I'm
  • 18:12 - 18:13
    quarrying
  • 18:13 - 18:17
    at a certain point X, shown on the X axis,
  • 18:17 - 18:22
    and if my data
  • 18:22 - 18:24
    set, say, look like that,
  • 18:24 - 18:29
    then I'm going to give the points close to this a large weight and give the points
  • 18:29 - 18:33
    far away a small weight.
  • 18:33 - 18:35
    So
  • 18:35 - 18:38
    for the points that are far away,
  • 18:38 - 18:40
    WI will be close to zero.
  • 18:40 - 18:44
    And so as if for the points that are far away,
  • 18:44 - 18:48
    they will not contribute much at all to this summation, right? So I think this is
  • 18:48 - 18:49
    sum over I
  • 18:49 - 18:54
    of one times this quadratic term for points by points plus zero times this quadratic term for faraway
  • 18:54 - 18:56
    points.
  • 18:56 - 18:59
    And so the effect of using this weighting is that
  • 18:59 - 19:00
    locally weighted linear regression
  • 19:00 - 19:03
    fits a set of parameters theta,
  • 19:03 - 19:05
    paying much more attention to fitting the
  • 19:05 - 19:07
    points close by
  • 19:07 - 19:11
    accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is
  • 19:11 - 19:18
    exponentially [inaudible]?
  • 19:18 - 19:23
    Yeah. Let's see. So it turns out there are many other weighting functions you can use. It
  • 19:23 - 19:27
    turns out that there are definitely different communities of researchers that tend to
  • 19:27 - 19:29
    choose different choices by default.
  • 19:29 - 19:34
    There is somewhat of a literature on debating what point -
  • 19:34 - 19:36
    exactly what function to use.
  • 19:36 - 19:38
    This, sort of, exponential decay function is -
  • 19:38 - 19:41
    this happens to be a reasonably common one that seems to be a more reasonable choice on many problems,
  • 19:41 - 19:42
  • 19:42 - 19:45
    but you can actually plug in other functions as well. Did
  • 19:45 - 19:48
    I mention what [inaudible] is it at?
  • 19:48 - 19:50
    For those of you that are familiar with
  • 19:50 - 19:52
    the
  • 19:52 - 19:54
    normal distribution, or the Gaussian distribution,
  • 19:54 - 19:55
    say this -
  • 19:55 - 19:59
    what this formula I've written out here, it cosmetically
  • 19:59 - 20:03
    looks a bit like a Gaussian distribution. Okay? But this actually has
  • 20:03 - 20:06
    absolutely nothing to do with Gaussian distribution.
  • 20:06 - 20:08
    So this
  • 20:08 - 20:12
    is not that a problem with XI is Gaussian or whatever. This is no such
  • 20:12 - 20:13
    interpretation.
  • 20:13 - 20:17
    This is just a convenient function that happens to be a bell-shaped function, but
  • 20:17 - 20:21
    don't endow this of any Gaussian semantics. Okay?
  • 20:21 - 20:23
    So, in fact - well,
  • 20:23 - 20:29
    if you remember the familiar bell-shaped Gaussian, again, it's just
  • 20:29 - 20:32
    the ways of associating with these points is that if you
  • 20:32 - 20:34
    imagine
  • 20:34 - 20:36
    putting this on a bell-shaped bump,
  • 20:36 - 20:40
    centered around the position of where you want to value your hypothesis H,
  • 20:40 - 20:43
    then there's a saying this point here I'll give
  • 20:43 - 20:44
    a weight that's proportional
  • 20:44 - 20:49
    to the height of the Gaussian - excuse me, to the height of the bell-shaped function
  • 20:49 - 20:51
    evaluated at this point. And the
  • 20:51 - 20:53
    way to get to this point will be,
  • 20:53 - 20:55
    to this training example,
  • 20:55 - 20:57
    will be proportionate to that height
  • 20:57 - 20:59
    and so on. Okay?
  • 20:59 - 21:01
    And so training examples that are really far away
  • 21:01 - 21:05
    get a very small weight.
  • 21:05 - 21:07
  • 21:07 - 21:11
    One last small generalization to this is that
  • 21:11 - 21:12
    normally
  • 21:12 - 21:15
    there's one other parameter to this
  • 21:15 - 21:18
    algorithm, which I'll denote as tow.
  • 21:18 - 21:21
    Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian.
  • 21:21 - 21:25
    This is a convenient form or function.
  • 21:25 - 21:27
    This parameter tow
  • 21:27 - 21:33
    is called the bandwidth
  • 21:33 - 21:34
    parameter
  • 21:34 - 21:41
    and
  • 21:41 - 21:48
    informally it controls how fast the weights fall of with distance. Okay? So just
  • 21:48 - 21:49
    copy my diagram from
  • 21:49 - 21:54
    the other side, I guess.
  • 21:54 - 21:56
    So if
  • 21:56 - 21:58
    tow is very small,
  • 21:58 - 22:00
  • 22:00 - 22:05
    if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape,
  • 22:05 - 22:08
    so that the weights of the points are far away fall off rapidly.
  • 22:08 - 22:10
    Whereas if
  • 22:10 - 22:17
    tow
  • 22:17 - 22:21
    is large then you'd end
  • 22:21 - 22:25
    up choosing a weighting function that falls of relatively slowly with distance from your
  • 22:25 - 22:28
    query.
  • 22:28 - 22:31
    Okay?
  • 22:31 - 22:33
    So
  • 22:33 - 22:40
    I hope you can, therefore, see that if you
  • 22:42 - 22:44
    apply locally weighted linear regression to a data set that looks like
  • 22:44 - 22:46
    this,
  • 22:46 - 22:47
    then to
  • 22:47 - 22:50
    ask what your hypothesis output is at a point like this you end up having a straight line
  • 22:50 - 22:51
    making
  • 22:51 - 22:53
    that prediction. To
  • 22:53 - 22:55
    ask what kind of class this [inaudible] at
  • 22:55 - 22:56
    that value
  • 22:56 - 22:59
    you put a straight line there
  • 22:59 - 23:00
    and you predict that value.
  • 23:00 - 23:04
    It turns out that every time you try to vary your hypothesis, every time you
  • 23:04 - 23:07
    ask your learning algorithm to make a prediction for
  • 23:07 - 23:09
    how much a new house costs or whatever,
  • 23:09 - 23:10
  • 23:10 - 23:14
    you need to run a new fitting procedure and then
  • 23:14 - 23:16
    evaluate this line that you fit
  • 23:16 - 23:18
    just at the position of
  • 23:18 - 23:22
    the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you
  • 23:22 - 23:26
    do this for every point along the X-axis then
  • 23:26 - 23:27
    you find that
  • 23:27 - 23:30
    locally weighted regression is able to trace on this, sort of, very
  • 23:30 - 23:32
    non-linear curve
  • 23:32 - 23:38
    for a data set like this. Okay? So
  • 23:38 - 23:42
    in the problem set we're actually gonna let you play around more with this algorithm. So I won't say
  • 23:42 - 23:44
    too much more about it here.
  • 23:44 - 23:49
    But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you
  • 23:49 - 23:50
    still have
  • 23:50 - 23:51
    the same problem of overfitting
  • 23:51 - 23:55
    and underfitting, like when
  • 23:55 - 23:58
    you had a Q's tow. Like you make it too small
  • 23:58 - 23:59
    in your -
  • 23:59 - 24:02
    Yes, absolutely. Yes. So locally
  • 24:02 - 24:04
    weighted regression can run into
  • 24:04 - 24:07
    - locally weighted regression is not a penancier for the problem of
  • 24:07 - 24:10
    overfitting or underfitting.
  • 24:10 - 24:15
    You can still run into the same problems with locally weighted regression. What you just
  • 24:15 - 24:17
    said about
  • 24:17 - 24:20
    - and so some of these things I'll leave you to discover for yourself in the
  • 24:20 - 24:21
    homework problem.
  • 24:21 - 24:25
    You'll actually see what you just mentioned.
  • 24:25 - 24:29
    Yeah? It almost seems like
  • 24:29 - 24:31
    you're
  • 24:31 - 24:37
    not even thoroughly [inaudible] with this locally weighted, you had all the
  • 24:37 - 24:42
    data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a
  • 24:42 - 24:45
    model, because you need the entire data set.
  • 24:45 - 24:45
    And
  • 24:45 - 24:50
    the other way of saying that is that this is a non-parametric learning
  • 24:50 - 24:51
    algorithm. So this
  • 24:51 - 24:52
  • 24:52 - 24:54
    -I don't know. I won't
  • 24:54 - 24:58
    debate whether, you know, are we really building a model or not. But
  • 24:58 - 25:00
    this is a perfectly fine - so
  • 25:00 - 25:02
    if I think
  • 25:02 - 25:04
    when you write a code implementing
  • 25:04 - 25:07
    locally weighted linear regression on the
  • 25:07 - 25:09
    data set I think of that code
  • 25:09 - 25:13
    as a whole - as building your model.
  • 25:13 - 25:15
    So it actually uses -
  • 25:15 - 25:19
    we've actually used this quite successfully to model, sort of, the dynamics of this
  • 25:19 - 25:23
    autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights
  • 25:23 - 25:24
    based
  • 25:24 - 25:28
    on
  • 25:28 - 25:33
    the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see,
  • 25:33 - 25:34
    yes. So
  • 25:34 - 25:37
    it turns out there are a few things you can do. One thing that is quite common is
  • 25:37 - 25:40
    how to choose this band with parameter tow,
  • 25:40 - 25:46
    right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible]
  • 25:46 - 25:53
    Gaussian
  • 25:56 - 25:58
    sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's
  • 25:58 - 26:00
    see. Boy.
  • 26:00 - 26:02
    The weights are not
  • 26:02 - 26:06
    random variables and it's not, for the purpose of this algorithm, it is not useful to
  • 26:06 - 26:09
    endow it with probable semantics. So you could
  • 26:09 - 26:15
    choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In
  • 26:15 - 26:16
    fact,
  • 26:16 - 26:17
    it turns out that
  • 26:17 - 26:22
    I happened to choose this, sort of, bell-shaped function
  • 26:22 - 26:23
    to define my weights.
  • 26:23 - 26:27
    It's actually fine to choose a function that doesn't even integrate to one, that integrates to
  • 26:27 - 26:30
    infinity, say, as you're weighting function. So
  • 26:30 - 26:32
    in that sense,
  • 26:32 - 26:36
    I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially
  • 26:36 - 26:42
    since you use other functions that integrate to infinity and don't integrate to one. Okay?
  • 26:42 - 26:44
    It's the last question and let's move on Assume
  • 26:44 - 26:51
    that we have a very huge [inaudible], for example. A very huge set of houses and want to
  • 26:51 - 26:54
    predict the linear for each house
  • 26:54 - 26:58
    and so should the end result for each input - I'm
  • 26:58 - 27:00
    seeing this very constantly for
  • 27:00 - 27:02
    - Yes, you're right. So
  • 27:02 - 27:05
    because locally weighted regression is a
  • 27:05 - 27:07
    non-parametric algorithm
  • 27:07 - 27:11
    every time you make a prediction you need to fit theta to your entire training set again.
  • 27:11 - 27:13
    So you're actually right.
  • 27:13 - 27:18
    If you have a very large training set then this is a somewhat expensive algorithm to
  • 27:18 - 27:20
    use. Because every time you want to make a prediction
  • 27:20 - 27:21
    you need to fit
  • 27:21 - 27:23
    a straight line
  • 27:23 - 27:26
    to a huge data set again.
  • 27:26 - 27:29
    Turns out there are algorithms that - turns
  • 27:29 - 27:32
    out there are ways to make this much more efficient for large data sets as well.
  • 27:32 - 27:35
    So don't want to talk about that. If you're interested, look
  • 27:35 - 27:36
    up the work of Andrew Moore
  • 27:36 - 27:38
    on KD-trees. He,
  • 27:38 - 27:40
    sort
  • 27:40 - 27:42
    of, figured out ways to fit these models much more efficiently. That's not something I want to go
  • 27:42 - 27:45
    into today. Okay? Let me
  • 27:45 - 27:52
    move one. Let's take more questions later. So, okay.
  • 28:01 - 28:06
    So that's locally weighted regression.
  • 28:06 - 28:07
    Remember the
  • 28:07 - 28:11
    outline I had, I guess, at the beginning of this lecture. What I want to do now is
  • 28:11 - 28:11
  • 28:11 - 28:16
    talk about a probabilistic interpretation of linear regression, all right?
  • 28:16 - 28:19
    And in particular of the - it'll be this probabilistic interpretation
  • 28:19 - 28:22
    that let's us move on to talk
  • 28:22 - 28:29
    about logistic regression, which will be our first classification algorithm. So
  • 28:40 - 28:43
    let's put aside locally weighted regression for now. We'll just talk about
  • 28:43 - 28:47
    ordinary unweighted linear regression. Let's
  • 28:47 - 28:51
    ask the question of why least squares, right? Of all the things
  • 28:51 - 28:52
    we could optimize
  • 28:52 - 28:56
    how do we come up with this criteria for minimizing the square of the area
  • 28:56 - 28:59
    between the predictions of the hypotheses
  • 28:59 - 29:02
    and the values Y predicted. So why not minimize
  • 29:02 - 29:08
    the absolute value of the areas or the areas to the power of four or something?
  • 29:08 - 29:10
    What I'm going to do now is present
  • 29:10 - 29:12
    one set of assumptions that
  • 29:12 - 29:15
    will serve to "justify"
  • 29:15 - 29:19
    why we're minimizing the sum of square zero. Okay?
  • 29:19 - 29:22
    It turns out that there are many assumptions that are sufficient
  • 29:22 - 29:26
    to justify why we do least squares and this is just one of them.
  • 29:26 - 29:27
    So
  • 29:27 - 29:30
    just because I present one set of assumptions under which least
  • 29:30 - 29:32
    squares regression make sense,
  • 29:32 - 29:35
    but this is not the only set of assumptions. So even if the assumptions
  • 29:35 - 29:39
    I describe don't hold, least squares actually still makes sense in many
  • 29:39 - 29:40
    circumstances. But this
  • 29:40 - 29:41
    sort of new help, you know,
  • 29:41 - 29:48
    give one rationalization, like, one reason for doing least squares regression.
  • 29:48 - 29:51
    And, in particular, what I'm going to do is
  • 29:51 - 29:58
    endow the least squares model with probabilistic semantics. So
  • 29:58 - 30:02
    let's assume in our example of predicting housing prices,
  • 30:02 - 30:03
    that
  • 30:03 - 30:08
    the price of the house it's sold four, and
  • 30:08 - 30:12
    there's going to be some linear function of the features,
  • 30:12 - 30:14
    plus
  • 30:14 - 30:17
    some term epsilon I. Okay?
  • 30:17 - 30:20
    And epsilon I will be
  • 30:20 - 30:22
    our error term.
  • 30:22 - 30:23
    You can think of
  • 30:23 - 30:28
    the error term as capturing unmodeled effects, like, that maybe
  • 30:28 - 30:31
    there's some other features of a house, like, maybe how many fireplaces it has or whether
  • 30:31 - 30:33
    there's a garden or whatever,
  • 30:33 - 30:34
    that there
  • 30:34 - 30:37
    are additional features that we jut fail to capture or
  • 30:37 - 30:40
    you can think of epsilon as random noise.
  • 30:40 - 30:42
    Epsilon is our error term that captures both these
  • 30:42 - 30:46
    unmodeled effects. Just things we forgot to model. Maybe the function isn't quite
  • 30:46 - 30:48
    linear or something.
  • 30:48 - 30:50
  • 30:50 - 30:53
    As well as random noise, like maybe
  • 30:53 - 30:57
    that day the seller was in a really bad mood and so he sold it, just
  • 30:57 - 31:00
    refused to go for a reasonable price or something.
  • 31:00 - 31:03
    And now
  • 31:03 - 31:08
    I will assume that the errors have a
  • 31:08 - 31:09
    probabilistic
  • 31:09 - 31:13
    - have a probability distribution. I'll assume that the errors epsilon I
  • 31:13 - 31:15
    are distributed
  • 31:15 - 31:16
    just till
  • 31:16 - 31:19
    they denote epsilon I
  • 31:19 - 31:23
    is distributive according to a probability distribution. That's
  • 31:23 - 31:27
    a Gaussian distribution with mean zero
  • 31:27 - 31:28
    and variance sigma squared. Okay? So
  • 31:28 - 31:31
    let me just scripts in here,
  • 31:31 - 31:35
    n stands for normal, right? To denote a normal distribution, also known as the
  • 31:35 - 31:36
    Gaussian distribution,
  • 31:36 - 31:37
    with mean
  • 31:37 - 31:41
    zero and covariance sigma squared.
  • 31:41 - 31:47
    Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you.
  • 31:47 - 31:49
    Great. Almost everyone.
  • 31:49 - 31:51
    So,
  • 31:51 - 31:55
    in other words, the density for Gaussian is what you've seen before.
  • 31:55 - 31:58
    The density for epsilon I would be
  • 31:58 - 32:00
    one over root 2 pi sigma, E to the
  • 32:00 - 32:02
    negative,
  • 32:02 - 32:04
    epsilon I
  • 32:04 - 32:06
  • 32:06 - 32:09
    squared over 2 sigma squared, right?
  • 32:09 - 32:15
    And the
  • 32:15 - 32:22
    density of our epsilon I will be this bell-shaped curve
  • 32:23 - 32:26
    with one standard deviation
  • 32:26 - 32:31
    being a, sort of, sigma. Okay? This
  • 32:31 - 32:32
    is
  • 32:32 - 32:34
    form for that bell-shaped curve.
  • 32:34 - 32:41
    So, let's see. I can erase that. Can I
  • 32:41 - 32:47
    erase the board?
  • 32:47 - 32:54
  • 32:59 - 33:05
    So this implies that the
  • 33:05 - 33:10
    probability distribution of a price of a house
  • 33:10 - 33:12
    given in si
  • 33:12 - 33:14
    and the parameters theta,
  • 33:14 - 33:21
    that this is going to be Gaussian
  • 33:30 - 33:35
    with that density. Okay?
  • 33:35 - 33:38
    In other words, saying goes as that the
  • 33:38 - 33:41
    price of
  • 33:41 - 33:47
    a house given the features of the house and my parameters theta,
  • 33:47 - 33:51
    this is going to be a random variable
  • 33:51 - 33:54
    that's distributed Gaussian with
  • 33:54 - 33:57
    mean theta transpose XI
  • 33:57 - 33:58
    and variance sigma squared.
  • 33:58 - 34:01
    Right? Because we imagine that
  • 34:01 - 34:05
    the way the housing prices are generated is that the price of a house
  • 34:05 - 34:09
    is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma
  • 34:09 - 34:12
    squared. So
  • 34:12 - 34:14
    the price of a house is going to
  • 34:14 - 34:16
    have mean theta transpose XI, again, and sigma squared, right? Does
  • 34:16 - 34:20
    this make
  • 34:20 - 34:24
    sense? Raise your hand if this makes sense. Yeah,
  • 34:24 - 34:31
    okay. Lots of you. In
  • 34:38 - 34:44
    point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume
  • 34:44 - 34:47
    here the error is a
  • 34:47 - 34:50
    Gaussian?
  • 34:50 - 34:54
    Right. So, boy.
  • 34:54 - 34:56
    Why do I see the error as Gaussian?
  • 34:56 - 35:01
    Two reasons, right? One is that it turns out to be mathematically convenient to do so
  • 35:01 - 35:03
    and the other is, I don't
  • 35:03 - 35:07
    know, I can also mumble about justifications, such as things to the
  • 35:07 - 35:09
    central limit theorem. It turns out that if you,
  • 35:09 - 35:12
    for the vast majority of problems, if you apply a linear regression model like
  • 35:12 - 35:16
    this and try to measure the distribution of the errors,
  • 35:16 - 35:20
    not all the time, but very often you find that the errors really are Gaussian. That
  • 35:20 - 35:22
    this Gaussian model is a good
  • 35:22 - 35:24
    assumption for the error
  • 35:24 - 35:26
    in regression problems like these.
  • 35:26 - 35:29
    Some of you may have heard of the central limit theorem, which says that
  • 35:29 - 35:33
    the sum of many independent random variables will tend towards a Gaussian.
  • 35:33 - 35:37
    So if the error is caused by many effects, like the mood of the
  • 35:37 - 35:39
    seller, the mood of the buyer,
  • 35:39 - 35:43
    some other features that we miss, whether the place has a garden or not, and
  • 35:43 - 35:46
    if all of these effects are independent, then
  • 35:46 - 35:49
    by the central limit theorem you might be inclined to believe that
  • 35:49 - 35:52
    the sum of all these effects will be approximately Gaussian. If
  • 35:52 - 35:54
    in practice, I guess, the
  • 35:54 - 35:58
    two real answers are that, 1.) In practice this is actually a reasonably accurate
  • 35:58 - 36:05
    assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're
  • 36:07 - 36:08
    saying
  • 36:08 - 36:11
    if we assume that area around model
  • 36:11 - 36:13
    has zero mean, then
  • 36:13 - 36:16
    the area is centered around our model. Which
  • 36:16 - 36:18
    it seems almost like we're trying to assume
  • 36:18 - 36:20
    what we're trying to prove. Instructor?
  • 36:20 - 36:24
    That's the [inaudible] but, yes. You are assuming that
  • 36:24 - 36:28
    the error has zero mean. Which is, yeah, right.
  • 36:28 - 36:31
    I think later this quarter we get to some of the other
  • 36:31 - 36:35
    things, but for now just think of this as a mathematically - it's actually not
  • 36:35 - 36:38
    an unreasonable assumption.
  • 36:38 - 36:40
    I guess,
  • 36:40 - 36:45
    in machine learning all the assumptions we make are almost never
  • 36:45 - 36:49
    true in the absence sense, right? Because, for instance,
  • 36:49 - 36:54
    housing prices are priced to dollars and cents, so the error will be -
  • 36:54 - 36:55
    errors
  • 36:55 - 36:59
    in prices are not continued as value random variables, because
  • 36:59 - 37:02
    houses can only be priced at a certain number of dollars and a certain number of
  • 37:02 - 37:06
    cents and you never have fractions of cents in housing prices.
  • 37:06 - 37:10
    Whereas a Gaussian random variable would. So in that sense, assumptions we make are never
  • 37:10 - 37:15
    "absolutely true," but for practical purposes this is a
  • 37:15 - 37:19
    accurate enough assumption that it'll be useful to
  • 37:19 - 37:24
    make. Okay? I think in a week or two, we'll actually come back to
  • 37:24 - 37:27
    selected more about the assumptions we make and when they help our learning
  • 37:27 - 37:29
    algorithms and when they hurt our learning
  • 37:29 - 37:31
    algorithms. We'll say a bit more about it
  • 37:31 - 37:38
    when we talk about generative and discriminative learning algorithms, like, in a week or
  • 37:40 - 37:45
    two. Okay? So let's point out one bit of notation, which is that when I
  • 37:45 - 37:48
    wrote this down I actually wrote P of YI given XI and then semicolon
  • 37:48 - 37:50
    theta
  • 37:50 - 37:55
    and I'm going to use this notation when we are not thinking of theta as a
  • 37:55 - 37:57
    random variable. So
  • 37:57 - 38:01
    in statistics, though, sometimes it's called the frequentist's point of view,
  • 38:01 - 38:02
    where you think of there as being some,
  • 38:02 - 38:06
    sort of, true value of theta that's out there that's generating the data say,
  • 38:06 - 38:08
    but
  • 38:08 - 38:11
    we don't know what theta is, but theta is not a random
  • 38:11 - 38:13
    vehicle, right? So it's not like there's some
  • 38:13 - 38:16
    random value of theta out there. It's that theta is -
  • 38:16 - 38:19
    there's some true value of theta out there. It's just that we don't
  • 38:19 - 38:22
    know what the true value of theta is. So
  • 38:22 - 38:26
    if theta is not a random variable, then I'm
  • 38:26 - 38:27
    going to avoid
  • 38:27 - 38:31
    writing P of YI given XI comma theta, because this would mean
  • 38:31 - 38:35
    that probably of YI conditioned on X and theta
  • 38:35 - 38:40
    and you can only condition on random variables.
  • 38:40 - 38:43
    So at this part of the class where we're taking
  • 38:43 - 38:46
    sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class
  • 38:46 - 38:49
    we're thinking of theta not as a random variable, but just as something
  • 38:49 - 38:50
    we're trying to estimate
  • 38:50 - 38:53
    and use the semicolon
  • 38:53 - 38:58
    notation. So the way to read this is this is the probability of YI given XI
  • 38:58 - 39:00
    and parameterized by theta. Okay? So
  • 39:00 - 39:04
    you read the semicolon as parameterized by.
  • 39:04 - 39:08
    And in the same way here, I'll say YI given XI parameterized by
  • 39:08 - 39:09
    theta is distributed
  • 39:09 - 39:16
    Gaussian with that. All right.
  • 39:36 - 39:38
    So we're gonna make one more assumption.
  • 39:38 - 39:41
    Let's assume that the
  • 39:41 - 39:44
    error terms are
  • 39:44 - 39:48
  • 39:48 - 39:51
    IID, okay?
  • 39:51 - 39:54
    Which stands for Independently and Identically Distributed. So it's
  • 39:54 - 39:57
    going to assume that the error terms are
  • 39:57 - 40:04
    independent of each other,
  • 40:04 - 40:11
    right?
  • 40:11 - 40:15
    The identically distributed part just means that I'm assuming the outcome for the
  • 40:15 - 40:18
    same Gaussian distribution or the same variance,
  • 40:18 - 40:22
    but the more important part of is this is that I'm assuming that the epsilon I's are
  • 40:22 - 40:26
    independent of each other.
  • 40:26 - 40:29
    Now, let's talk about how to fit a model.
  • 40:29 - 40:32
    The probability of Y given
  • 40:32 - 40:36
    X parameterized by theta - I'm actually going to give
  • 40:36 - 40:39
    this another name. I'm going to write this down
  • 40:39 - 40:43
    and we'll call this the likelihood of theta
  • 40:43 - 40:46
    as the probability of Y given X parameterized by theta.
  • 40:46 - 40:49
    And so this is going to be
  • 40:49 - 40:50
    the product
  • 40:50 - 40:57
    over my training set like that.
  • 41:00 - 41:04
    Which is, in turn, going to be a product of
  • 41:04 - 41:11
    those Gaussian densities that I wrote down just now,
  • 41:11 - 41:15
    right?
  • 41:15 - 41:20
    Okay?
  • 41:20 - 41:25
    Then in parts of notation, I guess, I define this term here to be the
  • 41:25 - 41:26
    likelihood of theta.
  • 41:26 - 41:30
    And the likely of theta is just the probability of the data Y, right? Given X
  • 41:30 - 41:33
    and prioritized by theta.
  • 41:33 - 41:37
    To test the likelihood and probability are often confused.
  • 41:37 - 41:41
    So the likelihood of theta is the same thing as the
  • 41:41 - 41:46
    probability of the data you saw. So likely and probably are, sort of, the same thing.
  • 41:46 - 41:48
    Except that when I use the term likelihood
  • 41:48 - 41:52
    I'm trying to emphasize that I'm taking this thing
  • 41:52 - 41:55
    and viewing it as a function of theta.
  • 41:55 - 41:57
    Okay?
  • 41:57 - 42:01
    So likelihood and for probability, they're really the same thing except that
  • 42:01 - 42:02
    when I want to view this thing
  • 42:02 - 42:06
    as a function of theta holding X and Y fix are
  • 42:06 - 42:10
    then called likelihood. Okay? So
  • 42:10 - 42:13
    hopefully you hear me say the likelihood of the parameters and the probability
  • 42:13 - 42:15
    of the data,
  • 42:15 - 42:18
    right? Rather than the likelihood of the data or probability of parameters. So try
  • 42:18 - 42:25
    to be consistent in that terminology.
  • 42:31 - 42:32
    So given that
  • 42:32 - 42:34
    the probability of the data is this and this
  • 42:34 - 42:37
    is also the likelihood of the parameters,
  • 42:37 - 42:38
    how do you estimate
  • 42:38 - 42:40
    the parameters theta? So given a training set,
  • 42:40 - 42:46
    what parameters theta do you want to choose for your model?
  • 42:46 - 42:53
  • 42:59 - 43:02
    Well, the principle of maximum likelihood
  • 43:02 - 43:03
    estimation
  • 43:03 - 43:09
    says that,
  • 43:09 - 43:13
    right? You can choose the value of theta that makes the data
  • 43:13 - 43:20
    as probable as possible, right? So choose theta
  • 43:20 - 43:27
    to maximize the likelihood. Or
  • 43:27 - 43:30
    in other words choose the parameters that make
  • 43:30 - 43:33
    the data as probable as possible, right? So this is
  • 43:33 - 43:37
    massive likely your estimation from six to six. So it's choose the parameters that makes
  • 43:37 - 43:40
    it as likely as probable as possible
  • 43:40 - 43:43
    for me to have seen the data I just
  • 43:43 - 43:47
    did. So
  • 43:47 - 43:53
    for mathematical convenience, let me define lower case l of theta.
  • 43:53 - 43:58
    This is called the log likelihood function and it's just log
  • 43:58 - 44:01
    of capital L of theta.
  • 44:01 - 44:06
    So this is log over product over I
  • 44:06 - 44:10
    to find
  • 44:10 - 44:14
    sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this
  • 44:14 - 44:17
    from the previous board.
  • 44:17 - 44:24
    Log and a product is the same as the sum of over logs, right? So it's a sum
  • 44:25 - 44:32
    of
  • 44:35 - 44:38
    the logs of - which simplifies to m times
  • 44:38 - 44:39
    one over root
  • 44:39 - 44:44
    two pi
  • 44:44 - 44:44
    sigma
  • 44:44 - 44:47
    plus
  • 44:47 - 44:52
    and then log of explanation cancel each other, right? So if log of E of
  • 44:52 - 44:53
    something is just
  • 44:53 - 45:00
    whatever's inside the exponent. So, you know what,
  • 45:01 - 45:08
    let me write this on the
  • 45:12 - 45:16
    next
  • 45:16 - 45:21
    board.
  • 45:21 - 45:28
    Okay.
  • 45:33 - 45:40
  • 45:46 - 45:51
    So
  • 45:51 - 45:53
    maximizing the likelihood or maximizing the log
  • 45:53 - 45:58
    likelihood is the same
  • 45:58 - 46:03
    as minimizing
  • 46:03 - 46:10
    that term over there. Well, you get it, right?
  • 46:22 - 46:26
    Because there's a minus sign. So maximizing this because of the minus sign is the same as
  • 46:26 - 46:27
    minimizing
  • 46:27 - 46:33
    this as a function of theta. And
  • 46:33 - 46:36
    this is, of course, just
  • 46:36 - 46:43
    the same quadratic cos function that we had last time, J of theta,
  • 46:43 - 46:44
    right? So what
  • 46:44 - 46:48
    we've just shown is that the ordinary least squares algorithm,
  • 46:48 - 46:51
    that we worked on the previous lecture,
  • 46:51 - 46:55
    is just maximum likelihood
  • 46:55 - 46:56
    assuming
  • 46:56 - 46:58
    this probabilistic model,
  • 46:58 - 47:05
    assuming IID Gaussian errors on our data.
  • 47:06 - 47:10
  • 47:10 - 47:11
    Okay? One thing that we'll
  • 47:11 - 47:12
    actually leave is that,
  • 47:12 - 47:14
    in the next lecture notice that
  • 47:14 - 47:17
    the value of sigma squared doesn't matter,
  • 47:17 - 47:18
    right? That somehow
  • 47:18 - 47:21
    no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a
  • 47:21 - 47:22
    variance
  • 47:22 - 47:26
    of a Gaussian. So that no matter what sigma
  • 47:26 - 47:30
    squared is since it's a positive number the value of theta we end up with
  • 47:30 - 47:34
    will be the same, right? So because
  • 47:34 - 47:35
    minimizing this
  • 47:35 - 47:39
    you get the same value of theta no matter what sigma squared is. So it's as if
  • 47:39 - 47:43
    in this model the value of sigma squared doesn't really matter.
  • 47:43 - 47:46
    Just remember that for the next lecture. We'll come back
  • 47:46 - 47:48
    to this again.
  • 47:48 - 47:51
    Any questions about this?
  • 47:51 - 47:53
    Actually, let me clean up
  • 47:53 - 48:00
    another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of
  • 48:44 - 48:51
    theta by
  • 48:51 - 48:52
    a fraction
  • 48:52 - 48:54
    of error,
  • 48:54 - 48:55
    but I think it's that you
  • 48:55 - 48:57
    measure
  • 48:57 - 49:01
    because it depends on the family of theta too, for example. If
  • 49:01 - 49:05
  • 49:05 - 49:09
    you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think
  • 49:09 - 49:13
    let's - the thing's you're mentioning are
  • 49:13 - 49:15
    maybe deeper questions about
  • 49:15 - 49:18
    learning algorithms that we'll just come back to later, so don't really want to get into
  • 49:18 - 49:19
    that right
  • 49:19 - 49:22
    now. Any more
  • 49:22 - 49:29
    questions? Okay. So
  • 49:33 - 49:39
    this endows linear regression with a probabilistic interpretation.
  • 49:39 - 49:43
    I'm actually going to use this probabil - use this, sort of, probabilistic
  • 49:43 - 49:44
    interpretation
  • 49:44 - 49:46
    in order to derive our next learning algorithm,
  • 49:46 - 49:50
    which will be our first classification algorithm. Okay?
  • 49:50 - 49:54
    So
  • 49:54 - 49:58
    you'll recall that I said that regression problems are where the variable Y
  • 49:58 - 50:01
    that you're trying to predict is continuous values.
  • 50:01 - 50:04
    Now I'm actually gonna talk about our first classification problem,
  • 50:04 - 50:08
    where the value Y you're trying to predict
  • 50:08 - 50:11
    will be discreet value. You can take on only a small number of discrete values
  • 50:11 - 50:14
    and in this case I'll talk about binding classification
  • 50:14 - 50:16
    where
  • 50:16 - 50:19
    Y takes on only two values, right? So
  • 50:19 - 50:22
    you come up with classification problems if you're trying to do,
  • 50:22 - 50:26
    say, a medical diagnosis and try to decide based on some features
  • 50:26 - 50:30
    that the patient has a disease or does not have a disease.
  • 50:30 - 50:34
    Or if in the housing example, maybe you're trying to decide will this house sell in the
  • 50:34 - 50:38
    next six months or not and the answer is either yes or no. It'll either be sold in the
  • 50:38 - 50:41
    next six months or it won't be.
  • 50:41 - 50:45
    Other standing examples, if you want to build a spam filter. Is this e-mail spam
  • 50:45 - 50:51
    or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting
  • 50:51 - 50:55
    whether a computer system will crash. So you have a learning algorithm to predict will
  • 50:55 - 50:59
    this computing cluster crash over the next 24 hours? And, again, it's a yes
  • 50:59 - 51:06
    or no answer. So
  • 51:06 - 51:08
    there's X, there's Y.
  • 51:08 - 51:14
    And in a classification problem
  • 51:14 - 51:15
    Y takes on
  • 51:15 - 51:19
    two values, zero and one. That's it in binding the classification.
  • 51:19 - 51:22
    So what can you do? Well, one thing you could do is
  • 51:22 - 51:25
    take linear regression, as we've described it so far, and apply it to this problem,
  • 51:25 - 51:26
    right? So you,
  • 51:26 - 51:30
    you know, given this data set you can fit a straight line to it. Maybe
  • 51:30 - 51:32
    you get that straight line, right?
  • 51:32 - 51:32
    But
  • 51:32 - 51:37
    this data set I've drawn, right? This is an amazingly easy classification problem. It's
  • 51:37 - 51:38
    pretty obvious
  • 51:38 - 51:41
    to all of us that, right? The relationship between X and Y is -
  • 51:41 - 51:48
    well, you just look at a value around here and it's the right is one, it's
  • 51:48 - 51:52
    the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then
  • 51:52 - 51:54
    maybe take your linear regression
  • 51:54 - 51:56
    hypothesis to this straight line
  • 51:56 - 51:58
    and threshold it at 0.5.
  • 51:58 - 52:02
    If you do that you'll certainly get the right answer. You predict that
  • 52:02 - 52:03
  • 52:03 - 52:04
    if X is to the right of, sort
  • 52:04 - 52:06
    of, the mid-point here
  • 52:06 - 52:12
    then Y is one and then next to the left of that mid-point then Y is zero.
  • 52:12 - 52:16
    So some people actually do this. Apply linear regression to classification problems
  • 52:16 - 52:18
    and sometimes it'll
  • 52:18 - 52:19
    work okay,
  • 52:19 - 52:22
    but in general it's actually a pretty bad idea to
  • 52:22 - 52:26
    apply linear regression to
  • 52:26 - 52:32
    classification problems like these and here's why. Let's say I
  • 52:32 - 52:34
    change my training set
  • 52:34 - 52:40
    by giving you just one more training example all the way up there, right?
  • 52:40 - 52:43
    Imagine if given this training set is actually still entirely obvious what the
  • 52:43 - 52:46
    relationship between X and Y is, right? It's just -
  • 52:46 - 52:51
    take this value as greater than Y is one and it's less then Y
  • 52:51 - 52:52
    is zero.
  • 52:52 - 52:55
    By giving you this additional training example it really shouldn't
  • 52:55 - 52:56
    change anything. I mean,
  • 52:56 - 52:59
    I didn't really convey much new information. There's no surprise that this
  • 52:59 - 53:02
    corresponds to Y equals one.
  • 53:02 - 53:05
    But if you now fit linear regression to this data
  • 53:05 - 53:07
    set you end up with a line that, I
  • 53:07 - 53:10
    don't know, maybe looks like that, right?
  • 53:10 - 53:13
    And now the predictions of your
  • 53:13 - 53:16
    hypothesis have changed completely if
  • 53:16 - 53:23
    your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that?
  • 53:27 - 53:30
    Right.
  • 53:30 - 53:31
    Yeah, yeah, fine. Yeah, sure. A theta
  • 53:31 - 53:37
    set like that so. So, I
  • 53:37 - 53:39
    guess,
  • 53:39 - 53:42
    these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will
  • 53:42 - 53:48
    change it even more if you gave it
  • 53:48 - 53:50
    all -
  • 53:50 - 53:52
    Yeah. Then I think this actually would make it even worse. You
  • 53:52 - 53:54
    would actually get a line that pulls out even further, right? So
  • 53:54 - 53:58
    this is my example. I get to make it whatever I want, right? But
  • 53:58 - 54:01
    the point of this is that there's not a deep meaning to this. The point of this is
  • 54:01 - 54:02
    just that
  • 54:02 - 54:05
    it could be a really bad idea to apply linear regression to classification
  • 54:05 - 54:06
  • 54:06 - 54:12
    algorithm. Sometimes it work fine, but usually I wouldn't do it.
  • 54:12 - 54:14
    So a couple of problems with this. One is that,
  • 54:14 - 54:17
    well - so what do you want to do
  • 54:17 - 54:21
    for classification? If you know the value of Y lies between zero and
  • 54:21 - 54:25
    one then to kind of fix this problem
  • 54:25 - 54:27
    let's just start by
  • 54:27 - 54:29
    changing the form
  • 54:29 - 54:34
    of our hypothesis so that my hypothesis
  • 54:34 - 54:39
    always lies in the unit interval between zero and one. Okay?
  • 54:39 - 54:42
    So if I know Y is either
  • 54:42 - 54:44
    zero or one then
  • 54:44 - 54:47
    let's at least not have my hypothesis predict values much larger than one and much
  • 54:47 - 54:51
    smaller than zero.
  • 54:51 - 54:52
    And so
  • 54:52 - 54:56
    I'm going to - instead of choosing a linear function for my hypothesis I'm going
  • 54:56 - 55:00
    to choose something slightly different. And,
  • 55:00 - 55:03
    in particular, I'm going to choose
  • 55:03 - 55:08
    this function, H subscript theta of X is going to equal to G of
  • 55:08 - 55:11
    theta transpose X
  • 55:11 - 55:13
    where
  • 55:13 - 55:14
    G
  • 55:14 - 55:17
    is going to be this
  • 55:17 - 55:18
    function and so
  • 55:18 - 55:21
    this becomes more than one plus theta X
  • 55:21 - 55:23
    of theta
  • 55:23 - 55:25
    transpose X.
  • 55:25 - 55:28
    And G of Z is called the sigmoid
  • 55:28 - 55:33
    function and it
  • 55:33 - 55:39
    is often also called the logistic function. It
  • 55:39 - 55:41
    goes by either of these
  • 55:41 - 55:48
    names. And what G of Z looks like is the following. So when you have your
  • 55:48 - 55:51
    horizontal axis I'm going to plot Z
  • 55:51 - 55:57
    and so G of Z
  • 55:57 - 55:59
    will look like this.
  • 55:59 - 56:05
    Okay? I didn't draw that very well. Okay.
  • 56:05 - 56:06
    So G of Z
  • 56:06 - 56:08
    tends towards zero
  • 56:08 - 56:10
    as Z becomes very small
  • 56:10 - 56:12
    and G of Z will ascend
  • 56:12 - 56:15
    towards one as Z becomes large and it crosses the
  • 56:15 - 56:17
    vertical
  • 56:17 - 56:20
    axis at 0.5.
  • 56:20 - 56:24
    So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of
  • 56:24 - 56:27
    sigmoid in other
  • 56:27 - 56:30
    step five? Say that again. Why we cannot chose this at five for some reason, like, that's
  • 56:30 - 56:35
    better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from,
  • 56:35 - 56:37
    right? I just
  • 56:37 - 56:39
    wrote down this function. It actually
  • 56:39 - 56:43
    turns out that there are two reasons for using this function that we'll come to.
  • 56:43 - 56:44
    One is -
  • 56:44 - 56:47
    we talked about generalized linear models. We'll see that this falls out naturally
  • 56:47 - 56:50
    as part of the broader class of models.
  • 56:50 - 56:51
    And another reason
  • 56:51 - 56:52
    that we'll talk about
  • 56:52 - 56:54
    next week, it turns out
  • 56:54 - 56:55
    there are a couple of,
  • 56:55 - 56:57
    I think, very beautiful reasons for why
  • 56:57 - 56:59
    we choose logistic
  • 56:59 - 57:00
    functions. We'll see
  • 57:00 - 57:02
    that in a little bit. But for now let me just
  • 57:02 - 57:07
    define it and just take my word for it for now that this is a reasonable choice.
  • 57:07 - 57:09
    Okay? But notice now that
  • 57:09 - 57:16
    my - the values output by my hypothesis will always be between zero
  • 57:16 - 57:17
    and one. Furthermore,
  • 57:17 - 57:21
    just like we did for linear regression, I'm going to endow
  • 57:21 - 57:26
    the outputs and my hypothesis with a probabilistic interpretation, right? So
  • 57:26 - 57:31
    I'm going to assume that the probability that Y is equal to one
  • 57:31 - 57:34
    given X and parameterized by theta
  • 57:34 - 57:36
    that's equal to
  • 57:36 - 57:38
    H subscript theta of X, all right?
  • 57:38 - 57:40
    So in other words
  • 57:40 - 57:43
    I'm going to imagine that my hypothesis is outputting all these
  • 57:43 - 57:45
    numbers that lie between zero and one.
  • 57:45 - 57:48
    I'm going to think of my hypothesis
  • 57:48 - 57:55
    as trying to estimate the probability that Y is equal to one. Okay?
  • 57:56 - 58:00
    And because
  • 58:00 - 58:02
    Y has to be either zero or one
  • 58:02 - 58:05
    then the probability of Y equals zero is going
  • 58:05 - 58:12
    to be that. All right?
  • 58:12 - 58:15
    So more simply it turns out - actually, take these two equations
  • 58:15 - 58:19
    and write them more compactly.
  • 58:19 - 58:22
    Write P of Y given X
  • 58:22 - 58:24
    parameterized by theta.
  • 58:24 - 58:26
    This is going to be H
  • 58:26 - 58:28
    subscript theta of X to
  • 58:28 - 58:32
    the power of Y times
  • 58:32 - 58:33
    one minus
  • 58:33 - 58:35
    H of X to the power of
  • 58:35 - 58:37
    one minus Y. Okay? So I know this
  • 58:37 - 58:43
    looks somewhat bizarre, but this actually makes the variation much nicer.
  • 58:43 - 58:44
    So Y is equal to one
  • 58:44 - 58:48
    then this equation is H of X to the power of one
  • 58:48 - 58:51
    times something to the power of zero.
  • 58:51 - 58:54
    So anything to the power of zero is just one,
  • 58:54 - 58:56
    right? So Y equals one then
  • 58:56 - 58:59
    this is something to the power of zero and so this is just one.
  • 58:59 - 59:03
    So if Y equals one this is just saying P of Y equals one is equal to H subscript
  • 59:03 - 59:06
    theta of X. Okay?
  • 59:06 - 59:08
    And in the same way, if Y is equal
  • 59:08 - 59:10
    to zero then this is P
  • 59:10 - 59:14
    of Y equals zero equals this thing to the power of zero and so this disappears. This is
  • 59:14 - 59:16
    just one
  • 59:16 - 59:18
    times this thing power of one. Okay? So this is
  • 59:18 - 59:19
    a
  • 59:19 - 59:20
    compact way of writing
  • 59:20 - 59:23
    both of these equations to
  • 59:23 - 59:30
    gather them to one line.
  • 59:31 - 59:36
    So let's hope our parameter fitting, right? And, again, you can ask -
  • 59:36 - 59:38
    well, given this model by data, how do I fit
  • 59:38 - 59:41
    the parameters theta of my
  • 59:41 - 59:46
    model? So the likelihood of the parameters is, as before, it's just the probability
  • 59:46 - 59:49
  • 59:49 - 59:50
  • 59:50 - 59:54
    of theta, right? Which is product over I, PFYI
  • 59:54 - 59:57
    given XI
  • 59:57 - 59:59
    parameterized by theta.
  • 59:59 - 60:02
    Which is - just plugging those
  • 60:02 - 60:09
    in. Okay? I
  • 60:09 - 60:16
    dropped this theta subscript just so you can write a little bit less. Oh,
  • 60:17 - 60:20
    excuse me. These
  • 60:20 - 60:27
    should be
  • 60:29 - 60:36
    XI's
  • 60:36 - 60:43
    and YI's. Okay?
  • 60:51 - 60:53
    So,
  • 60:53 - 60:57
    as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So
  • 60:57 - 60:59
    we want
  • 60:59 - 61:04
    to find the - setting the parameters theta that maximizes the likelihood L
  • 61:04 - 61:08
    of theta. It
  • 61:08 - 61:10
    turns out
  • 61:10 - 61:11
    that very often
  • 61:11 - 61:14
    - just when you work with the derivations, it turns out that it is often
  • 61:14 - 61:18
    much easier to maximize the log of the likelihood rather than maximize the
  • 61:18 - 61:20
    likelihood.
  • 61:20 - 61:20
    So
  • 61:20 - 61:23
    the log
  • 61:23 - 61:25
    likelihood L of theta is just log of capital L.
  • 61:25 - 61:28
    This will, therefore,
  • 61:28 - 61:35
    be sum of this. Okay?
  • 61:58 - 61:59
    And so
  • 61:59 - 62:04
    to fit the parameters theta of our model we'll
  • 62:04 - 62:06
    find the value of
  • 62:06 - 62:12
    theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes.
  • 62:12 - 62:19
    Thanks.
  • 62:22 - 62:27
    So having maximized this function - well, it turns out we can actually apply
  • 62:27 - 62:29
    the same gradient
  • 62:29 - 62:33
    descent algorithm that we learned. That was the first algorithm we used
  • 62:33 - 62:35
    to minimize the quadratic function.
  • 62:35 - 62:37
    And you remember, when we talked about least squares,
  • 62:37 - 62:40
    the first algorithm we used to minimize the quadratic
  • 62:40 - 62:41
    error function
  • 62:41 - 62:43
    was great in descent.
  • 62:43 - 62:45
    So can actually use exactly the same algorithm
  • 62:45 - 62:48
    to maximize the log likelihood.
  • 62:48 - 62:49
  • 62:49 - 62:51
    And you remember, that algorithm was just
  • 62:51 - 62:55
    repeatedly take the value of theta
  • 62:55 - 62:56
    and you replace it with
  • 62:56 - 62:59
    the previous value of theta plus
  • 62:59 - 63:02
    a learning rate alpha
  • 63:02 - 63:04
    times
  • 63:04 - 63:08
    the gradient of the cos function. The log likelihood will respect the
  • 63:08 - 63:10
    theta. Okay?
  • 63:10 - 63:14
    One small change is that because previously we were trying to minimize
  • 63:14 - 63:15
  • 63:15 - 63:17
    the quadratic error term.
  • 63:17 - 63:20
    Today we're trying to maximize rather than minimize. So rather than having a minus
  • 63:20 - 63:23
    sign we have a plus sign. So this is
  • 63:23 - 63:24
    just great in ascents,
  • 63:24 - 63:25
    but for the
  • 63:25 - 63:28
    maximization rather than the minimization.
  • 63:28 - 63:32
    So we actually call this gradient ascent and it's really the same
  • 63:32 - 63:35
    algorithm.
  • 63:35 - 63:37
    So to figure out
  • 63:37 - 63:42
    what this gradient - so in order to derive gradient descent,
  • 63:42 - 63:44
    what you need to do is
  • 63:44 - 63:48
    compute the partial derivatives of your objective function with respect to
  • 63:48 - 63:53
    each of your parameters theta I, right?
  • 63:53 - 63:58
    It turns out that
  • 63:58 - 64:00
    if you actually
  • 64:00 - 64:03
    compute this partial derivative -
  • 64:03 - 64:10
    so you take this formula, this L of theta, which is - oh, got that wrong too. If
  • 64:10 - 64:14
    you take this lower case l theta, if you take the log likelihood of theta,
  • 64:14 - 64:17
    and if you take it's partial derivative with
  • 64:17 - 64:19
    respect to theta I
  • 64:19 - 64:22
    you find that
  • 64:22 - 64:27
    this is equal to -
  • 64:27 - 64:30
    let's see. Okay? And,
  • 64:30 - 64:37
    I
  • 64:46 - 64:47
    don't
  • 64:47 - 64:50
    know, the derivation isn't terribly complicated, but in
  • 64:50 - 64:54
    the interest of saving you watching me write down a couple of
  • 64:54 - 64:56
    blackboards full of math I'll just write
  • 64:56 - 64:57
    down the final answer. But
  • 64:57 - 64:59
    the way you get this is you
  • 64:59 - 65:00
    just take those, plug
  • 65:00 - 65:04
    in the definition for F subscript theta as function of XI, and take derivatives,
  • 65:04 - 65:06
    and work through the algebra
  • 65:06 - 65:08
    it turns out it'll simplify
  • 65:08 - 65:13
    down to this formula. Okay?
  • 65:13 - 65:15
    And so
  • 65:15 - 65:19
    what that gives you is that gradient ascent
  • 65:19 - 65:22
    is the following
  • 65:22 - 65:24
    rule. Theta J gets updated as theta
  • 65:24 - 65:26
    J
  • 65:26 - 65:28
    plus alpha
  • 65:28 - 65:35
    gives this. Okay?
  • 65:47 - 65:50
    Does this look familiar to anyone? Did you
  • 65:50 - 65:56
    remember seeing this formula at the last lecture? Right.
  • 65:56 - 65:59
    So when I worked up Bastrian descent
  • 65:59 - 66:02
    for least squares regression I,
  • 66:02 - 66:06
    actually, wrote down
  • 66:06 - 66:08
    exactly the same thing, or maybe
  • 66:08 - 66:11
    there's a minus sign and this is also fit. But I, actually, had
  • 66:11 - 66:14
    exactly the same learning rule last time
  • 66:14 - 66:20
    for least squares regression,
  • 66:20 - 66:24
    right? Is this the same learning algorithm then? So what's different? How come I was making
  • 66:24 - 66:25
    all that noise earlier about
  • 66:25 - 66:30
    least squares regression being a bad idea for classification problems and then I did
  • 66:30 - 66:34
    a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants?
  • 66:34 - 66:41
    Say that again. [Inaudible] Oh,
  • 66:44 - 66:47
    right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same,
  • 66:47 - 66:48
    this is not the same, right? And the
  • 66:48 - 66:49
    reason is,
  • 66:49 - 66:52
    in logistic regression
  • 66:52 - 66:54
    this is different from before, right?
  • 66:54 - 66:56
    The definition
  • 66:56 - 66:58
    of this H subscript theta of XI
  • 66:58 - 66:59
    is not
  • 66:59 - 67:03
    the same as the definition I was using in the previous lecture.
  • 67:03 - 67:07
    And in particular this is no longer theta transpose XI. This is not
  • 67:07 - 67:10
    a linear function anymore.
  • 67:10 - 67:12
    This is a logistic function of theta
  • 67:12 - 67:14
    transpose XI. Okay?
  • 67:14 - 67:18
    So even though this looks cosmetically similar,
  • 67:18 - 67:20
    even though this is similar on the surface,
  • 67:20 - 67:24
    to the Bastrian descent rule I derived last time for
  • 67:24 - 67:25
    least squares regression
  • 67:25 - 67:29
    this is actually a totally different learning algorithm. Okay?
  • 67:29 - 67:32
    And it turns out that there's actually no coincidence that you ended up with the
  • 67:32 - 67:35
    same learning rule. We'll actually
  • 67:35 - 67:35
  • 67:35 - 67:40
    talk a bit more about this later when we talk about generalized linear models.
  • 67:40 - 67:43
    But this is one of the most elegant generalized learning models
  • 67:43 - 67:45
    that we'll see later. That
  • 67:45 - 67:48
    even though we're using a different model, you actually ended up with
  • 67:48 - 67:51
    what looks like the same learning algorithm and it's actually no
  • 67:51 - 67:56
    coincidence. Cool.
  • 67:56 - 67:59
    One last comment as
  • 67:59 - 68:01
    part of a sort of learning process,
  • 68:01 - 68:02
    over here
  • 68:02 - 68:05
    I said I take the derivatives and I
  • 68:05 - 68:07
    ended up with this line.
  • 68:07 - 68:09
    I didn't want to
  • 68:09 - 68:13
    make you sit through a long algebraic derivation, but
  • 68:13 - 68:15
    later today or later this week,
  • 68:15 - 68:19
    please, do go home and look at our lecture notes, where I wrote out
  • 68:19 - 68:21
    the entirety of this derivation in full,
  • 68:21 - 68:23
    and make sure you can follow every single step of
  • 68:23 - 68:27
    how we take partial derivatives of this log likelihood
  • 68:27 - 68:32
    to get this formula over here. Okay? By the way, for those who are
  • 68:32 - 68:36
    interested in seriously masking machine learning material,
  • 68:36 - 68:40
    when you go home and look at the lecture notes it will actually be very easy for most
  • 68:40 - 68:42
    of you to look through
  • 68:42 - 68:45
    the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense,
  • 68:45 - 68:45
    and,
  • 68:45 - 68:50
    sort of, say cool. I see how you get this line.
  • 68:50 - 68:53
    You want to make sure you really understand the material. My concrete
  • 68:53 - 68:55
    suggestion to you would be to you to go home,
  • 68:55 - 68:58
    read through the lecture notes, check every line,
  • 68:58 - 69:02
    and then to cover up the derivation and see if you can derive this example, right? So
  • 69:02 - 69:07
    in general, that's usually good advice for studying technical
  • 69:07 - 69:09
    material like machine learning. Which is if you work through a proof
  • 69:09 - 69:11
    and you think you understood every line,
  • 69:11 - 69:14
    the way to make sure you really understood it is to cover it up and see
  • 69:14 - 69:17
    if you can rederive the entire thing itself. This is actually a great way because I
  • 69:17 - 69:20
    did this a lot when I was trying to study
  • 69:20 - 69:22
    various pieces of machine learning
  • 69:22 - 69:26
    theory and various proofs. And this is actually a great way to study because cover up
  • 69:26 - 69:28
    the derivations and see if you can do it yourself
  • 69:28 - 69:33
    without looking at the original derivation. All right.
  • 69:33 - 69:37
  • 69:37 - 69:40
    I probably won't get to Newton's Method today. I just
  • 69:40 - 69:47
    want to say
  • 69:55 - 69:58
    - take one quick digression to talk about
  • 69:58 - 69:59
    one more algorithm,
  • 69:59 - 70:01
    which was the discussion sort
  • 70:01 - 70:08
    of alluding to this earlier,
  • 70:09 - 70:12
    which is the perceptron
  • 70:12 - 70:13
    algorithm, right? So
  • 70:13 - 70:14
    I'm
  • 70:14 - 70:17
    not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come
  • 70:17 - 70:20
    back to later. Later this quarter
  • 70:20 - 70:24
    we'll talk about learning theory.
  • 70:24 - 70:27
    So in logistic regression we said that G of Z are, sort
  • 70:27 - 70:28
    of,
  • 70:28 - 70:30
    my hypothesis output values
  • 70:30 - 70:33
    that were low numbers between zero and one.
  • 70:33 - 70:37
    The question is what if you want to force G of Z to up
  • 70:37 - 70:39
    the value to
  • 70:39 - 70:39
    either
  • 70:39 - 70:41
    zero one?
  • 70:41 - 70:44
    So the
  • 70:44 - 70:46
    perceptron algorithm defines G of Z
  • 70:46 - 70:48
    to be this.
  • 70:48 - 70:53
  • 70:53 - 70:54
    So the picture is - or
  • 70:54 - 71:01
    the cartoon is, rather than this sigmoid function. E of
  • 71:02 - 71:08
    Z now looks like this step function that you were asking about earlier.
  • 71:08 - 71:14
    In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So
  • 71:14 - 71:14
    this
  • 71:14 - 71:18
    is actually - everything is exactly the same as before,
  • 71:18 - 71:20
    except that G of Z is now the step function.
  • 71:20 - 71:22
    It
  • 71:22 - 71:25
    turns out there's this learning called the perceptron learning rule that's actually
  • 71:25 - 71:28
    even the same as the classic gradient ascent
  • 71:28 - 71:31
    for logistic regression.
  • 71:31 - 71:33
    And the learning rule is
  • 71:33 - 71:40
    given by this. Okay?
  • 71:44 - 71:50
    So it looks just like the
  • 71:50 - 71:51
    classic gradient ascent rule
  • 71:51 - 71:54
  • 71:54 - 71:56
    for logistic regression.
  • 71:56 - 72:01
    So this is very different flavor of algorithm than least squares regression and logistic
  • 72:01 - 72:02
    regression,
  • 72:02 - 72:06
    and, in particular, because it outputs only values are either zero or one it
  • 72:06 - 72:09
    turns out it's very difficult to endow this algorithm with
  • 72:09 - 72:12
    probabilistic semantics. And this
  • 72:12 - 72:19
    is, again, even though - oh, excuse me. Right there. Okay.
  • 72:20 - 72:24
    And even though this learning rule looks, again, looks cosmetically very similar to
  • 72:24 - 72:26
    what we have in logistics regression this is actually
  • 72:26 - 72:28
    a very different type of learning rule
  • 72:28 - 72:31
    than the others that were seen
  • 72:31 - 72:34
    in this class. So
  • 72:34 - 72:37
    because this is such a simple learning algorithm, right? It just
  • 72:37 - 72:41
    computes theta transpose X and then you threshold and then your output is zero or one.
  • 72:41 - 72:43
    This is -
  • 72:43 - 72:46
    right. So these are a simpler algorithm than logistic regression, I think.
  • 72:46 - 72:49
    When we talk about learning theory later in this class,
  • 72:49 - 72:55
    the simplicity of this algorithm will let us come back and use it as a building block. Okay?
  • 72:55 - 72:57
    But that's all I want to say about this algorithm for now.
Title:
Lecture 3 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng delves into locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:13:14
Timothy Lin edited English subtitles for Lecture 3 | Machine Learning (Stanford)
N. Ueda added a translation

English subtitles

Revisions