Lecture 3 | Machine Learning (Stanford)
-
0:00 - 0:09(music)
-
0:10 - 0:13this presentation is delivered by
the Stanford Center for -
0:13 - 0:15Professional Development.
-
0:23 - 0:27Okay. Good morning and
welcome back to the third -
0:27 - 0:31lecture of this class. So here's
-
0:31 - 0:33what I want to do today,
-
0:33 - 0:34and some of
-
0:34 - 0:38the topics I do today may seem a little bit like I'm jumping, sort of,
-
0:38 - 0:44from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we
-
0:44 - 0:45talked about
-
0:45 - 0:48linear regression and today I want to talk about
-
0:48 - 0:52sort of an adaptation of that called locally weighted regression. It's very a powerful
-
0:52 - 0:57algorithm that's actually one of my former mentor's probably favorite machine
-
0:57 - 0:59learning algorithm. We'll then
-
0:59 - 1:03talk about a probabilistic interpretation of linear regression
-
1:03 - 1:08and use that to move onto our first classification algorithm, which
-
1:08 - 1:09is logistic regression; take
-
1:09 - 1:13a brief digression to tell you about something called the perceptron algorithm,
-
1:13 - 1:16which is something we'll come back to, again, later this quarter;
-
1:16 - 1:17and
-
1:17 - 1:21time allowing I hope to get to Newton's method, which is an algorithm for
-
1:21 - 1:23fitting logistic regression models.
-
1:24 - 1:31So, let's just recap what we're talking about in the previous lecture,
-
1:31 - 1:34remember the notation that I defined was that
-
1:34 - 1:35I used this
-
1:35 - 1:38X superscript i,
-
1:38 - 1:41Y superscript i to denote the ith training example.
-
1:47 - 1:48And
-
1:49 - 1:52when we're talking about linear regression
-
1:53 - 1:54or ordinary least squares,
-
1:54 - 1:56we use this to denote
-
1:56 - 1:58the predicted value
-
1:58 - 2:01output by my hypothesis H on
-
2:01 - 2:02the input X^i.
-
2:02 - 2:04And my hypothesis
-
2:04 - 2:06was parameterized by the vector
-
2:06 - 2:08parameters theta
-
2:08 - 2:14and so we said that this was equal to sum from theta j
-
2:14 - 2:16X^i
-
2:16 - 2:20si
-
2:20 - 2:22written more simply as theta transpose X
-
2:22 - 2:25And we had the convention that X
-
2:25 - 2:29subscript zero is equal to one so
this accounts for the intercept term in our -
2:29 - 2:31linear regression model.
-
2:31 - 2:33And lowercase n here
-
2:33 - 2:37was the notation I was using for the
-
2:37 - 2:40number of features in my training set. Okay? So in the
-
2:40 - 2:44example trying to predict housing prices, we had two features, the size of
-
2:44 - 2:46the house and the number of bedrooms.
-
2:46 - 2:51We had two features and there was - little n was equal to two.
-
2:51 - 2:52
-
2:52 - 2:55So just to
-
2:55 - 2:57finish recapping the previous lecture, we
-
2:57 - 3:01defined this quadratic cos function J of theta equals one-half,
-
3:01 - 3:06something
-
3:06 - 3:07I
-
3:07 - 3:10equals one to m, theta of XI minus YI
-
3:10 - 3:12squared
-
3:12 - 3:17where this is the sum over our m training examples and my training set. So lowercase
-
3:17 - 3:18m
-
3:18 - 3:21was the notation I've been using to denote the number of training examples I have and the
-
3:21 - 3:23size of my training set.
-
3:23 - 3:25And at the end of the last lecture,
-
3:25 - 3:27we derive
-
3:27 - 3:31the value of theta that minimizes this enclosed form, which was X
-
3:31 - 3:32transpose X
-
3:32 - 3:35inverse X
-
3:35 - 3:39transpose Y. Okay?
-
3:39 - 3:42So
-
3:42 - 3:46as we move on in today's lecture, I'll continue to use this notation and, again,
-
3:46 - 3:51I realize this is a fair amount of notation to all remember,
-
3:51 - 3:55so if partway through this lecture you forgot - if you're having trouble remembering
-
3:55 - 4:02what lowercase m is or what lowercase n is or something please raise your hand and ask. When
-
4:05 - 4:08we talked about linear regression last time
-
4:08 - 4:10we used two features. One of the features was
-
4:10 - 4:14the size of the houses in square feet, so the living area of the house,
-
4:14 - 4:18and the other feature was the number of bedrooms in the house.
-
4:18 - 4:22In general, we apply a machine-learning algorithm to some problem that you care
-
4:22 - 4:23about.
-
4:23 - 4:28The choice of the features will very much be up to you, right? And
-
4:28 - 4:32the way you choose your features to give the learning algorithm will often have a
-
4:32 - 4:34large impact on how it actually does.
-
4:34 - 4:40So just for example,
-
4:40 - 4:44the choice we made last time was X1 equal this size, and let's leave this idea
-
4:44 - 4:47of the feature of the number of bedrooms for now, let's say we don't have data
-
4:47 - 4:51that tells us how many bedrooms are in these houses.
-
4:51 - 4:54One thing you could do is actually define - oh, let's
-
4:54 - 4:56draw this out.
-
4:56 - 5:03And so, right? So say that
-
5:05 - 5:08was the size of the house and that's the price of the house. So
-
5:08 - 5:10if you use
-
5:10 - 5:15this as a feature maybe you get theta zero plus theta 1,
-
5:15 - 5:20X1, this, sort of, linear model.
-
5:20 - 5:22If you choose - let me just copy the
-
5:22 - 5:27same data set over, right?
-
5:27 - 5:30You can define the set of features where X1 is equal to the size of the house
-
5:30 - 5:35and X2 is the
-
5:35 - 5:36square
-
5:36 - 5:37of the size
-
5:37 - 5:38of the house. Okay?
-
5:38 - 5:43So X1 is the size of the house in say square footage and X2 is
-
5:43 - 5:46just take whatever the square footage of the house is and just
-
5:46 - 5:49square that number, and this would be another way to come up with a feature,
-
5:49 - 5:51and if you do that then
-
5:51 - 5:56the same algorithm will end up fitting a
-
5:56 - 5:59quadratic function for you.
-
5:59 - 6:01Theta 2, XM squared.
-
6:01 - 6:06Okay? Because this
-
6:06 - 6:10is actually X2. And
-
6:10 - 6:13depending on what the data looks like, maybe this is a slightly
-
6:13 - 6:17better fit to the data. You can actually
-
6:17 - 6:24take this even further, right?
-
6:25 - 6:27Which is - let's see.
-
6:27 - 6:30I have seven training examples here, so you can actually
-
6:30 - 6:34maybe fit up to six for the polynomial. You can actually fill a model
-
6:34 - 6:36theta zero plus
-
6:36 - 6:38theta one, X1 plus theta two,
-
6:38 - 6:42X squared plus up to
-
6:42 - 6:49theta six. X to the
-
6:49 - 6:53power of six and theta six are the polynomial
-
6:53 - 6:56to these seven data points.
-
6:56 - 6:58And if you do that you find that
-
6:58 - 7:02you come up with a model that fits your data exactly. This is where, I guess,
-
7:02 - 7:06in this example I drew, we have seven data points, so if you fit a
-
7:06 - 7:08six model polynomial you can, sort of, fit a line
-
7:08 - 7:12that passes through these seven points perfectly.
-
7:12 - 7:14And you probably find that the curve
-
7:14 - 7:17you get will look something
-
7:17 - 7:20like that.
-
7:20 - 7:23And on the one hand, this is a great model in a sense that it
-
7:23 - 7:25fits your training data perfectly.
-
7:25 - 7:27On the other hand, this is probably not a
-
7:27 - 7:29very good model in the sense that
-
7:29 - 7:32none of us seriously think that this is a very good predictor of housing
-
7:32 - 7:36prices as a function of the size of the house, right? So
-
7:36 - 7:39we'll actually come back to this later. It turns
-
7:39 - 7:42out of the models we have here;
-
7:42 - 7:45I feel like maybe the quadratic model fits the data best.
-
7:45 - 7:46Whereas
-
7:46 - 7:48
-
7:48 - 7:52the linear model looks like there's actually a bit of a quadratic component in this
-
7:52 - 7:53data
-
7:53 - 7:57that the linear function is not capturing.
-
7:57 - 8:00So we'll actually come back to this a little bit later and talk about the problems
-
8:00 - 8:04associated with fitting models that are either too simple, use two small a set of
-
8:04 - 8:05features, or
-
8:05 - 8:08on the models that are too complex and maybe
-
8:08 - 8:11use too large a set of features.
-
8:11 - 8:12Just to give these a
-
8:12 - 8:13name,
-
8:13 - 8:14we call this
-
8:14 - 8:20the problem of underfitting
-
8:20 - 8:23and, very informally, this refers to a setting where
-
8:23 - 8:27there are obvious patterns that - where there are patterns in the data that the
-
8:27 - 8:29algorithm is just failing to fit.
-
8:29 - 8:33And this problem here we refer to as
-
8:33 - 8:35overfitting
-
8:35 - 8:36and, again, very informally,
-
8:36 - 8:41this is when the algorithm is fitting the idiosyncrasies of this specific data set,
-
8:41 - 8:43right? It just so happens that
-
8:43 - 8:48of the seven houses we sampled in Portland, or wherever you collect data from,
-
8:48 - 8:52that house happens to be a bit more expensive, that house happened on the less
-
8:52 - 8:54expensive, and by
-
8:54 - 8:57fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties
-
8:57 - 8:59of this data set,
-
8:59 - 9:01rather than the true underlying trends
-
9:01 - 9:05of how housing prices vary as the function of the size of house. Okay?
-
9:05 - 9:08So these are two very different problems. We'll define them more formally me later
-
9:08 - 9:12and talk about how to address each of these problems,
-
9:12 - 9:14but for now I
-
9:14 - 9:21hope you appreciate that there is this issue of selecting features. So if
-
9:22 - 9:24you want to just
-
9:24 - 9:26teach us the learning problems there are a few ways to do
-
9:26 - 9:27so.
-
9:27 - 9:29We'll talk about
-
9:29 - 9:33feature selection algorithms later this quarter as well. So automatic algorithms
-
9:33 - 9:34for choosing
-
9:34 - 9:36what features you use in a
-
9:36 - 9:38regression problem like this.
-
9:38 - 9:42What I want to do today is talk about a class of algorithms
-
9:42 - 9:45called non-parametric learning algorithms that will help
-
9:45 - 9:50to alleviate the need somewhat for you to choose features very carefully. Okay?
-
9:50 - 9:57And this leads us into our discussion of locally weighted regression.
-
9:57 - 10:04And just to define the term,
-
10:12 - 10:17linear regression, as we've defined it so far, is an example of a parametric learning
-
10:17 - 10:18
-
10:18 - 10:19algorithm. Parametric
-
10:19 - 10:22learning algorithm is one that's defined as
-
10:22 - 10:25an algorithm that has a fixed number of parameters
-
10:25 - 10:27that fit to the data. Okay? So
-
10:27 - 10:29in linear regression
-
10:29 - 10:33we have a fix set of parameters theta, right? That must
-
10:33 - 10:39fit to the data.
-
10:39 - 10:46In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The
-
10:59 - 11:03formal definition, which is not very intuitive, so I've replaced it with a
-
11:03 - 11:04second, say, more
-
11:04 - 11:06intuitive.
-
11:06 - 11:10The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm
-
11:10 - 11:17where the number of parameters
-
11:18 - 11:22goes
-
11:22 - 11:25with M, with the size of the training set. And usually it's
-
11:25 - 11:31defined as a number of parameters grows linearly with the size of the training set.
-
11:31 - 11:33This is the formal definition.
-
11:33 - 11:33A
-
11:33 - 11:36slightly less formal definition is that
-
11:36 - 11:38the amount of stuff that your learning algorithm needs
-
11:38 - 11:41to keep around
-
11:41 - 11:45will grow linearly with the training sets or, in another way of saying it, is that this is an
-
11:45 - 11:46algorithm that
-
11:46 - 11:52we'll need to keep around an entire training set, even after learning. Okay? So
-
11:52 - 11:54don't worry too much about this definition. But
-
11:54 - 11:56what I want to do now is
-
11:56 - 11:59describe a specific non-parametric learning algorithm
-
11:59 - 12:06called locally weighted regression.
-
12:10 - 12:17Which also goes by a couple of other names -
-
12:17 - 12:21which also goes by the name of Loess for self-hysterical reasons. Loess
-
12:21 - 12:23is usually spelled L-O-E-S-S,
-
12:23 - 12:24sometimes spelled like that,
-
12:24 - 12:27too. I just call it locally weighted regression.
-
12:27 - 12:34So here's
-
12:35 - 12:38the idea. This will be an algorithm that allows us
-
12:38 - 12:43to worry a little bit less about having to choose features very carefully.
-
12:43 - 12:48So
-
12:48 - 12:55for my motivating example, let's say that I
-
12:55 - 12:59have a
-
12:59 - 13:01training site that looks like this, okay?
-
13:01 - 13:04So this is X and that's Y.
-
13:04 - 13:07If you run
-
13:07 - 13:11linear regression on this and you fit maybe a linear function to this
-
13:11 - 13:12and you end up with a
-
13:12 - 13:13more or less
-
13:13 - 13:17flat, straight line, which is not a very good fit to this data. You
-
13:17 - 13:20can sit around and stare at this and try to decide whether the features are used right.
-
13:20 - 13:23So maybe you want to toss in a quadratic function,
-
13:23 - 13:26but this isn't really quadratic either. So maybe you want to
-
13:26 - 13:28model this as a X
-
13:28 - 13:31plus X squared plus maybe some function of sin of X or something.
-
13:31 - 13:34You actually sit around and fiddle with features.
-
13:34 - 13:37And after a while you can probably come up with a set of features that the model is
-
13:37 - 13:40okay, but let's talk about an algorithm that
-
13:40 - 13:47you can use without needing to do that.
-
13:50 - 13:53So
-
13:53 - 13:54if - well,
-
13:54 - 13:56suppose you want to evaluate
-
13:56 - 13:59your hypothesis H
-
13:59 - 14:04at a certain point
-
14:04 - 14:06with a certain query point low K is X. Okay? And
-
14:06 - 14:07let's say
-
14:07 - 14:11you want to know what's the predicted value of
-
14:11 - 14:12Y
-
14:12 - 14:17at this position of X, right? So
-
14:17 - 14:19for linear regression,
-
14:19 - 14:22what we were doing was we would fit
-
14:22 - 14:25theta
-
14:25 - 14:28to minimize
-
14:28 - 14:30sum
-
14:30 - 14:35over I, YI minus theta, transpose XI
-
14:35 - 14:39squared,
-
14:39 - 14:41and return theta
-
14:41 - 14:46transpose X. Okay? So that was linear regression.
-
14:46 - 14:50In contrast, in locally weighted linear regression you're going to do things slightly
-
14:50 - 14:51different. You're
-
14:51 - 14:54going to look at this point X
-
14:54 - 14:59and then I'm going to look in my data set and take into account
-
14:59 - 15:03only the data points that are, sort of, in the little vicinity of X. Okay?
-
15:03 - 15:07So we'll look at where I want to value my hypothesis. I'm going to look
-
15:07 - 15:10only in the vicinity of
-
15:10 - 15:14this point where I want to value my hypothesis,
-
15:14 - 15:16and then I'm going to take,
-
15:16 - 15:20let's say, just these few points,
-
15:20 - 15:21and I will
-
15:21 - 15:23apply linear regression
-
15:23 - 15:26to fit a straight line just to this sub-set of the data. Okay? I'm
-
15:26 - 15:30using this sub-term sub-set - well let's come back to that later.
-
15:30 - 15:32So we take this data set and I fit a
-
15:32 - 15:37straight line to it and maybe I get a straight line like that.
-
15:37 - 15:41And what I'll do is then
-
15:41 - 15:45evaluate this particular value of straight line and
-
15:45 - 15:47that will be the value I return for my algorithm.
-
15:47 - 15:50I think this would be the predicted value
-
15:50 - 15:53for -
-
15:53 - 15:57this would be the value of then my hypothesis outputs
-
15:57 - 16:04in locally weighted regression. Okay? So
-
16:05 - 16:10we're gonna fall one up. Let me go ahead and formalize that. In
-
16:10 - 16:15locally weighted regression, we're going to fit theta to
-
16:15 - 16:18minimize
-
16:18 - 16:25sum over I
-
16:27 - 16:33to minimize that
-
16:33 - 16:37where these terms W superscript I are called weights.
-
16:37 - 16:38
-
16:38 - 16:41There are many possible choice for ways, I'm just gonna write one down. So this E's and
-
16:41 - 16:43minus, XI minus
-
16:43 - 16:46X squared
-
16:46 - 16:49over
-
16:49 - 16:55two. So let's look at what these weights really are, right? So notice that -
-
16:55 - 16:58suppose you have a training example XI.
-
16:58 - 17:05So that XI is very close to X. So this is small,
-
17:07 - 17:11right? Then if XI minus X is small, so if XI minus X is close to zero, then
-
17:11 - 17:14this is E's to the minus zero and E to the zero
-
17:14 - 17:16is one.
-
17:16 - 17:20So if XI is close to X, then
-
17:20 - 17:21WI
-
17:21 - 17:25will be close to one. In other words, the weight associated with the, I training
-
17:25 - 17:27example be close to one
-
17:27 - 17:30if XI and X are close to each
-
17:30 - 17:31other.
-
17:31 - 17:34Conversely if XI minus X
-
17:34 - 17:41is large
-
17:42 - 17:43then - I don't
-
17:43 - 17:48know, what would WI be? Zero.
-
17:48 - 17:51Zero, right. Close to zero. Right.
-
17:51 - 17:52So
-
17:52 - 17:57if XI is very far from X then this is E to the minus of some large number
-
17:57 - 18:02and E to the minus some large number will be close to zero.
-
18:02 - 18:05Okay?
-
18:05 - 18:12So the picture is, if I'm
-
18:12 - 18:13quarrying
-
18:13 - 18:17at a certain point X, shown on the X axis,
-
18:17 - 18:22and if my data
-
18:22 - 18:24set, say, look like that,
-
18:24 - 18:29then I'm going to give the points close to this a large weight and give the points
-
18:29 - 18:33far away a small weight.
-
18:33 - 18:35So
-
18:35 - 18:38for the points that are far away,
-
18:38 - 18:40WI will be close to zero.
-
18:40 - 18:44And so as if for the points that are far away,
-
18:44 - 18:48they will not contribute much at all to this summation, right? So I think this is
-
18:48 - 18:49sum over I
-
18:49 - 18:54of one times this quadratic term for points by points plus zero times this quadratic term for faraway
-
18:54 - 18:56points.
-
18:56 - 18:59And so the effect of using this weighting is that
-
18:59 - 19:00locally weighted linear regression
-
19:00 - 19:03fits a set of parameters theta,
-
19:03 - 19:05paying much more attention to fitting the
-
19:05 - 19:07points close by
-
19:07 - 19:11accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is
-
19:11 - 19:18exponentially [inaudible]?
-
19:18 - 19:23Yeah. Let's see. So it turns out there are many other weighting functions you can use. It
-
19:23 - 19:27turns out that there are definitely different communities of researchers that tend to
-
19:27 - 19:29choose different choices by default.
-
19:29 - 19:34There is somewhat of a literature on debating what point -
-
19:34 - 19:36exactly what function to use.
-
19:36 - 19:38This, sort of, exponential decay function is -
-
19:38 - 19:41this happens to be a reasonably common one that seems to be a more reasonable choice on many problems,
-
19:41 - 19:42
-
19:42 - 19:45but you can actually plug in other functions as well. Did
-
19:45 - 19:48I mention what [inaudible] is it at?
-
19:48 - 19:50For those of you that are familiar with
-
19:50 - 19:52the
-
19:52 - 19:54normal distribution, or the Gaussian distribution,
-
19:54 - 19:55say this -
-
19:55 - 19:59what this formula I've written out here, it cosmetically
-
19:59 - 20:03looks a bit like a Gaussian distribution. Okay? But this actually has
-
20:03 - 20:06absolutely nothing to do with Gaussian distribution.
-
20:06 - 20:08So this
-
20:08 - 20:12is not that a problem with XI is Gaussian or whatever. This is no such
-
20:12 - 20:13interpretation.
-
20:13 - 20:17This is just a convenient function that happens to be a bell-shaped function, but
-
20:17 - 20:21don't endow this of any Gaussian semantics. Okay?
-
20:21 - 20:23So, in fact - well,
-
20:23 - 20:29if you remember the familiar bell-shaped Gaussian, again, it's just
-
20:29 - 20:32the ways of associating with these points is that if you
-
20:32 - 20:34imagine
-
20:34 - 20:36putting this on a bell-shaped bump,
-
20:36 - 20:40centered around the position of where you want to value your hypothesis H,
-
20:40 - 20:43then there's a saying this point here I'll give
-
20:43 - 20:44a weight that's proportional
-
20:44 - 20:49to the height of the Gaussian - excuse me, to the height of the bell-shaped function
-
20:49 - 20:51evaluated at this point. And the
-
20:51 - 20:53way to get to this point will be,
-
20:53 - 20:55to this training example,
-
20:55 - 20:57will be proportionate to that height
-
20:57 - 20:59and so on. Okay?
-
20:59 - 21:01And so training examples that are really far away
-
21:01 - 21:05get a very small weight.
-
21:05 - 21:07
-
21:07 - 21:11One last small generalization to this is that
-
21:11 - 21:12normally
-
21:12 - 21:15there's one other parameter to this
-
21:15 - 21:18algorithm, which I'll denote as tow.
-
21:18 - 21:21Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian.
-
21:21 - 21:25This is a convenient form or function.
-
21:25 - 21:27This parameter tow
-
21:27 - 21:33is called the bandwidth
-
21:33 - 21:34parameter
-
21:34 - 21:41and
-
21:41 - 21:48informally it controls how fast the weights fall of with distance. Okay? So just
-
21:48 - 21:49copy my diagram from
-
21:49 - 21:54the other side, I guess.
-
21:54 - 21:56So if
-
21:56 - 21:58tow is very small,
-
21:58 - 22:00
-
22:00 - 22:05if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape,
-
22:05 - 22:08so that the weights of the points are far away fall off rapidly.
-
22:08 - 22:10Whereas if
-
22:10 - 22:17tow
-
22:17 - 22:21is large then you'd end
-
22:21 - 22:25up choosing a weighting function that falls of relatively slowly with distance from your
-
22:25 - 22:28query.
-
22:28 - 22:31Okay?
-
22:31 - 22:33So
-
22:33 - 22:40I hope you can, therefore, see that if you
-
22:42 - 22:44apply locally weighted linear regression to a data set that looks like
-
22:44 - 22:46this,
-
22:46 - 22:47then to
-
22:47 - 22:50ask what your hypothesis output is at a point like this you end up having a straight line
-
22:50 - 22:51making
-
22:51 - 22:53that prediction. To
-
22:53 - 22:55ask what kind of class this [inaudible] at
-
22:55 - 22:56that value
-
22:56 - 22:59you put a straight line there
-
22:59 - 23:00and you predict that value.
-
23:00 - 23:04It turns out that every time you try to vary your hypothesis, every time you
-
23:04 - 23:07ask your learning algorithm to make a prediction for
-
23:07 - 23:09how much a new house costs or whatever,
-
23:09 - 23:10
-
23:10 - 23:14you need to run a new fitting procedure and then
-
23:14 - 23:16evaluate this line that you fit
-
23:16 - 23:18just at the position of
-
23:18 - 23:22the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you
-
23:22 - 23:26do this for every point along the X-axis then
-
23:26 - 23:27you find that
-
23:27 - 23:30locally weighted regression is able to trace on this, sort of, very
-
23:30 - 23:32non-linear curve
-
23:32 - 23:38for a data set like this. Okay? So
-
23:38 - 23:42in the problem set we're actually gonna let you play around more with this algorithm. So I won't say
-
23:42 - 23:44too much more about it here.
-
23:44 - 23:49But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you
-
23:49 - 23:50still have
-
23:50 - 23:51the same problem of overfitting
-
23:51 - 23:55and underfitting, like when
-
23:55 - 23:58you had a Q's tow. Like you make it too small
-
23:58 - 23:59in your -
-
23:59 - 24:02Yes, absolutely. Yes. So locally
-
24:02 - 24:04weighted regression can run into
-
24:04 - 24:07- locally weighted regression is not a penancier for the problem of
-
24:07 - 24:10overfitting or underfitting.
-
24:10 - 24:15You can still run into the same problems with locally weighted regression. What you just
-
24:15 - 24:17said about
-
24:17 - 24:20- and so some of these things I'll leave you to discover for yourself in the
-
24:20 - 24:21homework problem.
-
24:21 - 24:25You'll actually see what you just mentioned.
-
24:25 - 24:29Yeah? It almost seems like
-
24:29 - 24:31you're
-
24:31 - 24:37not even thoroughly [inaudible] with this locally weighted, you had all the
-
24:37 - 24:42data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a
-
24:42 - 24:45model, because you need the entire data set.
-
24:45 - 24:45And
-
24:45 - 24:50the other way of saying that is that this is a non-parametric learning
-
24:50 - 24:51algorithm. So this
-
24:51 - 24:52
-
24:52 - 24:54-I don't know. I won't
-
24:54 - 24:58debate whether, you know, are we really building a model or not. But
-
24:58 - 25:00this is a perfectly fine - so
-
25:00 - 25:02if I think
-
25:02 - 25:04when you write a code implementing
-
25:04 - 25:07locally weighted linear regression on the
-
25:07 - 25:09data set I think of that code
-
25:09 - 25:13as a whole - as building your model.
-
25:13 - 25:15So it actually uses -
-
25:15 - 25:19we've actually used this quite successfully to model, sort of, the dynamics of this
-
25:19 - 25:23autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights
-
25:23 - 25:24based
-
25:24 - 25:28on
-
25:28 - 25:33the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see,
-
25:33 - 25:34yes. So
-
25:34 - 25:37it turns out there are a few things you can do. One thing that is quite common is
-
25:37 - 25:40how to choose this band with parameter tow,
-
25:40 - 25:46right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible]
-
25:46 - 25:53Gaussian
-
25:56 - 25:58sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's
-
25:58 - 26:00see. Boy.
-
26:00 - 26:02The weights are not
-
26:02 - 26:06random variables and it's not, for the purpose of this algorithm, it is not useful to
-
26:06 - 26:09endow it with probable semantics. So you could
-
26:09 - 26:15choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In
-
26:15 - 26:16fact,
-
26:16 - 26:17it turns out that
-
26:17 - 26:22I happened to choose this, sort of, bell-shaped function
-
26:22 - 26:23to define my weights.
-
26:23 - 26:27It's actually fine to choose a function that doesn't even integrate to one, that integrates to
-
26:27 - 26:30infinity, say, as you're weighting function. So
-
26:30 - 26:32in that sense,
-
26:32 - 26:36I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially
-
26:36 - 26:42since you use other functions that integrate to infinity and don't integrate to one. Okay?
-
26:42 - 26:44It's the last question and let's move on Assume
-
26:44 - 26:51that we have a very huge [inaudible], for example. A very huge set of houses and want to
-
26:51 - 26:54predict the linear for each house
-
26:54 - 26:58and so should the end result for each input - I'm
-
26:58 - 27:00seeing this very constantly for
-
27:00 - 27:02- Yes, you're right. So
-
27:02 - 27:05because locally weighted regression is a
-
27:05 - 27:07non-parametric algorithm
-
27:07 - 27:11every time you make a prediction you need to fit theta to your entire training set again.
-
27:11 - 27:13So you're actually right.
-
27:13 - 27:18If you have a very large training set then this is a somewhat expensive algorithm to
-
27:18 - 27:20use. Because every time you want to make a prediction
-
27:20 - 27:21you need to fit
-
27:21 - 27:23a straight line
-
27:23 - 27:26to a huge data set again.
-
27:26 - 27:29Turns out there are algorithms that - turns
-
27:29 - 27:32out there are ways to make this much more efficient for large data sets as well.
-
27:32 - 27:35So don't want to talk about that. If you're interested, look
-
27:35 - 27:36up the work of Andrew Moore
-
27:36 - 27:38on KD-trees. He,
-
27:38 - 27:40sort
-
27:40 - 27:42of, figured out ways to fit these models much more efficiently. That's not something I want to go
-
27:42 - 27:45into today. Okay? Let me
-
27:45 - 27:52move one. Let's take more questions later. So, okay.
-
28:01 - 28:06So that's locally weighted regression.
-
28:06 - 28:07Remember the
-
28:07 - 28:11outline I had, I guess, at the beginning of this lecture. What I want to do now is
-
28:11 - 28:11
-
28:11 - 28:16talk about a probabilistic interpretation of linear regression, all right?
-
28:16 - 28:19And in particular of the - it'll be this probabilistic interpretation
-
28:19 - 28:22that let's us move on to talk
-
28:22 - 28:29about logistic regression, which will be our first classification algorithm. So
-
28:40 - 28:43let's put aside locally weighted regression for now. We'll just talk about
-
28:43 - 28:47ordinary unweighted linear regression. Let's
-
28:47 - 28:51ask the question of why least squares, right? Of all the things
-
28:51 - 28:52we could optimize
-
28:52 - 28:56how do we come up with this criteria for minimizing the square of the area
-
28:56 - 28:59between the predictions of the hypotheses
-
28:59 - 29:02and the values Y predicted. So why not minimize
-
29:02 - 29:08the absolute value of the areas or the areas to the power of four or something?
-
29:08 - 29:10What I'm going to do now is present
-
29:10 - 29:12one set of assumptions that
-
29:12 - 29:15will serve to "justify"
-
29:15 - 29:19why we're minimizing the sum of square zero. Okay?
-
29:19 - 29:22It turns out that there are many assumptions that are sufficient
-
29:22 - 29:26to justify why we do least squares and this is just one of them.
-
29:26 - 29:27So
-
29:27 - 29:30just because I present one set of assumptions under which least
-
29:30 - 29:32squares regression make sense,
-
29:32 - 29:35but this is not the only set of assumptions. So even if the assumptions
-
29:35 - 29:39I describe don't hold, least squares actually still makes sense in many
-
29:39 - 29:40circumstances. But this
-
29:40 - 29:41sort of new help, you know,
-
29:41 - 29:48give one rationalization, like, one reason for doing least squares regression.
-
29:48 - 29:51And, in particular, what I'm going to do is
-
29:51 - 29:58endow the least squares model with probabilistic semantics. So
-
29:58 - 30:02let's assume in our example of predicting housing prices,
-
30:02 - 30:03that
-
30:03 - 30:08the price of the house it's sold four, and
-
30:08 - 30:12there's going to be some linear function of the features,
-
30:12 - 30:14plus
-
30:14 - 30:17some term epsilon I. Okay?
-
30:17 - 30:20And epsilon I will be
-
30:20 - 30:22our error term.
-
30:22 - 30:23You can think of
-
30:23 - 30:28the error term as capturing unmodeled effects, like, that maybe
-
30:28 - 30:31there's some other features of a house, like, maybe how many fireplaces it has or whether
-
30:31 - 30:33there's a garden or whatever,
-
30:33 - 30:34that there
-
30:34 - 30:37are additional features that we jut fail to capture or
-
30:37 - 30:40you can think of epsilon as random noise.
-
30:40 - 30:42Epsilon is our error term that captures both these
-
30:42 - 30:46unmodeled effects. Just things we forgot to model. Maybe the function isn't quite
-
30:46 - 30:48linear or something.
-
30:48 - 30:50
-
30:50 - 30:53As well as random noise, like maybe
-
30:53 - 30:57that day the seller was in a really bad mood and so he sold it, just
-
30:57 - 31:00refused to go for a reasonable price or something.
-
31:00 - 31:03And now
-
31:03 - 31:08I will assume that the errors have a
-
31:08 - 31:09probabilistic
-
31:09 - 31:13- have a probability distribution. I'll assume that the errors epsilon I
-
31:13 - 31:15are distributed
-
31:15 - 31:16just till
-
31:16 - 31:19they denote epsilon I
-
31:19 - 31:23is distributive according to a probability distribution. That's
-
31:23 - 31:27a Gaussian distribution with mean zero
-
31:27 - 31:28and variance sigma squared. Okay? So
-
31:28 - 31:31let me just scripts in here,
-
31:31 - 31:35n stands for normal, right? To denote a normal distribution, also known as the
-
31:35 - 31:36Gaussian distribution,
-
31:36 - 31:37with mean
-
31:37 - 31:41zero and covariance sigma squared.
-
31:41 - 31:47Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you.
-
31:47 - 31:49Great. Almost everyone.
-
31:49 - 31:51So,
-
31:51 - 31:55in other words, the density for Gaussian is what you've seen before.
-
31:55 - 31:58The density for epsilon I would be
-
31:58 - 32:00one over root 2 pi sigma, E to the
-
32:00 - 32:02negative,
-
32:02 - 32:04epsilon I
-
32:04 - 32:06
-
32:06 - 32:09squared over 2 sigma squared, right?
-
32:09 - 32:15And the
-
32:15 - 32:22density of our epsilon I will be this bell-shaped curve
-
32:23 - 32:26with one standard deviation
-
32:26 - 32:31being a, sort of, sigma. Okay? This
-
32:31 - 32:32is
-
32:32 - 32:34form for that bell-shaped curve.
-
32:34 - 32:41So, let's see. I can erase that. Can I
-
32:41 - 32:47erase the board?
-
32:47 - 32:54
-
32:59 - 33:05So this implies that the
-
33:05 - 33:10probability distribution of a price of a house
-
33:10 - 33:12given in si
-
33:12 - 33:14and the parameters theta,
-
33:14 - 33:21that this is going to be Gaussian
-
33:30 - 33:35with that density. Okay?
-
33:35 - 33:38In other words, saying goes as that the
-
33:38 - 33:41price of
-
33:41 - 33:47a house given the features of the house and my parameters theta,
-
33:47 - 33:51this is going to be a random variable
-
33:51 - 33:54that's distributed Gaussian with
-
33:54 - 33:57mean theta transpose XI
-
33:57 - 33:58and variance sigma squared.
-
33:58 - 34:01Right? Because we imagine that
-
34:01 - 34:05the way the housing prices are generated is that the price of a house
-
34:05 - 34:09is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma
-
34:09 - 34:12squared. So
-
34:12 - 34:14the price of a house is going to
-
34:14 - 34:16have mean theta transpose XI, again, and sigma squared, right? Does
-
34:16 - 34:20this make
-
34:20 - 34:24sense? Raise your hand if this makes sense. Yeah,
-
34:24 - 34:31okay. Lots of you. In
-
34:38 - 34:44point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume
-
34:44 - 34:47here the error is a
-
34:47 - 34:50Gaussian?
-
34:50 - 34:54Right. So, boy.
-
34:54 - 34:56Why do I see the error as Gaussian?
-
34:56 - 35:01Two reasons, right? One is that it turns out to be mathematically convenient to do so
-
35:01 - 35:03and the other is, I don't
-
35:03 - 35:07know, I can also mumble about justifications, such as things to the
-
35:07 - 35:09central limit theorem. It turns out that if you,
-
35:09 - 35:12for the vast majority of problems, if you apply a linear regression model like
-
35:12 - 35:16this and try to measure the distribution of the errors,
-
35:16 - 35:20not all the time, but very often you find that the errors really are Gaussian. That
-
35:20 - 35:22this Gaussian model is a good
-
35:22 - 35:24assumption for the error
-
35:24 - 35:26in regression problems like these.
-
35:26 - 35:29Some of you may have heard of the central limit theorem, which says that
-
35:29 - 35:33the sum of many independent random variables will tend towards a Gaussian.
-
35:33 - 35:37So if the error is caused by many effects, like the mood of the
-
35:37 - 35:39seller, the mood of the buyer,
-
35:39 - 35:43some other features that we miss, whether the place has a garden or not, and
-
35:43 - 35:46if all of these effects are independent, then
-
35:46 - 35:49by the central limit theorem you might be inclined to believe that
-
35:49 - 35:52the sum of all these effects will be approximately Gaussian. If
-
35:52 - 35:54in practice, I guess, the
-
35:54 - 35:58two real answers are that, 1.) In practice this is actually a reasonably accurate
-
35:58 - 36:05assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're
-
36:07 - 36:08saying
-
36:08 - 36:11if we assume that area around model
-
36:11 - 36:13has zero mean, then
-
36:13 - 36:16the area is centered around our model. Which
-
36:16 - 36:18it seems almost like we're trying to assume
-
36:18 - 36:20what we're trying to prove. Instructor?
-
36:20 - 36:24That's the [inaudible] but, yes. You are assuming that
-
36:24 - 36:28the error has zero mean. Which is, yeah, right.
-
36:28 - 36:31I think later this quarter we get to some of the other
-
36:31 - 36:35things, but for now just think of this as a mathematically - it's actually not
-
36:35 - 36:38an unreasonable assumption.
-
36:38 - 36:40I guess,
-
36:40 - 36:45in machine learning all the assumptions we make are almost never
-
36:45 - 36:49true in the absence sense, right? Because, for instance,
-
36:49 - 36:54housing prices are priced to dollars and cents, so the error will be -
-
36:54 - 36:55errors
-
36:55 - 36:59in prices are not continued as value random variables, because
-
36:59 - 37:02houses can only be priced at a certain number of dollars and a certain number of
-
37:02 - 37:06cents and you never have fractions of cents in housing prices.
-
37:06 - 37:10Whereas a Gaussian random variable would. So in that sense, assumptions we make are never
-
37:10 - 37:15"absolutely true," but for practical purposes this is a
-
37:15 - 37:19accurate enough assumption that it'll be useful to
-
37:19 - 37:24make. Okay? I think in a week or two, we'll actually come back to
-
37:24 - 37:27selected more about the assumptions we make and when they help our learning
-
37:27 - 37:29algorithms and when they hurt our learning
-
37:29 - 37:31algorithms. We'll say a bit more about it
-
37:31 - 37:38when we talk about generative and discriminative learning algorithms, like, in a week or
-
37:40 - 37:45two. Okay? So let's point out one bit of notation, which is that when I
-
37:45 - 37:48wrote this down I actually wrote P of YI given XI and then semicolon
-
37:48 - 37:50theta
-
37:50 - 37:55and I'm going to use this notation when we are not thinking of theta as a
-
37:55 - 37:57random variable. So
-
37:57 - 38:01in statistics, though, sometimes it's called the frequentist's point of view,
-
38:01 - 38:02where you think of there as being some,
-
38:02 - 38:06sort of, true value of theta that's out there that's generating the data say,
-
38:06 - 38:08but
-
38:08 - 38:11we don't know what theta is, but theta is not a random
-
38:11 - 38:13vehicle, right? So it's not like there's some
-
38:13 - 38:16random value of theta out there. It's that theta is -
-
38:16 - 38:19there's some true value of theta out there. It's just that we don't
-
38:19 - 38:22know what the true value of theta is. So
-
38:22 - 38:26if theta is not a random variable, then I'm
-
38:26 - 38:27going to avoid
-
38:27 - 38:31writing P of YI given XI comma theta, because this would mean
-
38:31 - 38:35that probably of YI conditioned on X and theta
-
38:35 - 38:40and you can only condition on random variables.
-
38:40 - 38:43So at this part of the class where we're taking
-
38:43 - 38:46sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class
-
38:46 - 38:49we're thinking of theta not as a random variable, but just as something
-
38:49 - 38:50we're trying to estimate
-
38:50 - 38:53and use the semicolon
-
38:53 - 38:58notation. So the way to read this is this is the probability of YI given XI
-
38:58 - 39:00and parameterized by theta. Okay? So
-
39:00 - 39:04you read the semicolon as parameterized by.
-
39:04 - 39:08And in the same way here, I'll say YI given XI parameterized by
-
39:08 - 39:09theta is distributed
-
39:09 - 39:16Gaussian with that. All right.
-
39:36 - 39:38So we're gonna make one more assumption.
-
39:38 - 39:41Let's assume that the
-
39:41 - 39:44error terms are
-
39:44 - 39:48
-
39:48 - 39:51IID, okay?
-
39:51 - 39:54Which stands for Independently and Identically Distributed. So it's
-
39:54 - 39:57going to assume that the error terms are
-
39:57 - 40:04independent of each other,
-
40:04 - 40:11right?
-
40:11 - 40:15The identically distributed part just means that I'm assuming the outcome for the
-
40:15 - 40:18same Gaussian distribution or the same variance,
-
40:18 - 40:22but the more important part of is this is that I'm assuming that the epsilon I's are
-
40:22 - 40:26independent of each other.
-
40:26 - 40:29Now, let's talk about how to fit a model.
-
40:29 - 40:32The probability of Y given
-
40:32 - 40:36X parameterized by theta - I'm actually going to give
-
40:36 - 40:39this another name. I'm going to write this down
-
40:39 - 40:43and we'll call this the likelihood of theta
-
40:43 - 40:46as the probability of Y given X parameterized by theta.
-
40:46 - 40:49And so this is going to be
-
40:49 - 40:50the product
-
40:50 - 40:57over my training set like that.
-
41:00 - 41:04Which is, in turn, going to be a product of
-
41:04 - 41:11those Gaussian densities that I wrote down just now,
-
41:11 - 41:15right?
-
41:15 - 41:20Okay?
-
41:20 - 41:25Then in parts of notation, I guess, I define this term here to be the
-
41:25 - 41:26likelihood of theta.
-
41:26 - 41:30And the likely of theta is just the probability of the data Y, right? Given X
-
41:30 - 41:33and prioritized by theta.
-
41:33 - 41:37To test the likelihood and probability are often confused.
-
41:37 - 41:41So the likelihood of theta is the same thing as the
-
41:41 - 41:46probability of the data you saw. So likely and probably are, sort of, the same thing.
-
41:46 - 41:48Except that when I use the term likelihood
-
41:48 - 41:52I'm trying to emphasize that I'm taking this thing
-
41:52 - 41:55and viewing it as a function of theta.
-
41:55 - 41:57Okay?
-
41:57 - 42:01So likelihood and for probability, they're really the same thing except that
-
42:01 - 42:02when I want to view this thing
-
42:02 - 42:06as a function of theta holding X and Y fix are
-
42:06 - 42:10then called likelihood. Okay? So
-
42:10 - 42:13hopefully you hear me say the likelihood of the parameters and the probability
-
42:13 - 42:15of the data,
-
42:15 - 42:18right? Rather than the likelihood of the data or probability of parameters. So try
-
42:18 - 42:25to be consistent in that terminology.
-
42:31 - 42:32So given that
-
42:32 - 42:34the probability of the data is this and this
-
42:34 - 42:37is also the likelihood of the parameters,
-
42:37 - 42:38how do you estimate
-
42:38 - 42:40the parameters theta? So given a training set,
-
42:40 - 42:46what parameters theta do you want to choose for your model?
-
42:46 - 42:53
-
42:59 - 43:02Well, the principle of maximum likelihood
-
43:02 - 43:03estimation
-
43:03 - 43:09says that,
-
43:09 - 43:13right? You can choose the value of theta that makes the data
-
43:13 - 43:20as probable as possible, right? So choose theta
-
43:20 - 43:27to maximize the likelihood. Or
-
43:27 - 43:30in other words choose the parameters that make
-
43:30 - 43:33the data as probable as possible, right? So this is
-
43:33 - 43:37massive likely your estimation from six to six. So it's choose the parameters that makes
-
43:37 - 43:40it as likely as probable as possible
-
43:40 - 43:43for me to have seen the data I just
-
43:43 - 43:47did. So
-
43:47 - 43:53for mathematical convenience, let me define lower case l of theta.
-
43:53 - 43:58This is called the log likelihood function and it's just log
-
43:58 - 44:01of capital L of theta.
-
44:01 - 44:06So this is log over product over I
-
44:06 - 44:10to find
-
44:10 - 44:14sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this
-
44:14 - 44:17from the previous board.
-
44:17 - 44:24Log and a product is the same as the sum of over logs, right? So it's a sum
-
44:25 - 44:32of
-
44:35 - 44:38the logs of - which simplifies to m times
-
44:38 - 44:39one over root
-
44:39 - 44:44two pi
-
44:44 - 44:44sigma
-
44:44 - 44:47plus
-
44:47 - 44:52and then log of explanation cancel each other, right? So if log of E of
-
44:52 - 44:53something is just
-
44:53 - 45:00whatever's inside the exponent. So, you know what,
-
45:01 - 45:08let me write this on the
-
45:12 - 45:16next
-
45:16 - 45:21board.
-
45:21 - 45:28Okay.
-
45:33 - 45:40
-
45:46 - 45:51So
-
45:51 - 45:53maximizing the likelihood or maximizing the log
-
45:53 - 45:58likelihood is the same
-
45:58 - 46:03as minimizing
-
46:03 - 46:10that term over there. Well, you get it, right?
-
46:22 - 46:26Because there's a minus sign. So maximizing this because of the minus sign is the same as
-
46:26 - 46:27minimizing
-
46:27 - 46:33this as a function of theta. And
-
46:33 - 46:36this is, of course, just
-
46:36 - 46:43the same quadratic cos function that we had last time, J of theta,
-
46:43 - 46:44right? So what
-
46:44 - 46:48we've just shown is that the ordinary least squares algorithm,
-
46:48 - 46:51that we worked on the previous lecture,
-
46:51 - 46:55is just maximum likelihood
-
46:55 - 46:56assuming
-
46:56 - 46:58this probabilistic model,
-
46:58 - 47:05assuming IID Gaussian errors on our data.
-
47:06 - 47:10
-
47:10 - 47:11Okay? One thing that we'll
-
47:11 - 47:12actually leave is that,
-
47:12 - 47:14in the next lecture notice that
-
47:14 - 47:17the value of sigma squared doesn't matter,
-
47:17 - 47:18right? That somehow
-
47:18 - 47:21no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a
-
47:21 - 47:22variance
-
47:22 - 47:26of a Gaussian. So that no matter what sigma
-
47:26 - 47:30squared is since it's a positive number the value of theta we end up with
-
47:30 - 47:34will be the same, right? So because
-
47:34 - 47:35minimizing this
-
47:35 - 47:39you get the same value of theta no matter what sigma squared is. So it's as if
-
47:39 - 47:43in this model the value of sigma squared doesn't really matter.
-
47:43 - 47:46Just remember that for the next lecture. We'll come back
-
47:46 - 47:48to this again.
-
47:48 - 47:51Any questions about this?
-
47:51 - 47:53Actually, let me clean up
-
47:53 - 48:00another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of
-
48:44 - 48:51theta by
-
48:51 - 48:52a fraction
-
48:52 - 48:54of error,
-
48:54 - 48:55but I think it's that you
-
48:55 - 48:57measure
-
48:57 - 49:01because it depends on the family of theta too, for example. If
-
49:01 - 49:05
-
49:05 - 49:09you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think
-
49:09 - 49:13let's - the thing's you're mentioning are
-
49:13 - 49:15maybe deeper questions about
-
49:15 - 49:18learning algorithms that we'll just come back to later, so don't really want to get into
-
49:18 - 49:19that right
-
49:19 - 49:22now. Any more
-
49:22 - 49:29questions? Okay. So
-
49:33 - 49:39this endows linear regression with a probabilistic interpretation.
-
49:39 - 49:43I'm actually going to use this probabil - use this, sort of, probabilistic
-
49:43 - 49:44interpretation
-
49:44 - 49:46in order to derive our next learning algorithm,
-
49:46 - 49:50which will be our first classification algorithm. Okay?
-
49:50 - 49:54So
-
49:54 - 49:58you'll recall that I said that regression problems are where the variable Y
-
49:58 - 50:01that you're trying to predict is continuous values.
-
50:01 - 50:04Now I'm actually gonna talk about our first classification problem,
-
50:04 - 50:08where the value Y you're trying to predict
-
50:08 - 50:11will be discreet value. You can take on only a small number of discrete values
-
50:11 - 50:14and in this case I'll talk about binding classification
-
50:14 - 50:16where
-
50:16 - 50:19Y takes on only two values, right? So
-
50:19 - 50:22you come up with classification problems if you're trying to do,
-
50:22 - 50:26say, a medical diagnosis and try to decide based on some features
-
50:26 - 50:30that the patient has a disease or does not have a disease.
-
50:30 - 50:34Or if in the housing example, maybe you're trying to decide will this house sell in the
-
50:34 - 50:38next six months or not and the answer is either yes or no. It'll either be sold in the
-
50:38 - 50:41next six months or it won't be.
-
50:41 - 50:45Other standing examples, if you want to build a spam filter. Is this e-mail spam
-
50:45 - 50:51or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting
-
50:51 - 50:55whether a computer system will crash. So you have a learning algorithm to predict will
-
50:55 - 50:59this computing cluster crash over the next 24 hours? And, again, it's a yes
-
50:59 - 51:06or no answer. So
-
51:06 - 51:08there's X, there's Y.
-
51:08 - 51:14And in a classification problem
-
51:14 - 51:15Y takes on
-
51:15 - 51:19two values, zero and one. That's it in binding the classification.
-
51:19 - 51:22So what can you do? Well, one thing you could do is
-
51:22 - 51:25take linear regression, as we've described it so far, and apply it to this problem,
-
51:25 - 51:26right? So you,
-
51:26 - 51:30you know, given this data set you can fit a straight line to it. Maybe
-
51:30 - 51:32you get that straight line, right?
-
51:32 - 51:32But
-
51:32 - 51:37this data set I've drawn, right? This is an amazingly easy classification problem. It's
-
51:37 - 51:38pretty obvious
-
51:38 - 51:41to all of us that, right? The relationship between X and Y is -
-
51:41 - 51:48well, you just look at a value around here and it's the right is one, it's
-
51:48 - 51:52the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then
-
51:52 - 51:54maybe take your linear regression
-
51:54 - 51:56hypothesis to this straight line
-
51:56 - 51:58and threshold it at 0.5.
-
51:58 - 52:02If you do that you'll certainly get the right answer. You predict that
-
52:02 - 52:03
-
52:03 - 52:04if X is to the right of, sort
-
52:04 - 52:06of, the mid-point here
-
52:06 - 52:12then Y is one and then next to the left of that mid-point then Y is zero.
-
52:12 - 52:16So some people actually do this. Apply linear regression to classification problems
-
52:16 - 52:18and sometimes it'll
-
52:18 - 52:19work okay,
-
52:19 - 52:22but in general it's actually a pretty bad idea to
-
52:22 - 52:26apply linear regression to
-
52:26 - 52:32classification problems like these and here's why. Let's say I
-
52:32 - 52:34change my training set
-
52:34 - 52:40by giving you just one more training example all the way up there, right?
-
52:40 - 52:43Imagine if given this training set is actually still entirely obvious what the
-
52:43 - 52:46relationship between X and Y is, right? It's just -
-
52:46 - 52:51take this value as greater than Y is one and it's less then Y
-
52:51 - 52:52is zero.
-
52:52 - 52:55By giving you this additional training example it really shouldn't
-
52:55 - 52:56change anything. I mean,
-
52:56 - 52:59I didn't really convey much new information. There's no surprise that this
-
52:59 - 53:02corresponds to Y equals one.
-
53:02 - 53:05But if you now fit linear regression to this data
-
53:05 - 53:07set you end up with a line that, I
-
53:07 - 53:10don't know, maybe looks like that, right?
-
53:10 - 53:13And now the predictions of your
-
53:13 - 53:16hypothesis have changed completely if
-
53:16 - 53:23your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that?
-
53:27 - 53:30Right.
-
53:30 - 53:31Yeah, yeah, fine. Yeah, sure. A theta
-
53:31 - 53:37set like that so. So, I
-
53:37 - 53:39guess,
-
53:39 - 53:42these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will
-
53:42 - 53:48change it even more if you gave it
-
53:48 - 53:50all -
-
53:50 - 53:52Yeah. Then I think this actually would make it even worse. You
-
53:52 - 53:54would actually get a line that pulls out even further, right? So
-
53:54 - 53:58this is my example. I get to make it whatever I want, right? But
-
53:58 - 54:01the point of this is that there's not a deep meaning to this. The point of this is
-
54:01 - 54:02just that
-
54:02 - 54:05it could be a really bad idea to apply linear regression to classification
-
54:05 - 54:06
-
54:06 - 54:12algorithm. Sometimes it work fine, but usually I wouldn't do it.
-
54:12 - 54:14So a couple of problems with this. One is that,
-
54:14 - 54:17well - so what do you want to do
-
54:17 - 54:21for classification? If you know the value of Y lies between zero and
-
54:21 - 54:25one then to kind of fix this problem
-
54:25 - 54:27let's just start by
-
54:27 - 54:29changing the form
-
54:29 - 54:34of our hypothesis so that my hypothesis
-
54:34 - 54:39always lies in the unit interval between zero and one. Okay?
-
54:39 - 54:42So if I know Y is either
-
54:42 - 54:44zero or one then
-
54:44 - 54:47let's at least not have my hypothesis predict values much larger than one and much
-
54:47 - 54:51smaller than zero.
-
54:51 - 54:52And so
-
54:52 - 54:56I'm going to - instead of choosing a linear function for my hypothesis I'm going
-
54:56 - 55:00to choose something slightly different. And,
-
55:00 - 55:03in particular, I'm going to choose
-
55:03 - 55:08this function, H subscript theta of X is going to equal to G of
-
55:08 - 55:11theta transpose X
-
55:11 - 55:13where
-
55:13 - 55:14G
-
55:14 - 55:17is going to be this
-
55:17 - 55:18function and so
-
55:18 - 55:21this becomes more than one plus theta X
-
55:21 - 55:23of theta
-
55:23 - 55:25transpose X.
-
55:25 - 55:28And G of Z is called the sigmoid
-
55:28 - 55:33function and it
-
55:33 - 55:39is often also called the logistic function. It
-
55:39 - 55:41goes by either of these
-
55:41 - 55:48names. And what G of Z looks like is the following. So when you have your
-
55:48 - 55:51horizontal axis I'm going to plot Z
-
55:51 - 55:57and so G of Z
-
55:57 - 55:59will look like this.
-
55:59 - 56:05Okay? I didn't draw that very well. Okay.
-
56:05 - 56:06So G of Z
-
56:06 - 56:08tends towards zero
-
56:08 - 56:10as Z becomes very small
-
56:10 - 56:12and G of Z will ascend
-
56:12 - 56:15towards one as Z becomes large and it crosses the
-
56:15 - 56:17vertical
-
56:17 - 56:20axis at 0.5.
-
56:20 - 56:24So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of
-
56:24 - 56:27sigmoid in other
-
56:27 - 56:30step five? Say that again. Why we cannot chose this at five for some reason, like, that's
-
56:30 - 56:35better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from,
-
56:35 - 56:37right? I just
-
56:37 - 56:39wrote down this function. It actually
-
56:39 - 56:43turns out that there are two reasons for using this function that we'll come to.
-
56:43 - 56:44One is -
-
56:44 - 56:47we talked about generalized linear models. We'll see that this falls out naturally
-
56:47 - 56:50as part of the broader class of models.
-
56:50 - 56:51And another reason
-
56:51 - 56:52that we'll talk about
-
56:52 - 56:54next week, it turns out
-
56:54 - 56:55there are a couple of,
-
56:55 - 56:57I think, very beautiful reasons for why
-
56:57 - 56:59we choose logistic
-
56:59 - 57:00functions. We'll see
-
57:00 - 57:02that in a little bit. But for now let me just
-
57:02 - 57:07define it and just take my word for it for now that this is a reasonable choice.
-
57:07 - 57:09Okay? But notice now that
-
57:09 - 57:16my - the values output by my hypothesis will always be between zero
-
57:16 - 57:17and one. Furthermore,
-
57:17 - 57:21just like we did for linear regression, I'm going to endow
-
57:21 - 57:26the outputs and my hypothesis with a probabilistic interpretation, right? So
-
57:26 - 57:31I'm going to assume that the probability that Y is equal to one
-
57:31 - 57:34given X and parameterized by theta
-
57:34 - 57:36that's equal to
-
57:36 - 57:38H subscript theta of X, all right?
-
57:38 - 57:40So in other words
-
57:40 - 57:43I'm going to imagine that my hypothesis is outputting all these
-
57:43 - 57:45numbers that lie between zero and one.
-
57:45 - 57:48I'm going to think of my hypothesis
-
57:48 - 57:55as trying to estimate the probability that Y is equal to one. Okay?
-
57:56 - 58:00And because
-
58:00 - 58:02Y has to be either zero or one
-
58:02 - 58:05then the probability of Y equals zero is going
-
58:05 - 58:12to be that. All right?
-
58:12 - 58:15So more simply it turns out - actually, take these two equations
-
58:15 - 58:19and write them more compactly.
-
58:19 - 58:22Write P of Y given X
-
58:22 - 58:24parameterized by theta.
-
58:24 - 58:26This is going to be H
-
58:26 - 58:28subscript theta of X to
-
58:28 - 58:32the power of Y times
-
58:32 - 58:33one minus
-
58:33 - 58:35H of X to the power of
-
58:35 - 58:37one minus Y. Okay? So I know this
-
58:37 - 58:43looks somewhat bizarre, but this actually makes the variation much nicer.
-
58:43 - 58:44So Y is equal to one
-
58:44 - 58:48then this equation is H of X to the power of one
-
58:48 - 58:51times something to the power of zero.
-
58:51 - 58:54So anything to the power of zero is just one,
-
58:54 - 58:56right? So Y equals one then
-
58:56 - 58:59this is something to the power of zero and so this is just one.
-
58:59 - 59:03So if Y equals one this is just saying P of Y equals one is equal to H subscript
-
59:03 - 59:06theta of X. Okay?
-
59:06 - 59:08And in the same way, if Y is equal
-
59:08 - 59:10to zero then this is P
-
59:10 - 59:14of Y equals zero equals this thing to the power of zero and so this disappears. This is
-
59:14 - 59:16just one
-
59:16 - 59:18times this thing power of one. Okay? So this is
-
59:18 - 59:19a
-
59:19 - 59:20compact way of writing
-
59:20 - 59:23both of these equations to
-
59:23 - 59:30gather them to one line.
-
59:31 - 59:36So let's hope our parameter fitting, right? And, again, you can ask -
-
59:36 - 59:38well, given this model by data, how do I fit
-
59:38 - 59:41the parameters theta of my
-
59:41 - 59:46model? So the likelihood of the parameters is, as before, it's just the probability
-
59:46 - 59:49
-
59:49 - 59:50
-
59:50 - 59:54of theta, right? Which is product over I, PFYI
-
59:54 - 59:57given XI
-
59:57 - 59:59parameterized by theta.
-
59:59 - 60:02Which is - just plugging those
-
60:02 - 60:09in. Okay? I
-
60:09 - 60:16dropped this theta subscript just so you can write a little bit less. Oh,
-
60:17 - 60:20excuse me. These
-
60:20 - 60:27should be
-
60:29 - 60:36XI's
-
60:36 - 60:43and YI's. Okay?
-
60:51 - 60:53So,
-
60:53 - 60:57as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So
-
60:57 - 60:59we want
-
60:59 - 61:04to find the - setting the parameters theta that maximizes the likelihood L
-
61:04 - 61:08of theta. It
-
61:08 - 61:10turns out
-
61:10 - 61:11that very often
-
61:11 - 61:14- just when you work with the derivations, it turns out that it is often
-
61:14 - 61:18much easier to maximize the log of the likelihood rather than maximize the
-
61:18 - 61:20likelihood.
-
61:20 - 61:20So
-
61:20 - 61:23the log
-
61:23 - 61:25likelihood L of theta is just log of capital L.
-
61:25 - 61:28This will, therefore,
-
61:28 - 61:35be sum of this. Okay?
-
61:58 - 61:59And so
-
61:59 - 62:04to fit the parameters theta of our model we'll
-
62:04 - 62:06find the value of
-
62:06 - 62:12theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes.
-
62:12 - 62:19Thanks.
-
62:22 - 62:27So having maximized this function - well, it turns out we can actually apply
-
62:27 - 62:29the same gradient
-
62:29 - 62:33descent algorithm that we learned. That was the first algorithm we used
-
62:33 - 62:35to minimize the quadratic function.
-
62:35 - 62:37And you remember, when we talked about least squares,
-
62:37 - 62:40the first algorithm we used to minimize the quadratic
-
62:40 - 62:41error function
-
62:41 - 62:43was great in descent.
-
62:43 - 62:45So can actually use exactly the same algorithm
-
62:45 - 62:48to maximize the log likelihood.
-
62:48 - 62:49
-
62:49 - 62:51And you remember, that algorithm was just
-
62:51 - 62:55repeatedly take the value of theta
-
62:55 - 62:56and you replace it with
-
62:56 - 62:59the previous value of theta plus
-
62:59 - 63:02a learning rate alpha
-
63:02 - 63:04times
-
63:04 - 63:08the gradient of the cos function. The log likelihood will respect the
-
63:08 - 63:10theta. Okay?
-
63:10 - 63:14One small change is that because previously we were trying to minimize
-
63:14 - 63:15
-
63:15 - 63:17the quadratic error term.
-
63:17 - 63:20Today we're trying to maximize rather than minimize. So rather than having a minus
-
63:20 - 63:23sign we have a plus sign. So this is
-
63:23 - 63:24just great in ascents,
-
63:24 - 63:25but for the
-
63:25 - 63:28maximization rather than the minimization.
-
63:28 - 63:32So we actually call this gradient ascent and it's really the same
-
63:32 - 63:35algorithm.
-
63:35 - 63:37So to figure out
-
63:37 - 63:42what this gradient - so in order to derive gradient descent,
-
63:42 - 63:44what you need to do is
-
63:44 - 63:48compute the partial derivatives of your objective function with respect to
-
63:48 - 63:53each of your parameters theta I, right?
-
63:53 - 63:58It turns out that
-
63:58 - 64:00if you actually
-
64:00 - 64:03compute this partial derivative -
-
64:03 - 64:10so you take this formula, this L of theta, which is - oh, got that wrong too. If
-
64:10 - 64:14you take this lower case l theta, if you take the log likelihood of theta,
-
64:14 - 64:17and if you take it's partial derivative with
-
64:17 - 64:19respect to theta I
-
64:19 - 64:22you find that
-
64:22 - 64:27this is equal to -
-
64:27 - 64:30let's see. Okay? And,
-
64:30 - 64:37I
-
64:46 - 64:47don't
-
64:47 - 64:50know, the derivation isn't terribly complicated, but in
-
64:50 - 64:54the interest of saving you watching me write down a couple of
-
64:54 - 64:56blackboards full of math I'll just write
-
64:56 - 64:57down the final answer. But
-
64:57 - 64:59the way you get this is you
-
64:59 - 65:00just take those, plug
-
65:00 - 65:04in the definition for F subscript theta as function of XI, and take derivatives,
-
65:04 - 65:06and work through the algebra
-
65:06 - 65:08it turns out it'll simplify
-
65:08 - 65:13down to this formula. Okay?
-
65:13 - 65:15And so
-
65:15 - 65:19what that gives you is that gradient ascent
-
65:19 - 65:22is the following
-
65:22 - 65:24rule. Theta J gets updated as theta
-
65:24 - 65:26J
-
65:26 - 65:28plus alpha
-
65:28 - 65:35gives this. Okay?
-
65:47 - 65:50Does this look familiar to anyone? Did you
-
65:50 - 65:56remember seeing this formula at the last lecture? Right.
-
65:56 - 65:59So when I worked up Bastrian descent
-
65:59 - 66:02for least squares regression I,
-
66:02 - 66:06actually, wrote down
-
66:06 - 66:08exactly the same thing, or maybe
-
66:08 - 66:11there's a minus sign and this is also fit. But I, actually, had
-
66:11 - 66:14exactly the same learning rule last time
-
66:14 - 66:20for least squares regression,
-
66:20 - 66:24right? Is this the same learning algorithm then? So what's different? How come I was making
-
66:24 - 66:25all that noise earlier about
-
66:25 - 66:30least squares regression being a bad idea for classification problems and then I did
-
66:30 - 66:34a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants?
-
66:34 - 66:41Say that again. [Inaudible] Oh,
-
66:44 - 66:47right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same,
-
66:47 - 66:48this is not the same, right? And the
-
66:48 - 66:49reason is,
-
66:49 - 66:52in logistic regression
-
66:52 - 66:54this is different from before, right?
-
66:54 - 66:56The definition
-
66:56 - 66:58of this H subscript theta of XI
-
66:58 - 66:59is not
-
66:59 - 67:03the same as the definition I was using in the previous lecture.
-
67:03 - 67:07And in particular this is no longer theta transpose XI. This is not
-
67:07 - 67:10a linear function anymore.
-
67:10 - 67:12This is a logistic function of theta
-
67:12 - 67:14transpose XI. Okay?
-
67:14 - 67:18So even though this looks cosmetically similar,
-
67:18 - 67:20even though this is similar on the surface,
-
67:20 - 67:24to the Bastrian descent rule I derived last time for
-
67:24 - 67:25least squares regression
-
67:25 - 67:29this is actually a totally different learning algorithm. Okay?
-
67:29 - 67:32And it turns out that there's actually no coincidence that you ended up with the
-
67:32 - 67:35same learning rule. We'll actually
-
67:35 - 67:35
-
67:35 - 67:40talk a bit more about this later when we talk about generalized linear models.
-
67:40 - 67:43But this is one of the most elegant generalized learning models
-
67:43 - 67:45that we'll see later. That
-
67:45 - 67:48even though we're using a different model, you actually ended up with
-
67:48 - 67:51what looks like the same learning algorithm and it's actually no
-
67:51 - 67:56coincidence. Cool.
-
67:56 - 67:59One last comment as
-
67:59 - 68:01part of a sort of learning process,
-
68:01 - 68:02over here
-
68:02 - 68:05I said I take the derivatives and I
-
68:05 - 68:07ended up with this line.
-
68:07 - 68:09I didn't want to
-
68:09 - 68:13make you sit through a long algebraic derivation, but
-
68:13 - 68:15later today or later this week,
-
68:15 - 68:19please, do go home and look at our lecture notes, where I wrote out
-
68:19 - 68:21the entirety of this derivation in full,
-
68:21 - 68:23and make sure you can follow every single step of
-
68:23 - 68:27how we take partial derivatives of this log likelihood
-
68:27 - 68:32to get this formula over here. Okay? By the way, for those who are
-
68:32 - 68:36interested in seriously masking machine learning material,
-
68:36 - 68:40when you go home and look at the lecture notes it will actually be very easy for most
-
68:40 - 68:42of you to look through
-
68:42 - 68:45the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense,
-
68:45 - 68:45and,
-
68:45 - 68:50sort of, say cool. I see how you get this line.
-
68:50 - 68:53You want to make sure you really understand the material. My concrete
-
68:53 - 68:55suggestion to you would be to you to go home,
-
68:55 - 68:58read through the lecture notes, check every line,
-
68:58 - 69:02and then to cover up the derivation and see if you can derive this example, right? So
-
69:02 - 69:07in general, that's usually good advice for studying technical
-
69:07 - 69:09material like machine learning. Which is if you work through a proof
-
69:09 - 69:11and you think you understood every line,
-
69:11 - 69:14the way to make sure you really understood it is to cover it up and see
-
69:14 - 69:17if you can rederive the entire thing itself. This is actually a great way because I
-
69:17 - 69:20did this a lot when I was trying to study
-
69:20 - 69:22various pieces of machine learning
-
69:22 - 69:26theory and various proofs. And this is actually a great way to study because cover up
-
69:26 - 69:28the derivations and see if you can do it yourself
-
69:28 - 69:33without looking at the original derivation. All right.
-
69:33 - 69:37
-
69:37 - 69:40I probably won't get to Newton's Method today. I just
-
69:40 - 69:47want to say
-
69:55 - 69:58- take one quick digression to talk about
-
69:58 - 69:59one more algorithm,
-
69:59 - 70:01which was the discussion sort
-
70:01 - 70:08of alluding to this earlier,
-
70:09 - 70:12which is the perceptron
-
70:12 - 70:13algorithm, right? So
-
70:13 - 70:14I'm
-
70:14 - 70:17not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come
-
70:17 - 70:20back to later. Later this quarter
-
70:20 - 70:24we'll talk about learning theory.
-
70:24 - 70:27So in logistic regression we said that G of Z are, sort
-
70:27 - 70:28of,
-
70:28 - 70:30my hypothesis output values
-
70:30 - 70:33that were low numbers between zero and one.
-
70:33 - 70:37The question is what if you want to force G of Z to up
-
70:37 - 70:39the value to
-
70:39 - 70:39either
-
70:39 - 70:41zero one?
-
70:41 - 70:44So the
-
70:44 - 70:46perceptron algorithm defines G of Z
-
70:46 - 70:48to be this.
-
70:48 - 70:53
-
70:53 - 70:54So the picture is - or
-
70:54 - 71:01the cartoon is, rather than this sigmoid function. E of
-
71:02 - 71:08Z now looks like this step function that you were asking about earlier.
-
71:08 - 71:14In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So
-
71:14 - 71:14this
-
71:14 - 71:18is actually - everything is exactly the same as before,
-
71:18 - 71:20except that G of Z is now the step function.
-
71:20 - 71:22It
-
71:22 - 71:25turns out there's this learning called the perceptron learning rule that's actually
-
71:25 - 71:28even the same as the classic gradient ascent
-
71:28 - 71:31for logistic regression.
-
71:31 - 71:33And the learning rule is
-
71:33 - 71:40given by this. Okay?
-
71:44 - 71:50So it looks just like the
-
71:50 - 71:51classic gradient ascent rule
-
71:51 - 71:54
-
71:54 - 71:56for logistic regression.
-
71:56 - 72:01So this is very different flavor of algorithm than least squares regression and logistic
-
72:01 - 72:02regression,
-
72:02 - 72:06and, in particular, because it outputs only values are either zero or one it
-
72:06 - 72:09turns out it's very difficult to endow this algorithm with
-
72:09 - 72:12probabilistic semantics. And this
-
72:12 - 72:19is, again, even though - oh, excuse me. Right there. Okay.
-
72:20 - 72:24And even though this learning rule looks, again, looks cosmetically very similar to
-
72:24 - 72:26what we have in logistics regression this is actually
-
72:26 - 72:28a very different type of learning rule
-
72:28 - 72:31than the others that were seen
-
72:31 - 72:34in this class. So
-
72:34 - 72:37because this is such a simple learning algorithm, right? It just
-
72:37 - 72:41computes theta transpose X and then you threshold and then your output is zero or one.
-
72:41 - 72:43This is -
-
72:43 - 72:46right. So these are a simpler algorithm than logistic regression, I think.
-
72:46 - 72:49When we talk about learning theory later in this class,
-
72:49 - 72:55the simplicity of this algorithm will let us come back and use it as a building block. Okay?
-
72:55 - 72:57But that's all I want to say about this algorithm for now.
- Title:
- Lecture 3 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng delves into locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:13:14
Timothy Lin edited English subtitles for Lecture 3 | Machine Learning (Stanford) | ||
N. Ueda added a translation |