Lecture 3 | Machine Learning (Stanford)

Edit subtitles

0:00 - 0:09

(music)
0:10 - 0:13

this presentation is delivered by
the Stanford Center for
0:13 - 0:15

Professional Development.
0:23 - 0:27

Okay. Good morning and
welcome back to the third
0:27 - 0:31

lecture of this class. So here's
0:31 - 0:33

what I want to do today,
0:33 - 0:34

and some of
0:34 - 0:38

the topics I do today may seem a little bit like I'm jumping, sort of,
0:38 - 0:44

from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we
0:44 - 0:45

talked about
0:45 - 0:48

linear regression and today I want to talk about
0:48 - 0:52

sort of an adaptation of that called locally weighted regression. It's very a powerful
0:52 - 0:57

algorithm that's actually one of my former mentor's probably favorite machine
0:57 - 0:59

learning algorithm. We'll then
0:59 - 1:03

talk about a probabilistic interpretation of linear regression
1:03 - 1:08

and use that to move onto our first classification algorithm, which
1:08 - 1:09

is logistic regression; take
1:09 - 1:13

a brief digression to tell you about something called the perceptron algorithm,
1:13 - 1:16

which is something we'll come back to, again, later this quarter;
1:16 - 1:17

and
1:17 - 1:21

time allowing I hope to get to Newton's method, which is an algorithm for
1:21 - 1:23

fitting logistic regression models.
1:24 - 1:31

So, let's just recap what we're talking about in the previous lecture,
1:31 - 1:34

remember the notation that I defined was that
1:34 - 1:35

I used this
1:35 - 1:38

X superscript i,
1:38 - 1:41

Y superscript i to denote the ith training example.
1:47 - 1:48

And
1:49 - 1:52

when we're talking about linear regression
1:53 - 1:54

or ordinary least squares,
1:54 - 1:56

we use this to denote
1:56 - 1:58

the predicted value
1:58 - 2:01

output by my hypothesis H on
2:01 - 2:02

the input X^i.
2:02 - 2:04

And my hypothesis
2:04 - 2:06

was parameterized by the vector
2:06 - 2:08

parameters theta
2:08 - 2:14

and so we said that this was equal to sum from theta j
2:14 - 2:16

X^i
2:16 - 2:20

si
2:20 - 2:22

written more simply as theta transpose X
2:22 - 2:25

And we had the convention that X
2:25 - 2:29

subscript zero is equal to one so
this accounts for the intercept term in our
2:29 - 2:31

linear regression model.
2:31 - 2:33

And lowercase n here
2:33 - 2:37

was the notation I was using for the
2:37 - 2:40

number of features in my training set. Okay? So in the
2:40 - 2:44

example trying to predict housing prices, we had two features, the size of
2:44 - 2:46

the house and the number of bedrooms.
2:46 - 2:51

We had two features and there was - little n was equal to two.
2:51 - 2:52
2:52 - 2:55

So just to
2:55 - 2:57

finish recapping the previous lecture, we
2:57 - 3:01

defined this quadratic cos function J of theta equals one-half,
3:01 - 3:06

something
3:06 - 3:07

I
3:07 - 3:10

equals one to m, theta of XI minus YI
3:10 - 3:12

squared
3:12 - 3:17

where this is the sum over our m training examples and my training set. So lowercase
3:17 - 3:18

m
3:18 - 3:21

was the notation I've been using to denote the number of training examples I have and the
3:21 - 3:23

size of my training set.
3:23 - 3:25

And at the end of the last lecture,
3:25 - 3:27

we derive
3:27 - 3:31

the value of theta that minimizes this enclosed form, which was X
3:31 - 3:32

transpose X
3:32 - 3:35

inverse X
3:35 - 3:39

transpose Y. Okay?
3:39 - 3:42

So
3:42 - 3:46

as we move on in today's lecture, I'll continue to use this notation and, again,
3:46 - 3:51

I realize this is a fair amount of notation to all remember,
3:51 - 3:55

so if partway through this lecture you forgot - if you're having trouble remembering
3:55 - 4:02

what lowercase m is or what lowercase n is or something please raise your hand and ask. When
4:05 - 4:08

we talked about linear regression last time
4:08 - 4:10

we used two features. One of the features was
4:10 - 4:14

the size of the houses in square feet, so the living area of the house,
4:14 - 4:18

and the other feature was the number of bedrooms in the house.
4:18 - 4:22

In general, we apply a machine-learning algorithm to some problem that you care
4:22 - 4:23

about.
4:23 - 4:28

The choice of the features will very much be up to you, right? And
4:28 - 4:32

the way you choose your features to give the learning algorithm will often have a
4:32 - 4:34

large impact on how it actually does.
4:34 - 4:40

So just for example,
4:40 - 4:44

the choice we made last time was X1 equal this size, and let's leave this idea
4:44 - 4:47

of the feature of the number of bedrooms for now, let's say we don't have data
4:47 - 4:51

that tells us how many bedrooms are in these houses.
4:51 - 4:54

One thing you could do is actually define - oh, let's
4:54 - 4:56

draw this out.
4:56 - 5:03

And so, right? So say that
5:05 - 5:08

was the size of the house and that's the price of the house. So
5:08 - 5:10

if you use
5:10 - 5:15

this as a feature maybe you get theta zero plus theta 1,
5:15 - 5:20

X1, this, sort of, linear model.
5:20 - 5:22

If you choose - let me just copy the
5:22 - 5:27

same data set over, right?
5:27 - 5:30

You can define the set of features where X1 is equal to the size of the house
5:30 - 5:35

and X2 is the
5:35 - 5:36

square
5:36 - 5:37

of the size
5:37 - 5:38

of the house. Okay?
5:38 - 5:43

So X1 is the size of the house in say square footage and X2 is
5:43 - 5:46

just take whatever the square footage of the house is and just
5:46 - 5:49

square that number, and this would be another way to come up with a feature,
5:49 - 5:51

and if you do that then
5:51 - 5:56

the same algorithm will end up fitting a
5:56 - 5:59

quadratic function for you.
5:59 - 6:01

Theta 2, XM squared.
6:01 - 6:06

Okay? Because this
6:06 - 6:10

is actually X2. And
6:10 - 6:13

depending on what the data looks like, maybe this is a slightly
6:13 - 6:17

better fit to the data. You can actually
6:17 - 6:24

take this even further, right?
6:25 - 6:27

Which is - let's see.
6:27 - 6:30

I have seven training examples here, so you can actually
6:30 - 6:34

maybe fit up to six for the polynomial. You can actually fill a model
6:34 - 6:36

theta zero plus
6:36 - 6:38

theta one, X1 plus theta two,
6:38 - 6:42

X squared plus up to
6:42 - 6:49

theta six. X to the
6:49 - 6:53

power of six and theta six are the polynomial
6:53 - 6:56

to these seven data points.
6:56 - 6:58

And if you do that you find that
6:58 - 7:02

you come up with a model that fits your data exactly. This is where, I guess,
7:02 - 7:06

in this example I drew, we have seven data points, so if you fit a
7:06 - 7:08

six model polynomial you can, sort of, fit a line
7:08 - 7:12

that passes through these seven points perfectly.
7:12 - 7:14

And you probably find that the curve
7:14 - 7:17

you get will look something
7:17 - 7:20

like that.
7:20 - 7:23

And on the one hand, this is a great model in a sense that it
7:23 - 7:25

fits your training data perfectly.
7:25 - 7:27

On the other hand, this is probably not a
7:27 - 7:29

very good model in the sense that
7:29 - 7:32

none of us seriously think that this is a very good predictor of housing
7:32 - 7:36

prices as a function of the size of the house, right? So
7:36 - 7:39

we'll actually come back to this later. It turns
7:39 - 7:42

out of the models we have here;
7:42 - 7:45

I feel like maybe the quadratic model fits the data best.
7:45 - 7:46

Whereas
7:46 - 7:48
7:48 - 7:52

the linear model looks like there's actually a bit of a quadratic component in this
7:52 - 7:53

data
7:53 - 7:57

that the linear function is not capturing.
7:57 - 8:00

So we'll actually come back to this a little bit later and talk about the problems
8:00 - 8:04

associated with fitting models that are either too simple, use two small a set of
8:04 - 8:05

features, or
8:05 - 8:08

on the models that are too complex and maybe
8:08 - 8:11

use too large a set of features.
8:11 - 8:12

Just to give these a
8:12 - 8:13

name,
8:13 - 8:14

we call this
8:14 - 8:20

the problem of underfitting
8:20 - 8:23

and, very informally, this refers to a setting where
8:23 - 8:27

there are obvious patterns that - where there are patterns in the data that the
8:27 - 8:29

algorithm is just failing to fit.
8:29 - 8:33

And this problem here we refer to as
8:33 - 8:35

overfitting
8:35 - 8:36

and, again, very informally,
8:36 - 8:41

this is when the algorithm is fitting the idiosyncrasies of this specific data set,
8:41 - 8:43

right? It just so happens that
8:43 - 8:48

of the seven houses we sampled in Portland, or wherever you collect data from,
8:48 - 8:52

that house happens to be a bit more expensive, that house happened on the less
8:52 - 8:54

expensive, and by
8:54 - 8:57

fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties
8:57 - 8:59

of this data set,
8:59 - 9:01

rather than the true underlying trends
9:01 - 9:05

of how housing prices vary as the function of the size of house. Okay?
9:05 - 9:08

So these are two very different problems. We'll define them more formally me later
9:08 - 9:12

and talk about how to address each of these problems,
9:12 - 9:14

but for now I
9:14 - 9:21

hope you appreciate that there is this issue of selecting features. So if
9:22 - 9:24

you want to just
9:24 - 9:26

teach us the learning problems there are a few ways to do
9:26 - 9:27

so.
9:27 - 9:29

We'll talk about
9:29 - 9:33

feature selection algorithms later this quarter as well. So automatic algorithms
9:33 - 9:34

for choosing
9:34 - 9:36

what features you use in a
9:36 - 9:38

regression problem like this.
9:38 - 9:42

What I want to do today is talk about a class of algorithms
9:42 - 9:45

called non-parametric learning algorithms that will help
9:45 - 9:50

to alleviate the need somewhat for you to choose features very carefully. Okay?
9:50 - 9:57

And this leads us into our discussion of locally weighted regression.
9:57 - 10:04

And just to define the term,
10:12 - 10:17

linear regression, as we've defined it so far, is an example of a parametric learning
10:17 - 10:18
10:18 - 10:19

algorithm. Parametric
10:19 - 10:22

learning algorithm is one that's defined as
10:22 - 10:25

an algorithm that has a fixed number of parameters
10:25 - 10:27

that fit to the data. Okay? So
10:27 - 10:29

in linear regression
10:29 - 10:33

we have a fix set of parameters theta, right? That must
10:33 - 10:39

fit to the data.
10:39 - 10:46

In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The
10:59 - 11:03

formal definition, which is not very intuitive, so I've replaced it with a
11:03 - 11:04

second, say, more
11:04 - 11:06

intuitive.
11:06 - 11:10

The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm
11:10 - 11:17

where the number of parameters
11:18 - 11:22

goes
11:22 - 11:25

with M, with the size of the training set. And usually it's
11:25 - 11:31

defined as a number of parameters grows linearly with the size of the training set.
11:31 - 11:33

This is the formal definition.
11:33 - 11:33

A
11:33 - 11:36

slightly less formal definition is that
11:36 - 11:38

the amount of stuff that your learning algorithm needs
11:38 - 11:41

to keep around
11:41 - 11:45

will grow linearly with the training sets or, in another way of saying it, is that this is an
11:45 - 11:46

algorithm that
11:46 - 11:52

we'll need to keep around an entire training set, even after learning. Okay? So
11:52 - 11:54

don't worry too much about this definition. But
11:54 - 11:56

what I want to do now is
11:56 - 11:59

describe a specific non-parametric learning algorithm
11:59 - 12:06

called locally weighted regression.
12:10 - 12:17

Which also goes by a couple of other names -
12:17 - 12:21

which also goes by the name of Loess for self-hysterical reasons. Loess
12:21 - 12:23

is usually spelled L-O-E-S-S,
12:23 - 12:24

sometimes spelled like that,
12:24 - 12:27

too. I just call it locally weighted regression.
12:27 - 12:34

So here's
12:35 - 12:38

the idea. This will be an algorithm that allows us
12:38 - 12:43

to worry a little bit less about having to choose features very carefully.
12:43 - 12:48

So
12:48 - 12:55

for my motivating example, let's say that I
12:55 - 12:59

have a
12:59 - 13:01

training site that looks like this, okay?
13:01 - 13:04

So this is X and that's Y.
13:04 - 13:07

If you run
13:07 - 13:11

linear regression on this and you fit maybe a linear function to this
13:11 - 13:12

and you end up with a
13:12 - 13:13

more or less
13:13 - 13:17

flat, straight line, which is not a very good fit to this data. You
13:17 - 13:20

can sit around and stare at this and try to decide whether the features are used right.
13:20 - 13:23

So maybe you want to toss in a quadratic function,
13:23 - 13:26

but this isn't really quadratic either. So maybe you want to
13:26 - 13:28

model this as a X
13:28 - 13:31

plus X squared plus maybe some function of sin of X or something.
13:31 - 13:34

You actually sit around and fiddle with features.
13:34 - 13:37

And after a while you can probably come up with a set of features that the model is
13:37 - 13:40

okay, but let's talk about an algorithm that
13:40 - 13:47

you can use without needing to do that.
13:50 - 13:53

So
13:53 - 13:54

if - well,
13:54 - 13:56

suppose you want to evaluate
13:56 - 13:59

your hypothesis H
13:59 - 14:04

at a certain point
14:04 - 14:06

with a certain query point low K is X. Okay? And
14:06 - 14:07

let's say
14:07 - 14:11

you want to know what's the predicted value of
14:11 - 14:12

Y
14:12 - 14:17

at this position of X, right? So
14:17 - 14:19

for linear regression,
14:19 - 14:22

what we were doing was we would fit
14:22 - 14:25

theta
14:25 - 14:28

to minimize
14:28 - 14:30

sum
14:30 - 14:35

over I, YI minus theta, transpose XI
14:35 - 14:39

squared,
14:39 - 14:41

and return theta
14:41 - 14:46

transpose X. Okay? So that was linear regression.
14:46 - 14:50

In contrast, in locally weighted linear regression you're going to do things slightly
14:50 - 14:51

different. You're
14:51 - 14:54

going to look at this point X
14:54 - 14:59

and then I'm going to look in my data set and take into account
14:59 - 15:03

only the data points that are, sort of, in the little vicinity of X. Okay?
15:03 - 15:07

So we'll look at where I want to value my hypothesis. I'm going to look
15:07 - 15:10

only in the vicinity of
15:10 - 15:14

this point where I want to value my hypothesis,
15:14 - 15:16

and then I'm going to take,
15:16 - 15:20

let's say, just these few points,
15:20 - 15:21

and I will
15:21 - 15:23

apply linear regression
15:23 - 15:26

to fit a straight line just to this sub-set of the data. Okay? I'm
15:26 - 15:30

using this sub-term sub-set - well let's come back to that later.
15:30 - 15:32

So we take this data set and I fit a
15:32 - 15:37

straight line to it and maybe I get a straight line like that.
15:37 - 15:41

And what I'll do is then
15:41 - 15:45

evaluate this particular value of straight line and
15:45 - 15:47

that will be the value I return for my algorithm.
15:47 - 15:50

I think this would be the predicted value
15:50 - 15:53

for -
15:53 - 15:57

this would be the value of then my hypothesis outputs
15:57 - 16:04

in locally weighted regression. Okay? So
16:05 - 16:10

we're gonna fall one up. Let me go ahead and formalize that. In
16:10 - 16:15

locally weighted regression, we're going to fit theta to
16:15 - 16:18

minimize
16:18 - 16:25

sum over I
16:27 - 16:33

to minimize that
16:33 - 16:37

where these terms W superscript I are called weights.
16:37 - 16:38
16:38 - 16:41

There are many possible choice for ways, I'm just gonna write one down. So this E's and
16:41 - 16:43

minus, XI minus
16:43 - 16:46

X squared
16:46 - 16:49

over
16:49 - 16:55

two. So let's look at what these weights really are, right? So notice that -
16:55 - 16:58

suppose you have a training example XI.
16:58 - 17:05

So that XI is very close to X. So this is small,
17:07 - 17:11

right? Then if XI minus X is small, so if XI minus X is close to zero, then
17:11 - 17:14

this is E's to the minus zero and E to the zero
17:14 - 17:16

is one.
17:16 - 17:20

So if XI is close to X, then
17:20 - 17:21

WI
17:21 - 17:25

will be close to one. In other words, the weight associated with the, I training
17:25 - 17:27

example be close to one
17:27 - 17:30

if XI and X are close to each
17:30 - 17:31

other.
17:31 - 17:34

Conversely if XI minus X
17:34 - 17:41

is large
17:42 - 17:43

then - I don't
17:43 - 17:48

know, what would WI be? Zero.
17:48 - 17:51

Zero, right. Close to zero. Right.
17:51 - 17:52

So
17:52 - 17:57

if XI is very far from X then this is E to the minus of some large number
17:57 - 18:02

and E to the minus some large number will be close to zero.
18:02 - 18:05

Okay?
18:05 - 18:12

So the picture is, if I'm
18:12 - 18:13

quarrying
18:13 - 18:17

at a certain point X, shown on the X axis,
18:17 - 18:22

and if my data
18:22 - 18:24

set, say, look like that,
18:24 - 18:29

then I'm going to give the points close to this a large weight and give the points
18:29 - 18:33

far away a small weight.
18:33 - 18:35

So
18:35 - 18:38

for the points that are far away,
18:38 - 18:40

WI will be close to zero.
18:40 - 18:44

And so as if for the points that are far away,
18:44 - 18:48

they will not contribute much at all to this summation, right? So I think this is
18:48 - 18:49

sum over I
18:49 - 18:54

of one times this quadratic term for points by points plus zero times this quadratic term for faraway
18:54 - 18:56

points.
18:56 - 18:59

And so the effect of using this weighting is that
18:59 - 19:00

locally weighted linear regression
19:00 - 19:03

fits a set of parameters theta,
19:03 - 19:05

paying much more attention to fitting the
19:05 - 19:07

points close by
19:07 - 19:11

accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is
19:11 - 19:18

exponentially [inaudible]?
19:18 - 19:23

Yeah. Let's see. So it turns out there are many other weighting functions you can use. It
19:23 - 19:27

turns out that there are definitely different communities of researchers that tend to
19:27 - 19:29

choose different choices by default.
19:29 - 19:34

There is somewhat of a literature on debating what point -
19:34 - 19:36

exactly what function to use.
19:36 - 19:38

This, sort of, exponential decay function is -
19:38 - 19:41

this happens to be a reasonably common one that seems to be a more reasonable choice on many problems,
19:41 - 19:42
19:42 - 19:45

but you can actually plug in other functions as well. Did
19:45 - 19:48

I mention what [inaudible] is it at?
19:48 - 19:50

For those of you that are familiar with
19:50 - 19:52

the
19:52 - 19:54

normal distribution, or the Gaussian distribution,
19:54 - 19:55

say this -
19:55 - 19:59

what this formula I've written out here, it cosmetically
19:59 - 20:03

looks a bit like a Gaussian distribution. Okay? But this actually has
20:03 - 20:06

absolutely nothing to do with Gaussian distribution.
20:06 - 20:08

So this
20:08 - 20:12

is not that a problem with XI is Gaussian or whatever. This is no such
20:12 - 20:13

interpretation.
20:13 - 20:17

This is just a convenient function that happens to be a bell-shaped function, but
20:17 - 20:21

don't endow this of any Gaussian semantics. Okay?
20:21 - 20:23

So, in fact - well,
20:23 - 20:29

if you remember the familiar bell-shaped Gaussian, again, it's just
20:29 - 20:32

the ways of associating with these points is that if you
20:32 - 20:34

imagine
20:34 - 20:36

putting this on a bell-shaped bump,
20:36 - 20:40

centered around the position of where you want to value your hypothesis H,
20:40 - 20:43

then there's a saying this point here I'll give
20:43 - 20:44

a weight that's proportional
20:44 - 20:49

to the height of the Gaussian - excuse me, to the height of the bell-shaped function
20:49 - 20:51

evaluated at this point. And the
20:51 - 20:53

way to get to this point will be,
20:53 - 20:55

to this training example,
20:55 - 20:57

will be proportionate to that height
20:57 - 20:59

and so on. Okay?
20:59 - 21:01

And so training examples that are really far away
21:01 - 21:05

get a very small weight.
21:05 - 21:07
21:07 - 21:11

One last small generalization to this is that
21:11 - 21:12

normally
21:12 - 21:15

there's one other parameter to this
21:15 - 21:18

algorithm, which I'll denote as tow.
21:18 - 21:21

Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian.
21:21 - 21:25

This is a convenient form or function.
21:25 - 21:27

This parameter tow
21:27 - 21:33

is called the bandwidth
21:33 - 21:34

parameter
21:34 - 21:41

and
21:41 - 21:48

informally it controls how fast the weights fall of with distance. Okay? So just
21:48 - 21:49

copy my diagram from
21:49 - 21:54

the other side, I guess.
21:54 - 21:56

So if
21:56 - 21:58

tow is very small,
21:58 - 22:00
22:00 - 22:05

if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape,
22:05 - 22:08

so that the weights of the points are far away fall off rapidly.
22:08 - 22:10

Whereas if
22:10 - 22:17

tow
22:17 - 22:21

is large then you'd end
22:21 - 22:25

up choosing a weighting function that falls of relatively slowly with distance from your
22:25 - 22:28

query.
22:28 - 22:31

Okay?
22:31 - 22:33

So
22:33 - 22:40

I hope you can, therefore, see that if you
22:42 - 22:44

apply locally weighted linear regression to a data set that looks like
22:44 - 22:46

this,
22:46 - 22:47

then to
22:47 - 22:50

ask what your hypothesis output is at a point like this you end up having a straight line
22:50 - 22:51

making
22:51 - 22:53

that prediction. To
22:53 - 22:55

ask what kind of class this [inaudible] at
22:55 - 22:56

that value
22:56 - 22:59

you put a straight line there
22:59 - 23:00

and you predict that value.
23:00 - 23:04

It turns out that every time you try to vary your hypothesis, every time you
23:04 - 23:07

ask your learning algorithm to make a prediction for
23:07 - 23:09

how much a new house costs or whatever,
23:09 - 23:10
23:10 - 23:14

you need to run a new fitting procedure and then
23:14 - 23:16

evaluate this line that you fit
23:16 - 23:18

just at the position of
23:18 - 23:22

the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you
23:22 - 23:26

do this for every point along the X-axis then
23:26 - 23:27

you find that
23:27 - 23:30

locally weighted regression is able to trace on this, sort of, very
23:30 - 23:32

non-linear curve
23:32 - 23:38

for a data set like this. Okay? So
23:38 - 23:42

in the problem set we're actually gonna let you play around more with this algorithm. So I won't say
23:42 - 23:44

too much more about it here.
23:44 - 23:49

But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you
23:49 - 23:50

still have
23:50 - 23:51

the same problem of overfitting
23:51 - 23:55

and underfitting, like when
23:55 - 23:58

you had a Q's tow. Like you make it too small
23:58 - 23:59

in your -
23:59 - 24:02

Yes, absolutely. Yes. So locally
24:02 - 24:04

weighted regression can run into
24:04 - 24:07

- locally weighted regression is not a penancier for the problem of
24:07 - 24:10

overfitting or underfitting.
24:10 - 24:15

You can still run into the same problems with locally weighted regression. What you just
24:15 - 24:17

said about
24:17 - 24:20

- and so some of these things I'll leave you to discover for yourself in the
24:20 - 24:21

homework problem.
24:21 - 24:25

You'll actually see what you just mentioned.
24:25 - 24:29

Yeah? It almost seems like
24:29 - 24:31

you're
24:31 - 24:37

not even thoroughly [inaudible] with this locally weighted, you had all the
24:37 - 24:42

data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a
24:42 - 24:45

model, because you need the entire data set.
24:45 - 24:45

And
24:45 - 24:50

the other way of saying that is that this is a non-parametric learning
24:50 - 24:51

algorithm. So this
24:51 - 24:52
24:52 - 24:54

-I don't know. I won't
24:54 - 24:58

debate whether, you know, are we really building a model or not. But
24:58 - 25:00

this is a perfectly fine - so
25:00 - 25:02

if I think
25:02 - 25:04

when you write a code implementing
25:04 - 25:07

locally weighted linear regression on the
25:07 - 25:09

data set I think of that code
25:09 - 25:13

as a whole - as building your model.
25:13 - 25:15

So it actually uses -
25:15 - 25:19

we've actually used this quite successfully to model, sort of, the dynamics of this
25:19 - 25:23

autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights
25:23 - 25:24

based
25:24 - 25:28

on
25:28 - 25:33

the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see,
25:33 - 25:34

yes. So
25:34 - 25:37

it turns out there are a few things you can do. One thing that is quite common is
25:37 - 25:40

how to choose this band with parameter tow,
25:40 - 25:46

right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible]
25:46 - 25:53

Gaussian
25:56 - 25:58

sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's
25:58 - 26:00

see. Boy.
26:00 - 26:02

The weights are not
26:02 - 26:06

random variables and it's not, for the purpose of this algorithm, it is not useful to
26:06 - 26:09

endow it with probable semantics. So you could
26:09 - 26:15

choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In
26:15 - 26:16

fact,
26:16 - 26:17

it turns out that
26:17 - 26:22

I happened to choose this, sort of, bell-shaped function
26:22 - 26:23

to define my weights.
26:23 - 26:27

It's actually fine to choose a function that doesn't even integrate to one, that integrates to
26:27 - 26:30

infinity, say, as you're weighting function. So
26:30 - 26:32

in that sense,
26:32 - 26:36

I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially
26:36 - 26:42

since you use other functions that integrate to infinity and don't integrate to one. Okay?
26:42 - 26:44

It's the last question and let's move on Assume
26:44 - 26:51

that we have a very huge [inaudible], for example. A very huge set of houses and want to
26:51 - 26:54

predict the linear for each house
26:54 - 26:58

and so should the end result for each input - I'm
26:58 - 27:00

seeing this very constantly for
27:00 - 27:02

- Yes, you're right. So
27:02 - 27:05

because locally weighted regression is a
27:05 - 27:07

non-parametric algorithm
27:07 - 27:11

every time you make a prediction you need to fit theta to your entire training set again.
27:11 - 27:13

So you're actually right.
27:13 - 27:18

If you have a very large training set then this is a somewhat expensive algorithm to
27:18 - 27:20

use. Because every time you want to make a prediction
27:20 - 27:21

you need to fit
27:21 - 27:23

a straight line
27:23 - 27:26

to a huge data set again.
27:26 - 27:29

Turns out there are algorithms that - turns
27:29 - 27:32

out there are ways to make this much more efficient for large data sets as well.
27:32 - 27:35

So don't want to talk about that. If you're interested, look
27:35 - 27:36

up the work of Andrew Moore
27:36 - 27:38

on KD-trees. He,
27:38 - 27:40

sort
27:40 - 27:42

of, figured out ways to fit these models much more efficiently. That's not something I want to go
27:42 - 27:45

into today. Okay? Let me
27:45 - 27:52

move one. Let's take more questions later. So, okay.
28:01 - 28:06

So that's locally weighted regression.
28:06 - 28:07

Remember the
28:07 - 28:11

outline I had, I guess, at the beginning of this lecture. What I want to do now is
28:11 - 28:11
28:11 - 28:16

talk about a probabilistic interpretation of linear regression, all right?
28:16 - 28:19

And in particular of the - it'll be this probabilistic interpretation
28:19 - 28:22

that let's us move on to talk
28:22 - 28:29

about logistic regression, which will be our first classification algorithm. So
28:40 - 28:43

let's put aside locally weighted regression for now. We'll just talk about
28:43 - 28:47

ordinary unweighted linear regression. Let's
28:47 - 28:51

ask the question of why least squares, right? Of all the things
28:51 - 28:52

we could optimize
28:52 - 28:56

how do we come up with this criteria for minimizing the square of the area
28:56 - 28:59

between the predictions of the hypotheses
28:59 - 29:02

and the values Y predicted. So why not minimize
29:02 - 29:08

the absolute value of the areas or the areas to the power of four or something?
29:08 - 29:10

What I'm going to do now is present
29:10 - 29:12

one set of assumptions that
29:12 - 29:15

will serve to "justify"
29:15 - 29:19

why we're minimizing the sum of square zero. Okay?
29:19 - 29:22

It turns out that there are many assumptions that are sufficient
29:22 - 29:26

to justify why we do least squares and this is just one of them.
29:26 - 29:27

So
29:27 - 29:30

just because I present one set of assumptions under which least
29:30 - 29:32

squares regression make sense,
29:32 - 29:35

but this is not the only set of assumptions. So even if the assumptions
29:35 - 29:39

I describe don't hold, least squares actually still makes sense in many
29:39 - 29:40

circumstances. But this
29:40 - 29:41

sort of new help, you know,
29:41 - 29:48

give one rationalization, like, one reason for doing least squares regression.
29:48 - 29:51

And, in particular, what I'm going to do is
29:51 - 29:58

endow the least squares model with probabilistic semantics. So
29:58 - 30:02

let's assume in our example of predicting housing prices,
30:02 - 30:03

that
30:03 - 30:08

the price of the house it's sold four, and
30:08 - 30:12

there's going to be some linear function of the features,
30:12 - 30:14

plus
30:14 - 30:17

some term epsilon I. Okay?
30:17 - 30:20

And epsilon I will be
30:20 - 30:22

our error term.
30:22 - 30:23

You can think of
30:23 - 30:28

the error term as capturing unmodeled effects, like, that maybe
30:28 - 30:31

there's some other features of a house, like, maybe how many fireplaces it has or whether
30:31 - 30:33

there's a garden or whatever,
30:33 - 30:34

that there
30:34 - 30:37

are additional features that we jut fail to capture or
30:37 - 30:40

you can think of epsilon as random noise.
30:40 - 30:42

Epsilon is our error term that captures both these
30:42 - 30:46

unmodeled effects. Just things we forgot to model. Maybe the function isn't quite
30:46 - 30:48

linear or something.
30:48 - 30:50
30:50 - 30:53

As well as random noise, like maybe
30:53 - 30:57

that day the seller was in a really bad mood and so he sold it, just
30:57 - 31:00

refused to go for a reasonable price or something.
31:00 - 31:03

And now
31:03 - 31:08

I will assume that the errors have a
31:08 - 31:09

probabilistic
31:09 - 31:13

- have a probability distribution. I'll assume that the errors epsilon I
31:13 - 31:15

are distributed
31:15 - 31:16

just till
31:16 - 31:19

they denote epsilon I
31:19 - 31:23

is distributive according to a probability distribution. That's
31:23 - 31:27

a Gaussian distribution with mean zero
31:27 - 31:28

and variance sigma squared. Okay? So
31:28 - 31:31

let me just scripts in here,
31:31 - 31:35

n stands for normal, right? To denote a normal distribution, also known as the
31:35 - 31:36

Gaussian distribution,
31:36 - 31:37

with mean
31:37 - 31:41

zero and covariance sigma squared.
31:41 - 31:47

Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you.
31:47 - 31:49

Great. Almost everyone.
31:49 - 31:51

So,
31:51 - 31:55

in other words, the density for Gaussian is what you've seen before.
31:55 - 31:58

The density for epsilon I would be
31:58 - 32:00

one over root 2 pi sigma, E to the
32:00 - 32:02

negative,
32:02 - 32:04

epsilon I
32:04 - 32:06
32:06 - 32:09

squared over 2 sigma squared, right?
32:09 - 32:15

And the
32:15 - 32:22

density of our epsilon I will be this bell-shaped curve
32:23 - 32:26

with one standard deviation
32:26 - 32:31

being a, sort of, sigma. Okay? This
32:31 - 32:32

is
32:32 - 32:34

form for that bell-shaped curve.
32:34 - 32:41

So, let's see. I can erase that. Can I
32:41 - 32:47

erase the board?
32:47 - 32:54
32:59 - 33:05

So this implies that the
33:05 - 33:10

probability distribution of a price of a house
33:10 - 33:12

given in si
33:12 - 33:14

and the parameters theta,
33:14 - 33:21

that this is going to be Gaussian
33:30 - 33:35

with that density. Okay?
33:35 - 33:38

In other words, saying goes as that the
33:38 - 33:41

price of
33:41 - 33:47

a house given the features of the house and my parameters theta,
33:47 - 33:51

this is going to be a random variable
33:51 - 33:54

that's distributed Gaussian with
33:54 - 33:57

mean theta transpose XI
33:57 - 33:58

and variance sigma squared.
33:58 - 34:01

Right? Because we imagine that
34:01 - 34:05

the way the housing prices are generated is that the price of a house
34:05 - 34:09

is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma
34:09 - 34:12

squared. So
34:12 - 34:14

the price of a house is going to
34:14 - 34:16

have mean theta transpose XI, again, and sigma squared, right? Does
34:16 - 34:20

this make
34:20 - 34:24

sense? Raise your hand if this makes sense. Yeah,
34:24 - 34:31

okay. Lots of you. In
34:38 - 34:44

point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume
34:44 - 34:47

here the error is a
34:47 - 34:50

Gaussian?
34:50 - 34:54

Right. So, boy.
34:54 - 34:56

Why do I see the error as Gaussian?
34:56 - 35:01

Two reasons, right? One is that it turns out to be mathematically convenient to do so
35:01 - 35:03

and the other is, I don't
35:03 - 35:07

know, I can also mumble about justifications, such as things to the
35:07 - 35:09

central limit theorem. It turns out that if you,
35:09 - 35:12

for the vast majority of problems, if you apply a linear regression model like
35:12 - 35:16

this and try to measure the distribution of the errors,
35:16 - 35:20

not all the time, but very often you find that the errors really are Gaussian. That
35:20 - 35:22

this Gaussian model is a good
35:22 - 35:24

assumption for the error
35:24 - 35:26

in regression problems like these.
35:26 - 35:29

Some of you may have heard of the central limit theorem, which says that
35:29 - 35:33

the sum of many independent random variables will tend towards a Gaussian.
35:33 - 35:37

So if the error is caused by many effects, like the mood of the
35:37 - 35:39

seller, the mood of the buyer,
35:39 - 35:43

some other features that we miss, whether the place has a garden or not, and
35:43 - 35:46

if all of these effects are independent, then
35:46 - 35:49

by the central limit theorem you might be inclined to believe that
35:49 - 35:52

the sum of all these effects will be approximately Gaussian. If
35:52 - 35:54

in practice, I guess, the
35:54 - 35:58

two real answers are that, 1.) In practice this is actually a reasonably accurate
35:58 - 36:05

assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're
36:07 - 36:08

saying
36:08 - 36:11

if we assume that area around model
36:11 - 36:13

has zero mean, then
36:13 - 36:16

the area is centered around our model. Which
36:16 - 36:18

it seems almost like we're trying to assume
36:18 - 36:20

what we're trying to prove. Instructor?
36:20 - 36:24

That's the [inaudible] but, yes. You are assuming that
36:24 - 36:28

the error has zero mean. Which is, yeah, right.
36:28 - 36:31

I think later this quarter we get to some of the other
36:31 - 36:35

things, but for now just think of this as a mathematically - it's actually not
36:35 - 36:38

an unreasonable assumption.
36:38 - 36:40

I guess,
36:40 - 36:45

in machine learning all the assumptions we make are almost never
36:45 - 36:49

true in the absence sense, right? Because, for instance,
36:49 - 36:54

housing prices are priced to dollars and cents, so the error will be -
36:54 - 36:55

errors
36:55 - 36:59

in prices are not continued as value random variables, because
36:59 - 37:02

houses can only be priced at a certain number of dollars and a certain number of
37:02 - 37:06

cents and you never have fractions of cents in housing prices.
37:06 - 37:10

Whereas a Gaussian random variable would. So in that sense, assumptions we make are never
37:10 - 37:15

"absolutely true," but for practical purposes this is a
37:15 - 37:19

accurate enough assumption that it'll be useful to
37:19 - 37:24

make. Okay? I think in a week or two, we'll actually come back to
37:24 - 37:27

selected more about the assumptions we make and when they help our learning
37:27 - 37:29

algorithms and when they hurt our learning
37:29 - 37:31

algorithms. We'll say a bit more about it
37:31 - 37:38

when we talk about generative and discriminative learning algorithms, like, in a week or
37:40 - 37:45

two. Okay? So let's point out one bit of notation, which is that when I
37:45 - 37:48

wrote this down I actually wrote P of YI given XI and then semicolon
37:48 - 37:50

theta
37:50 - 37:55

and I'm going to use this notation when we are not thinking of theta as a
37:55 - 37:57

random variable. So
37:57 - 38:01

in statistics, though, sometimes it's called the frequentist's point of view,
38:01 - 38:02

where you think of there as being some,
38:02 - 38:06

sort of, true value of theta that's out there that's generating the data say,
38:06 - 38:08

but
38:08 - 38:11

we don't know what theta is, but theta is not a random
38:11 - 38:13

vehicle, right? So it's not like there's some
38:13 - 38:16

random value of theta out there. It's that theta is -
38:16 - 38:19

there's some true value of theta out there. It's just that we don't
38:19 - 38:22

know what the true value of theta is. So
38:22 - 38:26

if theta is not a random variable, then I'm
38:26 - 38:27

going to avoid
38:27 - 38:31

writing P of YI given XI comma theta, because this would mean
38:31 - 38:35

that probably of YI conditioned on X and theta
38:35 - 38:40

and you can only condition on random variables.
38:40 - 38:43

So at this part of the class where we're taking
38:43 - 38:46

sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class
38:46 - 38:49

we're thinking of theta not as a random variable, but just as something
38:49 - 38:50

we're trying to estimate
38:50 - 38:53

and use the semicolon
38:53 - 38:58

notation. So the way to read this is this is the probability of YI given XI
38:58 - 39:00

and parameterized by theta. Okay? So
39:00 - 39:04

you read the semicolon as parameterized by.
39:04 - 39:08

And in the same way here, I'll say YI given XI parameterized by
39:08 - 39:09

theta is distributed
39:09 - 39:16

Gaussian with that. All right.
39:36 - 39:38

So we're gonna make one more assumption.
39:38 - 39:41

Let's assume that the
39:41 - 39:44

error terms are
39:44 - 39:48
39:48 - 39:51

IID, okay?
39:51 - 39:54

Which stands for Independently and Identically Distributed. So it's
39:54 - 39:57

going to assume that the error terms are
39:57 - 40:04

independent of each other,
40:04 - 40:11

right?
40:11 - 40:15

The identically distributed part just means that I'm assuming the outcome for the
40:15 - 40:18

same Gaussian distribution or the same variance,
40:18 - 40:22

but the more important part of is this is that I'm assuming that the epsilon I's are
40:22 - 40:26

independent of each other.
40:26 - 40:29

Now, let's talk about how to fit a model.
40:29 - 40:32

The probability of Y given
40:32 - 40:36

X parameterized by theta - I'm actually going to give
40:36 - 40:39

this another name. I'm going to write this down
40:39 - 40:43

and we'll call this the likelihood of theta
40:43 - 40:46

as the probability of Y given X parameterized by theta.
40:46 - 40:49

And so this is going to be
40:49 - 40:50

the product
40:50 - 40:57

over my training set like that.
41:00 - 41:04

Which is, in turn, going to be a product of
41:04 - 41:11

those Gaussian densities that I wrote down just now,
41:11 - 41:15

right?
41:15 - 41:20

Okay?
41:20 - 41:25

Then in parts of notation, I guess, I define this term here to be the
41:25 - 41:26

likelihood of theta.
41:26 - 41:30

And the likely of theta is just the probability of the data Y, right? Given X
41:30 - 41:33

and prioritized by theta.
41:33 - 41:37

To test the likelihood and probability are often confused.
41:37 - 41:41

So the likelihood of theta is the same thing as the
41:41 - 41:46

probability of the data you saw. So likely and probably are, sort of, the same thing.
41:46 - 41:48

Except that when I use the term likelihood
41:48 - 41:52

I'm trying to emphasize that I'm taking this thing
41:52 - 41:55

and viewing it as a function of theta.
41:55 - 41:57

Okay?
41:57 - 42:01

So likelihood and for probability, they're really the same thing except that
42:01 - 42:02

when I want to view this thing
42:02 - 42:06

as a function of theta holding X and Y fix are
42:06 - 42:10

then called likelihood. Okay? So
42:10 - 42:13

hopefully you hear me say the likelihood of the parameters and the probability
42:13 - 42:15

of the data,
42:15 - 42:18

right? Rather than the likelihood of the data or probability of parameters. So try
42:18 - 42:25

to be consistent in that terminology.
42:31 - 42:32

So given that
42:32 - 42:34

the probability of the data is this and this
42:34 - 42:37

is also the likelihood of the parameters,
42:37 - 42:38

how do you estimate
42:38 - 42:40

the parameters theta? So given a training set,
42:40 - 42:46

what parameters theta do you want to choose for your model?
42:46 - 42:53
42:59 - 43:02

Well, the principle of maximum likelihood
43:02 - 43:03

estimation
43:03 - 43:09

says that,
43:09 - 43:13

right? You can choose the value of theta that makes the data
43:13 - 43:20

as probable as possible, right? So choose theta
43:20 - 43:27

to maximize the likelihood. Or
43:27 - 43:30

in other words choose the parameters that make
43:30 - 43:33

the data as probable as possible, right? So this is
43:33 - 43:37

massive likely your estimation from six to six. So it's choose the parameters that makes
43:37 - 43:40

it as likely as probable as possible
43:40 - 43:43

for me to have seen the data I just
43:43 - 43:47

did. So
43:47 - 43:53

for mathematical convenience, let me define lower case l of theta.
43:53 - 43:58

This is called the log likelihood function and it's just log
43:58 - 44:01

of capital L of theta.
44:01 - 44:06

So this is log over product over I
44:06 - 44:10

to find
44:10 - 44:14

sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this
44:14 - 44:17

from the previous board.
44:17 - 44:24

Log and a product is the same as the sum of over logs, right? So it's a sum
44:25 - 44:32

of
44:35 - 44:38

the logs of - which simplifies to m times
44:38 - 44:39

one over root
44:39 - 44:44

two pi
44:44 - 44:44

sigma
44:44 - 44:47

plus
44:47 - 44:52

and then log of explanation cancel each other, right? So if log of E of
44:52 - 44:53

something is just
44:53 - 45:00

whatever's inside the exponent. So, you know what,
45:01 - 45:08

let me write this on the
45:12 - 45:16

next
45:16 - 45:21

board.
45:21 - 45:28

Okay.
45:33 - 45:40
45:46 - 45:51

So
45:51 - 45:53

maximizing the likelihood or maximizing the log
45:53 - 45:58

likelihood is the same
45:58 - 46:03

as minimizing
46:03 - 46:10

that term over there. Well, you get it, right?
46:22 - 46:26

Because there's a minus sign. So maximizing this because of the minus sign is the same as
46:26 - 46:27

minimizing
46:27 - 46:33

this as a function of theta. And
46:33 - 46:36

this is, of course, just
46:36 - 46:43

the same quadratic cos function that we had last time, J of theta,
46:43 - 46:44

right? So what
46:44 - 46:48

we've just shown is that the ordinary least squares algorithm,
46:48 - 46:51

that we worked on the previous lecture,
46:51 - 46:55

is just maximum likelihood
46:55 - 46:56

assuming
46:56 - 46:58

this probabilistic model,
46:58 - 47:05

assuming IID Gaussian errors on our data.
47:06 - 47:10
47:10 - 47:11

Okay? One thing that we'll
47:11 - 47:12

actually leave is that,
47:12 - 47:14

in the next lecture notice that
47:14 - 47:17

the value of sigma squared doesn't matter,
47:17 - 47:18

right? That somehow
47:18 - 47:21

no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a
47:21 - 47:22

variance
47:22 - 47:26

of a Gaussian. So that no matter what sigma
47:26 - 47:30

squared is since it's a positive number the value of theta we end up with
47:30 - 47:34

will be the same, right? So because
47:34 - 47:35

minimizing this
47:35 - 47:39

you get the same value of theta no matter what sigma squared is. So it's as if
47:39 - 47:43

in this model the value of sigma squared doesn't really matter.
47:43 - 47:46

Just remember that for the next lecture. We'll come back
47:46 - 47:48

to this again.
47:48 - 47:51

Any questions about this?
47:51 - 47:53

Actually, let me clean up
47:53 - 48:00

another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of
48:44 - 48:51

theta by
48:51 - 48:52

a fraction
48:52 - 48:54

of error,
48:54 - 48:55

but I think it's that you
48:55 - 48:57

measure
48:57 - 49:01

because it depends on the family of theta too, for example. If
49:01 - 49:05
49:05 - 49:09

you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think
49:09 - 49:13

let's - the thing's you're mentioning are
49:13 - 49:15

maybe deeper questions about
49:15 - 49:18

learning algorithms that we'll just come back to later, so don't really want to get into
49:18 - 49:19

that right
49:19 - 49:22

now. Any more
49:22 - 49:29

questions? Okay. So
49:33 - 49:39

this endows linear regression with a probabilistic interpretation.
49:39 - 49:43

I'm actually going to use this probabil - use this, sort of, probabilistic
49:43 - 49:44

interpretation
49:44 - 49:46

in order to derive our next learning algorithm,
49:46 - 49:50

which will be our first classification algorithm. Okay?
49:50 - 49:54

So
49:54 - 49:58

you'll recall that I said that regression problems are where the variable Y
49:58 - 50:01

that you're trying to predict is continuous values.
50:01 - 50:04

Now I'm actually gonna talk about our first classification problem,
50:04 - 50:08

where the value Y you're trying to predict
50:08 - 50:11

will be discreet value. You can take on only a small number of discrete values
50:11 - 50:14

and in this case I'll talk about binding classification
50:14 - 50:16

where
50:16 - 50:19

Y takes on only two values, right? So
50:19 - 50:22

you come up with classification problems if you're trying to do,
50:22 - 50:26

say, a medical diagnosis and try to decide based on some features
50:26 - 50:30

that the patient has a disease or does not have a disease.
50:30 - 50:34

Or if in the housing example, maybe you're trying to decide will this house sell in the
50:34 - 50:38

next six months or not and the answer is either yes or no. It'll either be sold in the
50:38 - 50:41

next six months or it won't be.
50:41 - 50:45

Other standing examples, if you want to build a spam filter. Is this e-mail spam
50:45 - 50:51

or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting
50:51 - 50:55

whether a computer system will crash. So you have a learning algorithm to predict will
50:55 - 50:59

this computing cluster crash over the next 24 hours? And, again, it's a yes
50:59 - 51:06

or no answer. So
51:06 - 51:08

there's X, there's Y.
51:08 - 51:14

And in a classification problem
51:14 - 51:15

Y takes on
51:15 - 51:19

two values, zero and one. That's it in binding the classification.
51:19 - 51:22

So what can you do? Well, one thing you could do is
51:22 - 51:25

take linear regression, as we've described it so far, and apply it to this problem,
51:25 - 51:26

right? So you,
51:26 - 51:30

you know, given this data set you can fit a straight line to it. Maybe
51:30 - 51:32

you get that straight line, right?
51:32 - 51:32

But
51:32 - 51:37

this data set I've drawn, right? This is an amazingly easy classification problem. It's
51:37 - 51:38

pretty obvious
51:38 - 51:41

to all of us that, right? The relationship between X and Y is -
51:41 - 51:48

well, you just look at a value around here and it's the right is one, it's
51:48 - 51:52

the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then
51:52 - 51:54

maybe take your linear regression
51:54 - 51:56

hypothesis to this straight line
51:56 - 51:58

and threshold it at 0.5.
51:58 - 52:02

If you do that you'll certainly get the right answer. You predict that
52:02 - 52:03
52:03 - 52:04

if X is to the right of, sort
52:04 - 52:06

of, the mid-point here
52:06 - 52:12

then Y is one and then next to the left of that mid-point then Y is zero.
52:12 - 52:16

So some people actually do this. Apply linear regression to classification problems
52:16 - 52:18

and sometimes it'll
52:18 - 52:19

work okay,
52:19 - 52:22

but in general it's actually a pretty bad idea to
52:22 - 52:26

apply linear regression to
52:26 - 52:32

classification problems like these and here's why. Let's say I
52:32 - 52:34

change my training set
52:34 - 52:40

by giving you just one more training example all the way up there, right?
52:40 - 52:43

Imagine if given this training set is actually still entirely obvious what the
52:43 - 52:46

relationship between X and Y is, right? It's just -
52:46 - 52:51

take this value as greater than Y is one and it's less then Y
52:51 - 52:52

is zero.
52:52 - 52:55

By giving you this additional training example it really shouldn't
52:55 - 52:56

change anything. I mean,
52:56 - 52:59

I didn't really convey much new information. There's no surprise that this
52:59 - 53:02

corresponds to Y equals one.
53:02 - 53:05

But if you now fit linear regression to this data
53:05 - 53:07

set you end up with a line that, I
53:07 - 53:10

don't know, maybe looks like that, right?
53:10 - 53:13

And now the predictions of your
53:13 - 53:16

hypothesis have changed completely if
53:16 - 53:23

your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that?
53:27 - 53:30

Right.
53:30 - 53:31

Yeah, yeah, fine. Yeah, sure. A theta
53:31 - 53:37

set like that so. So, I
53:37 - 53:39

guess,
53:39 - 53:42

these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will
53:42 - 53:48

change it even more if you gave it
53:48 - 53:50

all -
53:50 - 53:52

Yeah. Then I think this actually would make it even worse. You
53:52 - 53:54

would actually get a line that pulls out even further, right? So
53:54 - 53:58

this is my example. I get to make it whatever I want, right? But
53:58 - 54:01

the point of this is that there's not a deep meaning to this. The point of this is
54:01 - 54:02

just that
54:02 - 54:05

it could be a really bad idea to apply linear regression to classification
54:05 - 54:06
54:06 - 54:12

algorithm. Sometimes it work fine, but usually I wouldn't do it.
54:12 - 54:14

So a couple of problems with this. One is that,
54:14 - 54:17

well - so what do you want to do
54:17 - 54:21

for classification? If you know the value of Y lies between zero and
54:21 - 54:25

one then to kind of fix this problem
54:25 - 54:27

let's just start by
54:27 - 54:29

changing the form
54:29 - 54:34

of our hypothesis so that my hypothesis
54:34 - 54:39

always lies in the unit interval between zero and one. Okay?
54:39 - 54:42

So if I know Y is either
54:42 - 54:44

zero or one then
54:44 - 54:47

let's at least not have my hypothesis predict values much larger than one and much
54:47 - 54:51

smaller than zero.
54:51 - 54:52

And so
54:52 - 54:56

I'm going to - instead of choosing a linear function for my hypothesis I'm going
54:56 - 55:00

to choose something slightly different. And,
55:00 - 55:03

in particular, I'm going to choose
55:03 - 55:08

this function, H subscript theta of X is going to equal to G of
55:08 - 55:11

theta transpose X
55:11 - 55:13

where
55:13 - 55:14

G
55:14 - 55:17

is going to be this
55:17 - 55:18

function and so
55:18 - 55:21

this becomes more than one plus theta X
55:21 - 55:23

of theta
55:23 - 55:25

transpose X.
55:25 - 55:28

And G of Z is called the sigmoid
55:28 - 55:33

function and it
55:33 - 55:39

is often also called the logistic function. It
55:39 - 55:41

goes by either of these
55:41 - 55:48

names. And what G of Z looks like is the following. So when you have your
55:48 - 55:51

horizontal axis I'm going to plot Z
55:51 - 55:57

and so G of Z
55:57 - 55:59

will look like this.
55:59 - 56:05

Okay? I didn't draw that very well. Okay.
56:05 - 56:06

So G of Z
56:06 - 56:08

tends towards zero
56:08 - 56:10

as Z becomes very small
56:10 - 56:12

and G of Z will ascend
56:12 - 56:15

towards one as Z becomes large and it crosses the
56:15 - 56:17

vertical
56:17 - 56:20

axis at 0.5.
56:20 - 56:24

So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of
56:24 - 56:27

sigmoid in other
56:27 - 56:30

step five? Say that again. Why we cannot chose this at five for some reason, like, that's
56:30 - 56:35

better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from,
56:35 - 56:37

right? I just
56:37 - 56:39

wrote down this function. It actually
56:39 - 56:43

turns out that there are two reasons for using this function that we'll come to.
56:43 - 56:44

One is -
56:44 - 56:47

we talked about generalized linear models. We'll see that this falls out naturally
56:47 - 56:50

as part of the broader class of models.
56:50 - 56:51

And another reason
56:51 - 56:52

that we'll talk about
56:52 - 56:54

next week, it turns out
56:54 - 56:55

there are a couple of,
56:55 - 56:57

I think, very beautiful reasons for why
56:57 - 56:59

we choose logistic
56:59 - 57:00

functions. We'll see
57:00 - 57:02

that in a little bit. But for now let me just
57:02 - 57:07

define it and just take my word for it for now that this is a reasonable choice.
57:07 - 57:09

Okay? But notice now that
57:09 - 57:16

my - the values output by my hypothesis will always be between zero
57:16 - 57:17

and one. Furthermore,
57:17 - 57:21

just like we did for linear regression, I'm going to endow
57:21 - 57:26

the outputs and my hypothesis with a probabilistic interpretation, right? So
57:26 - 57:31

I'm going to assume that the probability that Y is equal to one
57:31 - 57:34

given X and parameterized by theta
57:34 - 57:36

that's equal to
57:36 - 57:38

H subscript theta of X, all right?
57:38 - 57:40

So in other words
57:40 - 57:43

I'm going to imagine that my hypothesis is outputting all these
57:43 - 57:45

numbers that lie between zero and one.
57:45 - 57:48

I'm going to think of my hypothesis
57:48 - 57:55

as trying to estimate the probability that Y is equal to one. Okay?
57:56 - 58:00

And because
58:00 - 58:02

Y has to be either zero or one
58:02 - 58:05

then the probability of Y equals zero is going
58:05 - 58:12

to be that. All right?
58:12 - 58:15

So more simply it turns out - actually, take these two equations
58:15 - 58:19

and write them more compactly.
58:19 - 58:22

Write P of Y given X
58:22 - 58:24

parameterized by theta.
58:24 - 58:26

This is going to be H
58:26 - 58:28

subscript theta of X to
58:28 - 58:32

the power of Y times
58:32 - 58:33

one minus
58:33 - 58:35

H of X to the power of
58:35 - 58:37

one minus Y. Okay? So I know this
58:37 - 58:43

looks somewhat bizarre, but this actually makes the variation much nicer.
58:43 - 58:44

So Y is equal to one
58:44 - 58:48

then this equation is H of X to the power of one
58:48 - 58:51

times something to the power of zero.
58:51 - 58:54

So anything to the power of zero is just one,
58:54 - 58:56

right? So Y equals one then
58:56 - 58:59

this is something to the power of zero and so this is just one.
58:59 - 59:03

So if Y equals one this is just saying P of Y equals one is equal to H subscript
59:03 - 59:06

theta of X. Okay?
59:06 - 59:08

And in the same way, if Y is equal
59:08 - 59:10

to zero then this is P
59:10 - 59:14

of Y equals zero equals this thing to the power of zero and so this disappears. This is
59:14 - 59:16

just one
59:16 - 59:18

times this thing power of one. Okay? So this is
59:18 - 59:19

a
59:19 - 59:20

compact way of writing
59:20 - 59:23

both of these equations to
59:23 - 59:30

gather them to one line.
59:31 - 59:36

So let's hope our parameter fitting, right? And, again, you can ask -
59:36 - 59:38

well, given this model by data, how do I fit
59:38 - 59:41

the parameters theta of my
59:41 - 59:46

model? So the likelihood of the parameters is, as before, it's just the probability
59:46 - 59:49
59:49 - 59:50
59:50 - 59:54

of theta, right? Which is product over I, PFYI
59:54 - 59:57

given XI
59:57 - 59:59

parameterized by theta.
59:59 - 60:02

Which is - just plugging those
60:02 - 60:09

in. Okay? I
60:09 - 60:16

dropped this theta subscript just so you can write a little bit less. Oh,
60:17 - 60:20

excuse me. These
60:20 - 60:27

should be
60:29 - 60:36

XI's
60:36 - 60:43

and YI's. Okay?
60:51 - 60:53

So,
60:53 - 60:57

as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So
60:57 - 60:59

we want
60:59 - 61:04

to find the - setting the parameters theta that maximizes the likelihood L
61:04 - 61:08

of theta. It
61:08 - 61:10

turns out
61:10 - 61:11

that very often
61:11 - 61:14

- just when you work with the derivations, it turns out that it is often
61:14 - 61:18

much easier to maximize the log of the likelihood rather than maximize the
61:18 - 61:20

likelihood.
61:20 - 61:20

So
61:20 - 61:23

the log
61:23 - 61:25

likelihood L of theta is just log of capital L.
61:25 - 61:28

This will, therefore,
61:28 - 61:35

be sum of this. Okay?
61:58 - 61:59

And so
61:59 - 62:04

to fit the parameters theta of our model we'll
62:04 - 62:06

find the value of
62:06 - 62:12

theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes.
62:12 - 62:19

Thanks.
62:22 - 62:27

So having maximized this function - well, it turns out we can actually apply
62:27 - 62:29

the same gradient
62:29 - 62:33

descent algorithm that we learned. That was the first algorithm we used
62:33 - 62:35

to minimize the quadratic function.
62:35 - 62:37

And you remember, when we talked about least squares,
62:37 - 62:40

the first algorithm we used to minimize the quadratic
62:40 - 62:41

error function
62:41 - 62:43

was great in descent.
62:43 - 62:45

So can actually use exactly the same algorithm
62:45 - 62:48

to maximize the log likelihood.
62:48 - 62:49
62:49 - 62:51

And you remember, that algorithm was just
62:51 - 62:55

repeatedly take the value of theta
62:55 - 62:56

and you replace it with
62:56 - 62:59

the previous value of theta plus
62:59 - 63:02

a learning rate alpha
63:02 - 63:04

times
63:04 - 63:08

the gradient of the cos function. The log likelihood will respect the
63:08 - 63:10

theta. Okay?
63:10 - 63:14

One small change is that because previously we were trying to minimize
63:14 - 63:15
63:15 - 63:17

the quadratic error term.
63:17 - 63:20

Today we're trying to maximize rather than minimize. So rather than having a minus
63:20 - 63:23

sign we have a plus sign. So this is
63:23 - 63:24

just great in ascents,
63:24 - 63:25

but for the
63:25 - 63:28

maximization rather than the minimization.
63:28 - 63:32

So we actually call this gradient ascent and it's really the same
63:32 - 63:35

algorithm.
63:35 - 63:37

So to figure out
63:37 - 63:42

what this gradient - so in order to derive gradient descent,
63:42 - 63:44

what you need to do is
63:44 - 63:48

compute the partial derivatives of your objective function with respect to
63:48 - 63:53

each of your parameters theta I, right?
63:53 - 63:58

It turns out that
63:58 - 64:00

if you actually
64:00 - 64:03

compute this partial derivative -
64:03 - 64:10

so you take this formula, this L of theta, which is - oh, got that wrong too. If
64:10 - 64:14

you take this lower case l theta, if you take the log likelihood of theta,
64:14 - 64:17

and if you take it's partial derivative with
64:17 - 64:19

respect to theta I
64:19 - 64:22

you find that
64:22 - 64:27

this is equal to -
64:27 - 64:30

let's see. Okay? And,
64:30 - 64:37

I
64:46 - 64:47

don't
64:47 - 64:50

know, the derivation isn't terribly complicated, but in
64:50 - 64:54

the interest of saving you watching me write down a couple of
64:54 - 64:56

blackboards full of math I'll just write
64:56 - 64:57

down the final answer. But
64:57 - 64:59

the way you get this is you
64:59 - 65:00

just take those, plug
65:00 - 65:04

in the definition for F subscript theta as function of XI, and take derivatives,
65:04 - 65:06

and work through the algebra
65:06 - 65:08

it turns out it'll simplify
65:08 - 65:13

down to this formula. Okay?
65:13 - 65:15

And so
65:15 - 65:19

what that gives you is that gradient ascent
65:19 - 65:22

is the following
65:22 - 65:24

rule. Theta J gets updated as theta
65:24 - 65:26

J
65:26 - 65:28

plus alpha
65:28 - 65:35

gives this. Okay?
65:47 - 65:50

Does this look familiar to anyone? Did you
65:50 - 65:56

remember seeing this formula at the last lecture? Right.
65:56 - 65:59

So when I worked up Bastrian descent
65:59 - 66:02

for least squares regression I,
66:02 - 66:06

actually, wrote down
66:06 - 66:08

exactly the same thing, or maybe
66:08 - 66:11

there's a minus sign and this is also fit. But I, actually, had
66:11 - 66:14

exactly the same learning rule last time
66:14 - 66:20

for least squares regression,
66:20 - 66:24

right? Is this the same learning algorithm then? So what's different? How come I was making
66:24 - 66:25

all that noise earlier about
66:25 - 66:30

least squares regression being a bad idea for classification problems and then I did
66:30 - 66:34

a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants?
66:34 - 66:41

Say that again. [Inaudible] Oh,
66:44 - 66:47

right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same,
66:47 - 66:48

this is not the same, right? And the
66:48 - 66:49

reason is,
66:49 - 66:52

in logistic regression
66:52 - 66:54

this is different from before, right?
66:54 - 66:56

The definition
66:56 - 66:58

of this H subscript theta of XI
66:58 - 66:59

is not
66:59 - 67:03

the same as the definition I was using in the previous lecture.
67:03 - 67:07

And in particular this is no longer theta transpose XI. This is not
67:07 - 67:10

a linear function anymore.
67:10 - 67:12

This is a logistic function of theta
67:12 - 67:14

transpose XI. Okay?
67:14 - 67:18

So even though this looks cosmetically similar,
67:18 - 67:20

even though this is similar on the surface,
67:20 - 67:24

to the Bastrian descent rule I derived last time for
67:24 - 67:25

least squares regression
67:25 - 67:29

this is actually a totally different learning algorithm. Okay?
67:29 - 67:32

And it turns out that there's actually no coincidence that you ended up with the
67:32 - 67:35

same learning rule. We'll actually
67:35 - 67:35
67:35 - 67:40

talk a bit more about this later when we talk about generalized linear models.
67:40 - 67:43

But this is one of the most elegant generalized learning models
67:43 - 67:45

that we'll see later. That
67:45 - 67:48

even though we're using a different model, you actually ended up with
67:48 - 67:51

what looks like the same learning algorithm and it's actually no
67:51 - 67:56

coincidence. Cool.
67:56 - 67:59

One last comment as
67:59 - 68:01

part of a sort of learning process,
68:01 - 68:02

over here
68:02 - 68:05

I said I take the derivatives and I
68:05 - 68:07

ended up with this line.
68:07 - 68:09

I didn't want to
68:09 - 68:13

make you sit through a long algebraic derivation, but
68:13 - 68:15

later today or later this week,
68:15 - 68:19

please, do go home and look at our lecture notes, where I wrote out
68:19 - 68:21

the entirety of this derivation in full,
68:21 - 68:23

and make sure you can follow every single step of
68:23 - 68:27

how we take partial derivatives of this log likelihood
68:27 - 68:32

to get this formula over here. Okay? By the way, for those who are
68:32 - 68:36

interested in seriously masking machine learning material,
68:36 - 68:40

when you go home and look at the lecture notes it will actually be very easy for most
68:40 - 68:42

of you to look through
68:42 - 68:45

the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense,
68:45 - 68:45

and,
68:45 - 68:50

sort of, say cool. I see how you get this line.
68:50 - 68:53

You want to make sure you really understand the material. My concrete
68:53 - 68:55

suggestion to you would be to you to go home,
68:55 - 68:58

read through the lecture notes, check every line,
68:58 - 69:02

and then to cover up the derivation and see if you can derive this example, right? So
69:02 - 69:07

in general, that's usually good advice for studying technical
69:07 - 69:09

material like machine learning. Which is if you work through a proof
69:09 - 69:11

and you think you understood every line,
69:11 - 69:14

the way to make sure you really understood it is to cover it up and see
69:14 - 69:17

if you can rederive the entire thing itself. This is actually a great way because I
69:17 - 69:20

did this a lot when I was trying to study
69:20 - 69:22

various pieces of machine learning
69:22 - 69:26

theory and various proofs. And this is actually a great way to study because cover up
69:26 - 69:28

the derivations and see if you can do it yourself
69:28 - 69:33

without looking at the original derivation. All right.
69:33 - 69:37
69:37 - 69:40

I probably won't get to Newton's Method today. I just
69:40 - 69:47

want to say
69:55 - 69:58

- take one quick digression to talk about
69:58 - 69:59

one more algorithm,
69:59 - 70:01

which was the discussion sort
70:01 - 70:08

of alluding to this earlier,
70:09 - 70:12

which is the perceptron
70:12 - 70:13

algorithm, right? So
70:13 - 70:14

I'm
70:14 - 70:17

not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come
70:17 - 70:20

back to later. Later this quarter
70:20 - 70:24

we'll talk about learning theory.
70:24 - 70:27

So in logistic regression we said that G of Z are, sort
70:27 - 70:28

of,
70:28 - 70:30

my hypothesis output values
70:30 - 70:33

that were low numbers between zero and one.
70:33 - 70:37

The question is what if you want to force G of Z to up
70:37 - 70:39

the value to
70:39 - 70:39

either
70:39 - 70:41

zero one?
70:41 - 70:44

So the
70:44 - 70:46

perceptron algorithm defines G of Z
70:46 - 70:48

to be this.
70:48 - 70:53
70:53 - 70:54

So the picture is - or
70:54 - 71:01

the cartoon is, rather than this sigmoid function. E of
71:02 - 71:08

Z now looks like this step function that you were asking about earlier.
71:08 - 71:14

In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So
71:14 - 71:14

this
71:14 - 71:18

is actually - everything is exactly the same as before,
71:18 - 71:20

except that G of Z is now the step function.
71:20 - 71:22

It
71:22 - 71:25

turns out there's this learning called the perceptron learning rule that's actually
71:25 - 71:28

even the same as the classic gradient ascent
71:28 - 71:31

for logistic regression.
71:31 - 71:33

And the learning rule is
71:33 - 71:40

given by this. Okay?
71:44 - 71:50

So it looks just like the
71:50 - 71:51

classic gradient ascent rule
71:51 - 71:54
71:54 - 71:56

for logistic regression.
71:56 - 72:01

So this is very different flavor of algorithm than least squares regression and logistic
72:01 - 72:02

regression,
72:02 - 72:06

and, in particular, because it outputs only values are either zero or one it
72:06 - 72:09

turns out it's very difficult to endow this algorithm with
72:09 - 72:12

probabilistic semantics. And this
72:12 - 72:19

is, again, even though - oh, excuse me. Right there. Okay.
72:20 - 72:24

And even though this learning rule looks, again, looks cosmetically very similar to
72:24 - 72:26

what we have in logistics regression this is actually
72:26 - 72:28

a very different type of learning rule
72:28 - 72:31

than the others that were seen
72:31 - 72:34

in this class. So
72:34 - 72:37

because this is such a simple learning algorithm, right? It just
72:37 - 72:41

computes theta transpose X and then you threshold and then your output is zero or one.
72:41 - 72:43

This is -
72:43 - 72:46

right. So these are a simpler algorithm than logistic regression, I think.
72:46 - 72:49

When we talk about learning theory later in this class,
72:49 - 72:55

the simplicity of this algorithm will let us come back and use it as a building block. Okay?
72:55 - 72:57

But that's all I want to say about this algorithm for now.

Title:: Lecture 3 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng delves into locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:13:14

	Timothy Lin edited English subtitles for Lecture 3 \| Machine Learning (Stanford)
	N. Ueda added a translation

English subtitles

Revisions

Revision 2 Edited

Timothy Lin

Lecture 3 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)