0:00:00.000,0:00:09.169
(music)

0:00:09.869,0:00:13.099
this presentation is delivered by [br]the Stanford Center for

0:00:13.099,0:00:14.979
Professional Development.

0:00:23.299,0:00:26.800
Okay. Good morning and [br]welcome back to the third

0:00:26.800,0:00:30.619
lecture of this class. So here's

0:00:30.619,0:00:33.360
what I want to do today,

0:00:33.360,0:00:34.320
and some of

0:00:34.320,0:00:37.920
the topics I do today may seem a little bit like I'm jumping, sort of,

0:00:37.920,0:00:44.440
from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we

0:00:44.440,0:00:45.370
talked about

0:00:45.370,0:00:48.240
linear regression and today I want to talk about

0:00:48.240,0:00:52.500
sort of an adaptation of that called locally weighted regression. It's very a powerful

0:00:52.500,0:00:57.120
algorithm that's actually one of my former mentor's probably favorite machine

0:00:57.120,0:00:59.360
learning algorithm. We'll then

0:00:59.360,0:01:02.800
talk about a probabilistic interpretation of linear regression

0:01:02.800,0:01:07.650
and use that to move onto our first classification algorithm, which

0:01:07.650,0:01:08.910
is logistic regression; take

0:01:08.910,0:01:12.630
a brief digression to tell you about something called the perceptron algorithm,

0:01:12.630,0:01:15.990
which is something we'll come back to, again, later this quarter;

0:01:15.990,0:01:17.170
and

0:01:17.170,0:01:21.070
time allowing I hope to get to Newton's method, which is an algorithm for

0:01:21.070,0:01:23.119
fitting logistic regression models.

0:01:24.490,0:01:30.510
So, let's just recap what we're talking about in the previous lecture,

0:01:30.510,0:01:33.790
remember the notation that I defined was that

0:01:33.790,0:01:35.090
I used this

0:01:35.090,0:01:37.620
X superscript i,

0:01:37.620,0:01:40.960
Y superscript i to denote the ith training example.

0:01:46.730,0:01:48.210
And

0:01:49.150,0:01:51.790
when we're talking about linear regression

0:01:52.900,0:01:54.080
or ordinary least squares,

0:01:54.080,0:01:56.140
we use this to denote

0:01:56.140,0:01:58.179
the predicted value

0:01:58.179,0:02:00.830
output by my hypothesis H on

0:02:00.830,0:02:02.420
the input X^i.

0:02:02.420,0:02:04.070
And my hypothesis

0:02:04.070,0:02:06.160
was parameterized by the vector

0:02:06.160,0:02:08.399
parameters theta

0:02:08.399,0:02:13.809
and so we said that this was equal to sum from theta j

0:02:13.809,0:02:15.549
X^i

0:02:15.549,0:02:19.870
si

0:02:19.870,0:02:22.409
written more simply as theta transpose X

0:02:22.409,0:02:25.079
And we had the convention that X

0:02:25.079,0:02:29.409
subscript zero is equal to one so [br]this accounts for the intercept term in our

0:02:29.409,0:02:30.919
linear regression model.

0:02:30.919,0:02:32.799
And lowercase n here

0:02:32.799,0:02:36.780
was the notation I was using for the

0:02:36.780,0:02:40.209
number of features in my training set. Okay? So in the

0:02:40.209,0:02:43.689
example trying to predict housing prices, we had two features, the size of

0:02:43.689,0:02:45.939
the house and the number of bedrooms.

0:02:45.939,0:02:50.559
We had two features and there was - little n was equal to two.

0:02:50.559,0:02:51.989


0:02:51.989,0:02:54.709
So just to

0:02:54.709,0:02:57.079
finish recapping the previous lecture, we

0:02:57.079,0:03:01.260
defined this quadratic cos function J of theta equals one-half,

0:03:01.260,0:03:05.510
something

0:03:05.510,0:03:07.209
I

0:03:07.209,0:03:10.379
equals one to m, theta of XI minus YI

0:03:10.379,0:03:12.439
squared

0:03:12.439,0:03:16.819
where this is the sum over our m training examples and my training set. So lowercase

0:03:16.819,0:03:17.769
m

0:03:17.769,0:03:21.129
was the notation I've been using to denote the number of training examples I have and the

0:03:21.129,0:03:23.059
size of my training set.

0:03:23.059,0:03:25.400
And at the end of the last lecture,

0:03:25.400,0:03:26.809
we derive

0:03:26.809,0:03:30.840
the value of theta that minimizes this enclosed form, which was X

0:03:30.840,0:03:32.349
transpose X

0:03:32.349,0:03:35.369
inverse X

0:03:35.369,0:03:38.889
transpose Y. Okay?

0:03:38.889,0:03:42.409
So

0:03:42.409,0:03:46.419
as we move on in today's lecture, I'll continue to use this notation and, again,

0:03:46.419,0:03:50.609
I realize this is a fair amount of notation to all remember,

0:03:50.609,0:03:55.099
so if partway through this lecture you forgot - if you're having trouble remembering

0:03:55.099,0:04:02.099
what lowercase m is or what lowercase n is or something please raise your hand and ask. When

0:04:04.709,0:04:07.690
we talked about linear regression last time

0:04:07.690,0:04:10.440
we used two features. One of the features was

0:04:10.440,0:04:14.279
the size of the houses in square feet, so the living area of the house,

0:04:14.279,0:04:18.449
and the other feature was the number of bedrooms in the house.

0:04:18.449,0:04:22.059
In general, we apply a machine-learning algorithm to some problem that you care

0:04:22.059,0:04:23.060
about.

0:04:23.060,0:04:28.210
The choice of the features will very much be up to you, right? And

0:04:28.210,0:04:32.300
the way you choose your features to give the learning algorithm will often have a

0:04:32.300,0:04:34.330
large impact on how it actually does.

0:04:34.330,0:04:40.120
So just for example,

0:04:40.120,0:04:44.120
the choice we made last time was X1 equal this size, and let's leave this idea

0:04:44.120,0:04:47.080
of the feature of the number of bedrooms for now, let's say we don't have data

0:04:47.080,0:04:50.630
that tells us how many bedrooms are in these houses.

0:04:50.630,0:04:54.030
One thing you could do is actually define - oh, let's

0:04:54.030,0:04:56.289
draw this out.

0:04:56.289,0:05:03.289
And so, right? So say that

0:05:04.689,0:05:07.669
was the size of the house and that's the price of the house. So

0:05:07.669,0:05:10.300
if you use

0:05:10.300,0:05:14.789
this as a feature maybe you get theta zero plus theta 1,

0:05:14.789,0:05:19.889
X1, this, sort of, linear model.

0:05:19.889,0:05:21.590
If you choose - let me just copy the

0:05:21.590,0:05:26.770
same data set over, right?

0:05:26.770,0:05:30.330
You can define the set of features where X1 is equal to the size of the house

0:05:30.330,0:05:34.989
and X2 is the

0:05:34.989,0:05:36.050
square

0:05:36.050,0:05:37.169
of the size

0:05:37.169,0:05:38.319
of the house. Okay?

0:05:38.319,0:05:43.080
So X1 is the size of the house in say square footage and X2 is

0:05:43.080,0:05:45.919
just take whatever the square footage of the house is and just

0:05:45.919,0:05:49.059
square that number, and this would be another way to come up with a feature,

0:05:49.059,0:05:51.109
and if you do that then

0:05:51.109,0:05:55.960
the same algorithm will end up fitting a

0:05:55.960,0:05:59.399
quadratic function for you.

0:05:59.399,0:06:01.379
Theta 2, XM squared.

0:06:01.379,0:06:06.139
Okay? Because this

0:06:06.139,0:06:09.610
is actually X2. And

0:06:09.610,0:06:12.520
depending on what the data looks like, maybe this is a slightly

0:06:12.520,0:06:16.509
better fit to the data. You can actually

0:06:16.509,0:06:23.509
take this even further, right?

0:06:25.280,0:06:26.960
Which is - let's see.

0:06:26.960,0:06:30.400
I have seven training examples here, so you can actually

0:06:30.400,0:06:34.460
maybe fit up to six for the polynomial. You can actually fill a model

0:06:34.460,0:06:35.619
theta zero plus

0:06:35.619,0:06:38.229
theta one, X1 plus theta two,

0:06:38.229,0:06:42.030
X squared plus up to

0:06:42.030,0:06:48.710
theta six. X to the

0:06:48.710,0:06:52.889
power of six and theta six are the polynomial

0:06:52.889,0:06:55.589
to these seven data points.

0:06:55.589,0:06:58.030
And if you do that you find that

0:06:58.030,0:07:01.669
you come up with a model that fits your data exactly. This is where, I guess,

0:07:01.669,0:07:06.190
in this example I drew, we have seven data points, so if you fit a

0:07:06.190,0:07:08.489
six model polynomial you can, sort of, fit a line

0:07:08.489,0:07:11.860
that passes through these seven points perfectly.

0:07:11.860,0:07:14.379
And you probably find that the curve

0:07:14.379,0:07:17.069
you get will look something

0:07:17.069,0:07:20.039
like that.

0:07:20.039,0:07:23.179
And on the one hand, this is a great model in a sense that it

0:07:23.179,0:07:25.329
fits your training data perfectly.

0:07:25.329,0:07:26.680
On the other hand, this is probably not a

0:07:26.680,0:07:28.819
very good model in the sense that

0:07:28.819,0:07:31.889
none of us seriously think that this is a very good predictor of housing

0:07:31.889,0:07:36.220
prices as a function of the size of the house, right? So

0:07:36.220,0:07:39.020
we'll actually come back to this later. It turns

0:07:39.020,0:07:41.940
out of the models we have here;

0:07:41.940,0:07:45.180
I feel like maybe the quadratic model fits the data best.

0:07:45.180,0:07:46.460
Whereas

0:07:46.460,0:07:47.810


0:07:47.810,0:07:52.020
the linear model looks like there's actually a bit of a quadratic component in this

0:07:52.020,0:07:52.830
data

0:07:52.830,0:07:56.539
that the linear function is not capturing.

0:07:56.539,0:07:59.959
So we'll actually come back to this a little bit later and talk about the problems

0:07:59.959,0:08:03.779
associated with fitting models that are either too simple, use two small a set of

0:08:03.779,0:08:04.620
features, or

0:08:04.620,0:08:08.449
on the models that are too complex and maybe

0:08:08.449,0:08:11.210
use too large a set of features.

0:08:11.210,0:08:12.289
Just to give these a

0:08:12.289,0:08:13.240
name,

0:08:13.240,0:08:14.490
we call this

0:08:14.490,0:08:19.750
the problem of underfitting

0:08:19.750,0:08:23.139
and, very informally, this refers to a setting where

0:08:23.139,0:08:26.819
there are obvious patterns that - where there are patterns in the data that the

0:08:26.819,0:08:28.729
algorithm is just failing to fit.

0:08:28.729,0:08:32.839
And this problem here we refer to as

0:08:32.839,0:08:34.580
overfitting

0:08:34.580,0:08:36.299
and, again, very informally,

0:08:36.299,0:08:41.310
this is when the algorithm is fitting the idiosyncrasies of this specific data set,

0:08:41.310,0:08:43.480
right? It just so happens that

0:08:43.480,0:08:48.190
of the seven houses we sampled in Portland, or wherever you collect data from,

0:08:48.190,0:08:51.640
that house happens to be a bit more expensive, that house happened on the less

0:08:51.640,0:08:54.070
expensive, and by

0:08:54.070,0:08:57.400
fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties

0:08:57.400,0:08:58.820
of this data set,

0:08:58.820,0:09:01.079
rather than the true underlying trends

0:09:01.079,0:09:04.940
of how housing prices vary as the function of the size of house. Okay?

0:09:04.940,0:09:08.460
So these are two very different problems. We'll define them more formally me later

0:09:08.460,0:09:12.280
and talk about how to address each of these problems,

0:09:12.280,0:09:13.890
but for now I

0:09:13.890,0:09:20.890
hope you appreciate that there is this issue of selecting features. So if

0:09:22.150,0:09:23.610
you want to just

0:09:23.610,0:09:26.270
teach us the learning problems there are a few ways to do

0:09:26.270,0:09:27.370
so.

0:09:27.370,0:09:29.010
We'll talk about

0:09:29.010,0:09:32.760
feature selection algorithms later this quarter as well. So automatic algorithms

0:09:32.760,0:09:33.779
for choosing

0:09:33.779,0:09:35.959
what features you use in a

0:09:35.959,0:09:37.530
regression problem like this.

0:09:37.530,0:09:41.610
What I want to do today is talk about a class of algorithms

0:09:41.610,0:09:44.570
called non-parametric learning algorithms that will help

0:09:44.570,0:09:49.730
to alleviate the need somewhat for you to choose features very carefully. Okay?

0:09:49.730,0:09:56.730
And this leads us into our discussion of locally weighted regression.

0:09:56.970,0:10:03.970
And just to define the term,

0:10:12.340,0:10:16.780
linear regression, as we've defined it so far, is an example of a parametric learning

0:10:16.780,0:10:17.740


0:10:17.740,0:10:19.150
algorithm. Parametric

0:10:19.150,0:10:21.920
learning algorithm is one that's defined as

0:10:21.920,0:10:24.670
an algorithm that has a fixed number of parameters

0:10:24.670,0:10:27.200
that fit to the data. Okay? So

0:10:27.200,0:10:28.670
in linear regression

0:10:28.670,0:10:32.930
we have a fix set of parameters theta, right? That must

0:10:32.930,0:10:39.440
fit to the data.

0:10:39.440,0:10:46.440
In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The

0:10:58.630,0:11:02.830
formal definition, which is not very intuitive, so I've replaced it with a

0:11:02.830,0:11:04.110
second, say, more

0:11:04.110,0:11:06.200
intuitive.

0:11:06.200,0:11:10.460
The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm

0:11:10.460,0:11:17.460
where the number of parameters

0:11:18.300,0:11:22.170
goes

0:11:22.170,0:11:25.419
with M, with the size of the training set. And usually it's

0:11:25.419,0:11:30.620
defined as a number of parameters grows linearly with the size of the training set.

0:11:30.620,0:11:32.640
This is the formal definition.

0:11:32.640,0:11:33.410
A

0:11:33.410,0:11:36.189
slightly less formal definition is that

0:11:36.189,0:11:37.529
the amount of stuff that your learning algorithm needs

0:11:37.529,0:11:40.830
to keep around

0:11:40.830,0:11:44.820
will grow linearly with the training sets or, in another way of saying it, is that this is an

0:11:44.820,0:11:45.830
algorithm that

0:11:45.830,0:11:51.590
we'll need to keep around an entire training set, even after learning. Okay? So

0:11:51.590,0:11:53.980
don't worry too much about this definition. But

0:11:53.980,0:11:55.979
what I want to do now is

0:11:55.979,0:11:58.989
describe a specific non-parametric learning algorithm

0:11:58.989,0:12:05.989
called locally weighted regression.

0:12:09.860,0:12:16.860
Which also goes by a couple of other names -

0:12:17.040,0:12:20.580
which also goes by the name of Loess for self-hysterical reasons. Loess

0:12:20.580,0:12:23.030
is usually spelled L-O-E-S-S,

0:12:23.030,0:12:24.130
sometimes spelled like that,

0:12:24.130,0:12:27.140
too. I just call it locally weighted regression.

0:12:27.140,0:12:34.140
So here's

0:12:34.540,0:12:37.760
the idea. This will be an algorithm that allows us

0:12:37.760,0:12:42.650
to worry a little bit less about having to choose features very carefully.

0:12:42.650,0:12:48.350
So

0:12:48.350,0:12:55.320
for my motivating example, let's say that I

0:12:55.320,0:12:59.000
have a

0:12:59.000,0:13:00.939
training site that looks like this, okay?

0:13:00.939,0:13:04.270
So this is X and that's Y.

0:13:04.270,0:13:07.010
If you run

0:13:07.010,0:13:10.920
linear regression on this and you fit maybe a linear function to this

0:13:10.920,0:13:12.380
and you end up with a

0:13:12.380,0:13:13.399
more or less

0:13:13.399,0:13:16.580
flat, straight line, which is not a very good fit to this data. You

0:13:16.580,0:13:19.820
can sit around and stare at this and try to decide whether the features are used right.

0:13:19.820,0:13:22.570
So maybe you want to toss in a quadratic function,

0:13:22.570,0:13:25.770
but this isn't really quadratic either. So maybe you want to

0:13:25.770,0:13:27.910
model this as a X

0:13:27.910,0:13:31.060
plus X squared plus maybe some function of sin of X or something.

0:13:31.060,0:13:33.850
You actually sit around and fiddle with features.

0:13:33.850,0:13:37.160
And after a while you can probably come up with a set of features that the model is

0:13:37.160,0:13:39.720
okay, but let's talk about an algorithm that

0:13:39.720,0:13:46.720
you can use without needing to do that.

0:13:50.340,0:13:52.930
So

0:13:52.930,0:13:54.480
if - well,

0:13:54.480,0:13:56.370
suppose you want to evaluate

0:13:56.370,0:13:59.170
your hypothesis H

0:13:59.170,0:14:03.950
at a certain point

0:14:03.950,0:14:06.310
with a certain query point low K is X. Okay? And

0:14:06.310,0:14:07.490
let's say

0:14:07.490,0:14:10.670
you want to know what's the predicted value of

0:14:10.670,0:14:11.930
Y

0:14:11.930,0:14:16.540
at this position of X, right? So

0:14:16.540,0:14:18.760
for linear regression,

0:14:18.760,0:14:22.440
what we were doing was we would fit

0:14:22.440,0:14:25.070
theta

0:14:25.070,0:14:28.130
to minimize

0:14:28.130,0:14:30.050
sum

0:14:30.050,0:14:34.610
over I, YI minus theta, transpose XI

0:14:34.610,0:14:38.760
squared,

0:14:38.760,0:14:41.200
and return theta

0:14:41.200,0:14:46.170
transpose X. Okay? So that was linear regression.

0:14:46.170,0:14:49.970
In contrast, in locally weighted linear regression you're going to do things slightly

0:14:49.970,0:14:51.130
different. You're

0:14:51.130,0:14:54.270
going to look at this point X

0:14:54.270,0:14:58.700
and then I'm going to look in my data set and take into account

0:14:58.700,0:15:03.370
only the data points that are, sort of, in the little vicinity of X. Okay?

0:15:03.370,0:15:07.130
So we'll look at where I want to value my hypothesis. I'm going to look

0:15:07.130,0:15:10.140
only in the vicinity of

0:15:10.140,0:15:13.730
this point where I want to value my hypothesis,

0:15:13.730,0:15:16.480
and then I'm going to take,

0:15:16.480,0:15:19.600
let's say, just these few points,

0:15:19.600,0:15:21.320
and I will

0:15:21.320,0:15:22.760
apply linear regression

0:15:22.760,0:15:26.160
to fit a straight line just to this sub-set of the data. Okay? I'm

0:15:26.160,0:15:29.740
using this sub-term sub-set - well let's come back to that later.

0:15:29.740,0:15:32.000
So we take this data set and I fit a

0:15:32.000,0:15:36.690
straight line to it and maybe I get a straight line like that.

0:15:36.690,0:15:40.720
And what I'll do is then

0:15:40.720,0:15:45.150
evaluate this particular value of straight line and

0:15:45.150,0:15:47.280
that will be the value I return for my algorithm.

0:15:47.280,0:15:50.360
I think this would be the predicted value

0:15:50.360,0:15:53.060
for -

0:15:53.060,0:15:57.370
this would be the value of then my hypothesis outputs

0:15:57.370,0:16:04.370
in locally weighted regression. Okay? So

0:16:05.270,0:16:10.110
we're gonna fall one up. Let me go ahead and formalize that. In

0:16:10.110,0:16:15.120
locally weighted regression, we're going to fit theta to

0:16:15.120,0:16:18.350
minimize

0:16:18.350,0:16:25.350
sum over I

0:16:27.480,0:16:32.590
to minimize that

0:16:32.590,0:16:36.600
where these terms W superscript I are called weights.

0:16:36.600,0:16:37.769


0:16:37.769,0:16:41.080
There are many possible choice for ways, I'm just gonna write one down. So this E's and

0:16:41.080,0:16:42.990
minus, XI minus

0:16:42.990,0:16:45.590
X squared

0:16:45.590,0:16:49.290
over

0:16:49.290,0:16:54.930
two. So let's look at what these weights really are, right? So notice that -

0:16:54.930,0:16:57.790
suppose you have a training example XI.

0:16:57.790,0:17:04.790
So that XI is very close to X. So this is small,

0:17:06.940,0:17:10.620
right? Then if XI minus X is small, so if XI minus X is close to zero, then

0:17:10.620,0:17:14.319
this is E's to the minus zero and E to the zero

0:17:14.319,0:17:15.860
is one.

0:17:15.860,0:17:19.790
So if XI is close to X, then

0:17:19.790,0:17:21.180
WI

0:17:21.180,0:17:25.130
will be close to one. In other words, the weight associated with the, I training

0:17:25.130,0:17:27.030
example be close to one

0:17:27.030,0:17:30.160
if XI and X are close to each

0:17:30.160,0:17:31.150
other.

0:17:31.150,0:17:34.270
Conversely if XI minus X

0:17:34.270,0:17:41.270
is large

0:17:42.330,0:17:43.110
then - I don't

0:17:43.110,0:17:48.010
know, what would WI be? Zero.

0:17:48.010,0:17:51.020
Zero, right. Close to zero. Right.

0:17:51.020,0:17:51.850
So

0:17:51.850,0:17:56.630
if XI is very far from X then this is E to the minus of some large number

0:17:56.630,0:18:02.210
and E to the minus some large number will be close to zero.

0:18:02.210,0:18:05.170
Okay?

0:18:05.170,0:18:12.170
So the picture is, if I'm

0:18:12.400,0:18:13.390
quarrying

0:18:13.390,0:18:16.970
at a certain point X, shown on the X axis,

0:18:16.970,0:18:21.820
and if my data

0:18:21.820,0:18:24.090
set, say, look like that,

0:18:24.090,0:18:28.880
then I'm going to give the points close to this a large weight and give the points

0:18:28.880,0:18:32.610
far away a small weight.

0:18:32.610,0:18:35.390
So

0:18:35.390,0:18:37.930
for the points that are far away,

0:18:37.930,0:18:39.690
WI will be close to zero.

0:18:39.690,0:18:44.280
And so as if for the points that are far away,

0:18:44.280,0:18:47.850
they will not contribute much at all to this summation, right? So I think this is

0:18:47.850,0:18:48.970
sum over I

0:18:48.970,0:18:54.240
of one times this quadratic term for points by points plus zero times this quadratic term for faraway

0:18:54.240,0:18:55.720
points.

0:18:55.720,0:18:58.610
And so the effect of using this weighting is that

0:18:58.610,0:19:00.050
locally weighted linear regression

0:19:00.050,0:19:02.630
fits a set of parameters theta,

0:19:02.630,0:19:05.030
paying much more attention to fitting the

0:19:05.030,0:19:06.950
points close by

0:19:06.950,0:19:10.870
accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is

0:19:10.870,0:19:17.580
exponentially [inaudible]?

0:19:17.580,0:19:22.570
Yeah. Let's see. So it turns out there are many other weighting functions you can use. It

0:19:22.570,0:19:26.890
turns out that there are definitely different communities of researchers that tend to

0:19:26.890,0:19:29.400
choose different choices by default.

0:19:29.400,0:19:33.700
There is somewhat of a literature on debating what point -

0:19:33.700,0:19:35.660
exactly what function to use.

0:19:35.660,0:19:37.970
This, sort of, exponential decay function is -

0:19:37.970,0:19:41.180
this happens to be a reasonably common one that seems to be a more reasonable choice on many problems,

0:19:41.180,0:19:42.310


0:19:42.310,0:19:45.380
but you can actually plug in other functions as well. Did

0:19:45.380,0:19:47.500
I mention what [inaudible] is it at?

0:19:47.500,0:19:49.710
For those of you that are familiar with

0:19:49.710,0:19:51.550
the

0:19:51.550,0:19:53.890
normal distribution, or the Gaussian distribution,

0:19:53.890,0:19:55.350
say this -

0:19:55.350,0:19:59.370
what this formula I've written out here, it cosmetically

0:19:59.370,0:20:02.770
looks a bit like a Gaussian distribution. Okay? But this actually has

0:20:02.770,0:20:06.190
absolutely nothing to do with Gaussian distribution.

0:20:06.190,0:20:07.940
So this

0:20:07.940,0:20:12.240
is not that a problem with XI is Gaussian or whatever. This is no such

0:20:12.240,0:20:13.010
interpretation.

0:20:13.010,0:20:16.840
This is just a convenient function that happens to be a bell-shaped function, but

0:20:16.840,0:20:21.460
don't endow this of any Gaussian semantics. Okay?

0:20:21.460,0:20:23.160
So, in fact - well,

0:20:23.160,0:20:28.510
if you remember the familiar bell-shaped Gaussian, again, it's just

0:20:28.510,0:20:32.160
the ways of associating with these points is that if you

0:20:32.160,0:20:33.850
imagine

0:20:33.850,0:20:36.050
putting this on a bell-shaped bump,

0:20:36.050,0:20:39.650
centered around the position of where you want to value your hypothesis H,

0:20:39.650,0:20:42.910
then there's a saying this point here I'll give

0:20:42.910,0:20:44.460
a weight that's proportional

0:20:44.460,0:20:48.780
to the height of the Gaussian - excuse me, to the height of the bell-shaped function

0:20:48.780,0:20:50.960
evaluated at this point. And the

0:20:50.960,0:20:53.020
way to get to this point will be,

0:20:53.020,0:20:54.730
to this training example,

0:20:54.730,0:20:56.800
will be proportionate to that height

0:20:56.800,0:20:58.549
and so on. Okay?

0:20:58.549,0:21:01.450
And so training examples that are really far away

0:21:01.450,0:21:04.820
get a very small weight.

0:21:04.820,0:21:07.400


0:21:07.400,0:21:10.630
One last small generalization to this is that

0:21:10.630,0:21:11.960
normally

0:21:11.960,0:21:15.080
there's one other parameter to this

0:21:15.080,0:21:17.760
algorithm, which I'll denote as tow.

0:21:17.760,0:21:21.110
Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian.

0:21:21.110,0:21:24.640
This is a convenient form or function.

0:21:24.640,0:21:27.140
This parameter tow

0:21:27.140,0:21:32.510
is called the bandwidth

0:21:32.510,0:21:33.760
parameter

0:21:33.760,0:21:40.760
and

0:21:40.980,0:21:47.530
informally it controls how fast the weights fall of with distance. Okay? So just

0:21:47.530,0:21:49.350
copy my diagram from

0:21:49.350,0:21:53.810
the other side, I guess.

0:21:53.810,0:21:55.770
So if

0:21:55.770,0:21:57.750
tow is very small,

0:21:57.750,0:22:00.000


0:22:00.000,0:22:04.520
if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape,

0:22:04.520,0:22:08.080
so that the weights of the points are far away fall off rapidly.

0:22:08.080,0:22:09.570
Whereas if

0:22:09.570,0:22:16.570
tow

0:22:17.490,0:22:21.309
is large then you'd end

0:22:21.309,0:22:25.309
up choosing a weighting function that falls of relatively slowly with distance from your

0:22:25.309,0:22:27.540
query.

0:22:27.540,0:22:30.549
Okay?

0:22:30.549,0:22:32.910
So

0:22:32.910,0:22:39.910
I hope you can, therefore, see that if you

0:22:41.580,0:22:44.410
apply locally weighted linear regression to a data set that looks like

0:22:44.410,0:22:45.640
this,

0:22:45.640,0:22:46.850
then to

0:22:46.850,0:22:50.100
ask what your hypothesis output is at a point like this you end up having a straight line

0:22:50.100,0:22:51.169
making

0:22:51.169,0:22:52.740
that prediction. To

0:22:52.740,0:22:55.159
ask what kind of class this [inaudible] at

0:22:55.159,0:22:56.430
that value

0:22:56.430,0:22:58.540
you put a straight line there

0:22:58.540,0:23:00.420
and you predict that value.

0:23:00.420,0:23:04.060
It turns out that every time you try to vary your hypothesis, every time you

0:23:04.060,0:23:06.509
ask your learning algorithm to make a prediction for

0:23:06.509,0:23:08.910
how much a new house costs or whatever,

0:23:08.910,0:23:09.910


0:23:09.910,0:23:13.640
you need to run a new fitting procedure and then

0:23:13.640,0:23:15.730
evaluate this line that you fit

0:23:15.730,0:23:17.810
just at the position of

0:23:17.810,0:23:22.430
the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you

0:23:22.430,0:23:25.510
do this for every point along the X-axis then

0:23:25.510,0:23:27.010
you find that

0:23:27.010,0:23:29.990
locally weighted regression is able to trace on this, sort of, very

0:23:29.990,0:23:31.690
non-linear curve

0:23:31.690,0:23:37.559
for a data set like this. Okay? So

0:23:37.559,0:23:42.110
in the problem set we're actually gonna let you play around more with this algorithm. So I won't say

0:23:42.110,0:23:43.850
too much more about it here.

0:23:43.850,0:23:48.730
But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you

0:23:48.730,0:23:49.950
still have

0:23:49.950,0:23:51.200
the same problem of overfitting

0:23:51.200,0:23:55.320
and underfitting, like when

0:23:55.320,0:23:57.960
you had a Q's tow. Like you make it too small

0:23:57.960,0:23:59.080
in your -

0:23:59.080,0:24:01.760
Yes, absolutely. Yes. So locally

0:24:01.760,0:24:04.260
weighted regression can run into

0:24:04.260,0:24:07.420
- locally weighted regression is not a penancier for the problem of

0:24:07.420,0:24:10.470
overfitting or underfitting.

0:24:10.470,0:24:15.130
You can still run into the same problems with locally weighted regression. What you just

0:24:15.130,0:24:16.920
said about

0:24:16.920,0:24:19.920
- and so some of these things I'll leave you to discover for yourself in the

0:24:19.920,0:24:20.859
homework problem.

0:24:20.859,0:24:24.720
You'll actually see what you just mentioned.

0:24:24.720,0:24:29.070
Yeah? It almost seems like

0:24:29.070,0:24:31.360
you're

0:24:31.360,0:24:37.430
not even thoroughly [inaudible] with this locally weighted, you had all the

0:24:37.430,0:24:42.500
data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a

0:24:42.500,0:24:44.700
model, because you need the entire data set.

0:24:44.700,0:24:45.460
And

0:24:45.460,0:24:49.720
the other way of saying that is that this is a non-parametric learning

0:24:49.720,0:24:51.150
algorithm. So this

0:24:51.150,0:24:52.280


0:24:52.280,0:24:54.440
-I don't know. I won't

0:24:54.440,0:24:58.130
debate whether, you know, are we really building a model or not. But

0:24:58.130,0:25:00.370
this is a perfectly fine - so

0:25:00.370,0:25:01.660
if I think

0:25:01.660,0:25:03.750
when you write a code implementing

0:25:03.750,0:25:07.080
locally weighted linear regression on the

0:25:07.080,0:25:08.980
data set I think of that code

0:25:08.980,0:25:13.340
as a whole - as building your model.

0:25:13.340,0:25:14.560
So it actually uses -

0:25:14.560,0:25:18.610
we've actually used this quite successfully to model, sort of, the dynamics of this

0:25:18.610,0:25:23.030
autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights

0:25:23.030,0:25:24.090
based

0:25:24.090,0:25:27.770
on

0:25:27.770,0:25:32.690
the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see,

0:25:32.690,0:25:34.259
yes. So

0:25:34.259,0:25:37.450
it turns out there are a few things you can do. One thing that is quite common is

0:25:37.450,0:25:40.190
how to choose this band with parameter tow,

0:25:40.190,0:25:45.790
right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible]

0:25:45.790,0:25:52.790
Gaussian

0:25:56.480,0:25:58.030
sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's

0:25:58.030,0:25:59.640
see. Boy.

0:25:59.640,0:26:01.679
The weights are not

0:26:01.679,0:26:06.130
random variables and it's not, for the purpose of this algorithm, it is not useful to

0:26:06.130,0:26:09.370
endow it with probable semantics. So you could

0:26:09.370,0:26:15.220
choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In

0:26:15.220,0:26:15.820
fact,

0:26:15.820,0:26:17.390
it turns out that

0:26:17.390,0:26:21.530
I happened to choose this, sort of, bell-shaped function

0:26:21.530,0:26:23.119
to define my weights.

0:26:23.119,0:26:27.149
It's actually fine to choose a function that doesn't even integrate to one, that integrates to

0:26:27.149,0:26:29.560
infinity, say, as you're weighting function. So

0:26:29.560,0:26:31.700
in that sense,

0:26:31.700,0:26:36.220
I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially

0:26:36.220,0:26:41.830
since you use other functions that integrate to infinity and don't integrate to one. Okay?

0:26:41.830,0:26:43.990
It's the last question and let's move on Assume

0:26:43.990,0:26:50.990
that we have a very huge [inaudible], for example. A very huge set of houses and want to

0:26:51.370,0:26:53.870
predict the linear for each house

0:26:53.870,0:26:57.530
and so should the end result for each input - I'm

0:26:57.530,0:26:59.549
seeing this very constantly for

0:26:59.549,0:27:02.120
- Yes, you're right. So

0:27:02.120,0:27:04.830
because locally weighted regression is a

0:27:04.830,0:27:07.200
non-parametric algorithm

0:27:07.200,0:27:11.380
every time you make a prediction you need to fit theta to your entire training set again.

0:27:11.380,0:27:13.120
So you're actually right.

0:27:13.120,0:27:17.640
If you have a very large training set then this is a somewhat expensive algorithm to

0:27:17.640,0:27:19.840
use. Because every time you want to make a prediction

0:27:19.840,0:27:20.929
you need to fit

0:27:20.929,0:27:22.660
a straight line

0:27:22.660,0:27:25.640
to a huge data set again.

0:27:25.640,0:27:28.860
Turns out there are algorithms that - turns

0:27:28.860,0:27:32.170
out there are ways to make this much more efficient for large data sets as well.

0:27:32.170,0:27:34.539
So don't want to talk about that. If you're interested, look

0:27:34.539,0:27:36.089
up the work of Andrew Moore

0:27:36.089,0:27:37.780
on KD-trees. He,

0:27:37.780,0:27:39.580
sort

0:27:39.580,0:27:42.190
of, figured out ways to fit these models much more efficiently. That's not something I want to go

0:27:42.190,0:27:44.760
into today. Okay? Let me

0:27:44.760,0:27:51.760
move one. Let's take more questions later. So, okay.

0:28:00.789,0:28:05.670
So that's locally weighted regression.

0:28:05.670,0:28:07.080
Remember the

0:28:07.080,0:28:10.830
outline I had, I guess, at the beginning of this lecture. What I want to do now is

0:28:10.830,0:28:11.440


0:28:11.440,0:28:15.650
talk about a probabilistic interpretation of linear regression, all right?

0:28:15.650,0:28:19.179
And in particular of the - it'll be this probabilistic interpretation

0:28:19.179,0:28:22.250
that let's us move on to talk

0:28:22.250,0:28:29.250
about logistic regression, which will be our first classification algorithm. So

0:28:39.680,0:28:43.320
let's put aside locally weighted regression for now. We'll just talk about

0:28:43.320,0:28:46.650
ordinary unweighted linear regression. Let's

0:28:46.650,0:28:50.580
ask the question of why least squares, right? Of all the things

0:28:50.580,0:28:52.400
we could optimize

0:28:52.400,0:28:56.250
how do we come up with this criteria for minimizing the square of the area

0:28:56.250,0:28:59.490
between the predictions of the hypotheses

0:28:59.490,0:29:02.320
and the values Y predicted. So why not minimize

0:29:02.320,0:29:08.140
the absolute value of the areas or the areas to the power of four or something?

0:29:08.140,0:29:10.260
What I'm going to do now is present

0:29:10.260,0:29:12.270
one set of assumptions that

0:29:12.270,0:29:14.520
will serve to "justify"

0:29:14.520,0:29:18.950
why we're minimizing the sum of square zero. Okay?

0:29:18.950,0:29:21.620
It turns out that there are many assumptions that are sufficient

0:29:21.620,0:29:25.820
to justify why we do least squares and this is just one of them.

0:29:25.820,0:29:26.929
So

0:29:26.929,0:29:30.260
just because I present one set of assumptions under which least

0:29:30.260,0:29:32.340
squares regression make sense,

0:29:32.340,0:29:35.270
but this is not the only set of assumptions. So even if the assumptions

0:29:35.270,0:29:39.160
I describe don't hold, least squares actually still makes sense in many

0:29:39.160,0:29:40.030
circumstances. But this

0:29:40.030,0:29:41.490
sort of new help, you know,

0:29:41.490,0:29:48.289
give one rationalization, like, one reason for doing least squares regression.

0:29:48.289,0:29:51.140
And, in particular, what I'm going to do is

0:29:51.140,0:29:57.520
endow the least squares model with probabilistic semantics. So

0:29:57.520,0:30:01.809
let's assume in our example of predicting housing prices,

0:30:01.809,0:30:02.950
that

0:30:02.950,0:30:07.510
the price of the house it's sold four, and

0:30:07.510,0:30:12.400
there's going to be some linear function of the features,

0:30:12.400,0:30:13.720
plus

0:30:13.720,0:30:17.049
some term epsilon I. Okay?

0:30:17.049,0:30:19.620
And epsilon I will be

0:30:19.620,0:30:22.010
our error term.

0:30:22.010,0:30:23.160
You can think of

0:30:23.160,0:30:27.860
the error term as capturing unmodeled effects, like, that maybe

0:30:27.860,0:30:31.490
there's some other features of a house, like, maybe how many fireplaces it has or whether

0:30:31.490,0:30:33.390
there's a garden or whatever,

0:30:33.390,0:30:34.130
that there

0:30:34.130,0:30:37.100
are additional features that we jut fail to capture or

0:30:37.100,0:30:39.520
you can think of epsilon as random noise.

0:30:39.520,0:30:42.360
Epsilon is our error term that captures both these

0:30:42.360,0:30:46.230
unmodeled effects. Just things we forgot to model. Maybe the function isn't quite

0:30:46.230,0:30:47.660
linear or something.

0:30:47.660,0:30:50.360


0:30:50.360,0:30:52.709
As well as random noise, like maybe

0:30:52.709,0:30:56.539
that day the seller was in a really bad mood and so he sold it, just

0:30:56.539,0:31:00.440
refused to go for a reasonable price or something.

0:31:00.440,0:31:02.670
And now

0:31:02.670,0:31:07.880
I will assume that the errors have a

0:31:07.880,0:31:09.010
probabilistic

0:31:09.010,0:31:13.100
- have a probability distribution. I'll assume that the errors epsilon I

0:31:13.100,0:31:14.520
are distributed

0:31:14.520,0:31:16.500
just till

0:31:16.500,0:31:18.660
they denote epsilon I

0:31:18.660,0:31:23.440
is distributive according to a probability distribution. That's

0:31:23.440,0:31:26.510
a Gaussian distribution with mean zero

0:31:26.510,0:31:28.170
and variance sigma squared. Okay? So

0:31:28.170,0:31:30.630
let me just scripts in here,

0:31:30.630,0:31:34.630
n stands for normal, right? To denote a normal distribution, also known as the

0:31:34.630,0:31:36.169
Gaussian distribution,

0:31:36.169,0:31:37.320
with mean

0:31:37.320,0:31:41.190
zero and covariance sigma squared.

0:31:41.190,0:31:46.730
Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you.

0:31:46.730,0:31:48.640
Great. Almost everyone.

0:31:48.640,0:31:51.280
So,

0:31:51.280,0:31:55.309
in other words, the density for Gaussian is what you've seen before.

0:31:55.309,0:31:57.730
The density for epsilon I would be

0:31:57.730,0:31:59.650
one over root 2 pi sigma, E to the

0:31:59.650,0:32:01.790
negative,

0:32:01.790,0:32:03.570
epsilon I

0:32:03.570,0:32:05.760


0:32:05.760,0:32:09.420
squared over 2 sigma squared, right?

0:32:09.420,0:32:15.170
And the

0:32:15.170,0:32:22.170
density of our epsilon I will be this bell-shaped curve

0:32:22.929,0:32:25.870
with one standard deviation

0:32:25.870,0:32:31.039
being a, sort of, sigma. Okay? This

0:32:31.039,0:32:31.780
is

0:32:31.780,0:32:34.200
form for that bell-shaped curve.

0:32:34.200,0:32:41.200
So, let's see. I can erase that. Can I

0:32:41.260,0:32:47.040
erase the board?

0:32:47.040,0:32:54.040


0:32:59.159,0:33:04.770
So this implies that the

0:33:04.770,0:33:10.120
probability distribution of a price of a house

0:33:10.120,0:33:11.530
given in si

0:33:11.530,0:33:13.620
and the parameters theta,

0:33:13.620,0:33:20.620
that this is going to be Gaussian

0:33:29.530,0:33:34.820
with that density. Okay?

0:33:34.820,0:33:37.930
In other words, saying goes as that the

0:33:37.930,0:33:41.440
price of

0:33:41.440,0:33:46.550
a house given the features of the house and my parameters theta,

0:33:46.550,0:33:50.530
this is going to be a random variable

0:33:50.530,0:33:53.820
that's distributed Gaussian with

0:33:53.820,0:33:56.700
mean theta transpose XI

0:33:56.700,0:33:58.039
and variance sigma squared.

0:33:58.039,0:34:01.120
Right? Because we imagine that

0:34:01.120,0:34:04.990
the way the housing prices are generated is that the price of a house

0:34:04.990,0:34:08.990
is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma

0:34:08.990,0:34:11.649
squared. So

0:34:11.649,0:34:13.589
the price of a house is going to

0:34:13.589,0:34:16.239
have mean theta transpose XI, again, and sigma squared, right? Does

0:34:16.239,0:34:20.069
this make

0:34:20.069,0:34:24.379
sense? Raise your hand if this makes sense. Yeah,

0:34:24.379,0:34:31.379
okay. Lots of you. In

0:34:38.369,0:34:43.939
point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume

0:34:43.939,0:34:47.489
here the error is a

0:34:47.489,0:34:50.299
Gaussian?

0:34:50.299,0:34:54.039
Right. So, boy.

0:34:54.039,0:34:56.429
Why do I see the error as Gaussian?

0:34:56.429,0:35:00.909
Two reasons, right? One is that it turns out to be mathematically convenient to do so

0:35:00.909,0:35:03.109
and the other is, I don't

0:35:03.109,0:35:06.529
know, I can also mumble about justifications, such as things to the

0:35:06.529,0:35:08.609
central limit theorem. It turns out that if you,

0:35:08.609,0:35:11.989
for the vast majority of problems, if you apply a linear regression model like

0:35:11.989,0:35:15.679
this and try to measure the distribution of the errors,

0:35:15.679,0:35:19.509
not all the time, but very often you find that the errors really are Gaussian. That

0:35:19.509,0:35:21.779
this Gaussian model is a good

0:35:21.779,0:35:23.519
assumption for the error

0:35:23.519,0:35:26.079
in regression problems like these.

0:35:26.079,0:35:29.000
Some of you may have heard of the central limit theorem, which says that

0:35:29.000,0:35:33.219
the sum of many independent random variables will tend towards a Gaussian.

0:35:33.219,0:35:37.450
So if the error is caused by many effects, like the mood of the

0:35:37.450,0:35:39.060
seller, the mood of the buyer,

0:35:39.060,0:35:42.680
some other features that we miss, whether the place has a garden or not, and

0:35:42.680,0:35:45.630
if all of these effects are independent, then

0:35:45.630,0:35:49.259
by the central limit theorem you might be inclined to believe that

0:35:49.259,0:35:52.259
the sum of all these effects will be approximately Gaussian. If

0:35:52.259,0:35:53.760
in practice, I guess, the

0:35:53.760,0:35:57.810
two real answers are that, 1.) In practice this is actually a reasonably accurate

0:35:57.810,0:36:04.810
assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're

0:36:06.509,0:36:07.539
saying

0:36:07.539,0:36:10.539
if we assume that area around model

0:36:10.539,0:36:12.849
has zero mean, then

0:36:12.849,0:36:16.269
the area is centered around our model. Which

0:36:16.269,0:36:18.059
it seems almost like we're trying to assume

0:36:18.059,0:36:19.960
what we're trying to prove. Instructor?

0:36:19.960,0:36:23.640
That's the [inaudible] but, yes. You are assuming that

0:36:23.640,0:36:28.359
the error has zero mean. Which is, yeah, right.

0:36:28.359,0:36:30.919
I think later this quarter we get to some of the other

0:36:30.919,0:36:35.059
things, but for now just think of this as a mathematically - it's actually not

0:36:35.059,0:36:37.879
an unreasonable assumption.

0:36:37.879,0:36:40.390
I guess,

0:36:40.390,0:36:44.979
in machine learning all the assumptions we make are almost never

0:36:44.979,0:36:48.519
true in the absence sense, right? Because, for instance,

0:36:48.519,0:36:54.359
housing prices are priced to dollars and cents, so the error will be -

0:36:54.359,0:36:55.199
errors

0:36:55.199,0:36:58.719
in prices are not continued as value random variables, because

0:36:58.719,0:37:02.239
houses can only be priced at a certain number of dollars and a certain number of

0:37:02.239,0:37:05.970
cents and you never have fractions of cents in housing prices.

0:37:05.970,0:37:10.339
Whereas a Gaussian random variable would. So in that sense, assumptions we make are never

0:37:10.339,0:37:14.689
"absolutely true," but for practical purposes this is a

0:37:14.689,0:37:19.459
accurate enough assumption that it'll be useful to

0:37:19.459,0:37:23.529
make. Okay? I think in a week or two, we'll actually come back to

0:37:23.529,0:37:27.279
selected more about the assumptions we make and when they help our learning

0:37:27.279,0:37:29.059
algorithms and when they hurt our learning

0:37:29.059,0:37:30.839
algorithms. We'll say a bit more about it

0:37:30.839,0:37:37.839
when we talk about generative and discriminative learning algorithms, like, in a week or

0:37:40.319,0:37:44.819
two. Okay? So let's point out one bit of notation, which is that when I

0:37:44.819,0:37:48.399
wrote this down I actually wrote P of YI given XI and then semicolon

0:37:48.399,0:37:49.719
theta

0:37:49.719,0:37:54.999
and I'm going to use this notation when we are not thinking of theta as a

0:37:54.999,0:37:56.799
random variable. So

0:37:56.799,0:38:00.709
in statistics, though, sometimes it's called the frequentist's point of view,

0:38:00.709,0:38:02.380
where you think of there as being some,

0:38:02.380,0:38:06.450
sort of, true value of theta that's out there that's generating the data say,

0:38:06.450,0:38:07.519
but

0:38:07.519,0:38:11.299
we don't know what theta is, but theta is not a random

0:38:11.299,0:38:13.440
vehicle, right? So it's not like there's some

0:38:13.440,0:38:16.270
random value of theta out there. It's that theta is -

0:38:16.270,0:38:19.140
there's some true value of theta out there. It's just that we don't

0:38:19.140,0:38:22.459
know what the true value of theta is. So

0:38:22.459,0:38:25.620
if theta is not a random variable, then I'm

0:38:25.620,0:38:27.299
going to avoid

0:38:27.299,0:38:31.359
writing P of YI given XI comma theta, because this would mean

0:38:31.359,0:38:34.899
that probably of YI conditioned on X and theta

0:38:34.899,0:38:40.200
and you can only condition on random variables.

0:38:40.200,0:38:43.010
So at this part of the class where we're taking

0:38:43.010,0:38:46.490
sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class

0:38:46.490,0:38:49.220
we're thinking of theta not as a random variable, but just as something

0:38:49.220,0:38:50.470
we're trying to estimate

0:38:50.470,0:38:52.829
and use the semicolon

0:38:52.829,0:38:57.559
notation. So the way to read this is this is the probability of YI given XI

0:38:57.559,0:39:00.410
and parameterized by theta. Okay? So

0:39:00.410,0:39:03.719
you read the semicolon as parameterized by.

0:39:03.719,0:39:08.269
And in the same way here, I'll say YI given XI parameterized by

0:39:08.269,0:39:09.499
theta is distributed

0:39:09.499,0:39:16.499
Gaussian with that. All right.

0:39:36.319,0:39:38.329
So we're gonna make one more assumption.

0:39:38.329,0:39:41.279
Let's assume that the

0:39:41.279,0:39:43.819
error terms are

0:39:43.819,0:39:48.269


0:39:48.269,0:39:50.609
IID, okay?

0:39:50.609,0:39:54.279
Which stands for Independently and Identically Distributed. So it's

0:39:54.279,0:39:57.259
going to assume that the error terms are

0:39:57.259,0:40:04.259
independent of each other,

0:40:04.469,0:40:11.279
right?

0:40:11.279,0:40:14.749
The identically distributed part just means that I'm assuming the outcome for the

0:40:14.749,0:40:18.009
same Gaussian distribution or the same variance,

0:40:18.009,0:40:22.099
but the more important part of is this is that I'm assuming that the epsilon I's are

0:40:22.099,0:40:25.519
independent of each other.

0:40:25.519,0:40:28.899
Now, let's talk about how to fit a model.

0:40:28.899,0:40:32.359
The probability of Y given

0:40:32.359,0:40:36.239
X parameterized by theta - I'm actually going to give

0:40:36.239,0:40:39.149
this another name. I'm going to write this down

0:40:39.149,0:40:42.749
and we'll call this the likelihood of theta

0:40:42.749,0:40:46.439
as the probability of Y given X parameterized by theta.

0:40:46.439,0:40:49.009
And so this is going to be

0:40:49.009,0:40:50.419
the product

0:40:50.419,0:40:57.419
over my training set like that.

0:40:59.939,0:41:04.299
Which is, in turn, going to be a product of

0:41:04.299,0:41:10.539
those Gaussian densities that I wrote down just now,

0:41:10.539,0:41:14.549
right?

0:41:14.549,0:41:20.359
Okay?

0:41:20.359,0:41:24.629
Then in parts of notation, I guess, I define this term here to be the

0:41:24.629,0:41:26.180
likelihood of theta.

0:41:26.180,0:41:30.109
And the likely of theta is just the probability of the data Y, right? Given X

0:41:30.109,0:41:32.789
and prioritized by theta.

0:41:32.789,0:41:36.619
To test the likelihood and probability are often confused.

0:41:36.619,0:41:40.519
So the likelihood of theta is the same thing as the

0:41:40.519,0:41:45.509
probability of the data you saw. So likely and probably are, sort of, the same thing.

0:41:45.509,0:41:47.779
Except that when I use the term likelihood

0:41:47.779,0:41:51.799
I'm trying to emphasize that I'm taking this thing

0:41:51.799,0:41:55.380
and viewing it as a function of theta.

0:41:55.380,0:41:57.130
Okay?

0:41:57.130,0:42:00.609
So likelihood and for probability, they're really the same thing except that

0:42:00.609,0:42:02.339
when I want to view this thing

0:42:02.339,0:42:05.660
as a function of theta holding X and Y fix are

0:42:05.660,0:42:10.089
then called likelihood. Okay? So

0:42:10.089,0:42:13.299
hopefully you hear me say the likelihood of the parameters and the probability

0:42:13.299,0:42:15.119
of the data,

0:42:15.119,0:42:18.469
right? Rather than the likelihood of the data or probability of parameters. So try

0:42:18.469,0:42:25.469
to be consistent in that terminology.

0:42:30.509,0:42:31.519
So given that

0:42:31.519,0:42:33.529
the probability of the data is this and this

0:42:33.529,0:42:36.859
is also the likelihood of the parameters,

0:42:36.859,0:42:37.809
how do you estimate

0:42:37.809,0:42:40.460
the parameters theta? So given a training set,

0:42:40.460,0:42:46.309
what parameters theta do you want to choose for your model?

0:42:46.309,0:42:53.309


0:42:58.749,0:43:01.549
Well, the principle of maximum likelihood

0:43:01.549,0:43:02.709
estimation

0:43:02.709,0:43:09.329
says that,

0:43:09.329,0:43:13.299
right? You can choose the value of theta that makes the data

0:43:13.299,0:43:20.039
as probable as possible, right? So choose theta

0:43:20.039,0:43:27.039
to maximize the likelihood. Or

0:43:27.199,0:43:29.769
in other words choose the parameters that make

0:43:29.769,0:43:33.119
the data as probable as possible, right? So this is

0:43:33.119,0:43:36.939
massive likely your estimation from six to six. So it's choose the parameters that makes

0:43:36.939,0:43:39.539
it as likely as probable as possible

0:43:39.539,0:43:43.309
for me to have seen the data I just

0:43:43.309,0:43:46.819
did. So

0:43:46.819,0:43:53.189
for mathematical convenience, let me define lower case l of theta.

0:43:53.189,0:43:57.959
This is called the log likelihood function and it's just log

0:43:57.959,0:44:01.429
of capital L of theta.

0:44:01.429,0:44:06.309
So this is log over product over I

0:44:06.309,0:44:09.539
to find

0:44:09.539,0:44:14.299
sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this

0:44:14.299,0:44:16.609
from the previous board.

0:44:16.609,0:44:23.609
Log and a product is the same as the sum of over logs, right? So it's a sum

0:44:24.509,0:44:31.509
of

0:44:35.140,0:44:38.400
the logs of - which simplifies to m times

0:44:38.400,0:44:39.349
one over root

0:44:39.349,0:44:43.509
two pi

0:44:43.509,0:44:44.380
sigma

0:44:44.380,0:44:47.469
plus

0:44:47.469,0:44:52.099
and then log of explanation cancel each other, right? So if log of E of

0:44:52.099,0:44:53.089
something is just

0:44:53.089,0:45:00.089
whatever's inside the exponent. So, you know what,

0:45:01.229,0:45:08.229
let me write this on the

0:45:12.079,0:45:16.129
next

0:45:16.129,0:45:21.359
board.

0:45:21.359,0:45:28.359
Okay.

0:45:32.759,0:45:39.759


0:45:46.239,0:45:50.509
So

0:45:50.509,0:45:53.349
maximizing the likelihood or maximizing the log

0:45:53.349,0:45:58.440
likelihood is the same

0:45:58.440,0:46:02.940
as minimizing

0:46:02.940,0:46:09.940
that term over there. Well, you get it, right?

0:46:22.019,0:46:26.039
Because there's a minus sign. So maximizing this because of the minus sign is the same as

0:46:26.039,0:46:26.829
minimizing

0:46:26.829,0:46:33.229
this as a function of theta. And

0:46:33.229,0:46:35.729
this is, of course, just

0:46:35.729,0:46:42.729
the same quadratic cos function that we had last time, J of theta,

0:46:43.089,0:46:44.469
right? So what

0:46:44.469,0:46:48.339
we've just shown is that the ordinary least squares algorithm,

0:46:48.339,0:46:50.539
that we worked on the previous lecture,

0:46:50.539,0:46:55.019
is just maximum likelihood

0:46:55.019,0:46:56.079
assuming

0:46:56.079,0:46:58.219
this probabilistic model,

0:46:58.219,0:47:05.219
assuming IID Gaussian errors on our data.

0:47:06.109,0:47:09.659


0:47:09.659,0:47:10.720
Okay? One thing that we'll

0:47:10.720,0:47:12.480
actually leave is that,

0:47:12.480,0:47:14.489
in the next lecture notice that

0:47:14.489,0:47:17.060
the value of sigma squared doesn't matter,

0:47:17.060,0:47:17.930
right? That somehow

0:47:17.930,0:47:21.390
no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a

0:47:21.390,0:47:21.999
variance

0:47:21.999,0:47:26.229
of a Gaussian. So that no matter what sigma

0:47:26.229,0:47:30.160
squared is since it's a positive number the value of theta we end up with

0:47:30.160,0:47:33.619
will be the same, right? So because

0:47:33.619,0:47:34.659
minimizing this

0:47:34.659,0:47:39.289
you get the same value of theta no matter what sigma squared is. So it's as if

0:47:39.289,0:47:42.690
in this model the value of sigma squared doesn't really matter.

0:47:42.690,0:47:45.790
Just remember that for the next lecture. We'll come back

0:47:45.790,0:47:47.509
to this again.

0:47:47.509,0:47:51.429
Any questions about this?

0:47:51.429,0:47:52.670
Actually, let me clean up

0:47:52.670,0:47:59.670
another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of

0:48:43.929,0:48:50.929
theta by

0:48:51.209,0:48:52.079
a fraction

0:48:52.079,0:48:53.809
of error,

0:48:53.809,0:48:55.369
but I think it's that you

0:48:55.369,0:48:57.319
measure

0:48:57.319,0:49:01.179
because it depends on the family of theta too, for example. If

0:49:01.179,0:49:05.269


0:49:05.269,0:49:09.069
you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think

0:49:09.069,0:49:12.609
let's - the thing's you're mentioning are

0:49:12.609,0:49:14.549
maybe deeper questions about

0:49:14.549,0:49:17.869
learning algorithms that we'll just come back to later, so don't really want to get into

0:49:17.869,0:49:19.459
that right

0:49:19.459,0:49:22.199
now. Any more

0:49:22.199,0:49:29.199
questions? Okay. So

0:49:33.219,0:49:38.759
this endows linear regression with a probabilistic interpretation.

0:49:38.759,0:49:43.209
I'm actually going to use this probabil - use this, sort of, probabilistic

0:49:43.209,0:49:43.940
interpretation

0:49:43.940,0:49:46.190
in order to derive our next learning algorithm,

0:49:46.190,0:49:50.309
which will be our first classification algorithm. Okay?

0:49:50.309,0:49:53.619
So

0:49:53.619,0:49:57.930
you'll recall that I said that regression problems are where the variable Y

0:49:57.930,0:50:01.339
that you're trying to predict is continuous values.

0:50:01.339,0:50:04.259
Now I'm actually gonna talk about our first classification problem,

0:50:04.259,0:50:07.859
where the value Y you're trying to predict

0:50:07.859,0:50:10.939
will be discreet value. You can take on only a small number of discrete values

0:50:10.939,0:50:14.379
and in this case I'll talk about binding classification

0:50:14.379,0:50:15.709
where

0:50:15.709,0:50:18.769
Y takes on only two values, right? So

0:50:18.769,0:50:21.890
you come up with classification problems if you're trying to do,

0:50:21.890,0:50:25.900
say, a medical diagnosis and try to decide based on some features

0:50:25.900,0:50:29.890
that the patient has a disease or does not have a disease.

0:50:29.890,0:50:34.239
Or if in the housing example, maybe you're trying to decide will this house sell in the

0:50:34.239,0:50:37.509
next six months or not and the answer is either yes or no. It'll either be sold in the

0:50:37.509,0:50:40.989
next six months or it won't be.

0:50:40.989,0:50:44.529
Other standing examples, if you want to build a spam filter. Is this e-mail spam

0:50:44.529,0:50:50.930
or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting

0:50:50.930,0:50:55.439
whether a computer system will crash. So you have a learning algorithm to predict will

0:50:55.439,0:50:59.159
this computing cluster crash over the next 24 hours? And, again, it's a yes

0:50:59.159,0:51:06.039
or no answer. So

0:51:06.039,0:51:08.149
there's X, there's Y.

0:51:08.149,0:51:14.119
And in a classification problem

0:51:14.119,0:51:15.269
Y takes on

0:51:15.269,0:51:18.569
two values, zero and one. That's it in binding the classification.

0:51:18.569,0:51:21.630
So what can you do? Well, one thing you could do is

0:51:21.630,0:51:25.299
take linear regression, as we've described it so far, and apply it to this problem,

0:51:25.299,0:51:26.399
right? So you,

0:51:26.399,0:51:30.289
you know, given this data set you can fit a straight line to it. Maybe

0:51:30.289,0:51:31.829
you get that straight line, right?

0:51:31.829,0:51:32.430
But

0:51:32.430,0:51:36.770
this data set I've drawn, right? This is an amazingly easy classification problem. It's

0:51:36.770,0:51:37.890
pretty obvious

0:51:37.890,0:51:41.420
to all of us that, right? The relationship between X and Y is -

0:51:41.420,0:51:47.729
well, you just look at a value around here and it's the right is one, it's

0:51:47.729,0:51:51.529
the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then

0:51:51.529,0:51:53.650
maybe take your linear regression

0:51:53.650,0:51:55.669
hypothesis to this straight line

0:51:55.669,0:51:58.489
and threshold it at 0.5.

0:51:58.489,0:52:02.059
If you do that you'll certainly get the right answer. You predict that

0:52:02.059,0:52:02.660


0:52:02.660,0:52:04.269
if X is to the right of, sort

0:52:04.269,0:52:06.229
of, the mid-point here

0:52:06.229,0:52:11.829
then Y is one and then next to the left of that mid-point then Y is zero.

0:52:11.829,0:52:16.099
So some people actually do this. Apply linear regression to classification problems

0:52:16.099,0:52:17.719
and sometimes it'll

0:52:17.719,0:52:19.319
work okay,

0:52:19.319,0:52:21.820
but in general it's actually a pretty bad idea to

0:52:21.820,0:52:26.130
apply linear regression to

0:52:26.130,0:52:31.849
classification problems like these and here's why. Let's say I

0:52:31.849,0:52:33.529
change my training set

0:52:33.529,0:52:39.509
by giving you just one more training example all the way up there, right?

0:52:39.509,0:52:43.049
Imagine if given this training set is actually still entirely obvious what the

0:52:43.049,0:52:45.670
relationship between X and Y is, right? It's just -

0:52:45.670,0:52:50.839
take this value as greater than Y is one and it's less then Y

0:52:50.839,0:52:52.039
is zero.

0:52:52.039,0:52:55.059
By giving you this additional training example it really shouldn't

0:52:55.059,0:52:56.429
change anything. I mean,

0:52:56.429,0:52:59.289
I didn't really convey much new information. There's no surprise that this

0:52:59.289,0:53:01.670
corresponds to Y equals one.

0:53:01.670,0:53:04.589
But if you now fit linear regression to this data

0:53:04.589,0:53:07.159
set you end up with a line that, I

0:53:07.159,0:53:10.410
don't know, maybe looks like that, right?

0:53:10.410,0:53:13.430
And now the predictions of your

0:53:13.430,0:53:15.739
hypothesis have changed completely if

0:53:15.739,0:53:22.739
your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that?

0:53:26.579,0:53:30.079
Right.

0:53:30.079,0:53:31.279
Yeah, yeah, fine. Yeah, sure. A theta

0:53:31.279,0:53:36.849
set like that so. So, I

0:53:36.849,0:53:38.609
guess,

0:53:38.609,0:53:42.179
these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will

0:53:42.179,0:53:47.609
change it even more if you gave it

0:53:47.609,0:53:49.579
all -

0:53:49.579,0:53:51.999
Yeah. Then I think this actually would make it even worse. You

0:53:51.999,0:53:54.349
would actually get a line that pulls out even further, right? So

0:53:54.349,0:53:57.819
this is my example. I get to make it whatever I want, right? But

0:53:57.819,0:54:00.689
the point of this is that there's not a deep meaning to this. The point of this is

0:54:00.689,0:54:01.879
just that

0:54:01.879,0:54:04.859
it could be a really bad idea to apply linear regression to classification

0:54:04.859,0:54:05.880


0:54:05.880,0:54:11.609
algorithm. Sometimes it work fine, but usually I wouldn't do it.

0:54:11.609,0:54:13.769
So a couple of problems with this. One is that,

0:54:13.769,0:54:17.349
well - so what do you want to do

0:54:17.349,0:54:20.630
for classification? If you know the value of Y lies between zero and

0:54:20.630,0:54:24.709
one then to kind of fix this problem

0:54:24.709,0:54:27.079
let's just start by

0:54:27.079,0:54:29.219
changing the form

0:54:29.219,0:54:33.869
of our hypothesis so that my hypothesis

0:54:33.869,0:54:39.279
always lies in the unit interval between zero and one. Okay?

0:54:39.279,0:54:41.829
So if I know Y is either

0:54:41.829,0:54:44.039
zero or one then

0:54:44.039,0:54:47.469
let's at least not have my hypothesis predict values much larger than one and much

0:54:47.469,0:54:50.939
smaller than zero.

0:54:50.939,0:54:52.369
And so

0:54:52.369,0:54:55.749
I'm going to - instead of choosing a linear function for my hypothesis I'm going

0:54:55.749,0:54:59.799
to choose something slightly different. And,

0:54:59.799,0:55:03.219
in particular, I'm going to choose

0:55:03.219,0:55:08.129
this function, H subscript theta of X is going to equal to G of

0:55:08.129,0:55:10.839
theta transpose X

0:55:10.839,0:55:12.930
where

0:55:12.930,0:55:13.890
G

0:55:13.890,0:55:17.279
is going to be this

0:55:17.279,0:55:18.329
function and so

0:55:18.329,0:55:20.819
this becomes more than one plus theta X

0:55:20.819,0:55:22.819
of theta

0:55:22.819,0:55:25.049
transpose X.

0:55:25.049,0:55:27.809
And G of Z is called the sigmoid

0:55:27.809,0:55:32.959
function and it

0:55:32.959,0:55:38.659
is often also called the logistic function. It

0:55:38.659,0:55:41.229
goes by either of these

0:55:41.229,0:55:48.229
names. And what G of Z looks like is the following. So when you have your

0:55:48.449,0:55:50.529
horizontal axis I'm going to plot Z

0:55:50.529,0:55:56.549
and so G of Z

0:55:56.549,0:55:58.879
will look like this.

0:55:58.879,0:56:04.759
Okay? I didn't draw that very well. Okay.

0:56:04.759,0:56:05.849
So G of Z

0:56:05.849,0:56:07.769
tends towards zero

0:56:07.769,0:56:09.909
as Z becomes very small

0:56:09.909,0:56:12.219
and G of Z will ascend

0:56:12.219,0:56:15.209
towards one as Z becomes large and it crosses the

0:56:15.209,0:56:17.089
vertical

0:56:17.089,0:56:19.519
axis at 0.5.

0:56:19.519,0:56:24.460
So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of

0:56:24.460,0:56:27.479
sigmoid in other

0:56:27.479,0:56:30.299
step five? Say that again. Why we cannot chose this at five for some reason, like, that's

0:56:30.299,0:56:35.429
better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from,

0:56:35.429,0:56:37.270
right? I just

0:56:37.270,0:56:38.839
wrote down this function. It actually

0:56:38.839,0:56:42.569
turns out that there are two reasons for using this function that we'll come to.

0:56:42.569,0:56:43.619
One is -

0:56:43.619,0:56:46.809
we talked about generalized linear models. We'll see that this falls out naturally

0:56:46.809,0:56:49.910
as part of the broader class of models.

0:56:49.910,0:56:51.089
And another reason

0:56:51.089,0:56:52.340
that we'll talk about

0:56:52.340,0:56:54.239
next week, it turns out

0:56:54.239,0:56:55.089
there are a couple of,

0:56:55.089,0:56:57.269
I think, very beautiful reasons for why

0:56:57.269,0:56:58.999
we choose logistic

0:56:58.999,0:57:00.469
functions. We'll see

0:57:00.469,0:57:02.089
that in a little bit. But for now let me just

0:57:02.089,0:57:07.249
define it and just take my word for it for now that this is a reasonable choice.

0:57:07.249,0:57:08.529
Okay? But notice now that

0:57:08.529,0:57:15.529
my - the values output by my hypothesis will always be between zero

0:57:15.879,0:57:17.289
and one. Furthermore,

0:57:17.289,0:57:20.719
just like we did for linear regression, I'm going to endow

0:57:20.719,0:57:25.549
the outputs and my hypothesis with a probabilistic interpretation, right? So

0:57:25.549,0:57:30.619
I'm going to assume that the probability that Y is equal to one

0:57:30.619,0:57:33.899
given X and parameterized by theta

0:57:33.899,0:57:35.559
that's equal to

0:57:35.559,0:57:38.439
H subscript theta of X, all right?

0:57:38.439,0:57:39.640
So in other words

0:57:39.640,0:57:42.789
I'm going to imagine that my hypothesis is outputting all these

0:57:42.789,0:57:45.409
numbers that lie between zero and one.

0:57:45.409,0:57:47.890
I'm going to think of my hypothesis

0:57:47.890,0:57:54.890
as trying to estimate the probability that Y is equal to one. Okay?

0:57:56.209,0:57:59.710
And because

0:57:59.710,0:58:02.119
Y has to be either zero or one

0:58:02.119,0:58:04.729
then the probability of Y equals zero is going

0:58:04.729,0:58:11.729
to be that. All right?

0:58:11.859,0:58:15.349
So more simply it turns out - actually, take these two equations

0:58:15.349,0:58:19.209
and write them more compactly.

0:58:19.209,0:58:22.139
Write P of Y given X

0:58:22.139,0:58:24.209
parameterized by theta.

0:58:24.209,0:58:25.839
This is going to be H

0:58:25.839,0:58:28.349
subscript theta of X to

0:58:28.349,0:58:31.949
the power of Y times

0:58:31.949,0:58:33.179
one minus

0:58:33.179,0:58:34.769
H of X to the power of

0:58:34.769,0:58:37.219
one minus Y. Okay? So I know this

0:58:37.219,0:58:42.889
looks somewhat bizarre, but this actually makes the variation much nicer.

0:58:42.889,0:58:44.439
So Y is equal to one

0:58:44.439,0:58:48.059
then this equation is H of X to the power of one

0:58:48.059,0:58:51.229
times something to the power of zero.

0:58:51.229,0:58:54.429
So anything to the power of zero is just one,

0:58:54.429,0:58:55.949
right? So Y equals one then

0:58:55.949,0:58:59.389
this is something to the power of zero and so this is just one.

0:58:59.389,0:59:03.349
So if Y equals one this is just saying P of Y equals one is equal to H subscript

0:59:03.349,0:59:05.919
theta of X. Okay?

0:59:05.919,0:59:08.249
And in the same way, if Y is equal

0:59:08.249,0:59:09.919
to zero then this is P

0:59:09.919,0:59:14.329
of Y equals zero equals this thing to the power of zero and so this disappears. This is

0:59:14.329,0:59:15.669
just one

0:59:15.669,0:59:17.579
times this thing power of one. Okay? So this is

0:59:17.579,0:59:18.599
a

0:59:18.599,0:59:20.269
compact way of writing

0:59:20.269,0:59:22.799
both of these equations to

0:59:22.799,0:59:29.799
gather them to one line.

0:59:31.149,0:59:35.650
So let's hope our parameter fitting, right? And, again, you can ask -

0:59:35.650,0:59:38.389
well, given this model by data, how do I fit

0:59:38.389,0:59:41.409
the parameters theta of my

0:59:41.409,0:59:46.169
model? So the likelihood of the parameters is, as before, it's just the probability

0:59:46.169,0:59:48.639


0:59:48.639,0:59:50.499


0:59:50.499,0:59:54.469
of theta, right? Which is product over I, PFYI

0:59:54.469,0:59:56.709
given XI

0:59:56.709,0:59:59.119
parameterized by theta.

0:59:59.119,1:00:01.609
Which is - just plugging those

1:00:01.609,1:00:08.609
in. Okay? I

1:00:09.399,1:00:16.399
dropped this theta subscript just so you can write a little bit less. Oh,

1:00:17.479,1:00:19.699
excuse me. These

1:00:19.699,1:00:26.699
should be

1:00:29.380,1:00:35.749
XI's

1:00:35.749,1:00:42.749
and YI's. Okay?

1:00:51.299,1:00:53.359
So,

1:00:53.359,1:00:57.029
as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So

1:00:57.029,1:00:58.759
we want

1:00:58.759,1:01:03.529
to find the - setting the parameters theta that maximizes the likelihood L

1:01:03.529,1:01:08.119
of theta. It

1:01:08.119,1:01:09.619
turns out

1:01:09.619,1:01:11.439
that very often

1:01:11.439,1:01:14.479
- just when you work with the derivations, it turns out that it is often

1:01:14.479,1:01:17.809
much easier to maximize the log of the likelihood rather than maximize the

1:01:17.809,1:01:19.559
likelihood.

1:01:19.559,1:01:20.439
So

1:01:20.439,1:01:22.640
the log

1:01:22.640,1:01:24.630
likelihood L of theta is just log of capital L.

1:01:24.630,1:01:27.879
This will, therefore,

1:01:27.879,1:01:34.879
be sum of this. Okay?

1:01:58.089,1:01:59.239
And so

1:01:59.239,1:02:03.689
to fit the parameters theta of our model we'll

1:02:03.689,1:02:05.929
find the value of

1:02:05.929,1:02:11.539
theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes.

1:02:11.539,1:02:18.539
Thanks.

1:02:21.519,1:02:27.239
So having maximized this function - well, it turns out we can actually apply

1:02:27.239,1:02:28.759
the same gradient

1:02:28.759,1:02:32.749
descent algorithm that we learned. That was the first algorithm we used

1:02:32.749,1:02:34.979
to minimize the quadratic function.

1:02:34.979,1:02:37.209
And you remember, when we talked about least squares,

1:02:37.209,1:02:40.120
the first algorithm we used to minimize the quadratic

1:02:40.120,1:02:41.060
error function

1:02:41.060,1:02:42.649
was great in descent.

1:02:42.649,1:02:45.409
So can actually use exactly the same algorithm

1:02:45.409,1:02:48.039
to maximize the log likelihood.

1:02:48.039,1:02:48.749


1:02:48.749,1:02:51.359
And you remember, that algorithm was just

1:02:51.359,1:02:54.609
repeatedly take the value of theta

1:02:54.609,1:02:56.489
and you replace it with

1:02:56.489,1:02:58.919
the previous value of theta plus

1:02:58.919,1:03:01.959
a learning rate alpha

1:03:01.959,1:03:03.939
times

1:03:03.939,1:03:08.339
the gradient of the cos function. The log likelihood will respect the

1:03:08.339,1:03:09.640
theta. Okay?

1:03:09.640,1:03:13.869
One small change is that because previously we were trying to minimize

1:03:13.869,1:03:14.650


1:03:14.650,1:03:16.999
the quadratic error term.

1:03:16.999,1:03:20.359
Today we're trying to maximize rather than minimize. So rather than having a minus

1:03:20.359,1:03:23.189
sign we have a plus sign. So this is

1:03:23.189,1:03:24.070
just great in ascents,

1:03:24.070,1:03:25.489
but for the

1:03:25.489,1:03:27.560
maximization rather than the minimization.

1:03:27.560,1:03:32.380
So we actually call this gradient ascent and it's really the same

1:03:32.380,1:03:35.389
algorithm.

1:03:35.389,1:03:36.840
So to figure out

1:03:36.840,1:03:41.619
what this gradient - so in order to derive gradient descent,

1:03:41.619,1:03:43.719
what you need to do is

1:03:43.719,1:03:48.069
compute the partial derivatives of your objective function with respect to

1:03:48.069,1:03:52.729
each of your parameters theta I, right?

1:03:52.729,1:03:58.219
It turns out that

1:03:58.219,1:04:00.269
if you actually

1:04:00.269,1:04:02.619
compute this partial derivative -

1:04:02.619,1:04:09.539
so you take this formula, this L of theta, which is - oh, got that wrong too. If

1:04:09.539,1:04:13.879
you take this lower case l theta, if you take the log likelihood of theta,

1:04:13.879,1:04:17.159
and if you take it's partial derivative with

1:04:17.159,1:04:19.479
respect to theta I

1:04:19.479,1:04:21.549
you find that

1:04:21.549,1:04:27.119
this is equal to -

1:04:27.119,1:04:29.829
let's see. Okay? And,

1:04:29.829,1:04:36.829
I

1:04:45.939,1:04:47.309
don't

1:04:47.309,1:04:50.329
know, the derivation isn't terribly complicated, but in

1:04:50.329,1:04:53.900
the interest of saving you watching me write down a couple of

1:04:53.900,1:04:55.789
blackboards full of math I'll just write

1:04:55.789,1:04:57.199
down the final answer. But

1:04:57.199,1:04:58.949
the way you get this is you

1:04:58.949,1:05:00.179
just take those, plug

1:05:00.179,1:05:04.429
in the definition for F subscript theta as function of XI, and take derivatives,

1:05:04.429,1:05:06.390
and work through the algebra

1:05:06.390,1:05:08.439
it turns out it'll simplify

1:05:08.439,1:05:13.219
down to this formula. Okay?

1:05:13.219,1:05:15.069
And so

1:05:15.069,1:05:19.130
what that gives you is that gradient ascent

1:05:19.130,1:05:21.669
is the following

1:05:21.669,1:05:24.150
rule. Theta J gets updated as theta

1:05:24.150,1:05:25.899
J

1:05:25.899,1:05:27.839
plus alpha

1:05:27.839,1:05:34.839
gives this. Okay?

1:05:46.999,1:05:50.229
Does this look familiar to anyone? Did you

1:05:50.229,1:05:56.369
remember seeing this formula at the last lecture? Right.

1:05:56.369,1:05:59.349
So when I worked up Bastrian descent

1:05:59.349,1:06:02.449
for least squares regression I,

1:06:02.449,1:06:06.269
actually, wrote down

1:06:06.269,1:06:07.640
exactly the same thing, or maybe

1:06:07.640,1:06:11.419
there's a minus sign and this is also fit. But I, actually, had

1:06:11.419,1:06:13.909
exactly the same learning rule last time

1:06:13.909,1:06:19.919
for least squares regression,

1:06:19.919,1:06:23.509
right? Is this the same learning algorithm then? So what's different? How come I was making

1:06:23.509,1:06:25.499
all that noise earlier about

1:06:25.499,1:06:30.009
least squares regression being a bad idea for classification problems and then I did

1:06:30.009,1:06:33.769
a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants?

1:06:33.769,1:06:40.769
Say that again. [Inaudible] Oh,

1:06:44.180,1:06:46.639
right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same,

1:06:46.639,1:06:48.350
this is not the same, right? And the

1:06:48.350,1:06:49.269
reason is,

1:06:49.269,1:06:51.700
in logistic regression

1:06:51.700,1:06:54.169
this is different from before, right?

1:06:54.169,1:06:55.699
The definition

1:06:55.699,1:06:58.380
of this H subscript theta of XI

1:06:58.380,1:06:59.319
is not

1:06:59.319,1:07:03.279
the same as the definition I was using in the previous lecture.

1:07:03.279,1:07:06.519
And in particular this is no longer theta transpose XI. This is not

1:07:06.519,1:07:09.909
a linear function anymore.

1:07:09.909,1:07:11.910
This is a logistic function of theta

1:07:11.910,1:07:14.369
transpose XI. Okay?

1:07:14.369,1:07:17.919
So even though this looks cosmetically similar,

1:07:17.919,1:07:20.119
even though this is similar on the surface,

1:07:20.119,1:07:23.739
to the Bastrian descent rule I derived last time for

1:07:23.739,1:07:25.209
least squares regression

1:07:25.209,1:07:29.359
this is actually a totally different learning algorithm. Okay?

1:07:29.359,1:07:32.499
And it turns out that there's actually no coincidence that you ended up with the

1:07:32.499,1:07:34.579
same learning rule. We'll actually

1:07:34.579,1:07:35.450


1:07:35.450,1:07:39.519
talk a bit more about this later when we talk about generalized linear models.

1:07:39.519,1:07:42.659
But this is one of the most elegant generalized learning models

1:07:42.659,1:07:44.519
that we'll see later. That

1:07:44.519,1:07:48.199
even though we're using a different model, you actually ended up with

1:07:48.199,1:07:51.379
what looks like the same learning algorithm and it's actually no

1:07:51.379,1:07:56.449
coincidence. Cool.

1:07:56.449,1:07:59.029
One last comment as

1:07:59.029,1:08:01.239
part of a sort of learning process,

1:08:01.239,1:08:02.339
over here

1:08:02.339,1:08:05.030
I said I take the derivatives and I

1:08:05.030,1:08:06.729
ended up with this line.

1:08:06.729,1:08:09.269
I didn't want to

1:08:09.269,1:08:13.119
make you sit through a long algebraic derivation, but

1:08:13.119,1:08:14.640
later today or later this week,

1:08:14.640,1:08:18.569
please, do go home and look at our lecture notes, where I wrote out

1:08:18.569,1:08:20.829
the entirety of this derivation in full,

1:08:20.829,1:08:23.419
and make sure you can follow every single step of

1:08:23.419,1:08:26.709
how we take partial derivatives of this log likelihood

1:08:26.709,1:08:32.069
to get this formula over here. Okay? By the way, for those who are

1:08:32.069,1:08:36.379
interested in seriously masking machine learning material,

1:08:36.379,1:08:40.349
when you go home and look at the lecture notes it will actually be very easy for most

1:08:40.349,1:08:41.549
of you to look through

1:08:41.549,1:08:44.599
the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense,

1:08:44.599,1:08:45.089
and,

1:08:45.089,1:08:49.539
sort of, say cool. I see how you get this line.

1:08:49.539,1:08:53.159
You want to make sure you really understand the material. My concrete

1:08:53.159,1:08:55.049
suggestion to you would be to you to go home,

1:08:55.049,1:08:57.779
read through the lecture notes, check every line,

1:08:57.779,1:09:02.159
and then to cover up the derivation and see if you can derive this example, right? So

1:09:02.159,1:09:06.529
in general, that's usually good advice for studying technical

1:09:06.529,1:09:09.029
material like machine learning. Which is if you work through a proof

1:09:09.029,1:09:11.239
and you think you understood every line,

1:09:11.239,1:09:14.129
the way to make sure you really understood it is to cover it up and see

1:09:14.129,1:09:17.459
if you can rederive the entire thing itself. This is actually a great way because I

1:09:17.459,1:09:19.679
did this a lot when I was trying to study

1:09:19.679,1:09:22.239
various pieces of machine learning

1:09:22.239,1:09:26.119
theory and various proofs. And this is actually a great way to study because cover up

1:09:26.119,1:09:28.420
the derivations and see if you can do it yourself

1:09:28.420,1:09:32.529
without looking at the original derivation. All right.

1:09:32.529,1:09:36.690


1:09:36.690,1:09:39.950
I probably won't get to Newton's Method today. I just

1:09:39.950,1:09:46.950
want to say

1:09:55.250,1:09:58.159
- take one quick digression to talk about

1:09:58.159,1:09:59.480
one more algorithm,

1:09:59.480,1:10:01.119
which was the discussion sort

1:10:01.119,1:10:08.119
of alluding to this earlier,

1:10:09.169,1:10:11.530
which is the perceptron

1:10:11.530,1:10:13.219
algorithm, right? So

1:10:13.219,1:10:13.760
I'm

1:10:13.760,1:10:16.840
not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come

1:10:16.840,1:10:20.340
back to later. Later this quarter

1:10:20.340,1:10:23.820
we'll talk about learning theory.

1:10:23.820,1:10:27.149
So in logistic regression we said that G of Z are, sort

1:10:27.149,1:10:27.920
of,

1:10:27.920,1:10:30.150
my hypothesis output values

1:10:30.150,1:10:32.800
that were low numbers between zero and one.

1:10:32.800,1:10:37.110
The question is what if you want to force G of Z to up

1:10:37.110,1:10:38.889
the value to

1:10:38.889,1:10:39.190
either

1:10:39.190,1:10:41.400
zero one?

1:10:41.400,1:10:43.860
So the

1:10:43.860,1:10:46.210
perceptron algorithm defines G of Z

1:10:46.210,1:10:48.090
to be this.

1:10:48.090,1:10:52.590


1:10:52.590,1:10:54.300
So the picture is - or

1:10:54.300,1:11:01.300
the cartoon is, rather than this sigmoid function. E of

1:11:02.349,1:11:07.699
Z now looks like this step function that you were asking about earlier.

1:11:07.699,1:11:13.579
In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So

1:11:13.579,1:11:14.310
this

1:11:14.310,1:11:17.570
is actually - everything is exactly the same as before,

1:11:17.570,1:11:20.389
except that G of Z is now the step function.

1:11:20.389,1:11:21.539
It

1:11:21.539,1:11:24.960
turns out there's this learning called the perceptron learning rule that's actually

1:11:24.960,1:11:28.429
even the same as the classic gradient ascent

1:11:28.429,1:11:30.610
for logistic regression.

1:11:30.610,1:11:32.539
And the learning rule is

1:11:32.539,1:11:39.539
given by this. Okay?

1:11:44.349,1:11:49.799
So it looks just like the

1:11:49.799,1:11:51.019
classic gradient ascent rule

1:11:51.019,1:11:53.659


1:11:53.659,1:11:56.210
for logistic regression.

1:11:56.210,1:12:00.780
So this is very different flavor of algorithm than least squares regression and logistic

1:12:00.780,1:12:01.669
regression,

1:12:01.669,1:12:05.889
and, in particular, because it outputs only values are either zero or one it

1:12:05.889,1:12:09.369
turns out it's very difficult to endow this algorithm with

1:12:09.369,1:12:11.840
probabilistic semantics. And this

1:12:11.840,1:12:18.840
is, again, even though - oh, excuse me. Right there. Okay.

1:12:19.639,1:12:23.659
And even though this learning rule looks, again, looks cosmetically very similar to

1:12:23.659,1:12:25.710
what we have in logistics regression this is actually

1:12:25.710,1:12:28.320
a very different type of learning rule

1:12:28.320,1:12:31.420
than the others that were seen

1:12:31.420,1:12:33.630
in this class. So

1:12:33.630,1:12:36.969
because this is such a simple learning algorithm, right? It just

1:12:36.969,1:12:41.030
computes theta transpose X and then you threshold and then your output is zero or one.

1:12:41.030,1:12:42.909
This is -

1:12:42.909,1:12:45.989
right. So these are a simpler algorithm than logistic regression, I think.

1:12:45.989,1:12:49.190
When we talk about learning theory later in this class,

1:12:49.190,1:12:54.590
the simplicity of this algorithm will let us come back and use it as a building block. Okay?

1:12:54.590,1:12:56.550
But that's all I want to say about this algorithm for now.