WEBVTT

00:00:00.000 --> 00:00:09.169
(music)

00:00:09.869 --> 00:00:13.099
this presentation is delivered by 
the Stanford Center for

00:00:13.099 --> 00:00:14.979
Professional Development.

00:00:23.299 --> 00:00:26.800
Okay. Good morning and 
welcome back to the third

00:00:26.800 --> 00:00:30.619
lecture of this class. So here's

00:00:30.619 --> 00:00:33.360
what I want to do today,

00:00:33.360 --> 00:00:34.320
and some of

00:00:34.320 --> 00:00:37.920
the topics I do today may seem a little bit like I'm jumping, sort of,

00:00:37.920 --> 00:00:44.440
from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we

00:00:44.440 --> 00:00:45.370
talked about

00:00:45.370 --> 00:00:48.240
linear regression and today I want to talk about

00:00:48.240 --> 00:00:52.500
sort of an adaptation of that called locally weighted regression. It's very a powerful

00:00:52.500 --> 00:00:57.120
algorithm that's actually one of my former mentor's probably favorite machine

00:00:57.120 --> 00:00:59.360
learning algorithm. We'll then

00:00:59.360 --> 00:01:02.800
talk about a probabilistic interpretation of linear regression

00:01:02.800 --> 00:01:07.650
and use that to move onto our first classification algorithm, which

00:01:07.650 --> 00:01:08.910
is logistic regression; take

00:01:08.910 --> 00:01:12.630
a brief digression to tell you about something called the perceptron algorithm,

00:01:12.630 --> 00:01:15.990
which is something we'll come back to, again, later this quarter;

00:01:15.990 --> 00:01:17.170
and

00:01:17.170 --> 00:01:21.070
time allowing I hope to get to Newton's method, which is an algorithm for

00:01:21.070 --> 00:01:23.119
fitting logistic regression models.

00:01:24.490 --> 00:01:30.510
So, let's just recap what we're talking about in the previous lecture,

00:01:30.510 --> 00:01:33.790
remember the notation that I defined was that

00:01:33.790 --> 00:01:35.090
I used this

00:01:35.090 --> 00:01:37.620
X superscript i,

00:01:37.620 --> 00:01:40.960
Y superscript i to denote the ith training example.

00:01:46.730 --> 00:01:48.210
And

00:01:49.150 --> 00:01:51.790
when we're talking about linear regression

00:01:52.900 --> 00:01:54.080
or ordinary least squares,

00:01:54.080 --> 00:01:56.140
we use this to denote

00:01:56.140 --> 00:01:58.179
the predicted value

00:01:58.179 --> 00:02:00.830
output by my hypothesis H on

00:02:00.830 --> 00:02:02.420
the input X^i.

00:02:02.420 --> 00:02:04.070
And my hypothesis

00:02:04.070 --> 00:02:06.160
was parameterized by the vector

00:02:06.160 --> 00:02:08.399
parameters theta

00:02:08.399 --> 00:02:13.809
and so we said that this was equal to sum from theta j

00:02:13.809 --> 00:02:15.549
X^i

00:02:15.549 --> 00:02:19.870
si

00:02:19.870 --> 00:02:22.409
written more simply as theta transpose X

00:02:22.409 --> 00:02:25.079
And we had the convention that X

00:02:25.079 --> 00:02:29.409
subscript zero is equal to one so 
this accounts for the intercept term in our

00:02:29.409 --> 00:02:30.919
linear regression model.

00:02:30.919 --> 00:02:32.799
And lowercase n here

00:02:32.799 --> 00:02:36.780
was the notation I was using for the

00:02:36.780 --> 00:02:40.209
number of features in my training set. Okay? So in the

00:02:40.209 --> 00:02:43.689
example trying to predict housing prices, we had two features, the size of

00:02:43.689 --> 00:02:45.939
the house and the number of bedrooms.

00:02:45.939 --> 00:02:50.559
We had two features and there was - little n was equal to two.

00:02:50.559 --> 00:02:51.989


00:02:51.989 --> 00:02:54.709
So just to

00:02:54.709 --> 00:02:57.079
finish recapping the previous lecture, we

00:02:57.079 --> 00:03:01.260
defined this quadratic cos function J of theta equals one-half,

00:03:01.260 --> 00:03:05.510
something

00:03:05.510 --> 00:03:07.209
I

00:03:07.209 --> 00:03:10.379
equals one to m, theta of XI minus YI

00:03:10.379 --> 00:03:12.439
squared

00:03:12.439 --> 00:03:16.819
where this is the sum over our m training examples and my training set. So lowercase

00:03:16.819 --> 00:03:17.769
m

00:03:17.769 --> 00:03:21.129
was the notation I've been using to denote the number of training examples I have and the

00:03:21.129 --> 00:03:23.059
size of my training set.

00:03:23.059 --> 00:03:25.400
And at the end of the last lecture,

00:03:25.400 --> 00:03:26.809
we derive

00:03:26.809 --> 00:03:30.840
the value of theta that minimizes this enclosed form, which was X

00:03:30.840 --> 00:03:32.349
transpose X

00:03:32.349 --> 00:03:35.369
inverse X

00:03:35.369 --> 00:03:38.889
transpose Y. Okay?

00:03:38.889 --> 00:03:42.409
So

00:03:42.409 --> 00:03:46.419
as we move on in today's lecture, I'll continue to use this notation and, again,

00:03:46.419 --> 00:03:50.609
I realize this is a fair amount of notation to all remember,

00:03:50.609 --> 00:03:55.099
so if partway through this lecture you forgot - if you're having trouble remembering

00:03:55.099 --> 00:04:02.099
what lowercase m is or what lowercase n is or something please raise your hand and ask. When

00:04:04.709 --> 00:04:07.690
we talked about linear regression last time

00:04:07.690 --> 00:04:10.440
we used two features. One of the features was

00:04:10.440 --> 00:04:14.279
the size of the houses in square feet, so the living area of the house,

00:04:14.279 --> 00:04:18.449
and the other feature was the number of bedrooms in the house.

00:04:18.449 --> 00:04:22.059
In general, we apply a machine-learning algorithm to some problem that you care

00:04:22.059 --> 00:04:23.060
about.

00:04:23.060 --> 00:04:28.210
The choice of the features will very much be up to you, right? And

00:04:28.210 --> 00:04:32.300
the way you choose your features to give the learning algorithm will often have a

00:04:32.300 --> 00:04:34.330
large impact on how it actually does.

00:04:34.330 --> 00:04:40.120
So just for example,

00:04:40.120 --> 00:04:44.120
the choice we made last time was X1 equal this size, and let's leave this idea

00:04:44.120 --> 00:04:47.080
of the feature of the number of bedrooms for now, let's say we don't have data

00:04:47.080 --> 00:04:50.630
that tells us how many bedrooms are in these houses.

00:04:50.630 --> 00:04:54.030
One thing you could do is actually define - oh, let's

00:04:54.030 --> 00:04:56.289
draw this out.

00:04:56.289 --> 00:05:03.289
And so, right? So say that

00:05:04.689 --> 00:05:07.669
was the size of the house and that's the price of the house. So

00:05:07.669 --> 00:05:10.300
if you use

00:05:10.300 --> 00:05:14.789
this as a feature maybe you get theta zero plus theta 1,

00:05:14.789 --> 00:05:19.889
X1, this, sort of, linear model.

00:05:19.889 --> 00:05:21.590
If you choose - let me just copy the

00:05:21.590 --> 00:05:26.770
same data set over, right?

00:05:26.770 --> 00:05:30.330
You can define the set of features where X1 is equal to the size of the house

00:05:30.330 --> 00:05:34.989
and X2 is the

00:05:34.989 --> 00:05:36.050
square

00:05:36.050 --> 00:05:37.169
of the size

00:05:37.169 --> 00:05:38.319
of the house. Okay?

00:05:38.319 --> 00:05:43.080
So X1 is the size of the house in say square footage and X2 is

00:05:43.080 --> 00:05:45.919
just take whatever the square footage of the house is and just

00:05:45.919 --> 00:05:49.059
square that number, and this would be another way to come up with a feature,

00:05:49.059 --> 00:05:51.109
and if you do that then

00:05:51.109 --> 00:05:55.960
the same algorithm will end up fitting a

00:05:55.960 --> 00:05:59.399
quadratic function for you.

00:05:59.399 --> 00:06:01.379
Theta 2, XM squared.

00:06:01.379 --> 00:06:06.139
Okay? Because this

00:06:06.139 --> 00:06:09.610
is actually X2. And

00:06:09.610 --> 00:06:12.520
depending on what the data looks like, maybe this is a slightly

00:06:12.520 --> 00:06:16.509
better fit to the data. You can actually

00:06:16.509 --> 00:06:23.509
take this even further, right?

00:06:25.280 --> 00:06:26.960
Which is - let's see.

00:06:26.960 --> 00:06:30.400
I have seven training examples here, so you can actually

00:06:30.400 --> 00:06:34.460
maybe fit up to six for the polynomial. You can actually fill a model

00:06:34.460 --> 00:06:35.619
theta zero plus

00:06:35.619 --> 00:06:38.229
theta one, X1 plus theta two,

00:06:38.229 --> 00:06:42.030
X squared plus up to

00:06:42.030 --> 00:06:48.710
theta six. X to the

00:06:48.710 --> 00:06:52.889
power of six and theta six are the polynomial

00:06:52.889 --> 00:06:55.589
to these seven data points.

00:06:55.589 --> 00:06:58.030
And if you do that you find that

00:06:58.030 --> 00:07:01.669
you come up with a model that fits your data exactly. This is where, I guess,

00:07:01.669 --> 00:07:06.190
in this example I drew, we have seven data points, so if you fit a

00:07:06.190 --> 00:07:08.489
six model polynomial you can, sort of, fit a line

00:07:08.489 --> 00:07:11.860
that passes through these seven points perfectly.

00:07:11.860 --> 00:07:14.379
And you probably find that the curve

00:07:14.379 --> 00:07:17.069
you get will look something

00:07:17.069 --> 00:07:20.039
like that.

00:07:20.039 --> 00:07:23.179
And on the one hand, this is a great model in a sense that it

00:07:23.179 --> 00:07:25.329
fits your training data perfectly.

00:07:25.329 --> 00:07:26.680
On the other hand, this is probably not a

00:07:26.680 --> 00:07:28.819
very good model in the sense that

00:07:28.819 --> 00:07:31.889
none of us seriously think that this is a very good predictor of housing

00:07:31.889 --> 00:07:36.220
prices as a function of the size of the house, right? So

00:07:36.220 --> 00:07:39.020
we'll actually come back to this later. It turns

00:07:39.020 --> 00:07:41.940
out of the models we have here;

00:07:41.940 --> 00:07:45.180
I feel like maybe the quadratic model fits the data best.

00:07:45.180 --> 00:07:46.460
Whereas

00:07:46.460 --> 00:07:47.810


00:07:47.810 --> 00:07:52.020
the linear model looks like there's actually a bit of a quadratic component in this

00:07:52.020 --> 00:07:52.830
data

00:07:52.830 --> 00:07:56.539
that the linear function is not capturing.

00:07:56.539 --> 00:07:59.959
So we'll actually come back to this a little bit later and talk about the problems

00:07:59.959 --> 00:08:03.779
associated with fitting models that are either too simple, use two small a set of

00:08:03.779 --> 00:08:04.620
features, or

00:08:04.620 --> 00:08:08.449
on the models that are too complex and maybe

00:08:08.449 --> 00:08:11.210
use too large a set of features.

00:08:11.210 --> 00:08:12.289
Just to give these a

00:08:12.289 --> 00:08:13.240
name,

00:08:13.240 --> 00:08:14.490
we call this

00:08:14.490 --> 00:08:19.750
the problem of underfitting

00:08:19.750 --> 00:08:23.139
and, very informally, this refers to a setting where

00:08:23.139 --> 00:08:26.819
there are obvious patterns that - where there are patterns in the data that the

00:08:26.819 --> 00:08:28.729
algorithm is just failing to fit.

00:08:28.729 --> 00:08:32.839
And this problem here we refer to as

00:08:32.839 --> 00:08:34.580
overfitting

00:08:34.580 --> 00:08:36.299
and, again, very informally,

00:08:36.299 --> 00:08:41.310
this is when the algorithm is fitting the idiosyncrasies of this specific data set,

00:08:41.310 --> 00:08:43.480
right? It just so happens that

00:08:43.480 --> 00:08:48.190
of the seven houses we sampled in Portland, or wherever you collect data from,

00:08:48.190 --> 00:08:51.640
that house happens to be a bit more expensive, that house happened on the less

00:08:51.640 --> 00:08:54.070
expensive, and by

00:08:54.070 --> 00:08:57.400
fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties

00:08:57.400 --> 00:08:58.820
of this data set,

00:08:58.820 --> 00:09:01.079
rather than the true underlying trends

00:09:01.079 --> 00:09:04.940
of how housing prices vary as the function of the size of house. Okay?

00:09:04.940 --> 00:09:08.460
So these are two very different problems. We'll define them more formally me later

00:09:08.460 --> 00:09:12.280
and talk about how to address each of these problems,

00:09:12.280 --> 00:09:13.890
but for now I

00:09:13.890 --> 00:09:20.890
hope you appreciate that there is this issue of selecting features. So if

00:09:22.150 --> 00:09:23.610
you want to just

00:09:23.610 --> 00:09:26.270
teach us the learning problems there are a few ways to do

00:09:26.270 --> 00:09:27.370
so.

00:09:27.370 --> 00:09:29.010
We'll talk about

00:09:29.010 --> 00:09:32.760
feature selection algorithms later this quarter as well. So automatic algorithms

00:09:32.760 --> 00:09:33.779
for choosing

00:09:33.779 --> 00:09:35.959
what features you use in a

00:09:35.959 --> 00:09:37.530
regression problem like this.

00:09:37.530 --> 00:09:41.610
What I want to do today is talk about a class of algorithms

00:09:41.610 --> 00:09:44.570
called non-parametric learning algorithms that will help

00:09:44.570 --> 00:09:49.730
to alleviate the need somewhat for you to choose features very carefully. Okay?

00:09:49.730 --> 00:09:56.730
And this leads us into our discussion of locally weighted regression.

00:09:56.970 --> 00:10:03.970
And just to define the term,

00:10:12.340 --> 00:10:16.780
linear regression, as we've defined it so far, is an example of a parametric learning

00:10:16.780 --> 00:10:17.740


00:10:17.740 --> 00:10:19.150
algorithm. Parametric

00:10:19.150 --> 00:10:21.920
learning algorithm is one that's defined as

00:10:21.920 --> 00:10:24.670
an algorithm that has a fixed number of parameters

00:10:24.670 --> 00:10:27.200
that fit to the data. Okay? So

00:10:27.200 --> 00:10:28.670
in linear regression

00:10:28.670 --> 00:10:32.930
we have a fix set of parameters theta, right? That must

00:10:32.930 --> 00:10:39.440
fit to the data.

00:10:39.440 --> 00:10:46.440
In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The

00:10:58.630 --> 00:11:02.830
formal definition, which is not very intuitive, so I've replaced it with a

00:11:02.830 --> 00:11:04.110
second, say, more

00:11:04.110 --> 00:11:06.200
intuitive.

00:11:06.200 --> 00:11:10.460
The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm

00:11:10.460 --> 00:11:17.460
where the number of parameters

00:11:18.300 --> 00:11:22.170
goes

00:11:22.170 --> 00:11:25.419
with M, with the size of the training set. And usually it's

00:11:25.419 --> 00:11:30.620
defined as a number of parameters grows linearly with the size of the training set.

00:11:30.620 --> 00:11:32.640
This is the formal definition.

00:11:32.640 --> 00:11:33.410
A

00:11:33.410 --> 00:11:36.189
slightly less formal definition is that

00:11:36.189 --> 00:11:37.529
the amount of stuff that your learning algorithm needs

00:11:37.529 --> 00:11:40.830
to keep around

00:11:40.830 --> 00:11:44.820
will grow linearly with the training sets or, in another way of saying it, is that this is an

00:11:44.820 --> 00:11:45.830
algorithm that

00:11:45.830 --> 00:11:51.590
we'll need to keep around an entire training set, even after learning. Okay? So

00:11:51.590 --> 00:11:53.980
don't worry too much about this definition. But

00:11:53.980 --> 00:11:55.979
what I want to do now is

00:11:55.979 --> 00:11:58.989
describe a specific non-parametric learning algorithm

00:11:58.989 --> 00:12:05.989
called locally weighted regression.

00:12:09.860 --> 00:12:16.860
Which also goes by a couple of other names -

00:12:17.040 --> 00:12:20.580
which also goes by the name of Loess for self-hysterical reasons. Loess

00:12:20.580 --> 00:12:23.030
is usually spelled L-O-E-S-S,

00:12:23.030 --> 00:12:24.130
sometimes spelled like that,

00:12:24.130 --> 00:12:27.140
too. I just call it locally weighted regression.

00:12:27.140 --> 00:12:34.140
So here's

00:12:34.540 --> 00:12:37.760
the idea. This will be an algorithm that allows us

00:12:37.760 --> 00:12:42.650
to worry a little bit less about having to choose features very carefully.

00:12:42.650 --> 00:12:48.350
So

00:12:48.350 --> 00:12:55.320
for my motivating example, let's say that I

00:12:55.320 --> 00:12:59.000
have a

00:12:59.000 --> 00:13:00.939
training site that looks like this, okay?

00:13:00.939 --> 00:13:04.270
So this is X and that's Y.

00:13:04.270 --> 00:13:07.010
If you run

00:13:07.010 --> 00:13:10.920
linear regression on this and you fit maybe a linear function to this

00:13:10.920 --> 00:13:12.380
and you end up with a

00:13:12.380 --> 00:13:13.399
more or less

00:13:13.399 --> 00:13:16.580
flat, straight line, which is not a very good fit to this data. You

00:13:16.580 --> 00:13:19.820
can sit around and stare at this and try to decide whether the features are used right.

00:13:19.820 --> 00:13:22.570
So maybe you want to toss in a quadratic function,

00:13:22.570 --> 00:13:25.770
but this isn't really quadratic either. So maybe you want to

00:13:25.770 --> 00:13:27.910
model this as a X

00:13:27.910 --> 00:13:31.060
plus X squared plus maybe some function of sin of X or something.

00:13:31.060 --> 00:13:33.850
You actually sit around and fiddle with features.

00:13:33.850 --> 00:13:37.160
And after a while you can probably come up with a set of features that the model is

00:13:37.160 --> 00:13:39.720
okay, but let's talk about an algorithm that

00:13:39.720 --> 00:13:46.720
you can use without needing to do that.

00:13:50.340 --> 00:13:52.930
So

00:13:52.930 --> 00:13:54.480
if - well,

00:13:54.480 --> 00:13:56.370
suppose you want to evaluate

00:13:56.370 --> 00:13:59.170
your hypothesis H

00:13:59.170 --> 00:14:03.950
at a certain point

00:14:03.950 --> 00:14:06.310
with a certain query point low K is X. Okay? And

00:14:06.310 --> 00:14:07.490
let's say

00:14:07.490 --> 00:14:10.670
you want to know what's the predicted value of

00:14:10.670 --> 00:14:11.930
Y

00:14:11.930 --> 00:14:16.540
at this position of X, right? So

00:14:16.540 --> 00:14:18.760
for linear regression,

00:14:18.760 --> 00:14:22.440
what we were doing was we would fit

00:14:22.440 --> 00:14:25.070
theta

00:14:25.070 --> 00:14:28.130
to minimize

00:14:28.130 --> 00:14:30.050
sum

00:14:30.050 --> 00:14:34.610
over I, YI minus theta, transpose XI

00:14:34.610 --> 00:14:38.760
squared,

00:14:38.760 --> 00:14:41.200
and return theta

00:14:41.200 --> 00:14:46.170
transpose X. Okay? So that was linear regression.

00:14:46.170 --> 00:14:49.970
In contrast, in locally weighted linear regression you're going to do things slightly

00:14:49.970 --> 00:14:51.130
different. You're

00:14:51.130 --> 00:14:54.270
going to look at this point X

00:14:54.270 --> 00:14:58.700
and then I'm going to look in my data set and take into account

00:14:58.700 --> 00:15:03.370
only the data points that are, sort of, in the little vicinity of X. Okay?

00:15:03.370 --> 00:15:07.130
So we'll look at where I want to value my hypothesis. I'm going to look

00:15:07.130 --> 00:15:10.140
only in the vicinity of

00:15:10.140 --> 00:15:13.730
this point where I want to value my hypothesis,

00:15:13.730 --> 00:15:16.480
and then I'm going to take,

00:15:16.480 --> 00:15:19.600
let's say, just these few points,

00:15:19.600 --> 00:15:21.320
and I will

00:15:21.320 --> 00:15:22.760
apply linear regression

00:15:22.760 --> 00:15:26.160
to fit a straight line just to this sub-set of the data. Okay? I'm

00:15:26.160 --> 00:15:29.740
using this sub-term sub-set - well let's come back to that later.

00:15:29.740 --> 00:15:32.000
So we take this data set and I fit a

00:15:32.000 --> 00:15:36.690
straight line to it and maybe I get a straight line like that.

00:15:36.690 --> 00:15:40.720
And what I'll do is then

00:15:40.720 --> 00:15:45.150
evaluate this particular value of straight line and

00:15:45.150 --> 00:15:47.280
that will be the value I return for my algorithm.

00:15:47.280 --> 00:15:50.360
I think this would be the predicted value

00:15:50.360 --> 00:15:53.060
for -

00:15:53.060 --> 00:15:57.370
this would be the value of then my hypothesis outputs

00:15:57.370 --> 00:16:04.370
in locally weighted regression. Okay? So

00:16:05.270 --> 00:16:10.110
we're gonna fall one up. Let me go ahead and formalize that. In

00:16:10.110 --> 00:16:15.120
locally weighted regression, we're going to fit theta to

00:16:15.120 --> 00:16:18.350
minimize

00:16:18.350 --> 00:16:25.350
sum over I

00:16:27.480 --> 00:16:32.590
to minimize that

00:16:32.590 --> 00:16:36.600
where these terms W superscript I are called weights.

00:16:36.600 --> 00:16:37.769


00:16:37.769 --> 00:16:41.080
There are many possible choice for ways, I'm just gonna write one down. So this E's and

00:16:41.080 --> 00:16:42.990
minus, XI minus

00:16:42.990 --> 00:16:45.590
X squared

00:16:45.590 --> 00:16:49.290
over

00:16:49.290 --> 00:16:54.930
two. So let's look at what these weights really are, right? So notice that -

00:16:54.930 --> 00:16:57.790
suppose you have a training example XI.

00:16:57.790 --> 00:17:04.790
So that XI is very close to X. So this is small,

00:17:06.940 --> 00:17:10.620
right? Then if XI minus X is small, so if XI minus X is close to zero, then

00:17:10.620 --> 00:17:14.319
this is E's to the minus zero and E to the zero

00:17:14.319 --> 00:17:15.860
is one.

00:17:15.860 --> 00:17:19.790
So if XI is close to X, then

00:17:19.790 --> 00:17:21.180
WI

00:17:21.180 --> 00:17:25.130
will be close to one. In other words, the weight associated with the, I training

00:17:25.130 --> 00:17:27.030
example be close to one

00:17:27.030 --> 00:17:30.160
if XI and X are close to each

00:17:30.160 --> 00:17:31.150
other.

00:17:31.150 --> 00:17:34.270
Conversely if XI minus X

00:17:34.270 --> 00:17:41.270
is large

00:17:42.330 --> 00:17:43.110
then - I don't

00:17:43.110 --> 00:17:48.010
know, what would WI be? Zero.

00:17:48.010 --> 00:17:51.020
Zero, right. Close to zero. Right.

00:17:51.020 --> 00:17:51.850
So

00:17:51.850 --> 00:17:56.630
if XI is very far from X then this is E to the minus of some large number

00:17:56.630 --> 00:18:02.210
and E to the minus some large number will be close to zero.

00:18:02.210 --> 00:18:05.170
Okay?

00:18:05.170 --> 00:18:12.170
So the picture is, if I'm

00:18:12.400 --> 00:18:13.390
quarrying

00:18:13.390 --> 00:18:16.970
at a certain point X, shown on the X axis,

00:18:16.970 --> 00:18:21.820
and if my data

00:18:21.820 --> 00:18:24.090
set, say, look like that,

00:18:24.090 --> 00:18:28.880
then I'm going to give the points close to this a large weight and give the points

00:18:28.880 --> 00:18:32.610
far away a small weight.

00:18:32.610 --> 00:18:35.390
So

00:18:35.390 --> 00:18:37.930
for the points that are far away,

00:18:37.930 --> 00:18:39.690
WI will be close to zero.

00:18:39.690 --> 00:18:44.280
And so as if for the points that are far away,

00:18:44.280 --> 00:18:47.850
they will not contribute much at all to this summation, right? So I think this is

00:18:47.850 --> 00:18:48.970
sum over I

00:18:48.970 --> 00:18:54.240
of one times this quadratic term for points by points plus zero times this quadratic term for faraway

00:18:54.240 --> 00:18:55.720
points.

00:18:55.720 --> 00:18:58.610
And so the effect of using this weighting is that

00:18:58.610 --> 00:19:00.050
locally weighted linear regression

00:19:00.050 --> 00:19:02.630
fits a set of parameters theta,

00:19:02.630 --> 00:19:05.030
paying much more attention to fitting the

00:19:05.030 --> 00:19:06.950
points close by

00:19:06.950 --> 00:19:10.870
accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is

00:19:10.870 --> 00:19:17.580
exponentially [inaudible]?

00:19:17.580 --> 00:19:22.570
Yeah. Let's see. So it turns out there are many other weighting functions you can use. It

00:19:22.570 --> 00:19:26.890
turns out that there are definitely different communities of researchers that tend to

00:19:26.890 --> 00:19:29.400
choose different choices by default.

00:19:29.400 --> 00:19:33.700
There is somewhat of a literature on debating what point -

00:19:33.700 --> 00:19:35.660
exactly what function to use.

00:19:35.660 --> 00:19:37.970
This, sort of, exponential decay function is -

00:19:37.970 --> 00:19:41.180
this happens to be a reasonably common one that seems to be a more reasonable choice on many problems,

00:19:41.180 --> 00:19:42.310


00:19:42.310 --> 00:19:45.380
but you can actually plug in other functions as well. Did

00:19:45.380 --> 00:19:47.500
I mention what [inaudible] is it at?

00:19:47.500 --> 00:19:49.710
For those of you that are familiar with

00:19:49.710 --> 00:19:51.550
the

00:19:51.550 --> 00:19:53.890
normal distribution, or the Gaussian distribution,

00:19:53.890 --> 00:19:55.350
say this -

00:19:55.350 --> 00:19:59.370
what this formula I've written out here, it cosmetically

00:19:59.370 --> 00:20:02.770
looks a bit like a Gaussian distribution. Okay? But this actually has

00:20:02.770 --> 00:20:06.190
absolutely nothing to do with Gaussian distribution.

00:20:06.190 --> 00:20:07.940
So this

00:20:07.940 --> 00:20:12.240
is not that a problem with XI is Gaussian or whatever. This is no such

00:20:12.240 --> 00:20:13.010
interpretation.

00:20:13.010 --> 00:20:16.840
This is just a convenient function that happens to be a bell-shaped function, but

00:20:16.840 --> 00:20:21.460
don't endow this of any Gaussian semantics. Okay?

00:20:21.460 --> 00:20:23.160
So, in fact - well,

00:20:23.160 --> 00:20:28.510
if you remember the familiar bell-shaped Gaussian, again, it's just

00:20:28.510 --> 00:20:32.160
the ways of associating with these points is that if you

00:20:32.160 --> 00:20:33.850
imagine

00:20:33.850 --> 00:20:36.050
putting this on a bell-shaped bump,

00:20:36.050 --> 00:20:39.650
centered around the position of where you want to value your hypothesis H,

00:20:39.650 --> 00:20:42.910
then there's a saying this point here I'll give

00:20:42.910 --> 00:20:44.460
a weight that's proportional

00:20:44.460 --> 00:20:48.780
to the height of the Gaussian - excuse me, to the height of the bell-shaped function

00:20:48.780 --> 00:20:50.960
evaluated at this point. And the

00:20:50.960 --> 00:20:53.020
way to get to this point will be,

00:20:53.020 --> 00:20:54.730
to this training example,

00:20:54.730 --> 00:20:56.800
will be proportionate to that height

00:20:56.800 --> 00:20:58.549
and so on. Okay?

00:20:58.549 --> 00:21:01.450
And so training examples that are really far away

00:21:01.450 --> 00:21:04.820
get a very small weight.

00:21:04.820 --> 00:21:07.400


00:21:07.400 --> 00:21:10.630
One last small generalization to this is that

00:21:10.630 --> 00:21:11.960
normally

00:21:11.960 --> 00:21:15.080
there's one other parameter to this

00:21:15.080 --> 00:21:17.760
algorithm, which I'll denote as tow.

00:21:17.760 --> 00:21:21.110
Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian.

00:21:21.110 --> 00:21:24.640
This is a convenient form or function.

00:21:24.640 --> 00:21:27.140
This parameter tow

00:21:27.140 --> 00:21:32.510
is called the bandwidth

00:21:32.510 --> 00:21:33.760
parameter

00:21:33.760 --> 00:21:40.760
and

00:21:40.980 --> 00:21:47.530
informally it controls how fast the weights fall of with distance. Okay? So just

00:21:47.530 --> 00:21:49.350
copy my diagram from

00:21:49.350 --> 00:21:53.810
the other side, I guess.

00:21:53.810 --> 00:21:55.770
So if

00:21:55.770 --> 00:21:57.750
tow is very small,

00:21:57.750 --> 00:22:00.000


00:22:00.000 --> 00:22:04.520
if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape,

00:22:04.520 --> 00:22:08.080
so that the weights of the points are far away fall off rapidly.

00:22:08.080 --> 00:22:09.570
Whereas if

00:22:09.570 --> 00:22:16.570
tow

00:22:17.490 --> 00:22:21.309
is large then you'd end

00:22:21.309 --> 00:22:25.309
up choosing a weighting function that falls of relatively slowly with distance from your

00:22:25.309 --> 00:22:27.540
query.

00:22:27.540 --> 00:22:30.549
Okay?

00:22:30.549 --> 00:22:32.910
So

00:22:32.910 --> 00:22:39.910
I hope you can, therefore, see that if you

00:22:41.580 --> 00:22:44.410
apply locally weighted linear regression to a data set that looks like

00:22:44.410 --> 00:22:45.640
this,

00:22:45.640 --> 00:22:46.850
then to

00:22:46.850 --> 00:22:50.100
ask what your hypothesis output is at a point like this you end up having a straight line

00:22:50.100 --> 00:22:51.169
making

00:22:51.169 --> 00:22:52.740
that prediction. To

00:22:52.740 --> 00:22:55.159
ask what kind of class this [inaudible] at

00:22:55.159 --> 00:22:56.430
that value

00:22:56.430 --> 00:22:58.540
you put a straight line there

00:22:58.540 --> 00:23:00.420
and you predict that value.

00:23:00.420 --> 00:23:04.060
It turns out that every time you try to vary your hypothesis, every time you

00:23:04.060 --> 00:23:06.509
ask your learning algorithm to make a prediction for

00:23:06.509 --> 00:23:08.910
how much a new house costs or whatever,

00:23:08.910 --> 00:23:09.910


00:23:09.910 --> 00:23:13.640
you need to run a new fitting procedure and then

00:23:13.640 --> 00:23:15.730
evaluate this line that you fit

00:23:15.730 --> 00:23:17.810
just at the position of

00:23:17.810 --> 00:23:22.430
the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you

00:23:22.430 --> 00:23:25.510
do this for every point along the X-axis then

00:23:25.510 --> 00:23:27.010
you find that

00:23:27.010 --> 00:23:29.990
locally weighted regression is able to trace on this, sort of, very

00:23:29.990 --> 00:23:31.690
non-linear curve

00:23:31.690 --> 00:23:37.559
for a data set like this. Okay? So

00:23:37.559 --> 00:23:42.110
in the problem set we're actually gonna let you play around more with this algorithm. So I won't say

00:23:42.110 --> 00:23:43.850
too much more about it here.

00:23:43.850 --> 00:23:48.730
But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you

00:23:48.730 --> 00:23:49.950
still have

00:23:49.950 --> 00:23:51.200
the same problem of overfitting

00:23:51.200 --> 00:23:55.320
and underfitting, like when

00:23:55.320 --> 00:23:57.960
you had a Q's tow. Like you make it too small

00:23:57.960 --> 00:23:59.080
in your -

00:23:59.080 --> 00:24:01.760
Yes, absolutely. Yes. So locally

00:24:01.760 --> 00:24:04.260
weighted regression can run into

00:24:04.260 --> 00:24:07.420
- locally weighted regression is not a penancier for the problem of

00:24:07.420 --> 00:24:10.470
overfitting or underfitting.

00:24:10.470 --> 00:24:15.130
You can still run into the same problems with locally weighted regression. What you just

00:24:15.130 --> 00:24:16.920
said about

00:24:16.920 --> 00:24:19.920
- and so some of these things I'll leave you to discover for yourself in the

00:24:19.920 --> 00:24:20.859
homework problem.

00:24:20.859 --> 00:24:24.720
You'll actually see what you just mentioned.

00:24:24.720 --> 00:24:29.070
Yeah? It almost seems like

00:24:29.070 --> 00:24:31.360
you're

00:24:31.360 --> 00:24:37.430
not even thoroughly [inaudible] with this locally weighted, you had all the

00:24:37.430 --> 00:24:42.500
data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a

00:24:42.500 --> 00:24:44.700
model, because you need the entire data set.

00:24:44.700 --> 00:24:45.460
And

00:24:45.460 --> 00:24:49.720
the other way of saying that is that this is a non-parametric learning

00:24:49.720 --> 00:24:51.150
algorithm. So this

00:24:51.150 --> 00:24:52.280


00:24:52.280 --> 00:24:54.440
-I don't know. I won't

00:24:54.440 --> 00:24:58.130
debate whether, you know, are we really building a model or not. But

00:24:58.130 --> 00:25:00.370
this is a perfectly fine - so

00:25:00.370 --> 00:25:01.660
if I think

00:25:01.660 --> 00:25:03.750
when you write a code implementing

00:25:03.750 --> 00:25:07.080
locally weighted linear regression on the

00:25:07.080 --> 00:25:08.980
data set I think of that code

00:25:08.980 --> 00:25:13.340
as a whole - as building your model.

00:25:13.340 --> 00:25:14.560
So it actually uses -

00:25:14.560 --> 00:25:18.610
we've actually used this quite successfully to model, sort of, the dynamics of this

00:25:18.610 --> 00:25:23.030
autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights

00:25:23.030 --> 00:25:24.090
based

00:25:24.090 --> 00:25:27.770
on

00:25:27.770 --> 00:25:32.690
the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see,

00:25:32.690 --> 00:25:34.259
yes. So

00:25:34.259 --> 00:25:37.450
it turns out there are a few things you can do. One thing that is quite common is

00:25:37.450 --> 00:25:40.190
how to choose this band with parameter tow,

00:25:40.190 --> 00:25:45.790
right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible]

00:25:45.790 --> 00:25:52.790
Gaussian

00:25:56.480 --> 00:25:58.030
sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's

00:25:58.030 --> 00:25:59.640
see. Boy.

00:25:59.640 --> 00:26:01.679
The weights are not

00:26:01.679 --> 00:26:06.130
random variables and it's not, for the purpose of this algorithm, it is not useful to

00:26:06.130 --> 00:26:09.370
endow it with probable semantics. So you could

00:26:09.370 --> 00:26:15.220
choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In

00:26:15.220 --> 00:26:15.820
fact,

00:26:15.820 --> 00:26:17.390
it turns out that

00:26:17.390 --> 00:26:21.530
I happened to choose this, sort of, bell-shaped function

00:26:21.530 --> 00:26:23.119
to define my weights.

00:26:23.119 --> 00:26:27.149
It's actually fine to choose a function that doesn't even integrate to one, that integrates to

00:26:27.149 --> 00:26:29.560
infinity, say, as you're weighting function. So

00:26:29.560 --> 00:26:31.700
in that sense,

00:26:31.700 --> 00:26:36.220
I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially

00:26:36.220 --> 00:26:41.830
since you use other functions that integrate to infinity and don't integrate to one. Okay?

00:26:41.830 --> 00:26:43.990
It's the last question and let's move on Assume

00:26:43.990 --> 00:26:50.990
that we have a very huge [inaudible], for example. A very huge set of houses and want to

00:26:51.370 --> 00:26:53.870
predict the linear for each house

00:26:53.870 --> 00:26:57.530
and so should the end result for each input - I'm

00:26:57.530 --> 00:26:59.549
seeing this very constantly for

00:26:59.549 --> 00:27:02.120
- Yes, you're right. So

00:27:02.120 --> 00:27:04.830
because locally weighted regression is a

00:27:04.830 --> 00:27:07.200
non-parametric algorithm

00:27:07.200 --> 00:27:11.380
every time you make a prediction you need to fit theta to your entire training set again.

00:27:11.380 --> 00:27:13.120
So you're actually right.

00:27:13.120 --> 00:27:17.640
If you have a very large training set then this is a somewhat expensive algorithm to

00:27:17.640 --> 00:27:19.840
use. Because every time you want to make a prediction

00:27:19.840 --> 00:27:20.929
you need to fit

00:27:20.929 --> 00:27:22.660
a straight line

00:27:22.660 --> 00:27:25.640
to a huge data set again.

00:27:25.640 --> 00:27:28.860
Turns out there are algorithms that - turns

00:27:28.860 --> 00:27:32.170
out there are ways to make this much more efficient for large data sets as well.

00:27:32.170 --> 00:27:34.539
So don't want to talk about that. If you're interested, look

00:27:34.539 --> 00:27:36.089
up the work of Andrew Moore

00:27:36.089 --> 00:27:37.780
on KD-trees. He,

00:27:37.780 --> 00:27:39.580
sort

00:27:39.580 --> 00:27:42.190
of, figured out ways to fit these models much more efficiently. That's not something I want to go

00:27:42.190 --> 00:27:44.760
into today. Okay? Let me

00:27:44.760 --> 00:27:51.760
move one. Let's take more questions later. So, okay.

00:28:00.789 --> 00:28:05.670
So that's locally weighted regression.

00:28:05.670 --> 00:28:07.080
Remember the

00:28:07.080 --> 00:28:10.830
outline I had, I guess, at the beginning of this lecture. What I want to do now is

00:28:10.830 --> 00:28:11.440


00:28:11.440 --> 00:28:15.650
talk about a probabilistic interpretation of linear regression, all right?

00:28:15.650 --> 00:28:19.179
And in particular of the - it'll be this probabilistic interpretation

00:28:19.179 --> 00:28:22.250
that let's us move on to talk

00:28:22.250 --> 00:28:29.250
about logistic regression, which will be our first classification algorithm. So

00:28:39.680 --> 00:28:43.320
let's put aside locally weighted regression for now. We'll just talk about

00:28:43.320 --> 00:28:46.650
ordinary unweighted linear regression. Let's

00:28:46.650 --> 00:28:50.580
ask the question of why least squares, right? Of all the things

00:28:50.580 --> 00:28:52.400
we could optimize

00:28:52.400 --> 00:28:56.250
how do we come up with this criteria for minimizing the square of the area

00:28:56.250 --> 00:28:59.490
between the predictions of the hypotheses

00:28:59.490 --> 00:29:02.320
and the values Y predicted. So why not minimize

00:29:02.320 --> 00:29:08.140
the absolute value of the areas or the areas to the power of four or something?

00:29:08.140 --> 00:29:10.260
What I'm going to do now is present

00:29:10.260 --> 00:29:12.270
one set of assumptions that

00:29:12.270 --> 00:29:14.520
will serve to "justify"

00:29:14.520 --> 00:29:18.950
why we're minimizing the sum of square zero. Okay?

00:29:18.950 --> 00:29:21.620
It turns out that there are many assumptions that are sufficient

00:29:21.620 --> 00:29:25.820
to justify why we do least squares and this is just one of them.

00:29:25.820 --> 00:29:26.929
So

00:29:26.929 --> 00:29:30.260
just because I present one set of assumptions under which least

00:29:30.260 --> 00:29:32.340
squares regression make sense,

00:29:32.340 --> 00:29:35.270
but this is not the only set of assumptions. So even if the assumptions

00:29:35.270 --> 00:29:39.160
I describe don't hold, least squares actually still makes sense in many

00:29:39.160 --> 00:29:40.030
circumstances. But this

00:29:40.030 --> 00:29:41.490
sort of new help, you know,

00:29:41.490 --> 00:29:48.289
give one rationalization, like, one reason for doing least squares regression.

00:29:48.289 --> 00:29:51.140
And, in particular, what I'm going to do is

00:29:51.140 --> 00:29:57.520
endow the least squares model with probabilistic semantics. So

00:29:57.520 --> 00:30:01.809
let's assume in our example of predicting housing prices,

00:30:01.809 --> 00:30:02.950
that

00:30:02.950 --> 00:30:07.510
the price of the house it's sold four, and

00:30:07.510 --> 00:30:12.400
there's going to be some linear function of the features,

00:30:12.400 --> 00:30:13.720
plus

00:30:13.720 --> 00:30:17.049
some term epsilon I. Okay?

00:30:17.049 --> 00:30:19.620
And epsilon I will be

00:30:19.620 --> 00:30:22.010
our error term.

00:30:22.010 --> 00:30:23.160
You can think of

00:30:23.160 --> 00:30:27.860
the error term as capturing unmodeled effects, like, that maybe

00:30:27.860 --> 00:30:31.490
there's some other features of a house, like, maybe how many fireplaces it has or whether

00:30:31.490 --> 00:30:33.390
there's a garden or whatever,

00:30:33.390 --> 00:30:34.130
that there

00:30:34.130 --> 00:30:37.100
are additional features that we jut fail to capture or

00:30:37.100 --> 00:30:39.520
you can think of epsilon as random noise.

00:30:39.520 --> 00:30:42.360
Epsilon is our error term that captures both these

00:30:42.360 --> 00:30:46.230
unmodeled effects. Just things we forgot to model. Maybe the function isn't quite

00:30:46.230 --> 00:30:47.660
linear or something.

00:30:47.660 --> 00:30:50.360


00:30:50.360 --> 00:30:52.709
As well as random noise, like maybe

00:30:52.709 --> 00:30:56.539
that day the seller was in a really bad mood and so he sold it, just

00:30:56.539 --> 00:31:00.440
refused to go for a reasonable price or something.

00:31:00.440 --> 00:31:02.670
And now

00:31:02.670 --> 00:31:07.880
I will assume that the errors have a

00:31:07.880 --> 00:31:09.010
probabilistic

00:31:09.010 --> 00:31:13.100
- have a probability distribution. I'll assume that the errors epsilon I

00:31:13.100 --> 00:31:14.520
are distributed

00:31:14.520 --> 00:31:16.500
just till

00:31:16.500 --> 00:31:18.660
they denote epsilon I

00:31:18.660 --> 00:31:23.440
is distributive according to a probability distribution. That's

00:31:23.440 --> 00:31:26.510
a Gaussian distribution with mean zero

00:31:26.510 --> 00:31:28.170
and variance sigma squared. Okay? So

00:31:28.170 --> 00:31:30.630
let me just scripts in here,

00:31:30.630 --> 00:31:34.630
n stands for normal, right? To denote a normal distribution, also known as the

00:31:34.630 --> 00:31:36.169
Gaussian distribution,

00:31:36.169 --> 00:31:37.320
with mean

00:31:37.320 --> 00:31:41.190
zero and covariance sigma squared.

00:31:41.190 --> 00:31:46.730
Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you.

00:31:46.730 --> 00:31:48.640
Great. Almost everyone.

00:31:48.640 --> 00:31:51.280
So,

00:31:51.280 --> 00:31:55.309
in other words, the density for Gaussian is what you've seen before.

00:31:55.309 --> 00:31:57.730
The density for epsilon I would be

00:31:57.730 --> 00:31:59.650
one over root 2 pi sigma, E to the

00:31:59.650 --> 00:32:01.790
negative,

00:32:01.790 --> 00:32:03.570
epsilon I

00:32:03.570 --> 00:32:05.760


00:32:05.760 --> 00:32:09.420
squared over 2 sigma squared, right?

00:32:09.420 --> 00:32:15.170
And the

00:32:15.170 --> 00:32:22.170
density of our epsilon I will be this bell-shaped curve

00:32:22.929 --> 00:32:25.870
with one standard deviation

00:32:25.870 --> 00:32:31.039
being a, sort of, sigma. Okay? This

00:32:31.039 --> 00:32:31.780
is

00:32:31.780 --> 00:32:34.200
form for that bell-shaped curve.

00:32:34.200 --> 00:32:41.200
So, let's see. I can erase that. Can I

00:32:41.260 --> 00:32:47.040
erase the board?

00:32:47.040 --> 00:32:54.040


00:32:59.159 --> 00:33:04.770
So this implies that the

00:33:04.770 --> 00:33:10.120
probability distribution of a price of a house

00:33:10.120 --> 00:33:11.530
given in si

00:33:11.530 --> 00:33:13.620
and the parameters theta,

00:33:13.620 --> 00:33:20.620
that this is going to be Gaussian

00:33:29.530 --> 00:33:34.820
with that density. Okay?

00:33:34.820 --> 00:33:37.930
In other words, saying goes as that the

00:33:37.930 --> 00:33:41.440
price of

00:33:41.440 --> 00:33:46.550
a house given the features of the house and my parameters theta,

00:33:46.550 --> 00:33:50.530
this is going to be a random variable

00:33:50.530 --> 00:33:53.820
that's distributed Gaussian with

00:33:53.820 --> 00:33:56.700
mean theta transpose XI

00:33:56.700 --> 00:33:58.039
and variance sigma squared.

00:33:58.039 --> 00:34:01.120
Right? Because we imagine that

00:34:01.120 --> 00:34:04.990
the way the housing prices are generated is that the price of a house

00:34:04.990 --> 00:34:08.990
is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma

00:34:08.990 --> 00:34:11.649
squared. So

00:34:11.649 --> 00:34:13.589
the price of a house is going to

00:34:13.589 --> 00:34:16.239
have mean theta transpose XI, again, and sigma squared, right? Does

00:34:16.239 --> 00:34:20.069
this make

00:34:20.069 --> 00:34:24.379
sense? Raise your hand if this makes sense. Yeah,

00:34:24.379 --> 00:34:31.379
okay. Lots of you. In

00:34:38.369 --> 00:34:43.939
point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume

00:34:43.939 --> 00:34:47.489
here the error is a

00:34:47.489 --> 00:34:50.299
Gaussian?

00:34:50.299 --> 00:34:54.039
Right. So, boy.

00:34:54.039 --> 00:34:56.429
Why do I see the error as Gaussian?

00:34:56.429 --> 00:35:00.909
Two reasons, right? One is that it turns out to be mathematically convenient to do so

00:35:00.909 --> 00:35:03.109
and the other is, I don't

00:35:03.109 --> 00:35:06.529
know, I can also mumble about justifications, such as things to the

00:35:06.529 --> 00:35:08.609
central limit theorem. It turns out that if you,

00:35:08.609 --> 00:35:11.989
for the vast majority of problems, if you apply a linear regression model like

00:35:11.989 --> 00:35:15.679
this and try to measure the distribution of the errors,

00:35:15.679 --> 00:35:19.509
not all the time, but very often you find that the errors really are Gaussian. That

00:35:19.509 --> 00:35:21.779
this Gaussian model is a good

00:35:21.779 --> 00:35:23.519
assumption for the error

00:35:23.519 --> 00:35:26.079
in regression problems like these.

00:35:26.079 --> 00:35:29.000
Some of you may have heard of the central limit theorem, which says that

00:35:29.000 --> 00:35:33.219
the sum of many independent random variables will tend towards a Gaussian.

00:35:33.219 --> 00:35:37.450
So if the error is caused by many effects, like the mood of the

00:35:37.450 --> 00:35:39.060
seller, the mood of the buyer,

00:35:39.060 --> 00:35:42.680
some other features that we miss, whether the place has a garden or not, and

00:35:42.680 --> 00:35:45.630
if all of these effects are independent, then

00:35:45.630 --> 00:35:49.259
by the central limit theorem you might be inclined to believe that

00:35:49.259 --> 00:35:52.259
the sum of all these effects will be approximately Gaussian. If

00:35:52.259 --> 00:35:53.760
in practice, I guess, the

00:35:53.760 --> 00:35:57.810
two real answers are that, 1.) In practice this is actually a reasonably accurate

00:35:57.810 --> 00:36:04.810
assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're

00:36:06.509 --> 00:36:07.539
saying

00:36:07.539 --> 00:36:10.539
if we assume that area around model

00:36:10.539 --> 00:36:12.849
has zero mean, then

00:36:12.849 --> 00:36:16.269
the area is centered around our model. Which

00:36:16.269 --> 00:36:18.059
it seems almost like we're trying to assume

00:36:18.059 --> 00:36:19.960
what we're trying to prove. Instructor?

00:36:19.960 --> 00:36:23.640
That's the [inaudible] but, yes. You are assuming that

00:36:23.640 --> 00:36:28.359
the error has zero mean. Which is, yeah, right.

00:36:28.359 --> 00:36:30.919
I think later this quarter we get to some of the other

00:36:30.919 --> 00:36:35.059
things, but for now just think of this as a mathematically - it's actually not

00:36:35.059 --> 00:36:37.879
an unreasonable assumption.

00:36:37.879 --> 00:36:40.390
I guess,

00:36:40.390 --> 00:36:44.979
in machine learning all the assumptions we make are almost never

00:36:44.979 --> 00:36:48.519
true in the absence sense, right? Because, for instance,

00:36:48.519 --> 00:36:54.359
housing prices are priced to dollars and cents, so the error will be -

00:36:54.359 --> 00:36:55.199
errors

00:36:55.199 --> 00:36:58.719
in prices are not continued as value random variables, because

00:36:58.719 --> 00:37:02.239
houses can only be priced at a certain number of dollars and a certain number of

00:37:02.239 --> 00:37:05.970
cents and you never have fractions of cents in housing prices.

00:37:05.970 --> 00:37:10.339
Whereas a Gaussian random variable would. So in that sense, assumptions we make are never

00:37:10.339 --> 00:37:14.689
"absolutely true," but for practical purposes this is a

00:37:14.689 --> 00:37:19.459
accurate enough assumption that it'll be useful to

00:37:19.459 --> 00:37:23.529
make. Okay? I think in a week or two, we'll actually come back to

00:37:23.529 --> 00:37:27.279
selected more about the assumptions we make and when they help our learning

00:37:27.279 --> 00:37:29.059
algorithms and when they hurt our learning

00:37:29.059 --> 00:37:30.839
algorithms. We'll say a bit more about it

00:37:30.839 --> 00:37:37.839
when we talk about generative and discriminative learning algorithms, like, in a week or

00:37:40.319 --> 00:37:44.819
two. Okay? So let's point out one bit of notation, which is that when I

00:37:44.819 --> 00:37:48.399
wrote this down I actually wrote P of YI given XI and then semicolon

00:37:48.399 --> 00:37:49.719
theta

00:37:49.719 --> 00:37:54.999
and I'm going to use this notation when we are not thinking of theta as a

00:37:54.999 --> 00:37:56.799
random variable. So

00:37:56.799 --> 00:38:00.709
in statistics, though, sometimes it's called the frequentist's point of view,

00:38:00.709 --> 00:38:02.380
where you think of there as being some,

00:38:02.380 --> 00:38:06.450
sort of, true value of theta that's out there that's generating the data say,

00:38:06.450 --> 00:38:07.519
but

00:38:07.519 --> 00:38:11.299
we don't know what theta is, but theta is not a random

00:38:11.299 --> 00:38:13.440
vehicle, right? So it's not like there's some

00:38:13.440 --> 00:38:16.270
random value of theta out there. It's that theta is -

00:38:16.270 --> 00:38:19.140
there's some true value of theta out there. It's just that we don't

00:38:19.140 --> 00:38:22.459
know what the true value of theta is. So

00:38:22.459 --> 00:38:25.620
if theta is not a random variable, then I'm

00:38:25.620 --> 00:38:27.299
going to avoid

00:38:27.299 --> 00:38:31.359
writing P of YI given XI comma theta, because this would mean

00:38:31.359 --> 00:38:34.899
that probably of YI conditioned on X and theta

00:38:34.899 --> 00:38:40.200
and you can only condition on random variables.

00:38:40.200 --> 00:38:43.010
So at this part of the class where we're taking

00:38:43.010 --> 00:38:46.490
sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class

00:38:46.490 --> 00:38:49.220
we're thinking of theta not as a random variable, but just as something

00:38:49.220 --> 00:38:50.470
we're trying to estimate

00:38:50.470 --> 00:38:52.829
and use the semicolon

00:38:52.829 --> 00:38:57.559
notation. So the way to read this is this is the probability of YI given XI

00:38:57.559 --> 00:39:00.410
and parameterized by theta. Okay? So

00:39:00.410 --> 00:39:03.719
you read the semicolon as parameterized by.

00:39:03.719 --> 00:39:08.269
And in the same way here, I'll say YI given XI parameterized by

00:39:08.269 --> 00:39:09.499
theta is distributed

00:39:09.499 --> 00:39:16.499
Gaussian with that. All right.

00:39:36.319 --> 00:39:38.329
So we're gonna make one more assumption.

00:39:38.329 --> 00:39:41.279
Let's assume that the

00:39:41.279 --> 00:39:43.819
error terms are

00:39:43.819 --> 00:39:48.269


00:39:48.269 --> 00:39:50.609
IID, okay?

00:39:50.609 --> 00:39:54.279
Which stands for Independently and Identically Distributed. So it's

00:39:54.279 --> 00:39:57.259
going to assume that the error terms are

00:39:57.259 --> 00:40:04.259
independent of each other,

00:40:04.469 --> 00:40:11.279
right?

00:40:11.279 --> 00:40:14.749
The identically distributed part just means that I'm assuming the outcome for the

00:40:14.749 --> 00:40:18.009
same Gaussian distribution or the same variance,

00:40:18.009 --> 00:40:22.099
but the more important part of is this is that I'm assuming that the epsilon I's are

00:40:22.099 --> 00:40:25.519
independent of each other.

00:40:25.519 --> 00:40:28.899
Now, let's talk about how to fit a model.

00:40:28.899 --> 00:40:32.359
The probability of Y given

00:40:32.359 --> 00:40:36.239
X parameterized by theta - I'm actually going to give

00:40:36.239 --> 00:40:39.149
this another name. I'm going to write this down

00:40:39.149 --> 00:40:42.749
and we'll call this the likelihood of theta

00:40:42.749 --> 00:40:46.439
as the probability of Y given X parameterized by theta.

00:40:46.439 --> 00:40:49.009
And so this is going to be

00:40:49.009 --> 00:40:50.419
the product

00:40:50.419 --> 00:40:57.419
over my training set like that.

00:40:59.939 --> 00:41:04.299
Which is, in turn, going to be a product of

00:41:04.299 --> 00:41:10.539
those Gaussian densities that I wrote down just now,

00:41:10.539 --> 00:41:14.549
right?

00:41:14.549 --> 00:41:20.359
Okay?

00:41:20.359 --> 00:41:24.629
Then in parts of notation, I guess, I define this term here to be the

00:41:24.629 --> 00:41:26.180
likelihood of theta.

00:41:26.180 --> 00:41:30.109
And the likely of theta is just the probability of the data Y, right? Given X

00:41:30.109 --> 00:41:32.789
and prioritized by theta.

00:41:32.789 --> 00:41:36.619
To test the likelihood and probability are often confused.

00:41:36.619 --> 00:41:40.519
So the likelihood of theta is the same thing as the

00:41:40.519 --> 00:41:45.509
probability of the data you saw. So likely and probably are, sort of, the same thing.

00:41:45.509 --> 00:41:47.779
Except that when I use the term likelihood

00:41:47.779 --> 00:41:51.799
I'm trying to emphasize that I'm taking this thing

00:41:51.799 --> 00:41:55.380
and viewing it as a function of theta.

00:41:55.380 --> 00:41:57.130
Okay?

00:41:57.130 --> 00:42:00.609
So likelihood and for probability, they're really the same thing except that

00:42:00.609 --> 00:42:02.339
when I want to view this thing

00:42:02.339 --> 00:42:05.660
as a function of theta holding X and Y fix are

00:42:05.660 --> 00:42:10.089
then called likelihood. Okay? So

00:42:10.089 --> 00:42:13.299
hopefully you hear me say the likelihood of the parameters and the probability

00:42:13.299 --> 00:42:15.119
of the data,

00:42:15.119 --> 00:42:18.469
right? Rather than the likelihood of the data or probability of parameters. So try

00:42:18.469 --> 00:42:25.469
to be consistent in that terminology.

00:42:30.509 --> 00:42:31.519
So given that

00:42:31.519 --> 00:42:33.529
the probability of the data is this and this

00:42:33.529 --> 00:42:36.859
is also the likelihood of the parameters,

00:42:36.859 --> 00:42:37.809
how do you estimate

00:42:37.809 --> 00:42:40.460
the parameters theta? So given a training set,

00:42:40.460 --> 00:42:46.309
what parameters theta do you want to choose for your model?

00:42:46.309 --> 00:42:53.309


00:42:58.749 --> 00:43:01.549
Well, the principle of maximum likelihood

00:43:01.549 --> 00:43:02.709
estimation

00:43:02.709 --> 00:43:09.329
says that,

00:43:09.329 --> 00:43:13.299
right? You can choose the value of theta that makes the data

00:43:13.299 --> 00:43:20.039
as probable as possible, right? So choose theta

00:43:20.039 --> 00:43:27.039
to maximize the likelihood. Or

00:43:27.199 --> 00:43:29.769
in other words choose the parameters that make

00:43:29.769 --> 00:43:33.119
the data as probable as possible, right? So this is

00:43:33.119 --> 00:43:36.939
massive likely your estimation from six to six. So it's choose the parameters that makes

00:43:36.939 --> 00:43:39.539
it as likely as probable as possible

00:43:39.539 --> 00:43:43.309
for me to have seen the data I just

00:43:43.309 --> 00:43:46.819
did. So

00:43:46.819 --> 00:43:53.189
for mathematical convenience, let me define lower case l of theta.

00:43:53.189 --> 00:43:57.959
This is called the log likelihood function and it's just log

00:43:57.959 --> 00:44:01.429
of capital L of theta.

00:44:01.429 --> 00:44:06.309
So this is log over product over I

00:44:06.309 --> 00:44:09.539
to find

00:44:09.539 --> 00:44:14.299
sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this

00:44:14.299 --> 00:44:16.609
from the previous board.

00:44:16.609 --> 00:44:23.609
Log and a product is the same as the sum of over logs, right? So it's a sum

00:44:24.509 --> 00:44:31.509
of

00:44:35.140 --> 00:44:38.400
the logs of - which simplifies to m times

00:44:38.400 --> 00:44:39.349
one over root

00:44:39.349 --> 00:44:43.509
two pi

00:44:43.509 --> 00:44:44.380
sigma

00:44:44.380 --> 00:44:47.469
plus

00:44:47.469 --> 00:44:52.099
and then log of explanation cancel each other, right? So if log of E of

00:44:52.099 --> 00:44:53.089
something is just

00:44:53.089 --> 00:45:00.089
whatever's inside the exponent. So, you know what,

00:45:01.229 --> 00:45:08.229
let me write this on the

00:45:12.079 --> 00:45:16.129
next

00:45:16.129 --> 00:45:21.359
board.

00:45:21.359 --> 00:45:28.359
Okay.

00:45:32.759 --> 00:45:39.759


00:45:46.239 --> 00:45:50.509
So

00:45:50.509 --> 00:45:53.349
maximizing the likelihood or maximizing the log

00:45:53.349 --> 00:45:58.440
likelihood is the same

00:45:58.440 --> 00:46:02.940
as minimizing

00:46:02.940 --> 00:46:09.940
that term over there. Well, you get it, right?

00:46:22.019 --> 00:46:26.039
Because there's a minus sign. So maximizing this because of the minus sign is the same as

00:46:26.039 --> 00:46:26.829
minimizing

00:46:26.829 --> 00:46:33.229
this as a function of theta. And

00:46:33.229 --> 00:46:35.729
this is, of course, just

00:46:35.729 --> 00:46:42.729
the same quadratic cos function that we had last time, J of theta,

00:46:43.089 --> 00:46:44.469
right? So what

00:46:44.469 --> 00:46:48.339
we've just shown is that the ordinary least squares algorithm,

00:46:48.339 --> 00:46:50.539
that we worked on the previous lecture,

00:46:50.539 --> 00:46:55.019
is just maximum likelihood

00:46:55.019 --> 00:46:56.079
assuming

00:46:56.079 --> 00:46:58.219
this probabilistic model,

00:46:58.219 --> 00:47:05.219
assuming IID Gaussian errors on our data.

00:47:06.109 --> 00:47:09.659


00:47:09.659 --> 00:47:10.720
Okay? One thing that we'll

00:47:10.720 --> 00:47:12.480
actually leave is that,

00:47:12.480 --> 00:47:14.489
in the next lecture notice that

00:47:14.489 --> 00:47:17.060
the value of sigma squared doesn't matter,

00:47:17.060 --> 00:47:17.930
right? That somehow

00:47:17.930 --> 00:47:21.390
no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a

00:47:21.390 --> 00:47:21.999
variance

00:47:21.999 --> 00:47:26.229
of a Gaussian. So that no matter what sigma

00:47:26.229 --> 00:47:30.160
squared is since it's a positive number the value of theta we end up with

00:47:30.160 --> 00:47:33.619
will be the same, right? So because

00:47:33.619 --> 00:47:34.659
minimizing this

00:47:34.659 --> 00:47:39.289
you get the same value of theta no matter what sigma squared is. So it's as if

00:47:39.289 --> 00:47:42.690
in this model the value of sigma squared doesn't really matter.

00:47:42.690 --> 00:47:45.790
Just remember that for the next lecture. We'll come back

00:47:45.790 --> 00:47:47.509
to this again.

00:47:47.509 --> 00:47:51.429
Any questions about this?

00:47:51.429 --> 00:47:52.670
Actually, let me clean up

00:47:52.670 --> 00:47:59.670
another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of

00:48:43.929 --> 00:48:50.929
theta by

00:48:51.209 --> 00:48:52.079
a fraction

00:48:52.079 --> 00:48:53.809
of error,

00:48:53.809 --> 00:48:55.369
but I think it's that you

00:48:55.369 --> 00:48:57.319
measure

00:48:57.319 --> 00:49:01.179
because it depends on the family of theta too, for example. If

00:49:01.179 --> 00:49:05.269


00:49:05.269 --> 00:49:09.069
you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think

00:49:09.069 --> 00:49:12.609
let's - the thing's you're mentioning are

00:49:12.609 --> 00:49:14.549
maybe deeper questions about

00:49:14.549 --> 00:49:17.869
learning algorithms that we'll just come back to later, so don't really want to get into

00:49:17.869 --> 00:49:19.459
that right

00:49:19.459 --> 00:49:22.199
now. Any more

00:49:22.199 --> 00:49:29.199
questions? Okay. So

00:49:33.219 --> 00:49:38.759
this endows linear regression with a probabilistic interpretation.

00:49:38.759 --> 00:49:43.209
I'm actually going to use this probabil - use this, sort of, probabilistic

00:49:43.209 --> 00:49:43.940
interpretation

00:49:43.940 --> 00:49:46.190
in order to derive our next learning algorithm,

00:49:46.190 --> 00:49:50.309
which will be our first classification algorithm. Okay?

00:49:50.309 --> 00:49:53.619
So

00:49:53.619 --> 00:49:57.930
you'll recall that I said that regression problems are where the variable Y

00:49:57.930 --> 00:50:01.339
that you're trying to predict is continuous values.

00:50:01.339 --> 00:50:04.259
Now I'm actually gonna talk about our first classification problem,

00:50:04.259 --> 00:50:07.859
where the value Y you're trying to predict

00:50:07.859 --> 00:50:10.939
will be discreet value. You can take on only a small number of discrete values

00:50:10.939 --> 00:50:14.379
and in this case I'll talk about binding classification

00:50:14.379 --> 00:50:15.709
where

00:50:15.709 --> 00:50:18.769
Y takes on only two values, right? So

00:50:18.769 --> 00:50:21.890
you come up with classification problems if you're trying to do,

00:50:21.890 --> 00:50:25.900
say, a medical diagnosis and try to decide based on some features

00:50:25.900 --> 00:50:29.890
that the patient has a disease or does not have a disease.

00:50:29.890 --> 00:50:34.239
Or if in the housing example, maybe you're trying to decide will this house sell in the

00:50:34.239 --> 00:50:37.509
next six months or not and the answer is either yes or no. It'll either be sold in the

00:50:37.509 --> 00:50:40.989
next six months or it won't be.

00:50:40.989 --> 00:50:44.529
Other standing examples, if you want to build a spam filter. Is this e-mail spam

00:50:44.529 --> 00:50:50.930
or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting

00:50:50.930 --> 00:50:55.439
whether a computer system will crash. So you have a learning algorithm to predict will

00:50:55.439 --> 00:50:59.159
this computing cluster crash over the next 24 hours? And, again, it's a yes

00:50:59.159 --> 00:51:06.039
or no answer. So

00:51:06.039 --> 00:51:08.149
there's X, there's Y.

00:51:08.149 --> 00:51:14.119
And in a classification problem

00:51:14.119 --> 00:51:15.269
Y takes on

00:51:15.269 --> 00:51:18.569
two values, zero and one. That's it in binding the classification.

00:51:18.569 --> 00:51:21.630
So what can you do? Well, one thing you could do is

00:51:21.630 --> 00:51:25.299
take linear regression, as we've described it so far, and apply it to this problem,

00:51:25.299 --> 00:51:26.399
right? So you,

00:51:26.399 --> 00:51:30.289
you know, given this data set you can fit a straight line to it. Maybe

00:51:30.289 --> 00:51:31.829
you get that straight line, right?

00:51:31.829 --> 00:51:32.430
But

00:51:32.430 --> 00:51:36.770
this data set I've drawn, right? This is an amazingly easy classification problem. It's

00:51:36.770 --> 00:51:37.890
pretty obvious

00:51:37.890 --> 00:51:41.420
to all of us that, right? The relationship between X and Y is -

00:51:41.420 --> 00:51:47.729
well, you just look at a value around here and it's the right is one, it's

00:51:47.729 --> 00:51:51.529
the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then

00:51:51.529 --> 00:51:53.650
maybe take your linear regression

00:51:53.650 --> 00:51:55.669
hypothesis to this straight line

00:51:55.669 --> 00:51:58.489
and threshold it at 0.5.

00:51:58.489 --> 00:52:02.059
If you do that you'll certainly get the right answer. You predict that

00:52:02.059 --> 00:52:02.660


00:52:02.660 --> 00:52:04.269
if X is to the right of, sort

00:52:04.269 --> 00:52:06.229
of, the mid-point here

00:52:06.229 --> 00:52:11.829
then Y is one and then next to the left of that mid-point then Y is zero.

00:52:11.829 --> 00:52:16.099
So some people actually do this. Apply linear regression to classification problems

00:52:16.099 --> 00:52:17.719
and sometimes it'll

00:52:17.719 --> 00:52:19.319
work okay,

00:52:19.319 --> 00:52:21.820
but in general it's actually a pretty bad idea to

00:52:21.820 --> 00:52:26.130
apply linear regression to

00:52:26.130 --> 00:52:31.849
classification problems like these and here's why. Let's say I

00:52:31.849 --> 00:52:33.529
change my training set

00:52:33.529 --> 00:52:39.509
by giving you just one more training example all the way up there, right?

00:52:39.509 --> 00:52:43.049
Imagine if given this training set is actually still entirely obvious what the

00:52:43.049 --> 00:52:45.670
relationship between X and Y is, right? It's just -

00:52:45.670 --> 00:52:50.839
take this value as greater than Y is one and it's less then Y

00:52:50.839 --> 00:52:52.039
is zero.

00:52:52.039 --> 00:52:55.059
By giving you this additional training example it really shouldn't

00:52:55.059 --> 00:52:56.429
change anything. I mean,

00:52:56.429 --> 00:52:59.289
I didn't really convey much new information. There's no surprise that this

00:52:59.289 --> 00:53:01.670
corresponds to Y equals one.

00:53:01.670 --> 00:53:04.589
But if you now fit linear regression to this data

00:53:04.589 --> 00:53:07.159
set you end up with a line that, I

00:53:07.159 --> 00:53:10.410
don't know, maybe looks like that, right?

00:53:10.410 --> 00:53:13.430
And now the predictions of your

00:53:13.430 --> 00:53:15.739
hypothesis have changed completely if

00:53:15.739 --> 00:53:22.739
your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that?

00:53:26.579 --> 00:53:30.079
Right.

00:53:30.079 --> 00:53:31.279
Yeah, yeah, fine. Yeah, sure. A theta

00:53:31.279 --> 00:53:36.849
set like that so. So, I

00:53:36.849 --> 00:53:38.609
guess,

00:53:38.609 --> 00:53:42.179
these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will

00:53:42.179 --> 00:53:47.609
change it even more if you gave it

00:53:47.609 --> 00:53:49.579
all -

00:53:49.579 --> 00:53:51.999
Yeah. Then I think this actually would make it even worse. You

00:53:51.999 --> 00:53:54.349
would actually get a line that pulls out even further, right? So

00:53:54.349 --> 00:53:57.819
this is my example. I get to make it whatever I want, right? But

00:53:57.819 --> 00:54:00.689
the point of this is that there's not a deep meaning to this. The point of this is

00:54:00.689 --> 00:54:01.879
just that

00:54:01.879 --> 00:54:04.859
it could be a really bad idea to apply linear regression to classification

00:54:04.859 --> 00:54:05.880


00:54:05.880 --> 00:54:11.609
algorithm. Sometimes it work fine, but usually I wouldn't do it.

00:54:11.609 --> 00:54:13.769
So a couple of problems with this. One is that,

00:54:13.769 --> 00:54:17.349
well - so what do you want to do

00:54:17.349 --> 00:54:20.630
for classification? If you know the value of Y lies between zero and

00:54:20.630 --> 00:54:24.709
one then to kind of fix this problem

00:54:24.709 --> 00:54:27.079
let's just start by

00:54:27.079 --> 00:54:29.219
changing the form

00:54:29.219 --> 00:54:33.869
of our hypothesis so that my hypothesis

00:54:33.869 --> 00:54:39.279
always lies in the unit interval between zero and one. Okay?

00:54:39.279 --> 00:54:41.829
So if I know Y is either

00:54:41.829 --> 00:54:44.039
zero or one then

00:54:44.039 --> 00:54:47.469
let's at least not have my hypothesis predict values much larger than one and much

00:54:47.469 --> 00:54:50.939
smaller than zero.

00:54:50.939 --> 00:54:52.369
And so

00:54:52.369 --> 00:54:55.749
I'm going to - instead of choosing a linear function for my hypothesis I'm going

00:54:55.749 --> 00:54:59.799
to choose something slightly different. And,

00:54:59.799 --> 00:55:03.219
in particular, I'm going to choose

00:55:03.219 --> 00:55:08.129
this function, H subscript theta of X is going to equal to G of

00:55:08.129 --> 00:55:10.839
theta transpose X

00:55:10.839 --> 00:55:12.930
where

00:55:12.930 --> 00:55:13.890
G

00:55:13.890 --> 00:55:17.279
is going to be this

00:55:17.279 --> 00:55:18.329
function and so

00:55:18.329 --> 00:55:20.819
this becomes more than one plus theta X

00:55:20.819 --> 00:55:22.819
of theta

00:55:22.819 --> 00:55:25.049
transpose X.

00:55:25.049 --> 00:55:27.809
And G of Z is called the sigmoid

00:55:27.809 --> 00:55:32.959
function and it

00:55:32.959 --> 00:55:38.659
is often also called the logistic function. It

00:55:38.659 --> 00:55:41.229
goes by either of these

00:55:41.229 --> 00:55:48.229
names. And what G of Z looks like is the following. So when you have your

00:55:48.449 --> 00:55:50.529
horizontal axis I'm going to plot Z

00:55:50.529 --> 00:55:56.549
and so G of Z

00:55:56.549 --> 00:55:58.879
will look like this.

00:55:58.879 --> 00:56:04.759
Okay? I didn't draw that very well. Okay.

00:56:04.759 --> 00:56:05.849
So G of Z

00:56:05.849 --> 00:56:07.769
tends towards zero

00:56:07.769 --> 00:56:09.909
as Z becomes very small

00:56:09.909 --> 00:56:12.219
and G of Z will ascend

00:56:12.219 --> 00:56:15.209
towards one as Z becomes large and it crosses the

00:56:15.209 --> 00:56:17.089
vertical

00:56:17.089 --> 00:56:19.519
axis at 0.5.

00:56:19.519 --> 00:56:24.460
So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of

00:56:24.460 --> 00:56:27.479
sigmoid in other

00:56:27.479 --> 00:56:30.299
step five? Say that again. Why we cannot chose this at five for some reason, like, that's

00:56:30.299 --> 00:56:35.429
better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from,

00:56:35.429 --> 00:56:37.270
right? I just

00:56:37.270 --> 00:56:38.839
wrote down this function. It actually

00:56:38.839 --> 00:56:42.569
turns out that there are two reasons for using this function that we'll come to.

00:56:42.569 --> 00:56:43.619
One is -

00:56:43.619 --> 00:56:46.809
we talked about generalized linear models. We'll see that this falls out naturally

00:56:46.809 --> 00:56:49.910
as part of the broader class of models.

00:56:49.910 --> 00:56:51.089
And another reason

00:56:51.089 --> 00:56:52.340
that we'll talk about

00:56:52.340 --> 00:56:54.239
next week, it turns out

00:56:54.239 --> 00:56:55.089
there are a couple of,

00:56:55.089 --> 00:56:57.269
I think, very beautiful reasons for why

00:56:57.269 --> 00:56:58.999
we choose logistic

00:56:58.999 --> 00:57:00.469
functions. We'll see

00:57:00.469 --> 00:57:02.089
that in a little bit. But for now let me just

00:57:02.089 --> 00:57:07.249
define it and just take my word for it for now that this is a reasonable choice.

00:57:07.249 --> 00:57:08.529
Okay? But notice now that

00:57:08.529 --> 00:57:15.529
my - the values output by my hypothesis will always be between zero

00:57:15.879 --> 00:57:17.289
and one. Furthermore,

00:57:17.289 --> 00:57:20.719
just like we did for linear regression, I'm going to endow

00:57:20.719 --> 00:57:25.549
the outputs and my hypothesis with a probabilistic interpretation, right? So

00:57:25.549 --> 00:57:30.619
I'm going to assume that the probability that Y is equal to one

00:57:30.619 --> 00:57:33.899
given X and parameterized by theta

00:57:33.899 --> 00:57:35.559
that's equal to

00:57:35.559 --> 00:57:38.439
H subscript theta of X, all right?

00:57:38.439 --> 00:57:39.640
So in other words

00:57:39.640 --> 00:57:42.789
I'm going to imagine that my hypothesis is outputting all these

00:57:42.789 --> 00:57:45.409
numbers that lie between zero and one.

00:57:45.409 --> 00:57:47.890
I'm going to think of my hypothesis

00:57:47.890 --> 00:57:54.890
as trying to estimate the probability that Y is equal to one. Okay?

00:57:56.209 --> 00:57:59.710
And because

00:57:59.710 --> 00:58:02.119
Y has to be either zero or one

00:58:02.119 --> 00:58:04.729
then the probability of Y equals zero is going

00:58:04.729 --> 00:58:11.729
to be that. All right?

00:58:11.859 --> 00:58:15.349
So more simply it turns out - actually, take these two equations

00:58:15.349 --> 00:58:19.209
and write them more compactly.

00:58:19.209 --> 00:58:22.139
Write P of Y given X

00:58:22.139 --> 00:58:24.209
parameterized by theta.

00:58:24.209 --> 00:58:25.839
This is going to be H

00:58:25.839 --> 00:58:28.349
subscript theta of X to

00:58:28.349 --> 00:58:31.949
the power of Y times

00:58:31.949 --> 00:58:33.179
one minus

00:58:33.179 --> 00:58:34.769
H of X to the power of

00:58:34.769 --> 00:58:37.219
one minus Y. Okay? So I know this

00:58:37.219 --> 00:58:42.889
looks somewhat bizarre, but this actually makes the variation much nicer.

00:58:42.889 --> 00:58:44.439
So Y is equal to one

00:58:44.439 --> 00:58:48.059
then this equation is H of X to the power of one

00:58:48.059 --> 00:58:51.229
times something to the power of zero.

00:58:51.229 --> 00:58:54.429
So anything to the power of zero is just one,

00:58:54.429 --> 00:58:55.949
right? So Y equals one then

00:58:55.949 --> 00:58:59.389
this is something to the power of zero and so this is just one.

00:58:59.389 --> 00:59:03.349
So if Y equals one this is just saying P of Y equals one is equal to H subscript

00:59:03.349 --> 00:59:05.919
theta of X. Okay?

00:59:05.919 --> 00:59:08.249
And in the same way, if Y is equal

00:59:08.249 --> 00:59:09.919
to zero then this is P

00:59:09.919 --> 00:59:14.329
of Y equals zero equals this thing to the power of zero and so this disappears. This is

00:59:14.329 --> 00:59:15.669
just one

00:59:15.669 --> 00:59:17.579
times this thing power of one. Okay? So this is

00:59:17.579 --> 00:59:18.599
a

00:59:18.599 --> 00:59:20.269
compact way of writing

00:59:20.269 --> 00:59:22.799
both of these equations to

00:59:22.799 --> 00:59:29.799
gather them to one line.

00:59:31.149 --> 00:59:35.650
So let's hope our parameter fitting, right? And, again, you can ask -

00:59:35.650 --> 00:59:38.389
well, given this model by data, how do I fit

00:59:38.389 --> 00:59:41.409
the parameters theta of my

00:59:41.409 --> 00:59:46.169
model? So the likelihood of the parameters is, as before, it's just the probability

00:59:46.169 --> 00:59:48.639


00:59:48.639 --> 00:59:50.499


00:59:50.499 --> 00:59:54.469
of theta, right? Which is product over I, PFYI

00:59:54.469 --> 00:59:56.709
given XI

00:59:56.709 --> 00:59:59.119
parameterized by theta.

00:59:59.119 --> 01:00:01.609
Which is - just plugging those

01:00:01.609 --> 01:00:08.609
in. Okay? I

01:00:09.399 --> 01:00:16.399
dropped this theta subscript just so you can write a little bit less. Oh,

01:00:17.479 --> 01:00:19.699
excuse me. These

01:00:19.699 --> 01:00:26.699
should be

01:00:29.380 --> 01:00:35.749
XI's

01:00:35.749 --> 01:00:42.749
and YI's. Okay?

01:00:51.299 --> 01:00:53.359
So,

01:00:53.359 --> 01:00:57.029
as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So

01:00:57.029 --> 01:00:58.759
we want

01:00:58.759 --> 01:01:03.529
to find the - setting the parameters theta that maximizes the likelihood L

01:01:03.529 --> 01:01:08.119
of theta. It

01:01:08.119 --> 01:01:09.619
turns out

01:01:09.619 --> 01:01:11.439
that very often

01:01:11.439 --> 01:01:14.479
- just when you work with the derivations, it turns out that it is often

01:01:14.479 --> 01:01:17.809
much easier to maximize the log of the likelihood rather than maximize the

01:01:17.809 --> 01:01:19.559
likelihood.

01:01:19.559 --> 01:01:20.439
So

01:01:20.439 --> 01:01:22.640
the log

01:01:22.640 --> 01:01:24.630
likelihood L of theta is just log of capital L.

01:01:24.630 --> 01:01:27.879
This will, therefore,

01:01:27.879 --> 01:01:34.879
be sum of this. Okay?

01:01:58.089 --> 01:01:59.239
And so

01:01:59.239 --> 01:02:03.689
to fit the parameters theta of our model we'll

01:02:03.689 --> 01:02:05.929
find the value of

01:02:05.929 --> 01:02:11.539
theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes.

01:02:11.539 --> 01:02:18.539
Thanks.

01:02:21.519 --> 01:02:27.239
So having maximized this function - well, it turns out we can actually apply

01:02:27.239 --> 01:02:28.759
the same gradient

01:02:28.759 --> 01:02:32.749
descent algorithm that we learned. That was the first algorithm we used

01:02:32.749 --> 01:02:34.979
to minimize the quadratic function.

01:02:34.979 --> 01:02:37.209
And you remember, when we talked about least squares,

01:02:37.209 --> 01:02:40.120
the first algorithm we used to minimize the quadratic

01:02:40.120 --> 01:02:41.060
error function

01:02:41.060 --> 01:02:42.649
was great in descent.

01:02:42.649 --> 01:02:45.409
So can actually use exactly the same algorithm

01:02:45.409 --> 01:02:48.039
to maximize the log likelihood.

01:02:48.039 --> 01:02:48.749


01:02:48.749 --> 01:02:51.359
And you remember, that algorithm was just

01:02:51.359 --> 01:02:54.609
repeatedly take the value of theta

01:02:54.609 --> 01:02:56.489
and you replace it with

01:02:56.489 --> 01:02:58.919
the previous value of theta plus

01:02:58.919 --> 01:03:01.959
a learning rate alpha

01:03:01.959 --> 01:03:03.939
times

01:03:03.939 --> 01:03:08.339
the gradient of the cos function. The log likelihood will respect the

01:03:08.339 --> 01:03:09.640
theta. Okay?

01:03:09.640 --> 01:03:13.869
One small change is that because previously we were trying to minimize

01:03:13.869 --> 01:03:14.650


01:03:14.650 --> 01:03:16.999
the quadratic error term.

01:03:16.999 --> 01:03:20.359
Today we're trying to maximize rather than minimize. So rather than having a minus

01:03:20.359 --> 01:03:23.189
sign we have a plus sign. So this is

01:03:23.189 --> 01:03:24.070
just great in ascents,

01:03:24.070 --> 01:03:25.489
but for the

01:03:25.489 --> 01:03:27.560
maximization rather than the minimization.

01:03:27.560 --> 01:03:32.380
So we actually call this gradient ascent and it's really the same

01:03:32.380 --> 01:03:35.389
algorithm.

01:03:35.389 --> 01:03:36.840
So to figure out

01:03:36.840 --> 01:03:41.619
what this gradient - so in order to derive gradient descent,

01:03:41.619 --> 01:03:43.719
what you need to do is

01:03:43.719 --> 01:03:48.069
compute the partial derivatives of your objective function with respect to

01:03:48.069 --> 01:03:52.729
each of your parameters theta I, right?

01:03:52.729 --> 01:03:58.219
It turns out that

01:03:58.219 --> 01:04:00.269
if you actually

01:04:00.269 --> 01:04:02.619
compute this partial derivative -

01:04:02.619 --> 01:04:09.539
so you take this formula, this L of theta, which is - oh, got that wrong too. If

01:04:09.539 --> 01:04:13.879
you take this lower case l theta, if you take the log likelihood of theta,

01:04:13.879 --> 01:04:17.159
and if you take it's partial derivative with

01:04:17.159 --> 01:04:19.479
respect to theta I

01:04:19.479 --> 01:04:21.549
you find that

01:04:21.549 --> 01:04:27.119
this is equal to -

01:04:27.119 --> 01:04:29.829
let's see. Okay? And,

01:04:29.829 --> 01:04:36.829
I

01:04:45.939 --> 01:04:47.309
don't

01:04:47.309 --> 01:04:50.329
know, the derivation isn't terribly complicated, but in

01:04:50.329 --> 01:04:53.900
the interest of saving you watching me write down a couple of

01:04:53.900 --> 01:04:55.789
blackboards full of math I'll just write

01:04:55.789 --> 01:04:57.199
down the final answer. But

01:04:57.199 --> 01:04:58.949
the way you get this is you

01:04:58.949 --> 01:05:00.179
just take those, plug

01:05:00.179 --> 01:05:04.429
in the definition for F subscript theta as function of XI, and take derivatives,

01:05:04.429 --> 01:05:06.390
and work through the algebra

01:05:06.390 --> 01:05:08.439
it turns out it'll simplify

01:05:08.439 --> 01:05:13.219
down to this formula. Okay?

01:05:13.219 --> 01:05:15.069
And so

01:05:15.069 --> 01:05:19.130
what that gives you is that gradient ascent

01:05:19.130 --> 01:05:21.669
is the following

01:05:21.669 --> 01:05:24.150
rule. Theta J gets updated as theta

01:05:24.150 --> 01:05:25.899
J

01:05:25.899 --> 01:05:27.839
plus alpha

01:05:27.839 --> 01:05:34.839
gives this. Okay?

01:05:46.999 --> 01:05:50.229
Does this look familiar to anyone? Did you

01:05:50.229 --> 01:05:56.369
remember seeing this formula at the last lecture? Right.

01:05:56.369 --> 01:05:59.349
So when I worked up Bastrian descent

01:05:59.349 --> 01:06:02.449
for least squares regression I,

01:06:02.449 --> 01:06:06.269
actually, wrote down

01:06:06.269 --> 01:06:07.640
exactly the same thing, or maybe

01:06:07.640 --> 01:06:11.419
there's a minus sign and this is also fit. But I, actually, had

01:06:11.419 --> 01:06:13.909
exactly the same learning rule last time

01:06:13.909 --> 01:06:19.919
for least squares regression,

01:06:19.919 --> 01:06:23.509
right? Is this the same learning algorithm then? So what's different? How come I was making

01:06:23.509 --> 01:06:25.499
all that noise earlier about

01:06:25.499 --> 01:06:30.009
least squares regression being a bad idea for classification problems and then I did

01:06:30.009 --> 01:06:33.769
a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants?

01:06:33.769 --> 01:06:40.769
Say that again. [Inaudible] Oh,

01:06:44.180 --> 01:06:46.639
right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same,

01:06:46.639 --> 01:06:48.350
this is not the same, right? And the

01:06:48.350 --> 01:06:49.269
reason is,

01:06:49.269 --> 01:06:51.700
in logistic regression

01:06:51.700 --> 01:06:54.169
this is different from before, right?

01:06:54.169 --> 01:06:55.699
The definition

01:06:55.699 --> 01:06:58.380
of this H subscript theta of XI

01:06:58.380 --> 01:06:59.319
is not

01:06:59.319 --> 01:07:03.279
the same as the definition I was using in the previous lecture.

01:07:03.279 --> 01:07:06.519
And in particular this is no longer theta transpose XI. This is not

01:07:06.519 --> 01:07:09.909
a linear function anymore.

01:07:09.909 --> 01:07:11.910
This is a logistic function of theta

01:07:11.910 --> 01:07:14.369
transpose XI. Okay?

01:07:14.369 --> 01:07:17.919
So even though this looks cosmetically similar,

01:07:17.919 --> 01:07:20.119
even though this is similar on the surface,

01:07:20.119 --> 01:07:23.739
to the Bastrian descent rule I derived last time for

01:07:23.739 --> 01:07:25.209
least squares regression

01:07:25.209 --> 01:07:29.359
this is actually a totally different learning algorithm. Okay?

01:07:29.359 --> 01:07:32.499
And it turns out that there's actually no coincidence that you ended up with the

01:07:32.499 --> 01:07:34.579
same learning rule. We'll actually

01:07:34.579 --> 01:07:35.450


01:07:35.450 --> 01:07:39.519
talk a bit more about this later when we talk about generalized linear models.

01:07:39.519 --> 01:07:42.659
But this is one of the most elegant generalized learning models

01:07:42.659 --> 01:07:44.519
that we'll see later. That

01:07:44.519 --> 01:07:48.199
even though we're using a different model, you actually ended up with

01:07:48.199 --> 01:07:51.379
what looks like the same learning algorithm and it's actually no

01:07:51.379 --> 01:07:56.449
coincidence. Cool.

01:07:56.449 --> 01:07:59.029
One last comment as

01:07:59.029 --> 01:08:01.239
part of a sort of learning process,

01:08:01.239 --> 01:08:02.339
over here

01:08:02.339 --> 01:08:05.030
I said I take the derivatives and I

01:08:05.030 --> 01:08:06.729
ended up with this line.

01:08:06.729 --> 01:08:09.269
I didn't want to

01:08:09.269 --> 01:08:13.119
make you sit through a long algebraic derivation, but

01:08:13.119 --> 01:08:14.640
later today or later this week,

01:08:14.640 --> 01:08:18.569
please, do go home and look at our lecture notes, where I wrote out

01:08:18.569 --> 01:08:20.829
the entirety of this derivation in full,

01:08:20.829 --> 01:08:23.419
and make sure you can follow every single step of

01:08:23.419 --> 01:08:26.709
how we take partial derivatives of this log likelihood

01:08:26.709 --> 01:08:32.069
to get this formula over here. Okay? By the way, for those who are

01:08:32.069 --> 01:08:36.379
interested in seriously masking machine learning material,

01:08:36.379 --> 01:08:40.349
when you go home and look at the lecture notes it will actually be very easy for most

01:08:40.349 --> 01:08:41.549
of you to look through

01:08:41.549 --> 01:08:44.599
the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense,

01:08:44.599 --> 01:08:45.089
and,

01:08:45.089 --> 01:08:49.539
sort of, say cool. I see how you get this line.

01:08:49.539 --> 01:08:53.159
You want to make sure you really understand the material. My concrete

01:08:53.159 --> 01:08:55.049
suggestion to you would be to you to go home,

01:08:55.049 --> 01:08:57.779
read through the lecture notes, check every line,

01:08:57.779 --> 01:09:02.159
and then to cover up the derivation and see if you can derive this example, right? So

01:09:02.159 --> 01:09:06.529
in general, that's usually good advice for studying technical

01:09:06.529 --> 01:09:09.029
material like machine learning. Which is if you work through a proof

01:09:09.029 --> 01:09:11.239
and you think you understood every line,

01:09:11.239 --> 01:09:14.129
the way to make sure you really understood it is to cover it up and see

01:09:14.129 --> 01:09:17.459
if you can rederive the entire thing itself. This is actually a great way because I

01:09:17.459 --> 01:09:19.679
did this a lot when I was trying to study

01:09:19.679 --> 01:09:22.239
various pieces of machine learning

01:09:22.239 --> 01:09:26.119
theory and various proofs. And this is actually a great way to study because cover up

01:09:26.119 --> 01:09:28.420
the derivations and see if you can do it yourself

01:09:28.420 --> 01:09:32.529
without looking at the original derivation. All right.

01:09:32.529 --> 01:09:36.690


01:09:36.690 --> 01:09:39.950
I probably won't get to Newton's Method today. I just

01:09:39.950 --> 01:09:46.950
want to say

01:09:55.250 --> 01:09:58.159
- take one quick digression to talk about

01:09:58.159 --> 01:09:59.480
one more algorithm,

01:09:59.480 --> 01:10:01.119
which was the discussion sort

01:10:01.119 --> 01:10:08.119
of alluding to this earlier,

01:10:09.169 --> 01:10:11.530
which is the perceptron

01:10:11.530 --> 01:10:13.219
algorithm, right? So

01:10:13.219 --> 01:10:13.760
I'm

01:10:13.760 --> 01:10:16.840
not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come

01:10:16.840 --> 01:10:20.340
back to later. Later this quarter

01:10:20.340 --> 01:10:23.820
we'll talk about learning theory.

01:10:23.820 --> 01:10:27.149
So in logistic regression we said that G of Z are, sort

01:10:27.149 --> 01:10:27.920
of,

01:10:27.920 --> 01:10:30.150
my hypothesis output values

01:10:30.150 --> 01:10:32.800
that were low numbers between zero and one.

01:10:32.800 --> 01:10:37.110
The question is what if you want to force G of Z to up

01:10:37.110 --> 01:10:38.889
the value to

01:10:38.889 --> 01:10:39.190
either

01:10:39.190 --> 01:10:41.400
zero one?

01:10:41.400 --> 01:10:43.860
So the

01:10:43.860 --> 01:10:46.210
perceptron algorithm defines G of Z

01:10:46.210 --> 01:10:48.090
to be this.

01:10:48.090 --> 01:10:52.590


01:10:52.590 --> 01:10:54.300
So the picture is - or

01:10:54.300 --> 01:11:01.300
the cartoon is, rather than this sigmoid function. E of

01:11:02.349 --> 01:11:07.699
Z now looks like this step function that you were asking about earlier.

01:11:07.699 --> 01:11:13.579
In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So

01:11:13.579 --> 01:11:14.310
this

01:11:14.310 --> 01:11:17.570
is actually - everything is exactly the same as before,

01:11:17.570 --> 01:11:20.389
except that G of Z is now the step function.

01:11:20.389 --> 01:11:21.539
It

01:11:21.539 --> 01:11:24.960
turns out there's this learning called the perceptron learning rule that's actually

01:11:24.960 --> 01:11:28.429
even the same as the classic gradient ascent

01:11:28.429 --> 01:11:30.610
for logistic regression.

01:11:30.610 --> 01:11:32.539
And the learning rule is

01:11:32.539 --> 01:11:39.539
given by this. Okay?

01:11:44.349 --> 01:11:49.799
So it looks just like the

01:11:49.799 --> 01:11:51.019
classic gradient ascent rule

01:11:51.019 --> 01:11:53.659


01:11:53.659 --> 01:11:56.210
for logistic regression.

01:11:56.210 --> 01:12:00.780
So this is very different flavor of algorithm than least squares regression and logistic

01:12:00.780 --> 01:12:01.669
regression,

01:12:01.669 --> 01:12:05.889
and, in particular, because it outputs only values are either zero or one it

01:12:05.889 --> 01:12:09.369
turns out it's very difficult to endow this algorithm with

01:12:09.369 --> 01:12:11.840
probabilistic semantics. And this

01:12:11.840 --> 01:12:18.840
is, again, even though - oh, excuse me. Right there. Okay.

01:12:19.639 --> 01:12:23.659
And even though this learning rule looks, again, looks cosmetically very similar to

01:12:23.659 --> 01:12:25.710
what we have in logistics regression this is actually

01:12:25.710 --> 01:12:28.320
a very different type of learning rule

01:12:28.320 --> 01:12:31.420
than the others that were seen

01:12:31.420 --> 01:12:33.630
in this class. So

01:12:33.630 --> 01:12:36.969
because this is such a simple learning algorithm, right? It just

01:12:36.969 --> 01:12:41.030
computes theta transpose X and then you threshold and then your output is zero or one.

01:12:41.030 --> 01:12:42.909
This is -

01:12:42.909 --> 01:12:45.989
right. So these are a simpler algorithm than logistic regression, I think.

01:12:45.989 --> 01:12:49.190
When we talk about learning theory later in this class,

01:12:49.190 --> 01:12:54.590
the simplicity of this algorithm will let us come back and use it as a building block. Okay?

01:12:54.590 --> 01:12:56.550
But that's all I want to say about this algorithm for now.