(music)
this presentation is delivered by
the Stanford Center for
Professional Development.
Okay. Good morning and
welcome back to the third
lecture of this class. So here's
what I want to do today,
and some of
the topics I do today may seem a little bit like I'm jumping, sort of,
from topic to topic, but here's, sort of, the outline for today and the logical flow of ideas. In the last lecture, we
talked about
linear regression and today I want to talk about
sort of an adaptation of that called locally weighted regression. It's very a powerful
algorithm that's actually one of my former mentor's probably favorite machine
learning algorithm. We'll then
talk about a probabilistic interpretation of linear regression
and use that to move onto our first classification algorithm, which
is logistic regression; take
a brief digression to tell you about something called the perceptron algorithm,
which is something we'll come back to, again, later this quarter;
and
time allowing I hope to get to Newton's method, which is an algorithm for
fitting logistic regression models.
So, let's just recap what we're talking about in the previous lecture,
remember the notation that I defined was that
I used this
X superscript i,
Y superscript i to denote the ith training example.
And
when we're talking about linear regression
or ordinary least squares,
we use this to denote
the predicted value
output by my hypothesis H on
the input X^i.
And my hypothesis
was parameterized by the vector
parameters theta
and so we said that this was equal to sum from theta j
X^i
si
written more simply as theta transpose X
And we had the convention that X
subscript zero is equal to one so
this accounts for the intercept term in our
linear regression model.
And lowercase n here
was the notation I was using for the
number of features in my training set. Okay? So in the
example trying to predict housing prices, we had two features, the size of
the house and the number of bedrooms.
We had two features and there was - little n was equal to two.
So just to
finish recapping the previous lecture, we
defined this quadratic cos function J of theta equals one-half,
something
I
equals one to m, theta of XI minus YI
squared
where this is the sum over our m training examples and my training set. So lowercase
m
was the notation I've been using to denote the number of training examples I have and the
size of my training set.
And at the end of the last lecture,
we derive
the value of theta that minimizes this enclosed form, which was X
transpose X
inverse X
transpose Y. Okay?
So
as we move on in today's lecture, I'll continue to use this notation and, again,
I realize this is a fair amount of notation to all remember,
so if partway through this lecture you forgot - if you're having trouble remembering
what lowercase m is or what lowercase n is or something please raise your hand and ask. When
we talked about linear regression last time
we used two features. One of the features was
the size of the houses in square feet, so the living area of the house,
and the other feature was the number of bedrooms in the house.
In general, we apply a machine-learning algorithm to some problem that you care
about.
The choice of the features will very much be up to you, right? And
the way you choose your features to give the learning algorithm will often have a
large impact on how it actually does.
So just for example,
the choice we made last time was X1 equal this size, and let's leave this idea
of the feature of the number of bedrooms for now, let's say we don't have data
that tells us how many bedrooms are in these houses.
One thing you could do is actually define - oh, let's
draw this out.
And so, right? So say that
was the size of the house and that's the price of the house. So
if you use
this as a feature maybe you get theta zero plus theta 1,
X1, this, sort of, linear model.
If you choose - let me just copy the
same data set over, right?
You can define the set of features where X1 is equal to the size of the house
and X2 is the
square
of the size
of the house. Okay?
So X1 is the size of the house in say square footage and X2 is
just take whatever the square footage of the house is and just
square that number, and this would be another way to come up with a feature,
and if you do that then
the same algorithm will end up fitting a
quadratic function for you.
Theta 2, XM squared.
Okay? Because this
is actually X2. And
depending on what the data looks like, maybe this is a slightly
better fit to the data. You can actually
take this even further, right?
Which is - let's see.
I have seven training examples here, so you can actually
maybe fit up to six for the polynomial. You can actually fill a model
theta zero plus
theta one, X1 plus theta two,
X squared plus up to
theta six. X to the
power of six and theta six are the polynomial
to these seven data points.
And if you do that you find that
you come up with a model that fits your data exactly. This is where, I guess,
in this example I drew, we have seven data points, so if you fit a
six model polynomial you can, sort of, fit a line
that passes through these seven points perfectly.
And you probably find that the curve
you get will look something
like that.
And on the one hand, this is a great model in a sense that it
fits your training data perfectly.
On the other hand, this is probably not a
very good model in the sense that
none of us seriously think that this is a very good predictor of housing
prices as a function of the size of the house, right? So
we'll actually come back to this later. It turns
out of the models we have here;
I feel like maybe the quadratic model fits the data best.
Whereas
the linear model looks like there's actually a bit of a quadratic component in this
data
that the linear function is not capturing.
So we'll actually come back to this a little bit later and talk about the problems
associated with fitting models that are either too simple, use two small a set of
features, or
on the models that are too complex and maybe
use too large a set of features.
Just to give these a
name,
we call this
the problem of underfitting
and, very informally, this refers to a setting where
there are obvious patterns that - where there are patterns in the data that the
algorithm is just failing to fit.
And this problem here we refer to as
overfitting
and, again, very informally,
this is when the algorithm is fitting the idiosyncrasies of this specific data set,
right? It just so happens that
of the seven houses we sampled in Portland, or wherever you collect data from,
that house happens to be a bit more expensive, that house happened on the less
expensive, and by
fitting six for the polynomial we're, sort of, fitting the idiosyncratic properties
of this data set,
rather than the true underlying trends
of how housing prices vary as the function of the size of house. Okay?
So these are two very different problems. We'll define them more formally me later
and talk about how to address each of these problems,
but for now I
hope you appreciate that there is this issue of selecting features. So if
you want to just
teach us the learning problems there are a few ways to do
so.
We'll talk about
feature selection algorithms later this quarter as well. So automatic algorithms
for choosing
what features you use in a
regression problem like this.
What I want to do today is talk about a class of algorithms
called non-parametric learning algorithms that will help
to alleviate the need somewhat for you to choose features very carefully. Okay?
And this leads us into our discussion of locally weighted regression.
And just to define the term,
linear regression, as we've defined it so far, is an example of a parametric learning
algorithm. Parametric
learning algorithm is one that's defined as
an algorithm that has a fixed number of parameters
that fit to the data. Okay? So
in linear regression
we have a fix set of parameters theta, right? That must
fit to the data.
In contrast, what I'm gonna talk about now is our first non-parametric learning algorithm. The
formal definition, which is not very intuitive, so I've replaced it with a
second, say, more
intuitive.
The, sort of, formal definition of the non-parametric learning algorithm is that it's an algorithm
where the number of parameters
goes
with M, with the size of the training set. And usually it's
defined as a number of parameters grows linearly with the size of the training set.
This is the formal definition.
A
slightly less formal definition is that
the amount of stuff that your learning algorithm needs
to keep around
will grow linearly with the training sets or, in another way of saying it, is that this is an
algorithm that
we'll need to keep around an entire training set, even after learning. Okay? So
don't worry too much about this definition. But
what I want to do now is
describe a specific non-parametric learning algorithm
called locally weighted regression.
Which also goes by a couple of other names -
which also goes by the name of Loess for self-hysterical reasons. Loess
is usually spelled L-O-E-S-S,
sometimes spelled like that,
too. I just call it locally weighted regression.
So here's
the idea. This will be an algorithm that allows us
to worry a little bit less about having to choose features very carefully.
So
for my motivating example, let's say that I
have a
training site that looks like this, okay?
So this is X and that's Y.
If you run
linear regression on this and you fit maybe a linear function to this
and you end up with a
more or less
flat, straight line, which is not a very good fit to this data. You
can sit around and stare at this and try to decide whether the features are used right.
So maybe you want to toss in a quadratic function,
but this isn't really quadratic either. So maybe you want to
model this as a X
plus X squared plus maybe some function of sin of X or something.
You actually sit around and fiddle with features.
And after a while you can probably come up with a set of features that the model is
okay, but let's talk about an algorithm that
you can use without needing to do that.
So
if - well,
suppose you want to evaluate
your hypothesis H
at a certain point
with a certain query point low K is X. Okay? And
let's say
you want to know what's the predicted value of
Y
at this position of X, right? So
for linear regression,
what we were doing was we would fit
theta
to minimize
sum
over I, YI minus theta, transpose XI
squared,
and return theta
transpose X. Okay? So that was linear regression.
In contrast, in locally weighted linear regression you're going to do things slightly
different. You're
going to look at this point X
and then I'm going to look in my data set and take into account
only the data points that are, sort of, in the little vicinity of X. Okay?
So we'll look at where I want to value my hypothesis. I'm going to look
only in the vicinity of
this point where I want to value my hypothesis,
and then I'm going to take,
let's say, just these few points,
and I will
apply linear regression
to fit a straight line just to this sub-set of the data. Okay? I'm
using this sub-term sub-set - well let's come back to that later.
So we take this data set and I fit a
straight line to it and maybe I get a straight line like that.
And what I'll do is then
evaluate this particular value of straight line and
that will be the value I return for my algorithm.
I think this would be the predicted value
for -
this would be the value of then my hypothesis outputs
in locally weighted regression. Okay? So
we're gonna fall one up. Let me go ahead and formalize that. In
locally weighted regression, we're going to fit theta to
minimize
sum over I
to minimize that
where these terms W superscript I are called weights.
There are many possible choice for ways, I'm just gonna write one down. So this E's and
minus, XI minus
X squared
over
two. So let's look at what these weights really are, right? So notice that -
suppose you have a training example XI.
So that XI is very close to X. So this is small,
right? Then if XI minus X is small, so if XI minus X is close to zero, then
this is E's to the minus zero and E to the zero
is one.
So if XI is close to X, then
WI
will be close to one. In other words, the weight associated with the, I training
example be close to one
if XI and X are close to each
other.
Conversely if XI minus X
is large
then - I don't
know, what would WI be? Zero.
Zero, right. Close to zero. Right.
So
if XI is very far from X then this is E to the minus of some large number
and E to the minus some large number will be close to zero.
Okay?
So the picture is, if I'm
quarrying
at a certain point X, shown on the X axis,
and if my data
set, say, look like that,
then I'm going to give the points close to this a large weight and give the points
far away a small weight.
So
for the points that are far away,
WI will be close to zero.
And so as if for the points that are far away,
they will not contribute much at all to this summation, right? So I think this is
sum over I
of one times this quadratic term for points by points plus zero times this quadratic term for faraway
points.
And so the effect of using this weighting is that
locally weighted linear regression
fits a set of parameters theta,
paying much more attention to fitting the
points close by
accurately. Whereas ignoring the contribution from faraway points. Okay? Yeah? Your Y is
exponentially [inaudible]?
Yeah. Let's see. So it turns out there are many other weighting functions you can use. It
turns out that there are definitely different communities of researchers that tend to
choose different choices by default.
There is somewhat of a literature on debating what point -
exactly what function to use.
This, sort of, exponential decay function is -
this happens to be a reasonably common one that seems to be a more reasonable choice on many problems,
but you can actually plug in other functions as well. Did
I mention what [inaudible] is it at?
For those of you that are familiar with
the
normal distribution, or the Gaussian distribution,
say this -
what this formula I've written out here, it cosmetically
looks a bit like a Gaussian distribution. Okay? But this actually has
absolutely nothing to do with Gaussian distribution.
So this
is not that a problem with XI is Gaussian or whatever. This is no such
interpretation.
This is just a convenient function that happens to be a bell-shaped function, but
don't endow this of any Gaussian semantics. Okay?
So, in fact - well,
if you remember the familiar bell-shaped Gaussian, again, it's just
the ways of associating with these points is that if you
imagine
putting this on a bell-shaped bump,
centered around the position of where you want to value your hypothesis H,
then there's a saying this point here I'll give
a weight that's proportional
to the height of the Gaussian - excuse me, to the height of the bell-shaped function
evaluated at this point. And the
way to get to this point will be,
to this training example,
will be proportionate to that height
and so on. Okay?
And so training examples that are really far away
get a very small weight.
One last small generalization to this is that
normally
there's one other parameter to this
algorithm, which I'll denote as tow.
Again, this looks suspiciously like the variants of a Gaussian, but this is not a Gaussian.
This is a convenient form or function.
This parameter tow
is called the bandwidth
parameter
and
informally it controls how fast the weights fall of with distance. Okay? So just
copy my diagram from
the other side, I guess.
So if
tow is very small,
if that's a query X, then you end up choosing a fairly narrow Gaussian - excuse me, a fairly narrow bell shape,
so that the weights of the points are far away fall off rapidly.
Whereas if
tow
is large then you'd end
up choosing a weighting function that falls of relatively slowly with distance from your
query.
Okay?
So
I hope you can, therefore, see that if you
apply locally weighted linear regression to a data set that looks like
this,
then to
ask what your hypothesis output is at a point like this you end up having a straight line
making
that prediction. To
ask what kind of class this [inaudible] at
that value
you put a straight line there
and you predict that value.
It turns out that every time you try to vary your hypothesis, every time you
ask your learning algorithm to make a prediction for
how much a new house costs or whatever,
you need to run a new fitting procedure and then
evaluate this line that you fit
just at the position of
the value of X. So the position of the query where you're trying to make a prediction. Okay? But if you
do this for every point along the X-axis then
you find that
locally weighted regression is able to trace on this, sort of, very
non-linear curve
for a data set like this. Okay? So
in the problem set we're actually gonna let you play around more with this algorithm. So I won't say
too much more about it here.
But to finally move on to the next topic let me check the questions you have. Yeah? It seems like you
still have
the same problem of overfitting
and underfitting, like when
you had a Q's tow. Like you make it too small
in your -
Yes, absolutely. Yes. So locally
weighted regression can run into
- locally weighted regression is not a penancier for the problem of
overfitting or underfitting.
You can still run into the same problems with locally weighted regression. What you just
said about
- and so some of these things I'll leave you to discover for yourself in the
homework problem.
You'll actually see what you just mentioned.
Yeah? It almost seems like
you're
not even thoroughly [inaudible] with this locally weighted, you had all the
data that you originally had anyway. Yeah. I'm just trying to think of [inaudible] the original data points. Right. So the question is, sort of, this - it's almost as if you're not building a
model, because you need the entire data set.
And
the other way of saying that is that this is a non-parametric learning
algorithm. So this
-I don't know. I won't
debate whether, you know, are we really building a model or not. But
this is a perfectly fine - so
if I think
when you write a code implementing
locally weighted linear regression on the
data set I think of that code
as a whole - as building your model.
So it actually uses -
we've actually used this quite successfully to model, sort of, the dynamics of this
autonomous helicopter this is. Yeah? I ask if this algorithm that learn the weights
based
on
the data? Learn what weights? Oh, the weights WI. Instead of using [inaudible]. I see,
yes. So
it turns out there are a few things you can do. One thing that is quite common is
how to choose this band with parameter tow,
right? As using the data. We'll actually talk about that a bit later when we talk about model selection. Yes? One last question. I used [inaudible]
Gaussian
sometimes if you [inaudible] Gaussian and then - Oh, I guess. Lt's
see. Boy.
The weights are not
random variables and it's not, for the purpose of this algorithm, it is not useful to
endow it with probable semantics. So you could
choose to define things as Gaussian, but it, sort of, doesn't lead anywhere. In
fact,
it turns out that
I happened to choose this, sort of, bell-shaped function
to define my weights.
It's actually fine to choose a function that doesn't even integrate to one, that integrates to
infinity, say, as you're weighting function. So
in that sense,
I mean, you could force in the definition of a Gaussian, but it's, sort of, not useful. Especially
since you use other functions that integrate to infinity and don't integrate to one. Okay?
It's the last question and let's move on Assume
that we have a very huge [inaudible], for example. A very huge set of houses and want to
predict the linear for each house
and so should the end result for each input - I'm
seeing this very constantly for
- Yes, you're right. So
because locally weighted regression is a
non-parametric algorithm
every time you make a prediction you need to fit theta to your entire training set again.
So you're actually right.
If you have a very large training set then this is a somewhat expensive algorithm to
use. Because every time you want to make a prediction
you need to fit
a straight line
to a huge data set again.
Turns out there are algorithms that - turns
out there are ways to make this much more efficient for large data sets as well.
So don't want to talk about that. If you're interested, look
up the work of Andrew Moore
on KD-trees. He,
sort
of, figured out ways to fit these models much more efficiently. That's not something I want to go
into today. Okay? Let me
move one. Let's take more questions later. So, okay.
So that's locally weighted regression.
Remember the
outline I had, I guess, at the beginning of this lecture. What I want to do now is
talk about a probabilistic interpretation of linear regression, all right?
And in particular of the - it'll be this probabilistic interpretation
that let's us move on to talk
about logistic regression, which will be our first classification algorithm. So
let's put aside locally weighted regression for now. We'll just talk about
ordinary unweighted linear regression. Let's
ask the question of why least squares, right? Of all the things
we could optimize
how do we come up with this criteria for minimizing the square of the area
between the predictions of the hypotheses
and the values Y predicted. So why not minimize
the absolute value of the areas or the areas to the power of four or something?
What I'm going to do now is present
one set of assumptions that
will serve to "justify"
why we're minimizing the sum of square zero. Okay?
It turns out that there are many assumptions that are sufficient
to justify why we do least squares and this is just one of them.
So
just because I present one set of assumptions under which least
squares regression make sense,
but this is not the only set of assumptions. So even if the assumptions
I describe don't hold, least squares actually still makes sense in many
circumstances. But this
sort of new help, you know,
give one rationalization, like, one reason for doing least squares regression.
And, in particular, what I'm going to do is
endow the least squares model with probabilistic semantics. So
let's assume in our example of predicting housing prices,
that
the price of the house it's sold four, and
there's going to be some linear function of the features,
plus
some term epsilon I. Okay?
And epsilon I will be
our error term.
You can think of
the error term as capturing unmodeled effects, like, that maybe
there's some other features of a house, like, maybe how many fireplaces it has or whether
there's a garden or whatever,
that there
are additional features that we jut fail to capture or
you can think of epsilon as random noise.
Epsilon is our error term that captures both these
unmodeled effects. Just things we forgot to model. Maybe the function isn't quite
linear or something.
As well as random noise, like maybe
that day the seller was in a really bad mood and so he sold it, just
refused to go for a reasonable price or something.
And now
I will assume that the errors have a
probabilistic
- have a probability distribution. I'll assume that the errors epsilon I
are distributed
just till
they denote epsilon I
is distributive according to a probability distribution. That's
a Gaussian distribution with mean zero
and variance sigma squared. Okay? So
let me just scripts in here,
n stands for normal, right? To denote a normal distribution, also known as the
Gaussian distribution,
with mean
zero and covariance sigma squared.
Actually, just quickly raise your hand if you've seen a Gaussian distribution before. Okay, cool. Most of you.
Great. Almost everyone.
So,
in other words, the density for Gaussian is what you've seen before.
The density for epsilon I would be
one over root 2 pi sigma, E to the
negative,
epsilon I
squared over 2 sigma squared, right?
And the
density of our epsilon I will be this bell-shaped curve
with one standard deviation
being a, sort of, sigma. Okay? This
is
form for that bell-shaped curve.
So, let's see. I can erase that. Can I
erase the board?
So this implies that the
probability distribution of a price of a house
given in si
and the parameters theta,
that this is going to be Gaussian
with that density. Okay?
In other words, saying goes as that the
price of
a house given the features of the house and my parameters theta,
this is going to be a random variable
that's distributed Gaussian with
mean theta transpose XI
and variance sigma squared.
Right? Because we imagine that
the way the housing prices are generated is that the price of a house
is equal to theta transpose XI and then plus some random Gaussian noise with variance sigma
squared. So
the price of a house is going to
have mean theta transpose XI, again, and sigma squared, right? Does
this make
sense? Raise your hand if this makes sense. Yeah,
okay. Lots of you. In
point of notation - oh, yes? Assuming we don't know anything about the error, why do you assume
here the error is a
Gaussian?
Right. So, boy.
Why do I see the error as Gaussian?
Two reasons, right? One is that it turns out to be mathematically convenient to do so
and the other is, I don't
know, I can also mumble about justifications, such as things to the
central limit theorem. It turns out that if you,
for the vast majority of problems, if you apply a linear regression model like
this and try to measure the distribution of the errors,
not all the time, but very often you find that the errors really are Gaussian. That
this Gaussian model is a good
assumption for the error
in regression problems like these.
Some of you may have heard of the central limit theorem, which says that
the sum of many independent random variables will tend towards a Gaussian.
So if the error is caused by many effects, like the mood of the
seller, the mood of the buyer,
some other features that we miss, whether the place has a garden or not, and
if all of these effects are independent, then
by the central limit theorem you might be inclined to believe that
the sum of all these effects will be approximately Gaussian. If
in practice, I guess, the
two real answers are that, 1.) In practice this is actually a reasonably accurate
assumption, and 2.) Is it turns out to be mathematically convenient to do so. Okay? Yeah? It seems like we're
saying
if we assume that area around model
has zero mean, then
the area is centered around our model. Which
it seems almost like we're trying to assume
what we're trying to prove. Instructor?
That's the [inaudible] but, yes. You are assuming that
the error has zero mean. Which is, yeah, right.
I think later this quarter we get to some of the other
things, but for now just think of this as a mathematically - it's actually not
an unreasonable assumption.
I guess,
in machine learning all the assumptions we make are almost never
true in the absence sense, right? Because, for instance,
housing prices are priced to dollars and cents, so the error will be -
errors
in prices are not continued as value random variables, because
houses can only be priced at a certain number of dollars and a certain number of
cents and you never have fractions of cents in housing prices.
Whereas a Gaussian random variable would. So in that sense, assumptions we make are never
"absolutely true," but for practical purposes this is a
accurate enough assumption that it'll be useful to
make. Okay? I think in a week or two, we'll actually come back to
selected more about the assumptions we make and when they help our learning
algorithms and when they hurt our learning
algorithms. We'll say a bit more about it
when we talk about generative and discriminative learning algorithms, like, in a week or
two. Okay? So let's point out one bit of notation, which is that when I
wrote this down I actually wrote P of YI given XI and then semicolon
theta
and I'm going to use this notation when we are not thinking of theta as a
random variable. So
in statistics, though, sometimes it's called the frequentist's point of view,
where you think of there as being some,
sort of, true value of theta that's out there that's generating the data say,
but
we don't know what theta is, but theta is not a random
vehicle, right? So it's not like there's some
random value of theta out there. It's that theta is -
there's some true value of theta out there. It's just that we don't
know what the true value of theta is. So
if theta is not a random variable, then I'm
going to avoid
writing P of YI given XI comma theta, because this would mean
that probably of YI conditioned on X and theta
and you can only condition on random variables.
So at this part of the class where we're taking
sort of frequentist's viewpoint rather than the Dasian viewpoint, in this part of class
we're thinking of theta not as a random variable, but just as something
we're trying to estimate
and use the semicolon
notation. So the way to read this is this is the probability of YI given XI
and parameterized by theta. Okay? So
you read the semicolon as parameterized by.
And in the same way here, I'll say YI given XI parameterized by
theta is distributed
Gaussian with that. All right.
So we're gonna make one more assumption.
Let's assume that the
error terms are
IID, okay?
Which stands for Independently and Identically Distributed. So it's
going to assume that the error terms are
independent of each other,
right?
The identically distributed part just means that I'm assuming the outcome for the
same Gaussian distribution or the same variance,
but the more important part of is this is that I'm assuming that the epsilon I's are
independent of each other.
Now, let's talk about how to fit a model.
The probability of Y given
X parameterized by theta - I'm actually going to give
this another name. I'm going to write this down
and we'll call this the likelihood of theta
as the probability of Y given X parameterized by theta.
And so this is going to be
the product
over my training set like that.
Which is, in turn, going to be a product of
those Gaussian densities that I wrote down just now,
right?
Okay?
Then in parts of notation, I guess, I define this term here to be the
likelihood of theta.
And the likely of theta is just the probability of the data Y, right? Given X
and prioritized by theta.
To test the likelihood and probability are often confused.
So the likelihood of theta is the same thing as the
probability of the data you saw. So likely and probably are, sort of, the same thing.
Except that when I use the term likelihood
I'm trying to emphasize that I'm taking this thing
and viewing it as a function of theta.
Okay?
So likelihood and for probability, they're really the same thing except that
when I want to view this thing
as a function of theta holding X and Y fix are
then called likelihood. Okay? So
hopefully you hear me say the likelihood of the parameters and the probability
of the data,
right? Rather than the likelihood of the data or probability of parameters. So try
to be consistent in that terminology.
So given that
the probability of the data is this and this
is also the likelihood of the parameters,
how do you estimate
the parameters theta? So given a training set,
what parameters theta do you want to choose for your model?
Well, the principle of maximum likelihood
estimation
says that,
right? You can choose the value of theta that makes the data
as probable as possible, right? So choose theta
to maximize the likelihood. Or
in other words choose the parameters that make
the data as probable as possible, right? So this is
massive likely your estimation from six to six. So it's choose the parameters that makes
it as likely as probable as possible
for me to have seen the data I just
did. So
for mathematical convenience, let me define lower case l of theta.
This is called the log likelihood function and it's just log
of capital L of theta.
So this is log over product over I
to find
sigma E to that. I won't bother to write out what's in the exponent for now. It's just saying this
from the previous board.
Log and a product is the same as the sum of over logs, right? So it's a sum
of
the logs of - which simplifies to m times
one over root
two pi
sigma
plus
and then log of explanation cancel each other, right? So if log of E of
something is just
whatever's inside the exponent. So, you know what,
let me write this on the
next
board.
Okay.
So
maximizing the likelihood or maximizing the log
likelihood is the same
as minimizing
that term over there. Well, you get it, right?
Because there's a minus sign. So maximizing this because of the minus sign is the same as
minimizing
this as a function of theta. And
this is, of course, just
the same quadratic cos function that we had last time, J of theta,
right? So what
we've just shown is that the ordinary least squares algorithm,
that we worked on the previous lecture,
is just maximum likelihood
assuming
this probabilistic model,
assuming IID Gaussian errors on our data.
Okay? One thing that we'll
actually leave is that,
in the next lecture notice that
the value of sigma squared doesn't matter,
right? That somehow
no matter what the value of sigma squared is, I mean, sigma squared has to be a positive number. It's a
variance
of a Gaussian. So that no matter what sigma
squared is since it's a positive number the value of theta we end up with
will be the same, right? So because
minimizing this
you get the same value of theta no matter what sigma squared is. So it's as if
in this model the value of sigma squared doesn't really matter.
Just remember that for the next lecture. We'll come back
to this again.
Any questions about this?
Actually, let me clean up
another couple of boards and then I'll see what questions you have. Okay. Any questions? Yeah? You are, I think here you try to measure the likelihood of your nice of
theta by
a fraction
of error,
but I think it's that you
measure
because it depends on the family of theta too, for example. If
you have a lot of parameters [inaudible] or fitting in? Yeah, yeah. I mean, you're asking about overfitting, whether this is a good model. I think
let's - the thing's you're mentioning are
maybe deeper questions about
learning algorithms that we'll just come back to later, so don't really want to get into
that right
now. Any more
questions? Okay. So
this endows linear regression with a probabilistic interpretation.
I'm actually going to use this probabil - use this, sort of, probabilistic
interpretation
in order to derive our next learning algorithm,
which will be our first classification algorithm. Okay?
So
you'll recall that I said that regression problems are where the variable Y
that you're trying to predict is continuous values.
Now I'm actually gonna talk about our first classification problem,
where the value Y you're trying to predict
will be discreet value. You can take on only a small number of discrete values
and in this case I'll talk about binding classification
where
Y takes on only two values, right? So
you come up with classification problems if you're trying to do,
say, a medical diagnosis and try to decide based on some features
that the patient has a disease or does not have a disease.
Or if in the housing example, maybe you're trying to decide will this house sell in the
next six months or not and the answer is either yes or no. It'll either be sold in the
next six months or it won't be.
Other standing examples, if you want to build a spam filter. Is this e-mail spam
or not? It's yes or no. Or if you, you know, some of my colleagues sit in whether predicting
whether a computer system will crash. So you have a learning algorithm to predict will
this computing cluster crash over the next 24 hours? And, again, it's a yes
or no answer. So
there's X, there's Y.
And in a classification problem
Y takes on
two values, zero and one. That's it in binding the classification.
So what can you do? Well, one thing you could do is
take linear regression, as we've described it so far, and apply it to this problem,
right? So you,
you know, given this data set you can fit a straight line to it. Maybe
you get that straight line, right?
But
this data set I've drawn, right? This is an amazingly easy classification problem. It's
pretty obvious
to all of us that, right? The relationship between X and Y is -
well, you just look at a value around here and it's the right is one, it's
the left and Y is zero. So you apply linear regressions to this data set and you get a reasonable fit and you can then
maybe take your linear regression
hypothesis to this straight line
and threshold it at 0.5.
If you do that you'll certainly get the right answer. You predict that
if X is to the right of, sort
of, the mid-point here
then Y is one and then next to the left of that mid-point then Y is zero.
So some people actually do this. Apply linear regression to classification problems
and sometimes it'll
work okay,
but in general it's actually a pretty bad idea to
apply linear regression to
classification problems like these and here's why. Let's say I
change my training set
by giving you just one more training example all the way up there, right?
Imagine if given this training set is actually still entirely obvious what the
relationship between X and Y is, right? It's just -
take this value as greater than Y is one and it's less then Y
is zero.
By giving you this additional training example it really shouldn't
change anything. I mean,
I didn't really convey much new information. There's no surprise that this
corresponds to Y equals one.
But if you now fit linear regression to this data
set you end up with a line that, I
don't know, maybe looks like that, right?
And now the predictions of your
hypothesis have changed completely if
your threshold - your hypothesis at Y equal both 0.5. Okay? So - In between there might be an interval where it's zero, right? For that far off point? Oh, you mean, like that?
Right.
Yeah, yeah, fine. Yeah, sure. A theta
set like that so. So, I
guess,
these just - yes, you're right, but this is an example and this example works. This - [Inaudible] that will
change it even more if you gave it
all -
Yeah. Then I think this actually would make it even worse. You
would actually get a line that pulls out even further, right? So
this is my example. I get to make it whatever I want, right? But
the point of this is that there's not a deep meaning to this. The point of this is
just that
it could be a really bad idea to apply linear regression to classification
algorithm. Sometimes it work fine, but usually I wouldn't do it.
So a couple of problems with this. One is that,
well - so what do you want to do
for classification? If you know the value of Y lies between zero and
one then to kind of fix this problem
let's just start by
changing the form
of our hypothesis so that my hypothesis
always lies in the unit interval between zero and one. Okay?
So if I know Y is either
zero or one then
let's at least not have my hypothesis predict values much larger than one and much
smaller than zero.
And so
I'm going to - instead of choosing a linear function for my hypothesis I'm going
to choose something slightly different. And,
in particular, I'm going to choose
this function, H subscript theta of X is going to equal to G of
theta transpose X
where
G
is going to be this
function and so
this becomes more than one plus theta X
of theta
transpose X.
And G of Z is called the sigmoid
function and it
is often also called the logistic function. It
goes by either of these
names. And what G of Z looks like is the following. So when you have your
horizontal axis I'm going to plot Z
and so G of Z
will look like this.
Okay? I didn't draw that very well. Okay.
So G of Z
tends towards zero
as Z becomes very small
and G of Z will ascend
towards one as Z becomes large and it crosses the
vertical
axis at 0.5.
So this is what sigmoid function, also called the logistic function of. Yeah? Question? What sort of
sigmoid in other
step five? Say that again. Why we cannot chose this at five for some reason, like, that's
better binary. Yeah. Let me come back to that later. So it turns out that Y - where did I get this function from,
right? I just
wrote down this function. It actually
turns out that there are two reasons for using this function that we'll come to.
One is -
we talked about generalized linear models. We'll see that this falls out naturally
as part of the broader class of models.
And another reason
that we'll talk about
next week, it turns out
there are a couple of,
I think, very beautiful reasons for why
we choose logistic
functions. We'll see
that in a little bit. But for now let me just
define it and just take my word for it for now that this is a reasonable choice.
Okay? But notice now that
my - the values output by my hypothesis will always be between zero
and one. Furthermore,
just like we did for linear regression, I'm going to endow
the outputs and my hypothesis with a probabilistic interpretation, right? So
I'm going to assume that the probability that Y is equal to one
given X and parameterized by theta
that's equal to
H subscript theta of X, all right?
So in other words
I'm going to imagine that my hypothesis is outputting all these
numbers that lie between zero and one.
I'm going to think of my hypothesis
as trying to estimate the probability that Y is equal to one. Okay?
And because
Y has to be either zero or one
then the probability of Y equals zero is going
to be that. All right?
So more simply it turns out - actually, take these two equations
and write them more compactly.
Write P of Y given X
parameterized by theta.
This is going to be H
subscript theta of X to
the power of Y times
one minus
H of X to the power of
one minus Y. Okay? So I know this
looks somewhat bizarre, but this actually makes the variation much nicer.
So Y is equal to one
then this equation is H of X to the power of one
times something to the power of zero.
So anything to the power of zero is just one,
right? So Y equals one then
this is something to the power of zero and so this is just one.
So if Y equals one this is just saying P of Y equals one is equal to H subscript
theta of X. Okay?
And in the same way, if Y is equal
to zero then this is P
of Y equals zero equals this thing to the power of zero and so this disappears. This is
just one
times this thing power of one. Okay? So this is
a
compact way of writing
both of these equations to
gather them to one line.
So let's hope our parameter fitting, right? And, again, you can ask -
well, given this model by data, how do I fit
the parameters theta of my
model? So the likelihood of the parameters is, as before, it's just the probability
of theta, right? Which is product over I, PFYI
given XI
parameterized by theta.
Which is - just plugging those
in. Okay? I
dropped this theta subscript just so you can write a little bit less. Oh,
excuse me. These
should be
XI's
and YI's. Okay?
So,
as before, let's say we want to find a maximum likelihood estimate of the parameters theta. So
we want
to find the - setting the parameters theta that maximizes the likelihood L
of theta. It
turns out
that very often
- just when you work with the derivations, it turns out that it is often
much easier to maximize the log of the likelihood rather than maximize the
likelihood.
So
the log
likelihood L of theta is just log of capital L.
This will, therefore,
be sum of this. Okay?
And so
to fit the parameters theta of our model we'll
find the value of
theta that maximizes this log likelihood. Yeah? [Inaudible] Say that again. YI is [inaudible]. Oh, yes.
Thanks.
So having maximized this function - well, it turns out we can actually apply
the same gradient
descent algorithm that we learned. That was the first algorithm we used
to minimize the quadratic function.
And you remember, when we talked about least squares,
the first algorithm we used to minimize the quadratic
error function
was great in descent.
So can actually use exactly the same algorithm
to maximize the log likelihood.
And you remember, that algorithm was just
repeatedly take the value of theta
and you replace it with
the previous value of theta plus
a learning rate alpha
times
the gradient of the cos function. The log likelihood will respect the
theta. Okay?
One small change is that because previously we were trying to minimize
the quadratic error term.
Today we're trying to maximize rather than minimize. So rather than having a minus
sign we have a plus sign. So this is
just great in ascents,
but for the
maximization rather than the minimization.
So we actually call this gradient ascent and it's really the same
algorithm.
So to figure out
what this gradient - so in order to derive gradient descent,
what you need to do is
compute the partial derivatives of your objective function with respect to
each of your parameters theta I, right?
It turns out that
if you actually
compute this partial derivative -
so you take this formula, this L of theta, which is - oh, got that wrong too. If
you take this lower case l theta, if you take the log likelihood of theta,
and if you take it's partial derivative with
respect to theta I
you find that
this is equal to -
let's see. Okay? And,
I
don't
know, the derivation isn't terribly complicated, but in
the interest of saving you watching me write down a couple of
blackboards full of math I'll just write
down the final answer. But
the way you get this is you
just take those, plug
in the definition for F subscript theta as function of XI, and take derivatives,
and work through the algebra
it turns out it'll simplify
down to this formula. Okay?
And so
what that gives you is that gradient ascent
is the following
rule. Theta J gets updated as theta
J
plus alpha
gives this. Okay?
Does this look familiar to anyone? Did you
remember seeing this formula at the last lecture? Right.
So when I worked up Bastrian descent
for least squares regression I,
actually, wrote down
exactly the same thing, or maybe
there's a minus sign and this is also fit. But I, actually, had
exactly the same learning rule last time
for least squares regression,
right? Is this the same learning algorithm then? So what's different? How come I was making
all that noise earlier about
least squares regression being a bad idea for classification problems and then I did
a bunch of math and I skipped some steps, but I'm, sort of, claiming at the end they're really the same learning algorithm? [Inaudible] constants?
Say that again. [Inaudible] Oh,
right. Okay, cool. It's the lowest it - No, exactly. Right. So zero to the same,
this is not the same, right? And the
reason is,
in logistic regression
this is different from before, right?
The definition
of this H subscript theta of XI
is not
the same as the definition I was using in the previous lecture.
And in particular this is no longer theta transpose XI. This is not
a linear function anymore.
This is a logistic function of theta
transpose XI. Okay?
So even though this looks cosmetically similar,
even though this is similar on the surface,
to the Bastrian descent rule I derived last time for
least squares regression
this is actually a totally different learning algorithm. Okay?
And it turns out that there's actually no coincidence that you ended up with the
same learning rule. We'll actually
talk a bit more about this later when we talk about generalized linear models.
But this is one of the most elegant generalized learning models
that we'll see later. That
even though we're using a different model, you actually ended up with
what looks like the same learning algorithm and it's actually no
coincidence. Cool.
One last comment as
part of a sort of learning process,
over here
I said I take the derivatives and I
ended up with this line.
I didn't want to
make you sit through a long algebraic derivation, but
later today or later this week,
please, do go home and look at our lecture notes, where I wrote out
the entirety of this derivation in full,
and make sure you can follow every single step of
how we take partial derivatives of this log likelihood
to get this formula over here. Okay? By the way, for those who are
interested in seriously masking machine learning material,
when you go home and look at the lecture notes it will actually be very easy for most
of you to look through
the lecture notes and read through every line and go yep, that makes sense, that makes sense, that makes sense,
and,
sort of, say cool. I see how you get this line.
You want to make sure you really understand the material. My concrete
suggestion to you would be to you to go home,
read through the lecture notes, check every line,
and then to cover up the derivation and see if you can derive this example, right? So
in general, that's usually good advice for studying technical
material like machine learning. Which is if you work through a proof
and you think you understood every line,
the way to make sure you really understood it is to cover it up and see
if you can rederive the entire thing itself. This is actually a great way because I
did this a lot when I was trying to study
various pieces of machine learning
theory and various proofs. And this is actually a great way to study because cover up
the derivations and see if you can do it yourself
without looking at the original derivation. All right.
I probably won't get to Newton's Method today. I just
want to say
- take one quick digression to talk about
one more algorithm,
which was the discussion sort
of alluding to this earlier,
which is the perceptron
algorithm, right? So
I'm
not gonna say a whole lot about the perceptron algorithm, but this is something that we'll come
back to later. Later this quarter
we'll talk about learning theory.
So in logistic regression we said that G of Z are, sort
of,
my hypothesis output values
that were low numbers between zero and one.
The question is what if you want to force G of Z to up
the value to
either
zero one?
So the
perceptron algorithm defines G of Z
to be this.
So the picture is - or
the cartoon is, rather than this sigmoid function. E of
Z now looks like this step function that you were asking about earlier.
In saying this before, we can use H subscript theta of X equals G of theta transpose X. Okay? So
this
is actually - everything is exactly the same as before,
except that G of Z is now the step function.
It
turns out there's this learning called the perceptron learning rule that's actually
even the same as the classic gradient ascent
for logistic regression.
And the learning rule is
given by this. Okay?
So it looks just like the
classic gradient ascent rule
for logistic regression.
So this is very different flavor of algorithm than least squares regression and logistic
regression,
and, in particular, because it outputs only values are either zero or one it
turns out it's very difficult to endow this algorithm with
probabilistic semantics. And this
is, again, even though - oh, excuse me. Right there. Okay.
And even though this learning rule looks, again, looks cosmetically very similar to
what we have in logistics regression this is actually
a very different type of learning rule
than the others that were seen
in this class. So
because this is such a simple learning algorithm, right? It just
computes theta transpose X and then you threshold and then your output is zero or one.
This is -
right. So these are a simpler algorithm than logistic regression, I think.
When we talk about learning theory later in this class,
the simplicity of this algorithm will let us come back and use it as a building block. Okay?
But that's all I want to say about this algorithm for now.