This presentation is delivered by the Stanford Center for Professional
Development.
So what I want to do today in this lecture is
talk a little bit more about learning theory. In particular, I'll
talk about VC dimension
and
building on the issues of bias variance tradeoffs of under
fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in
this one.
I then want to talk about model selection algorithms for automatically
making
decisions for this
bias variance tradeoff, that we started to talk about in the previous lecture.
And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if
I don't get to this today,
I'll get to this in next week's lecture.
To recap:
the result we
proved at the previous lecture was that
if you have a finite
hypothesis class if
h is a set of k hypotheses,
and suppose you have some
fixed parameters, gamma and delta,
then
in order to guarantee
that this holds,
we're probability at least one minus delta.
It suffices
that n is greater and equal to that; okay?
And using big-O notations,
just learning dropped constants, I can also write this as that; okay?
So just to quickly remind you of what all of the notation means,
we talked about empirical risk minimization,
which was
the simplified modern machine learning
that
has a hypothesis class of script h.
And what the empirical risk minimization-learning algorithm does is it just chooses the
hypothesis
that attains the smallest error on the training set.
And so
this symbol, epsilon, just denoted generalization error; right? This is the
probability
of a hypothesis h [inaudible] misclassifying a new example drawn from the same
distribution as the training set.
And so this says that
in order to guarantee that the generalization error of the hypothesis h
[inaudible] output by empirical risk minimization
that this is less and equal to
the best possible generalization error
use it in your hypothesis class plus two times gamma two times this error threshold.
We want to guarantee that this holds a probability at least one minus delta.
We show that
it suffices for your training set size m
to be greater than equal to this; okay? One over two gamma square log two k over
delta;
where again, k is the size of your hypothesis class.
And so this is some complexity result because it gives us a bound in the
number of training examples we need
in order to give a guarantee
on something on the error; okay? So
this is a sample complexity result. So
what I want to do now is take this result, and try to generalize it to the
case of infinite hypothesis classes. So here,
we said that the set script h is sort of just k
specific functions,
when you want to use a model like logistic regression, which is actually
parameterized by real numbers.
So
I'm actually first going to give an argument that's sort of
formally broken just sort of technically somewhat broken, but conveys
useful intuition. And then I'll give the more correct argument, but
without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want
to
apply this result analyzing logistic regression.
So let's say
your hypothesis class is because of all linear division boundaries; right? So say
script h is
parameterized
by d
real numbers; okay? So for example, if
you're applying logistic regression with
over [inaudible], then d would be endless one with logistic regression to find the
linear position boundary,
parameterized by
endless one real numbers.
When you think about how your hypothesis class is really represented in a computer
computers use
zero one bits to represent real numbers.
And so if you
use like
a normal standard computer, it normally will represent real numbers by what's
called double position floating point numbers.
And what that means is that each real number
is represented by or a 64-bit representation;
right? So really
you know what floating point is in a computer. So a 64-bit floating point
is what almost all of
us use routinely.
And so this parameterized by d real numbers, that's really
as if it's parameterized by 64
times d
bits.
Computers can't represent real numbers. They only represent
used to speed things.
And so the size of
your
hypothesis class in your computer representation
you have 64 times d bits that you can flip.
And so the number of possible values for your 62 to 64 d bits is
really just
to the power of 64 d; okay?
Because that's the number of ways you can
flip the 64 d bits.
And so
this is why it's important that we that we had log k there; right? So k is
therefore, to the 64 d.
And if I plug it into this equation over here,
what you find is that
in order to get this sort of guarantee,
it suffices that m
is
great and equal to
on the order of one of
the gamma square log it's
just a 64 d
over delta,
which is that; okay? So
just to be clear, in order to guarantee that
there's only one, instead of the same complexity result as we had before so the question is:
suppose, you want a guarantee that a hypotheses returned by empirical risk
minimization will
have a generalization error that's within two gamma or the best hypotheses in your hypotheses
class.
Then what this result suggests is that, you know, in order to give that sort of
error bound guarantee,
it suffices that m is greater and equal to this.
In other words, that
your number of training examples has to be
on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the
intuition
that this conveys is actually, roughly right. This says, that the number of
training examples you need
is roughly linear
in the number of parameters of your hypothesis class. That
m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.
I'll say more about
this
later. This result is clearly, slightly broken, in the sense that it relies on a
64-bit representation of 14-point numbers.
So let me actually go ahead and outline
the "right way" to show this more formally; all
right? And it turns out the "right way" to show this more formally
involves a much longer because the
proof is extremely involved, so I'm just actually going to state the result, and not
prove it. Farther proof be a source of learning theory balance, infinite hypothesis
classes.
This definition given a set of d points,
we say,
a hypothesis class h
shatters
the set s, if
h can realize any labeling on it; okay?
And what
I mean by realizing any labeling on it the informal way of thinking about this
is:
if a hypothesis class has shattered the set s, what that means is that I can take these
d points,
and I can associate these d points with any
caught set of labels y; right?
So choose any set of labeling y for each of these d
points.
And if your hypothesis class shatters s, then that means that
there will be
a hypothesis that labels those d examples perfectly;
okay? That's what shattering means. So let me just illustrate those in an example.
So let's say h is the class of all linear classifiers
into e,
and let's say that
s is this
[inaudible] comprising two points; okay?
So there are
four possible labelings that computes
with these two points. You can choose
to label both positive; one positive, one negative, one negative, one positive or you can
label both of them negative.
And if
the hypothesis class h classed all linear classifiers into the then, for each
of these training sets, I can sort of find a
linear classifier
that attains zero training error on
each of these. Then on all possible labelings of this set of two points.
And so I'll say
that the hypothesis class script h shatters
this set s
of two points; okay? One
more example show
you
a larger example. Suppose my set s is now
this set of three points;
right? Then, I now have eight possible labelings for these three points;
okay? And so for these three points, I now have eight possible labelings.
And once again, I can for each
of these labelings, I can find the hypothesis in the hypothesis class
that labels
these examples correctly.
And so once again, I see that by definition, say, that my hypothesis
class also shatters this set s. Student:Right. Instructor (Andrew Ng):And
then that that terminology h can realize any labeling on s. That's
obviously [inaudible].
Give it any set of labels and you can find a hypothesis that perfectly
separates the positive and negative examples;
okay?
So
how about this
set?
Suppose s is now this set of four points, then,
you know, there are lots of labels. There are now 16 labelings we can choose on this; right?
That's one for instance,
and this is another one;
right? And so I can realize some labelings. But there's no linear division
boundary that can realize this labeling,
and so h does not shatter
this set of four points; okay?
And
I'm not really going to prove it here, but it turns out that you can show that
in two dimensions, there is no set of four points that the
class of all linear classifiers can shatter; okay?
So here's another definition. When I say that the well,
it's called the
VC dimension. These
two people, Vapnik and Chervonenkis
so given a hypothesis class,
the Vapnik and Chervonenkis
dimension of h, which we usually write as VC of script h,
is the size of the
larger
set that is shattered by this set by h.
And
if a hypothesis class can shatter arbitrarily large sets, then the VC dimension
is infinite.
So just as a kind of good example: if h
is the class of all linear
classifiers
into d, then the
VC dimension of the set
is equal to
three because we saw just now that there is a size of
there was a set s of size three that it could shatter,
and I don't really prove it. But it turns out there is no sets of size four
that it can
shatter. And therefore,
the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your
point. Instructor
(Andrew
Ng):Yes, absolutely. So it turns out that
if
I choose a set like this it's actually set s,
then there are labelings on this they cannot realize. And so,
h cannot shatter this set.
But that's okay because right there definitely is
there exists some other set of size three being shattered. So the VC
dimension is
three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor
(Andrew Ng):Not according to this definition. No.
Right. So again, let's see,
I can choose my set s to be
to be a set of three points that are all over lapping. Three points in exactly
the same place.
And clearly, I can't shatter this set,
but that's okay. And I can't shatter this set, either,
but that's okay because there are some other sets of size three that I can
shatter.
And it turns out this result holds true into
the more generally,
in
any dimensions the VC dimension of the class of linear classifiers in
any dimensions is equal to n plus one.
Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC
dimension in any dimensions; whereas, n d
is equal to n plus one.
So maybe you wanna write it down:
what is arguably the best-known result in all of learning theory, I guess;
which is that.
Let a hypothesis class be given,
and let the VC dimension of h be
equal to d. Then we're
in probability of
one minus delta.
We have that
the formula on the right looks a bit complicated, but don't worry about it. I'll point
out the essential aspects of it later.
But the key to this result is that if you have a hypothesis class with VC dimension d, and
now this can be an infinite hypothesis
class,
what
Vapnik and Chervonenkis show is that
we're probability of at least one minus delta.
You enjoy this
sort of uniform conversions results;
okay? We have
that for all hypotheses h that for all the hypotheses in your
hypothesis class,
you have that the generalization error of h
minus
the training error of h. So the difference between these two things is bounded above
by some complicated formula like this; okay?
And thus,
we're probably one minus delta. We also have that have the
same thing; okay?
And going from this step to this step; right? Going from this step
to this step is actually something that you saw yourself;
that we actually proved earlier. Because
you remember, in the previous lecture we proved that if you have uniform
conversions,
then that implies that
it appears actually that we showed that if generalization error and training error are close to
each other; within gamma of each other,
then the generalization error of the hypotheses you pick will be within
two gamma
times the best generalization error.
So this is really
generalization error of
h [inaudible] best possible generalization error plus two times gamma. And just the
two constants in front here
that I've absorbed into the big-O notation.
So that formula is slightly more complicated. Let
me just rewrite this as a corollary, which is that
in order to guarantee that
this holds,
we're probability of one minus delta.
We're probably at least one minus delta, I should say. It
suffices
that I'm gonna write this this way: I'm gonna write m equals
big-O of
d,
and I'm going to put
gamma and delta in as a subscript error to denote that.
Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that
depend on gamma and delta into the big-O notation,
then
in order to guarantee this holds, it suffices that m
is on the order of the
VC dimension and hypotheses class; okay?
So
let's see.
So what we conclude from this is that
if you have a learning algorithm that tries to for empirical risk minimization
algorithms in other words, less formally,
for learning algorithms, they try to minimize training error. The
intuition to take away from this is that
the number of training examples you need is therefore,
roughly, linear
in the VC dimension of the hypotheses class.
And more formally, this shows that sample complexity
is upper bounded
by the VC dimension; okay?
It turns out that for most reasonable hypothesis classes, it turns out that
the VC dimension is sort
of very similar, I guess, to
the number of parameters you model.
So for example,
you have model and logistic regression linear classification.
In any dimensions logistic regression in any dimensions is endless
one parameters.
And the VC dimension of which is the of class of linear classifiers is always the endless one.
So it turns out that
for most reasonable hypothesis classes, the
VC dimension is usually linear in the number of parameters of your model.
Wherein, is most sense of low other polynomial;
in the number of parameters of your model.
And so this
the takeaway intuition from this is that
the number of training examples you need to fit in those models is going to be let's say,
roughly, linear in the number of parameters in your model; okay?
There are some somewhat strange examples where what I just said is not true. There are some
strange examples where you have very few parameters, but the VC dimension is
enormous.
But I actually know of
all of the examples I know of that fall into that regime are somewhat
strange and
degenerate. So somewhat unusual, and
not the source of not learning algorithms you usually use.
Let's see,
just other things. It turns out that
so this result shows the sample complexity is upper bounded by VC
dimension. But if you have a number of training examples that
are on the order of the VC dimension, then you find
it turns out that in the worse case
some complexity is also lower bounded by VC dimension.
And what that means is that if you have a perfectly nasty learning
problem, say, then if
the number of training examples you have is less than on the order of the VC
dimension;
then
it is not possible to prove this bound. So I guess in the worse case,
sample complexity in the number of training examples you need is upper bounded and lower bounded
by the VC dimension. Let's
see,
questions about this? Does
the
proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the
entirety of the theorem. This is true. It
turns out in the proof well,
somewhere, regardless of the proof there's a step reconstruction called an
epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the
proof, it is not an assumption that you need.
In someway that sort of proof that's one-step that uses a very
clever [inaudible] to prove
this. But that's not needed; it's an assumption. I'd like to say, back when
I was a Ph.D. student, when I was working through this proof,
there was sort of a solid week where I
would wake up, and go to the office at 9:00 a.m. Then I'd start reading
the book that led up to this proof. And then
I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left
off. And it sort of took me a whole week that way,
to understand this proof, so
I thought I would inflict that on you.
Just to tie a couple of loose ends:
what I'm about to do is, I'm about to just mention a few things that will
maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.
And so
let's see,
it turns out that
just so it will be more strong with you so this bound was
proved
for an algorithm that uses empirical risk minimization, for an algorithm that
minimizes
0-1 training error. So
one
question that some of
you ask is
how about support vector machines; right? How come SVM's don't over
fit?
And in the sequel of remember our discussion on support vector machines said that
you use kernels, and map the features in infinite dimensional feature space.
And so it seems like the VC dimension should be infinite; n plus
one and n is infinite.
So it turns out that
the class of linear separators with large margin actually has low VC dimension.
I wanna say this very quickly, and informally.
It's actually, not very important for you to understand the details, but I'm going
to say it very informally. It turns out that I will give you a set of points.
And
if I ask you to consider only the course of lines that separate these points of a
large margin [inaudible],
so
my hypothesis class will comprise only
the linear position boundaries
that separate the points of a large margin. Say with a
margin,
at least gamma; okay.
And so
I won't allow a point that comes closer. Like,
I won't allow that line because it comes too close to one of my points.
It turns out that
if I consider my
data points all
lie within some sphere of radius r,
and if I consider only the course of linear separators is separate to data
with a margin of at least gamma,
then the VC dimension of this
course is less than or equal to r squared over four
gamma squared
plus one; okay?
So
this funny symbol here, that just means rounding up.
This is a ceiling symbol; it means rounding up x.
And it
turns out you prove and there are some strange things
about this result that I'm deliberately not
gonna to talk about but turns they
can prove that the VC dimension of the class of linear classifiers with large
margins is actually bounded.
The surprising thing about this is that
this is the bound on VC dimension that has no dependents
on the dimension of the points x. So
in other words,
your data points x combine an infinite dimensional space,
but so long as you restrict attention to the class of
your separators with large margin, the
VC dimension is bounded.
And so
in trying to find a large margin separator
in trying to find the line that separates your positive and
your negative examples
with large margin,
it turns out therefore, that
the support vector machine is automatically trying to find
a hypothesis class
with small VC dimension. And therefore, it does not over fit. Alex? Student:What is
the [inaudible]? Instructor (Andrew
Ng):It
is actually defined the
same way as finite dimensional spaces. So you
know, suppose you have infinite actually,
these are constantly infinite dimensional vectors; not [inaudible] to the infinite
dimensional
vectors. Normally, the 2 to 1
squared
is equal to some [inaudible] equals 110
xi squared, so if x
is infinite dimensional, you just appoint it like
that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.
[Inaudible].
Yes. Although, I assume that this is bounded by r. Oh. It's
a yeah so this insures that conversions.
So just
something people sometimes wonder about. And
last, the actually
tie empirical risk minimization back a little more strongly to the source of algorithms
we've talked about.
It turns out
that
so the theory was about, and so far, was really for empirical risk minimization. So that view's
so we focus on just one training example.
Let me draw a
function, you know, a
zero here jumps
to one,
and it looks like that.
And so this for
once, this training example,
this may be
indicator
h where [inaudible] is d
equals data transpose x;
okay? But one
training example your training example will be positive or negative. And
depending on what the value of this data transpose x is, you either get it right
or wrong. And so
you know, I guess if your training example if you have a positive example,
then when z is positive,
you get it
right.
Suppose you have a negative example, so y equals 0; right?
Then if z, which is data transpose x if this is positive,
then you will get this example wrong;
whereas, if z is negative then you'd get this example right.
And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You
know, it's equal
to g of data transpose
x; okay? And so it turns out that
so what you really like to do is choose parameters data so as to
minimize this step function; right?
You'd like to choose parameters data, so that
you end up with a
correct classification on setting your training example, and so you'd like
indicator h
of x not equal y.
You'd like this indicator function to be 0. It
turns out this step function is clearly a non-convex function. And so
it turns out that just the linear classifiers
minimizing the training error is an empty heart problem. It
turns out that both logistic regression, and support vector machines can be viewed as
using a convex
approximation for this problem.
And in particular
and draw a function like that
it turns out that
logistic regression is trying to maximize likelihood. And so it's tying to minimize
the minus of the logged likelihood.
And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that
looks like this. And
this line that I just drew, you can think of it as a rough approximation to this step function; which is
maybe what you're really trying to minimize,
so you want to minimize training error. So you can actually think of logistic regression as trying to approximate
empirical risk minimization.
Where instead of using this step function, which is non-convex, and gives you a hard
optimization problem, it
uses this line above this curve above. So approximate it,
so you have a convex optimization problem you can
find the
maximum likelihood it's in the parameters for logistic regression.
And it turns out, support vector machine also can be viewed as approximated dysfunction
to only a little bit different let's
see, support vector machine turns out, can be viewed as trying to approximate this
step function two
over different
approximation that's linear, and then
that sort of [inaudible] linear that
our results goes this [inaudible] there, and then it goes up as a
linear function there. And that's that
is called the hinge class. And so you
can think of logistic regression and the support vector machine as
different approximations to try to minimize this
step function;
okay?
And that's why I guess,
all the theory we developed
even though SVM's and logistic regression aren't exactly due to
empirical risk minimization,
the theory we develop often gives the
completely appropriate intuitions for SVM's, and logistic regression; okay.
So that was the last of the loose ends.
And if you didn't get this, don't worry too much about it. It's a high-level message. It's
just that
SVM's and logistic regression are reasonable to think of as approximations
empirical risk minimization algorithms.
What I want to do next is move on to talk about model selection. Before I do that, let me just
check for questions about
this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that
we
sort of wrapped up with a
discussion on VC dimension,
we saw that there's often a trade-off between bias and variance. And in
particular, so
it is important not to choose a hypothesis that's either too simple or too
complex. So
if your data has sort of a quadratic structure to it,
then if you choose a linear
function to try to approximate it, then you would under fit. So you have a
hypothesis with high bias.
And conversely, we choose a hypothesis that's too complex, and you have high variance.
And you'll also
fail to fit. Then you would over fit the data, and you'd also fail to generalize well.
So model selection algorithms
provide a class of methods to automatically trade make these tradeoffs
between bias
and variance; right? So remember the
cartoon I drew last time
of generalization error?
I drew this last time. Where on the x-axis
was model complexity, meaning the number of
the degree of the polynomial; the [inaudible] regression function
or whatever.
And if you have too simple a model, you have high generalization error, those under
fitting.
And you if have too complex a model,
like 15 or 14-degree polynomial to five data points, then you also have high
generalization error, and you're over fitting.
So what I wanna do now is actually just talk about model selection in the abstract;
all right?
Some examples of model selection problems will include
well, I'll run the example
of
let's say you're
trying to choose the degree of a polynomial;
right? What degree polynomial do you want to choose?
Another example of a model selection problem would be if you're trying to choose
the parameter [inaudible],
which was the bandwidth parameter
in locally awaited linear regression
or in some sort of local way to
regression.
Yet, another model selection problem
is if you're trying to choose the parameter c [inaudible] and as the
[inaudible];
right? And so one known soft margin is the
we had this optimization objective; right?
And the parameter c
controls the tradeoff between
how much you want
to set for your example. So a large margin versus how much you want to penalize
in this class [inaudible] example. So these are three specific examples of model selection
problems.
And
let's come up with a method for semantically choosing them.
Let's say you have some finite set of models, and let's write these as m1, m2, m3,
and
so on. For example, this may be the linear
classifier or this may be the quadratic classifier,
and so on;
okay? Or this
may also be
you may also take the bandwidth parameter [inaudible]
and discretize it into a range of values, and you're trying to choose from the most
discrete of the values.
So let's talk about how you would select an appropriate model; all right? Well,
one thing you could do is you can pick all of these models, and train them on
you're training set.
And then see which model has the lowest training
error. So that's a terrible idea, and why's that?
Right. Cool. Because of the over fit; right. And those some of you are laughing that
I asked that. So that'd be a terrible idea to choose a model
by looking at your training set because well, obviously, you end up choosing the most
complex model; right? And you
choose a 10th degree polynomial because that's what fits the training set. So we
come to model selection in a training set
several standard procedures to do this. One is hold out cross
validation,
and in hold out cross validation,
we teach a training set. And we randomly split
the training set into two subsets. We
call it
subset take all the data you have
and randomly split it into two subsets. And we'll call it the training set, and the
hold out cross validation subset.
And then,
you know, you train each model
on just trading subset of it, and test it
on your hold out cross validation set.
And you pick the model
with the lowest error
on the hold out cross validation
subset; okay? So this is sort of a
relatively straightforward procedure, and it's commonly used
where you train on 70 percent of the data.
Then test all of your models. And 30 percent, you can take
whatever has the smallest
hold out cross validation error.
And after this you actually have a chose. You can actually
having taken all of these hypothesis trained on 70 percent of the data,
you can actually just output the hypothesis that has the lowest error on your hold out
cross validation set.
And optionally, you can actually take the model that you selected
and go back, and retrain it on all 100 percent of the data;
okay? So both versions are actually done and used really often. You can
either,
you know, just take the best hypothesis that was trained on 70 percent of the data,
and just output that as you find the hypothesis
or you can use this to
say, having chosen the degree of the polynomial you want to fit, you can then go
back and retrain the model on the entire 100 percent of your data.
And both of these are commonly done.
How about a cross validation does sort of work straight?
And
sometimes we're working
with a company or application or something. The
many machine-learning applications we have
very little data or where, you
know, every training example you have was
painfully acquired at great cost; right? Sometimes your data is
acquired by
medical experiments, and each of these each
training example represents a sick man in amounts of physical human pain or something. So
we talk and say, well, I'm going to hold out 30 percent of your data set, just to select
my model.
If people were who sometimes that causes unhappiness, and so maybe you wanna
use not have to leave out 30 percent of your data just to do model
selection.
So there are a couple of other variations on hold out cross validation that makes
sometimes, slightly more efficient use of the data.
And one is called k-fold cross validation.
And here's the idea: I'm gonna
take all of
my data s;
so imagine, I'm gonna draw this
box s, as to note the entirety of all the data I have.
And I'll then divide it
into k pieces, and this is five pieces in what I've drawn.
Then what'll I'll do is I will repeatedly
train on k minus one pieces.
Test on the remaining one
test on the remaining piece, I guess;
right? And then you average
over the k result.
So another way, we'll just hold out I will
hold
out say, just 1/5 of my data
and I'll train on the remaining 4/5, and I'll test on the first one.
And then I'll then go and hold out the second 1/5 from my [inaudible] for the
remaining pieces test on this, you remove the third piece,
train on the 4/5; I'm gonna do this five times.
And then I'll take the five error measures I
have and I'll
average them. And this then gives me an estimate of the generalization error of my model; okay?
And then, again, when you do
k-fold cross validation,
usually you then go back and
retrain the model you selected
on the entirety of your training set.
So I drew five pieces here because that was easier for me to draw, but k equals 10 is
very
common; okay? I should say
k equals 10 is the fairly common choice to do 10 fold cross validation.
And the advantage of the over hold out cross option is that you switch the data into ten pieces.
Then each time you're only holding out
1/10 of your data, rather than,
you know, say, 30 percent of your data. I
must say, in standard hold out in simple hold out cross validation,
a 30 70 split is fairly common.
Sometimes like 2/3 1/3 or
a 70 30 split is fairly common.
And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most
common choice.
The disadvantage of k-fold cross validation is that it can be much more
computationally expensive.
In particular,
to validate your model, you now need to train your model ten times,
instead of just once. And so you need
to: from logistic regression, ten times per model, rather than just once. And so this is
computationally more expensive.
But k equals ten works great.
And then,
finally, in
there's actually a version of this that you can take even further,
which is when your set k
equals m.
And so that's when you take your training set, and you split it into as many pieces as you have training
examples.
And this procedure is called leave one out cross validation.
And
what you do is you then
take out the first training example, train on the rest, and test on the first
example.
Then you take out the second training example,
train on the rest, and test on the second example. Then
you take out the third example, train on everything, but your third example. Test on
the third example, and so on.
And so
with this many pieces you are now making,
maybe even more effective use of your data than k-fold cross
validation. But
you could leave leave one out cross validation is computationally very expensive
because now
you need to repeatedly leave one example out, and then run your learning
algorithm on m minus one training examples. You need to do this a lot of times, and so
this is computationally very expensive.
And typically, this is done only when you're extremely data scarce. So if you
have a learning problem where you have, say, 15 training examples or something,
then if you have very few training examples, leave one
out cross validation
is
maybe preferred.
Yeah?
Student:You know, that time you proved that the difference between the generalized [inaudible] by
number of examples in your training
set and VC
dimension. So
maybe [inaudible] examples into
different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I
mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but
yeah, in practice, I personally tend not to do that. It
tends not to be the VC
dimension bounds are somewhat loose bounds. And so
there are people
in structure risk minimization that propose what you do, but I personally tend not to do
that,
though. Questions for cross validation?
Yeah. This
is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?
Right.
So it seems like when I'd be
able to use that
instead
of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor
No okay. So it turns out that when you're proving learning theory bounds, very often
the bounds will be extremely loose because you're sort of proving the worse case upper bound that
holds true
even for very bad what is
it
so the bounds that I proved just now; right? That holds true for absolutely any
probability distribution over training
examples; right?
So just assume the training examples we've drawn, iid
from some distribution script d,
and the bounds I proved hold true for absolutely any probability
distribution over script
d. And chances are
whatever real life distribution you get over, you know, houses and their
prices or whatever, is probably not as bad as
the very worse one you
could've gotten; okay?
And so it turns out that if you
actually plug in the constants of learning theory bounds, you often get
extremely large numbers. Take logistic regression logistic
regression you have ten parameters
and 0.01 error, and with 95 percent probability. How many training
examples do I
need? If you actually plug in actual constants into the text for learning theory bounds,
you often get extremely pessimistic estimates with the number of examples you need. You end
up with
some ridiculously large numbers. You would need 10,000 training examples to fit
ten parameters.
So
a good way to think of these learning theory bounds is
and this is why, also, when I write papers on learning theory bounds, I
quite often
use big-O notation to just absolutely just ignore the constant factors because
the bounds seem to be very loose.
There are some attempts to use these bounds to give guidelines as to
what model
to choose, and so on.
But I personally tend to use the bounds again,
intuition
about
for example, what are the number of training examples you need
gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes
quadratic parameters?
So it's quite often the shape of the bounds. The fact that
the number of training examples the fact that some complexity is linear in the VC dimension,
that's sort of a useful intuition you can get from
these theories. But the
actual magnitude of the bound will tend to be much looser than
will hold true for a particular problem you are working
on. So did that
answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the
way, for myself, a rule of thumb that I often use is if
you're trying to fit a logistic regression model,
if you have
n parameters or n plus one parameters;
if the number of training examples is ten times your number of
parameters, then you're probably in good shape.
And if your number of training examples is like tiny times the number of parameters, then
you're probably perfectly fine fitting that model.
So those are the sorts of intuitions
that you can get from these bounds. Student:In cross validation do
we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train
testers randomly.
One more thing I want to talk about for model selection is there's actually a special case of model selections,
called the feature selection problem.
And so here's the intuition:
for many machine-learning problems you may have a very high dimensional
feature space with very high dimensional
you have x's [inaudible] feature x's.
So for example, for text classification and I wanna talk about this text classification example that
spam versus
non-spam. You may easily have on
the order of 30,000 or 50,000 features. I think I used 50,000 in
my early examples. So if you have
so many features you have 50,000 features,
depending on what learning algorithm you use, there may be a real
risk of over fitting.
And so
if you can reduce the number of features, maybe
you can reduce the variance of your learning algorithm, and reduce the risk of
over fitting. And for
the specific case of text classification, if
you imagine that maybe there's a small number of relevant features, so
there are all these English words.
And many of these English words probably don't tell you anything at all about
whether the email is spam or non-spam. If it were, you
know, English function words like, the, of, a, and;
these are probably words that don't tell you anything about whether the email is spam or non-spam. So
words in contrast will be a much smaller number of
features that are truly relevant to the learning problem.
So for example, you see the word buy or Viagra, those are words that
are very useful. So you words, some you spam and non-spam. You see the word
Stanford or
machinelearning or your own personal name. These are other words that are useful for
telling you whether something is spam or non-spam. So
in feature selection, we would like to select a subset of the features
that may be or hopefully the most relevant ones for a specific learning
problem, so
as to give ourselves a simpler learning a simpler hypothesis class to choose
from. And then therefore, reduce the risk of over fitting.
Even when we
may have had 50,000 features originally. So
how do you do this? Well, if you
have n
features, then there are
two to the n possible subsets;
right? Because, you know,
each of your n features can either be included or excluded. So there are two to the n
possibilities.
And this is a huge space.
So in feature selection, what we most commonly do is use various searcheristics
sort of
simple search algorithms
to try to search through this space of two to the n possible subsets of features;
to try to find a good
subset of features. This is
too large a number to enumerate all possible feature subsets.
And as a complete example,
this is the forward search algorithm; it's also called the forward selection algorithm.
It's actually pretty simple, but I'll just
write it out.
My writing it out will make it look more complicated than it really
is, but
it starts with
initialize the sets script f
to be the empty set,
and then
repeat
for
i equals one
to n;
try adding
feature i
to the set scripts
f, and evaluate the
model
using cross validation.
And by cross validation, I mean any of the three flavors, be it simple hold out cross
validation or k-fold cross validation or leave one out cross
validation.
And then, you know, set f to be
equal to
f union, I guess.
And then the best feature
found is f 1, I guess; okay?
And finally, you would
okay? So
forward selections, procedure is: follow through the empty set of features.
And then on each
generation, take each of
your features that isn't already in your set script f and you try adding that
feature to your set. Then
you train them all, though, and evaluate them all, though, using
cross validation.
And basically, figure out what is the best single feature to add to your
set
script f. In step two here, you go ahead and add
that feature to your set script f, and you get it right.
And when I say best feature or best model here by best, I really mean the
best model according to hold out cross validation.
By best, I really mean
the single feature addition that results in the lowest hold out
cross validation error or the
lowest cross validation error.
So you do this
adding one feature at a time.
When you terminate this a little bit, as if you've
added all the features to f, so f is now
the entire set of features; you can terminate this.
Or if
by some rule of thumb, you know that you
probably don't ever want more than
k features, you can also terminate
this if f is already exceeded some threshold number of features. So maybe
if
you have 100 training examples, and you're fitting logistic regression,
you probably know you won't want more than 100 features. And so
you stop after you have
100 features added to set f; okay?
And then finally, having done this, output of best hypothesis found; again, by
best, I mean,
when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of
hypothesis, and testing them using cross validation.
So when I say output best hypothesis found, I mean
of all of the hypothesis you've seen during this entire procedure,
pick the one with the lowest
cross validation error that you saw; okay?
So that's forward selection. So
let's see, just to give this a name, this is an incidence of what's called
wrapper feature
selection.
And the term wrapper comes
from the fact that
this feature selection algorithm that I just described is a forward selection or forward
search.
It's a piece of software that
you write that wraps around your learning algorithm. In the sense that
to perform forward selection,
you need to repeatedly make cause to your learning algorithm
to train
your model,
using different subsets of features; okay? So this is called wrapper model
feature selection.
And it tends to be somewhat computationally expensive because as you're
performing the search process,
you're repeatedly training your learning algorithm over and over and over on all of
these different subsets of features. Let's
just mention also, there is a variation of this called backward search or
backward selection,
which is where you start with
f equals
the entire set of features, and you delete features one at a time;
okay?
So that's backward search or backward selection.
And
this is another feature selection algorithm that you might use. Part of
whether this makes sense is
really there will be problems where
it really doesn't even make sense to initialize f to be the set of all features.
So if you have 100 training examples
and 10,000 features,
which may well happen
100 emails and 10,000 training
10,000 features in email, then 100 training examples then
depending on the learning algorithm you're using, it may or may not make sense to
initialize
the set f to be all features, and train them all by using all features. And if it
doesn't make sense, then you can train them all by using all features; then
forward selection would be more common.
So
let's see. Wrapper model feature selection algorithms tend to work
well.
And in particular, they actually often work better than a different class of algorithms I'm gonna talk
about now. But their main disadvantage is that they're computationally very
expensive.
Do you have any questions about this
before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,
you're actually right. So forward search and backward search, both of these are searcheristics,
and
you cannot but for either of these you cannot guarantee they'll find the best
subset of features. It
actually turns out that
under many formulizations of the feature selection problems it actually turns out to be an empty
heart problem, to find the best subset of features.
But in practice, forward selection backward selection work fine,
and you can also envision other search algorithms where you sort of
have other methods to search through the space up to the end possible feature
subsets. So
let's see.
Wrapper feature selection
tends to work well when you can afford to do it
computationally.
But for problems
such as text classification it turns out for text classification specifically
because you have so many features, and easily have 50,000 features.
Forward selection would be very, very expensive.
So there's a different class of algorithms that
will give you that
tends not to do as well in the sense of generalization error. So you tend to learn the
hypothesis that works less well,
but is computationally much less expensive.
And these are called
the
filter feature selection methods.
And the basic idea is
that for each feature i will
compute
some measure
of how informative
xi
is about y; okay?
And to do this, we'll use some simple
heuristics; for every feature we'll just try to compute some rough estimate or
compute
some measure of how informative
xi is about
y. So there are many ways you can do this. One way you can choose is to just compute the
correlation between xi and y. And just for each of
your features just see how correlated this is with
your class label y.
And then you just pick the top k most correlated features.
Another way
to do this
for the case of text classification, there's one other method, which especially
for this k features I guess
there's one other
informative measure
that's used very commonly, which is called major information. I'm going to
tell you
some of these ideas in problem sets, but I'll
just say this very briefly.
So the major information between feature xi and y
I'll just write out the definition, I guess.
Let's say this is text classification, so x can take on two values, 0, 1;
the major information between xi and y is to find out some overall possible values of
x;
some overall possible values of
y times the distribution
times that.
Where
all of these distributions where so the joint distribution over xi and y,
you would estimate from your training data
all of these things you would use, as well. You would estimate
from
the training data what is the probability that x is 0, what's the probability that x is one, what's the
probability that x is 0, y is 0, x is one; y is 0, and so on. So it
turns out
there's a standard information theoretic measure of how different
probability distributions are.
And I'm not gonna prove this here. But it turns out that
this major information is
actually
so the standard
measure of how different distributions are; called the K-L divergence.
When you take a class in information theory,
you have seen concepts of mutual information in the K-L divergence, but if you
haven't, don't worry about it. Just
the intuition is there's something called K-L divergence that's a formal measure
of how different two probability distributions
are. And mutual information is a measure for how different
the joint distribution is
of x and y;
from the distribution
you would get if you were to assume they were independent;
okay? So
if x and y were independent, then
p of x, y would be equal to p of x times p of y.
And so you know, this distribution and this distribution would be identical, and the
K-L divergence would be 0.
In contrast, if x and y were very non-independent in
other words, if x and y are very informative about each other,
then this K-L divergence
will be large.
And so mutual information is a formal measure of
how non-independent x and y are.
And if x and y are highly non-independent
then that means that
x will presumably tell you something about y,
and so they'll have large mutual information. And
this measure of information will tell you x might be a good feature. And you
get to play with some of these ideas
more in the problem sets. So I won't say much more about it.
And what you do then is having chosen
some measure like correlation or major information or something else, you
then pick the top
k features; meaning that you compute correlation between xi and y for all the
features of mutual information xi and y for all the features.
And then you include in your learning algorithm the
k features of the largest correlation with the label or
the largest mutual information label, whatever.
And to choose
k,
you can actually use cross validation, as well; okay?
So you would
take all your features, and sort them in decreasing order of mutual
information.
And then you'd try using just the top one feature, the top two features, the top
three features, and so on.
And you decide how many features includes
using cross validation; okay? Or you
can sometimes you can just choose this by hand, as well. Okay. Questions about
this? Okay.
Cool. Great. So
next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close
to the end.