0:00:12.240,0:00:15.509
This presentation is delivered by the Stanford Center for Professional

0:00:15.509,0:00:22.509
Development.

0:00:24.209,0:00:28.029
So what I want to do today in this lecture is

0:00:28.029,0:00:31.479
talk a little bit more about learning theory. In particular, I'll

0:00:31.479,0:00:33.230
talk about VC dimension

0:00:33.230,0:00:35.610
and

0:00:35.610,0:00:39.560
building on the issues of bias variance tradeoffs of under

0:00:39.560,0:00:42.630
fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in

0:00:42.630,0:00:43.640
this one.

0:00:43.640,0:00:49.180
I then want to talk about model selection algorithms for automatically

0:00:49.180,0:00:50.070
making

0:00:50.070,0:00:52.190
decisions for this

0:00:52.190,0:00:55.280
bias variance tradeoff, that we started to talk about in the previous lecture.

0:00:55.280,0:00:59.310
And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if

0:00:59.310,0:01:01.469
I don't get to this today,

0:01:01.469,0:01:07.280
I'll get to this in next week's lecture.

0:01:07.280,0:01:09.520
To recap:

0:01:09.520,0:01:10.430
the result we

0:01:10.430,0:01:13.510
proved at the previous lecture was that

0:01:13.510,0:01:15.370
if you have a finite

0:01:15.370,0:01:16.430
hypothesis class if

0:01:16.430,0:01:19.509
h is a set of k hypotheses,

0:01:19.509,0:01:24.080


0:01:24.080,0:01:25.710
and suppose you have some

0:01:25.710,0:01:32.710
fixed parameters, gamma and delta,

0:01:33.030,0:01:34.440
then

0:01:34.440,0:01:36.530


0:01:36.530,0:01:40.890
in order to guarantee

0:01:40.890,0:01:47.750
that this holds,

0:01:47.750,0:01:50.770
we're probability at least one minus delta.

0:01:50.770,0:01:55.970
It suffices

0:01:55.970,0:02:02.970
that n is greater and equal to that; okay?

0:02:05.240,0:02:07.359
And using big-O notations,

0:02:07.359,0:02:14.359
just learning dropped constants, I can also write this as that; okay?

0:02:14.459,0:02:18.180
So just to quickly remind you of what all of the notation means,

0:02:18.180,0:02:21.180
we talked about empirical risk minimization,

0:02:21.180,0:02:21.970
which was

0:02:21.970,0:02:24.620
the simplified modern machine learning

0:02:24.620,0:02:25.899
that

0:02:25.899,0:02:28.549
has a hypothesis class of script h.

0:02:28.549,0:02:32.539
And what the empirical risk minimization-learning algorithm does is it just chooses the

0:02:32.539,0:02:34.029
hypothesis

0:02:34.029,0:02:38.169
that attains the smallest error on the training set.

0:02:38.169,0:02:39.019
And so

0:02:39.019,0:02:43.479
this symbol, epsilon, just denoted generalization error; right? This is the

0:02:43.479,0:02:44.450
probability

0:02:44.450,0:02:48.949
of a hypothesis h [inaudible] misclassifying a new example drawn from the same

0:02:48.949,0:02:50.930
distribution as the training set.

0:02:50.930,0:02:53.079
And so this says that

0:03:01.869,0:03:06.119
in order to guarantee that the generalization error of the hypothesis h

0:03:06.119,0:03:08.679
[inaudible] output by empirical risk minimization

0:03:08.679,0:03:11.009
that this is less and equal to

0:03:11.009,0:03:14.209
the best possible generalization error

0:03:14.780,0:03:19.889
use it in your hypothesis class plus two times gamma two times this error threshold.

0:03:19.889,0:03:24.120
We want to guarantee that this holds a probability at least one minus delta.

0:03:24.120,0:03:25.519
We show that

0:03:25.519,0:03:28.999
it suffices for your training set size m

0:03:28.999,0:03:33.079
to be greater than equal to this; okay? One over two gamma square log two k over

0:03:33.079,0:03:34.179
delta;

0:03:34.179,0:03:39.229
where again, k is the size of your hypothesis class.

0:03:39.229,0:03:42.909
And so this is some complexity result because it gives us a bound in the

0:03:42.909,0:03:45.019
number of training examples we need

0:03:45.019,0:03:47.050
in order to give a guarantee

0:03:47.050,0:03:48.699
on something on the error; okay? So

0:03:48.699,0:03:54.819
this is a sample complexity result. So

0:03:54.819,0:03:58.879
what I want to do now is take this result, and try to generalize it to the

0:03:58.879,0:04:02.939
case of infinite hypothesis classes. So here,

0:04:02.939,0:04:07.139
we said that the set script h is sort of just k

0:04:07.139,0:04:08.570
specific functions,

0:04:08.570,0:04:12.120
when you want to use a model like logistic regression, which is actually

0:04:12.120,0:04:14.059
parameterized by real numbers.

0:04:14.059,0:04:15.019
So

0:04:15.019,0:04:17.420
I'm actually first going to give an argument that's sort of

0:04:17.420,0:04:21.160
formally broken just sort of technically somewhat broken, but conveys

0:04:21.160,0:04:25.230
useful intuition. And then I'll give the more correct argument, but

0:04:25.230,0:04:30.060
without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want

0:04:30.060,0:04:31.050
to

0:04:31.050,0:04:34.679
apply this result analyzing logistic regression.

0:04:35.550,0:04:37.400
So let's say

0:04:39.210,0:04:44.789
your hypothesis class is because of all linear division boundaries; right? So say

0:04:44.789,0:04:46.770
script h is

0:04:46.770,0:04:51.749
parameterized

0:04:51.749,0:04:57.860
by d

0:04:57.860,0:05:00.279
real numbers; okay? So for example, if

0:05:00.279,0:05:02.570
you're applying logistic regression with

0:05:02.570,0:05:06.940
over [inaudible], then d would be endless one with logistic regression to find the

0:05:06.940,0:05:08.680
linear position boundary,

0:05:08.680,0:05:09.640
parameterized by

0:05:09.640,0:05:13.529
endless one real numbers.

0:05:13.529,0:05:17.729
When you think about how your hypothesis class is really represented in a computer

0:05:17.729,0:05:18.569
computers use

0:05:18.569,0:05:21.949
zero one bits to represent real numbers.

0:05:21.949,0:05:22.930
And so if you

0:05:22.930,0:05:24.479
use like

0:05:24.479,0:05:28.559
a normal standard computer, it normally will represent real numbers by what's

0:05:28.559,0:05:30.869
called double position floating point numbers.

0:05:30.869,0:05:33.200
And what that means is that each real number

0:05:33.200,0:05:36.789
is represented by or a 64-bit representation;

0:05:36.789,0:05:38.060
right? So really

0:05:38.060,0:05:41.820
you know what floating point is in a computer. So a 64-bit floating point

0:05:41.820,0:05:43.309
is what almost all of

0:05:43.309,0:05:44.780
us use routinely.

0:05:44.780,0:05:47.910
And so this parameterized by d real numbers, that's really

0:05:47.910,0:05:50.979
as if it's parameterized by 64

0:05:50.979,0:05:52.919
times d

0:05:52.919,0:05:54.319
bits.

0:05:54.319,0:05:56.590
Computers can't represent real numbers. They only represent

0:05:56.590,0:05:58.800
used to speed things.

0:05:58.800,0:06:00.689
And so the size of

0:06:00.689,0:06:02.949
your

0:06:02.949,0:06:07.250
hypothesis class in your computer representation

0:06:07.250,0:06:11.000
you have 64 times d bits that you can flip.

0:06:11.000,0:06:15.050
And so the number of possible values for your 62 to 64 d bits is

0:06:15.050,0:06:17.509
really just

0:06:17.509,0:06:20.889
to the power of 64 d; okay?

0:06:20.889,0:06:23.360
Because that's the number of ways you can

0:06:23.360,0:06:26.349
flip the 64 d bits.

0:06:26.349,0:06:29.150
And so

0:06:29.150,0:06:33.979
this is why it's important that we that we had log k there; right? So k is

0:06:33.979,0:06:36.819
therefore, to the 64 d.

0:06:36.819,0:06:40.379
And if I plug it into this equation over here,

0:06:40.379,0:06:43.300
what you find is that

0:06:43.300,0:06:47.139
in order to get this sort of guarantee,

0:06:47.139,0:06:50.979
it suffices that m

0:06:50.979,0:06:52.739
is

0:06:52.739,0:06:55.979
great and equal to

0:06:55.979,0:06:58.770
on the order of one of

0:06:58.770,0:07:01.740
the gamma square log it's

0:07:01.740,0:07:03.369
just a 64 d

0:07:03.369,0:07:05.050
over delta,

0:07:05.050,0:07:12.050
which is that; okay? So

0:07:14.520,0:07:17.779
just to be clear, in order to guarantee that

0:07:17.779,0:07:23.729
there's only one, instead of the same complexity result as we had before so the question is:

0:07:23.729,0:07:27.309
suppose, you want a guarantee that a hypotheses returned by empirical risk

0:07:27.309,0:07:29.249
minimization will

0:07:29.249,0:07:33.909
have a generalization error that's within two gamma or the best hypotheses in your hypotheses

0:07:33.909,0:07:35.970
class.

0:07:35.970,0:07:39.400
Then what this result suggests is that, you know, in order to give that sort of

0:07:39.400,0:07:40.770
error bound guarantee,

0:07:40.770,0:07:43.429
it suffices that m is greater and equal to this.

0:07:43.429,0:07:44.579
In other words, that

0:07:44.579,0:07:46.889
your number of training examples has to be

0:07:46.889,0:07:51.699
on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the

0:07:53.329,0:07:54.419
intuition

0:07:54.419,0:07:57.740
that this conveys is actually, roughly right. This says, that the number of

0:07:57.740,0:07:59.050
training examples you need

0:07:59.050,0:08:00.740
is roughly linear

0:08:00.740,0:08:04.620
in the number of parameters of your hypothesis class. That

0:08:04.620,0:08:08.680
m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.

0:08:08.680,0:08:09.580


0:08:09.580,0:08:12.479
I'll say more about

0:08:16.969,0:08:20.969
this

0:08:21.919,0:08:25.520
later. This result is clearly, slightly broken, in the sense that it relies on a

0:08:25.520,0:08:28.729
64-bit representation of 14-point numbers.

0:08:28.729,0:08:31.899
So let me actually go ahead and outline

0:08:31.899,0:08:34.900
the "right way" to show this more formally; all

0:08:34.900,0:08:38.910
right? And it turns out the "right way" to show this more formally

0:08:38.910,0:08:43.120
involves a much longer because the

0:08:43.120,0:08:46.250
proof is extremely involved, so I'm just actually going to state the result, and not

0:08:46.250,0:08:50.160
prove it. Farther proof be a source of learning theory balance, infinite hypothesis

0:08:50.160,0:08:54.320
classes.

0:08:56.450,0:09:03.450
This definition given a set of d points,

0:09:18.630,0:09:22.880
we say,

0:09:22.880,0:09:26.560
a hypothesis class h

0:09:26.560,0:09:28.590
shatters

0:09:28.590,0:09:34.520
the set s, if

0:09:34.520,0:09:41.520
h can realize any labeling on it; okay?

0:09:46.650,0:09:47.360
And what

0:09:47.360,0:09:50.990
I mean by realizing any labeling on it the informal way of thinking about this

0:09:50.990,0:09:53.020
is:

0:09:53.020,0:09:56.860
if a hypothesis class has shattered the set s, what that means is that I can take these

0:09:56.860,0:09:58.510
d points,

0:09:58.510,0:10:01.740
and I can associate these d points with any

0:10:01.740,0:10:04.250
caught set of labels y; right?

0:10:04.250,0:10:08.900
So choose any set of labeling y for each of these d

0:10:08.900,0:10:10.140
points.

0:10:10.140,0:10:13.840
And if your hypothesis class shatters s, then that means that

0:10:13.840,0:10:14.790
there will be

0:10:14.790,0:10:19.470
a hypothesis that labels those d examples perfectly;

0:10:19.470,0:10:23.710
okay? That's what shattering means. So let me just illustrate those in an example.

0:10:23.710,0:10:30.710
So let's say h is the class of all linear classifiers

0:10:31.280,0:10:35.520
into e,

0:10:35.520,0:10:38.400
and let's say that

0:10:38.400,0:10:40.370
s is this

0:10:40.370,0:10:43.740
[inaudible] comprising two points; okay?

0:10:43.740,0:10:45.980
So there are

0:10:45.980,0:10:49.520
four possible labelings that computes

0:10:49.520,0:10:53.270
with these two points. You can choose

0:10:53.270,0:10:56.470
to label both positive; one positive, one negative, one negative, one positive or you can

0:10:56.470,0:10:59.280
label both of them negative.

0:10:59.280,0:11:00.750
And if

0:11:00.750,0:11:04.730
the hypothesis class h classed all linear classifiers into the then, for each

0:11:04.730,0:11:08.450
of these training sets, I can sort of find a

0:11:08.450,0:11:09.990
linear classifier

0:11:09.990,0:11:13.500
that attains zero training error on

0:11:13.500,0:11:17.810
each of these. Then on all possible labelings of this set of two points.

0:11:17.810,0:11:20.140
And so I'll say

0:11:20.140,0:11:23.160
that the hypothesis class script h shatters

0:11:23.160,0:11:24.129
this set s

0:11:24.129,0:11:29.480
of two points; okay? One

0:11:29.480,0:11:36.480
more example show

0:11:37.350,0:11:38.780
you

0:11:38.780,0:11:42.059
a larger example. Suppose my set s is now

0:11:42.059,0:11:44.640
this set of three points;

0:11:44.640,0:11:51.640
right? Then, I now have eight possible labelings for these three points;

0:12:10.600,0:12:14.830
okay? And so for these three points, I now have eight possible labelings.

0:12:14.830,0:12:16.940
And once again, I can for each

0:12:16.940,0:12:21.620
of these labelings, I can find the hypothesis in the hypothesis class

0:12:21.620,0:12:23.390
that labels

0:12:23.390,0:12:25.610
these examples correctly.

0:12:25.610,0:12:30.220
And so once again, I see that by definition, say, that my hypothesis

0:12:30.220,0:12:33.890
class also shatters this set s. Student:Right. Instructor (Andrew Ng):And

0:12:33.890,0:12:38.340
then that that terminology h can realize any labeling on s. That's

0:12:38.340,0:12:39.830
obviously [inaudible].

0:12:39.830,0:12:43.680
Give it any set of labels and you can find a hypothesis that perfectly

0:12:43.680,0:12:47.570
separates the positive and negative examples;

0:12:47.570,0:12:48.840
okay?

0:12:48.840,0:12:51.090
So

0:12:51.090,0:12:54.660
how about this

0:12:54.660,0:12:57.430
set?

0:12:57.430,0:13:00.840
Suppose s is now this set of four points, then,

0:13:00.840,0:13:04.840
you know, there are lots of labels. There are now 16 labelings we can choose on this; right?

0:13:04.840,0:13:06.400
That's one for instance,

0:13:06.400,0:13:09.090
and this is another one;

0:13:09.090,0:13:13.420
right? And so I can realize some labelings. But there's no linear division

0:13:13.420,0:13:15.780
boundary that can realize this labeling,

0:13:15.780,0:13:17.800
and so h does not shatter

0:13:17.800,0:13:20.250
this set of four points; okay?

0:13:20.250,0:13:21.580
And

0:13:21.580,0:13:25.360
I'm not really going to prove it here, but it turns out that you can show that

0:13:25.360,0:13:30.040
in two dimensions, there is no set of four points that the

0:13:30.040,0:13:37.040
class of all linear classifiers can shatter; okay?

0:13:45.630,0:13:49.110
So here's another definition. When I say that the well,

0:13:49.110,0:13:50.290
it's called the

0:13:50.290,0:13:52.700
VC dimension. These

0:13:52.700,0:13:59.700
two people, Vapnik and Chervonenkis

0:14:06.550,0:14:09.910
so given a hypothesis class,

0:14:09.910,0:14:13.590
the Vapnik and Chervonenkis

0:14:13.590,0:14:17.600
dimension of h, which we usually write as VC of script h,

0:14:17.600,0:14:24.600
is the size of the

0:14:33.780,0:14:36.670
larger

0:14:36.670,0:14:40.070
set that is shattered by this set by h.

0:14:40.070,0:14:40.759
And

0:14:40.759,0:14:44.939
if a hypothesis class can shatter arbitrarily large sets, then the VC dimension

0:14:44.939,0:14:47.820
is infinite.

0:14:47.820,0:14:50.549
So just as a kind of good example: if h

0:14:50.549,0:14:51.920
is the class of all linear

0:14:51.920,0:14:57.850
classifiers

0:14:57.850,0:15:02.660
into d, then the

0:15:02.660,0:15:04.100
VC dimension of the set

0:15:04.100,0:15:04.730
is equal to

0:15:04.730,0:15:09.490
three because we saw just now that there is a size of

0:15:09.490,0:15:12.940
there was a set s of size three that it could shatter,

0:15:12.940,0:15:17.259
and I don't really prove it. But it turns out there is no sets of size four

0:15:17.259,0:15:18.480
that it can

0:15:18.480,0:15:19.900
shatter. And therefore,

0:15:19.900,0:15:21.220
the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your

0:15:21.220,0:15:23.270
point. Instructor

0:15:23.270,0:15:27.530
(Andrew

0:15:27.530,0:15:30.210
Ng):Yes, absolutely. So it turns out that

0:15:30.210,0:15:32.370
if

0:15:32.370,0:15:35.650
I choose a set like this it's actually set s,

0:15:35.650,0:15:39.170
then there are labelings on this they cannot realize. And so,

0:15:39.170,0:15:40.980
h cannot shatter this set.

0:15:40.980,0:15:43.620
But that's okay because right there definitely is

0:15:43.620,0:15:46.590
there exists some other set of size three being shattered. So the VC

0:15:46.590,0:15:47.890
dimension is

0:15:47.890,0:15:54.890
three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor

0:15:55.590,0:15:58.130
(Andrew Ng):Not according to this definition. No.

0:15:58.130,0:16:00.640
Right. So again, let's see,

0:16:00.640,0:16:03.619
I can choose my set s to be

0:16:03.619,0:16:07.869
to be a set of three points that are all over lapping. Three points in exactly

0:16:07.869,0:16:09.780
the same place.

0:16:09.780,0:16:11.730
And clearly, I can't shatter this set,

0:16:11.730,0:16:15.130
but that's okay. And I can't shatter this set, either,

0:16:15.130,0:16:18.560
but that's okay because there are some other sets of size three that I can

0:16:18.560,0:16:25.040
shatter.

0:16:25.040,0:16:28.410
And it turns out this result holds true into

0:16:28.410,0:16:31.690
the more generally,

0:16:31.690,0:16:34.430


0:16:34.430,0:16:41.410
in

0:16:41.410,0:16:48.410
any dimensions the VC dimension of the class of linear classifiers in

0:16:51.460,0:16:53.500
any dimensions is equal to n plus one.

0:16:53.500,0:16:59.210
Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC

0:16:59.210,0:17:01.839
dimension in any dimensions; whereas, n d

0:17:01.839,0:17:04.350
is equal to n plus one.

0:17:04.350,0:17:11.350


0:17:12.059,0:17:14.459
So maybe you wanna write it down:

0:17:14.459,0:17:15.809


0:17:15.809,0:17:20.740
what is arguably the best-known result in all of learning theory, I guess;

0:17:20.740,0:17:25.930


0:17:25.930,0:17:28.770
which is that.

0:17:28.770,0:17:35.770
Let a hypothesis class be given,

0:17:37.600,0:17:41.170
and let the VC dimension of h be

0:17:41.170,0:17:45.830
equal to d. Then we're

0:17:45.830,0:17:47.530
in probability of

0:17:47.530,0:17:50.090
one minus delta.

0:17:50.090,0:17:57.090
We have that

0:18:31.830,0:18:34.770
the formula on the right looks a bit complicated, but don't worry about it. I'll point

0:18:34.770,0:18:36.910
out the essential aspects of it later.

0:18:36.910,0:18:40.600
But the key to this result is that if you have a hypothesis class with VC dimension d, and

0:18:40.600,0:18:45.320
now this can be an infinite hypothesis

0:18:45.320,0:18:46.660
class,

0:18:46.660,0:18:48.620
what

0:18:48.620,0:18:50.990
Vapnik and Chervonenkis show is that

0:18:50.990,0:18:54.030
we're probability of at least one minus delta.

0:18:54.030,0:18:55.300
You enjoy this

0:18:55.300,0:18:57.540
sort of uniform conversions results;

0:18:57.540,0:19:00.380
okay? We have

0:19:00.380,0:19:02.520


0:19:02.520,0:19:06.760
that for all hypotheses h that for all the hypotheses in your

0:19:06.760,0:19:08.490
hypothesis class,

0:19:08.490,0:19:13.080
you have that the generalization error of h

0:19:13.080,0:19:13.780
minus

0:19:13.780,0:19:19.210
the training error of h. So the difference between these two things is bounded above

0:19:19.210,0:19:24.250
by some complicated formula like this; okay?

0:19:24.250,0:19:29.520
And thus,

0:19:29.520,0:19:36.520
we're probably one minus delta. We also have that have the

0:19:45.140,0:19:52.140
same thing; okay?

0:19:56.810,0:20:00.730
And going from this step to this step; right? Going from this step

0:20:00.730,0:20:03.429
to this step is actually something that you saw yourself;

0:20:03.429,0:20:05.929
that we actually proved earlier. Because

0:20:05.929,0:20:09.200
you remember, in the previous lecture we proved that if you have uniform

0:20:09.200,0:20:10.720
conversions,

0:20:10.720,0:20:13.210
then that implies that

0:20:13.210,0:20:18.080
it appears actually that we showed that if generalization error and training error are close to

0:20:18.080,0:20:20.800
each other; within gamma of each other,

0:20:20.800,0:20:24.360
then the generalization error of the hypotheses you pick will be within

0:20:24.360,0:20:26.150
two gamma

0:20:26.150,0:20:28.080
times the best generalization error.

0:20:28.080,0:20:29.650
So this is really

0:20:29.650,0:20:30.900
generalization error of

0:20:30.900,0:20:35.130
h [inaudible] best possible generalization error plus two times gamma. And just the

0:20:35.130,0:20:37.390
two constants in front here

0:20:37.390,0:20:43.790
that I've absorbed into the big-O notation.

0:20:43.790,0:20:47.310
So that formula is slightly more complicated. Let

0:20:47.310,0:20:54.310
me just rewrite this as a corollary, which is that

0:20:55.370,0:20:58.850
in order to guarantee that

0:20:58.850,0:21:04.390
this holds,

0:21:04.390,0:21:08.580
we're probability of one minus delta.

0:21:08.580,0:21:11.270
We're probably at least one minus delta, I should say. It

0:21:11.270,0:21:16.790
suffices

0:21:16.790,0:21:20.400
that I'm gonna write this this way: I'm gonna write m equals

0:21:20.400,0:21:22.310
big-O of

0:21:22.310,0:21:22.950
d,

0:21:22.950,0:21:25.880
and I'm going to put

0:21:25.880,0:21:32.570
gamma and delta in as a subscript error to denote that.

0:21:32.570,0:21:37.340
Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that

0:21:37.340,0:21:40.390
depend on gamma and delta into the big-O notation,

0:21:40.390,0:21:41.600
then

0:21:41.600,0:21:44.870
in order to guarantee this holds, it suffices that m

0:21:44.870,0:21:46.430
is on the order of the

0:21:46.430,0:21:53.430
VC dimension and hypotheses class; okay?

0:21:53.470,0:21:55.790
So

0:21:55.790,0:21:58.860
let's see.

0:21:58.860,0:22:02.490
So what we conclude from this is that

0:22:02.490,0:22:06.100
if you have a learning algorithm that tries to for empirical risk minimization

0:22:06.100,0:22:08.770
algorithms in other words, less formally,

0:22:08.770,0:22:12.630
for learning algorithms, they try to minimize training error. The

0:22:12.630,0:22:14.850
intuition to take away from this is that

0:22:14.850,0:22:18.080
the number of training examples you need is therefore,

0:22:18.080,0:22:19.250
roughly, linear

0:22:19.250,0:22:24.350
in the VC dimension of the hypotheses class.

0:22:24.350,0:22:27.690
And more formally, this shows that sample complexity

0:22:27.690,0:22:29.570
is upper bounded

0:22:29.570,0:22:34.040
by the VC dimension; okay?

0:22:34.040,0:22:38.490
It turns out that for most reasonable hypothesis classes, it turns out that

0:22:38.490,0:22:41.460
the VC dimension is sort

0:22:41.460,0:22:41.869


0:22:41.869,0:22:44.930
of very similar, I guess, to

0:22:44.930,0:22:47.110
the number of parameters you model.

0:22:47.110,0:22:48.600
So for example,

0:22:48.600,0:22:52.080
you have model and logistic regression linear classification.

0:22:52.080,0:22:56.059
In any dimensions logistic regression in any dimensions is endless

0:22:56.059,0:22:57.430
one parameters.

0:22:57.430,0:23:02.430
And the VC dimension of which is the of class of linear classifiers is always the endless one.

0:23:02.430,0:23:04.289
So it turns out that

0:23:04.289,0:23:07.360
for most reasonable hypothesis classes, the

0:23:07.360,0:23:11.270
VC dimension is usually linear in the number of parameters of your model.

0:23:11.270,0:23:14.820
Wherein, is most sense of low other polynomial;

0:23:14.820,0:23:17.440
in the number of parameters of your model.

0:23:17.440,0:23:18.760
And so this

0:23:18.760,0:23:21.120
the takeaway intuition from this is that

0:23:21.120,0:23:25.059
the number of training examples you need to fit in those models is going to be let's say,

0:23:25.059,0:23:29.550
roughly, linear in the number of parameters in your model; okay?

0:23:29.550,0:23:33.120
There are some somewhat strange examples where what I just said is not true. There are some

0:23:33.120,0:23:36.559
strange examples where you have very few parameters, but the VC dimension is

0:23:36.559,0:23:37.450
enormous.

0:23:37.450,0:23:38.610
But I actually know of

0:23:38.610,0:23:42.070
all of the examples I know of that fall into that regime are somewhat

0:23:42.070,0:23:43.559
strange and

0:23:43.559,0:23:45.160
degenerate. So somewhat unusual, and

0:23:45.160,0:23:50.340
not the source of not learning algorithms you usually use.

0:23:50.340,0:23:52.270
Let's see,

0:23:52.270,0:23:55.030
just other things. It turns out that

0:23:55.030,0:23:58.490
so this result shows the sample complexity is upper bounded by VC

0:23:58.490,0:24:01.320
dimension. But if you have a number of training examples that

0:24:01.320,0:24:04.510
are on the order of the VC dimension, then you find

0:24:04.510,0:24:07.510
it turns out that in the worse case

0:24:07.510,0:24:10.320


0:24:10.320,0:24:12.200


0:24:12.200,0:24:15.840
some complexity is also lower bounded by VC dimension.

0:24:15.840,0:24:20.220
And what that means is that if you have a perfectly nasty learning

0:24:20.220,0:24:23.120
problem, say, then if

0:24:23.120,0:24:26.630
the number of training examples you have is less than on the order of the VC

0:24:26.630,0:24:27.450
dimension;

0:24:27.450,0:24:29.690
then

0:24:29.690,0:24:33.910
it is not possible to prove this bound. So I guess in the worse case,

0:24:33.910,0:24:37.490
sample complexity in the number of training examples you need is upper bounded and lower bounded

0:24:37.490,0:24:41.820
by the VC dimension. Let's

0:24:41.820,0:24:48.110
see,

0:24:48.110,0:24:55.110
questions about this? Does

0:24:55.540,0:24:59.700


0:24:59.700,0:25:00.490
the

0:25:00.490,0:25:04.560
proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the

0:25:04.560,0:25:07.280
entirety of the theorem. This is true. It

0:25:07.280,0:25:09.920
turns out in the proof well,

0:25:09.920,0:25:12.730
somewhere, regardless of the proof there's a step reconstruction called an

0:25:12.730,0:25:15.750
epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the

0:25:15.750,0:25:19.880
proof, it is not an assumption that you need.

0:25:19.880,0:25:23.610
In someway that sort of proof that's one-step that uses a very

0:25:23.610,0:25:27.050
clever [inaudible] to prove

0:25:27.050,0:25:32.130
this. But that's not needed; it's an assumption. I'd like to say, back when

0:25:32.130,0:25:36.370
I was a Ph.D. student, when I was working through this proof,

0:25:36.370,0:25:38.220
there was sort of a solid week where I

0:25:38.220,0:25:42.100
would wake up, and go to the office at 9:00 a.m. Then I'd start reading

0:25:42.100,0:25:44.840
the book that led up to this proof. And then

0:25:44.840,0:25:49.330
I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left

0:25:49.330,0:25:51.609
off. And it sort of took me a whole week that way,

0:25:51.609,0:25:53.180
to understand this proof, so

0:25:53.180,0:26:00.180
I thought I would inflict that on you.

0:26:03.090,0:26:07.470
Just to tie a couple of loose ends:

0:26:07.470,0:26:13.110
what I'm about to do is, I'm about to just mention a few things that will

0:26:13.110,0:26:18.380
maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.

0:26:18.380,0:26:21.380
And so

0:26:21.380,0:26:24.150
let's see,

0:26:24.150,0:26:26.860
it turns out that

0:26:26.860,0:26:30.170
just so it will be more strong with you so this bound was

0:26:30.170,0:26:30.670
proved

0:26:30.670,0:26:34.840
for an algorithm that uses empirical risk minimization, for an algorithm that

0:26:34.840,0:26:35.920
minimizes

0:26:35.920,0:26:39.580
0-1 training error. So

0:26:39.580,0:26:46.330
one

0:26:46.330,0:26:49.350
question that some of

0:26:49.350,0:26:51.720
you ask is

0:26:51.720,0:26:55.460
how about support vector machines; right? How come SVM's don't over

0:26:55.460,0:26:57.250
fit?

0:26:57.250,0:27:01.280
And in the sequel of remember our discussion on support vector machines said that

0:27:01.280,0:27:06.450
you use kernels, and map the features in infinite dimensional feature space.

0:27:06.450,0:27:10.390
And so it seems like the VC dimension should be infinite; n plus

0:27:10.390,0:27:12.340
one and n is infinite.

0:27:12.340,0:27:14.840
So it turns out that

0:27:14.840,0:27:19.630
the class of linear separators with large margin actually has low VC dimension.

0:27:19.630,0:27:23.300
I wanna say this very quickly, and informally.

0:27:23.300,0:27:28.010
It's actually, not very important for you to understand the details, but I'm going

0:27:28.010,0:27:31.760
to say it very informally. It turns out that I will give you a set of points.

0:27:31.760,0:27:32.660
And

0:27:32.660,0:27:37.500
if I ask you to consider only the course of lines that separate these points of a

0:27:37.500,0:27:38.780
large margin [inaudible],

0:27:38.780,0:27:40.030
so

0:27:40.030,0:27:43.840
my hypothesis class will comprise only

0:27:43.840,0:27:45.950
the linear position boundaries

0:27:45.950,0:27:49.289
that separate the points of a large margin. Say with a

0:27:49.289,0:27:50.049
margin,

0:27:50.049,0:27:52.260
at least gamma; okay.

0:27:52.260,0:27:53.000
And so

0:27:53.000,0:27:55.760
I won't allow a point that comes closer. Like,

0:27:55.760,0:28:00.210
I won't allow that line because it comes too close to one of my points.

0:28:00.210,0:28:04.330
It turns out that

0:28:04.330,0:28:10.890
if I consider my

0:28:10.890,0:28:15.410
data points all

0:28:15.410,0:28:18.690
lie within some sphere of radius r,

0:28:18.690,0:28:23.190
and if I consider only the course of linear separators is separate to data

0:28:23.190,0:28:26.799
with a margin of at least gamma,

0:28:26.799,0:28:30.420
then the VC dimension of this

0:28:30.420,0:28:34.260
course is less than or equal to r squared over four

0:28:34.260,0:28:36.150
gamma squared

0:28:36.150,0:28:38.170
plus one; okay?

0:28:38.170,0:28:39.970
So

0:28:39.970,0:28:43.340
this funny symbol here, that just means rounding up.

0:28:43.340,0:28:47.190
This is a ceiling symbol; it means rounding up x.

0:28:47.190,0:28:48.360
And it

0:28:48.360,0:28:51.280
turns out you prove and there are some strange things

0:28:51.280,0:28:52.940
about this result that I'm deliberately not

0:28:52.940,0:28:55.090
gonna to talk about but turns they

0:28:55.090,0:28:59.100
can prove that the VC dimension of the class of linear classifiers with large

0:28:59.100,0:29:00.890
margins is actually bounded.

0:29:00.890,0:29:02.820
The surprising thing about this is that

0:29:02.820,0:29:05.909
this is the bound on VC dimension that has no dependents

0:29:05.909,0:29:07.860
on the dimension of the points x. So

0:29:07.860,0:29:09.240
in other words,

0:29:09.240,0:29:12.799
your data points x combine an infinite dimensional space,

0:29:12.799,0:29:15.680
but so long as you restrict attention to the class of

0:29:15.680,0:29:17.990
your separators with large margin, the

0:29:17.990,0:29:20.250
VC dimension is bounded.

0:29:20.250,0:29:21.529
And so

0:29:21.529,0:29:25.650
in trying to find a large margin separator

0:29:25.650,0:29:28.990
in trying to find the line that separates your positive and

0:29:28.990,0:29:29.950
your negative examples

0:29:29.950,0:29:32.490
with large margin,

0:29:32.490,0:29:35.100
it turns out therefore, that

0:29:35.100,0:29:38.060
the support vector machine is automatically trying to find

0:29:38.060,0:29:39.809
a hypothesis class

0:29:39.809,0:29:42.450
with small VC dimension. And therefore, it does not over fit. Alex? Student:What is

0:29:42.450,0:29:49.320
the [inaudible]? Instructor (Andrew

0:29:49.320,0:29:50.010
Ng):It

0:29:50.010,0:29:51.879
is actually defined the

0:29:51.879,0:29:55.640
same way as finite dimensional spaces. So you

0:29:55.640,0:29:57.210
know, suppose you have infinite actually,

0:29:57.210,0:30:00.590
these are constantly infinite dimensional vectors; not [inaudible] to the infinite

0:30:00.590,0:30:03.590
dimensional

0:30:03.590,0:30:06.080
vectors. Normally, the 2 to 1

0:30:06.080,0:30:07.460
squared

0:30:07.460,0:30:10.010
is equal to some [inaudible] equals 110

0:30:10.010,0:30:12.210
xi squared, so if x

0:30:12.210,0:30:17.470
is infinite dimensional, you just appoint it like

0:30:17.470,0:30:20.220
that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.

0:30:20.220,0:30:22.679
[Inaudible].

0:30:22.679,0:30:27.230
Yes. Although, I assume that this is bounded by r. Oh. It's

0:30:27.230,0:30:30.750
a yeah so this insures that conversions.

0:30:30.750,0:30:31.670


0:30:31.670,0:30:35.040
So just

0:30:35.040,0:30:37.650
something people sometimes wonder about. And

0:30:37.650,0:30:39.640
last, the actually

0:30:39.640,0:30:44.470
tie empirical risk minimization back a little more strongly to the source of algorithms

0:30:44.470,0:30:46.650
we've talked about.

0:30:46.650,0:30:48.050
It turns out

0:30:48.050,0:30:55.050
that

0:31:10.179,0:31:15.100
so the theory was about, and so far, was really for empirical risk minimization. So that view's

0:31:15.100,0:31:17.770


0:31:17.770,0:31:21.010
so we focus on just one training example.

0:31:21.010,0:31:22.860
Let me draw a

0:31:22.860,0:31:24.049
function, you know, a

0:31:24.049,0:31:25.080
zero here jumps

0:31:25.080,0:31:26.210
to one,

0:31:26.210,0:31:27.670
and it looks like that.

0:31:27.670,0:31:29.570
And so this for

0:31:29.570,0:31:31.720
once, this training example,

0:31:31.720,0:31:36.230
this may be

0:31:36.230,0:31:42.830
indicator

0:31:42.830,0:31:47.110
h where [inaudible] is d

0:31:47.110,0:31:49.340
equals data transpose x;

0:31:49.340,0:31:52.600
okay? But one

0:31:52.600,0:31:56.250
training example your training example will be positive or negative. And

0:31:56.250,0:31:59.360
depending on what the value of this data transpose x is, you either get it right

0:31:59.360,0:32:01.800
or wrong. And so

0:32:01.800,0:32:05.720
you know, I guess if your training example if you have a positive example,

0:32:05.720,0:32:08.370
then when z is positive,

0:32:08.370,0:32:10.380
you get it

0:32:10.380,0:32:11.240
right.

0:32:11.240,0:32:18.240


0:32:18.610,0:32:22.630
Suppose you have a negative example, so y equals 0; right?

0:32:22.630,0:32:26.780
Then if z, which is data transpose x if this is positive,

0:32:26.780,0:32:29.650
then you will get this example wrong;

0:32:29.650,0:32:32.860
whereas, if z is negative then you'd get this example right.

0:32:32.860,0:32:36.630
And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You

0:32:36.630,0:32:42.639
know, it's equal

0:32:42.639,0:32:49.539
to g of data transpose

0:32:49.539,0:32:50.730


0:32:50.730,0:32:53.940
x; okay? And so it turns out that

0:32:53.940,0:32:57.260
so what you really like to do is choose parameters data so as to

0:32:57.260,0:32:59.550
minimize this step function; right?

0:32:59.550,0:33:02.400
You'd like to choose parameters data, so that

0:33:02.400,0:33:04.710
you end up with a

0:33:04.710,0:33:08.500
correct classification on setting your training example, and so you'd like

0:33:08.500,0:33:09.440
indicator h

0:33:09.440,0:33:10.890
of x not equal y.

0:33:10.890,0:33:14.960
You'd like this indicator function to be 0. It

0:33:14.960,0:33:18.560
turns out this step function is clearly a non-convex function. And so

0:33:18.560,0:33:21.799
it turns out that just the linear classifiers

0:33:21.799,0:33:25.130
minimizing the training error is an empty heart problem. It

0:33:25.130,0:33:29.560
turns out that both logistic regression, and support vector machines can be viewed as

0:33:29.560,0:33:31.210
using a convex

0:33:31.210,0:33:32.600
approximation for this problem.

0:33:32.600,0:33:36.440
And in particular

0:33:36.440,0:33:40.130
and draw a function like that

0:33:41.460,0:33:48.460
it turns out that

0:33:50.809,0:33:54.600
logistic regression is trying to maximize likelihood. And so it's tying to minimize

0:33:54.600,0:33:56.900
the minus of the logged likelihood.

0:33:56.900,0:34:00.350
And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that

0:34:00.350,0:34:03.400
looks like this. And

0:34:03.400,0:34:07.250
this line that I just drew, you can think of it as a rough approximation to this step function; which is

0:34:07.250,0:34:10.709
maybe what you're really trying to minimize,

0:34:10.709,0:34:14.539
so you want to minimize training error. So you can actually think of logistic regression as trying to approximate

0:34:14.539,0:34:16.069
empirical risk minimization.

0:34:16.069,0:34:19.859
Where instead of using this step function, which is non-convex, and gives you a hard

0:34:19.859,0:34:21.059
optimization problem, it

0:34:21.059,0:34:25.189
uses this line above this curve above. So approximate it,

0:34:25.189,0:34:28.229
so you have a convex optimization problem you can

0:34:28.229,0:34:29.929
find the

0:34:29.929,0:34:34.409
maximum likelihood it's in the parameters for logistic regression.

0:34:34.409,0:34:38.690
And it turns out, support vector machine also can be viewed as approximated dysfunction

0:34:38.690,0:34:41.139
to only a little bit different let's

0:34:41.139,0:34:44.719
see, support vector machine turns out, can be viewed as trying to approximate this

0:34:44.719,0:34:46.209
step function two

0:34:46.209,0:34:48.289
over different

0:34:48.289,0:34:50.929
approximation that's linear, and then

0:34:50.929,0:34:53.589
that sort of [inaudible] linear that

0:34:53.589,0:34:55.710
our results goes this [inaudible] there, and then it goes up as a

0:34:55.710,0:34:58.150
linear function there. And that's that

0:34:58.150,0:35:00.170
is called the hinge class. And so you

0:35:00.170,0:35:03.799
can think of logistic regression and the support vector machine as

0:35:03.799,0:35:07.259
different approximations to try to minimize this

0:35:07.259,0:35:10.859
step function;

0:35:10.859,0:35:11.940
okay?

0:35:11.940,0:35:13.380
And that's why I guess,

0:35:13.380,0:35:16.109
all the theory we developed

0:35:16.109,0:35:19.239
even though SVM's and logistic regression aren't exactly due to

0:35:19.239,0:35:21.739
empirical risk minimization,

0:35:21.739,0:35:24.790
the theory we develop often gives the

0:35:24.790,0:35:31.790
completely appropriate intuitions for SVM's, and logistic regression; okay.

0:35:32.879,0:35:35.779
So that was the last of the loose ends.

0:35:35.779,0:35:39.319
And if you didn't get this, don't worry too much about it. It's a high-level message. It's

0:35:39.319,0:35:40.309
just that

0:35:40.309,0:35:44.479
SVM's and logistic regression are reasonable to think of as approximations

0:35:44.479,0:35:47.769
empirical risk minimization algorithms.

0:35:47.769,0:35:51.609
What I want to do next is move on to talk about model selection. Before I do that, let me just

0:35:51.609,0:35:58.609
check for questions about

0:36:49.190,0:36:52.869
this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that

0:36:52.869,0:36:53.329
we

0:36:53.329,0:36:55.049
sort of wrapped up with a

0:36:55.049,0:36:56.799
discussion on VC dimension,

0:36:56.799,0:37:01.019
we saw that there's often a trade-off between bias and variance. And in

0:37:01.019,0:37:02.489
particular, so

0:37:02.489,0:37:06.759
it is important not to choose a hypothesis that's either too simple or too

0:37:06.759,0:37:07.849
complex. So

0:37:07.849,0:37:11.249
if your data has sort of a quadratic structure to it,

0:37:11.249,0:37:13.710
then if you choose a linear

0:37:13.710,0:37:17.889
function to try to approximate it, then you would under fit. So you have a

0:37:17.889,0:37:19.939
hypothesis with high bias.

0:37:19.939,0:37:24.179
And conversely, we choose a hypothesis that's too complex, and you have high variance.

0:37:24.179,0:37:26.130
And you'll also

0:37:26.130,0:37:31.899
fail to fit. Then you would over fit the data, and you'd also fail to generalize well.

0:37:31.899,0:37:35.649
So model selection algorithms

0:37:35.649,0:37:39.629
provide a class of methods to automatically trade make these tradeoffs

0:37:39.629,0:37:41.619
between bias

0:37:41.619,0:37:43.069
and variance; right? So remember the

0:37:43.069,0:37:44.989
cartoon I drew last time

0:37:44.989,0:37:51.989
of generalization error?

0:37:54.309,0:37:57.409
I drew this last time. Where on the x-axis

0:37:57.409,0:38:03.319
was model complexity, meaning the number of

0:38:03.319,0:38:07.599
the degree of the polynomial; the [inaudible] regression function

0:38:07.599,0:38:08.719
or whatever.

0:38:08.719,0:38:12.469
And if you have too simple a model, you have high generalization error, those under

0:38:12.469,0:38:13.529
fitting.

0:38:13.529,0:38:15.630
And you if have too complex a model,

0:38:15.630,0:38:19.729
like 15 or 14-degree polynomial to five data points, then you also have high

0:38:19.729,0:38:24.069
generalization error, and you're over fitting.

0:38:24.069,0:38:28.559
So what I wanna do now is actually just talk about model selection in the abstract;

0:38:28.559,0:38:30.019
all right?

0:38:30.019,0:38:32.779
Some examples of model selection problems will include

0:38:32.779,0:38:34.369
well, I'll run the example

0:38:34.369,0:38:36.139
of

0:38:36.139,0:38:36.790
let's say you're

0:38:36.790,0:38:43.790
trying to choose the degree of a polynomial;

0:38:45.159,0:38:48.909
right? What degree polynomial do you want to choose?

0:38:48.909,0:38:51.899
Another example of a model selection problem would be if you're trying to choose

0:38:51.899,0:38:53.869
the parameter [inaudible],

0:38:53.869,0:38:59.839
which was the bandwidth parameter

0:38:59.839,0:39:04.140
in locally awaited linear regression

0:39:04.140,0:39:07.709
or in some sort of local way to

0:39:07.709,0:39:08.440
regression.

0:39:08.440,0:39:10.480
Yet, another model selection problem

0:39:10.480,0:39:13.940
is if you're trying to choose the parameter c [inaudible] and as the

0:39:13.940,0:39:16.049
[inaudible];

0:39:16.049,0:39:21.680
right? And so one known soft margin is the

0:39:21.680,0:39:28.539
we had this optimization objective; right?

0:39:28.539,0:39:31.039
And the parameter c

0:39:31.039,0:39:32.969
controls the tradeoff between

0:39:32.969,0:39:34.019
how much you want

0:39:34.019,0:39:38.239
to set for your example. So a large margin versus how much you want to penalize

0:39:38.239,0:39:40.499


0:39:40.499,0:39:44.279
in this class [inaudible] example. So these are three specific examples of model selection

0:39:44.279,0:39:45.339
problems.

0:39:45.339,0:39:47.569
And

0:39:47.569,0:39:54.569
let's come up with a method for semantically choosing them.

0:40:13.679,0:40:17.849
Let's say you have some finite set of models, and let's write these as m1, m2, m3,

0:40:17.849,0:40:20.129
and

0:40:20.129,0:40:25.219
so on. For example, this may be the linear

0:40:25.219,0:40:29.660
classifier or this may be the quadratic classifier,

0:40:29.660,0:40:30.410
and so on;

0:40:30.410,0:40:32.170
okay? Or this

0:40:32.170,0:40:33.700
may also be

0:40:33.700,0:40:37.240
you may also take the bandwidth parameter [inaudible]

0:40:37.240,0:40:41.130
and discretize it into a range of values, and you're trying to choose from the most

0:40:41.130,0:40:43.469
discrete of the values.

0:40:43.469,0:40:48.239
So let's talk about how you would select an appropriate model; all right? Well,

0:40:48.239,0:40:52.029
one thing you could do is you can pick all of these models, and train them on

0:40:52.029,0:40:53.869
you're training set.

0:40:53.869,0:40:56.989
And then see which model has the lowest training

0:40:56.989,0:41:03.989
error. So that's a terrible idea, and why's that?

0:41:05.450,0:41:11.099
Right. Cool. Because of the over fit; right. And those some of you are laughing that

0:41:11.099,0:41:13.939
I asked that. So that'd be a terrible idea to choose a model

0:41:13.939,0:41:17.019
by looking at your training set because well, obviously, you end up choosing the most

0:41:17.019,0:41:18.839
complex model; right? And you

0:41:18.839,0:41:25.549
choose a 10th degree polynomial because that's what fits the training set. So we

0:41:25.549,0:41:28.569
come to model selection in a training set

0:41:28.569,0:41:31.939
several standard procedures to do this. One is hold out cross

0:41:31.939,0:41:38.869
validation,

0:41:38.869,0:41:41.410
and in hold out cross validation,

0:41:41.410,0:41:45.509
we teach a training set. And we randomly split

0:41:45.509,0:41:50.529
the training set into two subsets. We

0:41:50.529,0:41:51.490
call it

0:41:51.490,0:41:53.589
subset take all the data you have

0:41:53.589,0:42:00.589
and randomly split it into two subsets. And we'll call it the training set, and the

0:42:00.799,0:42:05.159
hold out cross validation subset.

0:42:05.159,0:42:07.119
And then,

0:42:07.119,0:42:11.989
you know, you train each model

0:42:11.989,0:42:18.739
on just trading subset of it, and test it

0:42:18.739,0:42:22.069
on your hold out cross validation set.

0:42:22.069,0:42:25.289
And you pick the model

0:42:25.289,0:42:32.289
with the lowest error

0:42:32.949,0:42:34.889
on the hold out cross validation

0:42:34.889,0:42:36.959
subset; okay? So this is sort of a

0:42:36.959,0:42:39.799
relatively straightforward procedure, and it's commonly used

0:42:39.799,0:42:42.039
where you train on 70 percent of the data.

0:42:42.039,0:42:44.509
Then test all of your models. And 30 percent, you can take

0:42:44.509,0:42:45.900
whatever has the smallest

0:42:45.900,0:42:51.239
hold out cross validation error.

0:42:51.239,0:42:54.829
And after this you actually have a chose. You can actually

0:42:54.829,0:42:58.589
having taken all of these hypothesis trained on 70 percent of the data,

0:42:58.589,0:43:02.650
you can actually just output the hypothesis that has the lowest error on your hold out

0:43:02.650,0:43:04.469
cross validation set.

0:43:04.469,0:43:08.619
And optionally, you can actually take the model that you selected

0:43:08.619,0:43:13.079
and go back, and retrain it on all 100 percent of the data;

0:43:13.079,0:43:15.529
okay? So both versions are actually done and used really often. You can

0:43:15.529,0:43:16.409
either,

0:43:16.409,0:43:20.039
you know, just take the best hypothesis that was trained on 70 percent of the data,

0:43:20.039,0:43:22.419
and just output that as you find the hypothesis

0:43:22.419,0:43:23.959
or you can use this to

0:43:23.959,0:43:27.289
say, having chosen the degree of the polynomial you want to fit, you can then go

0:43:27.289,0:43:31.329
back and retrain the model on the entire 100 percent of your data.

0:43:31.329,0:43:38.329
And both of these are commonly done.

0:43:39.749,0:43:46.749


0:43:50.489,0:43:54.459
How about a cross validation does sort of work straight?

0:43:54.459,0:43:55.759
And

0:43:55.759,0:43:58.249
sometimes we're working

0:43:58.249,0:43:58.950


0:43:58.950,0:44:03.670
with a company or application or something. The

0:44:03.670,0:44:05.929
many machine-learning applications we have

0:44:05.929,0:44:07.580
very little data or where, you

0:44:07.580,0:44:10.100
know, every training example you have was

0:44:10.100,0:44:14.940
painfully acquired at great cost; right? Sometimes your data is

0:44:14.940,0:44:16.039
acquired by

0:44:16.039,0:44:18.780
medical experiments, and each of these each

0:44:18.780,0:44:23.829
training example represents a sick man in amounts of physical human pain or something. So

0:44:23.829,0:44:27.279
we talk and say, well, I'm going to hold out 30 percent of your data set, just to select

0:44:27.279,0:44:28.109
my model.

0:44:28.109,0:44:32.809
If people were who sometimes that causes unhappiness, and so maybe you wanna

0:44:32.809,0:44:37.099
use not have to leave out 30 percent of your data just to do model

0:44:37.099,0:44:38.719
selection.

0:44:38.719,0:44:43.400
So there are a couple of other variations on hold out cross validation that makes

0:44:43.400,0:44:47.809
sometimes, slightly more efficient use of the data.

0:44:47.809,0:44:51.909
And one is called k-fold cross validation.

0:44:51.909,0:44:54.190
And here's the idea: I'm gonna

0:44:54.190,0:44:55.950
take all of

0:44:55.950,0:44:56.819
my data s;

0:44:56.819,0:45:00.619
so imagine, I'm gonna draw this

0:45:00.619,0:45:04.489
box s, as to note the entirety of all the data I have.

0:45:04.489,0:45:07.489
And I'll then divide it

0:45:07.489,0:45:12.999
into k pieces, and this is five pieces in what I've drawn.

0:45:12.999,0:45:16.809
Then what'll I'll do is I will repeatedly

0:45:16.809,0:45:23.410
train on k minus one pieces.

0:45:23.410,0:45:27.669
Test on the remaining one

0:45:27.669,0:45:30.649
test on the remaining piece, I guess;

0:45:30.649,0:45:35.699
right? And then you average

0:45:35.699,0:45:41.369
over the k result.

0:45:41.369,0:45:43.809
So another way, we'll just hold out I will

0:45:43.809,0:45:48.329
hold

0:45:48.329,0:45:50.619
out say, just 1/5 of my data

0:45:50.619,0:45:54.649
and I'll train on the remaining 4/5, and I'll test on the first one.

0:45:54.649,0:45:58.399
And then I'll then go and hold out the second 1/5 from my [inaudible] for the

0:45:58.399,0:46:03.099
remaining pieces test on this, you remove the third piece,

0:46:03.099,0:46:05.909
train on the 4/5; I'm gonna do this five times.

0:46:05.909,0:46:09.089
And then I'll take the five error measures I

0:46:09.089,0:46:10.569
have and I'll

0:46:10.569,0:46:16.369
average them. And this then gives me an estimate of the generalization error of my model; okay?

0:46:16.369,0:46:18.089
And then, again, when you do

0:46:18.089,0:46:19.929
k-fold cross validation,

0:46:19.929,0:46:21.829
usually you then go back and

0:46:21.829,0:46:24.030
retrain the model you selected

0:46:24.030,0:46:27.670
on the entirety of your training set.

0:46:27.670,0:46:33.930
So I drew five pieces here because that was easier for me to draw, but k equals 10 is

0:46:33.930,0:46:39.339
very

0:46:39.339,0:46:41.710
common; okay? I should say

0:46:41.710,0:46:47.509
k equals 10 is the fairly common choice to do 10 fold cross validation.

0:46:47.509,0:46:52.299
And the advantage of the over hold out cross option is that you switch the data into ten pieces.

0:46:52.299,0:46:54.400
Then each time you're only holding out

0:46:54.400,0:46:56.380
1/10 of your data, rather than,

0:46:56.380,0:46:59.119
you know, say, 30 percent of your data. I

0:46:59.119,0:47:03.309
must say, in standard hold out in simple hold out cross validation,

0:47:03.309,0:47:05.620
a 30 70 split is fairly common.

0:47:05.620,0:47:07.849
Sometimes like 2/3 1/3 or

0:47:07.849,0:47:09.919
a 70 30 split is fairly common.

0:47:09.919,0:47:14.299
And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most

0:47:14.299,0:47:16.400
common choice.

0:47:16.400,0:47:19.849
The disadvantage of k-fold cross validation is that it can be much more

0:47:19.849,0:47:21.430
computationally expensive.

0:47:21.430,0:47:22.999
In particular,

0:47:22.999,0:47:26.160
to validate your model, you now need to train your model ten times,

0:47:26.160,0:47:29.089
instead of just once. And so you need

0:47:29.089,0:47:33.140
to: from logistic regression, ten times per model, rather than just once. And so this is

0:47:33.140,0:47:34.569
computationally more expensive.

0:47:34.569,0:47:38.329
But k equals ten works great.

0:47:38.329,0:47:40.249
And then,

0:47:40.249,0:47:44.269
finally, in

0:47:44.269,0:47:47.769
there's actually a version of this that you can take even further,

0:47:47.769,0:47:50.059
which is when your set k

0:47:50.059,0:47:51.660
equals m.

0:47:51.660,0:47:56.690
And so that's when you take your training set, and you split it into as many pieces as you have training

0:47:56.690,0:47:59.190
examples.

0:47:59.190,0:48:06.190
And this procedure is called leave one out cross validation.

0:48:06.429,0:48:07.479
And

0:48:07.479,0:48:09.069
what you do is you then

0:48:09.069,0:48:12.479
take out the first training example, train on the rest, and test on the first

0:48:12.479,0:48:13.359
example.

0:48:13.359,0:48:16.479
Then you take out the second training example,

0:48:16.479,0:48:18.560
train on the rest, and test on the second example. Then

0:48:18.560,0:48:22.180
you take out the third example, train on everything, but your third example. Test on

0:48:22.180,0:48:24.079
the third example, and so on.

0:48:24.079,0:48:25.410
And so

0:48:25.410,0:48:27.900
with this many pieces you are now making,

0:48:27.900,0:48:31.170
maybe even more effective use of your data than k-fold cross

0:48:31.170,0:48:32.459
validation. But

0:48:32.459,0:48:36.269
you could leave leave one out cross validation is computationally very expensive

0:48:36.269,0:48:37.459
because now

0:48:37.459,0:48:41.970
you need to repeatedly leave one example out, and then run your learning

0:48:41.970,0:48:46.459
algorithm on m minus one training examples. You need to do this a lot of times, and so

0:48:46.459,0:48:49.119
this is computationally very expensive.

0:48:49.119,0:48:54.829
And typically, this is done only when you're extremely data scarce. So if you

0:48:54.829,0:48:58.779
have a learning problem where you have, say, 15 training examples or something,

0:48:58.779,0:49:01.109
then if you have very few training examples, leave one

0:49:01.109,0:49:03.629
out cross validation

0:49:03.629,0:49:04.460
is

0:49:04.460,0:49:11.460
maybe preferred.

0:49:15.709,0:49:18.369
Yeah?

0:49:18.369,0:49:22.890
Student:You know, that time you proved that the difference between the generalized [inaudible] by

0:49:22.890,0:49:26.989
number of examples in your training

0:49:26.989,0:49:28.089
set and VC

0:49:28.089,0:49:30.609
dimension. So

0:49:30.609,0:49:32.679
maybe [inaudible] examples into

0:49:32.679,0:49:34.599
different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I

0:49:34.599,0:49:36.829
mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but

0:49:36.829,0:49:40.100
yeah, in practice, I personally tend not to do that. It

0:49:40.100,0:49:42.729
tends not to be the VC

0:49:42.729,0:49:46.529
dimension bounds are somewhat loose bounds. And so

0:49:46.529,0:49:47.960
there are people

0:49:47.960,0:49:51.060
in structure risk minimization that propose what you do, but I personally tend not to do

0:49:51.060,0:49:54.539
that,

0:50:09.459,0:50:16.459
though. Questions for cross validation?

0:50:33.519,0:50:35.669
Yeah. This

0:50:35.669,0:50:38.899
is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?

0:50:38.899,0:50:39.939
Right.

0:50:39.939,0:50:42.239
So it seems like when I'd be

0:50:42.239,0:50:44.129
able to use that

0:50:44.129,0:50:46.629


0:50:46.629,0:50:47.769
instead

0:50:47.769,0:50:49.980
of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor

0:50:49.980,0:50:53.349
No okay. So it turns out that when you're proving learning theory bounds, very often

0:50:53.349,0:50:58.109
the bounds will be extremely loose because you're sort of proving the worse case upper bound that

0:50:58.109,0:50:59.160
holds true

0:50:59.160,0:51:03.299
even for very bad what is

0:51:03.299,0:51:06.429
it

0:51:06.429,0:51:10.479
so the bounds that I proved just now; right? That holds true for absolutely any

0:51:10.479,0:51:12.209
probability distribution over training

0:51:12.209,0:51:13.480
examples; right?

0:51:13.480,0:51:17.069
So just assume the training examples we've drawn, iid

0:51:17.069,0:51:19.029
from some distribution script d,

0:51:19.029,0:51:23.209
and the bounds I proved hold true for absolutely any probability

0:51:23.209,0:51:27.129
distribution over script

0:51:27.129,0:51:28.939
d. And chances are

0:51:28.939,0:51:33.049
whatever real life distribution you get over, you know, houses and their

0:51:33.049,0:51:36.139
prices or whatever, is probably not as bad as

0:51:36.139,0:51:39.119
the very worse one you

0:51:39.119,0:51:41.249
could've gotten; okay?

0:51:41.249,0:51:42.189
And so it turns out that if you

0:51:42.189,0:51:46.299
actually plug in the constants of learning theory bounds, you often get

0:51:46.299,0:51:50.839
extremely large numbers. Take logistic regression logistic

0:51:50.839,0:51:53.229
regression you have ten parameters

0:51:53.229,0:51:56.969
and 0.01 error, and with 95 percent probability. How many training

0:51:56.969,0:51:58.239
examples do I

0:51:58.239,0:52:02.109
need? If you actually plug in actual constants into the text for learning theory bounds,

0:52:02.109,0:52:06.329
you often get extremely pessimistic estimates with the number of examples you need. You end

0:52:06.329,0:52:07.529
up with

0:52:07.529,0:52:11.749
some ridiculously large numbers. You would need 10,000 training examples to fit

0:52:11.749,0:52:14.349
ten parameters.

0:52:14.349,0:52:17.489
So

0:52:17.489,0:52:20.319
a good way to think of these learning theory bounds is

0:52:20.319,0:52:24.599
and this is why, also, when I write papers on learning theory bounds, I

0:52:24.599,0:52:26.390
quite often

0:52:26.390,0:52:30.159
use big-O notation to just absolutely just ignore the constant factors because

0:52:30.159,0:52:33.339
the bounds seem to be very loose.

0:52:33.339,0:52:38.669
There are some attempts to use these bounds to give guidelines as to

0:52:38.669,0:52:40.019
what model

0:52:40.019,0:52:41.559
to choose, and so on.

0:52:41.559,0:52:46.589
But I personally tend to use the bounds again,

0:52:46.589,0:52:47.489
intuition

0:52:47.489,0:52:49.239
about

0:52:49.239,0:52:53.089
for example, what are the number of training examples you need

0:52:53.089,0:52:56.089
gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes

0:52:56.089,0:52:59.669
quadratic parameters?

0:52:59.669,0:53:02.400
So it's quite often the shape of the bounds. The fact that

0:53:02.400,0:53:06.749
the number of training examples the fact that some complexity is linear in the VC dimension,

0:53:06.749,0:53:09.449
that's sort of a useful intuition you can get from

0:53:09.449,0:53:11.329
these theories. But the

0:53:11.329,0:53:14.439
actual magnitude of the bound will tend to be much looser than

0:53:14.439,0:53:18.009
will hold true for a particular problem you are working

0:53:18.009,0:53:25.009
on. So did that

0:53:25.839,0:53:28.409
answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the

0:53:28.409,0:53:31.509
way, for myself, a rule of thumb that I often use is if

0:53:31.509,0:53:34.969
you're trying to fit a logistic regression model,

0:53:34.969,0:53:36.179
if you have

0:53:36.179,0:53:38.749
n parameters or n plus one parameters;

0:53:38.749,0:53:42.429
if the number of training examples is ten times your number of

0:53:42.429,0:53:44.709
parameters, then you're probably in good shape.

0:53:44.709,0:53:49.289
And if your number of training examples is like tiny times the number of parameters, then

0:53:49.289,0:53:52.299
you're probably perfectly fine fitting that model.

0:53:52.299,0:53:54.880
So those are the sorts of intuitions

0:53:54.880,0:54:01.880
that you can get from these bounds. Student:In cross validation do

0:54:02.939,0:54:06.739
we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train

0:54:06.739,0:54:13.739
testers randomly.

0:54:22.989,0:54:26.839
One more thing I want to talk about for model selection is there's actually a special case of model selections,

0:54:26.839,0:54:33.839
called the feature selection problem.

0:54:37.809,0:54:41.339
And so here's the intuition:

0:54:41.339,0:54:45.479
for many machine-learning problems you may have a very high dimensional

0:54:45.479,0:54:48.119
feature space with very high dimensional

0:54:48.119,0:54:51.640
you have x's [inaudible] feature x's.

0:54:51.640,0:54:56.289
So for example, for text classification and I wanna talk about this text classification example that

0:54:56.289,0:54:58.299
spam versus

0:54:58.299,0:55:00.430
non-spam. You may easily have on

0:55:00.430,0:55:04.130
the order of 30,000 or 50,000 features. I think I used 50,000 in

0:55:04.130,0:55:07.150
my early examples. So if you have

0:55:07.150,0:55:10.659
so many features you have 50,000 features,

0:55:10.659,0:55:13.759
depending on what learning algorithm you use, there may be a real

0:55:13.759,0:55:15.529
risk of over fitting.

0:55:15.529,0:55:17.219
And so

0:55:17.219,0:55:20.079
if you can reduce the number of features, maybe

0:55:20.079,0:55:23.179
you can reduce the variance of your learning algorithm, and reduce the risk of

0:55:23.179,0:55:25.309
over fitting. And for

0:55:25.309,0:55:28.259
the specific case of text classification, if

0:55:28.259,0:55:31.919
you imagine that maybe there's a small number of relevant features, so

0:55:31.919,0:55:33.679
there are all these English words.

0:55:33.679,0:55:36.969
And many of these English words probably don't tell you anything at all about

0:55:36.969,0:55:39.880
whether the email is spam or non-spam. If it were, you

0:55:39.880,0:55:43.989
know, English function words like, the, of, a, and;

0:55:43.989,0:55:48.049
these are probably words that don't tell you anything about whether the email is spam or non-spam. So

0:55:48.049,0:55:50.569
words in contrast will be a much smaller number of

0:55:50.569,0:55:54.319
features that are truly relevant to the learning problem.

0:55:54.319,0:55:57.369
So for example, you see the word buy or Viagra, those are words that

0:55:57.369,0:56:01.669
are very useful. So you words, some you spam and non-spam. You see the word

0:56:01.669,0:56:02.829
Stanford or

0:56:02.829,0:56:06.270
machinelearning or your own personal name. These are other words that are useful for

0:56:06.270,0:56:10.529
telling you whether something is spam or non-spam. So

0:56:10.529,0:56:15.659
in feature selection, we would like to select a subset of the features

0:56:15.659,0:56:19.519
that may be or hopefully the most relevant ones for a specific learning

0:56:19.519,0:56:20.369
problem, so

0:56:20.369,0:56:25.009
as to give ourselves a simpler learning a simpler hypothesis class to choose

0:56:25.009,0:56:28.499
from. And then therefore, reduce the risk of over fitting.

0:56:28.499,0:56:29.699
Even when we

0:56:29.699,0:56:36.619
may have had 50,000 features originally. So

0:56:36.619,0:56:41.209
how do you do this? Well, if you

0:56:41.209,0:56:44.009
have n

0:56:44.009,0:56:47.329
features, then there are

0:56:47.329,0:56:49.529
two to the n possible subsets;

0:56:49.529,0:56:54.429


0:56:54.429,0:56:55.900
right? Because, you know,

0:56:55.900,0:56:59.880
each of your n features can either be included or excluded. So there are two to the n

0:56:59.880,0:57:00.909
possibilities.

0:57:00.909,0:57:04.200
And this is a huge space.

0:57:04.200,0:57:08.859
So in feature selection, what we most commonly do is use various searcheristics

0:57:08.859,0:57:09.510
sort of

0:57:09.510,0:57:10.799
simple search algorithms

0:57:10.799,0:57:15.329
to try to search through this space of two to the n possible subsets of features;

0:57:15.329,0:57:16.710
to try to find a good

0:57:16.710,0:57:18.290
subset of features. This is

0:57:18.290,0:57:22.569
too large a number to enumerate all possible feature subsets.

0:57:22.569,0:57:25.519
And as a complete example,

0:57:25.519,0:57:26.759


0:57:26.759,0:57:31.159
this is the forward search algorithm; it's also called the forward selection algorithm.

0:57:31.159,0:57:32.689


0:57:32.689,0:57:34.849
It's actually pretty simple, but I'll just

0:57:34.849,0:57:36.080
write it out.

0:57:36.080,0:57:42.059
My writing it out will make it look more complicated than it really

0:57:42.059,0:57:44.649
is, but

0:57:44.649,0:57:46.040
it starts with

0:57:46.040,0:57:48.149
initialize the sets script f

0:57:48.149,0:57:51.649
to be the empty set,

0:57:51.649,0:57:53.239
and then

0:57:53.239,0:57:55.019
repeat

0:57:55.019,0:57:56.619
for

0:57:56.619,0:57:58.859
i equals one

0:57:58.859,0:58:00.439
to n;

0:58:04.849,0:58:08.239
try adding

0:58:08.239,0:58:11.459
feature i

0:58:11.459,0:58:13.849
to the set scripts

0:58:13.849,0:58:16.039
f, and evaluate the

0:58:16.039,0:58:19.739
model

0:58:19.739,0:58:25.719
using cross validation.

0:58:25.719,0:58:30.399
And by cross validation, I mean any of the three flavors, be it simple hold out cross

0:58:30.399,0:58:34.629
validation or k-fold cross validation or leave one out cross

0:58:34.629,0:58:35.739
validation.

0:58:35.739,0:58:39.949
And then, you know, set f to be

0:58:39.949,0:58:41.239
equal to

0:58:41.239,0:58:43.249
f union, I guess.

0:58:43.249,0:58:48.579
And then the best feature

0:58:48.579,0:58:55.579
found is f 1, I guess; okay?

0:58:57.619,0:59:00.059
And finally, you would

0:59:00.059,0:59:06.920
okay? So

0:59:06.920,0:59:13.920


0:59:16.699,0:59:20.319
forward selections, procedure is: follow through the empty set of features.

0:59:20.319,0:59:22.319
And then on each

0:59:22.319,0:59:25.689
generation, take each of

0:59:25.689,0:59:28.829
your features that isn't already in your set script f and you try adding that

0:59:28.829,0:59:30.449
feature to your set. Then

0:59:30.449,0:59:33.870
you train them all, though, and evaluate them all, though, using

0:59:33.870,0:59:35.039
cross validation.

0:59:35.039,0:59:38.719
And basically, figure out what is the best single feature to add to your

0:59:38.719,0:59:40.719
set

0:59:40.719,0:59:43.709
script f. In step two here, you go ahead and add

0:59:43.709,0:59:46.509
that feature to your set script f, and you get it right.

0:59:46.509,0:59:50.559
And when I say best feature or best model here by best, I really mean the

0:59:50.559,0:59:53.669
best model according to hold out cross validation.

0:59:53.669,0:59:56.049
By best, I really mean

0:59:56.049,1:00:00.639
the single feature addition that results in the lowest hold out

1:00:00.639,1:00:01.509
cross validation error or the

1:00:01.509,1:00:03.209
lowest cross validation error.

1:00:03.209,1:00:04.499
So you do this

1:00:04.499,1:00:07.959
adding one feature at a time.

1:00:07.959,1:00:12.019
When you terminate this a little bit, as if you've

1:00:12.019,1:00:15.079
added all the features to f, so f is now

1:00:15.079,1:00:17.919
the entire set of features; you can terminate this.

1:00:17.919,1:00:19.309
Or if

1:00:19.309,1:00:22.039
by some rule of thumb, you know that you

1:00:22.039,1:00:23.769
probably don't ever want more than

1:00:23.769,1:00:26.729
k features, you can also terminate

1:00:26.729,1:00:30.239
this if f is already exceeded some threshold number of features. So maybe

1:00:30.239,1:00:30.979
if

1:00:30.979,1:00:34.139
you have 100 training examples, and you're fitting logistic regression,

1:00:34.139,1:00:39.359
you probably know you won't want more than 100 features. And so

1:00:39.359,1:00:41.229
you stop after you have

1:00:41.229,1:00:45.709
100 features added to set f; okay?

1:00:45.709,1:00:50.029
And then finally, having done this, output of best hypothesis found; again, by

1:00:50.029,1:00:51.669
best, I mean,

1:00:51.669,1:00:55.459
when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of

1:00:55.459,1:00:58.419
hypothesis, and testing them using cross validation.

1:00:58.419,1:01:01.659
So when I say output best hypothesis found, I mean

1:01:01.659,1:01:04.650
of all of the hypothesis you've seen during this entire procedure,

1:01:04.650,1:01:07.149
pick the one with the lowest

1:01:07.149,1:01:10.309
cross validation error that you saw; okay?

1:01:10.309,1:01:17.309
So that's forward selection. So

1:01:35.239,1:01:39.599
let's see, just to give this a name, this is an incidence of what's called

1:01:39.599,1:01:43.189


1:01:43.189,1:01:46.439
wrapper feature

1:01:46.439,1:01:47.879
selection.

1:01:47.879,1:01:50.579
And the term wrapper comes

1:01:50.579,1:01:52.049
from the fact that

1:01:52.049,1:01:55.489
this feature selection algorithm that I just described is a forward selection or forward

1:01:55.489,1:01:56.819
search.

1:01:56.819,1:01:58.680
It's a piece of software that

1:01:58.680,1:02:02.149
you write that wraps around your learning algorithm. In the sense that

1:02:02.149,1:02:04.249
to perform forward selection,

1:02:04.249,1:02:07.579
you need to repeatedly make cause to your learning algorithm

1:02:07.579,1:02:08.829
to train

1:02:08.829,1:02:11.789
your model,

1:02:11.789,1:02:15.289
using different subsets of features; okay? So this is called wrapper model

1:02:15.289,1:02:17.319
feature selection.

1:02:17.319,1:02:20.509
And it tends to be somewhat computationally expensive because as you're

1:02:20.509,1:02:22.529
performing the search process,

1:02:22.529,1:02:26.029
you're repeatedly training your learning algorithm over and over and over on all of

1:02:26.029,1:02:30.329
these different subsets of features. Let's

1:02:30.329,1:02:37.329
just mention also, there is a variation of this called backward search or

1:02:38.899,1:02:40.829
backward selection,

1:02:40.829,1:02:43.709
which is where you start with

1:02:43.709,1:02:46.989
f equals

1:02:46.989,1:02:53.989
the entire set of features, and you delete features one at a time;

1:02:59.469,1:03:01.369
okay?

1:03:01.369,1:03:03.979


1:03:03.979,1:03:06.759
So that's backward search or backward selection.

1:03:06.759,1:03:08.079
And

1:03:08.079,1:03:13.369
this is another feature selection algorithm that you might use. Part of

1:03:13.369,1:03:15.400
whether this makes sense is

1:03:15.400,1:03:17.409
really there will be problems where

1:03:17.409,1:03:21.319
it really doesn't even make sense to initialize f to be the set of all features.

1:03:21.319,1:03:23.859
So if you have 100 training examples

1:03:23.859,1:03:25.569
and 10,000 features,

1:03:25.569,1:03:28.829
which may well happen

1:03:28.829,1:03:30.759
100 emails and 10,000 training

1:03:30.759,1:03:35.289
10,000 features in email, then 100 training examples then

1:03:35.289,1:03:38.279
depending on the learning algorithm you're using, it may or may not make sense to

1:03:38.279,1:03:38.989
initialize

1:03:38.989,1:03:42.789
the set f to be all features, and train them all by using all features. And if it

1:03:42.789,1:03:46.130
doesn't make sense, then you can train them all by using all features; then

1:03:46.130,1:03:51.529
forward selection would be more common.

1:03:51.529,1:03:53.609
So

1:03:53.609,1:03:56.960
let's see. Wrapper model feature selection algorithms tend to work

1:03:56.960,1:03:58.459
well.

1:03:58.459,1:04:02.179
And in particular, they actually often work better than a different class of algorithms I'm gonna talk

1:04:02.179,1:04:05.789
about now. But their main disadvantage is that they're computationally very

1:04:05.789,1:04:08.709
expensive.

1:04:08.709,1:04:15.709
Do you have any questions about this

1:04:16.289,1:04:19.109
before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,

1:04:19.109,1:04:23.459
you're actually right. So forward search and backward search, both of these are searcheristics,

1:04:23.459,1:04:24.360
and

1:04:24.360,1:04:28.090
you cannot but for either of these you cannot guarantee they'll find the best

1:04:28.090,1:04:29.119
subset of features. It

1:04:29.119,1:04:30.729
actually turns out that

1:04:30.729,1:04:34.999
under many formulizations of the feature selection problems it actually turns out to be an empty

1:04:34.999,1:04:40.029
heart problem, to find the best subset of features.

1:04:40.029,1:04:43.999
But in practice, forward selection backward selection work fine,

1:04:43.999,1:04:47.170
and you can also envision other search algorithms where you sort of

1:04:47.170,1:04:50.489
have other methods to search through the space up to the end possible feature

1:04:50.489,1:04:57.489
subsets. So

1:05:01.079,1:05:03.700
let's see.

1:05:03.700,1:05:06.339
Wrapper feature selection

1:05:06.339,1:05:09.829
tends to work well when you can afford to do it

1:05:09.829,1:05:11.539
computationally.

1:05:11.539,1:05:13.800
But for problems

1:05:13.800,1:05:18.279
such as text classification it turns out for text classification specifically

1:05:18.279,1:05:21.949
because you have so many features, and easily have 50,000 features.

1:05:21.949,1:05:25.789
Forward selection would be very, very expensive.

1:05:25.789,1:05:28.240
So there's a different class of algorithms that

1:05:28.240,1:05:31.189
will give you that

1:05:31.189,1:05:35.319
tends not to do as well in the sense of generalization error. So you tend to learn the

1:05:35.319,1:05:37.329
hypothesis that works less well,

1:05:37.329,1:05:40.289
but is computationally much less expensive.

1:05:40.289,1:05:44.189
And these are called

1:05:44.189,1:05:46.669
the

1:05:46.669,1:05:50.279
filter feature selection methods.

1:05:50.279,1:05:53.119
And the basic idea is

1:05:53.119,1:05:53.679


1:05:53.679,1:06:00.539
that for each feature i will

1:06:00.539,1:06:03.939
compute

1:06:03.939,1:06:10.539
some measure

1:06:10.539,1:06:17.219
of how informative

1:06:17.219,1:06:19.909
xi

1:06:19.909,1:06:25.479
is about y; okay?

1:06:25.479,1:06:28.779
And to do this, we'll use some simple

1:06:28.779,1:06:33.489
heuristics; for every feature we'll just try to compute some rough estimate or

1:06:33.489,1:06:40.489
compute

1:06:41.279,1:06:48.279
some measure of how informative

1:06:50.179,1:06:54.489
xi is about

1:06:54.489,1:06:59.139
y. So there are many ways you can do this. One way you can choose is to just compute the

1:06:59.139,1:07:01.279
correlation between xi and y. And just for each of

1:07:01.279,1:07:04.999
your features just see how correlated this is with

1:07:04.999,1:07:06.719
your class label y.

1:07:06.719,1:07:12.809
And then you just pick the top k most correlated features.

1:07:12.809,1:07:19.809
Another way

1:07:46.429,1:07:51.549
to do this

1:07:51.549,1:07:54.930
for the case of text classification, there's one other method, which especially

1:07:54.930,1:07:56.779
for this k features I guess

1:07:56.779,1:07:58.749
there's one other

1:07:58.749,1:08:00.709
informative measure

1:08:00.709,1:08:04.269
that's used very commonly, which is called major information. I'm going to

1:08:04.269,1:08:11.229
tell you

1:08:11.229,1:08:15.399
some of these ideas in problem sets, but I'll

1:08:15.399,1:08:17.900
just say this very briefly.

1:08:17.900,1:08:21.599
So the major information between feature xi and y

1:08:21.599,1:08:27.670
I'll just write out the definition, I guess.

1:08:27.670,1:08:32.229
Let's say this is text classification, so x can take on two values, 0, 1;

1:08:32.229,1:08:36.179
the major information between xi and y is to find out some overall possible values of

1:08:36.179,1:08:37.359
x;

1:08:37.359,1:08:40.929
some overall possible values of

1:08:40.929,1:08:45.609
y times the distribution

1:08:45.609,1:08:52.049


1:08:52.049,1:08:53.089
times that.

1:08:53.089,1:08:54.379
Where

1:08:59.569,1:09:04.039
all of these distributions where so the joint distribution over xi and y,

1:09:04.039,1:09:11.039
you would estimate from your training data

1:09:12.559,1:09:14.220
all of these things you would use, as well. You would estimate

1:09:14.220,1:09:15.259
from

1:09:15.259,1:09:20.159
the training data what is the probability that x is 0, what's the probability that x is one, what's the

1:09:20.159,1:09:27.159
probability that x is 0, y is 0, x is one; y is 0, and so on. So it

1:09:27.849,1:09:31.190
turns out

1:09:31.190,1:09:35.000
there's a standard information theoretic measure of how different

1:09:35.000,1:09:36.750
probability distributions are.

1:09:36.750,1:09:39.989
And I'm not gonna prove this here. But it turns out that

1:09:39.989,1:09:45.609
this major information is

1:09:45.609,1:09:51.749
actually

1:09:51.749,1:09:54.309
so the standard

1:09:54.309,1:09:58.099
measure of how different distributions are; called the K-L divergence.

1:09:58.099,1:10:00.359
When you take a class in information theory,

1:10:00.359,1:10:03.419
you have seen concepts of mutual information in the K-L divergence, but if you

1:10:03.419,1:10:06.019
haven't, don't worry about it. Just

1:10:06.019,1:10:09.239
the intuition is there's something called K-L divergence that's a formal measure

1:10:09.239,1:10:12.949
of how different two probability distributions

1:10:12.949,1:10:16.609
are. And mutual information is a measure for how different

1:10:16.609,1:10:19.829
the joint distribution is

1:10:19.829,1:10:21.569
of x and y;

1:10:21.569,1:10:23.400
from the distribution

1:10:23.400,1:10:26.159
you would get if you were to assume they were independent;

1:10:26.159,1:10:27.210
okay? So

1:10:27.210,1:10:29.449
if x and y were independent, then

1:10:29.449,1:10:33.979
p of x, y would be equal to p of x times p of y.

1:10:33.979,1:10:37.340
And so you know, this distribution and this distribution would be identical, and the

1:10:37.340,1:10:40.369
K-L divergence would be 0.

1:10:40.369,1:10:43.579
In contrast, if x and y were very non-independent in

1:10:43.579,1:10:46.900
other words, if x and y are very informative about each other,

1:10:46.900,1:10:48.449
then this K-L divergence

1:10:48.449,1:10:49.719
will be large.

1:10:49.719,1:10:52.379
And so mutual information is a formal measure of

1:10:52.379,1:10:55.460
how non-independent x and y are.

1:10:55.460,1:10:58.299
And if x and y are highly non-independent

1:10:58.299,1:10:59.260
then that means that

1:10:59.260,1:11:02.209
x will presumably tell you something about y,

1:11:02.209,1:11:05.989
and so they'll have large mutual information. And

1:11:05.989,1:11:09.519
this measure of information will tell you x might be a good feature. And you

1:11:09.519,1:11:11.989
get to play with some of these ideas

1:11:11.989,1:11:15.309
more in the problem sets. So I won't say much more about it.

1:11:16.579,1:11:19.420
And what you do then is having chosen

1:11:19.420,1:11:23.780
some measure like correlation or major information or something else, you

1:11:23.780,1:11:28.889
then pick the top

1:11:28.889,1:11:33.659
k features; meaning that you compute correlation between xi and y for all the

1:11:33.659,1:11:36.979
features of mutual information xi and y for all the features.

1:11:36.979,1:11:40.080
And then you include in your learning algorithm the

1:11:40.080,1:11:43.369
k features of the largest correlation with the label or

1:11:43.369,1:11:46.489
the largest mutual information label, whatever.

1:11:46.489,1:11:51.559
And to choose

1:11:51.559,1:11:54.809
k,

1:11:54.809,1:11:57.799
you can actually use cross validation, as well; okay?

1:11:57.799,1:11:59.360
So you would

1:11:59.360,1:12:03.059
take all your features, and sort them in decreasing order of mutual

1:12:03.059,1:12:04.109
information.

1:12:04.109,1:12:07.529
And then you'd try using just the top one feature, the top two features, the top

1:12:07.529,1:12:09.559
three features, and so on.

1:12:09.559,1:12:13.869
And you decide how many features includes

1:12:13.869,1:12:17.279
using cross validation; okay? Or you

1:12:17.279,1:12:24.279
can sometimes you can just choose this by hand, as well. Okay. Questions about

1:12:24.279,1:12:31.279
this? Okay.

1:12:34.559,1:12:36.479
Cool. Great. So

1:12:36.479,1:12:40.629
next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close

1:12:40.629,1:12:40.879
to the end.