WEBVTT

00:00:12.240 --> 00:00:15.509
This presentation is delivered by the Stanford Center for Professional

00:00:15.509 --> 00:00:22.509
Development.

00:00:24.209 --> 00:00:28.029
So what I want to do today in this lecture is

00:00:28.029 --> 00:00:31.479
talk a little bit more about learning theory. In particular, I'll

00:00:31.479 --> 00:00:33.230
talk about VC dimension

00:00:33.230 --> 00:00:35.610
and

00:00:35.610 --> 00:00:39.560
building on the issues of bias variance tradeoffs of under

00:00:39.560 --> 00:00:42.630
fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in

00:00:42.630 --> 00:00:43.640
this one.

00:00:43.640 --> 00:00:49.180
I then want to talk about model selection algorithms for automatically

00:00:49.180 --> 00:00:50.070
making

00:00:50.070 --> 00:00:52.190
decisions for this

00:00:52.190 --> 00:00:55.280
bias variance tradeoff, that we started to talk about in the previous lecture.

00:00:55.280 --> 00:00:59.310
And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if

00:00:59.310 --> 00:01:01.469
I don't get to this today,

00:01:01.469 --> 00:01:07.280
I'll get to this in next week's lecture.

00:01:07.280 --> 00:01:09.520
To recap:

00:01:09.520 --> 00:01:10.430
the result we

00:01:10.430 --> 00:01:13.510
proved at the previous lecture was that

00:01:13.510 --> 00:01:15.370
if you have a finite

00:01:15.370 --> 00:01:16.430
hypothesis class if

00:01:16.430 --> 00:01:19.509
h is a set of k hypotheses,

00:01:19.509 --> 00:01:24.080


00:01:24.080 --> 00:01:25.710
and suppose you have some

00:01:25.710 --> 00:01:32.710
fixed parameters, gamma and delta,

00:01:33.030 --> 00:01:34.440
then

00:01:34.440 --> 00:01:36.530


00:01:36.530 --> 00:01:40.890
in order to guarantee

00:01:40.890 --> 00:01:47.750
that this holds,

00:01:47.750 --> 00:01:50.770
we're probability at least one minus delta.

00:01:50.770 --> 00:01:55.970
It suffices

00:01:55.970 --> 00:02:02.970
that n is greater and equal to that; okay?

00:02:05.240 --> 00:02:07.359
And using big-O notations,

00:02:07.359 --> 00:02:14.359
just learning dropped constants, I can also write this as that; okay?

00:02:14.459 --> 00:02:18.180
So just to quickly remind you of what all of the notation means,

00:02:18.180 --> 00:02:21.180
we talked about empirical risk minimization,

00:02:21.180 --> 00:02:21.970
which was

00:02:21.970 --> 00:02:24.620
the simplified modern machine learning

00:02:24.620 --> 00:02:25.899
that

00:02:25.899 --> 00:02:28.549
has a hypothesis class of script h.

00:02:28.549 --> 00:02:32.539
And what the empirical risk minimization-learning algorithm does is it just chooses the

00:02:32.539 --> 00:02:34.029
hypothesis

00:02:34.029 --> 00:02:38.169
that attains the smallest error on the training set.

00:02:38.169 --> 00:02:39.019
And so

00:02:39.019 --> 00:02:43.479
this symbol, epsilon, just denoted generalization error; right? This is the

00:02:43.479 --> 00:02:44.450
probability

00:02:44.450 --> 00:02:48.949
of a hypothesis h [inaudible] misclassifying a new example drawn from the same

00:02:48.949 --> 00:02:50.930
distribution as the training set.

00:02:50.930 --> 00:02:53.079
And so this says that

00:03:01.869 --> 00:03:06.119
in order to guarantee that the generalization error of the hypothesis h

00:03:06.119 --> 00:03:08.679
[inaudible] output by empirical risk minimization

00:03:08.679 --> 00:03:11.009
that this is less and equal to

00:03:11.009 --> 00:03:14.209
the best possible generalization error

00:03:14.780 --> 00:03:19.889
use it in your hypothesis class plus two times gamma two times this error threshold.

00:03:19.889 --> 00:03:24.120
We want to guarantee that this holds a probability at least one minus delta.

00:03:24.120 --> 00:03:25.519
We show that

00:03:25.519 --> 00:03:28.999
it suffices for your training set size m

00:03:28.999 --> 00:03:33.079
to be greater than equal to this; okay? One over two gamma square log two k over

00:03:33.079 --> 00:03:34.179
delta;

00:03:34.179 --> 00:03:39.229
where again, k is the size of your hypothesis class.

00:03:39.229 --> 00:03:42.909
And so this is some complexity result because it gives us a bound in the

00:03:42.909 --> 00:03:45.019
number of training examples we need

00:03:45.019 --> 00:03:47.050
in order to give a guarantee

00:03:47.050 --> 00:03:48.699
on something on the error; okay? So

00:03:48.699 --> 00:03:54.819
this is a sample complexity result. So

00:03:54.819 --> 00:03:58.879
what I want to do now is take this result, and try to generalize it to the

00:03:58.879 --> 00:04:02.939
case of infinite hypothesis classes. So here,

00:04:02.939 --> 00:04:07.139
we said that the set script h is sort of just k

00:04:07.139 --> 00:04:08.570
specific functions,

00:04:08.570 --> 00:04:12.120
when you want to use a model like logistic regression, which is actually

00:04:12.120 --> 00:04:14.059
parameterized by real numbers.

00:04:14.059 --> 00:04:15.019
So

00:04:15.019 --> 00:04:17.420
I'm actually first going to give an argument that's sort of

00:04:17.420 --> 00:04:21.160
formally broken just sort of technically somewhat broken, but conveys

00:04:21.160 --> 00:04:25.230
useful intuition. And then I'll give the more correct argument, but

00:04:25.230 --> 00:04:30.060
without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want

00:04:30.060 --> 00:04:31.050
to

00:04:31.050 --> 00:04:34.679
apply this result analyzing logistic regression.

00:04:35.550 --> 00:04:37.400
So let's say

00:04:39.210 --> 00:04:44.789
your hypothesis class is because of all linear division boundaries; right? So say

00:04:44.789 --> 00:04:46.770
script h is

00:04:46.770 --> 00:04:51.749
parameterized

00:04:51.749 --> 00:04:57.860
by d

00:04:57.860 --> 00:05:00.279
real numbers; okay? So for example, if

00:05:00.279 --> 00:05:02.570
you're applying logistic regression with

00:05:02.570 --> 00:05:06.940
over [inaudible], then d would be endless one with logistic regression to find the

00:05:06.940 --> 00:05:08.680
linear position boundary,

00:05:08.680 --> 00:05:09.640
parameterized by

00:05:09.640 --> 00:05:13.529
endless one real numbers.

00:05:13.529 --> 00:05:17.729
When you think about how your hypothesis class is really represented in a computer

00:05:17.729 --> 00:05:18.569
computers use

00:05:18.569 --> 00:05:21.949
zero one bits to represent real numbers.

00:05:21.949 --> 00:05:22.930
And so if you

00:05:22.930 --> 00:05:24.479
use like

00:05:24.479 --> 00:05:28.559
a normal standard computer, it normally will represent real numbers by what's

00:05:28.559 --> 00:05:30.869
called double position floating point numbers.

00:05:30.869 --> 00:05:33.200
And what that means is that each real number

00:05:33.200 --> 00:05:36.789
is represented by or a 64-bit representation;

00:05:36.789 --> 00:05:38.060
right? So really

00:05:38.060 --> 00:05:41.820
you know what floating point is in a computer. So a 64-bit floating point

00:05:41.820 --> 00:05:43.309
is what almost all of

00:05:43.309 --> 00:05:44.780
us use routinely.

00:05:44.780 --> 00:05:47.910
And so this parameterized by d real numbers, that's really

00:05:47.910 --> 00:05:50.979
as if it's parameterized by 64

00:05:50.979 --> 00:05:52.919
times d

00:05:52.919 --> 00:05:54.319
bits.

00:05:54.319 --> 00:05:56.590
Computers can't represent real numbers. They only represent

00:05:56.590 --> 00:05:58.800
used to speed things.

00:05:58.800 --> 00:06:00.689
And so the size of

00:06:00.689 --> 00:06:02.949
your

00:06:02.949 --> 00:06:07.250
hypothesis class in your computer representation

00:06:07.250 --> 00:06:11.000
you have 64 times d bits that you can flip.

00:06:11.000 --> 00:06:15.050
And so the number of possible values for your 62 to 64 d bits is

00:06:15.050 --> 00:06:17.509
really just

00:06:17.509 --> 00:06:20.889
to the power of 64 d; okay?

00:06:20.889 --> 00:06:23.360
Because that's the number of ways you can

00:06:23.360 --> 00:06:26.349
flip the 64 d bits.

00:06:26.349 --> 00:06:29.150
And so

00:06:29.150 --> 00:06:33.979
this is why it's important that we that we had log k there; right? So k is

00:06:33.979 --> 00:06:36.819
therefore, to the 64 d.

00:06:36.819 --> 00:06:40.379
And if I plug it into this equation over here,

00:06:40.379 --> 00:06:43.300
what you find is that

00:06:43.300 --> 00:06:47.139
in order to get this sort of guarantee,

00:06:47.139 --> 00:06:50.979
it suffices that m

00:06:50.979 --> 00:06:52.739
is

00:06:52.739 --> 00:06:55.979
great and equal to

00:06:55.979 --> 00:06:58.770
on the order of one of

00:06:58.770 --> 00:07:01.740
the gamma square log it's

00:07:01.740 --> 00:07:03.369
just a 64 d

00:07:03.369 --> 00:07:05.050
over delta,

00:07:05.050 --> 00:07:12.050
which is that; okay? So

00:07:14.520 --> 00:07:17.779
just to be clear, in order to guarantee that

00:07:17.779 --> 00:07:23.729
there's only one, instead of the same complexity result as we had before so the question is:

00:07:23.729 --> 00:07:27.309
suppose, you want a guarantee that a hypotheses returned by empirical risk

00:07:27.309 --> 00:07:29.249
minimization will

00:07:29.249 --> 00:07:33.909
have a generalization error that's within two gamma or the best hypotheses in your hypotheses

00:07:33.909 --> 00:07:35.970
class.

00:07:35.970 --> 00:07:39.400
Then what this result suggests is that, you know, in order to give that sort of

00:07:39.400 --> 00:07:40.770
error bound guarantee,

00:07:40.770 --> 00:07:43.429
it suffices that m is greater and equal to this.

00:07:43.429 --> 00:07:44.579
In other words, that

00:07:44.579 --> 00:07:46.889
your number of training examples has to be

00:07:46.889 --> 00:07:51.699
on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the

00:07:53.329 --> 00:07:54.419
intuition

00:07:54.419 --> 00:07:57.740
that this conveys is actually, roughly right. This says, that the number of

00:07:57.740 --> 00:07:59.050
training examples you need

00:07:59.050 --> 00:08:00.740
is roughly linear

00:08:00.740 --> 00:08:04.620
in the number of parameters of your hypothesis class. That

00:08:04.620 --> 00:08:08.680
m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.

00:08:08.680 --> 00:08:09.580


00:08:09.580 --> 00:08:12.479
I'll say more about

00:08:16.969 --> 00:08:20.969
this

00:08:21.919 --> 00:08:25.520
later. This result is clearly, slightly broken, in the sense that it relies on a

00:08:25.520 --> 00:08:28.729
64-bit representation of 14-point numbers.

00:08:28.729 --> 00:08:31.899
So let me actually go ahead and outline

00:08:31.899 --> 00:08:34.900
the "right way" to show this more formally; all

00:08:34.900 --> 00:08:38.910
right? And it turns out the "right way" to show this more formally

00:08:38.910 --> 00:08:43.120
involves a much longer because the

00:08:43.120 --> 00:08:46.250
proof is extremely involved, so I'm just actually going to state the result, and not

00:08:46.250 --> 00:08:50.160
prove it. Farther proof be a source of learning theory balance, infinite hypothesis

00:08:50.160 --> 00:08:54.320
classes.

00:08:56.450 --> 00:09:03.450
This definition given a set of d points,

00:09:18.630 --> 00:09:22.880
we say,

00:09:22.880 --> 00:09:26.560
a hypothesis class h

00:09:26.560 --> 00:09:28.590
shatters

00:09:28.590 --> 00:09:34.520
the set s, if

00:09:34.520 --> 00:09:41.520
h can realize any labeling on it; okay?

00:09:46.650 --> 00:09:47.360
And what

00:09:47.360 --> 00:09:50.990
I mean by realizing any labeling on it the informal way of thinking about this

00:09:50.990 --> 00:09:53.020
is:

00:09:53.020 --> 00:09:56.860
if a hypothesis class has shattered the set s, what that means is that I can take these

00:09:56.860 --> 00:09:58.510
d points,

00:09:58.510 --> 00:10:01.740
and I can associate these d points with any

00:10:01.740 --> 00:10:04.250
caught set of labels y; right?

00:10:04.250 --> 00:10:08.900
So choose any set of labeling y for each of these d

00:10:08.900 --> 00:10:10.140
points.

00:10:10.140 --> 00:10:13.840
And if your hypothesis class shatters s, then that means that

00:10:13.840 --> 00:10:14.790
there will be

00:10:14.790 --> 00:10:19.470
a hypothesis that labels those d examples perfectly;

00:10:19.470 --> 00:10:23.710
okay? That's what shattering means. So let me just illustrate those in an example.

00:10:23.710 --> 00:10:30.710
So let's say h is the class of all linear classifiers

00:10:31.280 --> 00:10:35.520
into e,

00:10:35.520 --> 00:10:38.400
and let's say that

00:10:38.400 --> 00:10:40.370
s is this

00:10:40.370 --> 00:10:43.740
[inaudible] comprising two points; okay?

00:10:43.740 --> 00:10:45.980
So there are

00:10:45.980 --> 00:10:49.520
four possible labelings that computes

00:10:49.520 --> 00:10:53.270
with these two points. You can choose

00:10:53.270 --> 00:10:56.470
to label both positive; one positive, one negative, one negative, one positive or you can

00:10:56.470 --> 00:10:59.280
label both of them negative.

00:10:59.280 --> 00:11:00.750
And if

00:11:00.750 --> 00:11:04.730
the hypothesis class h classed all linear classifiers into the then, for each

00:11:04.730 --> 00:11:08.450
of these training sets, I can sort of find a

00:11:08.450 --> 00:11:09.990
linear classifier

00:11:09.990 --> 00:11:13.500
that attains zero training error on

00:11:13.500 --> 00:11:17.810
each of these. Then on all possible labelings of this set of two points.

00:11:17.810 --> 00:11:20.140
And so I'll say

00:11:20.140 --> 00:11:23.160
that the hypothesis class script h shatters

00:11:23.160 --> 00:11:24.129
this set s

00:11:24.129 --> 00:11:29.480
of two points; okay? One

00:11:29.480 --> 00:11:36.480
more example show

00:11:37.350 --> 00:11:38.780
you

00:11:38.780 --> 00:11:42.059
a larger example. Suppose my set s is now

00:11:42.059 --> 00:11:44.640
this set of three points;

00:11:44.640 --> 00:11:51.640
right? Then, I now have eight possible labelings for these three points;

00:12:10.600 --> 00:12:14.830
okay? And so for these three points, I now have eight possible labelings.

00:12:14.830 --> 00:12:16.940
And once again, I can for each

00:12:16.940 --> 00:12:21.620
of these labelings, I can find the hypothesis in the hypothesis class

00:12:21.620 --> 00:12:23.390
that labels

00:12:23.390 --> 00:12:25.610
these examples correctly.

00:12:25.610 --> 00:12:30.220
And so once again, I see that by definition, say, that my hypothesis

00:12:30.220 --> 00:12:33.890
class also shatters this set s. Student:Right. Instructor (Andrew Ng):And

00:12:33.890 --> 00:12:38.340
then that that terminology h can realize any labeling on s. That's

00:12:38.340 --> 00:12:39.830
obviously [inaudible].

00:12:39.830 --> 00:12:43.680
Give it any set of labels and you can find a hypothesis that perfectly

00:12:43.680 --> 00:12:47.570
separates the positive and negative examples;

00:12:47.570 --> 00:12:48.840
okay?

00:12:48.840 --> 00:12:51.090
So

00:12:51.090 --> 00:12:54.660
how about this

00:12:54.660 --> 00:12:57.430
set?

00:12:57.430 --> 00:13:00.840
Suppose s is now this set of four points, then,

00:13:00.840 --> 00:13:04.840
you know, there are lots of labels. There are now 16 labelings we can choose on this; right?

00:13:04.840 --> 00:13:06.400
That's one for instance,

00:13:06.400 --> 00:13:09.090
and this is another one;

00:13:09.090 --> 00:13:13.420
right? And so I can realize some labelings. But there's no linear division

00:13:13.420 --> 00:13:15.780
boundary that can realize this labeling,

00:13:15.780 --> 00:13:17.800
and so h does not shatter

00:13:17.800 --> 00:13:20.250
this set of four points; okay?

00:13:20.250 --> 00:13:21.580
And

00:13:21.580 --> 00:13:25.360
I'm not really going to prove it here, but it turns out that you can show that

00:13:25.360 --> 00:13:30.040
in two dimensions, there is no set of four points that the

00:13:30.040 --> 00:13:37.040
class of all linear classifiers can shatter; okay?

00:13:45.630 --> 00:13:49.110
So here's another definition. When I say that the well,

00:13:49.110 --> 00:13:50.290
it's called the

00:13:50.290 --> 00:13:52.700
VC dimension. These

00:13:52.700 --> 00:13:59.700
two people, Vapnik and Chervonenkis

00:14:06.550 --> 00:14:09.910
so given a hypothesis class,

00:14:09.910 --> 00:14:13.590
the Vapnik and Chervonenkis

00:14:13.590 --> 00:14:17.600
dimension of h, which we usually write as VC of script h,

00:14:17.600 --> 00:14:24.600
is the size of the

00:14:33.780 --> 00:14:36.670
larger

00:14:36.670 --> 00:14:40.070
set that is shattered by this set by h.

00:14:40.070 --> 00:14:40.759
And

00:14:40.759 --> 00:14:44.939
if a hypothesis class can shatter arbitrarily large sets, then the VC dimension

00:14:44.939 --> 00:14:47.820
is infinite.

00:14:47.820 --> 00:14:50.549
So just as a kind of good example: if h

00:14:50.549 --> 00:14:51.920
is the class of all linear

00:14:51.920 --> 00:14:57.850
classifiers

00:14:57.850 --> 00:15:02.660
into d, then the

00:15:02.660 --> 00:15:04.100
VC dimension of the set

00:15:04.100 --> 00:15:04.730
is equal to

00:15:04.730 --> 00:15:09.490
three because we saw just now that there is a size of

00:15:09.490 --> 00:15:12.940
there was a set s of size three that it could shatter,

00:15:12.940 --> 00:15:17.259
and I don't really prove it. But it turns out there is no sets of size four

00:15:17.259 --> 00:15:18.480
that it can

00:15:18.480 --> 00:15:19.900
shatter. And therefore,

00:15:19.900 --> 00:15:21.220
the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your

00:15:21.220 --> 00:15:23.270
point. Instructor

00:15:23.270 --> 00:15:27.530
(Andrew

00:15:27.530 --> 00:15:30.210
Ng):Yes, absolutely. So it turns out that

00:15:30.210 --> 00:15:32.370
if

00:15:32.370 --> 00:15:35.650
I choose a set like this it's actually set s,

00:15:35.650 --> 00:15:39.170
then there are labelings on this they cannot realize. And so,

00:15:39.170 --> 00:15:40.980
h cannot shatter this set.

00:15:40.980 --> 00:15:43.620
But that's okay because right there definitely is

00:15:43.620 --> 00:15:46.590
there exists some other set of size three being shattered. So the VC

00:15:46.590 --> 00:15:47.890
dimension is

00:15:47.890 --> 00:15:54.890
three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor

00:15:55.590 --> 00:15:58.130
(Andrew Ng):Not according to this definition. No.

00:15:58.130 --> 00:16:00.640
Right. So again, let's see,

00:16:00.640 --> 00:16:03.619
I can choose my set s to be

00:16:03.619 --> 00:16:07.869
to be a set of three points that are all over lapping. Three points in exactly

00:16:07.869 --> 00:16:09.780
the same place.

00:16:09.780 --> 00:16:11.730
And clearly, I can't shatter this set,

00:16:11.730 --> 00:16:15.130
but that's okay. And I can't shatter this set, either,

00:16:15.130 --> 00:16:18.560
but that's okay because there are some other sets of size three that I can

00:16:18.560 --> 00:16:25.040
shatter.

00:16:25.040 --> 00:16:28.410
And it turns out this result holds true into

00:16:28.410 --> 00:16:31.690
the more generally,

00:16:31.690 --> 00:16:34.430


00:16:34.430 --> 00:16:41.410
in

00:16:41.410 --> 00:16:48.410
any dimensions the VC dimension of the class of linear classifiers in

00:16:51.460 --> 00:16:53.500
any dimensions is equal to n plus one.

00:16:53.500 --> 00:16:59.210
Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC

00:16:59.210 --> 00:17:01.839
dimension in any dimensions; whereas, n d

00:17:01.839 --> 00:17:04.350
is equal to n plus one.

00:17:04.350 --> 00:17:11.350


00:17:12.059 --> 00:17:14.459
So maybe you wanna write it down:

00:17:14.459 --> 00:17:15.809


00:17:15.809 --> 00:17:20.740
what is arguably the best-known result in all of learning theory, I guess;

00:17:20.740 --> 00:17:25.930


00:17:25.930 --> 00:17:28.770
which is that.

00:17:28.770 --> 00:17:35.770
Let a hypothesis class be given,

00:17:37.600 --> 00:17:41.170
and let the VC dimension of h be

00:17:41.170 --> 00:17:45.830
equal to d. Then we're

00:17:45.830 --> 00:17:47.530
in probability of

00:17:47.530 --> 00:17:50.090
one minus delta.

00:17:50.090 --> 00:17:57.090
We have that

00:18:31.830 --> 00:18:34.770
the formula on the right looks a bit complicated, but don't worry about it. I'll point

00:18:34.770 --> 00:18:36.910
out the essential aspects of it later.

00:18:36.910 --> 00:18:40.600
But the key to this result is that if you have a hypothesis class with VC dimension d, and

00:18:40.600 --> 00:18:45.320
now this can be an infinite hypothesis

00:18:45.320 --> 00:18:46.660
class,

00:18:46.660 --> 00:18:48.620
what

00:18:48.620 --> 00:18:50.990
Vapnik and Chervonenkis show is that

00:18:50.990 --> 00:18:54.030
we're probability of at least one minus delta.

00:18:54.030 --> 00:18:55.300
You enjoy this

00:18:55.300 --> 00:18:57.540
sort of uniform conversions results;

00:18:57.540 --> 00:19:00.380
okay? We have

00:19:00.380 --> 00:19:02.520


00:19:02.520 --> 00:19:06.760
that for all hypotheses h that for all the hypotheses in your

00:19:06.760 --> 00:19:08.490
hypothesis class,

00:19:08.490 --> 00:19:13.080
you have that the generalization error of h

00:19:13.080 --> 00:19:13.780
minus

00:19:13.780 --> 00:19:19.210
the training error of h. So the difference between these two things is bounded above

00:19:19.210 --> 00:19:24.250
by some complicated formula like this; okay?

00:19:24.250 --> 00:19:29.520
And thus,

00:19:29.520 --> 00:19:36.520
we're probably one minus delta. We also have that have the

00:19:45.140 --> 00:19:52.140
same thing; okay?

00:19:56.810 --> 00:20:00.730
And going from this step to this step; right? Going from this step

00:20:00.730 --> 00:20:03.429
to this step is actually something that you saw yourself;

00:20:03.429 --> 00:20:05.929
that we actually proved earlier. Because

00:20:05.929 --> 00:20:09.200
you remember, in the previous lecture we proved that if you have uniform

00:20:09.200 --> 00:20:10.720
conversions,

00:20:10.720 --> 00:20:13.210
then that implies that

00:20:13.210 --> 00:20:18.080
it appears actually that we showed that if generalization error and training error are close to

00:20:18.080 --> 00:20:20.800
each other; within gamma of each other,

00:20:20.800 --> 00:20:24.360
then the generalization error of the hypotheses you pick will be within

00:20:24.360 --> 00:20:26.150
two gamma

00:20:26.150 --> 00:20:28.080
times the best generalization error.

00:20:28.080 --> 00:20:29.650
So this is really

00:20:29.650 --> 00:20:30.900
generalization error of

00:20:30.900 --> 00:20:35.130
h [inaudible] best possible generalization error plus two times gamma. And just the

00:20:35.130 --> 00:20:37.390
two constants in front here

00:20:37.390 --> 00:20:43.790
that I've absorbed into the big-O notation.

00:20:43.790 --> 00:20:47.310
So that formula is slightly more complicated. Let

00:20:47.310 --> 00:20:54.310
me just rewrite this as a corollary, which is that

00:20:55.370 --> 00:20:58.850
in order to guarantee that

00:20:58.850 --> 00:21:04.390
this holds,

00:21:04.390 --> 00:21:08.580
we're probability of one minus delta.

00:21:08.580 --> 00:21:11.270
We're probably at least one minus delta, I should say. It

00:21:11.270 --> 00:21:16.790
suffices

00:21:16.790 --> 00:21:20.400
that I'm gonna write this this way: I'm gonna write m equals

00:21:20.400 --> 00:21:22.310
big-O of

00:21:22.310 --> 00:21:22.950
d,

00:21:22.950 --> 00:21:25.880
and I'm going to put

00:21:25.880 --> 00:21:32.570
gamma and delta in as a subscript error to denote that.

00:21:32.570 --> 00:21:37.340
Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that

00:21:37.340 --> 00:21:40.390
depend on gamma and delta into the big-O notation,

00:21:40.390 --> 00:21:41.600
then

00:21:41.600 --> 00:21:44.870
in order to guarantee this holds, it suffices that m

00:21:44.870 --> 00:21:46.430
is on the order of the

00:21:46.430 --> 00:21:53.430
VC dimension and hypotheses class; okay?

00:21:53.470 --> 00:21:55.790
So

00:21:55.790 --> 00:21:58.860
let's see.

00:21:58.860 --> 00:22:02.490
So what we conclude from this is that

00:22:02.490 --> 00:22:06.100
if you have a learning algorithm that tries to for empirical risk minimization

00:22:06.100 --> 00:22:08.770
algorithms in other words, less formally,

00:22:08.770 --> 00:22:12.630
for learning algorithms, they try to minimize training error. The

00:22:12.630 --> 00:22:14.850
intuition to take away from this is that

00:22:14.850 --> 00:22:18.080
the number of training examples you need is therefore,

00:22:18.080 --> 00:22:19.250
roughly, linear

00:22:19.250 --> 00:22:24.350
in the VC dimension of the hypotheses class.

00:22:24.350 --> 00:22:27.690
And more formally, this shows that sample complexity

00:22:27.690 --> 00:22:29.570
is upper bounded

00:22:29.570 --> 00:22:34.040
by the VC dimension; okay?

00:22:34.040 --> 00:22:38.490
It turns out that for most reasonable hypothesis classes, it turns out that

00:22:38.490 --> 00:22:41.460
the VC dimension is sort

00:22:41.460 --> 00:22:41.869


00:22:41.869 --> 00:22:44.930
of very similar, I guess, to

00:22:44.930 --> 00:22:47.110
the number of parameters you model.

00:22:47.110 --> 00:22:48.600
So for example,

00:22:48.600 --> 00:22:52.080
you have model and logistic regression linear classification.

00:22:52.080 --> 00:22:56.059
In any dimensions logistic regression in any dimensions is endless

00:22:56.059 --> 00:22:57.430
one parameters.

00:22:57.430 --> 00:23:02.430
And the VC dimension of which is the of class of linear classifiers is always the endless one.

00:23:02.430 --> 00:23:04.289
So it turns out that

00:23:04.289 --> 00:23:07.360
for most reasonable hypothesis classes, the

00:23:07.360 --> 00:23:11.270
VC dimension is usually linear in the number of parameters of your model.

00:23:11.270 --> 00:23:14.820
Wherein, is most sense of low other polynomial;

00:23:14.820 --> 00:23:17.440
in the number of parameters of your model.

00:23:17.440 --> 00:23:18.760
And so this

00:23:18.760 --> 00:23:21.120
the takeaway intuition from this is that

00:23:21.120 --> 00:23:25.059
the number of training examples you need to fit in those models is going to be let's say,

00:23:25.059 --> 00:23:29.550
roughly, linear in the number of parameters in your model; okay?

00:23:29.550 --> 00:23:33.120
There are some somewhat strange examples where what I just said is not true. There are some

00:23:33.120 --> 00:23:36.559
strange examples where you have very few parameters, but the VC dimension is

00:23:36.559 --> 00:23:37.450
enormous.

00:23:37.450 --> 00:23:38.610
But I actually know of

00:23:38.610 --> 00:23:42.070
all of the examples I know of that fall into that regime are somewhat

00:23:42.070 --> 00:23:43.559
strange and

00:23:43.559 --> 00:23:45.160
degenerate. So somewhat unusual, and

00:23:45.160 --> 00:23:50.340
not the source of not learning algorithms you usually use.

00:23:50.340 --> 00:23:52.270
Let's see,

00:23:52.270 --> 00:23:55.030
just other things. It turns out that

00:23:55.030 --> 00:23:58.490
so this result shows the sample complexity is upper bounded by VC

00:23:58.490 --> 00:24:01.320
dimension. But if you have a number of training examples that

00:24:01.320 --> 00:24:04.510
are on the order of the VC dimension, then you find

00:24:04.510 --> 00:24:07.510
it turns out that in the worse case

00:24:07.510 --> 00:24:10.320


00:24:10.320 --> 00:24:12.200


00:24:12.200 --> 00:24:15.840
some complexity is also lower bounded by VC dimension.

00:24:15.840 --> 00:24:20.220
And what that means is that if you have a perfectly nasty learning

00:24:20.220 --> 00:24:23.120
problem, say, then if

00:24:23.120 --> 00:24:26.630
the number of training examples you have is less than on the order of the VC

00:24:26.630 --> 00:24:27.450
dimension;

00:24:27.450 --> 00:24:29.690
then

00:24:29.690 --> 00:24:33.910
it is not possible to prove this bound. So I guess in the worse case,

00:24:33.910 --> 00:24:37.490
sample complexity in the number of training examples you need is upper bounded and lower bounded

00:24:37.490 --> 00:24:41.820
by the VC dimension. Let's

00:24:41.820 --> 00:24:48.110
see,

00:24:48.110 --> 00:24:55.110
questions about this? Does

00:24:55.540 --> 00:24:59.700


00:24:59.700 --> 00:25:00.490
the

00:25:00.490 --> 00:25:04.560
proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the

00:25:04.560 --> 00:25:07.280
entirety of the theorem. This is true. It

00:25:07.280 --> 00:25:09.920
turns out in the proof well,

00:25:09.920 --> 00:25:12.730
somewhere, regardless of the proof there's a step reconstruction called an

00:25:12.730 --> 00:25:15.750
epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the

00:25:15.750 --> 00:25:19.880
proof, it is not an assumption that you need.

00:25:19.880 --> 00:25:23.610
In someway that sort of proof that's one-step that uses a very

00:25:23.610 --> 00:25:27.050
clever [inaudible] to prove

00:25:27.050 --> 00:25:32.130
this. But that's not needed; it's an assumption. I'd like to say, back when

00:25:32.130 --> 00:25:36.370
I was a Ph.D. student, when I was working through this proof,

00:25:36.370 --> 00:25:38.220
there was sort of a solid week where I

00:25:38.220 --> 00:25:42.100
would wake up, and go to the office at 9:00 a.m. Then I'd start reading

00:25:42.100 --> 00:25:44.840
the book that led up to this proof. And then

00:25:44.840 --> 00:25:49.330
I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left

00:25:49.330 --> 00:25:51.609
off. And it sort of took me a whole week that way,

00:25:51.609 --> 00:25:53.180
to understand this proof, so

00:25:53.180 --> 00:26:00.180
I thought I would inflict that on you.

00:26:03.090 --> 00:26:07.470
Just to tie a couple of loose ends:

00:26:07.470 --> 00:26:13.110
what I'm about to do is, I'm about to just mention a few things that will

00:26:13.110 --> 00:26:18.380
maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.

00:26:18.380 --> 00:26:21.380
And so

00:26:21.380 --> 00:26:24.150
let's see,

00:26:24.150 --> 00:26:26.860
it turns out that

00:26:26.860 --> 00:26:30.170
just so it will be more strong with you so this bound was

00:26:30.170 --> 00:26:30.670
proved

00:26:30.670 --> 00:26:34.840
for an algorithm that uses empirical risk minimization, for an algorithm that

00:26:34.840 --> 00:26:35.920
minimizes

00:26:35.920 --> 00:26:39.580
0-1 training error. So

00:26:39.580 --> 00:26:46.330
one

00:26:46.330 --> 00:26:49.350
question that some of

00:26:49.350 --> 00:26:51.720
you ask is

00:26:51.720 --> 00:26:55.460
how about support vector machines; right? How come SVM's don't over

00:26:55.460 --> 00:26:57.250
fit?

00:26:57.250 --> 00:27:01.280
And in the sequel of remember our discussion on support vector machines said that

00:27:01.280 --> 00:27:06.450
you use kernels, and map the features in infinite dimensional feature space.

00:27:06.450 --> 00:27:10.390
And so it seems like the VC dimension should be infinite; n plus

00:27:10.390 --> 00:27:12.340
one and n is infinite.

00:27:12.340 --> 00:27:14.840
So it turns out that

00:27:14.840 --> 00:27:19.630
the class of linear separators with large margin actually has low VC dimension.

00:27:19.630 --> 00:27:23.300
I wanna say this very quickly, and informally.

00:27:23.300 --> 00:27:28.010
It's actually, not very important for you to understand the details, but I'm going

00:27:28.010 --> 00:27:31.760
to say it very informally. It turns out that I will give you a set of points.

00:27:31.760 --> 00:27:32.660
And

00:27:32.660 --> 00:27:37.500
if I ask you to consider only the course of lines that separate these points of a

00:27:37.500 --> 00:27:38.780
large margin [inaudible],

00:27:38.780 --> 00:27:40.030
so

00:27:40.030 --> 00:27:43.840
my hypothesis class will comprise only

00:27:43.840 --> 00:27:45.950
the linear position boundaries

00:27:45.950 --> 00:27:49.289
that separate the points of a large margin. Say with a

00:27:49.289 --> 00:27:50.049
margin,

00:27:50.049 --> 00:27:52.260
at least gamma; okay.

00:27:52.260 --> 00:27:53.000
And so

00:27:53.000 --> 00:27:55.760
I won't allow a point that comes closer. Like,

00:27:55.760 --> 00:28:00.210
I won't allow that line because it comes too close to one of my points.

00:28:00.210 --> 00:28:04.330
It turns out that

00:28:04.330 --> 00:28:10.890
if I consider my

00:28:10.890 --> 00:28:15.410
data points all

00:28:15.410 --> 00:28:18.690
lie within some sphere of radius r,

00:28:18.690 --> 00:28:23.190
and if I consider only the course of linear separators is separate to data

00:28:23.190 --> 00:28:26.799
with a margin of at least gamma,

00:28:26.799 --> 00:28:30.420
then the VC dimension of this

00:28:30.420 --> 00:28:34.260
course is less than or equal to r squared over four

00:28:34.260 --> 00:28:36.150
gamma squared

00:28:36.150 --> 00:28:38.170
plus one; okay?

00:28:38.170 --> 00:28:39.970
So

00:28:39.970 --> 00:28:43.340
this funny symbol here, that just means rounding up.

00:28:43.340 --> 00:28:47.190
This is a ceiling symbol; it means rounding up x.

00:28:47.190 --> 00:28:48.360
And it

00:28:48.360 --> 00:28:51.280
turns out you prove and there are some strange things

00:28:51.280 --> 00:28:52.940
about this result that I'm deliberately not

00:28:52.940 --> 00:28:55.090
gonna to talk about but turns they

00:28:55.090 --> 00:28:59.100
can prove that the VC dimension of the class of linear classifiers with large

00:28:59.100 --> 00:29:00.890
margins is actually bounded.

00:29:00.890 --> 00:29:02.820
The surprising thing about this is that

00:29:02.820 --> 00:29:05.909
this is the bound on VC dimension that has no dependents

00:29:05.909 --> 00:29:07.860
on the dimension of the points x. So

00:29:07.860 --> 00:29:09.240
in other words,

00:29:09.240 --> 00:29:12.799
your data points x combine an infinite dimensional space,

00:29:12.799 --> 00:29:15.680
but so long as you restrict attention to the class of

00:29:15.680 --> 00:29:17.990
your separators with large margin, the

00:29:17.990 --> 00:29:20.250
VC dimension is bounded.

00:29:20.250 --> 00:29:21.529
And so

00:29:21.529 --> 00:29:25.650
in trying to find a large margin separator

00:29:25.650 --> 00:29:28.990
in trying to find the line that separates your positive and

00:29:28.990 --> 00:29:29.950
your negative examples

00:29:29.950 --> 00:29:32.490
with large margin,

00:29:32.490 --> 00:29:35.100
it turns out therefore, that

00:29:35.100 --> 00:29:38.060
the support vector machine is automatically trying to find

00:29:38.060 --> 00:29:39.809
a hypothesis class

00:29:39.809 --> 00:29:42.450
with small VC dimension. And therefore, it does not over fit. Alex? Student:What is

00:29:42.450 --> 00:29:49.320
the [inaudible]? Instructor (Andrew

00:29:49.320 --> 00:29:50.010
Ng):It

00:29:50.010 --> 00:29:51.879
is actually defined the

00:29:51.879 --> 00:29:55.640
same way as finite dimensional spaces. So you

00:29:55.640 --> 00:29:57.210
know, suppose you have infinite actually,

00:29:57.210 --> 00:30:00.590
these are constantly infinite dimensional vectors; not [inaudible] to the infinite

00:30:00.590 --> 00:30:03.590
dimensional

00:30:03.590 --> 00:30:06.080
vectors. Normally, the 2 to 1

00:30:06.080 --> 00:30:07.460
squared

00:30:07.460 --> 00:30:10.010
is equal to some [inaudible] equals 110

00:30:10.010 --> 00:30:12.210
xi squared, so if x

00:30:12.210 --> 00:30:17.470
is infinite dimensional, you just appoint it like

00:30:17.470 --> 00:30:20.220
that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.

00:30:20.220 --> 00:30:22.679
[Inaudible].

00:30:22.679 --> 00:30:27.230
Yes. Although, I assume that this is bounded by r. Oh. It's

00:30:27.230 --> 00:30:30.750
a yeah so this insures that conversions.

00:30:30.750 --> 00:30:31.670


00:30:31.670 --> 00:30:35.040
So just

00:30:35.040 --> 00:30:37.650
something people sometimes wonder about. And

00:30:37.650 --> 00:30:39.640
last, the actually

00:30:39.640 --> 00:30:44.470
tie empirical risk minimization back a little more strongly to the source of algorithms

00:30:44.470 --> 00:30:46.650
we've talked about.

00:30:46.650 --> 00:30:48.050
It turns out

00:30:48.050 --> 00:30:55.050
that

00:31:10.179 --> 00:31:15.100
so the theory was about, and so far, was really for empirical risk minimization. So that view's

00:31:15.100 --> 00:31:17.770


00:31:17.770 --> 00:31:21.010
so we focus on just one training example.

00:31:21.010 --> 00:31:22.860
Let me draw a

00:31:22.860 --> 00:31:24.049
function, you know, a

00:31:24.049 --> 00:31:25.080
zero here jumps

00:31:25.080 --> 00:31:26.210
to one,

00:31:26.210 --> 00:31:27.670
and it looks like that.

00:31:27.670 --> 00:31:29.570
And so this for

00:31:29.570 --> 00:31:31.720
once, this training example,

00:31:31.720 --> 00:31:36.230
this may be

00:31:36.230 --> 00:31:42.830
indicator

00:31:42.830 --> 00:31:47.110
h where [inaudible] is d

00:31:47.110 --> 00:31:49.340
equals data transpose x;

00:31:49.340 --> 00:31:52.600
okay? But one

00:31:52.600 --> 00:31:56.250
training example your training example will be positive or negative. And

00:31:56.250 --> 00:31:59.360
depending on what the value of this data transpose x is, you either get it right

00:31:59.360 --> 00:32:01.800
or wrong. And so

00:32:01.800 --> 00:32:05.720
you know, I guess if your training example if you have a positive example,

00:32:05.720 --> 00:32:08.370
then when z is positive,

00:32:08.370 --> 00:32:10.380
you get it

00:32:10.380 --> 00:32:11.240
right.

00:32:11.240 --> 00:32:18.240


00:32:18.610 --> 00:32:22.630
Suppose you have a negative example, so y equals 0; right?

00:32:22.630 --> 00:32:26.780
Then if z, which is data transpose x if this is positive,

00:32:26.780 --> 00:32:29.650
then you will get this example wrong;

00:32:29.650 --> 00:32:32.860
whereas, if z is negative then you'd get this example right.

00:32:32.860 --> 00:32:36.630
And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You

00:32:36.630 --> 00:32:42.639
know, it's equal

00:32:42.639 --> 00:32:49.539
to g of data transpose

00:32:49.539 --> 00:32:50.730


00:32:50.730 --> 00:32:53.940
x; okay? And so it turns out that

00:32:53.940 --> 00:32:57.260
so what you really like to do is choose parameters data so as to

00:32:57.260 --> 00:32:59.550
minimize this step function; right?

00:32:59.550 --> 00:33:02.400
You'd like to choose parameters data, so that

00:33:02.400 --> 00:33:04.710
you end up with a

00:33:04.710 --> 00:33:08.500
correct classification on setting your training example, and so you'd like

00:33:08.500 --> 00:33:09.440
indicator h

00:33:09.440 --> 00:33:10.890
of x not equal y.

00:33:10.890 --> 00:33:14.960
You'd like this indicator function to be 0. It

00:33:14.960 --> 00:33:18.560
turns out this step function is clearly a non-convex function. And so

00:33:18.560 --> 00:33:21.799
it turns out that just the linear classifiers

00:33:21.799 --> 00:33:25.130
minimizing the training error is an empty heart problem. It

00:33:25.130 --> 00:33:29.560
turns out that both logistic regression, and support vector machines can be viewed as

00:33:29.560 --> 00:33:31.210
using a convex

00:33:31.210 --> 00:33:32.600
approximation for this problem.

00:33:32.600 --> 00:33:36.440
And in particular

00:33:36.440 --> 00:33:40.130
and draw a function like that

00:33:41.460 --> 00:33:48.460
it turns out that

00:33:50.809 --> 00:33:54.600
logistic regression is trying to maximize likelihood. And so it's tying to minimize

00:33:54.600 --> 00:33:56.900
the minus of the logged likelihood.

00:33:56.900 --> 00:34:00.350
And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that

00:34:00.350 --> 00:34:03.400
looks like this. And

00:34:03.400 --> 00:34:07.250
this line that I just drew, you can think of it as a rough approximation to this step function; which is

00:34:07.250 --> 00:34:10.709
maybe what you're really trying to minimize,

00:34:10.709 --> 00:34:14.539
so you want to minimize training error. So you can actually think of logistic regression as trying to approximate

00:34:14.539 --> 00:34:16.069
empirical risk minimization.

00:34:16.069 --> 00:34:19.859
Where instead of using this step function, which is non-convex, and gives you a hard

00:34:19.859 --> 00:34:21.059
optimization problem, it

00:34:21.059 --> 00:34:25.189
uses this line above this curve above. So approximate it,

00:34:25.189 --> 00:34:28.229
so you have a convex optimization problem you can

00:34:28.229 --> 00:34:29.929
find the

00:34:29.929 --> 00:34:34.409
maximum likelihood it's in the parameters for logistic regression.

00:34:34.409 --> 00:34:38.690
And it turns out, support vector machine also can be viewed as approximated dysfunction

00:34:38.690 --> 00:34:41.139
to only a little bit different let's

00:34:41.139 --> 00:34:44.719
see, support vector machine turns out, can be viewed as trying to approximate this

00:34:44.719 --> 00:34:46.209
step function two

00:34:46.209 --> 00:34:48.289
over different

00:34:48.289 --> 00:34:50.929
approximation that's linear, and then

00:34:50.929 --> 00:34:53.589
that sort of [inaudible] linear that

00:34:53.589 --> 00:34:55.710
our results goes this [inaudible] there, and then it goes up as a

00:34:55.710 --> 00:34:58.150
linear function there. And that's that

00:34:58.150 --> 00:35:00.170
is called the hinge class. And so you

00:35:00.170 --> 00:35:03.799
can think of logistic regression and the support vector machine as

00:35:03.799 --> 00:35:07.259
different approximations to try to minimize this

00:35:07.259 --> 00:35:10.859
step function;

00:35:10.859 --> 00:35:11.940
okay?

00:35:11.940 --> 00:35:13.380
And that's why I guess,

00:35:13.380 --> 00:35:16.109
all the theory we developed

00:35:16.109 --> 00:35:19.239
even though SVM's and logistic regression aren't exactly due to

00:35:19.239 --> 00:35:21.739
empirical risk minimization,

00:35:21.739 --> 00:35:24.790
the theory we develop often gives the

00:35:24.790 --> 00:35:31.790
completely appropriate intuitions for SVM's, and logistic regression; okay.

00:35:32.879 --> 00:35:35.779
So that was the last of the loose ends.

00:35:35.779 --> 00:35:39.319
And if you didn't get this, don't worry too much about it. It's a high-level message. It's

00:35:39.319 --> 00:35:40.309
just that

00:35:40.309 --> 00:35:44.479
SVM's and logistic regression are reasonable to think of as approximations

00:35:44.479 --> 00:35:47.769
empirical risk minimization algorithms.

00:35:47.769 --> 00:35:51.609
What I want to do next is move on to talk about model selection. Before I do that, let me just

00:35:51.609 --> 00:35:58.609
check for questions about

00:36:49.190 --> 00:36:52.869
this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that

00:36:52.869 --> 00:36:53.329
we

00:36:53.329 --> 00:36:55.049
sort of wrapped up with a

00:36:55.049 --> 00:36:56.799
discussion on VC dimension,

00:36:56.799 --> 00:37:01.019
we saw that there's often a trade-off between bias and variance. And in

00:37:01.019 --> 00:37:02.489
particular, so

00:37:02.489 --> 00:37:06.759
it is important not to choose a hypothesis that's either too simple or too

00:37:06.759 --> 00:37:07.849
complex. So

00:37:07.849 --> 00:37:11.249
if your data has sort of a quadratic structure to it,

00:37:11.249 --> 00:37:13.710
then if you choose a linear

00:37:13.710 --> 00:37:17.889
function to try to approximate it, then you would under fit. So you have a

00:37:17.889 --> 00:37:19.939
hypothesis with high bias.

00:37:19.939 --> 00:37:24.179
And conversely, we choose a hypothesis that's too complex, and you have high variance.

00:37:24.179 --> 00:37:26.130
And you'll also

00:37:26.130 --> 00:37:31.899
fail to fit. Then you would over fit the data, and you'd also fail to generalize well.

00:37:31.899 --> 00:37:35.649
So model selection algorithms

00:37:35.649 --> 00:37:39.629
provide a class of methods to automatically trade make these tradeoffs

00:37:39.629 --> 00:37:41.619
between bias

00:37:41.619 --> 00:37:43.069
and variance; right? So remember the

00:37:43.069 --> 00:37:44.989
cartoon I drew last time

00:37:44.989 --> 00:37:51.989
of generalization error?

00:37:54.309 --> 00:37:57.409
I drew this last time. Where on the x-axis

00:37:57.409 --> 00:38:03.319
was model complexity, meaning the number of

00:38:03.319 --> 00:38:07.599
the degree of the polynomial; the [inaudible] regression function

00:38:07.599 --> 00:38:08.719
or whatever.

00:38:08.719 --> 00:38:12.469
And if you have too simple a model, you have high generalization error, those under

00:38:12.469 --> 00:38:13.529
fitting.

00:38:13.529 --> 00:38:15.630
And you if have too complex a model,

00:38:15.630 --> 00:38:19.729
like 15 or 14-degree polynomial to five data points, then you also have high

00:38:19.729 --> 00:38:24.069
generalization error, and you're over fitting.

00:38:24.069 --> 00:38:28.559
So what I wanna do now is actually just talk about model selection in the abstract;

00:38:28.559 --> 00:38:30.019
all right?

00:38:30.019 --> 00:38:32.779
Some examples of model selection problems will include

00:38:32.779 --> 00:38:34.369
well, I'll run the example

00:38:34.369 --> 00:38:36.139
of

00:38:36.139 --> 00:38:36.790
let's say you're

00:38:36.790 --> 00:38:43.790
trying to choose the degree of a polynomial;

00:38:45.159 --> 00:38:48.909
right? What degree polynomial do you want to choose?

00:38:48.909 --> 00:38:51.899
Another example of a model selection problem would be if you're trying to choose

00:38:51.899 --> 00:38:53.869
the parameter [inaudible],

00:38:53.869 --> 00:38:59.839
which was the bandwidth parameter

00:38:59.839 --> 00:39:04.140
in locally awaited linear regression

00:39:04.140 --> 00:39:07.709
or in some sort of local way to

00:39:07.709 --> 00:39:08.440
regression.

00:39:08.440 --> 00:39:10.480
Yet, another model selection problem

00:39:10.480 --> 00:39:13.940
is if you're trying to choose the parameter c [inaudible] and as the

00:39:13.940 --> 00:39:16.049
[inaudible];

00:39:16.049 --> 00:39:21.680
right? And so one known soft margin is the

00:39:21.680 --> 00:39:28.539
we had this optimization objective; right?

00:39:28.539 --> 00:39:31.039
And the parameter c

00:39:31.039 --> 00:39:32.969
controls the tradeoff between

00:39:32.969 --> 00:39:34.019
how much you want

00:39:34.019 --> 00:39:38.239
to set for your example. So a large margin versus how much you want to penalize

00:39:38.239 --> 00:39:40.499


00:39:40.499 --> 00:39:44.279
in this class [inaudible] example. So these are three specific examples of model selection

00:39:44.279 --> 00:39:45.339
problems.

00:39:45.339 --> 00:39:47.569
And

00:39:47.569 --> 00:39:54.569
let's come up with a method for semantically choosing them.

00:40:13.679 --> 00:40:17.849
Let's say you have some finite set of models, and let's write these as m1, m2, m3,

00:40:17.849 --> 00:40:20.129
and

00:40:20.129 --> 00:40:25.219
so on. For example, this may be the linear

00:40:25.219 --> 00:40:29.660
classifier or this may be the quadratic classifier,

00:40:29.660 --> 00:40:30.410
and so on;

00:40:30.410 --> 00:40:32.170
okay? Or this

00:40:32.170 --> 00:40:33.700
may also be

00:40:33.700 --> 00:40:37.240
you may also take the bandwidth parameter [inaudible]

00:40:37.240 --> 00:40:41.130
and discretize it into a range of values, and you're trying to choose from the most

00:40:41.130 --> 00:40:43.469
discrete of the values.

00:40:43.469 --> 00:40:48.239
So let's talk about how you would select an appropriate model; all right? Well,

00:40:48.239 --> 00:40:52.029
one thing you could do is you can pick all of these models, and train them on

00:40:52.029 --> 00:40:53.869
you're training set.

00:40:53.869 --> 00:40:56.989
And then see which model has the lowest training

00:40:56.989 --> 00:41:03.989
error. So that's a terrible idea, and why's that?

00:41:05.450 --> 00:41:11.099
Right. Cool. Because of the over fit; right. And those some of you are laughing that

00:41:11.099 --> 00:41:13.939
I asked that. So that'd be a terrible idea to choose a model

00:41:13.939 --> 00:41:17.019
by looking at your training set because well, obviously, you end up choosing the most

00:41:17.019 --> 00:41:18.839
complex model; right? And you

00:41:18.839 --> 00:41:25.549
choose a 10th degree polynomial because that's what fits the training set. So we

00:41:25.549 --> 00:41:28.569
come to model selection in a training set

00:41:28.569 --> 00:41:31.939
several standard procedures to do this. One is hold out cross

00:41:31.939 --> 00:41:38.869
validation,

00:41:38.869 --> 00:41:41.410
and in hold out cross validation,

00:41:41.410 --> 00:41:45.509
we teach a training set. And we randomly split

00:41:45.509 --> 00:41:50.529
the training set into two subsets. We

00:41:50.529 --> 00:41:51.490
call it

00:41:51.490 --> 00:41:53.589
subset take all the data you have

00:41:53.589 --> 00:42:00.589
and randomly split it into two subsets. And we'll call it the training set, and the

00:42:00.799 --> 00:42:05.159
hold out cross validation subset.

00:42:05.159 --> 00:42:07.119
And then,

00:42:07.119 --> 00:42:11.989
you know, you train each model

00:42:11.989 --> 00:42:18.739
on just trading subset of it, and test it

00:42:18.739 --> 00:42:22.069
on your hold out cross validation set.

00:42:22.069 --> 00:42:25.289
And you pick the model

00:42:25.289 --> 00:42:32.289
with the lowest error

00:42:32.949 --> 00:42:34.889
on the hold out cross validation

00:42:34.889 --> 00:42:36.959
subset; okay? So this is sort of a

00:42:36.959 --> 00:42:39.799
relatively straightforward procedure, and it's commonly used

00:42:39.799 --> 00:42:42.039
where you train on 70 percent of the data.

00:42:42.039 --> 00:42:44.509
Then test all of your models. And 30 percent, you can take

00:42:44.509 --> 00:42:45.900
whatever has the smallest

00:42:45.900 --> 00:42:51.239
hold out cross validation error.

00:42:51.239 --> 00:42:54.829
And after this you actually have a chose. You can actually

00:42:54.829 --> 00:42:58.589
having taken all of these hypothesis trained on 70 percent of the data,

00:42:58.589 --> 00:43:02.650
you can actually just output the hypothesis that has the lowest error on your hold out

00:43:02.650 --> 00:43:04.469
cross validation set.

00:43:04.469 --> 00:43:08.619
And optionally, you can actually take the model that you selected

00:43:08.619 --> 00:43:13.079
and go back, and retrain it on all 100 percent of the data;

00:43:13.079 --> 00:43:15.529
okay? So both versions are actually done and used really often. You can

00:43:15.529 --> 00:43:16.409
either,

00:43:16.409 --> 00:43:20.039
you know, just take the best hypothesis that was trained on 70 percent of the data,

00:43:20.039 --> 00:43:22.419
and just output that as you find the hypothesis

00:43:22.419 --> 00:43:23.959
or you can use this to

00:43:23.959 --> 00:43:27.289
say, having chosen the degree of the polynomial you want to fit, you can then go

00:43:27.289 --> 00:43:31.329
back and retrain the model on the entire 100 percent of your data.

00:43:31.329 --> 00:43:38.329
And both of these are commonly done.

00:43:39.749 --> 00:43:46.749


00:43:50.489 --> 00:43:54.459
How about a cross validation does sort of work straight?

00:43:54.459 --> 00:43:55.759
And

00:43:55.759 --> 00:43:58.249
sometimes we're working

00:43:58.249 --> 00:43:58.950


00:43:58.950 --> 00:44:03.670
with a company or application or something. The

00:44:03.670 --> 00:44:05.929
many machine-learning applications we have

00:44:05.929 --> 00:44:07.580
very little data or where, you

00:44:07.580 --> 00:44:10.100
know, every training example you have was

00:44:10.100 --> 00:44:14.940
painfully acquired at great cost; right? Sometimes your data is

00:44:14.940 --> 00:44:16.039
acquired by

00:44:16.039 --> 00:44:18.780
medical experiments, and each of these each

00:44:18.780 --> 00:44:23.829
training example represents a sick man in amounts of physical human pain or something. So

00:44:23.829 --> 00:44:27.279
we talk and say, well, I'm going to hold out 30 percent of your data set, just to select

00:44:27.279 --> 00:44:28.109
my model.

00:44:28.109 --> 00:44:32.809
If people were who sometimes that causes unhappiness, and so maybe you wanna

00:44:32.809 --> 00:44:37.099
use not have to leave out 30 percent of your data just to do model

00:44:37.099 --> 00:44:38.719
selection.

00:44:38.719 --> 00:44:43.400
So there are a couple of other variations on hold out cross validation that makes

00:44:43.400 --> 00:44:47.809
sometimes, slightly more efficient use of the data.

00:44:47.809 --> 00:44:51.909
And one is called k-fold cross validation.

00:44:51.909 --> 00:44:54.190
And here's the idea: I'm gonna

00:44:54.190 --> 00:44:55.950
take all of

00:44:55.950 --> 00:44:56.819
my data s;

00:44:56.819 --> 00:45:00.619
so imagine, I'm gonna draw this

00:45:00.619 --> 00:45:04.489
box s, as to note the entirety of all the data I have.

00:45:04.489 --> 00:45:07.489
And I'll then divide it

00:45:07.489 --> 00:45:12.999
into k pieces, and this is five pieces in what I've drawn.

00:45:12.999 --> 00:45:16.809
Then what'll I'll do is I will repeatedly

00:45:16.809 --> 00:45:23.410
train on k minus one pieces.

00:45:23.410 --> 00:45:27.669
Test on the remaining one

00:45:27.669 --> 00:45:30.649
test on the remaining piece, I guess;

00:45:30.649 --> 00:45:35.699
right? And then you average

00:45:35.699 --> 00:45:41.369
over the k result.

00:45:41.369 --> 00:45:43.809
So another way, we'll just hold out I will

00:45:43.809 --> 00:45:48.329
hold

00:45:48.329 --> 00:45:50.619
out say, just 1/5 of my data

00:45:50.619 --> 00:45:54.649
and I'll train on the remaining 4/5, and I'll test on the first one.

00:45:54.649 --> 00:45:58.399
And then I'll then go and hold out the second 1/5 from my [inaudible] for the

00:45:58.399 --> 00:46:03.099
remaining pieces test on this, you remove the third piece,

00:46:03.099 --> 00:46:05.909
train on the 4/5; I'm gonna do this five times.

00:46:05.909 --> 00:46:09.089
And then I'll take the five error measures I

00:46:09.089 --> 00:46:10.569
have and I'll

00:46:10.569 --> 00:46:16.369
average them. And this then gives me an estimate of the generalization error of my model; okay?

00:46:16.369 --> 00:46:18.089
And then, again, when you do

00:46:18.089 --> 00:46:19.929
k-fold cross validation,

00:46:19.929 --> 00:46:21.829
usually you then go back and

00:46:21.829 --> 00:46:24.030
retrain the model you selected

00:46:24.030 --> 00:46:27.670
on the entirety of your training set.

00:46:27.670 --> 00:46:33.930
So I drew five pieces here because that was easier for me to draw, but k equals 10 is

00:46:33.930 --> 00:46:39.339
very

00:46:39.339 --> 00:46:41.710
common; okay? I should say

00:46:41.710 --> 00:46:47.509
k equals 10 is the fairly common choice to do 10 fold cross validation.

00:46:47.509 --> 00:46:52.299
And the advantage of the over hold out cross option is that you switch the data into ten pieces.

00:46:52.299 --> 00:46:54.400
Then each time you're only holding out

00:46:54.400 --> 00:46:56.380
1/10 of your data, rather than,

00:46:56.380 --> 00:46:59.119
you know, say, 30 percent of your data. I

00:46:59.119 --> 00:47:03.309
must say, in standard hold out in simple hold out cross validation,

00:47:03.309 --> 00:47:05.620
a 30 70 split is fairly common.

00:47:05.620 --> 00:47:07.849
Sometimes like 2/3 1/3 or

00:47:07.849 --> 00:47:09.919
a 70 30 split is fairly common.

00:47:09.919 --> 00:47:14.299
And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most

00:47:14.299 --> 00:47:16.400
common choice.

00:47:16.400 --> 00:47:19.849
The disadvantage of k-fold cross validation is that it can be much more

00:47:19.849 --> 00:47:21.430
computationally expensive.

00:47:21.430 --> 00:47:22.999
In particular,

00:47:22.999 --> 00:47:26.160
to validate your model, you now need to train your model ten times,

00:47:26.160 --> 00:47:29.089
instead of just once. And so you need

00:47:29.089 --> 00:47:33.140
to: from logistic regression, ten times per model, rather than just once. And so this is

00:47:33.140 --> 00:47:34.569
computationally more expensive.

00:47:34.569 --> 00:47:38.329
But k equals ten works great.

00:47:38.329 --> 00:47:40.249
And then,

00:47:40.249 --> 00:47:44.269
finally, in

00:47:44.269 --> 00:47:47.769
there's actually a version of this that you can take even further,

00:47:47.769 --> 00:47:50.059
which is when your set k

00:47:50.059 --> 00:47:51.660
equals m.

00:47:51.660 --> 00:47:56.690
And so that's when you take your training set, and you split it into as many pieces as you have training

00:47:56.690 --> 00:47:59.190
examples.

00:47:59.190 --> 00:48:06.190
And this procedure is called leave one out cross validation.

00:48:06.429 --> 00:48:07.479
And

00:48:07.479 --> 00:48:09.069
what you do is you then

00:48:09.069 --> 00:48:12.479
take out the first training example, train on the rest, and test on the first

00:48:12.479 --> 00:48:13.359
example.

00:48:13.359 --> 00:48:16.479
Then you take out the second training example,

00:48:16.479 --> 00:48:18.560
train on the rest, and test on the second example. Then

00:48:18.560 --> 00:48:22.180
you take out the third example, train on everything, but your third example. Test on

00:48:22.180 --> 00:48:24.079
the third example, and so on.

00:48:24.079 --> 00:48:25.410
And so

00:48:25.410 --> 00:48:27.900
with this many pieces you are now making,

00:48:27.900 --> 00:48:31.170
maybe even more effective use of your data than k-fold cross

00:48:31.170 --> 00:48:32.459
validation. But

00:48:32.459 --> 00:48:36.269
you could leave leave one out cross validation is computationally very expensive

00:48:36.269 --> 00:48:37.459
because now

00:48:37.459 --> 00:48:41.970
you need to repeatedly leave one example out, and then run your learning

00:48:41.970 --> 00:48:46.459
algorithm on m minus one training examples. You need to do this a lot of times, and so

00:48:46.459 --> 00:48:49.119
this is computationally very expensive.

00:48:49.119 --> 00:48:54.829
And typically, this is done only when you're extremely data scarce. So if you

00:48:54.829 --> 00:48:58.779
have a learning problem where you have, say, 15 training examples or something,

00:48:58.779 --> 00:49:01.109
then if you have very few training examples, leave one

00:49:01.109 --> 00:49:03.629
out cross validation

00:49:03.629 --> 00:49:04.460
is

00:49:04.460 --> 00:49:11.460
maybe preferred.

00:49:15.709 --> 00:49:18.369
Yeah?

00:49:18.369 --> 00:49:22.890
Student:You know, that time you proved that the difference between the generalized [inaudible] by

00:49:22.890 --> 00:49:26.989
number of examples in your training

00:49:26.989 --> 00:49:28.089
set and VC

00:49:28.089 --> 00:49:30.609
dimension. So

00:49:30.609 --> 00:49:32.679
maybe [inaudible] examples into

00:49:32.679 --> 00:49:34.599
different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I

00:49:34.599 --> 00:49:36.829
mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but

00:49:36.829 --> 00:49:40.100
yeah, in practice, I personally tend not to do that. It

00:49:40.100 --> 00:49:42.729
tends not to be the VC

00:49:42.729 --> 00:49:46.529
dimension bounds are somewhat loose bounds. And so

00:49:46.529 --> 00:49:47.960
there are people

00:49:47.960 --> 00:49:51.060
in structure risk minimization that propose what you do, but I personally tend not to do

00:49:51.060 --> 00:49:54.539
that,

00:50:09.459 --> 00:50:16.459
though. Questions for cross validation?

00:50:33.519 --> 00:50:35.669
Yeah. This

00:50:35.669 --> 00:50:38.899
is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?

00:50:38.899 --> 00:50:39.939
Right.

00:50:39.939 --> 00:50:42.239
So it seems like when I'd be

00:50:42.239 --> 00:50:44.129
able to use that

00:50:44.129 --> 00:50:46.629


00:50:46.629 --> 00:50:47.769
instead

00:50:47.769 --> 00:50:49.980
of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor

00:50:49.980 --> 00:50:53.349
No okay. So it turns out that when you're proving learning theory bounds, very often

00:50:53.349 --> 00:50:58.109
the bounds will be extremely loose because you're sort of proving the worse case upper bound that

00:50:58.109 --> 00:50:59.160
holds true

00:50:59.160 --> 00:51:03.299
even for very bad what is

00:51:03.299 --> 00:51:06.429
it

00:51:06.429 --> 00:51:10.479
so the bounds that I proved just now; right? That holds true for absolutely any

00:51:10.479 --> 00:51:12.209
probability distribution over training

00:51:12.209 --> 00:51:13.480
examples; right?

00:51:13.480 --> 00:51:17.069
So just assume the training examples we've drawn, iid

00:51:17.069 --> 00:51:19.029
from some distribution script d,

00:51:19.029 --> 00:51:23.209
and the bounds I proved hold true for absolutely any probability

00:51:23.209 --> 00:51:27.129
distribution over script

00:51:27.129 --> 00:51:28.939
d. And chances are

00:51:28.939 --> 00:51:33.049
whatever real life distribution you get over, you know, houses and their

00:51:33.049 --> 00:51:36.139
prices or whatever, is probably not as bad as

00:51:36.139 --> 00:51:39.119
the very worse one you

00:51:39.119 --> 00:51:41.249
could've gotten; okay?

00:51:41.249 --> 00:51:42.189
And so it turns out that if you

00:51:42.189 --> 00:51:46.299
actually plug in the constants of learning theory bounds, you often get

00:51:46.299 --> 00:51:50.839
extremely large numbers. Take logistic regression logistic

00:51:50.839 --> 00:51:53.229
regression you have ten parameters

00:51:53.229 --> 00:51:56.969
and 0.01 error, and with 95 percent probability. How many training

00:51:56.969 --> 00:51:58.239
examples do I

00:51:58.239 --> 00:52:02.109
need? If you actually plug in actual constants into the text for learning theory bounds,

00:52:02.109 --> 00:52:06.329
you often get extremely pessimistic estimates with the number of examples you need. You end

00:52:06.329 --> 00:52:07.529
up with

00:52:07.529 --> 00:52:11.749
some ridiculously large numbers. You would need 10,000 training examples to fit

00:52:11.749 --> 00:52:14.349
ten parameters.

00:52:14.349 --> 00:52:17.489
So

00:52:17.489 --> 00:52:20.319
a good way to think of these learning theory bounds is

00:52:20.319 --> 00:52:24.599
and this is why, also, when I write papers on learning theory bounds, I

00:52:24.599 --> 00:52:26.390
quite often

00:52:26.390 --> 00:52:30.159
use big-O notation to just absolutely just ignore the constant factors because

00:52:30.159 --> 00:52:33.339
the bounds seem to be very loose.

00:52:33.339 --> 00:52:38.669
There are some attempts to use these bounds to give guidelines as to

00:52:38.669 --> 00:52:40.019
what model

00:52:40.019 --> 00:52:41.559
to choose, and so on.

00:52:41.559 --> 00:52:46.589
But I personally tend to use the bounds again,

00:52:46.589 --> 00:52:47.489
intuition

00:52:47.489 --> 00:52:49.239
about

00:52:49.239 --> 00:52:53.089
for example, what are the number of training examples you need

00:52:53.089 --> 00:52:56.089
gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes

00:52:56.089 --> 00:52:59.669
quadratic parameters?

00:52:59.669 --> 00:53:02.400
So it's quite often the shape of the bounds. The fact that

00:53:02.400 --> 00:53:06.749
the number of training examples the fact that some complexity is linear in the VC dimension,

00:53:06.749 --> 00:53:09.449
that's sort of a useful intuition you can get from

00:53:09.449 --> 00:53:11.329
these theories. But the

00:53:11.329 --> 00:53:14.439
actual magnitude of the bound will tend to be much looser than

00:53:14.439 --> 00:53:18.009
will hold true for a particular problem you are working

00:53:18.009 --> 00:53:25.009
on. So did that

00:53:25.839 --> 00:53:28.409
answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the

00:53:28.409 --> 00:53:31.509
way, for myself, a rule of thumb that I often use is if

00:53:31.509 --> 00:53:34.969
you're trying to fit a logistic regression model,

00:53:34.969 --> 00:53:36.179
if you have

00:53:36.179 --> 00:53:38.749
n parameters or n plus one parameters;

00:53:38.749 --> 00:53:42.429
if the number of training examples is ten times your number of

00:53:42.429 --> 00:53:44.709
parameters, then you're probably in good shape.

00:53:44.709 --> 00:53:49.289
And if your number of training examples is like tiny times the number of parameters, then

00:53:49.289 --> 00:53:52.299
you're probably perfectly fine fitting that model.

00:53:52.299 --> 00:53:54.880
So those are the sorts of intuitions

00:53:54.880 --> 00:54:01.880
that you can get from these bounds. Student:In cross validation do

00:54:02.939 --> 00:54:06.739
we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train

00:54:06.739 --> 00:54:13.739
testers randomly.

00:54:22.989 --> 00:54:26.839
One more thing I want to talk about for model selection is there's actually a special case of model selections,

00:54:26.839 --> 00:54:33.839
called the feature selection problem.

00:54:37.809 --> 00:54:41.339
And so here's the intuition:

00:54:41.339 --> 00:54:45.479
for many machine-learning problems you may have a very high dimensional

00:54:45.479 --> 00:54:48.119
feature space with very high dimensional

00:54:48.119 --> 00:54:51.640
you have x's [inaudible] feature x's.

00:54:51.640 --> 00:54:56.289
So for example, for text classification and I wanna talk about this text classification example that

00:54:56.289 --> 00:54:58.299
spam versus

00:54:58.299 --> 00:55:00.430
non-spam. You may easily have on

00:55:00.430 --> 00:55:04.130
the order of 30,000 or 50,000 features. I think I used 50,000 in

00:55:04.130 --> 00:55:07.150
my early examples. So if you have

00:55:07.150 --> 00:55:10.659
so many features you have 50,000 features,

00:55:10.659 --> 00:55:13.759
depending on what learning algorithm you use, there may be a real

00:55:13.759 --> 00:55:15.529
risk of over fitting.

00:55:15.529 --> 00:55:17.219
And so

00:55:17.219 --> 00:55:20.079
if you can reduce the number of features, maybe

00:55:20.079 --> 00:55:23.179
you can reduce the variance of your learning algorithm, and reduce the risk of

00:55:23.179 --> 00:55:25.309
over fitting. And for

00:55:25.309 --> 00:55:28.259
the specific case of text classification, if

00:55:28.259 --> 00:55:31.919
you imagine that maybe there's a small number of relevant features, so

00:55:31.919 --> 00:55:33.679
there are all these English words.

00:55:33.679 --> 00:55:36.969
And many of these English words probably don't tell you anything at all about

00:55:36.969 --> 00:55:39.880
whether the email is spam or non-spam. If it were, you

00:55:39.880 --> 00:55:43.989
know, English function words like, the, of, a, and;

00:55:43.989 --> 00:55:48.049
these are probably words that don't tell you anything about whether the email is spam or non-spam. So

00:55:48.049 --> 00:55:50.569
words in contrast will be a much smaller number of

00:55:50.569 --> 00:55:54.319
features that are truly relevant to the learning problem.

00:55:54.319 --> 00:55:57.369
So for example, you see the word buy or Viagra, those are words that

00:55:57.369 --> 00:56:01.669
are very useful. So you words, some you spam and non-spam. You see the word

00:56:01.669 --> 00:56:02.829
Stanford or

00:56:02.829 --> 00:56:06.270
machinelearning or your own personal name. These are other words that are useful for

00:56:06.270 --> 00:56:10.529
telling you whether something is spam or non-spam. So

00:56:10.529 --> 00:56:15.659
in feature selection, we would like to select a subset of the features

00:56:15.659 --> 00:56:19.519
that may be or hopefully the most relevant ones for a specific learning

00:56:19.519 --> 00:56:20.369
problem, so

00:56:20.369 --> 00:56:25.009
as to give ourselves a simpler learning a simpler hypothesis class to choose

00:56:25.009 --> 00:56:28.499
from. And then therefore, reduce the risk of over fitting.

00:56:28.499 --> 00:56:29.699
Even when we

00:56:29.699 --> 00:56:36.619
may have had 50,000 features originally. So

00:56:36.619 --> 00:56:41.209
how do you do this? Well, if you

00:56:41.209 --> 00:56:44.009
have n

00:56:44.009 --> 00:56:47.329
features, then there are

00:56:47.329 --> 00:56:49.529
two to the n possible subsets;

00:56:49.529 --> 00:56:54.429


00:56:54.429 --> 00:56:55.900
right? Because, you know,

00:56:55.900 --> 00:56:59.880
each of your n features can either be included or excluded. So there are two to the n

00:56:59.880 --> 00:57:00.909
possibilities.

00:57:00.909 --> 00:57:04.200
And this is a huge space.

00:57:04.200 --> 00:57:08.859
So in feature selection, what we most commonly do is use various searcheristics

00:57:08.859 --> 00:57:09.510
sort of

00:57:09.510 --> 00:57:10.799
simple search algorithms

00:57:10.799 --> 00:57:15.329
to try to search through this space of two to the n possible subsets of features;

00:57:15.329 --> 00:57:16.710
to try to find a good

00:57:16.710 --> 00:57:18.290
subset of features. This is

00:57:18.290 --> 00:57:22.569
too large a number to enumerate all possible feature subsets.

00:57:22.569 --> 00:57:25.519
And as a complete example,

00:57:25.519 --> 00:57:26.759


00:57:26.759 --> 00:57:31.159
this is the forward search algorithm; it's also called the forward selection algorithm.

00:57:31.159 --> 00:57:32.689


00:57:32.689 --> 00:57:34.849
It's actually pretty simple, but I'll just

00:57:34.849 --> 00:57:36.080
write it out.

00:57:36.080 --> 00:57:42.059
My writing it out will make it look more complicated than it really

00:57:42.059 --> 00:57:44.649
is, but

00:57:44.649 --> 00:57:46.040
it starts with

00:57:46.040 --> 00:57:48.149
initialize the sets script f

00:57:48.149 --> 00:57:51.649
to be the empty set,

00:57:51.649 --> 00:57:53.239
and then

00:57:53.239 --> 00:57:55.019
repeat

00:57:55.019 --> 00:57:56.619
for

00:57:56.619 --> 00:57:58.859
i equals one

00:57:58.859 --> 00:58:00.439
to n;

00:58:04.849 --> 00:58:08.239
try adding

00:58:08.239 --> 00:58:11.459
feature i

00:58:11.459 --> 00:58:13.849
to the set scripts

00:58:13.849 --> 00:58:16.039
f, and evaluate the

00:58:16.039 --> 00:58:19.739
model

00:58:19.739 --> 00:58:25.719
using cross validation.

00:58:25.719 --> 00:58:30.399
And by cross validation, I mean any of the three flavors, be it simple hold out cross

00:58:30.399 --> 00:58:34.629
validation or k-fold cross validation or leave one out cross

00:58:34.629 --> 00:58:35.739
validation.

00:58:35.739 --> 00:58:39.949
And then, you know, set f to be

00:58:39.949 --> 00:58:41.239
equal to

00:58:41.239 --> 00:58:43.249
f union, I guess.

00:58:43.249 --> 00:58:48.579
And then the best feature

00:58:48.579 --> 00:58:55.579
found is f 1, I guess; okay?

00:58:57.619 --> 00:59:00.059
And finally, you would

00:59:00.059 --> 00:59:06.920
okay? So

00:59:06.920 --> 00:59:13.920


00:59:16.699 --> 00:59:20.319
forward selections, procedure is: follow through the empty set of features.

00:59:20.319 --> 00:59:22.319
And then on each

00:59:22.319 --> 00:59:25.689
generation, take each of

00:59:25.689 --> 00:59:28.829
your features that isn't already in your set script f and you try adding that

00:59:28.829 --> 00:59:30.449
feature to your set. Then

00:59:30.449 --> 00:59:33.870
you train them all, though, and evaluate them all, though, using

00:59:33.870 --> 00:59:35.039
cross validation.

00:59:35.039 --> 00:59:38.719
And basically, figure out what is the best single feature to add to your

00:59:38.719 --> 00:59:40.719
set

00:59:40.719 --> 00:59:43.709
script f. In step two here, you go ahead and add

00:59:43.709 --> 00:59:46.509
that feature to your set script f, and you get it right.

00:59:46.509 --> 00:59:50.559
And when I say best feature or best model here by best, I really mean the

00:59:50.559 --> 00:59:53.669
best model according to hold out cross validation.

00:59:53.669 --> 00:59:56.049
By best, I really mean

00:59:56.049 --> 01:00:00.639
the single feature addition that results in the lowest hold out

01:00:00.639 --> 01:00:01.509
cross validation error or the

01:00:01.509 --> 01:00:03.209
lowest cross validation error.

01:00:03.209 --> 01:00:04.499
So you do this

01:00:04.499 --> 01:00:07.959
adding one feature at a time.

01:00:07.959 --> 01:00:12.019
When you terminate this a little bit, as if you've

01:00:12.019 --> 01:00:15.079
added all the features to f, so f is now

01:00:15.079 --> 01:00:17.919
the entire set of features; you can terminate this.

01:00:17.919 --> 01:00:19.309
Or if

01:00:19.309 --> 01:00:22.039
by some rule of thumb, you know that you

01:00:22.039 --> 01:00:23.769
probably don't ever want more than

01:00:23.769 --> 01:00:26.729
k features, you can also terminate

01:00:26.729 --> 01:00:30.239
this if f is already exceeded some threshold number of features. So maybe

01:00:30.239 --> 01:00:30.979
if

01:00:30.979 --> 01:00:34.139
you have 100 training examples, and you're fitting logistic regression,

01:00:34.139 --> 01:00:39.359
you probably know you won't want more than 100 features. And so

01:00:39.359 --> 01:00:41.229
you stop after you have

01:00:41.229 --> 01:00:45.709
100 features added to set f; okay?

01:00:45.709 --> 01:00:50.029
And then finally, having done this, output of best hypothesis found; again, by

01:00:50.029 --> 01:00:51.669
best, I mean,

01:00:51.669 --> 01:00:55.459
when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of

01:00:55.459 --> 01:00:58.419
hypothesis, and testing them using cross validation.

01:00:58.419 --> 01:01:01.659
So when I say output best hypothesis found, I mean

01:01:01.659 --> 01:01:04.650
of all of the hypothesis you've seen during this entire procedure,

01:01:04.650 --> 01:01:07.149
pick the one with the lowest

01:01:07.149 --> 01:01:10.309
cross validation error that you saw; okay?

01:01:10.309 --> 01:01:17.309
So that's forward selection. So

01:01:35.239 --> 01:01:39.599
let's see, just to give this a name, this is an incidence of what's called

01:01:39.599 --> 01:01:43.189


01:01:43.189 --> 01:01:46.439
wrapper feature

01:01:46.439 --> 01:01:47.879
selection.

01:01:47.879 --> 01:01:50.579
And the term wrapper comes

01:01:50.579 --> 01:01:52.049
from the fact that

01:01:52.049 --> 01:01:55.489
this feature selection algorithm that I just described is a forward selection or forward

01:01:55.489 --> 01:01:56.819
search.

01:01:56.819 --> 01:01:58.680
It's a piece of software that

01:01:58.680 --> 01:02:02.149
you write that wraps around your learning algorithm. In the sense that

01:02:02.149 --> 01:02:04.249
to perform forward selection,

01:02:04.249 --> 01:02:07.579
you need to repeatedly make cause to your learning algorithm

01:02:07.579 --> 01:02:08.829
to train

01:02:08.829 --> 01:02:11.789
your model,

01:02:11.789 --> 01:02:15.289
using different subsets of features; okay? So this is called wrapper model

01:02:15.289 --> 01:02:17.319
feature selection.

01:02:17.319 --> 01:02:20.509
And it tends to be somewhat computationally expensive because as you're

01:02:20.509 --> 01:02:22.529
performing the search process,

01:02:22.529 --> 01:02:26.029
you're repeatedly training your learning algorithm over and over and over on all of

01:02:26.029 --> 01:02:30.329
these different subsets of features. Let's

01:02:30.329 --> 01:02:37.329
just mention also, there is a variation of this called backward search or

01:02:38.899 --> 01:02:40.829
backward selection,

01:02:40.829 --> 01:02:43.709
which is where you start with

01:02:43.709 --> 01:02:46.989
f equals

01:02:46.989 --> 01:02:53.989
the entire set of features, and you delete features one at a time;

01:02:59.469 --> 01:03:01.369
okay?

01:03:01.369 --> 01:03:03.979


01:03:03.979 --> 01:03:06.759
So that's backward search or backward selection.

01:03:06.759 --> 01:03:08.079
And

01:03:08.079 --> 01:03:13.369
this is another feature selection algorithm that you might use. Part of

01:03:13.369 --> 01:03:15.400
whether this makes sense is

01:03:15.400 --> 01:03:17.409
really there will be problems where

01:03:17.409 --> 01:03:21.319
it really doesn't even make sense to initialize f to be the set of all features.

01:03:21.319 --> 01:03:23.859
So if you have 100 training examples

01:03:23.859 --> 01:03:25.569
and 10,000 features,

01:03:25.569 --> 01:03:28.829
which may well happen

01:03:28.829 --> 01:03:30.759
100 emails and 10,000 training

01:03:30.759 --> 01:03:35.289
10,000 features in email, then 100 training examples then

01:03:35.289 --> 01:03:38.279
depending on the learning algorithm you're using, it may or may not make sense to

01:03:38.279 --> 01:03:38.989
initialize

01:03:38.989 --> 01:03:42.789
the set f to be all features, and train them all by using all features. And if it

01:03:42.789 --> 01:03:46.130
doesn't make sense, then you can train them all by using all features; then

01:03:46.130 --> 01:03:51.529
forward selection would be more common.

01:03:51.529 --> 01:03:53.609
So

01:03:53.609 --> 01:03:56.960
let's see. Wrapper model feature selection algorithms tend to work

01:03:56.960 --> 01:03:58.459
well.

01:03:58.459 --> 01:04:02.179
And in particular, they actually often work better than a different class of algorithms I'm gonna talk

01:04:02.179 --> 01:04:05.789
about now. But their main disadvantage is that they're computationally very

01:04:05.789 --> 01:04:08.709
expensive.

01:04:08.709 --> 01:04:15.709
Do you have any questions about this

01:04:16.289 --> 01:04:19.109
before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,

01:04:19.109 --> 01:04:23.459
you're actually right. So forward search and backward search, both of these are searcheristics,

01:04:23.459 --> 01:04:24.360
and

01:04:24.360 --> 01:04:28.090
you cannot but for either of these you cannot guarantee they'll find the best

01:04:28.090 --> 01:04:29.119
subset of features. It

01:04:29.119 --> 01:04:30.729
actually turns out that

01:04:30.729 --> 01:04:34.999
under many formulizations of the feature selection problems it actually turns out to be an empty

01:04:34.999 --> 01:04:40.029
heart problem, to find the best subset of features.

01:04:40.029 --> 01:04:43.999
But in practice, forward selection backward selection work fine,

01:04:43.999 --> 01:04:47.170
and you can also envision other search algorithms where you sort of

01:04:47.170 --> 01:04:50.489
have other methods to search through the space up to the end possible feature

01:04:50.489 --> 01:04:57.489
subsets. So

01:05:01.079 --> 01:05:03.700
let's see.

01:05:03.700 --> 01:05:06.339
Wrapper feature selection

01:05:06.339 --> 01:05:09.829
tends to work well when you can afford to do it

01:05:09.829 --> 01:05:11.539
computationally.

01:05:11.539 --> 01:05:13.800
But for problems

01:05:13.800 --> 01:05:18.279
such as text classification it turns out for text classification specifically

01:05:18.279 --> 01:05:21.949
because you have so many features, and easily have 50,000 features.

01:05:21.949 --> 01:05:25.789
Forward selection would be very, very expensive.

01:05:25.789 --> 01:05:28.240
So there's a different class of algorithms that

01:05:28.240 --> 01:05:31.189
will give you that

01:05:31.189 --> 01:05:35.319
tends not to do as well in the sense of generalization error. So you tend to learn the

01:05:35.319 --> 01:05:37.329
hypothesis that works less well,

01:05:37.329 --> 01:05:40.289
but is computationally much less expensive.

01:05:40.289 --> 01:05:44.189
And these are called

01:05:44.189 --> 01:05:46.669
the

01:05:46.669 --> 01:05:50.279
filter feature selection methods.

01:05:50.279 --> 01:05:53.119
And the basic idea is

01:05:53.119 --> 01:05:53.679


01:05:53.679 --> 01:06:00.539
that for each feature i will

01:06:00.539 --> 01:06:03.939
compute

01:06:03.939 --> 01:06:10.539
some measure

01:06:10.539 --> 01:06:17.219
of how informative

01:06:17.219 --> 01:06:19.909
xi

01:06:19.909 --> 01:06:25.479
is about y; okay?

01:06:25.479 --> 01:06:28.779
And to do this, we'll use some simple

01:06:28.779 --> 01:06:33.489
heuristics; for every feature we'll just try to compute some rough estimate or

01:06:33.489 --> 01:06:40.489
compute

01:06:41.279 --> 01:06:48.279
some measure of how informative

01:06:50.179 --> 01:06:54.489
xi is about

01:06:54.489 --> 01:06:59.139
y. So there are many ways you can do this. One way you can choose is to just compute the

01:06:59.139 --> 01:07:01.279
correlation between xi and y. And just for each of

01:07:01.279 --> 01:07:04.999
your features just see how correlated this is with

01:07:04.999 --> 01:07:06.719
your class label y.

01:07:06.719 --> 01:07:12.809
And then you just pick the top k most correlated features.

01:07:12.809 --> 01:07:19.809
Another way

01:07:46.429 --> 01:07:51.549
to do this

01:07:51.549 --> 01:07:54.930
for the case of text classification, there's one other method, which especially

01:07:54.930 --> 01:07:56.779
for this k features I guess

01:07:56.779 --> 01:07:58.749
there's one other

01:07:58.749 --> 01:08:00.709
informative measure

01:08:00.709 --> 01:08:04.269
that's used very commonly, which is called major information. I'm going to

01:08:04.269 --> 01:08:11.229
tell you

01:08:11.229 --> 01:08:15.399
some of these ideas in problem sets, but I'll

01:08:15.399 --> 01:08:17.900
just say this very briefly.

01:08:17.900 --> 01:08:21.599
So the major information between feature xi and y

01:08:21.599 --> 01:08:27.670
I'll just write out the definition, I guess.

01:08:27.670 --> 01:08:32.229
Let's say this is text classification, so x can take on two values, 0, 1;

01:08:32.229 --> 01:08:36.179
the major information between xi and y is to find out some overall possible values of

01:08:36.179 --> 01:08:37.359
x;

01:08:37.359 --> 01:08:40.929
some overall possible values of

01:08:40.929 --> 01:08:45.609
y times the distribution

01:08:45.609 --> 01:08:52.049


01:08:52.049 --> 01:08:53.089
times that.

01:08:53.089 --> 01:08:54.379
Where

01:08:59.569 --> 01:09:04.039
all of these distributions where so the joint distribution over xi and y,

01:09:04.039 --> 01:09:11.039
you would estimate from your training data

01:09:12.559 --> 01:09:14.220
all of these things you would use, as well. You would estimate

01:09:14.220 --> 01:09:15.259
from

01:09:15.259 --> 01:09:20.159
the training data what is the probability that x is 0, what's the probability that x is one, what's the

01:09:20.159 --> 01:09:27.159
probability that x is 0, y is 0, x is one; y is 0, and so on. So it

01:09:27.849 --> 01:09:31.190
turns out

01:09:31.190 --> 01:09:35.000
there's a standard information theoretic measure of how different

01:09:35.000 --> 01:09:36.750
probability distributions are.

01:09:36.750 --> 01:09:39.989
And I'm not gonna prove this here. But it turns out that

01:09:39.989 --> 01:09:45.609
this major information is

01:09:45.609 --> 01:09:51.749
actually

01:09:51.749 --> 01:09:54.309
so the standard

01:09:54.309 --> 01:09:58.099
measure of how different distributions are; called the K-L divergence.

01:09:58.099 --> 01:10:00.359
When you take a class in information theory,

01:10:00.359 --> 01:10:03.419
you have seen concepts of mutual information in the K-L divergence, but if you

01:10:03.419 --> 01:10:06.019
haven't, don't worry about it. Just

01:10:06.019 --> 01:10:09.239
the intuition is there's something called K-L divergence that's a formal measure

01:10:09.239 --> 01:10:12.949
of how different two probability distributions

01:10:12.949 --> 01:10:16.609
are. And mutual information is a measure for how different

01:10:16.609 --> 01:10:19.829
the joint distribution is

01:10:19.829 --> 01:10:21.569
of x and y;

01:10:21.569 --> 01:10:23.400
from the distribution

01:10:23.400 --> 01:10:26.159
you would get if you were to assume they were independent;

01:10:26.159 --> 01:10:27.210
okay? So

01:10:27.210 --> 01:10:29.449
if x and y were independent, then

01:10:29.449 --> 01:10:33.979
p of x, y would be equal to p of x times p of y.

01:10:33.979 --> 01:10:37.340
And so you know, this distribution and this distribution would be identical, and the

01:10:37.340 --> 01:10:40.369
K-L divergence would be 0.

01:10:40.369 --> 01:10:43.579
In contrast, if x and y were very non-independent in

01:10:43.579 --> 01:10:46.900
other words, if x and y are very informative about each other,

01:10:46.900 --> 01:10:48.449
then this K-L divergence

01:10:48.449 --> 01:10:49.719
will be large.

01:10:49.719 --> 01:10:52.379
And so mutual information is a formal measure of

01:10:52.379 --> 01:10:55.460
how non-independent x and y are.

01:10:55.460 --> 01:10:58.299
And if x and y are highly non-independent

01:10:58.299 --> 01:10:59.260
then that means that

01:10:59.260 --> 01:11:02.209
x will presumably tell you something about y,

01:11:02.209 --> 01:11:05.989
and so they'll have large mutual information. And

01:11:05.989 --> 01:11:09.519
this measure of information will tell you x might be a good feature. And you

01:11:09.519 --> 01:11:11.989
get to play with some of these ideas

01:11:11.989 --> 01:11:15.309
more in the problem sets. So I won't say much more about it.

01:11:16.579 --> 01:11:19.420
And what you do then is having chosen

01:11:19.420 --> 01:11:23.780
some measure like correlation or major information or something else, you

01:11:23.780 --> 01:11:28.889
then pick the top

01:11:28.889 --> 01:11:33.659
k features; meaning that you compute correlation between xi and y for all the

01:11:33.659 --> 01:11:36.979
features of mutual information xi and y for all the features.

01:11:36.979 --> 01:11:40.080
And then you include in your learning algorithm the

01:11:40.080 --> 01:11:43.369
k features of the largest correlation with the label or

01:11:43.369 --> 01:11:46.489
the largest mutual information label, whatever.

01:11:46.489 --> 01:11:51.559
And to choose

01:11:51.559 --> 01:11:54.809
k,

01:11:54.809 --> 01:11:57.799
you can actually use cross validation, as well; okay?

01:11:57.799 --> 01:11:59.360
So you would

01:11:59.360 --> 01:12:03.059
take all your features, and sort them in decreasing order of mutual

01:12:03.059 --> 01:12:04.109
information.

01:12:04.109 --> 01:12:07.529
And then you'd try using just the top one feature, the top two features, the top

01:12:07.529 --> 01:12:09.559
three features, and so on.

01:12:09.559 --> 01:12:13.869
And you decide how many features includes

01:12:13.869 --> 01:12:17.279
using cross validation; okay? Or you

01:12:17.279 --> 01:12:24.279
can sometimes you can just choose this by hand, as well. Okay. Questions about

01:12:24.279 --> 01:12:31.279
this? Okay.

01:12:34.559 --> 01:12:36.479
Cool. Great. So

01:12:36.479 --> 01:12:40.629
next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close

01:12:40.629 --> 01:12:40.879
to the end.