Lecture 10 | Machine Learning (Stanford)

Edit subtitles

0:12 - 0:16

This presentation is delivered by the Stanford Center for Professional
0:16 - 0:23

Development.
0:24 - 0:28

So what I want to do today in this lecture is
0:28 - 0:31

talk a little bit more about learning theory. In particular, I'll
0:31 - 0:33

talk about VC dimension
0:33 - 0:36

and
0:36 - 0:40

building on the issues of bias variance tradeoffs of under
0:40 - 0:43

fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in
0:43 - 0:44

this one.
0:44 - 0:49

I then want to talk about model selection algorithms for automatically
0:49 - 0:50

making
0:50 - 0:52

decisions for this
0:52 - 0:55

bias variance tradeoff, that we started to talk about in the previous lecture.
0:55 - 0:59

And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if
0:59 - 1:01

I don't get to this today,
1:01 - 1:07

I'll get to this in next week's lecture.
1:07 - 1:10

To recap:
1:10 - 1:10

the result we
1:10 - 1:14

proved at the previous lecture was that
1:14 - 1:15

if you have a finite
1:15 - 1:16

hypothesis class if
1:16 - 1:20

h is a set of k hypotheses,
1:20 - 1:24
1:24 - 1:26

and suppose you have some
1:26 - 1:33

fixed parameters, gamma and delta,
1:33 - 1:34

then
1:34 - 1:37
1:37 - 1:41

in order to guarantee
1:41 - 1:48

that this holds,
1:48 - 1:51

we're probability at least one minus delta.
1:51 - 1:56

It suffices
1:56 - 2:03

that n is greater and equal to that; okay?
2:05 - 2:07

And using big-O notations,
2:07 - 2:14

just learning dropped constants, I can also write this as that; okay?
2:14 - 2:18

So just to quickly remind you of what all of the notation means,
2:18 - 2:21

we talked about empirical risk minimization,
2:21 - 2:22

which was
2:22 - 2:25

the simplified modern machine learning
2:25 - 2:26

that
2:26 - 2:29

has a hypothesis class of script h.
2:29 - 2:33

And what the empirical risk minimization-learning algorithm does is it just chooses the
2:33 - 2:34

hypothesis
2:34 - 2:38

that attains the smallest error on the training set.
2:38 - 2:39

And so
2:39 - 2:43

this symbol, epsilon, just denoted generalization error; right? This is the
2:43 - 2:44

probability
2:44 - 2:49

of a hypothesis h [inaudible] misclassifying a new example drawn from the same
2:49 - 2:51

distribution as the training set.
2:51 - 2:53

And so this says that
3:02 - 3:06

in order to guarantee that the generalization error of the hypothesis h
3:06 - 3:09

[inaudible] output by empirical risk minimization
3:09 - 3:11

that this is less and equal to
3:11 - 3:14

the best possible generalization error
3:15 - 3:20

use it in your hypothesis class plus two times gamma two times this error threshold.
3:20 - 3:24

We want to guarantee that this holds a probability at least one minus delta.
3:24 - 3:26

We show that
3:26 - 3:29

it suffices for your training set size m
3:29 - 3:33

to be greater than equal to this; okay? One over two gamma square log two k over
3:33 - 3:34

delta;
3:34 - 3:39

where again, k is the size of your hypothesis class.
3:39 - 3:43

And so this is some complexity result because it gives us a bound in the
3:43 - 3:45

number of training examples we need
3:45 - 3:47

in order to give a guarantee
3:47 - 3:49

on something on the error; okay? So
3:49 - 3:55

this is a sample complexity result. So
3:55 - 3:59

what I want to do now is take this result, and try to generalize it to the
3:59 - 4:03

case of infinite hypothesis classes. So here,
4:03 - 4:07

we said that the set script h is sort of just k
4:07 - 4:09

specific functions,
4:09 - 4:12

when you want to use a model like logistic regression, which is actually
4:12 - 4:14

parameterized by real numbers.
4:14 - 4:15

So
4:15 - 4:17

I'm actually first going to give an argument that's sort of
4:17 - 4:21

formally broken just sort of technically somewhat broken, but conveys
4:21 - 4:25

useful intuition. And then I'll give the more correct argument, but
4:25 - 4:30

without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want
4:30 - 4:31

to
4:31 - 4:35

apply this result analyzing logistic regression.
4:36 - 4:37

So let's say
4:39 - 4:45

your hypothesis class is because of all linear division boundaries; right? So say
4:45 - 4:47

script h is
4:47 - 4:52

parameterized
4:52 - 4:58

by d
4:58 - 5:00

real numbers; okay? So for example, if
5:00 - 5:03

you're applying logistic regression with
5:03 - 5:07

over [inaudible], then d would be endless one with logistic regression to find the
5:07 - 5:09

linear position boundary,
5:09 - 5:10

parameterized by
5:10 - 5:14

endless one real numbers.
5:14 - 5:18

When you think about how your hypothesis class is really represented in a computer
5:18 - 5:19

computers use
5:19 - 5:22

zero one bits to represent real numbers.
5:22 - 5:23

And so if you
5:23 - 5:24

use like
5:24 - 5:29

a normal standard computer, it normally will represent real numbers by what's
5:29 - 5:31

called double position floating point numbers.
5:31 - 5:33

And what that means is that each real number
5:33 - 5:37

is represented by or a 64-bit representation;
5:37 - 5:38

right? So really
5:38 - 5:42

you know what floating point is in a computer. So a 64-bit floating point
5:42 - 5:43

is what almost all of
5:43 - 5:45

us use routinely.
5:45 - 5:48

And so this parameterized by d real numbers, that's really
5:48 - 5:51

as if it's parameterized by 64
5:51 - 5:53

times d
5:53 - 5:54

bits.
5:54 - 5:57

Computers can't represent real numbers. They only represent
5:57 - 5:59

used to speed things.
5:59 - 6:01

And so the size of
6:01 - 6:03

your
6:03 - 6:07

hypothesis class in your computer representation
6:07 - 6:11

you have 64 times d bits that you can flip.
6:11 - 6:15

And so the number of possible values for your 62 to 64 d bits is
6:15 - 6:18

really just
6:18 - 6:21

to the power of 64 d; okay?
6:21 - 6:23

Because that's the number of ways you can
6:23 - 6:26

flip the 64 d bits.
6:26 - 6:29

And so
6:29 - 6:34

this is why it's important that we that we had log k there; right? So k is
6:34 - 6:37

therefore, to the 64 d.
6:37 - 6:40

And if I plug it into this equation over here,
6:40 - 6:43

what you find is that
6:43 - 6:47

in order to get this sort of guarantee,
6:47 - 6:51

it suffices that m
6:51 - 6:53

is
6:53 - 6:56

great and equal to
6:56 - 6:59

on the order of one of
6:59 - 7:02

the gamma square log it's
7:02 - 7:03

just a 64 d
7:03 - 7:05

over delta,
7:05 - 7:12

which is that; okay? So
7:15 - 7:18

just to be clear, in order to guarantee that
7:18 - 7:24

there's only one, instead of the same complexity result as we had before so the question is:
7:24 - 7:27

suppose, you want a guarantee that a hypotheses returned by empirical risk
7:27 - 7:29

minimization will
7:29 - 7:34

have a generalization error that's within two gamma or the best hypotheses in your hypotheses
7:34 - 7:36

class.
7:36 - 7:39

Then what this result suggests is that, you know, in order to give that sort of
7:39 - 7:41

error bound guarantee,
7:41 - 7:43

it suffices that m is greater and equal to this.
7:43 - 7:45

In other words, that
7:45 - 7:47

your number of training examples has to be
7:47 - 7:52

on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the
7:53 - 7:54

intuition
7:54 - 7:58

that this conveys is actually, roughly right. This says, that the number of
7:58 - 7:59

training examples you need
7:59 - 8:01

is roughly linear
8:01 - 8:05

in the number of parameters of your hypothesis class. That
8:05 - 8:09

m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.
8:09 - 8:10
8:10 - 8:12

I'll say more about
8:17 - 8:21

this
8:22 - 8:26

later. This result is clearly, slightly broken, in the sense that it relies on a
8:26 - 8:29

64-bit representation of 14-point numbers.
8:29 - 8:32

So let me actually go ahead and outline
8:32 - 8:35

the "right way" to show this more formally; all
8:35 - 8:39

right? And it turns out the "right way" to show this more formally
8:39 - 8:43

involves a much longer because the
8:43 - 8:46

proof is extremely involved, so I'm just actually going to state the result, and not
8:46 - 8:50

prove it. Farther proof be a source of learning theory balance, infinite hypothesis
8:50 - 8:54

classes.
8:56 - 9:03

This definition given a set of d points,
9:19 - 9:23

we say,
9:23 - 9:27

a hypothesis class h
9:27 - 9:29

shatters
9:29 - 9:35

the set s, if
9:35 - 9:42

h can realize any labeling on it; okay?
9:47 - 9:47

And what
9:47 - 9:51

I mean by realizing any labeling on it the informal way of thinking about this
9:51 - 9:53

is:
9:53 - 9:57

if a hypothesis class has shattered the set s, what that means is that I can take these
9:57 - 9:59

d points,
9:59 - 10:02

and I can associate these d points with any
10:02 - 10:04

caught set of labels y; right?
10:04 - 10:09

So choose any set of labeling y for each of these d
10:09 - 10:10

points.
10:10 - 10:14

And if your hypothesis class shatters s, then that means that
10:14 - 10:15

there will be
10:15 - 10:19

a hypothesis that labels those d examples perfectly;
10:19 - 10:24

okay? That's what shattering means. So let me just illustrate those in an example.
10:24 - 10:31

So let's say h is the class of all linear classifiers
10:31 - 10:36

into e,
10:36 - 10:38

and let's say that
10:38 - 10:40

s is this
10:40 - 10:44

[inaudible] comprising two points; okay?
10:44 - 10:46

So there are
10:46 - 10:50

four possible labelings that computes
10:50 - 10:53

with these two points. You can choose
10:53 - 10:56

to label both positive; one positive, one negative, one negative, one positive or you can
10:56 - 10:59

label both of them negative.
10:59 - 11:01

And if
11:01 - 11:05

the hypothesis class h classed all linear classifiers into the then, for each
11:05 - 11:08

of these training sets, I can sort of find a
11:08 - 11:10

linear classifier
11:10 - 11:14

that attains zero training error on
11:14 - 11:18

each of these. Then on all possible labelings of this set of two points.
11:18 - 11:20

And so I'll say
11:20 - 11:23

that the hypothesis class script h shatters
11:23 - 11:24

this set s
11:24 - 11:29

of two points; okay? One
11:29 - 11:36

more example show
11:37 - 11:39

you
11:39 - 11:42

a larger example. Suppose my set s is now
11:42 - 11:45

this set of three points;
11:45 - 11:52

right? Then, I now have eight possible labelings for these three points;
12:11 - 12:15

okay? And so for these three points, I now have eight possible labelings.
12:15 - 12:17

And once again, I can for each
12:17 - 12:22

of these labelings, I can find the hypothesis in the hypothesis class
12:22 - 12:23

that labels
12:23 - 12:26

these examples correctly.
12:26 - 12:30

And so once again, I see that by definition, say, that my hypothesis
12:30 - 12:34

class also shatters this set s. Student:Right. Instructor (Andrew Ng):And
12:34 - 12:38

then that that terminology h can realize any labeling on s. That's
12:38 - 12:40

obviously [inaudible].
12:40 - 12:44

Give it any set of labels and you can find a hypothesis that perfectly
12:44 - 12:48

separates the positive and negative examples;
12:48 - 12:49

okay?
12:49 - 12:51

So
12:51 - 12:55

how about this
12:55 - 12:57

set?
12:57 - 13:01

Suppose s is now this set of four points, then,
13:01 - 13:05

you know, there are lots of labels. There are now 16 labelings we can choose on this; right?
13:05 - 13:06

That's one for instance,
13:06 - 13:09

and this is another one;
13:09 - 13:13

right? And so I can realize some labelings. But there's no linear division
13:13 - 13:16

boundary that can realize this labeling,
13:16 - 13:18

and so h does not shatter
13:18 - 13:20

this set of four points; okay?
13:20 - 13:22

And
13:22 - 13:25

I'm not really going to prove it here, but it turns out that you can show that
13:25 - 13:30

in two dimensions, there is no set of four points that the
13:30 - 13:37

class of all linear classifiers can shatter; okay?
13:46 - 13:49

So here's another definition. When I say that the well,
13:49 - 13:50

it's called the
13:50 - 13:53

VC dimension. These
13:53 - 14:00

two people, Vapnik and Chervonenkis
14:07 - 14:10

so given a hypothesis class,
14:10 - 14:14

the Vapnik and Chervonenkis
14:14 - 14:18

dimension of h, which we usually write as VC of script h,
14:18 - 14:25

is the size of the
14:34 - 14:37

larger
14:37 - 14:40

set that is shattered by this set by h.
14:40 - 14:41

And
14:41 - 14:45

if a hypothesis class can shatter arbitrarily large sets, then the VC dimension
14:45 - 14:48

is infinite.
14:48 - 14:51

So just as a kind of good example: if h
14:51 - 14:52

is the class of all linear
14:52 - 14:58

classifiers
14:58 - 15:03

into d, then the
15:03 - 15:04

VC dimension of the set
15:04 - 15:05

is equal to
15:05 - 15:09

three because we saw just now that there is a size of
15:09 - 15:13

there was a set s of size three that it could shatter,
15:13 - 15:17

and I don't really prove it. But it turns out there is no sets of size four
15:17 - 15:18

that it can
15:18 - 15:20

shatter. And therefore,
15:20 - 15:21

the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your
15:21 - 15:23

point. Instructor
15:23 - 15:28

(Andrew
15:28 - 15:30

Ng):Yes, absolutely. So it turns out that
15:30 - 15:32

if
15:32 - 15:36

I choose a set like this it's actually set s,
15:36 - 15:39

then there are labelings on this they cannot realize. And so,
15:39 - 15:41

h cannot shatter this set.
15:41 - 15:44

But that's okay because right there definitely is
15:44 - 15:47

there exists some other set of size three being shattered. So the VC
15:47 - 15:48

dimension is
15:48 - 15:55

three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor
15:56 - 15:58

(Andrew Ng):Not according to this definition. No.
15:58 - 16:01

Right. So again, let's see,
16:01 - 16:04

I can choose my set s to be
16:04 - 16:08

to be a set of three points that are all over lapping. Three points in exactly
16:08 - 16:10

the same place.
16:10 - 16:12

And clearly, I can't shatter this set,
16:12 - 16:15

but that's okay. And I can't shatter this set, either,
16:15 - 16:19

but that's okay because there are some other sets of size three that I can
16:19 - 16:25

shatter.
16:25 - 16:28

And it turns out this result holds true into
16:28 - 16:32

the more generally,
16:32 - 16:34
16:34 - 16:41

in
16:41 - 16:48

any dimensions the VC dimension of the class of linear classifiers in
16:51 - 16:54

any dimensions is equal to n plus one.
16:54 - 16:59

Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC
16:59 - 17:02

dimension in any dimensions; whereas, n d
17:02 - 17:04

is equal to n plus one.
17:04 - 17:11
17:12 - 17:14

So maybe you wanna write it down:
17:14 - 17:16
17:16 - 17:21

what is arguably the best-known result in all of learning theory, I guess;
17:21 - 17:26
17:26 - 17:29

which is that.
17:29 - 17:36

Let a hypothesis class be given,
17:38 - 17:41

and let the VC dimension of h be
17:41 - 17:46

equal to d. Then we're
17:46 - 17:48

in probability of
17:48 - 17:50

one minus delta.
17:50 - 17:57

We have that
18:32 - 18:35

the formula on the right looks a bit complicated, but don't worry about it. I'll point
18:35 - 18:37

out the essential aspects of it later.
18:37 - 18:41

But the key to this result is that if you have a hypothesis class with VC dimension d, and
18:41 - 18:45

now this can be an infinite hypothesis
18:45 - 18:47

class,
18:47 - 18:49

what
18:49 - 18:51

Vapnik and Chervonenkis show is that
18:51 - 18:54

we're probability of at least one minus delta.
18:54 - 18:55

You enjoy this
18:55 - 18:58

sort of uniform conversions results;
18:58 - 19:00

okay? We have
19:00 - 19:03
19:03 - 19:07

that for all hypotheses h that for all the hypotheses in your
19:07 - 19:08

hypothesis class,
19:08 - 19:13

you have that the generalization error of h
19:13 - 19:14

minus
19:14 - 19:19

the training error of h. So the difference between these two things is bounded above
19:19 - 19:24

by some complicated formula like this; okay?
19:24 - 19:30

And thus,
19:30 - 19:37

we're probably one minus delta. We also have that have the
19:45 - 19:52

same thing; okay?
19:57 - 20:01

And going from this step to this step; right? Going from this step
20:01 - 20:03

to this step is actually something that you saw yourself;
20:03 - 20:06

that we actually proved earlier. Because
20:06 - 20:09

you remember, in the previous lecture we proved that if you have uniform
20:09 - 20:11

conversions,
20:11 - 20:13

then that implies that
20:13 - 20:18

it appears actually that we showed that if generalization error and training error are close to
20:18 - 20:21

each other; within gamma of each other,
20:21 - 20:24

then the generalization error of the hypotheses you pick will be within
20:24 - 20:26

two gamma
20:26 - 20:28

times the best generalization error.
20:28 - 20:30

So this is really
20:30 - 20:31

generalization error of
20:31 - 20:35

h [inaudible] best possible generalization error plus two times gamma. And just the
20:35 - 20:37

two constants in front here
20:37 - 20:44

that I've absorbed into the big-O notation.
20:44 - 20:47

So that formula is slightly more complicated. Let
20:47 - 20:54

me just rewrite this as a corollary, which is that
20:55 - 20:59

in order to guarantee that
20:59 - 21:04

this holds,
21:04 - 21:09

we're probability of one minus delta.
21:09 - 21:11

We're probably at least one minus delta, I should say. It
21:11 - 21:17

suffices
21:17 - 21:20

that I'm gonna write this this way: I'm gonna write m equals
21:20 - 21:22

big-O of
21:22 - 21:23

d,
21:23 - 21:26

and I'm going to put
21:26 - 21:33

gamma and delta in as a subscript error to denote that.
21:33 - 21:37

Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that
21:37 - 21:40

depend on gamma and delta into the big-O notation,
21:40 - 21:42

then
21:42 - 21:45

in order to guarantee this holds, it suffices that m
21:45 - 21:46

is on the order of the
21:46 - 21:53

VC dimension and hypotheses class; okay?
21:53 - 21:56

So
21:56 - 21:59

let's see.
21:59 - 22:02

So what we conclude from this is that
22:02 - 22:06

if you have a learning algorithm that tries to for empirical risk minimization
22:06 - 22:09

algorithms in other words, less formally,
22:09 - 22:13

for learning algorithms, they try to minimize training error. The
22:13 - 22:15

intuition to take away from this is that
22:15 - 22:18

the number of training examples you need is therefore,
22:18 - 22:19

roughly, linear
22:19 - 22:24

in the VC dimension of the hypotheses class.
22:24 - 22:28

And more formally, this shows that sample complexity
22:28 - 22:30

is upper bounded
22:30 - 22:34

by the VC dimension; okay?
22:34 - 22:38

It turns out that for most reasonable hypothesis classes, it turns out that
22:38 - 22:41

the VC dimension is sort
22:41 - 22:42
22:42 - 22:45

of very similar, I guess, to
22:45 - 22:47

the number of parameters you model.
22:47 - 22:49

So for example,
22:49 - 22:52

you have model and logistic regression linear classification.
22:52 - 22:56

In any dimensions logistic regression in any dimensions is endless
22:56 - 22:57

one parameters.
22:57 - 23:02

And the VC dimension of which is the of class of linear classifiers is always the endless one.
23:02 - 23:04

So it turns out that
23:04 - 23:07

for most reasonable hypothesis classes, the
23:07 - 23:11

VC dimension is usually linear in the number of parameters of your model.
23:11 - 23:15

Wherein, is most sense of low other polynomial;
23:15 - 23:17

in the number of parameters of your model.
23:17 - 23:19

And so this
23:19 - 23:21

the takeaway intuition from this is that
23:21 - 23:25

the number of training examples you need to fit in those models is going to be let's say,
23:25 - 23:30

roughly, linear in the number of parameters in your model; okay?
23:30 - 23:33

There are some somewhat strange examples where what I just said is not true. There are some
23:33 - 23:37

strange examples where you have very few parameters, but the VC dimension is
23:37 - 23:37

enormous.
23:37 - 23:39

But I actually know of
23:39 - 23:42

all of the examples I know of that fall into that regime are somewhat
23:42 - 23:44

strange and
23:44 - 23:45

degenerate. So somewhat unusual, and
23:45 - 23:50

not the source of not learning algorithms you usually use.
23:50 - 23:52

Let's see,
23:52 - 23:55

just other things. It turns out that
23:55 - 23:58

so this result shows the sample complexity is upper bounded by VC
23:58 - 24:01

dimension. But if you have a number of training examples that
24:01 - 24:05

are on the order of the VC dimension, then you find
24:05 - 24:08

it turns out that in the worse case
24:08 - 24:10
24:10 - 24:12
24:12 - 24:16

some complexity is also lower bounded by VC dimension.
24:16 - 24:20

And what that means is that if you have a perfectly nasty learning
24:20 - 24:23

problem, say, then if
24:23 - 24:27

the number of training examples you have is less than on the order of the VC
24:27 - 24:27

dimension;
24:27 - 24:30

then
24:30 - 24:34

it is not possible to prove this bound. So I guess in the worse case,
24:34 - 24:37

sample complexity in the number of training examples you need is upper bounded and lower bounded
24:37 - 24:42

by the VC dimension. Let's
24:42 - 24:48

see,
24:48 - 24:55

questions about this? Does
24:56 - 25:00
25:00 - 25:00

the
25:00 - 25:05

proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the
25:05 - 25:07

entirety of the theorem. This is true. It
25:07 - 25:10

turns out in the proof well,
25:10 - 25:13

somewhere, regardless of the proof there's a step reconstruction called an
25:13 - 25:16

epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the
25:16 - 25:20

proof, it is not an assumption that you need.
25:20 - 25:24

In someway that sort of proof that's one-step that uses a very
25:24 - 25:27

clever [inaudible] to prove
25:27 - 25:32

this. But that's not needed; it's an assumption. I'd like to say, back when
25:32 - 25:36

I was a Ph.D. student, when I was working through this proof,
25:36 - 25:38

there was sort of a solid week where I
25:38 - 25:42

would wake up, and go to the office at 9:00 a.m. Then I'd start reading
25:42 - 25:45

the book that led up to this proof. And then
25:45 - 25:49

I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left
25:49 - 25:52

off. And it sort of took me a whole week that way,
25:52 - 25:53

to understand this proof, so
25:53 - 26:00

I thought I would inflict that on you.
26:03 - 26:07

Just to tie a couple of loose ends:
26:07 - 26:13

what I'm about to do is, I'm about to just mention a few things that will
26:13 - 26:18

maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.
26:18 - 26:21

And so
26:21 - 26:24

let's see,
26:24 - 26:27

it turns out that
26:27 - 26:30

just so it will be more strong with you so this bound was
26:30 - 26:31

proved
26:31 - 26:35

for an algorithm that uses empirical risk minimization, for an algorithm that
26:35 - 26:36

minimizes
26:36 - 26:40

0-1 training error. So
26:40 - 26:46

one
26:46 - 26:49

question that some of
26:49 - 26:52

you ask is
26:52 - 26:55

how about support vector machines; right? How come SVM's don't over
26:55 - 26:57

fit?
26:57 - 27:01

And in the sequel of remember our discussion on support vector machines said that
27:01 - 27:06

you use kernels, and map the features in infinite dimensional feature space.
27:06 - 27:10

And so it seems like the VC dimension should be infinite; n plus
27:10 - 27:12

one and n is infinite.
27:12 - 27:15

So it turns out that
27:15 - 27:20

the class of linear separators with large margin actually has low VC dimension.
27:20 - 27:23

I wanna say this very quickly, and informally.
27:23 - 27:28

It's actually, not very important for you to understand the details, but I'm going
27:28 - 27:32

to say it very informally. It turns out that I will give you a set of points.
27:32 - 27:33

And
27:33 - 27:38

if I ask you to consider only the course of lines that separate these points of a
27:38 - 27:39

large margin [inaudible],
27:39 - 27:40

so
27:40 - 27:44

my hypothesis class will comprise only
27:44 - 27:46

the linear position boundaries
27:46 - 27:49

that separate the points of a large margin. Say with a
27:49 - 27:50

margin,
27:50 - 27:52

at least gamma; okay.
27:52 - 27:53

And so
27:53 - 27:56

I won't allow a point that comes closer. Like,
27:56 - 28:00

I won't allow that line because it comes too close to one of my points.
28:00 - 28:04

It turns out that
28:04 - 28:11

if I consider my
28:11 - 28:15

data points all
28:15 - 28:19

lie within some sphere of radius r,
28:19 - 28:23

and if I consider only the course of linear separators is separate to data
28:23 - 28:27

with a margin of at least gamma,
28:27 - 28:30

then the VC dimension of this
28:30 - 28:34

course is less than or equal to r squared over four
28:34 - 28:36

gamma squared
28:36 - 28:38

plus one; okay?
28:38 - 28:40

So
28:40 - 28:43

this funny symbol here, that just means rounding up.
28:43 - 28:47

This is a ceiling symbol; it means rounding up x.
28:47 - 28:48

And it
28:48 - 28:51

turns out you prove and there are some strange things
28:51 - 28:53

about this result that I'm deliberately not
28:53 - 28:55

gonna to talk about but turns they
28:55 - 28:59

can prove that the VC dimension of the class of linear classifiers with large
28:59 - 29:01

margins is actually bounded.
29:01 - 29:03

The surprising thing about this is that
29:03 - 29:06

this is the bound on VC dimension that has no dependents
29:06 - 29:08

on the dimension of the points x. So
29:08 - 29:09

in other words,
29:09 - 29:13

your data points x combine an infinite dimensional space,
29:13 - 29:16

but so long as you restrict attention to the class of
29:16 - 29:18

your separators with large margin, the
29:18 - 29:20

VC dimension is bounded.
29:20 - 29:22

And so
29:22 - 29:26

in trying to find a large margin separator
29:26 - 29:29

in trying to find the line that separates your positive and
29:29 - 29:30

your negative examples
29:30 - 29:32

with large margin,
29:32 - 29:35

it turns out therefore, that
29:35 - 29:38

the support vector machine is automatically trying to find
29:38 - 29:40

a hypothesis class
29:40 - 29:42

with small VC dimension. And therefore, it does not over fit. Alex? Student:What is
29:42 - 29:49

the [inaudible]? Instructor (Andrew
29:49 - 29:50

Ng):It
29:50 - 29:52

is actually defined the
29:52 - 29:56

same way as finite dimensional spaces. So you
29:56 - 29:57

know, suppose you have infinite actually,
29:57 - 30:01

these are constantly infinite dimensional vectors; not [inaudible] to the infinite
30:01 - 30:04

dimensional
30:04 - 30:06

vectors. Normally, the 2 to 1
30:06 - 30:07

squared
30:07 - 30:10

is equal to some [inaudible] equals 110
30:10 - 30:12

xi squared, so if x
30:12 - 30:17

is infinite dimensional, you just appoint it like
30:17 - 30:20

that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.
30:20 - 30:23

[Inaudible].
30:23 - 30:27

Yes. Although, I assume that this is bounded by r. Oh. It's
30:27 - 30:31

a yeah so this insures that conversions.
30:31 - 30:32
30:32 - 30:35

So just
30:35 - 30:38

something people sometimes wonder about. And
30:38 - 30:40

last, the actually
30:40 - 30:44

tie empirical risk minimization back a little more strongly to the source of algorithms
30:44 - 30:47

we've talked about.
30:47 - 30:48

It turns out
30:48 - 30:55

that
31:10 - 31:15

so the theory was about, and so far, was really for empirical risk minimization. So that view's
31:15 - 31:18
31:18 - 31:21

so we focus on just one training example.
31:21 - 31:23

Let me draw a
31:23 - 31:24

function, you know, a
31:24 - 31:25

zero here jumps
31:25 - 31:26

to one,
31:26 - 31:28

and it looks like that.
31:28 - 31:30

And so this for
31:30 - 31:32

once, this training example,
31:32 - 31:36

this may be
31:36 - 31:43

indicator
31:43 - 31:47

h where [inaudible] is d
31:47 - 31:49

equals data transpose x;
31:49 - 31:53

okay? But one
31:53 - 31:56

training example your training example will be positive or negative. And
31:56 - 31:59

depending on what the value of this data transpose x is, you either get it right
31:59 - 32:02

or wrong. And so
32:02 - 32:06

you know, I guess if your training example if you have a positive example,
32:06 - 32:08

then when z is positive,
32:08 - 32:10

you get it
32:10 - 32:11

right.
32:11 - 32:18
32:19 - 32:23

Suppose you have a negative example, so y equals 0; right?
32:23 - 32:27

Then if z, which is data transpose x if this is positive,
32:27 - 32:30

then you will get this example wrong;
32:30 - 32:33

whereas, if z is negative then you'd get this example right.
32:33 - 32:37

And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You
32:37 - 32:43

know, it's equal
32:43 - 32:50

to g of data transpose
32:50 - 32:51
32:51 - 32:54

x; okay? And so it turns out that
32:54 - 32:57

so what you really like to do is choose parameters data so as to
32:57 - 33:00

minimize this step function; right?
33:00 - 33:02

You'd like to choose parameters data, so that
33:02 - 33:05

you end up with a
33:05 - 33:08

correct classification on setting your training example, and so you'd like
33:08 - 33:09

indicator h
33:09 - 33:11

of x not equal y.
33:11 - 33:15

You'd like this indicator function to be 0. It
33:15 - 33:19

turns out this step function is clearly a non-convex function. And so
33:19 - 33:22

it turns out that just the linear classifiers
33:22 - 33:25

minimizing the training error is an empty heart problem. It
33:25 - 33:30

turns out that both logistic regression, and support vector machines can be viewed as
33:30 - 33:31

using a convex
33:31 - 33:33

approximation for this problem.
33:33 - 33:36

And in particular
33:36 - 33:40

and draw a function like that
33:41 - 33:48

it turns out that
33:51 - 33:55

logistic regression is trying to maximize likelihood. And so it's tying to minimize
33:55 - 33:57

the minus of the logged likelihood.
33:57 - 34:00

And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that
34:00 - 34:03

looks like this. And
34:03 - 34:07

this line that I just drew, you can think of it as a rough approximation to this step function; which is
34:07 - 34:11

maybe what you're really trying to minimize,
34:11 - 34:15

so you want to minimize training error. So you can actually think of logistic regression as trying to approximate
34:15 - 34:16

empirical risk minimization.
34:16 - 34:20

Where instead of using this step function, which is non-convex, and gives you a hard
34:20 - 34:21

optimization problem, it
34:21 - 34:25

uses this line above this curve above. So approximate it,
34:25 - 34:28

so you have a convex optimization problem you can
34:28 - 34:30

find the
34:30 - 34:34

maximum likelihood it's in the parameters for logistic regression.
34:34 - 34:39

And it turns out, support vector machine also can be viewed as approximated dysfunction
34:39 - 34:41

to only a little bit different let's
34:41 - 34:45

see, support vector machine turns out, can be viewed as trying to approximate this
34:45 - 34:46

step function two
34:46 - 34:48

over different
34:48 - 34:51

approximation that's linear, and then
34:51 - 34:54

that sort of [inaudible] linear that
34:54 - 34:56

our results goes this [inaudible] there, and then it goes up as a
34:56 - 34:58

linear function there. And that's that
34:58 - 35:00

is called the hinge class. And so you
35:00 - 35:04

can think of logistic regression and the support vector machine as
35:04 - 35:07

different approximations to try to minimize this
35:07 - 35:11

step function;
35:11 - 35:12

okay?
35:12 - 35:13

And that's why I guess,
35:13 - 35:16

all the theory we developed
35:16 - 35:19

even though SVM's and logistic regression aren't exactly due to
35:19 - 35:22

empirical risk minimization,
35:22 - 35:25

the theory we develop often gives the
35:25 - 35:32

completely appropriate intuitions for SVM's, and logistic regression; okay.
35:33 - 35:36

So that was the last of the loose ends.
35:36 - 35:39

And if you didn't get this, don't worry too much about it. It's a high-level message. It's
35:39 - 35:40

just that
35:40 - 35:44

SVM's and logistic regression are reasonable to think of as approximations
35:44 - 35:48

empirical risk minimization algorithms.
35:48 - 35:52

What I want to do next is move on to talk about model selection. Before I do that, let me just
35:52 - 35:59

check for questions about
36:49 - 36:53

this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that
36:53 - 36:53

we
36:53 - 36:55

sort of wrapped up with a
36:55 - 36:57

discussion on VC dimension,
36:57 - 37:01

we saw that there's often a trade-off between bias and variance. And in
37:01 - 37:02

particular, so
37:02 - 37:07

it is important not to choose a hypothesis that's either too simple or too
37:07 - 37:08

complex. So
37:08 - 37:11

if your data has sort of a quadratic structure to it,
37:11 - 37:14

then if you choose a linear
37:14 - 37:18

function to try to approximate it, then you would under fit. So you have a
37:18 - 37:20

hypothesis with high bias.
37:20 - 37:24

And conversely, we choose a hypothesis that's too complex, and you have high variance.
37:24 - 37:26

And you'll also
37:26 - 37:32

fail to fit. Then you would over fit the data, and you'd also fail to generalize well.
37:32 - 37:36

So model selection algorithms
37:36 - 37:40

provide a class of methods to automatically trade make these tradeoffs
37:40 - 37:42

between bias
37:42 - 37:43

and variance; right? So remember the
37:43 - 37:45

cartoon I drew last time
37:45 - 37:52

of generalization error?
37:54 - 37:57

I drew this last time. Where on the x-axis
37:57 - 38:03

was model complexity, meaning the number of
38:03 - 38:08

the degree of the polynomial; the [inaudible] regression function
38:08 - 38:09

or whatever.
38:09 - 38:12

And if you have too simple a model, you have high generalization error, those under
38:12 - 38:14

fitting.
38:14 - 38:16

And you if have too complex a model,
38:16 - 38:20

like 15 or 14-degree polynomial to five data points, then you also have high
38:20 - 38:24

generalization error, and you're over fitting.
38:24 - 38:29

So what I wanna do now is actually just talk about model selection in the abstract;
38:29 - 38:30

all right?
38:30 - 38:33

Some examples of model selection problems will include
38:33 - 38:34

well, I'll run the example
38:34 - 38:36

of
38:36 - 38:37

let's say you're
38:37 - 38:44

trying to choose the degree of a polynomial;
38:45 - 38:49

right? What degree polynomial do you want to choose?
38:49 - 38:52

Another example of a model selection problem would be if you're trying to choose
38:52 - 38:54

the parameter [inaudible],
38:54 - 39:00

which was the bandwidth parameter
39:00 - 39:04

in locally awaited linear regression
39:04 - 39:08

or in some sort of local way to
39:08 - 39:08

regression.
39:08 - 39:10

Yet, another model selection problem
39:10 - 39:14

is if you're trying to choose the parameter c [inaudible] and as the
39:14 - 39:16

[inaudible];
39:16 - 39:22

right? And so one known soft margin is the
39:22 - 39:29

we had this optimization objective; right?
39:29 - 39:31

And the parameter c
39:31 - 39:33

controls the tradeoff between
39:33 - 39:34

how much you want
39:34 - 39:38

to set for your example. So a large margin versus how much you want to penalize
39:38 - 39:40
39:40 - 39:44

in this class [inaudible] example. So these are three specific examples of model selection
39:44 - 39:45

problems.
39:45 - 39:48

And
39:48 - 39:55

let's come up with a method for semantically choosing them.
40:14 - 40:18

Let's say you have some finite set of models, and let's write these as m1, m2, m3,
40:18 - 40:20

and
40:20 - 40:25

so on. For example, this may be the linear
40:25 - 40:30

classifier or this may be the quadratic classifier,
40:30 - 40:30

and so on;
40:30 - 40:32

okay? Or this
40:32 - 40:34

may also be
40:34 - 40:37

you may also take the bandwidth parameter [inaudible]
40:37 - 40:41

and discretize it into a range of values, and you're trying to choose from the most
40:41 - 40:43

discrete of the values.
40:43 - 40:48

So let's talk about how you would select an appropriate model; all right? Well,
40:48 - 40:52

one thing you could do is you can pick all of these models, and train them on
40:52 - 40:54

you're training set.
40:54 - 40:57

And then see which model has the lowest training
40:57 - 41:04

error. So that's a terrible idea, and why's that?
41:05 - 41:11

Right. Cool. Because of the over fit; right. And those some of you are laughing that
41:11 - 41:14

I asked that. So that'd be a terrible idea to choose a model
41:14 - 41:17

by looking at your training set because well, obviously, you end up choosing the most
41:17 - 41:19

complex model; right? And you
41:19 - 41:26

choose a 10th degree polynomial because that's what fits the training set. So we
41:26 - 41:29

come to model selection in a training set
41:29 - 41:32

several standard procedures to do this. One is hold out cross
41:32 - 41:39

validation,
41:39 - 41:41

and in hold out cross validation,
41:41 - 41:46

we teach a training set. And we randomly split
41:46 - 41:51

the training set into two subsets. We
41:51 - 41:51

call it
41:51 - 41:54

subset take all the data you have
41:54 - 42:01

and randomly split it into two subsets. And we'll call it the training set, and the
42:01 - 42:05

hold out cross validation subset.
42:05 - 42:07

And then,
42:07 - 42:12

you know, you train each model
42:12 - 42:19

on just trading subset of it, and test it
42:19 - 42:22

on your hold out cross validation set.
42:22 - 42:25

And you pick the model
42:25 - 42:32

with the lowest error
42:33 - 42:35

on the hold out cross validation
42:35 - 42:37

subset; okay? So this is sort of a
42:37 - 42:40

relatively straightforward procedure, and it's commonly used
42:40 - 42:42

where you train on 70 percent of the data.
42:42 - 42:45

Then test all of your models. And 30 percent, you can take
42:45 - 42:46

whatever has the smallest
42:46 - 42:51

hold out cross validation error.
42:51 - 42:55

And after this you actually have a chose. You can actually
42:55 - 42:59

having taken all of these hypothesis trained on 70 percent of the data,
42:59 - 43:03

you can actually just output the hypothesis that has the lowest error on your hold out
43:03 - 43:04

cross validation set.
43:04 - 43:09

And optionally, you can actually take the model that you selected
43:09 - 43:13

and go back, and retrain it on all 100 percent of the data;
43:13 - 43:16

okay? So both versions are actually done and used really often. You can
43:16 - 43:16

either,
43:16 - 43:20

you know, just take the best hypothesis that was trained on 70 percent of the data,
43:20 - 43:22

and just output that as you find the hypothesis
43:22 - 43:24

or you can use this to
43:24 - 43:27

say, having chosen the degree of the polynomial you want to fit, you can then go
43:27 - 43:31

back and retrain the model on the entire 100 percent of your data.
43:31 - 43:38

And both of these are commonly done.
43:40 - 43:47
43:50 - 43:54

How about a cross validation does sort of work straight?
43:54 - 43:56

And
43:56 - 43:58

sometimes we're working
43:58 - 43:59
43:59 - 44:04

with a company or application or something. The
44:04 - 44:06

many machine-learning applications we have
44:06 - 44:08

very little data or where, you
44:08 - 44:10

know, every training example you have was
44:10 - 44:15

painfully acquired at great cost; right? Sometimes your data is
44:15 - 44:16

acquired by
44:16 - 44:19

medical experiments, and each of these each
44:19 - 44:24

training example represents a sick man in amounts of physical human pain or something. So
44:24 - 44:27

we talk and say, well, I'm going to hold out 30 percent of your data set, just to select
44:27 - 44:28

my model.
44:28 - 44:33

If people were who sometimes that causes unhappiness, and so maybe you wanna
44:33 - 44:37

use not have to leave out 30 percent of your data just to do model
44:37 - 44:39

selection.
44:39 - 44:43

So there are a couple of other variations on hold out cross validation that makes
44:43 - 44:48

sometimes, slightly more efficient use of the data.
44:48 - 44:52

And one is called k-fold cross validation.
44:52 - 44:54

And here's the idea: I'm gonna
44:54 - 44:56

take all of
44:56 - 44:57

my data s;
44:57 - 45:01

so imagine, I'm gonna draw this
45:01 - 45:04

box s, as to note the entirety of all the data I have.
45:04 - 45:07

And I'll then divide it
45:07 - 45:13

into k pieces, and this is five pieces in what I've drawn.
45:13 - 45:17

Then what'll I'll do is I will repeatedly
45:17 - 45:23

train on k minus one pieces.
45:23 - 45:28

Test on the remaining one
45:28 - 45:31

test on the remaining piece, I guess;
45:31 - 45:36

right? And then you average
45:36 - 45:41

over the k result.
45:41 - 45:44

So another way, we'll just hold out I will
45:44 - 45:48

hold
45:48 - 45:51

out say, just 1/5 of my data
45:51 - 45:55

and I'll train on the remaining 4/5, and I'll test on the first one.
45:55 - 45:58

And then I'll then go and hold out the second 1/5 from my [inaudible] for the
45:58 - 46:03

remaining pieces test on this, you remove the third piece,
46:03 - 46:06

train on the 4/5; I'm gonna do this five times.
46:06 - 46:09

And then I'll take the five error measures I
46:09 - 46:11

have and I'll
46:11 - 46:16

average them. And this then gives me an estimate of the generalization error of my model; okay?
46:16 - 46:18

And then, again, when you do
46:18 - 46:20

k-fold cross validation,
46:20 - 46:22

usually you then go back and
46:22 - 46:24

retrain the model you selected
46:24 - 46:28

on the entirety of your training set.
46:28 - 46:34

So I drew five pieces here because that was easier for me to draw, but k equals 10 is
46:34 - 46:39

very
46:39 - 46:42

common; okay? I should say
46:42 - 46:48

k equals 10 is the fairly common choice to do 10 fold cross validation.
46:48 - 46:52

And the advantage of the over hold out cross option is that you switch the data into ten pieces.
46:52 - 46:54

Then each time you're only holding out
46:54 - 46:56

1/10 of your data, rather than,
46:56 - 46:59

you know, say, 30 percent of your data. I
46:59 - 47:03

must say, in standard hold out in simple hold out cross validation,
47:03 - 47:06

a 30 70 split is fairly common.
47:06 - 47:08

Sometimes like 2/3 1/3 or
47:08 - 47:10

a 70 30 split is fairly common.
47:10 - 47:14

And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most
47:14 - 47:16

common choice.
47:16 - 47:20

The disadvantage of k-fold cross validation is that it can be much more
47:20 - 47:21

computationally expensive.
47:21 - 47:23

In particular,
47:23 - 47:26

to validate your model, you now need to train your model ten times,
47:26 - 47:29

instead of just once. And so you need
47:29 - 47:33

to: from logistic regression, ten times per model, rather than just once. And so this is
47:33 - 47:35

computationally more expensive.
47:35 - 47:38

But k equals ten works great.
47:38 - 47:40

And then,
47:40 - 47:44

finally, in
47:44 - 47:48

there's actually a version of this that you can take even further,
47:48 - 47:50

which is when your set k
47:50 - 47:52

equals m.
47:52 - 47:57

And so that's when you take your training set, and you split it into as many pieces as you have training
47:57 - 47:59

examples.
47:59 - 48:06

And this procedure is called leave one out cross validation.
48:06 - 48:07

And
48:07 - 48:09

what you do is you then
48:09 - 48:12

take out the first training example, train on the rest, and test on the first
48:12 - 48:13

example.
48:13 - 48:16

Then you take out the second training example,
48:16 - 48:19

train on the rest, and test on the second example. Then
48:19 - 48:22

you take out the third example, train on everything, but your third example. Test on
48:22 - 48:24

the third example, and so on.
48:24 - 48:25

And so
48:25 - 48:28

with this many pieces you are now making,
48:28 - 48:31

maybe even more effective use of your data than k-fold cross
48:31 - 48:32

validation. But
48:32 - 48:36

you could leave leave one out cross validation is computationally very expensive
48:36 - 48:37

because now
48:37 - 48:42

you need to repeatedly leave one example out, and then run your learning
48:42 - 48:46

algorithm on m minus one training examples. You need to do this a lot of times, and so
48:46 - 48:49

this is computationally very expensive.
48:49 - 48:55

And typically, this is done only when you're extremely data scarce. So if you
48:55 - 48:59

have a learning problem where you have, say, 15 training examples or something,
48:59 - 49:01

then if you have very few training examples, leave one
49:01 - 49:04

out cross validation
49:04 - 49:04

is
49:04 - 49:11

maybe preferred.
49:16 - 49:18

Yeah?
49:18 - 49:23

Student:You know, that time you proved that the difference between the generalized [inaudible] by
49:23 - 49:27

number of examples in your training
49:27 - 49:28

set and VC
49:28 - 49:31

dimension. So
49:31 - 49:33

maybe [inaudible] examples into
49:33 - 49:35

different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I
49:35 - 49:37

mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but
49:37 - 49:40

yeah, in practice, I personally tend not to do that. It
49:40 - 49:43

tends not to be the VC
49:43 - 49:47

dimension bounds are somewhat loose bounds. And so
49:47 - 49:48

there are people
49:48 - 49:51

in structure risk minimization that propose what you do, but I personally tend not to do
49:51 - 49:55

that,
50:09 - 50:16

though. Questions for cross validation?
50:34 - 50:36

Yeah. This
50:36 - 50:39

is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?
50:39 - 50:40

Right.
50:40 - 50:42

So it seems like when I'd be
50:42 - 50:44

able to use that
50:44 - 50:47
50:47 - 50:48

instead
50:48 - 50:50

of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor
50:50 - 50:53

No okay. So it turns out that when you're proving learning theory bounds, very often
50:53 - 50:58

the bounds will be extremely loose because you're sort of proving the worse case upper bound that
50:58 - 50:59

holds true
50:59 - 51:03

even for very bad what is
51:03 - 51:06

it
51:06 - 51:10

so the bounds that I proved just now; right? That holds true for absolutely any
51:10 - 51:12

probability distribution over training
51:12 - 51:13

examples; right?
51:13 - 51:17

So just assume the training examples we've drawn, iid
51:17 - 51:19

from some distribution script d,
51:19 - 51:23

and the bounds I proved hold true for absolutely any probability
51:23 - 51:27

distribution over script
51:27 - 51:29

d. And chances are
51:29 - 51:33

whatever real life distribution you get over, you know, houses and their
51:33 - 51:36

prices or whatever, is probably not as bad as
51:36 - 51:39

the very worse one you
51:39 - 51:41

could've gotten; okay?
51:41 - 51:42

And so it turns out that if you
51:42 - 51:46

actually plug in the constants of learning theory bounds, you often get
51:46 - 51:51

extremely large numbers. Take logistic regression logistic
51:51 - 51:53

regression you have ten parameters
51:53 - 51:57

and 0.01 error, and with 95 percent probability. How many training
51:57 - 51:58

examples do I
51:58 - 52:02

need? If you actually plug in actual constants into the text for learning theory bounds,
52:02 - 52:06

you often get extremely pessimistic estimates with the number of examples you need. You end
52:06 - 52:08

up with
52:08 - 52:12

some ridiculously large numbers. You would need 10,000 training examples to fit
52:12 - 52:14

ten parameters.
52:14 - 52:17

So
52:17 - 52:20

a good way to think of these learning theory bounds is
52:20 - 52:25

and this is why, also, when I write papers on learning theory bounds, I
52:25 - 52:26

quite often
52:26 - 52:30

use big-O notation to just absolutely just ignore the constant factors because
52:30 - 52:33

the bounds seem to be very loose.
52:33 - 52:39

There are some attempts to use these bounds to give guidelines as to
52:39 - 52:40

what model
52:40 - 52:42

to choose, and so on.
52:42 - 52:47

But I personally tend to use the bounds again,
52:47 - 52:47

intuition
52:47 - 52:49

about
52:49 - 52:53

for example, what are the number of training examples you need
52:53 - 52:56

gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes
52:56 - 53:00

quadratic parameters?
53:00 - 53:02

So it's quite often the shape of the bounds. The fact that
53:02 - 53:07

the number of training examples the fact that some complexity is linear in the VC dimension,
53:07 - 53:09

that's sort of a useful intuition you can get from
53:09 - 53:11

these theories. But the
53:11 - 53:14

actual magnitude of the bound will tend to be much looser than
53:14 - 53:18

will hold true for a particular problem you are working
53:18 - 53:25

on. So did that
53:26 - 53:28

answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the
53:28 - 53:32

way, for myself, a rule of thumb that I often use is if
53:32 - 53:35

you're trying to fit a logistic regression model,
53:35 - 53:36

if you have
53:36 - 53:39

n parameters or n plus one parameters;
53:39 - 53:42

if the number of training examples is ten times your number of
53:42 - 53:45

parameters, then you're probably in good shape.
53:45 - 53:49

And if your number of training examples is like tiny times the number of parameters, then
53:49 - 53:52

you're probably perfectly fine fitting that model.
53:52 - 53:55

So those are the sorts of intuitions
53:55 - 54:02

that you can get from these bounds. Student:In cross validation do
54:03 - 54:07

we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train
54:07 - 54:14

testers randomly.
54:23 - 54:27

One more thing I want to talk about for model selection is there's actually a special case of model selections,
54:27 - 54:34

called the feature selection problem.
54:38 - 54:41

And so here's the intuition:
54:41 - 54:45

for many machine-learning problems you may have a very high dimensional
54:45 - 54:48

feature space with very high dimensional
54:48 - 54:52

you have x's [inaudible] feature x's.
54:52 - 54:56

So for example, for text classification and I wanna talk about this text classification example that
54:56 - 54:58

spam versus
54:58 - 55:00

non-spam. You may easily have on
55:00 - 55:04

the order of 30,000 or 50,000 features. I think I used 50,000 in
55:04 - 55:07

my early examples. So if you have
55:07 - 55:11

so many features you have 50,000 features,
55:11 - 55:14

depending on what learning algorithm you use, there may be a real
55:14 - 55:16

risk of over fitting.
55:16 - 55:17

And so
55:17 - 55:20

if you can reduce the number of features, maybe
55:20 - 55:23

you can reduce the variance of your learning algorithm, and reduce the risk of
55:23 - 55:25

over fitting. And for
55:25 - 55:28

the specific case of text classification, if
55:28 - 55:32

you imagine that maybe there's a small number of relevant features, so
55:32 - 55:34

there are all these English words.
55:34 - 55:37

And many of these English words probably don't tell you anything at all about
55:37 - 55:40

whether the email is spam or non-spam. If it were, you
55:40 - 55:44

know, English function words like, the, of, a, and;
55:44 - 55:48

these are probably words that don't tell you anything about whether the email is spam or non-spam. So
55:48 - 55:51

words in contrast will be a much smaller number of
55:51 - 55:54

features that are truly relevant to the learning problem.
55:54 - 55:57

So for example, you see the word buy or Viagra, those are words that
55:57 - 56:02

are very useful. So you words, some you spam and non-spam. You see the word
56:02 - 56:03

Stanford or
56:03 - 56:06

machinelearning or your own personal name. These are other words that are useful for
56:06 - 56:11

telling you whether something is spam or non-spam. So
56:11 - 56:16

in feature selection, we would like to select a subset of the features
56:16 - 56:20

that may be or hopefully the most relevant ones for a specific learning
56:20 - 56:20

problem, so
56:20 - 56:25

as to give ourselves a simpler learning a simpler hypothesis class to choose
56:25 - 56:28

from. And then therefore, reduce the risk of over fitting.
56:28 - 56:30

Even when we
56:30 - 56:37

may have had 50,000 features originally. So
56:37 - 56:41

how do you do this? Well, if you
56:41 - 56:44

have n
56:44 - 56:47

features, then there are
56:47 - 56:50

two to the n possible subsets;
56:50 - 56:54
56:54 - 56:56

right? Because, you know,
56:56 - 57:00

each of your n features can either be included or excluded. So there are two to the n
57:00 - 57:01

possibilities.
57:01 - 57:04

And this is a huge space.
57:04 - 57:09

So in feature selection, what we most commonly do is use various searcheristics
57:09 - 57:10

sort of
57:10 - 57:11

simple search algorithms
57:11 - 57:15

to try to search through this space of two to the n possible subsets of features;
57:15 - 57:17

to try to find a good
57:17 - 57:18

subset of features. This is
57:18 - 57:23

too large a number to enumerate all possible feature subsets.
57:23 - 57:26

And as a complete example,
57:26 - 57:27
57:27 - 57:31

this is the forward search algorithm; it's also called the forward selection algorithm.
57:31 - 57:33
57:33 - 57:35

It's actually pretty simple, but I'll just
57:35 - 57:36

write it out.
57:36 - 57:42

My writing it out will make it look more complicated than it really
57:42 - 57:45

is, but
57:45 - 57:46

it starts with
57:46 - 57:48

initialize the sets script f
57:48 - 57:52

to be the empty set,
57:52 - 57:53

and then
57:53 - 57:55

repeat
57:55 - 57:57

for
57:57 - 57:59

i equals one
57:59 - 58:00

to n;
58:05 - 58:08

try adding
58:08 - 58:11

feature i
58:11 - 58:14

to the set scripts
58:14 - 58:16

f, and evaluate the
58:16 - 58:20

model
58:20 - 58:26

using cross validation.
58:26 - 58:30

And by cross validation, I mean any of the three flavors, be it simple hold out cross
58:30 - 58:35

validation or k-fold cross validation or leave one out cross
58:35 - 58:36

validation.
58:36 - 58:40

And then, you know, set f to be
58:40 - 58:41

equal to
58:41 - 58:43

f union, I guess.
58:43 - 58:49

And then the best feature
58:49 - 58:56

found is f 1, I guess; okay?
58:58 - 59:00

And finally, you would
59:00 - 59:07

okay? So
59:07 - 59:14
59:17 - 59:20

forward selections, procedure is: follow through the empty set of features.
59:20 - 59:22

And then on each
59:22 - 59:26

generation, take each of
59:26 - 59:29

your features that isn't already in your set script f and you try adding that
59:29 - 59:30

feature to your set. Then
59:30 - 59:34

you train them all, though, and evaluate them all, though, using
59:34 - 59:35

cross validation.
59:35 - 59:39

And basically, figure out what is the best single feature to add to your
59:39 - 59:41

set
59:41 - 59:44

script f. In step two here, you go ahead and add
59:44 - 59:47

that feature to your set script f, and you get it right.
59:47 - 59:51

And when I say best feature or best model here by best, I really mean the
59:51 - 59:54

best model according to hold out cross validation.
59:54 - 59:56

By best, I really mean
59:56 - 60:01

the single feature addition that results in the lowest hold out
60:01 - 60:02

cross validation error or the
60:02 - 60:03

lowest cross validation error.
60:03 - 60:04

So you do this
60:04 - 60:08

adding one feature at a time.
60:08 - 60:12

When you terminate this a little bit, as if you've
60:12 - 60:15

added all the features to f, so f is now
60:15 - 60:18

the entire set of features; you can terminate this.
60:18 - 60:19

Or if
60:19 - 60:22

by some rule of thumb, you know that you
60:22 - 60:24

probably don't ever want more than
60:24 - 60:27

k features, you can also terminate
60:27 - 60:30

this if f is already exceeded some threshold number of features. So maybe
60:30 - 60:31

if
60:31 - 60:34

you have 100 training examples, and you're fitting logistic regression,
60:34 - 60:39

you probably know you won't want more than 100 features. And so
60:39 - 60:41

you stop after you have
60:41 - 60:46

100 features added to set f; okay?
60:46 - 60:50

And then finally, having done this, output of best hypothesis found; again, by
60:50 - 60:52

best, I mean,
60:52 - 60:55

when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of
60:55 - 60:58

hypothesis, and testing them using cross validation.
60:58 - 61:02

So when I say output best hypothesis found, I mean
61:02 - 61:05

of all of the hypothesis you've seen during this entire procedure,
61:05 - 61:07

pick the one with the lowest
61:07 - 61:10

cross validation error that you saw; okay?
61:10 - 61:17

So that's forward selection. So
61:35 - 61:40

let's see, just to give this a name, this is an incidence of what's called
61:40 - 61:43
61:43 - 61:46

wrapper feature
61:46 - 61:48

selection.
61:48 - 61:51

And the term wrapper comes
61:51 - 61:52

from the fact that
61:52 - 61:55

this feature selection algorithm that I just described is a forward selection or forward
61:55 - 61:57

search.
61:57 - 61:59

It's a piece of software that
61:59 - 62:02

you write that wraps around your learning algorithm. In the sense that
62:02 - 62:04

to perform forward selection,
62:04 - 62:08

you need to repeatedly make cause to your learning algorithm
62:08 - 62:09

to train
62:09 - 62:12

your model,
62:12 - 62:15

using different subsets of features; okay? So this is called wrapper model
62:15 - 62:17

feature selection.
62:17 - 62:21

And it tends to be somewhat computationally expensive because as you're
62:21 - 62:23

performing the search process,
62:23 - 62:26

you're repeatedly training your learning algorithm over and over and over on all of
62:26 - 62:30

these different subsets of features. Let's
62:30 - 62:37

just mention also, there is a variation of this called backward search or
62:39 - 62:41

backward selection,
62:41 - 62:44

which is where you start with
62:44 - 62:47

f equals
62:47 - 62:54

the entire set of features, and you delete features one at a time;
62:59 - 63:01

okay?
63:01 - 63:04
63:04 - 63:07

So that's backward search or backward selection.
63:07 - 63:08

And
63:08 - 63:13

this is another feature selection algorithm that you might use. Part of
63:13 - 63:15

whether this makes sense is
63:15 - 63:17

really there will be problems where
63:17 - 63:21

it really doesn't even make sense to initialize f to be the set of all features.
63:21 - 63:24

So if you have 100 training examples
63:24 - 63:26

and 10,000 features,
63:26 - 63:29

which may well happen
63:29 - 63:31

100 emails and 10,000 training
63:31 - 63:35

10,000 features in email, then 100 training examples then
63:35 - 63:38

depending on the learning algorithm you're using, it may or may not make sense to
63:38 - 63:39

initialize
63:39 - 63:43

the set f to be all features, and train them all by using all features. And if it
63:43 - 63:46

doesn't make sense, then you can train them all by using all features; then
63:46 - 63:52

forward selection would be more common.
63:52 - 63:54

So
63:54 - 63:57

let's see. Wrapper model feature selection algorithms tend to work
63:57 - 63:58

well.
63:58 - 64:02

And in particular, they actually often work better than a different class of algorithms I'm gonna talk
64:02 - 64:06

about now. But their main disadvantage is that they're computationally very
64:06 - 64:09

expensive.
64:09 - 64:16

Do you have any questions about this
64:16 - 64:19

before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,
64:19 - 64:23

you're actually right. So forward search and backward search, both of these are searcheristics,
64:23 - 64:24

and
64:24 - 64:28

you cannot but for either of these you cannot guarantee they'll find the best
64:28 - 64:29

subset of features. It
64:29 - 64:31

actually turns out that
64:31 - 64:35

under many formulizations of the feature selection problems it actually turns out to be an empty
64:35 - 64:40

heart problem, to find the best subset of features.
64:40 - 64:44

But in practice, forward selection backward selection work fine,
64:44 - 64:47

and you can also envision other search algorithms where you sort of
64:47 - 64:50

have other methods to search through the space up to the end possible feature
64:50 - 64:57

subsets. So
65:01 - 65:04

let's see.
65:04 - 65:06

Wrapper feature selection
65:06 - 65:10

tends to work well when you can afford to do it
65:10 - 65:12

computationally.
65:12 - 65:14

But for problems
65:14 - 65:18

such as text classification it turns out for text classification specifically
65:18 - 65:22

because you have so many features, and easily have 50,000 features.
65:22 - 65:26

Forward selection would be very, very expensive.
65:26 - 65:28

So there's a different class of algorithms that
65:28 - 65:31

will give you that
65:31 - 65:35

tends not to do as well in the sense of generalization error. So you tend to learn the
65:35 - 65:37

hypothesis that works less well,
65:37 - 65:40

but is computationally much less expensive.
65:40 - 65:44

And these are called
65:44 - 65:47

the
65:47 - 65:50

filter feature selection methods.
65:50 - 65:53

And the basic idea is
65:53 - 65:54
65:54 - 66:01

that for each feature i will
66:01 - 66:04

compute
66:04 - 66:11

some measure
66:11 - 66:17

of how informative
66:17 - 66:20

xi
66:20 - 66:25

is about y; okay?
66:25 - 66:29

And to do this, we'll use some simple
66:29 - 66:33

heuristics; for every feature we'll just try to compute some rough estimate or
66:33 - 66:40

compute
66:41 - 66:48

some measure of how informative
66:50 - 66:54

xi is about
66:54 - 66:59

y. So there are many ways you can do this. One way you can choose is to just compute the
66:59 - 67:01

correlation between xi and y. And just for each of
67:01 - 67:05

your features just see how correlated this is with
67:05 - 67:07

your class label y.
67:07 - 67:13

And then you just pick the top k most correlated features.
67:13 - 67:20

Another way
67:46 - 67:52

to do this
67:52 - 67:55

for the case of text classification, there's one other method, which especially
67:55 - 67:57

for this k features I guess
67:57 - 67:59

there's one other
67:59 - 68:01

informative measure
68:01 - 68:04

that's used very commonly, which is called major information. I'm going to
68:04 - 68:11

tell you
68:11 - 68:15

some of these ideas in problem sets, but I'll
68:15 - 68:18

just say this very briefly.
68:18 - 68:22

So the major information between feature xi and y
68:22 - 68:28

I'll just write out the definition, I guess.
68:28 - 68:32

Let's say this is text classification, so x can take on two values, 0, 1;
68:32 - 68:36

the major information between xi and y is to find out some overall possible values of
68:36 - 68:37

x;
68:37 - 68:41

some overall possible values of
68:41 - 68:46

y times the distribution
68:46 - 68:52
68:52 - 68:53

times that.
68:53 - 68:54

Where
69:00 - 69:04

all of these distributions where so the joint distribution over xi and y,
69:04 - 69:11

you would estimate from your training data
69:13 - 69:14

all of these things you would use, as well. You would estimate
69:14 - 69:15

from
69:15 - 69:20

the training data what is the probability that x is 0, what's the probability that x is one, what's the
69:20 - 69:27

probability that x is 0, y is 0, x is one; y is 0, and so on. So it
69:28 - 69:31

turns out
69:31 - 69:35

there's a standard information theoretic measure of how different
69:35 - 69:37

probability distributions are.
69:37 - 69:40

And I'm not gonna prove this here. But it turns out that
69:40 - 69:46

this major information is
69:46 - 69:52

actually
69:52 - 69:54

so the standard
69:54 - 69:58

measure of how different distributions are; called the K-L divergence.
69:58 - 70:00

When you take a class in information theory,
70:00 - 70:03

you have seen concepts of mutual information in the K-L divergence, but if you
70:03 - 70:06

haven't, don't worry about it. Just
70:06 - 70:09

the intuition is there's something called K-L divergence that's a formal measure
70:09 - 70:13

of how different two probability distributions
70:13 - 70:17

are. And mutual information is a measure for how different
70:17 - 70:20

the joint distribution is
70:20 - 70:22

of x and y;
70:22 - 70:23

from the distribution
70:23 - 70:26

you would get if you were to assume they were independent;
70:26 - 70:27

okay? So
70:27 - 70:29

if x and y were independent, then
70:29 - 70:34

p of x, y would be equal to p of x times p of y.
70:34 - 70:37

And so you know, this distribution and this distribution would be identical, and the
70:37 - 70:40

K-L divergence would be 0.
70:40 - 70:44

In contrast, if x and y were very non-independent in
70:44 - 70:47

other words, if x and y are very informative about each other,
70:47 - 70:48

then this K-L divergence
70:48 - 70:50

will be large.
70:50 - 70:52

And so mutual information is a formal measure of
70:52 - 70:55

how non-independent x and y are.
70:55 - 70:58

And if x and y are highly non-independent
70:58 - 70:59

then that means that
70:59 - 71:02

x will presumably tell you something about y,
71:02 - 71:06

and so they'll have large mutual information. And
71:06 - 71:10

this measure of information will tell you x might be a good feature. And you
71:10 - 71:12

get to play with some of these ideas
71:12 - 71:15

more in the problem sets. So I won't say much more about it.
71:17 - 71:19

And what you do then is having chosen
71:19 - 71:24

some measure like correlation or major information or something else, you
71:24 - 71:29

then pick the top
71:29 - 71:34

k features; meaning that you compute correlation between xi and y for all the
71:34 - 71:37

features of mutual information xi and y for all the features.
71:37 - 71:40

And then you include in your learning algorithm the
71:40 - 71:43

k features of the largest correlation with the label or
71:43 - 71:46

the largest mutual information label, whatever.
71:46 - 71:52

And to choose
71:52 - 71:55

k,
71:55 - 71:58

you can actually use cross validation, as well; okay?
71:58 - 71:59

So you would
71:59 - 72:03

take all your features, and sort them in decreasing order of mutual
72:03 - 72:04

information.
72:04 - 72:08

And then you'd try using just the top one feature, the top two features, the top
72:08 - 72:10

three features, and so on.
72:10 - 72:14

And you decide how many features includes
72:14 - 72:17

using cross validation; okay? Or you
72:17 - 72:24

can sometimes you can just choose this by hand, as well. Okay. Questions about
72:24 - 72:31

this? Okay.
72:35 - 72:36

Cool. Great. So
72:36 - 72:41

next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close
72:41 - 72:41

to the end.

Title:: Lecture 10 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng continues his lecture on learning theory by discussing VC dimension and model selection.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:12:56

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 10 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)