Lecture 10 | Machine Learning (Stanford)
-
0:12 - 0:16This presentation is delivered by the Stanford Center for Professional
-
0:16 - 0:23Development.
-
0:24 - 0:28So what I want to do today in this lecture is
-
0:28 - 0:31talk a little bit more about learning theory. In particular, I'll
-
0:31 - 0:33talk about VC dimension
-
0:33 - 0:36and
-
0:36 - 0:40building on the issues of bias variance tradeoffs of under
-
0:40 - 0:43fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in
-
0:43 - 0:44this one.
-
0:44 - 0:49I then want to talk about model selection algorithms for automatically
-
0:49 - 0:50making
-
0:50 - 0:52decisions for this
-
0:52 - 0:55bias variance tradeoff, that we started to talk about in the previous lecture.
-
0:55 - 0:59And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if
-
0:59 - 1:01I don't get to this today,
-
1:01 - 1:07I'll get to this in next week's lecture.
-
1:07 - 1:10To recap:
-
1:10 - 1:10the result we
-
1:10 - 1:14proved at the previous lecture was that
-
1:14 - 1:15if you have a finite
-
1:15 - 1:16hypothesis class if
-
1:16 - 1:20h is a set of k hypotheses,
-
1:20 - 1:24
-
1:24 - 1:26and suppose you have some
-
1:26 - 1:33fixed parameters, gamma and delta,
-
1:33 - 1:34then
-
1:34 - 1:37
-
1:37 - 1:41in order to guarantee
-
1:41 - 1:48that this holds,
-
1:48 - 1:51we're probability at least one minus delta.
-
1:51 - 1:56It suffices
-
1:56 - 2:03that n is greater and equal to that; okay?
-
2:05 - 2:07And using big-O notations,
-
2:07 - 2:14just learning dropped constants, I can also write this as that; okay?
-
2:14 - 2:18So just to quickly remind you of what all of the notation means,
-
2:18 - 2:21we talked about empirical risk minimization,
-
2:21 - 2:22which was
-
2:22 - 2:25the simplified modern machine learning
-
2:25 - 2:26that
-
2:26 - 2:29has a hypothesis class of script h.
-
2:29 - 2:33And what the empirical risk minimization-learning algorithm does is it just chooses the
-
2:33 - 2:34hypothesis
-
2:34 - 2:38that attains the smallest error on the training set.
-
2:38 - 2:39And so
-
2:39 - 2:43this symbol, epsilon, just denoted generalization error; right? This is the
-
2:43 - 2:44probability
-
2:44 - 2:49of a hypothesis h [inaudible] misclassifying a new example drawn from the same
-
2:49 - 2:51distribution as the training set.
-
2:51 - 2:53And so this says that
-
3:02 - 3:06in order to guarantee that the generalization error of the hypothesis h
-
3:06 - 3:09[inaudible] output by empirical risk minimization
-
3:09 - 3:11that this is less and equal to
-
3:11 - 3:14the best possible generalization error
-
3:15 - 3:20use it in your hypothesis class plus two times gamma two times this error threshold.
-
3:20 - 3:24We want to guarantee that this holds a probability at least one minus delta.
-
3:24 - 3:26We show that
-
3:26 - 3:29it suffices for your training set size m
-
3:29 - 3:33to be greater than equal to this; okay? One over two gamma square log two k over
-
3:33 - 3:34delta;
-
3:34 - 3:39where again, k is the size of your hypothesis class.
-
3:39 - 3:43And so this is some complexity result because it gives us a bound in the
-
3:43 - 3:45number of training examples we need
-
3:45 - 3:47in order to give a guarantee
-
3:47 - 3:49on something on the error; okay? So
-
3:49 - 3:55this is a sample complexity result. So
-
3:55 - 3:59what I want to do now is take this result, and try to generalize it to the
-
3:59 - 4:03case of infinite hypothesis classes. So here,
-
4:03 - 4:07we said that the set script h is sort of just k
-
4:07 - 4:09specific functions,
-
4:09 - 4:12when you want to use a model like logistic regression, which is actually
-
4:12 - 4:14parameterized by real numbers.
-
4:14 - 4:15So
-
4:15 - 4:17I'm actually first going to give an argument that's sort of
-
4:17 - 4:21formally broken just sort of technically somewhat broken, but conveys
-
4:21 - 4:25useful intuition. And then I'll give the more correct argument, but
-
4:25 - 4:30without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want
-
4:30 - 4:31to
-
4:31 - 4:35apply this result analyzing logistic regression.
-
4:36 - 4:37So let's say
-
4:39 - 4:45your hypothesis class is because of all linear division boundaries; right? So say
-
4:45 - 4:47script h is
-
4:47 - 4:52parameterized
-
4:52 - 4:58by d
-
4:58 - 5:00real numbers; okay? So for example, if
-
5:00 - 5:03you're applying logistic regression with
-
5:03 - 5:07over [inaudible], then d would be endless one with logistic regression to find the
-
5:07 - 5:09linear position boundary,
-
5:09 - 5:10parameterized by
-
5:10 - 5:14endless one real numbers.
-
5:14 - 5:18When you think about how your hypothesis class is really represented in a computer
-
5:18 - 5:19computers use
-
5:19 - 5:22zero one bits to represent real numbers.
-
5:22 - 5:23And so if you
-
5:23 - 5:24use like
-
5:24 - 5:29a normal standard computer, it normally will represent real numbers by what's
-
5:29 - 5:31called double position floating point numbers.
-
5:31 - 5:33And what that means is that each real number
-
5:33 - 5:37is represented by or a 64-bit representation;
-
5:37 - 5:38right? So really
-
5:38 - 5:42you know what floating point is in a computer. So a 64-bit floating point
-
5:42 - 5:43is what almost all of
-
5:43 - 5:45us use routinely.
-
5:45 - 5:48And so this parameterized by d real numbers, that's really
-
5:48 - 5:51as if it's parameterized by 64
-
5:51 - 5:53times d
-
5:53 - 5:54bits.
-
5:54 - 5:57Computers can't represent real numbers. They only represent
-
5:57 - 5:59used to speed things.
-
5:59 - 6:01And so the size of
-
6:01 - 6:03your
-
6:03 - 6:07hypothesis class in your computer representation
-
6:07 - 6:11you have 64 times d bits that you can flip.
-
6:11 - 6:15And so the number of possible values for your 62 to 64 d bits is
-
6:15 - 6:18really just
-
6:18 - 6:21to the power of 64 d; okay?
-
6:21 - 6:23Because that's the number of ways you can
-
6:23 - 6:26flip the 64 d bits.
-
6:26 - 6:29And so
-
6:29 - 6:34this is why it's important that we that we had log k there; right? So k is
-
6:34 - 6:37therefore, to the 64 d.
-
6:37 - 6:40And if I plug it into this equation over here,
-
6:40 - 6:43what you find is that
-
6:43 - 6:47in order to get this sort of guarantee,
-
6:47 - 6:51it suffices that m
-
6:51 - 6:53is
-
6:53 - 6:56great and equal to
-
6:56 - 6:59on the order of one of
-
6:59 - 7:02the gamma square log it's
-
7:02 - 7:03just a 64 d
-
7:03 - 7:05over delta,
-
7:05 - 7:12which is that; okay? So
-
7:15 - 7:18just to be clear, in order to guarantee that
-
7:18 - 7:24there's only one, instead of the same complexity result as we had before so the question is:
-
7:24 - 7:27suppose, you want a guarantee that a hypotheses returned by empirical risk
-
7:27 - 7:29minimization will
-
7:29 - 7:34have a generalization error that's within two gamma or the best hypotheses in your hypotheses
-
7:34 - 7:36class.
-
7:36 - 7:39Then what this result suggests is that, you know, in order to give that sort of
-
7:39 - 7:41error bound guarantee,
-
7:41 - 7:43it suffices that m is greater and equal to this.
-
7:43 - 7:45In other words, that
-
7:45 - 7:47your number of training examples has to be
-
7:47 - 7:52on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the
-
7:53 - 7:54intuition
-
7:54 - 7:58that this conveys is actually, roughly right. This says, that the number of
-
7:58 - 7:59training examples you need
-
7:59 - 8:01is roughly linear
-
8:01 - 8:05in the number of parameters of your hypothesis class. That
-
8:05 - 8:09m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.
-
8:09 - 8:10
-
8:10 - 8:12I'll say more about
-
8:17 - 8:21this
-
8:22 - 8:26later. This result is clearly, slightly broken, in the sense that it relies on a
-
8:26 - 8:2964-bit representation of 14-point numbers.
-
8:29 - 8:32So let me actually go ahead and outline
-
8:32 - 8:35the "right way" to show this more formally; all
-
8:35 - 8:39right? And it turns out the "right way" to show this more formally
-
8:39 - 8:43involves a much longer because the
-
8:43 - 8:46proof is extremely involved, so I'm just actually going to state the result, and not
-
8:46 - 8:50prove it. Farther proof be a source of learning theory balance, infinite hypothesis
-
8:50 - 8:54classes.
-
8:56 - 9:03This definition given a set of d points,
-
9:19 - 9:23we say,
-
9:23 - 9:27a hypothesis class h
-
9:27 - 9:29shatters
-
9:29 - 9:35the set s, if
-
9:35 - 9:42h can realize any labeling on it; okay?
-
9:47 - 9:47And what
-
9:47 - 9:51I mean by realizing any labeling on it the informal way of thinking about this
-
9:51 - 9:53is:
-
9:53 - 9:57if a hypothesis class has shattered the set s, what that means is that I can take these
-
9:57 - 9:59d points,
-
9:59 - 10:02and I can associate these d points with any
-
10:02 - 10:04caught set of labels y; right?
-
10:04 - 10:09So choose any set of labeling y for each of these d
-
10:09 - 10:10points.
-
10:10 - 10:14And if your hypothesis class shatters s, then that means that
-
10:14 - 10:15there will be
-
10:15 - 10:19a hypothesis that labels those d examples perfectly;
-
10:19 - 10:24okay? That's what shattering means. So let me just illustrate those in an example.
-
10:24 - 10:31So let's say h is the class of all linear classifiers
-
10:31 - 10:36into e,
-
10:36 - 10:38and let's say that
-
10:38 - 10:40s is this
-
10:40 - 10:44[inaudible] comprising two points; okay?
-
10:44 - 10:46So there are
-
10:46 - 10:50four possible labelings that computes
-
10:50 - 10:53with these two points. You can choose
-
10:53 - 10:56to label both positive; one positive, one negative, one negative, one positive or you can
-
10:56 - 10:59label both of them negative.
-
10:59 - 11:01And if
-
11:01 - 11:05the hypothesis class h classed all linear classifiers into the then, for each
-
11:05 - 11:08of these training sets, I can sort of find a
-
11:08 - 11:10linear classifier
-
11:10 - 11:14that attains zero training error on
-
11:14 - 11:18each of these. Then on all possible labelings of this set of two points.
-
11:18 - 11:20And so I'll say
-
11:20 - 11:23that the hypothesis class script h shatters
-
11:23 - 11:24this set s
-
11:24 - 11:29of two points; okay? One
-
11:29 - 11:36more example show
-
11:37 - 11:39you
-
11:39 - 11:42a larger example. Suppose my set s is now
-
11:42 - 11:45this set of three points;
-
11:45 - 11:52right? Then, I now have eight possible labelings for these three points;
-
12:11 - 12:15okay? And so for these three points, I now have eight possible labelings.
-
12:15 - 12:17And once again, I can for each
-
12:17 - 12:22of these labelings, I can find the hypothesis in the hypothesis class
-
12:22 - 12:23that labels
-
12:23 - 12:26these examples correctly.
-
12:26 - 12:30And so once again, I see that by definition, say, that my hypothesis
-
12:30 - 12:34class also shatters this set s. Student:Right. Instructor (Andrew Ng):And
-
12:34 - 12:38then that that terminology h can realize any labeling on s. That's
-
12:38 - 12:40obviously [inaudible].
-
12:40 - 12:44Give it any set of labels and you can find a hypothesis that perfectly
-
12:44 - 12:48separates the positive and negative examples;
-
12:48 - 12:49okay?
-
12:49 - 12:51So
-
12:51 - 12:55how about this
-
12:55 - 12:57set?
-
12:57 - 13:01Suppose s is now this set of four points, then,
-
13:01 - 13:05you know, there are lots of labels. There are now 16 labelings we can choose on this; right?
-
13:05 - 13:06That's one for instance,
-
13:06 - 13:09and this is another one;
-
13:09 - 13:13right? And so I can realize some labelings. But there's no linear division
-
13:13 - 13:16boundary that can realize this labeling,
-
13:16 - 13:18and so h does not shatter
-
13:18 - 13:20this set of four points; okay?
-
13:20 - 13:22And
-
13:22 - 13:25I'm not really going to prove it here, but it turns out that you can show that
-
13:25 - 13:30in two dimensions, there is no set of four points that the
-
13:30 - 13:37class of all linear classifiers can shatter; okay?
-
13:46 - 13:49So here's another definition. When I say that the well,
-
13:49 - 13:50it's called the
-
13:50 - 13:53VC dimension. These
-
13:53 - 14:00two people, Vapnik and Chervonenkis
-
14:07 - 14:10so given a hypothesis class,
-
14:10 - 14:14the Vapnik and Chervonenkis
-
14:14 - 14:18dimension of h, which we usually write as VC of script h,
-
14:18 - 14:25is the size of the
-
14:34 - 14:37larger
-
14:37 - 14:40set that is shattered by this set by h.
-
14:40 - 14:41And
-
14:41 - 14:45if a hypothesis class can shatter arbitrarily large sets, then the VC dimension
-
14:45 - 14:48is infinite.
-
14:48 - 14:51So just as a kind of good example: if h
-
14:51 - 14:52is the class of all linear
-
14:52 - 14:58classifiers
-
14:58 - 15:03into d, then the
-
15:03 - 15:04VC dimension of the set
-
15:04 - 15:05is equal to
-
15:05 - 15:09three because we saw just now that there is a size of
-
15:09 - 15:13there was a set s of size three that it could shatter,
-
15:13 - 15:17and I don't really prove it. But it turns out there is no sets of size four
-
15:17 - 15:18that it can
-
15:18 - 15:20shatter. And therefore,
-
15:20 - 15:21the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your
-
15:21 - 15:23point. Instructor
-
15:23 - 15:28(Andrew
-
15:28 - 15:30Ng):Yes, absolutely. So it turns out that
-
15:30 - 15:32if
-
15:32 - 15:36I choose a set like this it's actually set s,
-
15:36 - 15:39then there are labelings on this they cannot realize. And so,
-
15:39 - 15:41h cannot shatter this set.
-
15:41 - 15:44But that's okay because right there definitely is
-
15:44 - 15:47there exists some other set of size three being shattered. So the VC
-
15:47 - 15:48dimension is
-
15:48 - 15:55three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor
-
15:56 - 15:58(Andrew Ng):Not according to this definition. No.
-
15:58 - 16:01Right. So again, let's see,
-
16:01 - 16:04I can choose my set s to be
-
16:04 - 16:08to be a set of three points that are all over lapping. Three points in exactly
-
16:08 - 16:10the same place.
-
16:10 - 16:12And clearly, I can't shatter this set,
-
16:12 - 16:15but that's okay. And I can't shatter this set, either,
-
16:15 - 16:19but that's okay because there are some other sets of size three that I can
-
16:19 - 16:25shatter.
-
16:25 - 16:28And it turns out this result holds true into
-
16:28 - 16:32the more generally,
-
16:32 - 16:34
-
16:34 - 16:41in
-
16:41 - 16:48any dimensions the VC dimension of the class of linear classifiers in
-
16:51 - 16:54any dimensions is equal to n plus one.
-
16:54 - 16:59Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC
-
16:59 - 17:02dimension in any dimensions; whereas, n d
-
17:02 - 17:04is equal to n plus one.
-
17:04 - 17:11
-
17:12 - 17:14So maybe you wanna write it down:
-
17:14 - 17:16
-
17:16 - 17:21what is arguably the best-known result in all of learning theory, I guess;
-
17:21 - 17:26
-
17:26 - 17:29which is that.
-
17:29 - 17:36Let a hypothesis class be given,
-
17:38 - 17:41and let the VC dimension of h be
-
17:41 - 17:46equal to d. Then we're
-
17:46 - 17:48in probability of
-
17:48 - 17:50one minus delta.
-
17:50 - 17:57We have that
-
18:32 - 18:35the formula on the right looks a bit complicated, but don't worry about it. I'll point
-
18:35 - 18:37out the essential aspects of it later.
-
18:37 - 18:41But the key to this result is that if you have a hypothesis class with VC dimension d, and
-
18:41 - 18:45now this can be an infinite hypothesis
-
18:45 - 18:47class,
-
18:47 - 18:49what
-
18:49 - 18:51Vapnik and Chervonenkis show is that
-
18:51 - 18:54we're probability of at least one minus delta.
-
18:54 - 18:55You enjoy this
-
18:55 - 18:58sort of uniform conversions results;
-
18:58 - 19:00okay? We have
-
19:00 - 19:03
-
19:03 - 19:07that for all hypotheses h that for all the hypotheses in your
-
19:07 - 19:08hypothesis class,
-
19:08 - 19:13you have that the generalization error of h
-
19:13 - 19:14minus
-
19:14 - 19:19the training error of h. So the difference between these two things is bounded above
-
19:19 - 19:24by some complicated formula like this; okay?
-
19:24 - 19:30And thus,
-
19:30 - 19:37we're probably one minus delta. We also have that have the
-
19:45 - 19:52same thing; okay?
-
19:57 - 20:01And going from this step to this step; right? Going from this step
-
20:01 - 20:03to this step is actually something that you saw yourself;
-
20:03 - 20:06that we actually proved earlier. Because
-
20:06 - 20:09you remember, in the previous lecture we proved that if you have uniform
-
20:09 - 20:11conversions,
-
20:11 - 20:13then that implies that
-
20:13 - 20:18it appears actually that we showed that if generalization error and training error are close to
-
20:18 - 20:21each other; within gamma of each other,
-
20:21 - 20:24then the generalization error of the hypotheses you pick will be within
-
20:24 - 20:26two gamma
-
20:26 - 20:28times the best generalization error.
-
20:28 - 20:30So this is really
-
20:30 - 20:31generalization error of
-
20:31 - 20:35h [inaudible] best possible generalization error plus two times gamma. And just the
-
20:35 - 20:37two constants in front here
-
20:37 - 20:44that I've absorbed into the big-O notation.
-
20:44 - 20:47So that formula is slightly more complicated. Let
-
20:47 - 20:54me just rewrite this as a corollary, which is that
-
20:55 - 20:59in order to guarantee that
-
20:59 - 21:04this holds,
-
21:04 - 21:09we're probability of one minus delta.
-
21:09 - 21:11We're probably at least one minus delta, I should say. It
-
21:11 - 21:17suffices
-
21:17 - 21:20that I'm gonna write this this way: I'm gonna write m equals
-
21:20 - 21:22big-O of
-
21:22 - 21:23d,
-
21:23 - 21:26and I'm going to put
-
21:26 - 21:33gamma and delta in as a subscript error to denote that.
-
21:33 - 21:37Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that
-
21:37 - 21:40depend on gamma and delta into the big-O notation,
-
21:40 - 21:42then
-
21:42 - 21:45in order to guarantee this holds, it suffices that m
-
21:45 - 21:46is on the order of the
-
21:46 - 21:53VC dimension and hypotheses class; okay?
-
21:53 - 21:56So
-
21:56 - 21:59let's see.
-
21:59 - 22:02So what we conclude from this is that
-
22:02 - 22:06if you have a learning algorithm that tries to for empirical risk minimization
-
22:06 - 22:09algorithms in other words, less formally,
-
22:09 - 22:13for learning algorithms, they try to minimize training error. The
-
22:13 - 22:15intuition to take away from this is that
-
22:15 - 22:18the number of training examples you need is therefore,
-
22:18 - 22:19roughly, linear
-
22:19 - 22:24in the VC dimension of the hypotheses class.
-
22:24 - 22:28And more formally, this shows that sample complexity
-
22:28 - 22:30is upper bounded
-
22:30 - 22:34by the VC dimension; okay?
-
22:34 - 22:38It turns out that for most reasonable hypothesis classes, it turns out that
-
22:38 - 22:41the VC dimension is sort
-
22:41 - 22:42
-
22:42 - 22:45of very similar, I guess, to
-
22:45 - 22:47the number of parameters you model.
-
22:47 - 22:49So for example,
-
22:49 - 22:52you have model and logistic regression linear classification.
-
22:52 - 22:56In any dimensions logistic regression in any dimensions is endless
-
22:56 - 22:57one parameters.
-
22:57 - 23:02And the VC dimension of which is the of class of linear classifiers is always the endless one.
-
23:02 - 23:04So it turns out that
-
23:04 - 23:07for most reasonable hypothesis classes, the
-
23:07 - 23:11VC dimension is usually linear in the number of parameters of your model.
-
23:11 - 23:15Wherein, is most sense of low other polynomial;
-
23:15 - 23:17in the number of parameters of your model.
-
23:17 - 23:19And so this
-
23:19 - 23:21the takeaway intuition from this is that
-
23:21 - 23:25the number of training examples you need to fit in those models is going to be let's say,
-
23:25 - 23:30roughly, linear in the number of parameters in your model; okay?
-
23:30 - 23:33There are some somewhat strange examples where what I just said is not true. There are some
-
23:33 - 23:37strange examples where you have very few parameters, but the VC dimension is
-
23:37 - 23:37enormous.
-
23:37 - 23:39But I actually know of
-
23:39 - 23:42all of the examples I know of that fall into that regime are somewhat
-
23:42 - 23:44strange and
-
23:44 - 23:45degenerate. So somewhat unusual, and
-
23:45 - 23:50not the source of not learning algorithms you usually use.
-
23:50 - 23:52Let's see,
-
23:52 - 23:55just other things. It turns out that
-
23:55 - 23:58so this result shows the sample complexity is upper bounded by VC
-
23:58 - 24:01dimension. But if you have a number of training examples that
-
24:01 - 24:05are on the order of the VC dimension, then you find
-
24:05 - 24:08it turns out that in the worse case
-
24:08 - 24:10
-
24:10 - 24:12
-
24:12 - 24:16some complexity is also lower bounded by VC dimension.
-
24:16 - 24:20And what that means is that if you have a perfectly nasty learning
-
24:20 - 24:23problem, say, then if
-
24:23 - 24:27the number of training examples you have is less than on the order of the VC
-
24:27 - 24:27dimension;
-
24:27 - 24:30then
-
24:30 - 24:34it is not possible to prove this bound. So I guess in the worse case,
-
24:34 - 24:37sample complexity in the number of training examples you need is upper bounded and lower bounded
-
24:37 - 24:42by the VC dimension. Let's
-
24:42 - 24:48see,
-
24:48 - 24:55questions about this? Does
-
24:56 - 25:00
-
25:00 - 25:00the
-
25:00 - 25:05proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the
-
25:05 - 25:07entirety of the theorem. This is true. It
-
25:07 - 25:10turns out in the proof well,
-
25:10 - 25:13somewhere, regardless of the proof there's a step reconstruction called an
-
25:13 - 25:16epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the
-
25:16 - 25:20proof, it is not an assumption that you need.
-
25:20 - 25:24In someway that sort of proof that's one-step that uses a very
-
25:24 - 25:27clever [inaudible] to prove
-
25:27 - 25:32this. But that's not needed; it's an assumption. I'd like to say, back when
-
25:32 - 25:36I was a Ph.D. student, when I was working through this proof,
-
25:36 - 25:38there was sort of a solid week where I
-
25:38 - 25:42would wake up, and go to the office at 9:00 a.m. Then I'd start reading
-
25:42 - 25:45the book that led up to this proof. And then
-
25:45 - 25:49I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left
-
25:49 - 25:52off. And it sort of took me a whole week that way,
-
25:52 - 25:53to understand this proof, so
-
25:53 - 26:00I thought I would inflict that on you.
-
26:03 - 26:07Just to tie a couple of loose ends:
-
26:07 - 26:13what I'm about to do is, I'm about to just mention a few things that will
-
26:13 - 26:18maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.
-
26:18 - 26:21And so
-
26:21 - 26:24let's see,
-
26:24 - 26:27it turns out that
-
26:27 - 26:30just so it will be more strong with you so this bound was
-
26:30 - 26:31proved
-
26:31 - 26:35for an algorithm that uses empirical risk minimization, for an algorithm that
-
26:35 - 26:36minimizes
-
26:36 - 26:400-1 training error. So
-
26:40 - 26:46one
-
26:46 - 26:49question that some of
-
26:49 - 26:52you ask is
-
26:52 - 26:55how about support vector machines; right? How come SVM's don't over
-
26:55 - 26:57fit?
-
26:57 - 27:01And in the sequel of remember our discussion on support vector machines said that
-
27:01 - 27:06you use kernels, and map the features in infinite dimensional feature space.
-
27:06 - 27:10And so it seems like the VC dimension should be infinite; n plus
-
27:10 - 27:12one and n is infinite.
-
27:12 - 27:15So it turns out that
-
27:15 - 27:20the class of linear separators with large margin actually has low VC dimension.
-
27:20 - 27:23I wanna say this very quickly, and informally.
-
27:23 - 27:28It's actually, not very important for you to understand the details, but I'm going
-
27:28 - 27:32to say it very informally. It turns out that I will give you a set of points.
-
27:32 - 27:33And
-
27:33 - 27:38if I ask you to consider only the course of lines that separate these points of a
-
27:38 - 27:39large margin [inaudible],
-
27:39 - 27:40so
-
27:40 - 27:44my hypothesis class will comprise only
-
27:44 - 27:46the linear position boundaries
-
27:46 - 27:49that separate the points of a large margin. Say with a
-
27:49 - 27:50margin,
-
27:50 - 27:52at least gamma; okay.
-
27:52 - 27:53And so
-
27:53 - 27:56I won't allow a point that comes closer. Like,
-
27:56 - 28:00I won't allow that line because it comes too close to one of my points.
-
28:00 - 28:04It turns out that
-
28:04 - 28:11if I consider my
-
28:11 - 28:15data points all
-
28:15 - 28:19lie within some sphere of radius r,
-
28:19 - 28:23and if I consider only the course of linear separators is separate to data
-
28:23 - 28:27with a margin of at least gamma,
-
28:27 - 28:30then the VC dimension of this
-
28:30 - 28:34course is less than or equal to r squared over four
-
28:34 - 28:36gamma squared
-
28:36 - 28:38plus one; okay?
-
28:38 - 28:40So
-
28:40 - 28:43this funny symbol here, that just means rounding up.
-
28:43 - 28:47This is a ceiling symbol; it means rounding up x.
-
28:47 - 28:48And it
-
28:48 - 28:51turns out you prove and there are some strange things
-
28:51 - 28:53about this result that I'm deliberately not
-
28:53 - 28:55gonna to talk about but turns they
-
28:55 - 28:59can prove that the VC dimension of the class of linear classifiers with large
-
28:59 - 29:01margins is actually bounded.
-
29:01 - 29:03The surprising thing about this is that
-
29:03 - 29:06this is the bound on VC dimension that has no dependents
-
29:06 - 29:08on the dimension of the points x. So
-
29:08 - 29:09in other words,
-
29:09 - 29:13your data points x combine an infinite dimensional space,
-
29:13 - 29:16but so long as you restrict attention to the class of
-
29:16 - 29:18your separators with large margin, the
-
29:18 - 29:20VC dimension is bounded.
-
29:20 - 29:22And so
-
29:22 - 29:26in trying to find a large margin separator
-
29:26 - 29:29in trying to find the line that separates your positive and
-
29:29 - 29:30your negative examples
-
29:30 - 29:32with large margin,
-
29:32 - 29:35it turns out therefore, that
-
29:35 - 29:38the support vector machine is automatically trying to find
-
29:38 - 29:40a hypothesis class
-
29:40 - 29:42with small VC dimension. And therefore, it does not over fit. Alex? Student:What is
-
29:42 - 29:49the [inaudible]? Instructor (Andrew
-
29:49 - 29:50Ng):It
-
29:50 - 29:52is actually defined the
-
29:52 - 29:56same way as finite dimensional spaces. So you
-
29:56 - 29:57know, suppose you have infinite actually,
-
29:57 - 30:01these are constantly infinite dimensional vectors; not [inaudible] to the infinite
-
30:01 - 30:04dimensional
-
30:04 - 30:06vectors. Normally, the 2 to 1
-
30:06 - 30:07squared
-
30:07 - 30:10is equal to some [inaudible] equals 110
-
30:10 - 30:12xi squared, so if x
-
30:12 - 30:17is infinite dimensional, you just appoint it like
-
30:17 - 30:20that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.
-
30:20 - 30:23[Inaudible].
-
30:23 - 30:27Yes. Although, I assume that this is bounded by r. Oh. It's
-
30:27 - 30:31a yeah so this insures that conversions.
-
30:31 - 30:32
-
30:32 - 30:35So just
-
30:35 - 30:38something people sometimes wonder about. And
-
30:38 - 30:40last, the actually
-
30:40 - 30:44tie empirical risk minimization back a little more strongly to the source of algorithms
-
30:44 - 30:47we've talked about.
-
30:47 - 30:48It turns out
-
30:48 - 30:55that
-
31:10 - 31:15so the theory was about, and so far, was really for empirical risk minimization. So that view's
-
31:15 - 31:18
-
31:18 - 31:21so we focus on just one training example.
-
31:21 - 31:23Let me draw a
-
31:23 - 31:24function, you know, a
-
31:24 - 31:25zero here jumps
-
31:25 - 31:26to one,
-
31:26 - 31:28and it looks like that.
-
31:28 - 31:30And so this for
-
31:30 - 31:32once, this training example,
-
31:32 - 31:36this may be
-
31:36 - 31:43indicator
-
31:43 - 31:47h where [inaudible] is d
-
31:47 - 31:49equals data transpose x;
-
31:49 - 31:53okay? But one
-
31:53 - 31:56training example your training example will be positive or negative. And
-
31:56 - 31:59depending on what the value of this data transpose x is, you either get it right
-
31:59 - 32:02or wrong. And so
-
32:02 - 32:06you know, I guess if your training example if you have a positive example,
-
32:06 - 32:08then when z is positive,
-
32:08 - 32:10you get it
-
32:10 - 32:11right.
-
32:11 - 32:18
-
32:19 - 32:23Suppose you have a negative example, so y equals 0; right?
-
32:23 - 32:27Then if z, which is data transpose x if this is positive,
-
32:27 - 32:30then you will get this example wrong;
-
32:30 - 32:33whereas, if z is negative then you'd get this example right.
-
32:33 - 32:37And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You
-
32:37 - 32:43know, it's equal
-
32:43 - 32:50to g of data transpose
-
32:50 - 32:51
-
32:51 - 32:54x; okay? And so it turns out that
-
32:54 - 32:57so what you really like to do is choose parameters data so as to
-
32:57 - 33:00minimize this step function; right?
-
33:00 - 33:02You'd like to choose parameters data, so that
-
33:02 - 33:05you end up with a
-
33:05 - 33:08correct classification on setting your training example, and so you'd like
-
33:08 - 33:09indicator h
-
33:09 - 33:11of x not equal y.
-
33:11 - 33:15You'd like this indicator function to be 0. It
-
33:15 - 33:19turns out this step function is clearly a non-convex function. And so
-
33:19 - 33:22it turns out that just the linear classifiers
-
33:22 - 33:25minimizing the training error is an empty heart problem. It
-
33:25 - 33:30turns out that both logistic regression, and support vector machines can be viewed as
-
33:30 - 33:31using a convex
-
33:31 - 33:33approximation for this problem.
-
33:33 - 33:36And in particular
-
33:36 - 33:40and draw a function like that
-
33:41 - 33:48it turns out that
-
33:51 - 33:55logistic regression is trying to maximize likelihood. And so it's tying to minimize
-
33:55 - 33:57the minus of the logged likelihood.
-
33:57 - 34:00And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that
-
34:00 - 34:03looks like this. And
-
34:03 - 34:07this line that I just drew, you can think of it as a rough approximation to this step function; which is
-
34:07 - 34:11maybe what you're really trying to minimize,
-
34:11 - 34:15so you want to minimize training error. So you can actually think of logistic regression as trying to approximate
-
34:15 - 34:16empirical risk minimization.
-
34:16 - 34:20Where instead of using this step function, which is non-convex, and gives you a hard
-
34:20 - 34:21optimization problem, it
-
34:21 - 34:25uses this line above this curve above. So approximate it,
-
34:25 - 34:28so you have a convex optimization problem you can
-
34:28 - 34:30find the
-
34:30 - 34:34maximum likelihood it's in the parameters for logistic regression.
-
34:34 - 34:39And it turns out, support vector machine also can be viewed as approximated dysfunction
-
34:39 - 34:41to only a little bit different let's
-
34:41 - 34:45see, support vector machine turns out, can be viewed as trying to approximate this
-
34:45 - 34:46step function two
-
34:46 - 34:48over different
-
34:48 - 34:51approximation that's linear, and then
-
34:51 - 34:54that sort of [inaudible] linear that
-
34:54 - 34:56our results goes this [inaudible] there, and then it goes up as a
-
34:56 - 34:58linear function there. And that's that
-
34:58 - 35:00is called the hinge class. And so you
-
35:00 - 35:04can think of logistic regression and the support vector machine as
-
35:04 - 35:07different approximations to try to minimize this
-
35:07 - 35:11step function;
-
35:11 - 35:12okay?
-
35:12 - 35:13And that's why I guess,
-
35:13 - 35:16all the theory we developed
-
35:16 - 35:19even though SVM's and logistic regression aren't exactly due to
-
35:19 - 35:22empirical risk minimization,
-
35:22 - 35:25the theory we develop often gives the
-
35:25 - 35:32completely appropriate intuitions for SVM's, and logistic regression; okay.
-
35:33 - 35:36So that was the last of the loose ends.
-
35:36 - 35:39And if you didn't get this, don't worry too much about it. It's a high-level message. It's
-
35:39 - 35:40just that
-
35:40 - 35:44SVM's and logistic regression are reasonable to think of as approximations
-
35:44 - 35:48empirical risk minimization algorithms.
-
35:48 - 35:52What I want to do next is move on to talk about model selection. Before I do that, let me just
-
35:52 - 35:59check for questions about
-
36:49 - 36:53this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that
-
36:53 - 36:53we
-
36:53 - 36:55sort of wrapped up with a
-
36:55 - 36:57discussion on VC dimension,
-
36:57 - 37:01we saw that there's often a trade-off between bias and variance. And in
-
37:01 - 37:02particular, so
-
37:02 - 37:07it is important not to choose a hypothesis that's either too simple or too
-
37:07 - 37:08complex. So
-
37:08 - 37:11if your data has sort of a quadratic structure to it,
-
37:11 - 37:14then if you choose a linear
-
37:14 - 37:18function to try to approximate it, then you would under fit. So you have a
-
37:18 - 37:20hypothesis with high bias.
-
37:20 - 37:24And conversely, we choose a hypothesis that's too complex, and you have high variance.
-
37:24 - 37:26And you'll also
-
37:26 - 37:32fail to fit. Then you would over fit the data, and you'd also fail to generalize well.
-
37:32 - 37:36So model selection algorithms
-
37:36 - 37:40provide a class of methods to automatically trade make these tradeoffs
-
37:40 - 37:42between bias
-
37:42 - 37:43and variance; right? So remember the
-
37:43 - 37:45cartoon I drew last time
-
37:45 - 37:52of generalization error?
-
37:54 - 37:57I drew this last time. Where on the x-axis
-
37:57 - 38:03was model complexity, meaning the number of
-
38:03 - 38:08the degree of the polynomial; the [inaudible] regression function
-
38:08 - 38:09or whatever.
-
38:09 - 38:12And if you have too simple a model, you have high generalization error, those under
-
38:12 - 38:14fitting.
-
38:14 - 38:16And you if have too complex a model,
-
38:16 - 38:20like 15 or 14-degree polynomial to five data points, then you also have high
-
38:20 - 38:24generalization error, and you're over fitting.
-
38:24 - 38:29So what I wanna do now is actually just talk about model selection in the abstract;
-
38:29 - 38:30all right?
-
38:30 - 38:33Some examples of model selection problems will include
-
38:33 - 38:34well, I'll run the example
-
38:34 - 38:36of
-
38:36 - 38:37let's say you're
-
38:37 - 38:44trying to choose the degree of a polynomial;
-
38:45 - 38:49right? What degree polynomial do you want to choose?
-
38:49 - 38:52Another example of a model selection problem would be if you're trying to choose
-
38:52 - 38:54the parameter [inaudible],
-
38:54 - 39:00which was the bandwidth parameter
-
39:00 - 39:04in locally awaited linear regression
-
39:04 - 39:08or in some sort of local way to
-
39:08 - 39:08regression.
-
39:08 - 39:10Yet, another model selection problem
-
39:10 - 39:14is if you're trying to choose the parameter c [inaudible] and as the
-
39:14 - 39:16[inaudible];
-
39:16 - 39:22right? And so one known soft margin is the
-
39:22 - 39:29we had this optimization objective; right?
-
39:29 - 39:31And the parameter c
-
39:31 - 39:33controls the tradeoff between
-
39:33 - 39:34how much you want
-
39:34 - 39:38to set for your example. So a large margin versus how much you want to penalize
-
39:38 - 39:40
-
39:40 - 39:44in this class [inaudible] example. So these are three specific examples of model selection
-
39:44 - 39:45problems.
-
39:45 - 39:48And
-
39:48 - 39:55let's come up with a method for semantically choosing them.
-
40:14 - 40:18Let's say you have some finite set of models, and let's write these as m1, m2, m3,
-
40:18 - 40:20and
-
40:20 - 40:25so on. For example, this may be the linear
-
40:25 - 40:30classifier or this may be the quadratic classifier,
-
40:30 - 40:30and so on;
-
40:30 - 40:32okay? Or this
-
40:32 - 40:34may also be
-
40:34 - 40:37you may also take the bandwidth parameter [inaudible]
-
40:37 - 40:41and discretize it into a range of values, and you're trying to choose from the most
-
40:41 - 40:43discrete of the values.
-
40:43 - 40:48So let's talk about how you would select an appropriate model; all right? Well,
-
40:48 - 40:52one thing you could do is you can pick all of these models, and train them on
-
40:52 - 40:54you're training set.
-
40:54 - 40:57And then see which model has the lowest training
-
40:57 - 41:04error. So that's a terrible idea, and why's that?
-
41:05 - 41:11Right. Cool. Because of the over fit; right. And those some of you are laughing that
-
41:11 - 41:14I asked that. So that'd be a terrible idea to choose a model
-
41:14 - 41:17by looking at your training set because well, obviously, you end up choosing the most
-
41:17 - 41:19complex model; right? And you
-
41:19 - 41:26choose a 10th degree polynomial because that's what fits the training set. So we
-
41:26 - 41:29come to model selection in a training set
-
41:29 - 41:32several standard procedures to do this. One is hold out cross
-
41:32 - 41:39validation,
-
41:39 - 41:41and in hold out cross validation,
-
41:41 - 41:46we teach a training set. And we randomly split
-
41:46 - 41:51the training set into two subsets. We
-
41:51 - 41:51call it
-
41:51 - 41:54subset take all the data you have
-
41:54 - 42:01and randomly split it into two subsets. And we'll call it the training set, and the
-
42:01 - 42:05hold out cross validation subset.
-
42:05 - 42:07And then,
-
42:07 - 42:12you know, you train each model
-
42:12 - 42:19on just trading subset of it, and test it
-
42:19 - 42:22on your hold out cross validation set.
-
42:22 - 42:25And you pick the model
-
42:25 - 42:32with the lowest error
-
42:33 - 42:35on the hold out cross validation
-
42:35 - 42:37subset; okay? So this is sort of a
-
42:37 - 42:40relatively straightforward procedure, and it's commonly used
-
42:40 - 42:42where you train on 70 percent of the data.
-
42:42 - 42:45Then test all of your models. And 30 percent, you can take
-
42:45 - 42:46whatever has the smallest
-
42:46 - 42:51hold out cross validation error.
-
42:51 - 42:55And after this you actually have a chose. You can actually
-
42:55 - 42:59having taken all of these hypothesis trained on 70 percent of the data,
-
42:59 - 43:03you can actually just output the hypothesis that has the lowest error on your hold out
-
43:03 - 43:04cross validation set.
-
43:04 - 43:09And optionally, you can actually take the model that you selected
-
43:09 - 43:13and go back, and retrain it on all 100 percent of the data;
-
43:13 - 43:16okay? So both versions are actually done and used really often. You can
-
43:16 - 43:16either,
-
43:16 - 43:20you know, just take the best hypothesis that was trained on 70 percent of the data,
-
43:20 - 43:22and just output that as you find the hypothesis
-
43:22 - 43:24or you can use this to
-
43:24 - 43:27say, having chosen the degree of the polynomial you want to fit, you can then go
-
43:27 - 43:31back and retrain the model on the entire 100 percent of your data.
-
43:31 - 43:38And both of these are commonly done.
-
43:40 - 43:47
-
43:50 - 43:54How about a cross validation does sort of work straight?
-
43:54 - 43:56And
-
43:56 - 43:58sometimes we're working
-
43:58 - 43:59
-
43:59 - 44:04with a company or application or something. The
-
44:04 - 44:06many machine-learning applications we have
-
44:06 - 44:08very little data or where, you
-
44:08 - 44:10know, every training example you have was
-
44:10 - 44:15painfully acquired at great cost; right? Sometimes your data is
-
44:15 - 44:16acquired by
-
44:16 - 44:19medical experiments, and each of these each
-
44:19 - 44:24training example represents a sick man in amounts of physical human pain or something. So
-
44:24 - 44:27we talk and say, well, I'm going to hold out 30 percent of your data set, just to select
-
44:27 - 44:28my model.
-
44:28 - 44:33If people were who sometimes that causes unhappiness, and so maybe you wanna
-
44:33 - 44:37use not have to leave out 30 percent of your data just to do model
-
44:37 - 44:39selection.
-
44:39 - 44:43So there are a couple of other variations on hold out cross validation that makes
-
44:43 - 44:48sometimes, slightly more efficient use of the data.
-
44:48 - 44:52And one is called k-fold cross validation.
-
44:52 - 44:54And here's the idea: I'm gonna
-
44:54 - 44:56take all of
-
44:56 - 44:57my data s;
-
44:57 - 45:01so imagine, I'm gonna draw this
-
45:01 - 45:04box s, as to note the entirety of all the data I have.
-
45:04 - 45:07And I'll then divide it
-
45:07 - 45:13into k pieces, and this is five pieces in what I've drawn.
-
45:13 - 45:17Then what'll I'll do is I will repeatedly
-
45:17 - 45:23train on k minus one pieces.
-
45:23 - 45:28Test on the remaining one
-
45:28 - 45:31test on the remaining piece, I guess;
-
45:31 - 45:36right? And then you average
-
45:36 - 45:41over the k result.
-
45:41 - 45:44So another way, we'll just hold out I will
-
45:44 - 45:48hold
-
45:48 - 45:51out say, just 1/5 of my data
-
45:51 - 45:55and I'll train on the remaining 4/5, and I'll test on the first one.
-
45:55 - 45:58And then I'll then go and hold out the second 1/5 from my [inaudible] for the
-
45:58 - 46:03remaining pieces test on this, you remove the third piece,
-
46:03 - 46:06train on the 4/5; I'm gonna do this five times.
-
46:06 - 46:09And then I'll take the five error measures I
-
46:09 - 46:11have and I'll
-
46:11 - 46:16average them. And this then gives me an estimate of the generalization error of my model; okay?
-
46:16 - 46:18And then, again, when you do
-
46:18 - 46:20k-fold cross validation,
-
46:20 - 46:22usually you then go back and
-
46:22 - 46:24retrain the model you selected
-
46:24 - 46:28on the entirety of your training set.
-
46:28 - 46:34So I drew five pieces here because that was easier for me to draw, but k equals 10 is
-
46:34 - 46:39very
-
46:39 - 46:42common; okay? I should say
-
46:42 - 46:48k equals 10 is the fairly common choice to do 10 fold cross validation.
-
46:48 - 46:52And the advantage of the over hold out cross option is that you switch the data into ten pieces.
-
46:52 - 46:54Then each time you're only holding out
-
46:54 - 46:561/10 of your data, rather than,
-
46:56 - 46:59you know, say, 30 percent of your data. I
-
46:59 - 47:03must say, in standard hold out in simple hold out cross validation,
-
47:03 - 47:06a 30 70 split is fairly common.
-
47:06 - 47:08Sometimes like 2/3 1/3 or
-
47:08 - 47:10a 70 30 split is fairly common.
-
47:10 - 47:14And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most
-
47:14 - 47:16common choice.
-
47:16 - 47:20The disadvantage of k-fold cross validation is that it can be much more
-
47:20 - 47:21computationally expensive.
-
47:21 - 47:23In particular,
-
47:23 - 47:26to validate your model, you now need to train your model ten times,
-
47:26 - 47:29instead of just once. And so you need
-
47:29 - 47:33to: from logistic regression, ten times per model, rather than just once. And so this is
-
47:33 - 47:35computationally more expensive.
-
47:35 - 47:38But k equals ten works great.
-
47:38 - 47:40And then,
-
47:40 - 47:44finally, in
-
47:44 - 47:48there's actually a version of this that you can take even further,
-
47:48 - 47:50which is when your set k
-
47:50 - 47:52equals m.
-
47:52 - 47:57And so that's when you take your training set, and you split it into as many pieces as you have training
-
47:57 - 47:59examples.
-
47:59 - 48:06And this procedure is called leave one out cross validation.
-
48:06 - 48:07And
-
48:07 - 48:09what you do is you then
-
48:09 - 48:12take out the first training example, train on the rest, and test on the first
-
48:12 - 48:13example.
-
48:13 - 48:16Then you take out the second training example,
-
48:16 - 48:19train on the rest, and test on the second example. Then
-
48:19 - 48:22you take out the third example, train on everything, but your third example. Test on
-
48:22 - 48:24the third example, and so on.
-
48:24 - 48:25And so
-
48:25 - 48:28with this many pieces you are now making,
-
48:28 - 48:31maybe even more effective use of your data than k-fold cross
-
48:31 - 48:32validation. But
-
48:32 - 48:36you could leave leave one out cross validation is computationally very expensive
-
48:36 - 48:37because now
-
48:37 - 48:42you need to repeatedly leave one example out, and then run your learning
-
48:42 - 48:46algorithm on m minus one training examples. You need to do this a lot of times, and so
-
48:46 - 48:49this is computationally very expensive.
-
48:49 - 48:55And typically, this is done only when you're extremely data scarce. So if you
-
48:55 - 48:59have a learning problem where you have, say, 15 training examples or something,
-
48:59 - 49:01then if you have very few training examples, leave one
-
49:01 - 49:04out cross validation
-
49:04 - 49:04is
-
49:04 - 49:11maybe preferred.
-
49:16 - 49:18Yeah?
-
49:18 - 49:23Student:You know, that time you proved that the difference between the generalized [inaudible] by
-
49:23 - 49:27number of examples in your training
-
49:27 - 49:28set and VC
-
49:28 - 49:31dimension. So
-
49:31 - 49:33maybe [inaudible] examples into
-
49:33 - 49:35different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I
-
49:35 - 49:37mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but
-
49:37 - 49:40yeah, in practice, I personally tend not to do that. It
-
49:40 - 49:43tends not to be the VC
-
49:43 - 49:47dimension bounds are somewhat loose bounds. And so
-
49:47 - 49:48there are people
-
49:48 - 49:51in structure risk minimization that propose what you do, but I personally tend not to do
-
49:51 - 49:55that,
-
50:09 - 50:16though. Questions for cross validation?
-
50:34 - 50:36Yeah. This
-
50:36 - 50:39is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?
-
50:39 - 50:40Right.
-
50:40 - 50:42So it seems like when I'd be
-
50:42 - 50:44able to use that
-
50:44 - 50:47
-
50:47 - 50:48instead
-
50:48 - 50:50of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor
-
50:50 - 50:53No okay. So it turns out that when you're proving learning theory bounds, very often
-
50:53 - 50:58the bounds will be extremely loose because you're sort of proving the worse case upper bound that
-
50:58 - 50:59holds true
-
50:59 - 51:03even for very bad what is
-
51:03 - 51:06it
-
51:06 - 51:10so the bounds that I proved just now; right? That holds true for absolutely any
-
51:10 - 51:12probability distribution over training
-
51:12 - 51:13examples; right?
-
51:13 - 51:17So just assume the training examples we've drawn, iid
-
51:17 - 51:19from some distribution script d,
-
51:19 - 51:23and the bounds I proved hold true for absolutely any probability
-
51:23 - 51:27distribution over script
-
51:27 - 51:29d. And chances are
-
51:29 - 51:33whatever real life distribution you get over, you know, houses and their
-
51:33 - 51:36prices or whatever, is probably not as bad as
-
51:36 - 51:39the very worse one you
-
51:39 - 51:41could've gotten; okay?
-
51:41 - 51:42And so it turns out that if you
-
51:42 - 51:46actually plug in the constants of learning theory bounds, you often get
-
51:46 - 51:51extremely large numbers. Take logistic regression logistic
-
51:51 - 51:53regression you have ten parameters
-
51:53 - 51:57and 0.01 error, and with 95 percent probability. How many training
-
51:57 - 51:58examples do I
-
51:58 - 52:02need? If you actually plug in actual constants into the text for learning theory bounds,
-
52:02 - 52:06you often get extremely pessimistic estimates with the number of examples you need. You end
-
52:06 - 52:08up with
-
52:08 - 52:12some ridiculously large numbers. You would need 10,000 training examples to fit
-
52:12 - 52:14ten parameters.
-
52:14 - 52:17So
-
52:17 - 52:20a good way to think of these learning theory bounds is
-
52:20 - 52:25and this is why, also, when I write papers on learning theory bounds, I
-
52:25 - 52:26quite often
-
52:26 - 52:30use big-O notation to just absolutely just ignore the constant factors because
-
52:30 - 52:33the bounds seem to be very loose.
-
52:33 - 52:39There are some attempts to use these bounds to give guidelines as to
-
52:39 - 52:40what model
-
52:40 - 52:42to choose, and so on.
-
52:42 - 52:47But I personally tend to use the bounds again,
-
52:47 - 52:47intuition
-
52:47 - 52:49about
-
52:49 - 52:53for example, what are the number of training examples you need
-
52:53 - 52:56gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes
-
52:56 - 53:00quadratic parameters?
-
53:00 - 53:02So it's quite often the shape of the bounds. The fact that
-
53:02 - 53:07the number of training examples the fact that some complexity is linear in the VC dimension,
-
53:07 - 53:09that's sort of a useful intuition you can get from
-
53:09 - 53:11these theories. But the
-
53:11 - 53:14actual magnitude of the bound will tend to be much looser than
-
53:14 - 53:18will hold true for a particular problem you are working
-
53:18 - 53:25on. So did that
-
53:26 - 53:28answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the
-
53:28 - 53:32way, for myself, a rule of thumb that I often use is if
-
53:32 - 53:35you're trying to fit a logistic regression model,
-
53:35 - 53:36if you have
-
53:36 - 53:39n parameters or n plus one parameters;
-
53:39 - 53:42if the number of training examples is ten times your number of
-
53:42 - 53:45parameters, then you're probably in good shape.
-
53:45 - 53:49And if your number of training examples is like tiny times the number of parameters, then
-
53:49 - 53:52you're probably perfectly fine fitting that model.
-
53:52 - 53:55So those are the sorts of intuitions
-
53:55 - 54:02that you can get from these bounds. Student:In cross validation do
-
54:03 - 54:07we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train
-
54:07 - 54:14testers randomly.
-
54:23 - 54:27One more thing I want to talk about for model selection is there's actually a special case of model selections,
-
54:27 - 54:34called the feature selection problem.
-
54:38 - 54:41And so here's the intuition:
-
54:41 - 54:45for many machine-learning problems you may have a very high dimensional
-
54:45 - 54:48feature space with very high dimensional
-
54:48 - 54:52you have x's [inaudible] feature x's.
-
54:52 - 54:56So for example, for text classification and I wanna talk about this text classification example that
-
54:56 - 54:58spam versus
-
54:58 - 55:00non-spam. You may easily have on
-
55:00 - 55:04the order of 30,000 or 50,000 features. I think I used 50,000 in
-
55:04 - 55:07my early examples. So if you have
-
55:07 - 55:11so many features you have 50,000 features,
-
55:11 - 55:14depending on what learning algorithm you use, there may be a real
-
55:14 - 55:16risk of over fitting.
-
55:16 - 55:17And so
-
55:17 - 55:20if you can reduce the number of features, maybe
-
55:20 - 55:23you can reduce the variance of your learning algorithm, and reduce the risk of
-
55:23 - 55:25over fitting. And for
-
55:25 - 55:28the specific case of text classification, if
-
55:28 - 55:32you imagine that maybe there's a small number of relevant features, so
-
55:32 - 55:34there are all these English words.
-
55:34 - 55:37And many of these English words probably don't tell you anything at all about
-
55:37 - 55:40whether the email is spam or non-spam. If it were, you
-
55:40 - 55:44know, English function words like, the, of, a, and;
-
55:44 - 55:48these are probably words that don't tell you anything about whether the email is spam or non-spam. So
-
55:48 - 55:51words in contrast will be a much smaller number of
-
55:51 - 55:54features that are truly relevant to the learning problem.
-
55:54 - 55:57So for example, you see the word buy or Viagra, those are words that
-
55:57 - 56:02are very useful. So you words, some you spam and non-spam. You see the word
-
56:02 - 56:03Stanford or
-
56:03 - 56:06machinelearning or your own personal name. These are other words that are useful for
-
56:06 - 56:11telling you whether something is spam or non-spam. So
-
56:11 - 56:16in feature selection, we would like to select a subset of the features
-
56:16 - 56:20that may be or hopefully the most relevant ones for a specific learning
-
56:20 - 56:20problem, so
-
56:20 - 56:25as to give ourselves a simpler learning a simpler hypothesis class to choose
-
56:25 - 56:28from. And then therefore, reduce the risk of over fitting.
-
56:28 - 56:30Even when we
-
56:30 - 56:37may have had 50,000 features originally. So
-
56:37 - 56:41how do you do this? Well, if you
-
56:41 - 56:44have n
-
56:44 - 56:47features, then there are
-
56:47 - 56:50two to the n possible subsets;
-
56:50 - 56:54
-
56:54 - 56:56right? Because, you know,
-
56:56 - 57:00each of your n features can either be included or excluded. So there are two to the n
-
57:00 - 57:01possibilities.
-
57:01 - 57:04And this is a huge space.
-
57:04 - 57:09So in feature selection, what we most commonly do is use various searcheristics
-
57:09 - 57:10sort of
-
57:10 - 57:11simple search algorithms
-
57:11 - 57:15to try to search through this space of two to the n possible subsets of features;
-
57:15 - 57:17to try to find a good
-
57:17 - 57:18subset of features. This is
-
57:18 - 57:23too large a number to enumerate all possible feature subsets.
-
57:23 - 57:26And as a complete example,
-
57:26 - 57:27
-
57:27 - 57:31this is the forward search algorithm; it's also called the forward selection algorithm.
-
57:31 - 57:33
-
57:33 - 57:35It's actually pretty simple, but I'll just
-
57:35 - 57:36write it out.
-
57:36 - 57:42My writing it out will make it look more complicated than it really
-
57:42 - 57:45is, but
-
57:45 - 57:46it starts with
-
57:46 - 57:48initialize the sets script f
-
57:48 - 57:52to be the empty set,
-
57:52 - 57:53and then
-
57:53 - 57:55repeat
-
57:55 - 57:57for
-
57:57 - 57:59i equals one
-
57:59 - 58:00to n;
-
58:05 - 58:08try adding
-
58:08 - 58:11feature i
-
58:11 - 58:14to the set scripts
-
58:14 - 58:16f, and evaluate the
-
58:16 - 58:20model
-
58:20 - 58:26using cross validation.
-
58:26 - 58:30And by cross validation, I mean any of the three flavors, be it simple hold out cross
-
58:30 - 58:35validation or k-fold cross validation or leave one out cross
-
58:35 - 58:36validation.
-
58:36 - 58:40And then, you know, set f to be
-
58:40 - 58:41equal to
-
58:41 - 58:43f union, I guess.
-
58:43 - 58:49And then the best feature
-
58:49 - 58:56found is f 1, I guess; okay?
-
58:58 - 59:00And finally, you would
-
59:00 - 59:07okay? So
-
59:07 - 59:14
-
59:17 - 59:20forward selections, procedure is: follow through the empty set of features.
-
59:20 - 59:22And then on each
-
59:22 - 59:26generation, take each of
-
59:26 - 59:29your features that isn't already in your set script f and you try adding that
-
59:29 - 59:30feature to your set. Then
-
59:30 - 59:34you train them all, though, and evaluate them all, though, using
-
59:34 - 59:35cross validation.
-
59:35 - 59:39And basically, figure out what is the best single feature to add to your
-
59:39 - 59:41set
-
59:41 - 59:44script f. In step two here, you go ahead and add
-
59:44 - 59:47that feature to your set script f, and you get it right.
-
59:47 - 59:51And when I say best feature or best model here by best, I really mean the
-
59:51 - 59:54best model according to hold out cross validation.
-
59:54 - 59:56By best, I really mean
-
59:56 - 60:01the single feature addition that results in the lowest hold out
-
60:01 - 60:02cross validation error or the
-
60:02 - 60:03lowest cross validation error.
-
60:03 - 60:04So you do this
-
60:04 - 60:08adding one feature at a time.
-
60:08 - 60:12When you terminate this a little bit, as if you've
-
60:12 - 60:15added all the features to f, so f is now
-
60:15 - 60:18the entire set of features; you can terminate this.
-
60:18 - 60:19Or if
-
60:19 - 60:22by some rule of thumb, you know that you
-
60:22 - 60:24probably don't ever want more than
-
60:24 - 60:27k features, you can also terminate
-
60:27 - 60:30this if f is already exceeded some threshold number of features. So maybe
-
60:30 - 60:31if
-
60:31 - 60:34you have 100 training examples, and you're fitting logistic regression,
-
60:34 - 60:39you probably know you won't want more than 100 features. And so
-
60:39 - 60:41you stop after you have
-
60:41 - 60:46100 features added to set f; okay?
-
60:46 - 60:50And then finally, having done this, output of best hypothesis found; again, by
-
60:50 - 60:52best, I mean,
-
60:52 - 60:55when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of
-
60:55 - 60:58hypothesis, and testing them using cross validation.
-
60:58 - 61:02So when I say output best hypothesis found, I mean
-
61:02 - 61:05of all of the hypothesis you've seen during this entire procedure,
-
61:05 - 61:07pick the one with the lowest
-
61:07 - 61:10cross validation error that you saw; okay?
-
61:10 - 61:17So that's forward selection. So
-
61:35 - 61:40let's see, just to give this a name, this is an incidence of what's called
-
61:40 - 61:43
-
61:43 - 61:46wrapper feature
-
61:46 - 61:48selection.
-
61:48 - 61:51And the term wrapper comes
-
61:51 - 61:52from the fact that
-
61:52 - 61:55this feature selection algorithm that I just described is a forward selection or forward
-
61:55 - 61:57search.
-
61:57 - 61:59It's a piece of software that
-
61:59 - 62:02you write that wraps around your learning algorithm. In the sense that
-
62:02 - 62:04to perform forward selection,
-
62:04 - 62:08you need to repeatedly make cause to your learning algorithm
-
62:08 - 62:09to train
-
62:09 - 62:12your model,
-
62:12 - 62:15using different subsets of features; okay? So this is called wrapper model
-
62:15 - 62:17feature selection.
-
62:17 - 62:21And it tends to be somewhat computationally expensive because as you're
-
62:21 - 62:23performing the search process,
-
62:23 - 62:26you're repeatedly training your learning algorithm over and over and over on all of
-
62:26 - 62:30these different subsets of features. Let's
-
62:30 - 62:37just mention also, there is a variation of this called backward search or
-
62:39 - 62:41backward selection,
-
62:41 - 62:44which is where you start with
-
62:44 - 62:47f equals
-
62:47 - 62:54the entire set of features, and you delete features one at a time;
-
62:59 - 63:01okay?
-
63:01 - 63:04
-
63:04 - 63:07So that's backward search or backward selection.
-
63:07 - 63:08And
-
63:08 - 63:13this is another feature selection algorithm that you might use. Part of
-
63:13 - 63:15whether this makes sense is
-
63:15 - 63:17really there will be problems where
-
63:17 - 63:21it really doesn't even make sense to initialize f to be the set of all features.
-
63:21 - 63:24So if you have 100 training examples
-
63:24 - 63:26and 10,000 features,
-
63:26 - 63:29which may well happen
-
63:29 - 63:31100 emails and 10,000 training
-
63:31 - 63:3510,000 features in email, then 100 training examples then
-
63:35 - 63:38depending on the learning algorithm you're using, it may or may not make sense to
-
63:38 - 63:39initialize
-
63:39 - 63:43the set f to be all features, and train them all by using all features. And if it
-
63:43 - 63:46doesn't make sense, then you can train them all by using all features; then
-
63:46 - 63:52forward selection would be more common.
-
63:52 - 63:54So
-
63:54 - 63:57let's see. Wrapper model feature selection algorithms tend to work
-
63:57 - 63:58well.
-
63:58 - 64:02And in particular, they actually often work better than a different class of algorithms I'm gonna talk
-
64:02 - 64:06about now. But their main disadvantage is that they're computationally very
-
64:06 - 64:09expensive.
-
64:09 - 64:16Do you have any questions about this
-
64:16 - 64:19before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,
-
64:19 - 64:23you're actually right. So forward search and backward search, both of these are searcheristics,
-
64:23 - 64:24and
-
64:24 - 64:28you cannot but for either of these you cannot guarantee they'll find the best
-
64:28 - 64:29subset of features. It
-
64:29 - 64:31actually turns out that
-
64:31 - 64:35under many formulizations of the feature selection problems it actually turns out to be an empty
-
64:35 - 64:40heart problem, to find the best subset of features.
-
64:40 - 64:44But in practice, forward selection backward selection work fine,
-
64:44 - 64:47and you can also envision other search algorithms where you sort of
-
64:47 - 64:50have other methods to search through the space up to the end possible feature
-
64:50 - 64:57subsets. So
-
65:01 - 65:04let's see.
-
65:04 - 65:06Wrapper feature selection
-
65:06 - 65:10tends to work well when you can afford to do it
-
65:10 - 65:12computationally.
-
65:12 - 65:14But for problems
-
65:14 - 65:18such as text classification it turns out for text classification specifically
-
65:18 - 65:22because you have so many features, and easily have 50,000 features.
-
65:22 - 65:26Forward selection would be very, very expensive.
-
65:26 - 65:28So there's a different class of algorithms that
-
65:28 - 65:31will give you that
-
65:31 - 65:35tends not to do as well in the sense of generalization error. So you tend to learn the
-
65:35 - 65:37hypothesis that works less well,
-
65:37 - 65:40but is computationally much less expensive.
-
65:40 - 65:44And these are called
-
65:44 - 65:47the
-
65:47 - 65:50filter feature selection methods.
-
65:50 - 65:53And the basic idea is
-
65:53 - 65:54
-
65:54 - 66:01that for each feature i will
-
66:01 - 66:04compute
-
66:04 - 66:11some measure
-
66:11 - 66:17of how informative
-
66:17 - 66:20xi
-
66:20 - 66:25is about y; okay?
-
66:25 - 66:29And to do this, we'll use some simple
-
66:29 - 66:33heuristics; for every feature we'll just try to compute some rough estimate or
-
66:33 - 66:40compute
-
66:41 - 66:48some measure of how informative
-
66:50 - 66:54xi is about
-
66:54 - 66:59y. So there are many ways you can do this. One way you can choose is to just compute the
-
66:59 - 67:01correlation between xi and y. And just for each of
-
67:01 - 67:05your features just see how correlated this is with
-
67:05 - 67:07your class label y.
-
67:07 - 67:13And then you just pick the top k most correlated features.
-
67:13 - 67:20Another way
-
67:46 - 67:52to do this
-
67:52 - 67:55for the case of text classification, there's one other method, which especially
-
67:55 - 67:57for this k features I guess
-
67:57 - 67:59there's one other
-
67:59 - 68:01informative measure
-
68:01 - 68:04that's used very commonly, which is called major information. I'm going to
-
68:04 - 68:11tell you
-
68:11 - 68:15some of these ideas in problem sets, but I'll
-
68:15 - 68:18just say this very briefly.
-
68:18 - 68:22So the major information between feature xi and y
-
68:22 - 68:28I'll just write out the definition, I guess.
-
68:28 - 68:32Let's say this is text classification, so x can take on two values, 0, 1;
-
68:32 - 68:36the major information between xi and y is to find out some overall possible values of
-
68:36 - 68:37x;
-
68:37 - 68:41some overall possible values of
-
68:41 - 68:46y times the distribution
-
68:46 - 68:52
-
68:52 - 68:53times that.
-
68:53 - 68:54Where
-
69:00 - 69:04all of these distributions where so the joint distribution over xi and y,
-
69:04 - 69:11you would estimate from your training data
-
69:13 - 69:14all of these things you would use, as well. You would estimate
-
69:14 - 69:15from
-
69:15 - 69:20the training data what is the probability that x is 0, what's the probability that x is one, what's the
-
69:20 - 69:27probability that x is 0, y is 0, x is one; y is 0, and so on. So it
-
69:28 - 69:31turns out
-
69:31 - 69:35there's a standard information theoretic measure of how different
-
69:35 - 69:37probability distributions are.
-
69:37 - 69:40And I'm not gonna prove this here. But it turns out that
-
69:40 - 69:46this major information is
-
69:46 - 69:52actually
-
69:52 - 69:54so the standard
-
69:54 - 69:58measure of how different distributions are; called the K-L divergence.
-
69:58 - 70:00When you take a class in information theory,
-
70:00 - 70:03you have seen concepts of mutual information in the K-L divergence, but if you
-
70:03 - 70:06haven't, don't worry about it. Just
-
70:06 - 70:09the intuition is there's something called K-L divergence that's a formal measure
-
70:09 - 70:13of how different two probability distributions
-
70:13 - 70:17are. And mutual information is a measure for how different
-
70:17 - 70:20the joint distribution is
-
70:20 - 70:22of x and y;
-
70:22 - 70:23from the distribution
-
70:23 - 70:26you would get if you were to assume they were independent;
-
70:26 - 70:27okay? So
-
70:27 - 70:29if x and y were independent, then
-
70:29 - 70:34p of x, y would be equal to p of x times p of y.
-
70:34 - 70:37And so you know, this distribution and this distribution would be identical, and the
-
70:37 - 70:40K-L divergence would be 0.
-
70:40 - 70:44In contrast, if x and y were very non-independent in
-
70:44 - 70:47other words, if x and y are very informative about each other,
-
70:47 - 70:48then this K-L divergence
-
70:48 - 70:50will be large.
-
70:50 - 70:52And so mutual information is a formal measure of
-
70:52 - 70:55how non-independent x and y are.
-
70:55 - 70:58And if x and y are highly non-independent
-
70:58 - 70:59then that means that
-
70:59 - 71:02x will presumably tell you something about y,
-
71:02 - 71:06and so they'll have large mutual information. And
-
71:06 - 71:10this measure of information will tell you x might be a good feature. And you
-
71:10 - 71:12get to play with some of these ideas
-
71:12 - 71:15more in the problem sets. So I won't say much more about it.
-
71:17 - 71:19And what you do then is having chosen
-
71:19 - 71:24some measure like correlation or major information or something else, you
-
71:24 - 71:29then pick the top
-
71:29 - 71:34k features; meaning that you compute correlation between xi and y for all the
-
71:34 - 71:37features of mutual information xi and y for all the features.
-
71:37 - 71:40And then you include in your learning algorithm the
-
71:40 - 71:43k features of the largest correlation with the label or
-
71:43 - 71:46the largest mutual information label, whatever.
-
71:46 - 71:52And to choose
-
71:52 - 71:55k,
-
71:55 - 71:58you can actually use cross validation, as well; okay?
-
71:58 - 71:59So you would
-
71:59 - 72:03take all your features, and sort them in decreasing order of mutual
-
72:03 - 72:04information.
-
72:04 - 72:08And then you'd try using just the top one feature, the top two features, the top
-
72:08 - 72:10three features, and so on.
-
72:10 - 72:14And you decide how many features includes
-
72:14 - 72:17using cross validation; okay? Or you
-
72:17 - 72:24can sometimes you can just choose this by hand, as well. Okay. Questions about
-
72:24 - 72:31this? Okay.
-
72:35 - 72:36Cool. Great. So
-
72:36 - 72:41next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close
-
72:41 - 72:41to the end.
- Title:
- Lecture 10 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng continues his lecture on learning theory by discussing VC dimension and model selection.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:12:56