WEBVTT 00:00:12.240 --> 00:00:15.509 This presentation is delivered by the Stanford Center for Professional 00:00:15.509 --> 00:00:22.509 Development. 00:00:24.209 --> 00:00:28.029 So what I want to do today in this lecture is 00:00:28.029 --> 00:00:31.479 talk a little bit more about learning theory. In particular, I'll 00:00:31.479 --> 00:00:33.230 talk about VC dimension 00:00:33.230 --> 00:00:35.610 and 00:00:35.610 --> 00:00:39.560 building on the issues of bias variance tradeoffs of under 00:00:39.560 --> 00:00:42.630 fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in 00:00:42.630 --> 00:00:43.640 this one. 00:00:43.640 --> 00:00:49.180 I then want to talk about model selection algorithms for automatically 00:00:49.180 --> 00:00:50.070 making 00:00:50.070 --> 00:00:52.190 decisions for this 00:00:52.190 --> 00:00:55.280 bias variance tradeoff, that we started to talk about in the previous lecture. 00:00:55.280 --> 00:00:59.310 And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if 00:00:59.310 --> 00:01:01.469 I don't get to this today, 00:01:01.469 --> 00:01:07.280 I'll get to this in next week's lecture. 00:01:07.280 --> 00:01:09.520 To recap: 00:01:09.520 --> 00:01:10.430 the result we 00:01:10.430 --> 00:01:13.510 proved at the previous lecture was that 00:01:13.510 --> 00:01:15.370 if you have a finite 00:01:15.370 --> 00:01:16.430 hypothesis class if 00:01:16.430 --> 00:01:19.509 h is a set of k hypotheses, 00:01:19.509 --> 00:01:24.080 00:01:24.080 --> 00:01:25.710 and suppose you have some 00:01:25.710 --> 00:01:32.710 fixed parameters, gamma and delta, 00:01:33.030 --> 00:01:34.440 then 00:01:34.440 --> 00:01:36.530 00:01:36.530 --> 00:01:40.890 in order to guarantee 00:01:40.890 --> 00:01:47.750 that this holds, 00:01:47.750 --> 00:01:50.770 we're probability at least one minus delta. 00:01:50.770 --> 00:01:55.970 It suffices 00:01:55.970 --> 00:02:02.970 that n is greater and equal to that; okay? 00:02:05.240 --> 00:02:07.359 And using big-O notations, 00:02:07.359 --> 00:02:14.359 just learning dropped constants, I can also write this as that; okay? 00:02:14.459 --> 00:02:18.180 So just to quickly remind you of what all of the notation means, 00:02:18.180 --> 00:02:21.180 we talked about empirical risk minimization, 00:02:21.180 --> 00:02:21.970 which was 00:02:21.970 --> 00:02:24.620 the simplified modern machine learning 00:02:24.620 --> 00:02:25.899 that 00:02:25.899 --> 00:02:28.549 has a hypothesis class of script h. 00:02:28.549 --> 00:02:32.539 And what the empirical risk minimization-learning algorithm does is it just chooses the 00:02:32.539 --> 00:02:34.029 hypothesis 00:02:34.029 --> 00:02:38.169 that attains the smallest error on the training set. 00:02:38.169 --> 00:02:39.019 And so 00:02:39.019 --> 00:02:43.479 this symbol, epsilon, just denoted generalization error; right? This is the 00:02:43.479 --> 00:02:44.450 probability 00:02:44.450 --> 00:02:48.949 of a hypothesis h [inaudible] misclassifying a new example drawn from the same 00:02:48.949 --> 00:02:50.930 distribution as the training set. 00:02:50.930 --> 00:02:53.079 And so this says that 00:03:01.869 --> 00:03:06.119 in order to guarantee that the generalization error of the hypothesis h 00:03:06.119 --> 00:03:08.679 [inaudible] output by empirical risk minimization 00:03:08.679 --> 00:03:11.009 that this is less and equal to 00:03:11.009 --> 00:03:14.209 the best possible generalization error 00:03:14.780 --> 00:03:19.889 use it in your hypothesis class plus two times gamma two times this error threshold. 00:03:19.889 --> 00:03:24.120 We want to guarantee that this holds a probability at least one minus delta. 00:03:24.120 --> 00:03:25.519 We show that 00:03:25.519 --> 00:03:28.999 it suffices for your training set size m 00:03:28.999 --> 00:03:33.079 to be greater than equal to this; okay? One over two gamma square log two k over 00:03:33.079 --> 00:03:34.179 delta; 00:03:34.179 --> 00:03:39.229 where again, k is the size of your hypothesis class. 00:03:39.229 --> 00:03:42.909 And so this is some complexity result because it gives us a bound in the 00:03:42.909 --> 00:03:45.019 number of training examples we need 00:03:45.019 --> 00:03:47.050 in order to give a guarantee 00:03:47.050 --> 00:03:48.699 on something on the error; okay? So 00:03:48.699 --> 00:03:54.819 this is a sample complexity result. So 00:03:54.819 --> 00:03:58.879 what I want to do now is take this result, and try to generalize it to the 00:03:58.879 --> 00:04:02.939 case of infinite hypothesis classes. So here, 00:04:02.939 --> 00:04:07.139 we said that the set script h is sort of just k 00:04:07.139 --> 00:04:08.570 specific functions, 00:04:08.570 --> 00:04:12.120 when you want to use a model like logistic regression, which is actually 00:04:12.120 --> 00:04:14.059 parameterized by real numbers. 00:04:14.059 --> 00:04:15.019 So 00:04:15.019 --> 00:04:17.420 I'm actually first going to give an argument that's sort of 00:04:17.420 --> 00:04:21.160 formally broken just sort of technically somewhat broken, but conveys 00:04:21.160 --> 00:04:25.230 useful intuition. And then I'll give the more correct argument, but 00:04:25.230 --> 00:04:30.060 without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want 00:04:30.060 --> 00:04:31.050 to 00:04:31.050 --> 00:04:34.679 apply this result analyzing logistic regression. 00:04:35.550 --> 00:04:37.400 So let's say 00:04:39.210 --> 00:04:44.789 your hypothesis class is because of all linear division boundaries; right? So say 00:04:44.789 --> 00:04:46.770 script h is 00:04:46.770 --> 00:04:51.749 parameterized 00:04:51.749 --> 00:04:57.860 by d 00:04:57.860 --> 00:05:00.279 real numbers; okay? So for example, if 00:05:00.279 --> 00:05:02.570 you're applying logistic regression with 00:05:02.570 --> 00:05:06.940 over [inaudible], then d would be endless one with logistic regression to find the 00:05:06.940 --> 00:05:08.680 linear position boundary, 00:05:08.680 --> 00:05:09.640 parameterized by 00:05:09.640 --> 00:05:13.529 endless one real numbers. 00:05:13.529 --> 00:05:17.729 When you think about how your hypothesis class is really represented in a computer 00:05:17.729 --> 00:05:18.569 computers use 00:05:18.569 --> 00:05:21.949 zero one bits to represent real numbers. 00:05:21.949 --> 00:05:22.930 And so if you 00:05:22.930 --> 00:05:24.479 use like 00:05:24.479 --> 00:05:28.559 a normal standard computer, it normally will represent real numbers by what's 00:05:28.559 --> 00:05:30.869 called double position floating point numbers. 00:05:30.869 --> 00:05:33.200 And what that means is that each real number 00:05:33.200 --> 00:05:36.789 is represented by or a 64-bit representation; 00:05:36.789 --> 00:05:38.060 right? So really 00:05:38.060 --> 00:05:41.820 you know what floating point is in a computer. So a 64-bit floating point 00:05:41.820 --> 00:05:43.309 is what almost all of 00:05:43.309 --> 00:05:44.780 us use routinely. 00:05:44.780 --> 00:05:47.910 And so this parameterized by d real numbers, that's really 00:05:47.910 --> 00:05:50.979 as if it's parameterized by 64 00:05:50.979 --> 00:05:52.919 times d 00:05:52.919 --> 00:05:54.319 bits. 00:05:54.319 --> 00:05:56.590 Computers can't represent real numbers. They only represent 00:05:56.590 --> 00:05:58.800 used to speed things. 00:05:58.800 --> 00:06:00.689 And so the size of 00:06:00.689 --> 00:06:02.949 your 00:06:02.949 --> 00:06:07.250 hypothesis class in your computer representation 00:06:07.250 --> 00:06:11.000 you have 64 times d bits that you can flip. 00:06:11.000 --> 00:06:15.050 And so the number of possible values for your 62 to 64 d bits is 00:06:15.050 --> 00:06:17.509 really just 00:06:17.509 --> 00:06:20.889 to the power of 64 d; okay? 00:06:20.889 --> 00:06:23.360 Because that's the number of ways you can 00:06:23.360 --> 00:06:26.349 flip the 64 d bits. 00:06:26.349 --> 00:06:29.150 And so 00:06:29.150 --> 00:06:33.979 this is why it's important that we that we had log k there; right? So k is 00:06:33.979 --> 00:06:36.819 therefore, to the 64 d. 00:06:36.819 --> 00:06:40.379 And if I plug it into this equation over here, 00:06:40.379 --> 00:06:43.300 what you find is that 00:06:43.300 --> 00:06:47.139 in order to get this sort of guarantee, 00:06:47.139 --> 00:06:50.979 it suffices that m 00:06:50.979 --> 00:06:52.739 is 00:06:52.739 --> 00:06:55.979 great and equal to 00:06:55.979 --> 00:06:58.770 on the order of one of 00:06:58.770 --> 00:07:01.740 the gamma square log it's 00:07:01.740 --> 00:07:03.369 just a 64 d 00:07:03.369 --> 00:07:05.050 over delta, 00:07:05.050 --> 00:07:12.050 which is that; okay? So 00:07:14.520 --> 00:07:17.779 just to be clear, in order to guarantee that 00:07:17.779 --> 00:07:23.729 there's only one, instead of the same complexity result as we had before so the question is: 00:07:23.729 --> 00:07:27.309 suppose, you want a guarantee that a hypotheses returned by empirical risk 00:07:27.309 --> 00:07:29.249 minimization will 00:07:29.249 --> 00:07:33.909 have a generalization error that's within two gamma or the best hypotheses in your hypotheses 00:07:33.909 --> 00:07:35.970 class. 00:07:35.970 --> 00:07:39.400 Then what this result suggests is that, you know, in order to give that sort of 00:07:39.400 --> 00:07:40.770 error bound guarantee, 00:07:40.770 --> 00:07:43.429 it suffices that m is greater and equal to this. 00:07:43.429 --> 00:07:44.579 In other words, that 00:07:44.579 --> 00:07:46.889 your number of training examples has to be 00:07:46.889 --> 00:07:51.699 on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the 00:07:53.329 --> 00:07:54.419 intuition 00:07:54.419 --> 00:07:57.740 that this conveys is actually, roughly right. This says, that the number of 00:07:57.740 --> 00:07:59.050 training examples you need 00:07:59.050 --> 00:08:00.740 is roughly linear 00:08:00.740 --> 00:08:04.620 in the number of parameters of your hypothesis class. That 00:08:04.620 --> 00:08:08.680 m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right. 00:08:08.680 --> 00:08:09.580 00:08:09.580 --> 00:08:12.479 I'll say more about 00:08:16.969 --> 00:08:20.969 this 00:08:21.919 --> 00:08:25.520 later. This result is clearly, slightly broken, in the sense that it relies on a 00:08:25.520 --> 00:08:28.729 64-bit representation of 14-point numbers. 00:08:28.729 --> 00:08:31.899 So let me actually go ahead and outline 00:08:31.899 --> 00:08:34.900 the "right way" to show this more formally; all 00:08:34.900 --> 00:08:38.910 right? And it turns out the "right way" to show this more formally 00:08:38.910 --> 00:08:43.120 involves a much longer because the 00:08:43.120 --> 00:08:46.250 proof is extremely involved, so I'm just actually going to state the result, and not 00:08:46.250 --> 00:08:50.160 prove it. Farther proof be a source of learning theory balance, infinite hypothesis 00:08:50.160 --> 00:08:54.320 classes. 00:08:56.450 --> 00:09:03.450 This definition given a set of d points, 00:09:18.630 --> 00:09:22.880 we say, 00:09:22.880 --> 00:09:26.560 a hypothesis class h 00:09:26.560 --> 00:09:28.590 shatters 00:09:28.590 --> 00:09:34.520 the set s, if 00:09:34.520 --> 00:09:41.520 h can realize any labeling on it; okay? 00:09:46.650 --> 00:09:47.360 And what 00:09:47.360 --> 00:09:50.990 I mean by realizing any labeling on it the informal way of thinking about this 00:09:50.990 --> 00:09:53.020 is: 00:09:53.020 --> 00:09:56.860 if a hypothesis class has shattered the set s, what that means is that I can take these 00:09:56.860 --> 00:09:58.510 d points, 00:09:58.510 --> 00:10:01.740 and I can associate these d points with any 00:10:01.740 --> 00:10:04.250 caught set of labels y; right? 00:10:04.250 --> 00:10:08.900 So choose any set of labeling y for each of these d 00:10:08.900 --> 00:10:10.140 points. 00:10:10.140 --> 00:10:13.840 And if your hypothesis class shatters s, then that means that 00:10:13.840 --> 00:10:14.790 there will be 00:10:14.790 --> 00:10:19.470 a hypothesis that labels those d examples perfectly; 00:10:19.470 --> 00:10:23.710 okay? That's what shattering means. So let me just illustrate those in an example. 00:10:23.710 --> 00:10:30.710 So let's say h is the class of all linear classifiers 00:10:31.280 --> 00:10:35.520 into e, 00:10:35.520 --> 00:10:38.400 and let's say that 00:10:38.400 --> 00:10:40.370 s is this 00:10:40.370 --> 00:10:43.740 [inaudible] comprising two points; okay? 00:10:43.740 --> 00:10:45.980 So there are 00:10:45.980 --> 00:10:49.520 four possible labelings that computes 00:10:49.520 --> 00:10:53.270 with these two points. You can choose 00:10:53.270 --> 00:10:56.470 to label both positive; one positive, one negative, one negative, one positive or you can 00:10:56.470 --> 00:10:59.280 label both of them negative. 00:10:59.280 --> 00:11:00.750 And if 00:11:00.750 --> 00:11:04.730 the hypothesis class h classed all linear classifiers into the then, for each 00:11:04.730 --> 00:11:08.450 of these training sets, I can sort of find a 00:11:08.450 --> 00:11:09.990 linear classifier 00:11:09.990 --> 00:11:13.500 that attains zero training error on 00:11:13.500 --> 00:11:17.810 each of these. Then on all possible labelings of this set of two points. 00:11:17.810 --> 00:11:20.140 And so I'll say 00:11:20.140 --> 00:11:23.160 that the hypothesis class script h shatters 00:11:23.160 --> 00:11:24.129 this set s 00:11:24.129 --> 00:11:29.480 of two points; okay? One 00:11:29.480 --> 00:11:36.480 more example show 00:11:37.350 --> 00:11:38.780 you 00:11:38.780 --> 00:11:42.059 a larger example. Suppose my set s is now 00:11:42.059 --> 00:11:44.640 this set of three points; 00:11:44.640 --> 00:11:51.640 right? Then, I now have eight possible labelings for these three points; 00:12:10.600 --> 00:12:14.830 okay? And so for these three points, I now have eight possible labelings. 00:12:14.830 --> 00:12:16.940 And once again, I can for each 00:12:16.940 --> 00:12:21.620 of these labelings, I can find the hypothesis in the hypothesis class 00:12:21.620 --> 00:12:23.390 that labels 00:12:23.390 --> 00:12:25.610 these examples correctly. 00:12:25.610 --> 00:12:30.220 And so once again, I see that by definition, say, that my hypothesis 00:12:30.220 --> 00:12:33.890 class also shatters this set s. Student:Right. Instructor (Andrew Ng):And 00:12:33.890 --> 00:12:38.340 then that that terminology h can realize any labeling on s. That's 00:12:38.340 --> 00:12:39.830 obviously [inaudible]. 00:12:39.830 --> 00:12:43.680 Give it any set of labels and you can find a hypothesis that perfectly 00:12:43.680 --> 00:12:47.570 separates the positive and negative examples; 00:12:47.570 --> 00:12:48.840 okay? 00:12:48.840 --> 00:12:51.090 So 00:12:51.090 --> 00:12:54.660 how about this 00:12:54.660 --> 00:12:57.430 set? 00:12:57.430 --> 00:13:00.840 Suppose s is now this set of four points, then, 00:13:00.840 --> 00:13:04.840 you know, there are lots of labels. There are now 16 labelings we can choose on this; right? 00:13:04.840 --> 00:13:06.400 That's one for instance, 00:13:06.400 --> 00:13:09.090 and this is another one; 00:13:09.090 --> 00:13:13.420 right? And so I can realize some labelings. But there's no linear division 00:13:13.420 --> 00:13:15.780 boundary that can realize this labeling, 00:13:15.780 --> 00:13:17.800 and so h does not shatter 00:13:17.800 --> 00:13:20.250 this set of four points; okay? 00:13:20.250 --> 00:13:21.580 And 00:13:21.580 --> 00:13:25.360 I'm not really going to prove it here, but it turns out that you can show that 00:13:25.360 --> 00:13:30.040 in two dimensions, there is no set of four points that the 00:13:30.040 --> 00:13:37.040 class of all linear classifiers can shatter; okay? 00:13:45.630 --> 00:13:49.110 So here's another definition. When I say that the well, 00:13:49.110 --> 00:13:50.290 it's called the 00:13:50.290 --> 00:13:52.700 VC dimension. These 00:13:52.700 --> 00:13:59.700 two people, Vapnik and Chervonenkis 00:14:06.550 --> 00:14:09.910 so given a hypothesis class, 00:14:09.910 --> 00:14:13.590 the Vapnik and Chervonenkis 00:14:13.590 --> 00:14:17.600 dimension of h, which we usually write as VC of script h, 00:14:17.600 --> 00:14:24.600 is the size of the 00:14:33.780 --> 00:14:36.670 larger 00:14:36.670 --> 00:14:40.070 set that is shattered by this set by h. 00:14:40.070 --> 00:14:40.759 And 00:14:40.759 --> 00:14:44.939 if a hypothesis class can shatter arbitrarily large sets, then the VC dimension 00:14:44.939 --> 00:14:47.820 is infinite. 00:14:47.820 --> 00:14:50.549 So just as a kind of good example: if h 00:14:50.549 --> 00:14:51.920 is the class of all linear 00:14:51.920 --> 00:14:57.850 classifiers 00:14:57.850 --> 00:15:02.660 into d, then the 00:15:02.660 --> 00:15:04.100 VC dimension of the set 00:15:04.100 --> 00:15:04.730 is equal to 00:15:04.730 --> 00:15:09.490 three because we saw just now that there is a size of 00:15:09.490 --> 00:15:12.940 there was a set s of size three that it could shatter, 00:15:12.940 --> 00:15:17.259 and I don't really prove it. But it turns out there is no sets of size four 00:15:17.259 --> 00:15:18.480 that it can 00:15:18.480 --> 00:15:19.900 shatter. And therefore, 00:15:19.900 --> 00:15:21.220 the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your 00:15:21.220 --> 00:15:23.270 point. Instructor 00:15:23.270 --> 00:15:27.530 (Andrew 00:15:27.530 --> 00:15:30.210 Ng):Yes, absolutely. So it turns out that 00:15:30.210 --> 00:15:32.370 if 00:15:32.370 --> 00:15:35.650 I choose a set like this it's actually set s, 00:15:35.650 --> 00:15:39.170 then there are labelings on this they cannot realize. And so, 00:15:39.170 --> 00:15:40.980 h cannot shatter this set. 00:15:40.980 --> 00:15:43.620 But that's okay because right there definitely is 00:15:43.620 --> 00:15:46.590 there exists some other set of size three being shattered. So the VC 00:15:46.590 --> 00:15:47.890 dimension is 00:15:47.890 --> 00:15:54.890 three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor 00:15:55.590 --> 00:15:58.130 (Andrew Ng):Not according to this definition. No. 00:15:58.130 --> 00:16:00.640 Right. So again, let's see, 00:16:00.640 --> 00:16:03.619 I can choose my set s to be 00:16:03.619 --> 00:16:07.869 to be a set of three points that are all over lapping. Three points in exactly 00:16:07.869 --> 00:16:09.780 the same place. 00:16:09.780 --> 00:16:11.730 And clearly, I can't shatter this set, 00:16:11.730 --> 00:16:15.130 but that's okay. And I can't shatter this set, either, 00:16:15.130 --> 00:16:18.560 but that's okay because there are some other sets of size three that I can 00:16:18.560 --> 00:16:25.040 shatter. 00:16:25.040 --> 00:16:28.410 And it turns out this result holds true into 00:16:28.410 --> 00:16:31.690 the more generally, 00:16:31.690 --> 00:16:34.430 00:16:34.430 --> 00:16:41.410 in 00:16:41.410 --> 00:16:48.410 any dimensions the VC dimension of the class of linear classifiers in 00:16:51.460 --> 00:16:53.500 any dimensions is equal to n plus one. 00:16:53.500 --> 00:16:59.210 Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC 00:16:59.210 --> 00:17:01.839 dimension in any dimensions; whereas, n d 00:17:01.839 --> 00:17:04.350 is equal to n plus one. 00:17:04.350 --> 00:17:11.350 00:17:12.059 --> 00:17:14.459 So maybe you wanna write it down: 00:17:14.459 --> 00:17:15.809 00:17:15.809 --> 00:17:20.740 what is arguably the best-known result in all of learning theory, I guess; 00:17:20.740 --> 00:17:25.930 00:17:25.930 --> 00:17:28.770 which is that. 00:17:28.770 --> 00:17:35.770 Let a hypothesis class be given, 00:17:37.600 --> 00:17:41.170 and let the VC dimension of h be 00:17:41.170 --> 00:17:45.830 equal to d. Then we're 00:17:45.830 --> 00:17:47.530 in probability of 00:17:47.530 --> 00:17:50.090 one minus delta. 00:17:50.090 --> 00:17:57.090 We have that 00:18:31.830 --> 00:18:34.770 the formula on the right looks a bit complicated, but don't worry about it. I'll point 00:18:34.770 --> 00:18:36.910 out the essential aspects of it later. 00:18:36.910 --> 00:18:40.600 But the key to this result is that if you have a hypothesis class with VC dimension d, and 00:18:40.600 --> 00:18:45.320 now this can be an infinite hypothesis 00:18:45.320 --> 00:18:46.660 class, 00:18:46.660 --> 00:18:48.620 what 00:18:48.620 --> 00:18:50.990 Vapnik and Chervonenkis show is that 00:18:50.990 --> 00:18:54.030 we're probability of at least one minus delta. 00:18:54.030 --> 00:18:55.300 You enjoy this 00:18:55.300 --> 00:18:57.540 sort of uniform conversions results; 00:18:57.540 --> 00:19:00.380 okay? We have 00:19:00.380 --> 00:19:02.520 00:19:02.520 --> 00:19:06.760 that for all hypotheses h that for all the hypotheses in your 00:19:06.760 --> 00:19:08.490 hypothesis class, 00:19:08.490 --> 00:19:13.080 you have that the generalization error of h 00:19:13.080 --> 00:19:13.780 minus 00:19:13.780 --> 00:19:19.210 the training error of h. So the difference between these two things is bounded above 00:19:19.210 --> 00:19:24.250 by some complicated formula like this; okay? 00:19:24.250 --> 00:19:29.520 And thus, 00:19:29.520 --> 00:19:36.520 we're probably one minus delta. We also have that have the 00:19:45.140 --> 00:19:52.140 same thing; okay? 00:19:56.810 --> 00:20:00.730 And going from this step to this step; right? Going from this step 00:20:00.730 --> 00:20:03.429 to this step is actually something that you saw yourself; 00:20:03.429 --> 00:20:05.929 that we actually proved earlier. Because 00:20:05.929 --> 00:20:09.200 you remember, in the previous lecture we proved that if you have uniform 00:20:09.200 --> 00:20:10.720 conversions, 00:20:10.720 --> 00:20:13.210 then that implies that 00:20:13.210 --> 00:20:18.080 it appears actually that we showed that if generalization error and training error are close to 00:20:18.080 --> 00:20:20.800 each other; within gamma of each other, 00:20:20.800 --> 00:20:24.360 then the generalization error of the hypotheses you pick will be within 00:20:24.360 --> 00:20:26.150 two gamma 00:20:26.150 --> 00:20:28.080 times the best generalization error. 00:20:28.080 --> 00:20:29.650 So this is really 00:20:29.650 --> 00:20:30.900 generalization error of 00:20:30.900 --> 00:20:35.130 h [inaudible] best possible generalization error plus two times gamma. And just the 00:20:35.130 --> 00:20:37.390 two constants in front here 00:20:37.390 --> 00:20:43.790 that I've absorbed into the big-O notation. 00:20:43.790 --> 00:20:47.310 So that formula is slightly more complicated. Let 00:20:47.310 --> 00:20:54.310 me just rewrite this as a corollary, which is that 00:20:55.370 --> 00:20:58.850 in order to guarantee that 00:20:58.850 --> 00:21:04.390 this holds, 00:21:04.390 --> 00:21:08.580 we're probability of one minus delta. 00:21:08.580 --> 00:21:11.270 We're probably at least one minus delta, I should say. It 00:21:11.270 --> 00:21:16.790 suffices 00:21:16.790 --> 00:21:20.400 that I'm gonna write this this way: I'm gonna write m equals 00:21:20.400 --> 00:21:22.310 big-O of 00:21:22.310 --> 00:21:22.950 d, 00:21:22.950 --> 00:21:25.880 and I'm going to put 00:21:25.880 --> 00:21:32.570 gamma and delta in as a subscript error to denote that. 00:21:32.570 --> 00:21:37.340 Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that 00:21:37.340 --> 00:21:40.390 depend on gamma and delta into the big-O notation, 00:21:40.390 --> 00:21:41.600 then 00:21:41.600 --> 00:21:44.870 in order to guarantee this holds, it suffices that m 00:21:44.870 --> 00:21:46.430 is on the order of the 00:21:46.430 --> 00:21:53.430 VC dimension and hypotheses class; okay? 00:21:53.470 --> 00:21:55.790 So 00:21:55.790 --> 00:21:58.860 let's see. 00:21:58.860 --> 00:22:02.490 So what we conclude from this is that 00:22:02.490 --> 00:22:06.100 if you have a learning algorithm that tries to for empirical risk minimization 00:22:06.100 --> 00:22:08.770 algorithms in other words, less formally, 00:22:08.770 --> 00:22:12.630 for learning algorithms, they try to minimize training error. The 00:22:12.630 --> 00:22:14.850 intuition to take away from this is that 00:22:14.850 --> 00:22:18.080 the number of training examples you need is therefore, 00:22:18.080 --> 00:22:19.250 roughly, linear 00:22:19.250 --> 00:22:24.350 in the VC dimension of the hypotheses class. 00:22:24.350 --> 00:22:27.690 And more formally, this shows that sample complexity 00:22:27.690 --> 00:22:29.570 is upper bounded 00:22:29.570 --> 00:22:34.040 by the VC dimension; okay? 00:22:34.040 --> 00:22:38.490 It turns out that for most reasonable hypothesis classes, it turns out that 00:22:38.490 --> 00:22:41.460 the VC dimension is sort 00:22:41.460 --> 00:22:41.869 00:22:41.869 --> 00:22:44.930 of very similar, I guess, to 00:22:44.930 --> 00:22:47.110 the number of parameters you model. 00:22:47.110 --> 00:22:48.600 So for example, 00:22:48.600 --> 00:22:52.080 you have model and logistic regression linear classification. 00:22:52.080 --> 00:22:56.059 In any dimensions logistic regression in any dimensions is endless 00:22:56.059 --> 00:22:57.430 one parameters. 00:22:57.430 --> 00:23:02.430 And the VC dimension of which is the of class of linear classifiers is always the endless one. 00:23:02.430 --> 00:23:04.289 So it turns out that 00:23:04.289 --> 00:23:07.360 for most reasonable hypothesis classes, the 00:23:07.360 --> 00:23:11.270 VC dimension is usually linear in the number of parameters of your model. 00:23:11.270 --> 00:23:14.820 Wherein, is most sense of low other polynomial; 00:23:14.820 --> 00:23:17.440 in the number of parameters of your model. 00:23:17.440 --> 00:23:18.760 And so this 00:23:18.760 --> 00:23:21.120 the takeaway intuition from this is that 00:23:21.120 --> 00:23:25.059 the number of training examples you need to fit in those models is going to be let's say, 00:23:25.059 --> 00:23:29.550 roughly, linear in the number of parameters in your model; okay? 00:23:29.550 --> 00:23:33.120 There are some somewhat strange examples where what I just said is not true. There are some 00:23:33.120 --> 00:23:36.559 strange examples where you have very few parameters, but the VC dimension is 00:23:36.559 --> 00:23:37.450 enormous. 00:23:37.450 --> 00:23:38.610 But I actually know of 00:23:38.610 --> 00:23:42.070 all of the examples I know of that fall into that regime are somewhat 00:23:42.070 --> 00:23:43.559 strange and 00:23:43.559 --> 00:23:45.160 degenerate. So somewhat unusual, and 00:23:45.160 --> 00:23:50.340 not the source of not learning algorithms you usually use. 00:23:50.340 --> 00:23:52.270 Let's see, 00:23:52.270 --> 00:23:55.030 just other things. It turns out that 00:23:55.030 --> 00:23:58.490 so this result shows the sample complexity is upper bounded by VC 00:23:58.490 --> 00:24:01.320 dimension. But if you have a number of training examples that 00:24:01.320 --> 00:24:04.510 are on the order of the VC dimension, then you find 00:24:04.510 --> 00:24:07.510 it turns out that in the worse case 00:24:07.510 --> 00:24:10.320 00:24:10.320 --> 00:24:12.200 00:24:12.200 --> 00:24:15.840 some complexity is also lower bounded by VC dimension. 00:24:15.840 --> 00:24:20.220 And what that means is that if you have a perfectly nasty learning 00:24:20.220 --> 00:24:23.120 problem, say, then if 00:24:23.120 --> 00:24:26.630 the number of training examples you have is less than on the order of the VC 00:24:26.630 --> 00:24:27.450 dimension; 00:24:27.450 --> 00:24:29.690 then 00:24:29.690 --> 00:24:33.910 it is not possible to prove this bound. So I guess in the worse case, 00:24:33.910 --> 00:24:37.490 sample complexity in the number of training examples you need is upper bounded and lower bounded 00:24:37.490 --> 00:24:41.820 by the VC dimension. Let's 00:24:41.820 --> 00:24:48.110 see, 00:24:48.110 --> 00:24:55.110 questions about this? Does 00:24:55.540 --> 00:24:59.700 00:24:59.700 --> 00:25:00.490 the 00:25:00.490 --> 00:25:04.560 proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the 00:25:04.560 --> 00:25:07.280 entirety of the theorem. This is true. It 00:25:07.280 --> 00:25:09.920 turns out in the proof well, 00:25:09.920 --> 00:25:12.730 somewhere, regardless of the proof there's a step reconstruction called an 00:25:12.730 --> 00:25:15.750 epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the 00:25:15.750 --> 00:25:19.880 proof, it is not an assumption that you need. 00:25:19.880 --> 00:25:23.610 In someway that sort of proof that's one-step that uses a very 00:25:23.610 --> 00:25:27.050 clever [inaudible] to prove 00:25:27.050 --> 00:25:32.130 this. But that's not needed; it's an assumption. I'd like to say, back when 00:25:32.130 --> 00:25:36.370 I was a Ph.D. student, when I was working through this proof, 00:25:36.370 --> 00:25:38.220 there was sort of a solid week where I 00:25:38.220 --> 00:25:42.100 would wake up, and go to the office at 9:00 a.m. Then I'd start reading 00:25:42.100 --> 00:25:44.840 the book that led up to this proof. And then 00:25:44.840 --> 00:25:49.330 I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left 00:25:49.330 --> 00:25:51.609 off. And it sort of took me a whole week that way, 00:25:51.609 --> 00:25:53.180 to understand this proof, so 00:25:53.180 --> 00:26:00.180 I thought I would inflict that on you. 00:26:03.090 --> 00:26:07.470 Just to tie a couple of loose ends: 00:26:07.470 --> 00:26:13.110 what I'm about to do is, I'm about to just mention a few things that will 00:26:13.110 --> 00:26:18.380 maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends. 00:26:18.380 --> 00:26:21.380 And so 00:26:21.380 --> 00:26:24.150 let's see, 00:26:24.150 --> 00:26:26.860 it turns out that 00:26:26.860 --> 00:26:30.170 just so it will be more strong with you so this bound was 00:26:30.170 --> 00:26:30.670 proved 00:26:30.670 --> 00:26:34.840 for an algorithm that uses empirical risk minimization, for an algorithm that 00:26:34.840 --> 00:26:35.920 minimizes 00:26:35.920 --> 00:26:39.580 0-1 training error. So 00:26:39.580 --> 00:26:46.330 one 00:26:46.330 --> 00:26:49.350 question that some of 00:26:49.350 --> 00:26:51.720 you ask is 00:26:51.720 --> 00:26:55.460 how about support vector machines; right? How come SVM's don't over 00:26:55.460 --> 00:26:57.250 fit? 00:26:57.250 --> 00:27:01.280 And in the sequel of remember our discussion on support vector machines said that 00:27:01.280 --> 00:27:06.450 you use kernels, and map the features in infinite dimensional feature space. 00:27:06.450 --> 00:27:10.390 And so it seems like the VC dimension should be infinite; n plus 00:27:10.390 --> 00:27:12.340 one and n is infinite. 00:27:12.340 --> 00:27:14.840 So it turns out that 00:27:14.840 --> 00:27:19.630 the class of linear separators with large margin actually has low VC dimension. 00:27:19.630 --> 00:27:23.300 I wanna say this very quickly, and informally. 00:27:23.300 --> 00:27:28.010 It's actually, not very important for you to understand the details, but I'm going 00:27:28.010 --> 00:27:31.760 to say it very informally. It turns out that I will give you a set of points. 00:27:31.760 --> 00:27:32.660 And 00:27:32.660 --> 00:27:37.500 if I ask you to consider only the course of lines that separate these points of a 00:27:37.500 --> 00:27:38.780 large margin [inaudible], 00:27:38.780 --> 00:27:40.030 so 00:27:40.030 --> 00:27:43.840 my hypothesis class will comprise only 00:27:43.840 --> 00:27:45.950 the linear position boundaries 00:27:45.950 --> 00:27:49.289 that separate the points of a large margin. Say with a 00:27:49.289 --> 00:27:50.049 margin, 00:27:50.049 --> 00:27:52.260 at least gamma; okay. 00:27:52.260 --> 00:27:53.000 And so 00:27:53.000 --> 00:27:55.760 I won't allow a point that comes closer. Like, 00:27:55.760 --> 00:28:00.210 I won't allow that line because it comes too close to one of my points. 00:28:00.210 --> 00:28:04.330 It turns out that 00:28:04.330 --> 00:28:10.890 if I consider my 00:28:10.890 --> 00:28:15.410 data points all 00:28:15.410 --> 00:28:18.690 lie within some sphere of radius r, 00:28:18.690 --> 00:28:23.190 and if I consider only the course of linear separators is separate to data 00:28:23.190 --> 00:28:26.799 with a margin of at least gamma, 00:28:26.799 --> 00:28:30.420 then the VC dimension of this 00:28:30.420 --> 00:28:34.260 course is less than or equal to r squared over four 00:28:34.260 --> 00:28:36.150 gamma squared 00:28:36.150 --> 00:28:38.170 plus one; okay? 00:28:38.170 --> 00:28:39.970 So 00:28:39.970 --> 00:28:43.340 this funny symbol here, that just means rounding up. 00:28:43.340 --> 00:28:47.190 This is a ceiling symbol; it means rounding up x. 00:28:47.190 --> 00:28:48.360 And it 00:28:48.360 --> 00:28:51.280 turns out you prove and there are some strange things 00:28:51.280 --> 00:28:52.940 about this result that I'm deliberately not 00:28:52.940 --> 00:28:55.090 gonna to talk about but turns they 00:28:55.090 --> 00:28:59.100 can prove that the VC dimension of the class of linear classifiers with large 00:28:59.100 --> 00:29:00.890 margins is actually bounded. 00:29:00.890 --> 00:29:02.820 The surprising thing about this is that 00:29:02.820 --> 00:29:05.909 this is the bound on VC dimension that has no dependents 00:29:05.909 --> 00:29:07.860 on the dimension of the points x. So 00:29:07.860 --> 00:29:09.240 in other words, 00:29:09.240 --> 00:29:12.799 your data points x combine an infinite dimensional space, 00:29:12.799 --> 00:29:15.680 but so long as you restrict attention to the class of 00:29:15.680 --> 00:29:17.990 your separators with large margin, the 00:29:17.990 --> 00:29:20.250 VC dimension is bounded. 00:29:20.250 --> 00:29:21.529 And so 00:29:21.529 --> 00:29:25.650 in trying to find a large margin separator 00:29:25.650 --> 00:29:28.990 in trying to find the line that separates your positive and 00:29:28.990 --> 00:29:29.950 your negative examples 00:29:29.950 --> 00:29:32.490 with large margin, 00:29:32.490 --> 00:29:35.100 it turns out therefore, that 00:29:35.100 --> 00:29:38.060 the support vector machine is automatically trying to find 00:29:38.060 --> 00:29:39.809 a hypothesis class 00:29:39.809 --> 00:29:42.450 with small VC dimension. And therefore, it does not over fit. Alex? Student:What is 00:29:42.450 --> 00:29:49.320 the [inaudible]? Instructor (Andrew 00:29:49.320 --> 00:29:50.010 Ng):It 00:29:50.010 --> 00:29:51.879 is actually defined the 00:29:51.879 --> 00:29:55.640 same way as finite dimensional spaces. So you 00:29:55.640 --> 00:29:57.210 know, suppose you have infinite actually, 00:29:57.210 --> 00:30:00.590 these are constantly infinite dimensional vectors; not [inaudible] to the infinite 00:30:00.590 --> 00:30:03.590 dimensional 00:30:03.590 --> 00:30:06.080 vectors. Normally, the 2 to 1 00:30:06.080 --> 00:30:07.460 squared 00:30:07.460 --> 00:30:10.010 is equal to some [inaudible] equals 110 00:30:10.010 --> 00:30:12.210 xi squared, so if x 00:30:12.210 --> 00:30:17.470 is infinite dimensional, you just appoint it like 00:30:17.470 --> 00:30:20.220 that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again. 00:30:20.220 --> 00:30:22.679 [Inaudible]. 00:30:22.679 --> 00:30:27.230 Yes. Although, I assume that this is bounded by r. Oh. It's 00:30:27.230 --> 00:30:30.750 a yeah so this insures that conversions. 00:30:30.750 --> 00:30:31.670 00:30:31.670 --> 00:30:35.040 So just 00:30:35.040 --> 00:30:37.650 something people sometimes wonder about. And 00:30:37.650 --> 00:30:39.640 last, the actually 00:30:39.640 --> 00:30:44.470 tie empirical risk minimization back a little more strongly to the source of algorithms 00:30:44.470 --> 00:30:46.650 we've talked about. 00:30:46.650 --> 00:30:48.050 It turns out 00:30:48.050 --> 00:30:55.050 that 00:31:10.179 --> 00:31:15.100 so the theory was about, and so far, was really for empirical risk minimization. So that view's 00:31:15.100 --> 00:31:17.770 00:31:17.770 --> 00:31:21.010 so we focus on just one training example. 00:31:21.010 --> 00:31:22.860 Let me draw a 00:31:22.860 --> 00:31:24.049 function, you know, a 00:31:24.049 --> 00:31:25.080 zero here jumps 00:31:25.080 --> 00:31:26.210 to one, 00:31:26.210 --> 00:31:27.670 and it looks like that. 00:31:27.670 --> 00:31:29.570 And so this for 00:31:29.570 --> 00:31:31.720 once, this training example, 00:31:31.720 --> 00:31:36.230 this may be 00:31:36.230 --> 00:31:42.830 indicator 00:31:42.830 --> 00:31:47.110 h where [inaudible] is d 00:31:47.110 --> 00:31:49.340 equals data transpose x; 00:31:49.340 --> 00:31:52.600 okay? But one 00:31:52.600 --> 00:31:56.250 training example your training example will be positive or negative. And 00:31:56.250 --> 00:31:59.360 depending on what the value of this data transpose x is, you either get it right 00:31:59.360 --> 00:32:01.800 or wrong. And so 00:32:01.800 --> 00:32:05.720 you know, I guess if your training example if you have a positive example, 00:32:05.720 --> 00:32:08.370 then when z is positive, 00:32:08.370 --> 00:32:10.380 you get it 00:32:10.380 --> 00:32:11.240 right. 00:32:11.240 --> 00:32:18.240 00:32:18.610 --> 00:32:22.630 Suppose you have a negative example, so y equals 0; right? 00:32:22.630 --> 00:32:26.780 Then if z, which is data transpose x if this is positive, 00:32:26.780 --> 00:32:29.650 then you will get this example wrong; 00:32:29.650 --> 00:32:32.860 whereas, if z is negative then you'd get this example right. 00:32:32.860 --> 00:32:36.630 And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You 00:32:36.630 --> 00:32:42.639 know, it's equal 00:32:42.639 --> 00:32:49.539 to g of data transpose 00:32:49.539 --> 00:32:50.730 00:32:50.730 --> 00:32:53.940 x; okay? And so it turns out that 00:32:53.940 --> 00:32:57.260 so what you really like to do is choose parameters data so as to 00:32:57.260 --> 00:32:59.550 minimize this step function; right? 00:32:59.550 --> 00:33:02.400 You'd like to choose parameters data, so that 00:33:02.400 --> 00:33:04.710 you end up with a 00:33:04.710 --> 00:33:08.500 correct classification on setting your training example, and so you'd like 00:33:08.500 --> 00:33:09.440 indicator h 00:33:09.440 --> 00:33:10.890 of x not equal y. 00:33:10.890 --> 00:33:14.960 You'd like this indicator function to be 0. It 00:33:14.960 --> 00:33:18.560 turns out this step function is clearly a non-convex function. And so 00:33:18.560 --> 00:33:21.799 it turns out that just the linear classifiers 00:33:21.799 --> 00:33:25.130 minimizing the training error is an empty heart problem. It 00:33:25.130 --> 00:33:29.560 turns out that both logistic regression, and support vector machines can be viewed as 00:33:29.560 --> 00:33:31.210 using a convex 00:33:31.210 --> 00:33:32.600 approximation for this problem. 00:33:32.600 --> 00:33:36.440 And in particular 00:33:36.440 --> 00:33:40.130 and draw a function like that 00:33:41.460 --> 00:33:48.460 it turns out that 00:33:50.809 --> 00:33:54.600 logistic regression is trying to maximize likelihood. And so it's tying to minimize 00:33:54.600 --> 00:33:56.900 the minus of the logged likelihood. 00:33:56.900 --> 00:34:00.350 And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that 00:34:00.350 --> 00:34:03.400 looks like this. And 00:34:03.400 --> 00:34:07.250 this line that I just drew, you can think of it as a rough approximation to this step function; which is 00:34:07.250 --> 00:34:10.709 maybe what you're really trying to minimize, 00:34:10.709 --> 00:34:14.539 so you want to minimize training error. So you can actually think of logistic regression as trying to approximate 00:34:14.539 --> 00:34:16.069 empirical risk minimization. 00:34:16.069 --> 00:34:19.859 Where instead of using this step function, which is non-convex, and gives you a hard 00:34:19.859 --> 00:34:21.059 optimization problem, it 00:34:21.059 --> 00:34:25.189 uses this line above this curve above. So approximate it, 00:34:25.189 --> 00:34:28.229 so you have a convex optimization problem you can 00:34:28.229 --> 00:34:29.929 find the 00:34:29.929 --> 00:34:34.409 maximum likelihood it's in the parameters for logistic regression. 00:34:34.409 --> 00:34:38.690 And it turns out, support vector machine also can be viewed as approximated dysfunction 00:34:38.690 --> 00:34:41.139 to only a little bit different let's 00:34:41.139 --> 00:34:44.719 see, support vector machine turns out, can be viewed as trying to approximate this 00:34:44.719 --> 00:34:46.209 step function two 00:34:46.209 --> 00:34:48.289 over different 00:34:48.289 --> 00:34:50.929 approximation that's linear, and then 00:34:50.929 --> 00:34:53.589 that sort of [inaudible] linear that 00:34:53.589 --> 00:34:55.710 our results goes this [inaudible] there, and then it goes up as a 00:34:55.710 --> 00:34:58.150 linear function there. And that's that 00:34:58.150 --> 00:35:00.170 is called the hinge class. And so you 00:35:00.170 --> 00:35:03.799 can think of logistic regression and the support vector machine as 00:35:03.799 --> 00:35:07.259 different approximations to try to minimize this 00:35:07.259 --> 00:35:10.859 step function; 00:35:10.859 --> 00:35:11.940 okay? 00:35:11.940 --> 00:35:13.380 And that's why I guess, 00:35:13.380 --> 00:35:16.109 all the theory we developed 00:35:16.109 --> 00:35:19.239 even though SVM's and logistic regression aren't exactly due to 00:35:19.239 --> 00:35:21.739 empirical risk minimization, 00:35:21.739 --> 00:35:24.790 the theory we develop often gives the 00:35:24.790 --> 00:35:31.790 completely appropriate intuitions for SVM's, and logistic regression; okay. 00:35:32.879 --> 00:35:35.779 So that was the last of the loose ends. 00:35:35.779 --> 00:35:39.319 And if you didn't get this, don't worry too much about it. It's a high-level message. It's 00:35:39.319 --> 00:35:40.309 just that 00:35:40.309 --> 00:35:44.479 SVM's and logistic regression are reasonable to think of as approximations 00:35:44.479 --> 00:35:47.769 empirical risk minimization algorithms. 00:35:47.769 --> 00:35:51.609 What I want to do next is move on to talk about model selection. Before I do that, let me just 00:35:51.609 --> 00:35:58.609 check for questions about 00:36:49.190 --> 00:36:52.869 this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that 00:36:52.869 --> 00:36:53.329 we 00:36:53.329 --> 00:36:55.049 sort of wrapped up with a 00:36:55.049 --> 00:36:56.799 discussion on VC dimension, 00:36:56.799 --> 00:37:01.019 we saw that there's often a trade-off between bias and variance. And in 00:37:01.019 --> 00:37:02.489 particular, so 00:37:02.489 --> 00:37:06.759 it is important not to choose a hypothesis that's either too simple or too 00:37:06.759 --> 00:37:07.849 complex. So 00:37:07.849 --> 00:37:11.249 if your data has sort of a quadratic structure to it, 00:37:11.249 --> 00:37:13.710 then if you choose a linear 00:37:13.710 --> 00:37:17.889 function to try to approximate it, then you would under fit. So you have a 00:37:17.889 --> 00:37:19.939 hypothesis with high bias. 00:37:19.939 --> 00:37:24.179 And conversely, we choose a hypothesis that's too complex, and you have high variance. 00:37:24.179 --> 00:37:26.130 And you'll also 00:37:26.130 --> 00:37:31.899 fail to fit. Then you would over fit the data, and you'd also fail to generalize well. 00:37:31.899 --> 00:37:35.649 So model selection algorithms 00:37:35.649 --> 00:37:39.629 provide a class of methods to automatically trade make these tradeoffs 00:37:39.629 --> 00:37:41.619 between bias 00:37:41.619 --> 00:37:43.069 and variance; right? So remember the 00:37:43.069 --> 00:37:44.989 cartoon I drew last time 00:37:44.989 --> 00:37:51.989 of generalization error? 00:37:54.309 --> 00:37:57.409 I drew this last time. Where on the x-axis 00:37:57.409 --> 00:38:03.319 was model complexity, meaning the number of 00:38:03.319 --> 00:38:07.599 the degree of the polynomial; the [inaudible] regression function 00:38:07.599 --> 00:38:08.719 or whatever. 00:38:08.719 --> 00:38:12.469 And if you have too simple a model, you have high generalization error, those under 00:38:12.469 --> 00:38:13.529 fitting. 00:38:13.529 --> 00:38:15.630 And you if have too complex a model, 00:38:15.630 --> 00:38:19.729 like 15 or 14-degree polynomial to five data points, then you also have high 00:38:19.729 --> 00:38:24.069 generalization error, and you're over fitting. 00:38:24.069 --> 00:38:28.559 So what I wanna do now is actually just talk about model selection in the abstract; 00:38:28.559 --> 00:38:30.019 all right? 00:38:30.019 --> 00:38:32.779 Some examples of model selection problems will include 00:38:32.779 --> 00:38:34.369 well, I'll run the example 00:38:34.369 --> 00:38:36.139 of 00:38:36.139 --> 00:38:36.790 let's say you're 00:38:36.790 --> 00:38:43.790 trying to choose the degree of a polynomial; 00:38:45.159 --> 00:38:48.909 right? What degree polynomial do you want to choose? 00:38:48.909 --> 00:38:51.899 Another example of a model selection problem would be if you're trying to choose 00:38:51.899 --> 00:38:53.869 the parameter [inaudible], 00:38:53.869 --> 00:38:59.839 which was the bandwidth parameter 00:38:59.839 --> 00:39:04.140 in locally awaited linear regression 00:39:04.140 --> 00:39:07.709 or in some sort of local way to 00:39:07.709 --> 00:39:08.440 regression. 00:39:08.440 --> 00:39:10.480 Yet, another model selection problem 00:39:10.480 --> 00:39:13.940 is if you're trying to choose the parameter c [inaudible] and as the 00:39:13.940 --> 00:39:16.049 [inaudible]; 00:39:16.049 --> 00:39:21.680 right? And so one known soft margin is the 00:39:21.680 --> 00:39:28.539 we had this optimization objective; right? 00:39:28.539 --> 00:39:31.039 And the parameter c 00:39:31.039 --> 00:39:32.969 controls the tradeoff between 00:39:32.969 --> 00:39:34.019 how much you want 00:39:34.019 --> 00:39:38.239 to set for your example. So a large margin versus how much you want to penalize 00:39:38.239 --> 00:39:40.499 00:39:40.499 --> 00:39:44.279 in this class [inaudible] example. So these are three specific examples of model selection 00:39:44.279 --> 00:39:45.339 problems. 00:39:45.339 --> 00:39:47.569 And 00:39:47.569 --> 00:39:54.569 let's come up with a method for semantically choosing them. 00:40:13.679 --> 00:40:17.849 Let's say you have some finite set of models, and let's write these as m1, m2, m3, 00:40:17.849 --> 00:40:20.129 and 00:40:20.129 --> 00:40:25.219 so on. For example, this may be the linear 00:40:25.219 --> 00:40:29.660 classifier or this may be the quadratic classifier, 00:40:29.660 --> 00:40:30.410 and so on; 00:40:30.410 --> 00:40:32.170 okay? Or this 00:40:32.170 --> 00:40:33.700 may also be 00:40:33.700 --> 00:40:37.240 you may also take the bandwidth parameter [inaudible] 00:40:37.240 --> 00:40:41.130 and discretize it into a range of values, and you're trying to choose from the most 00:40:41.130 --> 00:40:43.469 discrete of the values. 00:40:43.469 --> 00:40:48.239 So let's talk about how you would select an appropriate model; all right? Well, 00:40:48.239 --> 00:40:52.029 one thing you could do is you can pick all of these models, and train them on 00:40:52.029 --> 00:40:53.869 you're training set. 00:40:53.869 --> 00:40:56.989 And then see which model has the lowest training 00:40:56.989 --> 00:41:03.989 error. So that's a terrible idea, and why's that? 00:41:05.450 --> 00:41:11.099 Right. Cool. Because of the over fit; right. And those some of you are laughing that 00:41:11.099 --> 00:41:13.939 I asked that. So that'd be a terrible idea to choose a model 00:41:13.939 --> 00:41:17.019 by looking at your training set because well, obviously, you end up choosing the most 00:41:17.019 --> 00:41:18.839 complex model; right? And you 00:41:18.839 --> 00:41:25.549 choose a 10th degree polynomial because that's what fits the training set. So we 00:41:25.549 --> 00:41:28.569 come to model selection in a training set 00:41:28.569 --> 00:41:31.939 several standard procedures to do this. One is hold out cross 00:41:31.939 --> 00:41:38.869 validation, 00:41:38.869 --> 00:41:41.410 and in hold out cross validation, 00:41:41.410 --> 00:41:45.509 we teach a training set. And we randomly split 00:41:45.509 --> 00:41:50.529 the training set into two subsets. We 00:41:50.529 --> 00:41:51.490 call it 00:41:51.490 --> 00:41:53.589 subset take all the data you have 00:41:53.589 --> 00:42:00.589 and randomly split it into two subsets. And we'll call it the training set, and the 00:42:00.799 --> 00:42:05.159 hold out cross validation subset. 00:42:05.159 --> 00:42:07.119 And then, 00:42:07.119 --> 00:42:11.989 you know, you train each model 00:42:11.989 --> 00:42:18.739 on just trading subset of it, and test it 00:42:18.739 --> 00:42:22.069 on your hold out cross validation set. 00:42:22.069 --> 00:42:25.289 And you pick the model 00:42:25.289 --> 00:42:32.289 with the lowest error 00:42:32.949 --> 00:42:34.889 on the hold out cross validation 00:42:34.889 --> 00:42:36.959 subset; okay? So this is sort of a 00:42:36.959 --> 00:42:39.799 relatively straightforward procedure, and it's commonly used 00:42:39.799 --> 00:42:42.039 where you train on 70 percent of the data. 00:42:42.039 --> 00:42:44.509 Then test all of your models. And 30 percent, you can take 00:42:44.509 --> 00:42:45.900 whatever has the smallest 00:42:45.900 --> 00:42:51.239 hold out cross validation error. 00:42:51.239 --> 00:42:54.829 And after this you actually have a chose. You can actually 00:42:54.829 --> 00:42:58.589 having taken all of these hypothesis trained on 70 percent of the data, 00:42:58.589 --> 00:43:02.650 you can actually just output the hypothesis that has the lowest error on your hold out 00:43:02.650 --> 00:43:04.469 cross validation set. 00:43:04.469 --> 00:43:08.619 And optionally, you can actually take the model that you selected 00:43:08.619 --> 00:43:13.079 and go back, and retrain it on all 100 percent of the data; 00:43:13.079 --> 00:43:15.529 okay? So both versions are actually done and used really often. You can 00:43:15.529 --> 00:43:16.409 either, 00:43:16.409 --> 00:43:20.039 you know, just take the best hypothesis that was trained on 70 percent of the data, 00:43:20.039 --> 00:43:22.419 and just output that as you find the hypothesis 00:43:22.419 --> 00:43:23.959 or you can use this to 00:43:23.959 --> 00:43:27.289 say, having chosen the degree of the polynomial you want to fit, you can then go 00:43:27.289 --> 00:43:31.329 back and retrain the model on the entire 100 percent of your data. 00:43:31.329 --> 00:43:38.329 And both of these are commonly done. 00:43:39.749 --> 00:43:46.749 00:43:50.489 --> 00:43:54.459 How about a cross validation does sort of work straight? 00:43:54.459 --> 00:43:55.759 And 00:43:55.759 --> 00:43:58.249 sometimes we're working 00:43:58.249 --> 00:43:58.950 00:43:58.950 --> 00:44:03.670 with a company or application or something. The 00:44:03.670 --> 00:44:05.929 many machine-learning applications we have 00:44:05.929 --> 00:44:07.580 very little data or where, you 00:44:07.580 --> 00:44:10.100 know, every training example you have was 00:44:10.100 --> 00:44:14.940 painfully acquired at great cost; right? Sometimes your data is 00:44:14.940 --> 00:44:16.039 acquired by 00:44:16.039 --> 00:44:18.780 medical experiments, and each of these each 00:44:18.780 --> 00:44:23.829 training example represents a sick man in amounts of physical human pain or something. So 00:44:23.829 --> 00:44:27.279 we talk and say, well, I'm going to hold out 30 percent of your data set, just to select 00:44:27.279 --> 00:44:28.109 my model. 00:44:28.109 --> 00:44:32.809 If people were who sometimes that causes unhappiness, and so maybe you wanna 00:44:32.809 --> 00:44:37.099 use not have to leave out 30 percent of your data just to do model 00:44:37.099 --> 00:44:38.719 selection. 00:44:38.719 --> 00:44:43.400 So there are a couple of other variations on hold out cross validation that makes 00:44:43.400 --> 00:44:47.809 sometimes, slightly more efficient use of the data. 00:44:47.809 --> 00:44:51.909 And one is called k-fold cross validation. 00:44:51.909 --> 00:44:54.190 And here's the idea: I'm gonna 00:44:54.190 --> 00:44:55.950 take all of 00:44:55.950 --> 00:44:56.819 my data s; 00:44:56.819 --> 00:45:00.619 so imagine, I'm gonna draw this 00:45:00.619 --> 00:45:04.489 box s, as to note the entirety of all the data I have. 00:45:04.489 --> 00:45:07.489 And I'll then divide it 00:45:07.489 --> 00:45:12.999 into k pieces, and this is five pieces in what I've drawn. 00:45:12.999 --> 00:45:16.809 Then what'll I'll do is I will repeatedly 00:45:16.809 --> 00:45:23.410 train on k minus one pieces. 00:45:23.410 --> 00:45:27.669 Test on the remaining one 00:45:27.669 --> 00:45:30.649 test on the remaining piece, I guess; 00:45:30.649 --> 00:45:35.699 right? And then you average 00:45:35.699 --> 00:45:41.369 over the k result. 00:45:41.369 --> 00:45:43.809 So another way, we'll just hold out I will 00:45:43.809 --> 00:45:48.329 hold 00:45:48.329 --> 00:45:50.619 out say, just 1/5 of my data 00:45:50.619 --> 00:45:54.649 and I'll train on the remaining 4/5, and I'll test on the first one. 00:45:54.649 --> 00:45:58.399 And then I'll then go and hold out the second 1/5 from my [inaudible] for the 00:45:58.399 --> 00:46:03.099 remaining pieces test on this, you remove the third piece, 00:46:03.099 --> 00:46:05.909 train on the 4/5; I'm gonna do this five times. 00:46:05.909 --> 00:46:09.089 And then I'll take the five error measures I 00:46:09.089 --> 00:46:10.569 have and I'll 00:46:10.569 --> 00:46:16.369 average them. And this then gives me an estimate of the generalization error of my model; okay? 00:46:16.369 --> 00:46:18.089 And then, again, when you do 00:46:18.089 --> 00:46:19.929 k-fold cross validation, 00:46:19.929 --> 00:46:21.829 usually you then go back and 00:46:21.829 --> 00:46:24.030 retrain the model you selected 00:46:24.030 --> 00:46:27.670 on the entirety of your training set. 00:46:27.670 --> 00:46:33.930 So I drew five pieces here because that was easier for me to draw, but k equals 10 is 00:46:33.930 --> 00:46:39.339 very 00:46:39.339 --> 00:46:41.710 common; okay? I should say 00:46:41.710 --> 00:46:47.509 k equals 10 is the fairly common choice to do 10 fold cross validation. 00:46:47.509 --> 00:46:52.299 And the advantage of the over hold out cross option is that you switch the data into ten pieces. 00:46:52.299 --> 00:46:54.400 Then each time you're only holding out 00:46:54.400 --> 00:46:56.380 1/10 of your data, rather than, 00:46:56.380 --> 00:46:59.119 you know, say, 30 percent of your data. I 00:46:59.119 --> 00:47:03.309 must say, in standard hold out in simple hold out cross validation, 00:47:03.309 --> 00:47:05.620 a 30 70 split is fairly common. 00:47:05.620 --> 00:47:07.849 Sometimes like 2/3 1/3 or 00:47:07.849 --> 00:47:09.919 a 70 30 split is fairly common. 00:47:09.919 --> 00:47:14.299 And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most 00:47:14.299 --> 00:47:16.400 common choice. 00:47:16.400 --> 00:47:19.849 The disadvantage of k-fold cross validation is that it can be much more 00:47:19.849 --> 00:47:21.430 computationally expensive. 00:47:21.430 --> 00:47:22.999 In particular, 00:47:22.999 --> 00:47:26.160 to validate your model, you now need to train your model ten times, 00:47:26.160 --> 00:47:29.089 instead of just once. And so you need 00:47:29.089 --> 00:47:33.140 to: from logistic regression, ten times per model, rather than just once. And so this is 00:47:33.140 --> 00:47:34.569 computationally more expensive. 00:47:34.569 --> 00:47:38.329 But k equals ten works great. 00:47:38.329 --> 00:47:40.249 And then, 00:47:40.249 --> 00:47:44.269 finally, in 00:47:44.269 --> 00:47:47.769 there's actually a version of this that you can take even further, 00:47:47.769 --> 00:47:50.059 which is when your set k 00:47:50.059 --> 00:47:51.660 equals m. 00:47:51.660 --> 00:47:56.690 And so that's when you take your training set, and you split it into as many pieces as you have training 00:47:56.690 --> 00:47:59.190 examples. 00:47:59.190 --> 00:48:06.190 And this procedure is called leave one out cross validation. 00:48:06.429 --> 00:48:07.479 And 00:48:07.479 --> 00:48:09.069 what you do is you then 00:48:09.069 --> 00:48:12.479 take out the first training example, train on the rest, and test on the first 00:48:12.479 --> 00:48:13.359 example. 00:48:13.359 --> 00:48:16.479 Then you take out the second training example, 00:48:16.479 --> 00:48:18.560 train on the rest, and test on the second example. Then 00:48:18.560 --> 00:48:22.180 you take out the third example, train on everything, but your third example. Test on 00:48:22.180 --> 00:48:24.079 the third example, and so on. 00:48:24.079 --> 00:48:25.410 And so 00:48:25.410 --> 00:48:27.900 with this many pieces you are now making, 00:48:27.900 --> 00:48:31.170 maybe even more effective use of your data than k-fold cross 00:48:31.170 --> 00:48:32.459 validation. But 00:48:32.459 --> 00:48:36.269 you could leave leave one out cross validation is computationally very expensive 00:48:36.269 --> 00:48:37.459 because now 00:48:37.459 --> 00:48:41.970 you need to repeatedly leave one example out, and then run your learning 00:48:41.970 --> 00:48:46.459 algorithm on m minus one training examples. You need to do this a lot of times, and so 00:48:46.459 --> 00:48:49.119 this is computationally very expensive. 00:48:49.119 --> 00:48:54.829 And typically, this is done only when you're extremely data scarce. So if you 00:48:54.829 --> 00:48:58.779 have a learning problem where you have, say, 15 training examples or something, 00:48:58.779 --> 00:49:01.109 then if you have very few training examples, leave one 00:49:01.109 --> 00:49:03.629 out cross validation 00:49:03.629 --> 00:49:04.460 is 00:49:04.460 --> 00:49:11.460 maybe preferred. 00:49:15.709 --> 00:49:18.369 Yeah? 00:49:18.369 --> 00:49:22.890 Student:You know, that time you proved that the difference between the generalized [inaudible] by 00:49:22.890 --> 00:49:26.989 number of examples in your training 00:49:26.989 --> 00:49:28.089 set and VC 00:49:28.089 --> 00:49:30.609 dimension. So 00:49:30.609 --> 00:49:32.679 maybe [inaudible] examples into 00:49:32.679 --> 00:49:34.599 different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I 00:49:34.599 --> 00:49:36.829 mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but 00:49:36.829 --> 00:49:40.100 yeah, in practice, I personally tend not to do that. It 00:49:40.100 --> 00:49:42.729 tends not to be the VC 00:49:42.729 --> 00:49:46.529 dimension bounds are somewhat loose bounds. And so 00:49:46.529 --> 00:49:47.960 there are people 00:49:47.960 --> 00:49:51.060 in structure risk minimization that propose what you do, but I personally tend not to do 00:49:51.060 --> 00:49:54.539 that, 00:50:09.459 --> 00:50:16.459 though. Questions for cross validation? 00:50:33.519 --> 00:50:35.669 Yeah. This 00:50:35.669 --> 00:50:38.899 is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]? 00:50:38.899 --> 00:50:39.939 Right. 00:50:39.939 --> 00:50:42.239 So it seems like when I'd be 00:50:42.239 --> 00:50:44.129 able to use that 00:50:44.129 --> 00:50:46.629 00:50:46.629 --> 00:50:47.769 instead 00:50:47.769 --> 00:50:49.980 of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor 00:50:49.980 --> 00:50:53.349 No okay. So it turns out that when you're proving learning theory bounds, very often 00:50:53.349 --> 00:50:58.109 the bounds will be extremely loose because you're sort of proving the worse case upper bound that 00:50:58.109 --> 00:50:59.160 holds true 00:50:59.160 --> 00:51:03.299 even for very bad what is 00:51:03.299 --> 00:51:06.429 it 00:51:06.429 --> 00:51:10.479 so the bounds that I proved just now; right? That holds true for absolutely any 00:51:10.479 --> 00:51:12.209 probability distribution over training 00:51:12.209 --> 00:51:13.480 examples; right? 00:51:13.480 --> 00:51:17.069 So just assume the training examples we've drawn, iid 00:51:17.069 --> 00:51:19.029 from some distribution script d, 00:51:19.029 --> 00:51:23.209 and the bounds I proved hold true for absolutely any probability 00:51:23.209 --> 00:51:27.129 distribution over script 00:51:27.129 --> 00:51:28.939 d. And chances are 00:51:28.939 --> 00:51:33.049 whatever real life distribution you get over, you know, houses and their 00:51:33.049 --> 00:51:36.139 prices or whatever, is probably not as bad as 00:51:36.139 --> 00:51:39.119 the very worse one you 00:51:39.119 --> 00:51:41.249 could've gotten; okay? 00:51:41.249 --> 00:51:42.189 And so it turns out that if you 00:51:42.189 --> 00:51:46.299 actually plug in the constants of learning theory bounds, you often get 00:51:46.299 --> 00:51:50.839 extremely large numbers. Take logistic regression logistic 00:51:50.839 --> 00:51:53.229 regression you have ten parameters 00:51:53.229 --> 00:51:56.969 and 0.01 error, and with 95 percent probability. How many training 00:51:56.969 --> 00:51:58.239 examples do I 00:51:58.239 --> 00:52:02.109 need? If you actually plug in actual constants into the text for learning theory bounds, 00:52:02.109 --> 00:52:06.329 you often get extremely pessimistic estimates with the number of examples you need. You end 00:52:06.329 --> 00:52:07.529 up with 00:52:07.529 --> 00:52:11.749 some ridiculously large numbers. You would need 10,000 training examples to fit 00:52:11.749 --> 00:52:14.349 ten parameters. 00:52:14.349 --> 00:52:17.489 So 00:52:17.489 --> 00:52:20.319 a good way to think of these learning theory bounds is 00:52:20.319 --> 00:52:24.599 and this is why, also, when I write papers on learning theory bounds, I 00:52:24.599 --> 00:52:26.390 quite often 00:52:26.390 --> 00:52:30.159 use big-O notation to just absolutely just ignore the constant factors because 00:52:30.159 --> 00:52:33.339 the bounds seem to be very loose. 00:52:33.339 --> 00:52:38.669 There are some attempts to use these bounds to give guidelines as to 00:52:38.669 --> 00:52:40.019 what model 00:52:40.019 --> 00:52:41.559 to choose, and so on. 00:52:41.559 --> 00:52:46.589 But I personally tend to use the bounds again, 00:52:46.589 --> 00:52:47.489 intuition 00:52:47.489 --> 00:52:49.239 about 00:52:49.239 --> 00:52:53.089 for example, what are the number of training examples you need 00:52:53.089 --> 00:52:56.089 gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes 00:52:56.089 --> 00:52:59.669 quadratic parameters? 00:52:59.669 --> 00:53:02.400 So it's quite often the shape of the bounds. The fact that 00:53:02.400 --> 00:53:06.749 the number of training examples the fact that some complexity is linear in the VC dimension, 00:53:06.749 --> 00:53:09.449 that's sort of a useful intuition you can get from 00:53:09.449 --> 00:53:11.329 these theories. But the 00:53:11.329 --> 00:53:14.439 actual magnitude of the bound will tend to be much looser than 00:53:14.439 --> 00:53:18.009 will hold true for a particular problem you are working 00:53:18.009 --> 00:53:25.009 on. So did that 00:53:25.839 --> 00:53:28.409 answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the 00:53:28.409 --> 00:53:31.509 way, for myself, a rule of thumb that I often use is if 00:53:31.509 --> 00:53:34.969 you're trying to fit a logistic regression model, 00:53:34.969 --> 00:53:36.179 if you have 00:53:36.179 --> 00:53:38.749 n parameters or n plus one parameters; 00:53:38.749 --> 00:53:42.429 if the number of training examples is ten times your number of 00:53:42.429 --> 00:53:44.709 parameters, then you're probably in good shape. 00:53:44.709 --> 00:53:49.289 And if your number of training examples is like tiny times the number of parameters, then 00:53:49.289 --> 00:53:52.299 you're probably perfectly fine fitting that model. 00:53:52.299 --> 00:53:54.880 So those are the sorts of intuitions 00:53:54.880 --> 00:54:01.880 that you can get from these bounds. Student:In cross validation do 00:54:02.939 --> 00:54:06.739 we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train 00:54:06.739 --> 00:54:13.739 testers randomly. 00:54:22.989 --> 00:54:26.839 One more thing I want to talk about for model selection is there's actually a special case of model selections, 00:54:26.839 --> 00:54:33.839 called the feature selection problem. 00:54:37.809 --> 00:54:41.339 And so here's the intuition: 00:54:41.339 --> 00:54:45.479 for many machine-learning problems you may have a very high dimensional 00:54:45.479 --> 00:54:48.119 feature space with very high dimensional 00:54:48.119 --> 00:54:51.640 you have x's [inaudible] feature x's. 00:54:51.640 --> 00:54:56.289 So for example, for text classification and I wanna talk about this text classification example that 00:54:56.289 --> 00:54:58.299 spam versus 00:54:58.299 --> 00:55:00.430 non-spam. You may easily have on 00:55:00.430 --> 00:55:04.130 the order of 30,000 or 50,000 features. I think I used 50,000 in 00:55:04.130 --> 00:55:07.150 my early examples. So if you have 00:55:07.150 --> 00:55:10.659 so many features you have 50,000 features, 00:55:10.659 --> 00:55:13.759 depending on what learning algorithm you use, there may be a real 00:55:13.759 --> 00:55:15.529 risk of over fitting. 00:55:15.529 --> 00:55:17.219 And so 00:55:17.219 --> 00:55:20.079 if you can reduce the number of features, maybe 00:55:20.079 --> 00:55:23.179 you can reduce the variance of your learning algorithm, and reduce the risk of 00:55:23.179 --> 00:55:25.309 over fitting. And for 00:55:25.309 --> 00:55:28.259 the specific case of text classification, if 00:55:28.259 --> 00:55:31.919 you imagine that maybe there's a small number of relevant features, so 00:55:31.919 --> 00:55:33.679 there are all these English words. 00:55:33.679 --> 00:55:36.969 And many of these English words probably don't tell you anything at all about 00:55:36.969 --> 00:55:39.880 whether the email is spam or non-spam. If it were, you 00:55:39.880 --> 00:55:43.989 know, English function words like, the, of, a, and; 00:55:43.989 --> 00:55:48.049 these are probably words that don't tell you anything about whether the email is spam or non-spam. So 00:55:48.049 --> 00:55:50.569 words in contrast will be a much smaller number of 00:55:50.569 --> 00:55:54.319 features that are truly relevant to the learning problem. 00:55:54.319 --> 00:55:57.369 So for example, you see the word buy or Viagra, those are words that 00:55:57.369 --> 00:56:01.669 are very useful. So you words, some you spam and non-spam. You see the word 00:56:01.669 --> 00:56:02.829 Stanford or 00:56:02.829 --> 00:56:06.270 machinelearning or your own personal name. These are other words that are useful for 00:56:06.270 --> 00:56:10.529 telling you whether something is spam or non-spam. So 00:56:10.529 --> 00:56:15.659 in feature selection, we would like to select a subset of the features 00:56:15.659 --> 00:56:19.519 that may be or hopefully the most relevant ones for a specific learning 00:56:19.519 --> 00:56:20.369 problem, so 00:56:20.369 --> 00:56:25.009 as to give ourselves a simpler learning a simpler hypothesis class to choose 00:56:25.009 --> 00:56:28.499 from. And then therefore, reduce the risk of over fitting. 00:56:28.499 --> 00:56:29.699 Even when we 00:56:29.699 --> 00:56:36.619 may have had 50,000 features originally. So 00:56:36.619 --> 00:56:41.209 how do you do this? Well, if you 00:56:41.209 --> 00:56:44.009 have n 00:56:44.009 --> 00:56:47.329 features, then there are 00:56:47.329 --> 00:56:49.529 two to the n possible subsets; 00:56:49.529 --> 00:56:54.429 00:56:54.429 --> 00:56:55.900 right? Because, you know, 00:56:55.900 --> 00:56:59.880 each of your n features can either be included or excluded. So there are two to the n 00:56:59.880 --> 00:57:00.909 possibilities. 00:57:00.909 --> 00:57:04.200 And this is a huge space. 00:57:04.200 --> 00:57:08.859 So in feature selection, what we most commonly do is use various searcheristics 00:57:08.859 --> 00:57:09.510 sort of 00:57:09.510 --> 00:57:10.799 simple search algorithms 00:57:10.799 --> 00:57:15.329 to try to search through this space of two to the n possible subsets of features; 00:57:15.329 --> 00:57:16.710 to try to find a good 00:57:16.710 --> 00:57:18.290 subset of features. This is 00:57:18.290 --> 00:57:22.569 too large a number to enumerate all possible feature subsets. 00:57:22.569 --> 00:57:25.519 And as a complete example, 00:57:25.519 --> 00:57:26.759 00:57:26.759 --> 00:57:31.159 this is the forward search algorithm; it's also called the forward selection algorithm. 00:57:31.159 --> 00:57:32.689 00:57:32.689 --> 00:57:34.849 It's actually pretty simple, but I'll just 00:57:34.849 --> 00:57:36.080 write it out. 00:57:36.080 --> 00:57:42.059 My writing it out will make it look more complicated than it really 00:57:42.059 --> 00:57:44.649 is, but 00:57:44.649 --> 00:57:46.040 it starts with 00:57:46.040 --> 00:57:48.149 initialize the sets script f 00:57:48.149 --> 00:57:51.649 to be the empty set, 00:57:51.649 --> 00:57:53.239 and then 00:57:53.239 --> 00:57:55.019 repeat 00:57:55.019 --> 00:57:56.619 for 00:57:56.619 --> 00:57:58.859 i equals one 00:57:58.859 --> 00:58:00.439 to n; 00:58:04.849 --> 00:58:08.239 try adding 00:58:08.239 --> 00:58:11.459 feature i 00:58:11.459 --> 00:58:13.849 to the set scripts 00:58:13.849 --> 00:58:16.039 f, and evaluate the 00:58:16.039 --> 00:58:19.739 model 00:58:19.739 --> 00:58:25.719 using cross validation. 00:58:25.719 --> 00:58:30.399 And by cross validation, I mean any of the three flavors, be it simple hold out cross 00:58:30.399 --> 00:58:34.629 validation or k-fold cross validation or leave one out cross 00:58:34.629 --> 00:58:35.739 validation. 00:58:35.739 --> 00:58:39.949 And then, you know, set f to be 00:58:39.949 --> 00:58:41.239 equal to 00:58:41.239 --> 00:58:43.249 f union, I guess. 00:58:43.249 --> 00:58:48.579 And then the best feature 00:58:48.579 --> 00:58:55.579 found is f 1, I guess; okay? 00:58:57.619 --> 00:59:00.059 And finally, you would 00:59:00.059 --> 00:59:06.920 okay? So 00:59:06.920 --> 00:59:13.920 00:59:16.699 --> 00:59:20.319 forward selections, procedure is: follow through the empty set of features. 00:59:20.319 --> 00:59:22.319 And then on each 00:59:22.319 --> 00:59:25.689 generation, take each of 00:59:25.689 --> 00:59:28.829 your features that isn't already in your set script f and you try adding that 00:59:28.829 --> 00:59:30.449 feature to your set. Then 00:59:30.449 --> 00:59:33.870 you train them all, though, and evaluate them all, though, using 00:59:33.870 --> 00:59:35.039 cross validation. 00:59:35.039 --> 00:59:38.719 And basically, figure out what is the best single feature to add to your 00:59:38.719 --> 00:59:40.719 set 00:59:40.719 --> 00:59:43.709 script f. In step two here, you go ahead and add 00:59:43.709 --> 00:59:46.509 that feature to your set script f, and you get it right. 00:59:46.509 --> 00:59:50.559 And when I say best feature or best model here by best, I really mean the 00:59:50.559 --> 00:59:53.669 best model according to hold out cross validation. 00:59:53.669 --> 00:59:56.049 By best, I really mean 00:59:56.049 --> 01:00:00.639 the single feature addition that results in the lowest hold out 01:00:00.639 --> 01:00:01.509 cross validation error or the 01:00:01.509 --> 01:00:03.209 lowest cross validation error. 01:00:03.209 --> 01:00:04.499 So you do this 01:00:04.499 --> 01:00:07.959 adding one feature at a time. 01:00:07.959 --> 01:00:12.019 When you terminate this a little bit, as if you've 01:00:12.019 --> 01:00:15.079 added all the features to f, so f is now 01:00:15.079 --> 01:00:17.919 the entire set of features; you can terminate this. 01:00:17.919 --> 01:00:19.309 Or if 01:00:19.309 --> 01:00:22.039 by some rule of thumb, you know that you 01:00:22.039 --> 01:00:23.769 probably don't ever want more than 01:00:23.769 --> 01:00:26.729 k features, you can also terminate 01:00:26.729 --> 01:00:30.239 this if f is already exceeded some threshold number of features. So maybe 01:00:30.239 --> 01:00:30.979 if 01:00:30.979 --> 01:00:34.139 you have 100 training examples, and you're fitting logistic regression, 01:00:34.139 --> 01:00:39.359 you probably know you won't want more than 100 features. And so 01:00:39.359 --> 01:00:41.229 you stop after you have 01:00:41.229 --> 01:00:45.709 100 features added to set f; okay? 01:00:45.709 --> 01:00:50.029 And then finally, having done this, output of best hypothesis found; again, by 01:00:50.029 --> 01:00:51.669 best, I mean, 01:00:51.669 --> 01:00:55.459 when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of 01:00:55.459 --> 01:00:58.419 hypothesis, and testing them using cross validation. 01:00:58.419 --> 01:01:01.659 So when I say output best hypothesis found, I mean 01:01:01.659 --> 01:01:04.650 of all of the hypothesis you've seen during this entire procedure, 01:01:04.650 --> 01:01:07.149 pick the one with the lowest 01:01:07.149 --> 01:01:10.309 cross validation error that you saw; okay? 01:01:10.309 --> 01:01:17.309 So that's forward selection. So 01:01:35.239 --> 01:01:39.599 let's see, just to give this a name, this is an incidence of what's called 01:01:39.599 --> 01:01:43.189 01:01:43.189 --> 01:01:46.439 wrapper feature 01:01:46.439 --> 01:01:47.879 selection. 01:01:47.879 --> 01:01:50.579 And the term wrapper comes 01:01:50.579 --> 01:01:52.049 from the fact that 01:01:52.049 --> 01:01:55.489 this feature selection algorithm that I just described is a forward selection or forward 01:01:55.489 --> 01:01:56.819 search. 01:01:56.819 --> 01:01:58.680 It's a piece of software that 01:01:58.680 --> 01:02:02.149 you write that wraps around your learning algorithm. In the sense that 01:02:02.149 --> 01:02:04.249 to perform forward selection, 01:02:04.249 --> 01:02:07.579 you need to repeatedly make cause to your learning algorithm 01:02:07.579 --> 01:02:08.829 to train 01:02:08.829 --> 01:02:11.789 your model, 01:02:11.789 --> 01:02:15.289 using different subsets of features; okay? So this is called wrapper model 01:02:15.289 --> 01:02:17.319 feature selection. 01:02:17.319 --> 01:02:20.509 And it tends to be somewhat computationally expensive because as you're 01:02:20.509 --> 01:02:22.529 performing the search process, 01:02:22.529 --> 01:02:26.029 you're repeatedly training your learning algorithm over and over and over on all of 01:02:26.029 --> 01:02:30.329 these different subsets of features. Let's 01:02:30.329 --> 01:02:37.329 just mention also, there is a variation of this called backward search or 01:02:38.899 --> 01:02:40.829 backward selection, 01:02:40.829 --> 01:02:43.709 which is where you start with 01:02:43.709 --> 01:02:46.989 f equals 01:02:46.989 --> 01:02:53.989 the entire set of features, and you delete features one at a time; 01:02:59.469 --> 01:03:01.369 okay? 01:03:01.369 --> 01:03:03.979 01:03:03.979 --> 01:03:06.759 So that's backward search or backward selection. 01:03:06.759 --> 01:03:08.079 And 01:03:08.079 --> 01:03:13.369 this is another feature selection algorithm that you might use. Part of 01:03:13.369 --> 01:03:15.400 whether this makes sense is 01:03:15.400 --> 01:03:17.409 really there will be problems where 01:03:17.409 --> 01:03:21.319 it really doesn't even make sense to initialize f to be the set of all features. 01:03:21.319 --> 01:03:23.859 So if you have 100 training examples 01:03:23.859 --> 01:03:25.569 and 10,000 features, 01:03:25.569 --> 01:03:28.829 which may well happen 01:03:28.829 --> 01:03:30.759 100 emails and 10,000 training 01:03:30.759 --> 01:03:35.289 10,000 features in email, then 100 training examples then 01:03:35.289 --> 01:03:38.279 depending on the learning algorithm you're using, it may or may not make sense to 01:03:38.279 --> 01:03:38.989 initialize 01:03:38.989 --> 01:03:42.789 the set f to be all features, and train them all by using all features. And if it 01:03:42.789 --> 01:03:46.130 doesn't make sense, then you can train them all by using all features; then 01:03:46.130 --> 01:03:51.529 forward selection would be more common. 01:03:51.529 --> 01:03:53.609 So 01:03:53.609 --> 01:03:56.960 let's see. Wrapper model feature selection algorithms tend to work 01:03:56.960 --> 01:03:58.459 well. 01:03:58.459 --> 01:04:02.179 And in particular, they actually often work better than a different class of algorithms I'm gonna talk 01:04:02.179 --> 01:04:05.789 about now. But their main disadvantage is that they're computationally very 01:04:05.789 --> 01:04:08.709 expensive. 01:04:08.709 --> 01:04:15.709 Do you have any questions about this 01:04:16.289 --> 01:04:19.109 before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes, 01:04:19.109 --> 01:04:23.459 you're actually right. So forward search and backward search, both of these are searcheristics, 01:04:23.459 --> 01:04:24.360 and 01:04:24.360 --> 01:04:28.090 you cannot but for either of these you cannot guarantee they'll find the best 01:04:28.090 --> 01:04:29.119 subset of features. It 01:04:29.119 --> 01:04:30.729 actually turns out that 01:04:30.729 --> 01:04:34.999 under many formulizations of the feature selection problems it actually turns out to be an empty 01:04:34.999 --> 01:04:40.029 heart problem, to find the best subset of features. 01:04:40.029 --> 01:04:43.999 But in practice, forward selection backward selection work fine, 01:04:43.999 --> 01:04:47.170 and you can also envision other search algorithms where you sort of 01:04:47.170 --> 01:04:50.489 have other methods to search through the space up to the end possible feature 01:04:50.489 --> 01:04:57.489 subsets. So 01:05:01.079 --> 01:05:03.700 let's see. 01:05:03.700 --> 01:05:06.339 Wrapper feature selection 01:05:06.339 --> 01:05:09.829 tends to work well when you can afford to do it 01:05:09.829 --> 01:05:11.539 computationally. 01:05:11.539 --> 01:05:13.800 But for problems 01:05:13.800 --> 01:05:18.279 such as text classification it turns out for text classification specifically 01:05:18.279 --> 01:05:21.949 because you have so many features, and easily have 50,000 features. 01:05:21.949 --> 01:05:25.789 Forward selection would be very, very expensive. 01:05:25.789 --> 01:05:28.240 So there's a different class of algorithms that 01:05:28.240 --> 01:05:31.189 will give you that 01:05:31.189 --> 01:05:35.319 tends not to do as well in the sense of generalization error. So you tend to learn the 01:05:35.319 --> 01:05:37.329 hypothesis that works less well, 01:05:37.329 --> 01:05:40.289 but is computationally much less expensive. 01:05:40.289 --> 01:05:44.189 And these are called 01:05:44.189 --> 01:05:46.669 the 01:05:46.669 --> 01:05:50.279 filter feature selection methods. 01:05:50.279 --> 01:05:53.119 And the basic idea is 01:05:53.119 --> 01:05:53.679 01:05:53.679 --> 01:06:00.539 that for each feature i will 01:06:00.539 --> 01:06:03.939 compute 01:06:03.939 --> 01:06:10.539 some measure 01:06:10.539 --> 01:06:17.219 of how informative 01:06:17.219 --> 01:06:19.909 xi 01:06:19.909 --> 01:06:25.479 is about y; okay? 01:06:25.479 --> 01:06:28.779 And to do this, we'll use some simple 01:06:28.779 --> 01:06:33.489 heuristics; for every feature we'll just try to compute some rough estimate or 01:06:33.489 --> 01:06:40.489 compute 01:06:41.279 --> 01:06:48.279 some measure of how informative 01:06:50.179 --> 01:06:54.489 xi is about 01:06:54.489 --> 01:06:59.139 y. So there are many ways you can do this. One way you can choose is to just compute the 01:06:59.139 --> 01:07:01.279 correlation between xi and y. And just for each of 01:07:01.279 --> 01:07:04.999 your features just see how correlated this is with 01:07:04.999 --> 01:07:06.719 your class label y. 01:07:06.719 --> 01:07:12.809 And then you just pick the top k most correlated features. 01:07:12.809 --> 01:07:19.809 Another way 01:07:46.429 --> 01:07:51.549 to do this 01:07:51.549 --> 01:07:54.930 for the case of text classification, there's one other method, which especially 01:07:54.930 --> 01:07:56.779 for this k features I guess 01:07:56.779 --> 01:07:58.749 there's one other 01:07:58.749 --> 01:08:00.709 informative measure 01:08:00.709 --> 01:08:04.269 that's used very commonly, which is called major information. I'm going to 01:08:04.269 --> 01:08:11.229 tell you 01:08:11.229 --> 01:08:15.399 some of these ideas in problem sets, but I'll 01:08:15.399 --> 01:08:17.900 just say this very briefly. 01:08:17.900 --> 01:08:21.599 So the major information between feature xi and y 01:08:21.599 --> 01:08:27.670 I'll just write out the definition, I guess. 01:08:27.670 --> 01:08:32.229 Let's say this is text classification, so x can take on two values, 0, 1; 01:08:32.229 --> 01:08:36.179 the major information between xi and y is to find out some overall possible values of 01:08:36.179 --> 01:08:37.359 x; 01:08:37.359 --> 01:08:40.929 some overall possible values of 01:08:40.929 --> 01:08:45.609 y times the distribution 01:08:45.609 --> 01:08:52.049 01:08:52.049 --> 01:08:53.089 times that. 01:08:53.089 --> 01:08:54.379 Where 01:08:59.569 --> 01:09:04.039 all of these distributions where so the joint distribution over xi and y, 01:09:04.039 --> 01:09:11.039 you would estimate from your training data 01:09:12.559 --> 01:09:14.220 all of these things you would use, as well. You would estimate 01:09:14.220 --> 01:09:15.259 from 01:09:15.259 --> 01:09:20.159 the training data what is the probability that x is 0, what's the probability that x is one, what's the 01:09:20.159 --> 01:09:27.159 probability that x is 0, y is 0, x is one; y is 0, and so on. So it 01:09:27.849 --> 01:09:31.190 turns out 01:09:31.190 --> 01:09:35.000 there's a standard information theoretic measure of how different 01:09:35.000 --> 01:09:36.750 probability distributions are. 01:09:36.750 --> 01:09:39.989 And I'm not gonna prove this here. But it turns out that 01:09:39.989 --> 01:09:45.609 this major information is 01:09:45.609 --> 01:09:51.749 actually 01:09:51.749 --> 01:09:54.309 so the standard 01:09:54.309 --> 01:09:58.099 measure of how different distributions are; called the K-L divergence. 01:09:58.099 --> 01:10:00.359 When you take a class in information theory, 01:10:00.359 --> 01:10:03.419 you have seen concepts of mutual information in the K-L divergence, but if you 01:10:03.419 --> 01:10:06.019 haven't, don't worry about it. Just 01:10:06.019 --> 01:10:09.239 the intuition is there's something called K-L divergence that's a formal measure 01:10:09.239 --> 01:10:12.949 of how different two probability distributions 01:10:12.949 --> 01:10:16.609 are. And mutual information is a measure for how different 01:10:16.609 --> 01:10:19.829 the joint distribution is 01:10:19.829 --> 01:10:21.569 of x and y; 01:10:21.569 --> 01:10:23.400 from the distribution 01:10:23.400 --> 01:10:26.159 you would get if you were to assume they were independent; 01:10:26.159 --> 01:10:27.210 okay? So 01:10:27.210 --> 01:10:29.449 if x and y were independent, then 01:10:29.449 --> 01:10:33.979 p of x, y would be equal to p of x times p of y. 01:10:33.979 --> 01:10:37.340 And so you know, this distribution and this distribution would be identical, and the 01:10:37.340 --> 01:10:40.369 K-L divergence would be 0. 01:10:40.369 --> 01:10:43.579 In contrast, if x and y were very non-independent in 01:10:43.579 --> 01:10:46.900 other words, if x and y are very informative about each other, 01:10:46.900 --> 01:10:48.449 then this K-L divergence 01:10:48.449 --> 01:10:49.719 will be large. 01:10:49.719 --> 01:10:52.379 And so mutual information is a formal measure of 01:10:52.379 --> 01:10:55.460 how non-independent x and y are. 01:10:55.460 --> 01:10:58.299 And if x and y are highly non-independent 01:10:58.299 --> 01:10:59.260 then that means that 01:10:59.260 --> 01:11:02.209 x will presumably tell you something about y, 01:11:02.209 --> 01:11:05.989 and so they'll have large mutual information. And 01:11:05.989 --> 01:11:09.519 this measure of information will tell you x might be a good feature. And you 01:11:09.519 --> 01:11:11.989 get to play with some of these ideas 01:11:11.989 --> 01:11:15.309 more in the problem sets. So I won't say much more about it. 01:11:16.579 --> 01:11:19.420 And what you do then is having chosen 01:11:19.420 --> 01:11:23.780 some measure like correlation or major information or something else, you 01:11:23.780 --> 01:11:28.889 then pick the top 01:11:28.889 --> 01:11:33.659 k features; meaning that you compute correlation between xi and y for all the 01:11:33.659 --> 01:11:36.979 features of mutual information xi and y for all the features. 01:11:36.979 --> 01:11:40.080 And then you include in your learning algorithm the 01:11:40.080 --> 01:11:43.369 k features of the largest correlation with the label or 01:11:43.369 --> 01:11:46.489 the largest mutual information label, whatever. 01:11:46.489 --> 01:11:51.559 And to choose 01:11:51.559 --> 01:11:54.809 k, 01:11:54.809 --> 01:11:57.799 you can actually use cross validation, as well; okay? 01:11:57.799 --> 01:11:59.360 So you would 01:11:59.360 --> 01:12:03.059 take all your features, and sort them in decreasing order of mutual 01:12:03.059 --> 01:12:04.109 information. 01:12:04.109 --> 01:12:07.529 And then you'd try using just the top one feature, the top two features, the top 01:12:07.529 --> 01:12:09.559 three features, and so on. 01:12:09.559 --> 01:12:13.869 And you decide how many features includes 01:12:13.869 --> 01:12:17.279 using cross validation; okay? Or you 01:12:17.279 --> 01:12:24.279 can sometimes you can just choose this by hand, as well. Okay. Questions about 01:12:24.279 --> 01:12:31.279 this? Okay. 01:12:34.559 --> 01:12:36.479 Cool. Great. So 01:12:36.479 --> 01:12:40.629 next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close 01:12:40.629 --> 01:12:40.879 to the end.