0:00:12.240,0:00:15.509 This presentation is delivered by the Stanford Center for Professional 0:00:15.509,0:00:22.509 Development. 0:00:24.209,0:00:28.029 So what I want to do today in this lecture is 0:00:28.029,0:00:31.479 talk a little bit more about learning theory. In particular, I'll 0:00:31.479,0:00:33.230 talk about VC dimension 0:00:33.230,0:00:35.610 and 0:00:35.610,0:00:39.560 building on the issues of bias variance tradeoffs of under 0:00:39.560,0:00:42.630 fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in 0:00:42.630,0:00:43.640 this one. 0:00:43.640,0:00:49.180 I then want to talk about model selection algorithms for automatically 0:00:49.180,0:00:50.070 making 0:00:50.070,0:00:52.190 decisions for this 0:00:52.190,0:00:55.280 bias variance tradeoff, that we started to talk about in the previous lecture. 0:00:55.280,0:00:59.310 And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if 0:00:59.310,0:01:01.469 I don't get to this today, 0:01:01.469,0:01:07.280 I'll get to this in next week's lecture. 0:01:07.280,0:01:09.520 To recap: 0:01:09.520,0:01:10.430 the result we 0:01:10.430,0:01:13.510 proved at the previous lecture was that 0:01:13.510,0:01:15.370 if you have a finite 0:01:15.370,0:01:16.430 hypothesis class if 0:01:16.430,0:01:19.509 h is a set of k hypotheses, 0:01:19.509,0:01:24.080 0:01:24.080,0:01:25.710 and suppose you have some 0:01:25.710,0:01:32.710 fixed parameters, gamma and delta, 0:01:33.030,0:01:34.440 then 0:01:34.440,0:01:36.530 0:01:36.530,0:01:40.890 in order to guarantee 0:01:40.890,0:01:47.750 that this holds, 0:01:47.750,0:01:50.770 we're probability at least one minus delta. 0:01:50.770,0:01:55.970 It suffices 0:01:55.970,0:02:02.970 that n is greater and equal to that; okay? 0:02:05.240,0:02:07.359 And using big-O notations, 0:02:07.359,0:02:14.359 just learning dropped constants, I can also write this as that; okay? 0:02:14.459,0:02:18.180 So just to quickly remind you of what all of the notation means, 0:02:18.180,0:02:21.180 we talked about empirical risk minimization, 0:02:21.180,0:02:21.970 which was 0:02:21.970,0:02:24.620 the simplified modern machine learning 0:02:24.620,0:02:25.899 that 0:02:25.899,0:02:28.549 has a hypothesis class of script h. 0:02:28.549,0:02:32.539 And what the empirical risk minimization-learning algorithm does is it just chooses the 0:02:32.539,0:02:34.029 hypothesis 0:02:34.029,0:02:38.169 that attains the smallest error on the training set. 0:02:38.169,0:02:39.019 And so 0:02:39.019,0:02:43.479 this symbol, epsilon, just denoted generalization error; right? This is the 0:02:43.479,0:02:44.450 probability 0:02:44.450,0:02:48.949 of a hypothesis h [inaudible] misclassifying a new example drawn from the same 0:02:48.949,0:02:50.930 distribution as the training set. 0:02:50.930,0:02:53.079 And so this says that 0:03:01.869,0:03:06.119 in order to guarantee that the generalization error of the hypothesis h 0:03:06.119,0:03:08.679 [inaudible] output by empirical risk minimization 0:03:08.679,0:03:11.009 that this is less and equal to 0:03:11.009,0:03:14.209 the best possible generalization error 0:03:14.780,0:03:19.889 use it in your hypothesis class plus two times gamma two times this error threshold. 0:03:19.889,0:03:24.120 We want to guarantee that this holds a probability at least one minus delta. 0:03:24.120,0:03:25.519 We show that 0:03:25.519,0:03:28.999 it suffices for your training set size m 0:03:28.999,0:03:33.079 to be greater than equal to this; okay? One over two gamma square log two k over 0:03:33.079,0:03:34.179 delta; 0:03:34.179,0:03:39.229 where again, k is the size of your hypothesis class. 0:03:39.229,0:03:42.909 And so this is some complexity result because it gives us a bound in the 0:03:42.909,0:03:45.019 number of training examples we need 0:03:45.019,0:03:47.050 in order to give a guarantee 0:03:47.050,0:03:48.699 on something on the error; okay? So 0:03:48.699,0:03:54.819 this is a sample complexity result. So 0:03:54.819,0:03:58.879 what I want to do now is take this result, and try to generalize it to the 0:03:58.879,0:04:02.939 case of infinite hypothesis classes. So here, 0:04:02.939,0:04:07.139 we said that the set script h is sort of just k 0:04:07.139,0:04:08.570 specific functions, 0:04:08.570,0:04:12.120 when you want to use a model like logistic regression, which is actually 0:04:12.120,0:04:14.059 parameterized by real numbers. 0:04:14.059,0:04:15.019 So 0:04:15.019,0:04:17.420 I'm actually first going to give an argument that's sort of 0:04:17.420,0:04:21.160 formally broken just sort of technically somewhat broken, but conveys 0:04:21.160,0:04:25.230 useful intuition. And then I'll give the more correct argument, but 0:04:25.230,0:04:30.060 without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want 0:04:30.060,0:04:31.050 to 0:04:31.050,0:04:34.679 apply this result analyzing logistic regression. 0:04:35.550,0:04:37.400 So let's say 0:04:39.210,0:04:44.789 your hypothesis class is because of all linear division boundaries; right? So say 0:04:44.789,0:04:46.770 script h is 0:04:46.770,0:04:51.749 parameterized 0:04:51.749,0:04:57.860 by d 0:04:57.860,0:05:00.279 real numbers; okay? So for example, if 0:05:00.279,0:05:02.570 you're applying logistic regression with 0:05:02.570,0:05:06.940 over [inaudible], then d would be endless one with logistic regression to find the 0:05:06.940,0:05:08.680 linear position boundary, 0:05:08.680,0:05:09.640 parameterized by 0:05:09.640,0:05:13.529 endless one real numbers. 0:05:13.529,0:05:17.729 When you think about how your hypothesis class is really represented in a computer 0:05:17.729,0:05:18.569 computers use 0:05:18.569,0:05:21.949 zero one bits to represent real numbers. 0:05:21.949,0:05:22.930 And so if you 0:05:22.930,0:05:24.479 use like 0:05:24.479,0:05:28.559 a normal standard computer, it normally will represent real numbers by what's 0:05:28.559,0:05:30.869 called double position floating point numbers. 0:05:30.869,0:05:33.200 And what that means is that each real number 0:05:33.200,0:05:36.789 is represented by or a 64-bit representation; 0:05:36.789,0:05:38.060 right? So really 0:05:38.060,0:05:41.820 you know what floating point is in a computer. So a 64-bit floating point 0:05:41.820,0:05:43.309 is what almost all of 0:05:43.309,0:05:44.780 us use routinely. 0:05:44.780,0:05:47.910 And so this parameterized by d real numbers, that's really 0:05:47.910,0:05:50.979 as if it's parameterized by 64 0:05:50.979,0:05:52.919 times d 0:05:52.919,0:05:54.319 bits. 0:05:54.319,0:05:56.590 Computers can't represent real numbers. They only represent 0:05:56.590,0:05:58.800 used to speed things. 0:05:58.800,0:06:00.689 And so the size of 0:06:00.689,0:06:02.949 your 0:06:02.949,0:06:07.250 hypothesis class in your computer representation 0:06:07.250,0:06:11.000 you have 64 times d bits that you can flip. 0:06:11.000,0:06:15.050 And so the number of possible values for your 62 to 64 d bits is 0:06:15.050,0:06:17.509 really just 0:06:17.509,0:06:20.889 to the power of 64 d; okay? 0:06:20.889,0:06:23.360 Because that's the number of ways you can 0:06:23.360,0:06:26.349 flip the 64 d bits. 0:06:26.349,0:06:29.150 And so 0:06:29.150,0:06:33.979 this is why it's important that we that we had log k there; right? So k is 0:06:33.979,0:06:36.819 therefore, to the 64 d. 0:06:36.819,0:06:40.379 And if I plug it into this equation over here, 0:06:40.379,0:06:43.300 what you find is that 0:06:43.300,0:06:47.139 in order to get this sort of guarantee, 0:06:47.139,0:06:50.979 it suffices that m 0:06:50.979,0:06:52.739 is 0:06:52.739,0:06:55.979 great and equal to 0:06:55.979,0:06:58.770 on the order of one of 0:06:58.770,0:07:01.740 the gamma square log it's 0:07:01.740,0:07:03.369 just a 64 d 0:07:03.369,0:07:05.050 over delta, 0:07:05.050,0:07:12.050 which is that; okay? So 0:07:14.520,0:07:17.779 just to be clear, in order to guarantee that 0:07:17.779,0:07:23.729 there's only one, instead of the same complexity result as we had before so the question is: 0:07:23.729,0:07:27.309 suppose, you want a guarantee that a hypotheses returned by empirical risk 0:07:27.309,0:07:29.249 minimization will 0:07:29.249,0:07:33.909 have a generalization error that's within two gamma or the best hypotheses in your hypotheses 0:07:33.909,0:07:35.970 class. 0:07:35.970,0:07:39.400 Then what this result suggests is that, you know, in order to give that sort of 0:07:39.400,0:07:40.770 error bound guarantee, 0:07:40.770,0:07:43.429 it suffices that m is greater and equal to this. 0:07:43.429,0:07:44.579 In other words, that 0:07:44.579,0:07:46.889 your number of training examples has to be 0:07:46.889,0:07:51.699 on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the 0:07:53.329,0:07:54.419 intuition 0:07:54.419,0:07:57.740 that this conveys is actually, roughly right. This says, that the number of 0:07:57.740,0:07:59.050 training examples you need 0:07:59.050,0:08:00.740 is roughly linear 0:08:00.740,0:08:04.620 in the number of parameters of your hypothesis class. That 0:08:04.620,0:08:08.680 m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right. 0:08:08.680,0:08:09.580 0:08:09.580,0:08:12.479 I'll say more about 0:08:16.969,0:08:20.969 this 0:08:21.919,0:08:25.520 later. This result is clearly, slightly broken, in the sense that it relies on a 0:08:25.520,0:08:28.729 64-bit representation of 14-point numbers. 0:08:28.729,0:08:31.899 So let me actually go ahead and outline 0:08:31.899,0:08:34.900 the "right way" to show this more formally; all 0:08:34.900,0:08:38.910 right? And it turns out the "right way" to show this more formally 0:08:38.910,0:08:43.120 involves a much longer because the 0:08:43.120,0:08:46.250 proof is extremely involved, so I'm just actually going to state the result, and not 0:08:46.250,0:08:50.160 prove it. Farther proof be a source of learning theory balance, infinite hypothesis 0:08:50.160,0:08:54.320 classes. 0:08:56.450,0:09:03.450 This definition given a set of d points, 0:09:18.630,0:09:22.880 we say, 0:09:22.880,0:09:26.560 a hypothesis class h 0:09:26.560,0:09:28.590 shatters 0:09:28.590,0:09:34.520 the set s, if 0:09:34.520,0:09:41.520 h can realize any labeling on it; okay? 0:09:46.650,0:09:47.360 And what 0:09:47.360,0:09:50.990 I mean by realizing any labeling on it the informal way of thinking about this 0:09:50.990,0:09:53.020 is: 0:09:53.020,0:09:56.860 if a hypothesis class has shattered the set s, what that means is that I can take these 0:09:56.860,0:09:58.510 d points, 0:09:58.510,0:10:01.740 and I can associate these d points with any 0:10:01.740,0:10:04.250 caught set of labels y; right? 0:10:04.250,0:10:08.900 So choose any set of labeling y for each of these d 0:10:08.900,0:10:10.140 points. 0:10:10.140,0:10:13.840 And if your hypothesis class shatters s, then that means that 0:10:13.840,0:10:14.790 there will be 0:10:14.790,0:10:19.470 a hypothesis that labels those d examples perfectly; 0:10:19.470,0:10:23.710 okay? That's what shattering means. So let me just illustrate those in an example. 0:10:23.710,0:10:30.710 So let's say h is the class of all linear classifiers 0:10:31.280,0:10:35.520 into e, 0:10:35.520,0:10:38.400 and let's say that 0:10:38.400,0:10:40.370 s is this 0:10:40.370,0:10:43.740 [inaudible] comprising two points; okay? 0:10:43.740,0:10:45.980 So there are 0:10:45.980,0:10:49.520 four possible labelings that computes 0:10:49.520,0:10:53.270 with these two points. You can choose 0:10:53.270,0:10:56.470 to label both positive; one positive, one negative, one negative, one positive or you can 0:10:56.470,0:10:59.280 label both of them negative. 0:10:59.280,0:11:00.750 And if 0:11:00.750,0:11:04.730 the hypothesis class h classed all linear classifiers into the then, for each 0:11:04.730,0:11:08.450 of these training sets, I can sort of find a 0:11:08.450,0:11:09.990 linear classifier 0:11:09.990,0:11:13.500 that attains zero training error on 0:11:13.500,0:11:17.810 each of these. Then on all possible labelings of this set of two points. 0:11:17.810,0:11:20.140 And so I'll say 0:11:20.140,0:11:23.160 that the hypothesis class script h shatters 0:11:23.160,0:11:24.129 this set s 0:11:24.129,0:11:29.480 of two points; okay? One 0:11:29.480,0:11:36.480 more example show 0:11:37.350,0:11:38.780 you 0:11:38.780,0:11:42.059 a larger example. Suppose my set s is now 0:11:42.059,0:11:44.640 this set of three points; 0:11:44.640,0:11:51.640 right? Then, I now have eight possible labelings for these three points; 0:12:10.600,0:12:14.830 okay? And so for these three points, I now have eight possible labelings. 0:12:14.830,0:12:16.940 And once again, I can for each 0:12:16.940,0:12:21.620 of these labelings, I can find the hypothesis in the hypothesis class 0:12:21.620,0:12:23.390 that labels 0:12:23.390,0:12:25.610 these examples correctly. 0:12:25.610,0:12:30.220 And so once again, I see that by definition, say, that my hypothesis 0:12:30.220,0:12:33.890 class also shatters this set s. Student:Right. Instructor (Andrew Ng):And 0:12:33.890,0:12:38.340 then that that terminology h can realize any labeling on s. That's 0:12:38.340,0:12:39.830 obviously [inaudible]. 0:12:39.830,0:12:43.680 Give it any set of labels and you can find a hypothesis that perfectly 0:12:43.680,0:12:47.570 separates the positive and negative examples; 0:12:47.570,0:12:48.840 okay? 0:12:48.840,0:12:51.090 So 0:12:51.090,0:12:54.660 how about this 0:12:54.660,0:12:57.430 set? 0:12:57.430,0:13:00.840 Suppose s is now this set of four points, then, 0:13:00.840,0:13:04.840 you know, there are lots of labels. There are now 16 labelings we can choose on this; right? 0:13:04.840,0:13:06.400 That's one for instance, 0:13:06.400,0:13:09.090 and this is another one; 0:13:09.090,0:13:13.420 right? And so I can realize some labelings. But there's no linear division 0:13:13.420,0:13:15.780 boundary that can realize this labeling, 0:13:15.780,0:13:17.800 and so h does not shatter 0:13:17.800,0:13:20.250 this set of four points; okay? 0:13:20.250,0:13:21.580 And 0:13:21.580,0:13:25.360 I'm not really going to prove it here, but it turns out that you can show that 0:13:25.360,0:13:30.040 in two dimensions, there is no set of four points that the 0:13:30.040,0:13:37.040 class of all linear classifiers can shatter; okay? 0:13:45.630,0:13:49.110 So here's another definition. When I say that the well, 0:13:49.110,0:13:50.290 it's called the 0:13:50.290,0:13:52.700 VC dimension. These 0:13:52.700,0:13:59.700 two people, Vapnik and Chervonenkis 0:14:06.550,0:14:09.910 so given a hypothesis class, 0:14:09.910,0:14:13.590 the Vapnik and Chervonenkis 0:14:13.590,0:14:17.600 dimension of h, which we usually write as VC of script h, 0:14:17.600,0:14:24.600 is the size of the 0:14:33.780,0:14:36.670 larger 0:14:36.670,0:14:40.070 set that is shattered by this set by h. 0:14:40.070,0:14:40.759 And 0:14:40.759,0:14:44.939 if a hypothesis class can shatter arbitrarily large sets, then the VC dimension 0:14:44.939,0:14:47.820 is infinite. 0:14:47.820,0:14:50.549 So just as a kind of good example: if h 0:14:50.549,0:14:51.920 is the class of all linear 0:14:51.920,0:14:57.850 classifiers 0:14:57.850,0:15:02.660 into d, then the 0:15:02.660,0:15:04.100 VC dimension of the set 0:15:04.100,0:15:04.730 is equal to 0:15:04.730,0:15:09.490 three because we saw just now that there is a size of 0:15:09.490,0:15:12.940 there was a set s of size three that it could shatter, 0:15:12.940,0:15:17.259 and I don't really prove it. But it turns out there is no sets of size four 0:15:17.259,0:15:18.480 that it can 0:15:18.480,0:15:19.900 shatter. And therefore, 0:15:19.900,0:15:21.220 the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your 0:15:21.220,0:15:23.270 point. Instructor 0:15:23.270,0:15:27.530 (Andrew 0:15:27.530,0:15:30.210 Ng):Yes, absolutely. So it turns out that 0:15:30.210,0:15:32.370 if 0:15:32.370,0:15:35.650 I choose a set like this it's actually set s, 0:15:35.650,0:15:39.170 then there are labelings on this they cannot realize. And so, 0:15:39.170,0:15:40.980 h cannot shatter this set. 0:15:40.980,0:15:43.620 But that's okay because right there definitely is 0:15:43.620,0:15:46.590 there exists some other set of size three being shattered. So the VC 0:15:46.590,0:15:47.890 dimension is 0:15:47.890,0:15:54.890 three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor 0:15:55.590,0:15:58.130 (Andrew Ng):Not according to this definition. No. 0:15:58.130,0:16:00.640 Right. So again, let's see, 0:16:00.640,0:16:03.619 I can choose my set s to be 0:16:03.619,0:16:07.869 to be a set of three points that are all over lapping. Three points in exactly 0:16:07.869,0:16:09.780 the same place. 0:16:09.780,0:16:11.730 And clearly, I can't shatter this set, 0:16:11.730,0:16:15.130 but that's okay. And I can't shatter this set, either, 0:16:15.130,0:16:18.560 but that's okay because there are some other sets of size three that I can 0:16:18.560,0:16:25.040 shatter. 0:16:25.040,0:16:28.410 And it turns out this result holds true into 0:16:28.410,0:16:31.690 the more generally, 0:16:31.690,0:16:34.430 0:16:34.430,0:16:41.410 in 0:16:41.410,0:16:48.410 any dimensions the VC dimension of the class of linear classifiers in 0:16:51.460,0:16:53.500 any dimensions is equal to n plus one. 0:16:53.500,0:16:59.210 Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC 0:16:59.210,0:17:01.839 dimension in any dimensions; whereas, n d 0:17:01.839,0:17:04.350 is equal to n plus one. 0:17:04.350,0:17:11.350 0:17:12.059,0:17:14.459 So maybe you wanna write it down: 0:17:14.459,0:17:15.809 0:17:15.809,0:17:20.740 what is arguably the best-known result in all of learning theory, I guess; 0:17:20.740,0:17:25.930 0:17:25.930,0:17:28.770 which is that. 0:17:28.770,0:17:35.770 Let a hypothesis class be given, 0:17:37.600,0:17:41.170 and let the VC dimension of h be 0:17:41.170,0:17:45.830 equal to d. Then we're 0:17:45.830,0:17:47.530 in probability of 0:17:47.530,0:17:50.090 one minus delta. 0:17:50.090,0:17:57.090 We have that 0:18:31.830,0:18:34.770 the formula on the right looks a bit complicated, but don't worry about it. I'll point 0:18:34.770,0:18:36.910 out the essential aspects of it later. 0:18:36.910,0:18:40.600 But the key to this result is that if you have a hypothesis class with VC dimension d, and 0:18:40.600,0:18:45.320 now this can be an infinite hypothesis 0:18:45.320,0:18:46.660 class, 0:18:46.660,0:18:48.620 what 0:18:48.620,0:18:50.990 Vapnik and Chervonenkis show is that 0:18:50.990,0:18:54.030 we're probability of at least one minus delta. 0:18:54.030,0:18:55.300 You enjoy this 0:18:55.300,0:18:57.540 sort of uniform conversions results; 0:18:57.540,0:19:00.380 okay? We have 0:19:00.380,0:19:02.520 0:19:02.520,0:19:06.760 that for all hypotheses h that for all the hypotheses in your 0:19:06.760,0:19:08.490 hypothesis class, 0:19:08.490,0:19:13.080 you have that the generalization error of h 0:19:13.080,0:19:13.780 minus 0:19:13.780,0:19:19.210 the training error of h. So the difference between these two things is bounded above 0:19:19.210,0:19:24.250 by some complicated formula like this; okay? 0:19:24.250,0:19:29.520 And thus, 0:19:29.520,0:19:36.520 we're probably one minus delta. We also have that have the 0:19:45.140,0:19:52.140 same thing; okay? 0:19:56.810,0:20:00.730 And going from this step to this step; right? Going from this step 0:20:00.730,0:20:03.429 to this step is actually something that you saw yourself; 0:20:03.429,0:20:05.929 that we actually proved earlier. Because 0:20:05.929,0:20:09.200 you remember, in the previous lecture we proved that if you have uniform 0:20:09.200,0:20:10.720 conversions, 0:20:10.720,0:20:13.210 then that implies that 0:20:13.210,0:20:18.080 it appears actually that we showed that if generalization error and training error are close to 0:20:18.080,0:20:20.800 each other; within gamma of each other, 0:20:20.800,0:20:24.360 then the generalization error of the hypotheses you pick will be within 0:20:24.360,0:20:26.150 two gamma 0:20:26.150,0:20:28.080 times the best generalization error. 0:20:28.080,0:20:29.650 So this is really 0:20:29.650,0:20:30.900 generalization error of 0:20:30.900,0:20:35.130 h [inaudible] best possible generalization error plus two times gamma. And just the 0:20:35.130,0:20:37.390 two constants in front here 0:20:37.390,0:20:43.790 that I've absorbed into the big-O notation. 0:20:43.790,0:20:47.310 So that formula is slightly more complicated. Let 0:20:47.310,0:20:54.310 me just rewrite this as a corollary, which is that 0:20:55.370,0:20:58.850 in order to guarantee that 0:20:58.850,0:21:04.390 this holds, 0:21:04.390,0:21:08.580 we're probability of one minus delta. 0:21:08.580,0:21:11.270 We're probably at least one minus delta, I should say. It 0:21:11.270,0:21:16.790 suffices 0:21:16.790,0:21:20.400 that I'm gonna write this this way: I'm gonna write m equals 0:21:20.400,0:21:22.310 big-O of 0:21:22.310,0:21:22.950 d, 0:21:22.950,0:21:25.880 and I'm going to put 0:21:25.880,0:21:32.570 gamma and delta in as a subscript error to denote that. 0:21:32.570,0:21:37.340 Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that 0:21:37.340,0:21:40.390 depend on gamma and delta into the big-O notation, 0:21:40.390,0:21:41.600 then 0:21:41.600,0:21:44.870 in order to guarantee this holds, it suffices that m 0:21:44.870,0:21:46.430 is on the order of the 0:21:46.430,0:21:53.430 VC dimension and hypotheses class; okay? 0:21:53.470,0:21:55.790 So 0:21:55.790,0:21:58.860 let's see. 0:21:58.860,0:22:02.490 So what we conclude from this is that 0:22:02.490,0:22:06.100 if you have a learning algorithm that tries to for empirical risk minimization 0:22:06.100,0:22:08.770 algorithms in other words, less formally, 0:22:08.770,0:22:12.630 for learning algorithms, they try to minimize training error. The 0:22:12.630,0:22:14.850 intuition to take away from this is that 0:22:14.850,0:22:18.080 the number of training examples you need is therefore, 0:22:18.080,0:22:19.250 roughly, linear 0:22:19.250,0:22:24.350 in the VC dimension of the hypotheses class. 0:22:24.350,0:22:27.690 And more formally, this shows that sample complexity 0:22:27.690,0:22:29.570 is upper bounded 0:22:29.570,0:22:34.040 by the VC dimension; okay? 0:22:34.040,0:22:38.490 It turns out that for most reasonable hypothesis classes, it turns out that 0:22:38.490,0:22:41.460 the VC dimension is sort 0:22:41.460,0:22:41.869 0:22:41.869,0:22:44.930 of very similar, I guess, to 0:22:44.930,0:22:47.110 the number of parameters you model. 0:22:47.110,0:22:48.600 So for example, 0:22:48.600,0:22:52.080 you have model and logistic regression linear classification. 0:22:52.080,0:22:56.059 In any dimensions logistic regression in any dimensions is endless 0:22:56.059,0:22:57.430 one parameters. 0:22:57.430,0:23:02.430 And the VC dimension of which is the of class of linear classifiers is always the endless one. 0:23:02.430,0:23:04.289 So it turns out that 0:23:04.289,0:23:07.360 for most reasonable hypothesis classes, the 0:23:07.360,0:23:11.270 VC dimension is usually linear in the number of parameters of your model. 0:23:11.270,0:23:14.820 Wherein, is most sense of low other polynomial; 0:23:14.820,0:23:17.440 in the number of parameters of your model. 0:23:17.440,0:23:18.760 And so this 0:23:18.760,0:23:21.120 the takeaway intuition from this is that 0:23:21.120,0:23:25.059 the number of training examples you need to fit in those models is going to be let's say, 0:23:25.059,0:23:29.550 roughly, linear in the number of parameters in your model; okay? 0:23:29.550,0:23:33.120 There are some somewhat strange examples where what I just said is not true. There are some 0:23:33.120,0:23:36.559 strange examples where you have very few parameters, but the VC dimension is 0:23:36.559,0:23:37.450 enormous. 0:23:37.450,0:23:38.610 But I actually know of 0:23:38.610,0:23:42.070 all of the examples I know of that fall into that regime are somewhat 0:23:42.070,0:23:43.559 strange and 0:23:43.559,0:23:45.160 degenerate. So somewhat unusual, and 0:23:45.160,0:23:50.340 not the source of not learning algorithms you usually use. 0:23:50.340,0:23:52.270 Let's see, 0:23:52.270,0:23:55.030 just other things. It turns out that 0:23:55.030,0:23:58.490 so this result shows the sample complexity is upper bounded by VC 0:23:58.490,0:24:01.320 dimension. But if you have a number of training examples that 0:24:01.320,0:24:04.510 are on the order of the VC dimension, then you find 0:24:04.510,0:24:07.510 it turns out that in the worse case 0:24:07.510,0:24:10.320 0:24:10.320,0:24:12.200 0:24:12.200,0:24:15.840 some complexity is also lower bounded by VC dimension. 0:24:15.840,0:24:20.220 And what that means is that if you have a perfectly nasty learning 0:24:20.220,0:24:23.120 problem, say, then if 0:24:23.120,0:24:26.630 the number of training examples you have is less than on the order of the VC 0:24:26.630,0:24:27.450 dimension; 0:24:27.450,0:24:29.690 then 0:24:29.690,0:24:33.910 it is not possible to prove this bound. So I guess in the worse case, 0:24:33.910,0:24:37.490 sample complexity in the number of training examples you need is upper bounded and lower bounded 0:24:37.490,0:24:41.820 by the VC dimension. Let's 0:24:41.820,0:24:48.110 see, 0:24:48.110,0:24:55.110 questions about this? Does 0:24:55.540,0:24:59.700 0:24:59.700,0:25:00.490 the 0:25:00.490,0:25:04.560 proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the 0:25:04.560,0:25:07.280 entirety of the theorem. This is true. It 0:25:07.280,0:25:09.920 turns out in the proof well, 0:25:09.920,0:25:12.730 somewhere, regardless of the proof there's a step reconstruction called an 0:25:12.730,0:25:15.750 epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the 0:25:15.750,0:25:19.880 proof, it is not an assumption that you need. 0:25:19.880,0:25:23.610 In someway that sort of proof that's one-step that uses a very 0:25:23.610,0:25:27.050 clever [inaudible] to prove 0:25:27.050,0:25:32.130 this. But that's not needed; it's an assumption. I'd like to say, back when 0:25:32.130,0:25:36.370 I was a Ph.D. student, when I was working through this proof, 0:25:36.370,0:25:38.220 there was sort of a solid week where I 0:25:38.220,0:25:42.100 would wake up, and go to the office at 9:00 a.m. Then I'd start reading 0:25:42.100,0:25:44.840 the book that led up to this proof. And then 0:25:44.840,0:25:49.330 I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left 0:25:49.330,0:25:51.609 off. And it sort of took me a whole week that way, 0:25:51.609,0:25:53.180 to understand this proof, so 0:25:53.180,0:26:00.180 I thought I would inflict that on you. 0:26:03.090,0:26:07.470 Just to tie a couple of loose ends: 0:26:07.470,0:26:13.110 what I'm about to do is, I'm about to just mention a few things that will 0:26:13.110,0:26:18.380 maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends. 0:26:18.380,0:26:21.380 And so 0:26:21.380,0:26:24.150 let's see, 0:26:24.150,0:26:26.860 it turns out that 0:26:26.860,0:26:30.170 just so it will be more strong with you so this bound was 0:26:30.170,0:26:30.670 proved 0:26:30.670,0:26:34.840 for an algorithm that uses empirical risk minimization, for an algorithm that 0:26:34.840,0:26:35.920 minimizes 0:26:35.920,0:26:39.580 0-1 training error. So 0:26:39.580,0:26:46.330 one 0:26:46.330,0:26:49.350 question that some of 0:26:49.350,0:26:51.720 you ask is 0:26:51.720,0:26:55.460 how about support vector machines; right? How come SVM's don't over 0:26:55.460,0:26:57.250 fit? 0:26:57.250,0:27:01.280 And in the sequel of remember our discussion on support vector machines said that 0:27:01.280,0:27:06.450 you use kernels, and map the features in infinite dimensional feature space. 0:27:06.450,0:27:10.390 And so it seems like the VC dimension should be infinite; n plus 0:27:10.390,0:27:12.340 one and n is infinite. 0:27:12.340,0:27:14.840 So it turns out that 0:27:14.840,0:27:19.630 the class of linear separators with large margin actually has low VC dimension. 0:27:19.630,0:27:23.300 I wanna say this very quickly, and informally. 0:27:23.300,0:27:28.010 It's actually, not very important for you to understand the details, but I'm going 0:27:28.010,0:27:31.760 to say it very informally. It turns out that I will give you a set of points. 0:27:31.760,0:27:32.660 And 0:27:32.660,0:27:37.500 if I ask you to consider only the course of lines that separate these points of a 0:27:37.500,0:27:38.780 large margin [inaudible], 0:27:38.780,0:27:40.030 so 0:27:40.030,0:27:43.840 my hypothesis class will comprise only 0:27:43.840,0:27:45.950 the linear position boundaries 0:27:45.950,0:27:49.289 that separate the points of a large margin. Say with a 0:27:49.289,0:27:50.049 margin, 0:27:50.049,0:27:52.260 at least gamma; okay. 0:27:52.260,0:27:53.000 And so 0:27:53.000,0:27:55.760 I won't allow a point that comes closer. Like, 0:27:55.760,0:28:00.210 I won't allow that line because it comes too close to one of my points. 0:28:00.210,0:28:04.330 It turns out that 0:28:04.330,0:28:10.890 if I consider my 0:28:10.890,0:28:15.410 data points all 0:28:15.410,0:28:18.690 lie within some sphere of radius r, 0:28:18.690,0:28:23.190 and if I consider only the course of linear separators is separate to data 0:28:23.190,0:28:26.799 with a margin of at least gamma, 0:28:26.799,0:28:30.420 then the VC dimension of this 0:28:30.420,0:28:34.260 course is less than or equal to r squared over four 0:28:34.260,0:28:36.150 gamma squared 0:28:36.150,0:28:38.170 plus one; okay? 0:28:38.170,0:28:39.970 So 0:28:39.970,0:28:43.340 this funny symbol here, that just means rounding up. 0:28:43.340,0:28:47.190 This is a ceiling symbol; it means rounding up x. 0:28:47.190,0:28:48.360 And it 0:28:48.360,0:28:51.280 turns out you prove and there are some strange things 0:28:51.280,0:28:52.940 about this result that I'm deliberately not 0:28:52.940,0:28:55.090 gonna to talk about but turns they 0:28:55.090,0:28:59.100 can prove that the VC dimension of the class of linear classifiers with large 0:28:59.100,0:29:00.890 margins is actually bounded. 0:29:00.890,0:29:02.820 The surprising thing about this is that 0:29:02.820,0:29:05.909 this is the bound on VC dimension that has no dependents 0:29:05.909,0:29:07.860 on the dimension of the points x. So 0:29:07.860,0:29:09.240 in other words, 0:29:09.240,0:29:12.799 your data points x combine an infinite dimensional space, 0:29:12.799,0:29:15.680 but so long as you restrict attention to the class of 0:29:15.680,0:29:17.990 your separators with large margin, the 0:29:17.990,0:29:20.250 VC dimension is bounded. 0:29:20.250,0:29:21.529 And so 0:29:21.529,0:29:25.650 in trying to find a large margin separator 0:29:25.650,0:29:28.990 in trying to find the line that separates your positive and 0:29:28.990,0:29:29.950 your negative examples 0:29:29.950,0:29:32.490 with large margin, 0:29:32.490,0:29:35.100 it turns out therefore, that 0:29:35.100,0:29:38.060 the support vector machine is automatically trying to find 0:29:38.060,0:29:39.809 a hypothesis class 0:29:39.809,0:29:42.450 with small VC dimension. And therefore, it does not over fit. Alex? Student:What is 0:29:42.450,0:29:49.320 the [inaudible]? Instructor (Andrew 0:29:49.320,0:29:50.010 Ng):It 0:29:50.010,0:29:51.879 is actually defined the 0:29:51.879,0:29:55.640 same way as finite dimensional spaces. So you 0:29:55.640,0:29:57.210 know, suppose you have infinite actually, 0:29:57.210,0:30:00.590 these are constantly infinite dimensional vectors; not [inaudible] to the infinite 0:30:00.590,0:30:03.590 dimensional 0:30:03.590,0:30:06.080 vectors. Normally, the 2 to 1 0:30:06.080,0:30:07.460 squared 0:30:07.460,0:30:10.010 is equal to some [inaudible] equals 110 0:30:10.010,0:30:12.210 xi squared, so if x 0:30:12.210,0:30:17.470 is infinite dimensional, you just appoint it like 0:30:17.470,0:30:20.220 that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again. 0:30:20.220,0:30:22.679 [Inaudible]. 0:30:22.679,0:30:27.230 Yes. Although, I assume that this is bounded by r. Oh. It's 0:30:27.230,0:30:30.750 a yeah so this insures that conversions. 0:30:30.750,0:30:31.670 0:30:31.670,0:30:35.040 So just 0:30:35.040,0:30:37.650 something people sometimes wonder about. And 0:30:37.650,0:30:39.640 last, the actually 0:30:39.640,0:30:44.470 tie empirical risk minimization back a little more strongly to the source of algorithms 0:30:44.470,0:30:46.650 we've talked about. 0:30:46.650,0:30:48.050 It turns out 0:30:48.050,0:30:55.050 that 0:31:10.179,0:31:15.100 so the theory was about, and so far, was really for empirical risk minimization. So that view's 0:31:15.100,0:31:17.770 0:31:17.770,0:31:21.010 so we focus on just one training example. 0:31:21.010,0:31:22.860 Let me draw a 0:31:22.860,0:31:24.049 function, you know, a 0:31:24.049,0:31:25.080 zero here jumps 0:31:25.080,0:31:26.210 to one, 0:31:26.210,0:31:27.670 and it looks like that. 0:31:27.670,0:31:29.570 And so this for 0:31:29.570,0:31:31.720 once, this training example, 0:31:31.720,0:31:36.230 this may be 0:31:36.230,0:31:42.830 indicator 0:31:42.830,0:31:47.110 h where [inaudible] is d 0:31:47.110,0:31:49.340 equals data transpose x; 0:31:49.340,0:31:52.600 okay? But one 0:31:52.600,0:31:56.250 training example your training example will be positive or negative. And 0:31:56.250,0:31:59.360 depending on what the value of this data transpose x is, you either get it right 0:31:59.360,0:32:01.800 or wrong. And so 0:32:01.800,0:32:05.720 you know, I guess if your training example if you have a positive example, 0:32:05.720,0:32:08.370 then when z is positive, 0:32:08.370,0:32:10.380 you get it 0:32:10.380,0:32:11.240 right. 0:32:11.240,0:32:18.240 0:32:18.610,0:32:22.630 Suppose you have a negative example, so y equals 0; right? 0:32:22.630,0:32:26.780 Then if z, which is data transpose x if this is positive, 0:32:26.780,0:32:29.650 then you will get this example wrong; 0:32:29.650,0:32:32.860 whereas, if z is negative then you'd get this example right. 0:32:32.860,0:32:36.630 And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You 0:32:36.630,0:32:42.639 know, it's equal 0:32:42.639,0:32:49.539 to g of data transpose 0:32:49.539,0:32:50.730 0:32:50.730,0:32:53.940 x; okay? And so it turns out that 0:32:53.940,0:32:57.260 so what you really like to do is choose parameters data so as to 0:32:57.260,0:32:59.550 minimize this step function; right? 0:32:59.550,0:33:02.400 You'd like to choose parameters data, so that 0:33:02.400,0:33:04.710 you end up with a 0:33:04.710,0:33:08.500 correct classification on setting your training example, and so you'd like 0:33:08.500,0:33:09.440 indicator h 0:33:09.440,0:33:10.890 of x not equal y. 0:33:10.890,0:33:14.960 You'd like this indicator function to be 0. It 0:33:14.960,0:33:18.560 turns out this step function is clearly a non-convex function. And so 0:33:18.560,0:33:21.799 it turns out that just the linear classifiers 0:33:21.799,0:33:25.130 minimizing the training error is an empty heart problem. It 0:33:25.130,0:33:29.560 turns out that both logistic regression, and support vector machines can be viewed as 0:33:29.560,0:33:31.210 using a convex 0:33:31.210,0:33:32.600 approximation for this problem. 0:33:32.600,0:33:36.440 And in particular 0:33:36.440,0:33:40.130 and draw a function like that 0:33:41.460,0:33:48.460 it turns out that 0:33:50.809,0:33:54.600 logistic regression is trying to maximize likelihood. And so it's tying to minimize 0:33:54.600,0:33:56.900 the minus of the logged likelihood. 0:33:56.900,0:34:00.350 And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that 0:34:00.350,0:34:03.400 looks like this. And 0:34:03.400,0:34:07.250 this line that I just drew, you can think of it as a rough approximation to this step function; which is 0:34:07.250,0:34:10.709 maybe what you're really trying to minimize, 0:34:10.709,0:34:14.539 so you want to minimize training error. So you can actually think of logistic regression as trying to approximate 0:34:14.539,0:34:16.069 empirical risk minimization. 0:34:16.069,0:34:19.859 Where instead of using this step function, which is non-convex, and gives you a hard 0:34:19.859,0:34:21.059 optimization problem, it 0:34:21.059,0:34:25.189 uses this line above this curve above. So approximate it, 0:34:25.189,0:34:28.229 so you have a convex optimization problem you can 0:34:28.229,0:34:29.929 find the 0:34:29.929,0:34:34.409 maximum likelihood it's in the parameters for logistic regression. 0:34:34.409,0:34:38.690 And it turns out, support vector machine also can be viewed as approximated dysfunction 0:34:38.690,0:34:41.139 to only a little bit different let's 0:34:41.139,0:34:44.719 see, support vector machine turns out, can be viewed as trying to approximate this 0:34:44.719,0:34:46.209 step function two 0:34:46.209,0:34:48.289 over different 0:34:48.289,0:34:50.929 approximation that's linear, and then 0:34:50.929,0:34:53.589 that sort of [inaudible] linear that 0:34:53.589,0:34:55.710 our results goes this [inaudible] there, and then it goes up as a 0:34:55.710,0:34:58.150 linear function there. And that's that 0:34:58.150,0:35:00.170 is called the hinge class. And so you 0:35:00.170,0:35:03.799 can think of logistic regression and the support vector machine as 0:35:03.799,0:35:07.259 different approximations to try to minimize this 0:35:07.259,0:35:10.859 step function; 0:35:10.859,0:35:11.940 okay? 0:35:11.940,0:35:13.380 And that's why I guess, 0:35:13.380,0:35:16.109 all the theory we developed 0:35:16.109,0:35:19.239 even though SVM's and logistic regression aren't exactly due to 0:35:19.239,0:35:21.739 empirical risk minimization, 0:35:21.739,0:35:24.790 the theory we develop often gives the 0:35:24.790,0:35:31.790 completely appropriate intuitions for SVM's, and logistic regression; okay. 0:35:32.879,0:35:35.779 So that was the last of the loose ends. 0:35:35.779,0:35:39.319 And if you didn't get this, don't worry too much about it. It's a high-level message. It's 0:35:39.319,0:35:40.309 just that 0:35:40.309,0:35:44.479 SVM's and logistic regression are reasonable to think of as approximations 0:35:44.479,0:35:47.769 empirical risk minimization algorithms. 0:35:47.769,0:35:51.609 What I want to do next is move on to talk about model selection. Before I do that, let me just 0:35:51.609,0:35:58.609 check for questions about 0:36:49.190,0:36:52.869 this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that 0:36:52.869,0:36:53.329 we 0:36:53.329,0:36:55.049 sort of wrapped up with a 0:36:55.049,0:36:56.799 discussion on VC dimension, 0:36:56.799,0:37:01.019 we saw that there's often a trade-off between bias and variance. And in 0:37:01.019,0:37:02.489 particular, so 0:37:02.489,0:37:06.759 it is important not to choose a hypothesis that's either too simple or too 0:37:06.759,0:37:07.849 complex. So 0:37:07.849,0:37:11.249 if your data has sort of a quadratic structure to it, 0:37:11.249,0:37:13.710 then if you choose a linear 0:37:13.710,0:37:17.889 function to try to approximate it, then you would under fit. So you have a 0:37:17.889,0:37:19.939 hypothesis with high bias. 0:37:19.939,0:37:24.179 And conversely, we choose a hypothesis that's too complex, and you have high variance. 0:37:24.179,0:37:26.130 And you'll also 0:37:26.130,0:37:31.899 fail to fit. Then you would over fit the data, and you'd also fail to generalize well. 0:37:31.899,0:37:35.649 So model selection algorithms 0:37:35.649,0:37:39.629 provide a class of methods to automatically trade make these tradeoffs 0:37:39.629,0:37:41.619 between bias 0:37:41.619,0:37:43.069 and variance; right? So remember the 0:37:43.069,0:37:44.989 cartoon I drew last time 0:37:44.989,0:37:51.989 of generalization error? 0:37:54.309,0:37:57.409 I drew this last time. Where on the x-axis 0:37:57.409,0:38:03.319 was model complexity, meaning the number of 0:38:03.319,0:38:07.599 the degree of the polynomial; the [inaudible] regression function 0:38:07.599,0:38:08.719 or whatever. 0:38:08.719,0:38:12.469 And if you have too simple a model, you have high generalization error, those under 0:38:12.469,0:38:13.529 fitting. 0:38:13.529,0:38:15.630 And you if have too complex a model, 0:38:15.630,0:38:19.729 like 15 or 14-degree polynomial to five data points, then you also have high 0:38:19.729,0:38:24.069 generalization error, and you're over fitting. 0:38:24.069,0:38:28.559 So what I wanna do now is actually just talk about model selection in the abstract; 0:38:28.559,0:38:30.019 all right? 0:38:30.019,0:38:32.779 Some examples of model selection problems will include 0:38:32.779,0:38:34.369 well, I'll run the example 0:38:34.369,0:38:36.139 of 0:38:36.139,0:38:36.790 let's say you're 0:38:36.790,0:38:43.790 trying to choose the degree of a polynomial; 0:38:45.159,0:38:48.909 right? What degree polynomial do you want to choose? 0:38:48.909,0:38:51.899 Another example of a model selection problem would be if you're trying to choose 0:38:51.899,0:38:53.869 the parameter [inaudible], 0:38:53.869,0:38:59.839 which was the bandwidth parameter 0:38:59.839,0:39:04.140 in locally awaited linear regression 0:39:04.140,0:39:07.709 or in some sort of local way to 0:39:07.709,0:39:08.440 regression. 0:39:08.440,0:39:10.480 Yet, another model selection problem 0:39:10.480,0:39:13.940 is if you're trying to choose the parameter c [inaudible] and as the 0:39:13.940,0:39:16.049 [inaudible]; 0:39:16.049,0:39:21.680 right? And so one known soft margin is the 0:39:21.680,0:39:28.539 we had this optimization objective; right? 0:39:28.539,0:39:31.039 And the parameter c 0:39:31.039,0:39:32.969 controls the tradeoff between 0:39:32.969,0:39:34.019 how much you want 0:39:34.019,0:39:38.239 to set for your example. So a large margin versus how much you want to penalize 0:39:38.239,0:39:40.499 0:39:40.499,0:39:44.279 in this class [inaudible] example. So these are three specific examples of model selection 0:39:44.279,0:39:45.339 problems. 0:39:45.339,0:39:47.569 And 0:39:47.569,0:39:54.569 let's come up with a method for semantically choosing them. 0:40:13.679,0:40:17.849 Let's say you have some finite set of models, and let's write these as m1, m2, m3, 0:40:17.849,0:40:20.129 and 0:40:20.129,0:40:25.219 so on. For example, this may be the linear 0:40:25.219,0:40:29.660 classifier or this may be the quadratic classifier, 0:40:29.660,0:40:30.410 and so on; 0:40:30.410,0:40:32.170 okay? Or this 0:40:32.170,0:40:33.700 may also be 0:40:33.700,0:40:37.240 you may also take the bandwidth parameter [inaudible] 0:40:37.240,0:40:41.130 and discretize it into a range of values, and you're trying to choose from the most 0:40:41.130,0:40:43.469 discrete of the values. 0:40:43.469,0:40:48.239 So let's talk about how you would select an appropriate model; all right? Well, 0:40:48.239,0:40:52.029 one thing you could do is you can pick all of these models, and train them on 0:40:52.029,0:40:53.869 you're training set. 0:40:53.869,0:40:56.989 And then see which model has the lowest training 0:40:56.989,0:41:03.989 error. So that's a terrible idea, and why's that? 0:41:05.450,0:41:11.099 Right. Cool. Because of the over fit; right. And those some of you are laughing that 0:41:11.099,0:41:13.939 I asked that. So that'd be a terrible idea to choose a model 0:41:13.939,0:41:17.019 by looking at your training set because well, obviously, you end up choosing the most 0:41:17.019,0:41:18.839 complex model; right? And you 0:41:18.839,0:41:25.549 choose a 10th degree polynomial because that's what fits the training set. So we 0:41:25.549,0:41:28.569 come to model selection in a training set 0:41:28.569,0:41:31.939 several standard procedures to do this. One is hold out cross 0:41:31.939,0:41:38.869 validation, 0:41:38.869,0:41:41.410 and in hold out cross validation, 0:41:41.410,0:41:45.509 we teach a training set. And we randomly split 0:41:45.509,0:41:50.529 the training set into two subsets. We 0:41:50.529,0:41:51.490 call it 0:41:51.490,0:41:53.589 subset take all the data you have 0:41:53.589,0:42:00.589 and randomly split it into two subsets. And we'll call it the training set, and the 0:42:00.799,0:42:05.159 hold out cross validation subset. 0:42:05.159,0:42:07.119 And then, 0:42:07.119,0:42:11.989 you know, you train each model 0:42:11.989,0:42:18.739 on just trading subset of it, and test it 0:42:18.739,0:42:22.069 on your hold out cross validation set. 0:42:22.069,0:42:25.289 And you pick the model 0:42:25.289,0:42:32.289 with the lowest error 0:42:32.949,0:42:34.889 on the hold out cross validation 0:42:34.889,0:42:36.959 subset; okay? So this is sort of a 0:42:36.959,0:42:39.799 relatively straightforward procedure, and it's commonly used 0:42:39.799,0:42:42.039 where you train on 70 percent of the data. 0:42:42.039,0:42:44.509 Then test all of your models. And 30 percent, you can take 0:42:44.509,0:42:45.900 whatever has the smallest 0:42:45.900,0:42:51.239 hold out cross validation error. 0:42:51.239,0:42:54.829 And after this you actually have a chose. You can actually 0:42:54.829,0:42:58.589 having taken all of these hypothesis trained on 70 percent of the data, 0:42:58.589,0:43:02.650 you can actually just output the hypothesis that has the lowest error on your hold out 0:43:02.650,0:43:04.469 cross validation set. 0:43:04.469,0:43:08.619 And optionally, you can actually take the model that you selected 0:43:08.619,0:43:13.079 and go back, and retrain it on all 100 percent of the data; 0:43:13.079,0:43:15.529 okay? So both versions are actually done and used really often. You can 0:43:15.529,0:43:16.409 either, 0:43:16.409,0:43:20.039 you know, just take the best hypothesis that was trained on 70 percent of the data, 0:43:20.039,0:43:22.419 and just output that as you find the hypothesis 0:43:22.419,0:43:23.959 or you can use this to 0:43:23.959,0:43:27.289 say, having chosen the degree of the polynomial you want to fit, you can then go 0:43:27.289,0:43:31.329 back and retrain the model on the entire 100 percent of your data. 0:43:31.329,0:43:38.329 And both of these are commonly done. 0:43:39.749,0:43:46.749 0:43:50.489,0:43:54.459 How about a cross validation does sort of work straight? 0:43:54.459,0:43:55.759 And 0:43:55.759,0:43:58.249 sometimes we're working 0:43:58.249,0:43:58.950 0:43:58.950,0:44:03.670 with a company or application or something. The 0:44:03.670,0:44:05.929 many machine-learning applications we have 0:44:05.929,0:44:07.580 very little data or where, you 0:44:07.580,0:44:10.100 know, every training example you have was 0:44:10.100,0:44:14.940 painfully acquired at great cost; right? Sometimes your data is 0:44:14.940,0:44:16.039 acquired by 0:44:16.039,0:44:18.780 medical experiments, and each of these each 0:44:18.780,0:44:23.829 training example represents a sick man in amounts of physical human pain or something. So 0:44:23.829,0:44:27.279 we talk and say, well, I'm going to hold out 30 percent of your data set, just to select 0:44:27.279,0:44:28.109 my model. 0:44:28.109,0:44:32.809 If people were who sometimes that causes unhappiness, and so maybe you wanna 0:44:32.809,0:44:37.099 use not have to leave out 30 percent of your data just to do model 0:44:37.099,0:44:38.719 selection. 0:44:38.719,0:44:43.400 So there are a couple of other variations on hold out cross validation that makes 0:44:43.400,0:44:47.809 sometimes, slightly more efficient use of the data. 0:44:47.809,0:44:51.909 And one is called k-fold cross validation. 0:44:51.909,0:44:54.190 And here's the idea: I'm gonna 0:44:54.190,0:44:55.950 take all of 0:44:55.950,0:44:56.819 my data s; 0:44:56.819,0:45:00.619 so imagine, I'm gonna draw this 0:45:00.619,0:45:04.489 box s, as to note the entirety of all the data I have. 0:45:04.489,0:45:07.489 And I'll then divide it 0:45:07.489,0:45:12.999 into k pieces, and this is five pieces in what I've drawn. 0:45:12.999,0:45:16.809 Then what'll I'll do is I will repeatedly 0:45:16.809,0:45:23.410 train on k minus one pieces. 0:45:23.410,0:45:27.669 Test on the remaining one 0:45:27.669,0:45:30.649 test on the remaining piece, I guess; 0:45:30.649,0:45:35.699 right? And then you average 0:45:35.699,0:45:41.369 over the k result. 0:45:41.369,0:45:43.809 So another way, we'll just hold out I will 0:45:43.809,0:45:48.329 hold 0:45:48.329,0:45:50.619 out say, just 1/5 of my data 0:45:50.619,0:45:54.649 and I'll train on the remaining 4/5, and I'll test on the first one. 0:45:54.649,0:45:58.399 And then I'll then go and hold out the second 1/5 from my [inaudible] for the 0:45:58.399,0:46:03.099 remaining pieces test on this, you remove the third piece, 0:46:03.099,0:46:05.909 train on the 4/5; I'm gonna do this five times. 0:46:05.909,0:46:09.089 And then I'll take the five error measures I 0:46:09.089,0:46:10.569 have and I'll 0:46:10.569,0:46:16.369 average them. And this then gives me an estimate of the generalization error of my model; okay? 0:46:16.369,0:46:18.089 And then, again, when you do 0:46:18.089,0:46:19.929 k-fold cross validation, 0:46:19.929,0:46:21.829 usually you then go back and 0:46:21.829,0:46:24.030 retrain the model you selected 0:46:24.030,0:46:27.670 on the entirety of your training set. 0:46:27.670,0:46:33.930 So I drew five pieces here because that was easier for me to draw, but k equals 10 is 0:46:33.930,0:46:39.339 very 0:46:39.339,0:46:41.710 common; okay? I should say 0:46:41.710,0:46:47.509 k equals 10 is the fairly common choice to do 10 fold cross validation. 0:46:47.509,0:46:52.299 And the advantage of the over hold out cross option is that you switch the data into ten pieces. 0:46:52.299,0:46:54.400 Then each time you're only holding out 0:46:54.400,0:46:56.380 1/10 of your data, rather than, 0:46:56.380,0:46:59.119 you know, say, 30 percent of your data. I 0:46:59.119,0:47:03.309 must say, in standard hold out in simple hold out cross validation, 0:47:03.309,0:47:05.620 a 30 70 split is fairly common. 0:47:05.620,0:47:07.849 Sometimes like 2/3 1/3 or 0:47:07.849,0:47:09.919 a 70 30 split is fairly common. 0:47:09.919,0:47:14.299 And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most 0:47:14.299,0:47:16.400 common choice. 0:47:16.400,0:47:19.849 The disadvantage of k-fold cross validation is that it can be much more 0:47:19.849,0:47:21.430 computationally expensive. 0:47:21.430,0:47:22.999 In particular, 0:47:22.999,0:47:26.160 to validate your model, you now need to train your model ten times, 0:47:26.160,0:47:29.089 instead of just once. And so you need 0:47:29.089,0:47:33.140 to: from logistic regression, ten times per model, rather than just once. And so this is 0:47:33.140,0:47:34.569 computationally more expensive. 0:47:34.569,0:47:38.329 But k equals ten works great. 0:47:38.329,0:47:40.249 And then, 0:47:40.249,0:47:44.269 finally, in 0:47:44.269,0:47:47.769 there's actually a version of this that you can take even further, 0:47:47.769,0:47:50.059 which is when your set k 0:47:50.059,0:47:51.660 equals m. 0:47:51.660,0:47:56.690 And so that's when you take your training set, and you split it into as many pieces as you have training 0:47:56.690,0:47:59.190 examples. 0:47:59.190,0:48:06.190 And this procedure is called leave one out cross validation. 0:48:06.429,0:48:07.479 And 0:48:07.479,0:48:09.069 what you do is you then 0:48:09.069,0:48:12.479 take out the first training example, train on the rest, and test on the first 0:48:12.479,0:48:13.359 example. 0:48:13.359,0:48:16.479 Then you take out the second training example, 0:48:16.479,0:48:18.560 train on the rest, and test on the second example. Then 0:48:18.560,0:48:22.180 you take out the third example, train on everything, but your third example. Test on 0:48:22.180,0:48:24.079 the third example, and so on. 0:48:24.079,0:48:25.410 And so 0:48:25.410,0:48:27.900 with this many pieces you are now making, 0:48:27.900,0:48:31.170 maybe even more effective use of your data than k-fold cross 0:48:31.170,0:48:32.459 validation. But 0:48:32.459,0:48:36.269 you could leave leave one out cross validation is computationally very expensive 0:48:36.269,0:48:37.459 because now 0:48:37.459,0:48:41.970 you need to repeatedly leave one example out, and then run your learning 0:48:41.970,0:48:46.459 algorithm on m minus one training examples. You need to do this a lot of times, and so 0:48:46.459,0:48:49.119 this is computationally very expensive. 0:48:49.119,0:48:54.829 And typically, this is done only when you're extremely data scarce. So if you 0:48:54.829,0:48:58.779 have a learning problem where you have, say, 15 training examples or something, 0:48:58.779,0:49:01.109 then if you have very few training examples, leave one 0:49:01.109,0:49:03.629 out cross validation 0:49:03.629,0:49:04.460 is 0:49:04.460,0:49:11.460 maybe preferred. 0:49:15.709,0:49:18.369 Yeah? 0:49:18.369,0:49:22.890 Student:You know, that time you proved that the difference between the generalized [inaudible] by 0:49:22.890,0:49:26.989 number of examples in your training 0:49:26.989,0:49:28.089 set and VC 0:49:28.089,0:49:30.609 dimension. So 0:49:30.609,0:49:32.679 maybe [inaudible] examples into 0:49:32.679,0:49:34.599 different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I 0:49:34.599,0:49:36.829 mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but 0:49:36.829,0:49:40.100 yeah, in practice, I personally tend not to do that. It 0:49:40.100,0:49:42.729 tends not to be the VC 0:49:42.729,0:49:46.529 dimension bounds are somewhat loose bounds. And so 0:49:46.529,0:49:47.960 there are people 0:49:47.960,0:49:51.060 in structure risk minimization that propose what you do, but I personally tend not to do 0:49:51.060,0:49:54.539 that, 0:50:09.459,0:50:16.459 though. Questions for cross validation? 0:50:33.519,0:50:35.669 Yeah. This 0:50:35.669,0:50:38.899 is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]? 0:50:38.899,0:50:39.939 Right. 0:50:39.939,0:50:42.239 So it seems like when I'd be 0:50:42.239,0:50:44.129 able to use that 0:50:44.129,0:50:46.629 0:50:46.629,0:50:47.769 instead 0:50:47.769,0:50:49.980 of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor 0:50:49.980,0:50:53.349 No okay. So it turns out that when you're proving learning theory bounds, very often 0:50:53.349,0:50:58.109 the bounds will be extremely loose because you're sort of proving the worse case upper bound that 0:50:58.109,0:50:59.160 holds true 0:50:59.160,0:51:03.299 even for very bad what is 0:51:03.299,0:51:06.429 it 0:51:06.429,0:51:10.479 so the bounds that I proved just now; right? That holds true for absolutely any 0:51:10.479,0:51:12.209 probability distribution over training 0:51:12.209,0:51:13.480 examples; right? 0:51:13.480,0:51:17.069 So just assume the training examples we've drawn, iid 0:51:17.069,0:51:19.029 from some distribution script d, 0:51:19.029,0:51:23.209 and the bounds I proved hold true for absolutely any probability 0:51:23.209,0:51:27.129 distribution over script 0:51:27.129,0:51:28.939 d. And chances are 0:51:28.939,0:51:33.049 whatever real life distribution you get over, you know, houses and their 0:51:33.049,0:51:36.139 prices or whatever, is probably not as bad as 0:51:36.139,0:51:39.119 the very worse one you 0:51:39.119,0:51:41.249 could've gotten; okay? 0:51:41.249,0:51:42.189 And so it turns out that if you 0:51:42.189,0:51:46.299 actually plug in the constants of learning theory bounds, you often get 0:51:46.299,0:51:50.839 extremely large numbers. Take logistic regression logistic 0:51:50.839,0:51:53.229 regression you have ten parameters 0:51:53.229,0:51:56.969 and 0.01 error, and with 95 percent probability. How many training 0:51:56.969,0:51:58.239 examples do I 0:51:58.239,0:52:02.109 need? If you actually plug in actual constants into the text for learning theory bounds, 0:52:02.109,0:52:06.329 you often get extremely pessimistic estimates with the number of examples you need. You end 0:52:06.329,0:52:07.529 up with 0:52:07.529,0:52:11.749 some ridiculously large numbers. You would need 10,000 training examples to fit 0:52:11.749,0:52:14.349 ten parameters. 0:52:14.349,0:52:17.489 So 0:52:17.489,0:52:20.319 a good way to think of these learning theory bounds is 0:52:20.319,0:52:24.599 and this is why, also, when I write papers on learning theory bounds, I 0:52:24.599,0:52:26.390 quite often 0:52:26.390,0:52:30.159 use big-O notation to just absolutely just ignore the constant factors because 0:52:30.159,0:52:33.339 the bounds seem to be very loose. 0:52:33.339,0:52:38.669 There are some attempts to use these bounds to give guidelines as to 0:52:38.669,0:52:40.019 what model 0:52:40.019,0:52:41.559 to choose, and so on. 0:52:41.559,0:52:46.589 But I personally tend to use the bounds again, 0:52:46.589,0:52:47.489 intuition 0:52:47.489,0:52:49.239 about 0:52:49.239,0:52:53.089 for example, what are the number of training examples you need 0:52:53.089,0:52:56.089 gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes 0:52:56.089,0:52:59.669 quadratic parameters? 0:52:59.669,0:53:02.400 So it's quite often the shape of the bounds. The fact that 0:53:02.400,0:53:06.749 the number of training examples the fact that some complexity is linear in the VC dimension, 0:53:06.749,0:53:09.449 that's sort of a useful intuition you can get from 0:53:09.449,0:53:11.329 these theories. But the 0:53:11.329,0:53:14.439 actual magnitude of the bound will tend to be much looser than 0:53:14.439,0:53:18.009 will hold true for a particular problem you are working 0:53:18.009,0:53:25.009 on. So did that 0:53:25.839,0:53:28.409 answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the 0:53:28.409,0:53:31.509 way, for myself, a rule of thumb that I often use is if 0:53:31.509,0:53:34.969 you're trying to fit a logistic regression model, 0:53:34.969,0:53:36.179 if you have 0:53:36.179,0:53:38.749 n parameters or n plus one parameters; 0:53:38.749,0:53:42.429 if the number of training examples is ten times your number of 0:53:42.429,0:53:44.709 parameters, then you're probably in good shape. 0:53:44.709,0:53:49.289 And if your number of training examples is like tiny times the number of parameters, then 0:53:49.289,0:53:52.299 you're probably perfectly fine fitting that model. 0:53:52.299,0:53:54.880 So those are the sorts of intuitions 0:53:54.880,0:54:01.880 that you can get from these bounds. Student:In cross validation do 0:54:02.939,0:54:06.739 we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train 0:54:06.739,0:54:13.739 testers randomly. 0:54:22.989,0:54:26.839 One more thing I want to talk about for model selection is there's actually a special case of model selections, 0:54:26.839,0:54:33.839 called the feature selection problem. 0:54:37.809,0:54:41.339 And so here's the intuition: 0:54:41.339,0:54:45.479 for many machine-learning problems you may have a very high dimensional 0:54:45.479,0:54:48.119 feature space with very high dimensional 0:54:48.119,0:54:51.640 you have x's [inaudible] feature x's. 0:54:51.640,0:54:56.289 So for example, for text classification and I wanna talk about this text classification example that 0:54:56.289,0:54:58.299 spam versus 0:54:58.299,0:55:00.430 non-spam. You may easily have on 0:55:00.430,0:55:04.130 the order of 30,000 or 50,000 features. I think I used 50,000 in 0:55:04.130,0:55:07.150 my early examples. So if you have 0:55:07.150,0:55:10.659 so many features you have 50,000 features, 0:55:10.659,0:55:13.759 depending on what learning algorithm you use, there may be a real 0:55:13.759,0:55:15.529 risk of over fitting. 0:55:15.529,0:55:17.219 And so 0:55:17.219,0:55:20.079 if you can reduce the number of features, maybe 0:55:20.079,0:55:23.179 you can reduce the variance of your learning algorithm, and reduce the risk of 0:55:23.179,0:55:25.309 over fitting. And for 0:55:25.309,0:55:28.259 the specific case of text classification, if 0:55:28.259,0:55:31.919 you imagine that maybe there's a small number of relevant features, so 0:55:31.919,0:55:33.679 there are all these English words. 0:55:33.679,0:55:36.969 And many of these English words probably don't tell you anything at all about 0:55:36.969,0:55:39.880 whether the email is spam or non-spam. If it were, you 0:55:39.880,0:55:43.989 know, English function words like, the, of, a, and; 0:55:43.989,0:55:48.049 these are probably words that don't tell you anything about whether the email is spam or non-spam. So 0:55:48.049,0:55:50.569 words in contrast will be a much smaller number of 0:55:50.569,0:55:54.319 features that are truly relevant to the learning problem. 0:55:54.319,0:55:57.369 So for example, you see the word buy or Viagra, those are words that 0:55:57.369,0:56:01.669 are very useful. So you words, some you spam and non-spam. You see the word 0:56:01.669,0:56:02.829 Stanford or 0:56:02.829,0:56:06.270 machinelearning or your own personal name. These are other words that are useful for 0:56:06.270,0:56:10.529 telling you whether something is spam or non-spam. So 0:56:10.529,0:56:15.659 in feature selection, we would like to select a subset of the features 0:56:15.659,0:56:19.519 that may be or hopefully the most relevant ones for a specific learning 0:56:19.519,0:56:20.369 problem, so 0:56:20.369,0:56:25.009 as to give ourselves a simpler learning a simpler hypothesis class to choose 0:56:25.009,0:56:28.499 from. And then therefore, reduce the risk of over fitting. 0:56:28.499,0:56:29.699 Even when we 0:56:29.699,0:56:36.619 may have had 50,000 features originally. So 0:56:36.619,0:56:41.209 how do you do this? Well, if you 0:56:41.209,0:56:44.009 have n 0:56:44.009,0:56:47.329 features, then there are 0:56:47.329,0:56:49.529 two to the n possible subsets; 0:56:49.529,0:56:54.429 0:56:54.429,0:56:55.900 right? Because, you know, 0:56:55.900,0:56:59.880 each of your n features can either be included or excluded. So there are two to the n 0:56:59.880,0:57:00.909 possibilities. 0:57:00.909,0:57:04.200 And this is a huge space. 0:57:04.200,0:57:08.859 So in feature selection, what we most commonly do is use various searcheristics 0:57:08.859,0:57:09.510 sort of 0:57:09.510,0:57:10.799 simple search algorithms 0:57:10.799,0:57:15.329 to try to search through this space of two to the n possible subsets of features; 0:57:15.329,0:57:16.710 to try to find a good 0:57:16.710,0:57:18.290 subset of features. This is 0:57:18.290,0:57:22.569 too large a number to enumerate all possible feature subsets. 0:57:22.569,0:57:25.519 And as a complete example, 0:57:25.519,0:57:26.759 0:57:26.759,0:57:31.159 this is the forward search algorithm; it's also called the forward selection algorithm. 0:57:31.159,0:57:32.689 0:57:32.689,0:57:34.849 It's actually pretty simple, but I'll just 0:57:34.849,0:57:36.080 write it out. 0:57:36.080,0:57:42.059 My writing it out will make it look more complicated than it really 0:57:42.059,0:57:44.649 is, but 0:57:44.649,0:57:46.040 it starts with 0:57:46.040,0:57:48.149 initialize the sets script f 0:57:48.149,0:57:51.649 to be the empty set, 0:57:51.649,0:57:53.239 and then 0:57:53.239,0:57:55.019 repeat 0:57:55.019,0:57:56.619 for 0:57:56.619,0:57:58.859 i equals one 0:57:58.859,0:58:00.439 to n; 0:58:04.849,0:58:08.239 try adding 0:58:08.239,0:58:11.459 feature i 0:58:11.459,0:58:13.849 to the set scripts 0:58:13.849,0:58:16.039 f, and evaluate the 0:58:16.039,0:58:19.739 model 0:58:19.739,0:58:25.719 using cross validation. 0:58:25.719,0:58:30.399 And by cross validation, I mean any of the three flavors, be it simple hold out cross 0:58:30.399,0:58:34.629 validation or k-fold cross validation or leave one out cross 0:58:34.629,0:58:35.739 validation. 0:58:35.739,0:58:39.949 And then, you know, set f to be 0:58:39.949,0:58:41.239 equal to 0:58:41.239,0:58:43.249 f union, I guess. 0:58:43.249,0:58:48.579 And then the best feature 0:58:48.579,0:58:55.579 found is f 1, I guess; okay? 0:58:57.619,0:59:00.059 And finally, you would 0:59:00.059,0:59:06.920 okay? So 0:59:06.920,0:59:13.920 0:59:16.699,0:59:20.319 forward selections, procedure is: follow through the empty set of features. 0:59:20.319,0:59:22.319 And then on each 0:59:22.319,0:59:25.689 generation, take each of 0:59:25.689,0:59:28.829 your features that isn't already in your set script f and you try adding that 0:59:28.829,0:59:30.449 feature to your set. Then 0:59:30.449,0:59:33.870 you train them all, though, and evaluate them all, though, using 0:59:33.870,0:59:35.039 cross validation. 0:59:35.039,0:59:38.719 And basically, figure out what is the best single feature to add to your 0:59:38.719,0:59:40.719 set 0:59:40.719,0:59:43.709 script f. In step two here, you go ahead and add 0:59:43.709,0:59:46.509 that feature to your set script f, and you get it right. 0:59:46.509,0:59:50.559 And when I say best feature or best model here by best, I really mean the 0:59:50.559,0:59:53.669 best model according to hold out cross validation. 0:59:53.669,0:59:56.049 By best, I really mean 0:59:56.049,1:00:00.639 the single feature addition that results in the lowest hold out 1:00:00.639,1:00:01.509 cross validation error or the 1:00:01.509,1:00:03.209 lowest cross validation error. 1:00:03.209,1:00:04.499 So you do this 1:00:04.499,1:00:07.959 adding one feature at a time. 1:00:07.959,1:00:12.019 When you terminate this a little bit, as if you've 1:00:12.019,1:00:15.079 added all the features to f, so f is now 1:00:15.079,1:00:17.919 the entire set of features; you can terminate this. 1:00:17.919,1:00:19.309 Or if 1:00:19.309,1:00:22.039 by some rule of thumb, you know that you 1:00:22.039,1:00:23.769 probably don't ever want more than 1:00:23.769,1:00:26.729 k features, you can also terminate 1:00:26.729,1:00:30.239 this if f is already exceeded some threshold number of features. So maybe 1:00:30.239,1:00:30.979 if 1:00:30.979,1:00:34.139 you have 100 training examples, and you're fitting logistic regression, 1:00:34.139,1:00:39.359 you probably know you won't want more than 100 features. And so 1:00:39.359,1:00:41.229 you stop after you have 1:00:41.229,1:00:45.709 100 features added to set f; okay? 1:00:45.709,1:00:50.029 And then finally, having done this, output of best hypothesis found; again, by 1:00:50.029,1:00:51.669 best, I mean, 1:00:51.669,1:00:55.459 when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of 1:00:55.459,1:00:58.419 hypothesis, and testing them using cross validation. 1:00:58.419,1:01:01.659 So when I say output best hypothesis found, I mean 1:01:01.659,1:01:04.650 of all of the hypothesis you've seen during this entire procedure, 1:01:04.650,1:01:07.149 pick the one with the lowest 1:01:07.149,1:01:10.309 cross validation error that you saw; okay? 1:01:10.309,1:01:17.309 So that's forward selection. So 1:01:35.239,1:01:39.599 let's see, just to give this a name, this is an incidence of what's called 1:01:39.599,1:01:43.189 1:01:43.189,1:01:46.439 wrapper feature 1:01:46.439,1:01:47.879 selection. 1:01:47.879,1:01:50.579 And the term wrapper comes 1:01:50.579,1:01:52.049 from the fact that 1:01:52.049,1:01:55.489 this feature selection algorithm that I just described is a forward selection or forward 1:01:55.489,1:01:56.819 search. 1:01:56.819,1:01:58.680 It's a piece of software that 1:01:58.680,1:02:02.149 you write that wraps around your learning algorithm. In the sense that 1:02:02.149,1:02:04.249 to perform forward selection, 1:02:04.249,1:02:07.579 you need to repeatedly make cause to your learning algorithm 1:02:07.579,1:02:08.829 to train 1:02:08.829,1:02:11.789 your model, 1:02:11.789,1:02:15.289 using different subsets of features; okay? So this is called wrapper model 1:02:15.289,1:02:17.319 feature selection. 1:02:17.319,1:02:20.509 And it tends to be somewhat computationally expensive because as you're 1:02:20.509,1:02:22.529 performing the search process, 1:02:22.529,1:02:26.029 you're repeatedly training your learning algorithm over and over and over on all of 1:02:26.029,1:02:30.329 these different subsets of features. Let's 1:02:30.329,1:02:37.329 just mention also, there is a variation of this called backward search or 1:02:38.899,1:02:40.829 backward selection, 1:02:40.829,1:02:43.709 which is where you start with 1:02:43.709,1:02:46.989 f equals 1:02:46.989,1:02:53.989 the entire set of features, and you delete features one at a time; 1:02:59.469,1:03:01.369 okay? 1:03:01.369,1:03:03.979 1:03:03.979,1:03:06.759 So that's backward search or backward selection. 1:03:06.759,1:03:08.079 And 1:03:08.079,1:03:13.369 this is another feature selection algorithm that you might use. Part of 1:03:13.369,1:03:15.400 whether this makes sense is 1:03:15.400,1:03:17.409 really there will be problems where 1:03:17.409,1:03:21.319 it really doesn't even make sense to initialize f to be the set of all features. 1:03:21.319,1:03:23.859 So if you have 100 training examples 1:03:23.859,1:03:25.569 and 10,000 features, 1:03:25.569,1:03:28.829 which may well happen 1:03:28.829,1:03:30.759 100 emails and 10,000 training 1:03:30.759,1:03:35.289 10,000 features in email, then 100 training examples then 1:03:35.289,1:03:38.279 depending on the learning algorithm you're using, it may or may not make sense to 1:03:38.279,1:03:38.989 initialize 1:03:38.989,1:03:42.789 the set f to be all features, and train them all by using all features. And if it 1:03:42.789,1:03:46.130 doesn't make sense, then you can train them all by using all features; then 1:03:46.130,1:03:51.529 forward selection would be more common. 1:03:51.529,1:03:53.609 So 1:03:53.609,1:03:56.960 let's see. Wrapper model feature selection algorithms tend to work 1:03:56.960,1:03:58.459 well. 1:03:58.459,1:04:02.179 And in particular, they actually often work better than a different class of algorithms I'm gonna talk 1:04:02.179,1:04:05.789 about now. But their main disadvantage is that they're computationally very 1:04:05.789,1:04:08.709 expensive. 1:04:08.709,1:04:15.709 Do you have any questions about this 1:04:16.289,1:04:19.109 before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes, 1:04:19.109,1:04:23.459 you're actually right. So forward search and backward search, both of these are searcheristics, 1:04:23.459,1:04:24.360 and 1:04:24.360,1:04:28.090 you cannot but for either of these you cannot guarantee they'll find the best 1:04:28.090,1:04:29.119 subset of features. It 1:04:29.119,1:04:30.729 actually turns out that 1:04:30.729,1:04:34.999 under many formulizations of the feature selection problems it actually turns out to be an empty 1:04:34.999,1:04:40.029 heart problem, to find the best subset of features. 1:04:40.029,1:04:43.999 But in practice, forward selection backward selection work fine, 1:04:43.999,1:04:47.170 and you can also envision other search algorithms where you sort of 1:04:47.170,1:04:50.489 have other methods to search through the space up to the end possible feature 1:04:50.489,1:04:57.489 subsets. So 1:05:01.079,1:05:03.700 let's see. 1:05:03.700,1:05:06.339 Wrapper feature selection 1:05:06.339,1:05:09.829 tends to work well when you can afford to do it 1:05:09.829,1:05:11.539 computationally. 1:05:11.539,1:05:13.800 But for problems 1:05:13.800,1:05:18.279 such as text classification it turns out for text classification specifically 1:05:18.279,1:05:21.949 because you have so many features, and easily have 50,000 features. 1:05:21.949,1:05:25.789 Forward selection would be very, very expensive. 1:05:25.789,1:05:28.240 So there's a different class of algorithms that 1:05:28.240,1:05:31.189 will give you that 1:05:31.189,1:05:35.319 tends not to do as well in the sense of generalization error. So you tend to learn the 1:05:35.319,1:05:37.329 hypothesis that works less well, 1:05:37.329,1:05:40.289 but is computationally much less expensive. 1:05:40.289,1:05:44.189 And these are called 1:05:44.189,1:05:46.669 the 1:05:46.669,1:05:50.279 filter feature selection methods. 1:05:50.279,1:05:53.119 And the basic idea is 1:05:53.119,1:05:53.679 1:05:53.679,1:06:00.539 that for each feature i will 1:06:00.539,1:06:03.939 compute 1:06:03.939,1:06:10.539 some measure 1:06:10.539,1:06:17.219 of how informative 1:06:17.219,1:06:19.909 xi 1:06:19.909,1:06:25.479 is about y; okay? 1:06:25.479,1:06:28.779 And to do this, we'll use some simple 1:06:28.779,1:06:33.489 heuristics; for every feature we'll just try to compute some rough estimate or 1:06:33.489,1:06:40.489 compute 1:06:41.279,1:06:48.279 some measure of how informative 1:06:50.179,1:06:54.489 xi is about 1:06:54.489,1:06:59.139 y. So there are many ways you can do this. One way you can choose is to just compute the 1:06:59.139,1:07:01.279 correlation between xi and y. And just for each of 1:07:01.279,1:07:04.999 your features just see how correlated this is with 1:07:04.999,1:07:06.719 your class label y. 1:07:06.719,1:07:12.809 And then you just pick the top k most correlated features. 1:07:12.809,1:07:19.809 Another way 1:07:46.429,1:07:51.549 to do this 1:07:51.549,1:07:54.930 for the case of text classification, there's one other method, which especially 1:07:54.930,1:07:56.779 for this k features I guess 1:07:56.779,1:07:58.749 there's one other 1:07:58.749,1:08:00.709 informative measure 1:08:00.709,1:08:04.269 that's used very commonly, which is called major information. I'm going to 1:08:04.269,1:08:11.229 tell you 1:08:11.229,1:08:15.399 some of these ideas in problem sets, but I'll 1:08:15.399,1:08:17.900 just say this very briefly. 1:08:17.900,1:08:21.599 So the major information between feature xi and y 1:08:21.599,1:08:27.670 I'll just write out the definition, I guess. 1:08:27.670,1:08:32.229 Let's say this is text classification, so x can take on two values, 0, 1; 1:08:32.229,1:08:36.179 the major information between xi and y is to find out some overall possible values of 1:08:36.179,1:08:37.359 x; 1:08:37.359,1:08:40.929 some overall possible values of 1:08:40.929,1:08:45.609 y times the distribution 1:08:45.609,1:08:52.049 1:08:52.049,1:08:53.089 times that. 1:08:53.089,1:08:54.379 Where 1:08:59.569,1:09:04.039 all of these distributions where so the joint distribution over xi and y, 1:09:04.039,1:09:11.039 you would estimate from your training data 1:09:12.559,1:09:14.220 all of these things you would use, as well. You would estimate 1:09:14.220,1:09:15.259 from 1:09:15.259,1:09:20.159 the training data what is the probability that x is 0, what's the probability that x is one, what's the 1:09:20.159,1:09:27.159 probability that x is 0, y is 0, x is one; y is 0, and so on. So it 1:09:27.849,1:09:31.190 turns out 1:09:31.190,1:09:35.000 there's a standard information theoretic measure of how different 1:09:35.000,1:09:36.750 probability distributions are. 1:09:36.750,1:09:39.989 And I'm not gonna prove this here. But it turns out that 1:09:39.989,1:09:45.609 this major information is 1:09:45.609,1:09:51.749 actually 1:09:51.749,1:09:54.309 so the standard 1:09:54.309,1:09:58.099 measure of how different distributions are; called the K-L divergence. 1:09:58.099,1:10:00.359 When you take a class in information theory, 1:10:00.359,1:10:03.419 you have seen concepts of mutual information in the K-L divergence, but if you 1:10:03.419,1:10:06.019 haven't, don't worry about it. Just 1:10:06.019,1:10:09.239 the intuition is there's something called K-L divergence that's a formal measure 1:10:09.239,1:10:12.949 of how different two probability distributions 1:10:12.949,1:10:16.609 are. And mutual information is a measure for how different 1:10:16.609,1:10:19.829 the joint distribution is 1:10:19.829,1:10:21.569 of x and y; 1:10:21.569,1:10:23.400 from the distribution 1:10:23.400,1:10:26.159 you would get if you were to assume they were independent; 1:10:26.159,1:10:27.210 okay? So 1:10:27.210,1:10:29.449 if x and y were independent, then 1:10:29.449,1:10:33.979 p of x, y would be equal to p of x times p of y. 1:10:33.979,1:10:37.340 And so you know, this distribution and this distribution would be identical, and the 1:10:37.340,1:10:40.369 K-L divergence would be 0. 1:10:40.369,1:10:43.579 In contrast, if x and y were very non-independent in 1:10:43.579,1:10:46.900 other words, if x and y are very informative about each other, 1:10:46.900,1:10:48.449 then this K-L divergence 1:10:48.449,1:10:49.719 will be large. 1:10:49.719,1:10:52.379 And so mutual information is a formal measure of 1:10:52.379,1:10:55.460 how non-independent x and y are. 1:10:55.460,1:10:58.299 And if x and y are highly non-independent 1:10:58.299,1:10:59.260 then that means that 1:10:59.260,1:11:02.209 x will presumably tell you something about y, 1:11:02.209,1:11:05.989 and so they'll have large mutual information. And 1:11:05.989,1:11:09.519 this measure of information will tell you x might be a good feature. And you 1:11:09.519,1:11:11.989 get to play with some of these ideas 1:11:11.989,1:11:15.309 more in the problem sets. So I won't say much more about it. 1:11:16.579,1:11:19.420 And what you do then is having chosen 1:11:19.420,1:11:23.780 some measure like correlation or major information or something else, you 1:11:23.780,1:11:28.889 then pick the top 1:11:28.889,1:11:33.659 k features; meaning that you compute correlation between xi and y for all the 1:11:33.659,1:11:36.979 features of mutual information xi and y for all the features. 1:11:36.979,1:11:40.080 And then you include in your learning algorithm the 1:11:40.080,1:11:43.369 k features of the largest correlation with the label or 1:11:43.369,1:11:46.489 the largest mutual information label, whatever. 1:11:46.489,1:11:51.559 And to choose 1:11:51.559,1:11:54.809 k, 1:11:54.809,1:11:57.799 you can actually use cross validation, as well; okay? 1:11:57.799,1:11:59.360 So you would 1:11:59.360,1:12:03.059 take all your features, and sort them in decreasing order of mutual 1:12:03.059,1:12:04.109 information. 1:12:04.109,1:12:07.529 And then you'd try using just the top one feature, the top two features, the top 1:12:07.529,1:12:09.559 three features, and so on. 1:12:09.559,1:12:13.869 And you decide how many features includes 1:12:13.869,1:12:17.279 using cross validation; okay? Or you 1:12:17.279,1:12:24.279 can sometimes you can just choose this by hand, as well. Okay. Questions about 1:12:24.279,1:12:31.279 this? Okay. 1:12:34.559,1:12:36.479 Cool. Great. So 1:12:36.479,1:12:40.629 next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close 1:12:40.629,1:12:40.879 to the end.