[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:12.24,0:00:15.51,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:15.51,0:00:22.51,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:24.21,0:00:28.03,Default,,0000,0000,0000,,So what I want to do today in this lecture is Dialogue: 0,0:00:28.03,0:00:31.48,Default,,0000,0000,0000,,talk a little bit more about learning theory. In particular, I'll Dialogue: 0,0:00:31.48,0:00:33.23,Default,,0000,0000,0000,,talk about VC dimension Dialogue: 0,0:00:33.23,0:00:35.61,Default,,0000,0000,0000,,and Dialogue: 0,0:00:35.61,0:00:39.56,Default,,0000,0000,0000,,building on the issues of bias variance tradeoffs of under Dialogue: 0,0:00:39.56,0:00:42.63,Default,,0000,0000,0000,,fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in Dialogue: 0,0:00:42.63,0:00:43.64,Default,,0000,0000,0000,,this one. Dialogue: 0,0:00:43.64,0:00:49.18,Default,,0000,0000,0000,,I then want to talk about model selection algorithms for automatically Dialogue: 0,0:00:49.18,0:00:50.07,Default,,0000,0000,0000,,making Dialogue: 0,0:00:50.07,0:00:52.19,Default,,0000,0000,0000,,decisions for this Dialogue: 0,0:00:52.19,0:00:55.28,Default,,0000,0000,0000,,bias variance tradeoff, that we started to talk about in the previous lecture. Dialogue: 0,0:00:55.28,0:00:59.31,Default,,0000,0000,0000,,And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if Dialogue: 0,0:00:59.31,0:01:01.47,Default,,0000,0000,0000,,I don't get to this today, Dialogue: 0,0:01:01.47,0:01:07.28,Default,,0000,0000,0000,,I'll get to this in next week's lecture. Dialogue: 0,0:01:07.28,0:01:09.52,Default,,0000,0000,0000,,To recap: Dialogue: 0,0:01:09.52,0:01:10.43,Default,,0000,0000,0000,,the result we Dialogue: 0,0:01:10.43,0:01:13.51,Default,,0000,0000,0000,,proved at the previous lecture was that Dialogue: 0,0:01:13.51,0:01:15.37,Default,,0000,0000,0000,,if you have a finite Dialogue: 0,0:01:15.37,0:01:16.43,Default,,0000,0000,0000,,hypothesis class if Dialogue: 0,0:01:16.43,0:01:19.51,Default,,0000,0000,0000,,h is a set of k hypotheses, Dialogue: 0,0:01:19.51,0:01:24.08,Default,,0000,0000,0000,, Dialogue: 0,0:01:24.08,0:01:25.71,Default,,0000,0000,0000,,and suppose you have some Dialogue: 0,0:01:25.71,0:01:32.71,Default,,0000,0000,0000,,fixed parameters, gamma and delta, Dialogue: 0,0:01:33.03,0:01:34.44,Default,,0000,0000,0000,,then Dialogue: 0,0:01:34.44,0:01:36.53,Default,,0000,0000,0000,, Dialogue: 0,0:01:36.53,0:01:40.89,Default,,0000,0000,0000,,in order to guarantee Dialogue: 0,0:01:40.89,0:01:47.75,Default,,0000,0000,0000,,that this holds, Dialogue: 0,0:01:47.75,0:01:50.77,Default,,0000,0000,0000,,we're probability at least one minus delta. Dialogue: 0,0:01:50.77,0:01:55.97,Default,,0000,0000,0000,,It suffices Dialogue: 0,0:01:55.97,0:02:02.97,Default,,0000,0000,0000,,that n is greater and equal to that; okay? Dialogue: 0,0:02:05.24,0:02:07.36,Default,,0000,0000,0000,,And using big-O notations, Dialogue: 0,0:02:07.36,0:02:14.36,Default,,0000,0000,0000,,just learning dropped constants, I can also write this as that; okay? Dialogue: 0,0:02:14.46,0:02:18.18,Default,,0000,0000,0000,,So just to quickly remind you of what all of the notation means, Dialogue: 0,0:02:18.18,0:02:21.18,Default,,0000,0000,0000,,we talked about empirical risk minimization, Dialogue: 0,0:02:21.18,0:02:21.97,Default,,0000,0000,0000,,which was Dialogue: 0,0:02:21.97,0:02:24.62,Default,,0000,0000,0000,,the simplified modern machine learning Dialogue: 0,0:02:24.62,0:02:25.90,Default,,0000,0000,0000,,that Dialogue: 0,0:02:25.90,0:02:28.55,Default,,0000,0000,0000,,has a hypothesis class of script h. Dialogue: 0,0:02:28.55,0:02:32.54,Default,,0000,0000,0000,,And what the empirical risk minimization-learning algorithm does is it just chooses the Dialogue: 0,0:02:32.54,0:02:34.03,Default,,0000,0000,0000,,hypothesis Dialogue: 0,0:02:34.03,0:02:38.17,Default,,0000,0000,0000,,that attains the smallest error on the training set. Dialogue: 0,0:02:38.17,0:02:39.02,Default,,0000,0000,0000,,And so Dialogue: 0,0:02:39.02,0:02:43.48,Default,,0000,0000,0000,,this symbol, epsilon, just denoted generalization error; right? This is the Dialogue: 0,0:02:43.48,0:02:44.45,Default,,0000,0000,0000,,probability Dialogue: 0,0:02:44.45,0:02:48.95,Default,,0000,0000,0000,,of a hypothesis h [inaudible] misclassifying a new example drawn from the same Dialogue: 0,0:02:48.95,0:02:50.93,Default,,0000,0000,0000,,distribution as the training set. Dialogue: 0,0:02:50.93,0:02:53.08,Default,,0000,0000,0000,,And so this says that Dialogue: 0,0:03:01.87,0:03:06.12,Default,,0000,0000,0000,,in order to guarantee that the generalization error of the hypothesis h Dialogue: 0,0:03:06.12,0:03:08.68,Default,,0000,0000,0000,,[inaudible] output by empirical risk minimization Dialogue: 0,0:03:08.68,0:03:11.01,Default,,0000,0000,0000,,that this is less and equal to Dialogue: 0,0:03:11.01,0:03:14.21,Default,,0000,0000,0000,,the best possible generalization error Dialogue: 0,0:03:14.78,0:03:19.89,Default,,0000,0000,0000,,use it in your hypothesis class plus two times gamma two times this error threshold. Dialogue: 0,0:03:19.89,0:03:24.12,Default,,0000,0000,0000,,We want to guarantee that this holds a probability at least one minus delta. Dialogue: 0,0:03:24.12,0:03:25.52,Default,,0000,0000,0000,,We show that Dialogue: 0,0:03:25.52,0:03:28.100,Default,,0000,0000,0000,,it suffices for your training set size m Dialogue: 0,0:03:28.100,0:03:33.08,Default,,0000,0000,0000,,to be greater than equal to this; okay? One over two gamma square log two k over Dialogue: 0,0:03:33.08,0:03:34.18,Default,,0000,0000,0000,,delta; Dialogue: 0,0:03:34.18,0:03:39.23,Default,,0000,0000,0000,,where again, k is the size of your hypothesis class. Dialogue: 0,0:03:39.23,0:03:42.91,Default,,0000,0000,0000,,And so this is some complexity result because it gives us a bound in the Dialogue: 0,0:03:42.91,0:03:45.02,Default,,0000,0000,0000,,number of training examples we need Dialogue: 0,0:03:45.02,0:03:47.05,Default,,0000,0000,0000,,in order to give a guarantee Dialogue: 0,0:03:47.05,0:03:48.70,Default,,0000,0000,0000,,on something on the error; okay? So Dialogue: 0,0:03:48.70,0:03:54.82,Default,,0000,0000,0000,,this is a sample complexity result. So Dialogue: 0,0:03:54.82,0:03:58.88,Default,,0000,0000,0000,,what I want to do now is take this result, and try to generalize it to the Dialogue: 0,0:03:58.88,0:04:02.94,Default,,0000,0000,0000,,case of infinite hypothesis classes. So here, Dialogue: 0,0:04:02.94,0:04:07.14,Default,,0000,0000,0000,,we said that the set script h is sort of just k Dialogue: 0,0:04:07.14,0:04:08.57,Default,,0000,0000,0000,,specific functions, Dialogue: 0,0:04:08.57,0:04:12.12,Default,,0000,0000,0000,,when you want to use a model like logistic regression, which is actually Dialogue: 0,0:04:12.12,0:04:14.06,Default,,0000,0000,0000,,parameterized by real numbers. Dialogue: 0,0:04:14.06,0:04:15.02,Default,,0000,0000,0000,,So Dialogue: 0,0:04:15.02,0:04:17.42,Default,,0000,0000,0000,,I'm actually first going to give an argument that's sort of Dialogue: 0,0:04:17.42,0:04:21.16,Default,,0000,0000,0000,,formally broken just sort of technically somewhat broken, but conveys Dialogue: 0,0:04:21.16,0:04:25.23,Default,,0000,0000,0000,,useful intuition. And then I'll give the more correct argument, but Dialogue: 0,0:04:25.23,0:04:30.06,Default,,0000,0000,0000,,without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want Dialogue: 0,0:04:30.06,0:04:31.05,Default,,0000,0000,0000,,to Dialogue: 0,0:04:31.05,0:04:34.68,Default,,0000,0000,0000,,apply this result analyzing logistic regression. Dialogue: 0,0:04:35.55,0:04:37.40,Default,,0000,0000,0000,,So let's say Dialogue: 0,0:04:39.21,0:04:44.79,Default,,0000,0000,0000,,your hypothesis class is because of all linear division boundaries; right? So say Dialogue: 0,0:04:44.79,0:04:46.77,Default,,0000,0000,0000,,script h is Dialogue: 0,0:04:46.77,0:04:51.75,Default,,0000,0000,0000,,parameterized Dialogue: 0,0:04:51.75,0:04:57.86,Default,,0000,0000,0000,,by d Dialogue: 0,0:04:57.86,0:05:00.28,Default,,0000,0000,0000,,real numbers; okay? So for example, if Dialogue: 0,0:05:00.28,0:05:02.57,Default,,0000,0000,0000,,you're applying logistic regression with Dialogue: 0,0:05:02.57,0:05:06.94,Default,,0000,0000,0000,,over [inaudible], then d would be endless one with logistic regression to find the Dialogue: 0,0:05:06.94,0:05:08.68,Default,,0000,0000,0000,,linear position boundary, Dialogue: 0,0:05:08.68,0:05:09.64,Default,,0000,0000,0000,,parameterized by Dialogue: 0,0:05:09.64,0:05:13.53,Default,,0000,0000,0000,,endless one real numbers. Dialogue: 0,0:05:13.53,0:05:17.73,Default,,0000,0000,0000,,When you think about how your hypothesis class is really represented in a computer Dialogue: 0,0:05:17.73,0:05:18.57,Default,,0000,0000,0000,,computers use Dialogue: 0,0:05:18.57,0:05:21.95,Default,,0000,0000,0000,,zero one bits to represent real numbers. Dialogue: 0,0:05:21.95,0:05:22.93,Default,,0000,0000,0000,,And so if you Dialogue: 0,0:05:22.93,0:05:24.48,Default,,0000,0000,0000,,use like Dialogue: 0,0:05:24.48,0:05:28.56,Default,,0000,0000,0000,,a normal standard computer, it normally will represent real numbers by what's Dialogue: 0,0:05:28.56,0:05:30.87,Default,,0000,0000,0000,,called double position floating point numbers. Dialogue: 0,0:05:30.87,0:05:33.20,Default,,0000,0000,0000,,And what that means is that each real number Dialogue: 0,0:05:33.20,0:05:36.79,Default,,0000,0000,0000,,is represented by or a 64-bit representation; Dialogue: 0,0:05:36.79,0:05:38.06,Default,,0000,0000,0000,,right? So really Dialogue: 0,0:05:38.06,0:05:41.82,Default,,0000,0000,0000,,you know what floating point is in a computer. So a 64-bit floating point Dialogue: 0,0:05:41.82,0:05:43.31,Default,,0000,0000,0000,,is what almost all of Dialogue: 0,0:05:43.31,0:05:44.78,Default,,0000,0000,0000,,us use routinely. Dialogue: 0,0:05:44.78,0:05:47.91,Default,,0000,0000,0000,,And so this parameterized by d real numbers, that's really Dialogue: 0,0:05:47.91,0:05:50.98,Default,,0000,0000,0000,,as if it's parameterized by 64 Dialogue: 0,0:05:50.98,0:05:52.92,Default,,0000,0000,0000,,times d Dialogue: 0,0:05:52.92,0:05:54.32,Default,,0000,0000,0000,,bits. Dialogue: 0,0:05:54.32,0:05:56.59,Default,,0000,0000,0000,,Computers can't represent real numbers. They only represent Dialogue: 0,0:05:56.59,0:05:58.80,Default,,0000,0000,0000,,used to speed things. Dialogue: 0,0:05:58.80,0:06:00.69,Default,,0000,0000,0000,,And so the size of Dialogue: 0,0:06:00.69,0:06:02.95,Default,,0000,0000,0000,,your Dialogue: 0,0:06:02.95,0:06:07.25,Default,,0000,0000,0000,,hypothesis class in your computer representation Dialogue: 0,0:06:07.25,0:06:11.00,Default,,0000,0000,0000,,you have 64 times d bits that you can flip. Dialogue: 0,0:06:11.00,0:06:15.05,Default,,0000,0000,0000,,And so the number of possible values for your 62 to 64 d bits is Dialogue: 0,0:06:15.05,0:06:17.51,Default,,0000,0000,0000,,really just Dialogue: 0,0:06:17.51,0:06:20.89,Default,,0000,0000,0000,,to the power of 64 d; okay? Dialogue: 0,0:06:20.89,0:06:23.36,Default,,0000,0000,0000,,Because that's the number of ways you can Dialogue: 0,0:06:23.36,0:06:26.35,Default,,0000,0000,0000,,flip the 64 d bits. Dialogue: 0,0:06:26.35,0:06:29.15,Default,,0000,0000,0000,,And so Dialogue: 0,0:06:29.15,0:06:33.98,Default,,0000,0000,0000,,this is why it's important that we that we had log k there; right? So k is Dialogue: 0,0:06:33.98,0:06:36.82,Default,,0000,0000,0000,,therefore, to the 64 d. Dialogue: 0,0:06:36.82,0:06:40.38,Default,,0000,0000,0000,,And if I plug it into this equation over here, Dialogue: 0,0:06:40.38,0:06:43.30,Default,,0000,0000,0000,,what you find is that Dialogue: 0,0:06:43.30,0:06:47.14,Default,,0000,0000,0000,,in order to get this sort of guarantee, Dialogue: 0,0:06:47.14,0:06:50.98,Default,,0000,0000,0000,,it suffices that m Dialogue: 0,0:06:50.98,0:06:52.74,Default,,0000,0000,0000,,is Dialogue: 0,0:06:52.74,0:06:55.98,Default,,0000,0000,0000,,great and equal to Dialogue: 0,0:06:55.98,0:06:58.77,Default,,0000,0000,0000,,on the order of one of Dialogue: 0,0:06:58.77,0:07:01.74,Default,,0000,0000,0000,,the gamma square log it's Dialogue: 0,0:07:01.74,0:07:03.37,Default,,0000,0000,0000,,just a 64 d Dialogue: 0,0:07:03.37,0:07:05.05,Default,,0000,0000,0000,,over delta, Dialogue: 0,0:07:05.05,0:07:12.05,Default,,0000,0000,0000,,which is that; okay? So Dialogue: 0,0:07:14.52,0:07:17.78,Default,,0000,0000,0000,,just to be clear, in order to guarantee that Dialogue: 0,0:07:17.78,0:07:23.73,Default,,0000,0000,0000,,there's only one, instead of the same complexity result as we had before so the question is: Dialogue: 0,0:07:23.73,0:07:27.31,Default,,0000,0000,0000,,suppose, you want a guarantee that a hypotheses returned by empirical risk Dialogue: 0,0:07:27.31,0:07:29.25,Default,,0000,0000,0000,,minimization will Dialogue: 0,0:07:29.25,0:07:33.91,Default,,0000,0000,0000,,have a generalization error that's within two gamma or the best hypotheses in your hypotheses Dialogue: 0,0:07:33.91,0:07:35.97,Default,,0000,0000,0000,,class. Dialogue: 0,0:07:35.97,0:07:39.40,Default,,0000,0000,0000,,Then what this result suggests is that, you know, in order to give that sort of Dialogue: 0,0:07:39.40,0:07:40.77,Default,,0000,0000,0000,,error bound guarantee, Dialogue: 0,0:07:40.77,0:07:43.43,Default,,0000,0000,0000,,it suffices that m is greater and equal to this. Dialogue: 0,0:07:43.43,0:07:44.58,Default,,0000,0000,0000,,In other words, that Dialogue: 0,0:07:44.58,0:07:46.89,Default,,0000,0000,0000,,your number of training examples has to be Dialogue: 0,0:07:46.89,0:07:51.70,Default,,0000,0000,0000,,on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the Dialogue: 0,0:07:53.33,0:07:54.42,Default,,0000,0000,0000,,intuition Dialogue: 0,0:07:54.42,0:07:57.74,Default,,0000,0000,0000,,that this conveys is actually, roughly right. This says, that the number of Dialogue: 0,0:07:57.74,0:07:59.05,Default,,0000,0000,0000,,training examples you need Dialogue: 0,0:07:59.05,0:08:00.74,Default,,0000,0000,0000,,is roughly linear Dialogue: 0,0:08:00.74,0:08:04.62,Default,,0000,0000,0000,,in the number of parameters of your hypothesis class. That Dialogue: 0,0:08:04.62,0:08:08.68,Default,,0000,0000,0000,,m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right. Dialogue: 0,0:08:08.68,0:08:09.58,Default,,0000,0000,0000,, Dialogue: 0,0:08:09.58,0:08:12.48,Default,,0000,0000,0000,,I'll say more about Dialogue: 0,0:08:16.97,0:08:20.97,Default,,0000,0000,0000,,this Dialogue: 0,0:08:21.92,0:08:25.52,Default,,0000,0000,0000,,later. This result is clearly, slightly broken, in the sense that it relies on a Dialogue: 0,0:08:25.52,0:08:28.73,Default,,0000,0000,0000,,64-bit representation of 14-point numbers. Dialogue: 0,0:08:28.73,0:08:31.90,Default,,0000,0000,0000,,So let me actually go ahead and outline Dialogue: 0,0:08:31.90,0:08:34.90,Default,,0000,0000,0000,,the "right way" to show this more formally; all Dialogue: 0,0:08:34.90,0:08:38.91,Default,,0000,0000,0000,,right? And it turns out the "right way" to show this more formally Dialogue: 0,0:08:38.91,0:08:43.12,Default,,0000,0000,0000,,involves a much longer because the Dialogue: 0,0:08:43.12,0:08:46.25,Default,,0000,0000,0000,,proof is extremely involved, so I'm just actually going to state the result, and not Dialogue: 0,0:08:46.25,0:08:50.16,Default,,0000,0000,0000,,prove it. Farther proof be a source of learning theory balance, infinite hypothesis Dialogue: 0,0:08:50.16,0:08:54.32,Default,,0000,0000,0000,,classes. Dialogue: 0,0:08:56.45,0:09:03.45,Default,,0000,0000,0000,,This definition given a set of d points, Dialogue: 0,0:09:18.63,0:09:22.88,Default,,0000,0000,0000,,we say, Dialogue: 0,0:09:22.88,0:09:26.56,Default,,0000,0000,0000,,a hypothesis class h Dialogue: 0,0:09:26.56,0:09:28.59,Default,,0000,0000,0000,,shatters Dialogue: 0,0:09:28.59,0:09:34.52,Default,,0000,0000,0000,,the set s, if Dialogue: 0,0:09:34.52,0:09:41.52,Default,,0000,0000,0000,,h can realize any labeling on it; okay? Dialogue: 0,0:09:46.65,0:09:47.36,Default,,0000,0000,0000,,And what Dialogue: 0,0:09:47.36,0:09:50.99,Default,,0000,0000,0000,,I mean by realizing any labeling on it the informal way of thinking about this Dialogue: 0,0:09:50.99,0:09:53.02,Default,,0000,0000,0000,,is: Dialogue: 0,0:09:53.02,0:09:56.86,Default,,0000,0000,0000,,if a hypothesis class has shattered the set s, what that means is that I can take these Dialogue: 0,0:09:56.86,0:09:58.51,Default,,0000,0000,0000,,d points, Dialogue: 0,0:09:58.51,0:10:01.74,Default,,0000,0000,0000,,and I can associate these d points with any Dialogue: 0,0:10:01.74,0:10:04.25,Default,,0000,0000,0000,,caught set of labels y; right? Dialogue: 0,0:10:04.25,0:10:08.90,Default,,0000,0000,0000,,So choose any set of labeling y for each of these d Dialogue: 0,0:10:08.90,0:10:10.14,Default,,0000,0000,0000,,points. Dialogue: 0,0:10:10.14,0:10:13.84,Default,,0000,0000,0000,,And if your hypothesis class shatters s, then that means that Dialogue: 0,0:10:13.84,0:10:14.79,Default,,0000,0000,0000,,there will be Dialogue: 0,0:10:14.79,0:10:19.47,Default,,0000,0000,0000,,a hypothesis that labels those d examples perfectly; Dialogue: 0,0:10:19.47,0:10:23.71,Default,,0000,0000,0000,,okay? That's what shattering means. So let me just illustrate those in an example. Dialogue: 0,0:10:23.71,0:10:30.71,Default,,0000,0000,0000,,So let's say h is the class of all linear classifiers Dialogue: 0,0:10:31.28,0:10:35.52,Default,,0000,0000,0000,,into e, Dialogue: 0,0:10:35.52,0:10:38.40,Default,,0000,0000,0000,,and let's say that Dialogue: 0,0:10:38.40,0:10:40.37,Default,,0000,0000,0000,,s is this Dialogue: 0,0:10:40.37,0:10:43.74,Default,,0000,0000,0000,,[inaudible] comprising two points; okay? Dialogue: 0,0:10:43.74,0:10:45.98,Default,,0000,0000,0000,,So there are Dialogue: 0,0:10:45.98,0:10:49.52,Default,,0000,0000,0000,,four possible labelings that computes Dialogue: 0,0:10:49.52,0:10:53.27,Default,,0000,0000,0000,,with these two points. You can choose Dialogue: 0,0:10:53.27,0:10:56.47,Default,,0000,0000,0000,,to label both positive; one positive, one negative, one negative, one positive or you can Dialogue: 0,0:10:56.47,0:10:59.28,Default,,0000,0000,0000,,label both of them negative. Dialogue: 0,0:10:59.28,0:11:00.75,Default,,0000,0000,0000,,And if Dialogue: 0,0:11:00.75,0:11:04.73,Default,,0000,0000,0000,,the hypothesis class h classed all linear classifiers into the then, for each Dialogue: 0,0:11:04.73,0:11:08.45,Default,,0000,0000,0000,,of these training sets, I can sort of find a Dialogue: 0,0:11:08.45,0:11:09.99,Default,,0000,0000,0000,,linear classifier Dialogue: 0,0:11:09.99,0:11:13.50,Default,,0000,0000,0000,,that attains zero training error on Dialogue: 0,0:11:13.50,0:11:17.81,Default,,0000,0000,0000,,each of these. Then on all possible labelings of this set of two points. Dialogue: 0,0:11:17.81,0:11:20.14,Default,,0000,0000,0000,,And so I'll say Dialogue: 0,0:11:20.14,0:11:23.16,Default,,0000,0000,0000,,that the hypothesis class script h shatters Dialogue: 0,0:11:23.16,0:11:24.13,Default,,0000,0000,0000,,this set s Dialogue: 0,0:11:24.13,0:11:29.48,Default,,0000,0000,0000,,of two points; okay? One Dialogue: 0,0:11:29.48,0:11:36.48,Default,,0000,0000,0000,,more example show Dialogue: 0,0:11:37.35,0:11:38.78,Default,,0000,0000,0000,,you Dialogue: 0,0:11:38.78,0:11:42.06,Default,,0000,0000,0000,,a larger example. Suppose my set s is now Dialogue: 0,0:11:42.06,0:11:44.64,Default,,0000,0000,0000,,this set of three points; Dialogue: 0,0:11:44.64,0:11:51.64,Default,,0000,0000,0000,,right? Then, I now have eight possible labelings for these three points; Dialogue: 0,0:12:10.60,0:12:14.83,Default,,0000,0000,0000,,okay? And so for these three points, I now have eight possible labelings. Dialogue: 0,0:12:14.83,0:12:16.94,Default,,0000,0000,0000,,And once again, I can for each Dialogue: 0,0:12:16.94,0:12:21.62,Default,,0000,0000,0000,,of these labelings, I can find the hypothesis in the hypothesis class Dialogue: 0,0:12:21.62,0:12:23.39,Default,,0000,0000,0000,,that labels Dialogue: 0,0:12:23.39,0:12:25.61,Default,,0000,0000,0000,,these examples correctly. Dialogue: 0,0:12:25.61,0:12:30.22,Default,,0000,0000,0000,,And so once again, I see that by definition, say, that my hypothesis Dialogue: 0,0:12:30.22,0:12:33.89,Default,,0000,0000,0000,,class also shatters this set s. Student:Right. Instructor (Andrew Ng):And Dialogue: 0,0:12:33.89,0:12:38.34,Default,,0000,0000,0000,,then that that terminology h can realize any labeling on s. That's Dialogue: 0,0:12:38.34,0:12:39.83,Default,,0000,0000,0000,,obviously [inaudible]. Dialogue: 0,0:12:39.83,0:12:43.68,Default,,0000,0000,0000,,Give it any set of labels and you can find a hypothesis that perfectly Dialogue: 0,0:12:43.68,0:12:47.57,Default,,0000,0000,0000,,separates the positive and negative examples; Dialogue: 0,0:12:47.57,0:12:48.84,Default,,0000,0000,0000,,okay? Dialogue: 0,0:12:48.84,0:12:51.09,Default,,0000,0000,0000,,So Dialogue: 0,0:12:51.09,0:12:54.66,Default,,0000,0000,0000,,how about this Dialogue: 0,0:12:54.66,0:12:57.43,Default,,0000,0000,0000,,set? Dialogue: 0,0:12:57.43,0:13:00.84,Default,,0000,0000,0000,,Suppose s is now this set of four points, then, Dialogue: 0,0:13:00.84,0:13:04.84,Default,,0000,0000,0000,,you know, there are lots of labels. There are now 16 labelings we can choose on this; right? Dialogue: 0,0:13:04.84,0:13:06.40,Default,,0000,0000,0000,,That's one for instance, Dialogue: 0,0:13:06.40,0:13:09.09,Default,,0000,0000,0000,,and this is another one; Dialogue: 0,0:13:09.09,0:13:13.42,Default,,0000,0000,0000,,right? And so I can realize some labelings. But there's no linear division Dialogue: 0,0:13:13.42,0:13:15.78,Default,,0000,0000,0000,,boundary that can realize this labeling, Dialogue: 0,0:13:15.78,0:13:17.80,Default,,0000,0000,0000,,and so h does not shatter Dialogue: 0,0:13:17.80,0:13:20.25,Default,,0000,0000,0000,,this set of four points; okay? Dialogue: 0,0:13:20.25,0:13:21.58,Default,,0000,0000,0000,,And Dialogue: 0,0:13:21.58,0:13:25.36,Default,,0000,0000,0000,,I'm not really going to prove it here, but it turns out that you can show that Dialogue: 0,0:13:25.36,0:13:30.04,Default,,0000,0000,0000,,in two dimensions, there is no set of four points that the Dialogue: 0,0:13:30.04,0:13:37.04,Default,,0000,0000,0000,,class of all linear classifiers can shatter; okay? Dialogue: 0,0:13:45.63,0:13:49.11,Default,,0000,0000,0000,,So here's another definition. When I say that the well, Dialogue: 0,0:13:49.11,0:13:50.29,Default,,0000,0000,0000,,it's called the Dialogue: 0,0:13:50.29,0:13:52.70,Default,,0000,0000,0000,,VC dimension. These Dialogue: 0,0:13:52.70,0:13:59.70,Default,,0000,0000,0000,,two people, Vapnik and Chervonenkis Dialogue: 0,0:14:06.55,0:14:09.91,Default,,0000,0000,0000,,so given a hypothesis class, Dialogue: 0,0:14:09.91,0:14:13.59,Default,,0000,0000,0000,,the Vapnik and Chervonenkis Dialogue: 0,0:14:13.59,0:14:17.60,Default,,0000,0000,0000,,dimension of h, which we usually write as VC of script h, Dialogue: 0,0:14:17.60,0:14:24.60,Default,,0000,0000,0000,,is the size of the Dialogue: 0,0:14:33.78,0:14:36.67,Default,,0000,0000,0000,,larger Dialogue: 0,0:14:36.67,0:14:40.07,Default,,0000,0000,0000,,set that is shattered by this set by h. Dialogue: 0,0:14:40.07,0:14:40.76,Default,,0000,0000,0000,,And Dialogue: 0,0:14:40.76,0:14:44.94,Default,,0000,0000,0000,,if a hypothesis class can shatter arbitrarily large sets, then the VC dimension Dialogue: 0,0:14:44.94,0:14:47.82,Default,,0000,0000,0000,,is infinite. Dialogue: 0,0:14:47.82,0:14:50.55,Default,,0000,0000,0000,,So just as a kind of good example: if h Dialogue: 0,0:14:50.55,0:14:51.92,Default,,0000,0000,0000,,is the class of all linear Dialogue: 0,0:14:51.92,0:14:57.85,Default,,0000,0000,0000,,classifiers Dialogue: 0,0:14:57.85,0:15:02.66,Default,,0000,0000,0000,,into d, then the Dialogue: 0,0:15:02.66,0:15:04.10,Default,,0000,0000,0000,,VC dimension of the set Dialogue: 0,0:15:04.10,0:15:04.73,Default,,0000,0000,0000,,is equal to Dialogue: 0,0:15:04.73,0:15:09.49,Default,,0000,0000,0000,,three because we saw just now that there is a size of Dialogue: 0,0:15:09.49,0:15:12.94,Default,,0000,0000,0000,,there was a set s of size three that it could shatter, Dialogue: 0,0:15:12.94,0:15:17.26,Default,,0000,0000,0000,,and I don't really prove it. But it turns out there is no sets of size four Dialogue: 0,0:15:17.26,0:15:18.48,Default,,0000,0000,0000,,that it can Dialogue: 0,0:15:18.48,0:15:19.90,Default,,0000,0000,0000,,shatter. And therefore, Dialogue: 0,0:15:19.90,0:15:21.22,Default,,0000,0000,0000,,the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your Dialogue: 0,0:15:21.22,0:15:23.27,Default,,0000,0000,0000,,point. Instructor Dialogue: 0,0:15:23.27,0:15:27.53,Default,,0000,0000,0000,,(Andrew Dialogue: 0,0:15:27.53,0:15:30.21,Default,,0000,0000,0000,,Ng):Yes, absolutely. So it turns out that Dialogue: 0,0:15:30.21,0:15:32.37,Default,,0000,0000,0000,,if Dialogue: 0,0:15:32.37,0:15:35.65,Default,,0000,0000,0000,,I choose a set like this it's actually set s, Dialogue: 0,0:15:35.65,0:15:39.17,Default,,0000,0000,0000,,then there are labelings on this they cannot realize. And so, Dialogue: 0,0:15:39.17,0:15:40.98,Default,,0000,0000,0000,,h cannot shatter this set. Dialogue: 0,0:15:40.98,0:15:43.62,Default,,0000,0000,0000,,But that's okay because right there definitely is Dialogue: 0,0:15:43.62,0:15:46.59,Default,,0000,0000,0000,,there exists some other set of size three being shattered. So the VC Dialogue: 0,0:15:46.59,0:15:47.89,Default,,0000,0000,0000,,dimension is Dialogue: 0,0:15:47.89,0:15:54.89,Default,,0000,0000,0000,,three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor Dialogue: 0,0:15:55.59,0:15:58.13,Default,,0000,0000,0000,,(Andrew Ng):Not according to this definition. No. Dialogue: 0,0:15:58.13,0:16:00.64,Default,,0000,0000,0000,,Right. So again, let's see, Dialogue: 0,0:16:00.64,0:16:03.62,Default,,0000,0000,0000,,I can choose my set s to be Dialogue: 0,0:16:03.62,0:16:07.87,Default,,0000,0000,0000,,to be a set of three points that are all over lapping. Three points in exactly Dialogue: 0,0:16:07.87,0:16:09.78,Default,,0000,0000,0000,,the same place. Dialogue: 0,0:16:09.78,0:16:11.73,Default,,0000,0000,0000,,And clearly, I can't shatter this set, Dialogue: 0,0:16:11.73,0:16:15.13,Default,,0000,0000,0000,,but that's okay. And I can't shatter this set, either, Dialogue: 0,0:16:15.13,0:16:18.56,Default,,0000,0000,0000,,but that's okay because there are some other sets of size three that I can Dialogue: 0,0:16:18.56,0:16:25.04,Default,,0000,0000,0000,,shatter. Dialogue: 0,0:16:25.04,0:16:28.41,Default,,0000,0000,0000,,And it turns out this result holds true into Dialogue: 0,0:16:28.41,0:16:31.69,Default,,0000,0000,0000,,the more generally, Dialogue: 0,0:16:31.69,0:16:34.43,Default,,0000,0000,0000,, Dialogue: 0,0:16:34.43,0:16:41.41,Default,,0000,0000,0000,,in Dialogue: 0,0:16:41.41,0:16:48.41,Default,,0000,0000,0000,,any dimensions the VC dimension of the class of linear classifiers in Dialogue: 0,0:16:51.46,0:16:53.50,Default,,0000,0000,0000,,any dimensions is equal to n plus one. Dialogue: 0,0:16:53.50,0:16:59.21,Default,,0000,0000,0000,,Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC Dialogue: 0,0:16:59.21,0:17:01.84,Default,,0000,0000,0000,,dimension in any dimensions; whereas, n d Dialogue: 0,0:17:01.84,0:17:04.35,Default,,0000,0000,0000,,is equal to n plus one. Dialogue: 0,0:17:04.35,0:17:11.35,Default,,0000,0000,0000,, Dialogue: 0,0:17:12.06,0:17:14.46,Default,,0000,0000,0000,,So maybe you wanna write it down: Dialogue: 0,0:17:14.46,0:17:15.81,Default,,0000,0000,0000,, Dialogue: 0,0:17:15.81,0:17:20.74,Default,,0000,0000,0000,,what is arguably the best-known result in all of learning theory, I guess; Dialogue: 0,0:17:20.74,0:17:25.93,Default,,0000,0000,0000,, Dialogue: 0,0:17:25.93,0:17:28.77,Default,,0000,0000,0000,,which is that. Dialogue: 0,0:17:28.77,0:17:35.77,Default,,0000,0000,0000,,Let a hypothesis class be given, Dialogue: 0,0:17:37.60,0:17:41.17,Default,,0000,0000,0000,,and let the VC dimension of h be Dialogue: 0,0:17:41.17,0:17:45.83,Default,,0000,0000,0000,,equal to d. Then we're Dialogue: 0,0:17:45.83,0:17:47.53,Default,,0000,0000,0000,,in probability of Dialogue: 0,0:17:47.53,0:17:50.09,Default,,0000,0000,0000,,one minus delta. Dialogue: 0,0:17:50.09,0:17:57.09,Default,,0000,0000,0000,,We have that Dialogue: 0,0:18:31.83,0:18:34.77,Default,,0000,0000,0000,,the formula on the right looks a bit complicated, but don't worry about it. I'll point Dialogue: 0,0:18:34.77,0:18:36.91,Default,,0000,0000,0000,,out the essential aspects of it later. Dialogue: 0,0:18:36.91,0:18:40.60,Default,,0000,0000,0000,,But the key to this result is that if you have a hypothesis class with VC dimension d, and Dialogue: 0,0:18:40.60,0:18:45.32,Default,,0000,0000,0000,,now this can be an infinite hypothesis Dialogue: 0,0:18:45.32,0:18:46.66,Default,,0000,0000,0000,,class, Dialogue: 0,0:18:46.66,0:18:48.62,Default,,0000,0000,0000,,what Dialogue: 0,0:18:48.62,0:18:50.99,Default,,0000,0000,0000,,Vapnik and Chervonenkis show is that Dialogue: 0,0:18:50.99,0:18:54.03,Default,,0000,0000,0000,,we're probability of at least one minus delta. Dialogue: 0,0:18:54.03,0:18:55.30,Default,,0000,0000,0000,,You enjoy this Dialogue: 0,0:18:55.30,0:18:57.54,Default,,0000,0000,0000,,sort of uniform conversions results; Dialogue: 0,0:18:57.54,0:19:00.38,Default,,0000,0000,0000,,okay? We have Dialogue: 0,0:19:00.38,0:19:02.52,Default,,0000,0000,0000,, Dialogue: 0,0:19:02.52,0:19:06.76,Default,,0000,0000,0000,,that for all hypotheses h that for all the hypotheses in your Dialogue: 0,0:19:06.76,0:19:08.49,Default,,0000,0000,0000,,hypothesis class, Dialogue: 0,0:19:08.49,0:19:13.08,Default,,0000,0000,0000,,you have that the generalization error of h Dialogue: 0,0:19:13.08,0:19:13.78,Default,,0000,0000,0000,,minus Dialogue: 0,0:19:13.78,0:19:19.21,Default,,0000,0000,0000,,the training error of h. So the difference between these two things is bounded above Dialogue: 0,0:19:19.21,0:19:24.25,Default,,0000,0000,0000,,by some complicated formula like this; okay? Dialogue: 0,0:19:24.25,0:19:29.52,Default,,0000,0000,0000,,And thus, Dialogue: 0,0:19:29.52,0:19:36.52,Default,,0000,0000,0000,,we're probably one minus delta. We also have that have the Dialogue: 0,0:19:45.14,0:19:52.14,Default,,0000,0000,0000,,same thing; okay? Dialogue: 0,0:19:56.81,0:20:00.73,Default,,0000,0000,0000,,And going from this step to this step; right? Going from this step Dialogue: 0,0:20:00.73,0:20:03.43,Default,,0000,0000,0000,,to this step is actually something that you saw yourself; Dialogue: 0,0:20:03.43,0:20:05.93,Default,,0000,0000,0000,,that we actually proved earlier. Because Dialogue: 0,0:20:05.93,0:20:09.20,Default,,0000,0000,0000,,you remember, in the previous lecture we proved that if you have uniform Dialogue: 0,0:20:09.20,0:20:10.72,Default,,0000,0000,0000,,conversions, Dialogue: 0,0:20:10.72,0:20:13.21,Default,,0000,0000,0000,,then that implies that Dialogue: 0,0:20:13.21,0:20:18.08,Default,,0000,0000,0000,,it appears actually that we showed that if generalization error and training error are close to Dialogue: 0,0:20:18.08,0:20:20.80,Default,,0000,0000,0000,,each other; within gamma of each other, Dialogue: 0,0:20:20.80,0:20:24.36,Default,,0000,0000,0000,,then the generalization error of the hypotheses you pick will be within Dialogue: 0,0:20:24.36,0:20:26.15,Default,,0000,0000,0000,,two gamma Dialogue: 0,0:20:26.15,0:20:28.08,Default,,0000,0000,0000,,times the best generalization error. Dialogue: 0,0:20:28.08,0:20:29.65,Default,,0000,0000,0000,,So this is really Dialogue: 0,0:20:29.65,0:20:30.90,Default,,0000,0000,0000,,generalization error of Dialogue: 0,0:20:30.90,0:20:35.13,Default,,0000,0000,0000,,h [inaudible] best possible generalization error plus two times gamma. And just the Dialogue: 0,0:20:35.13,0:20:37.39,Default,,0000,0000,0000,,two constants in front here Dialogue: 0,0:20:37.39,0:20:43.79,Default,,0000,0000,0000,,that I've absorbed into the big-O notation. Dialogue: 0,0:20:43.79,0:20:47.31,Default,,0000,0000,0000,,So that formula is slightly more complicated. Let Dialogue: 0,0:20:47.31,0:20:54.31,Default,,0000,0000,0000,,me just rewrite this as a corollary, which is that Dialogue: 0,0:20:55.37,0:20:58.85,Default,,0000,0000,0000,,in order to guarantee that Dialogue: 0,0:20:58.85,0:21:04.39,Default,,0000,0000,0000,,this holds, Dialogue: 0,0:21:04.39,0:21:08.58,Default,,0000,0000,0000,,we're probability of one minus delta. Dialogue: 0,0:21:08.58,0:21:11.27,Default,,0000,0000,0000,,We're probably at least one minus delta, I should say. It Dialogue: 0,0:21:11.27,0:21:16.79,Default,,0000,0000,0000,,suffices Dialogue: 0,0:21:16.79,0:21:20.40,Default,,0000,0000,0000,,that I'm gonna write this this way: I'm gonna write m equals Dialogue: 0,0:21:20.40,0:21:22.31,Default,,0000,0000,0000,,big-O of Dialogue: 0,0:21:22.31,0:21:22.95,Default,,0000,0000,0000,,d, Dialogue: 0,0:21:22.95,0:21:25.88,Default,,0000,0000,0000,,and I'm going to put Dialogue: 0,0:21:25.88,0:21:32.57,Default,,0000,0000,0000,,gamma and delta in as a subscript error to denote that. Dialogue: 0,0:21:32.57,0:21:37.34,Default,,0000,0000,0000,,Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that Dialogue: 0,0:21:37.34,0:21:40.39,Default,,0000,0000,0000,,depend on gamma and delta into the big-O notation, Dialogue: 0,0:21:40.39,0:21:41.60,Default,,0000,0000,0000,,then Dialogue: 0,0:21:41.60,0:21:44.87,Default,,0000,0000,0000,,in order to guarantee this holds, it suffices that m Dialogue: 0,0:21:44.87,0:21:46.43,Default,,0000,0000,0000,,is on the order of the Dialogue: 0,0:21:46.43,0:21:53.43,Default,,0000,0000,0000,,VC dimension and hypotheses class; okay? Dialogue: 0,0:21:53.47,0:21:55.79,Default,,0000,0000,0000,,So Dialogue: 0,0:21:55.79,0:21:58.86,Default,,0000,0000,0000,,let's see. Dialogue: 0,0:21:58.86,0:22:02.49,Default,,0000,0000,0000,,So what we conclude from this is that Dialogue: 0,0:22:02.49,0:22:06.10,Default,,0000,0000,0000,,if you have a learning algorithm that tries to for empirical risk minimization Dialogue: 0,0:22:06.10,0:22:08.77,Default,,0000,0000,0000,,algorithms in other words, less formally, Dialogue: 0,0:22:08.77,0:22:12.63,Default,,0000,0000,0000,,for learning algorithms, they try to minimize training error. The Dialogue: 0,0:22:12.63,0:22:14.85,Default,,0000,0000,0000,,intuition to take away from this is that Dialogue: 0,0:22:14.85,0:22:18.08,Default,,0000,0000,0000,,the number of training examples you need is therefore, Dialogue: 0,0:22:18.08,0:22:19.25,Default,,0000,0000,0000,,roughly, linear Dialogue: 0,0:22:19.25,0:22:24.35,Default,,0000,0000,0000,,in the VC dimension of the hypotheses class. Dialogue: 0,0:22:24.35,0:22:27.69,Default,,0000,0000,0000,,And more formally, this shows that sample complexity Dialogue: 0,0:22:27.69,0:22:29.57,Default,,0000,0000,0000,,is upper bounded Dialogue: 0,0:22:29.57,0:22:34.04,Default,,0000,0000,0000,,by the VC dimension; okay? Dialogue: 0,0:22:34.04,0:22:38.49,Default,,0000,0000,0000,,It turns out that for most reasonable hypothesis classes, it turns out that Dialogue: 0,0:22:38.49,0:22:41.46,Default,,0000,0000,0000,,the VC dimension is sort Dialogue: 0,0:22:41.46,0:22:41.87,Default,,0000,0000,0000,, Dialogue: 0,0:22:41.87,0:22:44.93,Default,,0000,0000,0000,,of very similar, I guess, to Dialogue: 0,0:22:44.93,0:22:47.11,Default,,0000,0000,0000,,the number of parameters you model. Dialogue: 0,0:22:47.11,0:22:48.60,Default,,0000,0000,0000,,So for example, Dialogue: 0,0:22:48.60,0:22:52.08,Default,,0000,0000,0000,,you have model and logistic regression linear classification. Dialogue: 0,0:22:52.08,0:22:56.06,Default,,0000,0000,0000,,In any dimensions logistic regression in any dimensions is endless Dialogue: 0,0:22:56.06,0:22:57.43,Default,,0000,0000,0000,,one parameters. Dialogue: 0,0:22:57.43,0:23:02.43,Default,,0000,0000,0000,,And the VC dimension of which is the of class of linear classifiers is always the endless one. Dialogue: 0,0:23:02.43,0:23:04.29,Default,,0000,0000,0000,,So it turns out that Dialogue: 0,0:23:04.29,0:23:07.36,Default,,0000,0000,0000,,for most reasonable hypothesis classes, the Dialogue: 0,0:23:07.36,0:23:11.27,Default,,0000,0000,0000,,VC dimension is usually linear in the number of parameters of your model. Dialogue: 0,0:23:11.27,0:23:14.82,Default,,0000,0000,0000,,Wherein, is most sense of low other polynomial; Dialogue: 0,0:23:14.82,0:23:17.44,Default,,0000,0000,0000,,in the number of parameters of your model. Dialogue: 0,0:23:17.44,0:23:18.76,Default,,0000,0000,0000,,And so this Dialogue: 0,0:23:18.76,0:23:21.12,Default,,0000,0000,0000,,the takeaway intuition from this is that Dialogue: 0,0:23:21.12,0:23:25.06,Default,,0000,0000,0000,,the number of training examples you need to fit in those models is going to be let's say, Dialogue: 0,0:23:25.06,0:23:29.55,Default,,0000,0000,0000,,roughly, linear in the number of parameters in your model; okay? Dialogue: 0,0:23:29.55,0:23:33.12,Default,,0000,0000,0000,,There are some somewhat strange examples where what I just said is not true. There are some Dialogue: 0,0:23:33.12,0:23:36.56,Default,,0000,0000,0000,,strange examples where you have very few parameters, but the VC dimension is Dialogue: 0,0:23:36.56,0:23:37.45,Default,,0000,0000,0000,,enormous. Dialogue: 0,0:23:37.45,0:23:38.61,Default,,0000,0000,0000,,But I actually know of Dialogue: 0,0:23:38.61,0:23:42.07,Default,,0000,0000,0000,,all of the examples I know of that fall into that regime are somewhat Dialogue: 0,0:23:42.07,0:23:43.56,Default,,0000,0000,0000,,strange and Dialogue: 0,0:23:43.56,0:23:45.16,Default,,0000,0000,0000,,degenerate. So somewhat unusual, and Dialogue: 0,0:23:45.16,0:23:50.34,Default,,0000,0000,0000,,not the source of not learning algorithms you usually use. Dialogue: 0,0:23:50.34,0:23:52.27,Default,,0000,0000,0000,,Let's see, Dialogue: 0,0:23:52.27,0:23:55.03,Default,,0000,0000,0000,,just other things. It turns out that Dialogue: 0,0:23:55.03,0:23:58.49,Default,,0000,0000,0000,,so this result shows the sample complexity is upper bounded by VC Dialogue: 0,0:23:58.49,0:24:01.32,Default,,0000,0000,0000,,dimension. But if you have a number of training examples that Dialogue: 0,0:24:01.32,0:24:04.51,Default,,0000,0000,0000,,are on the order of the VC dimension, then you find Dialogue: 0,0:24:04.51,0:24:07.51,Default,,0000,0000,0000,,it turns out that in the worse case Dialogue: 0,0:24:07.51,0:24:10.32,Default,,0000,0000,0000,, Dialogue: 0,0:24:10.32,0:24:12.20,Default,,0000,0000,0000,, Dialogue: 0,0:24:12.20,0:24:15.84,Default,,0000,0000,0000,,some complexity is also lower bounded by VC dimension. Dialogue: 0,0:24:15.84,0:24:20.22,Default,,0000,0000,0000,,And what that means is that if you have a perfectly nasty learning Dialogue: 0,0:24:20.22,0:24:23.12,Default,,0000,0000,0000,,problem, say, then if Dialogue: 0,0:24:23.12,0:24:26.63,Default,,0000,0000,0000,,the number of training examples you have is less than on the order of the VC Dialogue: 0,0:24:26.63,0:24:27.45,Default,,0000,0000,0000,,dimension; Dialogue: 0,0:24:27.45,0:24:29.69,Default,,0000,0000,0000,,then Dialogue: 0,0:24:29.69,0:24:33.91,Default,,0000,0000,0000,,it is not possible to prove this bound. So I guess in the worse case, Dialogue: 0,0:24:33.91,0:24:37.49,Default,,0000,0000,0000,,sample complexity in the number of training examples you need is upper bounded and lower bounded Dialogue: 0,0:24:37.49,0:24:41.82,Default,,0000,0000,0000,,by the VC dimension. Let's Dialogue: 0,0:24:41.82,0:24:48.11,Default,,0000,0000,0000,,see, Dialogue: 0,0:24:48.11,0:24:55.11,Default,,0000,0000,0000,,questions about this? Does Dialogue: 0,0:24:55.54,0:24:59.70,Default,,0000,0000,0000,, Dialogue: 0,0:24:59.70,0:25:00.49,Default,,0000,0000,0000,,the Dialogue: 0,0:25:00.49,0:25:04.56,Default,,0000,0000,0000,,proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the Dialogue: 0,0:25:04.56,0:25:07.28,Default,,0000,0000,0000,,entirety of the theorem. This is true. It Dialogue: 0,0:25:07.28,0:25:09.92,Default,,0000,0000,0000,,turns out in the proof well, Dialogue: 0,0:25:09.92,0:25:12.73,Default,,0000,0000,0000,,somewhere, regardless of the proof there's a step reconstruction called an Dialogue: 0,0:25:12.73,0:25:15.75,Default,,0000,0000,0000,,epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the Dialogue: 0,0:25:15.75,0:25:19.88,Default,,0000,0000,0000,,proof, it is not an assumption that you need. Dialogue: 0,0:25:19.88,0:25:23.61,Default,,0000,0000,0000,,In someway that sort of proof that's one-step that uses a very Dialogue: 0,0:25:23.61,0:25:27.05,Default,,0000,0000,0000,,clever [inaudible] to prove Dialogue: 0,0:25:27.05,0:25:32.13,Default,,0000,0000,0000,,this. But that's not needed; it's an assumption. I'd like to say, back when Dialogue: 0,0:25:32.13,0:25:36.37,Default,,0000,0000,0000,,I was a Ph.D. student, when I was working through this proof, Dialogue: 0,0:25:36.37,0:25:38.22,Default,,0000,0000,0000,,there was sort of a solid week where I Dialogue: 0,0:25:38.22,0:25:42.10,Default,,0000,0000,0000,,would wake up, and go to the office at 9:00 a.m. Then I'd start reading Dialogue: 0,0:25:42.10,0:25:44.84,Default,,0000,0000,0000,,the book that led up to this proof. And then Dialogue: 0,0:25:44.84,0:25:49.33,Default,,0000,0000,0000,,I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left Dialogue: 0,0:25:49.33,0:25:51.61,Default,,0000,0000,0000,,off. And it sort of took me a whole week that way, Dialogue: 0,0:25:51.61,0:25:53.18,Default,,0000,0000,0000,,to understand this proof, so Dialogue: 0,0:25:53.18,0:26:00.18,Default,,0000,0000,0000,,I thought I would inflict that on you. Dialogue: 0,0:26:03.09,0:26:07.47,Default,,0000,0000,0000,,Just to tie a couple of loose ends: Dialogue: 0,0:26:07.47,0:26:13.11,Default,,0000,0000,0000,,what I'm about to do is, I'm about to just mention a few things that will Dialogue: 0,0:26:13.11,0:26:18.38,Default,,0000,0000,0000,,maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends. Dialogue: 0,0:26:18.38,0:26:21.38,Default,,0000,0000,0000,,And so Dialogue: 0,0:26:21.38,0:26:24.15,Default,,0000,0000,0000,,let's see, Dialogue: 0,0:26:24.15,0:26:26.86,Default,,0000,0000,0000,,it turns out that Dialogue: 0,0:26:26.86,0:26:30.17,Default,,0000,0000,0000,,just so it will be more strong with you so this bound was Dialogue: 0,0:26:30.17,0:26:30.67,Default,,0000,0000,0000,,proved Dialogue: 0,0:26:30.67,0:26:34.84,Default,,0000,0000,0000,,for an algorithm that uses empirical risk minimization, for an algorithm that Dialogue: 0,0:26:34.84,0:26:35.92,Default,,0000,0000,0000,,minimizes Dialogue: 0,0:26:35.92,0:26:39.58,Default,,0000,0000,0000,,0-1 training error. So Dialogue: 0,0:26:39.58,0:26:46.33,Default,,0000,0000,0000,,one Dialogue: 0,0:26:46.33,0:26:49.35,Default,,0000,0000,0000,,question that some of Dialogue: 0,0:26:49.35,0:26:51.72,Default,,0000,0000,0000,,you ask is Dialogue: 0,0:26:51.72,0:26:55.46,Default,,0000,0000,0000,,how about support vector machines; right? How come SVM's don't over Dialogue: 0,0:26:55.46,0:26:57.25,Default,,0000,0000,0000,,fit? Dialogue: 0,0:26:57.25,0:27:01.28,Default,,0000,0000,0000,,And in the sequel of remember our discussion on support vector machines said that Dialogue: 0,0:27:01.28,0:27:06.45,Default,,0000,0000,0000,,you use kernels, and map the features in infinite dimensional feature space. Dialogue: 0,0:27:06.45,0:27:10.39,Default,,0000,0000,0000,,And so it seems like the VC dimension should be infinite; n plus Dialogue: 0,0:27:10.39,0:27:12.34,Default,,0000,0000,0000,,one and n is infinite. Dialogue: 0,0:27:12.34,0:27:14.84,Default,,0000,0000,0000,,So it turns out that Dialogue: 0,0:27:14.84,0:27:19.63,Default,,0000,0000,0000,,the class of linear separators with large margin actually has low VC dimension. Dialogue: 0,0:27:19.63,0:27:23.30,Default,,0000,0000,0000,,I wanna say this very quickly, and informally. Dialogue: 0,0:27:23.30,0:27:28.01,Default,,0000,0000,0000,,It's actually, not very important for you to understand the details, but I'm going Dialogue: 0,0:27:28.01,0:27:31.76,Default,,0000,0000,0000,,to say it very informally. It turns out that I will give you a set of points. Dialogue: 0,0:27:31.76,0:27:32.66,Default,,0000,0000,0000,,And Dialogue: 0,0:27:32.66,0:27:37.50,Default,,0000,0000,0000,,if I ask you to consider only the course of lines that separate these points of a Dialogue: 0,0:27:37.50,0:27:38.78,Default,,0000,0000,0000,,large margin [inaudible], Dialogue: 0,0:27:38.78,0:27:40.03,Default,,0000,0000,0000,,so Dialogue: 0,0:27:40.03,0:27:43.84,Default,,0000,0000,0000,,my hypothesis class will comprise only Dialogue: 0,0:27:43.84,0:27:45.95,Default,,0000,0000,0000,,the linear position boundaries Dialogue: 0,0:27:45.95,0:27:49.29,Default,,0000,0000,0000,,that separate the points of a large margin. Say with a Dialogue: 0,0:27:49.29,0:27:50.05,Default,,0000,0000,0000,,margin, Dialogue: 0,0:27:50.05,0:27:52.26,Default,,0000,0000,0000,,at least gamma; okay. Dialogue: 0,0:27:52.26,0:27:53.00,Default,,0000,0000,0000,,And so Dialogue: 0,0:27:53.00,0:27:55.76,Default,,0000,0000,0000,,I won't allow a point that comes closer. Like, Dialogue: 0,0:27:55.76,0:28:00.21,Default,,0000,0000,0000,,I won't allow that line because it comes too close to one of my points. Dialogue: 0,0:28:00.21,0:28:04.33,Default,,0000,0000,0000,,It turns out that Dialogue: 0,0:28:04.33,0:28:10.89,Default,,0000,0000,0000,,if I consider my Dialogue: 0,0:28:10.89,0:28:15.41,Default,,0000,0000,0000,,data points all Dialogue: 0,0:28:15.41,0:28:18.69,Default,,0000,0000,0000,,lie within some sphere of radius r, Dialogue: 0,0:28:18.69,0:28:23.19,Default,,0000,0000,0000,,and if I consider only the course of linear separators is separate to data Dialogue: 0,0:28:23.19,0:28:26.80,Default,,0000,0000,0000,,with a margin of at least gamma, Dialogue: 0,0:28:26.80,0:28:30.42,Default,,0000,0000,0000,,then the VC dimension of this Dialogue: 0,0:28:30.42,0:28:34.26,Default,,0000,0000,0000,,course is less than or equal to r squared over four Dialogue: 0,0:28:34.26,0:28:36.15,Default,,0000,0000,0000,,gamma squared Dialogue: 0,0:28:36.15,0:28:38.17,Default,,0000,0000,0000,,plus one; okay? Dialogue: 0,0:28:38.17,0:28:39.97,Default,,0000,0000,0000,,So Dialogue: 0,0:28:39.97,0:28:43.34,Default,,0000,0000,0000,,this funny symbol here, that just means rounding up. Dialogue: 0,0:28:43.34,0:28:47.19,Default,,0000,0000,0000,,This is a ceiling symbol; it means rounding up x. Dialogue: 0,0:28:47.19,0:28:48.36,Default,,0000,0000,0000,,And it Dialogue: 0,0:28:48.36,0:28:51.28,Default,,0000,0000,0000,,turns out you prove and there are some strange things Dialogue: 0,0:28:51.28,0:28:52.94,Default,,0000,0000,0000,,about this result that I'm deliberately not Dialogue: 0,0:28:52.94,0:28:55.09,Default,,0000,0000,0000,,gonna to talk about but turns they Dialogue: 0,0:28:55.09,0:28:59.10,Default,,0000,0000,0000,,can prove that the VC dimension of the class of linear classifiers with large Dialogue: 0,0:28:59.10,0:29:00.89,Default,,0000,0000,0000,,margins is actually bounded. Dialogue: 0,0:29:00.89,0:29:02.82,Default,,0000,0000,0000,,The surprising thing about this is that Dialogue: 0,0:29:02.82,0:29:05.91,Default,,0000,0000,0000,,this is the bound on VC dimension that has no dependents Dialogue: 0,0:29:05.91,0:29:07.86,Default,,0000,0000,0000,,on the dimension of the points x. So Dialogue: 0,0:29:07.86,0:29:09.24,Default,,0000,0000,0000,,in other words, Dialogue: 0,0:29:09.24,0:29:12.80,Default,,0000,0000,0000,,your data points x combine an infinite dimensional space, Dialogue: 0,0:29:12.80,0:29:15.68,Default,,0000,0000,0000,,but so long as you restrict attention to the class of Dialogue: 0,0:29:15.68,0:29:17.99,Default,,0000,0000,0000,,your separators with large margin, the Dialogue: 0,0:29:17.99,0:29:20.25,Default,,0000,0000,0000,,VC dimension is bounded. Dialogue: 0,0:29:20.25,0:29:21.53,Default,,0000,0000,0000,,And so Dialogue: 0,0:29:21.53,0:29:25.65,Default,,0000,0000,0000,,in trying to find a large margin separator Dialogue: 0,0:29:25.65,0:29:28.99,Default,,0000,0000,0000,,in trying to find the line that separates your positive and Dialogue: 0,0:29:28.99,0:29:29.95,Default,,0000,0000,0000,,your negative examples Dialogue: 0,0:29:29.95,0:29:32.49,Default,,0000,0000,0000,,with large margin, Dialogue: 0,0:29:32.49,0:29:35.10,Default,,0000,0000,0000,,it turns out therefore, that Dialogue: 0,0:29:35.10,0:29:38.06,Default,,0000,0000,0000,,the support vector machine is automatically trying to find Dialogue: 0,0:29:38.06,0:29:39.81,Default,,0000,0000,0000,,a hypothesis class Dialogue: 0,0:29:39.81,0:29:42.45,Default,,0000,0000,0000,,with small VC dimension. And therefore, it does not over fit. Alex? Student:What is Dialogue: 0,0:29:42.45,0:29:49.32,Default,,0000,0000,0000,,the [inaudible]? Instructor (Andrew Dialogue: 0,0:29:49.32,0:29:50.01,Default,,0000,0000,0000,,Ng):It Dialogue: 0,0:29:50.01,0:29:51.88,Default,,0000,0000,0000,,is actually defined the Dialogue: 0,0:29:51.88,0:29:55.64,Default,,0000,0000,0000,,same way as finite dimensional spaces. So you Dialogue: 0,0:29:55.64,0:29:57.21,Default,,0000,0000,0000,,know, suppose you have infinite actually, Dialogue: 0,0:29:57.21,0:30:00.59,Default,,0000,0000,0000,,these are constantly infinite dimensional vectors; not [inaudible] to the infinite Dialogue: 0,0:30:00.59,0:30:03.59,Default,,0000,0000,0000,,dimensional Dialogue: 0,0:30:03.59,0:30:06.08,Default,,0000,0000,0000,,vectors. Normally, the 2 to 1 Dialogue: 0,0:30:06.08,0:30:07.46,Default,,0000,0000,0000,,squared Dialogue: 0,0:30:07.46,0:30:10.01,Default,,0000,0000,0000,,is equal to some [inaudible] equals 110 Dialogue: 0,0:30:10.01,0:30:12.21,Default,,0000,0000,0000,,xi squared, so if x Dialogue: 0,0:30:12.21,0:30:17.47,Default,,0000,0000,0000,,is infinite dimensional, you just appoint it like Dialogue: 0,0:30:17.47,0:30:20.22,Default,,0000,0000,0000,,that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again. Dialogue: 0,0:30:20.22,0:30:22.68,Default,,0000,0000,0000,,[Inaudible]. Dialogue: 0,0:30:22.68,0:30:27.23,Default,,0000,0000,0000,,Yes. Although, I assume that this is bounded by r. Oh. It's Dialogue: 0,0:30:27.23,0:30:30.75,Default,,0000,0000,0000,,a yeah so this insures that conversions. Dialogue: 0,0:30:30.75,0:30:31.67,Default,,0000,0000,0000,, Dialogue: 0,0:30:31.67,0:30:35.04,Default,,0000,0000,0000,,So just Dialogue: 0,0:30:35.04,0:30:37.65,Default,,0000,0000,0000,,something people sometimes wonder about. And Dialogue: 0,0:30:37.65,0:30:39.64,Default,,0000,0000,0000,,last, the actually Dialogue: 0,0:30:39.64,0:30:44.47,Default,,0000,0000,0000,,tie empirical risk minimization back a little more strongly to the source of algorithms Dialogue: 0,0:30:44.47,0:30:46.65,Default,,0000,0000,0000,,we've talked about. Dialogue: 0,0:30:46.65,0:30:48.05,Default,,0000,0000,0000,,It turns out Dialogue: 0,0:30:48.05,0:30:55.05,Default,,0000,0000,0000,,that Dialogue: 0,0:31:10.18,0:31:15.10,Default,,0000,0000,0000,,so the theory was about, and so far, was really for empirical risk minimization. So that view's Dialogue: 0,0:31:15.10,0:31:17.77,Default,,0000,0000,0000,, Dialogue: 0,0:31:17.77,0:31:21.01,Default,,0000,0000,0000,,so we focus on just one training example. Dialogue: 0,0:31:21.01,0:31:22.86,Default,,0000,0000,0000,,Let me draw a Dialogue: 0,0:31:22.86,0:31:24.05,Default,,0000,0000,0000,,function, you know, a Dialogue: 0,0:31:24.05,0:31:25.08,Default,,0000,0000,0000,,zero here jumps Dialogue: 0,0:31:25.08,0:31:26.21,Default,,0000,0000,0000,,to one, Dialogue: 0,0:31:26.21,0:31:27.67,Default,,0000,0000,0000,,and it looks like that. Dialogue: 0,0:31:27.67,0:31:29.57,Default,,0000,0000,0000,,And so this for Dialogue: 0,0:31:29.57,0:31:31.72,Default,,0000,0000,0000,,once, this training example, Dialogue: 0,0:31:31.72,0:31:36.23,Default,,0000,0000,0000,,this may be Dialogue: 0,0:31:36.23,0:31:42.83,Default,,0000,0000,0000,,indicator Dialogue: 0,0:31:42.83,0:31:47.11,Default,,0000,0000,0000,,h where [inaudible] is d Dialogue: 0,0:31:47.11,0:31:49.34,Default,,0000,0000,0000,,equals data transpose x; Dialogue: 0,0:31:49.34,0:31:52.60,Default,,0000,0000,0000,,okay? But one Dialogue: 0,0:31:52.60,0:31:56.25,Default,,0000,0000,0000,,training example your training example will be positive or negative. And Dialogue: 0,0:31:56.25,0:31:59.36,Default,,0000,0000,0000,,depending on what the value of this data transpose x is, you either get it right Dialogue: 0,0:31:59.36,0:32:01.80,Default,,0000,0000,0000,,or wrong. And so Dialogue: 0,0:32:01.80,0:32:05.72,Default,,0000,0000,0000,,you know, I guess if your training example if you have a positive example, Dialogue: 0,0:32:05.72,0:32:08.37,Default,,0000,0000,0000,,then when z is positive, Dialogue: 0,0:32:08.37,0:32:10.38,Default,,0000,0000,0000,,you get it Dialogue: 0,0:32:10.38,0:32:11.24,Default,,0000,0000,0000,,right. Dialogue: 0,0:32:11.24,0:32:18.24,Default,,0000,0000,0000,, Dialogue: 0,0:32:18.61,0:32:22.63,Default,,0000,0000,0000,,Suppose you have a negative example, so y equals 0; right? Dialogue: 0,0:32:22.63,0:32:26.78,Default,,0000,0000,0000,,Then if z, which is data transpose x if this is positive, Dialogue: 0,0:32:26.78,0:32:29.65,Default,,0000,0000,0000,,then you will get this example wrong; Dialogue: 0,0:32:29.65,0:32:32.86,Default,,0000,0000,0000,,whereas, if z is negative then you'd get this example right. Dialogue: 0,0:32:32.86,0:32:36.63,Default,,0000,0000,0000,,And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You Dialogue: 0,0:32:36.63,0:32:42.64,Default,,0000,0000,0000,,know, it's equal Dialogue: 0,0:32:42.64,0:32:49.54,Default,,0000,0000,0000,,to g of data transpose Dialogue: 0,0:32:49.54,0:32:50.73,Default,,0000,0000,0000,, Dialogue: 0,0:32:50.73,0:32:53.94,Default,,0000,0000,0000,,x; okay? And so it turns out that Dialogue: 0,0:32:53.94,0:32:57.26,Default,,0000,0000,0000,,so what you really like to do is choose parameters data so as to Dialogue: 0,0:32:57.26,0:32:59.55,Default,,0000,0000,0000,,minimize this step function; right? Dialogue: 0,0:32:59.55,0:33:02.40,Default,,0000,0000,0000,,You'd like to choose parameters data, so that Dialogue: 0,0:33:02.40,0:33:04.71,Default,,0000,0000,0000,,you end up with a Dialogue: 0,0:33:04.71,0:33:08.50,Default,,0000,0000,0000,,correct classification on setting your training example, and so you'd like Dialogue: 0,0:33:08.50,0:33:09.44,Default,,0000,0000,0000,,indicator h Dialogue: 0,0:33:09.44,0:33:10.89,Default,,0000,0000,0000,,of x not equal y. Dialogue: 0,0:33:10.89,0:33:14.96,Default,,0000,0000,0000,,You'd like this indicator function to be 0. It Dialogue: 0,0:33:14.96,0:33:18.56,Default,,0000,0000,0000,,turns out this step function is clearly a non-convex function. And so Dialogue: 0,0:33:18.56,0:33:21.80,Default,,0000,0000,0000,,it turns out that just the linear classifiers Dialogue: 0,0:33:21.80,0:33:25.13,Default,,0000,0000,0000,,minimizing the training error is an empty heart problem. It Dialogue: 0,0:33:25.13,0:33:29.56,Default,,0000,0000,0000,,turns out that both logistic regression, and support vector machines can be viewed as Dialogue: 0,0:33:29.56,0:33:31.21,Default,,0000,0000,0000,,using a convex Dialogue: 0,0:33:31.21,0:33:32.60,Default,,0000,0000,0000,,approximation for this problem. Dialogue: 0,0:33:32.60,0:33:36.44,Default,,0000,0000,0000,,And in particular Dialogue: 0,0:33:36.44,0:33:40.13,Default,,0000,0000,0000,,and draw a function like that Dialogue: 0,0:33:41.46,0:33:48.46,Default,,0000,0000,0000,,it turns out that Dialogue: 0,0:33:50.81,0:33:54.60,Default,,0000,0000,0000,,logistic regression is trying to maximize likelihood. And so it's tying to minimize Dialogue: 0,0:33:54.60,0:33:56.90,Default,,0000,0000,0000,,the minus of the logged likelihood. Dialogue: 0,0:33:56.90,0:34:00.35,Default,,0000,0000,0000,,And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that Dialogue: 0,0:34:00.35,0:34:03.40,Default,,0000,0000,0000,,looks like this. And Dialogue: 0,0:34:03.40,0:34:07.25,Default,,0000,0000,0000,,this line that I just drew, you can think of it as a rough approximation to this step function; which is Dialogue: 0,0:34:07.25,0:34:10.71,Default,,0000,0000,0000,,maybe what you're really trying to minimize, Dialogue: 0,0:34:10.71,0:34:14.54,Default,,0000,0000,0000,,so you want to minimize training error. So you can actually think of logistic regression as trying to approximate Dialogue: 0,0:34:14.54,0:34:16.07,Default,,0000,0000,0000,,empirical risk minimization. Dialogue: 0,0:34:16.07,0:34:19.86,Default,,0000,0000,0000,,Where instead of using this step function, which is non-convex, and gives you a hard Dialogue: 0,0:34:19.86,0:34:21.06,Default,,0000,0000,0000,,optimization problem, it Dialogue: 0,0:34:21.06,0:34:25.19,Default,,0000,0000,0000,,uses this line above this curve above. So approximate it, Dialogue: 0,0:34:25.19,0:34:28.23,Default,,0000,0000,0000,,so you have a convex optimization problem you can Dialogue: 0,0:34:28.23,0:34:29.93,Default,,0000,0000,0000,,find the Dialogue: 0,0:34:29.93,0:34:34.41,Default,,0000,0000,0000,,maximum likelihood it's in the parameters for logistic regression. Dialogue: 0,0:34:34.41,0:34:38.69,Default,,0000,0000,0000,,And it turns out, support vector machine also can be viewed as approximated dysfunction Dialogue: 0,0:34:38.69,0:34:41.14,Default,,0000,0000,0000,,to only a little bit different let's Dialogue: 0,0:34:41.14,0:34:44.72,Default,,0000,0000,0000,,see, support vector machine turns out, can be viewed as trying to approximate this Dialogue: 0,0:34:44.72,0:34:46.21,Default,,0000,0000,0000,,step function two Dialogue: 0,0:34:46.21,0:34:48.29,Default,,0000,0000,0000,,over different Dialogue: 0,0:34:48.29,0:34:50.93,Default,,0000,0000,0000,,approximation that's linear, and then Dialogue: 0,0:34:50.93,0:34:53.59,Default,,0000,0000,0000,,that sort of [inaudible] linear that Dialogue: 0,0:34:53.59,0:34:55.71,Default,,0000,0000,0000,,our results goes this [inaudible] there, and then it goes up as a Dialogue: 0,0:34:55.71,0:34:58.15,Default,,0000,0000,0000,,linear function there. And that's that Dialogue: 0,0:34:58.15,0:35:00.17,Default,,0000,0000,0000,,is called the hinge class. And so you Dialogue: 0,0:35:00.17,0:35:03.80,Default,,0000,0000,0000,,can think of logistic regression and the support vector machine as Dialogue: 0,0:35:03.80,0:35:07.26,Default,,0000,0000,0000,,different approximations to try to minimize this Dialogue: 0,0:35:07.26,0:35:10.86,Default,,0000,0000,0000,,step function; Dialogue: 0,0:35:10.86,0:35:11.94,Default,,0000,0000,0000,,okay? Dialogue: 0,0:35:11.94,0:35:13.38,Default,,0000,0000,0000,,And that's why I guess, Dialogue: 0,0:35:13.38,0:35:16.11,Default,,0000,0000,0000,,all the theory we developed Dialogue: 0,0:35:16.11,0:35:19.24,Default,,0000,0000,0000,,even though SVM's and logistic regression aren't exactly due to Dialogue: 0,0:35:19.24,0:35:21.74,Default,,0000,0000,0000,,empirical risk minimization, Dialogue: 0,0:35:21.74,0:35:24.79,Default,,0000,0000,0000,,the theory we develop often gives the Dialogue: 0,0:35:24.79,0:35:31.79,Default,,0000,0000,0000,,completely appropriate intuitions for SVM's, and logistic regression; okay. Dialogue: 0,0:35:32.88,0:35:35.78,Default,,0000,0000,0000,,So that was the last of the loose ends. Dialogue: 0,0:35:35.78,0:35:39.32,Default,,0000,0000,0000,,And if you didn't get this, don't worry too much about it. It's a high-level message. It's Dialogue: 0,0:35:39.32,0:35:40.31,Default,,0000,0000,0000,,just that Dialogue: 0,0:35:40.31,0:35:44.48,Default,,0000,0000,0000,,SVM's and logistic regression are reasonable to think of as approximations Dialogue: 0,0:35:44.48,0:35:47.77,Default,,0000,0000,0000,,empirical risk minimization algorithms. Dialogue: 0,0:35:47.77,0:35:51.61,Default,,0000,0000,0000,,What I want to do next is move on to talk about model selection. Before I do that, let me just Dialogue: 0,0:35:51.61,0:35:58.61,Default,,0000,0000,0000,,check for questions about Dialogue: 0,0:36:49.19,0:36:52.87,Default,,0000,0000,0000,,this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that Dialogue: 0,0:36:52.87,0:36:53.33,Default,,0000,0000,0000,,we Dialogue: 0,0:36:53.33,0:36:55.05,Default,,0000,0000,0000,,sort of wrapped up with a Dialogue: 0,0:36:55.05,0:36:56.80,Default,,0000,0000,0000,,discussion on VC dimension, Dialogue: 0,0:36:56.80,0:37:01.02,Default,,0000,0000,0000,,we saw that there's often a trade-off between bias and variance. And in Dialogue: 0,0:37:01.02,0:37:02.49,Default,,0000,0000,0000,,particular, so Dialogue: 0,0:37:02.49,0:37:06.76,Default,,0000,0000,0000,,it is important not to choose a hypothesis that's either too simple or too Dialogue: 0,0:37:06.76,0:37:07.85,Default,,0000,0000,0000,,complex. So Dialogue: 0,0:37:07.85,0:37:11.25,Default,,0000,0000,0000,,if your data has sort of a quadratic structure to it, Dialogue: 0,0:37:11.25,0:37:13.71,Default,,0000,0000,0000,,then if you choose a linear Dialogue: 0,0:37:13.71,0:37:17.89,Default,,0000,0000,0000,,function to try to approximate it, then you would under fit. So you have a Dialogue: 0,0:37:17.89,0:37:19.94,Default,,0000,0000,0000,,hypothesis with high bias. Dialogue: 0,0:37:19.94,0:37:24.18,Default,,0000,0000,0000,,And conversely, we choose a hypothesis that's too complex, and you have high variance. Dialogue: 0,0:37:24.18,0:37:26.13,Default,,0000,0000,0000,,And you'll also Dialogue: 0,0:37:26.13,0:37:31.90,Default,,0000,0000,0000,,fail to fit. Then you would over fit the data, and you'd also fail to generalize well. Dialogue: 0,0:37:31.90,0:37:35.65,Default,,0000,0000,0000,,So model selection algorithms Dialogue: 0,0:37:35.65,0:37:39.63,Default,,0000,0000,0000,,provide a class of methods to automatically trade make these tradeoffs Dialogue: 0,0:37:39.63,0:37:41.62,Default,,0000,0000,0000,,between bias Dialogue: 0,0:37:41.62,0:37:43.07,Default,,0000,0000,0000,,and variance; right? So remember the Dialogue: 0,0:37:43.07,0:37:44.99,Default,,0000,0000,0000,,cartoon I drew last time Dialogue: 0,0:37:44.99,0:37:51.99,Default,,0000,0000,0000,,of generalization error? Dialogue: 0,0:37:54.31,0:37:57.41,Default,,0000,0000,0000,,I drew this last time. Where on the x-axis Dialogue: 0,0:37:57.41,0:38:03.32,Default,,0000,0000,0000,,was model complexity, meaning the number of Dialogue: 0,0:38:03.32,0:38:07.60,Default,,0000,0000,0000,,the degree of the polynomial; the [inaudible] regression function Dialogue: 0,0:38:07.60,0:38:08.72,Default,,0000,0000,0000,,or whatever. Dialogue: 0,0:38:08.72,0:38:12.47,Default,,0000,0000,0000,,And if you have too simple a model, you have high generalization error, those under Dialogue: 0,0:38:12.47,0:38:13.53,Default,,0000,0000,0000,,fitting. Dialogue: 0,0:38:13.53,0:38:15.63,Default,,0000,0000,0000,,And you if have too complex a model, Dialogue: 0,0:38:15.63,0:38:19.73,Default,,0000,0000,0000,,like 15 or 14-degree polynomial to five data points, then you also have high Dialogue: 0,0:38:19.73,0:38:24.07,Default,,0000,0000,0000,,generalization error, and you're over fitting. Dialogue: 0,0:38:24.07,0:38:28.56,Default,,0000,0000,0000,,So what I wanna do now is actually just talk about model selection in the abstract; Dialogue: 0,0:38:28.56,0:38:30.02,Default,,0000,0000,0000,,all right? Dialogue: 0,0:38:30.02,0:38:32.78,Default,,0000,0000,0000,,Some examples of model selection problems will include Dialogue: 0,0:38:32.78,0:38:34.37,Default,,0000,0000,0000,,well, I'll run the example Dialogue: 0,0:38:34.37,0:38:36.14,Default,,0000,0000,0000,,of Dialogue: 0,0:38:36.14,0:38:36.79,Default,,0000,0000,0000,,let's say you're Dialogue: 0,0:38:36.79,0:38:43.79,Default,,0000,0000,0000,,trying to choose the degree of a polynomial; Dialogue: 0,0:38:45.16,0:38:48.91,Default,,0000,0000,0000,,right? What degree polynomial do you want to choose? Dialogue: 0,0:38:48.91,0:38:51.90,Default,,0000,0000,0000,,Another example of a model selection problem would be if you're trying to choose Dialogue: 0,0:38:51.90,0:38:53.87,Default,,0000,0000,0000,,the parameter [inaudible], Dialogue: 0,0:38:53.87,0:38:59.84,Default,,0000,0000,0000,,which was the bandwidth parameter Dialogue: 0,0:38:59.84,0:39:04.14,Default,,0000,0000,0000,,in locally awaited linear regression Dialogue: 0,0:39:04.14,0:39:07.71,Default,,0000,0000,0000,,or in some sort of local way to Dialogue: 0,0:39:07.71,0:39:08.44,Default,,0000,0000,0000,,regression. Dialogue: 0,0:39:08.44,0:39:10.48,Default,,0000,0000,0000,,Yet, another model selection problem Dialogue: 0,0:39:10.48,0:39:13.94,Default,,0000,0000,0000,,is if you're trying to choose the parameter c [inaudible] and as the Dialogue: 0,0:39:13.94,0:39:16.05,Default,,0000,0000,0000,,[inaudible]; Dialogue: 0,0:39:16.05,0:39:21.68,Default,,0000,0000,0000,,right? And so one known soft margin is the Dialogue: 0,0:39:21.68,0:39:28.54,Default,,0000,0000,0000,,we had this optimization objective; right? Dialogue: 0,0:39:28.54,0:39:31.04,Default,,0000,0000,0000,,And the parameter c Dialogue: 0,0:39:31.04,0:39:32.97,Default,,0000,0000,0000,,controls the tradeoff between Dialogue: 0,0:39:32.97,0:39:34.02,Default,,0000,0000,0000,,how much you want Dialogue: 0,0:39:34.02,0:39:38.24,Default,,0000,0000,0000,,to set for your example. So a large margin versus how much you want to penalize Dialogue: 0,0:39:38.24,0:39:40.50,Default,,0000,0000,0000,, Dialogue: 0,0:39:40.50,0:39:44.28,Default,,0000,0000,0000,,in this class [inaudible] example. So these are three specific examples of model selection Dialogue: 0,0:39:44.28,0:39:45.34,Default,,0000,0000,0000,,problems. Dialogue: 0,0:39:45.34,0:39:47.57,Default,,0000,0000,0000,,And Dialogue: 0,0:39:47.57,0:39:54.57,Default,,0000,0000,0000,,let's come up with a method for semantically choosing them. Dialogue: 0,0:40:13.68,0:40:17.85,Default,,0000,0000,0000,,Let's say you have some finite set of models, and let's write these as m1, m2, m3, Dialogue: 0,0:40:17.85,0:40:20.13,Default,,0000,0000,0000,,and Dialogue: 0,0:40:20.13,0:40:25.22,Default,,0000,0000,0000,,so on. For example, this may be the linear Dialogue: 0,0:40:25.22,0:40:29.66,Default,,0000,0000,0000,,classifier or this may be the quadratic classifier, Dialogue: 0,0:40:29.66,0:40:30.41,Default,,0000,0000,0000,,and so on; Dialogue: 0,0:40:30.41,0:40:32.17,Default,,0000,0000,0000,,okay? Or this Dialogue: 0,0:40:32.17,0:40:33.70,Default,,0000,0000,0000,,may also be Dialogue: 0,0:40:33.70,0:40:37.24,Default,,0000,0000,0000,,you may also take the bandwidth parameter [inaudible] Dialogue: 0,0:40:37.24,0:40:41.13,Default,,0000,0000,0000,,and discretize it into a range of values, and you're trying to choose from the most Dialogue: 0,0:40:41.13,0:40:43.47,Default,,0000,0000,0000,,discrete of the values. Dialogue: 0,0:40:43.47,0:40:48.24,Default,,0000,0000,0000,,So let's talk about how you would select an appropriate model; all right? Well, Dialogue: 0,0:40:48.24,0:40:52.03,Default,,0000,0000,0000,,one thing you could do is you can pick all of these models, and train them on Dialogue: 0,0:40:52.03,0:40:53.87,Default,,0000,0000,0000,,you're training set. Dialogue: 0,0:40:53.87,0:40:56.99,Default,,0000,0000,0000,,And then see which model has the lowest training Dialogue: 0,0:40:56.99,0:41:03.99,Default,,0000,0000,0000,,error. So that's a terrible idea, and why's that? Dialogue: 0,0:41:05.45,0:41:11.10,Default,,0000,0000,0000,,Right. Cool. Because of the over fit; right. And those some of you are laughing that Dialogue: 0,0:41:11.10,0:41:13.94,Default,,0000,0000,0000,,I asked that. So that'd be a terrible idea to choose a model Dialogue: 0,0:41:13.94,0:41:17.02,Default,,0000,0000,0000,,by looking at your training set because well, obviously, you end up choosing the most Dialogue: 0,0:41:17.02,0:41:18.84,Default,,0000,0000,0000,,complex model; right? And you Dialogue: 0,0:41:18.84,0:41:25.55,Default,,0000,0000,0000,,choose a 10th degree polynomial because that's what fits the training set. So we Dialogue: 0,0:41:25.55,0:41:28.57,Default,,0000,0000,0000,,come to model selection in a training set Dialogue: 0,0:41:28.57,0:41:31.94,Default,,0000,0000,0000,,several standard procedures to do this. One is hold out cross Dialogue: 0,0:41:31.94,0:41:38.87,Default,,0000,0000,0000,,validation, Dialogue: 0,0:41:38.87,0:41:41.41,Default,,0000,0000,0000,,and in hold out cross validation, Dialogue: 0,0:41:41.41,0:41:45.51,Default,,0000,0000,0000,,we teach a training set. And we randomly split Dialogue: 0,0:41:45.51,0:41:50.53,Default,,0000,0000,0000,,the training set into two subsets. We Dialogue: 0,0:41:50.53,0:41:51.49,Default,,0000,0000,0000,,call it Dialogue: 0,0:41:51.49,0:41:53.59,Default,,0000,0000,0000,,subset take all the data you have Dialogue: 0,0:41:53.59,0:42:00.59,Default,,0000,0000,0000,,and randomly split it into two subsets. And we'll call it the training set, and the Dialogue: 0,0:42:00.80,0:42:05.16,Default,,0000,0000,0000,,hold out cross validation subset. Dialogue: 0,0:42:05.16,0:42:07.12,Default,,0000,0000,0000,,And then, Dialogue: 0,0:42:07.12,0:42:11.99,Default,,0000,0000,0000,,you know, you train each model Dialogue: 0,0:42:11.99,0:42:18.74,Default,,0000,0000,0000,,on just trading subset of it, and test it Dialogue: 0,0:42:18.74,0:42:22.07,Default,,0000,0000,0000,,on your hold out cross validation set. Dialogue: 0,0:42:22.07,0:42:25.29,Default,,0000,0000,0000,,And you pick the model Dialogue: 0,0:42:25.29,0:42:32.29,Default,,0000,0000,0000,,with the lowest error Dialogue: 0,0:42:32.95,0:42:34.89,Default,,0000,0000,0000,,on the hold out cross validation Dialogue: 0,0:42:34.89,0:42:36.96,Default,,0000,0000,0000,,subset; okay? So this is sort of a Dialogue: 0,0:42:36.96,0:42:39.80,Default,,0000,0000,0000,,relatively straightforward procedure, and it's commonly used Dialogue: 0,0:42:39.80,0:42:42.04,Default,,0000,0000,0000,,where you train on 70 percent of the data. Dialogue: 0,0:42:42.04,0:42:44.51,Default,,0000,0000,0000,,Then test all of your models. And 30 percent, you can take Dialogue: 0,0:42:44.51,0:42:45.90,Default,,0000,0000,0000,,whatever has the smallest Dialogue: 0,0:42:45.90,0:42:51.24,Default,,0000,0000,0000,,hold out cross validation error. Dialogue: 0,0:42:51.24,0:42:54.83,Default,,0000,0000,0000,,And after this you actually have a chose. You can actually Dialogue: 0,0:42:54.83,0:42:58.59,Default,,0000,0000,0000,,having taken all of these hypothesis trained on 70 percent of the data, Dialogue: 0,0:42:58.59,0:43:02.65,Default,,0000,0000,0000,,you can actually just output the hypothesis that has the lowest error on your hold out Dialogue: 0,0:43:02.65,0:43:04.47,Default,,0000,0000,0000,,cross validation set. Dialogue: 0,0:43:04.47,0:43:08.62,Default,,0000,0000,0000,,And optionally, you can actually take the model that you selected Dialogue: 0,0:43:08.62,0:43:13.08,Default,,0000,0000,0000,,and go back, and retrain it on all 100 percent of the data; Dialogue: 0,0:43:13.08,0:43:15.53,Default,,0000,0000,0000,,okay? So both versions are actually done and used really often. You can Dialogue: 0,0:43:15.53,0:43:16.41,Default,,0000,0000,0000,,either, Dialogue: 0,0:43:16.41,0:43:20.04,Default,,0000,0000,0000,,you know, just take the best hypothesis that was trained on 70 percent of the data, Dialogue: 0,0:43:20.04,0:43:22.42,Default,,0000,0000,0000,,and just output that as you find the hypothesis Dialogue: 0,0:43:22.42,0:43:23.96,Default,,0000,0000,0000,,or you can use this to Dialogue: 0,0:43:23.96,0:43:27.29,Default,,0000,0000,0000,,say, having chosen the degree of the polynomial you want to fit, you can then go Dialogue: 0,0:43:27.29,0:43:31.33,Default,,0000,0000,0000,,back and retrain the model on the entire 100 percent of your data. Dialogue: 0,0:43:31.33,0:43:38.33,Default,,0000,0000,0000,,And both of these are commonly done. Dialogue: 0,0:43:39.75,0:43:46.75,Default,,0000,0000,0000,, Dialogue: 0,0:43:50.49,0:43:54.46,Default,,0000,0000,0000,,How about a cross validation does sort of work straight? Dialogue: 0,0:43:54.46,0:43:55.76,Default,,0000,0000,0000,,And Dialogue: 0,0:43:55.76,0:43:58.25,Default,,0000,0000,0000,,sometimes we're working Dialogue: 0,0:43:58.25,0:43:58.95,Default,,0000,0000,0000,, Dialogue: 0,0:43:58.95,0:44:03.67,Default,,0000,0000,0000,,with a company or application or something. The Dialogue: 0,0:44:03.67,0:44:05.93,Default,,0000,0000,0000,,many machine-learning applications we have Dialogue: 0,0:44:05.93,0:44:07.58,Default,,0000,0000,0000,,very little data or where, you Dialogue: 0,0:44:07.58,0:44:10.10,Default,,0000,0000,0000,,know, every training example you have was Dialogue: 0,0:44:10.10,0:44:14.94,Default,,0000,0000,0000,,painfully acquired at great cost; right? Sometimes your data is Dialogue: 0,0:44:14.94,0:44:16.04,Default,,0000,0000,0000,,acquired by Dialogue: 0,0:44:16.04,0:44:18.78,Default,,0000,0000,0000,,medical experiments, and each of these each Dialogue: 0,0:44:18.78,0:44:23.83,Default,,0000,0000,0000,,training example represents a sick man in amounts of physical human pain or something. So Dialogue: 0,0:44:23.83,0:44:27.28,Default,,0000,0000,0000,,we talk and say, well, I'm going to hold out 30 percent of your data set, just to select Dialogue: 0,0:44:27.28,0:44:28.11,Default,,0000,0000,0000,,my model. Dialogue: 0,0:44:28.11,0:44:32.81,Default,,0000,0000,0000,,If people were who sometimes that causes unhappiness, and so maybe you wanna Dialogue: 0,0:44:32.81,0:44:37.10,Default,,0000,0000,0000,,use not have to leave out 30 percent of your data just to do model Dialogue: 0,0:44:37.10,0:44:38.72,Default,,0000,0000,0000,,selection. Dialogue: 0,0:44:38.72,0:44:43.40,Default,,0000,0000,0000,,So there are a couple of other variations on hold out cross validation that makes Dialogue: 0,0:44:43.40,0:44:47.81,Default,,0000,0000,0000,,sometimes, slightly more efficient use of the data. Dialogue: 0,0:44:47.81,0:44:51.91,Default,,0000,0000,0000,,And one is called k-fold cross validation. Dialogue: 0,0:44:51.91,0:44:54.19,Default,,0000,0000,0000,,And here's the idea: I'm gonna Dialogue: 0,0:44:54.19,0:44:55.95,Default,,0000,0000,0000,,take all of Dialogue: 0,0:44:55.95,0:44:56.82,Default,,0000,0000,0000,,my data s; Dialogue: 0,0:44:56.82,0:45:00.62,Default,,0000,0000,0000,,so imagine, I'm gonna draw this Dialogue: 0,0:45:00.62,0:45:04.49,Default,,0000,0000,0000,,box s, as to note the entirety of all the data I have. Dialogue: 0,0:45:04.49,0:45:07.49,Default,,0000,0000,0000,,And I'll then divide it Dialogue: 0,0:45:07.49,0:45:12.100,Default,,0000,0000,0000,,into k pieces, and this is five pieces in what I've drawn. Dialogue: 0,0:45:12.100,0:45:16.81,Default,,0000,0000,0000,,Then what'll I'll do is I will repeatedly Dialogue: 0,0:45:16.81,0:45:23.41,Default,,0000,0000,0000,,train on k minus one pieces. Dialogue: 0,0:45:23.41,0:45:27.67,Default,,0000,0000,0000,,Test on the remaining one Dialogue: 0,0:45:27.67,0:45:30.65,Default,,0000,0000,0000,,test on the remaining piece, I guess; Dialogue: 0,0:45:30.65,0:45:35.70,Default,,0000,0000,0000,,right? And then you average Dialogue: 0,0:45:35.70,0:45:41.37,Default,,0000,0000,0000,,over the k result. Dialogue: 0,0:45:41.37,0:45:43.81,Default,,0000,0000,0000,,So another way, we'll just hold out I will Dialogue: 0,0:45:43.81,0:45:48.33,Default,,0000,0000,0000,,hold Dialogue: 0,0:45:48.33,0:45:50.62,Default,,0000,0000,0000,,out say, just 1/5 of my data Dialogue: 0,0:45:50.62,0:45:54.65,Default,,0000,0000,0000,,and I'll train on the remaining 4/5, and I'll test on the first one. Dialogue: 0,0:45:54.65,0:45:58.40,Default,,0000,0000,0000,,And then I'll then go and hold out the second 1/5 from my [inaudible] for the Dialogue: 0,0:45:58.40,0:46:03.10,Default,,0000,0000,0000,,remaining pieces test on this, you remove the third piece, Dialogue: 0,0:46:03.10,0:46:05.91,Default,,0000,0000,0000,,train on the 4/5; I'm gonna do this five times. Dialogue: 0,0:46:05.91,0:46:09.09,Default,,0000,0000,0000,,And then I'll take the five error measures I Dialogue: 0,0:46:09.09,0:46:10.57,Default,,0000,0000,0000,,have and I'll Dialogue: 0,0:46:10.57,0:46:16.37,Default,,0000,0000,0000,,average them. And this then gives me an estimate of the generalization error of my model; okay? Dialogue: 0,0:46:16.37,0:46:18.09,Default,,0000,0000,0000,,And then, again, when you do Dialogue: 0,0:46:18.09,0:46:19.93,Default,,0000,0000,0000,,k-fold cross validation, Dialogue: 0,0:46:19.93,0:46:21.83,Default,,0000,0000,0000,,usually you then go back and Dialogue: 0,0:46:21.83,0:46:24.03,Default,,0000,0000,0000,,retrain the model you selected Dialogue: 0,0:46:24.03,0:46:27.67,Default,,0000,0000,0000,,on the entirety of your training set. Dialogue: 0,0:46:27.67,0:46:33.93,Default,,0000,0000,0000,,So I drew five pieces here because that was easier for me to draw, but k equals 10 is Dialogue: 0,0:46:33.93,0:46:39.34,Default,,0000,0000,0000,,very Dialogue: 0,0:46:39.34,0:46:41.71,Default,,0000,0000,0000,,common; okay? I should say Dialogue: 0,0:46:41.71,0:46:47.51,Default,,0000,0000,0000,,k equals 10 is the fairly common choice to do 10 fold cross validation. Dialogue: 0,0:46:47.51,0:46:52.30,Default,,0000,0000,0000,,And the advantage of the over hold out cross option is that you switch the data into ten pieces. Dialogue: 0,0:46:52.30,0:46:54.40,Default,,0000,0000,0000,,Then each time you're only holding out Dialogue: 0,0:46:54.40,0:46:56.38,Default,,0000,0000,0000,,1/10 of your data, rather than, Dialogue: 0,0:46:56.38,0:46:59.12,Default,,0000,0000,0000,,you know, say, 30 percent of your data. I Dialogue: 0,0:46:59.12,0:47:03.31,Default,,0000,0000,0000,,must say, in standard hold out in simple hold out cross validation, Dialogue: 0,0:47:03.31,0:47:05.62,Default,,0000,0000,0000,,a 30 70 split is fairly common. Dialogue: 0,0:47:05.62,0:47:07.85,Default,,0000,0000,0000,,Sometimes like 2/3 1/3 or Dialogue: 0,0:47:07.85,0:47:09.92,Default,,0000,0000,0000,,a 70 30 split is fairly common. Dialogue: 0,0:47:09.92,0:47:14.30,Default,,0000,0000,0000,,And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most Dialogue: 0,0:47:14.30,0:47:16.40,Default,,0000,0000,0000,,common choice. Dialogue: 0,0:47:16.40,0:47:19.85,Default,,0000,0000,0000,,The disadvantage of k-fold cross validation is that it can be much more Dialogue: 0,0:47:19.85,0:47:21.43,Default,,0000,0000,0000,,computationally expensive. Dialogue: 0,0:47:21.43,0:47:22.100,Default,,0000,0000,0000,,In particular, Dialogue: 0,0:47:22.100,0:47:26.16,Default,,0000,0000,0000,,to validate your model, you now need to train your model ten times, Dialogue: 0,0:47:26.16,0:47:29.09,Default,,0000,0000,0000,,instead of just once. And so you need Dialogue: 0,0:47:29.09,0:47:33.14,Default,,0000,0000,0000,,to: from logistic regression, ten times per model, rather than just once. And so this is Dialogue: 0,0:47:33.14,0:47:34.57,Default,,0000,0000,0000,,computationally more expensive. Dialogue: 0,0:47:34.57,0:47:38.33,Default,,0000,0000,0000,,But k equals ten works great. Dialogue: 0,0:47:38.33,0:47:40.25,Default,,0000,0000,0000,,And then, Dialogue: 0,0:47:40.25,0:47:44.27,Default,,0000,0000,0000,,finally, in Dialogue: 0,0:47:44.27,0:47:47.77,Default,,0000,0000,0000,,there's actually a version of this that you can take even further, Dialogue: 0,0:47:47.77,0:47:50.06,Default,,0000,0000,0000,,which is when your set k Dialogue: 0,0:47:50.06,0:47:51.66,Default,,0000,0000,0000,,equals m. Dialogue: 0,0:47:51.66,0:47:56.69,Default,,0000,0000,0000,,And so that's when you take your training set, and you split it into as many pieces as you have training Dialogue: 0,0:47:56.69,0:47:59.19,Default,,0000,0000,0000,,examples. Dialogue: 0,0:47:59.19,0:48:06.19,Default,,0000,0000,0000,,And this procedure is called leave one out cross validation. Dialogue: 0,0:48:06.43,0:48:07.48,Default,,0000,0000,0000,,And Dialogue: 0,0:48:07.48,0:48:09.07,Default,,0000,0000,0000,,what you do is you then Dialogue: 0,0:48:09.07,0:48:12.48,Default,,0000,0000,0000,,take out the first training example, train on the rest, and test on the first Dialogue: 0,0:48:12.48,0:48:13.36,Default,,0000,0000,0000,,example. Dialogue: 0,0:48:13.36,0:48:16.48,Default,,0000,0000,0000,,Then you take out the second training example, Dialogue: 0,0:48:16.48,0:48:18.56,Default,,0000,0000,0000,,train on the rest, and test on the second example. Then Dialogue: 0,0:48:18.56,0:48:22.18,Default,,0000,0000,0000,,you take out the third example, train on everything, but your third example. Test on Dialogue: 0,0:48:22.18,0:48:24.08,Default,,0000,0000,0000,,the third example, and so on. Dialogue: 0,0:48:24.08,0:48:25.41,Default,,0000,0000,0000,,And so Dialogue: 0,0:48:25.41,0:48:27.90,Default,,0000,0000,0000,,with this many pieces you are now making, Dialogue: 0,0:48:27.90,0:48:31.17,Default,,0000,0000,0000,,maybe even more effective use of your data than k-fold cross Dialogue: 0,0:48:31.17,0:48:32.46,Default,,0000,0000,0000,,validation. But Dialogue: 0,0:48:32.46,0:48:36.27,Default,,0000,0000,0000,,you could leave leave one out cross validation is computationally very expensive Dialogue: 0,0:48:36.27,0:48:37.46,Default,,0000,0000,0000,,because now Dialogue: 0,0:48:37.46,0:48:41.97,Default,,0000,0000,0000,,you need to repeatedly leave one example out, and then run your learning Dialogue: 0,0:48:41.97,0:48:46.46,Default,,0000,0000,0000,,algorithm on m minus one training examples. You need to do this a lot of times, and so Dialogue: 0,0:48:46.46,0:48:49.12,Default,,0000,0000,0000,,this is computationally very expensive. Dialogue: 0,0:48:49.12,0:48:54.83,Default,,0000,0000,0000,,And typically, this is done only when you're extremely data scarce. So if you Dialogue: 0,0:48:54.83,0:48:58.78,Default,,0000,0000,0000,,have a learning problem where you have, say, 15 training examples or something, Dialogue: 0,0:48:58.78,0:49:01.11,Default,,0000,0000,0000,,then if you have very few training examples, leave one Dialogue: 0,0:49:01.11,0:49:03.63,Default,,0000,0000,0000,,out cross validation Dialogue: 0,0:49:03.63,0:49:04.46,Default,,0000,0000,0000,,is Dialogue: 0,0:49:04.46,0:49:11.46,Default,,0000,0000,0000,,maybe preferred. Dialogue: 0,0:49:15.71,0:49:18.37,Default,,0000,0000,0000,,Yeah? Dialogue: 0,0:49:18.37,0:49:22.89,Default,,0000,0000,0000,,Student:You know, that time you proved that the difference between the generalized [inaudible] by Dialogue: 0,0:49:22.89,0:49:26.99,Default,,0000,0000,0000,,number of examples in your training Dialogue: 0,0:49:26.99,0:49:28.09,Default,,0000,0000,0000,,set and VC Dialogue: 0,0:49:28.09,0:49:30.61,Default,,0000,0000,0000,,dimension. So Dialogue: 0,0:49:30.61,0:49:32.68,Default,,0000,0000,0000,,maybe [inaudible] examples into Dialogue: 0,0:49:32.68,0:49:34.60,Default,,0000,0000,0000,,different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I Dialogue: 0,0:49:34.60,0:49:36.83,Default,,0000,0000,0000,,mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but Dialogue: 0,0:49:36.83,0:49:40.10,Default,,0000,0000,0000,,yeah, in practice, I personally tend not to do that. It Dialogue: 0,0:49:40.10,0:49:42.73,Default,,0000,0000,0000,,tends not to be the VC Dialogue: 0,0:49:42.73,0:49:46.53,Default,,0000,0000,0000,,dimension bounds are somewhat loose bounds. And so Dialogue: 0,0:49:46.53,0:49:47.96,Default,,0000,0000,0000,,there are people Dialogue: 0,0:49:47.96,0:49:51.06,Default,,0000,0000,0000,,in structure risk minimization that propose what you do, but I personally tend not to do Dialogue: 0,0:49:51.06,0:49:54.54,Default,,0000,0000,0000,,that, Dialogue: 0,0:50:09.46,0:50:16.46,Default,,0000,0000,0000,,though. Questions for cross validation? Dialogue: 0,0:50:33.52,0:50:35.67,Default,,0000,0000,0000,,Yeah. This Dialogue: 0,0:50:35.67,0:50:38.90,Default,,0000,0000,0000,,is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]? Dialogue: 0,0:50:38.90,0:50:39.94,Default,,0000,0000,0000,,Right. Dialogue: 0,0:50:39.94,0:50:42.24,Default,,0000,0000,0000,,So it seems like when I'd be Dialogue: 0,0:50:42.24,0:50:44.13,Default,,0000,0000,0000,,able to use that Dialogue: 0,0:50:44.13,0:50:46.63,Default,,0000,0000,0000,, Dialogue: 0,0:50:46.63,0:50:47.77,Default,,0000,0000,0000,,instead Dialogue: 0,0:50:47.77,0:50:49.98,Default,,0000,0000,0000,,of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor Dialogue: 0,0:50:49.98,0:50:53.35,Default,,0000,0000,0000,,No okay. So it turns out that when you're proving learning theory bounds, very often Dialogue: 0,0:50:53.35,0:50:58.11,Default,,0000,0000,0000,,the bounds will be extremely loose because you're sort of proving the worse case upper bound that Dialogue: 0,0:50:58.11,0:50:59.16,Default,,0000,0000,0000,,holds true Dialogue: 0,0:50:59.16,0:51:03.30,Default,,0000,0000,0000,,even for very bad what is Dialogue: 0,0:51:03.30,0:51:06.43,Default,,0000,0000,0000,,it Dialogue: 0,0:51:06.43,0:51:10.48,Default,,0000,0000,0000,,so the bounds that I proved just now; right? That holds true for absolutely any Dialogue: 0,0:51:10.48,0:51:12.21,Default,,0000,0000,0000,,probability distribution over training Dialogue: 0,0:51:12.21,0:51:13.48,Default,,0000,0000,0000,,examples; right? Dialogue: 0,0:51:13.48,0:51:17.07,Default,,0000,0000,0000,,So just assume the training examples we've drawn, iid Dialogue: 0,0:51:17.07,0:51:19.03,Default,,0000,0000,0000,,from some distribution script d, Dialogue: 0,0:51:19.03,0:51:23.21,Default,,0000,0000,0000,,and the bounds I proved hold true for absolutely any probability Dialogue: 0,0:51:23.21,0:51:27.13,Default,,0000,0000,0000,,distribution over script Dialogue: 0,0:51:27.13,0:51:28.94,Default,,0000,0000,0000,,d. And chances are Dialogue: 0,0:51:28.94,0:51:33.05,Default,,0000,0000,0000,,whatever real life distribution you get over, you know, houses and their Dialogue: 0,0:51:33.05,0:51:36.14,Default,,0000,0000,0000,,prices or whatever, is probably not as bad as Dialogue: 0,0:51:36.14,0:51:39.12,Default,,0000,0000,0000,,the very worse one you Dialogue: 0,0:51:39.12,0:51:41.25,Default,,0000,0000,0000,,could've gotten; okay? Dialogue: 0,0:51:41.25,0:51:42.19,Default,,0000,0000,0000,,And so it turns out that if you Dialogue: 0,0:51:42.19,0:51:46.30,Default,,0000,0000,0000,,actually plug in the constants of learning theory bounds, you often get Dialogue: 0,0:51:46.30,0:51:50.84,Default,,0000,0000,0000,,extremely large numbers. Take logistic regression logistic Dialogue: 0,0:51:50.84,0:51:53.23,Default,,0000,0000,0000,,regression you have ten parameters Dialogue: 0,0:51:53.23,0:51:56.97,Default,,0000,0000,0000,,and 0.01 error, and with 95 percent probability. How many training Dialogue: 0,0:51:56.97,0:51:58.24,Default,,0000,0000,0000,,examples do I Dialogue: 0,0:51:58.24,0:52:02.11,Default,,0000,0000,0000,,need? If you actually plug in actual constants into the text for learning theory bounds, Dialogue: 0,0:52:02.11,0:52:06.33,Default,,0000,0000,0000,,you often get extremely pessimistic estimates with the number of examples you need. You end Dialogue: 0,0:52:06.33,0:52:07.53,Default,,0000,0000,0000,,up with Dialogue: 0,0:52:07.53,0:52:11.75,Default,,0000,0000,0000,,some ridiculously large numbers. You would need 10,000 training examples to fit Dialogue: 0,0:52:11.75,0:52:14.35,Default,,0000,0000,0000,,ten parameters. Dialogue: 0,0:52:14.35,0:52:17.49,Default,,0000,0000,0000,,So Dialogue: 0,0:52:17.49,0:52:20.32,Default,,0000,0000,0000,,a good way to think of these learning theory bounds is Dialogue: 0,0:52:20.32,0:52:24.60,Default,,0000,0000,0000,,and this is why, also, when I write papers on learning theory bounds, I Dialogue: 0,0:52:24.60,0:52:26.39,Default,,0000,0000,0000,,quite often Dialogue: 0,0:52:26.39,0:52:30.16,Default,,0000,0000,0000,,use big-O notation to just absolutely just ignore the constant factors because Dialogue: 0,0:52:30.16,0:52:33.34,Default,,0000,0000,0000,,the bounds seem to be very loose. Dialogue: 0,0:52:33.34,0:52:38.67,Default,,0000,0000,0000,,There are some attempts to use these bounds to give guidelines as to Dialogue: 0,0:52:38.67,0:52:40.02,Default,,0000,0000,0000,,what model Dialogue: 0,0:52:40.02,0:52:41.56,Default,,0000,0000,0000,,to choose, and so on. Dialogue: 0,0:52:41.56,0:52:46.59,Default,,0000,0000,0000,,But I personally tend to use the bounds again, Dialogue: 0,0:52:46.59,0:52:47.49,Default,,0000,0000,0000,,intuition Dialogue: 0,0:52:47.49,0:52:49.24,Default,,0000,0000,0000,,about Dialogue: 0,0:52:49.24,0:52:53.09,Default,,0000,0000,0000,,for example, what are the number of training examples you need Dialogue: 0,0:52:53.09,0:52:56.09,Default,,0000,0000,0000,,gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes Dialogue: 0,0:52:56.09,0:52:59.67,Default,,0000,0000,0000,,quadratic parameters? Dialogue: 0,0:52:59.67,0:53:02.40,Default,,0000,0000,0000,,So it's quite often the shape of the bounds. The fact that Dialogue: 0,0:53:02.40,0:53:06.75,Default,,0000,0000,0000,,the number of training examples the fact that some complexity is linear in the VC dimension, Dialogue: 0,0:53:06.75,0:53:09.45,Default,,0000,0000,0000,,that's sort of a useful intuition you can get from Dialogue: 0,0:53:09.45,0:53:11.33,Default,,0000,0000,0000,,these theories. But the Dialogue: 0,0:53:11.33,0:53:14.44,Default,,0000,0000,0000,,actual magnitude of the bound will tend to be much looser than Dialogue: 0,0:53:14.44,0:53:18.01,Default,,0000,0000,0000,,will hold true for a particular problem you are working Dialogue: 0,0:53:18.01,0:53:25.01,Default,,0000,0000,0000,,on. So did that Dialogue: 0,0:53:25.84,0:53:28.41,Default,,0000,0000,0000,,answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the Dialogue: 0,0:53:28.41,0:53:31.51,Default,,0000,0000,0000,,way, for myself, a rule of thumb that I often use is if Dialogue: 0,0:53:31.51,0:53:34.97,Default,,0000,0000,0000,,you're trying to fit a logistic regression model, Dialogue: 0,0:53:34.97,0:53:36.18,Default,,0000,0000,0000,,if you have Dialogue: 0,0:53:36.18,0:53:38.75,Default,,0000,0000,0000,,n parameters or n plus one parameters; Dialogue: 0,0:53:38.75,0:53:42.43,Default,,0000,0000,0000,,if the number of training examples is ten times your number of Dialogue: 0,0:53:42.43,0:53:44.71,Default,,0000,0000,0000,,parameters, then you're probably in good shape. Dialogue: 0,0:53:44.71,0:53:49.29,Default,,0000,0000,0000,,And if your number of training examples is like tiny times the number of parameters, then Dialogue: 0,0:53:49.29,0:53:52.30,Default,,0000,0000,0000,,you're probably perfectly fine fitting that model. Dialogue: 0,0:53:52.30,0:53:54.88,Default,,0000,0000,0000,,So those are the sorts of intuitions Dialogue: 0,0:53:54.88,0:54:01.88,Default,,0000,0000,0000,,that you can get from these bounds. Student:In cross validation do Dialogue: 0,0:54:02.94,0:54:06.74,Default,,0000,0000,0000,,we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train Dialogue: 0,0:54:06.74,0:54:13.74,Default,,0000,0000,0000,,testers randomly. Dialogue: 0,0:54:22.99,0:54:26.84,Default,,0000,0000,0000,,One more thing I want to talk about for model selection is there's actually a special case of model selections, Dialogue: 0,0:54:26.84,0:54:33.84,Default,,0000,0000,0000,,called the feature selection problem. Dialogue: 0,0:54:37.81,0:54:41.34,Default,,0000,0000,0000,,And so here's the intuition: Dialogue: 0,0:54:41.34,0:54:45.48,Default,,0000,0000,0000,,for many machine-learning problems you may have a very high dimensional Dialogue: 0,0:54:45.48,0:54:48.12,Default,,0000,0000,0000,,feature space with very high dimensional Dialogue: 0,0:54:48.12,0:54:51.64,Default,,0000,0000,0000,,you have x's [inaudible] feature x's. Dialogue: 0,0:54:51.64,0:54:56.29,Default,,0000,0000,0000,,So for example, for text classification and I wanna talk about this text classification example that Dialogue: 0,0:54:56.29,0:54:58.30,Default,,0000,0000,0000,,spam versus Dialogue: 0,0:54:58.30,0:55:00.43,Default,,0000,0000,0000,,non-spam. You may easily have on Dialogue: 0,0:55:00.43,0:55:04.13,Default,,0000,0000,0000,,the order of 30,000 or 50,000 features. I think I used 50,000 in Dialogue: 0,0:55:04.13,0:55:07.15,Default,,0000,0000,0000,,my early examples. So if you have Dialogue: 0,0:55:07.15,0:55:10.66,Default,,0000,0000,0000,,so many features you have 50,000 features, Dialogue: 0,0:55:10.66,0:55:13.76,Default,,0000,0000,0000,,depending on what learning algorithm you use, there may be a real Dialogue: 0,0:55:13.76,0:55:15.53,Default,,0000,0000,0000,,risk of over fitting. Dialogue: 0,0:55:15.53,0:55:17.22,Default,,0000,0000,0000,,And so Dialogue: 0,0:55:17.22,0:55:20.08,Default,,0000,0000,0000,,if you can reduce the number of features, maybe Dialogue: 0,0:55:20.08,0:55:23.18,Default,,0000,0000,0000,,you can reduce the variance of your learning algorithm, and reduce the risk of Dialogue: 0,0:55:23.18,0:55:25.31,Default,,0000,0000,0000,,over fitting. And for Dialogue: 0,0:55:25.31,0:55:28.26,Default,,0000,0000,0000,,the specific case of text classification, if Dialogue: 0,0:55:28.26,0:55:31.92,Default,,0000,0000,0000,,you imagine that maybe there's a small number of relevant features, so Dialogue: 0,0:55:31.92,0:55:33.68,Default,,0000,0000,0000,,there are all these English words. Dialogue: 0,0:55:33.68,0:55:36.97,Default,,0000,0000,0000,,And many of these English words probably don't tell you anything at all about Dialogue: 0,0:55:36.97,0:55:39.88,Default,,0000,0000,0000,,whether the email is spam or non-spam. If it were, you Dialogue: 0,0:55:39.88,0:55:43.99,Default,,0000,0000,0000,,know, English function words like, the, of, a, and; Dialogue: 0,0:55:43.99,0:55:48.05,Default,,0000,0000,0000,,these are probably words that don't tell you anything about whether the email is spam or non-spam. So Dialogue: 0,0:55:48.05,0:55:50.57,Default,,0000,0000,0000,,words in contrast will be a much smaller number of Dialogue: 0,0:55:50.57,0:55:54.32,Default,,0000,0000,0000,,features that are truly relevant to the learning problem. Dialogue: 0,0:55:54.32,0:55:57.37,Default,,0000,0000,0000,,So for example, you see the word buy or Viagra, those are words that Dialogue: 0,0:55:57.37,0:56:01.67,Default,,0000,0000,0000,,are very useful. So you words, some you spam and non-spam. You see the word Dialogue: 0,0:56:01.67,0:56:02.83,Default,,0000,0000,0000,,Stanford or Dialogue: 0,0:56:02.83,0:56:06.27,Default,,0000,0000,0000,,machinelearning or your own personal name. These are other words that are useful for Dialogue: 0,0:56:06.27,0:56:10.53,Default,,0000,0000,0000,,telling you whether something is spam or non-spam. So Dialogue: 0,0:56:10.53,0:56:15.66,Default,,0000,0000,0000,,in feature selection, we would like to select a subset of the features Dialogue: 0,0:56:15.66,0:56:19.52,Default,,0000,0000,0000,,that may be or hopefully the most relevant ones for a specific learning Dialogue: 0,0:56:19.52,0:56:20.37,Default,,0000,0000,0000,,problem, so Dialogue: 0,0:56:20.37,0:56:25.01,Default,,0000,0000,0000,,as to give ourselves a simpler learning a simpler hypothesis class to choose Dialogue: 0,0:56:25.01,0:56:28.50,Default,,0000,0000,0000,,from. And then therefore, reduce the risk of over fitting. Dialogue: 0,0:56:28.50,0:56:29.70,Default,,0000,0000,0000,,Even when we Dialogue: 0,0:56:29.70,0:56:36.62,Default,,0000,0000,0000,,may have had 50,000 features originally. So Dialogue: 0,0:56:36.62,0:56:41.21,Default,,0000,0000,0000,,how do you do this? Well, if you Dialogue: 0,0:56:41.21,0:56:44.01,Default,,0000,0000,0000,,have n Dialogue: 0,0:56:44.01,0:56:47.33,Default,,0000,0000,0000,,features, then there are Dialogue: 0,0:56:47.33,0:56:49.53,Default,,0000,0000,0000,,two to the n possible subsets; Dialogue: 0,0:56:49.53,0:56:54.43,Default,,0000,0000,0000,, Dialogue: 0,0:56:54.43,0:56:55.90,Default,,0000,0000,0000,,right? Because, you know, Dialogue: 0,0:56:55.90,0:56:59.88,Default,,0000,0000,0000,,each of your n features can either be included or excluded. So there are two to the n Dialogue: 0,0:56:59.88,0:57:00.91,Default,,0000,0000,0000,,possibilities. Dialogue: 0,0:57:00.91,0:57:04.20,Default,,0000,0000,0000,,And this is a huge space. Dialogue: 0,0:57:04.20,0:57:08.86,Default,,0000,0000,0000,,So in feature selection, what we most commonly do is use various searcheristics Dialogue: 0,0:57:08.86,0:57:09.51,Default,,0000,0000,0000,,sort of Dialogue: 0,0:57:09.51,0:57:10.80,Default,,0000,0000,0000,,simple search algorithms Dialogue: 0,0:57:10.80,0:57:15.33,Default,,0000,0000,0000,,to try to search through this space of two to the n possible subsets of features; Dialogue: 0,0:57:15.33,0:57:16.71,Default,,0000,0000,0000,,to try to find a good Dialogue: 0,0:57:16.71,0:57:18.29,Default,,0000,0000,0000,,subset of features. This is Dialogue: 0,0:57:18.29,0:57:22.57,Default,,0000,0000,0000,,too large a number to enumerate all possible feature subsets. Dialogue: 0,0:57:22.57,0:57:25.52,Default,,0000,0000,0000,,And as a complete example, Dialogue: 0,0:57:25.52,0:57:26.76,Default,,0000,0000,0000,, Dialogue: 0,0:57:26.76,0:57:31.16,Default,,0000,0000,0000,,this is the forward search algorithm; it's also called the forward selection algorithm. Dialogue: 0,0:57:31.16,0:57:32.69,Default,,0000,0000,0000,, Dialogue: 0,0:57:32.69,0:57:34.85,Default,,0000,0000,0000,,It's actually pretty simple, but I'll just Dialogue: 0,0:57:34.85,0:57:36.08,Default,,0000,0000,0000,,write it out. Dialogue: 0,0:57:36.08,0:57:42.06,Default,,0000,0000,0000,,My writing it out will make it look more complicated than it really Dialogue: 0,0:57:42.06,0:57:44.65,Default,,0000,0000,0000,,is, but Dialogue: 0,0:57:44.65,0:57:46.04,Default,,0000,0000,0000,,it starts with Dialogue: 0,0:57:46.04,0:57:48.15,Default,,0000,0000,0000,,initialize the sets script f Dialogue: 0,0:57:48.15,0:57:51.65,Default,,0000,0000,0000,,to be the empty set, Dialogue: 0,0:57:51.65,0:57:53.24,Default,,0000,0000,0000,,and then Dialogue: 0,0:57:53.24,0:57:55.02,Default,,0000,0000,0000,,repeat Dialogue: 0,0:57:55.02,0:57:56.62,Default,,0000,0000,0000,,for Dialogue: 0,0:57:56.62,0:57:58.86,Default,,0000,0000,0000,,i equals one Dialogue: 0,0:57:58.86,0:58:00.44,Default,,0000,0000,0000,,to n; Dialogue: 0,0:58:04.85,0:58:08.24,Default,,0000,0000,0000,,try adding Dialogue: 0,0:58:08.24,0:58:11.46,Default,,0000,0000,0000,,feature i Dialogue: 0,0:58:11.46,0:58:13.85,Default,,0000,0000,0000,,to the set scripts Dialogue: 0,0:58:13.85,0:58:16.04,Default,,0000,0000,0000,,f, and evaluate the Dialogue: 0,0:58:16.04,0:58:19.74,Default,,0000,0000,0000,,model Dialogue: 0,0:58:19.74,0:58:25.72,Default,,0000,0000,0000,,using cross validation. Dialogue: 0,0:58:25.72,0:58:30.40,Default,,0000,0000,0000,,And by cross validation, I mean any of the three flavors, be it simple hold out cross Dialogue: 0,0:58:30.40,0:58:34.63,Default,,0000,0000,0000,,validation or k-fold cross validation or leave one out cross Dialogue: 0,0:58:34.63,0:58:35.74,Default,,0000,0000,0000,,validation. Dialogue: 0,0:58:35.74,0:58:39.95,Default,,0000,0000,0000,,And then, you know, set f to be Dialogue: 0,0:58:39.95,0:58:41.24,Default,,0000,0000,0000,,equal to Dialogue: 0,0:58:41.24,0:58:43.25,Default,,0000,0000,0000,,f union, I guess. Dialogue: 0,0:58:43.25,0:58:48.58,Default,,0000,0000,0000,,And then the best feature Dialogue: 0,0:58:48.58,0:58:55.58,Default,,0000,0000,0000,,found is f 1, I guess; okay? Dialogue: 0,0:58:57.62,0:59:00.06,Default,,0000,0000,0000,,And finally, you would Dialogue: 0,0:59:00.06,0:59:06.92,Default,,0000,0000,0000,,okay? So Dialogue: 0,0:59:06.92,0:59:13.92,Default,,0000,0000,0000,, Dialogue: 0,0:59:16.70,0:59:20.32,Default,,0000,0000,0000,,forward selections, procedure is: follow through the empty set of features. Dialogue: 0,0:59:20.32,0:59:22.32,Default,,0000,0000,0000,,And then on each Dialogue: 0,0:59:22.32,0:59:25.69,Default,,0000,0000,0000,,generation, take each of Dialogue: 0,0:59:25.69,0:59:28.83,Default,,0000,0000,0000,,your features that isn't already in your set script f and you try adding that Dialogue: 0,0:59:28.83,0:59:30.45,Default,,0000,0000,0000,,feature to your set. Then Dialogue: 0,0:59:30.45,0:59:33.87,Default,,0000,0000,0000,,you train them all, though, and evaluate them all, though, using Dialogue: 0,0:59:33.87,0:59:35.04,Default,,0000,0000,0000,,cross validation. Dialogue: 0,0:59:35.04,0:59:38.72,Default,,0000,0000,0000,,And basically, figure out what is the best single feature to add to your Dialogue: 0,0:59:38.72,0:59:40.72,Default,,0000,0000,0000,,set Dialogue: 0,0:59:40.72,0:59:43.71,Default,,0000,0000,0000,,script f. In step two here, you go ahead and add Dialogue: 0,0:59:43.71,0:59:46.51,Default,,0000,0000,0000,,that feature to your set script f, and you get it right. Dialogue: 0,0:59:46.51,0:59:50.56,Default,,0000,0000,0000,,And when I say best feature or best model here by best, I really mean the Dialogue: 0,0:59:50.56,0:59:53.67,Default,,0000,0000,0000,,best model according to hold out cross validation. Dialogue: 0,0:59:53.67,0:59:56.05,Default,,0000,0000,0000,,By best, I really mean Dialogue: 0,0:59:56.05,1:00:00.64,Default,,0000,0000,0000,,the single feature addition that results in the lowest hold out Dialogue: 0,1:00:00.64,1:00:01.51,Default,,0000,0000,0000,,cross validation error or the Dialogue: 0,1:00:01.51,1:00:03.21,Default,,0000,0000,0000,,lowest cross validation error. Dialogue: 0,1:00:03.21,1:00:04.50,Default,,0000,0000,0000,,So you do this Dialogue: 0,1:00:04.50,1:00:07.96,Default,,0000,0000,0000,,adding one feature at a time. Dialogue: 0,1:00:07.96,1:00:12.02,Default,,0000,0000,0000,,When you terminate this a little bit, as if you've Dialogue: 0,1:00:12.02,1:00:15.08,Default,,0000,0000,0000,,added all the features to f, so f is now Dialogue: 0,1:00:15.08,1:00:17.92,Default,,0000,0000,0000,,the entire set of features; you can terminate this. Dialogue: 0,1:00:17.92,1:00:19.31,Default,,0000,0000,0000,,Or if Dialogue: 0,1:00:19.31,1:00:22.04,Default,,0000,0000,0000,,by some rule of thumb, you know that you Dialogue: 0,1:00:22.04,1:00:23.77,Default,,0000,0000,0000,,probably don't ever want more than Dialogue: 0,1:00:23.77,1:00:26.73,Default,,0000,0000,0000,,k features, you can also terminate Dialogue: 0,1:00:26.73,1:00:30.24,Default,,0000,0000,0000,,this if f is already exceeded some threshold number of features. So maybe Dialogue: 0,1:00:30.24,1:00:30.98,Default,,0000,0000,0000,,if Dialogue: 0,1:00:30.98,1:00:34.14,Default,,0000,0000,0000,,you have 100 training examples, and you're fitting logistic regression, Dialogue: 0,1:00:34.14,1:00:39.36,Default,,0000,0000,0000,,you probably know you won't want more than 100 features. And so Dialogue: 0,1:00:39.36,1:00:41.23,Default,,0000,0000,0000,,you stop after you have Dialogue: 0,1:00:41.23,1:00:45.71,Default,,0000,0000,0000,,100 features added to set f; okay? Dialogue: 0,1:00:45.71,1:00:50.03,Default,,0000,0000,0000,,And then finally, having done this, output of best hypothesis found; again, by Dialogue: 0,1:00:50.03,1:00:51.67,Default,,0000,0000,0000,,best, I mean, Dialogue: 0,1:00:51.67,1:00:55.46,Default,,0000,0000,0000,,when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of Dialogue: 0,1:00:55.46,1:00:58.42,Default,,0000,0000,0000,,hypothesis, and testing them using cross validation. Dialogue: 0,1:00:58.42,1:01:01.66,Default,,0000,0000,0000,,So when I say output best hypothesis found, I mean Dialogue: 0,1:01:01.66,1:01:04.65,Default,,0000,0000,0000,,of all of the hypothesis you've seen during this entire procedure, Dialogue: 0,1:01:04.65,1:01:07.15,Default,,0000,0000,0000,,pick the one with the lowest Dialogue: 0,1:01:07.15,1:01:10.31,Default,,0000,0000,0000,,cross validation error that you saw; okay? Dialogue: 0,1:01:10.31,1:01:17.31,Default,,0000,0000,0000,,So that's forward selection. So Dialogue: 0,1:01:35.24,1:01:39.60,Default,,0000,0000,0000,,let's see, just to give this a name, this is an incidence of what's called Dialogue: 0,1:01:39.60,1:01:43.19,Default,,0000,0000,0000,, Dialogue: 0,1:01:43.19,1:01:46.44,Default,,0000,0000,0000,,wrapper feature Dialogue: 0,1:01:46.44,1:01:47.88,Default,,0000,0000,0000,,selection. Dialogue: 0,1:01:47.88,1:01:50.58,Default,,0000,0000,0000,,And the term wrapper comes Dialogue: 0,1:01:50.58,1:01:52.05,Default,,0000,0000,0000,,from the fact that Dialogue: 0,1:01:52.05,1:01:55.49,Default,,0000,0000,0000,,this feature selection algorithm that I just described is a forward selection or forward Dialogue: 0,1:01:55.49,1:01:56.82,Default,,0000,0000,0000,,search. Dialogue: 0,1:01:56.82,1:01:58.68,Default,,0000,0000,0000,,It's a piece of software that Dialogue: 0,1:01:58.68,1:02:02.15,Default,,0000,0000,0000,,you write that wraps around your learning algorithm. In the sense that Dialogue: 0,1:02:02.15,1:02:04.25,Default,,0000,0000,0000,,to perform forward selection, Dialogue: 0,1:02:04.25,1:02:07.58,Default,,0000,0000,0000,,you need to repeatedly make cause to your learning algorithm Dialogue: 0,1:02:07.58,1:02:08.83,Default,,0000,0000,0000,,to train Dialogue: 0,1:02:08.83,1:02:11.79,Default,,0000,0000,0000,,your model, Dialogue: 0,1:02:11.79,1:02:15.29,Default,,0000,0000,0000,,using different subsets of features; okay? So this is called wrapper model Dialogue: 0,1:02:15.29,1:02:17.32,Default,,0000,0000,0000,,feature selection. Dialogue: 0,1:02:17.32,1:02:20.51,Default,,0000,0000,0000,,And it tends to be somewhat computationally expensive because as you're Dialogue: 0,1:02:20.51,1:02:22.53,Default,,0000,0000,0000,,performing the search process, Dialogue: 0,1:02:22.53,1:02:26.03,Default,,0000,0000,0000,,you're repeatedly training your learning algorithm over and over and over on all of Dialogue: 0,1:02:26.03,1:02:30.33,Default,,0000,0000,0000,,these different subsets of features. Let's Dialogue: 0,1:02:30.33,1:02:37.33,Default,,0000,0000,0000,,just mention also, there is a variation of this called backward search or Dialogue: 0,1:02:38.90,1:02:40.83,Default,,0000,0000,0000,,backward selection, Dialogue: 0,1:02:40.83,1:02:43.71,Default,,0000,0000,0000,,which is where you start with Dialogue: 0,1:02:43.71,1:02:46.99,Default,,0000,0000,0000,,f equals Dialogue: 0,1:02:46.99,1:02:53.99,Default,,0000,0000,0000,,the entire set of features, and you delete features one at a time; Dialogue: 0,1:02:59.47,1:03:01.37,Default,,0000,0000,0000,,okay? Dialogue: 0,1:03:01.37,1:03:03.98,Default,,0000,0000,0000,, Dialogue: 0,1:03:03.98,1:03:06.76,Default,,0000,0000,0000,,So that's backward search or backward selection. Dialogue: 0,1:03:06.76,1:03:08.08,Default,,0000,0000,0000,,And Dialogue: 0,1:03:08.08,1:03:13.37,Default,,0000,0000,0000,,this is another feature selection algorithm that you might use. Part of Dialogue: 0,1:03:13.37,1:03:15.40,Default,,0000,0000,0000,,whether this makes sense is Dialogue: 0,1:03:15.40,1:03:17.41,Default,,0000,0000,0000,,really there will be problems where Dialogue: 0,1:03:17.41,1:03:21.32,Default,,0000,0000,0000,,it really doesn't even make sense to initialize f to be the set of all features. Dialogue: 0,1:03:21.32,1:03:23.86,Default,,0000,0000,0000,,So if you have 100 training examples Dialogue: 0,1:03:23.86,1:03:25.57,Default,,0000,0000,0000,,and 10,000 features, Dialogue: 0,1:03:25.57,1:03:28.83,Default,,0000,0000,0000,,which may well happen Dialogue: 0,1:03:28.83,1:03:30.76,Default,,0000,0000,0000,,100 emails and 10,000 training Dialogue: 0,1:03:30.76,1:03:35.29,Default,,0000,0000,0000,,10,000 features in email, then 100 training examples then Dialogue: 0,1:03:35.29,1:03:38.28,Default,,0000,0000,0000,,depending on the learning algorithm you're using, it may or may not make sense to Dialogue: 0,1:03:38.28,1:03:38.99,Default,,0000,0000,0000,,initialize Dialogue: 0,1:03:38.99,1:03:42.79,Default,,0000,0000,0000,,the set f to be all features, and train them all by using all features. And if it Dialogue: 0,1:03:42.79,1:03:46.13,Default,,0000,0000,0000,,doesn't make sense, then you can train them all by using all features; then Dialogue: 0,1:03:46.13,1:03:51.53,Default,,0000,0000,0000,,forward selection would be more common. Dialogue: 0,1:03:51.53,1:03:53.61,Default,,0000,0000,0000,,So Dialogue: 0,1:03:53.61,1:03:56.96,Default,,0000,0000,0000,,let's see. Wrapper model feature selection algorithms tend to work Dialogue: 0,1:03:56.96,1:03:58.46,Default,,0000,0000,0000,,well. Dialogue: 0,1:03:58.46,1:04:02.18,Default,,0000,0000,0000,,And in particular, they actually often work better than a different class of algorithms I'm gonna talk Dialogue: 0,1:04:02.18,1:04:05.79,Default,,0000,0000,0000,,about now. But their main disadvantage is that they're computationally very Dialogue: 0,1:04:05.79,1:04:08.71,Default,,0000,0000,0000,,expensive. Dialogue: 0,1:04:08.71,1:04:15.71,Default,,0000,0000,0000,,Do you have any questions about this Dialogue: 0,1:04:16.29,1:04:19.11,Default,,0000,0000,0000,,before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes, Dialogue: 0,1:04:19.11,1:04:23.46,Default,,0000,0000,0000,,you're actually right. So forward search and backward search, both of these are searcheristics, Dialogue: 0,1:04:23.46,1:04:24.36,Default,,0000,0000,0000,,and Dialogue: 0,1:04:24.36,1:04:28.09,Default,,0000,0000,0000,,you cannot but for either of these you cannot guarantee they'll find the best Dialogue: 0,1:04:28.09,1:04:29.12,Default,,0000,0000,0000,,subset of features. It Dialogue: 0,1:04:29.12,1:04:30.73,Default,,0000,0000,0000,,actually turns out that Dialogue: 0,1:04:30.73,1:04:34.100,Default,,0000,0000,0000,,under many formulizations of the feature selection problems it actually turns out to be an empty Dialogue: 0,1:04:34.100,1:04:40.03,Default,,0000,0000,0000,,heart problem, to find the best subset of features. Dialogue: 0,1:04:40.03,1:04:43.100,Default,,0000,0000,0000,,But in practice, forward selection backward selection work fine, Dialogue: 0,1:04:43.100,1:04:47.17,Default,,0000,0000,0000,,and you can also envision other search algorithms where you sort of Dialogue: 0,1:04:47.17,1:04:50.49,Default,,0000,0000,0000,,have other methods to search through the space up to the end possible feature Dialogue: 0,1:04:50.49,1:04:57.49,Default,,0000,0000,0000,,subsets. So Dialogue: 0,1:05:01.08,1:05:03.70,Default,,0000,0000,0000,,let's see. Dialogue: 0,1:05:03.70,1:05:06.34,Default,,0000,0000,0000,,Wrapper feature selection Dialogue: 0,1:05:06.34,1:05:09.83,Default,,0000,0000,0000,,tends to work well when you can afford to do it Dialogue: 0,1:05:09.83,1:05:11.54,Default,,0000,0000,0000,,computationally. Dialogue: 0,1:05:11.54,1:05:13.80,Default,,0000,0000,0000,,But for problems Dialogue: 0,1:05:13.80,1:05:18.28,Default,,0000,0000,0000,,such as text classification it turns out for text classification specifically Dialogue: 0,1:05:18.28,1:05:21.95,Default,,0000,0000,0000,,because you have so many features, and easily have 50,000 features. Dialogue: 0,1:05:21.95,1:05:25.79,Default,,0000,0000,0000,,Forward selection would be very, very expensive. Dialogue: 0,1:05:25.79,1:05:28.24,Default,,0000,0000,0000,,So there's a different class of algorithms that Dialogue: 0,1:05:28.24,1:05:31.19,Default,,0000,0000,0000,,will give you that Dialogue: 0,1:05:31.19,1:05:35.32,Default,,0000,0000,0000,,tends not to do as well in the sense of generalization error. So you tend to learn the Dialogue: 0,1:05:35.32,1:05:37.33,Default,,0000,0000,0000,,hypothesis that works less well, Dialogue: 0,1:05:37.33,1:05:40.29,Default,,0000,0000,0000,,but is computationally much less expensive. Dialogue: 0,1:05:40.29,1:05:44.19,Default,,0000,0000,0000,,And these are called Dialogue: 0,1:05:44.19,1:05:46.67,Default,,0000,0000,0000,,the Dialogue: 0,1:05:46.67,1:05:50.28,Default,,0000,0000,0000,,filter feature selection methods. Dialogue: 0,1:05:50.28,1:05:53.12,Default,,0000,0000,0000,,And the basic idea is Dialogue: 0,1:05:53.12,1:05:53.68,Default,,0000,0000,0000,, Dialogue: 0,1:05:53.68,1:06:00.54,Default,,0000,0000,0000,,that for each feature i will Dialogue: 0,1:06:00.54,1:06:03.94,Default,,0000,0000,0000,,compute Dialogue: 0,1:06:03.94,1:06:10.54,Default,,0000,0000,0000,,some measure Dialogue: 0,1:06:10.54,1:06:17.22,Default,,0000,0000,0000,,of how informative Dialogue: 0,1:06:17.22,1:06:19.91,Default,,0000,0000,0000,,xi Dialogue: 0,1:06:19.91,1:06:25.48,Default,,0000,0000,0000,,is about y; okay? Dialogue: 0,1:06:25.48,1:06:28.78,Default,,0000,0000,0000,,And to do this, we'll use some simple Dialogue: 0,1:06:28.78,1:06:33.49,Default,,0000,0000,0000,,heuristics; for every feature we'll just try to compute some rough estimate or Dialogue: 0,1:06:33.49,1:06:40.49,Default,,0000,0000,0000,,compute Dialogue: 0,1:06:41.28,1:06:48.28,Default,,0000,0000,0000,,some measure of how informative Dialogue: 0,1:06:50.18,1:06:54.49,Default,,0000,0000,0000,,xi is about Dialogue: 0,1:06:54.49,1:06:59.14,Default,,0000,0000,0000,,y. So there are many ways you can do this. One way you can choose is to just compute the Dialogue: 0,1:06:59.14,1:07:01.28,Default,,0000,0000,0000,,correlation between xi and y. And just for each of Dialogue: 0,1:07:01.28,1:07:04.100,Default,,0000,0000,0000,,your features just see how correlated this is with Dialogue: 0,1:07:04.100,1:07:06.72,Default,,0000,0000,0000,,your class label y. Dialogue: 0,1:07:06.72,1:07:12.81,Default,,0000,0000,0000,,And then you just pick the top k most correlated features. Dialogue: 0,1:07:12.81,1:07:19.81,Default,,0000,0000,0000,,Another way Dialogue: 0,1:07:46.43,1:07:51.55,Default,,0000,0000,0000,,to do this Dialogue: 0,1:07:51.55,1:07:54.93,Default,,0000,0000,0000,,for the case of text classification, there's one other method, which especially Dialogue: 0,1:07:54.93,1:07:56.78,Default,,0000,0000,0000,,for this k features I guess Dialogue: 0,1:07:56.78,1:07:58.75,Default,,0000,0000,0000,,there's one other Dialogue: 0,1:07:58.75,1:08:00.71,Default,,0000,0000,0000,,informative measure Dialogue: 0,1:08:00.71,1:08:04.27,Default,,0000,0000,0000,,that's used very commonly, which is called major information. I'm going to Dialogue: 0,1:08:04.27,1:08:11.23,Default,,0000,0000,0000,,tell you Dialogue: 0,1:08:11.23,1:08:15.40,Default,,0000,0000,0000,,some of these ideas in problem sets, but I'll Dialogue: 0,1:08:15.40,1:08:17.90,Default,,0000,0000,0000,,just say this very briefly. Dialogue: 0,1:08:17.90,1:08:21.60,Default,,0000,0000,0000,,So the major information between feature xi and y Dialogue: 0,1:08:21.60,1:08:27.67,Default,,0000,0000,0000,,I'll just write out the definition, I guess. Dialogue: 0,1:08:27.67,1:08:32.23,Default,,0000,0000,0000,,Let's say this is text classification, so x can take on two values, 0, 1; Dialogue: 0,1:08:32.23,1:08:36.18,Default,,0000,0000,0000,,the major information between xi and y is to find out some overall possible values of Dialogue: 0,1:08:36.18,1:08:37.36,Default,,0000,0000,0000,,x; Dialogue: 0,1:08:37.36,1:08:40.93,Default,,0000,0000,0000,,some overall possible values of Dialogue: 0,1:08:40.93,1:08:45.61,Default,,0000,0000,0000,,y times the distribution Dialogue: 0,1:08:45.61,1:08:52.05,Default,,0000,0000,0000,, Dialogue: 0,1:08:52.05,1:08:53.09,Default,,0000,0000,0000,,times that. Dialogue: 0,1:08:53.09,1:08:54.38,Default,,0000,0000,0000,,Where Dialogue: 0,1:08:59.57,1:09:04.04,Default,,0000,0000,0000,,all of these distributions where so the joint distribution over xi and y, Dialogue: 0,1:09:04.04,1:09:11.04,Default,,0000,0000,0000,,you would estimate from your training data Dialogue: 0,1:09:12.56,1:09:14.22,Default,,0000,0000,0000,,all of these things you would use, as well. You would estimate Dialogue: 0,1:09:14.22,1:09:15.26,Default,,0000,0000,0000,,from Dialogue: 0,1:09:15.26,1:09:20.16,Default,,0000,0000,0000,,the training data what is the probability that x is 0, what's the probability that x is one, what's the Dialogue: 0,1:09:20.16,1:09:27.16,Default,,0000,0000,0000,,probability that x is 0, y is 0, x is one; y is 0, and so on. So it Dialogue: 0,1:09:27.85,1:09:31.19,Default,,0000,0000,0000,,turns out Dialogue: 0,1:09:31.19,1:09:35.00,Default,,0000,0000,0000,,there's a standard information theoretic measure of how different Dialogue: 0,1:09:35.00,1:09:36.75,Default,,0000,0000,0000,,probability distributions are. Dialogue: 0,1:09:36.75,1:09:39.99,Default,,0000,0000,0000,,And I'm not gonna prove this here. But it turns out that Dialogue: 0,1:09:39.99,1:09:45.61,Default,,0000,0000,0000,,this major information is Dialogue: 0,1:09:45.61,1:09:51.75,Default,,0000,0000,0000,,actually Dialogue: 0,1:09:51.75,1:09:54.31,Default,,0000,0000,0000,,so the standard Dialogue: 0,1:09:54.31,1:09:58.10,Default,,0000,0000,0000,,measure of how different distributions are; called the K-L divergence. Dialogue: 0,1:09:58.10,1:10:00.36,Default,,0000,0000,0000,,When you take a class in information theory, Dialogue: 0,1:10:00.36,1:10:03.42,Default,,0000,0000,0000,,you have seen concepts of mutual information in the K-L divergence, but if you Dialogue: 0,1:10:03.42,1:10:06.02,Default,,0000,0000,0000,,haven't, don't worry about it. Just Dialogue: 0,1:10:06.02,1:10:09.24,Default,,0000,0000,0000,,the intuition is there's something called K-L divergence that's a formal measure Dialogue: 0,1:10:09.24,1:10:12.95,Default,,0000,0000,0000,,of how different two probability distributions Dialogue: 0,1:10:12.95,1:10:16.61,Default,,0000,0000,0000,,are. And mutual information is a measure for how different Dialogue: 0,1:10:16.61,1:10:19.83,Default,,0000,0000,0000,,the joint distribution is Dialogue: 0,1:10:19.83,1:10:21.57,Default,,0000,0000,0000,,of x and y; Dialogue: 0,1:10:21.57,1:10:23.40,Default,,0000,0000,0000,,from the distribution Dialogue: 0,1:10:23.40,1:10:26.16,Default,,0000,0000,0000,,you would get if you were to assume they were independent; Dialogue: 0,1:10:26.16,1:10:27.21,Default,,0000,0000,0000,,okay? So Dialogue: 0,1:10:27.21,1:10:29.45,Default,,0000,0000,0000,,if x and y were independent, then Dialogue: 0,1:10:29.45,1:10:33.98,Default,,0000,0000,0000,,p of x, y would be equal to p of x times p of y. Dialogue: 0,1:10:33.98,1:10:37.34,Default,,0000,0000,0000,,And so you know, this distribution and this distribution would be identical, and the Dialogue: 0,1:10:37.34,1:10:40.37,Default,,0000,0000,0000,,K-L divergence would be 0. Dialogue: 0,1:10:40.37,1:10:43.58,Default,,0000,0000,0000,,In contrast, if x and y were very non-independent in Dialogue: 0,1:10:43.58,1:10:46.90,Default,,0000,0000,0000,,other words, if x and y are very informative about each other, Dialogue: 0,1:10:46.90,1:10:48.45,Default,,0000,0000,0000,,then this K-L divergence Dialogue: 0,1:10:48.45,1:10:49.72,Default,,0000,0000,0000,,will be large. Dialogue: 0,1:10:49.72,1:10:52.38,Default,,0000,0000,0000,,And so mutual information is a formal measure of Dialogue: 0,1:10:52.38,1:10:55.46,Default,,0000,0000,0000,,how non-independent x and y are. Dialogue: 0,1:10:55.46,1:10:58.30,Default,,0000,0000,0000,,And if x and y are highly non-independent Dialogue: 0,1:10:58.30,1:10:59.26,Default,,0000,0000,0000,,then that means that Dialogue: 0,1:10:59.26,1:11:02.21,Default,,0000,0000,0000,,x will presumably tell you something about y, Dialogue: 0,1:11:02.21,1:11:05.99,Default,,0000,0000,0000,,and so they'll have large mutual information. And Dialogue: 0,1:11:05.99,1:11:09.52,Default,,0000,0000,0000,,this measure of information will tell you x might be a good feature. And you Dialogue: 0,1:11:09.52,1:11:11.99,Default,,0000,0000,0000,,get to play with some of these ideas Dialogue: 0,1:11:11.99,1:11:15.31,Default,,0000,0000,0000,,more in the problem sets. So I won't say much more about it. Dialogue: 0,1:11:16.58,1:11:19.42,Default,,0000,0000,0000,,And what you do then is having chosen Dialogue: 0,1:11:19.42,1:11:23.78,Default,,0000,0000,0000,,some measure like correlation or major information or something else, you Dialogue: 0,1:11:23.78,1:11:28.89,Default,,0000,0000,0000,,then pick the top Dialogue: 0,1:11:28.89,1:11:33.66,Default,,0000,0000,0000,,k features; meaning that you compute correlation between xi and y for all the Dialogue: 0,1:11:33.66,1:11:36.98,Default,,0000,0000,0000,,features of mutual information xi and y for all the features. Dialogue: 0,1:11:36.98,1:11:40.08,Default,,0000,0000,0000,,And then you include in your learning algorithm the Dialogue: 0,1:11:40.08,1:11:43.37,Default,,0000,0000,0000,,k features of the largest correlation with the label or Dialogue: 0,1:11:43.37,1:11:46.49,Default,,0000,0000,0000,,the largest mutual information label, whatever. Dialogue: 0,1:11:46.49,1:11:51.56,Default,,0000,0000,0000,,And to choose Dialogue: 0,1:11:51.56,1:11:54.81,Default,,0000,0000,0000,,k, Dialogue: 0,1:11:54.81,1:11:57.80,Default,,0000,0000,0000,,you can actually use cross validation, as well; okay? Dialogue: 0,1:11:57.80,1:11:59.36,Default,,0000,0000,0000,,So you would Dialogue: 0,1:11:59.36,1:12:03.06,Default,,0000,0000,0000,,take all your features, and sort them in decreasing order of mutual Dialogue: 0,1:12:03.06,1:12:04.11,Default,,0000,0000,0000,,information. Dialogue: 0,1:12:04.11,1:12:07.53,Default,,0000,0000,0000,,And then you'd try using just the top one feature, the top two features, the top Dialogue: 0,1:12:07.53,1:12:09.56,Default,,0000,0000,0000,,three features, and so on. Dialogue: 0,1:12:09.56,1:12:13.87,Default,,0000,0000,0000,,And you decide how many features includes Dialogue: 0,1:12:13.87,1:12:17.28,Default,,0000,0000,0000,,using cross validation; okay? Or you Dialogue: 0,1:12:17.28,1:12:24.28,Default,,0000,0000,0000,,can sometimes you can just choose this by hand, as well. Okay. Questions about Dialogue: 0,1:12:24.28,1:12:31.28,Default,,0000,0000,0000,,this? Okay. Dialogue: 0,1:12:34.56,1:12:36.48,Default,,0000,0000,0000,,Cool. Great. So Dialogue: 0,1:12:36.48,1:12:40.63,Default,,0000,0000,0000,,next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close Dialogue: 0,1:12:40.63,1:12:40.88,Default,,0000,0000,0000,,to the end.