[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:11.80,0:00:15.06,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:15.06,0:00:22.06,Default,,0000,0000,0000,,Development. Okay. Dialogue: 0,0:00:29.11,0:00:31.81,Default,,0000,0000,0000,,So welcome back. Dialogue: 0,0:00:31.81,0:00:35.86,Default,,0000,0000,0000,,What I want to do today is start a new chapter Dialogue: 0,0:00:35.86,0:00:41.02,Default,,0000,0000,0000,,in between now and then. In particular, I want to talk about learning theory. Dialogue: 0,0:00:41.02,0:00:45.45,Default,,0000,0000,0000,,So in the previous, I guess eight lectures so far, you've learned about Dialogue: 0,0:00:45.45,0:00:49.92,Default,,0000,0000,0000,,a lot of learning algorithms, and yes, you now I hope understand a little about Dialogue: 0,0:00:49.92,0:00:53.87,Default,,0000,0000,0000,,some of the best and most powerful tools of machine learning in the [inaudible]. Dialogue: 0,0:00:53.87,0:00:58.22,Default,,0000,0000,0000,,And all of you are now sort of well qualified to go into industry Dialogue: 0,0:00:58.22,0:01:01.54,Default,,0000,0000,0000,,and though powerful learning algorithms apply, really the most powerful Dialogue: 0,0:01:01.54,0:01:05.17,Default,,0000,0000,0000,,learning algorithms we know to all sorts of problems, and in fact, I hope Dialogue: 0,0:01:05.17,0:01:10.73,Default,,0000,0000,0000,,you start to do that on your projects right away as well. Dialogue: 0,0:01:10.73,0:01:14.46,Default,,0000,0000,0000,,You might remember, I think it was in the very first lecture, Dialogue: 0,0:01:14.46,0:01:18.96,Default,,0000,0000,0000,,that I made an analogy to if you're trying to learn to be a carpenter, so Dialogue: 0,0:01:18.96,0:01:20.52,Default,,0000,0000,0000,,if Dialogue: 0,0:01:20.52,0:01:24.44,Default,,0000,0000,0000,,you imagine you're going to carpentry school to learn to be a carpenter, Dialogue: 0,0:01:24.44,0:01:28.54,Default,,0000,0000,0000,,then only a small part of what you need to do is to acquire a set of tools. If you learn to Dialogue: 0,0:01:28.54,0:01:29.99,Default,,0000,0000,0000,,be a carpenter Dialogue: 0,0:01:29.99,0:01:33.39,Default,,0000,0000,0000,,you don't walk in and pick up a tool box and [inaudible], so Dialogue: 0,0:01:33.39,0:01:35.56,Default,,0000,0000,0000,,when you need to Dialogue: 0,0:01:35.56,0:01:39.65,Default,,0000,0000,0000,,cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw Dialogue: 0,0:01:39.65,0:01:42.78,Default,,0000,0000,0000,,whatever, is this really mastering the tools there's also Dialogue: 0,0:01:42.78,0:01:45.82,Default,,0000,0000,0000,,an essential part of becoming a good carpenter. Dialogue: 0,0:01:45.82,0:01:46.97,Default,,0000,0000,0000,,And Dialogue: 0,0:01:46.97,0:01:49.60,Default,,0000,0000,0000,,what I want to do in the next few lectures is Dialogue: 0,0:01:49.60,0:01:52.64,Default,,0000,0000,0000,,actually give you a sense of the mastery of the machine learning tools all of Dialogue: 0,0:01:52.64,0:01:55.17,Default,,0000,0000,0000,,you have. Okay? Dialogue: 0,0:01:55.17,0:01:57.09,Default,,0000,0000,0000,,And so in particular, Dialogue: 0,0:01:57.09,0:02:01.46,Default,,0000,0000,0000,,in the next few lectures what I want to is to talk more deeply about Dialogue: 0,0:02:01.46,0:02:05.06,Default,,0000,0000,0000,,the properties of different machine learning algorithms so that you can get a sense of Dialogue: 0,0:02:05.06,0:02:06.73,Default,,0000,0000,0000,,when it's most appropriate to use Dialogue: 0,0:02:06.73,0:02:08.12,Default,,0000,0000,0000,,each one. Dialogue: 0,0:02:08.12,0:02:11.66,Default,,0000,0000,0000,,And it turns out that one of the most common scenarios in machine learning is Dialogue: 0,0:02:12.87,0:02:15.24,Default,,0000,0000,0000,,someday you'll be doing research or Dialogue: 0,0:02:15.24,0:02:16.92,Default,,0000,0000,0000,,[inaudible] a company. Dialogue: 0,0:02:16.92,0:02:20.42,Default,,0000,0000,0000,,And you'll apply one of the learning algorithms you learned about, you may apply logistic regression, Dialogue: 0,0:02:20.42,0:02:23.60,Default,,0000,0000,0000,,or support vector machines, or Naive Bayes or something, Dialogue: 0,0:02:23.60,0:02:27.85,Default,,0000,0000,0000,,and for whatever bizarre reason, it won't work as well as you were hoping, or it Dialogue: 0,0:02:27.85,0:02:32.26,Default,,0000,0000,0000,,won't quite do what you were hoping it to. Dialogue: 0,0:02:32.98,0:02:34.53,Default,,0000,0000,0000,,To Dialogue: 0,0:02:34.53,0:02:37.41,Default,,0000,0000,0000,,me what really separates the people from Dialogue: 0,0:02:37.41,0:02:40.69,Default,,0000,0000,0000,,what really separates the people that really understand and really get machine Dialogue: 0,0:02:40.69,0:02:41.96,Default,,0000,0000,0000,,learning, Dialogue: 0,0:02:41.96,0:02:46.09,Default,,0000,0000,0000,,compared to people that maybe read the textbook and so they'll work through the math, Dialogue: 0,0:02:46.09,0:02:50.06,Default,,0000,0000,0000,,will be what you do next. Will be in your decisions of when Dialogue: 0,0:02:50.06,0:02:53.69,Default,,0000,0000,0000,,you apply a support vector machine and it doesn't quite do what you wanted, Dialogue: 0,0:02:53.69,0:02:56.54,Default,,0000,0000,0000,,do you really understand enough about support vector machines to know what to Dialogue: 0,0:02:56.54,0:02:58.56,Default,,0000,0000,0000,,do next and how to modify the algorithm? Dialogue: 0,0:02:58.56,0:03:02.02,Default,,0000,0000,0000,,And to me that's often what really separates the great people in machine Dialogue: 0,0:03:02.02,0:03:06.26,Default,,0000,0000,0000,,learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have Dialogue: 0,0:03:06.26,0:03:08.99,Default,,0000,0000,0000,,just understood that. Okay? Dialogue: 0,0:03:08.99,0:03:09.66,Default,,0000,0000,0000,,So Dialogue: 0,0:03:09.66,0:03:13.82,Default,,0000,0000,0000,,what I want to do today, today's lecture will mainly be on learning theory and Dialogue: 0,0:03:13.82,0:03:15.27,Default,,0000,0000,0000,,we'll start to Dialogue: 0,0:03:15.27,0:03:18.55,Default,,0000,0000,0000,,talk about some of the theoretical results of machine learning. Dialogue: 0,0:03:19.35,0:03:22.24,Default,,0000,0000,0000,,The next lecture, later this week, will be on algorithms for Dialogue: 0,0:03:22.24,0:03:26.37,Default,,0000,0000,0000,,sort of [inaudible], or fixing some of the problems Dialogue: 0,0:03:26.37,0:03:30.04,Default,,0000,0000,0000,,that the learning theory will point out to us and help us understand. And then Dialogue: 0,0:03:30.04,0:03:34.46,Default,,0000,0000,0000,,two lectures from now, that lecture will be almost entirely focused on Dialogue: 0,0:03:34.46,0:03:36.40,Default,,0000,0000,0000,,the practical advice for Dialogue: 0,0:03:36.40,0:03:40.23,Default,,0000,0000,0000,,how to apply learning algorithms. Okay? Dialogue: 0,0:03:40.23,0:03:47.23,Default,,0000,0000,0000,,So you have any questions about this before I start? Okay. Dialogue: 0,0:03:50.63,0:03:53.95,Default,,0000,0000,0000,,So the very first thing we're gonna talk about is something that Dialogue: 0,0:03:53.95,0:03:58.69,Default,,0000,0000,0000,,you've probably already seen on the first homework, and something that Dialogue: 0,0:03:58.69,0:04:01.29,Default,,0000,0000,0000,,alluded to previously, Dialogue: 0,0:04:01.29,0:04:07.12,Default,,0000,0000,0000,,which is the bias variance trade-off. So take Dialogue: 0,0:04:07.12,0:04:10.100,Default,,0000,0000,0000,,ordinary least squares, the first learning algorithm we learned Dialogue: 0,0:04:10.100,0:04:15.85,Default,,0000,0000,0000,,about, if you [inaudible] a straight line through these datas, this is not a very good model. Dialogue: 0,0:04:15.85,0:04:19.79,Default,,0000,0000,0000,,Right. Dialogue: 0,0:04:19.79,0:04:21.01,Default,,0000,0000,0000,,And if Dialogue: 0,0:04:21.01,0:04:23.70,Default,,0000,0000,0000,,this happens, we say it has Dialogue: 0,0:04:23.70,0:04:24.49,Default,,0000,0000,0000,,underfit Dialogue: 0,0:04:24.49,0:04:27.46,Default,,0000,0000,0000,,the data, or we say that this is a learning algorithm Dialogue: 0,0:04:27.46,0:04:30.63,Default,,0000,0000,0000,,with a very high bias, because it is Dialogue: 0,0:04:30.63,0:04:35.67,Default,,0000,0000,0000,,failing to fit the evident quadratic structure in the data. Dialogue: 0,0:04:35.67,0:04:36.25,Default,,0000,0000,0000,,And Dialogue: 0,0:04:36.25,0:04:37.59,Default,,0000,0000,0000,,for the prefaces Dialogue: 0,0:04:37.59,0:04:41.34,Default,,0000,0000,0000,,of [inaudible] you can formally think of the Dialogue: 0,0:04:41.34,0:04:45.44,Default,,0000,0000,0000,,bias of the learning algorithm as representing the fact that even if you Dialogue: 0,0:04:45.44,0:04:47.96,Default,,0000,0000,0000,,had an infinite amount of training data, even if Dialogue: 0,0:04:47.96,0:04:50.83,Default,,0000,0000,0000,,you had tons of training data, Dialogue: 0,0:04:50.83,0:04:52.58,Default,,0000,0000,0000,,this algorithm would still fail Dialogue: 0,0:04:52.58,0:04:54.20,Default,,0000,0000,0000,,to fit the quadratic Dialogue: 0,0:04:54.20,0:04:57.82,Default,,0000,0000,0000,,function, the quadratic structure in Dialogue: 0,0:04:57.82,0:05:01.67,Default,,0000,0000,0000,,the data. And so we think of this as a learning algorithm with high bias. Dialogue: 0,0:05:01.67,0:05:04.34,Default,,0000,0000,0000,,Then there's the opposite problem, so that's the Dialogue: 0,0:05:04.34,0:05:06.47,Default,,0000,0000,0000,,same dataset. Dialogue: 0,0:05:06.47,0:05:13.10,Default,,0000,0000,0000,,If Dialogue: 0,0:05:13.10,0:05:19.50,Default,,0000,0000,0000,,you fit a fourth of the polynomials into this dataset, Dialogue: 0,0:05:19.50,0:05:20.43,Default,,0000,0000,0000,,then you have Dialogue: 0,0:05:20.43,0:05:25.52,Default,,0000,0000,0000,,you'll be able to interpolate the five data points exactly, but clearly, this is also Dialogue: 0,0:05:25.52,0:05:27.20,Default,,0000,0000,0000,,not a great model Dialogue: 0,0:05:27.20,0:05:30.65,Default,,0000,0000,0000,,to the structure that you and I probably see in the data. Dialogue: 0,0:05:30.65,0:05:34.41,Default,,0000,0000,0000,,And we say that this Dialogue: 0,0:05:34.41,0:05:37.31,Default,,0000,0000,0000,,algorithm has a problem, excuse me, Dialogue: 0,0:05:37.31,0:05:42.83,Default,,0000,0000,0000,,is overfitting the data, Dialogue: 0,0:05:42.83,0:05:48.54,Default,,0000,0000,0000,,or alternatively that this algorithm has high variance. Okay? And the intuition behind Dialogue: 0,0:05:48.54,0:05:50.24,Default,,0000,0000,0000,,overfitting a high variance is that Dialogue: 0,0:05:50.24,0:05:54.42,Default,,0000,0000,0000,,the algorithm is fitting serious patterns in the data, or is fitting Dialogue: 0,0:05:54.42,0:05:58.36,Default,,0000,0000,0000,,idiosyncratic properties of this specific dataset, be it the dataset of Dialogue: 0,0:05:58.36,0:06:00.89,Default,,0000,0000,0000,,housing prices or whatever. Dialogue: 0,0:06:00.89,0:06:07.89,Default,,0000,0000,0000,,And quite often, they'll be some happy medium Dialogue: 0,0:06:08.19,0:06:10.06,Default,,0000,0000,0000,,of fitting a quadratic function Dialogue: 0,0:06:10.06,0:06:14.82,Default,,0000,0000,0000,,that maybe won't interpolate your data points perfectly, but also captures multistructure Dialogue: 0,0:06:14.82,0:06:16.74,Default,,0000,0000,0000,,in your data Dialogue: 0,0:06:16.74,0:06:20.76,Default,,0000,0000,0000,,than a simple model which under fits. Dialogue: 0,0:06:20.76,0:06:23.83,Default,,0000,0000,0000,,I say that you can sort of have the exactly the same picture Dialogue: 0,0:06:23.83,0:06:25.78,Default,,0000,0000,0000,,of classification problems as well, Dialogue: 0,0:06:25.78,0:06:28.45,Default,,0000,0000,0000,,so lets say Dialogue: 0,0:06:28.45,0:06:35.45,Default,,0000,0000,0000,,this is my training set, right, Dialogue: 0,0:06:35.84,0:06:37.30,Default,,0000,0000,0000,,of Dialogue: 0,0:06:37.30,0:06:39.32,Default,,0000,0000,0000,,positive and negative examples, Dialogue: 0,0:06:44.26,0:06:46.78,Default,,0000,0000,0000,,and so you can fit Dialogue: 0,0:06:46.78,0:06:51.68,Default,,0000,0000,0000,,logistic regression with a very high order polynomial [inaudible], or [inaudible] of X Dialogue: 0,0:06:51.68,0:06:57.39,Default,,0000,0000,0000,,equals the sigmoid function of Dialogue: 0,0:07:01.24,0:07:04.44,Default,,0000,0000,0000,,whatever. Sigmoid function applied to a tenth of the polynomial. Dialogue: 0,0:07:04.44,0:07:07.52,Default,,0000,0000,0000,,And you do that, maybe you get a decision boundary Dialogue: 0,0:07:07.52,0:07:13.46,Default,,0000,0000,0000,,like this. Right. Dialogue: 0,0:07:13.46,0:07:16.83,Default,,0000,0000,0000,,That does indeed perfectly separate the positive and negative classes, this is Dialogue: 0,0:07:16.83,0:07:18.61,Default,,0000,0000,0000,,another example of how Dialogue: 0,0:07:18.61,0:07:21.17,Default,,0000,0000,0000,,overfitting, and Dialogue: 0,0:07:21.17,0:07:24.03,Default,,0000,0000,0000,,in contrast you fit logistic regression into this model with just the linear features, Dialogue: 0,0:07:24.03,0:07:26.62,Default,,0000,0000,0000,,with none Dialogue: 0,0:07:26.62,0:07:29.56,Default,,0000,0000,0000,,of the quadratic features, then maybe you get a decision boundary like that, which Dialogue: 0,0:07:29.56,0:07:33.18,Default,,0000,0000,0000,,can also underfit. Dialogue: 0,0:07:33.18,0:07:34.59,Default,,0000,0000,0000,,Okay. Dialogue: 0,0:07:34.59,0:07:38.06,Default,,0000,0000,0000,,So what I want to do now is Dialogue: 0,0:07:38.06,0:07:41.90,Default,,0000,0000,0000,,understand this problem of overfitting versus underfitting, of high bias versus high Dialogue: 0,0:07:41.90,0:07:43.41,Default,,0000,0000,0000,,variance, more explicitly, Dialogue: 0,0:07:44.82,0:07:48.77,Default,,0000,0000,0000,,I will do that by posing a more formal model of machine learning and so Dialogue: 0,0:07:48.77,0:07:50.79,Default,,0000,0000,0000,,trying to prove when Dialogue: 0,0:07:50.79,0:07:56.68,Default,,0000,0000,0000,,these two twin problems when each of these two problems come up. Dialogue: 0,0:07:56.68,0:07:59.22,Default,,0000,0000,0000,,And as I'm modeling the Dialogue: 0,0:07:59.22,0:08:00.68,Default,,0000,0000,0000,, Dialogue: 0,0:08:00.68,0:08:04.39,Default,,0000,0000,0000,,example for our initial foray into learning theory, Dialogue: 0,0:08:04.39,0:08:10.45,Default,,0000,0000,0000,,I want to talk about learning classification, Dialogue: 0,0:08:10.45,0:08:13.06,Default,,0000,0000,0000,,in which Dialogue: 0,0:08:13.77,0:08:16.36,Default,,0000,0000,0000,,H of X is equal Dialogue: 0,0:08:16.36,0:08:18.82,Default,,0000,0000,0000,,to G of data transpose X. Okay? Dialogue: 0,0:08:18.82,0:08:20.93,Default,,0000,0000,0000,,So the learning classifier. Dialogue: 0,0:08:20.93,0:08:26.84,Default,,0000,0000,0000,,And for this class I'm going to use, Z Dialogue: 0,0:08:26.84,0:08:32.74,Default,,0000,0000,0000,,excuse me. I'm gonna use G as indicator Z grading with Dialogue: 0,0:08:32.74,0:08:39.74,Default,,0000,0000,0000,,zero. Dialogue: 0,0:08:40.26,0:08:43.57,Default,,0000,0000,0000,,With apologies in advance for changing the notation yet again, Dialogue: 0,0:08:43.57,0:08:45.35,Default,,0000,0000,0000,,for the support vector machine Dialogue: 0,0:08:45.35,0:08:49.45,Default,,0000,0000,0000,,lectures we use Y equals minus one or plus one. For Dialogue: 0,0:08:49.45,0:08:52.67,Default,,0000,0000,0000,,learning theory lectures, turns out it'll be a bit cleaner if I switch back to Dialogue: 0,0:08:52.67,0:08:57.81,Default,,0000,0000,0000,,Y equals zero-one again, so I'm gonna switch back to my original notation. Dialogue: 0,0:08:57.81,0:09:02.22,Default,,0000,0000,0000,,And so you think of this model as a model forum as Dialogue: 0,0:09:02.22,0:09:04.54,Default,,0000,0000,0000,,logistic regressions, say, and think of this as being Dialogue: 0,0:09:04.54,0:09:06.60,Default,,0000,0000,0000,,similar to logistic regression, Dialogue: 0,0:09:06.60,0:09:08.42,Default,,0000,0000,0000,,except that now we're going to force Dialogue: 0,0:09:08.42,0:09:12.42,Default,,0000,0000,0000,,the logistic regression algorithm, to opt for labels that are Dialogue: 0,0:09:12.42,0:09:14.01,Default,,0000,0000,0000,,either zero or one. Okay? Dialogue: 0,0:09:14.01,0:09:15.61,Default,,0000,0000,0000,,So you can think of this as a Dialogue: 0,0:09:15.61,0:09:20.71,Default,,0000,0000,0000,,classifier to opt for labels zero or one involved in the probabilities. Dialogue: 0,0:09:20.71,0:09:25.41,Default,,0000,0000,0000,,And so as Dialogue: 0,0:09:25.41,0:09:32.41,Default,,0000,0000,0000,,usual, let's say we're given a training set of M examples. Dialogue: 0,0:09:34.35,0:09:38.08,Default,,0000,0000,0000,,That's just my notation for writing a set of M examples ranging from I equals Dialogue: 0,0:09:38.08,0:09:40.17,Default,,0000,0000,0000,,one Dialogue: 0,0:09:40.50,0:09:45.45,Default,,0000,0000,0000,,through M. And I'm going to assume that the training example is XIYI. Dialogue: 0,0:09:45.45,0:09:48.32,Default,,0000,0000,0000,,I've drawn IID, Dialogue: 0,0:09:48.32,0:09:50.22,Default,,0000,0000,0000,,from sum distribution, Dialogue: 0,0:09:50.22,0:09:50.96,Default,,0000,0000,0000,,scripts Dialogue: 0,0:09:50.96,0:09:54.47,Default,,0000,0000,0000,,D. Okay? [Inaudible]. Identically and definitively distributed Dialogue: 0,0:09:54.47,0:09:59.72,Default,,0000,0000,0000,,and if you have you have running a classification problem on houses, like Dialogue: 0,0:09:59.72,0:10:03.45,Default,,0000,0000,0000,,features of the house comma, whether the house will be sold in the next six months, then this Dialogue: 0,0:10:03.45,0:10:04.70,Default,,0000,0000,0000,,is just Dialogue: 0,0:10:04.70,0:10:07.65,Default,,0000,0000,0000,,the priority distribution over Dialogue: 0,0:10:07.65,0:10:12.19,Default,,0000,0000,0000,,features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that Dialogue: 0,0:10:12.19,0:10:16.81,Default,,0000,0000,0000,,training examples we've drawn IID from some probability distributions, Dialogue: 0,0:10:18.32,0:10:19.94,Default,,0000,0000,0000,,scripts D. Well, same thing for spam, if you're Dialogue: 0,0:10:19.94,0:10:23.42,Default,,0000,0000,0000,,trying to build a spam classifier then this would be the distribution of Dialogue: 0,0:10:23.42,0:10:29.61,Default,,0000,0000,0000,,what emails look like comma, whether they are spam or not. Dialogue: 0,0:10:29.61,0:10:32.13,Default,,0000,0000,0000,,And in particular, to understand Dialogue: 0,0:10:32.13,0:10:38.00,Default,,0000,0000,0000,,or simplify to understand the phenomena of bias invariance, I'm actually going to use a Dialogue: 0,0:10:38.00,0:10:41.88,Default,,0000,0000,0000,,simplified model of machine learning. Dialogue: 0,0:10:41.88,0:10:44.67,Default,,0000,0000,0000,,And in particular, Dialogue: 0,0:10:44.67,0:10:45.81,Default,,0000,0000,0000,,logistic regression fits Dialogue: 0,0:10:45.81,0:10:50.04,Default,,0000,0000,0000,,this parameters the model like this for maximizing the law of likelihood. Dialogue: 0,0:10:50.04,0:10:55.01,Default,,0000,0000,0000,,But in order to understand learning algorithms more deeply, I'm just going to assume a simplified Dialogue: 0,0:10:55.01,0:10:58.89,Default,,0000,0000,0000,,model of machine learning, let me just write that down. Dialogue: 0,0:11:01.96,0:11:05.26,Default,,0000,0000,0000,,So I'm going to define training error Dialogue: 0,0:11:05.26,0:11:08.67,Default,,0000,0000,0000,,as Dialogue: 0,0:11:08.67,0:11:14.49,Default,,0000,0000,0000,,so this is a training error of a hypothesis X subscript data. Dialogue: 0,0:11:14.49,0:11:17.61,Default,,0000,0000,0000,,Write this epsilon hat of subscript data. Dialogue: 0,0:11:17.61,0:11:19.98,Default,,0000,0000,0000,,If I want to make the Dialogue: 0,0:11:19.98,0:11:23.45,Default,,0000,0000,0000,,dependence on a training set explicit, I'll write this with Dialogue: 0,0:11:23.45,0:11:27.21,Default,,0000,0000,0000,,a subscript S there where S is a training set. Dialogue: 0,0:11:27.21,0:11:34.21,Default,,0000,0000,0000,,And I'll define this as, Dialogue: 0,0:11:41.66,0:11:44.27,Default,,0000,0000,0000,,let's see. Okay. I Dialogue: 0,0:11:44.27,0:11:47.86,Default,,0000,0000,0000,,hope the notation is clear. This is a sum of indicator functions for whether your hypothesis Dialogue: 0,0:11:47.86,0:11:52.17,Default,,0000,0000,0000,,correctly classifies the Y the IFE example. Dialogue: 0,0:11:52.17,0:11:55.34,Default,,0000,0000,0000,,And so when you divide by M, this is just Dialogue: 0,0:11:55.34,0:11:57.74,Default,,0000,0000,0000,,in your training set what's the fraction Dialogue: 0,0:11:57.74,0:11:59.56,Default,,0000,0000,0000,,of training examples your Dialogue: 0,0:11:59.56,0:12:01.20,Default,,0000,0000,0000,,hypothesis classifies Dialogue: 0,0:12:01.20,0:12:05.34,Default,,0000,0000,0000,,so defined as a training error. And Dialogue: 0,0:12:05.34,0:12:07.98,Default,,0000,0000,0000,,training error is also called risk. Dialogue: 0,0:12:10.19,0:12:14.18,Default,,0000,0000,0000,,The simplified model of machine learning I'm gonna talk about is Dialogue: 0,0:12:14.18,0:12:17.17,Default,,0000,0000,0000,,called empirical risk minimization. Dialogue: 0,0:12:17.17,0:12:21.21,Default,,0000,0000,0000,,And in particular, I'm going to assume that the way my learning algorithm works Dialogue: 0,0:12:21.21,0:12:25.95,Default,,0000,0000,0000,,is it will choose parameters Dialogue: 0,0:12:25.95,0:12:32.95,Default,,0000,0000,0000,,data, that Dialogue: 0,0:12:33.94,0:12:39.09,Default,,0000,0000,0000,,minimize my training error. Okay? Dialogue: 0,0:12:39.09,0:12:43.24,Default,,0000,0000,0000,,And it will be this learning algorithm that we'll prove properties about. Dialogue: 0,0:12:43.24,0:12:44.64,Default,,0000,0000,0000,,And it turns out that Dialogue: 0,0:12:44.64,0:12:45.99,Default,,0000,0000,0000,,you Dialogue: 0,0:12:45.99,0:12:49.91,Default,,0000,0000,0000,,can think of this as the most basic learning algorithm, the algorithm that minimizes Dialogue: 0,0:12:49.91,0:12:51.97,Default,,0000,0000,0000,,your training error. It Dialogue: 0,0:12:51.97,0:12:55.35,Default,,0000,0000,0000,,turns out that logistic regression and support vector machines can be Dialogue: 0,0:12:55.35,0:12:58.93,Default,,0000,0000,0000,,formally viewed as approximation cities, so it Dialogue: 0,0:12:58.93,0:13:02.51,Default,,0000,0000,0000,,turns out that if you actually want to do this, this is a nonconvex optimization Dialogue: 0,0:13:02.51,0:13:03.40,Default,,0000,0000,0000,,problem. Dialogue: 0,0:13:03.40,0:13:08.86,Default,,0000,0000,0000,,This is actually it actually [inaudible] hard to solve this optimization problem. Dialogue: 0,0:13:08.86,0:13:13.79,Default,,0000,0000,0000,,And logistic regression and support vector machines can both be viewed as Dialogue: 0,0:13:13.79,0:13:14.43,Default,,0000,0000,0000,,approximations Dialogue: 0,0:13:14.43,0:13:17.27,Default,,0000,0000,0000,,to this nonconvex optimization problem Dialogue: 0,0:13:17.27,0:13:20.24,Default,,0000,0000,0000,,by finding the convex approximation to it. Dialogue: 0,0:13:20.24,0:13:22.59,Default,,0000,0000,0000,,Think of this as Dialogue: 0,0:13:22.59,0:13:23.64,Default,,0000,0000,0000,,similar to what Dialogue: 0,0:13:23.64,0:13:27.71,Default,,0000,0000,0000,,algorithms like logistic regression Dialogue: 0,0:13:27.71,0:13:30.28,Default,,0000,0000,0000,,are doing. So Dialogue: 0,0:13:30.28,0:13:32.55,Default,,0000,0000,0000,,let me take that Dialogue: 0,0:13:32.55,0:13:34.23,Default,,0000,0000,0000,,definition of empirical risk Dialogue: 0,0:13:34.23,0:13:35.71,Default,,0000,0000,0000,,minimization Dialogue: 0,0:13:35.71,0:13:42.60,Default,,0000,0000,0000,,and actually just rewrite it in a different equivalent way. Dialogue: 0,0:13:42.60,0:13:45.44,Default,,0000,0000,0000,,For the results I want to prove today, it turns out Dialogue: 0,0:13:45.44,0:13:47.63,Default,,0000,0000,0000,,that it will be useful to think of Dialogue: 0,0:13:47.63,0:13:49.34,Default,,0000,0000,0000,,our learning algorithm Dialogue: 0,0:13:49.34,0:13:52.15,Default,,0000,0000,0000,,as not choosing a set of parameters, Dialogue: 0,0:13:52.15,0:13:54.81,Default,,0000,0000,0000,,but as choosing a function. Dialogue: 0,0:13:54.81,0:13:57.87,Default,,0000,0000,0000,,So let me say what I mean by that. Let me define Dialogue: 0,0:13:57.87,0:14:00.32,Default,,0000,0000,0000,,the hypothesis Dialogue: 0,0:14:00.32,0:14:03.49,Default,,0000,0000,0000,,class, Dialogue: 0,0:14:03.49,0:14:04.80,Default,,0000,0000,0000,,script h, Dialogue: 0,0:14:04.80,0:14:09.41,Default,,0000,0000,0000,,as the class of all hypotheses of in other words as the class of all linear Dialogue: 0,0:14:09.41,0:14:10.69,Default,,0000,0000,0000,,classifiers, that your Dialogue: 0,0:14:10.69,0:14:13.73,Default,,0000,0000,0000,,learning algorithm is choosing from. Dialogue: 0,0:14:13.73,0:14:17.26,Default,,0000,0000,0000,,Okay? So Dialogue: 0,0:14:17.26,0:14:19.03,Default,,0000,0000,0000,,H subscript data Dialogue: 0,0:14:19.03,0:14:23.19,Default,,0000,0000,0000,,is a specific linear classifier, so H subscript data, Dialogue: 0,0:14:23.19,0:14:29.88,Default,,0000,0000,0000,,in each of these functions each Dialogue: 0,0:14:29.88,0:14:33.55,Default,,0000,0000,0000,,of these is a function mapping from the input domain X is the class zero-one. Each Dialogue: 0,0:14:33.55,0:14:34.76,Default,,0000,0000,0000,,of Dialogue: 0,0:14:34.76,0:14:38.43,Default,,0000,0000,0000,,these is a function, and as you vary the parameter's data, Dialogue: 0,0:14:38.43,0:14:41.11,Default,,0000,0000,0000,,you get different functions. Dialogue: 0,0:14:41.11,0:14:43.81,Default,,0000,0000,0000,,And so let me define the hypothesis class Dialogue: 0,0:14:43.81,0:14:47.75,Default,,0000,0000,0000,,script H to be the class of all functions that say logistic regression can choose Dialogue: 0,0:14:47.75,0:14:49.14,Default,,0000,0000,0000,,from. Okay. So Dialogue: 0,0:14:49.14,0:14:53.00,Default,,0000,0000,0000,,this is the class of all linear classifiers Dialogue: 0,0:14:53.00,0:14:54.48,Default,,0000,0000,0000,,and so Dialogue: 0,0:14:57.42,0:15:00.90,Default,,0000,0000,0000,,I'm going to define, Dialogue: 0,0:15:00.90,0:15:05.05,Default,,0000,0000,0000,,or maybe redefine empirical risk minimization as Dialogue: 0,0:15:05.05,0:15:08.82,Default,,0000,0000,0000,,instead of writing this choosing a set of parameters, I want to think of it as Dialogue: 0,0:15:08.82,0:15:10.39,Default,,0000,0000,0000,,choosing a Dialogue: 0,0:15:10.39,0:15:12.50,Default,,0000,0000,0000,,function Dialogue: 0,0:15:12.50,0:15:15.30,Default,,0000,0000,0000,,into hypothesis class of script H Dialogue: 0,0:15:15.30,0:15:21.57,Default,,0000,0000,0000,,that minimizes Dialogue: 0,0:15:21.57,0:15:27.36,Default,,0000,0000,0000,,that minimizes my training error. Okay? Dialogue: 0,0:15:27.36,0:15:31.04,Default,,0000,0000,0000,,So actually can you raise your Dialogue: 0,0:15:31.04,0:15:34.54,Default,,0000,0000,0000,,hand if it makes sense to you why this is equivalent to the previous Dialogue: 0,0:15:34.54,0:15:40.90,Default,,0000,0000,0000,,formulation? Okay, cool. Dialogue: 0,0:15:40.90,0:15:44.35,Default,,0000,0000,0000,,Thanks. So for development of the use of think of algorithms as choosing from Dialogue: 0,0:15:44.35,0:15:46.75,Default,,0000,0000,0000,,function from the class instead, Dialogue: 0,0:15:46.75,0:15:47.98,Default,,0000,0000,0000,,because Dialogue: 0,0:15:47.98,0:15:51.30,Default,,0000,0000,0000,,in a more general case this set, script H, Dialogue: 0,0:15:51.30,0:15:53.100,Default,,0000,0000,0000,,can be some other class of functions. Maybe Dialogue: 0,0:15:53.100,0:15:58.22,Default,,0000,0000,0000,,is a class of all functions represented by viewer network, or the class of all Dialogue: 0,0:15:58.22,0:16:03.67,Default,,0000,0000,0000,,some other class of functions the learning algorithm wants to choose from. Dialogue: 0,0:16:03.67,0:16:05.24,Default,,0000,0000,0000,,And Dialogue: 0,0:16:05.24,0:16:11.15,Default,,0000,0000,0000,,this definition for empirical risk minimization will still apply. Okay? Dialogue: 0,0:16:11.15,0:16:13.33,Default,,0000,0000,0000,,So Dialogue: 0,0:16:13.33,0:16:16.48,Default,,0000,0000,0000,,what we'd like to do is understand Dialogue: 0,0:16:16.48,0:16:18.78,Default,,0000,0000,0000,,whether empirical risk minimization Dialogue: 0,0:16:18.78,0:16:24.50,Default,,0000,0000,0000,,is a reasonable algorithm. Alex? Student:[Inaudible] a function that's Dialogue: 0,0:16:24.50,0:16:26.24,Default,,0000,0000,0000,,defined Dialogue: 0,0:16:26.24,0:16:29.61,Default,,0000,0000,0000,,by G Dialogue: 0,0:16:29.61,0:16:35.31,Default,,0000,0000,0000,,of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see Dialogue: 0,0:16:35.31,0:16:36.47,Default,,0000,0000,0000,,I guess this Dialogue: 0,0:16:36.47,0:16:40.61,Default,,0000,0000,0000,,the question is H data still defined by G of phase transpose Dialogue: 0,0:16:40.61,0:16:46.70,Default,,0000,0000,0000,,X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh, Dialogue: 0,0:16:46.70,0:16:48.41,Default,,0000,0000,0000,,yeah so very two answers Dialogue: 0,0:16:48.41,0:16:51.07,Default,,0000,0000,0000,,to that. One is, this framework is general, so Dialogue: 0,0:16:51.07,0:16:54.77,Default,,0000,0000,0000,,for the purpose of this lecture it may be useful to you to keep in mind a model of the Dialogue: 0,0:16:54.77,0:16:56.67,Default,,0000,0000,0000,,example of Dialogue: 0,0:16:56.67,0:17:00.64,Default,,0000,0000,0000,,when H subscript data is the class of all linear classifiers such as those used by Dialogue: 0,0:17:00.64,0:17:04.35,Default,,0000,0000,0000,,like a visectron algorithm or logistic regression. Dialogue: 0,0:17:04.35,0:17:05.57,Default,,0000,0000,0000,,This Dialogue: 0,0:17:05.57,0:17:06.79,Default,,0000,0000,0000,,everything on this board, Dialogue: 0,0:17:06.79,0:17:11.14,Default,,0000,0000,0000,,however, is actually more general. H can be any set of functions, mapping Dialogue: 0,0:17:11.14,0:17:14.93,Default,,0000,0000,0000,,from the INFA domain to the center of class label zero and one, Dialogue: 0,0:17:14.93,0:17:17.63,Default,,0000,0000,0000,,and then you can perform empirical risk minimization Dialogue: 0,0:17:17.63,0:17:21.24,Default,,0000,0000,0000,,over any hypothesis class. For the purpose Dialogue: 0,0:17:21.24,0:17:22.77,Default,,0000,0000,0000,,of today's lecture, Dialogue: 0,0:17:22.77,0:17:27.17,Default,,0000,0000,0000,,I am going to restrict myself to talking about binary classification, but it turns Dialogue: 0,0:17:27.17,0:17:30.66,Default,,0000,0000,0000,,out everything I say generalizes to regression Dialogue: 0,0:17:30.66,0:17:35.96,Default,,0000,0000,0000,,in other problem as Dialogue: 0,0:17:35.96,0:17:39.73,Default,,0000,0000,0000,,well. Does that answer your question? Yes. Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm. Dialogue: 0,0:17:40.37,0:17:45.34,Default,,0000,0000,0000,,In particular, what are the things we can prove about it? So Dialogue: 0,0:17:45.34,0:17:48.90,Default,,0000,0000,0000,,clearly we don't actually care about training error, we don't really care about Dialogue: 0,0:17:48.90,0:17:53.43,Default,,0000,0000,0000,,making accurate predictions on the training set, or at a least that's not the ultimate goal. The Dialogue: 0,0:17:53.43,0:17:55.51,Default,,0000,0000,0000,,ultimate goal is Dialogue: 0,0:17:55.51,0:17:57.56,Default,,0000,0000,0000,,how well it makes Dialogue: 0,0:17:57.56,0:17:58.73,Default,,0000,0000,0000,,generalization Dialogue: 0,0:17:58.73,0:18:03.94,Default,,0000,0000,0000,,how well it makes predictions on examples that we haven't seen before. How Dialogue: 0,0:18:03.94,0:18:05.66,Default,,0000,0000,0000,,well it predicts prices or Dialogue: 0,0:18:05.66,0:18:07.38,Default,,0000,0000,0000,,sale or no sale Dialogue: 0,0:18:07.38,0:18:10.29,Default,,0000,0000,0000,,outcomes of houses you haven't seen before. Dialogue: 0,0:18:10.29,0:18:12.87,Default,,0000,0000,0000,,So what we really care about Dialogue: 0,0:18:12.87,0:18:16.16,Default,,0000,0000,0000,,is generalization error, which I write Dialogue: 0,0:18:16.16,0:18:17.87,Default,,0000,0000,0000,,as epsilon of H. Dialogue: 0,0:18:17.87,0:18:21.70,Default,,0000,0000,0000,,And this is defined as the probability Dialogue: 0,0:18:21.70,0:18:23.83,Default,,0000,0000,0000,,that if I sample Dialogue: 0,0:18:23.83,0:18:25.94,Default,,0000,0000,0000,,a new example, X Dialogue: 0,0:18:25.94,0:18:28.37,Default,,0000,0000,0000,,comma Y, Dialogue: 0,0:18:28.37,0:18:35.37,Default,,0000,0000,0000,,from that distribution scripts D, Dialogue: 0,0:18:40.25,0:18:43.23,Default,,0000,0000,0000,,my hypothesis mislabels Dialogue: 0,0:18:43.23,0:18:46.92,Default,,0000,0000,0000,,that example. And Dialogue: 0,0:18:46.92,0:18:49.99,Default,,0000,0000,0000,,in terms of notational convention, usually Dialogue: 0,0:18:49.99,0:18:53.40,Default,,0000,0000,0000,,if I use if I place a hat on top of something, it Dialogue: 0,0:18:53.40,0:18:57.19,Default,,0000,0000,0000,,usually means not always but it usually means that it is an attempt to Dialogue: 0,0:18:57.19,0:18:59.40,Default,,0000,0000,0000,,estimate something about the hat. So Dialogue: 0,0:18:59.40,0:19:00.66,Default,,0000,0000,0000,,for example, Dialogue: 0,0:19:00.66,0:19:02.58,Default,,0000,0000,0000,,epsilon hat here Dialogue: 0,0:19:02.58,0:19:06.05,Default,,0000,0000,0000,,this is something that we're trying think of epsilon hat training error Dialogue: 0,0:19:06.05,0:19:08.78,Default,,0000,0000,0000,,as an attempt to approximate generalization error. Dialogue: 0,0:19:08.78,0:19:10.98,Default,,0000,0000,0000,,Okay, so the notation convention is Dialogue: 0,0:19:10.98,0:19:15.06,Default,,0000,0000,0000,,usually the things with the hats on top are things we're using to estimate other Dialogue: 0,0:19:15.06,0:19:15.95,Default,,0000,0000,0000,,quantities. Dialogue: 0,0:19:15.95,0:19:19.56,Default,,0000,0000,0000,,And H hat is a hypothesis output by learning algorithm to try to Dialogue: 0,0:19:19.56,0:19:24.67,Default,,0000,0000,0000,,estimate what the functions from H to Y X to Y. So Dialogue: 0,0:19:24.67,0:19:29.24,Default,,0000,0000,0000,,let's actually prove some things about when empirical risk minimization Dialogue: 0,0:19:29.24,0:19:30.15,Default,,0000,0000,0000,,will do well Dialogue: 0,0:19:30.15,0:19:33.44,Default,,0000,0000,0000,,in a sense of giving us low generalization error, which is what Dialogue: 0,0:19:33.44,0:19:40.44,Default,,0000,0000,0000,,we really care about. Dialogue: 0,0:19:46.75,0:19:50.11,Default,,0000,0000,0000,,In order to prove our first learning theory result, I'm going to have to state Dialogue: 0,0:19:50.11,0:19:52.67,Default,,0000,0000,0000,,two lemmas, the first is Dialogue: 0,0:19:52.67,0:19:59.67,Default,,0000,0000,0000,,the union Dialogue: 0,0:20:00.25,0:20:07.25,Default,,0000,0000,0000,,vowel, which is the following, Dialogue: 0,0:20:07.80,0:20:10.64,Default,,0000,0000,0000,,let A1 through AK be Dialogue: 0,0:20:10.64,0:20:12.25,Default,,0000,0000,0000,,K event. And when Dialogue: 0,0:20:12.25,0:20:16.08,Default,,0000,0000,0000,,I say events, I mean events in a sense of a probabilistic event Dialogue: 0,0:20:16.08,0:20:18.13,Default,,0000,0000,0000,,that either happens or not. Dialogue: 0,0:20:18.13,0:20:25.13,Default,,0000,0000,0000,,And these are not necessarily independent. Dialogue: 0,0:20:28.49,0:20:32.40,Default,,0000,0000,0000,,So there's some current distribution over the events A one through AK, Dialogue: 0,0:20:32.40,0:20:34.83,Default,,0000,0000,0000,,and maybe they're independent maybe not, Dialogue: 0,0:20:34.83,0:20:41.24,Default,,0000,0000,0000,,no assumption on that. Then Dialogue: 0,0:20:41.24,0:20:44.74,Default,,0000,0000,0000,,the probability of A one or Dialogue: 0,0:20:44.74,0:20:46.18,Default,,0000,0000,0000,,A two Dialogue: 0,0:20:46.18,0:20:47.61,Default,,0000,0000,0000,,or dot, dot, Dialogue: 0,0:20:47.61,0:20:48.96,Default,,0000,0000,0000,,dot, up to Dialogue: 0,0:20:48.96,0:20:52.48,Default,,0000,0000,0000,,AK, this union symbol, this Dialogue: 0,0:20:52.48,0:20:54.07,Default,,0000,0000,0000,,hat, this just means Dialogue: 0,0:20:54.07,0:20:57.97,Default,,0000,0000,0000,,this sort of just set notation for probability just means or. So the probability Dialogue: 0,0:20:57.97,0:20:58.64,Default,,0000,0000,0000,,of Dialogue: 0,0:20:58.64,0:21:02.29,Default,,0000,0000,0000,,at least one of these events occurring, of A one or A two, or up Dialogue: 0,0:21:02.29,0:21:03.43,Default,,0000,0000,0000,,to AK, Dialogue: 0,0:21:03.43,0:21:07.95,Default,,0000,0000,0000,,this is S equal to the probability of A one plus probability of A two plus dot, Dialogue: 0,0:21:07.95,0:21:11.12,Default,,0000,0000,0000,,dot, dot, plus probability of AK. Dialogue: 0,0:21:11.12,0:21:17.54,Default,,0000,0000,0000,,Okay? So Dialogue: 0,0:21:17.54,0:21:20.70,Default,,0000,0000,0000,,the intuition behind this is just that Dialogue: 0,0:21:20.70,0:21:23.39,Default,,0000,0000,0000,,I'm not sure if you've seen Venn diagrams Dialogue: 0,0:21:23.39,0:21:26.01,Default,,0000,0000,0000,,depictions of probability before, if you haven't, Dialogue: 0,0:21:26.01,0:21:29.48,Default,,0000,0000,0000,,what I'm about to do may be a little cryptic, so just ignore that. Just Dialogue: 0,0:21:29.48,0:21:31.80,Default,,0000,0000,0000,,ignore what I'm about to do if you haven't seen it before. Dialogue: 0,0:21:31.80,0:21:37.34,Default,,0000,0000,0000,,But if you have seen it before then this is really Dialogue: 0,0:21:37.34,0:21:40.77,Default,,0000,0000,0000,,this is really great the probability of A one, Dialogue: 0,0:21:40.77,0:21:43.77,Default,,0000,0000,0000,,union A two, union A three, is less Dialogue: 0,0:21:43.77,0:21:46.63,Default,,0000,0000,0000,,than the P of A one, plus Dialogue: 0,0:21:46.63,0:21:51.38,Default,,0000,0000,0000,,P of A two, plus P of A Dialogue: 0,0:21:51.38,0:21:51.81,Default,,0000,0000,0000,,three. Dialogue: 0,0:21:51.81,0:21:54.37,Default,,0000,0000,0000,,Right. So that the total mass in Dialogue: 0,0:21:54.37,0:21:57.51,Default,,0000,0000,0000,,the union of these three things [inaudible] to the sum of the masses Dialogue: 0,0:21:57.51,0:22:01.25,Default,,0000,0000,0000,,in the three individual sets, it's not very surprising. Dialogue: 0,0:22:01.25,0:22:04.52,Default,,0000,0000,0000,,It turns out that depending on how you define your axioms of probability, Dialogue: 0,0:22:04.52,0:22:08.10,Default,,0000,0000,0000,,this is actually one of the axioms that probably varies, so Dialogue: 0,0:22:08.10,0:22:12.18,Default,,0000,0000,0000,,I won't actually try to prove this. This is usually Dialogue: 0,0:22:12.18,0:22:16.70,Default,,0000,0000,0000,,written as an axiom. So sigmas of avitivity are probably Dialogue: 0,0:22:16.70,0:22:17.38,Default,,0000,0000,0000,,measured as this Dialogue: 0,0:22:17.38,0:22:24.38,Default,,0000,0000,0000,,what is sometimes called as well. Dialogue: 0,0:22:26.37,0:22:30.60,Default,,0000,0000,0000,,But in learning theory it's commonly called the union balance I just call it that. The Dialogue: 0,0:22:30.60,0:22:31.75,Default,,0000,0000,0000,,other Dialogue: 0,0:22:31.75,0:22:38.75,Default,,0000,0000,0000,,lemma I need is called the Hufting inequality. And Dialogue: 0,0:22:39.64,0:22:43.24,Default,,0000,0000,0000,,again, I won't actually prove this, I'll just state it, Dialogue: 0,0:22:43.24,0:22:44.04,Default,,0000,0000,0000,,which is Dialogue: 0,0:22:44.04,0:22:46.29,Default,,0000,0000,0000,,let's let Z1 up to ZM, Dialogue: 0,0:22:46.29,0:22:48.85,Default,,0000,0000,0000,,BM, IID, Dialogue: 0,0:22:48.85,0:22:51.72,Default,,0000,0000,0000,,there Dialogue: 0,0:22:51.72,0:22:55.18,Default,,0000,0000,0000,, Dialogue: 0,0:22:55.18,0:23:02.18,Default,,0000,0000,0000,,may be random variables with mean Phi. Dialogue: 0,0:23:06.56,0:23:09.15,Default,,0000,0000,0000,,So the probability of ZI equals 1 is equal to Dialogue: 0,0:23:09.15,0:23:16.15,Default,,0000,0000,0000,,Phi. Dialogue: 0,0:23:16.40,0:23:18.29,Default,,0000,0000,0000,,So Dialogue: 0,0:23:18.29,0:23:22.55,Default,,0000,0000,0000,,let's say you observe M IID for newly random variables and you want to estimate their Dialogue: 0,0:23:22.55,0:23:23.64,Default,,0000,0000,0000,,mean. Dialogue: 0,0:23:23.64,0:23:28.79,Default,,0000,0000,0000,,So let me define Phi hat, and this is again that notation, no convention, Phi hat means Dialogue: 0,0:23:29.52,0:23:30.44,Default,,0000,0000,0000,,does not attempt Dialogue: 0,0:23:30.44,0:23:33.88,Default,,0000,0000,0000,,is an estimate or something else. So when we define Phi Dialogue: 0,0:23:33.88,0:23:39.29,Default,,0000,0000,0000,,hat to be 1 over M, semper my equals one through MZI. Okay? Dialogue: 0,0:23:39.29,0:23:40.56,Default,,0000,0000,0000,,So this is Dialogue: 0,0:23:40.56,0:23:44.50,Default,,0000,0000,0000,,our attempt to estimate the mean of these Benuve random variables by sort of taking Dialogue: 0,0:23:44.50,0:23:46.55,Default,,0000,0000,0000,,its average. Dialogue: 0,0:23:48.69,0:23:52.71,Default,,0000,0000,0000,,And let any gamma Dialogue: 0,0:23:52.71,0:23:54.97,Default,,0000,0000,0000,,be fixed. Dialogue: 0,0:23:59.06,0:24:00.69,Default,,0000,0000,0000,,Then, Dialogue: 0,0:24:00.69,0:24:04.34,Default,,0000,0000,0000,,the Hufting inequality Dialogue: 0,0:24:04.34,0:24:11.12,Default,,0000,0000,0000,,is that Dialogue: 0,0:24:13.81,0:24:17.11,Default,,0000,0000,0000,,the probability your estimate of Phi Dialogue: 0,0:24:17.11,0:24:20.63,Default,,0000,0000,0000,,is more than gamma away from the true value of Phi, Dialogue: 0,0:24:20.63,0:24:26.38,Default,,0000,0000,0000,,that this is bounded by two E to the next of two gamma squared. Okay? So Dialogue: 0,0:24:26.38,0:24:31.30,Default,,0000,0000,0000,,just in pictures Dialogue: 0,0:24:31.30,0:24:35.21,Default,,0000,0000,0000,,so this theorem holds this lemma, the Hufting inequality, Dialogue: 0,0:24:35.21,0:24:37.95,Default,,0000,0000,0000,,this is just a statement of fact, this just holds true. Dialogue: 0,0:24:37.95,0:24:41.85,Default,,0000,0000,0000,,But let me now draw a cartoon to describe some of the intuition behind this, I Dialogue: 0,0:24:41.85,0:24:44.06,Default,,0000,0000,0000,,guess. Dialogue: 0,0:24:44.06,0:24:44.94,Default,,0000,0000,0000,,So Dialogue: 0,0:24:44.94,0:24:49.07,Default,,0000,0000,0000,,lets say [inaudible] this is a real number line from zero to one. Dialogue: 0,0:24:49.07,0:24:52.57,Default,,0000,0000,0000,,And so Phi is the mean of your Benuve random variables. Dialogue: 0,0:24:53.58,0:24:56.00,Default,,0000,0000,0000,,You will remember from you know, Dialogue: 0,0:24:56.00,0:24:59.37,Default,,0000,0000,0000,,whatever some undergraduate probability or statistics class, Dialogue: 0,0:24:59.37,0:25:02.55,Default,,0000,0000,0000,,the central limit theorem that says that when you average all the things together, Dialogue: 0,0:25:02.55,0:25:05.48,Default,,0000,0000,0000,,you tend to get a Gaussian distribution. Dialogue: 0,0:25:05.48,0:25:10.38,Default,,0000,0000,0000,,And so when you toss M coins with bias Phi, we observe these M Dialogue: 0,0:25:10.38,0:25:12.32,Default,,0000,0000,0000,,Benuve random variables, Dialogue: 0,0:25:12.32,0:25:17.68,Default,,0000,0000,0000,,and we average them, then the probability distribution of Dialogue: 0,0:25:17.68,0:25:24.68,Default,,0000,0000,0000,,Phi hat Dialogue: 0,0:25:26.26,0:25:28.74,Default,,0000,0000,0000,,will roughly be a Gaussian lets say. Okay? It Dialogue: 0,0:25:28.74,0:25:29.68,Default,,0000,0000,0000,,turns out if Dialogue: 0,0:25:29.68,0:25:32.65,Default,,0000,0000,0000,,you haven't seen this up before, this is actually that the Dialogue: 0,0:25:32.65,0:25:36.50,Default,,0000,0000,0000,,cumulative distribution function of Phi hat will converse with that of the Gaussian. Dialogue: 0,0:25:36.50,0:25:40.06,Default,,0000,0000,0000,,Technically Phi hat can only take on a discreet set of values Dialogue: 0,0:25:40.06,0:25:43.12,Default,,0000,0000,0000,,because these are factions one over Ms. It doesn't really have an Dialogue: 0,0:25:43.12,0:25:45.16,Default,,0000,0000,0000,,entity but just as a cartoon Dialogue: 0,0:25:45.16,0:25:49.47,Default,,0000,0000,0000,,think of it as a converse roughly to a Gaussian. Dialogue: 0,0:25:49.47,0:25:56.05,Default,,0000,0000,0000,,So what the Hufting inequality says is that if you pick a value of gamma, let me put Dialogue: 0,0:25:56.05,0:25:59.50,Default,,0000,0000,0000,,S one interval gamma there's another interval gamma. Dialogue: 0,0:25:59.50,0:26:03.48,Default,,0000,0000,0000,,Then the saying that the probability mass of the details, Dialogue: 0,0:26:03.48,0:26:05.82,Default,,0000,0000,0000,,in other words the probability that Dialogue: 0,0:26:05.82,0:26:08.41,Default,,0000,0000,0000,,my value of Phi hat is more than Dialogue: 0,0:26:08.41,0:26:10.96,Default,,0000,0000,0000,,a gamma away from the true value, Dialogue: 0,0:26:10.96,0:26:15.50,Default,,0000,0000,0000,,that the total mass Dialogue: 0,0:26:15.50,0:26:18.02,Default,,0000,0000,0000,,that the Dialogue: 0,0:26:18.02,0:26:22.41,Default,,0000,0000,0000,,total probability mass in these tails is at most two Dialogue: 0,0:26:22.41,0:26:25.66,Default,,0000,0000,0000,,E to the negative two gamma squared M. Okay? Dialogue: 0,0:26:25.66,0:26:27.51,Default,,0000,0000,0000,,That's what the Hufting inequality so if you Dialogue: 0,0:26:27.51,0:26:30.61,Default,,0000,0000,0000,,can't read that this just says this is just the right hand side of the bound, two E to Dialogue: 0,0:26:30.61,0:26:32.66,Default,,0000,0000,0000,,negative two gamma squared. Dialogue: 0,0:26:32.66,0:26:36.22,Default,,0000,0000,0000,,So balance the probability that you make a mistake in estimating the Dialogue: 0,0:26:36.22,0:26:41.74,Default,,0000,0000,0000,,mean of a Benuve random variable. Dialogue: 0,0:26:41.74,0:26:43.03,Default,,0000,0000,0000,,And the Dialogue: 0,0:26:43.03,0:26:47.18,Default,,0000,0000,0000,,cool thing about this bound the interesting thing behind this bound is that Dialogue: 0,0:26:48.04,0:26:50.61,Default,,0000,0000,0000,,the [inaudible] exponentially in M, Dialogue: 0,0:26:50.61,0:26:51.71,Default,,0000,0000,0000,,so it says Dialogue: 0,0:26:51.71,0:26:54.03,Default,,0000,0000,0000,,that for a fixed value of gamma, Dialogue: 0,0:26:54.03,0:26:55.54,Default,,0000,0000,0000,,as you increase the size Dialogue: 0,0:26:55.54,0:26:58.03,Default,,0000,0000,0000,,of your training set, as you toss a coin more and more, Dialogue: 0,0:26:58.03,0:27:02.25,Default,,0000,0000,0000,,then the worth of this Gaussian will shrink. The worth of this Dialogue: 0,0:27:02.25,0:27:06.27,Default,,0000,0000,0000,,Gaussian will actually shrink like one over root to M. Dialogue: 0,0:27:06.27,0:27:12.01,Default,,0000,0000,0000,,And that will cause the probability mass left in the tails to decrease exponentially, Dialogue: 0,0:27:12.01,0:27:19.01,Default,,0000,0000,0000,,quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the Dialogue: 0,0:27:21.40,0:27:23.36,Default,,0000,0000,0000,,central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different Dialogue: 0,0:27:23.36,0:27:27.22,Default,,0000,0000,0000,,this is proved no so the central limit theorem there may be a Dialogue: 0,0:27:27.22,0:27:29.33,Default,,0000,0000,0000,,version of the central limit theorem, but the Dialogue: 0,0:27:29.33,0:27:32.84,Default,,0000,0000,0000,,versions I'm familiar with tend are sort of asymptotic, but this works Dialogue: 0,0:27:32.84,0:27:35.63,Default,,0000,0000,0000,,for any finer value of M. Oh, and for your Dialogue: 0,0:27:35.63,0:27:40.19,Default,,0000,0000,0000,,this bound holds even if M is equal to two, or M is [inaudible], if M is Dialogue: 0,0:27:40.19,0:27:41.41,Default,,0000,0000,0000,,very small, Dialogue: 0,0:27:41.41,0:27:45.48,Default,,0000,0000,0000,,the central limit theorem approximation is not gonna hold, but this theorem holds Dialogue: 0,0:27:45.48,0:27:46.37,Default,,0000,0000,0000,,regardless. Dialogue: 0,0:27:46.37,0:27:47.68,Default,,0000,0000,0000,,Okay? I'm Dialogue: 0,0:27:47.68,0:27:51.92,Default,,0000,0000,0000,,drawing this just as a cartoon to help explain the intuition, but this Dialogue: 0,0:27:51.92,0:27:58.92,Default,,0000,0000,0000,,theorem just holds true, without reference to central limit Dialogue: 0,0:28:08.59,0:28:12.83,Default,,0000,0000,0000,,theorem. All right. So lets start to understand empirical risk minimization, Dialogue: 0,0:28:12.83,0:28:19.83,Default,,0000,0000,0000,,and what I want to do is Dialogue: 0,0:28:23.08,0:28:25.49,Default,,0000,0000,0000,,begin with Dialogue: 0,0:28:25.49,0:28:28.44,Default,,0000,0000,0000,,studying empirical risk minimization Dialogue: 0,0:28:28.44,0:28:30.89,Default,,0000,0000,0000,,for a Dialogue: 0,0:28:30.89,0:28:33.00,Default,,0000,0000,0000,,[inaudible] case Dialogue: 0,0:28:33.00,0:28:36.67,Default,,0000,0000,0000,,that's a logistic regression, and in particular I want to start with studying Dialogue: 0,0:28:36.67,0:28:40.51,Default,,0000,0000,0000,,the case of finite hypothesis classes. Dialogue: 0,0:28:40.51,0:28:47.51,Default,,0000,0000,0000,,So let's say script H is a class of Dialogue: 0,0:28:48.91,0:28:55.60,Default,,0000,0000,0000,,K hypotheses. Dialogue: 0,0:28:55.60,0:28:59.54,Default,,0000,0000,0000,,Right. So this is K functions with no each of these is just a function mapping Dialogue: 0,0:28:59.54,0:29:04.64,Default,,0000,0000,0000,,from inputs to outputs, there's no parameters in this. Dialogue: 0,0:29:04.64,0:29:05.91,Default,,0000,0000,0000,,And so Dialogue: 0,0:29:05.91,0:29:12.53,Default,,0000,0000,0000,,what the empirical risk minimization would do is it would take the training set Dialogue: 0,0:29:12.53,0:29:13.81,Default,,0000,0000,0000,,and it'll then Dialogue: 0,0:29:13.81,0:29:16.59,Default,,0000,0000,0000,,look at each of these K functions, Dialogue: 0,0:29:16.59,0:29:21.66,Default,,0000,0000,0000,,and it'll pick whichever of these functions has the lowest training error. Okay? Dialogue: 0,0:29:21.66,0:29:24.26,Default,,0000,0000,0000,,So now that the logistic regression uses an infinitely Dialogue: 0,0:29:24.26,0:29:28.94,Default,,0000,0000,0000,,large a continuous infinitely large class of hypotheses, script H, Dialogue: 0,0:29:28.94,0:29:32.44,Default,,0000,0000,0000,,but to prove the first row I actually want to just Dialogue: 0,0:29:32.44,0:29:33.09,Default,,0000,0000,0000,,describe Dialogue: 0,0:29:33.09,0:29:37.54,Default,,0000,0000,0000,,our first learning theorem is all for the case of when you have a finite hypothesis class, and then Dialogue: 0,0:29:37.54,0:29:41.75,Default,,0000,0000,0000,,we'll later generalize that into the hypothesis classes. Dialogue: 0,0:29:43.05,0:29:45.78,Default,,0000,0000,0000,,So Dialogue: 0,0:29:53.22,0:29:58.34,Default,,0000,0000,0000,,empirical risk minimization takes the hypothesis of the lowest training error, Dialogue: 0,0:29:58.34,0:30:04.27,Default,,0000,0000,0000,,and what I'd like to do is prove a bound on the generalization error Dialogue: 0,0:30:04.27,0:30:05.96,Default,,0000,0000,0000,,of H hat. All right. So in other words I'm Dialogue: 0,0:30:05.96,0:30:07.07,Default,,0000,0000,0000,,gonna prove that Dialogue: 0,0:30:07.07,0:30:14.07,Default,,0000,0000,0000,,somehow minimizing training error allows me to do well on generalization error. Dialogue: 0,0:30:14.07,0:30:15.92,Default,,0000,0000,0000,,And here's the strategy, Dialogue: 0,0:30:15.92,0:30:19.95,Default,,0000,0000,0000,,I'm going to Dialogue: 0,0:30:19.95,0:30:22.90,Default,,0000,0000,0000,,the first step in this prove I'm going to show that Dialogue: 0,0:30:22.90,0:30:25.04,Default,,0000,0000,0000,,training error Dialogue: 0,0:30:25.04,0:30:29.58,Default,,0000,0000,0000,,is a good approximation to generalization error, Dialogue: 0,0:30:32.04,0:30:34.61,Default,,0000,0000,0000,,and then I'm going to show Dialogue: 0,0:30:34.61,0:30:37.76,Default,,0000,0000,0000,,that this implies a bound Dialogue: 0,0:30:37.76,0:30:40.04,Default,,0000,0000,0000,,on Dialogue: 0,0:30:40.04,0:30:42.86,Default,,0000,0000,0000,,the generalization error Dialogue: 0,0:30:42.86,0:30:48.36,Default,,0000,0000,0000,,of the hypothesis of [inaudible] empirical risk Dialogue: 0,0:30:48.36,0:30:53.31,Default,,0000,0000,0000,,minimization. And I just realized, this class I guess is also maybe slightly notation Dialogue: 0,0:30:53.31,0:30:55.56,Default,,0000,0000,0000,,heavy class Dialogue: 0,0:30:55.56,0:30:59.16,Default,,0000,0000,0000,,round, instead of just introducing a reasonably large set of new symbols, so if Dialogue: 0,0:30:59.16,0:31:02.57,Default,,0000,0000,0000,,again, in the course of today's lecture, you're looking at some symbol and you don't quite Dialogue: 0,0:31:02.57,0:31:07.21,Default,,0000,0000,0000,,remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a Dialogue: 0,0:31:07.21,0:31:11.57,Default,,0000,0000,0000,,generalization error or was it something else? So raise your hand and Dialogue: 0,0:31:11.57,0:31:16.74,Default,,0000,0000,0000,,ask if you don't understand what the notation I was defining. Dialogue: 0,0:31:16.74,0:31:20.43,Default,,0000,0000,0000,,Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors Dialogue: 0,0:31:20.43,0:31:22.97,Default,,0000,0000,0000,,that give approximation generalization error, Dialogue: 0,0:31:22.97,0:31:26.36,Default,,0000,0000,0000,,and this will imply that minimizing training error Dialogue: 0,0:31:26.36,0:31:29.92,Default,,0000,0000,0000,,will also do pretty well in terms of minimizing generalization error. Dialogue: 0,0:31:29.92,0:31:33.13,Default,,0000,0000,0000,,And this will give us a bound on the generalization error Dialogue: 0,0:31:33.13,0:31:39.52,Default,,0000,0000,0000,,of the hypothesis output by empirical risk minimization. Okay? Dialogue: 0,0:31:39.52,0:31:46.52,Default,,0000,0000,0000,,So here's the idea. Dialogue: 0,0:31:47.51,0:31:51.53,Default,,0000,0000,0000,,So Dialogue: 0,0:31:51.53,0:31:56.33,Default,,0000,0000,0000,,lets even not consider all the hypotheses at once, lets pick any Dialogue: 0,0:31:56.33,0:32:00.33,Default,,0000,0000,0000,,hypothesis, HJ in the class script H, and Dialogue: 0,0:32:00.33,0:32:04.55,Default,,0000,0000,0000,,so until further notice lets just consider there one fixed hypothesis. So pick any Dialogue: 0,0:32:04.55,0:32:10.25,Default,,0000,0000,0000,,one hypothesis and let's talk about that Dialogue: 0,0:32:10.25,0:32:12.39,Default,,0000,0000,0000,,one. Let Dialogue: 0,0:32:12.39,0:32:15.16,Default,,0000,0000,0000,,me define ZI Dialogue: 0,0:32:15.16,0:32:22.16,Default,,0000,0000,0000,,to be indicator function Dialogue: 0,0:32:23.97,0:32:25.49,Default,,0000,0000,0000,,for Dialogue: 0,0:32:25.49,0:32:29.93,Default,,0000,0000,0000,,whether this hypothesis misclassifies the IFE example excuse me Dialogue: 0,0:32:29.93,0:32:33.82,Default,,0000,0000,0000,,or Z subscript I. Okay? Dialogue: 0,0:32:33.82,0:32:40.25,Default,,0000,0000,0000,,So Dialogue: 0,0:32:40.25,0:32:42.81,Default,,0000,0000,0000,,ZI would be zero or one Dialogue: 0,0:32:42.81,0:32:46.80,Default,,0000,0000,0000,,depending on whether this one hypothesis which is the only one Dialogue: 0,0:32:46.80,0:32:48.64,Default,,0000,0000,0000,,I'm gonna even consider now, Dialogue: 0,0:32:48.64,0:32:52.28,Default,,0000,0000,0000,,whether this hypothesis was classified as an example. Dialogue: 0,0:32:52.28,0:32:56.97,Default,,0000,0000,0000,,And so Dialogue: 0,0:32:56.97,0:33:00.62,Default,,0000,0000,0000,,my training set is drawn randomly from sum distribution scripts Dialogue: 0,0:33:00.62,0:33:02.14,Default,,0000,0000,0000,,d, Dialogue: 0,0:33:02.14,0:33:04.62,Default,,0000,0000,0000,,and Dialogue: 0,0:33:04.62,0:33:09.82,Default,,0000,0000,0000,,depending on what training examples I've got, these ZIs would be either zero or one. Dialogue: 0,0:33:09.82,0:33:14.26,Default,,0000,0000,0000,,So let's figure out what the probability distribution ZI is. Dialogue: 0,0:33:14.26,0:33:15.02,Default,,0000,0000,0000,,Well, Dialogue: 0,0:33:15.02,0:33:15.95,Default,,0000,0000,0000,,so Dialogue: 0,0:33:15.95,0:33:20.97,Default,,0000,0000,0000,,ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only Dialogue: 0,0:33:20.97,0:33:24.87,Default,,0000,0000,0000,,take on these values. Dialogue: 0,0:33:24.87,0:33:31.26,Default,,0000,0000,0000,,Well, Dialogue: 0,0:33:31.26,0:33:35.16,Default,,0000,0000,0000,,what's the probability that ZI is equal to one? In other words, what's the Dialogue: 0,0:33:35.16,0:33:36.49,Default,,0000,0000,0000,,probability Dialogue: 0,0:33:36.49,0:33:40.78,Default,,0000,0000,0000,,that from a fixed hypothesis HJ, Dialogue: 0,0:33:40.78,0:33:45.08,Default,,0000,0000,0000,,when I sample my training set IID from distribution D, what is the chance Dialogue: 0,0:33:45.08,0:33:45.97,Default,,0000,0000,0000,,that Dialogue: 0,0:33:45.97,0:33:50.37,Default,,0000,0000,0000,,my hypothesis will misclassify it? Dialogue: 0,0:33:50.37,0:33:54.29,Default,,0000,0000,0000,,Well, by definition, Dialogue: 0,0:33:54.29,0:33:56.77,Default,,0000,0000,0000,,that's just a generalization error of my Dialogue: 0,0:33:56.77,0:34:00.67,Default,,0000,0000,0000,,hypothesis HJ. Dialogue: 0,0:34:00.67,0:34:03.99,Default,,0000,0000,0000,,So ZI is a Benuve random variable Dialogue: 0,0:34:03.99,0:34:05.49,Default,,0000,0000,0000,,with mean Dialogue: 0,0:34:05.49,0:34:12.49,Default,,0000,0000,0000,,given by the generalization error of this hypothesis. Dialogue: 0,0:34:14.44,0:34:20.17,Default,,0000,0000,0000,,Raise your hand if that made sense. Oh, cool. Great. Dialogue: 0,0:34:20.17,0:34:21.88,Default,,0000,0000,0000,,And moreover, Dialogue: 0,0:34:21.88,0:34:23.99,Default,,0000,0000,0000,,all the ZIs have the same Dialogue: 0,0:34:23.99,0:34:27.97,Default,,0000,0000,0000,,probability of being one, and all my training examples I've drawn are IID, Dialogue: 0,0:34:27.97,0:34:32.84,Default,,0000,0000,0000,,and so the ZIs are also independent Dialogue: 0,0:34:32.84,0:34:39.10,Default,,0000,0000,0000,,and therefore Dialogue: 0,0:34:39.10,0:34:42.42,Default,,0000,0000,0000,,the ZIs themselves are IID random Dialogue: 0,0:34:42.42,0:34:49.42,Default,,0000,0000,0000,,variables. Okay? Because my training examples were drawn independently of each other, by assumption. Dialogue: 0,0:34:55.80,0:34:59.15,Default,,0000,0000,0000,,If you read this as the definition of training error, Dialogue: 0,0:34:59.15,0:35:04.33,Default,,0000,0000,0000,,the training error of my hypothesis HJ, that's just that. Dialogue: 0,0:35:07.21,0:35:09.03,Default,,0000,0000,0000,,That's just the average Dialogue: 0,0:35:09.03,0:35:16.03,Default,,0000,0000,0000,,of my ZIs, which was well I previously defined it like this. Okay? Dialogue: 0,0:35:25.73,0:35:28.54,Default,,0000,0000,0000,,And so epsilon hat of HJ Dialogue: 0,0:35:28.54,0:35:30.42,Default,,0000,0000,0000,,is exactly Dialogue: 0,0:35:30.42,0:35:32.83,Default,,0000,0000,0000,,the average of MIID, Dialogue: 0,0:35:32.83,0:35:35.85,Default,,0000,0000,0000,,Benuve random variables, Dialogue: 0,0:35:35.85,0:35:40.04,Default,,0000,0000,0000,,drawn from Benuve distribution with mean given by Dialogue: 0,0:35:40.04,0:35:41.75,Default,,0000,0000,0000,,the Dialogue: 0,0:35:41.75,0:35:43.87,Default,,0000,0000,0000,,generalization error, so this is Dialogue: 0,0:35:43.87,0:35:47.91,Default,,0000,0000,0000,,well this is the average of MIID Benuve random variables, Dialogue: 0,0:35:47.91,0:35:54.91,Default,,0000,0000,0000,,each of which has meaning given by the Dialogue: 0,0:35:54.94,0:35:58.69,Default,,0000,0000,0000,,generalization error of HJ. Dialogue: 0,0:35:58.69,0:36:03.77,Default,,0000,0000,0000,,And therefore, by Dialogue: 0,0:36:03.77,0:36:07.03,Default,,0000,0000,0000,,the Hufting inequality Dialogue: 0,0:36:07.03,0:36:11.07,Default,,0000,0000,0000,,we have to add the Dialogue: 0,0:36:11.07,0:36:15.70,Default,,0000,0000,0000,,probability Dialogue: 0,0:36:15.70,0:36:20.03,Default,,0000,0000,0000,,that the difference between training and generalization error, the probability that this is greater than gamma is Dialogue: 0,0:36:20.03,0:36:21.54,Default,,0000,0000,0000,,less than to two, E to Dialogue: 0,0:36:21.54,0:36:24.72,Default,,0000,0000,0000,,the negative two, Dialogue: 0,0:36:24.72,0:36:26.95,Default,,0000,0000,0000,,gamma squared M. Okay? Dialogue: 0,0:36:26.95,0:36:32.42,Default,,0000,0000,0000,,Exactly by the Hufting inequality. Dialogue: 0,0:36:32.42,0:36:34.58,Default,,0000,0000,0000,,And what this proves is that, Dialogue: 0,0:36:34.58,0:36:37.29,Default,,0000,0000,0000,,for my fixed hypothesis HJ, Dialogue: 0,0:36:37.29,0:36:40.53,Default,,0000,0000,0000,,my training error, epsilon hat will Dialogue: 0,0:36:40.53,0:36:43.90,Default,,0000,0000,0000,,with high probability, assuming M is large, if Dialogue: 0,0:36:43.90,0:36:46.89,Default,,0000,0000,0000,,M is large than this thing on the right hand side will be small, because Dialogue: 0,0:36:46.89,0:36:49.24,Default,,0000,0000,0000,,this is two Es and a negative two gamma squared M. Dialogue: 0,0:36:49.24,0:36:52.56,Default,,0000,0000,0000,,So this says that if my training set is large enough, Dialogue: 0,0:36:52.56,0:36:54.39,Default,,0000,0000,0000,,then the probability Dialogue: 0,0:36:54.39,0:36:57.30,Default,,0000,0000,0000,,my training error is far from generalization error, Dialogue: 0,0:36:57.30,0:36:59.09,Default,,0000,0000,0000,,meaning that it is more than gamma, Dialogue: 0,0:36:59.09,0:37:06.09,Default,,0000,0000,0000,,will be small, will be bounded by this thing on the right hand side. Okay? Now, Dialogue: 0,0:37:09.12,0:37:11.57,Default,,0000,0000,0000,,here's the [inaudible] tricky part, Dialogue: 0,0:37:11.57,0:37:16.57,Default,,0000,0000,0000,,what we've done is approve this bound for one fixed hypothesis, for HJ. Dialogue: 0,0:37:16.57,0:37:20.43,Default,,0000,0000,0000,,What I want to prove is that training error will be a good estimate for generalization Dialogue: 0,0:37:20.43,0:37:21.28,Default,,0000,0000,0000,,error, Dialogue: 0,0:37:21.28,0:37:26.25,Default,,0000,0000,0000,,not just for this one hypothesis HJ, but actually for all K hypotheses in my Dialogue: 0,0:37:26.25,0:37:27.83,Default,,0000,0000,0000,,hypothesis class Dialogue: 0,0:37:27.83,0:37:29.65,Default,,0000,0000,0000,,script Dialogue: 0,0:37:29.65,0:37:30.77,Default,,0000,0000,0000,,H. Dialogue: 0,0:37:30.77,0:37:33.07,Default,,0000,0000,0000,,So let's do it Dialogue: 0,0:37:35.72,0:37:42.72,Default,,0000,0000,0000,,well, Dialogue: 0,0:37:54.66,0:37:57.86,Default,,0000,0000,0000,,better do it on a new board. So in order to show that, let me define Dialogue: 0,0:37:57.86,0:38:01.23,Default,,0000,0000,0000,,a random event, let me define AJ to be the event Dialogue: 0,0:38:01.23,0:38:05.27,Default,,0000,0000,0000,,that Dialogue: 0,0:38:05.27,0:38:12.27,Default,,0000,0000,0000,,let Dialogue: 0,0:38:21.47,0:38:24.48,Default,,0000,0000,0000,,me define AJ to be the event that Dialogue: 0,0:38:24.48,0:38:28.04,Default,,0000,0000,0000,,you know, the difference between training and generalization error is more than gamma on a Dialogue: 0,0:38:28.04,0:38:30.44,Default,,0000,0000,0000,,hypothesis HJ. Dialogue: 0,0:38:30.44,0:38:32.09,Default,,0000,0000,0000,,And so what we Dialogue: 0,0:38:32.09,0:38:36.45,Default,,0000,0000,0000,,put on the previous board was that the probability of AJ is less equal to two E Dialogue: 0,0:38:36.45,0:38:43.45,Default,,0000,0000,0000,,to the negative two, gamma squared M, and this is pretty small. Now, Dialogue: 0,0:38:44.06,0:38:49.73,Default,,0000,0000,0000,,What I want to bound is the probability that Dialogue: 0,0:38:49.73,0:38:53.88,Default,,0000,0000,0000,,there exists some hypothesis in my class Dialogue: 0,0:38:53.88,0:39:00.88,Default,,0000,0000,0000,,script H, Dialogue: 0,0:39:02.84,0:39:08.11,Default,,0000,0000,0000,,such that I make a large error in my estimate of generalization error. Okay? Dialogue: 0,0:39:08.11,0:39:10.27,Default,,0000,0000,0000,,Such that this holds true. Dialogue: 0,0:39:12.18,0:39:14.29,Default,,0000,0000,0000,,So this is really just Dialogue: 0,0:39:14.29,0:39:17.89,Default,,0000,0000,0000,,that the probability that there exists a hypothesis for which this Dialogue: 0,0:39:17.89,0:39:20.65,Default,,0000,0000,0000,,holds. This is really the probability that Dialogue: 0,0:39:20.65,0:39:25.30,Default,,0000,0000,0000,,A one or A two, or up to AK holds. Dialogue: 0,0:39:25.30,0:39:28.66,Default,,0000,0000,0000,,The chance Dialogue: 0,0:39:28.66,0:39:33.18,Default,,0000,0000,0000,,there exists a hypothesis is just well the priority Dialogue: 0,0:39:33.18,0:39:37.48,Default,,0000,0000,0000,,that for hypothesis one and make a large error in estimating the generalization error, Dialogue: 0,0:39:37.48,0:39:41.07,Default,,0000,0000,0000,,or for hypothesis two and make a large error in estimating generalization error, Dialogue: 0,0:39:41.07,0:39:43.56,Default,,0000,0000,0000,,and so on. Dialogue: 0,0:39:43.56,0:39:50.56,Default,,0000,0000,0000,,And so by the union bound, this is less than equal to Dialogue: 0,0:39:50.82,0:39:56.23,Default,,0000,0000,0000,,that, Dialogue: 0,0:39:56.23,0:39:59.33,Default,,0000,0000,0000,,which is therefore less than equal to Dialogue: 0,0:40:06.52,0:40:13.52,Default,,0000,0000,0000,,is Dialogue: 0,0:40:13.99,0:40:20.99,Default,,0000,0000,0000,,equal to that. Okay? Dialogue: 0,0:40:38.95,0:40:40.44,Default,,0000,0000,0000,,So let Dialogue: 0,0:40:40.44,0:40:42.36,Default,,0000,0000,0000,,me just take Dialogue: 0,0:40:42.36,0:40:47.64,Default,,0000,0000,0000,,one minus both sides of the equation Dialogue: 0,0:40:47.64,0:40:50.94,Default,,0000,0000,0000,,on the previous board let me take one minus both sides, so Dialogue: 0,0:40:50.94,0:40:56.90,Default,,0000,0000,0000,,the probability that there does not exist for Dialogue: 0,0:40:56.90,0:40:59.01,Default,,0000,0000,0000,,hypothesis Dialogue: 0,0:40:59.01,0:41:06.01,Default,,0000,0000,0000,,such that, Dialogue: 0,0:41:07.65,0:41:10.84,Default,,0000,0000,0000,,that. The probability that there does not exist a hypothesis on which I make a large Dialogue: 0,0:41:10.84,0:41:13.19,Default,,0000,0000,0000,,error in this estimate Dialogue: 0,0:41:13.19,0:41:20.19,Default,,0000,0000,0000,,while this is equal to the probability that for all hypotheses, Dialogue: 0,0:41:22.68,0:41:24.76,Default,,0000,0000,0000,,I make a Dialogue: 0,0:41:24.76,0:41:26.57,Default,,0000,0000,0000,,small error, or at Dialogue: 0,0:41:26.57,0:41:31.61,Default,,0000,0000,0000,,most gamma, in my estimate of generalization error. Dialogue: 0,0:41:31.61,0:41:35.52,Default,,0000,0000,0000,,In taking one minus on the right hand side I get two KE Dialogue: 0,0:41:35.52,0:41:39.15,Default,,0000,0000,0000,,to the negative two Dialogue: 0,0:41:39.15,0:41:44.11,Default,,0000,0000,0000,,gamma squared M. Okay? Dialogue: 0,0:41:44.11,0:41:45.71,Default,,0000,0000,0000,,And so Dialogue: 0,0:41:45.71,0:41:49.52,Default,,0000,0000,0000,,and the sign of the inequality flipped because I took one minus both Dialogue: 0,0:41:49.52,0:41:53.80,Default,,0000,0000,0000,,sides. The minus sign flips the sign of the Dialogue: 0,0:41:53.80,0:41:55.82,Default,,0000,0000,0000,,equality. Dialogue: 0,0:41:55.82,0:41:59.80,Default,,0000,0000,0000,,So what we're shown is that Dialogue: 0,0:41:59.80,0:42:04.68,Default,,0000,0000,0000,,with probability which abbreviates to WP with probability one minus Dialogue: 0,0:42:04.68,0:42:08.53,Default,,0000,0000,0000,,two KE to the negative two gamma squared M. Dialogue: 0,0:42:08.53,0:42:11.89,Default,,0000,0000,0000,,We have that, epsilon hat Dialogue: 0,0:42:11.89,0:42:16.48,Default,,0000,0000,0000,,of H Dialogue: 0,0:42:16.48,0:42:21.15,Default,,0000,0000,0000,,will be Dialogue: 0,0:42:21.15,0:42:28.15,Default,,0000,0000,0000,,will then gamma of epsilon of H, Dialogue: 0,0:42:31.31,0:42:33.06,Default,,0000,0000,0000,,simultaneously for all Dialogue: 0,0:42:33.06,0:42:35.45,Default,,0000,0000,0000,,hypotheses in our Dialogue: 0,0:42:35.45,0:42:42.45,Default,,0000,0000,0000,,class script H. Dialogue: 0,0:42:43.27,0:42:47.21,Default,,0000,0000,0000,,And so Dialogue: 0,0:42:47.21,0:42:54.21,Default,,0000,0000,0000,,just to give this result a name, this is called Dialogue: 0,0:42:55.53,0:43:00.09,Default,,0000,0000,0000,,this is one instance of what's called a uniform conversions result, Dialogue: 0,0:43:00.09,0:43:01.13,Default,,0000,0000,0000,,and Dialogue: 0,0:43:01.13,0:43:03.76,Default,,0000,0000,0000,,the term uniform conversions Dialogue: 0,0:43:03.76,0:43:05.72,Default,,0000,0000,0000,,this sort of alludes to the fact that Dialogue: 0,0:43:05.72,0:43:09.66,Default,,0000,0000,0000,,this shows that as M becomes large, Dialogue: 0,0:43:09.66,0:43:12.03,Default,,0000,0000,0000,,then these epsilon hats Dialogue: 0,0:43:12.03,0:43:16.19,Default,,0000,0000,0000,,will all simultaneously converge to epsilon of H. That Dialogue: 0,0:43:16.19,0:43:17.85,Default,,0000,0000,0000,,training error will Dialogue: 0,0:43:17.85,0:43:20.84,Default,,0000,0000,0000,,become very close to generalization error Dialogue: 0,0:43:20.84,0:43:23.21,Default,,0000,0000,0000,,simultaneously for all hypotheses H. Dialogue: 0,0:43:23.21,0:43:25.46,Default,,0000,0000,0000,,That's what the term uniform Dialogue: 0,0:43:25.46,0:43:29.15,Default,,0000,0000,0000,,refers to, is the fact that this converges for all hypotheses H and not just Dialogue: 0,0:43:29.15,0:43:31.06,Default,,0000,0000,0000,,for one hypothesis. And so Dialogue: 0,0:43:31.06,0:43:36.13,Default,,0000,0000,0000,,what we're shown is one example of a uniform conversions result. Okay? Dialogue: 0,0:43:36.13,0:43:39.40,Default,,0000,0000,0000,,So let me clean a couple more boards. I'll come back and ask what questions you have Dialogue: 0,0:43:39.40,0:43:46.40,Default,,0000,0000,0000,,about this. We should take another look at this and make sure it all makes sense. Yeah, okay. Dialogue: 0,0:44:20.11,0:44:27.11,Default,,0000,0000,0000,,What questions do you have about this? Dialogue: 0,0:44:27.83,0:44:31.27,Default,,0000,0000,0000,, Dialogue: 0,0:44:31.27,0:44:34.24,Default,,0000,0000,0000,,Student: How the is the Dialogue: 0,0:44:34.24,0:44:36.15,Default,,0000,0000,0000,,value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed? Dialogue: 0,0:44:36.15,0:44:40.29,Default,,0000,0000,0000,,So for these purposes for the purposes of this, gamma is a constant. Dialogue: 0,0:44:40.29,0:44:43.49,Default,,0000,0000,0000,,Imagine a gamma is some constant that we chose in advance, Dialogue: 0,0:44:43.49,0:44:48.59,Default,,0000,0000,0000,,and this is a bound that holds true for any fixed value Dialogue: 0,0:44:48.59,0:44:50.74,Default,,0000,0000,0000,,of gamma. Later on as we Dialogue: 0,0:44:50.74,0:44:54.45,Default,,0000,0000,0000,,take this bound and then sort of develop this result further, Dialogue: 0,0:44:54.45,0:44:58.13,Default,,0000,0000,0000,,we'll choose specific values of gamma as a [inaudible] of this bound. For now Dialogue: 0,0:44:58.13,0:45:05.13,Default,,0000,0000,0000,,we'll just imagine that when we're proved this holds true for any value of gamma. Any questions? Dialogue: 0,0:45:08.89,0:45:11.58,Default,,0000,0000,0000,,Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite, Dialogue: 0,0:45:11.58,0:45:16.28,Default,,0000,0000,0000,,so this simple result won't work in this present form, but we'll generalize this Dialogue: 0,0:45:16.28,0:45:20.20,Default,,0000,0000,0000,,probably won't get to it today but we'll generalize this at the beginning of the Dialogue: 0,0:45:20.20,0:45:27.20,Default,,0000,0000,0000,,next lecture to infinite hypothesis classes. Student:How do we use this Dialogue: 0,0:45:29.78,0:45:31.89,Default,,0000,0000,0000,,theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me Dialogue: 0,0:45:31.89,0:45:37.26,Default,,0000,0000,0000,,I might get to a little of that later today, we'll talk concretely about algorithms, Dialogue: 0,0:45:37.26,0:45:38.89,Default,,0000,0000,0000,,the consequences of Dialogue: 0,0:45:38.89,0:45:45.72,Default,,0000,0000,0000,,the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you Dialogue: 0,0:45:45.72,0:45:47.29,Default,,0000,0000,0000,,just raise your hand if the Dialogue: 0,0:45:47.29,0:45:49.60,Default,,0000,0000,0000,,things I've proved so far Dialogue: 0,0:45:49.60,0:45:55.07,Default,,0000,0000,0000,,make sense? Okay. Cool. Great. Dialogue: 0,0:45:55.07,0:45:59.34,Default,,0000,0000,0000,,Thanks. All right. Let me just take this uniform conversions bound and rewrite Dialogue: 0,0:45:59.34,0:46:02.28,Default,,0000,0000,0000,,it in a couple of other forms. Dialogue: 0,0:46:02.28,0:46:04.59,Default,,0000,0000,0000,,So Dialogue: 0,0:46:04.59,0:46:08.69,Default,,0000,0000,0000,,this is a sort of a bound on probability, this is saying suppose I fix my training Dialogue: 0,0:46:08.69,0:46:11.09,Default,,0000,0000,0000,,set and then fix my training Dialogue: 0,0:46:11.09,0:46:14.70,Default,,0000,0000,0000,,set fix my threshold, my error threshold gamma, Dialogue: 0,0:46:14.70,0:46:19.11,Default,,0000,0000,0000,,what is the probability that uniform conversions holds, and well, Dialogue: 0,0:46:19.11,0:46:22.67,Default,,0000,0000,0000,,that's my formula that gives the answer. This is the probability of something Dialogue: 0,0:46:22.67,0:46:25.75,Default,,0000,0000,0000,,happening. Dialogue: 0,0:46:25.75,0:46:28.75,Default,,0000,0000,0000,,So there are actually three parameters of interest. One is, Dialogue: 0,0:46:28.75,0:46:31.54,Default,,0000,0000,0000,,What is this probability? The Dialogue: 0,0:46:31.54,0:46:34.29,Default,,0000,0000,0000,,other parameter is, What's the training set size M? Dialogue: 0,0:46:34.29,0:46:36.02,Default,,0000,0000,0000,,And the third parameter is, What Dialogue: 0,0:46:36.02,0:46:37.01,Default,,0000,0000,0000,,is the value Dialogue: 0,0:46:37.01,0:46:39.99,Default,,0000,0000,0000,,of this error Dialogue: 0,0:46:39.99,0:46:41.95,Default,,0000,0000,0000,,threshold gamma? I'm not gonna Dialogue: 0,0:46:41.95,0:46:44.31,Default,,0000,0000,0000,,vary K for these purposes. Dialogue: 0,0:46:44.31,0:46:47.62,Default,,0000,0000,0000,,So other two other equivalent forms of the bounds, Dialogue: 0,0:46:47.62,0:46:49.51,Default,,0000,0000,0000,,which so you can ask, Dialogue: 0,0:46:49.51,0:46:50.60,Default,,0000,0000,0000,,given Dialogue: 0,0:46:50.60,0:46:52.31,Default,,0000,0000,0000,,gamma Dialogue: 0,0:46:52.31,0:46:53.29,Default,,0000,0000,0000,,so what Dialogue: 0,0:46:53.29,0:46:56.02,Default,,0000,0000,0000,,we proved was given gamma and given M, what Dialogue: 0,0:46:56.02,0:46:57.85,Default,,0000,0000,0000,,is the probability of Dialogue: 0,0:46:57.85,0:47:01.02,Default,,0000,0000,0000,,uniform conversions? The Dialogue: 0,0:47:01.02,0:47:04.15,Default,,0000,0000,0000,,other equivalent forms are, Dialogue: 0,0:47:04.15,0:47:06.90,Default,,0000,0000,0000,,so that given gamma Dialogue: 0,0:47:06.90,0:47:13.90,Default,,0000,0000,0000,,and the probability delta of making a large error, Dialogue: 0,0:47:15.34,0:47:19.05,Default,,0000,0000,0000,,how large a training set size do you need in order to give Dialogue: 0,0:47:19.05,0:47:21.95,Default,,0000,0000,0000,,a bound on how large a Dialogue: 0,0:47:21.95,0:47:25.46,Default,,0000,0000,0000,,training set size do you need Dialogue: 0,0:47:25.46,0:47:29.75,Default,,0000,0000,0000,,to give a uniform conversions bound with parameters gamma Dialogue: 0,0:47:29.75,0:47:31.21,Default,,0000,0000,0000,,and delta? Dialogue: 0,0:47:31.21,0:47:33.75,Default,,0000,0000,0000,,And so well, Dialogue: 0,0:47:33.75,0:47:37.73,Default,,0000,0000,0000,,so if you set delta to be two KE so negative two gamma Dialogue: 0,0:47:37.73,0:47:38.49,Default,,0000,0000,0000,,squared M. Dialogue: 0,0:47:38.49,0:47:41.36,Default,,0000,0000,0000,,This is that form that I had on the left. Dialogue: 0,0:47:41.36,0:47:43.63,Default,,0000,0000,0000,,And if you solve Dialogue: 0,0:47:43.63,0:47:46.08,Default,,0000,0000,0000,,for M, Dialogue: 0,0:47:46.08,0:47:48.17,Default,,0000,0000,0000,,what you find is that Dialogue: 0,0:47:48.17,0:47:54.43,Default,,0000,0000,0000,,there's an equivalent form of this result that says that Dialogue: 0,0:47:54.43,0:47:55.89,Default,,0000,0000,0000,,so long as your training Dialogue: 0,0:47:55.89,0:47:59.03,Default,,0000,0000,0000,,set assigns M as greater than this. Dialogue: 0,0:47:59.03,0:48:05.39,Default,,0000,0000,0000,,And this is the formula that I get by solving for M. Dialogue: 0,0:48:05.39,0:48:08.81,Default,,0000,0000,0000,,Okay? So long as M is greater than equal to this, Dialogue: 0,0:48:08.81,0:48:12.93,Default,,0000,0000,0000,,then with probability, which I'm abbreviating to WP again, Dialogue: 0,0:48:12.93,0:48:17.37,Default,,0000,0000,0000,,with probability at least one minus delta, Dialogue: 0,0:48:17.37,0:48:24.37,Default,,0000,0000,0000,,we have for all. Dialogue: 0,0:48:24.88,0:48:30.96,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:48:30.96,0:48:34.30,Default,,0000,0000,0000,,So this says how large a training set size that I need Dialogue: 0,0:48:34.30,0:48:37.86,Default,,0000,0000,0000,,to guarantee that with probability at least one minus delta, Dialogue: 0,0:48:37.86,0:48:42.24,Default,,0000,0000,0000,,we have the training error is within gamma of generalization error for all my Dialogue: 0,0:48:42.24,0:48:45.26,Default,,0000,0000,0000,,hypotheses, and this gives an answer. Dialogue: 0,0:48:45.26,0:48:49.40,Default,,0000,0000,0000,,And just to give this another name, this is an example of a sample Dialogue: 0,0:48:49.40,0:48:53.47,Default,,0000,0000,0000,,complexity Dialogue: 0,0:48:53.47,0:48:58.08,Default,,0000,0000,0000,, Dialogue: 0,0:48:58.08,0:48:58.94,Default,,0000,0000,0000,,bound. Dialogue: 0,0:48:58.94,0:49:02.87,Default,,0000,0000,0000,,So from undergrad computer science classes you may have heard of Dialogue: 0,0:49:02.87,0:49:05.68,Default,,0000,0000,0000,,computational complexity, which is how much computations you need to achieve Dialogue: 0,0:49:05.68,0:49:06.85,Default,,0000,0000,0000,,something. Dialogue: 0,0:49:06.85,0:49:11.32,Default,,0000,0000,0000,,So sample complexity just means how large a training example how large a training set how Dialogue: 0,0:49:11.32,0:49:13.68,Default,,0000,0000,0000,,large a sample of examples do you need Dialogue: 0,0:49:13.68,0:49:16.55,Default,,0000,0000,0000,,in order to achieve a certain bound and error. Dialogue: 0,0:49:16.55,0:49:19.62,Default,,0000,0000,0000,,And it turns out that in many of the theorems we write out you can Dialogue: 0,0:49:19.62,0:49:23.66,Default,,0000,0000,0000,,pose them in sort of a form of probability bound or a sample complexity bound or in some other Dialogue: 0,0:49:23.66,0:49:24.46,Default,,0000,0000,0000,,form. Dialogue: 0,0:49:24.46,0:49:28.37,Default,,0000,0000,0000,,I personally often find the sample complexity bounds the most Dialogue: 0,0:49:28.37,0:49:32.42,Default,,0000,0000,0000,,easy to interpret because it says how large a training set do you need to give a certain bound on the Dialogue: 0,0:49:32.42,0:49:35.82,Default,,0000,0000,0000,,errors. Dialogue: 0,0:49:35.82,0:49:38.83,Default,,0000,0000,0000,,And in fact well, we'll see this later, Dialogue: 0,0:49:38.83,0:49:41.05,Default,,0000,0000,0000,,sample complexity bounds often sort of Dialogue: 0,0:49:41.05,0:49:43.30,Default,,0000,0000,0000,,help to give guidance for really if Dialogue: 0,0:49:43.30,0:49:44.24,Default,,0000,0000,0000,,you're trying to Dialogue: 0,0:49:44.24,0:49:46.14,Default,,0000,0000,0000,,achieve something on a machine learning problem, Dialogue: 0,0:49:46.14,0:49:46.94,Default,,0000,0000,0000,,this really is Dialogue: 0,0:49:46.94,0:49:49.72,Default,,0000,0000,0000,,trying to give guidance on how much training data you need Dialogue: 0,0:49:49.72,0:49:54.27,Default,,0000,0000,0000,,to prove something. Dialogue: 0,0:49:54.27,0:49:57.77,Default,,0000,0000,0000,,The one thing I want to note here is that M Dialogue: 0,0:49:57.77,0:50:01.02,Default,,0000,0000,0000,,grows like the log of K, right, so Dialogue: 0,0:50:01.02,0:50:05.63,Default,,0000,0000,0000,,the log of K grows extremely slowly as a function of K. The log is one Dialogue: 0,0:50:05.63,0:50:10.62,Default,,0000,0000,0000,,of the slowest growing functions, right. It's one of well, Dialogue: 0,0:50:10.62,0:50:17.62,Default,,0000,0000,0000,,some of you may have heard this, right? That for all values of K, right I learned Dialogue: 0,0:50:18.87,0:50:22.18,Default,,0000,0000,0000,,this from a colleague, Andrew Moore, Dialogue: 0,0:50:22.18,0:50:23.74,Default,,0000,0000,0000,,at Carnegie Mellon Dialogue: 0,0:50:23.74,0:50:24.70,Default,,0000,0000,0000,,that in Dialogue: 0,0:50:24.70,0:50:30.55,Default,,0000,0000,0000,,computer science for all practical purposes for all values of K, log K is less [inaudible], this is Dialogue: 0,0:50:30.55,0:50:31.51,Default,,0000,0000,0000,,almost true. Dialogue: 0,0:50:31.51,0:50:35.53,Default,,0000,0000,0000,,So log K is logs is one of the slowest growing functions, and so the Dialogue: 0,0:50:35.53,0:50:40.06,Default,,0000,0000,0000,,fact that M sample complexity grows like the log of K, Dialogue: 0,0:50:40.06,0:50:41.55,Default,,0000,0000,0000,,means that Dialogue: 0,0:50:41.55,0:50:45.75,Default,,0000,0000,0000,,you can increase this number of hypotheses in your hypothesis class quite a lot Dialogue: 0,0:50:45.75,0:50:50.61,Default,,0000,0000,0000,,and the number of the training examples you need won't grow Dialogue: 0,0:50:50.61,0:50:56.39,Default,,0000,0000,0000,, Dialogue: 0,0:50:56.39,0:51:00.29,Default,,0000,0000,0000,,very much. [Inaudible]. This property will be important later when we talk about infinite Dialogue: 0,0:51:00.29,0:51:02.86,Default,,0000,0000,0000,,hypothesis classes. Dialogue: 0,0:51:02.86,0:51:05.68,Default,,0000,0000,0000,,The final form is the Dialogue: 0,0:51:05.68,0:51:08.49,Default,,0000,0000,0000,,I guess is sometimes called the error bound, Dialogue: 0,0:51:08.49,0:51:13.66,Default,,0000,0000,0000,,which is when you hold M and delta fixed and solved for gamma. Dialogue: 0,0:51:13.66,0:51:20.66,Default,,0000,0000,0000,,And so Dialogue: 0,0:51:22.81,0:51:27.47,Default,,0000,0000,0000,,and what do you do what you get then is that the Dialogue: 0,0:51:27.47,0:51:30.76,Default,,0000,0000,0000,,probability at least one minus delta, Dialogue: 0,0:51:30.76,0:51:37.76,Default,,0000,0000,0000,,we have that. Dialogue: 0,0:51:41.34,0:51:44.41,Default,,0000,0000,0000,,For all hypotheses in my hypothesis class, Dialogue: 0,0:51:44.41,0:51:51.41,Default,,0000,0000,0000,,the difference in the training generalization error would be less Dialogue: 0,0:51:53.69,0:51:55.17,Default,,0000,0000,0000,,than equal to that. Okay? And that's just Dialogue: 0,0:51:55.17,0:52:02.17,Default,,0000,0000,0000,,solving for gamma and plugging the value I get in there. Okay? All right. Dialogue: 0,0:52:03.24,0:52:10.24,Default,,0000,0000,0000,,So the Dialogue: 0,0:52:30.95,0:52:33.57,Default,,0000,0000,0000,,second Dialogue: 0,0:52:33.57,0:52:36.22,Default,,0000,0000,0000,,step of Dialogue: 0,0:52:36.22,0:52:43.22,Default,,0000,0000,0000,,the overall proof I want to execute is the following. Dialogue: 0,0:52:44.52,0:52:47.33,Default,,0000,0000,0000,,The result of the training error is essentially that uniform Dialogue: 0,0:52:47.33,0:52:49.10,Default,,0000,0000,0000,,conversions will hold true Dialogue: 0,0:52:49.10,0:52:51.44,Default,,0000,0000,0000,,with high probability. Dialogue: 0,0:52:51.44,0:52:55.25,Default,,0000,0000,0000,,What I want to show now is let's assume that uniform conversions hold. So let's Dialogue: 0,0:52:55.25,0:52:57.96,Default,,0000,0000,0000,,assume that for all Dialogue: 0,0:52:57.96,0:53:00.09,Default,,0000,0000,0000,,hypotheses H, Dialogue: 0,0:53:00.09,0:53:05.44,Default,,0000,0000,0000,,we have that epsilon of H minus epsilon hat of H, is Dialogue: 0,0:53:05.44,0:53:08.62,Default,,0000,0000,0000,,less than of the gamma. Okay? Dialogue: 0,0:53:08.62,0:53:10.93,Default,,0000,0000,0000,,What I want to do now is Dialogue: 0,0:53:10.93,0:53:13.14,Default,,0000,0000,0000,,use this to see what we can Dialogue: 0,0:53:13.14,0:53:14.91,Default,,0000,0000,0000,,prove about the bound Dialogue: 0,0:53:14.91,0:53:21.91,Default,,0000,0000,0000,,of see what we can prove about the generalization error. Dialogue: 0,0:53:28.83,0:53:31.40,Default,,0000,0000,0000,, Dialogue: 0,0:53:31.40,0:53:32.62,Default,,0000,0000,0000,,So I want to know Dialogue: 0,0:53:32.62,0:53:35.64,Default,,0000,0000,0000,,suppose this holds true I want Dialogue: 0,0:53:35.64,0:53:40.26,Default,,0000,0000,0000,,to know can we prove something about the generalization error of H hat, where Dialogue: 0,0:53:40.26,0:53:43.38,Default,,0000,0000,0000,,again, H hat Dialogue: 0,0:53:43.38,0:53:45.95,Default,,0000,0000,0000,,was the hypothesis Dialogue: 0,0:53:45.95,0:53:51.68,Default,,0000,0000,0000,,selected by empirical risk minimization. Dialogue: 0,0:53:51.68,0:53:55.79,Default,,0000,0000,0000,,Okay? So in order to show this, let me make one more definition, let me define H Dialogue: 0,0:53:55.79,0:54:00.29,Default,,0000,0000,0000,,star, Dialogue: 0,0:54:00.29,0:54:03.02,Default,,0000,0000,0000,,to be the hypothesis Dialogue: 0,0:54:03.02,0:54:06.46,Default,,0000,0000,0000,,in my class Dialogue: 0,0:54:06.46,0:54:10.19,Default,,0000,0000,0000,,script H that has the smallest generalization error. Dialogue: 0,0:54:10.19,0:54:11.31,Default,,0000,0000,0000,,So this is Dialogue: 0,0:54:11.31,0:54:14.93,Default,,0000,0000,0000,,if I had an infinite amount of training data or if I really I Dialogue: 0,0:54:14.93,0:54:19.03,Default,,0000,0000,0000,,could go in and find the best possible hypothesis Dialogue: 0,0:54:19.03,0:54:23.36,Default,,0000,0000,0000,,best possible hypothesis in the sense of minimizing generalization error Dialogue: 0,0:54:23.36,0:54:28.99,Default,,0000,0000,0000,,what's the hypothesis I would get? Okay? So Dialogue: 0,0:54:28.99,0:54:32.96,Default,,0000,0000,0000,,in some sense, it sort of makes sense to compare the performance of our learning Dialogue: 0,0:54:32.96,0:54:33.61,Default,,0000,0000,0000,,algorithm Dialogue: 0,0:54:33.61,0:54:35.57,Default,,0000,0000,0000,,to the performance of H star, because we Dialogue: 0,0:54:35.57,0:54:39.96,Default,,0000,0000,0000,,sort of we clearly can't hope to do better than H star. Dialogue: 0,0:54:39.96,0:54:42.47,Default,,0000,0000,0000,,Another way of saying that is that if Dialogue: 0,0:54:42.47,0:54:47.61,Default,,0000,0000,0000,,your hypothesis class is a class of all linear decision boundaries, that Dialogue: 0,0:54:47.61,0:54:50.63,Default,,0000,0000,0000,,the data just can't be separated by any linear functions. Dialogue: 0,0:54:50.63,0:54:53.74,Default,,0000,0000,0000,,So if even H star is really bad, Dialogue: 0,0:54:53.74,0:54:55.24,Default,,0000,0000,0000,,then there's sort of Dialogue: 0,0:54:55.24,0:54:59.12,Default,,0000,0000,0000,,it's unlikely then there's just not much hope that your learning algorithm could do even Dialogue: 0,0:54:59.12,0:55:04.13,Default,,0000,0000,0000,,better Dialogue: 0,0:55:04.13,0:55:08.98,Default,,0000,0000,0000,,than H star. So I actually prove this result in three steps. Dialogue: 0,0:55:08.98,0:55:13.40,Default,,0000,0000,0000,,So the generalization error of H hat, the hypothesis I chose, Dialogue: 0,0:55:13.40,0:55:20.40,Default,,0000,0000,0000,,this is going to be less than equal to that, actually let me number these Dialogue: 0,0:55:21.74,0:55:25.06,Default,,0000,0000,0000,,equations, right. This Dialogue: 0,0:55:25.06,0:55:27.20,Default,,0000,0000,0000,,is Dialogue: 0,0:55:27.20,0:55:30.29,Default,,0000,0000,0000,,because of equation one, because I see that Dialogue: 0,0:55:30.29,0:55:33.09,Default,,0000,0000,0000,,epsilon of H hat and epsilon hat of H hat will then gamma of each Dialogue: 0,0:55:33.09,0:55:36.63,Default,,0000,0000,0000,,other. Dialogue: 0,0:55:36.63,0:55:38.32,Default,,0000,0000,0000,,Now Dialogue: 0,0:55:38.32,0:55:42.14,Default,,0000,0000,0000,,because H star, excuse me, Dialogue: 0,0:55:42.14,0:55:45.89,Default,,0000,0000,0000,,now by the Dialogue: 0,0:55:45.89,0:55:51.01,Default,,0000,0000,0000,,definition of empirical risk minimization, H Dialogue: 0,0:55:51.01,0:55:54.05,Default,,0000,0000,0000,,hat was chosen to minimize training error Dialogue: 0,0:55:54.05,0:55:58.68,Default,,0000,0000,0000,,and so there can't be any hypothesis with lower training error than H hat. Dialogue: 0,0:55:58.68,0:56:04.63,Default,,0000,0000,0000,,So the training error of H hat must be less than the equal to Dialogue: 0,0:56:04.63,0:56:07.08,Default,,0000,0000,0000,,the training error of H star. So Dialogue: 0,0:56:07.08,0:56:10.55,Default,,0000,0000,0000,,this is sort of by two, or by the definition of H hat, Dialogue: 0,0:56:10.55,0:56:17.55,Default,,0000,0000,0000,,as the hypothesis that minimizes training error H hat. Dialogue: 0,0:56:17.62,0:56:20.16,Default,,0000,0000,0000,,And the final step is I'm Dialogue: 0,0:56:20.16,0:56:23.50,Default,,0000,0000,0000,,going to apply this uniform conversions result again. We know that Dialogue: 0,0:56:23.50,0:56:24.93,Default,,0000,0000,0000,,epsilon hat Dialogue: 0,0:56:24.93,0:56:29.55,Default,,0000,0000,0000,,of H star must be moving gamma of epsilon of H star. Dialogue: 0,0:56:29.55,0:56:33.88,Default,,0000,0000,0000,,And so this is at most Dialogue: 0,0:56:33.88,0:56:35.20,Default,,0000,0000,0000,,plus gamma. Then Dialogue: 0,0:56:35.20,0:56:36.09,Default,,0000,0000,0000,, Dialogue: 0,0:56:36.09,0:56:38.30,Default,,0000,0000,0000,,I have my original gamma Dialogue: 0,0:56:38.30,0:56:41.32,Default,,0000,0000,0000,,there. Okay? And so this is by Dialogue: 0,0:56:41.32,0:56:44.52,Default,,0000,0000,0000,,equation one again because oh, excuse me Dialogue: 0,0:56:44.52,0:56:47.69,Default,,0000,0000,0000,,because I know the training error of H star must be moving gamma of the Dialogue: 0,0:56:47.69,0:56:49.93,Default,,0000,0000,0000,,generalization error of H star. Dialogue: 0,0:56:49.93,0:56:51.70,Default,,0000,0000,0000,,And so well, I'll just Dialogue: 0,0:56:51.70,0:56:57.47,Default,,0000,0000,0000,,write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training Dialogue: 0,0:56:57.47,0:57:04.47,Default,,0000,0000,0000,,error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh, Dialogue: 0,0:57:24.87,0:57:26.65,Default,,0000,0000,0000,,okay. Let me just well, let Dialogue: 0,0:57:26.65,0:57:32.78,Default,,0000,0000,0000,,me write that down on this board. So actually actually let me Dialogue: 0,0:57:32.78,0:57:36.46,Default,,0000,0000,0000,,think Dialogue: 0,0:57:36.46,0:57:38.74,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:57:38.74,0:57:43.70,Default,,0000,0000,0000,,fit this in here. So epsilon hat Dialogue: 0,0:57:43.70,0:57:45.17,Default,,0000,0000,0000,,of H is the Dialogue: 0,0:57:45.17,0:57:48.100,Default,,0000,0000,0000,,training Dialogue: 0,0:57:48.100,0:57:50.62,Default,,0000,0000,0000,, Dialogue: 0,0:57:50.62,0:57:52.47,Default,,0000,0000,0000,,error of the hypothesis H. In Dialogue: 0,0:57:52.47,0:57:55.95,Default,,0000,0000,0000,,other words, given the hypothesis a hypothesis is just a function, right mapped from X or Dialogue: 0,0:57:55.95,0:57:57.56,Default,,0000,0000,0000,,Ys Dialogue: 0,0:57:57.56,0:57:59.49,Default,,0000,0000,0000,,so epsilon hat of H is Dialogue: 0,0:57:59.49,0:58:02.81,Default,,0000,0000,0000,,given the hypothesis H, what's the fraction of training examples it Dialogue: 0,0:58:02.81,0:58:05.03,Default,,0000,0000,0000,,misclassifies? Dialogue: 0,0:58:05.03,0:58:11.60,Default,,0000,0000,0000,,And generalization error Dialogue: 0,0:58:11.60,0:58:13.84,Default,,0000,0000,0000,,of Dialogue: 0,0:58:13.84,0:58:17.95,Default,,0000,0000,0000,,H, is given the hypothesis H if I Dialogue: 0,0:58:17.95,0:58:19.15,Default,,0000,0000,0000,,sample another Dialogue: 0,0:58:19.15,0:58:21.71,Default,,0000,0000,0000,,example from my distribution Dialogue: 0,0:58:21.71,0:58:22.90,Default,,0000,0000,0000,,scripts Dialogue: 0,0:58:22.90,0:58:28.31,Default,,0000,0000,0000,,D, what's the probability that H will misclassify that example? Does that make sense? Dialogue: 0,0:58:28.31,0:58:31.86,Default,,0000,0000,0000,,Oh, Dialogue: 0,0:58:31.86,0:58:38.24,Default,,0000,0000,0000,,okay. And H hat is the hypothesis that's chosen by empirical risk minimization. Dialogue: 0,0:58:38.24,0:58:42.58,Default,,0000,0000,0000,,So when I talk about empirical risk minimization, is the algorithm that Dialogue: 0,0:58:42.58,0:58:47.80,Default,,0000,0000,0000,,minimizes training error, and so epsilon hat of H is the training error of H, Dialogue: 0,0:58:47.80,0:58:50.22,Default,,0000,0000,0000,,and so H hat is defined as Dialogue: 0,0:58:50.22,0:58:52.05,Default,,0000,0000,0000,,the hypothesis that out Dialogue: 0,0:58:52.05,0:58:55.03,Default,,0000,0000,0000,,of all hypotheses in my class script H, Dialogue: 0,0:58:55.03,0:58:57.59,Default,,0000,0000,0000,,the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member Dialogue: 0,0:58:57.59,0:59:04.59,Default,,0000,0000,0000,,of Dialogue: 0,0:59:06.67,0:59:08.80,Default,,0000,0000,0000,,typical H, [inaudible] Dialogue: 0,0:59:08.80,0:59:10.69,Default,,0000,0000,0000,,family Dialogue: 0,0:59:10.69,0:59:12.91,Default,,0000,0000,0000,,right? Yes it is. Dialogue: 0,0:59:12.91,0:59:19.91,Default,,0000,0000,0000,,So what happens with the generalization error [inaudible]? Dialogue: 0,0:59:20.07,0:59:24.49,Default,,0000,0000,0000,,I'll talk about that later. So Dialogue: 0,0:59:24.49,0:59:31.49,Default,,0000,0000,0000,,let me tie all these things together into a Dialogue: 0,0:59:34.34,0:59:38.05,Default,,0000,0000,0000,,theorem. Dialogue: 0,0:59:38.05,0:59:42.52,Default,,0000,0000,0000,,Let there be a hypothesis class given with a Dialogue: 0,0:59:42.52,0:59:48.87,Default,,0000,0000,0000,,finite set of K hypotheses, Dialogue: 0,0:59:48.87,0:59:50.89,Default,,0000,0000,0000,,and let any M Dialogue: 0,0:59:50.89,0:59:52.25,Default,,0000,0000,0000,,delta Dialogue: 0,0:59:52.25,0:59:58.32,Default,,0000,0000,0000,,be fixed. Dialogue: 0,0:59:58.32,1:00:02.05,Default,,0000,0000,0000,,Then so I fixed M and delta, so this will be the error bound Dialogue: 0,1:00:02.05,1:00:04.50,Default,,0000,0000,0000,,form of the theorem, right? Dialogue: 0,1:00:04.50,1:00:06.76,Default,,0000,0000,0000,,Then, with Dialogue: 0,1:00:06.76,1:00:09.64,Default,,0000,0000,0000,,probability at least one minus delta. Dialogue: 0,1:00:09.64,1:00:13.92,Default,,0000,0000,0000,,We have that. The generalization error of H hat is Dialogue: 0,1:00:13.92,1:00:19.22,Default,,0000,0000,0000,,less than or equal to Dialogue: 0,1:00:19.22,1:00:22.61,Default,,0000,0000,0000,,the minimum over all hypotheses in Dialogue: 0,1:00:22.61,1:00:25.05,Default,,0000,0000,0000,,set H epsilon of H, Dialogue: 0,1:00:25.05,1:00:26.68,Default,,0000,0000,0000,,plus Dialogue: 0,1:00:26.68,1:00:33.68,Default,,0000,0000,0000,,two times, Dialogue: 0,1:00:38.03,1:00:39.11,Default,,0000,0000,0000,,plus that. Okay? Dialogue: 0,1:00:39.11,1:00:41.76,Default,,0000,0000,0000,,So to prove this, well, Dialogue: 0,1:00:41.76,1:00:45.53,Default,,0000,0000,0000,,this term of course is just epsilon of H star. Dialogue: 0,1:00:45.53,1:00:47.88,Default,,0000,0000,0000,,And so to prove this Dialogue: 0,1:00:47.88,1:00:49.74,Default,,0000,0000,0000,,we set Dialogue: 0,1:00:49.74,1:00:52.51,Default,,0000,0000,0000,,gamma to equal to that Dialogue: 0,1:00:52.51,1:00:55.97,Default,,0000,0000,0000,,this is two times the square root term. Dialogue: 0,1:00:55.97,1:01:02.97,Default,,0000,0000,0000,,To prove this theorem we set gamma to equal to that square root term. Say that again? Dialogue: 0,1:01:04.01,1:01:07.52,Default,,0000,0000,0000,,Wait. Say that again? Dialogue: 0,1:01:07.52,1:01:11.39,Default,,0000,0000,0000,,Oh, Dialogue: 0,1:01:13.95,1:01:16.06,Default,,0000,0000,0000,,yes. Thank you. That didn't Dialogue: 0,1:01:16.06,1:01:19.64,Default,,0000,0000,0000,,make sense at all. Dialogue: 0,1:01:19.64,1:01:20.94,Default,,0000,0000,0000,,Thanks. Great. Dialogue: 0,1:01:20.94,1:01:25.67,Default,,0000,0000,0000,,So set gamma to that square root term, Dialogue: 0,1:01:25.67,1:01:29.91,Default,,0000,0000,0000,,and so we know equation one, Dialogue: 0,1:01:29.91,1:01:31.74,Default,,0000,0000,0000,,right, from the previous board Dialogue: 0,1:01:31.74,1:01:35.28,Default,,0000,0000,0000,,holds with probability one minus delta. Dialogue: 0,1:01:35.28,1:01:40.43,Default,,0000,0000,0000,,Right. Equation one was the uniform conversions result right, that well, Dialogue: 0,1:01:40.43,1:01:47.43,Default,,0000,0000,0000,,IE. Dialogue: 0,1:01:49.55,1:01:52.25,Default,,0000,0000,0000,,This is equation one from the previous board, right, so Dialogue: 0,1:01:52.25,1:01:54.76,Default,,0000,0000,0000,,set gamma equal to this we know that we'll probably Dialogue: 0,1:01:54.76,1:01:58.46,Default,,0000,0000,0000,,use one minus delta this uniform conversions holds, Dialogue: 0,1:01:58.46,1:02:01.15,Default,,0000,0000,0000,,and whenever that holds, that implies you Dialogue: 0,1:02:01.15,1:02:04.72,Default,,0000,0000,0000,,know, I Dialogue: 0,1:02:04.72,1:02:05.02,Default,,0000,0000,0000,,guess Dialogue: 0,1:02:05.02,1:02:07.67,Default,,0000,0000,0000,,if we call this equation star I guess. Dialogue: 0,1:02:07.67,1:02:11.81,Default,,0000,0000,0000,,And whenever uniform conversions holds, we showed again, on the previous boards Dialogue: 0,1:02:11.81,1:02:17.09,Default,,0000,0000,0000,,that this result holds, that generalization error of H hat is less than Dialogue: 0,1:02:17.09,1:02:20.77,Default,,0000,0000,0000,,two generalization error of H star plus two times gamma. Okay? And Dialogue: 0,1:02:20.77,1:02:22.67,Default,,0000,0000,0000,,so that proves Dialogue: 0,1:02:22.67,1:02:29.67,Default,,0000,0000,0000,,this theorem. So Dialogue: 0,1:02:41.56,1:02:44.23,Default,,0000,0000,0000,,this result sort of helps us to quantify Dialogue: 0,1:02:44.23,1:02:45.23,Default,,0000,0000,0000,,a little bit Dialogue: 0,1:02:45.23,1:02:47.61,Default,,0000,0000,0000,,that bias variance tradeoff Dialogue: 0,1:02:47.61,1:02:49.74,Default,,0000,0000,0000,,that I talked about Dialogue: 0,1:02:49.74,1:02:54.24,Default,,0000,0000,0000,,at the beginning of actually near the very start of this lecture. Dialogue: 0,1:02:54.24,1:02:58.39,Default,,0000,0000,0000,,And in particular Dialogue: 0,1:02:58.39,1:03:03.66,Default,,0000,0000,0000,,let's say I have some hypothesis class script H, that Dialogue: 0,1:03:03.66,1:03:06.04,Default,,0000,0000,0000,,I'm using, maybe as a class of all Dialogue: 0,1:03:06.04,1:03:09.80,Default,,0000,0000,0000,,linear functions and linear regression, and logistic regression with Dialogue: 0,1:03:09.80,1:03:11.79,Default,,0000,0000,0000,,just the linear features. Dialogue: 0,1:03:11.79,1:03:14.74,Default,,0000,0000,0000,,And let's say I'm considering switching Dialogue: 0,1:03:14.74,1:03:19.62,Default,,0000,0000,0000,,to some new class H prime by having more features. So lets say this is Dialogue: 0,1:03:19.62,1:03:23.06,Default,,0000,0000,0000,,linear Dialogue: 0,1:03:23.06,1:03:27.24,Default,,0000,0000,0000,,and this is quadratic, Dialogue: 0,1:03:27.24,1:03:28.04,Default,,0000,0000,0000,, Dialogue: 0,1:03:28.04,1:03:31.66,Default,,0000,0000,0000,,so the class of all linear functions and the subset of the class of all quadratic functions, Dialogue: 0,1:03:31.66,1:03:34.86,Default,,0000,0000,0000,,and so H is the subset of H prime. And Dialogue: 0,1:03:34.86,1:03:36.79,Default,,0000,0000,0000,,let's say I'm considering Dialogue: 0,1:03:36.79,1:03:40.93,Default,,0000,0000,0000,,instead of using my linear hypothesis class let's say I'm considering switching Dialogue: 0,1:03:40.93,1:03:44.77,Default,,0000,0000,0000,,to a quadratic hypothesis class, or switching to a larger hypothesis Dialogue: 0,1:03:44.77,1:03:46.37,Default,,0000,0000,0000,,class. Dialogue: 0,1:03:46.37,1:03:47.13,Default,,0000,0000,0000,, Dialogue: 0,1:03:47.13,1:03:49.81,Default,,0000,0000,0000,,Then what are the tradeoffs involved? Well, I Dialogue: 0,1:03:49.81,1:03:52.93,Default,,0000,0000,0000,,proved this only for finite hypothesis classes, but we'll see that something Dialogue: 0,1:03:52.93,1:03:56.02,Default,,0000,0000,0000,,very similar holds for infinite hypothesis classes too. Dialogue: 0,1:03:56.02,1:03:57.56,Default,,0000,0000,0000,,But the tradeoff is Dialogue: 0,1:03:57.56,1:04:00.62,Default,,0000,0000,0000,,what if I switch from H to H prime, or I switch from linear to quadratic Dialogue: 0,1:04:00.62,1:04:03.100,Default,,0000,0000,0000,,functions. Then Dialogue: 0,1:04:03.100,1:04:08.03,Default,,0000,0000,0000,,epsilon of H star will become better because Dialogue: 0,1:04:08.03,1:04:12.42,Default,,0000,0000,0000,,the best hypothesis in my hypothesis class will become better. Dialogue: 0,1:04:12.42,1:04:14.83,Default,,0000,0000,0000,,The best quadratic function Dialogue: 0,1:04:14.83,1:04:16.05,Default,,0000,0000,0000,, Dialogue: 0,1:04:16.05,1:04:18.80,Default,,0000,0000,0000,,by best I mean in the sense of generalization error Dialogue: 0,1:04:18.80,1:04:21.30,Default,,0000,0000,0000,,the hypothesis function Dialogue: 0,1:04:21.30,1:04:24.83,Default,,0000,0000,0000,,the quadratic function with the lowest generalization error Dialogue: 0,1:04:24.83,1:04:25.60,Default,,0000,0000,0000,,has to have Dialogue: 0,1:04:25.60,1:04:27.06,Default,,0000,0000,0000,,equal or Dialogue: 0,1:04:27.06,1:04:28.02,Default,,0000,0000,0000,,more likely lower generalization error than the best Dialogue: 0,1:04:28.02,1:04:30.28,Default,,0000,0000,0000,,linear function. Dialogue: 0,1:04:30.28,1:04:31.80,Default,,0000,0000,0000,, Dialogue: 0,1:04:31.80,1:04:37.55,Default,,0000,0000,0000,,So by switching to a more complex hypothesis class you can get this first term as you go down. Dialogue: 0,1:04:37.55,1:04:40.35,Default,,0000,0000,0000,,But what I pay for then is that Dialogue: 0,1:04:40.35,1:04:42.83,Default,,0000,0000,0000,,K will increase. Dialogue: 0,1:04:42.83,1:04:46.45,Default,,0000,0000,0000,,By switching to a larger hypothesis class, Dialogue: 0,1:04:46.45,1:04:48.11,Default,,0000,0000,0000,,the first term will go down, Dialogue: 0,1:04:48.11,1:04:51.82,Default,,0000,0000,0000,,but the second term will increase because I now have a larger class of Dialogue: 0,1:04:51.82,1:04:52.88,Default,,0000,0000,0000,,hypotheses Dialogue: 0,1:04:52.88,1:04:54.63,Default,,0000,0000,0000,,and so the second term K Dialogue: 0,1:04:54.63,1:04:57.19,Default,,0000,0000,0000,,will increase. Dialogue: 0,1:04:57.19,1:05:01.68,Default,,0000,0000,0000,,And so this is sometimes called the bias this is usually called the bias variance Dialogue: 0,1:05:01.68,1:05:03.05,Default,,0000,0000,0000,,tradeoff. Whereby Dialogue: 0,1:05:03.05,1:05:07.70,Default,,0000,0000,0000,,going to larger hypothesis class maybe I have the hope for finding a better function, Dialogue: 0,1:05:07.70,1:05:09.95,Default,,0000,0000,0000,,that my risk of sort of Dialogue: 0,1:05:09.95,1:05:13.80,Default,,0000,0000,0000,,not fitting my model so accurately also increases, and Dialogue: 0,1:05:13.80,1:05:19.45,Default,,0000,0000,0000,,that's because illustrated by the second term going up when the Dialogue: 0,1:05:19.45,1:05:21.30,Default,,0000,0000,0000,,size of your hypothesis, Dialogue: 0,1:05:21.30,1:05:28.20,Default,,0000,0000,0000,,when K goes up. Dialogue: 0,1:05:28.20,1:05:32.08,Default,,0000,0000,0000,,And so Dialogue: 0,1:05:32.08,1:05:35.28,Default,,0000,0000,0000,,speaking very loosely, Dialogue: 0,1:05:35.28,1:05:37.11,Default,,0000,0000,0000,,we can think of Dialogue: 0,1:05:37.11,1:05:40.15,Default,,0000,0000,0000,,this first term as corresponding Dialogue: 0,1:05:40.15,1:05:41.46,Default,,0000,0000,0000,,maybe to the bias Dialogue: 0,1:05:41.46,1:05:45.21,Default,,0000,0000,0000,,of the learning algorithm, or the bias of the hypothesis class. Dialogue: 0,1:05:45.21,1:05:50.43,Default,,0000,0000,0000,,And you can again speaking very loosely, Dialogue: 0,1:05:50.43,1:05:52.88,Default,,0000,0000,0000,,think of the second term as corresponding Dialogue: 0,1:05:52.88,1:05:56.76,Default,,0000,0000,0000,,to the variance in your hypothesis, in other words how well you can actually fit a Dialogue: 0,1:05:56.76,1:05:57.87,Default,,0000,0000,0000,,hypothesis in the Dialogue: 0,1:05:57.87,1:05:59.76,Default,,0000,0000,0000,,how well you Dialogue: 0,1:05:59.76,1:06:02.87,Default,,0000,0000,0000,,actually fit this hypothesis class to the data. Dialogue: 0,1:06:02.87,1:06:06.83,Default,,0000,0000,0000,,And by switching to a more complex hypothesis class, your variance increases and your bias decreases. Dialogue: 0,1:06:06.83,1:06:09.49,Default,,0000,0000,0000,, Dialogue: 0,1:06:09.49,1:06:13.54,Default,,0000,0000,0000,,As a note of warning, it turns out that if you take like a statistics class you've seen Dialogue: 0,1:06:13.54,1:06:15.12,Default,,0000,0000,0000,,definitions of bias and variance, Dialogue: 0,1:06:15.12,1:06:18.50,Default,,0000,0000,0000,,which are often defined in terms of squared error or something. Dialogue: 0,1:06:18.50,1:06:21.58,Default,,0000,0000,0000,,It turns out that for classification problems, Dialogue: 0,1:06:21.58,1:06:24.65,Default,,0000,0000,0000,,there actually is no Dialogue: 0,1:06:24.65,1:06:25.88,Default,,0000,0000,0000,, Dialogue: 0,1:06:25.88,1:06:27.49,Default,,0000,0000,0000,,universally accepted Dialogue: 0,1:06:27.49,1:06:28.99,Default,,0000,0000,0000,,formal definition of Dialogue: 0,1:06:28.99,1:06:30.68,Default,,0000,0000,0000,,bias and Dialogue: 0,1:06:30.68,1:06:35.26,Default,,0000,0000,0000,,variance for classification problems. For regression problems, there is this square error Dialogue: 0,1:06:35.26,1:06:38.64,Default,,0000,0000,0000,,definition. For classification problems it Dialogue: 0,1:06:38.64,1:06:39.98,Default,,0000,0000,0000,,turns out there've Dialogue: 0,1:06:39.98,1:06:43.70,Default,,0000,0000,0000,,been several competing proposals for definitions of bias and variance. So when I Dialogue: 0,1:06:43.70,1:06:45.30,Default,,0000,0000,0000,,say bias and variance here, Dialogue: 0,1:06:45.30,1:06:47.42,Default,,0000,0000,0000,,think of these as very loose, informal, Dialogue: 0,1:06:47.42,1:06:54.42,Default,,0000,0000,0000,,intuitive definitions, and not formal definitions. Okay. Dialogue: 0,1:07:00.50,1:07:07.50,Default,,0000,0000,0000,,The Dialogue: 0,1:07:15.33,1:07:18.08,Default,,0000,0000,0000,,cartoon associated with intuition I just Dialogue: 0,1:07:18.08,1:07:19.39,Default,,0000,0000,0000,,said would be Dialogue: 0,1:07:19.39,1:07:22.35,Default,,0000,0000,0000,,as follows: Let's Dialogue: 0,1:07:22.35,1:07:24.01,Default,,0000,0000,0000,,say Dialogue: 0,1:07:24.01,1:07:26.92,Default,,0000,0000,0000,, Dialogue: 0,1:07:26.92,1:07:30.37,Default,,0000,0000,0000,,and everything about the plot will be for a fixed value of M, for a Dialogue: 0,1:07:30.37,1:07:35.22,Default,,0000,0000,0000,,fixed training set size M. Vertical axis I'll plot ever and Dialogue: 0,1:07:35.22,1:07:38.80,Default,,0000,0000,0000,,on the horizontal axis I'll plot model Dialogue: 0,1:07:38.80,1:07:43.73,Default,,0000,0000,0000,,complexity. Dialogue: 0,1:07:43.73,1:07:46.52,Default,,0000,0000,0000,,And by model complexity I mean Dialogue: 0,1:07:46.52,1:07:53.50,Default,,0000,0000,0000,,sort of degree of polynomial, Dialogue: 0,1:07:53.50,1:07:57.67,Default,,0000,0000,0000,, Dialogue: 0,1:07:57.67,1:07:59.74,Default,,0000,0000,0000,,size of your Dialogue: 0,1:07:59.74,1:08:04.67,Default,,0000,0000,0000,,hypothesis class script H etc. It actually turns out, Dialogue: 0,1:08:04.67,1:08:07.51,Default,,0000,0000,0000,,you remember the bandwidth parameter Dialogue: 0,1:08:07.51,1:08:10.58,Default,,0000,0000,0000,,from Dialogue: 0,1:08:10.58,1:08:13.89,Default,,0000,0000,0000,,locally weighted linear regression, that also Dialogue: 0,1:08:13.89,1:08:17.46,Default,,0000,0000,0000,,has a similar effect in controlling how complex your model is. Dialogue: 0,1:08:17.46,1:08:20.65,Default,,0000,0000,0000,,Model complexity [inaudible] polynomial I guess. Dialogue: 0,1:08:20.65,1:08:23.24,Default,,0000,0000,0000,,So the more complex your model, Dialogue: 0,1:08:23.24,1:08:25.39,Default,,0000,0000,0000,,the better your Dialogue: 0,1:08:25.39,1:08:26.02,Default,,0000,0000,0000,,training error, Dialogue: 0,1:08:26.02,1:08:30.22,Default,,0000,0000,0000,,and so Dialogue: 0,1:08:30.22,1:08:32.89,Default,,0000,0000,0000,,your training error will tend to [inaudible] zero as you increase the complexity of your model because the Dialogue: 0,1:08:32.89,1:08:36.42,Default,,0000,0000,0000,,more complete your model the better you can fit your training set. Dialogue: 0,1:08:36.42,1:08:37.52,Default,,0000,0000,0000,,But Dialogue: 0,1:08:37.52,1:08:44.52,Default,,0000,0000,0000,,because of this bias variance tradeoff, Dialogue: 0,1:08:44.69,1:08:48.64,Default,,0000,0000,0000,,you find Dialogue: 0,1:08:48.64,1:08:53.37,Default,,0000,0000,0000,,that Dialogue: 0,1:08:53.37,1:08:57.14,Default,,0000,0000,0000,,generalization error will come down for a while and then it will go back up. Dialogue: 0,1:08:57.14,1:09:03.67,Default,,0000,0000,0000,,And this regime on the left is when you're underfitting the data or when Dialogue: 0,1:09:03.67,1:09:06.26,Default,,0000,0000,0000,,you have high bias. Dialogue: 0,1:09:06.26,1:09:07.69,Default,,0000,0000,0000,,And this regime Dialogue: 0,1:09:07.69,1:09:10.38,Default,,0000,0000,0000,,on the right Dialogue: 0,1:09:10.38,1:09:16.37,Default,,0000,0000,0000,,is when you have high variance or Dialogue: 0,1:09:16.37,1:09:20.81,Default,,0000,0000,0000,,you're overfitting the data. Okay? Dialogue: 0,1:09:20.81,1:09:24.46,Default,,0000,0000,0000,,And this is why a model of sort of intermediate Dialogue: 0,1:09:24.46,1:09:27.40,Default,,0000,0000,0000,,complexity, somewhere here Dialogue: 0,1:09:27.40,1:09:28.100,Default,,0000,0000,0000,,if often preferable Dialogue: 0,1:09:28.100,1:09:30.24,Default,,0000,0000,0000,,to if Dialogue: 0,1:09:30.24,1:09:32.86,Default,,0000,0000,0000,,[inaudible] and minimize generalization error. Dialogue: 0,1:09:32.86,1:09:35.73,Default,,0000,0000,0000,,Okay? Dialogue: 0,1:09:35.73,1:09:39.51,Default,,0000,0000,0000,,So that's just a cartoon. In the next lecture we'll actually talk about the number of Dialogue: 0,1:09:39.51,1:09:40.23,Default,,0000,0000,0000,,algorithms Dialogue: 0,1:09:40.23,1:09:44.46,Default,,0000,0000,0000,,for trying to automatically select model complexities, say to get Dialogue: 0,1:09:44.46,1:09:48.18,Default,,0000,0000,0000,,you as close as possible to this minimum to Dialogue: 0,1:09:48.18,1:09:55.18,Default,,0000,0000,0000,,this area of minimized generalization error. Dialogue: 0,1:10:10.55,1:10:14.54,Default,,0000,0000,0000,,The last thing I want to do is actually Dialogue: 0,1:10:14.54,1:10:18.96,Default,,0000,0000,0000,,going back to the theorem I wrote out, I just want to take that theorem well, Dialogue: 0,1:10:18.96,1:10:20.15,Default,,0000,0000,0000,,so the theorem I wrote out Dialogue: 0,1:10:20.15,1:10:25.13,Default,,0000,0000,0000,,was an error bound theorem this says for fixed M and delta where probability Dialogue: 0,1:10:25.13,1:10:28.64,Default,,0000,0000,0000,,one minus delta, I get a bound on Dialogue: 0,1:10:28.64,1:10:30.78,Default,,0000,0000,0000,,gamma, which is what this term is. Dialogue: 0,1:10:30.78,1:10:34.77,Default,,0000,0000,0000,,So the very last thing I wanna do today is just come back to this theorem and write out a Dialogue: 0,1:10:34.77,1:10:35.72,Default,,0000,0000,0000,,corollary Dialogue: 0,1:10:35.72,1:10:39.59,Default,,0000,0000,0000,,where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta Dialogue: 0,1:10:39.59,1:10:41.39,Default,,0000,0000,0000,,and solve for M. Dialogue: 0,1:10:41.39,1:10:46.08,Default,,0000,0000,0000,,And if you do that, Dialogue: 0,1:10:46.08,1:10:47.19,Default,,0000,0000,0000,,you Dialogue: 0,1:10:47.19,1:10:49.29,Default,,0000,0000,0000,,get the following corollary: Let H Dialogue: 0,1:10:49.29,1:10:52.44,Default,,0000,0000,0000,,be Dialogue: 0,1:10:52.44,1:10:54.76,Default,,0000,0000,0000,,fixed with K hypotheses Dialogue: 0,1:10:54.76,1:10:58.99,Default,,0000,0000,0000,,and let Dialogue: 0,1:10:58.99,1:11:01.34,Default,,0000,0000,0000,,any delta Dialogue: 0,1:11:01.34,1:11:04.37,Default,,0000,0000,0000,,and gamma be fixed. Dialogue: 0,1:11:09.02,1:11:11.45,Default,,0000,0000,0000,,Then Dialogue: 0,1:11:11.45,1:11:18.45,Default,,0000,0000,0000,,in order to guarantee that, Dialogue: 0,1:11:22.80,1:11:25.06,Default,,0000,0000,0000,,let's say I want a guarantee Dialogue: 0,1:11:25.06,1:11:26.99,Default,,0000,0000,0000,,that the generalization error Dialogue: 0,1:11:26.99,1:11:30.73,Default,,0000,0000,0000,,of the hypothesis I choose with empirical risk minimization, Dialogue: 0,1:11:30.73,1:11:33.87,Default,,0000,0000,0000,,that this is at most two times gamma worse Dialogue: 0,1:11:33.87,1:11:38.32,Default,,0000,0000,0000,,than the best possible error I could obtain with this hypothesis class. Dialogue: 0,1:11:38.32,1:11:40.47,Default,,0000,0000,0000,,Lets say I want this to Dialogue: 0,1:11:40.47,1:11:45.05,Default,,0000,0000,0000,,hold true with probability at least one minus delta, Dialogue: 0,1:11:45.05,1:11:50.16,Default,,0000,0000,0000,,then it suffices Dialogue: 0,1:11:50.16,1:11:53.69,Default,,0000,0000,0000,,that M Dialogue: 0,1:11:53.69,1:12:00.52,Default,,0000,0000,0000,,is [inaudible] to that. Okay? And this is Dialogue: 0,1:12:00.52,1:12:01.01,Default,,0000,0000,0000,,sort of Dialogue: 0,1:12:01.01,1:12:05.57,Default,,0000,0000,0000,,solving for the error Dialogue: 0,1:12:05.57,1:12:07.70,Default,,0000,0000,0000,,bound for M. One thing we're going to convince yourselves Dialogue: 0,1:12:07.70,1:12:10.11,Default,,0000,0000,0000,,of the easy part of this is Dialogue: 0,1:12:10.11,1:12:14.96,Default,,0000,0000,0000,,if you set that term [inaudible] gamma and solve for M you will get this. Dialogue: 0,1:12:14.96,1:12:18.67,Default,,0000,0000,0000,,One thing I want you to go home and sort of convince yourselves of is that this Dialogue: 0,1:12:18.67,1:12:20.56,Default,,0000,0000,0000,,result really holds true. Dialogue: 0,1:12:20.56,1:12:23.87,Default,,0000,0000,0000,,That this really logically follows from the theorem we've proved. Dialogue: 0,1:12:23.87,1:12:25.67,Default,,0000,0000,0000,,In other words, Dialogue: 0,1:12:25.67,1:12:29.85,Default,,0000,0000,0000,,you can take that formula we wrote and solve for M and because this is the formula you get for M, Dialogue: 0,1:12:29.85,1:12:31.67,Default,,0000,0000,0000,,that's just that's the easy part. That Dialogue: 0,1:12:31.67,1:12:33.77,Default,,0000,0000,0000,,once you go back and convince yourselves that Dialogue: 0,1:12:33.77,1:12:37.66,Default,,0000,0000,0000,,this theorem is a true fact and that it does indeed logically follow from the Dialogue: 0,1:12:37.66,1:12:38.66,Default,,0000,0000,0000,,other one. In Dialogue: 0,1:12:38.66,1:12:39.87,Default,,0000,0000,0000,,particular, Dialogue: 0,1:12:39.87,1:12:43.74,Default,,0000,0000,0000,,make sure that if you solve for that you really get M grading equals this, and why is this M Dialogue: 0,1:12:43.74,1:12:46.68,Default,,0000,0000,0000,,grading that and not M less equal two, and just make sure Dialogue: 0,1:12:46.68,1:12:50.23,Default,,0000,0000,0000,,I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay? Dialogue: 0,1:12:53.28,1:12:54.84,Default,,0000,0000,0000,,And Dialogue: 0,1:12:54.84,1:12:58.11,Default,,0000,0000,0000,,it turns out that when we prove these bounds in learning theory it turns out Dialogue: 0,1:12:58.11,1:13:02.82,Default,,0000,0000,0000,,that very often the constants are sort of loose. So it turns out that when we prove Dialogue: 0,1:13:02.82,1:13:05.62,Default,,0000,0000,0000,,these bounds usually we're interested Dialogue: 0,1:13:05.62,1:13:10.35,Default,,0000,0000,0000,,usually we're not very interested in the constants, and so I Dialogue: 0,1:13:10.35,1:13:12.41,Default,,0000,0000,0000,,write this as big O of Dialogue: 0,1:13:12.41,1:13:16.14,Default,,0000,0000,0000,,one over gamma squared, log Dialogue: 0,1:13:16.14,1:13:17.78,Default,,0000,0000,0000,,K over delta, Dialogue: 0,1:13:17.78,1:13:20.57,Default,,0000,0000,0000,,and again, the key Dialogue: 0,1:13:20.57,1:13:23.87,Default,,0000,0000,0000,,step in this is that the dependence on M Dialogue: 0,1:13:23.87,1:13:26.71,Default,,0000,0000,0000,,with the size of the hypothesis class is logarithmic. Dialogue: 0,1:13:26.71,1:13:30.10,Default,,0000,0000,0000,,And this will be very important later when we Dialogue: 0,1:13:30.10,1:13:35.43,Default,,0000,0000,0000,,talk about infinite hypothesis classes. Okay? Any Dialogue: 0,1:13:35.43,1:13:42.43,Default,,0000,0000,0000,,questions about this? No? Okay, cool. Dialogue: 0,1:13:49.66,1:13:51.38,Default,,0000,0000,0000,,So Dialogue: 0,1:13:51.38,1:13:54.49,Default,,0000,0000,0000,,next lecture we'll come back, we'll actually start from this result again. Dialogue: 0,1:13:54.49,1:13:57.70,Default,,0000,0000,0000,,Remember this. I'll write this down as the first thing I do in the next lecture Dialogue: 0,1:13:57.70,1:14:02.52,Default,,0000,0000,0000,,and we'll generalize these to infinite hypothesis classes and then talk about Dialogue: 0,1:14:02.52,1:14:04.66,Default,,0000,0000,0000,,practical algorithms for model spectrum. So I'll see you guys in a couple days.