1 00:00:11,799 --> 00:00:15,059 This presentation is delivered by the Stanford Center for Professional 2 00:00:15,059 --> 00:00:22,059 Development. Okay. 3 00:00:29,109 --> 00:00:31,810 So welcome back. 4 00:00:31,810 --> 00:00:35,860 What I want to do today is start a new chapter 5 00:00:35,860 --> 00:00:41,020 in between now and then. In particular, I want to talk about learning theory. 6 00:00:41,020 --> 00:00:45,450 So in the previous, I guess eight lectures so far, you've learned about 7 00:00:45,450 --> 00:00:49,920 a lot of learning algorithms, and yes, you now I hope understand a little about 8 00:00:49,920 --> 00:00:53,870 some of the best and most powerful tools of machine learning in the [inaudible]. 9 00:00:53,870 --> 00:00:58,219 And all of you are now sort of well qualified to go into industry 10 00:00:58,219 --> 00:01:01,539 and though powerful learning algorithms apply, really the most powerful 11 00:01:01,539 --> 00:01:05,170 learning algorithms we know to all sorts of problems, and in fact, I hope 12 00:01:05,170 --> 00:01:10,730 you start to do that on your projects right away as well. 13 00:01:10,730 --> 00:01:14,460 You might remember, I think it was in the very first lecture, 14 00:01:14,460 --> 00:01:18,960 that I made an analogy to if you're trying to learn to be a carpenter, so 15 00:01:18,960 --> 00:01:20,520 if 16 00:01:20,520 --> 00:01:24,440 you imagine you're going to carpentry school to learn to be a carpenter, 17 00:01:24,440 --> 00:01:28,540 then only a small part of what you need to do is to acquire a set of tools. If you learn to 18 00:01:28,540 --> 00:01:29,990 be a carpenter 19 00:01:29,990 --> 00:01:33,390 you don't walk in and pick up a tool box and [inaudible], so 20 00:01:33,390 --> 00:01:35,560 when you need to 21 00:01:35,560 --> 00:01:39,650 cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw 22 00:01:39,650 --> 00:01:42,780 whatever, is this really mastering the tools there's also 23 00:01:42,780 --> 00:01:45,820 an essential part of becoming a good carpenter. 24 00:01:45,820 --> 00:01:46,970 And 25 00:01:46,970 --> 00:01:49,600 what I want to do in the next few lectures is 26 00:01:49,600 --> 00:01:52,640 actually give you a sense of the mastery of the machine learning tools all of 27 00:01:52,640 --> 00:01:55,170 you have. Okay? 28 00:01:55,170 --> 00:01:57,090 And so in particular, 29 00:01:57,090 --> 00:02:01,460 in the next few lectures what I want to is to talk more deeply about 30 00:02:01,460 --> 00:02:05,060 the properties of different machine learning algorithms so that you can get a sense of 31 00:02:05,060 --> 00:02:06,730 when it's most appropriate to use 32 00:02:06,730 --> 00:02:08,119 each one. 33 00:02:08,119 --> 00:02:11,660 And it turns out that one of the most common scenarios in machine learning is 34 00:02:12,869 --> 00:02:15,239 someday you'll be doing research or 35 00:02:15,239 --> 00:02:16,919 [inaudible] a company. 36 00:02:16,919 --> 00:02:20,419 And you'll apply one of the learning algorithms you learned about, you may apply logistic regression, 37 00:02:20,419 --> 00:02:23,599 or support vector machines, or Naive Bayes or something, 38 00:02:23,599 --> 00:02:27,849 and for whatever bizarre reason, it won't work as well as you were hoping, or it 39 00:02:27,849 --> 00:02:32,259 won't quite do what you were hoping it to. 40 00:02:32,980 --> 00:02:34,530 To 41 00:02:34,530 --> 00:02:37,409 me what really separates the people from 42 00:02:37,409 --> 00:02:40,689 what really separates the people that really understand and really get machine 43 00:02:40,689 --> 00:02:41,959 learning, 44 00:02:41,959 --> 00:02:46,089 compared to people that maybe read the textbook and so they'll work through the math, 45 00:02:46,089 --> 00:02:50,059 will be what you do next. Will be in your decisions of when 46 00:02:50,059 --> 00:02:53,689 you apply a support vector machine and it doesn't quite do what you wanted, 47 00:02:53,689 --> 00:02:56,540 do you really understand enough about support vector machines to know what to 48 00:02:56,540 --> 00:02:58,559 do next and how to modify the algorithm? 49 00:02:58,559 --> 00:03:02,019 And to me that's often what really separates the great people in machine 50 00:03:02,019 --> 00:03:06,259 learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have 51 00:03:06,259 --> 00:03:08,989 just understood that. Okay? 52 00:03:08,989 --> 00:03:09,659 So 53 00:03:09,659 --> 00:03:13,819 what I want to do today, today's lecture will mainly be on learning theory and 54 00:03:13,819 --> 00:03:15,269 we'll start to 55 00:03:15,269 --> 00:03:18,549 talk about some of the theoretical results of machine learning. 56 00:03:19,349 --> 00:03:22,239 The next lecture, later this week, will be on algorithms for 57 00:03:22,239 --> 00:03:26,370 sort of [inaudible], or fixing some of the problems 58 00:03:26,370 --> 00:03:30,039 that the learning theory will point out to us and help us understand. And then 59 00:03:30,039 --> 00:03:34,459 two lectures from now, that lecture will be almost entirely focused on 60 00:03:34,459 --> 00:03:36,400 the practical advice for 61 00:03:36,400 --> 00:03:40,229 how to apply learning algorithms. Okay? 62 00:03:40,229 --> 00:03:47,229 So you have any questions about this before I start? Okay. 63 00:03:50,629 --> 00:03:53,949 So the very first thing we're gonna talk about is something that 64 00:03:53,949 --> 00:03:58,689 you've probably already seen on the first homework, and something that 65 00:03:58,689 --> 00:04:01,289 alluded to previously, 66 00:04:01,289 --> 00:04:07,119 which is the bias variance trade-off. So take 67 00:04:07,119 --> 00:04:10,999 ordinary least squares, the first learning algorithm we learned 68 00:04:10,999 --> 00:04:15,849 about, if you [inaudible] a straight line through these datas, this is not a very good model. 69 00:04:15,849 --> 00:04:19,789 Right. 70 00:04:19,789 --> 00:04:21,009 And if 71 00:04:21,009 --> 00:04:23,699 this happens, we say it has 72 00:04:23,699 --> 00:04:24,490 underfit 73 00:04:24,490 --> 00:04:27,460 the data, or we say that this is a learning algorithm 74 00:04:27,460 --> 00:04:30,629 with a very high bias, because it is 75 00:04:30,629 --> 00:04:35,669 failing to fit the evident quadratic structure in the data. 76 00:04:35,669 --> 00:04:36,250 And 77 00:04:36,250 --> 00:04:37,590 for the prefaces 78 00:04:37,590 --> 00:04:41,340 of [inaudible] you can formally think of the 79 00:04:41,340 --> 00:04:45,439 bias of the learning algorithm as representing the fact that even if you 80 00:04:45,439 --> 00:04:47,960 had an infinite amount of training data, even if 81 00:04:47,960 --> 00:04:50,829 you had tons of training data, 82 00:04:50,829 --> 00:04:52,580 this algorithm would still fail 83 00:04:52,580 --> 00:04:54,199 to fit the quadratic 84 00:04:54,199 --> 00:04:57,819 function, the quadratic structure in 85 00:04:57,819 --> 00:05:01,669 the data. And so we think of this as a learning algorithm with high bias. 86 00:05:01,669 --> 00:05:04,340 Then there's the opposite problem, so that's the 87 00:05:04,340 --> 00:05:06,469 same dataset. 88 00:05:06,469 --> 00:05:13,099 If 89 00:05:13,099 --> 00:05:19,500 you fit a fourth of the polynomials into this dataset, 90 00:05:19,500 --> 00:05:20,430 then you have 91 00:05:20,430 --> 00:05:25,520 you'll be able to interpolate the five data points exactly, but clearly, this is also 92 00:05:25,520 --> 00:05:27,200 not a great model 93 00:05:27,200 --> 00:05:30,650 to the structure that you and I probably see in the data. 94 00:05:30,650 --> 00:05:34,409 And we say that this 95 00:05:34,409 --> 00:05:37,310 algorithm has a problem, excuse me, 96 00:05:37,310 --> 00:05:42,830 is overfitting the data, 97 00:05:42,830 --> 00:05:48,539 or alternatively that this algorithm has high variance. Okay? And the intuition behind 98 00:05:48,539 --> 00:05:50,240 overfitting a high variance is that 99 00:05:50,240 --> 00:05:54,419 the algorithm is fitting serious patterns in the data, or is fitting 100 00:05:54,419 --> 00:05:58,360 idiosyncratic properties of this specific dataset, be it the dataset of 101 00:05:58,360 --> 00:06:00,889 housing prices or whatever. 102 00:06:00,889 --> 00:06:07,889 And quite often, they'll be some happy medium 103 00:06:08,189 --> 00:06:10,060 of fitting a quadratic function 104 00:06:10,060 --> 00:06:14,819 that maybe won't interpolate your data points perfectly, but also captures multistructure 105 00:06:14,819 --> 00:06:16,740 in your data 106 00:06:16,740 --> 00:06:20,759 than a simple model which under fits. 107 00:06:20,759 --> 00:06:23,829 I say that you can sort of have the exactly the same picture 108 00:06:23,829 --> 00:06:25,779 of classification problems as well, 109 00:06:25,779 --> 00:06:28,449 so lets say 110 00:06:28,449 --> 00:06:35,449 this is my training set, right, 111 00:06:35,840 --> 00:06:37,300 of 112 00:06:37,300 --> 00:06:39,319 positive and negative examples, 113 00:06:44,259 --> 00:06:46,779 and so you can fit 114 00:06:46,779 --> 00:06:51,679 logistic regression with a very high order polynomial [inaudible], or [inaudible] of X 115 00:06:51,679 --> 00:06:57,389 equals the sigmoid function of 116 00:07:01,240 --> 00:07:04,439 whatever. Sigmoid function applied to a tenth of the polynomial. 117 00:07:04,439 --> 00:07:07,520 And you do that, maybe you get a decision boundary 118 00:07:07,520 --> 00:07:13,460 like this. Right. 119 00:07:13,460 --> 00:07:16,830 That does indeed perfectly separate the positive and negative classes, this is 120 00:07:16,830 --> 00:07:18,610 another example of how 121 00:07:18,610 --> 00:07:21,169 overfitting, and 122 00:07:21,169 --> 00:07:24,029 in contrast you fit logistic regression into this model with just the linear features, 123 00:07:24,029 --> 00:07:26,619 with none 124 00:07:26,619 --> 00:07:29,560 of the quadratic features, then maybe you get a decision boundary like that, which 125 00:07:29,560 --> 00:07:33,179 can also underfit. 126 00:07:33,179 --> 00:07:34,589 Okay. 127 00:07:34,589 --> 00:07:38,059 So what I want to do now is 128 00:07:38,059 --> 00:07:41,899 understand this problem of overfitting versus underfitting, of high bias versus high 129 00:07:41,899 --> 00:07:43,409 variance, more explicitly, 130 00:07:44,819 --> 00:07:48,770 I will do that by posing a more formal model of machine learning and so 131 00:07:48,770 --> 00:07:50,789 trying to prove when 132 00:07:50,789 --> 00:07:56,680 these two twin problems when each of these two problems come up. 133 00:07:56,680 --> 00:07:59,220 And as I'm modeling the 134 00:07:59,220 --> 00:08:00,679 135 00:08:00,679 --> 00:08:04,389 example for our initial foray into learning theory, 136 00:08:04,389 --> 00:08:10,449 I want to talk about learning classification, 137 00:08:10,449 --> 00:08:13,060 in which 138 00:08:13,770 --> 00:08:16,360 H of X is equal 139 00:08:16,360 --> 00:08:18,819 to G of data transpose X. Okay? 140 00:08:18,819 --> 00:08:20,930 So the learning classifier. 141 00:08:20,930 --> 00:08:26,840 And for this class I'm going to use, Z 142 00:08:26,840 --> 00:08:32,740 excuse me. I'm gonna use G as indicator Z grading with 143 00:08:32,740 --> 00:08:39,740 zero. 144 00:08:40,260 --> 00:08:43,570 With apologies in advance for changing the notation yet again, 145 00:08:43,570 --> 00:08:45,350 for the support vector machine 146 00:08:45,350 --> 00:08:49,450 lectures we use Y equals minus one or plus one. For 147 00:08:49,450 --> 00:08:52,670 learning theory lectures, turns out it'll be a bit cleaner if I switch back to 148 00:08:52,670 --> 00:08:57,810 Y equals zero-one again, so I'm gonna switch back to my original notation. 149 00:08:57,810 --> 00:09:02,220 And so you think of this model as a model forum as 150 00:09:02,220 --> 00:09:04,540 logistic regressions, say, and think of this as being 151 00:09:04,540 --> 00:09:06,600 similar to logistic regression, 152 00:09:06,600 --> 00:09:08,420 except that now we're going to force 153 00:09:08,420 --> 00:09:12,420 the logistic regression algorithm, to opt for labels that are 154 00:09:12,420 --> 00:09:14,010 either zero or one. Okay? 155 00:09:14,010 --> 00:09:15,610 So you can think of this as a 156 00:09:15,610 --> 00:09:20,710 classifier to opt for labels zero or one involved in the probabilities. 157 00:09:20,710 --> 00:09:25,410 And so as 158 00:09:25,410 --> 00:09:32,410 usual, let's say we're given a training set of M examples. 159 00:09:34,350 --> 00:09:38,080 That's just my notation for writing a set of M examples ranging from I equals 160 00:09:38,080 --> 00:09:40,170 one 161 00:09:40,500 --> 00:09:45,450 through M. And I'm going to assume that the training example is XIYI. 162 00:09:45,450 --> 00:09:48,320 I've drawn IID, 163 00:09:48,320 --> 00:09:50,220 from sum distribution, 164 00:09:50,220 --> 00:09:50,959 scripts 165 00:09:50,959 --> 00:09:54,470 D. Okay? [Inaudible]. Identically and definitively distributed 166 00:09:54,470 --> 00:09:59,720 and if you have you have running a classification problem on houses, like 167 00:09:59,720 --> 00:10:03,450 features of the house comma, whether the house will be sold in the next six months, then this 168 00:10:03,450 --> 00:10:04,700 is just 169 00:10:04,700 --> 00:10:07,650 the priority distribution over 170 00:10:07,650 --> 00:10:12,190 features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that 171 00:10:12,190 --> 00:10:16,810 training examples we've drawn IID from some probability distributions, 172 00:10:18,320 --> 00:10:19,940 scripts D. Well, same thing for spam, if you're 173 00:10:19,940 --> 00:10:23,420 trying to build a spam classifier then this would be the distribution of 174 00:10:23,420 --> 00:10:29,610 what emails look like comma, whether they are spam or not. 175 00:10:29,610 --> 00:10:32,130 And in particular, to understand 176 00:10:32,130 --> 00:10:38,000 or simplify to understand the phenomena of bias invariance, I'm actually going to use a 177 00:10:38,000 --> 00:10:41,880 simplified model of machine learning. 178 00:10:41,880 --> 00:10:44,670 And in particular, 179 00:10:44,670 --> 00:10:45,810 logistic regression fits 180 00:10:45,810 --> 00:10:50,040 this parameters the model like this for maximizing the law of likelihood. 181 00:10:50,040 --> 00:10:55,010 But in order to understand learning algorithms more deeply, I'm just going to assume a simplified 182 00:10:55,010 --> 00:10:58,890 model of machine learning, let me just write that down. 183 00:11:01,960 --> 00:11:05,260 So I'm going to define training error 184 00:11:05,260 --> 00:11:08,670 as 185 00:11:08,670 --> 00:11:14,490 so this is a training error of a hypothesis X subscript data. 186 00:11:14,490 --> 00:11:17,610 Write this epsilon hat of subscript data. 187 00:11:17,610 --> 00:11:19,980 If I want to make the 188 00:11:19,980 --> 00:11:23,450 dependence on a training set explicit, I'll write this with 189 00:11:23,450 --> 00:11:27,210 a subscript S there where S is a training set. 190 00:11:27,210 --> 00:11:34,210 And I'll define this as, 191 00:11:41,660 --> 00:11:44,270 let's see. Okay. I 192 00:11:44,270 --> 00:11:47,860 hope the notation is clear. This is a sum of indicator functions for whether your hypothesis 193 00:11:47,860 --> 00:11:52,170 correctly classifies the Y the IFE example. 194 00:11:52,170 --> 00:11:55,340 And so when you divide by M, this is just 195 00:11:55,340 --> 00:11:57,740 in your training set what's the fraction 196 00:11:57,740 --> 00:11:59,560 of training examples your 197 00:11:59,560 --> 00:12:01,200 hypothesis classifies 198 00:12:01,200 --> 00:12:05,340 so defined as a training error. And 199 00:12:05,340 --> 00:12:07,980 training error is also called risk. 200 00:12:10,190 --> 00:12:14,180 The simplified model of machine learning I'm gonna talk about is 201 00:12:14,180 --> 00:12:17,170 called empirical risk minimization. 202 00:12:17,170 --> 00:12:21,210 And in particular, I'm going to assume that the way my learning algorithm works 203 00:12:21,210 --> 00:12:25,950 is it will choose parameters 204 00:12:25,950 --> 00:12:32,950 data, that 205 00:12:33,940 --> 00:12:39,090 minimize my training error. Okay? 206 00:12:39,090 --> 00:12:43,240 And it will be this learning algorithm that we'll prove properties about. 207 00:12:43,240 --> 00:12:44,640 And it turns out that 208 00:12:44,640 --> 00:12:45,990 you 209 00:12:45,990 --> 00:12:49,910 can think of this as the most basic learning algorithm, the algorithm that minimizes 210 00:12:49,910 --> 00:12:51,970 your training error. It 211 00:12:51,970 --> 00:12:55,350 turns out that logistic regression and support vector machines can be 212 00:12:55,350 --> 00:12:58,930 formally viewed as approximation cities, so it 213 00:12:58,930 --> 00:13:02,510 turns out that if you actually want to do this, this is a nonconvex optimization 214 00:13:02,510 --> 00:13:03,399 problem. 215 00:13:03,399 --> 00:13:08,860 This is actually it actually [inaudible] hard to solve this optimization problem. 216 00:13:08,860 --> 00:13:13,790 And logistic regression and support vector machines can both be viewed as 217 00:13:13,790 --> 00:13:14,429 approximations 218 00:13:14,429 --> 00:13:17,270 to this nonconvex optimization problem 219 00:13:17,270 --> 00:13:20,240 by finding the convex approximation to it. 220 00:13:20,240 --> 00:13:22,589 Think of this as 221 00:13:22,589 --> 00:13:23,640 similar to what 222 00:13:23,640 --> 00:13:27,710 algorithms like logistic regression 223 00:13:27,710 --> 00:13:30,280 are doing. So 224 00:13:30,280 --> 00:13:32,550 let me take that 225 00:13:32,550 --> 00:13:34,230 definition of empirical risk 226 00:13:34,230 --> 00:13:35,710 minimization 227 00:13:35,710 --> 00:13:42,600 and actually just rewrite it in a different equivalent way. 228 00:13:42,600 --> 00:13:45,440 For the results I want to prove today, it turns out 229 00:13:45,440 --> 00:13:47,630 that it will be useful to think of 230 00:13:47,630 --> 00:13:49,340 our learning algorithm 231 00:13:49,340 --> 00:13:52,150 as not choosing a set of parameters, 232 00:13:52,150 --> 00:13:54,810 but as choosing a function. 233 00:13:54,810 --> 00:13:57,870 So let me say what I mean by that. Let me define 234 00:13:57,870 --> 00:14:00,319 the hypothesis 235 00:14:00,319 --> 00:14:03,490 class, 236 00:14:03,490 --> 00:14:04,800 script h, 237 00:14:04,800 --> 00:14:09,410 as the class of all hypotheses of in other words as the class of all linear 238 00:14:09,410 --> 00:14:10,690 classifiers, that your 239 00:14:10,690 --> 00:14:13,730 learning algorithm is choosing from. 240 00:14:13,730 --> 00:14:17,260 Okay? So 241 00:14:17,260 --> 00:14:19,029 H subscript data 242 00:14:19,029 --> 00:14:23,190 is a specific linear classifier, so H subscript data, 243 00:14:23,190 --> 00:14:29,880 in each of these functions each 244 00:14:29,880 --> 00:14:33,550 of these is a function mapping from the input domain X is the class zero-one. Each 245 00:14:33,550 --> 00:14:34,760 of 246 00:14:34,760 --> 00:14:38,430 these is a function, and as you vary the parameter's data, 247 00:14:38,430 --> 00:14:41,110 you get different functions. 248 00:14:41,110 --> 00:14:43,810 And so let me define the hypothesis class 249 00:14:43,810 --> 00:14:47,750 script H to be the class of all functions that say logistic regression can choose 250 00:14:47,750 --> 00:14:49,140 from. Okay. So 251 00:14:49,140 --> 00:14:53,000 this is the class of all linear classifiers 252 00:14:53,000 --> 00:14:54,480 and so 253 00:14:57,420 --> 00:15:00,900 I'm going to define, 254 00:15:00,900 --> 00:15:05,050 or maybe redefine empirical risk minimization as 255 00:15:05,050 --> 00:15:08,820 instead of writing this choosing a set of parameters, I want to think of it as 256 00:15:08,820 --> 00:15:10,390 choosing a 257 00:15:10,390 --> 00:15:12,500 function 258 00:15:12,500 --> 00:15:15,300 into hypothesis class of script H 259 00:15:15,300 --> 00:15:21,570 that minimizes 260 00:15:21,570 --> 00:15:27,360 that minimizes my training error. Okay? 261 00:15:27,360 --> 00:15:31,040 So actually can you raise your 262 00:15:31,040 --> 00:15:34,540 hand if it makes sense to you why this is equivalent to the previous 263 00:15:34,540 --> 00:15:40,900 formulation? Okay, cool. 264 00:15:40,900 --> 00:15:44,350 Thanks. So for development of the use of think of algorithms as choosing from 265 00:15:44,350 --> 00:15:46,750 function from the class instead, 266 00:15:46,750 --> 00:15:47,980 because 267 00:15:47,980 --> 00:15:51,300 in a more general case this set, script H, 268 00:15:51,300 --> 00:15:53,999 can be some other class of functions. Maybe 269 00:15:53,999 --> 00:15:58,220 is a class of all functions represented by viewer network, or the class of all 270 00:15:58,220 --> 00:16:03,670 some other class of functions the learning algorithm wants to choose from. 271 00:16:03,670 --> 00:16:05,240 And 272 00:16:05,240 --> 00:16:11,150 this definition for empirical risk minimization will still apply. Okay? 273 00:16:11,150 --> 00:16:13,330 So 274 00:16:13,330 --> 00:16:16,480 what we'd like to do is understand 275 00:16:16,480 --> 00:16:18,780 whether empirical risk minimization 276 00:16:18,780 --> 00:16:24,500 is a reasonable algorithm. Alex? Student:[Inaudible] a function that's 277 00:16:24,500 --> 00:16:26,239 defined 278 00:16:26,239 --> 00:16:29,610 by G 279 00:16:29,610 --> 00:16:35,310 of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see 280 00:16:35,310 --> 00:16:36,470 I guess this 281 00:16:36,470 --> 00:16:40,610 the question is H data still defined by G of phase transpose 282 00:16:40,610 --> 00:16:46,700 X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh, 283 00:16:46,700 --> 00:16:48,410 yeah so very two answers 284 00:16:48,410 --> 00:16:51,070 to that. One is, this framework is general, so 285 00:16:51,070 --> 00:16:54,770 for the purpose of this lecture it may be useful to you to keep in mind a model of the 286 00:16:54,770 --> 00:16:56,670 example of 287 00:16:56,670 --> 00:17:00,639 when H subscript data is the class of all linear classifiers such as those used by 288 00:17:00,639 --> 00:17:04,350 like a visectron algorithm or logistic regression. 289 00:17:04,350 --> 00:17:05,570 This 290 00:17:05,570 --> 00:17:06,790 everything on this board, 291 00:17:06,790 --> 00:17:11,140 however, is actually more general. H can be any set of functions, mapping 292 00:17:11,140 --> 00:17:14,930 from the INFA domain to the center of class label zero and one, 293 00:17:14,930 --> 00:17:17,630 and then you can perform empirical risk minimization 294 00:17:17,630 --> 00:17:21,240 over any hypothesis class. For the purpose 295 00:17:21,240 --> 00:17:22,770 of today's lecture, 296 00:17:22,770 --> 00:17:27,170 I am going to restrict myself to talking about binary classification, but it turns 297 00:17:27,170 --> 00:17:30,660 out everything I say generalizes to regression 298 00:17:30,660 --> 00:17:35,960 in other problem as 299 00:17:35,960 --> 00:17:39,730 well. Does that answer your question? Yes. Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm. 300 00:17:40,370 --> 00:17:45,340 In particular, what are the things we can prove about it? So 301 00:17:45,340 --> 00:17:48,900 clearly we don't actually care about training error, we don't really care about 302 00:17:48,900 --> 00:17:53,430 making accurate predictions on the training set, or at a least that's not the ultimate goal. The 303 00:17:53,430 --> 00:17:55,510 ultimate goal is 304 00:17:55,510 --> 00:17:57,560 how well it makes 305 00:17:57,560 --> 00:17:58,730 generalization 306 00:17:58,730 --> 00:18:03,940 how well it makes predictions on examples that we haven't seen before. How 307 00:18:03,940 --> 00:18:05,660 well it predicts prices or 308 00:18:05,660 --> 00:18:07,380 sale or no sale 309 00:18:07,380 --> 00:18:10,289 outcomes of houses you haven't seen before. 310 00:18:10,289 --> 00:18:12,870 So what we really care about 311 00:18:12,870 --> 00:18:16,160 is generalization error, which I write 312 00:18:16,160 --> 00:18:17,870 as epsilon of H. 313 00:18:17,870 --> 00:18:21,700 And this is defined as the probability 314 00:18:21,700 --> 00:18:23,830 that if I sample 315 00:18:23,830 --> 00:18:25,940 a new example, X 316 00:18:25,940 --> 00:18:28,370 comma Y, 317 00:18:28,370 --> 00:18:35,370 from that distribution scripts D, 318 00:18:40,250 --> 00:18:43,230 my hypothesis mislabels 319 00:18:43,230 --> 00:18:46,920 that example. And 320 00:18:46,920 --> 00:18:49,990 in terms of notational convention, usually 321 00:18:49,990 --> 00:18:53,400 if I use if I place a hat on top of something, it 322 00:18:53,400 --> 00:18:57,190 usually means not always but it usually means that it is an attempt to 323 00:18:57,190 --> 00:18:59,400 estimate something about the hat. So 324 00:18:59,400 --> 00:19:00,660 for example, 325 00:19:00,660 --> 00:19:02,580 epsilon hat here 326 00:19:02,580 --> 00:19:06,049 this is something that we're trying think of epsilon hat training error 327 00:19:06,049 --> 00:19:08,779 as an attempt to approximate generalization error. 328 00:19:08,779 --> 00:19:10,980 Okay, so the notation convention is 329 00:19:10,980 --> 00:19:15,060 usually the things with the hats on top are things we're using to estimate other 330 00:19:15,060 --> 00:19:15,950 quantities. 331 00:19:15,950 --> 00:19:19,560 And H hat is a hypothesis output by learning algorithm to try to 332 00:19:19,560 --> 00:19:24,670 estimate what the functions from H to Y X to Y. So 333 00:19:24,670 --> 00:19:29,240 let's actually prove some things about when empirical risk minimization 334 00:19:29,240 --> 00:19:30,149 will do well 335 00:19:30,149 --> 00:19:33,440 in a sense of giving us low generalization error, which is what 336 00:19:33,440 --> 00:19:40,440 we really care about. 337 00:19:46,750 --> 00:19:50,110 In order to prove our first learning theory result, I'm going to have to state 338 00:19:50,110 --> 00:19:52,670 two lemmas, the first is 339 00:19:52,670 --> 00:19:59,670 the union 340 00:20:00,250 --> 00:20:07,250 vowel, which is the following, 341 00:20:07,800 --> 00:20:10,640 let A1 through AK be 342 00:20:10,640 --> 00:20:12,250 K event. And when 343 00:20:12,250 --> 00:20:16,080 I say events, I mean events in a sense of a probabilistic event 344 00:20:16,080 --> 00:20:18,130 that either happens or not. 345 00:20:18,130 --> 00:20:25,130 And these are not necessarily independent. 346 00:20:28,490 --> 00:20:32,400 So there's some current distribution over the events A one through AK, 347 00:20:32,400 --> 00:20:34,830 and maybe they're independent maybe not, 348 00:20:34,830 --> 00:20:41,240 no assumption on that. Then 349 00:20:41,240 --> 00:20:44,740 the probability of A one or 350 00:20:44,740 --> 00:20:46,179 A two 351 00:20:46,179 --> 00:20:47,610 or dot, dot, 352 00:20:47,610 --> 00:20:48,960 dot, up to 353 00:20:48,960 --> 00:20:52,480 AK, this union symbol, this 354 00:20:52,480 --> 00:20:54,070 hat, this just means 355 00:20:54,070 --> 00:20:57,970 this sort of just set notation for probability just means or. So the probability 356 00:20:57,970 --> 00:20:58,640 of 357 00:20:58,640 --> 00:21:02,290 at least one of these events occurring, of A one or A two, or up 358 00:21:02,290 --> 00:21:03,429 to AK, 359 00:21:03,429 --> 00:21:07,950 this is S equal to the probability of A one plus probability of A two plus dot, 360 00:21:07,950 --> 00:21:11,120 dot, dot, plus probability of AK. 361 00:21:11,120 --> 00:21:17,540 Okay? So 362 00:21:17,540 --> 00:21:20,700 the intuition behind this is just that 363 00:21:20,700 --> 00:21:23,390 I'm not sure if you've seen Venn diagrams 364 00:21:23,390 --> 00:21:26,010 depictions of probability before, if you haven't, 365 00:21:26,010 --> 00:21:29,480 what I'm about to do may be a little cryptic, so just ignore that. Just 366 00:21:29,480 --> 00:21:31,799 ignore what I'm about to do if you haven't seen it before. 367 00:21:31,799 --> 00:21:37,340 But if you have seen it before then this is really 368 00:21:37,340 --> 00:21:40,770 this is really great the probability of A one, 369 00:21:40,770 --> 00:21:43,770 union A two, union A three, is less 370 00:21:43,770 --> 00:21:46,630 than the P of A one, plus 371 00:21:46,630 --> 00:21:51,380 P of A two, plus P of A 372 00:21:51,380 --> 00:21:51,809 three. 373 00:21:51,809 --> 00:21:54,370 Right. So that the total mass in 374 00:21:54,370 --> 00:21:57,510 the union of these three things [inaudible] to the sum of the masses 375 00:21:57,510 --> 00:22:01,250 in the three individual sets, it's not very surprising. 376 00:22:01,250 --> 00:22:04,520 It turns out that depending on how you define your axioms of probability, 377 00:22:04,520 --> 00:22:08,100 this is actually one of the axioms that probably varies, so 378 00:22:08,100 --> 00:22:12,180 I won't actually try to prove this. This is usually 379 00:22:12,180 --> 00:22:16,700 written as an axiom. So sigmas of avitivity are probably 380 00:22:16,700 --> 00:22:17,379 measured as this 381 00:22:17,379 --> 00:22:24,379 what is sometimes called as well. 382 00:22:26,370 --> 00:22:30,600 But in learning theory it's commonly called the union balance I just call it that. The 383 00:22:30,600 --> 00:22:31,750 other 384 00:22:31,750 --> 00:22:38,750 lemma I need is called the Hufting inequality. And 385 00:22:39,640 --> 00:22:43,240 again, I won't actually prove this, I'll just state it, 386 00:22:43,240 --> 00:22:44,039 which is 387 00:22:44,039 --> 00:22:46,290 let's let Z1 up to ZM, 388 00:22:46,290 --> 00:22:48,850 BM, IID, 389 00:22:48,850 --> 00:22:51,720 there 390 00:22:51,720 --> 00:22:55,180 391 00:22:55,180 --> 00:23:02,180 may be random variables with mean Phi. 392 00:23:06,559 --> 00:23:09,150 So the probability of ZI equals 1 is equal to 393 00:23:09,150 --> 00:23:16,150 Phi. 394 00:23:16,400 --> 00:23:18,289 So 395 00:23:18,289 --> 00:23:22,549 let's say you observe M IID for newly random variables and you want to estimate their 396 00:23:22,549 --> 00:23:23,640 mean. 397 00:23:23,640 --> 00:23:28,790 So let me define Phi hat, and this is again that notation, no convention, Phi hat means 398 00:23:29,520 --> 00:23:30,440 does not attempt 399 00:23:30,440 --> 00:23:33,880 is an estimate or something else. So when we define Phi 400 00:23:33,880 --> 00:23:39,290 hat to be 1 over M, semper my equals one through MZI. Okay? 401 00:23:39,290 --> 00:23:40,559 So this is 402 00:23:40,559 --> 00:23:44,500 our attempt to estimate the mean of these Benuve random variables by sort of taking 403 00:23:44,500 --> 00:23:46,549 its average. 404 00:23:48,690 --> 00:23:52,710 And let any gamma 405 00:23:52,710 --> 00:23:54,970 be fixed. 406 00:23:59,060 --> 00:24:00,690 Then, 407 00:24:00,690 --> 00:24:04,340 the Hufting inequality 408 00:24:04,340 --> 00:24:11,120 is that 409 00:24:13,810 --> 00:24:17,110 the probability your estimate of Phi 410 00:24:17,110 --> 00:24:20,630 is more than gamma away from the true value of Phi, 411 00:24:20,630 --> 00:24:26,380 that this is bounded by two E to the next of two gamma squared. Okay? So 412 00:24:26,380 --> 00:24:31,299 just in pictures 413 00:24:31,299 --> 00:24:35,210 so this theorem holds this lemma, the Hufting inequality, 414 00:24:35,210 --> 00:24:37,950 this is just a statement of fact, this just holds true. 415 00:24:37,950 --> 00:24:41,850 But let me now draw a cartoon to describe some of the intuition behind this, I 416 00:24:41,850 --> 00:24:44,059 guess. 417 00:24:44,059 --> 00:24:44,940 So 418 00:24:44,940 --> 00:24:49,070 lets say [inaudible] this is a real number line from zero to one. 419 00:24:49,070 --> 00:24:52,570 And so Phi is the mean of your Benuve random variables. 420 00:24:53,580 --> 00:24:56,000 You will remember from you know, 421 00:24:56,000 --> 00:24:59,370 whatever some undergraduate probability or statistics class, 422 00:24:59,370 --> 00:25:02,550 the central limit theorem that says that when you average all the things together, 423 00:25:02,550 --> 00:25:05,480 you tend to get a Gaussian distribution. 424 00:25:05,480 --> 00:25:10,380 And so when you toss M coins with bias Phi, we observe these M 425 00:25:10,380 --> 00:25:12,320 Benuve random variables, 426 00:25:12,320 --> 00:25:17,680 and we average them, then the probability distribution of 427 00:25:17,680 --> 00:25:24,680 Phi hat 428 00:25:26,260 --> 00:25:28,740 will roughly be a Gaussian lets say. Okay? It 429 00:25:28,740 --> 00:25:29,680 turns out if 430 00:25:29,680 --> 00:25:32,650 you haven't seen this up before, this is actually that the 431 00:25:32,650 --> 00:25:36,500 cumulative distribution function of Phi hat will converse with that of the Gaussian. 432 00:25:36,500 --> 00:25:40,060 Technically Phi hat can only take on a discreet set of values 433 00:25:40,060 --> 00:25:43,120 because these are factions one over Ms. It doesn't really have an 434 00:25:43,120 --> 00:25:45,159 entity but just as a cartoon 435 00:25:45,159 --> 00:25:49,470 think of it as a converse roughly to a Gaussian. 436 00:25:49,470 --> 00:25:56,050 So what the Hufting inequality says is that if you pick a value of gamma, let me put 437 00:25:56,050 --> 00:25:59,500 S one interval gamma there's another interval gamma. 438 00:25:59,500 --> 00:26:03,480 Then the saying that the probability mass of the details, 439 00:26:03,480 --> 00:26:05,820 in other words the probability that 440 00:26:05,820 --> 00:26:08,410 my value of Phi hat is more than 441 00:26:08,410 --> 00:26:10,960 a gamma away from the true value, 442 00:26:10,960 --> 00:26:15,500 that the total mass 443 00:26:15,500 --> 00:26:18,019 that the 444 00:26:18,019 --> 00:26:22,410 total probability mass in these tails is at most two 445 00:26:22,410 --> 00:26:25,660 E to the negative two gamma squared M. Okay? 446 00:26:25,660 --> 00:26:27,510 That's what the Hufting inequality so if you 447 00:26:27,510 --> 00:26:30,610 can't read that this just says this is just the right hand side of the bound, two E to 448 00:26:30,610 --> 00:26:32,659 negative two gamma squared. 449 00:26:32,659 --> 00:26:36,220 So balance the probability that you make a mistake in estimating the 450 00:26:36,220 --> 00:26:41,740 mean of a Benuve random variable. 451 00:26:41,740 --> 00:26:43,030 And the 452 00:26:43,030 --> 00:26:47,180 cool thing about this bound the interesting thing behind this bound is that 453 00:26:48,039 --> 00:26:50,610 the [inaudible] exponentially in M, 454 00:26:50,610 --> 00:26:51,710 so it says 455 00:26:51,710 --> 00:26:54,030 that for a fixed value of gamma, 456 00:26:54,030 --> 00:26:55,540 as you increase the size 457 00:26:55,540 --> 00:26:58,030 of your training set, as you toss a coin more and more, 458 00:26:58,030 --> 00:27:02,250 then the worth of this Gaussian will shrink. The worth of this 459 00:27:02,250 --> 00:27:06,270 Gaussian will actually shrink like one over root to M. 460 00:27:06,270 --> 00:27:12,010 And that will cause the probability mass left in the tails to decrease exponentially, 461 00:27:12,010 --> 00:27:19,010 quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the 462 00:27:21,400 --> 00:27:23,360 central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different 463 00:27:23,360 --> 00:27:27,220 this is proved no so the central limit theorem there may be a 464 00:27:27,220 --> 00:27:29,330 version of the central limit theorem, but the 465 00:27:29,330 --> 00:27:32,840 versions I'm familiar with tend are sort of asymptotic, but this works 466 00:27:32,840 --> 00:27:35,630 for any finer value of M. Oh, and for your 467 00:27:35,630 --> 00:27:40,190 this bound holds even if M is equal to two, or M is [inaudible], if M is 468 00:27:40,190 --> 00:27:41,410 very small, 469 00:27:41,410 --> 00:27:45,480 the central limit theorem approximation is not gonna hold, but this theorem holds 470 00:27:45,480 --> 00:27:46,370 regardless. 471 00:27:46,370 --> 00:27:47,680 Okay? I'm 472 00:27:47,680 --> 00:27:51,920 drawing this just as a cartoon to help explain the intuition, but this 473 00:27:51,920 --> 00:27:58,920 theorem just holds true, without reference to central limit 474 00:28:08,590 --> 00:28:12,830 theorem. All right. So lets start to understand empirical risk minimization, 475 00:28:12,830 --> 00:28:19,830 and what I want to do is 476 00:28:23,080 --> 00:28:25,490 begin with 477 00:28:25,490 --> 00:28:28,440 studying empirical risk minimization 478 00:28:28,440 --> 00:28:30,890 for a 479 00:28:30,890 --> 00:28:33,000 [inaudible] case 480 00:28:33,000 --> 00:28:36,670 that's a logistic regression, and in particular I want to start with studying 481 00:28:36,670 --> 00:28:40,510 the case of finite hypothesis classes. 482 00:28:40,510 --> 00:28:47,510 So let's say script H is a class of 483 00:28:48,909 --> 00:28:55,600 K hypotheses. 484 00:28:55,600 --> 00:28:59,540 Right. So this is K functions with no each of these is just a function mapping 485 00:28:59,540 --> 00:29:04,640 from inputs to outputs, there's no parameters in this. 486 00:29:04,640 --> 00:29:05,910 And so 487 00:29:05,910 --> 00:29:12,530 what the empirical risk minimization would do is it would take the training set 488 00:29:12,530 --> 00:29:13,809 and it'll then 489 00:29:13,809 --> 00:29:16,590 look at each of these K functions, 490 00:29:16,590 --> 00:29:21,660 and it'll pick whichever of these functions has the lowest training error. Okay? 491 00:29:21,660 --> 00:29:24,260 So now that the logistic regression uses an infinitely 492 00:29:24,260 --> 00:29:28,940 large a continuous infinitely large class of hypotheses, script H, 493 00:29:28,940 --> 00:29:32,440 but to prove the first row I actually want to just 494 00:29:32,440 --> 00:29:33,090 describe 495 00:29:33,090 --> 00:29:37,540 our first learning theorem is all for the case of when you have a finite hypothesis class, and then 496 00:29:37,540 --> 00:29:41,750 we'll later generalize that into the hypothesis classes. 497 00:29:43,050 --> 00:29:45,780 So 498 00:29:53,220 --> 00:29:58,340 empirical risk minimization takes the hypothesis of the lowest training error, 499 00:29:58,340 --> 00:30:04,270 and what I'd like to do is prove a bound on the generalization error 500 00:30:04,270 --> 00:30:05,960 of H hat. All right. So in other words I'm 501 00:30:05,960 --> 00:30:07,070 gonna prove that 502 00:30:07,070 --> 00:30:14,070 somehow minimizing training error allows me to do well on generalization error. 503 00:30:14,070 --> 00:30:15,920 And here's the strategy, 504 00:30:15,920 --> 00:30:19,950 I'm going to 505 00:30:19,950 --> 00:30:22,899 the first step in this prove I'm going to show that 506 00:30:22,899 --> 00:30:25,039 training error 507 00:30:25,039 --> 00:30:29,580 is a good approximation to generalization error, 508 00:30:32,040 --> 00:30:34,610 and then I'm going to show 509 00:30:34,610 --> 00:30:37,760 that this implies a bound 510 00:30:37,760 --> 00:30:40,040 on 511 00:30:40,040 --> 00:30:42,860 the generalization error 512 00:30:42,860 --> 00:30:48,360 of the hypothesis of [inaudible] empirical risk 513 00:30:48,360 --> 00:30:53,310 minimization. And I just realized, this class I guess is also maybe slightly notation 514 00:30:53,310 --> 00:30:55,559 heavy class 515 00:30:55,559 --> 00:30:59,160 round, instead of just introducing a reasonably large set of new symbols, so if 516 00:30:59,160 --> 00:31:02,570 again, in the course of today's lecture, you're looking at some symbol and you don't quite 517 00:31:02,570 --> 00:31:07,210 remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a 518 00:31:07,210 --> 00:31:11,570 generalization error or was it something else? So raise your hand and 519 00:31:11,570 --> 00:31:16,740 ask if you don't understand what the notation I was defining. 520 00:31:16,740 --> 00:31:20,429 Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors 521 00:31:20,429 --> 00:31:22,970 that give approximation generalization error, 522 00:31:22,970 --> 00:31:26,360 and this will imply that minimizing training error 523 00:31:26,360 --> 00:31:29,919 will also do pretty well in terms of minimizing generalization error. 524 00:31:29,919 --> 00:31:33,130 And this will give us a bound on the generalization error 525 00:31:33,130 --> 00:31:39,520 of the hypothesis output by empirical risk minimization. Okay? 526 00:31:39,520 --> 00:31:46,520 So here's the idea. 527 00:31:47,509 --> 00:31:51,530 So 528 00:31:51,530 --> 00:31:56,330 lets even not consider all the hypotheses at once, lets pick any 529 00:31:56,330 --> 00:32:00,330 hypothesis, HJ in the class script H, and 530 00:32:00,330 --> 00:32:04,549 so until further notice lets just consider there one fixed hypothesis. So pick any 531 00:32:04,549 --> 00:32:10,250 one hypothesis and let's talk about that 532 00:32:10,250 --> 00:32:12,390 one. Let 533 00:32:12,390 --> 00:32:15,160 me define ZI 534 00:32:15,160 --> 00:32:22,160 to be indicator function 535 00:32:23,970 --> 00:32:25,490 for 536 00:32:25,490 --> 00:32:29,929 whether this hypothesis misclassifies the IFE example excuse me 537 00:32:29,929 --> 00:32:33,820 or Z subscript I. Okay? 538 00:32:33,820 --> 00:32:40,250 So 539 00:32:40,250 --> 00:32:42,809 ZI would be zero or one 540 00:32:42,809 --> 00:32:46,799 depending on whether this one hypothesis which is the only one 541 00:32:46,799 --> 00:32:48,640 I'm gonna even consider now, 542 00:32:48,640 --> 00:32:52,280 whether this hypothesis was classified as an example. 543 00:32:52,280 --> 00:32:56,970 And so 544 00:32:56,970 --> 00:33:00,620 my training set is drawn randomly from sum distribution scripts 545 00:33:00,620 --> 00:33:02,140 d, 546 00:33:02,140 --> 00:33:04,620 and 547 00:33:04,620 --> 00:33:09,820 depending on what training examples I've got, these ZIs would be either zero or one. 548 00:33:09,820 --> 00:33:14,260 So let's figure out what the probability distribution ZI is. 549 00:33:14,260 --> 00:33:15,020 Well, 550 00:33:15,020 --> 00:33:15,950 so 551 00:33:15,950 --> 00:33:20,970 ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only 552 00:33:20,970 --> 00:33:24,870 take on these values. 553 00:33:24,870 --> 00:33:31,260 Well, 554 00:33:31,260 --> 00:33:35,160 what's the probability that ZI is equal to one? In other words, what's the 555 00:33:35,160 --> 00:33:36,490 probability 556 00:33:36,490 --> 00:33:40,780 that from a fixed hypothesis HJ, 557 00:33:40,780 --> 00:33:45,080 when I sample my training set IID from distribution D, what is the chance 558 00:33:45,080 --> 00:33:45,970 that 559 00:33:45,970 --> 00:33:50,370 my hypothesis will misclassify it? 560 00:33:50,370 --> 00:33:54,290 Well, by definition, 561 00:33:54,290 --> 00:33:56,770 that's just a generalization error of my 562 00:33:56,770 --> 00:34:00,669 hypothesis HJ. 563 00:34:00,669 --> 00:34:03,990 So ZI is a Benuve random variable 564 00:34:03,990 --> 00:34:05,490 with mean 565 00:34:05,490 --> 00:34:12,490 given by the generalization error of this hypothesis. 566 00:34:14,439 --> 00:34:20,169 Raise your hand if that made sense. Oh, cool. Great. 567 00:34:20,169 --> 00:34:21,880 And moreover, 568 00:34:21,880 --> 00:34:23,989 all the ZIs have the same 569 00:34:23,989 --> 00:34:27,969 probability of being one, and all my training examples I've drawn are IID, 570 00:34:27,969 --> 00:34:32,839 and so the ZIs are also independent 571 00:34:32,839 --> 00:34:39,099 and therefore 572 00:34:39,099 --> 00:34:42,419 the ZIs themselves are IID random 573 00:34:42,419 --> 00:34:49,419 variables. Okay? Because my training examples were drawn independently of each other, by assumption. 574 00:34:55,799 --> 00:34:59,149 If you read this as the definition of training error, 575 00:34:59,149 --> 00:35:04,329 the training error of my hypothesis HJ, that's just that. 576 00:35:07,210 --> 00:35:09,029 That's just the average 577 00:35:09,029 --> 00:35:16,029 of my ZIs, which was well I previously defined it like this. Okay? 578 00:35:25,729 --> 00:35:28,539 And so epsilon hat of HJ 579 00:35:28,539 --> 00:35:30,420 is exactly 580 00:35:30,420 --> 00:35:32,829 the average of MIID, 581 00:35:32,829 --> 00:35:35,849 Benuve random variables, 582 00:35:35,849 --> 00:35:40,039 drawn from Benuve distribution with mean given by 583 00:35:40,039 --> 00:35:41,749 the 584 00:35:41,749 --> 00:35:43,869 generalization error, so this is 585 00:35:43,869 --> 00:35:47,909 well this is the average of MIID Benuve random variables, 586 00:35:47,909 --> 00:35:54,909 each of which has meaning given by the 587 00:35:54,939 --> 00:35:58,689 generalization error of HJ. 588 00:35:58,689 --> 00:36:03,769 And therefore, by 589 00:36:03,769 --> 00:36:07,029 the Hufting inequality 590 00:36:07,029 --> 00:36:11,069 we have to add the 591 00:36:11,069 --> 00:36:15,699 probability 592 00:36:15,699 --> 00:36:20,029 that the difference between training and generalization error, the probability that this is greater than gamma is 593 00:36:20,029 --> 00:36:21,539 less than to two, E to 594 00:36:21,539 --> 00:36:24,719 the negative two, 595 00:36:24,719 --> 00:36:26,949 gamma squared M. Okay? 596 00:36:26,949 --> 00:36:32,420 Exactly by the Hufting inequality. 597 00:36:32,420 --> 00:36:34,579 And what this proves is that, 598 00:36:34,579 --> 00:36:37,289 for my fixed hypothesis HJ, 599 00:36:37,289 --> 00:36:40,529 my training error, epsilon hat will 600 00:36:40,529 --> 00:36:43,899 with high probability, assuming M is large, if 601 00:36:43,899 --> 00:36:46,890 M is large than this thing on the right hand side will be small, because 602 00:36:46,890 --> 00:36:49,239 this is two Es and a negative two gamma squared M. 603 00:36:49,239 --> 00:36:52,559 So this says that if my training set is large enough, 604 00:36:52,559 --> 00:36:54,390 then the probability 605 00:36:54,390 --> 00:36:57,299 my training error is far from generalization error, 606 00:36:57,299 --> 00:36:59,089 meaning that it is more than gamma, 607 00:36:59,089 --> 00:37:06,089 will be small, will be bounded by this thing on the right hand side. Okay? Now, 608 00:37:09,119 --> 00:37:11,569 here's the [inaudible] tricky part, 609 00:37:11,569 --> 00:37:16,569 what we've done is approve this bound for one fixed hypothesis, for HJ. 610 00:37:16,569 --> 00:37:20,429 What I want to prove is that training error will be a good estimate for generalization 611 00:37:20,429 --> 00:37:21,279 error, 612 00:37:21,279 --> 00:37:26,249 not just for this one hypothesis HJ, but actually for all K hypotheses in my 613 00:37:26,249 --> 00:37:27,829 hypothesis class 614 00:37:27,829 --> 00:37:29,650 script 615 00:37:29,650 --> 00:37:30,769 H. 616 00:37:30,769 --> 00:37:33,069 So let's do it 617 00:37:35,719 --> 00:37:42,719 well, 618 00:37:54,659 --> 00:37:57,859 better do it on a new board. So in order to show that, let me define 619 00:37:57,859 --> 00:38:01,229 a random event, let me define AJ to be the event 620 00:38:01,229 --> 00:38:05,269 that 621 00:38:05,269 --> 00:38:12,269 let 622 00:38:21,469 --> 00:38:24,479 me define AJ to be the event that 623 00:38:24,479 --> 00:38:28,039 you know, the difference between training and generalization error is more than gamma on a 624 00:38:28,039 --> 00:38:30,439 hypothesis HJ. 625 00:38:30,439 --> 00:38:32,090 And so what we 626 00:38:32,090 --> 00:38:36,449 put on the previous board was that the probability of AJ is less equal to two E 627 00:38:36,449 --> 00:38:43,449 to the negative two, gamma squared M, and this is pretty small. Now, 628 00:38:44,059 --> 00:38:49,729 What I want to bound is the probability that 629 00:38:49,729 --> 00:38:53,879 there exists some hypothesis in my class 630 00:38:53,879 --> 00:39:00,879 script H, 631 00:39:02,839 --> 00:39:08,109 such that I make a large error in my estimate of generalization error. Okay? 632 00:39:08,109 --> 00:39:10,269 Such that this holds true. 633 00:39:12,180 --> 00:39:14,289 So this is really just 634 00:39:14,289 --> 00:39:17,890 that the probability that there exists a hypothesis for which this 635 00:39:17,890 --> 00:39:20,650 holds. This is really the probability that 636 00:39:20,650 --> 00:39:25,299 A one or A two, or up to AK holds. 637 00:39:25,299 --> 00:39:28,659 The chance 638 00:39:28,659 --> 00:39:33,179 there exists a hypothesis is just well the priority 639 00:39:33,179 --> 00:39:37,479 that for hypothesis one and make a large error in estimating the generalization error, 640 00:39:37,479 --> 00:39:41,069 or for hypothesis two and make a large error in estimating generalization error, 641 00:39:41,069 --> 00:39:43,559 and so on. 642 00:39:43,559 --> 00:39:50,559 And so by the union bound, this is less than equal to 643 00:39:50,819 --> 00:39:56,229 that, 644 00:39:56,229 --> 00:39:59,329 which is therefore less than equal to 645 00:40:06,519 --> 00:40:13,519 is 646 00:40:13,989 --> 00:40:20,989 equal to that. Okay? 647 00:40:38,949 --> 00:40:40,439 So let 648 00:40:40,439 --> 00:40:42,359 me just take 649 00:40:42,359 --> 00:40:47,639 one minus both sides of the equation 650 00:40:47,639 --> 00:40:50,939 on the previous board let me take one minus both sides, so 651 00:40:50,939 --> 00:40:56,899 the probability that there does not exist for 652 00:40:56,899 --> 00:40:59,009 hypothesis 653 00:40:59,009 --> 00:41:06,009 such that, 654 00:41:07,650 --> 00:41:10,839 that. The probability that there does not exist a hypothesis on which I make a large 655 00:41:10,839 --> 00:41:13,190 error in this estimate 656 00:41:13,190 --> 00:41:20,190 while this is equal to the probability that for all hypotheses, 657 00:41:22,680 --> 00:41:24,759 I make a 658 00:41:24,759 --> 00:41:26,569 small error, or at 659 00:41:26,569 --> 00:41:31,609 most gamma, in my estimate of generalization error. 660 00:41:31,609 --> 00:41:35,519 In taking one minus on the right hand side I get two KE 661 00:41:35,519 --> 00:41:39,149 to the negative two 662 00:41:39,149 --> 00:41:44,109 gamma squared M. Okay? 663 00:41:44,109 --> 00:41:45,709 And so 664 00:41:45,709 --> 00:41:49,519 and the sign of the inequality flipped because I took one minus both 665 00:41:49,519 --> 00:41:53,799 sides. The minus sign flips the sign of the 666 00:41:53,799 --> 00:41:55,819 equality. 667 00:41:55,819 --> 00:41:59,799 So what we're shown is that 668 00:41:59,799 --> 00:42:04,679 with probability which abbreviates to WP with probability one minus 669 00:42:04,679 --> 00:42:08,529 two KE to the negative two gamma squared M. 670 00:42:08,529 --> 00:42:11,889 We have that, epsilon hat 671 00:42:11,889 --> 00:42:16,479 of H 672 00:42:16,479 --> 00:42:21,149 will be 673 00:42:21,149 --> 00:42:28,149 will then gamma of epsilon of H, 674 00:42:31,309 --> 00:42:33,059 simultaneously for all 675 00:42:33,059 --> 00:42:35,449 hypotheses in our 676 00:42:35,449 --> 00:42:42,449 class script H. 677 00:42:43,269 --> 00:42:47,209 And so 678 00:42:47,209 --> 00:42:54,209 just to give this result a name, this is called 679 00:42:55,529 --> 00:43:00,089 this is one instance of what's called a uniform conversions result, 680 00:43:00,089 --> 00:43:01,130 and 681 00:43:01,130 --> 00:43:03,759 the term uniform conversions 682 00:43:03,759 --> 00:43:05,719 this sort of alludes to the fact that 683 00:43:05,719 --> 00:43:09,659 this shows that as M becomes large, 684 00:43:09,659 --> 00:43:12,029 then these epsilon hats 685 00:43:12,029 --> 00:43:16,189 will all simultaneously converge to epsilon of H. That 686 00:43:16,189 --> 00:43:17,849 training error will 687 00:43:17,849 --> 00:43:20,840 become very close to generalization error 688 00:43:20,840 --> 00:43:23,210 simultaneously for all hypotheses H. 689 00:43:23,210 --> 00:43:25,459 That's what the term uniform 690 00:43:25,459 --> 00:43:29,150 refers to, is the fact that this converges for all hypotheses H and not just 691 00:43:29,150 --> 00:43:31,059 for one hypothesis. And so 692 00:43:31,059 --> 00:43:36,130 what we're shown is one example of a uniform conversions result. Okay? 693 00:43:36,130 --> 00:43:39,400 So let me clean a couple more boards. I'll come back and ask what questions you have 694 00:43:39,400 --> 00:43:46,400 about this. We should take another look at this and make sure it all makes sense. Yeah, okay. 695 00:44:20,109 --> 00:44:27,109 What questions do you have about this? 696 00:44:27,829 --> 00:44:31,269 697 00:44:31,269 --> 00:44:34,239 Student: How the is the 698 00:44:34,239 --> 00:44:36,150 value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed? 699 00:44:36,150 --> 00:44:40,289 So for these purposes for the purposes of this, gamma is a constant. 700 00:44:40,289 --> 00:44:43,489 Imagine a gamma is some constant that we chose in advance, 701 00:44:43,489 --> 00:44:48,589 and this is a bound that holds true for any fixed value 702 00:44:48,589 --> 00:44:50,739 of gamma. Later on as we 703 00:44:50,739 --> 00:44:54,449 take this bound and then sort of develop this result further, 704 00:44:54,449 --> 00:44:58,129 we'll choose specific values of gamma as a [inaudible] of this bound. For now 705 00:44:58,129 --> 00:45:05,129 we'll just imagine that when we're proved this holds true for any value of gamma. Any questions? 706 00:45:08,890 --> 00:45:11,579 Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite, 707 00:45:11,579 --> 00:45:16,279 so this simple result won't work in this present form, but we'll generalize this 708 00:45:16,279 --> 00:45:20,199 probably won't get to it today but we'll generalize this at the beginning of the 709 00:45:20,199 --> 00:45:27,199 next lecture to infinite hypothesis classes. Student:How do we use this 710 00:45:29,779 --> 00:45:31,889 theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me 711 00:45:31,889 --> 00:45:37,259 I might get to a little of that later today, we'll talk concretely about algorithms, 712 00:45:37,259 --> 00:45:38,890 the consequences of 713 00:45:38,890 --> 00:45:45,719 the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you 714 00:45:45,719 --> 00:45:47,289 just raise your hand if the 715 00:45:47,289 --> 00:45:49,599 things I've proved so far 716 00:45:49,599 --> 00:45:55,069 make sense? Okay. Cool. Great. 717 00:45:55,069 --> 00:45:59,339 Thanks. All right. Let me just take this uniform conversions bound and rewrite 718 00:45:59,339 --> 00:46:02,279 it in a couple of other forms. 719 00:46:02,279 --> 00:46:04,589 So 720 00:46:04,589 --> 00:46:08,689 this is a sort of a bound on probability, this is saying suppose I fix my training 721 00:46:08,689 --> 00:46:11,089 set and then fix my training 722 00:46:11,089 --> 00:46:14,699 set fix my threshold, my error threshold gamma, 723 00:46:14,699 --> 00:46:19,109 what is the probability that uniform conversions holds, and well, 724 00:46:19,109 --> 00:46:22,670 that's my formula that gives the answer. This is the probability of something 725 00:46:22,670 --> 00:46:25,749 happening. 726 00:46:25,749 --> 00:46:28,750 So there are actually three parameters of interest. One is, 727 00:46:28,750 --> 00:46:31,539 What is this probability? The 728 00:46:31,539 --> 00:46:34,290 other parameter is, What's the training set size M? 729 00:46:34,290 --> 00:46:36,020 And the third parameter is, What 730 00:46:36,020 --> 00:46:37,009 is the value 731 00:46:37,009 --> 00:46:39,989 of this error 732 00:46:39,989 --> 00:46:41,950 threshold gamma? I'm not gonna 733 00:46:41,950 --> 00:46:44,309 vary K for these purposes. 734 00:46:44,309 --> 00:46:47,619 So other two other equivalent forms of the bounds, 735 00:46:47,619 --> 00:46:49,509 which so you can ask, 736 00:46:49,509 --> 00:46:50,599 given 737 00:46:50,599 --> 00:46:52,309 gamma 738 00:46:52,309 --> 00:46:53,289 so what 739 00:46:53,289 --> 00:46:56,019 we proved was given gamma and given M, what 740 00:46:56,019 --> 00:46:57,849 is the probability of 741 00:46:57,849 --> 00:47:01,019 uniform conversions? The 742 00:47:01,019 --> 00:47:04,150 other equivalent forms are, 743 00:47:04,150 --> 00:47:06,899 so that given gamma 744 00:47:06,899 --> 00:47:13,899 and the probability delta of making a large error, 745 00:47:15,339 --> 00:47:19,049 how large a training set size do you need in order to give 746 00:47:19,049 --> 00:47:21,949 a bound on how large a 747 00:47:21,949 --> 00:47:25,459 training set size do you need 748 00:47:25,459 --> 00:47:29,749 to give a uniform conversions bound with parameters gamma 749 00:47:29,749 --> 00:47:31,209 and delta? 750 00:47:31,209 --> 00:47:33,749 And so well, 751 00:47:33,749 --> 00:47:37,729 so if you set delta to be two KE so negative two gamma 752 00:47:37,729 --> 00:47:38,489 squared M. 753 00:47:38,489 --> 00:47:41,359 This is that form that I had on the left. 754 00:47:41,359 --> 00:47:43,630 And if you solve 755 00:47:43,630 --> 00:47:46,079 for M, 756 00:47:46,079 --> 00:47:48,170 what you find is that 757 00:47:48,170 --> 00:47:54,430 there's an equivalent form of this result that says that 758 00:47:54,430 --> 00:47:55,890 so long as your training 759 00:47:55,890 --> 00:47:59,029 set assigns M as greater than this. 760 00:47:59,029 --> 00:48:05,390 And this is the formula that I get by solving for M. 761 00:48:05,390 --> 00:48:08,809 Okay? So long as M is greater than equal to this, 762 00:48:08,809 --> 00:48:12,929 then with probability, which I'm abbreviating to WP again, 763 00:48:12,929 --> 00:48:17,369 with probability at least one minus delta, 764 00:48:17,369 --> 00:48:24,369 we have for all. 765 00:48:24,880 --> 00:48:30,959 Okay? 766 00:48:30,959 --> 00:48:34,299 So this says how large a training set size that I need 767 00:48:34,299 --> 00:48:37,859 to guarantee that with probability at least one minus delta, 768 00:48:37,859 --> 00:48:42,239 we have the training error is within gamma of generalization error for all my 769 00:48:42,239 --> 00:48:45,259 hypotheses, and this gives an answer. 770 00:48:45,259 --> 00:48:49,400 And just to give this another name, this is an example of a sample 771 00:48:49,400 --> 00:48:53,469 complexity 772 00:48:53,469 --> 00:48:58,079 773 00:48:58,079 --> 00:48:58,940 bound. 774 00:48:58,940 --> 00:49:02,870 So from undergrad computer science classes you may have heard of 775 00:49:02,870 --> 00:49:05,680 computational complexity, which is how much computations you need to achieve 776 00:49:05,680 --> 00:49:06,849 something. 777 00:49:06,849 --> 00:49:11,319 So sample complexity just means how large a training example how large a training set how 778 00:49:11,319 --> 00:49:13,679 large a sample of examples do you need 779 00:49:13,679 --> 00:49:16,549 in order to achieve a certain bound and error. 780 00:49:16,549 --> 00:49:19,619 And it turns out that in many of the theorems we write out you can 781 00:49:19,619 --> 00:49:23,659 pose them in sort of a form of probability bound or a sample complexity bound or in some other 782 00:49:23,659 --> 00:49:24,459 form. 783 00:49:24,459 --> 00:49:28,369 I personally often find the sample complexity bounds the most 784 00:49:28,369 --> 00:49:32,419 easy to interpret because it says how large a training set do you need to give a certain bound on the 785 00:49:32,419 --> 00:49:35,819 errors. 786 00:49:35,819 --> 00:49:38,829 And in fact well, we'll see this later, 787 00:49:38,829 --> 00:49:41,050 sample complexity bounds often sort of 788 00:49:41,050 --> 00:49:43,299 help to give guidance for really if 789 00:49:43,299 --> 00:49:44,240 you're trying to 790 00:49:44,240 --> 00:49:46,140 achieve something on a machine learning problem, 791 00:49:46,140 --> 00:49:46,940 this really is 792 00:49:46,940 --> 00:49:49,719 trying to give guidance on how much training data you need 793 00:49:49,719 --> 00:49:54,269 to prove something. 794 00:49:54,269 --> 00:49:57,769 The one thing I want to note here is that M 795 00:49:57,769 --> 00:50:01,019 grows like the log of K, right, so 796 00:50:01,019 --> 00:50:05,630 the log of K grows extremely slowly as a function of K. The log is one 797 00:50:05,630 --> 00:50:10,619 of the slowest growing functions, right. It's one of well, 798 00:50:10,619 --> 00:50:17,619 some of you may have heard this, right? That for all values of K, right I learned 799 00:50:18,869 --> 00:50:22,179 this from a colleague, Andrew Moore, 800 00:50:22,179 --> 00:50:23,739 at Carnegie Mellon 801 00:50:23,739 --> 00:50:24,699 that in 802 00:50:24,699 --> 00:50:30,549 computer science for all practical purposes for all values of K, log K is less [inaudible], this is 803 00:50:30,549 --> 00:50:31,509 almost true. 804 00:50:31,509 --> 00:50:35,529 So log K is logs is one of the slowest growing functions, and so the 805 00:50:35,529 --> 00:50:40,059 fact that M sample complexity grows like the log of K, 806 00:50:40,059 --> 00:50:41,549 means that 807 00:50:41,549 --> 00:50:45,749 you can increase this number of hypotheses in your hypothesis class quite a lot 808 00:50:45,749 --> 00:50:50,609 and the number of the training examples you need won't grow 809 00:50:50,609 --> 00:50:56,390 810 00:50:56,390 --> 00:51:00,289 very much. [Inaudible]. This property will be important later when we talk about infinite 811 00:51:00,289 --> 00:51:02,859 hypothesis classes. 812 00:51:02,859 --> 00:51:05,679 The final form is the 813 00:51:05,679 --> 00:51:08,489 I guess is sometimes called the error bound, 814 00:51:08,489 --> 00:51:13,659 which is when you hold M and delta fixed and solved for gamma. 815 00:51:13,659 --> 00:51:20,659 And so 816 00:51:22,809 --> 00:51:27,469 and what do you do what you get then is that the 817 00:51:27,469 --> 00:51:30,759 probability at least one minus delta, 818 00:51:30,759 --> 00:51:37,759 we have that. 819 00:51:41,339 --> 00:51:44,409 For all hypotheses in my hypothesis class, 820 00:51:44,409 --> 00:51:51,409 the difference in the training generalization error would be less 821 00:51:53,690 --> 00:51:55,169 than equal to that. Okay? And that's just 822 00:51:55,169 --> 00:52:02,169 solving for gamma and plugging the value I get in there. Okay? All right. 823 00:52:03,239 --> 00:52:10,239 So the 824 00:52:30,949 --> 00:52:33,569 second 825 00:52:33,569 --> 00:52:36,219 step of 826 00:52:36,219 --> 00:52:43,219 the overall proof I want to execute is the following. 827 00:52:44,519 --> 00:52:47,329 The result of the training error is essentially that uniform 828 00:52:47,329 --> 00:52:49,100 conversions will hold true 829 00:52:49,100 --> 00:52:51,439 with high probability. 830 00:52:51,439 --> 00:52:55,249 What I want to show now is let's assume that uniform conversions hold. So let's 831 00:52:55,249 --> 00:52:57,959 assume that for all 832 00:52:57,959 --> 00:53:00,089 hypotheses H, 833 00:53:00,089 --> 00:53:05,439 we have that epsilon of H minus epsilon hat of H, is 834 00:53:05,439 --> 00:53:08,619 less than of the gamma. Okay? 835 00:53:08,619 --> 00:53:10,930 What I want to do now is 836 00:53:10,930 --> 00:53:13,140 use this to see what we can 837 00:53:13,140 --> 00:53:14,910 prove about the bound 838 00:53:14,910 --> 00:53:21,910 of see what we can prove about the generalization error. 839 00:53:28,829 --> 00:53:31,400 840 00:53:31,400 --> 00:53:32,619 So I want to know 841 00:53:32,619 --> 00:53:35,639 suppose this holds true I want 842 00:53:35,639 --> 00:53:40,259 to know can we prove something about the generalization error of H hat, where 843 00:53:40,259 --> 00:53:43,380 again, H hat 844 00:53:43,380 --> 00:53:45,950 was the hypothesis 845 00:53:45,950 --> 00:53:51,679 selected by empirical risk minimization. 846 00:53:51,679 --> 00:53:55,789 Okay? So in order to show this, let me make one more definition, let me define H 847 00:53:55,789 --> 00:54:00,289 star, 848 00:54:00,289 --> 00:54:03,019 to be the hypothesis 849 00:54:03,019 --> 00:54:06,459 in my class 850 00:54:06,459 --> 00:54:10,189 script H that has the smallest generalization error. 851 00:54:10,189 --> 00:54:11,309 So this is 852 00:54:11,309 --> 00:54:14,930 if I had an infinite amount of training data or if I really I 853 00:54:14,930 --> 00:54:19,029 could go in and find the best possible hypothesis 854 00:54:19,029 --> 00:54:23,359 best possible hypothesis in the sense of minimizing generalization error 855 00:54:23,359 --> 00:54:28,989 what's the hypothesis I would get? Okay? So 856 00:54:28,989 --> 00:54:32,959 in some sense, it sort of makes sense to compare the performance of our learning 857 00:54:32,959 --> 00:54:33,609 algorithm 858 00:54:33,609 --> 00:54:35,569 to the performance of H star, because we 859 00:54:35,569 --> 00:54:39,959 sort of we clearly can't hope to do better than H star. 860 00:54:39,959 --> 00:54:42,469 Another way of saying that is that if 861 00:54:42,469 --> 00:54:47,609 your hypothesis class is a class of all linear decision boundaries, that 862 00:54:47,609 --> 00:54:50,630 the data just can't be separated by any linear functions. 863 00:54:50,630 --> 00:54:53,739 So if even H star is really bad, 864 00:54:53,739 --> 00:54:55,240 then there's sort of 865 00:54:55,240 --> 00:54:59,120 it's unlikely then there's just not much hope that your learning algorithm could do even 866 00:54:59,120 --> 00:55:04,130 better 867 00:55:04,130 --> 00:55:08,979 than H star. So I actually prove this result in three steps. 868 00:55:08,979 --> 00:55:13,399 So the generalization error of H hat, the hypothesis I chose, 869 00:55:13,399 --> 00:55:20,399 this is going to be less than equal to that, actually let me number these 870 00:55:21,739 --> 00:55:25,059 equations, right. This 871 00:55:25,059 --> 00:55:27,199 is 872 00:55:27,199 --> 00:55:30,289 because of equation one, because I see that 873 00:55:30,289 --> 00:55:33,089 epsilon of H hat and epsilon hat of H hat will then gamma of each 874 00:55:33,089 --> 00:55:36,630 other. 875 00:55:36,630 --> 00:55:38,319 Now 876 00:55:38,319 --> 00:55:42,140 because H star, excuse me, 877 00:55:42,140 --> 00:55:45,890 now by the 878 00:55:45,890 --> 00:55:51,009 definition of empirical risk minimization, H 879 00:55:51,009 --> 00:55:54,049 hat was chosen to minimize training error 880 00:55:54,049 --> 00:55:58,679 and so there can't be any hypothesis with lower training error than H hat. 881 00:55:58,679 --> 00:56:04,629 So the training error of H hat must be less than the equal to 882 00:56:04,629 --> 00:56:07,079 the training error of H star. So 883 00:56:07,079 --> 00:56:10,549 this is sort of by two, or by the definition of H hat, 884 00:56:10,549 --> 00:56:17,549 as the hypothesis that minimizes training error H hat. 885 00:56:17,619 --> 00:56:20,159 And the final step is I'm 886 00:56:20,159 --> 00:56:23,499 going to apply this uniform conversions result again. We know that 887 00:56:23,499 --> 00:56:24,929 epsilon hat 888 00:56:24,929 --> 00:56:29,549 of H star must be moving gamma of epsilon of H star. 889 00:56:29,549 --> 00:56:33,879 And so this is at most 890 00:56:33,879 --> 00:56:35,199 plus gamma. Then 891 00:56:35,199 --> 00:56:36,089 892 00:56:36,089 --> 00:56:38,299 I have my original gamma 893 00:56:38,299 --> 00:56:41,319 there. Okay? And so this is by 894 00:56:41,319 --> 00:56:44,519 equation one again because oh, excuse me 895 00:56:44,519 --> 00:56:47,690 because I know the training error of H star must be moving gamma of the 896 00:56:47,690 --> 00:56:49,929 generalization error of H star. 897 00:56:49,929 --> 00:56:51,699 And so well, I'll just 898 00:56:51,699 --> 00:56:57,469 write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training 899 00:56:57,469 --> 00:57:04,469 error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh, 900 00:57:24,869 --> 00:57:26,649 okay. Let me just well, let 901 00:57:26,649 --> 00:57:32,779 me write that down on this board. So actually actually let me 902 00:57:32,779 --> 00:57:36,459 think 903 00:57:36,459 --> 00:57:38,739 [inaudible] 904 00:57:38,739 --> 00:57:43,699 fit this in here. So epsilon hat 905 00:57:43,699 --> 00:57:45,169 of H is the 906 00:57:45,169 --> 00:57:48,999 training 907 00:57:48,999 --> 00:57:50,620 908 00:57:50,620 --> 00:57:52,469 error of the hypothesis H. In 909 00:57:52,469 --> 00:57:55,949 other words, given the hypothesis a hypothesis is just a function, right mapped from X or 910 00:57:55,949 --> 00:57:57,559 Ys 911 00:57:57,559 --> 00:57:59,489 so epsilon hat of H is 912 00:57:59,489 --> 00:58:02,809 given the hypothesis H, what's the fraction of training examples it 913 00:58:02,809 --> 00:58:05,029 misclassifies? 914 00:58:05,029 --> 00:58:11,599 And generalization error 915 00:58:11,599 --> 00:58:13,839 of 916 00:58:13,839 --> 00:58:17,949 H, is given the hypothesis H if I 917 00:58:17,949 --> 00:58:19,149 sample another 918 00:58:19,149 --> 00:58:21,709 example from my distribution 919 00:58:21,709 --> 00:58:22,899 scripts 920 00:58:22,899 --> 00:58:28,309 D, what's the probability that H will misclassify that example? Does that make sense? 921 00:58:28,309 --> 00:58:31,859 Oh, 922 00:58:31,859 --> 00:58:38,239 okay. And H hat is the hypothesis that's chosen by empirical risk minimization. 923 00:58:38,239 --> 00:58:42,579 So when I talk about empirical risk minimization, is the algorithm that 924 00:58:42,579 --> 00:58:47,799 minimizes training error, and so epsilon hat of H is the training error of H, 925 00:58:47,799 --> 00:58:50,219 and so H hat is defined as 926 00:58:50,219 --> 00:58:52,049 the hypothesis that out 927 00:58:52,049 --> 00:58:55,029 of all hypotheses in my class script H, 928 00:58:55,029 --> 00:58:57,589 the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member 929 00:58:57,589 --> 00:59:04,589 of 930 00:59:06,669 --> 00:59:08,799 typical H, [inaudible] 931 00:59:08,799 --> 00:59:10,689 family 932 00:59:10,689 --> 00:59:12,909 right? Yes it is. 933 00:59:12,909 --> 00:59:19,909 So what happens with the generalization error [inaudible]? 934 00:59:20,069 --> 00:59:24,489 I'll talk about that later. So 935 00:59:24,489 --> 00:59:31,489 let me tie all these things together into a 936 00:59:34,339 --> 00:59:38,049 theorem. 937 00:59:38,049 --> 00:59:42,519 Let there be a hypothesis class given with a 938 00:59:42,519 --> 00:59:48,869 finite set of K hypotheses, 939 00:59:48,869 --> 00:59:50,889 and let any M 940 00:59:50,889 --> 00:59:52,249 delta 941 00:59:52,249 --> 00:59:58,319 be fixed. 942 00:59:58,319 --> 01:00:02,049 Then so I fixed M and delta, so this will be the error bound 943 01:00:02,049 --> 01:00:04,499 form of the theorem, right? 944 01:00:04,499 --> 01:00:06,759 Then, with 945 01:00:06,759 --> 01:00:09,640 probability at least one minus delta. 946 01:00:09,640 --> 01:00:13,920 We have that. The generalization error of H hat is 947 01:00:13,920 --> 01:00:19,219 less than or equal to 948 01:00:19,219 --> 01:00:22,609 the minimum over all hypotheses in 949 01:00:22,609 --> 01:00:25,049 set H epsilon of H, 950 01:00:25,049 --> 01:00:26,679 plus 951 01:00:26,679 --> 01:00:33,679 two times, 952 01:00:38,029 --> 01:00:39,109 plus that. Okay? 953 01:00:39,109 --> 01:00:41,759 So to prove this, well, 954 01:00:41,759 --> 01:00:45,529 this term of course is just epsilon of H star. 955 01:00:45,529 --> 01:00:47,879 And so to prove this 956 01:00:47,879 --> 01:00:49,739 we set 957 01:00:49,739 --> 01:00:52,509 gamma to equal to that 958 01:00:52,509 --> 01:00:55,969 this is two times the square root term. 959 01:00:55,969 --> 01:01:02,969 To prove this theorem we set gamma to equal to that square root term. Say that again? 960 01:01:04,009 --> 01:01:07,519 Wait. Say that again? 961 01:01:07,519 --> 01:01:11,389 Oh, 962 01:01:13,949 --> 01:01:16,059 yes. Thank you. That didn't 963 01:01:16,059 --> 01:01:19,639 make sense at all. 964 01:01:19,639 --> 01:01:20,939 Thanks. Great. 965 01:01:20,939 --> 01:01:25,670 So set gamma to that square root term, 966 01:01:25,670 --> 01:01:29,910 and so we know equation one, 967 01:01:29,910 --> 01:01:31,739 right, from the previous board 968 01:01:31,739 --> 01:01:35,279 holds with probability one minus delta. 969 01:01:35,279 --> 01:01:40,430 Right. Equation one was the uniform conversions result right, that well, 970 01:01:40,430 --> 01:01:47,430 IE. 971 01:01:49,549 --> 01:01:52,249 This is equation one from the previous board, right, so 972 01:01:52,249 --> 01:01:54,759 set gamma equal to this we know that we'll probably 973 01:01:54,759 --> 01:01:58,459 use one minus delta this uniform conversions holds, 974 01:01:58,459 --> 01:02:01,149 and whenever that holds, that implies you 975 01:02:01,149 --> 01:02:04,719 know, I 976 01:02:04,719 --> 01:02:05,019 guess 977 01:02:05,019 --> 01:02:07,669 if we call this equation star I guess. 978 01:02:07,669 --> 01:02:11,809 And whenever uniform conversions holds, we showed again, on the previous boards 979 01:02:11,809 --> 01:02:17,089 that this result holds, that generalization error of H hat is less than 980 01:02:17,089 --> 01:02:20,769 two generalization error of H star plus two times gamma. Okay? And 981 01:02:20,769 --> 01:02:22,670 so that proves 982 01:02:22,670 --> 01:02:29,670 this theorem. So 983 01:02:41,559 --> 01:02:44,229 this result sort of helps us to quantify 984 01:02:44,229 --> 01:02:45,230 a little bit 985 01:02:45,230 --> 01:02:47,609 that bias variance tradeoff 986 01:02:47,609 --> 01:02:49,739 that I talked about 987 01:02:49,739 --> 01:02:54,239 at the beginning of actually near the very start of this lecture. 988 01:02:54,239 --> 01:02:58,390 And in particular 989 01:02:58,390 --> 01:03:03,659 let's say I have some hypothesis class script H, that 990 01:03:03,659 --> 01:03:06,039 I'm using, maybe as a class of all 991 01:03:06,039 --> 01:03:09,799 linear functions and linear regression, and logistic regression with 992 01:03:09,799 --> 01:03:11,789 just the linear features. 993 01:03:11,789 --> 01:03:14,739 And let's say I'm considering switching 994 01:03:14,739 --> 01:03:19,619 to some new class H prime by having more features. So lets say this is 995 01:03:19,619 --> 01:03:23,059 linear 996 01:03:23,059 --> 01:03:27,239 and this is quadratic, 997 01:03:27,239 --> 01:03:28,039 998 01:03:28,039 --> 01:03:31,659 so the class of all linear functions and the subset of the class of all quadratic functions, 999 01:03:31,659 --> 01:03:34,859 and so H is the subset of H prime. And 1000 01:03:34,859 --> 01:03:36,789 let's say I'm considering 1001 01:03:36,789 --> 01:03:40,929 instead of using my linear hypothesis class let's say I'm considering switching 1002 01:03:40,929 --> 01:03:44,769 to a quadratic hypothesis class, or switching to a larger hypothesis 1003 01:03:44,769 --> 01:03:46,369 class. 1004 01:03:46,369 --> 01:03:47,130 1005 01:03:47,130 --> 01:03:49,809 Then what are the tradeoffs involved? Well, I 1006 01:03:49,809 --> 01:03:52,929 proved this only for finite hypothesis classes, but we'll see that something 1007 01:03:52,929 --> 01:03:56,019 very similar holds for infinite hypothesis classes too. 1008 01:03:56,019 --> 01:03:57,559 But the tradeoff is 1009 01:03:57,559 --> 01:04:00,619 what if I switch from H to H prime, or I switch from linear to quadratic 1010 01:04:00,619 --> 01:04:03,999 functions. Then 1011 01:04:03,999 --> 01:04:08,029 epsilon of H star will become better because 1012 01:04:08,029 --> 01:04:12,419 the best hypothesis in my hypothesis class will become better. 1013 01:04:12,419 --> 01:04:14,829 The best quadratic function 1014 01:04:14,829 --> 01:04:16,049 1015 01:04:16,049 --> 01:04:18,799 by best I mean in the sense of generalization error 1016 01:04:18,799 --> 01:04:21,299 the hypothesis function 1017 01:04:21,299 --> 01:04:24,829 the quadratic function with the lowest generalization error 1018 01:04:24,829 --> 01:04:25,599 has to have 1019 01:04:25,599 --> 01:04:27,059 equal or 1020 01:04:27,059 --> 01:04:28,019 more likely lower generalization error than the best 1021 01:04:28,019 --> 01:04:30,279 linear function. 1022 01:04:30,279 --> 01:04:31,799 1023 01:04:31,799 --> 01:04:37,549 So by switching to a more complex hypothesis class you can get this first term as you go down. 1024 01:04:37,549 --> 01:04:40,349 But what I pay for then is that 1025 01:04:40,349 --> 01:04:42,829 K will increase. 1026 01:04:42,829 --> 01:04:46,449 By switching to a larger hypothesis class, 1027 01:04:46,449 --> 01:04:48,110 the first term will go down, 1028 01:04:48,110 --> 01:04:51,820 but the second term will increase because I now have a larger class of 1029 01:04:51,820 --> 01:04:52,880 hypotheses 1030 01:04:52,880 --> 01:04:54,630 and so the second term K 1031 01:04:54,630 --> 01:04:57,189 will increase. 1032 01:04:57,189 --> 01:05:01,679 And so this is sometimes called the bias this is usually called the bias variance 1033 01:05:01,679 --> 01:05:03,049 tradeoff. Whereby 1034 01:05:03,049 --> 01:05:07,699 going to larger hypothesis class maybe I have the hope for finding a better function, 1035 01:05:07,699 --> 01:05:09,949 that my risk of sort of 1036 01:05:09,949 --> 01:05:13,799 not fitting my model so accurately also increases, and 1037 01:05:13,799 --> 01:05:19,449 that's because illustrated by the second term going up when the 1038 01:05:19,449 --> 01:05:21,299 size of your hypothesis, 1039 01:05:21,299 --> 01:05:28,199 when K goes up. 1040 01:05:28,199 --> 01:05:32,079 And so 1041 01:05:32,079 --> 01:05:35,279 speaking very loosely, 1042 01:05:35,279 --> 01:05:37,109 we can think of 1043 01:05:37,109 --> 01:05:40,149 this first term as corresponding 1044 01:05:40,149 --> 01:05:41,459 maybe to the bias 1045 01:05:41,459 --> 01:05:45,209 of the learning algorithm, or the bias of the hypothesis class. 1046 01:05:45,209 --> 01:05:50,429 And you can again speaking very loosely, 1047 01:05:50,429 --> 01:05:52,880 think of the second term as corresponding 1048 01:05:52,880 --> 01:05:56,760 to the variance in your hypothesis, in other words how well you can actually fit a 1049 01:05:56,760 --> 01:05:57,869 hypothesis in the 1050 01:05:57,869 --> 01:05:59,759 how well you 1051 01:05:59,759 --> 01:06:02,869 actually fit this hypothesis class to the data. 1052 01:06:02,869 --> 01:06:06,829 And by switching to a more complex hypothesis class, your variance increases and your bias decreases. 1053 01:06:06,829 --> 01:06:09,489 1054 01:06:09,489 --> 01:06:13,539 As a note of warning, it turns out that if you take like a statistics class you've seen 1055 01:06:13,539 --> 01:06:15,119 definitions of bias and variance, 1056 01:06:15,119 --> 01:06:18,499 which are often defined in terms of squared error or something. 1057 01:06:18,499 --> 01:06:21,580 It turns out that for classification problems, 1058 01:06:21,580 --> 01:06:24,649 there actually is no 1059 01:06:24,649 --> 01:06:25,880 1060 01:06:25,880 --> 01:06:27,489 universally accepted 1061 01:06:27,489 --> 01:06:28,989 formal definition of 1062 01:06:28,989 --> 01:06:30,679 bias and 1063 01:06:30,679 --> 01:06:35,259 variance for classification problems. For regression problems, there is this square error 1064 01:06:35,259 --> 01:06:38,639 definition. For classification problems it 1065 01:06:38,639 --> 01:06:39,980 turns out there've 1066 01:06:39,980 --> 01:06:43,700 been several competing proposals for definitions of bias and variance. So when I 1067 01:06:43,700 --> 01:06:45,300 say bias and variance here, 1068 01:06:45,300 --> 01:06:47,420 think of these as very loose, informal, 1069 01:06:47,420 --> 01:06:54,420 intuitive definitions, and not formal definitions. Okay. 1070 01:07:00,499 --> 01:07:07,499 The 1071 01:07:15,329 --> 01:07:18,080 cartoon associated with intuition I just 1072 01:07:18,080 --> 01:07:19,390 said would be 1073 01:07:19,390 --> 01:07:22,349 as follows: Let's 1074 01:07:22,349 --> 01:07:24,009 say 1075 01:07:24,009 --> 01:07:26,919 1076 01:07:26,919 --> 01:07:30,369 and everything about the plot will be for a fixed value of M, for a 1077 01:07:30,369 --> 01:07:35,219 fixed training set size M. Vertical axis I'll plot ever and 1078 01:07:35,219 --> 01:07:38,799 on the horizontal axis I'll plot model 1079 01:07:38,799 --> 01:07:43,729 complexity. 1080 01:07:43,729 --> 01:07:46,519 And by model complexity I mean 1081 01:07:46,519 --> 01:07:53,499 sort of degree of polynomial, 1082 01:07:53,499 --> 01:07:57,669 1083 01:07:57,669 --> 01:07:59,739 size of your 1084 01:07:59,739 --> 01:08:04,669 hypothesis class script H etc. It actually turns out, 1085 01:08:04,669 --> 01:08:07,509 you remember the bandwidth parameter 1086 01:08:07,509 --> 01:08:10,579 from 1087 01:08:10,579 --> 01:08:13,890 locally weighted linear regression, that also 1088 01:08:13,890 --> 01:08:17,460 has a similar effect in controlling how complex your model is. 1089 01:08:17,460 --> 01:08:20,650 Model complexity [inaudible] polynomial I guess. 1090 01:08:20,650 --> 01:08:23,239 So the more complex your model, 1091 01:08:23,239 --> 01:08:25,390 the better your 1092 01:08:25,390 --> 01:08:26,020 training error, 1093 01:08:26,020 --> 01:08:30,219 and so 1094 01:08:30,219 --> 01:08:32,889 your training error will tend to [inaudible] zero as you increase the complexity of your model because the 1095 01:08:32,889 --> 01:08:36,419 more complete your model the better you can fit your training set. 1096 01:08:36,419 --> 01:08:37,519 But 1097 01:08:37,519 --> 01:08:44,519 because of this bias variance tradeoff, 1098 01:08:44,689 --> 01:08:48,639 you find 1099 01:08:48,639 --> 01:08:53,370 that 1100 01:08:53,370 --> 01:08:57,139 generalization error will come down for a while and then it will go back up. 1101 01:08:57,139 --> 01:09:03,669 And this regime on the left is when you're underfitting the data or when 1102 01:09:03,669 --> 01:09:06,259 you have high bias. 1103 01:09:06,259 --> 01:09:07,690 And this regime 1104 01:09:07,690 --> 01:09:10,379 on the right 1105 01:09:10,379 --> 01:09:16,370 is when you have high variance or 1106 01:09:16,370 --> 01:09:20,809 you're overfitting the data. Okay? 1107 01:09:20,809 --> 01:09:24,459 And this is why a model of sort of intermediate 1108 01:09:24,459 --> 01:09:27,400 complexity, somewhere here 1109 01:09:27,400 --> 01:09:28,999 if often preferable 1110 01:09:28,999 --> 01:09:30,239 to if 1111 01:09:30,239 --> 01:09:32,859 [inaudible] and minimize generalization error. 1112 01:09:32,859 --> 01:09:35,729 Okay? 1113 01:09:35,729 --> 01:09:39,509 So that's just a cartoon. In the next lecture we'll actually talk about the number of 1114 01:09:39,509 --> 01:09:40,230 algorithms 1115 01:09:40,230 --> 01:09:44,459 for trying to automatically select model complexities, say to get 1116 01:09:44,459 --> 01:09:48,179 you as close as possible to this minimum to 1117 01:09:48,179 --> 01:09:55,179 this area of minimized generalization error. 1118 01:10:10,550 --> 01:10:14,539 The last thing I want to do is actually 1119 01:10:14,539 --> 01:10:18,959 going back to the theorem I wrote out, I just want to take that theorem well, 1120 01:10:18,959 --> 01:10:20,150 so the theorem I wrote out 1121 01:10:20,150 --> 01:10:25,130 was an error bound theorem this says for fixed M and delta where probability 1122 01:10:25,130 --> 01:10:28,639 one minus delta, I get a bound on 1123 01:10:28,639 --> 01:10:30,779 gamma, which is what this term is. 1124 01:10:30,779 --> 01:10:34,769 So the very last thing I wanna do today is just come back to this theorem and write out a 1125 01:10:34,769 --> 01:10:35,719 corollary 1126 01:10:35,719 --> 01:10:39,590 where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta 1127 01:10:39,590 --> 01:10:41,389 and solve for M. 1128 01:10:41,389 --> 01:10:46,080 And if you do that, 1129 01:10:46,080 --> 01:10:47,190 you 1130 01:10:47,190 --> 01:10:49,289 get the following corollary: Let H 1131 01:10:49,289 --> 01:10:52,440 be 1132 01:10:52,440 --> 01:10:54,759 fixed with K hypotheses 1133 01:10:54,759 --> 01:10:58,989 and let 1134 01:10:58,989 --> 01:11:01,340 any delta 1135 01:11:01,340 --> 01:11:04,369 and gamma be fixed. 1136 01:11:09,019 --> 01:11:11,449 Then 1137 01:11:11,449 --> 01:11:18,449 in order to guarantee that, 1138 01:11:22,800 --> 01:11:25,060 let's say I want a guarantee 1139 01:11:25,060 --> 01:11:26,989 that the generalization error 1140 01:11:26,989 --> 01:11:30,729 of the hypothesis I choose with empirical risk minimization, 1141 01:11:30,729 --> 01:11:33,869 that this is at most two times gamma worse 1142 01:11:33,869 --> 01:11:38,319 than the best possible error I could obtain with this hypothesis class. 1143 01:11:38,319 --> 01:11:40,470 Lets say I want this to 1144 01:11:40,470 --> 01:11:45,050 hold true with probability at least one minus delta, 1145 01:11:45,050 --> 01:11:50,159 then it suffices 1146 01:11:50,159 --> 01:11:53,689 that M 1147 01:11:53,689 --> 01:12:00,519 is [inaudible] to that. Okay? And this is 1148 01:12:00,519 --> 01:12:01,010 sort of 1149 01:12:01,010 --> 01:12:05,570 solving for the error 1150 01:12:05,570 --> 01:12:07,699 bound for M. One thing we're going to convince yourselves 1151 01:12:07,699 --> 01:12:10,110 of the easy part of this is 1152 01:12:10,110 --> 01:12:14,960 if you set that term [inaudible] gamma and solve for M you will get this. 1153 01:12:14,960 --> 01:12:18,670 One thing I want you to go home and sort of convince yourselves of is that this 1154 01:12:18,670 --> 01:12:20,559 result really holds true. 1155 01:12:20,559 --> 01:12:23,869 That this really logically follows from the theorem we've proved. 1156 01:12:23,869 --> 01:12:25,669 In other words, 1157 01:12:25,669 --> 01:12:29,849 you can take that formula we wrote and solve for M and because this is the formula you get for M, 1158 01:12:29,849 --> 01:12:31,670 that's just that's the easy part. That 1159 01:12:31,670 --> 01:12:33,770 once you go back and convince yourselves that 1160 01:12:33,770 --> 01:12:37,660 this theorem is a true fact and that it does indeed logically follow from the 1161 01:12:37,660 --> 01:12:38,659 other one. In 1162 01:12:38,659 --> 01:12:39,869 particular, 1163 01:12:39,869 --> 01:12:43,739 make sure that if you solve for that you really get M grading equals this, and why is this M 1164 01:12:43,739 --> 01:12:46,679 grading that and not M less equal two, and just make sure 1165 01:12:46,679 --> 01:12:50,229 I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay? 1166 01:12:53,280 --> 01:12:54,840 And 1167 01:12:54,840 --> 01:12:58,110 it turns out that when we prove these bounds in learning theory it turns out 1168 01:12:58,110 --> 01:13:02,820 that very often the constants are sort of loose. So it turns out that when we prove 1169 01:13:02,820 --> 01:13:05,619 these bounds usually we're interested 1170 01:13:05,619 --> 01:13:10,349 usually we're not very interested in the constants, and so I 1171 01:13:10,349 --> 01:13:12,409 write this as big O of 1172 01:13:12,409 --> 01:13:16,139 one over gamma squared, log 1173 01:13:16,139 --> 01:13:17,780 K over delta, 1174 01:13:17,780 --> 01:13:20,570 and again, the key 1175 01:13:20,570 --> 01:13:23,869 step in this is that the dependence on M 1176 01:13:23,869 --> 01:13:26,709 with the size of the hypothesis class is logarithmic. 1177 01:13:26,709 --> 01:13:30,099 And this will be very important later when we 1178 01:13:30,099 --> 01:13:35,429 talk about infinite hypothesis classes. Okay? Any 1179 01:13:35,429 --> 01:13:42,429 questions about this? No? Okay, cool. 1180 01:13:49,659 --> 01:13:51,379 So 1181 01:13:51,379 --> 01:13:54,489 next lecture we'll come back, we'll actually start from this result again. 1182 01:13:54,489 --> 01:13:57,699 Remember this. I'll write this down as the first thing I do in the next lecture 1183 01:13:57,699 --> 01:14:02,519 and we'll generalize these to infinite hypothesis classes and then talk about 1184 01:14:02,519 --> 01:14:04,659 practical algorithms for model spectrum. So I'll see you guys in a couple days.