1 00:00:10,170 --> 00:00:11,549 2 00:00:11,549 --> 00:00:14,819 This presentation is delivered by the Stanford Center for Professional 3 00:00:14,819 --> 00:00:21,819 Development. 4 00:00:24,739 --> 00:00:28,499 Okay. Good morning. Welcome back. 5 00:00:28,499 --> 00:00:31,239 What I want to do today is 6 00:00:31,239 --> 00:00:36,640 actually wrap up our discussion on learning theory and sort of on 7 00:00:36,640 --> 00:00:37,670 and I'm gonna start by 8 00:00:37,670 --> 00:00:41,810 talking about Bayesian statistics and regularization, 9 00:00:41,810 --> 00:00:45,680 and then take a very brief digression to tell you about online learning. 10 00:00:45,680 --> 00:00:50,320 And most of today's lecture will actually be on various pieces of that, so applying 11 00:00:50,320 --> 00:00:52,450 machine learning algorithms to problems like, you know, 12 00:00:52,450 --> 00:00:55,920 like the project or other problems you may go work on after you graduate from this 13 00:00:57,010 --> 00:00:57,739 class. But 14 00:00:57,739 --> 00:01:02,300 let's start the talk about Bayesian statistics and regularization. 15 00:01:02,300 --> 00:01:04,130 So you remember from last week, 16 00:01:04,130 --> 00:01:08,960 we started to talk about learning theory and we learned about bias and 17 00:01:08,960 --> 00:01:12,260 variance. And I guess in the previous lecture, we spent most of the previous 18 00:01:12,260 --> 00:01:13,240 lecture 19 00:01:13,240 --> 00:01:17,510 talking about algorithms for model selection and for 20 00:01:17,510 --> 00:01:21,360 feature selection. We talked about cross-validation. Right? So 21 00:01:21,360 --> 00:01:24,200 most of the methods we talked about in the previous lecture were 22 00:01:24,200 --> 00:01:28,460 ways for you to try to simply the model. So for example, 23 00:01:28,460 --> 00:01:31,920 the feature selection algorithms we talked about gives you a way to eliminate a number 24 00:01:31,920 --> 00:01:32,909 of features, 25 00:01:32,909 --> 00:01:36,350 so as to reduce the number of parameters you need to fit and thereby reduce 26 00:01:36,350 --> 00:01:39,330 overfitting. Right? You remember that? So feature 27 00:01:39,330 --> 00:01:40,670 selection algorithms 28 00:01:42,149 --> 00:01:43,970 choose a subset of the features 29 00:01:43,970 --> 00:01:48,470 so that you have less parameters and you may be less likely to overfit. Right? 30 00:01:48,470 --> 00:01:52,790 What I want to do today is to talk about a different way to prevent overfitting. 31 00:01:52,790 --> 00:01:53,870 And 32 00:01:53,870 --> 00:01:57,860 there's a method called regularization and there's a way that lets you keep all the 33 00:01:57,860 --> 00:01:58,859 parameters. 34 00:01:59,920 --> 00:02:04,200 So here's the idea, and I'm gonna illustrate this example with, 35 00:02:04,200 --> 00:02:06,960 say, linear regression. 36 00:02:06,960 --> 00:02:08,709 So 37 00:02:08,709 --> 00:02:13,329 you take 38 00:02:13,329 --> 00:02:16,920 the linear regression model, the very first model we learned about, 39 00:02:16,920 --> 00:02:17,969 right, 40 00:02:17,969 --> 00:02:21,639 we said that we would choose the parameters 41 00:02:21,639 --> 00:02:23,609 via 42 00:02:23,609 --> 00:02:29,519 maximum likelihood. 43 00:02:29,519 --> 00:02:31,669 Right? And that meant that, 44 00:02:31,669 --> 00:02:35,299 you know, you would choose the parameters theta 45 00:02:35,299 --> 00:02:39,879 that maximized 46 00:02:39,879 --> 00:02:43,009 the probability of the data, 47 00:02:43,009 --> 00:02:46,109 which is parameters theta that maximized the probability of the data we observe. 48 00:02:46,109 --> 00:02:50,709 Right? 49 00:02:50,709 --> 00:02:52,049 And so 50 00:02:52,049 --> 00:02:56,279 to give this sort of procedure a name, this is one example of most 51 00:02:56,279 --> 00:03:00,259 common frequencies procedure, 52 00:03:00,259 --> 00:03:05,419 and frequency, you can think of sort of as maybe one school of statistics. 53 00:03:05,419 --> 00:03:07,079 And the philosophical view 54 00:03:07,079 --> 00:03:08,869 behind writing this down was 55 00:03:08,869 --> 00:03:13,480 we envisioned that there was some true parameter theta out there that generated, 56 00:03:13,480 --> 00:03:16,469 you know, the Xs and the Ys. There's some true parameter theta 57 00:03:16,469 --> 00:03:20,529 that govern housing prices, Y is a function of X, 58 00:03:20,529 --> 00:03:23,459 and we don't know what the value of theta is, 59 00:03:23,459 --> 00:03:27,630 and we'd like to come up with some procedure for estimating the value of theta. Okay? 60 00:03:27,630 --> 00:03:28,379 And so, 61 00:03:28,379 --> 00:03:31,269 maximum likelihood is just one possible procedure 62 00:03:31,269 --> 00:03:33,209 for estimating the unknown value 63 00:03:33,209 --> 00:03:35,879 for theta. 64 00:03:35,879 --> 00:03:37,609 And 65 00:03:37,609 --> 00:03:41,060 the way you formulated this, you know, theta was not a random variable. Right? 66 00:03:41,060 --> 00:03:42,140 That's what why said, 67 00:03:42,140 --> 00:03:45,490 so theta is just some true value out there. It's not random or anything, we just don't 68 00:03:45,490 --> 00:03:50,329 know what it is, and we have a procedure called maximum likelihood for estimating the 69 00:03:50,329 --> 00:03:55,469 value for theta. So this is one example of what's called a frequencies procedure. 70 00:03:55,469 --> 00:03:57,279 The alternative to the, I 71 00:03:57,279 --> 00:04:04,279 guess, the frequency school of statistics is the Bayesian school, 72 00:04:04,389 --> 00:04:06,869 in which 73 00:04:06,869 --> 00:04:10,239 we're gonna say that we don't know what theta, 74 00:04:10,239 --> 00:04:13,339 and so we will put a prior 75 00:04:13,339 --> 00:04:14,769 on 76 00:04:14,769 --> 00:04:18,239 theta. Okay? So in the Bayesian school students would say, "Well 77 00:04:18,239 --> 00:04:22,379 don't know what the value of theta so let's represent our uncertainty over theta with a 78 00:04:22,379 --> 00:04:27,229 prior. 79 00:04:28,259 --> 00:04:32,009 So for example, 80 00:04:32,009 --> 00:04:34,770 our prior on theta 81 00:04:34,770 --> 00:04:36,860 may be a Gaussian distribution 82 00:04:36,860 --> 00:04:39,849 with mean zero and curvalence matrix 83 00:04:39,849 --> 00:04:42,889 given by tau squared I. Okay? 84 00:04:42,889 --> 00:04:45,319 And so - 85 00:04:45,319 --> 00:04:52,319 actually, if I use S to denote my training set, well - right, 86 00:04:56,279 --> 00:05:00,289 so theta represents my beliefs about what the parameters are in the absence of 87 00:05:00,289 --> 00:05:01,380 any data. So 88 00:05:01,380 --> 00:05:03,290 not having seen any data, theta 89 00:05:03,290 --> 00:05:05,709 represents, you know, what I think theta - it 90 00:05:05,709 --> 00:05:10,090 probably represents what I think theta is most likely to be. 91 00:05:10,090 --> 00:05:15,659 And so given the training set, S, in the sort of Bayesian procedure, we would, 92 00:05:15,659 --> 00:05:22,439 well, 93 00:05:22,439 --> 00:05:26,539 calculate the probability, the posterior probability by parameters 94 00:05:26,539 --> 00:05:28,869 given my training sets, and - let's 95 00:05:28,869 --> 00:05:30,469 write 96 00:05:30,469 --> 00:05:31,779 this on the next board. 97 00:05:31,779 --> 00:05:33,360 So my posterior 98 00:05:33,360 --> 00:05:36,740 on my parameters given my training set, by Bayes' rule, this will 99 00:05:36,740 --> 00:05:40,160 be proportional to, you 100 00:05:40,160 --> 00:05:47,160 know, this. 101 00:05:51,459 --> 00:05:56,559 Right? So by Bayes' rule. Let's call 102 00:05:56,559 --> 00:06:01,429 it posterior. 103 00:06:01,429 --> 00:06:05,669 And this distribution now represents my beliefs about what theta is after I've 104 00:06:05,669 --> 00:06:08,219 seen the training set. 105 00:06:08,219 --> 00:06:15,219 And when you now want to make a new prediction on the price of a new house, 106 00:06:20,060 --> 00:06:21,959 on the input X, 107 00:06:21,959 --> 00:06:26,709 I would say that, well, the distribution over the possible housing prices for 108 00:06:26,709 --> 00:06:30,560 this new house I'm trying to estimate the price of, say, 109 00:06:30,560 --> 00:06:34,270 given the size of the house, the features of the house at X, and the training 110 00:06:34,270 --> 00:06:36,559 set I had previously, it 111 00:06:36,559 --> 00:06:43,559 is going to be given by 112 00:06:47,280 --> 00:06:50,470 an integral over my parameters theta of 113 00:06:50,470 --> 00:06:53,020 probably of Y given X comma theta 114 00:06:53,020 --> 00:06:54,649 and times 115 00:06:54,649 --> 00:06:59,929 the posterior distribution of theta given the training set. Okay? 116 00:06:59,929 --> 00:07:01,789 And in 117 00:07:01,789 --> 00:07:03,719 particular, if you want your prediction to be 118 00:07:03,719 --> 00:07:07,559 the expected value 119 00:07:07,559 --> 00:07:08,729 of Y given 120 00:07:08,729 --> 00:07:10,279 the 121 00:07:10,279 --> 00:07:13,559 input X in training set, you would 122 00:07:13,559 --> 00:07:16,349 say integrate 123 00:07:16,349 --> 00:07:23,349 over Y 124 00:07:23,750 --> 00:07:26,539 times the posterior. Okay? 125 00:07:26,539 --> 00:07:33,539 You would take an expectation of Y with respect to your posterior distribution. Okay? 126 00:07:35,440 --> 00:07:39,169 And you notice that when I was writing this down, so with the Bayesian 127 00:07:39,169 --> 00:07:41,959 formulation, and now started to write up here Y given X 128 00:07:41,959 --> 00:07:44,110 comma theta because 129 00:07:44,110 --> 00:07:47,590 this formula now is the property of Y conditioned on the values of the 130 00:07:47,590 --> 00:07:51,909 random variables X and theta. So I'm no longer writing semicolon theta, I'm writing 131 00:07:51,909 --> 00:07:53,099 comma theta 132 00:07:53,099 --> 00:07:56,379 because I'm now treating theta 133 00:07:56,379 --> 00:08:00,259 as a random variable. So all 134 00:08:00,259 --> 00:08:02,849 of this is somewhat abstract but this is 135 00:08:02,849 --> 00:08:04,109 - and 136 00:08:04,109 --> 00:08:11,109 it turns out - actually let's check. Are there questions about this? No? Okay. Let's 137 00:08:14,779 --> 00:08:17,819 try to make this more concrete. It turns out that 138 00:08:17,819 --> 00:08:21,699 for many problems, 139 00:08:21,699 --> 00:08:26,059 both of these steps in the computation are difficult because if, 140 00:08:26,059 --> 00:08:29,469 you know, theta is an N plus onedimensional vector, is an N plus 141 00:08:29,469 --> 00:08:31,219 one-dimensional parameter vector, 142 00:08:31,219 --> 00:08:35,010 then this is one an integral over an N plus one-dimensional, you know, over RN 143 00:08:35,010 --> 00:08:36,390 plus one. 144 00:08:36,390 --> 00:08:38,660 And because numerically it's very difficult 145 00:08:38,660 --> 00:08:41,130 to compute integrals over 146 00:08:41,130 --> 00:08:44,800 very high dimensional spaces, all right? So 147 00:08:44,800 --> 00:08:48,740 usually this integral - actually usually it's hard to compute the posterior in theta 148 00:08:48,740 --> 00:08:54,310 and it's also hard to compute this integral if theta is very high dimensional. There are few 149 00:08:54,310 --> 00:08:55,910 exceptions for which this can be done in 150 00:08:55,910 --> 00:08:58,290 closed form, but for 151 00:08:58,290 --> 00:09:00,460 many learning algorithms, say, 152 00:09:00,460 --> 00:09:04,940 Bayesian logistic regression, this is hard to do. 153 00:09:04,940 --> 00:09:09,780 And so what's commonly done is to take the posterior distribution 154 00:09:09,780 --> 00:09:13,390 and instead of actually computing a full posterior distribution, chi of theta 155 00:09:13,390 --> 00:09:15,280 given S, 156 00:09:15,280 --> 00:09:16,510 we'll instead 157 00:09:16,510 --> 00:09:19,330 take this quantity on the right-hand side 158 00:09:19,330 --> 00:09:23,880 and just maximize this quantity on the right-hand side. So let me write this down. 159 00:09:23,880 --> 00:09:26,650 So 160 00:09:26,650 --> 00:09:30,100 commonly, instead of computing the full posterior distribution, 161 00:09:30,100 --> 00:09:37,100 we will choose the following. Okay? 162 00:09:42,280 --> 00:09:47,160 We will choose what's called the MAP estimate, or the maximum a posteriori 163 00:09:47,160 --> 00:09:50,270 estimate of theta, which is the most likely value of theta, 164 00:09:50,270 --> 00:09:53,990 most probable value of theta onto your posterior distribution. 165 00:09:53,990 --> 00:09:56,430 And that's just 166 00:09:56,430 --> 00:10:03,430 ont max chi 167 00:10:07,960 --> 00:10:11,870 of theta. 168 00:10:11,870 --> 00:10:18,870 And then when you need to make a prediction, 169 00:10:21,480 --> 00:10:28,480 you know, you would just predict, say, well, 170 00:10:35,530 --> 00:10:37,599 using your usual hypothesis 171 00:10:37,599 --> 00:10:41,290 and using this MAP value of theta 172 00:10:41,290 --> 00:10:42,479 in place of 173 00:10:42,479 --> 00:10:47,380 - as the parameter vector you'd choose. Okay? And 174 00:10:47,380 --> 00:10:51,340 notice, the only difference between this and standard maximum likelihood estimation 175 00:10:51,340 --> 00:10:53,060 is that when you're choosing, 176 00:10:53,060 --> 00:10:56,920 you know, the - instead of choosing the maximum likelihood value for 177 00:10:56,920 --> 00:11:01,140 theta, you're instead maximizing this, which is what you have for maximum likelihood estimation, 178 00:11:01,140 --> 00:11:05,170 and then times this other quantity which is the prior. 179 00:11:05,170 --> 00:11:06,650 Right? 180 00:11:06,650 --> 00:11:09,520 And 181 00:11:09,520 --> 00:11:10,870 let's see, 182 00:11:10,870 --> 00:11:16,560 when intuition is that if your prior 183 00:11:16,560 --> 00:11:21,470 is 184 00:11:21,470 --> 00:11:23,650 theta being Gaussian and with mean 185 00:11:23,650 --> 00:11:26,030 zero and some covariance, 186 00:11:26,030 --> 00:11:30,550 then for a distribution like this, most of the [inaudible] mass is close to zero. Right? 187 00:11:30,550 --> 00:11:34,010 So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. 188 00:11:37,990 --> 00:11:40,550 And so the prior distribution, instead of saying that 189 00:11:40,550 --> 00:11:44,800 you think most of the parameters should be close to 190 00:11:44,800 --> 00:11:48,780 zero, and if you remember our discussion on feature selection, if you eliminate a 191 00:11:48,780 --> 00:11:50,880 feature from consideration 192 00:11:50,880 --> 00:11:53,280 that's the same as 193 00:11:53,280 --> 00:11:56,630 setting the source and value of theta to be equal to zero. All right? So if you 194 00:11:56,630 --> 00:11:58,020 set 195 00:11:58,020 --> 00:12:02,090 theta five to be equal to zero, that's the same as, you know, eliminating feature 196 00:12:02,090 --> 00:12:04,550 five from the your hypothesis. 197 00:12:04,550 --> 00:12:09,710 And so, this is the prior that drives most of the parameter values to zero 198 00:12:09,710 --> 00:12:12,710 - to values close to zero. And you'll think of this as 199 00:12:12,710 --> 00:12:17,410 doing something analogous, if doing something reminiscent of feature selection. Okay? And 200 00:12:17,410 --> 00:12:21,560 it turns out that with this formulation, the parameters won't actually be 201 00:12:21,560 --> 00:12:24,990 exactly zero but many of the values will be close to zero. 202 00:12:24,990 --> 00:12:27,030 And 203 00:12:27,030 --> 00:12:31,510 I guess in pictures, 204 00:12:34,550 --> 00:12:36,210 if you remember, 205 00:12:36,210 --> 00:12:40,070 I said that if you have, say, five data points and you fit 206 00:12:40,070 --> 00:12:47,070 a fourth-order polynomial - 207 00:12:47,620 --> 00:12:51,040 well I think that had too many bumps in it, but never mind. 208 00:12:51,040 --> 00:12:54,360 If you fit it a - if you fit very high polynomial to a 209 00:12:54,360 --> 00:12:58,180 very small dataset, you can get these very large oscillations 210 00:12:58,180 --> 00:13:01,420 if you use maximum likelihood estimation. All right? 211 00:13:01,420 --> 00:13:05,610 In contrast, if you apply this sort of Bayesian regularization, 212 00:13:05,610 --> 00:13:08,200 you can actually fit a higherorder polynomial 213 00:13:08,200 --> 00:13:11,160 that still get 214 00:13:11,160 --> 00:13:13,250 sort of a smoother and smoother fit to the data 215 00:13:13,250 --> 00:13:17,840 as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer 216 00:13:17,840 --> 00:13:19,830 to zero. And that 217 00:13:19,830 --> 00:13:22,750 in practice - it's sort of hard to see, but you can take my word for it. 218 00:13:22,750 --> 00:13:24,900 As tau becomes smaller and smaller, 219 00:13:24,900 --> 00:13:28,760 the curves you tend to fit your data also become smoother and smoother, and so you 220 00:13:28,760 --> 00:13:29,560 tend 221 00:13:29,560 --> 00:13:35,700 less and less overfit, even when you're fitting a large number of 222 00:13:35,700 --> 00:13:37,670 parameters. 223 00:13:37,670 --> 00:13:41,290 Okay? Let's see, 224 00:13:41,290 --> 00:13:44,590 and 225 00:13:44,590 --> 00:13:48,129 one last piece of intuition that I would just toss out there. And 226 00:13:48,129 --> 00:13:49,920 you get to play more with 227 00:13:49,920 --> 00:13:52,850 this particular set of ideas more in Problem Set 228 00:13:52,850 --> 00:13:54,980 3, which I'll post online later 229 00:13:54,980 --> 00:13:57,160 this week I guess. 230 00:13:57,160 --> 00:14:01,950 Is that whereas maximum likelihood tries to minimize, 231 00:14:01,950 --> 00:14:08,950 say, this, 232 00:14:12,210 --> 00:14:15,830 right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing 233 00:14:15,830 --> 00:14:17,030 this, 234 00:14:17,030 --> 00:14:19,290 it turns out that if you 235 00:14:19,290 --> 00:14:21,220 add this prior term there, 236 00:14:21,220 --> 00:14:26,500 it turns out that the authorization objective you end up optimizing 237 00:14:26,500 --> 00:14:31,160 turns out to be that. Where you add an extra term that, you know, 238 00:14:31,160 --> 00:14:34,380 penalizes your parameter theta as being large. 239 00:14:34,980 --> 00:14:38,820 And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to 240 00:14:38,820 --> 00:14:40,890 keep your parameters small. 241 00:14:40,890 --> 00:14:42,810 And this has the effect. 242 00:14:42,810 --> 00:14:46,530 Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of 243 00:14:46,530 --> 00:14:48,260 keeping the functions you fit 244 00:14:48,260 --> 00:14:55,260 to be smoother and less likely to overfit. Okay? Okay, hopefully this will 245 00:14:57,870 --> 00:15:02,200 make more sense when you play with these ideas a bit more in the next problem set. But let's check 246 00:15:02,200 --> 00:15:09,200 questions about all this. 247 00:15:12,560 --> 00:15:16,000 The smoothing behavior is it because [inaudible] actually get different [inaudible]? 248 00:15:19,260 --> 00:15:23,010 Let's see. Yeah. It depends on - well 249 00:15:23,010 --> 00:15:28,160 most priors with most of the mass close to zero will get this effect, I guess. And just by 250 00:15:28,160 --> 00:15:29,090 convention, the Gaussian 251 00:15:29,090 --> 00:15:31,770 prior is what's most - 252 00:15:31,770 --> 00:15:33,820 used the most common for models like logistic 253 00:15:33,820 --> 00:15:37,910 regression and linear regression, generalized in your models. There are a few 254 00:15:37,910 --> 00:15:41,190 other priors that I sometimes use, like the Laplace prior, 255 00:15:41,190 --> 00:15:48,190 but all of them will tend to have these sorts of smoothing effects. All right. Cool. 256 00:15:50,270 --> 00:15:52,640 And so it turns out that for 257 00:15:52,640 --> 00:15:57,250 problems like text classification, text classification is like 30,000 features or 50,000 258 00:15:57,250 --> 00:15:59,560 features, 259 00:15:59,560 --> 00:16:02,980 where it seems like an algorithm like logistic regression would be very much prone to 260 00:16:02,980 --> 00:16:03,670 overfitting. Right? 261 00:16:03,670 --> 00:16:06,470 So imagine trying to build a spam classifier, 262 00:16:06,470 --> 00:16:10,050 maybe you have 100 training examples but you have 30,000 263 00:16:10,050 --> 00:16:12,530 features or 50,000 features, 264 00:16:12,530 --> 00:16:16,820 that seems clearly to be prone to overfitting. Right? But it turns out that with 265 00:16:16,820 --> 00:16:18,440 this sort of Bayesian 266 00:16:18,440 --> 00:16:19,460 regularization, 267 00:16:19,460 --> 00:16:22,100 with [inaudible] Gaussian, 268 00:16:22,100 --> 00:16:26,710 logistic regression becomes a very effective text classification algorithm 269 00:16:26,710 --> 00:16:32,510 with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah, 270 00:16:32,510 --> 00:16:36,950 right, and so pick - and to pick either tau squared or lambda. 271 00:16:36,950 --> 00:16:40,260 I think the relation is lambda equals one over tau squared. But right, so pick either 272 00:16:40,260 --> 00:16:41,840 tau squared or lambda, you could 273 00:16:41,840 --> 00:16:48,840 use cross-validation, yeah. All right? Okay, cool. So all right, 274 00:16:49,120 --> 00:16:51,740 that 275 00:16:51,740 --> 00:16:55,240 was all I want to say about methods for preventing 276 00:16:55,240 --> 00:16:56,410 overfitting. What I 277 00:16:56,410 --> 00:17:00,820 want to do next is just spend, you know, five minutes talking about 278 00:17:00,820 --> 00:17:02,300 online learning. 279 00:17:02,300 --> 00:17:04,650 And this is sort of a digression. 280 00:17:04,650 --> 00:17:06,390 And so, you know, when 281 00:17:06,390 --> 00:17:09,680 you're designing the syllabus of a class, I guess, sometimes 282 00:17:09,680 --> 00:17:13,040 there are just some ideas you want to talk about but can't find a very good place to 283 00:17:13,040 --> 00:17:16,480 fit in anywhere. So this is one of those ideas that may 284 00:17:16,480 --> 00:17:20,890 seem a bit disjointed from the rest of the class but I just 285 00:17:20,890 --> 00:17:25,770 want to 286 00:17:25,770 --> 00:17:29,730 tell 287 00:17:29,730 --> 00:17:33,720 you a little bit about it. Okay. So here's the idea. 288 00:17:33,720 --> 00:17:37,820 So far, all the learning algorithms we've talked about are what's called batch learning 289 00:17:37,820 --> 00:17:38,520 algorithms, 290 00:17:38,520 --> 00:17:41,650 where you're given a training set and then you get to run your learning algorithm on the 291 00:17:41,650 --> 00:17:46,910 training set and then maybe you test it on some other test set. 292 00:17:46,910 --> 00:17:49,980 And there's another learning setting called online learning, 293 00:17:49,980 --> 00:17:53,670 in which you have to make predictions even while you are in the process of learning. 294 00:17:53,670 --> 00:17:57,650 So here's how the problem sees. 295 00:17:57,650 --> 00:17:59,360 All right? I'm first gonna give you X one. 296 00:17:59,360 --> 00:18:01,410 Let's say there's a classification problem, 297 00:18:01,410 --> 00:18:02,780 so I'm first gonna give you 298 00:18:02,780 --> 00:18:07,420 X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?" 299 00:18:07,420 --> 00:18:09,670 And you've not seen any data yet. 300 00:18:09,670 --> 00:18:11,679 And so, you make a guess. Right? You 301 00:18:11,679 --> 00:18:12,990 guess - 302 00:18:12,990 --> 00:18:15,190 we'll call your guess Y hat one. 303 00:18:15,190 --> 00:18:18,010 And after you've made your prediction, I 304 00:18:18,010 --> 00:18:21,840 will then reveal to you the true label Y one. Okay? And 305 00:18:21,840 --> 00:18:25,580 not having seen any data before, your odds of getting the first one right are only 306 00:18:25,580 --> 00:18:28,750 50 percent, right, if you guess randomly. 307 00:18:28,750 --> 00:18:29,950 308 00:18:29,950 --> 00:18:35,080 And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And 309 00:18:35,080 --> 00:18:35,910 so you now maybe 310 00:18:35,910 --> 00:18:40,200 are gonna make a slightly more educated guess and call that Y hat two. 311 00:18:40,200 --> 00:18:44,880 And after you've made your guess, I reveal the true label to you. 312 00:18:44,880 --> 00:18:50,000 And so, then I show you X three, and then you make 313 00:18:50,000 --> 00:18:53,000 your guess, and learning proceeds as follows. 314 00:18:53,000 --> 00:18:54,860 So this is just a lot of 315 00:18:54,860 --> 00:18:59,370 machine learning and batch learning, and the model settings where 316 00:18:59,370 --> 00:19:03,710 you have to keep learning even as you're making predictions, 317 00:19:03,710 --> 00:19:08,050 okay? So I don't know, setting your website and you have users coming in. And as the first user 318 00:19:08,050 --> 00:19:10,990 comes in, you need to start making predictions already about what 319 00:19:10,990 --> 00:19:13,450 the user likes or dislikes. And there's only, you know, 320 00:19:13,450 --> 00:19:18,370 as you're making predictions you get to show more and more training examples. 321 00:19:18,370 --> 00:19:25,370 So in online learning what you care about is the total online error, 322 00:19:27,370 --> 00:19:28,400 323 00:19:28,400 --> 00:19:32,900 which is sum from I equals one to MC if you get the sequence of M examples 324 00:19:32,900 --> 00:19:34,390 all together, 325 00:19:34,390 --> 00:19:37,860 indicator Y hat I 326 00:19:37,860 --> 00:19:40,900 not equal to Y hi. Okay? 327 00:19:40,900 --> 00:19:43,510 So the total online error is 328 00:19:43,510 --> 00:19:45,920 the total number of mistakes you make 329 00:19:45,920 --> 00:19:49,029 on a sequence of examples like this. 330 00:19:49,029 --> 00:19:51,910 And 331 00:19:51,910 --> 00:19:54,049 it turns out that, you know, 332 00:19:54,049 --> 00:19:57,690 many of the learning algorithms you have - when you finish all the learning algorithms, you've 333 00:19:57,690 --> 00:20:00,690 learned about and can apply to this setting. One thing 334 00:20:00,690 --> 00:20:02,680 you could do is 335 00:20:02,680 --> 00:20:03,850 when you're asked 336 00:20:03,850 --> 00:20:06,230 to make prediction on Y hat three, 337 00:20:06,230 --> 00:20:10,409 right, one simple thing to do is well you've seen some other training examples 338 00:20:10,409 --> 00:20:14,040 up to this point so you can just take your learning algorithm and run it 339 00:20:14,040 --> 00:20:15,429 on the examples, 340 00:20:15,429 --> 00:20:18,970 you know, leading up to Y hat three. So just run the learning algorithm on all 341 00:20:18,970 --> 00:20:21,420 the examples you've seen previous 342 00:20:21,420 --> 00:20:25,400 to being asked to make a prediction on certain example, and then use your 343 00:20:25,400 --> 00:20:27,320 learning 344 00:20:27,320 --> 00:20:30,940 algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we 345 00:20:30,940 --> 00:20:34,190 saw that you could use the stochastic gradient descent, that, 346 00:20:34,190 --> 00:20:41,190 you know, can be adapted very nicely to this. So as a concrete example, if you 347 00:20:41,740 --> 00:20:45,660 remember the perceptron algorithms, say, right, 348 00:20:45,660 --> 00:20:47,350 you would 349 00:20:47,350 --> 00:20:52,280 say initial the parameter theta to be equal to zero. 350 00:20:52,280 --> 00:20:55,940 And then after seeing the Ith training 351 00:20:55,940 --> 00:20:59,230 example, you'd 352 00:20:59,230 --> 00:21:06,230 update the parameters, you 353 00:21:13,740 --> 00:21:15,420 know, 354 00:21:15,420 --> 00:21:19,740 using - you've see this reel a lot of times now, right, using the standard 355 00:21:19,740 --> 00:21:21,720 perceptron learning rule. And 356 00:21:21,720 --> 00:21:26,230 the same thing, if you were using logistic regression you can then, 357 00:21:26,230 --> 00:21:29,940 again, after seeing each training example, just run, you know, essentially run 358 00:21:29,940 --> 00:21:33,620 one-step stochastic gradient descent 359 00:21:33,620 --> 00:21:38,050 on just the example you saw. Okay? 360 00:21:38,050 --> 00:21:39,800 And so 361 00:21:39,800 --> 00:21:41,040 the reason I've 362 00:21:41,040 --> 00:21:45,380 put this into the sort of "learning theory" section of this class was because 363 00:21:45,380 --> 00:21:49,190 it turns that sometimes you can prove fairly amazing results 364 00:21:49,190 --> 00:21:51,500 on your total online error 365 00:21:51,500 --> 00:21:53,790 using algorithms like these. 366 00:21:53,790 --> 00:21:57,500 I will actually - I don't actually want to spend the time in the main 367 00:21:57,500 --> 00:21:58,640 lecture to prove 368 00:21:58,640 --> 00:22:01,880 this, but, for example, you can prove that 369 00:22:01,880 --> 00:22:04,790 when you use the perceptron algorithm, 370 00:22:04,790 --> 00:22:07,890 then even when 371 00:22:07,890 --> 00:22:11,710 the features XI, 372 00:22:11,710 --> 00:22:15,250 maybe infinite dimensional feature vectors, like we saw for 373 00:22:15,250 --> 00:22:18,700 simple vector machines. And sometimes, infinite feature dimensional vectors 374 00:22:18,700 --> 00:22:20,740 may use kernel representations. Okay? But so it 375 00:22:20,740 --> 00:22:22,180 turns out that you can prove that 376 00:22:22,180 --> 00:22:24,790 when you a perceptron algorithm, 377 00:22:24,790 --> 00:22:27,590 even when the data is maybe 378 00:22:27,590 --> 00:22:28,920 extremely high dimensional and it 379 00:22:28,920 --> 00:22:32,900 seems like you'd be prone to overfitting, right, you can prove that so as the 380 00:22:32,900 --> 00:22:36,550 long as the positive and negative examples 381 00:22:36,550 --> 00:22:38,740 382 00:22:38,740 --> 00:22:41,860 are separated by a margin, 383 00:22:41,860 --> 00:22:43,360 right. So in this 384 00:22:43,360 --> 00:22:48,050 infinite dimensional space, 385 00:22:48,050 --> 00:22:49,740 so long as, 386 00:22:49,740 --> 00:22:53,390 you know, there is some margin 387 00:22:53,390 --> 00:22:57,480 down there separating the positive and negative examples, 388 00:22:57,480 --> 00:22:59,300 you can prove that 389 00:22:59,300 --> 00:23:02,150 perceptron algorithm will converge 390 00:23:02,150 --> 00:23:06,870 to a hypothesis that perfectly separates the positive and negative examples. 391 00:23:06,870 --> 00:23:10,760 Okay? And then so after seeing only a finite number of examples, it'll 392 00:23:10,760 --> 00:23:14,410 converge to digital boundary that perfectly separates the 393 00:23:14,410 --> 00:23:18,590 positive and negative examples, even though you may in an infinite dimensional space. Okay? 394 00:23:18,590 --> 00:23:20,970 So 395 00:23:20,970 --> 00:23:22,260 let's see. 396 00:23:22,260 --> 00:23:25,920 The proof itself would take me sort of almost an entire lecture to do, 397 00:23:25,920 --> 00:23:27,179 and there are sort of 398 00:23:27,179 --> 00:23:29,460 other things that I want to do more than that. 399 00:23:29,460 --> 00:23:32,850 So you want to see the proof of this yourself, it's actually written up in the lecture 400 00:23:32,850 --> 00:23:35,010 notes that I posted online. 401 00:23:35,010 --> 00:23:38,210 For the purposes of this class' syllabus, the proof of this 402 00:23:38,210 --> 00:23:41,059 result, you can treat this as optional reading. And by that, I mean, 403 00:23:41,059 --> 00:23:44,799 you know, it won't appear on the midterm and you won't be asked about this 404 00:23:44,799 --> 00:23:47,039 specifically in the problem sets, 405 00:23:47,039 --> 00:23:48,770 but I thought it'd be - 406 00:23:48,770 --> 00:23:52,340 I know some of you are curious after the previous lecture so why 407 00:23:52,340 --> 00:23:53,750 you can prove 408 00:23:53,750 --> 00:23:55,450 that, you know, SVMs can have 409 00:23:55,450 --> 00:23:59,240 bounded VC dimension, even in these infinite dimensional spaces, and how do 410 00:23:59,240 --> 00:24:00,940 you prove things in these - 411 00:24:00,940 --> 00:24:04,450 how do you prove learning theory results in these infinite dimensional feature spaces. And 412 00:24:04,450 --> 00:24:05,190 so 413 00:24:05,190 --> 00:24:08,480 the perceptron bound that I just talked about was the simplest instance I 414 00:24:08,480 --> 00:24:11,040 know of that you can sort of read in like half an hour and understand 415 00:24:11,040 --> 00:24:11,960 it. 416 00:24:11,960 --> 00:24:14,700 So if you're 417 00:24:14,700 --> 00:24:16,179 interested, there are lecture notes online for 418 00:24:16,179 --> 00:24:20,720 how this perceptron bound is actually proved. It's a very 419 00:24:21,580 --> 00:24:25,710 [inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But 420 00:24:25,710 --> 00:24:28,450 regardless of the theoretical results, 421 00:24:28,450 --> 00:24:30,880 you know, the online learning setting is something 422 00:24:30,880 --> 00:24:33,490 that you - that comes reasonably often. And so, 423 00:24:33,490 --> 00:24:38,750 these algorithms based on stochastic gradient descent often go very well. Okay, any 424 00:24:38,750 --> 00:24:45,750 questions about this before I move on? 425 00:24:49,990 --> 00:24:54,799 All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to 426 00:24:54,799 --> 00:24:56,500 PowerPoint slides, please, 427 00:24:56,500 --> 00:24:59,810 is I actually want to spend most of today's lecture sort of talking about 428 00:24:59,810 --> 00:25:05,270 advice for applying different machine learning 429 00:25:05,270 --> 00:25:10,850 algorithms. 430 00:25:10,850 --> 00:25:11,970 And so, 431 00:25:11,970 --> 00:25:14,019 you know, right now, already you have a, 432 00:25:14,019 --> 00:25:16,250 I think, a good understanding of 433 00:25:16,250 --> 00:25:18,620 really the most powerful tools 434 00:25:18,620 --> 00:25:21,020 known to humankind in machine learning. 435 00:25:21,020 --> 00:25:22,090 Right? And 436 00:25:22,090 --> 00:25:26,520 what I want to do today is give you some advice on how to apply them really 437 00:25:26,520 --> 00:25:27,720 powerfully because, 438 00:25:27,720 --> 00:25:30,299 you know, the same tool - it turns out that you can 439 00:25:30,299 --> 00:25:33,249 take the same machine learning tool, say logistic regression, 440 00:25:33,249 --> 00:25:36,800 and you can ask two different people to apply it to the same problem. 441 00:25:36,800 --> 00:25:38,230 And 442 00:25:38,230 --> 00:25:41,290 sometimes one person will do an amazing job and it'll work amazingly well, and the second 443 00:25:41,290 --> 00:25:43,150 person will sort of 444 00:25:43,150 --> 00:25:47,150 not really get it to work, even though it was exactly the same algorithm. Right? 445 00:25:47,150 --> 00:25:51,049 And so what I want to do today, in the rest of the time I have today, is try 446 00:25:51,049 --> 00:25:52,190 to 447 00:25:52,190 --> 00:25:55,210 convey to you, you know, some of the methods for how to 448 00:25:55,210 --> 00:25:56,600 make sure you're one of - 449 00:25:56,600 --> 00:26:03,600 you really know how to get these learning algorithms to work well in problems. So 450 00:26:05,120 --> 00:26:07,340 just some caveats on what I'm gonna, I 451 00:26:07,340 --> 00:26:08,859 guess, talk about in the 452 00:26:08,859 --> 00:26:11,080 rest of today's lecture. 453 00:26:11,940 --> 00:26:15,760 Something I want to talk about is actually not very mathematical but is also 454 00:26:15,760 --> 00:26:17,610 some of the hardest, 455 00:26:17,610 --> 00:26:18,480 most 456 00:26:18,480 --> 00:26:21,910 conceptually most difficult material in this class to understand. All right? So this 457 00:26:21,910 --> 00:26:23,070 is 458 00:26:23,070 --> 00:26:25,400 not mathematical but this is not easy. 459 00:26:25,400 --> 00:26:29,730 And I want to say this caveat some of what I'll say today is debatable. 460 00:26:29,730 --> 00:26:33,309 I think most good machine learning people will agree with most of what I say but maybe not 461 00:26:33,309 --> 00:26:35,730 everything I say. 462 00:26:35,730 --> 00:26:39,350 And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about 463 00:26:39,350 --> 00:26:41,680 this later. What I'm 464 00:26:41,680 --> 00:26:43,730 focusing on today is advice for 465 00:26:43,730 --> 00:26:47,269 how to just get stuff to work. If you work in the company and you want to deliver a 466 00:26:47,269 --> 00:26:49,120 product or you're, you know, 467 00:26:49,120 --> 00:26:52,750 building a system and you just want your machine learning system to work. Okay? Some of what I'm about 468 00:26:52,750 --> 00:26:54,000 to say today 469 00:26:54,000 --> 00:26:57,559 isn't great advice if you goal is to invent a new machine learning algorithm, but 470 00:26:57,559 --> 00:26:59,179 this is advice for how to 471 00:26:59,179 --> 00:27:02,040 make machine learning algorithm work and, you know, and 472 00:27:02,040 --> 00:27:05,080 deploy a working system. So three 473 00:27:05,080 --> 00:27:08,900 key areas I'm gonna talk about. One: diagnostics for 474 00:27:08,900 --> 00:27:11,890 debugging learning algorithms. Second: sort of 475 00:27:11,890 --> 00:27:16,820 talk briefly about error analyses and ablative analysis. 476 00:27:16,820 --> 00:27:21,470 And third, I want to talk about just advice for how to get started on a machine-learning 477 00:27:23,410 --> 00:27:24,420 problem. 478 00:27:24,420 --> 00:27:27,340 And one theme that'll come up later is it 479 00:27:27,340 --> 00:27:31,230 turns out you've heard about premature optimization, right, 480 00:27:31,230 --> 00:27:33,890 in writing software. This is when 481 00:27:33,890 --> 00:27:38,120 someone over-designs from the start, when someone, you know, is writing piece of code and 482 00:27:38,120 --> 00:27:41,330 they choose a subroutine to optimize 483 00:27:41,330 --> 00:27:45,350 heavily. And maybe you write the subroutine as assembly or something. And that's often 484 00:27:45,350 --> 00:27:48,919 - and many of us have been guilty of premature optimization, where 485 00:27:48,919 --> 00:27:51,510 we're trying to get a piece of code to run faster. And 486 00:27:51,510 --> 00:27:54,710 we choose probably a piece of code and we implement it an assembly, and 487 00:27:54,710 --> 00:27:56,820 really tune and get to run really quickly. And 488 00:27:56,820 --> 00:28:01,280 it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature 489 00:28:01,280 --> 00:28:02,210 optimization. 490 00:28:02,210 --> 00:28:06,140 And in undergraduate programming classes, we warn people all the time not to do 491 00:28:06,140 --> 00:28:10,250 premature optimization and people still do it all the time. Right? 492 00:28:10,250 --> 00:28:11,750 And 493 00:28:11,750 --> 00:28:16,020 turns out, a very similar thing happens in building machine-learning systems. That 494 00:28:16,020 --> 00:28:20,980 many people are often guilty of, what I call, premature statistical optimization, where they 495 00:28:20,980 --> 00:28:21,630 496 00:28:21,630 --> 00:28:26,090 heavily optimize part of a machine learning system and that turns out 497 00:28:26,090 --> 00:28:27,830 not to be the important piece. Okay? 498 00:28:27,830 --> 00:28:30,170 So I'll talk about that later, as well. 499 00:28:30,170 --> 00:28:35,680 So let's first talk about debugging learning algorithms. 500 00:28:35,680 --> 00:28:37,700 501 00:28:37,700 --> 00:28:39,560 502 00:28:39,560 --> 00:28:42,610 As a motivating 503 00:28:42,610 --> 00:28:46,910 example, let's say you want to build an anti-spam system. And 504 00:28:46,910 --> 00:28:51,780 let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right? 505 00:28:51,780 --> 00:28:54,410 So instead of using 50,000 words, you're chosen a small set of 100 506 00:28:54,410 --> 00:28:55,390 features 507 00:28:55,390 --> 00:28:59,010 to use for your anti-spam system. 508 00:28:59,010 --> 00:29:02,140 And let's say you implement Bayesian logistic regression, implement gradient 509 00:29:02,140 --> 00:29:02,850 descent, 510 00:29:02,850 --> 00:29:07,310 and you get 20 percent test error, which is unacceptably high. Right? 511 00:29:07,310 --> 00:29:09,750 So this 512 00:29:09,750 --> 00:29:14,260 is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda 513 00:29:14,260 --> 00:29:15,160 squared term. 514 00:29:15,160 --> 00:29:19,950 And we're maximizing rather than minimizing as well, so there's a minus lambda 515 00:29:19,950 --> 00:29:23,809 theta square instead of plus lambda theta squared. So 516 00:29:23,809 --> 00:29:28,390 the question is, you implemented your Bayesian logistic regression algorithm, 517 00:29:28,390 --> 00:29:34,050 and you tested it on your test set and you got unacceptably high error, so what do you do 518 00:29:34,050 --> 00:29:35,370 next? Right? 519 00:29:35,370 --> 00:29:37,310 So, 520 00:29:37,310 --> 00:29:40,730 you know, one thing you could do is think about the ways you could improve this algorithm. 521 00:29:40,730 --> 00:29:44,510 And this is probably what most people will do instead of, "Well let's sit down and think what 522 00:29:44,510 --> 00:29:47,929 could've gone wrong, and then we'll try to improve the algorithm." 523 00:29:47,929 --> 00:29:51,370 Well obviously having more training data could only help, so one thing you can do is try to 524 00:29:51,370 --> 00:29:55,140 get more training examples. 525 00:29:55,140 --> 00:29:58,880 Maybe you suspect, that even 100 features was too many, so you might try to 526 00:29:58,880 --> 00:30:01,059 get a smaller set of 527 00:30:01,059 --> 00:30:04,510 features. What's more common is you might suspect your features aren't good enough, so you might 528 00:30:04,510 --> 00:30:06,880 spend some time, look at the email headers, see if 529 00:30:06,880 --> 00:30:09,240 you can figure out better features for, you know, 530 00:30:09,240 --> 00:30:12,890 finding spam emails or whatever. 531 00:30:12,890 --> 00:30:14,460 Right. And 532 00:30:14,460 --> 00:30:17,720 right, so and 533 00:30:17,720 --> 00:30:20,780 just sit around and come up with better features, such as for email headers. 534 00:30:20,780 --> 00:30:24,940 You may also suspect that gradient descent hasn't quite converged yet, and so let's try 535 00:30:24,940 --> 00:30:28,090 running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right, 536 00:30:28,090 --> 00:30:29,480 just run 537 00:30:29,480 --> 00:30:30,940 gradient descent longer. 538 00:30:30,940 --> 00:30:35,620 Or maybe you remember, you know, you remember hearing from class that maybe 539 00:30:35,620 --> 00:30:38,580 Newton's method converges better, so let's 540 00:30:38,580 --> 00:30:39,809 try 541 00:30:39,809 --> 00:30:41,840 that instead. You may want to tune the value for lambda, because not 542 00:30:41,840 --> 00:30:43,560 sure if that was the right thing, 543 00:30:43,560 --> 00:30:46,960 or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only 544 00:30:46,960 --> 00:30:50,240 listed eight things 545 00:30:50,240 --> 00:30:54,549 here, but you can imagine if you were actually sitting down, building machine-learning 546 00:30:54,549 --> 00:30:55,720 system, 547 00:30:55,720 --> 00:30:58,040 the options to you are endless. You can think of, you 548 00:30:58,040 --> 00:31:01,040 know, hundreds of ways to improve a learning system. 549 00:31:01,040 --> 00:31:02,429 And some of these things like, 550 00:31:02,429 --> 00:31:05,670 well getting more training examples, surely that's gonna help, so that seems like it's a good 551 00:31:05,670 --> 00:31:08,980 use of your time. Right? 552 00:31:08,980 --> 00:31:11,290 And it turns out that 553 00:31:11,290 --> 00:31:15,210 this [inaudible] of picking ways to improve the learning algorithm and picking one and going 554 00:31:15,210 --> 00:31:16,120 for it, 555 00:31:16,120 --> 00:31:20,620 it might work in the sense that it may eventually get you to a working system, but 556 00:31:20,620 --> 00:31:25,030 often it's very time-consuming. And I think it's often a largely - largely a matter of 557 00:31:25,030 --> 00:31:28,140 luck, whether you end up fixing what the problem is. 558 00:31:28,140 --> 00:31:29,490 In particular, these 559 00:31:29,490 --> 00:31:33,030 eight improvements all fix very different problems. 560 00:31:33,030 --> 00:31:37,340 And some of them will be fixing problems that you don't have. And 561 00:31:37,340 --> 00:31:40,070 if you can rule out six of 562 00:31:40,070 --> 00:31:44,129 eight of these, say, you could - if by somehow looking at the problem more deeply, 563 00:31:44,129 --> 00:31:46,809 you can figure out which one of these eight things is actually the right thing 564 00:31:46,809 --> 00:31:47,850 to do, 565 00:31:47,850 --> 00:31:48,850 you can save yourself 566 00:31:48,850 --> 00:31:50,760 a lot of time. 567 00:31:50,760 --> 00:31:56,090 So let's see how we can go about doing that. 568 00:31:56,090 --> 00:32:01,830 The people in industry and in research that I see that are really good, would not 569 00:32:01,830 --> 00:32:05,490 go and try to change a learning algorithm randomly. There are lots of things that 570 00:32:05,490 --> 00:32:08,110 obviously improve your learning algorithm, 571 00:32:08,110 --> 00:32:12,460 but the problem is there are so many of them it's hard to know what to do. 572 00:32:12,460 --> 00:32:16,590 So you find all the really good ones that run various diagnostics to figure out 573 00:32:16,590 --> 00:32:18,010 the problem is 574 00:32:18,010 --> 00:32:21,610 and they think where a problem is. Okay? 575 00:32:21,610 --> 00:32:23,830 So 576 00:32:23,830 --> 00:32:27,309 for our motivating story, right, we said - let's say Bayesian logistic regression test 577 00:32:27,309 --> 00:32:29,010 error was 20 percent, 578 00:32:29,010 --> 00:32:32,020 which let's say is unacceptably high. 579 00:32:32,020 --> 00:32:34,830 And let's suppose you suspected the problem is 580 00:32:34,830 --> 00:32:36,440 either overfitting, 581 00:32:36,440 --> 00:32:37,790 so it's high bias, 582 00:32:37,790 --> 00:32:42,240 or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse 583 00:32:42,240 --> 00:32:45,220 me; I think I 584 00:32:45,220 --> 00:32:46,620 wrote that wrong. 585 00:32:46,620 --> 00:32:48,080 Let's firstly - so let's 586 00:32:48,080 --> 00:32:49,219 forget - forget the tables. 587 00:32:49,219 --> 00:32:52,839 Suppose you suspect the problem is either high bias or high variance, and some of the text 588 00:32:52,839 --> 00:32:54,730 here 589 00:32:54,730 --> 00:32:55,250 doesn't make sense. And 590 00:32:55,250 --> 00:32:56,429 you want to know 591 00:32:56,429 --> 00:33:00,850 if you're overfitting, which would be high variance, or you have too few 592 00:33:00,850 --> 00:33:06,240 features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So 593 00:33:06,240 --> 00:33:08,750 how can you figure out whether the problem 594 00:33:08,750 --> 00:33:10,790 is one of high bias 595 00:33:10,790 --> 00:33:15,610 or high variance? Right? So it turns 596 00:33:15,610 --> 00:33:19,009 out there's a simple diagnostic you can look at that will tell you 597 00:33:19,009 --> 00:33:24,150 whether the problem is high bias or high variance. If you 598 00:33:24,150 --> 00:33:27,900 remember the cartoon we'd seen previously for high variance problems, when you have high 599 00:33:27,900 --> 00:33:29,710 variance 600 00:33:29,710 --> 00:33:33,280 the training error will be much lower than the test error. All right? When you 601 00:33:33,280 --> 00:33:36,140 have a high variance problem, that's when you're fitting 602 00:33:36,140 --> 00:33:39,480 your training set very well. That's when you're fitting, you know, a tenth order polynomial to 603 00:33:39,480 --> 00:33:41,650 11 data points. All right? And 604 00:33:41,650 --> 00:33:44,670 that's when you're just fitting the data set very well, and so your training error will be 605 00:33:44,670 --> 00:33:45,670 much lower than 606 00:33:45,670 --> 00:33:47,640 your test 607 00:33:47,640 --> 00:33:49,940 error. And in contrast, if you have high bias, 608 00:33:49,940 --> 00:33:52,700 that's when your training error will also be high. Right? 609 00:33:52,700 --> 00:33:56,450 That's when your data is quadratic, say, but you're fitting a linear function to it 610 00:33:56,450 --> 00:34:02,290 and so you aren't even fitting your training set well. So 611 00:34:02,290 --> 00:34:04,450 just in cartoons, I guess, 612 00:34:04,450 --> 00:34:07,950 this is a - this is what a typical learning curve for high variance looks 613 00:34:07,950 --> 00:34:09,339 like. 614 00:34:09,339 --> 00:34:13,690 On your horizontal axis, I'm plotting the training set size M, right, 615 00:34:13,690 --> 00:34:16,429 and on vertical axis, I'm plotting the error. 616 00:34:16,429 --> 00:34:19,469 And so, let's see, 617 00:34:19,469 --> 00:34:21,029 you know, as you increase - 618 00:34:21,029 --> 00:34:25,119 if you have a high variance problem, you'll notice as the training set size, M, 619 00:34:25,119 --> 00:34:29,219 increases, your test set error will keep on decreasing. 620 00:34:29,219 --> 00:34:32,829 And so this sort of suggests that, well, if you can increase the training set size even 621 00:34:32,829 --> 00:34:36,359 further, maybe if you extrapolate the green curve out, maybe 622 00:34:36,359 --> 00:34:39,970 that test set error will decrease even further. All right? 623 00:34:39,970 --> 00:34:43,399 Another thing that's useful to plot here is - let's say 624 00:34:43,399 --> 00:34:46,539 the red horizontal line is the desired performance 625 00:34:46,539 --> 00:34:50,259 you're trying to reach, another useful thing to plot is actually the training error. Right? 626 00:34:50,259 --> 00:34:52,009 And it turns out that 627 00:34:52,009 --> 00:34:59,009 your training error will actually grow as a function of the training set size 628 00:34:59,249 --> 00:35:01,609 because the larger your training set, 629 00:35:01,609 --> 00:35:03,619 the harder it is to fit, 630 00:35:03,619 --> 00:35:06,149 you know, your training set perfectly. Right? 631 00:35:06,149 --> 00:35:09,249 So this is just a cartoon, don't take it too seriously, but in general, your training error 632 00:35:09,249 --> 00:35:11,420 will actually grow 633 00:35:11,420 --> 00:35:15,079 as a function of your training set size. Because smart training sets, if you have one data point, 634 00:35:15,079 --> 00:35:17,769 it's really easy to fit that perfectly, but if you have 635 00:35:17,769 --> 00:35:22,099 10,000 data points, it's much harder to fit that perfectly. 636 00:35:22,099 --> 00:35:23,149 All right? 637 00:35:23,149 --> 00:35:27,960 And so another diagnostic for high variance, and the one that I tend to use more, 638 00:35:27,960 --> 00:35:31,670 is to just look at training versus test error. And if there's a large gap between 639 00:35:31,670 --> 00:35:32,789 them, 640 00:35:32,789 --> 00:35:34,160 then this suggests that, you know, 641 00:35:34,160 --> 00:35:39,630 getting more training data may allow you to help close that gap. Okay? 642 00:35:39,630 --> 00:35:41,420 So this is 643 00:35:41,420 --> 00:35:42,340 what the 644 00:35:42,340 --> 00:35:45,059 cartoon would look like when - in the 645 00:35:45,059 --> 00:35:49,199 case of high variance. 646 00:35:49,199 --> 00:35:53,099 This is what the cartoon looks like for high bias. Right? If you 647 00:35:53,099 --> 00:35:54,779 look at the learning curve, you 648 00:35:54,779 --> 00:35:57,499 see that the curve for test error 649 00:35:57,499 --> 00:36:01,419 has flattened out already. And so this is a sign that, 650 00:36:01,419 --> 00:36:05,179 you know, if you get more training examples, if you extrapolate this curve 651 00:36:05,179 --> 00:36:06,519 further to the right, 652 00:36:06,519 --> 00:36:09,670 it's maybe not likely to go down much further. 653 00:36:09,670 --> 00:36:12,469 And this is a property of high bias: that getting more training data won't 654 00:36:12,469 --> 00:36:15,619 necessarily help. 655 00:36:15,619 --> 00:36:18,999 But again, to me the more useful diagnostic is 656 00:36:18,999 --> 00:36:20,299 if you plot 657 00:36:20,299 --> 00:36:23,999 training errors well, if you look at your training error as well as your, you know, 658 00:36:23,999 --> 00:36:26,369 hold out test set error. 659 00:36:26,369 --> 00:36:29,409 If you find that even your training error 660 00:36:29,409 --> 00:36:31,529 is high, 661 00:36:31,529 --> 00:36:34,779 then that's a sign that getting more training data is not 662 00:36:34,779 --> 00:36:38,269 going to help. Right? 663 00:36:38,269 --> 00:36:42,199 In fact, you know, think about it, 664 00:36:42,199 --> 00:36:44,539 training error 665 00:36:44,539 --> 00:36:48,089 grows as a function of your training set size. 666 00:36:48,089 --> 00:36:50,449 And so if your 667 00:36:50,449 --> 00:36:55,569 training error is already above your level of desired performance, 668 00:36:55,569 --> 00:36:56,599 then 669 00:36:56,599 --> 00:37:00,789 getting even more training data is not going to reduce your training 670 00:37:00,789 --> 00:37:03,009 error down to the desired level of performance. Right? 671 00:37:03,009 --> 00:37:06,469 Because, you know, your training error sort of only gets worse as you get more and more training 672 00:37:06,469 --> 00:37:07,549 examples. 673 00:37:07,549 --> 00:37:10,799 So if you extrapolate further to the right, it's not like this blue line will come 674 00:37:10,799 --> 00:37:13,399 back down to the level of desired performance. Right? This will stay up 675 00:37:13,399 --> 00:37:17,479 there. Okay? So for 676 00:37:17,479 --> 00:37:21,339 me personally, I actually, when looking at a curve like the green 677 00:37:21,339 --> 00:37:25,380 curve on test error, I actually personally tend to find it very difficult to tell 678 00:37:25,380 --> 00:37:29,000 if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very 679 00:37:29,000 --> 00:37:31,009 often, it's somewhat 680 00:37:31,009 --> 00:37:32,899 ambiguous. So for me personally, 681 00:37:32,899 --> 00:37:37,129 the diagnostic I tend to use the most often to tell if I have a bias problem or a variance 682 00:37:37,129 --> 00:37:37,859 problem 683 00:37:37,859 --> 00:37:41,319 is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so, 684 00:37:41,319 --> 00:37:45,420 going 685 00:37:45,420 --> 00:37:47,130 back to 686 00:37:47,130 --> 00:37:52,399 the list of fixes, look 687 00:37:52,399 --> 00:37:54,109 at the first fix, 688 00:37:54,109 --> 00:37:56,339 getting more training examples 689 00:37:56,339 --> 00:37:58,650 is a way to fix high variance. 690 00:37:58,650 --> 00:38:02,749 Right? If you have a high variance problem, getting more training examples will help. 691 00:38:02,749 --> 00:38:05,529 Trying a smaller set of features: 692 00:38:05,529 --> 00:38:11,759 that also fixes high variance. All right? 693 00:38:11,759 --> 00:38:15,869 Trying a larger set of features or adding email features, these 694 00:38:15,869 --> 00:38:20,150 are solutions that fix high bias. Right? 695 00:38:20,150 --> 00:38:26,769 So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay? 696 00:38:26,769 --> 00:38:29,069 And so 697 00:38:29,069 --> 00:38:33,579 quite often you see people working on machine learning problems 698 00:38:33,579 --> 00:38:34,589 and 699 00:38:34,589 --> 00:38:37,569 they'll remember that getting more training examples helps. And 700 00:38:37,569 --> 00:38:41,119 so, they'll build a learning system, build an anti-spam system and it doesn't work. 701 00:38:41,119 --> 00:38:42,229 And then they 702 00:38:42,229 --> 00:38:45,999 go off and spend lots of time and money and effort collecting more training data 703 00:38:45,999 --> 00:38:50,509 because they'll say, "Oh well, getting more data's obviously got to help." 704 00:38:50,509 --> 00:38:53,319 But if they had a high bias problem in the first place, and not a high variance 705 00:38:53,319 --> 00:38:54,890 problem, 706 00:38:54,890 --> 00:38:56,769 it's entirely possible to spend 707 00:38:56,769 --> 00:39:00,149 three months or six months collecting more and more training data, 708 00:39:00,149 --> 00:39:04,999 not realizing that it couldn't possibly help. Right? 709 00:39:04,999 --> 00:39:07,619 And so, this actually happens a lot in, you 710 00:39:07,619 --> 00:39:12,409 know, in Silicon Valley and companies, this happens a lot. There will often 711 00:39:12,409 --> 00:39:15,329 people building various machine learning systems, and 712 00:39:15,329 --> 00:39:19,519 they'll often - you often see people spending six months working on fixing a 713 00:39:19,519 --> 00:39:20,999 learning algorithm 714 00:39:20,999 --> 00:39:23,940 and you could've told them six months ago that, you know, 715 00:39:23,940 --> 00:39:27,210 that couldn't possibly have helped. But because they didn't know what the 716 00:39:27,210 --> 00:39:28,709 problem was, and 717 00:39:28,709 --> 00:39:33,549 they'd easily spend six months trying to invent new features or something. And 718 00:39:33,549 --> 00:39:37,809 this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and 719 00:39:37,809 --> 00:39:42,289 told them, "I could've told you six months ago that this was not going to help." And 720 00:39:42,289 --> 00:39:46,149 the six months is not a joke, you actually see 721 00:39:46,149 --> 00:39:47,709 this. 722 00:39:47,709 --> 00:39:49,510 And in contrast, if you 723 00:39:49,510 --> 00:39:53,049 actually figure out the problem's one of high bias or high variance, then 724 00:39:53,049 --> 00:39:54,299 you can rule out 725 00:39:54,299 --> 00:39:55,799 two of these solutions and 726 00:39:55,799 --> 00:40:00,779 save yourself many months of fruitless effort. Okay? I actually 727 00:40:00,779 --> 00:40:03,709 want to talk about these four at the bottom as well. But before I move on, let me 728 00:40:03,709 --> 00:40:05,319 just check if there were questions about what I've talked 729 00:40:05,319 --> 00:40:12,319 about so far. No? Okay, great. So bias 730 00:40:20,209 --> 00:40:23,220 versus variance is one thing that comes up 731 00:40:23,220 --> 00:40:29,539 often. This bias versus variance is one common diagnostic. And so, 732 00:40:29,539 --> 00:40:33,180 for other machine learning problems, it's often up to your own ingenuity to figure out 733 00:40:33,180 --> 00:40:35,700 your own diagnostics to figure out what's wrong. All right? 734 00:40:35,700 --> 00:40:41,230 So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you 735 00:40:41,230 --> 00:40:44,300 know, to construct your own tests. Like do you look at the difference training and 736 00:40:44,300 --> 00:40:46,499 test errors or do you look at something else? 737 00:40:46,499 --> 00:40:49,929 It's often up to your own ingenuity to construct your own diagnostics to figure out what's 738 00:40:49,929 --> 00:40:52,589 going on. 739 00:40:52,589 --> 00:40:55,029 What I want to do is go through another example. All right? 740 00:40:55,029 --> 00:40:58,890 And this one is slightly more contrived but it'll illustrate another 741 00:40:58,890 --> 00:41:02,769 common question that comes up, another one of the most common 742 00:41:02,769 --> 00:41:04,750 issues that comes up in applying 743 00:41:04,750 --> 00:41:06,089 learning algorithms. 744 00:41:06,089 --> 00:41:08,319 So in this example, it's slightly more contrived, 745 00:41:08,319 --> 00:41:11,579 let's say you implement Bayesian logistic regression 746 00:41:11,579 --> 00:41:17,549 and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So 747 00:41:17,549 --> 00:41:19,150 it's rejecting, you know, 748 00:41:19,150 --> 00:41:21,449 2 percent of - 749 00:41:21,449 --> 00:41:25,179 it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all 750 00:41:25,179 --> 00:41:26,959 spam gets 751 00:41:26,959 --> 00:41:30,660 through which is fine, but is also rejecting 2 percent of your good email, 752 00:41:30,660 --> 00:41:35,489 2 percent of the email from your friends and that's unacceptably high, let's 753 00:41:35,489 --> 00:41:36,909 say. 754 00:41:36,909 --> 00:41:39,010 And let's say that 755 00:41:39,010 --> 00:41:41,899 a simple vector machine using a linear kernel 756 00:41:41,899 --> 00:41:44,830 gets 10 percent error on spam and 757 00:41:44,830 --> 00:41:49,069 0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this 758 00:41:49,069 --> 00:41:53,359 example, let's say you're trying to build an anti-spam system. Right? 759 00:41:53,359 --> 00:41:56,170 Let's say that you really want to deploy 760 00:41:56,170 --> 00:41:57,679 logistic regression 761 00:41:57,679 --> 00:42:01,209 to your customers because of computational efficiency or because you need 762 00:42:01,209 --> 00:42:03,389 retrain overnight every day, 763 00:42:03,389 --> 00:42:07,319 and because logistic regression just runs more easily and more quickly or something. Okay? So let's 764 00:42:07,319 --> 00:42:08,669 say you want to deploy logistic 765 00:42:08,669 --> 00:42:12,649 regression, but it's just not working out well. So 766 00:42:12,649 --> 00:42:17,609 question is: What do you do next? So it 767 00:42:17,609 --> 00:42:18,829 turns out that this - 768 00:42:18,829 --> 00:42:22,319 the issue that comes up here, the one other common question that 769 00:42:22,319 --> 00:42:24,889 comes up is 770 00:42:24,889 --> 00:42:30,189 a question of is the algorithm converging. So you might suspect that maybe 771 00:42:30,189 --> 00:42:33,299 the problem with logistic regression is that it's just not converging. 772 00:42:33,299 --> 00:42:36,309 Maybe you need to run iterations. And 773 00:42:36,309 --> 00:42:37,759 it 774 00:42:37,759 --> 00:42:40,359 turns out that, again if you look at the optimization objective, say, 775 00:42:40,359 --> 00:42:43,710 logistic regression is, let's say, optimizing J 776 00:42:43,710 --> 00:42:46,730 of theta, it actually turns out that if you look at optimizing your objective as a function of the number 777 00:42:46,730 --> 00:42:51,809 of iterations, when you look 778 00:42:51,809 --> 00:42:55,009 at this curve, you know, it sort of looks like it's going up but it sort of 779 00:42:55,009 --> 00:42:57,630 looks like there's absentiles. And 780 00:42:57,630 --> 00:43:00,949 when you look at these curves, it's often very hard to tell 781 00:43:00,949 --> 00:43:03,729 if the curve has already flattened out. All right? And you look at these 782 00:43:03,729 --> 00:43:05,979 curves a lot so you can ask: 783 00:43:05,979 --> 00:43:08,229 Well has the algorithm converged? When you look at the J of theta like this, it's 784 00:43:08,229 --> 00:43:10,329 often hard to tell. 785 00:43:10,329 --> 00:43:14,149 You can run this ten times as long and see if it's flattened out. And you can run this ten 786 00:43:14,149 --> 00:43:21,079 times as long and it'll often still look like maybe it's going up very slowly, or something. Right? 787 00:43:21,079 --> 00:43:24,919 So a better diagnostic for what logistic regression is converged than 788 00:43:24,919 --> 00:43:28,809 looking at this curve. 789 00:43:28,809 --> 00:43:32,089 The other question you might wonder - the other thing you might 790 00:43:32,089 --> 00:43:36,709 suspect is a problem is are you optimizing the right function. 791 00:43:36,709 --> 00:43:38,920 So 792 00:43:38,920 --> 00:43:40,599 what you care about, 793 00:43:40,599 --> 00:43:42,879 right, in spam, say, 794 00:43:42,879 --> 00:43:44,260 is a 795 00:43:44,260 --> 00:43:47,499 weighted accuracy function like that. So A of theta is, 796 00:43:47,499 --> 00:43:49,190 you know, sum over your 797 00:43:49,190 --> 00:43:52,249 examples of some weights times whether you got it right. 798 00:43:52,249 --> 00:43:56,809 And so the weight may be higher for non-spam than for spam mail because you care 799 00:43:56,809 --> 00:43:57,710 about getting 800 00:43:57,710 --> 00:44:01,469 your predictions correct for spam email much more than non-spam mail, say. So let's 801 00:44:01,469 --> 00:44:02,359 802 00:44:02,359 --> 00:44:05,469 say A of theta 803 00:44:05,469 --> 00:44:10,819 is the optimization objective that you really care about, but Bayesian logistic regression is 804 00:44:10,819 --> 00:44:15,400 that it optimizes a quantity like that. Right? It's this 805 00:44:15,400 --> 00:44:17,689 sort of maximum likelihood thing 806 00:44:17,689 --> 00:44:19,380 and then with this 807 00:44:19,380 --> 00:44:20,849 two-nom, you know, 808 00:44:20,849 --> 00:44:22,779 penalty thing that we saw previously. And you 809 00:44:22,779 --> 00:44:26,499 might be wondering: Is this the right optimization function to be optimizing. 810 00:44:26,499 --> 00:44:30,949 Okay? And: Or do I maybe need to change the value for lambda 811 00:44:30,949 --> 00:44:33,899 to change this parameter? Or: 812 00:44:33,899 --> 00:44:39,819 Should I maybe really be switching to support vector machine optimization objective? 813 00:44:39,819 --> 00:44:42,130 Okay? Does that make sense? So 814 00:44:42,130 --> 00:44:44,490 the second diagnostic I'm gonna talk about 815 00:44:44,490 --> 00:44:46,989 is let's say you want to figure out 816 00:44:46,989 --> 00:44:50,609 is the algorithm converging, is the optimization algorithm converging, or 817 00:44:50,609 --> 00:44:51,900 is the problem with 818 00:44:51,900 --> 00:44:57,749 the optimization objective I chose in the first place? Okay? 819 00:44:57,749 --> 00:45:02,819 So here's 820 00:45:02,819 --> 00:45:07,329 the diagnostic you can use. Let me let - right. So to 821 00:45:07,329 --> 00:45:11,029 just reiterate the story, right, let's say an SVM outperforms Bayesian 822 00:45:11,029 --> 00:45:13,519 logistic regression but you really want to deploy 823 00:45:13,519 --> 00:45:16,759 Bayesian logistic regression to your problem. Let me 824 00:45:16,759 --> 00:45:19,049 let theta subscript SVM, be the 825 00:45:19,049 --> 00:45:21,669 parameters learned by an SVM, 826 00:45:21,669 --> 00:45:25,259 and I'll let theta subscript BLR be the parameters learned by Bayesian 827 00:45:25,259 --> 00:45:28,049 logistic regression. 828 00:45:28,049 --> 00:45:32,480 So the optimization objective you care about is this, you know, weighted accuracy 829 00:45:32,480 --> 00:45:35,079 criteria that I talked about just now. 830 00:45:35,079 --> 00:45:37,859 And 831 00:45:37,859 --> 00:45:41,739 the support vector machine outperforms Bayesian logistic regression. And so, you know, 832 00:45:41,739 --> 00:45:44,969 the weighted accuracy on the supportvector-machine parameters 833 00:45:44,969 --> 00:45:46,969 is better than the weighted accuracy 834 00:45:46,969 --> 00:45:50,179 for Bayesian logistic regression. 835 00:45:50,179 --> 00:45:53,929 So 836 00:45:53,929 --> 00:45:57,039 further, Bayesian logistic regression tries to optimize 837 00:45:57,039 --> 00:45:59,410 an optimization objective like that, which I 838 00:45:59,410 --> 00:46:02,269 denoted J theta. 839 00:46:02,269 --> 00:46:05,839 And so, the diagnostic I choose to use is 840 00:46:05,839 --> 00:46:08,430 to see if J of SVM 841 00:46:08,430 --> 00:46:12,269 is bigger-than or less-than J of BLR. Okay? 842 00:46:12,269 --> 00:46:14,609 So I explain this on the next slide. 843 00:46:14,609 --> 00:46:15,569 So 844 00:46:15,569 --> 00:46:19,529 we know two facts. We know that - well we know one fact. We know that a weighted 845 00:46:19,529 --> 00:46:20,519 accuracy 846 00:46:20,519 --> 00:46:23,160 of support vector machine, right, 847 00:46:23,160 --> 00:46:24,479 is bigger than 848 00:46:24,479 --> 00:46:28,859 this weighted accuracy of Bayesian logistic regression. So 849 00:46:28,859 --> 00:46:32,209 in order for me to figure out whether Bayesian logistic regression is converging, 850 00:46:32,209 --> 00:46:35,379 or whether I'm just optimizing the wrong objective function, 851 00:46:35,379 --> 00:46:41,059 the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay? 852 00:46:41,059 --> 00:46:43,549 So let me explain this, 853 00:46:43,549 --> 00:46:44,769 so in Case 1, 854 00:46:44,769 --> 00:46:46,029 right, 855 00:46:46,029 --> 00:46:48,319 it's just those two equations copied over. 856 00:46:48,319 --> 00:46:50,489 In Case 1, let's say that 857 00:46:50,489 --> 00:46:54,589 J of SVM is, indeed, is greater than J of BLR - or J of 858 00:46:54,589 --> 00:47:01,169 theta SVM is greater than J of theta BLR. But 859 00:47:01,169 --> 00:47:04,439 we know that Bayesian logistic regression 860 00:47:04,439 --> 00:47:07,519 was trying to maximize J of theta; 861 00:47:07,519 --> 00:47:08,869 that's the definition of 862 00:47:08,869 --> 00:47:12,359 Bayesian logistic regression. 863 00:47:12,359 --> 00:47:16,759 So this means that 864 00:47:16,759 --> 00:47:17,599 theta - 865 00:47:17,599 --> 00:47:22,029 the value of theta output that Bayesian logistic regression actually fails to 866 00:47:22,029 --> 00:47:24,209 maximize J 867 00:47:24,209 --> 00:47:27,309 because the support back to machine actually returned the value of theta that, 868 00:47:27,309 --> 00:47:28,719 you know does a 869 00:47:28,719 --> 00:47:31,349 better job out-maximizing J. 870 00:47:31,349 --> 00:47:36,509 And so, this tells me that Bayesian logistic regression didn't actually maximize J 871 00:47:36,509 --> 00:47:39,319 correctly, and so the problem is with 872 00:47:39,319 --> 00:47:41,099 the optimization algorithm. The 873 00:47:41,099 --> 00:47:45,269 optimization algorithm hasn't converged. The other 874 00:47:45,269 --> 00:47:46,099 case 875 00:47:46,099 --> 00:47:49,890 is as follows, where 876 00:47:49,890 --> 00:47:52,579 J of theta SVM is less-than/equal to J of theta 877 00:47:52,579 --> 00:47:55,719 BLR. Okay? 878 00:47:55,719 --> 00:47:58,389 In this case, what does 879 00:47:58,389 --> 00:47:59,140 that mean? 880 00:47:59,140 --> 00:48:01,849 This means that Bayesian logistic regression 881 00:48:01,849 --> 00:48:04,599 actually attains the higher value 882 00:48:04,599 --> 00:48:07,289 for the optimization objective J 883 00:48:07,289 --> 00:48:10,929 then doesn't support back to machine. 884 00:48:10,929 --> 00:48:13,159 The support back to machine, 885 00:48:13,159 --> 00:48:14,969 which does worse 886 00:48:14,969 --> 00:48:17,669 on your optimization problem, 887 00:48:17,669 --> 00:48:19,199 actually does better 888 00:48:19,199 --> 00:48:24,329 on the weighted accuracy measure. 889 00:48:24,329 --> 00:48:27,999 So what this means is that something that does worse on your optimization 890 00:48:27,999 --> 00:48:28,789 objective, 891 00:48:28,789 --> 00:48:29,789 on J, 892 00:48:29,789 --> 00:48:31,429 can actually do better 893 00:48:31,429 --> 00:48:34,039 on the weighted accuracy objective. 894 00:48:34,039 --> 00:48:37,109 And this really means that maximizing 895 00:48:37,109 --> 00:48:38,369 J of theta, 896 00:48:38,369 --> 00:48:42,059 you know, doesn't really correspond that well to maximizing your weighted accuracy criteria. 897 00:48:42,059 --> 00:48:43,430 898 00:48:43,430 --> 00:48:47,359 And therefore, this tells you that J of theta is maybe the wrong optimization 899 00:48:47,359 --> 00:48:49,649 objective to be maximizing. Right? 900 00:48:49,649 --> 00:48:51,160 That just maximizing J of 901 00:48:51,160 --> 00:48:53,149 theta just wasn't a good objective 902 00:48:53,149 --> 00:49:00,149 to be choosing if you care about the weighted accuracy. Okay? Can you 903 00:49:02,669 --> 00:49:03,460 raise your hand 904 00:49:03,460 --> 00:49:09,989 if this made sense? 905 00:49:09,989 --> 00:49:11,519 Cool, good. So 906 00:49:11,519 --> 00:49:16,829 that tells us whether the problem is with the optimization objective 907 00:49:16,829 --> 00:49:19,380 or whether it's with the objective function. 908 00:49:19,380 --> 00:49:21,009 And so going back to this 909 00:49:21,009 --> 00:49:23,149 slide, the eight fixes we had, 910 00:49:23,149 --> 00:49:24,180 you notice that if you 911 00:49:24,180 --> 00:49:27,170 run gradient descent for more iterations 912 00:49:27,170 --> 00:49:31,019 that fixes the optimization algorithm. You try and use this method 913 00:49:31,019 --> 00:49:33,259 fixes the optimization algorithm, 914 00:49:33,259 --> 00:49:37,289 whereas using a different value for lambda, in that lambda times norm of data 915 00:49:37,289 --> 00:49:39,469 squared, you know, in your objective, 916 00:49:39,469 --> 00:49:42,359 fixes the optimization objective. And 917 00:49:42,359 --> 00:49:47,629 changing to an SVM is also another way of trying to fix the optimization objective. Okay? 918 00:49:47,629 --> 00:49:49,329 And so 919 00:49:49,329 --> 00:49:52,309 once again, you actually see this quite often that - 920 00:49:52,309 --> 00:49:55,079 actually, you see it very often, people will 921 00:49:55,079 --> 00:49:58,479 have a problem with the optimization objective 922 00:49:58,479 --> 00:50:00,989 and be working harder and harder 923 00:50:00,989 --> 00:50:03,179 to fix the optimization algorithm. 924 00:50:03,179 --> 00:50:06,079 That's another very common pattern that 925 00:50:06,079 --> 00:50:10,190 the problem is in the formula from your J of theta, that often you see people, you know, 926 00:50:10,190 --> 00:50:13,269 just running more and more iterations of gradient descent. Like trying Newton's 927 00:50:13,269 --> 00:50:16,010 method and trying conjugate and then trying 928 00:50:16,010 --> 00:50:18,589 more and more crazy optimization algorithms, 929 00:50:18,589 --> 00:50:20,889 whereas the problem was, you know, 930 00:50:20,889 --> 00:50:24,459 optimizing J of theta wasn't going to fix the problem at all. Okay? 931 00:50:24,459 --> 00:50:28,649 So there's another example of when these sorts of diagnostics will 932 00:50:28,649 --> 00:50:31,909 help you figure out whether you should be fixing your optimization algorithm 933 00:50:31,909 --> 00:50:33,259 or fixing the 934 00:50:33,259 --> 00:50:38,849 optimization 935 00:50:38,849 --> 00:50:45,339 objective. Okay? Let me think 936 00:50:45,339 --> 00:50:47,599 how much time I have. 937 00:50:47,599 --> 00:50:48,819 Hmm, let's 938 00:50:48,819 --> 00:50:49,620 see. Well okay, we have time. Let's do this. 939 00:50:49,620 --> 00:50:52,980 Show you one last example of a diagnostic. This is one that came up in, 940 00:50:52,980 --> 00:50:56,999 you know, my students' and my work on flying helicopters. 941 00:50:56,999 --> 00:50:57,839 942 00:50:57,839 --> 00:51:00,189 This one actually, 943 00:51:00,189 --> 00:51:04,179 this example is the most complex of the three examples I'm gonna do 944 00:51:04,179 --> 00:51:05,609 today. 945 00:51:05,609 --> 00:51:08,559 I'm going to somewhat quickly, and 946 00:51:08,559 --> 00:51:11,259 this actually draws on reinforcement learning which is something that I'm not 947 00:51:11,259 --> 00:51:14,499 gonna talk about until towards - close to the end of the course here, but this just 948 00:51:14,499 --> 00:51:16,759 a more 949 00:51:16,759 --> 00:51:20,009 complicated example of a diagnostic we're gonna go over. 950 00:51:20,009 --> 00:51:23,759 What I'll do is probably go over this fairly quickly, and then after we've talked about 951 00:51:23,759 --> 00:51:26,839 reinforcement learning in the class, I'll probably actually come back and redo this exact 952 00:51:26,839 --> 00:51:32,919 same example because you'll understand it more deeply. Okay? 953 00:51:32,919 --> 00:51:37,099 So some of you know that my students and I fly autonomous helicopters, so how do you get a 954 00:51:37,099 --> 00:51:41,559 machine-learning algorithm to design the controller for 955 00:51:41,559 --> 00:51:44,199 helicopter? This is what we do. All right? 956 00:51:44,199 --> 00:51:48,519 This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our 957 00:51:48,519 --> 00:51:49,619 simulator. 958 00:51:49,619 --> 00:51:53,499 This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you 959 00:51:53,499 --> 00:51:55,679 960 00:51:55,679 --> 00:51:57,190 choose a cost function, it's 961 00:51:57,190 --> 00:52:00,849 actually called a [inaudible] function, but for this actually I'll call it cost function. 962 00:52:00,849 --> 00:52:02,949 Say J of theta is, you know, 963 00:52:02,949 --> 00:52:06,589 the expected squared error in your helicopter's 964 00:52:06,589 --> 00:52:08,150 position. Okay? So this is J of theta is 965 00:52:08,150 --> 00:52:08,509 maybe 966 00:52:08,509 --> 00:52:12,359 it's expected square error or just the square error. 967 00:52:12,359 --> 00:52:16,909 And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms 968 00:52:16,909 --> 00:52:18,599 in a few weeks. 969 00:52:18,599 --> 00:52:22,499 You run reinforcement learning algorithm in your simulator 970 00:52:22,499 --> 00:52:26,640 to try to minimize this cost function; try to minimize the squared error of 971 00:52:26,640 --> 00:52:31,439 how well you're controlling your helicopter's position. Okay? 972 00:52:31,439 --> 00:52:35,279 The reinforcement learning algorithm will output some parameters, which I'm denoting theta 973 00:52:35,279 --> 00:52:37,209 subscript RL, 974 00:52:37,209 --> 00:52:41,709 and then you'll use that to fly your helicopter. 975 00:52:41,709 --> 00:52:44,959 So suppose you run this learning algorithm and 976 00:52:44,959 --> 00:52:48,589 you get out a set of controller parameters, theta subscript RL, 977 00:52:48,589 --> 00:52:52,299 that gives much worse performance than a human pilot. Then 978 00:52:52,299 --> 00:52:54,729 what do you do next? And in particular, you 979 00:52:54,729 --> 00:52:57,959 know, corresponding to the three steps above, there are three 980 00:52:57,959 --> 00:53:00,589 natural things you can try. Right? You can 981 00:53:00,589 --> 00:53:01,869 try to - oh, the bottom of 982 00:53:01,869 --> 00:53:03,919 the slide got chopped off. 983 00:53:03,919 --> 00:53:07,529 You can try to improve the simulator. And 984 00:53:07,529 --> 00:53:10,329 maybe you think your simulator's isn't that accurate, you need to capture 985 00:53:10,329 --> 00:53:12,339 the aerodynamic effects more 986 00:53:12,339 --> 00:53:15,429 accurately. You need to capture the airflow and the turbulence affects around the helicopter 987 00:53:15,429 --> 00:53:18,279 more accurately. 988 00:53:18,279 --> 00:53:21,439 Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe 989 00:53:21,439 --> 00:53:24,719 what a human pilot does isn't just optimizing square area but it's something more 990 00:53:24,719 --> 00:53:25,989 subtle. 991 00:53:25,989 --> 00:53:26,769 Or maybe 992 00:53:26,769 --> 00:53:32,989 the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So 993 00:53:32,989 --> 00:53:36,799 these are the diagnostics that I actually used, and my students and I actually use to figure out what's 994 00:53:36,799 --> 00:53:41,299 going on. 995 00:53:41,299 --> 00:53:44,509 Actually, why don't you just think about this for a second and think what you'd do, and then 996 00:53:44,509 --> 00:53:51,509 I'll go on and tell you what we do. All right, 997 00:54:46,229 --> 00:54:47,869 so let me tell you what - 998 00:54:47,869 --> 00:54:49,599 how we do this and see 999 00:54:49,599 --> 00:54:52,770 whether it's the same as yours or not. And if you have a better idea than I do, let me 1000 00:54:52,770 --> 00:54:53,569 know and I'll let you try it 1001 00:54:53,569 --> 00:54:55,919 on my helicopter. 1002 00:54:55,919 --> 00:54:58,239 So 1003 00:54:58,239 --> 00:55:01,449 here's a reasoning that I wanted to experiment, right. So, 1004 00:55:01,449 --> 00:55:03,680 yeah, let's say the controller output 1005 00:55:03,680 --> 00:55:10,369 by our reinforcement-learning algorithm does poorly. Well 1006 00:55:10,369 --> 00:55:12,609 suppose the following three things hold true. 1007 00:55:12,609 --> 00:55:15,149 Suppose the contrary, I guess. Suppose that 1008 00:55:15,149 --> 00:55:19,649 the helicopter simulator is accurate, so let's assume we have an accurate model 1009 00:55:19,649 --> 00:55:22,449 of our helicopter. And 1010 00:55:22,449 --> 00:55:25,299 let's suppose that the reinforcement learning algorithm, 1011 00:55:25,299 --> 00:55:28,890 you know, correctly controls the helicopter in simulation, 1012 00:55:28,890 --> 00:55:31,819 so we tend to run a learning algorithm in simulation so that, you know, the 1013 00:55:31,819 --> 00:55:35,149 learning algorithm can crash a helicopter and it's fine. Right? 1014 00:55:35,149 --> 00:55:37,230 So let's assume our reinforcement-learning 1015 00:55:37,230 --> 00:55:40,110 algorithm correctly controls the helicopter so as to minimize the cost 1016 00:55:40,110 --> 00:55:42,099 function J of theta. 1017 00:55:42,099 --> 00:55:43,740 And let's suppose that 1018 00:55:43,740 --> 00:55:47,630 minimizing J of theta does indeed correspond to accurate or the correct autonomous 1019 00:55:47,630 --> 00:55:49,339 flight. 1020 00:55:49,339 --> 00:55:52,069 If all of these things held true, 1021 00:55:52,069 --> 00:55:53,909 then that means that 1022 00:55:53,909 --> 00:55:58,459 the parameters, theta RL, should actually fly well on my real 1023 00:55:58,459 --> 00:56:01,039 helicopter. Right? 1024 00:56:01,039 --> 00:56:03,280 And so the fact that the learning 1025 00:56:03,280 --> 00:56:05,339 control parameters, theta RL, 1026 00:56:05,339 --> 00:56:08,599 does not fly well on my helicopter, that sort 1027 00:56:08,599 --> 00:56:11,249 of means that ones of these three assumptions must be wrong 1028 00:56:11,249 --> 00:56:17,869 and I'd like to figure out which of these 1029 00:56:17,869 --> 00:56:19,669 three assumptions 1030 00:56:19,669 --> 00:56:22,089 is wrong. Okay? So these are the diagnostics we use. 1031 00:56:22,089 --> 00:56:25,449 First one is 1032 00:56:25,449 --> 00:56:31,719 we look at the controller and see if it even flies well in 1033 00:56:31,719 --> 00:56:35,089 simulation. Right? So the simulator of the helicopter that we did the learning on, 1034 00:56:35,089 --> 00:56:38,699 and so if the learning algorithm flies well in the simulator but 1035 00:56:38,699 --> 00:56:42,029 it doesn't fly well on my real helicopter, 1036 00:56:42,029 --> 00:56:46,109 then that tells me the problem is probably in the simulator. Right? 1037 00:56:46,109 --> 00:56:48,049 My simulator predicts 1038 00:56:48,049 --> 00:56:51,909 the helicopter's controller will fly well but it doesn't actually fly well in real life, so 1039 00:56:51,909 --> 00:56:53,580 could be the problem's in the simulator 1040 00:56:53,580 --> 00:56:59,239 and we should spend out efforts improving the accuracy of our simulator. 1041 00:56:59,239 --> 00:57:03,170 Otherwise, let me write theta subscript human, be the human 1042 00:57:03,170 --> 00:57:07,049 control policy. All right? So 1043 00:57:07,049 --> 00:57:11,639 let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it 1044 00:57:11,639 --> 00:57:13,479 could be in real life, 1045 00:57:13,479 --> 00:57:16,769 and let's measure, you know, the means squared error 1046 00:57:16,769 --> 00:57:20,209 of the human pilot's flight. And 1047 00:57:20,209 --> 00:57:24,239 let's see if the human pilot does better or worse 1048 00:57:24,239 --> 00:57:26,089 than the learned controller, 1049 00:57:26,089 --> 00:57:28,249 in terms of optimizing this 1050 00:57:28,249 --> 00:57:31,969 objective function J of theta. Okay? 1051 00:57:31,969 --> 00:57:33,929 So if the human does 1052 00:57:33,929 --> 00:57:36,890 worse, if even a very good human pilot 1053 00:57:36,890 --> 00:57:41,439 attains a worse value on my optimization objective, on my cost 1054 00:57:41,439 --> 00:57:42,410 function, 1055 00:57:42,410 --> 00:57:48,619 than my learning algorithm, 1056 00:57:48,619 --> 00:57:51,799 then the problem is in the reinforcement-learning algorithm. 1057 00:57:51,799 --> 00:57:56,089 Because my reinforcement-learning algorithm was trying to minimize J of 1058 00:57:56,089 --> 00:58:00,140 theta, but a human actually attains a lower value for J of theta than does my 1059 00:58:00,140 --> 00:58:01,779 algorithm. 1060 00:58:01,779 --> 00:58:05,489 And so that tells me that clearly my algorithm's not 1061 00:58:05,489 --> 00:58:07,819 managing to minimize J of theta 1062 00:58:07,819 --> 00:58:12,880 and that tells me the problem's in the reinforcement learning algorithm. 1063 00:58:12,880 --> 00:58:17,650 And finally, if J of theta - if the human actually attains a larger value 1064 00:58:17,650 --> 00:58:19,549 for theta - excuse me, 1065 00:58:19,549 --> 00:58:24,399 if the human actually attains a larger value for J of theta, the human actually 1066 00:58:24,399 --> 00:58:27,859 has, you know, larger mean squared error for the helicopter position than 1067 00:58:27,859 --> 00:58:30,599 does my reinforcement learning algorithms, that's 1068 00:58:30,599 --> 00:58:34,000 I like - but I like the way the human flies much better than my reinforcement learning 1069 00:58:34,000 --> 00:58:35,319 algorithm. So 1070 00:58:35,319 --> 00:58:37,229 if that holds true, 1071 00:58:37,229 --> 00:58:39,779 then clearly the problem's in the cost function, right, 1072 00:58:39,779 --> 00:58:42,880 because the human does worse on my cost function 1073 00:58:42,880 --> 00:58:46,069 but flies much better than my learning algorithm. 1074 00:58:46,069 --> 00:58:48,359 And so that means the problem's in the cost function. It 1075 00:58:48,359 --> 00:58:50,089 means - oh 1076 00:58:50,089 --> 00:58:50,539 excuse me, I 1077 00:58:50,539 --> 00:58:53,679 meant minimizing it, not maximizing it, there's a typo on the slide, 1078 00:58:53,679 --> 00:58:55,379 because that means that minimizing 1079 00:58:55,379 --> 00:58:57,089 the cost function 1080 00:58:57,089 --> 00:59:00,219 - my learning algorithm does a better job minimizing the cost function but doesn't 1081 00:59:00,219 --> 00:59:03,439 fly as well as a human pilot. So that tells you that 1082 00:59:03,439 --> 00:59:04,719 minimizing the cost function 1083 00:59:04,719 --> 00:59:06,880 doesn't correspond to good autonomous flight. And what 1084 00:59:06,880 --> 00:59:11,859 you should do it go back and see if you can change J of 1085 00:59:11,859 --> 00:59:13,099 theta. Okay? 1086 00:59:13,099 --> 00:59:18,379 And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement 1087 00:59:18,379 --> 00:59:21,730 learning algorithms just work but when they don't work, 1088 00:59:21,730 --> 00:59:26,200 these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, 1089 00:59:26,200 --> 00:59:30,329 on changing the cost function, or on changing the reinforcement learning 1090 00:59:30,329 --> 00:59:32,089 algorithm. And 1091 00:59:32,089 --> 00:59:37,039 again, if you don't know which of your three problems it is, it's entirely possible, 1092 00:59:37,039 --> 00:59:40,280 you know, to spend two years, whatever, changing, building a better simulator 1093 00:59:40,280 --> 00:59:42,599 for your helicopter. 1094 00:59:42,599 --> 00:59:43,950 But it turns out that 1095 00:59:43,950 --> 00:59:47,690 modeling helicopter aerodynamics is an active area of research. There are people, you know, writing 1096 00:59:47,690 --> 00:59:49,789 entire PhD theses on this still. 1097 00:59:49,789 --> 00:59:53,559 So it's entirely possible to go out and spend six years and write a PhD thesis and build 1098 00:59:53,559 --> 00:59:55,499 a much better helicopter simulator, but if you're fixing 1099 00:59:55,499 --> 01:00:02,499 the wrong problem it's not gonna help. 1100 01:00:03,209 --> 01:00:05,529 So 1101 01:00:05,529 --> 01:00:08,919 quite often, you need to come up with your own diagnostics to figure out what's happening in an 1102 01:00:08,919 --> 01:00:11,639 algorithm when something is going wrong. 1103 01:00:11,639 --> 01:00:15,679 And unfortunately I don't know of - what I've described 1104 01:00:15,679 --> 01:00:17,149 are sort of maybe 1105 01:00:17,149 --> 01:00:20,509 some of the most common diagnostics that I've used, that I've seen, 1106 01:00:20,509 --> 01:00:23,709 you know, to be useful for many problems. But very often, you need to come up 1107 01:00:23,709 --> 01:00:28,189 with your own for your own specific learning problem. 1108 01:00:28,189 --> 01:00:31,729 And I just want to point out that even when the learning algorithm is working well, it's 1109 01:00:31,729 --> 01:00:35,159 often a good idea to run diagnostics, like the ones I talked 1110 01:00:35,159 --> 01:00:36,069 about, 1111 01:00:36,069 --> 01:00:38,309 to make sure you really understand what's going on. 1112 01:00:38,309 --> 01:00:41,599 All right? And this is useful for a couple of reasons. One is that 1113 01:00:41,599 --> 01:00:45,609 diagnostics like these will often help you to understand your application 1114 01:00:45,609 --> 01:00:47,899 problem better. 1115 01:00:47,899 --> 01:00:52,159 So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying 1116 01:00:52,159 --> 01:00:56,349 job to apply machine-learning algorithms to some application problem of, you 1117 01:00:56,349 --> 01:00:59,299 know, significant economic interest. Right? 1118 01:00:59,299 --> 01:01:02,929 And you're gonna be working on one specific 1119 01:01:02,929 --> 01:01:08,059 important machine learning application for many months, or even for years. 1120 01:01:08,059 --> 01:01:10,989 One of the most valuable things for you personally will be for you to 1121 01:01:10,989 --> 01:01:13,249 get in - for you personally 1122 01:01:13,249 --> 01:01:16,909 to get in an intuitive understanding of what works and what doesn't work your 1123 01:01:16,909 --> 01:01:17,369 problem. 1124 01:01:17,369 --> 01:01:21,240 Sort of right now in the industry, in Silicon Valley or around the world, 1125 01:01:21,240 --> 01:01:24,830 there are many companies with important machine learning problems and there are often people 1126 01:01:24,830 --> 01:01:26,950 working on the same machine learning problem, you 1127 01:01:26,950 --> 01:01:31,209 know, for many months or for years on end. And 1128 01:01:31,209 --> 01:01:34,659 when you're doing that, I mean solving a really important problem using learning algorithms, one of 1129 01:01:34,659 --> 01:01:38,719 the most valuable things is just your own personal intuitive understanding of the 1130 01:01:38,719 --> 01:01:40,999 problem. 1131 01:01:40,999 --> 01:01:42,169 Okay? 1132 01:01:42,169 --> 01:01:43,410 And diagnostics, like 1133 01:01:43,410 --> 01:01:48,149 the sort I talked about, will be one way for you to get a better and better understanding of 1134 01:01:48,149 --> 01:01:50,279 these problems. It 1135 01:01:50,279 --> 01:01:54,089 turns out, by the way, there are some of Silicon Valley companies that outsource their 1136 01:01:54,089 --> 01:01:56,679 machine learning. So there's sometimes, you know, whatever. 1137 01:01:56,679 --> 01:01:59,529 They're a company in Silicon Valley and they'll, you know, 1138 01:01:59,529 --> 01:02:03,230 hire a firm in New York to run all their learning algorithms for them. 1139 01:02:03,230 --> 01:02:06,890 And I'm not a businessman, but I personally think that's 1140 01:02:06,890 --> 01:02:09,309 often a terrible idea because 1141 01:02:09,309 --> 01:02:13,639 if your expertise, if your understanding of your data is given, 1142 01:02:13,639 --> 01:02:15,709 you know, to an outsource agency, 1143 01:02:15,709 --> 01:02:19,589 then if you don't maintain that expertise, if there's a problem you really care about 1144 01:02:19,589 --> 01:02:22,299 then it'll be your own, you know, 1145 01:02:22,299 --> 01:02:26,009 understanding of the problem that you build up over months that'll be really valuable. 1146 01:02:26,009 --> 01:02:28,649 And if that knowledge is outsourced, you don't get to keep that knowledge 1147 01:02:28,649 --> 01:02:29,489 yourself. 1148 01:02:29,489 --> 01:02:31,699 I personally think that's a terrible idea. 1149 01:02:31,699 --> 01:02:35,809 But I'm not a businessman, but I just see people do that a lot, 1150 01:02:35,809 --> 01:02:39,109 and just. Let's see. 1151 01:02:39,109 --> 01:02:42,949 Another reason for running diagnostics like these is actually in writing research 1152 01:02:42,949 --> 01:02:43,609 papers, 1153 01:02:43,609 --> 01:02:46,149 right? So 1154 01:02:46,149 --> 01:02:49,329 diagnostics and error analyses, which I'll talk about in a minute, 1155 01:02:49,329 --> 01:02:53,019 often help to convey insight about the problem and help justify your research 1156 01:02:53,019 --> 01:02:54,109 claims. 1157 01:02:54,109 --> 01:02:56,559 1158 01:02:56,559 --> 01:02:57,780 So for example, 1159 01:02:57,780 --> 01:03:00,790 rather than writing a research paper, say, that's says, you know, "Oh well here's 1160 01:03:00,790 --> 01:03:04,039 an algorithm that works. I built this helicopter and it flies," or whatever, 1161 01:03:04,039 --> 01:03:05,650 it's often much more interesting to say, 1162 01:03:05,650 --> 01:03:09,609 "Here's an algorithm that works, and it works because of a specific 1163 01:03:09,609 --> 01:03:13,920 component X. And moreover, here's the diagnostic that gives you justification that shows X was 1164 01:03:13,920 --> 01:03:19,159 the thing that fixed this problem," and that's where you made it work. Okay? So 1165 01:03:19,159 --> 01:03:21,389 that leads me 1166 01:03:21,389 --> 01:03:25,929 into a discussion on error analysis, which is often good machine learning practice, 1167 01:03:25,929 --> 01:03:26,439 1168 01:03:26,439 --> 01:03:32,099 is a way for understanding what your sources of errors are. So what I 1169 01:03:32,099 --> 01:03:34,689 call error analyses - and let's check 1170 01:03:34,689 --> 01:03:41,689 questions about this. 1171 01:03:41,769 --> 01:03:45,789 Yeah? 1172 01:03:45,789 --> 01:03:49,809 Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times. 1173 01:03:49,809 --> 01:03:53,500 The thing that is most difficult a helicopter is actually building a 1174 01:03:53,500 --> 01:03:55,109 very - I don't know. It 1175 01:03:55,109 --> 01:03:58,489 changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter 1176 01:03:58,489 --> 01:04:02,859 is very hard. Yeah. Okay. So 1177 01:04:02,859 --> 01:04:03,930 for error 1178 01:04:03,930 --> 01:04:06,269 analyses, 1179 01:04:06,269 --> 01:04:10,809 this is a way for figuring out what is working in your algorithm and what isn't working. 1180 01:04:10,809 --> 01:04:17,709 And we're gonna talk about two specific examples. So there are 1181 01:04:17,709 --> 01:04:21,529 many learning - there are many sort of IA systems, many machine learning systems, that 1182 01:04:21,529 --> 01:04:22,469 combine 1183 01:04:22,469 --> 01:04:24,890 many different components into a pipeline. So 1184 01:04:24,890 --> 01:04:27,469 here's sort of a contrived example for this, 1185 01:04:27,469 --> 01:04:31,019 not dissimilar in many ways from the actual machine learning systems you see. 1186 01:04:31,019 --> 01:04:32,390 So let's say you want to 1187 01:04:32,390 --> 01:04:37,749 recognize people from images. This is a picture of one of my friends. 1188 01:04:37,749 --> 01:04:41,899 So you take this input in camera image, say, and you often run it through a long pipeline. So 1189 01:04:41,899 --> 01:04:43,069 for example, 1190 01:04:43,069 --> 01:04:47,859 the first thing you may do may be preprocess the image and remove the background, so you remove the 1191 01:04:47,859 --> 01:04:49,189 background. 1192 01:04:49,189 --> 01:04:51,909 And then you run a 1193 01:04:51,909 --> 01:04:55,209 face detection algorithm, so a machine learning algorithm to detect people's faces. 1194 01:04:55,209 --> 01:04:56,109 Right? 1195 01:04:56,109 --> 01:04:59,759 And then, you know, let's say you want to recognize the identity of the person, right, this is your 1196 01:04:59,759 --> 01:05:01,719 application. 1197 01:05:01,719 --> 01:05:04,440 You then segment of the eyes, segment of the nose, 1198 01:05:04,440 --> 01:05:08,329 and have different learning algorithms to detect the mouth and so on. 1199 01:05:08,329 --> 01:05:10,029 I know; she might not want to be friend 1200 01:05:10,029 --> 01:05:13,249 after she sees this. 1201 01:05:13,249 --> 01:05:16,769 And then having found all these features, based on, you know, what the nose looks like, what the eyes 1202 01:05:16,769 --> 01:05:18,610 looks like, whatever, then you 1203 01:05:18,610 --> 01:05:22,800 feed all the features into a logistic regression algorithm. And your logistic 1204 01:05:22,800 --> 01:05:24,770 regression or soft match regression, or whatever, 1205 01:05:24,770 --> 01:05:30,379 will tell you the identity of this person. Okay? 1206 01:05:30,379 --> 01:05:32,459 So 1207 01:05:32,459 --> 01:05:35,059 this is what error analysis is. 1208 01:05:35,059 --> 01:05:40,329 You have a long complicated pipeline combining many machine learning 1209 01:05:40,329 --> 01:05:43,919 components. Many of these would be used in learning algorithms. 1210 01:05:43,919 --> 01:05:45,689 And so, 1211 01:05:45,689 --> 01:05:50,419 it's often very useful to figure out how much of your error can be attributed to each of 1212 01:05:50,419 --> 01:05:55,179 these components. 1213 01:05:55,179 --> 01:05:56,179 So 1214 01:05:56,179 --> 01:05:59,589 what we'll do in a typical error analysis procedure 1215 01:05:59,589 --> 01:06:03,709 is we'll repeatedly plug in the ground-truth for each component and see how the 1216 01:06:03,709 --> 01:06:05,129 accuracy changes. 1217 01:06:05,129 --> 01:06:07,589 So what I mean by that is the 1218 01:06:07,589 --> 01:06:11,389 figure on the bottom left - bottom right, let's say the overall accuracy of the system is 1219 01:06:11,389 --> 01:06:12,739 85 percent. Right? 1220 01:06:12,739 --> 01:06:14,689 Then I want to know 1221 01:06:14,689 --> 01:06:17,399 where my 15 percent of error comes from. 1222 01:06:17,399 --> 01:06:19,159 And so what I'll do is I'll go 1223 01:06:19,159 --> 01:06:21,329 to my test set 1224 01:06:21,329 --> 01:06:26,629 and I'll actually code it and - oh, instead of - actually implement my correct 1225 01:06:26,629 --> 01:06:29,750 background removal. So actually, go in and give it, 1226 01:06:29,750 --> 01:06:33,439 give my algorithm what is the correct background versus foreground. 1227 01:06:33,439 --> 01:06:36,840 And if I do that, let's color that blue to denote that I'm 1228 01:06:36,840 --> 01:06:39,529 giving that ground-truth data in the test set, 1229 01:06:39,529 --> 01:06:43,839 let's assume our accuracy increases to 85.1 percent. Okay? 1230 01:06:43,839 --> 01:06:47,760 And now I'll go in and, you know, give my algorithm the ground-truth, 1231 01:06:47,760 --> 01:06:48,929 face detection 1232 01:06:48,929 --> 01:06:53,019 output. So I'll go in and actually on my test set I'll just tell the algorithm where the 1233 01:06:53,019 --> 01:06:55,130 face is. And if I do that, 1234 01:06:55,130 --> 01:06:59,049 let's say my algorithm's accuracy increases to 91 percent, 1235 01:06:59,049 --> 01:07:02,519 and so on. And then I'll go for each of these components 1236 01:07:02,519 --> 01:07:05,019 and just give it 1237 01:07:05,019 --> 01:07:08,660 the ground-truth label for each of the components, 1238 01:07:08,660 --> 01:07:11,640 because say, like, the nose segmentation algorithm's trying to figure out 1239 01:07:11,640 --> 01:07:13,219 where the nose is. I just in 1240 01:07:13,219 --> 01:07:16,589 and tell it where the nose is so that it doesn't have to figure that out. 1241 01:07:16,589 --> 01:07:20,559 And as I do this, one component through the other, you know, I end up giving it the correct output 1242 01:07:20,559 --> 01:07:23,650 label and end up with 100 percent accuracy. 1243 01:07:23,650 --> 01:07:27,000 And now you can look at this table - I'm sorry this is cut off on the bottom, 1244 01:07:27,000 --> 01:07:29,119 it says logistic regression 100 percent. Now you can 1245 01:07:29,119 --> 01:07:30,719 look at this 1246 01:07:30,719 --> 01:07:31,670 table and 1247 01:07:31,670 --> 01:07:33,009 see, 1248 01:07:33,009 --> 01:07:36,079 you know, how much giving the ground-truth labels for each of these 1249 01:07:36,079 --> 01:07:39,029 components could help boost your final performance. 1250 01:07:39,029 --> 01:07:42,419 In particular, if you look at this table, you notice that 1251 01:07:42,419 --> 01:07:45,269 when I added the face detection ground-truth, 1252 01:07:45,269 --> 01:07:48,279 my performance jumped from 85.1 percent accuracy 1253 01:07:48,279 --> 01:07:50,619 to 91 percent accuracy. Right? 1254 01:07:50,619 --> 01:07:54,529 So this tells me that if only I can get better face detection, 1255 01:07:54,529 --> 01:07:58,029 maybe I can boost my accuracy by 6 percent. 1256 01:07:58,029 --> 01:08:00,499 Whereas in contrast, when I, 1257 01:08:00,499 --> 01:08:04,349 you know, say plugged in better, I don't know, 1258 01:08:04,349 --> 01:08:07,059 background removal, my accuracy improved from 85 1259 01:08:07,059 --> 01:08:08,669 to 85.1 percent. 1260 01:08:08,669 --> 01:08:11,519 And so, this sort of diagnostic also tells you that if your goal 1261 01:08:11,519 --> 01:08:13,869 is to improve the system, it's probably a waste of 1262 01:08:13,869 --> 01:08:17,679 your time to try to improve your background subtraction. Because if 1263 01:08:17,679 --> 01:08:19,219 even if you got the ground-truth, 1264 01:08:19,219 --> 01:08:22,059 this is gives you, at most, 0.1 percent accuracy, 1265 01:08:22,059 --> 01:08:24,600 whereas if you do better face detection, maybe there's a much 1266 01:08:24,600 --> 01:08:26,399 larger potential for gains there. Okay? 1267 01:08:26,399 --> 01:08:28,669 So this sort of diagnostic, 1268 01:08:28,669 --> 01:08:29,899 again, 1269 01:08:29,899 --> 01:08:33,149 is very useful because if your is to improve the system, 1270 01:08:33,149 --> 01:08:35,999 there are so many different pieces you can easily choose to spend the next three 1271 01:08:35,999 --> 01:08:36,650 months on. Right? 1272 01:08:36,650 --> 01:08:39,259 And choosing the right piece 1273 01:08:39,259 --> 01:08:42,799 is critical, and this sort of diagnostic tells you what's the piece that may 1274 01:08:42,799 --> 01:08:48,729 actually be worth your time to work on. 1275 01:08:48,729 --> 01:08:51,709 There's sort of another type of analyses that's sort of the opposite of what I just 1276 01:08:51,709 --> 01:08:53,369 talked about. 1277 01:08:53,369 --> 01:08:55,469 The error analysis I just talked about 1278 01:08:55,469 --> 01:08:58,279 tries to explain the difference between the current performance and perfect 1279 01:08:58,279 --> 01:08:59,770 performance, 1280 01:08:59,770 --> 01:09:03,619 whereas this sort of ablative analysis tries to explain the difference 1281 01:09:03,619 --> 01:09:09,119 between some baselines, some really bad performance and your current performance. 1282 01:09:09,119 --> 01:09:13,089 So for this example, let's suppose you've built a very good anti-spam classifier for 1283 01:09:13,089 --> 01:09:17,150 adding lots of clever features to your logistic regression algorithm. Right? So you added 1284 01:09:17,150 --> 01:09:20,690 features for spam correction, for, you know, sender host features, for email header 1285 01:09:20,690 --> 01:09:21,409 features, 1286 01:09:21,409 --> 01:09:24,799 email text parser features, JavaScript parser features, 1287 01:09:24,799 --> 01:09:26,839 features for embedded images, and so on. 1288 01:09:26,839 --> 01:09:30,230 So now let's say you preview the system and you want to figure out, you know, how well did 1289 01:09:30,230 --> 01:09:33,799 each of these - how much did each of these components actually contribute? Maybe you want 1290 01:09:33,799 --> 01:09:37,130 to write a research paper and claim this was the piece that made the 1291 01:09:37,130 --> 01:09:40,949 big difference. Can you actually document that claim and justify it? 1292 01:09:40,949 --> 01:09:43,319 So in ablative analysis, 1293 01:09:43,319 --> 01:09:44,569 here's what we do. 1294 01:09:44,569 --> 01:09:46,330 So in this example, 1295 01:09:46,330 --> 01:09:49,670 let's say that simple logistic regression without any of your clever 1296 01:09:49,670 --> 01:09:52,089 improvements get 94 percent performance. And 1297 01:09:52,089 --> 01:09:55,479 you want to figure out what accounts for your improvement from 94 to 1298 01:09:55,479 --> 01:09:58,429 99.9 percent performance. 1299 01:09:58,429 --> 01:10:03,280 So in ablative analysis and so instead of adding components one at a day, we'll instead 1300 01:10:03,280 --> 01:10:06,840 remove components one at a time to see how it rates. 1301 01:10:06,840 --> 01:10:11,460 So start with your overall system, which is 99 percent accuracy. 1302 01:10:11,460 --> 01:10:14,130 And then we remove spelling correction and see how much performance 1303 01:10:14,130 --> 01:10:15,389 drops. 1304 01:10:15,389 --> 01:10:22,389 Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so, 1305 01:10:24,219 --> 01:10:28,150 in this contrived example, 1306 01:10:28,150 --> 01:10:31,120 you see that, I guess, the biggest drop 1307 01:10:31,120 --> 01:10:32,380 occurred when you remove 1308 01:10:32,380 --> 01:10:37,560 the text parser features. And so you can then make a credible case that, 1309 01:10:37,560 --> 01:10:41,280 you know, the text parser features where what really made the biggest difference here. Okay? 1310 01:10:41,280 --> 01:10:42,699 And you can also tell, 1311 01:10:42,699 --> 01:10:45,530 for instance, that, I don't know, 1312 01:10:45,530 --> 01:10:49,360 removing the sender host features on this 1313 01:10:49,360 --> 01:10:52,280 line, right, performance dropped from 99.9 to 98.9. And so this also means 1314 01:10:52,280 --> 01:10:53,139 that 1315 01:10:53,139 --> 01:10:56,449 in case you want to get rid of the sender host features to speed up 1316 01:10:56,449 --> 01:11:03,449 computational something that would be a good candidate for elimination. Okay? Are there any 1317 01:11:03,630 --> 01:11:05,420 guarantees that if you shuffle around the order in which 1318 01:11:05,420 --> 01:11:06,420 you drop those 1319 01:11:06,420 --> 01:11:09,580 features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's 1320 01:11:09,580 --> 01:11:12,110 no guarantee you'd get the similar result. 1321 01:11:12,110 --> 01:11:13,890 So in practice, 1322 01:11:13,890 --> 01:11:17,730 sometimes there's a fairly natural of ordering for both types of analyses, the error 1323 01:11:17,730 --> 01:11:19,329 analyses and ablative analysis, 1324 01:11:19,329 --> 01:11:22,749 sometimes there's a fairly natural ordering in which you add things or remove things, 1325 01:11:22,749 --> 01:11:24,559 sometimes there's isn't. And 1326 01:11:24,559 --> 01:11:28,469 quite often, you either choose one ordering and just go for it 1327 01:11:28,469 --> 01:11:32,070 or " And don't think of these analyses as sort of formulas that are constants, though; I mean 1328 01:11:32,070 --> 01:11:35,239 feel free to invent your own, as well. You know 1329 01:11:35,239 --> 01:11:36,639 one of the things 1330 01:11:36,639 --> 01:11:37,920 that's done quite often is 1331 01:11:37,920 --> 01:11:39,310 take the overall system 1332 01:11:39,310 --> 01:11:43,289 and just remove one and then put it back, then remove a different one 1333 01:11:43,289 --> 01:11:48,129 then put it back until all of these things are done. Okay. 1334 01:11:48,129 --> 01:11:51,009 So the very last thing I want to talk about is sort of this 1335 01:11:51,009 --> 01:11:57,979 general advice for how to get started on a learning problem. So 1336 01:11:57,979 --> 01:12:03,840 here's a cartoon description on two broad to get started on learning problem. 1337 01:12:03,840 --> 01:12:05,740 The first one is 1338 01:12:05,740 --> 01:12:07,610 carefully design your system, so 1339 01:12:07,610 --> 01:12:11,739 you spend a long time designing exactly the right features, collecting the right data set, and 1340 01:12:11,739 --> 01:12:14,189 designing the right algorithmic structure, then you 1341 01:12:14,189 --> 01:12:17,679 implement it and hope it works. All right? 1342 01:12:17,679 --> 01:12:21,039 The benefit of this sort of approach is you get maybe nicer, maybe more scalable 1343 01:12:21,039 --> 01:12:22,429 algorithms, 1344 01:12:22,429 --> 01:12:26,719 and maybe you come up with new elegant learning algorithms. And if your goal is to, 1345 01:12:26,719 --> 01:12:30,760 you know, contribute to basic research in machine learning, if your goal is to invent new machine learning 1346 01:12:30,760 --> 01:12:31,499 algorithms, 1347 01:12:31,499 --> 01:12:33,550 this process of slowing down and 1348 01:12:33,550 --> 01:12:36,300 thinking deeply about the problem, you know, that is sort of the right way to go 1349 01:12:36,300 --> 01:12:37,119 about is 1350 01:12:37,119 --> 01:12:41,099 think deeply about a problem and invent new solutions. 1351 01:12:41,099 --> 01:12:42,280 1352 01:12:42,280 --> 01:12:44,079 Second sort of approach 1353 01:12:44,079 --> 01:12:48,839 is what I call build-and-fix, which is we input something quick and dirty 1354 01:12:48,839 --> 01:12:52,309 and then you run error analyses and diagnostics to figure out what's wrong and 1355 01:12:52,309 --> 01:12:54,199 you fix those errors. 1356 01:12:54,199 --> 01:12:58,130 The benefit of this second type of approach is that it'll often get your 1357 01:12:58,130 --> 01:13:01,119 application working much more quickly. 1358 01:13:01,119 --> 01:13:04,400 And especially with those of you, if you end up working in a company, and sometimes - if you end up working in 1359 01:13:04,400 --> 01:13:05,550 a company, 1360 01:13:05,550 --> 01:13:07,460 you know, very often it's not 1361 01:13:07,460 --> 01:13:10,900 the best product that wins; it's the first product to market that 1362 01:13:10,900 --> 01:13:11,690 wins. And 1363 01:13:11,690 --> 01:13:14,869 so there's - especially in the industry. There's really something to be said for, 1364 01:13:14,869 --> 01:13:18,790 you know, building a system quickly and getting it deployed quickly. 1365 01:13:18,790 --> 01:13:23,139 And the second approach of building a quick-and-dirty, I'm gonna say hack 1366 01:13:23,139 --> 01:13:26,469 and then fixing the problems will actually get you to a 1367 01:13:26,469 --> 01:13:27,839 system that works well 1368 01:13:27,839 --> 01:13:30,969 much more quickly. 1369 01:13:30,969 --> 01:13:32,649 And the reason is 1370 01:13:32,649 --> 01:13:36,149 very often it's really not clear what parts of a system are easier to think of to 1371 01:13:36,149 --> 01:13:37,590 build and therefore what 1372 01:13:37,590 --> 01:13:40,179 you need to spends lot of time focusing on. 1373 01:13:40,179 --> 01:13:43,420 So there's that example I talked about just now. Right? 1374 01:13:43,420 --> 01:13:46,929 For identifying 1375 01:13:46,929 --> 01:13:48,710 people, say. 1376 01:13:48,710 --> 01:13:53,199 And with a big complicated learning system like this, a big complicated pipeline like this, 1377 01:13:53,199 --> 01:13:55,590 it's really not obvious at the outset 1378 01:13:55,590 --> 01:13:59,130 which of these components you should spend lots of time working on. Right? And if 1379 01:13:59,130 --> 01:14:00,960 you didn't know that 1380 01:14:00,960 --> 01:14:03,800 preprocessing wasn't the right component, you could easily have 1381 01:14:03,800 --> 01:14:07,269 spent three months working on better background subtraction, not knowing that it's 1382 01:14:07,269 --> 01:14:09,880 just not gonna ultimately matter. 1383 01:14:09,880 --> 01:14:10,769 And so 1384 01:14:10,769 --> 01:14:13,690 the only way to find out what really works was inputting something quickly and 1385 01:14:13,690 --> 01:14:15,350 you find out what parts - 1386 01:14:15,350 --> 01:14:16,889 and find out 1387 01:14:16,889 --> 01:14:17,889 what parts 1388 01:14:17,889 --> 01:14:21,359 are really the hard parts to implement, or what parts are hard parts that could make a 1389 01:14:21,359 --> 01:14:23,079 difference in performance. 1390 01:14:23,079 --> 01:14:26,579 In fact, say that if your goal is to build a 1391 01:14:26,579 --> 01:14:29,309 people recognition system, a system like this is actually far too 1392 01:14:29,309 --> 01:14:31,639 complicated as your initial system. 1393 01:14:31,639 --> 01:14:35,560 Maybe after you're prototyped a few systems, and you converged a system like this. But if this 1394 01:14:35,560 --> 01:14:42,560 is your first system you're designing, this is much too complicated. Also, this is a 1395 01:14:43,570 --> 01:14:48,059 very concrete piece of advice, and this applies to your projects as well. 1396 01:14:48,059 --> 01:14:51,229 If your goal is to build a working application, 1397 01:14:51,229 --> 01:14:55,260 Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your 1398 01:14:55,260 --> 01:14:57,279 data. 1399 01:14:57,279 --> 01:15:01,219 And very often, and if you just take the data you're trying to predict and just plot your 1400 01:15:01,219 --> 01:15:05,729 data, plot X, plot Y, plot your data everywhere you can think of, 1401 01:15:05,729 --> 01:15:10,309 you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they 1402 01:15:10,309 --> 01:15:13,899 should be positive. Something's wrong with this dataset." And it's about 1403 01:15:13,899 --> 01:15:18,389 half the time you find something obviously wrong with your data or something very surprising. 1404 01:15:18,389 --> 01:15:21,570 And this is something you find out just by plotting your data, and that you 1405 01:15:21,570 --> 01:15:28,179 won't find out be implementing these big complicated learning algorithms on it. Plotting 1406 01:15:28,179 --> 01:15:31,920 the data sounds so simple, it was one of the pieces of advice that lots of us give but 1407 01:15:31,920 --> 01:15:38,570 hardly anyone follows, so you can take that for what it's worth. 1408 01:15:38,570 --> 01:15:42,199 Let me just reiterate, what I just said here may be bad advice 1409 01:15:42,199 --> 01:15:44,019 if your goal is to come up with 1410 01:15:44,019 --> 01:15:46,639 new machine learning algorithms. All right? So 1411 01:15:46,639 --> 01:15:51,019 for me personally, the learning algorithm I use the most often is probably 1412 01:15:51,019 --> 01:15:53,600 logistic regression because I have code lying around. So give me a 1413 01:15:53,600 --> 01:15:56,770 learning problem, I probably won't try anything more complicated than logistic 1414 01:15:56,770 --> 01:15:58,260 regression on it first. And it's 1415 01:15:58,260 --> 01:16:01,940 only after trying something really simple and figure our what's easy, what's hard, then you know 1416 01:16:01,940 --> 01:16:03,940 where to focus your efforts. But 1417 01:16:03,940 --> 01:16:07,610 again, if your goal is to invent new machine learning algorithms, then you sort of don't 1418 01:16:07,610 --> 01:16:10,750 want to hack up something and then add another hack to fix it, and hack it even more to 1419 01:16:10,750 --> 01:16:12,219 fix it. Right? So if 1420 01:16:12,219 --> 01:16:15,919 your goal is to do novel machine learning research, then it pays to think more deeply about the 1421 01:16:15,919 --> 01:16:21,340 problem and not gonna follow this specifically. 1422 01:16:21,340 --> 01:16:22,919 Shoot, you know what? All 1423 01:16:22,919 --> 01:16:28,280 right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly. 1424 01:16:28,280 --> 01:16:30,620 And so, this is what I think 1425 01:16:30,620 --> 01:16:33,459 of as premature statistical optimization, 1426 01:16:33,459 --> 01:16:35,079 where quite often, 1427 01:16:35,079 --> 01:16:38,320 just like premature optimization of code, quite often 1428 01:16:38,320 --> 01:16:44,369 people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more 1429 01:16:44,369 --> 01:16:46,949 slides. This 1430 01:16:46,949 --> 01:16:48,539 was - 1431 01:16:48,539 --> 01:16:52,070 this is a sort of cartoon that highly influenced my own thinking. It was based on 1432 01:16:52,070 --> 01:16:55,340 a paper written by Christos Papadimitriou. 1433 01:16:55,340 --> 01:16:57,429 This is how 1434 01:16:57,429 --> 01:16:59,360 progress - this is how 1435 01:16:59,360 --> 01:17:02,360 developmental progress of research often happens. Right? 1436 01:17:02,360 --> 01:17:05,559 Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it 1437 01:17:05,559 --> 01:17:06,519 seems like a useful thing to have. 1438 01:17:06,519 --> 01:17:09,670 Right? You know free up people, don't have 1439 01:17:09,670 --> 01:17:12,760 to deliver mail. So what - 1440 01:17:12,760 --> 01:17:14,280 to deliver mail, 1441 01:17:14,280 --> 01:17:19,139 obviously you need a robot to wander around indoor environments and you need a robot to 1442 01:17:19,139 --> 01:17:21,480 manipulate objects and pickup envelopes. And so, 1443 01:17:21,480 --> 01:17:24,890 you need to build those two components in order to get a mail delivery robot. And 1444 01:17:24,890 --> 01:17:25,590 so I've 1445 01:17:25,590 --> 01:17:29,650 drawing those two components and little arrows to denote that, you know, obstacle avoidance 1446 01:17:29,650 --> 01:17:30,460 is 1447 01:17:30,460 --> 01:17:32,229 needed or would help build 1448 01:17:32,229 --> 01:17:35,510 your mail delivery robot. Well 1449 01:17:35,510 --> 01:17:37,189 for obstacle for avoidance, 1450 01:17:37,189 --> 01:17:43,159 clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. 1451 01:17:43,159 --> 01:17:46,840 Now we're gonna use computer vision to detect the objects. And so, 1452 01:17:46,840 --> 01:17:51,120 we know that, you know, lighting sometimes changes, right, depending on whether it's the 1453 01:17:51,120 --> 01:17:52,709 morning or noontime or evening. This 1454 01:17:52,709 --> 01:17:53,930 is lighting 1455 01:17:53,930 --> 01:17:56,639 causes the color of things to change, and so you need 1456 01:17:56,639 --> 01:18:00,509 an object detection system that's invariant to the specific colors of an 1457 01:18:00,509 --> 01:18:01,199 object. Right? 1458 01:18:01,199 --> 01:18:04,420 Because lighting 1459 01:18:04,420 --> 01:18:05,400 changes, 1460 01:18:05,400 --> 01:18:09,849 say. Well color, or RGB values, is represented by three-dimensional vectors. And 1461 01:18:09,849 --> 01:18:11,170 so you need to learn 1462 01:18:11,170 --> 01:18:13,499 when two colors might be the same thing, 1463 01:18:13,499 --> 01:18:15,260 when two, you know, 1464 01:18:15,260 --> 01:18:18,159 visual appearance of two colors may be the same thing as just the lighting change or 1465 01:18:18,159 --> 01:18:19,539 something. 1466 01:18:19,539 --> 01:18:20,590 And 1467 01:18:20,590 --> 01:18:24,059 to understand that properly, we can go out and study differential geometry 1468 01:18:24,059 --> 01:18:27,510 of 3d manifolds because that helps us build a sound theory on which 1469 01:18:27,510 --> 01:18:32,250 to develop our 3d similarity learning algorithms. 1470 01:18:32,250 --> 01:18:36,159 And to really understand the fundamental aspects of this problem, 1471 01:18:36,159 --> 01:18:40,110 we have to study the complexity of non-Riemannian geometries. And on 1472 01:18:40,110 --> 01:18:43,850 and on it goes until eventually you're proving convergence bounds for 1473 01:18:43,850 --> 01:18:49,790 sampled of non-monotonic logic. I don't even know what this is because I just made it up. 1474 01:18:49,790 --> 01:18:51,530 Whereas in reality, 1475 01:18:51,530 --> 01:18:53,970 you know, chances are that link isn't real. 1476 01:18:53,970 --> 01:18:55,660 Color variance 1477 01:18:55,660 --> 01:18:59,550 just barely helped object recognition maybe. I'm making this up. 1478 01:18:59,550 --> 01:19:03,499 Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay? 1479 01:19:03,499 --> 01:19:05,270 So, each of 1480 01:19:05,270 --> 01:19:09,130 these circles can represent a person, or a research community, or a thought in your 1481 01:19:09,130 --> 01:19:12,020 head. And there's a very real chance that 1482 01:19:12,020 --> 01:19:15,469 maybe there are all these papers written on differential geometry of 3d manifolds, and they are 1483 01:19:15,469 --> 01:19:18,570 written because some guy once told someone else that it'll help 3d similarity learning. 1484 01:19:18,570 --> 01:19:20,489 And, 1485 01:19:20,489 --> 01:19:23,369 you know, it's like "A friend of mine told me that color invariance would help in 1486 01:19:23,369 --> 01:19:26,119 object recognition, so I'm working on color invariance. And now I'm gonna tell a friend 1487 01:19:26,119 --> 01:19:27,440 of mine 1488 01:19:27,440 --> 01:19:30,280 that his thing will help my problem. And he'll tell a friend of his that his thing will help 1489 01:19:30,280 --> 01:19:31,619 with his problem." 1490 01:19:31,619 --> 01:19:33,520 And pretty soon, you're working on 1491 01:19:33,520 --> 01:19:37,540 convergence bound for sampled non-monotonic logic, when in reality none of these will 1492 01:19:37,540 --> 01:19:39,129 see the light of 1493 01:19:39,129 --> 01:19:42,519 day of your mail delivery robot. Okay? 1494 01:19:42,519 --> 01:19:46,599 I'm not criticizing the role of theory. There are very powerful theories, like the 1495 01:19:46,599 --> 01:19:48,400 theory of VC dimension, 1496 01:19:48,400 --> 01:19:52,090 which is far, far, far to the right of this. So VC dimension is about 1497 01:19:52,090 --> 01:19:53,289 as theoretical 1498 01:19:53,289 --> 01:19:57,119 as it can get. And it's clearly had a huge impact on many applications. And there's, 1499 01:19:57,119 --> 01:19:59,559 you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, 1500 01:19:59,559 --> 01:20:00,750 is about 1501 01:20:00,750 --> 01:20:04,219 theoretical as it can get. It's 1502 01:20:04,219 --> 01:20:05,800 like a huge application 1503 01:20:05,800 --> 01:20:09,309 on all of computer science, the theory of NP-hardness. 1504 01:20:09,309 --> 01:20:10,669 But 1505 01:20:10,669 --> 01:20:13,799 when you are off working on highly theoretical things, I guess, to me 1506 01:20:13,799 --> 01:20:16,849 personally it's important to keep in mind 1507 01:20:16,849 --> 01:20:19,699 are you working on something like VC dimension, which is high impact, or are you 1508 01:20:19,699 --> 01:20:23,290 working on something like convergence bound for sampled nonmonotonic logic, which 1509 01:20:23,290 --> 01:20:24,710 you're only hoping 1510 01:20:24,710 --> 01:20:25,900 has some peripheral relevance 1511 01:20:25,900 --> 01:20:30,040 to some application. Okay? 1512 01:20:30,040 --> 01:20:34,849 For me personally, I tend to work on an application only if I - excuse me. 1513 01:20:34,849 --> 01:20:36,989 For me personally, and this is a personal choice, 1514 01:20:36,989 --> 01:20:41,340 I tend to trust something only if I personally can see a link from the 1515 01:20:41,340 --> 01:20:42,679 theory I'm working on 1516 01:20:42,679 --> 01:20:44,430 all the way back to an application. 1517 01:20:44,430 --> 01:20:46,010 And 1518 01:20:46,010 --> 01:20:50,299 if I don't personally see a direct link from what I'm doing to an application then, 1519 01:20:50,299 --> 01:20:53,429 you know, then that's fine. Then I can choose to work on theory, but 1520 01:20:53,429 --> 01:20:55,650 I wouldn't necessarily trust that 1521 01:20:55,650 --> 01:20:59,210 what the theory I'm working on will relate to an application, if I don't personally 1522 01:20:59,210 --> 01:21:02,429 see a link all the way back. 1523 01:21:02,429 --> 01:21:04,400 Just to summarize. 1524 01:21:04,400 --> 01:21:06,409 1525 01:21:06,409 --> 01:21:08,679 One lesson to take away from today is I think 1526 01:21:08,679 --> 01:21:12,529 time spent coming up with diagnostics for learning algorithms is often time well spent. 1527 01:21:12,529 --> 01:21:13,029 1528 01:21:13,029 --> 01:21:16,199 It's often up to your own ingenuity to come up with great diagnostics. And 1529 01:21:16,199 --> 01:21:19,019 just when I personally, when I work on machine learning algorithm, 1530 01:21:19,019 --> 01:21:21,169 it's not uncommon for me to be spending like 1531 01:21:21,169 --> 01:21:23,679 between a third and often half of my time 1532 01:21:23,679 --> 01:21:26,409 just writing diagnostics and trying to figure out what's going right and what's 1533 01:21:26,409 --> 01:21:28,079 going on. 1534 01:21:28,079 --> 01:21:31,500 Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and 1535 01:21:31,500 --> 01:21:34,780 making progress. You don't want to be spending all this time, you know, implementing tests on your 1536 01:21:34,780 --> 01:21:38,280 learning algorithms; it doesn't feel like when you're doing anything. But when 1537 01:21:38,280 --> 01:21:41,419 I implement learning algorithms, at least a third, and quite often half of 1538 01:21:41,419 --> 01:21:45,880 my time, is actually spent implementing those tests and you can figure out what to work on. And 1539 01:21:45,880 --> 01:21:49,219 I think it's actually one of the best uses of your time. Talked 1540 01:21:49,219 --> 01:21:50,729 about error 1541 01:21:50,729 --> 01:21:54,319 analyses and ablative analyses, and lastly 1542 01:21:54,319 --> 01:21:56,890 talked about, you know, different approaches and the 1543 01:21:56,890 --> 01:22:00,979 risks of premature statistical optimization. Okay. 1544 01:22:00,979 --> 01:22:04,339 Sorry I ran you over. I'll be here for a few more minutes for your questions.