0:00:10.170,0:00:11.549 0:00:11.549,0:00:14.819 This presentation is delivered by the Stanford Center for Professional 0:00:14.819,0:00:21.819 Development. 0:00:24.739,0:00:28.499 Okay. Good morning. Welcome back. 0:00:28.499,0:00:31.239 What I want to do today is 0:00:31.239,0:00:36.640 actually wrap up our discussion on learning theory and sort of on 0:00:36.640,0:00:37.670 and I'm gonna start by 0:00:37.670,0:00:41.810 talking about Bayesian statistics and regularization, 0:00:41.810,0:00:45.680 and then take a very brief digression to tell you about online learning. 0:00:45.680,0:00:50.320 And most of today's lecture will actually be on various pieces of that, so applying 0:00:50.320,0:00:52.450 machine learning algorithms to problems like, you know, 0:00:52.450,0:00:55.920 like the project or other problems you may go work on after you graduate from this 0:00:57.010,0:00:57.739 class. But 0:00:57.739,0:01:02.300 let's start the talk about Bayesian statistics and regularization. 0:01:02.300,0:01:04.130 So you remember from last week, 0:01:04.130,0:01:08.960 we started to talk about learning theory and we learned about bias and 0:01:08.960,0:01:12.260 variance. And I guess in the previous lecture, we spent most of the previous 0:01:12.260,0:01:13.240 lecture 0:01:13.240,0:01:17.510 talking about algorithms for model selection and for 0:01:17.510,0:01:21.360 feature selection. We talked about cross-validation. Right? So 0:01:21.360,0:01:24.200 most of the methods we talked about in the previous lecture were 0:01:24.200,0:01:28.460 ways for you to try to simply the model. So for example, 0:01:28.460,0:01:31.920 the feature selection algorithms we talked about gives you a way to eliminate a number 0:01:31.920,0:01:32.909 of features, 0:01:32.909,0:01:36.350 so as to reduce the number of parameters you need to fit and thereby reduce 0:01:36.350,0:01:39.330 overfitting. Right? You remember that? So feature 0:01:39.330,0:01:40.670 selection algorithms 0:01:42.149,0:01:43.970 choose a subset of the features 0:01:43.970,0:01:48.470 so that you have less parameters and you may be less likely to overfit. Right? 0:01:48.470,0:01:52.790 What I want to do today is to talk about a different way to prevent overfitting. 0:01:52.790,0:01:53.870 And 0:01:53.870,0:01:57.860 there's a method called regularization and there's a way that lets you keep all the 0:01:57.860,0:01:58.859 parameters. 0:01:59.920,0:02:04.200 So here's the idea, and I'm gonna illustrate this example with, 0:02:04.200,0:02:06.960 say, linear regression. 0:02:06.960,0:02:08.709 So 0:02:08.709,0:02:13.329 you take 0:02:13.329,0:02:16.920 the linear regression model, the very first model we learned about, 0:02:16.920,0:02:17.969 right, 0:02:17.969,0:02:21.639 we said that we would choose the parameters 0:02:21.639,0:02:23.609 via 0:02:23.609,0:02:29.519 maximum likelihood. 0:02:29.519,0:02:31.669 Right? And that meant that, 0:02:31.669,0:02:35.299 you know, you would choose the parameters theta 0:02:35.299,0:02:39.879 that maximized 0:02:39.879,0:02:43.009 the probability of the data, 0:02:43.009,0:02:46.109 which is parameters theta that maximized the probability of the data we observe. 0:02:46.109,0:02:50.709 Right? 0:02:50.709,0:02:52.049 And so 0:02:52.049,0:02:56.279 to give this sort of procedure a name, this is one example of most 0:02:56.279,0:03:00.259 common frequencies procedure, 0:03:00.259,0:03:05.419 and frequency, you can think of sort of as maybe one school of statistics. 0:03:05.419,0:03:07.079 And the philosophical view 0:03:07.079,0:03:08.869 behind writing this down was 0:03:08.869,0:03:13.480 we envisioned that there was some true parameter theta out there that generated, 0:03:13.480,0:03:16.469 you know, the Xs and the Ys. There's some true parameter theta 0:03:16.469,0:03:20.529 that govern housing prices, Y is a function of X, 0:03:20.529,0:03:23.459 and we don't know what the value of theta is, 0:03:23.459,0:03:27.630 and we'd like to come up with some procedure for estimating the value of theta. Okay? 0:03:27.630,0:03:28.379 And so, 0:03:28.379,0:03:31.269 maximum likelihood is just one possible procedure 0:03:31.269,0:03:33.209 for estimating the unknown value 0:03:33.209,0:03:35.879 for theta. 0:03:35.879,0:03:37.609 And 0:03:37.609,0:03:41.060 the way you formulated this, you know, theta was not a random variable. Right? 0:03:41.060,0:03:42.140 That's what why said, 0:03:42.140,0:03:45.490 so theta is just some true value out there. It's not random or anything, we just don't 0:03:45.490,0:03:50.329 know what it is, and we have a procedure called maximum likelihood for estimating the 0:03:50.329,0:03:55.469 value for theta. So this is one example of what's called a frequencies procedure. 0:03:55.469,0:03:57.279 The alternative to the, I 0:03:57.279,0:04:04.279 guess, the frequency school of statistics is the Bayesian school, 0:04:04.389,0:04:06.869 in which 0:04:06.869,0:04:10.239 we're gonna say that we don't know what theta, 0:04:10.239,0:04:13.339 and so we will put a prior 0:04:13.339,0:04:14.769 on 0:04:14.769,0:04:18.239 theta. Okay? So in the Bayesian school students would say, "Well 0:04:18.239,0:04:22.379 don't know what the value of theta so let's represent our uncertainty over theta with a 0:04:22.379,0:04:27.229 prior. 0:04:28.259,0:04:32.009 So for example, 0:04:32.009,0:04:34.770 our prior on theta 0:04:34.770,0:04:36.860 may be a Gaussian distribution 0:04:36.860,0:04:39.849 with mean zero and curvalence matrix 0:04:39.849,0:04:42.889 given by tau squared I. Okay? 0:04:42.889,0:04:45.319 And so - 0:04:45.319,0:04:52.319 actually, if I use S to denote my training set, well - right, 0:04:56.279,0:05:00.289 so theta represents my beliefs about what the parameters are in the absence of 0:05:00.289,0:05:01.380 any data. So 0:05:01.380,0:05:03.290 not having seen any data, theta 0:05:03.290,0:05:05.709 represents, you know, what I think theta - it 0:05:05.709,0:05:10.090 probably represents what I think theta is most likely to be. 0:05:10.090,0:05:15.659 And so given the training set, S, in the sort of Bayesian procedure, we would, 0:05:15.659,0:05:22.439 well, 0:05:22.439,0:05:26.539 calculate the probability, the posterior probability by parameters 0:05:26.539,0:05:28.869 given my training sets, and - let's 0:05:28.869,0:05:30.469 write 0:05:30.469,0:05:31.779 this on the next board. 0:05:31.779,0:05:33.360 So my posterior 0:05:33.360,0:05:36.740 on my parameters given my training set, by Bayes' rule, this will 0:05:36.740,0:05:40.160 be proportional to, you 0:05:40.160,0:05:47.160 know, this. 0:05:51.459,0:05:56.559 Right? So by Bayes' rule. Let's call 0:05:56.559,0:06:01.429 it posterior. 0:06:01.429,0:06:05.669 And this distribution now represents my beliefs about what theta is after I've 0:06:05.669,0:06:08.219 seen the training set. 0:06:08.219,0:06:15.219 And when you now want to make a new prediction on the price of a new house, 0:06:20.060,0:06:21.959 on the input X, 0:06:21.959,0:06:26.709 I would say that, well, the distribution over the possible housing prices for 0:06:26.709,0:06:30.560 this new house I'm trying to estimate the price of, say, 0:06:30.560,0:06:34.270 given the size of the house, the features of the house at X, and the training 0:06:34.270,0:06:36.559 set I had previously, it 0:06:36.559,0:06:43.559 is going to be given by 0:06:47.280,0:06:50.470 an integral over my parameters theta of 0:06:50.470,0:06:53.020 probably of Y given X comma theta 0:06:53.020,0:06:54.649 and times 0:06:54.649,0:06:59.929 the posterior distribution of theta given the training set. Okay? 0:06:59.929,0:07:01.789 And in 0:07:01.789,0:07:03.719 particular, if you want your prediction to be 0:07:03.719,0:07:07.559 the expected value 0:07:07.559,0:07:08.729 of Y given 0:07:08.729,0:07:10.279 the 0:07:10.279,0:07:13.559 input X in training set, you would 0:07:13.559,0:07:16.349 say integrate 0:07:16.349,0:07:23.349 over Y 0:07:23.750,0:07:26.539 times the posterior. Okay? 0:07:26.539,0:07:33.539 You would take an expectation of Y with respect to your posterior distribution. Okay? 0:07:35.440,0:07:39.169 And you notice that when I was writing this down, so with the Bayesian 0:07:39.169,0:07:41.959 formulation, and now started to write up here Y given X 0:07:41.959,0:07:44.110 comma theta because 0:07:44.110,0:07:47.590 this formula now is the property of Y conditioned on the values of the 0:07:47.590,0:07:51.909 random variables X and theta. So I'm no longer writing semicolon theta, I'm writing 0:07:51.909,0:07:53.099 comma theta 0:07:53.099,0:07:56.379 because I'm now treating theta 0:07:56.379,0:08:00.259 as a random variable. So all 0:08:00.259,0:08:02.849 of this is somewhat abstract but this is 0:08:02.849,0:08:04.109 - and 0:08:04.109,0:08:11.109 it turns out - actually let's check. Are there questions about this? No? Okay. Let's 0:08:14.779,0:08:17.819 try to make this more concrete. It turns out that 0:08:17.819,0:08:21.699 for many problems, 0:08:21.699,0:08:26.059 both of these steps in the computation are difficult because if, 0:08:26.059,0:08:29.469 you know, theta is an N plus onedimensional vector, is an N plus 0:08:29.469,0:08:31.219 one-dimensional parameter vector, 0:08:31.219,0:08:35.010 then this is one an integral over an N plus one-dimensional, you know, over RN 0:08:35.010,0:08:36.390 plus one. 0:08:36.390,0:08:38.660 And because numerically it's very difficult 0:08:38.660,0:08:41.130 to compute integrals over 0:08:41.130,0:08:44.800 very high dimensional spaces, all right? So 0:08:44.800,0:08:48.740 usually this integral - actually usually it's hard to compute the posterior in theta 0:08:48.740,0:08:54.310 and it's also hard to compute this integral if theta is very high dimensional. There are few 0:08:54.310,0:08:55.910 exceptions for which this can be done in 0:08:55.910,0:08:58.290 closed form, but for 0:08:58.290,0:09:00.460 many learning algorithms, say, 0:09:00.460,0:09:04.940 Bayesian logistic regression, this is hard to do. 0:09:04.940,0:09:09.780 And so what's commonly done is to take the posterior distribution 0:09:09.780,0:09:13.390 and instead of actually computing a full posterior distribution, chi of theta 0:09:13.390,0:09:15.280 given S, 0:09:15.280,0:09:16.510 we'll instead 0:09:16.510,0:09:19.330 take this quantity on the right-hand side 0:09:19.330,0:09:23.880 and just maximize this quantity on the right-hand side. So let me write this down. 0:09:23.880,0:09:26.650 So 0:09:26.650,0:09:30.100 commonly, instead of computing the full posterior distribution, 0:09:30.100,0:09:37.100 we will choose the following. Okay? 0:09:42.280,0:09:47.160 We will choose what's called the MAP estimate, or the maximum a posteriori 0:09:47.160,0:09:50.270 estimate of theta, which is the most likely value of theta, 0:09:50.270,0:09:53.990 most probable value of theta onto your posterior distribution. 0:09:53.990,0:09:56.430 And that's just 0:09:56.430,0:10:03.430 ont max chi 0:10:07.960,0:10:11.870 of theta. 0:10:11.870,0:10:18.870 And then when you need to make a prediction, 0:10:21.480,0:10:28.480 you know, you would just predict, say, well, 0:10:35.530,0:10:37.599 using your usual hypothesis 0:10:37.599,0:10:41.290 and using this MAP value of theta 0:10:41.290,0:10:42.479 in place of 0:10:42.479,0:10:47.380 - as the parameter vector you'd choose. Okay? And 0:10:47.380,0:10:51.340 notice, the only difference between this and standard maximum likelihood estimation 0:10:51.340,0:10:53.060 is that when you're choosing, 0:10:53.060,0:10:56.920 you know, the - instead of choosing the maximum likelihood value for 0:10:56.920,0:11:01.140 theta, you're instead maximizing this, which is what you have for maximum likelihood estimation, 0:11:01.140,0:11:05.170 and then times this other quantity which is the prior. 0:11:05.170,0:11:06.650 Right? 0:11:06.650,0:11:09.520 And 0:11:09.520,0:11:10.870 let's see, 0:11:10.870,0:11:16.560 when intuition is that if your prior 0:11:16.560,0:11:21.470 is 0:11:21.470,0:11:23.650 theta being Gaussian and with mean 0:11:23.650,0:11:26.030 zero and some covariance, 0:11:26.030,0:11:30.550 then for a distribution like this, most of the [inaudible] mass is close to zero. Right? 0:11:30.550,0:11:34.010 So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. 0:11:37.990,0:11:40.550 And so the prior distribution, instead of saying that 0:11:40.550,0:11:44.800 you think most of the parameters should be close to 0:11:44.800,0:11:48.780 zero, and if you remember our discussion on feature selection, if you eliminate a 0:11:48.780,0:11:50.880 feature from consideration 0:11:50.880,0:11:53.280 that's the same as 0:11:53.280,0:11:56.630 setting the source and value of theta to be equal to zero. All right? So if you 0:11:56.630,0:11:58.020 set 0:11:58.020,0:12:02.090 theta five to be equal to zero, that's the same as, you know, eliminating feature 0:12:02.090,0:12:04.550 five from the your hypothesis. 0:12:04.550,0:12:09.710 And so, this is the prior that drives most of the parameter values to zero 0:12:09.710,0:12:12.710 - to values close to zero. And you'll think of this as 0:12:12.710,0:12:17.410 doing something analogous, if doing something reminiscent of feature selection. Okay? And 0:12:17.410,0:12:21.560 it turns out that with this formulation, the parameters won't actually be 0:12:21.560,0:12:24.990 exactly zero but many of the values will be close to zero. 0:12:24.990,0:12:27.030 And 0:12:27.030,0:12:31.510 I guess in pictures, 0:12:34.550,0:12:36.210 if you remember, 0:12:36.210,0:12:40.070 I said that if you have, say, five data points and you fit 0:12:40.070,0:12:47.070 a fourth-order polynomial - 0:12:47.620,0:12:51.040 well I think that had too many bumps in it, but never mind. 0:12:51.040,0:12:54.360 If you fit it a - if you fit very high polynomial to a 0:12:54.360,0:12:58.180 very small dataset, you can get these very large oscillations 0:12:58.180,0:13:01.420 if you use maximum likelihood estimation. All right? 0:13:01.420,0:13:05.610 In contrast, if you apply this sort of Bayesian regularization, 0:13:05.610,0:13:08.200 you can actually fit a higherorder polynomial 0:13:08.200,0:13:11.160 that still get 0:13:11.160,0:13:13.250 sort of a smoother and smoother fit to the data 0:13:13.250,0:13:17.840 as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer 0:13:17.840,0:13:19.830 to zero. And that 0:13:19.830,0:13:22.750 in practice - it's sort of hard to see, but you can take my word for it. 0:13:22.750,0:13:24.900 As tau becomes smaller and smaller, 0:13:24.900,0:13:28.760 the curves you tend to fit your data also become smoother and smoother, and so you 0:13:28.760,0:13:29.560 tend 0:13:29.560,0:13:35.700 less and less overfit, even when you're fitting a large number of 0:13:35.700,0:13:37.670 parameters. 0:13:37.670,0:13:41.290 Okay? Let's see, 0:13:41.290,0:13:44.590 and 0:13:44.590,0:13:48.129 one last piece of intuition that I would just toss out there. And 0:13:48.129,0:13:49.920 you get to play more with 0:13:49.920,0:13:52.850 this particular set of ideas more in Problem Set 0:13:52.850,0:13:54.980 3, which I'll post online later 0:13:54.980,0:13:57.160 this week I guess. 0:13:57.160,0:14:01.950 Is that whereas maximum likelihood tries to minimize, 0:14:01.950,0:14:08.950 say, this, 0:14:12.210,0:14:15.830 right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing 0:14:15.830,0:14:17.030 this, 0:14:17.030,0:14:19.290 it turns out that if you 0:14:19.290,0:14:21.220 add this prior term there, 0:14:21.220,0:14:26.500 it turns out that the authorization objective you end up optimizing 0:14:26.500,0:14:31.160 turns out to be that. Where you add an extra term that, you know, 0:14:31.160,0:14:34.380 penalizes your parameter theta as being large. 0:14:34.980,0:14:38.820 And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to 0:14:38.820,0:14:40.890 keep your parameters small. 0:14:40.890,0:14:42.810 And this has the effect. 0:14:42.810,0:14:46.530 Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of 0:14:46.530,0:14:48.260 keeping the functions you fit 0:14:48.260,0:14:55.260 to be smoother and less likely to overfit. Okay? Okay, hopefully this will 0:14:57.870,0:15:02.200 make more sense when you play with these ideas a bit more in the next problem set. But let's check 0:15:02.200,0:15:09.200 questions about all this. 0:15:12.560,0:15:16.000 The smoothing behavior is it because [inaudible] actually get different [inaudible]? 0:15:19.260,0:15:23.010 Let's see. Yeah. It depends on - well 0:15:23.010,0:15:28.160 most priors with most of the mass close to zero will get this effect, I guess. And just by 0:15:28.160,0:15:29.090 convention, the Gaussian 0:15:29.090,0:15:31.770 prior is what's most - 0:15:31.770,0:15:33.820 used the most common for models like logistic 0:15:33.820,0:15:37.910 regression and linear regression, generalized in your models. There are a few 0:15:37.910,0:15:41.190 other priors that I sometimes use, like the Laplace prior, 0:15:41.190,0:15:48.190 but all of them will tend to have these sorts of smoothing effects. All right. Cool. 0:15:50.270,0:15:52.640 And so it turns out that for 0:15:52.640,0:15:57.250 problems like text classification, text classification is like 30,000 features or 50,000 0:15:57.250,0:15:59.560 features, 0:15:59.560,0:16:02.980 where it seems like an algorithm like logistic regression would be very much prone to 0:16:02.980,0:16:03.670 overfitting. Right? 0:16:03.670,0:16:06.470 So imagine trying to build a spam classifier, 0:16:06.470,0:16:10.050 maybe you have 100 training examples but you have 30,000 0:16:10.050,0:16:12.530 features or 50,000 features, 0:16:12.530,0:16:16.820 that seems clearly to be prone to overfitting. Right? But it turns out that with 0:16:16.820,0:16:18.440 this sort of Bayesian 0:16:18.440,0:16:19.460 regularization, 0:16:19.460,0:16:22.100 with [inaudible] Gaussian, 0:16:22.100,0:16:26.710 logistic regression becomes a very effective text classification algorithm 0:16:26.710,0:16:32.510 with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah, 0:16:32.510,0:16:36.950 right, and so pick - and to pick either tau squared or lambda. 0:16:36.950,0:16:40.260 I think the relation is lambda equals one over tau squared. But right, so pick either 0:16:40.260,0:16:41.840 tau squared or lambda, you could 0:16:41.840,0:16:48.840 use cross-validation, yeah. All right? Okay, cool. So all right, 0:16:49.120,0:16:51.740 that 0:16:51.740,0:16:55.240 was all I want to say about methods for preventing 0:16:55.240,0:16:56.410 overfitting. What I 0:16:56.410,0:17:00.820 want to do next is just spend, you know, five minutes talking about 0:17:00.820,0:17:02.300 online learning. 0:17:02.300,0:17:04.650 And this is sort of a digression. 0:17:04.650,0:17:06.390 And so, you know, when 0:17:06.390,0:17:09.680 you're designing the syllabus of a class, I guess, sometimes 0:17:09.680,0:17:13.040 there are just some ideas you want to talk about but can't find a very good place to 0:17:13.040,0:17:16.480 fit in anywhere. So this is one of those ideas that may 0:17:16.480,0:17:20.890 seem a bit disjointed from the rest of the class but I just 0:17:20.890,0:17:25.770 want to 0:17:25.770,0:17:29.730 tell 0:17:29.730,0:17:33.720 you a little bit about it. Okay. So here's the idea. 0:17:33.720,0:17:37.820 So far, all the learning algorithms we've talked about are what's called batch learning 0:17:37.820,0:17:38.520 algorithms, 0:17:38.520,0:17:41.650 where you're given a training set and then you get to run your learning algorithm on the 0:17:41.650,0:17:46.910 training set and then maybe you test it on some other test set. 0:17:46.910,0:17:49.980 And there's another learning setting called online learning, 0:17:49.980,0:17:53.670 in which you have to make predictions even while you are in the process of learning. 0:17:53.670,0:17:57.650 So here's how the problem sees. 0:17:57.650,0:17:59.360 All right? I'm first gonna give you X one. 0:17:59.360,0:18:01.410 Let's say there's a classification problem, 0:18:01.410,0:18:02.780 so I'm first gonna give you 0:18:02.780,0:18:07.420 X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?" 0:18:07.420,0:18:09.670 And you've not seen any data yet. 0:18:09.670,0:18:11.679 And so, you make a guess. Right? You 0:18:11.679,0:18:12.990 guess - 0:18:12.990,0:18:15.190 we'll call your guess Y hat one. 0:18:15.190,0:18:18.010 And after you've made your prediction, I 0:18:18.010,0:18:21.840 will then reveal to you the true label Y one. Okay? And 0:18:21.840,0:18:25.580 not having seen any data before, your odds of getting the first one right are only 0:18:25.580,0:18:28.750 50 percent, right, if you guess randomly. 0:18:28.750,0:18:29.950 0:18:29.950,0:18:35.080 And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And 0:18:35.080,0:18:35.910 so you now maybe 0:18:35.910,0:18:40.200 are gonna make a slightly more educated guess and call that Y hat two. 0:18:40.200,0:18:44.880 And after you've made your guess, I reveal the true label to you. 0:18:44.880,0:18:50.000 And so, then I show you X three, and then you make 0:18:50.000,0:18:53.000 your guess, and learning proceeds as follows. 0:18:53.000,0:18:54.860 So this is just a lot of 0:18:54.860,0:18:59.370 machine learning and batch learning, and the model settings where 0:18:59.370,0:19:03.710 you have to keep learning even as you're making predictions, 0:19:03.710,0:19:08.050 okay? So I don't know, setting your website and you have users coming in. And as the first user 0:19:08.050,0:19:10.990 comes in, you need to start making predictions already about what 0:19:10.990,0:19:13.450 the user likes or dislikes. And there's only, you know, 0:19:13.450,0:19:18.370 as you're making predictions you get to show more and more training examples. 0:19:18.370,0:19:25.370 So in online learning what you care about is the total online error, 0:19:27.370,0:19:28.400 0:19:28.400,0:19:32.900 which is sum from I equals one to MC if you get the sequence of M examples 0:19:32.900,0:19:34.390 all together, 0:19:34.390,0:19:37.860 indicator Y hat I 0:19:37.860,0:19:40.900 not equal to Y hi. Okay? 0:19:40.900,0:19:43.510 So the total online error is 0:19:43.510,0:19:45.920 the total number of mistakes you make 0:19:45.920,0:19:49.029 on a sequence of examples like this. 0:19:49.029,0:19:51.910 And 0:19:51.910,0:19:54.049 it turns out that, you know, 0:19:54.049,0:19:57.690 many of the learning algorithms you have - when you finish all the learning algorithms, you've 0:19:57.690,0:20:00.690 learned about and can apply to this setting. One thing 0:20:00.690,0:20:02.680 you could do is 0:20:02.680,0:20:03.850 when you're asked 0:20:03.850,0:20:06.230 to make prediction on Y hat three, 0:20:06.230,0:20:10.409 right, one simple thing to do is well you've seen some other training examples 0:20:10.409,0:20:14.040 up to this point so you can just take your learning algorithm and run it 0:20:14.040,0:20:15.429 on the examples, 0:20:15.429,0:20:18.970 you know, leading up to Y hat three. So just run the learning algorithm on all 0:20:18.970,0:20:21.420 the examples you've seen previous 0:20:21.420,0:20:25.400 to being asked to make a prediction on certain example, and then use your 0:20:25.400,0:20:27.320 learning 0:20:27.320,0:20:30.940 algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we 0:20:30.940,0:20:34.190 saw that you could use the stochastic gradient descent, that, 0:20:34.190,0:20:41.190 you know, can be adapted very nicely to this. So as a concrete example, if you 0:20:41.740,0:20:45.660 remember the perceptron algorithms, say, right, 0:20:45.660,0:20:47.350 you would 0:20:47.350,0:20:52.280 say initial the parameter theta to be equal to zero. 0:20:52.280,0:20:55.940 And then after seeing the Ith training 0:20:55.940,0:20:59.230 example, you'd 0:20:59.230,0:21:06.230 update the parameters, you 0:21:13.740,0:21:15.420 know, 0:21:15.420,0:21:19.740 using - you've see this reel a lot of times now, right, using the standard 0:21:19.740,0:21:21.720 perceptron learning rule. And 0:21:21.720,0:21:26.230 the same thing, if you were using logistic regression you can then, 0:21:26.230,0:21:29.940 again, after seeing each training example, just run, you know, essentially run 0:21:29.940,0:21:33.620 one-step stochastic gradient descent 0:21:33.620,0:21:38.050 on just the example you saw. Okay? 0:21:38.050,0:21:39.800 And so 0:21:39.800,0:21:41.040 the reason I've 0:21:41.040,0:21:45.380 put this into the sort of "learning theory" section of this class was because 0:21:45.380,0:21:49.190 it turns that sometimes you can prove fairly amazing results 0:21:49.190,0:21:51.500 on your total online error 0:21:51.500,0:21:53.790 using algorithms like these. 0:21:53.790,0:21:57.500 I will actually - I don't actually want to spend the time in the main 0:21:57.500,0:21:58.640 lecture to prove 0:21:58.640,0:22:01.880 this, but, for example, you can prove that 0:22:01.880,0:22:04.790 when you use the perceptron algorithm, 0:22:04.790,0:22:07.890 then even when 0:22:07.890,0:22:11.710 the features XI, 0:22:11.710,0:22:15.250 maybe infinite dimensional feature vectors, like we saw for 0:22:15.250,0:22:18.700 simple vector machines. And sometimes, infinite feature dimensional vectors 0:22:18.700,0:22:20.740 may use kernel representations. Okay? But so it 0:22:20.740,0:22:22.180 turns out that you can prove that 0:22:22.180,0:22:24.790 when you a perceptron algorithm, 0:22:24.790,0:22:27.590 even when the data is maybe 0:22:27.590,0:22:28.920 extremely high dimensional and it 0:22:28.920,0:22:32.900 seems like you'd be prone to overfitting, right, you can prove that so as the 0:22:32.900,0:22:36.550 long as the positive and negative examples 0:22:36.550,0:22:38.740 0:22:38.740,0:22:41.860 are separated by a margin, 0:22:41.860,0:22:43.360 right. So in this 0:22:43.360,0:22:48.050 infinite dimensional space, 0:22:48.050,0:22:49.740 so long as, 0:22:49.740,0:22:53.390 you know, there is some margin 0:22:53.390,0:22:57.480 down there separating the positive and negative examples, 0:22:57.480,0:22:59.300 you can prove that 0:22:59.300,0:23:02.150 perceptron algorithm will converge 0:23:02.150,0:23:06.870 to a hypothesis that perfectly separates the positive and negative examples. 0:23:06.870,0:23:10.760 Okay? And then so after seeing only a finite number of examples, it'll 0:23:10.760,0:23:14.410 converge to digital boundary that perfectly separates the 0:23:14.410,0:23:18.590 positive and negative examples, even though you may in an infinite dimensional space. Okay? 0:23:18.590,0:23:20.970 So 0:23:20.970,0:23:22.260 let's see. 0:23:22.260,0:23:25.920 The proof itself would take me sort of almost an entire lecture to do, 0:23:25.920,0:23:27.179 and there are sort of 0:23:27.179,0:23:29.460 other things that I want to do more than that. 0:23:29.460,0:23:32.850 So you want to see the proof of this yourself, it's actually written up in the lecture 0:23:32.850,0:23:35.010 notes that I posted online. 0:23:35.010,0:23:38.210 For the purposes of this class' syllabus, the proof of this 0:23:38.210,0:23:41.059 result, you can treat this as optional reading. And by that, I mean, 0:23:41.059,0:23:44.799 you know, it won't appear on the midterm and you won't be asked about this 0:23:44.799,0:23:47.039 specifically in the problem sets, 0:23:47.039,0:23:48.770 but I thought it'd be - 0:23:48.770,0:23:52.340 I know some of you are curious after the previous lecture so why 0:23:52.340,0:23:53.750 you can prove 0:23:53.750,0:23:55.450 that, you know, SVMs can have 0:23:55.450,0:23:59.240 bounded VC dimension, even in these infinite dimensional spaces, and how do 0:23:59.240,0:24:00.940 you prove things in these - 0:24:00.940,0:24:04.450 how do you prove learning theory results in these infinite dimensional feature spaces. And 0:24:04.450,0:24:05.190 so 0:24:05.190,0:24:08.480 the perceptron bound that I just talked about was the simplest instance I 0:24:08.480,0:24:11.040 know of that you can sort of read in like half an hour and understand 0:24:11.040,0:24:11.960 it. 0:24:11.960,0:24:14.700 So if you're 0:24:14.700,0:24:16.179 interested, there are lecture notes online for 0:24:16.179,0:24:20.720 how this perceptron bound is actually proved. It's a very 0:24:21.580,0:24:25.710 [inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But 0:24:25.710,0:24:28.450 regardless of the theoretical results, 0:24:28.450,0:24:30.880 you know, the online learning setting is something 0:24:30.880,0:24:33.490 that you - that comes reasonably often. And so, 0:24:33.490,0:24:38.750 these algorithms based on stochastic gradient descent often go very well. Okay, any 0:24:38.750,0:24:45.750 questions about this before I move on? 0:24:49.990,0:24:54.799 All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to 0:24:54.799,0:24:56.500 PowerPoint slides, please, 0:24:56.500,0:24:59.810 is I actually want to spend most of today's lecture sort of talking about 0:24:59.810,0:25:05.270 advice for applying different machine learning 0:25:05.270,0:25:10.850 algorithms. 0:25:10.850,0:25:11.970 And so, 0:25:11.970,0:25:14.019 you know, right now, already you have a, 0:25:14.019,0:25:16.250 I think, a good understanding of 0:25:16.250,0:25:18.620 really the most powerful tools 0:25:18.620,0:25:21.020 known to humankind in machine learning. 0:25:21.020,0:25:22.090 Right? And 0:25:22.090,0:25:26.520 what I want to do today is give you some advice on how to apply them really 0:25:26.520,0:25:27.720 powerfully because, 0:25:27.720,0:25:30.299 you know, the same tool - it turns out that you can 0:25:30.299,0:25:33.249 take the same machine learning tool, say logistic regression, 0:25:33.249,0:25:36.800 and you can ask two different people to apply it to the same problem. 0:25:36.800,0:25:38.230 And 0:25:38.230,0:25:41.290 sometimes one person will do an amazing job and it'll work amazingly well, and the second 0:25:41.290,0:25:43.150 person will sort of 0:25:43.150,0:25:47.150 not really get it to work, even though it was exactly the same algorithm. Right? 0:25:47.150,0:25:51.049 And so what I want to do today, in the rest of the time I have today, is try 0:25:51.049,0:25:52.190 to 0:25:52.190,0:25:55.210 convey to you, you know, some of the methods for how to 0:25:55.210,0:25:56.600 make sure you're one of - 0:25:56.600,0:26:03.600 you really know how to get these learning algorithms to work well in problems. So 0:26:05.120,0:26:07.340 just some caveats on what I'm gonna, I 0:26:07.340,0:26:08.859 guess, talk about in the 0:26:08.859,0:26:11.080 rest of today's lecture. 0:26:11.940,0:26:15.760 Something I want to talk about is actually not very mathematical but is also 0:26:15.760,0:26:17.610 some of the hardest, 0:26:17.610,0:26:18.480 most 0:26:18.480,0:26:21.910 conceptually most difficult material in this class to understand. All right? So this 0:26:21.910,0:26:23.070 is 0:26:23.070,0:26:25.400 not mathematical but this is not easy. 0:26:25.400,0:26:29.730 And I want to say this caveat some of what I'll say today is debatable. 0:26:29.730,0:26:33.309 I think most good machine learning people will agree with most of what I say but maybe not 0:26:33.309,0:26:35.730 everything I say. 0:26:35.730,0:26:39.350 And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about 0:26:39.350,0:26:41.680 this later. What I'm 0:26:41.680,0:26:43.730 focusing on today is advice for 0:26:43.730,0:26:47.269 how to just get stuff to work. If you work in the company and you want to deliver a 0:26:47.269,0:26:49.120 product or you're, you know, 0:26:49.120,0:26:52.750 building a system and you just want your machine learning system to work. Okay? Some of what I'm about 0:26:52.750,0:26:54.000 to say today 0:26:54.000,0:26:57.559 isn't great advice if you goal is to invent a new machine learning algorithm, but 0:26:57.559,0:26:59.179 this is advice for how to 0:26:59.179,0:27:02.040 make machine learning algorithm work and, you know, and 0:27:02.040,0:27:05.080 deploy a working system. So three 0:27:05.080,0:27:08.900 key areas I'm gonna talk about. One: diagnostics for 0:27:08.900,0:27:11.890 debugging learning algorithms. Second: sort of 0:27:11.890,0:27:16.820 talk briefly about error analyses and ablative analysis. 0:27:16.820,0:27:21.470 And third, I want to talk about just advice for how to get started on a machine-learning 0:27:23.410,0:27:24.420 problem. 0:27:24.420,0:27:27.340 And one theme that'll come up later is it 0:27:27.340,0:27:31.230 turns out you've heard about premature optimization, right, 0:27:31.230,0:27:33.890 in writing software. This is when 0:27:33.890,0:27:38.120 someone over-designs from the start, when someone, you know, is writing piece of code and 0:27:38.120,0:27:41.330 they choose a subroutine to optimize 0:27:41.330,0:27:45.350 heavily. And maybe you write the subroutine as assembly or something. And that's often 0:27:45.350,0:27:48.919 - and many of us have been guilty of premature optimization, where 0:27:48.919,0:27:51.510 we're trying to get a piece of code to run faster. And 0:27:51.510,0:27:54.710 we choose probably a piece of code and we implement it an assembly, and 0:27:54.710,0:27:56.820 really tune and get to run really quickly. And 0:27:56.820,0:28:01.280 it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature 0:28:01.280,0:28:02.210 optimization. 0:28:02.210,0:28:06.140 And in undergraduate programming classes, we warn people all the time not to do 0:28:06.140,0:28:10.250 premature optimization and people still do it all the time. Right? 0:28:10.250,0:28:11.750 And 0:28:11.750,0:28:16.020 turns out, a very similar thing happens in building machine-learning systems. That 0:28:16.020,0:28:20.980 many people are often guilty of, what I call, premature statistical optimization, where they 0:28:20.980,0:28:21.630 0:28:21.630,0:28:26.090 heavily optimize part of a machine learning system and that turns out 0:28:26.090,0:28:27.830 not to be the important piece. Okay? 0:28:27.830,0:28:30.170 So I'll talk about that later, as well. 0:28:30.170,0:28:35.680 So let's first talk about debugging learning algorithms. 0:28:35.680,0:28:37.700 0:28:37.700,0:28:39.560 0:28:39.560,0:28:42.610 As a motivating 0:28:42.610,0:28:46.910 example, let's say you want to build an anti-spam system. And 0:28:46.910,0:28:51.780 let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right? 0:28:51.780,0:28:54.410 So instead of using 50,000 words, you're chosen a small set of 100 0:28:54.410,0:28:55.390 features 0:28:55.390,0:28:59.010 to use for your anti-spam system. 0:28:59.010,0:29:02.140 And let's say you implement Bayesian logistic regression, implement gradient 0:29:02.140,0:29:02.850 descent, 0:29:02.850,0:29:07.310 and you get 20 percent test error, which is unacceptably high. Right? 0:29:07.310,0:29:09.750 So this 0:29:09.750,0:29:14.260 is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda 0:29:14.260,0:29:15.160 squared term. 0:29:15.160,0:29:19.950 And we're maximizing rather than minimizing as well, so there's a minus lambda 0:29:19.950,0:29:23.809 theta square instead of plus lambda theta squared. So 0:29:23.809,0:29:28.390 the question is, you implemented your Bayesian logistic regression algorithm, 0:29:28.390,0:29:34.050 and you tested it on your test set and you got unacceptably high error, so what do you do 0:29:34.050,0:29:35.370 next? Right? 0:29:35.370,0:29:37.310 So, 0:29:37.310,0:29:40.730 you know, one thing you could do is think about the ways you could improve this algorithm. 0:29:40.730,0:29:44.510 And this is probably what most people will do instead of, "Well let's sit down and think what 0:29:44.510,0:29:47.929 could've gone wrong, and then we'll try to improve the algorithm." 0:29:47.929,0:29:51.370 Well obviously having more training data could only help, so one thing you can do is try to 0:29:51.370,0:29:55.140 get more training examples. 0:29:55.140,0:29:58.880 Maybe you suspect, that even 100 features was too many, so you might try to 0:29:58.880,0:30:01.059 get a smaller set of 0:30:01.059,0:30:04.510 features. What's more common is you might suspect your features aren't good enough, so you might 0:30:04.510,0:30:06.880 spend some time, look at the email headers, see if 0:30:06.880,0:30:09.240 you can figure out better features for, you know, 0:30:09.240,0:30:12.890 finding spam emails or whatever. 0:30:12.890,0:30:14.460 Right. And 0:30:14.460,0:30:17.720 right, so and 0:30:17.720,0:30:20.780 just sit around and come up with better features, such as for email headers. 0:30:20.780,0:30:24.940 You may also suspect that gradient descent hasn't quite converged yet, and so let's try 0:30:24.940,0:30:28.090 running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right, 0:30:28.090,0:30:29.480 just run 0:30:29.480,0:30:30.940 gradient descent longer. 0:30:30.940,0:30:35.620 Or maybe you remember, you know, you remember hearing from class that maybe 0:30:35.620,0:30:38.580 Newton's method converges better, so let's 0:30:38.580,0:30:39.809 try 0:30:39.809,0:30:41.840 that instead. You may want to tune the value for lambda, because not 0:30:41.840,0:30:43.560 sure if that was the right thing, 0:30:43.560,0:30:46.960 or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only 0:30:46.960,0:30:50.240 listed eight things 0:30:50.240,0:30:54.549 here, but you can imagine if you were actually sitting down, building machine-learning 0:30:54.549,0:30:55.720 system, 0:30:55.720,0:30:58.040 the options to you are endless. You can think of, you 0:30:58.040,0:31:01.040 know, hundreds of ways to improve a learning system. 0:31:01.040,0:31:02.429 And some of these things like, 0:31:02.429,0:31:05.670 well getting more training examples, surely that's gonna help, so that seems like it's a good 0:31:05.670,0:31:08.980 use of your time. Right? 0:31:08.980,0:31:11.290 And it turns out that 0:31:11.290,0:31:15.210 this [inaudible] of picking ways to improve the learning algorithm and picking one and going 0:31:15.210,0:31:16.120 for it, 0:31:16.120,0:31:20.620 it might work in the sense that it may eventually get you to a working system, but 0:31:20.620,0:31:25.030 often it's very time-consuming. And I think it's often a largely - largely a matter of 0:31:25.030,0:31:28.140 luck, whether you end up fixing what the problem is. 0:31:28.140,0:31:29.490 In particular, these 0:31:29.490,0:31:33.030 eight improvements all fix very different problems. 0:31:33.030,0:31:37.340 And some of them will be fixing problems that you don't have. And 0:31:37.340,0:31:40.070 if you can rule out six of 0:31:40.070,0:31:44.129 eight of these, say, you could - if by somehow looking at the problem more deeply, 0:31:44.129,0:31:46.809 you can figure out which one of these eight things is actually the right thing 0:31:46.809,0:31:47.850 to do, 0:31:47.850,0:31:48.850 you can save yourself 0:31:48.850,0:31:50.760 a lot of time. 0:31:50.760,0:31:56.090 So let's see how we can go about doing that. 0:31:56.090,0:32:01.830 The people in industry and in research that I see that are really good, would not 0:32:01.830,0:32:05.490 go and try to change a learning algorithm randomly. There are lots of things that 0:32:05.490,0:32:08.110 obviously improve your learning algorithm, 0:32:08.110,0:32:12.460 but the problem is there are so many of them it's hard to know what to do. 0:32:12.460,0:32:16.590 So you find all the really good ones that run various diagnostics to figure out 0:32:16.590,0:32:18.010 the problem is 0:32:18.010,0:32:21.610 and they think where a problem is. Okay? 0:32:21.610,0:32:23.830 So 0:32:23.830,0:32:27.309 for our motivating story, right, we said - let's say Bayesian logistic regression test 0:32:27.309,0:32:29.010 error was 20 percent, 0:32:29.010,0:32:32.020 which let's say is unacceptably high. 0:32:32.020,0:32:34.830 And let's suppose you suspected the problem is 0:32:34.830,0:32:36.440 either overfitting, 0:32:36.440,0:32:37.790 so it's high bias, 0:32:37.790,0:32:42.240 or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse 0:32:42.240,0:32:45.220 me; I think I 0:32:45.220,0:32:46.620 wrote that wrong. 0:32:46.620,0:32:48.080 Let's firstly - so let's 0:32:48.080,0:32:49.219 forget - forget the tables. 0:32:49.219,0:32:52.839 Suppose you suspect the problem is either high bias or high variance, and some of the text 0:32:52.839,0:32:54.730 here 0:32:54.730,0:32:55.250 doesn't make sense. And 0:32:55.250,0:32:56.429 you want to know 0:32:56.429,0:33:00.850 if you're overfitting, which would be high variance, or you have too few 0:33:00.850,0:33:06.240 features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So 0:33:06.240,0:33:08.750 how can you figure out whether the problem 0:33:08.750,0:33:10.790 is one of high bias 0:33:10.790,0:33:15.610 or high variance? Right? So it turns 0:33:15.610,0:33:19.009 out there's a simple diagnostic you can look at that will tell you 0:33:19.009,0:33:24.150 whether the problem is high bias or high variance. If you 0:33:24.150,0:33:27.900 remember the cartoon we'd seen previously for high variance problems, when you have high 0:33:27.900,0:33:29.710 variance 0:33:29.710,0:33:33.280 the training error will be much lower than the test error. All right? When you 0:33:33.280,0:33:36.140 have a high variance problem, that's when you're fitting 0:33:36.140,0:33:39.480 your training set very well. That's when you're fitting, you know, a tenth order polynomial to 0:33:39.480,0:33:41.650 11 data points. All right? And 0:33:41.650,0:33:44.670 that's when you're just fitting the data set very well, and so your training error will be 0:33:44.670,0:33:45.670 much lower than 0:33:45.670,0:33:47.640 your test 0:33:47.640,0:33:49.940 error. And in contrast, if you have high bias, 0:33:49.940,0:33:52.700 that's when your training error will also be high. Right? 0:33:52.700,0:33:56.450 That's when your data is quadratic, say, but you're fitting a linear function to it 0:33:56.450,0:34:02.290 and so you aren't even fitting your training set well. So 0:34:02.290,0:34:04.450 just in cartoons, I guess, 0:34:04.450,0:34:07.950 this is a - this is what a typical learning curve for high variance looks 0:34:07.950,0:34:09.339 like. 0:34:09.339,0:34:13.690 On your horizontal axis, I'm plotting the training set size M, right, 0:34:13.690,0:34:16.429 and on vertical axis, I'm plotting the error. 0:34:16.429,0:34:19.469 And so, let's see, 0:34:19.469,0:34:21.029 you know, as you increase - 0:34:21.029,0:34:25.119 if you have a high variance problem, you'll notice as the training set size, M, 0:34:25.119,0:34:29.219 increases, your test set error will keep on decreasing. 0:34:29.219,0:34:32.829 And so this sort of suggests that, well, if you can increase the training set size even 0:34:32.829,0:34:36.359 further, maybe if you extrapolate the green curve out, maybe 0:34:36.359,0:34:39.970 that test set error will decrease even further. All right? 0:34:39.970,0:34:43.399 Another thing that's useful to plot here is - let's say 0:34:43.399,0:34:46.539 the red horizontal line is the desired performance 0:34:46.539,0:34:50.259 you're trying to reach, another useful thing to plot is actually the training error. Right? 0:34:50.259,0:34:52.009 And it turns out that 0:34:52.009,0:34:59.009 your training error will actually grow as a function of the training set size 0:34:59.249,0:35:01.609 because the larger your training set, 0:35:01.609,0:35:03.619 the harder it is to fit, 0:35:03.619,0:35:06.149 you know, your training set perfectly. Right? 0:35:06.149,0:35:09.249 So this is just a cartoon, don't take it too seriously, but in general, your training error 0:35:09.249,0:35:11.420 will actually grow 0:35:11.420,0:35:15.079 as a function of your training set size. Because smart training sets, if you have one data point, 0:35:15.079,0:35:17.769 it's really easy to fit that perfectly, but if you have 0:35:17.769,0:35:22.099 10,000 data points, it's much harder to fit that perfectly. 0:35:22.099,0:35:23.149 All right? 0:35:23.149,0:35:27.960 And so another diagnostic for high variance, and the one that I tend to use more, 0:35:27.960,0:35:31.670 is to just look at training versus test error. And if there's a large gap between 0:35:31.670,0:35:32.789 them, 0:35:32.789,0:35:34.160 then this suggests that, you know, 0:35:34.160,0:35:39.630 getting more training data may allow you to help close that gap. Okay? 0:35:39.630,0:35:41.420 So this is 0:35:41.420,0:35:42.340 what the 0:35:42.340,0:35:45.059 cartoon would look like when - in the 0:35:45.059,0:35:49.199 case of high variance. 0:35:49.199,0:35:53.099 This is what the cartoon looks like for high bias. Right? If you 0:35:53.099,0:35:54.779 look at the learning curve, you 0:35:54.779,0:35:57.499 see that the curve for test error 0:35:57.499,0:36:01.419 has flattened out already. And so this is a sign that, 0:36:01.419,0:36:05.179 you know, if you get more training examples, if you extrapolate this curve 0:36:05.179,0:36:06.519 further to the right, 0:36:06.519,0:36:09.670 it's maybe not likely to go down much further. 0:36:09.670,0:36:12.469 And this is a property of high bias: that getting more training data won't 0:36:12.469,0:36:15.619 necessarily help. 0:36:15.619,0:36:18.999 But again, to me the more useful diagnostic is 0:36:18.999,0:36:20.299 if you plot 0:36:20.299,0:36:23.999 training errors well, if you look at your training error as well as your, you know, 0:36:23.999,0:36:26.369 hold out test set error. 0:36:26.369,0:36:29.409 If you find that even your training error 0:36:29.409,0:36:31.529 is high, 0:36:31.529,0:36:34.779 then that's a sign that getting more training data is not 0:36:34.779,0:36:38.269 going to help. Right? 0:36:38.269,0:36:42.199 In fact, you know, think about it, 0:36:42.199,0:36:44.539 training error 0:36:44.539,0:36:48.089 grows as a function of your training set size. 0:36:48.089,0:36:50.449 And so if your 0:36:50.449,0:36:55.569 training error is already above your level of desired performance, 0:36:55.569,0:36:56.599 then 0:36:56.599,0:37:00.789 getting even more training data is not going to reduce your training 0:37:00.789,0:37:03.009 error down to the desired level of performance. Right? 0:37:03.009,0:37:06.469 Because, you know, your training error sort of only gets worse as you get more and more training 0:37:06.469,0:37:07.549 examples. 0:37:07.549,0:37:10.799 So if you extrapolate further to the right, it's not like this blue line will come 0:37:10.799,0:37:13.399 back down to the level of desired performance. Right? This will stay up 0:37:13.399,0:37:17.479 there. Okay? So for 0:37:17.479,0:37:21.339 me personally, I actually, when looking at a curve like the green 0:37:21.339,0:37:25.380 curve on test error, I actually personally tend to find it very difficult to tell 0:37:25.380,0:37:29.000 if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very 0:37:29.000,0:37:31.009 often, it's somewhat 0:37:31.009,0:37:32.899 ambiguous. So for me personally, 0:37:32.899,0:37:37.129 the diagnostic I tend to use the most often to tell if I have a bias problem or a variance 0:37:37.129,0:37:37.859 problem 0:37:37.859,0:37:41.319 is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so, 0:37:41.319,0:37:45.420 going 0:37:45.420,0:37:47.130 back to 0:37:47.130,0:37:52.399 the list of fixes, look 0:37:52.399,0:37:54.109 at the first fix, 0:37:54.109,0:37:56.339 getting more training examples 0:37:56.339,0:37:58.650 is a way to fix high variance. 0:37:58.650,0:38:02.749 Right? If you have a high variance problem, getting more training examples will help. 0:38:02.749,0:38:05.529 Trying a smaller set of features: 0:38:05.529,0:38:11.759 that also fixes high variance. All right? 0:38:11.759,0:38:15.869 Trying a larger set of features or adding email features, these 0:38:15.869,0:38:20.150 are solutions that fix high bias. Right? 0:38:20.150,0:38:26.769 So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay? 0:38:26.769,0:38:29.069 And so 0:38:29.069,0:38:33.579 quite often you see people working on machine learning problems 0:38:33.579,0:38:34.589 and 0:38:34.589,0:38:37.569 they'll remember that getting more training examples helps. And 0:38:37.569,0:38:41.119 so, they'll build a learning system, build an anti-spam system and it doesn't work. 0:38:41.119,0:38:42.229 And then they 0:38:42.229,0:38:45.999 go off and spend lots of time and money and effort collecting more training data 0:38:45.999,0:38:50.509 because they'll say, "Oh well, getting more data's obviously got to help." 0:38:50.509,0:38:53.319 But if they had a high bias problem in the first place, and not a high variance 0:38:53.319,0:38:54.890 problem, 0:38:54.890,0:38:56.769 it's entirely possible to spend 0:38:56.769,0:39:00.149 three months or six months collecting more and more training data, 0:39:00.149,0:39:04.999 not realizing that it couldn't possibly help. Right? 0:39:04.999,0:39:07.619 And so, this actually happens a lot in, you 0:39:07.619,0:39:12.409 know, in Silicon Valley and companies, this happens a lot. There will often 0:39:12.409,0:39:15.329 people building various machine learning systems, and 0:39:15.329,0:39:19.519 they'll often - you often see people spending six months working on fixing a 0:39:19.519,0:39:20.999 learning algorithm 0:39:20.999,0:39:23.940 and you could've told them six months ago that, you know, 0:39:23.940,0:39:27.210 that couldn't possibly have helped. But because they didn't know what the 0:39:27.210,0:39:28.709 problem was, and 0:39:28.709,0:39:33.549 they'd easily spend six months trying to invent new features or something. And 0:39:33.549,0:39:37.809 this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and 0:39:37.809,0:39:42.289 told them, "I could've told you six months ago that this was not going to help." And 0:39:42.289,0:39:46.149 the six months is not a joke, you actually see 0:39:46.149,0:39:47.709 this. 0:39:47.709,0:39:49.510 And in contrast, if you 0:39:49.510,0:39:53.049 actually figure out the problem's one of high bias or high variance, then 0:39:53.049,0:39:54.299 you can rule out 0:39:54.299,0:39:55.799 two of these solutions and 0:39:55.799,0:40:00.779 save yourself many months of fruitless effort. Okay? I actually 0:40:00.779,0:40:03.709 want to talk about these four at the bottom as well. But before I move on, let me 0:40:03.709,0:40:05.319 just check if there were questions about what I've talked 0:40:05.319,0:40:12.319 about so far. No? Okay, great. So bias 0:40:20.209,0:40:23.220 versus variance is one thing that comes up 0:40:23.220,0:40:29.539 often. This bias versus variance is one common diagnostic. And so, 0:40:29.539,0:40:33.180 for other machine learning problems, it's often up to your own ingenuity to figure out 0:40:33.180,0:40:35.700 your own diagnostics to figure out what's wrong. All right? 0:40:35.700,0:40:41.230 So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you 0:40:41.230,0:40:44.300 know, to construct your own tests. Like do you look at the difference training and 0:40:44.300,0:40:46.499 test errors or do you look at something else? 0:40:46.499,0:40:49.929 It's often up to your own ingenuity to construct your own diagnostics to figure out what's 0:40:49.929,0:40:52.589 going on. 0:40:52.589,0:40:55.029 What I want to do is go through another example. All right? 0:40:55.029,0:40:58.890 And this one is slightly more contrived but it'll illustrate another 0:40:58.890,0:41:02.769 common question that comes up, another one of the most common 0:41:02.769,0:41:04.750 issues that comes up in applying 0:41:04.750,0:41:06.089 learning algorithms. 0:41:06.089,0:41:08.319 So in this example, it's slightly more contrived, 0:41:08.319,0:41:11.579 let's say you implement Bayesian logistic regression 0:41:11.579,0:41:17.549 and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So 0:41:17.549,0:41:19.150 it's rejecting, you know, 0:41:19.150,0:41:21.449 2 percent of - 0:41:21.449,0:41:25.179 it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all 0:41:25.179,0:41:26.959 spam gets 0:41:26.959,0:41:30.660 through which is fine, but is also rejecting 2 percent of your good email, 0:41:30.660,0:41:35.489 2 percent of the email from your friends and that's unacceptably high, let's 0:41:35.489,0:41:36.909 say. 0:41:36.909,0:41:39.010 And let's say that 0:41:39.010,0:41:41.899 a simple vector machine using a linear kernel 0:41:41.899,0:41:44.830 gets 10 percent error on spam and 0:41:44.830,0:41:49.069 0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this 0:41:49.069,0:41:53.359 example, let's say you're trying to build an anti-spam system. Right? 0:41:53.359,0:41:56.170 Let's say that you really want to deploy 0:41:56.170,0:41:57.679 logistic regression 0:41:57.679,0:42:01.209 to your customers because of computational efficiency or because you need 0:42:01.209,0:42:03.389 retrain overnight every day, 0:42:03.389,0:42:07.319 and because logistic regression just runs more easily and more quickly or something. Okay? So let's 0:42:07.319,0:42:08.669 say you want to deploy logistic 0:42:08.669,0:42:12.649 regression, but it's just not working out well. So 0:42:12.649,0:42:17.609 question is: What do you do next? So it 0:42:17.609,0:42:18.829 turns out that this - 0:42:18.829,0:42:22.319 the issue that comes up here, the one other common question that 0:42:22.319,0:42:24.889 comes up is 0:42:24.889,0:42:30.189 a question of is the algorithm converging. So you might suspect that maybe 0:42:30.189,0:42:33.299 the problem with logistic regression is that it's just not converging. 0:42:33.299,0:42:36.309 Maybe you need to run iterations. And 0:42:36.309,0:42:37.759 it 0:42:37.759,0:42:40.359 turns out that, again if you look at the optimization objective, say, 0:42:40.359,0:42:43.710 logistic regression is, let's say, optimizing J 0:42:43.710,0:42:46.730 of theta, it actually turns out that if you look at optimizing your objective as a function of the number 0:42:46.730,0:42:51.809 of iterations, when you look 0:42:51.809,0:42:55.009 at this curve, you know, it sort of looks like it's going up but it sort of 0:42:55.009,0:42:57.630 looks like there's absentiles. And 0:42:57.630,0:43:00.949 when you look at these curves, it's often very hard to tell 0:43:00.949,0:43:03.729 if the curve has already flattened out. All right? And you look at these 0:43:03.729,0:43:05.979 curves a lot so you can ask: 0:43:05.979,0:43:08.229 Well has the algorithm converged? When you look at the J of theta like this, it's 0:43:08.229,0:43:10.329 often hard to tell. 0:43:10.329,0:43:14.149 You can run this ten times as long and see if it's flattened out. And you can run this ten 0:43:14.149,0:43:21.079 times as long and it'll often still look like maybe it's going up very slowly, or something. Right? 0:43:21.079,0:43:24.919 So a better diagnostic for what logistic regression is converged than 0:43:24.919,0:43:28.809 looking at this curve. 0:43:28.809,0:43:32.089 The other question you might wonder - the other thing you might 0:43:32.089,0:43:36.709 suspect is a problem is are you optimizing the right function. 0:43:36.709,0:43:38.920 So 0:43:38.920,0:43:40.599 what you care about, 0:43:40.599,0:43:42.879 right, in spam, say, 0:43:42.879,0:43:44.260 is a 0:43:44.260,0:43:47.499 weighted accuracy function like that. So A of theta is, 0:43:47.499,0:43:49.190 you know, sum over your 0:43:49.190,0:43:52.249 examples of some weights times whether you got it right. 0:43:52.249,0:43:56.809 And so the weight may be higher for non-spam than for spam mail because you care 0:43:56.809,0:43:57.710 about getting 0:43:57.710,0:44:01.469 your predictions correct for spam email much more than non-spam mail, say. So let's 0:44:01.469,0:44:02.359 0:44:02.359,0:44:05.469 say A of theta 0:44:05.469,0:44:10.819 is the optimization objective that you really care about, but Bayesian logistic regression is 0:44:10.819,0:44:15.400 that it optimizes a quantity like that. Right? It's this 0:44:15.400,0:44:17.689 sort of maximum likelihood thing 0:44:17.689,0:44:19.380 and then with this 0:44:19.380,0:44:20.849 two-nom, you know, 0:44:20.849,0:44:22.779 penalty thing that we saw previously. And you 0:44:22.779,0:44:26.499 might be wondering: Is this the right optimization function to be optimizing. 0:44:26.499,0:44:30.949 Okay? And: Or do I maybe need to change the value for lambda 0:44:30.949,0:44:33.899 to change this parameter? Or: 0:44:33.899,0:44:39.819 Should I maybe really be switching to support vector machine optimization objective? 0:44:39.819,0:44:42.130 Okay? Does that make sense? So 0:44:42.130,0:44:44.490 the second diagnostic I'm gonna talk about 0:44:44.490,0:44:46.989 is let's say you want to figure out 0:44:46.989,0:44:50.609 is the algorithm converging, is the optimization algorithm converging, or 0:44:50.609,0:44:51.900 is the problem with 0:44:51.900,0:44:57.749 the optimization objective I chose in the first place? Okay? 0:44:57.749,0:45:02.819 So here's 0:45:02.819,0:45:07.329 the diagnostic you can use. Let me let - right. So to 0:45:07.329,0:45:11.029 just reiterate the story, right, let's say an SVM outperforms Bayesian 0:45:11.029,0:45:13.519 logistic regression but you really want to deploy 0:45:13.519,0:45:16.759 Bayesian logistic regression to your problem. Let me 0:45:16.759,0:45:19.049 let theta subscript SVM, be the 0:45:19.049,0:45:21.669 parameters learned by an SVM, 0:45:21.669,0:45:25.259 and I'll let theta subscript BLR be the parameters learned by Bayesian 0:45:25.259,0:45:28.049 logistic regression. 0:45:28.049,0:45:32.480 So the optimization objective you care about is this, you know, weighted accuracy 0:45:32.480,0:45:35.079 criteria that I talked about just now. 0:45:35.079,0:45:37.859 And 0:45:37.859,0:45:41.739 the support vector machine outperforms Bayesian logistic regression. And so, you know, 0:45:41.739,0:45:44.969 the weighted accuracy on the supportvector-machine parameters 0:45:44.969,0:45:46.969 is better than the weighted accuracy 0:45:46.969,0:45:50.179 for Bayesian logistic regression. 0:45:50.179,0:45:53.929 So 0:45:53.929,0:45:57.039 further, Bayesian logistic regression tries to optimize 0:45:57.039,0:45:59.410 an optimization objective like that, which I 0:45:59.410,0:46:02.269 denoted J theta. 0:46:02.269,0:46:05.839 And so, the diagnostic I choose to use is 0:46:05.839,0:46:08.430 to see if J of SVM 0:46:08.430,0:46:12.269 is bigger-than or less-than J of BLR. Okay? 0:46:12.269,0:46:14.609 So I explain this on the next slide. 0:46:14.609,0:46:15.569 So 0:46:15.569,0:46:19.529 we know two facts. We know that - well we know one fact. We know that a weighted 0:46:19.529,0:46:20.519 accuracy 0:46:20.519,0:46:23.160 of support vector machine, right, 0:46:23.160,0:46:24.479 is bigger than 0:46:24.479,0:46:28.859 this weighted accuracy of Bayesian logistic regression. So 0:46:28.859,0:46:32.209 in order for me to figure out whether Bayesian logistic regression is converging, 0:46:32.209,0:46:35.379 or whether I'm just optimizing the wrong objective function, 0:46:35.379,0:46:41.059 the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay? 0:46:41.059,0:46:43.549 So let me explain this, 0:46:43.549,0:46:44.769 so in Case 1, 0:46:44.769,0:46:46.029 right, 0:46:46.029,0:46:48.319 it's just those two equations copied over. 0:46:48.319,0:46:50.489 In Case 1, let's say that 0:46:50.489,0:46:54.589 J of SVM is, indeed, is greater than J of BLR - or J of 0:46:54.589,0:47:01.169 theta SVM is greater than J of theta BLR. But 0:47:01.169,0:47:04.439 we know that Bayesian logistic regression 0:47:04.439,0:47:07.519 was trying to maximize J of theta; 0:47:07.519,0:47:08.869 that's the definition of 0:47:08.869,0:47:12.359 Bayesian logistic regression. 0:47:12.359,0:47:16.759 So this means that 0:47:16.759,0:47:17.599 theta - 0:47:17.599,0:47:22.029 the value of theta output that Bayesian logistic regression actually fails to 0:47:22.029,0:47:24.209 maximize J 0:47:24.209,0:47:27.309 because the support back to machine actually returned the value of theta that, 0:47:27.309,0:47:28.719 you know does a 0:47:28.719,0:47:31.349 better job out-maximizing J. 0:47:31.349,0:47:36.509 And so, this tells me that Bayesian logistic regression didn't actually maximize J 0:47:36.509,0:47:39.319 correctly, and so the problem is with 0:47:39.319,0:47:41.099 the optimization algorithm. The 0:47:41.099,0:47:45.269 optimization algorithm hasn't converged. The other 0:47:45.269,0:47:46.099 case 0:47:46.099,0:47:49.890 is as follows, where 0:47:49.890,0:47:52.579 J of theta SVM is less-than/equal to J of theta 0:47:52.579,0:47:55.719 BLR. Okay? 0:47:55.719,0:47:58.389 In this case, what does 0:47:58.389,0:47:59.140 that mean? 0:47:59.140,0:48:01.849 This means that Bayesian logistic regression 0:48:01.849,0:48:04.599 actually attains the higher value 0:48:04.599,0:48:07.289 for the optimization objective J 0:48:07.289,0:48:10.929 then doesn't support back to machine. 0:48:10.929,0:48:13.159 The support back to machine, 0:48:13.159,0:48:14.969 which does worse 0:48:14.969,0:48:17.669 on your optimization problem, 0:48:17.669,0:48:19.199 actually does better 0:48:19.199,0:48:24.329 on the weighted accuracy measure. 0:48:24.329,0:48:27.999 So what this means is that something that does worse on your optimization 0:48:27.999,0:48:28.789 objective, 0:48:28.789,0:48:29.789 on J, 0:48:29.789,0:48:31.429 can actually do better 0:48:31.429,0:48:34.039 on the weighted accuracy objective. 0:48:34.039,0:48:37.109 And this really means that maximizing 0:48:37.109,0:48:38.369 J of theta, 0:48:38.369,0:48:42.059 you know, doesn't really correspond that well to maximizing your weighted accuracy criteria. 0:48:42.059,0:48:43.430 0:48:43.430,0:48:47.359 And therefore, this tells you that J of theta is maybe the wrong optimization 0:48:47.359,0:48:49.649 objective to be maximizing. Right? 0:48:49.649,0:48:51.160 That just maximizing J of 0:48:51.160,0:48:53.149 theta just wasn't a good objective 0:48:53.149,0:49:00.149 to be choosing if you care about the weighted accuracy. Okay? Can you 0:49:02.669,0:49:03.460 raise your hand 0:49:03.460,0:49:09.989 if this made sense? 0:49:09.989,0:49:11.519 Cool, good. So 0:49:11.519,0:49:16.829 that tells us whether the problem is with the optimization objective 0:49:16.829,0:49:19.380 or whether it's with the objective function. 0:49:19.380,0:49:21.009 And so going back to this 0:49:21.009,0:49:23.149 slide, the eight fixes we had, 0:49:23.149,0:49:24.180 you notice that if you 0:49:24.180,0:49:27.170 run gradient descent for more iterations 0:49:27.170,0:49:31.019 that fixes the optimization algorithm. You try and use this method 0:49:31.019,0:49:33.259 fixes the optimization algorithm, 0:49:33.259,0:49:37.289 whereas using a different value for lambda, in that lambda times norm of data 0:49:37.289,0:49:39.469 squared, you know, in your objective, 0:49:39.469,0:49:42.359 fixes the optimization objective. And 0:49:42.359,0:49:47.629 changing to an SVM is also another way of trying to fix the optimization objective. Okay? 0:49:47.629,0:49:49.329 And so 0:49:49.329,0:49:52.309 once again, you actually see this quite often that - 0:49:52.309,0:49:55.079 actually, you see it very often, people will 0:49:55.079,0:49:58.479 have a problem with the optimization objective 0:49:58.479,0:50:00.989 and be working harder and harder 0:50:00.989,0:50:03.179 to fix the optimization algorithm. 0:50:03.179,0:50:06.079 That's another very common pattern that 0:50:06.079,0:50:10.190 the problem is in the formula from your J of theta, that often you see people, you know, 0:50:10.190,0:50:13.269 just running more and more iterations of gradient descent. Like trying Newton's 0:50:13.269,0:50:16.010 method and trying conjugate and then trying 0:50:16.010,0:50:18.589 more and more crazy optimization algorithms, 0:50:18.589,0:50:20.889 whereas the problem was, you know, 0:50:20.889,0:50:24.459 optimizing J of theta wasn't going to fix the problem at all. Okay? 0:50:24.459,0:50:28.649 So there's another example of when these sorts of diagnostics will 0:50:28.649,0:50:31.909 help you figure out whether you should be fixing your optimization algorithm 0:50:31.909,0:50:33.259 or fixing the 0:50:33.259,0:50:38.849 optimization 0:50:38.849,0:50:45.339 objective. Okay? Let me think 0:50:45.339,0:50:47.599 how much time I have. 0:50:47.599,0:50:48.819 Hmm, let's 0:50:48.819,0:50:49.620 see. Well okay, we have time. Let's do this. 0:50:49.620,0:50:52.980 Show you one last example of a diagnostic. This is one that came up in, 0:50:52.980,0:50:56.999 you know, my students' and my work on flying helicopters. 0:50:56.999,0:50:57.839 0:50:57.839,0:51:00.189 This one actually, 0:51:00.189,0:51:04.179 this example is the most complex of the three examples I'm gonna do 0:51:04.179,0:51:05.609 today. 0:51:05.609,0:51:08.559 I'm going to somewhat quickly, and 0:51:08.559,0:51:11.259 this actually draws on reinforcement learning which is something that I'm not 0:51:11.259,0:51:14.499 gonna talk about until towards - close to the end of the course here, but this just 0:51:14.499,0:51:16.759 a more 0:51:16.759,0:51:20.009 complicated example of a diagnostic we're gonna go over. 0:51:20.009,0:51:23.759 What I'll do is probably go over this fairly quickly, and then after we've talked about 0:51:23.759,0:51:26.839 reinforcement learning in the class, I'll probably actually come back and redo this exact 0:51:26.839,0:51:32.919 same example because you'll understand it more deeply. Okay? 0:51:32.919,0:51:37.099 So some of you know that my students and I fly autonomous helicopters, so how do you get a 0:51:37.099,0:51:41.559 machine-learning algorithm to design the controller for 0:51:41.559,0:51:44.199 helicopter? This is what we do. All right? 0:51:44.199,0:51:48.519 This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our 0:51:48.519,0:51:49.619 simulator. 0:51:49.619,0:51:53.499 This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you 0:51:53.499,0:51:55.679 0:51:55.679,0:51:57.190 choose a cost function, it's 0:51:57.190,0:52:00.849 actually called a [inaudible] function, but for this actually I'll call it cost function. 0:52:00.849,0:52:02.949 Say J of theta is, you know, 0:52:02.949,0:52:06.589 the expected squared error in your helicopter's 0:52:06.589,0:52:08.150 position. Okay? So this is J of theta is 0:52:08.150,0:52:08.509 maybe 0:52:08.509,0:52:12.359 it's expected square error or just the square error. 0:52:12.359,0:52:16.909 And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms 0:52:16.909,0:52:18.599 in a few weeks. 0:52:18.599,0:52:22.499 You run reinforcement learning algorithm in your simulator 0:52:22.499,0:52:26.640 to try to minimize this cost function; try to minimize the squared error of 0:52:26.640,0:52:31.439 how well you're controlling your helicopter's position. Okay? 0:52:31.439,0:52:35.279 The reinforcement learning algorithm will output some parameters, which I'm denoting theta 0:52:35.279,0:52:37.209 subscript RL, 0:52:37.209,0:52:41.709 and then you'll use that to fly your helicopter. 0:52:41.709,0:52:44.959 So suppose you run this learning algorithm and 0:52:44.959,0:52:48.589 you get out a set of controller parameters, theta subscript RL, 0:52:48.589,0:52:52.299 that gives much worse performance than a human pilot. Then 0:52:52.299,0:52:54.729 what do you do next? And in particular, you 0:52:54.729,0:52:57.959 know, corresponding to the three steps above, there are three 0:52:57.959,0:53:00.589 natural things you can try. Right? You can 0:53:00.589,0:53:01.869 try to - oh, the bottom of 0:53:01.869,0:53:03.919 the slide got chopped off. 0:53:03.919,0:53:07.529 You can try to improve the simulator. And 0:53:07.529,0:53:10.329 maybe you think your simulator's isn't that accurate, you need to capture 0:53:10.329,0:53:12.339 the aerodynamic effects more 0:53:12.339,0:53:15.429 accurately. You need to capture the airflow and the turbulence affects around the helicopter 0:53:15.429,0:53:18.279 more accurately. 0:53:18.279,0:53:21.439 Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe 0:53:21.439,0:53:24.719 what a human pilot does isn't just optimizing square area but it's something more 0:53:24.719,0:53:25.989 subtle. 0:53:25.989,0:53:26.769 Or maybe 0:53:26.769,0:53:32.989 the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So 0:53:32.989,0:53:36.799 these are the diagnostics that I actually used, and my students and I actually use to figure out what's 0:53:36.799,0:53:41.299 going on. 0:53:41.299,0:53:44.509 Actually, why don't you just think about this for a second and think what you'd do, and then 0:53:44.509,0:53:51.509 I'll go on and tell you what we do. All right, 0:54:46.229,0:54:47.869 so let me tell you what - 0:54:47.869,0:54:49.599 how we do this and see 0:54:49.599,0:54:52.770 whether it's the same as yours or not. And if you have a better idea than I do, let me 0:54:52.770,0:54:53.569 know and I'll let you try it 0:54:53.569,0:54:55.919 on my helicopter. 0:54:55.919,0:54:58.239 So 0:54:58.239,0:55:01.449 here's a reasoning that I wanted to experiment, right. So, 0:55:01.449,0:55:03.680 yeah, let's say the controller output 0:55:03.680,0:55:10.369 by our reinforcement-learning algorithm does poorly. Well 0:55:10.369,0:55:12.609 suppose the following three things hold true. 0:55:12.609,0:55:15.149 Suppose the contrary, I guess. Suppose that 0:55:15.149,0:55:19.649 the helicopter simulator is accurate, so let's assume we have an accurate model 0:55:19.649,0:55:22.449 of our helicopter. And 0:55:22.449,0:55:25.299 let's suppose that the reinforcement learning algorithm, 0:55:25.299,0:55:28.890 you know, correctly controls the helicopter in simulation, 0:55:28.890,0:55:31.819 so we tend to run a learning algorithm in simulation so that, you know, the 0:55:31.819,0:55:35.149 learning algorithm can crash a helicopter and it's fine. Right? 0:55:35.149,0:55:37.230 So let's assume our reinforcement-learning 0:55:37.230,0:55:40.110 algorithm correctly controls the helicopter so as to minimize the cost 0:55:40.110,0:55:42.099 function J of theta. 0:55:42.099,0:55:43.740 And let's suppose that 0:55:43.740,0:55:47.630 minimizing J of theta does indeed correspond to accurate or the correct autonomous 0:55:47.630,0:55:49.339 flight. 0:55:49.339,0:55:52.069 If all of these things held true, 0:55:52.069,0:55:53.909 then that means that 0:55:53.909,0:55:58.459 the parameters, theta RL, should actually fly well on my real 0:55:58.459,0:56:01.039 helicopter. Right? 0:56:01.039,0:56:03.280 And so the fact that the learning 0:56:03.280,0:56:05.339 control parameters, theta RL, 0:56:05.339,0:56:08.599 does not fly well on my helicopter, that sort 0:56:08.599,0:56:11.249 of means that ones of these three assumptions must be wrong 0:56:11.249,0:56:17.869 and I'd like to figure out which of these 0:56:17.869,0:56:19.669 three assumptions 0:56:19.669,0:56:22.089 is wrong. Okay? So these are the diagnostics we use. 0:56:22.089,0:56:25.449 First one is 0:56:25.449,0:56:31.719 we look at the controller and see if it even flies well in 0:56:31.719,0:56:35.089 simulation. Right? So the simulator of the helicopter that we did the learning on, 0:56:35.089,0:56:38.699 and so if the learning algorithm flies well in the simulator but 0:56:38.699,0:56:42.029 it doesn't fly well on my real helicopter, 0:56:42.029,0:56:46.109 then that tells me the problem is probably in the simulator. Right? 0:56:46.109,0:56:48.049 My simulator predicts 0:56:48.049,0:56:51.909 the helicopter's controller will fly well but it doesn't actually fly well in real life, so 0:56:51.909,0:56:53.580 could be the problem's in the simulator 0:56:53.580,0:56:59.239 and we should spend out efforts improving the accuracy of our simulator. 0:56:59.239,0:57:03.170 Otherwise, let me write theta subscript human, be the human 0:57:03.170,0:57:07.049 control policy. All right? So 0:57:07.049,0:57:11.639 let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it 0:57:11.639,0:57:13.479 could be in real life, 0:57:13.479,0:57:16.769 and let's measure, you know, the means squared error 0:57:16.769,0:57:20.209 of the human pilot's flight. And 0:57:20.209,0:57:24.239 let's see if the human pilot does better or worse 0:57:24.239,0:57:26.089 than the learned controller, 0:57:26.089,0:57:28.249 in terms of optimizing this 0:57:28.249,0:57:31.969 objective function J of theta. Okay? 0:57:31.969,0:57:33.929 So if the human does 0:57:33.929,0:57:36.890 worse, if even a very good human pilot 0:57:36.890,0:57:41.439 attains a worse value on my optimization objective, on my cost 0:57:41.439,0:57:42.410 function, 0:57:42.410,0:57:48.619 than my learning algorithm, 0:57:48.619,0:57:51.799 then the problem is in the reinforcement-learning algorithm. 0:57:51.799,0:57:56.089 Because my reinforcement-learning algorithm was trying to minimize J of 0:57:56.089,0:58:00.140 theta, but a human actually attains a lower value for J of theta than does my 0:58:00.140,0:58:01.779 algorithm. 0:58:01.779,0:58:05.489 And so that tells me that clearly my algorithm's not 0:58:05.489,0:58:07.819 managing to minimize J of theta 0:58:07.819,0:58:12.880 and that tells me the problem's in the reinforcement learning algorithm. 0:58:12.880,0:58:17.650 And finally, if J of theta - if the human actually attains a larger value 0:58:17.650,0:58:19.549 for theta - excuse me, 0:58:19.549,0:58:24.399 if the human actually attains a larger value for J of theta, the human actually 0:58:24.399,0:58:27.859 has, you know, larger mean squared error for the helicopter position than 0:58:27.859,0:58:30.599 does my reinforcement learning algorithms, that's 0:58:30.599,0:58:34.000 I like - but I like the way the human flies much better than my reinforcement learning 0:58:34.000,0:58:35.319 algorithm. So 0:58:35.319,0:58:37.229 if that holds true, 0:58:37.229,0:58:39.779 then clearly the problem's in the cost function, right, 0:58:39.779,0:58:42.880 because the human does worse on my cost function 0:58:42.880,0:58:46.069 but flies much better than my learning algorithm. 0:58:46.069,0:58:48.359 And so that means the problem's in the cost function. It 0:58:48.359,0:58:50.089 means - oh 0:58:50.089,0:58:50.539 excuse me, I 0:58:50.539,0:58:53.679 meant minimizing it, not maximizing it, there's a typo on the slide, 0:58:53.679,0:58:55.379 because that means that minimizing 0:58:55.379,0:58:57.089 the cost function 0:58:57.089,0:59:00.219 - my learning algorithm does a better job minimizing the cost function but doesn't 0:59:00.219,0:59:03.439 fly as well as a human pilot. So that tells you that 0:59:03.439,0:59:04.719 minimizing the cost function 0:59:04.719,0:59:06.880 doesn't correspond to good autonomous flight. And what 0:59:06.880,0:59:11.859 you should do it go back and see if you can change J of 0:59:11.859,0:59:13.099 theta. Okay? 0:59:13.099,0:59:18.379 And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement 0:59:18.379,0:59:21.730 learning algorithms just work but when they don't work, 0:59:21.730,0:59:26.200 these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, 0:59:26.200,0:59:30.329 on changing the cost function, or on changing the reinforcement learning 0:59:30.329,0:59:32.089 algorithm. And 0:59:32.089,0:59:37.039 again, if you don't know which of your three problems it is, it's entirely possible, 0:59:37.039,0:59:40.280 you know, to spend two years, whatever, changing, building a better simulator 0:59:40.280,0:59:42.599 for your helicopter. 0:59:42.599,0:59:43.950 But it turns out that 0:59:43.950,0:59:47.690 modeling helicopter aerodynamics is an active area of research. There are people, you know, writing 0:59:47.690,0:59:49.789 entire PhD theses on this still. 0:59:49.789,0:59:53.559 So it's entirely possible to go out and spend six years and write a PhD thesis and build 0:59:53.559,0:59:55.499 a much better helicopter simulator, but if you're fixing 0:59:55.499,1:00:02.499 the wrong problem it's not gonna help. 1:00:03.209,1:00:05.529 So 1:00:05.529,1:00:08.919 quite often, you need to come up with your own diagnostics to figure out what's happening in an 1:00:08.919,1:00:11.639 algorithm when something is going wrong. 1:00:11.639,1:00:15.679 And unfortunately I don't know of - what I've described 1:00:15.679,1:00:17.149 are sort of maybe 1:00:17.149,1:00:20.509 some of the most common diagnostics that I've used, that I've seen, 1:00:20.509,1:00:23.709 you know, to be useful for many problems. But very often, you need to come up 1:00:23.709,1:00:28.189 with your own for your own specific learning problem. 1:00:28.189,1:00:31.729 And I just want to point out that even when the learning algorithm is working well, it's 1:00:31.729,1:00:35.159 often a good idea to run diagnostics, like the ones I talked 1:00:35.159,1:00:36.069 about, 1:00:36.069,1:00:38.309 to make sure you really understand what's going on. 1:00:38.309,1:00:41.599 All right? And this is useful for a couple of reasons. One is that 1:00:41.599,1:00:45.609 diagnostics like these will often help you to understand your application 1:00:45.609,1:00:47.899 problem better. 1:00:47.899,1:00:52.159 So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying 1:00:52.159,1:00:56.349 job to apply machine-learning algorithms to some application problem of, you 1:00:56.349,1:00:59.299 know, significant economic interest. Right? 1:00:59.299,1:01:02.929 And you're gonna be working on one specific 1:01:02.929,1:01:08.059 important machine learning application for many months, or even for years. 1:01:08.059,1:01:10.989 One of the most valuable things for you personally will be for you to 1:01:10.989,1:01:13.249 get in - for you personally 1:01:13.249,1:01:16.909 to get in an intuitive understanding of what works and what doesn't work your 1:01:16.909,1:01:17.369 problem. 1:01:17.369,1:01:21.240 Sort of right now in the industry, in Silicon Valley or around the world, 1:01:21.240,1:01:24.830 there are many companies with important machine learning problems and there are often people 1:01:24.830,1:01:26.950 working on the same machine learning problem, you 1:01:26.950,1:01:31.209 know, for many months or for years on end. And 1:01:31.209,1:01:34.659 when you're doing that, I mean solving a really important problem using learning algorithms, one of 1:01:34.659,1:01:38.719 the most valuable things is just your own personal intuitive understanding of the 1:01:38.719,1:01:40.999 problem. 1:01:40.999,1:01:42.169 Okay? 1:01:42.169,1:01:43.410 And diagnostics, like 1:01:43.410,1:01:48.149 the sort I talked about, will be one way for you to get a better and better understanding of 1:01:48.149,1:01:50.279 these problems. It 1:01:50.279,1:01:54.089 turns out, by the way, there are some of Silicon Valley companies that outsource their 1:01:54.089,1:01:56.679 machine learning. So there's sometimes, you know, whatever. 1:01:56.679,1:01:59.529 They're a company in Silicon Valley and they'll, you know, 1:01:59.529,1:02:03.230 hire a firm in New York to run all their learning algorithms for them. 1:02:03.230,1:02:06.890 And I'm not a businessman, but I personally think that's 1:02:06.890,1:02:09.309 often a terrible idea because 1:02:09.309,1:02:13.639 if your expertise, if your understanding of your data is given, 1:02:13.639,1:02:15.709 you know, to an outsource agency, 1:02:15.709,1:02:19.589 then if you don't maintain that expertise, if there's a problem you really care about 1:02:19.589,1:02:22.299 then it'll be your own, you know, 1:02:22.299,1:02:26.009 understanding of the problem that you build up over months that'll be really valuable. 1:02:26.009,1:02:28.649 And if that knowledge is outsourced, you don't get to keep that knowledge 1:02:28.649,1:02:29.489 yourself. 1:02:29.489,1:02:31.699 I personally think that's a terrible idea. 1:02:31.699,1:02:35.809 But I'm not a businessman, but I just see people do that a lot, 1:02:35.809,1:02:39.109 and just. Let's see. 1:02:39.109,1:02:42.949 Another reason for running diagnostics like these is actually in writing research 1:02:42.949,1:02:43.609 papers, 1:02:43.609,1:02:46.149 right? So 1:02:46.149,1:02:49.329 diagnostics and error analyses, which I'll talk about in a minute, 1:02:49.329,1:02:53.019 often help to convey insight about the problem and help justify your research 1:02:53.019,1:02:54.109 claims. 1:02:54.109,1:02:56.559 1:02:56.559,1:02:57.780 So for example, 1:02:57.780,1:03:00.790 rather than writing a research paper, say, that's says, you know, "Oh well here's 1:03:00.790,1:03:04.039 an algorithm that works. I built this helicopter and it flies," or whatever, 1:03:04.039,1:03:05.650 it's often much more interesting to say, 1:03:05.650,1:03:09.609 "Here's an algorithm that works, and it works because of a specific 1:03:09.609,1:03:13.920 component X. And moreover, here's the diagnostic that gives you justification that shows X was 1:03:13.920,1:03:19.159 the thing that fixed this problem," and that's where you made it work. Okay? So 1:03:19.159,1:03:21.389 that leads me 1:03:21.389,1:03:25.929 into a discussion on error analysis, which is often good machine learning practice, 1:03:25.929,1:03:26.439 1:03:26.439,1:03:32.099 is a way for understanding what your sources of errors are. So what I 1:03:32.099,1:03:34.689 call error analyses - and let's check 1:03:34.689,1:03:41.689 questions about this. 1:03:41.769,1:03:45.789 Yeah? 1:03:45.789,1:03:49.809 Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times. 1:03:49.809,1:03:53.500 The thing that is most difficult a helicopter is actually building a 1:03:53.500,1:03:55.109 very - I don't know. It 1:03:55.109,1:03:58.489 changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter 1:03:58.489,1:04:02.859 is very hard. Yeah. Okay. So 1:04:02.859,1:04:03.930 for error 1:04:03.930,1:04:06.269 analyses, 1:04:06.269,1:04:10.809 this is a way for figuring out what is working in your algorithm and what isn't working. 1:04:10.809,1:04:17.709 And we're gonna talk about two specific examples. So there are 1:04:17.709,1:04:21.529 many learning - there are many sort of IA systems, many machine learning systems, that 1:04:21.529,1:04:22.469 combine 1:04:22.469,1:04:24.890 many different components into a pipeline. So 1:04:24.890,1:04:27.469 here's sort of a contrived example for this, 1:04:27.469,1:04:31.019 not dissimilar in many ways from the actual machine learning systems you see. 1:04:31.019,1:04:32.390 So let's say you want to 1:04:32.390,1:04:37.749 recognize people from images. This is a picture of one of my friends. 1:04:37.749,1:04:41.899 So you take this input in camera image, say, and you often run it through a long pipeline. So 1:04:41.899,1:04:43.069 for example, 1:04:43.069,1:04:47.859 the first thing you may do may be preprocess the image and remove the background, so you remove the 1:04:47.859,1:04:49.189 background. 1:04:49.189,1:04:51.909 And then you run a 1:04:51.909,1:04:55.209 face detection algorithm, so a machine learning algorithm to detect people's faces. 1:04:55.209,1:04:56.109 Right? 1:04:56.109,1:04:59.759 And then, you know, let's say you want to recognize the identity of the person, right, this is your 1:04:59.759,1:05:01.719 application. 1:05:01.719,1:05:04.440 You then segment of the eyes, segment of the nose, 1:05:04.440,1:05:08.329 and have different learning algorithms to detect the mouth and so on. 1:05:08.329,1:05:10.029 I know; she might not want to be friend 1:05:10.029,1:05:13.249 after she sees this. 1:05:13.249,1:05:16.769 And then having found all these features, based on, you know, what the nose looks like, what the eyes 1:05:16.769,1:05:18.610 looks like, whatever, then you 1:05:18.610,1:05:22.800 feed all the features into a logistic regression algorithm. And your logistic 1:05:22.800,1:05:24.770 regression or soft match regression, or whatever, 1:05:24.770,1:05:30.379 will tell you the identity of this person. Okay? 1:05:30.379,1:05:32.459 So 1:05:32.459,1:05:35.059 this is what error analysis is. 1:05:35.059,1:05:40.329 You have a long complicated pipeline combining many machine learning 1:05:40.329,1:05:43.919 components. Many of these would be used in learning algorithms. 1:05:43.919,1:05:45.689 And so, 1:05:45.689,1:05:50.419 it's often very useful to figure out how much of your error can be attributed to each of 1:05:50.419,1:05:55.179 these components. 1:05:55.179,1:05:56.179 So 1:05:56.179,1:05:59.589 what we'll do in a typical error analysis procedure 1:05:59.589,1:06:03.709 is we'll repeatedly plug in the ground-truth for each component and see how the 1:06:03.709,1:06:05.129 accuracy changes. 1:06:05.129,1:06:07.589 So what I mean by that is the 1:06:07.589,1:06:11.389 figure on the bottom left - bottom right, let's say the overall accuracy of the system is 1:06:11.389,1:06:12.739 85 percent. Right? 1:06:12.739,1:06:14.689 Then I want to know 1:06:14.689,1:06:17.399 where my 15 percent of error comes from. 1:06:17.399,1:06:19.159 And so what I'll do is I'll go 1:06:19.159,1:06:21.329 to my test set 1:06:21.329,1:06:26.629 and I'll actually code it and - oh, instead of - actually implement my correct 1:06:26.629,1:06:29.750 background removal. So actually, go in and give it, 1:06:29.750,1:06:33.439 give my algorithm what is the correct background versus foreground. 1:06:33.439,1:06:36.840 And if I do that, let's color that blue to denote that I'm 1:06:36.840,1:06:39.529 giving that ground-truth data in the test set, 1:06:39.529,1:06:43.839 let's assume our accuracy increases to 85.1 percent. Okay? 1:06:43.839,1:06:47.760 And now I'll go in and, you know, give my algorithm the ground-truth, 1:06:47.760,1:06:48.929 face detection 1:06:48.929,1:06:53.019 output. So I'll go in and actually on my test set I'll just tell the algorithm where the 1:06:53.019,1:06:55.130 face is. And if I do that, 1:06:55.130,1:06:59.049 let's say my algorithm's accuracy increases to 91 percent, 1:06:59.049,1:07:02.519 and so on. And then I'll go for each of these components 1:07:02.519,1:07:05.019 and just give it 1:07:05.019,1:07:08.660 the ground-truth label for each of the components, 1:07:08.660,1:07:11.640 because say, like, the nose segmentation algorithm's trying to figure out 1:07:11.640,1:07:13.219 where the nose is. I just in 1:07:13.219,1:07:16.589 and tell it where the nose is so that it doesn't have to figure that out. 1:07:16.589,1:07:20.559 And as I do this, one component through the other, you know, I end up giving it the correct output 1:07:20.559,1:07:23.650 label and end up with 100 percent accuracy. 1:07:23.650,1:07:27.000 And now you can look at this table - I'm sorry this is cut off on the bottom, 1:07:27.000,1:07:29.119 it says logistic regression 100 percent. Now you can 1:07:29.119,1:07:30.719 look at this 1:07:30.719,1:07:31.670 table and 1:07:31.670,1:07:33.009 see, 1:07:33.009,1:07:36.079 you know, how much giving the ground-truth labels for each of these 1:07:36.079,1:07:39.029 components could help boost your final performance. 1:07:39.029,1:07:42.419 In particular, if you look at this table, you notice that 1:07:42.419,1:07:45.269 when I added the face detection ground-truth, 1:07:45.269,1:07:48.279 my performance jumped from 85.1 percent accuracy 1:07:48.279,1:07:50.619 to 91 percent accuracy. Right? 1:07:50.619,1:07:54.529 So this tells me that if only I can get better face detection, 1:07:54.529,1:07:58.029 maybe I can boost my accuracy by 6 percent. 1:07:58.029,1:08:00.499 Whereas in contrast, when I, 1:08:00.499,1:08:04.349 you know, say plugged in better, I don't know, 1:08:04.349,1:08:07.059 background removal, my accuracy improved from 85 1:08:07.059,1:08:08.669 to 85.1 percent. 1:08:08.669,1:08:11.519 And so, this sort of diagnostic also tells you that if your goal 1:08:11.519,1:08:13.869 is to improve the system, it's probably a waste of 1:08:13.869,1:08:17.679 your time to try to improve your background subtraction. Because if 1:08:17.679,1:08:19.219 even if you got the ground-truth, 1:08:19.219,1:08:22.059 this is gives you, at most, 0.1 percent accuracy, 1:08:22.059,1:08:24.600 whereas if you do better face detection, maybe there's a much 1:08:24.600,1:08:26.399 larger potential for gains there. Okay? 1:08:26.399,1:08:28.669 So this sort of diagnostic, 1:08:28.669,1:08:29.899 again, 1:08:29.899,1:08:33.149 is very useful because if your is to improve the system, 1:08:33.149,1:08:35.999 there are so many different pieces you can easily choose to spend the next three 1:08:35.999,1:08:36.650 months on. Right? 1:08:36.650,1:08:39.259 And choosing the right piece 1:08:39.259,1:08:42.799 is critical, and this sort of diagnostic tells you what's the piece that may 1:08:42.799,1:08:48.729 actually be worth your time to work on. 1:08:48.729,1:08:51.709 There's sort of another type of analyses that's sort of the opposite of what I just 1:08:51.709,1:08:53.369 talked about. 1:08:53.369,1:08:55.469 The error analysis I just talked about 1:08:55.469,1:08:58.279 tries to explain the difference between the current performance and perfect 1:08:58.279,1:08:59.770 performance, 1:08:59.770,1:09:03.619 whereas this sort of ablative analysis tries to explain the difference 1:09:03.619,1:09:09.119 between some baselines, some really bad performance and your current performance. 1:09:09.119,1:09:13.089 So for this example, let's suppose you've built a very good anti-spam classifier for 1:09:13.089,1:09:17.150 adding lots of clever features to your logistic regression algorithm. Right? So you added 1:09:17.150,1:09:20.690 features for spam correction, for, you know, sender host features, for email header 1:09:20.690,1:09:21.409 features, 1:09:21.409,1:09:24.799 email text parser features, JavaScript parser features, 1:09:24.799,1:09:26.839 features for embedded images, and so on. 1:09:26.839,1:09:30.230 So now let's say you preview the system and you want to figure out, you know, how well did 1:09:30.230,1:09:33.799 each of these - how much did each of these components actually contribute? Maybe you want 1:09:33.799,1:09:37.130 to write a research paper and claim this was the piece that made the 1:09:37.130,1:09:40.949 big difference. Can you actually document that claim and justify it? 1:09:40.949,1:09:43.319 So in ablative analysis, 1:09:43.319,1:09:44.569 here's what we do. 1:09:44.569,1:09:46.330 So in this example, 1:09:46.330,1:09:49.670 let's say that simple logistic regression without any of your clever 1:09:49.670,1:09:52.089 improvements get 94 percent performance. And 1:09:52.089,1:09:55.479 you want to figure out what accounts for your improvement from 94 to 1:09:55.479,1:09:58.429 99.9 percent performance. 1:09:58.429,1:10:03.280 So in ablative analysis and so instead of adding components one at a day, we'll instead 1:10:03.280,1:10:06.840 remove components one at a time to see how it rates. 1:10:06.840,1:10:11.460 So start with your overall system, which is 99 percent accuracy. 1:10:11.460,1:10:14.130 And then we remove spelling correction and see how much performance 1:10:14.130,1:10:15.389 drops. 1:10:15.389,1:10:22.389 Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so, 1:10:24.219,1:10:28.150 in this contrived example, 1:10:28.150,1:10:31.120 you see that, I guess, the biggest drop 1:10:31.120,1:10:32.380 occurred when you remove 1:10:32.380,1:10:37.560 the text parser features. And so you can then make a credible case that, 1:10:37.560,1:10:41.280 you know, the text parser features where what really made the biggest difference here. Okay? 1:10:41.280,1:10:42.699 And you can also tell, 1:10:42.699,1:10:45.530 for instance, that, I don't know, 1:10:45.530,1:10:49.360 removing the sender host features on this 1:10:49.360,1:10:52.280 line, right, performance dropped from 99.9 to 98.9. And so this also means 1:10:52.280,1:10:53.139 that 1:10:53.139,1:10:56.449 in case you want to get rid of the sender host features to speed up 1:10:56.449,1:11:03.449 computational something that would be a good candidate for elimination. Okay? Are there any 1:11:03.630,1:11:05.420 guarantees that if you shuffle around the order in which 1:11:05.420,1:11:06.420 you drop those 1:11:06.420,1:11:09.580 features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's 1:11:09.580,1:11:12.110 no guarantee you'd get the similar result. 1:11:12.110,1:11:13.890 So in practice, 1:11:13.890,1:11:17.730 sometimes there's a fairly natural of ordering for both types of analyses, the error 1:11:17.730,1:11:19.329 analyses and ablative analysis, 1:11:19.329,1:11:22.749 sometimes there's a fairly natural ordering in which you add things or remove things, 1:11:22.749,1:11:24.559 sometimes there's isn't. And 1:11:24.559,1:11:28.469 quite often, you either choose one ordering and just go for it 1:11:28.469,1:11:32.070 or " And don't think of these analyses as sort of formulas that are constants, though; I mean 1:11:32.070,1:11:35.239 feel free to invent your own, as well. You know 1:11:35.239,1:11:36.639 one of the things 1:11:36.639,1:11:37.920 that's done quite often is 1:11:37.920,1:11:39.310 take the overall system 1:11:39.310,1:11:43.289 and just remove one and then put it back, then remove a different one 1:11:43.289,1:11:48.129 then put it back until all of these things are done. Okay. 1:11:48.129,1:11:51.009 So the very last thing I want to talk about is sort of this 1:11:51.009,1:11:57.979 general advice for how to get started on a learning problem. So 1:11:57.979,1:12:03.840 here's a cartoon description on two broad to get started on learning problem. 1:12:03.840,1:12:05.740 The first one is 1:12:05.740,1:12:07.610 carefully design your system, so 1:12:07.610,1:12:11.739 you spend a long time designing exactly the right features, collecting the right data set, and 1:12:11.739,1:12:14.189 designing the right algorithmic structure, then you 1:12:14.189,1:12:17.679 implement it and hope it works. All right? 1:12:17.679,1:12:21.039 The benefit of this sort of approach is you get maybe nicer, maybe more scalable 1:12:21.039,1:12:22.429 algorithms, 1:12:22.429,1:12:26.719 and maybe you come up with new elegant learning algorithms. And if your goal is to, 1:12:26.719,1:12:30.760 you know, contribute to basic research in machine learning, if your goal is to invent new machine learning 1:12:30.760,1:12:31.499 algorithms, 1:12:31.499,1:12:33.550 this process of slowing down and 1:12:33.550,1:12:36.300 thinking deeply about the problem, you know, that is sort of the right way to go 1:12:36.300,1:12:37.119 about is 1:12:37.119,1:12:41.099 think deeply about a problem and invent new solutions. 1:12:41.099,1:12:42.280 1:12:42.280,1:12:44.079 Second sort of approach 1:12:44.079,1:12:48.839 is what I call build-and-fix, which is we input something quick and dirty 1:12:48.839,1:12:52.309 and then you run error analyses and diagnostics to figure out what's wrong and 1:12:52.309,1:12:54.199 you fix those errors. 1:12:54.199,1:12:58.130 The benefit of this second type of approach is that it'll often get your 1:12:58.130,1:13:01.119 application working much more quickly. 1:13:01.119,1:13:04.400 And especially with those of you, if you end up working in a company, and sometimes - if you end up working in 1:13:04.400,1:13:05.550 a company, 1:13:05.550,1:13:07.460 you know, very often it's not 1:13:07.460,1:13:10.900 the best product that wins; it's the first product to market that 1:13:10.900,1:13:11.690 wins. And 1:13:11.690,1:13:14.869 so there's - especially in the industry. There's really something to be said for, 1:13:14.869,1:13:18.790 you know, building a system quickly and getting it deployed quickly. 1:13:18.790,1:13:23.139 And the second approach of building a quick-and-dirty, I'm gonna say hack 1:13:23.139,1:13:26.469 and then fixing the problems will actually get you to a 1:13:26.469,1:13:27.839 system that works well 1:13:27.839,1:13:30.969 much more quickly. 1:13:30.969,1:13:32.649 And the reason is 1:13:32.649,1:13:36.149 very often it's really not clear what parts of a system are easier to think of to 1:13:36.149,1:13:37.590 build and therefore what 1:13:37.590,1:13:40.179 you need to spends lot of time focusing on. 1:13:40.179,1:13:43.420 So there's that example I talked about just now. Right? 1:13:43.420,1:13:46.929 For identifying 1:13:46.929,1:13:48.710 people, say. 1:13:48.710,1:13:53.199 And with a big complicated learning system like this, a big complicated pipeline like this, 1:13:53.199,1:13:55.590 it's really not obvious at the outset 1:13:55.590,1:13:59.130 which of these components you should spend lots of time working on. Right? And if 1:13:59.130,1:14:00.960 you didn't know that 1:14:00.960,1:14:03.800 preprocessing wasn't the right component, you could easily have 1:14:03.800,1:14:07.269 spent three months working on better background subtraction, not knowing that it's 1:14:07.269,1:14:09.880 just not gonna ultimately matter. 1:14:09.880,1:14:10.769 And so 1:14:10.769,1:14:13.690 the only way to find out what really works was inputting something quickly and 1:14:13.690,1:14:15.350 you find out what parts - 1:14:15.350,1:14:16.889 and find out 1:14:16.889,1:14:17.889 what parts 1:14:17.889,1:14:21.359 are really the hard parts to implement, or what parts are hard parts that could make a 1:14:21.359,1:14:23.079 difference in performance. 1:14:23.079,1:14:26.579 In fact, say that if your goal is to build a 1:14:26.579,1:14:29.309 people recognition system, a system like this is actually far too 1:14:29.309,1:14:31.639 complicated as your initial system. 1:14:31.639,1:14:35.560 Maybe after you're prototyped a few systems, and you converged a system like this. But if this 1:14:35.560,1:14:42.560 is your first system you're designing, this is much too complicated. Also, this is a 1:14:43.570,1:14:48.059 very concrete piece of advice, and this applies to your projects as well. 1:14:48.059,1:14:51.229 If your goal is to build a working application, 1:14:51.229,1:14:55.260 Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your 1:14:55.260,1:14:57.279 data. 1:14:57.279,1:15:01.219 And very often, and if you just take the data you're trying to predict and just plot your 1:15:01.219,1:15:05.729 data, plot X, plot Y, plot your data everywhere you can think of, 1:15:05.729,1:15:10.309 you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they 1:15:10.309,1:15:13.899 should be positive. Something's wrong with this dataset." And it's about 1:15:13.899,1:15:18.389 half the time you find something obviously wrong with your data or something very surprising. 1:15:18.389,1:15:21.570 And this is something you find out just by plotting your data, and that you 1:15:21.570,1:15:28.179 won't find out be implementing these big complicated learning algorithms on it. Plotting 1:15:28.179,1:15:31.920 the data sounds so simple, it was one of the pieces of advice that lots of us give but 1:15:31.920,1:15:38.570 hardly anyone follows, so you can take that for what it's worth. 1:15:38.570,1:15:42.199 Let me just reiterate, what I just said here may be bad advice 1:15:42.199,1:15:44.019 if your goal is to come up with 1:15:44.019,1:15:46.639 new machine learning algorithms. All right? So 1:15:46.639,1:15:51.019 for me personally, the learning algorithm I use the most often is probably 1:15:51.019,1:15:53.600 logistic regression because I have code lying around. So give me a 1:15:53.600,1:15:56.770 learning problem, I probably won't try anything more complicated than logistic 1:15:56.770,1:15:58.260 regression on it first. And it's 1:15:58.260,1:16:01.940 only after trying something really simple and figure our what's easy, what's hard, then you know 1:16:01.940,1:16:03.940 where to focus your efforts. But 1:16:03.940,1:16:07.610 again, if your goal is to invent new machine learning algorithms, then you sort of don't 1:16:07.610,1:16:10.750 want to hack up something and then add another hack to fix it, and hack it even more to 1:16:10.750,1:16:12.219 fix it. Right? So if 1:16:12.219,1:16:15.919 your goal is to do novel machine learning research, then it pays to think more deeply about the 1:16:15.919,1:16:21.340 problem and not gonna follow this specifically. 1:16:21.340,1:16:22.919 Shoot, you know what? All 1:16:22.919,1:16:28.280 right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly. 1:16:28.280,1:16:30.620 And so, this is what I think 1:16:30.620,1:16:33.459 of as premature statistical optimization, 1:16:33.459,1:16:35.079 where quite often, 1:16:35.079,1:16:38.320 just like premature optimization of code, quite often 1:16:38.320,1:16:44.369 people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more 1:16:44.369,1:16:46.949 slides. This 1:16:46.949,1:16:48.539 was - 1:16:48.539,1:16:52.070 this is a sort of cartoon that highly influenced my own thinking. It was based on 1:16:52.070,1:16:55.340 a paper written by Christos Papadimitriou. 1:16:55.340,1:16:57.429 This is how 1:16:57.429,1:16:59.360 progress - this is how 1:16:59.360,1:17:02.360 developmental progress of research often happens. Right? 1:17:02.360,1:17:05.559 Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it 1:17:05.559,1:17:06.519 seems like a useful thing to have. 1:17:06.519,1:17:09.670 Right? You know free up people, don't have 1:17:09.670,1:17:12.760 to deliver mail. So what - 1:17:12.760,1:17:14.280 to deliver mail, 1:17:14.280,1:17:19.139 obviously you need a robot to wander around indoor environments and you need a robot to 1:17:19.139,1:17:21.480 manipulate objects and pickup envelopes. And so, 1:17:21.480,1:17:24.890 you need to build those two components in order to get a mail delivery robot. And 1:17:24.890,1:17:25.590 so I've 1:17:25.590,1:17:29.650 drawing those two components and little arrows to denote that, you know, obstacle avoidance 1:17:29.650,1:17:30.460 is 1:17:30.460,1:17:32.229 needed or would help build 1:17:32.229,1:17:35.510 your mail delivery robot. Well 1:17:35.510,1:17:37.189 for obstacle for avoidance, 1:17:37.189,1:17:43.159 clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. 1:17:43.159,1:17:46.840 Now we're gonna use computer vision to detect the objects. And so, 1:17:46.840,1:17:51.120 we know that, you know, lighting sometimes changes, right, depending on whether it's the 1:17:51.120,1:17:52.709 morning or noontime or evening. This 1:17:52.709,1:17:53.930 is lighting 1:17:53.930,1:17:56.639 causes the color of things to change, and so you need 1:17:56.639,1:18:00.509 an object detection system that's invariant to the specific colors of an 1:18:00.509,1:18:01.199 object. Right? 1:18:01.199,1:18:04.420 Because lighting 1:18:04.420,1:18:05.400 changes, 1:18:05.400,1:18:09.849 say. Well color, or RGB values, is represented by three-dimensional vectors. And 1:18:09.849,1:18:11.170 so you need to learn 1:18:11.170,1:18:13.499 when two colors might be the same thing, 1:18:13.499,1:18:15.260 when two, you know, 1:18:15.260,1:18:18.159 visual appearance of two colors may be the same thing as just the lighting change or 1:18:18.159,1:18:19.539 something. 1:18:19.539,1:18:20.590 And 1:18:20.590,1:18:24.059 to understand that properly, we can go out and study differential geometry 1:18:24.059,1:18:27.510 of 3d manifolds because that helps us build a sound theory on which 1:18:27.510,1:18:32.250 to develop our 3d similarity learning algorithms. 1:18:32.250,1:18:36.159 And to really understand the fundamental aspects of this problem, 1:18:36.159,1:18:40.110 we have to study the complexity of non-Riemannian geometries. And on 1:18:40.110,1:18:43.850 and on it goes until eventually you're proving convergence bounds for 1:18:43.850,1:18:49.790 sampled of non-monotonic logic. I don't even know what this is because I just made it up. 1:18:49.790,1:18:51.530 Whereas in reality, 1:18:51.530,1:18:53.970 you know, chances are that link isn't real. 1:18:53.970,1:18:55.660 Color variance 1:18:55.660,1:18:59.550 just barely helped object recognition maybe. I'm making this up. 1:18:59.550,1:19:03.499 Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay? 1:19:03.499,1:19:05.270 So, each of 1:19:05.270,1:19:09.130 these circles can represent a person, or a research community, or a thought in your 1:19:09.130,1:19:12.020 head. And there's a very real chance that 1:19:12.020,1:19:15.469 maybe there are all these papers written on differential geometry of 3d manifolds, and they are 1:19:15.469,1:19:18.570 written because some guy once told someone else that it'll help 3d similarity learning. 1:19:18.570,1:19:20.489 And, 1:19:20.489,1:19:23.369 you know, it's like "A friend of mine told me that color invariance would help in 1:19:23.369,1:19:26.119 object recognition, so I'm working on color invariance. And now I'm gonna tell a friend 1:19:26.119,1:19:27.440 of mine 1:19:27.440,1:19:30.280 that his thing will help my problem. And he'll tell a friend of his that his thing will help 1:19:30.280,1:19:31.619 with his problem." 1:19:31.619,1:19:33.520 And pretty soon, you're working on 1:19:33.520,1:19:37.540 convergence bound for sampled non-monotonic logic, when in reality none of these will 1:19:37.540,1:19:39.129 see the light of 1:19:39.129,1:19:42.519 day of your mail delivery robot. Okay? 1:19:42.519,1:19:46.599 I'm not criticizing the role of theory. There are very powerful theories, like the 1:19:46.599,1:19:48.400 theory of VC dimension, 1:19:48.400,1:19:52.090 which is far, far, far to the right of this. So VC dimension is about 1:19:52.090,1:19:53.289 as theoretical 1:19:53.289,1:19:57.119 as it can get. And it's clearly had a huge impact on many applications. And there's, 1:19:57.119,1:19:59.559 you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, 1:19:59.559,1:20:00.750 is about 1:20:00.750,1:20:04.219 theoretical as it can get. It's 1:20:04.219,1:20:05.800 like a huge application 1:20:05.800,1:20:09.309 on all of computer science, the theory of NP-hardness. 1:20:09.309,1:20:10.669 But 1:20:10.669,1:20:13.799 when you are off working on highly theoretical things, I guess, to me 1:20:13.799,1:20:16.849 personally it's important to keep in mind 1:20:16.849,1:20:19.699 are you working on something like VC dimension, which is high impact, or are you 1:20:19.699,1:20:23.290 working on something like convergence bound for sampled nonmonotonic logic, which 1:20:23.290,1:20:24.710 you're only hoping 1:20:24.710,1:20:25.900 has some peripheral relevance 1:20:25.900,1:20:30.040 to some application. Okay? 1:20:30.040,1:20:34.849 For me personally, I tend to work on an application only if I - excuse me. 1:20:34.849,1:20:36.989 For me personally, and this is a personal choice, 1:20:36.989,1:20:41.340 I tend to trust something only if I personally can see a link from the 1:20:41.340,1:20:42.679 theory I'm working on 1:20:42.679,1:20:44.430 all the way back to an application. 1:20:44.430,1:20:46.010 And 1:20:46.010,1:20:50.299 if I don't personally see a direct link from what I'm doing to an application then, 1:20:50.299,1:20:53.429 you know, then that's fine. Then I can choose to work on theory, but 1:20:53.429,1:20:55.650 I wouldn't necessarily trust that 1:20:55.650,1:20:59.210 what the theory I'm working on will relate to an application, if I don't personally 1:20:59.210,1:21:02.429 see a link all the way back. 1:21:02.429,1:21:04.400 Just to summarize. 1:21:04.400,1:21:06.409 1:21:06.409,1:21:08.679 One lesson to take away from today is I think 1:21:08.679,1:21:12.529 time spent coming up with diagnostics for learning algorithms is often time well spent. 1:21:12.529,1:21:13.029 1:21:13.029,1:21:16.199 It's often up to your own ingenuity to come up with great diagnostics. And 1:21:16.199,1:21:19.019 just when I personally, when I work on machine learning algorithm, 1:21:19.019,1:21:21.169 it's not uncommon for me to be spending like 1:21:21.169,1:21:23.679 between a third and often half of my time 1:21:23.679,1:21:26.409 just writing diagnostics and trying to figure out what's going right and what's 1:21:26.409,1:21:28.079 going on. 1:21:28.079,1:21:31.500 Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and 1:21:31.500,1:21:34.780 making progress. You don't want to be spending all this time, you know, implementing tests on your 1:21:34.780,1:21:38.280 learning algorithms; it doesn't feel like when you're doing anything. But when 1:21:38.280,1:21:41.419 I implement learning algorithms, at least a third, and quite often half of 1:21:41.419,1:21:45.880 my time, is actually spent implementing those tests and you can figure out what to work on. And 1:21:45.880,1:21:49.219 I think it's actually one of the best uses of your time. Talked 1:21:49.219,1:21:50.729 about error 1:21:50.729,1:21:54.319 analyses and ablative analyses, and lastly 1:21:54.319,1:21:56.890 talked about, you know, different approaches and the 1:21:56.890,1:22:00.979 risks of premature statistical optimization. Okay. 1:22:00.979,1:22:04.339 Sorry I ran you over. I'll be here for a few more minutes for your questions.