WEBVTT 00:00:10.170 --> 00:00:11.549 00:00:11.549 --> 00:00:14.819 This presentation is delivered by the Stanford Center for Professional 00:00:14.819 --> 00:00:21.819 Development. 00:00:24.739 --> 00:00:28.499 Okay. Good morning. Welcome back. 00:00:28.499 --> 00:00:31.239 What I want to do today is 00:00:31.239 --> 00:00:36.640 actually wrap up our discussion on learning theory and sort of on 00:00:36.640 --> 00:00:37.670 and I'm gonna start by 00:00:37.670 --> 00:00:41.810 talking about Bayesian statistics and regularization, 00:00:41.810 --> 00:00:45.680 and then take a very brief digression to tell you about online learning. 00:00:45.680 --> 00:00:50.320 And most of today's lecture will actually be on various pieces of that, so applying 00:00:50.320 --> 00:00:52.450 machine learning algorithms to problems like, you know, 00:00:52.450 --> 00:00:55.920 like the project or other problems you may go work on after you graduate from this 00:00:57.010 --> 00:00:57.739 class. But 00:00:57.739 --> 00:01:02.300 let's start the talk about Bayesian statistics and regularization. 00:01:02.300 --> 00:01:04.130 So you remember from last week, 00:01:04.130 --> 00:01:08.960 we started to talk about learning theory and we learned about bias and 00:01:08.960 --> 00:01:12.260 variance. And I guess in the previous lecture, we spent most of the previous 00:01:12.260 --> 00:01:13.240 lecture 00:01:13.240 --> 00:01:17.510 talking about algorithms for model selection and for 00:01:17.510 --> 00:01:21.360 feature selection. We talked about cross-validation. Right? So 00:01:21.360 --> 00:01:24.200 most of the methods we talked about in the previous lecture were 00:01:24.200 --> 00:01:28.460 ways for you to try to simply the model. So for example, 00:01:28.460 --> 00:01:31.920 the feature selection algorithms we talked about gives you a way to eliminate a number 00:01:31.920 --> 00:01:32.909 of features, 00:01:32.909 --> 00:01:36.350 so as to reduce the number of parameters you need to fit and thereby reduce 00:01:36.350 --> 00:01:39.330 overfitting. Right? You remember that? So feature 00:01:39.330 --> 00:01:40.670 selection algorithms 00:01:42.149 --> 00:01:43.970 choose a subset of the features 00:01:43.970 --> 00:01:48.470 so that you have less parameters and you may be less likely to overfit. Right? 00:01:48.470 --> 00:01:52.790 What I want to do today is to talk about a different way to prevent overfitting. 00:01:52.790 --> 00:01:53.870 And 00:01:53.870 --> 00:01:57.860 there's a method called regularization and there's a way that lets you keep all the 00:01:57.860 --> 00:01:58.859 parameters. 00:01:59.920 --> 00:02:04.200 So here's the idea, and I'm gonna illustrate this example with, 00:02:04.200 --> 00:02:06.960 say, linear regression. 00:02:06.960 --> 00:02:08.709 So 00:02:08.709 --> 00:02:13.329 you take 00:02:13.329 --> 00:02:16.920 the linear regression model, the very first model we learned about, 00:02:16.920 --> 00:02:17.969 right, 00:02:17.969 --> 00:02:21.639 we said that we would choose the parameters 00:02:21.639 --> 00:02:23.609 via 00:02:23.609 --> 00:02:29.519 maximum likelihood. 00:02:29.519 --> 00:02:31.669 Right? And that meant that, 00:02:31.669 --> 00:02:35.299 you know, you would choose the parameters theta 00:02:35.299 --> 00:02:39.879 that maximized 00:02:39.879 --> 00:02:43.009 the probability of the data, 00:02:43.009 --> 00:02:46.109 which is parameters theta that maximized the probability of the data we observe. 00:02:46.109 --> 00:02:50.709 Right? 00:02:50.709 --> 00:02:52.049 And so 00:02:52.049 --> 00:02:56.279 to give this sort of procedure a name, this is one example of most 00:02:56.279 --> 00:03:00.259 common frequencies procedure, 00:03:00.259 --> 00:03:05.419 and frequency, you can think of sort of as maybe one school of statistics. 00:03:05.419 --> 00:03:07.079 And the philosophical view 00:03:07.079 --> 00:03:08.869 behind writing this down was 00:03:08.869 --> 00:03:13.480 we envisioned that there was some true parameter theta out there that generated, 00:03:13.480 --> 00:03:16.469 you know, the Xs and the Ys. There's some true parameter theta 00:03:16.469 --> 00:03:20.529 that govern housing prices, Y is a function of X, 00:03:20.529 --> 00:03:23.459 and we don't know what the value of theta is, 00:03:23.459 --> 00:03:27.630 and we'd like to come up with some procedure for estimating the value of theta. Okay? 00:03:27.630 --> 00:03:28.379 And so, 00:03:28.379 --> 00:03:31.269 maximum likelihood is just one possible procedure 00:03:31.269 --> 00:03:33.209 for estimating the unknown value 00:03:33.209 --> 00:03:35.879 for theta. 00:03:35.879 --> 00:03:37.609 And 00:03:37.609 --> 00:03:41.060 the way you formulated this, you know, theta was not a random variable. Right? 00:03:41.060 --> 00:03:42.140 That's what why said, 00:03:42.140 --> 00:03:45.490 so theta is just some true value out there. It's not random or anything, we just don't 00:03:45.490 --> 00:03:50.329 know what it is, and we have a procedure called maximum likelihood for estimating the 00:03:50.329 --> 00:03:55.469 value for theta. So this is one example of what's called a frequencies procedure. 00:03:55.469 --> 00:03:57.279 The alternative to the, I 00:03:57.279 --> 00:04:04.279 guess, the frequency school of statistics is the Bayesian school, 00:04:04.389 --> 00:04:06.869 in which 00:04:06.869 --> 00:04:10.239 we're gonna say that we don't know what theta, 00:04:10.239 --> 00:04:13.339 and so we will put a prior 00:04:13.339 --> 00:04:14.769 on 00:04:14.769 --> 00:04:18.239 theta. Okay? So in the Bayesian school students would say, "Well 00:04:18.239 --> 00:04:22.379 don't know what the value of theta so let's represent our uncertainty over theta with a 00:04:22.379 --> 00:04:27.229 prior. 00:04:28.259 --> 00:04:32.009 So for example, 00:04:32.009 --> 00:04:34.770 our prior on theta 00:04:34.770 --> 00:04:36.860 may be a Gaussian distribution 00:04:36.860 --> 00:04:39.849 with mean zero and curvalence matrix 00:04:39.849 --> 00:04:42.889 given by tau squared I. Okay? 00:04:42.889 --> 00:04:45.319 And so - 00:04:45.319 --> 00:04:52.319 actually, if I use S to denote my training set, well - right, 00:04:56.279 --> 00:05:00.289 so theta represents my beliefs about what the parameters are in the absence of 00:05:00.289 --> 00:05:01.380 any data. So 00:05:01.380 --> 00:05:03.290 not having seen any data, theta 00:05:03.290 --> 00:05:05.709 represents, you know, what I think theta - it 00:05:05.709 --> 00:05:10.090 probably represents what I think theta is most likely to be. 00:05:10.090 --> 00:05:15.659 And so given the training set, S, in the sort of Bayesian procedure, we would, 00:05:15.659 --> 00:05:22.439 well, 00:05:22.439 --> 00:05:26.539 calculate the probability, the posterior probability by parameters 00:05:26.539 --> 00:05:28.869 given my training sets, and - let's 00:05:28.869 --> 00:05:30.469 write 00:05:30.469 --> 00:05:31.779 this on the next board. 00:05:31.779 --> 00:05:33.360 So my posterior 00:05:33.360 --> 00:05:36.740 on my parameters given my training set, by Bayes' rule, this will 00:05:36.740 --> 00:05:40.160 be proportional to, you 00:05:40.160 --> 00:05:47.160 know, this. 00:05:51.459 --> 00:05:56.559 Right? So by Bayes' rule. Let's call 00:05:56.559 --> 00:06:01.429 it posterior. 00:06:01.429 --> 00:06:05.669 And this distribution now represents my beliefs about what theta is after I've 00:06:05.669 --> 00:06:08.219 seen the training set. 00:06:08.219 --> 00:06:15.219 And when you now want to make a new prediction on the price of a new house, 00:06:20.060 --> 00:06:21.959 on the input X, 00:06:21.959 --> 00:06:26.709 I would say that, well, the distribution over the possible housing prices for 00:06:26.709 --> 00:06:30.560 this new house I'm trying to estimate the price of, say, 00:06:30.560 --> 00:06:34.270 given the size of the house, the features of the house at X, and the training 00:06:34.270 --> 00:06:36.559 set I had previously, it 00:06:36.559 --> 00:06:43.559 is going to be given by 00:06:47.280 --> 00:06:50.470 an integral over my parameters theta of 00:06:50.470 --> 00:06:53.020 probably of Y given X comma theta 00:06:53.020 --> 00:06:54.649 and times 00:06:54.649 --> 00:06:59.929 the posterior distribution of theta given the training set. Okay? 00:06:59.929 --> 00:07:01.789 And in 00:07:01.789 --> 00:07:03.719 particular, if you want your prediction to be 00:07:03.719 --> 00:07:07.559 the expected value 00:07:07.559 --> 00:07:08.729 of Y given 00:07:08.729 --> 00:07:10.279 the 00:07:10.279 --> 00:07:13.559 input X in training set, you would 00:07:13.559 --> 00:07:16.349 say integrate 00:07:16.349 --> 00:07:23.349 over Y 00:07:23.750 --> 00:07:26.539 times the posterior. Okay? 00:07:26.539 --> 00:07:33.539 You would take an expectation of Y with respect to your posterior distribution. Okay? 00:07:35.440 --> 00:07:39.169 And you notice that when I was writing this down, so with the Bayesian 00:07:39.169 --> 00:07:41.959 formulation, and now started to write up here Y given X 00:07:41.959 --> 00:07:44.110 comma theta because 00:07:44.110 --> 00:07:47.590 this formula now is the property of Y conditioned on the values of the 00:07:47.590 --> 00:07:51.909 random variables X and theta. So I'm no longer writing semicolon theta, I'm writing 00:07:51.909 --> 00:07:53.099 comma theta 00:07:53.099 --> 00:07:56.379 because I'm now treating theta 00:07:56.379 --> 00:08:00.259 as a random variable. So all 00:08:00.259 --> 00:08:02.849 of this is somewhat abstract but this is 00:08:02.849 --> 00:08:04.109 - and 00:08:04.109 --> 00:08:11.109 it turns out - actually let's check. Are there questions about this? No? Okay. Let's 00:08:14.779 --> 00:08:17.819 try to make this more concrete. It turns out that 00:08:17.819 --> 00:08:21.699 for many problems, 00:08:21.699 --> 00:08:26.059 both of these steps in the computation are difficult because if, 00:08:26.059 --> 00:08:29.469 you know, theta is an N plus onedimensional vector, is an N plus 00:08:29.469 --> 00:08:31.219 one-dimensional parameter vector, 00:08:31.219 --> 00:08:35.010 then this is one an integral over an N plus one-dimensional, you know, over RN 00:08:35.010 --> 00:08:36.390 plus one. 00:08:36.390 --> 00:08:38.660 And because numerically it's very difficult 00:08:38.660 --> 00:08:41.130 to compute integrals over 00:08:41.130 --> 00:08:44.800 very high dimensional spaces, all right? So 00:08:44.800 --> 00:08:48.740 usually this integral - actually usually it's hard to compute the posterior in theta 00:08:48.740 --> 00:08:54.310 and it's also hard to compute this integral if theta is very high dimensional. There are few 00:08:54.310 --> 00:08:55.910 exceptions for which this can be done in 00:08:55.910 --> 00:08:58.290 closed form, but for 00:08:58.290 --> 00:09:00.460 many learning algorithms, say, 00:09:00.460 --> 00:09:04.940 Bayesian logistic regression, this is hard to do. 00:09:04.940 --> 00:09:09.780 And so what's commonly done is to take the posterior distribution 00:09:09.780 --> 00:09:13.390 and instead of actually computing a full posterior distribution, chi of theta 00:09:13.390 --> 00:09:15.280 given S, 00:09:15.280 --> 00:09:16.510 we'll instead 00:09:16.510 --> 00:09:19.330 take this quantity on the right-hand side 00:09:19.330 --> 00:09:23.880 and just maximize this quantity on the right-hand side. So let me write this down. 00:09:23.880 --> 00:09:26.650 So 00:09:26.650 --> 00:09:30.100 commonly, instead of computing the full posterior distribution, 00:09:30.100 --> 00:09:37.100 we will choose the following. Okay? 00:09:42.280 --> 00:09:47.160 We will choose what's called the MAP estimate, or the maximum a posteriori 00:09:47.160 --> 00:09:50.270 estimate of theta, which is the most likely value of theta, 00:09:50.270 --> 00:09:53.990 most probable value of theta onto your posterior distribution. 00:09:53.990 --> 00:09:56.430 And that's just 00:09:56.430 --> 00:10:03.430 ont max chi 00:10:07.960 --> 00:10:11.870 of theta. 00:10:11.870 --> 00:10:18.870 And then when you need to make a prediction, 00:10:21.480 --> 00:10:28.480 you know, you would just predict, say, well, 00:10:35.530 --> 00:10:37.599 using your usual hypothesis 00:10:37.599 --> 00:10:41.290 and using this MAP value of theta 00:10:41.290 --> 00:10:42.479 in place of 00:10:42.479 --> 00:10:47.380 - as the parameter vector you'd choose. Okay? And 00:10:47.380 --> 00:10:51.340 notice, the only difference between this and standard maximum likelihood estimation 00:10:51.340 --> 00:10:53.060 is that when you're choosing, 00:10:53.060 --> 00:10:56.920 you know, the - instead of choosing the maximum likelihood value for 00:10:56.920 --> 00:11:01.140 theta, you're instead maximizing this, which is what you have for maximum likelihood estimation, 00:11:01.140 --> 00:11:05.170 and then times this other quantity which is the prior. 00:11:05.170 --> 00:11:06.650 Right? 00:11:06.650 --> 00:11:09.520 And 00:11:09.520 --> 00:11:10.870 let's see, 00:11:10.870 --> 00:11:16.560 when intuition is that if your prior 00:11:16.560 --> 00:11:21.470 is 00:11:21.470 --> 00:11:23.650 theta being Gaussian and with mean 00:11:23.650 --> 00:11:26.030 zero and some covariance, 00:11:26.030 --> 00:11:30.550 then for a distribution like this, most of the [inaudible] mass is close to zero. Right? 00:11:30.550 --> 00:11:34.010 So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. 00:11:37.990 --> 00:11:40.550 And so the prior distribution, instead of saying that 00:11:40.550 --> 00:11:44.800 you think most of the parameters should be close to 00:11:44.800 --> 00:11:48.780 zero, and if you remember our discussion on feature selection, if you eliminate a 00:11:48.780 --> 00:11:50.880 feature from consideration 00:11:50.880 --> 00:11:53.280 that's the same as 00:11:53.280 --> 00:11:56.630 setting the source and value of theta to be equal to zero. All right? So if you 00:11:56.630 --> 00:11:58.020 set 00:11:58.020 --> 00:12:02.090 theta five to be equal to zero, that's the same as, you know, eliminating feature 00:12:02.090 --> 00:12:04.550 five from the your hypothesis. 00:12:04.550 --> 00:12:09.710 And so, this is the prior that drives most of the parameter values to zero 00:12:09.710 --> 00:12:12.710 - to values close to zero. And you'll think of this as 00:12:12.710 --> 00:12:17.410 doing something analogous, if doing something reminiscent of feature selection. Okay? And 00:12:17.410 --> 00:12:21.560 it turns out that with this formulation, the parameters won't actually be 00:12:21.560 --> 00:12:24.990 exactly zero but many of the values will be close to zero. 00:12:24.990 --> 00:12:27.030 And 00:12:27.030 --> 00:12:31.510 I guess in pictures, 00:12:34.550 --> 00:12:36.210 if you remember, 00:12:36.210 --> 00:12:40.070 I said that if you have, say, five data points and you fit 00:12:40.070 --> 00:12:47.070 a fourth-order polynomial - 00:12:47.620 --> 00:12:51.040 well I think that had too many bumps in it, but never mind. 00:12:51.040 --> 00:12:54.360 If you fit it a - if you fit very high polynomial to a 00:12:54.360 --> 00:12:58.180 very small dataset, you can get these very large oscillations 00:12:58.180 --> 00:13:01.420 if you use maximum likelihood estimation. All right? 00:13:01.420 --> 00:13:05.610 In contrast, if you apply this sort of Bayesian regularization, 00:13:05.610 --> 00:13:08.200 you can actually fit a higherorder polynomial 00:13:08.200 --> 00:13:11.160 that still get 00:13:11.160 --> 00:13:13.250 sort of a smoother and smoother fit to the data 00:13:13.250 --> 00:13:17.840 as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer 00:13:17.840 --> 00:13:19.830 to zero. And that 00:13:19.830 --> 00:13:22.750 in practice - it's sort of hard to see, but you can take my word for it. 00:13:22.750 --> 00:13:24.900 As tau becomes smaller and smaller, 00:13:24.900 --> 00:13:28.760 the curves you tend to fit your data also become smoother and smoother, and so you 00:13:28.760 --> 00:13:29.560 tend 00:13:29.560 --> 00:13:35.700 less and less overfit, even when you're fitting a large number of 00:13:35.700 --> 00:13:37.670 parameters. 00:13:37.670 --> 00:13:41.290 Okay? Let's see, 00:13:41.290 --> 00:13:44.590 and 00:13:44.590 --> 00:13:48.129 one last piece of intuition that I would just toss out there. And 00:13:48.129 --> 00:13:49.920 you get to play more with 00:13:49.920 --> 00:13:52.850 this particular set of ideas more in Problem Set 00:13:52.850 --> 00:13:54.980 3, which I'll post online later 00:13:54.980 --> 00:13:57.160 this week I guess. 00:13:57.160 --> 00:14:01.950 Is that whereas maximum likelihood tries to minimize, 00:14:01.950 --> 00:14:08.950 say, this, 00:14:12.210 --> 00:14:15.830 right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing 00:14:15.830 --> 00:14:17.030 this, 00:14:17.030 --> 00:14:19.290 it turns out that if you 00:14:19.290 --> 00:14:21.220 add this prior term there, 00:14:21.220 --> 00:14:26.500 it turns out that the authorization objective you end up optimizing 00:14:26.500 --> 00:14:31.160 turns out to be that. Where you add an extra term that, you know, 00:14:31.160 --> 00:14:34.380 penalizes your parameter theta as being large. 00:14:34.980 --> 00:14:38.820 And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to 00:14:38.820 --> 00:14:40.890 keep your parameters small. 00:14:40.890 --> 00:14:42.810 And this has the effect. 00:14:42.810 --> 00:14:46.530 Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of 00:14:46.530 --> 00:14:48.260 keeping the functions you fit 00:14:48.260 --> 00:14:55.260 to be smoother and less likely to overfit. Okay? Okay, hopefully this will 00:14:57.870 --> 00:15:02.200 make more sense when you play with these ideas a bit more in the next problem set. But let's check 00:15:02.200 --> 00:15:09.200 questions about all this. 00:15:12.560 --> 00:15:16.000 The smoothing behavior is it because [inaudible] actually get different [inaudible]? 00:15:19.260 --> 00:15:23.010 Let's see. Yeah. It depends on - well 00:15:23.010 --> 00:15:28.160 most priors with most of the mass close to zero will get this effect, I guess. And just by 00:15:28.160 --> 00:15:29.090 convention, the Gaussian 00:15:29.090 --> 00:15:31.770 prior is what's most - 00:15:31.770 --> 00:15:33.820 used the most common for models like logistic 00:15:33.820 --> 00:15:37.910 regression and linear regression, generalized in your models. There are a few 00:15:37.910 --> 00:15:41.190 other priors that I sometimes use, like the Laplace prior, 00:15:41.190 --> 00:15:48.190 but all of them will tend to have these sorts of smoothing effects. All right. Cool. 00:15:50.270 --> 00:15:52.640 And so it turns out that for 00:15:52.640 --> 00:15:57.250 problems like text classification, text classification is like 30,000 features or 50,000 00:15:57.250 --> 00:15:59.560 features, 00:15:59.560 --> 00:16:02.980 where it seems like an algorithm like logistic regression would be very much prone to 00:16:02.980 --> 00:16:03.670 overfitting. Right? 00:16:03.670 --> 00:16:06.470 So imagine trying to build a spam classifier, 00:16:06.470 --> 00:16:10.050 maybe you have 100 training examples but you have 30,000 00:16:10.050 --> 00:16:12.530 features or 50,000 features, 00:16:12.530 --> 00:16:16.820 that seems clearly to be prone to overfitting. Right? But it turns out that with 00:16:16.820 --> 00:16:18.440 this sort of Bayesian 00:16:18.440 --> 00:16:19.460 regularization, 00:16:19.460 --> 00:16:22.100 with [inaudible] Gaussian, 00:16:22.100 --> 00:16:26.710 logistic regression becomes a very effective text classification algorithm 00:16:26.710 --> 00:16:32.510 with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah, 00:16:32.510 --> 00:16:36.950 right, and so pick - and to pick either tau squared or lambda. 00:16:36.950 --> 00:16:40.260 I think the relation is lambda equals one over tau squared. But right, so pick either 00:16:40.260 --> 00:16:41.840 tau squared or lambda, you could 00:16:41.840 --> 00:16:48.840 use cross-validation, yeah. All right? Okay, cool. So all right, 00:16:49.120 --> 00:16:51.740 that 00:16:51.740 --> 00:16:55.240 was all I want to say about methods for preventing 00:16:55.240 --> 00:16:56.410 overfitting. What I 00:16:56.410 --> 00:17:00.820 want to do next is just spend, you know, five minutes talking about 00:17:00.820 --> 00:17:02.300 online learning. 00:17:02.300 --> 00:17:04.650 And this is sort of a digression. 00:17:04.650 --> 00:17:06.390 And so, you know, when 00:17:06.390 --> 00:17:09.680 you're designing the syllabus of a class, I guess, sometimes 00:17:09.680 --> 00:17:13.040 there are just some ideas you want to talk about but can't find a very good place to 00:17:13.040 --> 00:17:16.480 fit in anywhere. So this is one of those ideas that may 00:17:16.480 --> 00:17:20.890 seem a bit disjointed from the rest of the class but I just 00:17:20.890 --> 00:17:25.770 want to 00:17:25.770 --> 00:17:29.730 tell 00:17:29.730 --> 00:17:33.720 you a little bit about it. Okay. So here's the idea. 00:17:33.720 --> 00:17:37.820 So far, all the learning algorithms we've talked about are what's called batch learning 00:17:37.820 --> 00:17:38.520 algorithms, 00:17:38.520 --> 00:17:41.650 where you're given a training set and then you get to run your learning algorithm on the 00:17:41.650 --> 00:17:46.910 training set and then maybe you test it on some other test set. 00:17:46.910 --> 00:17:49.980 And there's another learning setting called online learning, 00:17:49.980 --> 00:17:53.670 in which you have to make predictions even while you are in the process of learning. 00:17:53.670 --> 00:17:57.650 So here's how the problem sees. 00:17:57.650 --> 00:17:59.360 All right? I'm first gonna give you X one. 00:17:59.360 --> 00:18:01.410 Let's say there's a classification problem, 00:18:01.410 --> 00:18:02.780 so I'm first gonna give you 00:18:02.780 --> 00:18:07.420 X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?" 00:18:07.420 --> 00:18:09.670 And you've not seen any data yet. 00:18:09.670 --> 00:18:11.679 And so, you make a guess. Right? You 00:18:11.679 --> 00:18:12.990 guess - 00:18:12.990 --> 00:18:15.190 we'll call your guess Y hat one. 00:18:15.190 --> 00:18:18.010 And after you've made your prediction, I 00:18:18.010 --> 00:18:21.840 will then reveal to you the true label Y one. Okay? And 00:18:21.840 --> 00:18:25.580 not having seen any data before, your odds of getting the first one right are only 00:18:25.580 --> 00:18:28.750 50 percent, right, if you guess randomly. 00:18:28.750 --> 00:18:29.950 00:18:29.950 --> 00:18:35.080 And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And 00:18:35.080 --> 00:18:35.910 so you now maybe 00:18:35.910 --> 00:18:40.200 are gonna make a slightly more educated guess and call that Y hat two. 00:18:40.200 --> 00:18:44.880 And after you've made your guess, I reveal the true label to you. 00:18:44.880 --> 00:18:50.000 And so, then I show you X three, and then you make 00:18:50.000 --> 00:18:53.000 your guess, and learning proceeds as follows. 00:18:53.000 --> 00:18:54.860 So this is just a lot of 00:18:54.860 --> 00:18:59.370 machine learning and batch learning, and the model settings where 00:18:59.370 --> 00:19:03.710 you have to keep learning even as you're making predictions, 00:19:03.710 --> 00:19:08.050 okay? So I don't know, setting your website and you have users coming in. And as the first user 00:19:08.050 --> 00:19:10.990 comes in, you need to start making predictions already about what 00:19:10.990 --> 00:19:13.450 the user likes or dislikes. And there's only, you know, 00:19:13.450 --> 00:19:18.370 as you're making predictions you get to show more and more training examples. 00:19:18.370 --> 00:19:25.370 So in online learning what you care about is the total online error, 00:19:27.370 --> 00:19:28.400 00:19:28.400 --> 00:19:32.900 which is sum from I equals one to MC if you get the sequence of M examples 00:19:32.900 --> 00:19:34.390 all together, 00:19:34.390 --> 00:19:37.860 indicator Y hat I 00:19:37.860 --> 00:19:40.900 not equal to Y hi. Okay? 00:19:40.900 --> 00:19:43.510 So the total online error is 00:19:43.510 --> 00:19:45.920 the total number of mistakes you make 00:19:45.920 --> 00:19:49.029 on a sequence of examples like this. 00:19:49.029 --> 00:19:51.910 And 00:19:51.910 --> 00:19:54.049 it turns out that, you know, 00:19:54.049 --> 00:19:57.690 many of the learning algorithms you have - when you finish all the learning algorithms, you've 00:19:57.690 --> 00:20:00.690 learned about and can apply to this setting. One thing 00:20:00.690 --> 00:20:02.680 you could do is 00:20:02.680 --> 00:20:03.850 when you're asked 00:20:03.850 --> 00:20:06.230 to make prediction on Y hat three, 00:20:06.230 --> 00:20:10.409 right, one simple thing to do is well you've seen some other training examples 00:20:10.409 --> 00:20:14.040 up to this point so you can just take your learning algorithm and run it 00:20:14.040 --> 00:20:15.429 on the examples, 00:20:15.429 --> 00:20:18.970 you know, leading up to Y hat three. So just run the learning algorithm on all 00:20:18.970 --> 00:20:21.420 the examples you've seen previous 00:20:21.420 --> 00:20:25.400 to being asked to make a prediction on certain example, and then use your 00:20:25.400 --> 00:20:27.320 learning 00:20:27.320 --> 00:20:30.940 algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we 00:20:30.940 --> 00:20:34.190 saw that you could use the stochastic gradient descent, that, 00:20:34.190 --> 00:20:41.190 you know, can be adapted very nicely to this. So as a concrete example, if you 00:20:41.740 --> 00:20:45.660 remember the perceptron algorithms, say, right, 00:20:45.660 --> 00:20:47.350 you would 00:20:47.350 --> 00:20:52.280 say initial the parameter theta to be equal to zero. 00:20:52.280 --> 00:20:55.940 And then after seeing the Ith training 00:20:55.940 --> 00:20:59.230 example, you'd 00:20:59.230 --> 00:21:06.230 update the parameters, you 00:21:13.740 --> 00:21:15.420 know, 00:21:15.420 --> 00:21:19.740 using - you've see this reel a lot of times now, right, using the standard 00:21:19.740 --> 00:21:21.720 perceptron learning rule. And 00:21:21.720 --> 00:21:26.230 the same thing, if you were using logistic regression you can then, 00:21:26.230 --> 00:21:29.940 again, after seeing each training example, just run, you know, essentially run 00:21:29.940 --> 00:21:33.620 one-step stochastic gradient descent 00:21:33.620 --> 00:21:38.050 on just the example you saw. Okay? 00:21:38.050 --> 00:21:39.800 And so 00:21:39.800 --> 00:21:41.040 the reason I've 00:21:41.040 --> 00:21:45.380 put this into the sort of "learning theory" section of this class was because 00:21:45.380 --> 00:21:49.190 it turns that sometimes you can prove fairly amazing results 00:21:49.190 --> 00:21:51.500 on your total online error 00:21:51.500 --> 00:21:53.790 using algorithms like these. 00:21:53.790 --> 00:21:57.500 I will actually - I don't actually want to spend the time in the main 00:21:57.500 --> 00:21:58.640 lecture to prove 00:21:58.640 --> 00:22:01.880 this, but, for example, you can prove that 00:22:01.880 --> 00:22:04.790 when you use the perceptron algorithm, 00:22:04.790 --> 00:22:07.890 then even when 00:22:07.890 --> 00:22:11.710 the features XI, 00:22:11.710 --> 00:22:15.250 maybe infinite dimensional feature vectors, like we saw for 00:22:15.250 --> 00:22:18.700 simple vector machines. And sometimes, infinite feature dimensional vectors 00:22:18.700 --> 00:22:20.740 may use kernel representations. Okay? But so it 00:22:20.740 --> 00:22:22.180 turns out that you can prove that 00:22:22.180 --> 00:22:24.790 when you a perceptron algorithm, 00:22:24.790 --> 00:22:27.590 even when the data is maybe 00:22:27.590 --> 00:22:28.920 extremely high dimensional and it 00:22:28.920 --> 00:22:32.900 seems like you'd be prone to overfitting, right, you can prove that so as the 00:22:32.900 --> 00:22:36.550 long as the positive and negative examples 00:22:36.550 --> 00:22:38.740 00:22:38.740 --> 00:22:41.860 are separated by a margin, 00:22:41.860 --> 00:22:43.360 right. So in this 00:22:43.360 --> 00:22:48.050 infinite dimensional space, 00:22:48.050 --> 00:22:49.740 so long as, 00:22:49.740 --> 00:22:53.390 you know, there is some margin 00:22:53.390 --> 00:22:57.480 down there separating the positive and negative examples, 00:22:57.480 --> 00:22:59.300 you can prove that 00:22:59.300 --> 00:23:02.150 perceptron algorithm will converge 00:23:02.150 --> 00:23:06.870 to a hypothesis that perfectly separates the positive and negative examples. 00:23:06.870 --> 00:23:10.760 Okay? And then so after seeing only a finite number of examples, it'll 00:23:10.760 --> 00:23:14.410 converge to digital boundary that perfectly separates the 00:23:14.410 --> 00:23:18.590 positive and negative examples, even though you may in an infinite dimensional space. Okay? 00:23:18.590 --> 00:23:20.970 So 00:23:20.970 --> 00:23:22.260 let's see. 00:23:22.260 --> 00:23:25.920 The proof itself would take me sort of almost an entire lecture to do, 00:23:25.920 --> 00:23:27.179 and there are sort of 00:23:27.179 --> 00:23:29.460 other things that I want to do more than that. 00:23:29.460 --> 00:23:32.850 So you want to see the proof of this yourself, it's actually written up in the lecture 00:23:32.850 --> 00:23:35.010 notes that I posted online. 00:23:35.010 --> 00:23:38.210 For the purposes of this class' syllabus, the proof of this 00:23:38.210 --> 00:23:41.059 result, you can treat this as optional reading. And by that, I mean, 00:23:41.059 --> 00:23:44.799 you know, it won't appear on the midterm and you won't be asked about this 00:23:44.799 --> 00:23:47.039 specifically in the problem sets, 00:23:47.039 --> 00:23:48.770 but I thought it'd be - 00:23:48.770 --> 00:23:52.340 I know some of you are curious after the previous lecture so why 00:23:52.340 --> 00:23:53.750 you can prove 00:23:53.750 --> 00:23:55.450 that, you know, SVMs can have 00:23:55.450 --> 00:23:59.240 bounded VC dimension, even in these infinite dimensional spaces, and how do 00:23:59.240 --> 00:24:00.940 you prove things in these - 00:24:00.940 --> 00:24:04.450 how do you prove learning theory results in these infinite dimensional feature spaces. And 00:24:04.450 --> 00:24:05.190 so 00:24:05.190 --> 00:24:08.480 the perceptron bound that I just talked about was the simplest instance I 00:24:08.480 --> 00:24:11.040 know of that you can sort of read in like half an hour and understand 00:24:11.040 --> 00:24:11.960 it. 00:24:11.960 --> 00:24:14.700 So if you're 00:24:14.700 --> 00:24:16.179 interested, there are lecture notes online for 00:24:16.179 --> 00:24:20.720 how this perceptron bound is actually proved. It's a very 00:24:21.580 --> 00:24:25.710 [inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But 00:24:25.710 --> 00:24:28.450 regardless of the theoretical results, 00:24:28.450 --> 00:24:30.880 you know, the online learning setting is something 00:24:30.880 --> 00:24:33.490 that you - that comes reasonably often. And so, 00:24:33.490 --> 00:24:38.750 these algorithms based on stochastic gradient descent often go very well. Okay, any 00:24:38.750 --> 00:24:45.750 questions about this before I move on? 00:24:49.990 --> 00:24:54.799 All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to 00:24:54.799 --> 00:24:56.500 PowerPoint slides, please, 00:24:56.500 --> 00:24:59.810 is I actually want to spend most of today's lecture sort of talking about 00:24:59.810 --> 00:25:05.270 advice for applying different machine learning 00:25:05.270 --> 00:25:10.850 algorithms. 00:25:10.850 --> 00:25:11.970 And so, 00:25:11.970 --> 00:25:14.019 you know, right now, already you have a, 00:25:14.019 --> 00:25:16.250 I think, a good understanding of 00:25:16.250 --> 00:25:18.620 really the most powerful tools 00:25:18.620 --> 00:25:21.020 known to humankind in machine learning. 00:25:21.020 --> 00:25:22.090 Right? And 00:25:22.090 --> 00:25:26.520 what I want to do today is give you some advice on how to apply them really 00:25:26.520 --> 00:25:27.720 powerfully because, 00:25:27.720 --> 00:25:30.299 you know, the same tool - it turns out that you can 00:25:30.299 --> 00:25:33.249 take the same machine learning tool, say logistic regression, 00:25:33.249 --> 00:25:36.800 and you can ask two different people to apply it to the same problem. 00:25:36.800 --> 00:25:38.230 And 00:25:38.230 --> 00:25:41.290 sometimes one person will do an amazing job and it'll work amazingly well, and the second 00:25:41.290 --> 00:25:43.150 person will sort of 00:25:43.150 --> 00:25:47.150 not really get it to work, even though it was exactly the same algorithm. Right? 00:25:47.150 --> 00:25:51.049 And so what I want to do today, in the rest of the time I have today, is try 00:25:51.049 --> 00:25:52.190 to 00:25:52.190 --> 00:25:55.210 convey to you, you know, some of the methods for how to 00:25:55.210 --> 00:25:56.600 make sure you're one of - 00:25:56.600 --> 00:26:03.600 you really know how to get these learning algorithms to work well in problems. So 00:26:05.120 --> 00:26:07.340 just some caveats on what I'm gonna, I 00:26:07.340 --> 00:26:08.859 guess, talk about in the 00:26:08.859 --> 00:26:11.080 rest of today's lecture. 00:26:11.940 --> 00:26:15.760 Something I want to talk about is actually not very mathematical but is also 00:26:15.760 --> 00:26:17.610 some of the hardest, 00:26:17.610 --> 00:26:18.480 most 00:26:18.480 --> 00:26:21.910 conceptually most difficult material in this class to understand. All right? So this 00:26:21.910 --> 00:26:23.070 is 00:26:23.070 --> 00:26:25.400 not mathematical but this is not easy. 00:26:25.400 --> 00:26:29.730 And I want to say this caveat some of what I'll say today is debatable. 00:26:29.730 --> 00:26:33.309 I think most good machine learning people will agree with most of what I say but maybe not 00:26:33.309 --> 00:26:35.730 everything I say. 00:26:35.730 --> 00:26:39.350 And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about 00:26:39.350 --> 00:26:41.680 this later. What I'm 00:26:41.680 --> 00:26:43.730 focusing on today is advice for 00:26:43.730 --> 00:26:47.269 how to just get stuff to work. If you work in the company and you want to deliver a 00:26:47.269 --> 00:26:49.120 product or you're, you know, 00:26:49.120 --> 00:26:52.750 building a system and you just want your machine learning system to work. Okay? Some of what I'm about 00:26:52.750 --> 00:26:54.000 to say today 00:26:54.000 --> 00:26:57.559 isn't great advice if you goal is to invent a new machine learning algorithm, but 00:26:57.559 --> 00:26:59.179 this is advice for how to 00:26:59.179 --> 00:27:02.040 make machine learning algorithm work and, you know, and 00:27:02.040 --> 00:27:05.080 deploy a working system. So three 00:27:05.080 --> 00:27:08.900 key areas I'm gonna talk about. One: diagnostics for 00:27:08.900 --> 00:27:11.890 debugging learning algorithms. Second: sort of 00:27:11.890 --> 00:27:16.820 talk briefly about error analyses and ablative analysis. 00:27:16.820 --> 00:27:21.470 And third, I want to talk about just advice for how to get started on a machine-learning 00:27:23.410 --> 00:27:24.420 problem. 00:27:24.420 --> 00:27:27.340 And one theme that'll come up later is it 00:27:27.340 --> 00:27:31.230 turns out you've heard about premature optimization, right, 00:27:31.230 --> 00:27:33.890 in writing software. This is when 00:27:33.890 --> 00:27:38.120 someone over-designs from the start, when someone, you know, is writing piece of code and 00:27:38.120 --> 00:27:41.330 they choose a subroutine to optimize 00:27:41.330 --> 00:27:45.350 heavily. And maybe you write the subroutine as assembly or something. And that's often 00:27:45.350 --> 00:27:48.919 - and many of us have been guilty of premature optimization, where 00:27:48.919 --> 00:27:51.510 we're trying to get a piece of code to run faster. And 00:27:51.510 --> 00:27:54.710 we choose probably a piece of code and we implement it an assembly, and 00:27:54.710 --> 00:27:56.820 really tune and get to run really quickly. And 00:27:56.820 --> 00:28:01.280 it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature 00:28:01.280 --> 00:28:02.210 optimization. 00:28:02.210 --> 00:28:06.140 And in undergraduate programming classes, we warn people all the time not to do 00:28:06.140 --> 00:28:10.250 premature optimization and people still do it all the time. Right? 00:28:10.250 --> 00:28:11.750 And 00:28:11.750 --> 00:28:16.020 turns out, a very similar thing happens in building machine-learning systems. That 00:28:16.020 --> 00:28:20.980 many people are often guilty of, what I call, premature statistical optimization, where they 00:28:20.980 --> 00:28:21.630 00:28:21.630 --> 00:28:26.090 heavily optimize part of a machine learning system and that turns out 00:28:26.090 --> 00:28:27.830 not to be the important piece. Okay? 00:28:27.830 --> 00:28:30.170 So I'll talk about that later, as well. 00:28:30.170 --> 00:28:35.680 So let's first talk about debugging learning algorithms. 00:28:35.680 --> 00:28:37.700 00:28:37.700 --> 00:28:39.560 00:28:39.560 --> 00:28:42.610 As a motivating 00:28:42.610 --> 00:28:46.910 example, let's say you want to build an anti-spam system. And 00:28:46.910 --> 00:28:51.780 let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right? 00:28:51.780 --> 00:28:54.410 So instead of using 50,000 words, you're chosen a small set of 100 00:28:54.410 --> 00:28:55.390 features 00:28:55.390 --> 00:28:59.010 to use for your anti-spam system. 00:28:59.010 --> 00:29:02.140 And let's say you implement Bayesian logistic regression, implement gradient 00:29:02.140 --> 00:29:02.850 descent, 00:29:02.850 --> 00:29:07.310 and you get 20 percent test error, which is unacceptably high. Right? 00:29:07.310 --> 00:29:09.750 So this 00:29:09.750 --> 00:29:14.260 is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda 00:29:14.260 --> 00:29:15.160 squared term. 00:29:15.160 --> 00:29:19.950 And we're maximizing rather than minimizing as well, so there's a minus lambda 00:29:19.950 --> 00:29:23.809 theta square instead of plus lambda theta squared. So 00:29:23.809 --> 00:29:28.390 the question is, you implemented your Bayesian logistic regression algorithm, 00:29:28.390 --> 00:29:34.050 and you tested it on your test set and you got unacceptably high error, so what do you do 00:29:34.050 --> 00:29:35.370 next? Right? 00:29:35.370 --> 00:29:37.310 So, 00:29:37.310 --> 00:29:40.730 you know, one thing you could do is think about the ways you could improve this algorithm. 00:29:40.730 --> 00:29:44.510 And this is probably what most people will do instead of, "Well let's sit down and think what 00:29:44.510 --> 00:29:47.929 could've gone wrong, and then we'll try to improve the algorithm." 00:29:47.929 --> 00:29:51.370 Well obviously having more training data could only help, so one thing you can do is try to 00:29:51.370 --> 00:29:55.140 get more training examples. 00:29:55.140 --> 00:29:58.880 Maybe you suspect, that even 100 features was too many, so you might try to 00:29:58.880 --> 00:30:01.059 get a smaller set of 00:30:01.059 --> 00:30:04.510 features. What's more common is you might suspect your features aren't good enough, so you might 00:30:04.510 --> 00:30:06.880 spend some time, look at the email headers, see if 00:30:06.880 --> 00:30:09.240 you can figure out better features for, you know, 00:30:09.240 --> 00:30:12.890 finding spam emails or whatever. 00:30:12.890 --> 00:30:14.460 Right. And 00:30:14.460 --> 00:30:17.720 right, so and 00:30:17.720 --> 00:30:20.780 just sit around and come up with better features, such as for email headers. 00:30:20.780 --> 00:30:24.940 You may also suspect that gradient descent hasn't quite converged yet, and so let's try 00:30:24.940 --> 00:30:28.090 running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right, 00:30:28.090 --> 00:30:29.480 just run 00:30:29.480 --> 00:30:30.940 gradient descent longer. 00:30:30.940 --> 00:30:35.620 Or maybe you remember, you know, you remember hearing from class that maybe 00:30:35.620 --> 00:30:38.580 Newton's method converges better, so let's 00:30:38.580 --> 00:30:39.809 try 00:30:39.809 --> 00:30:41.840 that instead. You may want to tune the value for lambda, because not 00:30:41.840 --> 00:30:43.560 sure if that was the right thing, 00:30:43.560 --> 00:30:46.960 or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only 00:30:46.960 --> 00:30:50.240 listed eight things 00:30:50.240 --> 00:30:54.549 here, but you can imagine if you were actually sitting down, building machine-learning 00:30:54.549 --> 00:30:55.720 system, 00:30:55.720 --> 00:30:58.040 the options to you are endless. You can think of, you 00:30:58.040 --> 00:31:01.040 know, hundreds of ways to improve a learning system. 00:31:01.040 --> 00:31:02.429 And some of these things like, 00:31:02.429 --> 00:31:05.670 well getting more training examples, surely that's gonna help, so that seems like it's a good 00:31:05.670 --> 00:31:08.980 use of your time. Right? 00:31:08.980 --> 00:31:11.290 And it turns out that 00:31:11.290 --> 00:31:15.210 this [inaudible] of picking ways to improve the learning algorithm and picking one and going 00:31:15.210 --> 00:31:16.120 for it, 00:31:16.120 --> 00:31:20.620 it might work in the sense that it may eventually get you to a working system, but 00:31:20.620 --> 00:31:25.030 often it's very time-consuming. And I think it's often a largely - largely a matter of 00:31:25.030 --> 00:31:28.140 luck, whether you end up fixing what the problem is. 00:31:28.140 --> 00:31:29.490 In particular, these 00:31:29.490 --> 00:31:33.030 eight improvements all fix very different problems. 00:31:33.030 --> 00:31:37.340 And some of them will be fixing problems that you don't have. And 00:31:37.340 --> 00:31:40.070 if you can rule out six of 00:31:40.070 --> 00:31:44.129 eight of these, say, you could - if by somehow looking at the problem more deeply, 00:31:44.129 --> 00:31:46.809 you can figure out which one of these eight things is actually the right thing 00:31:46.809 --> 00:31:47.850 to do, 00:31:47.850 --> 00:31:48.850 you can save yourself 00:31:48.850 --> 00:31:50.760 a lot of time. 00:31:50.760 --> 00:31:56.090 So let's see how we can go about doing that. 00:31:56.090 --> 00:32:01.830 The people in industry and in research that I see that are really good, would not 00:32:01.830 --> 00:32:05.490 go and try to change a learning algorithm randomly. There are lots of things that 00:32:05.490 --> 00:32:08.110 obviously improve your learning algorithm, 00:32:08.110 --> 00:32:12.460 but the problem is there are so many of them it's hard to know what to do. 00:32:12.460 --> 00:32:16.590 So you find all the really good ones that run various diagnostics to figure out 00:32:16.590 --> 00:32:18.010 the problem is 00:32:18.010 --> 00:32:21.610 and they think where a problem is. Okay? 00:32:21.610 --> 00:32:23.830 So 00:32:23.830 --> 00:32:27.309 for our motivating story, right, we said - let's say Bayesian logistic regression test 00:32:27.309 --> 00:32:29.010 error was 20 percent, 00:32:29.010 --> 00:32:32.020 which let's say is unacceptably high. 00:32:32.020 --> 00:32:34.830 And let's suppose you suspected the problem is 00:32:34.830 --> 00:32:36.440 either overfitting, 00:32:36.440 --> 00:32:37.790 so it's high bias, 00:32:37.790 --> 00:32:42.240 or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse 00:32:42.240 --> 00:32:45.220 me; I think I 00:32:45.220 --> 00:32:46.620 wrote that wrong. 00:32:46.620 --> 00:32:48.080 Let's firstly - so let's 00:32:48.080 --> 00:32:49.219 forget - forget the tables. 00:32:49.219 --> 00:32:52.839 Suppose you suspect the problem is either high bias or high variance, and some of the text 00:32:52.839 --> 00:32:54.730 here 00:32:54.730 --> 00:32:55.250 doesn't make sense. And 00:32:55.250 --> 00:32:56.429 you want to know 00:32:56.429 --> 00:33:00.850 if you're overfitting, which would be high variance, or you have too few 00:33:00.850 --> 00:33:06.240 features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So 00:33:06.240 --> 00:33:08.750 how can you figure out whether the problem 00:33:08.750 --> 00:33:10.790 is one of high bias 00:33:10.790 --> 00:33:15.610 or high variance? Right? So it turns 00:33:15.610 --> 00:33:19.009 out there's a simple diagnostic you can look at that will tell you 00:33:19.009 --> 00:33:24.150 whether the problem is high bias or high variance. If you 00:33:24.150 --> 00:33:27.900 remember the cartoon we'd seen previously for high variance problems, when you have high 00:33:27.900 --> 00:33:29.710 variance 00:33:29.710 --> 00:33:33.280 the training error will be much lower than the test error. All right? When you 00:33:33.280 --> 00:33:36.140 have a high variance problem, that's when you're fitting 00:33:36.140 --> 00:33:39.480 your training set very well. That's when you're fitting, you know, a tenth order polynomial to 00:33:39.480 --> 00:33:41.650 11 data points. All right? And 00:33:41.650 --> 00:33:44.670 that's when you're just fitting the data set very well, and so your training error will be 00:33:44.670 --> 00:33:45.670 much lower than 00:33:45.670 --> 00:33:47.640 your test 00:33:47.640 --> 00:33:49.940 error. And in contrast, if you have high bias, 00:33:49.940 --> 00:33:52.700 that's when your training error will also be high. Right? 00:33:52.700 --> 00:33:56.450 That's when your data is quadratic, say, but you're fitting a linear function to it 00:33:56.450 --> 00:34:02.290 and so you aren't even fitting your training set well. So 00:34:02.290 --> 00:34:04.450 just in cartoons, I guess, 00:34:04.450 --> 00:34:07.950 this is a - this is what a typical learning curve for high variance looks 00:34:07.950 --> 00:34:09.339 like. 00:34:09.339 --> 00:34:13.690 On your horizontal axis, I'm plotting the training set size M, right, 00:34:13.690 --> 00:34:16.429 and on vertical axis, I'm plotting the error. 00:34:16.429 --> 00:34:19.469 And so, let's see, 00:34:19.469 --> 00:34:21.029 you know, as you increase - 00:34:21.029 --> 00:34:25.119 if you have a high variance problem, you'll notice as the training set size, M, 00:34:25.119 --> 00:34:29.219 increases, your test set error will keep on decreasing. 00:34:29.219 --> 00:34:32.829 And so this sort of suggests that, well, if you can increase the training set size even 00:34:32.829 --> 00:34:36.359 further, maybe if you extrapolate the green curve out, maybe 00:34:36.359 --> 00:34:39.970 that test set error will decrease even further. All right? 00:34:39.970 --> 00:34:43.399 Another thing that's useful to plot here is - let's say 00:34:43.399 --> 00:34:46.539 the red horizontal line is the desired performance 00:34:46.539 --> 00:34:50.259 you're trying to reach, another useful thing to plot is actually the training error. Right? 00:34:50.259 --> 00:34:52.009 And it turns out that 00:34:52.009 --> 00:34:59.009 your training error will actually grow as a function of the training set size 00:34:59.249 --> 00:35:01.609 because the larger your training set, 00:35:01.609 --> 00:35:03.619 the harder it is to fit, 00:35:03.619 --> 00:35:06.149 you know, your training set perfectly. Right? 00:35:06.149 --> 00:35:09.249 So this is just a cartoon, don't take it too seriously, but in general, your training error 00:35:09.249 --> 00:35:11.420 will actually grow 00:35:11.420 --> 00:35:15.079 as a function of your training set size. Because smart training sets, if you have one data point, 00:35:15.079 --> 00:35:17.769 it's really easy to fit that perfectly, but if you have 00:35:17.769 --> 00:35:22.099 10,000 data points, it's much harder to fit that perfectly. 00:35:22.099 --> 00:35:23.149 All right? 00:35:23.149 --> 00:35:27.960 And so another diagnostic for high variance, and the one that I tend to use more, 00:35:27.960 --> 00:35:31.670 is to just look at training versus test error. And if there's a large gap between 00:35:31.670 --> 00:35:32.789 them, 00:35:32.789 --> 00:35:34.160 then this suggests that, you know, 00:35:34.160 --> 00:35:39.630 getting more training data may allow you to help close that gap. Okay? 00:35:39.630 --> 00:35:41.420 So this is 00:35:41.420 --> 00:35:42.340 what the 00:35:42.340 --> 00:35:45.059 cartoon would look like when - in the 00:35:45.059 --> 00:35:49.199 case of high variance. 00:35:49.199 --> 00:35:53.099 This is what the cartoon looks like for high bias. Right? If you 00:35:53.099 --> 00:35:54.779 look at the learning curve, you 00:35:54.779 --> 00:35:57.499 see that the curve for test error 00:35:57.499 --> 00:36:01.419 has flattened out already. And so this is a sign that, 00:36:01.419 --> 00:36:05.179 you know, if you get more training examples, if you extrapolate this curve 00:36:05.179 --> 00:36:06.519 further to the right, 00:36:06.519 --> 00:36:09.670 it's maybe not likely to go down much further. 00:36:09.670 --> 00:36:12.469 And this is a property of high bias: that getting more training data won't 00:36:12.469 --> 00:36:15.619 necessarily help. 00:36:15.619 --> 00:36:18.999 But again, to me the more useful diagnostic is 00:36:18.999 --> 00:36:20.299 if you plot 00:36:20.299 --> 00:36:23.999 training errors well, if you look at your training error as well as your, you know, 00:36:23.999 --> 00:36:26.369 hold out test set error. 00:36:26.369 --> 00:36:29.409 If you find that even your training error 00:36:29.409 --> 00:36:31.529 is high, 00:36:31.529 --> 00:36:34.779 then that's a sign that getting more training data is not 00:36:34.779 --> 00:36:38.269 going to help. Right? 00:36:38.269 --> 00:36:42.199 In fact, you know, think about it, 00:36:42.199 --> 00:36:44.539 training error 00:36:44.539 --> 00:36:48.089 grows as a function of your training set size. 00:36:48.089 --> 00:36:50.449 And so if your 00:36:50.449 --> 00:36:55.569 training error is already above your level of desired performance, 00:36:55.569 --> 00:36:56.599 then 00:36:56.599 --> 00:37:00.789 getting even more training data is not going to reduce your training 00:37:00.789 --> 00:37:03.009 error down to the desired level of performance. Right? 00:37:03.009 --> 00:37:06.469 Because, you know, your training error sort of only gets worse as you get more and more training 00:37:06.469 --> 00:37:07.549 examples. 00:37:07.549 --> 00:37:10.799 So if you extrapolate further to the right, it's not like this blue line will come 00:37:10.799 --> 00:37:13.399 back down to the level of desired performance. Right? This will stay up 00:37:13.399 --> 00:37:17.479 there. Okay? So for 00:37:17.479 --> 00:37:21.339 me personally, I actually, when looking at a curve like the green 00:37:21.339 --> 00:37:25.380 curve on test error, I actually personally tend to find it very difficult to tell 00:37:25.380 --> 00:37:29.000 if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very 00:37:29.000 --> 00:37:31.009 often, it's somewhat 00:37:31.009 --> 00:37:32.899 ambiguous. So for me personally, 00:37:32.899 --> 00:37:37.129 the diagnostic I tend to use the most often to tell if I have a bias problem or a variance 00:37:37.129 --> 00:37:37.859 problem 00:37:37.859 --> 00:37:41.319 is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so, 00:37:41.319 --> 00:37:45.420 going 00:37:45.420 --> 00:37:47.130 back to 00:37:47.130 --> 00:37:52.399 the list of fixes, look 00:37:52.399 --> 00:37:54.109 at the first fix, 00:37:54.109 --> 00:37:56.339 getting more training examples 00:37:56.339 --> 00:37:58.650 is a way to fix high variance. 00:37:58.650 --> 00:38:02.749 Right? If you have a high variance problem, getting more training examples will help. 00:38:02.749 --> 00:38:05.529 Trying a smaller set of features: 00:38:05.529 --> 00:38:11.759 that also fixes high variance. All right? 00:38:11.759 --> 00:38:15.869 Trying a larger set of features or adding email features, these 00:38:15.869 --> 00:38:20.150 are solutions that fix high bias. Right? 00:38:20.150 --> 00:38:26.769 So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay? 00:38:26.769 --> 00:38:29.069 And so 00:38:29.069 --> 00:38:33.579 quite often you see people working on machine learning problems 00:38:33.579 --> 00:38:34.589 and 00:38:34.589 --> 00:38:37.569 they'll remember that getting more training examples helps. And 00:38:37.569 --> 00:38:41.119 so, they'll build a learning system, build an anti-spam system and it doesn't work. 00:38:41.119 --> 00:38:42.229 And then they 00:38:42.229 --> 00:38:45.999 go off and spend lots of time and money and effort collecting more training data 00:38:45.999 --> 00:38:50.509 because they'll say, "Oh well, getting more data's obviously got to help." 00:38:50.509 --> 00:38:53.319 But if they had a high bias problem in the first place, and not a high variance 00:38:53.319 --> 00:38:54.890 problem, 00:38:54.890 --> 00:38:56.769 it's entirely possible to spend 00:38:56.769 --> 00:39:00.149 three months or six months collecting more and more training data, 00:39:00.149 --> 00:39:04.999 not realizing that it couldn't possibly help. Right? 00:39:04.999 --> 00:39:07.619 And so, this actually happens a lot in, you 00:39:07.619 --> 00:39:12.409 know, in Silicon Valley and companies, this happens a lot. There will often 00:39:12.409 --> 00:39:15.329 people building various machine learning systems, and 00:39:15.329 --> 00:39:19.519 they'll often - you often see people spending six months working on fixing a 00:39:19.519 --> 00:39:20.999 learning algorithm 00:39:20.999 --> 00:39:23.940 and you could've told them six months ago that, you know, 00:39:23.940 --> 00:39:27.210 that couldn't possibly have helped. But because they didn't know what the 00:39:27.210 --> 00:39:28.709 problem was, and 00:39:28.709 --> 00:39:33.549 they'd easily spend six months trying to invent new features or something. And 00:39:33.549 --> 00:39:37.809 this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and 00:39:37.809 --> 00:39:42.289 told them, "I could've told you six months ago that this was not going to help." And 00:39:42.289 --> 00:39:46.149 the six months is not a joke, you actually see 00:39:46.149 --> 00:39:47.709 this. 00:39:47.709 --> 00:39:49.510 And in contrast, if you 00:39:49.510 --> 00:39:53.049 actually figure out the problem's one of high bias or high variance, then 00:39:53.049 --> 00:39:54.299 you can rule out 00:39:54.299 --> 00:39:55.799 two of these solutions and 00:39:55.799 --> 00:40:00.779 save yourself many months of fruitless effort. Okay? I actually 00:40:00.779 --> 00:40:03.709 want to talk about these four at the bottom as well. But before I move on, let me 00:40:03.709 --> 00:40:05.319 just check if there were questions about what I've talked 00:40:05.319 --> 00:40:12.319 about so far. No? Okay, great. So bias 00:40:20.209 --> 00:40:23.220 versus variance is one thing that comes up 00:40:23.220 --> 00:40:29.539 often. This bias versus variance is one common diagnostic. And so, 00:40:29.539 --> 00:40:33.180 for other machine learning problems, it's often up to your own ingenuity to figure out 00:40:33.180 --> 00:40:35.700 your own diagnostics to figure out what's wrong. All right? 00:40:35.700 --> 00:40:41.230 So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you 00:40:41.230 --> 00:40:44.300 know, to construct your own tests. Like do you look at the difference training and 00:40:44.300 --> 00:40:46.499 test errors or do you look at something else? 00:40:46.499 --> 00:40:49.929 It's often up to your own ingenuity to construct your own diagnostics to figure out what's 00:40:49.929 --> 00:40:52.589 going on. 00:40:52.589 --> 00:40:55.029 What I want to do is go through another example. All right? 00:40:55.029 --> 00:40:58.890 And this one is slightly more contrived but it'll illustrate another 00:40:58.890 --> 00:41:02.769 common question that comes up, another one of the most common 00:41:02.769 --> 00:41:04.750 issues that comes up in applying 00:41:04.750 --> 00:41:06.089 learning algorithms. 00:41:06.089 --> 00:41:08.319 So in this example, it's slightly more contrived, 00:41:08.319 --> 00:41:11.579 let's say you implement Bayesian logistic regression 00:41:11.579 --> 00:41:17.549 and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So 00:41:17.549 --> 00:41:19.150 it's rejecting, you know, 00:41:19.150 --> 00:41:21.449 2 percent of - 00:41:21.449 --> 00:41:25.179 it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all 00:41:25.179 --> 00:41:26.959 spam gets 00:41:26.959 --> 00:41:30.660 through which is fine, but is also rejecting 2 percent of your good email, 00:41:30.660 --> 00:41:35.489 2 percent of the email from your friends and that's unacceptably high, let's 00:41:35.489 --> 00:41:36.909 say. 00:41:36.909 --> 00:41:39.010 And let's say that 00:41:39.010 --> 00:41:41.899 a simple vector machine using a linear kernel 00:41:41.899 --> 00:41:44.830 gets 10 percent error on spam and 00:41:44.830 --> 00:41:49.069 0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this 00:41:49.069 --> 00:41:53.359 example, let's say you're trying to build an anti-spam system. Right? 00:41:53.359 --> 00:41:56.170 Let's say that you really want to deploy 00:41:56.170 --> 00:41:57.679 logistic regression 00:41:57.679 --> 00:42:01.209 to your customers because of computational efficiency or because you need 00:42:01.209 --> 00:42:03.389 retrain overnight every day, 00:42:03.389 --> 00:42:07.319 and because logistic regression just runs more easily and more quickly or something. Okay? So let's 00:42:07.319 --> 00:42:08.669 say you want to deploy logistic 00:42:08.669 --> 00:42:12.649 regression, but it's just not working out well. So 00:42:12.649 --> 00:42:17.609 question is: What do you do next? So it 00:42:17.609 --> 00:42:18.829 turns out that this - 00:42:18.829 --> 00:42:22.319 the issue that comes up here, the one other common question that 00:42:22.319 --> 00:42:24.889 comes up is 00:42:24.889 --> 00:42:30.189 a question of is the algorithm converging. So you might suspect that maybe 00:42:30.189 --> 00:42:33.299 the problem with logistic regression is that it's just not converging. 00:42:33.299 --> 00:42:36.309 Maybe you need to run iterations. And 00:42:36.309 --> 00:42:37.759 it 00:42:37.759 --> 00:42:40.359 turns out that, again if you look at the optimization objective, say, 00:42:40.359 --> 00:42:43.710 logistic regression is, let's say, optimizing J 00:42:43.710 --> 00:42:46.730 of theta, it actually turns out that if you look at optimizing your objective as a function of the number 00:42:46.730 --> 00:42:51.809 of iterations, when you look 00:42:51.809 --> 00:42:55.009 at this curve, you know, it sort of looks like it's going up but it sort of 00:42:55.009 --> 00:42:57.630 looks like there's absentiles. And 00:42:57.630 --> 00:43:00.949 when you look at these curves, it's often very hard to tell 00:43:00.949 --> 00:43:03.729 if the curve has already flattened out. All right? And you look at these 00:43:03.729 --> 00:43:05.979 curves a lot so you can ask: 00:43:05.979 --> 00:43:08.229 Well has the algorithm converged? When you look at the J of theta like this, it's 00:43:08.229 --> 00:43:10.329 often hard to tell. 00:43:10.329 --> 00:43:14.149 You can run this ten times as long and see if it's flattened out. And you can run this ten 00:43:14.149 --> 00:43:21.079 times as long and it'll often still look like maybe it's going up very slowly, or something. Right? 00:43:21.079 --> 00:43:24.919 So a better diagnostic for what logistic regression is converged than 00:43:24.919 --> 00:43:28.809 looking at this curve. 00:43:28.809 --> 00:43:32.089 The other question you might wonder - the other thing you might 00:43:32.089 --> 00:43:36.709 suspect is a problem is are you optimizing the right function. 00:43:36.709 --> 00:43:38.920 So 00:43:38.920 --> 00:43:40.599 what you care about, 00:43:40.599 --> 00:43:42.879 right, in spam, say, 00:43:42.879 --> 00:43:44.260 is a 00:43:44.260 --> 00:43:47.499 weighted accuracy function like that. So A of theta is, 00:43:47.499 --> 00:43:49.190 you know, sum over your 00:43:49.190 --> 00:43:52.249 examples of some weights times whether you got it right. 00:43:52.249 --> 00:43:56.809 And so the weight may be higher for non-spam than for spam mail because you care 00:43:56.809 --> 00:43:57.710 about getting 00:43:57.710 --> 00:44:01.469 your predictions correct for spam email much more than non-spam mail, say. So let's 00:44:01.469 --> 00:44:02.359 00:44:02.359 --> 00:44:05.469 say A of theta 00:44:05.469 --> 00:44:10.819 is the optimization objective that you really care about, but Bayesian logistic regression is 00:44:10.819 --> 00:44:15.400 that it optimizes a quantity like that. Right? It's this 00:44:15.400 --> 00:44:17.689 sort of maximum likelihood thing 00:44:17.689 --> 00:44:19.380 and then with this 00:44:19.380 --> 00:44:20.849 two-nom, you know, 00:44:20.849 --> 00:44:22.779 penalty thing that we saw previously. And you 00:44:22.779 --> 00:44:26.499 might be wondering: Is this the right optimization function to be optimizing. 00:44:26.499 --> 00:44:30.949 Okay? And: Or do I maybe need to change the value for lambda 00:44:30.949 --> 00:44:33.899 to change this parameter? Or: 00:44:33.899 --> 00:44:39.819 Should I maybe really be switching to support vector machine optimization objective? 00:44:39.819 --> 00:44:42.130 Okay? Does that make sense? So 00:44:42.130 --> 00:44:44.490 the second diagnostic I'm gonna talk about 00:44:44.490 --> 00:44:46.989 is let's say you want to figure out 00:44:46.989 --> 00:44:50.609 is the algorithm converging, is the optimization algorithm converging, or 00:44:50.609 --> 00:44:51.900 is the problem with 00:44:51.900 --> 00:44:57.749 the optimization objective I chose in the first place? Okay? 00:44:57.749 --> 00:45:02.819 So here's 00:45:02.819 --> 00:45:07.329 the diagnostic you can use. Let me let - right. So to 00:45:07.329 --> 00:45:11.029 just reiterate the story, right, let's say an SVM outperforms Bayesian 00:45:11.029 --> 00:45:13.519 logistic regression but you really want to deploy 00:45:13.519 --> 00:45:16.759 Bayesian logistic regression to your problem. Let me 00:45:16.759 --> 00:45:19.049 let theta subscript SVM, be the 00:45:19.049 --> 00:45:21.669 parameters learned by an SVM, 00:45:21.669 --> 00:45:25.259 and I'll let theta subscript BLR be the parameters learned by Bayesian 00:45:25.259 --> 00:45:28.049 logistic regression. 00:45:28.049 --> 00:45:32.480 So the optimization objective you care about is this, you know, weighted accuracy 00:45:32.480 --> 00:45:35.079 criteria that I talked about just now. 00:45:35.079 --> 00:45:37.859 And 00:45:37.859 --> 00:45:41.739 the support vector machine outperforms Bayesian logistic regression. And so, you know, 00:45:41.739 --> 00:45:44.969 the weighted accuracy on the supportvector-machine parameters 00:45:44.969 --> 00:45:46.969 is better than the weighted accuracy 00:45:46.969 --> 00:45:50.179 for Bayesian logistic regression. 00:45:50.179 --> 00:45:53.929 So 00:45:53.929 --> 00:45:57.039 further, Bayesian logistic regression tries to optimize 00:45:57.039 --> 00:45:59.410 an optimization objective like that, which I 00:45:59.410 --> 00:46:02.269 denoted J theta. 00:46:02.269 --> 00:46:05.839 And so, the diagnostic I choose to use is 00:46:05.839 --> 00:46:08.430 to see if J of SVM 00:46:08.430 --> 00:46:12.269 is bigger-than or less-than J of BLR. Okay? 00:46:12.269 --> 00:46:14.609 So I explain this on the next slide. 00:46:14.609 --> 00:46:15.569 So 00:46:15.569 --> 00:46:19.529 we know two facts. We know that - well we know one fact. We know that a weighted 00:46:19.529 --> 00:46:20.519 accuracy 00:46:20.519 --> 00:46:23.160 of support vector machine, right, 00:46:23.160 --> 00:46:24.479 is bigger than 00:46:24.479 --> 00:46:28.859 this weighted accuracy of Bayesian logistic regression. So 00:46:28.859 --> 00:46:32.209 in order for me to figure out whether Bayesian logistic regression is converging, 00:46:32.209 --> 00:46:35.379 or whether I'm just optimizing the wrong objective function, 00:46:35.379 --> 00:46:41.059 the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay? 00:46:41.059 --> 00:46:43.549 So let me explain this, 00:46:43.549 --> 00:46:44.769 so in Case 1, 00:46:44.769 --> 00:46:46.029 right, 00:46:46.029 --> 00:46:48.319 it's just those two equations copied over. 00:46:48.319 --> 00:46:50.489 In Case 1, let's say that 00:46:50.489 --> 00:46:54.589 J of SVM is, indeed, is greater than J of BLR - or J of 00:46:54.589 --> 00:47:01.169 theta SVM is greater than J of theta BLR. But 00:47:01.169 --> 00:47:04.439 we know that Bayesian logistic regression 00:47:04.439 --> 00:47:07.519 was trying to maximize J of theta; 00:47:07.519 --> 00:47:08.869 that's the definition of 00:47:08.869 --> 00:47:12.359 Bayesian logistic regression. 00:47:12.359 --> 00:47:16.759 So this means that 00:47:16.759 --> 00:47:17.599 theta - 00:47:17.599 --> 00:47:22.029 the value of theta output that Bayesian logistic regression actually fails to 00:47:22.029 --> 00:47:24.209 maximize J 00:47:24.209 --> 00:47:27.309 because the support back to machine actually returned the value of theta that, 00:47:27.309 --> 00:47:28.719 you know does a 00:47:28.719 --> 00:47:31.349 better job out-maximizing J. 00:47:31.349 --> 00:47:36.509 And so, this tells me that Bayesian logistic regression didn't actually maximize J 00:47:36.509 --> 00:47:39.319 correctly, and so the problem is with 00:47:39.319 --> 00:47:41.099 the optimization algorithm. The 00:47:41.099 --> 00:47:45.269 optimization algorithm hasn't converged. The other 00:47:45.269 --> 00:47:46.099 case 00:47:46.099 --> 00:47:49.890 is as follows, where 00:47:49.890 --> 00:47:52.579 J of theta SVM is less-than/equal to J of theta 00:47:52.579 --> 00:47:55.719 BLR. Okay? 00:47:55.719 --> 00:47:58.389 In this case, what does 00:47:58.389 --> 00:47:59.140 that mean? 00:47:59.140 --> 00:48:01.849 This means that Bayesian logistic regression 00:48:01.849 --> 00:48:04.599 actually attains the higher value 00:48:04.599 --> 00:48:07.289 for the optimization objective J 00:48:07.289 --> 00:48:10.929 then doesn't support back to machine. 00:48:10.929 --> 00:48:13.159 The support back to machine, 00:48:13.159 --> 00:48:14.969 which does worse 00:48:14.969 --> 00:48:17.669 on your optimization problem, 00:48:17.669 --> 00:48:19.199 actually does better 00:48:19.199 --> 00:48:24.329 on the weighted accuracy measure. 00:48:24.329 --> 00:48:27.999 So what this means is that something that does worse on your optimization 00:48:27.999 --> 00:48:28.789 objective, 00:48:28.789 --> 00:48:29.789 on J, 00:48:29.789 --> 00:48:31.429 can actually do better 00:48:31.429 --> 00:48:34.039 on the weighted accuracy objective. 00:48:34.039 --> 00:48:37.109 And this really means that maximizing 00:48:37.109 --> 00:48:38.369 J of theta, 00:48:38.369 --> 00:48:42.059 you know, doesn't really correspond that well to maximizing your weighted accuracy criteria. 00:48:42.059 --> 00:48:43.430 00:48:43.430 --> 00:48:47.359 And therefore, this tells you that J of theta is maybe the wrong optimization 00:48:47.359 --> 00:48:49.649 objective to be maximizing. Right? 00:48:49.649 --> 00:48:51.160 That just maximizing J of 00:48:51.160 --> 00:48:53.149 theta just wasn't a good objective 00:48:53.149 --> 00:49:00.149 to be choosing if you care about the weighted accuracy. Okay? Can you 00:49:02.669 --> 00:49:03.460 raise your hand 00:49:03.460 --> 00:49:09.989 if this made sense? 00:49:09.989 --> 00:49:11.519 Cool, good. So 00:49:11.519 --> 00:49:16.829 that tells us whether the problem is with the optimization objective 00:49:16.829 --> 00:49:19.380 or whether it's with the objective function. 00:49:19.380 --> 00:49:21.009 And so going back to this 00:49:21.009 --> 00:49:23.149 slide, the eight fixes we had, 00:49:23.149 --> 00:49:24.180 you notice that if you 00:49:24.180 --> 00:49:27.170 run gradient descent for more iterations 00:49:27.170 --> 00:49:31.019 that fixes the optimization algorithm. You try and use this method 00:49:31.019 --> 00:49:33.259 fixes the optimization algorithm, 00:49:33.259 --> 00:49:37.289 whereas using a different value for lambda, in that lambda times norm of data 00:49:37.289 --> 00:49:39.469 squared, you know, in your objective, 00:49:39.469 --> 00:49:42.359 fixes the optimization objective. And 00:49:42.359 --> 00:49:47.629 changing to an SVM is also another way of trying to fix the optimization objective. Okay? 00:49:47.629 --> 00:49:49.329 And so 00:49:49.329 --> 00:49:52.309 once again, you actually see this quite often that - 00:49:52.309 --> 00:49:55.079 actually, you see it very often, people will 00:49:55.079 --> 00:49:58.479 have a problem with the optimization objective 00:49:58.479 --> 00:50:00.989 and be working harder and harder 00:50:00.989 --> 00:50:03.179 to fix the optimization algorithm. 00:50:03.179 --> 00:50:06.079 That's another very common pattern that 00:50:06.079 --> 00:50:10.190 the problem is in the formula from your J of theta, that often you see people, you know, 00:50:10.190 --> 00:50:13.269 just running more and more iterations of gradient descent. Like trying Newton's 00:50:13.269 --> 00:50:16.010 method and trying conjugate and then trying 00:50:16.010 --> 00:50:18.589 more and more crazy optimization algorithms, 00:50:18.589 --> 00:50:20.889 whereas the problem was, you know, 00:50:20.889 --> 00:50:24.459 optimizing J of theta wasn't going to fix the problem at all. Okay? 00:50:24.459 --> 00:50:28.649 So there's another example of when these sorts of diagnostics will 00:50:28.649 --> 00:50:31.909 help you figure out whether you should be fixing your optimization algorithm 00:50:31.909 --> 00:50:33.259 or fixing the 00:50:33.259 --> 00:50:38.849 optimization 00:50:38.849 --> 00:50:45.339 objective. Okay? Let me think 00:50:45.339 --> 00:50:47.599 how much time I have. 00:50:47.599 --> 00:50:48.819 Hmm, let's 00:50:48.819 --> 00:50:49.620 see. Well okay, we have time. Let's do this. 00:50:49.620 --> 00:50:52.980 Show you one last example of a diagnostic. This is one that came up in, 00:50:52.980 --> 00:50:56.999 you know, my students' and my work on flying helicopters. 00:50:56.999 --> 00:50:57.839 00:50:57.839 --> 00:51:00.189 This one actually, 00:51:00.189 --> 00:51:04.179 this example is the most complex of the three examples I'm gonna do 00:51:04.179 --> 00:51:05.609 today. 00:51:05.609 --> 00:51:08.559 I'm going to somewhat quickly, and 00:51:08.559 --> 00:51:11.259 this actually draws on reinforcement learning which is something that I'm not 00:51:11.259 --> 00:51:14.499 gonna talk about until towards - close to the end of the course here, but this just 00:51:14.499 --> 00:51:16.759 a more 00:51:16.759 --> 00:51:20.009 complicated example of a diagnostic we're gonna go over. 00:51:20.009 --> 00:51:23.759 What I'll do is probably go over this fairly quickly, and then after we've talked about 00:51:23.759 --> 00:51:26.839 reinforcement learning in the class, I'll probably actually come back and redo this exact 00:51:26.839 --> 00:51:32.919 same example because you'll understand it more deeply. Okay? 00:51:32.919 --> 00:51:37.099 So some of you know that my students and I fly autonomous helicopters, so how do you get a 00:51:37.099 --> 00:51:41.559 machine-learning algorithm to design the controller for 00:51:41.559 --> 00:51:44.199 helicopter? This is what we do. All right? 00:51:44.199 --> 00:51:48.519 This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our 00:51:48.519 --> 00:51:49.619 simulator. 00:51:49.619 --> 00:51:53.499 This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you 00:51:53.499 --> 00:51:55.679 00:51:55.679 --> 00:51:57.190 choose a cost function, it's 00:51:57.190 --> 00:52:00.849 actually called a [inaudible] function, but for this actually I'll call it cost function. 00:52:00.849 --> 00:52:02.949 Say J of theta is, you know, 00:52:02.949 --> 00:52:06.589 the expected squared error in your helicopter's 00:52:06.589 --> 00:52:08.150 position. Okay? So this is J of theta is 00:52:08.150 --> 00:52:08.509 maybe 00:52:08.509 --> 00:52:12.359 it's expected square error or just the square error. 00:52:12.359 --> 00:52:16.909 And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms 00:52:16.909 --> 00:52:18.599 in a few weeks. 00:52:18.599 --> 00:52:22.499 You run reinforcement learning algorithm in your simulator 00:52:22.499 --> 00:52:26.640 to try to minimize this cost function; try to minimize the squared error of 00:52:26.640 --> 00:52:31.439 how well you're controlling your helicopter's position. Okay? 00:52:31.439 --> 00:52:35.279 The reinforcement learning algorithm will output some parameters, which I'm denoting theta 00:52:35.279 --> 00:52:37.209 subscript RL, 00:52:37.209 --> 00:52:41.709 and then you'll use that to fly your helicopter. 00:52:41.709 --> 00:52:44.959 So suppose you run this learning algorithm and 00:52:44.959 --> 00:52:48.589 you get out a set of controller parameters, theta subscript RL, 00:52:48.589 --> 00:52:52.299 that gives much worse performance than a human pilot. Then 00:52:52.299 --> 00:52:54.729 what do you do next? And in particular, you 00:52:54.729 --> 00:52:57.959 know, corresponding to the three steps above, there are three 00:52:57.959 --> 00:53:00.589 natural things you can try. Right? You can 00:53:00.589 --> 00:53:01.869 try to - oh, the bottom of 00:53:01.869 --> 00:53:03.919 the slide got chopped off. 00:53:03.919 --> 00:53:07.529 You can try to improve the simulator. And 00:53:07.529 --> 00:53:10.329 maybe you think your simulator's isn't that accurate, you need to capture 00:53:10.329 --> 00:53:12.339 the aerodynamic effects more 00:53:12.339 --> 00:53:15.429 accurately. You need to capture the airflow and the turbulence affects around the helicopter 00:53:15.429 --> 00:53:18.279 more accurately. 00:53:18.279 --> 00:53:21.439 Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe 00:53:21.439 --> 00:53:24.719 what a human pilot does isn't just optimizing square area but it's something more 00:53:24.719 --> 00:53:25.989 subtle. 00:53:25.989 --> 00:53:26.769 Or maybe 00:53:26.769 --> 00:53:32.989 the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So 00:53:32.989 --> 00:53:36.799 these are the diagnostics that I actually used, and my students and I actually use to figure out what's 00:53:36.799 --> 00:53:41.299 going on. 00:53:41.299 --> 00:53:44.509 Actually, why don't you just think about this for a second and think what you'd do, and then 00:53:44.509 --> 00:53:51.509 I'll go on and tell you what we do. All right, 00:54:46.229 --> 00:54:47.869 so let me tell you what - 00:54:47.869 --> 00:54:49.599 how we do this and see 00:54:49.599 --> 00:54:52.770 whether it's the same as yours or not. And if you have a better idea than I do, let me 00:54:52.770 --> 00:54:53.569 know and I'll let you try it 00:54:53.569 --> 00:54:55.919 on my helicopter. 00:54:55.919 --> 00:54:58.239 So 00:54:58.239 --> 00:55:01.449 here's a reasoning that I wanted to experiment, right. So, 00:55:01.449 --> 00:55:03.680 yeah, let's say the controller output 00:55:03.680 --> 00:55:10.369 by our reinforcement-learning algorithm does poorly. Well 00:55:10.369 --> 00:55:12.609 suppose the following three things hold true. 00:55:12.609 --> 00:55:15.149 Suppose the contrary, I guess. Suppose that 00:55:15.149 --> 00:55:19.649 the helicopter simulator is accurate, so let's assume we have an accurate model 00:55:19.649 --> 00:55:22.449 of our helicopter. And 00:55:22.449 --> 00:55:25.299 let's suppose that the reinforcement learning algorithm, 00:55:25.299 --> 00:55:28.890 you know, correctly controls the helicopter in simulation, 00:55:28.890 --> 00:55:31.819 so we tend to run a learning algorithm in simulation so that, you know, the 00:55:31.819 --> 00:55:35.149 learning algorithm can crash a helicopter and it's fine. Right? 00:55:35.149 --> 00:55:37.230 So let's assume our reinforcement-learning 00:55:37.230 --> 00:55:40.110 algorithm correctly controls the helicopter so as to minimize the cost 00:55:40.110 --> 00:55:42.099 function J of theta. 00:55:42.099 --> 00:55:43.740 And let's suppose that 00:55:43.740 --> 00:55:47.630 minimizing J of theta does indeed correspond to accurate or the correct autonomous 00:55:47.630 --> 00:55:49.339 flight. 00:55:49.339 --> 00:55:52.069 If all of these things held true, 00:55:52.069 --> 00:55:53.909 then that means that 00:55:53.909 --> 00:55:58.459 the parameters, theta RL, should actually fly well on my real 00:55:58.459 --> 00:56:01.039 helicopter. Right? 00:56:01.039 --> 00:56:03.280 And so the fact that the learning 00:56:03.280 --> 00:56:05.339 control parameters, theta RL, 00:56:05.339 --> 00:56:08.599 does not fly well on my helicopter, that sort 00:56:08.599 --> 00:56:11.249 of means that ones of these three assumptions must be wrong 00:56:11.249 --> 00:56:17.869 and I'd like to figure out which of these 00:56:17.869 --> 00:56:19.669 three assumptions 00:56:19.669 --> 00:56:22.089 is wrong. Okay? So these are the diagnostics we use. 00:56:22.089 --> 00:56:25.449 First one is 00:56:25.449 --> 00:56:31.719 we look at the controller and see if it even flies well in 00:56:31.719 --> 00:56:35.089 simulation. Right? So the simulator of the helicopter that we did the learning on, 00:56:35.089 --> 00:56:38.699 and so if the learning algorithm flies well in the simulator but 00:56:38.699 --> 00:56:42.029 it doesn't fly well on my real helicopter, 00:56:42.029 --> 00:56:46.109 then that tells me the problem is probably in the simulator. Right? 00:56:46.109 --> 00:56:48.049 My simulator predicts 00:56:48.049 --> 00:56:51.909 the helicopter's controller will fly well but it doesn't actually fly well in real life, so 00:56:51.909 --> 00:56:53.580 could be the problem's in the simulator 00:56:53.580 --> 00:56:59.239 and we should spend out efforts improving the accuracy of our simulator. 00:56:59.239 --> 00:57:03.170 Otherwise, let me write theta subscript human, be the human 00:57:03.170 --> 00:57:07.049 control policy. All right? So 00:57:07.049 --> 00:57:11.639 let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it 00:57:11.639 --> 00:57:13.479 could be in real life, 00:57:13.479 --> 00:57:16.769 and let's measure, you know, the means squared error 00:57:16.769 --> 00:57:20.209 of the human pilot's flight. And 00:57:20.209 --> 00:57:24.239 let's see if the human pilot does better or worse 00:57:24.239 --> 00:57:26.089 than the learned controller, 00:57:26.089 --> 00:57:28.249 in terms of optimizing this 00:57:28.249 --> 00:57:31.969 objective function J of theta. Okay? 00:57:31.969 --> 00:57:33.929 So if the human does 00:57:33.929 --> 00:57:36.890 worse, if even a very good human pilot 00:57:36.890 --> 00:57:41.439 attains a worse value on my optimization objective, on my cost 00:57:41.439 --> 00:57:42.410 function, 00:57:42.410 --> 00:57:48.619 than my learning algorithm, 00:57:48.619 --> 00:57:51.799 then the problem is in the reinforcement-learning algorithm. 00:57:51.799 --> 00:57:56.089 Because my reinforcement-learning algorithm was trying to minimize J of 00:57:56.089 --> 00:58:00.140 theta, but a human actually attains a lower value for J of theta than does my 00:58:00.140 --> 00:58:01.779 algorithm. 00:58:01.779 --> 00:58:05.489 And so that tells me that clearly my algorithm's not 00:58:05.489 --> 00:58:07.819 managing to minimize J of theta 00:58:07.819 --> 00:58:12.880 and that tells me the problem's in the reinforcement learning algorithm. 00:58:12.880 --> 00:58:17.650 And finally, if J of theta - if the human actually attains a larger value 00:58:17.650 --> 00:58:19.549 for theta - excuse me, 00:58:19.549 --> 00:58:24.399 if the human actually attains a larger value for J of theta, the human actually 00:58:24.399 --> 00:58:27.859 has, you know, larger mean squared error for the helicopter position than 00:58:27.859 --> 00:58:30.599 does my reinforcement learning algorithms, that's 00:58:30.599 --> 00:58:34.000 I like - but I like the way the human flies much better than my reinforcement learning 00:58:34.000 --> 00:58:35.319 algorithm. So 00:58:35.319 --> 00:58:37.229 if that holds true, 00:58:37.229 --> 00:58:39.779 then clearly the problem's in the cost function, right, 00:58:39.779 --> 00:58:42.880 because the human does worse on my cost function 00:58:42.880 --> 00:58:46.069 but flies much better than my learning algorithm. 00:58:46.069 --> 00:58:48.359 And so that means the problem's in the cost function. It 00:58:48.359 --> 00:58:50.089 means - oh 00:58:50.089 --> 00:58:50.539 excuse me, I 00:58:50.539 --> 00:58:53.679 meant minimizing it, not maximizing it, there's a typo on the slide, 00:58:53.679 --> 00:58:55.379 because that means that minimizing 00:58:55.379 --> 00:58:57.089 the cost function 00:58:57.089 --> 00:59:00.219 - my learning algorithm does a better job minimizing the cost function but doesn't 00:59:00.219 --> 00:59:03.439 fly as well as a human pilot. So that tells you that 00:59:03.439 --> 00:59:04.719 minimizing the cost function 00:59:04.719 --> 00:59:06.880 doesn't correspond to good autonomous flight. And what 00:59:06.880 --> 00:59:11.859 you should do it go back and see if you can change J of 00:59:11.859 --> 00:59:13.099 theta. Okay? 00:59:13.099 --> 00:59:18.379 And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement 00:59:18.379 --> 00:59:21.730 learning algorithms just work but when they don't work, 00:59:21.730 --> 00:59:26.200 these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, 00:59:26.200 --> 00:59:30.329 on changing the cost function, or on changing the reinforcement learning 00:59:30.329 --> 00:59:32.089 algorithm. And 00:59:32.089 --> 00:59:37.039 again, if you don't know which of your three problems it is, it's entirely possible, 00:59:37.039 --> 00:59:40.280 you know, to spend two years, whatever, changing, building a better simulator 00:59:40.280 --> 00:59:42.599 for your helicopter. 00:59:42.599 --> 00:59:43.950 But it turns out that 00:59:43.950 --> 00:59:47.690 modeling helicopter aerodynamics is an active area of research. There are people, you know, writing 00:59:47.690 --> 00:59:49.789 entire PhD theses on this still. 00:59:49.789 --> 00:59:53.559 So it's entirely possible to go out and spend six years and write a PhD thesis and build 00:59:53.559 --> 00:59:55.499 a much better helicopter simulator, but if you're fixing 00:59:55.499 --> 01:00:02.499 the wrong problem it's not gonna help. 01:00:03.209 --> 01:00:05.529 So 01:00:05.529 --> 01:00:08.919 quite often, you need to come up with your own diagnostics to figure out what's happening in an 01:00:08.919 --> 01:00:11.639 algorithm when something is going wrong. 01:00:11.639 --> 01:00:15.679 And unfortunately I don't know of - what I've described 01:00:15.679 --> 01:00:17.149 are sort of maybe 01:00:17.149 --> 01:00:20.509 some of the most common diagnostics that I've used, that I've seen, 01:00:20.509 --> 01:00:23.709 you know, to be useful for many problems. But very often, you need to come up 01:00:23.709 --> 01:00:28.189 with your own for your own specific learning problem. 01:00:28.189 --> 01:00:31.729 And I just want to point out that even when the learning algorithm is working well, it's 01:00:31.729 --> 01:00:35.159 often a good idea to run diagnostics, like the ones I talked 01:00:35.159 --> 01:00:36.069 about, 01:00:36.069 --> 01:00:38.309 to make sure you really understand what's going on. 01:00:38.309 --> 01:00:41.599 All right? And this is useful for a couple of reasons. One is that 01:00:41.599 --> 01:00:45.609 diagnostics like these will often help you to understand your application 01:00:45.609 --> 01:00:47.899 problem better. 01:00:47.899 --> 01:00:52.159 So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying 01:00:52.159 --> 01:00:56.349 job to apply machine-learning algorithms to some application problem of, you 01:00:56.349 --> 01:00:59.299 know, significant economic interest. Right? 01:00:59.299 --> 01:01:02.929 And you're gonna be working on one specific 01:01:02.929 --> 01:01:08.059 important machine learning application for many months, or even for years. 01:01:08.059 --> 01:01:10.989 One of the most valuable things for you personally will be for you to 01:01:10.989 --> 01:01:13.249 get in - for you personally 01:01:13.249 --> 01:01:16.909 to get in an intuitive understanding of what works and what doesn't work your 01:01:16.909 --> 01:01:17.369 problem. 01:01:17.369 --> 01:01:21.240 Sort of right now in the industry, in Silicon Valley or around the world, 01:01:21.240 --> 01:01:24.830 there are many companies with important machine learning problems and there are often people 01:01:24.830 --> 01:01:26.950 working on the same machine learning problem, you 01:01:26.950 --> 01:01:31.209 know, for many months or for years on end. And 01:01:31.209 --> 01:01:34.659 when you're doing that, I mean solving a really important problem using learning algorithms, one of 01:01:34.659 --> 01:01:38.719 the most valuable things is just your own personal intuitive understanding of the 01:01:38.719 --> 01:01:40.999 problem. 01:01:40.999 --> 01:01:42.169 Okay? 01:01:42.169 --> 01:01:43.410 And diagnostics, like 01:01:43.410 --> 01:01:48.149 the sort I talked about, will be one way for you to get a better and better understanding of 01:01:48.149 --> 01:01:50.279 these problems. It 01:01:50.279 --> 01:01:54.089 turns out, by the way, there are some of Silicon Valley companies that outsource their 01:01:54.089 --> 01:01:56.679 machine learning. So there's sometimes, you know, whatever. 01:01:56.679 --> 01:01:59.529 They're a company in Silicon Valley and they'll, you know, 01:01:59.529 --> 01:02:03.230 hire a firm in New York to run all their learning algorithms for them. 01:02:03.230 --> 01:02:06.890 And I'm not a businessman, but I personally think that's 01:02:06.890 --> 01:02:09.309 often a terrible idea because 01:02:09.309 --> 01:02:13.639 if your expertise, if your understanding of your data is given, 01:02:13.639 --> 01:02:15.709 you know, to an outsource agency, 01:02:15.709 --> 01:02:19.589 then if you don't maintain that expertise, if there's a problem you really care about 01:02:19.589 --> 01:02:22.299 then it'll be your own, you know, 01:02:22.299 --> 01:02:26.009 understanding of the problem that you build up over months that'll be really valuable. 01:02:26.009 --> 01:02:28.649 And if that knowledge is outsourced, you don't get to keep that knowledge 01:02:28.649 --> 01:02:29.489 yourself. 01:02:29.489 --> 01:02:31.699 I personally think that's a terrible idea. 01:02:31.699 --> 01:02:35.809 But I'm not a businessman, but I just see people do that a lot, 01:02:35.809 --> 01:02:39.109 and just. Let's see. 01:02:39.109 --> 01:02:42.949 Another reason for running diagnostics like these is actually in writing research 01:02:42.949 --> 01:02:43.609 papers, 01:02:43.609 --> 01:02:46.149 right? So 01:02:46.149 --> 01:02:49.329 diagnostics and error analyses, which I'll talk about in a minute, 01:02:49.329 --> 01:02:53.019 often help to convey insight about the problem and help justify your research 01:02:53.019 --> 01:02:54.109 claims. 01:02:54.109 --> 01:02:56.559 01:02:56.559 --> 01:02:57.780 So for example, 01:02:57.780 --> 01:03:00.790 rather than writing a research paper, say, that's says, you know, "Oh well here's 01:03:00.790 --> 01:03:04.039 an algorithm that works. I built this helicopter and it flies," or whatever, 01:03:04.039 --> 01:03:05.650 it's often much more interesting to say, 01:03:05.650 --> 01:03:09.609 "Here's an algorithm that works, and it works because of a specific 01:03:09.609 --> 01:03:13.920 component X. And moreover, here's the diagnostic that gives you justification that shows X was 01:03:13.920 --> 01:03:19.159 the thing that fixed this problem," and that's where you made it work. Okay? So 01:03:19.159 --> 01:03:21.389 that leads me 01:03:21.389 --> 01:03:25.929 into a discussion on error analysis, which is often good machine learning practice, 01:03:25.929 --> 01:03:26.439 01:03:26.439 --> 01:03:32.099 is a way for understanding what your sources of errors are. So what I 01:03:32.099 --> 01:03:34.689 call error analyses - and let's check 01:03:34.689 --> 01:03:41.689 questions about this. 01:03:41.769 --> 01:03:45.789 Yeah? 01:03:45.789 --> 01:03:49.809 Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times. 01:03:49.809 --> 01:03:53.500 The thing that is most difficult a helicopter is actually building a 01:03:53.500 --> 01:03:55.109 very - I don't know. It 01:03:55.109 --> 01:03:58.489 changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter 01:03:58.489 --> 01:04:02.859 is very hard. Yeah. Okay. So 01:04:02.859 --> 01:04:03.930 for error 01:04:03.930 --> 01:04:06.269 analyses, 01:04:06.269 --> 01:04:10.809 this is a way for figuring out what is working in your algorithm and what isn't working. 01:04:10.809 --> 01:04:17.709 And we're gonna talk about two specific examples. So there are 01:04:17.709 --> 01:04:21.529 many learning - there are many sort of IA systems, many machine learning systems, that 01:04:21.529 --> 01:04:22.469 combine 01:04:22.469 --> 01:04:24.890 many different components into a pipeline. So 01:04:24.890 --> 01:04:27.469 here's sort of a contrived example for this, 01:04:27.469 --> 01:04:31.019 not dissimilar in many ways from the actual machine learning systems you see. 01:04:31.019 --> 01:04:32.390 So let's say you want to 01:04:32.390 --> 01:04:37.749 recognize people from images. This is a picture of one of my friends. 01:04:37.749 --> 01:04:41.899 So you take this input in camera image, say, and you often run it through a long pipeline. So 01:04:41.899 --> 01:04:43.069 for example, 01:04:43.069 --> 01:04:47.859 the first thing you may do may be preprocess the image and remove the background, so you remove the 01:04:47.859 --> 01:04:49.189 background. 01:04:49.189 --> 01:04:51.909 And then you run a 01:04:51.909 --> 01:04:55.209 face detection algorithm, so a machine learning algorithm to detect people's faces. 01:04:55.209 --> 01:04:56.109 Right? 01:04:56.109 --> 01:04:59.759 And then, you know, let's say you want to recognize the identity of the person, right, this is your 01:04:59.759 --> 01:05:01.719 application. 01:05:01.719 --> 01:05:04.440 You then segment of the eyes, segment of the nose, 01:05:04.440 --> 01:05:08.329 and have different learning algorithms to detect the mouth and so on. 01:05:08.329 --> 01:05:10.029 I know; she might not want to be friend 01:05:10.029 --> 01:05:13.249 after she sees this. 01:05:13.249 --> 01:05:16.769 And then having found all these features, based on, you know, what the nose looks like, what the eyes 01:05:16.769 --> 01:05:18.610 looks like, whatever, then you 01:05:18.610 --> 01:05:22.800 feed all the features into a logistic regression algorithm. And your logistic 01:05:22.800 --> 01:05:24.770 regression or soft match regression, or whatever, 01:05:24.770 --> 01:05:30.379 will tell you the identity of this person. Okay? 01:05:30.379 --> 01:05:32.459 So 01:05:32.459 --> 01:05:35.059 this is what error analysis is. 01:05:35.059 --> 01:05:40.329 You have a long complicated pipeline combining many machine learning 01:05:40.329 --> 01:05:43.919 components. Many of these would be used in learning algorithms. 01:05:43.919 --> 01:05:45.689 And so, 01:05:45.689 --> 01:05:50.419 it's often very useful to figure out how much of your error can be attributed to each of 01:05:50.419 --> 01:05:55.179 these components. 01:05:55.179 --> 01:05:56.179 So 01:05:56.179 --> 01:05:59.589 what we'll do in a typical error analysis procedure 01:05:59.589 --> 01:06:03.709 is we'll repeatedly plug in the ground-truth for each component and see how the 01:06:03.709 --> 01:06:05.129 accuracy changes. 01:06:05.129 --> 01:06:07.589 So what I mean by that is the 01:06:07.589 --> 01:06:11.389 figure on the bottom left - bottom right, let's say the overall accuracy of the system is 01:06:11.389 --> 01:06:12.739 85 percent. Right? 01:06:12.739 --> 01:06:14.689 Then I want to know 01:06:14.689 --> 01:06:17.399 where my 15 percent of error comes from. 01:06:17.399 --> 01:06:19.159 And so what I'll do is I'll go 01:06:19.159 --> 01:06:21.329 to my test set 01:06:21.329 --> 01:06:26.629 and I'll actually code it and - oh, instead of - actually implement my correct 01:06:26.629 --> 01:06:29.750 background removal. So actually, go in and give it, 01:06:29.750 --> 01:06:33.439 give my algorithm what is the correct background versus foreground. 01:06:33.439 --> 01:06:36.840 And if I do that, let's color that blue to denote that I'm 01:06:36.840 --> 01:06:39.529 giving that ground-truth data in the test set, 01:06:39.529 --> 01:06:43.839 let's assume our accuracy increases to 85.1 percent. Okay? 01:06:43.839 --> 01:06:47.760 And now I'll go in and, you know, give my algorithm the ground-truth, 01:06:47.760 --> 01:06:48.929 face detection 01:06:48.929 --> 01:06:53.019 output. So I'll go in and actually on my test set I'll just tell the algorithm where the 01:06:53.019 --> 01:06:55.130 face is. And if I do that, 01:06:55.130 --> 01:06:59.049 let's say my algorithm's accuracy increases to 91 percent, 01:06:59.049 --> 01:07:02.519 and so on. And then I'll go for each of these components 01:07:02.519 --> 01:07:05.019 and just give it 01:07:05.019 --> 01:07:08.660 the ground-truth label for each of the components, 01:07:08.660 --> 01:07:11.640 because say, like, the nose segmentation algorithm's trying to figure out 01:07:11.640 --> 01:07:13.219 where the nose is. I just in 01:07:13.219 --> 01:07:16.589 and tell it where the nose is so that it doesn't have to figure that out. 01:07:16.589 --> 01:07:20.559 And as I do this, one component through the other, you know, I end up giving it the correct output 01:07:20.559 --> 01:07:23.650 label and end up with 100 percent accuracy. 01:07:23.650 --> 01:07:27.000 And now you can look at this table - I'm sorry this is cut off on the bottom, 01:07:27.000 --> 01:07:29.119 it says logistic regression 100 percent. Now you can 01:07:29.119 --> 01:07:30.719 look at this 01:07:30.719 --> 01:07:31.670 table and 01:07:31.670 --> 01:07:33.009 see, 01:07:33.009 --> 01:07:36.079 you know, how much giving the ground-truth labels for each of these 01:07:36.079 --> 01:07:39.029 components could help boost your final performance. 01:07:39.029 --> 01:07:42.419 In particular, if you look at this table, you notice that 01:07:42.419 --> 01:07:45.269 when I added the face detection ground-truth, 01:07:45.269 --> 01:07:48.279 my performance jumped from 85.1 percent accuracy 01:07:48.279 --> 01:07:50.619 to 91 percent accuracy. Right? 01:07:50.619 --> 01:07:54.529 So this tells me that if only I can get better face detection, 01:07:54.529 --> 01:07:58.029 maybe I can boost my accuracy by 6 percent. 01:07:58.029 --> 01:08:00.499 Whereas in contrast, when I, 01:08:00.499 --> 01:08:04.349 you know, say plugged in better, I don't know, 01:08:04.349 --> 01:08:07.059 background removal, my accuracy improved from 85 01:08:07.059 --> 01:08:08.669 to 85.1 percent. 01:08:08.669 --> 01:08:11.519 And so, this sort of diagnostic also tells you that if your goal 01:08:11.519 --> 01:08:13.869 is to improve the system, it's probably a waste of 01:08:13.869 --> 01:08:17.679 your time to try to improve your background subtraction. Because if 01:08:17.679 --> 01:08:19.219 even if you got the ground-truth, 01:08:19.219 --> 01:08:22.059 this is gives you, at most, 0.1 percent accuracy, 01:08:22.059 --> 01:08:24.600 whereas if you do better face detection, maybe there's a much 01:08:24.600 --> 01:08:26.399 larger potential for gains there. Okay? 01:08:26.399 --> 01:08:28.669 So this sort of diagnostic, 01:08:28.669 --> 01:08:29.899 again, 01:08:29.899 --> 01:08:33.149 is very useful because if your is to improve the system, 01:08:33.149 --> 01:08:35.999 there are so many different pieces you can easily choose to spend the next three 01:08:35.999 --> 01:08:36.650 months on. Right? 01:08:36.650 --> 01:08:39.259 And choosing the right piece 01:08:39.259 --> 01:08:42.799 is critical, and this sort of diagnostic tells you what's the piece that may 01:08:42.799 --> 01:08:48.729 actually be worth your time to work on. 01:08:48.729 --> 01:08:51.709 There's sort of another type of analyses that's sort of the opposite of what I just 01:08:51.709 --> 01:08:53.369 talked about. 01:08:53.369 --> 01:08:55.469 The error analysis I just talked about 01:08:55.469 --> 01:08:58.279 tries to explain the difference between the current performance and perfect 01:08:58.279 --> 01:08:59.770 performance, 01:08:59.770 --> 01:09:03.619 whereas this sort of ablative analysis tries to explain the difference 01:09:03.619 --> 01:09:09.119 between some baselines, some really bad performance and your current performance. 01:09:09.119 --> 01:09:13.089 So for this example, let's suppose you've built a very good anti-spam classifier for 01:09:13.089 --> 01:09:17.150 adding lots of clever features to your logistic regression algorithm. Right? So you added 01:09:17.150 --> 01:09:20.690 features for spam correction, for, you know, sender host features, for email header 01:09:20.690 --> 01:09:21.409 features, 01:09:21.409 --> 01:09:24.799 email text parser features, JavaScript parser features, 01:09:24.799 --> 01:09:26.839 features for embedded images, and so on. 01:09:26.839 --> 01:09:30.230 So now let's say you preview the system and you want to figure out, you know, how well did 01:09:30.230 --> 01:09:33.799 each of these - how much did each of these components actually contribute? Maybe you want 01:09:33.799 --> 01:09:37.130 to write a research paper and claim this was the piece that made the 01:09:37.130 --> 01:09:40.949 big difference. Can you actually document that claim and justify it? 01:09:40.949 --> 01:09:43.319 So in ablative analysis, 01:09:43.319 --> 01:09:44.569 here's what we do. 01:09:44.569 --> 01:09:46.330 So in this example, 01:09:46.330 --> 01:09:49.670 let's say that simple logistic regression without any of your clever 01:09:49.670 --> 01:09:52.089 improvements get 94 percent performance. And 01:09:52.089 --> 01:09:55.479 you want to figure out what accounts for your improvement from 94 to 01:09:55.479 --> 01:09:58.429 99.9 percent performance. 01:09:58.429 --> 01:10:03.280 So in ablative analysis and so instead of adding components one at a day, we'll instead 01:10:03.280 --> 01:10:06.840 remove components one at a time to see how it rates. 01:10:06.840 --> 01:10:11.460 So start with your overall system, which is 99 percent accuracy. 01:10:11.460 --> 01:10:14.130 And then we remove spelling correction and see how much performance 01:10:14.130 --> 01:10:15.389 drops. 01:10:15.389 --> 01:10:22.389 Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so, 01:10:24.219 --> 01:10:28.150 in this contrived example, 01:10:28.150 --> 01:10:31.120 you see that, I guess, the biggest drop 01:10:31.120 --> 01:10:32.380 occurred when you remove 01:10:32.380 --> 01:10:37.560 the text parser features. And so you can then make a credible case that, 01:10:37.560 --> 01:10:41.280 you know, the text parser features where what really made the biggest difference here. Okay? 01:10:41.280 --> 01:10:42.699 And you can also tell, 01:10:42.699 --> 01:10:45.530 for instance, that, I don't know, 01:10:45.530 --> 01:10:49.360 removing the sender host features on this 01:10:49.360 --> 01:10:52.280 line, right, performance dropped from 99.9 to 98.9. And so this also means 01:10:52.280 --> 01:10:53.139 that 01:10:53.139 --> 01:10:56.449 in case you want to get rid of the sender host features to speed up 01:10:56.449 --> 01:11:03.449 computational something that would be a good candidate for elimination. Okay? Are there any 01:11:03.630 --> 01:11:05.420 guarantees that if you shuffle around the order in which 01:11:05.420 --> 01:11:06.420 you drop those 01:11:06.420 --> 01:11:09.580 features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's 01:11:09.580 --> 01:11:12.110 no guarantee you'd get the similar result. 01:11:12.110 --> 01:11:13.890 So in practice, 01:11:13.890 --> 01:11:17.730 sometimes there's a fairly natural of ordering for both types of analyses, the error 01:11:17.730 --> 01:11:19.329 analyses and ablative analysis, 01:11:19.329 --> 01:11:22.749 sometimes there's a fairly natural ordering in which you add things or remove things, 01:11:22.749 --> 01:11:24.559 sometimes there's isn't. And 01:11:24.559 --> 01:11:28.469 quite often, you either choose one ordering and just go for it 01:11:28.469 --> 01:11:32.070 or " And don't think of these analyses as sort of formulas that are constants, though; I mean 01:11:32.070 --> 01:11:35.239 feel free to invent your own, as well. You know 01:11:35.239 --> 01:11:36.639 one of the things 01:11:36.639 --> 01:11:37.920 that's done quite often is 01:11:37.920 --> 01:11:39.310 take the overall system 01:11:39.310 --> 01:11:43.289 and just remove one and then put it back, then remove a different one 01:11:43.289 --> 01:11:48.129 then put it back until all of these things are done. Okay. 01:11:48.129 --> 01:11:51.009 So the very last thing I want to talk about is sort of this 01:11:51.009 --> 01:11:57.979 general advice for how to get started on a learning problem. So 01:11:57.979 --> 01:12:03.840 here's a cartoon description on two broad to get started on learning problem. 01:12:03.840 --> 01:12:05.740 The first one is 01:12:05.740 --> 01:12:07.610 carefully design your system, so 01:12:07.610 --> 01:12:11.739 you spend a long time designing exactly the right features, collecting the right data set, and 01:12:11.739 --> 01:12:14.189 designing the right algorithmic structure, then you 01:12:14.189 --> 01:12:17.679 implement it and hope it works. All right? 01:12:17.679 --> 01:12:21.039 The benefit of this sort of approach is you get maybe nicer, maybe more scalable 01:12:21.039 --> 01:12:22.429 algorithms, 01:12:22.429 --> 01:12:26.719 and maybe you come up with new elegant learning algorithms. And if your goal is to, 01:12:26.719 --> 01:12:30.760 you know, contribute to basic research in machine learning, if your goal is to invent new machine learning 01:12:30.760 --> 01:12:31.499 algorithms, 01:12:31.499 --> 01:12:33.550 this process of slowing down and 01:12:33.550 --> 01:12:36.300 thinking deeply about the problem, you know, that is sort of the right way to go 01:12:36.300 --> 01:12:37.119 about is 01:12:37.119 --> 01:12:41.099 think deeply about a problem and invent new solutions. 01:12:41.099 --> 01:12:42.280 01:12:42.280 --> 01:12:44.079 Second sort of approach 01:12:44.079 --> 01:12:48.839 is what I call build-and-fix, which is we input something quick and dirty 01:12:48.839 --> 01:12:52.309 and then you run error analyses and diagnostics to figure out what's wrong and 01:12:52.309 --> 01:12:54.199 you fix those errors. 01:12:54.199 --> 01:12:58.130 The benefit of this second type of approach is that it'll often get your 01:12:58.130 --> 01:13:01.119 application working much more quickly. 01:13:01.119 --> 01:13:04.400 And especially with those of you, if you end up working in a company, and sometimes - if you end up working in 01:13:04.400 --> 01:13:05.550 a company, 01:13:05.550 --> 01:13:07.460 you know, very often it's not 01:13:07.460 --> 01:13:10.900 the best product that wins; it's the first product to market that 01:13:10.900 --> 01:13:11.690 wins. And 01:13:11.690 --> 01:13:14.869 so there's - especially in the industry. There's really something to be said for, 01:13:14.869 --> 01:13:18.790 you know, building a system quickly and getting it deployed quickly. 01:13:18.790 --> 01:13:23.139 And the second approach of building a quick-and-dirty, I'm gonna say hack 01:13:23.139 --> 01:13:26.469 and then fixing the problems will actually get you to a 01:13:26.469 --> 01:13:27.839 system that works well 01:13:27.839 --> 01:13:30.969 much more quickly. 01:13:30.969 --> 01:13:32.649 And the reason is 01:13:32.649 --> 01:13:36.149 very often it's really not clear what parts of a system are easier to think of to 01:13:36.149 --> 01:13:37.590 build and therefore what 01:13:37.590 --> 01:13:40.179 you need to spends lot of time focusing on. 01:13:40.179 --> 01:13:43.420 So there's that example I talked about just now. Right? 01:13:43.420 --> 01:13:46.929 For identifying 01:13:46.929 --> 01:13:48.710 people, say. 01:13:48.710 --> 01:13:53.199 And with a big complicated learning system like this, a big complicated pipeline like this, 01:13:53.199 --> 01:13:55.590 it's really not obvious at the outset 01:13:55.590 --> 01:13:59.130 which of these components you should spend lots of time working on. Right? And if 01:13:59.130 --> 01:14:00.960 you didn't know that 01:14:00.960 --> 01:14:03.800 preprocessing wasn't the right component, you could easily have 01:14:03.800 --> 01:14:07.269 spent three months working on better background subtraction, not knowing that it's 01:14:07.269 --> 01:14:09.880 just not gonna ultimately matter. 01:14:09.880 --> 01:14:10.769 And so 01:14:10.769 --> 01:14:13.690 the only way to find out what really works was inputting something quickly and 01:14:13.690 --> 01:14:15.350 you find out what parts - 01:14:15.350 --> 01:14:16.889 and find out 01:14:16.889 --> 01:14:17.889 what parts 01:14:17.889 --> 01:14:21.359 are really the hard parts to implement, or what parts are hard parts that could make a 01:14:21.359 --> 01:14:23.079 difference in performance. 01:14:23.079 --> 01:14:26.579 In fact, say that if your goal is to build a 01:14:26.579 --> 01:14:29.309 people recognition system, a system like this is actually far too 01:14:29.309 --> 01:14:31.639 complicated as your initial system. 01:14:31.639 --> 01:14:35.560 Maybe after you're prototyped a few systems, and you converged a system like this. But if this 01:14:35.560 --> 01:14:42.560 is your first system you're designing, this is much too complicated. Also, this is a 01:14:43.570 --> 01:14:48.059 very concrete piece of advice, and this applies to your projects as well. 01:14:48.059 --> 01:14:51.229 If your goal is to build a working application, 01:14:51.229 --> 01:14:55.260 Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your 01:14:55.260 --> 01:14:57.279 data. 01:14:57.279 --> 01:15:01.219 And very often, and if you just take the data you're trying to predict and just plot your 01:15:01.219 --> 01:15:05.729 data, plot X, plot Y, plot your data everywhere you can think of, 01:15:05.729 --> 01:15:10.309 you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they 01:15:10.309 --> 01:15:13.899 should be positive. Something's wrong with this dataset." And it's about 01:15:13.899 --> 01:15:18.389 half the time you find something obviously wrong with your data or something very surprising. 01:15:18.389 --> 01:15:21.570 And this is something you find out just by plotting your data, and that you 01:15:21.570 --> 01:15:28.179 won't find out be implementing these big complicated learning algorithms on it. Plotting 01:15:28.179 --> 01:15:31.920 the data sounds so simple, it was one of the pieces of advice that lots of us give but 01:15:31.920 --> 01:15:38.570 hardly anyone follows, so you can take that for what it's worth. 01:15:38.570 --> 01:15:42.199 Let me just reiterate, what I just said here may be bad advice 01:15:42.199 --> 01:15:44.019 if your goal is to come up with 01:15:44.019 --> 01:15:46.639 new machine learning algorithms. All right? So 01:15:46.639 --> 01:15:51.019 for me personally, the learning algorithm I use the most often is probably 01:15:51.019 --> 01:15:53.600 logistic regression because I have code lying around. So give me a 01:15:53.600 --> 01:15:56.770 learning problem, I probably won't try anything more complicated than logistic 01:15:56.770 --> 01:15:58.260 regression on it first. And it's 01:15:58.260 --> 01:16:01.940 only after trying something really simple and figure our what's easy, what's hard, then you know 01:16:01.940 --> 01:16:03.940 where to focus your efforts. But 01:16:03.940 --> 01:16:07.610 again, if your goal is to invent new machine learning algorithms, then you sort of don't 01:16:07.610 --> 01:16:10.750 want to hack up something and then add another hack to fix it, and hack it even more to 01:16:10.750 --> 01:16:12.219 fix it. Right? So if 01:16:12.219 --> 01:16:15.919 your goal is to do novel machine learning research, then it pays to think more deeply about the 01:16:15.919 --> 01:16:21.340 problem and not gonna follow this specifically. 01:16:21.340 --> 01:16:22.919 Shoot, you know what? All 01:16:22.919 --> 01:16:28.280 right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly. 01:16:28.280 --> 01:16:30.620 And so, this is what I think 01:16:30.620 --> 01:16:33.459 of as premature statistical optimization, 01:16:33.459 --> 01:16:35.079 where quite often, 01:16:35.079 --> 01:16:38.320 just like premature optimization of code, quite often 01:16:38.320 --> 01:16:44.369 people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more 01:16:44.369 --> 01:16:46.949 slides. This 01:16:46.949 --> 01:16:48.539 was - 01:16:48.539 --> 01:16:52.070 this is a sort of cartoon that highly influenced my own thinking. It was based on 01:16:52.070 --> 01:16:55.340 a paper written by Christos Papadimitriou. 01:16:55.340 --> 01:16:57.429 This is how 01:16:57.429 --> 01:16:59.360 progress - this is how 01:16:59.360 --> 01:17:02.360 developmental progress of research often happens. Right? 01:17:02.360 --> 01:17:05.559 Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it 01:17:05.559 --> 01:17:06.519 seems like a useful thing to have. 01:17:06.519 --> 01:17:09.670 Right? You know free up people, don't have 01:17:09.670 --> 01:17:12.760 to deliver mail. So what - 01:17:12.760 --> 01:17:14.280 to deliver mail, 01:17:14.280 --> 01:17:19.139 obviously you need a robot to wander around indoor environments and you need a robot to 01:17:19.139 --> 01:17:21.480 manipulate objects and pickup envelopes. And so, 01:17:21.480 --> 01:17:24.890 you need to build those two components in order to get a mail delivery robot. And 01:17:24.890 --> 01:17:25.590 so I've 01:17:25.590 --> 01:17:29.650 drawing those two components and little arrows to denote that, you know, obstacle avoidance 01:17:29.650 --> 01:17:30.460 is 01:17:30.460 --> 01:17:32.229 needed or would help build 01:17:32.229 --> 01:17:35.510 your mail delivery robot. Well 01:17:35.510 --> 01:17:37.189 for obstacle for avoidance, 01:17:37.189 --> 01:17:43.159 clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. 01:17:43.159 --> 01:17:46.840 Now we're gonna use computer vision to detect the objects. And so, 01:17:46.840 --> 01:17:51.120 we know that, you know, lighting sometimes changes, right, depending on whether it's the 01:17:51.120 --> 01:17:52.709 morning or noontime or evening. This 01:17:52.709 --> 01:17:53.930 is lighting 01:17:53.930 --> 01:17:56.639 causes the color of things to change, and so you need 01:17:56.639 --> 01:18:00.509 an object detection system that's invariant to the specific colors of an 01:18:00.509 --> 01:18:01.199 object. Right? 01:18:01.199 --> 01:18:04.420 Because lighting 01:18:04.420 --> 01:18:05.400 changes, 01:18:05.400 --> 01:18:09.849 say. Well color, or RGB values, is represented by three-dimensional vectors. And 01:18:09.849 --> 01:18:11.170 so you need to learn 01:18:11.170 --> 01:18:13.499 when two colors might be the same thing, 01:18:13.499 --> 01:18:15.260 when two, you know, 01:18:15.260 --> 01:18:18.159 visual appearance of two colors may be the same thing as just the lighting change or 01:18:18.159 --> 01:18:19.539 something. 01:18:19.539 --> 01:18:20.590 And 01:18:20.590 --> 01:18:24.059 to understand that properly, we can go out and study differential geometry 01:18:24.059 --> 01:18:27.510 of 3d manifolds because that helps us build a sound theory on which 01:18:27.510 --> 01:18:32.250 to develop our 3d similarity learning algorithms. 01:18:32.250 --> 01:18:36.159 And to really understand the fundamental aspects of this problem, 01:18:36.159 --> 01:18:40.110 we have to study the complexity of non-Riemannian geometries. And on 01:18:40.110 --> 01:18:43.850 and on it goes until eventually you're proving convergence bounds for 01:18:43.850 --> 01:18:49.790 sampled of non-monotonic logic. I don't even know what this is because I just made it up. 01:18:49.790 --> 01:18:51.530 Whereas in reality, 01:18:51.530 --> 01:18:53.970 you know, chances are that link isn't real. 01:18:53.970 --> 01:18:55.660 Color variance 01:18:55.660 --> 01:18:59.550 just barely helped object recognition maybe. I'm making this up. 01:18:59.550 --> 01:19:03.499 Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay? 01:19:03.499 --> 01:19:05.270 So, each of 01:19:05.270 --> 01:19:09.130 these circles can represent a person, or a research community, or a thought in your 01:19:09.130 --> 01:19:12.020 head. And there's a very real chance that 01:19:12.020 --> 01:19:15.469 maybe there are all these papers written on differential geometry of 3d manifolds, and they are 01:19:15.469 --> 01:19:18.570 written because some guy once told someone else that it'll help 3d similarity learning. 01:19:18.570 --> 01:19:20.489 And, 01:19:20.489 --> 01:19:23.369 you know, it's like "A friend of mine told me that color invariance would help in 01:19:23.369 --> 01:19:26.119 object recognition, so I'm working on color invariance. And now I'm gonna tell a friend 01:19:26.119 --> 01:19:27.440 of mine 01:19:27.440 --> 01:19:30.280 that his thing will help my problem. And he'll tell a friend of his that his thing will help 01:19:30.280 --> 01:19:31.619 with his problem." 01:19:31.619 --> 01:19:33.520 And pretty soon, you're working on 01:19:33.520 --> 01:19:37.540 convergence bound for sampled non-monotonic logic, when in reality none of these will 01:19:37.540 --> 01:19:39.129 see the light of 01:19:39.129 --> 01:19:42.519 day of your mail delivery robot. Okay? 01:19:42.519 --> 01:19:46.599 I'm not criticizing the role of theory. There are very powerful theories, like the 01:19:46.599 --> 01:19:48.400 theory of VC dimension, 01:19:48.400 --> 01:19:52.090 which is far, far, far to the right of this. So VC dimension is about 01:19:52.090 --> 01:19:53.289 as theoretical 01:19:53.289 --> 01:19:57.119 as it can get. And it's clearly had a huge impact on many applications. And there's, 01:19:57.119 --> 01:19:59.559 you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, 01:19:59.559 --> 01:20:00.750 is about 01:20:00.750 --> 01:20:04.219 theoretical as it can get. It's 01:20:04.219 --> 01:20:05.800 like a huge application 01:20:05.800 --> 01:20:09.309 on all of computer science, the theory of NP-hardness. 01:20:09.309 --> 01:20:10.669 But 01:20:10.669 --> 01:20:13.799 when you are off working on highly theoretical things, I guess, to me 01:20:13.799 --> 01:20:16.849 personally it's important to keep in mind 01:20:16.849 --> 01:20:19.699 are you working on something like VC dimension, which is high impact, or are you 01:20:19.699 --> 01:20:23.290 working on something like convergence bound for sampled nonmonotonic logic, which 01:20:23.290 --> 01:20:24.710 you're only hoping 01:20:24.710 --> 01:20:25.900 has some peripheral relevance 01:20:25.900 --> 01:20:30.040 to some application. Okay? 01:20:30.040 --> 01:20:34.849 For me personally, I tend to work on an application only if I - excuse me. 01:20:34.849 --> 01:20:36.989 For me personally, and this is a personal choice, 01:20:36.989 --> 01:20:41.340 I tend to trust something only if I personally can see a link from the 01:20:41.340 --> 01:20:42.679 theory I'm working on 01:20:42.679 --> 01:20:44.430 all the way back to an application. 01:20:44.430 --> 01:20:46.010 And 01:20:46.010 --> 01:20:50.299 if I don't personally see a direct link from what I'm doing to an application then, 01:20:50.299 --> 01:20:53.429 you know, then that's fine. Then I can choose to work on theory, but 01:20:53.429 --> 01:20:55.650 I wouldn't necessarily trust that 01:20:55.650 --> 01:20:59.210 what the theory I'm working on will relate to an application, if I don't personally 01:20:59.210 --> 01:21:02.429 see a link all the way back. 01:21:02.429 --> 01:21:04.400 Just to summarize. 01:21:04.400 --> 01:21:06.409 01:21:06.409 --> 01:21:08.679 One lesson to take away from today is I think 01:21:08.679 --> 01:21:12.529 time spent coming up with diagnostics for learning algorithms is often time well spent. 01:21:12.529 --> 01:21:13.029 01:21:13.029 --> 01:21:16.199 It's often up to your own ingenuity to come up with great diagnostics. And 01:21:16.199 --> 01:21:19.019 just when I personally, when I work on machine learning algorithm, 01:21:19.019 --> 01:21:21.169 it's not uncommon for me to be spending like 01:21:21.169 --> 01:21:23.679 between a third and often half of my time 01:21:23.679 --> 01:21:26.409 just writing diagnostics and trying to figure out what's going right and what's 01:21:26.409 --> 01:21:28.079 going on. 01:21:28.079 --> 01:21:31.500 Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and 01:21:31.500 --> 01:21:34.780 making progress. You don't want to be spending all this time, you know, implementing tests on your 01:21:34.780 --> 01:21:38.280 learning algorithms; it doesn't feel like when you're doing anything. But when 01:21:38.280 --> 01:21:41.419 I implement learning algorithms, at least a third, and quite often half of 01:21:41.419 --> 01:21:45.880 my time, is actually spent implementing those tests and you can figure out what to work on. And 01:21:45.880 --> 01:21:49.219 I think it's actually one of the best uses of your time. Talked 01:21:49.219 --> 01:21:50.729 about error 01:21:50.729 --> 01:21:54.319 analyses and ablative analyses, and lastly 01:21:54.319 --> 01:21:56.890 talked about, you know, different approaches and the 01:21:56.890 --> 01:22:00.979 risks of premature statistical optimization. Okay. 01:22:00.979 --> 01:22:04.339 Sorry I ran you over. I'll be here for a few more minutes for your questions.