[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:10.17,0:00:11.55,Default,,0000,0000,0000,, Dialogue: 0,0:00:11.55,0:00:14.82,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:14.82,0:00:21.82,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:24.74,0:00:28.50,Default,,0000,0000,0000,,Okay. Good morning. Welcome back. Dialogue: 0,0:00:28.50,0:00:31.24,Default,,0000,0000,0000,,What I want to do today is Dialogue: 0,0:00:31.24,0:00:36.64,Default,,0000,0000,0000,,actually wrap up our discussion on learning theory and sort of on Dialogue: 0,0:00:36.64,0:00:37.67,Default,,0000,0000,0000,,and I'm gonna start by Dialogue: 0,0:00:37.67,0:00:41.81,Default,,0000,0000,0000,,talking about Bayesian statistics and regularization, Dialogue: 0,0:00:41.81,0:00:45.68,Default,,0000,0000,0000,,and then take a very brief digression to tell you about online learning. Dialogue: 0,0:00:45.68,0:00:50.32,Default,,0000,0000,0000,,And most of today's lecture will actually be on various pieces of that, so applying Dialogue: 0,0:00:50.32,0:00:52.45,Default,,0000,0000,0000,,machine learning algorithms to problems like, you know, Dialogue: 0,0:00:52.45,0:00:55.92,Default,,0000,0000,0000,,like the project or other problems you may go work on after you graduate from this Dialogue: 0,0:00:57.01,0:00:57.74,Default,,0000,0000,0000,,class. But Dialogue: 0,0:00:57.74,0:01:02.30,Default,,0000,0000,0000,,let's start the talk about Bayesian statistics and regularization. Dialogue: 0,0:01:02.30,0:01:04.13,Default,,0000,0000,0000,,So you remember from last week, Dialogue: 0,0:01:04.13,0:01:08.96,Default,,0000,0000,0000,,we started to talk about learning theory and we learned about bias and Dialogue: 0,0:01:08.96,0:01:12.26,Default,,0000,0000,0000,,variance. And I guess in the previous lecture, we spent most of the previous Dialogue: 0,0:01:12.26,0:01:13.24,Default,,0000,0000,0000,,lecture Dialogue: 0,0:01:13.24,0:01:17.51,Default,,0000,0000,0000,,talking about algorithms for model selection and for Dialogue: 0,0:01:17.51,0:01:21.36,Default,,0000,0000,0000,,feature selection. We talked about cross-validation. Right? So Dialogue: 0,0:01:21.36,0:01:24.20,Default,,0000,0000,0000,,most of the methods we talked about in the previous lecture were Dialogue: 0,0:01:24.20,0:01:28.46,Default,,0000,0000,0000,,ways for you to try to simply the model. So for example, Dialogue: 0,0:01:28.46,0:01:31.92,Default,,0000,0000,0000,,the feature selection algorithms we talked about gives you a way to eliminate a number Dialogue: 0,0:01:31.92,0:01:32.91,Default,,0000,0000,0000,,of features, Dialogue: 0,0:01:32.91,0:01:36.35,Default,,0000,0000,0000,,so as to reduce the number of parameters you need to fit and thereby reduce Dialogue: 0,0:01:36.35,0:01:39.33,Default,,0000,0000,0000,,overfitting. Right? You remember that? So feature Dialogue: 0,0:01:39.33,0:01:40.67,Default,,0000,0000,0000,,selection algorithms Dialogue: 0,0:01:42.15,0:01:43.97,Default,,0000,0000,0000,,choose a subset of the features Dialogue: 0,0:01:43.97,0:01:48.47,Default,,0000,0000,0000,,so that you have less parameters and you may be less likely to overfit. Right? Dialogue: 0,0:01:48.47,0:01:52.79,Default,,0000,0000,0000,,What I want to do today is to talk about a different way to prevent overfitting. Dialogue: 0,0:01:52.79,0:01:53.87,Default,,0000,0000,0000,,And Dialogue: 0,0:01:53.87,0:01:57.86,Default,,0000,0000,0000,,there's a method called regularization and there's a way that lets you keep all the Dialogue: 0,0:01:57.86,0:01:58.86,Default,,0000,0000,0000,,parameters. Dialogue: 0,0:01:59.92,0:02:04.20,Default,,0000,0000,0000,,So here's the idea, and I'm gonna illustrate this example with, Dialogue: 0,0:02:04.20,0:02:06.96,Default,,0000,0000,0000,,say, linear regression. Dialogue: 0,0:02:06.96,0:02:08.71,Default,,0000,0000,0000,,So Dialogue: 0,0:02:08.71,0:02:13.33,Default,,0000,0000,0000,,you take Dialogue: 0,0:02:13.33,0:02:16.92,Default,,0000,0000,0000,,the linear regression model, the very first model we learned about, Dialogue: 0,0:02:16.92,0:02:17.97,Default,,0000,0000,0000,,right, Dialogue: 0,0:02:17.97,0:02:21.64,Default,,0000,0000,0000,,we said that we would choose the parameters Dialogue: 0,0:02:21.64,0:02:23.61,Default,,0000,0000,0000,,via Dialogue: 0,0:02:23.61,0:02:29.52,Default,,0000,0000,0000,,maximum likelihood. Dialogue: 0,0:02:29.52,0:02:31.67,Default,,0000,0000,0000,,Right? And that meant that, Dialogue: 0,0:02:31.67,0:02:35.30,Default,,0000,0000,0000,,you know, you would choose the parameters theta Dialogue: 0,0:02:35.30,0:02:39.88,Default,,0000,0000,0000,,that maximized Dialogue: 0,0:02:39.88,0:02:43.01,Default,,0000,0000,0000,,the probability of the data, Dialogue: 0,0:02:43.01,0:02:46.11,Default,,0000,0000,0000,,which is parameters theta that maximized the probability of the data we observe. Dialogue: 0,0:02:46.11,0:02:50.71,Default,,0000,0000,0000,,Right? Dialogue: 0,0:02:50.71,0:02:52.05,Default,,0000,0000,0000,,And so Dialogue: 0,0:02:52.05,0:02:56.28,Default,,0000,0000,0000,,to give this sort of procedure a name, this is one example of most Dialogue: 0,0:02:56.28,0:03:00.26,Default,,0000,0000,0000,,common frequencies procedure, Dialogue: 0,0:03:00.26,0:03:05.42,Default,,0000,0000,0000,,and frequency, you can think of sort of as maybe one school of statistics. Dialogue: 0,0:03:05.42,0:03:07.08,Default,,0000,0000,0000,,And the philosophical view Dialogue: 0,0:03:07.08,0:03:08.87,Default,,0000,0000,0000,,behind writing this down was Dialogue: 0,0:03:08.87,0:03:13.48,Default,,0000,0000,0000,,we envisioned that there was some true parameter theta out there that generated, Dialogue: 0,0:03:13.48,0:03:16.47,Default,,0000,0000,0000,,you know, the Xs and the Ys. There's some true parameter theta Dialogue: 0,0:03:16.47,0:03:20.53,Default,,0000,0000,0000,,that govern housing prices, Y is a function of X, Dialogue: 0,0:03:20.53,0:03:23.46,Default,,0000,0000,0000,,and we don't know what the value of theta is, Dialogue: 0,0:03:23.46,0:03:27.63,Default,,0000,0000,0000,,and we'd like to come up with some procedure for estimating the value of theta. Okay? Dialogue: 0,0:03:27.63,0:03:28.38,Default,,0000,0000,0000,,And so, Dialogue: 0,0:03:28.38,0:03:31.27,Default,,0000,0000,0000,,maximum likelihood is just one possible procedure Dialogue: 0,0:03:31.27,0:03:33.21,Default,,0000,0000,0000,,for estimating the unknown value Dialogue: 0,0:03:33.21,0:03:35.88,Default,,0000,0000,0000,,for theta. Dialogue: 0,0:03:35.88,0:03:37.61,Default,,0000,0000,0000,,And Dialogue: 0,0:03:37.61,0:03:41.06,Default,,0000,0000,0000,,the way you formulated this, you know, theta was not a random variable. Right? Dialogue: 0,0:03:41.06,0:03:42.14,Default,,0000,0000,0000,,That's what why said, Dialogue: 0,0:03:42.14,0:03:45.49,Default,,0000,0000,0000,,so theta is just some true value out there. It's not random or anything, we just don't Dialogue: 0,0:03:45.49,0:03:50.33,Default,,0000,0000,0000,,know what it is, and we have a procedure called maximum likelihood for estimating the Dialogue: 0,0:03:50.33,0:03:55.47,Default,,0000,0000,0000,,value for theta. So this is one example of what's called a frequencies procedure. Dialogue: 0,0:03:55.47,0:03:57.28,Default,,0000,0000,0000,,The alternative to the, I Dialogue: 0,0:03:57.28,0:04:04.28,Default,,0000,0000,0000,,guess, the frequency school of statistics is the Bayesian school, Dialogue: 0,0:04:04.39,0:04:06.87,Default,,0000,0000,0000,,in which Dialogue: 0,0:04:06.87,0:04:10.24,Default,,0000,0000,0000,,we're gonna say that we don't know what theta, Dialogue: 0,0:04:10.24,0:04:13.34,Default,,0000,0000,0000,,and so we will put a prior Dialogue: 0,0:04:13.34,0:04:14.77,Default,,0000,0000,0000,,on Dialogue: 0,0:04:14.77,0:04:18.24,Default,,0000,0000,0000,,theta. Okay? So in the Bayesian school students would say, "Well Dialogue: 0,0:04:18.24,0:04:22.38,Default,,0000,0000,0000,,don't know what the value of theta so let's represent our uncertainty over theta with a Dialogue: 0,0:04:22.38,0:04:27.23,Default,,0000,0000,0000,,prior. Dialogue: 0,0:04:28.26,0:04:32.01,Default,,0000,0000,0000,,So for example, Dialogue: 0,0:04:32.01,0:04:34.77,Default,,0000,0000,0000,,our prior on theta Dialogue: 0,0:04:34.77,0:04:36.86,Default,,0000,0000,0000,,may be a Gaussian distribution Dialogue: 0,0:04:36.86,0:04:39.85,Default,,0000,0000,0000,,with mean zero and curvalence matrix Dialogue: 0,0:04:39.85,0:04:42.89,Default,,0000,0000,0000,,given by tau squared I. Okay? Dialogue: 0,0:04:42.89,0:04:45.32,Default,,0000,0000,0000,,And so - Dialogue: 0,0:04:45.32,0:04:52.32,Default,,0000,0000,0000,,actually, if I use S to denote my training set, well - right, Dialogue: 0,0:04:56.28,0:05:00.29,Default,,0000,0000,0000,,so theta represents my beliefs about what the parameters are in the absence of Dialogue: 0,0:05:00.29,0:05:01.38,Default,,0000,0000,0000,,any data. So Dialogue: 0,0:05:01.38,0:05:03.29,Default,,0000,0000,0000,,not having seen any data, theta Dialogue: 0,0:05:03.29,0:05:05.71,Default,,0000,0000,0000,,represents, you know, what I think theta - it Dialogue: 0,0:05:05.71,0:05:10.09,Default,,0000,0000,0000,,probably represents what I think theta is most likely to be. Dialogue: 0,0:05:10.09,0:05:15.66,Default,,0000,0000,0000,,And so given the training set, S, in the sort of Bayesian procedure, we would, Dialogue: 0,0:05:15.66,0:05:22.44,Default,,0000,0000,0000,,well, Dialogue: 0,0:05:22.44,0:05:26.54,Default,,0000,0000,0000,,calculate the probability, the posterior probability by parameters Dialogue: 0,0:05:26.54,0:05:28.87,Default,,0000,0000,0000,,given my training sets, and - let's Dialogue: 0,0:05:28.87,0:05:30.47,Default,,0000,0000,0000,,write Dialogue: 0,0:05:30.47,0:05:31.78,Default,,0000,0000,0000,,this on the next board. Dialogue: 0,0:05:31.78,0:05:33.36,Default,,0000,0000,0000,,So my posterior Dialogue: 0,0:05:33.36,0:05:36.74,Default,,0000,0000,0000,,on my parameters given my training set, by Bayes' rule, this will Dialogue: 0,0:05:36.74,0:05:40.16,Default,,0000,0000,0000,,be proportional to, you Dialogue: 0,0:05:40.16,0:05:47.16,Default,,0000,0000,0000,,know, this. Dialogue: 0,0:05:51.46,0:05:56.56,Default,,0000,0000,0000,,Right? So by Bayes' rule. Let's call Dialogue: 0,0:05:56.56,0:06:01.43,Default,,0000,0000,0000,,it posterior. Dialogue: 0,0:06:01.43,0:06:05.67,Default,,0000,0000,0000,,And this distribution now represents my beliefs about what theta is after I've Dialogue: 0,0:06:05.67,0:06:08.22,Default,,0000,0000,0000,,seen the training set. Dialogue: 0,0:06:08.22,0:06:15.22,Default,,0000,0000,0000,,And when you now want to make a new prediction on the price of a new house, Dialogue: 0,0:06:20.06,0:06:21.96,Default,,0000,0000,0000,,on the input X, Dialogue: 0,0:06:21.96,0:06:26.71,Default,,0000,0000,0000,,I would say that, well, the distribution over the possible housing prices for Dialogue: 0,0:06:26.71,0:06:30.56,Default,,0000,0000,0000,,this new house I'm trying to estimate the price of, say, Dialogue: 0,0:06:30.56,0:06:34.27,Default,,0000,0000,0000,,given the size of the house, the features of the house at X, and the training Dialogue: 0,0:06:34.27,0:06:36.56,Default,,0000,0000,0000,,set I had previously, it Dialogue: 0,0:06:36.56,0:06:43.56,Default,,0000,0000,0000,,is going to be given by Dialogue: 0,0:06:47.28,0:06:50.47,Default,,0000,0000,0000,,an integral over my parameters theta of Dialogue: 0,0:06:50.47,0:06:53.02,Default,,0000,0000,0000,,probably of Y given X comma theta Dialogue: 0,0:06:53.02,0:06:54.65,Default,,0000,0000,0000,,and times Dialogue: 0,0:06:54.65,0:06:59.93,Default,,0000,0000,0000,,the posterior distribution of theta given the training set. Okay? Dialogue: 0,0:06:59.93,0:07:01.79,Default,,0000,0000,0000,,And in Dialogue: 0,0:07:01.79,0:07:03.72,Default,,0000,0000,0000,,particular, if you want your prediction to be Dialogue: 0,0:07:03.72,0:07:07.56,Default,,0000,0000,0000,,the expected value Dialogue: 0,0:07:07.56,0:07:08.73,Default,,0000,0000,0000,,of Y given Dialogue: 0,0:07:08.73,0:07:10.28,Default,,0000,0000,0000,,the Dialogue: 0,0:07:10.28,0:07:13.56,Default,,0000,0000,0000,,input X in training set, you would Dialogue: 0,0:07:13.56,0:07:16.35,Default,,0000,0000,0000,,say integrate Dialogue: 0,0:07:16.35,0:07:23.35,Default,,0000,0000,0000,,over Y Dialogue: 0,0:07:23.75,0:07:26.54,Default,,0000,0000,0000,,times the posterior. Okay? Dialogue: 0,0:07:26.54,0:07:33.54,Default,,0000,0000,0000,,You would take an expectation of Y with respect to your posterior distribution. Okay? Dialogue: 0,0:07:35.44,0:07:39.17,Default,,0000,0000,0000,,And you notice that when I was writing this down, so with the Bayesian Dialogue: 0,0:07:39.17,0:07:41.96,Default,,0000,0000,0000,,formulation, and now started to write up here Y given X Dialogue: 0,0:07:41.96,0:07:44.11,Default,,0000,0000,0000,,comma theta because Dialogue: 0,0:07:44.11,0:07:47.59,Default,,0000,0000,0000,,this formula now is the property of Y conditioned on the values of the Dialogue: 0,0:07:47.59,0:07:51.91,Default,,0000,0000,0000,,random variables X and theta. So I'm no longer writing semicolon theta, I'm writing Dialogue: 0,0:07:51.91,0:07:53.10,Default,,0000,0000,0000,,comma theta Dialogue: 0,0:07:53.10,0:07:56.38,Default,,0000,0000,0000,,because I'm now treating theta Dialogue: 0,0:07:56.38,0:08:00.26,Default,,0000,0000,0000,,as a random variable. So all Dialogue: 0,0:08:00.26,0:08:02.85,Default,,0000,0000,0000,,of this is somewhat abstract but this is Dialogue: 0,0:08:02.85,0:08:04.11,Default,,0000,0000,0000,,- and Dialogue: 0,0:08:04.11,0:08:11.11,Default,,0000,0000,0000,,it turns out - actually let's check. Are there questions about this? No? Okay. Let's Dialogue: 0,0:08:14.78,0:08:17.82,Default,,0000,0000,0000,,try to make this more concrete. It turns out that Dialogue: 0,0:08:17.82,0:08:21.70,Default,,0000,0000,0000,,for many problems, Dialogue: 0,0:08:21.70,0:08:26.06,Default,,0000,0000,0000,,both of these steps in the computation are difficult because if, Dialogue: 0,0:08:26.06,0:08:29.47,Default,,0000,0000,0000,,you know, theta is an N plus onedimensional vector, is an N plus Dialogue: 0,0:08:29.47,0:08:31.22,Default,,0000,0000,0000,,one-dimensional parameter vector, Dialogue: 0,0:08:31.22,0:08:35.01,Default,,0000,0000,0000,,then this is one an integral over an N plus one-dimensional, you know, over RN Dialogue: 0,0:08:35.01,0:08:36.39,Default,,0000,0000,0000,,plus one. Dialogue: 0,0:08:36.39,0:08:38.66,Default,,0000,0000,0000,,And because numerically it's very difficult Dialogue: 0,0:08:38.66,0:08:41.13,Default,,0000,0000,0000,,to compute integrals over Dialogue: 0,0:08:41.13,0:08:44.80,Default,,0000,0000,0000,,very high dimensional spaces, all right? So Dialogue: 0,0:08:44.80,0:08:48.74,Default,,0000,0000,0000,,usually this integral - actually usually it's hard to compute the posterior in theta Dialogue: 0,0:08:48.74,0:08:54.31,Default,,0000,0000,0000,,and it's also hard to compute this integral if theta is very high dimensional. There are few Dialogue: 0,0:08:54.31,0:08:55.91,Default,,0000,0000,0000,,exceptions for which this can be done in Dialogue: 0,0:08:55.91,0:08:58.29,Default,,0000,0000,0000,,closed form, but for Dialogue: 0,0:08:58.29,0:09:00.46,Default,,0000,0000,0000,,many learning algorithms, say, Dialogue: 0,0:09:00.46,0:09:04.94,Default,,0000,0000,0000,,Bayesian logistic regression, this is hard to do. Dialogue: 0,0:09:04.94,0:09:09.78,Default,,0000,0000,0000,,And so what's commonly done is to take the posterior distribution Dialogue: 0,0:09:09.78,0:09:13.39,Default,,0000,0000,0000,,and instead of actually computing a full posterior distribution, chi of theta Dialogue: 0,0:09:13.39,0:09:15.28,Default,,0000,0000,0000,,given S, Dialogue: 0,0:09:15.28,0:09:16.51,Default,,0000,0000,0000,,we'll instead Dialogue: 0,0:09:16.51,0:09:19.33,Default,,0000,0000,0000,,take this quantity on the right-hand side Dialogue: 0,0:09:19.33,0:09:23.88,Default,,0000,0000,0000,,and just maximize this quantity on the right-hand side. So let me write this down. Dialogue: 0,0:09:23.88,0:09:26.65,Default,,0000,0000,0000,,So Dialogue: 0,0:09:26.65,0:09:30.10,Default,,0000,0000,0000,,commonly, instead of computing the full posterior distribution, Dialogue: 0,0:09:30.10,0:09:37.10,Default,,0000,0000,0000,,we will choose the following. Okay? Dialogue: 0,0:09:42.28,0:09:47.16,Default,,0000,0000,0000,,We will choose what's called the MAP estimate, or the maximum a posteriori Dialogue: 0,0:09:47.16,0:09:50.27,Default,,0000,0000,0000,,estimate of theta, which is the most likely value of theta, Dialogue: 0,0:09:50.27,0:09:53.99,Default,,0000,0000,0000,,most probable value of theta onto your posterior distribution. Dialogue: 0,0:09:53.99,0:09:56.43,Default,,0000,0000,0000,,And that's just Dialogue: 0,0:09:56.43,0:10:03.43,Default,,0000,0000,0000,,ont max chi Dialogue: 0,0:10:07.96,0:10:11.87,Default,,0000,0000,0000,,of theta. Dialogue: 0,0:10:11.87,0:10:18.87,Default,,0000,0000,0000,,And then when you need to make a prediction, Dialogue: 0,0:10:21.48,0:10:28.48,Default,,0000,0000,0000,,you know, you would just predict, say, well, Dialogue: 0,0:10:35.53,0:10:37.60,Default,,0000,0000,0000,,using your usual hypothesis Dialogue: 0,0:10:37.60,0:10:41.29,Default,,0000,0000,0000,,and using this MAP value of theta Dialogue: 0,0:10:41.29,0:10:42.48,Default,,0000,0000,0000,,in place of Dialogue: 0,0:10:42.48,0:10:47.38,Default,,0000,0000,0000,,- as the parameter vector you'd choose. Okay? And Dialogue: 0,0:10:47.38,0:10:51.34,Default,,0000,0000,0000,,notice, the only difference between this and standard maximum likelihood estimation Dialogue: 0,0:10:51.34,0:10:53.06,Default,,0000,0000,0000,,is that when you're choosing, Dialogue: 0,0:10:53.06,0:10:56.92,Default,,0000,0000,0000,,you know, the - instead of choosing the maximum likelihood value for Dialogue: 0,0:10:56.92,0:11:01.14,Default,,0000,0000,0000,,theta, you're instead maximizing this, which is what you have for maximum likelihood estimation, Dialogue: 0,0:11:01.14,0:11:05.17,Default,,0000,0000,0000,,and then times this other quantity which is the prior. Dialogue: 0,0:11:05.17,0:11:06.65,Default,,0000,0000,0000,,Right? Dialogue: 0,0:11:06.65,0:11:09.52,Default,,0000,0000,0000,,And Dialogue: 0,0:11:09.52,0:11:10.87,Default,,0000,0000,0000,,let's see, Dialogue: 0,0:11:10.87,0:11:16.56,Default,,0000,0000,0000,,when intuition is that if your prior Dialogue: 0,0:11:16.56,0:11:21.47,Default,,0000,0000,0000,,is Dialogue: 0,0:11:21.47,0:11:23.65,Default,,0000,0000,0000,,theta being Gaussian and with mean Dialogue: 0,0:11:23.65,0:11:26.03,Default,,0000,0000,0000,,zero and some covariance, Dialogue: 0,0:11:26.03,0:11:30.55,Default,,0000,0000,0000,,then for a distribution like this, most of the [inaudible] mass is close to zero. Right? Dialogue: 0,0:11:30.55,0:11:34.01,Default,,0000,0000,0000,,So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero. Dialogue: 0,0:11:37.99,0:11:40.55,Default,,0000,0000,0000,,And so the prior distribution, instead of saying that Dialogue: 0,0:11:40.55,0:11:44.80,Default,,0000,0000,0000,,you think most of the parameters should be close to Dialogue: 0,0:11:44.80,0:11:48.78,Default,,0000,0000,0000,,zero, and if you remember our discussion on feature selection, if you eliminate a Dialogue: 0,0:11:48.78,0:11:50.88,Default,,0000,0000,0000,,feature from consideration Dialogue: 0,0:11:50.88,0:11:53.28,Default,,0000,0000,0000,,that's the same as Dialogue: 0,0:11:53.28,0:11:56.63,Default,,0000,0000,0000,,setting the source and value of theta to be equal to zero. All right? So if you Dialogue: 0,0:11:56.63,0:11:58.02,Default,,0000,0000,0000,,set Dialogue: 0,0:11:58.02,0:12:02.09,Default,,0000,0000,0000,,theta five to be equal to zero, that's the same as, you know, eliminating feature Dialogue: 0,0:12:02.09,0:12:04.55,Default,,0000,0000,0000,,five from the your hypothesis. Dialogue: 0,0:12:04.55,0:12:09.71,Default,,0000,0000,0000,,And so, this is the prior that drives most of the parameter values to zero Dialogue: 0,0:12:09.71,0:12:12.71,Default,,0000,0000,0000,,- to values close to zero. And you'll think of this as Dialogue: 0,0:12:12.71,0:12:17.41,Default,,0000,0000,0000,,doing something analogous, if doing something reminiscent of feature selection. Okay? And Dialogue: 0,0:12:17.41,0:12:21.56,Default,,0000,0000,0000,,it turns out that with this formulation, the parameters won't actually be Dialogue: 0,0:12:21.56,0:12:24.99,Default,,0000,0000,0000,,exactly zero but many of the values will be close to zero. Dialogue: 0,0:12:24.99,0:12:27.03,Default,,0000,0000,0000,,And Dialogue: 0,0:12:27.03,0:12:31.51,Default,,0000,0000,0000,,I guess in pictures, Dialogue: 0,0:12:34.55,0:12:36.21,Default,,0000,0000,0000,,if you remember, Dialogue: 0,0:12:36.21,0:12:40.07,Default,,0000,0000,0000,,I said that if you have, say, five data points and you fit Dialogue: 0,0:12:40.07,0:12:47.07,Default,,0000,0000,0000,,a fourth-order polynomial - Dialogue: 0,0:12:47.62,0:12:51.04,Default,,0000,0000,0000,,well I think that had too many bumps in it, but never mind. Dialogue: 0,0:12:51.04,0:12:54.36,Default,,0000,0000,0000,,If you fit it a - if you fit very high polynomial to a Dialogue: 0,0:12:54.36,0:12:58.18,Default,,0000,0000,0000,,very small dataset, you can get these very large oscillations Dialogue: 0,0:12:58.18,0:13:01.42,Default,,0000,0000,0000,,if you use maximum likelihood estimation. All right? Dialogue: 0,0:13:01.42,0:13:05.61,Default,,0000,0000,0000,,In contrast, if you apply this sort of Bayesian regularization, Dialogue: 0,0:13:05.61,0:13:08.20,Default,,0000,0000,0000,,you can actually fit a higherorder polynomial Dialogue: 0,0:13:08.20,0:13:11.16,Default,,0000,0000,0000,,that still get Dialogue: 0,0:13:11.16,0:13:13.25,Default,,0000,0000,0000,,sort of a smoother and smoother fit to the data Dialogue: 0,0:13:13.25,0:13:17.84,Default,,0000,0000,0000,,as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer Dialogue: 0,0:13:17.84,0:13:19.83,Default,,0000,0000,0000,,to zero. And that Dialogue: 0,0:13:19.83,0:13:22.75,Default,,0000,0000,0000,,in practice - it's sort of hard to see, but you can take my word for it. Dialogue: 0,0:13:22.75,0:13:24.90,Default,,0000,0000,0000,,As tau becomes smaller and smaller, Dialogue: 0,0:13:24.90,0:13:28.76,Default,,0000,0000,0000,,the curves you tend to fit your data also become smoother and smoother, and so you Dialogue: 0,0:13:28.76,0:13:29.56,Default,,0000,0000,0000,,tend Dialogue: 0,0:13:29.56,0:13:35.70,Default,,0000,0000,0000,,less and less overfit, even when you're fitting a large number of Dialogue: 0,0:13:35.70,0:13:37.67,Default,,0000,0000,0000,,parameters. Dialogue: 0,0:13:37.67,0:13:41.29,Default,,0000,0000,0000,,Okay? Let's see, Dialogue: 0,0:13:41.29,0:13:44.59,Default,,0000,0000,0000,,and Dialogue: 0,0:13:44.59,0:13:48.13,Default,,0000,0000,0000,,one last piece of intuition that I would just toss out there. And Dialogue: 0,0:13:48.13,0:13:49.92,Default,,0000,0000,0000,,you get to play more with Dialogue: 0,0:13:49.92,0:13:52.85,Default,,0000,0000,0000,,this particular set of ideas more in Problem Set Dialogue: 0,0:13:52.85,0:13:54.98,Default,,0000,0000,0000,,3, which I'll post online later Dialogue: 0,0:13:54.98,0:13:57.16,Default,,0000,0000,0000,,this week I guess. Dialogue: 0,0:13:57.16,0:14:01.95,Default,,0000,0000,0000,,Is that whereas maximum likelihood tries to minimize, Dialogue: 0,0:14:01.95,0:14:08.95,Default,,0000,0000,0000,,say, this, Dialogue: 0,0:14:12.21,0:14:15.83,Default,,0000,0000,0000,,right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing Dialogue: 0,0:14:15.83,0:14:17.03,Default,,0000,0000,0000,,this, Dialogue: 0,0:14:17.03,0:14:19.29,Default,,0000,0000,0000,,it turns out that if you Dialogue: 0,0:14:19.29,0:14:21.22,Default,,0000,0000,0000,,add this prior term there, Dialogue: 0,0:14:21.22,0:14:26.50,Default,,0000,0000,0000,,it turns out that the authorization objective you end up optimizing Dialogue: 0,0:14:26.50,0:14:31.16,Default,,0000,0000,0000,,turns out to be that. Where you add an extra term that, you know, Dialogue: 0,0:14:31.16,0:14:34.38,Default,,0000,0000,0000,,penalizes your parameter theta as being large. Dialogue: 0,0:14:34.98,0:14:38.82,Default,,0000,0000,0000,,And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to Dialogue: 0,0:14:38.82,0:14:40.89,Default,,0000,0000,0000,,keep your parameters small. Dialogue: 0,0:14:40.89,0:14:42.81,Default,,0000,0000,0000,,And this has the effect. Dialogue: 0,0:14:42.81,0:14:46.53,Default,,0000,0000,0000,,Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of Dialogue: 0,0:14:46.53,0:14:48.26,Default,,0000,0000,0000,,keeping the functions you fit Dialogue: 0,0:14:48.26,0:14:55.26,Default,,0000,0000,0000,,to be smoother and less likely to overfit. Okay? Okay, hopefully this will Dialogue: 0,0:14:57.87,0:15:02.20,Default,,0000,0000,0000,,make more sense when you play with these ideas a bit more in the next problem set. But let's check Dialogue: 0,0:15:02.20,0:15:09.20,Default,,0000,0000,0000,,questions about all this. Dialogue: 0,0:15:12.56,0:15:16.00,Default,,0000,0000,0000,,The smoothing behavior is it because [inaudible] actually get different [inaudible]? Dialogue: 0,0:15:19.26,0:15:23.01,Default,,0000,0000,0000,,Let's see. Yeah. It depends on - well Dialogue: 0,0:15:23.01,0:15:28.16,Default,,0000,0000,0000,,most priors with most of the mass close to zero will get this effect, I guess. And just by Dialogue: 0,0:15:28.16,0:15:29.09,Default,,0000,0000,0000,,convention, the Gaussian Dialogue: 0,0:15:29.09,0:15:31.77,Default,,0000,0000,0000,,prior is what's most - Dialogue: 0,0:15:31.77,0:15:33.82,Default,,0000,0000,0000,,used the most common for models like logistic Dialogue: 0,0:15:33.82,0:15:37.91,Default,,0000,0000,0000,,regression and linear regression, generalized in your models. There are a few Dialogue: 0,0:15:37.91,0:15:41.19,Default,,0000,0000,0000,,other priors that I sometimes use, like the Laplace prior, Dialogue: 0,0:15:41.19,0:15:48.19,Default,,0000,0000,0000,,but all of them will tend to have these sorts of smoothing effects. All right. Cool. Dialogue: 0,0:15:50.27,0:15:52.64,Default,,0000,0000,0000,,And so it turns out that for Dialogue: 0,0:15:52.64,0:15:57.25,Default,,0000,0000,0000,,problems like text classification, text classification is like 30,000 features or 50,000 Dialogue: 0,0:15:57.25,0:15:59.56,Default,,0000,0000,0000,,features, Dialogue: 0,0:15:59.56,0:16:02.98,Default,,0000,0000,0000,,where it seems like an algorithm like logistic regression would be very much prone to Dialogue: 0,0:16:02.98,0:16:03.67,Default,,0000,0000,0000,,overfitting. Right? Dialogue: 0,0:16:03.67,0:16:06.47,Default,,0000,0000,0000,,So imagine trying to build a spam classifier, Dialogue: 0,0:16:06.47,0:16:10.05,Default,,0000,0000,0000,,maybe you have 100 training examples but you have 30,000 Dialogue: 0,0:16:10.05,0:16:12.53,Default,,0000,0000,0000,,features or 50,000 features, Dialogue: 0,0:16:12.53,0:16:16.82,Default,,0000,0000,0000,,that seems clearly to be prone to overfitting. Right? But it turns out that with Dialogue: 0,0:16:16.82,0:16:18.44,Default,,0000,0000,0000,,this sort of Bayesian Dialogue: 0,0:16:18.44,0:16:19.46,Default,,0000,0000,0000,,regularization, Dialogue: 0,0:16:19.46,0:16:22.10,Default,,0000,0000,0000,,with [inaudible] Gaussian, Dialogue: 0,0:16:22.10,0:16:26.71,Default,,0000,0000,0000,,logistic regression becomes a very effective text classification algorithm Dialogue: 0,0:16:26.71,0:16:32.51,Default,,0000,0000,0000,,with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah, Dialogue: 0,0:16:32.51,0:16:36.95,Default,,0000,0000,0000,,right, and so pick - and to pick either tau squared or lambda. Dialogue: 0,0:16:36.95,0:16:40.26,Default,,0000,0000,0000,,I think the relation is lambda equals one over tau squared. But right, so pick either Dialogue: 0,0:16:40.26,0:16:41.84,Default,,0000,0000,0000,,tau squared or lambda, you could Dialogue: 0,0:16:41.84,0:16:48.84,Default,,0000,0000,0000,,use cross-validation, yeah. All right? Okay, cool. So all right, Dialogue: 0,0:16:49.12,0:16:51.74,Default,,0000,0000,0000,,that Dialogue: 0,0:16:51.74,0:16:55.24,Default,,0000,0000,0000,,was all I want to say about methods for preventing Dialogue: 0,0:16:55.24,0:16:56.41,Default,,0000,0000,0000,,overfitting. What I Dialogue: 0,0:16:56.41,0:17:00.82,Default,,0000,0000,0000,,want to do next is just spend, you know, five minutes talking about Dialogue: 0,0:17:00.82,0:17:02.30,Default,,0000,0000,0000,,online learning. Dialogue: 0,0:17:02.30,0:17:04.65,Default,,0000,0000,0000,,And this is sort of a digression. Dialogue: 0,0:17:04.65,0:17:06.39,Default,,0000,0000,0000,,And so, you know, when Dialogue: 0,0:17:06.39,0:17:09.68,Default,,0000,0000,0000,,you're designing the syllabus of a class, I guess, sometimes Dialogue: 0,0:17:09.68,0:17:13.04,Default,,0000,0000,0000,,there are just some ideas you want to talk about but can't find a very good place to Dialogue: 0,0:17:13.04,0:17:16.48,Default,,0000,0000,0000,,fit in anywhere. So this is one of those ideas that may Dialogue: 0,0:17:16.48,0:17:20.89,Default,,0000,0000,0000,,seem a bit disjointed from the rest of the class but I just Dialogue: 0,0:17:20.89,0:17:25.77,Default,,0000,0000,0000,,want to Dialogue: 0,0:17:25.77,0:17:29.73,Default,,0000,0000,0000,,tell Dialogue: 0,0:17:29.73,0:17:33.72,Default,,0000,0000,0000,,you a little bit about it. Okay. So here's the idea. Dialogue: 0,0:17:33.72,0:17:37.82,Default,,0000,0000,0000,,So far, all the learning algorithms we've talked about are what's called batch learning Dialogue: 0,0:17:37.82,0:17:38.52,Default,,0000,0000,0000,,algorithms, Dialogue: 0,0:17:38.52,0:17:41.65,Default,,0000,0000,0000,,where you're given a training set and then you get to run your learning algorithm on the Dialogue: 0,0:17:41.65,0:17:46.91,Default,,0000,0000,0000,,training set and then maybe you test it on some other test set. Dialogue: 0,0:17:46.91,0:17:49.98,Default,,0000,0000,0000,,And there's another learning setting called online learning, Dialogue: 0,0:17:49.98,0:17:53.67,Default,,0000,0000,0000,,in which you have to make predictions even while you are in the process of learning. Dialogue: 0,0:17:53.67,0:17:57.65,Default,,0000,0000,0000,,So here's how the problem sees. Dialogue: 0,0:17:57.65,0:17:59.36,Default,,0000,0000,0000,,All right? I'm first gonna give you X one. Dialogue: 0,0:17:59.36,0:18:01.41,Default,,0000,0000,0000,,Let's say there's a classification problem, Dialogue: 0,0:18:01.41,0:18:02.78,Default,,0000,0000,0000,,so I'm first gonna give you Dialogue: 0,0:18:02.78,0:18:07.42,Default,,0000,0000,0000,,X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?" Dialogue: 0,0:18:07.42,0:18:09.67,Default,,0000,0000,0000,,And you've not seen any data yet. Dialogue: 0,0:18:09.67,0:18:11.68,Default,,0000,0000,0000,,And so, you make a guess. Right? You Dialogue: 0,0:18:11.68,0:18:12.99,Default,,0000,0000,0000,,guess - Dialogue: 0,0:18:12.99,0:18:15.19,Default,,0000,0000,0000,,we'll call your guess Y hat one. Dialogue: 0,0:18:15.19,0:18:18.01,Default,,0000,0000,0000,,And after you've made your prediction, I Dialogue: 0,0:18:18.01,0:18:21.84,Default,,0000,0000,0000,,will then reveal to you the true label Y one. Okay? And Dialogue: 0,0:18:21.84,0:18:25.58,Default,,0000,0000,0000,,not having seen any data before, your odds of getting the first one right are only Dialogue: 0,0:18:25.58,0:18:28.75,Default,,0000,0000,0000,,50 percent, right, if you guess randomly. Dialogue: 0,0:18:28.75,0:18:29.95,Default,,0000,0000,0000,, Dialogue: 0,0:18:29.95,0:18:35.08,Default,,0000,0000,0000,,And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And Dialogue: 0,0:18:35.08,0:18:35.91,Default,,0000,0000,0000,,so you now maybe Dialogue: 0,0:18:35.91,0:18:40.20,Default,,0000,0000,0000,,are gonna make a slightly more educated guess and call that Y hat two. Dialogue: 0,0:18:40.20,0:18:44.88,Default,,0000,0000,0000,,And after you've made your guess, I reveal the true label to you. Dialogue: 0,0:18:44.88,0:18:50.00,Default,,0000,0000,0000,,And so, then I show you X three, and then you make Dialogue: 0,0:18:50.00,0:18:53.00,Default,,0000,0000,0000,,your guess, and learning proceeds as follows. Dialogue: 0,0:18:53.00,0:18:54.86,Default,,0000,0000,0000,,So this is just a lot of Dialogue: 0,0:18:54.86,0:18:59.37,Default,,0000,0000,0000,,machine learning and batch learning, and the model settings where Dialogue: 0,0:18:59.37,0:19:03.71,Default,,0000,0000,0000,,you have to keep learning even as you're making predictions, Dialogue: 0,0:19:03.71,0:19:08.05,Default,,0000,0000,0000,,okay? So I don't know, setting your website and you have users coming in. And as the first user Dialogue: 0,0:19:08.05,0:19:10.99,Default,,0000,0000,0000,,comes in, you need to start making predictions already about what Dialogue: 0,0:19:10.99,0:19:13.45,Default,,0000,0000,0000,,the user likes or dislikes. And there's only, you know, Dialogue: 0,0:19:13.45,0:19:18.37,Default,,0000,0000,0000,,as you're making predictions you get to show more and more training examples. Dialogue: 0,0:19:18.37,0:19:25.37,Default,,0000,0000,0000,,So in online learning what you care about is the total online error, Dialogue: 0,0:19:27.37,0:19:28.40,Default,,0000,0000,0000,, Dialogue: 0,0:19:28.40,0:19:32.90,Default,,0000,0000,0000,,which is sum from I equals one to MC if you get the sequence of M examples Dialogue: 0,0:19:32.90,0:19:34.39,Default,,0000,0000,0000,,all together, Dialogue: 0,0:19:34.39,0:19:37.86,Default,,0000,0000,0000,,indicator Y hat I Dialogue: 0,0:19:37.86,0:19:40.90,Default,,0000,0000,0000,,not equal to Y hi. Okay? Dialogue: 0,0:19:40.90,0:19:43.51,Default,,0000,0000,0000,,So the total online error is Dialogue: 0,0:19:43.51,0:19:45.92,Default,,0000,0000,0000,,the total number of mistakes you make Dialogue: 0,0:19:45.92,0:19:49.03,Default,,0000,0000,0000,,on a sequence of examples like this. Dialogue: 0,0:19:49.03,0:19:51.91,Default,,0000,0000,0000,,And Dialogue: 0,0:19:51.91,0:19:54.05,Default,,0000,0000,0000,,it turns out that, you know, Dialogue: 0,0:19:54.05,0:19:57.69,Default,,0000,0000,0000,,many of the learning algorithms you have - when you finish all the learning algorithms, you've Dialogue: 0,0:19:57.69,0:20:00.69,Default,,0000,0000,0000,,learned about and can apply to this setting. One thing Dialogue: 0,0:20:00.69,0:20:02.68,Default,,0000,0000,0000,,you could do is Dialogue: 0,0:20:02.68,0:20:03.85,Default,,0000,0000,0000,,when you're asked Dialogue: 0,0:20:03.85,0:20:06.23,Default,,0000,0000,0000,,to make prediction on Y hat three, Dialogue: 0,0:20:06.23,0:20:10.41,Default,,0000,0000,0000,,right, one simple thing to do is well you've seen some other training examples Dialogue: 0,0:20:10.41,0:20:14.04,Default,,0000,0000,0000,,up to this point so you can just take your learning algorithm and run it Dialogue: 0,0:20:14.04,0:20:15.43,Default,,0000,0000,0000,,on the examples, Dialogue: 0,0:20:15.43,0:20:18.97,Default,,0000,0000,0000,,you know, leading up to Y hat three. So just run the learning algorithm on all Dialogue: 0,0:20:18.97,0:20:21.42,Default,,0000,0000,0000,,the examples you've seen previous Dialogue: 0,0:20:21.42,0:20:25.40,Default,,0000,0000,0000,,to being asked to make a prediction on certain example, and then use your Dialogue: 0,0:20:25.40,0:20:27.32,Default,,0000,0000,0000,,learning Dialogue: 0,0:20:27.32,0:20:30.94,Default,,0000,0000,0000,,algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we Dialogue: 0,0:20:30.94,0:20:34.19,Default,,0000,0000,0000,,saw that you could use the stochastic gradient descent, that, Dialogue: 0,0:20:34.19,0:20:41.19,Default,,0000,0000,0000,,you know, can be adapted very nicely to this. So as a concrete example, if you Dialogue: 0,0:20:41.74,0:20:45.66,Default,,0000,0000,0000,,remember the perceptron algorithms, say, right, Dialogue: 0,0:20:45.66,0:20:47.35,Default,,0000,0000,0000,,you would Dialogue: 0,0:20:47.35,0:20:52.28,Default,,0000,0000,0000,,say initial the parameter theta to be equal to zero. Dialogue: 0,0:20:52.28,0:20:55.94,Default,,0000,0000,0000,,And then after seeing the Ith training Dialogue: 0,0:20:55.94,0:20:59.23,Default,,0000,0000,0000,,example, you'd Dialogue: 0,0:20:59.23,0:21:06.23,Default,,0000,0000,0000,,update the parameters, you Dialogue: 0,0:21:13.74,0:21:15.42,Default,,0000,0000,0000,,know, Dialogue: 0,0:21:15.42,0:21:19.74,Default,,0000,0000,0000,,using - you've see this reel a lot of times now, right, using the standard Dialogue: 0,0:21:19.74,0:21:21.72,Default,,0000,0000,0000,,perceptron learning rule. And Dialogue: 0,0:21:21.72,0:21:26.23,Default,,0000,0000,0000,,the same thing, if you were using logistic regression you can then, Dialogue: 0,0:21:26.23,0:21:29.94,Default,,0000,0000,0000,,again, after seeing each training example, just run, you know, essentially run Dialogue: 0,0:21:29.94,0:21:33.62,Default,,0000,0000,0000,,one-step stochastic gradient descent Dialogue: 0,0:21:33.62,0:21:38.05,Default,,0000,0000,0000,,on just the example you saw. Okay? Dialogue: 0,0:21:38.05,0:21:39.80,Default,,0000,0000,0000,,And so Dialogue: 0,0:21:39.80,0:21:41.04,Default,,0000,0000,0000,,the reason I've Dialogue: 0,0:21:41.04,0:21:45.38,Default,,0000,0000,0000,,put this into the sort of "learning theory" section of this class was because Dialogue: 0,0:21:45.38,0:21:49.19,Default,,0000,0000,0000,,it turns that sometimes you can prove fairly amazing results Dialogue: 0,0:21:49.19,0:21:51.50,Default,,0000,0000,0000,,on your total online error Dialogue: 0,0:21:51.50,0:21:53.79,Default,,0000,0000,0000,,using algorithms like these. Dialogue: 0,0:21:53.79,0:21:57.50,Default,,0000,0000,0000,,I will actually - I don't actually want to spend the time in the main Dialogue: 0,0:21:57.50,0:21:58.64,Default,,0000,0000,0000,,lecture to prove Dialogue: 0,0:21:58.64,0:22:01.88,Default,,0000,0000,0000,,this, but, for example, you can prove that Dialogue: 0,0:22:01.88,0:22:04.79,Default,,0000,0000,0000,,when you use the perceptron algorithm, Dialogue: 0,0:22:04.79,0:22:07.89,Default,,0000,0000,0000,,then even when Dialogue: 0,0:22:07.89,0:22:11.71,Default,,0000,0000,0000,,the features XI, Dialogue: 0,0:22:11.71,0:22:15.25,Default,,0000,0000,0000,,maybe infinite dimensional feature vectors, like we saw for Dialogue: 0,0:22:15.25,0:22:18.70,Default,,0000,0000,0000,,simple vector machines. And sometimes, infinite feature dimensional vectors Dialogue: 0,0:22:18.70,0:22:20.74,Default,,0000,0000,0000,,may use kernel representations. Okay? But so it Dialogue: 0,0:22:20.74,0:22:22.18,Default,,0000,0000,0000,,turns out that you can prove that Dialogue: 0,0:22:22.18,0:22:24.79,Default,,0000,0000,0000,,when you a perceptron algorithm, Dialogue: 0,0:22:24.79,0:22:27.59,Default,,0000,0000,0000,,even when the data is maybe Dialogue: 0,0:22:27.59,0:22:28.92,Default,,0000,0000,0000,,extremely high dimensional and it Dialogue: 0,0:22:28.92,0:22:32.90,Default,,0000,0000,0000,,seems like you'd be prone to overfitting, right, you can prove that so as the Dialogue: 0,0:22:32.90,0:22:36.55,Default,,0000,0000,0000,,long as the positive and negative examples Dialogue: 0,0:22:36.55,0:22:38.74,Default,,0000,0000,0000,, Dialogue: 0,0:22:38.74,0:22:41.86,Default,,0000,0000,0000,,are separated by a margin, Dialogue: 0,0:22:41.86,0:22:43.36,Default,,0000,0000,0000,,right. So in this Dialogue: 0,0:22:43.36,0:22:48.05,Default,,0000,0000,0000,,infinite dimensional space, Dialogue: 0,0:22:48.05,0:22:49.74,Default,,0000,0000,0000,,so long as, Dialogue: 0,0:22:49.74,0:22:53.39,Default,,0000,0000,0000,,you know, there is some margin Dialogue: 0,0:22:53.39,0:22:57.48,Default,,0000,0000,0000,,down there separating the positive and negative examples, Dialogue: 0,0:22:57.48,0:22:59.30,Default,,0000,0000,0000,,you can prove that Dialogue: 0,0:22:59.30,0:23:02.15,Default,,0000,0000,0000,,perceptron algorithm will converge Dialogue: 0,0:23:02.15,0:23:06.87,Default,,0000,0000,0000,,to a hypothesis that perfectly separates the positive and negative examples. Dialogue: 0,0:23:06.87,0:23:10.76,Default,,0000,0000,0000,,Okay? And then so after seeing only a finite number of examples, it'll Dialogue: 0,0:23:10.76,0:23:14.41,Default,,0000,0000,0000,,converge to digital boundary that perfectly separates the Dialogue: 0,0:23:14.41,0:23:18.59,Default,,0000,0000,0000,,positive and negative examples, even though you may in an infinite dimensional space. Okay? Dialogue: 0,0:23:18.59,0:23:20.97,Default,,0000,0000,0000,,So Dialogue: 0,0:23:20.97,0:23:22.26,Default,,0000,0000,0000,,let's see. Dialogue: 0,0:23:22.26,0:23:25.92,Default,,0000,0000,0000,,The proof itself would take me sort of almost an entire lecture to do, Dialogue: 0,0:23:25.92,0:23:27.18,Default,,0000,0000,0000,,and there are sort of Dialogue: 0,0:23:27.18,0:23:29.46,Default,,0000,0000,0000,,other things that I want to do more than that. Dialogue: 0,0:23:29.46,0:23:32.85,Default,,0000,0000,0000,,So you want to see the proof of this yourself, it's actually written up in the lecture Dialogue: 0,0:23:32.85,0:23:35.01,Default,,0000,0000,0000,,notes that I posted online. Dialogue: 0,0:23:35.01,0:23:38.21,Default,,0000,0000,0000,,For the purposes of this class' syllabus, the proof of this Dialogue: 0,0:23:38.21,0:23:41.06,Default,,0000,0000,0000,,result, you can treat this as optional reading. And by that, I mean, Dialogue: 0,0:23:41.06,0:23:44.80,Default,,0000,0000,0000,,you know, it won't appear on the midterm and you won't be asked about this Dialogue: 0,0:23:44.80,0:23:47.04,Default,,0000,0000,0000,,specifically in the problem sets, Dialogue: 0,0:23:47.04,0:23:48.77,Default,,0000,0000,0000,,but I thought it'd be - Dialogue: 0,0:23:48.77,0:23:52.34,Default,,0000,0000,0000,,I know some of you are curious after the previous lecture so why Dialogue: 0,0:23:52.34,0:23:53.75,Default,,0000,0000,0000,,you can prove Dialogue: 0,0:23:53.75,0:23:55.45,Default,,0000,0000,0000,,that, you know, SVMs can have Dialogue: 0,0:23:55.45,0:23:59.24,Default,,0000,0000,0000,,bounded VC dimension, even in these infinite dimensional spaces, and how do Dialogue: 0,0:23:59.24,0:24:00.94,Default,,0000,0000,0000,,you prove things in these - Dialogue: 0,0:24:00.94,0:24:04.45,Default,,0000,0000,0000,,how do you prove learning theory results in these infinite dimensional feature spaces. And Dialogue: 0,0:24:04.45,0:24:05.19,Default,,0000,0000,0000,,so Dialogue: 0,0:24:05.19,0:24:08.48,Default,,0000,0000,0000,,the perceptron bound that I just talked about was the simplest instance I Dialogue: 0,0:24:08.48,0:24:11.04,Default,,0000,0000,0000,,know of that you can sort of read in like half an hour and understand Dialogue: 0,0:24:11.04,0:24:11.96,Default,,0000,0000,0000,,it. Dialogue: 0,0:24:11.96,0:24:14.70,Default,,0000,0000,0000,,So if you're Dialogue: 0,0:24:14.70,0:24:16.18,Default,,0000,0000,0000,,interested, there are lecture notes online for Dialogue: 0,0:24:16.18,0:24:20.72,Default,,0000,0000,0000,,how this perceptron bound is actually proved. It's a very Dialogue: 0,0:24:21.58,0:24:25.71,Default,,0000,0000,0000,,[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But Dialogue: 0,0:24:25.71,0:24:28.45,Default,,0000,0000,0000,,regardless of the theoretical results, Dialogue: 0,0:24:28.45,0:24:30.88,Default,,0000,0000,0000,,you know, the online learning setting is something Dialogue: 0,0:24:30.88,0:24:33.49,Default,,0000,0000,0000,,that you - that comes reasonably often. And so, Dialogue: 0,0:24:33.49,0:24:38.75,Default,,0000,0000,0000,,these algorithms based on stochastic gradient descent often go very well. Okay, any Dialogue: 0,0:24:38.75,0:24:45.75,Default,,0000,0000,0000,,questions about this before I move on? Dialogue: 0,0:24:49.99,0:24:54.80,Default,,0000,0000,0000,,All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to Dialogue: 0,0:24:54.80,0:24:56.50,Default,,0000,0000,0000,,PowerPoint slides, please, Dialogue: 0,0:24:56.50,0:24:59.81,Default,,0000,0000,0000,,is I actually want to spend most of today's lecture sort of talking about Dialogue: 0,0:24:59.81,0:25:05.27,Default,,0000,0000,0000,,advice for applying different machine learning Dialogue: 0,0:25:05.27,0:25:10.85,Default,,0000,0000,0000,,algorithms. Dialogue: 0,0:25:10.85,0:25:11.97,Default,,0000,0000,0000,,And so, Dialogue: 0,0:25:11.97,0:25:14.02,Default,,0000,0000,0000,,you know, right now, already you have a, Dialogue: 0,0:25:14.02,0:25:16.25,Default,,0000,0000,0000,,I think, a good understanding of Dialogue: 0,0:25:16.25,0:25:18.62,Default,,0000,0000,0000,,really the most powerful tools Dialogue: 0,0:25:18.62,0:25:21.02,Default,,0000,0000,0000,,known to humankind in machine learning. Dialogue: 0,0:25:21.02,0:25:22.09,Default,,0000,0000,0000,,Right? And Dialogue: 0,0:25:22.09,0:25:26.52,Default,,0000,0000,0000,,what I want to do today is give you some advice on how to apply them really Dialogue: 0,0:25:26.52,0:25:27.72,Default,,0000,0000,0000,,powerfully because, Dialogue: 0,0:25:27.72,0:25:30.30,Default,,0000,0000,0000,,you know, the same tool - it turns out that you can Dialogue: 0,0:25:30.30,0:25:33.25,Default,,0000,0000,0000,,take the same machine learning tool, say logistic regression, Dialogue: 0,0:25:33.25,0:25:36.80,Default,,0000,0000,0000,,and you can ask two different people to apply it to the same problem. Dialogue: 0,0:25:36.80,0:25:38.23,Default,,0000,0000,0000,,And Dialogue: 0,0:25:38.23,0:25:41.29,Default,,0000,0000,0000,,sometimes one person will do an amazing job and it'll work amazingly well, and the second Dialogue: 0,0:25:41.29,0:25:43.15,Default,,0000,0000,0000,,person will sort of Dialogue: 0,0:25:43.15,0:25:47.15,Default,,0000,0000,0000,,not really get it to work, even though it was exactly the same algorithm. Right? Dialogue: 0,0:25:47.15,0:25:51.05,Default,,0000,0000,0000,,And so what I want to do today, in the rest of the time I have today, is try Dialogue: 0,0:25:51.05,0:25:52.19,Default,,0000,0000,0000,,to Dialogue: 0,0:25:52.19,0:25:55.21,Default,,0000,0000,0000,,convey to you, you know, some of the methods for how to Dialogue: 0,0:25:55.21,0:25:56.60,Default,,0000,0000,0000,,make sure you're one of - Dialogue: 0,0:25:56.60,0:26:03.60,Default,,0000,0000,0000,,you really know how to get these learning algorithms to work well in problems. So Dialogue: 0,0:26:05.12,0:26:07.34,Default,,0000,0000,0000,,just some caveats on what I'm gonna, I Dialogue: 0,0:26:07.34,0:26:08.86,Default,,0000,0000,0000,,guess, talk about in the Dialogue: 0,0:26:08.86,0:26:11.08,Default,,0000,0000,0000,,rest of today's lecture. Dialogue: 0,0:26:11.94,0:26:15.76,Default,,0000,0000,0000,,Something I want to talk about is actually not very mathematical but is also Dialogue: 0,0:26:15.76,0:26:17.61,Default,,0000,0000,0000,,some of the hardest, Dialogue: 0,0:26:17.61,0:26:18.48,Default,,0000,0000,0000,,most Dialogue: 0,0:26:18.48,0:26:21.91,Default,,0000,0000,0000,,conceptually most difficult material in this class to understand. All right? So this Dialogue: 0,0:26:21.91,0:26:23.07,Default,,0000,0000,0000,,is Dialogue: 0,0:26:23.07,0:26:25.40,Default,,0000,0000,0000,,not mathematical but this is not easy. Dialogue: 0,0:26:25.40,0:26:29.73,Default,,0000,0000,0000,,And I want to say this caveat some of what I'll say today is debatable. Dialogue: 0,0:26:29.73,0:26:33.31,Default,,0000,0000,0000,,I think most good machine learning people will agree with most of what I say but maybe not Dialogue: 0,0:26:33.31,0:26:35.73,Default,,0000,0000,0000,,everything I say. Dialogue: 0,0:26:35.73,0:26:39.35,Default,,0000,0000,0000,,And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about Dialogue: 0,0:26:39.35,0:26:41.68,Default,,0000,0000,0000,,this later. What I'm Dialogue: 0,0:26:41.68,0:26:43.73,Default,,0000,0000,0000,,focusing on today is advice for Dialogue: 0,0:26:43.73,0:26:47.27,Default,,0000,0000,0000,,how to just get stuff to work. If you work in the company and you want to deliver a Dialogue: 0,0:26:47.27,0:26:49.12,Default,,0000,0000,0000,,product or you're, you know, Dialogue: 0,0:26:49.12,0:26:52.75,Default,,0000,0000,0000,,building a system and you just want your machine learning system to work. Okay? Some of what I'm about Dialogue: 0,0:26:52.75,0:26:54.00,Default,,0000,0000,0000,,to say today Dialogue: 0,0:26:54.00,0:26:57.56,Default,,0000,0000,0000,,isn't great advice if you goal is to invent a new machine learning algorithm, but Dialogue: 0,0:26:57.56,0:26:59.18,Default,,0000,0000,0000,,this is advice for how to Dialogue: 0,0:26:59.18,0:27:02.04,Default,,0000,0000,0000,,make machine learning algorithm work and, you know, and Dialogue: 0,0:27:02.04,0:27:05.08,Default,,0000,0000,0000,,deploy a working system. So three Dialogue: 0,0:27:05.08,0:27:08.90,Default,,0000,0000,0000,,key areas I'm gonna talk about. One: diagnostics for Dialogue: 0,0:27:08.90,0:27:11.89,Default,,0000,0000,0000,,debugging learning algorithms. Second: sort of Dialogue: 0,0:27:11.89,0:27:16.82,Default,,0000,0000,0000,,talk briefly about error analyses and ablative analysis. Dialogue: 0,0:27:16.82,0:27:21.47,Default,,0000,0000,0000,,And third, I want to talk about just advice for how to get started on a machine-learning Dialogue: 0,0:27:23.41,0:27:24.42,Default,,0000,0000,0000,,problem. Dialogue: 0,0:27:24.42,0:27:27.34,Default,,0000,0000,0000,,And one theme that'll come up later is it Dialogue: 0,0:27:27.34,0:27:31.23,Default,,0000,0000,0000,,turns out you've heard about premature optimization, right, Dialogue: 0,0:27:31.23,0:27:33.89,Default,,0000,0000,0000,,in writing software. This is when Dialogue: 0,0:27:33.89,0:27:38.12,Default,,0000,0000,0000,,someone over-designs from the start, when someone, you know, is writing piece of code and Dialogue: 0,0:27:38.12,0:27:41.33,Default,,0000,0000,0000,,they choose a subroutine to optimize Dialogue: 0,0:27:41.33,0:27:45.35,Default,,0000,0000,0000,,heavily. And maybe you write the subroutine as assembly or something. And that's often Dialogue: 0,0:27:45.35,0:27:48.92,Default,,0000,0000,0000,,- and many of us have been guilty of premature optimization, where Dialogue: 0,0:27:48.92,0:27:51.51,Default,,0000,0000,0000,,we're trying to get a piece of code to run faster. And Dialogue: 0,0:27:51.51,0:27:54.71,Default,,0000,0000,0000,,we choose probably a piece of code and we implement it an assembly, and Dialogue: 0,0:27:54.71,0:27:56.82,Default,,0000,0000,0000,,really tune and get to run really quickly. And Dialogue: 0,0:27:56.82,0:28:01.28,Default,,0000,0000,0000,,it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature Dialogue: 0,0:28:01.28,0:28:02.21,Default,,0000,0000,0000,,optimization. Dialogue: 0,0:28:02.21,0:28:06.14,Default,,0000,0000,0000,,And in undergraduate programming classes, we warn people all the time not to do Dialogue: 0,0:28:06.14,0:28:10.25,Default,,0000,0000,0000,,premature optimization and people still do it all the time. Right? Dialogue: 0,0:28:10.25,0:28:11.75,Default,,0000,0000,0000,,And Dialogue: 0,0:28:11.75,0:28:16.02,Default,,0000,0000,0000,,turns out, a very similar thing happens in building machine-learning systems. That Dialogue: 0,0:28:16.02,0:28:20.98,Default,,0000,0000,0000,,many people are often guilty of, what I call, premature statistical optimization, where they Dialogue: 0,0:28:20.98,0:28:21.63,Default,,0000,0000,0000,, Dialogue: 0,0:28:21.63,0:28:26.09,Default,,0000,0000,0000,,heavily optimize part of a machine learning system and that turns out Dialogue: 0,0:28:26.09,0:28:27.83,Default,,0000,0000,0000,,not to be the important piece. Okay? Dialogue: 0,0:28:27.83,0:28:30.17,Default,,0000,0000,0000,,So I'll talk about that later, as well. Dialogue: 0,0:28:30.17,0:28:35.68,Default,,0000,0000,0000,,So let's first talk about debugging learning algorithms. Dialogue: 0,0:28:35.68,0:28:37.70,Default,,0000,0000,0000,, Dialogue: 0,0:28:37.70,0:28:39.56,Default,,0000,0000,0000,, Dialogue: 0,0:28:39.56,0:28:42.61,Default,,0000,0000,0000,,As a motivating Dialogue: 0,0:28:42.61,0:28:46.91,Default,,0000,0000,0000,,example, let's say you want to build an anti-spam system. And Dialogue: 0,0:28:46.91,0:28:51.78,Default,,0000,0000,0000,,let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right? Dialogue: 0,0:28:51.78,0:28:54.41,Default,,0000,0000,0000,,So instead of using 50,000 words, you're chosen a small set of 100 Dialogue: 0,0:28:54.41,0:28:55.39,Default,,0000,0000,0000,,features Dialogue: 0,0:28:55.39,0:28:59.01,Default,,0000,0000,0000,,to use for your anti-spam system. Dialogue: 0,0:28:59.01,0:29:02.14,Default,,0000,0000,0000,,And let's say you implement Bayesian logistic regression, implement gradient Dialogue: 0,0:29:02.14,0:29:02.85,Default,,0000,0000,0000,,descent, Dialogue: 0,0:29:02.85,0:29:07.31,Default,,0000,0000,0000,,and you get 20 percent test error, which is unacceptably high. Right? Dialogue: 0,0:29:07.31,0:29:09.75,Default,,0000,0000,0000,,So this Dialogue: 0,0:29:09.75,0:29:14.26,Default,,0000,0000,0000,,is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda Dialogue: 0,0:29:14.26,0:29:15.16,Default,,0000,0000,0000,,squared term. Dialogue: 0,0:29:15.16,0:29:19.95,Default,,0000,0000,0000,,And we're maximizing rather than minimizing as well, so there's a minus lambda Dialogue: 0,0:29:19.95,0:29:23.81,Default,,0000,0000,0000,,theta square instead of plus lambda theta squared. So Dialogue: 0,0:29:23.81,0:29:28.39,Default,,0000,0000,0000,,the question is, you implemented your Bayesian logistic regression algorithm, Dialogue: 0,0:29:28.39,0:29:34.05,Default,,0000,0000,0000,,and you tested it on your test set and you got unacceptably high error, so what do you do Dialogue: 0,0:29:34.05,0:29:35.37,Default,,0000,0000,0000,,next? Right? Dialogue: 0,0:29:35.37,0:29:37.31,Default,,0000,0000,0000,,So, Dialogue: 0,0:29:37.31,0:29:40.73,Default,,0000,0000,0000,,you know, one thing you could do is think about the ways you could improve this algorithm. Dialogue: 0,0:29:40.73,0:29:44.51,Default,,0000,0000,0000,,And this is probably what most people will do instead of, "Well let's sit down and think what Dialogue: 0,0:29:44.51,0:29:47.93,Default,,0000,0000,0000,,could've gone wrong, and then we'll try to improve the algorithm." Dialogue: 0,0:29:47.93,0:29:51.37,Default,,0000,0000,0000,,Well obviously having more training data could only help, so one thing you can do is try to Dialogue: 0,0:29:51.37,0:29:55.14,Default,,0000,0000,0000,,get more training examples. Dialogue: 0,0:29:55.14,0:29:58.88,Default,,0000,0000,0000,,Maybe you suspect, that even 100 features was too many, so you might try to Dialogue: 0,0:29:58.88,0:30:01.06,Default,,0000,0000,0000,,get a smaller set of Dialogue: 0,0:30:01.06,0:30:04.51,Default,,0000,0000,0000,,features. What's more common is you might suspect your features aren't good enough, so you might Dialogue: 0,0:30:04.51,0:30:06.88,Default,,0000,0000,0000,,spend some time, look at the email headers, see if Dialogue: 0,0:30:06.88,0:30:09.24,Default,,0000,0000,0000,,you can figure out better features for, you know, Dialogue: 0,0:30:09.24,0:30:12.89,Default,,0000,0000,0000,,finding spam emails or whatever. Dialogue: 0,0:30:12.89,0:30:14.46,Default,,0000,0000,0000,,Right. And Dialogue: 0,0:30:14.46,0:30:17.72,Default,,0000,0000,0000,,right, so and Dialogue: 0,0:30:17.72,0:30:20.78,Default,,0000,0000,0000,,just sit around and come up with better features, such as for email headers. Dialogue: 0,0:30:20.78,0:30:24.94,Default,,0000,0000,0000,,You may also suspect that gradient descent hasn't quite converged yet, and so let's try Dialogue: 0,0:30:24.94,0:30:28.09,Default,,0000,0000,0000,,running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right, Dialogue: 0,0:30:28.09,0:30:29.48,Default,,0000,0000,0000,,just run Dialogue: 0,0:30:29.48,0:30:30.94,Default,,0000,0000,0000,,gradient descent longer. Dialogue: 0,0:30:30.94,0:30:35.62,Default,,0000,0000,0000,,Or maybe you remember, you know, you remember hearing from class that maybe Dialogue: 0,0:30:35.62,0:30:38.58,Default,,0000,0000,0000,,Newton's method converges better, so let's Dialogue: 0,0:30:38.58,0:30:39.81,Default,,0000,0000,0000,,try Dialogue: 0,0:30:39.81,0:30:41.84,Default,,0000,0000,0000,,that instead. You may want to tune the value for lambda, because not Dialogue: 0,0:30:41.84,0:30:43.56,Default,,0000,0000,0000,,sure if that was the right thing, Dialogue: 0,0:30:43.56,0:30:46.96,Default,,0000,0000,0000,,or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only Dialogue: 0,0:30:46.96,0:30:50.24,Default,,0000,0000,0000,,listed eight things Dialogue: 0,0:30:50.24,0:30:54.55,Default,,0000,0000,0000,,here, but you can imagine if you were actually sitting down, building machine-learning Dialogue: 0,0:30:54.55,0:30:55.72,Default,,0000,0000,0000,,system, Dialogue: 0,0:30:55.72,0:30:58.04,Default,,0000,0000,0000,,the options to you are endless. You can think of, you Dialogue: 0,0:30:58.04,0:31:01.04,Default,,0000,0000,0000,,know, hundreds of ways to improve a learning system. Dialogue: 0,0:31:01.04,0:31:02.43,Default,,0000,0000,0000,,And some of these things like, Dialogue: 0,0:31:02.43,0:31:05.67,Default,,0000,0000,0000,,well getting more training examples, surely that's gonna help, so that seems like it's a good Dialogue: 0,0:31:05.67,0:31:08.98,Default,,0000,0000,0000,,use of your time. Right? Dialogue: 0,0:31:08.98,0:31:11.29,Default,,0000,0000,0000,,And it turns out that Dialogue: 0,0:31:11.29,0:31:15.21,Default,,0000,0000,0000,,this [inaudible] of picking ways to improve the learning algorithm and picking one and going Dialogue: 0,0:31:15.21,0:31:16.12,Default,,0000,0000,0000,,for it, Dialogue: 0,0:31:16.12,0:31:20.62,Default,,0000,0000,0000,,it might work in the sense that it may eventually get you to a working system, but Dialogue: 0,0:31:20.62,0:31:25.03,Default,,0000,0000,0000,,often it's very time-consuming. And I think it's often a largely - largely a matter of Dialogue: 0,0:31:25.03,0:31:28.14,Default,,0000,0000,0000,,luck, whether you end up fixing what the problem is. Dialogue: 0,0:31:28.14,0:31:29.49,Default,,0000,0000,0000,,In particular, these Dialogue: 0,0:31:29.49,0:31:33.03,Default,,0000,0000,0000,,eight improvements all fix very different problems. Dialogue: 0,0:31:33.03,0:31:37.34,Default,,0000,0000,0000,,And some of them will be fixing problems that you don't have. And Dialogue: 0,0:31:37.34,0:31:40.07,Default,,0000,0000,0000,,if you can rule out six of Dialogue: 0,0:31:40.07,0:31:44.13,Default,,0000,0000,0000,,eight of these, say, you could - if by somehow looking at the problem more deeply, Dialogue: 0,0:31:44.13,0:31:46.81,Default,,0000,0000,0000,,you can figure out which one of these eight things is actually the right thing Dialogue: 0,0:31:46.81,0:31:47.85,Default,,0000,0000,0000,,to do, Dialogue: 0,0:31:47.85,0:31:48.85,Default,,0000,0000,0000,,you can save yourself Dialogue: 0,0:31:48.85,0:31:50.76,Default,,0000,0000,0000,,a lot of time. Dialogue: 0,0:31:50.76,0:31:56.09,Default,,0000,0000,0000,,So let's see how we can go about doing that. Dialogue: 0,0:31:56.09,0:32:01.83,Default,,0000,0000,0000,,The people in industry and in research that I see that are really good, would not Dialogue: 0,0:32:01.83,0:32:05.49,Default,,0000,0000,0000,,go and try to change a learning algorithm randomly. There are lots of things that Dialogue: 0,0:32:05.49,0:32:08.11,Default,,0000,0000,0000,,obviously improve your learning algorithm, Dialogue: 0,0:32:08.11,0:32:12.46,Default,,0000,0000,0000,,but the problem is there are so many of them it's hard to know what to do. Dialogue: 0,0:32:12.46,0:32:16.59,Default,,0000,0000,0000,,So you find all the really good ones that run various diagnostics to figure out Dialogue: 0,0:32:16.59,0:32:18.01,Default,,0000,0000,0000,,the problem is Dialogue: 0,0:32:18.01,0:32:21.61,Default,,0000,0000,0000,,and they think where a problem is. Okay? Dialogue: 0,0:32:21.61,0:32:23.83,Default,,0000,0000,0000,,So Dialogue: 0,0:32:23.83,0:32:27.31,Default,,0000,0000,0000,,for our motivating story, right, we said - let's say Bayesian logistic regression test Dialogue: 0,0:32:27.31,0:32:29.01,Default,,0000,0000,0000,,error was 20 percent, Dialogue: 0,0:32:29.01,0:32:32.02,Default,,0000,0000,0000,,which let's say is unacceptably high. Dialogue: 0,0:32:32.02,0:32:34.83,Default,,0000,0000,0000,,And let's suppose you suspected the problem is Dialogue: 0,0:32:34.83,0:32:36.44,Default,,0000,0000,0000,,either overfitting, Dialogue: 0,0:32:36.44,0:32:37.79,Default,,0000,0000,0000,,so it's high bias, Dialogue: 0,0:32:37.79,0:32:42.24,Default,,0000,0000,0000,,or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse Dialogue: 0,0:32:42.24,0:32:45.22,Default,,0000,0000,0000,,me; I think I Dialogue: 0,0:32:45.22,0:32:46.62,Default,,0000,0000,0000,,wrote that wrong. Dialogue: 0,0:32:46.62,0:32:48.08,Default,,0000,0000,0000,,Let's firstly - so let's Dialogue: 0,0:32:48.08,0:32:49.22,Default,,0000,0000,0000,,forget - forget the tables. Dialogue: 0,0:32:49.22,0:32:52.84,Default,,0000,0000,0000,,Suppose you suspect the problem is either high bias or high variance, and some of the text Dialogue: 0,0:32:52.84,0:32:54.73,Default,,0000,0000,0000,,here Dialogue: 0,0:32:54.73,0:32:55.25,Default,,0000,0000,0000,,doesn't make sense. And Dialogue: 0,0:32:55.25,0:32:56.43,Default,,0000,0000,0000,,you want to know Dialogue: 0,0:32:56.43,0:33:00.85,Default,,0000,0000,0000,,if you're overfitting, which would be high variance, or you have too few Dialogue: 0,0:33:00.85,0:33:06.24,Default,,0000,0000,0000,,features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So Dialogue: 0,0:33:06.24,0:33:08.75,Default,,0000,0000,0000,,how can you figure out whether the problem Dialogue: 0,0:33:08.75,0:33:10.79,Default,,0000,0000,0000,,is one of high bias Dialogue: 0,0:33:10.79,0:33:15.61,Default,,0000,0000,0000,,or high variance? Right? So it turns Dialogue: 0,0:33:15.61,0:33:19.01,Default,,0000,0000,0000,,out there's a simple diagnostic you can look at that will tell you Dialogue: 0,0:33:19.01,0:33:24.15,Default,,0000,0000,0000,,whether the problem is high bias or high variance. If you Dialogue: 0,0:33:24.15,0:33:27.90,Default,,0000,0000,0000,,remember the cartoon we'd seen previously for high variance problems, when you have high Dialogue: 0,0:33:27.90,0:33:29.71,Default,,0000,0000,0000,,variance Dialogue: 0,0:33:29.71,0:33:33.28,Default,,0000,0000,0000,,the training error will be much lower than the test error. All right? When you Dialogue: 0,0:33:33.28,0:33:36.14,Default,,0000,0000,0000,,have a high variance problem, that's when you're fitting Dialogue: 0,0:33:36.14,0:33:39.48,Default,,0000,0000,0000,,your training set very well. That's when you're fitting, you know, a tenth order polynomial to Dialogue: 0,0:33:39.48,0:33:41.65,Default,,0000,0000,0000,,11 data points. All right? And Dialogue: 0,0:33:41.65,0:33:44.67,Default,,0000,0000,0000,,that's when you're just fitting the data set very well, and so your training error will be Dialogue: 0,0:33:44.67,0:33:45.67,Default,,0000,0000,0000,,much lower than Dialogue: 0,0:33:45.67,0:33:47.64,Default,,0000,0000,0000,,your test Dialogue: 0,0:33:47.64,0:33:49.94,Default,,0000,0000,0000,,error. And in contrast, if you have high bias, Dialogue: 0,0:33:49.94,0:33:52.70,Default,,0000,0000,0000,,that's when your training error will also be high. Right? Dialogue: 0,0:33:52.70,0:33:56.45,Default,,0000,0000,0000,,That's when your data is quadratic, say, but you're fitting a linear function to it Dialogue: 0,0:33:56.45,0:34:02.29,Default,,0000,0000,0000,,and so you aren't even fitting your training set well. So Dialogue: 0,0:34:02.29,0:34:04.45,Default,,0000,0000,0000,,just in cartoons, I guess, Dialogue: 0,0:34:04.45,0:34:07.95,Default,,0000,0000,0000,,this is a - this is what a typical learning curve for high variance looks Dialogue: 0,0:34:07.95,0:34:09.34,Default,,0000,0000,0000,,like. Dialogue: 0,0:34:09.34,0:34:13.69,Default,,0000,0000,0000,,On your horizontal axis, I'm plotting the training set size M, right, Dialogue: 0,0:34:13.69,0:34:16.43,Default,,0000,0000,0000,,and on vertical axis, I'm plotting the error. Dialogue: 0,0:34:16.43,0:34:19.47,Default,,0000,0000,0000,,And so, let's see, Dialogue: 0,0:34:19.47,0:34:21.03,Default,,0000,0000,0000,,you know, as you increase - Dialogue: 0,0:34:21.03,0:34:25.12,Default,,0000,0000,0000,,if you have a high variance problem, you'll notice as the training set size, M, Dialogue: 0,0:34:25.12,0:34:29.22,Default,,0000,0000,0000,,increases, your test set error will keep on decreasing. Dialogue: 0,0:34:29.22,0:34:32.83,Default,,0000,0000,0000,,And so this sort of suggests that, well, if you can increase the training set size even Dialogue: 0,0:34:32.83,0:34:36.36,Default,,0000,0000,0000,,further, maybe if you extrapolate the green curve out, maybe Dialogue: 0,0:34:36.36,0:34:39.97,Default,,0000,0000,0000,,that test set error will decrease even further. All right? Dialogue: 0,0:34:39.97,0:34:43.40,Default,,0000,0000,0000,,Another thing that's useful to plot here is - let's say Dialogue: 0,0:34:43.40,0:34:46.54,Default,,0000,0000,0000,,the red horizontal line is the desired performance Dialogue: 0,0:34:46.54,0:34:50.26,Default,,0000,0000,0000,,you're trying to reach, another useful thing to plot is actually the training error. Right? Dialogue: 0,0:34:50.26,0:34:52.01,Default,,0000,0000,0000,,And it turns out that Dialogue: 0,0:34:52.01,0:34:59.01,Default,,0000,0000,0000,,your training error will actually grow as a function of the training set size Dialogue: 0,0:34:59.25,0:35:01.61,Default,,0000,0000,0000,,because the larger your training set, Dialogue: 0,0:35:01.61,0:35:03.62,Default,,0000,0000,0000,,the harder it is to fit, Dialogue: 0,0:35:03.62,0:35:06.15,Default,,0000,0000,0000,,you know, your training set perfectly. Right? Dialogue: 0,0:35:06.15,0:35:09.25,Default,,0000,0000,0000,,So this is just a cartoon, don't take it too seriously, but in general, your training error Dialogue: 0,0:35:09.25,0:35:11.42,Default,,0000,0000,0000,,will actually grow Dialogue: 0,0:35:11.42,0:35:15.08,Default,,0000,0000,0000,,as a function of your training set size. Because smart training sets, if you have one data point, Dialogue: 0,0:35:15.08,0:35:17.77,Default,,0000,0000,0000,,it's really easy to fit that perfectly, but if you have Dialogue: 0,0:35:17.77,0:35:22.10,Default,,0000,0000,0000,,10,000 data points, it's much harder to fit that perfectly. Dialogue: 0,0:35:22.10,0:35:23.15,Default,,0000,0000,0000,,All right? Dialogue: 0,0:35:23.15,0:35:27.96,Default,,0000,0000,0000,,And so another diagnostic for high variance, and the one that I tend to use more, Dialogue: 0,0:35:27.96,0:35:31.67,Default,,0000,0000,0000,,is to just look at training versus test error. And if there's a large gap between Dialogue: 0,0:35:31.67,0:35:32.79,Default,,0000,0000,0000,,them, Dialogue: 0,0:35:32.79,0:35:34.16,Default,,0000,0000,0000,,then this suggests that, you know, Dialogue: 0,0:35:34.16,0:35:39.63,Default,,0000,0000,0000,,getting more training data may allow you to help close that gap. Okay? Dialogue: 0,0:35:39.63,0:35:41.42,Default,,0000,0000,0000,,So this is Dialogue: 0,0:35:41.42,0:35:42.34,Default,,0000,0000,0000,,what the Dialogue: 0,0:35:42.34,0:35:45.06,Default,,0000,0000,0000,,cartoon would look like when - in the Dialogue: 0,0:35:45.06,0:35:49.20,Default,,0000,0000,0000,,case of high variance. Dialogue: 0,0:35:49.20,0:35:53.10,Default,,0000,0000,0000,,This is what the cartoon looks like for high bias. Right? If you Dialogue: 0,0:35:53.10,0:35:54.78,Default,,0000,0000,0000,,look at the learning curve, you Dialogue: 0,0:35:54.78,0:35:57.50,Default,,0000,0000,0000,,see that the curve for test error Dialogue: 0,0:35:57.50,0:36:01.42,Default,,0000,0000,0000,,has flattened out already. And so this is a sign that, Dialogue: 0,0:36:01.42,0:36:05.18,Default,,0000,0000,0000,,you know, if you get more training examples, if you extrapolate this curve Dialogue: 0,0:36:05.18,0:36:06.52,Default,,0000,0000,0000,,further to the right, Dialogue: 0,0:36:06.52,0:36:09.67,Default,,0000,0000,0000,,it's maybe not likely to go down much further. Dialogue: 0,0:36:09.67,0:36:12.47,Default,,0000,0000,0000,,And this is a property of high bias: that getting more training data won't Dialogue: 0,0:36:12.47,0:36:15.62,Default,,0000,0000,0000,,necessarily help. Dialogue: 0,0:36:15.62,0:36:18.100,Default,,0000,0000,0000,,But again, to me the more useful diagnostic is Dialogue: 0,0:36:18.100,0:36:20.30,Default,,0000,0000,0000,,if you plot Dialogue: 0,0:36:20.30,0:36:23.100,Default,,0000,0000,0000,,training errors well, if you look at your training error as well as your, you know, Dialogue: 0,0:36:23.100,0:36:26.37,Default,,0000,0000,0000,,hold out test set error. Dialogue: 0,0:36:26.37,0:36:29.41,Default,,0000,0000,0000,,If you find that even your training error Dialogue: 0,0:36:29.41,0:36:31.53,Default,,0000,0000,0000,,is high, Dialogue: 0,0:36:31.53,0:36:34.78,Default,,0000,0000,0000,,then that's a sign that getting more training data is not Dialogue: 0,0:36:34.78,0:36:38.27,Default,,0000,0000,0000,,going to help. Right? Dialogue: 0,0:36:38.27,0:36:42.20,Default,,0000,0000,0000,,In fact, you know, think about it, Dialogue: 0,0:36:42.20,0:36:44.54,Default,,0000,0000,0000,,training error Dialogue: 0,0:36:44.54,0:36:48.09,Default,,0000,0000,0000,,grows as a function of your training set size. Dialogue: 0,0:36:48.09,0:36:50.45,Default,,0000,0000,0000,,And so if your Dialogue: 0,0:36:50.45,0:36:55.57,Default,,0000,0000,0000,,training error is already above your level of desired performance, Dialogue: 0,0:36:55.57,0:36:56.60,Default,,0000,0000,0000,,then Dialogue: 0,0:36:56.60,0:37:00.79,Default,,0000,0000,0000,,getting even more training data is not going to reduce your training Dialogue: 0,0:37:00.79,0:37:03.01,Default,,0000,0000,0000,,error down to the desired level of performance. Right? Dialogue: 0,0:37:03.01,0:37:06.47,Default,,0000,0000,0000,,Because, you know, your training error sort of only gets worse as you get more and more training Dialogue: 0,0:37:06.47,0:37:07.55,Default,,0000,0000,0000,,examples. Dialogue: 0,0:37:07.55,0:37:10.80,Default,,0000,0000,0000,,So if you extrapolate further to the right, it's not like this blue line will come Dialogue: 0,0:37:10.80,0:37:13.40,Default,,0000,0000,0000,,back down to the level of desired performance. Right? This will stay up Dialogue: 0,0:37:13.40,0:37:17.48,Default,,0000,0000,0000,,there. Okay? So for Dialogue: 0,0:37:17.48,0:37:21.34,Default,,0000,0000,0000,,me personally, I actually, when looking at a curve like the green Dialogue: 0,0:37:21.34,0:37:25.38,Default,,0000,0000,0000,,curve on test error, I actually personally tend to find it very difficult to tell Dialogue: 0,0:37:25.38,0:37:29.00,Default,,0000,0000,0000,,if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very Dialogue: 0,0:37:29.00,0:37:31.01,Default,,0000,0000,0000,,often, it's somewhat Dialogue: 0,0:37:31.01,0:37:32.90,Default,,0000,0000,0000,,ambiguous. So for me personally, Dialogue: 0,0:37:32.90,0:37:37.13,Default,,0000,0000,0000,,the diagnostic I tend to use the most often to tell if I have a bias problem or a variance Dialogue: 0,0:37:37.13,0:37:37.86,Default,,0000,0000,0000,,problem Dialogue: 0,0:37:37.86,0:37:41.32,Default,,0000,0000,0000,,is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so, Dialogue: 0,0:37:41.32,0:37:45.42,Default,,0000,0000,0000,,going Dialogue: 0,0:37:45.42,0:37:47.13,Default,,0000,0000,0000,,back to Dialogue: 0,0:37:47.13,0:37:52.40,Default,,0000,0000,0000,,the list of fixes, look Dialogue: 0,0:37:52.40,0:37:54.11,Default,,0000,0000,0000,,at the first fix, Dialogue: 0,0:37:54.11,0:37:56.34,Default,,0000,0000,0000,,getting more training examples Dialogue: 0,0:37:56.34,0:37:58.65,Default,,0000,0000,0000,,is a way to fix high variance. Dialogue: 0,0:37:58.65,0:38:02.75,Default,,0000,0000,0000,,Right? If you have a high variance problem, getting more training examples will help. Dialogue: 0,0:38:02.75,0:38:05.53,Default,,0000,0000,0000,,Trying a smaller set of features: Dialogue: 0,0:38:05.53,0:38:11.76,Default,,0000,0000,0000,,that also fixes high variance. All right? Dialogue: 0,0:38:11.76,0:38:15.87,Default,,0000,0000,0000,,Trying a larger set of features or adding email features, these Dialogue: 0,0:38:15.87,0:38:20.15,Default,,0000,0000,0000,,are solutions that fix high bias. Right? Dialogue: 0,0:38:20.15,0:38:26.77,Default,,0000,0000,0000,,So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay? Dialogue: 0,0:38:26.77,0:38:29.07,Default,,0000,0000,0000,,And so Dialogue: 0,0:38:29.07,0:38:33.58,Default,,0000,0000,0000,,quite often you see people working on machine learning problems Dialogue: 0,0:38:33.58,0:38:34.59,Default,,0000,0000,0000,,and Dialogue: 0,0:38:34.59,0:38:37.57,Default,,0000,0000,0000,,they'll remember that getting more training examples helps. And Dialogue: 0,0:38:37.57,0:38:41.12,Default,,0000,0000,0000,,so, they'll build a learning system, build an anti-spam system and it doesn't work. Dialogue: 0,0:38:41.12,0:38:42.23,Default,,0000,0000,0000,,And then they Dialogue: 0,0:38:42.23,0:38:45.100,Default,,0000,0000,0000,,go off and spend lots of time and money and effort collecting more training data Dialogue: 0,0:38:45.100,0:38:50.51,Default,,0000,0000,0000,,because they'll say, "Oh well, getting more data's obviously got to help." Dialogue: 0,0:38:50.51,0:38:53.32,Default,,0000,0000,0000,,But if they had a high bias problem in the first place, and not a high variance Dialogue: 0,0:38:53.32,0:38:54.89,Default,,0000,0000,0000,,problem, Dialogue: 0,0:38:54.89,0:38:56.77,Default,,0000,0000,0000,,it's entirely possible to spend Dialogue: 0,0:38:56.77,0:39:00.15,Default,,0000,0000,0000,,three months or six months collecting more and more training data, Dialogue: 0,0:39:00.15,0:39:04.100,Default,,0000,0000,0000,,not realizing that it couldn't possibly help. Right? Dialogue: 0,0:39:04.100,0:39:07.62,Default,,0000,0000,0000,,And so, this actually happens a lot in, you Dialogue: 0,0:39:07.62,0:39:12.41,Default,,0000,0000,0000,,know, in Silicon Valley and companies, this happens a lot. There will often Dialogue: 0,0:39:12.41,0:39:15.33,Default,,0000,0000,0000,,people building various machine learning systems, and Dialogue: 0,0:39:15.33,0:39:19.52,Default,,0000,0000,0000,,they'll often - you often see people spending six months working on fixing a Dialogue: 0,0:39:19.52,0:39:20.100,Default,,0000,0000,0000,,learning algorithm Dialogue: 0,0:39:20.100,0:39:23.94,Default,,0000,0000,0000,,and you could've told them six months ago that, you know, Dialogue: 0,0:39:23.94,0:39:27.21,Default,,0000,0000,0000,,that couldn't possibly have helped. But because they didn't know what the Dialogue: 0,0:39:27.21,0:39:28.71,Default,,0000,0000,0000,,problem was, and Dialogue: 0,0:39:28.71,0:39:33.55,Default,,0000,0000,0000,,they'd easily spend six months trying to invent new features or something. And Dialogue: 0,0:39:33.55,0:39:37.81,Default,,0000,0000,0000,,this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and Dialogue: 0,0:39:37.81,0:39:42.29,Default,,0000,0000,0000,,told them, "I could've told you six months ago that this was not going to help." And Dialogue: 0,0:39:42.29,0:39:46.15,Default,,0000,0000,0000,,the six months is not a joke, you actually see Dialogue: 0,0:39:46.15,0:39:47.71,Default,,0000,0000,0000,,this. Dialogue: 0,0:39:47.71,0:39:49.51,Default,,0000,0000,0000,,And in contrast, if you Dialogue: 0,0:39:49.51,0:39:53.05,Default,,0000,0000,0000,,actually figure out the problem's one of high bias or high variance, then Dialogue: 0,0:39:53.05,0:39:54.30,Default,,0000,0000,0000,,you can rule out Dialogue: 0,0:39:54.30,0:39:55.80,Default,,0000,0000,0000,,two of these solutions and Dialogue: 0,0:39:55.80,0:40:00.78,Default,,0000,0000,0000,,save yourself many months of fruitless effort. Okay? I actually Dialogue: 0,0:40:00.78,0:40:03.71,Default,,0000,0000,0000,,want to talk about these four at the bottom as well. But before I move on, let me Dialogue: 0,0:40:03.71,0:40:05.32,Default,,0000,0000,0000,,just check if there were questions about what I've talked Dialogue: 0,0:40:05.32,0:40:12.32,Default,,0000,0000,0000,,about so far. No? Okay, great. So bias Dialogue: 0,0:40:20.21,0:40:23.22,Default,,0000,0000,0000,,versus variance is one thing that comes up Dialogue: 0,0:40:23.22,0:40:29.54,Default,,0000,0000,0000,,often. This bias versus variance is one common diagnostic. And so, Dialogue: 0,0:40:29.54,0:40:33.18,Default,,0000,0000,0000,,for other machine learning problems, it's often up to your own ingenuity to figure out Dialogue: 0,0:40:33.18,0:40:35.70,Default,,0000,0000,0000,,your own diagnostics to figure out what's wrong. All right? Dialogue: 0,0:40:35.70,0:40:41.23,Default,,0000,0000,0000,,So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you Dialogue: 0,0:40:41.23,0:40:44.30,Default,,0000,0000,0000,,know, to construct your own tests. Like do you look at the difference training and Dialogue: 0,0:40:44.30,0:40:46.50,Default,,0000,0000,0000,,test errors or do you look at something else? Dialogue: 0,0:40:46.50,0:40:49.93,Default,,0000,0000,0000,,It's often up to your own ingenuity to construct your own diagnostics to figure out what's Dialogue: 0,0:40:49.93,0:40:52.59,Default,,0000,0000,0000,,going on. Dialogue: 0,0:40:52.59,0:40:55.03,Default,,0000,0000,0000,,What I want to do is go through another example. All right? Dialogue: 0,0:40:55.03,0:40:58.89,Default,,0000,0000,0000,,And this one is slightly more contrived but it'll illustrate another Dialogue: 0,0:40:58.89,0:41:02.77,Default,,0000,0000,0000,,common question that comes up, another one of the most common Dialogue: 0,0:41:02.77,0:41:04.75,Default,,0000,0000,0000,,issues that comes up in applying Dialogue: 0,0:41:04.75,0:41:06.09,Default,,0000,0000,0000,,learning algorithms. Dialogue: 0,0:41:06.09,0:41:08.32,Default,,0000,0000,0000,,So in this example, it's slightly more contrived, Dialogue: 0,0:41:08.32,0:41:11.58,Default,,0000,0000,0000,,let's say you implement Bayesian logistic regression Dialogue: 0,0:41:11.58,0:41:17.55,Default,,0000,0000,0000,,and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So Dialogue: 0,0:41:17.55,0:41:19.15,Default,,0000,0000,0000,,it's rejecting, you know, Dialogue: 0,0:41:19.15,0:41:21.45,Default,,0000,0000,0000,,2 percent of - Dialogue: 0,0:41:21.45,0:41:25.18,Default,,0000,0000,0000,,it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all Dialogue: 0,0:41:25.18,0:41:26.96,Default,,0000,0000,0000,,spam gets Dialogue: 0,0:41:26.96,0:41:30.66,Default,,0000,0000,0000,,through which is fine, but is also rejecting 2 percent of your good email, Dialogue: 0,0:41:30.66,0:41:35.49,Default,,0000,0000,0000,,2 percent of the email from your friends and that's unacceptably high, let's Dialogue: 0,0:41:35.49,0:41:36.91,Default,,0000,0000,0000,,say. Dialogue: 0,0:41:36.91,0:41:39.01,Default,,0000,0000,0000,,And let's say that Dialogue: 0,0:41:39.01,0:41:41.90,Default,,0000,0000,0000,,a simple vector machine using a linear kernel Dialogue: 0,0:41:41.90,0:41:44.83,Default,,0000,0000,0000,,gets 10 percent error on spam and Dialogue: 0,0:41:44.83,0:41:49.07,Default,,0000,0000,0000,,0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this Dialogue: 0,0:41:49.07,0:41:53.36,Default,,0000,0000,0000,,example, let's say you're trying to build an anti-spam system. Right? Dialogue: 0,0:41:53.36,0:41:56.17,Default,,0000,0000,0000,,Let's say that you really want to deploy Dialogue: 0,0:41:56.17,0:41:57.68,Default,,0000,0000,0000,,logistic regression Dialogue: 0,0:41:57.68,0:42:01.21,Default,,0000,0000,0000,,to your customers because of computational efficiency or because you need Dialogue: 0,0:42:01.21,0:42:03.39,Default,,0000,0000,0000,,retrain overnight every day, Dialogue: 0,0:42:03.39,0:42:07.32,Default,,0000,0000,0000,,and because logistic regression just runs more easily and more quickly or something. Okay? So let's Dialogue: 0,0:42:07.32,0:42:08.67,Default,,0000,0000,0000,,say you want to deploy logistic Dialogue: 0,0:42:08.67,0:42:12.65,Default,,0000,0000,0000,,regression, but it's just not working out well. So Dialogue: 0,0:42:12.65,0:42:17.61,Default,,0000,0000,0000,,question is: What do you do next? So it Dialogue: 0,0:42:17.61,0:42:18.83,Default,,0000,0000,0000,,turns out that this - Dialogue: 0,0:42:18.83,0:42:22.32,Default,,0000,0000,0000,,the issue that comes up here, the one other common question that Dialogue: 0,0:42:22.32,0:42:24.89,Default,,0000,0000,0000,,comes up is Dialogue: 0,0:42:24.89,0:42:30.19,Default,,0000,0000,0000,,a question of is the algorithm converging. So you might suspect that maybe Dialogue: 0,0:42:30.19,0:42:33.30,Default,,0000,0000,0000,,the problem with logistic regression is that it's just not converging. Dialogue: 0,0:42:33.30,0:42:36.31,Default,,0000,0000,0000,,Maybe you need to run iterations. And Dialogue: 0,0:42:36.31,0:42:37.76,Default,,0000,0000,0000,,it Dialogue: 0,0:42:37.76,0:42:40.36,Default,,0000,0000,0000,,turns out that, again if you look at the optimization objective, say, Dialogue: 0,0:42:40.36,0:42:43.71,Default,,0000,0000,0000,,logistic regression is, let's say, optimizing J Dialogue: 0,0:42:43.71,0:42:46.73,Default,,0000,0000,0000,,of theta, it actually turns out that if you look at optimizing your objective as a function of the number Dialogue: 0,0:42:46.73,0:42:51.81,Default,,0000,0000,0000,,of iterations, when you look Dialogue: 0,0:42:51.81,0:42:55.01,Default,,0000,0000,0000,,at this curve, you know, it sort of looks like it's going up but it sort of Dialogue: 0,0:42:55.01,0:42:57.63,Default,,0000,0000,0000,,looks like there's absentiles. And Dialogue: 0,0:42:57.63,0:43:00.95,Default,,0000,0000,0000,,when you look at these curves, it's often very hard to tell Dialogue: 0,0:43:00.95,0:43:03.73,Default,,0000,0000,0000,,if the curve has already flattened out. All right? And you look at these Dialogue: 0,0:43:03.73,0:43:05.98,Default,,0000,0000,0000,,curves a lot so you can ask: Dialogue: 0,0:43:05.98,0:43:08.23,Default,,0000,0000,0000,,Well has the algorithm converged? When you look at the J of theta like this, it's Dialogue: 0,0:43:08.23,0:43:10.33,Default,,0000,0000,0000,,often hard to tell. Dialogue: 0,0:43:10.33,0:43:14.15,Default,,0000,0000,0000,,You can run this ten times as long and see if it's flattened out. And you can run this ten Dialogue: 0,0:43:14.15,0:43:21.08,Default,,0000,0000,0000,,times as long and it'll often still look like maybe it's going up very slowly, or something. Right? Dialogue: 0,0:43:21.08,0:43:24.92,Default,,0000,0000,0000,,So a better diagnostic for what logistic regression is converged than Dialogue: 0,0:43:24.92,0:43:28.81,Default,,0000,0000,0000,,looking at this curve. Dialogue: 0,0:43:28.81,0:43:32.09,Default,,0000,0000,0000,,The other question you might wonder - the other thing you might Dialogue: 0,0:43:32.09,0:43:36.71,Default,,0000,0000,0000,,suspect is a problem is are you optimizing the right function. Dialogue: 0,0:43:36.71,0:43:38.92,Default,,0000,0000,0000,,So Dialogue: 0,0:43:38.92,0:43:40.60,Default,,0000,0000,0000,,what you care about, Dialogue: 0,0:43:40.60,0:43:42.88,Default,,0000,0000,0000,,right, in spam, say, Dialogue: 0,0:43:42.88,0:43:44.26,Default,,0000,0000,0000,,is a Dialogue: 0,0:43:44.26,0:43:47.50,Default,,0000,0000,0000,,weighted accuracy function like that. So A of theta is, Dialogue: 0,0:43:47.50,0:43:49.19,Default,,0000,0000,0000,,you know, sum over your Dialogue: 0,0:43:49.19,0:43:52.25,Default,,0000,0000,0000,,examples of some weights times whether you got it right. Dialogue: 0,0:43:52.25,0:43:56.81,Default,,0000,0000,0000,,And so the weight may be higher for non-spam than for spam mail because you care Dialogue: 0,0:43:56.81,0:43:57.71,Default,,0000,0000,0000,,about getting Dialogue: 0,0:43:57.71,0:44:01.47,Default,,0000,0000,0000,,your predictions correct for spam email much more than non-spam mail, say. So let's Dialogue: 0,0:44:01.47,0:44:02.36,Default,,0000,0000,0000,, Dialogue: 0,0:44:02.36,0:44:05.47,Default,,0000,0000,0000,,say A of theta Dialogue: 0,0:44:05.47,0:44:10.82,Default,,0000,0000,0000,,is the optimization objective that you really care about, but Bayesian logistic regression is Dialogue: 0,0:44:10.82,0:44:15.40,Default,,0000,0000,0000,,that it optimizes a quantity like that. Right? It's this Dialogue: 0,0:44:15.40,0:44:17.69,Default,,0000,0000,0000,,sort of maximum likelihood thing Dialogue: 0,0:44:17.69,0:44:19.38,Default,,0000,0000,0000,,and then with this Dialogue: 0,0:44:19.38,0:44:20.85,Default,,0000,0000,0000,,two-nom, you know, Dialogue: 0,0:44:20.85,0:44:22.78,Default,,0000,0000,0000,,penalty thing that we saw previously. And you Dialogue: 0,0:44:22.78,0:44:26.50,Default,,0000,0000,0000,,might be wondering: Is this the right optimization function to be optimizing. Dialogue: 0,0:44:26.50,0:44:30.95,Default,,0000,0000,0000,,Okay? And: Or do I maybe need to change the value for lambda Dialogue: 0,0:44:30.95,0:44:33.90,Default,,0000,0000,0000,,to change this parameter? Or: Dialogue: 0,0:44:33.90,0:44:39.82,Default,,0000,0000,0000,,Should I maybe really be switching to support vector machine optimization objective? Dialogue: 0,0:44:39.82,0:44:42.13,Default,,0000,0000,0000,,Okay? Does that make sense? So Dialogue: 0,0:44:42.13,0:44:44.49,Default,,0000,0000,0000,,the second diagnostic I'm gonna talk about Dialogue: 0,0:44:44.49,0:44:46.99,Default,,0000,0000,0000,,is let's say you want to figure out Dialogue: 0,0:44:46.99,0:44:50.61,Default,,0000,0000,0000,,is the algorithm converging, is the optimization algorithm converging, or Dialogue: 0,0:44:50.61,0:44:51.90,Default,,0000,0000,0000,,is the problem with Dialogue: 0,0:44:51.90,0:44:57.75,Default,,0000,0000,0000,,the optimization objective I chose in the first place? Okay? Dialogue: 0,0:44:57.75,0:45:02.82,Default,,0000,0000,0000,,So here's Dialogue: 0,0:45:02.82,0:45:07.33,Default,,0000,0000,0000,,the diagnostic you can use. Let me let - right. So to Dialogue: 0,0:45:07.33,0:45:11.03,Default,,0000,0000,0000,,just reiterate the story, right, let's say an SVM outperforms Bayesian Dialogue: 0,0:45:11.03,0:45:13.52,Default,,0000,0000,0000,,logistic regression but you really want to deploy Dialogue: 0,0:45:13.52,0:45:16.76,Default,,0000,0000,0000,,Bayesian logistic regression to your problem. Let me Dialogue: 0,0:45:16.76,0:45:19.05,Default,,0000,0000,0000,,let theta subscript SVM, be the Dialogue: 0,0:45:19.05,0:45:21.67,Default,,0000,0000,0000,,parameters learned by an SVM, Dialogue: 0,0:45:21.67,0:45:25.26,Default,,0000,0000,0000,,and I'll let theta subscript BLR be the parameters learned by Bayesian Dialogue: 0,0:45:25.26,0:45:28.05,Default,,0000,0000,0000,,logistic regression. Dialogue: 0,0:45:28.05,0:45:32.48,Default,,0000,0000,0000,,So the optimization objective you care about is this, you know, weighted accuracy Dialogue: 0,0:45:32.48,0:45:35.08,Default,,0000,0000,0000,,criteria that I talked about just now. Dialogue: 0,0:45:35.08,0:45:37.86,Default,,0000,0000,0000,,And Dialogue: 0,0:45:37.86,0:45:41.74,Default,,0000,0000,0000,,the support vector machine outperforms Bayesian logistic regression. And so, you know, Dialogue: 0,0:45:41.74,0:45:44.97,Default,,0000,0000,0000,,the weighted accuracy on the supportvector-machine parameters Dialogue: 0,0:45:44.97,0:45:46.97,Default,,0000,0000,0000,,is better than the weighted accuracy Dialogue: 0,0:45:46.97,0:45:50.18,Default,,0000,0000,0000,,for Bayesian logistic regression. Dialogue: 0,0:45:50.18,0:45:53.93,Default,,0000,0000,0000,,So Dialogue: 0,0:45:53.93,0:45:57.04,Default,,0000,0000,0000,,further, Bayesian logistic regression tries to optimize Dialogue: 0,0:45:57.04,0:45:59.41,Default,,0000,0000,0000,,an optimization objective like that, which I Dialogue: 0,0:45:59.41,0:46:02.27,Default,,0000,0000,0000,,denoted J theta. Dialogue: 0,0:46:02.27,0:46:05.84,Default,,0000,0000,0000,,And so, the diagnostic I choose to use is Dialogue: 0,0:46:05.84,0:46:08.43,Default,,0000,0000,0000,,to see if J of SVM Dialogue: 0,0:46:08.43,0:46:12.27,Default,,0000,0000,0000,,is bigger-than or less-than J of BLR. Okay? Dialogue: 0,0:46:12.27,0:46:14.61,Default,,0000,0000,0000,,So I explain this on the next slide. Dialogue: 0,0:46:14.61,0:46:15.57,Default,,0000,0000,0000,,So Dialogue: 0,0:46:15.57,0:46:19.53,Default,,0000,0000,0000,,we know two facts. We know that - well we know one fact. We know that a weighted Dialogue: 0,0:46:19.53,0:46:20.52,Default,,0000,0000,0000,,accuracy Dialogue: 0,0:46:20.52,0:46:23.16,Default,,0000,0000,0000,,of support vector machine, right, Dialogue: 0,0:46:23.16,0:46:24.48,Default,,0000,0000,0000,,is bigger than Dialogue: 0,0:46:24.48,0:46:28.86,Default,,0000,0000,0000,,this weighted accuracy of Bayesian logistic regression. So Dialogue: 0,0:46:28.86,0:46:32.21,Default,,0000,0000,0000,,in order for me to figure out whether Bayesian logistic regression is converging, Dialogue: 0,0:46:32.21,0:46:35.38,Default,,0000,0000,0000,,or whether I'm just optimizing the wrong objective function, Dialogue: 0,0:46:35.38,0:46:41.06,Default,,0000,0000,0000,,the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay? Dialogue: 0,0:46:41.06,0:46:43.55,Default,,0000,0000,0000,,So let me explain this, Dialogue: 0,0:46:43.55,0:46:44.77,Default,,0000,0000,0000,,so in Case 1, Dialogue: 0,0:46:44.77,0:46:46.03,Default,,0000,0000,0000,,right, Dialogue: 0,0:46:46.03,0:46:48.32,Default,,0000,0000,0000,,it's just those two equations copied over. Dialogue: 0,0:46:48.32,0:46:50.49,Default,,0000,0000,0000,,In Case 1, let's say that Dialogue: 0,0:46:50.49,0:46:54.59,Default,,0000,0000,0000,,J of SVM is, indeed, is greater than J of BLR - or J of Dialogue: 0,0:46:54.59,0:47:01.17,Default,,0000,0000,0000,,theta SVM is greater than J of theta BLR. But Dialogue: 0,0:47:01.17,0:47:04.44,Default,,0000,0000,0000,,we know that Bayesian logistic regression Dialogue: 0,0:47:04.44,0:47:07.52,Default,,0000,0000,0000,,was trying to maximize J of theta; Dialogue: 0,0:47:07.52,0:47:08.87,Default,,0000,0000,0000,,that's the definition of Dialogue: 0,0:47:08.87,0:47:12.36,Default,,0000,0000,0000,,Bayesian logistic regression. Dialogue: 0,0:47:12.36,0:47:16.76,Default,,0000,0000,0000,,So this means that Dialogue: 0,0:47:16.76,0:47:17.60,Default,,0000,0000,0000,,theta - Dialogue: 0,0:47:17.60,0:47:22.03,Default,,0000,0000,0000,,the value of theta output that Bayesian logistic regression actually fails to Dialogue: 0,0:47:22.03,0:47:24.21,Default,,0000,0000,0000,,maximize J Dialogue: 0,0:47:24.21,0:47:27.31,Default,,0000,0000,0000,,because the support back to machine actually returned the value of theta that, Dialogue: 0,0:47:27.31,0:47:28.72,Default,,0000,0000,0000,,you know does a Dialogue: 0,0:47:28.72,0:47:31.35,Default,,0000,0000,0000,,better job out-maximizing J. Dialogue: 0,0:47:31.35,0:47:36.51,Default,,0000,0000,0000,,And so, this tells me that Bayesian logistic regression didn't actually maximize J Dialogue: 0,0:47:36.51,0:47:39.32,Default,,0000,0000,0000,,correctly, and so the problem is with Dialogue: 0,0:47:39.32,0:47:41.10,Default,,0000,0000,0000,,the optimization algorithm. The Dialogue: 0,0:47:41.10,0:47:45.27,Default,,0000,0000,0000,,optimization algorithm hasn't converged. The other Dialogue: 0,0:47:45.27,0:47:46.10,Default,,0000,0000,0000,,case Dialogue: 0,0:47:46.10,0:47:49.89,Default,,0000,0000,0000,,is as follows, where Dialogue: 0,0:47:49.89,0:47:52.58,Default,,0000,0000,0000,,J of theta SVM is less-than/equal to J of theta Dialogue: 0,0:47:52.58,0:47:55.72,Default,,0000,0000,0000,,BLR. Okay? Dialogue: 0,0:47:55.72,0:47:58.39,Default,,0000,0000,0000,,In this case, what does Dialogue: 0,0:47:58.39,0:47:59.14,Default,,0000,0000,0000,,that mean? Dialogue: 0,0:47:59.14,0:48:01.85,Default,,0000,0000,0000,,This means that Bayesian logistic regression Dialogue: 0,0:48:01.85,0:48:04.60,Default,,0000,0000,0000,,actually attains the higher value Dialogue: 0,0:48:04.60,0:48:07.29,Default,,0000,0000,0000,,for the optimization objective J Dialogue: 0,0:48:07.29,0:48:10.93,Default,,0000,0000,0000,,then doesn't support back to machine. Dialogue: 0,0:48:10.93,0:48:13.16,Default,,0000,0000,0000,,The support back to machine, Dialogue: 0,0:48:13.16,0:48:14.97,Default,,0000,0000,0000,,which does worse Dialogue: 0,0:48:14.97,0:48:17.67,Default,,0000,0000,0000,,on your optimization problem, Dialogue: 0,0:48:17.67,0:48:19.20,Default,,0000,0000,0000,,actually does better Dialogue: 0,0:48:19.20,0:48:24.33,Default,,0000,0000,0000,,on the weighted accuracy measure. Dialogue: 0,0:48:24.33,0:48:27.100,Default,,0000,0000,0000,,So what this means is that something that does worse on your optimization Dialogue: 0,0:48:27.100,0:48:28.79,Default,,0000,0000,0000,,objective, Dialogue: 0,0:48:28.79,0:48:29.79,Default,,0000,0000,0000,,on J, Dialogue: 0,0:48:29.79,0:48:31.43,Default,,0000,0000,0000,,can actually do better Dialogue: 0,0:48:31.43,0:48:34.04,Default,,0000,0000,0000,,on the weighted accuracy objective. Dialogue: 0,0:48:34.04,0:48:37.11,Default,,0000,0000,0000,,And this really means that maximizing Dialogue: 0,0:48:37.11,0:48:38.37,Default,,0000,0000,0000,,J of theta, Dialogue: 0,0:48:38.37,0:48:42.06,Default,,0000,0000,0000,,you know, doesn't really correspond that well to maximizing your weighted accuracy criteria. Dialogue: 0,0:48:42.06,0:48:43.43,Default,,0000,0000,0000,, Dialogue: 0,0:48:43.43,0:48:47.36,Default,,0000,0000,0000,,And therefore, this tells you that J of theta is maybe the wrong optimization Dialogue: 0,0:48:47.36,0:48:49.65,Default,,0000,0000,0000,,objective to be maximizing. Right? Dialogue: 0,0:48:49.65,0:48:51.16,Default,,0000,0000,0000,,That just maximizing J of Dialogue: 0,0:48:51.16,0:48:53.15,Default,,0000,0000,0000,,theta just wasn't a good objective Dialogue: 0,0:48:53.15,0:49:00.15,Default,,0000,0000,0000,,to be choosing if you care about the weighted accuracy. Okay? Can you Dialogue: 0,0:49:02.67,0:49:03.46,Default,,0000,0000,0000,,raise your hand Dialogue: 0,0:49:03.46,0:49:09.99,Default,,0000,0000,0000,,if this made sense? Dialogue: 0,0:49:09.99,0:49:11.52,Default,,0000,0000,0000,,Cool, good. So Dialogue: 0,0:49:11.52,0:49:16.83,Default,,0000,0000,0000,,that tells us whether the problem is with the optimization objective Dialogue: 0,0:49:16.83,0:49:19.38,Default,,0000,0000,0000,,or whether it's with the objective function. Dialogue: 0,0:49:19.38,0:49:21.01,Default,,0000,0000,0000,,And so going back to this Dialogue: 0,0:49:21.01,0:49:23.15,Default,,0000,0000,0000,,slide, the eight fixes we had, Dialogue: 0,0:49:23.15,0:49:24.18,Default,,0000,0000,0000,,you notice that if you Dialogue: 0,0:49:24.18,0:49:27.17,Default,,0000,0000,0000,,run gradient descent for more iterations Dialogue: 0,0:49:27.17,0:49:31.02,Default,,0000,0000,0000,,that fixes the optimization algorithm. You try and use this method Dialogue: 0,0:49:31.02,0:49:33.26,Default,,0000,0000,0000,,fixes the optimization algorithm, Dialogue: 0,0:49:33.26,0:49:37.29,Default,,0000,0000,0000,,whereas using a different value for lambda, in that lambda times norm of data Dialogue: 0,0:49:37.29,0:49:39.47,Default,,0000,0000,0000,,squared, you know, in your objective, Dialogue: 0,0:49:39.47,0:49:42.36,Default,,0000,0000,0000,,fixes the optimization objective. And Dialogue: 0,0:49:42.36,0:49:47.63,Default,,0000,0000,0000,,changing to an SVM is also another way of trying to fix the optimization objective. Okay? Dialogue: 0,0:49:47.63,0:49:49.33,Default,,0000,0000,0000,,And so Dialogue: 0,0:49:49.33,0:49:52.31,Default,,0000,0000,0000,,once again, you actually see this quite often that - Dialogue: 0,0:49:52.31,0:49:55.08,Default,,0000,0000,0000,,actually, you see it very often, people will Dialogue: 0,0:49:55.08,0:49:58.48,Default,,0000,0000,0000,,have a problem with the optimization objective Dialogue: 0,0:49:58.48,0:50:00.99,Default,,0000,0000,0000,,and be working harder and harder Dialogue: 0,0:50:00.99,0:50:03.18,Default,,0000,0000,0000,,to fix the optimization algorithm. Dialogue: 0,0:50:03.18,0:50:06.08,Default,,0000,0000,0000,,That's another very common pattern that Dialogue: 0,0:50:06.08,0:50:10.19,Default,,0000,0000,0000,,the problem is in the formula from your J of theta, that often you see people, you know, Dialogue: 0,0:50:10.19,0:50:13.27,Default,,0000,0000,0000,,just running more and more iterations of gradient descent. Like trying Newton's Dialogue: 0,0:50:13.27,0:50:16.01,Default,,0000,0000,0000,,method and trying conjugate and then trying Dialogue: 0,0:50:16.01,0:50:18.59,Default,,0000,0000,0000,,more and more crazy optimization algorithms, Dialogue: 0,0:50:18.59,0:50:20.89,Default,,0000,0000,0000,,whereas the problem was, you know, Dialogue: 0,0:50:20.89,0:50:24.46,Default,,0000,0000,0000,,optimizing J of theta wasn't going to fix the problem at all. Okay? Dialogue: 0,0:50:24.46,0:50:28.65,Default,,0000,0000,0000,,So there's another example of when these sorts of diagnostics will Dialogue: 0,0:50:28.65,0:50:31.91,Default,,0000,0000,0000,,help you figure out whether you should be fixing your optimization algorithm Dialogue: 0,0:50:31.91,0:50:33.26,Default,,0000,0000,0000,,or fixing the Dialogue: 0,0:50:33.26,0:50:38.85,Default,,0000,0000,0000,,optimization Dialogue: 0,0:50:38.85,0:50:45.34,Default,,0000,0000,0000,,objective. Okay? Let me think Dialogue: 0,0:50:45.34,0:50:47.60,Default,,0000,0000,0000,,how much time I have. Dialogue: 0,0:50:47.60,0:50:48.82,Default,,0000,0000,0000,,Hmm, let's Dialogue: 0,0:50:48.82,0:50:49.62,Default,,0000,0000,0000,,see. Well okay, we have time. Let's do this. Dialogue: 0,0:50:49.62,0:50:52.98,Default,,0000,0000,0000,,Show you one last example of a diagnostic. This is one that came up in, Dialogue: 0,0:50:52.98,0:50:56.100,Default,,0000,0000,0000,,you know, my students' and my work on flying helicopters. Dialogue: 0,0:50:56.100,0:50:57.84,Default,,0000,0000,0000,, Dialogue: 0,0:50:57.84,0:51:00.19,Default,,0000,0000,0000,,This one actually, Dialogue: 0,0:51:00.19,0:51:04.18,Default,,0000,0000,0000,,this example is the most complex of the three examples I'm gonna do Dialogue: 0,0:51:04.18,0:51:05.61,Default,,0000,0000,0000,,today. Dialogue: 0,0:51:05.61,0:51:08.56,Default,,0000,0000,0000,,I'm going to somewhat quickly, and Dialogue: 0,0:51:08.56,0:51:11.26,Default,,0000,0000,0000,,this actually draws on reinforcement learning which is something that I'm not Dialogue: 0,0:51:11.26,0:51:14.50,Default,,0000,0000,0000,,gonna talk about until towards - close to the end of the course here, but this just Dialogue: 0,0:51:14.50,0:51:16.76,Default,,0000,0000,0000,,a more Dialogue: 0,0:51:16.76,0:51:20.01,Default,,0000,0000,0000,,complicated example of a diagnostic we're gonna go over. Dialogue: 0,0:51:20.01,0:51:23.76,Default,,0000,0000,0000,,What I'll do is probably go over this fairly quickly, and then after we've talked about Dialogue: 0,0:51:23.76,0:51:26.84,Default,,0000,0000,0000,,reinforcement learning in the class, I'll probably actually come back and redo this exact Dialogue: 0,0:51:26.84,0:51:32.92,Default,,0000,0000,0000,,same example because you'll understand it more deeply. Okay? Dialogue: 0,0:51:32.92,0:51:37.10,Default,,0000,0000,0000,,So some of you know that my students and I fly autonomous helicopters, so how do you get a Dialogue: 0,0:51:37.10,0:51:41.56,Default,,0000,0000,0000,,machine-learning algorithm to design the controller for Dialogue: 0,0:51:41.56,0:51:44.20,Default,,0000,0000,0000,,helicopter? This is what we do. All right? Dialogue: 0,0:51:44.20,0:51:48.52,Default,,0000,0000,0000,,This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our Dialogue: 0,0:51:48.52,0:51:49.62,Default,,0000,0000,0000,,simulator. Dialogue: 0,0:51:49.62,0:51:53.50,Default,,0000,0000,0000,,This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you Dialogue: 0,0:51:53.50,0:51:55.68,Default,,0000,0000,0000,, Dialogue: 0,0:51:55.68,0:51:57.19,Default,,0000,0000,0000,,choose a cost function, it's Dialogue: 0,0:51:57.19,0:52:00.85,Default,,0000,0000,0000,,actually called a [inaudible] function, but for this actually I'll call it cost function. Dialogue: 0,0:52:00.85,0:52:02.95,Default,,0000,0000,0000,,Say J of theta is, you know, Dialogue: 0,0:52:02.95,0:52:06.59,Default,,0000,0000,0000,,the expected squared error in your helicopter's Dialogue: 0,0:52:06.59,0:52:08.15,Default,,0000,0000,0000,,position. Okay? So this is J of theta is Dialogue: 0,0:52:08.15,0:52:08.51,Default,,0000,0000,0000,,maybe Dialogue: 0,0:52:08.51,0:52:12.36,Default,,0000,0000,0000,,it's expected square error or just the square error. Dialogue: 0,0:52:12.36,0:52:16.91,Default,,0000,0000,0000,,And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms Dialogue: 0,0:52:16.91,0:52:18.60,Default,,0000,0000,0000,,in a few weeks. Dialogue: 0,0:52:18.60,0:52:22.50,Default,,0000,0000,0000,,You run reinforcement learning algorithm in your simulator Dialogue: 0,0:52:22.50,0:52:26.64,Default,,0000,0000,0000,,to try to minimize this cost function; try to minimize the squared error of Dialogue: 0,0:52:26.64,0:52:31.44,Default,,0000,0000,0000,,how well you're controlling your helicopter's position. Okay? Dialogue: 0,0:52:31.44,0:52:35.28,Default,,0000,0000,0000,,The reinforcement learning algorithm will output some parameters, which I'm denoting theta Dialogue: 0,0:52:35.28,0:52:37.21,Default,,0000,0000,0000,,subscript RL, Dialogue: 0,0:52:37.21,0:52:41.71,Default,,0000,0000,0000,,and then you'll use that to fly your helicopter. Dialogue: 0,0:52:41.71,0:52:44.96,Default,,0000,0000,0000,,So suppose you run this learning algorithm and Dialogue: 0,0:52:44.96,0:52:48.59,Default,,0000,0000,0000,,you get out a set of controller parameters, theta subscript RL, Dialogue: 0,0:52:48.59,0:52:52.30,Default,,0000,0000,0000,,that gives much worse performance than a human pilot. Then Dialogue: 0,0:52:52.30,0:52:54.73,Default,,0000,0000,0000,,what do you do next? And in particular, you Dialogue: 0,0:52:54.73,0:52:57.96,Default,,0000,0000,0000,,know, corresponding to the three steps above, there are three Dialogue: 0,0:52:57.96,0:53:00.59,Default,,0000,0000,0000,,natural things you can try. Right? You can Dialogue: 0,0:53:00.59,0:53:01.87,Default,,0000,0000,0000,,try to - oh, the bottom of Dialogue: 0,0:53:01.87,0:53:03.92,Default,,0000,0000,0000,,the slide got chopped off. Dialogue: 0,0:53:03.92,0:53:07.53,Default,,0000,0000,0000,,You can try to improve the simulator. And Dialogue: 0,0:53:07.53,0:53:10.33,Default,,0000,0000,0000,,maybe you think your simulator's isn't that accurate, you need to capture Dialogue: 0,0:53:10.33,0:53:12.34,Default,,0000,0000,0000,,the aerodynamic effects more Dialogue: 0,0:53:12.34,0:53:15.43,Default,,0000,0000,0000,,accurately. You need to capture the airflow and the turbulence affects around the helicopter Dialogue: 0,0:53:15.43,0:53:18.28,Default,,0000,0000,0000,,more accurately. Dialogue: 0,0:53:18.28,0:53:21.44,Default,,0000,0000,0000,,Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe Dialogue: 0,0:53:21.44,0:53:24.72,Default,,0000,0000,0000,,what a human pilot does isn't just optimizing square area but it's something more Dialogue: 0,0:53:24.72,0:53:25.99,Default,,0000,0000,0000,,subtle. Dialogue: 0,0:53:25.99,0:53:26.77,Default,,0000,0000,0000,,Or maybe Dialogue: 0,0:53:26.77,0:53:32.99,Default,,0000,0000,0000,,the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So Dialogue: 0,0:53:32.99,0:53:36.80,Default,,0000,0000,0000,,these are the diagnostics that I actually used, and my students and I actually use to figure out what's Dialogue: 0,0:53:36.80,0:53:41.30,Default,,0000,0000,0000,,going on. Dialogue: 0,0:53:41.30,0:53:44.51,Default,,0000,0000,0000,,Actually, why don't you just think about this for a second and think what you'd do, and then Dialogue: 0,0:53:44.51,0:53:51.51,Default,,0000,0000,0000,,I'll go on and tell you what we do. All right, Dialogue: 0,0:54:46.23,0:54:47.87,Default,,0000,0000,0000,,so let me tell you what - Dialogue: 0,0:54:47.87,0:54:49.60,Default,,0000,0000,0000,,how we do this and see Dialogue: 0,0:54:49.60,0:54:52.77,Default,,0000,0000,0000,,whether it's the same as yours or not. And if you have a better idea than I do, let me Dialogue: 0,0:54:52.77,0:54:53.57,Default,,0000,0000,0000,,know and I'll let you try it Dialogue: 0,0:54:53.57,0:54:55.92,Default,,0000,0000,0000,,on my helicopter. Dialogue: 0,0:54:55.92,0:54:58.24,Default,,0000,0000,0000,,So Dialogue: 0,0:54:58.24,0:55:01.45,Default,,0000,0000,0000,,here's a reasoning that I wanted to experiment, right. So, Dialogue: 0,0:55:01.45,0:55:03.68,Default,,0000,0000,0000,,yeah, let's say the controller output Dialogue: 0,0:55:03.68,0:55:10.37,Default,,0000,0000,0000,,by our reinforcement-learning algorithm does poorly. Well Dialogue: 0,0:55:10.37,0:55:12.61,Default,,0000,0000,0000,,suppose the following three things hold true. Dialogue: 0,0:55:12.61,0:55:15.15,Default,,0000,0000,0000,,Suppose the contrary, I guess. Suppose that Dialogue: 0,0:55:15.15,0:55:19.65,Default,,0000,0000,0000,,the helicopter simulator is accurate, so let's assume we have an accurate model Dialogue: 0,0:55:19.65,0:55:22.45,Default,,0000,0000,0000,,of our helicopter. And Dialogue: 0,0:55:22.45,0:55:25.30,Default,,0000,0000,0000,,let's suppose that the reinforcement learning algorithm, Dialogue: 0,0:55:25.30,0:55:28.89,Default,,0000,0000,0000,,you know, correctly controls the helicopter in simulation, Dialogue: 0,0:55:28.89,0:55:31.82,Default,,0000,0000,0000,,so we tend to run a learning algorithm in simulation so that, you know, the Dialogue: 0,0:55:31.82,0:55:35.15,Default,,0000,0000,0000,,learning algorithm can crash a helicopter and it's fine. Right? Dialogue: 0,0:55:35.15,0:55:37.23,Default,,0000,0000,0000,,So let's assume our reinforcement-learning Dialogue: 0,0:55:37.23,0:55:40.11,Default,,0000,0000,0000,,algorithm correctly controls the helicopter so as to minimize the cost Dialogue: 0,0:55:40.11,0:55:42.10,Default,,0000,0000,0000,,function J of theta. Dialogue: 0,0:55:42.10,0:55:43.74,Default,,0000,0000,0000,,And let's suppose that Dialogue: 0,0:55:43.74,0:55:47.63,Default,,0000,0000,0000,,minimizing J of theta does indeed correspond to accurate or the correct autonomous Dialogue: 0,0:55:47.63,0:55:49.34,Default,,0000,0000,0000,,flight. Dialogue: 0,0:55:49.34,0:55:52.07,Default,,0000,0000,0000,,If all of these things held true, Dialogue: 0,0:55:52.07,0:55:53.91,Default,,0000,0000,0000,,then that means that Dialogue: 0,0:55:53.91,0:55:58.46,Default,,0000,0000,0000,,the parameters, theta RL, should actually fly well on my real Dialogue: 0,0:55:58.46,0:56:01.04,Default,,0000,0000,0000,,helicopter. Right? Dialogue: 0,0:56:01.04,0:56:03.28,Default,,0000,0000,0000,,And so the fact that the learning Dialogue: 0,0:56:03.28,0:56:05.34,Default,,0000,0000,0000,,control parameters, theta RL, Dialogue: 0,0:56:05.34,0:56:08.60,Default,,0000,0000,0000,,does not fly well on my helicopter, that sort Dialogue: 0,0:56:08.60,0:56:11.25,Default,,0000,0000,0000,,of means that ones of these three assumptions must be wrong Dialogue: 0,0:56:11.25,0:56:17.87,Default,,0000,0000,0000,,and I'd like to figure out which of these Dialogue: 0,0:56:17.87,0:56:19.67,Default,,0000,0000,0000,,three assumptions Dialogue: 0,0:56:19.67,0:56:22.09,Default,,0000,0000,0000,,is wrong. Okay? So these are the diagnostics we use. Dialogue: 0,0:56:22.09,0:56:25.45,Default,,0000,0000,0000,,First one is Dialogue: 0,0:56:25.45,0:56:31.72,Default,,0000,0000,0000,,we look at the controller and see if it even flies well in Dialogue: 0,0:56:31.72,0:56:35.09,Default,,0000,0000,0000,,simulation. Right? So the simulator of the helicopter that we did the learning on, Dialogue: 0,0:56:35.09,0:56:38.70,Default,,0000,0000,0000,,and so if the learning algorithm flies well in the simulator but Dialogue: 0,0:56:38.70,0:56:42.03,Default,,0000,0000,0000,,it doesn't fly well on my real helicopter, Dialogue: 0,0:56:42.03,0:56:46.11,Default,,0000,0000,0000,,then that tells me the problem is probably in the simulator. Right? Dialogue: 0,0:56:46.11,0:56:48.05,Default,,0000,0000,0000,,My simulator predicts Dialogue: 0,0:56:48.05,0:56:51.91,Default,,0000,0000,0000,,the helicopter's controller will fly well but it doesn't actually fly well in real life, so Dialogue: 0,0:56:51.91,0:56:53.58,Default,,0000,0000,0000,,could be the problem's in the simulator Dialogue: 0,0:56:53.58,0:56:59.24,Default,,0000,0000,0000,,and we should spend out efforts improving the accuracy of our simulator. Dialogue: 0,0:56:59.24,0:57:03.17,Default,,0000,0000,0000,,Otherwise, let me write theta subscript human, be the human Dialogue: 0,0:57:03.17,0:57:07.05,Default,,0000,0000,0000,,control policy. All right? So Dialogue: 0,0:57:07.05,0:57:11.64,Default,,0000,0000,0000,,let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it Dialogue: 0,0:57:11.64,0:57:13.48,Default,,0000,0000,0000,,could be in real life, Dialogue: 0,0:57:13.48,0:57:16.77,Default,,0000,0000,0000,,and let's measure, you know, the means squared error Dialogue: 0,0:57:16.77,0:57:20.21,Default,,0000,0000,0000,,of the human pilot's flight. And Dialogue: 0,0:57:20.21,0:57:24.24,Default,,0000,0000,0000,,let's see if the human pilot does better or worse Dialogue: 0,0:57:24.24,0:57:26.09,Default,,0000,0000,0000,,than the learned controller, Dialogue: 0,0:57:26.09,0:57:28.25,Default,,0000,0000,0000,,in terms of optimizing this Dialogue: 0,0:57:28.25,0:57:31.97,Default,,0000,0000,0000,,objective function J of theta. Okay? Dialogue: 0,0:57:31.97,0:57:33.93,Default,,0000,0000,0000,,So if the human does Dialogue: 0,0:57:33.93,0:57:36.89,Default,,0000,0000,0000,,worse, if even a very good human pilot Dialogue: 0,0:57:36.89,0:57:41.44,Default,,0000,0000,0000,,attains a worse value on my optimization objective, on my cost Dialogue: 0,0:57:41.44,0:57:42.41,Default,,0000,0000,0000,,function, Dialogue: 0,0:57:42.41,0:57:48.62,Default,,0000,0000,0000,,than my learning algorithm, Dialogue: 0,0:57:48.62,0:57:51.80,Default,,0000,0000,0000,,then the problem is in the reinforcement-learning algorithm. Dialogue: 0,0:57:51.80,0:57:56.09,Default,,0000,0000,0000,,Because my reinforcement-learning algorithm was trying to minimize J of Dialogue: 0,0:57:56.09,0:58:00.14,Default,,0000,0000,0000,,theta, but a human actually attains a lower value for J of theta than does my Dialogue: 0,0:58:00.14,0:58:01.78,Default,,0000,0000,0000,,algorithm. Dialogue: 0,0:58:01.78,0:58:05.49,Default,,0000,0000,0000,,And so that tells me that clearly my algorithm's not Dialogue: 0,0:58:05.49,0:58:07.82,Default,,0000,0000,0000,,managing to minimize J of theta Dialogue: 0,0:58:07.82,0:58:12.88,Default,,0000,0000,0000,,and that tells me the problem's in the reinforcement learning algorithm. Dialogue: 0,0:58:12.88,0:58:17.65,Default,,0000,0000,0000,,And finally, if J of theta - if the human actually attains a larger value Dialogue: 0,0:58:17.65,0:58:19.55,Default,,0000,0000,0000,,for theta - excuse me, Dialogue: 0,0:58:19.55,0:58:24.40,Default,,0000,0000,0000,,if the human actually attains a larger value for J of theta, the human actually Dialogue: 0,0:58:24.40,0:58:27.86,Default,,0000,0000,0000,,has, you know, larger mean squared error for the helicopter position than Dialogue: 0,0:58:27.86,0:58:30.60,Default,,0000,0000,0000,,does my reinforcement learning algorithms, that's Dialogue: 0,0:58:30.60,0:58:34.00,Default,,0000,0000,0000,,I like - but I like the way the human flies much better than my reinforcement learning Dialogue: 0,0:58:34.00,0:58:35.32,Default,,0000,0000,0000,,algorithm. So Dialogue: 0,0:58:35.32,0:58:37.23,Default,,0000,0000,0000,,if that holds true, Dialogue: 0,0:58:37.23,0:58:39.78,Default,,0000,0000,0000,,then clearly the problem's in the cost function, right, Dialogue: 0,0:58:39.78,0:58:42.88,Default,,0000,0000,0000,,because the human does worse on my cost function Dialogue: 0,0:58:42.88,0:58:46.07,Default,,0000,0000,0000,,but flies much better than my learning algorithm. Dialogue: 0,0:58:46.07,0:58:48.36,Default,,0000,0000,0000,,And so that means the problem's in the cost function. It Dialogue: 0,0:58:48.36,0:58:50.09,Default,,0000,0000,0000,,means - oh Dialogue: 0,0:58:50.09,0:58:50.54,Default,,0000,0000,0000,,excuse me, I Dialogue: 0,0:58:50.54,0:58:53.68,Default,,0000,0000,0000,,meant minimizing it, not maximizing it, there's a typo on the slide, Dialogue: 0,0:58:53.68,0:58:55.38,Default,,0000,0000,0000,,because that means that minimizing Dialogue: 0,0:58:55.38,0:58:57.09,Default,,0000,0000,0000,,the cost function Dialogue: 0,0:58:57.09,0:59:00.22,Default,,0000,0000,0000,,- my learning algorithm does a better job minimizing the cost function but doesn't Dialogue: 0,0:59:00.22,0:59:03.44,Default,,0000,0000,0000,,fly as well as a human pilot. So that tells you that Dialogue: 0,0:59:03.44,0:59:04.72,Default,,0000,0000,0000,,minimizing the cost function Dialogue: 0,0:59:04.72,0:59:06.88,Default,,0000,0000,0000,,doesn't correspond to good autonomous flight. And what Dialogue: 0,0:59:06.88,0:59:11.86,Default,,0000,0000,0000,,you should do it go back and see if you can change J of Dialogue: 0,0:59:11.86,0:59:13.10,Default,,0000,0000,0000,,theta. Okay? Dialogue: 0,0:59:13.10,0:59:18.38,Default,,0000,0000,0000,,And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement Dialogue: 0,0:59:18.38,0:59:21.73,Default,,0000,0000,0000,,learning algorithms just work but when they don't work, Dialogue: 0,0:59:21.73,0:59:26.20,Default,,0000,0000,0000,,these are the sorts of diagnostics you use to figure out should we be focusing on the simulator, Dialogue: 0,0:59:26.20,0:59:30.33,Default,,0000,0000,0000,,on changing the cost function, or on changing the reinforcement learning Dialogue: 0,0:59:30.33,0:59:32.09,Default,,0000,0000,0000,,algorithm. And Dialogue: 0,0:59:32.09,0:59:37.04,Default,,0000,0000,0000,,again, if you don't know which of your three problems it is, it's entirely possible, Dialogue: 0,0:59:37.04,0:59:40.28,Default,,0000,0000,0000,,you know, to spend two years, whatever, changing, building a better simulator Dialogue: 0,0:59:40.28,0:59:42.60,Default,,0000,0000,0000,,for your helicopter. Dialogue: 0,0:59:42.60,0:59:43.95,Default,,0000,0000,0000,,But it turns out that Dialogue: 0,0:59:43.95,0:59:47.69,Default,,0000,0000,0000,,modeling helicopter aerodynamics is an active area of research. There are people, you know, writing Dialogue: 0,0:59:47.69,0:59:49.79,Default,,0000,0000,0000,,entire PhD theses on this still. Dialogue: 0,0:59:49.79,0:59:53.56,Default,,0000,0000,0000,,So it's entirely possible to go out and spend six years and write a PhD thesis and build Dialogue: 0,0:59:53.56,0:59:55.50,Default,,0000,0000,0000,,a much better helicopter simulator, but if you're fixing Dialogue: 0,0:59:55.50,1:00:02.50,Default,,0000,0000,0000,,the wrong problem it's not gonna help. Dialogue: 0,1:00:03.21,1:00:05.53,Default,,0000,0000,0000,,So Dialogue: 0,1:00:05.53,1:00:08.92,Default,,0000,0000,0000,,quite often, you need to come up with your own diagnostics to figure out what's happening in an Dialogue: 0,1:00:08.92,1:00:11.64,Default,,0000,0000,0000,,algorithm when something is going wrong. Dialogue: 0,1:00:11.64,1:00:15.68,Default,,0000,0000,0000,,And unfortunately I don't know of - what I've described Dialogue: 0,1:00:15.68,1:00:17.15,Default,,0000,0000,0000,,are sort of maybe Dialogue: 0,1:00:17.15,1:00:20.51,Default,,0000,0000,0000,,some of the most common diagnostics that I've used, that I've seen, Dialogue: 0,1:00:20.51,1:00:23.71,Default,,0000,0000,0000,,you know, to be useful for many problems. But very often, you need to come up Dialogue: 0,1:00:23.71,1:00:28.19,Default,,0000,0000,0000,,with your own for your own specific learning problem. Dialogue: 0,1:00:28.19,1:00:31.73,Default,,0000,0000,0000,,And I just want to point out that even when the learning algorithm is working well, it's Dialogue: 0,1:00:31.73,1:00:35.16,Default,,0000,0000,0000,,often a good idea to run diagnostics, like the ones I talked Dialogue: 0,1:00:35.16,1:00:36.07,Default,,0000,0000,0000,,about, Dialogue: 0,1:00:36.07,1:00:38.31,Default,,0000,0000,0000,,to make sure you really understand what's going on. Dialogue: 0,1:00:38.31,1:00:41.60,Default,,0000,0000,0000,,All right? And this is useful for a couple of reasons. One is that Dialogue: 0,1:00:41.60,1:00:45.61,Default,,0000,0000,0000,,diagnostics like these will often help you to understand your application Dialogue: 0,1:00:45.61,1:00:47.90,Default,,0000,0000,0000,,problem better. Dialogue: 0,1:00:47.90,1:00:52.16,Default,,0000,0000,0000,,So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying Dialogue: 0,1:00:52.16,1:00:56.35,Default,,0000,0000,0000,,job to apply machine-learning algorithms to some application problem of, you Dialogue: 0,1:00:56.35,1:00:59.30,Default,,0000,0000,0000,,know, significant economic interest. Right? Dialogue: 0,1:00:59.30,1:01:02.93,Default,,0000,0000,0000,,And you're gonna be working on one specific Dialogue: 0,1:01:02.93,1:01:08.06,Default,,0000,0000,0000,,important machine learning application for many months, or even for years. Dialogue: 0,1:01:08.06,1:01:10.99,Default,,0000,0000,0000,,One of the most valuable things for you personally will be for you to Dialogue: 0,1:01:10.99,1:01:13.25,Default,,0000,0000,0000,,get in - for you personally Dialogue: 0,1:01:13.25,1:01:16.91,Default,,0000,0000,0000,,to get in an intuitive understanding of what works and what doesn't work your Dialogue: 0,1:01:16.91,1:01:17.37,Default,,0000,0000,0000,,problem. Dialogue: 0,1:01:17.37,1:01:21.24,Default,,0000,0000,0000,,Sort of right now in the industry, in Silicon Valley or around the world, Dialogue: 0,1:01:21.24,1:01:24.83,Default,,0000,0000,0000,,there are many companies with important machine learning problems and there are often people Dialogue: 0,1:01:24.83,1:01:26.95,Default,,0000,0000,0000,,working on the same machine learning problem, you Dialogue: 0,1:01:26.95,1:01:31.21,Default,,0000,0000,0000,,know, for many months or for years on end. And Dialogue: 0,1:01:31.21,1:01:34.66,Default,,0000,0000,0000,,when you're doing that, I mean solving a really important problem using learning algorithms, one of Dialogue: 0,1:01:34.66,1:01:38.72,Default,,0000,0000,0000,,the most valuable things is just your own personal intuitive understanding of the Dialogue: 0,1:01:38.72,1:01:40.100,Default,,0000,0000,0000,,problem. Dialogue: 0,1:01:40.100,1:01:42.17,Default,,0000,0000,0000,,Okay? Dialogue: 0,1:01:42.17,1:01:43.41,Default,,0000,0000,0000,,And diagnostics, like Dialogue: 0,1:01:43.41,1:01:48.15,Default,,0000,0000,0000,,the sort I talked about, will be one way for you to get a better and better understanding of Dialogue: 0,1:01:48.15,1:01:50.28,Default,,0000,0000,0000,,these problems. It Dialogue: 0,1:01:50.28,1:01:54.09,Default,,0000,0000,0000,,turns out, by the way, there are some of Silicon Valley companies that outsource their Dialogue: 0,1:01:54.09,1:01:56.68,Default,,0000,0000,0000,,machine learning. So there's sometimes, you know, whatever. Dialogue: 0,1:01:56.68,1:01:59.53,Default,,0000,0000,0000,,They're a company in Silicon Valley and they'll, you know, Dialogue: 0,1:01:59.53,1:02:03.23,Default,,0000,0000,0000,,hire a firm in New York to run all their learning algorithms for them. Dialogue: 0,1:02:03.23,1:02:06.89,Default,,0000,0000,0000,,And I'm not a businessman, but I personally think that's Dialogue: 0,1:02:06.89,1:02:09.31,Default,,0000,0000,0000,,often a terrible idea because Dialogue: 0,1:02:09.31,1:02:13.64,Default,,0000,0000,0000,,if your expertise, if your understanding of your data is given, Dialogue: 0,1:02:13.64,1:02:15.71,Default,,0000,0000,0000,,you know, to an outsource agency, Dialogue: 0,1:02:15.71,1:02:19.59,Default,,0000,0000,0000,,then if you don't maintain that expertise, if there's a problem you really care about Dialogue: 0,1:02:19.59,1:02:22.30,Default,,0000,0000,0000,,then it'll be your own, you know, Dialogue: 0,1:02:22.30,1:02:26.01,Default,,0000,0000,0000,,understanding of the problem that you build up over months that'll be really valuable. Dialogue: 0,1:02:26.01,1:02:28.65,Default,,0000,0000,0000,,And if that knowledge is outsourced, you don't get to keep that knowledge Dialogue: 0,1:02:28.65,1:02:29.49,Default,,0000,0000,0000,,yourself. Dialogue: 0,1:02:29.49,1:02:31.70,Default,,0000,0000,0000,,I personally think that's a terrible idea. Dialogue: 0,1:02:31.70,1:02:35.81,Default,,0000,0000,0000,,But I'm not a businessman, but I just see people do that a lot, Dialogue: 0,1:02:35.81,1:02:39.11,Default,,0000,0000,0000,,and just. Let's see. Dialogue: 0,1:02:39.11,1:02:42.95,Default,,0000,0000,0000,,Another reason for running diagnostics like these is actually in writing research Dialogue: 0,1:02:42.95,1:02:43.61,Default,,0000,0000,0000,,papers, Dialogue: 0,1:02:43.61,1:02:46.15,Default,,0000,0000,0000,,right? So Dialogue: 0,1:02:46.15,1:02:49.33,Default,,0000,0000,0000,,diagnostics and error analyses, which I'll talk about in a minute, Dialogue: 0,1:02:49.33,1:02:53.02,Default,,0000,0000,0000,,often help to convey insight about the problem and help justify your research Dialogue: 0,1:02:53.02,1:02:54.11,Default,,0000,0000,0000,,claims. Dialogue: 0,1:02:54.11,1:02:56.56,Default,,0000,0000,0000,, Dialogue: 0,1:02:56.56,1:02:57.78,Default,,0000,0000,0000,,So for example, Dialogue: 0,1:02:57.78,1:03:00.79,Default,,0000,0000,0000,,rather than writing a research paper, say, that's says, you know, "Oh well here's Dialogue: 0,1:03:00.79,1:03:04.04,Default,,0000,0000,0000,,an algorithm that works. I built this helicopter and it flies," or whatever, Dialogue: 0,1:03:04.04,1:03:05.65,Default,,0000,0000,0000,,it's often much more interesting to say, Dialogue: 0,1:03:05.65,1:03:09.61,Default,,0000,0000,0000,,"Here's an algorithm that works, and it works because of a specific Dialogue: 0,1:03:09.61,1:03:13.92,Default,,0000,0000,0000,,component X. And moreover, here's the diagnostic that gives you justification that shows X was Dialogue: 0,1:03:13.92,1:03:19.16,Default,,0000,0000,0000,,the thing that fixed this problem," and that's where you made it work. Okay? So Dialogue: 0,1:03:19.16,1:03:21.39,Default,,0000,0000,0000,,that leads me Dialogue: 0,1:03:21.39,1:03:25.93,Default,,0000,0000,0000,,into a discussion on error analysis, which is often good machine learning practice, Dialogue: 0,1:03:25.93,1:03:26.44,Default,,0000,0000,0000,, Dialogue: 0,1:03:26.44,1:03:32.10,Default,,0000,0000,0000,,is a way for understanding what your sources of errors are. So what I Dialogue: 0,1:03:32.10,1:03:34.69,Default,,0000,0000,0000,,call error analyses - and let's check Dialogue: 0,1:03:34.69,1:03:41.69,Default,,0000,0000,0000,,questions about this. Dialogue: 0,1:03:41.77,1:03:45.79,Default,,0000,0000,0000,,Yeah? Dialogue: 0,1:03:45.79,1:03:49.81,Default,,0000,0000,0000,,Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times. Dialogue: 0,1:03:49.81,1:03:53.50,Default,,0000,0000,0000,,The thing that is most difficult a helicopter is actually building a Dialogue: 0,1:03:53.50,1:03:55.11,Default,,0000,0000,0000,,very - I don't know. It Dialogue: 0,1:03:55.11,1:03:58.49,Default,,0000,0000,0000,,changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter Dialogue: 0,1:03:58.49,1:04:02.86,Default,,0000,0000,0000,,is very hard. Yeah. Okay. So Dialogue: 0,1:04:02.86,1:04:03.93,Default,,0000,0000,0000,,for error Dialogue: 0,1:04:03.93,1:04:06.27,Default,,0000,0000,0000,,analyses, Dialogue: 0,1:04:06.27,1:04:10.81,Default,,0000,0000,0000,,this is a way for figuring out what is working in your algorithm and what isn't working. Dialogue: 0,1:04:10.81,1:04:17.71,Default,,0000,0000,0000,,And we're gonna talk about two specific examples. So there are Dialogue: 0,1:04:17.71,1:04:21.53,Default,,0000,0000,0000,,many learning - there are many sort of IA systems, many machine learning systems, that Dialogue: 0,1:04:21.53,1:04:22.47,Default,,0000,0000,0000,,combine Dialogue: 0,1:04:22.47,1:04:24.89,Default,,0000,0000,0000,,many different components into a pipeline. So Dialogue: 0,1:04:24.89,1:04:27.47,Default,,0000,0000,0000,,here's sort of a contrived example for this, Dialogue: 0,1:04:27.47,1:04:31.02,Default,,0000,0000,0000,,not dissimilar in many ways from the actual machine learning systems you see. Dialogue: 0,1:04:31.02,1:04:32.39,Default,,0000,0000,0000,,So let's say you want to Dialogue: 0,1:04:32.39,1:04:37.75,Default,,0000,0000,0000,,recognize people from images. This is a picture of one of my friends. Dialogue: 0,1:04:37.75,1:04:41.90,Default,,0000,0000,0000,,So you take this input in camera image, say, and you often run it through a long pipeline. So Dialogue: 0,1:04:41.90,1:04:43.07,Default,,0000,0000,0000,,for example, Dialogue: 0,1:04:43.07,1:04:47.86,Default,,0000,0000,0000,,the first thing you may do may be preprocess the image and remove the background, so you remove the Dialogue: 0,1:04:47.86,1:04:49.19,Default,,0000,0000,0000,,background. Dialogue: 0,1:04:49.19,1:04:51.91,Default,,0000,0000,0000,,And then you run a Dialogue: 0,1:04:51.91,1:04:55.21,Default,,0000,0000,0000,,face detection algorithm, so a machine learning algorithm to detect people's faces. Dialogue: 0,1:04:55.21,1:04:56.11,Default,,0000,0000,0000,,Right? Dialogue: 0,1:04:56.11,1:04:59.76,Default,,0000,0000,0000,,And then, you know, let's say you want to recognize the identity of the person, right, this is your Dialogue: 0,1:04:59.76,1:05:01.72,Default,,0000,0000,0000,,application. Dialogue: 0,1:05:01.72,1:05:04.44,Default,,0000,0000,0000,,You then segment of the eyes, segment of the nose, Dialogue: 0,1:05:04.44,1:05:08.33,Default,,0000,0000,0000,,and have different learning algorithms to detect the mouth and so on. Dialogue: 0,1:05:08.33,1:05:10.03,Default,,0000,0000,0000,,I know; she might not want to be friend Dialogue: 0,1:05:10.03,1:05:13.25,Default,,0000,0000,0000,,after she sees this. Dialogue: 0,1:05:13.25,1:05:16.77,Default,,0000,0000,0000,,And then having found all these features, based on, you know, what the nose looks like, what the eyes Dialogue: 0,1:05:16.77,1:05:18.61,Default,,0000,0000,0000,,looks like, whatever, then you Dialogue: 0,1:05:18.61,1:05:22.80,Default,,0000,0000,0000,,feed all the features into a logistic regression algorithm. And your logistic Dialogue: 0,1:05:22.80,1:05:24.77,Default,,0000,0000,0000,,regression or soft match regression, or whatever, Dialogue: 0,1:05:24.77,1:05:30.38,Default,,0000,0000,0000,,will tell you the identity of this person. Okay? Dialogue: 0,1:05:30.38,1:05:32.46,Default,,0000,0000,0000,,So Dialogue: 0,1:05:32.46,1:05:35.06,Default,,0000,0000,0000,,this is what error analysis is. Dialogue: 0,1:05:35.06,1:05:40.33,Default,,0000,0000,0000,,You have a long complicated pipeline combining many machine learning Dialogue: 0,1:05:40.33,1:05:43.92,Default,,0000,0000,0000,,components. Many of these would be used in learning algorithms. Dialogue: 0,1:05:43.92,1:05:45.69,Default,,0000,0000,0000,,And so, Dialogue: 0,1:05:45.69,1:05:50.42,Default,,0000,0000,0000,,it's often very useful to figure out how much of your error can be attributed to each of Dialogue: 0,1:05:50.42,1:05:55.18,Default,,0000,0000,0000,,these components. Dialogue: 0,1:05:55.18,1:05:56.18,Default,,0000,0000,0000,,So Dialogue: 0,1:05:56.18,1:05:59.59,Default,,0000,0000,0000,,what we'll do in a typical error analysis procedure Dialogue: 0,1:05:59.59,1:06:03.71,Default,,0000,0000,0000,,is we'll repeatedly plug in the ground-truth for each component and see how the Dialogue: 0,1:06:03.71,1:06:05.13,Default,,0000,0000,0000,,accuracy changes. Dialogue: 0,1:06:05.13,1:06:07.59,Default,,0000,0000,0000,,So what I mean by that is the Dialogue: 0,1:06:07.59,1:06:11.39,Default,,0000,0000,0000,,figure on the bottom left - bottom right, let's say the overall accuracy of the system is Dialogue: 0,1:06:11.39,1:06:12.74,Default,,0000,0000,0000,,85 percent. Right? Dialogue: 0,1:06:12.74,1:06:14.69,Default,,0000,0000,0000,,Then I want to know Dialogue: 0,1:06:14.69,1:06:17.40,Default,,0000,0000,0000,,where my 15 percent of error comes from. Dialogue: 0,1:06:17.40,1:06:19.16,Default,,0000,0000,0000,,And so what I'll do is I'll go Dialogue: 0,1:06:19.16,1:06:21.33,Default,,0000,0000,0000,,to my test set Dialogue: 0,1:06:21.33,1:06:26.63,Default,,0000,0000,0000,,and I'll actually code it and - oh, instead of - actually implement my correct Dialogue: 0,1:06:26.63,1:06:29.75,Default,,0000,0000,0000,,background removal. So actually, go in and give it, Dialogue: 0,1:06:29.75,1:06:33.44,Default,,0000,0000,0000,,give my algorithm what is the correct background versus foreground. Dialogue: 0,1:06:33.44,1:06:36.84,Default,,0000,0000,0000,,And if I do that, let's color that blue to denote that I'm Dialogue: 0,1:06:36.84,1:06:39.53,Default,,0000,0000,0000,,giving that ground-truth data in the test set, Dialogue: 0,1:06:39.53,1:06:43.84,Default,,0000,0000,0000,,let's assume our accuracy increases to 85.1 percent. Okay? Dialogue: 0,1:06:43.84,1:06:47.76,Default,,0000,0000,0000,,And now I'll go in and, you know, give my algorithm the ground-truth, Dialogue: 0,1:06:47.76,1:06:48.93,Default,,0000,0000,0000,,face detection Dialogue: 0,1:06:48.93,1:06:53.02,Default,,0000,0000,0000,,output. So I'll go in and actually on my test set I'll just tell the algorithm where the Dialogue: 0,1:06:53.02,1:06:55.13,Default,,0000,0000,0000,,face is. And if I do that, Dialogue: 0,1:06:55.13,1:06:59.05,Default,,0000,0000,0000,,let's say my algorithm's accuracy increases to 91 percent, Dialogue: 0,1:06:59.05,1:07:02.52,Default,,0000,0000,0000,,and so on. And then I'll go for each of these components Dialogue: 0,1:07:02.52,1:07:05.02,Default,,0000,0000,0000,,and just give it Dialogue: 0,1:07:05.02,1:07:08.66,Default,,0000,0000,0000,,the ground-truth label for each of the components, Dialogue: 0,1:07:08.66,1:07:11.64,Default,,0000,0000,0000,,because say, like, the nose segmentation algorithm's trying to figure out Dialogue: 0,1:07:11.64,1:07:13.22,Default,,0000,0000,0000,,where the nose is. I just in Dialogue: 0,1:07:13.22,1:07:16.59,Default,,0000,0000,0000,,and tell it where the nose is so that it doesn't have to figure that out. Dialogue: 0,1:07:16.59,1:07:20.56,Default,,0000,0000,0000,,And as I do this, one component through the other, you know, I end up giving it the correct output Dialogue: 0,1:07:20.56,1:07:23.65,Default,,0000,0000,0000,,label and end up with 100 percent accuracy. Dialogue: 0,1:07:23.65,1:07:27.00,Default,,0000,0000,0000,,And now you can look at this table - I'm sorry this is cut off on the bottom, Dialogue: 0,1:07:27.00,1:07:29.12,Default,,0000,0000,0000,,it says logistic regression 100 percent. Now you can Dialogue: 0,1:07:29.12,1:07:30.72,Default,,0000,0000,0000,,look at this Dialogue: 0,1:07:30.72,1:07:31.67,Default,,0000,0000,0000,,table and Dialogue: 0,1:07:31.67,1:07:33.01,Default,,0000,0000,0000,,see, Dialogue: 0,1:07:33.01,1:07:36.08,Default,,0000,0000,0000,,you know, how much giving the ground-truth labels for each of these Dialogue: 0,1:07:36.08,1:07:39.03,Default,,0000,0000,0000,,components could help boost your final performance. Dialogue: 0,1:07:39.03,1:07:42.42,Default,,0000,0000,0000,,In particular, if you look at this table, you notice that Dialogue: 0,1:07:42.42,1:07:45.27,Default,,0000,0000,0000,,when I added the face detection ground-truth, Dialogue: 0,1:07:45.27,1:07:48.28,Default,,0000,0000,0000,,my performance jumped from 85.1 percent accuracy Dialogue: 0,1:07:48.28,1:07:50.62,Default,,0000,0000,0000,,to 91 percent accuracy. Right? Dialogue: 0,1:07:50.62,1:07:54.53,Default,,0000,0000,0000,,So this tells me that if only I can get better face detection, Dialogue: 0,1:07:54.53,1:07:58.03,Default,,0000,0000,0000,,maybe I can boost my accuracy by 6 percent. Dialogue: 0,1:07:58.03,1:08:00.50,Default,,0000,0000,0000,,Whereas in contrast, when I, Dialogue: 0,1:08:00.50,1:08:04.35,Default,,0000,0000,0000,,you know, say plugged in better, I don't know, Dialogue: 0,1:08:04.35,1:08:07.06,Default,,0000,0000,0000,,background removal, my accuracy improved from 85 Dialogue: 0,1:08:07.06,1:08:08.67,Default,,0000,0000,0000,,to 85.1 percent. Dialogue: 0,1:08:08.67,1:08:11.52,Default,,0000,0000,0000,,And so, this sort of diagnostic also tells you that if your goal Dialogue: 0,1:08:11.52,1:08:13.87,Default,,0000,0000,0000,,is to improve the system, it's probably a waste of Dialogue: 0,1:08:13.87,1:08:17.68,Default,,0000,0000,0000,,your time to try to improve your background subtraction. Because if Dialogue: 0,1:08:17.68,1:08:19.22,Default,,0000,0000,0000,,even if you got the ground-truth, Dialogue: 0,1:08:19.22,1:08:22.06,Default,,0000,0000,0000,,this is gives you, at most, 0.1 percent accuracy, Dialogue: 0,1:08:22.06,1:08:24.60,Default,,0000,0000,0000,,whereas if you do better face detection, maybe there's a much Dialogue: 0,1:08:24.60,1:08:26.40,Default,,0000,0000,0000,,larger potential for gains there. Okay? Dialogue: 0,1:08:26.40,1:08:28.67,Default,,0000,0000,0000,,So this sort of diagnostic, Dialogue: 0,1:08:28.67,1:08:29.90,Default,,0000,0000,0000,,again, Dialogue: 0,1:08:29.90,1:08:33.15,Default,,0000,0000,0000,,is very useful because if your is to improve the system, Dialogue: 0,1:08:33.15,1:08:35.100,Default,,0000,0000,0000,,there are so many different pieces you can easily choose to spend the next three Dialogue: 0,1:08:35.100,1:08:36.65,Default,,0000,0000,0000,,months on. Right? Dialogue: 0,1:08:36.65,1:08:39.26,Default,,0000,0000,0000,,And choosing the right piece Dialogue: 0,1:08:39.26,1:08:42.80,Default,,0000,0000,0000,,is critical, and this sort of diagnostic tells you what's the piece that may Dialogue: 0,1:08:42.80,1:08:48.73,Default,,0000,0000,0000,,actually be worth your time to work on. Dialogue: 0,1:08:48.73,1:08:51.71,Default,,0000,0000,0000,,There's sort of another type of analyses that's sort of the opposite of what I just Dialogue: 0,1:08:51.71,1:08:53.37,Default,,0000,0000,0000,,talked about. Dialogue: 0,1:08:53.37,1:08:55.47,Default,,0000,0000,0000,,The error analysis I just talked about Dialogue: 0,1:08:55.47,1:08:58.28,Default,,0000,0000,0000,,tries to explain the difference between the current performance and perfect Dialogue: 0,1:08:58.28,1:08:59.77,Default,,0000,0000,0000,,performance, Dialogue: 0,1:08:59.77,1:09:03.62,Default,,0000,0000,0000,,whereas this sort of ablative analysis tries to explain the difference Dialogue: 0,1:09:03.62,1:09:09.12,Default,,0000,0000,0000,,between some baselines, some really bad performance and your current performance. Dialogue: 0,1:09:09.12,1:09:13.09,Default,,0000,0000,0000,,So for this example, let's suppose you've built a very good anti-spam classifier for Dialogue: 0,1:09:13.09,1:09:17.15,Default,,0000,0000,0000,,adding lots of clever features to your logistic regression algorithm. Right? So you added Dialogue: 0,1:09:17.15,1:09:20.69,Default,,0000,0000,0000,,features for spam correction, for, you know, sender host features, for email header Dialogue: 0,1:09:20.69,1:09:21.41,Default,,0000,0000,0000,,features, Dialogue: 0,1:09:21.41,1:09:24.80,Default,,0000,0000,0000,,email text parser features, JavaScript parser features, Dialogue: 0,1:09:24.80,1:09:26.84,Default,,0000,0000,0000,,features for embedded images, and so on. Dialogue: 0,1:09:26.84,1:09:30.23,Default,,0000,0000,0000,,So now let's say you preview the system and you want to figure out, you know, how well did Dialogue: 0,1:09:30.23,1:09:33.80,Default,,0000,0000,0000,,each of these - how much did each of these components actually contribute? Maybe you want Dialogue: 0,1:09:33.80,1:09:37.13,Default,,0000,0000,0000,,to write a research paper and claim this was the piece that made the Dialogue: 0,1:09:37.13,1:09:40.95,Default,,0000,0000,0000,,big difference. Can you actually document that claim and justify it? Dialogue: 0,1:09:40.95,1:09:43.32,Default,,0000,0000,0000,,So in ablative analysis, Dialogue: 0,1:09:43.32,1:09:44.57,Default,,0000,0000,0000,,here's what we do. Dialogue: 0,1:09:44.57,1:09:46.33,Default,,0000,0000,0000,,So in this example, Dialogue: 0,1:09:46.33,1:09:49.67,Default,,0000,0000,0000,,let's say that simple logistic regression without any of your clever Dialogue: 0,1:09:49.67,1:09:52.09,Default,,0000,0000,0000,,improvements get 94 percent performance. And Dialogue: 0,1:09:52.09,1:09:55.48,Default,,0000,0000,0000,,you want to figure out what accounts for your improvement from 94 to Dialogue: 0,1:09:55.48,1:09:58.43,Default,,0000,0000,0000,,99.9 percent performance. Dialogue: 0,1:09:58.43,1:10:03.28,Default,,0000,0000,0000,,So in ablative analysis and so instead of adding components one at a day, we'll instead Dialogue: 0,1:10:03.28,1:10:06.84,Default,,0000,0000,0000,,remove components one at a time to see how it rates. Dialogue: 0,1:10:06.84,1:10:11.46,Default,,0000,0000,0000,,So start with your overall system, which is 99 percent accuracy. Dialogue: 0,1:10:11.46,1:10:14.13,Default,,0000,0000,0000,,And then we remove spelling correction and see how much performance Dialogue: 0,1:10:14.13,1:10:15.39,Default,,0000,0000,0000,,drops. Dialogue: 0,1:10:15.39,1:10:22.39,Default,,0000,0000,0000,,Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so, Dialogue: 0,1:10:24.22,1:10:28.15,Default,,0000,0000,0000,,in this contrived example, Dialogue: 0,1:10:28.15,1:10:31.12,Default,,0000,0000,0000,,you see that, I guess, the biggest drop Dialogue: 0,1:10:31.12,1:10:32.38,Default,,0000,0000,0000,,occurred when you remove Dialogue: 0,1:10:32.38,1:10:37.56,Default,,0000,0000,0000,,the text parser features. And so you can then make a credible case that, Dialogue: 0,1:10:37.56,1:10:41.28,Default,,0000,0000,0000,,you know, the text parser features where what really made the biggest difference here. Okay? Dialogue: 0,1:10:41.28,1:10:42.70,Default,,0000,0000,0000,,And you can also tell, Dialogue: 0,1:10:42.70,1:10:45.53,Default,,0000,0000,0000,,for instance, that, I don't know, Dialogue: 0,1:10:45.53,1:10:49.36,Default,,0000,0000,0000,,removing the sender host features on this Dialogue: 0,1:10:49.36,1:10:52.28,Default,,0000,0000,0000,,line, right, performance dropped from 99.9 to 98.9. And so this also means Dialogue: 0,1:10:52.28,1:10:53.14,Default,,0000,0000,0000,,that Dialogue: 0,1:10:53.14,1:10:56.45,Default,,0000,0000,0000,,in case you want to get rid of the sender host features to speed up Dialogue: 0,1:10:56.45,1:11:03.45,Default,,0000,0000,0000,,computational something that would be a good candidate for elimination. Okay? Are there any Dialogue: 0,1:11:03.63,1:11:05.42,Default,,0000,0000,0000,,guarantees that if you shuffle around the order in which Dialogue: 0,1:11:05.42,1:11:06.42,Default,,0000,0000,0000,,you drop those Dialogue: 0,1:11:06.42,1:11:09.58,Default,,0000,0000,0000,,features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's Dialogue: 0,1:11:09.58,1:11:12.11,Default,,0000,0000,0000,,no guarantee you'd get the similar result. Dialogue: 0,1:11:12.11,1:11:13.89,Default,,0000,0000,0000,,So in practice, Dialogue: 0,1:11:13.89,1:11:17.73,Default,,0000,0000,0000,,sometimes there's a fairly natural of ordering for both types of analyses, the error Dialogue: 0,1:11:17.73,1:11:19.33,Default,,0000,0000,0000,,analyses and ablative analysis, Dialogue: 0,1:11:19.33,1:11:22.75,Default,,0000,0000,0000,,sometimes there's a fairly natural ordering in which you add things or remove things, Dialogue: 0,1:11:22.75,1:11:24.56,Default,,0000,0000,0000,,sometimes there's isn't. And Dialogue: 0,1:11:24.56,1:11:28.47,Default,,0000,0000,0000,,quite often, you either choose one ordering and just go for it Dialogue: 0,1:11:28.47,1:11:32.07,Default,,0000,0000,0000,,or " And don't think of these analyses as sort of formulas that are constants, though; I mean Dialogue: 0,1:11:32.07,1:11:35.24,Default,,0000,0000,0000,,feel free to invent your own, as well. You know Dialogue: 0,1:11:35.24,1:11:36.64,Default,,0000,0000,0000,,one of the things Dialogue: 0,1:11:36.64,1:11:37.92,Default,,0000,0000,0000,,that's done quite often is Dialogue: 0,1:11:37.92,1:11:39.31,Default,,0000,0000,0000,,take the overall system Dialogue: 0,1:11:39.31,1:11:43.29,Default,,0000,0000,0000,,and just remove one and then put it back, then remove a different one Dialogue: 0,1:11:43.29,1:11:48.13,Default,,0000,0000,0000,,then put it back until all of these things are done. Okay. Dialogue: 0,1:11:48.13,1:11:51.01,Default,,0000,0000,0000,,So the very last thing I want to talk about is sort of this Dialogue: 0,1:11:51.01,1:11:57.98,Default,,0000,0000,0000,,general advice for how to get started on a learning problem. So Dialogue: 0,1:11:57.98,1:12:03.84,Default,,0000,0000,0000,,here's a cartoon description on two broad to get started on learning problem. Dialogue: 0,1:12:03.84,1:12:05.74,Default,,0000,0000,0000,,The first one is Dialogue: 0,1:12:05.74,1:12:07.61,Default,,0000,0000,0000,,carefully design your system, so Dialogue: 0,1:12:07.61,1:12:11.74,Default,,0000,0000,0000,,you spend a long time designing exactly the right features, collecting the right data set, and Dialogue: 0,1:12:11.74,1:12:14.19,Default,,0000,0000,0000,,designing the right algorithmic structure, then you Dialogue: 0,1:12:14.19,1:12:17.68,Default,,0000,0000,0000,,implement it and hope it works. All right? Dialogue: 0,1:12:17.68,1:12:21.04,Default,,0000,0000,0000,,The benefit of this sort of approach is you get maybe nicer, maybe more scalable Dialogue: 0,1:12:21.04,1:12:22.43,Default,,0000,0000,0000,,algorithms, Dialogue: 0,1:12:22.43,1:12:26.72,Default,,0000,0000,0000,,and maybe you come up with new elegant learning algorithms. And if your goal is to, Dialogue: 0,1:12:26.72,1:12:30.76,Default,,0000,0000,0000,,you know, contribute to basic research in machine learning, if your goal is to invent new machine learning Dialogue: 0,1:12:30.76,1:12:31.50,Default,,0000,0000,0000,,algorithms, Dialogue: 0,1:12:31.50,1:12:33.55,Default,,0000,0000,0000,,this process of slowing down and Dialogue: 0,1:12:33.55,1:12:36.30,Default,,0000,0000,0000,,thinking deeply about the problem, you know, that is sort of the right way to go Dialogue: 0,1:12:36.30,1:12:37.12,Default,,0000,0000,0000,,about is Dialogue: 0,1:12:37.12,1:12:41.10,Default,,0000,0000,0000,,think deeply about a problem and invent new solutions. Dialogue: 0,1:12:41.10,1:12:42.28,Default,,0000,0000,0000,, Dialogue: 0,1:12:42.28,1:12:44.08,Default,,0000,0000,0000,,Second sort of approach Dialogue: 0,1:12:44.08,1:12:48.84,Default,,0000,0000,0000,,is what I call build-and-fix, which is we input something quick and dirty Dialogue: 0,1:12:48.84,1:12:52.31,Default,,0000,0000,0000,,and then you run error analyses and diagnostics to figure out what's wrong and Dialogue: 0,1:12:52.31,1:12:54.20,Default,,0000,0000,0000,,you fix those errors. Dialogue: 0,1:12:54.20,1:12:58.13,Default,,0000,0000,0000,,The benefit of this second type of approach is that it'll often get your Dialogue: 0,1:12:58.13,1:13:01.12,Default,,0000,0000,0000,,application working much more quickly. Dialogue: 0,1:13:01.12,1:13:04.40,Default,,0000,0000,0000,,And especially with those of you, if you end up working in a company, and sometimes - if you end up working in Dialogue: 0,1:13:04.40,1:13:05.55,Default,,0000,0000,0000,,a company, Dialogue: 0,1:13:05.55,1:13:07.46,Default,,0000,0000,0000,,you know, very often it's not Dialogue: 0,1:13:07.46,1:13:10.90,Default,,0000,0000,0000,,the best product that wins; it's the first product to market that Dialogue: 0,1:13:10.90,1:13:11.69,Default,,0000,0000,0000,,wins. And Dialogue: 0,1:13:11.69,1:13:14.87,Default,,0000,0000,0000,,so there's - especially in the industry. There's really something to be said for, Dialogue: 0,1:13:14.87,1:13:18.79,Default,,0000,0000,0000,,you know, building a system quickly and getting it deployed quickly. Dialogue: 0,1:13:18.79,1:13:23.14,Default,,0000,0000,0000,,And the second approach of building a quick-and-dirty, I'm gonna say hack Dialogue: 0,1:13:23.14,1:13:26.47,Default,,0000,0000,0000,,and then fixing the problems will actually get you to a Dialogue: 0,1:13:26.47,1:13:27.84,Default,,0000,0000,0000,,system that works well Dialogue: 0,1:13:27.84,1:13:30.97,Default,,0000,0000,0000,,much more quickly. Dialogue: 0,1:13:30.97,1:13:32.65,Default,,0000,0000,0000,,And the reason is Dialogue: 0,1:13:32.65,1:13:36.15,Default,,0000,0000,0000,,very often it's really not clear what parts of a system are easier to think of to Dialogue: 0,1:13:36.15,1:13:37.59,Default,,0000,0000,0000,,build and therefore what Dialogue: 0,1:13:37.59,1:13:40.18,Default,,0000,0000,0000,,you need to spends lot of time focusing on. Dialogue: 0,1:13:40.18,1:13:43.42,Default,,0000,0000,0000,,So there's that example I talked about just now. Right? Dialogue: 0,1:13:43.42,1:13:46.93,Default,,0000,0000,0000,,For identifying Dialogue: 0,1:13:46.93,1:13:48.71,Default,,0000,0000,0000,,people, say. Dialogue: 0,1:13:48.71,1:13:53.20,Default,,0000,0000,0000,,And with a big complicated learning system like this, a big complicated pipeline like this, Dialogue: 0,1:13:53.20,1:13:55.59,Default,,0000,0000,0000,,it's really not obvious at the outset Dialogue: 0,1:13:55.59,1:13:59.13,Default,,0000,0000,0000,,which of these components you should spend lots of time working on. Right? And if Dialogue: 0,1:13:59.13,1:14:00.96,Default,,0000,0000,0000,,you didn't know that Dialogue: 0,1:14:00.96,1:14:03.80,Default,,0000,0000,0000,,preprocessing wasn't the right component, you could easily have Dialogue: 0,1:14:03.80,1:14:07.27,Default,,0000,0000,0000,,spent three months working on better background subtraction, not knowing that it's Dialogue: 0,1:14:07.27,1:14:09.88,Default,,0000,0000,0000,,just not gonna ultimately matter. Dialogue: 0,1:14:09.88,1:14:10.77,Default,,0000,0000,0000,,And so Dialogue: 0,1:14:10.77,1:14:13.69,Default,,0000,0000,0000,,the only way to find out what really works was inputting something quickly and Dialogue: 0,1:14:13.69,1:14:15.35,Default,,0000,0000,0000,,you find out what parts - Dialogue: 0,1:14:15.35,1:14:16.89,Default,,0000,0000,0000,,and find out Dialogue: 0,1:14:16.89,1:14:17.89,Default,,0000,0000,0000,,what parts Dialogue: 0,1:14:17.89,1:14:21.36,Default,,0000,0000,0000,,are really the hard parts to implement, or what parts are hard parts that could make a Dialogue: 0,1:14:21.36,1:14:23.08,Default,,0000,0000,0000,,difference in performance. Dialogue: 0,1:14:23.08,1:14:26.58,Default,,0000,0000,0000,,In fact, say that if your goal is to build a Dialogue: 0,1:14:26.58,1:14:29.31,Default,,0000,0000,0000,,people recognition system, a system like this is actually far too Dialogue: 0,1:14:29.31,1:14:31.64,Default,,0000,0000,0000,,complicated as your initial system. Dialogue: 0,1:14:31.64,1:14:35.56,Default,,0000,0000,0000,,Maybe after you're prototyped a few systems, and you converged a system like this. But if this Dialogue: 0,1:14:35.56,1:14:42.56,Default,,0000,0000,0000,,is your first system you're designing, this is much too complicated. Also, this is a Dialogue: 0,1:14:43.57,1:14:48.06,Default,,0000,0000,0000,,very concrete piece of advice, and this applies to your projects as well. Dialogue: 0,1:14:48.06,1:14:51.23,Default,,0000,0000,0000,,If your goal is to build a working application, Dialogue: 0,1:14:51.23,1:14:55.26,Default,,0000,0000,0000,,Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your Dialogue: 0,1:14:55.26,1:14:57.28,Default,,0000,0000,0000,,data. Dialogue: 0,1:14:57.28,1:15:01.22,Default,,0000,0000,0000,,And very often, and if you just take the data you're trying to predict and just plot your Dialogue: 0,1:15:01.22,1:15:05.73,Default,,0000,0000,0000,,data, plot X, plot Y, plot your data everywhere you can think of, Dialogue: 0,1:15:05.73,1:15:10.31,Default,,0000,0000,0000,,you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they Dialogue: 0,1:15:10.31,1:15:13.90,Default,,0000,0000,0000,,should be positive. Something's wrong with this dataset." And it's about Dialogue: 0,1:15:13.90,1:15:18.39,Default,,0000,0000,0000,,half the time you find something obviously wrong with your data or something very surprising. Dialogue: 0,1:15:18.39,1:15:21.57,Default,,0000,0000,0000,,And this is something you find out just by plotting your data, and that you Dialogue: 0,1:15:21.57,1:15:28.18,Default,,0000,0000,0000,,won't find out be implementing these big complicated learning algorithms on it. Plotting Dialogue: 0,1:15:28.18,1:15:31.92,Default,,0000,0000,0000,,the data sounds so simple, it was one of the pieces of advice that lots of us give but Dialogue: 0,1:15:31.92,1:15:38.57,Default,,0000,0000,0000,,hardly anyone follows, so you can take that for what it's worth. Dialogue: 0,1:15:38.57,1:15:42.20,Default,,0000,0000,0000,,Let me just reiterate, what I just said here may be bad advice Dialogue: 0,1:15:42.20,1:15:44.02,Default,,0000,0000,0000,,if your goal is to come up with Dialogue: 0,1:15:44.02,1:15:46.64,Default,,0000,0000,0000,,new machine learning algorithms. All right? So Dialogue: 0,1:15:46.64,1:15:51.02,Default,,0000,0000,0000,,for me personally, the learning algorithm I use the most often is probably Dialogue: 0,1:15:51.02,1:15:53.60,Default,,0000,0000,0000,,logistic regression because I have code lying around. So give me a Dialogue: 0,1:15:53.60,1:15:56.77,Default,,0000,0000,0000,,learning problem, I probably won't try anything more complicated than logistic Dialogue: 0,1:15:56.77,1:15:58.26,Default,,0000,0000,0000,,regression on it first. And it's Dialogue: 0,1:15:58.26,1:16:01.94,Default,,0000,0000,0000,,only after trying something really simple and figure our what's easy, what's hard, then you know Dialogue: 0,1:16:01.94,1:16:03.94,Default,,0000,0000,0000,,where to focus your efforts. But Dialogue: 0,1:16:03.94,1:16:07.61,Default,,0000,0000,0000,,again, if your goal is to invent new machine learning algorithms, then you sort of don't Dialogue: 0,1:16:07.61,1:16:10.75,Default,,0000,0000,0000,,want to hack up something and then add another hack to fix it, and hack it even more to Dialogue: 0,1:16:10.75,1:16:12.22,Default,,0000,0000,0000,,fix it. Right? So if Dialogue: 0,1:16:12.22,1:16:15.92,Default,,0000,0000,0000,,your goal is to do novel machine learning research, then it pays to think more deeply about the Dialogue: 0,1:16:15.92,1:16:21.34,Default,,0000,0000,0000,,problem and not gonna follow this specifically. Dialogue: 0,1:16:21.34,1:16:22.92,Default,,0000,0000,0000,,Shoot, you know what? All Dialogue: 0,1:16:22.92,1:16:28.28,Default,,0000,0000,0000,,right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly. Dialogue: 0,1:16:28.28,1:16:30.62,Default,,0000,0000,0000,,And so, this is what I think Dialogue: 0,1:16:30.62,1:16:33.46,Default,,0000,0000,0000,,of as premature statistical optimization, Dialogue: 0,1:16:33.46,1:16:35.08,Default,,0000,0000,0000,,where quite often, Dialogue: 0,1:16:35.08,1:16:38.32,Default,,0000,0000,0000,,just like premature optimization of code, quite often Dialogue: 0,1:16:38.32,1:16:44.37,Default,,0000,0000,0000,,people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more Dialogue: 0,1:16:44.37,1:16:46.95,Default,,0000,0000,0000,,slides. This Dialogue: 0,1:16:46.95,1:16:48.54,Default,,0000,0000,0000,,was - Dialogue: 0,1:16:48.54,1:16:52.07,Default,,0000,0000,0000,,this is a sort of cartoon that highly influenced my own thinking. It was based on Dialogue: 0,1:16:52.07,1:16:55.34,Default,,0000,0000,0000,,a paper written by Christos Papadimitriou. Dialogue: 0,1:16:55.34,1:16:57.43,Default,,0000,0000,0000,,This is how Dialogue: 0,1:16:57.43,1:16:59.36,Default,,0000,0000,0000,,progress - this is how Dialogue: 0,1:16:59.36,1:17:02.36,Default,,0000,0000,0000,,developmental progress of research often happens. Right? Dialogue: 0,1:17:02.36,1:17:05.56,Default,,0000,0000,0000,,Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it Dialogue: 0,1:17:05.56,1:17:06.52,Default,,0000,0000,0000,,seems like a useful thing to have. Dialogue: 0,1:17:06.52,1:17:09.67,Default,,0000,0000,0000,,Right? You know free up people, don't have Dialogue: 0,1:17:09.67,1:17:12.76,Default,,0000,0000,0000,,to deliver mail. So what - Dialogue: 0,1:17:12.76,1:17:14.28,Default,,0000,0000,0000,,to deliver mail, Dialogue: 0,1:17:14.28,1:17:19.14,Default,,0000,0000,0000,,obviously you need a robot to wander around indoor environments and you need a robot to Dialogue: 0,1:17:19.14,1:17:21.48,Default,,0000,0000,0000,,manipulate objects and pickup envelopes. And so, Dialogue: 0,1:17:21.48,1:17:24.89,Default,,0000,0000,0000,,you need to build those two components in order to get a mail delivery robot. And Dialogue: 0,1:17:24.89,1:17:25.59,Default,,0000,0000,0000,,so I've Dialogue: 0,1:17:25.59,1:17:29.65,Default,,0000,0000,0000,,drawing those two components and little arrows to denote that, you know, obstacle avoidance Dialogue: 0,1:17:29.65,1:17:30.46,Default,,0000,0000,0000,,is Dialogue: 0,1:17:30.46,1:17:32.23,Default,,0000,0000,0000,,needed or would help build Dialogue: 0,1:17:32.23,1:17:35.51,Default,,0000,0000,0000,,your mail delivery robot. Well Dialogue: 0,1:17:35.51,1:17:37.19,Default,,0000,0000,0000,,for obstacle for avoidance, Dialogue: 0,1:17:37.19,1:17:43.16,Default,,0000,0000,0000,,clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles. Dialogue: 0,1:17:43.16,1:17:46.84,Default,,0000,0000,0000,,Now we're gonna use computer vision to detect the objects. And so, Dialogue: 0,1:17:46.84,1:17:51.12,Default,,0000,0000,0000,,we know that, you know, lighting sometimes changes, right, depending on whether it's the Dialogue: 0,1:17:51.12,1:17:52.71,Default,,0000,0000,0000,,morning or noontime or evening. This Dialogue: 0,1:17:52.71,1:17:53.93,Default,,0000,0000,0000,,is lighting Dialogue: 0,1:17:53.93,1:17:56.64,Default,,0000,0000,0000,,causes the color of things to change, and so you need Dialogue: 0,1:17:56.64,1:18:00.51,Default,,0000,0000,0000,,an object detection system that's invariant to the specific colors of an Dialogue: 0,1:18:00.51,1:18:01.20,Default,,0000,0000,0000,,object. Right? Dialogue: 0,1:18:01.20,1:18:04.42,Default,,0000,0000,0000,,Because lighting Dialogue: 0,1:18:04.42,1:18:05.40,Default,,0000,0000,0000,,changes, Dialogue: 0,1:18:05.40,1:18:09.85,Default,,0000,0000,0000,,say. Well color, or RGB values, is represented by three-dimensional vectors. And Dialogue: 0,1:18:09.85,1:18:11.17,Default,,0000,0000,0000,,so you need to learn Dialogue: 0,1:18:11.17,1:18:13.50,Default,,0000,0000,0000,,when two colors might be the same thing, Dialogue: 0,1:18:13.50,1:18:15.26,Default,,0000,0000,0000,,when two, you know, Dialogue: 0,1:18:15.26,1:18:18.16,Default,,0000,0000,0000,,visual appearance of two colors may be the same thing as just the lighting change or Dialogue: 0,1:18:18.16,1:18:19.54,Default,,0000,0000,0000,,something. Dialogue: 0,1:18:19.54,1:18:20.59,Default,,0000,0000,0000,,And Dialogue: 0,1:18:20.59,1:18:24.06,Default,,0000,0000,0000,,to understand that properly, we can go out and study differential geometry Dialogue: 0,1:18:24.06,1:18:27.51,Default,,0000,0000,0000,,of 3d manifolds because that helps us build a sound theory on which Dialogue: 0,1:18:27.51,1:18:32.25,Default,,0000,0000,0000,,to develop our 3d similarity learning algorithms. Dialogue: 0,1:18:32.25,1:18:36.16,Default,,0000,0000,0000,,And to really understand the fundamental aspects of this problem, Dialogue: 0,1:18:36.16,1:18:40.11,Default,,0000,0000,0000,,we have to study the complexity of non-Riemannian geometries. And on Dialogue: 0,1:18:40.11,1:18:43.85,Default,,0000,0000,0000,,and on it goes until eventually you're proving convergence bounds for Dialogue: 0,1:18:43.85,1:18:49.79,Default,,0000,0000,0000,,sampled of non-monotonic logic. I don't even know what this is because I just made it up. Dialogue: 0,1:18:49.79,1:18:51.53,Default,,0000,0000,0000,,Whereas in reality, Dialogue: 0,1:18:51.53,1:18:53.97,Default,,0000,0000,0000,,you know, chances are that link isn't real. Dialogue: 0,1:18:53.97,1:18:55.66,Default,,0000,0000,0000,,Color variance Dialogue: 0,1:18:55.66,1:18:59.55,Default,,0000,0000,0000,,just barely helped object recognition maybe. I'm making this up. Dialogue: 0,1:18:59.55,1:19:03.50,Default,,0000,0000,0000,,Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay? Dialogue: 0,1:19:03.50,1:19:05.27,Default,,0000,0000,0000,,So, each of Dialogue: 0,1:19:05.27,1:19:09.13,Default,,0000,0000,0000,,these circles can represent a person, or a research community, or a thought in your Dialogue: 0,1:19:09.13,1:19:12.02,Default,,0000,0000,0000,,head. And there's a very real chance that Dialogue: 0,1:19:12.02,1:19:15.47,Default,,0000,0000,0000,,maybe there are all these papers written on differential geometry of 3d manifolds, and they are Dialogue: 0,1:19:15.47,1:19:18.57,Default,,0000,0000,0000,,written because some guy once told someone else that it'll help 3d similarity learning. Dialogue: 0,1:19:18.57,1:19:20.49,Default,,0000,0000,0000,,And, Dialogue: 0,1:19:20.49,1:19:23.37,Default,,0000,0000,0000,,you know, it's like "A friend of mine told me that color invariance would help in Dialogue: 0,1:19:23.37,1:19:26.12,Default,,0000,0000,0000,,object recognition, so I'm working on color invariance. And now I'm gonna tell a friend Dialogue: 0,1:19:26.12,1:19:27.44,Default,,0000,0000,0000,,of mine Dialogue: 0,1:19:27.44,1:19:30.28,Default,,0000,0000,0000,,that his thing will help my problem. And he'll tell a friend of his that his thing will help Dialogue: 0,1:19:30.28,1:19:31.62,Default,,0000,0000,0000,,with his problem." Dialogue: 0,1:19:31.62,1:19:33.52,Default,,0000,0000,0000,,And pretty soon, you're working on Dialogue: 0,1:19:33.52,1:19:37.54,Default,,0000,0000,0000,,convergence bound for sampled non-monotonic logic, when in reality none of these will Dialogue: 0,1:19:37.54,1:19:39.13,Default,,0000,0000,0000,,see the light of Dialogue: 0,1:19:39.13,1:19:42.52,Default,,0000,0000,0000,,day of your mail delivery robot. Okay? Dialogue: 0,1:19:42.52,1:19:46.60,Default,,0000,0000,0000,,I'm not criticizing the role of theory. There are very powerful theories, like the Dialogue: 0,1:19:46.60,1:19:48.40,Default,,0000,0000,0000,,theory of VC dimension, Dialogue: 0,1:19:48.40,1:19:52.09,Default,,0000,0000,0000,,which is far, far, far to the right of this. So VC dimension is about Dialogue: 0,1:19:52.09,1:19:53.29,Default,,0000,0000,0000,,as theoretical Dialogue: 0,1:19:53.29,1:19:57.12,Default,,0000,0000,0000,,as it can get. And it's clearly had a huge impact on many applications. And there's, Dialogue: 0,1:19:57.12,1:19:59.56,Default,,0000,0000,0000,,you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know, Dialogue: 0,1:19:59.56,1:20:00.75,Default,,0000,0000,0000,,is about Dialogue: 0,1:20:00.75,1:20:04.22,Default,,0000,0000,0000,,theoretical as it can get. It's Dialogue: 0,1:20:04.22,1:20:05.80,Default,,0000,0000,0000,,like a huge application Dialogue: 0,1:20:05.80,1:20:09.31,Default,,0000,0000,0000,,on all of computer science, the theory of NP-hardness. Dialogue: 0,1:20:09.31,1:20:10.67,Default,,0000,0000,0000,,But Dialogue: 0,1:20:10.67,1:20:13.80,Default,,0000,0000,0000,,when you are off working on highly theoretical things, I guess, to me Dialogue: 0,1:20:13.80,1:20:16.85,Default,,0000,0000,0000,,personally it's important to keep in mind Dialogue: 0,1:20:16.85,1:20:19.70,Default,,0000,0000,0000,,are you working on something like VC dimension, which is high impact, or are you Dialogue: 0,1:20:19.70,1:20:23.29,Default,,0000,0000,0000,,working on something like convergence bound for sampled nonmonotonic logic, which Dialogue: 0,1:20:23.29,1:20:24.71,Default,,0000,0000,0000,,you're only hoping Dialogue: 0,1:20:24.71,1:20:25.90,Default,,0000,0000,0000,,has some peripheral relevance Dialogue: 0,1:20:25.90,1:20:30.04,Default,,0000,0000,0000,,to some application. Okay? Dialogue: 0,1:20:30.04,1:20:34.85,Default,,0000,0000,0000,,For me personally, I tend to work on an application only if I - excuse me. Dialogue: 0,1:20:34.85,1:20:36.99,Default,,0000,0000,0000,,For me personally, and this is a personal choice, Dialogue: 0,1:20:36.99,1:20:41.34,Default,,0000,0000,0000,,I tend to trust something only if I personally can see a link from the Dialogue: 0,1:20:41.34,1:20:42.68,Default,,0000,0000,0000,,theory I'm working on Dialogue: 0,1:20:42.68,1:20:44.43,Default,,0000,0000,0000,,all the way back to an application. Dialogue: 0,1:20:44.43,1:20:46.01,Default,,0000,0000,0000,,And Dialogue: 0,1:20:46.01,1:20:50.30,Default,,0000,0000,0000,,if I don't personally see a direct link from what I'm doing to an application then, Dialogue: 0,1:20:50.30,1:20:53.43,Default,,0000,0000,0000,,you know, then that's fine. Then I can choose to work on theory, but Dialogue: 0,1:20:53.43,1:20:55.65,Default,,0000,0000,0000,,I wouldn't necessarily trust that Dialogue: 0,1:20:55.65,1:20:59.21,Default,,0000,0000,0000,,what the theory I'm working on will relate to an application, if I don't personally Dialogue: 0,1:20:59.21,1:21:02.43,Default,,0000,0000,0000,,see a link all the way back. Dialogue: 0,1:21:02.43,1:21:04.40,Default,,0000,0000,0000,,Just to summarize. Dialogue: 0,1:21:04.40,1:21:06.41,Default,,0000,0000,0000,, Dialogue: 0,1:21:06.41,1:21:08.68,Default,,0000,0000,0000,,One lesson to take away from today is I think Dialogue: 0,1:21:08.68,1:21:12.53,Default,,0000,0000,0000,,time spent coming up with diagnostics for learning algorithms is often time well spent. Dialogue: 0,1:21:12.53,1:21:13.03,Default,,0000,0000,0000,, Dialogue: 0,1:21:13.03,1:21:16.20,Default,,0000,0000,0000,,It's often up to your own ingenuity to come up with great diagnostics. And Dialogue: 0,1:21:16.20,1:21:19.02,Default,,0000,0000,0000,,just when I personally, when I work on machine learning algorithm, Dialogue: 0,1:21:19.02,1:21:21.17,Default,,0000,0000,0000,,it's not uncommon for me to be spending like Dialogue: 0,1:21:21.17,1:21:23.68,Default,,0000,0000,0000,,between a third and often half of my time Dialogue: 0,1:21:23.68,1:21:26.41,Default,,0000,0000,0000,,just writing diagnostics and trying to figure out what's going right and what's Dialogue: 0,1:21:26.41,1:21:28.08,Default,,0000,0000,0000,,going on. Dialogue: 0,1:21:28.08,1:21:31.50,Default,,0000,0000,0000,,Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and Dialogue: 0,1:21:31.50,1:21:34.78,Default,,0000,0000,0000,,making progress. You don't want to be spending all this time, you know, implementing tests on your Dialogue: 0,1:21:34.78,1:21:38.28,Default,,0000,0000,0000,,learning algorithms; it doesn't feel like when you're doing anything. But when Dialogue: 0,1:21:38.28,1:21:41.42,Default,,0000,0000,0000,,I implement learning algorithms, at least a third, and quite often half of Dialogue: 0,1:21:41.42,1:21:45.88,Default,,0000,0000,0000,,my time, is actually spent implementing those tests and you can figure out what to work on. And Dialogue: 0,1:21:45.88,1:21:49.22,Default,,0000,0000,0000,,I think it's actually one of the best uses of your time. Talked Dialogue: 0,1:21:49.22,1:21:50.73,Default,,0000,0000,0000,,about error Dialogue: 0,1:21:50.73,1:21:54.32,Default,,0000,0000,0000,,analyses and ablative analyses, and lastly Dialogue: 0,1:21:54.32,1:21:56.89,Default,,0000,0000,0000,,talked about, you know, different approaches and the Dialogue: 0,1:21:56.89,1:22:00.98,Default,,0000,0000,0000,,risks of premature statistical optimization. Okay. Dialogue: 0,1:22:00.98,1:22:04.34,Default,,0000,0000,0000,,Sorry I ran you over. I'll be here for a few more minutes for your questions.