[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:10.17,0:00:11.55,Default,,0000,0000,0000,,
Dialogue: 0,0:00:11.55,0:00:14.82,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional
Dialogue: 0,0:00:14.82,0:00:21.82,Default,,0000,0000,0000,,Development.
Dialogue: 0,0:00:24.74,0:00:28.50,Default,,0000,0000,0000,,Okay. Good morning. Welcome back.
Dialogue: 0,0:00:28.50,0:00:31.24,Default,,0000,0000,0000,,What I want to do today is
Dialogue: 0,0:00:31.24,0:00:36.64,Default,,0000,0000,0000,,actually wrap up our discussion on learning theory and sort of on
Dialogue: 0,0:00:36.64,0:00:37.67,Default,,0000,0000,0000,,and I'm gonna start by
Dialogue: 0,0:00:37.67,0:00:41.81,Default,,0000,0000,0000,,talking about Bayesian statistics and regularization,
Dialogue: 0,0:00:41.81,0:00:45.68,Default,,0000,0000,0000,,and then take a very brief digression to tell you about online learning.
Dialogue: 0,0:00:45.68,0:00:50.32,Default,,0000,0000,0000,,And most of today's lecture will actually be on various pieces of that, so applying
Dialogue: 0,0:00:50.32,0:00:52.45,Default,,0000,0000,0000,,machine learning algorithms to problems like, you know,
Dialogue: 0,0:00:52.45,0:00:55.92,Default,,0000,0000,0000,,like the project or other problems you may go work on after you graduate from this
Dialogue: 0,0:00:57.01,0:00:57.74,Default,,0000,0000,0000,,class. But
Dialogue: 0,0:00:57.74,0:01:02.30,Default,,0000,0000,0000,,let's start the talk about Bayesian statistics and regularization.
Dialogue: 0,0:01:02.30,0:01:04.13,Default,,0000,0000,0000,,So you remember from last week,
Dialogue: 0,0:01:04.13,0:01:08.96,Default,,0000,0000,0000,,we started to talk about learning theory and we learned about bias and
Dialogue: 0,0:01:08.96,0:01:12.26,Default,,0000,0000,0000,,variance. And I guess in the previous lecture, we spent most of the previous
Dialogue: 0,0:01:12.26,0:01:13.24,Default,,0000,0000,0000,,lecture
Dialogue: 0,0:01:13.24,0:01:17.51,Default,,0000,0000,0000,,talking about algorithms for model selection and for
Dialogue: 0,0:01:17.51,0:01:21.36,Default,,0000,0000,0000,,feature selection. We talked about cross-validation. Right? So
Dialogue: 0,0:01:21.36,0:01:24.20,Default,,0000,0000,0000,,most of the methods we talked about in the previous lecture were
Dialogue: 0,0:01:24.20,0:01:28.46,Default,,0000,0000,0000,,ways for you to try to simply the model. So for example,
Dialogue: 0,0:01:28.46,0:01:31.92,Default,,0000,0000,0000,,the feature selection algorithms we talked about gives you a way to eliminate a number
Dialogue: 0,0:01:31.92,0:01:32.91,Default,,0000,0000,0000,,of features,
Dialogue: 0,0:01:32.91,0:01:36.35,Default,,0000,0000,0000,,so as to reduce the number of parameters you need to fit and thereby reduce
Dialogue: 0,0:01:36.35,0:01:39.33,Default,,0000,0000,0000,,overfitting. Right? You remember that? So feature
Dialogue: 0,0:01:39.33,0:01:40.67,Default,,0000,0000,0000,,selection algorithms
Dialogue: 0,0:01:42.15,0:01:43.97,Default,,0000,0000,0000,,choose a subset of the features
Dialogue: 0,0:01:43.97,0:01:48.47,Default,,0000,0000,0000,,so that you have less parameters and you may be less likely to overfit. Right?
Dialogue: 0,0:01:48.47,0:01:52.79,Default,,0000,0000,0000,,What I want to do today is to talk about a different way to prevent overfitting.
Dialogue: 0,0:01:52.79,0:01:53.87,Default,,0000,0000,0000,,And
Dialogue: 0,0:01:53.87,0:01:57.86,Default,,0000,0000,0000,,there's a method called regularization and there's a way that lets you keep all the
Dialogue: 0,0:01:57.86,0:01:58.86,Default,,0000,0000,0000,,parameters.
Dialogue: 0,0:01:59.92,0:02:04.20,Default,,0000,0000,0000,,So here's the idea, and I'm gonna illustrate this example with,
Dialogue: 0,0:02:04.20,0:02:06.96,Default,,0000,0000,0000,,say, linear regression.
Dialogue: 0,0:02:06.96,0:02:08.71,Default,,0000,0000,0000,,So
Dialogue: 0,0:02:08.71,0:02:13.33,Default,,0000,0000,0000,,you take
Dialogue: 0,0:02:13.33,0:02:16.92,Default,,0000,0000,0000,,the linear regression model, the very first model we learned about,
Dialogue: 0,0:02:16.92,0:02:17.97,Default,,0000,0000,0000,,right,
Dialogue: 0,0:02:17.97,0:02:21.64,Default,,0000,0000,0000,,we said that we would choose the parameters
Dialogue: 0,0:02:21.64,0:02:23.61,Default,,0000,0000,0000,,via
Dialogue: 0,0:02:23.61,0:02:29.52,Default,,0000,0000,0000,,maximum likelihood.
Dialogue: 0,0:02:29.52,0:02:31.67,Default,,0000,0000,0000,,Right? And that meant that,
Dialogue: 0,0:02:31.67,0:02:35.30,Default,,0000,0000,0000,,you know, you would choose the parameters theta
Dialogue: 0,0:02:35.30,0:02:39.88,Default,,0000,0000,0000,,that maximized
Dialogue: 0,0:02:39.88,0:02:43.01,Default,,0000,0000,0000,,the probability of the data,
Dialogue: 0,0:02:43.01,0:02:46.11,Default,,0000,0000,0000,,which is parameters theta that maximized the probability of the data we observe.
Dialogue: 0,0:02:46.11,0:02:50.71,Default,,0000,0000,0000,,Right?
Dialogue: 0,0:02:50.71,0:02:52.05,Default,,0000,0000,0000,,And so
Dialogue: 0,0:02:52.05,0:02:56.28,Default,,0000,0000,0000,,to give this sort of procedure a name, this is one example of most
Dialogue: 0,0:02:56.28,0:03:00.26,Default,,0000,0000,0000,,common frequencies procedure,
Dialogue: 0,0:03:00.26,0:03:05.42,Default,,0000,0000,0000,,and frequency, you can think of sort of as maybe one school of statistics.
Dialogue: 0,0:03:05.42,0:03:07.08,Default,,0000,0000,0000,,And the philosophical view
Dialogue: 0,0:03:07.08,0:03:08.87,Default,,0000,0000,0000,,behind writing this down was
Dialogue: 0,0:03:08.87,0:03:13.48,Default,,0000,0000,0000,,we envisioned that there was some true parameter theta out there that generated,
Dialogue: 0,0:03:13.48,0:03:16.47,Default,,0000,0000,0000,,you know, the Xs and the Ys. There's some true parameter theta
Dialogue: 0,0:03:16.47,0:03:20.53,Default,,0000,0000,0000,,that govern housing prices, Y is a function of X,
Dialogue: 0,0:03:20.53,0:03:23.46,Default,,0000,0000,0000,,and we don't know what the value of theta is,
Dialogue: 0,0:03:23.46,0:03:27.63,Default,,0000,0000,0000,,and we'd like to come up with some procedure for estimating the value of theta. Okay?
Dialogue: 0,0:03:27.63,0:03:28.38,Default,,0000,0000,0000,,And so,
Dialogue: 0,0:03:28.38,0:03:31.27,Default,,0000,0000,0000,,maximum likelihood is just one possible procedure
Dialogue: 0,0:03:31.27,0:03:33.21,Default,,0000,0000,0000,,for estimating the unknown value
Dialogue: 0,0:03:33.21,0:03:35.88,Default,,0000,0000,0000,,for theta.
Dialogue: 0,0:03:35.88,0:03:37.61,Default,,0000,0000,0000,,And
Dialogue: 0,0:03:37.61,0:03:41.06,Default,,0000,0000,0000,,the way you formulated this, you know, theta was not a random variable. Right?
Dialogue: 0,0:03:41.06,0:03:42.14,Default,,0000,0000,0000,,That's what why said,
Dialogue: 0,0:03:42.14,0:03:45.49,Default,,0000,0000,0000,,so theta is just some true value out there. It's not random or anything, we just don't
Dialogue: 0,0:03:45.49,0:03:50.33,Default,,0000,0000,0000,,know what it is, and we have a procedure called maximum likelihood for estimating the
Dialogue: 0,0:03:50.33,0:03:55.47,Default,,0000,0000,0000,,value for theta. So this is one example of what's called a frequencies procedure.
Dialogue: 0,0:03:55.47,0:03:57.28,Default,,0000,0000,0000,,The alternative to the, I
Dialogue: 0,0:03:57.28,0:04:04.28,Default,,0000,0000,0000,,guess, the frequency school of statistics is the Bayesian school,
Dialogue: 0,0:04:04.39,0:04:06.87,Default,,0000,0000,0000,,in which
Dialogue: 0,0:04:06.87,0:04:10.24,Default,,0000,0000,0000,,we're gonna say that we don't know what theta,
Dialogue: 0,0:04:10.24,0:04:13.34,Default,,0000,0000,0000,,and so we will put a prior
Dialogue: 0,0:04:13.34,0:04:14.77,Default,,0000,0000,0000,,on
Dialogue: 0,0:04:14.77,0:04:18.24,Default,,0000,0000,0000,,theta. Okay? So in the Bayesian school students would say, "Well
Dialogue: 0,0:04:18.24,0:04:22.38,Default,,0000,0000,0000,,don't know what the value of theta so let's represent our uncertainty over theta with a
Dialogue: 0,0:04:22.38,0:04:27.23,Default,,0000,0000,0000,,prior.
Dialogue: 0,0:04:28.26,0:04:32.01,Default,,0000,0000,0000,,So for example,
Dialogue: 0,0:04:32.01,0:04:34.77,Default,,0000,0000,0000,,our prior on theta
Dialogue: 0,0:04:34.77,0:04:36.86,Default,,0000,0000,0000,,may be a Gaussian distribution
Dialogue: 0,0:04:36.86,0:04:39.85,Default,,0000,0000,0000,,with mean zero and curvalence matrix
Dialogue: 0,0:04:39.85,0:04:42.89,Default,,0000,0000,0000,,given by tau squared I. Okay?
Dialogue: 0,0:04:42.89,0:04:45.32,Default,,0000,0000,0000,,And so -
Dialogue: 0,0:04:45.32,0:04:52.32,Default,,0000,0000,0000,,actually, if I use S to denote my training set, well - right,
Dialogue: 0,0:04:56.28,0:05:00.29,Default,,0000,0000,0000,,so theta represents my beliefs about what the parameters are in the absence of
Dialogue: 0,0:05:00.29,0:05:01.38,Default,,0000,0000,0000,,any data. So
Dialogue: 0,0:05:01.38,0:05:03.29,Default,,0000,0000,0000,,not having seen any data, theta
Dialogue: 0,0:05:03.29,0:05:05.71,Default,,0000,0000,0000,,represents, you know, what I think theta - it
Dialogue: 0,0:05:05.71,0:05:10.09,Default,,0000,0000,0000,,probably represents what I think theta is most likely to be.
Dialogue: 0,0:05:10.09,0:05:15.66,Default,,0000,0000,0000,,And so given the training set, S, in the sort of Bayesian procedure, we would,
Dialogue: 0,0:05:15.66,0:05:22.44,Default,,0000,0000,0000,,well,
Dialogue: 0,0:05:22.44,0:05:26.54,Default,,0000,0000,0000,,calculate the probability, the posterior probability by parameters
Dialogue: 0,0:05:26.54,0:05:28.87,Default,,0000,0000,0000,,given my training sets, and - let's
Dialogue: 0,0:05:28.87,0:05:30.47,Default,,0000,0000,0000,,write
Dialogue: 0,0:05:30.47,0:05:31.78,Default,,0000,0000,0000,,this on the next board.
Dialogue: 0,0:05:31.78,0:05:33.36,Default,,0000,0000,0000,,So my posterior
Dialogue: 0,0:05:33.36,0:05:36.74,Default,,0000,0000,0000,,on my parameters given my training set, by Bayes' rule, this will
Dialogue: 0,0:05:36.74,0:05:40.16,Default,,0000,0000,0000,,be proportional to, you
Dialogue: 0,0:05:40.16,0:05:47.16,Default,,0000,0000,0000,,know, this.
Dialogue: 0,0:05:51.46,0:05:56.56,Default,,0000,0000,0000,,Right? So by Bayes' rule. Let's call
Dialogue: 0,0:05:56.56,0:06:01.43,Default,,0000,0000,0000,,it posterior.
Dialogue: 0,0:06:01.43,0:06:05.67,Default,,0000,0000,0000,,And this distribution now represents my beliefs about what theta is after I've
Dialogue: 0,0:06:05.67,0:06:08.22,Default,,0000,0000,0000,,seen the training set.
Dialogue: 0,0:06:08.22,0:06:15.22,Default,,0000,0000,0000,,And when you now want to make a new prediction on the price of a new house,
Dialogue: 0,0:06:20.06,0:06:21.96,Default,,0000,0000,0000,,on the input X,
Dialogue: 0,0:06:21.96,0:06:26.71,Default,,0000,0000,0000,,I would say that, well, the distribution over the possible housing prices for
Dialogue: 0,0:06:26.71,0:06:30.56,Default,,0000,0000,0000,,this new house I'm trying to estimate the price of, say,
Dialogue: 0,0:06:30.56,0:06:34.27,Default,,0000,0000,0000,,given the size of the house, the features of the house at X, and the training
Dialogue: 0,0:06:34.27,0:06:36.56,Default,,0000,0000,0000,,set I had previously, it
Dialogue: 0,0:06:36.56,0:06:43.56,Default,,0000,0000,0000,,is going to be given by
Dialogue: 0,0:06:47.28,0:06:50.47,Default,,0000,0000,0000,,an integral over my parameters theta of
Dialogue: 0,0:06:50.47,0:06:53.02,Default,,0000,0000,0000,,probably of Y given X comma theta
Dialogue: 0,0:06:53.02,0:06:54.65,Default,,0000,0000,0000,,and times
Dialogue: 0,0:06:54.65,0:06:59.93,Default,,0000,0000,0000,,the posterior distribution of theta given the training set. Okay?
Dialogue: 0,0:06:59.93,0:07:01.79,Default,,0000,0000,0000,,And in
Dialogue: 0,0:07:01.79,0:07:03.72,Default,,0000,0000,0000,,particular, if you want your prediction to be
Dialogue: 0,0:07:03.72,0:07:07.56,Default,,0000,0000,0000,,the expected value
Dialogue: 0,0:07:07.56,0:07:08.73,Default,,0000,0000,0000,,of Y given
Dialogue: 0,0:07:08.73,0:07:10.28,Default,,0000,0000,0000,,the
Dialogue: 0,0:07:10.28,0:07:13.56,Default,,0000,0000,0000,,input X in training set, you would
Dialogue: 0,0:07:13.56,0:07:16.35,Default,,0000,0000,0000,,say integrate
Dialogue: 0,0:07:16.35,0:07:23.35,Default,,0000,0000,0000,,over Y
Dialogue: 0,0:07:23.75,0:07:26.54,Default,,0000,0000,0000,,times the posterior. Okay?
Dialogue: 0,0:07:26.54,0:07:33.54,Default,,0000,0000,0000,,You would take an expectation of Y with respect to your posterior distribution. Okay?
Dialogue: 0,0:07:35.44,0:07:39.17,Default,,0000,0000,0000,,And you notice that when I was writing this down, so with the Bayesian
Dialogue: 0,0:07:39.17,0:07:41.96,Default,,0000,0000,0000,,formulation, and now started to write up here Y given X
Dialogue: 0,0:07:41.96,0:07:44.11,Default,,0000,0000,0000,,comma theta because
Dialogue: 0,0:07:44.11,0:07:47.59,Default,,0000,0000,0000,,this formula now is the property of Y conditioned on the values of the
Dialogue: 0,0:07:47.59,0:07:51.91,Default,,0000,0000,0000,,random variables X and theta. So I'm no longer writing semicolon theta, I'm writing
Dialogue: 0,0:07:51.91,0:07:53.10,Default,,0000,0000,0000,,comma theta
Dialogue: 0,0:07:53.10,0:07:56.38,Default,,0000,0000,0000,,because I'm now treating theta
Dialogue: 0,0:07:56.38,0:08:00.26,Default,,0000,0000,0000,,as a random variable. So all
Dialogue: 0,0:08:00.26,0:08:02.85,Default,,0000,0000,0000,,of this is somewhat abstract but this is
Dialogue: 0,0:08:02.85,0:08:04.11,Default,,0000,0000,0000,,- and
Dialogue: 0,0:08:04.11,0:08:11.11,Default,,0000,0000,0000,,it turns out - actually let's check. Are there questions about this? No? Okay. Let's
Dialogue: 0,0:08:14.78,0:08:17.82,Default,,0000,0000,0000,,try to make this more concrete. It turns out that
Dialogue: 0,0:08:17.82,0:08:21.70,Default,,0000,0000,0000,,for many problems,
Dialogue: 0,0:08:21.70,0:08:26.06,Default,,0000,0000,0000,,both of these steps in the computation are difficult because if,
Dialogue: 0,0:08:26.06,0:08:29.47,Default,,0000,0000,0000,,you know, theta is an N plus onedimensional vector, is an N plus
Dialogue: 0,0:08:29.47,0:08:31.22,Default,,0000,0000,0000,,one-dimensional parameter vector,
Dialogue: 0,0:08:31.22,0:08:35.01,Default,,0000,0000,0000,,then this is one an integral over an N plus one-dimensional, you know, over RN
Dialogue: 0,0:08:35.01,0:08:36.39,Default,,0000,0000,0000,,plus one.
Dialogue: 0,0:08:36.39,0:08:38.66,Default,,0000,0000,0000,,And because numerically it's very difficult
Dialogue: 0,0:08:38.66,0:08:41.13,Default,,0000,0000,0000,,to compute integrals over
Dialogue: 0,0:08:41.13,0:08:44.80,Default,,0000,0000,0000,,very high dimensional spaces, all right? So
Dialogue: 0,0:08:44.80,0:08:48.74,Default,,0000,0000,0000,,usually this integral - actually usually it's hard to compute the posterior in theta
Dialogue: 0,0:08:48.74,0:08:54.31,Default,,0000,0000,0000,,and it's also hard to compute this integral if theta is very high dimensional. There are few
Dialogue: 0,0:08:54.31,0:08:55.91,Default,,0000,0000,0000,,exceptions for which this can be done in
Dialogue: 0,0:08:55.91,0:08:58.29,Default,,0000,0000,0000,,closed form, but for
Dialogue: 0,0:08:58.29,0:09:00.46,Default,,0000,0000,0000,,many learning algorithms, say,
Dialogue: 0,0:09:00.46,0:09:04.94,Default,,0000,0000,0000,,Bayesian logistic regression, this is hard to do.
Dialogue: 0,0:09:04.94,0:09:09.78,Default,,0000,0000,0000,,And so what's commonly done is to take the posterior distribution
Dialogue: 0,0:09:09.78,0:09:13.39,Default,,0000,0000,0000,,and instead of actually computing a full posterior distribution, chi of theta
Dialogue: 0,0:09:13.39,0:09:15.28,Default,,0000,0000,0000,,given S,
Dialogue: 0,0:09:15.28,0:09:16.51,Default,,0000,0000,0000,,we'll instead
Dialogue: 0,0:09:16.51,0:09:19.33,Default,,0000,0000,0000,,take this quantity on the right-hand side
Dialogue: 0,0:09:19.33,0:09:23.88,Default,,0000,0000,0000,,and just maximize this quantity on the right-hand side. So let me write this down.
Dialogue: 0,0:09:23.88,0:09:26.65,Default,,0000,0000,0000,,So
Dialogue: 0,0:09:26.65,0:09:30.10,Default,,0000,0000,0000,,commonly, instead of computing the full posterior distribution,
Dialogue: 0,0:09:30.10,0:09:37.10,Default,,0000,0000,0000,,we will choose the following. Okay?
Dialogue: 0,0:09:42.28,0:09:47.16,Default,,0000,0000,0000,,We will choose what's called the MAP estimate, or the maximum a posteriori
Dialogue: 0,0:09:47.16,0:09:50.27,Default,,0000,0000,0000,,estimate of theta, which is the most likely value of theta,
Dialogue: 0,0:09:50.27,0:09:53.99,Default,,0000,0000,0000,,most probable value of theta onto your posterior distribution.
Dialogue: 0,0:09:53.99,0:09:56.43,Default,,0000,0000,0000,,And that's just
Dialogue: 0,0:09:56.43,0:10:03.43,Default,,0000,0000,0000,,ont max chi
Dialogue: 0,0:10:07.96,0:10:11.87,Default,,0000,0000,0000,,of theta.
Dialogue: 0,0:10:11.87,0:10:18.87,Default,,0000,0000,0000,,And then when you need to make a prediction,
Dialogue: 0,0:10:21.48,0:10:28.48,Default,,0000,0000,0000,,you know, you would just predict, say, well,
Dialogue: 0,0:10:35.53,0:10:37.60,Default,,0000,0000,0000,,using your usual hypothesis
Dialogue: 0,0:10:37.60,0:10:41.29,Default,,0000,0000,0000,,and using this MAP value of theta
Dialogue: 0,0:10:41.29,0:10:42.48,Default,,0000,0000,0000,,in place of
Dialogue: 0,0:10:42.48,0:10:47.38,Default,,0000,0000,0000,,- as the parameter vector you'd choose. Okay? And
Dialogue: 0,0:10:47.38,0:10:51.34,Default,,0000,0000,0000,,notice, the only difference between this and standard maximum likelihood estimation
Dialogue: 0,0:10:51.34,0:10:53.06,Default,,0000,0000,0000,,is that when you're choosing,
Dialogue: 0,0:10:53.06,0:10:56.92,Default,,0000,0000,0000,,you know, the - instead of choosing the maximum likelihood value for
Dialogue: 0,0:10:56.92,0:11:01.14,Default,,0000,0000,0000,,theta, you're instead maximizing this, which is what you have for maximum likelihood estimation,
Dialogue: 0,0:11:01.14,0:11:05.17,Default,,0000,0000,0000,,and then times this other quantity which is the prior.
Dialogue: 0,0:11:05.17,0:11:06.65,Default,,0000,0000,0000,,Right?
Dialogue: 0,0:11:06.65,0:11:09.52,Default,,0000,0000,0000,,And
Dialogue: 0,0:11:09.52,0:11:10.87,Default,,0000,0000,0000,,let's see,
Dialogue: 0,0:11:10.87,0:11:16.56,Default,,0000,0000,0000,,when intuition is that if your prior
Dialogue: 0,0:11:16.56,0:11:21.47,Default,,0000,0000,0000,,is
Dialogue: 0,0:11:21.47,0:11:23.65,Default,,0000,0000,0000,,theta being Gaussian and with mean
Dialogue: 0,0:11:23.65,0:11:26.03,Default,,0000,0000,0000,,zero and some covariance,
Dialogue: 0,0:11:26.03,0:11:30.55,Default,,0000,0000,0000,,then for a distribution like this, most of the [inaudible] mass is close to zero. Right?
Dialogue: 0,0:11:30.55,0:11:34.01,Default,,0000,0000,0000,,So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero.
Dialogue: 0,0:11:37.99,0:11:40.55,Default,,0000,0000,0000,,And so the prior distribution, instead of saying that
Dialogue: 0,0:11:40.55,0:11:44.80,Default,,0000,0000,0000,,you think most of the parameters should be close to
Dialogue: 0,0:11:44.80,0:11:48.78,Default,,0000,0000,0000,,zero, and if you remember our discussion on feature selection, if you eliminate a
Dialogue: 0,0:11:48.78,0:11:50.88,Default,,0000,0000,0000,,feature from consideration
Dialogue: 0,0:11:50.88,0:11:53.28,Default,,0000,0000,0000,,that's the same as
Dialogue: 0,0:11:53.28,0:11:56.63,Default,,0000,0000,0000,,setting the source and value of theta to be equal to zero. All right? So if you
Dialogue: 0,0:11:56.63,0:11:58.02,Default,,0000,0000,0000,,set
Dialogue: 0,0:11:58.02,0:12:02.09,Default,,0000,0000,0000,,theta five to be equal to zero, that's the same as, you know, eliminating feature
Dialogue: 0,0:12:02.09,0:12:04.55,Default,,0000,0000,0000,,five from the your hypothesis.
Dialogue: 0,0:12:04.55,0:12:09.71,Default,,0000,0000,0000,,And so, this is the prior that drives most of the parameter values to zero
Dialogue: 0,0:12:09.71,0:12:12.71,Default,,0000,0000,0000,,- to values close to zero. And you'll think of this as
Dialogue: 0,0:12:12.71,0:12:17.41,Default,,0000,0000,0000,,doing something analogous, if doing something reminiscent of feature selection. Okay? And
Dialogue: 0,0:12:17.41,0:12:21.56,Default,,0000,0000,0000,,it turns out that with this formulation, the parameters won't actually be
Dialogue: 0,0:12:21.56,0:12:24.99,Default,,0000,0000,0000,,exactly zero but many of the values will be close to zero.
Dialogue: 0,0:12:24.99,0:12:27.03,Default,,0000,0000,0000,,And
Dialogue: 0,0:12:27.03,0:12:31.51,Default,,0000,0000,0000,,I guess in pictures,
Dialogue: 0,0:12:34.55,0:12:36.21,Default,,0000,0000,0000,,if you remember,
Dialogue: 0,0:12:36.21,0:12:40.07,Default,,0000,0000,0000,,I said that if you have, say, five data points and you fit
Dialogue: 0,0:12:40.07,0:12:47.07,Default,,0000,0000,0000,,a fourth-order polynomial -
Dialogue: 0,0:12:47.62,0:12:51.04,Default,,0000,0000,0000,,well I think that had too many bumps in it, but never mind.
Dialogue: 0,0:12:51.04,0:12:54.36,Default,,0000,0000,0000,,If you fit it a - if you fit very high polynomial to a
Dialogue: 0,0:12:54.36,0:12:58.18,Default,,0000,0000,0000,,very small dataset, you can get these very large oscillations
Dialogue: 0,0:12:58.18,0:13:01.42,Default,,0000,0000,0000,,if you use maximum likelihood estimation. All right?
Dialogue: 0,0:13:01.42,0:13:05.61,Default,,0000,0000,0000,,In contrast, if you apply this sort of Bayesian regularization,
Dialogue: 0,0:13:05.61,0:13:08.20,Default,,0000,0000,0000,,you can actually fit a higherorder polynomial
Dialogue: 0,0:13:08.20,0:13:11.16,Default,,0000,0000,0000,,that still get
Dialogue: 0,0:13:11.16,0:13:13.25,Default,,0000,0000,0000,,sort of a smoother and smoother fit to the data
Dialogue: 0,0:13:13.25,0:13:17.84,Default,,0000,0000,0000,,as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer
Dialogue: 0,0:13:17.84,0:13:19.83,Default,,0000,0000,0000,,to zero. And that
Dialogue: 0,0:13:19.83,0:13:22.75,Default,,0000,0000,0000,,in practice - it's sort of hard to see, but you can take my word for it.
Dialogue: 0,0:13:22.75,0:13:24.90,Default,,0000,0000,0000,,As tau becomes smaller and smaller,
Dialogue: 0,0:13:24.90,0:13:28.76,Default,,0000,0000,0000,,the curves you tend to fit your data also become smoother and smoother, and so you
Dialogue: 0,0:13:28.76,0:13:29.56,Default,,0000,0000,0000,,tend
Dialogue: 0,0:13:29.56,0:13:35.70,Default,,0000,0000,0000,,less and less overfit, even when you're fitting a large number of
Dialogue: 0,0:13:35.70,0:13:37.67,Default,,0000,0000,0000,,parameters.
Dialogue: 0,0:13:37.67,0:13:41.29,Default,,0000,0000,0000,,Okay? Let's see,
Dialogue: 0,0:13:41.29,0:13:44.59,Default,,0000,0000,0000,,and
Dialogue: 0,0:13:44.59,0:13:48.13,Default,,0000,0000,0000,,one last piece of intuition that I would just toss out there. And
Dialogue: 0,0:13:48.13,0:13:49.92,Default,,0000,0000,0000,,you get to play more with
Dialogue: 0,0:13:49.92,0:13:52.85,Default,,0000,0000,0000,,this particular set of ideas more in Problem Set
Dialogue: 0,0:13:52.85,0:13:54.98,Default,,0000,0000,0000,,3, which I'll post online later
Dialogue: 0,0:13:54.98,0:13:57.16,Default,,0000,0000,0000,,this week I guess.
Dialogue: 0,0:13:57.16,0:14:01.95,Default,,0000,0000,0000,,Is that whereas maximum likelihood tries to minimize,
Dialogue: 0,0:14:01.95,0:14:08.95,Default,,0000,0000,0000,,say, this,
Dialogue: 0,0:14:12.21,0:14:15.83,Default,,0000,0000,0000,,right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing
Dialogue: 0,0:14:15.83,0:14:17.03,Default,,0000,0000,0000,,this,
Dialogue: 0,0:14:17.03,0:14:19.29,Default,,0000,0000,0000,,it turns out that if you
Dialogue: 0,0:14:19.29,0:14:21.22,Default,,0000,0000,0000,,add this prior term there,
Dialogue: 0,0:14:21.22,0:14:26.50,Default,,0000,0000,0000,,it turns out that the authorization objective you end up optimizing
Dialogue: 0,0:14:26.50,0:14:31.16,Default,,0000,0000,0000,,turns out to be that. Where you add an extra term that, you know,
Dialogue: 0,0:14:31.16,0:14:34.38,Default,,0000,0000,0000,,penalizes your parameter theta as being large.
Dialogue: 0,0:14:34.98,0:14:38.82,Default,,0000,0000,0000,,And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to
Dialogue: 0,0:14:38.82,0:14:40.89,Default,,0000,0000,0000,,keep your parameters small.
Dialogue: 0,0:14:40.89,0:14:42.81,Default,,0000,0000,0000,,And this has the effect.
Dialogue: 0,0:14:42.81,0:14:46.53,Default,,0000,0000,0000,,Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of
Dialogue: 0,0:14:46.53,0:14:48.26,Default,,0000,0000,0000,,keeping the functions you fit
Dialogue: 0,0:14:48.26,0:14:55.26,Default,,0000,0000,0000,,to be smoother and less likely to overfit. Okay? Okay, hopefully this will
Dialogue: 0,0:14:57.87,0:15:02.20,Default,,0000,0000,0000,,make more sense when you play with these ideas a bit more in the next problem set. But let's check
Dialogue: 0,0:15:02.20,0:15:09.20,Default,,0000,0000,0000,,questions about all this.
Dialogue: 0,0:15:12.56,0:15:16.00,Default,,0000,0000,0000,,The smoothing behavior is it because [inaudible] actually get different [inaudible]?
Dialogue: 0,0:15:19.26,0:15:23.01,Default,,0000,0000,0000,,Let's see. Yeah. It depends on - well
Dialogue: 0,0:15:23.01,0:15:28.16,Default,,0000,0000,0000,,most priors with most of the mass close to zero will get this effect, I guess. And just by
Dialogue: 0,0:15:28.16,0:15:29.09,Default,,0000,0000,0000,,convention, the Gaussian
Dialogue: 0,0:15:29.09,0:15:31.77,Default,,0000,0000,0000,,prior is what's most -
Dialogue: 0,0:15:31.77,0:15:33.82,Default,,0000,0000,0000,,used the most common for models like logistic
Dialogue: 0,0:15:33.82,0:15:37.91,Default,,0000,0000,0000,,regression and linear regression, generalized in your models. There are a few
Dialogue: 0,0:15:37.91,0:15:41.19,Default,,0000,0000,0000,,other priors that I sometimes use, like the Laplace prior,
Dialogue: 0,0:15:41.19,0:15:48.19,Default,,0000,0000,0000,,but all of them will tend to have these sorts of smoothing effects. All right. Cool.
Dialogue: 0,0:15:50.27,0:15:52.64,Default,,0000,0000,0000,,And so it turns out that for
Dialogue: 0,0:15:52.64,0:15:57.25,Default,,0000,0000,0000,,problems like text classification, text classification is like 30,000 features or 50,000
Dialogue: 0,0:15:57.25,0:15:59.56,Default,,0000,0000,0000,,features,
Dialogue: 0,0:15:59.56,0:16:02.98,Default,,0000,0000,0000,,where it seems like an algorithm like logistic regression would be very much prone to
Dialogue: 0,0:16:02.98,0:16:03.67,Default,,0000,0000,0000,,overfitting. Right?
Dialogue: 0,0:16:03.67,0:16:06.47,Default,,0000,0000,0000,,So imagine trying to build a spam classifier,
Dialogue: 0,0:16:06.47,0:16:10.05,Default,,0000,0000,0000,,maybe you have 100 training examples but you have 30,000
Dialogue: 0,0:16:10.05,0:16:12.53,Default,,0000,0000,0000,,features or 50,000 features,
Dialogue: 0,0:16:12.53,0:16:16.82,Default,,0000,0000,0000,,that seems clearly to be prone to overfitting. Right? But it turns out that with
Dialogue: 0,0:16:16.82,0:16:18.44,Default,,0000,0000,0000,,this sort of Bayesian
Dialogue: 0,0:16:18.44,0:16:19.46,Default,,0000,0000,0000,,regularization,
Dialogue: 0,0:16:19.46,0:16:22.10,Default,,0000,0000,0000,,with [inaudible] Gaussian,
Dialogue: 0,0:16:22.10,0:16:26.71,Default,,0000,0000,0000,,logistic regression becomes a very effective text classification algorithm
Dialogue: 0,0:16:26.71,0:16:32.51,Default,,0000,0000,0000,,with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah,
Dialogue: 0,0:16:32.51,0:16:36.95,Default,,0000,0000,0000,,right, and so pick - and to pick either tau squared or lambda.
Dialogue: 0,0:16:36.95,0:16:40.26,Default,,0000,0000,0000,,I think the relation is lambda equals one over tau squared. But right, so pick either
Dialogue: 0,0:16:40.26,0:16:41.84,Default,,0000,0000,0000,,tau squared or lambda, you could
Dialogue: 0,0:16:41.84,0:16:48.84,Default,,0000,0000,0000,,use cross-validation, yeah. All right? Okay, cool. So all right,
Dialogue: 0,0:16:49.12,0:16:51.74,Default,,0000,0000,0000,,that
Dialogue: 0,0:16:51.74,0:16:55.24,Default,,0000,0000,0000,,was all I want to say about methods for preventing
Dialogue: 0,0:16:55.24,0:16:56.41,Default,,0000,0000,0000,,overfitting. What I
Dialogue: 0,0:16:56.41,0:17:00.82,Default,,0000,0000,0000,,want to do next is just spend, you know, five minutes talking about
Dialogue: 0,0:17:00.82,0:17:02.30,Default,,0000,0000,0000,,online learning.
Dialogue: 0,0:17:02.30,0:17:04.65,Default,,0000,0000,0000,,And this is sort of a digression.
Dialogue: 0,0:17:04.65,0:17:06.39,Default,,0000,0000,0000,,And so, you know, when
Dialogue: 0,0:17:06.39,0:17:09.68,Default,,0000,0000,0000,,you're designing the syllabus of a class, I guess, sometimes
Dialogue: 0,0:17:09.68,0:17:13.04,Default,,0000,0000,0000,,there are just some ideas you want to talk about but can't find a very good place to
Dialogue: 0,0:17:13.04,0:17:16.48,Default,,0000,0000,0000,,fit in anywhere. So this is one of those ideas that may
Dialogue: 0,0:17:16.48,0:17:20.89,Default,,0000,0000,0000,,seem a bit disjointed from the rest of the class but I just
Dialogue: 0,0:17:20.89,0:17:25.77,Default,,0000,0000,0000,,want to
Dialogue: 0,0:17:25.77,0:17:29.73,Default,,0000,0000,0000,,tell
Dialogue: 0,0:17:29.73,0:17:33.72,Default,,0000,0000,0000,,you a little bit about it. Okay. So here's the idea.
Dialogue: 0,0:17:33.72,0:17:37.82,Default,,0000,0000,0000,,So far, all the learning algorithms we've talked about are what's called batch learning
Dialogue: 0,0:17:37.82,0:17:38.52,Default,,0000,0000,0000,,algorithms,
Dialogue: 0,0:17:38.52,0:17:41.65,Default,,0000,0000,0000,,where you're given a training set and then you get to run your learning algorithm on the
Dialogue: 0,0:17:41.65,0:17:46.91,Default,,0000,0000,0000,,training set and then maybe you test it on some other test set.
Dialogue: 0,0:17:46.91,0:17:49.98,Default,,0000,0000,0000,,And there's another learning setting called online learning,
Dialogue: 0,0:17:49.98,0:17:53.67,Default,,0000,0000,0000,,in which you have to make predictions even while you are in the process of learning.
Dialogue: 0,0:17:53.67,0:17:57.65,Default,,0000,0000,0000,,So here's how the problem sees.
Dialogue: 0,0:17:57.65,0:17:59.36,Default,,0000,0000,0000,,All right? I'm first gonna give you X one.
Dialogue: 0,0:17:59.36,0:18:01.41,Default,,0000,0000,0000,,Let's say there's a classification problem,
Dialogue: 0,0:18:01.41,0:18:02.78,Default,,0000,0000,0000,,so I'm first gonna give you
Dialogue: 0,0:18:02.78,0:18:07.42,Default,,0000,0000,0000,,X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?"
Dialogue: 0,0:18:07.42,0:18:09.67,Default,,0000,0000,0000,,And you've not seen any data yet.
Dialogue: 0,0:18:09.67,0:18:11.68,Default,,0000,0000,0000,,And so, you make a guess. Right? You
Dialogue: 0,0:18:11.68,0:18:12.99,Default,,0000,0000,0000,,guess -
Dialogue: 0,0:18:12.99,0:18:15.19,Default,,0000,0000,0000,,we'll call your guess Y hat one.
Dialogue: 0,0:18:15.19,0:18:18.01,Default,,0000,0000,0000,,And after you've made your prediction, I
Dialogue: 0,0:18:18.01,0:18:21.84,Default,,0000,0000,0000,,will then reveal to you the true label Y one. Okay? And
Dialogue: 0,0:18:21.84,0:18:25.58,Default,,0000,0000,0000,,not having seen any data before, your odds of getting the first one right are only
Dialogue: 0,0:18:25.58,0:18:28.75,Default,,0000,0000,0000,,50 percent, right, if you guess randomly.
Dialogue: 0,0:18:28.75,0:18:29.95,Default,,0000,0000,0000,,
Dialogue: 0,0:18:29.95,0:18:35.08,Default,,0000,0000,0000,,And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And
Dialogue: 0,0:18:35.08,0:18:35.91,Default,,0000,0000,0000,,so you now maybe
Dialogue: 0,0:18:35.91,0:18:40.20,Default,,0000,0000,0000,,are gonna make a slightly more educated guess and call that Y hat two.
Dialogue: 0,0:18:40.20,0:18:44.88,Default,,0000,0000,0000,,And after you've made your guess, I reveal the true label to you.
Dialogue: 0,0:18:44.88,0:18:50.00,Default,,0000,0000,0000,,And so, then I show you X three, and then you make
Dialogue: 0,0:18:50.00,0:18:53.00,Default,,0000,0000,0000,,your guess, and learning proceeds as follows.
Dialogue: 0,0:18:53.00,0:18:54.86,Default,,0000,0000,0000,,So this is just a lot of
Dialogue: 0,0:18:54.86,0:18:59.37,Default,,0000,0000,0000,,machine learning and batch learning, and the model settings where
Dialogue: 0,0:18:59.37,0:19:03.71,Default,,0000,0000,0000,,you have to keep learning even as you're making predictions,
Dialogue: 0,0:19:03.71,0:19:08.05,Default,,0000,0000,0000,,okay? So I don't know, setting your website and you have users coming in. And as the first user
Dialogue: 0,0:19:08.05,0:19:10.99,Default,,0000,0000,0000,,comes in, you need to start making predictions already about what
Dialogue: 0,0:19:10.99,0:19:13.45,Default,,0000,0000,0000,,the user likes or dislikes. And there's only, you know,
Dialogue: 0,0:19:13.45,0:19:18.37,Default,,0000,0000,0000,,as you're making predictions you get to show more and more training examples.
Dialogue: 0,0:19:18.37,0:19:25.37,Default,,0000,0000,0000,,So in online learning what you care about is the total online error,
Dialogue: 0,0:19:27.37,0:19:28.40,Default,,0000,0000,0000,,
Dialogue: 0,0:19:28.40,0:19:32.90,Default,,0000,0000,0000,,which is sum from I equals one to MC if you get the sequence of M examples
Dialogue: 0,0:19:32.90,0:19:34.39,Default,,0000,0000,0000,,all together,
Dialogue: 0,0:19:34.39,0:19:37.86,Default,,0000,0000,0000,,indicator Y hat I
Dialogue: 0,0:19:37.86,0:19:40.90,Default,,0000,0000,0000,,not equal to Y hi. Okay?
Dialogue: 0,0:19:40.90,0:19:43.51,Default,,0000,0000,0000,,So the total online error is
Dialogue: 0,0:19:43.51,0:19:45.92,Default,,0000,0000,0000,,the total number of mistakes you make
Dialogue: 0,0:19:45.92,0:19:49.03,Default,,0000,0000,0000,,on a sequence of examples like this.
Dialogue: 0,0:19:49.03,0:19:51.91,Default,,0000,0000,0000,,And
Dialogue: 0,0:19:51.91,0:19:54.05,Default,,0000,0000,0000,,it turns out that, you know,
Dialogue: 0,0:19:54.05,0:19:57.69,Default,,0000,0000,0000,,many of the learning algorithms you have - when you finish all the learning algorithms, you've
Dialogue: 0,0:19:57.69,0:20:00.69,Default,,0000,0000,0000,,learned about and can apply to this setting. One thing
Dialogue: 0,0:20:00.69,0:20:02.68,Default,,0000,0000,0000,,you could do is
Dialogue: 0,0:20:02.68,0:20:03.85,Default,,0000,0000,0000,,when you're asked
Dialogue: 0,0:20:03.85,0:20:06.23,Default,,0000,0000,0000,,to make prediction on Y hat three,
Dialogue: 0,0:20:06.23,0:20:10.41,Default,,0000,0000,0000,,right, one simple thing to do is well you've seen some other training examples
Dialogue: 0,0:20:10.41,0:20:14.04,Default,,0000,0000,0000,,up to this point so you can just take your learning algorithm and run it
Dialogue: 0,0:20:14.04,0:20:15.43,Default,,0000,0000,0000,,on the examples,
Dialogue: 0,0:20:15.43,0:20:18.97,Default,,0000,0000,0000,,you know, leading up to Y hat three. So just run the learning algorithm on all
Dialogue: 0,0:20:18.97,0:20:21.42,Default,,0000,0000,0000,,the examples you've seen previous
Dialogue: 0,0:20:21.42,0:20:25.40,Default,,0000,0000,0000,,to being asked to make a prediction on certain example, and then use your
Dialogue: 0,0:20:25.40,0:20:27.32,Default,,0000,0000,0000,,learning
Dialogue: 0,0:20:27.32,0:20:30.94,Default,,0000,0000,0000,,algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we
Dialogue: 0,0:20:30.94,0:20:34.19,Default,,0000,0000,0000,,saw that you could use the stochastic gradient descent, that,
Dialogue: 0,0:20:34.19,0:20:41.19,Default,,0000,0000,0000,,you know, can be adapted very nicely to this. So as a concrete example, if you
Dialogue: 0,0:20:41.74,0:20:45.66,Default,,0000,0000,0000,,remember the perceptron algorithms, say, right,
Dialogue: 0,0:20:45.66,0:20:47.35,Default,,0000,0000,0000,,you would
Dialogue: 0,0:20:47.35,0:20:52.28,Default,,0000,0000,0000,,say initial the parameter theta to be equal to zero.
Dialogue: 0,0:20:52.28,0:20:55.94,Default,,0000,0000,0000,,And then after seeing the Ith training
Dialogue: 0,0:20:55.94,0:20:59.23,Default,,0000,0000,0000,,example, you'd
Dialogue: 0,0:20:59.23,0:21:06.23,Default,,0000,0000,0000,,update the parameters, you
Dialogue: 0,0:21:13.74,0:21:15.42,Default,,0000,0000,0000,,know,
Dialogue: 0,0:21:15.42,0:21:19.74,Default,,0000,0000,0000,,using - you've see this reel a lot of times now, right, using the standard
Dialogue: 0,0:21:19.74,0:21:21.72,Default,,0000,0000,0000,,perceptron learning rule. And
Dialogue: 0,0:21:21.72,0:21:26.23,Default,,0000,0000,0000,,the same thing, if you were using logistic regression you can then,
Dialogue: 0,0:21:26.23,0:21:29.94,Default,,0000,0000,0000,,again, after seeing each training example, just run, you know, essentially run
Dialogue: 0,0:21:29.94,0:21:33.62,Default,,0000,0000,0000,,one-step stochastic gradient descent
Dialogue: 0,0:21:33.62,0:21:38.05,Default,,0000,0000,0000,,on just the example you saw. Okay?
Dialogue: 0,0:21:38.05,0:21:39.80,Default,,0000,0000,0000,,And so
Dialogue: 0,0:21:39.80,0:21:41.04,Default,,0000,0000,0000,,the reason I've
Dialogue: 0,0:21:41.04,0:21:45.38,Default,,0000,0000,0000,,put this into the sort of "learning theory" section of this class was because
Dialogue: 0,0:21:45.38,0:21:49.19,Default,,0000,0000,0000,,it turns that sometimes you can prove fairly amazing results
Dialogue: 0,0:21:49.19,0:21:51.50,Default,,0000,0000,0000,,on your total online error
Dialogue: 0,0:21:51.50,0:21:53.79,Default,,0000,0000,0000,,using algorithms like these.
Dialogue: 0,0:21:53.79,0:21:57.50,Default,,0000,0000,0000,,I will actually - I don't actually want to spend the time in the main
Dialogue: 0,0:21:57.50,0:21:58.64,Default,,0000,0000,0000,,lecture to prove
Dialogue: 0,0:21:58.64,0:22:01.88,Default,,0000,0000,0000,,this, but, for example, you can prove that
Dialogue: 0,0:22:01.88,0:22:04.79,Default,,0000,0000,0000,,when you use the perceptron algorithm,
Dialogue: 0,0:22:04.79,0:22:07.89,Default,,0000,0000,0000,,then even when
Dialogue: 0,0:22:07.89,0:22:11.71,Default,,0000,0000,0000,,the features XI,
Dialogue: 0,0:22:11.71,0:22:15.25,Default,,0000,0000,0000,,maybe infinite dimensional feature vectors, like we saw for
Dialogue: 0,0:22:15.25,0:22:18.70,Default,,0000,0000,0000,,simple vector machines. And sometimes, infinite feature dimensional vectors
Dialogue: 0,0:22:18.70,0:22:20.74,Default,,0000,0000,0000,,may use kernel representations. Okay? But so it
Dialogue: 0,0:22:20.74,0:22:22.18,Default,,0000,0000,0000,,turns out that you can prove that
Dialogue: 0,0:22:22.18,0:22:24.79,Default,,0000,0000,0000,,when you a perceptron algorithm,
Dialogue: 0,0:22:24.79,0:22:27.59,Default,,0000,0000,0000,,even when the data is maybe
Dialogue: 0,0:22:27.59,0:22:28.92,Default,,0000,0000,0000,,extremely high dimensional and it
Dialogue: 0,0:22:28.92,0:22:32.90,Default,,0000,0000,0000,,seems like you'd be prone to overfitting, right, you can prove that so as the
Dialogue: 0,0:22:32.90,0:22:36.55,Default,,0000,0000,0000,,long as the positive and negative examples
Dialogue: 0,0:22:36.55,0:22:38.74,Default,,0000,0000,0000,,
Dialogue: 0,0:22:38.74,0:22:41.86,Default,,0000,0000,0000,,are separated by a margin,
Dialogue: 0,0:22:41.86,0:22:43.36,Default,,0000,0000,0000,,right. So in this
Dialogue: 0,0:22:43.36,0:22:48.05,Default,,0000,0000,0000,,infinite dimensional space,
Dialogue: 0,0:22:48.05,0:22:49.74,Default,,0000,0000,0000,,so long as,
Dialogue: 0,0:22:49.74,0:22:53.39,Default,,0000,0000,0000,,you know, there is some margin
Dialogue: 0,0:22:53.39,0:22:57.48,Default,,0000,0000,0000,,down there separating the positive and negative examples,
Dialogue: 0,0:22:57.48,0:22:59.30,Default,,0000,0000,0000,,you can prove that
Dialogue: 0,0:22:59.30,0:23:02.15,Default,,0000,0000,0000,,perceptron algorithm will converge
Dialogue: 0,0:23:02.15,0:23:06.87,Default,,0000,0000,0000,,to a hypothesis that perfectly separates the positive and negative examples.
Dialogue: 0,0:23:06.87,0:23:10.76,Default,,0000,0000,0000,,Okay? And then so after seeing only a finite number of examples, it'll
Dialogue: 0,0:23:10.76,0:23:14.41,Default,,0000,0000,0000,,converge to digital boundary that perfectly separates the
Dialogue: 0,0:23:14.41,0:23:18.59,Default,,0000,0000,0000,,positive and negative examples, even though you may in an infinite dimensional space. Okay?
Dialogue: 0,0:23:18.59,0:23:20.97,Default,,0000,0000,0000,,So
Dialogue: 0,0:23:20.97,0:23:22.26,Default,,0000,0000,0000,,let's see.
Dialogue: 0,0:23:22.26,0:23:25.92,Default,,0000,0000,0000,,The proof itself would take me sort of almost an entire lecture to do,
Dialogue: 0,0:23:25.92,0:23:27.18,Default,,0000,0000,0000,,and there are sort of
Dialogue: 0,0:23:27.18,0:23:29.46,Default,,0000,0000,0000,,other things that I want to do more than that.
Dialogue: 0,0:23:29.46,0:23:32.85,Default,,0000,0000,0000,,So you want to see the proof of this yourself, it's actually written up in the lecture
Dialogue: 0,0:23:32.85,0:23:35.01,Default,,0000,0000,0000,,notes that I posted online.
Dialogue: 0,0:23:35.01,0:23:38.21,Default,,0000,0000,0000,,For the purposes of this class' syllabus, the proof of this
Dialogue: 0,0:23:38.21,0:23:41.06,Default,,0000,0000,0000,,result, you can treat this as optional reading. And by that, I mean,
Dialogue: 0,0:23:41.06,0:23:44.80,Default,,0000,0000,0000,,you know, it won't appear on the midterm and you won't be asked about this
Dialogue: 0,0:23:44.80,0:23:47.04,Default,,0000,0000,0000,,specifically in the problem sets,
Dialogue: 0,0:23:47.04,0:23:48.77,Default,,0000,0000,0000,,but I thought it'd be -
Dialogue: 0,0:23:48.77,0:23:52.34,Default,,0000,0000,0000,,I know some of you are curious after the previous lecture so why
Dialogue: 0,0:23:52.34,0:23:53.75,Default,,0000,0000,0000,,you can prove
Dialogue: 0,0:23:53.75,0:23:55.45,Default,,0000,0000,0000,,that, you know, SVMs can have
Dialogue: 0,0:23:55.45,0:23:59.24,Default,,0000,0000,0000,,bounded VC dimension, even in these infinite dimensional spaces, and how do
Dialogue: 0,0:23:59.24,0:24:00.94,Default,,0000,0000,0000,,you prove things in these -
Dialogue: 0,0:24:00.94,0:24:04.45,Default,,0000,0000,0000,,how do you prove learning theory results in these infinite dimensional feature spaces. And
Dialogue: 0,0:24:04.45,0:24:05.19,Default,,0000,0000,0000,,so
Dialogue: 0,0:24:05.19,0:24:08.48,Default,,0000,0000,0000,,the perceptron bound that I just talked about was the simplest instance I
Dialogue: 0,0:24:08.48,0:24:11.04,Default,,0000,0000,0000,,know of that you can sort of read in like half an hour and understand
Dialogue: 0,0:24:11.04,0:24:11.96,Default,,0000,0000,0000,,it.
Dialogue: 0,0:24:11.96,0:24:14.70,Default,,0000,0000,0000,,So if you're
Dialogue: 0,0:24:14.70,0:24:16.18,Default,,0000,0000,0000,,interested, there are lecture notes online for
Dialogue: 0,0:24:16.18,0:24:20.72,Default,,0000,0000,0000,,how this perceptron bound is actually proved. It's a very
Dialogue: 0,0:24:21.58,0:24:25.71,Default,,0000,0000,0000,,[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But
Dialogue: 0,0:24:25.71,0:24:28.45,Default,,0000,0000,0000,,regardless of the theoretical results,
Dialogue: 0,0:24:28.45,0:24:30.88,Default,,0000,0000,0000,,you know, the online learning setting is something
Dialogue: 0,0:24:30.88,0:24:33.49,Default,,0000,0000,0000,,that you - that comes reasonably often. And so,
Dialogue: 0,0:24:33.49,0:24:38.75,Default,,0000,0000,0000,,these algorithms based on stochastic gradient descent often go very well. Okay, any
Dialogue: 0,0:24:38.75,0:24:45.75,Default,,0000,0000,0000,,questions about this before I move on?
Dialogue: 0,0:24:49.99,0:24:54.80,Default,,0000,0000,0000,,All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to
Dialogue: 0,0:24:54.80,0:24:56.50,Default,,0000,0000,0000,,PowerPoint slides, please,
Dialogue: 0,0:24:56.50,0:24:59.81,Default,,0000,0000,0000,,is I actually want to spend most of today's lecture sort of talking about
Dialogue: 0,0:24:59.81,0:25:05.27,Default,,0000,0000,0000,,advice for applying different machine learning
Dialogue: 0,0:25:05.27,0:25:10.85,Default,,0000,0000,0000,,algorithms.
Dialogue: 0,0:25:10.85,0:25:11.97,Default,,0000,0000,0000,,And so,
Dialogue: 0,0:25:11.97,0:25:14.02,Default,,0000,0000,0000,,you know, right now, already you have a,
Dialogue: 0,0:25:14.02,0:25:16.25,Default,,0000,0000,0000,,I think, a good understanding of
Dialogue: 0,0:25:16.25,0:25:18.62,Default,,0000,0000,0000,,really the most powerful tools
Dialogue: 0,0:25:18.62,0:25:21.02,Default,,0000,0000,0000,,known to humankind in machine learning.
Dialogue: 0,0:25:21.02,0:25:22.09,Default,,0000,0000,0000,,Right? And
Dialogue: 0,0:25:22.09,0:25:26.52,Default,,0000,0000,0000,,what I want to do today is give you some advice on how to apply them really
Dialogue: 0,0:25:26.52,0:25:27.72,Default,,0000,0000,0000,,powerfully because,
Dialogue: 0,0:25:27.72,0:25:30.30,Default,,0000,0000,0000,,you know, the same tool - it turns out that you can
Dialogue: 0,0:25:30.30,0:25:33.25,Default,,0000,0000,0000,,take the same machine learning tool, say logistic regression,
Dialogue: 0,0:25:33.25,0:25:36.80,Default,,0000,0000,0000,,and you can ask two different people to apply it to the same problem.
Dialogue: 0,0:25:36.80,0:25:38.23,Default,,0000,0000,0000,,And
Dialogue: 0,0:25:38.23,0:25:41.29,Default,,0000,0000,0000,,sometimes one person will do an amazing job and it'll work amazingly well, and the second
Dialogue: 0,0:25:41.29,0:25:43.15,Default,,0000,0000,0000,,person will sort of
Dialogue: 0,0:25:43.15,0:25:47.15,Default,,0000,0000,0000,,not really get it to work, even though it was exactly the same algorithm. Right?
Dialogue: 0,0:25:47.15,0:25:51.05,Default,,0000,0000,0000,,And so what I want to do today, in the rest of the time I have today, is try
Dialogue: 0,0:25:51.05,0:25:52.19,Default,,0000,0000,0000,,to
Dialogue: 0,0:25:52.19,0:25:55.21,Default,,0000,0000,0000,,convey to you, you know, some of the methods for how to
Dialogue: 0,0:25:55.21,0:25:56.60,Default,,0000,0000,0000,,make sure you're one of -
Dialogue: 0,0:25:56.60,0:26:03.60,Default,,0000,0000,0000,,you really know how to get these learning algorithms to work well in problems. So
Dialogue: 0,0:26:05.12,0:26:07.34,Default,,0000,0000,0000,,just some caveats on what I'm gonna, I
Dialogue: 0,0:26:07.34,0:26:08.86,Default,,0000,0000,0000,,guess, talk about in the
Dialogue: 0,0:26:08.86,0:26:11.08,Default,,0000,0000,0000,,rest of today's lecture.
Dialogue: 0,0:26:11.94,0:26:15.76,Default,,0000,0000,0000,,Something I want to talk about is actually not very mathematical but is also
Dialogue: 0,0:26:15.76,0:26:17.61,Default,,0000,0000,0000,,some of the hardest,
Dialogue: 0,0:26:17.61,0:26:18.48,Default,,0000,0000,0000,,most
Dialogue: 0,0:26:18.48,0:26:21.91,Default,,0000,0000,0000,,conceptually most difficult material in this class to understand. All right? So this
Dialogue: 0,0:26:21.91,0:26:23.07,Default,,0000,0000,0000,,is
Dialogue: 0,0:26:23.07,0:26:25.40,Default,,0000,0000,0000,,not mathematical but this is not easy.
Dialogue: 0,0:26:25.40,0:26:29.73,Default,,0000,0000,0000,,And I want to say this caveat some of what I'll say today is debatable.
Dialogue: 0,0:26:29.73,0:26:33.31,Default,,0000,0000,0000,,I think most good machine learning people will agree with most of what I say but maybe not
Dialogue: 0,0:26:33.31,0:26:35.73,Default,,0000,0000,0000,,everything I say.
Dialogue: 0,0:26:35.73,0:26:39.35,Default,,0000,0000,0000,,And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about
Dialogue: 0,0:26:39.35,0:26:41.68,Default,,0000,0000,0000,,this later. What I'm
Dialogue: 0,0:26:41.68,0:26:43.73,Default,,0000,0000,0000,,focusing on today is advice for
Dialogue: 0,0:26:43.73,0:26:47.27,Default,,0000,0000,0000,,how to just get stuff to work. If you work in the company and you want to deliver a
Dialogue: 0,0:26:47.27,0:26:49.12,Default,,0000,0000,0000,,product or you're, you know,
Dialogue: 0,0:26:49.12,0:26:52.75,Default,,0000,0000,0000,,building a system and you just want your machine learning system to work. Okay? Some of what I'm about
Dialogue: 0,0:26:52.75,0:26:54.00,Default,,0000,0000,0000,,to say today
Dialogue: 0,0:26:54.00,0:26:57.56,Default,,0000,0000,0000,,isn't great advice if you goal is to invent a new machine learning algorithm, but
Dialogue: 0,0:26:57.56,0:26:59.18,Default,,0000,0000,0000,,this is advice for how to
Dialogue: 0,0:26:59.18,0:27:02.04,Default,,0000,0000,0000,,make machine learning algorithm work and, you know, and
Dialogue: 0,0:27:02.04,0:27:05.08,Default,,0000,0000,0000,,deploy a working system. So three
Dialogue: 0,0:27:05.08,0:27:08.90,Default,,0000,0000,0000,,key areas I'm gonna talk about. One: diagnostics for
Dialogue: 0,0:27:08.90,0:27:11.89,Default,,0000,0000,0000,,debugging learning algorithms. Second: sort of
Dialogue: 0,0:27:11.89,0:27:16.82,Default,,0000,0000,0000,,talk briefly about error analyses and ablative analysis.
Dialogue: 0,0:27:16.82,0:27:21.47,Default,,0000,0000,0000,,And third, I want to talk about just advice for how to get started on a machine-learning
Dialogue: 0,0:27:23.41,0:27:24.42,Default,,0000,0000,0000,,problem.
Dialogue: 0,0:27:24.42,0:27:27.34,Default,,0000,0000,0000,,And one theme that'll come up later is it
Dialogue: 0,0:27:27.34,0:27:31.23,Default,,0000,0000,0000,,turns out you've heard about premature optimization, right,
Dialogue: 0,0:27:31.23,0:27:33.89,Default,,0000,0000,0000,,in writing software. This is when
Dialogue: 0,0:27:33.89,0:27:38.12,Default,,0000,0000,0000,,someone over-designs from the start, when someone, you know, is writing piece of code and
Dialogue: 0,0:27:38.12,0:27:41.33,Default,,0000,0000,0000,,they choose a subroutine to optimize
Dialogue: 0,0:27:41.33,0:27:45.35,Default,,0000,0000,0000,,heavily. And maybe you write the subroutine as assembly or something. And that's often
Dialogue: 0,0:27:45.35,0:27:48.92,Default,,0000,0000,0000,,- and many of us have been guilty of premature optimization, where
Dialogue: 0,0:27:48.92,0:27:51.51,Default,,0000,0000,0000,,we're trying to get a piece of code to run faster. And
Dialogue: 0,0:27:51.51,0:27:54.71,Default,,0000,0000,0000,,we choose probably a piece of code and we implement it an assembly, and
Dialogue: 0,0:27:54.71,0:27:56.82,Default,,0000,0000,0000,,really tune and get to run really quickly. And
Dialogue: 0,0:27:56.82,0:28:01.28,Default,,0000,0000,0000,,it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature
Dialogue: 0,0:28:01.28,0:28:02.21,Default,,0000,0000,0000,,optimization.
Dialogue: 0,0:28:02.21,0:28:06.14,Default,,0000,0000,0000,,And in undergraduate programming classes, we warn people all the time not to do
Dialogue: 0,0:28:06.14,0:28:10.25,Default,,0000,0000,0000,,premature optimization and people still do it all the time. Right?
Dialogue: 0,0:28:10.25,0:28:11.75,Default,,0000,0000,0000,,And
Dialogue: 0,0:28:11.75,0:28:16.02,Default,,0000,0000,0000,,turns out, a very similar thing happens in building machine-learning systems. That
Dialogue: 0,0:28:16.02,0:28:20.98,Default,,0000,0000,0000,,many people are often guilty of, what I call, premature statistical optimization, where they
Dialogue: 0,0:28:20.98,0:28:21.63,Default,,0000,0000,0000,,
Dialogue: 0,0:28:21.63,0:28:26.09,Default,,0000,0000,0000,,heavily optimize part of a machine learning system and that turns out
Dialogue: 0,0:28:26.09,0:28:27.83,Default,,0000,0000,0000,,not to be the important piece. Okay?
Dialogue: 0,0:28:27.83,0:28:30.17,Default,,0000,0000,0000,,So I'll talk about that later, as well.
Dialogue: 0,0:28:30.17,0:28:35.68,Default,,0000,0000,0000,,So let's first talk about debugging learning algorithms.
Dialogue: 0,0:28:35.68,0:28:37.70,Default,,0000,0000,0000,,
Dialogue: 0,0:28:37.70,0:28:39.56,Default,,0000,0000,0000,,
Dialogue: 0,0:28:39.56,0:28:42.61,Default,,0000,0000,0000,,As a motivating
Dialogue: 0,0:28:42.61,0:28:46.91,Default,,0000,0000,0000,,example, let's say you want to build an anti-spam system. And
Dialogue: 0,0:28:46.91,0:28:51.78,Default,,0000,0000,0000,,let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right?
Dialogue: 0,0:28:51.78,0:28:54.41,Default,,0000,0000,0000,,So instead of using 50,000 words, you're chosen a small set of 100
Dialogue: 0,0:28:54.41,0:28:55.39,Default,,0000,0000,0000,,features
Dialogue: 0,0:28:55.39,0:28:59.01,Default,,0000,0000,0000,,to use for your anti-spam system.
Dialogue: 0,0:28:59.01,0:29:02.14,Default,,0000,0000,0000,,And let's say you implement Bayesian logistic regression, implement gradient
Dialogue: 0,0:29:02.14,0:29:02.85,Default,,0000,0000,0000,,descent,
Dialogue: 0,0:29:02.85,0:29:07.31,Default,,0000,0000,0000,,and you get 20 percent test error, which is unacceptably high. Right?
Dialogue: 0,0:29:07.31,0:29:09.75,Default,,0000,0000,0000,,So this
Dialogue: 0,0:29:09.75,0:29:14.26,Default,,0000,0000,0000,,is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda
Dialogue: 0,0:29:14.26,0:29:15.16,Default,,0000,0000,0000,,squared term.
Dialogue: 0,0:29:15.16,0:29:19.95,Default,,0000,0000,0000,,And we're maximizing rather than minimizing as well, so there's a minus lambda
Dialogue: 0,0:29:19.95,0:29:23.81,Default,,0000,0000,0000,,theta square instead of plus lambda theta squared. So
Dialogue: 0,0:29:23.81,0:29:28.39,Default,,0000,0000,0000,,the question is, you implemented your Bayesian logistic regression algorithm,
Dialogue: 0,0:29:28.39,0:29:34.05,Default,,0000,0000,0000,,and you tested it on your test set and you got unacceptably high error, so what do you do
Dialogue: 0,0:29:34.05,0:29:35.37,Default,,0000,0000,0000,,next? Right?
Dialogue: 0,0:29:35.37,0:29:37.31,Default,,0000,0000,0000,,So,
Dialogue: 0,0:29:37.31,0:29:40.73,Default,,0000,0000,0000,,you know, one thing you could do is think about the ways you could improve this algorithm.
Dialogue: 0,0:29:40.73,0:29:44.51,Default,,0000,0000,0000,,And this is probably what most people will do instead of, "Well let's sit down and think what
Dialogue: 0,0:29:44.51,0:29:47.93,Default,,0000,0000,0000,,could've gone wrong, and then we'll try to improve the algorithm."
Dialogue: 0,0:29:47.93,0:29:51.37,Default,,0000,0000,0000,,Well obviously having more training data could only help, so one thing you can do is try to
Dialogue: 0,0:29:51.37,0:29:55.14,Default,,0000,0000,0000,,get more training examples.
Dialogue: 0,0:29:55.14,0:29:58.88,Default,,0000,0000,0000,,Maybe you suspect, that even 100 features was too many, so you might try to
Dialogue: 0,0:29:58.88,0:30:01.06,Default,,0000,0000,0000,,get a smaller set of
Dialogue: 0,0:30:01.06,0:30:04.51,Default,,0000,0000,0000,,features. What's more common is you might suspect your features aren't good enough, so you might
Dialogue: 0,0:30:04.51,0:30:06.88,Default,,0000,0000,0000,,spend some time, look at the email headers, see if
Dialogue: 0,0:30:06.88,0:30:09.24,Default,,0000,0000,0000,,you can figure out better features for, you know,
Dialogue: 0,0:30:09.24,0:30:12.89,Default,,0000,0000,0000,,finding spam emails or whatever.
Dialogue: 0,0:30:12.89,0:30:14.46,Default,,0000,0000,0000,,Right. And
Dialogue: 0,0:30:14.46,0:30:17.72,Default,,0000,0000,0000,,right, so and
Dialogue: 0,0:30:17.72,0:30:20.78,Default,,0000,0000,0000,,just sit around and come up with better features, such as for email headers.
Dialogue: 0,0:30:20.78,0:30:24.94,Default,,0000,0000,0000,,You may also suspect that gradient descent hasn't quite converged yet, and so let's try
Dialogue: 0,0:30:24.94,0:30:28.09,Default,,0000,0000,0000,,running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right,
Dialogue: 0,0:30:28.09,0:30:29.48,Default,,0000,0000,0000,,just run
Dialogue: 0,0:30:29.48,0:30:30.94,Default,,0000,0000,0000,,gradient descent longer.
Dialogue: 0,0:30:30.94,0:30:35.62,Default,,0000,0000,0000,,Or maybe you remember, you know, you remember hearing from class that maybe
Dialogue: 0,0:30:35.62,0:30:38.58,Default,,0000,0000,0000,,Newton's method converges better, so let's
Dialogue: 0,0:30:38.58,0:30:39.81,Default,,0000,0000,0000,,try
Dialogue: 0,0:30:39.81,0:30:41.84,Default,,0000,0000,0000,,that instead. You may want to tune the value for lambda, because not
Dialogue: 0,0:30:41.84,0:30:43.56,Default,,0000,0000,0000,,sure if that was the right thing,
Dialogue: 0,0:30:43.56,0:30:46.96,Default,,0000,0000,0000,,or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only
Dialogue: 0,0:30:46.96,0:30:50.24,Default,,0000,0000,0000,,listed eight things
Dialogue: 0,0:30:50.24,0:30:54.55,Default,,0000,0000,0000,,here, but you can imagine if you were actually sitting down, building machine-learning
Dialogue: 0,0:30:54.55,0:30:55.72,Default,,0000,0000,0000,,system,
Dialogue: 0,0:30:55.72,0:30:58.04,Default,,0000,0000,0000,,the options to you are endless. You can think of, you
Dialogue: 0,0:30:58.04,0:31:01.04,Default,,0000,0000,0000,,know, hundreds of ways to improve a learning system.
Dialogue: 0,0:31:01.04,0:31:02.43,Default,,0000,0000,0000,,And some of these things like,
Dialogue: 0,0:31:02.43,0:31:05.67,Default,,0000,0000,0000,,well getting more training examples, surely that's gonna help, so that seems like it's a good
Dialogue: 0,0:31:05.67,0:31:08.98,Default,,0000,0000,0000,,use of your time. Right?
Dialogue: 0,0:31:08.98,0:31:11.29,Default,,0000,0000,0000,,And it turns out that
Dialogue: 0,0:31:11.29,0:31:15.21,Default,,0000,0000,0000,,this [inaudible] of picking ways to improve the learning algorithm and picking one and going
Dialogue: 0,0:31:15.21,0:31:16.12,Default,,0000,0000,0000,,for it,
Dialogue: 0,0:31:16.12,0:31:20.62,Default,,0000,0000,0000,,it might work in the sense that it may eventually get you to a working system, but
Dialogue: 0,0:31:20.62,0:31:25.03,Default,,0000,0000,0000,,often it's very time-consuming. And I think it's often a largely - largely a matter of
Dialogue: 0,0:31:25.03,0:31:28.14,Default,,0000,0000,0000,,luck, whether you end up fixing what the problem is.
Dialogue: 0,0:31:28.14,0:31:29.49,Default,,0000,0000,0000,,In particular, these
Dialogue: 0,0:31:29.49,0:31:33.03,Default,,0000,0000,0000,,eight improvements all fix very different problems.
Dialogue: 0,0:31:33.03,0:31:37.34,Default,,0000,0000,0000,,And some of them will be fixing problems that you don't have. And
Dialogue: 0,0:31:37.34,0:31:40.07,Default,,0000,0000,0000,,if you can rule out six of
Dialogue: 0,0:31:40.07,0:31:44.13,Default,,0000,0000,0000,,eight of these, say, you could - if by somehow looking at the problem more deeply,
Dialogue: 0,0:31:44.13,0:31:46.81,Default,,0000,0000,0000,,you can figure out which one of these eight things is actually the right thing
Dialogue: 0,0:31:46.81,0:31:47.85,Default,,0000,0000,0000,,to do,
Dialogue: 0,0:31:47.85,0:31:48.85,Default,,0000,0000,0000,,you can save yourself
Dialogue: 0,0:31:48.85,0:31:50.76,Default,,0000,0000,0000,,a lot of time.
Dialogue: 0,0:31:50.76,0:31:56.09,Default,,0000,0000,0000,,So let's see how we can go about doing that.
Dialogue: 0,0:31:56.09,0:32:01.83,Default,,0000,0000,0000,,The people in industry and in research that I see that are really good, would not
Dialogue: 0,0:32:01.83,0:32:05.49,Default,,0000,0000,0000,,go and try to change a learning algorithm randomly. There are lots of things that
Dialogue: 0,0:32:05.49,0:32:08.11,Default,,0000,0000,0000,,obviously improve your learning algorithm,
Dialogue: 0,0:32:08.11,0:32:12.46,Default,,0000,0000,0000,,but the problem is there are so many of them it's hard to know what to do.
Dialogue: 0,0:32:12.46,0:32:16.59,Default,,0000,0000,0000,,So you find all the really good ones that run various diagnostics to figure out
Dialogue: 0,0:32:16.59,0:32:18.01,Default,,0000,0000,0000,,the problem is
Dialogue: 0,0:32:18.01,0:32:21.61,Default,,0000,0000,0000,,and they think where a problem is. Okay?
Dialogue: 0,0:32:21.61,0:32:23.83,Default,,0000,0000,0000,,So
Dialogue: 0,0:32:23.83,0:32:27.31,Default,,0000,0000,0000,,for our motivating story, right, we said - let's say Bayesian logistic regression test
Dialogue: 0,0:32:27.31,0:32:29.01,Default,,0000,0000,0000,,error was 20 percent,
Dialogue: 0,0:32:29.01,0:32:32.02,Default,,0000,0000,0000,,which let's say is unacceptably high.
Dialogue: 0,0:32:32.02,0:32:34.83,Default,,0000,0000,0000,,And let's suppose you suspected the problem is
Dialogue: 0,0:32:34.83,0:32:36.44,Default,,0000,0000,0000,,either overfitting,
Dialogue: 0,0:32:36.44,0:32:37.79,Default,,0000,0000,0000,,so it's high bias,
Dialogue: 0,0:32:37.79,0:32:42.24,Default,,0000,0000,0000,,or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse
Dialogue: 0,0:32:42.24,0:32:45.22,Default,,0000,0000,0000,,me; I think I
Dialogue: 0,0:32:45.22,0:32:46.62,Default,,0000,0000,0000,,wrote that wrong.
Dialogue: 0,0:32:46.62,0:32:48.08,Default,,0000,0000,0000,,Let's firstly - so let's
Dialogue: 0,0:32:48.08,0:32:49.22,Default,,0000,0000,0000,,forget - forget the tables.
Dialogue: 0,0:32:49.22,0:32:52.84,Default,,0000,0000,0000,,Suppose you suspect the problem is either high bias or high variance, and some of the text
Dialogue: 0,0:32:52.84,0:32:54.73,Default,,0000,0000,0000,,here
Dialogue: 0,0:32:54.73,0:32:55.25,Default,,0000,0000,0000,,doesn't make sense. And
Dialogue: 0,0:32:55.25,0:32:56.43,Default,,0000,0000,0000,,you want to know
Dialogue: 0,0:32:56.43,0:33:00.85,Default,,0000,0000,0000,,if you're overfitting, which would be high variance, or you have too few
Dialogue: 0,0:33:00.85,0:33:06.24,Default,,0000,0000,0000,,features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So
Dialogue: 0,0:33:06.24,0:33:08.75,Default,,0000,0000,0000,,how can you figure out whether the problem
Dialogue: 0,0:33:08.75,0:33:10.79,Default,,0000,0000,0000,,is one of high bias
Dialogue: 0,0:33:10.79,0:33:15.61,Default,,0000,0000,0000,,or high variance? Right? So it turns
Dialogue: 0,0:33:15.61,0:33:19.01,Default,,0000,0000,0000,,out there's a simple diagnostic you can look at that will tell you
Dialogue: 0,0:33:19.01,0:33:24.15,Default,,0000,0000,0000,,whether the problem is high bias or high variance. If you
Dialogue: 0,0:33:24.15,0:33:27.90,Default,,0000,0000,0000,,remember the cartoon we'd seen previously for high variance problems, when you have high
Dialogue: 0,0:33:27.90,0:33:29.71,Default,,0000,0000,0000,,variance
Dialogue: 0,0:33:29.71,0:33:33.28,Default,,0000,0000,0000,,the training error will be much lower than the test error. All right? When you
Dialogue: 0,0:33:33.28,0:33:36.14,Default,,0000,0000,0000,,have a high variance problem, that's when you're fitting
Dialogue: 0,0:33:36.14,0:33:39.48,Default,,0000,0000,0000,,your training set very well. That's when you're fitting, you know, a tenth order polynomial to
Dialogue: 0,0:33:39.48,0:33:41.65,Default,,0000,0000,0000,,11 data points. All right? And
Dialogue: 0,0:33:41.65,0:33:44.67,Default,,0000,0000,0000,,that's when you're just fitting the data set very well, and so your training error will be
Dialogue: 0,0:33:44.67,0:33:45.67,Default,,0000,0000,0000,,much lower than
Dialogue: 0,0:33:45.67,0:33:47.64,Default,,0000,0000,0000,,your test
Dialogue: 0,0:33:47.64,0:33:49.94,Default,,0000,0000,0000,,error. And in contrast, if you have high bias,
Dialogue: 0,0:33:49.94,0:33:52.70,Default,,0000,0000,0000,,that's when your training error will also be high. Right?
Dialogue: 0,0:33:52.70,0:33:56.45,Default,,0000,0000,0000,,That's when your data is quadratic, say, but you're fitting a linear function to it
Dialogue: 0,0:33:56.45,0:34:02.29,Default,,0000,0000,0000,,and so you aren't even fitting your training set well. So
Dialogue: 0,0:34:02.29,0:34:04.45,Default,,0000,0000,0000,,just in cartoons, I guess,
Dialogue: 0,0:34:04.45,0:34:07.95,Default,,0000,0000,0000,,this is a - this is what a typical learning curve for high variance looks
Dialogue: 0,0:34:07.95,0:34:09.34,Default,,0000,0000,0000,,like.
Dialogue: 0,0:34:09.34,0:34:13.69,Default,,0000,0000,0000,,On your horizontal axis, I'm plotting the training set size M, right,
Dialogue: 0,0:34:13.69,0:34:16.43,Default,,0000,0000,0000,,and on vertical axis, I'm plotting the error.
Dialogue: 0,0:34:16.43,0:34:19.47,Default,,0000,0000,0000,,And so, let's see,
Dialogue: 0,0:34:19.47,0:34:21.03,Default,,0000,0000,0000,,you know, as you increase -
Dialogue: 0,0:34:21.03,0:34:25.12,Default,,0000,0000,0000,,if you have a high variance problem, you'll notice as the training set size, M,
Dialogue: 0,0:34:25.12,0:34:29.22,Default,,0000,0000,0000,,increases, your test set error will keep on decreasing.
Dialogue: 0,0:34:29.22,0:34:32.83,Default,,0000,0000,0000,,And so this sort of suggests that, well, if you can increase the training set size even
Dialogue: 0,0:34:32.83,0:34:36.36,Default,,0000,0000,0000,,further, maybe if you extrapolate the green curve out, maybe
Dialogue: 0,0:34:36.36,0:34:39.97,Default,,0000,0000,0000,,that test set error will decrease even further. All right?
Dialogue: 0,0:34:39.97,0:34:43.40,Default,,0000,0000,0000,,Another thing that's useful to plot here is - let's say
Dialogue: 0,0:34:43.40,0:34:46.54,Default,,0000,0000,0000,,the red horizontal line is the desired performance
Dialogue: 0,0:34:46.54,0:34:50.26,Default,,0000,0000,0000,,you're trying to reach, another useful thing to plot is actually the training error. Right?
Dialogue: 0,0:34:50.26,0:34:52.01,Default,,0000,0000,0000,,And it turns out that
Dialogue: 0,0:34:52.01,0:34:59.01,Default,,0000,0000,0000,,your training error will actually grow as a function of the training set size
Dialogue: 0,0:34:59.25,0:35:01.61,Default,,0000,0000,0000,,because the larger your training set,
Dialogue: 0,0:35:01.61,0:35:03.62,Default,,0000,0000,0000,,the harder it is to fit,
Dialogue: 0,0:35:03.62,0:35:06.15,Default,,0000,0000,0000,,you know, your training set perfectly. Right?
Dialogue: 0,0:35:06.15,0:35:09.25,Default,,0000,0000,0000,,So this is just a cartoon, don't take it too seriously, but in general, your training error
Dialogue: 0,0:35:09.25,0:35:11.42,Default,,0000,0000,0000,,will actually grow
Dialogue: 0,0:35:11.42,0:35:15.08,Default,,0000,0000,0000,,as a function of your training set size. Because smart training sets, if you have one data point,
Dialogue: 0,0:35:15.08,0:35:17.77,Default,,0000,0000,0000,,it's really easy to fit that perfectly, but if you have
Dialogue: 0,0:35:17.77,0:35:22.10,Default,,0000,0000,0000,,10,000 data points, it's much harder to fit that perfectly.
Dialogue: 0,0:35:22.10,0:35:23.15,Default,,0000,0000,0000,,All right?
Dialogue: 0,0:35:23.15,0:35:27.96,Default,,0000,0000,0000,,And so another diagnostic for high variance, and the one that I tend to use more,
Dialogue: 0,0:35:27.96,0:35:31.67,Default,,0000,0000,0000,,is to just look at training versus test error. And if there's a large gap between
Dialogue: 0,0:35:31.67,0:35:32.79,Default,,0000,0000,0000,,them,
Dialogue: 0,0:35:32.79,0:35:34.16,Default,,0000,0000,0000,,then this suggests that, you know,
Dialogue: 0,0:35:34.16,0:35:39.63,Default,,0000,0000,0000,,getting more training data may allow you to help close that gap. Okay?
Dialogue: 0,0:35:39.63,0:35:41.42,Default,,0000,0000,0000,,So this is
Dialogue: 0,0:35:41.42,0:35:42.34,Default,,0000,0000,0000,,what the
Dialogue: 0,0:35:42.34,0:35:45.06,Default,,0000,0000,0000,,cartoon would look like when - in the
Dialogue: 0,0:35:45.06,0:35:49.20,Default,,0000,0000,0000,,case of high variance.
Dialogue: 0,0:35:49.20,0:35:53.10,Default,,0000,0000,0000,,This is what the cartoon looks like for high bias. Right? If you
Dialogue: 0,0:35:53.10,0:35:54.78,Default,,0000,0000,0000,,look at the learning curve, you
Dialogue: 0,0:35:54.78,0:35:57.50,Default,,0000,0000,0000,,see that the curve for test error
Dialogue: 0,0:35:57.50,0:36:01.42,Default,,0000,0000,0000,,has flattened out already. And so this is a sign that,
Dialogue: 0,0:36:01.42,0:36:05.18,Default,,0000,0000,0000,,you know, if you get more training examples, if you extrapolate this curve
Dialogue: 0,0:36:05.18,0:36:06.52,Default,,0000,0000,0000,,further to the right,
Dialogue: 0,0:36:06.52,0:36:09.67,Default,,0000,0000,0000,,it's maybe not likely to go down much further.
Dialogue: 0,0:36:09.67,0:36:12.47,Default,,0000,0000,0000,,And this is a property of high bias: that getting more training data won't
Dialogue: 0,0:36:12.47,0:36:15.62,Default,,0000,0000,0000,,necessarily help.
Dialogue: 0,0:36:15.62,0:36:18.100,Default,,0000,0000,0000,,But again, to me the more useful diagnostic is
Dialogue: 0,0:36:18.100,0:36:20.30,Default,,0000,0000,0000,,if you plot
Dialogue: 0,0:36:20.30,0:36:23.100,Default,,0000,0000,0000,,training errors well, if you look at your training error as well as your, you know,
Dialogue: 0,0:36:23.100,0:36:26.37,Default,,0000,0000,0000,,hold out test set error.
Dialogue: 0,0:36:26.37,0:36:29.41,Default,,0000,0000,0000,,If you find that even your training error
Dialogue: 0,0:36:29.41,0:36:31.53,Default,,0000,0000,0000,,is high,
Dialogue: 0,0:36:31.53,0:36:34.78,Default,,0000,0000,0000,,then that's a sign that getting more training data is not
Dialogue: 0,0:36:34.78,0:36:38.27,Default,,0000,0000,0000,,going to help. Right?
Dialogue: 0,0:36:38.27,0:36:42.20,Default,,0000,0000,0000,,In fact, you know, think about it,
Dialogue: 0,0:36:42.20,0:36:44.54,Default,,0000,0000,0000,,training error
Dialogue: 0,0:36:44.54,0:36:48.09,Default,,0000,0000,0000,,grows as a function of your training set size.
Dialogue: 0,0:36:48.09,0:36:50.45,Default,,0000,0000,0000,,And so if your
Dialogue: 0,0:36:50.45,0:36:55.57,Default,,0000,0000,0000,,training error is already above your level of desired performance,
Dialogue: 0,0:36:55.57,0:36:56.60,Default,,0000,0000,0000,,then
Dialogue: 0,0:36:56.60,0:37:00.79,Default,,0000,0000,0000,,getting even more training data is not going to reduce your training
Dialogue: 0,0:37:00.79,0:37:03.01,Default,,0000,0000,0000,,error down to the desired level of performance. Right?
Dialogue: 0,0:37:03.01,0:37:06.47,Default,,0000,0000,0000,,Because, you know, your training error sort of only gets worse as you get more and more training
Dialogue: 0,0:37:06.47,0:37:07.55,Default,,0000,0000,0000,,examples.
Dialogue: 0,0:37:07.55,0:37:10.80,Default,,0000,0000,0000,,So if you extrapolate further to the right, it's not like this blue line will come
Dialogue: 0,0:37:10.80,0:37:13.40,Default,,0000,0000,0000,,back down to the level of desired performance. Right? This will stay up
Dialogue: 0,0:37:13.40,0:37:17.48,Default,,0000,0000,0000,,there. Okay? So for
Dialogue: 0,0:37:17.48,0:37:21.34,Default,,0000,0000,0000,,me personally, I actually, when looking at a curve like the green
Dialogue: 0,0:37:21.34,0:37:25.38,Default,,0000,0000,0000,,curve on test error, I actually personally tend to find it very difficult to tell
Dialogue: 0,0:37:25.38,0:37:29.00,Default,,0000,0000,0000,,if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very
Dialogue: 0,0:37:29.00,0:37:31.01,Default,,0000,0000,0000,,often, it's somewhat
Dialogue: 0,0:37:31.01,0:37:32.90,Default,,0000,0000,0000,,ambiguous. So for me personally,
Dialogue: 0,0:37:32.90,0:37:37.13,Default,,0000,0000,0000,,the diagnostic I tend to use the most often to tell if I have a bias problem or a variance
Dialogue: 0,0:37:37.13,0:37:37.86,Default,,0000,0000,0000,,problem
Dialogue: 0,0:37:37.86,0:37:41.32,Default,,0000,0000,0000,,is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so,
Dialogue: 0,0:37:41.32,0:37:45.42,Default,,0000,0000,0000,,going
Dialogue: 0,0:37:45.42,0:37:47.13,Default,,0000,0000,0000,,back to
Dialogue: 0,0:37:47.13,0:37:52.40,Default,,0000,0000,0000,,the list of fixes, look
Dialogue: 0,0:37:52.40,0:37:54.11,Default,,0000,0000,0000,,at the first fix,
Dialogue: 0,0:37:54.11,0:37:56.34,Default,,0000,0000,0000,,getting more training examples
Dialogue: 0,0:37:56.34,0:37:58.65,Default,,0000,0000,0000,,is a way to fix high variance.
Dialogue: 0,0:37:58.65,0:38:02.75,Default,,0000,0000,0000,,Right? If you have a high variance problem, getting more training examples will help.
Dialogue: 0,0:38:02.75,0:38:05.53,Default,,0000,0000,0000,,Trying a smaller set of features:
Dialogue: 0,0:38:05.53,0:38:11.76,Default,,0000,0000,0000,,that also fixes high variance. All right?
Dialogue: 0,0:38:11.76,0:38:15.87,Default,,0000,0000,0000,,Trying a larger set of features or adding email features, these
Dialogue: 0,0:38:15.87,0:38:20.15,Default,,0000,0000,0000,,are solutions that fix high bias. Right?
Dialogue: 0,0:38:20.15,0:38:26.77,Default,,0000,0000,0000,,So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay?
Dialogue: 0,0:38:26.77,0:38:29.07,Default,,0000,0000,0000,,And so
Dialogue: 0,0:38:29.07,0:38:33.58,Default,,0000,0000,0000,,quite often you see people working on machine learning problems
Dialogue: 0,0:38:33.58,0:38:34.59,Default,,0000,0000,0000,,and
Dialogue: 0,0:38:34.59,0:38:37.57,Default,,0000,0000,0000,,they'll remember that getting more training examples helps. And
Dialogue: 0,0:38:37.57,0:38:41.12,Default,,0000,0000,0000,,so, they'll build a learning system, build an anti-spam system and it doesn't work.
Dialogue: 0,0:38:41.12,0:38:42.23,Default,,0000,0000,0000,,And then they
Dialogue: 0,0:38:42.23,0:38:45.100,Default,,0000,0000,0000,,go off and spend lots of time and money and effort collecting more training data
Dialogue: 0,0:38:45.100,0:38:50.51,Default,,0000,0000,0000,,because they'll say, "Oh well, getting more data's obviously got to help."
Dialogue: 0,0:38:50.51,0:38:53.32,Default,,0000,0000,0000,,But if they had a high bias problem in the first place, and not a high variance
Dialogue: 0,0:38:53.32,0:38:54.89,Default,,0000,0000,0000,,problem,
Dialogue: 0,0:38:54.89,0:38:56.77,Default,,0000,0000,0000,,it's entirely possible to spend
Dialogue: 0,0:38:56.77,0:39:00.15,Default,,0000,0000,0000,,three months or six months collecting more and more training data,
Dialogue: 0,0:39:00.15,0:39:04.100,Default,,0000,0000,0000,,not realizing that it couldn't possibly help. Right?
Dialogue: 0,0:39:04.100,0:39:07.62,Default,,0000,0000,0000,,And so, this actually happens a lot in, you
Dialogue: 0,0:39:07.62,0:39:12.41,Default,,0000,0000,0000,,know, in Silicon Valley and companies, this happens a lot. There will often
Dialogue: 0,0:39:12.41,0:39:15.33,Default,,0000,0000,0000,,people building various machine learning systems, and
Dialogue: 0,0:39:15.33,0:39:19.52,Default,,0000,0000,0000,,they'll often - you often see people spending six months working on fixing a
Dialogue: 0,0:39:19.52,0:39:20.100,Default,,0000,0000,0000,,learning algorithm
Dialogue: 0,0:39:20.100,0:39:23.94,Default,,0000,0000,0000,,and you could've told them six months ago that, you know,
Dialogue: 0,0:39:23.94,0:39:27.21,Default,,0000,0000,0000,,that couldn't possibly have helped. But because they didn't know what the
Dialogue: 0,0:39:27.21,0:39:28.71,Default,,0000,0000,0000,,problem was, and
Dialogue: 0,0:39:28.71,0:39:33.55,Default,,0000,0000,0000,,they'd easily spend six months trying to invent new features or something. And
Dialogue: 0,0:39:33.55,0:39:37.81,Default,,0000,0000,0000,,this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and
Dialogue: 0,0:39:37.81,0:39:42.29,Default,,0000,0000,0000,,told them, "I could've told you six months ago that this was not going to help." And
Dialogue: 0,0:39:42.29,0:39:46.15,Default,,0000,0000,0000,,the six months is not a joke, you actually see
Dialogue: 0,0:39:46.15,0:39:47.71,Default,,0000,0000,0000,,this.
Dialogue: 0,0:39:47.71,0:39:49.51,Default,,0000,0000,0000,,And in contrast, if you
Dialogue: 0,0:39:49.51,0:39:53.05,Default,,0000,0000,0000,,actually figure out the problem's one of high bias or high variance, then
Dialogue: 0,0:39:53.05,0:39:54.30,Default,,0000,0000,0000,,you can rule out
Dialogue: 0,0:39:54.30,0:39:55.80,Default,,0000,0000,0000,,two of these solutions and
Dialogue: 0,0:39:55.80,0:40:00.78,Default,,0000,0000,0000,,save yourself many months of fruitless effort. Okay? I actually
Dialogue: 0,0:40:00.78,0:40:03.71,Default,,0000,0000,0000,,want to talk about these four at the bottom as well. But before I move on, let me
Dialogue: 0,0:40:03.71,0:40:05.32,Default,,0000,0000,0000,,just check if there were questions about what I've talked
Dialogue: 0,0:40:05.32,0:40:12.32,Default,,0000,0000,0000,,about so far. No? Okay, great. So bias
Dialogue: 0,0:40:20.21,0:40:23.22,Default,,0000,0000,0000,,versus variance is one thing that comes up
Dialogue: 0,0:40:23.22,0:40:29.54,Default,,0000,0000,0000,,often. This bias versus variance is one common diagnostic. And so,
Dialogue: 0,0:40:29.54,0:40:33.18,Default,,0000,0000,0000,,for other machine learning problems, it's often up to your own ingenuity to figure out
Dialogue: 0,0:40:33.18,0:40:35.70,Default,,0000,0000,0000,,your own diagnostics to figure out what's wrong. All right?
Dialogue: 0,0:40:35.70,0:40:41.23,Default,,0000,0000,0000,,So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you
Dialogue: 0,0:40:41.23,0:40:44.30,Default,,0000,0000,0000,,know, to construct your own tests. Like do you look at the difference training and
Dialogue: 0,0:40:44.30,0:40:46.50,Default,,0000,0000,0000,,test errors or do you look at something else?
Dialogue: 0,0:40:46.50,0:40:49.93,Default,,0000,0000,0000,,It's often up to your own ingenuity to construct your own diagnostics to figure out what's
Dialogue: 0,0:40:49.93,0:40:52.59,Default,,0000,0000,0000,,going on.
Dialogue: 0,0:40:52.59,0:40:55.03,Default,,0000,0000,0000,,What I want to do is go through another example. All right?
Dialogue: 0,0:40:55.03,0:40:58.89,Default,,0000,0000,0000,,And this one is slightly more contrived but it'll illustrate another
Dialogue: 0,0:40:58.89,0:41:02.77,Default,,0000,0000,0000,,common question that comes up, another one of the most common
Dialogue: 0,0:41:02.77,0:41:04.75,Default,,0000,0000,0000,,issues that comes up in applying
Dialogue: 0,0:41:04.75,0:41:06.09,Default,,0000,0000,0000,,learning algorithms.
Dialogue: 0,0:41:06.09,0:41:08.32,Default,,0000,0000,0000,,So in this example, it's slightly more contrived,
Dialogue: 0,0:41:08.32,0:41:11.58,Default,,0000,0000,0000,,let's say you implement Bayesian logistic regression
Dialogue: 0,0:41:11.58,0:41:17.55,Default,,0000,0000,0000,,and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So
Dialogue: 0,0:41:17.55,0:41:19.15,Default,,0000,0000,0000,,it's rejecting, you know,
Dialogue: 0,0:41:19.15,0:41:21.45,Default,,0000,0000,0000,,2 percent of -
Dialogue: 0,0:41:21.45,0:41:25.18,Default,,0000,0000,0000,,it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all
Dialogue: 0,0:41:25.18,0:41:26.96,Default,,0000,0000,0000,,spam gets
Dialogue: 0,0:41:26.96,0:41:30.66,Default,,0000,0000,0000,,through which is fine, but is also rejecting 2 percent of your good email,
Dialogue: 0,0:41:30.66,0:41:35.49,Default,,0000,0000,0000,,2 percent of the email from your friends and that's unacceptably high, let's
Dialogue: 0,0:41:35.49,0:41:36.91,Default,,0000,0000,0000,,say.
Dialogue: 0,0:41:36.91,0:41:39.01,Default,,0000,0000,0000,,And let's say that
Dialogue: 0,0:41:39.01,0:41:41.90,Default,,0000,0000,0000,,a simple vector machine using a linear kernel
Dialogue: 0,0:41:41.90,0:41:44.83,Default,,0000,0000,0000,,gets 10 percent error on spam and
Dialogue: 0,0:41:44.83,0:41:49.07,Default,,0000,0000,0000,,0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this
Dialogue: 0,0:41:49.07,0:41:53.36,Default,,0000,0000,0000,,example, let's say you're trying to build an anti-spam system. Right?
Dialogue: 0,0:41:53.36,0:41:56.17,Default,,0000,0000,0000,,Let's say that you really want to deploy
Dialogue: 0,0:41:56.17,0:41:57.68,Default,,0000,0000,0000,,logistic regression
Dialogue: 0,0:41:57.68,0:42:01.21,Default,,0000,0000,0000,,to your customers because of computational efficiency or because you need
Dialogue: 0,0:42:01.21,0:42:03.39,Default,,0000,0000,0000,,retrain overnight every day,
Dialogue: 0,0:42:03.39,0:42:07.32,Default,,0000,0000,0000,,and because logistic regression just runs more easily and more quickly or something. Okay? So let's
Dialogue: 0,0:42:07.32,0:42:08.67,Default,,0000,0000,0000,,say you want to deploy logistic
Dialogue: 0,0:42:08.67,0:42:12.65,Default,,0000,0000,0000,,regression, but it's just not working out well. So
Dialogue: 0,0:42:12.65,0:42:17.61,Default,,0000,0000,0000,,question is: What do you do next? So it
Dialogue: 0,0:42:17.61,0:42:18.83,Default,,0000,0000,0000,,turns out that this -
Dialogue: 0,0:42:18.83,0:42:22.32,Default,,0000,0000,0000,,the issue that comes up here, the one other common question that
Dialogue: 0,0:42:22.32,0:42:24.89,Default,,0000,0000,0000,,comes up is
Dialogue: 0,0:42:24.89,0:42:30.19,Default,,0000,0000,0000,,a question of is the algorithm converging. So you might suspect that maybe
Dialogue: 0,0:42:30.19,0:42:33.30,Default,,0000,0000,0000,,the problem with logistic regression is that it's just not converging.
Dialogue: 0,0:42:33.30,0:42:36.31,Default,,0000,0000,0000,,Maybe you need to run iterations. And
Dialogue: 0,0:42:36.31,0:42:37.76,Default,,0000,0000,0000,,it
Dialogue: 0,0:42:37.76,0:42:40.36,Default,,0000,0000,0000,,turns out that, again if you look at the optimization objective, say,
Dialogue: 0,0:42:40.36,0:42:43.71,Default,,0000,0000,0000,,logistic regression is, let's say, optimizing J
Dialogue: 0,0:42:43.71,0:42:46.73,Default,,0000,0000,0000,,of theta, it actually turns out that if you look at optimizing your objective as a function of the number
Dialogue: 0,0:42:46.73,0:42:51.81,Default,,0000,0000,0000,,of iterations, when you look
Dialogue: 0,0:42:51.81,0:42:55.01,Default,,0000,0000,0000,,at this curve, you know, it sort of looks like it's going up but it sort of
Dialogue: 0,0:42:55.01,0:42:57.63,Default,,0000,0000,0000,,looks like there's absentiles. And
Dialogue: 0,0:42:57.63,0:43:00.95,Default,,0000,0000,0000,,when you look at these curves, it's often very hard to tell
Dialogue: 0,0:43:00.95,0:43:03.73,Default,,0000,0000,0000,,if the curve has already flattened out. All right? And you look at these
Dialogue: 0,0:43:03.73,0:43:05.98,Default,,0000,0000,0000,,curves a lot so you can ask:
Dialogue: 0,0:43:05.98,0:43:08.23,Default,,0000,0000,0000,,Well has the algorithm converged? When you look at the J of theta like this, it's
Dialogue: 0,0:43:08.23,0:43:10.33,Default,,0000,0000,0000,,often hard to tell.
Dialogue: 0,0:43:10.33,0:43:14.15,Default,,0000,0000,0000,,You can run this ten times as long and see if it's flattened out. And you can run this ten
Dialogue: 0,0:43:14.15,0:43:21.08,Default,,0000,0000,0000,,times as long and it'll often still look like maybe it's going up very slowly, or something. Right?
Dialogue: 0,0:43:21.08,0:43:24.92,Default,,0000,0000,0000,,So a better diagnostic for what logistic regression is converged than
Dialogue: 0,0:43:24.92,0:43:28.81,Default,,0000,0000,0000,,looking at this curve.
Dialogue: 0,0:43:28.81,0:43:32.09,Default,,0000,0000,0000,,The other question you might wonder - the other thing you might
Dialogue: 0,0:43:32.09,0:43:36.71,Default,,0000,0000,0000,,suspect is a problem is are you optimizing the right function.
Dialogue: 0,0:43:36.71,0:43:38.92,Default,,0000,0000,0000,,So
Dialogue: 0,0:43:38.92,0:43:40.60,Default,,0000,0000,0000,,what you care about,
Dialogue: 0,0:43:40.60,0:43:42.88,Default,,0000,0000,0000,,right, in spam, say,
Dialogue: 0,0:43:42.88,0:43:44.26,Default,,0000,0000,0000,,is a
Dialogue: 0,0:43:44.26,0:43:47.50,Default,,0000,0000,0000,,weighted accuracy function like that. So A of theta is,
Dialogue: 0,0:43:47.50,0:43:49.19,Default,,0000,0000,0000,,you know, sum over your
Dialogue: 0,0:43:49.19,0:43:52.25,Default,,0000,0000,0000,,examples of some weights times whether you got it right.
Dialogue: 0,0:43:52.25,0:43:56.81,Default,,0000,0000,0000,,And so the weight may be higher for non-spam than for spam mail because you care
Dialogue: 0,0:43:56.81,0:43:57.71,Default,,0000,0000,0000,,about getting
Dialogue: 0,0:43:57.71,0:44:01.47,Default,,0000,0000,0000,,your predictions correct for spam email much more than non-spam mail, say. So let's
Dialogue: 0,0:44:01.47,0:44:02.36,Default,,0000,0000,0000,,
Dialogue: 0,0:44:02.36,0:44:05.47,Default,,0000,0000,0000,,say A of theta
Dialogue: 0,0:44:05.47,0:44:10.82,Default,,0000,0000,0000,,is the optimization objective that you really care about, but Bayesian logistic regression is
Dialogue: 0,0:44:10.82,0:44:15.40,Default,,0000,0000,0000,,that it optimizes a quantity like that. Right? It's this
Dialogue: 0,0:44:15.40,0:44:17.69,Default,,0000,0000,0000,,sort of maximum likelihood thing
Dialogue: 0,0:44:17.69,0:44:19.38,Default,,0000,0000,0000,,and then with this
Dialogue: 0,0:44:19.38,0:44:20.85,Default,,0000,0000,0000,,two-nom, you know,
Dialogue: 0,0:44:20.85,0:44:22.78,Default,,0000,0000,0000,,penalty thing that we saw previously. And you
Dialogue: 0,0:44:22.78,0:44:26.50,Default,,0000,0000,0000,,might be wondering: Is this the right optimization function to be optimizing.
Dialogue: 0,0:44:26.50,0:44:30.95,Default,,0000,0000,0000,,Okay? And: Or do I maybe need to change the value for lambda
Dialogue: 0,0:44:30.95,0:44:33.90,Default,,0000,0000,0000,,to change this parameter? Or:
Dialogue: 0,0:44:33.90,0:44:39.82,Default,,0000,0000,0000,,Should I maybe really be switching to support vector machine optimization objective?
Dialogue: 0,0:44:39.82,0:44:42.13,Default,,0000,0000,0000,,Okay? Does that make sense? So
Dialogue: 0,0:44:42.13,0:44:44.49,Default,,0000,0000,0000,,the second diagnostic I'm gonna talk about
Dialogue: 0,0:44:44.49,0:44:46.99,Default,,0000,0000,0000,,is let's say you want to figure out
Dialogue: 0,0:44:46.99,0:44:50.61,Default,,0000,0000,0000,,is the algorithm converging, is the optimization algorithm converging, or
Dialogue: 0,0:44:50.61,0:44:51.90,Default,,0000,0000,0000,,is the problem with
Dialogue: 0,0:44:51.90,0:44:57.75,Default,,0000,0000,0000,,the optimization objective I chose in the first place? Okay?
Dialogue: 0,0:44:57.75,0:45:02.82,Default,,0000,0000,0000,,So here's
Dialogue: 0,0:45:02.82,0:45:07.33,Default,,0000,0000,0000,,the diagnostic you can use. Let me let - right. So to
Dialogue: 0,0:45:07.33,0:45:11.03,Default,,0000,0000,0000,,just reiterate the story, right, let's say an SVM outperforms Bayesian
Dialogue: 0,0:45:11.03,0:45:13.52,Default,,0000,0000,0000,,logistic regression but you really want to deploy
Dialogue: 0,0:45:13.52,0:45:16.76,Default,,0000,0000,0000,,Bayesian logistic regression to your problem. Let me
Dialogue: 0,0:45:16.76,0:45:19.05,Default,,0000,0000,0000,,let theta subscript SVM, be the
Dialogue: 0,0:45:19.05,0:45:21.67,Default,,0000,0000,0000,,parameters learned by an SVM,
Dialogue: 0,0:45:21.67,0:45:25.26,Default,,0000,0000,0000,,and I'll let theta subscript BLR be the parameters learned by Bayesian
Dialogue: 0,0:45:25.26,0:45:28.05,Default,,0000,0000,0000,,logistic regression.
Dialogue: 0,0:45:28.05,0:45:32.48,Default,,0000,0000,0000,,So the optimization objective you care about is this, you know, weighted accuracy
Dialogue: 0,0:45:32.48,0:45:35.08,Default,,0000,0000,0000,,criteria that I talked about just now.
Dialogue: 0,0:45:35.08,0:45:37.86,Default,,0000,0000,0000,,And
Dialogue: 0,0:45:37.86,0:45:41.74,Default,,0000,0000,0000,,the support vector machine outperforms Bayesian logistic regression. And so, you know,
Dialogue: 0,0:45:41.74,0:45:44.97,Default,,0000,0000,0000,,the weighted accuracy on the supportvector-machine parameters
Dialogue: 0,0:45:44.97,0:45:46.97,Default,,0000,0000,0000,,is better than the weighted accuracy
Dialogue: 0,0:45:46.97,0:45:50.18,Default,,0000,0000,0000,,for Bayesian logistic regression.
Dialogue: 0,0:45:50.18,0:45:53.93,Default,,0000,0000,0000,,So
Dialogue: 0,0:45:53.93,0:45:57.04,Default,,0000,0000,0000,,further, Bayesian logistic regression tries to optimize
Dialogue: 0,0:45:57.04,0:45:59.41,Default,,0000,0000,0000,,an optimization objective like that, which I
Dialogue: 0,0:45:59.41,0:46:02.27,Default,,0000,0000,0000,,denoted J theta.
Dialogue: 0,0:46:02.27,0:46:05.84,Default,,0000,0000,0000,,And so, the diagnostic I choose to use is
Dialogue: 0,0:46:05.84,0:46:08.43,Default,,0000,0000,0000,,to see if J of SVM
Dialogue: 0,0:46:08.43,0:46:12.27,Default,,0000,0000,0000,,is bigger-than or less-than J of BLR. Okay?
Dialogue: 0,0:46:12.27,0:46:14.61,Default,,0000,0000,0000,,So I explain this on the next slide.
Dialogue: 0,0:46:14.61,0:46:15.57,Default,,0000,0000,0000,,So
Dialogue: 0,0:46:15.57,0:46:19.53,Default,,0000,0000,0000,,we know two facts. We know that - well we know one fact. We know that a weighted
Dialogue: 0,0:46:19.53,0:46:20.52,Default,,0000,0000,0000,,accuracy
Dialogue: 0,0:46:20.52,0:46:23.16,Default,,0000,0000,0000,,of support vector machine, right,
Dialogue: 0,0:46:23.16,0:46:24.48,Default,,0000,0000,0000,,is bigger than
Dialogue: 0,0:46:24.48,0:46:28.86,Default,,0000,0000,0000,,this weighted accuracy of Bayesian logistic regression. So
Dialogue: 0,0:46:28.86,0:46:32.21,Default,,0000,0000,0000,,in order for me to figure out whether Bayesian logistic regression is converging,
Dialogue: 0,0:46:32.21,0:46:35.38,Default,,0000,0000,0000,,or whether I'm just optimizing the wrong objective function,
Dialogue: 0,0:46:35.38,0:46:41.06,Default,,0000,0000,0000,,the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay?
Dialogue: 0,0:46:41.06,0:46:43.55,Default,,0000,0000,0000,,So let me explain this,
Dialogue: 0,0:46:43.55,0:46:44.77,Default,,0000,0000,0000,,so in Case 1,
Dialogue: 0,0:46:44.77,0:46:46.03,Default,,0000,0000,0000,,right,
Dialogue: 0,0:46:46.03,0:46:48.32,Default,,0000,0000,0000,,it's just those two equations copied over.
Dialogue: 0,0:46:48.32,0:46:50.49,Default,,0000,0000,0000,,In Case 1, let's say that
Dialogue: 0,0:46:50.49,0:46:54.59,Default,,0000,0000,0000,,J of SVM is, indeed, is greater than J of BLR - or J of
Dialogue: 0,0:46:54.59,0:47:01.17,Default,,0000,0000,0000,,theta SVM is greater than J of theta BLR. But
Dialogue: 0,0:47:01.17,0:47:04.44,Default,,0000,0000,0000,,we know that Bayesian logistic regression
Dialogue: 0,0:47:04.44,0:47:07.52,Default,,0000,0000,0000,,was trying to maximize J of theta;
Dialogue: 0,0:47:07.52,0:47:08.87,Default,,0000,0000,0000,,that's the definition of
Dialogue: 0,0:47:08.87,0:47:12.36,Default,,0000,0000,0000,,Bayesian logistic regression.
Dialogue: 0,0:47:12.36,0:47:16.76,Default,,0000,0000,0000,,So this means that
Dialogue: 0,0:47:16.76,0:47:17.60,Default,,0000,0000,0000,,theta -
Dialogue: 0,0:47:17.60,0:47:22.03,Default,,0000,0000,0000,,the value of theta output that Bayesian logistic regression actually fails to
Dialogue: 0,0:47:22.03,0:47:24.21,Default,,0000,0000,0000,,maximize J
Dialogue: 0,0:47:24.21,0:47:27.31,Default,,0000,0000,0000,,because the support back to machine actually returned the value of theta that,
Dialogue: 0,0:47:27.31,0:47:28.72,Default,,0000,0000,0000,,you know does a
Dialogue: 0,0:47:28.72,0:47:31.35,Default,,0000,0000,0000,,better job out-maximizing J.
Dialogue: 0,0:47:31.35,0:47:36.51,Default,,0000,0000,0000,,And so, this tells me that Bayesian logistic regression didn't actually maximize J
Dialogue: 0,0:47:36.51,0:47:39.32,Default,,0000,0000,0000,,correctly, and so the problem is with
Dialogue: 0,0:47:39.32,0:47:41.10,Default,,0000,0000,0000,,the optimization algorithm. The
Dialogue: 0,0:47:41.10,0:47:45.27,Default,,0000,0000,0000,,optimization algorithm hasn't converged. The other
Dialogue: 0,0:47:45.27,0:47:46.10,Default,,0000,0000,0000,,case
Dialogue: 0,0:47:46.10,0:47:49.89,Default,,0000,0000,0000,,is as follows, where
Dialogue: 0,0:47:49.89,0:47:52.58,Default,,0000,0000,0000,,J of theta SVM is less-than/equal to J of theta
Dialogue: 0,0:47:52.58,0:47:55.72,Default,,0000,0000,0000,,BLR. Okay?
Dialogue: 0,0:47:55.72,0:47:58.39,Default,,0000,0000,0000,,In this case, what does
Dialogue: 0,0:47:58.39,0:47:59.14,Default,,0000,0000,0000,,that mean?
Dialogue: 0,0:47:59.14,0:48:01.85,Default,,0000,0000,0000,,This means that Bayesian logistic regression
Dialogue: 0,0:48:01.85,0:48:04.60,Default,,0000,0000,0000,,actually attains the higher value
Dialogue: 0,0:48:04.60,0:48:07.29,Default,,0000,0000,0000,,for the optimization objective J
Dialogue: 0,0:48:07.29,0:48:10.93,Default,,0000,0000,0000,,then doesn't support back to machine.
Dialogue: 0,0:48:10.93,0:48:13.16,Default,,0000,0000,0000,,The support back to machine,
Dialogue: 0,0:48:13.16,0:48:14.97,Default,,0000,0000,0000,,which does worse
Dialogue: 0,0:48:14.97,0:48:17.67,Default,,0000,0000,0000,,on your optimization problem,
Dialogue: 0,0:48:17.67,0:48:19.20,Default,,0000,0000,0000,,actually does better
Dialogue: 0,0:48:19.20,0:48:24.33,Default,,0000,0000,0000,,on the weighted accuracy measure.
Dialogue: 0,0:48:24.33,0:48:27.100,Default,,0000,0000,0000,,So what this means is that something that does worse on your optimization
Dialogue: 0,0:48:27.100,0:48:28.79,Default,,0000,0000,0000,,objective,
Dialogue: 0,0:48:28.79,0:48:29.79,Default,,0000,0000,0000,,on J,
Dialogue: 0,0:48:29.79,0:48:31.43,Default,,0000,0000,0000,,can actually do better
Dialogue: 0,0:48:31.43,0:48:34.04,Default,,0000,0000,0000,,on the weighted accuracy objective.
Dialogue: 0,0:48:34.04,0:48:37.11,Default,,0000,0000,0000,,And this really means that maximizing
Dialogue: 0,0:48:37.11,0:48:38.37,Default,,0000,0000,0000,,J of theta,
Dialogue: 0,0:48:38.37,0:48:42.06,Default,,0000,0000,0000,,you know, doesn't really correspond that well to maximizing your weighted accuracy criteria.
Dialogue: 0,0:48:42.06,0:48:43.43,Default,,0000,0000,0000,,
Dialogue: 0,0:48:43.43,0:48:47.36,Default,,0000,0000,0000,,And therefore, this tells you that J of theta is maybe the wrong optimization
Dialogue: 0,0:48:47.36,0:48:49.65,Default,,0000,0000,0000,,objective to be maximizing. Right?
Dialogue: 0,0:48:49.65,0:48:51.16,Default,,0000,0000,0000,,That just maximizing J of
Dialogue: 0,0:48:51.16,0:48:53.15,Default,,0000,0000,0000,,theta just wasn't a good objective
Dialogue: 0,0:48:53.15,0:49:00.15,Default,,0000,0000,0000,,to be choosing if you care about the weighted accuracy. Okay? Can you
Dialogue: 0,0:49:02.67,0:49:03.46,Default,,0000,0000,0000,,raise your hand
Dialogue: 0,0:49:03.46,0:49:09.99,Default,,0000,0000,0000,,if this made sense?
Dialogue: 0,0:49:09.99,0:49:11.52,Default,,0000,0000,0000,,Cool, good. So
Dialogue: 0,0:49:11.52,0:49:16.83,Default,,0000,0000,0000,,that tells us whether the problem is with the optimization objective
Dialogue: 0,0:49:16.83,0:49:19.38,Default,,0000,0000,0000,,or whether it's with the objective function.
Dialogue: 0,0:49:19.38,0:49:21.01,Default,,0000,0000,0000,,And so going back to this
Dialogue: 0,0:49:21.01,0:49:23.15,Default,,0000,0000,0000,,slide, the eight fixes we had,
Dialogue: 0,0:49:23.15,0:49:24.18,Default,,0000,0000,0000,,you notice that if you
Dialogue: 0,0:49:24.18,0:49:27.17,Default,,0000,0000,0000,,run gradient descent for more iterations
Dialogue: 0,0:49:27.17,0:49:31.02,Default,,0000,0000,0000,,that fixes the optimization algorithm. You try and use this method
Dialogue: 0,0:49:31.02,0:49:33.26,Default,,0000,0000,0000,,fixes the optimization algorithm,
Dialogue: 0,0:49:33.26,0:49:37.29,Default,,0000,0000,0000,,whereas using a different value for lambda, in that lambda times norm of data
Dialogue: 0,0:49:37.29,0:49:39.47,Default,,0000,0000,0000,,squared, you know, in your objective,
Dialogue: 0,0:49:39.47,0:49:42.36,Default,,0000,0000,0000,,fixes the optimization objective. And
Dialogue: 0,0:49:42.36,0:49:47.63,Default,,0000,0000,0000,,changing to an SVM is also another way of trying to fix the optimization objective. Okay?
Dialogue: 0,0:49:47.63,0:49:49.33,Default,,0000,0000,0000,,And so
Dialogue: 0,0:49:49.33,0:49:52.31,Default,,0000,0000,0000,,once again, you actually see this quite often that -
Dialogue: 0,0:49:52.31,0:49:55.08,Default,,0000,0000,0000,,actually, you see it very often, people will
Dialogue: 0,0:49:55.08,0:49:58.48,Default,,0000,0000,0000,,have a problem with the optimization objective
Dialogue: 0,0:49:58.48,0:50:00.99,Default,,0000,0000,0000,,and be working harder and harder
Dialogue: 0,0:50:00.99,0:50:03.18,Default,,0000,0000,0000,,to fix the optimization algorithm.
Dialogue: 0,0:50:03.18,0:50:06.08,Default,,0000,0000,0000,,That's another very common pattern that
Dialogue: 0,0:50:06.08,0:50:10.19,Default,,0000,0000,0000,,the problem is in the formula from your J of theta, that often you see people, you know,
Dialogue: 0,0:50:10.19,0:50:13.27,Default,,0000,0000,0000,,just running more and more iterations of gradient descent. Like trying Newton's
Dialogue: 0,0:50:13.27,0:50:16.01,Default,,0000,0000,0000,,method and trying conjugate and then trying
Dialogue: 0,0:50:16.01,0:50:18.59,Default,,0000,0000,0000,,more and more crazy optimization algorithms,
Dialogue: 0,0:50:18.59,0:50:20.89,Default,,0000,0000,0000,,whereas the problem was, you know,
Dialogue: 0,0:50:20.89,0:50:24.46,Default,,0000,0000,0000,,optimizing J of theta wasn't going to fix the problem at all. Okay?
Dialogue: 0,0:50:24.46,0:50:28.65,Default,,0000,0000,0000,,So there's another example of when these sorts of diagnostics will
Dialogue: 0,0:50:28.65,0:50:31.91,Default,,0000,0000,0000,,help you figure out whether you should be fixing your optimization algorithm
Dialogue: 0,0:50:31.91,0:50:33.26,Default,,0000,0000,0000,,or fixing the
Dialogue: 0,0:50:33.26,0:50:38.85,Default,,0000,0000,0000,,optimization
Dialogue: 0,0:50:38.85,0:50:45.34,Default,,0000,0000,0000,,objective. Okay? Let me think
Dialogue: 0,0:50:45.34,0:50:47.60,Default,,0000,0000,0000,,how much time I have.
Dialogue: 0,0:50:47.60,0:50:48.82,Default,,0000,0000,0000,,Hmm, let's
Dialogue: 0,0:50:48.82,0:50:49.62,Default,,0000,0000,0000,,see. Well okay, we have time. Let's do this.
Dialogue: 0,0:50:49.62,0:50:52.98,Default,,0000,0000,0000,,Show you one last example of a diagnostic. This is one that came up in,
Dialogue: 0,0:50:52.98,0:50:56.100,Default,,0000,0000,0000,,you know, my students' and my work on flying helicopters.
Dialogue: 0,0:50:56.100,0:50:57.84,Default,,0000,0000,0000,,
Dialogue: 0,0:50:57.84,0:51:00.19,Default,,0000,0000,0000,,This one actually,
Dialogue: 0,0:51:00.19,0:51:04.18,Default,,0000,0000,0000,,this example is the most complex of the three examples I'm gonna do
Dialogue: 0,0:51:04.18,0:51:05.61,Default,,0000,0000,0000,,today.
Dialogue: 0,0:51:05.61,0:51:08.56,Default,,0000,0000,0000,,I'm going to somewhat quickly, and
Dialogue: 0,0:51:08.56,0:51:11.26,Default,,0000,0000,0000,,this actually draws on reinforcement learning which is something that I'm not
Dialogue: 0,0:51:11.26,0:51:14.50,Default,,0000,0000,0000,,gonna talk about until towards - close to the end of the course here, but this just
Dialogue: 0,0:51:14.50,0:51:16.76,Default,,0000,0000,0000,,a more
Dialogue: 0,0:51:16.76,0:51:20.01,Default,,0000,0000,0000,,complicated example of a diagnostic we're gonna go over.
Dialogue: 0,0:51:20.01,0:51:23.76,Default,,0000,0000,0000,,What I'll do is probably go over this fairly quickly, and then after we've talked about
Dialogue: 0,0:51:23.76,0:51:26.84,Default,,0000,0000,0000,,reinforcement learning in the class, I'll probably actually come back and redo this exact
Dialogue: 0,0:51:26.84,0:51:32.92,Default,,0000,0000,0000,,same example because you'll understand it more deeply. Okay?
Dialogue: 0,0:51:32.92,0:51:37.10,Default,,0000,0000,0000,,So some of you know that my students and I fly autonomous helicopters, so how do you get a
Dialogue: 0,0:51:37.10,0:51:41.56,Default,,0000,0000,0000,,machine-learning algorithm to design the controller for
Dialogue: 0,0:51:41.56,0:51:44.20,Default,,0000,0000,0000,,helicopter? This is what we do. All right?
Dialogue: 0,0:51:44.20,0:51:48.52,Default,,0000,0000,0000,,This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our
Dialogue: 0,0:51:48.52,0:51:49.62,Default,,0000,0000,0000,,simulator.
Dialogue: 0,0:51:49.62,0:51:53.50,Default,,0000,0000,0000,,This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you
Dialogue: 0,0:51:53.50,0:51:55.68,Default,,0000,0000,0000,,
Dialogue: 0,0:51:55.68,0:51:57.19,Default,,0000,0000,0000,,choose a cost function, it's
Dialogue: 0,0:51:57.19,0:52:00.85,Default,,0000,0000,0000,,actually called a [inaudible] function, but for this actually I'll call it cost function.
Dialogue: 0,0:52:00.85,0:52:02.95,Default,,0000,0000,0000,,Say J of theta is, you know,
Dialogue: 0,0:52:02.95,0:52:06.59,Default,,0000,0000,0000,,the expected squared error in your helicopter's
Dialogue: 0,0:52:06.59,0:52:08.15,Default,,0000,0000,0000,,position. Okay? So this is J of theta is
Dialogue: 0,0:52:08.15,0:52:08.51,Default,,0000,0000,0000,,maybe
Dialogue: 0,0:52:08.51,0:52:12.36,Default,,0000,0000,0000,,it's expected square error or just the square error.
Dialogue: 0,0:52:12.36,0:52:16.91,Default,,0000,0000,0000,,And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms
Dialogue: 0,0:52:16.91,0:52:18.60,Default,,0000,0000,0000,,in a few weeks.
Dialogue: 0,0:52:18.60,0:52:22.50,Default,,0000,0000,0000,,You run reinforcement learning algorithm in your simulator
Dialogue: 0,0:52:22.50,0:52:26.64,Default,,0000,0000,0000,,to try to minimize this cost function; try to minimize the squared error of
Dialogue: 0,0:52:26.64,0:52:31.44,Default,,0000,0000,0000,,how well you're controlling your helicopter's position. Okay?
Dialogue: 0,0:52:31.44,0:52:35.28,Default,,0000,0000,0000,,The reinforcement learning algorithm will output some parameters, which I'm denoting theta
Dialogue: 0,0:52:35.28,0:52:37.21,Default,,0000,0000,0000,,subscript RL,
Dialogue: 0,0:52:37.21,0:52:41.71,Default,,0000,0000,0000,,and then you'll use that to fly your helicopter.
Dialogue: 0,0:52:41.71,0:52:44.96,Default,,0000,0000,0000,,So suppose you run this learning algorithm and
Dialogue: 0,0:52:44.96,0:52:48.59,Default,,0000,0000,0000,,you get out a set of controller parameters, theta subscript RL,
Dialogue: 0,0:52:48.59,0:52:52.30,Default,,0000,0000,0000,,that gives much worse performance than a human pilot. Then
Dialogue: 0,0:52:52.30,0:52:54.73,Default,,0000,0000,0000,,what do you do next? And in particular, you
Dialogue: 0,0:52:54.73,0:52:57.96,Default,,0000,0000,0000,,know, corresponding to the three steps above, there are three
Dialogue: 0,0:52:57.96,0:53:00.59,Default,,0000,0000,0000,,natural things you can try. Right? You can
Dialogue: 0,0:53:00.59,0:53:01.87,Default,,0000,0000,0000,,try to - oh, the bottom of
Dialogue: 0,0:53:01.87,0:53:03.92,Default,,0000,0000,0000,,the slide got chopped off.
Dialogue: 0,0:53:03.92,0:53:07.53,Default,,0000,0000,0000,,You can try to improve the simulator. And
Dialogue: 0,0:53:07.53,0:53:10.33,Default,,0000,0000,0000,,maybe you think your simulator's isn't that accurate, you need to capture
Dialogue: 0,0:53:10.33,0:53:12.34,Default,,0000,0000,0000,,the aerodynamic effects more
Dialogue: 0,0:53:12.34,0:53:15.43,Default,,0000,0000,0000,,accurately. You need to capture the airflow and the turbulence affects around the helicopter
Dialogue: 0,0:53:15.43,0:53:18.28,Default,,0000,0000,0000,,more accurately.
Dialogue: 0,0:53:18.28,0:53:21.44,Default,,0000,0000,0000,,Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe
Dialogue: 0,0:53:21.44,0:53:24.72,Default,,0000,0000,0000,,what a human pilot does isn't just optimizing square area but it's something more
Dialogue: 0,0:53:24.72,0:53:25.99,Default,,0000,0000,0000,,subtle.
Dialogue: 0,0:53:25.99,0:53:26.77,Default,,0000,0000,0000,,Or maybe
Dialogue: 0,0:53:26.77,0:53:32.99,Default,,0000,0000,0000,,the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So
Dialogue: 0,0:53:32.99,0:53:36.80,Default,,0000,0000,0000,,these are the diagnostics that I actually used, and my students and I actually use to figure out what's
Dialogue: 0,0:53:36.80,0:53:41.30,Default,,0000,0000,0000,,going on.
Dialogue: 0,0:53:41.30,0:53:44.51,Default,,0000,0000,0000,,Actually, why don't you just think about this for a second and think what you'd do, and then
Dialogue: 0,0:53:44.51,0:53:51.51,Default,,0000,0000,0000,,I'll go on and tell you what we do. All right,
Dialogue: 0,0:54:46.23,0:54:47.87,Default,,0000,0000,0000,,so let me tell you what -
Dialogue: 0,0:54:47.87,0:54:49.60,Default,,0000,0000,0000,,how we do this and see
Dialogue: 0,0:54:49.60,0:54:52.77,Default,,0000,0000,0000,,whether it's the same as yours or not. And if you have a better idea than I do, let me
Dialogue: 0,0:54:52.77,0:54:53.57,Default,,0000,0000,0000,,know and I'll let you try it
Dialogue: 0,0:54:53.57,0:54:55.92,Default,,0000,0000,0000,,on my helicopter.
Dialogue: 0,0:54:55.92,0:54:58.24,Default,,0000,0000,0000,,So
Dialogue: 0,0:54:58.24,0:55:01.45,Default,,0000,0000,0000,,here's a reasoning that I wanted to experiment, right. So,
Dialogue: 0,0:55:01.45,0:55:03.68,Default,,0000,0000,0000,,yeah, let's say the controller output
Dialogue: 0,0:55:03.68,0:55:10.37,Default,,0000,0000,0000,,by our reinforcement-learning algorithm does poorly. Well
Dialogue: 0,0:55:10.37,0:55:12.61,Default,,0000,0000,0000,,suppose the following three things hold true.
Dialogue: 0,0:55:12.61,0:55:15.15,Default,,0000,0000,0000,,Suppose the contrary, I guess. Suppose that
Dialogue: 0,0:55:15.15,0:55:19.65,Default,,0000,0000,0000,,the helicopter simulator is accurate, so let's assume we have an accurate model
Dialogue: 0,0:55:19.65,0:55:22.45,Default,,0000,0000,0000,,of our helicopter. And
Dialogue: 0,0:55:22.45,0:55:25.30,Default,,0000,0000,0000,,let's suppose that the reinforcement learning algorithm,
Dialogue: 0,0:55:25.30,0:55:28.89,Default,,0000,0000,0000,,you know, correctly controls the helicopter in simulation,
Dialogue: 0,0:55:28.89,0:55:31.82,Default,,0000,0000,0000,,so we tend to run a learning algorithm in simulation so that, you know, the
Dialogue: 0,0:55:31.82,0:55:35.15,Default,,0000,0000,0000,,learning algorithm can crash a helicopter and it's fine. Right?
Dialogue: 0,0:55:35.15,0:55:37.23,Default,,0000,0000,0000,,So let's assume our reinforcement-learning
Dialogue: 0,0:55:37.23,0:55:40.11,Default,,0000,0000,0000,,algorithm correctly controls the helicopter so as to minimize the cost
Dialogue: 0,0:55:40.11,0:55:42.10,Default,,0000,0000,0000,,function J of theta.
Dialogue: 0,0:55:42.10,0:55:43.74,Default,,0000,0000,0000,,And let's suppose that
Dialogue: 0,0:55:43.74,0:55:47.63,Default,,0000,0000,0000,,minimizing J of theta does indeed correspond to accurate or the correct autonomous
Dialogue: 0,0:55:47.63,0:55:49.34,Default,,0000,0000,0000,,flight.
Dialogue: 0,0:55:49.34,0:55:52.07,Default,,0000,0000,0000,,If all of these things held true,
Dialogue: 0,0:55:52.07,0:55:53.91,Default,,0000,0000,0000,,then that means that
Dialogue: 0,0:55:53.91,0:55:58.46,Default,,0000,0000,0000,,the parameters, theta RL, should actually fly well on my real
Dialogue: 0,0:55:58.46,0:56:01.04,Default,,0000,0000,0000,,helicopter. Right?
Dialogue: 0,0:56:01.04,0:56:03.28,Default,,0000,0000,0000,,And so the fact that the learning
Dialogue: 0,0:56:03.28,0:56:05.34,Default,,0000,0000,0000,,control parameters, theta RL,
Dialogue: 0,0:56:05.34,0:56:08.60,Default,,0000,0000,0000,,does not fly well on my helicopter, that sort
Dialogue: 0,0:56:08.60,0:56:11.25,Default,,0000,0000,0000,,of means that ones of these three assumptions must be wrong
Dialogue: 0,0:56:11.25,0:56:17.87,Default,,0000,0000,0000,,and I'd like to figure out which of these
Dialogue: 0,0:56:17.87,0:56:19.67,Default,,0000,0000,0000,,three assumptions
Dialogue: 0,0:56:19.67,0:56:22.09,Default,,0000,0000,0000,,is wrong. Okay? So these are the diagnostics we use.
Dialogue: 0,0:56:22.09,0:56:25.45,Default,,0000,0000,0000,,First one is
Dialogue: 0,0:56:25.45,0:56:31.72,Default,,0000,0000,0000,,we look at the controller and see if it even flies well in
Dialogue: 0,0:56:31.72,0:56:35.09,Default,,0000,0000,0000,,simulation. Right? So the simulator of the helicopter that we did the learning on,
Dialogue: 0,0:56:35.09,0:56:38.70,Default,,0000,0000,0000,,and so if the learning algorithm flies well in the simulator but
Dialogue: 0,0:56:38.70,0:56:42.03,Default,,0000,0000,0000,,it doesn't fly well on my real helicopter,
Dialogue: 0,0:56:42.03,0:56:46.11,Default,,0000,0000,0000,,then that tells me the problem is probably in the simulator. Right?
Dialogue: 0,0:56:46.11,0:56:48.05,Default,,0000,0000,0000,,My simulator predicts
Dialogue: 0,0:56:48.05,0:56:51.91,Default,,0000,0000,0000,,the helicopter's controller will fly well but it doesn't actually fly well in real life, so
Dialogue: 0,0:56:51.91,0:56:53.58,Default,,0000,0000,0000,,could be the problem's in the simulator
Dialogue: 0,0:56:53.58,0:56:59.24,Default,,0000,0000,0000,,and we should spend out efforts improving the accuracy of our simulator.
Dialogue: 0,0:56:59.24,0:57:03.17,Default,,0000,0000,0000,,Otherwise, let me write theta subscript human, be the human
Dialogue: 0,0:57:03.17,0:57:07.05,Default,,0000,0000,0000,,control policy. All right? So
Dialogue: 0,0:57:07.05,0:57:11.64,Default,,0000,0000,0000,,let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it
Dialogue: 0,0:57:11.64,0:57:13.48,Default,,0000,0000,0000,,could be in real life,
Dialogue: 0,0:57:13.48,0:57:16.77,Default,,0000,0000,0000,,and let's measure, you know, the means squared error
Dialogue: 0,0:57:16.77,0:57:20.21,Default,,0000,0000,0000,,of the human pilot's flight. And
Dialogue: 0,0:57:20.21,0:57:24.24,Default,,0000,0000,0000,,let's see if the human pilot does better or worse
Dialogue: 0,0:57:24.24,0:57:26.09,Default,,0000,0000,0000,,than the learned controller,
Dialogue: 0,0:57:26.09,0:57:28.25,Default,,0000,0000,0000,,in terms of optimizing this
Dialogue: 0,0:57:28.25,0:57:31.97,Default,,0000,0000,0000,,objective function J of theta. Okay?
Dialogue: 0,0:57:31.97,0:57:33.93,Default,,0000,0000,0000,,So if the human does
Dialogue: 0,0:57:33.93,0:57:36.89,Default,,0000,0000,0000,,worse, if even a very good human pilot
Dialogue: 0,0:57:36.89,0:57:41.44,Default,,0000,0000,0000,,attains a worse value on my optimization objective, on my cost
Dialogue: 0,0:57:41.44,0:57:42.41,Default,,0000,0000,0000,,function,
Dialogue: 0,0:57:42.41,0:57:48.62,Default,,0000,0000,0000,,than my learning algorithm,
Dialogue: 0,0:57:48.62,0:57:51.80,Default,,0000,0000,0000,,then the problem is in the reinforcement-learning algorithm.
Dialogue: 0,0:57:51.80,0:57:56.09,Default,,0000,0000,0000,,Because my reinforcement-learning algorithm was trying to minimize J of
Dialogue: 0,0:57:56.09,0:58:00.14,Default,,0000,0000,0000,,theta, but a human actually attains a lower value for J of theta than does my
Dialogue: 0,0:58:00.14,0:58:01.78,Default,,0000,0000,0000,,algorithm.
Dialogue: 0,0:58:01.78,0:58:05.49,Default,,0000,0000,0000,,And so that tells me that clearly my algorithm's not
Dialogue: 0,0:58:05.49,0:58:07.82,Default,,0000,0000,0000,,managing to minimize J of theta
Dialogue: 0,0:58:07.82,0:58:12.88,Default,,0000,0000,0000,,and that tells me the problem's in the reinforcement learning algorithm.
Dialogue: 0,0:58:12.88,0:58:17.65,Default,,0000,0000,0000,,And finally, if J of theta - if the human actually attains a larger value
Dialogue: 0,0:58:17.65,0:58:19.55,Default,,0000,0000,0000,,for theta - excuse me,
Dialogue: 0,0:58:19.55,0:58:24.40,Default,,0000,0000,0000,,if the human actually attains a larger value for J of theta, the human actually
Dialogue: 0,0:58:24.40,0:58:27.86,Default,,0000,0000,0000,,has, you know, larger mean squared error for the helicopter position than
Dialogue: 0,0:58:27.86,0:58:30.60,Default,,0000,0000,0000,,does my reinforcement learning algorithms, that's
Dialogue: 0,0:58:30.60,0:58:34.00,Default,,0000,0000,0000,,I like - but I like the way the human flies much better than my reinforcement learning
Dialogue: 0,0:58:34.00,0:58:35.32,Default,,0000,0000,0000,,algorithm. So
Dialogue: 0,0:58:35.32,0:58:37.23,Default,,0000,0000,0000,,if that holds true,
Dialogue: 0,0:58:37.23,0:58:39.78,Default,,0000,0000,0000,,then clearly the problem's in the cost function, right,
Dialogue: 0,0:58:39.78,0:58:42.88,Default,,0000,0000,0000,,because the human does worse on my cost function
Dialogue: 0,0:58:42.88,0:58:46.07,Default,,0000,0000,0000,,but flies much better than my learning algorithm.
Dialogue: 0,0:58:46.07,0:58:48.36,Default,,0000,0000,0000,,And so that means the problem's in the cost function. It
Dialogue: 0,0:58:48.36,0:58:50.09,Default,,0000,0000,0000,,means - oh
Dialogue: 0,0:58:50.09,0:58:50.54,Default,,0000,0000,0000,,excuse me, I
Dialogue: 0,0:58:50.54,0:58:53.68,Default,,0000,0000,0000,,meant minimizing it, not maximizing it, there's a typo on the slide,
Dialogue: 0,0:58:53.68,0:58:55.38,Default,,0000,0000,0000,,because that means that minimizing
Dialogue: 0,0:58:55.38,0:58:57.09,Default,,0000,0000,0000,,the cost function
Dialogue: 0,0:58:57.09,0:59:00.22,Default,,0000,0000,0000,,- my learning algorithm does a better job minimizing the cost function but doesn't
Dialogue: 0,0:59:00.22,0:59:03.44,Default,,0000,0000,0000,,fly as well as a human pilot. So that tells you that
Dialogue: 0,0:59:03.44,0:59:04.72,Default,,0000,0000,0000,,minimizing the cost function
Dialogue: 0,0:59:04.72,0:59:06.88,Default,,0000,0000,0000,,doesn't correspond to good autonomous flight. And what
Dialogue: 0,0:59:06.88,0:59:11.86,Default,,0000,0000,0000,,you should do it go back and see if you can change J of
Dialogue: 0,0:59:11.86,0:59:13.10,Default,,0000,0000,0000,,theta. Okay?
Dialogue: 0,0:59:13.10,0:59:18.38,Default,,0000,0000,0000,,And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement
Dialogue: 0,0:59:18.38,0:59:21.73,Default,,0000,0000,0000,,learning algorithms just work but when they don't work,
Dialogue: 0,0:59:21.73,0:59:26.20,Default,,0000,0000,0000,,these are the sorts of diagnostics you use to figure out should we be focusing on the simulator,
Dialogue: 0,0:59:26.20,0:59:30.33,Default,,0000,0000,0000,,on changing the cost function, or on changing the reinforcement learning
Dialogue: 0,0:59:30.33,0:59:32.09,Default,,0000,0000,0000,,algorithm. And
Dialogue: 0,0:59:32.09,0:59:37.04,Default,,0000,0000,0000,,again, if you don't know which of your three problems it is, it's entirely possible,
Dialogue: 0,0:59:37.04,0:59:40.28,Default,,0000,0000,0000,,you know, to spend two years, whatever, changing, building a better simulator
Dialogue: 0,0:59:40.28,0:59:42.60,Default,,0000,0000,0000,,for your helicopter.
Dialogue: 0,0:59:42.60,0:59:43.95,Default,,0000,0000,0000,,But it turns out that
Dialogue: 0,0:59:43.95,0:59:47.69,Default,,0000,0000,0000,,modeling helicopter aerodynamics is an active area of research. There are people, you know, writing
Dialogue: 0,0:59:47.69,0:59:49.79,Default,,0000,0000,0000,,entire PhD theses on this still.
Dialogue: 0,0:59:49.79,0:59:53.56,Default,,0000,0000,0000,,So it's entirely possible to go out and spend six years and write a PhD thesis and build
Dialogue: 0,0:59:53.56,0:59:55.50,Default,,0000,0000,0000,,a much better helicopter simulator, but if you're fixing
Dialogue: 0,0:59:55.50,1:00:02.50,Default,,0000,0000,0000,,the wrong problem it's not gonna help.
Dialogue: 0,1:00:03.21,1:00:05.53,Default,,0000,0000,0000,,So
Dialogue: 0,1:00:05.53,1:00:08.92,Default,,0000,0000,0000,,quite often, you need to come up with your own diagnostics to figure out what's happening in an
Dialogue: 0,1:00:08.92,1:00:11.64,Default,,0000,0000,0000,,algorithm when something is going wrong.
Dialogue: 0,1:00:11.64,1:00:15.68,Default,,0000,0000,0000,,And unfortunately I don't know of - what I've described
Dialogue: 0,1:00:15.68,1:00:17.15,Default,,0000,0000,0000,,are sort of maybe
Dialogue: 0,1:00:17.15,1:00:20.51,Default,,0000,0000,0000,,some of the most common diagnostics that I've used, that I've seen,
Dialogue: 0,1:00:20.51,1:00:23.71,Default,,0000,0000,0000,,you know, to be useful for many problems. But very often, you need to come up
Dialogue: 0,1:00:23.71,1:00:28.19,Default,,0000,0000,0000,,with your own for your own specific learning problem.
Dialogue: 0,1:00:28.19,1:00:31.73,Default,,0000,0000,0000,,And I just want to point out that even when the learning algorithm is working well, it's
Dialogue: 0,1:00:31.73,1:00:35.16,Default,,0000,0000,0000,,often a good idea to run diagnostics, like the ones I talked
Dialogue: 0,1:00:35.16,1:00:36.07,Default,,0000,0000,0000,,about,
Dialogue: 0,1:00:36.07,1:00:38.31,Default,,0000,0000,0000,,to make sure you really understand what's going on.
Dialogue: 0,1:00:38.31,1:00:41.60,Default,,0000,0000,0000,,All right? And this is useful for a couple of reasons. One is that
Dialogue: 0,1:00:41.60,1:00:45.61,Default,,0000,0000,0000,,diagnostics like these will often help you to understand your application
Dialogue: 0,1:00:45.61,1:00:47.90,Default,,0000,0000,0000,,problem better.
Dialogue: 0,1:00:47.90,1:00:52.16,Default,,0000,0000,0000,,So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying
Dialogue: 0,1:00:52.16,1:00:56.35,Default,,0000,0000,0000,,job to apply machine-learning algorithms to some application problem of, you
Dialogue: 0,1:00:56.35,1:00:59.30,Default,,0000,0000,0000,,know, significant economic interest. Right?
Dialogue: 0,1:00:59.30,1:01:02.93,Default,,0000,0000,0000,,And you're gonna be working on one specific
Dialogue: 0,1:01:02.93,1:01:08.06,Default,,0000,0000,0000,,important machine learning application for many months, or even for years.
Dialogue: 0,1:01:08.06,1:01:10.99,Default,,0000,0000,0000,,One of the most valuable things for you personally will be for you to
Dialogue: 0,1:01:10.99,1:01:13.25,Default,,0000,0000,0000,,get in - for you personally
Dialogue: 0,1:01:13.25,1:01:16.91,Default,,0000,0000,0000,,to get in an intuitive understanding of what works and what doesn't work your
Dialogue: 0,1:01:16.91,1:01:17.37,Default,,0000,0000,0000,,problem.
Dialogue: 0,1:01:17.37,1:01:21.24,Default,,0000,0000,0000,,Sort of right now in the industry, in Silicon Valley or around the world,
Dialogue: 0,1:01:21.24,1:01:24.83,Default,,0000,0000,0000,,there are many companies with important machine learning problems and there are often people
Dialogue: 0,1:01:24.83,1:01:26.95,Default,,0000,0000,0000,,working on the same machine learning problem, you
Dialogue: 0,1:01:26.95,1:01:31.21,Default,,0000,0000,0000,,know, for many months or for years on end. And
Dialogue: 0,1:01:31.21,1:01:34.66,Default,,0000,0000,0000,,when you're doing that, I mean solving a really important problem using learning algorithms, one of
Dialogue: 0,1:01:34.66,1:01:38.72,Default,,0000,0000,0000,,the most valuable things is just your own personal intuitive understanding of the
Dialogue: 0,1:01:38.72,1:01:40.100,Default,,0000,0000,0000,,problem.
Dialogue: 0,1:01:40.100,1:01:42.17,Default,,0000,0000,0000,,Okay?
Dialogue: 0,1:01:42.17,1:01:43.41,Default,,0000,0000,0000,,And diagnostics, like
Dialogue: 0,1:01:43.41,1:01:48.15,Default,,0000,0000,0000,,the sort I talked about, will be one way for you to get a better and better understanding of
Dialogue: 0,1:01:48.15,1:01:50.28,Default,,0000,0000,0000,,these problems. It
Dialogue: 0,1:01:50.28,1:01:54.09,Default,,0000,0000,0000,,turns out, by the way, there are some of Silicon Valley companies that outsource their
Dialogue: 0,1:01:54.09,1:01:56.68,Default,,0000,0000,0000,,machine learning. So there's sometimes, you know, whatever.
Dialogue: 0,1:01:56.68,1:01:59.53,Default,,0000,0000,0000,,They're a company in Silicon Valley and they'll, you know,
Dialogue: 0,1:01:59.53,1:02:03.23,Default,,0000,0000,0000,,hire a firm in New York to run all their learning algorithms for them.
Dialogue: 0,1:02:03.23,1:02:06.89,Default,,0000,0000,0000,,And I'm not a businessman, but I personally think that's
Dialogue: 0,1:02:06.89,1:02:09.31,Default,,0000,0000,0000,,often a terrible idea because
Dialogue: 0,1:02:09.31,1:02:13.64,Default,,0000,0000,0000,,if your expertise, if your understanding of your data is given,
Dialogue: 0,1:02:13.64,1:02:15.71,Default,,0000,0000,0000,,you know, to an outsource agency,
Dialogue: 0,1:02:15.71,1:02:19.59,Default,,0000,0000,0000,,then if you don't maintain that expertise, if there's a problem you really care about
Dialogue: 0,1:02:19.59,1:02:22.30,Default,,0000,0000,0000,,then it'll be your own, you know,
Dialogue: 0,1:02:22.30,1:02:26.01,Default,,0000,0000,0000,,understanding of the problem that you build up over months that'll be really valuable.
Dialogue: 0,1:02:26.01,1:02:28.65,Default,,0000,0000,0000,,And if that knowledge is outsourced, you don't get to keep that knowledge
Dialogue: 0,1:02:28.65,1:02:29.49,Default,,0000,0000,0000,,yourself.
Dialogue: 0,1:02:29.49,1:02:31.70,Default,,0000,0000,0000,,I personally think that's a terrible idea.
Dialogue: 0,1:02:31.70,1:02:35.81,Default,,0000,0000,0000,,But I'm not a businessman, but I just see people do that a lot,
Dialogue: 0,1:02:35.81,1:02:39.11,Default,,0000,0000,0000,,and just. Let's see.
Dialogue: 0,1:02:39.11,1:02:42.95,Default,,0000,0000,0000,,Another reason for running diagnostics like these is actually in writing research
Dialogue: 0,1:02:42.95,1:02:43.61,Default,,0000,0000,0000,,papers,
Dialogue: 0,1:02:43.61,1:02:46.15,Default,,0000,0000,0000,,right? So
Dialogue: 0,1:02:46.15,1:02:49.33,Default,,0000,0000,0000,,diagnostics and error analyses, which I'll talk about in a minute,
Dialogue: 0,1:02:49.33,1:02:53.02,Default,,0000,0000,0000,,often help to convey insight about the problem and help justify your research
Dialogue: 0,1:02:53.02,1:02:54.11,Default,,0000,0000,0000,,claims.
Dialogue: 0,1:02:54.11,1:02:56.56,Default,,0000,0000,0000,,
Dialogue: 0,1:02:56.56,1:02:57.78,Default,,0000,0000,0000,,So for example,
Dialogue: 0,1:02:57.78,1:03:00.79,Default,,0000,0000,0000,,rather than writing a research paper, say, that's says, you know, "Oh well here's
Dialogue: 0,1:03:00.79,1:03:04.04,Default,,0000,0000,0000,,an algorithm that works. I built this helicopter and it flies," or whatever,
Dialogue: 0,1:03:04.04,1:03:05.65,Default,,0000,0000,0000,,it's often much more interesting to say,
Dialogue: 0,1:03:05.65,1:03:09.61,Default,,0000,0000,0000,,"Here's an algorithm that works, and it works because of a specific
Dialogue: 0,1:03:09.61,1:03:13.92,Default,,0000,0000,0000,,component X. And moreover, here's the diagnostic that gives you justification that shows X was
Dialogue: 0,1:03:13.92,1:03:19.16,Default,,0000,0000,0000,,the thing that fixed this problem," and that's where you made it work. Okay? So
Dialogue: 0,1:03:19.16,1:03:21.39,Default,,0000,0000,0000,,that leads me
Dialogue: 0,1:03:21.39,1:03:25.93,Default,,0000,0000,0000,,into a discussion on error analysis, which is often good machine learning practice,
Dialogue: 0,1:03:25.93,1:03:26.44,Default,,0000,0000,0000,,
Dialogue: 0,1:03:26.44,1:03:32.10,Default,,0000,0000,0000,,is a way for understanding what your sources of errors are. So what I
Dialogue: 0,1:03:32.10,1:03:34.69,Default,,0000,0000,0000,,call error analyses - and let's check
Dialogue: 0,1:03:34.69,1:03:41.69,Default,,0000,0000,0000,,questions about this.
Dialogue: 0,1:03:41.77,1:03:45.79,Default,,0000,0000,0000,,Yeah?
Dialogue: 0,1:03:45.79,1:03:49.81,Default,,0000,0000,0000,,Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times.
Dialogue: 0,1:03:49.81,1:03:53.50,Default,,0000,0000,0000,,The thing that is most difficult a helicopter is actually building a
Dialogue: 0,1:03:53.50,1:03:55.11,Default,,0000,0000,0000,,very - I don't know. It
Dialogue: 0,1:03:55.11,1:03:58.49,Default,,0000,0000,0000,,changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter
Dialogue: 0,1:03:58.49,1:04:02.86,Default,,0000,0000,0000,,is very hard. Yeah. Okay. So
Dialogue: 0,1:04:02.86,1:04:03.93,Default,,0000,0000,0000,,for error
Dialogue: 0,1:04:03.93,1:04:06.27,Default,,0000,0000,0000,,analyses,
Dialogue: 0,1:04:06.27,1:04:10.81,Default,,0000,0000,0000,,this is a way for figuring out what is working in your algorithm and what isn't working.
Dialogue: 0,1:04:10.81,1:04:17.71,Default,,0000,0000,0000,,And we're gonna talk about two specific examples. So there are
Dialogue: 0,1:04:17.71,1:04:21.53,Default,,0000,0000,0000,,many learning - there are many sort of IA systems, many machine learning systems, that
Dialogue: 0,1:04:21.53,1:04:22.47,Default,,0000,0000,0000,,combine
Dialogue: 0,1:04:22.47,1:04:24.89,Default,,0000,0000,0000,,many different components into a pipeline. So
Dialogue: 0,1:04:24.89,1:04:27.47,Default,,0000,0000,0000,,here's sort of a contrived example for this,
Dialogue: 0,1:04:27.47,1:04:31.02,Default,,0000,0000,0000,,not dissimilar in many ways from the actual machine learning systems you see.
Dialogue: 0,1:04:31.02,1:04:32.39,Default,,0000,0000,0000,,So let's say you want to
Dialogue: 0,1:04:32.39,1:04:37.75,Default,,0000,0000,0000,,recognize people from images. This is a picture of one of my friends.
Dialogue: 0,1:04:37.75,1:04:41.90,Default,,0000,0000,0000,,So you take this input in camera image, say, and you often run it through a long pipeline. So
Dialogue: 0,1:04:41.90,1:04:43.07,Default,,0000,0000,0000,,for example,
Dialogue: 0,1:04:43.07,1:04:47.86,Default,,0000,0000,0000,,the first thing you may do may be preprocess the image and remove the background, so you remove the
Dialogue: 0,1:04:47.86,1:04:49.19,Default,,0000,0000,0000,,background.
Dialogue: 0,1:04:49.19,1:04:51.91,Default,,0000,0000,0000,,And then you run a
Dialogue: 0,1:04:51.91,1:04:55.21,Default,,0000,0000,0000,,face detection algorithm, so a machine learning algorithm to detect people's faces.
Dialogue: 0,1:04:55.21,1:04:56.11,Default,,0000,0000,0000,,Right?
Dialogue: 0,1:04:56.11,1:04:59.76,Default,,0000,0000,0000,,And then, you know, let's say you want to recognize the identity of the person, right, this is your
Dialogue: 0,1:04:59.76,1:05:01.72,Default,,0000,0000,0000,,application.
Dialogue: 0,1:05:01.72,1:05:04.44,Default,,0000,0000,0000,,You then segment of the eyes, segment of the nose,
Dialogue: 0,1:05:04.44,1:05:08.33,Default,,0000,0000,0000,,and have different learning algorithms to detect the mouth and so on.
Dialogue: 0,1:05:08.33,1:05:10.03,Default,,0000,0000,0000,,I know; she might not want to be friend
Dialogue: 0,1:05:10.03,1:05:13.25,Default,,0000,0000,0000,,after she sees this.
Dialogue: 0,1:05:13.25,1:05:16.77,Default,,0000,0000,0000,,And then having found all these features, based on, you know, what the nose looks like, what the eyes
Dialogue: 0,1:05:16.77,1:05:18.61,Default,,0000,0000,0000,,looks like, whatever, then you
Dialogue: 0,1:05:18.61,1:05:22.80,Default,,0000,0000,0000,,feed all the features into a logistic regression algorithm. And your logistic
Dialogue: 0,1:05:22.80,1:05:24.77,Default,,0000,0000,0000,,regression or soft match regression, or whatever,
Dialogue: 0,1:05:24.77,1:05:30.38,Default,,0000,0000,0000,,will tell you the identity of this person. Okay?
Dialogue: 0,1:05:30.38,1:05:32.46,Default,,0000,0000,0000,,So
Dialogue: 0,1:05:32.46,1:05:35.06,Default,,0000,0000,0000,,this is what error analysis is.
Dialogue: 0,1:05:35.06,1:05:40.33,Default,,0000,0000,0000,,You have a long complicated pipeline combining many machine learning
Dialogue: 0,1:05:40.33,1:05:43.92,Default,,0000,0000,0000,,components. Many of these would be used in learning algorithms.
Dialogue: 0,1:05:43.92,1:05:45.69,Default,,0000,0000,0000,,And so,
Dialogue: 0,1:05:45.69,1:05:50.42,Default,,0000,0000,0000,,it's often very useful to figure out how much of your error can be attributed to each of
Dialogue: 0,1:05:50.42,1:05:55.18,Default,,0000,0000,0000,,these components.
Dialogue: 0,1:05:55.18,1:05:56.18,Default,,0000,0000,0000,,So
Dialogue: 0,1:05:56.18,1:05:59.59,Default,,0000,0000,0000,,what we'll do in a typical error analysis procedure
Dialogue: 0,1:05:59.59,1:06:03.71,Default,,0000,0000,0000,,is we'll repeatedly plug in the ground-truth for each component and see how the
Dialogue: 0,1:06:03.71,1:06:05.13,Default,,0000,0000,0000,,accuracy changes.
Dialogue: 0,1:06:05.13,1:06:07.59,Default,,0000,0000,0000,,So what I mean by that is the
Dialogue: 0,1:06:07.59,1:06:11.39,Default,,0000,0000,0000,,figure on the bottom left - bottom right, let's say the overall accuracy of the system is
Dialogue: 0,1:06:11.39,1:06:12.74,Default,,0000,0000,0000,,85 percent. Right?
Dialogue: 0,1:06:12.74,1:06:14.69,Default,,0000,0000,0000,,Then I want to know
Dialogue: 0,1:06:14.69,1:06:17.40,Default,,0000,0000,0000,,where my 15 percent of error comes from.
Dialogue: 0,1:06:17.40,1:06:19.16,Default,,0000,0000,0000,,And so what I'll do is I'll go
Dialogue: 0,1:06:19.16,1:06:21.33,Default,,0000,0000,0000,,to my test set
Dialogue: 0,1:06:21.33,1:06:26.63,Default,,0000,0000,0000,,and I'll actually code it and - oh, instead of - actually implement my correct
Dialogue: 0,1:06:26.63,1:06:29.75,Default,,0000,0000,0000,,background removal. So actually, go in and give it,
Dialogue: 0,1:06:29.75,1:06:33.44,Default,,0000,0000,0000,,give my algorithm what is the correct background versus foreground.
Dialogue: 0,1:06:33.44,1:06:36.84,Default,,0000,0000,0000,,And if I do that, let's color that blue to denote that I'm
Dialogue: 0,1:06:36.84,1:06:39.53,Default,,0000,0000,0000,,giving that ground-truth data in the test set,
Dialogue: 0,1:06:39.53,1:06:43.84,Default,,0000,0000,0000,,let's assume our accuracy increases to 85.1 percent. Okay?
Dialogue: 0,1:06:43.84,1:06:47.76,Default,,0000,0000,0000,,And now I'll go in and, you know, give my algorithm the ground-truth,
Dialogue: 0,1:06:47.76,1:06:48.93,Default,,0000,0000,0000,,face detection
Dialogue: 0,1:06:48.93,1:06:53.02,Default,,0000,0000,0000,,output. So I'll go in and actually on my test set I'll just tell the algorithm where the
Dialogue: 0,1:06:53.02,1:06:55.13,Default,,0000,0000,0000,,face is. And if I do that,
Dialogue: 0,1:06:55.13,1:06:59.05,Default,,0000,0000,0000,,let's say my algorithm's accuracy increases to 91 percent,
Dialogue: 0,1:06:59.05,1:07:02.52,Default,,0000,0000,0000,,and so on. And then I'll go for each of these components
Dialogue: 0,1:07:02.52,1:07:05.02,Default,,0000,0000,0000,,and just give it
Dialogue: 0,1:07:05.02,1:07:08.66,Default,,0000,0000,0000,,the ground-truth label for each of the components,
Dialogue: 0,1:07:08.66,1:07:11.64,Default,,0000,0000,0000,,because say, like, the nose segmentation algorithm's trying to figure out
Dialogue: 0,1:07:11.64,1:07:13.22,Default,,0000,0000,0000,,where the nose is. I just in
Dialogue: 0,1:07:13.22,1:07:16.59,Default,,0000,0000,0000,,and tell it where the nose is so that it doesn't have to figure that out.
Dialogue: 0,1:07:16.59,1:07:20.56,Default,,0000,0000,0000,,And as I do this, one component through the other, you know, I end up giving it the correct output
Dialogue: 0,1:07:20.56,1:07:23.65,Default,,0000,0000,0000,,label and end up with 100 percent accuracy.
Dialogue: 0,1:07:23.65,1:07:27.00,Default,,0000,0000,0000,,And now you can look at this table - I'm sorry this is cut off on the bottom,
Dialogue: 0,1:07:27.00,1:07:29.12,Default,,0000,0000,0000,,it says logistic regression 100 percent. Now you can
Dialogue: 0,1:07:29.12,1:07:30.72,Default,,0000,0000,0000,,look at this
Dialogue: 0,1:07:30.72,1:07:31.67,Default,,0000,0000,0000,,table and
Dialogue: 0,1:07:31.67,1:07:33.01,Default,,0000,0000,0000,,see,
Dialogue: 0,1:07:33.01,1:07:36.08,Default,,0000,0000,0000,,you know, how much giving the ground-truth labels for each of these
Dialogue: 0,1:07:36.08,1:07:39.03,Default,,0000,0000,0000,,components could help boost your final performance.
Dialogue: 0,1:07:39.03,1:07:42.42,Default,,0000,0000,0000,,In particular, if you look at this table, you notice that
Dialogue: 0,1:07:42.42,1:07:45.27,Default,,0000,0000,0000,,when I added the face detection ground-truth,
Dialogue: 0,1:07:45.27,1:07:48.28,Default,,0000,0000,0000,,my performance jumped from 85.1 percent accuracy
Dialogue: 0,1:07:48.28,1:07:50.62,Default,,0000,0000,0000,,to 91 percent accuracy. Right?
Dialogue: 0,1:07:50.62,1:07:54.53,Default,,0000,0000,0000,,So this tells me that if only I can get better face detection,
Dialogue: 0,1:07:54.53,1:07:58.03,Default,,0000,0000,0000,,maybe I can boost my accuracy by 6 percent.
Dialogue: 0,1:07:58.03,1:08:00.50,Default,,0000,0000,0000,,Whereas in contrast, when I,
Dialogue: 0,1:08:00.50,1:08:04.35,Default,,0000,0000,0000,,you know, say plugged in better, I don't know,
Dialogue: 0,1:08:04.35,1:08:07.06,Default,,0000,0000,0000,,background removal, my accuracy improved from 85
Dialogue: 0,1:08:07.06,1:08:08.67,Default,,0000,0000,0000,,to 85.1 percent.
Dialogue: 0,1:08:08.67,1:08:11.52,Default,,0000,0000,0000,,And so, this sort of diagnostic also tells you that if your goal
Dialogue: 0,1:08:11.52,1:08:13.87,Default,,0000,0000,0000,,is to improve the system, it's probably a waste of
Dialogue: 0,1:08:13.87,1:08:17.68,Default,,0000,0000,0000,,your time to try to improve your background subtraction. Because if
Dialogue: 0,1:08:17.68,1:08:19.22,Default,,0000,0000,0000,,even if you got the ground-truth,
Dialogue: 0,1:08:19.22,1:08:22.06,Default,,0000,0000,0000,,this is gives you, at most, 0.1 percent accuracy,
Dialogue: 0,1:08:22.06,1:08:24.60,Default,,0000,0000,0000,,whereas if you do better face detection, maybe there's a much
Dialogue: 0,1:08:24.60,1:08:26.40,Default,,0000,0000,0000,,larger potential for gains there. Okay?
Dialogue: 0,1:08:26.40,1:08:28.67,Default,,0000,0000,0000,,So this sort of diagnostic,
Dialogue: 0,1:08:28.67,1:08:29.90,Default,,0000,0000,0000,,again,
Dialogue: 0,1:08:29.90,1:08:33.15,Default,,0000,0000,0000,,is very useful because if your is to improve the system,
Dialogue: 0,1:08:33.15,1:08:35.100,Default,,0000,0000,0000,,there are so many different pieces you can easily choose to spend the next three
Dialogue: 0,1:08:35.100,1:08:36.65,Default,,0000,0000,0000,,months on. Right?
Dialogue: 0,1:08:36.65,1:08:39.26,Default,,0000,0000,0000,,And choosing the right piece
Dialogue: 0,1:08:39.26,1:08:42.80,Default,,0000,0000,0000,,is critical, and this sort of diagnostic tells you what's the piece that may
Dialogue: 0,1:08:42.80,1:08:48.73,Default,,0000,0000,0000,,actually be worth your time to work on.
Dialogue: 0,1:08:48.73,1:08:51.71,Default,,0000,0000,0000,,There's sort of another type of analyses that's sort of the opposite of what I just
Dialogue: 0,1:08:51.71,1:08:53.37,Default,,0000,0000,0000,,talked about.
Dialogue: 0,1:08:53.37,1:08:55.47,Default,,0000,0000,0000,,The error analysis I just talked about
Dialogue: 0,1:08:55.47,1:08:58.28,Default,,0000,0000,0000,,tries to explain the difference between the current performance and perfect
Dialogue: 0,1:08:58.28,1:08:59.77,Default,,0000,0000,0000,,performance,
Dialogue: 0,1:08:59.77,1:09:03.62,Default,,0000,0000,0000,,whereas this sort of ablative analysis tries to explain the difference
Dialogue: 0,1:09:03.62,1:09:09.12,Default,,0000,0000,0000,,between some baselines, some really bad performance and your current performance.
Dialogue: 0,1:09:09.12,1:09:13.09,Default,,0000,0000,0000,,So for this example, let's suppose you've built a very good anti-spam classifier for
Dialogue: 0,1:09:13.09,1:09:17.15,Default,,0000,0000,0000,,adding lots of clever features to your logistic regression algorithm. Right? So you added
Dialogue: 0,1:09:17.15,1:09:20.69,Default,,0000,0000,0000,,features for spam correction, for, you know, sender host features, for email header
Dialogue: 0,1:09:20.69,1:09:21.41,Default,,0000,0000,0000,,features,
Dialogue: 0,1:09:21.41,1:09:24.80,Default,,0000,0000,0000,,email text parser features, JavaScript parser features,
Dialogue: 0,1:09:24.80,1:09:26.84,Default,,0000,0000,0000,,features for embedded images, and so on.
Dialogue: 0,1:09:26.84,1:09:30.23,Default,,0000,0000,0000,,So now let's say you preview the system and you want to figure out, you know, how well did
Dialogue: 0,1:09:30.23,1:09:33.80,Default,,0000,0000,0000,,each of these - how much did each of these components actually contribute? Maybe you want
Dialogue: 0,1:09:33.80,1:09:37.13,Default,,0000,0000,0000,,to write a research paper and claim this was the piece that made the
Dialogue: 0,1:09:37.13,1:09:40.95,Default,,0000,0000,0000,,big difference. Can you actually document that claim and justify it?
Dialogue: 0,1:09:40.95,1:09:43.32,Default,,0000,0000,0000,,So in ablative analysis,
Dialogue: 0,1:09:43.32,1:09:44.57,Default,,0000,0000,0000,,here's what we do.
Dialogue: 0,1:09:44.57,1:09:46.33,Default,,0000,0000,0000,,So in this example,
Dialogue: 0,1:09:46.33,1:09:49.67,Default,,0000,0000,0000,,let's say that simple logistic regression without any of your clever
Dialogue: 0,1:09:49.67,1:09:52.09,Default,,0000,0000,0000,,improvements get 94 percent performance. And
Dialogue: 0,1:09:52.09,1:09:55.48,Default,,0000,0000,0000,,you want to figure out what accounts for your improvement from 94 to
Dialogue: 0,1:09:55.48,1:09:58.43,Default,,0000,0000,0000,,99.9 percent performance.
Dialogue: 0,1:09:58.43,1:10:03.28,Default,,0000,0000,0000,,So in ablative analysis and so instead of adding components one at a day, we'll instead
Dialogue: 0,1:10:03.28,1:10:06.84,Default,,0000,0000,0000,,remove components one at a time to see how it rates.
Dialogue: 0,1:10:06.84,1:10:11.46,Default,,0000,0000,0000,,So start with your overall system, which is 99 percent accuracy.
Dialogue: 0,1:10:11.46,1:10:14.13,Default,,0000,0000,0000,,And then we remove spelling correction and see how much performance
Dialogue: 0,1:10:14.13,1:10:15.39,Default,,0000,0000,0000,,drops.
Dialogue: 0,1:10:15.39,1:10:22.39,Default,,0000,0000,0000,,Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so,
Dialogue: 0,1:10:24.22,1:10:28.15,Default,,0000,0000,0000,,in this contrived example,
Dialogue: 0,1:10:28.15,1:10:31.12,Default,,0000,0000,0000,,you see that, I guess, the biggest drop
Dialogue: 0,1:10:31.12,1:10:32.38,Default,,0000,0000,0000,,occurred when you remove
Dialogue: 0,1:10:32.38,1:10:37.56,Default,,0000,0000,0000,,the text parser features. And so you can then make a credible case that,
Dialogue: 0,1:10:37.56,1:10:41.28,Default,,0000,0000,0000,,you know, the text parser features where what really made the biggest difference here. Okay?
Dialogue: 0,1:10:41.28,1:10:42.70,Default,,0000,0000,0000,,And you can also tell,
Dialogue: 0,1:10:42.70,1:10:45.53,Default,,0000,0000,0000,,for instance, that, I don't know,
Dialogue: 0,1:10:45.53,1:10:49.36,Default,,0000,0000,0000,,removing the sender host features on this
Dialogue: 0,1:10:49.36,1:10:52.28,Default,,0000,0000,0000,,line, right, performance dropped from 99.9 to 98.9. And so this also means
Dialogue: 0,1:10:52.28,1:10:53.14,Default,,0000,0000,0000,,that
Dialogue: 0,1:10:53.14,1:10:56.45,Default,,0000,0000,0000,,in case you want to get rid of the sender host features to speed up
Dialogue: 0,1:10:56.45,1:11:03.45,Default,,0000,0000,0000,,computational something that would be a good candidate for elimination. Okay? Are there any
Dialogue: 0,1:11:03.63,1:11:05.42,Default,,0000,0000,0000,,guarantees that if you shuffle around the order in which
Dialogue: 0,1:11:05.42,1:11:06.42,Default,,0000,0000,0000,,you drop those
Dialogue: 0,1:11:06.42,1:11:09.58,Default,,0000,0000,0000,,features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's
Dialogue: 0,1:11:09.58,1:11:12.11,Default,,0000,0000,0000,,no guarantee you'd get the similar result.
Dialogue: 0,1:11:12.11,1:11:13.89,Default,,0000,0000,0000,,So in practice,
Dialogue: 0,1:11:13.89,1:11:17.73,Default,,0000,0000,0000,,sometimes there's a fairly natural of ordering for both types of analyses, the error
Dialogue: 0,1:11:17.73,1:11:19.33,Default,,0000,0000,0000,,analyses and ablative analysis,
Dialogue: 0,1:11:19.33,1:11:22.75,Default,,0000,0000,0000,,sometimes there's a fairly natural ordering in which you add things or remove things,
Dialogue: 0,1:11:22.75,1:11:24.56,Default,,0000,0000,0000,,sometimes there's isn't. And
Dialogue: 0,1:11:24.56,1:11:28.47,Default,,0000,0000,0000,,quite often, you either choose one ordering and just go for it
Dialogue: 0,1:11:28.47,1:11:32.07,Default,,0000,0000,0000,,or " And don't think of these analyses as sort of formulas that are constants, though; I mean
Dialogue: 0,1:11:32.07,1:11:35.24,Default,,0000,0000,0000,,feel free to invent your own, as well. You know
Dialogue: 0,1:11:35.24,1:11:36.64,Default,,0000,0000,0000,,one of the things
Dialogue: 0,1:11:36.64,1:11:37.92,Default,,0000,0000,0000,,that's done quite often is
Dialogue: 0,1:11:37.92,1:11:39.31,Default,,0000,0000,0000,,take the overall system
Dialogue: 0,1:11:39.31,1:11:43.29,Default,,0000,0000,0000,,and just remove one and then put it back, then remove a different one
Dialogue: 0,1:11:43.29,1:11:48.13,Default,,0000,0000,0000,,then put it back until all of these things are done. Okay.
Dialogue: 0,1:11:48.13,1:11:51.01,Default,,0000,0000,0000,,So the very last thing I want to talk about is sort of this
Dialogue: 0,1:11:51.01,1:11:57.98,Default,,0000,0000,0000,,general advice for how to get started on a learning problem. So
Dialogue: 0,1:11:57.98,1:12:03.84,Default,,0000,0000,0000,,here's a cartoon description on two broad to get started on learning problem.
Dialogue: 0,1:12:03.84,1:12:05.74,Default,,0000,0000,0000,,The first one is
Dialogue: 0,1:12:05.74,1:12:07.61,Default,,0000,0000,0000,,carefully design your system, so
Dialogue: 0,1:12:07.61,1:12:11.74,Default,,0000,0000,0000,,you spend a long time designing exactly the right features, collecting the right data set, and
Dialogue: 0,1:12:11.74,1:12:14.19,Default,,0000,0000,0000,,designing the right algorithmic structure, then you
Dialogue: 0,1:12:14.19,1:12:17.68,Default,,0000,0000,0000,,implement it and hope it works. All right?
Dialogue: 0,1:12:17.68,1:12:21.04,Default,,0000,0000,0000,,The benefit of this sort of approach is you get maybe nicer, maybe more scalable
Dialogue: 0,1:12:21.04,1:12:22.43,Default,,0000,0000,0000,,algorithms,
Dialogue: 0,1:12:22.43,1:12:26.72,Default,,0000,0000,0000,,and maybe you come up with new elegant learning algorithms. And if your goal is to,
Dialogue: 0,1:12:26.72,1:12:30.76,Default,,0000,0000,0000,,you know, contribute to basic research in machine learning, if your goal is to invent new machine learning
Dialogue: 0,1:12:30.76,1:12:31.50,Default,,0000,0000,0000,,algorithms,
Dialogue: 0,1:12:31.50,1:12:33.55,Default,,0000,0000,0000,,this process of slowing down and
Dialogue: 0,1:12:33.55,1:12:36.30,Default,,0000,0000,0000,,thinking deeply about the problem, you know, that is sort of the right way to go
Dialogue: 0,1:12:36.30,1:12:37.12,Default,,0000,0000,0000,,about is
Dialogue: 0,1:12:37.12,1:12:41.10,Default,,0000,0000,0000,,think deeply about a problem and invent new solutions.
Dialogue: 0,1:12:41.10,1:12:42.28,Default,,0000,0000,0000,,
Dialogue: 0,1:12:42.28,1:12:44.08,Default,,0000,0000,0000,,Second sort of approach
Dialogue: 0,1:12:44.08,1:12:48.84,Default,,0000,0000,0000,,is what I call build-and-fix, which is we input something quick and dirty
Dialogue: 0,1:12:48.84,1:12:52.31,Default,,0000,0000,0000,,and then you run error analyses and diagnostics to figure out what's wrong and
Dialogue: 0,1:12:52.31,1:12:54.20,Default,,0000,0000,0000,,you fix those errors.
Dialogue: 0,1:12:54.20,1:12:58.13,Default,,0000,0000,0000,,The benefit of this second type of approach is that it'll often get your
Dialogue: 0,1:12:58.13,1:13:01.12,Default,,0000,0000,0000,,application working much more quickly.
Dialogue: 0,1:13:01.12,1:13:04.40,Default,,0000,0000,0000,,And especially with those of you, if you end up working in a company, and sometimes - if you end up working in
Dialogue: 0,1:13:04.40,1:13:05.55,Default,,0000,0000,0000,,a company,
Dialogue: 0,1:13:05.55,1:13:07.46,Default,,0000,0000,0000,,you know, very often it's not
Dialogue: 0,1:13:07.46,1:13:10.90,Default,,0000,0000,0000,,the best product that wins; it's the first product to market that
Dialogue: 0,1:13:10.90,1:13:11.69,Default,,0000,0000,0000,,wins. And
Dialogue: 0,1:13:11.69,1:13:14.87,Default,,0000,0000,0000,,so there's - especially in the industry. There's really something to be said for,
Dialogue: 0,1:13:14.87,1:13:18.79,Default,,0000,0000,0000,,you know, building a system quickly and getting it deployed quickly.
Dialogue: 0,1:13:18.79,1:13:23.14,Default,,0000,0000,0000,,And the second approach of building a quick-and-dirty, I'm gonna say hack
Dialogue: 0,1:13:23.14,1:13:26.47,Default,,0000,0000,0000,,and then fixing the problems will actually get you to a
Dialogue: 0,1:13:26.47,1:13:27.84,Default,,0000,0000,0000,,system that works well
Dialogue: 0,1:13:27.84,1:13:30.97,Default,,0000,0000,0000,,much more quickly.
Dialogue: 0,1:13:30.97,1:13:32.65,Default,,0000,0000,0000,,And the reason is
Dialogue: 0,1:13:32.65,1:13:36.15,Default,,0000,0000,0000,,very often it's really not clear what parts of a system are easier to think of to
Dialogue: 0,1:13:36.15,1:13:37.59,Default,,0000,0000,0000,,build and therefore what
Dialogue: 0,1:13:37.59,1:13:40.18,Default,,0000,0000,0000,,you need to spends lot of time focusing on.
Dialogue: 0,1:13:40.18,1:13:43.42,Default,,0000,0000,0000,,So there's that example I talked about just now. Right?
Dialogue: 0,1:13:43.42,1:13:46.93,Default,,0000,0000,0000,,For identifying
Dialogue: 0,1:13:46.93,1:13:48.71,Default,,0000,0000,0000,,people, say.
Dialogue: 0,1:13:48.71,1:13:53.20,Default,,0000,0000,0000,,And with a big complicated learning system like this, a big complicated pipeline like this,
Dialogue: 0,1:13:53.20,1:13:55.59,Default,,0000,0000,0000,,it's really not obvious at the outset
Dialogue: 0,1:13:55.59,1:13:59.13,Default,,0000,0000,0000,,which of these components you should spend lots of time working on. Right? And if
Dialogue: 0,1:13:59.13,1:14:00.96,Default,,0000,0000,0000,,you didn't know that
Dialogue: 0,1:14:00.96,1:14:03.80,Default,,0000,0000,0000,,preprocessing wasn't the right component, you could easily have
Dialogue: 0,1:14:03.80,1:14:07.27,Default,,0000,0000,0000,,spent three months working on better background subtraction, not knowing that it's
Dialogue: 0,1:14:07.27,1:14:09.88,Default,,0000,0000,0000,,just not gonna ultimately matter.
Dialogue: 0,1:14:09.88,1:14:10.77,Default,,0000,0000,0000,,And so
Dialogue: 0,1:14:10.77,1:14:13.69,Default,,0000,0000,0000,,the only way to find out what really works was inputting something quickly and
Dialogue: 0,1:14:13.69,1:14:15.35,Default,,0000,0000,0000,,you find out what parts -
Dialogue: 0,1:14:15.35,1:14:16.89,Default,,0000,0000,0000,,and find out
Dialogue: 0,1:14:16.89,1:14:17.89,Default,,0000,0000,0000,,what parts
Dialogue: 0,1:14:17.89,1:14:21.36,Default,,0000,0000,0000,,are really the hard parts to implement, or what parts are hard parts that could make a
Dialogue: 0,1:14:21.36,1:14:23.08,Default,,0000,0000,0000,,difference in performance.
Dialogue: 0,1:14:23.08,1:14:26.58,Default,,0000,0000,0000,,In fact, say that if your goal is to build a
Dialogue: 0,1:14:26.58,1:14:29.31,Default,,0000,0000,0000,,people recognition system, a system like this is actually far too
Dialogue: 0,1:14:29.31,1:14:31.64,Default,,0000,0000,0000,,complicated as your initial system.
Dialogue: 0,1:14:31.64,1:14:35.56,Default,,0000,0000,0000,,Maybe after you're prototyped a few systems, and you converged a system like this. But if this
Dialogue: 0,1:14:35.56,1:14:42.56,Default,,0000,0000,0000,,is your first system you're designing, this is much too complicated. Also, this is a
Dialogue: 0,1:14:43.57,1:14:48.06,Default,,0000,0000,0000,,very concrete piece of advice, and this applies to your projects as well.
Dialogue: 0,1:14:48.06,1:14:51.23,Default,,0000,0000,0000,,If your goal is to build a working application,
Dialogue: 0,1:14:51.23,1:14:55.26,Default,,0000,0000,0000,,Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your
Dialogue: 0,1:14:55.26,1:14:57.28,Default,,0000,0000,0000,,data.
Dialogue: 0,1:14:57.28,1:15:01.22,Default,,0000,0000,0000,,And very often, and if you just take the data you're trying to predict and just plot your
Dialogue: 0,1:15:01.22,1:15:05.73,Default,,0000,0000,0000,,data, plot X, plot Y, plot your data everywhere you can think of,
Dialogue: 0,1:15:05.73,1:15:10.31,Default,,0000,0000,0000,,you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they
Dialogue: 0,1:15:10.31,1:15:13.90,Default,,0000,0000,0000,,should be positive. Something's wrong with this dataset." And it's about
Dialogue: 0,1:15:13.90,1:15:18.39,Default,,0000,0000,0000,,half the time you find something obviously wrong with your data or something very surprising.
Dialogue: 0,1:15:18.39,1:15:21.57,Default,,0000,0000,0000,,And this is something you find out just by plotting your data, and that you
Dialogue: 0,1:15:21.57,1:15:28.18,Default,,0000,0000,0000,,won't find out be implementing these big complicated learning algorithms on it. Plotting
Dialogue: 0,1:15:28.18,1:15:31.92,Default,,0000,0000,0000,,the data sounds so simple, it was one of the pieces of advice that lots of us give but
Dialogue: 0,1:15:31.92,1:15:38.57,Default,,0000,0000,0000,,hardly anyone follows, so you can take that for what it's worth.
Dialogue: 0,1:15:38.57,1:15:42.20,Default,,0000,0000,0000,,Let me just reiterate, what I just said here may be bad advice
Dialogue: 0,1:15:42.20,1:15:44.02,Default,,0000,0000,0000,,if your goal is to come up with
Dialogue: 0,1:15:44.02,1:15:46.64,Default,,0000,0000,0000,,new machine learning algorithms. All right? So
Dialogue: 0,1:15:46.64,1:15:51.02,Default,,0000,0000,0000,,for me personally, the learning algorithm I use the most often is probably
Dialogue: 0,1:15:51.02,1:15:53.60,Default,,0000,0000,0000,,logistic regression because I have code lying around. So give me a
Dialogue: 0,1:15:53.60,1:15:56.77,Default,,0000,0000,0000,,learning problem, I probably won't try anything more complicated than logistic
Dialogue: 0,1:15:56.77,1:15:58.26,Default,,0000,0000,0000,,regression on it first. And it's
Dialogue: 0,1:15:58.26,1:16:01.94,Default,,0000,0000,0000,,only after trying something really simple and figure our what's easy, what's hard, then you know
Dialogue: 0,1:16:01.94,1:16:03.94,Default,,0000,0000,0000,,where to focus your efforts. But
Dialogue: 0,1:16:03.94,1:16:07.61,Default,,0000,0000,0000,,again, if your goal is to invent new machine learning algorithms, then you sort of don't
Dialogue: 0,1:16:07.61,1:16:10.75,Default,,0000,0000,0000,,want to hack up something and then add another hack to fix it, and hack it even more to
Dialogue: 0,1:16:10.75,1:16:12.22,Default,,0000,0000,0000,,fix it. Right? So if
Dialogue: 0,1:16:12.22,1:16:15.92,Default,,0000,0000,0000,,your goal is to do novel machine learning research, then it pays to think more deeply about the
Dialogue: 0,1:16:15.92,1:16:21.34,Default,,0000,0000,0000,,problem and not gonna follow this specifically.
Dialogue: 0,1:16:21.34,1:16:22.92,Default,,0000,0000,0000,,Shoot, you know what? All
Dialogue: 0,1:16:22.92,1:16:28.28,Default,,0000,0000,0000,,right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly.
Dialogue: 0,1:16:28.28,1:16:30.62,Default,,0000,0000,0000,,And so, this is what I think
Dialogue: 0,1:16:30.62,1:16:33.46,Default,,0000,0000,0000,,of as premature statistical optimization,
Dialogue: 0,1:16:33.46,1:16:35.08,Default,,0000,0000,0000,,where quite often,
Dialogue: 0,1:16:35.08,1:16:38.32,Default,,0000,0000,0000,,just like premature optimization of code, quite often
Dialogue: 0,1:16:38.32,1:16:44.37,Default,,0000,0000,0000,,people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more
Dialogue: 0,1:16:44.37,1:16:46.95,Default,,0000,0000,0000,,slides. This
Dialogue: 0,1:16:46.95,1:16:48.54,Default,,0000,0000,0000,,was -
Dialogue: 0,1:16:48.54,1:16:52.07,Default,,0000,0000,0000,,this is a sort of cartoon that highly influenced my own thinking. It was based on
Dialogue: 0,1:16:52.07,1:16:55.34,Default,,0000,0000,0000,,a paper written by Christos Papadimitriou.
Dialogue: 0,1:16:55.34,1:16:57.43,Default,,0000,0000,0000,,This is how
Dialogue: 0,1:16:57.43,1:16:59.36,Default,,0000,0000,0000,,progress - this is how
Dialogue: 0,1:16:59.36,1:17:02.36,Default,,0000,0000,0000,,developmental progress of research often happens. Right?
Dialogue: 0,1:17:02.36,1:17:05.56,Default,,0000,0000,0000,,Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it
Dialogue: 0,1:17:05.56,1:17:06.52,Default,,0000,0000,0000,,seems like a useful thing to have.
Dialogue: 0,1:17:06.52,1:17:09.67,Default,,0000,0000,0000,,Right? You know free up people, don't have
Dialogue: 0,1:17:09.67,1:17:12.76,Default,,0000,0000,0000,,to deliver mail. So what -
Dialogue: 0,1:17:12.76,1:17:14.28,Default,,0000,0000,0000,,to deliver mail,
Dialogue: 0,1:17:14.28,1:17:19.14,Default,,0000,0000,0000,,obviously you need a robot to wander around indoor environments and you need a robot to
Dialogue: 0,1:17:19.14,1:17:21.48,Default,,0000,0000,0000,,manipulate objects and pickup envelopes. And so,
Dialogue: 0,1:17:21.48,1:17:24.89,Default,,0000,0000,0000,,you need to build those two components in order to get a mail delivery robot. And
Dialogue: 0,1:17:24.89,1:17:25.59,Default,,0000,0000,0000,,so I've
Dialogue: 0,1:17:25.59,1:17:29.65,Default,,0000,0000,0000,,drawing those two components and little arrows to denote that, you know, obstacle avoidance
Dialogue: 0,1:17:29.65,1:17:30.46,Default,,0000,0000,0000,,is
Dialogue: 0,1:17:30.46,1:17:32.23,Default,,0000,0000,0000,,needed or would help build
Dialogue: 0,1:17:32.23,1:17:35.51,Default,,0000,0000,0000,,your mail delivery robot. Well
Dialogue: 0,1:17:35.51,1:17:37.19,Default,,0000,0000,0000,,for obstacle for avoidance,
Dialogue: 0,1:17:37.19,1:17:43.16,Default,,0000,0000,0000,,clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles.
Dialogue: 0,1:17:43.16,1:17:46.84,Default,,0000,0000,0000,,Now we're gonna use computer vision to detect the objects. And so,
Dialogue: 0,1:17:46.84,1:17:51.12,Default,,0000,0000,0000,,we know that, you know, lighting sometimes changes, right, depending on whether it's the
Dialogue: 0,1:17:51.12,1:17:52.71,Default,,0000,0000,0000,,morning or noontime or evening. This
Dialogue: 0,1:17:52.71,1:17:53.93,Default,,0000,0000,0000,,is lighting
Dialogue: 0,1:17:53.93,1:17:56.64,Default,,0000,0000,0000,,causes the color of things to change, and so you need
Dialogue: 0,1:17:56.64,1:18:00.51,Default,,0000,0000,0000,,an object detection system that's invariant to the specific colors of an
Dialogue: 0,1:18:00.51,1:18:01.20,Default,,0000,0000,0000,,object. Right?
Dialogue: 0,1:18:01.20,1:18:04.42,Default,,0000,0000,0000,,Because lighting
Dialogue: 0,1:18:04.42,1:18:05.40,Default,,0000,0000,0000,,changes,
Dialogue: 0,1:18:05.40,1:18:09.85,Default,,0000,0000,0000,,say. Well color, or RGB values, is represented by three-dimensional vectors. And
Dialogue: 0,1:18:09.85,1:18:11.17,Default,,0000,0000,0000,,so you need to learn
Dialogue: 0,1:18:11.17,1:18:13.50,Default,,0000,0000,0000,,when two colors might be the same thing,
Dialogue: 0,1:18:13.50,1:18:15.26,Default,,0000,0000,0000,,when two, you know,
Dialogue: 0,1:18:15.26,1:18:18.16,Default,,0000,0000,0000,,visual appearance of two colors may be the same thing as just the lighting change or
Dialogue: 0,1:18:18.16,1:18:19.54,Default,,0000,0000,0000,,something.
Dialogue: 0,1:18:19.54,1:18:20.59,Default,,0000,0000,0000,,And
Dialogue: 0,1:18:20.59,1:18:24.06,Default,,0000,0000,0000,,to understand that properly, we can go out and study differential geometry
Dialogue: 0,1:18:24.06,1:18:27.51,Default,,0000,0000,0000,,of 3d manifolds because that helps us build a sound theory on which
Dialogue: 0,1:18:27.51,1:18:32.25,Default,,0000,0000,0000,,to develop our 3d similarity learning algorithms.
Dialogue: 0,1:18:32.25,1:18:36.16,Default,,0000,0000,0000,,And to really understand the fundamental aspects of this problem,
Dialogue: 0,1:18:36.16,1:18:40.11,Default,,0000,0000,0000,,we have to study the complexity of non-Riemannian geometries. And on
Dialogue: 0,1:18:40.11,1:18:43.85,Default,,0000,0000,0000,,and on it goes until eventually you're proving convergence bounds for
Dialogue: 0,1:18:43.85,1:18:49.79,Default,,0000,0000,0000,,sampled of non-monotonic logic. I don't even know what this is because I just made it up.
Dialogue: 0,1:18:49.79,1:18:51.53,Default,,0000,0000,0000,,Whereas in reality,
Dialogue: 0,1:18:51.53,1:18:53.97,Default,,0000,0000,0000,,you know, chances are that link isn't real.
Dialogue: 0,1:18:53.97,1:18:55.66,Default,,0000,0000,0000,,Color variance
Dialogue: 0,1:18:55.66,1:18:59.55,Default,,0000,0000,0000,,just barely helped object recognition maybe. I'm making this up.
Dialogue: 0,1:18:59.55,1:19:03.50,Default,,0000,0000,0000,,Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay?
Dialogue: 0,1:19:03.50,1:19:05.27,Default,,0000,0000,0000,,So, each of
Dialogue: 0,1:19:05.27,1:19:09.13,Default,,0000,0000,0000,,these circles can represent a person, or a research community, or a thought in your
Dialogue: 0,1:19:09.13,1:19:12.02,Default,,0000,0000,0000,,head. And there's a very real chance that
Dialogue: 0,1:19:12.02,1:19:15.47,Default,,0000,0000,0000,,maybe there are all these papers written on differential geometry of 3d manifolds, and they are
Dialogue: 0,1:19:15.47,1:19:18.57,Default,,0000,0000,0000,,written because some guy once told someone else that it'll help 3d similarity learning.
Dialogue: 0,1:19:18.57,1:19:20.49,Default,,0000,0000,0000,,And,
Dialogue: 0,1:19:20.49,1:19:23.37,Default,,0000,0000,0000,,you know, it's like "A friend of mine told me that color invariance would help in
Dialogue: 0,1:19:23.37,1:19:26.12,Default,,0000,0000,0000,,object recognition, so I'm working on color invariance. And now I'm gonna tell a friend
Dialogue: 0,1:19:26.12,1:19:27.44,Default,,0000,0000,0000,,of mine
Dialogue: 0,1:19:27.44,1:19:30.28,Default,,0000,0000,0000,,that his thing will help my problem. And he'll tell a friend of his that his thing will help
Dialogue: 0,1:19:30.28,1:19:31.62,Default,,0000,0000,0000,,with his problem."
Dialogue: 0,1:19:31.62,1:19:33.52,Default,,0000,0000,0000,,And pretty soon, you're working on
Dialogue: 0,1:19:33.52,1:19:37.54,Default,,0000,0000,0000,,convergence bound for sampled non-monotonic logic, when in reality none of these will
Dialogue: 0,1:19:37.54,1:19:39.13,Default,,0000,0000,0000,,see the light of
Dialogue: 0,1:19:39.13,1:19:42.52,Default,,0000,0000,0000,,day of your mail delivery robot. Okay?
Dialogue: 0,1:19:42.52,1:19:46.60,Default,,0000,0000,0000,,I'm not criticizing the role of theory. There are very powerful theories, like the
Dialogue: 0,1:19:46.60,1:19:48.40,Default,,0000,0000,0000,,theory of VC dimension,
Dialogue: 0,1:19:48.40,1:19:52.09,Default,,0000,0000,0000,,which is far, far, far to the right of this. So VC dimension is about
Dialogue: 0,1:19:52.09,1:19:53.29,Default,,0000,0000,0000,,as theoretical
Dialogue: 0,1:19:53.29,1:19:57.12,Default,,0000,0000,0000,,as it can get. And it's clearly had a huge impact on many applications. And there's,
Dialogue: 0,1:19:57.12,1:19:59.56,Default,,0000,0000,0000,,you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know,
Dialogue: 0,1:19:59.56,1:20:00.75,Default,,0000,0000,0000,,is about
Dialogue: 0,1:20:00.75,1:20:04.22,Default,,0000,0000,0000,,theoretical as it can get. It's
Dialogue: 0,1:20:04.22,1:20:05.80,Default,,0000,0000,0000,,like a huge application
Dialogue: 0,1:20:05.80,1:20:09.31,Default,,0000,0000,0000,,on all of computer science, the theory of NP-hardness.
Dialogue: 0,1:20:09.31,1:20:10.67,Default,,0000,0000,0000,,But
Dialogue: 0,1:20:10.67,1:20:13.80,Default,,0000,0000,0000,,when you are off working on highly theoretical things, I guess, to me
Dialogue: 0,1:20:13.80,1:20:16.85,Default,,0000,0000,0000,,personally it's important to keep in mind
Dialogue: 0,1:20:16.85,1:20:19.70,Default,,0000,0000,0000,,are you working on something like VC dimension, which is high impact, or are you
Dialogue: 0,1:20:19.70,1:20:23.29,Default,,0000,0000,0000,,working on something like convergence bound for sampled nonmonotonic logic, which
Dialogue: 0,1:20:23.29,1:20:24.71,Default,,0000,0000,0000,,you're only hoping
Dialogue: 0,1:20:24.71,1:20:25.90,Default,,0000,0000,0000,,has some peripheral relevance
Dialogue: 0,1:20:25.90,1:20:30.04,Default,,0000,0000,0000,,to some application. Okay?
Dialogue: 0,1:20:30.04,1:20:34.85,Default,,0000,0000,0000,,For me personally, I tend to work on an application only if I - excuse me.
Dialogue: 0,1:20:34.85,1:20:36.99,Default,,0000,0000,0000,,For me personally, and this is a personal choice,
Dialogue: 0,1:20:36.99,1:20:41.34,Default,,0000,0000,0000,,I tend to trust something only if I personally can see a link from the
Dialogue: 0,1:20:41.34,1:20:42.68,Default,,0000,0000,0000,,theory I'm working on
Dialogue: 0,1:20:42.68,1:20:44.43,Default,,0000,0000,0000,,all the way back to an application.
Dialogue: 0,1:20:44.43,1:20:46.01,Default,,0000,0000,0000,,And
Dialogue: 0,1:20:46.01,1:20:50.30,Default,,0000,0000,0000,,if I don't personally see a direct link from what I'm doing to an application then,
Dialogue: 0,1:20:50.30,1:20:53.43,Default,,0000,0000,0000,,you know, then that's fine. Then I can choose to work on theory, but
Dialogue: 0,1:20:53.43,1:20:55.65,Default,,0000,0000,0000,,I wouldn't necessarily trust that
Dialogue: 0,1:20:55.65,1:20:59.21,Default,,0000,0000,0000,,what the theory I'm working on will relate to an application, if I don't personally
Dialogue: 0,1:20:59.21,1:21:02.43,Default,,0000,0000,0000,,see a link all the way back.
Dialogue: 0,1:21:02.43,1:21:04.40,Default,,0000,0000,0000,,Just to summarize.
Dialogue: 0,1:21:04.40,1:21:06.41,Default,,0000,0000,0000,,
Dialogue: 0,1:21:06.41,1:21:08.68,Default,,0000,0000,0000,,One lesson to take away from today is I think
Dialogue: 0,1:21:08.68,1:21:12.53,Default,,0000,0000,0000,,time spent coming up with diagnostics for learning algorithms is often time well spent.
Dialogue: 0,1:21:12.53,1:21:13.03,Default,,0000,0000,0000,,
Dialogue: 0,1:21:13.03,1:21:16.20,Default,,0000,0000,0000,,It's often up to your own ingenuity to come up with great diagnostics. And
Dialogue: 0,1:21:16.20,1:21:19.02,Default,,0000,0000,0000,,just when I personally, when I work on machine learning algorithm,
Dialogue: 0,1:21:19.02,1:21:21.17,Default,,0000,0000,0000,,it's not uncommon for me to be spending like
Dialogue: 0,1:21:21.17,1:21:23.68,Default,,0000,0000,0000,,between a third and often half of my time
Dialogue: 0,1:21:23.68,1:21:26.41,Default,,0000,0000,0000,,just writing diagnostics and trying to figure out what's going right and what's
Dialogue: 0,1:21:26.41,1:21:28.08,Default,,0000,0000,0000,,going on.
Dialogue: 0,1:21:28.08,1:21:31.50,Default,,0000,0000,0000,,Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and
Dialogue: 0,1:21:31.50,1:21:34.78,Default,,0000,0000,0000,,making progress. You don't want to be spending all this time, you know, implementing tests on your
Dialogue: 0,1:21:34.78,1:21:38.28,Default,,0000,0000,0000,,learning algorithms; it doesn't feel like when you're doing anything. But when
Dialogue: 0,1:21:38.28,1:21:41.42,Default,,0000,0000,0000,,I implement learning algorithms, at least a third, and quite often half of
Dialogue: 0,1:21:41.42,1:21:45.88,Default,,0000,0000,0000,,my time, is actually spent implementing those tests and you can figure out what to work on. And
Dialogue: 0,1:21:45.88,1:21:49.22,Default,,0000,0000,0000,,I think it's actually one of the best uses of your time. Talked
Dialogue: 0,1:21:49.22,1:21:50.73,Default,,0000,0000,0000,,about error
Dialogue: 0,1:21:50.73,1:21:54.32,Default,,0000,0000,0000,,analyses and ablative analyses, and lastly
Dialogue: 0,1:21:54.32,1:21:56.89,Default,,0000,0000,0000,,talked about, you know, different approaches and the
Dialogue: 0,1:21:56.89,1:22:00.98,Default,,0000,0000,0000,,risks of premature statistical optimization. Okay.
Dialogue: 0,1:22:00.98,1:22:04.34,Default,,0000,0000,0000,,Sorry I ran you over. I'll be here for a few more minutes for your questions.