0:00:10.170,0:00:11.549


0:00:11.549,0:00:14.819
This presentation is delivered by the Stanford Center for Professional

0:00:14.819,0:00:21.819
Development.

0:00:24.739,0:00:28.499
Okay. Good morning. Welcome back.

0:00:28.499,0:00:31.239
What I want to do today is

0:00:31.239,0:00:36.640
actually wrap up our discussion on learning theory and sort of on

0:00:36.640,0:00:37.670
and I'm gonna start by

0:00:37.670,0:00:41.810
talking about Bayesian statistics and regularization,

0:00:41.810,0:00:45.680
and then take a very brief digression to tell you about online learning.

0:00:45.680,0:00:50.320
And most of today's lecture will actually be on various pieces of that, so applying

0:00:50.320,0:00:52.450
machine learning algorithms to problems like, you know,

0:00:52.450,0:00:55.920
like the project or other problems you may go work on after you graduate from this

0:00:57.010,0:00:57.739
class. But

0:00:57.739,0:01:02.300
let's start the talk about Bayesian statistics and regularization.

0:01:02.300,0:01:04.130
So you remember from last week,

0:01:04.130,0:01:08.960
we started to talk about learning theory and we learned about bias and

0:01:08.960,0:01:12.260
variance. And I guess in the previous lecture, we spent most of the previous

0:01:12.260,0:01:13.240
lecture

0:01:13.240,0:01:17.510
talking about algorithms for model selection and for

0:01:17.510,0:01:21.360
feature selection. We talked about cross-validation. Right? So

0:01:21.360,0:01:24.200
most of the methods we talked about in the previous lecture were

0:01:24.200,0:01:28.460
ways for you to try to simply the model. So for example,

0:01:28.460,0:01:31.920
the feature selection algorithms we talked about gives you a way to eliminate a number

0:01:31.920,0:01:32.909
of features,

0:01:32.909,0:01:36.350
so as to reduce the number of parameters you need to fit and thereby reduce

0:01:36.350,0:01:39.330
overfitting. Right? You remember that? So feature

0:01:39.330,0:01:40.670
selection algorithms

0:01:42.149,0:01:43.970
choose a subset of the features

0:01:43.970,0:01:48.470
so that you have less parameters and you may be less likely to overfit. Right?

0:01:48.470,0:01:52.790
What I want to do today is to talk about a different way to prevent overfitting.

0:01:52.790,0:01:53.870
And

0:01:53.870,0:01:57.860
there's a method called regularization and there's a way that lets you keep all the

0:01:57.860,0:01:58.859
parameters.

0:01:59.920,0:02:04.200
So here's the idea, and I'm gonna illustrate this example with,

0:02:04.200,0:02:06.960
say, linear regression.

0:02:06.960,0:02:08.709
So

0:02:08.709,0:02:13.329
you take

0:02:13.329,0:02:16.920
the linear regression model, the very first model we learned about,

0:02:16.920,0:02:17.969
right,

0:02:17.969,0:02:21.639
we said that we would choose the parameters

0:02:21.639,0:02:23.609
via

0:02:23.609,0:02:29.519
maximum likelihood.

0:02:29.519,0:02:31.669
Right? And that meant that,

0:02:31.669,0:02:35.299
you know, you would choose the parameters theta

0:02:35.299,0:02:39.879
that maximized

0:02:39.879,0:02:43.009
the probability of the data,

0:02:43.009,0:02:46.109
which is parameters theta that maximized the probability of the data we observe.

0:02:46.109,0:02:50.709
Right?

0:02:50.709,0:02:52.049
And so

0:02:52.049,0:02:56.279
to give this sort of procedure a name, this is one example of most

0:02:56.279,0:03:00.259
common frequencies procedure,

0:03:00.259,0:03:05.419
and frequency, you can think of sort of as maybe one school of statistics.

0:03:05.419,0:03:07.079
And the philosophical view

0:03:07.079,0:03:08.869
behind writing this down was

0:03:08.869,0:03:13.480
we envisioned that there was some true parameter theta out there that generated,

0:03:13.480,0:03:16.469
you know, the Xs and the Ys. There's some true parameter theta

0:03:16.469,0:03:20.529
that govern housing prices, Y is a function of X,

0:03:20.529,0:03:23.459
and we don't know what the value of theta is,

0:03:23.459,0:03:27.630
and we'd like to come up with some procedure for estimating the value of theta. Okay?

0:03:27.630,0:03:28.379
And so,

0:03:28.379,0:03:31.269
maximum likelihood is just one possible procedure

0:03:31.269,0:03:33.209
for estimating the unknown value

0:03:33.209,0:03:35.879
for theta.

0:03:35.879,0:03:37.609
And

0:03:37.609,0:03:41.060
the way you formulated this, you know, theta was not a random variable. Right?

0:03:41.060,0:03:42.140
That's what why said,

0:03:42.140,0:03:45.490
so theta is just some true value out there. It's not random or anything, we just don't

0:03:45.490,0:03:50.329
know what it is, and we have a procedure called maximum likelihood for estimating the

0:03:50.329,0:03:55.469
value for theta. So this is one example of what's called a frequencies procedure.

0:03:55.469,0:03:57.279
The alternative to the, I

0:03:57.279,0:04:04.279
guess, the frequency school of statistics is the Bayesian school,

0:04:04.389,0:04:06.869
in which

0:04:06.869,0:04:10.239
we're gonna say that we don't know what theta,

0:04:10.239,0:04:13.339
and so we will put a prior

0:04:13.339,0:04:14.769
on

0:04:14.769,0:04:18.239
theta. Okay? So in the Bayesian school students would say, "Well

0:04:18.239,0:04:22.379
don't know what the value of theta so let's represent our uncertainty over theta with a

0:04:22.379,0:04:27.229
prior.

0:04:28.259,0:04:32.009
So for example,

0:04:32.009,0:04:34.770
our prior on theta

0:04:34.770,0:04:36.860
may be a Gaussian distribution

0:04:36.860,0:04:39.849
with mean zero and curvalence matrix

0:04:39.849,0:04:42.889
given by tau squared I. Okay?

0:04:42.889,0:04:45.319
And so -

0:04:45.319,0:04:52.319
actually, if I use S to denote my training set, well - right,

0:04:56.279,0:05:00.289
so theta represents my beliefs about what the parameters are in the absence of

0:05:00.289,0:05:01.380
any data. So

0:05:01.380,0:05:03.290
not having seen any data, theta

0:05:03.290,0:05:05.709
represents, you know, what I think theta - it

0:05:05.709,0:05:10.090
probably represents what I think theta is most likely to be.

0:05:10.090,0:05:15.659
And so given the training set, S, in the sort of Bayesian procedure, we would,

0:05:15.659,0:05:22.439
well,

0:05:22.439,0:05:26.539
calculate the probability, the posterior probability by parameters

0:05:26.539,0:05:28.869
given my training sets, and - let's

0:05:28.869,0:05:30.469
write

0:05:30.469,0:05:31.779
this on the next board.

0:05:31.779,0:05:33.360
So my posterior

0:05:33.360,0:05:36.740
on my parameters given my training set, by Bayes' rule, this will

0:05:36.740,0:05:40.160
be proportional to, you

0:05:40.160,0:05:47.160
know, this.

0:05:51.459,0:05:56.559
Right? So by Bayes' rule. Let's call

0:05:56.559,0:06:01.429
it posterior.

0:06:01.429,0:06:05.669
And this distribution now represents my beliefs about what theta is after I've

0:06:05.669,0:06:08.219
seen the training set.

0:06:08.219,0:06:15.219
And when you now want to make a new prediction on the price of a new house,

0:06:20.060,0:06:21.959
on the input X,

0:06:21.959,0:06:26.709
I would say that, well, the distribution over the possible housing prices for

0:06:26.709,0:06:30.560
this new house I'm trying to estimate the price of, say,

0:06:30.560,0:06:34.270
given the size of the house, the features of the house at X, and the training

0:06:34.270,0:06:36.559
set I had previously, it

0:06:36.559,0:06:43.559
is going to be given by

0:06:47.280,0:06:50.470
an integral over my parameters theta of

0:06:50.470,0:06:53.020
probably of Y given X comma theta

0:06:53.020,0:06:54.649
and times

0:06:54.649,0:06:59.929
the posterior distribution of theta given the training set. Okay?

0:06:59.929,0:07:01.789
And in

0:07:01.789,0:07:03.719
particular, if you want your prediction to be

0:07:03.719,0:07:07.559
the expected value

0:07:07.559,0:07:08.729
of Y given

0:07:08.729,0:07:10.279
the

0:07:10.279,0:07:13.559
input X in training set, you would

0:07:13.559,0:07:16.349
say integrate

0:07:16.349,0:07:23.349
over Y

0:07:23.750,0:07:26.539
times the posterior. Okay?

0:07:26.539,0:07:33.539
You would take an expectation of Y with respect to your posterior distribution. Okay?

0:07:35.440,0:07:39.169
And you notice that when I was writing this down, so with the Bayesian

0:07:39.169,0:07:41.959
formulation, and now started to write up here Y given X

0:07:41.959,0:07:44.110
comma theta because

0:07:44.110,0:07:47.590
this formula now is the property of Y conditioned on the values of the

0:07:47.590,0:07:51.909
random variables X and theta. So I'm no longer writing semicolon theta, I'm writing

0:07:51.909,0:07:53.099
comma theta

0:07:53.099,0:07:56.379
because I'm now treating theta

0:07:56.379,0:08:00.259
as a random variable. So all

0:08:00.259,0:08:02.849
of this is somewhat abstract but this is

0:08:02.849,0:08:04.109
- and

0:08:04.109,0:08:11.109
it turns out - actually let's check. Are there questions about this? No? Okay. Let's

0:08:14.779,0:08:17.819
try to make this more concrete. It turns out that

0:08:17.819,0:08:21.699
for many problems,

0:08:21.699,0:08:26.059
both of these steps in the computation are difficult because if,

0:08:26.059,0:08:29.469
you know, theta is an N plus onedimensional vector, is an N plus

0:08:29.469,0:08:31.219
one-dimensional parameter vector,

0:08:31.219,0:08:35.010
then this is one an integral over an N plus one-dimensional, you know, over RN

0:08:35.010,0:08:36.390
plus one.

0:08:36.390,0:08:38.660
And because numerically it's very difficult

0:08:38.660,0:08:41.130
to compute integrals over

0:08:41.130,0:08:44.800
very high dimensional spaces, all right? So

0:08:44.800,0:08:48.740
usually this integral - actually usually it's hard to compute the posterior in theta

0:08:48.740,0:08:54.310
and it's also hard to compute this integral if theta is very high dimensional. There are few

0:08:54.310,0:08:55.910
exceptions for which this can be done in

0:08:55.910,0:08:58.290
closed form, but for

0:08:58.290,0:09:00.460
many learning algorithms, say,

0:09:00.460,0:09:04.940
Bayesian logistic regression, this is hard to do.

0:09:04.940,0:09:09.780
And so what's commonly done is to take the posterior distribution

0:09:09.780,0:09:13.390
and instead of actually computing a full posterior distribution, chi of theta

0:09:13.390,0:09:15.280
given S,

0:09:15.280,0:09:16.510
we'll instead

0:09:16.510,0:09:19.330
take this quantity on the right-hand side

0:09:19.330,0:09:23.880
and just maximize this quantity on the right-hand side. So let me write this down.

0:09:23.880,0:09:26.650
So

0:09:26.650,0:09:30.100
commonly, instead of computing the full posterior distribution,

0:09:30.100,0:09:37.100
we will choose the following. Okay?

0:09:42.280,0:09:47.160
We will choose what's called the MAP estimate, or the maximum a posteriori

0:09:47.160,0:09:50.270
estimate of theta, which is the most likely value of theta,

0:09:50.270,0:09:53.990
most probable value of theta onto your posterior distribution.

0:09:53.990,0:09:56.430
And that's just

0:09:56.430,0:10:03.430
ont max chi

0:10:07.960,0:10:11.870
of theta.

0:10:11.870,0:10:18.870
And then when you need to make a prediction,

0:10:21.480,0:10:28.480
you know, you would just predict, say, well,

0:10:35.530,0:10:37.599
using your usual hypothesis

0:10:37.599,0:10:41.290
and using this MAP value of theta

0:10:41.290,0:10:42.479
in place of

0:10:42.479,0:10:47.380
- as the parameter vector you'd choose. Okay? And

0:10:47.380,0:10:51.340
notice, the only difference between this and standard maximum likelihood estimation

0:10:51.340,0:10:53.060
is that when you're choosing,

0:10:53.060,0:10:56.920
you know, the - instead of choosing the maximum likelihood value for

0:10:56.920,0:11:01.140
theta, you're instead maximizing this, which is what you have for maximum likelihood estimation,

0:11:01.140,0:11:05.170
and then times this other quantity which is the prior.

0:11:05.170,0:11:06.650
Right?

0:11:06.650,0:11:09.520
And

0:11:09.520,0:11:10.870
let's see,

0:11:10.870,0:11:16.560
when intuition is that if your prior

0:11:16.560,0:11:21.470
is

0:11:21.470,0:11:23.650
theta being Gaussian and with mean

0:11:23.650,0:11:26.030
zero and some covariance,

0:11:26.030,0:11:30.550
then for a distribution like this, most of the [inaudible] mass is close to zero. Right?

0:11:30.550,0:11:34.010
So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero.

0:11:37.990,0:11:40.550
And so the prior distribution, instead of saying that

0:11:40.550,0:11:44.800
you think most of the parameters should be close to

0:11:44.800,0:11:48.780
zero, and if you remember our discussion on feature selection, if you eliminate a

0:11:48.780,0:11:50.880
feature from consideration

0:11:50.880,0:11:53.280
that's the same as

0:11:53.280,0:11:56.630
setting the source and value of theta to be equal to zero. All right? So if you

0:11:56.630,0:11:58.020
set

0:11:58.020,0:12:02.090
theta five to be equal to zero, that's the same as, you know, eliminating feature

0:12:02.090,0:12:04.550
five from the your hypothesis.

0:12:04.550,0:12:09.710
And so, this is the prior that drives most of the parameter values to zero

0:12:09.710,0:12:12.710
- to values close to zero. And you'll think of this as

0:12:12.710,0:12:17.410
doing something analogous, if doing something reminiscent of feature selection. Okay? And

0:12:17.410,0:12:21.560
it turns out that with this formulation, the parameters won't actually be

0:12:21.560,0:12:24.990
exactly zero but many of the values will be close to zero.

0:12:24.990,0:12:27.030
And

0:12:27.030,0:12:31.510
I guess in pictures,

0:12:34.550,0:12:36.210
if you remember,

0:12:36.210,0:12:40.070
I said that if you have, say, five data points and you fit

0:12:40.070,0:12:47.070
a fourth-order polynomial -

0:12:47.620,0:12:51.040
well I think that had too many bumps in it, but never mind.

0:12:51.040,0:12:54.360
If you fit it a - if you fit very high polynomial to a

0:12:54.360,0:12:58.180
very small dataset, you can get these very large oscillations

0:12:58.180,0:13:01.420
if you use maximum likelihood estimation. All right?

0:13:01.420,0:13:05.610
In contrast, if you apply this sort of Bayesian regularization,

0:13:05.610,0:13:08.200
you can actually fit a higherorder polynomial

0:13:08.200,0:13:11.160
that still get

0:13:11.160,0:13:13.250
sort of a smoother and smoother fit to the data

0:13:13.250,0:13:17.840
as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer

0:13:17.840,0:13:19.830
to zero. And that

0:13:19.830,0:13:22.750
in practice - it's sort of hard to see, but you can take my word for it.

0:13:22.750,0:13:24.900
As tau becomes smaller and smaller,

0:13:24.900,0:13:28.760
the curves you tend to fit your data also become smoother and smoother, and so you

0:13:28.760,0:13:29.560
tend

0:13:29.560,0:13:35.700
less and less overfit, even when you're fitting a large number of

0:13:35.700,0:13:37.670
parameters.

0:13:37.670,0:13:41.290
Okay? Let's see,

0:13:41.290,0:13:44.590
and

0:13:44.590,0:13:48.129
one last piece of intuition that I would just toss out there. And

0:13:48.129,0:13:49.920
you get to play more with

0:13:49.920,0:13:52.850
this particular set of ideas more in Problem Set

0:13:52.850,0:13:54.980
3, which I'll post online later

0:13:54.980,0:13:57.160
this week I guess.

0:13:57.160,0:14:01.950
Is that whereas maximum likelihood tries to minimize,

0:14:01.950,0:14:08.950
say, this,

0:14:12.210,0:14:15.830
right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing

0:14:15.830,0:14:17.030
this,

0:14:17.030,0:14:19.290
it turns out that if you

0:14:19.290,0:14:21.220
add this prior term there,

0:14:21.220,0:14:26.500
it turns out that the authorization objective you end up optimizing

0:14:26.500,0:14:31.160
turns out to be that. Where you add an extra term that, you know,

0:14:31.160,0:14:34.380
penalizes your parameter theta as being large.

0:14:34.980,0:14:38.820
And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to

0:14:38.820,0:14:40.890
keep your parameters small.

0:14:40.890,0:14:42.810
And this has the effect.

0:14:42.810,0:14:46.530
Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of

0:14:46.530,0:14:48.260
keeping the functions you fit

0:14:48.260,0:14:55.260
to be smoother and less likely to overfit. Okay? Okay, hopefully this will

0:14:57.870,0:15:02.200
make more sense when you play with these ideas a bit more in the next problem set. But let's check

0:15:02.200,0:15:09.200
questions about all this.

0:15:12.560,0:15:16.000
The smoothing behavior is it because [inaudible] actually get different [inaudible]?

0:15:19.260,0:15:23.010
Let's see. Yeah. It depends on - well

0:15:23.010,0:15:28.160
most priors with most of the mass close to zero will get this effect, I guess. And just by

0:15:28.160,0:15:29.090
convention, the Gaussian

0:15:29.090,0:15:31.770
prior is what's most -

0:15:31.770,0:15:33.820
used the most common for models like logistic

0:15:33.820,0:15:37.910
regression and linear regression, generalized in your models. There are a few

0:15:37.910,0:15:41.190
other priors that I sometimes use, like the Laplace prior,

0:15:41.190,0:15:48.190
but all of them will tend to have these sorts of smoothing effects. All right. Cool.

0:15:50.270,0:15:52.640
And so it turns out that for

0:15:52.640,0:15:57.250
problems like text classification, text classification is like 30,000 features or 50,000

0:15:57.250,0:15:59.560
features,

0:15:59.560,0:16:02.980
where it seems like an algorithm like logistic regression would be very much prone to

0:16:02.980,0:16:03.670
overfitting. Right?

0:16:03.670,0:16:06.470
So imagine trying to build a spam classifier,

0:16:06.470,0:16:10.050
maybe you have 100 training examples but you have 30,000

0:16:10.050,0:16:12.530
features or 50,000 features,

0:16:12.530,0:16:16.820
that seems clearly to be prone to overfitting. Right? But it turns out that with

0:16:16.820,0:16:18.440
this sort of Bayesian

0:16:18.440,0:16:19.460
regularization,

0:16:19.460,0:16:22.100
with [inaudible] Gaussian,

0:16:22.100,0:16:26.710
logistic regression becomes a very effective text classification algorithm

0:16:26.710,0:16:32.510
with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah,

0:16:32.510,0:16:36.950
right, and so pick - and to pick either tau squared or lambda.

0:16:36.950,0:16:40.260
I think the relation is lambda equals one over tau squared. But right, so pick either

0:16:40.260,0:16:41.840
tau squared or lambda, you could

0:16:41.840,0:16:48.840
use cross-validation, yeah. All right? Okay, cool. So all right,

0:16:49.120,0:16:51.740
that

0:16:51.740,0:16:55.240
was all I want to say about methods for preventing

0:16:55.240,0:16:56.410
overfitting. What I

0:16:56.410,0:17:00.820
want to do next is just spend, you know, five minutes talking about

0:17:00.820,0:17:02.300
online learning.

0:17:02.300,0:17:04.650
And this is sort of a digression.

0:17:04.650,0:17:06.390
And so, you know, when

0:17:06.390,0:17:09.680
you're designing the syllabus of a class, I guess, sometimes

0:17:09.680,0:17:13.040
there are just some ideas you want to talk about but can't find a very good place to

0:17:13.040,0:17:16.480
fit in anywhere. So this is one of those ideas that may

0:17:16.480,0:17:20.890
seem a bit disjointed from the rest of the class but I just

0:17:20.890,0:17:25.770
want to

0:17:25.770,0:17:29.730
tell

0:17:29.730,0:17:33.720
you a little bit about it. Okay. So here's the idea.

0:17:33.720,0:17:37.820
So far, all the learning algorithms we've talked about are what's called batch learning

0:17:37.820,0:17:38.520
algorithms,

0:17:38.520,0:17:41.650
where you're given a training set and then you get to run your learning algorithm on the

0:17:41.650,0:17:46.910
training set and then maybe you test it on some other test set.

0:17:46.910,0:17:49.980
And there's another learning setting called online learning,

0:17:49.980,0:17:53.670
in which you have to make predictions even while you are in the process of learning.

0:17:53.670,0:17:57.650
So here's how the problem sees.

0:17:57.650,0:17:59.360
All right? I'm first gonna give you X one.

0:17:59.360,0:18:01.410
Let's say there's a classification problem,

0:18:01.410,0:18:02.780
so I'm first gonna give you

0:18:02.780,0:18:07.420
X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?"

0:18:07.420,0:18:09.670
And you've not seen any data yet.

0:18:09.670,0:18:11.679
And so, you make a guess. Right? You

0:18:11.679,0:18:12.990
guess -

0:18:12.990,0:18:15.190
we'll call your guess Y hat one.

0:18:15.190,0:18:18.010
And after you've made your prediction, I

0:18:18.010,0:18:21.840
will then reveal to you the true label Y one. Okay? And

0:18:21.840,0:18:25.580
not having seen any data before, your odds of getting the first one right are only

0:18:25.580,0:18:28.750
50 percent, right, if you guess randomly.

0:18:28.750,0:18:29.950


0:18:29.950,0:18:35.080
And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And

0:18:35.080,0:18:35.910
so you now maybe

0:18:35.910,0:18:40.200
are gonna make a slightly more educated guess and call that Y hat two.

0:18:40.200,0:18:44.880
And after you've made your guess, I reveal the true label to you.

0:18:44.880,0:18:50.000
And so, then I show you X three, and then you make

0:18:50.000,0:18:53.000
your guess, and learning proceeds as follows.

0:18:53.000,0:18:54.860
So this is just a lot of

0:18:54.860,0:18:59.370
machine learning and batch learning, and the model settings where

0:18:59.370,0:19:03.710
you have to keep learning even as you're making predictions,

0:19:03.710,0:19:08.050
okay? So I don't know, setting your website and you have users coming in. And as the first user

0:19:08.050,0:19:10.990
comes in, you need to start making predictions already about what

0:19:10.990,0:19:13.450
the user likes or dislikes. And there's only, you know,

0:19:13.450,0:19:18.370
as you're making predictions you get to show more and more training examples.

0:19:18.370,0:19:25.370
So in online learning what you care about is the total online error,

0:19:27.370,0:19:28.400


0:19:28.400,0:19:32.900
which is sum from I equals one to MC if you get the sequence of M examples

0:19:32.900,0:19:34.390
all together,

0:19:34.390,0:19:37.860
indicator Y hat I

0:19:37.860,0:19:40.900
not equal to Y hi. Okay?

0:19:40.900,0:19:43.510
So the total online error is

0:19:43.510,0:19:45.920
the total number of mistakes you make

0:19:45.920,0:19:49.029
on a sequence of examples like this.

0:19:49.029,0:19:51.910
And

0:19:51.910,0:19:54.049
it turns out that, you know,

0:19:54.049,0:19:57.690
many of the learning algorithms you have - when you finish all the learning algorithms, you've

0:19:57.690,0:20:00.690
learned about and can apply to this setting. One thing

0:20:00.690,0:20:02.680
you could do is

0:20:02.680,0:20:03.850
when you're asked

0:20:03.850,0:20:06.230
to make prediction on Y hat three,

0:20:06.230,0:20:10.409
right, one simple thing to do is well you've seen some other training examples

0:20:10.409,0:20:14.040
up to this point so you can just take your learning algorithm and run it

0:20:14.040,0:20:15.429
on the examples,

0:20:15.429,0:20:18.970
you know, leading up to Y hat three. So just run the learning algorithm on all

0:20:18.970,0:20:21.420
the examples you've seen previous

0:20:21.420,0:20:25.400
to being asked to make a prediction on certain example, and then use your

0:20:25.400,0:20:27.320
learning

0:20:27.320,0:20:30.940
algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we

0:20:30.940,0:20:34.190
saw that you could use the stochastic gradient descent, that,

0:20:34.190,0:20:41.190
you know, can be adapted very nicely to this. So as a concrete example, if you

0:20:41.740,0:20:45.660
remember the perceptron algorithms, say, right,

0:20:45.660,0:20:47.350
you would

0:20:47.350,0:20:52.280
say initial the parameter theta to be equal to zero.

0:20:52.280,0:20:55.940
And then after seeing the Ith training

0:20:55.940,0:20:59.230
example, you'd

0:20:59.230,0:21:06.230
update the parameters, you

0:21:13.740,0:21:15.420
know,

0:21:15.420,0:21:19.740
using - you've see this reel a lot of times now, right, using the standard

0:21:19.740,0:21:21.720
perceptron learning rule. And

0:21:21.720,0:21:26.230
the same thing, if you were using logistic regression you can then,

0:21:26.230,0:21:29.940
again, after seeing each training example, just run, you know, essentially run

0:21:29.940,0:21:33.620
one-step stochastic gradient descent

0:21:33.620,0:21:38.050
on just the example you saw. Okay?

0:21:38.050,0:21:39.800
And so

0:21:39.800,0:21:41.040
the reason I've

0:21:41.040,0:21:45.380
put this into the sort of "learning theory" section of this class was because

0:21:45.380,0:21:49.190
it turns that sometimes you can prove fairly amazing results

0:21:49.190,0:21:51.500
on your total online error

0:21:51.500,0:21:53.790
using algorithms like these.

0:21:53.790,0:21:57.500
I will actually - I don't actually want to spend the time in the main

0:21:57.500,0:21:58.640
lecture to prove

0:21:58.640,0:22:01.880
this, but, for example, you can prove that

0:22:01.880,0:22:04.790
when you use the perceptron algorithm,

0:22:04.790,0:22:07.890
then even when

0:22:07.890,0:22:11.710
the features XI,

0:22:11.710,0:22:15.250
maybe infinite dimensional feature vectors, like we saw for

0:22:15.250,0:22:18.700
simple vector machines. And sometimes, infinite feature dimensional vectors

0:22:18.700,0:22:20.740
may use kernel representations. Okay? But so it

0:22:20.740,0:22:22.180
turns out that you can prove that

0:22:22.180,0:22:24.790
when you a perceptron algorithm,

0:22:24.790,0:22:27.590
even when the data is maybe

0:22:27.590,0:22:28.920
extremely high dimensional and it

0:22:28.920,0:22:32.900
seems like you'd be prone to overfitting, right, you can prove that so as the

0:22:32.900,0:22:36.550
long as the positive and negative examples

0:22:36.550,0:22:38.740


0:22:38.740,0:22:41.860
are separated by a margin,

0:22:41.860,0:22:43.360
right. So in this

0:22:43.360,0:22:48.050
infinite dimensional space,

0:22:48.050,0:22:49.740
so long as,

0:22:49.740,0:22:53.390
you know, there is some margin

0:22:53.390,0:22:57.480
down there separating the positive and negative examples,

0:22:57.480,0:22:59.300
you can prove that

0:22:59.300,0:23:02.150
perceptron algorithm will converge

0:23:02.150,0:23:06.870
to a hypothesis that perfectly separates the positive and negative examples.

0:23:06.870,0:23:10.760
Okay? And then so after seeing only a finite number of examples, it'll

0:23:10.760,0:23:14.410
converge to digital boundary that perfectly separates the

0:23:14.410,0:23:18.590
positive and negative examples, even though you may in an infinite dimensional space. Okay?

0:23:18.590,0:23:20.970
So

0:23:20.970,0:23:22.260
let's see.

0:23:22.260,0:23:25.920
The proof itself would take me sort of almost an entire lecture to do,

0:23:25.920,0:23:27.179
and there are sort of

0:23:27.179,0:23:29.460
other things that I want to do more than that.

0:23:29.460,0:23:32.850
So you want to see the proof of this yourself, it's actually written up in the lecture

0:23:32.850,0:23:35.010
notes that I posted online.

0:23:35.010,0:23:38.210
For the purposes of this class' syllabus, the proof of this

0:23:38.210,0:23:41.059
result, you can treat this as optional reading. And by that, I mean,

0:23:41.059,0:23:44.799
you know, it won't appear on the midterm and you won't be asked about this

0:23:44.799,0:23:47.039
specifically in the problem sets,

0:23:47.039,0:23:48.770
but I thought it'd be -

0:23:48.770,0:23:52.340
I know some of you are curious after the previous lecture so why

0:23:52.340,0:23:53.750
you can prove

0:23:53.750,0:23:55.450
that, you know, SVMs can have

0:23:55.450,0:23:59.240
bounded VC dimension, even in these infinite dimensional spaces, and how do

0:23:59.240,0:24:00.940
you prove things in these -

0:24:00.940,0:24:04.450
how do you prove learning theory results in these infinite dimensional feature spaces. And

0:24:04.450,0:24:05.190
so

0:24:05.190,0:24:08.480
the perceptron bound that I just talked about was the simplest instance I

0:24:08.480,0:24:11.040
know of that you can sort of read in like half an hour and understand

0:24:11.040,0:24:11.960
it.

0:24:11.960,0:24:14.700
So if you're

0:24:14.700,0:24:16.179
interested, there are lecture notes online for

0:24:16.179,0:24:20.720
how this perceptron bound is actually proved. It's a very

0:24:21.580,0:24:25.710
[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But

0:24:25.710,0:24:28.450
regardless of the theoretical results,

0:24:28.450,0:24:30.880
you know, the online learning setting is something

0:24:30.880,0:24:33.490
that you - that comes reasonably often. And so,

0:24:33.490,0:24:38.750
these algorithms based on stochastic gradient descent often go very well. Okay, any

0:24:38.750,0:24:45.750
questions about this before I move on?

0:24:49.990,0:24:54.799
All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to

0:24:54.799,0:24:56.500
PowerPoint slides, please,

0:24:56.500,0:24:59.810
is I actually want to spend most of today's lecture sort of talking about

0:24:59.810,0:25:05.270
advice for applying different machine learning

0:25:05.270,0:25:10.850
algorithms.

0:25:10.850,0:25:11.970
And so,

0:25:11.970,0:25:14.019
you know, right now, already you have a,

0:25:14.019,0:25:16.250
I think, a good understanding of

0:25:16.250,0:25:18.620
really the most powerful tools

0:25:18.620,0:25:21.020
known to humankind in machine learning.

0:25:21.020,0:25:22.090
Right? And

0:25:22.090,0:25:26.520
what I want to do today is give you some advice on how to apply them really

0:25:26.520,0:25:27.720
powerfully because,

0:25:27.720,0:25:30.299
you know, the same tool - it turns out that you can

0:25:30.299,0:25:33.249
take the same machine learning tool, say logistic regression,

0:25:33.249,0:25:36.800
and you can ask two different people to apply it to the same problem.

0:25:36.800,0:25:38.230
And

0:25:38.230,0:25:41.290
sometimes one person will do an amazing job and it'll work amazingly well, and the second

0:25:41.290,0:25:43.150
person will sort of

0:25:43.150,0:25:47.150
not really get it to work, even though it was exactly the same algorithm. Right?

0:25:47.150,0:25:51.049
And so what I want to do today, in the rest of the time I have today, is try

0:25:51.049,0:25:52.190
to

0:25:52.190,0:25:55.210
convey to you, you know, some of the methods for how to

0:25:55.210,0:25:56.600
make sure you're one of -

0:25:56.600,0:26:03.600
you really know how to get these learning algorithms to work well in problems. So

0:26:05.120,0:26:07.340
just some caveats on what I'm gonna, I

0:26:07.340,0:26:08.859
guess, talk about in the

0:26:08.859,0:26:11.080
rest of today's lecture.

0:26:11.940,0:26:15.760
Something I want to talk about is actually not very mathematical but is also

0:26:15.760,0:26:17.610
some of the hardest,

0:26:17.610,0:26:18.480
most

0:26:18.480,0:26:21.910
conceptually most difficult material in this class to understand. All right? So this

0:26:21.910,0:26:23.070
is

0:26:23.070,0:26:25.400
not mathematical but this is not easy.

0:26:25.400,0:26:29.730
And I want to say this caveat some of what I'll say today is debatable.

0:26:29.730,0:26:33.309
I think most good machine learning people will agree with most of what I say but maybe not

0:26:33.309,0:26:35.730
everything I say.

0:26:35.730,0:26:39.350
And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about

0:26:39.350,0:26:41.680
this later. What I'm

0:26:41.680,0:26:43.730
focusing on today is advice for

0:26:43.730,0:26:47.269
how to just get stuff to work. If you work in the company and you want to deliver a

0:26:47.269,0:26:49.120
product or you're, you know,

0:26:49.120,0:26:52.750
building a system and you just want your machine learning system to work. Okay? Some of what I'm about

0:26:52.750,0:26:54.000
to say today

0:26:54.000,0:26:57.559
isn't great advice if you goal is to invent a new machine learning algorithm, but

0:26:57.559,0:26:59.179
this is advice for how to

0:26:59.179,0:27:02.040
make machine learning algorithm work and, you know, and

0:27:02.040,0:27:05.080
deploy a working system. So three

0:27:05.080,0:27:08.900
key areas I'm gonna talk about. One: diagnostics for

0:27:08.900,0:27:11.890
debugging learning algorithms. Second: sort of

0:27:11.890,0:27:16.820
talk briefly about error analyses and ablative analysis.

0:27:16.820,0:27:21.470
And third, I want to talk about just advice for how to get started on a machine-learning

0:27:23.410,0:27:24.420
problem.

0:27:24.420,0:27:27.340
And one theme that'll come up later is it

0:27:27.340,0:27:31.230
turns out you've heard about premature optimization, right,

0:27:31.230,0:27:33.890
in writing software. This is when

0:27:33.890,0:27:38.120
someone over-designs from the start, when someone, you know, is writing piece of code and

0:27:38.120,0:27:41.330
they choose a subroutine to optimize

0:27:41.330,0:27:45.350
heavily. And maybe you write the subroutine as assembly or something. And that's often

0:27:45.350,0:27:48.919
- and many of us have been guilty of premature optimization, where

0:27:48.919,0:27:51.510
we're trying to get a piece of code to run faster. And

0:27:51.510,0:27:54.710
we choose probably a piece of code and we implement it an assembly, and

0:27:54.710,0:27:56.820
really tune and get to run really quickly. And

0:27:56.820,0:28:01.280
it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature

0:28:01.280,0:28:02.210
optimization.

0:28:02.210,0:28:06.140
And in undergraduate programming classes, we warn people all the time not to do

0:28:06.140,0:28:10.250
premature optimization and people still do it all the time. Right?

0:28:10.250,0:28:11.750
And

0:28:11.750,0:28:16.020
turns out, a very similar thing happens in building machine-learning systems. That

0:28:16.020,0:28:20.980
many people are often guilty of, what I call, premature statistical optimization, where they

0:28:20.980,0:28:21.630


0:28:21.630,0:28:26.090
heavily optimize part of a machine learning system and that turns out

0:28:26.090,0:28:27.830
not to be the important piece. Okay?

0:28:27.830,0:28:30.170
So I'll talk about that later, as well.

0:28:30.170,0:28:35.680
So let's first talk about debugging learning algorithms.

0:28:35.680,0:28:37.700


0:28:37.700,0:28:39.560


0:28:39.560,0:28:42.610
As a motivating

0:28:42.610,0:28:46.910
example, let's say you want to build an anti-spam system. And

0:28:46.910,0:28:51.780
let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right?

0:28:51.780,0:28:54.410
So instead of using 50,000 words, you're chosen a small set of 100

0:28:54.410,0:28:55.390
features

0:28:55.390,0:28:59.010
to use for your anti-spam system.

0:28:59.010,0:29:02.140
And let's say you implement Bayesian logistic regression, implement gradient

0:29:02.140,0:29:02.850
descent,

0:29:02.850,0:29:07.310
and you get 20 percent test error, which is unacceptably high. Right?

0:29:07.310,0:29:09.750
So this

0:29:09.750,0:29:14.260
is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda

0:29:14.260,0:29:15.160
squared term.

0:29:15.160,0:29:19.950
And we're maximizing rather than minimizing as well, so there's a minus lambda

0:29:19.950,0:29:23.809
theta square instead of plus lambda theta squared. So

0:29:23.809,0:29:28.390
the question is, you implemented your Bayesian logistic regression algorithm,

0:29:28.390,0:29:34.050
and you tested it on your test set and you got unacceptably high error, so what do you do

0:29:34.050,0:29:35.370
next? Right?

0:29:35.370,0:29:37.310
So,

0:29:37.310,0:29:40.730
you know, one thing you could do is think about the ways you could improve this algorithm.

0:29:40.730,0:29:44.510
And this is probably what most people will do instead of, "Well let's sit down and think what

0:29:44.510,0:29:47.929
could've gone wrong, and then we'll try to improve the algorithm."

0:29:47.929,0:29:51.370
Well obviously having more training data could only help, so one thing you can do is try to

0:29:51.370,0:29:55.140
get more training examples.

0:29:55.140,0:29:58.880
Maybe you suspect, that even 100 features was too many, so you might try to

0:29:58.880,0:30:01.059
get a smaller set of

0:30:01.059,0:30:04.510
features. What's more common is you might suspect your features aren't good enough, so you might

0:30:04.510,0:30:06.880
spend some time, look at the email headers, see if

0:30:06.880,0:30:09.240
you can figure out better features for, you know,

0:30:09.240,0:30:12.890
finding spam emails or whatever.

0:30:12.890,0:30:14.460
Right. And

0:30:14.460,0:30:17.720
right, so and

0:30:17.720,0:30:20.780
just sit around and come up with better features, such as for email headers.

0:30:20.780,0:30:24.940
You may also suspect that gradient descent hasn't quite converged yet, and so let's try

0:30:24.940,0:30:28.090
running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right,

0:30:28.090,0:30:29.480
just run

0:30:29.480,0:30:30.940
gradient descent longer.

0:30:30.940,0:30:35.620
Or maybe you remember, you know, you remember hearing from class that maybe

0:30:35.620,0:30:38.580
Newton's method converges better, so let's

0:30:38.580,0:30:39.809
try

0:30:39.809,0:30:41.840
that instead. You may want to tune the value for lambda, because not

0:30:41.840,0:30:43.560
sure if that was the right thing,

0:30:43.560,0:30:46.960
or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only

0:30:46.960,0:30:50.240
listed eight things

0:30:50.240,0:30:54.549
here, but you can imagine if you were actually sitting down, building machine-learning

0:30:54.549,0:30:55.720
system,

0:30:55.720,0:30:58.040
the options to you are endless. You can think of, you

0:30:58.040,0:31:01.040
know, hundreds of ways to improve a learning system.

0:31:01.040,0:31:02.429
And some of these things like,

0:31:02.429,0:31:05.670
well getting more training examples, surely that's gonna help, so that seems like it's a good

0:31:05.670,0:31:08.980
use of your time. Right?

0:31:08.980,0:31:11.290
And it turns out that

0:31:11.290,0:31:15.210
this [inaudible] of picking ways to improve the learning algorithm and picking one and going

0:31:15.210,0:31:16.120
for it,

0:31:16.120,0:31:20.620
it might work in the sense that it may eventually get you to a working system, but

0:31:20.620,0:31:25.030
often it's very time-consuming. And I think it's often a largely - largely a matter of

0:31:25.030,0:31:28.140
luck, whether you end up fixing what the problem is.

0:31:28.140,0:31:29.490
In particular, these

0:31:29.490,0:31:33.030
eight improvements all fix very different problems.

0:31:33.030,0:31:37.340
And some of them will be fixing problems that you don't have. And

0:31:37.340,0:31:40.070
if you can rule out six of

0:31:40.070,0:31:44.129
eight of these, say, you could - if by somehow looking at the problem more deeply,

0:31:44.129,0:31:46.809
you can figure out which one of these eight things is actually the right thing

0:31:46.809,0:31:47.850
to do,

0:31:47.850,0:31:48.850
you can save yourself

0:31:48.850,0:31:50.760
a lot of time.

0:31:50.760,0:31:56.090
So let's see how we can go about doing that.

0:31:56.090,0:32:01.830
The people in industry and in research that I see that are really good, would not

0:32:01.830,0:32:05.490
go and try to change a learning algorithm randomly. There are lots of things that

0:32:05.490,0:32:08.110
obviously improve your learning algorithm,

0:32:08.110,0:32:12.460
but the problem is there are so many of them it's hard to know what to do.

0:32:12.460,0:32:16.590
So you find all the really good ones that run various diagnostics to figure out

0:32:16.590,0:32:18.010
the problem is

0:32:18.010,0:32:21.610
and they think where a problem is. Okay?

0:32:21.610,0:32:23.830
So

0:32:23.830,0:32:27.309
for our motivating story, right, we said - let's say Bayesian logistic regression test

0:32:27.309,0:32:29.010
error was 20 percent,

0:32:29.010,0:32:32.020
which let's say is unacceptably high.

0:32:32.020,0:32:34.830
And let's suppose you suspected the problem is

0:32:34.830,0:32:36.440
either overfitting,

0:32:36.440,0:32:37.790
so it's high bias,

0:32:37.790,0:32:42.240
or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse

0:32:42.240,0:32:45.220
me; I think I

0:32:45.220,0:32:46.620
wrote that wrong.

0:32:46.620,0:32:48.080
Let's firstly - so let's

0:32:48.080,0:32:49.219
forget - forget the tables.

0:32:49.219,0:32:52.839
Suppose you suspect the problem is either high bias or high variance, and some of the text

0:32:52.839,0:32:54.730
here

0:32:54.730,0:32:55.250
doesn't make sense. And

0:32:55.250,0:32:56.429
you want to know

0:32:56.429,0:33:00.850
if you're overfitting, which would be high variance, or you have too few

0:33:00.850,0:33:06.240
features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So

0:33:06.240,0:33:08.750
how can you figure out whether the problem

0:33:08.750,0:33:10.790
is one of high bias

0:33:10.790,0:33:15.610
or high variance? Right? So it turns

0:33:15.610,0:33:19.009
out there's a simple diagnostic you can look at that will tell you

0:33:19.009,0:33:24.150
whether the problem is high bias or high variance. If you

0:33:24.150,0:33:27.900
remember the cartoon we'd seen previously for high variance problems, when you have high

0:33:27.900,0:33:29.710
variance

0:33:29.710,0:33:33.280
the training error will be much lower than the test error. All right? When you

0:33:33.280,0:33:36.140
have a high variance problem, that's when you're fitting

0:33:36.140,0:33:39.480
your training set very well. That's when you're fitting, you know, a tenth order polynomial to

0:33:39.480,0:33:41.650
11 data points. All right? And

0:33:41.650,0:33:44.670
that's when you're just fitting the data set very well, and so your training error will be

0:33:44.670,0:33:45.670
much lower than

0:33:45.670,0:33:47.640
your test

0:33:47.640,0:33:49.940
error. And in contrast, if you have high bias,

0:33:49.940,0:33:52.700
that's when your training error will also be high. Right?

0:33:52.700,0:33:56.450
That's when your data is quadratic, say, but you're fitting a linear function to it

0:33:56.450,0:34:02.290
and so you aren't even fitting your training set well. So

0:34:02.290,0:34:04.450
just in cartoons, I guess,

0:34:04.450,0:34:07.950
this is a - this is what a typical learning curve for high variance looks

0:34:07.950,0:34:09.339
like.

0:34:09.339,0:34:13.690
On your horizontal axis, I'm plotting the training set size M, right,

0:34:13.690,0:34:16.429
and on vertical axis, I'm plotting the error.

0:34:16.429,0:34:19.469
And so, let's see,

0:34:19.469,0:34:21.029
you know, as you increase -

0:34:21.029,0:34:25.119
if you have a high variance problem, you'll notice as the training set size, M,

0:34:25.119,0:34:29.219
increases, your test set error will keep on decreasing.

0:34:29.219,0:34:32.829
And so this sort of suggests that, well, if you can increase the training set size even

0:34:32.829,0:34:36.359
further, maybe if you extrapolate the green curve out, maybe

0:34:36.359,0:34:39.970
that test set error will decrease even further. All right?

0:34:39.970,0:34:43.399
Another thing that's useful to plot here is - let's say

0:34:43.399,0:34:46.539
the red horizontal line is the desired performance

0:34:46.539,0:34:50.259
you're trying to reach, another useful thing to plot is actually the training error. Right?

0:34:50.259,0:34:52.009
And it turns out that

0:34:52.009,0:34:59.009
your training error will actually grow as a function of the training set size

0:34:59.249,0:35:01.609
because the larger your training set,

0:35:01.609,0:35:03.619
the harder it is to fit,

0:35:03.619,0:35:06.149
you know, your training set perfectly. Right?

0:35:06.149,0:35:09.249
So this is just a cartoon, don't take it too seriously, but in general, your training error

0:35:09.249,0:35:11.420
will actually grow

0:35:11.420,0:35:15.079
as a function of your training set size. Because smart training sets, if you have one data point,

0:35:15.079,0:35:17.769
it's really easy to fit that perfectly, but if you have

0:35:17.769,0:35:22.099
10,000 data points, it's much harder to fit that perfectly.

0:35:22.099,0:35:23.149
All right?

0:35:23.149,0:35:27.960
And so another diagnostic for high variance, and the one that I tend to use more,

0:35:27.960,0:35:31.670
is to just look at training versus test error. And if there's a large gap between

0:35:31.670,0:35:32.789
them,

0:35:32.789,0:35:34.160
then this suggests that, you know,

0:35:34.160,0:35:39.630
getting more training data may allow you to help close that gap. Okay?

0:35:39.630,0:35:41.420
So this is

0:35:41.420,0:35:42.340
what the

0:35:42.340,0:35:45.059
cartoon would look like when - in the

0:35:45.059,0:35:49.199
case of high variance.

0:35:49.199,0:35:53.099
This is what the cartoon looks like for high bias. Right? If you

0:35:53.099,0:35:54.779
look at the learning curve, you

0:35:54.779,0:35:57.499
see that the curve for test error

0:35:57.499,0:36:01.419
has flattened out already. And so this is a sign that,

0:36:01.419,0:36:05.179
you know, if you get more training examples, if you extrapolate this curve

0:36:05.179,0:36:06.519
further to the right,

0:36:06.519,0:36:09.670
it's maybe not likely to go down much further.

0:36:09.670,0:36:12.469
And this is a property of high bias: that getting more training data won't

0:36:12.469,0:36:15.619
necessarily help.

0:36:15.619,0:36:18.999
But again, to me the more useful diagnostic is

0:36:18.999,0:36:20.299
if you plot

0:36:20.299,0:36:23.999
training errors well, if you look at your training error as well as your, you know,

0:36:23.999,0:36:26.369
hold out test set error.

0:36:26.369,0:36:29.409
If you find that even your training error

0:36:29.409,0:36:31.529
is high,

0:36:31.529,0:36:34.779
then that's a sign that getting more training data is not

0:36:34.779,0:36:38.269
going to help. Right?

0:36:38.269,0:36:42.199
In fact, you know, think about it,

0:36:42.199,0:36:44.539
training error

0:36:44.539,0:36:48.089
grows as a function of your training set size.

0:36:48.089,0:36:50.449
And so if your

0:36:50.449,0:36:55.569
training error is already above your level of desired performance,

0:36:55.569,0:36:56.599
then

0:36:56.599,0:37:00.789
getting even more training data is not going to reduce your training

0:37:00.789,0:37:03.009
error down to the desired level of performance. Right?

0:37:03.009,0:37:06.469
Because, you know, your training error sort of only gets worse as you get more and more training

0:37:06.469,0:37:07.549
examples.

0:37:07.549,0:37:10.799
So if you extrapolate further to the right, it's not like this blue line will come

0:37:10.799,0:37:13.399
back down to the level of desired performance. Right? This will stay up

0:37:13.399,0:37:17.479
there. Okay? So for

0:37:17.479,0:37:21.339
me personally, I actually, when looking at a curve like the green

0:37:21.339,0:37:25.380
curve on test error, I actually personally tend to find it very difficult to tell

0:37:25.380,0:37:29.000
if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very

0:37:29.000,0:37:31.009
often, it's somewhat

0:37:31.009,0:37:32.899
ambiguous. So for me personally,

0:37:32.899,0:37:37.129
the diagnostic I tend to use the most often to tell if I have a bias problem or a variance

0:37:37.129,0:37:37.859
problem

0:37:37.859,0:37:41.319
is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so,

0:37:41.319,0:37:45.420
going

0:37:45.420,0:37:47.130
back to

0:37:47.130,0:37:52.399
the list of fixes, look

0:37:52.399,0:37:54.109
at the first fix,

0:37:54.109,0:37:56.339
getting more training examples

0:37:56.339,0:37:58.650
is a way to fix high variance.

0:37:58.650,0:38:02.749
Right? If you have a high variance problem, getting more training examples will help.

0:38:02.749,0:38:05.529
Trying a smaller set of features:

0:38:05.529,0:38:11.759
that also fixes high variance. All right?

0:38:11.759,0:38:15.869
Trying a larger set of features or adding email features, these

0:38:15.869,0:38:20.150
are solutions that fix high bias. Right?

0:38:20.150,0:38:26.769
So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay?

0:38:26.769,0:38:29.069
And so

0:38:29.069,0:38:33.579
quite often you see people working on machine learning problems

0:38:33.579,0:38:34.589
and

0:38:34.589,0:38:37.569
they'll remember that getting more training examples helps. And

0:38:37.569,0:38:41.119
so, they'll build a learning system, build an anti-spam system and it doesn't work.

0:38:41.119,0:38:42.229
And then they

0:38:42.229,0:38:45.999
go off and spend lots of time and money and effort collecting more training data

0:38:45.999,0:38:50.509
because they'll say, "Oh well, getting more data's obviously got to help."

0:38:50.509,0:38:53.319
But if they had a high bias problem in the first place, and not a high variance

0:38:53.319,0:38:54.890
problem,

0:38:54.890,0:38:56.769
it's entirely possible to spend

0:38:56.769,0:39:00.149
three months or six months collecting more and more training data,

0:39:00.149,0:39:04.999
not realizing that it couldn't possibly help. Right?

0:39:04.999,0:39:07.619
And so, this actually happens a lot in, you

0:39:07.619,0:39:12.409
know, in Silicon Valley and companies, this happens a lot. There will often

0:39:12.409,0:39:15.329
people building various machine learning systems, and

0:39:15.329,0:39:19.519
they'll often - you often see people spending six months working on fixing a

0:39:19.519,0:39:20.999
learning algorithm

0:39:20.999,0:39:23.940
and you could've told them six months ago that, you know,

0:39:23.940,0:39:27.210
that couldn't possibly have helped. But because they didn't know what the

0:39:27.210,0:39:28.709
problem was, and

0:39:28.709,0:39:33.549
they'd easily spend six months trying to invent new features or something. And

0:39:33.549,0:39:37.809
this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and

0:39:37.809,0:39:42.289
told them, "I could've told you six months ago that this was not going to help." And

0:39:42.289,0:39:46.149
the six months is not a joke, you actually see

0:39:46.149,0:39:47.709
this.

0:39:47.709,0:39:49.510
And in contrast, if you

0:39:49.510,0:39:53.049
actually figure out the problem's one of high bias or high variance, then

0:39:53.049,0:39:54.299
you can rule out

0:39:54.299,0:39:55.799
two of these solutions and

0:39:55.799,0:40:00.779
save yourself many months of fruitless effort. Okay? I actually

0:40:00.779,0:40:03.709
want to talk about these four at the bottom as well. But before I move on, let me

0:40:03.709,0:40:05.319
just check if there were questions about what I've talked

0:40:05.319,0:40:12.319
about so far. No? Okay, great. So bias

0:40:20.209,0:40:23.220
versus variance is one thing that comes up

0:40:23.220,0:40:29.539
often. This bias versus variance is one common diagnostic. And so,

0:40:29.539,0:40:33.180
for other machine learning problems, it's often up to your own ingenuity to figure out

0:40:33.180,0:40:35.700
your own diagnostics to figure out what's wrong. All right?

0:40:35.700,0:40:41.230
So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you

0:40:41.230,0:40:44.300
know, to construct your own tests. Like do you look at the difference training and

0:40:44.300,0:40:46.499
test errors or do you look at something else?

0:40:46.499,0:40:49.929
It's often up to your own ingenuity to construct your own diagnostics to figure out what's

0:40:49.929,0:40:52.589
going on.

0:40:52.589,0:40:55.029
What I want to do is go through another example. All right?

0:40:55.029,0:40:58.890
And this one is slightly more contrived but it'll illustrate another

0:40:58.890,0:41:02.769
common question that comes up, another one of the most common

0:41:02.769,0:41:04.750
issues that comes up in applying

0:41:04.750,0:41:06.089
learning algorithms.

0:41:06.089,0:41:08.319
So in this example, it's slightly more contrived,

0:41:08.319,0:41:11.579
let's say you implement Bayesian logistic regression

0:41:11.579,0:41:17.549
and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So

0:41:17.549,0:41:19.150
it's rejecting, you know,

0:41:19.150,0:41:21.449
2 percent of -

0:41:21.449,0:41:25.179
it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all

0:41:25.179,0:41:26.959
spam gets

0:41:26.959,0:41:30.660
through which is fine, but is also rejecting 2 percent of your good email,

0:41:30.660,0:41:35.489
2 percent of the email from your friends and that's unacceptably high, let's

0:41:35.489,0:41:36.909
say.

0:41:36.909,0:41:39.010
And let's say that

0:41:39.010,0:41:41.899
a simple vector machine using a linear kernel

0:41:41.899,0:41:44.830
gets 10 percent error on spam and

0:41:44.830,0:41:49.069
0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this

0:41:49.069,0:41:53.359
example, let's say you're trying to build an anti-spam system. Right?

0:41:53.359,0:41:56.170
Let's say that you really want to deploy

0:41:56.170,0:41:57.679
logistic regression

0:41:57.679,0:42:01.209
to your customers because of computational efficiency or because you need

0:42:01.209,0:42:03.389
retrain overnight every day,

0:42:03.389,0:42:07.319
and because logistic regression just runs more easily and more quickly or something. Okay? So let's

0:42:07.319,0:42:08.669
say you want to deploy logistic

0:42:08.669,0:42:12.649
regression, but it's just not working out well. So

0:42:12.649,0:42:17.609
question is: What do you do next? So it

0:42:17.609,0:42:18.829
turns out that this -

0:42:18.829,0:42:22.319
the issue that comes up here, the one other common question that

0:42:22.319,0:42:24.889
comes up is

0:42:24.889,0:42:30.189
a question of is the algorithm converging. So you might suspect that maybe

0:42:30.189,0:42:33.299
the problem with logistic regression is that it's just not converging.

0:42:33.299,0:42:36.309
Maybe you need to run iterations. And

0:42:36.309,0:42:37.759
it

0:42:37.759,0:42:40.359
turns out that, again if you look at the optimization objective, say,

0:42:40.359,0:42:43.710
logistic regression is, let's say, optimizing J

0:42:43.710,0:42:46.730
of theta, it actually turns out that if you look at optimizing your objective as a function of the number

0:42:46.730,0:42:51.809
of iterations, when you look

0:42:51.809,0:42:55.009
at this curve, you know, it sort of looks like it's going up but it sort of

0:42:55.009,0:42:57.630
looks like there's absentiles. And

0:42:57.630,0:43:00.949
when you look at these curves, it's often very hard to tell

0:43:00.949,0:43:03.729
if the curve has already flattened out. All right? And you look at these

0:43:03.729,0:43:05.979
curves a lot so you can ask:

0:43:05.979,0:43:08.229
Well has the algorithm converged? When you look at the J of theta like this, it's

0:43:08.229,0:43:10.329
often hard to tell.

0:43:10.329,0:43:14.149
You can run this ten times as long and see if it's flattened out. And you can run this ten

0:43:14.149,0:43:21.079
times as long and it'll often still look like maybe it's going up very slowly, or something. Right?

0:43:21.079,0:43:24.919
So a better diagnostic for what logistic regression is converged than

0:43:24.919,0:43:28.809
looking at this curve.

0:43:28.809,0:43:32.089
The other question you might wonder - the other thing you might

0:43:32.089,0:43:36.709
suspect is a problem is are you optimizing the right function.

0:43:36.709,0:43:38.920
So

0:43:38.920,0:43:40.599
what you care about,

0:43:40.599,0:43:42.879
right, in spam, say,

0:43:42.879,0:43:44.260
is a

0:43:44.260,0:43:47.499
weighted accuracy function like that. So A of theta is,

0:43:47.499,0:43:49.190
you know, sum over your

0:43:49.190,0:43:52.249
examples of some weights times whether you got it right.

0:43:52.249,0:43:56.809
And so the weight may be higher for non-spam than for spam mail because you care

0:43:56.809,0:43:57.710
about getting

0:43:57.710,0:44:01.469
your predictions correct for spam email much more than non-spam mail, say. So let's

0:44:01.469,0:44:02.359


0:44:02.359,0:44:05.469
say A of theta

0:44:05.469,0:44:10.819
is the optimization objective that you really care about, but Bayesian logistic regression is

0:44:10.819,0:44:15.400
that it optimizes a quantity like that. Right? It's this

0:44:15.400,0:44:17.689
sort of maximum likelihood thing

0:44:17.689,0:44:19.380
and then with this

0:44:19.380,0:44:20.849
two-nom, you know,

0:44:20.849,0:44:22.779
penalty thing that we saw previously. And you

0:44:22.779,0:44:26.499
might be wondering: Is this the right optimization function to be optimizing.

0:44:26.499,0:44:30.949
Okay? And: Or do I maybe need to change the value for lambda

0:44:30.949,0:44:33.899
to change this parameter? Or:

0:44:33.899,0:44:39.819
Should I maybe really be switching to support vector machine optimization objective?

0:44:39.819,0:44:42.130
Okay? Does that make sense? So

0:44:42.130,0:44:44.490
the second diagnostic I'm gonna talk about

0:44:44.490,0:44:46.989
is let's say you want to figure out

0:44:46.989,0:44:50.609
is the algorithm converging, is the optimization algorithm converging, or

0:44:50.609,0:44:51.900
is the problem with

0:44:51.900,0:44:57.749
the optimization objective I chose in the first place? Okay?

0:44:57.749,0:45:02.819
So here's

0:45:02.819,0:45:07.329
the diagnostic you can use. Let me let - right. So to

0:45:07.329,0:45:11.029
just reiterate the story, right, let's say an SVM outperforms Bayesian

0:45:11.029,0:45:13.519
logistic regression but you really want to deploy

0:45:13.519,0:45:16.759
Bayesian logistic regression to your problem. Let me

0:45:16.759,0:45:19.049
let theta subscript SVM, be the

0:45:19.049,0:45:21.669
parameters learned by an SVM,

0:45:21.669,0:45:25.259
and I'll let theta subscript BLR be the parameters learned by Bayesian

0:45:25.259,0:45:28.049
logistic regression.

0:45:28.049,0:45:32.480
So the optimization objective you care about is this, you know, weighted accuracy

0:45:32.480,0:45:35.079
criteria that I talked about just now.

0:45:35.079,0:45:37.859
And

0:45:37.859,0:45:41.739
the support vector machine outperforms Bayesian logistic regression. And so, you know,

0:45:41.739,0:45:44.969
the weighted accuracy on the supportvector-machine parameters

0:45:44.969,0:45:46.969
is better than the weighted accuracy

0:45:46.969,0:45:50.179
for Bayesian logistic regression.

0:45:50.179,0:45:53.929
So

0:45:53.929,0:45:57.039
further, Bayesian logistic regression tries to optimize

0:45:57.039,0:45:59.410
an optimization objective like that, which I

0:45:59.410,0:46:02.269
denoted J theta.

0:46:02.269,0:46:05.839
And so, the diagnostic I choose to use is

0:46:05.839,0:46:08.430
to see if J of SVM

0:46:08.430,0:46:12.269
is bigger-than or less-than J of BLR. Okay?

0:46:12.269,0:46:14.609
So I explain this on the next slide.

0:46:14.609,0:46:15.569
So

0:46:15.569,0:46:19.529
we know two facts. We know that - well we know one fact. We know that a weighted

0:46:19.529,0:46:20.519
accuracy

0:46:20.519,0:46:23.160
of support vector machine, right,

0:46:23.160,0:46:24.479
is bigger than

0:46:24.479,0:46:28.859
this weighted accuracy of Bayesian logistic regression. So

0:46:28.859,0:46:32.209
in order for me to figure out whether Bayesian logistic regression is converging,

0:46:32.209,0:46:35.379
or whether I'm just optimizing the wrong objective function,

0:46:35.379,0:46:41.059
the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay?

0:46:41.059,0:46:43.549
So let me explain this,

0:46:43.549,0:46:44.769
so in Case 1,

0:46:44.769,0:46:46.029
right,

0:46:46.029,0:46:48.319
it's just those two equations copied over.

0:46:48.319,0:46:50.489
In Case 1, let's say that

0:46:50.489,0:46:54.589
J of SVM is, indeed, is greater than J of BLR - or J of

0:46:54.589,0:47:01.169
theta SVM is greater than J of theta BLR. But

0:47:01.169,0:47:04.439
we know that Bayesian logistic regression

0:47:04.439,0:47:07.519
was trying to maximize J of theta;

0:47:07.519,0:47:08.869
that's the definition of

0:47:08.869,0:47:12.359
Bayesian logistic regression.

0:47:12.359,0:47:16.759
So this means that

0:47:16.759,0:47:17.599
theta -

0:47:17.599,0:47:22.029
the value of theta output that Bayesian logistic regression actually fails to

0:47:22.029,0:47:24.209
maximize J

0:47:24.209,0:47:27.309
because the support back to machine actually returned the value of theta that,

0:47:27.309,0:47:28.719
you know does a

0:47:28.719,0:47:31.349
better job out-maximizing J.

0:47:31.349,0:47:36.509
And so, this tells me that Bayesian logistic regression didn't actually maximize J

0:47:36.509,0:47:39.319
correctly, and so the problem is with

0:47:39.319,0:47:41.099
the optimization algorithm. The

0:47:41.099,0:47:45.269
optimization algorithm hasn't converged. The other

0:47:45.269,0:47:46.099
case

0:47:46.099,0:47:49.890
is as follows, where

0:47:49.890,0:47:52.579
J of theta SVM is less-than/equal to J of theta

0:47:52.579,0:47:55.719
BLR. Okay?

0:47:55.719,0:47:58.389
In this case, what does

0:47:58.389,0:47:59.140
that mean?

0:47:59.140,0:48:01.849
This means that Bayesian logistic regression

0:48:01.849,0:48:04.599
actually attains the higher value

0:48:04.599,0:48:07.289
for the optimization objective J

0:48:07.289,0:48:10.929
then doesn't support back to machine.

0:48:10.929,0:48:13.159
The support back to machine,

0:48:13.159,0:48:14.969
which does worse

0:48:14.969,0:48:17.669
on your optimization problem,

0:48:17.669,0:48:19.199
actually does better

0:48:19.199,0:48:24.329
on the weighted accuracy measure.

0:48:24.329,0:48:27.999
So what this means is that something that does worse on your optimization

0:48:27.999,0:48:28.789
objective,

0:48:28.789,0:48:29.789
on J,

0:48:29.789,0:48:31.429
can actually do better

0:48:31.429,0:48:34.039
on the weighted accuracy objective.

0:48:34.039,0:48:37.109
And this really means that maximizing

0:48:37.109,0:48:38.369
J of theta,

0:48:38.369,0:48:42.059
you know, doesn't really correspond that well to maximizing your weighted accuracy criteria.

0:48:42.059,0:48:43.430


0:48:43.430,0:48:47.359
And therefore, this tells you that J of theta is maybe the wrong optimization

0:48:47.359,0:48:49.649
objective to be maximizing. Right?

0:48:49.649,0:48:51.160
That just maximizing J of

0:48:51.160,0:48:53.149
theta just wasn't a good objective

0:48:53.149,0:49:00.149
to be choosing if you care about the weighted accuracy. Okay? Can you

0:49:02.669,0:49:03.460
raise your hand

0:49:03.460,0:49:09.989
if this made sense?

0:49:09.989,0:49:11.519
Cool, good. So

0:49:11.519,0:49:16.829
that tells us whether the problem is with the optimization objective

0:49:16.829,0:49:19.380
or whether it's with the objective function.

0:49:19.380,0:49:21.009
And so going back to this

0:49:21.009,0:49:23.149
slide, the eight fixes we had,

0:49:23.149,0:49:24.180
you notice that if you

0:49:24.180,0:49:27.170
run gradient descent for more iterations

0:49:27.170,0:49:31.019
that fixes the optimization algorithm. You try and use this method

0:49:31.019,0:49:33.259
fixes the optimization algorithm,

0:49:33.259,0:49:37.289
whereas using a different value for lambda, in that lambda times norm of data

0:49:37.289,0:49:39.469
squared, you know, in your objective,

0:49:39.469,0:49:42.359
fixes the optimization objective. And

0:49:42.359,0:49:47.629
changing to an SVM is also another way of trying to fix the optimization objective. Okay?

0:49:47.629,0:49:49.329
And so

0:49:49.329,0:49:52.309
once again, you actually see this quite often that -

0:49:52.309,0:49:55.079
actually, you see it very often, people will

0:49:55.079,0:49:58.479
have a problem with the optimization objective

0:49:58.479,0:50:00.989
and be working harder and harder

0:50:00.989,0:50:03.179
to fix the optimization algorithm.

0:50:03.179,0:50:06.079
That's another very common pattern that

0:50:06.079,0:50:10.190
the problem is in the formula from your J of theta, that often you see people, you know,

0:50:10.190,0:50:13.269
just running more and more iterations of gradient descent. Like trying Newton's

0:50:13.269,0:50:16.010
method and trying conjugate and then trying

0:50:16.010,0:50:18.589
more and more crazy optimization algorithms,

0:50:18.589,0:50:20.889
whereas the problem was, you know,

0:50:20.889,0:50:24.459
optimizing J of theta wasn't going to fix the problem at all. Okay?

0:50:24.459,0:50:28.649
So there's another example of when these sorts of diagnostics will

0:50:28.649,0:50:31.909
help you figure out whether you should be fixing your optimization algorithm

0:50:31.909,0:50:33.259
or fixing the

0:50:33.259,0:50:38.849
optimization

0:50:38.849,0:50:45.339
objective. Okay? Let me think

0:50:45.339,0:50:47.599
how much time I have.

0:50:47.599,0:50:48.819
Hmm, let's

0:50:48.819,0:50:49.620
see. Well okay, we have time. Let's do this.

0:50:49.620,0:50:52.980
Show you one last example of a diagnostic. This is one that came up in,

0:50:52.980,0:50:56.999
you know, my students' and my work on flying helicopters.

0:50:56.999,0:50:57.839


0:50:57.839,0:51:00.189
This one actually,

0:51:00.189,0:51:04.179
this example is the most complex of the three examples I'm gonna do

0:51:04.179,0:51:05.609
today.

0:51:05.609,0:51:08.559
I'm going to somewhat quickly, and

0:51:08.559,0:51:11.259
this actually draws on reinforcement learning which is something that I'm not

0:51:11.259,0:51:14.499
gonna talk about until towards - close to the end of the course here, but this just

0:51:14.499,0:51:16.759
a more

0:51:16.759,0:51:20.009
complicated example of a diagnostic we're gonna go over.

0:51:20.009,0:51:23.759
What I'll do is probably go over this fairly quickly, and then after we've talked about

0:51:23.759,0:51:26.839
reinforcement learning in the class, I'll probably actually come back and redo this exact

0:51:26.839,0:51:32.919
same example because you'll understand it more deeply. Okay?

0:51:32.919,0:51:37.099
So some of you know that my students and I fly autonomous helicopters, so how do you get a

0:51:37.099,0:51:41.559
machine-learning algorithm to design the controller for

0:51:41.559,0:51:44.199
helicopter? This is what we do. All right?

0:51:44.199,0:51:48.519
This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our

0:51:48.519,0:51:49.619
simulator.

0:51:49.619,0:51:53.499
This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you

0:51:53.499,0:51:55.679


0:51:55.679,0:51:57.190
choose a cost function, it's

0:51:57.190,0:52:00.849
actually called a [inaudible] function, but for this actually I'll call it cost function.

0:52:00.849,0:52:02.949
Say J of theta is, you know,

0:52:02.949,0:52:06.589
the expected squared error in your helicopter's

0:52:06.589,0:52:08.150
position. Okay? So this is J of theta is

0:52:08.150,0:52:08.509
maybe

0:52:08.509,0:52:12.359
it's expected square error or just the square error.

0:52:12.359,0:52:16.909
And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms

0:52:16.909,0:52:18.599
in a few weeks.

0:52:18.599,0:52:22.499
You run reinforcement learning algorithm in your simulator

0:52:22.499,0:52:26.640
to try to minimize this cost function; try to minimize the squared error of

0:52:26.640,0:52:31.439
how well you're controlling your helicopter's position. Okay?

0:52:31.439,0:52:35.279
The reinforcement learning algorithm will output some parameters, which I'm denoting theta

0:52:35.279,0:52:37.209
subscript RL,

0:52:37.209,0:52:41.709
and then you'll use that to fly your helicopter.

0:52:41.709,0:52:44.959
So suppose you run this learning algorithm and

0:52:44.959,0:52:48.589
you get out a set of controller parameters, theta subscript RL,

0:52:48.589,0:52:52.299
that gives much worse performance than a human pilot. Then

0:52:52.299,0:52:54.729
what do you do next? And in particular, you

0:52:54.729,0:52:57.959
know, corresponding to the three steps above, there are three

0:52:57.959,0:53:00.589
natural things you can try. Right? You can

0:53:00.589,0:53:01.869
try to - oh, the bottom of

0:53:01.869,0:53:03.919
the slide got chopped off.

0:53:03.919,0:53:07.529
You can try to improve the simulator. And

0:53:07.529,0:53:10.329
maybe you think your simulator's isn't that accurate, you need to capture

0:53:10.329,0:53:12.339
the aerodynamic effects more

0:53:12.339,0:53:15.429
accurately. You need to capture the airflow and the turbulence affects around the helicopter

0:53:15.429,0:53:18.279
more accurately.

0:53:18.279,0:53:21.439
Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe

0:53:21.439,0:53:24.719
what a human pilot does isn't just optimizing square area but it's something more

0:53:24.719,0:53:25.989
subtle.

0:53:25.989,0:53:26.769
Or maybe

0:53:26.769,0:53:32.989
the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So

0:53:32.989,0:53:36.799
these are the diagnostics that I actually used, and my students and I actually use to figure out what's

0:53:36.799,0:53:41.299
going on.

0:53:41.299,0:53:44.509
Actually, why don't you just think about this for a second and think what you'd do, and then

0:53:44.509,0:53:51.509
I'll go on and tell you what we do. All right,

0:54:46.229,0:54:47.869
so let me tell you what -

0:54:47.869,0:54:49.599
how we do this and see

0:54:49.599,0:54:52.770
whether it's the same as yours or not. And if you have a better idea than I do, let me

0:54:52.770,0:54:53.569
know and I'll let you try it

0:54:53.569,0:54:55.919
on my helicopter.

0:54:55.919,0:54:58.239
So

0:54:58.239,0:55:01.449
here's a reasoning that I wanted to experiment, right. So,

0:55:01.449,0:55:03.680
yeah, let's say the controller output

0:55:03.680,0:55:10.369
by our reinforcement-learning algorithm does poorly. Well

0:55:10.369,0:55:12.609
suppose the following three things hold true.

0:55:12.609,0:55:15.149
Suppose the contrary, I guess. Suppose that

0:55:15.149,0:55:19.649
the helicopter simulator is accurate, so let's assume we have an accurate model

0:55:19.649,0:55:22.449
of our helicopter. And

0:55:22.449,0:55:25.299
let's suppose that the reinforcement learning algorithm,

0:55:25.299,0:55:28.890
you know, correctly controls the helicopter in simulation,

0:55:28.890,0:55:31.819
so we tend to run a learning algorithm in simulation so that, you know, the

0:55:31.819,0:55:35.149
learning algorithm can crash a helicopter and it's fine. Right?

0:55:35.149,0:55:37.230
So let's assume our reinforcement-learning

0:55:37.230,0:55:40.110
algorithm correctly controls the helicopter so as to minimize the cost

0:55:40.110,0:55:42.099
function J of theta.

0:55:42.099,0:55:43.740
And let's suppose that

0:55:43.740,0:55:47.630
minimizing J of theta does indeed correspond to accurate or the correct autonomous

0:55:47.630,0:55:49.339
flight.

0:55:49.339,0:55:52.069
If all of these things held true,

0:55:52.069,0:55:53.909
then that means that

0:55:53.909,0:55:58.459
the parameters, theta RL, should actually fly well on my real

0:55:58.459,0:56:01.039
helicopter. Right?

0:56:01.039,0:56:03.280
And so the fact that the learning

0:56:03.280,0:56:05.339
control parameters, theta RL,

0:56:05.339,0:56:08.599
does not fly well on my helicopter, that sort

0:56:08.599,0:56:11.249
of means that ones of these three assumptions must be wrong

0:56:11.249,0:56:17.869
and I'd like to figure out which of these

0:56:17.869,0:56:19.669
three assumptions

0:56:19.669,0:56:22.089
is wrong. Okay? So these are the diagnostics we use.

0:56:22.089,0:56:25.449
First one is

0:56:25.449,0:56:31.719
we look at the controller and see if it even flies well in

0:56:31.719,0:56:35.089
simulation. Right? So the simulator of the helicopter that we did the learning on,

0:56:35.089,0:56:38.699
and so if the learning algorithm flies well in the simulator but

0:56:38.699,0:56:42.029
it doesn't fly well on my real helicopter,

0:56:42.029,0:56:46.109
then that tells me the problem is probably in the simulator. Right?

0:56:46.109,0:56:48.049
My simulator predicts

0:56:48.049,0:56:51.909
the helicopter's controller will fly well but it doesn't actually fly well in real life, so

0:56:51.909,0:56:53.580
could be the problem's in the simulator

0:56:53.580,0:56:59.239
and we should spend out efforts improving the accuracy of our simulator.

0:56:59.239,0:57:03.170
Otherwise, let me write theta subscript human, be the human

0:57:03.170,0:57:07.049
control policy. All right? So

0:57:07.049,0:57:11.639
let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it

0:57:11.639,0:57:13.479
could be in real life,

0:57:13.479,0:57:16.769
and let's measure, you know, the means squared error

0:57:16.769,0:57:20.209
of the human pilot's flight. And

0:57:20.209,0:57:24.239
let's see if the human pilot does better or worse

0:57:24.239,0:57:26.089
than the learned controller,

0:57:26.089,0:57:28.249
in terms of optimizing this

0:57:28.249,0:57:31.969
objective function J of theta. Okay?

0:57:31.969,0:57:33.929
So if the human does

0:57:33.929,0:57:36.890
worse, if even a very good human pilot

0:57:36.890,0:57:41.439
attains a worse value on my optimization objective, on my cost

0:57:41.439,0:57:42.410
function,

0:57:42.410,0:57:48.619
than my learning algorithm,

0:57:48.619,0:57:51.799
then the problem is in the reinforcement-learning algorithm.

0:57:51.799,0:57:56.089
Because my reinforcement-learning algorithm was trying to minimize J of

0:57:56.089,0:58:00.140
theta, but a human actually attains a lower value for J of theta than does my

0:58:00.140,0:58:01.779
algorithm.

0:58:01.779,0:58:05.489
And so that tells me that clearly my algorithm's not

0:58:05.489,0:58:07.819
managing to minimize J of theta

0:58:07.819,0:58:12.880
and that tells me the problem's in the reinforcement learning algorithm.

0:58:12.880,0:58:17.650
And finally, if J of theta - if the human actually attains a larger value

0:58:17.650,0:58:19.549
for theta - excuse me,

0:58:19.549,0:58:24.399
if the human actually attains a larger value for J of theta, the human actually

0:58:24.399,0:58:27.859
has, you know, larger mean squared error for the helicopter position than

0:58:27.859,0:58:30.599
does my reinforcement learning algorithms, that's

0:58:30.599,0:58:34.000
I like - but I like the way the human flies much better than my reinforcement learning

0:58:34.000,0:58:35.319
algorithm. So

0:58:35.319,0:58:37.229
if that holds true,

0:58:37.229,0:58:39.779
then clearly the problem's in the cost function, right,

0:58:39.779,0:58:42.880
because the human does worse on my cost function

0:58:42.880,0:58:46.069
but flies much better than my learning algorithm.

0:58:46.069,0:58:48.359
And so that means the problem's in the cost function. It

0:58:48.359,0:58:50.089
means - oh

0:58:50.089,0:58:50.539
excuse me, I

0:58:50.539,0:58:53.679
meant minimizing it, not maximizing it, there's a typo on the slide,

0:58:53.679,0:58:55.379
because that means that minimizing

0:58:55.379,0:58:57.089
the cost function

0:58:57.089,0:59:00.219
- my learning algorithm does a better job minimizing the cost function but doesn't

0:59:00.219,0:59:03.439
fly as well as a human pilot. So that tells you that

0:59:03.439,0:59:04.719
minimizing the cost function

0:59:04.719,0:59:06.880
doesn't correspond to good autonomous flight. And what

0:59:06.880,0:59:11.859
you should do it go back and see if you can change J of

0:59:11.859,0:59:13.099
theta. Okay?

0:59:13.099,0:59:18.379
And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement

0:59:18.379,0:59:21.730
learning algorithms just work but when they don't work,

0:59:21.730,0:59:26.200
these are the sorts of diagnostics you use to figure out should we be focusing on the simulator,

0:59:26.200,0:59:30.329
on changing the cost function, or on changing the reinforcement learning

0:59:30.329,0:59:32.089
algorithm. And

0:59:32.089,0:59:37.039
again, if you don't know which of your three problems it is, it's entirely possible,

0:59:37.039,0:59:40.280
you know, to spend two years, whatever, changing, building a better simulator

0:59:40.280,0:59:42.599
for your helicopter.

0:59:42.599,0:59:43.950
But it turns out that

0:59:43.950,0:59:47.690
modeling helicopter aerodynamics is an active area of research. There are people, you know, writing

0:59:47.690,0:59:49.789
entire PhD theses on this still.

0:59:49.789,0:59:53.559
So it's entirely possible to go out and spend six years and write a PhD thesis and build

0:59:53.559,0:59:55.499
a much better helicopter simulator, but if you're fixing

0:59:55.499,1:00:02.499
the wrong problem it's not gonna help.

1:00:03.209,1:00:05.529
So

1:00:05.529,1:00:08.919
quite often, you need to come up with your own diagnostics to figure out what's happening in an

1:00:08.919,1:00:11.639
algorithm when something is going wrong.

1:00:11.639,1:00:15.679
And unfortunately I don't know of - what I've described

1:00:15.679,1:00:17.149
are sort of maybe

1:00:17.149,1:00:20.509
some of the most common diagnostics that I've used, that I've seen,

1:00:20.509,1:00:23.709
you know, to be useful for many problems. But very often, you need to come up

1:00:23.709,1:00:28.189
with your own for your own specific learning problem.

1:00:28.189,1:00:31.729
And I just want to point out that even when the learning algorithm is working well, it's

1:00:31.729,1:00:35.159
often a good idea to run diagnostics, like the ones I talked

1:00:35.159,1:00:36.069
about,

1:00:36.069,1:00:38.309
to make sure you really understand what's going on.

1:00:38.309,1:00:41.599
All right? And this is useful for a couple of reasons. One is that

1:00:41.599,1:00:45.609
diagnostics like these will often help you to understand your application

1:00:45.609,1:00:47.899
problem better.

1:00:47.899,1:00:52.159
So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying

1:00:52.159,1:00:56.349
job to apply machine-learning algorithms to some application problem of, you

1:00:56.349,1:00:59.299
know, significant economic interest. Right?

1:00:59.299,1:01:02.929
And you're gonna be working on one specific

1:01:02.929,1:01:08.059
important machine learning application for many months, or even for years.

1:01:08.059,1:01:10.989
One of the most valuable things for you personally will be for you to

1:01:10.989,1:01:13.249
get in - for you personally

1:01:13.249,1:01:16.909
to get in an intuitive understanding of what works and what doesn't work your

1:01:16.909,1:01:17.369
problem.

1:01:17.369,1:01:21.240
Sort of right now in the industry, in Silicon Valley or around the world,

1:01:21.240,1:01:24.830
there are many companies with important machine learning problems and there are often people

1:01:24.830,1:01:26.950
working on the same machine learning problem, you

1:01:26.950,1:01:31.209
know, for many months or for years on end. And

1:01:31.209,1:01:34.659
when you're doing that, I mean solving a really important problem using learning algorithms, one of

1:01:34.659,1:01:38.719
the most valuable things is just your own personal intuitive understanding of the

1:01:38.719,1:01:40.999
problem.

1:01:40.999,1:01:42.169
Okay?

1:01:42.169,1:01:43.410
And diagnostics, like

1:01:43.410,1:01:48.149
the sort I talked about, will be one way for you to get a better and better understanding of

1:01:48.149,1:01:50.279
these problems. It

1:01:50.279,1:01:54.089
turns out, by the way, there are some of Silicon Valley companies that outsource their

1:01:54.089,1:01:56.679
machine learning. So there's sometimes, you know, whatever.

1:01:56.679,1:01:59.529
They're a company in Silicon Valley and they'll, you know,

1:01:59.529,1:02:03.230
hire a firm in New York to run all their learning algorithms for them.

1:02:03.230,1:02:06.890
And I'm not a businessman, but I personally think that's

1:02:06.890,1:02:09.309
often a terrible idea because

1:02:09.309,1:02:13.639
if your expertise, if your understanding of your data is given,

1:02:13.639,1:02:15.709
you know, to an outsource agency,

1:02:15.709,1:02:19.589
then if you don't maintain that expertise, if there's a problem you really care about

1:02:19.589,1:02:22.299
then it'll be your own, you know,

1:02:22.299,1:02:26.009
understanding of the problem that you build up over months that'll be really valuable.

1:02:26.009,1:02:28.649
And if that knowledge is outsourced, you don't get to keep that knowledge

1:02:28.649,1:02:29.489
yourself.

1:02:29.489,1:02:31.699
I personally think that's a terrible idea.

1:02:31.699,1:02:35.809
But I'm not a businessman, but I just see people do that a lot,

1:02:35.809,1:02:39.109
and just. Let's see.

1:02:39.109,1:02:42.949
Another reason for running diagnostics like these is actually in writing research

1:02:42.949,1:02:43.609
papers,

1:02:43.609,1:02:46.149
right? So

1:02:46.149,1:02:49.329
diagnostics and error analyses, which I'll talk about in a minute,

1:02:49.329,1:02:53.019
often help to convey insight about the problem and help justify your research

1:02:53.019,1:02:54.109
claims.

1:02:54.109,1:02:56.559


1:02:56.559,1:02:57.780
So for example,

1:02:57.780,1:03:00.790
rather than writing a research paper, say, that's says, you know, "Oh well here's

1:03:00.790,1:03:04.039
an algorithm that works. I built this helicopter and it flies," or whatever,

1:03:04.039,1:03:05.650
it's often much more interesting to say,

1:03:05.650,1:03:09.609
"Here's an algorithm that works, and it works because of a specific

1:03:09.609,1:03:13.920
component X. And moreover, here's the diagnostic that gives you justification that shows X was

1:03:13.920,1:03:19.159
the thing that fixed this problem," and that's where you made it work. Okay? So

1:03:19.159,1:03:21.389
that leads me

1:03:21.389,1:03:25.929
into a discussion on error analysis, which is often good machine learning practice,

1:03:25.929,1:03:26.439


1:03:26.439,1:03:32.099
is a way for understanding what your sources of errors are. So what I

1:03:32.099,1:03:34.689
call error analyses - and let's check

1:03:34.689,1:03:41.689
questions about this.

1:03:41.769,1:03:45.789
Yeah?

1:03:45.789,1:03:49.809
Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times.

1:03:49.809,1:03:53.500
The thing that is most difficult a helicopter is actually building a

1:03:53.500,1:03:55.109
very - I don't know. It

1:03:55.109,1:03:58.489
changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter

1:03:58.489,1:04:02.859
is very hard. Yeah. Okay. So

1:04:02.859,1:04:03.930
for error

1:04:03.930,1:04:06.269
analyses,

1:04:06.269,1:04:10.809
this is a way for figuring out what is working in your algorithm and what isn't working.

1:04:10.809,1:04:17.709
And we're gonna talk about two specific examples. So there are

1:04:17.709,1:04:21.529
many learning - there are many sort of IA systems, many machine learning systems, that

1:04:21.529,1:04:22.469
combine

1:04:22.469,1:04:24.890
many different components into a pipeline. So

1:04:24.890,1:04:27.469
here's sort of a contrived example for this,

1:04:27.469,1:04:31.019
not dissimilar in many ways from the actual machine learning systems you see.

1:04:31.019,1:04:32.390
So let's say you want to

1:04:32.390,1:04:37.749
recognize people from images. This is a picture of one of my friends.

1:04:37.749,1:04:41.899
So you take this input in camera image, say, and you often run it through a long pipeline. So

1:04:41.899,1:04:43.069
for example,

1:04:43.069,1:04:47.859
the first thing you may do may be preprocess the image and remove the background, so you remove the

1:04:47.859,1:04:49.189
background.

1:04:49.189,1:04:51.909
And then you run a

1:04:51.909,1:04:55.209
face detection algorithm, so a machine learning algorithm to detect people's faces.

1:04:55.209,1:04:56.109
Right?

1:04:56.109,1:04:59.759
And then, you know, let's say you want to recognize the identity of the person, right, this is your

1:04:59.759,1:05:01.719
application.

1:05:01.719,1:05:04.440
You then segment of the eyes, segment of the nose,

1:05:04.440,1:05:08.329
and have different learning algorithms to detect the mouth and so on.

1:05:08.329,1:05:10.029
I know; she might not want to be friend

1:05:10.029,1:05:13.249
after she sees this.

1:05:13.249,1:05:16.769
And then having found all these features, based on, you know, what the nose looks like, what the eyes

1:05:16.769,1:05:18.610
looks like, whatever, then you

1:05:18.610,1:05:22.800
feed all the features into a logistic regression algorithm. And your logistic

1:05:22.800,1:05:24.770
regression or soft match regression, or whatever,

1:05:24.770,1:05:30.379
will tell you the identity of this person. Okay?

1:05:30.379,1:05:32.459
So

1:05:32.459,1:05:35.059
this is what error analysis is.

1:05:35.059,1:05:40.329
You have a long complicated pipeline combining many machine learning

1:05:40.329,1:05:43.919
components. Many of these would be used in learning algorithms.

1:05:43.919,1:05:45.689
And so,

1:05:45.689,1:05:50.419
it's often very useful to figure out how much of your error can be attributed to each of

1:05:50.419,1:05:55.179
these components.

1:05:55.179,1:05:56.179
So

1:05:56.179,1:05:59.589
what we'll do in a typical error analysis procedure

1:05:59.589,1:06:03.709
is we'll repeatedly plug in the ground-truth for each component and see how the

1:06:03.709,1:06:05.129
accuracy changes.

1:06:05.129,1:06:07.589
So what I mean by that is the

1:06:07.589,1:06:11.389
figure on the bottom left - bottom right, let's say the overall accuracy of the system is

1:06:11.389,1:06:12.739
85 percent. Right?

1:06:12.739,1:06:14.689
Then I want to know

1:06:14.689,1:06:17.399
where my 15 percent of error comes from.

1:06:17.399,1:06:19.159
And so what I'll do is I'll go

1:06:19.159,1:06:21.329
to my test set

1:06:21.329,1:06:26.629
and I'll actually code it and - oh, instead of - actually implement my correct

1:06:26.629,1:06:29.750
background removal. So actually, go in and give it,

1:06:29.750,1:06:33.439
give my algorithm what is the correct background versus foreground.

1:06:33.439,1:06:36.840
And if I do that, let's color that blue to denote that I'm

1:06:36.840,1:06:39.529
giving that ground-truth data in the test set,

1:06:39.529,1:06:43.839
let's assume our accuracy increases to 85.1 percent. Okay?

1:06:43.839,1:06:47.760
And now I'll go in and, you know, give my algorithm the ground-truth,

1:06:47.760,1:06:48.929
face detection

1:06:48.929,1:06:53.019
output. So I'll go in and actually on my test set I'll just tell the algorithm where the

1:06:53.019,1:06:55.130
face is. And if I do that,

1:06:55.130,1:06:59.049
let's say my algorithm's accuracy increases to 91 percent,

1:06:59.049,1:07:02.519
and so on. And then I'll go for each of these components

1:07:02.519,1:07:05.019
and just give it

1:07:05.019,1:07:08.660
the ground-truth label for each of the components,

1:07:08.660,1:07:11.640
because say, like, the nose segmentation algorithm's trying to figure out

1:07:11.640,1:07:13.219
where the nose is. I just in

1:07:13.219,1:07:16.589
and tell it where the nose is so that it doesn't have to figure that out.

1:07:16.589,1:07:20.559
And as I do this, one component through the other, you know, I end up giving it the correct output

1:07:20.559,1:07:23.650
label and end up with 100 percent accuracy.

1:07:23.650,1:07:27.000
And now you can look at this table - I'm sorry this is cut off on the bottom,

1:07:27.000,1:07:29.119
it says logistic regression 100 percent. Now you can

1:07:29.119,1:07:30.719
look at this

1:07:30.719,1:07:31.670
table and

1:07:31.670,1:07:33.009
see,

1:07:33.009,1:07:36.079
you know, how much giving the ground-truth labels for each of these

1:07:36.079,1:07:39.029
components could help boost your final performance.

1:07:39.029,1:07:42.419
In particular, if you look at this table, you notice that

1:07:42.419,1:07:45.269
when I added the face detection ground-truth,

1:07:45.269,1:07:48.279
my performance jumped from 85.1 percent accuracy

1:07:48.279,1:07:50.619
to 91 percent accuracy. Right?

1:07:50.619,1:07:54.529
So this tells me that if only I can get better face detection,

1:07:54.529,1:07:58.029
maybe I can boost my accuracy by 6 percent.

1:07:58.029,1:08:00.499
Whereas in contrast, when I,

1:08:00.499,1:08:04.349
you know, say plugged in better, I don't know,

1:08:04.349,1:08:07.059
background removal, my accuracy improved from 85

1:08:07.059,1:08:08.669
to 85.1 percent.

1:08:08.669,1:08:11.519
And so, this sort of diagnostic also tells you that if your goal

1:08:11.519,1:08:13.869
is to improve the system, it's probably a waste of

1:08:13.869,1:08:17.679
your time to try to improve your background subtraction. Because if

1:08:17.679,1:08:19.219
even if you got the ground-truth,

1:08:19.219,1:08:22.059
this is gives you, at most, 0.1 percent accuracy,

1:08:22.059,1:08:24.600
whereas if you do better face detection, maybe there's a much

1:08:24.600,1:08:26.399
larger potential for gains there. Okay?

1:08:26.399,1:08:28.669
So this sort of diagnostic,

1:08:28.669,1:08:29.899
again,

1:08:29.899,1:08:33.149
is very useful because if your is to improve the system,

1:08:33.149,1:08:35.999
there are so many different pieces you can easily choose to spend the next three

1:08:35.999,1:08:36.650
months on. Right?

1:08:36.650,1:08:39.259
And choosing the right piece

1:08:39.259,1:08:42.799
is critical, and this sort of diagnostic tells you what's the piece that may

1:08:42.799,1:08:48.729
actually be worth your time to work on.

1:08:48.729,1:08:51.709
There's sort of another type of analyses that's sort of the opposite of what I just

1:08:51.709,1:08:53.369
talked about.

1:08:53.369,1:08:55.469
The error analysis I just talked about

1:08:55.469,1:08:58.279
tries to explain the difference between the current performance and perfect

1:08:58.279,1:08:59.770
performance,

1:08:59.770,1:09:03.619
whereas this sort of ablative analysis tries to explain the difference

1:09:03.619,1:09:09.119
between some baselines, some really bad performance and your current performance.

1:09:09.119,1:09:13.089
So for this example, let's suppose you've built a very good anti-spam classifier for

1:09:13.089,1:09:17.150
adding lots of clever features to your logistic regression algorithm. Right? So you added

1:09:17.150,1:09:20.690
features for spam correction, for, you know, sender host features, for email header

1:09:20.690,1:09:21.409
features,

1:09:21.409,1:09:24.799
email text parser features, JavaScript parser features,

1:09:24.799,1:09:26.839
features for embedded images, and so on.

1:09:26.839,1:09:30.230
So now let's say you preview the system and you want to figure out, you know, how well did

1:09:30.230,1:09:33.799
each of these - how much did each of these components actually contribute? Maybe you want

1:09:33.799,1:09:37.130
to write a research paper and claim this was the piece that made the

1:09:37.130,1:09:40.949
big difference. Can you actually document that claim and justify it?

1:09:40.949,1:09:43.319
So in ablative analysis,

1:09:43.319,1:09:44.569
here's what we do.

1:09:44.569,1:09:46.330
So in this example,

1:09:46.330,1:09:49.670
let's say that simple logistic regression without any of your clever

1:09:49.670,1:09:52.089
improvements get 94 percent performance. And

1:09:52.089,1:09:55.479
you want to figure out what accounts for your improvement from 94 to

1:09:55.479,1:09:58.429
99.9 percent performance.

1:09:58.429,1:10:03.280
So in ablative analysis and so instead of adding components one at a day, we'll instead

1:10:03.280,1:10:06.840
remove components one at a time to see how it rates.

1:10:06.840,1:10:11.460
So start with your overall system, which is 99 percent accuracy.

1:10:11.460,1:10:14.130
And then we remove spelling correction and see how much performance

1:10:14.130,1:10:15.389
drops.

1:10:15.389,1:10:22.389
Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so,

1:10:24.219,1:10:28.150
in this contrived example,

1:10:28.150,1:10:31.120
you see that, I guess, the biggest drop

1:10:31.120,1:10:32.380
occurred when you remove

1:10:32.380,1:10:37.560
the text parser features. And so you can then make a credible case that,

1:10:37.560,1:10:41.280
you know, the text parser features where what really made the biggest difference here. Okay?

1:10:41.280,1:10:42.699
And you can also tell,

1:10:42.699,1:10:45.530
for instance, that, I don't know,

1:10:45.530,1:10:49.360
removing the sender host features on this

1:10:49.360,1:10:52.280
line, right, performance dropped from 99.9 to 98.9. And so this also means

1:10:52.280,1:10:53.139
that

1:10:53.139,1:10:56.449
in case you want to get rid of the sender host features to speed up

1:10:56.449,1:11:03.449
computational something that would be a good candidate for elimination. Okay? Are there any

1:11:03.630,1:11:05.420
guarantees that if you shuffle around the order in which

1:11:05.420,1:11:06.420
you drop those

1:11:06.420,1:11:09.580
features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's

1:11:09.580,1:11:12.110
no guarantee you'd get the similar result.

1:11:12.110,1:11:13.890
So in practice,

1:11:13.890,1:11:17.730
sometimes there's a fairly natural of ordering for both types of analyses, the error

1:11:17.730,1:11:19.329
analyses and ablative analysis,

1:11:19.329,1:11:22.749
sometimes there's a fairly natural ordering in which you add things or remove things,

1:11:22.749,1:11:24.559
sometimes there's isn't. And

1:11:24.559,1:11:28.469
quite often, you either choose one ordering and just go for it

1:11:28.469,1:11:32.070
or " And don't think of these analyses as sort of formulas that are constants, though; I mean

1:11:32.070,1:11:35.239
feel free to invent your own, as well. You know

1:11:35.239,1:11:36.639
one of the things

1:11:36.639,1:11:37.920
that's done quite often is

1:11:37.920,1:11:39.310
take the overall system

1:11:39.310,1:11:43.289
and just remove one and then put it back, then remove a different one

1:11:43.289,1:11:48.129
then put it back until all of these things are done. Okay.

1:11:48.129,1:11:51.009
So the very last thing I want to talk about is sort of this

1:11:51.009,1:11:57.979
general advice for how to get started on a learning problem. So

1:11:57.979,1:12:03.840
here's a cartoon description on two broad to get started on learning problem.

1:12:03.840,1:12:05.740
The first one is

1:12:05.740,1:12:07.610
carefully design your system, so

1:12:07.610,1:12:11.739
you spend a long time designing exactly the right features, collecting the right data set, and

1:12:11.739,1:12:14.189
designing the right algorithmic structure, then you

1:12:14.189,1:12:17.679
implement it and hope it works. All right?

1:12:17.679,1:12:21.039
The benefit of this sort of approach is you get maybe nicer, maybe more scalable

1:12:21.039,1:12:22.429
algorithms,

1:12:22.429,1:12:26.719
and maybe you come up with new elegant learning algorithms. And if your goal is to,

1:12:26.719,1:12:30.760
you know, contribute to basic research in machine learning, if your goal is to invent new machine learning

1:12:30.760,1:12:31.499
algorithms,

1:12:31.499,1:12:33.550
this process of slowing down and

1:12:33.550,1:12:36.300
thinking deeply about the problem, you know, that is sort of the right way to go

1:12:36.300,1:12:37.119
about is

1:12:37.119,1:12:41.099
think deeply about a problem and invent new solutions.

1:12:41.099,1:12:42.280


1:12:42.280,1:12:44.079
Second sort of approach

1:12:44.079,1:12:48.839
is what I call build-and-fix, which is we input something quick and dirty

1:12:48.839,1:12:52.309
and then you run error analyses and diagnostics to figure out what's wrong and

1:12:52.309,1:12:54.199
you fix those errors.

1:12:54.199,1:12:58.130
The benefit of this second type of approach is that it'll often get your

1:12:58.130,1:13:01.119
application working much more quickly.

1:13:01.119,1:13:04.400
And especially with those of you, if you end up working in a company, and sometimes - if you end up working in

1:13:04.400,1:13:05.550
a company,

1:13:05.550,1:13:07.460
you know, very often it's not

1:13:07.460,1:13:10.900
the best product that wins; it's the first product to market that

1:13:10.900,1:13:11.690
wins. And

1:13:11.690,1:13:14.869
so there's - especially in the industry. There's really something to be said for,

1:13:14.869,1:13:18.790
you know, building a system quickly and getting it deployed quickly.

1:13:18.790,1:13:23.139
And the second approach of building a quick-and-dirty, I'm gonna say hack

1:13:23.139,1:13:26.469
and then fixing the problems will actually get you to a

1:13:26.469,1:13:27.839
system that works well

1:13:27.839,1:13:30.969
much more quickly.

1:13:30.969,1:13:32.649
And the reason is

1:13:32.649,1:13:36.149
very often it's really not clear what parts of a system are easier to think of to

1:13:36.149,1:13:37.590
build and therefore what

1:13:37.590,1:13:40.179
you need to spends lot of time focusing on.

1:13:40.179,1:13:43.420
So there's that example I talked about just now. Right?

1:13:43.420,1:13:46.929
For identifying

1:13:46.929,1:13:48.710
people, say.

1:13:48.710,1:13:53.199
And with a big complicated learning system like this, a big complicated pipeline like this,

1:13:53.199,1:13:55.590
it's really not obvious at the outset

1:13:55.590,1:13:59.130
which of these components you should spend lots of time working on. Right? And if

1:13:59.130,1:14:00.960
you didn't know that

1:14:00.960,1:14:03.800
preprocessing wasn't the right component, you could easily have

1:14:03.800,1:14:07.269
spent three months working on better background subtraction, not knowing that it's

1:14:07.269,1:14:09.880
just not gonna ultimately matter.

1:14:09.880,1:14:10.769
And so

1:14:10.769,1:14:13.690
the only way to find out what really works was inputting something quickly and

1:14:13.690,1:14:15.350
you find out what parts -

1:14:15.350,1:14:16.889
and find out

1:14:16.889,1:14:17.889
what parts

1:14:17.889,1:14:21.359
are really the hard parts to implement, or what parts are hard parts that could make a

1:14:21.359,1:14:23.079
difference in performance.

1:14:23.079,1:14:26.579
In fact, say that if your goal is to build a

1:14:26.579,1:14:29.309
people recognition system, a system like this is actually far too

1:14:29.309,1:14:31.639
complicated as your initial system.

1:14:31.639,1:14:35.560
Maybe after you're prototyped a few systems, and you converged a system like this. But if this

1:14:35.560,1:14:42.560
is your first system you're designing, this is much too complicated. Also, this is a

1:14:43.570,1:14:48.059
very concrete piece of advice, and this applies to your projects as well.

1:14:48.059,1:14:51.229
If your goal is to build a working application,

1:14:51.229,1:14:55.260
Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your

1:14:55.260,1:14:57.279
data.

1:14:57.279,1:15:01.219
And very often, and if you just take the data you're trying to predict and just plot your

1:15:01.219,1:15:05.729
data, plot X, plot Y, plot your data everywhere you can think of,

1:15:05.729,1:15:10.309
you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they

1:15:10.309,1:15:13.899
should be positive. Something's wrong with this dataset." And it's about

1:15:13.899,1:15:18.389
half the time you find something obviously wrong with your data or something very surprising.

1:15:18.389,1:15:21.570
And this is something you find out just by plotting your data, and that you

1:15:21.570,1:15:28.179
won't find out be implementing these big complicated learning algorithms on it. Plotting

1:15:28.179,1:15:31.920
the data sounds so simple, it was one of the pieces of advice that lots of us give but

1:15:31.920,1:15:38.570
hardly anyone follows, so you can take that for what it's worth.

1:15:38.570,1:15:42.199
Let me just reiterate, what I just said here may be bad advice

1:15:42.199,1:15:44.019
if your goal is to come up with

1:15:44.019,1:15:46.639
new machine learning algorithms. All right? So

1:15:46.639,1:15:51.019
for me personally, the learning algorithm I use the most often is probably

1:15:51.019,1:15:53.600
logistic regression because I have code lying around. So give me a

1:15:53.600,1:15:56.770
learning problem, I probably won't try anything more complicated than logistic

1:15:56.770,1:15:58.260
regression on it first. And it's

1:15:58.260,1:16:01.940
only after trying something really simple and figure our what's easy, what's hard, then you know

1:16:01.940,1:16:03.940
where to focus your efforts. But

1:16:03.940,1:16:07.610
again, if your goal is to invent new machine learning algorithms, then you sort of don't

1:16:07.610,1:16:10.750
want to hack up something and then add another hack to fix it, and hack it even more to

1:16:10.750,1:16:12.219
fix it. Right? So if

1:16:12.219,1:16:15.919
your goal is to do novel machine learning research, then it pays to think more deeply about the

1:16:15.919,1:16:21.340
problem and not gonna follow this specifically.

1:16:21.340,1:16:22.919
Shoot, you know what? All

1:16:22.919,1:16:28.280
right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly.

1:16:28.280,1:16:30.620
And so, this is what I think

1:16:30.620,1:16:33.459
of as premature statistical optimization,

1:16:33.459,1:16:35.079
where quite often,

1:16:35.079,1:16:38.320
just like premature optimization of code, quite often

1:16:38.320,1:16:44.369
people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more

1:16:44.369,1:16:46.949
slides. This

1:16:46.949,1:16:48.539
was -

1:16:48.539,1:16:52.070
this is a sort of cartoon that highly influenced my own thinking. It was based on

1:16:52.070,1:16:55.340
a paper written by Christos Papadimitriou.

1:16:55.340,1:16:57.429
This is how

1:16:57.429,1:16:59.360
progress - this is how

1:16:59.360,1:17:02.360
developmental progress of research often happens. Right?

1:17:02.360,1:17:05.559
Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it

1:17:05.559,1:17:06.519
seems like a useful thing to have.

1:17:06.519,1:17:09.670
Right? You know free up people, don't have

1:17:09.670,1:17:12.760
to deliver mail. So what -

1:17:12.760,1:17:14.280
to deliver mail,

1:17:14.280,1:17:19.139
obviously you need a robot to wander around indoor environments and you need a robot to

1:17:19.139,1:17:21.480
manipulate objects and pickup envelopes. And so,

1:17:21.480,1:17:24.890
you need to build those two components in order to get a mail delivery robot. And

1:17:24.890,1:17:25.590
so I've

1:17:25.590,1:17:29.650
drawing those two components and little arrows to denote that, you know, obstacle avoidance

1:17:29.650,1:17:30.460
is

1:17:30.460,1:17:32.229
needed or would help build

1:17:32.229,1:17:35.510
your mail delivery robot. Well

1:17:35.510,1:17:37.189
for obstacle for avoidance,

1:17:37.189,1:17:43.159
clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles.

1:17:43.159,1:17:46.840
Now we're gonna use computer vision to detect the objects. And so,

1:17:46.840,1:17:51.120
we know that, you know, lighting sometimes changes, right, depending on whether it's the

1:17:51.120,1:17:52.709
morning or noontime or evening. This

1:17:52.709,1:17:53.930
is lighting

1:17:53.930,1:17:56.639
causes the color of things to change, and so you need

1:17:56.639,1:18:00.509
an object detection system that's invariant to the specific colors of an

1:18:00.509,1:18:01.199
object. Right?

1:18:01.199,1:18:04.420
Because lighting

1:18:04.420,1:18:05.400
changes,

1:18:05.400,1:18:09.849
say. Well color, or RGB values, is represented by three-dimensional vectors. And

1:18:09.849,1:18:11.170
so you need to learn

1:18:11.170,1:18:13.499
when two colors might be the same thing,

1:18:13.499,1:18:15.260
when two, you know,

1:18:15.260,1:18:18.159
visual appearance of two colors may be the same thing as just the lighting change or

1:18:18.159,1:18:19.539
something.

1:18:19.539,1:18:20.590
And

1:18:20.590,1:18:24.059
to understand that properly, we can go out and study differential geometry

1:18:24.059,1:18:27.510
of 3d manifolds because that helps us build a sound theory on which

1:18:27.510,1:18:32.250
to develop our 3d similarity learning algorithms.

1:18:32.250,1:18:36.159
And to really understand the fundamental aspects of this problem,

1:18:36.159,1:18:40.110
we have to study the complexity of non-Riemannian geometries. And on

1:18:40.110,1:18:43.850
and on it goes until eventually you're proving convergence bounds for

1:18:43.850,1:18:49.790
sampled of non-monotonic logic. I don't even know what this is because I just made it up.

1:18:49.790,1:18:51.530
Whereas in reality,

1:18:51.530,1:18:53.970
you know, chances are that link isn't real.

1:18:53.970,1:18:55.660
Color variance

1:18:55.660,1:18:59.550
just barely helped object recognition maybe. I'm making this up.

1:18:59.550,1:19:03.499
Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay?

1:19:03.499,1:19:05.270
So, each of

1:19:05.270,1:19:09.130
these circles can represent a person, or a research community, or a thought in your

1:19:09.130,1:19:12.020
head. And there's a very real chance that

1:19:12.020,1:19:15.469
maybe there are all these papers written on differential geometry of 3d manifolds, and they are

1:19:15.469,1:19:18.570
written because some guy once told someone else that it'll help 3d similarity learning.

1:19:18.570,1:19:20.489
And,

1:19:20.489,1:19:23.369
you know, it's like "A friend of mine told me that color invariance would help in

1:19:23.369,1:19:26.119
object recognition, so I'm working on color invariance. And now I'm gonna tell a friend

1:19:26.119,1:19:27.440
of mine

1:19:27.440,1:19:30.280
that his thing will help my problem. And he'll tell a friend of his that his thing will help

1:19:30.280,1:19:31.619
with his problem."

1:19:31.619,1:19:33.520
And pretty soon, you're working on

1:19:33.520,1:19:37.540
convergence bound for sampled non-monotonic logic, when in reality none of these will

1:19:37.540,1:19:39.129
see the light of

1:19:39.129,1:19:42.519
day of your mail delivery robot. Okay?

1:19:42.519,1:19:46.599
I'm not criticizing the role of theory. There are very powerful theories, like the

1:19:46.599,1:19:48.400
theory of VC dimension,

1:19:48.400,1:19:52.090
which is far, far, far to the right of this. So VC dimension is about

1:19:52.090,1:19:53.289
as theoretical

1:19:53.289,1:19:57.119
as it can get. And it's clearly had a huge impact on many applications. And there's,

1:19:57.119,1:19:59.559
you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know,

1:19:59.559,1:20:00.750
is about

1:20:00.750,1:20:04.219
theoretical as it can get. It's

1:20:04.219,1:20:05.800
like a huge application

1:20:05.800,1:20:09.309
on all of computer science, the theory of NP-hardness.

1:20:09.309,1:20:10.669
But

1:20:10.669,1:20:13.799
when you are off working on highly theoretical things, I guess, to me

1:20:13.799,1:20:16.849
personally it's important to keep in mind

1:20:16.849,1:20:19.699
are you working on something like VC dimension, which is high impact, or are you

1:20:19.699,1:20:23.290
working on something like convergence bound for sampled nonmonotonic logic, which

1:20:23.290,1:20:24.710
you're only hoping

1:20:24.710,1:20:25.900
has some peripheral relevance

1:20:25.900,1:20:30.040
to some application. Okay?

1:20:30.040,1:20:34.849
For me personally, I tend to work on an application only if I - excuse me.

1:20:34.849,1:20:36.989
For me personally, and this is a personal choice,

1:20:36.989,1:20:41.340
I tend to trust something only if I personally can see a link from the

1:20:41.340,1:20:42.679
theory I'm working on

1:20:42.679,1:20:44.430
all the way back to an application.

1:20:44.430,1:20:46.010
And

1:20:46.010,1:20:50.299
if I don't personally see a direct link from what I'm doing to an application then,

1:20:50.299,1:20:53.429
you know, then that's fine. Then I can choose to work on theory, but

1:20:53.429,1:20:55.650
I wouldn't necessarily trust that

1:20:55.650,1:20:59.210
what the theory I'm working on will relate to an application, if I don't personally

1:20:59.210,1:21:02.429
see a link all the way back.

1:21:02.429,1:21:04.400
Just to summarize.

1:21:04.400,1:21:06.409


1:21:06.409,1:21:08.679
One lesson to take away from today is I think

1:21:08.679,1:21:12.529
time spent coming up with diagnostics for learning algorithms is often time well spent.

1:21:12.529,1:21:13.029


1:21:13.029,1:21:16.199
It's often up to your own ingenuity to come up with great diagnostics. And

1:21:16.199,1:21:19.019
just when I personally, when I work on machine learning algorithm,

1:21:19.019,1:21:21.169
it's not uncommon for me to be spending like

1:21:21.169,1:21:23.679
between a third and often half of my time

1:21:23.679,1:21:26.409
just writing diagnostics and trying to figure out what's going right and what's

1:21:26.409,1:21:28.079
going on.

1:21:28.079,1:21:31.500
Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and

1:21:31.500,1:21:34.780
making progress. You don't want to be spending all this time, you know, implementing tests on your

1:21:34.780,1:21:38.280
learning algorithms; it doesn't feel like when you're doing anything. But when

1:21:38.280,1:21:41.419
I implement learning algorithms, at least a third, and quite often half of

1:21:41.419,1:21:45.880
my time, is actually spent implementing those tests and you can figure out what to work on. And

1:21:45.880,1:21:49.219
I think it's actually one of the best uses of your time. Talked

1:21:49.219,1:21:50.729
about error

1:21:50.729,1:21:54.319
analyses and ablative analyses, and lastly

1:21:54.319,1:21:56.890
talked about, you know, different approaches and the

1:21:56.890,1:22:00.979
risks of premature statistical optimization. Okay.

1:22:00.979,1:22:04.339
Sorry I ran you over. I'll be here for a few more minutes for your questions.