WEBVTT

00:00:10.170 --> 00:00:11.549


00:00:11.549 --> 00:00:14.819
This presentation is delivered by the Stanford Center for Professional

00:00:14.819 --> 00:00:21.819
Development.

00:00:24.739 --> 00:00:28.499
Okay. Good morning. Welcome back.

00:00:28.499 --> 00:00:31.239
What I want to do today is

00:00:31.239 --> 00:00:36.640
actually wrap up our discussion on learning theory and sort of on

00:00:36.640 --> 00:00:37.670
and I'm gonna start by

00:00:37.670 --> 00:00:41.810
talking about Bayesian statistics and regularization,

00:00:41.810 --> 00:00:45.680
and then take a very brief digression to tell you about online learning.

00:00:45.680 --> 00:00:50.320
And most of today's lecture will actually be on various pieces of that, so applying

00:00:50.320 --> 00:00:52.450
machine learning algorithms to problems like, you know,

00:00:52.450 --> 00:00:55.920
like the project or other problems you may go work on after you graduate from this

00:00:57.010 --> 00:00:57.739
class. But

00:00:57.739 --> 00:01:02.300
let's start the talk about Bayesian statistics and regularization.

00:01:02.300 --> 00:01:04.130
So you remember from last week,

00:01:04.130 --> 00:01:08.960
we started to talk about learning theory and we learned about bias and

00:01:08.960 --> 00:01:12.260
variance. And I guess in the previous lecture, we spent most of the previous

00:01:12.260 --> 00:01:13.240
lecture

00:01:13.240 --> 00:01:17.510
talking about algorithms for model selection and for

00:01:17.510 --> 00:01:21.360
feature selection. We talked about cross-validation. Right? So

00:01:21.360 --> 00:01:24.200
most of the methods we talked about in the previous lecture were

00:01:24.200 --> 00:01:28.460
ways for you to try to simply the model. So for example,

00:01:28.460 --> 00:01:31.920
the feature selection algorithms we talked about gives you a way to eliminate a number

00:01:31.920 --> 00:01:32.909
of features,

00:01:32.909 --> 00:01:36.350
so as to reduce the number of parameters you need to fit and thereby reduce

00:01:36.350 --> 00:01:39.330
overfitting. Right? You remember that? So feature

00:01:39.330 --> 00:01:40.670
selection algorithms

00:01:42.149 --> 00:01:43.970
choose a subset of the features

00:01:43.970 --> 00:01:48.470
so that you have less parameters and you may be less likely to overfit. Right?

00:01:48.470 --> 00:01:52.790
What I want to do today is to talk about a different way to prevent overfitting.

00:01:52.790 --> 00:01:53.870
And

00:01:53.870 --> 00:01:57.860
there's a method called regularization and there's a way that lets you keep all the

00:01:57.860 --> 00:01:58.859
parameters.

00:01:59.920 --> 00:02:04.200
So here's the idea, and I'm gonna illustrate this example with,

00:02:04.200 --> 00:02:06.960
say, linear regression.

00:02:06.960 --> 00:02:08.709
So

00:02:08.709 --> 00:02:13.329
you take

00:02:13.329 --> 00:02:16.920
the linear regression model, the very first model we learned about,

00:02:16.920 --> 00:02:17.969
right,

00:02:17.969 --> 00:02:21.639
we said that we would choose the parameters

00:02:21.639 --> 00:02:23.609
via

00:02:23.609 --> 00:02:29.519
maximum likelihood.

00:02:29.519 --> 00:02:31.669
Right? And that meant that,

00:02:31.669 --> 00:02:35.299
you know, you would choose the parameters theta

00:02:35.299 --> 00:02:39.879
that maximized

00:02:39.879 --> 00:02:43.009
the probability of the data,

00:02:43.009 --> 00:02:46.109
which is parameters theta that maximized the probability of the data we observe.

00:02:46.109 --> 00:02:50.709
Right?

00:02:50.709 --> 00:02:52.049
And so

00:02:52.049 --> 00:02:56.279
to give this sort of procedure a name, this is one example of most

00:02:56.279 --> 00:03:00.259
common frequencies procedure,

00:03:00.259 --> 00:03:05.419
and frequency, you can think of sort of as maybe one school of statistics.

00:03:05.419 --> 00:03:07.079
And the philosophical view

00:03:07.079 --> 00:03:08.869
behind writing this down was

00:03:08.869 --> 00:03:13.480
we envisioned that there was some true parameter theta out there that generated,

00:03:13.480 --> 00:03:16.469
you know, the Xs and the Ys. There's some true parameter theta

00:03:16.469 --> 00:03:20.529
that govern housing prices, Y is a function of X,

00:03:20.529 --> 00:03:23.459
and we don't know what the value of theta is,

00:03:23.459 --> 00:03:27.630
and we'd like to come up with some procedure for estimating the value of theta. Okay?

00:03:27.630 --> 00:03:28.379
And so,

00:03:28.379 --> 00:03:31.269
maximum likelihood is just one possible procedure

00:03:31.269 --> 00:03:33.209
for estimating the unknown value

00:03:33.209 --> 00:03:35.879
for theta.

00:03:35.879 --> 00:03:37.609
And

00:03:37.609 --> 00:03:41.060
the way you formulated this, you know, theta was not a random variable. Right?

00:03:41.060 --> 00:03:42.140
That's what why said,

00:03:42.140 --> 00:03:45.490
so theta is just some true value out there. It's not random or anything, we just don't

00:03:45.490 --> 00:03:50.329
know what it is, and we have a procedure called maximum likelihood for estimating the

00:03:50.329 --> 00:03:55.469
value for theta. So this is one example of what's called a frequencies procedure.

00:03:55.469 --> 00:03:57.279
The alternative to the, I

00:03:57.279 --> 00:04:04.279
guess, the frequency school of statistics is the Bayesian school,

00:04:04.389 --> 00:04:06.869
in which

00:04:06.869 --> 00:04:10.239
we're gonna say that we don't know what theta,

00:04:10.239 --> 00:04:13.339
and so we will put a prior

00:04:13.339 --> 00:04:14.769
on

00:04:14.769 --> 00:04:18.239
theta. Okay? So in the Bayesian school students would say, "Well

00:04:18.239 --> 00:04:22.379
don't know what the value of theta so let's represent our uncertainty over theta with a

00:04:22.379 --> 00:04:27.229
prior.

00:04:28.259 --> 00:04:32.009
So for example,

00:04:32.009 --> 00:04:34.770
our prior on theta

00:04:34.770 --> 00:04:36.860
may be a Gaussian distribution

00:04:36.860 --> 00:04:39.849
with mean zero and curvalence matrix

00:04:39.849 --> 00:04:42.889
given by tau squared I. Okay?

00:04:42.889 --> 00:04:45.319
And so -

00:04:45.319 --> 00:04:52.319
actually, if I use S to denote my training set, well - right,

00:04:56.279 --> 00:05:00.289
so theta represents my beliefs about what the parameters are in the absence of

00:05:00.289 --> 00:05:01.380
any data. So

00:05:01.380 --> 00:05:03.290
not having seen any data, theta

00:05:03.290 --> 00:05:05.709
represents, you know, what I think theta - it

00:05:05.709 --> 00:05:10.090
probably represents what I think theta is most likely to be.

00:05:10.090 --> 00:05:15.659
And so given the training set, S, in the sort of Bayesian procedure, we would,

00:05:15.659 --> 00:05:22.439
well,

00:05:22.439 --> 00:05:26.539
calculate the probability, the posterior probability by parameters

00:05:26.539 --> 00:05:28.869
given my training sets, and - let's

00:05:28.869 --> 00:05:30.469
write

00:05:30.469 --> 00:05:31.779
this on the next board.

00:05:31.779 --> 00:05:33.360
So my posterior

00:05:33.360 --> 00:05:36.740
on my parameters given my training set, by Bayes' rule, this will

00:05:36.740 --> 00:05:40.160
be proportional to, you

00:05:40.160 --> 00:05:47.160
know, this.

00:05:51.459 --> 00:05:56.559
Right? So by Bayes' rule. Let's call

00:05:56.559 --> 00:06:01.429
it posterior.

00:06:01.429 --> 00:06:05.669
And this distribution now represents my beliefs about what theta is after I've

00:06:05.669 --> 00:06:08.219
seen the training set.

00:06:08.219 --> 00:06:15.219
And when you now want to make a new prediction on the price of a new house,

00:06:20.060 --> 00:06:21.959
on the input X,

00:06:21.959 --> 00:06:26.709
I would say that, well, the distribution over the possible housing prices for

00:06:26.709 --> 00:06:30.560
this new house I'm trying to estimate the price of, say,

00:06:30.560 --> 00:06:34.270
given the size of the house, the features of the house at X, and the training

00:06:34.270 --> 00:06:36.559
set I had previously, it

00:06:36.559 --> 00:06:43.559
is going to be given by

00:06:47.280 --> 00:06:50.470
an integral over my parameters theta of

00:06:50.470 --> 00:06:53.020
probably of Y given X comma theta

00:06:53.020 --> 00:06:54.649
and times

00:06:54.649 --> 00:06:59.929
the posterior distribution of theta given the training set. Okay?

00:06:59.929 --> 00:07:01.789
And in

00:07:01.789 --> 00:07:03.719
particular, if you want your prediction to be

00:07:03.719 --> 00:07:07.559
the expected value

00:07:07.559 --> 00:07:08.729
of Y given

00:07:08.729 --> 00:07:10.279
the

00:07:10.279 --> 00:07:13.559
input X in training set, you would

00:07:13.559 --> 00:07:16.349
say integrate

00:07:16.349 --> 00:07:23.349
over Y

00:07:23.750 --> 00:07:26.539
times the posterior. Okay?

00:07:26.539 --> 00:07:33.539
You would take an expectation of Y with respect to your posterior distribution. Okay?

00:07:35.440 --> 00:07:39.169
And you notice that when I was writing this down, so with the Bayesian

00:07:39.169 --> 00:07:41.959
formulation, and now started to write up here Y given X

00:07:41.959 --> 00:07:44.110
comma theta because

00:07:44.110 --> 00:07:47.590
this formula now is the property of Y conditioned on the values of the

00:07:47.590 --> 00:07:51.909
random variables X and theta. So I'm no longer writing semicolon theta, I'm writing

00:07:51.909 --> 00:07:53.099
comma theta

00:07:53.099 --> 00:07:56.379
because I'm now treating theta

00:07:56.379 --> 00:08:00.259
as a random variable. So all

00:08:00.259 --> 00:08:02.849
of this is somewhat abstract but this is

00:08:02.849 --> 00:08:04.109
- and

00:08:04.109 --> 00:08:11.109
it turns out - actually let's check. Are there questions about this? No? Okay. Let's

00:08:14.779 --> 00:08:17.819
try to make this more concrete. It turns out that

00:08:17.819 --> 00:08:21.699
for many problems,

00:08:21.699 --> 00:08:26.059
both of these steps in the computation are difficult because if,

00:08:26.059 --> 00:08:29.469
you know, theta is an N plus onedimensional vector, is an N plus

00:08:29.469 --> 00:08:31.219
one-dimensional parameter vector,

00:08:31.219 --> 00:08:35.010
then this is one an integral over an N plus one-dimensional, you know, over RN

00:08:35.010 --> 00:08:36.390
plus one.

00:08:36.390 --> 00:08:38.660
And because numerically it's very difficult

00:08:38.660 --> 00:08:41.130
to compute integrals over

00:08:41.130 --> 00:08:44.800
very high dimensional spaces, all right? So

00:08:44.800 --> 00:08:48.740
usually this integral - actually usually it's hard to compute the posterior in theta

00:08:48.740 --> 00:08:54.310
and it's also hard to compute this integral if theta is very high dimensional. There are few

00:08:54.310 --> 00:08:55.910
exceptions for which this can be done in

00:08:55.910 --> 00:08:58.290
closed form, but for

00:08:58.290 --> 00:09:00.460
many learning algorithms, say,

00:09:00.460 --> 00:09:04.940
Bayesian logistic regression, this is hard to do.

00:09:04.940 --> 00:09:09.780
And so what's commonly done is to take the posterior distribution

00:09:09.780 --> 00:09:13.390
and instead of actually computing a full posterior distribution, chi of theta

00:09:13.390 --> 00:09:15.280
given S,

00:09:15.280 --> 00:09:16.510
we'll instead

00:09:16.510 --> 00:09:19.330
take this quantity on the right-hand side

00:09:19.330 --> 00:09:23.880
and just maximize this quantity on the right-hand side. So let me write this down.

00:09:23.880 --> 00:09:26.650
So

00:09:26.650 --> 00:09:30.100
commonly, instead of computing the full posterior distribution,

00:09:30.100 --> 00:09:37.100
we will choose the following. Okay?

00:09:42.280 --> 00:09:47.160
We will choose what's called the MAP estimate, or the maximum a posteriori

00:09:47.160 --> 00:09:50.270
estimate of theta, which is the most likely value of theta,

00:09:50.270 --> 00:09:53.990
most probable value of theta onto your posterior distribution.

00:09:53.990 --> 00:09:56.430
And that's just

00:09:56.430 --> 00:10:03.430
ont max chi

00:10:07.960 --> 00:10:11.870
of theta.

00:10:11.870 --> 00:10:18.870
And then when you need to make a prediction,

00:10:21.480 --> 00:10:28.480
you know, you would just predict, say, well,

00:10:35.530 --> 00:10:37.599
using your usual hypothesis

00:10:37.599 --> 00:10:41.290
and using this MAP value of theta

00:10:41.290 --> 00:10:42.479
in place of

00:10:42.479 --> 00:10:47.380
- as the parameter vector you'd choose. Okay? And

00:10:47.380 --> 00:10:51.340
notice, the only difference between this and standard maximum likelihood estimation

00:10:51.340 --> 00:10:53.060
is that when you're choosing,

00:10:53.060 --> 00:10:56.920
you know, the - instead of choosing the maximum likelihood value for

00:10:56.920 --> 00:11:01.140
theta, you're instead maximizing this, which is what you have for maximum likelihood estimation,

00:11:01.140 --> 00:11:05.170
and then times this other quantity which is the prior.

00:11:05.170 --> 00:11:06.650
Right?

00:11:06.650 --> 00:11:09.520
And

00:11:09.520 --> 00:11:10.870
let's see,

00:11:10.870 --> 00:11:16.560
when intuition is that if your prior

00:11:16.560 --> 00:11:21.470
is

00:11:21.470 --> 00:11:23.650
theta being Gaussian and with mean

00:11:23.650 --> 00:11:26.030
zero and some covariance,

00:11:26.030 --> 00:11:30.550
then for a distribution like this, most of the [inaudible] mass is close to zero. Right?

00:11:30.550 --> 00:11:34.010
So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero.

00:11:37.990 --> 00:11:40.550
And so the prior distribution, instead of saying that

00:11:40.550 --> 00:11:44.800
you think most of the parameters should be close to

00:11:44.800 --> 00:11:48.780
zero, and if you remember our discussion on feature selection, if you eliminate a

00:11:48.780 --> 00:11:50.880
feature from consideration

00:11:50.880 --> 00:11:53.280
that's the same as

00:11:53.280 --> 00:11:56.630
setting the source and value of theta to be equal to zero. All right? So if you

00:11:56.630 --> 00:11:58.020
set

00:11:58.020 --> 00:12:02.090
theta five to be equal to zero, that's the same as, you know, eliminating feature

00:12:02.090 --> 00:12:04.550
five from the your hypothesis.

00:12:04.550 --> 00:12:09.710
And so, this is the prior that drives most of the parameter values to zero

00:12:09.710 --> 00:12:12.710
- to values close to zero. And you'll think of this as

00:12:12.710 --> 00:12:17.410
doing something analogous, if doing something reminiscent of feature selection. Okay? And

00:12:17.410 --> 00:12:21.560
it turns out that with this formulation, the parameters won't actually be

00:12:21.560 --> 00:12:24.990
exactly zero but many of the values will be close to zero.

00:12:24.990 --> 00:12:27.030
And

00:12:27.030 --> 00:12:31.510
I guess in pictures,

00:12:34.550 --> 00:12:36.210
if you remember,

00:12:36.210 --> 00:12:40.070
I said that if you have, say, five data points and you fit

00:12:40.070 --> 00:12:47.070
a fourth-order polynomial -

00:12:47.620 --> 00:12:51.040
well I think that had too many bumps in it, but never mind.

00:12:51.040 --> 00:12:54.360
If you fit it a - if you fit very high polynomial to a

00:12:54.360 --> 00:12:58.180
very small dataset, you can get these very large oscillations

00:12:58.180 --> 00:13:01.420
if you use maximum likelihood estimation. All right?

00:13:01.420 --> 00:13:05.610
In contrast, if you apply this sort of Bayesian regularization,

00:13:05.610 --> 00:13:08.200
you can actually fit a higherorder polynomial

00:13:08.200 --> 00:13:11.160
that still get

00:13:11.160 --> 00:13:13.250
sort of a smoother and smoother fit to the data

00:13:13.250 --> 00:13:17.840
as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer

00:13:17.840 --> 00:13:19.830
to zero. And that

00:13:19.830 --> 00:13:22.750
in practice - it's sort of hard to see, but you can take my word for it.

00:13:22.750 --> 00:13:24.900
As tau becomes smaller and smaller,

00:13:24.900 --> 00:13:28.760
the curves you tend to fit your data also become smoother and smoother, and so you

00:13:28.760 --> 00:13:29.560
tend

00:13:29.560 --> 00:13:35.700
less and less overfit, even when you're fitting a large number of

00:13:35.700 --> 00:13:37.670
parameters.

00:13:37.670 --> 00:13:41.290
Okay? Let's see,

00:13:41.290 --> 00:13:44.590
and

00:13:44.590 --> 00:13:48.129
one last piece of intuition that I would just toss out there. And

00:13:48.129 --> 00:13:49.920
you get to play more with

00:13:49.920 --> 00:13:52.850
this particular set of ideas more in Problem Set

00:13:52.850 --> 00:13:54.980
3, which I'll post online later

00:13:54.980 --> 00:13:57.160
this week I guess.

00:13:57.160 --> 00:14:01.950
Is that whereas maximum likelihood tries to minimize,

00:14:01.950 --> 00:14:08.950
say, this,

00:14:12.210 --> 00:14:15.830
right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing

00:14:15.830 --> 00:14:17.030
this,

00:14:17.030 --> 00:14:19.290
it turns out that if you

00:14:19.290 --> 00:14:21.220
add this prior term there,

00:14:21.220 --> 00:14:26.500
it turns out that the authorization objective you end up optimizing

00:14:26.500 --> 00:14:31.160
turns out to be that. Where you add an extra term that, you know,

00:14:31.160 --> 00:14:34.380
penalizes your parameter theta as being large.

00:14:34.980 --> 00:14:38.820
And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to

00:14:38.820 --> 00:14:40.890
keep your parameters small.

00:14:40.890 --> 00:14:42.810
And this has the effect.

00:14:42.810 --> 00:14:46.530
Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of

00:14:46.530 --> 00:14:48.260
keeping the functions you fit

00:14:48.260 --> 00:14:55.260
to be smoother and less likely to overfit. Okay? Okay, hopefully this will

00:14:57.870 --> 00:15:02.200
make more sense when you play with these ideas a bit more in the next problem set. But let's check

00:15:02.200 --> 00:15:09.200
questions about all this.

00:15:12.560 --> 00:15:16.000
The smoothing behavior is it because [inaudible] actually get different [inaudible]?

00:15:19.260 --> 00:15:23.010
Let's see. Yeah. It depends on - well

00:15:23.010 --> 00:15:28.160
most priors with most of the mass close to zero will get this effect, I guess. And just by

00:15:28.160 --> 00:15:29.090
convention, the Gaussian

00:15:29.090 --> 00:15:31.770
prior is what's most -

00:15:31.770 --> 00:15:33.820
used the most common for models like logistic

00:15:33.820 --> 00:15:37.910
regression and linear regression, generalized in your models. There are a few

00:15:37.910 --> 00:15:41.190
other priors that I sometimes use, like the Laplace prior,

00:15:41.190 --> 00:15:48.190
but all of them will tend to have these sorts of smoothing effects. All right. Cool.

00:15:50.270 --> 00:15:52.640
And so it turns out that for

00:15:52.640 --> 00:15:57.250
problems like text classification, text classification is like 30,000 features or 50,000

00:15:57.250 --> 00:15:59.560
features,

00:15:59.560 --> 00:16:02.980
where it seems like an algorithm like logistic regression would be very much prone to

00:16:02.980 --> 00:16:03.670
overfitting. Right?

00:16:03.670 --> 00:16:06.470
So imagine trying to build a spam classifier,

00:16:06.470 --> 00:16:10.050
maybe you have 100 training examples but you have 30,000

00:16:10.050 --> 00:16:12.530
features or 50,000 features,

00:16:12.530 --> 00:16:16.820
that seems clearly to be prone to overfitting. Right? But it turns out that with

00:16:16.820 --> 00:16:18.440
this sort of Bayesian

00:16:18.440 --> 00:16:19.460
regularization,

00:16:19.460 --> 00:16:22.100
with [inaudible] Gaussian,

00:16:22.100 --> 00:16:26.710
logistic regression becomes a very effective text classification algorithm

00:16:26.710 --> 00:16:32.510
with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah,

00:16:32.510 --> 00:16:36.950
right, and so pick - and to pick either tau squared or lambda.

00:16:36.950 --> 00:16:40.260
I think the relation is lambda equals one over tau squared. But right, so pick either

00:16:40.260 --> 00:16:41.840
tau squared or lambda, you could

00:16:41.840 --> 00:16:48.840
use cross-validation, yeah. All right? Okay, cool. So all right,

00:16:49.120 --> 00:16:51.740
that

00:16:51.740 --> 00:16:55.240
was all I want to say about methods for preventing

00:16:55.240 --> 00:16:56.410
overfitting. What I

00:16:56.410 --> 00:17:00.820
want to do next is just spend, you know, five minutes talking about

00:17:00.820 --> 00:17:02.300
online learning.

00:17:02.300 --> 00:17:04.650
And this is sort of a digression.

00:17:04.650 --> 00:17:06.390
And so, you know, when

00:17:06.390 --> 00:17:09.680
you're designing the syllabus of a class, I guess, sometimes

00:17:09.680 --> 00:17:13.040
there are just some ideas you want to talk about but can't find a very good place to

00:17:13.040 --> 00:17:16.480
fit in anywhere. So this is one of those ideas that may

00:17:16.480 --> 00:17:20.890
seem a bit disjointed from the rest of the class but I just

00:17:20.890 --> 00:17:25.770
want to

00:17:25.770 --> 00:17:29.730
tell

00:17:29.730 --> 00:17:33.720
you a little bit about it. Okay. So here's the idea.

00:17:33.720 --> 00:17:37.820
So far, all the learning algorithms we've talked about are what's called batch learning

00:17:37.820 --> 00:17:38.520
algorithms,

00:17:38.520 --> 00:17:41.650
where you're given a training set and then you get to run your learning algorithm on the

00:17:41.650 --> 00:17:46.910
training set and then maybe you test it on some other test set.

00:17:46.910 --> 00:17:49.980
And there's another learning setting called online learning,

00:17:49.980 --> 00:17:53.670
in which you have to make predictions even while you are in the process of learning.

00:17:53.670 --> 00:17:57.650
So here's how the problem sees.

00:17:57.650 --> 00:17:59.360
All right? I'm first gonna give you X one.

00:17:59.360 --> 00:18:01.410
Let's say there's a classification problem,

00:18:01.410 --> 00:18:02.780
so I'm first gonna give you

00:18:02.780 --> 00:18:07.420
X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?"

00:18:07.420 --> 00:18:09.670
And you've not seen any data yet.

00:18:09.670 --> 00:18:11.679
And so, you make a guess. Right? You

00:18:11.679 --> 00:18:12.990
guess -

00:18:12.990 --> 00:18:15.190
we'll call your guess Y hat one.

00:18:15.190 --> 00:18:18.010
And after you've made your prediction, I

00:18:18.010 --> 00:18:21.840
will then reveal to you the true label Y one. Okay? And

00:18:21.840 --> 00:18:25.580
not having seen any data before, your odds of getting the first one right are only

00:18:25.580 --> 00:18:28.750
50 percent, right, if you guess randomly.

00:18:28.750 --> 00:18:29.950


00:18:29.950 --> 00:18:35.080
And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And

00:18:35.080 --> 00:18:35.910
so you now maybe

00:18:35.910 --> 00:18:40.200
are gonna make a slightly more educated guess and call that Y hat two.

00:18:40.200 --> 00:18:44.880
And after you've made your guess, I reveal the true label to you.

00:18:44.880 --> 00:18:50.000
And so, then I show you X three, and then you make

00:18:50.000 --> 00:18:53.000
your guess, and learning proceeds as follows.

00:18:53.000 --> 00:18:54.860
So this is just a lot of

00:18:54.860 --> 00:18:59.370
machine learning and batch learning, and the model settings where

00:18:59.370 --> 00:19:03.710
you have to keep learning even as you're making predictions,

00:19:03.710 --> 00:19:08.050
okay? So I don't know, setting your website and you have users coming in. And as the first user

00:19:08.050 --> 00:19:10.990
comes in, you need to start making predictions already about what

00:19:10.990 --> 00:19:13.450
the user likes or dislikes. And there's only, you know,

00:19:13.450 --> 00:19:18.370
as you're making predictions you get to show more and more training examples.

00:19:18.370 --> 00:19:25.370
So in online learning what you care about is the total online error,

00:19:27.370 --> 00:19:28.400


00:19:28.400 --> 00:19:32.900
which is sum from I equals one to MC if you get the sequence of M examples

00:19:32.900 --> 00:19:34.390
all together,

00:19:34.390 --> 00:19:37.860
indicator Y hat I

00:19:37.860 --> 00:19:40.900
not equal to Y hi. Okay?

00:19:40.900 --> 00:19:43.510
So the total online error is

00:19:43.510 --> 00:19:45.920
the total number of mistakes you make

00:19:45.920 --> 00:19:49.029
on a sequence of examples like this.

00:19:49.029 --> 00:19:51.910
And

00:19:51.910 --> 00:19:54.049
it turns out that, you know,

00:19:54.049 --> 00:19:57.690
many of the learning algorithms you have - when you finish all the learning algorithms, you've

00:19:57.690 --> 00:20:00.690
learned about and can apply to this setting. One thing

00:20:00.690 --> 00:20:02.680
you could do is

00:20:02.680 --> 00:20:03.850
when you're asked

00:20:03.850 --> 00:20:06.230
to make prediction on Y hat three,

00:20:06.230 --> 00:20:10.409
right, one simple thing to do is well you've seen some other training examples

00:20:10.409 --> 00:20:14.040
up to this point so you can just take your learning algorithm and run it

00:20:14.040 --> 00:20:15.429
on the examples,

00:20:15.429 --> 00:20:18.970
you know, leading up to Y hat three. So just run the learning algorithm on all

00:20:18.970 --> 00:20:21.420
the examples you've seen previous

00:20:21.420 --> 00:20:25.400
to being asked to make a prediction on certain example, and then use your

00:20:25.400 --> 00:20:27.320
learning

00:20:27.320 --> 00:20:30.940
algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we

00:20:30.940 --> 00:20:34.190
saw that you could use the stochastic gradient descent, that,

00:20:34.190 --> 00:20:41.190
you know, can be adapted very nicely to this. So as a concrete example, if you

00:20:41.740 --> 00:20:45.660
remember the perceptron algorithms, say, right,

00:20:45.660 --> 00:20:47.350
you would

00:20:47.350 --> 00:20:52.280
say initial the parameter theta to be equal to zero.

00:20:52.280 --> 00:20:55.940
And then after seeing the Ith training

00:20:55.940 --> 00:20:59.230
example, you'd

00:20:59.230 --> 00:21:06.230
update the parameters, you

00:21:13.740 --> 00:21:15.420
know,

00:21:15.420 --> 00:21:19.740
using - you've see this reel a lot of times now, right, using the standard

00:21:19.740 --> 00:21:21.720
perceptron learning rule. And

00:21:21.720 --> 00:21:26.230
the same thing, if you were using logistic regression you can then,

00:21:26.230 --> 00:21:29.940
again, after seeing each training example, just run, you know, essentially run

00:21:29.940 --> 00:21:33.620
one-step stochastic gradient descent

00:21:33.620 --> 00:21:38.050
on just the example you saw. Okay?

00:21:38.050 --> 00:21:39.800
And so

00:21:39.800 --> 00:21:41.040
the reason I've

00:21:41.040 --> 00:21:45.380
put this into the sort of "learning theory" section of this class was because

00:21:45.380 --> 00:21:49.190
it turns that sometimes you can prove fairly amazing results

00:21:49.190 --> 00:21:51.500
on your total online error

00:21:51.500 --> 00:21:53.790
using algorithms like these.

00:21:53.790 --> 00:21:57.500
I will actually - I don't actually want to spend the time in the main

00:21:57.500 --> 00:21:58.640
lecture to prove

00:21:58.640 --> 00:22:01.880
this, but, for example, you can prove that

00:22:01.880 --> 00:22:04.790
when you use the perceptron algorithm,

00:22:04.790 --> 00:22:07.890
then even when

00:22:07.890 --> 00:22:11.710
the features XI,

00:22:11.710 --> 00:22:15.250
maybe infinite dimensional feature vectors, like we saw for

00:22:15.250 --> 00:22:18.700
simple vector machines. And sometimes, infinite feature dimensional vectors

00:22:18.700 --> 00:22:20.740
may use kernel representations. Okay? But so it

00:22:20.740 --> 00:22:22.180
turns out that you can prove that

00:22:22.180 --> 00:22:24.790
when you a perceptron algorithm,

00:22:24.790 --> 00:22:27.590
even when the data is maybe

00:22:27.590 --> 00:22:28.920
extremely high dimensional and it

00:22:28.920 --> 00:22:32.900
seems like you'd be prone to overfitting, right, you can prove that so as the

00:22:32.900 --> 00:22:36.550
long as the positive and negative examples

00:22:36.550 --> 00:22:38.740


00:22:38.740 --> 00:22:41.860
are separated by a margin,

00:22:41.860 --> 00:22:43.360
right. So in this

00:22:43.360 --> 00:22:48.050
infinite dimensional space,

00:22:48.050 --> 00:22:49.740
so long as,

00:22:49.740 --> 00:22:53.390
you know, there is some margin

00:22:53.390 --> 00:22:57.480
down there separating the positive and negative examples,

00:22:57.480 --> 00:22:59.300
you can prove that

00:22:59.300 --> 00:23:02.150
perceptron algorithm will converge

00:23:02.150 --> 00:23:06.870
to a hypothesis that perfectly separates the positive and negative examples.

00:23:06.870 --> 00:23:10.760
Okay? And then so after seeing only a finite number of examples, it'll

00:23:10.760 --> 00:23:14.410
converge to digital boundary that perfectly separates the

00:23:14.410 --> 00:23:18.590
positive and negative examples, even though you may in an infinite dimensional space. Okay?

00:23:18.590 --> 00:23:20.970
So

00:23:20.970 --> 00:23:22.260
let's see.

00:23:22.260 --> 00:23:25.920
The proof itself would take me sort of almost an entire lecture to do,

00:23:25.920 --> 00:23:27.179
and there are sort of

00:23:27.179 --> 00:23:29.460
other things that I want to do more than that.

00:23:29.460 --> 00:23:32.850
So you want to see the proof of this yourself, it's actually written up in the lecture

00:23:32.850 --> 00:23:35.010
notes that I posted online.

00:23:35.010 --> 00:23:38.210
For the purposes of this class' syllabus, the proof of this

00:23:38.210 --> 00:23:41.059
result, you can treat this as optional reading. And by that, I mean,

00:23:41.059 --> 00:23:44.799
you know, it won't appear on the midterm and you won't be asked about this

00:23:44.799 --> 00:23:47.039
specifically in the problem sets,

00:23:47.039 --> 00:23:48.770
but I thought it'd be -

00:23:48.770 --> 00:23:52.340
I know some of you are curious after the previous lecture so why

00:23:52.340 --> 00:23:53.750
you can prove

00:23:53.750 --> 00:23:55.450
that, you know, SVMs can have

00:23:55.450 --> 00:23:59.240
bounded VC dimension, even in these infinite dimensional spaces, and how do

00:23:59.240 --> 00:24:00.940
you prove things in these -

00:24:00.940 --> 00:24:04.450
how do you prove learning theory results in these infinite dimensional feature spaces. And

00:24:04.450 --> 00:24:05.190
so

00:24:05.190 --> 00:24:08.480
the perceptron bound that I just talked about was the simplest instance I

00:24:08.480 --> 00:24:11.040
know of that you can sort of read in like half an hour and understand

00:24:11.040 --> 00:24:11.960
it.

00:24:11.960 --> 00:24:14.700
So if you're

00:24:14.700 --> 00:24:16.179
interested, there are lecture notes online for

00:24:16.179 --> 00:24:20.720
how this perceptron bound is actually proved. It's a very

00:24:21.580 --> 00:24:25.710
[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But

00:24:25.710 --> 00:24:28.450
regardless of the theoretical results,

00:24:28.450 --> 00:24:30.880
you know, the online learning setting is something

00:24:30.880 --> 00:24:33.490
that you - that comes reasonably often. And so,

00:24:33.490 --> 00:24:38.750
these algorithms based on stochastic gradient descent often go very well. Okay, any

00:24:38.750 --> 00:24:45.750
questions about this before I move on?

00:24:49.990 --> 00:24:54.799
All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to

00:24:54.799 --> 00:24:56.500
PowerPoint slides, please,

00:24:56.500 --> 00:24:59.810
is I actually want to spend most of today's lecture sort of talking about

00:24:59.810 --> 00:25:05.270
advice for applying different machine learning

00:25:05.270 --> 00:25:10.850
algorithms.

00:25:10.850 --> 00:25:11.970
And so,

00:25:11.970 --> 00:25:14.019
you know, right now, already you have a,

00:25:14.019 --> 00:25:16.250
I think, a good understanding of

00:25:16.250 --> 00:25:18.620
really the most powerful tools

00:25:18.620 --> 00:25:21.020
known to humankind in machine learning.

00:25:21.020 --> 00:25:22.090
Right? And

00:25:22.090 --> 00:25:26.520
what I want to do today is give you some advice on how to apply them really

00:25:26.520 --> 00:25:27.720
powerfully because,

00:25:27.720 --> 00:25:30.299
you know, the same tool - it turns out that you can

00:25:30.299 --> 00:25:33.249
take the same machine learning tool, say logistic regression,

00:25:33.249 --> 00:25:36.800
and you can ask two different people to apply it to the same problem.

00:25:36.800 --> 00:25:38.230
And

00:25:38.230 --> 00:25:41.290
sometimes one person will do an amazing job and it'll work amazingly well, and the second

00:25:41.290 --> 00:25:43.150
person will sort of

00:25:43.150 --> 00:25:47.150
not really get it to work, even though it was exactly the same algorithm. Right?

00:25:47.150 --> 00:25:51.049
And so what I want to do today, in the rest of the time I have today, is try

00:25:51.049 --> 00:25:52.190
to

00:25:52.190 --> 00:25:55.210
convey to you, you know, some of the methods for how to

00:25:55.210 --> 00:25:56.600
make sure you're one of -

00:25:56.600 --> 00:26:03.600
you really know how to get these learning algorithms to work well in problems. So

00:26:05.120 --> 00:26:07.340
just some caveats on what I'm gonna, I

00:26:07.340 --> 00:26:08.859
guess, talk about in the

00:26:08.859 --> 00:26:11.080
rest of today's lecture.

00:26:11.940 --> 00:26:15.760
Something I want to talk about is actually not very mathematical but is also

00:26:15.760 --> 00:26:17.610
some of the hardest,

00:26:17.610 --> 00:26:18.480
most

00:26:18.480 --> 00:26:21.910
conceptually most difficult material in this class to understand. All right? So this

00:26:21.910 --> 00:26:23.070
is

00:26:23.070 --> 00:26:25.400
not mathematical but this is not easy.

00:26:25.400 --> 00:26:29.730
And I want to say this caveat some of what I'll say today is debatable.

00:26:29.730 --> 00:26:33.309
I think most good machine learning people will agree with most of what I say but maybe not

00:26:33.309 --> 00:26:35.730
everything I say.

00:26:35.730 --> 00:26:39.350
And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about

00:26:39.350 --> 00:26:41.680
this later. What I'm

00:26:41.680 --> 00:26:43.730
focusing on today is advice for

00:26:43.730 --> 00:26:47.269
how to just get stuff to work. If you work in the company and you want to deliver a

00:26:47.269 --> 00:26:49.120
product or you're, you know,

00:26:49.120 --> 00:26:52.750
building a system and you just want your machine learning system to work. Okay? Some of what I'm about

00:26:52.750 --> 00:26:54.000
to say today

00:26:54.000 --> 00:26:57.559
isn't great advice if you goal is to invent a new machine learning algorithm, but

00:26:57.559 --> 00:26:59.179
this is advice for how to

00:26:59.179 --> 00:27:02.040
make machine learning algorithm work and, you know, and

00:27:02.040 --> 00:27:05.080
deploy a working system. So three

00:27:05.080 --> 00:27:08.900
key areas I'm gonna talk about. One: diagnostics for

00:27:08.900 --> 00:27:11.890
debugging learning algorithms. Second: sort of

00:27:11.890 --> 00:27:16.820
talk briefly about error analyses and ablative analysis.

00:27:16.820 --> 00:27:21.470
And third, I want to talk about just advice for how to get started on a machine-learning

00:27:23.410 --> 00:27:24.420
problem.

00:27:24.420 --> 00:27:27.340
And one theme that'll come up later is it

00:27:27.340 --> 00:27:31.230
turns out you've heard about premature optimization, right,

00:27:31.230 --> 00:27:33.890
in writing software. This is when

00:27:33.890 --> 00:27:38.120
someone over-designs from the start, when someone, you know, is writing piece of code and

00:27:38.120 --> 00:27:41.330
they choose a subroutine to optimize

00:27:41.330 --> 00:27:45.350
heavily. And maybe you write the subroutine as assembly or something. And that's often

00:27:45.350 --> 00:27:48.919
- and many of us have been guilty of premature optimization, where

00:27:48.919 --> 00:27:51.510
we're trying to get a piece of code to run faster. And

00:27:51.510 --> 00:27:54.710
we choose probably a piece of code and we implement it an assembly, and

00:27:54.710 --> 00:27:56.820
really tune and get to run really quickly. And

00:27:56.820 --> 00:28:01.280
it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature

00:28:01.280 --> 00:28:02.210
optimization.

00:28:02.210 --> 00:28:06.140
And in undergraduate programming classes, we warn people all the time not to do

00:28:06.140 --> 00:28:10.250
premature optimization and people still do it all the time. Right?

00:28:10.250 --> 00:28:11.750
And

00:28:11.750 --> 00:28:16.020
turns out, a very similar thing happens in building machine-learning systems. That

00:28:16.020 --> 00:28:20.980
many people are often guilty of, what I call, premature statistical optimization, where they

00:28:20.980 --> 00:28:21.630


00:28:21.630 --> 00:28:26.090
heavily optimize part of a machine learning system and that turns out

00:28:26.090 --> 00:28:27.830
not to be the important piece. Okay?

00:28:27.830 --> 00:28:30.170
So I'll talk about that later, as well.

00:28:30.170 --> 00:28:35.680
So let's first talk about debugging learning algorithms.

00:28:35.680 --> 00:28:37.700


00:28:37.700 --> 00:28:39.560


00:28:39.560 --> 00:28:42.610
As a motivating

00:28:42.610 --> 00:28:46.910
example, let's say you want to build an anti-spam system. And

00:28:46.910 --> 00:28:51.780
let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right?

00:28:51.780 --> 00:28:54.410
So instead of using 50,000 words, you're chosen a small set of 100

00:28:54.410 --> 00:28:55.390
features

00:28:55.390 --> 00:28:59.010
to use for your anti-spam system.

00:28:59.010 --> 00:29:02.140
And let's say you implement Bayesian logistic regression, implement gradient

00:29:02.140 --> 00:29:02.850
descent,

00:29:02.850 --> 00:29:07.310
and you get 20 percent test error, which is unacceptably high. Right?

00:29:07.310 --> 00:29:09.750
So this

00:29:09.750 --> 00:29:14.260
is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda

00:29:14.260 --> 00:29:15.160
squared term.

00:29:15.160 --> 00:29:19.950
And we're maximizing rather than minimizing as well, so there's a minus lambda

00:29:19.950 --> 00:29:23.809
theta square instead of plus lambda theta squared. So

00:29:23.809 --> 00:29:28.390
the question is, you implemented your Bayesian logistic regression algorithm,

00:29:28.390 --> 00:29:34.050
and you tested it on your test set and you got unacceptably high error, so what do you do

00:29:34.050 --> 00:29:35.370
next? Right?

00:29:35.370 --> 00:29:37.310
So,

00:29:37.310 --> 00:29:40.730
you know, one thing you could do is think about the ways you could improve this algorithm.

00:29:40.730 --> 00:29:44.510
And this is probably what most people will do instead of, "Well let's sit down and think what

00:29:44.510 --> 00:29:47.929
could've gone wrong, and then we'll try to improve the algorithm."

00:29:47.929 --> 00:29:51.370
Well obviously having more training data could only help, so one thing you can do is try to

00:29:51.370 --> 00:29:55.140
get more training examples.

00:29:55.140 --> 00:29:58.880
Maybe you suspect, that even 100 features was too many, so you might try to

00:29:58.880 --> 00:30:01.059
get a smaller set of

00:30:01.059 --> 00:30:04.510
features. What's more common is you might suspect your features aren't good enough, so you might

00:30:04.510 --> 00:30:06.880
spend some time, look at the email headers, see if

00:30:06.880 --> 00:30:09.240
you can figure out better features for, you know,

00:30:09.240 --> 00:30:12.890
finding spam emails or whatever.

00:30:12.890 --> 00:30:14.460
Right. And

00:30:14.460 --> 00:30:17.720
right, so and

00:30:17.720 --> 00:30:20.780
just sit around and come up with better features, such as for email headers.

00:30:20.780 --> 00:30:24.940
You may also suspect that gradient descent hasn't quite converged yet, and so let's try

00:30:24.940 --> 00:30:28.090
running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right,

00:30:28.090 --> 00:30:29.480
just run

00:30:29.480 --> 00:30:30.940
gradient descent longer.

00:30:30.940 --> 00:30:35.620
Or maybe you remember, you know, you remember hearing from class that maybe

00:30:35.620 --> 00:30:38.580
Newton's method converges better, so let's

00:30:38.580 --> 00:30:39.809
try

00:30:39.809 --> 00:30:41.840
that instead. You may want to tune the value for lambda, because not

00:30:41.840 --> 00:30:43.560
sure if that was the right thing,

00:30:43.560 --> 00:30:46.960
or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only

00:30:46.960 --> 00:30:50.240
listed eight things

00:30:50.240 --> 00:30:54.549
here, but you can imagine if you were actually sitting down, building machine-learning

00:30:54.549 --> 00:30:55.720
system,

00:30:55.720 --> 00:30:58.040
the options to you are endless. You can think of, you

00:30:58.040 --> 00:31:01.040
know, hundreds of ways to improve a learning system.

00:31:01.040 --> 00:31:02.429
And some of these things like,

00:31:02.429 --> 00:31:05.670
well getting more training examples, surely that's gonna help, so that seems like it's a good

00:31:05.670 --> 00:31:08.980
use of your time. Right?

00:31:08.980 --> 00:31:11.290
And it turns out that

00:31:11.290 --> 00:31:15.210
this [inaudible] of picking ways to improve the learning algorithm and picking one and going

00:31:15.210 --> 00:31:16.120
for it,

00:31:16.120 --> 00:31:20.620
it might work in the sense that it may eventually get you to a working system, but

00:31:20.620 --> 00:31:25.030
often it's very time-consuming. And I think it's often a largely - largely a matter of

00:31:25.030 --> 00:31:28.140
luck, whether you end up fixing what the problem is.

00:31:28.140 --> 00:31:29.490
In particular, these

00:31:29.490 --> 00:31:33.030
eight improvements all fix very different problems.

00:31:33.030 --> 00:31:37.340
And some of them will be fixing problems that you don't have. And

00:31:37.340 --> 00:31:40.070
if you can rule out six of

00:31:40.070 --> 00:31:44.129
eight of these, say, you could - if by somehow looking at the problem more deeply,

00:31:44.129 --> 00:31:46.809
you can figure out which one of these eight things is actually the right thing

00:31:46.809 --> 00:31:47.850
to do,

00:31:47.850 --> 00:31:48.850
you can save yourself

00:31:48.850 --> 00:31:50.760
a lot of time.

00:31:50.760 --> 00:31:56.090
So let's see how we can go about doing that.

00:31:56.090 --> 00:32:01.830
The people in industry and in research that I see that are really good, would not

00:32:01.830 --> 00:32:05.490
go and try to change a learning algorithm randomly. There are lots of things that

00:32:05.490 --> 00:32:08.110
obviously improve your learning algorithm,

00:32:08.110 --> 00:32:12.460
but the problem is there are so many of them it's hard to know what to do.

00:32:12.460 --> 00:32:16.590
So you find all the really good ones that run various diagnostics to figure out

00:32:16.590 --> 00:32:18.010
the problem is

00:32:18.010 --> 00:32:21.610
and they think where a problem is. Okay?

00:32:21.610 --> 00:32:23.830
So

00:32:23.830 --> 00:32:27.309
for our motivating story, right, we said - let's say Bayesian logistic regression test

00:32:27.309 --> 00:32:29.010
error was 20 percent,

00:32:29.010 --> 00:32:32.020
which let's say is unacceptably high.

00:32:32.020 --> 00:32:34.830
And let's suppose you suspected the problem is

00:32:34.830 --> 00:32:36.440
either overfitting,

00:32:36.440 --> 00:32:37.790
so it's high bias,

00:32:37.790 --> 00:32:42.240
or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse

00:32:42.240 --> 00:32:45.220
me; I think I

00:32:45.220 --> 00:32:46.620
wrote that wrong.

00:32:46.620 --> 00:32:48.080
Let's firstly - so let's

00:32:48.080 --> 00:32:49.219
forget - forget the tables.

00:32:49.219 --> 00:32:52.839
Suppose you suspect the problem is either high bias or high variance, and some of the text

00:32:52.839 --> 00:32:54.730
here

00:32:54.730 --> 00:32:55.250
doesn't make sense. And

00:32:55.250 --> 00:32:56.429
you want to know

00:32:56.429 --> 00:33:00.850
if you're overfitting, which would be high variance, or you have too few

00:33:00.850 --> 00:33:06.240
features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So

00:33:06.240 --> 00:33:08.750
how can you figure out whether the problem

00:33:08.750 --> 00:33:10.790
is one of high bias

00:33:10.790 --> 00:33:15.610
or high variance? Right? So it turns

00:33:15.610 --> 00:33:19.009
out there's a simple diagnostic you can look at that will tell you

00:33:19.009 --> 00:33:24.150
whether the problem is high bias or high variance. If you

00:33:24.150 --> 00:33:27.900
remember the cartoon we'd seen previously for high variance problems, when you have high

00:33:27.900 --> 00:33:29.710
variance

00:33:29.710 --> 00:33:33.280
the training error will be much lower than the test error. All right? When you

00:33:33.280 --> 00:33:36.140
have a high variance problem, that's when you're fitting

00:33:36.140 --> 00:33:39.480
your training set very well. That's when you're fitting, you know, a tenth order polynomial to

00:33:39.480 --> 00:33:41.650
11 data points. All right? And

00:33:41.650 --> 00:33:44.670
that's when you're just fitting the data set very well, and so your training error will be

00:33:44.670 --> 00:33:45.670
much lower than

00:33:45.670 --> 00:33:47.640
your test

00:33:47.640 --> 00:33:49.940
error. And in contrast, if you have high bias,

00:33:49.940 --> 00:33:52.700
that's when your training error will also be high. Right?

00:33:52.700 --> 00:33:56.450
That's when your data is quadratic, say, but you're fitting a linear function to it

00:33:56.450 --> 00:34:02.290
and so you aren't even fitting your training set well. So

00:34:02.290 --> 00:34:04.450
just in cartoons, I guess,

00:34:04.450 --> 00:34:07.950
this is a - this is what a typical learning curve for high variance looks

00:34:07.950 --> 00:34:09.339
like.

00:34:09.339 --> 00:34:13.690
On your horizontal axis, I'm plotting the training set size M, right,

00:34:13.690 --> 00:34:16.429
and on vertical axis, I'm plotting the error.

00:34:16.429 --> 00:34:19.469
And so, let's see,

00:34:19.469 --> 00:34:21.029
you know, as you increase -

00:34:21.029 --> 00:34:25.119
if you have a high variance problem, you'll notice as the training set size, M,

00:34:25.119 --> 00:34:29.219
increases, your test set error will keep on decreasing.

00:34:29.219 --> 00:34:32.829
And so this sort of suggests that, well, if you can increase the training set size even

00:34:32.829 --> 00:34:36.359
further, maybe if you extrapolate the green curve out, maybe

00:34:36.359 --> 00:34:39.970
that test set error will decrease even further. All right?

00:34:39.970 --> 00:34:43.399
Another thing that's useful to plot here is - let's say

00:34:43.399 --> 00:34:46.539
the red horizontal line is the desired performance

00:34:46.539 --> 00:34:50.259
you're trying to reach, another useful thing to plot is actually the training error. Right?

00:34:50.259 --> 00:34:52.009
And it turns out that

00:34:52.009 --> 00:34:59.009
your training error will actually grow as a function of the training set size

00:34:59.249 --> 00:35:01.609
because the larger your training set,

00:35:01.609 --> 00:35:03.619
the harder it is to fit,

00:35:03.619 --> 00:35:06.149
you know, your training set perfectly. Right?

00:35:06.149 --> 00:35:09.249
So this is just a cartoon, don't take it too seriously, but in general, your training error

00:35:09.249 --> 00:35:11.420
will actually grow

00:35:11.420 --> 00:35:15.079
as a function of your training set size. Because smart training sets, if you have one data point,

00:35:15.079 --> 00:35:17.769
it's really easy to fit that perfectly, but if you have

00:35:17.769 --> 00:35:22.099
10,000 data points, it's much harder to fit that perfectly.

00:35:22.099 --> 00:35:23.149
All right?

00:35:23.149 --> 00:35:27.960
And so another diagnostic for high variance, and the one that I tend to use more,

00:35:27.960 --> 00:35:31.670
is to just look at training versus test error. And if there's a large gap between

00:35:31.670 --> 00:35:32.789
them,

00:35:32.789 --> 00:35:34.160
then this suggests that, you know,

00:35:34.160 --> 00:35:39.630
getting more training data may allow you to help close that gap. Okay?

00:35:39.630 --> 00:35:41.420
So this is

00:35:41.420 --> 00:35:42.340
what the

00:35:42.340 --> 00:35:45.059
cartoon would look like when - in the

00:35:45.059 --> 00:35:49.199
case of high variance.

00:35:49.199 --> 00:35:53.099
This is what the cartoon looks like for high bias. Right? If you

00:35:53.099 --> 00:35:54.779
look at the learning curve, you

00:35:54.779 --> 00:35:57.499
see that the curve for test error

00:35:57.499 --> 00:36:01.419
has flattened out already. And so this is a sign that,

00:36:01.419 --> 00:36:05.179
you know, if you get more training examples, if you extrapolate this curve

00:36:05.179 --> 00:36:06.519
further to the right,

00:36:06.519 --> 00:36:09.670
it's maybe not likely to go down much further.

00:36:09.670 --> 00:36:12.469
And this is a property of high bias: that getting more training data won't

00:36:12.469 --> 00:36:15.619
necessarily help.

00:36:15.619 --> 00:36:18.999
But again, to me the more useful diagnostic is

00:36:18.999 --> 00:36:20.299
if you plot

00:36:20.299 --> 00:36:23.999
training errors well, if you look at your training error as well as your, you know,

00:36:23.999 --> 00:36:26.369
hold out test set error.

00:36:26.369 --> 00:36:29.409
If you find that even your training error

00:36:29.409 --> 00:36:31.529
is high,

00:36:31.529 --> 00:36:34.779
then that's a sign that getting more training data is not

00:36:34.779 --> 00:36:38.269
going to help. Right?

00:36:38.269 --> 00:36:42.199
In fact, you know, think about it,

00:36:42.199 --> 00:36:44.539
training error

00:36:44.539 --> 00:36:48.089
grows as a function of your training set size.

00:36:48.089 --> 00:36:50.449
And so if your

00:36:50.449 --> 00:36:55.569
training error is already above your level of desired performance,

00:36:55.569 --> 00:36:56.599
then

00:36:56.599 --> 00:37:00.789
getting even more training data is not going to reduce your training

00:37:00.789 --> 00:37:03.009
error down to the desired level of performance. Right?

00:37:03.009 --> 00:37:06.469
Because, you know, your training error sort of only gets worse as you get more and more training

00:37:06.469 --> 00:37:07.549
examples.

00:37:07.549 --> 00:37:10.799
So if you extrapolate further to the right, it's not like this blue line will come

00:37:10.799 --> 00:37:13.399
back down to the level of desired performance. Right? This will stay up

00:37:13.399 --> 00:37:17.479
there. Okay? So for

00:37:17.479 --> 00:37:21.339
me personally, I actually, when looking at a curve like the green

00:37:21.339 --> 00:37:25.380
curve on test error, I actually personally tend to find it very difficult to tell

00:37:25.380 --> 00:37:29.000
if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very

00:37:29.000 --> 00:37:31.009
often, it's somewhat

00:37:31.009 --> 00:37:32.899
ambiguous. So for me personally,

00:37:32.899 --> 00:37:37.129
the diagnostic I tend to use the most often to tell if I have a bias problem or a variance

00:37:37.129 --> 00:37:37.859
problem

00:37:37.859 --> 00:37:41.319
is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so,

00:37:41.319 --> 00:37:45.420
going

00:37:45.420 --> 00:37:47.130
back to

00:37:47.130 --> 00:37:52.399
the list of fixes, look

00:37:52.399 --> 00:37:54.109
at the first fix,

00:37:54.109 --> 00:37:56.339
getting more training examples

00:37:56.339 --> 00:37:58.650
is a way to fix high variance.

00:37:58.650 --> 00:38:02.749
Right? If you have a high variance problem, getting more training examples will help.

00:38:02.749 --> 00:38:05.529
Trying a smaller set of features:

00:38:05.529 --> 00:38:11.759
that also fixes high variance. All right?

00:38:11.759 --> 00:38:15.869
Trying a larger set of features or adding email features, these

00:38:15.869 --> 00:38:20.150
are solutions that fix high bias. Right?

00:38:20.150 --> 00:38:26.769
So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay?

00:38:26.769 --> 00:38:29.069
And so

00:38:29.069 --> 00:38:33.579
quite often you see people working on machine learning problems

00:38:33.579 --> 00:38:34.589
and

00:38:34.589 --> 00:38:37.569
they'll remember that getting more training examples helps. And

00:38:37.569 --> 00:38:41.119
so, they'll build a learning system, build an anti-spam system and it doesn't work.

00:38:41.119 --> 00:38:42.229
And then they

00:38:42.229 --> 00:38:45.999
go off and spend lots of time and money and effort collecting more training data

00:38:45.999 --> 00:38:50.509
because they'll say, "Oh well, getting more data's obviously got to help."

00:38:50.509 --> 00:38:53.319
But if they had a high bias problem in the first place, and not a high variance

00:38:53.319 --> 00:38:54.890
problem,

00:38:54.890 --> 00:38:56.769
it's entirely possible to spend

00:38:56.769 --> 00:39:00.149
three months or six months collecting more and more training data,

00:39:00.149 --> 00:39:04.999
not realizing that it couldn't possibly help. Right?

00:39:04.999 --> 00:39:07.619
And so, this actually happens a lot in, you

00:39:07.619 --> 00:39:12.409
know, in Silicon Valley and companies, this happens a lot. There will often

00:39:12.409 --> 00:39:15.329
people building various machine learning systems, and

00:39:15.329 --> 00:39:19.519
they'll often - you often see people spending six months working on fixing a

00:39:19.519 --> 00:39:20.999
learning algorithm

00:39:20.999 --> 00:39:23.940
and you could've told them six months ago that, you know,

00:39:23.940 --> 00:39:27.210
that couldn't possibly have helped. But because they didn't know what the

00:39:27.210 --> 00:39:28.709
problem was, and

00:39:28.709 --> 00:39:33.549
they'd easily spend six months trying to invent new features or something. And

00:39:33.549 --> 00:39:37.809
this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and

00:39:37.809 --> 00:39:42.289
told them, "I could've told you six months ago that this was not going to help." And

00:39:42.289 --> 00:39:46.149
the six months is not a joke, you actually see

00:39:46.149 --> 00:39:47.709
this.

00:39:47.709 --> 00:39:49.510
And in contrast, if you

00:39:49.510 --> 00:39:53.049
actually figure out the problem's one of high bias or high variance, then

00:39:53.049 --> 00:39:54.299
you can rule out

00:39:54.299 --> 00:39:55.799
two of these solutions and

00:39:55.799 --> 00:40:00.779
save yourself many months of fruitless effort. Okay? I actually

00:40:00.779 --> 00:40:03.709
want to talk about these four at the bottom as well. But before I move on, let me

00:40:03.709 --> 00:40:05.319
just check if there were questions about what I've talked

00:40:05.319 --> 00:40:12.319
about so far. No? Okay, great. So bias

00:40:20.209 --> 00:40:23.220
versus variance is one thing that comes up

00:40:23.220 --> 00:40:29.539
often. This bias versus variance is one common diagnostic. And so,

00:40:29.539 --> 00:40:33.180
for other machine learning problems, it's often up to your own ingenuity to figure out

00:40:33.180 --> 00:40:35.700
your own diagnostics to figure out what's wrong. All right?

00:40:35.700 --> 00:40:41.230
So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you

00:40:41.230 --> 00:40:44.300
know, to construct your own tests. Like do you look at the difference training and

00:40:44.300 --> 00:40:46.499
test errors or do you look at something else?

00:40:46.499 --> 00:40:49.929
It's often up to your own ingenuity to construct your own diagnostics to figure out what's

00:40:49.929 --> 00:40:52.589
going on.

00:40:52.589 --> 00:40:55.029
What I want to do is go through another example. All right?

00:40:55.029 --> 00:40:58.890
And this one is slightly more contrived but it'll illustrate another

00:40:58.890 --> 00:41:02.769
common question that comes up, another one of the most common

00:41:02.769 --> 00:41:04.750
issues that comes up in applying

00:41:04.750 --> 00:41:06.089
learning algorithms.

00:41:06.089 --> 00:41:08.319
So in this example, it's slightly more contrived,

00:41:08.319 --> 00:41:11.579
let's say you implement Bayesian logistic regression

00:41:11.579 --> 00:41:17.549
and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So

00:41:17.549 --> 00:41:19.150
it's rejecting, you know,

00:41:19.150 --> 00:41:21.449
2 percent of -

00:41:21.449 --> 00:41:25.179
it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all

00:41:25.179 --> 00:41:26.959
spam gets

00:41:26.959 --> 00:41:30.660
through which is fine, but is also rejecting 2 percent of your good email,

00:41:30.660 --> 00:41:35.489
2 percent of the email from your friends and that's unacceptably high, let's

00:41:35.489 --> 00:41:36.909
say.

00:41:36.909 --> 00:41:39.010
And let's say that

00:41:39.010 --> 00:41:41.899
a simple vector machine using a linear kernel

00:41:41.899 --> 00:41:44.830
gets 10 percent error on spam and

00:41:44.830 --> 00:41:49.069
0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this

00:41:49.069 --> 00:41:53.359
example, let's say you're trying to build an anti-spam system. Right?

00:41:53.359 --> 00:41:56.170
Let's say that you really want to deploy

00:41:56.170 --> 00:41:57.679
logistic regression

00:41:57.679 --> 00:42:01.209
to your customers because of computational efficiency or because you need

00:42:01.209 --> 00:42:03.389
retrain overnight every day,

00:42:03.389 --> 00:42:07.319
and because logistic regression just runs more easily and more quickly or something. Okay? So let's

00:42:07.319 --> 00:42:08.669
say you want to deploy logistic

00:42:08.669 --> 00:42:12.649
regression, but it's just not working out well. So

00:42:12.649 --> 00:42:17.609
question is: What do you do next? So it

00:42:17.609 --> 00:42:18.829
turns out that this -

00:42:18.829 --> 00:42:22.319
the issue that comes up here, the one other common question that

00:42:22.319 --> 00:42:24.889
comes up is

00:42:24.889 --> 00:42:30.189
a question of is the algorithm converging. So you might suspect that maybe

00:42:30.189 --> 00:42:33.299
the problem with logistic regression is that it's just not converging.

00:42:33.299 --> 00:42:36.309
Maybe you need to run iterations. And

00:42:36.309 --> 00:42:37.759
it

00:42:37.759 --> 00:42:40.359
turns out that, again if you look at the optimization objective, say,

00:42:40.359 --> 00:42:43.710
logistic regression is, let's say, optimizing J

00:42:43.710 --> 00:42:46.730
of theta, it actually turns out that if you look at optimizing your objective as a function of the number

00:42:46.730 --> 00:42:51.809
of iterations, when you look

00:42:51.809 --> 00:42:55.009
at this curve, you know, it sort of looks like it's going up but it sort of

00:42:55.009 --> 00:42:57.630
looks like there's absentiles. And

00:42:57.630 --> 00:43:00.949
when you look at these curves, it's often very hard to tell

00:43:00.949 --> 00:43:03.729
if the curve has already flattened out. All right? And you look at these

00:43:03.729 --> 00:43:05.979
curves a lot so you can ask:

00:43:05.979 --> 00:43:08.229
Well has the algorithm converged? When you look at the J of theta like this, it's

00:43:08.229 --> 00:43:10.329
often hard to tell.

00:43:10.329 --> 00:43:14.149
You can run this ten times as long and see if it's flattened out. And you can run this ten

00:43:14.149 --> 00:43:21.079
times as long and it'll often still look like maybe it's going up very slowly, or something. Right?

00:43:21.079 --> 00:43:24.919
So a better diagnostic for what logistic regression is converged than

00:43:24.919 --> 00:43:28.809
looking at this curve.

00:43:28.809 --> 00:43:32.089
The other question you might wonder - the other thing you might

00:43:32.089 --> 00:43:36.709
suspect is a problem is are you optimizing the right function.

00:43:36.709 --> 00:43:38.920
So

00:43:38.920 --> 00:43:40.599
what you care about,

00:43:40.599 --> 00:43:42.879
right, in spam, say,

00:43:42.879 --> 00:43:44.260
is a

00:43:44.260 --> 00:43:47.499
weighted accuracy function like that. So A of theta is,

00:43:47.499 --> 00:43:49.190
you know, sum over your

00:43:49.190 --> 00:43:52.249
examples of some weights times whether you got it right.

00:43:52.249 --> 00:43:56.809
And so the weight may be higher for non-spam than for spam mail because you care

00:43:56.809 --> 00:43:57.710
about getting

00:43:57.710 --> 00:44:01.469
your predictions correct for spam email much more than non-spam mail, say. So let's

00:44:01.469 --> 00:44:02.359


00:44:02.359 --> 00:44:05.469
say A of theta

00:44:05.469 --> 00:44:10.819
is the optimization objective that you really care about, but Bayesian logistic regression is

00:44:10.819 --> 00:44:15.400
that it optimizes a quantity like that. Right? It's this

00:44:15.400 --> 00:44:17.689
sort of maximum likelihood thing

00:44:17.689 --> 00:44:19.380
and then with this

00:44:19.380 --> 00:44:20.849
two-nom, you know,

00:44:20.849 --> 00:44:22.779
penalty thing that we saw previously. And you

00:44:22.779 --> 00:44:26.499
might be wondering: Is this the right optimization function to be optimizing.

00:44:26.499 --> 00:44:30.949
Okay? And: Or do I maybe need to change the value for lambda

00:44:30.949 --> 00:44:33.899
to change this parameter? Or:

00:44:33.899 --> 00:44:39.819
Should I maybe really be switching to support vector machine optimization objective?

00:44:39.819 --> 00:44:42.130
Okay? Does that make sense? So

00:44:42.130 --> 00:44:44.490
the second diagnostic I'm gonna talk about

00:44:44.490 --> 00:44:46.989
is let's say you want to figure out

00:44:46.989 --> 00:44:50.609
is the algorithm converging, is the optimization algorithm converging, or

00:44:50.609 --> 00:44:51.900
is the problem with

00:44:51.900 --> 00:44:57.749
the optimization objective I chose in the first place? Okay?

00:44:57.749 --> 00:45:02.819
So here's

00:45:02.819 --> 00:45:07.329
the diagnostic you can use. Let me let - right. So to

00:45:07.329 --> 00:45:11.029
just reiterate the story, right, let's say an SVM outperforms Bayesian

00:45:11.029 --> 00:45:13.519
logistic regression but you really want to deploy

00:45:13.519 --> 00:45:16.759
Bayesian logistic regression to your problem. Let me

00:45:16.759 --> 00:45:19.049
let theta subscript SVM, be the

00:45:19.049 --> 00:45:21.669
parameters learned by an SVM,

00:45:21.669 --> 00:45:25.259
and I'll let theta subscript BLR be the parameters learned by Bayesian

00:45:25.259 --> 00:45:28.049
logistic regression.

00:45:28.049 --> 00:45:32.480
So the optimization objective you care about is this, you know, weighted accuracy

00:45:32.480 --> 00:45:35.079
criteria that I talked about just now.

00:45:35.079 --> 00:45:37.859
And

00:45:37.859 --> 00:45:41.739
the support vector machine outperforms Bayesian logistic regression. And so, you know,

00:45:41.739 --> 00:45:44.969
the weighted accuracy on the supportvector-machine parameters

00:45:44.969 --> 00:45:46.969
is better than the weighted accuracy

00:45:46.969 --> 00:45:50.179
for Bayesian logistic regression.

00:45:50.179 --> 00:45:53.929
So

00:45:53.929 --> 00:45:57.039
further, Bayesian logistic regression tries to optimize

00:45:57.039 --> 00:45:59.410
an optimization objective like that, which I

00:45:59.410 --> 00:46:02.269
denoted J theta.

00:46:02.269 --> 00:46:05.839
And so, the diagnostic I choose to use is

00:46:05.839 --> 00:46:08.430
to see if J of SVM

00:46:08.430 --> 00:46:12.269
is bigger-than or less-than J of BLR. Okay?

00:46:12.269 --> 00:46:14.609
So I explain this on the next slide.

00:46:14.609 --> 00:46:15.569
So

00:46:15.569 --> 00:46:19.529
we know two facts. We know that - well we know one fact. We know that a weighted

00:46:19.529 --> 00:46:20.519
accuracy

00:46:20.519 --> 00:46:23.160
of support vector machine, right,

00:46:23.160 --> 00:46:24.479
is bigger than

00:46:24.479 --> 00:46:28.859
this weighted accuracy of Bayesian logistic regression. So

00:46:28.859 --> 00:46:32.209
in order for me to figure out whether Bayesian logistic regression is converging,

00:46:32.209 --> 00:46:35.379
or whether I'm just optimizing the wrong objective function,

00:46:35.379 --> 00:46:41.059
the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay?

00:46:41.059 --> 00:46:43.549
So let me explain this,

00:46:43.549 --> 00:46:44.769
so in Case 1,

00:46:44.769 --> 00:46:46.029
right,

00:46:46.029 --> 00:46:48.319
it's just those two equations copied over.

00:46:48.319 --> 00:46:50.489
In Case 1, let's say that

00:46:50.489 --> 00:46:54.589
J of SVM is, indeed, is greater than J of BLR - or J of

00:46:54.589 --> 00:47:01.169
theta SVM is greater than J of theta BLR. But

00:47:01.169 --> 00:47:04.439
we know that Bayesian logistic regression

00:47:04.439 --> 00:47:07.519
was trying to maximize J of theta;

00:47:07.519 --> 00:47:08.869
that's the definition of

00:47:08.869 --> 00:47:12.359
Bayesian logistic regression.

00:47:12.359 --> 00:47:16.759
So this means that

00:47:16.759 --> 00:47:17.599
theta -

00:47:17.599 --> 00:47:22.029
the value of theta output that Bayesian logistic regression actually fails to

00:47:22.029 --> 00:47:24.209
maximize J

00:47:24.209 --> 00:47:27.309
because the support back to machine actually returned the value of theta that,

00:47:27.309 --> 00:47:28.719
you know does a

00:47:28.719 --> 00:47:31.349
better job out-maximizing J.

00:47:31.349 --> 00:47:36.509
And so, this tells me that Bayesian logistic regression didn't actually maximize J

00:47:36.509 --> 00:47:39.319
correctly, and so the problem is with

00:47:39.319 --> 00:47:41.099
the optimization algorithm. The

00:47:41.099 --> 00:47:45.269
optimization algorithm hasn't converged. The other

00:47:45.269 --> 00:47:46.099
case

00:47:46.099 --> 00:47:49.890
is as follows, where

00:47:49.890 --> 00:47:52.579
J of theta SVM is less-than/equal to J of theta

00:47:52.579 --> 00:47:55.719
BLR. Okay?

00:47:55.719 --> 00:47:58.389
In this case, what does

00:47:58.389 --> 00:47:59.140
that mean?

00:47:59.140 --> 00:48:01.849
This means that Bayesian logistic regression

00:48:01.849 --> 00:48:04.599
actually attains the higher value

00:48:04.599 --> 00:48:07.289
for the optimization objective J

00:48:07.289 --> 00:48:10.929
then doesn't support back to machine.

00:48:10.929 --> 00:48:13.159
The support back to machine,

00:48:13.159 --> 00:48:14.969
which does worse

00:48:14.969 --> 00:48:17.669
on your optimization problem,

00:48:17.669 --> 00:48:19.199
actually does better

00:48:19.199 --> 00:48:24.329
on the weighted accuracy measure.

00:48:24.329 --> 00:48:27.999
So what this means is that something that does worse on your optimization

00:48:27.999 --> 00:48:28.789
objective,

00:48:28.789 --> 00:48:29.789
on J,

00:48:29.789 --> 00:48:31.429
can actually do better

00:48:31.429 --> 00:48:34.039
on the weighted accuracy objective.

00:48:34.039 --> 00:48:37.109
And this really means that maximizing

00:48:37.109 --> 00:48:38.369
J of theta,

00:48:38.369 --> 00:48:42.059
you know, doesn't really correspond that well to maximizing your weighted accuracy criteria.

00:48:42.059 --> 00:48:43.430


00:48:43.430 --> 00:48:47.359
And therefore, this tells you that J of theta is maybe the wrong optimization

00:48:47.359 --> 00:48:49.649
objective to be maximizing. Right?

00:48:49.649 --> 00:48:51.160
That just maximizing J of

00:48:51.160 --> 00:48:53.149
theta just wasn't a good objective

00:48:53.149 --> 00:49:00.149
to be choosing if you care about the weighted accuracy. Okay? Can you

00:49:02.669 --> 00:49:03.460
raise your hand

00:49:03.460 --> 00:49:09.989
if this made sense?

00:49:09.989 --> 00:49:11.519
Cool, good. So

00:49:11.519 --> 00:49:16.829
that tells us whether the problem is with the optimization objective

00:49:16.829 --> 00:49:19.380
or whether it's with the objective function.

00:49:19.380 --> 00:49:21.009
And so going back to this

00:49:21.009 --> 00:49:23.149
slide, the eight fixes we had,

00:49:23.149 --> 00:49:24.180
you notice that if you

00:49:24.180 --> 00:49:27.170
run gradient descent for more iterations

00:49:27.170 --> 00:49:31.019
that fixes the optimization algorithm. You try and use this method

00:49:31.019 --> 00:49:33.259
fixes the optimization algorithm,

00:49:33.259 --> 00:49:37.289
whereas using a different value for lambda, in that lambda times norm of data

00:49:37.289 --> 00:49:39.469
squared, you know, in your objective,

00:49:39.469 --> 00:49:42.359
fixes the optimization objective. And

00:49:42.359 --> 00:49:47.629
changing to an SVM is also another way of trying to fix the optimization objective. Okay?

00:49:47.629 --> 00:49:49.329
And so

00:49:49.329 --> 00:49:52.309
once again, you actually see this quite often that -

00:49:52.309 --> 00:49:55.079
actually, you see it very often, people will

00:49:55.079 --> 00:49:58.479
have a problem with the optimization objective

00:49:58.479 --> 00:50:00.989
and be working harder and harder

00:50:00.989 --> 00:50:03.179
to fix the optimization algorithm.

00:50:03.179 --> 00:50:06.079
That's another very common pattern that

00:50:06.079 --> 00:50:10.190
the problem is in the formula from your J of theta, that often you see people, you know,

00:50:10.190 --> 00:50:13.269
just running more and more iterations of gradient descent. Like trying Newton's

00:50:13.269 --> 00:50:16.010
method and trying conjugate and then trying

00:50:16.010 --> 00:50:18.589
more and more crazy optimization algorithms,

00:50:18.589 --> 00:50:20.889
whereas the problem was, you know,

00:50:20.889 --> 00:50:24.459
optimizing J of theta wasn't going to fix the problem at all. Okay?

00:50:24.459 --> 00:50:28.649
So there's another example of when these sorts of diagnostics will

00:50:28.649 --> 00:50:31.909
help you figure out whether you should be fixing your optimization algorithm

00:50:31.909 --> 00:50:33.259
or fixing the

00:50:33.259 --> 00:50:38.849
optimization

00:50:38.849 --> 00:50:45.339
objective. Okay? Let me think

00:50:45.339 --> 00:50:47.599
how much time I have.

00:50:47.599 --> 00:50:48.819
Hmm, let's

00:50:48.819 --> 00:50:49.620
see. Well okay, we have time. Let's do this.

00:50:49.620 --> 00:50:52.980
Show you one last example of a diagnostic. This is one that came up in,

00:50:52.980 --> 00:50:56.999
you know, my students' and my work on flying helicopters.

00:50:56.999 --> 00:50:57.839


00:50:57.839 --> 00:51:00.189
This one actually,

00:51:00.189 --> 00:51:04.179
this example is the most complex of the three examples I'm gonna do

00:51:04.179 --> 00:51:05.609
today.

00:51:05.609 --> 00:51:08.559
I'm going to somewhat quickly, and

00:51:08.559 --> 00:51:11.259
this actually draws on reinforcement learning which is something that I'm not

00:51:11.259 --> 00:51:14.499
gonna talk about until towards - close to the end of the course here, but this just

00:51:14.499 --> 00:51:16.759
a more

00:51:16.759 --> 00:51:20.009
complicated example of a diagnostic we're gonna go over.

00:51:20.009 --> 00:51:23.759
What I'll do is probably go over this fairly quickly, and then after we've talked about

00:51:23.759 --> 00:51:26.839
reinforcement learning in the class, I'll probably actually come back and redo this exact

00:51:26.839 --> 00:51:32.919
same example because you'll understand it more deeply. Okay?

00:51:32.919 --> 00:51:37.099
So some of you know that my students and I fly autonomous helicopters, so how do you get a

00:51:37.099 --> 00:51:41.559
machine-learning algorithm to design the controller for

00:51:41.559 --> 00:51:44.199
helicopter? This is what we do. All right?

00:51:44.199 --> 00:51:48.519
This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our

00:51:48.519 --> 00:51:49.619
simulator.

00:51:49.619 --> 00:51:53.499
This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you

00:51:53.499 --> 00:51:55.679


00:51:55.679 --> 00:51:57.190
choose a cost function, it's

00:51:57.190 --> 00:52:00.849
actually called a [inaudible] function, but for this actually I'll call it cost function.

00:52:00.849 --> 00:52:02.949
Say J of theta is, you know,

00:52:02.949 --> 00:52:06.589
the expected squared error in your helicopter's

00:52:06.589 --> 00:52:08.150
position. Okay? So this is J of theta is

00:52:08.150 --> 00:52:08.509
maybe

00:52:08.509 --> 00:52:12.359
it's expected square error or just the square error.

00:52:12.359 --> 00:52:16.909
And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms

00:52:16.909 --> 00:52:18.599
in a few weeks.

00:52:18.599 --> 00:52:22.499
You run reinforcement learning algorithm in your simulator

00:52:22.499 --> 00:52:26.640
to try to minimize this cost function; try to minimize the squared error of

00:52:26.640 --> 00:52:31.439
how well you're controlling your helicopter's position. Okay?

00:52:31.439 --> 00:52:35.279
The reinforcement learning algorithm will output some parameters, which I'm denoting theta

00:52:35.279 --> 00:52:37.209
subscript RL,

00:52:37.209 --> 00:52:41.709
and then you'll use that to fly your helicopter.

00:52:41.709 --> 00:52:44.959
So suppose you run this learning algorithm and

00:52:44.959 --> 00:52:48.589
you get out a set of controller parameters, theta subscript RL,

00:52:48.589 --> 00:52:52.299
that gives much worse performance than a human pilot. Then

00:52:52.299 --> 00:52:54.729
what do you do next? And in particular, you

00:52:54.729 --> 00:52:57.959
know, corresponding to the three steps above, there are three

00:52:57.959 --> 00:53:00.589
natural things you can try. Right? You can

00:53:00.589 --> 00:53:01.869
try to - oh, the bottom of

00:53:01.869 --> 00:53:03.919
the slide got chopped off.

00:53:03.919 --> 00:53:07.529
You can try to improve the simulator. And

00:53:07.529 --> 00:53:10.329
maybe you think your simulator's isn't that accurate, you need to capture

00:53:10.329 --> 00:53:12.339
the aerodynamic effects more

00:53:12.339 --> 00:53:15.429
accurately. You need to capture the airflow and the turbulence affects around the helicopter

00:53:15.429 --> 00:53:18.279
more accurately.

00:53:18.279 --> 00:53:21.439
Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe

00:53:21.439 --> 00:53:24.719
what a human pilot does isn't just optimizing square area but it's something more

00:53:24.719 --> 00:53:25.989
subtle.

00:53:25.989 --> 00:53:26.769
Or maybe

00:53:26.769 --> 00:53:32.989
the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So

00:53:32.989 --> 00:53:36.799
these are the diagnostics that I actually used, and my students and I actually use to figure out what's

00:53:36.799 --> 00:53:41.299
going on.

00:53:41.299 --> 00:53:44.509
Actually, why don't you just think about this for a second and think what you'd do, and then

00:53:44.509 --> 00:53:51.509
I'll go on and tell you what we do. All right,

00:54:46.229 --> 00:54:47.869
so let me tell you what -

00:54:47.869 --> 00:54:49.599
how we do this and see

00:54:49.599 --> 00:54:52.770
whether it's the same as yours or not. And if you have a better idea than I do, let me

00:54:52.770 --> 00:54:53.569
know and I'll let you try it

00:54:53.569 --> 00:54:55.919
on my helicopter.

00:54:55.919 --> 00:54:58.239
So

00:54:58.239 --> 00:55:01.449
here's a reasoning that I wanted to experiment, right. So,

00:55:01.449 --> 00:55:03.680
yeah, let's say the controller output

00:55:03.680 --> 00:55:10.369
by our reinforcement-learning algorithm does poorly. Well

00:55:10.369 --> 00:55:12.609
suppose the following three things hold true.

00:55:12.609 --> 00:55:15.149
Suppose the contrary, I guess. Suppose that

00:55:15.149 --> 00:55:19.649
the helicopter simulator is accurate, so let's assume we have an accurate model

00:55:19.649 --> 00:55:22.449
of our helicopter. And

00:55:22.449 --> 00:55:25.299
let's suppose that the reinforcement learning algorithm,

00:55:25.299 --> 00:55:28.890
you know, correctly controls the helicopter in simulation,

00:55:28.890 --> 00:55:31.819
so we tend to run a learning algorithm in simulation so that, you know, the

00:55:31.819 --> 00:55:35.149
learning algorithm can crash a helicopter and it's fine. Right?

00:55:35.149 --> 00:55:37.230
So let's assume our reinforcement-learning

00:55:37.230 --> 00:55:40.110
algorithm correctly controls the helicopter so as to minimize the cost

00:55:40.110 --> 00:55:42.099
function J of theta.

00:55:42.099 --> 00:55:43.740
And let's suppose that

00:55:43.740 --> 00:55:47.630
minimizing J of theta does indeed correspond to accurate or the correct autonomous

00:55:47.630 --> 00:55:49.339
flight.

00:55:49.339 --> 00:55:52.069
If all of these things held true,

00:55:52.069 --> 00:55:53.909
then that means that

00:55:53.909 --> 00:55:58.459
the parameters, theta RL, should actually fly well on my real

00:55:58.459 --> 00:56:01.039
helicopter. Right?

00:56:01.039 --> 00:56:03.280
And so the fact that the learning

00:56:03.280 --> 00:56:05.339
control parameters, theta RL,

00:56:05.339 --> 00:56:08.599
does not fly well on my helicopter, that sort

00:56:08.599 --> 00:56:11.249
of means that ones of these three assumptions must be wrong

00:56:11.249 --> 00:56:17.869
and I'd like to figure out which of these

00:56:17.869 --> 00:56:19.669
three assumptions

00:56:19.669 --> 00:56:22.089
is wrong. Okay? So these are the diagnostics we use.

00:56:22.089 --> 00:56:25.449
First one is

00:56:25.449 --> 00:56:31.719
we look at the controller and see if it even flies well in

00:56:31.719 --> 00:56:35.089
simulation. Right? So the simulator of the helicopter that we did the learning on,

00:56:35.089 --> 00:56:38.699
and so if the learning algorithm flies well in the simulator but

00:56:38.699 --> 00:56:42.029
it doesn't fly well on my real helicopter,

00:56:42.029 --> 00:56:46.109
then that tells me the problem is probably in the simulator. Right?

00:56:46.109 --> 00:56:48.049
My simulator predicts

00:56:48.049 --> 00:56:51.909
the helicopter's controller will fly well but it doesn't actually fly well in real life, so

00:56:51.909 --> 00:56:53.580
could be the problem's in the simulator

00:56:53.580 --> 00:56:59.239
and we should spend out efforts improving the accuracy of our simulator.

00:56:59.239 --> 00:57:03.170
Otherwise, let me write theta subscript human, be the human

00:57:03.170 --> 00:57:07.049
control policy. All right? So

00:57:07.049 --> 00:57:11.639
let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it

00:57:11.639 --> 00:57:13.479
could be in real life,

00:57:13.479 --> 00:57:16.769
and let's measure, you know, the means squared error

00:57:16.769 --> 00:57:20.209
of the human pilot's flight. And

00:57:20.209 --> 00:57:24.239
let's see if the human pilot does better or worse

00:57:24.239 --> 00:57:26.089
than the learned controller,

00:57:26.089 --> 00:57:28.249
in terms of optimizing this

00:57:28.249 --> 00:57:31.969
objective function J of theta. Okay?

00:57:31.969 --> 00:57:33.929
So if the human does

00:57:33.929 --> 00:57:36.890
worse, if even a very good human pilot

00:57:36.890 --> 00:57:41.439
attains a worse value on my optimization objective, on my cost

00:57:41.439 --> 00:57:42.410
function,

00:57:42.410 --> 00:57:48.619
than my learning algorithm,

00:57:48.619 --> 00:57:51.799
then the problem is in the reinforcement-learning algorithm.

00:57:51.799 --> 00:57:56.089
Because my reinforcement-learning algorithm was trying to minimize J of

00:57:56.089 --> 00:58:00.140
theta, but a human actually attains a lower value for J of theta than does my

00:58:00.140 --> 00:58:01.779
algorithm.

00:58:01.779 --> 00:58:05.489
And so that tells me that clearly my algorithm's not

00:58:05.489 --> 00:58:07.819
managing to minimize J of theta

00:58:07.819 --> 00:58:12.880
and that tells me the problem's in the reinforcement learning algorithm.

00:58:12.880 --> 00:58:17.650
And finally, if J of theta - if the human actually attains a larger value

00:58:17.650 --> 00:58:19.549
for theta - excuse me,

00:58:19.549 --> 00:58:24.399
if the human actually attains a larger value for J of theta, the human actually

00:58:24.399 --> 00:58:27.859
has, you know, larger mean squared error for the helicopter position than

00:58:27.859 --> 00:58:30.599
does my reinforcement learning algorithms, that's

00:58:30.599 --> 00:58:34.000
I like - but I like the way the human flies much better than my reinforcement learning

00:58:34.000 --> 00:58:35.319
algorithm. So

00:58:35.319 --> 00:58:37.229
if that holds true,

00:58:37.229 --> 00:58:39.779
then clearly the problem's in the cost function, right,

00:58:39.779 --> 00:58:42.880
because the human does worse on my cost function

00:58:42.880 --> 00:58:46.069
but flies much better than my learning algorithm.

00:58:46.069 --> 00:58:48.359
And so that means the problem's in the cost function. It

00:58:48.359 --> 00:58:50.089
means - oh

00:58:50.089 --> 00:58:50.539
excuse me, I

00:58:50.539 --> 00:58:53.679
meant minimizing it, not maximizing it, there's a typo on the slide,

00:58:53.679 --> 00:58:55.379
because that means that minimizing

00:58:55.379 --> 00:58:57.089
the cost function

00:58:57.089 --> 00:59:00.219
- my learning algorithm does a better job minimizing the cost function but doesn't

00:59:00.219 --> 00:59:03.439
fly as well as a human pilot. So that tells you that

00:59:03.439 --> 00:59:04.719
minimizing the cost function

00:59:04.719 --> 00:59:06.880
doesn't correspond to good autonomous flight. And what

00:59:06.880 --> 00:59:11.859
you should do it go back and see if you can change J of

00:59:11.859 --> 00:59:13.099
theta. Okay?

00:59:13.099 --> 00:59:18.379
And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement

00:59:18.379 --> 00:59:21.730
learning algorithms just work but when they don't work,

00:59:21.730 --> 00:59:26.200
these are the sorts of diagnostics you use to figure out should we be focusing on the simulator,

00:59:26.200 --> 00:59:30.329
on changing the cost function, or on changing the reinforcement learning

00:59:30.329 --> 00:59:32.089
algorithm. And

00:59:32.089 --> 00:59:37.039
again, if you don't know which of your three problems it is, it's entirely possible,

00:59:37.039 --> 00:59:40.280
you know, to spend two years, whatever, changing, building a better simulator

00:59:40.280 --> 00:59:42.599
for your helicopter.

00:59:42.599 --> 00:59:43.950
But it turns out that

00:59:43.950 --> 00:59:47.690
modeling helicopter aerodynamics is an active area of research. There are people, you know, writing

00:59:47.690 --> 00:59:49.789
entire PhD theses on this still.

00:59:49.789 --> 00:59:53.559
So it's entirely possible to go out and spend six years and write a PhD thesis and build

00:59:53.559 --> 00:59:55.499
a much better helicopter simulator, but if you're fixing

00:59:55.499 --> 01:00:02.499
the wrong problem it's not gonna help.

01:00:03.209 --> 01:00:05.529
So

01:00:05.529 --> 01:00:08.919
quite often, you need to come up with your own diagnostics to figure out what's happening in an

01:00:08.919 --> 01:00:11.639
algorithm when something is going wrong.

01:00:11.639 --> 01:00:15.679
And unfortunately I don't know of - what I've described

01:00:15.679 --> 01:00:17.149
are sort of maybe

01:00:17.149 --> 01:00:20.509
some of the most common diagnostics that I've used, that I've seen,

01:00:20.509 --> 01:00:23.709
you know, to be useful for many problems. But very often, you need to come up

01:00:23.709 --> 01:00:28.189
with your own for your own specific learning problem.

01:00:28.189 --> 01:00:31.729
And I just want to point out that even when the learning algorithm is working well, it's

01:00:31.729 --> 01:00:35.159
often a good idea to run diagnostics, like the ones I talked

01:00:35.159 --> 01:00:36.069
about,

01:00:36.069 --> 01:00:38.309
to make sure you really understand what's going on.

01:00:38.309 --> 01:00:41.599
All right? And this is useful for a couple of reasons. One is that

01:00:41.599 --> 01:00:45.609
diagnostics like these will often help you to understand your application

01:00:45.609 --> 01:00:47.899
problem better.

01:00:47.899 --> 01:00:52.159
So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying

01:00:52.159 --> 01:00:56.349
job to apply machine-learning algorithms to some application problem of, you

01:00:56.349 --> 01:00:59.299
know, significant economic interest. Right?

01:00:59.299 --> 01:01:02.929
And you're gonna be working on one specific

01:01:02.929 --> 01:01:08.059
important machine learning application for many months, or even for years.

01:01:08.059 --> 01:01:10.989
One of the most valuable things for you personally will be for you to

01:01:10.989 --> 01:01:13.249
get in - for you personally

01:01:13.249 --> 01:01:16.909
to get in an intuitive understanding of what works and what doesn't work your

01:01:16.909 --> 01:01:17.369
problem.

01:01:17.369 --> 01:01:21.240
Sort of right now in the industry, in Silicon Valley or around the world,

01:01:21.240 --> 01:01:24.830
there are many companies with important machine learning problems and there are often people

01:01:24.830 --> 01:01:26.950
working on the same machine learning problem, you

01:01:26.950 --> 01:01:31.209
know, for many months or for years on end. And

01:01:31.209 --> 01:01:34.659
when you're doing that, I mean solving a really important problem using learning algorithms, one of

01:01:34.659 --> 01:01:38.719
the most valuable things is just your own personal intuitive understanding of the

01:01:38.719 --> 01:01:40.999
problem.

01:01:40.999 --> 01:01:42.169
Okay?

01:01:42.169 --> 01:01:43.410
And diagnostics, like

01:01:43.410 --> 01:01:48.149
the sort I talked about, will be one way for you to get a better and better understanding of

01:01:48.149 --> 01:01:50.279
these problems. It

01:01:50.279 --> 01:01:54.089
turns out, by the way, there are some of Silicon Valley companies that outsource their

01:01:54.089 --> 01:01:56.679
machine learning. So there's sometimes, you know, whatever.

01:01:56.679 --> 01:01:59.529
They're a company in Silicon Valley and they'll, you know,

01:01:59.529 --> 01:02:03.230
hire a firm in New York to run all their learning algorithms for them.

01:02:03.230 --> 01:02:06.890
And I'm not a businessman, but I personally think that's

01:02:06.890 --> 01:02:09.309
often a terrible idea because

01:02:09.309 --> 01:02:13.639
if your expertise, if your understanding of your data is given,

01:02:13.639 --> 01:02:15.709
you know, to an outsource agency,

01:02:15.709 --> 01:02:19.589
then if you don't maintain that expertise, if there's a problem you really care about

01:02:19.589 --> 01:02:22.299
then it'll be your own, you know,

01:02:22.299 --> 01:02:26.009
understanding of the problem that you build up over months that'll be really valuable.

01:02:26.009 --> 01:02:28.649
And if that knowledge is outsourced, you don't get to keep that knowledge

01:02:28.649 --> 01:02:29.489
yourself.

01:02:29.489 --> 01:02:31.699
I personally think that's a terrible idea.

01:02:31.699 --> 01:02:35.809
But I'm not a businessman, but I just see people do that a lot,

01:02:35.809 --> 01:02:39.109
and just. Let's see.

01:02:39.109 --> 01:02:42.949
Another reason for running diagnostics like these is actually in writing research

01:02:42.949 --> 01:02:43.609
papers,

01:02:43.609 --> 01:02:46.149
right? So

01:02:46.149 --> 01:02:49.329
diagnostics and error analyses, which I'll talk about in a minute,

01:02:49.329 --> 01:02:53.019
often help to convey insight about the problem and help justify your research

01:02:53.019 --> 01:02:54.109
claims.

01:02:54.109 --> 01:02:56.559


01:02:56.559 --> 01:02:57.780
So for example,

01:02:57.780 --> 01:03:00.790
rather than writing a research paper, say, that's says, you know, "Oh well here's

01:03:00.790 --> 01:03:04.039
an algorithm that works. I built this helicopter and it flies," or whatever,

01:03:04.039 --> 01:03:05.650
it's often much more interesting to say,

01:03:05.650 --> 01:03:09.609
"Here's an algorithm that works, and it works because of a specific

01:03:09.609 --> 01:03:13.920
component X. And moreover, here's the diagnostic that gives you justification that shows X was

01:03:13.920 --> 01:03:19.159
the thing that fixed this problem," and that's where you made it work. Okay? So

01:03:19.159 --> 01:03:21.389
that leads me

01:03:21.389 --> 01:03:25.929
into a discussion on error analysis, which is often good machine learning practice,

01:03:25.929 --> 01:03:26.439


01:03:26.439 --> 01:03:32.099
is a way for understanding what your sources of errors are. So what I

01:03:32.099 --> 01:03:34.689
call error analyses - and let's check

01:03:34.689 --> 01:03:41.689
questions about this.

01:03:41.769 --> 01:03:45.789
Yeah?

01:03:45.789 --> 01:03:49.809
Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times.

01:03:49.809 --> 01:03:53.500
The thing that is most difficult a helicopter is actually building a

01:03:53.500 --> 01:03:55.109
very - I don't know. It

01:03:55.109 --> 01:03:58.489
changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter

01:03:58.489 --> 01:04:02.859
is very hard. Yeah. Okay. So

01:04:02.859 --> 01:04:03.930
for error

01:04:03.930 --> 01:04:06.269
analyses,

01:04:06.269 --> 01:04:10.809
this is a way for figuring out what is working in your algorithm and what isn't working.

01:04:10.809 --> 01:04:17.709
And we're gonna talk about two specific examples. So there are

01:04:17.709 --> 01:04:21.529
many learning - there are many sort of IA systems, many machine learning systems, that

01:04:21.529 --> 01:04:22.469
combine

01:04:22.469 --> 01:04:24.890
many different components into a pipeline. So

01:04:24.890 --> 01:04:27.469
here's sort of a contrived example for this,

01:04:27.469 --> 01:04:31.019
not dissimilar in many ways from the actual machine learning systems you see.

01:04:31.019 --> 01:04:32.390
So let's say you want to

01:04:32.390 --> 01:04:37.749
recognize people from images. This is a picture of one of my friends.

01:04:37.749 --> 01:04:41.899
So you take this input in camera image, say, and you often run it through a long pipeline. So

01:04:41.899 --> 01:04:43.069
for example,

01:04:43.069 --> 01:04:47.859
the first thing you may do may be preprocess the image and remove the background, so you remove the

01:04:47.859 --> 01:04:49.189
background.

01:04:49.189 --> 01:04:51.909
And then you run a

01:04:51.909 --> 01:04:55.209
face detection algorithm, so a machine learning algorithm to detect people's faces.

01:04:55.209 --> 01:04:56.109
Right?

01:04:56.109 --> 01:04:59.759
And then, you know, let's say you want to recognize the identity of the person, right, this is your

01:04:59.759 --> 01:05:01.719
application.

01:05:01.719 --> 01:05:04.440
You then segment of the eyes, segment of the nose,

01:05:04.440 --> 01:05:08.329
and have different learning algorithms to detect the mouth and so on.

01:05:08.329 --> 01:05:10.029
I know; she might not want to be friend

01:05:10.029 --> 01:05:13.249
after she sees this.

01:05:13.249 --> 01:05:16.769
And then having found all these features, based on, you know, what the nose looks like, what the eyes

01:05:16.769 --> 01:05:18.610
looks like, whatever, then you

01:05:18.610 --> 01:05:22.800
feed all the features into a logistic regression algorithm. And your logistic

01:05:22.800 --> 01:05:24.770
regression or soft match regression, or whatever,

01:05:24.770 --> 01:05:30.379
will tell you the identity of this person. Okay?

01:05:30.379 --> 01:05:32.459
So

01:05:32.459 --> 01:05:35.059
this is what error analysis is.

01:05:35.059 --> 01:05:40.329
You have a long complicated pipeline combining many machine learning

01:05:40.329 --> 01:05:43.919
components. Many of these would be used in learning algorithms.

01:05:43.919 --> 01:05:45.689
And so,

01:05:45.689 --> 01:05:50.419
it's often very useful to figure out how much of your error can be attributed to each of

01:05:50.419 --> 01:05:55.179
these components.

01:05:55.179 --> 01:05:56.179
So

01:05:56.179 --> 01:05:59.589
what we'll do in a typical error analysis procedure

01:05:59.589 --> 01:06:03.709
is we'll repeatedly plug in the ground-truth for each component and see how the

01:06:03.709 --> 01:06:05.129
accuracy changes.

01:06:05.129 --> 01:06:07.589
So what I mean by that is the

01:06:07.589 --> 01:06:11.389
figure on the bottom left - bottom right, let's say the overall accuracy of the system is

01:06:11.389 --> 01:06:12.739
85 percent. Right?

01:06:12.739 --> 01:06:14.689
Then I want to know

01:06:14.689 --> 01:06:17.399
where my 15 percent of error comes from.

01:06:17.399 --> 01:06:19.159
And so what I'll do is I'll go

01:06:19.159 --> 01:06:21.329
to my test set

01:06:21.329 --> 01:06:26.629
and I'll actually code it and - oh, instead of - actually implement my correct

01:06:26.629 --> 01:06:29.750
background removal. So actually, go in and give it,

01:06:29.750 --> 01:06:33.439
give my algorithm what is the correct background versus foreground.

01:06:33.439 --> 01:06:36.840
And if I do that, let's color that blue to denote that I'm

01:06:36.840 --> 01:06:39.529
giving that ground-truth data in the test set,

01:06:39.529 --> 01:06:43.839
let's assume our accuracy increases to 85.1 percent. Okay?

01:06:43.839 --> 01:06:47.760
And now I'll go in and, you know, give my algorithm the ground-truth,

01:06:47.760 --> 01:06:48.929
face detection

01:06:48.929 --> 01:06:53.019
output. So I'll go in and actually on my test set I'll just tell the algorithm where the

01:06:53.019 --> 01:06:55.130
face is. And if I do that,

01:06:55.130 --> 01:06:59.049
let's say my algorithm's accuracy increases to 91 percent,

01:06:59.049 --> 01:07:02.519
and so on. And then I'll go for each of these components

01:07:02.519 --> 01:07:05.019
and just give it

01:07:05.019 --> 01:07:08.660
the ground-truth label for each of the components,

01:07:08.660 --> 01:07:11.640
because say, like, the nose segmentation algorithm's trying to figure out

01:07:11.640 --> 01:07:13.219
where the nose is. I just in

01:07:13.219 --> 01:07:16.589
and tell it where the nose is so that it doesn't have to figure that out.

01:07:16.589 --> 01:07:20.559
And as I do this, one component through the other, you know, I end up giving it the correct output

01:07:20.559 --> 01:07:23.650
label and end up with 100 percent accuracy.

01:07:23.650 --> 01:07:27.000
And now you can look at this table - I'm sorry this is cut off on the bottom,

01:07:27.000 --> 01:07:29.119
it says logistic regression 100 percent. Now you can

01:07:29.119 --> 01:07:30.719
look at this

01:07:30.719 --> 01:07:31.670
table and

01:07:31.670 --> 01:07:33.009
see,

01:07:33.009 --> 01:07:36.079
you know, how much giving the ground-truth labels for each of these

01:07:36.079 --> 01:07:39.029
components could help boost your final performance.

01:07:39.029 --> 01:07:42.419
In particular, if you look at this table, you notice that

01:07:42.419 --> 01:07:45.269
when I added the face detection ground-truth,

01:07:45.269 --> 01:07:48.279
my performance jumped from 85.1 percent accuracy

01:07:48.279 --> 01:07:50.619
to 91 percent accuracy. Right?

01:07:50.619 --> 01:07:54.529
So this tells me that if only I can get better face detection,

01:07:54.529 --> 01:07:58.029
maybe I can boost my accuracy by 6 percent.

01:07:58.029 --> 01:08:00.499
Whereas in contrast, when I,

01:08:00.499 --> 01:08:04.349
you know, say plugged in better, I don't know,

01:08:04.349 --> 01:08:07.059
background removal, my accuracy improved from 85

01:08:07.059 --> 01:08:08.669
to 85.1 percent.

01:08:08.669 --> 01:08:11.519
And so, this sort of diagnostic also tells you that if your goal

01:08:11.519 --> 01:08:13.869
is to improve the system, it's probably a waste of

01:08:13.869 --> 01:08:17.679
your time to try to improve your background subtraction. Because if

01:08:17.679 --> 01:08:19.219
even if you got the ground-truth,

01:08:19.219 --> 01:08:22.059
this is gives you, at most, 0.1 percent accuracy,

01:08:22.059 --> 01:08:24.600
whereas if you do better face detection, maybe there's a much

01:08:24.600 --> 01:08:26.399
larger potential for gains there. Okay?

01:08:26.399 --> 01:08:28.669
So this sort of diagnostic,

01:08:28.669 --> 01:08:29.899
again,

01:08:29.899 --> 01:08:33.149
is very useful because if your is to improve the system,

01:08:33.149 --> 01:08:35.999
there are so many different pieces you can easily choose to spend the next three

01:08:35.999 --> 01:08:36.650
months on. Right?

01:08:36.650 --> 01:08:39.259
And choosing the right piece

01:08:39.259 --> 01:08:42.799
is critical, and this sort of diagnostic tells you what's the piece that may

01:08:42.799 --> 01:08:48.729
actually be worth your time to work on.

01:08:48.729 --> 01:08:51.709
There's sort of another type of analyses that's sort of the opposite of what I just

01:08:51.709 --> 01:08:53.369
talked about.

01:08:53.369 --> 01:08:55.469
The error analysis I just talked about

01:08:55.469 --> 01:08:58.279
tries to explain the difference between the current performance and perfect

01:08:58.279 --> 01:08:59.770
performance,

01:08:59.770 --> 01:09:03.619
whereas this sort of ablative analysis tries to explain the difference

01:09:03.619 --> 01:09:09.119
between some baselines, some really bad performance and your current performance.

01:09:09.119 --> 01:09:13.089
So for this example, let's suppose you've built a very good anti-spam classifier for

01:09:13.089 --> 01:09:17.150
adding lots of clever features to your logistic regression algorithm. Right? So you added

01:09:17.150 --> 01:09:20.690
features for spam correction, for, you know, sender host features, for email header

01:09:20.690 --> 01:09:21.409
features,

01:09:21.409 --> 01:09:24.799
email text parser features, JavaScript parser features,

01:09:24.799 --> 01:09:26.839
features for embedded images, and so on.

01:09:26.839 --> 01:09:30.230
So now let's say you preview the system and you want to figure out, you know, how well did

01:09:30.230 --> 01:09:33.799
each of these - how much did each of these components actually contribute? Maybe you want

01:09:33.799 --> 01:09:37.130
to write a research paper and claim this was the piece that made the

01:09:37.130 --> 01:09:40.949
big difference. Can you actually document that claim and justify it?

01:09:40.949 --> 01:09:43.319
So in ablative analysis,

01:09:43.319 --> 01:09:44.569
here's what we do.

01:09:44.569 --> 01:09:46.330
So in this example,

01:09:46.330 --> 01:09:49.670
let's say that simple logistic regression without any of your clever

01:09:49.670 --> 01:09:52.089
improvements get 94 percent performance. And

01:09:52.089 --> 01:09:55.479
you want to figure out what accounts for your improvement from 94 to

01:09:55.479 --> 01:09:58.429
99.9 percent performance.

01:09:58.429 --> 01:10:03.280
So in ablative analysis and so instead of adding components one at a day, we'll instead

01:10:03.280 --> 01:10:06.840
remove components one at a time to see how it rates.

01:10:06.840 --> 01:10:11.460
So start with your overall system, which is 99 percent accuracy.

01:10:11.460 --> 01:10:14.130
And then we remove spelling correction and see how much performance

01:10:14.130 --> 01:10:15.389
drops.

01:10:15.389 --> 01:10:22.389
Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so,

01:10:24.219 --> 01:10:28.150
in this contrived example,

01:10:28.150 --> 01:10:31.120
you see that, I guess, the biggest drop

01:10:31.120 --> 01:10:32.380
occurred when you remove

01:10:32.380 --> 01:10:37.560
the text parser features. And so you can then make a credible case that,

01:10:37.560 --> 01:10:41.280
you know, the text parser features where what really made the biggest difference here. Okay?

01:10:41.280 --> 01:10:42.699
And you can also tell,

01:10:42.699 --> 01:10:45.530
for instance, that, I don't know,

01:10:45.530 --> 01:10:49.360
removing the sender host features on this

01:10:49.360 --> 01:10:52.280
line, right, performance dropped from 99.9 to 98.9. And so this also means

01:10:52.280 --> 01:10:53.139
that

01:10:53.139 --> 01:10:56.449
in case you want to get rid of the sender host features to speed up

01:10:56.449 --> 01:11:03.449
computational something that would be a good candidate for elimination. Okay? Are there any

01:11:03.630 --> 01:11:05.420
guarantees that if you shuffle around the order in which

01:11:05.420 --> 01:11:06.420
you drop those

01:11:06.420 --> 01:11:09.580
features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's

01:11:09.580 --> 01:11:12.110
no guarantee you'd get the similar result.

01:11:12.110 --> 01:11:13.890
So in practice,

01:11:13.890 --> 01:11:17.730
sometimes there's a fairly natural of ordering for both types of analyses, the error

01:11:17.730 --> 01:11:19.329
analyses and ablative analysis,

01:11:19.329 --> 01:11:22.749
sometimes there's a fairly natural ordering in which you add things or remove things,

01:11:22.749 --> 01:11:24.559
sometimes there's isn't. And

01:11:24.559 --> 01:11:28.469
quite often, you either choose one ordering and just go for it

01:11:28.469 --> 01:11:32.070
or " And don't think of these analyses as sort of formulas that are constants, though; I mean

01:11:32.070 --> 01:11:35.239
feel free to invent your own, as well. You know

01:11:35.239 --> 01:11:36.639
one of the things

01:11:36.639 --> 01:11:37.920
that's done quite often is

01:11:37.920 --> 01:11:39.310
take the overall system

01:11:39.310 --> 01:11:43.289
and just remove one and then put it back, then remove a different one

01:11:43.289 --> 01:11:48.129
then put it back until all of these things are done. Okay.

01:11:48.129 --> 01:11:51.009
So the very last thing I want to talk about is sort of this

01:11:51.009 --> 01:11:57.979
general advice for how to get started on a learning problem. So

01:11:57.979 --> 01:12:03.840
here's a cartoon description on two broad to get started on learning problem.

01:12:03.840 --> 01:12:05.740
The first one is

01:12:05.740 --> 01:12:07.610
carefully design your system, so

01:12:07.610 --> 01:12:11.739
you spend a long time designing exactly the right features, collecting the right data set, and

01:12:11.739 --> 01:12:14.189
designing the right algorithmic structure, then you

01:12:14.189 --> 01:12:17.679
implement it and hope it works. All right?

01:12:17.679 --> 01:12:21.039
The benefit of this sort of approach is you get maybe nicer, maybe more scalable

01:12:21.039 --> 01:12:22.429
algorithms,

01:12:22.429 --> 01:12:26.719
and maybe you come up with new elegant learning algorithms. And if your goal is to,

01:12:26.719 --> 01:12:30.760
you know, contribute to basic research in machine learning, if your goal is to invent new machine learning

01:12:30.760 --> 01:12:31.499
algorithms,

01:12:31.499 --> 01:12:33.550
this process of slowing down and

01:12:33.550 --> 01:12:36.300
thinking deeply about the problem, you know, that is sort of the right way to go

01:12:36.300 --> 01:12:37.119
about is

01:12:37.119 --> 01:12:41.099
think deeply about a problem and invent new solutions.

01:12:41.099 --> 01:12:42.280


01:12:42.280 --> 01:12:44.079
Second sort of approach

01:12:44.079 --> 01:12:48.839
is what I call build-and-fix, which is we input something quick and dirty

01:12:48.839 --> 01:12:52.309
and then you run error analyses and diagnostics to figure out what's wrong and

01:12:52.309 --> 01:12:54.199
you fix those errors.

01:12:54.199 --> 01:12:58.130
The benefit of this second type of approach is that it'll often get your

01:12:58.130 --> 01:13:01.119
application working much more quickly.

01:13:01.119 --> 01:13:04.400
And especially with those of you, if you end up working in a company, and sometimes - if you end up working in

01:13:04.400 --> 01:13:05.550
a company,

01:13:05.550 --> 01:13:07.460
you know, very often it's not

01:13:07.460 --> 01:13:10.900
the best product that wins; it's the first product to market that

01:13:10.900 --> 01:13:11.690
wins. And

01:13:11.690 --> 01:13:14.869
so there's - especially in the industry. There's really something to be said for,

01:13:14.869 --> 01:13:18.790
you know, building a system quickly and getting it deployed quickly.

01:13:18.790 --> 01:13:23.139
And the second approach of building a quick-and-dirty, I'm gonna say hack

01:13:23.139 --> 01:13:26.469
and then fixing the problems will actually get you to a

01:13:26.469 --> 01:13:27.839
system that works well

01:13:27.839 --> 01:13:30.969
much more quickly.

01:13:30.969 --> 01:13:32.649
And the reason is

01:13:32.649 --> 01:13:36.149
very often it's really not clear what parts of a system are easier to think of to

01:13:36.149 --> 01:13:37.590
build and therefore what

01:13:37.590 --> 01:13:40.179
you need to spends lot of time focusing on.

01:13:40.179 --> 01:13:43.420
So there's that example I talked about just now. Right?

01:13:43.420 --> 01:13:46.929
For identifying

01:13:46.929 --> 01:13:48.710
people, say.

01:13:48.710 --> 01:13:53.199
And with a big complicated learning system like this, a big complicated pipeline like this,

01:13:53.199 --> 01:13:55.590
it's really not obvious at the outset

01:13:55.590 --> 01:13:59.130
which of these components you should spend lots of time working on. Right? And if

01:13:59.130 --> 01:14:00.960
you didn't know that

01:14:00.960 --> 01:14:03.800
preprocessing wasn't the right component, you could easily have

01:14:03.800 --> 01:14:07.269
spent three months working on better background subtraction, not knowing that it's

01:14:07.269 --> 01:14:09.880
just not gonna ultimately matter.

01:14:09.880 --> 01:14:10.769
And so

01:14:10.769 --> 01:14:13.690
the only way to find out what really works was inputting something quickly and

01:14:13.690 --> 01:14:15.350
you find out what parts -

01:14:15.350 --> 01:14:16.889
and find out

01:14:16.889 --> 01:14:17.889
what parts

01:14:17.889 --> 01:14:21.359
are really the hard parts to implement, or what parts are hard parts that could make a

01:14:21.359 --> 01:14:23.079
difference in performance.

01:14:23.079 --> 01:14:26.579
In fact, say that if your goal is to build a

01:14:26.579 --> 01:14:29.309
people recognition system, a system like this is actually far too

01:14:29.309 --> 01:14:31.639
complicated as your initial system.

01:14:31.639 --> 01:14:35.560
Maybe after you're prototyped a few systems, and you converged a system like this. But if this

01:14:35.560 --> 01:14:42.560
is your first system you're designing, this is much too complicated. Also, this is a

01:14:43.570 --> 01:14:48.059
very concrete piece of advice, and this applies to your projects as well.

01:14:48.059 --> 01:14:51.229
If your goal is to build a working application,

01:14:51.229 --> 01:14:55.260
Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your

01:14:55.260 --> 01:14:57.279
data.

01:14:57.279 --> 01:15:01.219
And very often, and if you just take the data you're trying to predict and just plot your

01:15:01.219 --> 01:15:05.729
data, plot X, plot Y, plot your data everywhere you can think of,

01:15:05.729 --> 01:15:10.309
you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they

01:15:10.309 --> 01:15:13.899
should be positive. Something's wrong with this dataset." And it's about

01:15:13.899 --> 01:15:18.389
half the time you find something obviously wrong with your data or something very surprising.

01:15:18.389 --> 01:15:21.570
And this is something you find out just by plotting your data, and that you

01:15:21.570 --> 01:15:28.179
won't find out be implementing these big complicated learning algorithms on it. Plotting

01:15:28.179 --> 01:15:31.920
the data sounds so simple, it was one of the pieces of advice that lots of us give but

01:15:31.920 --> 01:15:38.570
hardly anyone follows, so you can take that for what it's worth.

01:15:38.570 --> 01:15:42.199
Let me just reiterate, what I just said here may be bad advice

01:15:42.199 --> 01:15:44.019
if your goal is to come up with

01:15:44.019 --> 01:15:46.639
new machine learning algorithms. All right? So

01:15:46.639 --> 01:15:51.019
for me personally, the learning algorithm I use the most often is probably

01:15:51.019 --> 01:15:53.600
logistic regression because I have code lying around. So give me a

01:15:53.600 --> 01:15:56.770
learning problem, I probably won't try anything more complicated than logistic

01:15:56.770 --> 01:15:58.260
regression on it first. And it's

01:15:58.260 --> 01:16:01.940
only after trying something really simple and figure our what's easy, what's hard, then you know

01:16:01.940 --> 01:16:03.940
where to focus your efforts. But

01:16:03.940 --> 01:16:07.610
again, if your goal is to invent new machine learning algorithms, then you sort of don't

01:16:07.610 --> 01:16:10.750
want to hack up something and then add another hack to fix it, and hack it even more to

01:16:10.750 --> 01:16:12.219
fix it. Right? So if

01:16:12.219 --> 01:16:15.919
your goal is to do novel machine learning research, then it pays to think more deeply about the

01:16:15.919 --> 01:16:21.340
problem and not gonna follow this specifically.

01:16:21.340 --> 01:16:22.919
Shoot, you know what? All

01:16:22.919 --> 01:16:28.280
right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly.

01:16:28.280 --> 01:16:30.620
And so, this is what I think

01:16:30.620 --> 01:16:33.459
of as premature statistical optimization,

01:16:33.459 --> 01:16:35.079
where quite often,

01:16:35.079 --> 01:16:38.320
just like premature optimization of code, quite often

01:16:38.320 --> 01:16:44.369
people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more

01:16:44.369 --> 01:16:46.949
slides. This

01:16:46.949 --> 01:16:48.539
was -

01:16:48.539 --> 01:16:52.070
this is a sort of cartoon that highly influenced my own thinking. It was based on

01:16:52.070 --> 01:16:55.340
a paper written by Christos Papadimitriou.

01:16:55.340 --> 01:16:57.429
This is how

01:16:57.429 --> 01:16:59.360
progress - this is how

01:16:59.360 --> 01:17:02.360
developmental progress of research often happens. Right?

01:17:02.360 --> 01:17:05.559
Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it

01:17:05.559 --> 01:17:06.519
seems like a useful thing to have.

01:17:06.519 --> 01:17:09.670
Right? You know free up people, don't have

01:17:09.670 --> 01:17:12.760
to deliver mail. So what -

01:17:12.760 --> 01:17:14.280
to deliver mail,

01:17:14.280 --> 01:17:19.139
obviously you need a robot to wander around indoor environments and you need a robot to

01:17:19.139 --> 01:17:21.480
manipulate objects and pickup envelopes. And so,

01:17:21.480 --> 01:17:24.890
you need to build those two components in order to get a mail delivery robot. And

01:17:24.890 --> 01:17:25.590
so I've

01:17:25.590 --> 01:17:29.650
drawing those two components and little arrows to denote that, you know, obstacle avoidance

01:17:29.650 --> 01:17:30.460
is

01:17:30.460 --> 01:17:32.229
needed or would help build

01:17:32.229 --> 01:17:35.510
your mail delivery robot. Well

01:17:35.510 --> 01:17:37.189
for obstacle for avoidance,

01:17:37.189 --> 01:17:43.159
clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles.

01:17:43.159 --> 01:17:46.840
Now we're gonna use computer vision to detect the objects. And so,

01:17:46.840 --> 01:17:51.120
we know that, you know, lighting sometimes changes, right, depending on whether it's the

01:17:51.120 --> 01:17:52.709
morning or noontime or evening. This

01:17:52.709 --> 01:17:53.930
is lighting

01:17:53.930 --> 01:17:56.639
causes the color of things to change, and so you need

01:17:56.639 --> 01:18:00.509
an object detection system that's invariant to the specific colors of an

01:18:00.509 --> 01:18:01.199
object. Right?

01:18:01.199 --> 01:18:04.420
Because lighting

01:18:04.420 --> 01:18:05.400
changes,

01:18:05.400 --> 01:18:09.849
say. Well color, or RGB values, is represented by three-dimensional vectors. And

01:18:09.849 --> 01:18:11.170
so you need to learn

01:18:11.170 --> 01:18:13.499
when two colors might be the same thing,

01:18:13.499 --> 01:18:15.260
when two, you know,

01:18:15.260 --> 01:18:18.159
visual appearance of two colors may be the same thing as just the lighting change or

01:18:18.159 --> 01:18:19.539
something.

01:18:19.539 --> 01:18:20.590
And

01:18:20.590 --> 01:18:24.059
to understand that properly, we can go out and study differential geometry

01:18:24.059 --> 01:18:27.510
of 3d manifolds because that helps us build a sound theory on which

01:18:27.510 --> 01:18:32.250
to develop our 3d similarity learning algorithms.

01:18:32.250 --> 01:18:36.159
And to really understand the fundamental aspects of this problem,

01:18:36.159 --> 01:18:40.110
we have to study the complexity of non-Riemannian geometries. And on

01:18:40.110 --> 01:18:43.850
and on it goes until eventually you're proving convergence bounds for

01:18:43.850 --> 01:18:49.790
sampled of non-monotonic logic. I don't even know what this is because I just made it up.

01:18:49.790 --> 01:18:51.530
Whereas in reality,

01:18:51.530 --> 01:18:53.970
you know, chances are that link isn't real.

01:18:53.970 --> 01:18:55.660
Color variance

01:18:55.660 --> 01:18:59.550
just barely helped object recognition maybe. I'm making this up.

01:18:59.550 --> 01:19:03.499
Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay?

01:19:03.499 --> 01:19:05.270
So, each of

01:19:05.270 --> 01:19:09.130
these circles can represent a person, or a research community, or a thought in your

01:19:09.130 --> 01:19:12.020
head. And there's a very real chance that

01:19:12.020 --> 01:19:15.469
maybe there are all these papers written on differential geometry of 3d manifolds, and they are

01:19:15.469 --> 01:19:18.570
written because some guy once told someone else that it'll help 3d similarity learning.

01:19:18.570 --> 01:19:20.489
And,

01:19:20.489 --> 01:19:23.369
you know, it's like "A friend of mine told me that color invariance would help in

01:19:23.369 --> 01:19:26.119
object recognition, so I'm working on color invariance. And now I'm gonna tell a friend

01:19:26.119 --> 01:19:27.440
of mine

01:19:27.440 --> 01:19:30.280
that his thing will help my problem. And he'll tell a friend of his that his thing will help

01:19:30.280 --> 01:19:31.619
with his problem."

01:19:31.619 --> 01:19:33.520
And pretty soon, you're working on

01:19:33.520 --> 01:19:37.540
convergence bound for sampled non-monotonic logic, when in reality none of these will

01:19:37.540 --> 01:19:39.129
see the light of

01:19:39.129 --> 01:19:42.519
day of your mail delivery robot. Okay?

01:19:42.519 --> 01:19:46.599
I'm not criticizing the role of theory. There are very powerful theories, like the

01:19:46.599 --> 01:19:48.400
theory of VC dimension,

01:19:48.400 --> 01:19:52.090
which is far, far, far to the right of this. So VC dimension is about

01:19:52.090 --> 01:19:53.289
as theoretical

01:19:53.289 --> 01:19:57.119
as it can get. And it's clearly had a huge impact on many applications. And there's,

01:19:57.119 --> 01:19:59.559
you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know,

01:19:59.559 --> 01:20:00.750
is about

01:20:00.750 --> 01:20:04.219
theoretical as it can get. It's

01:20:04.219 --> 01:20:05.800
like a huge application

01:20:05.800 --> 01:20:09.309
on all of computer science, the theory of NP-hardness.

01:20:09.309 --> 01:20:10.669
But

01:20:10.669 --> 01:20:13.799
when you are off working on highly theoretical things, I guess, to me

01:20:13.799 --> 01:20:16.849
personally it's important to keep in mind

01:20:16.849 --> 01:20:19.699
are you working on something like VC dimension, which is high impact, or are you

01:20:19.699 --> 01:20:23.290
working on something like convergence bound for sampled nonmonotonic logic, which

01:20:23.290 --> 01:20:24.710
you're only hoping

01:20:24.710 --> 01:20:25.900
has some peripheral relevance

01:20:25.900 --> 01:20:30.040
to some application. Okay?

01:20:30.040 --> 01:20:34.849
For me personally, I tend to work on an application only if I - excuse me.

01:20:34.849 --> 01:20:36.989
For me personally, and this is a personal choice,

01:20:36.989 --> 01:20:41.340
I tend to trust something only if I personally can see a link from the

01:20:41.340 --> 01:20:42.679
theory I'm working on

01:20:42.679 --> 01:20:44.430
all the way back to an application.

01:20:44.430 --> 01:20:46.010
And

01:20:46.010 --> 01:20:50.299
if I don't personally see a direct link from what I'm doing to an application then,

01:20:50.299 --> 01:20:53.429
you know, then that's fine. Then I can choose to work on theory, but

01:20:53.429 --> 01:20:55.650
I wouldn't necessarily trust that

01:20:55.650 --> 01:20:59.210
what the theory I'm working on will relate to an application, if I don't personally

01:20:59.210 --> 01:21:02.429
see a link all the way back.

01:21:02.429 --> 01:21:04.400
Just to summarize.

01:21:04.400 --> 01:21:06.409


01:21:06.409 --> 01:21:08.679
One lesson to take away from today is I think

01:21:08.679 --> 01:21:12.529
time spent coming up with diagnostics for learning algorithms is often time well spent.

01:21:12.529 --> 01:21:13.029


01:21:13.029 --> 01:21:16.199
It's often up to your own ingenuity to come up with great diagnostics. And

01:21:16.199 --> 01:21:19.019
just when I personally, when I work on machine learning algorithm,

01:21:19.019 --> 01:21:21.169
it's not uncommon for me to be spending like

01:21:21.169 --> 01:21:23.679
between a third and often half of my time

01:21:23.679 --> 01:21:26.409
just writing diagnostics and trying to figure out what's going right and what's

01:21:26.409 --> 01:21:28.079
going on.

01:21:28.079 --> 01:21:31.500
Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and

01:21:31.500 --> 01:21:34.780
making progress. You don't want to be spending all this time, you know, implementing tests on your

01:21:34.780 --> 01:21:38.280
learning algorithms; it doesn't feel like when you're doing anything. But when

01:21:38.280 --> 01:21:41.419
I implement learning algorithms, at least a third, and quite often half of

01:21:41.419 --> 01:21:45.880
my time, is actually spent implementing those tests and you can figure out what to work on. And

01:21:45.880 --> 01:21:49.219
I think it's actually one of the best uses of your time. Talked

01:21:49.219 --> 01:21:50.729
about error

01:21:50.729 --> 01:21:54.319
analyses and ablative analyses, and lastly

01:21:54.319 --> 01:21:56.890
talked about, you know, different approaches and the

01:21:56.890 --> 01:22:00.979
risks of premature statistical optimization. Okay.

01:22:00.979 --> 01:22:04.339
Sorry I ran you over. I'll be here for a few more minutes for your questions.