Lecture 11 | Machine Learning (Stanford)

Edit subtitles

0:10 - 0:12
0:12 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:22

Development.
0:25 - 0:28

Okay. Good morning. Welcome back.
0:28 - 0:31

What I want to do today is
0:31 - 0:37

actually wrap up our discussion on learning theory and sort of on
0:37 - 0:38

and I'm gonna start by
0:38 - 0:42

talking about Bayesian statistics and regularization,
0:42 - 0:46

and then take a very brief digression to tell you about online learning.
0:46 - 0:50

And most of today's lecture will actually be on various pieces of that, so applying
0:50 - 0:52

machine learning algorithms to problems like, you know,
0:52 - 0:56

like the project or other problems you may go work on after you graduate from this
0:57 - 0:58

class. But
0:58 - 1:02

let's start the talk about Bayesian statistics and regularization.
1:02 - 1:04

So you remember from last week,
1:04 - 1:09

we started to talk about learning theory and we learned about bias and
1:09 - 1:12

variance. And I guess in the previous lecture, we spent most of the previous
1:12 - 1:13

lecture
1:13 - 1:18

talking about algorithms for model selection and for
1:18 - 1:21

feature selection. We talked about cross-validation. Right? So
1:21 - 1:24

most of the methods we talked about in the previous lecture were
1:24 - 1:28

ways for you to try to simply the model. So for example,
1:28 - 1:32

the feature selection algorithms we talked about gives you a way to eliminate a number
1:32 - 1:33

of features,
1:33 - 1:36

so as to reduce the number of parameters you need to fit and thereby reduce
1:36 - 1:39

overfitting. Right? You remember that? So feature
1:39 - 1:41

selection algorithms
1:42 - 1:44

choose a subset of the features
1:44 - 1:48

so that you have less parameters and you may be less likely to overfit. Right?
1:48 - 1:53

What I want to do today is to talk about a different way to prevent overfitting.
1:53 - 1:54

And
1:54 - 1:58

there's a method called regularization and there's a way that lets you keep all the
1:58 - 1:59

parameters.
2:00 - 2:04

So here's the idea, and I'm gonna illustrate this example with,
2:04 - 2:07

say, linear regression.
2:07 - 2:09

So
2:09 - 2:13

you take
2:13 - 2:17

the linear regression model, the very first model we learned about,
2:17 - 2:18

right,
2:18 - 2:22

we said that we would choose the parameters
2:22 - 2:24

via
2:24 - 2:30

maximum likelihood.
2:30 - 2:32

Right? And that meant that,
2:32 - 2:35

you know, you would choose the parameters theta
2:35 - 2:40

that maximized
2:40 - 2:43

the probability of the data,
2:43 - 2:46

which is parameters theta that maximized the probability of the data we observe.
2:46 - 2:51

Right?
2:51 - 2:52

And so
2:52 - 2:56

to give this sort of procedure a name, this is one example of most
2:56 - 3:00

common frequencies procedure,
3:00 - 3:05

and frequency, you can think of sort of as maybe one school of statistics.
3:05 - 3:07

And the philosophical view
3:07 - 3:09

behind writing this down was
3:09 - 3:13

we envisioned that there was some true parameter theta out there that generated,
3:13 - 3:16

you know, the Xs and the Ys. There's some true parameter theta
3:16 - 3:21

that govern housing prices, Y is a function of X,
3:21 - 3:23

and we don't know what the value of theta is,
3:23 - 3:28

and we'd like to come up with some procedure for estimating the value of theta. Okay?
3:28 - 3:28

And so,
3:28 - 3:31

maximum likelihood is just one possible procedure
3:31 - 3:33

for estimating the unknown value
3:33 - 3:36

for theta.
3:36 - 3:38

And
3:38 - 3:41

the way you formulated this, you know, theta was not a random variable. Right?
3:41 - 3:42

That's what why said,
3:42 - 3:45

so theta is just some true value out there. It's not random or anything, we just don't
3:45 - 3:50

know what it is, and we have a procedure called maximum likelihood for estimating the
3:50 - 3:55

value for theta. So this is one example of what's called a frequencies procedure.
3:55 - 3:57

The alternative to the, I
3:57 - 4:04

guess, the frequency school of statistics is the Bayesian school,
4:04 - 4:07

in which
4:07 - 4:10

we're gonna say that we don't know what theta,
4:10 - 4:13

and so we will put a prior
4:13 - 4:15

on
4:15 - 4:18

theta. Okay? So in the Bayesian school students would say, "Well
4:18 - 4:22

don't know what the value of theta so let's represent our uncertainty over theta with a
4:22 - 4:27

prior.
4:28 - 4:32

So for example,
4:32 - 4:35

our prior on theta
4:35 - 4:37

may be a Gaussian distribution
4:37 - 4:40

with mean zero and curvalence matrix
4:40 - 4:43

given by tau squared I. Okay?
4:43 - 4:45

And so -
4:45 - 4:52

actually, if I use S to denote my training set, well - right,
4:56 - 5:00

so theta represents my beliefs about what the parameters are in the absence of
5:00 - 5:01

any data. So
5:01 - 5:03

not having seen any data, theta
5:03 - 5:06

represents, you know, what I think theta - it
5:06 - 5:10

probably represents what I think theta is most likely to be.
5:10 - 5:16

And so given the training set, S, in the sort of Bayesian procedure, we would,
5:16 - 5:22

well,
5:22 - 5:27

calculate the probability, the posterior probability by parameters
5:27 - 5:29

given my training sets, and - let's
5:29 - 5:30

write
5:30 - 5:32

this on the next board.
5:32 - 5:33

So my posterior
5:33 - 5:37

on my parameters given my training set, by Bayes' rule, this will
5:37 - 5:40

be proportional to, you
5:40 - 5:47

know, this.
5:51 - 5:57

Right? So by Bayes' rule. Let's call
5:57 - 6:01

it posterior.
6:01 - 6:06

And this distribution now represents my beliefs about what theta is after I've
6:06 - 6:08

seen the training set.
6:08 - 6:15

And when you now want to make a new prediction on the price of a new house,
6:20 - 6:22

on the input X,
6:22 - 6:27

I would say that, well, the distribution over the possible housing prices for
6:27 - 6:31

this new house I'm trying to estimate the price of, say,
6:31 - 6:34

given the size of the house, the features of the house at X, and the training
6:34 - 6:37

set I had previously, it
6:37 - 6:44

is going to be given by
6:47 - 6:50

an integral over my parameters theta of
6:50 - 6:53

probably of Y given X comma theta
6:53 - 6:55

and times
6:55 - 7:00

the posterior distribution of theta given the training set. Okay?
7:00 - 7:02

And in
7:02 - 7:04

particular, if you want your prediction to be
7:04 - 7:08

the expected value
7:08 - 7:09

of Y given
7:09 - 7:10

the
7:10 - 7:14

input X in training set, you would
7:14 - 7:16

say integrate
7:16 - 7:23

over Y
7:24 - 7:27

times the posterior. Okay?
7:27 - 7:34

You would take an expectation of Y with respect to your posterior distribution. Okay?
7:35 - 7:39

And you notice that when I was writing this down, so with the Bayesian
7:39 - 7:42

formulation, and now started to write up here Y given X
7:42 - 7:44

comma theta because
7:44 - 7:48

this formula now is the property of Y conditioned on the values of the
7:48 - 7:52

random variables X and theta. So I'm no longer writing semicolon theta, I'm writing
7:52 - 7:53

comma theta
7:53 - 7:56

because I'm now treating theta
7:56 - 8:00

as a random variable. So all
8:00 - 8:03

of this is somewhat abstract but this is
8:03 - 8:04

- and
8:04 - 8:11

it turns out - actually let's check. Are there questions about this? No? Okay. Let's
8:15 - 8:18

try to make this more concrete. It turns out that
8:18 - 8:22

for many problems,
8:22 - 8:26

both of these steps in the computation are difficult because if,
8:26 - 8:29

you know, theta is an N plus onedimensional vector, is an N plus
8:29 - 8:31

one-dimensional parameter vector,
8:31 - 8:35

then this is one an integral over an N plus one-dimensional, you know, over RN
8:35 - 8:36

plus one.
8:36 - 8:39

And because numerically it's very difficult
8:39 - 8:41

to compute integrals over
8:41 - 8:45

very high dimensional spaces, all right? So
8:45 - 8:49

usually this integral - actually usually it's hard to compute the posterior in theta
8:49 - 8:54

and it's also hard to compute this integral if theta is very high dimensional. There are few
8:54 - 8:56

exceptions for which this can be done in
8:56 - 8:58

closed form, but for
8:58 - 9:00

many learning algorithms, say,
9:00 - 9:05

Bayesian logistic regression, this is hard to do.
9:05 - 9:10

And so what's commonly done is to take the posterior distribution
9:10 - 9:13

and instead of actually computing a full posterior distribution, chi of theta
9:13 - 9:15

given S,
9:15 - 9:17

we'll instead
9:17 - 9:19

take this quantity on the right-hand side
9:19 - 9:24

and just maximize this quantity on the right-hand side. So let me write this down.
9:24 - 9:27

So
9:27 - 9:30

commonly, instead of computing the full posterior distribution,
9:30 - 9:37

we will choose the following. Okay?
9:42 - 9:47

We will choose what's called the MAP estimate, or the maximum a posteriori
9:47 - 9:50

estimate of theta, which is the most likely value of theta,
9:50 - 9:54

most probable value of theta onto your posterior distribution.
9:54 - 9:56

And that's just
9:56 - 10:03

ont max chi
10:08 - 10:12

of theta.
10:12 - 10:19

And then when you need to make a prediction,
10:21 - 10:28

you know, you would just predict, say, well,
10:36 - 10:38

using your usual hypothesis
10:38 - 10:41

and using this MAP value of theta
10:41 - 10:42

in place of
10:42 - 10:47

- as the parameter vector you'd choose. Okay? And
10:47 - 10:51

notice, the only difference between this and standard maximum likelihood estimation
10:51 - 10:53

is that when you're choosing,
10:53 - 10:57

you know, the - instead of choosing the maximum likelihood value for
10:57 - 11:01

theta, you're instead maximizing this, which is what you have for maximum likelihood estimation,
11:01 - 11:05

and then times this other quantity which is the prior.
11:05 - 11:07

Right?
11:07 - 11:10

And
11:10 - 11:11

let's see,
11:11 - 11:17

when intuition is that if your prior
11:17 - 11:21

is
11:21 - 11:24

theta being Gaussian and with mean
11:24 - 11:26

zero and some covariance,
11:26 - 11:31

then for a distribution like this, most of the [inaudible] mass is close to zero. Right?
11:31 - 11:34

So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero.
11:38 - 11:41

And so the prior distribution, instead of saying that
11:41 - 11:45

you think most of the parameters should be close to
11:45 - 11:49

zero, and if you remember our discussion on feature selection, if you eliminate a
11:49 - 11:51

feature from consideration
11:51 - 11:53

that's the same as
11:53 - 11:57

setting the source and value of theta to be equal to zero. All right? So if you
11:57 - 11:58

set
11:58 - 12:02

theta five to be equal to zero, that's the same as, you know, eliminating feature
12:02 - 12:05

five from the your hypothesis.
12:05 - 12:10

And so, this is the prior that drives most of the parameter values to zero
12:10 - 12:13

- to values close to zero. And you'll think of this as
12:13 - 12:17

doing something analogous, if doing something reminiscent of feature selection. Okay? And
12:17 - 12:22

it turns out that with this formulation, the parameters won't actually be
12:22 - 12:25

exactly zero but many of the values will be close to zero.
12:25 - 12:27

And
12:27 - 12:32

I guess in pictures,
12:35 - 12:36

if you remember,
12:36 - 12:40

I said that if you have, say, five data points and you fit
12:40 - 12:47

a fourth-order polynomial -
12:48 - 12:51

well I think that had too many bumps in it, but never mind.
12:51 - 12:54

If you fit it a - if you fit very high polynomial to a
12:54 - 12:58

very small dataset, you can get these very large oscillations
12:58 - 13:01

if you use maximum likelihood estimation. All right?
13:01 - 13:06

In contrast, if you apply this sort of Bayesian regularization,
13:06 - 13:08

you can actually fit a higherorder polynomial
13:08 - 13:11

that still get
13:11 - 13:13

sort of a smoother and smoother fit to the data
13:13 - 13:18

as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer
13:18 - 13:20

to zero. And that
13:20 - 13:23

in practice - it's sort of hard to see, but you can take my word for it.
13:23 - 13:25

As tau becomes smaller and smaller,
13:25 - 13:29

the curves you tend to fit your data also become smoother and smoother, and so you
13:29 - 13:30

tend
13:30 - 13:36

less and less overfit, even when you're fitting a large number of
13:36 - 13:38

parameters.
13:38 - 13:41

Okay? Let's see,
13:41 - 13:45

and
13:45 - 13:48

one last piece of intuition that I would just toss out there. And
13:48 - 13:50

you get to play more with
13:50 - 13:53

this particular set of ideas more in Problem Set
13:53 - 13:55

3, which I'll post online later
13:55 - 13:57

this week I guess.
13:57 - 14:02

Is that whereas maximum likelihood tries to minimize,
14:02 - 14:09

say, this,
14:12 - 14:16

right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing
14:16 - 14:17

this,
14:17 - 14:19

it turns out that if you
14:19 - 14:21

add this prior term there,
14:21 - 14:26

it turns out that the authorization objective you end up optimizing
14:26 - 14:31

turns out to be that. Where you add an extra term that, you know,
14:31 - 14:34

penalizes your parameter theta as being large.
14:35 - 14:39

And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to
14:39 - 14:41

keep your parameters small.
14:41 - 14:43

And this has the effect.
14:43 - 14:47

Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of
14:47 - 14:48

keeping the functions you fit
14:48 - 14:55

to be smoother and less likely to overfit. Okay? Okay, hopefully this will
14:58 - 15:02

make more sense when you play with these ideas a bit more in the next problem set. But let's check
15:02 - 15:09

questions about all this.
15:13 - 15:16

The smoothing behavior is it because [inaudible] actually get different [inaudible]?
15:19 - 15:23

Let's see. Yeah. It depends on - well
15:23 - 15:28

most priors with most of the mass close to zero will get this effect, I guess. And just by
15:28 - 15:29

convention, the Gaussian
15:29 - 15:32

prior is what's most -
15:32 - 15:34

used the most common for models like logistic
15:34 - 15:38

regression and linear regression, generalized in your models. There are a few
15:38 - 15:41

other priors that I sometimes use, like the Laplace prior,
15:41 - 15:48

but all of them will tend to have these sorts of smoothing effects. All right. Cool.
15:50 - 15:53

And so it turns out that for
15:53 - 15:57

problems like text classification, text classification is like 30,000 features or 50,000
15:57 - 16:00

features,
16:00 - 16:03

where it seems like an algorithm like logistic regression would be very much prone to
16:03 - 16:04

overfitting. Right?
16:04 - 16:06

So imagine trying to build a spam classifier,
16:06 - 16:10

maybe you have 100 training examples but you have 30,000
16:10 - 16:13

features or 50,000 features,
16:13 - 16:17

that seems clearly to be prone to overfitting. Right? But it turns out that with
16:17 - 16:18

this sort of Bayesian
16:18 - 16:19

regularization,
16:19 - 16:22

with [inaudible] Gaussian,
16:22 - 16:27

logistic regression becomes a very effective text classification algorithm
16:27 - 16:33

with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah,
16:33 - 16:37

right, and so pick - and to pick either tau squared or lambda.
16:37 - 16:40

I think the relation is lambda equals one over tau squared. But right, so pick either
16:40 - 16:42

tau squared or lambda, you could
16:42 - 16:49

use cross-validation, yeah. All right? Okay, cool. So all right,
16:49 - 16:52

that
16:52 - 16:55

was all I want to say about methods for preventing
16:55 - 16:56

overfitting. What I
16:56 - 17:01

want to do next is just spend, you know, five minutes talking about
17:01 - 17:02

online learning.
17:02 - 17:05

And this is sort of a digression.
17:05 - 17:06

And so, you know, when
17:06 - 17:10

you're designing the syllabus of a class, I guess, sometimes
17:10 - 17:13

there are just some ideas you want to talk about but can't find a very good place to
17:13 - 17:16

fit in anywhere. So this is one of those ideas that may
17:16 - 17:21

seem a bit disjointed from the rest of the class but I just
17:21 - 17:26

want to
17:26 - 17:30

tell
17:30 - 17:34

you a little bit about it. Okay. So here's the idea.
17:34 - 17:38

So far, all the learning algorithms we've talked about are what's called batch learning
17:38 - 17:39

algorithms,
17:39 - 17:42

where you're given a training set and then you get to run your learning algorithm on the
17:42 - 17:47

training set and then maybe you test it on some other test set.
17:47 - 17:50

And there's another learning setting called online learning,
17:50 - 17:54

in which you have to make predictions even while you are in the process of learning.
17:54 - 17:58

So here's how the problem sees.
17:58 - 17:59

All right? I'm first gonna give you X one.
17:59 - 18:01

Let's say there's a classification problem,
18:01 - 18:03

so I'm first gonna give you
18:03 - 18:07

X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?"
18:07 - 18:10

And you've not seen any data yet.
18:10 - 18:12

And so, you make a guess. Right? You
18:12 - 18:13

guess -
18:13 - 18:15

we'll call your guess Y hat one.
18:15 - 18:18

And after you've made your prediction, I
18:18 - 18:22

will then reveal to you the true label Y one. Okay? And
18:22 - 18:26

not having seen any data before, your odds of getting the first one right are only
18:26 - 18:29

50 percent, right, if you guess randomly.
18:29 - 18:30
18:30 - 18:35

And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And
18:35 - 18:36

so you now maybe
18:36 - 18:40

are gonna make a slightly more educated guess and call that Y hat two.
18:40 - 18:45

And after you've made your guess, I reveal the true label to you.
18:45 - 18:50

And so, then I show you X three, and then you make
18:50 - 18:53

your guess, and learning proceeds as follows.
18:53 - 18:55

So this is just a lot of
18:55 - 18:59

machine learning and batch learning, and the model settings where
18:59 - 19:04

you have to keep learning even as you're making predictions,
19:04 - 19:08

okay? So I don't know, setting your website and you have users coming in. And as the first user
19:08 - 19:11

comes in, you need to start making predictions already about what
19:11 - 19:13

the user likes or dislikes. And there's only, you know,
19:13 - 19:18

as you're making predictions you get to show more and more training examples.
19:18 - 19:25

So in online learning what you care about is the total online error,
19:27 - 19:28
19:28 - 19:33

which is sum from I equals one to MC if you get the sequence of M examples
19:33 - 19:34

all together,
19:34 - 19:38

indicator Y hat I
19:38 - 19:41

not equal to Y hi. Okay?
19:41 - 19:44

So the total online error is
19:44 - 19:46

the total number of mistakes you make
19:46 - 19:49

on a sequence of examples like this.
19:49 - 19:52

And
19:52 - 19:54

it turns out that, you know,
19:54 - 19:58

many of the learning algorithms you have - when you finish all the learning algorithms, you've
19:58 - 20:01

learned about and can apply to this setting. One thing
20:01 - 20:03

you could do is
20:03 - 20:04

when you're asked
20:04 - 20:06

to make prediction on Y hat three,
20:06 - 20:10

right, one simple thing to do is well you've seen some other training examples
20:10 - 20:14

up to this point so you can just take your learning algorithm and run it
20:14 - 20:15

on the examples,
20:15 - 20:19

you know, leading up to Y hat three. So just run the learning algorithm on all
20:19 - 20:21

the examples you've seen previous
20:21 - 20:25

to being asked to make a prediction on certain example, and then use your
20:25 - 20:27

learning
20:27 - 20:31

algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we
20:31 - 20:34

saw that you could use the stochastic gradient descent, that,
20:34 - 20:41

you know, can be adapted very nicely to this. So as a concrete example, if you
20:42 - 20:46

remember the perceptron algorithms, say, right,
20:46 - 20:47

you would
20:47 - 20:52

say initial the parameter theta to be equal to zero.
20:52 - 20:56

And then after seeing the Ith training
20:56 - 20:59

example, you'd
20:59 - 21:06

update the parameters, you
21:14 - 21:15

know,
21:15 - 21:20

using - you've see this reel a lot of times now, right, using the standard
21:20 - 21:22

perceptron learning rule. And
21:22 - 21:26

the same thing, if you were using logistic regression you can then,
21:26 - 21:30

again, after seeing each training example, just run, you know, essentially run
21:30 - 21:34

one-step stochastic gradient descent
21:34 - 21:38

on just the example you saw. Okay?
21:38 - 21:40

And so
21:40 - 21:41

the reason I've
21:41 - 21:45

put this into the sort of "learning theory" section of this class was because
21:45 - 21:49

it turns that sometimes you can prove fairly amazing results
21:49 - 21:52

on your total online error
21:52 - 21:54

using algorithms like these.
21:54 - 21:58

I will actually - I don't actually want to spend the time in the main
21:58 - 21:59

lecture to prove
21:59 - 22:02

this, but, for example, you can prove that
22:02 - 22:05

when you use the perceptron algorithm,
22:05 - 22:08

then even when
22:08 - 22:12

the features XI,
22:12 - 22:15

maybe infinite dimensional feature vectors, like we saw for
22:15 - 22:19

simple vector machines. And sometimes, infinite feature dimensional vectors
22:19 - 22:21

may use kernel representations. Okay? But so it
22:21 - 22:22

turns out that you can prove that
22:22 - 22:25

when you a perceptron algorithm,
22:25 - 22:28

even when the data is maybe
22:28 - 22:29

extremely high dimensional and it
22:29 - 22:33

seems like you'd be prone to overfitting, right, you can prove that so as the
22:33 - 22:37

long as the positive and negative examples
22:37 - 22:39
22:39 - 22:42

are separated by a margin,
22:42 - 22:43

right. So in this
22:43 - 22:48

infinite dimensional space,
22:48 - 22:50

so long as,
22:50 - 22:53

you know, there is some margin
22:53 - 22:57

down there separating the positive and negative examples,
22:57 - 22:59

you can prove that
22:59 - 23:02

perceptron algorithm will converge
23:02 - 23:07

to a hypothesis that perfectly separates the positive and negative examples.
23:07 - 23:11

Okay? And then so after seeing only a finite number of examples, it'll
23:11 - 23:14

converge to digital boundary that perfectly separates the
23:14 - 23:19

positive and negative examples, even though you may in an infinite dimensional space. Okay?
23:19 - 23:21

So
23:21 - 23:22

let's see.
23:22 - 23:26

The proof itself would take me sort of almost an entire lecture to do,
23:26 - 23:27

and there are sort of
23:27 - 23:29

other things that I want to do more than that.
23:29 - 23:33

So you want to see the proof of this yourself, it's actually written up in the lecture
23:33 - 23:35

notes that I posted online.
23:35 - 23:38

For the purposes of this class' syllabus, the proof of this
23:38 - 23:41

result, you can treat this as optional reading. And by that, I mean,
23:41 - 23:45

you know, it won't appear on the midterm and you won't be asked about this
23:45 - 23:47

specifically in the problem sets,
23:47 - 23:49

but I thought it'd be -
23:49 - 23:52

I know some of you are curious after the previous lecture so why
23:52 - 23:54

you can prove
23:54 - 23:55

that, you know, SVMs can have
23:55 - 23:59

bounded VC dimension, even in these infinite dimensional spaces, and how do
23:59 - 24:01

you prove things in these -
24:01 - 24:04

how do you prove learning theory results in these infinite dimensional feature spaces. And
24:04 - 24:05

so
24:05 - 24:08

the perceptron bound that I just talked about was the simplest instance I
24:08 - 24:11

know of that you can sort of read in like half an hour and understand
24:11 - 24:12

it.
24:12 - 24:15

So if you're
24:15 - 24:16

interested, there are lecture notes online for
24:16 - 24:21

how this perceptron bound is actually proved. It's a very
24:22 - 24:26

[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But
24:26 - 24:28

regardless of the theoretical results,
24:28 - 24:31

you know, the online learning setting is something
24:31 - 24:33

that you - that comes reasonably often. And so,
24:33 - 24:39

these algorithms based on stochastic gradient descent often go very well. Okay, any
24:39 - 24:46

questions about this before I move on?
24:50 - 24:55

All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to
24:55 - 24:56

PowerPoint slides, please,
24:56 - 25:00

is I actually want to spend most of today's lecture sort of talking about
25:00 - 25:05

advice for applying different machine learning
25:05 - 25:11

algorithms.
25:11 - 25:12

And so,
25:12 - 25:14

you know, right now, already you have a,
25:14 - 25:16

I think, a good understanding of
25:16 - 25:19

really the most powerful tools
25:19 - 25:21

known to humankind in machine learning.
25:21 - 25:22

Right? And
25:22 - 25:27

what I want to do today is give you some advice on how to apply them really
25:27 - 25:28

powerfully because,
25:28 - 25:30

you know, the same tool - it turns out that you can
25:30 - 25:33

take the same machine learning tool, say logistic regression,
25:33 - 25:37

and you can ask two different people to apply it to the same problem.
25:37 - 25:38

And
25:38 - 25:41

sometimes one person will do an amazing job and it'll work amazingly well, and the second
25:41 - 25:43

person will sort of
25:43 - 25:47

not really get it to work, even though it was exactly the same algorithm. Right?
25:47 - 25:51

And so what I want to do today, in the rest of the time I have today, is try
25:51 - 25:52

to
25:52 - 25:55

convey to you, you know, some of the methods for how to
25:55 - 25:57

make sure you're one of -
25:57 - 26:04

you really know how to get these learning algorithms to work well in problems. So
26:05 - 26:07

just some caveats on what I'm gonna, I
26:07 - 26:09

guess, talk about in the
26:09 - 26:11

rest of today's lecture.
26:12 - 26:16

Something I want to talk about is actually not very mathematical but is also
26:16 - 26:18

some of the hardest,
26:18 - 26:18

most
26:18 - 26:22

conceptually most difficult material in this class to understand. All right? So this
26:22 - 26:23

is
26:23 - 26:25

not mathematical but this is not easy.
26:25 - 26:30

And I want to say this caveat some of what I'll say today is debatable.
26:30 - 26:33

I think most good machine learning people will agree with most of what I say but maybe not
26:33 - 26:36

everything I say.
26:36 - 26:39

And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about
26:39 - 26:42

this later. What I'm
26:42 - 26:44

focusing on today is advice for
26:44 - 26:47

how to just get stuff to work. If you work in the company and you want to deliver a
26:47 - 26:49

product or you're, you know,
26:49 - 26:53

building a system and you just want your machine learning system to work. Okay? Some of what I'm about
26:53 - 26:54

to say today
26:54 - 26:58

isn't great advice if you goal is to invent a new machine learning algorithm, but
26:58 - 26:59

this is advice for how to
26:59 - 27:02

make machine learning algorithm work and, you know, and
27:02 - 27:05

deploy a working system. So three
27:05 - 27:09

key areas I'm gonna talk about. One: diagnostics for
27:09 - 27:12

debugging learning algorithms. Second: sort of
27:12 - 27:17

talk briefly about error analyses and ablative analysis.
27:17 - 27:21

And third, I want to talk about just advice for how to get started on a machine-learning
27:23 - 27:24

problem.
27:24 - 27:27

And one theme that'll come up later is it
27:27 - 27:31

turns out you've heard about premature optimization, right,
27:31 - 27:34

in writing software. This is when
27:34 - 27:38

someone over-designs from the start, when someone, you know, is writing piece of code and
27:38 - 27:41

they choose a subroutine to optimize
27:41 - 27:45

heavily. And maybe you write the subroutine as assembly or something. And that's often
27:45 - 27:49

- and many of us have been guilty of premature optimization, where
27:49 - 27:52

we're trying to get a piece of code to run faster. And
27:52 - 27:55

we choose probably a piece of code and we implement it an assembly, and
27:55 - 27:57

really tune and get to run really quickly. And
27:57 - 28:01

it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature
28:01 - 28:02

optimization.
28:02 - 28:06

And in undergraduate programming classes, we warn people all the time not to do
28:06 - 28:10

premature optimization and people still do it all the time. Right?
28:10 - 28:12

And
28:12 - 28:16

turns out, a very similar thing happens in building machine-learning systems. That
28:16 - 28:21

many people are often guilty of, what I call, premature statistical optimization, where they
28:21 - 28:22
28:22 - 28:26

heavily optimize part of a machine learning system and that turns out
28:26 - 28:28

not to be the important piece. Okay?
28:28 - 28:30

So I'll talk about that later, as well.
28:30 - 28:36

So let's first talk about debugging learning algorithms.
28:36 - 28:38
28:38 - 28:40
28:40 - 28:43

As a motivating
28:43 - 28:47

example, let's say you want to build an anti-spam system. And
28:47 - 28:52

let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right?
28:52 - 28:54

So instead of using 50,000 words, you're chosen a small set of 100
28:54 - 28:55

features
28:55 - 28:59

to use for your anti-spam system.
28:59 - 29:02

And let's say you implement Bayesian logistic regression, implement gradient
29:02 - 29:03

descent,
29:03 - 29:07

and you get 20 percent test error, which is unacceptably high. Right?
29:07 - 29:10

So this
29:10 - 29:14

is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda
29:14 - 29:15

squared term.
29:15 - 29:20

And we're maximizing rather than minimizing as well, so there's a minus lambda
29:20 - 29:24

theta square instead of plus lambda theta squared. So
29:24 - 29:28

the question is, you implemented your Bayesian logistic regression algorithm,
29:28 - 29:34

and you tested it on your test set and you got unacceptably high error, so what do you do
29:34 - 29:35

next? Right?
29:35 - 29:37

So,
29:37 - 29:41

you know, one thing you could do is think about the ways you could improve this algorithm.
29:41 - 29:45

And this is probably what most people will do instead of, "Well let's sit down and think what
29:45 - 29:48

could've gone wrong, and then we'll try to improve the algorithm."
29:48 - 29:51

Well obviously having more training data could only help, so one thing you can do is try to
29:51 - 29:55

get more training examples.
29:55 - 29:59

Maybe you suspect, that even 100 features was too many, so you might try to
29:59 - 30:01

get a smaller set of
30:01 - 30:05

features. What's more common is you might suspect your features aren't good enough, so you might
30:05 - 30:07

spend some time, look at the email headers, see if
30:07 - 30:09

you can figure out better features for, you know,
30:09 - 30:13

finding spam emails or whatever.
30:13 - 30:14

Right. And
30:14 - 30:18

right, so and
30:18 - 30:21

just sit around and come up with better features, such as for email headers.
30:21 - 30:25

You may also suspect that gradient descent hasn't quite converged yet, and so let's try
30:25 - 30:28

running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right,
30:28 - 30:29

just run
30:29 - 30:31

gradient descent longer.
30:31 - 30:36

Or maybe you remember, you know, you remember hearing from class that maybe
30:36 - 30:39

Newton's method converges better, so let's
30:39 - 30:40

try
30:40 - 30:42

that instead. You may want to tune the value for lambda, because not
30:42 - 30:44

sure if that was the right thing,
30:44 - 30:47

or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only
30:47 - 30:50

listed eight things
30:50 - 30:55

here, but you can imagine if you were actually sitting down, building machine-learning
30:55 - 30:56

system,
30:56 - 30:58

the options to you are endless. You can think of, you
30:58 - 31:01

know, hundreds of ways to improve a learning system.
31:01 - 31:02

And some of these things like,
31:02 - 31:06

well getting more training examples, surely that's gonna help, so that seems like it's a good
31:06 - 31:09

use of your time. Right?
31:09 - 31:11

And it turns out that
31:11 - 31:15

this [inaudible] of picking ways to improve the learning algorithm and picking one and going
31:15 - 31:16

for it,
31:16 - 31:21

it might work in the sense that it may eventually get you to a working system, but
31:21 - 31:25

often it's very time-consuming. And I think it's often a largely - largely a matter of
31:25 - 31:28

luck, whether you end up fixing what the problem is.
31:28 - 31:29

In particular, these
31:29 - 31:33

eight improvements all fix very different problems.
31:33 - 31:37

And some of them will be fixing problems that you don't have. And
31:37 - 31:40

if you can rule out six of
31:40 - 31:44

eight of these, say, you could - if by somehow looking at the problem more deeply,
31:44 - 31:47

you can figure out which one of these eight things is actually the right thing
31:47 - 31:48

to do,
31:48 - 31:49

you can save yourself
31:49 - 31:51

a lot of time.
31:51 - 31:56

So let's see how we can go about doing that.
31:56 - 32:02

The people in industry and in research that I see that are really good, would not
32:02 - 32:05

go and try to change a learning algorithm randomly. There are lots of things that
32:05 - 32:08

obviously improve your learning algorithm,
32:08 - 32:12

but the problem is there are so many of them it's hard to know what to do.
32:12 - 32:17

So you find all the really good ones that run various diagnostics to figure out
32:17 - 32:18

the problem is
32:18 - 32:22

and they think where a problem is. Okay?
32:22 - 32:24

So
32:24 - 32:27

for our motivating story, right, we said - let's say Bayesian logistic regression test
32:27 - 32:29

error was 20 percent,
32:29 - 32:32

which let's say is unacceptably high.
32:32 - 32:35

And let's suppose you suspected the problem is
32:35 - 32:36

either overfitting,
32:36 - 32:38

so it's high bias,
32:38 - 32:42

or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse
32:42 - 32:45

me; I think I
32:45 - 32:47

wrote that wrong.
32:47 - 32:48

Let's firstly - so let's
32:48 - 32:49

forget - forget the tables.
32:49 - 32:53

Suppose you suspect the problem is either high bias or high variance, and some of the text
32:53 - 32:55

here
32:55 - 32:55

doesn't make sense. And
32:55 - 32:56

you want to know
32:56 - 33:01

if you're overfitting, which would be high variance, or you have too few
33:01 - 33:06

features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So
33:06 - 33:09

how can you figure out whether the problem
33:09 - 33:11

is one of high bias
33:11 - 33:16

or high variance? Right? So it turns
33:16 - 33:19

out there's a simple diagnostic you can look at that will tell you
33:19 - 33:24

whether the problem is high bias or high variance. If you
33:24 - 33:28

remember the cartoon we'd seen previously for high variance problems, when you have high
33:28 - 33:30

variance
33:30 - 33:33

the training error will be much lower than the test error. All right? When you
33:33 - 33:36

have a high variance problem, that's when you're fitting
33:36 - 33:39

your training set very well. That's when you're fitting, you know, a tenth order polynomial to
33:39 - 33:42

11 data points. All right? And
33:42 - 33:45

that's when you're just fitting the data set very well, and so your training error will be
33:45 - 33:46

much lower than
33:46 - 33:48

your test
33:48 - 33:50

error. And in contrast, if you have high bias,
33:50 - 33:53

that's when your training error will also be high. Right?
33:53 - 33:56

That's when your data is quadratic, say, but you're fitting a linear function to it
33:56 - 34:02

and so you aren't even fitting your training set well. So
34:02 - 34:04

just in cartoons, I guess,
34:04 - 34:08

this is a - this is what a typical learning curve for high variance looks
34:08 - 34:09

like.
34:09 - 34:14

On your horizontal axis, I'm plotting the training set size M, right,
34:14 - 34:16

and on vertical axis, I'm plotting the error.
34:16 - 34:19

And so, let's see,
34:19 - 34:21

you know, as you increase -
34:21 - 34:25

if you have a high variance problem, you'll notice as the training set size, M,
34:25 - 34:29

increases, your test set error will keep on decreasing.
34:29 - 34:33

And so this sort of suggests that, well, if you can increase the training set size even
34:33 - 34:36

further, maybe if you extrapolate the green curve out, maybe
34:36 - 34:40

that test set error will decrease even further. All right?
34:40 - 34:43

Another thing that's useful to plot here is - let's say
34:43 - 34:47

the red horizontal line is the desired performance
34:47 - 34:50

you're trying to reach, another useful thing to plot is actually the training error. Right?
34:50 - 34:52

And it turns out that
34:52 - 34:59

your training error will actually grow as a function of the training set size
34:59 - 35:02

because the larger your training set,
35:02 - 35:04

the harder it is to fit,
35:04 - 35:06

you know, your training set perfectly. Right?
35:06 - 35:09

So this is just a cartoon, don't take it too seriously, but in general, your training error
35:09 - 35:11

will actually grow
35:11 - 35:15

as a function of your training set size. Because smart training sets, if you have one data point,
35:15 - 35:18

it's really easy to fit that perfectly, but if you have
35:18 - 35:22

10,000 data points, it's much harder to fit that perfectly.
35:22 - 35:23

All right?
35:23 - 35:28

And so another diagnostic for high variance, and the one that I tend to use more,
35:28 - 35:32

is to just look at training versus test error. And if there's a large gap between
35:32 - 35:33

them,
35:33 - 35:34

then this suggests that, you know,
35:34 - 35:40

getting more training data may allow you to help close that gap. Okay?
35:40 - 35:41

So this is
35:41 - 35:42

what the
35:42 - 35:45

cartoon would look like when - in the
35:45 - 35:49

case of high variance.
35:49 - 35:53

This is what the cartoon looks like for high bias. Right? If you
35:53 - 35:55

look at the learning curve, you
35:55 - 35:57

see that the curve for test error
35:57 - 36:01

has flattened out already. And so this is a sign that,
36:01 - 36:05

you know, if you get more training examples, if you extrapolate this curve
36:05 - 36:07

further to the right,
36:07 - 36:10

it's maybe not likely to go down much further.
36:10 - 36:12

And this is a property of high bias: that getting more training data won't
36:12 - 36:16

necessarily help.
36:16 - 36:19

But again, to me the more useful diagnostic is
36:19 - 36:20

if you plot
36:20 - 36:24

training errors well, if you look at your training error as well as your, you know,
36:24 - 36:26

hold out test set error.
36:26 - 36:29

If you find that even your training error
36:29 - 36:32

is high,
36:32 - 36:35

then that's a sign that getting more training data is not
36:35 - 36:38

going to help. Right?
36:38 - 36:42

In fact, you know, think about it,
36:42 - 36:45

training error
36:45 - 36:48

grows as a function of your training set size.
36:48 - 36:50

And so if your
36:50 - 36:56

training error is already above your level of desired performance,
36:56 - 36:57

then
36:57 - 37:01

getting even more training data is not going to reduce your training
37:01 - 37:03

error down to the desired level of performance. Right?
37:03 - 37:06

Because, you know, your training error sort of only gets worse as you get more and more training
37:06 - 37:08

examples.
37:08 - 37:11

So if you extrapolate further to the right, it's not like this blue line will come
37:11 - 37:13

back down to the level of desired performance. Right? This will stay up
37:13 - 37:17

there. Okay? So for
37:17 - 37:21

me personally, I actually, when looking at a curve like the green
37:21 - 37:25

curve on test error, I actually personally tend to find it very difficult to tell
37:25 - 37:29

if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very
37:29 - 37:31

often, it's somewhat
37:31 - 37:33

ambiguous. So for me personally,
37:33 - 37:37

the diagnostic I tend to use the most often to tell if I have a bias problem or a variance
37:37 - 37:38

problem
37:38 - 37:41

is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so,
37:41 - 37:45

going
37:45 - 37:47

back to
37:47 - 37:52

the list of fixes, look
37:52 - 37:54

at the first fix,
37:54 - 37:56

getting more training examples
37:56 - 37:59

is a way to fix high variance.
37:59 - 38:03

Right? If you have a high variance problem, getting more training examples will help.
38:03 - 38:06

Trying a smaller set of features:
38:06 - 38:12

that also fixes high variance. All right?
38:12 - 38:16

Trying a larger set of features or adding email features, these
38:16 - 38:20

are solutions that fix high bias. Right?
38:20 - 38:27

So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay?
38:27 - 38:29

And so
38:29 - 38:34

quite often you see people working on machine learning problems
38:34 - 38:35

and
38:35 - 38:38

they'll remember that getting more training examples helps. And
38:38 - 38:41

so, they'll build a learning system, build an anti-spam system and it doesn't work.
38:41 - 38:42

And then they
38:42 - 38:46

go off and spend lots of time and money and effort collecting more training data
38:46 - 38:51

because they'll say, "Oh well, getting more data's obviously got to help."
38:51 - 38:53

But if they had a high bias problem in the first place, and not a high variance
38:53 - 38:55

problem,
38:55 - 38:57

it's entirely possible to spend
38:57 - 39:00

three months or six months collecting more and more training data,
39:00 - 39:05

not realizing that it couldn't possibly help. Right?
39:05 - 39:08

And so, this actually happens a lot in, you
39:08 - 39:12

know, in Silicon Valley and companies, this happens a lot. There will often
39:12 - 39:15

people building various machine learning systems, and
39:15 - 39:20

they'll often - you often see people spending six months working on fixing a
39:20 - 39:21

learning algorithm
39:21 - 39:24

and you could've told them six months ago that, you know,
39:24 - 39:27

that couldn't possibly have helped. But because they didn't know what the
39:27 - 39:29

problem was, and
39:29 - 39:34

they'd easily spend six months trying to invent new features or something. And
39:34 - 39:38

this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and
39:38 - 39:42

told them, "I could've told you six months ago that this was not going to help." And
39:42 - 39:46

the six months is not a joke, you actually see
39:46 - 39:48

this.
39:48 - 39:50

And in contrast, if you
39:50 - 39:53

actually figure out the problem's one of high bias or high variance, then
39:53 - 39:54

you can rule out
39:54 - 39:56

two of these solutions and
39:56 - 40:01

save yourself many months of fruitless effort. Okay? I actually
40:01 - 40:04

want to talk about these four at the bottom as well. But before I move on, let me
40:04 - 40:05

just check if there were questions about what I've talked
40:05 - 40:12

about so far. No? Okay, great. So bias
40:20 - 40:23

versus variance is one thing that comes up
40:23 - 40:30

often. This bias versus variance is one common diagnostic. And so,
40:30 - 40:33

for other machine learning problems, it's often up to your own ingenuity to figure out
40:33 - 40:36

your own diagnostics to figure out what's wrong. All right?
40:36 - 40:41

So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you
40:41 - 40:44

know, to construct your own tests. Like do you look at the difference training and
40:44 - 40:46

test errors or do you look at something else?
40:46 - 40:50

It's often up to your own ingenuity to construct your own diagnostics to figure out what's
40:50 - 40:53

going on.
40:53 - 40:55

What I want to do is go through another example. All right?
40:55 - 40:59

And this one is slightly more contrived but it'll illustrate another
40:59 - 41:03

common question that comes up, another one of the most common
41:03 - 41:05

issues that comes up in applying
41:05 - 41:06

learning algorithms.
41:06 - 41:08

So in this example, it's slightly more contrived,
41:08 - 41:12

let's say you implement Bayesian logistic regression
41:12 - 41:18

and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So
41:18 - 41:19

it's rejecting, you know,
41:19 - 41:21

2 percent of -
41:21 - 41:25

it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all
41:25 - 41:27

spam gets
41:27 - 41:31

through which is fine, but is also rejecting 2 percent of your good email,
41:31 - 41:35

2 percent of the email from your friends and that's unacceptably high, let's
41:35 - 41:37

say.
41:37 - 41:39

And let's say that
41:39 - 41:42

a simple vector machine using a linear kernel
41:42 - 41:45

gets 10 percent error on spam and
41:45 - 41:49

0.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this
41:49 - 41:53

example, let's say you're trying to build an anti-spam system. Right?
41:53 - 41:56

Let's say that you really want to deploy
41:56 - 41:58

logistic regression
41:58 - 42:01

to your customers because of computational efficiency or because you need
42:01 - 42:03

retrain overnight every day,
42:03 - 42:07

and because logistic regression just runs more easily and more quickly or something. Okay? So let's
42:07 - 42:09

say you want to deploy logistic
42:09 - 42:13

regression, but it's just not working out well. So
42:13 - 42:18

question is: What do you do next? So it
42:18 - 42:19

turns out that this -
42:19 - 42:22

the issue that comes up here, the one other common question that
42:22 - 42:25

comes up is
42:25 - 42:30

a question of is the algorithm converging. So you might suspect that maybe
42:30 - 42:33

the problem with logistic regression is that it's just not converging.
42:33 - 42:36

Maybe you need to run iterations. And
42:36 - 42:38

it
42:38 - 42:40

turns out that, again if you look at the optimization objective, say,
42:40 - 42:44

logistic regression is, let's say, optimizing J
42:44 - 42:47

of theta, it actually turns out that if you look at optimizing your objective as a function of the number
42:47 - 42:52

of iterations, when you look
42:52 - 42:55

at this curve, you know, it sort of looks like it's going up but it sort of
42:55 - 42:58

looks like there's absentiles. And
42:58 - 43:01

when you look at these curves, it's often very hard to tell
43:01 - 43:04

if the curve has already flattened out. All right? And you look at these
43:04 - 43:06

curves a lot so you can ask:
43:06 - 43:08

Well has the algorithm converged? When you look at the J of theta like this, it's
43:08 - 43:10

often hard to tell.
43:10 - 43:14

You can run this ten times as long and see if it's flattened out. And you can run this ten
43:14 - 43:21

times as long and it'll often still look like maybe it's going up very slowly, or something. Right?
43:21 - 43:25

So a better diagnostic for what logistic regression is converged than
43:25 - 43:29

looking at this curve.
43:29 - 43:32

The other question you might wonder - the other thing you might
43:32 - 43:37

suspect is a problem is are you optimizing the right function.
43:37 - 43:39

So
43:39 - 43:41

what you care about,
43:41 - 43:43

right, in spam, say,
43:43 - 43:44

is a
43:44 - 43:47

weighted accuracy function like that. So A of theta is,
43:47 - 43:49

you know, sum over your
43:49 - 43:52

examples of some weights times whether you got it right.
43:52 - 43:57

And so the weight may be higher for non-spam than for spam mail because you care
43:57 - 43:58

about getting
43:58 - 44:01

your predictions correct for spam email much more than non-spam mail, say. So let's
44:01 - 44:02
44:02 - 44:05

say A of theta
44:05 - 44:11

is the optimization objective that you really care about, but Bayesian logistic regression is
44:11 - 44:15

that it optimizes a quantity like that. Right? It's this
44:15 - 44:18

sort of maximum likelihood thing
44:18 - 44:19

and then with this
44:19 - 44:21

two-nom, you know,
44:21 - 44:23

penalty thing that we saw previously. And you
44:23 - 44:26

might be wondering: Is this the right optimization function to be optimizing.
44:26 - 44:31

Okay? And: Or do I maybe need to change the value for lambda
44:31 - 44:34

to change this parameter? Or:
44:34 - 44:40

Should I maybe really be switching to support vector machine optimization objective?
44:40 - 44:42

Okay? Does that make sense? So
44:42 - 44:44

the second diagnostic I'm gonna talk about
44:44 - 44:47

is let's say you want to figure out
44:47 - 44:51

is the algorithm converging, is the optimization algorithm converging, or
44:51 - 44:52

is the problem with
44:52 - 44:58

the optimization objective I chose in the first place? Okay?
44:58 - 45:03

So here's
45:03 - 45:07

the diagnostic you can use. Let me let - right. So to
45:07 - 45:11

just reiterate the story, right, let's say an SVM outperforms Bayesian
45:11 - 45:14

logistic regression but you really want to deploy
45:14 - 45:17

Bayesian logistic regression to your problem. Let me
45:17 - 45:19

let theta subscript SVM, be the
45:19 - 45:22

parameters learned by an SVM,
45:22 - 45:25

and I'll let theta subscript BLR be the parameters learned by Bayesian
45:25 - 45:28

logistic regression.
45:28 - 45:32

So the optimization objective you care about is this, you know, weighted accuracy
45:32 - 45:35

criteria that I talked about just now.
45:35 - 45:38

And
45:38 - 45:42

the support vector machine outperforms Bayesian logistic regression. And so, you know,
45:42 - 45:45

the weighted accuracy on the supportvector-machine parameters
45:45 - 45:47

is better than the weighted accuracy
45:47 - 45:50

for Bayesian logistic regression.
45:50 - 45:54

So
45:54 - 45:57

further, Bayesian logistic regression tries to optimize
45:57 - 45:59

an optimization objective like that, which I
45:59 - 46:02

denoted J theta.
46:02 - 46:06

And so, the diagnostic I choose to use is
46:06 - 46:08

to see if J of SVM
46:08 - 46:12

is bigger-than or less-than J of BLR. Okay?
46:12 - 46:15

So I explain this on the next slide.
46:15 - 46:16

So
46:16 - 46:20

we know two facts. We know that - well we know one fact. We know that a weighted
46:20 - 46:21

accuracy
46:21 - 46:23

of support vector machine, right,
46:23 - 46:24

is bigger than
46:24 - 46:29

this weighted accuracy of Bayesian logistic regression. So
46:29 - 46:32

in order for me to figure out whether Bayesian logistic regression is converging,
46:32 - 46:35

or whether I'm just optimizing the wrong objective function,
46:35 - 46:41

the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay?
46:41 - 46:44

So let me explain this,
46:44 - 46:45

so in Case 1,
46:45 - 46:46

right,
46:46 - 46:48

it's just those two equations copied over.
46:48 - 46:50

In Case 1, let's say that
46:50 - 46:55

J of SVM is, indeed, is greater than J of BLR - or J of
46:55 - 47:01

theta SVM is greater than J of theta BLR. But
47:01 - 47:04

we know that Bayesian logistic regression
47:04 - 47:08

was trying to maximize J of theta;
47:08 - 47:09

that's the definition of
47:09 - 47:12

Bayesian logistic regression.
47:12 - 47:17

So this means that
47:17 - 47:18

theta -
47:18 - 47:22

the value of theta output that Bayesian logistic regression actually fails to
47:22 - 47:24

maximize J
47:24 - 47:27

because the support back to machine actually returned the value of theta that,
47:27 - 47:29

you know does a
47:29 - 47:31

better job out-maximizing J.
47:31 - 47:37

And so, this tells me that Bayesian logistic regression didn't actually maximize J
47:37 - 47:39

correctly, and so the problem is with
47:39 - 47:41

the optimization algorithm. The
47:41 - 47:45

optimization algorithm hasn't converged. The other
47:45 - 47:46

case
47:46 - 47:50

is as follows, where
47:50 - 47:53

J of theta SVM is less-than/equal to J of theta
47:53 - 47:56

BLR. Okay?
47:56 - 47:58

In this case, what does
47:58 - 47:59

that mean?
47:59 - 48:02

This means that Bayesian logistic regression
48:02 - 48:05

actually attains the higher value
48:05 - 48:07

for the optimization objective J
48:07 - 48:11

then doesn't support back to machine.
48:11 - 48:13

The support back to machine,
48:13 - 48:15

which does worse
48:15 - 48:18

on your optimization problem,
48:18 - 48:19

actually does better
48:19 - 48:24

on the weighted accuracy measure.
48:24 - 48:28

So what this means is that something that does worse on your optimization
48:28 - 48:29

objective,
48:29 - 48:30

on J,
48:30 - 48:31

can actually do better
48:31 - 48:34

on the weighted accuracy objective.
48:34 - 48:37

And this really means that maximizing
48:37 - 48:38

J of theta,
48:38 - 48:42

you know, doesn't really correspond that well to maximizing your weighted accuracy criteria.
48:42 - 48:43
48:43 - 48:47

And therefore, this tells you that J of theta is maybe the wrong optimization
48:47 - 48:50

objective to be maximizing. Right?
48:50 - 48:51

That just maximizing J of
48:51 - 48:53

theta just wasn't a good objective
48:53 - 49:00

to be choosing if you care about the weighted accuracy. Okay? Can you
49:03 - 49:03

raise your hand
49:03 - 49:10

if this made sense?
49:10 - 49:12

Cool, good. So
49:12 - 49:17

that tells us whether the problem is with the optimization objective
49:17 - 49:19

or whether it's with the objective function.
49:19 - 49:21

And so going back to this
49:21 - 49:23

slide, the eight fixes we had,
49:23 - 49:24

you notice that if you
49:24 - 49:27

run gradient descent for more iterations
49:27 - 49:31

that fixes the optimization algorithm. You try and use this method
49:31 - 49:33

fixes the optimization algorithm,
49:33 - 49:37

whereas using a different value for lambda, in that lambda times norm of data
49:37 - 49:39

squared, you know, in your objective,
49:39 - 49:42

fixes the optimization objective. And
49:42 - 49:48

changing to an SVM is also another way of trying to fix the optimization objective. Okay?
49:48 - 49:49

And so
49:49 - 49:52

once again, you actually see this quite often that -
49:52 - 49:55

actually, you see it very often, people will
49:55 - 49:58

have a problem with the optimization objective
49:58 - 50:01

and be working harder and harder
50:01 - 50:03

to fix the optimization algorithm.
50:03 - 50:06

That's another very common pattern that
50:06 - 50:10

the problem is in the formula from your J of theta, that often you see people, you know,
50:10 - 50:13

just running more and more iterations of gradient descent. Like trying Newton's
50:13 - 50:16

method and trying conjugate and then trying
50:16 - 50:19

more and more crazy optimization algorithms,
50:19 - 50:21

whereas the problem was, you know,
50:21 - 50:24

optimizing J of theta wasn't going to fix the problem at all. Okay?
50:24 - 50:29

So there's another example of when these sorts of diagnostics will
50:29 - 50:32

help you figure out whether you should be fixing your optimization algorithm
50:32 - 50:33

or fixing the
50:33 - 50:39

optimization
50:39 - 50:45

objective. Okay? Let me think
50:45 - 50:48

how much time I have.
50:48 - 50:49

Hmm, let's
50:49 - 50:50

see. Well okay, we have time. Let's do this.
50:50 - 50:53

Show you one last example of a diagnostic. This is one that came up in,
50:53 - 50:57

you know, my students' and my work on flying helicopters.
50:57 - 50:58
50:58 - 51:00

This one actually,
51:00 - 51:04

this example is the most complex of the three examples I'm gonna do
51:04 - 51:06

today.
51:06 - 51:09

I'm going to somewhat quickly, and
51:09 - 51:11

this actually draws on reinforcement learning which is something that I'm not
51:11 - 51:14

gonna talk about until towards - close to the end of the course here, but this just
51:14 - 51:17

a more
51:17 - 51:20

complicated example of a diagnostic we're gonna go over.
51:20 - 51:24

What I'll do is probably go over this fairly quickly, and then after we've talked about
51:24 - 51:27

reinforcement learning in the class, I'll probably actually come back and redo this exact
51:27 - 51:33

same example because you'll understand it more deeply. Okay?
51:33 - 51:37

So some of you know that my students and I fly autonomous helicopters, so how do you get a
51:37 - 51:42

machine-learning algorithm to design the controller for
51:42 - 51:44

helicopter? This is what we do. All right?
51:44 - 51:49

This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our
51:49 - 51:50

simulator.
51:50 - 51:53

This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you
51:53 - 51:56
51:56 - 51:57

choose a cost function, it's
51:57 - 52:01

actually called a [inaudible] function, but for this actually I'll call it cost function.
52:01 - 52:03

Say J of theta is, you know,
52:03 - 52:07

the expected squared error in your helicopter's
52:07 - 52:08

position. Okay? So this is J of theta is
52:08 - 52:09

maybe
52:09 - 52:12

it's expected square error or just the square error.
52:12 - 52:17

And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms
52:17 - 52:19

in a few weeks.
52:19 - 52:22

You run reinforcement learning algorithm in your simulator
52:22 - 52:27

to try to minimize this cost function; try to minimize the squared error of
52:27 - 52:31

how well you're controlling your helicopter's position. Okay?
52:31 - 52:35

The reinforcement learning algorithm will output some parameters, which I'm denoting theta
52:35 - 52:37

subscript RL,
52:37 - 52:42

and then you'll use that to fly your helicopter.
52:42 - 52:45

So suppose you run this learning algorithm and
52:45 - 52:49

you get out a set of controller parameters, theta subscript RL,
52:49 - 52:52

that gives much worse performance than a human pilot. Then
52:52 - 52:55

what do you do next? And in particular, you
52:55 - 52:58

know, corresponding to the three steps above, there are three
52:58 - 53:01

natural things you can try. Right? You can
53:01 - 53:02

try to - oh, the bottom of
53:02 - 53:04

the slide got chopped off.
53:04 - 53:08

You can try to improve the simulator. And
53:08 - 53:10

maybe you think your simulator's isn't that accurate, you need to capture
53:10 - 53:12

the aerodynamic effects more
53:12 - 53:15

accurately. You need to capture the airflow and the turbulence affects around the helicopter
53:15 - 53:18

more accurately.
53:18 - 53:21

Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe
53:21 - 53:25

what a human pilot does isn't just optimizing square area but it's something more
53:25 - 53:26

subtle.
53:26 - 53:27

Or maybe
53:27 - 53:33

the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So
53:33 - 53:37

these are the diagnostics that I actually used, and my students and I actually use to figure out what's
53:37 - 53:41

going on.
53:41 - 53:45

Actually, why don't you just think about this for a second and think what you'd do, and then
53:45 - 53:52

I'll go on and tell you what we do. All right,
54:46 - 54:48

so let me tell you what -
54:48 - 54:50

how we do this and see
54:50 - 54:53

whether it's the same as yours or not. And if you have a better idea than I do, let me
54:53 - 54:54

know and I'll let you try it
54:54 - 54:56

on my helicopter.
54:56 - 54:58

So
54:58 - 55:01

here's a reasoning that I wanted to experiment, right. So,
55:01 - 55:04

yeah, let's say the controller output
55:04 - 55:10

by our reinforcement-learning algorithm does poorly. Well
55:10 - 55:13

suppose the following three things hold true.
55:13 - 55:15

Suppose the contrary, I guess. Suppose that
55:15 - 55:20

the helicopter simulator is accurate, so let's assume we have an accurate model
55:20 - 55:22

of our helicopter. And
55:22 - 55:25

let's suppose that the reinforcement learning algorithm,
55:25 - 55:29

you know, correctly controls the helicopter in simulation,
55:29 - 55:32

so we tend to run a learning algorithm in simulation so that, you know, the
55:32 - 55:35

learning algorithm can crash a helicopter and it's fine. Right?
55:35 - 55:37

So let's assume our reinforcement-learning
55:37 - 55:40

algorithm correctly controls the helicopter so as to minimize the cost
55:40 - 55:42

function J of theta.
55:42 - 55:44

And let's suppose that
55:44 - 55:48

minimizing J of theta does indeed correspond to accurate or the correct autonomous
55:48 - 55:49

flight.
55:49 - 55:52

If all of these things held true,
55:52 - 55:54

then that means that
55:54 - 55:58

the parameters, theta RL, should actually fly well on my real
55:58 - 56:01

helicopter. Right?
56:01 - 56:03

And so the fact that the learning
56:03 - 56:05

control parameters, theta RL,
56:05 - 56:09

does not fly well on my helicopter, that sort
56:09 - 56:11

of means that ones of these three assumptions must be wrong
56:11 - 56:18

and I'd like to figure out which of these
56:18 - 56:20

three assumptions
56:20 - 56:22

is wrong. Okay? So these are the diagnostics we use.
56:22 - 56:25

First one is
56:25 - 56:32

we look at the controller and see if it even flies well in
56:32 - 56:35

simulation. Right? So the simulator of the helicopter that we did the learning on,
56:35 - 56:39

and so if the learning algorithm flies well in the simulator but
56:39 - 56:42

it doesn't fly well on my real helicopter,
56:42 - 56:46

then that tells me the problem is probably in the simulator. Right?
56:46 - 56:48

My simulator predicts
56:48 - 56:52

the helicopter's controller will fly well but it doesn't actually fly well in real life, so
56:52 - 56:54

could be the problem's in the simulator
56:54 - 56:59

and we should spend out efforts improving the accuracy of our simulator.
56:59 - 57:03

Otherwise, let me write theta subscript human, be the human
57:03 - 57:07

control policy. All right? So
57:07 - 57:12

let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it
57:12 - 57:13

could be in real life,
57:13 - 57:17

and let's measure, you know, the means squared error
57:17 - 57:20

of the human pilot's flight. And
57:20 - 57:24

let's see if the human pilot does better or worse
57:24 - 57:26

than the learned controller,
57:26 - 57:28

in terms of optimizing this
57:28 - 57:32

objective function J of theta. Okay?
57:32 - 57:34

So if the human does
57:34 - 57:37

worse, if even a very good human pilot
57:37 - 57:41

attains a worse value on my optimization objective, on my cost
57:41 - 57:42

function,
57:42 - 57:49

than my learning algorithm,
57:49 - 57:52

then the problem is in the reinforcement-learning algorithm.
57:52 - 57:56

Because my reinforcement-learning algorithm was trying to minimize J of
57:56 - 58:00

theta, but a human actually attains a lower value for J of theta than does my
58:00 - 58:02

algorithm.
58:02 - 58:05

And so that tells me that clearly my algorithm's not
58:05 - 58:08

managing to minimize J of theta
58:08 - 58:13

and that tells me the problem's in the reinforcement learning algorithm.
58:13 - 58:18

And finally, if J of theta - if the human actually attains a larger value
58:18 - 58:20

for theta - excuse me,
58:20 - 58:24

if the human actually attains a larger value for J of theta, the human actually
58:24 - 58:28

has, you know, larger mean squared error for the helicopter position than
58:28 - 58:31

does my reinforcement learning algorithms, that's
58:31 - 58:34

I like - but I like the way the human flies much better than my reinforcement learning
58:34 - 58:35

algorithm. So
58:35 - 58:37

if that holds true,
58:37 - 58:40

then clearly the problem's in the cost function, right,
58:40 - 58:43

because the human does worse on my cost function
58:43 - 58:46

but flies much better than my learning algorithm.
58:46 - 58:48

And so that means the problem's in the cost function. It
58:48 - 58:50

means - oh
58:50 - 58:51

excuse me, I
58:51 - 58:54

meant minimizing it, not maximizing it, there's a typo on the slide,
58:54 - 58:55

because that means that minimizing
58:55 - 58:57

the cost function
58:57 - 59:00

- my learning algorithm does a better job minimizing the cost function but doesn't
59:00 - 59:03

fly as well as a human pilot. So that tells you that
59:03 - 59:05

minimizing the cost function
59:05 - 59:07

doesn't correspond to good autonomous flight. And what
59:07 - 59:12

you should do it go back and see if you can change J of
59:12 - 59:13

theta. Okay?
59:13 - 59:18

And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement
59:18 - 59:22

learning algorithms just work but when they don't work,
59:22 - 59:26

these are the sorts of diagnostics you use to figure out should we be focusing on the simulator,
59:26 - 59:30

on changing the cost function, or on changing the reinforcement learning
59:30 - 59:32

algorithm. And
59:32 - 59:37

again, if you don't know which of your three problems it is, it's entirely possible,
59:37 - 59:40

you know, to spend two years, whatever, changing, building a better simulator
59:40 - 59:43

for your helicopter.
59:43 - 59:44

But it turns out that
59:44 - 59:48

modeling helicopter aerodynamics is an active area of research. There are people, you know, writing
59:48 - 59:50

entire PhD theses on this still.
59:50 - 59:54

So it's entirely possible to go out and spend six years and write a PhD thesis and build
59:54 - 59:55

a much better helicopter simulator, but if you're fixing
59:55 - 60:02

the wrong problem it's not gonna help.
60:03 - 60:06

So
60:06 - 60:09

quite often, you need to come up with your own diagnostics to figure out what's happening in an
60:09 - 60:12

algorithm when something is going wrong.
60:12 - 60:16

And unfortunately I don't know of - what I've described
60:16 - 60:17

are sort of maybe
60:17 - 60:21

some of the most common diagnostics that I've used, that I've seen,
60:21 - 60:24

you know, to be useful for many problems. But very often, you need to come up
60:24 - 60:28

with your own for your own specific learning problem.
60:28 - 60:32

And I just want to point out that even when the learning algorithm is working well, it's
60:32 - 60:35

often a good idea to run diagnostics, like the ones I talked
60:35 - 60:36

about,
60:36 - 60:38

to make sure you really understand what's going on.
60:38 - 60:42

All right? And this is useful for a couple of reasons. One is that
60:42 - 60:46

diagnostics like these will often help you to understand your application
60:46 - 60:48

problem better.
60:48 - 60:52

So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying
60:52 - 60:56

job to apply machine-learning algorithms to some application problem of, you
60:56 - 60:59

know, significant economic interest. Right?
60:59 - 61:03

And you're gonna be working on one specific
61:03 - 61:08

important machine learning application for many months, or even for years.
61:08 - 61:11

One of the most valuable things for you personally will be for you to
61:11 - 61:13

get in - for you personally
61:13 - 61:17

to get in an intuitive understanding of what works and what doesn't work your
61:17 - 61:17

problem.
61:17 - 61:21

Sort of right now in the industry, in Silicon Valley or around the world,
61:21 - 61:25

there are many companies with important machine learning problems and there are often people
61:25 - 61:27

working on the same machine learning problem, you
61:27 - 61:31

know, for many months or for years on end. And
61:31 - 61:35

when you're doing that, I mean solving a really important problem using learning algorithms, one of
61:35 - 61:39

the most valuable things is just your own personal intuitive understanding of the
61:39 - 61:41

problem.
61:41 - 61:42

Okay?
61:42 - 61:43

And diagnostics, like
61:43 - 61:48

the sort I talked about, will be one way for you to get a better and better understanding of
61:48 - 61:50

these problems. It
61:50 - 61:54

turns out, by the way, there are some of Silicon Valley companies that outsource their
61:54 - 61:57

machine learning. So there's sometimes, you know, whatever.
61:57 - 62:00

They're a company in Silicon Valley and they'll, you know,
62:00 - 62:03

hire a firm in New York to run all their learning algorithms for them.
62:03 - 62:07

And I'm not a businessman, but I personally think that's
62:07 - 62:09

often a terrible idea because
62:09 - 62:14

if your expertise, if your understanding of your data is given,
62:14 - 62:16

you know, to an outsource agency,
62:16 - 62:20

then if you don't maintain that expertise, if there's a problem you really care about
62:20 - 62:22

then it'll be your own, you know,
62:22 - 62:26

understanding of the problem that you build up over months that'll be really valuable.
62:26 - 62:29

And if that knowledge is outsourced, you don't get to keep that knowledge
62:29 - 62:29

yourself.
62:29 - 62:32

I personally think that's a terrible idea.
62:32 - 62:36

But I'm not a businessman, but I just see people do that a lot,
62:36 - 62:39

and just. Let's see.
62:39 - 62:43

Another reason for running diagnostics like these is actually in writing research
62:43 - 62:44

papers,
62:44 - 62:46

right? So
62:46 - 62:49

diagnostics and error analyses, which I'll talk about in a minute,
62:49 - 62:53

often help to convey insight about the problem and help justify your research
62:53 - 62:54

claims.
62:54 - 62:57
62:57 - 62:58

So for example,
62:58 - 63:01

rather than writing a research paper, say, that's says, you know, "Oh well here's
63:01 - 63:04

an algorithm that works. I built this helicopter and it flies," or whatever,
63:04 - 63:06

it's often much more interesting to say,
63:06 - 63:10

"Here's an algorithm that works, and it works because of a specific
63:10 - 63:14

component X. And moreover, here's the diagnostic that gives you justification that shows X was
63:14 - 63:19

the thing that fixed this problem," and that's where you made it work. Okay? So
63:19 - 63:21

that leads me
63:21 - 63:26

into a discussion on error analysis, which is often good machine learning practice,
63:26 - 63:26
63:26 - 63:32

is a way for understanding what your sources of errors are. So what I
63:32 - 63:35

call error analyses - and let's check
63:35 - 63:42

questions about this.
63:42 - 63:46

Yeah?
63:46 - 63:50

Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times.
63:50 - 63:54

The thing that is most difficult a helicopter is actually building a
63:54 - 63:55

very - I don't know. It
63:55 - 63:58

changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter
63:58 - 64:03

is very hard. Yeah. Okay. So
64:03 - 64:04

for error
64:04 - 64:06

analyses,
64:06 - 64:11

this is a way for figuring out what is working in your algorithm and what isn't working.
64:11 - 64:18

And we're gonna talk about two specific examples. So there are
64:18 - 64:22

many learning - there are many sort of IA systems, many machine learning systems, that
64:22 - 64:22

combine
64:22 - 64:25

many different components into a pipeline. So
64:25 - 64:27

here's sort of a contrived example for this,
64:27 - 64:31

not dissimilar in many ways from the actual machine learning systems you see.
64:31 - 64:32

So let's say you want to
64:32 - 64:38

recognize people from images. This is a picture of one of my friends.
64:38 - 64:42

So you take this input in camera image, say, and you often run it through a long pipeline. So
64:42 - 64:43

for example,
64:43 - 64:48

the first thing you may do may be preprocess the image and remove the background, so you remove the
64:48 - 64:49

background.
64:49 - 64:52

And then you run a
64:52 - 64:55

face detection algorithm, so a machine learning algorithm to detect people's faces.
64:55 - 64:56

Right?
64:56 - 65:00

And then, you know, let's say you want to recognize the identity of the person, right, this is your
65:00 - 65:02

application.
65:02 - 65:04

You then segment of the eyes, segment of the nose,
65:04 - 65:08

and have different learning algorithms to detect the mouth and so on.
65:08 - 65:10

I know; she might not want to be friend
65:10 - 65:13

after she sees this.
65:13 - 65:17

And then having found all these features, based on, you know, what the nose looks like, what the eyes
65:17 - 65:19

looks like, whatever, then you
65:19 - 65:23

feed all the features into a logistic regression algorithm. And your logistic
65:23 - 65:25

regression or soft match regression, or whatever,
65:25 - 65:30

will tell you the identity of this person. Okay?
65:30 - 65:32

So
65:32 - 65:35

this is what error analysis is.
65:35 - 65:40

You have a long complicated pipeline combining many machine learning
65:40 - 65:44

components. Many of these would be used in learning algorithms.
65:44 - 65:46

And so,
65:46 - 65:50

it's often very useful to figure out how much of your error can be attributed to each of
65:50 - 65:55

these components.
65:55 - 65:56

So
65:56 - 66:00

what we'll do in a typical error analysis procedure
66:00 - 66:04

is we'll repeatedly plug in the ground-truth for each component and see how the
66:04 - 66:05

accuracy changes.
66:05 - 66:08

So what I mean by that is the
66:08 - 66:11

figure on the bottom left - bottom right, let's say the overall accuracy of the system is
66:11 - 66:13

85 percent. Right?
66:13 - 66:15

Then I want to know
66:15 - 66:17

where my 15 percent of error comes from.
66:17 - 66:19

And so what I'll do is I'll go
66:19 - 66:21

to my test set
66:21 - 66:27

and I'll actually code it and - oh, instead of - actually implement my correct
66:27 - 66:30

background removal. So actually, go in and give it,
66:30 - 66:33

give my algorithm what is the correct background versus foreground.
66:33 - 66:37

And if I do that, let's color that blue to denote that I'm
66:37 - 66:40

giving that ground-truth data in the test set,
66:40 - 66:44

let's assume our accuracy increases to 85.1 percent. Okay?
66:44 - 66:48

And now I'll go in and, you know, give my algorithm the ground-truth,
66:48 - 66:49

face detection
66:49 - 66:53

output. So I'll go in and actually on my test set I'll just tell the algorithm where the
66:53 - 66:55

face is. And if I do that,
66:55 - 66:59

let's say my algorithm's accuracy increases to 91 percent,
66:59 - 67:03

and so on. And then I'll go for each of these components
67:03 - 67:05

and just give it
67:05 - 67:09

the ground-truth label for each of the components,
67:09 - 67:12

because say, like, the nose segmentation algorithm's trying to figure out
67:12 - 67:13

where the nose is. I just in
67:13 - 67:17

and tell it where the nose is so that it doesn't have to figure that out.
67:17 - 67:21

And as I do this, one component through the other, you know, I end up giving it the correct output
67:21 - 67:24

label and end up with 100 percent accuracy.
67:24 - 67:27

And now you can look at this table - I'm sorry this is cut off on the bottom,
67:27 - 67:29

it says logistic regression 100 percent. Now you can
67:29 - 67:31

look at this
67:31 - 67:32

table and
67:32 - 67:33

see,
67:33 - 67:36

you know, how much giving the ground-truth labels for each of these
67:36 - 67:39

components could help boost your final performance.
67:39 - 67:42

In particular, if you look at this table, you notice that
67:42 - 67:45

when I added the face detection ground-truth,
67:45 - 67:48

my performance jumped from 85.1 percent accuracy
67:48 - 67:51

to 91 percent accuracy. Right?
67:51 - 67:55

So this tells me that if only I can get better face detection,
67:55 - 67:58

maybe I can boost my accuracy by 6 percent.
67:58 - 68:00

Whereas in contrast, when I,
68:00 - 68:04

you know, say plugged in better, I don't know,
68:04 - 68:07

background removal, my accuracy improved from 85
68:07 - 68:09

to 85.1 percent.
68:09 - 68:12

And so, this sort of diagnostic also tells you that if your goal
68:12 - 68:14

is to improve the system, it's probably a waste of
68:14 - 68:18

your time to try to improve your background subtraction. Because if
68:18 - 68:19

even if you got the ground-truth,
68:19 - 68:22

this is gives you, at most, 0.1 percent accuracy,
68:22 - 68:25

whereas if you do better face detection, maybe there's a much
68:25 - 68:26

larger potential for gains there. Okay?
68:26 - 68:29

So this sort of diagnostic,
68:29 - 68:30

again,
68:30 - 68:33

is very useful because if your is to improve the system,
68:33 - 68:36

there are so many different pieces you can easily choose to spend the next three
68:36 - 68:37

months on. Right?
68:37 - 68:39

And choosing the right piece
68:39 - 68:43

is critical, and this sort of diagnostic tells you what's the piece that may
68:43 - 68:49

actually be worth your time to work on.
68:49 - 68:52

There's sort of another type of analyses that's sort of the opposite of what I just
68:52 - 68:53

talked about.
68:53 - 68:55

The error analysis I just talked about
68:55 - 68:58

tries to explain the difference between the current performance and perfect
68:58 - 69:00

performance,
69:00 - 69:04

whereas this sort of ablative analysis tries to explain the difference
69:04 - 69:09

between some baselines, some really bad performance and your current performance.
69:09 - 69:13

So for this example, let's suppose you've built a very good anti-spam classifier for
69:13 - 69:17

adding lots of clever features to your logistic regression algorithm. Right? So you added
69:17 - 69:21

features for spam correction, for, you know, sender host features, for email header
69:21 - 69:21

features,
69:21 - 69:25

email text parser features, JavaScript parser features,
69:25 - 69:27

features for embedded images, and so on.
69:27 - 69:30

So now let's say you preview the system and you want to figure out, you know, how well did
69:30 - 69:34

each of these - how much did each of these components actually contribute? Maybe you want
69:34 - 69:37

to write a research paper and claim this was the piece that made the
69:37 - 69:41

big difference. Can you actually document that claim and justify it?
69:41 - 69:43

So in ablative analysis,
69:43 - 69:45

here's what we do.
69:45 - 69:46

So in this example,
69:46 - 69:50

let's say that simple logistic regression without any of your clever
69:50 - 69:52

improvements get 94 percent performance. And
69:52 - 69:55

you want to figure out what accounts for your improvement from 94 to
69:55 - 69:58

99.9 percent performance.
69:58 - 70:03

So in ablative analysis and so instead of adding components one at a day, we'll instead
70:03 - 70:07

remove components one at a time to see how it rates.
70:07 - 70:11

So start with your overall system, which is 99 percent accuracy.
70:11 - 70:14

And then we remove spelling correction and see how much performance
70:14 - 70:15

drops.
70:15 - 70:22

Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so,
70:24 - 70:28

in this contrived example,
70:28 - 70:31

you see that, I guess, the biggest drop
70:31 - 70:32

occurred when you remove
70:32 - 70:38

the text parser features. And so you can then make a credible case that,
70:38 - 70:41

you know, the text parser features where what really made the biggest difference here. Okay?
70:41 - 70:43

And you can also tell,
70:43 - 70:46

for instance, that, I don't know,
70:46 - 70:49

removing the sender host features on this
70:49 - 70:52

line, right, performance dropped from 99.9 to 98.9. And so this also means
70:52 - 70:53

that
70:53 - 70:56

in case you want to get rid of the sender host features to speed up
70:56 - 71:03

computational something that would be a good candidate for elimination. Okay? Are there any
71:04 - 71:05

guarantees that if you shuffle around the order in which
71:05 - 71:06

you drop those
71:06 - 71:10

features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's
71:10 - 71:12

no guarantee you'd get the similar result.
71:12 - 71:14

So in practice,
71:14 - 71:18

sometimes there's a fairly natural of ordering for both types of analyses, the error
71:18 - 71:19

analyses and ablative analysis,
71:19 - 71:23

sometimes there's a fairly natural ordering in which you add things or remove things,
71:23 - 71:25

sometimes there's isn't. And
71:25 - 71:28

quite often, you either choose one ordering and just go for it
71:28 - 71:32

or " And don't think of these analyses as sort of formulas that are constants, though; I mean
71:32 - 71:35

feel free to invent your own, as well. You know
71:35 - 71:37

one of the things
71:37 - 71:38

that's done quite often is
71:38 - 71:39

take the overall system
71:39 - 71:43

and just remove one and then put it back, then remove a different one
71:43 - 71:48

then put it back until all of these things are done. Okay.
71:48 - 71:51

So the very last thing I want to talk about is sort of this
71:51 - 71:58

general advice for how to get started on a learning problem. So
71:58 - 72:04

here's a cartoon description on two broad to get started on learning problem.
72:04 - 72:06

The first one is
72:06 - 72:08

carefully design your system, so
72:08 - 72:12

you spend a long time designing exactly the right features, collecting the right data set, and
72:12 - 72:14

designing the right algorithmic structure, then you
72:14 - 72:18

implement it and hope it works. All right?
72:18 - 72:21

The benefit of this sort of approach is you get maybe nicer, maybe more scalable
72:21 - 72:22

algorithms,
72:22 - 72:27

and maybe you come up with new elegant learning algorithms. And if your goal is to,
72:27 - 72:31

you know, contribute to basic research in machine learning, if your goal is to invent new machine learning
72:31 - 72:31

algorithms,
72:31 - 72:34

this process of slowing down and
72:34 - 72:36

thinking deeply about the problem, you know, that is sort of the right way to go
72:36 - 72:37

about is
72:37 - 72:41

think deeply about a problem and invent new solutions.
72:41 - 72:42
72:42 - 72:44

Second sort of approach
72:44 - 72:49

is what I call build-and-fix, which is we input something quick and dirty
72:49 - 72:52

and then you run error analyses and diagnostics to figure out what's wrong and
72:52 - 72:54

you fix those errors.
72:54 - 72:58

The benefit of this second type of approach is that it'll often get your
72:58 - 73:01

application working much more quickly.
73:01 - 73:04

And especially with those of you, if you end up working in a company, and sometimes - if you end up working in
73:04 - 73:06

a company,
73:06 - 73:07

you know, very often it's not
73:07 - 73:11

the best product that wins; it's the first product to market that
73:11 - 73:12

wins. And
73:12 - 73:15

so there's - especially in the industry. There's really something to be said for,
73:15 - 73:19

you know, building a system quickly and getting it deployed quickly.
73:19 - 73:23

And the second approach of building a quick-and-dirty, I'm gonna say hack
73:23 - 73:26

and then fixing the problems will actually get you to a
73:26 - 73:28

system that works well
73:28 - 73:31

much more quickly.
73:31 - 73:33

And the reason is
73:33 - 73:36

very often it's really not clear what parts of a system are easier to think of to
73:36 - 73:38

build and therefore what
73:38 - 73:40

you need to spends lot of time focusing on.
73:40 - 73:43

So there's that example I talked about just now. Right?
73:43 - 73:47

For identifying
73:47 - 73:49

people, say.
73:49 - 73:53

And with a big complicated learning system like this, a big complicated pipeline like this,
73:53 - 73:56

it's really not obvious at the outset
73:56 - 73:59

which of these components you should spend lots of time working on. Right? And if
73:59 - 74:01

you didn't know that
74:01 - 74:04

preprocessing wasn't the right component, you could easily have
74:04 - 74:07

spent three months working on better background subtraction, not knowing that it's
74:07 - 74:10

just not gonna ultimately matter.
74:10 - 74:11

And so
74:11 - 74:14

the only way to find out what really works was inputting something quickly and
74:14 - 74:15

you find out what parts -
74:15 - 74:17

and find out
74:17 - 74:18

what parts
74:18 - 74:21

are really the hard parts to implement, or what parts are hard parts that could make a
74:21 - 74:23

difference in performance.
74:23 - 74:27

In fact, say that if your goal is to build a
74:27 - 74:29

people recognition system, a system like this is actually far too
74:29 - 74:32

complicated as your initial system.
74:32 - 74:36

Maybe after you're prototyped a few systems, and you converged a system like this. But if this
74:36 - 74:43

is your first system you're designing, this is much too complicated. Also, this is a
74:44 - 74:48

very concrete piece of advice, and this applies to your projects as well.
74:48 - 74:51

If your goal is to build a working application,
74:51 - 74:55

Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your
74:55 - 74:57

data.
74:57 - 75:01

And very often, and if you just take the data you're trying to predict and just plot your
75:01 - 75:06

data, plot X, plot Y, plot your data everywhere you can think of,
75:06 - 75:10

you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they
75:10 - 75:14

should be positive. Something's wrong with this dataset." And it's about
75:14 - 75:18

half the time you find something obviously wrong with your data or something very surprising.
75:18 - 75:22

And this is something you find out just by plotting your data, and that you
75:22 - 75:28

won't find out be implementing these big complicated learning algorithms on it. Plotting
75:28 - 75:32

the data sounds so simple, it was one of the pieces of advice that lots of us give but
75:32 - 75:39

hardly anyone follows, so you can take that for what it's worth.
75:39 - 75:42

Let me just reiterate, what I just said here may be bad advice
75:42 - 75:44

if your goal is to come up with
75:44 - 75:47

new machine learning algorithms. All right? So
75:47 - 75:51

for me personally, the learning algorithm I use the most often is probably
75:51 - 75:54

logistic regression because I have code lying around. So give me a
75:54 - 75:57

learning problem, I probably won't try anything more complicated than logistic
75:57 - 75:58

regression on it first. And it's
75:58 - 76:02

only after trying something really simple and figure our what's easy, what's hard, then you know
76:02 - 76:04

where to focus your efforts. But
76:04 - 76:08

again, if your goal is to invent new machine learning algorithms, then you sort of don't
76:08 - 76:11

want to hack up something and then add another hack to fix it, and hack it even more to
76:11 - 76:12

fix it. Right? So if
76:12 - 76:16

your goal is to do novel machine learning research, then it pays to think more deeply about the
76:16 - 76:21

problem and not gonna follow this specifically.
76:21 - 76:23

Shoot, you know what? All
76:23 - 76:28

right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly.
76:28 - 76:31

And so, this is what I think
76:31 - 76:33

of as premature statistical optimization,
76:33 - 76:35

where quite often,
76:35 - 76:38

just like premature optimization of code, quite often
76:38 - 76:44

people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more
76:44 - 76:47

slides. This
76:47 - 76:49

was -
76:49 - 76:52

this is a sort of cartoon that highly influenced my own thinking. It was based on
76:52 - 76:55

a paper written by Christos Papadimitriou.
76:55 - 76:57

This is how
76:57 - 76:59

progress - this is how
76:59 - 77:02

developmental progress of research often happens. Right?
77:02 - 77:06

Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it
77:06 - 77:07

seems like a useful thing to have.
77:07 - 77:10

Right? You know free up people, don't have
77:10 - 77:13

to deliver mail. So what -
77:13 - 77:14

to deliver mail,
77:14 - 77:19

obviously you need a robot to wander around indoor environments and you need a robot to
77:19 - 77:21

manipulate objects and pickup envelopes. And so,
77:21 - 77:25

you need to build those two components in order to get a mail delivery robot. And
77:25 - 77:26

so I've
77:26 - 77:30

drawing those two components and little arrows to denote that, you know, obstacle avoidance
77:30 - 77:30

is
77:30 - 77:32

needed or would help build
77:32 - 77:36

your mail delivery robot. Well
77:36 - 77:37

for obstacle for avoidance,
77:37 - 77:43

clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles.
77:43 - 77:47

Now we're gonna use computer vision to detect the objects. And so,
77:47 - 77:51

we know that, you know, lighting sometimes changes, right, depending on whether it's the
77:51 - 77:53

morning or noontime or evening. This
77:53 - 77:54

is lighting
77:54 - 77:57

causes the color of things to change, and so you need
77:57 - 78:01

an object detection system that's invariant to the specific colors of an
78:01 - 78:01

object. Right?
78:01 - 78:04

Because lighting
78:04 - 78:05

changes,
78:05 - 78:10

say. Well color, or RGB values, is represented by three-dimensional vectors. And
78:10 - 78:11

so you need to learn
78:11 - 78:13

when two colors might be the same thing,
78:13 - 78:15

when two, you know,
78:15 - 78:18

visual appearance of two colors may be the same thing as just the lighting change or
78:18 - 78:20

something.
78:20 - 78:21

And
78:21 - 78:24

to understand that properly, we can go out and study differential geometry
78:24 - 78:28

of 3d manifolds because that helps us build a sound theory on which
78:28 - 78:32

to develop our 3d similarity learning algorithms.
78:32 - 78:36

And to really understand the fundamental aspects of this problem,
78:36 - 78:40

we have to study the complexity of non-Riemannian geometries. And on
78:40 - 78:44

and on it goes until eventually you're proving convergence bounds for
78:44 - 78:50

sampled of non-monotonic logic. I don't even know what this is because I just made it up.
78:50 - 78:52

Whereas in reality,
78:52 - 78:54

you know, chances are that link isn't real.
78:54 - 78:56

Color variance
78:56 - 79:00

just barely helped object recognition maybe. I'm making this up.
79:00 - 79:03

Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay?
79:03 - 79:05

So, each of
79:05 - 79:09

these circles can represent a person, or a research community, or a thought in your
79:09 - 79:12

head. And there's a very real chance that
79:12 - 79:15

maybe there are all these papers written on differential geometry of 3d manifolds, and they are
79:15 - 79:19

written because some guy once told someone else that it'll help 3d similarity learning.
79:19 - 79:20

And,
79:20 - 79:23

you know, it's like "A friend of mine told me that color invariance would help in
79:23 - 79:26

object recognition, so I'm working on color invariance. And now I'm gonna tell a friend
79:26 - 79:27

of mine
79:27 - 79:30

that his thing will help my problem. And he'll tell a friend of his that his thing will help
79:30 - 79:32

with his problem."
79:32 - 79:34

And pretty soon, you're working on
79:34 - 79:38

convergence bound for sampled non-monotonic logic, when in reality none of these will
79:38 - 79:39

see the light of
79:39 - 79:43

day of your mail delivery robot. Okay?
79:43 - 79:47

I'm not criticizing the role of theory. There are very powerful theories, like the
79:47 - 79:48

theory of VC dimension,
79:48 - 79:52

which is far, far, far to the right of this. So VC dimension is about
79:52 - 79:53

as theoretical
79:53 - 79:57

as it can get. And it's clearly had a huge impact on many applications. And there's,
79:57 - 80:00

you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know,
80:00 - 80:01

is about
80:01 - 80:04

theoretical as it can get. It's
80:04 - 80:06

like a huge application
80:06 - 80:09

on all of computer science, the theory of NP-hardness.
80:09 - 80:11

But
80:11 - 80:14

when you are off working on highly theoretical things, I guess, to me
80:14 - 80:17

personally it's important to keep in mind
80:17 - 80:20

are you working on something like VC dimension, which is high impact, or are you
80:20 - 80:23

working on something like convergence bound for sampled nonmonotonic logic, which
80:23 - 80:25

you're only hoping
80:25 - 80:26

has some peripheral relevance
80:26 - 80:30

to some application. Okay?
80:30 - 80:35

For me personally, I tend to work on an application only if I - excuse me.
80:35 - 80:37

For me personally, and this is a personal choice,
80:37 - 80:41

I tend to trust something only if I personally can see a link from the
80:41 - 80:43

theory I'm working on
80:43 - 80:44

all the way back to an application.
80:44 - 80:46

And
80:46 - 80:50

if I don't personally see a direct link from what I'm doing to an application then,
80:50 - 80:53

you know, then that's fine. Then I can choose to work on theory, but
80:53 - 80:56

I wouldn't necessarily trust that
80:56 - 80:59

what the theory I'm working on will relate to an application, if I don't personally
80:59 - 81:02

see a link all the way back.
81:02 - 81:04

Just to summarize.
81:04 - 81:06
81:06 - 81:09

One lesson to take away from today is I think
81:09 - 81:13

time spent coming up with diagnostics for learning algorithms is often time well spent.
81:13 - 81:13
81:13 - 81:16

It's often up to your own ingenuity to come up with great diagnostics. And
81:16 - 81:19

just when I personally, when I work on machine learning algorithm,
81:19 - 81:21

it's not uncommon for me to be spending like
81:21 - 81:24

between a third and often half of my time
81:24 - 81:26

just writing diagnostics and trying to figure out what's going right and what's
81:26 - 81:28

going on.
81:28 - 81:32

Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and
81:32 - 81:35

making progress. You don't want to be spending all this time, you know, implementing tests on your
81:35 - 81:38

learning algorithms; it doesn't feel like when you're doing anything. But when
81:38 - 81:41

I implement learning algorithms, at least a third, and quite often half of
81:41 - 81:46

my time, is actually spent implementing those tests and you can figure out what to work on. And
81:46 - 81:49

I think it's actually one of the best uses of your time. Talked
81:49 - 81:51

about error
81:51 - 81:54

analyses and ablative analyses, and lastly
81:54 - 81:57

talked about, you know, different approaches and the
81:57 - 82:01

risks of premature statistical optimization. Okay.
82:01 - 82:04

Sorry I ran you over. I'll be here for a few more minutes for your questions.

Title:: Lecture 11 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:22:19

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 11 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)