Lecture 11 | Machine Learning (Stanford)
-
0:10 - 0:12
-
0:12 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:22Development.
-
0:25 - 0:28Okay. Good morning. Welcome back.
-
0:28 - 0:31What I want to do today is
-
0:31 - 0:37actually wrap up our discussion on learning theory and sort of on
-
0:37 - 0:38and I'm gonna start by
-
0:38 - 0:42talking about Bayesian statistics and regularization,
-
0:42 - 0:46and then take a very brief digression to tell you about online learning.
-
0:46 - 0:50And most of today's lecture will actually be on various pieces of that, so applying
-
0:50 - 0:52machine learning algorithms to problems like, you know,
-
0:52 - 0:56like the project or other problems you may go work on after you graduate from this
-
0:57 - 0:58class. But
-
0:58 - 1:02let's start the talk about Bayesian statistics and regularization.
-
1:02 - 1:04So you remember from last week,
-
1:04 - 1:09we started to talk about learning theory and we learned about bias and
-
1:09 - 1:12variance. And I guess in the previous lecture, we spent most of the previous
-
1:12 - 1:13lecture
-
1:13 - 1:18talking about algorithms for model selection and for
-
1:18 - 1:21feature selection. We talked about cross-validation. Right? So
-
1:21 - 1:24most of the methods we talked about in the previous lecture were
-
1:24 - 1:28ways for you to try to simply the model. So for example,
-
1:28 - 1:32the feature selection algorithms we talked about gives you a way to eliminate a number
-
1:32 - 1:33of features,
-
1:33 - 1:36so as to reduce the number of parameters you need to fit and thereby reduce
-
1:36 - 1:39overfitting. Right? You remember that? So feature
-
1:39 - 1:41selection algorithms
-
1:42 - 1:44choose a subset of the features
-
1:44 - 1:48so that you have less parameters and you may be less likely to overfit. Right?
-
1:48 - 1:53What I want to do today is to talk about a different way to prevent overfitting.
-
1:53 - 1:54And
-
1:54 - 1:58there's a method called regularization and there's a way that lets you keep all the
-
1:58 - 1:59parameters.
-
2:00 - 2:04So here's the idea, and I'm gonna illustrate this example with,
-
2:04 - 2:07say, linear regression.
-
2:07 - 2:09So
-
2:09 - 2:13you take
-
2:13 - 2:17the linear regression model, the very first model we learned about,
-
2:17 - 2:18right,
-
2:18 - 2:22we said that we would choose the parameters
-
2:22 - 2:24via
-
2:24 - 2:30maximum likelihood.
-
2:30 - 2:32Right? And that meant that,
-
2:32 - 2:35you know, you would choose the parameters theta
-
2:35 - 2:40that maximized
-
2:40 - 2:43the probability of the data,
-
2:43 - 2:46which is parameters theta that maximized the probability of the data we observe.
-
2:46 - 2:51Right?
-
2:51 - 2:52And so
-
2:52 - 2:56to give this sort of procedure a name, this is one example of most
-
2:56 - 3:00common frequencies procedure,
-
3:00 - 3:05and frequency, you can think of sort of as maybe one school of statistics.
-
3:05 - 3:07And the philosophical view
-
3:07 - 3:09behind writing this down was
-
3:09 - 3:13we envisioned that there was some true parameter theta out there that generated,
-
3:13 - 3:16you know, the Xs and the Ys. There's some true parameter theta
-
3:16 - 3:21that govern housing prices, Y is a function of X,
-
3:21 - 3:23and we don't know what the value of theta is,
-
3:23 - 3:28and we'd like to come up with some procedure for estimating the value of theta. Okay?
-
3:28 - 3:28And so,
-
3:28 - 3:31maximum likelihood is just one possible procedure
-
3:31 - 3:33for estimating the unknown value
-
3:33 - 3:36for theta.
-
3:36 - 3:38And
-
3:38 - 3:41the way you formulated this, you know, theta was not a random variable. Right?
-
3:41 - 3:42That's what why said,
-
3:42 - 3:45so theta is just some true value out there. It's not random or anything, we just don't
-
3:45 - 3:50know what it is, and we have a procedure called maximum likelihood for estimating the
-
3:50 - 3:55value for theta. So this is one example of what's called a frequencies procedure.
-
3:55 - 3:57The alternative to the, I
-
3:57 - 4:04guess, the frequency school of statistics is the Bayesian school,
-
4:04 - 4:07in which
-
4:07 - 4:10we're gonna say that we don't know what theta,
-
4:10 - 4:13and so we will put a prior
-
4:13 - 4:15on
-
4:15 - 4:18theta. Okay? So in the Bayesian school students would say, "Well
-
4:18 - 4:22don't know what the value of theta so let's represent our uncertainty over theta with a
-
4:22 - 4:27prior.
-
4:28 - 4:32So for example,
-
4:32 - 4:35our prior on theta
-
4:35 - 4:37may be a Gaussian distribution
-
4:37 - 4:40with mean zero and curvalence matrix
-
4:40 - 4:43given by tau squared I. Okay?
-
4:43 - 4:45And so -
-
4:45 - 4:52actually, if I use S to denote my training set, well - right,
-
4:56 - 5:00so theta represents my beliefs about what the parameters are in the absence of
-
5:00 - 5:01any data. So
-
5:01 - 5:03not having seen any data, theta
-
5:03 - 5:06represents, you know, what I think theta - it
-
5:06 - 5:10probably represents what I think theta is most likely to be.
-
5:10 - 5:16And so given the training set, S, in the sort of Bayesian procedure, we would,
-
5:16 - 5:22well,
-
5:22 - 5:27calculate the probability, the posterior probability by parameters
-
5:27 - 5:29given my training sets, and - let's
-
5:29 - 5:30write
-
5:30 - 5:32this on the next board.
-
5:32 - 5:33So my posterior
-
5:33 - 5:37on my parameters given my training set, by Bayes' rule, this will
-
5:37 - 5:40be proportional to, you
-
5:40 - 5:47know, this.
-
5:51 - 5:57Right? So by Bayes' rule. Let's call
-
5:57 - 6:01it posterior.
-
6:01 - 6:06And this distribution now represents my beliefs about what theta is after I've
-
6:06 - 6:08seen the training set.
-
6:08 - 6:15And when you now want to make a new prediction on the price of a new house,
-
6:20 - 6:22on the input X,
-
6:22 - 6:27I would say that, well, the distribution over the possible housing prices for
-
6:27 - 6:31this new house I'm trying to estimate the price of, say,
-
6:31 - 6:34given the size of the house, the features of the house at X, and the training
-
6:34 - 6:37set I had previously, it
-
6:37 - 6:44is going to be given by
-
6:47 - 6:50an integral over my parameters theta of
-
6:50 - 6:53probably of Y given X comma theta
-
6:53 - 6:55and times
-
6:55 - 7:00the posterior distribution of theta given the training set. Okay?
-
7:00 - 7:02And in
-
7:02 - 7:04particular, if you want your prediction to be
-
7:04 - 7:08the expected value
-
7:08 - 7:09of Y given
-
7:09 - 7:10the
-
7:10 - 7:14input X in training set, you would
-
7:14 - 7:16say integrate
-
7:16 - 7:23over Y
-
7:24 - 7:27times the posterior. Okay?
-
7:27 - 7:34You would take an expectation of Y with respect to your posterior distribution. Okay?
-
7:35 - 7:39And you notice that when I was writing this down, so with the Bayesian
-
7:39 - 7:42formulation, and now started to write up here Y given X
-
7:42 - 7:44comma theta because
-
7:44 - 7:48this formula now is the property of Y conditioned on the values of the
-
7:48 - 7:52random variables X and theta. So I'm no longer writing semicolon theta, I'm writing
-
7:52 - 7:53comma theta
-
7:53 - 7:56because I'm now treating theta
-
7:56 - 8:00as a random variable. So all
-
8:00 - 8:03of this is somewhat abstract but this is
-
8:03 - 8:04- and
-
8:04 - 8:11it turns out - actually let's check. Are there questions about this? No? Okay. Let's
-
8:15 - 8:18try to make this more concrete. It turns out that
-
8:18 - 8:22for many problems,
-
8:22 - 8:26both of these steps in the computation are difficult because if,
-
8:26 - 8:29you know, theta is an N plus onedimensional vector, is an N plus
-
8:29 - 8:31one-dimensional parameter vector,
-
8:31 - 8:35then this is one an integral over an N plus one-dimensional, you know, over RN
-
8:35 - 8:36plus one.
-
8:36 - 8:39And because numerically it's very difficult
-
8:39 - 8:41to compute integrals over
-
8:41 - 8:45very high dimensional spaces, all right? So
-
8:45 - 8:49usually this integral - actually usually it's hard to compute the posterior in theta
-
8:49 - 8:54and it's also hard to compute this integral if theta is very high dimensional. There are few
-
8:54 - 8:56exceptions for which this can be done in
-
8:56 - 8:58closed form, but for
-
8:58 - 9:00many learning algorithms, say,
-
9:00 - 9:05Bayesian logistic regression, this is hard to do.
-
9:05 - 9:10And so what's commonly done is to take the posterior distribution
-
9:10 - 9:13and instead of actually computing a full posterior distribution, chi of theta
-
9:13 - 9:15given S,
-
9:15 - 9:17we'll instead
-
9:17 - 9:19take this quantity on the right-hand side
-
9:19 - 9:24and just maximize this quantity on the right-hand side. So let me write this down.
-
9:24 - 9:27So
-
9:27 - 9:30commonly, instead of computing the full posterior distribution,
-
9:30 - 9:37we will choose the following. Okay?
-
9:42 - 9:47We will choose what's called the MAP estimate, or the maximum a posteriori
-
9:47 - 9:50estimate of theta, which is the most likely value of theta,
-
9:50 - 9:54most probable value of theta onto your posterior distribution.
-
9:54 - 9:56And that's just
-
9:56 - 10:03ont max chi
-
10:08 - 10:12of theta.
-
10:12 - 10:19And then when you need to make a prediction,
-
10:21 - 10:28you know, you would just predict, say, well,
-
10:36 - 10:38using your usual hypothesis
-
10:38 - 10:41and using this MAP value of theta
-
10:41 - 10:42in place of
-
10:42 - 10:47- as the parameter vector you'd choose. Okay? And
-
10:47 - 10:51notice, the only difference between this and standard maximum likelihood estimation
-
10:51 - 10:53is that when you're choosing,
-
10:53 - 10:57you know, the - instead of choosing the maximum likelihood value for
-
10:57 - 11:01theta, you're instead maximizing this, which is what you have for maximum likelihood estimation,
-
11:01 - 11:05and then times this other quantity which is the prior.
-
11:05 - 11:07Right?
-
11:07 - 11:10And
-
11:10 - 11:11let's see,
-
11:11 - 11:17when intuition is that if your prior
-
11:17 - 11:21is
-
11:21 - 11:24theta being Gaussian and with mean
-
11:24 - 11:26zero and some covariance,
-
11:26 - 11:31then for a distribution like this, most of the [inaudible] mass is close to zero. Right?
-
11:31 - 11:34So there's a Gaussian centered around the point zero, and so [inaudible] mass is close to zero.
-
11:38 - 11:41And so the prior distribution, instead of saying that
-
11:41 - 11:45you think most of the parameters should be close to
-
11:45 - 11:49zero, and if you remember our discussion on feature selection, if you eliminate a
-
11:49 - 11:51feature from consideration
-
11:51 - 11:53that's the same as
-
11:53 - 11:57setting the source and value of theta to be equal to zero. All right? So if you
-
11:57 - 11:58set
-
11:58 - 12:02theta five to be equal to zero, that's the same as, you know, eliminating feature
-
12:02 - 12:05five from the your hypothesis.
-
12:05 - 12:10And so, this is the prior that drives most of the parameter values to zero
-
12:10 - 12:13- to values close to zero. And you'll think of this as
-
12:13 - 12:17doing something analogous, if doing something reminiscent of feature selection. Okay? And
-
12:17 - 12:22it turns out that with this formulation, the parameters won't actually be
-
12:22 - 12:25exactly zero but many of the values will be close to zero.
-
12:25 - 12:27And
-
12:27 - 12:32I guess in pictures,
-
12:35 - 12:36if you remember,
-
12:36 - 12:40I said that if you have, say, five data points and you fit
-
12:40 - 12:47a fourth-order polynomial -
-
12:48 - 12:51well I think that had too many bumps in it, but never mind.
-
12:51 - 12:54If you fit it a - if you fit very high polynomial to a
-
12:54 - 12:58very small dataset, you can get these very large oscillations
-
12:58 - 13:01if you use maximum likelihood estimation. All right?
-
13:01 - 13:06In contrast, if you apply this sort of Bayesian regularization,
-
13:06 - 13:08you can actually fit a higherorder polynomial
-
13:08 - 13:11that still get
-
13:11 - 13:13sort of a smoother and smoother fit to the data
-
13:13 - 13:18as you decrease tau. So as you decrease tau, you're driving the parameters to be closer and closer
-
13:18 - 13:20to zero. And that
-
13:20 - 13:23in practice - it's sort of hard to see, but you can take my word for it.
-
13:23 - 13:25As tau becomes smaller and smaller,
-
13:25 - 13:29the curves you tend to fit your data also become smoother and smoother, and so you
-
13:29 - 13:30tend
-
13:30 - 13:36less and less overfit, even when you're fitting a large number of
-
13:36 - 13:38parameters.
-
13:38 - 13:41Okay? Let's see,
-
13:41 - 13:45and
-
13:45 - 13:48one last piece of intuition that I would just toss out there. And
-
13:48 - 13:50you get to play more with
-
13:50 - 13:53this particular set of ideas more in Problem Set
-
13:53 - 13:553, which I'll post online later
-
13:55 - 13:57this week I guess.
-
13:57 - 14:02Is that whereas maximum likelihood tries to minimize,
-
14:02 - 14:09say, this,
-
14:12 - 14:16right? Whereas maximum likelihood for, say, linear regression turns out to be minimizing
-
14:16 - 14:17this,
-
14:17 - 14:19it turns out that if you
-
14:19 - 14:21add this prior term there,
-
14:21 - 14:26it turns out that the authorization objective you end up optimizing
-
14:26 - 14:31turns out to be that. Where you add an extra term that, you know,
-
14:31 - 14:34penalizes your parameter theta as being large.
-
14:35 - 14:39And so this ends up being an algorithm that's very similar to maximum likelihood, expect that you tend to
-
14:39 - 14:41keep your parameters small.
-
14:41 - 14:43And this has the effect.
-
14:43 - 14:47Again, it's kind of hard to see but just take my word for it. That strengthening the parameters has the effect of
-
14:47 - 14:48keeping the functions you fit
-
14:48 - 14:55to be smoother and less likely to overfit. Okay? Okay, hopefully this will
-
14:58 - 15:02make more sense when you play with these ideas a bit more in the next problem set. But let's check
-
15:02 - 15:09questions about all this.
-
15:13 - 15:16The smoothing behavior is it because [inaudible] actually get different [inaudible]?
-
15:19 - 15:23Let's see. Yeah. It depends on - well
-
15:23 - 15:28most priors with most of the mass close to zero will get this effect, I guess. And just by
-
15:28 - 15:29convention, the Gaussian
-
15:29 - 15:32prior is what's most -
-
15:32 - 15:34used the most common for models like logistic
-
15:34 - 15:38regression and linear regression, generalized in your models. There are a few
-
15:38 - 15:41other priors that I sometimes use, like the Laplace prior,
-
15:41 - 15:48but all of them will tend to have these sorts of smoothing effects. All right. Cool.
-
15:50 - 15:53And so it turns out that for
-
15:53 - 15:57problems like text classification, text classification is like 30,000 features or 50,000
-
15:57 - 16:00features,
-
16:00 - 16:03where it seems like an algorithm like logistic regression would be very much prone to
-
16:03 - 16:04overfitting. Right?
-
16:04 - 16:06So imagine trying to build a spam classifier,
-
16:06 - 16:10maybe you have 100 training examples but you have 30,000
-
16:10 - 16:13features or 50,000 features,
-
16:13 - 16:17that seems clearly to be prone to overfitting. Right? But it turns out that with
-
16:17 - 16:18this sort of Bayesian
-
16:18 - 16:19regularization,
-
16:19 - 16:22with [inaudible] Gaussian,
-
16:22 - 16:27logistic regression becomes a very effective text classification algorithm
-
16:27 - 16:33with this sort of Bayesian regularization. Alex? [Inaudible]? Yeah,
-
16:33 - 16:37right, and so pick - and to pick either tau squared or lambda.
-
16:37 - 16:40I think the relation is lambda equals one over tau squared. But right, so pick either
-
16:40 - 16:42tau squared or lambda, you could
-
16:42 - 16:49use cross-validation, yeah. All right? Okay, cool. So all right,
-
16:49 - 16:52that
-
16:52 - 16:55was all I want to say about methods for preventing
-
16:55 - 16:56overfitting. What I
-
16:56 - 17:01want to do next is just spend, you know, five minutes talking about
-
17:01 - 17:02online learning.
-
17:02 - 17:05And this is sort of a digression.
-
17:05 - 17:06And so, you know, when
-
17:06 - 17:10you're designing the syllabus of a class, I guess, sometimes
-
17:10 - 17:13there are just some ideas you want to talk about but can't find a very good place to
-
17:13 - 17:16fit in anywhere. So this is one of those ideas that may
-
17:16 - 17:21seem a bit disjointed from the rest of the class but I just
-
17:21 - 17:26want to
-
17:26 - 17:30tell
-
17:30 - 17:34you a little bit about it. Okay. So here's the idea.
-
17:34 - 17:38So far, all the learning algorithms we've talked about are what's called batch learning
-
17:38 - 17:39algorithms,
-
17:39 - 17:42where you're given a training set and then you get to run your learning algorithm on the
-
17:42 - 17:47training set and then maybe you test it on some other test set.
-
17:47 - 17:50And there's another learning setting called online learning,
-
17:50 - 17:54in which you have to make predictions even while you are in the process of learning.
-
17:54 - 17:58So here's how the problem sees.
-
17:58 - 17:59All right? I'm first gonna give you X one.
-
17:59 - 18:01Let's say there's a classification problem,
-
18:01 - 18:03so I'm first gonna give you
-
18:03 - 18:07X one and then gonna ask you, you know, "Can you make a prediction on X one? Is the label one or zero?"
-
18:07 - 18:10And you've not seen any data yet.
-
18:10 - 18:12And so, you make a guess. Right? You
-
18:12 - 18:13guess -
-
18:13 - 18:15we'll call your guess Y hat one.
-
18:15 - 18:18And after you've made your prediction, I
-
18:18 - 18:22will then reveal to you the true label Y one. Okay? And
-
18:22 - 18:26not having seen any data before, your odds of getting the first one right are only
-
18:26 - 18:2950 percent, right, if you guess randomly.
-
18:29 - 18:30
-
18:30 - 18:35And then I show you X two. And then I ask you, "Can you make a prediction on X two?" And
-
18:35 - 18:36so you now maybe
-
18:36 - 18:40are gonna make a slightly more educated guess and call that Y hat two.
-
18:40 - 18:45And after you've made your guess, I reveal the true label to you.
-
18:45 - 18:50And so, then I show you X three, and then you make
-
18:50 - 18:53your guess, and learning proceeds as follows.
-
18:53 - 18:55So this is just a lot of
-
18:55 - 18:59machine learning and batch learning, and the model settings where
-
18:59 - 19:04you have to keep learning even as you're making predictions,
-
19:04 - 19:08okay? So I don't know, setting your website and you have users coming in. And as the first user
-
19:08 - 19:11comes in, you need to start making predictions already about what
-
19:11 - 19:13the user likes or dislikes. And there's only, you know,
-
19:13 - 19:18as you're making predictions you get to show more and more training examples.
-
19:18 - 19:25So in online learning what you care about is the total online error,
-
19:27 - 19:28
-
19:28 - 19:33which is sum from I equals one to MC if you get the sequence of M examples
-
19:33 - 19:34all together,
-
19:34 - 19:38indicator Y hat I
-
19:38 - 19:41not equal to Y hi. Okay?
-
19:41 - 19:44So the total online error is
-
19:44 - 19:46the total number of mistakes you make
-
19:46 - 19:49on a sequence of examples like this.
-
19:49 - 19:52And
-
19:52 - 19:54it turns out that, you know,
-
19:54 - 19:58many of the learning algorithms you have - when you finish all the learning algorithms, you've
-
19:58 - 20:01learned about and can apply to this setting. One thing
-
20:01 - 20:03you could do is
-
20:03 - 20:04when you're asked
-
20:04 - 20:06to make prediction on Y hat three,
-
20:06 - 20:10right, one simple thing to do is well you've seen some other training examples
-
20:10 - 20:14up to this point so you can just take your learning algorithm and run it
-
20:14 - 20:15on the examples,
-
20:15 - 20:19you know, leading up to Y hat three. So just run the learning algorithm on all
-
20:19 - 20:21the examples you've seen previous
-
20:21 - 20:25to being asked to make a prediction on certain example, and then use your
-
20:25 - 20:27learning
-
20:27 - 20:31algorithm to make a prediction on the next example. And it turns out that there are also algorithms, especially the algorithms that we
-
20:31 - 20:34saw that you could use the stochastic gradient descent, that,
-
20:34 - 20:41you know, can be adapted very nicely to this. So as a concrete example, if you
-
20:42 - 20:46remember the perceptron algorithms, say, right,
-
20:46 - 20:47you would
-
20:47 - 20:52say initial the parameter theta to be equal to zero.
-
20:52 - 20:56And then after seeing the Ith training
-
20:56 - 20:59example, you'd
-
20:59 - 21:06update the parameters, you
-
21:14 - 21:15know,
-
21:15 - 21:20using - you've see this reel a lot of times now, right, using the standard
-
21:20 - 21:22perceptron learning rule. And
-
21:22 - 21:26the same thing, if you were using logistic regression you can then,
-
21:26 - 21:30again, after seeing each training example, just run, you know, essentially run
-
21:30 - 21:34one-step stochastic gradient descent
-
21:34 - 21:38on just the example you saw. Okay?
-
21:38 - 21:40And so
-
21:40 - 21:41the reason I've
-
21:41 - 21:45put this into the sort of "learning theory" section of this class was because
-
21:45 - 21:49it turns that sometimes you can prove fairly amazing results
-
21:49 - 21:52on your total online error
-
21:52 - 21:54using algorithms like these.
-
21:54 - 21:58I will actually - I don't actually want to spend the time in the main
-
21:58 - 21:59lecture to prove
-
21:59 - 22:02this, but, for example, you can prove that
-
22:02 - 22:05when you use the perceptron algorithm,
-
22:05 - 22:08then even when
-
22:08 - 22:12the features XI,
-
22:12 - 22:15maybe infinite dimensional feature vectors, like we saw for
-
22:15 - 22:19simple vector machines. And sometimes, infinite feature dimensional vectors
-
22:19 - 22:21may use kernel representations. Okay? But so it
-
22:21 - 22:22turns out that you can prove that
-
22:22 - 22:25when you a perceptron algorithm,
-
22:25 - 22:28even when the data is maybe
-
22:28 - 22:29extremely high dimensional and it
-
22:29 - 22:33seems like you'd be prone to overfitting, right, you can prove that so as the
-
22:33 - 22:37long as the positive and negative examples
-
22:37 - 22:39
-
22:39 - 22:42are separated by a margin,
-
22:42 - 22:43right. So in this
-
22:43 - 22:48infinite dimensional space,
-
22:48 - 22:50so long as,
-
22:50 - 22:53you know, there is some margin
-
22:53 - 22:57down there separating the positive and negative examples,
-
22:57 - 22:59you can prove that
-
22:59 - 23:02perceptron algorithm will converge
-
23:02 - 23:07to a hypothesis that perfectly separates the positive and negative examples.
-
23:07 - 23:11Okay? And then so after seeing only a finite number of examples, it'll
-
23:11 - 23:14converge to digital boundary that perfectly separates the
-
23:14 - 23:19positive and negative examples, even though you may in an infinite dimensional space. Okay?
-
23:19 - 23:21So
-
23:21 - 23:22let's see.
-
23:22 - 23:26The proof itself would take me sort of almost an entire lecture to do,
-
23:26 - 23:27and there are sort of
-
23:27 - 23:29other things that I want to do more than that.
-
23:29 - 23:33So you want to see the proof of this yourself, it's actually written up in the lecture
-
23:33 - 23:35notes that I posted online.
-
23:35 - 23:38For the purposes of this class' syllabus, the proof of this
-
23:38 - 23:41result, you can treat this as optional reading. And by that, I mean,
-
23:41 - 23:45you know, it won't appear on the midterm and you won't be asked about this
-
23:45 - 23:47specifically in the problem sets,
-
23:47 - 23:49but I thought it'd be -
-
23:49 - 23:52I know some of you are curious after the previous lecture so why
-
23:52 - 23:54you can prove
-
23:54 - 23:55that, you know, SVMs can have
-
23:55 - 23:59bounded VC dimension, even in these infinite dimensional spaces, and how do
-
23:59 - 24:01you prove things in these -
-
24:01 - 24:04how do you prove learning theory results in these infinite dimensional feature spaces. And
-
24:04 - 24:05so
-
24:05 - 24:08the perceptron bound that I just talked about was the simplest instance I
-
24:08 - 24:11know of that you can sort of read in like half an hour and understand
-
24:11 - 24:12it.
-
24:12 - 24:15So if you're
-
24:15 - 24:16interested, there are lecture notes online for
-
24:16 - 24:21how this perceptron bound is actually proved. It's a very
-
24:22 - 24:26[inaudible], you can prove it in like a page or so, so go ahead and take a look at that if you're interested. Okay? But
-
24:26 - 24:28regardless of the theoretical results,
-
24:28 - 24:31you know, the online learning setting is something
-
24:31 - 24:33that you - that comes reasonably often. And so,
-
24:33 - 24:39these algorithms based on stochastic gradient descent often go very well. Okay, any
-
24:39 - 24:46questions about this before I move on?
-
24:50 - 24:55All right. Cool. So the last thing I want to do today, and was the majority of today's lecture, actually can I switch to
-
24:55 - 24:56PowerPoint slides, please,
-
24:56 - 25:00is I actually want to spend most of today's lecture sort of talking about
-
25:00 - 25:05advice for applying different machine learning
-
25:05 - 25:11algorithms.
-
25:11 - 25:12And so,
-
25:12 - 25:14you know, right now, already you have a,
-
25:14 - 25:16I think, a good understanding of
-
25:16 - 25:19really the most powerful tools
-
25:19 - 25:21known to humankind in machine learning.
-
25:21 - 25:22Right? And
-
25:22 - 25:27what I want to do today is give you some advice on how to apply them really
-
25:27 - 25:28powerfully because,
-
25:28 - 25:30you know, the same tool - it turns out that you can
-
25:30 - 25:33take the same machine learning tool, say logistic regression,
-
25:33 - 25:37and you can ask two different people to apply it to the same problem.
-
25:37 - 25:38And
-
25:38 - 25:41sometimes one person will do an amazing job and it'll work amazingly well, and the second
-
25:41 - 25:43person will sort of
-
25:43 - 25:47not really get it to work, even though it was exactly the same algorithm. Right?
-
25:47 - 25:51And so what I want to do today, in the rest of the time I have today, is try
-
25:51 - 25:52to
-
25:52 - 25:55convey to you, you know, some of the methods for how to
-
25:55 - 25:57make sure you're one of -
-
25:57 - 26:04you really know how to get these learning algorithms to work well in problems. So
-
26:05 - 26:07just some caveats on what I'm gonna, I
-
26:07 - 26:09guess, talk about in the
-
26:09 - 26:11rest of today's lecture.
-
26:12 - 26:16Something I want to talk about is actually not very mathematical but is also
-
26:16 - 26:18some of the hardest,
-
26:18 - 26:18most
-
26:18 - 26:22conceptually most difficult material in this class to understand. All right? So this
-
26:22 - 26:23is
-
26:23 - 26:25not mathematical but this is not easy.
-
26:25 - 26:30And I want to say this caveat some of what I'll say today is debatable.
-
26:30 - 26:33I think most good machine learning people will agree with most of what I say but maybe not
-
26:33 - 26:36everything I say.
-
26:36 - 26:39And some of what I'll say is also not good advice for doing machine learning either, so I'll say more about
-
26:39 - 26:42this later. What I'm
-
26:42 - 26:44focusing on today is advice for
-
26:44 - 26:47how to just get stuff to work. If you work in the company and you want to deliver a
-
26:47 - 26:49product or you're, you know,
-
26:49 - 26:53building a system and you just want your machine learning system to work. Okay? Some of what I'm about
-
26:53 - 26:54to say today
-
26:54 - 26:58isn't great advice if you goal is to invent a new machine learning algorithm, but
-
26:58 - 26:59this is advice for how to
-
26:59 - 27:02make machine learning algorithm work and, you know, and
-
27:02 - 27:05deploy a working system. So three
-
27:05 - 27:09key areas I'm gonna talk about. One: diagnostics for
-
27:09 - 27:12debugging learning algorithms. Second: sort of
-
27:12 - 27:17talk briefly about error analyses and ablative analysis.
-
27:17 - 27:21And third, I want to talk about just advice for how to get started on a machine-learning
-
27:23 - 27:24problem.
-
27:24 - 27:27And one theme that'll come up later is it
-
27:27 - 27:31turns out you've heard about premature optimization, right,
-
27:31 - 27:34in writing software. This is when
-
27:34 - 27:38someone over-designs from the start, when someone, you know, is writing piece of code and
-
27:38 - 27:41they choose a subroutine to optimize
-
27:41 - 27:45heavily. And maybe you write the subroutine as assembly or something. And that's often
-
27:45 - 27:49- and many of us have been guilty of premature optimization, where
-
27:49 - 27:52we're trying to get a piece of code to run faster. And
-
27:52 - 27:55we choose probably a piece of code and we implement it an assembly, and
-
27:55 - 27:57really tune and get to run really quickly. And
-
27:57 - 28:01it turns out that wasn't the bottleneck in the code at all. Right? And we call that premature
-
28:01 - 28:02optimization.
-
28:02 - 28:06And in undergraduate programming classes, we warn people all the time not to do
-
28:06 - 28:10premature optimization and people still do it all the time. Right?
-
28:10 - 28:12And
-
28:12 - 28:16turns out, a very similar thing happens in building machine-learning systems. That
-
28:16 - 28:21many people are often guilty of, what I call, premature statistical optimization, where they
-
28:21 - 28:22
-
28:22 - 28:26heavily optimize part of a machine learning system and that turns out
-
28:26 - 28:28not to be the important piece. Okay?
-
28:28 - 28:30So I'll talk about that later, as well.
-
28:30 - 28:36So let's first talk about debugging learning algorithms.
-
28:36 - 28:38
-
28:38 - 28:40
-
28:40 - 28:43As a motivating
-
28:43 - 28:47example, let's say you want to build an anti-spam system. And
-
28:47 - 28:52let's say you've carefully chosen, you know, a small set of 100 words to use as features. All right?
-
28:52 - 28:54So instead of using 50,000 words, you're chosen a small set of 100
-
28:54 - 28:55features
-
28:55 - 28:59to use for your anti-spam system.
-
28:59 - 29:02And let's say you implement Bayesian logistic regression, implement gradient
-
29:02 - 29:03descent,
-
29:03 - 29:07and you get 20 percent test error, which is unacceptably high. Right?
-
29:07 - 29:10So this
-
29:10 - 29:14is Bayesian logistic regression, and so it's just like maximum likelihood but, you know, with that additional lambda
-
29:14 - 29:15squared term.
-
29:15 - 29:20And we're maximizing rather than minimizing as well, so there's a minus lambda
-
29:20 - 29:24theta square instead of plus lambda theta squared. So
-
29:24 - 29:28the question is, you implemented your Bayesian logistic regression algorithm,
-
29:28 - 29:34and you tested it on your test set and you got unacceptably high error, so what do you do
-
29:34 - 29:35next? Right?
-
29:35 - 29:37So,
-
29:37 - 29:41you know, one thing you could do is think about the ways you could improve this algorithm.
-
29:41 - 29:45And this is probably what most people will do instead of, "Well let's sit down and think what
-
29:45 - 29:48could've gone wrong, and then we'll try to improve the algorithm."
-
29:48 - 29:51Well obviously having more training data could only help, so one thing you can do is try to
-
29:51 - 29:55get more training examples.
-
29:55 - 29:59Maybe you suspect, that even 100 features was too many, so you might try to
-
29:59 - 30:01get a smaller set of
-
30:01 - 30:05features. What's more common is you might suspect your features aren't good enough, so you might
-
30:05 - 30:07spend some time, look at the email headers, see if
-
30:07 - 30:09you can figure out better features for, you know,
-
30:09 - 30:13finding spam emails or whatever.
-
30:13 - 30:14Right. And
-
30:14 - 30:18right, so and
-
30:18 - 30:21just sit around and come up with better features, such as for email headers.
-
30:21 - 30:25You may also suspect that gradient descent hasn't quite converged yet, and so let's try
-
30:25 - 30:28running gradient descent a bit longer to see if that works. And clearly, that can't hurt, right,
-
30:28 - 30:29just run
-
30:29 - 30:31gradient descent longer.
-
30:31 - 30:36Or maybe you remember, you know, you remember hearing from class that maybe
-
30:36 - 30:39Newton's method converges better, so let's
-
30:39 - 30:40try
-
30:40 - 30:42that instead. You may want to tune the value for lambda, because not
-
30:42 - 30:44sure if that was the right thing,
-
30:44 - 30:47or maybe you even want to an SVM because maybe you think an SVM might work better than logistic regression. So I only
-
30:47 - 30:50listed eight things
-
30:50 - 30:55here, but you can imagine if you were actually sitting down, building machine-learning
-
30:55 - 30:56system,
-
30:56 - 30:58the options to you are endless. You can think of, you
-
30:58 - 31:01know, hundreds of ways to improve a learning system.
-
31:01 - 31:02And some of these things like,
-
31:02 - 31:06well getting more training examples, surely that's gonna help, so that seems like it's a good
-
31:06 - 31:09use of your time. Right?
-
31:09 - 31:11And it turns out that
-
31:11 - 31:15this [inaudible] of picking ways to improve the learning algorithm and picking one and going
-
31:15 - 31:16for it,
-
31:16 - 31:21it might work in the sense that it may eventually get you to a working system, but
-
31:21 - 31:25often it's very time-consuming. And I think it's often a largely - largely a matter of
-
31:25 - 31:28luck, whether you end up fixing what the problem is.
-
31:28 - 31:29In particular, these
-
31:29 - 31:33eight improvements all fix very different problems.
-
31:33 - 31:37And some of them will be fixing problems that you don't have. And
-
31:37 - 31:40if you can rule out six of
-
31:40 - 31:44eight of these, say, you could - if by somehow looking at the problem more deeply,
-
31:44 - 31:47you can figure out which one of these eight things is actually the right thing
-
31:47 - 31:48to do,
-
31:48 - 31:49you can save yourself
-
31:49 - 31:51a lot of time.
-
31:51 - 31:56So let's see how we can go about doing that.
-
31:56 - 32:02The people in industry and in research that I see that are really good, would not
-
32:02 - 32:05go and try to change a learning algorithm randomly. There are lots of things that
-
32:05 - 32:08obviously improve your learning algorithm,
-
32:08 - 32:12but the problem is there are so many of them it's hard to know what to do.
-
32:12 - 32:17So you find all the really good ones that run various diagnostics to figure out
-
32:17 - 32:18the problem is
-
32:18 - 32:22and they think where a problem is. Okay?
-
32:22 - 32:24So
-
32:24 - 32:27for our motivating story, right, we said - let's say Bayesian logistic regression test
-
32:27 - 32:29error was 20 percent,
-
32:29 - 32:32which let's say is unacceptably high.
-
32:32 - 32:35And let's suppose you suspected the problem is
-
32:35 - 32:36either overfitting,
-
32:36 - 32:38so it's high bias,
-
32:38 - 32:42or you suspect that, you know, maybe you have two few features that classify as spam, so there's - Oh excuse
-
32:42 - 32:45me; I think I
-
32:45 - 32:47wrote that wrong.
-
32:47 - 32:48Let's firstly - so let's
-
32:48 - 32:49forget - forget the tables.
-
32:49 - 32:53Suppose you suspect the problem is either high bias or high variance, and some of the text
-
32:53 - 32:55here
-
32:55 - 32:55doesn't make sense. And
-
32:55 - 32:56you want to know
-
32:56 - 33:01if you're overfitting, which would be high variance, or you have too few
-
33:01 - 33:06features classified as spam, it'd be high bias. I had those two reversed, sorry. Okay? So
-
33:06 - 33:09how can you figure out whether the problem
-
33:09 - 33:11is one of high bias
-
33:11 - 33:16or high variance? Right? So it turns
-
33:16 - 33:19out there's a simple diagnostic you can look at that will tell you
-
33:19 - 33:24whether the problem is high bias or high variance. If you
-
33:24 - 33:28remember the cartoon we'd seen previously for high variance problems, when you have high
-
33:28 - 33:30variance
-
33:30 - 33:33the training error will be much lower than the test error. All right? When you
-
33:33 - 33:36have a high variance problem, that's when you're fitting
-
33:36 - 33:39your training set very well. That's when you're fitting, you know, a tenth order polynomial to
-
33:39 - 33:4211 data points. All right? And
-
33:42 - 33:45that's when you're just fitting the data set very well, and so your training error will be
-
33:45 - 33:46much lower than
-
33:46 - 33:48your test
-
33:48 - 33:50error. And in contrast, if you have high bias,
-
33:50 - 33:53that's when your training error will also be high. Right?
-
33:53 - 33:56That's when your data is quadratic, say, but you're fitting a linear function to it
-
33:56 - 34:02and so you aren't even fitting your training set well. So
-
34:02 - 34:04just in cartoons, I guess,
-
34:04 - 34:08this is a - this is what a typical learning curve for high variance looks
-
34:08 - 34:09like.
-
34:09 - 34:14On your horizontal axis, I'm plotting the training set size M, right,
-
34:14 - 34:16and on vertical axis, I'm plotting the error.
-
34:16 - 34:19And so, let's see,
-
34:19 - 34:21you know, as you increase -
-
34:21 - 34:25if you have a high variance problem, you'll notice as the training set size, M,
-
34:25 - 34:29increases, your test set error will keep on decreasing.
-
34:29 - 34:33And so this sort of suggests that, well, if you can increase the training set size even
-
34:33 - 34:36further, maybe if you extrapolate the green curve out, maybe
-
34:36 - 34:40that test set error will decrease even further. All right?
-
34:40 - 34:43Another thing that's useful to plot here is - let's say
-
34:43 - 34:47the red horizontal line is the desired performance
-
34:47 - 34:50you're trying to reach, another useful thing to plot is actually the training error. Right?
-
34:50 - 34:52And it turns out that
-
34:52 - 34:59your training error will actually grow as a function of the training set size
-
34:59 - 35:02because the larger your training set,
-
35:02 - 35:04the harder it is to fit,
-
35:04 - 35:06you know, your training set perfectly. Right?
-
35:06 - 35:09So this is just a cartoon, don't take it too seriously, but in general, your training error
-
35:09 - 35:11will actually grow
-
35:11 - 35:15as a function of your training set size. Because smart training sets, if you have one data point,
-
35:15 - 35:18it's really easy to fit that perfectly, but if you have
-
35:18 - 35:2210,000 data points, it's much harder to fit that perfectly.
-
35:22 - 35:23All right?
-
35:23 - 35:28And so another diagnostic for high variance, and the one that I tend to use more,
-
35:28 - 35:32is to just look at training versus test error. And if there's a large gap between
-
35:32 - 35:33them,
-
35:33 - 35:34then this suggests that, you know,
-
35:34 - 35:40getting more training data may allow you to help close that gap. Okay?
-
35:40 - 35:41So this is
-
35:41 - 35:42what the
-
35:42 - 35:45cartoon would look like when - in the
-
35:45 - 35:49case of high variance.
-
35:49 - 35:53This is what the cartoon looks like for high bias. Right? If you
-
35:53 - 35:55look at the learning curve, you
-
35:55 - 35:57see that the curve for test error
-
35:57 - 36:01has flattened out already. And so this is a sign that,
-
36:01 - 36:05you know, if you get more training examples, if you extrapolate this curve
-
36:05 - 36:07further to the right,
-
36:07 - 36:10it's maybe not likely to go down much further.
-
36:10 - 36:12And this is a property of high bias: that getting more training data won't
-
36:12 - 36:16necessarily help.
-
36:16 - 36:19But again, to me the more useful diagnostic is
-
36:19 - 36:20if you plot
-
36:20 - 36:24training errors well, if you look at your training error as well as your, you know,
-
36:24 - 36:26hold out test set error.
-
36:26 - 36:29If you find that even your training error
-
36:29 - 36:32is high,
-
36:32 - 36:35then that's a sign that getting more training data is not
-
36:35 - 36:38going to help. Right?
-
36:38 - 36:42In fact, you know, think about it,
-
36:42 - 36:45training error
-
36:45 - 36:48grows as a function of your training set size.
-
36:48 - 36:50And so if your
-
36:50 - 36:56training error is already above your level of desired performance,
-
36:56 - 36:57then
-
36:57 - 37:01getting even more training data is not going to reduce your training
-
37:01 - 37:03error down to the desired level of performance. Right?
-
37:03 - 37:06Because, you know, your training error sort of only gets worse as you get more and more training
-
37:06 - 37:08examples.
-
37:08 - 37:11So if you extrapolate further to the right, it's not like this blue line will come
-
37:11 - 37:13back down to the level of desired performance. Right? This will stay up
-
37:13 - 37:17there. Okay? So for
-
37:17 - 37:21me personally, I actually, when looking at a curve like the green
-
37:21 - 37:25curve on test error, I actually personally tend to find it very difficult to tell
-
37:25 - 37:29if the curve is still going down or if it's [inaudible]. Sometimes you can tell, but very
-
37:29 - 37:31often, it's somewhat
-
37:31 - 37:33ambiguous. So for me personally,
-
37:33 - 37:37the diagnostic I tend to use the most often to tell if I have a bias problem or a variance
-
37:37 - 37:38problem
-
37:38 - 37:41is to look at training and test error and see if they're very close together or if they're relatively far apart. Okay? And so,
-
37:41 - 37:45going
-
37:45 - 37:47back to
-
37:47 - 37:52the list of fixes, look
-
37:52 - 37:54at the first fix,
-
37:54 - 37:56getting more training examples
-
37:56 - 37:59is a way to fix high variance.
-
37:59 - 38:03Right? If you have a high variance problem, getting more training examples will help.
-
38:03 - 38:06Trying a smaller set of features:
-
38:06 - 38:12that also fixes high variance. All right?
-
38:12 - 38:16Trying a larger set of features or adding email features, these
-
38:16 - 38:20are solutions that fix high bias. Right?
-
38:20 - 38:27So high bias being if you're hypothesis was too simple, you didn't have enough features. Okay?
-
38:27 - 38:29And so
-
38:29 - 38:34quite often you see people working on machine learning problems
-
38:34 - 38:35and
-
38:35 - 38:38they'll remember that getting more training examples helps. And
-
38:38 - 38:41so, they'll build a learning system, build an anti-spam system and it doesn't work.
-
38:41 - 38:42And then they
-
38:42 - 38:46go off and spend lots of time and money and effort collecting more training data
-
38:46 - 38:51because they'll say, "Oh well, getting more data's obviously got to help."
-
38:51 - 38:53But if they had a high bias problem in the first place, and not a high variance
-
38:53 - 38:55problem,
-
38:55 - 38:57it's entirely possible to spend
-
38:57 - 39:00three months or six months collecting more and more training data,
-
39:00 - 39:05not realizing that it couldn't possibly help. Right?
-
39:05 - 39:08And so, this actually happens a lot in, you
-
39:08 - 39:12know, in Silicon Valley and companies, this happens a lot. There will often
-
39:12 - 39:15people building various machine learning systems, and
-
39:15 - 39:20they'll often - you often see people spending six months working on fixing a
-
39:20 - 39:21learning algorithm
-
39:21 - 39:24and you could've told them six months ago that, you know,
-
39:24 - 39:27that couldn't possibly have helped. But because they didn't know what the
-
39:27 - 39:29problem was, and
-
39:29 - 39:34they'd easily spend six months trying to invent new features or something. And
-
39:34 - 39:38this is - you see this surprisingly often and this is somewhat depressing. You could've gone to them and
-
39:38 - 39:42told them, "I could've told you six months ago that this was not going to help." And
-
39:42 - 39:46the six months is not a joke, you actually see
-
39:46 - 39:48this.
-
39:48 - 39:50And in contrast, if you
-
39:50 - 39:53actually figure out the problem's one of high bias or high variance, then
-
39:53 - 39:54you can rule out
-
39:54 - 39:56two of these solutions and
-
39:56 - 40:01save yourself many months of fruitless effort. Okay? I actually
-
40:01 - 40:04want to talk about these four at the bottom as well. But before I move on, let me
-
40:04 - 40:05just check if there were questions about what I've talked
-
40:05 - 40:12about so far. No? Okay, great. So bias
-
40:20 - 40:23versus variance is one thing that comes up
-
40:23 - 40:30often. This bias versus variance is one common diagnostic. And so,
-
40:30 - 40:33for other machine learning problems, it's often up to your own ingenuity to figure out
-
40:33 - 40:36your own diagnostics to figure out what's wrong. All right?
-
40:36 - 40:41So if a machine-learning algorithm isn't working, very often it's up to you to figure out, you
-
40:41 - 40:44know, to construct your own tests. Like do you look at the difference training and
-
40:44 - 40:46test errors or do you look at something else?
-
40:46 - 40:50It's often up to your own ingenuity to construct your own diagnostics to figure out what's
-
40:50 - 40:53going on.
-
40:53 - 40:55What I want to do is go through another example. All right?
-
40:55 - 40:59And this one is slightly more contrived but it'll illustrate another
-
40:59 - 41:03common question that comes up, another one of the most common
-
41:03 - 41:05issues that comes up in applying
-
41:05 - 41:06learning algorithms.
-
41:06 - 41:08So in this example, it's slightly more contrived,
-
41:08 - 41:12let's say you implement Bayesian logistic regression
-
41:12 - 41:18and you get 2 percent error on spam mail and 2 percent error non-spam mail. Right? So
-
41:18 - 41:19it's rejecting, you know,
-
41:19 - 41:212 percent of -
-
41:21 - 41:25it's rejecting 98 percent of your spam mail, which is fine, so 2 percent of all
-
41:25 - 41:27spam gets
-
41:27 - 41:31through which is fine, but is also rejecting 2 percent of your good email,
-
41:31 - 41:352 percent of the email from your friends and that's unacceptably high, let's
-
41:35 - 41:37say.
-
41:37 - 41:39And let's say that
-
41:39 - 41:42a simple vector machine using a linear kernel
-
41:42 - 41:45gets 10 percent error on spam and
-
41:45 - 41:490.01 percent error on non-spam, which is more of the acceptable performance you want. And let's say for the sake of this
-
41:49 - 41:53example, let's say you're trying to build an anti-spam system. Right?
-
41:53 - 41:56Let's say that you really want to deploy
-
41:56 - 41:58logistic regression
-
41:58 - 42:01to your customers because of computational efficiency or because you need
-
42:01 - 42:03retrain overnight every day,
-
42:03 - 42:07and because logistic regression just runs more easily and more quickly or something. Okay? So let's
-
42:07 - 42:09say you want to deploy logistic
-
42:09 - 42:13regression, but it's just not working out well. So
-
42:13 - 42:18question is: What do you do next? So it
-
42:18 - 42:19turns out that this -
-
42:19 - 42:22the issue that comes up here, the one other common question that
-
42:22 - 42:25comes up is
-
42:25 - 42:30a question of is the algorithm converging. So you might suspect that maybe
-
42:30 - 42:33the problem with logistic regression is that it's just not converging.
-
42:33 - 42:36Maybe you need to run iterations. And
-
42:36 - 42:38it
-
42:38 - 42:40turns out that, again if you look at the optimization objective, say,
-
42:40 - 42:44logistic regression is, let's say, optimizing J
-
42:44 - 42:47of theta, it actually turns out that if you look at optimizing your objective as a function of the number
-
42:47 - 42:52of iterations, when you look
-
42:52 - 42:55at this curve, you know, it sort of looks like it's going up but it sort of
-
42:55 - 42:58looks like there's absentiles. And
-
42:58 - 43:01when you look at these curves, it's often very hard to tell
-
43:01 - 43:04if the curve has already flattened out. All right? And you look at these
-
43:04 - 43:06curves a lot so you can ask:
-
43:06 - 43:08Well has the algorithm converged? When you look at the J of theta like this, it's
-
43:08 - 43:10often hard to tell.
-
43:10 - 43:14You can run this ten times as long and see if it's flattened out. And you can run this ten
-
43:14 - 43:21times as long and it'll often still look like maybe it's going up very slowly, or something. Right?
-
43:21 - 43:25So a better diagnostic for what logistic regression is converged than
-
43:25 - 43:29looking at this curve.
-
43:29 - 43:32The other question you might wonder - the other thing you might
-
43:32 - 43:37suspect is a problem is are you optimizing the right function.
-
43:37 - 43:39So
-
43:39 - 43:41what you care about,
-
43:41 - 43:43right, in spam, say,
-
43:43 - 43:44is a
-
43:44 - 43:47weighted accuracy function like that. So A of theta is,
-
43:47 - 43:49you know, sum over your
-
43:49 - 43:52examples of some weights times whether you got it right.
-
43:52 - 43:57And so the weight may be higher for non-spam than for spam mail because you care
-
43:57 - 43:58about getting
-
43:58 - 44:01your predictions correct for spam email much more than non-spam mail, say. So let's
-
44:01 - 44:02
-
44:02 - 44:05say A of theta
-
44:05 - 44:11is the optimization objective that you really care about, but Bayesian logistic regression is
-
44:11 - 44:15that it optimizes a quantity like that. Right? It's this
-
44:15 - 44:18sort of maximum likelihood thing
-
44:18 - 44:19and then with this
-
44:19 - 44:21two-nom, you know,
-
44:21 - 44:23penalty thing that we saw previously. And you
-
44:23 - 44:26might be wondering: Is this the right optimization function to be optimizing.
-
44:26 - 44:31Okay? And: Or do I maybe need to change the value for lambda
-
44:31 - 44:34to change this parameter? Or:
-
44:34 - 44:40Should I maybe really be switching to support vector machine optimization objective?
-
44:40 - 44:42Okay? Does that make sense? So
-
44:42 - 44:44the second diagnostic I'm gonna talk about
-
44:44 - 44:47is let's say you want to figure out
-
44:47 - 44:51is the algorithm converging, is the optimization algorithm converging, or
-
44:51 - 44:52is the problem with
-
44:52 - 44:58the optimization objective I chose in the first place? Okay?
-
44:58 - 45:03So here's
-
45:03 - 45:07the diagnostic you can use. Let me let - right. So to
-
45:07 - 45:11just reiterate the story, right, let's say an SVM outperforms Bayesian
-
45:11 - 45:14logistic regression but you really want to deploy
-
45:14 - 45:17Bayesian logistic regression to your problem. Let me
-
45:17 - 45:19let theta subscript SVM, be the
-
45:19 - 45:22parameters learned by an SVM,
-
45:22 - 45:25and I'll let theta subscript BLR be the parameters learned by Bayesian
-
45:25 - 45:28logistic regression.
-
45:28 - 45:32So the optimization objective you care about is this, you know, weighted accuracy
-
45:32 - 45:35criteria that I talked about just now.
-
45:35 - 45:38And
-
45:38 - 45:42the support vector machine outperforms Bayesian logistic regression. And so, you know,
-
45:42 - 45:45the weighted accuracy on the supportvector-machine parameters
-
45:45 - 45:47is better than the weighted accuracy
-
45:47 - 45:50for Bayesian logistic regression.
-
45:50 - 45:54So
-
45:54 - 45:57further, Bayesian logistic regression tries to optimize
-
45:57 - 45:59an optimization objective like that, which I
-
45:59 - 46:02denoted J theta.
-
46:02 - 46:06And so, the diagnostic I choose to use is
-
46:06 - 46:08to see if J of SVM
-
46:08 - 46:12is bigger-than or less-than J of BLR. Okay?
-
46:12 - 46:15So I explain this on the next slide.
-
46:15 - 46:16So
-
46:16 - 46:20we know two facts. We know that - well we know one fact. We know that a weighted
-
46:20 - 46:21accuracy
-
46:21 - 46:23of support vector machine, right,
-
46:23 - 46:24is bigger than
-
46:24 - 46:29this weighted accuracy of Bayesian logistic regression. So
-
46:29 - 46:32in order for me to figure out whether Bayesian logistic regression is converging,
-
46:32 - 46:35or whether I'm just optimizing the wrong objective function,
-
46:35 - 46:41the diagnostic I'm gonna use and I'm gonna check if this equality hold through. Okay?
-
46:41 - 46:44So let me explain this,
-
46:44 - 46:45so in Case 1,
-
46:45 - 46:46right,
-
46:46 - 46:48it's just those two equations copied over.
-
46:48 - 46:50In Case 1, let's say that
-
46:50 - 46:55J of SVM is, indeed, is greater than J of BLR - or J of
-
46:55 - 47:01theta SVM is greater than J of theta BLR. But
-
47:01 - 47:04we know that Bayesian logistic regression
-
47:04 - 47:08was trying to maximize J of theta;
-
47:08 - 47:09that's the definition of
-
47:09 - 47:12Bayesian logistic regression.
-
47:12 - 47:17So this means that
-
47:17 - 47:18theta -
-
47:18 - 47:22the value of theta output that Bayesian logistic regression actually fails to
-
47:22 - 47:24maximize J
-
47:24 - 47:27because the support back to machine actually returned the value of theta that,
-
47:27 - 47:29you know does a
-
47:29 - 47:31better job out-maximizing J.
-
47:31 - 47:37And so, this tells me that Bayesian logistic regression didn't actually maximize J
-
47:37 - 47:39correctly, and so the problem is with
-
47:39 - 47:41the optimization algorithm. The
-
47:41 - 47:45optimization algorithm hasn't converged. The other
-
47:45 - 47:46case
-
47:46 - 47:50is as follows, where
-
47:50 - 47:53J of theta SVM is less-than/equal to J of theta
-
47:53 - 47:56BLR. Okay?
-
47:56 - 47:58In this case, what does
-
47:58 - 47:59that mean?
-
47:59 - 48:02This means that Bayesian logistic regression
-
48:02 - 48:05actually attains the higher value
-
48:05 - 48:07for the optimization objective J
-
48:07 - 48:11then doesn't support back to machine.
-
48:11 - 48:13The support back to machine,
-
48:13 - 48:15which does worse
-
48:15 - 48:18on your optimization problem,
-
48:18 - 48:19actually does better
-
48:19 - 48:24on the weighted accuracy measure.
-
48:24 - 48:28So what this means is that something that does worse on your optimization
-
48:28 - 48:29objective,
-
48:29 - 48:30on J,
-
48:30 - 48:31can actually do better
-
48:31 - 48:34on the weighted accuracy objective.
-
48:34 - 48:37And this really means that maximizing
-
48:37 - 48:38J of theta,
-
48:38 - 48:42you know, doesn't really correspond that well to maximizing your weighted accuracy criteria.
-
48:42 - 48:43
-
48:43 - 48:47And therefore, this tells you that J of theta is maybe the wrong optimization
-
48:47 - 48:50objective to be maximizing. Right?
-
48:50 - 48:51That just maximizing J of
-
48:51 - 48:53theta just wasn't a good objective
-
48:53 - 49:00to be choosing if you care about the weighted accuracy. Okay? Can you
-
49:03 - 49:03raise your hand
-
49:03 - 49:10if this made sense?
-
49:10 - 49:12Cool, good. So
-
49:12 - 49:17that tells us whether the problem is with the optimization objective
-
49:17 - 49:19or whether it's with the objective function.
-
49:19 - 49:21And so going back to this
-
49:21 - 49:23slide, the eight fixes we had,
-
49:23 - 49:24you notice that if you
-
49:24 - 49:27run gradient descent for more iterations
-
49:27 - 49:31that fixes the optimization algorithm. You try and use this method
-
49:31 - 49:33fixes the optimization algorithm,
-
49:33 - 49:37whereas using a different value for lambda, in that lambda times norm of data
-
49:37 - 49:39squared, you know, in your objective,
-
49:39 - 49:42fixes the optimization objective. And
-
49:42 - 49:48changing to an SVM is also another way of trying to fix the optimization objective. Okay?
-
49:48 - 49:49And so
-
49:49 - 49:52once again, you actually see this quite often that -
-
49:52 - 49:55actually, you see it very often, people will
-
49:55 - 49:58have a problem with the optimization objective
-
49:58 - 50:01and be working harder and harder
-
50:01 - 50:03to fix the optimization algorithm.
-
50:03 - 50:06That's another very common pattern that
-
50:06 - 50:10the problem is in the formula from your J of theta, that often you see people, you know,
-
50:10 - 50:13just running more and more iterations of gradient descent. Like trying Newton's
-
50:13 - 50:16method and trying conjugate and then trying
-
50:16 - 50:19more and more crazy optimization algorithms,
-
50:19 - 50:21whereas the problem was, you know,
-
50:21 - 50:24optimizing J of theta wasn't going to fix the problem at all. Okay?
-
50:24 - 50:29So there's another example of when these sorts of diagnostics will
-
50:29 - 50:32help you figure out whether you should be fixing your optimization algorithm
-
50:32 - 50:33or fixing the
-
50:33 - 50:39optimization
-
50:39 - 50:45objective. Okay? Let me think
-
50:45 - 50:48how much time I have.
-
50:48 - 50:49Hmm, let's
-
50:49 - 50:50see. Well okay, we have time. Let's do this.
-
50:50 - 50:53Show you one last example of a diagnostic. This is one that came up in,
-
50:53 - 50:57you know, my students' and my work on flying helicopters.
-
50:57 - 50:58
-
50:58 - 51:00This one actually,
-
51:00 - 51:04this example is the most complex of the three examples I'm gonna do
-
51:04 - 51:06today.
-
51:06 - 51:09I'm going to somewhat quickly, and
-
51:09 - 51:11this actually draws on reinforcement learning which is something that I'm not
-
51:11 - 51:14gonna talk about until towards - close to the end of the course here, but this just
-
51:14 - 51:17a more
-
51:17 - 51:20complicated example of a diagnostic we're gonna go over.
-
51:20 - 51:24What I'll do is probably go over this fairly quickly, and then after we've talked about
-
51:24 - 51:27reinforcement learning in the class, I'll probably actually come back and redo this exact
-
51:27 - 51:33same example because you'll understand it more deeply. Okay?
-
51:33 - 51:37So some of you know that my students and I fly autonomous helicopters, so how do you get a
-
51:37 - 51:42machine-learning algorithm to design the controller for
-
51:42 - 51:44helicopter? This is what we do. All right?
-
51:44 - 51:49This first step was you build a simulator for a helicopter, so, you know, there's a screenshot of our
-
51:49 - 51:50simulator.
-
51:50 - 51:53This is just like a - it's like a joystick simulator; you can fly a helicopter in simulation. And then you
-
51:53 - 51:56
-
51:56 - 51:57choose a cost function, it's
-
51:57 - 52:01actually called a [inaudible] function, but for this actually I'll call it cost function.
-
52:01 - 52:03Say J of theta is, you know,
-
52:03 - 52:07the expected squared error in your helicopter's
-
52:07 - 52:08position. Okay? So this is J of theta is
-
52:08 - 52:09maybe
-
52:09 - 52:12it's expected square error or just the square error.
-
52:12 - 52:17And then we run a reinforcement-learning algorithm, you'll learn about RL algorithms
-
52:17 - 52:19in a few weeks.
-
52:19 - 52:22You run reinforcement learning algorithm in your simulator
-
52:22 - 52:27to try to minimize this cost function; try to minimize the squared error of
-
52:27 - 52:31how well you're controlling your helicopter's position. Okay?
-
52:31 - 52:35The reinforcement learning algorithm will output some parameters, which I'm denoting theta
-
52:35 - 52:37subscript RL,
-
52:37 - 52:42and then you'll use that to fly your helicopter.
-
52:42 - 52:45So suppose you run this learning algorithm and
-
52:45 - 52:49you get out a set of controller parameters, theta subscript RL,
-
52:49 - 52:52that gives much worse performance than a human pilot. Then
-
52:52 - 52:55what do you do next? And in particular, you
-
52:55 - 52:58know, corresponding to the three steps above, there are three
-
52:58 - 53:01natural things you can try. Right? You can
-
53:01 - 53:02try to - oh, the bottom of
-
53:02 - 53:04the slide got chopped off.
-
53:04 - 53:08You can try to improve the simulator. And
-
53:08 - 53:10maybe you think your simulator's isn't that accurate, you need to capture
-
53:10 - 53:12the aerodynamic effects more
-
53:12 - 53:15accurately. You need to capture the airflow and the turbulence affects around the helicopter
-
53:15 - 53:18more accurately.
-
53:18 - 53:21Maybe you need to modify the cost function. Maybe your square error isn't cutting it. Maybe
-
53:21 - 53:25what a human pilot does isn't just optimizing square area but it's something more
-
53:25 - 53:26subtle.
-
53:26 - 53:27Or maybe
-
53:27 - 53:33the reinforcement-learning algorithm isn't working; maybe it's not quite converging or something. Okay? So
-
53:33 - 53:37these are the diagnostics that I actually used, and my students and I actually use to figure out what's
-
53:37 - 53:41going on.
-
53:41 - 53:45Actually, why don't you just think about this for a second and think what you'd do, and then
-
53:45 - 53:52I'll go on and tell you what we do. All right,
-
54:46 - 54:48so let me tell you what -
-
54:48 - 54:50how we do this and see
-
54:50 - 54:53whether it's the same as yours or not. And if you have a better idea than I do, let me
-
54:53 - 54:54know and I'll let you try it
-
54:54 - 54:56on my helicopter.
-
54:56 - 54:58So
-
54:58 - 55:01here's a reasoning that I wanted to experiment, right. So,
-
55:01 - 55:04yeah, let's say the controller output
-
55:04 - 55:10by our reinforcement-learning algorithm does poorly. Well
-
55:10 - 55:13suppose the following three things hold true.
-
55:13 - 55:15Suppose the contrary, I guess. Suppose that
-
55:15 - 55:20the helicopter simulator is accurate, so let's assume we have an accurate model
-
55:20 - 55:22of our helicopter. And
-
55:22 - 55:25let's suppose that the reinforcement learning algorithm,
-
55:25 - 55:29you know, correctly controls the helicopter in simulation,
-
55:29 - 55:32so we tend to run a learning algorithm in simulation so that, you know, the
-
55:32 - 55:35learning algorithm can crash a helicopter and it's fine. Right?
-
55:35 - 55:37So let's assume our reinforcement-learning
-
55:37 - 55:40algorithm correctly controls the helicopter so as to minimize the cost
-
55:40 - 55:42function J of theta.
-
55:42 - 55:44And let's suppose that
-
55:44 - 55:48minimizing J of theta does indeed correspond to accurate or the correct autonomous
-
55:48 - 55:49flight.
-
55:49 - 55:52If all of these things held true,
-
55:52 - 55:54then that means that
-
55:54 - 55:58the parameters, theta RL, should actually fly well on my real
-
55:58 - 56:01helicopter. Right?
-
56:01 - 56:03And so the fact that the learning
-
56:03 - 56:05control parameters, theta RL,
-
56:05 - 56:09does not fly well on my helicopter, that sort
-
56:09 - 56:11of means that ones of these three assumptions must be wrong
-
56:11 - 56:18and I'd like to figure out which of these
-
56:18 - 56:20three assumptions
-
56:20 - 56:22is wrong. Okay? So these are the diagnostics we use.
-
56:22 - 56:25First one is
-
56:25 - 56:32we look at the controller and see if it even flies well in
-
56:32 - 56:35simulation. Right? So the simulator of the helicopter that we did the learning on,
-
56:35 - 56:39and so if the learning algorithm flies well in the simulator but
-
56:39 - 56:42it doesn't fly well on my real helicopter,
-
56:42 - 56:46then that tells me the problem is probably in the simulator. Right?
-
56:46 - 56:48My simulator predicts
-
56:48 - 56:52the helicopter's controller will fly well but it doesn't actually fly well in real life, so
-
56:52 - 56:54could be the problem's in the simulator
-
56:54 - 56:59and we should spend out efforts improving the accuracy of our simulator.
-
56:59 - 57:03Otherwise, let me write theta subscript human, be the human
-
57:03 - 57:07control policy. All right? So
-
57:07 - 57:12let's go ahead and ask a human to fly the helicopter, it could be in the simulator, it
-
57:12 - 57:13could be in real life,
-
57:13 - 57:17and let's measure, you know, the means squared error
-
57:17 - 57:20of the human pilot's flight. And
-
57:20 - 57:24let's see if the human pilot does better or worse
-
57:24 - 57:26than the learned controller,
-
57:26 - 57:28in terms of optimizing this
-
57:28 - 57:32objective function J of theta. Okay?
-
57:32 - 57:34So if the human does
-
57:34 - 57:37worse, if even a very good human pilot
-
57:37 - 57:41attains a worse value on my optimization objective, on my cost
-
57:41 - 57:42function,
-
57:42 - 57:49than my learning algorithm,
-
57:49 - 57:52then the problem is in the reinforcement-learning algorithm.
-
57:52 - 57:56Because my reinforcement-learning algorithm was trying to minimize J of
-
57:56 - 58:00theta, but a human actually attains a lower value for J of theta than does my
-
58:00 - 58:02algorithm.
-
58:02 - 58:05And so that tells me that clearly my algorithm's not
-
58:05 - 58:08managing to minimize J of theta
-
58:08 - 58:13and that tells me the problem's in the reinforcement learning algorithm.
-
58:13 - 58:18And finally, if J of theta - if the human actually attains a larger value
-
58:18 - 58:20for theta - excuse me,
-
58:20 - 58:24if the human actually attains a larger value for J of theta, the human actually
-
58:24 - 58:28has, you know, larger mean squared error for the helicopter position than
-
58:28 - 58:31does my reinforcement learning algorithms, that's
-
58:31 - 58:34I like - but I like the way the human flies much better than my reinforcement learning
-
58:34 - 58:35algorithm. So
-
58:35 - 58:37if that holds true,
-
58:37 - 58:40then clearly the problem's in the cost function, right,
-
58:40 - 58:43because the human does worse on my cost function
-
58:43 - 58:46but flies much better than my learning algorithm.
-
58:46 - 58:48And so that means the problem's in the cost function. It
-
58:48 - 58:50means - oh
-
58:50 - 58:51excuse me, I
-
58:51 - 58:54meant minimizing it, not maximizing it, there's a typo on the slide,
-
58:54 - 58:55because that means that minimizing
-
58:55 - 58:57the cost function
-
58:57 - 59:00- my learning algorithm does a better job minimizing the cost function but doesn't
-
59:00 - 59:03fly as well as a human pilot. So that tells you that
-
59:03 - 59:05minimizing the cost function
-
59:05 - 59:07doesn't correspond to good autonomous flight. And what
-
59:07 - 59:12you should do it go back and see if you can change J of
-
59:12 - 59:13theta. Okay?
-
59:13 - 59:18And so for those reinforcement learning problems, you know, if something doesn't work - often reinforcement
-
59:18 - 59:22learning algorithms just work but when they don't work,
-
59:22 - 59:26these are the sorts of diagnostics you use to figure out should we be focusing on the simulator,
-
59:26 - 59:30on changing the cost function, or on changing the reinforcement learning
-
59:30 - 59:32algorithm. And
-
59:32 - 59:37again, if you don't know which of your three problems it is, it's entirely possible,
-
59:37 - 59:40you know, to spend two years, whatever, changing, building a better simulator
-
59:40 - 59:43for your helicopter.
-
59:43 - 59:44But it turns out that
-
59:44 - 59:48modeling helicopter aerodynamics is an active area of research. There are people, you know, writing
-
59:48 - 59:50entire PhD theses on this still.
-
59:50 - 59:54So it's entirely possible to go out and spend six years and write a PhD thesis and build
-
59:54 - 59:55a much better helicopter simulator, but if you're fixing
-
59:55 - 60:02the wrong problem it's not gonna help.
-
60:03 - 60:06So
-
60:06 - 60:09quite often, you need to come up with your own diagnostics to figure out what's happening in an
-
60:09 - 60:12algorithm when something is going wrong.
-
60:12 - 60:16And unfortunately I don't know of - what I've described
-
60:16 - 60:17are sort of maybe
-
60:17 - 60:21some of the most common diagnostics that I've used, that I've seen,
-
60:21 - 60:24you know, to be useful for many problems. But very often, you need to come up
-
60:24 - 60:28with your own for your own specific learning problem.
-
60:28 - 60:32And I just want to point out that even when the learning algorithm is working well, it's
-
60:32 - 60:35often a good idea to run diagnostics, like the ones I talked
-
60:35 - 60:36about,
-
60:36 - 60:38to make sure you really understand what's going on.
-
60:38 - 60:42All right? And this is useful for a couple of reasons. One is that
-
60:42 - 60:46diagnostics like these will often help you to understand your application
-
60:46 - 60:48problem better.
-
60:48 - 60:52So some of you will, you know, graduate from Stanford and go on to get some amazingly high-paying
-
60:52 - 60:56job to apply machine-learning algorithms to some application problem of, you
-
60:56 - 60:59know, significant economic interest. Right?
-
60:59 - 61:03And you're gonna be working on one specific
-
61:03 - 61:08important machine learning application for many months, or even for years.
-
61:08 - 61:11One of the most valuable things for you personally will be for you to
-
61:11 - 61:13get in - for you personally
-
61:13 - 61:17to get in an intuitive understanding of what works and what doesn't work your
-
61:17 - 61:17problem.
-
61:17 - 61:21Sort of right now in the industry, in Silicon Valley or around the world,
-
61:21 - 61:25there are many companies with important machine learning problems and there are often people
-
61:25 - 61:27working on the same machine learning problem, you
-
61:27 - 61:31know, for many months or for years on end. And
-
61:31 - 61:35when you're doing that, I mean solving a really important problem using learning algorithms, one of
-
61:35 - 61:39the most valuable things is just your own personal intuitive understanding of the
-
61:39 - 61:41problem.
-
61:41 - 61:42Okay?
-
61:42 - 61:43And diagnostics, like
-
61:43 - 61:48the sort I talked about, will be one way for you to get a better and better understanding of
-
61:48 - 61:50these problems. It
-
61:50 - 61:54turns out, by the way, there are some of Silicon Valley companies that outsource their
-
61:54 - 61:57machine learning. So there's sometimes, you know, whatever.
-
61:57 - 62:00They're a company in Silicon Valley and they'll, you know,
-
62:00 - 62:03hire a firm in New York to run all their learning algorithms for them.
-
62:03 - 62:07And I'm not a businessman, but I personally think that's
-
62:07 - 62:09often a terrible idea because
-
62:09 - 62:14if your expertise, if your understanding of your data is given,
-
62:14 - 62:16you know, to an outsource agency,
-
62:16 - 62:20then if you don't maintain that expertise, if there's a problem you really care about
-
62:20 - 62:22then it'll be your own, you know,
-
62:22 - 62:26understanding of the problem that you build up over months that'll be really valuable.
-
62:26 - 62:29And if that knowledge is outsourced, you don't get to keep that knowledge
-
62:29 - 62:29yourself.
-
62:29 - 62:32I personally think that's a terrible idea.
-
62:32 - 62:36But I'm not a businessman, but I just see people do that a lot,
-
62:36 - 62:39and just. Let's see.
-
62:39 - 62:43Another reason for running diagnostics like these is actually in writing research
-
62:43 - 62:44papers,
-
62:44 - 62:46right? So
-
62:46 - 62:49diagnostics and error analyses, which I'll talk about in a minute,
-
62:49 - 62:53often help to convey insight about the problem and help justify your research
-
62:53 - 62:54claims.
-
62:54 - 62:57
-
62:57 - 62:58So for example,
-
62:58 - 63:01rather than writing a research paper, say, that's says, you know, "Oh well here's
-
63:01 - 63:04an algorithm that works. I built this helicopter and it flies," or whatever,
-
63:04 - 63:06it's often much more interesting to say,
-
63:06 - 63:10"Here's an algorithm that works, and it works because of a specific
-
63:10 - 63:14component X. And moreover, here's the diagnostic that gives you justification that shows X was
-
63:14 - 63:19the thing that fixed this problem," and that's where you made it work. Okay? So
-
63:19 - 63:21that leads me
-
63:21 - 63:26into a discussion on error analysis, which is often good machine learning practice,
-
63:26 - 63:26
-
63:26 - 63:32is a way for understanding what your sources of errors are. So what I
-
63:32 - 63:35call error analyses - and let's check
-
63:35 - 63:42questions about this.
-
63:42 - 63:46Yeah?
-
63:46 - 63:50Student:What ended up being wrong with the helicopter? Instructor (Andrew Ng):Oh, don't know. Let's see. We've flown so many times.
-
63:50 - 63:54The thing that is most difficult a helicopter is actually building a
-
63:54 - 63:55very - I don't know. It
-
63:55 - 63:58changes all the time. Quite often, it's actually the simulator. Building an accurate simulator of a helicopter
-
63:58 - 64:03is very hard. Yeah. Okay. So
-
64:03 - 64:04for error
-
64:04 - 64:06analyses,
-
64:06 - 64:11this is a way for figuring out what is working in your algorithm and what isn't working.
-
64:11 - 64:18And we're gonna talk about two specific examples. So there are
-
64:18 - 64:22many learning - there are many sort of IA systems, many machine learning systems, that
-
64:22 - 64:22combine
-
64:22 - 64:25many different components into a pipeline. So
-
64:25 - 64:27here's sort of a contrived example for this,
-
64:27 - 64:31not dissimilar in many ways from the actual machine learning systems you see.
-
64:31 - 64:32So let's say you want to
-
64:32 - 64:38recognize people from images. This is a picture of one of my friends.
-
64:38 - 64:42So you take this input in camera image, say, and you often run it through a long pipeline. So
-
64:42 - 64:43for example,
-
64:43 - 64:48the first thing you may do may be preprocess the image and remove the background, so you remove the
-
64:48 - 64:49background.
-
64:49 - 64:52And then you run a
-
64:52 - 64:55face detection algorithm, so a machine learning algorithm to detect people's faces.
-
64:55 - 64:56Right?
-
64:56 - 65:00And then, you know, let's say you want to recognize the identity of the person, right, this is your
-
65:00 - 65:02application.
-
65:02 - 65:04You then segment of the eyes, segment of the nose,
-
65:04 - 65:08and have different learning algorithms to detect the mouth and so on.
-
65:08 - 65:10I know; she might not want to be friend
-
65:10 - 65:13after she sees this.
-
65:13 - 65:17And then having found all these features, based on, you know, what the nose looks like, what the eyes
-
65:17 - 65:19looks like, whatever, then you
-
65:19 - 65:23feed all the features into a logistic regression algorithm. And your logistic
-
65:23 - 65:25regression or soft match regression, or whatever,
-
65:25 - 65:30will tell you the identity of this person. Okay?
-
65:30 - 65:32So
-
65:32 - 65:35this is what error analysis is.
-
65:35 - 65:40You have a long complicated pipeline combining many machine learning
-
65:40 - 65:44components. Many of these would be used in learning algorithms.
-
65:44 - 65:46And so,
-
65:46 - 65:50it's often very useful to figure out how much of your error can be attributed to each of
-
65:50 - 65:55these components.
-
65:55 - 65:56So
-
65:56 - 66:00what we'll do in a typical error analysis procedure
-
66:00 - 66:04is we'll repeatedly plug in the ground-truth for each component and see how the
-
66:04 - 66:05accuracy changes.
-
66:05 - 66:08So what I mean by that is the
-
66:08 - 66:11figure on the bottom left - bottom right, let's say the overall accuracy of the system is
-
66:11 - 66:1385 percent. Right?
-
66:13 - 66:15Then I want to know
-
66:15 - 66:17where my 15 percent of error comes from.
-
66:17 - 66:19And so what I'll do is I'll go
-
66:19 - 66:21to my test set
-
66:21 - 66:27and I'll actually code it and - oh, instead of - actually implement my correct
-
66:27 - 66:30background removal. So actually, go in and give it,
-
66:30 - 66:33give my algorithm what is the correct background versus foreground.
-
66:33 - 66:37And if I do that, let's color that blue to denote that I'm
-
66:37 - 66:40giving that ground-truth data in the test set,
-
66:40 - 66:44let's assume our accuracy increases to 85.1 percent. Okay?
-
66:44 - 66:48And now I'll go in and, you know, give my algorithm the ground-truth,
-
66:48 - 66:49face detection
-
66:49 - 66:53output. So I'll go in and actually on my test set I'll just tell the algorithm where the
-
66:53 - 66:55face is. And if I do that,
-
66:55 - 66:59let's say my algorithm's accuracy increases to 91 percent,
-
66:59 - 67:03and so on. And then I'll go for each of these components
-
67:03 - 67:05and just give it
-
67:05 - 67:09the ground-truth label for each of the components,
-
67:09 - 67:12because say, like, the nose segmentation algorithm's trying to figure out
-
67:12 - 67:13where the nose is. I just in
-
67:13 - 67:17and tell it where the nose is so that it doesn't have to figure that out.
-
67:17 - 67:21And as I do this, one component through the other, you know, I end up giving it the correct output
-
67:21 - 67:24label and end up with 100 percent accuracy.
-
67:24 - 67:27And now you can look at this table - I'm sorry this is cut off on the bottom,
-
67:27 - 67:29it says logistic regression 100 percent. Now you can
-
67:29 - 67:31look at this
-
67:31 - 67:32table and
-
67:32 - 67:33see,
-
67:33 - 67:36you know, how much giving the ground-truth labels for each of these
-
67:36 - 67:39components could help boost your final performance.
-
67:39 - 67:42In particular, if you look at this table, you notice that
-
67:42 - 67:45when I added the face detection ground-truth,
-
67:45 - 67:48my performance jumped from 85.1 percent accuracy
-
67:48 - 67:51to 91 percent accuracy. Right?
-
67:51 - 67:55So this tells me that if only I can get better face detection,
-
67:55 - 67:58maybe I can boost my accuracy by 6 percent.
-
67:58 - 68:00Whereas in contrast, when I,
-
68:00 - 68:04you know, say plugged in better, I don't know,
-
68:04 - 68:07background removal, my accuracy improved from 85
-
68:07 - 68:09to 85.1 percent.
-
68:09 - 68:12And so, this sort of diagnostic also tells you that if your goal
-
68:12 - 68:14is to improve the system, it's probably a waste of
-
68:14 - 68:18your time to try to improve your background subtraction. Because if
-
68:18 - 68:19even if you got the ground-truth,
-
68:19 - 68:22this is gives you, at most, 0.1 percent accuracy,
-
68:22 - 68:25whereas if you do better face detection, maybe there's a much
-
68:25 - 68:26larger potential for gains there. Okay?
-
68:26 - 68:29So this sort of diagnostic,
-
68:29 - 68:30again,
-
68:30 - 68:33is very useful because if your is to improve the system,
-
68:33 - 68:36there are so many different pieces you can easily choose to spend the next three
-
68:36 - 68:37months on. Right?
-
68:37 - 68:39And choosing the right piece
-
68:39 - 68:43is critical, and this sort of diagnostic tells you what's the piece that may
-
68:43 - 68:49actually be worth your time to work on.
-
68:49 - 68:52There's sort of another type of analyses that's sort of the opposite of what I just
-
68:52 - 68:53talked about.
-
68:53 - 68:55The error analysis I just talked about
-
68:55 - 68:58tries to explain the difference between the current performance and perfect
-
68:58 - 69:00performance,
-
69:00 - 69:04whereas this sort of ablative analysis tries to explain the difference
-
69:04 - 69:09between some baselines, some really bad performance and your current performance.
-
69:09 - 69:13So for this example, let's suppose you've built a very good anti-spam classifier for
-
69:13 - 69:17adding lots of clever features to your logistic regression algorithm. Right? So you added
-
69:17 - 69:21features for spam correction, for, you know, sender host features, for email header
-
69:21 - 69:21features,
-
69:21 - 69:25email text parser features, JavaScript parser features,
-
69:25 - 69:27features for embedded images, and so on.
-
69:27 - 69:30So now let's say you preview the system and you want to figure out, you know, how well did
-
69:30 - 69:34each of these - how much did each of these components actually contribute? Maybe you want
-
69:34 - 69:37to write a research paper and claim this was the piece that made the
-
69:37 - 69:41big difference. Can you actually document that claim and justify it?
-
69:41 - 69:43So in ablative analysis,
-
69:43 - 69:45here's what we do.
-
69:45 - 69:46So in this example,
-
69:46 - 69:50let's say that simple logistic regression without any of your clever
-
69:50 - 69:52improvements get 94 percent performance. And
-
69:52 - 69:55you want to figure out what accounts for your improvement from 94 to
-
69:55 - 69:5899.9 percent performance.
-
69:58 - 70:03So in ablative analysis and so instead of adding components one at a day, we'll instead
-
70:03 - 70:07remove components one at a time to see how it rates.
-
70:07 - 70:11So start with your overall system, which is 99 percent accuracy.
-
70:11 - 70:14And then we remove spelling correction and see how much performance
-
70:14 - 70:15drops.
-
70:15 - 70:22Then we'll remove the sender host features and see how much performance drops, and so on. All right? And so,
-
70:24 - 70:28in this contrived example,
-
70:28 - 70:31you see that, I guess, the biggest drop
-
70:31 - 70:32occurred when you remove
-
70:32 - 70:38the text parser features. And so you can then make a credible case that,
-
70:38 - 70:41you know, the text parser features where what really made the biggest difference here. Okay?
-
70:41 - 70:43And you can also tell,
-
70:43 - 70:46for instance, that, I don't know,
-
70:46 - 70:49removing the sender host features on this
-
70:49 - 70:52line, right, performance dropped from 99.9 to 98.9. And so this also means
-
70:52 - 70:53that
-
70:53 - 70:56in case you want to get rid of the sender host features to speed up
-
70:56 - 71:03computational something that would be a good candidate for elimination. Okay? Are there any
-
71:04 - 71:05guarantees that if you shuffle around the order in which
-
71:05 - 71:06you drop those
-
71:06 - 71:10features that you'll get the same - Yeah. Let's address the question: What if you shuffle in which you remove things? The answer is no. There's
-
71:10 - 71:12no guarantee you'd get the similar result.
-
71:12 - 71:14So in practice,
-
71:14 - 71:18sometimes there's a fairly natural of ordering for both types of analyses, the error
-
71:18 - 71:19analyses and ablative analysis,
-
71:19 - 71:23sometimes there's a fairly natural ordering in which you add things or remove things,
-
71:23 - 71:25sometimes there's isn't. And
-
71:25 - 71:28quite often, you either choose one ordering and just go for it
-
71:28 - 71:32or " And don't think of these analyses as sort of formulas that are constants, though; I mean
-
71:32 - 71:35feel free to invent your own, as well. You know
-
71:35 - 71:37one of the things
-
71:37 - 71:38that's done quite often is
-
71:38 - 71:39take the overall system
-
71:39 - 71:43and just remove one and then put it back, then remove a different one
-
71:43 - 71:48then put it back until all of these things are done. Okay.
-
71:48 - 71:51So the very last thing I want to talk about is sort of this
-
71:51 - 71:58general advice for how to get started on a learning problem. So
-
71:58 - 72:04here's a cartoon description on two broad to get started on learning problem.
-
72:04 - 72:06The first one is
-
72:06 - 72:08carefully design your system, so
-
72:08 - 72:12you spend a long time designing exactly the right features, collecting the right data set, and
-
72:12 - 72:14designing the right algorithmic structure, then you
-
72:14 - 72:18implement it and hope it works. All right?
-
72:18 - 72:21The benefit of this sort of approach is you get maybe nicer, maybe more scalable
-
72:21 - 72:22algorithms,
-
72:22 - 72:27and maybe you come up with new elegant learning algorithms. And if your goal is to,
-
72:27 - 72:31you know, contribute to basic research in machine learning, if your goal is to invent new machine learning
-
72:31 - 72:31algorithms,
-
72:31 - 72:34this process of slowing down and
-
72:34 - 72:36thinking deeply about the problem, you know, that is sort of the right way to go
-
72:36 - 72:37about is
-
72:37 - 72:41think deeply about a problem and invent new solutions.
-
72:41 - 72:42
-
72:42 - 72:44Second sort of approach
-
72:44 - 72:49is what I call build-and-fix, which is we input something quick and dirty
-
72:49 - 72:52and then you run error analyses and diagnostics to figure out what's wrong and
-
72:52 - 72:54you fix those errors.
-
72:54 - 72:58The benefit of this second type of approach is that it'll often get your
-
72:58 - 73:01application working much more quickly.
-
73:01 - 73:04And especially with those of you, if you end up working in a company, and sometimes - if you end up working in
-
73:04 - 73:06a company,
-
73:06 - 73:07you know, very often it's not
-
73:07 - 73:11the best product that wins; it's the first product to market that
-
73:11 - 73:12wins. And
-
73:12 - 73:15so there's - especially in the industry. There's really something to be said for,
-
73:15 - 73:19you know, building a system quickly and getting it deployed quickly.
-
73:19 - 73:23And the second approach of building a quick-and-dirty, I'm gonna say hack
-
73:23 - 73:26and then fixing the problems will actually get you to a
-
73:26 - 73:28system that works well
-
73:28 - 73:31much more quickly.
-
73:31 - 73:33And the reason is
-
73:33 - 73:36very often it's really not clear what parts of a system are easier to think of to
-
73:36 - 73:38build and therefore what
-
73:38 - 73:40you need to spends lot of time focusing on.
-
73:40 - 73:43So there's that example I talked about just now. Right?
-
73:43 - 73:47For identifying
-
73:47 - 73:49people, say.
-
73:49 - 73:53And with a big complicated learning system like this, a big complicated pipeline like this,
-
73:53 - 73:56it's really not obvious at the outset
-
73:56 - 73:59which of these components you should spend lots of time working on. Right? And if
-
73:59 - 74:01you didn't know that
-
74:01 - 74:04preprocessing wasn't the right component, you could easily have
-
74:04 - 74:07spent three months working on better background subtraction, not knowing that it's
-
74:07 - 74:10just not gonna ultimately matter.
-
74:10 - 74:11And so
-
74:11 - 74:14the only way to find out what really works was inputting something quickly and
-
74:14 - 74:15you find out what parts -
-
74:15 - 74:17and find out
-
74:17 - 74:18what parts
-
74:18 - 74:21are really the hard parts to implement, or what parts are hard parts that could make a
-
74:21 - 74:23difference in performance.
-
74:23 - 74:27In fact, say that if your goal is to build a
-
74:27 - 74:29people recognition system, a system like this is actually far too
-
74:29 - 74:32complicated as your initial system.
-
74:32 - 74:36Maybe after you're prototyped a few systems, and you converged a system like this. But if this
-
74:36 - 74:43is your first system you're designing, this is much too complicated. Also, this is a
-
74:44 - 74:48very concrete piece of advice, and this applies to your projects as well.
-
74:48 - 74:51If your goal is to build a working application,
-
74:51 - 74:55Step 1 is actually probably not to design a system like this. Step 1 is where you would plot your
-
74:55 - 74:57data.
-
74:57 - 75:01And very often, and if you just take the data you're trying to predict and just plot your
-
75:01 - 75:06data, plot X, plot Y, plot your data everywhere you can think of,
-
75:06 - 75:10you know, half the time you look at it and go, "Gee, how come all those numbers are negative? I thought they
-
75:10 - 75:14should be positive. Something's wrong with this dataset." And it's about
-
75:14 - 75:18half the time you find something obviously wrong with your data or something very surprising.
-
75:18 - 75:22And this is something you find out just by plotting your data, and that you
-
75:22 - 75:28won't find out be implementing these big complicated learning algorithms on it. Plotting
-
75:28 - 75:32the data sounds so simple, it was one of the pieces of advice that lots of us give but
-
75:32 - 75:39hardly anyone follows, so you can take that for what it's worth.
-
75:39 - 75:42Let me just reiterate, what I just said here may be bad advice
-
75:42 - 75:44if your goal is to come up with
-
75:44 - 75:47new machine learning algorithms. All right? So
-
75:47 - 75:51for me personally, the learning algorithm I use the most often is probably
-
75:51 - 75:54logistic regression because I have code lying around. So give me a
-
75:54 - 75:57learning problem, I probably won't try anything more complicated than logistic
-
75:57 - 75:58regression on it first. And it's
-
75:58 - 76:02only after trying something really simple and figure our what's easy, what's hard, then you know
-
76:02 - 76:04where to focus your efforts. But
-
76:04 - 76:08again, if your goal is to invent new machine learning algorithms, then you sort of don't
-
76:08 - 76:11want to hack up something and then add another hack to fix it, and hack it even more to
-
76:11 - 76:12fix it. Right? So if
-
76:12 - 76:16your goal is to do novel machine learning research, then it pays to think more deeply about the
-
76:16 - 76:21problem and not gonna follow this specifically.
-
76:21 - 76:23Shoot, you know what? All
-
76:23 - 76:28right, sorry if I'm late but I just have two more slides so I'm gonna go through these quickly.
-
76:28 - 76:31And so, this is what I think
-
76:31 - 76:33of as premature statistical optimization,
-
76:33 - 76:35where quite often,
-
76:35 - 76:38just like premature optimization of code, quite often
-
76:38 - 76:44people will prematurely optimize one component of a big complicated machine learning system. Okay? Just two more
-
76:44 - 76:47slides. This
-
76:47 - 76:49was -
-
76:49 - 76:52this is a sort of cartoon that highly influenced my own thinking. It was based on
-
76:52 - 76:55a paper written by Christos Papadimitriou.
-
76:55 - 76:57This is how
-
76:57 - 76:59progress - this is how
-
76:59 - 77:02developmental progress of research often happens. Right?
-
77:02 - 77:06Let's say you want to build a mail delivery robot, so I've drawn a circle there that says mail delivery robot. And it
-
77:06 - 77:07seems like a useful thing to have.
-
77:07 - 77:10Right? You know free up people, don't have
-
77:10 - 77:13to deliver mail. So what -
-
77:13 - 77:14to deliver mail,
-
77:14 - 77:19obviously you need a robot to wander around indoor environments and you need a robot to
-
77:19 - 77:21manipulate objects and pickup envelopes. And so,
-
77:21 - 77:25you need to build those two components in order to get a mail delivery robot. And
-
77:25 - 77:26so I've
-
77:26 - 77:30drawing those two components and little arrows to denote that, you know, obstacle avoidance
-
77:30 - 77:30is
-
77:30 - 77:32needed or would help build
-
77:32 - 77:36your mail delivery robot. Well
-
77:36 - 77:37for obstacle for avoidance,
-
77:37 - 77:43clearly, you need a robot that can navigate and you need to detect objects so you can avoid the obstacles.
-
77:43 - 77:47Now we're gonna use computer vision to detect the objects. And so,
-
77:47 - 77:51we know that, you know, lighting sometimes changes, right, depending on whether it's the
-
77:51 - 77:53morning or noontime or evening. This
-
77:53 - 77:54is lighting
-
77:54 - 77:57causes the color of things to change, and so you need
-
77:57 - 78:01an object detection system that's invariant to the specific colors of an
-
78:01 - 78:01object. Right?
-
78:01 - 78:04Because lighting
-
78:04 - 78:05changes,
-
78:05 - 78:10say. Well color, or RGB values, is represented by three-dimensional vectors. And
-
78:10 - 78:11so you need to learn
-
78:11 - 78:13when two colors might be the same thing,
-
78:13 - 78:15when two, you know,
-
78:15 - 78:18visual appearance of two colors may be the same thing as just the lighting change or
-
78:18 - 78:20something.
-
78:20 - 78:21And
-
78:21 - 78:24to understand that properly, we can go out and study differential geometry
-
78:24 - 78:28of 3d manifolds because that helps us build a sound theory on which
-
78:28 - 78:32to develop our 3d similarity learning algorithms.
-
78:32 - 78:36And to really understand the fundamental aspects of this problem,
-
78:36 - 78:40we have to study the complexity of non-Riemannian geometries. And on
-
78:40 - 78:44and on it goes until eventually you're proving convergence bounds for
-
78:44 - 78:50sampled of non-monotonic logic. I don't even know what this is because I just made it up.
-
78:50 - 78:52Whereas in reality,
-
78:52 - 78:54you know, chances are that link isn't real.
-
78:54 - 78:56Color variance
-
78:56 - 79:00just barely helped object recognition maybe. I'm making this up.
-
79:00 - 79:03Maybe differential geometry was hardly gonna help 3d similarity learning and that link's also gonna fail. Okay?
-
79:03 - 79:05So, each of
-
79:05 - 79:09these circles can represent a person, or a research community, or a thought in your
-
79:09 - 79:12head. And there's a very real chance that
-
79:12 - 79:15maybe there are all these papers written on differential geometry of 3d manifolds, and they are
-
79:15 - 79:19written because some guy once told someone else that it'll help 3d similarity learning.
-
79:19 - 79:20And,
-
79:20 - 79:23you know, it's like "A friend of mine told me that color invariance would help in
-
79:23 - 79:26object recognition, so I'm working on color invariance. And now I'm gonna tell a friend
-
79:26 - 79:27of mine
-
79:27 - 79:30that his thing will help my problem. And he'll tell a friend of his that his thing will help
-
79:30 - 79:32with his problem."
-
79:32 - 79:34And pretty soon, you're working on
-
79:34 - 79:38convergence bound for sampled non-monotonic logic, when in reality none of these will
-
79:38 - 79:39see the light of
-
79:39 - 79:43day of your mail delivery robot. Okay?
-
79:43 - 79:47I'm not criticizing the role of theory. There are very powerful theories, like the
-
79:47 - 79:48theory of VC dimension,
-
79:48 - 79:52which is far, far, far to the right of this. So VC dimension is about
-
79:52 - 79:53as theoretical
-
79:53 - 79:57as it can get. And it's clearly had a huge impact on many applications. And there's,
-
79:57 - 80:00you know, dramatically advanced data machine learning. And another example is theory of NP-hardness as again, you know,
-
80:00 - 80:01is about
-
80:01 - 80:04theoretical as it can get. It's
-
80:04 - 80:06like a huge application
-
80:06 - 80:09on all of computer science, the theory of NP-hardness.
-
80:09 - 80:11But
-
80:11 - 80:14when you are off working on highly theoretical things, I guess, to me
-
80:14 - 80:17personally it's important to keep in mind
-
80:17 - 80:20are you working on something like VC dimension, which is high impact, or are you
-
80:20 - 80:23working on something like convergence bound for sampled nonmonotonic logic, which
-
80:23 - 80:25you're only hoping
-
80:25 - 80:26has some peripheral relevance
-
80:26 - 80:30to some application. Okay?
-
80:30 - 80:35For me personally, I tend to work on an application only if I - excuse me.
-
80:35 - 80:37For me personally, and this is a personal choice,
-
80:37 - 80:41I tend to trust something only if I personally can see a link from the
-
80:41 - 80:43theory I'm working on
-
80:43 - 80:44all the way back to an application.
-
80:44 - 80:46And
-
80:46 - 80:50if I don't personally see a direct link from what I'm doing to an application then,
-
80:50 - 80:53you know, then that's fine. Then I can choose to work on theory, but
-
80:53 - 80:56I wouldn't necessarily trust that
-
80:56 - 80:59what the theory I'm working on will relate to an application, if I don't personally
-
80:59 - 81:02see a link all the way back.
-
81:02 - 81:04Just to summarize.
-
81:04 - 81:06
-
81:06 - 81:09One lesson to take away from today is I think
-
81:09 - 81:13time spent coming up with diagnostics for learning algorithms is often time well spent.
-
81:13 - 81:13
-
81:13 - 81:16It's often up to your own ingenuity to come up with great diagnostics. And
-
81:16 - 81:19just when I personally, when I work on machine learning algorithm,
-
81:19 - 81:21it's not uncommon for me to be spending like
-
81:21 - 81:24between a third and often half of my time
-
81:24 - 81:26just writing diagnostics and trying to figure out what's going right and what's
-
81:26 - 81:28going on.
-
81:28 - 81:32Sometimes it's tempting not to, right, because you want to be implementing learning algorithms and
-
81:32 - 81:35making progress. You don't want to be spending all this time, you know, implementing tests on your
-
81:35 - 81:38learning algorithms; it doesn't feel like when you're doing anything. But when
-
81:38 - 81:41I implement learning algorithms, at least a third, and quite often half of
-
81:41 - 81:46my time, is actually spent implementing those tests and you can figure out what to work on. And
-
81:46 - 81:49I think it's actually one of the best uses of your time. Talked
-
81:49 - 81:51about error
-
81:51 - 81:54analyses and ablative analyses, and lastly
-
81:54 - 81:57talked about, you know, different approaches and the
-
81:57 - 82:01risks of premature statistical optimization. Okay.
-
82:01 - 82:04Sorry I ran you over. I'll be here for a few more minutes for your questions.
- Title:
- Lecture 11 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:22:19