Lecture 9 | Machine Learning (Stanford)
-
0:12 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:22Development. Okay.
-
0:29 - 0:32So welcome back.
-
0:32 - 0:36What I want to do today is start a new chapter
-
0:36 - 0:41in between now and then. In particular, I want to talk about learning theory.
-
0:41 - 0:45So in the previous, I guess eight lectures so far, you've learned about
-
0:45 - 0:50a lot of learning algorithms, and yes, you now I hope understand a little about
-
0:50 - 0:54some of the best and most powerful tools of machine learning in the [inaudible].
-
0:54 - 0:58And all of you are now sort of well qualified to go into industry
-
0:58 - 1:02and though powerful learning algorithms apply, really the most powerful
-
1:02 - 1:05learning algorithms we know to all sorts of problems, and in fact, I hope
-
1:05 - 1:11you start to do that on your projects right away as well.
-
1:11 - 1:14You might remember, I think it was in the very first lecture,
-
1:14 - 1:19that I made an analogy to if you're trying to learn to be a carpenter, so
-
1:19 - 1:21if
-
1:21 - 1:24you imagine you're going to carpentry school to learn to be a carpenter,
-
1:24 - 1:29then only a small part of what you need to do is to acquire a set of tools. If you learn to
-
1:29 - 1:30be a carpenter
-
1:30 - 1:33you don't walk in and pick up a tool box and [inaudible], so
-
1:33 - 1:36when you need to
-
1:36 - 1:40cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw
-
1:40 - 1:43whatever, is this really mastering the tools there's also
-
1:43 - 1:46an essential part of becoming a good carpenter.
-
1:46 - 1:47And
-
1:47 - 1:50what I want to do in the next few lectures is
-
1:50 - 1:53actually give you a sense of the mastery of the machine learning tools all of
-
1:53 - 1:55you have. Okay?
-
1:55 - 1:57And so in particular,
-
1:57 - 2:01in the next few lectures what I want to is to talk more deeply about
-
2:01 - 2:05the properties of different machine learning algorithms so that you can get a sense of
-
2:05 - 2:07when it's most appropriate to use
-
2:07 - 2:08each one.
-
2:08 - 2:12And it turns out that one of the most common scenarios in machine learning is
-
2:13 - 2:15someday you'll be doing research or
-
2:15 - 2:17[inaudible] a company.
-
2:17 - 2:20And you'll apply one of the learning algorithms you learned about, you may apply logistic regression,
-
2:20 - 2:24or support vector machines, or Naive Bayes or something,
-
2:24 - 2:28and for whatever bizarre reason, it won't work as well as you were hoping, or it
-
2:28 - 2:32won't quite do what you were hoping it to.
-
2:33 - 2:35To
-
2:35 - 2:37me what really separates the people from
-
2:37 - 2:41what really separates the people that really understand and really get machine
-
2:41 - 2:42learning,
-
2:42 - 2:46compared to people that maybe read the textbook and so they'll work through the math,
-
2:46 - 2:50will be what you do next. Will be in your decisions of when
-
2:50 - 2:54you apply a support vector machine and it doesn't quite do what you wanted,
-
2:54 - 2:57do you really understand enough about support vector machines to know what to
-
2:57 - 2:59do next and how to modify the algorithm?
-
2:59 - 3:02And to me that's often what really separates the great people in machine
-
3:02 - 3:06learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have
-
3:06 - 3:09just understood that. Okay?
-
3:09 - 3:10So
-
3:10 - 3:14what I want to do today, today's lecture will mainly be on learning theory and
-
3:14 - 3:15we'll start to
-
3:15 - 3:19talk about some of the theoretical results of machine learning.
-
3:19 - 3:22The next lecture, later this week, will be on algorithms for
-
3:22 - 3:26sort of [inaudible], or fixing some of the problems
-
3:26 - 3:30that the learning theory will point out to us and help us understand. And then
-
3:30 - 3:34two lectures from now, that lecture will be almost entirely focused on
-
3:34 - 3:36the practical advice for
-
3:36 - 3:40how to apply learning algorithms. Okay?
-
3:40 - 3:47So you have any questions about this before I start? Okay.
-
3:51 - 3:54So the very first thing we're gonna talk about is something that
-
3:54 - 3:59you've probably already seen on the first homework, and something that
-
3:59 - 4:01alluded to previously,
-
4:01 - 4:07which is the bias variance trade-off. So take
-
4:07 - 4:11ordinary least squares, the first learning algorithm we learned
-
4:11 - 4:16about, if you [inaudible] a straight line through these datas, this is not a very good model.
-
4:16 - 4:20Right.
-
4:20 - 4:21And if
-
4:21 - 4:24this happens, we say it has
-
4:24 - 4:24underfit
-
4:24 - 4:27the data, or we say that this is a learning algorithm
-
4:27 - 4:31with a very high bias, because it is
-
4:31 - 4:36failing to fit the evident quadratic structure in the data.
-
4:36 - 4:36And
-
4:36 - 4:38for the prefaces
-
4:38 - 4:41of [inaudible] you can formally think of the
-
4:41 - 4:45bias of the learning algorithm as representing the fact that even if you
-
4:45 - 4:48had an infinite amount of training data, even if
-
4:48 - 4:51you had tons of training data,
-
4:51 - 4:53this algorithm would still fail
-
4:53 - 4:54to fit the quadratic
-
4:54 - 4:58function, the quadratic structure in
-
4:58 - 5:02the data. And so we think of this as a learning algorithm with high bias.
-
5:02 - 5:04Then there's the opposite problem, so that's the
-
5:04 - 5:06same dataset.
-
5:06 - 5:13If
-
5:13 - 5:20you fit a fourth of the polynomials into this dataset,
-
5:20 - 5:20then you have
-
5:20 - 5:26you'll be able to interpolate the five data points exactly, but clearly, this is also
-
5:26 - 5:27not a great model
-
5:27 - 5:31to the structure that you and I probably see in the data.
-
5:31 - 5:34And we say that this
-
5:34 - 5:37algorithm has a problem, excuse me,
-
5:37 - 5:43is overfitting the data,
-
5:43 - 5:49or alternatively that this algorithm has high variance. Okay? And the intuition behind
-
5:49 - 5:50overfitting a high variance is that
-
5:50 - 5:54the algorithm is fitting serious patterns in the data, or is fitting
-
5:54 - 5:58idiosyncratic properties of this specific dataset, be it the dataset of
-
5:58 - 6:01housing prices or whatever.
-
6:01 - 6:08And quite often, they'll be some happy medium
-
6:08 - 6:10of fitting a quadratic function
-
6:10 - 6:15that maybe won't interpolate your data points perfectly, but also captures multistructure
-
6:15 - 6:17in your data
-
6:17 - 6:21than a simple model which under fits.
-
6:21 - 6:24I say that you can sort of have the exactly the same picture
-
6:24 - 6:26of classification problems as well,
-
6:26 - 6:28so lets say
-
6:28 - 6:35this is my training set, right,
-
6:36 - 6:37of
-
6:37 - 6:39positive and negative examples,
-
6:44 - 6:47and so you can fit
-
6:47 - 6:52logistic regression with a very high order polynomial [inaudible], or [inaudible] of X
-
6:52 - 6:57equals the sigmoid function of
-
7:01 - 7:04whatever. Sigmoid function applied to a tenth of the polynomial.
-
7:04 - 7:08And you do that, maybe you get a decision boundary
-
7:08 - 7:13like this. Right.
-
7:13 - 7:17That does indeed perfectly separate the positive and negative classes, this is
-
7:17 - 7:19another example of how
-
7:19 - 7:21overfitting, and
-
7:21 - 7:24in contrast you fit logistic regression into this model with just the linear features,
-
7:24 - 7:27with none
-
7:27 - 7:30of the quadratic features, then maybe you get a decision boundary like that, which
-
7:30 - 7:33can also underfit.
-
7:33 - 7:35Okay.
-
7:35 - 7:38So what I want to do now is
-
7:38 - 7:42understand this problem of overfitting versus underfitting, of high bias versus high
-
7:42 - 7:43variance, more explicitly,
-
7:45 - 7:49I will do that by posing a more formal model of machine learning and so
-
7:49 - 7:51trying to prove when
-
7:51 - 7:57these two twin problems when each of these two problems come up.
-
7:57 - 7:59And as I'm modeling the
-
7:59 - 8:01
-
8:01 - 8:04example for our initial foray into learning theory,
-
8:04 - 8:10I want to talk about learning classification,
-
8:10 - 8:13in which
-
8:14 - 8:16H of X is equal
-
8:16 - 8:19to G of data transpose X. Okay?
-
8:19 - 8:21So the learning classifier.
-
8:21 - 8:27And for this class I'm going to use, Z
-
8:27 - 8:33excuse me. I'm gonna use G as indicator Z grading with
-
8:33 - 8:40zero.
-
8:40 - 8:44With apologies in advance for changing the notation yet again,
-
8:44 - 8:45for the support vector machine
-
8:45 - 8:49lectures we use Y equals minus one or plus one. For
-
8:49 - 8:53learning theory lectures, turns out it'll be a bit cleaner if I switch back to
-
8:53 - 8:58Y equals zero-one again, so I'm gonna switch back to my original notation.
-
8:58 - 9:02And so you think of this model as a model forum as
-
9:02 - 9:05logistic regressions, say, and think of this as being
-
9:05 - 9:07similar to logistic regression,
-
9:07 - 9:08except that now we're going to force
-
9:08 - 9:12the logistic regression algorithm, to opt for labels that are
-
9:12 - 9:14either zero or one. Okay?
-
9:14 - 9:16So you can think of this as a
-
9:16 - 9:21classifier to opt for labels zero or one involved in the probabilities.
-
9:21 - 9:25And so as
-
9:25 - 9:32usual, let's say we're given a training set of M examples.
-
9:34 - 9:38That's just my notation for writing a set of M examples ranging from I equals
-
9:38 - 9:40one
-
9:40 - 9:45through M. And I'm going to assume that the training example is XIYI.
-
9:45 - 9:48I've drawn IID,
-
9:48 - 9:50from sum distribution,
-
9:50 - 9:51scripts
-
9:51 - 9:54D. Okay? [Inaudible]. Identically and definitively distributed
-
9:54 - 10:00and if you have you have running a classification problem on houses, like
-
10:00 - 10:03features of the house comma, whether the house will be sold in the next six months, then this
-
10:03 - 10:05is just
-
10:05 - 10:08the priority distribution over
-
10:08 - 10:12features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that
-
10:12 - 10:17training examples we've drawn IID from some probability distributions,
-
10:18 - 10:20scripts D. Well, same thing for spam, if you're
-
10:20 - 10:23trying to build a spam classifier then this would be the distribution of
-
10:23 - 10:30what emails look like comma, whether they are spam or not.
-
10:30 - 10:32And in particular, to understand
-
10:32 - 10:38or simplify to understand the phenomena of bias invariance, I'm actually going to use a
-
10:38 - 10:42simplified model of machine learning.
-
10:42 - 10:45And in particular,
-
10:45 - 10:46logistic regression fits
-
10:46 - 10:50this parameters the model like this for maximizing the law of likelihood.
-
10:50 - 10:55But in order to understand learning algorithms more deeply, I'm just going to assume a simplified
-
10:55 - 10:59model of machine learning, let me just write that down.
-
11:02 - 11:05So I'm going to define training error
-
11:05 - 11:09as
-
11:09 - 11:14so this is a training error of a hypothesis X subscript data.
-
11:14 - 11:18Write this epsilon hat of subscript data.
-
11:18 - 11:20If I want to make the
-
11:20 - 11:23dependence on a training set explicit, I'll write this with
-
11:23 - 11:27a subscript S there where S is a training set.
-
11:27 - 11:34And I'll define this as,
-
11:42 - 11:44let's see. Okay. I
-
11:44 - 11:48hope the notation is clear. This is a sum of indicator functions for whether your hypothesis
-
11:48 - 11:52correctly classifies the Y the IFE example.
-
11:52 - 11:55And so when you divide by M, this is just
-
11:55 - 11:58in your training set what's the fraction
-
11:58 - 12:00of training examples your
-
12:00 - 12:01hypothesis classifies
-
12:01 - 12:05so defined as a training error. And
-
12:05 - 12:08training error is also called risk.
-
12:10 - 12:14The simplified model of machine learning I'm gonna talk about is
-
12:14 - 12:17called empirical risk minimization.
-
12:17 - 12:21And in particular, I'm going to assume that the way my learning algorithm works
-
12:21 - 12:26is it will choose parameters
-
12:26 - 12:33data, that
-
12:34 - 12:39minimize my training error. Okay?
-
12:39 - 12:43And it will be this learning algorithm that we'll prove properties about.
-
12:43 - 12:45And it turns out that
-
12:45 - 12:46you
-
12:46 - 12:50can think of this as the most basic learning algorithm, the algorithm that minimizes
-
12:50 - 12:52your training error. It
-
12:52 - 12:55turns out that logistic regression and support vector machines can be
-
12:55 - 12:59formally viewed as approximation cities, so it
-
12:59 - 13:03turns out that if you actually want to do this, this is a nonconvex optimization
-
13:03 - 13:03problem.
-
13:03 - 13:09This is actually it actually [inaudible] hard to solve this optimization problem.
-
13:09 - 13:14And logistic regression and support vector machines can both be viewed as
-
13:14 - 13:14approximations
-
13:14 - 13:17to this nonconvex optimization problem
-
13:17 - 13:20by finding the convex approximation to it.
-
13:20 - 13:23Think of this as
-
13:23 - 13:24similar to what
-
13:24 - 13:28algorithms like logistic regression
-
13:28 - 13:30are doing. So
-
13:30 - 13:33let me take that
-
13:33 - 13:34definition of empirical risk
-
13:34 - 13:36minimization
-
13:36 - 13:43and actually just rewrite it in a different equivalent way.
-
13:43 - 13:45For the results I want to prove today, it turns out
-
13:45 - 13:48that it will be useful to think of
-
13:48 - 13:49our learning algorithm
-
13:49 - 13:52as not choosing a set of parameters,
-
13:52 - 13:55but as choosing a function.
-
13:55 - 13:58So let me say what I mean by that. Let me define
-
13:58 - 14:00the hypothesis
-
14:00 - 14:03class,
-
14:03 - 14:05script h,
-
14:05 - 14:09as the class of all hypotheses of in other words as the class of all linear
-
14:09 - 14:11classifiers, that your
-
14:11 - 14:14learning algorithm is choosing from.
-
14:14 - 14:17Okay? So
-
14:17 - 14:19H subscript data
-
14:19 - 14:23is a specific linear classifier, so H subscript data,
-
14:23 - 14:30in each of these functions each
-
14:30 - 14:34of these is a function mapping from the input domain X is the class zero-one. Each
-
14:34 - 14:35of
-
14:35 - 14:38these is a function, and as you vary the parameter's data,
-
14:38 - 14:41you get different functions.
-
14:41 - 14:44And so let me define the hypothesis class
-
14:44 - 14:48script H to be the class of all functions that say logistic regression can choose
-
14:48 - 14:49from. Okay. So
-
14:49 - 14:53this is the class of all linear classifiers
-
14:53 - 14:54and so
-
14:57 - 15:01I'm going to define,
-
15:01 - 15:05or maybe redefine empirical risk minimization as
-
15:05 - 15:09instead of writing this choosing a set of parameters, I want to think of it as
-
15:09 - 15:10choosing a
-
15:10 - 15:12function
-
15:12 - 15:15into hypothesis class of script H
-
15:15 - 15:22that minimizes
-
15:22 - 15:27that minimizes my training error. Okay?
-
15:27 - 15:31So actually can you raise your
-
15:31 - 15:35hand if it makes sense to you why this is equivalent to the previous
-
15:35 - 15:41formulation? Okay, cool.
-
15:41 - 15:44Thanks. So for development of the use of think of algorithms as choosing from
-
15:44 - 15:47function from the class instead,
-
15:47 - 15:48because
-
15:48 - 15:51in a more general case this set, script H,
-
15:51 - 15:54can be some other class of functions. Maybe
-
15:54 - 15:58is a class of all functions represented by viewer network, or the class of all
-
15:58 - 16:04some other class of functions the learning algorithm wants to choose from.
-
16:04 - 16:05And
-
16:05 - 16:11this definition for empirical risk minimization will still apply. Okay?
-
16:11 - 16:13So
-
16:13 - 16:16what we'd like to do is understand
-
16:16 - 16:19whether empirical risk minimization
-
16:19 - 16:24is a reasonable algorithm. Alex? Student:[Inaudible] a function that's
-
16:24 - 16:26defined
-
16:26 - 16:30by G
-
16:30 - 16:35of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see
-
16:35 - 16:36I guess this
-
16:36 - 16:41the question is H data still defined by G of phase transpose
-
16:41 - 16:47X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh,
-
16:47 - 16:48yeah so very two answers
-
16:48 - 16:51to that. One is, this framework is general, so
-
16:51 - 16:55for the purpose of this lecture it may be useful to you to keep in mind a model of the
-
16:55 - 16:57example of
-
16:57 - 17:01when H subscript data is the class of all linear classifiers such as those used by
-
17:01 - 17:04like a visectron algorithm or logistic regression.
-
17:04 - 17:06This
-
17:06 - 17:07everything on this board,
-
17:07 - 17:11however, is actually more general. H can be any set of functions, mapping
-
17:11 - 17:15from the INFA domain to the center of class label zero and one,
-
17:15 - 17:18and then you can perform empirical risk minimization
-
17:18 - 17:21over any hypothesis class. For the purpose
-
17:21 - 17:23of today's lecture,
-
17:23 - 17:27I am going to restrict myself to talking about binary classification, but it turns
-
17:27 - 17:31out everything I say generalizes to regression
-
17:31 - 17:36in other problem as
-
17:36 - 17:40well. Does that answer your question? Yes. Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm.
-
17:40 - 17:45In particular, what are the things we can prove about it? So
-
17:45 - 17:49clearly we don't actually care about training error, we don't really care about
-
17:49 - 17:53making accurate predictions on the training set, or at a least that's not the ultimate goal. The
-
17:53 - 17:56ultimate goal is
-
17:56 - 17:58how well it makes
-
17:58 - 17:59generalization
-
17:59 - 18:04how well it makes predictions on examples that we haven't seen before. How
-
18:04 - 18:06well it predicts prices or
-
18:06 - 18:07sale or no sale
-
18:07 - 18:10outcomes of houses you haven't seen before.
-
18:10 - 18:13So what we really care about
-
18:13 - 18:16is generalization error, which I write
-
18:16 - 18:18as epsilon of H.
-
18:18 - 18:22And this is defined as the probability
-
18:22 - 18:24that if I sample
-
18:24 - 18:26a new example, X
-
18:26 - 18:28comma Y,
-
18:28 - 18:35from that distribution scripts D,
-
18:40 - 18:43my hypothesis mislabels
-
18:43 - 18:47that example. And
-
18:47 - 18:50in terms of notational convention, usually
-
18:50 - 18:53if I use if I place a hat on top of something, it
-
18:53 - 18:57usually means not always but it usually means that it is an attempt to
-
18:57 - 18:59estimate something about the hat. So
-
18:59 - 19:01for example,
-
19:01 - 19:03epsilon hat here
-
19:03 - 19:06this is something that we're trying think of epsilon hat training error
-
19:06 - 19:09as an attempt to approximate generalization error.
-
19:09 - 19:11Okay, so the notation convention is
-
19:11 - 19:15usually the things with the hats on top are things we're using to estimate other
-
19:15 - 19:16quantities.
-
19:16 - 19:20And H hat is a hypothesis output by learning algorithm to try to
-
19:20 - 19:25estimate what the functions from H to Y X to Y. So
-
19:25 - 19:29let's actually prove some things about when empirical risk minimization
-
19:29 - 19:30will do well
-
19:30 - 19:33in a sense of giving us low generalization error, which is what
-
19:33 - 19:40we really care about.
-
19:47 - 19:50In order to prove our first learning theory result, I'm going to have to state
-
19:50 - 19:53two lemmas, the first is
-
19:53 - 20:00the union
-
20:00 - 20:07vowel, which is the following,
-
20:08 - 20:11let A1 through AK be
-
20:11 - 20:12K event. And when
-
20:12 - 20:16I say events, I mean events in a sense of a probabilistic event
-
20:16 - 20:18that either happens or not.
-
20:18 - 20:25And these are not necessarily independent.
-
20:28 - 20:32So there's some current distribution over the events A one through AK,
-
20:32 - 20:35and maybe they're independent maybe not,
-
20:35 - 20:41no assumption on that. Then
-
20:41 - 20:45the probability of A one or
-
20:45 - 20:46A two
-
20:46 - 20:48or dot, dot,
-
20:48 - 20:49dot, up to
-
20:49 - 20:52AK, this union symbol, this
-
20:52 - 20:54hat, this just means
-
20:54 - 20:58this sort of just set notation for probability just means or. So the probability
-
20:58 - 20:59of
-
20:59 - 21:02at least one of these events occurring, of A one or A two, or up
-
21:02 - 21:03to AK,
-
21:03 - 21:08this is S equal to the probability of A one plus probability of A two plus dot,
-
21:08 - 21:11dot, dot, plus probability of AK.
-
21:11 - 21:18Okay? So
-
21:18 - 21:21the intuition behind this is just that
-
21:21 - 21:23I'm not sure if you've seen Venn diagrams
-
21:23 - 21:26depictions of probability before, if you haven't,
-
21:26 - 21:29what I'm about to do may be a little cryptic, so just ignore that. Just
-
21:29 - 21:32ignore what I'm about to do if you haven't seen it before.
-
21:32 - 21:37But if you have seen it before then this is really
-
21:37 - 21:41this is really great the probability of A one,
-
21:41 - 21:44union A two, union A three, is less
-
21:44 - 21:47than the P of A one, plus
-
21:47 - 21:51P of A two, plus P of A
-
21:51 - 21:52three.
-
21:52 - 21:54Right. So that the total mass in
-
21:54 - 21:58the union of these three things [inaudible] to the sum of the masses
-
21:58 - 22:01in the three individual sets, it's not very surprising.
-
22:01 - 22:05It turns out that depending on how you define your axioms of probability,
-
22:05 - 22:08this is actually one of the axioms that probably varies, so
-
22:08 - 22:12I won't actually try to prove this. This is usually
-
22:12 - 22:17written as an axiom. So sigmas of avitivity are probably
-
22:17 - 22:17measured as this
-
22:17 - 22:24what is sometimes called as well.
-
22:26 - 22:31But in learning theory it's commonly called the union balance I just call it that. The
-
22:31 - 22:32other
-
22:32 - 22:39lemma I need is called the Hufting inequality. And
-
22:40 - 22:43again, I won't actually prove this, I'll just state it,
-
22:43 - 22:44which is
-
22:44 - 22:46let's let Z1 up to ZM,
-
22:46 - 22:49BM, IID,
-
22:49 - 22:52there
-
22:52 - 22:55
-
22:55 - 23:02may be random variables with mean Phi.
-
23:07 - 23:09So the probability of ZI equals 1 is equal to
-
23:09 - 23:16Phi.
-
23:16 - 23:18So
-
23:18 - 23:23let's say you observe M IID for newly random variables and you want to estimate their
-
23:23 - 23:24mean.
-
23:24 - 23:29So let me define Phi hat, and this is again that notation, no convention, Phi hat means
-
23:30 - 23:30does not attempt
-
23:30 - 23:34is an estimate or something else. So when we define Phi
-
23:34 - 23:39hat to be 1 over M, semper my equals one through MZI. Okay?
-
23:39 - 23:41So this is
-
23:41 - 23:44our attempt to estimate the mean of these Benuve random variables by sort of taking
-
23:44 - 23:47its average.
-
23:49 - 23:53And let any gamma
-
23:53 - 23:55be fixed.
-
23:59 - 24:01Then,
-
24:01 - 24:04the Hufting inequality
-
24:04 - 24:11is that
-
24:14 - 24:17the probability your estimate of Phi
-
24:17 - 24:21is more than gamma away from the true value of Phi,
-
24:21 - 24:26that this is bounded by two E to the next of two gamma squared. Okay? So
-
24:26 - 24:31just in pictures
-
24:31 - 24:35so this theorem holds this lemma, the Hufting inequality,
-
24:35 - 24:38this is just a statement of fact, this just holds true.
-
24:38 - 24:42But let me now draw a cartoon to describe some of the intuition behind this, I
-
24:42 - 24:44guess.
-
24:44 - 24:45So
-
24:45 - 24:49lets say [inaudible] this is a real number line from zero to one.
-
24:49 - 24:53And so Phi is the mean of your Benuve random variables.
-
24:54 - 24:56You will remember from you know,
-
24:56 - 24:59whatever some undergraduate probability or statistics class,
-
24:59 - 25:03the central limit theorem that says that when you average all the things together,
-
25:03 - 25:05you tend to get a Gaussian distribution.
-
25:05 - 25:10And so when you toss M coins with bias Phi, we observe these M
-
25:10 - 25:12Benuve random variables,
-
25:12 - 25:18and we average them, then the probability distribution of
-
25:18 - 25:25Phi hat
-
25:26 - 25:29will roughly be a Gaussian lets say. Okay? It
-
25:29 - 25:30turns out if
-
25:30 - 25:33you haven't seen this up before, this is actually that the
-
25:33 - 25:36cumulative distribution function of Phi hat will converse with that of the Gaussian.
-
25:36 - 25:40Technically Phi hat can only take on a discreet set of values
-
25:40 - 25:43because these are factions one over Ms. It doesn't really have an
-
25:43 - 25:45entity but just as a cartoon
-
25:45 - 25:49think of it as a converse roughly to a Gaussian.
-
25:49 - 25:56So what the Hufting inequality says is that if you pick a value of gamma, let me put
-
25:56 - 26:00S one interval gamma there's another interval gamma.
-
26:00 - 26:03Then the saying that the probability mass of the details,
-
26:03 - 26:06in other words the probability that
-
26:06 - 26:08my value of Phi hat is more than
-
26:08 - 26:11a gamma away from the true value,
-
26:11 - 26:16that the total mass
-
26:16 - 26:18that the
-
26:18 - 26:22total probability mass in these tails is at most two
-
26:22 - 26:26E to the negative two gamma squared M. Okay?
-
26:26 - 26:28That's what the Hufting inequality so if you
-
26:28 - 26:31can't read that this just says this is just the right hand side of the bound, two E to
-
26:31 - 26:33negative two gamma squared.
-
26:33 - 26:36So balance the probability that you make a mistake in estimating the
-
26:36 - 26:42mean of a Benuve random variable.
-
26:42 - 26:43And the
-
26:43 - 26:47cool thing about this bound the interesting thing behind this bound is that
-
26:48 - 26:51the [inaudible] exponentially in M,
-
26:51 - 26:52so it says
-
26:52 - 26:54that for a fixed value of gamma,
-
26:54 - 26:56as you increase the size
-
26:56 - 26:58of your training set, as you toss a coin more and more,
-
26:58 - 27:02then the worth of this Gaussian will shrink. The worth of this
-
27:02 - 27:06Gaussian will actually shrink like one over root to M.
-
27:06 - 27:12And that will cause the probability mass left in the tails to decrease exponentially,
-
27:12 - 27:19quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the
-
27:21 - 27:23central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different
-
27:23 - 27:27this is proved no so the central limit theorem there may be a
-
27:27 - 27:29version of the central limit theorem, but the
-
27:29 - 27:33versions I'm familiar with tend are sort of asymptotic, but this works
-
27:33 - 27:36for any finer value of M. Oh, and for your
-
27:36 - 27:40this bound holds even if M is equal to two, or M is [inaudible], if M is
-
27:40 - 27:41very small,
-
27:41 - 27:45the central limit theorem approximation is not gonna hold, but this theorem holds
-
27:45 - 27:46regardless.
-
27:46 - 27:48Okay? I'm
-
27:48 - 27:52drawing this just as a cartoon to help explain the intuition, but this
-
27:52 - 27:59theorem just holds true, without reference to central limit
-
28:09 - 28:13theorem. All right. So lets start to understand empirical risk minimization,
-
28:13 - 28:20and what I want to do is
-
28:23 - 28:25begin with
-
28:25 - 28:28studying empirical risk minimization
-
28:28 - 28:31for a
-
28:31 - 28:33[inaudible] case
-
28:33 - 28:37that's a logistic regression, and in particular I want to start with studying
-
28:37 - 28:41the case of finite hypothesis classes.
-
28:41 - 28:48So let's say script H is a class of
-
28:49 - 28:56K hypotheses.
-
28:56 - 29:00Right. So this is K functions with no each of these is just a function mapping
-
29:00 - 29:05from inputs to outputs, there's no parameters in this.
-
29:05 - 29:06And so
-
29:06 - 29:13what the empirical risk minimization would do is it would take the training set
-
29:13 - 29:14and it'll then
-
29:14 - 29:17look at each of these K functions,
-
29:17 - 29:22and it'll pick whichever of these functions has the lowest training error. Okay?
-
29:22 - 29:24So now that the logistic regression uses an infinitely
-
29:24 - 29:29large a continuous infinitely large class of hypotheses, script H,
-
29:29 - 29:32but to prove the first row I actually want to just
-
29:32 - 29:33describe
-
29:33 - 29:38our first learning theorem is all for the case of when you have a finite hypothesis class, and then
-
29:38 - 29:42we'll later generalize that into the hypothesis classes.
-
29:43 - 29:46So
-
29:53 - 29:58empirical risk minimization takes the hypothesis of the lowest training error,
-
29:58 - 30:04and what I'd like to do is prove a bound on the generalization error
-
30:04 - 30:06of H hat. All right. So in other words I'm
-
30:06 - 30:07gonna prove that
-
30:07 - 30:14somehow minimizing training error allows me to do well on generalization error.
-
30:14 - 30:16And here's the strategy,
-
30:16 - 30:20I'm going to
-
30:20 - 30:23the first step in this prove I'm going to show that
-
30:23 - 30:25training error
-
30:25 - 30:30is a good approximation to generalization error,
-
30:32 - 30:35and then I'm going to show
-
30:35 - 30:38that this implies a bound
-
30:38 - 30:40on
-
30:40 - 30:43the generalization error
-
30:43 - 30:48of the hypothesis of [inaudible] empirical risk
-
30:48 - 30:53minimization. And I just realized, this class I guess is also maybe slightly notation
-
30:53 - 30:56heavy class
-
30:56 - 30:59round, instead of just introducing a reasonably large set of new symbols, so if
-
30:59 - 31:03again, in the course of today's lecture, you're looking at some symbol and you don't quite
-
31:03 - 31:07remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a
-
31:07 - 31:12generalization error or was it something else? So raise your hand and
-
31:12 - 31:17ask if you don't understand what the notation I was defining.
-
31:17 - 31:20Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors
-
31:20 - 31:23that give approximation generalization error,
-
31:23 - 31:26and this will imply that minimizing training error
-
31:26 - 31:30will also do pretty well in terms of minimizing generalization error.
-
31:30 - 31:33And this will give us a bound on the generalization error
-
31:33 - 31:40of the hypothesis output by empirical risk minimization. Okay?
-
31:40 - 31:47So here's the idea.
-
31:48 - 31:52So
-
31:52 - 31:56lets even not consider all the hypotheses at once, lets pick any
-
31:56 - 32:00hypothesis, HJ in the class script H, and
-
32:00 - 32:05so until further notice lets just consider there one fixed hypothesis. So pick any
-
32:05 - 32:10one hypothesis and let's talk about that
-
32:10 - 32:12one. Let
-
32:12 - 32:15me define ZI
-
32:15 - 32:22to be indicator function
-
32:24 - 32:25for
-
32:25 - 32:30whether this hypothesis misclassifies the IFE example excuse me
-
32:30 - 32:34or Z subscript I. Okay?
-
32:34 - 32:40So
-
32:40 - 32:43ZI would be zero or one
-
32:43 - 32:47depending on whether this one hypothesis which is the only one
-
32:47 - 32:49I'm gonna even consider now,
-
32:49 - 32:52whether this hypothesis was classified as an example.
-
32:52 - 32:57And so
-
32:57 - 33:01my training set is drawn randomly from sum distribution scripts
-
33:01 - 33:02d,
-
33:02 - 33:05and
-
33:05 - 33:10depending on what training examples I've got, these ZIs would be either zero or one.
-
33:10 - 33:14So let's figure out what the probability distribution ZI is.
-
33:14 - 33:15Well,
-
33:15 - 33:16so
-
33:16 - 33:21ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only
-
33:21 - 33:25take on these values.
-
33:25 - 33:31Well,
-
33:31 - 33:35what's the probability that ZI is equal to one? In other words, what's the
-
33:35 - 33:36probability
-
33:36 - 33:41that from a fixed hypothesis HJ,
-
33:41 - 33:45when I sample my training set IID from distribution D, what is the chance
-
33:45 - 33:46that
-
33:46 - 33:50my hypothesis will misclassify it?
-
33:50 - 33:54Well, by definition,
-
33:54 - 33:57that's just a generalization error of my
-
33:57 - 34:01hypothesis HJ.
-
34:01 - 34:04So ZI is a Benuve random variable
-
34:04 - 34:05with mean
-
34:05 - 34:12given by the generalization error of this hypothesis.
-
34:14 - 34:20Raise your hand if that made sense. Oh, cool. Great.
-
34:20 - 34:22And moreover,
-
34:22 - 34:24all the ZIs have the same
-
34:24 - 34:28probability of being one, and all my training examples I've drawn are IID,
-
34:28 - 34:33and so the ZIs are also independent
-
34:33 - 34:39and therefore
-
34:39 - 34:42the ZIs themselves are IID random
-
34:42 - 34:49variables. Okay? Because my training examples were drawn independently of each other, by assumption.
-
34:56 - 34:59If you read this as the definition of training error,
-
34:59 - 35:04the training error of my hypothesis HJ, that's just that.
-
35:07 - 35:09That's just the average
-
35:09 - 35:16of my ZIs, which was well I previously defined it like this. Okay?
-
35:26 - 35:29And so epsilon hat of HJ
-
35:29 - 35:30is exactly
-
35:30 - 35:33the average of MIID,
-
35:33 - 35:36Benuve random variables,
-
35:36 - 35:40drawn from Benuve distribution with mean given by
-
35:40 - 35:42the
-
35:42 - 35:44generalization error, so this is
-
35:44 - 35:48well this is the average of MIID Benuve random variables,
-
35:48 - 35:55each of which has meaning given by the
-
35:55 - 35:59generalization error of HJ.
-
35:59 - 36:04And therefore, by
-
36:04 - 36:07the Hufting inequality
-
36:07 - 36:11we have to add the
-
36:11 - 36:16probability
-
36:16 - 36:20that the difference between training and generalization error, the probability that this is greater than gamma is
-
36:20 - 36:22less than to two, E to
-
36:22 - 36:25the negative two,
-
36:25 - 36:27gamma squared M. Okay?
-
36:27 - 36:32Exactly by the Hufting inequality.
-
36:32 - 36:35And what this proves is that,
-
36:35 - 36:37for my fixed hypothesis HJ,
-
36:37 - 36:41my training error, epsilon hat will
-
36:41 - 36:44with high probability, assuming M is large, if
-
36:44 - 36:47M is large than this thing on the right hand side will be small, because
-
36:47 - 36:49this is two Es and a negative two gamma squared M.
-
36:49 - 36:53So this says that if my training set is large enough,
-
36:53 - 36:54then the probability
-
36:54 - 36:57my training error is far from generalization error,
-
36:57 - 36:59meaning that it is more than gamma,
-
36:59 - 37:06will be small, will be bounded by this thing on the right hand side. Okay? Now,
-
37:09 - 37:12here's the [inaudible] tricky part,
-
37:12 - 37:17what we've done is approve this bound for one fixed hypothesis, for HJ.
-
37:17 - 37:20What I want to prove is that training error will be a good estimate for generalization
-
37:20 - 37:21error,
-
37:21 - 37:26not just for this one hypothesis HJ, but actually for all K hypotheses in my
-
37:26 - 37:28hypothesis class
-
37:28 - 37:30script
-
37:30 - 37:31H.
-
37:31 - 37:33So let's do it
-
37:36 - 37:43well,
-
37:55 - 37:58better do it on a new board. So in order to show that, let me define
-
37:58 - 38:01a random event, let me define AJ to be the event
-
38:01 - 38:05that
-
38:05 - 38:12let
-
38:21 - 38:24me define AJ to be the event that
-
38:24 - 38:28you know, the difference between training and generalization error is more than gamma on a
-
38:28 - 38:30hypothesis HJ.
-
38:30 - 38:32And so what we
-
38:32 - 38:36put on the previous board was that the probability of AJ is less equal to two E
-
38:36 - 38:43to the negative two, gamma squared M, and this is pretty small. Now,
-
38:44 - 38:50What I want to bound is the probability that
-
38:50 - 38:54there exists some hypothesis in my class
-
38:54 - 39:01script H,
-
39:03 - 39:08such that I make a large error in my estimate of generalization error. Okay?
-
39:08 - 39:10Such that this holds true.
-
39:12 - 39:14So this is really just
-
39:14 - 39:18that the probability that there exists a hypothesis for which this
-
39:18 - 39:21holds. This is really the probability that
-
39:21 - 39:25A one or A two, or up to AK holds.
-
39:25 - 39:29The chance
-
39:29 - 39:33there exists a hypothesis is just well the priority
-
39:33 - 39:37that for hypothesis one and make a large error in estimating the generalization error,
-
39:37 - 39:41or for hypothesis two and make a large error in estimating generalization error,
-
39:41 - 39:44and so on.
-
39:44 - 39:51And so by the union bound, this is less than equal to
-
39:51 - 39:56that,
-
39:56 - 39:59which is therefore less than equal to
-
40:07 - 40:14is
-
40:14 - 40:21equal to that. Okay?
-
40:39 - 40:40So let
-
40:40 - 40:42me just take
-
40:42 - 40:48one minus both sides of the equation
-
40:48 - 40:51on the previous board let me take one minus both sides, so
-
40:51 - 40:57the probability that there does not exist for
-
40:57 - 40:59hypothesis
-
40:59 - 41:06such that,
-
41:08 - 41:11that. The probability that there does not exist a hypothesis on which I make a large
-
41:11 - 41:13error in this estimate
-
41:13 - 41:20while this is equal to the probability that for all hypotheses,
-
41:23 - 41:25I make a
-
41:25 - 41:27small error, or at
-
41:27 - 41:32most gamma, in my estimate of generalization error.
-
41:32 - 41:36In taking one minus on the right hand side I get two KE
-
41:36 - 41:39to the negative two
-
41:39 - 41:44gamma squared M. Okay?
-
41:44 - 41:46And so
-
41:46 - 41:50and the sign of the inequality flipped because I took one minus both
-
41:50 - 41:54sides. The minus sign flips the sign of the
-
41:54 - 41:56equality.
-
41:56 - 42:00So what we're shown is that
-
42:00 - 42:05with probability which abbreviates to WP with probability one minus
-
42:05 - 42:09two KE to the negative two gamma squared M.
-
42:09 - 42:12We have that, epsilon hat
-
42:12 - 42:16of H
-
42:16 - 42:21will be
-
42:21 - 42:28will then gamma of epsilon of H,
-
42:31 - 42:33simultaneously for all
-
42:33 - 42:35hypotheses in our
-
42:35 - 42:42class script H.
-
42:43 - 42:47And so
-
42:47 - 42:54just to give this result a name, this is called
-
42:56 - 43:00this is one instance of what's called a uniform conversions result,
-
43:00 - 43:01and
-
43:01 - 43:04the term uniform conversions
-
43:04 - 43:06this sort of alludes to the fact that
-
43:06 - 43:10this shows that as M becomes large,
-
43:10 - 43:12then these epsilon hats
-
43:12 - 43:16will all simultaneously converge to epsilon of H. That
-
43:16 - 43:18training error will
-
43:18 - 43:21become very close to generalization error
-
43:21 - 43:23simultaneously for all hypotheses H.
-
43:23 - 43:25That's what the term uniform
-
43:25 - 43:29refers to, is the fact that this converges for all hypotheses H and not just
-
43:29 - 43:31for one hypothesis. And so
-
43:31 - 43:36what we're shown is one example of a uniform conversions result. Okay?
-
43:36 - 43:39So let me clean a couple more boards. I'll come back and ask what questions you have
-
43:39 - 43:46about this. We should take another look at this and make sure it all makes sense. Yeah, okay.
-
44:20 - 44:27What questions do you have about this?
-
44:28 - 44:31
-
44:31 - 44:34Student: How the is the
-
44:34 - 44:36value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed?
-
44:36 - 44:40So for these purposes for the purposes of this, gamma is a constant.
-
44:40 - 44:43Imagine a gamma is some constant that we chose in advance,
-
44:43 - 44:49and this is a bound that holds true for any fixed value
-
44:49 - 44:51of gamma. Later on as we
-
44:51 - 44:54take this bound and then sort of develop this result further,
-
44:54 - 44:58we'll choose specific values of gamma as a [inaudible] of this bound. For now
-
44:58 - 45:05we'll just imagine that when we're proved this holds true for any value of gamma. Any questions?
-
45:09 - 45:12Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite,
-
45:12 - 45:16so this simple result won't work in this present form, but we'll generalize this
-
45:16 - 45:20probably won't get to it today but we'll generalize this at the beginning of the
-
45:20 - 45:27next lecture to infinite hypothesis classes. Student:How do we use this
-
45:30 - 45:32theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me
-
45:32 - 45:37I might get to a little of that later today, we'll talk concretely about algorithms,
-
45:37 - 45:39the consequences of
-
45:39 - 45:46the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you
-
45:46 - 45:47just raise your hand if the
-
45:47 - 45:50things I've proved so far
-
45:50 - 45:55make sense? Okay. Cool. Great.
-
45:55 - 45:59Thanks. All right. Let me just take this uniform conversions bound and rewrite
-
45:59 - 46:02it in a couple of other forms.
-
46:02 - 46:05So
-
46:05 - 46:09this is a sort of a bound on probability, this is saying suppose I fix my training
-
46:09 - 46:11set and then fix my training
-
46:11 - 46:15set fix my threshold, my error threshold gamma,
-
46:15 - 46:19what is the probability that uniform conversions holds, and well,
-
46:19 - 46:23that's my formula that gives the answer. This is the probability of something
-
46:23 - 46:26happening.
-
46:26 - 46:29So there are actually three parameters of interest. One is,
-
46:29 - 46:32What is this probability? The
-
46:32 - 46:34other parameter is, What's the training set size M?
-
46:34 - 46:36And the third parameter is, What
-
46:36 - 46:37is the value
-
46:37 - 46:40of this error
-
46:40 - 46:42threshold gamma? I'm not gonna
-
46:42 - 46:44vary K for these purposes.
-
46:44 - 46:48So other two other equivalent forms of the bounds,
-
46:48 - 46:50which so you can ask,
-
46:50 - 46:51given
-
46:51 - 46:52gamma
-
46:52 - 46:53so what
-
46:53 - 46:56we proved was given gamma and given M, what
-
46:56 - 46:58is the probability of
-
46:58 - 47:01uniform conversions? The
-
47:01 - 47:04other equivalent forms are,
-
47:04 - 47:07so that given gamma
-
47:07 - 47:14and the probability delta of making a large error,
-
47:15 - 47:19how large a training set size do you need in order to give
-
47:19 - 47:22a bound on how large a
-
47:22 - 47:25training set size do you need
-
47:25 - 47:30to give a uniform conversions bound with parameters gamma
-
47:30 - 47:31and delta?
-
47:31 - 47:34And so well,
-
47:34 - 47:38so if you set delta to be two KE so negative two gamma
-
47:38 - 47:38squared M.
-
47:38 - 47:41This is that form that I had on the left.
-
47:41 - 47:44And if you solve
-
47:44 - 47:46for M,
-
47:46 - 47:48what you find is that
-
47:48 - 47:54there's an equivalent form of this result that says that
-
47:54 - 47:56so long as your training
-
47:56 - 47:59set assigns M as greater than this.
-
47:59 - 48:05And this is the formula that I get by solving for M.
-
48:05 - 48:09Okay? So long as M is greater than equal to this,
-
48:09 - 48:13then with probability, which I'm abbreviating to WP again,
-
48:13 - 48:17with probability at least one minus delta,
-
48:17 - 48:24we have for all.
-
48:25 - 48:31Okay?
-
48:31 - 48:34So this says how large a training set size that I need
-
48:34 - 48:38to guarantee that with probability at least one minus delta,
-
48:38 - 48:42we have the training error is within gamma of generalization error for all my
-
48:42 - 48:45hypotheses, and this gives an answer.
-
48:45 - 48:49And just to give this another name, this is an example of a sample
-
48:49 - 48:53complexity
-
48:53 - 48:58
-
48:58 - 48:59bound.
-
48:59 - 49:03So from undergrad computer science classes you may have heard of
-
49:03 - 49:06computational complexity, which is how much computations you need to achieve
-
49:06 - 49:07something.
-
49:07 - 49:11So sample complexity just means how large a training example how large a training set how
-
49:11 - 49:14large a sample of examples do you need
-
49:14 - 49:17in order to achieve a certain bound and error.
-
49:17 - 49:20And it turns out that in many of the theorems we write out you can
-
49:20 - 49:24pose them in sort of a form of probability bound or a sample complexity bound or in some other
-
49:24 - 49:24form.
-
49:24 - 49:28I personally often find the sample complexity bounds the most
-
49:28 - 49:32easy to interpret because it says how large a training set do you need to give a certain bound on the
-
49:32 - 49:36errors.
-
49:36 - 49:39And in fact well, we'll see this later,
-
49:39 - 49:41sample complexity bounds often sort of
-
49:41 - 49:43help to give guidance for really if
-
49:43 - 49:44you're trying to
-
49:44 - 49:46achieve something on a machine learning problem,
-
49:46 - 49:47this really is
-
49:47 - 49:50trying to give guidance on how much training data you need
-
49:50 - 49:54to prove something.
-
49:54 - 49:58The one thing I want to note here is that M
-
49:58 - 50:01grows like the log of K, right, so
-
50:01 - 50:06the log of K grows extremely slowly as a function of K. The log is one
-
50:06 - 50:11of the slowest growing functions, right. It's one of well,
-
50:11 - 50:18some of you may have heard this, right? That for all values of K, right I learned
-
50:19 - 50:22this from a colleague, Andrew Moore,
-
50:22 - 50:24at Carnegie Mellon
-
50:24 - 50:25that in
-
50:25 - 50:31computer science for all practical purposes for all values of K, log K is less [inaudible], this is
-
50:31 - 50:32almost true.
-
50:32 - 50:36So log K is logs is one of the slowest growing functions, and so the
-
50:36 - 50:40fact that M sample complexity grows like the log of K,
-
50:40 - 50:42means that
-
50:42 - 50:46you can increase this number of hypotheses in your hypothesis class quite a lot
-
50:46 - 50:51and the number of the training examples you need won't grow
-
50:51 - 50:56
-
50:56 - 51:00very much. [Inaudible]. This property will be important later when we talk about infinite
-
51:00 - 51:03hypothesis classes.
-
51:03 - 51:06The final form is the
-
51:06 - 51:08I guess is sometimes called the error bound,
-
51:08 - 51:14which is when you hold M and delta fixed and solved for gamma.
-
51:14 - 51:21And so
-
51:23 - 51:27and what do you do what you get then is that the
-
51:27 - 51:31probability at least one minus delta,
-
51:31 - 51:38we have that.
-
51:41 - 51:44For all hypotheses in my hypothesis class,
-
51:44 - 51:51the difference in the training generalization error would be less
-
51:54 - 51:55than equal to that. Okay? And that's just
-
51:55 - 52:02solving for gamma and plugging the value I get in there. Okay? All right.
-
52:03 - 52:10So the
-
52:31 - 52:34second
-
52:34 - 52:36step of
-
52:36 - 52:43the overall proof I want to execute is the following.
-
52:45 - 52:47The result of the training error is essentially that uniform
-
52:47 - 52:49conversions will hold true
-
52:49 - 52:51with high probability.
-
52:51 - 52:55What I want to show now is let's assume that uniform conversions hold. So let's
-
52:55 - 52:58assume that for all
-
52:58 - 53:00hypotheses H,
-
53:00 - 53:05we have that epsilon of H minus epsilon hat of H, is
-
53:05 - 53:09less than of the gamma. Okay?
-
53:09 - 53:11What I want to do now is
-
53:11 - 53:13use this to see what we can
-
53:13 - 53:15prove about the bound
-
53:15 - 53:22of see what we can prove about the generalization error.
-
53:29 - 53:31
-
53:31 - 53:33So I want to know
-
53:33 - 53:36suppose this holds true I want
-
53:36 - 53:40to know can we prove something about the generalization error of H hat, where
-
53:40 - 53:43again, H hat
-
53:43 - 53:46was the hypothesis
-
53:46 - 53:52selected by empirical risk minimization.
-
53:52 - 53:56Okay? So in order to show this, let me make one more definition, let me define H
-
53:56 - 54:00star,
-
54:00 - 54:03to be the hypothesis
-
54:03 - 54:06in my class
-
54:06 - 54:10script H that has the smallest generalization error.
-
54:10 - 54:11So this is
-
54:11 - 54:15if I had an infinite amount of training data or if I really I
-
54:15 - 54:19could go in and find the best possible hypothesis
-
54:19 - 54:23best possible hypothesis in the sense of minimizing generalization error
-
54:23 - 54:29what's the hypothesis I would get? Okay? So
-
54:29 - 54:33in some sense, it sort of makes sense to compare the performance of our learning
-
54:33 - 54:34algorithm
-
54:34 - 54:36to the performance of H star, because we
-
54:36 - 54:40sort of we clearly can't hope to do better than H star.
-
54:40 - 54:42Another way of saying that is that if
-
54:42 - 54:48your hypothesis class is a class of all linear decision boundaries, that
-
54:48 - 54:51the data just can't be separated by any linear functions.
-
54:51 - 54:54So if even H star is really bad,
-
54:54 - 54:55then there's sort of
-
54:55 - 54:59it's unlikely then there's just not much hope that your learning algorithm could do even
-
54:59 - 55:04better
-
55:04 - 55:09than H star. So I actually prove this result in three steps.
-
55:09 - 55:13So the generalization error of H hat, the hypothesis I chose,
-
55:13 - 55:20this is going to be less than equal to that, actually let me number these
-
55:22 - 55:25equations, right. This
-
55:25 - 55:27is
-
55:27 - 55:30because of equation one, because I see that
-
55:30 - 55:33epsilon of H hat and epsilon hat of H hat will then gamma of each
-
55:33 - 55:37other.
-
55:37 - 55:38Now
-
55:38 - 55:42because H star, excuse me,
-
55:42 - 55:46now by the
-
55:46 - 55:51definition of empirical risk minimization, H
-
55:51 - 55:54hat was chosen to minimize training error
-
55:54 - 55:59and so there can't be any hypothesis with lower training error than H hat.
-
55:59 - 56:05So the training error of H hat must be less than the equal to
-
56:05 - 56:07the training error of H star. So
-
56:07 - 56:11this is sort of by two, or by the definition of H hat,
-
56:11 - 56:18as the hypothesis that minimizes training error H hat.
-
56:18 - 56:20And the final step is I'm
-
56:20 - 56:23going to apply this uniform conversions result again. We know that
-
56:23 - 56:25epsilon hat
-
56:25 - 56:30of H star must be moving gamma of epsilon of H star.
-
56:30 - 56:34And so this is at most
-
56:34 - 56:35plus gamma. Then
-
56:35 - 56:36
-
56:36 - 56:38I have my original gamma
-
56:38 - 56:41there. Okay? And so this is by
-
56:41 - 56:45equation one again because oh, excuse me
-
56:45 - 56:48because I know the training error of H star must be moving gamma of the
-
56:48 - 56:50generalization error of H star.
-
56:50 - 56:52And so well, I'll just
-
56:52 - 56:57write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training
-
56:57 - 57:04error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh,
-
57:25 - 57:27okay. Let me just well, let
-
57:27 - 57:33me write that down on this board. So actually actually let me
-
57:33 - 57:36think
-
57:36 - 57:39[inaudible]
-
57:39 - 57:44fit this in here. So epsilon hat
-
57:44 - 57:45of H is the
-
57:45 - 57:49training
-
57:49 - 57:51
-
57:51 - 57:52error of the hypothesis H. In
-
57:52 - 57:56other words, given the hypothesis a hypothesis is just a function, right mapped from X or
-
57:56 - 57:58Ys
-
57:58 - 57:59so epsilon hat of H is
-
57:59 - 58:03given the hypothesis H, what's the fraction of training examples it
-
58:03 - 58:05misclassifies?
-
58:05 - 58:12And generalization error
-
58:12 - 58:14of
-
58:14 - 58:18H, is given the hypothesis H if I
-
58:18 - 58:19sample another
-
58:19 - 58:22example from my distribution
-
58:22 - 58:23scripts
-
58:23 - 58:28D, what's the probability that H will misclassify that example? Does that make sense?
-
58:28 - 58:32Oh,
-
58:32 - 58:38okay. And H hat is the hypothesis that's chosen by empirical risk minimization.
-
58:38 - 58:43So when I talk about empirical risk minimization, is the algorithm that
-
58:43 - 58:48minimizes training error, and so epsilon hat of H is the training error of H,
-
58:48 - 58:50and so H hat is defined as
-
58:50 - 58:52the hypothesis that out
-
58:52 - 58:55of all hypotheses in my class script H,
-
58:55 - 58:58the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member
-
58:58 - 59:05of
-
59:07 - 59:09typical H, [inaudible]
-
59:09 - 59:11family
-
59:11 - 59:13right? Yes it is.
-
59:13 - 59:20So what happens with the generalization error [inaudible]?
-
59:20 - 59:24I'll talk about that later. So
-
59:24 - 59:31let me tie all these things together into a
-
59:34 - 59:38theorem.
-
59:38 - 59:43Let there be a hypothesis class given with a
-
59:43 - 59:49finite set of K hypotheses,
-
59:49 - 59:51and let any M
-
59:51 - 59:52delta
-
59:52 - 59:58be fixed.
-
59:58 - 60:02Then so I fixed M and delta, so this will be the error bound
-
60:02 - 60:04form of the theorem, right?
-
60:04 - 60:07Then, with
-
60:07 - 60:10probability at least one minus delta.
-
60:10 - 60:14We have that. The generalization error of H hat is
-
60:14 - 60:19less than or equal to
-
60:19 - 60:23the minimum over all hypotheses in
-
60:23 - 60:25set H epsilon of H,
-
60:25 - 60:27plus
-
60:27 - 60:34two times,
-
60:38 - 60:39plus that. Okay?
-
60:39 - 60:42So to prove this, well,
-
60:42 - 60:46this term of course is just epsilon of H star.
-
60:46 - 60:48And so to prove this
-
60:48 - 60:50we set
-
60:50 - 60:53gamma to equal to that
-
60:53 - 60:56this is two times the square root term.
-
60:56 - 61:03To prove this theorem we set gamma to equal to that square root term. Say that again?
-
61:04 - 61:08Wait. Say that again?
-
61:08 - 61:11Oh,
-
61:14 - 61:16yes. Thank you. That didn't
-
61:16 - 61:20make sense at all.
-
61:20 - 61:21Thanks. Great.
-
61:21 - 61:26So set gamma to that square root term,
-
61:26 - 61:30and so we know equation one,
-
61:30 - 61:32right, from the previous board
-
61:32 - 61:35holds with probability one minus delta.
-
61:35 - 61:40Right. Equation one was the uniform conversions result right, that well,
-
61:40 - 61:47IE.
-
61:50 - 61:52This is equation one from the previous board, right, so
-
61:52 - 61:55set gamma equal to this we know that we'll probably
-
61:55 - 61:58use one minus delta this uniform conversions holds,
-
61:58 - 62:01and whenever that holds, that implies you
-
62:01 - 62:05know, I
-
62:05 - 62:05guess
-
62:05 - 62:08if we call this equation star I guess.
-
62:08 - 62:12And whenever uniform conversions holds, we showed again, on the previous boards
-
62:12 - 62:17that this result holds, that generalization error of H hat is less than
-
62:17 - 62:21two generalization error of H star plus two times gamma. Okay? And
-
62:21 - 62:23so that proves
-
62:23 - 62:30this theorem. So
-
62:42 - 62:44this result sort of helps us to quantify
-
62:44 - 62:45a little bit
-
62:45 - 62:48that bias variance tradeoff
-
62:48 - 62:50that I talked about
-
62:50 - 62:54at the beginning of actually near the very start of this lecture.
-
62:54 - 62:58And in particular
-
62:58 - 63:04let's say I have some hypothesis class script H, that
-
63:04 - 63:06I'm using, maybe as a class of all
-
63:06 - 63:10linear functions and linear regression, and logistic regression with
-
63:10 - 63:12just the linear features.
-
63:12 - 63:15And let's say I'm considering switching
-
63:15 - 63:20to some new class H prime by having more features. So lets say this is
-
63:20 - 63:23linear
-
63:23 - 63:27and this is quadratic,
-
63:27 - 63:28
-
63:28 - 63:32so the class of all linear functions and the subset of the class of all quadratic functions,
-
63:32 - 63:35and so H is the subset of H prime. And
-
63:35 - 63:37let's say I'm considering
-
63:37 - 63:41instead of using my linear hypothesis class let's say I'm considering switching
-
63:41 - 63:45to a quadratic hypothesis class, or switching to a larger hypothesis
-
63:45 - 63:46class.
-
63:46 - 63:47
-
63:47 - 63:50Then what are the tradeoffs involved? Well, I
-
63:50 - 63:53proved this only for finite hypothesis classes, but we'll see that something
-
63:53 - 63:56very similar holds for infinite hypothesis classes too.
-
63:56 - 63:58But the tradeoff is
-
63:58 - 64:01what if I switch from H to H prime, or I switch from linear to quadratic
-
64:01 - 64:04functions. Then
-
64:04 - 64:08epsilon of H star will become better because
-
64:08 - 64:12the best hypothesis in my hypothesis class will become better.
-
64:12 - 64:15The best quadratic function
-
64:15 - 64:16
-
64:16 - 64:19by best I mean in the sense of generalization error
-
64:19 - 64:21the hypothesis function
-
64:21 - 64:25the quadratic function with the lowest generalization error
-
64:25 - 64:26has to have
-
64:26 - 64:27equal or
-
64:27 - 64:28more likely lower generalization error than the best
-
64:28 - 64:30linear function.
-
64:30 - 64:32
-
64:32 - 64:38So by switching to a more complex hypothesis class you can get this first term as you go down.
-
64:38 - 64:40But what I pay for then is that
-
64:40 - 64:43K will increase.
-
64:43 - 64:46By switching to a larger hypothesis class,
-
64:46 - 64:48the first term will go down,
-
64:48 - 64:52but the second term will increase because I now have a larger class of
-
64:52 - 64:53hypotheses
-
64:53 - 64:55and so the second term K
-
64:55 - 64:57will increase.
-
64:57 - 65:02And so this is sometimes called the bias this is usually called the bias variance
-
65:02 - 65:03tradeoff. Whereby
-
65:03 - 65:08going to larger hypothesis class maybe I have the hope for finding a better function,
-
65:08 - 65:10that my risk of sort of
-
65:10 - 65:14not fitting my model so accurately also increases, and
-
65:14 - 65:19that's because illustrated by the second term going up when the
-
65:19 - 65:21size of your hypothesis,
-
65:21 - 65:28when K goes up.
-
65:28 - 65:32And so
-
65:32 - 65:35speaking very loosely,
-
65:35 - 65:37we can think of
-
65:37 - 65:40this first term as corresponding
-
65:40 - 65:41maybe to the bias
-
65:41 - 65:45of the learning algorithm, or the bias of the hypothesis class.
-
65:45 - 65:50And you can again speaking very loosely,
-
65:50 - 65:53think of the second term as corresponding
-
65:53 - 65:57to the variance in your hypothesis, in other words how well you can actually fit a
-
65:57 - 65:58hypothesis in the
-
65:58 - 66:00how well you
-
66:00 - 66:03actually fit this hypothesis class to the data.
-
66:03 - 66:07And by switching to a more complex hypothesis class, your variance increases and your bias decreases.
-
66:07 - 66:09
-
66:09 - 66:14As a note of warning, it turns out that if you take like a statistics class you've seen
-
66:14 - 66:15definitions of bias and variance,
-
66:15 - 66:18which are often defined in terms of squared error or something.
-
66:18 - 66:22It turns out that for classification problems,
-
66:22 - 66:25there actually is no
-
66:25 - 66:26
-
66:26 - 66:27universally accepted
-
66:27 - 66:29formal definition of
-
66:29 - 66:31bias and
-
66:31 - 66:35variance for classification problems. For regression problems, there is this square error
-
66:35 - 66:39definition. For classification problems it
-
66:39 - 66:40turns out there've
-
66:40 - 66:44been several competing proposals for definitions of bias and variance. So when I
-
66:44 - 66:45say bias and variance here,
-
66:45 - 66:47think of these as very loose, informal,
-
66:47 - 66:54intuitive definitions, and not formal definitions. Okay.
-
67:00 - 67:07The
-
67:15 - 67:18cartoon associated with intuition I just
-
67:18 - 67:19said would be
-
67:19 - 67:22as follows: Let's
-
67:22 - 67:24say
-
67:24 - 67:27
-
67:27 - 67:30and everything about the plot will be for a fixed value of M, for a
-
67:30 - 67:35fixed training set size M. Vertical axis I'll plot ever and
-
67:35 - 67:39on the horizontal axis I'll plot model
-
67:39 - 67:44complexity.
-
67:44 - 67:47And by model complexity I mean
-
67:47 - 67:53sort of degree of polynomial,
-
67:53 - 67:58
-
67:58 - 68:00size of your
-
68:00 - 68:05hypothesis class script H etc. It actually turns out,
-
68:05 - 68:08you remember the bandwidth parameter
-
68:08 - 68:11from
-
68:11 - 68:14locally weighted linear regression, that also
-
68:14 - 68:17has a similar effect in controlling how complex your model is.
-
68:17 - 68:21Model complexity [inaudible] polynomial I guess.
-
68:21 - 68:23So the more complex your model,
-
68:23 - 68:25the better your
-
68:25 - 68:26training error,
-
68:26 - 68:30and so
-
68:30 - 68:33your training error will tend to [inaudible] zero as you increase the complexity of your model because the
-
68:33 - 68:36more complete your model the better you can fit your training set.
-
68:36 - 68:38But
-
68:38 - 68:45because of this bias variance tradeoff,
-
68:45 - 68:49you find
-
68:49 - 68:53that
-
68:53 - 68:57generalization error will come down for a while and then it will go back up.
-
68:57 - 69:04And this regime on the left is when you're underfitting the data or when
-
69:04 - 69:06you have high bias.
-
69:06 - 69:08And this regime
-
69:08 - 69:10on the right
-
69:10 - 69:16is when you have high variance or
-
69:16 - 69:21you're overfitting the data. Okay?
-
69:21 - 69:24And this is why a model of sort of intermediate
-
69:24 - 69:27complexity, somewhere here
-
69:27 - 69:29if often preferable
-
69:29 - 69:30to if
-
69:30 - 69:33[inaudible] and minimize generalization error.
-
69:33 - 69:36Okay?
-
69:36 - 69:40So that's just a cartoon. In the next lecture we'll actually talk about the number of
-
69:40 - 69:40algorithms
-
69:40 - 69:44for trying to automatically select model complexities, say to get
-
69:44 - 69:48you as close as possible to this minimum to
-
69:48 - 69:55this area of minimized generalization error.
-
70:11 - 70:15The last thing I want to do is actually
-
70:15 - 70:19going back to the theorem I wrote out, I just want to take that theorem well,
-
70:19 - 70:20so the theorem I wrote out
-
70:20 - 70:25was an error bound theorem this says for fixed M and delta where probability
-
70:25 - 70:29one minus delta, I get a bound on
-
70:29 - 70:31gamma, which is what this term is.
-
70:31 - 70:35So the very last thing I wanna do today is just come back to this theorem and write out a
-
70:35 - 70:36corollary
-
70:36 - 70:40where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta
-
70:40 - 70:41and solve for M.
-
70:41 - 70:46And if you do that,
-
70:46 - 70:47you
-
70:47 - 70:49get the following corollary: Let H
-
70:49 - 70:52be
-
70:52 - 70:55fixed with K hypotheses
-
70:55 - 70:59and let
-
70:59 - 71:01any delta
-
71:01 - 71:04and gamma be fixed.
-
71:09 - 71:11Then
-
71:11 - 71:18in order to guarantee that,
-
71:23 - 71:25let's say I want a guarantee
-
71:25 - 71:27that the generalization error
-
71:27 - 71:31of the hypothesis I choose with empirical risk minimization,
-
71:31 - 71:34that this is at most two times gamma worse
-
71:34 - 71:38than the best possible error I could obtain with this hypothesis class.
-
71:38 - 71:40Lets say I want this to
-
71:40 - 71:45hold true with probability at least one minus delta,
-
71:45 - 71:50then it suffices
-
71:50 - 71:54that M
-
71:54 - 72:01is [inaudible] to that. Okay? And this is
-
72:01 - 72:01sort of
-
72:01 - 72:06solving for the error
-
72:06 - 72:08bound for M. One thing we're going to convince yourselves
-
72:08 - 72:10of the easy part of this is
-
72:10 - 72:15if you set that term [inaudible] gamma and solve for M you will get this.
-
72:15 - 72:19One thing I want you to go home and sort of convince yourselves of is that this
-
72:19 - 72:21result really holds true.
-
72:21 - 72:24That this really logically follows from the theorem we've proved.
-
72:24 - 72:26In other words,
-
72:26 - 72:30you can take that formula we wrote and solve for M and because this is the formula you get for M,
-
72:30 - 72:32that's just that's the easy part. That
-
72:32 - 72:34once you go back and convince yourselves that
-
72:34 - 72:38this theorem is a true fact and that it does indeed logically follow from the
-
72:38 - 72:39other one. In
-
72:39 - 72:40particular,
-
72:40 - 72:44make sure that if you solve for that you really get M grading equals this, and why is this M
-
72:44 - 72:47grading that and not M less equal two, and just make sure
-
72:47 - 72:50I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay?
-
72:53 - 72:55And
-
72:55 - 72:58it turns out that when we prove these bounds in learning theory it turns out
-
72:58 - 73:03that very often the constants are sort of loose. So it turns out that when we prove
-
73:03 - 73:06these bounds usually we're interested
-
73:06 - 73:10usually we're not very interested in the constants, and so I
-
73:10 - 73:12write this as big O of
-
73:12 - 73:16one over gamma squared, log
-
73:16 - 73:18K over delta,
-
73:18 - 73:21and again, the key
-
73:21 - 73:24step in this is that the dependence on M
-
73:24 - 73:27with the size of the hypothesis class is logarithmic.
-
73:27 - 73:30And this will be very important later when we
-
73:30 - 73:35talk about infinite hypothesis classes. Okay? Any
-
73:35 - 73:42questions about this? No? Okay, cool.
-
73:50 - 73:51So
-
73:51 - 73:54next lecture we'll come back, we'll actually start from this result again.
-
73:54 - 73:58Remember this. I'll write this down as the first thing I do in the next lecture
-
73:58 - 74:03and we'll generalize these to infinite hypothesis classes and then talk about
-
74:03 - 74:05practical algorithms for model spectrum. So I'll see you guys in a couple days.
- Title:
- Lecture 9 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng delves into learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:14:19