[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:11.80,0:00:15.06,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional
Dialogue: 0,0:00:15.06,0:00:22.06,Default,,0000,0000,0000,,Development. Okay.
Dialogue: 0,0:00:29.11,0:00:31.81,Default,,0000,0000,0000,,So welcome back.
Dialogue: 0,0:00:31.81,0:00:35.86,Default,,0000,0000,0000,,What I want to do today is start a new chapter
Dialogue: 0,0:00:35.86,0:00:41.02,Default,,0000,0000,0000,,in between now and then. In particular, I want to talk about learning theory.
Dialogue: 0,0:00:41.02,0:00:45.45,Default,,0000,0000,0000,,So in the previous, I guess eight lectures so far, you've learned about
Dialogue: 0,0:00:45.45,0:00:49.92,Default,,0000,0000,0000,,a lot of learning algorithms, and yes, you now I hope understand a little about
Dialogue: 0,0:00:49.92,0:00:53.87,Default,,0000,0000,0000,,some of the best and most powerful tools of machine learning in the [inaudible].
Dialogue: 0,0:00:53.87,0:00:58.22,Default,,0000,0000,0000,,And all of you are now sort of well qualified to go into industry
Dialogue: 0,0:00:58.22,0:01:01.54,Default,,0000,0000,0000,,and though powerful learning algorithms apply, really the most powerful
Dialogue: 0,0:01:01.54,0:01:05.17,Default,,0000,0000,0000,,learning algorithms we know to all sorts of problems, and in fact, I hope
Dialogue: 0,0:01:05.17,0:01:10.73,Default,,0000,0000,0000,,you start to do that on your projects right away as well.
Dialogue: 0,0:01:10.73,0:01:14.46,Default,,0000,0000,0000,,You might remember, I think it was in the very first lecture,
Dialogue: 0,0:01:14.46,0:01:18.96,Default,,0000,0000,0000,,that I made an analogy to if you're trying to learn to be a carpenter, so
Dialogue: 0,0:01:18.96,0:01:20.52,Default,,0000,0000,0000,,if
Dialogue: 0,0:01:20.52,0:01:24.44,Default,,0000,0000,0000,,you imagine you're going to carpentry school to learn to be a carpenter,
Dialogue: 0,0:01:24.44,0:01:28.54,Default,,0000,0000,0000,,then only a small part of what you need to do is to acquire a set of tools. If you learn to
Dialogue: 0,0:01:28.54,0:01:29.99,Default,,0000,0000,0000,,be a carpenter
Dialogue: 0,0:01:29.99,0:01:33.39,Default,,0000,0000,0000,,you don't walk in and pick up a tool box and [inaudible], so
Dialogue: 0,0:01:33.39,0:01:35.56,Default,,0000,0000,0000,,when you need to
Dialogue: 0,0:01:35.56,0:01:39.65,Default,,0000,0000,0000,,cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw
Dialogue: 0,0:01:39.65,0:01:42.78,Default,,0000,0000,0000,,whatever, is this really mastering the tools there's also
Dialogue: 0,0:01:42.78,0:01:45.82,Default,,0000,0000,0000,,an essential part of becoming a good carpenter.
Dialogue: 0,0:01:45.82,0:01:46.97,Default,,0000,0000,0000,,And
Dialogue: 0,0:01:46.97,0:01:49.60,Default,,0000,0000,0000,,what I want to do in the next few lectures is
Dialogue: 0,0:01:49.60,0:01:52.64,Default,,0000,0000,0000,,actually give you a sense of the mastery of the machine learning tools all of
Dialogue: 0,0:01:52.64,0:01:55.17,Default,,0000,0000,0000,,you have. Okay?
Dialogue: 0,0:01:55.17,0:01:57.09,Default,,0000,0000,0000,,And so in particular,
Dialogue: 0,0:01:57.09,0:02:01.46,Default,,0000,0000,0000,,in the next few lectures what I want to is to talk more deeply about
Dialogue: 0,0:02:01.46,0:02:05.06,Default,,0000,0000,0000,,the properties of different machine learning algorithms so that you can get a sense of
Dialogue: 0,0:02:05.06,0:02:06.73,Default,,0000,0000,0000,,when it's most appropriate to use
Dialogue: 0,0:02:06.73,0:02:08.12,Default,,0000,0000,0000,,each one.
Dialogue: 0,0:02:08.12,0:02:11.66,Default,,0000,0000,0000,,And it turns out that one of the most common scenarios in machine learning is
Dialogue: 0,0:02:12.87,0:02:15.24,Default,,0000,0000,0000,,someday you'll be doing research or
Dialogue: 0,0:02:15.24,0:02:16.92,Default,,0000,0000,0000,,[inaudible] a company.
Dialogue: 0,0:02:16.92,0:02:20.42,Default,,0000,0000,0000,,And you'll apply one of the learning algorithms you learned about, you may apply logistic regression,
Dialogue: 0,0:02:20.42,0:02:23.60,Default,,0000,0000,0000,,or support vector machines, or Naive Bayes or something,
Dialogue: 0,0:02:23.60,0:02:27.85,Default,,0000,0000,0000,,and for whatever bizarre reason, it won't work as well as you were hoping, or it
Dialogue: 0,0:02:27.85,0:02:32.26,Default,,0000,0000,0000,,won't quite do what you were hoping it to.
Dialogue: 0,0:02:32.98,0:02:34.53,Default,,0000,0000,0000,,To
Dialogue: 0,0:02:34.53,0:02:37.41,Default,,0000,0000,0000,,me what really separates the people from
Dialogue: 0,0:02:37.41,0:02:40.69,Default,,0000,0000,0000,,what really separates the people that really understand and really get machine
Dialogue: 0,0:02:40.69,0:02:41.96,Default,,0000,0000,0000,,learning,
Dialogue: 0,0:02:41.96,0:02:46.09,Default,,0000,0000,0000,,compared to people that maybe read the textbook and so they'll work through the math,
Dialogue: 0,0:02:46.09,0:02:50.06,Default,,0000,0000,0000,,will be what you do next. Will be in your decisions of when
Dialogue: 0,0:02:50.06,0:02:53.69,Default,,0000,0000,0000,,you apply a support vector machine and it doesn't quite do what you wanted,
Dialogue: 0,0:02:53.69,0:02:56.54,Default,,0000,0000,0000,,do you really understand enough about support vector machines to know what to
Dialogue: 0,0:02:56.54,0:02:58.56,Default,,0000,0000,0000,,do next and how to modify the algorithm?
Dialogue: 0,0:02:58.56,0:03:02.02,Default,,0000,0000,0000,,And to me that's often what really separates the great people in machine
Dialogue: 0,0:03:02.02,0:03:06.26,Default,,0000,0000,0000,,learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have
Dialogue: 0,0:03:06.26,0:03:08.99,Default,,0000,0000,0000,,just understood that. Okay?
Dialogue: 0,0:03:08.99,0:03:09.66,Default,,0000,0000,0000,,So
Dialogue: 0,0:03:09.66,0:03:13.82,Default,,0000,0000,0000,,what I want to do today, today's lecture will mainly be on learning theory and
Dialogue: 0,0:03:13.82,0:03:15.27,Default,,0000,0000,0000,,we'll start to
Dialogue: 0,0:03:15.27,0:03:18.55,Default,,0000,0000,0000,,talk about some of the theoretical results of machine learning.
Dialogue: 0,0:03:19.35,0:03:22.24,Default,,0000,0000,0000,,The next lecture, later this week, will be on algorithms for
Dialogue: 0,0:03:22.24,0:03:26.37,Default,,0000,0000,0000,,sort of [inaudible], or fixing some of the problems
Dialogue: 0,0:03:26.37,0:03:30.04,Default,,0000,0000,0000,,that the learning theory will point out to us and help us understand. And then
Dialogue: 0,0:03:30.04,0:03:34.46,Default,,0000,0000,0000,,two lectures from now, that lecture will be almost entirely focused on
Dialogue: 0,0:03:34.46,0:03:36.40,Default,,0000,0000,0000,,the practical advice for
Dialogue: 0,0:03:36.40,0:03:40.23,Default,,0000,0000,0000,,how to apply learning algorithms. Okay?
Dialogue: 0,0:03:40.23,0:03:47.23,Default,,0000,0000,0000,,So you have any questions about this before I start? Okay.
Dialogue: 0,0:03:50.63,0:03:53.95,Default,,0000,0000,0000,,So the very first thing we're gonna talk about is something that
Dialogue: 0,0:03:53.95,0:03:58.69,Default,,0000,0000,0000,,you've probably already seen on the first homework, and something that
Dialogue: 0,0:03:58.69,0:04:01.29,Default,,0000,0000,0000,,alluded to previously,
Dialogue: 0,0:04:01.29,0:04:07.12,Default,,0000,0000,0000,,which is the bias variance trade-off. So take
Dialogue: 0,0:04:07.12,0:04:10.100,Default,,0000,0000,0000,,ordinary least squares, the first learning algorithm we learned
Dialogue: 0,0:04:10.100,0:04:15.85,Default,,0000,0000,0000,,about, if you [inaudible] a straight line through these datas, this is not a very good model.
Dialogue: 0,0:04:15.85,0:04:19.79,Default,,0000,0000,0000,,Right.
Dialogue: 0,0:04:19.79,0:04:21.01,Default,,0000,0000,0000,,And if
Dialogue: 0,0:04:21.01,0:04:23.70,Default,,0000,0000,0000,,this happens, we say it has
Dialogue: 0,0:04:23.70,0:04:24.49,Default,,0000,0000,0000,,underfit
Dialogue: 0,0:04:24.49,0:04:27.46,Default,,0000,0000,0000,,the data, or we say that this is a learning algorithm
Dialogue: 0,0:04:27.46,0:04:30.63,Default,,0000,0000,0000,,with a very high bias, because it is
Dialogue: 0,0:04:30.63,0:04:35.67,Default,,0000,0000,0000,,failing to fit the evident quadratic structure in the data.
Dialogue: 0,0:04:35.67,0:04:36.25,Default,,0000,0000,0000,,And
Dialogue: 0,0:04:36.25,0:04:37.59,Default,,0000,0000,0000,,for the prefaces
Dialogue: 0,0:04:37.59,0:04:41.34,Default,,0000,0000,0000,,of [inaudible] you can formally think of the
Dialogue: 0,0:04:41.34,0:04:45.44,Default,,0000,0000,0000,,bias of the learning algorithm as representing the fact that even if you
Dialogue: 0,0:04:45.44,0:04:47.96,Default,,0000,0000,0000,,had an infinite amount of training data, even if
Dialogue: 0,0:04:47.96,0:04:50.83,Default,,0000,0000,0000,,you had tons of training data,
Dialogue: 0,0:04:50.83,0:04:52.58,Default,,0000,0000,0000,,this algorithm would still fail
Dialogue: 0,0:04:52.58,0:04:54.20,Default,,0000,0000,0000,,to fit the quadratic
Dialogue: 0,0:04:54.20,0:04:57.82,Default,,0000,0000,0000,,function, the quadratic structure in
Dialogue: 0,0:04:57.82,0:05:01.67,Default,,0000,0000,0000,,the data. And so we think of this as a learning algorithm with high bias.
Dialogue: 0,0:05:01.67,0:05:04.34,Default,,0000,0000,0000,,Then there's the opposite problem, so that's the
Dialogue: 0,0:05:04.34,0:05:06.47,Default,,0000,0000,0000,,same dataset.
Dialogue: 0,0:05:06.47,0:05:13.10,Default,,0000,0000,0000,,If
Dialogue: 0,0:05:13.10,0:05:19.50,Default,,0000,0000,0000,,you fit a fourth of the polynomials into this dataset,
Dialogue: 0,0:05:19.50,0:05:20.43,Default,,0000,0000,0000,,then you have
Dialogue: 0,0:05:20.43,0:05:25.52,Default,,0000,0000,0000,,you'll be able to interpolate the five data points exactly, but clearly, this is also
Dialogue: 0,0:05:25.52,0:05:27.20,Default,,0000,0000,0000,,not a great model
Dialogue: 0,0:05:27.20,0:05:30.65,Default,,0000,0000,0000,,to the structure that you and I probably see in the data.
Dialogue: 0,0:05:30.65,0:05:34.41,Default,,0000,0000,0000,,And we say that this
Dialogue: 0,0:05:34.41,0:05:37.31,Default,,0000,0000,0000,,algorithm has a problem, excuse me,
Dialogue: 0,0:05:37.31,0:05:42.83,Default,,0000,0000,0000,,is overfitting the data,
Dialogue: 0,0:05:42.83,0:05:48.54,Default,,0000,0000,0000,,or alternatively that this algorithm has high variance. Okay? And the intuition behind
Dialogue: 0,0:05:48.54,0:05:50.24,Default,,0000,0000,0000,,overfitting a high variance is that
Dialogue: 0,0:05:50.24,0:05:54.42,Default,,0000,0000,0000,,the algorithm is fitting serious patterns in the data, or is fitting
Dialogue: 0,0:05:54.42,0:05:58.36,Default,,0000,0000,0000,,idiosyncratic properties of this specific dataset, be it the dataset of
Dialogue: 0,0:05:58.36,0:06:00.89,Default,,0000,0000,0000,,housing prices or whatever.
Dialogue: 0,0:06:00.89,0:06:07.89,Default,,0000,0000,0000,,And quite often, they'll be some happy medium
Dialogue: 0,0:06:08.19,0:06:10.06,Default,,0000,0000,0000,,of fitting a quadratic function
Dialogue: 0,0:06:10.06,0:06:14.82,Default,,0000,0000,0000,,that maybe won't interpolate your data points perfectly, but also captures multistructure
Dialogue: 0,0:06:14.82,0:06:16.74,Default,,0000,0000,0000,,in your data
Dialogue: 0,0:06:16.74,0:06:20.76,Default,,0000,0000,0000,,than a simple model which under fits.
Dialogue: 0,0:06:20.76,0:06:23.83,Default,,0000,0000,0000,,I say that you can sort of have the exactly the same picture
Dialogue: 0,0:06:23.83,0:06:25.78,Default,,0000,0000,0000,,of classification problems as well,
Dialogue: 0,0:06:25.78,0:06:28.45,Default,,0000,0000,0000,,so lets say
Dialogue: 0,0:06:28.45,0:06:35.45,Default,,0000,0000,0000,,this is my training set, right,
Dialogue: 0,0:06:35.84,0:06:37.30,Default,,0000,0000,0000,,of
Dialogue: 0,0:06:37.30,0:06:39.32,Default,,0000,0000,0000,,positive and negative examples,
Dialogue: 0,0:06:44.26,0:06:46.78,Default,,0000,0000,0000,,and so you can fit
Dialogue: 0,0:06:46.78,0:06:51.68,Default,,0000,0000,0000,,logistic regression with a very high order polynomial [inaudible], or [inaudible] of X
Dialogue: 0,0:06:51.68,0:06:57.39,Default,,0000,0000,0000,,equals the sigmoid function of
Dialogue: 0,0:07:01.24,0:07:04.44,Default,,0000,0000,0000,,whatever. Sigmoid function applied to a tenth of the polynomial.
Dialogue: 0,0:07:04.44,0:07:07.52,Default,,0000,0000,0000,,And you do that, maybe you get a decision boundary
Dialogue: 0,0:07:07.52,0:07:13.46,Default,,0000,0000,0000,,like this. Right.
Dialogue: 0,0:07:13.46,0:07:16.83,Default,,0000,0000,0000,,That does indeed perfectly separate the positive and negative classes, this is
Dialogue: 0,0:07:16.83,0:07:18.61,Default,,0000,0000,0000,,another example of how
Dialogue: 0,0:07:18.61,0:07:21.17,Default,,0000,0000,0000,,overfitting, and
Dialogue: 0,0:07:21.17,0:07:24.03,Default,,0000,0000,0000,,in contrast you fit logistic regression into this model with just the linear features,
Dialogue: 0,0:07:24.03,0:07:26.62,Default,,0000,0000,0000,,with none
Dialogue: 0,0:07:26.62,0:07:29.56,Default,,0000,0000,0000,,of the quadratic features, then maybe you get a decision boundary like that, which
Dialogue: 0,0:07:29.56,0:07:33.18,Default,,0000,0000,0000,,can also underfit.
Dialogue: 0,0:07:33.18,0:07:34.59,Default,,0000,0000,0000,,Okay.
Dialogue: 0,0:07:34.59,0:07:38.06,Default,,0000,0000,0000,,So what I want to do now is
Dialogue: 0,0:07:38.06,0:07:41.90,Default,,0000,0000,0000,,understand this problem of overfitting versus underfitting, of high bias versus high
Dialogue: 0,0:07:41.90,0:07:43.41,Default,,0000,0000,0000,,variance, more explicitly,
Dialogue: 0,0:07:44.82,0:07:48.77,Default,,0000,0000,0000,,I will do that by posing a more formal model of machine learning and so
Dialogue: 0,0:07:48.77,0:07:50.79,Default,,0000,0000,0000,,trying to prove when
Dialogue: 0,0:07:50.79,0:07:56.68,Default,,0000,0000,0000,,these two twin problems when each of these two problems come up.
Dialogue: 0,0:07:56.68,0:07:59.22,Default,,0000,0000,0000,,And as I'm modeling the
Dialogue: 0,0:07:59.22,0:08:00.68,Default,,0000,0000,0000,,
Dialogue: 0,0:08:00.68,0:08:04.39,Default,,0000,0000,0000,,example for our initial foray into learning theory,
Dialogue: 0,0:08:04.39,0:08:10.45,Default,,0000,0000,0000,,I want to talk about learning classification,
Dialogue: 0,0:08:10.45,0:08:13.06,Default,,0000,0000,0000,,in which
Dialogue: 0,0:08:13.77,0:08:16.36,Default,,0000,0000,0000,,H of X is equal
Dialogue: 0,0:08:16.36,0:08:18.82,Default,,0000,0000,0000,,to G of data transpose X. Okay?
Dialogue: 0,0:08:18.82,0:08:20.93,Default,,0000,0000,0000,,So the learning classifier.
Dialogue: 0,0:08:20.93,0:08:26.84,Default,,0000,0000,0000,,And for this class I'm going to use, Z
Dialogue: 0,0:08:26.84,0:08:32.74,Default,,0000,0000,0000,,excuse me. I'm gonna use G as indicator Z grading with
Dialogue: 0,0:08:32.74,0:08:39.74,Default,,0000,0000,0000,,zero.
Dialogue: 0,0:08:40.26,0:08:43.57,Default,,0000,0000,0000,,With apologies in advance for changing the notation yet again,
Dialogue: 0,0:08:43.57,0:08:45.35,Default,,0000,0000,0000,,for the support vector machine
Dialogue: 0,0:08:45.35,0:08:49.45,Default,,0000,0000,0000,,lectures we use Y equals minus one or plus one. For
Dialogue: 0,0:08:49.45,0:08:52.67,Default,,0000,0000,0000,,learning theory lectures, turns out it'll be a bit cleaner if I switch back to
Dialogue: 0,0:08:52.67,0:08:57.81,Default,,0000,0000,0000,,Y equals zero-one again, so I'm gonna switch back to my original notation.
Dialogue: 0,0:08:57.81,0:09:02.22,Default,,0000,0000,0000,,And so you think of this model as a model forum as
Dialogue: 0,0:09:02.22,0:09:04.54,Default,,0000,0000,0000,,logistic regressions, say, and think of this as being
Dialogue: 0,0:09:04.54,0:09:06.60,Default,,0000,0000,0000,,similar to logistic regression,
Dialogue: 0,0:09:06.60,0:09:08.42,Default,,0000,0000,0000,,except that now we're going to force
Dialogue: 0,0:09:08.42,0:09:12.42,Default,,0000,0000,0000,,the logistic regression algorithm, to opt for labels that are
Dialogue: 0,0:09:12.42,0:09:14.01,Default,,0000,0000,0000,,either zero or one. Okay?
Dialogue: 0,0:09:14.01,0:09:15.61,Default,,0000,0000,0000,,So you can think of this as a
Dialogue: 0,0:09:15.61,0:09:20.71,Default,,0000,0000,0000,,classifier to opt for labels zero or one involved in the probabilities.
Dialogue: 0,0:09:20.71,0:09:25.41,Default,,0000,0000,0000,,And so as
Dialogue: 0,0:09:25.41,0:09:32.41,Default,,0000,0000,0000,,usual, let's say we're given a training set of M examples.
Dialogue: 0,0:09:34.35,0:09:38.08,Default,,0000,0000,0000,,That's just my notation for writing a set of M examples ranging from I equals
Dialogue: 0,0:09:38.08,0:09:40.17,Default,,0000,0000,0000,,one
Dialogue: 0,0:09:40.50,0:09:45.45,Default,,0000,0000,0000,,through M. And I'm going to assume that the training example is XIYI.
Dialogue: 0,0:09:45.45,0:09:48.32,Default,,0000,0000,0000,,I've drawn IID,
Dialogue: 0,0:09:48.32,0:09:50.22,Default,,0000,0000,0000,,from sum distribution,
Dialogue: 0,0:09:50.22,0:09:50.96,Default,,0000,0000,0000,,scripts
Dialogue: 0,0:09:50.96,0:09:54.47,Default,,0000,0000,0000,,D. Okay? [Inaudible]. Identically and definitively distributed
Dialogue: 0,0:09:54.47,0:09:59.72,Default,,0000,0000,0000,,and if you have you have running a classification problem on houses, like
Dialogue: 0,0:09:59.72,0:10:03.45,Default,,0000,0000,0000,,features of the house comma, whether the house will be sold in the next six months, then this
Dialogue: 0,0:10:03.45,0:10:04.70,Default,,0000,0000,0000,,is just
Dialogue: 0,0:10:04.70,0:10:07.65,Default,,0000,0000,0000,,the priority distribution over
Dialogue: 0,0:10:07.65,0:10:12.19,Default,,0000,0000,0000,,features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that
Dialogue: 0,0:10:12.19,0:10:16.81,Default,,0000,0000,0000,,training examples we've drawn IID from some probability distributions,
Dialogue: 0,0:10:18.32,0:10:19.94,Default,,0000,0000,0000,,scripts D. Well, same thing for spam, if you're
Dialogue: 0,0:10:19.94,0:10:23.42,Default,,0000,0000,0000,,trying to build a spam classifier then this would be the distribution of
Dialogue: 0,0:10:23.42,0:10:29.61,Default,,0000,0000,0000,,what emails look like comma, whether they are spam or not.
Dialogue: 0,0:10:29.61,0:10:32.13,Default,,0000,0000,0000,,And in particular, to understand
Dialogue: 0,0:10:32.13,0:10:38.00,Default,,0000,0000,0000,,or simplify to understand the phenomena of bias invariance, I'm actually going to use a
Dialogue: 0,0:10:38.00,0:10:41.88,Default,,0000,0000,0000,,simplified model of machine learning.
Dialogue: 0,0:10:41.88,0:10:44.67,Default,,0000,0000,0000,,And in particular,
Dialogue: 0,0:10:44.67,0:10:45.81,Default,,0000,0000,0000,,logistic regression fits
Dialogue: 0,0:10:45.81,0:10:50.04,Default,,0000,0000,0000,,this parameters the model like this for maximizing the law of likelihood.
Dialogue: 0,0:10:50.04,0:10:55.01,Default,,0000,0000,0000,,But in order to understand learning algorithms more deeply, I'm just going to assume a simplified
Dialogue: 0,0:10:55.01,0:10:58.89,Default,,0000,0000,0000,,model of machine learning, let me just write that down.
Dialogue: 0,0:11:01.96,0:11:05.26,Default,,0000,0000,0000,,So I'm going to define training error
Dialogue: 0,0:11:05.26,0:11:08.67,Default,,0000,0000,0000,,as
Dialogue: 0,0:11:08.67,0:11:14.49,Default,,0000,0000,0000,,so this is a training error of a hypothesis X subscript data.
Dialogue: 0,0:11:14.49,0:11:17.61,Default,,0000,0000,0000,,Write this epsilon hat of subscript data.
Dialogue: 0,0:11:17.61,0:11:19.98,Default,,0000,0000,0000,,If I want to make the
Dialogue: 0,0:11:19.98,0:11:23.45,Default,,0000,0000,0000,,dependence on a training set explicit, I'll write this with
Dialogue: 0,0:11:23.45,0:11:27.21,Default,,0000,0000,0000,,a subscript S there where S is a training set.
Dialogue: 0,0:11:27.21,0:11:34.21,Default,,0000,0000,0000,,And I'll define this as,
Dialogue: 0,0:11:41.66,0:11:44.27,Default,,0000,0000,0000,,let's see. Okay. I
Dialogue: 0,0:11:44.27,0:11:47.86,Default,,0000,0000,0000,,hope the notation is clear. This is a sum of indicator functions for whether your hypothesis
Dialogue: 0,0:11:47.86,0:11:52.17,Default,,0000,0000,0000,,correctly classifies the Y the IFE example.
Dialogue: 0,0:11:52.17,0:11:55.34,Default,,0000,0000,0000,,And so when you divide by M, this is just
Dialogue: 0,0:11:55.34,0:11:57.74,Default,,0000,0000,0000,,in your training set what's the fraction
Dialogue: 0,0:11:57.74,0:11:59.56,Default,,0000,0000,0000,,of training examples your
Dialogue: 0,0:11:59.56,0:12:01.20,Default,,0000,0000,0000,,hypothesis classifies
Dialogue: 0,0:12:01.20,0:12:05.34,Default,,0000,0000,0000,,so defined as a training error. And
Dialogue: 0,0:12:05.34,0:12:07.98,Default,,0000,0000,0000,,training error is also called risk.
Dialogue: 0,0:12:10.19,0:12:14.18,Default,,0000,0000,0000,,The simplified model of machine learning I'm gonna talk about is
Dialogue: 0,0:12:14.18,0:12:17.17,Default,,0000,0000,0000,,called empirical risk minimization.
Dialogue: 0,0:12:17.17,0:12:21.21,Default,,0000,0000,0000,,And in particular, I'm going to assume that the way my learning algorithm works
Dialogue: 0,0:12:21.21,0:12:25.95,Default,,0000,0000,0000,,is it will choose parameters
Dialogue: 0,0:12:25.95,0:12:32.95,Default,,0000,0000,0000,,data, that
Dialogue: 0,0:12:33.94,0:12:39.09,Default,,0000,0000,0000,,minimize my training error. Okay?
Dialogue: 0,0:12:39.09,0:12:43.24,Default,,0000,0000,0000,,And it will be this learning algorithm that we'll prove properties about.
Dialogue: 0,0:12:43.24,0:12:44.64,Default,,0000,0000,0000,,And it turns out that
Dialogue: 0,0:12:44.64,0:12:45.99,Default,,0000,0000,0000,,you
Dialogue: 0,0:12:45.99,0:12:49.91,Default,,0000,0000,0000,,can think of this as the most basic learning algorithm, the algorithm that minimizes
Dialogue: 0,0:12:49.91,0:12:51.97,Default,,0000,0000,0000,,your training error. It
Dialogue: 0,0:12:51.97,0:12:55.35,Default,,0000,0000,0000,,turns out that logistic regression and support vector machines can be
Dialogue: 0,0:12:55.35,0:12:58.93,Default,,0000,0000,0000,,formally viewed as approximation cities, so it
Dialogue: 0,0:12:58.93,0:13:02.51,Default,,0000,0000,0000,,turns out that if you actually want to do this, this is a nonconvex optimization
Dialogue: 0,0:13:02.51,0:13:03.40,Default,,0000,0000,0000,,problem.
Dialogue: 0,0:13:03.40,0:13:08.86,Default,,0000,0000,0000,,This is actually it actually [inaudible] hard to solve this optimization problem.
Dialogue: 0,0:13:08.86,0:13:13.79,Default,,0000,0000,0000,,And logistic regression and support vector machines can both be viewed as
Dialogue: 0,0:13:13.79,0:13:14.43,Default,,0000,0000,0000,,approximations
Dialogue: 0,0:13:14.43,0:13:17.27,Default,,0000,0000,0000,,to this nonconvex optimization problem
Dialogue: 0,0:13:17.27,0:13:20.24,Default,,0000,0000,0000,,by finding the convex approximation to it.
Dialogue: 0,0:13:20.24,0:13:22.59,Default,,0000,0000,0000,,Think of this as
Dialogue: 0,0:13:22.59,0:13:23.64,Default,,0000,0000,0000,,similar to what
Dialogue: 0,0:13:23.64,0:13:27.71,Default,,0000,0000,0000,,algorithms like logistic regression
Dialogue: 0,0:13:27.71,0:13:30.28,Default,,0000,0000,0000,,are doing. So
Dialogue: 0,0:13:30.28,0:13:32.55,Default,,0000,0000,0000,,let me take that
Dialogue: 0,0:13:32.55,0:13:34.23,Default,,0000,0000,0000,,definition of empirical risk
Dialogue: 0,0:13:34.23,0:13:35.71,Default,,0000,0000,0000,,minimization
Dialogue: 0,0:13:35.71,0:13:42.60,Default,,0000,0000,0000,,and actually just rewrite it in a different equivalent way.
Dialogue: 0,0:13:42.60,0:13:45.44,Default,,0000,0000,0000,,For the results I want to prove today, it turns out
Dialogue: 0,0:13:45.44,0:13:47.63,Default,,0000,0000,0000,,that it will be useful to think of
Dialogue: 0,0:13:47.63,0:13:49.34,Default,,0000,0000,0000,,our learning algorithm
Dialogue: 0,0:13:49.34,0:13:52.15,Default,,0000,0000,0000,,as not choosing a set of parameters,
Dialogue: 0,0:13:52.15,0:13:54.81,Default,,0000,0000,0000,,but as choosing a function.
Dialogue: 0,0:13:54.81,0:13:57.87,Default,,0000,0000,0000,,So let me say what I mean by that. Let me define
Dialogue: 0,0:13:57.87,0:14:00.32,Default,,0000,0000,0000,,the hypothesis
Dialogue: 0,0:14:00.32,0:14:03.49,Default,,0000,0000,0000,,class,
Dialogue: 0,0:14:03.49,0:14:04.80,Default,,0000,0000,0000,,script h,
Dialogue: 0,0:14:04.80,0:14:09.41,Default,,0000,0000,0000,,as the class of all hypotheses of in other words as the class of all linear
Dialogue: 0,0:14:09.41,0:14:10.69,Default,,0000,0000,0000,,classifiers, that your
Dialogue: 0,0:14:10.69,0:14:13.73,Default,,0000,0000,0000,,learning algorithm is choosing from.
Dialogue: 0,0:14:13.73,0:14:17.26,Default,,0000,0000,0000,,Okay? So
Dialogue: 0,0:14:17.26,0:14:19.03,Default,,0000,0000,0000,,H subscript data
Dialogue: 0,0:14:19.03,0:14:23.19,Default,,0000,0000,0000,,is a specific linear classifier, so H subscript data,
Dialogue: 0,0:14:23.19,0:14:29.88,Default,,0000,0000,0000,,in each of these functions each
Dialogue: 0,0:14:29.88,0:14:33.55,Default,,0000,0000,0000,,of these is a function mapping from the input domain X is the class zero-one. Each
Dialogue: 0,0:14:33.55,0:14:34.76,Default,,0000,0000,0000,,of
Dialogue: 0,0:14:34.76,0:14:38.43,Default,,0000,0000,0000,,these is a function, and as you vary the parameter's data,
Dialogue: 0,0:14:38.43,0:14:41.11,Default,,0000,0000,0000,,you get different functions.
Dialogue: 0,0:14:41.11,0:14:43.81,Default,,0000,0000,0000,,And so let me define the hypothesis class
Dialogue: 0,0:14:43.81,0:14:47.75,Default,,0000,0000,0000,,script H to be the class of all functions that say logistic regression can choose
Dialogue: 0,0:14:47.75,0:14:49.14,Default,,0000,0000,0000,,from. Okay. So
Dialogue: 0,0:14:49.14,0:14:53.00,Default,,0000,0000,0000,,this is the class of all linear classifiers
Dialogue: 0,0:14:53.00,0:14:54.48,Default,,0000,0000,0000,,and so
Dialogue: 0,0:14:57.42,0:15:00.90,Default,,0000,0000,0000,,I'm going to define,
Dialogue: 0,0:15:00.90,0:15:05.05,Default,,0000,0000,0000,,or maybe redefine empirical risk minimization as
Dialogue: 0,0:15:05.05,0:15:08.82,Default,,0000,0000,0000,,instead of writing this choosing a set of parameters, I want to think of it as
Dialogue: 0,0:15:08.82,0:15:10.39,Default,,0000,0000,0000,,choosing a
Dialogue: 0,0:15:10.39,0:15:12.50,Default,,0000,0000,0000,,function
Dialogue: 0,0:15:12.50,0:15:15.30,Default,,0000,0000,0000,,into hypothesis class of script H
Dialogue: 0,0:15:15.30,0:15:21.57,Default,,0000,0000,0000,,that minimizes
Dialogue: 0,0:15:21.57,0:15:27.36,Default,,0000,0000,0000,,that minimizes my training error. Okay?
Dialogue: 0,0:15:27.36,0:15:31.04,Default,,0000,0000,0000,,So actually can you raise your
Dialogue: 0,0:15:31.04,0:15:34.54,Default,,0000,0000,0000,,hand if it makes sense to you why this is equivalent to the previous
Dialogue: 0,0:15:34.54,0:15:40.90,Default,,0000,0000,0000,,formulation? Okay, cool.
Dialogue: 0,0:15:40.90,0:15:44.35,Default,,0000,0000,0000,,Thanks. So for development of the use of think of algorithms as choosing from
Dialogue: 0,0:15:44.35,0:15:46.75,Default,,0000,0000,0000,,function from the class instead,
Dialogue: 0,0:15:46.75,0:15:47.98,Default,,0000,0000,0000,,because
Dialogue: 0,0:15:47.98,0:15:51.30,Default,,0000,0000,0000,,in a more general case this set, script H,
Dialogue: 0,0:15:51.30,0:15:53.100,Default,,0000,0000,0000,,can be some other class of functions. Maybe
Dialogue: 0,0:15:53.100,0:15:58.22,Default,,0000,0000,0000,,is a class of all functions represented by viewer network, or the class of all
Dialogue: 0,0:15:58.22,0:16:03.67,Default,,0000,0000,0000,,some other class of functions the learning algorithm wants to choose from.
Dialogue: 0,0:16:03.67,0:16:05.24,Default,,0000,0000,0000,,And
Dialogue: 0,0:16:05.24,0:16:11.15,Default,,0000,0000,0000,,this definition for empirical risk minimization will still apply. Okay?
Dialogue: 0,0:16:11.15,0:16:13.33,Default,,0000,0000,0000,,So
Dialogue: 0,0:16:13.33,0:16:16.48,Default,,0000,0000,0000,,what we'd like to do is understand
Dialogue: 0,0:16:16.48,0:16:18.78,Default,,0000,0000,0000,,whether empirical risk minimization
Dialogue: 0,0:16:18.78,0:16:24.50,Default,,0000,0000,0000,,is a reasonable algorithm. Alex? Student:[Inaudible] a function that's
Dialogue: 0,0:16:24.50,0:16:26.24,Default,,0000,0000,0000,,defined
Dialogue: 0,0:16:26.24,0:16:29.61,Default,,0000,0000,0000,,by G
Dialogue: 0,0:16:29.61,0:16:35.31,Default,,0000,0000,0000,,of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see
Dialogue: 0,0:16:35.31,0:16:36.47,Default,,0000,0000,0000,,I guess this
Dialogue: 0,0:16:36.47,0:16:40.61,Default,,0000,0000,0000,,the question is H data still defined by G of phase transpose
Dialogue: 0,0:16:40.61,0:16:46.70,Default,,0000,0000,0000,,X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh,
Dialogue: 0,0:16:46.70,0:16:48.41,Default,,0000,0000,0000,,yeah so very two answers
Dialogue: 0,0:16:48.41,0:16:51.07,Default,,0000,0000,0000,,to that. One is, this framework is general, so
Dialogue: 0,0:16:51.07,0:16:54.77,Default,,0000,0000,0000,,for the purpose of this lecture it may be useful to you to keep in mind a model of the
Dialogue: 0,0:16:54.77,0:16:56.67,Default,,0000,0000,0000,,example of
Dialogue: 0,0:16:56.67,0:17:00.64,Default,,0000,0000,0000,,when H subscript data is the class of all linear classifiers such as those used by
Dialogue: 0,0:17:00.64,0:17:04.35,Default,,0000,0000,0000,,like a visectron algorithm or logistic regression.
Dialogue: 0,0:17:04.35,0:17:05.57,Default,,0000,0000,0000,,This
Dialogue: 0,0:17:05.57,0:17:06.79,Default,,0000,0000,0000,,everything on this board,
Dialogue: 0,0:17:06.79,0:17:11.14,Default,,0000,0000,0000,,however, is actually more general. H can be any set of functions, mapping
Dialogue: 0,0:17:11.14,0:17:14.93,Default,,0000,0000,0000,,from the INFA domain to the center of class label zero and one,
Dialogue: 0,0:17:14.93,0:17:17.63,Default,,0000,0000,0000,,and then you can perform empirical risk minimization
Dialogue: 0,0:17:17.63,0:17:21.24,Default,,0000,0000,0000,,over any hypothesis class. For the purpose
Dialogue: 0,0:17:21.24,0:17:22.77,Default,,0000,0000,0000,,of today's lecture,
Dialogue: 0,0:17:22.77,0:17:27.17,Default,,0000,0000,0000,,I am going to restrict myself to talking about binary classification, but it turns
Dialogue: 0,0:17:27.17,0:17:30.66,Default,,0000,0000,0000,,out everything I say generalizes to regression
Dialogue: 0,0:17:30.66,0:17:35.96,Default,,0000,0000,0000,,in other problem as
Dialogue: 0,0:17:35.96,0:17:39.73,Default,,0000,0000,0000,,well. Does that answer your question? Yes. Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm.
Dialogue: 0,0:17:40.37,0:17:45.34,Default,,0000,0000,0000,,In particular, what are the things we can prove about it? So
Dialogue: 0,0:17:45.34,0:17:48.90,Default,,0000,0000,0000,,clearly we don't actually care about training error, we don't really care about
Dialogue: 0,0:17:48.90,0:17:53.43,Default,,0000,0000,0000,,making accurate predictions on the training set, or at a least that's not the ultimate goal. The
Dialogue: 0,0:17:53.43,0:17:55.51,Default,,0000,0000,0000,,ultimate goal is
Dialogue: 0,0:17:55.51,0:17:57.56,Default,,0000,0000,0000,,how well it makes
Dialogue: 0,0:17:57.56,0:17:58.73,Default,,0000,0000,0000,,generalization
Dialogue: 0,0:17:58.73,0:18:03.94,Default,,0000,0000,0000,,how well it makes predictions on examples that we haven't seen before. How
Dialogue: 0,0:18:03.94,0:18:05.66,Default,,0000,0000,0000,,well it predicts prices or
Dialogue: 0,0:18:05.66,0:18:07.38,Default,,0000,0000,0000,,sale or no sale
Dialogue: 0,0:18:07.38,0:18:10.29,Default,,0000,0000,0000,,outcomes of houses you haven't seen before.
Dialogue: 0,0:18:10.29,0:18:12.87,Default,,0000,0000,0000,,So what we really care about
Dialogue: 0,0:18:12.87,0:18:16.16,Default,,0000,0000,0000,,is generalization error, which I write
Dialogue: 0,0:18:16.16,0:18:17.87,Default,,0000,0000,0000,,as epsilon of H.
Dialogue: 0,0:18:17.87,0:18:21.70,Default,,0000,0000,0000,,And this is defined as the probability
Dialogue: 0,0:18:21.70,0:18:23.83,Default,,0000,0000,0000,,that if I sample
Dialogue: 0,0:18:23.83,0:18:25.94,Default,,0000,0000,0000,,a new example, X
Dialogue: 0,0:18:25.94,0:18:28.37,Default,,0000,0000,0000,,comma Y,
Dialogue: 0,0:18:28.37,0:18:35.37,Default,,0000,0000,0000,,from that distribution scripts D,
Dialogue: 0,0:18:40.25,0:18:43.23,Default,,0000,0000,0000,,my hypothesis mislabels
Dialogue: 0,0:18:43.23,0:18:46.92,Default,,0000,0000,0000,,that example. And
Dialogue: 0,0:18:46.92,0:18:49.99,Default,,0000,0000,0000,,in terms of notational convention, usually
Dialogue: 0,0:18:49.99,0:18:53.40,Default,,0000,0000,0000,,if I use if I place a hat on top of something, it
Dialogue: 0,0:18:53.40,0:18:57.19,Default,,0000,0000,0000,,usually means not always but it usually means that it is an attempt to
Dialogue: 0,0:18:57.19,0:18:59.40,Default,,0000,0000,0000,,estimate something about the hat. So
Dialogue: 0,0:18:59.40,0:19:00.66,Default,,0000,0000,0000,,for example,
Dialogue: 0,0:19:00.66,0:19:02.58,Default,,0000,0000,0000,,epsilon hat here
Dialogue: 0,0:19:02.58,0:19:06.05,Default,,0000,0000,0000,,this is something that we're trying think of epsilon hat training error
Dialogue: 0,0:19:06.05,0:19:08.78,Default,,0000,0000,0000,,as an attempt to approximate generalization error.
Dialogue: 0,0:19:08.78,0:19:10.98,Default,,0000,0000,0000,,Okay, so the notation convention is
Dialogue: 0,0:19:10.98,0:19:15.06,Default,,0000,0000,0000,,usually the things with the hats on top are things we're using to estimate other
Dialogue: 0,0:19:15.06,0:19:15.95,Default,,0000,0000,0000,,quantities.
Dialogue: 0,0:19:15.95,0:19:19.56,Default,,0000,0000,0000,,And H hat is a hypothesis output by learning algorithm to try to
Dialogue: 0,0:19:19.56,0:19:24.67,Default,,0000,0000,0000,,estimate what the functions from H to Y X to Y. So
Dialogue: 0,0:19:24.67,0:19:29.24,Default,,0000,0000,0000,,let's actually prove some things about when empirical risk minimization
Dialogue: 0,0:19:29.24,0:19:30.15,Default,,0000,0000,0000,,will do well
Dialogue: 0,0:19:30.15,0:19:33.44,Default,,0000,0000,0000,,in a sense of giving us low generalization error, which is what
Dialogue: 0,0:19:33.44,0:19:40.44,Default,,0000,0000,0000,,we really care about.
Dialogue: 0,0:19:46.75,0:19:50.11,Default,,0000,0000,0000,,In order to prove our first learning theory result, I'm going to have to state
Dialogue: 0,0:19:50.11,0:19:52.67,Default,,0000,0000,0000,,two lemmas, the first is
Dialogue: 0,0:19:52.67,0:19:59.67,Default,,0000,0000,0000,,the union
Dialogue: 0,0:20:00.25,0:20:07.25,Default,,0000,0000,0000,,vowel, which is the following,
Dialogue: 0,0:20:07.80,0:20:10.64,Default,,0000,0000,0000,,let A1 through AK be
Dialogue: 0,0:20:10.64,0:20:12.25,Default,,0000,0000,0000,,K event. And when
Dialogue: 0,0:20:12.25,0:20:16.08,Default,,0000,0000,0000,,I say events, I mean events in a sense of a probabilistic event
Dialogue: 0,0:20:16.08,0:20:18.13,Default,,0000,0000,0000,,that either happens or not.
Dialogue: 0,0:20:18.13,0:20:25.13,Default,,0000,0000,0000,,And these are not necessarily independent.
Dialogue: 0,0:20:28.49,0:20:32.40,Default,,0000,0000,0000,,So there's some current distribution over the events A one through AK,
Dialogue: 0,0:20:32.40,0:20:34.83,Default,,0000,0000,0000,,and maybe they're independent maybe not,
Dialogue: 0,0:20:34.83,0:20:41.24,Default,,0000,0000,0000,,no assumption on that. Then
Dialogue: 0,0:20:41.24,0:20:44.74,Default,,0000,0000,0000,,the probability of A one or
Dialogue: 0,0:20:44.74,0:20:46.18,Default,,0000,0000,0000,,A two
Dialogue: 0,0:20:46.18,0:20:47.61,Default,,0000,0000,0000,,or dot, dot,
Dialogue: 0,0:20:47.61,0:20:48.96,Default,,0000,0000,0000,,dot, up to
Dialogue: 0,0:20:48.96,0:20:52.48,Default,,0000,0000,0000,,AK, this union symbol, this
Dialogue: 0,0:20:52.48,0:20:54.07,Default,,0000,0000,0000,,hat, this just means
Dialogue: 0,0:20:54.07,0:20:57.97,Default,,0000,0000,0000,,this sort of just set notation for probability just means or. So the probability
Dialogue: 0,0:20:57.97,0:20:58.64,Default,,0000,0000,0000,,of
Dialogue: 0,0:20:58.64,0:21:02.29,Default,,0000,0000,0000,,at least one of these events occurring, of A one or A two, or up
Dialogue: 0,0:21:02.29,0:21:03.43,Default,,0000,0000,0000,,to AK,
Dialogue: 0,0:21:03.43,0:21:07.95,Default,,0000,0000,0000,,this is S equal to the probability of A one plus probability of A two plus dot,
Dialogue: 0,0:21:07.95,0:21:11.12,Default,,0000,0000,0000,,dot, dot, plus probability of AK.
Dialogue: 0,0:21:11.12,0:21:17.54,Default,,0000,0000,0000,,Okay? So
Dialogue: 0,0:21:17.54,0:21:20.70,Default,,0000,0000,0000,,the intuition behind this is just that
Dialogue: 0,0:21:20.70,0:21:23.39,Default,,0000,0000,0000,,I'm not sure if you've seen Venn diagrams
Dialogue: 0,0:21:23.39,0:21:26.01,Default,,0000,0000,0000,,depictions of probability before, if you haven't,
Dialogue: 0,0:21:26.01,0:21:29.48,Default,,0000,0000,0000,,what I'm about to do may be a little cryptic, so just ignore that. Just
Dialogue: 0,0:21:29.48,0:21:31.80,Default,,0000,0000,0000,,ignore what I'm about to do if you haven't seen it before.
Dialogue: 0,0:21:31.80,0:21:37.34,Default,,0000,0000,0000,,But if you have seen it before then this is really
Dialogue: 0,0:21:37.34,0:21:40.77,Default,,0000,0000,0000,,this is really great the probability of A one,
Dialogue: 0,0:21:40.77,0:21:43.77,Default,,0000,0000,0000,,union A two, union A three, is less
Dialogue: 0,0:21:43.77,0:21:46.63,Default,,0000,0000,0000,,than the P of A one, plus
Dialogue: 0,0:21:46.63,0:21:51.38,Default,,0000,0000,0000,,P of A two, plus P of A
Dialogue: 0,0:21:51.38,0:21:51.81,Default,,0000,0000,0000,,three.
Dialogue: 0,0:21:51.81,0:21:54.37,Default,,0000,0000,0000,,Right. So that the total mass in
Dialogue: 0,0:21:54.37,0:21:57.51,Default,,0000,0000,0000,,the union of these three things [inaudible] to the sum of the masses
Dialogue: 0,0:21:57.51,0:22:01.25,Default,,0000,0000,0000,,in the three individual sets, it's not very surprising.
Dialogue: 0,0:22:01.25,0:22:04.52,Default,,0000,0000,0000,,It turns out that depending on how you define your axioms of probability,
Dialogue: 0,0:22:04.52,0:22:08.10,Default,,0000,0000,0000,,this is actually one of the axioms that probably varies, so
Dialogue: 0,0:22:08.10,0:22:12.18,Default,,0000,0000,0000,,I won't actually try to prove this. This is usually
Dialogue: 0,0:22:12.18,0:22:16.70,Default,,0000,0000,0000,,written as an axiom. So sigmas of avitivity are probably
Dialogue: 0,0:22:16.70,0:22:17.38,Default,,0000,0000,0000,,measured as this
Dialogue: 0,0:22:17.38,0:22:24.38,Default,,0000,0000,0000,,what is sometimes called as well.
Dialogue: 0,0:22:26.37,0:22:30.60,Default,,0000,0000,0000,,But in learning theory it's commonly called the union balance I just call it that. The
Dialogue: 0,0:22:30.60,0:22:31.75,Default,,0000,0000,0000,,other
Dialogue: 0,0:22:31.75,0:22:38.75,Default,,0000,0000,0000,,lemma I need is called the Hufting inequality. And
Dialogue: 0,0:22:39.64,0:22:43.24,Default,,0000,0000,0000,,again, I won't actually prove this, I'll just state it,
Dialogue: 0,0:22:43.24,0:22:44.04,Default,,0000,0000,0000,,which is
Dialogue: 0,0:22:44.04,0:22:46.29,Default,,0000,0000,0000,,let's let Z1 up to ZM,
Dialogue: 0,0:22:46.29,0:22:48.85,Default,,0000,0000,0000,,BM, IID,
Dialogue: 0,0:22:48.85,0:22:51.72,Default,,0000,0000,0000,,there
Dialogue: 0,0:22:51.72,0:22:55.18,Default,,0000,0000,0000,,
Dialogue: 0,0:22:55.18,0:23:02.18,Default,,0000,0000,0000,,may be random variables with mean Phi.
Dialogue: 0,0:23:06.56,0:23:09.15,Default,,0000,0000,0000,,So the probability of ZI equals 1 is equal to
Dialogue: 0,0:23:09.15,0:23:16.15,Default,,0000,0000,0000,,Phi.
Dialogue: 0,0:23:16.40,0:23:18.29,Default,,0000,0000,0000,,So
Dialogue: 0,0:23:18.29,0:23:22.55,Default,,0000,0000,0000,,let's say you observe M IID for newly random variables and you want to estimate their
Dialogue: 0,0:23:22.55,0:23:23.64,Default,,0000,0000,0000,,mean.
Dialogue: 0,0:23:23.64,0:23:28.79,Default,,0000,0000,0000,,So let me define Phi hat, and this is again that notation, no convention, Phi hat means
Dialogue: 0,0:23:29.52,0:23:30.44,Default,,0000,0000,0000,,does not attempt
Dialogue: 0,0:23:30.44,0:23:33.88,Default,,0000,0000,0000,,is an estimate or something else. So when we define Phi
Dialogue: 0,0:23:33.88,0:23:39.29,Default,,0000,0000,0000,,hat to be 1 over M, semper my equals one through MZI. Okay?
Dialogue: 0,0:23:39.29,0:23:40.56,Default,,0000,0000,0000,,So this is
Dialogue: 0,0:23:40.56,0:23:44.50,Default,,0000,0000,0000,,our attempt to estimate the mean of these Benuve random variables by sort of taking
Dialogue: 0,0:23:44.50,0:23:46.55,Default,,0000,0000,0000,,its average.
Dialogue: 0,0:23:48.69,0:23:52.71,Default,,0000,0000,0000,,And let any gamma
Dialogue: 0,0:23:52.71,0:23:54.97,Default,,0000,0000,0000,,be fixed.
Dialogue: 0,0:23:59.06,0:24:00.69,Default,,0000,0000,0000,,Then,
Dialogue: 0,0:24:00.69,0:24:04.34,Default,,0000,0000,0000,,the Hufting inequality
Dialogue: 0,0:24:04.34,0:24:11.12,Default,,0000,0000,0000,,is that
Dialogue: 0,0:24:13.81,0:24:17.11,Default,,0000,0000,0000,,the probability your estimate of Phi
Dialogue: 0,0:24:17.11,0:24:20.63,Default,,0000,0000,0000,,is more than gamma away from the true value of Phi,
Dialogue: 0,0:24:20.63,0:24:26.38,Default,,0000,0000,0000,,that this is bounded by two E to the next of two gamma squared. Okay? So
Dialogue: 0,0:24:26.38,0:24:31.30,Default,,0000,0000,0000,,just in pictures
Dialogue: 0,0:24:31.30,0:24:35.21,Default,,0000,0000,0000,,so this theorem holds this lemma, the Hufting inequality,
Dialogue: 0,0:24:35.21,0:24:37.95,Default,,0000,0000,0000,,this is just a statement of fact, this just holds true.
Dialogue: 0,0:24:37.95,0:24:41.85,Default,,0000,0000,0000,,But let me now draw a cartoon to describe some of the intuition behind this, I
Dialogue: 0,0:24:41.85,0:24:44.06,Default,,0000,0000,0000,,guess.
Dialogue: 0,0:24:44.06,0:24:44.94,Default,,0000,0000,0000,,So
Dialogue: 0,0:24:44.94,0:24:49.07,Default,,0000,0000,0000,,lets say [inaudible] this is a real number line from zero to one.
Dialogue: 0,0:24:49.07,0:24:52.57,Default,,0000,0000,0000,,And so Phi is the mean of your Benuve random variables.
Dialogue: 0,0:24:53.58,0:24:56.00,Default,,0000,0000,0000,,You will remember from you know,
Dialogue: 0,0:24:56.00,0:24:59.37,Default,,0000,0000,0000,,whatever some undergraduate probability or statistics class,
Dialogue: 0,0:24:59.37,0:25:02.55,Default,,0000,0000,0000,,the central limit theorem that says that when you average all the things together,
Dialogue: 0,0:25:02.55,0:25:05.48,Default,,0000,0000,0000,,you tend to get a Gaussian distribution.
Dialogue: 0,0:25:05.48,0:25:10.38,Default,,0000,0000,0000,,And so when you toss M coins with bias Phi, we observe these M
Dialogue: 0,0:25:10.38,0:25:12.32,Default,,0000,0000,0000,,Benuve random variables,
Dialogue: 0,0:25:12.32,0:25:17.68,Default,,0000,0000,0000,,and we average them, then the probability distribution of
Dialogue: 0,0:25:17.68,0:25:24.68,Default,,0000,0000,0000,,Phi hat
Dialogue: 0,0:25:26.26,0:25:28.74,Default,,0000,0000,0000,,will roughly be a Gaussian lets say. Okay? It
Dialogue: 0,0:25:28.74,0:25:29.68,Default,,0000,0000,0000,,turns out if
Dialogue: 0,0:25:29.68,0:25:32.65,Default,,0000,0000,0000,,you haven't seen this up before, this is actually that the
Dialogue: 0,0:25:32.65,0:25:36.50,Default,,0000,0000,0000,,cumulative distribution function of Phi hat will converse with that of the Gaussian.
Dialogue: 0,0:25:36.50,0:25:40.06,Default,,0000,0000,0000,,Technically Phi hat can only take on a discreet set of values
Dialogue: 0,0:25:40.06,0:25:43.12,Default,,0000,0000,0000,,because these are factions one over Ms. It doesn't really have an
Dialogue: 0,0:25:43.12,0:25:45.16,Default,,0000,0000,0000,,entity but just as a cartoon
Dialogue: 0,0:25:45.16,0:25:49.47,Default,,0000,0000,0000,,think of it as a converse roughly to a Gaussian.
Dialogue: 0,0:25:49.47,0:25:56.05,Default,,0000,0000,0000,,So what the Hufting inequality says is that if you pick a value of gamma, let me put
Dialogue: 0,0:25:56.05,0:25:59.50,Default,,0000,0000,0000,,S one interval gamma there's another interval gamma.
Dialogue: 0,0:25:59.50,0:26:03.48,Default,,0000,0000,0000,,Then the saying that the probability mass of the details,
Dialogue: 0,0:26:03.48,0:26:05.82,Default,,0000,0000,0000,,in other words the probability that
Dialogue: 0,0:26:05.82,0:26:08.41,Default,,0000,0000,0000,,my value of Phi hat is more than
Dialogue: 0,0:26:08.41,0:26:10.96,Default,,0000,0000,0000,,a gamma away from the true value,
Dialogue: 0,0:26:10.96,0:26:15.50,Default,,0000,0000,0000,,that the total mass
Dialogue: 0,0:26:15.50,0:26:18.02,Default,,0000,0000,0000,,that the
Dialogue: 0,0:26:18.02,0:26:22.41,Default,,0000,0000,0000,,total probability mass in these tails is at most two
Dialogue: 0,0:26:22.41,0:26:25.66,Default,,0000,0000,0000,,E to the negative two gamma squared M. Okay?
Dialogue: 0,0:26:25.66,0:26:27.51,Default,,0000,0000,0000,,That's what the Hufting inequality so if you
Dialogue: 0,0:26:27.51,0:26:30.61,Default,,0000,0000,0000,,can't read that this just says this is just the right hand side of the bound, two E to
Dialogue: 0,0:26:30.61,0:26:32.66,Default,,0000,0000,0000,,negative two gamma squared.
Dialogue: 0,0:26:32.66,0:26:36.22,Default,,0000,0000,0000,,So balance the probability that you make a mistake in estimating the
Dialogue: 0,0:26:36.22,0:26:41.74,Default,,0000,0000,0000,,mean of a Benuve random variable.
Dialogue: 0,0:26:41.74,0:26:43.03,Default,,0000,0000,0000,,And the
Dialogue: 0,0:26:43.03,0:26:47.18,Default,,0000,0000,0000,,cool thing about this bound the interesting thing behind this bound is that
Dialogue: 0,0:26:48.04,0:26:50.61,Default,,0000,0000,0000,,the [inaudible] exponentially in M,
Dialogue: 0,0:26:50.61,0:26:51.71,Default,,0000,0000,0000,,so it says
Dialogue: 0,0:26:51.71,0:26:54.03,Default,,0000,0000,0000,,that for a fixed value of gamma,
Dialogue: 0,0:26:54.03,0:26:55.54,Default,,0000,0000,0000,,as you increase the size
Dialogue: 0,0:26:55.54,0:26:58.03,Default,,0000,0000,0000,,of your training set, as you toss a coin more and more,
Dialogue: 0,0:26:58.03,0:27:02.25,Default,,0000,0000,0000,,then the worth of this Gaussian will shrink. The worth of this
Dialogue: 0,0:27:02.25,0:27:06.27,Default,,0000,0000,0000,,Gaussian will actually shrink like one over root to M.
Dialogue: 0,0:27:06.27,0:27:12.01,Default,,0000,0000,0000,,And that will cause the probability mass left in the tails to decrease exponentially,
Dialogue: 0,0:27:12.01,0:27:19.01,Default,,0000,0000,0000,,quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the
Dialogue: 0,0:27:21.40,0:27:23.36,Default,,0000,0000,0000,,central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different
Dialogue: 0,0:27:23.36,0:27:27.22,Default,,0000,0000,0000,,this is proved no so the central limit theorem there may be a
Dialogue: 0,0:27:27.22,0:27:29.33,Default,,0000,0000,0000,,version of the central limit theorem, but the
Dialogue: 0,0:27:29.33,0:27:32.84,Default,,0000,0000,0000,,versions I'm familiar with tend are sort of asymptotic, but this works
Dialogue: 0,0:27:32.84,0:27:35.63,Default,,0000,0000,0000,,for any finer value of M. Oh, and for your
Dialogue: 0,0:27:35.63,0:27:40.19,Default,,0000,0000,0000,,this bound holds even if M is equal to two, or M is [inaudible], if M is
Dialogue: 0,0:27:40.19,0:27:41.41,Default,,0000,0000,0000,,very small,
Dialogue: 0,0:27:41.41,0:27:45.48,Default,,0000,0000,0000,,the central limit theorem approximation is not gonna hold, but this theorem holds
Dialogue: 0,0:27:45.48,0:27:46.37,Default,,0000,0000,0000,,regardless.
Dialogue: 0,0:27:46.37,0:27:47.68,Default,,0000,0000,0000,,Okay? I'm
Dialogue: 0,0:27:47.68,0:27:51.92,Default,,0000,0000,0000,,drawing this just as a cartoon to help explain the intuition, but this
Dialogue: 0,0:27:51.92,0:27:58.92,Default,,0000,0000,0000,,theorem just holds true, without reference to central limit
Dialogue: 0,0:28:08.59,0:28:12.83,Default,,0000,0000,0000,,theorem. All right. So lets start to understand empirical risk minimization,
Dialogue: 0,0:28:12.83,0:28:19.83,Default,,0000,0000,0000,,and what I want to do is
Dialogue: 0,0:28:23.08,0:28:25.49,Default,,0000,0000,0000,,begin with
Dialogue: 0,0:28:25.49,0:28:28.44,Default,,0000,0000,0000,,studying empirical risk minimization
Dialogue: 0,0:28:28.44,0:28:30.89,Default,,0000,0000,0000,,for a
Dialogue: 0,0:28:30.89,0:28:33.00,Default,,0000,0000,0000,,[inaudible] case
Dialogue: 0,0:28:33.00,0:28:36.67,Default,,0000,0000,0000,,that's a logistic regression, and in particular I want to start with studying
Dialogue: 0,0:28:36.67,0:28:40.51,Default,,0000,0000,0000,,the case of finite hypothesis classes.
Dialogue: 0,0:28:40.51,0:28:47.51,Default,,0000,0000,0000,,So let's say script H is a class of
Dialogue: 0,0:28:48.91,0:28:55.60,Default,,0000,0000,0000,,K hypotheses.
Dialogue: 0,0:28:55.60,0:28:59.54,Default,,0000,0000,0000,,Right. So this is K functions with no each of these is just a function mapping
Dialogue: 0,0:28:59.54,0:29:04.64,Default,,0000,0000,0000,,from inputs to outputs, there's no parameters in this.
Dialogue: 0,0:29:04.64,0:29:05.91,Default,,0000,0000,0000,,And so
Dialogue: 0,0:29:05.91,0:29:12.53,Default,,0000,0000,0000,,what the empirical risk minimization would do is it would take the training set
Dialogue: 0,0:29:12.53,0:29:13.81,Default,,0000,0000,0000,,and it'll then
Dialogue: 0,0:29:13.81,0:29:16.59,Default,,0000,0000,0000,,look at each of these K functions,
Dialogue: 0,0:29:16.59,0:29:21.66,Default,,0000,0000,0000,,and it'll pick whichever of these functions has the lowest training error. Okay?
Dialogue: 0,0:29:21.66,0:29:24.26,Default,,0000,0000,0000,,So now that the logistic regression uses an infinitely
Dialogue: 0,0:29:24.26,0:29:28.94,Default,,0000,0000,0000,,large a continuous infinitely large class of hypotheses, script H,
Dialogue: 0,0:29:28.94,0:29:32.44,Default,,0000,0000,0000,,but to prove the first row I actually want to just
Dialogue: 0,0:29:32.44,0:29:33.09,Default,,0000,0000,0000,,describe
Dialogue: 0,0:29:33.09,0:29:37.54,Default,,0000,0000,0000,,our first learning theorem is all for the case of when you have a finite hypothesis class, and then
Dialogue: 0,0:29:37.54,0:29:41.75,Default,,0000,0000,0000,,we'll later generalize that into the hypothesis classes.
Dialogue: 0,0:29:43.05,0:29:45.78,Default,,0000,0000,0000,,So
Dialogue: 0,0:29:53.22,0:29:58.34,Default,,0000,0000,0000,,empirical risk minimization takes the hypothesis of the lowest training error,
Dialogue: 0,0:29:58.34,0:30:04.27,Default,,0000,0000,0000,,and what I'd like to do is prove a bound on the generalization error
Dialogue: 0,0:30:04.27,0:30:05.96,Default,,0000,0000,0000,,of H hat. All right. So in other words I'm
Dialogue: 0,0:30:05.96,0:30:07.07,Default,,0000,0000,0000,,gonna prove that
Dialogue: 0,0:30:07.07,0:30:14.07,Default,,0000,0000,0000,,somehow minimizing training error allows me to do well on generalization error.
Dialogue: 0,0:30:14.07,0:30:15.92,Default,,0000,0000,0000,,And here's the strategy,
Dialogue: 0,0:30:15.92,0:30:19.95,Default,,0000,0000,0000,,I'm going to
Dialogue: 0,0:30:19.95,0:30:22.90,Default,,0000,0000,0000,,the first step in this prove I'm going to show that
Dialogue: 0,0:30:22.90,0:30:25.04,Default,,0000,0000,0000,,training error
Dialogue: 0,0:30:25.04,0:30:29.58,Default,,0000,0000,0000,,is a good approximation to generalization error,
Dialogue: 0,0:30:32.04,0:30:34.61,Default,,0000,0000,0000,,and then I'm going to show
Dialogue: 0,0:30:34.61,0:30:37.76,Default,,0000,0000,0000,,that this implies a bound
Dialogue: 0,0:30:37.76,0:30:40.04,Default,,0000,0000,0000,,on
Dialogue: 0,0:30:40.04,0:30:42.86,Default,,0000,0000,0000,,the generalization error
Dialogue: 0,0:30:42.86,0:30:48.36,Default,,0000,0000,0000,,of the hypothesis of [inaudible] empirical risk
Dialogue: 0,0:30:48.36,0:30:53.31,Default,,0000,0000,0000,,minimization. And I just realized, this class I guess is also maybe slightly notation
Dialogue: 0,0:30:53.31,0:30:55.56,Default,,0000,0000,0000,,heavy class
Dialogue: 0,0:30:55.56,0:30:59.16,Default,,0000,0000,0000,,round, instead of just introducing a reasonably large set of new symbols, so if
Dialogue: 0,0:30:59.16,0:31:02.57,Default,,0000,0000,0000,,again, in the course of today's lecture, you're looking at some symbol and you don't quite
Dialogue: 0,0:31:02.57,0:31:07.21,Default,,0000,0000,0000,,remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a
Dialogue: 0,0:31:07.21,0:31:11.57,Default,,0000,0000,0000,,generalization error or was it something else? So raise your hand and
Dialogue: 0,0:31:11.57,0:31:16.74,Default,,0000,0000,0000,,ask if you don't understand what the notation I was defining.
Dialogue: 0,0:31:16.74,0:31:20.43,Default,,0000,0000,0000,,Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors
Dialogue: 0,0:31:20.43,0:31:22.97,Default,,0000,0000,0000,,that give approximation generalization error,
Dialogue: 0,0:31:22.97,0:31:26.36,Default,,0000,0000,0000,,and this will imply that minimizing training error
Dialogue: 0,0:31:26.36,0:31:29.92,Default,,0000,0000,0000,,will also do pretty well in terms of minimizing generalization error.
Dialogue: 0,0:31:29.92,0:31:33.13,Default,,0000,0000,0000,,And this will give us a bound on the generalization error
Dialogue: 0,0:31:33.13,0:31:39.52,Default,,0000,0000,0000,,of the hypothesis output by empirical risk minimization. Okay?
Dialogue: 0,0:31:39.52,0:31:46.52,Default,,0000,0000,0000,,So here's the idea.
Dialogue: 0,0:31:47.51,0:31:51.53,Default,,0000,0000,0000,,So
Dialogue: 0,0:31:51.53,0:31:56.33,Default,,0000,0000,0000,,lets even not consider all the hypotheses at once, lets pick any
Dialogue: 0,0:31:56.33,0:32:00.33,Default,,0000,0000,0000,,hypothesis, HJ in the class script H, and
Dialogue: 0,0:32:00.33,0:32:04.55,Default,,0000,0000,0000,,so until further notice lets just consider there one fixed hypothesis. So pick any
Dialogue: 0,0:32:04.55,0:32:10.25,Default,,0000,0000,0000,,one hypothesis and let's talk about that
Dialogue: 0,0:32:10.25,0:32:12.39,Default,,0000,0000,0000,,one. Let
Dialogue: 0,0:32:12.39,0:32:15.16,Default,,0000,0000,0000,,me define ZI
Dialogue: 0,0:32:15.16,0:32:22.16,Default,,0000,0000,0000,,to be indicator function
Dialogue: 0,0:32:23.97,0:32:25.49,Default,,0000,0000,0000,,for
Dialogue: 0,0:32:25.49,0:32:29.93,Default,,0000,0000,0000,,whether this hypothesis misclassifies the IFE example excuse me
Dialogue: 0,0:32:29.93,0:32:33.82,Default,,0000,0000,0000,,or Z subscript I. Okay?
Dialogue: 0,0:32:33.82,0:32:40.25,Default,,0000,0000,0000,,So
Dialogue: 0,0:32:40.25,0:32:42.81,Default,,0000,0000,0000,,ZI would be zero or one
Dialogue: 0,0:32:42.81,0:32:46.80,Default,,0000,0000,0000,,depending on whether this one hypothesis which is the only one
Dialogue: 0,0:32:46.80,0:32:48.64,Default,,0000,0000,0000,,I'm gonna even consider now,
Dialogue: 0,0:32:48.64,0:32:52.28,Default,,0000,0000,0000,,whether this hypothesis was classified as an example.
Dialogue: 0,0:32:52.28,0:32:56.97,Default,,0000,0000,0000,,And so
Dialogue: 0,0:32:56.97,0:33:00.62,Default,,0000,0000,0000,,my training set is drawn randomly from sum distribution scripts
Dialogue: 0,0:33:00.62,0:33:02.14,Default,,0000,0000,0000,,d,
Dialogue: 0,0:33:02.14,0:33:04.62,Default,,0000,0000,0000,,and
Dialogue: 0,0:33:04.62,0:33:09.82,Default,,0000,0000,0000,,depending on what training examples I've got, these ZIs would be either zero or one.
Dialogue: 0,0:33:09.82,0:33:14.26,Default,,0000,0000,0000,,So let's figure out what the probability distribution ZI is.
Dialogue: 0,0:33:14.26,0:33:15.02,Default,,0000,0000,0000,,Well,
Dialogue: 0,0:33:15.02,0:33:15.95,Default,,0000,0000,0000,,so
Dialogue: 0,0:33:15.95,0:33:20.97,Default,,0000,0000,0000,,ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only
Dialogue: 0,0:33:20.97,0:33:24.87,Default,,0000,0000,0000,,take on these values.
Dialogue: 0,0:33:24.87,0:33:31.26,Default,,0000,0000,0000,,Well,
Dialogue: 0,0:33:31.26,0:33:35.16,Default,,0000,0000,0000,,what's the probability that ZI is equal to one? In other words, what's the
Dialogue: 0,0:33:35.16,0:33:36.49,Default,,0000,0000,0000,,probability
Dialogue: 0,0:33:36.49,0:33:40.78,Default,,0000,0000,0000,,that from a fixed hypothesis HJ,
Dialogue: 0,0:33:40.78,0:33:45.08,Default,,0000,0000,0000,,when I sample my training set IID from distribution D, what is the chance
Dialogue: 0,0:33:45.08,0:33:45.97,Default,,0000,0000,0000,,that
Dialogue: 0,0:33:45.97,0:33:50.37,Default,,0000,0000,0000,,my hypothesis will misclassify it?
Dialogue: 0,0:33:50.37,0:33:54.29,Default,,0000,0000,0000,,Well, by definition,
Dialogue: 0,0:33:54.29,0:33:56.77,Default,,0000,0000,0000,,that's just a generalization error of my
Dialogue: 0,0:33:56.77,0:34:00.67,Default,,0000,0000,0000,,hypothesis HJ.
Dialogue: 0,0:34:00.67,0:34:03.99,Default,,0000,0000,0000,,So ZI is a Benuve random variable
Dialogue: 0,0:34:03.99,0:34:05.49,Default,,0000,0000,0000,,with mean
Dialogue: 0,0:34:05.49,0:34:12.49,Default,,0000,0000,0000,,given by the generalization error of this hypothesis.
Dialogue: 0,0:34:14.44,0:34:20.17,Default,,0000,0000,0000,,Raise your hand if that made sense. Oh, cool. Great.
Dialogue: 0,0:34:20.17,0:34:21.88,Default,,0000,0000,0000,,And moreover,
Dialogue: 0,0:34:21.88,0:34:23.99,Default,,0000,0000,0000,,all the ZIs have the same
Dialogue: 0,0:34:23.99,0:34:27.97,Default,,0000,0000,0000,,probability of being one, and all my training examples I've drawn are IID,
Dialogue: 0,0:34:27.97,0:34:32.84,Default,,0000,0000,0000,,and so the ZIs are also independent
Dialogue: 0,0:34:32.84,0:34:39.10,Default,,0000,0000,0000,,and therefore
Dialogue: 0,0:34:39.10,0:34:42.42,Default,,0000,0000,0000,,the ZIs themselves are IID random
Dialogue: 0,0:34:42.42,0:34:49.42,Default,,0000,0000,0000,,variables. Okay? Because my training examples were drawn independently of each other, by assumption.
Dialogue: 0,0:34:55.80,0:34:59.15,Default,,0000,0000,0000,,If you read this as the definition of training error,
Dialogue: 0,0:34:59.15,0:35:04.33,Default,,0000,0000,0000,,the training error of my hypothesis HJ, that's just that.
Dialogue: 0,0:35:07.21,0:35:09.03,Default,,0000,0000,0000,,That's just the average
Dialogue: 0,0:35:09.03,0:35:16.03,Default,,0000,0000,0000,,of my ZIs, which was well I previously defined it like this. Okay?
Dialogue: 0,0:35:25.73,0:35:28.54,Default,,0000,0000,0000,,And so epsilon hat of HJ
Dialogue: 0,0:35:28.54,0:35:30.42,Default,,0000,0000,0000,,is exactly
Dialogue: 0,0:35:30.42,0:35:32.83,Default,,0000,0000,0000,,the average of MIID,
Dialogue: 0,0:35:32.83,0:35:35.85,Default,,0000,0000,0000,,Benuve random variables,
Dialogue: 0,0:35:35.85,0:35:40.04,Default,,0000,0000,0000,,drawn from Benuve distribution with mean given by
Dialogue: 0,0:35:40.04,0:35:41.75,Default,,0000,0000,0000,,the
Dialogue: 0,0:35:41.75,0:35:43.87,Default,,0000,0000,0000,,generalization error, so this is
Dialogue: 0,0:35:43.87,0:35:47.91,Default,,0000,0000,0000,,well this is the average of MIID Benuve random variables,
Dialogue: 0,0:35:47.91,0:35:54.91,Default,,0000,0000,0000,,each of which has meaning given by the
Dialogue: 0,0:35:54.94,0:35:58.69,Default,,0000,0000,0000,,generalization error of HJ.
Dialogue: 0,0:35:58.69,0:36:03.77,Default,,0000,0000,0000,,And therefore, by
Dialogue: 0,0:36:03.77,0:36:07.03,Default,,0000,0000,0000,,the Hufting inequality
Dialogue: 0,0:36:07.03,0:36:11.07,Default,,0000,0000,0000,,we have to add the
Dialogue: 0,0:36:11.07,0:36:15.70,Default,,0000,0000,0000,,probability
Dialogue: 0,0:36:15.70,0:36:20.03,Default,,0000,0000,0000,,that the difference between training and generalization error, the probability that this is greater than gamma is
Dialogue: 0,0:36:20.03,0:36:21.54,Default,,0000,0000,0000,,less than to two, E to
Dialogue: 0,0:36:21.54,0:36:24.72,Default,,0000,0000,0000,,the negative two,
Dialogue: 0,0:36:24.72,0:36:26.95,Default,,0000,0000,0000,,gamma squared M. Okay?
Dialogue: 0,0:36:26.95,0:36:32.42,Default,,0000,0000,0000,,Exactly by the Hufting inequality.
Dialogue: 0,0:36:32.42,0:36:34.58,Default,,0000,0000,0000,,And what this proves is that,
Dialogue: 0,0:36:34.58,0:36:37.29,Default,,0000,0000,0000,,for my fixed hypothesis HJ,
Dialogue: 0,0:36:37.29,0:36:40.53,Default,,0000,0000,0000,,my training error, epsilon hat will
Dialogue: 0,0:36:40.53,0:36:43.90,Default,,0000,0000,0000,,with high probability, assuming M is large, if
Dialogue: 0,0:36:43.90,0:36:46.89,Default,,0000,0000,0000,,M is large than this thing on the right hand side will be small, because
Dialogue: 0,0:36:46.89,0:36:49.24,Default,,0000,0000,0000,,this is two Es and a negative two gamma squared M.
Dialogue: 0,0:36:49.24,0:36:52.56,Default,,0000,0000,0000,,So this says that if my training set is large enough,
Dialogue: 0,0:36:52.56,0:36:54.39,Default,,0000,0000,0000,,then the probability
Dialogue: 0,0:36:54.39,0:36:57.30,Default,,0000,0000,0000,,my training error is far from generalization error,
Dialogue: 0,0:36:57.30,0:36:59.09,Default,,0000,0000,0000,,meaning that it is more than gamma,
Dialogue: 0,0:36:59.09,0:37:06.09,Default,,0000,0000,0000,,will be small, will be bounded by this thing on the right hand side. Okay? Now,
Dialogue: 0,0:37:09.12,0:37:11.57,Default,,0000,0000,0000,,here's the [inaudible] tricky part,
Dialogue: 0,0:37:11.57,0:37:16.57,Default,,0000,0000,0000,,what we've done is approve this bound for one fixed hypothesis, for HJ.
Dialogue: 0,0:37:16.57,0:37:20.43,Default,,0000,0000,0000,,What I want to prove is that training error will be a good estimate for generalization
Dialogue: 0,0:37:20.43,0:37:21.28,Default,,0000,0000,0000,,error,
Dialogue: 0,0:37:21.28,0:37:26.25,Default,,0000,0000,0000,,not just for this one hypothesis HJ, but actually for all K hypotheses in my
Dialogue: 0,0:37:26.25,0:37:27.83,Default,,0000,0000,0000,,hypothesis class
Dialogue: 0,0:37:27.83,0:37:29.65,Default,,0000,0000,0000,,script
Dialogue: 0,0:37:29.65,0:37:30.77,Default,,0000,0000,0000,,H.
Dialogue: 0,0:37:30.77,0:37:33.07,Default,,0000,0000,0000,,So let's do it
Dialogue: 0,0:37:35.72,0:37:42.72,Default,,0000,0000,0000,,well,
Dialogue: 0,0:37:54.66,0:37:57.86,Default,,0000,0000,0000,,better do it on a new board. So in order to show that, let me define
Dialogue: 0,0:37:57.86,0:38:01.23,Default,,0000,0000,0000,,a random event, let me define AJ to be the event
Dialogue: 0,0:38:01.23,0:38:05.27,Default,,0000,0000,0000,,that
Dialogue: 0,0:38:05.27,0:38:12.27,Default,,0000,0000,0000,,let
Dialogue: 0,0:38:21.47,0:38:24.48,Default,,0000,0000,0000,,me define AJ to be the event that
Dialogue: 0,0:38:24.48,0:38:28.04,Default,,0000,0000,0000,,you know, the difference between training and generalization error is more than gamma on a
Dialogue: 0,0:38:28.04,0:38:30.44,Default,,0000,0000,0000,,hypothesis HJ.
Dialogue: 0,0:38:30.44,0:38:32.09,Default,,0000,0000,0000,,And so what we
Dialogue: 0,0:38:32.09,0:38:36.45,Default,,0000,0000,0000,,put on the previous board was that the probability of AJ is less equal to two E
Dialogue: 0,0:38:36.45,0:38:43.45,Default,,0000,0000,0000,,to the negative two, gamma squared M, and this is pretty small. Now,
Dialogue: 0,0:38:44.06,0:38:49.73,Default,,0000,0000,0000,,What I want to bound is the probability that
Dialogue: 0,0:38:49.73,0:38:53.88,Default,,0000,0000,0000,,there exists some hypothesis in my class
Dialogue: 0,0:38:53.88,0:39:00.88,Default,,0000,0000,0000,,script H,
Dialogue: 0,0:39:02.84,0:39:08.11,Default,,0000,0000,0000,,such that I make a large error in my estimate of generalization error. Okay?
Dialogue: 0,0:39:08.11,0:39:10.27,Default,,0000,0000,0000,,Such that this holds true.
Dialogue: 0,0:39:12.18,0:39:14.29,Default,,0000,0000,0000,,So this is really just
Dialogue: 0,0:39:14.29,0:39:17.89,Default,,0000,0000,0000,,that the probability that there exists a hypothesis for which this
Dialogue: 0,0:39:17.89,0:39:20.65,Default,,0000,0000,0000,,holds. This is really the probability that
Dialogue: 0,0:39:20.65,0:39:25.30,Default,,0000,0000,0000,,A one or A two, or up to AK holds.
Dialogue: 0,0:39:25.30,0:39:28.66,Default,,0000,0000,0000,,The chance
Dialogue: 0,0:39:28.66,0:39:33.18,Default,,0000,0000,0000,,there exists a hypothesis is just well the priority
Dialogue: 0,0:39:33.18,0:39:37.48,Default,,0000,0000,0000,,that for hypothesis one and make a large error in estimating the generalization error,
Dialogue: 0,0:39:37.48,0:39:41.07,Default,,0000,0000,0000,,or for hypothesis two and make a large error in estimating generalization error,
Dialogue: 0,0:39:41.07,0:39:43.56,Default,,0000,0000,0000,,and so on.
Dialogue: 0,0:39:43.56,0:39:50.56,Default,,0000,0000,0000,,And so by the union bound, this is less than equal to
Dialogue: 0,0:39:50.82,0:39:56.23,Default,,0000,0000,0000,,that,
Dialogue: 0,0:39:56.23,0:39:59.33,Default,,0000,0000,0000,,which is therefore less than equal to
Dialogue: 0,0:40:06.52,0:40:13.52,Default,,0000,0000,0000,,is
Dialogue: 0,0:40:13.99,0:40:20.99,Default,,0000,0000,0000,,equal to that. Okay?
Dialogue: 0,0:40:38.95,0:40:40.44,Default,,0000,0000,0000,,So let
Dialogue: 0,0:40:40.44,0:40:42.36,Default,,0000,0000,0000,,me just take
Dialogue: 0,0:40:42.36,0:40:47.64,Default,,0000,0000,0000,,one minus both sides of the equation
Dialogue: 0,0:40:47.64,0:40:50.94,Default,,0000,0000,0000,,on the previous board let me take one minus both sides, so
Dialogue: 0,0:40:50.94,0:40:56.90,Default,,0000,0000,0000,,the probability that there does not exist for
Dialogue: 0,0:40:56.90,0:40:59.01,Default,,0000,0000,0000,,hypothesis
Dialogue: 0,0:40:59.01,0:41:06.01,Default,,0000,0000,0000,,such that,
Dialogue: 0,0:41:07.65,0:41:10.84,Default,,0000,0000,0000,,that. The probability that there does not exist a hypothesis on which I make a large
Dialogue: 0,0:41:10.84,0:41:13.19,Default,,0000,0000,0000,,error in this estimate
Dialogue: 0,0:41:13.19,0:41:20.19,Default,,0000,0000,0000,,while this is equal to the probability that for all hypotheses,
Dialogue: 0,0:41:22.68,0:41:24.76,Default,,0000,0000,0000,,I make a
Dialogue: 0,0:41:24.76,0:41:26.57,Default,,0000,0000,0000,,small error, or at
Dialogue: 0,0:41:26.57,0:41:31.61,Default,,0000,0000,0000,,most gamma, in my estimate of generalization error.
Dialogue: 0,0:41:31.61,0:41:35.52,Default,,0000,0000,0000,,In taking one minus on the right hand side I get two KE
Dialogue: 0,0:41:35.52,0:41:39.15,Default,,0000,0000,0000,,to the negative two
Dialogue: 0,0:41:39.15,0:41:44.11,Default,,0000,0000,0000,,gamma squared M. Okay?
Dialogue: 0,0:41:44.11,0:41:45.71,Default,,0000,0000,0000,,And so
Dialogue: 0,0:41:45.71,0:41:49.52,Default,,0000,0000,0000,,and the sign of the inequality flipped because I took one minus both
Dialogue: 0,0:41:49.52,0:41:53.80,Default,,0000,0000,0000,,sides. The minus sign flips the sign of the
Dialogue: 0,0:41:53.80,0:41:55.82,Default,,0000,0000,0000,,equality.
Dialogue: 0,0:41:55.82,0:41:59.80,Default,,0000,0000,0000,,So what we're shown is that
Dialogue: 0,0:41:59.80,0:42:04.68,Default,,0000,0000,0000,,with probability which abbreviates to WP with probability one minus
Dialogue: 0,0:42:04.68,0:42:08.53,Default,,0000,0000,0000,,two KE to the negative two gamma squared M.
Dialogue: 0,0:42:08.53,0:42:11.89,Default,,0000,0000,0000,,We have that, epsilon hat
Dialogue: 0,0:42:11.89,0:42:16.48,Default,,0000,0000,0000,,of H
Dialogue: 0,0:42:16.48,0:42:21.15,Default,,0000,0000,0000,,will be
Dialogue: 0,0:42:21.15,0:42:28.15,Default,,0000,0000,0000,,will then gamma of epsilon of H,
Dialogue: 0,0:42:31.31,0:42:33.06,Default,,0000,0000,0000,,simultaneously for all
Dialogue: 0,0:42:33.06,0:42:35.45,Default,,0000,0000,0000,,hypotheses in our
Dialogue: 0,0:42:35.45,0:42:42.45,Default,,0000,0000,0000,,class script H.
Dialogue: 0,0:42:43.27,0:42:47.21,Default,,0000,0000,0000,,And so
Dialogue: 0,0:42:47.21,0:42:54.21,Default,,0000,0000,0000,,just to give this result a name, this is called
Dialogue: 0,0:42:55.53,0:43:00.09,Default,,0000,0000,0000,,this is one instance of what's called a uniform conversions result,
Dialogue: 0,0:43:00.09,0:43:01.13,Default,,0000,0000,0000,,and
Dialogue: 0,0:43:01.13,0:43:03.76,Default,,0000,0000,0000,,the term uniform conversions
Dialogue: 0,0:43:03.76,0:43:05.72,Default,,0000,0000,0000,,this sort of alludes to the fact that
Dialogue: 0,0:43:05.72,0:43:09.66,Default,,0000,0000,0000,,this shows that as M becomes large,
Dialogue: 0,0:43:09.66,0:43:12.03,Default,,0000,0000,0000,,then these epsilon hats
Dialogue: 0,0:43:12.03,0:43:16.19,Default,,0000,0000,0000,,will all simultaneously converge to epsilon of H. That
Dialogue: 0,0:43:16.19,0:43:17.85,Default,,0000,0000,0000,,training error will
Dialogue: 0,0:43:17.85,0:43:20.84,Default,,0000,0000,0000,,become very close to generalization error
Dialogue: 0,0:43:20.84,0:43:23.21,Default,,0000,0000,0000,,simultaneously for all hypotheses H.
Dialogue: 0,0:43:23.21,0:43:25.46,Default,,0000,0000,0000,,That's what the term uniform
Dialogue: 0,0:43:25.46,0:43:29.15,Default,,0000,0000,0000,,refers to, is the fact that this converges for all hypotheses H and not just
Dialogue: 0,0:43:29.15,0:43:31.06,Default,,0000,0000,0000,,for one hypothesis. And so
Dialogue: 0,0:43:31.06,0:43:36.13,Default,,0000,0000,0000,,what we're shown is one example of a uniform conversions result. Okay?
Dialogue: 0,0:43:36.13,0:43:39.40,Default,,0000,0000,0000,,So let me clean a couple more boards. I'll come back and ask what questions you have
Dialogue: 0,0:43:39.40,0:43:46.40,Default,,0000,0000,0000,,about this. We should take another look at this and make sure it all makes sense. Yeah, okay.
Dialogue: 0,0:44:20.11,0:44:27.11,Default,,0000,0000,0000,,What questions do you have about this?
Dialogue: 0,0:44:27.83,0:44:31.27,Default,,0000,0000,0000,,
Dialogue: 0,0:44:31.27,0:44:34.24,Default,,0000,0000,0000,,Student: How the is the
Dialogue: 0,0:44:34.24,0:44:36.15,Default,,0000,0000,0000,,value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed?
Dialogue: 0,0:44:36.15,0:44:40.29,Default,,0000,0000,0000,,So for these purposes for the purposes of this, gamma is a constant.
Dialogue: 0,0:44:40.29,0:44:43.49,Default,,0000,0000,0000,,Imagine a gamma is some constant that we chose in advance,
Dialogue: 0,0:44:43.49,0:44:48.59,Default,,0000,0000,0000,,and this is a bound that holds true for any fixed value
Dialogue: 0,0:44:48.59,0:44:50.74,Default,,0000,0000,0000,,of gamma. Later on as we
Dialogue: 0,0:44:50.74,0:44:54.45,Default,,0000,0000,0000,,take this bound and then sort of develop this result further,
Dialogue: 0,0:44:54.45,0:44:58.13,Default,,0000,0000,0000,,we'll choose specific values of gamma as a [inaudible] of this bound. For now
Dialogue: 0,0:44:58.13,0:45:05.13,Default,,0000,0000,0000,,we'll just imagine that when we're proved this holds true for any value of gamma. Any questions?
Dialogue: 0,0:45:08.89,0:45:11.58,Default,,0000,0000,0000,,Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite,
Dialogue: 0,0:45:11.58,0:45:16.28,Default,,0000,0000,0000,,so this simple result won't work in this present form, but we'll generalize this
Dialogue: 0,0:45:16.28,0:45:20.20,Default,,0000,0000,0000,,probably won't get to it today but we'll generalize this at the beginning of the
Dialogue: 0,0:45:20.20,0:45:27.20,Default,,0000,0000,0000,,next lecture to infinite hypothesis classes. Student:How do we use this
Dialogue: 0,0:45:29.78,0:45:31.89,Default,,0000,0000,0000,,theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me
Dialogue: 0,0:45:31.89,0:45:37.26,Default,,0000,0000,0000,,I might get to a little of that later today, we'll talk concretely about algorithms,
Dialogue: 0,0:45:37.26,0:45:38.89,Default,,0000,0000,0000,,the consequences of
Dialogue: 0,0:45:38.89,0:45:45.72,Default,,0000,0000,0000,,the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you
Dialogue: 0,0:45:45.72,0:45:47.29,Default,,0000,0000,0000,,just raise your hand if the
Dialogue: 0,0:45:47.29,0:45:49.60,Default,,0000,0000,0000,,things I've proved so far
Dialogue: 0,0:45:49.60,0:45:55.07,Default,,0000,0000,0000,,make sense? Okay. Cool. Great.
Dialogue: 0,0:45:55.07,0:45:59.34,Default,,0000,0000,0000,,Thanks. All right. Let me just take this uniform conversions bound and rewrite
Dialogue: 0,0:45:59.34,0:46:02.28,Default,,0000,0000,0000,,it in a couple of other forms.
Dialogue: 0,0:46:02.28,0:46:04.59,Default,,0000,0000,0000,,So
Dialogue: 0,0:46:04.59,0:46:08.69,Default,,0000,0000,0000,,this is a sort of a bound on probability, this is saying suppose I fix my training
Dialogue: 0,0:46:08.69,0:46:11.09,Default,,0000,0000,0000,,set and then fix my training
Dialogue: 0,0:46:11.09,0:46:14.70,Default,,0000,0000,0000,,set fix my threshold, my error threshold gamma,
Dialogue: 0,0:46:14.70,0:46:19.11,Default,,0000,0000,0000,,what is the probability that uniform conversions holds, and well,
Dialogue: 0,0:46:19.11,0:46:22.67,Default,,0000,0000,0000,,that's my formula that gives the answer. This is the probability of something
Dialogue: 0,0:46:22.67,0:46:25.75,Default,,0000,0000,0000,,happening.
Dialogue: 0,0:46:25.75,0:46:28.75,Default,,0000,0000,0000,,So there are actually three parameters of interest. One is,
Dialogue: 0,0:46:28.75,0:46:31.54,Default,,0000,0000,0000,,What is this probability? The
Dialogue: 0,0:46:31.54,0:46:34.29,Default,,0000,0000,0000,,other parameter is, What's the training set size M?
Dialogue: 0,0:46:34.29,0:46:36.02,Default,,0000,0000,0000,,And the third parameter is, What
Dialogue: 0,0:46:36.02,0:46:37.01,Default,,0000,0000,0000,,is the value
Dialogue: 0,0:46:37.01,0:46:39.99,Default,,0000,0000,0000,,of this error
Dialogue: 0,0:46:39.99,0:46:41.95,Default,,0000,0000,0000,,threshold gamma? I'm not gonna
Dialogue: 0,0:46:41.95,0:46:44.31,Default,,0000,0000,0000,,vary K for these purposes.
Dialogue: 0,0:46:44.31,0:46:47.62,Default,,0000,0000,0000,,So other two other equivalent forms of the bounds,
Dialogue: 0,0:46:47.62,0:46:49.51,Default,,0000,0000,0000,,which so you can ask,
Dialogue: 0,0:46:49.51,0:46:50.60,Default,,0000,0000,0000,,given
Dialogue: 0,0:46:50.60,0:46:52.31,Default,,0000,0000,0000,,gamma
Dialogue: 0,0:46:52.31,0:46:53.29,Default,,0000,0000,0000,,so what
Dialogue: 0,0:46:53.29,0:46:56.02,Default,,0000,0000,0000,,we proved was given gamma and given M, what
Dialogue: 0,0:46:56.02,0:46:57.85,Default,,0000,0000,0000,,is the probability of
Dialogue: 0,0:46:57.85,0:47:01.02,Default,,0000,0000,0000,,uniform conversions? The
Dialogue: 0,0:47:01.02,0:47:04.15,Default,,0000,0000,0000,,other equivalent forms are,
Dialogue: 0,0:47:04.15,0:47:06.90,Default,,0000,0000,0000,,so that given gamma
Dialogue: 0,0:47:06.90,0:47:13.90,Default,,0000,0000,0000,,and the probability delta of making a large error,
Dialogue: 0,0:47:15.34,0:47:19.05,Default,,0000,0000,0000,,how large a training set size do you need in order to give
Dialogue: 0,0:47:19.05,0:47:21.95,Default,,0000,0000,0000,,a bound on how large a
Dialogue: 0,0:47:21.95,0:47:25.46,Default,,0000,0000,0000,,training set size do you need
Dialogue: 0,0:47:25.46,0:47:29.75,Default,,0000,0000,0000,,to give a uniform conversions bound with parameters gamma
Dialogue: 0,0:47:29.75,0:47:31.21,Default,,0000,0000,0000,,and delta?
Dialogue: 0,0:47:31.21,0:47:33.75,Default,,0000,0000,0000,,And so well,
Dialogue: 0,0:47:33.75,0:47:37.73,Default,,0000,0000,0000,,so if you set delta to be two KE so negative two gamma
Dialogue: 0,0:47:37.73,0:47:38.49,Default,,0000,0000,0000,,squared M.
Dialogue: 0,0:47:38.49,0:47:41.36,Default,,0000,0000,0000,,This is that form that I had on the left.
Dialogue: 0,0:47:41.36,0:47:43.63,Default,,0000,0000,0000,,And if you solve
Dialogue: 0,0:47:43.63,0:47:46.08,Default,,0000,0000,0000,,for M,
Dialogue: 0,0:47:46.08,0:47:48.17,Default,,0000,0000,0000,,what you find is that
Dialogue: 0,0:47:48.17,0:47:54.43,Default,,0000,0000,0000,,there's an equivalent form of this result that says that
Dialogue: 0,0:47:54.43,0:47:55.89,Default,,0000,0000,0000,,so long as your training
Dialogue: 0,0:47:55.89,0:47:59.03,Default,,0000,0000,0000,,set assigns M as greater than this.
Dialogue: 0,0:47:59.03,0:48:05.39,Default,,0000,0000,0000,,And this is the formula that I get by solving for M.
Dialogue: 0,0:48:05.39,0:48:08.81,Default,,0000,0000,0000,,Okay? So long as M is greater than equal to this,
Dialogue: 0,0:48:08.81,0:48:12.93,Default,,0000,0000,0000,,then with probability, which I'm abbreviating to WP again,
Dialogue: 0,0:48:12.93,0:48:17.37,Default,,0000,0000,0000,,with probability at least one minus delta,
Dialogue: 0,0:48:17.37,0:48:24.37,Default,,0000,0000,0000,,we have for all.
Dialogue: 0,0:48:24.88,0:48:30.96,Default,,0000,0000,0000,,Okay?
Dialogue: 0,0:48:30.96,0:48:34.30,Default,,0000,0000,0000,,So this says how large a training set size that I need
Dialogue: 0,0:48:34.30,0:48:37.86,Default,,0000,0000,0000,,to guarantee that with probability at least one minus delta,
Dialogue: 0,0:48:37.86,0:48:42.24,Default,,0000,0000,0000,,we have the training error is within gamma of generalization error for all my
Dialogue: 0,0:48:42.24,0:48:45.26,Default,,0000,0000,0000,,hypotheses, and this gives an answer.
Dialogue: 0,0:48:45.26,0:48:49.40,Default,,0000,0000,0000,,And just to give this another name, this is an example of a sample
Dialogue: 0,0:48:49.40,0:48:53.47,Default,,0000,0000,0000,,complexity
Dialogue: 0,0:48:53.47,0:48:58.08,Default,,0000,0000,0000,,
Dialogue: 0,0:48:58.08,0:48:58.94,Default,,0000,0000,0000,,bound.
Dialogue: 0,0:48:58.94,0:49:02.87,Default,,0000,0000,0000,,So from undergrad computer science classes you may have heard of
Dialogue: 0,0:49:02.87,0:49:05.68,Default,,0000,0000,0000,,computational complexity, which is how much computations you need to achieve
Dialogue: 0,0:49:05.68,0:49:06.85,Default,,0000,0000,0000,,something.
Dialogue: 0,0:49:06.85,0:49:11.32,Default,,0000,0000,0000,,So sample complexity just means how large a training example how large a training set how
Dialogue: 0,0:49:11.32,0:49:13.68,Default,,0000,0000,0000,,large a sample of examples do you need
Dialogue: 0,0:49:13.68,0:49:16.55,Default,,0000,0000,0000,,in order to achieve a certain bound and error.
Dialogue: 0,0:49:16.55,0:49:19.62,Default,,0000,0000,0000,,And it turns out that in many of the theorems we write out you can
Dialogue: 0,0:49:19.62,0:49:23.66,Default,,0000,0000,0000,,pose them in sort of a form of probability bound or a sample complexity bound or in some other
Dialogue: 0,0:49:23.66,0:49:24.46,Default,,0000,0000,0000,,form.
Dialogue: 0,0:49:24.46,0:49:28.37,Default,,0000,0000,0000,,I personally often find the sample complexity bounds the most
Dialogue: 0,0:49:28.37,0:49:32.42,Default,,0000,0000,0000,,easy to interpret because it says how large a training set do you need to give a certain bound on the
Dialogue: 0,0:49:32.42,0:49:35.82,Default,,0000,0000,0000,,errors.
Dialogue: 0,0:49:35.82,0:49:38.83,Default,,0000,0000,0000,,And in fact well, we'll see this later,
Dialogue: 0,0:49:38.83,0:49:41.05,Default,,0000,0000,0000,,sample complexity bounds often sort of
Dialogue: 0,0:49:41.05,0:49:43.30,Default,,0000,0000,0000,,help to give guidance for really if
Dialogue: 0,0:49:43.30,0:49:44.24,Default,,0000,0000,0000,,you're trying to
Dialogue: 0,0:49:44.24,0:49:46.14,Default,,0000,0000,0000,,achieve something on a machine learning problem,
Dialogue: 0,0:49:46.14,0:49:46.94,Default,,0000,0000,0000,,this really is
Dialogue: 0,0:49:46.94,0:49:49.72,Default,,0000,0000,0000,,trying to give guidance on how much training data you need
Dialogue: 0,0:49:49.72,0:49:54.27,Default,,0000,0000,0000,,to prove something.
Dialogue: 0,0:49:54.27,0:49:57.77,Default,,0000,0000,0000,,The one thing I want to note here is that M
Dialogue: 0,0:49:57.77,0:50:01.02,Default,,0000,0000,0000,,grows like the log of K, right, so
Dialogue: 0,0:50:01.02,0:50:05.63,Default,,0000,0000,0000,,the log of K grows extremely slowly as a function of K. The log is one
Dialogue: 0,0:50:05.63,0:50:10.62,Default,,0000,0000,0000,,of the slowest growing functions, right. It's one of well,
Dialogue: 0,0:50:10.62,0:50:17.62,Default,,0000,0000,0000,,some of you may have heard this, right? That for all values of K, right I learned
Dialogue: 0,0:50:18.87,0:50:22.18,Default,,0000,0000,0000,,this from a colleague, Andrew Moore,
Dialogue: 0,0:50:22.18,0:50:23.74,Default,,0000,0000,0000,,at Carnegie Mellon
Dialogue: 0,0:50:23.74,0:50:24.70,Default,,0000,0000,0000,,that in
Dialogue: 0,0:50:24.70,0:50:30.55,Default,,0000,0000,0000,,computer science for all practical purposes for all values of K, log K is less [inaudible], this is
Dialogue: 0,0:50:30.55,0:50:31.51,Default,,0000,0000,0000,,almost true.
Dialogue: 0,0:50:31.51,0:50:35.53,Default,,0000,0000,0000,,So log K is logs is one of the slowest growing functions, and so the
Dialogue: 0,0:50:35.53,0:50:40.06,Default,,0000,0000,0000,,fact that M sample complexity grows like the log of K,
Dialogue: 0,0:50:40.06,0:50:41.55,Default,,0000,0000,0000,,means that
Dialogue: 0,0:50:41.55,0:50:45.75,Default,,0000,0000,0000,,you can increase this number of hypotheses in your hypothesis class quite a lot
Dialogue: 0,0:50:45.75,0:50:50.61,Default,,0000,0000,0000,,and the number of the training examples you need won't grow
Dialogue: 0,0:50:50.61,0:50:56.39,Default,,0000,0000,0000,,
Dialogue: 0,0:50:56.39,0:51:00.29,Default,,0000,0000,0000,,very much. [Inaudible]. This property will be important later when we talk about infinite
Dialogue: 0,0:51:00.29,0:51:02.86,Default,,0000,0000,0000,,hypothesis classes.
Dialogue: 0,0:51:02.86,0:51:05.68,Default,,0000,0000,0000,,The final form is the
Dialogue: 0,0:51:05.68,0:51:08.49,Default,,0000,0000,0000,,I guess is sometimes called the error bound,
Dialogue: 0,0:51:08.49,0:51:13.66,Default,,0000,0000,0000,,which is when you hold M and delta fixed and solved for gamma.
Dialogue: 0,0:51:13.66,0:51:20.66,Default,,0000,0000,0000,,And so
Dialogue: 0,0:51:22.81,0:51:27.47,Default,,0000,0000,0000,,and what do you do what you get then is that the
Dialogue: 0,0:51:27.47,0:51:30.76,Default,,0000,0000,0000,,probability at least one minus delta,
Dialogue: 0,0:51:30.76,0:51:37.76,Default,,0000,0000,0000,,we have that.
Dialogue: 0,0:51:41.34,0:51:44.41,Default,,0000,0000,0000,,For all hypotheses in my hypothesis class,
Dialogue: 0,0:51:44.41,0:51:51.41,Default,,0000,0000,0000,,the difference in the training generalization error would be less
Dialogue: 0,0:51:53.69,0:51:55.17,Default,,0000,0000,0000,,than equal to that. Okay? And that's just
Dialogue: 0,0:51:55.17,0:52:02.17,Default,,0000,0000,0000,,solving for gamma and plugging the value I get in there. Okay? All right.
Dialogue: 0,0:52:03.24,0:52:10.24,Default,,0000,0000,0000,,So the
Dialogue: 0,0:52:30.95,0:52:33.57,Default,,0000,0000,0000,,second
Dialogue: 0,0:52:33.57,0:52:36.22,Default,,0000,0000,0000,,step of
Dialogue: 0,0:52:36.22,0:52:43.22,Default,,0000,0000,0000,,the overall proof I want to execute is the following.
Dialogue: 0,0:52:44.52,0:52:47.33,Default,,0000,0000,0000,,The result of the training error is essentially that uniform
Dialogue: 0,0:52:47.33,0:52:49.10,Default,,0000,0000,0000,,conversions will hold true
Dialogue: 0,0:52:49.10,0:52:51.44,Default,,0000,0000,0000,,with high probability.
Dialogue: 0,0:52:51.44,0:52:55.25,Default,,0000,0000,0000,,What I want to show now is let's assume that uniform conversions hold. So let's
Dialogue: 0,0:52:55.25,0:52:57.96,Default,,0000,0000,0000,,assume that for all
Dialogue: 0,0:52:57.96,0:53:00.09,Default,,0000,0000,0000,,hypotheses H,
Dialogue: 0,0:53:00.09,0:53:05.44,Default,,0000,0000,0000,,we have that epsilon of H minus epsilon hat of H, is
Dialogue: 0,0:53:05.44,0:53:08.62,Default,,0000,0000,0000,,less than of the gamma. Okay?
Dialogue: 0,0:53:08.62,0:53:10.93,Default,,0000,0000,0000,,What I want to do now is
Dialogue: 0,0:53:10.93,0:53:13.14,Default,,0000,0000,0000,,use this to see what we can
Dialogue: 0,0:53:13.14,0:53:14.91,Default,,0000,0000,0000,,prove about the bound
Dialogue: 0,0:53:14.91,0:53:21.91,Default,,0000,0000,0000,,of see what we can prove about the generalization error.
Dialogue: 0,0:53:28.83,0:53:31.40,Default,,0000,0000,0000,,
Dialogue: 0,0:53:31.40,0:53:32.62,Default,,0000,0000,0000,,So I want to know
Dialogue: 0,0:53:32.62,0:53:35.64,Default,,0000,0000,0000,,suppose this holds true I want
Dialogue: 0,0:53:35.64,0:53:40.26,Default,,0000,0000,0000,,to know can we prove something about the generalization error of H hat, where
Dialogue: 0,0:53:40.26,0:53:43.38,Default,,0000,0000,0000,,again, H hat
Dialogue: 0,0:53:43.38,0:53:45.95,Default,,0000,0000,0000,,was the hypothesis
Dialogue: 0,0:53:45.95,0:53:51.68,Default,,0000,0000,0000,,selected by empirical risk minimization.
Dialogue: 0,0:53:51.68,0:53:55.79,Default,,0000,0000,0000,,Okay? So in order to show this, let me make one more definition, let me define H
Dialogue: 0,0:53:55.79,0:54:00.29,Default,,0000,0000,0000,,star,
Dialogue: 0,0:54:00.29,0:54:03.02,Default,,0000,0000,0000,,to be the hypothesis
Dialogue: 0,0:54:03.02,0:54:06.46,Default,,0000,0000,0000,,in my class
Dialogue: 0,0:54:06.46,0:54:10.19,Default,,0000,0000,0000,,script H that has the smallest generalization error.
Dialogue: 0,0:54:10.19,0:54:11.31,Default,,0000,0000,0000,,So this is
Dialogue: 0,0:54:11.31,0:54:14.93,Default,,0000,0000,0000,,if I had an infinite amount of training data or if I really I
Dialogue: 0,0:54:14.93,0:54:19.03,Default,,0000,0000,0000,,could go in and find the best possible hypothesis
Dialogue: 0,0:54:19.03,0:54:23.36,Default,,0000,0000,0000,,best possible hypothesis in the sense of minimizing generalization error
Dialogue: 0,0:54:23.36,0:54:28.99,Default,,0000,0000,0000,,what's the hypothesis I would get? Okay? So
Dialogue: 0,0:54:28.99,0:54:32.96,Default,,0000,0000,0000,,in some sense, it sort of makes sense to compare the performance of our learning
Dialogue: 0,0:54:32.96,0:54:33.61,Default,,0000,0000,0000,,algorithm
Dialogue: 0,0:54:33.61,0:54:35.57,Default,,0000,0000,0000,,to the performance of H star, because we
Dialogue: 0,0:54:35.57,0:54:39.96,Default,,0000,0000,0000,,sort of we clearly can't hope to do better than H star.
Dialogue: 0,0:54:39.96,0:54:42.47,Default,,0000,0000,0000,,Another way of saying that is that if
Dialogue: 0,0:54:42.47,0:54:47.61,Default,,0000,0000,0000,,your hypothesis class is a class of all linear decision boundaries, that
Dialogue: 0,0:54:47.61,0:54:50.63,Default,,0000,0000,0000,,the data just can't be separated by any linear functions.
Dialogue: 0,0:54:50.63,0:54:53.74,Default,,0000,0000,0000,,So if even H star is really bad,
Dialogue: 0,0:54:53.74,0:54:55.24,Default,,0000,0000,0000,,then there's sort of
Dialogue: 0,0:54:55.24,0:54:59.12,Default,,0000,0000,0000,,it's unlikely then there's just not much hope that your learning algorithm could do even
Dialogue: 0,0:54:59.12,0:55:04.13,Default,,0000,0000,0000,,better
Dialogue: 0,0:55:04.13,0:55:08.98,Default,,0000,0000,0000,,than H star. So I actually prove this result in three steps.
Dialogue: 0,0:55:08.98,0:55:13.40,Default,,0000,0000,0000,,So the generalization error of H hat, the hypothesis I chose,
Dialogue: 0,0:55:13.40,0:55:20.40,Default,,0000,0000,0000,,this is going to be less than equal to that, actually let me number these
Dialogue: 0,0:55:21.74,0:55:25.06,Default,,0000,0000,0000,,equations, right. This
Dialogue: 0,0:55:25.06,0:55:27.20,Default,,0000,0000,0000,,is
Dialogue: 0,0:55:27.20,0:55:30.29,Default,,0000,0000,0000,,because of equation one, because I see that
Dialogue: 0,0:55:30.29,0:55:33.09,Default,,0000,0000,0000,,epsilon of H hat and epsilon hat of H hat will then gamma of each
Dialogue: 0,0:55:33.09,0:55:36.63,Default,,0000,0000,0000,,other.
Dialogue: 0,0:55:36.63,0:55:38.32,Default,,0000,0000,0000,,Now
Dialogue: 0,0:55:38.32,0:55:42.14,Default,,0000,0000,0000,,because H star, excuse me,
Dialogue: 0,0:55:42.14,0:55:45.89,Default,,0000,0000,0000,,now by the
Dialogue: 0,0:55:45.89,0:55:51.01,Default,,0000,0000,0000,,definition of empirical risk minimization, H
Dialogue: 0,0:55:51.01,0:55:54.05,Default,,0000,0000,0000,,hat was chosen to minimize training error
Dialogue: 0,0:55:54.05,0:55:58.68,Default,,0000,0000,0000,,and so there can't be any hypothesis with lower training error than H hat.
Dialogue: 0,0:55:58.68,0:56:04.63,Default,,0000,0000,0000,,So the training error of H hat must be less than the equal to
Dialogue: 0,0:56:04.63,0:56:07.08,Default,,0000,0000,0000,,the training error of H star. So
Dialogue: 0,0:56:07.08,0:56:10.55,Default,,0000,0000,0000,,this is sort of by two, or by the definition of H hat,
Dialogue: 0,0:56:10.55,0:56:17.55,Default,,0000,0000,0000,,as the hypothesis that minimizes training error H hat.
Dialogue: 0,0:56:17.62,0:56:20.16,Default,,0000,0000,0000,,And the final step is I'm
Dialogue: 0,0:56:20.16,0:56:23.50,Default,,0000,0000,0000,,going to apply this uniform conversions result again. We know that
Dialogue: 0,0:56:23.50,0:56:24.93,Default,,0000,0000,0000,,epsilon hat
Dialogue: 0,0:56:24.93,0:56:29.55,Default,,0000,0000,0000,,of H star must be moving gamma of epsilon of H star.
Dialogue: 0,0:56:29.55,0:56:33.88,Default,,0000,0000,0000,,And so this is at most
Dialogue: 0,0:56:33.88,0:56:35.20,Default,,0000,0000,0000,,plus gamma. Then
Dialogue: 0,0:56:35.20,0:56:36.09,Default,,0000,0000,0000,,
Dialogue: 0,0:56:36.09,0:56:38.30,Default,,0000,0000,0000,,I have my original gamma
Dialogue: 0,0:56:38.30,0:56:41.32,Default,,0000,0000,0000,,there. Okay? And so this is by
Dialogue: 0,0:56:41.32,0:56:44.52,Default,,0000,0000,0000,,equation one again because oh, excuse me
Dialogue: 0,0:56:44.52,0:56:47.69,Default,,0000,0000,0000,,because I know the training error of H star must be moving gamma of the
Dialogue: 0,0:56:47.69,0:56:49.93,Default,,0000,0000,0000,,generalization error of H star.
Dialogue: 0,0:56:49.93,0:56:51.70,Default,,0000,0000,0000,,And so well, I'll just
Dialogue: 0,0:56:51.70,0:56:57.47,Default,,0000,0000,0000,,write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training
Dialogue: 0,0:56:57.47,0:57:04.47,Default,,0000,0000,0000,,error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh,
Dialogue: 0,0:57:24.87,0:57:26.65,Default,,0000,0000,0000,,okay. Let me just well, let
Dialogue: 0,0:57:26.65,0:57:32.78,Default,,0000,0000,0000,,me write that down on this board. So actually actually let me
Dialogue: 0,0:57:32.78,0:57:36.46,Default,,0000,0000,0000,,think
Dialogue: 0,0:57:36.46,0:57:38.74,Default,,0000,0000,0000,,[inaudible]
Dialogue: 0,0:57:38.74,0:57:43.70,Default,,0000,0000,0000,,fit this in here. So epsilon hat
Dialogue: 0,0:57:43.70,0:57:45.17,Default,,0000,0000,0000,,of H is the
Dialogue: 0,0:57:45.17,0:57:48.100,Default,,0000,0000,0000,,training
Dialogue: 0,0:57:48.100,0:57:50.62,Default,,0000,0000,0000,,
Dialogue: 0,0:57:50.62,0:57:52.47,Default,,0000,0000,0000,,error of the hypothesis H. In
Dialogue: 0,0:57:52.47,0:57:55.95,Default,,0000,0000,0000,,other words, given the hypothesis a hypothesis is just a function, right mapped from X or
Dialogue: 0,0:57:55.95,0:57:57.56,Default,,0000,0000,0000,,Ys
Dialogue: 0,0:57:57.56,0:57:59.49,Default,,0000,0000,0000,,so epsilon hat of H is
Dialogue: 0,0:57:59.49,0:58:02.81,Default,,0000,0000,0000,,given the hypothesis H, what's the fraction of training examples it
Dialogue: 0,0:58:02.81,0:58:05.03,Default,,0000,0000,0000,,misclassifies?
Dialogue: 0,0:58:05.03,0:58:11.60,Default,,0000,0000,0000,,And generalization error
Dialogue: 0,0:58:11.60,0:58:13.84,Default,,0000,0000,0000,,of
Dialogue: 0,0:58:13.84,0:58:17.95,Default,,0000,0000,0000,,H, is given the hypothesis H if I
Dialogue: 0,0:58:17.95,0:58:19.15,Default,,0000,0000,0000,,sample another
Dialogue: 0,0:58:19.15,0:58:21.71,Default,,0000,0000,0000,,example from my distribution
Dialogue: 0,0:58:21.71,0:58:22.90,Default,,0000,0000,0000,,scripts
Dialogue: 0,0:58:22.90,0:58:28.31,Default,,0000,0000,0000,,D, what's the probability that H will misclassify that example? Does that make sense?
Dialogue: 0,0:58:28.31,0:58:31.86,Default,,0000,0000,0000,,Oh,
Dialogue: 0,0:58:31.86,0:58:38.24,Default,,0000,0000,0000,,okay. And H hat is the hypothesis that's chosen by empirical risk minimization.
Dialogue: 0,0:58:38.24,0:58:42.58,Default,,0000,0000,0000,,So when I talk about empirical risk minimization, is the algorithm that
Dialogue: 0,0:58:42.58,0:58:47.80,Default,,0000,0000,0000,,minimizes training error, and so epsilon hat of H is the training error of H,
Dialogue: 0,0:58:47.80,0:58:50.22,Default,,0000,0000,0000,,and so H hat is defined as
Dialogue: 0,0:58:50.22,0:58:52.05,Default,,0000,0000,0000,,the hypothesis that out
Dialogue: 0,0:58:52.05,0:58:55.03,Default,,0000,0000,0000,,of all hypotheses in my class script H,
Dialogue: 0,0:58:55.03,0:58:57.59,Default,,0000,0000,0000,,the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member
Dialogue: 0,0:58:57.59,0:59:04.59,Default,,0000,0000,0000,,of
Dialogue: 0,0:59:06.67,0:59:08.80,Default,,0000,0000,0000,,typical H, [inaudible]
Dialogue: 0,0:59:08.80,0:59:10.69,Default,,0000,0000,0000,,family
Dialogue: 0,0:59:10.69,0:59:12.91,Default,,0000,0000,0000,,right? Yes it is.
Dialogue: 0,0:59:12.91,0:59:19.91,Default,,0000,0000,0000,,So what happens with the generalization error [inaudible]?
Dialogue: 0,0:59:20.07,0:59:24.49,Default,,0000,0000,0000,,I'll talk about that later. So
Dialogue: 0,0:59:24.49,0:59:31.49,Default,,0000,0000,0000,,let me tie all these things together into a
Dialogue: 0,0:59:34.34,0:59:38.05,Default,,0000,0000,0000,,theorem.
Dialogue: 0,0:59:38.05,0:59:42.52,Default,,0000,0000,0000,,Let there be a hypothesis class given with a
Dialogue: 0,0:59:42.52,0:59:48.87,Default,,0000,0000,0000,,finite set of K hypotheses,
Dialogue: 0,0:59:48.87,0:59:50.89,Default,,0000,0000,0000,,and let any M
Dialogue: 0,0:59:50.89,0:59:52.25,Default,,0000,0000,0000,,delta
Dialogue: 0,0:59:52.25,0:59:58.32,Default,,0000,0000,0000,,be fixed.
Dialogue: 0,0:59:58.32,1:00:02.05,Default,,0000,0000,0000,,Then so I fixed M and delta, so this will be the error bound
Dialogue: 0,1:00:02.05,1:00:04.50,Default,,0000,0000,0000,,form of the theorem, right?
Dialogue: 0,1:00:04.50,1:00:06.76,Default,,0000,0000,0000,,Then, with
Dialogue: 0,1:00:06.76,1:00:09.64,Default,,0000,0000,0000,,probability at least one minus delta.
Dialogue: 0,1:00:09.64,1:00:13.92,Default,,0000,0000,0000,,We have that. The generalization error of H hat is
Dialogue: 0,1:00:13.92,1:00:19.22,Default,,0000,0000,0000,,less than or equal to
Dialogue: 0,1:00:19.22,1:00:22.61,Default,,0000,0000,0000,,the minimum over all hypotheses in
Dialogue: 0,1:00:22.61,1:00:25.05,Default,,0000,0000,0000,,set H epsilon of H,
Dialogue: 0,1:00:25.05,1:00:26.68,Default,,0000,0000,0000,,plus
Dialogue: 0,1:00:26.68,1:00:33.68,Default,,0000,0000,0000,,two times,
Dialogue: 0,1:00:38.03,1:00:39.11,Default,,0000,0000,0000,,plus that. Okay?
Dialogue: 0,1:00:39.11,1:00:41.76,Default,,0000,0000,0000,,So to prove this, well,
Dialogue: 0,1:00:41.76,1:00:45.53,Default,,0000,0000,0000,,this term of course is just epsilon of H star.
Dialogue: 0,1:00:45.53,1:00:47.88,Default,,0000,0000,0000,,And so to prove this
Dialogue: 0,1:00:47.88,1:00:49.74,Default,,0000,0000,0000,,we set
Dialogue: 0,1:00:49.74,1:00:52.51,Default,,0000,0000,0000,,gamma to equal to that
Dialogue: 0,1:00:52.51,1:00:55.97,Default,,0000,0000,0000,,this is two times the square root term.
Dialogue: 0,1:00:55.97,1:01:02.97,Default,,0000,0000,0000,,To prove this theorem we set gamma to equal to that square root term. Say that again?
Dialogue: 0,1:01:04.01,1:01:07.52,Default,,0000,0000,0000,,Wait. Say that again?
Dialogue: 0,1:01:07.52,1:01:11.39,Default,,0000,0000,0000,,Oh,
Dialogue: 0,1:01:13.95,1:01:16.06,Default,,0000,0000,0000,,yes. Thank you. That didn't
Dialogue: 0,1:01:16.06,1:01:19.64,Default,,0000,0000,0000,,make sense at all.
Dialogue: 0,1:01:19.64,1:01:20.94,Default,,0000,0000,0000,,Thanks. Great.
Dialogue: 0,1:01:20.94,1:01:25.67,Default,,0000,0000,0000,,So set gamma to that square root term,
Dialogue: 0,1:01:25.67,1:01:29.91,Default,,0000,0000,0000,,and so we know equation one,
Dialogue: 0,1:01:29.91,1:01:31.74,Default,,0000,0000,0000,,right, from the previous board
Dialogue: 0,1:01:31.74,1:01:35.28,Default,,0000,0000,0000,,holds with probability one minus delta.
Dialogue: 0,1:01:35.28,1:01:40.43,Default,,0000,0000,0000,,Right. Equation one was the uniform conversions result right, that well,
Dialogue: 0,1:01:40.43,1:01:47.43,Default,,0000,0000,0000,,IE.
Dialogue: 0,1:01:49.55,1:01:52.25,Default,,0000,0000,0000,,This is equation one from the previous board, right, so
Dialogue: 0,1:01:52.25,1:01:54.76,Default,,0000,0000,0000,,set gamma equal to this we know that we'll probably
Dialogue: 0,1:01:54.76,1:01:58.46,Default,,0000,0000,0000,,use one minus delta this uniform conversions holds,
Dialogue: 0,1:01:58.46,1:02:01.15,Default,,0000,0000,0000,,and whenever that holds, that implies you
Dialogue: 0,1:02:01.15,1:02:04.72,Default,,0000,0000,0000,,know, I
Dialogue: 0,1:02:04.72,1:02:05.02,Default,,0000,0000,0000,,guess
Dialogue: 0,1:02:05.02,1:02:07.67,Default,,0000,0000,0000,,if we call this equation star I guess.
Dialogue: 0,1:02:07.67,1:02:11.81,Default,,0000,0000,0000,,And whenever uniform conversions holds, we showed again, on the previous boards
Dialogue: 0,1:02:11.81,1:02:17.09,Default,,0000,0000,0000,,that this result holds, that generalization error of H hat is less than
Dialogue: 0,1:02:17.09,1:02:20.77,Default,,0000,0000,0000,,two generalization error of H star plus two times gamma. Okay? And
Dialogue: 0,1:02:20.77,1:02:22.67,Default,,0000,0000,0000,,so that proves
Dialogue: 0,1:02:22.67,1:02:29.67,Default,,0000,0000,0000,,this theorem. So
Dialogue: 0,1:02:41.56,1:02:44.23,Default,,0000,0000,0000,,this result sort of helps us to quantify
Dialogue: 0,1:02:44.23,1:02:45.23,Default,,0000,0000,0000,,a little bit
Dialogue: 0,1:02:45.23,1:02:47.61,Default,,0000,0000,0000,,that bias variance tradeoff
Dialogue: 0,1:02:47.61,1:02:49.74,Default,,0000,0000,0000,,that I talked about
Dialogue: 0,1:02:49.74,1:02:54.24,Default,,0000,0000,0000,,at the beginning of actually near the very start of this lecture.
Dialogue: 0,1:02:54.24,1:02:58.39,Default,,0000,0000,0000,,And in particular
Dialogue: 0,1:02:58.39,1:03:03.66,Default,,0000,0000,0000,,let's say I have some hypothesis class script H, that
Dialogue: 0,1:03:03.66,1:03:06.04,Default,,0000,0000,0000,,I'm using, maybe as a class of all
Dialogue: 0,1:03:06.04,1:03:09.80,Default,,0000,0000,0000,,linear functions and linear regression, and logistic regression with
Dialogue: 0,1:03:09.80,1:03:11.79,Default,,0000,0000,0000,,just the linear features.
Dialogue: 0,1:03:11.79,1:03:14.74,Default,,0000,0000,0000,,And let's say I'm considering switching
Dialogue: 0,1:03:14.74,1:03:19.62,Default,,0000,0000,0000,,to some new class H prime by having more features. So lets say this is
Dialogue: 0,1:03:19.62,1:03:23.06,Default,,0000,0000,0000,,linear
Dialogue: 0,1:03:23.06,1:03:27.24,Default,,0000,0000,0000,,and this is quadratic,
Dialogue: 0,1:03:27.24,1:03:28.04,Default,,0000,0000,0000,,
Dialogue: 0,1:03:28.04,1:03:31.66,Default,,0000,0000,0000,,so the class of all linear functions and the subset of the class of all quadratic functions,
Dialogue: 0,1:03:31.66,1:03:34.86,Default,,0000,0000,0000,,and so H is the subset of H prime. And
Dialogue: 0,1:03:34.86,1:03:36.79,Default,,0000,0000,0000,,let's say I'm considering
Dialogue: 0,1:03:36.79,1:03:40.93,Default,,0000,0000,0000,,instead of using my linear hypothesis class let's say I'm considering switching
Dialogue: 0,1:03:40.93,1:03:44.77,Default,,0000,0000,0000,,to a quadratic hypothesis class, or switching to a larger hypothesis
Dialogue: 0,1:03:44.77,1:03:46.37,Default,,0000,0000,0000,,class.
Dialogue: 0,1:03:46.37,1:03:47.13,Default,,0000,0000,0000,,
Dialogue: 0,1:03:47.13,1:03:49.81,Default,,0000,0000,0000,,Then what are the tradeoffs involved? Well, I
Dialogue: 0,1:03:49.81,1:03:52.93,Default,,0000,0000,0000,,proved this only for finite hypothesis classes, but we'll see that something
Dialogue: 0,1:03:52.93,1:03:56.02,Default,,0000,0000,0000,,very similar holds for infinite hypothesis classes too.
Dialogue: 0,1:03:56.02,1:03:57.56,Default,,0000,0000,0000,,But the tradeoff is
Dialogue: 0,1:03:57.56,1:04:00.62,Default,,0000,0000,0000,,what if I switch from H to H prime, or I switch from linear to quadratic
Dialogue: 0,1:04:00.62,1:04:03.100,Default,,0000,0000,0000,,functions. Then
Dialogue: 0,1:04:03.100,1:04:08.03,Default,,0000,0000,0000,,epsilon of H star will become better because
Dialogue: 0,1:04:08.03,1:04:12.42,Default,,0000,0000,0000,,the best hypothesis in my hypothesis class will become better.
Dialogue: 0,1:04:12.42,1:04:14.83,Default,,0000,0000,0000,,The best quadratic function
Dialogue: 0,1:04:14.83,1:04:16.05,Default,,0000,0000,0000,,
Dialogue: 0,1:04:16.05,1:04:18.80,Default,,0000,0000,0000,,by best I mean in the sense of generalization error
Dialogue: 0,1:04:18.80,1:04:21.30,Default,,0000,0000,0000,,the hypothesis function
Dialogue: 0,1:04:21.30,1:04:24.83,Default,,0000,0000,0000,,the quadratic function with the lowest generalization error
Dialogue: 0,1:04:24.83,1:04:25.60,Default,,0000,0000,0000,,has to have
Dialogue: 0,1:04:25.60,1:04:27.06,Default,,0000,0000,0000,,equal or
Dialogue: 0,1:04:27.06,1:04:28.02,Default,,0000,0000,0000,,more likely lower generalization error than the best
Dialogue: 0,1:04:28.02,1:04:30.28,Default,,0000,0000,0000,,linear function.
Dialogue: 0,1:04:30.28,1:04:31.80,Default,,0000,0000,0000,,
Dialogue: 0,1:04:31.80,1:04:37.55,Default,,0000,0000,0000,,So by switching to a more complex hypothesis class you can get this first term as you go down.
Dialogue: 0,1:04:37.55,1:04:40.35,Default,,0000,0000,0000,,But what I pay for then is that
Dialogue: 0,1:04:40.35,1:04:42.83,Default,,0000,0000,0000,,K will increase.
Dialogue: 0,1:04:42.83,1:04:46.45,Default,,0000,0000,0000,,By switching to a larger hypothesis class,
Dialogue: 0,1:04:46.45,1:04:48.11,Default,,0000,0000,0000,,the first term will go down,
Dialogue: 0,1:04:48.11,1:04:51.82,Default,,0000,0000,0000,,but the second term will increase because I now have a larger class of
Dialogue: 0,1:04:51.82,1:04:52.88,Default,,0000,0000,0000,,hypotheses
Dialogue: 0,1:04:52.88,1:04:54.63,Default,,0000,0000,0000,,and so the second term K
Dialogue: 0,1:04:54.63,1:04:57.19,Default,,0000,0000,0000,,will increase.
Dialogue: 0,1:04:57.19,1:05:01.68,Default,,0000,0000,0000,,And so this is sometimes called the bias this is usually called the bias variance
Dialogue: 0,1:05:01.68,1:05:03.05,Default,,0000,0000,0000,,tradeoff. Whereby
Dialogue: 0,1:05:03.05,1:05:07.70,Default,,0000,0000,0000,,going to larger hypothesis class maybe I have the hope for finding a better function,
Dialogue: 0,1:05:07.70,1:05:09.95,Default,,0000,0000,0000,,that my risk of sort of
Dialogue: 0,1:05:09.95,1:05:13.80,Default,,0000,0000,0000,,not fitting my model so accurately also increases, and
Dialogue: 0,1:05:13.80,1:05:19.45,Default,,0000,0000,0000,,that's because illustrated by the second term going up when the
Dialogue: 0,1:05:19.45,1:05:21.30,Default,,0000,0000,0000,,size of your hypothesis,
Dialogue: 0,1:05:21.30,1:05:28.20,Default,,0000,0000,0000,,when K goes up.
Dialogue: 0,1:05:28.20,1:05:32.08,Default,,0000,0000,0000,,And so
Dialogue: 0,1:05:32.08,1:05:35.28,Default,,0000,0000,0000,,speaking very loosely,
Dialogue: 0,1:05:35.28,1:05:37.11,Default,,0000,0000,0000,,we can think of
Dialogue: 0,1:05:37.11,1:05:40.15,Default,,0000,0000,0000,,this first term as corresponding
Dialogue: 0,1:05:40.15,1:05:41.46,Default,,0000,0000,0000,,maybe to the bias
Dialogue: 0,1:05:41.46,1:05:45.21,Default,,0000,0000,0000,,of the learning algorithm, or the bias of the hypothesis class.
Dialogue: 0,1:05:45.21,1:05:50.43,Default,,0000,0000,0000,,And you can again speaking very loosely,
Dialogue: 0,1:05:50.43,1:05:52.88,Default,,0000,0000,0000,,think of the second term as corresponding
Dialogue: 0,1:05:52.88,1:05:56.76,Default,,0000,0000,0000,,to the variance in your hypothesis, in other words how well you can actually fit a
Dialogue: 0,1:05:56.76,1:05:57.87,Default,,0000,0000,0000,,hypothesis in the
Dialogue: 0,1:05:57.87,1:05:59.76,Default,,0000,0000,0000,,how well you
Dialogue: 0,1:05:59.76,1:06:02.87,Default,,0000,0000,0000,,actually fit this hypothesis class to the data.
Dialogue: 0,1:06:02.87,1:06:06.83,Default,,0000,0000,0000,,And by switching to a more complex hypothesis class, your variance increases and your bias decreases.
Dialogue: 0,1:06:06.83,1:06:09.49,Default,,0000,0000,0000,,
Dialogue: 0,1:06:09.49,1:06:13.54,Default,,0000,0000,0000,,As a note of warning, it turns out that if you take like a statistics class you've seen
Dialogue: 0,1:06:13.54,1:06:15.12,Default,,0000,0000,0000,,definitions of bias and variance,
Dialogue: 0,1:06:15.12,1:06:18.50,Default,,0000,0000,0000,,which are often defined in terms of squared error or something.
Dialogue: 0,1:06:18.50,1:06:21.58,Default,,0000,0000,0000,,It turns out that for classification problems,
Dialogue: 0,1:06:21.58,1:06:24.65,Default,,0000,0000,0000,,there actually is no
Dialogue: 0,1:06:24.65,1:06:25.88,Default,,0000,0000,0000,,
Dialogue: 0,1:06:25.88,1:06:27.49,Default,,0000,0000,0000,,universally accepted
Dialogue: 0,1:06:27.49,1:06:28.99,Default,,0000,0000,0000,,formal definition of
Dialogue: 0,1:06:28.99,1:06:30.68,Default,,0000,0000,0000,,bias and
Dialogue: 0,1:06:30.68,1:06:35.26,Default,,0000,0000,0000,,variance for classification problems. For regression problems, there is this square error
Dialogue: 0,1:06:35.26,1:06:38.64,Default,,0000,0000,0000,,definition. For classification problems it
Dialogue: 0,1:06:38.64,1:06:39.98,Default,,0000,0000,0000,,turns out there've
Dialogue: 0,1:06:39.98,1:06:43.70,Default,,0000,0000,0000,,been several competing proposals for definitions of bias and variance. So when I
Dialogue: 0,1:06:43.70,1:06:45.30,Default,,0000,0000,0000,,say bias and variance here,
Dialogue: 0,1:06:45.30,1:06:47.42,Default,,0000,0000,0000,,think of these as very loose, informal,
Dialogue: 0,1:06:47.42,1:06:54.42,Default,,0000,0000,0000,,intuitive definitions, and not formal definitions. Okay.
Dialogue: 0,1:07:00.50,1:07:07.50,Default,,0000,0000,0000,,The
Dialogue: 0,1:07:15.33,1:07:18.08,Default,,0000,0000,0000,,cartoon associated with intuition I just
Dialogue: 0,1:07:18.08,1:07:19.39,Default,,0000,0000,0000,,said would be
Dialogue: 0,1:07:19.39,1:07:22.35,Default,,0000,0000,0000,,as follows: Let's
Dialogue: 0,1:07:22.35,1:07:24.01,Default,,0000,0000,0000,,say
Dialogue: 0,1:07:24.01,1:07:26.92,Default,,0000,0000,0000,,
Dialogue: 0,1:07:26.92,1:07:30.37,Default,,0000,0000,0000,,and everything about the plot will be for a fixed value of M, for a
Dialogue: 0,1:07:30.37,1:07:35.22,Default,,0000,0000,0000,,fixed training set size M. Vertical axis I'll plot ever and
Dialogue: 0,1:07:35.22,1:07:38.80,Default,,0000,0000,0000,,on the horizontal axis I'll plot model
Dialogue: 0,1:07:38.80,1:07:43.73,Default,,0000,0000,0000,,complexity.
Dialogue: 0,1:07:43.73,1:07:46.52,Default,,0000,0000,0000,,And by model complexity I mean
Dialogue: 0,1:07:46.52,1:07:53.50,Default,,0000,0000,0000,,sort of degree of polynomial,
Dialogue: 0,1:07:53.50,1:07:57.67,Default,,0000,0000,0000,,
Dialogue: 0,1:07:57.67,1:07:59.74,Default,,0000,0000,0000,,size of your
Dialogue: 0,1:07:59.74,1:08:04.67,Default,,0000,0000,0000,,hypothesis class script H etc. It actually turns out,
Dialogue: 0,1:08:04.67,1:08:07.51,Default,,0000,0000,0000,,you remember the bandwidth parameter
Dialogue: 0,1:08:07.51,1:08:10.58,Default,,0000,0000,0000,,from
Dialogue: 0,1:08:10.58,1:08:13.89,Default,,0000,0000,0000,,locally weighted linear regression, that also
Dialogue: 0,1:08:13.89,1:08:17.46,Default,,0000,0000,0000,,has a similar effect in controlling how complex your model is.
Dialogue: 0,1:08:17.46,1:08:20.65,Default,,0000,0000,0000,,Model complexity [inaudible] polynomial I guess.
Dialogue: 0,1:08:20.65,1:08:23.24,Default,,0000,0000,0000,,So the more complex your model,
Dialogue: 0,1:08:23.24,1:08:25.39,Default,,0000,0000,0000,,the better your
Dialogue: 0,1:08:25.39,1:08:26.02,Default,,0000,0000,0000,,training error,
Dialogue: 0,1:08:26.02,1:08:30.22,Default,,0000,0000,0000,,and so
Dialogue: 0,1:08:30.22,1:08:32.89,Default,,0000,0000,0000,,your training error will tend to [inaudible] zero as you increase the complexity of your model because the
Dialogue: 0,1:08:32.89,1:08:36.42,Default,,0000,0000,0000,,more complete your model the better you can fit your training set.
Dialogue: 0,1:08:36.42,1:08:37.52,Default,,0000,0000,0000,,But
Dialogue: 0,1:08:37.52,1:08:44.52,Default,,0000,0000,0000,,because of this bias variance tradeoff,
Dialogue: 0,1:08:44.69,1:08:48.64,Default,,0000,0000,0000,,you find
Dialogue: 0,1:08:48.64,1:08:53.37,Default,,0000,0000,0000,,that
Dialogue: 0,1:08:53.37,1:08:57.14,Default,,0000,0000,0000,,generalization error will come down for a while and then it will go back up.
Dialogue: 0,1:08:57.14,1:09:03.67,Default,,0000,0000,0000,,And this regime on the left is when you're underfitting the data or when
Dialogue: 0,1:09:03.67,1:09:06.26,Default,,0000,0000,0000,,you have high bias.
Dialogue: 0,1:09:06.26,1:09:07.69,Default,,0000,0000,0000,,And this regime
Dialogue: 0,1:09:07.69,1:09:10.38,Default,,0000,0000,0000,,on the right
Dialogue: 0,1:09:10.38,1:09:16.37,Default,,0000,0000,0000,,is when you have high variance or
Dialogue: 0,1:09:16.37,1:09:20.81,Default,,0000,0000,0000,,you're overfitting the data. Okay?
Dialogue: 0,1:09:20.81,1:09:24.46,Default,,0000,0000,0000,,And this is why a model of sort of intermediate
Dialogue: 0,1:09:24.46,1:09:27.40,Default,,0000,0000,0000,,complexity, somewhere here
Dialogue: 0,1:09:27.40,1:09:28.100,Default,,0000,0000,0000,,if often preferable
Dialogue: 0,1:09:28.100,1:09:30.24,Default,,0000,0000,0000,,to if
Dialogue: 0,1:09:30.24,1:09:32.86,Default,,0000,0000,0000,,[inaudible] and minimize generalization error.
Dialogue: 0,1:09:32.86,1:09:35.73,Default,,0000,0000,0000,,Okay?
Dialogue: 0,1:09:35.73,1:09:39.51,Default,,0000,0000,0000,,So that's just a cartoon. In the next lecture we'll actually talk about the number of
Dialogue: 0,1:09:39.51,1:09:40.23,Default,,0000,0000,0000,,algorithms
Dialogue: 0,1:09:40.23,1:09:44.46,Default,,0000,0000,0000,,for trying to automatically select model complexities, say to get
Dialogue: 0,1:09:44.46,1:09:48.18,Default,,0000,0000,0000,,you as close as possible to this minimum to
Dialogue: 0,1:09:48.18,1:09:55.18,Default,,0000,0000,0000,,this area of minimized generalization error.
Dialogue: 0,1:10:10.55,1:10:14.54,Default,,0000,0000,0000,,The last thing I want to do is actually
Dialogue: 0,1:10:14.54,1:10:18.96,Default,,0000,0000,0000,,going back to the theorem I wrote out, I just want to take that theorem well,
Dialogue: 0,1:10:18.96,1:10:20.15,Default,,0000,0000,0000,,so the theorem I wrote out
Dialogue: 0,1:10:20.15,1:10:25.13,Default,,0000,0000,0000,,was an error bound theorem this says for fixed M and delta where probability
Dialogue: 0,1:10:25.13,1:10:28.64,Default,,0000,0000,0000,,one minus delta, I get a bound on
Dialogue: 0,1:10:28.64,1:10:30.78,Default,,0000,0000,0000,,gamma, which is what this term is.
Dialogue: 0,1:10:30.78,1:10:34.77,Default,,0000,0000,0000,,So the very last thing I wanna do today is just come back to this theorem and write out a
Dialogue: 0,1:10:34.77,1:10:35.72,Default,,0000,0000,0000,,corollary
Dialogue: 0,1:10:35.72,1:10:39.59,Default,,0000,0000,0000,,where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta
Dialogue: 0,1:10:39.59,1:10:41.39,Default,,0000,0000,0000,,and solve for M.
Dialogue: 0,1:10:41.39,1:10:46.08,Default,,0000,0000,0000,,And if you do that,
Dialogue: 0,1:10:46.08,1:10:47.19,Default,,0000,0000,0000,,you
Dialogue: 0,1:10:47.19,1:10:49.29,Default,,0000,0000,0000,,get the following corollary: Let H
Dialogue: 0,1:10:49.29,1:10:52.44,Default,,0000,0000,0000,,be
Dialogue: 0,1:10:52.44,1:10:54.76,Default,,0000,0000,0000,,fixed with K hypotheses
Dialogue: 0,1:10:54.76,1:10:58.99,Default,,0000,0000,0000,,and let
Dialogue: 0,1:10:58.99,1:11:01.34,Default,,0000,0000,0000,,any delta
Dialogue: 0,1:11:01.34,1:11:04.37,Default,,0000,0000,0000,,and gamma be fixed.
Dialogue: 0,1:11:09.02,1:11:11.45,Default,,0000,0000,0000,,Then
Dialogue: 0,1:11:11.45,1:11:18.45,Default,,0000,0000,0000,,in order to guarantee that,
Dialogue: 0,1:11:22.80,1:11:25.06,Default,,0000,0000,0000,,let's say I want a guarantee
Dialogue: 0,1:11:25.06,1:11:26.99,Default,,0000,0000,0000,,that the generalization error
Dialogue: 0,1:11:26.99,1:11:30.73,Default,,0000,0000,0000,,of the hypothesis I choose with empirical risk minimization,
Dialogue: 0,1:11:30.73,1:11:33.87,Default,,0000,0000,0000,,that this is at most two times gamma worse
Dialogue: 0,1:11:33.87,1:11:38.32,Default,,0000,0000,0000,,than the best possible error I could obtain with this hypothesis class.
Dialogue: 0,1:11:38.32,1:11:40.47,Default,,0000,0000,0000,,Lets say I want this to
Dialogue: 0,1:11:40.47,1:11:45.05,Default,,0000,0000,0000,,hold true with probability at least one minus delta,
Dialogue: 0,1:11:45.05,1:11:50.16,Default,,0000,0000,0000,,then it suffices
Dialogue: 0,1:11:50.16,1:11:53.69,Default,,0000,0000,0000,,that M
Dialogue: 0,1:11:53.69,1:12:00.52,Default,,0000,0000,0000,,is [inaudible] to that. Okay? And this is
Dialogue: 0,1:12:00.52,1:12:01.01,Default,,0000,0000,0000,,sort of
Dialogue: 0,1:12:01.01,1:12:05.57,Default,,0000,0000,0000,,solving for the error
Dialogue: 0,1:12:05.57,1:12:07.70,Default,,0000,0000,0000,,bound for M. One thing we're going to convince yourselves
Dialogue: 0,1:12:07.70,1:12:10.11,Default,,0000,0000,0000,,of the easy part of this is
Dialogue: 0,1:12:10.11,1:12:14.96,Default,,0000,0000,0000,,if you set that term [inaudible] gamma and solve for M you will get this.
Dialogue: 0,1:12:14.96,1:12:18.67,Default,,0000,0000,0000,,One thing I want you to go home and sort of convince yourselves of is that this
Dialogue: 0,1:12:18.67,1:12:20.56,Default,,0000,0000,0000,,result really holds true.
Dialogue: 0,1:12:20.56,1:12:23.87,Default,,0000,0000,0000,,That this really logically follows from the theorem we've proved.
Dialogue: 0,1:12:23.87,1:12:25.67,Default,,0000,0000,0000,,In other words,
Dialogue: 0,1:12:25.67,1:12:29.85,Default,,0000,0000,0000,,you can take that formula we wrote and solve for M and because this is the formula you get for M,
Dialogue: 0,1:12:29.85,1:12:31.67,Default,,0000,0000,0000,,that's just that's the easy part. That
Dialogue: 0,1:12:31.67,1:12:33.77,Default,,0000,0000,0000,,once you go back and convince yourselves that
Dialogue: 0,1:12:33.77,1:12:37.66,Default,,0000,0000,0000,,this theorem is a true fact and that it does indeed logically follow from the
Dialogue: 0,1:12:37.66,1:12:38.66,Default,,0000,0000,0000,,other one. In
Dialogue: 0,1:12:38.66,1:12:39.87,Default,,0000,0000,0000,,particular,
Dialogue: 0,1:12:39.87,1:12:43.74,Default,,0000,0000,0000,,make sure that if you solve for that you really get M grading equals this, and why is this M
Dialogue: 0,1:12:43.74,1:12:46.68,Default,,0000,0000,0000,,grading that and not M less equal two, and just make sure
Dialogue: 0,1:12:46.68,1:12:50.23,Default,,0000,0000,0000,,I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay?
Dialogue: 0,1:12:53.28,1:12:54.84,Default,,0000,0000,0000,,And
Dialogue: 0,1:12:54.84,1:12:58.11,Default,,0000,0000,0000,,it turns out that when we prove these bounds in learning theory it turns out
Dialogue: 0,1:12:58.11,1:13:02.82,Default,,0000,0000,0000,,that very often the constants are sort of loose. So it turns out that when we prove
Dialogue: 0,1:13:02.82,1:13:05.62,Default,,0000,0000,0000,,these bounds usually we're interested
Dialogue: 0,1:13:05.62,1:13:10.35,Default,,0000,0000,0000,,usually we're not very interested in the constants, and so I
Dialogue: 0,1:13:10.35,1:13:12.41,Default,,0000,0000,0000,,write this as big O of
Dialogue: 0,1:13:12.41,1:13:16.14,Default,,0000,0000,0000,,one over gamma squared, log
Dialogue: 0,1:13:16.14,1:13:17.78,Default,,0000,0000,0000,,K over delta,
Dialogue: 0,1:13:17.78,1:13:20.57,Default,,0000,0000,0000,,and again, the key
Dialogue: 0,1:13:20.57,1:13:23.87,Default,,0000,0000,0000,,step in this is that the dependence on M
Dialogue: 0,1:13:23.87,1:13:26.71,Default,,0000,0000,0000,,with the size of the hypothesis class is logarithmic.
Dialogue: 0,1:13:26.71,1:13:30.10,Default,,0000,0000,0000,,And this will be very important later when we
Dialogue: 0,1:13:30.10,1:13:35.43,Default,,0000,0000,0000,,talk about infinite hypothesis classes. Okay? Any
Dialogue: 0,1:13:35.43,1:13:42.43,Default,,0000,0000,0000,,questions about this? No? Okay, cool.
Dialogue: 0,1:13:49.66,1:13:51.38,Default,,0000,0000,0000,,So
Dialogue: 0,1:13:51.38,1:13:54.49,Default,,0000,0000,0000,,next lecture we'll come back, we'll actually start from this result again.
Dialogue: 0,1:13:54.49,1:13:57.70,Default,,0000,0000,0000,,Remember this. I'll write this down as the first thing I do in the next lecture
Dialogue: 0,1:13:57.70,1:14:02.52,Default,,0000,0000,0000,,and we'll generalize these to infinite hypothesis classes and then talk about
Dialogue: 0,1:14:02.52,1:14:04.66,Default,,0000,0000,0000,,practical algorithms for model spectrum. So I'll see you guys in a couple days.