[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:12.24,0:00:15.51,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional
Dialogue: 0,0:00:15.51,0:00:22.51,Default,,0000,0000,0000,,Development.
Dialogue: 0,0:00:24.21,0:00:28.03,Default,,0000,0000,0000,,So what I want to do today in this lecture is
Dialogue: 0,0:00:28.03,0:00:31.48,Default,,0000,0000,0000,,talk a little bit more about learning theory. In particular, I'll
Dialogue: 0,0:00:31.48,0:00:33.23,Default,,0000,0000,0000,,talk about VC dimension
Dialogue: 0,0:00:33.23,0:00:35.61,Default,,0000,0000,0000,,and
Dialogue: 0,0:00:35.61,0:00:39.56,Default,,0000,0000,0000,,building on the issues of bias variance tradeoffs of under
Dialogue: 0,0:00:39.56,0:00:42.63,Default,,0000,0000,0000,,fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in
Dialogue: 0,0:00:42.63,0:00:43.64,Default,,0000,0000,0000,,this one.
Dialogue: 0,0:00:43.64,0:00:49.18,Default,,0000,0000,0000,,I then want to talk about model selection algorithms for automatically
Dialogue: 0,0:00:49.18,0:00:50.07,Default,,0000,0000,0000,,making
Dialogue: 0,0:00:50.07,0:00:52.19,Default,,0000,0000,0000,,decisions for this
Dialogue: 0,0:00:52.19,0:00:55.28,Default,,0000,0000,0000,,bias variance tradeoff, that we started to talk about in the previous lecture.
Dialogue: 0,0:00:55.28,0:00:59.31,Default,,0000,0000,0000,,And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if
Dialogue: 0,0:00:59.31,0:01:01.47,Default,,0000,0000,0000,,I don't get to this today,
Dialogue: 0,0:01:01.47,0:01:07.28,Default,,0000,0000,0000,,I'll get to this in next week's lecture.
Dialogue: 0,0:01:07.28,0:01:09.52,Default,,0000,0000,0000,,To recap:
Dialogue: 0,0:01:09.52,0:01:10.43,Default,,0000,0000,0000,,the result we
Dialogue: 0,0:01:10.43,0:01:13.51,Default,,0000,0000,0000,,proved at the previous lecture was that
Dialogue: 0,0:01:13.51,0:01:15.37,Default,,0000,0000,0000,,if you have a finite
Dialogue: 0,0:01:15.37,0:01:16.43,Default,,0000,0000,0000,,hypothesis class if
Dialogue: 0,0:01:16.43,0:01:19.51,Default,,0000,0000,0000,,h is a set of k hypotheses,
Dialogue: 0,0:01:19.51,0:01:24.08,Default,,0000,0000,0000,,
Dialogue: 0,0:01:24.08,0:01:25.71,Default,,0000,0000,0000,,and suppose you have some
Dialogue: 0,0:01:25.71,0:01:32.71,Default,,0000,0000,0000,,fixed parameters, gamma and delta,
Dialogue: 0,0:01:33.03,0:01:34.44,Default,,0000,0000,0000,,then
Dialogue: 0,0:01:34.44,0:01:36.53,Default,,0000,0000,0000,,
Dialogue: 0,0:01:36.53,0:01:40.89,Default,,0000,0000,0000,,in order to guarantee
Dialogue: 0,0:01:40.89,0:01:47.75,Default,,0000,0000,0000,,that this holds,
Dialogue: 0,0:01:47.75,0:01:50.77,Default,,0000,0000,0000,,we're probability at least one minus delta.
Dialogue: 0,0:01:50.77,0:01:55.97,Default,,0000,0000,0000,,It suffices
Dialogue: 0,0:01:55.97,0:02:02.97,Default,,0000,0000,0000,,that n is greater and equal to that; okay?
Dialogue: 0,0:02:05.24,0:02:07.36,Default,,0000,0000,0000,,And using big-O notations,
Dialogue: 0,0:02:07.36,0:02:14.36,Default,,0000,0000,0000,,just learning dropped constants, I can also write this as that; okay?
Dialogue: 0,0:02:14.46,0:02:18.18,Default,,0000,0000,0000,,So just to quickly remind you of what all of the notation means,
Dialogue: 0,0:02:18.18,0:02:21.18,Default,,0000,0000,0000,,we talked about empirical risk minimization,
Dialogue: 0,0:02:21.18,0:02:21.97,Default,,0000,0000,0000,,which was
Dialogue: 0,0:02:21.97,0:02:24.62,Default,,0000,0000,0000,,the simplified modern machine learning
Dialogue: 0,0:02:24.62,0:02:25.90,Default,,0000,0000,0000,,that
Dialogue: 0,0:02:25.90,0:02:28.55,Default,,0000,0000,0000,,has a hypothesis class of script h.
Dialogue: 0,0:02:28.55,0:02:32.54,Default,,0000,0000,0000,,And what the empirical risk minimization-learning algorithm does is it just chooses the
Dialogue: 0,0:02:32.54,0:02:34.03,Default,,0000,0000,0000,,hypothesis
Dialogue: 0,0:02:34.03,0:02:38.17,Default,,0000,0000,0000,,that attains the smallest error on the training set.
Dialogue: 0,0:02:38.17,0:02:39.02,Default,,0000,0000,0000,,And so
Dialogue: 0,0:02:39.02,0:02:43.48,Default,,0000,0000,0000,,this symbol, epsilon, just denoted generalization error; right? This is the
Dialogue: 0,0:02:43.48,0:02:44.45,Default,,0000,0000,0000,,probability
Dialogue: 0,0:02:44.45,0:02:48.95,Default,,0000,0000,0000,,of a hypothesis h [inaudible] misclassifying a new example drawn from the same
Dialogue: 0,0:02:48.95,0:02:50.93,Default,,0000,0000,0000,,distribution as the training set.
Dialogue: 0,0:02:50.93,0:02:53.08,Default,,0000,0000,0000,,And so this says that
Dialogue: 0,0:03:01.87,0:03:06.12,Default,,0000,0000,0000,,in order to guarantee that the generalization error of the hypothesis h
Dialogue: 0,0:03:06.12,0:03:08.68,Default,,0000,0000,0000,,[inaudible] output by empirical risk minimization
Dialogue: 0,0:03:08.68,0:03:11.01,Default,,0000,0000,0000,,that this is less and equal to
Dialogue: 0,0:03:11.01,0:03:14.21,Default,,0000,0000,0000,,the best possible generalization error
Dialogue: 0,0:03:14.78,0:03:19.89,Default,,0000,0000,0000,,use it in your hypothesis class plus two times gamma two times this error threshold.
Dialogue: 0,0:03:19.89,0:03:24.12,Default,,0000,0000,0000,,We want to guarantee that this holds a probability at least one minus delta.
Dialogue: 0,0:03:24.12,0:03:25.52,Default,,0000,0000,0000,,We show that
Dialogue: 0,0:03:25.52,0:03:28.100,Default,,0000,0000,0000,,it suffices for your training set size m
Dialogue: 0,0:03:28.100,0:03:33.08,Default,,0000,0000,0000,,to be greater than equal to this; okay? One over two gamma square log two k over
Dialogue: 0,0:03:33.08,0:03:34.18,Default,,0000,0000,0000,,delta;
Dialogue: 0,0:03:34.18,0:03:39.23,Default,,0000,0000,0000,,where again, k is the size of your hypothesis class.
Dialogue: 0,0:03:39.23,0:03:42.91,Default,,0000,0000,0000,,And so this is some complexity result because it gives us a bound in the
Dialogue: 0,0:03:42.91,0:03:45.02,Default,,0000,0000,0000,,number of training examples we need
Dialogue: 0,0:03:45.02,0:03:47.05,Default,,0000,0000,0000,,in order to give a guarantee
Dialogue: 0,0:03:47.05,0:03:48.70,Default,,0000,0000,0000,,on something on the error; okay? So
Dialogue: 0,0:03:48.70,0:03:54.82,Default,,0000,0000,0000,,this is a sample complexity result. So
Dialogue: 0,0:03:54.82,0:03:58.88,Default,,0000,0000,0000,,what I want to do now is take this result, and try to generalize it to the
Dialogue: 0,0:03:58.88,0:04:02.94,Default,,0000,0000,0000,,case of infinite hypothesis classes. So here,
Dialogue: 0,0:04:02.94,0:04:07.14,Default,,0000,0000,0000,,we said that the set script h is sort of just k
Dialogue: 0,0:04:07.14,0:04:08.57,Default,,0000,0000,0000,,specific functions,
Dialogue: 0,0:04:08.57,0:04:12.12,Default,,0000,0000,0000,,when you want to use a model like logistic regression, which is actually
Dialogue: 0,0:04:12.12,0:04:14.06,Default,,0000,0000,0000,,parameterized by real numbers.
Dialogue: 0,0:04:14.06,0:04:15.02,Default,,0000,0000,0000,,So
Dialogue: 0,0:04:15.02,0:04:17.42,Default,,0000,0000,0000,,I'm actually first going to give an argument that's sort of
Dialogue: 0,0:04:17.42,0:04:21.16,Default,,0000,0000,0000,,formally broken just sort of technically somewhat broken, but conveys
Dialogue: 0,0:04:21.16,0:04:25.23,Default,,0000,0000,0000,,useful intuition. And then I'll give the more correct argument, but
Dialogue: 0,0:04:25.23,0:04:30.06,Default,,0000,0000,0000,,without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want
Dialogue: 0,0:04:30.06,0:04:31.05,Default,,0000,0000,0000,,to
Dialogue: 0,0:04:31.05,0:04:34.68,Default,,0000,0000,0000,,apply this result analyzing logistic regression.
Dialogue: 0,0:04:35.55,0:04:37.40,Default,,0000,0000,0000,,So let's say
Dialogue: 0,0:04:39.21,0:04:44.79,Default,,0000,0000,0000,,your hypothesis class is because of all linear division boundaries; right? So say
Dialogue: 0,0:04:44.79,0:04:46.77,Default,,0000,0000,0000,,script h is
Dialogue: 0,0:04:46.77,0:04:51.75,Default,,0000,0000,0000,,parameterized
Dialogue: 0,0:04:51.75,0:04:57.86,Default,,0000,0000,0000,,by d
Dialogue: 0,0:04:57.86,0:05:00.28,Default,,0000,0000,0000,,real numbers; okay? So for example, if
Dialogue: 0,0:05:00.28,0:05:02.57,Default,,0000,0000,0000,,you're applying logistic regression with
Dialogue: 0,0:05:02.57,0:05:06.94,Default,,0000,0000,0000,,over [inaudible], then d would be endless one with logistic regression to find the
Dialogue: 0,0:05:06.94,0:05:08.68,Default,,0000,0000,0000,,linear position boundary,
Dialogue: 0,0:05:08.68,0:05:09.64,Default,,0000,0000,0000,,parameterized by
Dialogue: 0,0:05:09.64,0:05:13.53,Default,,0000,0000,0000,,endless one real numbers.
Dialogue: 0,0:05:13.53,0:05:17.73,Default,,0000,0000,0000,,When you think about how your hypothesis class is really represented in a computer
Dialogue: 0,0:05:17.73,0:05:18.57,Default,,0000,0000,0000,,computers use
Dialogue: 0,0:05:18.57,0:05:21.95,Default,,0000,0000,0000,,zero one bits to represent real numbers.
Dialogue: 0,0:05:21.95,0:05:22.93,Default,,0000,0000,0000,,And so if you
Dialogue: 0,0:05:22.93,0:05:24.48,Default,,0000,0000,0000,,use like
Dialogue: 0,0:05:24.48,0:05:28.56,Default,,0000,0000,0000,,a normal standard computer, it normally will represent real numbers by what's
Dialogue: 0,0:05:28.56,0:05:30.87,Default,,0000,0000,0000,,called double position floating point numbers.
Dialogue: 0,0:05:30.87,0:05:33.20,Default,,0000,0000,0000,,And what that means is that each real number
Dialogue: 0,0:05:33.20,0:05:36.79,Default,,0000,0000,0000,,is represented by or a 64-bit representation;
Dialogue: 0,0:05:36.79,0:05:38.06,Default,,0000,0000,0000,,right? So really
Dialogue: 0,0:05:38.06,0:05:41.82,Default,,0000,0000,0000,,you know what floating point is in a computer. So a 64-bit floating point
Dialogue: 0,0:05:41.82,0:05:43.31,Default,,0000,0000,0000,,is what almost all of
Dialogue: 0,0:05:43.31,0:05:44.78,Default,,0000,0000,0000,,us use routinely.
Dialogue: 0,0:05:44.78,0:05:47.91,Default,,0000,0000,0000,,And so this parameterized by d real numbers, that's really
Dialogue: 0,0:05:47.91,0:05:50.98,Default,,0000,0000,0000,,as if it's parameterized by 64
Dialogue: 0,0:05:50.98,0:05:52.92,Default,,0000,0000,0000,,times d
Dialogue: 0,0:05:52.92,0:05:54.32,Default,,0000,0000,0000,,bits.
Dialogue: 0,0:05:54.32,0:05:56.59,Default,,0000,0000,0000,,Computers can't represent real numbers. They only represent
Dialogue: 0,0:05:56.59,0:05:58.80,Default,,0000,0000,0000,,used to speed things.
Dialogue: 0,0:05:58.80,0:06:00.69,Default,,0000,0000,0000,,And so the size of
Dialogue: 0,0:06:00.69,0:06:02.95,Default,,0000,0000,0000,,your
Dialogue: 0,0:06:02.95,0:06:07.25,Default,,0000,0000,0000,,hypothesis class in your computer representation
Dialogue: 0,0:06:07.25,0:06:11.00,Default,,0000,0000,0000,,you have 64 times d bits that you can flip.
Dialogue: 0,0:06:11.00,0:06:15.05,Default,,0000,0000,0000,,And so the number of possible values for your 62 to 64 d bits is
Dialogue: 0,0:06:15.05,0:06:17.51,Default,,0000,0000,0000,,really just
Dialogue: 0,0:06:17.51,0:06:20.89,Default,,0000,0000,0000,,to the power of 64 d; okay?
Dialogue: 0,0:06:20.89,0:06:23.36,Default,,0000,0000,0000,,Because that's the number of ways you can
Dialogue: 0,0:06:23.36,0:06:26.35,Default,,0000,0000,0000,,flip the 64 d bits.
Dialogue: 0,0:06:26.35,0:06:29.15,Default,,0000,0000,0000,,And so
Dialogue: 0,0:06:29.15,0:06:33.98,Default,,0000,0000,0000,,this is why it's important that we that we had log k there; right? So k is
Dialogue: 0,0:06:33.98,0:06:36.82,Default,,0000,0000,0000,,therefore, to the 64 d.
Dialogue: 0,0:06:36.82,0:06:40.38,Default,,0000,0000,0000,,And if I plug it into this equation over here,
Dialogue: 0,0:06:40.38,0:06:43.30,Default,,0000,0000,0000,,what you find is that
Dialogue: 0,0:06:43.30,0:06:47.14,Default,,0000,0000,0000,,in order to get this sort of guarantee,
Dialogue: 0,0:06:47.14,0:06:50.98,Default,,0000,0000,0000,,it suffices that m
Dialogue: 0,0:06:50.98,0:06:52.74,Default,,0000,0000,0000,,is
Dialogue: 0,0:06:52.74,0:06:55.98,Default,,0000,0000,0000,,great and equal to
Dialogue: 0,0:06:55.98,0:06:58.77,Default,,0000,0000,0000,,on the order of one of
Dialogue: 0,0:06:58.77,0:07:01.74,Default,,0000,0000,0000,,the gamma square log it's
Dialogue: 0,0:07:01.74,0:07:03.37,Default,,0000,0000,0000,,just a 64 d
Dialogue: 0,0:07:03.37,0:07:05.05,Default,,0000,0000,0000,,over delta,
Dialogue: 0,0:07:05.05,0:07:12.05,Default,,0000,0000,0000,,which is that; okay? So
Dialogue: 0,0:07:14.52,0:07:17.78,Default,,0000,0000,0000,,just to be clear, in order to guarantee that
Dialogue: 0,0:07:17.78,0:07:23.73,Default,,0000,0000,0000,,there's only one, instead of the same complexity result as we had before so the question is:
Dialogue: 0,0:07:23.73,0:07:27.31,Default,,0000,0000,0000,,suppose, you want a guarantee that a hypotheses returned by empirical risk
Dialogue: 0,0:07:27.31,0:07:29.25,Default,,0000,0000,0000,,minimization will
Dialogue: 0,0:07:29.25,0:07:33.91,Default,,0000,0000,0000,,have a generalization error that's within two gamma or the best hypotheses in your hypotheses
Dialogue: 0,0:07:33.91,0:07:35.97,Default,,0000,0000,0000,,class.
Dialogue: 0,0:07:35.97,0:07:39.40,Default,,0000,0000,0000,,Then what this result suggests is that, you know, in order to give that sort of
Dialogue: 0,0:07:39.40,0:07:40.77,Default,,0000,0000,0000,,error bound guarantee,
Dialogue: 0,0:07:40.77,0:07:43.43,Default,,0000,0000,0000,,it suffices that m is greater and equal to this.
Dialogue: 0,0:07:43.43,0:07:44.58,Default,,0000,0000,0000,,In other words, that
Dialogue: 0,0:07:44.58,0:07:46.89,Default,,0000,0000,0000,,your number of training examples has to be
Dialogue: 0,0:07:46.89,0:07:51.70,Default,,0000,0000,0000,,on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the
Dialogue: 0,0:07:53.33,0:07:54.42,Default,,0000,0000,0000,,intuition
Dialogue: 0,0:07:54.42,0:07:57.74,Default,,0000,0000,0000,,that this conveys is actually, roughly right. This says, that the number of
Dialogue: 0,0:07:57.74,0:07:59.05,Default,,0000,0000,0000,,training examples you need
Dialogue: 0,0:07:59.05,0:08:00.74,Default,,0000,0000,0000,,is roughly linear
Dialogue: 0,0:08:00.74,0:08:04.62,Default,,0000,0000,0000,,in the number of parameters of your hypothesis class. That
Dialogue: 0,0:08:04.62,0:08:08.68,Default,,0000,0000,0000,,m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.
Dialogue: 0,0:08:08.68,0:08:09.58,Default,,0000,0000,0000,,
Dialogue: 0,0:08:09.58,0:08:12.48,Default,,0000,0000,0000,,I'll say more about
Dialogue: 0,0:08:16.97,0:08:20.97,Default,,0000,0000,0000,,this
Dialogue: 0,0:08:21.92,0:08:25.52,Default,,0000,0000,0000,,later. This result is clearly, slightly broken, in the sense that it relies on a
Dialogue: 0,0:08:25.52,0:08:28.73,Default,,0000,0000,0000,,64-bit representation of 14-point numbers.
Dialogue: 0,0:08:28.73,0:08:31.90,Default,,0000,0000,0000,,So let me actually go ahead and outline
Dialogue: 0,0:08:31.90,0:08:34.90,Default,,0000,0000,0000,,the "right way" to show this more formally; all
Dialogue: 0,0:08:34.90,0:08:38.91,Default,,0000,0000,0000,,right? And it turns out the "right way" to show this more formally
Dialogue: 0,0:08:38.91,0:08:43.12,Default,,0000,0000,0000,,involves a much longer because the
Dialogue: 0,0:08:43.12,0:08:46.25,Default,,0000,0000,0000,,proof is extremely involved, so I'm just actually going to state the result, and not
Dialogue: 0,0:08:46.25,0:08:50.16,Default,,0000,0000,0000,,prove it. Farther proof be a source of learning theory balance, infinite hypothesis
Dialogue: 0,0:08:50.16,0:08:54.32,Default,,0000,0000,0000,,classes.
Dialogue: 0,0:08:56.45,0:09:03.45,Default,,0000,0000,0000,,This definition given a set of d points,
Dialogue: 0,0:09:18.63,0:09:22.88,Default,,0000,0000,0000,,we say,
Dialogue: 0,0:09:22.88,0:09:26.56,Default,,0000,0000,0000,,a hypothesis class h
Dialogue: 0,0:09:26.56,0:09:28.59,Default,,0000,0000,0000,,shatters
Dialogue: 0,0:09:28.59,0:09:34.52,Default,,0000,0000,0000,,the set s, if
Dialogue: 0,0:09:34.52,0:09:41.52,Default,,0000,0000,0000,,h can realize any labeling on it; okay?
Dialogue: 0,0:09:46.65,0:09:47.36,Default,,0000,0000,0000,,And what
Dialogue: 0,0:09:47.36,0:09:50.99,Default,,0000,0000,0000,,I mean by realizing any labeling on it the informal way of thinking about this
Dialogue: 0,0:09:50.99,0:09:53.02,Default,,0000,0000,0000,,is:
Dialogue: 0,0:09:53.02,0:09:56.86,Default,,0000,0000,0000,,if a hypothesis class has shattered the set s, what that means is that I can take these
Dialogue: 0,0:09:56.86,0:09:58.51,Default,,0000,0000,0000,,d points,
Dialogue: 0,0:09:58.51,0:10:01.74,Default,,0000,0000,0000,,and I can associate these d points with any
Dialogue: 0,0:10:01.74,0:10:04.25,Default,,0000,0000,0000,,caught set of labels y; right?
Dialogue: 0,0:10:04.25,0:10:08.90,Default,,0000,0000,0000,,So choose any set of labeling y for each of these d
Dialogue: 0,0:10:08.90,0:10:10.14,Default,,0000,0000,0000,,points.
Dialogue: 0,0:10:10.14,0:10:13.84,Default,,0000,0000,0000,,And if your hypothesis class shatters s, then that means that
Dialogue: 0,0:10:13.84,0:10:14.79,Default,,0000,0000,0000,,there will be
Dialogue: 0,0:10:14.79,0:10:19.47,Default,,0000,0000,0000,,a hypothesis that labels those d examples perfectly;
Dialogue: 0,0:10:19.47,0:10:23.71,Default,,0000,0000,0000,,okay? That's what shattering means. So let me just illustrate those in an example.
Dialogue: 0,0:10:23.71,0:10:30.71,Default,,0000,0000,0000,,So let's say h is the class of all linear classifiers
Dialogue: 0,0:10:31.28,0:10:35.52,Default,,0000,0000,0000,,into e,
Dialogue: 0,0:10:35.52,0:10:38.40,Default,,0000,0000,0000,,and let's say that
Dialogue: 0,0:10:38.40,0:10:40.37,Default,,0000,0000,0000,,s is this
Dialogue: 0,0:10:40.37,0:10:43.74,Default,,0000,0000,0000,,[inaudible] comprising two points; okay?
Dialogue: 0,0:10:43.74,0:10:45.98,Default,,0000,0000,0000,,So there are
Dialogue: 0,0:10:45.98,0:10:49.52,Default,,0000,0000,0000,,four possible labelings that computes
Dialogue: 0,0:10:49.52,0:10:53.27,Default,,0000,0000,0000,,with these two points. You can choose
Dialogue: 0,0:10:53.27,0:10:56.47,Default,,0000,0000,0000,,to label both positive; one positive, one negative, one negative, one positive or you can
Dialogue: 0,0:10:56.47,0:10:59.28,Default,,0000,0000,0000,,label both of them negative.
Dialogue: 0,0:10:59.28,0:11:00.75,Default,,0000,0000,0000,,And if
Dialogue: 0,0:11:00.75,0:11:04.73,Default,,0000,0000,0000,,the hypothesis class h classed all linear classifiers into the then, for each
Dialogue: 0,0:11:04.73,0:11:08.45,Default,,0000,0000,0000,,of these training sets, I can sort of find a
Dialogue: 0,0:11:08.45,0:11:09.99,Default,,0000,0000,0000,,linear classifier
Dialogue: 0,0:11:09.99,0:11:13.50,Default,,0000,0000,0000,,that attains zero training error on
Dialogue: 0,0:11:13.50,0:11:17.81,Default,,0000,0000,0000,,each of these. Then on all possible labelings of this set of two points.
Dialogue: 0,0:11:17.81,0:11:20.14,Default,,0000,0000,0000,,And so I'll say
Dialogue: 0,0:11:20.14,0:11:23.16,Default,,0000,0000,0000,,that the hypothesis class script h shatters
Dialogue: 0,0:11:23.16,0:11:24.13,Default,,0000,0000,0000,,this set s
Dialogue: 0,0:11:24.13,0:11:29.48,Default,,0000,0000,0000,,of two points; okay? One
Dialogue: 0,0:11:29.48,0:11:36.48,Default,,0000,0000,0000,,more example show
Dialogue: 0,0:11:37.35,0:11:38.78,Default,,0000,0000,0000,,you
Dialogue: 0,0:11:38.78,0:11:42.06,Default,,0000,0000,0000,,a larger example. Suppose my set s is now
Dialogue: 0,0:11:42.06,0:11:44.64,Default,,0000,0000,0000,,this set of three points;
Dialogue: 0,0:11:44.64,0:11:51.64,Default,,0000,0000,0000,,right? Then, I now have eight possible labelings for these three points;
Dialogue: 0,0:12:10.60,0:12:14.83,Default,,0000,0000,0000,,okay? And so for these three points, I now have eight possible labelings.
Dialogue: 0,0:12:14.83,0:12:16.94,Default,,0000,0000,0000,,And once again, I can for each
Dialogue: 0,0:12:16.94,0:12:21.62,Default,,0000,0000,0000,,of these labelings, I can find the hypothesis in the hypothesis class
Dialogue: 0,0:12:21.62,0:12:23.39,Default,,0000,0000,0000,,that labels
Dialogue: 0,0:12:23.39,0:12:25.61,Default,,0000,0000,0000,,these examples correctly.
Dialogue: 0,0:12:25.61,0:12:30.22,Default,,0000,0000,0000,,And so once again, I see that by definition, say, that my hypothesis
Dialogue: 0,0:12:30.22,0:12:33.89,Default,,0000,0000,0000,,class also shatters this set s. Student:Right. Instructor (Andrew Ng):And
Dialogue: 0,0:12:33.89,0:12:38.34,Default,,0000,0000,0000,,then that that terminology h can realize any labeling on s. That's
Dialogue: 0,0:12:38.34,0:12:39.83,Default,,0000,0000,0000,,obviously [inaudible].
Dialogue: 0,0:12:39.83,0:12:43.68,Default,,0000,0000,0000,,Give it any set of labels and you can find a hypothesis that perfectly
Dialogue: 0,0:12:43.68,0:12:47.57,Default,,0000,0000,0000,,separates the positive and negative examples;
Dialogue: 0,0:12:47.57,0:12:48.84,Default,,0000,0000,0000,,okay?
Dialogue: 0,0:12:48.84,0:12:51.09,Default,,0000,0000,0000,,So
Dialogue: 0,0:12:51.09,0:12:54.66,Default,,0000,0000,0000,,how about this
Dialogue: 0,0:12:54.66,0:12:57.43,Default,,0000,0000,0000,,set?
Dialogue: 0,0:12:57.43,0:13:00.84,Default,,0000,0000,0000,,Suppose s is now this set of four points, then,
Dialogue: 0,0:13:00.84,0:13:04.84,Default,,0000,0000,0000,,you know, there are lots of labels. There are now 16 labelings we can choose on this; right?
Dialogue: 0,0:13:04.84,0:13:06.40,Default,,0000,0000,0000,,That's one for instance,
Dialogue: 0,0:13:06.40,0:13:09.09,Default,,0000,0000,0000,,and this is another one;
Dialogue: 0,0:13:09.09,0:13:13.42,Default,,0000,0000,0000,,right? And so I can realize some labelings. But there's no linear division
Dialogue: 0,0:13:13.42,0:13:15.78,Default,,0000,0000,0000,,boundary that can realize this labeling,
Dialogue: 0,0:13:15.78,0:13:17.80,Default,,0000,0000,0000,,and so h does not shatter
Dialogue: 0,0:13:17.80,0:13:20.25,Default,,0000,0000,0000,,this set of four points; okay?
Dialogue: 0,0:13:20.25,0:13:21.58,Default,,0000,0000,0000,,And
Dialogue: 0,0:13:21.58,0:13:25.36,Default,,0000,0000,0000,,I'm not really going to prove it here, but it turns out that you can show that
Dialogue: 0,0:13:25.36,0:13:30.04,Default,,0000,0000,0000,,in two dimensions, there is no set of four points that the
Dialogue: 0,0:13:30.04,0:13:37.04,Default,,0000,0000,0000,,class of all linear classifiers can shatter; okay?
Dialogue: 0,0:13:45.63,0:13:49.11,Default,,0000,0000,0000,,So here's another definition. When I say that the well,
Dialogue: 0,0:13:49.11,0:13:50.29,Default,,0000,0000,0000,,it's called the
Dialogue: 0,0:13:50.29,0:13:52.70,Default,,0000,0000,0000,,VC dimension. These
Dialogue: 0,0:13:52.70,0:13:59.70,Default,,0000,0000,0000,,two people, Vapnik and Chervonenkis
Dialogue: 0,0:14:06.55,0:14:09.91,Default,,0000,0000,0000,,so given a hypothesis class,
Dialogue: 0,0:14:09.91,0:14:13.59,Default,,0000,0000,0000,,the Vapnik and Chervonenkis
Dialogue: 0,0:14:13.59,0:14:17.60,Default,,0000,0000,0000,,dimension of h, which we usually write as VC of script h,
Dialogue: 0,0:14:17.60,0:14:24.60,Default,,0000,0000,0000,,is the size of the
Dialogue: 0,0:14:33.78,0:14:36.67,Default,,0000,0000,0000,,larger
Dialogue: 0,0:14:36.67,0:14:40.07,Default,,0000,0000,0000,,set that is shattered by this set by h.
Dialogue: 0,0:14:40.07,0:14:40.76,Default,,0000,0000,0000,,And
Dialogue: 0,0:14:40.76,0:14:44.94,Default,,0000,0000,0000,,if a hypothesis class can shatter arbitrarily large sets, then the VC dimension
Dialogue: 0,0:14:44.94,0:14:47.82,Default,,0000,0000,0000,,is infinite.
Dialogue: 0,0:14:47.82,0:14:50.55,Default,,0000,0000,0000,,So just as a kind of good example: if h
Dialogue: 0,0:14:50.55,0:14:51.92,Default,,0000,0000,0000,,is the class of all linear
Dialogue: 0,0:14:51.92,0:14:57.85,Default,,0000,0000,0000,,classifiers
Dialogue: 0,0:14:57.85,0:15:02.66,Default,,0000,0000,0000,,into d, then the
Dialogue: 0,0:15:02.66,0:15:04.10,Default,,0000,0000,0000,,VC dimension of the set
Dialogue: 0,0:15:04.10,0:15:04.73,Default,,0000,0000,0000,,is equal to
Dialogue: 0,0:15:04.73,0:15:09.49,Default,,0000,0000,0000,,three because we saw just now that there is a size of
Dialogue: 0,0:15:09.49,0:15:12.94,Default,,0000,0000,0000,,there was a set s of size three that it could shatter,
Dialogue: 0,0:15:12.94,0:15:17.26,Default,,0000,0000,0000,,and I don't really prove it. But it turns out there is no sets of size four
Dialogue: 0,0:15:17.26,0:15:18.48,Default,,0000,0000,0000,,that it can
Dialogue: 0,0:15:18.48,0:15:19.90,Default,,0000,0000,0000,,shatter. And therefore,
Dialogue: 0,0:15:19.90,0:15:21.22,Default,,0000,0000,0000,,the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your
Dialogue: 0,0:15:21.22,0:15:23.27,Default,,0000,0000,0000,,point. Instructor
Dialogue: 0,0:15:23.27,0:15:27.53,Default,,0000,0000,0000,,(Andrew
Dialogue: 0,0:15:27.53,0:15:30.21,Default,,0000,0000,0000,,Ng):Yes, absolutely. So it turns out that
Dialogue: 0,0:15:30.21,0:15:32.37,Default,,0000,0000,0000,,if
Dialogue: 0,0:15:32.37,0:15:35.65,Default,,0000,0000,0000,,I choose a set like this it's actually set s,
Dialogue: 0,0:15:35.65,0:15:39.17,Default,,0000,0000,0000,,then there are labelings on this they cannot realize. And so,
Dialogue: 0,0:15:39.17,0:15:40.98,Default,,0000,0000,0000,,h cannot shatter this set.
Dialogue: 0,0:15:40.98,0:15:43.62,Default,,0000,0000,0000,,But that's okay because right there definitely is
Dialogue: 0,0:15:43.62,0:15:46.59,Default,,0000,0000,0000,,there exists some other set of size three being shattered. So the VC
Dialogue: 0,0:15:46.59,0:15:47.89,Default,,0000,0000,0000,,dimension is
Dialogue: 0,0:15:47.89,0:15:54.89,Default,,0000,0000,0000,,three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor
Dialogue: 0,0:15:55.59,0:15:58.13,Default,,0000,0000,0000,,(Andrew Ng):Not according to this definition. No.
Dialogue: 0,0:15:58.13,0:16:00.64,Default,,0000,0000,0000,,Right. So again, let's see,
Dialogue: 0,0:16:00.64,0:16:03.62,Default,,0000,0000,0000,,I can choose my set s to be
Dialogue: 0,0:16:03.62,0:16:07.87,Default,,0000,0000,0000,,to be a set of three points that are all over lapping. Three points in exactly
Dialogue: 0,0:16:07.87,0:16:09.78,Default,,0000,0000,0000,,the same place.
Dialogue: 0,0:16:09.78,0:16:11.73,Default,,0000,0000,0000,,And clearly, I can't shatter this set,
Dialogue: 0,0:16:11.73,0:16:15.13,Default,,0000,0000,0000,,but that's okay. And I can't shatter this set, either,
Dialogue: 0,0:16:15.13,0:16:18.56,Default,,0000,0000,0000,,but that's okay because there are some other sets of size three that I can
Dialogue: 0,0:16:18.56,0:16:25.04,Default,,0000,0000,0000,,shatter.
Dialogue: 0,0:16:25.04,0:16:28.41,Default,,0000,0000,0000,,And it turns out this result holds true into
Dialogue: 0,0:16:28.41,0:16:31.69,Default,,0000,0000,0000,,the more generally,
Dialogue: 0,0:16:31.69,0:16:34.43,Default,,0000,0000,0000,,
Dialogue: 0,0:16:34.43,0:16:41.41,Default,,0000,0000,0000,,in
Dialogue: 0,0:16:41.41,0:16:48.41,Default,,0000,0000,0000,,any dimensions the VC dimension of the class of linear classifiers in
Dialogue: 0,0:16:51.46,0:16:53.50,Default,,0000,0000,0000,,any dimensions is equal to n plus one.
Dialogue: 0,0:16:53.50,0:16:59.21,Default,,0000,0000,0000,,Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC
Dialogue: 0,0:16:59.21,0:17:01.84,Default,,0000,0000,0000,,dimension in any dimensions; whereas, n d
Dialogue: 0,0:17:01.84,0:17:04.35,Default,,0000,0000,0000,,is equal to n plus one.
Dialogue: 0,0:17:04.35,0:17:11.35,Default,,0000,0000,0000,,
Dialogue: 0,0:17:12.06,0:17:14.46,Default,,0000,0000,0000,,So maybe you wanna write it down:
Dialogue: 0,0:17:14.46,0:17:15.81,Default,,0000,0000,0000,,
Dialogue: 0,0:17:15.81,0:17:20.74,Default,,0000,0000,0000,,what is arguably the best-known result in all of learning theory, I guess;
Dialogue: 0,0:17:20.74,0:17:25.93,Default,,0000,0000,0000,,
Dialogue: 0,0:17:25.93,0:17:28.77,Default,,0000,0000,0000,,which is that.
Dialogue: 0,0:17:28.77,0:17:35.77,Default,,0000,0000,0000,,Let a hypothesis class be given,
Dialogue: 0,0:17:37.60,0:17:41.17,Default,,0000,0000,0000,,and let the VC dimension of h be
Dialogue: 0,0:17:41.17,0:17:45.83,Default,,0000,0000,0000,,equal to d. Then we're
Dialogue: 0,0:17:45.83,0:17:47.53,Default,,0000,0000,0000,,in probability of
Dialogue: 0,0:17:47.53,0:17:50.09,Default,,0000,0000,0000,,one minus delta.
Dialogue: 0,0:17:50.09,0:17:57.09,Default,,0000,0000,0000,,We have that
Dialogue: 0,0:18:31.83,0:18:34.77,Default,,0000,0000,0000,,the formula on the right looks a bit complicated, but don't worry about it. I'll point
Dialogue: 0,0:18:34.77,0:18:36.91,Default,,0000,0000,0000,,out the essential aspects of it later.
Dialogue: 0,0:18:36.91,0:18:40.60,Default,,0000,0000,0000,,But the key to this result is that if you have a hypothesis class with VC dimension d, and
Dialogue: 0,0:18:40.60,0:18:45.32,Default,,0000,0000,0000,,now this can be an infinite hypothesis
Dialogue: 0,0:18:45.32,0:18:46.66,Default,,0000,0000,0000,,class,
Dialogue: 0,0:18:46.66,0:18:48.62,Default,,0000,0000,0000,,what
Dialogue: 0,0:18:48.62,0:18:50.99,Default,,0000,0000,0000,,Vapnik and Chervonenkis show is that
Dialogue: 0,0:18:50.99,0:18:54.03,Default,,0000,0000,0000,,we're probability of at least one minus delta.
Dialogue: 0,0:18:54.03,0:18:55.30,Default,,0000,0000,0000,,You enjoy this
Dialogue: 0,0:18:55.30,0:18:57.54,Default,,0000,0000,0000,,sort of uniform conversions results;
Dialogue: 0,0:18:57.54,0:19:00.38,Default,,0000,0000,0000,,okay? We have
Dialogue: 0,0:19:00.38,0:19:02.52,Default,,0000,0000,0000,,
Dialogue: 0,0:19:02.52,0:19:06.76,Default,,0000,0000,0000,,that for all hypotheses h that for all the hypotheses in your
Dialogue: 0,0:19:06.76,0:19:08.49,Default,,0000,0000,0000,,hypothesis class,
Dialogue: 0,0:19:08.49,0:19:13.08,Default,,0000,0000,0000,,you have that the generalization error of h
Dialogue: 0,0:19:13.08,0:19:13.78,Default,,0000,0000,0000,,minus
Dialogue: 0,0:19:13.78,0:19:19.21,Default,,0000,0000,0000,,the training error of h. So the difference between these two things is bounded above
Dialogue: 0,0:19:19.21,0:19:24.25,Default,,0000,0000,0000,,by some complicated formula like this; okay?
Dialogue: 0,0:19:24.25,0:19:29.52,Default,,0000,0000,0000,,And thus,
Dialogue: 0,0:19:29.52,0:19:36.52,Default,,0000,0000,0000,,we're probably one minus delta. We also have that have the
Dialogue: 0,0:19:45.14,0:19:52.14,Default,,0000,0000,0000,,same thing; okay?
Dialogue: 0,0:19:56.81,0:20:00.73,Default,,0000,0000,0000,,And going from this step to this step; right? Going from this step
Dialogue: 0,0:20:00.73,0:20:03.43,Default,,0000,0000,0000,,to this step is actually something that you saw yourself;
Dialogue: 0,0:20:03.43,0:20:05.93,Default,,0000,0000,0000,,that we actually proved earlier. Because
Dialogue: 0,0:20:05.93,0:20:09.20,Default,,0000,0000,0000,,you remember, in the previous lecture we proved that if you have uniform
Dialogue: 0,0:20:09.20,0:20:10.72,Default,,0000,0000,0000,,conversions,
Dialogue: 0,0:20:10.72,0:20:13.21,Default,,0000,0000,0000,,then that implies that
Dialogue: 0,0:20:13.21,0:20:18.08,Default,,0000,0000,0000,,it appears actually that we showed that if generalization error and training error are close to
Dialogue: 0,0:20:18.08,0:20:20.80,Default,,0000,0000,0000,,each other; within gamma of each other,
Dialogue: 0,0:20:20.80,0:20:24.36,Default,,0000,0000,0000,,then the generalization error of the hypotheses you pick will be within
Dialogue: 0,0:20:24.36,0:20:26.15,Default,,0000,0000,0000,,two gamma
Dialogue: 0,0:20:26.15,0:20:28.08,Default,,0000,0000,0000,,times the best generalization error.
Dialogue: 0,0:20:28.08,0:20:29.65,Default,,0000,0000,0000,,So this is really
Dialogue: 0,0:20:29.65,0:20:30.90,Default,,0000,0000,0000,,generalization error of
Dialogue: 0,0:20:30.90,0:20:35.13,Default,,0000,0000,0000,,h [inaudible] best possible generalization error plus two times gamma. And just the
Dialogue: 0,0:20:35.13,0:20:37.39,Default,,0000,0000,0000,,two constants in front here
Dialogue: 0,0:20:37.39,0:20:43.79,Default,,0000,0000,0000,,that I've absorbed into the big-O notation.
Dialogue: 0,0:20:43.79,0:20:47.31,Default,,0000,0000,0000,,So that formula is slightly more complicated. Let
Dialogue: 0,0:20:47.31,0:20:54.31,Default,,0000,0000,0000,,me just rewrite this as a corollary, which is that
Dialogue: 0,0:20:55.37,0:20:58.85,Default,,0000,0000,0000,,in order to guarantee that
Dialogue: 0,0:20:58.85,0:21:04.39,Default,,0000,0000,0000,,this holds,
Dialogue: 0,0:21:04.39,0:21:08.58,Default,,0000,0000,0000,,we're probability of one minus delta.
Dialogue: 0,0:21:08.58,0:21:11.27,Default,,0000,0000,0000,,We're probably at least one minus delta, I should say. It
Dialogue: 0,0:21:11.27,0:21:16.79,Default,,0000,0000,0000,,suffices
Dialogue: 0,0:21:16.79,0:21:20.40,Default,,0000,0000,0000,,that I'm gonna write this this way: I'm gonna write m equals
Dialogue: 0,0:21:20.40,0:21:22.31,Default,,0000,0000,0000,,big-O of
Dialogue: 0,0:21:22.31,0:21:22.95,Default,,0000,0000,0000,,d,
Dialogue: 0,0:21:22.95,0:21:25.88,Default,,0000,0000,0000,,and I'm going to put
Dialogue: 0,0:21:25.88,0:21:32.57,Default,,0000,0000,0000,,gamma and delta in as a subscript error to denote that.
Dialogue: 0,0:21:32.57,0:21:37.34,Default,,0000,0000,0000,,Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that
Dialogue: 0,0:21:37.34,0:21:40.39,Default,,0000,0000,0000,,depend on gamma and delta into the big-O notation,
Dialogue: 0,0:21:40.39,0:21:41.60,Default,,0000,0000,0000,,then
Dialogue: 0,0:21:41.60,0:21:44.87,Default,,0000,0000,0000,,in order to guarantee this holds, it suffices that m
Dialogue: 0,0:21:44.87,0:21:46.43,Default,,0000,0000,0000,,is on the order of the
Dialogue: 0,0:21:46.43,0:21:53.43,Default,,0000,0000,0000,,VC dimension and hypotheses class; okay?
Dialogue: 0,0:21:53.47,0:21:55.79,Default,,0000,0000,0000,,So
Dialogue: 0,0:21:55.79,0:21:58.86,Default,,0000,0000,0000,,let's see.
Dialogue: 0,0:21:58.86,0:22:02.49,Default,,0000,0000,0000,,So what we conclude from this is that
Dialogue: 0,0:22:02.49,0:22:06.10,Default,,0000,0000,0000,,if you have a learning algorithm that tries to for empirical risk minimization
Dialogue: 0,0:22:06.10,0:22:08.77,Default,,0000,0000,0000,,algorithms in other words, less formally,
Dialogue: 0,0:22:08.77,0:22:12.63,Default,,0000,0000,0000,,for learning algorithms, they try to minimize training error. The
Dialogue: 0,0:22:12.63,0:22:14.85,Default,,0000,0000,0000,,intuition to take away from this is that
Dialogue: 0,0:22:14.85,0:22:18.08,Default,,0000,0000,0000,,the number of training examples you need is therefore,
Dialogue: 0,0:22:18.08,0:22:19.25,Default,,0000,0000,0000,,roughly, linear
Dialogue: 0,0:22:19.25,0:22:24.35,Default,,0000,0000,0000,,in the VC dimension of the hypotheses class.
Dialogue: 0,0:22:24.35,0:22:27.69,Default,,0000,0000,0000,,And more formally, this shows that sample complexity
Dialogue: 0,0:22:27.69,0:22:29.57,Default,,0000,0000,0000,,is upper bounded
Dialogue: 0,0:22:29.57,0:22:34.04,Default,,0000,0000,0000,,by the VC dimension; okay?
Dialogue: 0,0:22:34.04,0:22:38.49,Default,,0000,0000,0000,,It turns out that for most reasonable hypothesis classes, it turns out that
Dialogue: 0,0:22:38.49,0:22:41.46,Default,,0000,0000,0000,,the VC dimension is sort
Dialogue: 0,0:22:41.46,0:22:41.87,Default,,0000,0000,0000,,
Dialogue: 0,0:22:41.87,0:22:44.93,Default,,0000,0000,0000,,of very similar, I guess, to
Dialogue: 0,0:22:44.93,0:22:47.11,Default,,0000,0000,0000,,the number of parameters you model.
Dialogue: 0,0:22:47.11,0:22:48.60,Default,,0000,0000,0000,,So for example,
Dialogue: 0,0:22:48.60,0:22:52.08,Default,,0000,0000,0000,,you have model and logistic regression linear classification.
Dialogue: 0,0:22:52.08,0:22:56.06,Default,,0000,0000,0000,,In any dimensions logistic regression in any dimensions is endless
Dialogue: 0,0:22:56.06,0:22:57.43,Default,,0000,0000,0000,,one parameters.
Dialogue: 0,0:22:57.43,0:23:02.43,Default,,0000,0000,0000,,And the VC dimension of which is the of class of linear classifiers is always the endless one.
Dialogue: 0,0:23:02.43,0:23:04.29,Default,,0000,0000,0000,,So it turns out that
Dialogue: 0,0:23:04.29,0:23:07.36,Default,,0000,0000,0000,,for most reasonable hypothesis classes, the
Dialogue: 0,0:23:07.36,0:23:11.27,Default,,0000,0000,0000,,VC dimension is usually linear in the number of parameters of your model.
Dialogue: 0,0:23:11.27,0:23:14.82,Default,,0000,0000,0000,,Wherein, is most sense of low other polynomial;
Dialogue: 0,0:23:14.82,0:23:17.44,Default,,0000,0000,0000,,in the number of parameters of your model.
Dialogue: 0,0:23:17.44,0:23:18.76,Default,,0000,0000,0000,,And so this
Dialogue: 0,0:23:18.76,0:23:21.12,Default,,0000,0000,0000,,the takeaway intuition from this is that
Dialogue: 0,0:23:21.12,0:23:25.06,Default,,0000,0000,0000,,the number of training examples you need to fit in those models is going to be let's say,
Dialogue: 0,0:23:25.06,0:23:29.55,Default,,0000,0000,0000,,roughly, linear in the number of parameters in your model; okay?
Dialogue: 0,0:23:29.55,0:23:33.12,Default,,0000,0000,0000,,There are some somewhat strange examples where what I just said is not true. There are some
Dialogue: 0,0:23:33.12,0:23:36.56,Default,,0000,0000,0000,,strange examples where you have very few parameters, but the VC dimension is
Dialogue: 0,0:23:36.56,0:23:37.45,Default,,0000,0000,0000,,enormous.
Dialogue: 0,0:23:37.45,0:23:38.61,Default,,0000,0000,0000,,But I actually know of
Dialogue: 0,0:23:38.61,0:23:42.07,Default,,0000,0000,0000,,all of the examples I know of that fall into that regime are somewhat
Dialogue: 0,0:23:42.07,0:23:43.56,Default,,0000,0000,0000,,strange and
Dialogue: 0,0:23:43.56,0:23:45.16,Default,,0000,0000,0000,,degenerate. So somewhat unusual, and
Dialogue: 0,0:23:45.16,0:23:50.34,Default,,0000,0000,0000,,not the source of not learning algorithms you usually use.
Dialogue: 0,0:23:50.34,0:23:52.27,Default,,0000,0000,0000,,Let's see,
Dialogue: 0,0:23:52.27,0:23:55.03,Default,,0000,0000,0000,,just other things. It turns out that
Dialogue: 0,0:23:55.03,0:23:58.49,Default,,0000,0000,0000,,so this result shows the sample complexity is upper bounded by VC
Dialogue: 0,0:23:58.49,0:24:01.32,Default,,0000,0000,0000,,dimension. But if you have a number of training examples that
Dialogue: 0,0:24:01.32,0:24:04.51,Default,,0000,0000,0000,,are on the order of the VC dimension, then you find
Dialogue: 0,0:24:04.51,0:24:07.51,Default,,0000,0000,0000,,it turns out that in the worse case
Dialogue: 0,0:24:07.51,0:24:10.32,Default,,0000,0000,0000,,
Dialogue: 0,0:24:10.32,0:24:12.20,Default,,0000,0000,0000,,
Dialogue: 0,0:24:12.20,0:24:15.84,Default,,0000,0000,0000,,some complexity is also lower bounded by VC dimension.
Dialogue: 0,0:24:15.84,0:24:20.22,Default,,0000,0000,0000,,And what that means is that if you have a perfectly nasty learning
Dialogue: 0,0:24:20.22,0:24:23.12,Default,,0000,0000,0000,,problem, say, then if
Dialogue: 0,0:24:23.12,0:24:26.63,Default,,0000,0000,0000,,the number of training examples you have is less than on the order of the VC
Dialogue: 0,0:24:26.63,0:24:27.45,Default,,0000,0000,0000,,dimension;
Dialogue: 0,0:24:27.45,0:24:29.69,Default,,0000,0000,0000,,then
Dialogue: 0,0:24:29.69,0:24:33.91,Default,,0000,0000,0000,,it is not possible to prove this bound. So I guess in the worse case,
Dialogue: 0,0:24:33.91,0:24:37.49,Default,,0000,0000,0000,,sample complexity in the number of training examples you need is upper bounded and lower bounded
Dialogue: 0,0:24:37.49,0:24:41.82,Default,,0000,0000,0000,,by the VC dimension. Let's
Dialogue: 0,0:24:41.82,0:24:48.11,Default,,0000,0000,0000,,see,
Dialogue: 0,0:24:48.11,0:24:55.11,Default,,0000,0000,0000,,questions about this? Does
Dialogue: 0,0:24:55.54,0:24:59.70,Default,,0000,0000,0000,,
Dialogue: 0,0:24:59.70,0:25:00.49,Default,,0000,0000,0000,,the
Dialogue: 0,0:25:00.49,0:25:04.56,Default,,0000,0000,0000,,proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the
Dialogue: 0,0:25:04.56,0:25:07.28,Default,,0000,0000,0000,,entirety of the theorem. This is true. It
Dialogue: 0,0:25:07.28,0:25:09.92,Default,,0000,0000,0000,,turns out in the proof well,
Dialogue: 0,0:25:09.92,0:25:12.73,Default,,0000,0000,0000,,somewhere, regardless of the proof there's a step reconstruction called an
Dialogue: 0,0:25:12.73,0:25:15.75,Default,,0000,0000,0000,,epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the
Dialogue: 0,0:25:15.75,0:25:19.88,Default,,0000,0000,0000,,proof, it is not an assumption that you need.
Dialogue: 0,0:25:19.88,0:25:23.61,Default,,0000,0000,0000,,In someway that sort of proof that's one-step that uses a very
Dialogue: 0,0:25:23.61,0:25:27.05,Default,,0000,0000,0000,,clever [inaudible] to prove
Dialogue: 0,0:25:27.05,0:25:32.13,Default,,0000,0000,0000,,this. But that's not needed; it's an assumption. I'd like to say, back when
Dialogue: 0,0:25:32.13,0:25:36.37,Default,,0000,0000,0000,,I was a Ph.D. student, when I was working through this proof,
Dialogue: 0,0:25:36.37,0:25:38.22,Default,,0000,0000,0000,,there was sort of a solid week where I
Dialogue: 0,0:25:38.22,0:25:42.10,Default,,0000,0000,0000,,would wake up, and go to the office at 9:00 a.m. Then I'd start reading
Dialogue: 0,0:25:42.10,0:25:44.84,Default,,0000,0000,0000,,the book that led up to this proof. And then
Dialogue: 0,0:25:44.84,0:25:49.33,Default,,0000,0000,0000,,I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left
Dialogue: 0,0:25:49.33,0:25:51.61,Default,,0000,0000,0000,,off. And it sort of took me a whole week that way,
Dialogue: 0,0:25:51.61,0:25:53.18,Default,,0000,0000,0000,,to understand this proof, so
Dialogue: 0,0:25:53.18,0:26:00.18,Default,,0000,0000,0000,,I thought I would inflict that on you.
Dialogue: 0,0:26:03.09,0:26:07.47,Default,,0000,0000,0000,,Just to tie a couple of loose ends:
Dialogue: 0,0:26:07.47,0:26:13.11,Default,,0000,0000,0000,,what I'm about to do is, I'm about to just mention a few things that will
Dialogue: 0,0:26:13.11,0:26:18.38,Default,,0000,0000,0000,,maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.
Dialogue: 0,0:26:18.38,0:26:21.38,Default,,0000,0000,0000,,And so
Dialogue: 0,0:26:21.38,0:26:24.15,Default,,0000,0000,0000,,let's see,
Dialogue: 0,0:26:24.15,0:26:26.86,Default,,0000,0000,0000,,it turns out that
Dialogue: 0,0:26:26.86,0:26:30.17,Default,,0000,0000,0000,,just so it will be more strong with you so this bound was
Dialogue: 0,0:26:30.17,0:26:30.67,Default,,0000,0000,0000,,proved
Dialogue: 0,0:26:30.67,0:26:34.84,Default,,0000,0000,0000,,for an algorithm that uses empirical risk minimization, for an algorithm that
Dialogue: 0,0:26:34.84,0:26:35.92,Default,,0000,0000,0000,,minimizes
Dialogue: 0,0:26:35.92,0:26:39.58,Default,,0000,0000,0000,,0-1 training error. So
Dialogue: 0,0:26:39.58,0:26:46.33,Default,,0000,0000,0000,,one
Dialogue: 0,0:26:46.33,0:26:49.35,Default,,0000,0000,0000,,question that some of
Dialogue: 0,0:26:49.35,0:26:51.72,Default,,0000,0000,0000,,you ask is
Dialogue: 0,0:26:51.72,0:26:55.46,Default,,0000,0000,0000,,how about support vector machines; right? How come SVM's don't over
Dialogue: 0,0:26:55.46,0:26:57.25,Default,,0000,0000,0000,,fit?
Dialogue: 0,0:26:57.25,0:27:01.28,Default,,0000,0000,0000,,And in the sequel of remember our discussion on support vector machines said that
Dialogue: 0,0:27:01.28,0:27:06.45,Default,,0000,0000,0000,,you use kernels, and map the features in infinite dimensional feature space.
Dialogue: 0,0:27:06.45,0:27:10.39,Default,,0000,0000,0000,,And so it seems like the VC dimension should be infinite; n plus
Dialogue: 0,0:27:10.39,0:27:12.34,Default,,0000,0000,0000,,one and n is infinite.
Dialogue: 0,0:27:12.34,0:27:14.84,Default,,0000,0000,0000,,So it turns out that
Dialogue: 0,0:27:14.84,0:27:19.63,Default,,0000,0000,0000,,the class of linear separators with large margin actually has low VC dimension.
Dialogue: 0,0:27:19.63,0:27:23.30,Default,,0000,0000,0000,,I wanna say this very quickly, and informally.
Dialogue: 0,0:27:23.30,0:27:28.01,Default,,0000,0000,0000,,It's actually, not very important for you to understand the details, but I'm going
Dialogue: 0,0:27:28.01,0:27:31.76,Default,,0000,0000,0000,,to say it very informally. It turns out that I will give you a set of points.
Dialogue: 0,0:27:31.76,0:27:32.66,Default,,0000,0000,0000,,And
Dialogue: 0,0:27:32.66,0:27:37.50,Default,,0000,0000,0000,,if I ask you to consider only the course of lines that separate these points of a
Dialogue: 0,0:27:37.50,0:27:38.78,Default,,0000,0000,0000,,large margin [inaudible],
Dialogue: 0,0:27:38.78,0:27:40.03,Default,,0000,0000,0000,,so
Dialogue: 0,0:27:40.03,0:27:43.84,Default,,0000,0000,0000,,my hypothesis class will comprise only
Dialogue: 0,0:27:43.84,0:27:45.95,Default,,0000,0000,0000,,the linear position boundaries
Dialogue: 0,0:27:45.95,0:27:49.29,Default,,0000,0000,0000,,that separate the points of a large margin. Say with a
Dialogue: 0,0:27:49.29,0:27:50.05,Default,,0000,0000,0000,,margin,
Dialogue: 0,0:27:50.05,0:27:52.26,Default,,0000,0000,0000,,at least gamma; okay.
Dialogue: 0,0:27:52.26,0:27:53.00,Default,,0000,0000,0000,,And so
Dialogue: 0,0:27:53.00,0:27:55.76,Default,,0000,0000,0000,,I won't allow a point that comes closer. Like,
Dialogue: 0,0:27:55.76,0:28:00.21,Default,,0000,0000,0000,,I won't allow that line because it comes too close to one of my points.
Dialogue: 0,0:28:00.21,0:28:04.33,Default,,0000,0000,0000,,It turns out that
Dialogue: 0,0:28:04.33,0:28:10.89,Default,,0000,0000,0000,,if I consider my
Dialogue: 0,0:28:10.89,0:28:15.41,Default,,0000,0000,0000,,data points all
Dialogue: 0,0:28:15.41,0:28:18.69,Default,,0000,0000,0000,,lie within some sphere of radius r,
Dialogue: 0,0:28:18.69,0:28:23.19,Default,,0000,0000,0000,,and if I consider only the course of linear separators is separate to data
Dialogue: 0,0:28:23.19,0:28:26.80,Default,,0000,0000,0000,,with a margin of at least gamma,
Dialogue: 0,0:28:26.80,0:28:30.42,Default,,0000,0000,0000,,then the VC dimension of this
Dialogue: 0,0:28:30.42,0:28:34.26,Default,,0000,0000,0000,,course is less than or equal to r squared over four
Dialogue: 0,0:28:34.26,0:28:36.15,Default,,0000,0000,0000,,gamma squared
Dialogue: 0,0:28:36.15,0:28:38.17,Default,,0000,0000,0000,,plus one; okay?
Dialogue: 0,0:28:38.17,0:28:39.97,Default,,0000,0000,0000,,So
Dialogue: 0,0:28:39.97,0:28:43.34,Default,,0000,0000,0000,,this funny symbol here, that just means rounding up.
Dialogue: 0,0:28:43.34,0:28:47.19,Default,,0000,0000,0000,,This is a ceiling symbol; it means rounding up x.
Dialogue: 0,0:28:47.19,0:28:48.36,Default,,0000,0000,0000,,And it
Dialogue: 0,0:28:48.36,0:28:51.28,Default,,0000,0000,0000,,turns out you prove and there are some strange things
Dialogue: 0,0:28:51.28,0:28:52.94,Default,,0000,0000,0000,,about this result that I'm deliberately not
Dialogue: 0,0:28:52.94,0:28:55.09,Default,,0000,0000,0000,,gonna to talk about but turns they
Dialogue: 0,0:28:55.09,0:28:59.10,Default,,0000,0000,0000,,can prove that the VC dimension of the class of linear classifiers with large
Dialogue: 0,0:28:59.10,0:29:00.89,Default,,0000,0000,0000,,margins is actually bounded.
Dialogue: 0,0:29:00.89,0:29:02.82,Default,,0000,0000,0000,,The surprising thing about this is that
Dialogue: 0,0:29:02.82,0:29:05.91,Default,,0000,0000,0000,,this is the bound on VC dimension that has no dependents
Dialogue: 0,0:29:05.91,0:29:07.86,Default,,0000,0000,0000,,on the dimension of the points x. So
Dialogue: 0,0:29:07.86,0:29:09.24,Default,,0000,0000,0000,,in other words,
Dialogue: 0,0:29:09.24,0:29:12.80,Default,,0000,0000,0000,,your data points x combine an infinite dimensional space,
Dialogue: 0,0:29:12.80,0:29:15.68,Default,,0000,0000,0000,,but so long as you restrict attention to the class of
Dialogue: 0,0:29:15.68,0:29:17.99,Default,,0000,0000,0000,,your separators with large margin, the
Dialogue: 0,0:29:17.99,0:29:20.25,Default,,0000,0000,0000,,VC dimension is bounded.
Dialogue: 0,0:29:20.25,0:29:21.53,Default,,0000,0000,0000,,And so
Dialogue: 0,0:29:21.53,0:29:25.65,Default,,0000,0000,0000,,in trying to find a large margin separator
Dialogue: 0,0:29:25.65,0:29:28.99,Default,,0000,0000,0000,,in trying to find the line that separates your positive and
Dialogue: 0,0:29:28.99,0:29:29.95,Default,,0000,0000,0000,,your negative examples
Dialogue: 0,0:29:29.95,0:29:32.49,Default,,0000,0000,0000,,with large margin,
Dialogue: 0,0:29:32.49,0:29:35.10,Default,,0000,0000,0000,,it turns out therefore, that
Dialogue: 0,0:29:35.10,0:29:38.06,Default,,0000,0000,0000,,the support vector machine is automatically trying to find
Dialogue: 0,0:29:38.06,0:29:39.81,Default,,0000,0000,0000,,a hypothesis class
Dialogue: 0,0:29:39.81,0:29:42.45,Default,,0000,0000,0000,,with small VC dimension. And therefore, it does not over fit. Alex? Student:What is
Dialogue: 0,0:29:42.45,0:29:49.32,Default,,0000,0000,0000,,the [inaudible]? Instructor (Andrew
Dialogue: 0,0:29:49.32,0:29:50.01,Default,,0000,0000,0000,,Ng):It
Dialogue: 0,0:29:50.01,0:29:51.88,Default,,0000,0000,0000,,is actually defined the
Dialogue: 0,0:29:51.88,0:29:55.64,Default,,0000,0000,0000,,same way as finite dimensional spaces. So you
Dialogue: 0,0:29:55.64,0:29:57.21,Default,,0000,0000,0000,,know, suppose you have infinite actually,
Dialogue: 0,0:29:57.21,0:30:00.59,Default,,0000,0000,0000,,these are constantly infinite dimensional vectors; not [inaudible] to the infinite
Dialogue: 0,0:30:00.59,0:30:03.59,Default,,0000,0000,0000,,dimensional
Dialogue: 0,0:30:03.59,0:30:06.08,Default,,0000,0000,0000,,vectors. Normally, the 2 to 1
Dialogue: 0,0:30:06.08,0:30:07.46,Default,,0000,0000,0000,,squared
Dialogue: 0,0:30:07.46,0:30:10.01,Default,,0000,0000,0000,,is equal to some [inaudible] equals 110
Dialogue: 0,0:30:10.01,0:30:12.21,Default,,0000,0000,0000,,xi squared, so if x
Dialogue: 0,0:30:12.21,0:30:17.47,Default,,0000,0000,0000,,is infinite dimensional, you just appoint it like
Dialogue: 0,0:30:17.47,0:30:20.22,Default,,0000,0000,0000,,that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.
Dialogue: 0,0:30:20.22,0:30:22.68,Default,,0000,0000,0000,,[Inaudible].
Dialogue: 0,0:30:22.68,0:30:27.23,Default,,0000,0000,0000,,Yes. Although, I assume that this is bounded by r. Oh. It's
Dialogue: 0,0:30:27.23,0:30:30.75,Default,,0000,0000,0000,,a yeah so this insures that conversions.
Dialogue: 0,0:30:30.75,0:30:31.67,Default,,0000,0000,0000,,
Dialogue: 0,0:30:31.67,0:30:35.04,Default,,0000,0000,0000,,So just
Dialogue: 0,0:30:35.04,0:30:37.65,Default,,0000,0000,0000,,something people sometimes wonder about. And
Dialogue: 0,0:30:37.65,0:30:39.64,Default,,0000,0000,0000,,last, the actually
Dialogue: 0,0:30:39.64,0:30:44.47,Default,,0000,0000,0000,,tie empirical risk minimization back a little more strongly to the source of algorithms
Dialogue: 0,0:30:44.47,0:30:46.65,Default,,0000,0000,0000,,we've talked about.
Dialogue: 0,0:30:46.65,0:30:48.05,Default,,0000,0000,0000,,It turns out
Dialogue: 0,0:30:48.05,0:30:55.05,Default,,0000,0000,0000,,that
Dialogue: 0,0:31:10.18,0:31:15.10,Default,,0000,0000,0000,,so the theory was about, and so far, was really for empirical risk minimization. So that view's
Dialogue: 0,0:31:15.10,0:31:17.77,Default,,0000,0000,0000,,
Dialogue: 0,0:31:17.77,0:31:21.01,Default,,0000,0000,0000,,so we focus on just one training example.
Dialogue: 0,0:31:21.01,0:31:22.86,Default,,0000,0000,0000,,Let me draw a
Dialogue: 0,0:31:22.86,0:31:24.05,Default,,0000,0000,0000,,function, you know, a
Dialogue: 0,0:31:24.05,0:31:25.08,Default,,0000,0000,0000,,zero here jumps
Dialogue: 0,0:31:25.08,0:31:26.21,Default,,0000,0000,0000,,to one,
Dialogue: 0,0:31:26.21,0:31:27.67,Default,,0000,0000,0000,,and it looks like that.
Dialogue: 0,0:31:27.67,0:31:29.57,Default,,0000,0000,0000,,And so this for
Dialogue: 0,0:31:29.57,0:31:31.72,Default,,0000,0000,0000,,once, this training example,
Dialogue: 0,0:31:31.72,0:31:36.23,Default,,0000,0000,0000,,this may be
Dialogue: 0,0:31:36.23,0:31:42.83,Default,,0000,0000,0000,,indicator
Dialogue: 0,0:31:42.83,0:31:47.11,Default,,0000,0000,0000,,h where [inaudible] is d
Dialogue: 0,0:31:47.11,0:31:49.34,Default,,0000,0000,0000,,equals data transpose x;
Dialogue: 0,0:31:49.34,0:31:52.60,Default,,0000,0000,0000,,okay? But one
Dialogue: 0,0:31:52.60,0:31:56.25,Default,,0000,0000,0000,,training example your training example will be positive or negative. And
Dialogue: 0,0:31:56.25,0:31:59.36,Default,,0000,0000,0000,,depending on what the value of this data transpose x is, you either get it right
Dialogue: 0,0:31:59.36,0:32:01.80,Default,,0000,0000,0000,,or wrong. And so
Dialogue: 0,0:32:01.80,0:32:05.72,Default,,0000,0000,0000,,you know, I guess if your training example if you have a positive example,
Dialogue: 0,0:32:05.72,0:32:08.37,Default,,0000,0000,0000,,then when z is positive,
Dialogue: 0,0:32:08.37,0:32:10.38,Default,,0000,0000,0000,,you get it
Dialogue: 0,0:32:10.38,0:32:11.24,Default,,0000,0000,0000,,right.
Dialogue: 0,0:32:11.24,0:32:18.24,Default,,0000,0000,0000,,
Dialogue: 0,0:32:18.61,0:32:22.63,Default,,0000,0000,0000,,Suppose you have a negative example, so y equals 0; right?
Dialogue: 0,0:32:22.63,0:32:26.78,Default,,0000,0000,0000,,Then if z, which is data transpose x if this is positive,
Dialogue: 0,0:32:26.78,0:32:29.65,Default,,0000,0000,0000,,then you will get this example wrong;
Dialogue: 0,0:32:29.65,0:32:32.86,Default,,0000,0000,0000,,whereas, if z is negative then you'd get this example right.
Dialogue: 0,0:32:32.86,0:32:36.63,Default,,0000,0000,0000,,And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You
Dialogue: 0,0:32:36.63,0:32:42.64,Default,,0000,0000,0000,,know, it's equal
Dialogue: 0,0:32:42.64,0:32:49.54,Default,,0000,0000,0000,,to g of data transpose
Dialogue: 0,0:32:49.54,0:32:50.73,Default,,0000,0000,0000,,
Dialogue: 0,0:32:50.73,0:32:53.94,Default,,0000,0000,0000,,x; okay? And so it turns out that
Dialogue: 0,0:32:53.94,0:32:57.26,Default,,0000,0000,0000,,so what you really like to do is choose parameters data so as to
Dialogue: 0,0:32:57.26,0:32:59.55,Default,,0000,0000,0000,,minimize this step function; right?
Dialogue: 0,0:32:59.55,0:33:02.40,Default,,0000,0000,0000,,You'd like to choose parameters data, so that
Dialogue: 0,0:33:02.40,0:33:04.71,Default,,0000,0000,0000,,you end up with a
Dialogue: 0,0:33:04.71,0:33:08.50,Default,,0000,0000,0000,,correct classification on setting your training example, and so you'd like
Dialogue: 0,0:33:08.50,0:33:09.44,Default,,0000,0000,0000,,indicator h
Dialogue: 0,0:33:09.44,0:33:10.89,Default,,0000,0000,0000,,of x not equal y.
Dialogue: 0,0:33:10.89,0:33:14.96,Default,,0000,0000,0000,,You'd like this indicator function to be 0. It
Dialogue: 0,0:33:14.96,0:33:18.56,Default,,0000,0000,0000,,turns out this step function is clearly a non-convex function. And so
Dialogue: 0,0:33:18.56,0:33:21.80,Default,,0000,0000,0000,,it turns out that just the linear classifiers
Dialogue: 0,0:33:21.80,0:33:25.13,Default,,0000,0000,0000,,minimizing the training error is an empty heart problem. It
Dialogue: 0,0:33:25.13,0:33:29.56,Default,,0000,0000,0000,,turns out that both logistic regression, and support vector machines can be viewed as
Dialogue: 0,0:33:29.56,0:33:31.21,Default,,0000,0000,0000,,using a convex
Dialogue: 0,0:33:31.21,0:33:32.60,Default,,0000,0000,0000,,approximation for this problem.
Dialogue: 0,0:33:32.60,0:33:36.44,Default,,0000,0000,0000,,And in particular
Dialogue: 0,0:33:36.44,0:33:40.13,Default,,0000,0000,0000,,and draw a function like that
Dialogue: 0,0:33:41.46,0:33:48.46,Default,,0000,0000,0000,,it turns out that
Dialogue: 0,0:33:50.81,0:33:54.60,Default,,0000,0000,0000,,logistic regression is trying to maximize likelihood. And so it's tying to minimize
Dialogue: 0,0:33:54.60,0:33:56.90,Default,,0000,0000,0000,,the minus of the logged likelihood.
Dialogue: 0,0:33:56.90,0:34:00.35,Default,,0000,0000,0000,,And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that
Dialogue: 0,0:34:00.35,0:34:03.40,Default,,0000,0000,0000,,looks like this. And
Dialogue: 0,0:34:03.40,0:34:07.25,Default,,0000,0000,0000,,this line that I just drew, you can think of it as a rough approximation to this step function; which is
Dialogue: 0,0:34:07.25,0:34:10.71,Default,,0000,0000,0000,,maybe what you're really trying to minimize,
Dialogue: 0,0:34:10.71,0:34:14.54,Default,,0000,0000,0000,,so you want to minimize training error. So you can actually think of logistic regression as trying to approximate
Dialogue: 0,0:34:14.54,0:34:16.07,Default,,0000,0000,0000,,empirical risk minimization.
Dialogue: 0,0:34:16.07,0:34:19.86,Default,,0000,0000,0000,,Where instead of using this step function, which is non-convex, and gives you a hard
Dialogue: 0,0:34:19.86,0:34:21.06,Default,,0000,0000,0000,,optimization problem, it
Dialogue: 0,0:34:21.06,0:34:25.19,Default,,0000,0000,0000,,uses this line above this curve above. So approximate it,
Dialogue: 0,0:34:25.19,0:34:28.23,Default,,0000,0000,0000,,so you have a convex optimization problem you can
Dialogue: 0,0:34:28.23,0:34:29.93,Default,,0000,0000,0000,,find the
Dialogue: 0,0:34:29.93,0:34:34.41,Default,,0000,0000,0000,,maximum likelihood it's in the parameters for logistic regression.
Dialogue: 0,0:34:34.41,0:34:38.69,Default,,0000,0000,0000,,And it turns out, support vector machine also can be viewed as approximated dysfunction
Dialogue: 0,0:34:38.69,0:34:41.14,Default,,0000,0000,0000,,to only a little bit different let's
Dialogue: 0,0:34:41.14,0:34:44.72,Default,,0000,0000,0000,,see, support vector machine turns out, can be viewed as trying to approximate this
Dialogue: 0,0:34:44.72,0:34:46.21,Default,,0000,0000,0000,,step function two
Dialogue: 0,0:34:46.21,0:34:48.29,Default,,0000,0000,0000,,over different
Dialogue: 0,0:34:48.29,0:34:50.93,Default,,0000,0000,0000,,approximation that's linear, and then
Dialogue: 0,0:34:50.93,0:34:53.59,Default,,0000,0000,0000,,that sort of [inaudible] linear that
Dialogue: 0,0:34:53.59,0:34:55.71,Default,,0000,0000,0000,,our results goes this [inaudible] there, and then it goes up as a
Dialogue: 0,0:34:55.71,0:34:58.15,Default,,0000,0000,0000,,linear function there. And that's that
Dialogue: 0,0:34:58.15,0:35:00.17,Default,,0000,0000,0000,,is called the hinge class. And so you
Dialogue: 0,0:35:00.17,0:35:03.80,Default,,0000,0000,0000,,can think of logistic regression and the support vector machine as
Dialogue: 0,0:35:03.80,0:35:07.26,Default,,0000,0000,0000,,different approximations to try to minimize this
Dialogue: 0,0:35:07.26,0:35:10.86,Default,,0000,0000,0000,,step function;
Dialogue: 0,0:35:10.86,0:35:11.94,Default,,0000,0000,0000,,okay?
Dialogue: 0,0:35:11.94,0:35:13.38,Default,,0000,0000,0000,,And that's why I guess,
Dialogue: 0,0:35:13.38,0:35:16.11,Default,,0000,0000,0000,,all the theory we developed
Dialogue: 0,0:35:16.11,0:35:19.24,Default,,0000,0000,0000,,even though SVM's and logistic regression aren't exactly due to
Dialogue: 0,0:35:19.24,0:35:21.74,Default,,0000,0000,0000,,empirical risk minimization,
Dialogue: 0,0:35:21.74,0:35:24.79,Default,,0000,0000,0000,,the theory we develop often gives the
Dialogue: 0,0:35:24.79,0:35:31.79,Default,,0000,0000,0000,,completely appropriate intuitions for SVM's, and logistic regression; okay.
Dialogue: 0,0:35:32.88,0:35:35.78,Default,,0000,0000,0000,,So that was the last of the loose ends.
Dialogue: 0,0:35:35.78,0:35:39.32,Default,,0000,0000,0000,,And if you didn't get this, don't worry too much about it. It's a high-level message. It's
Dialogue: 0,0:35:39.32,0:35:40.31,Default,,0000,0000,0000,,just that
Dialogue: 0,0:35:40.31,0:35:44.48,Default,,0000,0000,0000,,SVM's and logistic regression are reasonable to think of as approximations
Dialogue: 0,0:35:44.48,0:35:47.77,Default,,0000,0000,0000,,empirical risk minimization algorithms.
Dialogue: 0,0:35:47.77,0:35:51.61,Default,,0000,0000,0000,,What I want to do next is move on to talk about model selection. Before I do that, let me just
Dialogue: 0,0:35:51.61,0:35:58.61,Default,,0000,0000,0000,,check for questions about
Dialogue: 0,0:36:49.19,0:36:52.87,Default,,0000,0000,0000,,this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that
Dialogue: 0,0:36:52.87,0:36:53.33,Default,,0000,0000,0000,,we
Dialogue: 0,0:36:53.33,0:36:55.05,Default,,0000,0000,0000,,sort of wrapped up with a
Dialogue: 0,0:36:55.05,0:36:56.80,Default,,0000,0000,0000,,discussion on VC dimension,
Dialogue: 0,0:36:56.80,0:37:01.02,Default,,0000,0000,0000,,we saw that there's often a trade-off between bias and variance. And in
Dialogue: 0,0:37:01.02,0:37:02.49,Default,,0000,0000,0000,,particular, so
Dialogue: 0,0:37:02.49,0:37:06.76,Default,,0000,0000,0000,,it is important not to choose a hypothesis that's either too simple or too
Dialogue: 0,0:37:06.76,0:37:07.85,Default,,0000,0000,0000,,complex. So
Dialogue: 0,0:37:07.85,0:37:11.25,Default,,0000,0000,0000,,if your data has sort of a quadratic structure to it,
Dialogue: 0,0:37:11.25,0:37:13.71,Default,,0000,0000,0000,,then if you choose a linear
Dialogue: 0,0:37:13.71,0:37:17.89,Default,,0000,0000,0000,,function to try to approximate it, then you would under fit. So you have a
Dialogue: 0,0:37:17.89,0:37:19.94,Default,,0000,0000,0000,,hypothesis with high bias.
Dialogue: 0,0:37:19.94,0:37:24.18,Default,,0000,0000,0000,,And conversely, we choose a hypothesis that's too complex, and you have high variance.
Dialogue: 0,0:37:24.18,0:37:26.13,Default,,0000,0000,0000,,And you'll also
Dialogue: 0,0:37:26.13,0:37:31.90,Default,,0000,0000,0000,,fail to fit. Then you would over fit the data, and you'd also fail to generalize well.
Dialogue: 0,0:37:31.90,0:37:35.65,Default,,0000,0000,0000,,So model selection algorithms
Dialogue: 0,0:37:35.65,0:37:39.63,Default,,0000,0000,0000,,provide a class of methods to automatically trade make these tradeoffs
Dialogue: 0,0:37:39.63,0:37:41.62,Default,,0000,0000,0000,,between bias
Dialogue: 0,0:37:41.62,0:37:43.07,Default,,0000,0000,0000,,and variance; right? So remember the
Dialogue: 0,0:37:43.07,0:37:44.99,Default,,0000,0000,0000,,cartoon I drew last time
Dialogue: 0,0:37:44.99,0:37:51.99,Default,,0000,0000,0000,,of generalization error?
Dialogue: 0,0:37:54.31,0:37:57.41,Default,,0000,0000,0000,,I drew this last time. Where on the x-axis
Dialogue: 0,0:37:57.41,0:38:03.32,Default,,0000,0000,0000,,was model complexity, meaning the number of
Dialogue: 0,0:38:03.32,0:38:07.60,Default,,0000,0000,0000,,the degree of the polynomial; the [inaudible] regression function
Dialogue: 0,0:38:07.60,0:38:08.72,Default,,0000,0000,0000,,or whatever.
Dialogue: 0,0:38:08.72,0:38:12.47,Default,,0000,0000,0000,,And if you have too simple a model, you have high generalization error, those under
Dialogue: 0,0:38:12.47,0:38:13.53,Default,,0000,0000,0000,,fitting.
Dialogue: 0,0:38:13.53,0:38:15.63,Default,,0000,0000,0000,,And you if have too complex a model,
Dialogue: 0,0:38:15.63,0:38:19.73,Default,,0000,0000,0000,,like 15 or 14-degree polynomial to five data points, then you also have high
Dialogue: 0,0:38:19.73,0:38:24.07,Default,,0000,0000,0000,,generalization error, and you're over fitting.
Dialogue: 0,0:38:24.07,0:38:28.56,Default,,0000,0000,0000,,So what I wanna do now is actually just talk about model selection in the abstract;
Dialogue: 0,0:38:28.56,0:38:30.02,Default,,0000,0000,0000,,all right?
Dialogue: 0,0:38:30.02,0:38:32.78,Default,,0000,0000,0000,,Some examples of model selection problems will include
Dialogue: 0,0:38:32.78,0:38:34.37,Default,,0000,0000,0000,,well, I'll run the example
Dialogue: 0,0:38:34.37,0:38:36.14,Default,,0000,0000,0000,,of
Dialogue: 0,0:38:36.14,0:38:36.79,Default,,0000,0000,0000,,let's say you're
Dialogue: 0,0:38:36.79,0:38:43.79,Default,,0000,0000,0000,,trying to choose the degree of a polynomial;
Dialogue: 0,0:38:45.16,0:38:48.91,Default,,0000,0000,0000,,right? What degree polynomial do you want to choose?
Dialogue: 0,0:38:48.91,0:38:51.90,Default,,0000,0000,0000,,Another example of a model selection problem would be if you're trying to choose
Dialogue: 0,0:38:51.90,0:38:53.87,Default,,0000,0000,0000,,the parameter [inaudible],
Dialogue: 0,0:38:53.87,0:38:59.84,Default,,0000,0000,0000,,which was the bandwidth parameter
Dialogue: 0,0:38:59.84,0:39:04.14,Default,,0000,0000,0000,,in locally awaited linear regression
Dialogue: 0,0:39:04.14,0:39:07.71,Default,,0000,0000,0000,,or in some sort of local way to
Dialogue: 0,0:39:07.71,0:39:08.44,Default,,0000,0000,0000,,regression.
Dialogue: 0,0:39:08.44,0:39:10.48,Default,,0000,0000,0000,,Yet, another model selection problem
Dialogue: 0,0:39:10.48,0:39:13.94,Default,,0000,0000,0000,,is if you're trying to choose the parameter c [inaudible] and as the
Dialogue: 0,0:39:13.94,0:39:16.05,Default,,0000,0000,0000,,[inaudible];
Dialogue: 0,0:39:16.05,0:39:21.68,Default,,0000,0000,0000,,right? And so one known soft margin is the
Dialogue: 0,0:39:21.68,0:39:28.54,Default,,0000,0000,0000,,we had this optimization objective; right?
Dialogue: 0,0:39:28.54,0:39:31.04,Default,,0000,0000,0000,,And the parameter c
Dialogue: 0,0:39:31.04,0:39:32.97,Default,,0000,0000,0000,,controls the tradeoff between
Dialogue: 0,0:39:32.97,0:39:34.02,Default,,0000,0000,0000,,how much you want
Dialogue: 0,0:39:34.02,0:39:38.24,Default,,0000,0000,0000,,to set for your example. So a large margin versus how much you want to penalize
Dialogue: 0,0:39:38.24,0:39:40.50,Default,,0000,0000,0000,,
Dialogue: 0,0:39:40.50,0:39:44.28,Default,,0000,0000,0000,,in this class [inaudible] example. So these are three specific examples of model selection
Dialogue: 0,0:39:44.28,0:39:45.34,Default,,0000,0000,0000,,problems.
Dialogue: 0,0:39:45.34,0:39:47.57,Default,,0000,0000,0000,,And
Dialogue: 0,0:39:47.57,0:39:54.57,Default,,0000,0000,0000,,let's come up with a method for semantically choosing them.
Dialogue: 0,0:40:13.68,0:40:17.85,Default,,0000,0000,0000,,Let's say you have some finite set of models, and let's write these as m1, m2, m3,
Dialogue: 0,0:40:17.85,0:40:20.13,Default,,0000,0000,0000,,and
Dialogue: 0,0:40:20.13,0:40:25.22,Default,,0000,0000,0000,,so on. For example, this may be the linear
Dialogue: 0,0:40:25.22,0:40:29.66,Default,,0000,0000,0000,,classifier or this may be the quadratic classifier,
Dialogue: 0,0:40:29.66,0:40:30.41,Default,,0000,0000,0000,,and so on;
Dialogue: 0,0:40:30.41,0:40:32.17,Default,,0000,0000,0000,,okay? Or this
Dialogue: 0,0:40:32.17,0:40:33.70,Default,,0000,0000,0000,,may also be
Dialogue: 0,0:40:33.70,0:40:37.24,Default,,0000,0000,0000,,you may also take the bandwidth parameter [inaudible]
Dialogue: 0,0:40:37.24,0:40:41.13,Default,,0000,0000,0000,,and discretize it into a range of values, and you're trying to choose from the most
Dialogue: 0,0:40:41.13,0:40:43.47,Default,,0000,0000,0000,,discrete of the values.
Dialogue: 0,0:40:43.47,0:40:48.24,Default,,0000,0000,0000,,So let's talk about how you would select an appropriate model; all right? Well,
Dialogue: 0,0:40:48.24,0:40:52.03,Default,,0000,0000,0000,,one thing you could do is you can pick all of these models, and train them on
Dialogue: 0,0:40:52.03,0:40:53.87,Default,,0000,0000,0000,,you're training set.
Dialogue: 0,0:40:53.87,0:40:56.99,Default,,0000,0000,0000,,And then see which model has the lowest training
Dialogue: 0,0:40:56.99,0:41:03.99,Default,,0000,0000,0000,,error. So that's a terrible idea, and why's that?
Dialogue: 0,0:41:05.45,0:41:11.10,Default,,0000,0000,0000,,Right. Cool. Because of the over fit; right. And those some of you are laughing that
Dialogue: 0,0:41:11.10,0:41:13.94,Default,,0000,0000,0000,,I asked that. So that'd be a terrible idea to choose a model
Dialogue: 0,0:41:13.94,0:41:17.02,Default,,0000,0000,0000,,by looking at your training set because well, obviously, you end up choosing the most
Dialogue: 0,0:41:17.02,0:41:18.84,Default,,0000,0000,0000,,complex model; right? And you
Dialogue: 0,0:41:18.84,0:41:25.55,Default,,0000,0000,0000,,choose a 10th degree polynomial because that's what fits the training set. So we
Dialogue: 0,0:41:25.55,0:41:28.57,Default,,0000,0000,0000,,come to model selection in a training set
Dialogue: 0,0:41:28.57,0:41:31.94,Default,,0000,0000,0000,,several standard procedures to do this. One is hold out cross
Dialogue: 0,0:41:31.94,0:41:38.87,Default,,0000,0000,0000,,validation,
Dialogue: 0,0:41:38.87,0:41:41.41,Default,,0000,0000,0000,,and in hold out cross validation,
Dialogue: 0,0:41:41.41,0:41:45.51,Default,,0000,0000,0000,,we teach a training set. And we randomly split
Dialogue: 0,0:41:45.51,0:41:50.53,Default,,0000,0000,0000,,the training set into two subsets. We
Dialogue: 0,0:41:50.53,0:41:51.49,Default,,0000,0000,0000,,call it
Dialogue: 0,0:41:51.49,0:41:53.59,Default,,0000,0000,0000,,subset take all the data you have
Dialogue: 0,0:41:53.59,0:42:00.59,Default,,0000,0000,0000,,and randomly split it into two subsets. And we'll call it the training set, and the
Dialogue: 0,0:42:00.80,0:42:05.16,Default,,0000,0000,0000,,hold out cross validation subset.
Dialogue: 0,0:42:05.16,0:42:07.12,Default,,0000,0000,0000,,And then,
Dialogue: 0,0:42:07.12,0:42:11.99,Default,,0000,0000,0000,,you know, you train each model
Dialogue: 0,0:42:11.99,0:42:18.74,Default,,0000,0000,0000,,on just trading subset of it, and test it
Dialogue: 0,0:42:18.74,0:42:22.07,Default,,0000,0000,0000,,on your hold out cross validation set.
Dialogue: 0,0:42:22.07,0:42:25.29,Default,,0000,0000,0000,,And you pick the model
Dialogue: 0,0:42:25.29,0:42:32.29,Default,,0000,0000,0000,,with the lowest error
Dialogue: 0,0:42:32.95,0:42:34.89,Default,,0000,0000,0000,,on the hold out cross validation
Dialogue: 0,0:42:34.89,0:42:36.96,Default,,0000,0000,0000,,subset; okay? So this is sort of a
Dialogue: 0,0:42:36.96,0:42:39.80,Default,,0000,0000,0000,,relatively straightforward procedure, and it's commonly used
Dialogue: 0,0:42:39.80,0:42:42.04,Default,,0000,0000,0000,,where you train on 70 percent of the data.
Dialogue: 0,0:42:42.04,0:42:44.51,Default,,0000,0000,0000,,Then test all of your models. And 30 percent, you can take
Dialogue: 0,0:42:44.51,0:42:45.90,Default,,0000,0000,0000,,whatever has the smallest
Dialogue: 0,0:42:45.90,0:42:51.24,Default,,0000,0000,0000,,hold out cross validation error.
Dialogue: 0,0:42:51.24,0:42:54.83,Default,,0000,0000,0000,,And after this you actually have a chose. You can actually
Dialogue: 0,0:42:54.83,0:42:58.59,Default,,0000,0000,0000,,having taken all of these hypothesis trained on 70 percent of the data,
Dialogue: 0,0:42:58.59,0:43:02.65,Default,,0000,0000,0000,,you can actually just output the hypothesis that has the lowest error on your hold out
Dialogue: 0,0:43:02.65,0:43:04.47,Default,,0000,0000,0000,,cross validation set.
Dialogue: 0,0:43:04.47,0:43:08.62,Default,,0000,0000,0000,,And optionally, you can actually take the model that you selected
Dialogue: 0,0:43:08.62,0:43:13.08,Default,,0000,0000,0000,,and go back, and retrain it on all 100 percent of the data;
Dialogue: 0,0:43:13.08,0:43:15.53,Default,,0000,0000,0000,,okay? So both versions are actually done and used really often. You can
Dialogue: 0,0:43:15.53,0:43:16.41,Default,,0000,0000,0000,,either,
Dialogue: 0,0:43:16.41,0:43:20.04,Default,,0000,0000,0000,,you know, just take the best hypothesis that was trained on 70 percent of the data,
Dialogue: 0,0:43:20.04,0:43:22.42,Default,,0000,0000,0000,,and just output that as you find the hypothesis
Dialogue: 0,0:43:22.42,0:43:23.96,Default,,0000,0000,0000,,or you can use this to
Dialogue: 0,0:43:23.96,0:43:27.29,Default,,0000,0000,0000,,say, having chosen the degree of the polynomial you want to fit, you can then go
Dialogue: 0,0:43:27.29,0:43:31.33,Default,,0000,0000,0000,,back and retrain the model on the entire 100 percent of your data.
Dialogue: 0,0:43:31.33,0:43:38.33,Default,,0000,0000,0000,,And both of these are commonly done.
Dialogue: 0,0:43:39.75,0:43:46.75,Default,,0000,0000,0000,,
Dialogue: 0,0:43:50.49,0:43:54.46,Default,,0000,0000,0000,,How about a cross validation does sort of work straight?
Dialogue: 0,0:43:54.46,0:43:55.76,Default,,0000,0000,0000,,And
Dialogue: 0,0:43:55.76,0:43:58.25,Default,,0000,0000,0000,,sometimes we're working
Dialogue: 0,0:43:58.25,0:43:58.95,Default,,0000,0000,0000,,
Dialogue: 0,0:43:58.95,0:44:03.67,Default,,0000,0000,0000,,with a company or application or something. The
Dialogue: 0,0:44:03.67,0:44:05.93,Default,,0000,0000,0000,,many machine-learning applications we have
Dialogue: 0,0:44:05.93,0:44:07.58,Default,,0000,0000,0000,,very little data or where, you
Dialogue: 0,0:44:07.58,0:44:10.10,Default,,0000,0000,0000,,know, every training example you have was
Dialogue: 0,0:44:10.10,0:44:14.94,Default,,0000,0000,0000,,painfully acquired at great cost; right? Sometimes your data is
Dialogue: 0,0:44:14.94,0:44:16.04,Default,,0000,0000,0000,,acquired by
Dialogue: 0,0:44:16.04,0:44:18.78,Default,,0000,0000,0000,,medical experiments, and each of these each
Dialogue: 0,0:44:18.78,0:44:23.83,Default,,0000,0000,0000,,training example represents a sick man in amounts of physical human pain or something. So
Dialogue: 0,0:44:23.83,0:44:27.28,Default,,0000,0000,0000,,we talk and say, well, I'm going to hold out 30 percent of your data set, just to select
Dialogue: 0,0:44:27.28,0:44:28.11,Default,,0000,0000,0000,,my model.
Dialogue: 0,0:44:28.11,0:44:32.81,Default,,0000,0000,0000,,If people were who sometimes that causes unhappiness, and so maybe you wanna
Dialogue: 0,0:44:32.81,0:44:37.10,Default,,0000,0000,0000,,use not have to leave out 30 percent of your data just to do model
Dialogue: 0,0:44:37.10,0:44:38.72,Default,,0000,0000,0000,,selection.
Dialogue: 0,0:44:38.72,0:44:43.40,Default,,0000,0000,0000,,So there are a couple of other variations on hold out cross validation that makes
Dialogue: 0,0:44:43.40,0:44:47.81,Default,,0000,0000,0000,,sometimes, slightly more efficient use of the data.
Dialogue: 0,0:44:47.81,0:44:51.91,Default,,0000,0000,0000,,And one is called k-fold cross validation.
Dialogue: 0,0:44:51.91,0:44:54.19,Default,,0000,0000,0000,,And here's the idea: I'm gonna
Dialogue: 0,0:44:54.19,0:44:55.95,Default,,0000,0000,0000,,take all of
Dialogue: 0,0:44:55.95,0:44:56.82,Default,,0000,0000,0000,,my data s;
Dialogue: 0,0:44:56.82,0:45:00.62,Default,,0000,0000,0000,,so imagine, I'm gonna draw this
Dialogue: 0,0:45:00.62,0:45:04.49,Default,,0000,0000,0000,,box s, as to note the entirety of all the data I have.
Dialogue: 0,0:45:04.49,0:45:07.49,Default,,0000,0000,0000,,And I'll then divide it
Dialogue: 0,0:45:07.49,0:45:12.100,Default,,0000,0000,0000,,into k pieces, and this is five pieces in what I've drawn.
Dialogue: 0,0:45:12.100,0:45:16.81,Default,,0000,0000,0000,,Then what'll I'll do is I will repeatedly
Dialogue: 0,0:45:16.81,0:45:23.41,Default,,0000,0000,0000,,train on k minus one pieces.
Dialogue: 0,0:45:23.41,0:45:27.67,Default,,0000,0000,0000,,Test on the remaining one
Dialogue: 0,0:45:27.67,0:45:30.65,Default,,0000,0000,0000,,test on the remaining piece, I guess;
Dialogue: 0,0:45:30.65,0:45:35.70,Default,,0000,0000,0000,,right? And then you average
Dialogue: 0,0:45:35.70,0:45:41.37,Default,,0000,0000,0000,,over the k result.
Dialogue: 0,0:45:41.37,0:45:43.81,Default,,0000,0000,0000,,So another way, we'll just hold out I will
Dialogue: 0,0:45:43.81,0:45:48.33,Default,,0000,0000,0000,,hold
Dialogue: 0,0:45:48.33,0:45:50.62,Default,,0000,0000,0000,,out say, just 1/5 of my data
Dialogue: 0,0:45:50.62,0:45:54.65,Default,,0000,0000,0000,,and I'll train on the remaining 4/5, and I'll test on the first one.
Dialogue: 0,0:45:54.65,0:45:58.40,Default,,0000,0000,0000,,And then I'll then go and hold out the second 1/5 from my [inaudible] for the
Dialogue: 0,0:45:58.40,0:46:03.10,Default,,0000,0000,0000,,remaining pieces test on this, you remove the third piece,
Dialogue: 0,0:46:03.10,0:46:05.91,Default,,0000,0000,0000,,train on the 4/5; I'm gonna do this five times.
Dialogue: 0,0:46:05.91,0:46:09.09,Default,,0000,0000,0000,,And then I'll take the five error measures I
Dialogue: 0,0:46:09.09,0:46:10.57,Default,,0000,0000,0000,,have and I'll
Dialogue: 0,0:46:10.57,0:46:16.37,Default,,0000,0000,0000,,average them. And this then gives me an estimate of the generalization error of my model; okay?
Dialogue: 0,0:46:16.37,0:46:18.09,Default,,0000,0000,0000,,And then, again, when you do
Dialogue: 0,0:46:18.09,0:46:19.93,Default,,0000,0000,0000,,k-fold cross validation,
Dialogue: 0,0:46:19.93,0:46:21.83,Default,,0000,0000,0000,,usually you then go back and
Dialogue: 0,0:46:21.83,0:46:24.03,Default,,0000,0000,0000,,retrain the model you selected
Dialogue: 0,0:46:24.03,0:46:27.67,Default,,0000,0000,0000,,on the entirety of your training set.
Dialogue: 0,0:46:27.67,0:46:33.93,Default,,0000,0000,0000,,So I drew five pieces here because that was easier for me to draw, but k equals 10 is
Dialogue: 0,0:46:33.93,0:46:39.34,Default,,0000,0000,0000,,very
Dialogue: 0,0:46:39.34,0:46:41.71,Default,,0000,0000,0000,,common; okay? I should say
Dialogue: 0,0:46:41.71,0:46:47.51,Default,,0000,0000,0000,,k equals 10 is the fairly common choice to do 10 fold cross validation.
Dialogue: 0,0:46:47.51,0:46:52.30,Default,,0000,0000,0000,,And the advantage of the over hold out cross option is that you switch the data into ten pieces.
Dialogue: 0,0:46:52.30,0:46:54.40,Default,,0000,0000,0000,,Then each time you're only holding out
Dialogue: 0,0:46:54.40,0:46:56.38,Default,,0000,0000,0000,,1/10 of your data, rather than,
Dialogue: 0,0:46:56.38,0:46:59.12,Default,,0000,0000,0000,,you know, say, 30 percent of your data. I
Dialogue: 0,0:46:59.12,0:47:03.31,Default,,0000,0000,0000,,must say, in standard hold out in simple hold out cross validation,
Dialogue: 0,0:47:03.31,0:47:05.62,Default,,0000,0000,0000,,a 30 70 split is fairly common.
Dialogue: 0,0:47:05.62,0:47:07.85,Default,,0000,0000,0000,,Sometimes like 2/3 1/3 or
Dialogue: 0,0:47:07.85,0:47:09.92,Default,,0000,0000,0000,,a 70 30 split is fairly common.
Dialogue: 0,0:47:09.92,0:47:14.30,Default,,0000,0000,0000,,And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most
Dialogue: 0,0:47:14.30,0:47:16.40,Default,,0000,0000,0000,,common choice.
Dialogue: 0,0:47:16.40,0:47:19.85,Default,,0000,0000,0000,,The disadvantage of k-fold cross validation is that it can be much more
Dialogue: 0,0:47:19.85,0:47:21.43,Default,,0000,0000,0000,,computationally expensive.
Dialogue: 0,0:47:21.43,0:47:22.100,Default,,0000,0000,0000,,In particular,
Dialogue: 0,0:47:22.100,0:47:26.16,Default,,0000,0000,0000,,to validate your model, you now need to train your model ten times,
Dialogue: 0,0:47:26.16,0:47:29.09,Default,,0000,0000,0000,,instead of just once. And so you need
Dialogue: 0,0:47:29.09,0:47:33.14,Default,,0000,0000,0000,,to: from logistic regression, ten times per model, rather than just once. And so this is
Dialogue: 0,0:47:33.14,0:47:34.57,Default,,0000,0000,0000,,computationally more expensive.
Dialogue: 0,0:47:34.57,0:47:38.33,Default,,0000,0000,0000,,But k equals ten works great.
Dialogue: 0,0:47:38.33,0:47:40.25,Default,,0000,0000,0000,,And then,
Dialogue: 0,0:47:40.25,0:47:44.27,Default,,0000,0000,0000,,finally, in
Dialogue: 0,0:47:44.27,0:47:47.77,Default,,0000,0000,0000,,there's actually a version of this that you can take even further,
Dialogue: 0,0:47:47.77,0:47:50.06,Default,,0000,0000,0000,,which is when your set k
Dialogue: 0,0:47:50.06,0:47:51.66,Default,,0000,0000,0000,,equals m.
Dialogue: 0,0:47:51.66,0:47:56.69,Default,,0000,0000,0000,,And so that's when you take your training set, and you split it into as many pieces as you have training
Dialogue: 0,0:47:56.69,0:47:59.19,Default,,0000,0000,0000,,examples.
Dialogue: 0,0:47:59.19,0:48:06.19,Default,,0000,0000,0000,,And this procedure is called leave one out cross validation.
Dialogue: 0,0:48:06.43,0:48:07.48,Default,,0000,0000,0000,,And
Dialogue: 0,0:48:07.48,0:48:09.07,Default,,0000,0000,0000,,what you do is you then
Dialogue: 0,0:48:09.07,0:48:12.48,Default,,0000,0000,0000,,take out the first training example, train on the rest, and test on the first
Dialogue: 0,0:48:12.48,0:48:13.36,Default,,0000,0000,0000,,example.
Dialogue: 0,0:48:13.36,0:48:16.48,Default,,0000,0000,0000,,Then you take out the second training example,
Dialogue: 0,0:48:16.48,0:48:18.56,Default,,0000,0000,0000,,train on the rest, and test on the second example. Then
Dialogue: 0,0:48:18.56,0:48:22.18,Default,,0000,0000,0000,,you take out the third example, train on everything, but your third example. Test on
Dialogue: 0,0:48:22.18,0:48:24.08,Default,,0000,0000,0000,,the third example, and so on.
Dialogue: 0,0:48:24.08,0:48:25.41,Default,,0000,0000,0000,,And so
Dialogue: 0,0:48:25.41,0:48:27.90,Default,,0000,0000,0000,,with this many pieces you are now making,
Dialogue: 0,0:48:27.90,0:48:31.17,Default,,0000,0000,0000,,maybe even more effective use of your data than k-fold cross
Dialogue: 0,0:48:31.17,0:48:32.46,Default,,0000,0000,0000,,validation. But
Dialogue: 0,0:48:32.46,0:48:36.27,Default,,0000,0000,0000,,you could leave leave one out cross validation is computationally very expensive
Dialogue: 0,0:48:36.27,0:48:37.46,Default,,0000,0000,0000,,because now
Dialogue: 0,0:48:37.46,0:48:41.97,Default,,0000,0000,0000,,you need to repeatedly leave one example out, and then run your learning
Dialogue: 0,0:48:41.97,0:48:46.46,Default,,0000,0000,0000,,algorithm on m minus one training examples. You need to do this a lot of times, and so
Dialogue: 0,0:48:46.46,0:48:49.12,Default,,0000,0000,0000,,this is computationally very expensive.
Dialogue: 0,0:48:49.12,0:48:54.83,Default,,0000,0000,0000,,And typically, this is done only when you're extremely data scarce. So if you
Dialogue: 0,0:48:54.83,0:48:58.78,Default,,0000,0000,0000,,have a learning problem where you have, say, 15 training examples or something,
Dialogue: 0,0:48:58.78,0:49:01.11,Default,,0000,0000,0000,,then if you have very few training examples, leave one
Dialogue: 0,0:49:01.11,0:49:03.63,Default,,0000,0000,0000,,out cross validation
Dialogue: 0,0:49:03.63,0:49:04.46,Default,,0000,0000,0000,,is
Dialogue: 0,0:49:04.46,0:49:11.46,Default,,0000,0000,0000,,maybe preferred.
Dialogue: 0,0:49:15.71,0:49:18.37,Default,,0000,0000,0000,,Yeah?
Dialogue: 0,0:49:18.37,0:49:22.89,Default,,0000,0000,0000,,Student:You know, that time you proved that the difference between the generalized [inaudible] by
Dialogue: 0,0:49:22.89,0:49:26.99,Default,,0000,0000,0000,,number of examples in your training
Dialogue: 0,0:49:26.99,0:49:28.09,Default,,0000,0000,0000,,set and VC
Dialogue: 0,0:49:28.09,0:49:30.61,Default,,0000,0000,0000,,dimension. So
Dialogue: 0,0:49:30.61,0:49:32.68,Default,,0000,0000,0000,,maybe [inaudible] examples into
Dialogue: 0,0:49:32.68,0:49:34.60,Default,,0000,0000,0000,,different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I
Dialogue: 0,0:49:34.60,0:49:36.83,Default,,0000,0000,0000,,mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but
Dialogue: 0,0:49:36.83,0:49:40.10,Default,,0000,0000,0000,,yeah, in practice, I personally tend not to do that. It
Dialogue: 0,0:49:40.10,0:49:42.73,Default,,0000,0000,0000,,tends not to be the VC
Dialogue: 0,0:49:42.73,0:49:46.53,Default,,0000,0000,0000,,dimension bounds are somewhat loose bounds. And so
Dialogue: 0,0:49:46.53,0:49:47.96,Default,,0000,0000,0000,,there are people
Dialogue: 0,0:49:47.96,0:49:51.06,Default,,0000,0000,0000,,in structure risk minimization that propose what you do, but I personally tend not to do
Dialogue: 0,0:49:51.06,0:49:54.54,Default,,0000,0000,0000,,that,
Dialogue: 0,0:50:09.46,0:50:16.46,Default,,0000,0000,0000,,though. Questions for cross validation?
Dialogue: 0,0:50:33.52,0:50:35.67,Default,,0000,0000,0000,,Yeah. This
Dialogue: 0,0:50:35.67,0:50:38.90,Default,,0000,0000,0000,,is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?
Dialogue: 0,0:50:38.90,0:50:39.94,Default,,0000,0000,0000,,Right.
Dialogue: 0,0:50:39.94,0:50:42.24,Default,,0000,0000,0000,,So it seems like when I'd be
Dialogue: 0,0:50:42.24,0:50:44.13,Default,,0000,0000,0000,,able to use that
Dialogue: 0,0:50:44.13,0:50:46.63,Default,,0000,0000,0000,,
Dialogue: 0,0:50:46.63,0:50:47.77,Default,,0000,0000,0000,,instead
Dialogue: 0,0:50:47.77,0:50:49.98,Default,,0000,0000,0000,,of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor
Dialogue: 0,0:50:49.98,0:50:53.35,Default,,0000,0000,0000,,No okay. So it turns out that when you're proving learning theory bounds, very often
Dialogue: 0,0:50:53.35,0:50:58.11,Default,,0000,0000,0000,,the bounds will be extremely loose because you're sort of proving the worse case upper bound that
Dialogue: 0,0:50:58.11,0:50:59.16,Default,,0000,0000,0000,,holds true
Dialogue: 0,0:50:59.16,0:51:03.30,Default,,0000,0000,0000,,even for very bad what is
Dialogue: 0,0:51:03.30,0:51:06.43,Default,,0000,0000,0000,,it
Dialogue: 0,0:51:06.43,0:51:10.48,Default,,0000,0000,0000,,so the bounds that I proved just now; right? That holds true for absolutely any
Dialogue: 0,0:51:10.48,0:51:12.21,Default,,0000,0000,0000,,probability distribution over training
Dialogue: 0,0:51:12.21,0:51:13.48,Default,,0000,0000,0000,,examples; right?
Dialogue: 0,0:51:13.48,0:51:17.07,Default,,0000,0000,0000,,So just assume the training examples we've drawn, iid
Dialogue: 0,0:51:17.07,0:51:19.03,Default,,0000,0000,0000,,from some distribution script d,
Dialogue: 0,0:51:19.03,0:51:23.21,Default,,0000,0000,0000,,and the bounds I proved hold true for absolutely any probability
Dialogue: 0,0:51:23.21,0:51:27.13,Default,,0000,0000,0000,,distribution over script
Dialogue: 0,0:51:27.13,0:51:28.94,Default,,0000,0000,0000,,d. And chances are
Dialogue: 0,0:51:28.94,0:51:33.05,Default,,0000,0000,0000,,whatever real life distribution you get over, you know, houses and their
Dialogue: 0,0:51:33.05,0:51:36.14,Default,,0000,0000,0000,,prices or whatever, is probably not as bad as
Dialogue: 0,0:51:36.14,0:51:39.12,Default,,0000,0000,0000,,the very worse one you
Dialogue: 0,0:51:39.12,0:51:41.25,Default,,0000,0000,0000,,could've gotten; okay?
Dialogue: 0,0:51:41.25,0:51:42.19,Default,,0000,0000,0000,,And so it turns out that if you
Dialogue: 0,0:51:42.19,0:51:46.30,Default,,0000,0000,0000,,actually plug in the constants of learning theory bounds, you often get
Dialogue: 0,0:51:46.30,0:51:50.84,Default,,0000,0000,0000,,extremely large numbers. Take logistic regression logistic
Dialogue: 0,0:51:50.84,0:51:53.23,Default,,0000,0000,0000,,regression you have ten parameters
Dialogue: 0,0:51:53.23,0:51:56.97,Default,,0000,0000,0000,,and 0.01 error, and with 95 percent probability. How many training
Dialogue: 0,0:51:56.97,0:51:58.24,Default,,0000,0000,0000,,examples do I
Dialogue: 0,0:51:58.24,0:52:02.11,Default,,0000,0000,0000,,need? If you actually plug in actual constants into the text for learning theory bounds,
Dialogue: 0,0:52:02.11,0:52:06.33,Default,,0000,0000,0000,,you often get extremely pessimistic estimates with the number of examples you need. You end
Dialogue: 0,0:52:06.33,0:52:07.53,Default,,0000,0000,0000,,up with
Dialogue: 0,0:52:07.53,0:52:11.75,Default,,0000,0000,0000,,some ridiculously large numbers. You would need 10,000 training examples to fit
Dialogue: 0,0:52:11.75,0:52:14.35,Default,,0000,0000,0000,,ten parameters.
Dialogue: 0,0:52:14.35,0:52:17.49,Default,,0000,0000,0000,,So
Dialogue: 0,0:52:17.49,0:52:20.32,Default,,0000,0000,0000,,a good way to think of these learning theory bounds is
Dialogue: 0,0:52:20.32,0:52:24.60,Default,,0000,0000,0000,,and this is why, also, when I write papers on learning theory bounds, I
Dialogue: 0,0:52:24.60,0:52:26.39,Default,,0000,0000,0000,,quite often
Dialogue: 0,0:52:26.39,0:52:30.16,Default,,0000,0000,0000,,use big-O notation to just absolutely just ignore the constant factors because
Dialogue: 0,0:52:30.16,0:52:33.34,Default,,0000,0000,0000,,the bounds seem to be very loose.
Dialogue: 0,0:52:33.34,0:52:38.67,Default,,0000,0000,0000,,There are some attempts to use these bounds to give guidelines as to
Dialogue: 0,0:52:38.67,0:52:40.02,Default,,0000,0000,0000,,what model
Dialogue: 0,0:52:40.02,0:52:41.56,Default,,0000,0000,0000,,to choose, and so on.
Dialogue: 0,0:52:41.56,0:52:46.59,Default,,0000,0000,0000,,But I personally tend to use the bounds again,
Dialogue: 0,0:52:46.59,0:52:47.49,Default,,0000,0000,0000,,intuition
Dialogue: 0,0:52:47.49,0:52:49.24,Default,,0000,0000,0000,,about
Dialogue: 0,0:52:49.24,0:52:53.09,Default,,0000,0000,0000,,for example, what are the number of training examples you need
Dialogue: 0,0:52:53.09,0:52:56.09,Default,,0000,0000,0000,,gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes
Dialogue: 0,0:52:56.09,0:52:59.67,Default,,0000,0000,0000,,quadratic parameters?
Dialogue: 0,0:52:59.67,0:53:02.40,Default,,0000,0000,0000,,So it's quite often the shape of the bounds. The fact that
Dialogue: 0,0:53:02.40,0:53:06.75,Default,,0000,0000,0000,,the number of training examples the fact that some complexity is linear in the VC dimension,
Dialogue: 0,0:53:06.75,0:53:09.45,Default,,0000,0000,0000,,that's sort of a useful intuition you can get from
Dialogue: 0,0:53:09.45,0:53:11.33,Default,,0000,0000,0000,,these theories. But the
Dialogue: 0,0:53:11.33,0:53:14.44,Default,,0000,0000,0000,,actual magnitude of the bound will tend to be much looser than
Dialogue: 0,0:53:14.44,0:53:18.01,Default,,0000,0000,0000,,will hold true for a particular problem you are working
Dialogue: 0,0:53:18.01,0:53:25.01,Default,,0000,0000,0000,,on. So did that
Dialogue: 0,0:53:25.84,0:53:28.41,Default,,0000,0000,0000,,answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the
Dialogue: 0,0:53:28.41,0:53:31.51,Default,,0000,0000,0000,,way, for myself, a rule of thumb that I often use is if
Dialogue: 0,0:53:31.51,0:53:34.97,Default,,0000,0000,0000,,you're trying to fit a logistic regression model,
Dialogue: 0,0:53:34.97,0:53:36.18,Default,,0000,0000,0000,,if you have
Dialogue: 0,0:53:36.18,0:53:38.75,Default,,0000,0000,0000,,n parameters or n plus one parameters;
Dialogue: 0,0:53:38.75,0:53:42.43,Default,,0000,0000,0000,,if the number of training examples is ten times your number of
Dialogue: 0,0:53:42.43,0:53:44.71,Default,,0000,0000,0000,,parameters, then you're probably in good shape.
Dialogue: 0,0:53:44.71,0:53:49.29,Default,,0000,0000,0000,,And if your number of training examples is like tiny times the number of parameters, then
Dialogue: 0,0:53:49.29,0:53:52.30,Default,,0000,0000,0000,,you're probably perfectly fine fitting that model.
Dialogue: 0,0:53:52.30,0:53:54.88,Default,,0000,0000,0000,,So those are the sorts of intuitions
Dialogue: 0,0:53:54.88,0:54:01.88,Default,,0000,0000,0000,,that you can get from these bounds. Student:In cross validation do
Dialogue: 0,0:54:02.94,0:54:06.74,Default,,0000,0000,0000,,we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train
Dialogue: 0,0:54:06.74,0:54:13.74,Default,,0000,0000,0000,,testers randomly.
Dialogue: 0,0:54:22.99,0:54:26.84,Default,,0000,0000,0000,,One more thing I want to talk about for model selection is there's actually a special case of model selections,
Dialogue: 0,0:54:26.84,0:54:33.84,Default,,0000,0000,0000,,called the feature selection problem.
Dialogue: 0,0:54:37.81,0:54:41.34,Default,,0000,0000,0000,,And so here's the intuition:
Dialogue: 0,0:54:41.34,0:54:45.48,Default,,0000,0000,0000,,for many machine-learning problems you may have a very high dimensional
Dialogue: 0,0:54:45.48,0:54:48.12,Default,,0000,0000,0000,,feature space with very high dimensional
Dialogue: 0,0:54:48.12,0:54:51.64,Default,,0000,0000,0000,,you have x's [inaudible] feature x's.
Dialogue: 0,0:54:51.64,0:54:56.29,Default,,0000,0000,0000,,So for example, for text classification and I wanna talk about this text classification example that
Dialogue: 0,0:54:56.29,0:54:58.30,Default,,0000,0000,0000,,spam versus
Dialogue: 0,0:54:58.30,0:55:00.43,Default,,0000,0000,0000,,non-spam. You may easily have on
Dialogue: 0,0:55:00.43,0:55:04.13,Default,,0000,0000,0000,,the order of 30,000 or 50,000 features. I think I used 50,000 in
Dialogue: 0,0:55:04.13,0:55:07.15,Default,,0000,0000,0000,,my early examples. So if you have
Dialogue: 0,0:55:07.15,0:55:10.66,Default,,0000,0000,0000,,so many features you have 50,000 features,
Dialogue: 0,0:55:10.66,0:55:13.76,Default,,0000,0000,0000,,depending on what learning algorithm you use, there may be a real
Dialogue: 0,0:55:13.76,0:55:15.53,Default,,0000,0000,0000,,risk of over fitting.
Dialogue: 0,0:55:15.53,0:55:17.22,Default,,0000,0000,0000,,And so
Dialogue: 0,0:55:17.22,0:55:20.08,Default,,0000,0000,0000,,if you can reduce the number of features, maybe
Dialogue: 0,0:55:20.08,0:55:23.18,Default,,0000,0000,0000,,you can reduce the variance of your learning algorithm, and reduce the risk of
Dialogue: 0,0:55:23.18,0:55:25.31,Default,,0000,0000,0000,,over fitting. And for
Dialogue: 0,0:55:25.31,0:55:28.26,Default,,0000,0000,0000,,the specific case of text classification, if
Dialogue: 0,0:55:28.26,0:55:31.92,Default,,0000,0000,0000,,you imagine that maybe there's a small number of relevant features, so
Dialogue: 0,0:55:31.92,0:55:33.68,Default,,0000,0000,0000,,there are all these English words.
Dialogue: 0,0:55:33.68,0:55:36.97,Default,,0000,0000,0000,,And many of these English words probably don't tell you anything at all about
Dialogue: 0,0:55:36.97,0:55:39.88,Default,,0000,0000,0000,,whether the email is spam or non-spam. If it were, you
Dialogue: 0,0:55:39.88,0:55:43.99,Default,,0000,0000,0000,,know, English function words like, the, of, a, and;
Dialogue: 0,0:55:43.99,0:55:48.05,Default,,0000,0000,0000,,these are probably words that don't tell you anything about whether the email is spam or non-spam. So
Dialogue: 0,0:55:48.05,0:55:50.57,Default,,0000,0000,0000,,words in contrast will be a much smaller number of
Dialogue: 0,0:55:50.57,0:55:54.32,Default,,0000,0000,0000,,features that are truly relevant to the learning problem.
Dialogue: 0,0:55:54.32,0:55:57.37,Default,,0000,0000,0000,,So for example, you see the word buy or Viagra, those are words that
Dialogue: 0,0:55:57.37,0:56:01.67,Default,,0000,0000,0000,,are very useful. So you words, some you spam and non-spam. You see the word
Dialogue: 0,0:56:01.67,0:56:02.83,Default,,0000,0000,0000,,Stanford or
Dialogue: 0,0:56:02.83,0:56:06.27,Default,,0000,0000,0000,,machinelearning or your own personal name. These are other words that are useful for
Dialogue: 0,0:56:06.27,0:56:10.53,Default,,0000,0000,0000,,telling you whether something is spam or non-spam. So
Dialogue: 0,0:56:10.53,0:56:15.66,Default,,0000,0000,0000,,in feature selection, we would like to select a subset of the features
Dialogue: 0,0:56:15.66,0:56:19.52,Default,,0000,0000,0000,,that may be or hopefully the most relevant ones for a specific learning
Dialogue: 0,0:56:19.52,0:56:20.37,Default,,0000,0000,0000,,problem, so
Dialogue: 0,0:56:20.37,0:56:25.01,Default,,0000,0000,0000,,as to give ourselves a simpler learning a simpler hypothesis class to choose
Dialogue: 0,0:56:25.01,0:56:28.50,Default,,0000,0000,0000,,from. And then therefore, reduce the risk of over fitting.
Dialogue: 0,0:56:28.50,0:56:29.70,Default,,0000,0000,0000,,Even when we
Dialogue: 0,0:56:29.70,0:56:36.62,Default,,0000,0000,0000,,may have had 50,000 features originally. So
Dialogue: 0,0:56:36.62,0:56:41.21,Default,,0000,0000,0000,,how do you do this? Well, if you
Dialogue: 0,0:56:41.21,0:56:44.01,Default,,0000,0000,0000,,have n
Dialogue: 0,0:56:44.01,0:56:47.33,Default,,0000,0000,0000,,features, then there are
Dialogue: 0,0:56:47.33,0:56:49.53,Default,,0000,0000,0000,,two to the n possible subsets;
Dialogue: 0,0:56:49.53,0:56:54.43,Default,,0000,0000,0000,,
Dialogue: 0,0:56:54.43,0:56:55.90,Default,,0000,0000,0000,,right? Because, you know,
Dialogue: 0,0:56:55.90,0:56:59.88,Default,,0000,0000,0000,,each of your n features can either be included or excluded. So there are two to the n
Dialogue: 0,0:56:59.88,0:57:00.91,Default,,0000,0000,0000,,possibilities.
Dialogue: 0,0:57:00.91,0:57:04.20,Default,,0000,0000,0000,,And this is a huge space.
Dialogue: 0,0:57:04.20,0:57:08.86,Default,,0000,0000,0000,,So in feature selection, what we most commonly do is use various searcheristics
Dialogue: 0,0:57:08.86,0:57:09.51,Default,,0000,0000,0000,,sort of
Dialogue: 0,0:57:09.51,0:57:10.80,Default,,0000,0000,0000,,simple search algorithms
Dialogue: 0,0:57:10.80,0:57:15.33,Default,,0000,0000,0000,,to try to search through this space of two to the n possible subsets of features;
Dialogue: 0,0:57:15.33,0:57:16.71,Default,,0000,0000,0000,,to try to find a good
Dialogue: 0,0:57:16.71,0:57:18.29,Default,,0000,0000,0000,,subset of features. This is
Dialogue: 0,0:57:18.29,0:57:22.57,Default,,0000,0000,0000,,too large a number to enumerate all possible feature subsets.
Dialogue: 0,0:57:22.57,0:57:25.52,Default,,0000,0000,0000,,And as a complete example,
Dialogue: 0,0:57:25.52,0:57:26.76,Default,,0000,0000,0000,,
Dialogue: 0,0:57:26.76,0:57:31.16,Default,,0000,0000,0000,,this is the forward search algorithm; it's also called the forward selection algorithm.
Dialogue: 0,0:57:31.16,0:57:32.69,Default,,0000,0000,0000,,
Dialogue: 0,0:57:32.69,0:57:34.85,Default,,0000,0000,0000,,It's actually pretty simple, but I'll just
Dialogue: 0,0:57:34.85,0:57:36.08,Default,,0000,0000,0000,,write it out.
Dialogue: 0,0:57:36.08,0:57:42.06,Default,,0000,0000,0000,,My writing it out will make it look more complicated than it really
Dialogue: 0,0:57:42.06,0:57:44.65,Default,,0000,0000,0000,,is, but
Dialogue: 0,0:57:44.65,0:57:46.04,Default,,0000,0000,0000,,it starts with
Dialogue: 0,0:57:46.04,0:57:48.15,Default,,0000,0000,0000,,initialize the sets script f
Dialogue: 0,0:57:48.15,0:57:51.65,Default,,0000,0000,0000,,to be the empty set,
Dialogue: 0,0:57:51.65,0:57:53.24,Default,,0000,0000,0000,,and then
Dialogue: 0,0:57:53.24,0:57:55.02,Default,,0000,0000,0000,,repeat
Dialogue: 0,0:57:55.02,0:57:56.62,Default,,0000,0000,0000,,for
Dialogue: 0,0:57:56.62,0:57:58.86,Default,,0000,0000,0000,,i equals one
Dialogue: 0,0:57:58.86,0:58:00.44,Default,,0000,0000,0000,,to n;
Dialogue: 0,0:58:04.85,0:58:08.24,Default,,0000,0000,0000,,try adding
Dialogue: 0,0:58:08.24,0:58:11.46,Default,,0000,0000,0000,,feature i
Dialogue: 0,0:58:11.46,0:58:13.85,Default,,0000,0000,0000,,to the set scripts
Dialogue: 0,0:58:13.85,0:58:16.04,Default,,0000,0000,0000,,f, and evaluate the
Dialogue: 0,0:58:16.04,0:58:19.74,Default,,0000,0000,0000,,model
Dialogue: 0,0:58:19.74,0:58:25.72,Default,,0000,0000,0000,,using cross validation.
Dialogue: 0,0:58:25.72,0:58:30.40,Default,,0000,0000,0000,,And by cross validation, I mean any of the three flavors, be it simple hold out cross
Dialogue: 0,0:58:30.40,0:58:34.63,Default,,0000,0000,0000,,validation or k-fold cross validation or leave one out cross
Dialogue: 0,0:58:34.63,0:58:35.74,Default,,0000,0000,0000,,validation.
Dialogue: 0,0:58:35.74,0:58:39.95,Default,,0000,0000,0000,,And then, you know, set f to be
Dialogue: 0,0:58:39.95,0:58:41.24,Default,,0000,0000,0000,,equal to
Dialogue: 0,0:58:41.24,0:58:43.25,Default,,0000,0000,0000,,f union, I guess.
Dialogue: 0,0:58:43.25,0:58:48.58,Default,,0000,0000,0000,,And then the best feature
Dialogue: 0,0:58:48.58,0:58:55.58,Default,,0000,0000,0000,,found is f 1, I guess; okay?
Dialogue: 0,0:58:57.62,0:59:00.06,Default,,0000,0000,0000,,And finally, you would
Dialogue: 0,0:59:00.06,0:59:06.92,Default,,0000,0000,0000,,okay? So
Dialogue: 0,0:59:06.92,0:59:13.92,Default,,0000,0000,0000,,
Dialogue: 0,0:59:16.70,0:59:20.32,Default,,0000,0000,0000,,forward selections, procedure is: follow through the empty set of features.
Dialogue: 0,0:59:20.32,0:59:22.32,Default,,0000,0000,0000,,And then on each
Dialogue: 0,0:59:22.32,0:59:25.69,Default,,0000,0000,0000,,generation, take each of
Dialogue: 0,0:59:25.69,0:59:28.83,Default,,0000,0000,0000,,your features that isn't already in your set script f and you try adding that
Dialogue: 0,0:59:28.83,0:59:30.45,Default,,0000,0000,0000,,feature to your set. Then
Dialogue: 0,0:59:30.45,0:59:33.87,Default,,0000,0000,0000,,you train them all, though, and evaluate them all, though, using
Dialogue: 0,0:59:33.87,0:59:35.04,Default,,0000,0000,0000,,cross validation.
Dialogue: 0,0:59:35.04,0:59:38.72,Default,,0000,0000,0000,,And basically, figure out what is the best single feature to add to your
Dialogue: 0,0:59:38.72,0:59:40.72,Default,,0000,0000,0000,,set
Dialogue: 0,0:59:40.72,0:59:43.71,Default,,0000,0000,0000,,script f. In step two here, you go ahead and add
Dialogue: 0,0:59:43.71,0:59:46.51,Default,,0000,0000,0000,,that feature to your set script f, and you get it right.
Dialogue: 0,0:59:46.51,0:59:50.56,Default,,0000,0000,0000,,And when I say best feature or best model here by best, I really mean the
Dialogue: 0,0:59:50.56,0:59:53.67,Default,,0000,0000,0000,,best model according to hold out cross validation.
Dialogue: 0,0:59:53.67,0:59:56.05,Default,,0000,0000,0000,,By best, I really mean
Dialogue: 0,0:59:56.05,1:00:00.64,Default,,0000,0000,0000,,the single feature addition that results in the lowest hold out
Dialogue: 0,1:00:00.64,1:00:01.51,Default,,0000,0000,0000,,cross validation error or the
Dialogue: 0,1:00:01.51,1:00:03.21,Default,,0000,0000,0000,,lowest cross validation error.
Dialogue: 0,1:00:03.21,1:00:04.50,Default,,0000,0000,0000,,So you do this
Dialogue: 0,1:00:04.50,1:00:07.96,Default,,0000,0000,0000,,adding one feature at a time.
Dialogue: 0,1:00:07.96,1:00:12.02,Default,,0000,0000,0000,,When you terminate this a little bit, as if you've
Dialogue: 0,1:00:12.02,1:00:15.08,Default,,0000,0000,0000,,added all the features to f, so f is now
Dialogue: 0,1:00:15.08,1:00:17.92,Default,,0000,0000,0000,,the entire set of features; you can terminate this.
Dialogue: 0,1:00:17.92,1:00:19.31,Default,,0000,0000,0000,,Or if
Dialogue: 0,1:00:19.31,1:00:22.04,Default,,0000,0000,0000,,by some rule of thumb, you know that you
Dialogue: 0,1:00:22.04,1:00:23.77,Default,,0000,0000,0000,,probably don't ever want more than
Dialogue: 0,1:00:23.77,1:00:26.73,Default,,0000,0000,0000,,k features, you can also terminate
Dialogue: 0,1:00:26.73,1:00:30.24,Default,,0000,0000,0000,,this if f is already exceeded some threshold number of features. So maybe
Dialogue: 0,1:00:30.24,1:00:30.98,Default,,0000,0000,0000,,if
Dialogue: 0,1:00:30.98,1:00:34.14,Default,,0000,0000,0000,,you have 100 training examples, and you're fitting logistic regression,
Dialogue: 0,1:00:34.14,1:00:39.36,Default,,0000,0000,0000,,you probably know you won't want more than 100 features. And so
Dialogue: 0,1:00:39.36,1:00:41.23,Default,,0000,0000,0000,,you stop after you have
Dialogue: 0,1:00:41.23,1:00:45.71,Default,,0000,0000,0000,,100 features added to set f; okay?
Dialogue: 0,1:00:45.71,1:00:50.03,Default,,0000,0000,0000,,And then finally, having done this, output of best hypothesis found; again, by
Dialogue: 0,1:00:50.03,1:00:51.67,Default,,0000,0000,0000,,best, I mean,
Dialogue: 0,1:00:51.67,1:00:55.46,Default,,0000,0000,0000,,when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of
Dialogue: 0,1:00:55.46,1:00:58.42,Default,,0000,0000,0000,,hypothesis, and testing them using cross validation.
Dialogue: 0,1:00:58.42,1:01:01.66,Default,,0000,0000,0000,,So when I say output best hypothesis found, I mean
Dialogue: 0,1:01:01.66,1:01:04.65,Default,,0000,0000,0000,,of all of the hypothesis you've seen during this entire procedure,
Dialogue: 0,1:01:04.65,1:01:07.15,Default,,0000,0000,0000,,pick the one with the lowest
Dialogue: 0,1:01:07.15,1:01:10.31,Default,,0000,0000,0000,,cross validation error that you saw; okay?
Dialogue: 0,1:01:10.31,1:01:17.31,Default,,0000,0000,0000,,So that's forward selection. So
Dialogue: 0,1:01:35.24,1:01:39.60,Default,,0000,0000,0000,,let's see, just to give this a name, this is an incidence of what's called
Dialogue: 0,1:01:39.60,1:01:43.19,Default,,0000,0000,0000,,
Dialogue: 0,1:01:43.19,1:01:46.44,Default,,0000,0000,0000,,wrapper feature
Dialogue: 0,1:01:46.44,1:01:47.88,Default,,0000,0000,0000,,selection.
Dialogue: 0,1:01:47.88,1:01:50.58,Default,,0000,0000,0000,,And the term wrapper comes
Dialogue: 0,1:01:50.58,1:01:52.05,Default,,0000,0000,0000,,from the fact that
Dialogue: 0,1:01:52.05,1:01:55.49,Default,,0000,0000,0000,,this feature selection algorithm that I just described is a forward selection or forward
Dialogue: 0,1:01:55.49,1:01:56.82,Default,,0000,0000,0000,,search.
Dialogue: 0,1:01:56.82,1:01:58.68,Default,,0000,0000,0000,,It's a piece of software that
Dialogue: 0,1:01:58.68,1:02:02.15,Default,,0000,0000,0000,,you write that wraps around your learning algorithm. In the sense that
Dialogue: 0,1:02:02.15,1:02:04.25,Default,,0000,0000,0000,,to perform forward selection,
Dialogue: 0,1:02:04.25,1:02:07.58,Default,,0000,0000,0000,,you need to repeatedly make cause to your learning algorithm
Dialogue: 0,1:02:07.58,1:02:08.83,Default,,0000,0000,0000,,to train
Dialogue: 0,1:02:08.83,1:02:11.79,Default,,0000,0000,0000,,your model,
Dialogue: 0,1:02:11.79,1:02:15.29,Default,,0000,0000,0000,,using different subsets of features; okay? So this is called wrapper model
Dialogue: 0,1:02:15.29,1:02:17.32,Default,,0000,0000,0000,,feature selection.
Dialogue: 0,1:02:17.32,1:02:20.51,Default,,0000,0000,0000,,And it tends to be somewhat computationally expensive because as you're
Dialogue: 0,1:02:20.51,1:02:22.53,Default,,0000,0000,0000,,performing the search process,
Dialogue: 0,1:02:22.53,1:02:26.03,Default,,0000,0000,0000,,you're repeatedly training your learning algorithm over and over and over on all of
Dialogue: 0,1:02:26.03,1:02:30.33,Default,,0000,0000,0000,,these different subsets of features. Let's
Dialogue: 0,1:02:30.33,1:02:37.33,Default,,0000,0000,0000,,just mention also, there is a variation of this called backward search or
Dialogue: 0,1:02:38.90,1:02:40.83,Default,,0000,0000,0000,,backward selection,
Dialogue: 0,1:02:40.83,1:02:43.71,Default,,0000,0000,0000,,which is where you start with
Dialogue: 0,1:02:43.71,1:02:46.99,Default,,0000,0000,0000,,f equals
Dialogue: 0,1:02:46.99,1:02:53.99,Default,,0000,0000,0000,,the entire set of features, and you delete features one at a time;
Dialogue: 0,1:02:59.47,1:03:01.37,Default,,0000,0000,0000,,okay?
Dialogue: 0,1:03:01.37,1:03:03.98,Default,,0000,0000,0000,,
Dialogue: 0,1:03:03.98,1:03:06.76,Default,,0000,0000,0000,,So that's backward search or backward selection.
Dialogue: 0,1:03:06.76,1:03:08.08,Default,,0000,0000,0000,,And
Dialogue: 0,1:03:08.08,1:03:13.37,Default,,0000,0000,0000,,this is another feature selection algorithm that you might use. Part of
Dialogue: 0,1:03:13.37,1:03:15.40,Default,,0000,0000,0000,,whether this makes sense is
Dialogue: 0,1:03:15.40,1:03:17.41,Default,,0000,0000,0000,,really there will be problems where
Dialogue: 0,1:03:17.41,1:03:21.32,Default,,0000,0000,0000,,it really doesn't even make sense to initialize f to be the set of all features.
Dialogue: 0,1:03:21.32,1:03:23.86,Default,,0000,0000,0000,,So if you have 100 training examples
Dialogue: 0,1:03:23.86,1:03:25.57,Default,,0000,0000,0000,,and 10,000 features,
Dialogue: 0,1:03:25.57,1:03:28.83,Default,,0000,0000,0000,,which may well happen
Dialogue: 0,1:03:28.83,1:03:30.76,Default,,0000,0000,0000,,100 emails and 10,000 training
Dialogue: 0,1:03:30.76,1:03:35.29,Default,,0000,0000,0000,,10,000 features in email, then 100 training examples then
Dialogue: 0,1:03:35.29,1:03:38.28,Default,,0000,0000,0000,,depending on the learning algorithm you're using, it may or may not make sense to
Dialogue: 0,1:03:38.28,1:03:38.99,Default,,0000,0000,0000,,initialize
Dialogue: 0,1:03:38.99,1:03:42.79,Default,,0000,0000,0000,,the set f to be all features, and train them all by using all features. And if it
Dialogue: 0,1:03:42.79,1:03:46.13,Default,,0000,0000,0000,,doesn't make sense, then you can train them all by using all features; then
Dialogue: 0,1:03:46.13,1:03:51.53,Default,,0000,0000,0000,,forward selection would be more common.
Dialogue: 0,1:03:51.53,1:03:53.61,Default,,0000,0000,0000,,So
Dialogue: 0,1:03:53.61,1:03:56.96,Default,,0000,0000,0000,,let's see. Wrapper model feature selection algorithms tend to work
Dialogue: 0,1:03:56.96,1:03:58.46,Default,,0000,0000,0000,,well.
Dialogue: 0,1:03:58.46,1:04:02.18,Default,,0000,0000,0000,,And in particular, they actually often work better than a different class of algorithms I'm gonna talk
Dialogue: 0,1:04:02.18,1:04:05.79,Default,,0000,0000,0000,,about now. But their main disadvantage is that they're computationally very
Dialogue: 0,1:04:05.79,1:04:08.71,Default,,0000,0000,0000,,expensive.
Dialogue: 0,1:04:08.71,1:04:15.71,Default,,0000,0000,0000,,Do you have any questions about this
Dialogue: 0,1:04:16.29,1:04:19.11,Default,,0000,0000,0000,,before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,
Dialogue: 0,1:04:19.11,1:04:23.46,Default,,0000,0000,0000,,you're actually right. So forward search and backward search, both of these are searcheristics,
Dialogue: 0,1:04:23.46,1:04:24.36,Default,,0000,0000,0000,,and
Dialogue: 0,1:04:24.36,1:04:28.09,Default,,0000,0000,0000,,you cannot but for either of these you cannot guarantee they'll find the best
Dialogue: 0,1:04:28.09,1:04:29.12,Default,,0000,0000,0000,,subset of features. It
Dialogue: 0,1:04:29.12,1:04:30.73,Default,,0000,0000,0000,,actually turns out that
Dialogue: 0,1:04:30.73,1:04:34.100,Default,,0000,0000,0000,,under many formulizations of the feature selection problems it actually turns out to be an empty
Dialogue: 0,1:04:34.100,1:04:40.03,Default,,0000,0000,0000,,heart problem, to find the best subset of features.
Dialogue: 0,1:04:40.03,1:04:43.100,Default,,0000,0000,0000,,But in practice, forward selection backward selection work fine,
Dialogue: 0,1:04:43.100,1:04:47.17,Default,,0000,0000,0000,,and you can also envision other search algorithms where you sort of
Dialogue: 0,1:04:47.17,1:04:50.49,Default,,0000,0000,0000,,have other methods to search through the space up to the end possible feature
Dialogue: 0,1:04:50.49,1:04:57.49,Default,,0000,0000,0000,,subsets. So
Dialogue: 0,1:05:01.08,1:05:03.70,Default,,0000,0000,0000,,let's see.
Dialogue: 0,1:05:03.70,1:05:06.34,Default,,0000,0000,0000,,Wrapper feature selection
Dialogue: 0,1:05:06.34,1:05:09.83,Default,,0000,0000,0000,,tends to work well when you can afford to do it
Dialogue: 0,1:05:09.83,1:05:11.54,Default,,0000,0000,0000,,computationally.
Dialogue: 0,1:05:11.54,1:05:13.80,Default,,0000,0000,0000,,But for problems
Dialogue: 0,1:05:13.80,1:05:18.28,Default,,0000,0000,0000,,such as text classification it turns out for text classification specifically
Dialogue: 0,1:05:18.28,1:05:21.95,Default,,0000,0000,0000,,because you have so many features, and easily have 50,000 features.
Dialogue: 0,1:05:21.95,1:05:25.79,Default,,0000,0000,0000,,Forward selection would be very, very expensive.
Dialogue: 0,1:05:25.79,1:05:28.24,Default,,0000,0000,0000,,So there's a different class of algorithms that
Dialogue: 0,1:05:28.24,1:05:31.19,Default,,0000,0000,0000,,will give you that
Dialogue: 0,1:05:31.19,1:05:35.32,Default,,0000,0000,0000,,tends not to do as well in the sense of generalization error. So you tend to learn the
Dialogue: 0,1:05:35.32,1:05:37.33,Default,,0000,0000,0000,,hypothesis that works less well,
Dialogue: 0,1:05:37.33,1:05:40.29,Default,,0000,0000,0000,,but is computationally much less expensive.
Dialogue: 0,1:05:40.29,1:05:44.19,Default,,0000,0000,0000,,And these are called
Dialogue: 0,1:05:44.19,1:05:46.67,Default,,0000,0000,0000,,the
Dialogue: 0,1:05:46.67,1:05:50.28,Default,,0000,0000,0000,,filter feature selection methods.
Dialogue: 0,1:05:50.28,1:05:53.12,Default,,0000,0000,0000,,And the basic idea is
Dialogue: 0,1:05:53.12,1:05:53.68,Default,,0000,0000,0000,,
Dialogue: 0,1:05:53.68,1:06:00.54,Default,,0000,0000,0000,,that for each feature i will
Dialogue: 0,1:06:00.54,1:06:03.94,Default,,0000,0000,0000,,compute
Dialogue: 0,1:06:03.94,1:06:10.54,Default,,0000,0000,0000,,some measure
Dialogue: 0,1:06:10.54,1:06:17.22,Default,,0000,0000,0000,,of how informative
Dialogue: 0,1:06:17.22,1:06:19.91,Default,,0000,0000,0000,,xi
Dialogue: 0,1:06:19.91,1:06:25.48,Default,,0000,0000,0000,,is about y; okay?
Dialogue: 0,1:06:25.48,1:06:28.78,Default,,0000,0000,0000,,And to do this, we'll use some simple
Dialogue: 0,1:06:28.78,1:06:33.49,Default,,0000,0000,0000,,heuristics; for every feature we'll just try to compute some rough estimate or
Dialogue: 0,1:06:33.49,1:06:40.49,Default,,0000,0000,0000,,compute
Dialogue: 0,1:06:41.28,1:06:48.28,Default,,0000,0000,0000,,some measure of how informative
Dialogue: 0,1:06:50.18,1:06:54.49,Default,,0000,0000,0000,,xi is about
Dialogue: 0,1:06:54.49,1:06:59.14,Default,,0000,0000,0000,,y. So there are many ways you can do this. One way you can choose is to just compute the
Dialogue: 0,1:06:59.14,1:07:01.28,Default,,0000,0000,0000,,correlation between xi and y. And just for each of
Dialogue: 0,1:07:01.28,1:07:04.100,Default,,0000,0000,0000,,your features just see how correlated this is with
Dialogue: 0,1:07:04.100,1:07:06.72,Default,,0000,0000,0000,,your class label y.
Dialogue: 0,1:07:06.72,1:07:12.81,Default,,0000,0000,0000,,And then you just pick the top k most correlated features.
Dialogue: 0,1:07:12.81,1:07:19.81,Default,,0000,0000,0000,,Another way
Dialogue: 0,1:07:46.43,1:07:51.55,Default,,0000,0000,0000,,to do this
Dialogue: 0,1:07:51.55,1:07:54.93,Default,,0000,0000,0000,,for the case of text classification, there's one other method, which especially
Dialogue: 0,1:07:54.93,1:07:56.78,Default,,0000,0000,0000,,for this k features I guess
Dialogue: 0,1:07:56.78,1:07:58.75,Default,,0000,0000,0000,,there's one other
Dialogue: 0,1:07:58.75,1:08:00.71,Default,,0000,0000,0000,,informative measure
Dialogue: 0,1:08:00.71,1:08:04.27,Default,,0000,0000,0000,,that's used very commonly, which is called major information. I'm going to
Dialogue: 0,1:08:04.27,1:08:11.23,Default,,0000,0000,0000,,tell you
Dialogue: 0,1:08:11.23,1:08:15.40,Default,,0000,0000,0000,,some of these ideas in problem sets, but I'll
Dialogue: 0,1:08:15.40,1:08:17.90,Default,,0000,0000,0000,,just say this very briefly.
Dialogue: 0,1:08:17.90,1:08:21.60,Default,,0000,0000,0000,,So the major information between feature xi and y
Dialogue: 0,1:08:21.60,1:08:27.67,Default,,0000,0000,0000,,I'll just write out the definition, I guess.
Dialogue: 0,1:08:27.67,1:08:32.23,Default,,0000,0000,0000,,Let's say this is text classification, so x can take on two values, 0, 1;
Dialogue: 0,1:08:32.23,1:08:36.18,Default,,0000,0000,0000,,the major information between xi and y is to find out some overall possible values of
Dialogue: 0,1:08:36.18,1:08:37.36,Default,,0000,0000,0000,,x;
Dialogue: 0,1:08:37.36,1:08:40.93,Default,,0000,0000,0000,,some overall possible values of
Dialogue: 0,1:08:40.93,1:08:45.61,Default,,0000,0000,0000,,y times the distribution
Dialogue: 0,1:08:45.61,1:08:52.05,Default,,0000,0000,0000,,
Dialogue: 0,1:08:52.05,1:08:53.09,Default,,0000,0000,0000,,times that.
Dialogue: 0,1:08:53.09,1:08:54.38,Default,,0000,0000,0000,,Where
Dialogue: 0,1:08:59.57,1:09:04.04,Default,,0000,0000,0000,,all of these distributions where so the joint distribution over xi and y,
Dialogue: 0,1:09:04.04,1:09:11.04,Default,,0000,0000,0000,,you would estimate from your training data
Dialogue: 0,1:09:12.56,1:09:14.22,Default,,0000,0000,0000,,all of these things you would use, as well. You would estimate
Dialogue: 0,1:09:14.22,1:09:15.26,Default,,0000,0000,0000,,from
Dialogue: 0,1:09:15.26,1:09:20.16,Default,,0000,0000,0000,,the training data what is the probability that x is 0, what's the probability that x is one, what's the
Dialogue: 0,1:09:20.16,1:09:27.16,Default,,0000,0000,0000,,probability that x is 0, y is 0, x is one; y is 0, and so on. So it
Dialogue: 0,1:09:27.85,1:09:31.19,Default,,0000,0000,0000,,turns out
Dialogue: 0,1:09:31.19,1:09:35.00,Default,,0000,0000,0000,,there's a standard information theoretic measure of how different
Dialogue: 0,1:09:35.00,1:09:36.75,Default,,0000,0000,0000,,probability distributions are.
Dialogue: 0,1:09:36.75,1:09:39.99,Default,,0000,0000,0000,,And I'm not gonna prove this here. But it turns out that
Dialogue: 0,1:09:39.99,1:09:45.61,Default,,0000,0000,0000,,this major information is
Dialogue: 0,1:09:45.61,1:09:51.75,Default,,0000,0000,0000,,actually
Dialogue: 0,1:09:51.75,1:09:54.31,Default,,0000,0000,0000,,so the standard
Dialogue: 0,1:09:54.31,1:09:58.10,Default,,0000,0000,0000,,measure of how different distributions are; called the K-L divergence.
Dialogue: 0,1:09:58.10,1:10:00.36,Default,,0000,0000,0000,,When you take a class in information theory,
Dialogue: 0,1:10:00.36,1:10:03.42,Default,,0000,0000,0000,,you have seen concepts of mutual information in the K-L divergence, but if you
Dialogue: 0,1:10:03.42,1:10:06.02,Default,,0000,0000,0000,,haven't, don't worry about it. Just
Dialogue: 0,1:10:06.02,1:10:09.24,Default,,0000,0000,0000,,the intuition is there's something called K-L divergence that's a formal measure
Dialogue: 0,1:10:09.24,1:10:12.95,Default,,0000,0000,0000,,of how different two probability distributions
Dialogue: 0,1:10:12.95,1:10:16.61,Default,,0000,0000,0000,,are. And mutual information is a measure for how different
Dialogue: 0,1:10:16.61,1:10:19.83,Default,,0000,0000,0000,,the joint distribution is
Dialogue: 0,1:10:19.83,1:10:21.57,Default,,0000,0000,0000,,of x and y;
Dialogue: 0,1:10:21.57,1:10:23.40,Default,,0000,0000,0000,,from the distribution
Dialogue: 0,1:10:23.40,1:10:26.16,Default,,0000,0000,0000,,you would get if you were to assume they were independent;
Dialogue: 0,1:10:26.16,1:10:27.21,Default,,0000,0000,0000,,okay? So
Dialogue: 0,1:10:27.21,1:10:29.45,Default,,0000,0000,0000,,if x and y were independent, then
Dialogue: 0,1:10:29.45,1:10:33.98,Default,,0000,0000,0000,,p of x, y would be equal to p of x times p of y.
Dialogue: 0,1:10:33.98,1:10:37.34,Default,,0000,0000,0000,,And so you know, this distribution and this distribution would be identical, and the
Dialogue: 0,1:10:37.34,1:10:40.37,Default,,0000,0000,0000,,K-L divergence would be 0.
Dialogue: 0,1:10:40.37,1:10:43.58,Default,,0000,0000,0000,,In contrast, if x and y were very non-independent in
Dialogue: 0,1:10:43.58,1:10:46.90,Default,,0000,0000,0000,,other words, if x and y are very informative about each other,
Dialogue: 0,1:10:46.90,1:10:48.45,Default,,0000,0000,0000,,then this K-L divergence
Dialogue: 0,1:10:48.45,1:10:49.72,Default,,0000,0000,0000,,will be large.
Dialogue: 0,1:10:49.72,1:10:52.38,Default,,0000,0000,0000,,And so mutual information is a formal measure of
Dialogue: 0,1:10:52.38,1:10:55.46,Default,,0000,0000,0000,,how non-independent x and y are.
Dialogue: 0,1:10:55.46,1:10:58.30,Default,,0000,0000,0000,,And if x and y are highly non-independent
Dialogue: 0,1:10:58.30,1:10:59.26,Default,,0000,0000,0000,,then that means that
Dialogue: 0,1:10:59.26,1:11:02.21,Default,,0000,0000,0000,,x will presumably tell you something about y,
Dialogue: 0,1:11:02.21,1:11:05.99,Default,,0000,0000,0000,,and so they'll have large mutual information. And
Dialogue: 0,1:11:05.99,1:11:09.52,Default,,0000,0000,0000,,this measure of information will tell you x might be a good feature. And you
Dialogue: 0,1:11:09.52,1:11:11.99,Default,,0000,0000,0000,,get to play with some of these ideas
Dialogue: 0,1:11:11.99,1:11:15.31,Default,,0000,0000,0000,,more in the problem sets. So I won't say much more about it.
Dialogue: 0,1:11:16.58,1:11:19.42,Default,,0000,0000,0000,,And what you do then is having chosen
Dialogue: 0,1:11:19.42,1:11:23.78,Default,,0000,0000,0000,,some measure like correlation or major information or something else, you
Dialogue: 0,1:11:23.78,1:11:28.89,Default,,0000,0000,0000,,then pick the top
Dialogue: 0,1:11:28.89,1:11:33.66,Default,,0000,0000,0000,,k features; meaning that you compute correlation between xi and y for all the
Dialogue: 0,1:11:33.66,1:11:36.98,Default,,0000,0000,0000,,features of mutual information xi and y for all the features.
Dialogue: 0,1:11:36.98,1:11:40.08,Default,,0000,0000,0000,,And then you include in your learning algorithm the
Dialogue: 0,1:11:40.08,1:11:43.37,Default,,0000,0000,0000,,k features of the largest correlation with the label or
Dialogue: 0,1:11:43.37,1:11:46.49,Default,,0000,0000,0000,,the largest mutual information label, whatever.
Dialogue: 0,1:11:46.49,1:11:51.56,Default,,0000,0000,0000,,And to choose
Dialogue: 0,1:11:51.56,1:11:54.81,Default,,0000,0000,0000,,k,
Dialogue: 0,1:11:54.81,1:11:57.80,Default,,0000,0000,0000,,you can actually use cross validation, as well; okay?
Dialogue: 0,1:11:57.80,1:11:59.36,Default,,0000,0000,0000,,So you would
Dialogue: 0,1:11:59.36,1:12:03.06,Default,,0000,0000,0000,,take all your features, and sort them in decreasing order of mutual
Dialogue: 0,1:12:03.06,1:12:04.11,Default,,0000,0000,0000,,information.
Dialogue: 0,1:12:04.11,1:12:07.53,Default,,0000,0000,0000,,And then you'd try using just the top one feature, the top two features, the top
Dialogue: 0,1:12:07.53,1:12:09.56,Default,,0000,0000,0000,,three features, and so on.
Dialogue: 0,1:12:09.56,1:12:13.87,Default,,0000,0000,0000,,And you decide how many features includes
Dialogue: 0,1:12:13.87,1:12:17.28,Default,,0000,0000,0000,,using cross validation; okay? Or you
Dialogue: 0,1:12:17.28,1:12:24.28,Default,,0000,0000,0000,,can sometimes you can just choose this by hand, as well. Okay. Questions about
Dialogue: 0,1:12:24.28,1:12:31.28,Default,,0000,0000,0000,,this? Okay.
Dialogue: 0,1:12:34.56,1:12:36.48,Default,,0000,0000,0000,,Cool. Great. So
Dialogue: 0,1:12:36.48,1:12:40.63,Default,,0000,0000,0000,,next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close
Dialogue: 0,1:12:40.63,1:12:40.88,Default,,0000,0000,0000,,to the end.