Lecture 4 | Machine Learning (Stanford)
-
0:12 - 0:15this presentation is delivered by the stanford center for professional
-
0:15 - 0:22development.
-
0:24 - 0:26Okay, so welcome back. And
-
0:26 - 0:31what I want to do today is talk about
-
0:31 - 0:32Newton's method, an algorithm
-
0:32 - 0:35for fitting models like logistic regression,
-
0:35 - 0:39and then we'll talk about exponential family distributions and generalized linear
-
0:39 - 0:43models. It's a very nice class of ideas that will tie together,
-
0:43 - 0:47the logistic regression and the ordinary least squares models that we'll see. So hopefully I'll get to that
-
0:47 - 0:51today. So
-
0:51 - 0:55throughout the previous lecture and this lecture, we're starting to use increasingly
-
0:55 - 0:58large amounts of material on probability.
-
0:58 - 1:02So if you'd like to see a refresher on sort of the
-
1:02 - 1:03foundations of probability
-
1:03 - 1:06- if you're not sure if you quite had your prerequisites for this class
-
1:06 - 1:09in terms of a background in probability and statistics,
-
1:09 - 1:11then the discussion section
-
1:11 - 1:16taught this week by the TA's will go over so they can review a probability.
-
1:16 - 1:21At the same discussion sections also for the TA's, we'll also briefly go over
-
1:21 - 1:26sort of Matlab and Octave notation, which you need to use for your problem sets. And so if you
-
1:26 - 1:28any of you want to see
-
1:28 - 1:32a review of the probability and statistics pre-reqs, or if you want to, we will have a short tutorial of
-
1:32 - 1:34Matlab and Octave,
-
1:34 - 1:41please come to this - the next discussion section. All right. So
-
1:42 - 1:45just to recap briefly,
-
1:45 - 1:49towards the end of the last lecture I talked about the logistic regression model
-
1:49 - 1:51where we had -
-
1:51 - 1:57which was an algorithm for classification. We had that P of y given one
-
1:57 - 2:02[inaudible] - if an X - if Y equals one, give an X parameterized by theta
-
2:02 - 2:04under this model, all right. If
-
2:04 - 2:05this was one over one
-
2:05 - 2:11plus e to the theta, transpose X.
-
2:11 - 2:15And then you can write down the log-likelihood -
-
2:15 - 2:22like given the training sets,
-
2:23 - 2:30
-
2:30 - 2:31which was that.
-
2:31 - 2:32And
-
2:32 - 2:36by taking the derivatives of this, you can derive
-
2:36 - 2:38sort of a gradient ascent rule
-
2:38 - 2:43for finding the maximum likelihood estimate of the parameter theta for
-
2:43 - 2:46this logistic regression model. And so
-
2:46 - 2:48last time I wrote down
-
2:48 - 2:52the learning rule for batch gradient ascent, but the version of stochastic gradient
-
2:52 - 2:55ascent where
-
2:55 - 3:02you look at just one training example at a time,
-
3:07 - 3:09would be like this, okay.
-
3:09 - 3:12So last time I wrote down a batch gradient ascent.
This is stochastic gradient ascent. -
3:12 - 3:14So
-
3:14 - 3:18if you want to fit a logistic regression model, meaning find the
-
3:18 - 3:21value of theta that maximizes this log likelihood,
-
3:21 - 3:25gradient ascent or stochastic gradient ascent or batch gradient ascent is a perfectly fine
-
3:25 - 3:27algorithm to use.
-
3:27 - 3:29But what I want to do is talk about
-
3:29 - 3:30a different
-
3:30 - 3:32algorithm
-
3:32 - 3:33for fitting
-
3:33 - 3:35models like logistic regression.
-
3:35 - 3:39And this would be an algorithm that will, I guess, often run much faster than
-
3:39 - 3:42gradient descent.
-
3:42 - 3:44Um.
-
3:44 - 3:49And this algorithm is called Newton's Method.
-
3:49 - 3:54And when we describe Newton's Method - let me ask you - I should
-
3:54 - 4:01ask you to consider a different problem first,
-
4:01 - 4:04which is -
-
4:04 - 4:05let's say
-
4:05 - 4:11you have a function F of theta,
-
4:11 - 4:14and let's say you want to find the value of theta
-
4:14 - 4:18so that
-
4:18 - 4:21F of theta
-
4:21 - 4:26is equal to zero. Let's start the [inaudible], and then we'll sort of slowly
-
4:26 - 4:28change this until it becomes an algorithm for
-
4:28 - 4:35fitting maximum likelihood models, like logistic regression.
So - let's see. -
4:52 - 4:55I guess that works. Okay, so let's say that's my function F.
-
4:55 - 4:57This
-
4:57 - 5:01is my horizontal axis of theta, plot of F of theta,
-
5:01 - 5:03and so they're really trying to find this value for theta, and
-
5:03 - 5:05which F of theta is equal to zero. This
-
5:05 - 5:08is a horizontal axis.
-
5:08 - 5:12So here's the [inaudible]. I'm going to
-
5:12 - 5:15initialize
-
5:15 - 5:18theta as some value.
-
5:18 - 5:22We'll call theta superscript zero.
-
5:22 - 5:27And then here's what Newton's Method does. We're going to evaluate the function F at a
-
5:27 - 5:30value of theta, and then
-
5:30 - 5:34we'll compute the derivative of F, and we'll use the linear approximation to the
-
5:34 - 5:38function F of that value of theta. So in particular,
-
5:38 - 5:39
-
5:39 - 5:43
-
5:43 - 5:45
-
5:45 - 5:52I'm going to take the tangents to my function -
-
5:52 - 5:55hope that makes sense - starting the function [inaudible] work out nicely. I'm going to take the
-
5:55 - 5:57
-
5:57 - 6:00tangent to my function at that point there zero,
-
6:00 - 6:04and I'm going to sort of extend this tangent down until it intercepts the horizontal axis.
-
6:04 - 6:11I want to see what value this is. And I'm going to call this theta one, okay.
-
6:12 - 6:16And then so that's one iteration of Newton's Method. And
-
6:16 - 6:19what I'll do then is the same thing with this point. Take the
-
6:19 - 6:21tangent
-
6:21 - 6:22down here,
-
6:22 - 6:25and that's two iterations of the algorithm. And then just
-
6:25 - 6:28sort of keep going, that's
-
6:28 - 6:31theta three and so on, okay.
-
6:31 - 6:32So
-
6:32 - 6:38let's just go ahead and write down what this algorithm actually does.
-
6:38 - 6:38
-
6:38 - 6:41To go from theta zero to theta one, let
-
6:46 - 6:50length - let me just call that capital delta.
-
6:50 - 6:52So capital - so if you
-
6:52 - 6:53remember the
-
6:53 - 6:55definition of a derivative [inaudible],
-
6:55 - 6:59derivative of F evaluated at theta zero.
-
6:59 - 7:01In other words, the gradient of this first line,
-
7:01 - 7:05by the definition of gradient is going to be equal to this vertical
-
7:05 - 7:06length,
-
7:06 - 7:07divided by this horizontal length. A
-
7:07 - 7:10gradient of this - so the slope of this function
-
7:10 - 7:11is defined as
-
7:11 - 7:14the ratio between this vertical height
-
7:14 - 7:16and this width of triangle.
-
7:16 - 7:19So that's just equal to F of theta
-
7:19 - 7:23zero,
-
7:23 - 7:25divided by delta,
-
7:25 - 7:28which implies that
-
7:28 - 7:30delta is equal to F of
-
7:30 - 7:32theta zero,
-
7:32 - 7:34divided by a
-
7:34 - 7:38prime of
-
7:38 - 7:42theta zero,
-
7:42 - 7:43okay.
-
7:43 - 7:46And so theta
-
7:46 - 7:50one is therefore theta zero minus delta,
-
7:50 - 7:52minus capital delta,
-
7:52 - 7:56which is therefore just F theta zero over F
-
7:56 - 7:58prime of theta
-
7:58 - 8:00zero, all
-
8:00 - 8:05right. And more generally, one iteration of Newton's Method precedes this, theta T plus
-
8:05 - 8:08one equals theta T
-
8:08 - 8:11minus
-
8:11 - 8:15F of theta T divided by F prime of theta
-
8:15 - 8:17T. So that's one iteration
-
8:17 - 8:24of Newton's Method.
-
8:24 - 8:28Now, this is an algorithm for finding a value of theta for which F of theta equals
-
8:28 - 8:31zero. And so we apply the same idea
-
8:31 - 8:33to
-
8:33 - 8:37maximizing the log likelihood, right. So we have a function L
-
8:37 - 8:38of
-
8:38 - 8:40theta, and we want to maximize
-
8:40 - 8:44this function. Well, how do you maximize the function? You set the derivative to zero. So we want
-
8:44 - 8:50theta such that our
-
8:50 - 8:54prime of theta is equal to zero, so to maximize this function we want to find the place where
-
8:54 - 8:57the derivative of the function is equal to zero,
-
8:57 - 8:58and so we just apply
-
8:58 - 9:01the same idea.
-
9:01 - 9:04So we get
-
9:04 - 9:09theta one equals theta T minus L
-
9:09 - 9:11prime
-
9:11 - 9:18of theta T over L
-
9:18 - 9:20double prime of T, L
-
9:20 - 9:23double prime of
-
9:23 - 9:28theta T, okay. Because to maximize this function, we just let F be equal to L prime. Let F be
-
9:28 - 9:31the [inaudible] of L,
-
9:31 - 9:36and then we want to find the value of theta for which the derivative of L is
-
9:36 - 9:36zero, and
-
9:36 - 9:43therefore must be a local optimum.
-
9:48 - 9:55Does this make sense? Any questions about this?
-
9:59 - 10:03[Inaudible] The answer to that is fairly complicated. There
-
10:03 - 10:06are conditions on F that would guarantee that this will work. They are fairly complicated,
-
10:06 - 10:11and this is more complex than I want to go into now.
-
10:11 - 10:13In practice, this works very well for logistic regression,
-
10:13 - 10:19and for sort of generalizing any models I'll talk about later. [Inaudible]
-
10:19 - 10:23Yeah, it
-
10:23 - 10:27usually doesn't matter. When I implement this, I usually just initialize theta zero to
-
10:27 - 10:29zero to
-
10:29 - 10:32just initialize the parameters
-
10:32 - 10:36to the - back to all zeros, and usually this works fine. It's usually not a huge
-
10:36 - 10:43deal how you initialize theta.
-
10:44 - 10:46[Inaudible]
-
10:46 - 10:51or is it just different conversions? Let me
-
10:51 - 10:56say some things about that that'll sort of answer it. All of these
-
10:56 - 10:57
-
10:57 - 10:58
-
10:58 - 11:01algorithms tend not to - converges problems, and all of these
-
11:01 - 11:04algorithms will generally converge,
-
11:04 - 11:07unless you choose too large a linear rate for
-
11:07 - 11:09gradient ascent or something.
-
11:09 - 11:11But the speeds of
-
11:11 - 11:15conversions of these algorithms are very different.
-
11:15 - 11:18So
-
11:18 - 11:22it turns out that Newton's Method is an algorithm that enjoys extremely
-
11:22 - 11:25fast conversions.
-
11:25 - 11:29The technical term is that it enjoys a property called [inaudible] conversions. Don't
-
11:29 - 11:30know [inaudible] what that means,
-
11:30 - 11:31but just
-
11:31 - 11:34stated informally, it
-
11:34 - 11:35means that asymptotically
-
11:35 - 11:42every iteration of Newton's Method will double the number of significant digits
-
11:42 - 11:44that your solution
-
11:44 - 11:45is accurate
-
11:45 - 11:47to. Just lots of constant
-
11:47 - 11:48factors.
-
11:48 - 11:49Suppose that
-
11:49 - 11:51on a certain iteration
-
11:51 - 11:57your solution is within 0.01 at the optimum, so you have
-
11:57 - 11:590.01 error. Then after one
-
11:59 - 12:04iteration, your error will be on the order of 0.001,
-
12:04 - 12:11and after another iteration, your error will be on the order of 0.0001.
-
12:14 - 12:15So this is called
-
12:15 - 12:18quadratic conversions because you essentially get to square the error
-
12:18 - 12:22on every iteration of Newton's Method. [Inaudible]
-
12:22 - 12:25this is an asymptotic result that holds only when you are pretty
-
12:25 - 12:27** close to the optimum anyway, so this is
-
12:27 - 12:31the theoretical result that says it's true, but because of constant factors
-
12:31 - 12:34and so on, may paint a slightly rosier picture than
-
12:34 - 12:36might be accurate.
-
12:36 - 12:40But the fact is, when you implement - when I implement Newton's
-
12:40 - 12:41Method for logistic regression,
-
12:41 - 12:45usually converges like a dozen iterations or so for most reasonable
-
12:45 - 12:46size problems of
-
12:46 - 12:51tens of hundreds of features. So one
-
12:51 - 12:52thing I should
-
12:52 - 12:53talk about, which
-
12:53 - 12:54is
-
12:54 - 12:58what I wrote down over there was actually Newton's Method for the case of
-
12:58 - 13:01theta being a single-row number. The
-
13:01 - 13:06generalization to Newton's Method for when theta is a vector rather than
-
13:06 - 13:08when theta is just a row number
-
13:08 - 13:11is the following,
-
13:11 - 13:15which is that theta T plus one is theta T
-
13:15 - 13:17plus - and then we have the
-
13:17 - 13:19second derivative divided by the
-
13:19 - 13:23first - the first derivative divided by the second derivative.
-
13:23 - 13:25
-
13:25 - 13:30And the appropriate generalization is this, where
-
13:30 - 13:33this is the usual gradient of
-
13:33 - 13:37your objective, and
-
13:37 - 13:44each [inaudible] is a matrix called a Hessian,
-
13:51 - 13:55which is just a matrix of second derivative where HIJ
-
13:55 - 14:02equals - okay.
-
14:07 - 14:11So just to sort of - the
-
14:11 - 14:16first derivative divided by the second derivative, now you have a vector of first derivatives
-
14:16 - 14:17times
-
14:17 - 14:21sort of the inverse of the matrix of second derivatives. So
-
14:21 - 14:23this is sort of just the same thing
-
14:23 - 14:28[inaudible] of multiple dimensions.
-
14:28 - 14:33So for logistic regression, again, use the -
-
14:33 - 14:35for a reasonable number of features
-
14:35 - 14:40and training examples - when I run this algorithm, usually you see a conversion
-
14:40 - 14:41anywhere from sort of [inaudible]
-
14:41 - 14:44to like a dozen or so other [inaudible].
-
14:44 - 14:48To compare to gradient ascent, it's [inaudible] to gradient ascent, this
-
14:48 - 14:53usually means far fewer iterations to converge.
-
14:53 - 14:55Compared to gradient ascent, let's say [inaudible] gradient ascent,
-
14:55 - 14:59the disadvantage of Newton's Method is that on every iteration
-
14:59 - 15:02you need to invert the Hessian.
-
15:02 - 15:03So the Hessian
-
15:03 - 15:07will be an N-by-N matrix, or an N plus one by N plus one-dimensional matrix if N
-
15:07 - 15:09is the number of features.
-
15:09 - 15:13And so if you have a large number of features in your learning problem, if you have
-
15:13 - 15:15tens of thousands of features,
-
15:15 - 15:18then inverting H could be a slightly computationally expensive
-
15:18 - 15:22step. But for smaller, more reasonable numbers of features, this is usually a very [inaudible]. Question? [Inaudible] Let's see. I think
-
15:22 - 15:29you're right.
-
15:35 - 15:39That should probably be a minus.
-
15:39 - 15:40Do you have [inaudible]?
-
15:40 - 15:45Yeah, thanks.
-
15:45 - 15:48Yeah, X to a minus. Thank you. [Inaudible] problem also.
-
15:48 - 15:50I wrote down
-
15:50 - 15:51this algorithm
-
15:51 - 15:55to find the maximum likely estimate of the parameters for logistic regression.
-
15:55 - 15:56I wrote this
-
15:56 - 15:58down for maximizing a function.
-
15:58 - 16:02So I'll leave you to think about this yourself. If I wanted to use Newton's
-
16:02 - 16:05Method to minimize the function, how
-
16:05 - 16:07does the algorithm change? All right.
-
16:07 - 16:09So I'll leave you to think about that. So in other words,
-
16:09 - 16:13it's not the maximizations. How does the algorithm change if you want to use it
-
16:13 - 16:20for minimization?
-
16:28 - 16:30Actually,
-
16:30 - 16:32the answer
-
16:32 - 16:33is that
-
16:33 - 16:35it doesn't change. I'll
-
16:35 - 16:40leave you to work that out yourself why, okay. All right.
-
16:40 - 16:47Let's talk about generalized linear models.
-
16:47 - 16:50Let me just say, just to give a recap of
-
16:50 - 16:53both of the algorithms we've talked about so far.
-
16:53 - 16:55We've talked about
-
16:55 - 16:59two different algorithms for modeling PFY given X and parameterized by
-
16:59 - 17:01theta. And
-
17:01 - 17:03one of them - R was
-
17:03 - 17:06a real number
-
17:06 - 17:10and we are sealing that. And we sort of - the [inaudible] has a Gaussian distribution, then
-
17:10 - 17:13we got
-
17:13 - 17:16[inaudible] of linear regression.
-
17:16 - 17:20In the other case, we saw
-
17:20 - 17:23that if - was a classification problem where Y took on a value of either
-
17:23 - 17:25zero or one.
-
17:25 - 17:27
-
17:27 - 17:31In that case, well, what's the most natural distribution of zeros and
-
17:31 - 17:35ones is the [inaudible]. The [inaudible] distribution models
-
17:35 - 17:37random variables with two values,
-
17:37 - 17:40and in that case we got
-
17:40 - 17:46logistic regression.
-
17:46 - 17:49So along the way, some of the questions that came up were - so logistic regression, where
-
17:49 - 17:51
-
17:51 - 17:52on
-
17:52 - 17:54earth did I get the [inaudible] function from?
-
17:54 - 17:57And then so there are the choices you can use for, sort of,
-
17:57 - 18:00just where did this function come from?
-
18:00 - 18:02And there are other functions I could've plugged in, but
-
18:02 - 18:06the [inaudible] function turns out to be a natural
-
18:06 - 18:07default choice
-
18:07 - 18:09that lead us to logistic regression.
-
18:09 - 18:12And what I want to do now is
-
18:12 - 18:15take both of these algorithms and
-
18:15 - 18:19show that there are special cases that have [inaudible] the course of algorithms called
-
18:19 - 18:19generalized
-
18:19 - 18:22linear models,
-
18:22 - 18:24and there will be pauses for - it will be as [inaudible] the course
-
18:24 - 18:27of algorithms that think that the [inaudible]
-
18:27 - 18:31function will fall out very naturally as well.
-
18:31 - 18:33So, let's see -
-
18:33 - 18:37just looking for a longer piece of chalk. I
-
18:37 - 18:41should warn you, the ideas in generalized linear models are somewhat
-
18:41 - 18:42complex, so what
-
18:42 - 18:46I'm going to do today is try to sort of point you - point out the key ideas and give you a
-
18:46 - 18:49gist of the entire story.
-
18:49 - 18:52And then some of the details in the map and the derivations I'll leave you to work through
-
18:52 - 18:58by yourselves in the intellection [inaudible], which posts
-
18:58 - 19:05online.
-
19:06 - 19:09So [inaudible] these two distributions, the [inaudible] and
-
19:09 - 19:13the Gaussian.
-
19:13 - 19:15So suppose we have data
-
19:15 - 19:19that is zero-one valued, and we and we want to model it
-
19:19 - 19:21with
-
19:21 - 19:24[inaudible] variable
-
19:24 - 19:26parameterized by
-
19:26 - 19:31phi. So the [inaudible] distribution has the probability of Y equals one,
-
19:31 - 19:35which just equals the phi, right. So the parameter phi in the
-
19:35 - 19:40[inaudible] specifies the probability of Y being one. Now,
-
19:40 - 19:43as you vary the parameter theta, you get -
-
19:43 - 19:47you sort of get different [inaudible] distributions. As you vary the value of
-
19:47 - 19:51theta you get different probability distributions on Y
-
19:51 - 19:53that have different probabilities of being equal to one.
-
19:53 - 19:58And so I want you to think of this as not one fixed distribution, but as a set where there are a
-
19:58 - 20:00cause of distributions
-
20:00 - 20:04that you get as you vary theta.
-
20:04 - 20:09And in the same way, if you consider Gaussian distribution,
-
20:09 - 20:11as you vary [inaudible] you
-
20:11 - 20:14would get different
-
20:14 - 20:18Gaussian distributions. So think of this again as a cost, or as a set to
-
20:18 - 20:20distributions.
-
20:20 - 20:27And what I want to do now is show that
-
20:28 - 20:32both of these are special cases of the cause of distribution that's called the
-
20:32 - 20:34exponential family distribution.
-
20:34 - 20:38And in particular, we'll say that the cost of distributions, like
-
20:38 - 20:41the [inaudible] distributions that you get as you vary theta,
-
20:41 - 20:45we'll say the cost of distributions is in the exponential family
-
20:45 - 20:46if it can be written
-
20:46 - 20:50in the following form.
-
20:50 - 20:52P of Y parameterized by theta is
-
20:52 - 20:54equal to B of Y
-
20:54 - 21:01[inaudible],
-
21:05 - 21:06okay. Let me just
-
21:06 - 21:07get some of these
-
21:07 - 21:11terms, names, and
-
21:11 - 21:13then - let me - I'll
-
21:13 - 21:19say a bit more about what this means.
-
21:19 - 21:24So [inaudible] is called the natural parameter of the distribution,
-
21:24 - 21:26and T
-
21:26 - 21:33of Y is called the sufficient statistic.
-
21:35 - 21:37Usually,
-
21:37 - 21:41for many of the examples we'll see, including the [inaudible]
-
21:41 - 21:44and the Gaussian, T
-
21:44 - 21:48of Y is just equal to Y. So for most of this lecture you can
-
21:48 - 21:52mentally replace T of Y to be equal to Y, although this won't be true for the very
-
21:52 - 21:56fine example we do today, but mentally, you think of T of Y as equal to
-
21:56 - 21:57Y.
-
21:57 - 22:04And so
-
22:08 - 22:10for a given choice of
-
22:10 - 22:16these functions, A, B and
-
22:16 - 22:21T, all right - so we're gonna sort of fix the forms of the functions A, B and T.
-
22:21 - 22:22Then
-
22:22 - 22:27this formula defines, again, a set of distributions. It defines the cause of distributions
-
22:27 - 22:28that
-
22:28 - 22:31is now parameterized by
-
22:31 - 22:32[inaudible].
-
22:32 - 22:36So again, let's write down specific formulas for A, B and T, true specific
-
22:36 - 22:39choices of A, B and T. Then
-
22:39 - 22:43as I vary [inaudible] I get different distributions.
-
22:43 - 22:45And
-
22:45 - 22:46I'm going to show that
-
22:46 - 22:48the
-
22:48 - 22:52[inaudible] - I'm going to show that the [inaudible] and the Gaussians are special
-
22:52 - 22:54cases
-
22:54 - 22:56of exponential family distributions.
-
22:56 - 23:01And by that I mean that I can choose specific functions, A, B and T,
-
23:01 - 23:03so that this becomes the formula
-
23:03 - 23:06of the distributions of either a [inaudible] or a Gaussian.
-
23:06 - 23:09And then again, as I vary [inaudible],
-
23:09 - 23:12I'll get [inaudible], distributions with different means,
-
23:12 - 23:13or as I vary [inaudible], I'll get
-
23:13 - 23:17Gaussian distributions with different means for my
-
23:17 - 23:23fixed values of A, B and T.
-
23:23 - 23:27And for those of you that know what a sufficient statistic and statistics is,
-
23:27 - 23:28
-
23:28 - 23:31T of Y actually is a sufficient statistic
-
23:31 - 23:34in the formal sense of
-
23:34 - 23:37sufficient statistic for a probability distribution. They may have seen it in a statistics
-
23:37 - 23:38class.
-
23:38 - 23:42If you don't know what a sufficient statistic is, don't worry about. We
-
23:42 - 23:49sort of don't need that property today. Okay.
-
23:52 - 23:56So -
-
23:56 - 23:58oh, one last comment.
-
23:58 - 24:03Often, T of Y is equal to Y, and in many of these cases, [inaudible] is also
-
24:03 - 24:04just a raw number.
-
24:04 - 24:08So in many cases, the parameter of this distribution is just a raw number,
-
24:08 - 24:10and [inaudible] transposed T of Y
-
24:10 - 24:14is just a product of raw numbers. So again, that would be true for our first two examples, but
-
24:14 - 24:15
-
24:15 - 24:21now for the last example I'll do today.
-
24:21 - 24:25So now we'll show that the [inaudible] and the Gaussian are examples of exponential family
-
24:25 - 24:28distributions. We'll start with the [inaudible].
-
24:28 - 24:31So the [inaudible] distribution with [inaudible] - I guess I wrote this down
-
24:31 - 24:33already.
-
24:33 - 24:36PFY equals one [inaudible] by phi,
-
24:36 - 24:37[inaudible] equal to phi.
-
24:37 - 24:41So the parameter of phi specifies the probability
-
24:41 - 24:43that Y equals one.
-
24:43 - 24:45And so my goal now is to choose
-
24:45 - 24:48T, A and B, or is to choose A, B and T
-
24:48 - 24:53so that my formula for the exponential family becomes identical to my formula
-
24:53 - 25:00for the distribution of a [inaudible]. So
-
25:08 - 25:15probability of Y parameterized by phi is equal to that, all
-
25:17 - 25:20right. And you already saw sort
-
25:20 - 25:24of a similar exponential notation where we talked about logistic regression. The probability of
-
25:24 - 25:27Y being one is
-
25:27 - 25:31phi, the probability of Y being zero is one minus phi, so we can write this compactly as phi to the
-
25:31 - 25:38Y times one minus phi to the one minus Y. So I'm gonna take
-
25:40 - 25:44the exponent of the log of this, an exponentiation in taking log [inaudible]
-
25:44 - 25:45cancel each other
-
25:45 - 25:46
-
25:46 - 25:53out [inaudible].
-
25:53 - 25:55And this is equal to E to the
-
25:55 - 26:02Y. And so [inaudible]
-
26:28 - 26:31is to be T of Y,
-
26:31 - 26:33and
-
26:33 - 26:40this will be
-
26:42 - 26:45minus A of [inaudible].
-
26:45 - 26:52And then B of Y is just one, so B of Y doesn't matter. Just
-
26:57 - 26:58take a second
-
26:58 - 27:01to look through this and make sure it makes
-
27:01 - 27:08sense. I'll clean another
-
27:42 - 27:47board while you do that. So now let's write down a few more things. Just
-
27:47 - 27:51copying from the previous board, we had that [inaudible] zero four equal to log
-
27:51 - 27:53phi
-
27:53 - 27:57over one minus
-
27:57 - 28:01phi. [Inaudible] so if I want to do the [inaudible] take this formula,
-
28:01 - 28:04and if you invert it, if you solve for
-
28:04 - 28:06phi - excuse me, if you solve for
-
28:06 - 28:10theta as a function of phi, which is really [inaudible] is the function of
-
28:10 - 28:11phi.
-
28:11 - 28:17Just invert this formula. You find that phi is
-
28:17 - 28:19one over one plus [inaudible] minus [inaudible].
-
28:19 - 28:21And so
-
28:21 - 28:24somehow the logistic function magically
-
28:24 - 28:29falls out of this. We'll take this even this even further later.
-
28:29 - 28:31Again, copying definitions from
-
28:31 - 28:36the board on - from the previous board, A of [inaudible]
-
28:36 - 28:39I said is minus log of one minus phi. So again, phi and [inaudible] are function of each other, all
-
28:39 - 28:43right. So [inaudible] depends on phi, and phi
-
28:43 - 28:46depends on [inaudible]. So if I plug in
-
28:46 - 28:49this definition for [inaudible]
-
28:49 - 28:51into this - excuse
-
28:51 - 28:54me, plug in this definition for phi into that,
-
28:54 - 28:57I'll find that A of [inaudible] is therefore equal
-
28:57 - 29:01to log one plus [inaudible] to [inaudible]. And again, this is just algebra. This is not terribly
-
29:01 - 29:02interesting.
-
29:02 - 29:08And just to complete - excuse me.
-
29:08 - 29:11And just to complete the rest of this, T of
-
29:11 - 29:13Y is equal to Y,
-
29:13 - 29:16and
-
29:16 - 29:18B of Y is equal to one, okay.
-
29:18 - 29:20So just to recap what we've done,
-
29:20 - 29:23we've come up with a certain choice of
-
29:23 - 29:25functions A, T and
-
29:25 - 29:26B,
-
29:26 - 29:30so then my formula for the exponential family distribution
-
29:30 - 29:34now becomes exactly the formula for the distributions, or for the probability
-
29:34 - 29:37mass function of the [inaudible] distribution.
-
29:37 - 29:40And the natural parameter [inaudible] has a certain relationship of the
-
29:40 - 29:45original parameter of the [inaudible].
-
29:45 - 29:46Question?
-
29:46 - 29:49[Inaudible]
-
29:49 - 29:50Let's
-
29:50 - 29:54see. [Inaudible]. The second to
-
29:54 - 29:59the last one. Oh, this answer is fine. Okay.
-
29:59 - 30:01Let's see.
-
30:01 - 30:03Yeah, so this is -
-
30:03 - 30:07well, if you expand this term out, one minus Y times log Y minus phi,
-
30:07 - 30:09and so
-
30:09 - 30:10one times log -
-
30:10 - 30:13one minus phi becomes this. And the other
-
30:13 - 30:16term is minus Y times log Y minus phi.
-
30:16 - 30:18And then -
-
30:18 - 30:23so the minus of a log is log one over X, or is just log one over
-
30:23 - 30:25whatever. So minus Y times log
-
30:25 - 30:28one minus phi becomes
-
30:28 - 30:34sort of Y times log, one over one minus phi. Does that make sense? Yeah. Yeah,
-
30:34 - 30:39cool. Anything else? Yes?
-
30:39 - 30:41[Inaudible] is a scaler, isn't it? Up there
-
30:41 - 30:43-
-
30:43 - 30:47Yes. - it's a [inaudible] transposed, so it can be a vector or
-
30:47 - 30:49- Yes, [inaudible]. So
-
30:49 - 30:53let's see. In most - in this and the next example, [inaudible] will turn out to
-
30:53 - 30:55be a scaler.
-
30:55 - 30:56And so -
-
30:56 - 30:59well, on this board.
-
30:59 - 31:04And so if [inaudible] is a scaler and T of Y is a scaler, then this is just
-
31:04 - 31:06a real number times a real number. So this would be like a
-
31:06 - 31:08one-dimensional vector
-
31:08 - 31:12transposed times a one-dimensional vector. And so this is just real number times real number. Towards the
-
31:12 - 31:16end of today's lecture, we'll go with just one example where both of
-
31:16 - 31:17these are vectors.
-
31:17 - 31:24But for main distributions, these will turn out to be scalers.
-
31:24 - 31:25
-
31:25 - 31:27
-
31:27 - 31:33[Inaudible] distribution [inaudible]. I
-
31:33 - 31:35mean, it doesn't have the
-
31:35 - 31:38zero probability
-
31:38 - 31:42or [inaudible] zero and one. I
-
31:42 - 31:43see. So
-
31:43 - 31:46- yeah. Let's - for this,
-
31:46 - 31:51let's imagine that we're restricting the domain
-
31:51 - 31:52
-
31:52 - 31:55of the input of the function to be Y equals zero or one.
-
31:55 - 31:57So think of that as maybe in
-
31:57 - 32:02implicit constraint on it. [Inaudible]. But so this is a
-
32:02 - 32:03probability mass function
-
32:03 - 32:06for Y equals zero or Y equals one. So
-
32:06 - 32:13write down Y equals zero one. Let's think of that as an [inaudible]. So - cool.
-
32:16 - 32:20So this
-
32:20 - 32:21takes the [inaudible]
-
32:21 - 32:25distribution and invites in the form and the exponential family
-
32:25 - 32:28distribution. [Inaudible] do that very quickly for the Gaussian. I won't do the algebra for the
-
32:28 - 32:30Gaussian.
-
32:30 - 32:32I'll basically just write out the answers.
-
32:32 - 32:37So
-
32:37 - 32:38
-
32:38 - 32:40with a normal distribution with
-
32:40 - 32:43[inaudible] sequence squared,
-
32:43 - 32:47and so you remember, was it two lectures ago,
-
32:47 - 32:51when we were dividing the maximum likelihood - excuse me, oh, no, just the
-
32:51 - 32:52previous lecture
-
32:52 - 32:56when we were dividing the maximum likelihood estimate
-
32:56 - 33:00for the parameters of ordinary [inaudible] squares. We showed that
-
33:00 - 33:04the parameter for [inaudible] squared didn't matter. When we divide
-
33:04 - 33:06the [inaudible] model for [inaudible] square [inaudible],
-
33:06 - 33:09we said that no matter what [inaudible] square was, we end up with the same
-
33:09 - 33:12value of the parameters.
-
33:12 - 33:16So for the purposes of just writing lesson, today's lecture, and not taking
-
33:16 - 33:16account
-
33:16 - 33:18[inaudible] squared,
-
33:18 - 33:21I'm just going to set
-
33:21 - 33:26[inaudible] squared to be for the one, okay, so as to not worry about it.
-
33:26 - 33:28Lecture [inaudible] talks a little bit more about this, but I'm just gonna -
-
33:28 - 33:30just to make
-
33:30 - 33:34[inaudible] in class a bit easier and simpler today, let's just
-
33:34 - 33:36say that [inaudible] square equals one.
-
33:36 - 33:39[Inaudible] square is essentially just a scaling factor
-
33:39 - 33:44on the variable Y.
-
33:44 - 33:46So in that case, the Gaussian
-
33:46 - 33:51density is given by this,
-
33:51 - 33:52[inaudible]
-
33:52 - 33:54squared.
-
33:54 - 33:55
-
33:55 - 33:59And
-
33:59 - 34:02- well, by a couple of steps of algebra, which I'm not going to do,
-
34:02 - 34:07but is written out in [inaudible] in the lecture now so you can download.
-
34:07 - 34:10This is one root two
-
34:10 - 34:14pie, E to the minus one-half Y squared
-
34:14 - 34:17times E to E.
-
34:17 - 34:20New Y minus
-
34:20 - 34:23one-half [inaudible] squared, okay.
-
34:23 - 34:25So I'm just not doing the algebra.
-
34:25 - 34:28And so that's B
-
34:28 - 34:33of Y,
-
34:33 - 34:37we have [inaudible] that's equal to [inaudible]. P of
-
34:37 - 34:42Y equals
-
34:42 - 34:43Y,
-
34:43 - 34:45and - well, A of [inaudible]
-
34:45 - 34:50is equal to
-
34:50 - 34:54minus one-half - actually, I think that
-
34:54 - 34:59should be plus one-half. Have I got that right?
-
34:59 - 35:02Yeah, sorry. Let's
-
35:02 - 35:03see - excuse me. Plus sign
-
35:03 - 35:07there, okay. If you minus one-half [inaudible] squared, and
-
35:07 - 35:11because [inaudible] is equal to [inaudible], this is just minus one-half
-
35:11 - 35:14[inaudible] squared, okay.
-
35:14 - 35:15And so
-
35:15 - 35:19this would be a specific choice again of A, B and T
-
35:19 - 35:21that
-
35:21 - 35:23expresses the Gaussian density
-
35:23 - 35:27in the form of an exponential family distribution.
-
35:27 - 35:31And in this case, the relationship between [inaudible] and [inaudible] is that
-
35:31 - 35:35[inaudible] is just equal to [inaudible], so the [inaudible] of the Gaussian is just equal to the natural
-
35:35 - 35:38parameter of the exponential family distribution. Minus
-
35:38 - 35:44half. Oh, this is minus half? [Inaudible] Oh, okay, thanks.
-
35:44 - 35:47And so - guessing
-
35:47 - 35:51that should be plus then. Is that right? Okay. Oh,
-
35:51 - 35:58yes, you're right. Thank you. All right.
-
36:03 - 36:06And so [inaudible] result that
-
36:06 - 36:09if you've taken a look in undergrad statistics class, turns out that most of the
-
36:09 - 36:13"textbook distributions," not all, but most of them,
-
36:13 - 36:17can be written in the form of an exponential family distribution.
-
36:17 - 36:21So you saw the Gaussian, the normal distribution. It turns out the [inaudible]
-
36:21 - 36:25in normal distribution, which is a generalization of Gaussian random variables, so
-
36:25 - 36:27it's a high dimension to vectors.
-
36:27 - 36:32The [inaudible] normal distribution is also in the exponential family.
-
36:32 - 36:36You saw the [inaudible] as an exponential family. It turns out the [inaudible] distribution is too, all
-
36:36 - 36:38right. So the [inaudible]
-
36:38 - 36:42models outcomes over zero and one. They'll be coin tosses with two outcomes.
-
36:42 - 36:47The [inaudible] models outcomes over K possible values. That's also an
-
36:47 - 36:49exponential families distribution.
-
36:49 - 36:53You may have heard of the Parson distribution. And so the Parson distribution is
-
36:53 - 36:55often used for modeling counts. Things like
-
36:55 - 37:00the number of radioactive decays in a sample, or the number of
-
37:00 - 37:04customers to your website, the numbers of visitors arriving in a store. The
-
37:04 - 37:08Parson distribution is also in the exponential family. So are
-
37:08 - 37:10the gamma and the exponential
-
37:10 - 37:15distributions, if you've heard of them. So the gamma and the exponential distributions are
-
37:15 - 37:19distributions of the positive numbers. So they're often used in model
-
37:19 - 37:20intervals, like if you're
-
37:20 - 37:22standing at the bus stop and you want to
-
37:22 - 37:26ask, "When is the next bus likely to arrive? How long do I have to wait for
-
37:26 - 37:27my bus to arrive?"
-
37:27 - 37:32Often you model that with sort of gamma distribution or
-
37:32 - 37:34exponential families, or the exponential
-
37:34 - 37:37distribution. Those are also in the exponential family.
-
37:37 - 37:41Even more [inaudible] distributions, like the [inaudible] and the [inaudible]
-
37:41 - 37:44distributions, these are probably distributions over
-
37:44 - 37:49fractions, are already probability distributions over probability distributions.
-
37:49 - 37:52And also things like the Wisha distribution, which is the
-
37:52 - 37:57distribution over covariance matrices. So all of these, it turns out, can be written in
-
37:57 - 38:01the form of exponential family distributions. Well,
-
38:01 - 38:04and
-
38:04 - 38:09in the problem set where he asks you to take one of these distributions and write
-
38:09 - 38:13it in the form of the exponential family distribution, and derive a generalized linear model
-
38:13 - 38:15for it, okay.
-
38:15 - 38:22Which brings me to the next topic of
-
38:29 - 38:32having chosen and exponential family distribution,
-
38:32 - 38:38how do you use it to derive a generalized linear model?
-
38:38 - 38:41So
-
38:41 - 38:45generalized linear models are often abbreviated GLM's.
-
38:45 - 38:49And
-
38:49 - 38:52I'm going to write down the three assumptions. You can think of them
-
38:52 - 38:55as assumptions, or you can think of them as design choices,
-
38:55 - 39:00that will then allow me to sort of turn a crank and come up with a generalized linear model.
-
39:00 - 39:04So the first one is - I'm going to assume
-
39:04 - 39:04that
-
39:04 - 39:06given
-
39:06 - 39:09my input X and my parameters theta,
-
39:09 - 39:13I'm going to assume that the variable Y,
-
39:13 - 39:17the output Y, or the response variable Y I'm trying to predict
-
39:17 - 39:20is distributed
-
39:20 - 39:22exponential family
-
39:22 - 39:24
-
39:24 - 39:27with some natural parameter [inaudible]. And so this means that there
-
39:27 - 39:29is some specific choice
-
39:29 - 39:32of those functions, A, B and T
-
39:32 - 39:33so that
-
39:33 - 39:38the conditional distribution of Y given X and parameterized by theta,
-
39:38 - 39:40those exponential families
-
39:40 - 39:42with parameter
-
39:42 - 39:46[inaudible]. Where here, [inaudible] may depend on X in some way.
-
39:46 - 39:50So for example, if you're trying to predict - if you want to
-
39:50 - 39:51predict
-
39:51 - 39:53how many customers have arrived at your website,
-
39:53 - 39:55you may choose to
-
39:55 - 39:57model the number of people
-
39:57 - 40:01- the number of hits on your website by Parson Distribution since Parson
-
40:01 - 40:04Distribution is natural for modeling com data.
-
40:04 - 40:05And so you may
-
40:05 - 40:12choose the exponential family distribution here to be the Parson distribution.
-
40:15 - 40:18[Inaudible] that given X, our
-
40:18 - 40:21goal is
-
40:21 - 40:22to output
-
40:22 - 40:25the
-
40:25 - 40:29effective value of Y
-
40:29 - 40:30given X. So
-
40:30 - 40:34given the features in the website examples, I've given a set of features
-
40:34 - 40:38about whether there were any proportions, whether there were sales, how many people linked to
-
40:38 - 40:40your website, or whatever.
-
40:40 - 40:41I'm going to assume that
-
40:41 - 40:45our goal in our [inaudible] problem is to estimate the expected number of people that will
-
40:45 - 40:49arrive at your website on a given day.
-
40:49 - 40:50
-
40:50 - 40:52So in other words,
-
40:52 - 40:55you're saying that I want H of
-
40:55 - 40:58X to be equal to - oh,
-
40:58 - 41:00excuse me. I actually
-
41:00 - 41:06meant to write T of Y here.
-
41:06 - 41:10My goal is to get my learning algorithms hypothesis to output the expected value
-
41:10 - 41:13of T of Y given X. But
-
41:13 - 41:17again, for most of the examples, T of Y is just equal to Y.
-
41:17 - 41:21And so for most of the examples, our goal is to get our learning algorithms output, T
-
41:21 - 41:28expected value of Y given X because T of Y is usually
-
41:33 - 41:34equal
-
41:34 - 41:36to Y. Yes? [Inaudible] Yes, same thing, right. T of Y is a sufficient statistic.
-
41:36 - 41:39Same T of Y.
-
41:39 - 41:44And lastly, this last one I wrote down - these are assumptions. This last
-
41:44 - 41:48one you might - maybe wanna think of this as a design choice.
-
41:48 - 41:49
-
41:49 - 41:50Which is [inaudible]
-
41:50 - 41:53assume that the distribution of Y given X
-
41:53 - 41:55is a distributed exponential family
-
41:55 - 41:59with some parameter [inaudible]. So the number of visitors on the website
-
41:59 - 42:01on any given day will be Parson
-
42:01 - 42:03or some parameter [inaudible].
-
42:03 - 42:06And the last decision I need to make is
-
42:06 - 42:09was the relationship between my input teachers
-
42:09 - 42:11and this parameter
-
42:11 - 42:15[inaudible] parameterizing my Parson distribution or whatever.
-
42:15 - 42:19And this last step, I'm going to make the
-
42:19 - 42:21assumption, or really a design choice,
-
42:21 - 42:24that I'm going to assume the relationship between [inaudible]
-
42:24 - 42:26and my [inaudible] axis linear,
-
42:26 - 42:30and in particular that they're governed by this - that [inaudible] is equal to theta,
-
42:30 - 42:32transpose X.
-
42:32 - 42:36And the reason I make this design choice is it will allow me to turn the crank of
-
42:36 - 42:37the
-
42:37 - 42:39generalized linear model of
-
42:39 - 42:42machinery and come off with very nice algorithms for fitting
-
42:42 - 42:45say Parson Regression models
-
42:45 - 42:47or performed regression
-
42:47 - 42:50with a gamma distribution outputs or exponential distribution outputs and
-
42:50 - 42:53so on.
-
42:53 - 42:59So
-
42:59 - 43:06let's work through an example.
-
43:13 - 43:15
-
43:15 - 43:18[Inaudible]
-
43:18 - 43:23equals theta transpose X works for the case where [inaudible] is a real number.
-
43:23 - 43:25For the more general case, you
-
43:25 - 43:29would have [inaudible] I equals theta
-
43:29 - 43:35I, transpose X if [inaudible]
-
43:35 - 43:39is a vector rather than a real number. But again, most of the examples [inaudible]
-
43:39 - 43:45will just be a real number.
-
43:45 - 43:52All right.
-
43:55 - 44:00So let's work through the [inaudible] example. You'll see
-
44:00 - 44:07where Y given X parameterized by theta - this is a distributed
-
44:09 - 44:12exponential family with natural parameter [inaudible].
-
44:12 - 44:14And for
-
44:14 - 44:17the [inaudible] distribution, I'm going to choose A, B and T to
-
44:17 - 44:20be the specific forms
-
44:20 - 44:21that
-
44:21 - 44:26cause those exponential families to become the [inaudible] distribution. This is the example we
-
44:26 - 44:30worked through just now, the first example we worked through just now.
-
44:30 - 44:34So - oh,
-
44:34 - 44:35and we also have -
-
44:35 - 44:41so for any fixed
-
44:41 - 44:44value of X and theta, my hypothesis, my learning
-
44:44 - 44:46algorithm
-
44:46 - 44:53will
-
44:53 - 44:57make a prediction, or will make - will sort of output [inaudible] of
-
44:57 - 45:01X,
-
45:01 - 45:03which is
-
45:03 - 45:06by my, I guess, assumption [inaudible]. Watch our learning
-
45:06 - 45:10algorithm to output the expected value of Y given X
-
45:10 - 45:14
-
45:14 - 45:15and parameterized by theta, where Y can take on
-
45:15 - 45:17only the value zero and one,
-
45:17 - 45:22then the expected value of Y is just equal to the
-
45:22 - 45:26probability that Y is equal to one. So
-
45:26 - 45:30the expected value of a [inaudible] variable is just equal to the
-
45:30 - 45:37probability that it's equal to one.
-
45:37 - 45:38And so
-
45:38 - 45:42the probability that Y equals one is just equal to phi
-
45:42 - 45:43because that's the
-
45:43 - 45:45parameter
-
45:45 - 45:47of my [inaudible] distribution. Phi
-
45:47 - 45:54is, by definition, I guess, is the probability of my [inaudible] distribution [inaudible] value of one.
-
45:57 - 45:59Which we worked out previously,
-
45:59 - 46:03phi was one over one plus E to the negative [inaudible].
-
46:03 - 46:06So we worked this out on our previous board. This is the relationship -
-
46:06 - 46:10so when we wrote down the [inaudible] distribution
-
46:10 - 46:14in the form of an exponential family, we worked out what the relationship was between phi and
-
46:14 - 46:18[inaudible], and it was this. So we worked out the relationship between the
-
46:18 - 46:24expected value of Y and [inaudible] was this relationship.
-
46:24 - 46:26And lastly, because
-
46:26 - 46:29we made the design choice, or the assumption that
-
46:29 - 46:31[inaudible] and theta are
-
46:31 - 46:32linearly
-
46:32 - 46:36related. This is therefore equal to
-
46:36 - 46:42one over one plus E to the minus theta, transpose X.
-
46:42 - 46:45And so that's
-
46:45 - 46:46how I
-
46:46 - 46:49come up with the logistic regression algorithm
-
46:49 - 46:51when
-
46:51 - 46:53you have a variable Y
-
46:53 - 46:57- when you have a [inaudible] variable Y, or also response variable
-
46:57 - 46:58Y
-
46:58 - 47:02that takes on two values, and then you choose to model
-
47:02 - 47:08variable [inaudible] distribution. Are you
-
47:08 - 47:14sure this does make sense? Raise your hand if this makes sense. Yeah, okay, cool. So I hope
-
47:14 - 47:16you get
-
47:16 - 47:19the ease of use of this, or sort of the power of this. The only decision I
-
47:19 - 47:22made was really, I said Y -
-
47:22 - 47:25let's say I have a new machine-learning problem and
-
47:25 - 47:28I'm trying to predict the value of a variable Y that happens to take on two
-
47:28 - 47:30values.
-
47:30 - 47:34Then the only decision I need to make is I chose [inaudible] distribution. I
-
47:34 - 47:38say I want to model - I want to assume that given X and theta,
-
47:38 - 47:41I'm going to assume Y is distributed
-
47:41 - 47:43[inaudible]. That's the only decision I made.
-
47:43 - 47:47And then everything else follows automatically having made the
-
47:47 - 47:48decision to
-
47:48 - 47:50model Y given X and
-
47:50 - 47:54parameterized by theta as being [inaudible].
-
47:54 - 47:58In the same way you can choose a different distribution, you can choose Y as Parson or Y as
-
47:58 - 48:00gamma or Y as whatever,
-
48:00 - 48:03and follow a similar process and come up with a different model and
-
48:03 - 48:04different learning algorithm.
-
48:04 - 48:07Come up with a different generalized linear model for whatever learning
-
48:07 - 48:12algorithm you're faced with.
-
48:12 - 48:16This tiny little notation, the
-
48:16 - 48:20function G that
-
48:20 - 48:25relates G of [inaudible] that relates the natural parameter
-
48:25 - 48:28to the expected value of Y,
-
48:28 - 48:32which in this case, one over one plus [inaudible] minus [inaudible],
-
48:32 - 48:39this is called the canonical response function. And G inverse is called the
-
48:42 - 48:43canonical
-
48:43 - 48:49link function. These aren't
-
48:49 - 48:53
-
48:53 - 48:57a huge deal. I won't use this terminology a lot. I'm just
-
48:57 - 48:59mentioning those in case
-
48:59 - 49:03you hear about - people talk about generalized linear models, and if they talk about canonical
-
49:03 - 49:05response functions or
-
49:05 - 49:09canonical link functions, just so you know there's all of this.
-
49:09 - 49:11Actually, many techs actually use
-
49:11 - 49:16the reverse way. This is G inverse and this is G, but this
-
49:16 - 49:18notation turns out to be more consistent with
-
49:18 - 49:20other algorithms in machine learning. So
-
49:20 - 49:23I'm going to use this notation.
-
49:23 - 49:27But I probably won't use the terms canonical response functions and canonical
-
49:27 - 49:30link functions in lecture a lot, so just - I don't know.
-
49:30 - 49:32I'm not big on
-
49:32 - 49:39memorizing lots of names of things. I'm just tossing those out there in case you see it elsewhere. Okay.
-
49:50 - 49:51You know what, I think
-
49:51 - 49:55in the interest of time, I'm going to skip over the Gaussian example. But again,
-
49:55 - 49:57just like I said,
-
49:57 - 49:59[inaudible], Y
-
49:59 - 50:03is [inaudible], different variation I get of logistic regression. You can do the same
-
50:03 - 50:05thing with the Gaussian distribution
-
50:05 - 50:09and end up with ordinary [inaudible] squares model.
-
50:09 - 50:12The problem with Gaussian is that it's almost so simple that when you see it for
-
50:12 - 50:14the first time that it's
-
50:14 - 50:17sometimes more confusing than the [inaudible] model because it looks so simple,
-
50:17 - 50:21it looks like it has to be more complicated. So let me just skip that
-
50:21 - 50:23and leave you to read about
-
50:23 - 50:25the Gaussian example in the lecture notes.
-
50:25 - 50:27And what I want to do is
-
50:27 - 50:30actually go through a more complex example.
-
50:30 - 50:34Question? [Inaudible]
-
50:34 - 50:35Okay, right. So
-
50:35 - 50:39how do choose what theory will be?
-
50:39 - 50:43We'll get to that in the end. What you have there is the logistic
-
50:43 - 50:45regression model, which is a [inaudible] model
-
50:45 - 50:51that assumes the probability of Y given X is given by a certain form.
-
50:51 - 50:53And so
-
50:53 - 50:58what you do is you can write down the log likelihood of your training set, and
-
50:58 - 51:02find the value of theta that maximizes the log likelihood of the parameters.
-
51:02 - 51:05Does that make
-
51:05 - 51:06sense?
-
51:06 - 51:09So I'll say that again towards the end of today's lecture. But
-
51:09 - 51:14for logistic regression, the way you choose theta is exactly maximum likelihood,
-
51:14 - 51:15as we
-
51:15 - 51:18worked out in the previous lecture, using Newton's Method or gradient
-
51:18 - 51:20ascent or
-
51:20 - 51:22whatever. I'll sort of try to
-
51:22 - 51:29do that again for one more example towards the end of today's lecture. So
-
51:29 - 51:33what I want to do is actually use the remaining, I don't know,
-
51:33 - 51:3619 minutes or so of this class,
-
51:36 - 51:38to go through the -
-
51:38 - 51:42one of the more - it's probably the most complex example of a
-
51:42 - 51:46generalized linear model that I've used. This one I want to go through because it's a little bit
-
51:46 - 51:48trickier than
-
51:48 - 51:51many of the other textbook examples of
-
51:51 - 51:54generalized linear models.
-
51:54 - 51:55So
-
51:55 - 51:58again, what I'm going to do is
-
51:58 - 52:00go through the derivation
-
52:00 - 52:04reasonably quickly and give you the gist of it, and if there are steps I skip or details
-
52:04 - 52:08omitted, I'll leave you to read about them more carefully
-
52:08 - 52:10in the lecture notes.
-
52:10 - 52:14And what I want to do is talk about
-
52:14 - 52:17[inaudible].
-
52:17 - 52:20And
-
52:20 - 52:23[inaudible]
-
52:23 - 52:26is the distribution over
-
52:26 - 52:28K possible outcomes.
-
52:28 - 52:30
-
52:30 - 52:33Imagine you're now in a machine-learning problem where the value of Y that you're trying to
-
52:33 - 52:35predict can take on
-
52:35 - 52:37K possible outcomes, so rather than
-
52:37 - 52:40only two outcomes. So obviously, this example's already -
-
52:40 - 52:45if you want to have a learning algorithm, or to magically send emails for you into your
-
52:45 - 52:48right email folder, and you may have a dozen email folders you want your algorithm
-
52:48 - 52:50to classify emails into.
-
52:50 - 52:51Or
-
52:51 - 52:54predicting if the patient either has a disease or does not have
-
52:54 - 52:57a disease, which would be a [inaudible] classification problem.
-
52:57 - 53:02If you think that the patient may have one of K diseases, and
-
53:02 - 53:06you want other than have a learning algorithm figure out which one of K diseases your patient has is all. So
-
53:06 - 53:07lots
-
53:07 - 53:10of multi-cause classification problems where you have more than two causes. You model that
-
53:10 - 53:14with [inaudible].
-
53:14 - 53:17And eventually -
-
53:17 - 53:21so for logistic regression, I had [inaudible] like these where you have a
-
53:21 - 53:26training set and you find a decision boundary that separates them.
-
53:26 - 53:29[Inaudible], we're going to entertain
-
53:29 - 53:32the value of predicting, taking on multiple values, so you now have
-
53:32 - 53:33three causes,
-
53:33 - 53:36and the learning algorithm
-
53:36 - 53:39will learn some way to separate out three causes or more, rather than just two
-
53:39 - 53:44causes.
-
53:44 - 53:47So let's write [inaudible] in the form of
-
53:47 - 53:49an exponential family distribution.
-
53:49 - 53:53
-
53:53 - 53:58So the parameters of a [inaudible] are phi one, phi two
-
53:58 - 54:00[inaudible]
-
54:00 - 54:04phi K. I'll actually change this in a second -
-
54:04 - 54:09where the probability of Y equals I is phi I,
-
54:09 - 54:10right, because there are
-
54:10 - 54:13K possible outcomes.
-
54:13 - 54:18But if I choose this as my parameterization of the [inaudible], then
-
54:18 - 54:20my parameter's actually redundant because
-
54:20 - 54:23if these are probabilities, then you have to sum up the one.
-
54:23 - 54:26And therefore for example, I
-
54:26 - 54:30can derive the last parameter, phi K,
-
54:30 - 54:32as one minus phi
-
54:32 - 54:35one, up
-
54:35 - 54:36to phi K minus
-
54:36 - 54:41one. So this would be a
-
54:41 - 54:44redundant parameterization from
-
54:44 - 54:49[inaudible]. The result is over-parameterized. And so for purposes of
-
54:49 - 54:50this [inaudible], I'm
-
54:50 - 54:54going to treat my parameters of my [inaudible] as phi one,
-
54:54 - 54:59phi two, up to phi K minus one.
-
54:59 - 55:03And I won't think of phi K as a parameter. I'll just - so my parameters are
-
55:03 - 55:03just -
-
55:03 - 55:08I just have K minus one parameters, parameterizing my
-
55:08 - 55:11[inaudible]. And sometimes I write phi K in my
-
55:11 - 55:13derivations as well, and you should think of
-
55:13 - 55:18phi K as just a shorthand for this, for one minus the rest of the parameters, okay.
-
55:18 - 55:25So
-
55:36 - 55:41it turns out the [inaudible] is one of the few examples where T of Y - it's one of the
-
55:41 - 55:45examples where T of Y is not equal to Y.
-
55:45 - 55:49So in this case, Y is
-
55:49 - 55:52on of K possible values.
-
55:52 - 55:55And so T of Y would be defined as follows; T
-
55:55 - 55:59of one is going to be a vector with a one
-
55:59 - 56:02and zeros everywhere else. T
-
56:02 - 56:04of two
-
56:04 - 56:07is going to be a zero, one, zero
-
56:07 - 56:09and so
-
56:09 - 56:12on. Except that these are going to be
-
56:12 - 56:16K minus one-dimensional vectors. And
-
56:16 - 56:17so
-
56:17 - 56:21T of K minus one is going to be zero, zero,
-
56:21 - 56:24zero, one.
-
56:24 - 56:27And
-
56:27 - 56:29T of K is going to be the vector of all zeros.
-
56:29 - 56:30So this is just
-
56:30 - 56:32how I'm choosing to define T
-
56:32 - 56:35of Y
-
56:35 - 56:39to write down the [inaudible] in the form of an exponential family
-
56:39 - 56:41distribution.
-
56:41 - 56:45Again, these are K minus one-dimensional vectors.
-
56:45 - 56:47So
-
56:47 - 56:50this is a good point to introduce one more useful piece of notation,
-
56:50 - 56:54which is called indicator function notation.
-
56:54 - 56:58So I'm going to write one, and then curly braces.
-
56:58 - 57:04And if I write a true statement inside, then the indicator of that statement is going to be
-
57:04 - 57:05one. Then I write one,
-
57:05 - 57:08and then I write a false statement inside, then
-
57:08 - 57:12the value of this indicator function is going to be a
-
57:12 - 57:15zero. For example, if I write indicator two
-
57:15 - 57:17equals three
-
57:17 - 57:21[inaudible] that's false, and so this is equal to zero.
-
57:21 - 57:22Whereas indicator [inaudible]
-
57:22 - 57:25plus one equals two,
-
57:25 - 57:28I wrote down a true statement inside. And so the indicator of the statement was equal to
-
57:28 - 57:30one. So the
-
57:30 - 57:32indicator function is just a
-
57:32 - 57:35very useful notation for indicating sort of truth or falsehood
-
57:35 - 57:42of the statement inside. And so - actually, let's do
-
57:46 - 57:47this here.
-
57:47 - 57:54To combine both of these, right, if I carve out a bit
-
57:54 - 57:57of space here -
-
57:57 - 58:03so if I use - so TY is a
-
58:03 - 58:07vector. Y is one of K values, and so
-
58:07 - 58:10TY is one of these K vectors. If
-
58:10 - 58:13I use TY as [inaudible] to denote
-
58:13 - 58:16the [inaudible] element of the vector
-
58:16 - 58:18TY,
-
58:18 - 58:22then TY - the [inaudible] element of the vector TY
-
58:22 - 58:27is just equal to indicator for
-
58:27 - 58:30whether Y is equal to I. Just take a
-
58:30 - 58:33- let me clean a couple more boards. Take a look at this for a second
-
58:33 - 58:35and make sure you understand why that - make
-
58:35 - 58:42sure you understand all that notation and why this is true. All
-
59:10 - 59:16right. Actually, raise your hand if this equation makes sense to you.
-
59:16 - 59:19Most of you,
-
59:19 - 59:24not all, okay. [Inaudible].
-
59:24 - 59:27Just as one kind of [inaudible],
-
59:27 - 59:30suppose Y is equal to one
-
59:30 - 59:34- let's
-
59:34 - 59:36say - let me see. Suppose Y is equal to one, right,
-
59:36 - 59:40so TY is equal to this vector,
-
59:40 - 59:42and therefore the first element of this vector
-
59:42 - 59:47will be one, and the rest of the elements will be equal to zero.
-
59:47 - 59:50And so - let
-
59:50 - 59:54me try that again, I'm sorry. Let's say I want to ask - I want to look at the [inaudible] element of
-
59:54 - 59:59the vector TY, and I want to know is this one or zero. All right.
-
59:59 - 60:01Well, this will be one.
-
60:01 - 60:04The [inaudible] element of the vector TY
-
60:04 - 60:05will be equal to one
-
60:05 - 60:06if, and only if Y is
-
60:06 - 60:09equal to I.
-
60:09 - 60:12Because for example, if Y is equal to one, then only the first element of this
-
60:12 - 60:14vector will be zero.
-
60:14 - 60:18If Y is equal to two, then only the second element of the vector will be zero
-
60:18 - 60:20and so on. So the question of
-
60:20 - 60:24whether or not - whether the [inaudible] element of this vector, TY, is equal to
-
60:24 - 60:26one is
-
60:26 - 60:27answered by
-
60:27 - 60:29just asking is Y
-
60:29 - 60:36equal to I. Okay. If you're
-
60:36 - 60:39still not quite sure why that's true, go home and
-
60:39 - 60:41think about it a bit more. And I think I -
-
60:41 - 60:45and take a look at the lecture notes as
-
60:45 - 60:50well, maybe that'll help. At least for now, only just take my word for it. So
-
60:50 - 60:54let's go ahead and write out the distribution
-
60:54 - 60:58for the [inaudible] in an exponential family form.
-
60:58 - 61:05So PFY is equal to phi one.
-
61:06 - 61:08Indicator Y equals one
-
61:08 - 61:11times phi two. Indicator Y equals
-
61:11 - 61:13to
-
61:13 - 61:17up to phi K
-
61:17 - 61:21times indicator Y equals K. And again, phi K is not a
-
61:21 - 61:23parameter of the distribution. Phi K
-
61:23 - 61:29is a shorthand for one minus phi one minus phi two minus the rest.
-
61:29 - 61:31And so
-
61:31 - 61:35using this equation on the left as well, I can also write this as
-
61:35 - 61:38phi one times TY one, phi
-
61:38 - 61:40two, TY
-
61:40 - 61:43two, dot, dot, dot.
-
61:43 - 61:45Phi K minus one,
-
61:45 - 61:46
-
61:46 - 61:48TY, K minus one
-
61:48 - 61:51times phi K. And then
-
61:51 - 61:58one minus [inaudible]. That should
-
62:02 - 62:08be K.
-
62:08 - 62:11And it turns out -
-
62:11 - 62:15it takes some of the steps of algebra that I don't have time to show.
-
62:15 - 62:19It turns out, you can simplify this into - well, the
-
62:19 - 62:26exponential family form
-
62:32 - 62:39where [inaudible] is a vector, this is a
-
62:50 - 62:57K minus one-dimensional vector, and - well,
-
63:06 - 63:08okay. So deriving this is a few steps of algebra
-
63:08 - 63:13that you can work out yourself, but I won't do here.
-
63:13 - 63:18And so using my definition for TY, and
-
63:18 - 63:19by choosing
-
63:19 - 63:21[inaudible] A and B this
-
63:21 - 63:24way, I can take my distribution from [inaudible] and write it out in
-
63:24 - 63:31a form of an exponential family distribution.
-
63:33 - 63:35It turns out also that
-
63:35 - 63:37- let's
-
63:37 - 63:40see. [Inaudible], right. One of the things we did was we also had
-
63:40 - 63:42[inaudible] as a function of phi,
-
63:42 - 63:45and then we inverted that to write
-
63:45 - 63:48out phi as a function of [inaudible].
-
63:48 - 63:51So it turns out you can do that as well.
-
63:51 - 63:55So this defines [inaudible] as a function of the [inaudible] distributions parameters phi.
-
63:55 - 63:56So
-
63:56 - 64:00you can take this relationship between [inaudible] and phi and invert it,
-
64:00 - 64:03and write out phi as a function of [inaudible].
-
64:03 - 64:06And it turns out, you get that
-
64:06 - 64:09phi I is equal to [inaudible] -
-
64:09 - 64:16excuse me.
-
64:16 - 64:21And you get that phi I is equal to
-
64:21 - 64:28[inaudible] I of one plus
-
64:28 - 64:35that.
-
64:36 - 64:39And the way you do this is you just - this defines
-
64:39 - 64:43[inaudible] as a function of the phi, so if you take this and solve for [inaudible], you end up with this. And
-
64:43 - 64:48this is - again, there are a couple of steps of algebra that I'm just not showing.
-
64:48 - 64:52And then lastly, using our
-
64:52 - 64:56assumption that the [inaudible] are a linear function of the [inaudible]
-
64:56 - 64:57axis, phi
-
64:57 - 65:02I is therefore equal to E to the theta I, transpose X,
-
65:02 - 65:06divided by one plus sum over
-
65:06 - 65:09J equals one, to K
-
65:09 - 65:11minus one,
-
65:11 - 65:14E to the
-
65:14 - 65:18theta J, transpose
-
65:18 - 65:24X. And this is just using the fact
-
65:24 - 65:29that [inaudible] I equals theta I, transpose X, which was our earlier
-
65:29 - 65:36design choice from generalized linear models. So
-
65:44 - 65:50we're just about down.
-
65:50 - 65:52So my learning algorithm
-
65:52 - 65:56[inaudible]. I'm going to think of it as [inaudible] the
-
65:56 - 65:59expected value of TY
-
65:59 - 66:02given X and [inaudible] by theta.
-
66:02 - 66:05So
-
66:05 - 66:09TY was this vector indicator function. So
-
66:09 - 66:10T one
-
66:10 - 66:13was indicator Y equals one,
-
66:13 - 66:16down to indicator Y equals
-
66:16 - 66:20K minus one. All
-
66:20 - 66:24right. So I want my learning algorithm to output this; the expected value of this vector of
-
66:24 - 66:31indicator functions.
-
66:34 - 66:39The expected value of indicator Y equals one is just
-
66:39 - 66:42the probability that Y equals one,
-
66:42 - 66:45which is given by phi one.
-
66:45 - 66:48So I have a random variable that's one whenever Y is equal to one and zero
-
66:48 - 66:50otherwise,
-
66:50 - 66:52so the expected value of that,
-
66:52 - 66:55of this indicator Y equals one is just the
-
66:55 - 66:59probability that Y equals one, which is given by phi one.
-
66:59 - 67:02And therefore,
-
67:02 - 67:05by what we were taught earlier,
-
67:05 - 67:08this is therefore [inaudible]
-
67:08 - 67:15to the theta one, transpose X over - well - okay.
-
67:46 - 67:47And so my
-
67:47 - 67:51learning algorithm will output the probability that Y equals one, Y equals two, up to Y
-
67:51 - 67:55equals K minus one.
-
67:55 - 67:58And these probabilities are going to be parameterized by
-
67:58 - 68:05these functions like these.
-
68:21 - 68:25And so just to give this algorithm a name,
-
68:25 - 68:32this algorithm is called softmax regression,
-
68:34 - 68:39and is widely thought of as the generalization of
-
68:39 - 68:42logistic regression, which is regression of two classes. Is widely thought
-
68:42 - 68:45of as a generalization of logistic regression
-
68:45 - 68:46to the case of
-
68:46 - 68:50K classes rather than two classes.
-
68:50 - 68:53
-
68:53 - 68:56And so just to be very concrete about what you do, right. So you have a machine-learning
-
68:56 - 69:00problem, and you want to apply softmax regression to it. So generally,
-
69:00 - 69:03work for the entire derivation [inaudible]. I think the
-
69:03 - 69:06question you had is about how to fit parameters.
-
69:06 - 69:10So let's say you have a machine-learning problem, and
-
69:10 - 69:13Y takes on one of K classes.
-
69:13 - 69:17What you do is you sit down and say, "Okay, I wanna model Y as being
-
69:17 - 69:19[inaudible]
-
69:19 - 69:24given any X and then theta." And so you chose [inaudible] as the exponential family. Then you sort
-
69:24 - 69:27of turn the crank. And everything else I wrote down follows
-
69:27 - 69:30automatically from you have made the choice
-
69:30 - 69:35of using [inaudible] distribution as your choice of exponential family.
-
69:35 - 69:38And then what you do is you then have this training set, X, I,
-
69:38 - 69:39Y,
-
69:39 - 69:41I up to X, M,
-
69:41 - 69:44Y, M. So
-
69:44 - 69:45you're
-
69:45 - 69:49doing the training set. We're now [inaudible] the value of Y takes on one
-
69:49 - 69:51of K possible values.
-
69:51 - 69:55And what you do is you then
-
69:55 - 69:59find the parameters of the model by maximum likelihood. So you write down the likelihood
-
69:59 - 70:03of the parameters, and you maximize the likelihood.
-
70:03 - 70:07So what's the likelihood? Well, the likelihood, as usual, is the
-
70:07 - 70:09product of your training set of
-
70:09 - 70:11P of YI
-
70:11 - 70:15given XI parameterized
-
70:15 - 70:18by theta. That's
-
70:18 - 70:20the likelihood, same as we had before.
-
70:20 - 70:21And that's
-
70:21 - 70:24
-
70:24 - 70:25product of your
-
70:25 - 70:26training set of -
-
70:26 - 70:30let me write these down now.
-
70:30 - 70:33YI equals one
-
70:33 - 70:36times phi two of indicator YI
-
70:36 - 70:38equals two, dot,
-
70:38 - 70:39dot, dot,
-
70:39 - 70:43to phi K of indicator YI
-
70:43 - 70:45equals
-
70:45 - 70:47K.
-
70:47 - 70:50Where, for example,
-
70:50 - 70:55phi one depends on theta through this formula. It is E to the theta one,
-
70:55 - 70:57transpose X
-
70:57 - 70:59over
-
70:59 - 71:02one
-
71:02 - 71:09plus sum over J - well, that formula I had just now.
-
71:09 - 71:12And so phi one here is really a
-
71:12 - 71:16shorthand for this formula, and similarly for phi two and so on,
-
71:16 - 71:20up to phi K, where phi K is one minus all of these things. All right.
-
71:20 - 71:22So this is a
-
71:22 - 71:23
-
71:23 - 71:26-this formula looks more complicated than it really is. What you
-
71:26 - 71:28really do is you write this down,
-
71:28 - 71:31then you take logs, compute a derivative of this formula [inaudible] theta,
-
71:31 - 71:34and
-
71:34 - 71:37apply say gradient ascent
-
71:37 - 71:41to maximize the likelihood. What are the rows of theta? [Inaudible] it's just been a vector,
-
71:41 - 71:45right? And now it looks
-
71:45 - 71:49like it's two-dimensional. Yeah. In the notation of the [inaudible] I think have theta one
-
71:49 - 71:50through
-
71:50 - 71:53theta
-
71:53 - 71:58K minus one. I've been thinking of each of these as -
-
71:58 - 71:59and N
-
71:59 - 72:01plus one-dimensional vector.
-
72:01 - 72:04If X is N plus one-dimensional,
-
72:04 - 72:05then I've been - see, I think if
-
72:05 - 72:09you have a set of parameters comprising K minus one vectors,
-
72:09 - 72:13and each of these is a - you could group all of these together and make these, but I
-
72:13 - 72:16just haven't been doing that. [Inaudible] the derivative
-
72:16 - 72:23of K minus one parameter vectors. [Inaudible], what do they correspond to?
-
72:23 - 72:25[Inaudible].
-
72:25 - 72:29We're sort of out of time. Let me take that offline. It's hard to answer in the
-
72:29 - 72:33same way that the logistic regression - what does theta correspond to
-
72:33 - 72:38in logistic regression? You can sort of answer that as sort of - Yeah. It's kind of like
-
72:38 - 72:43the [inaudible] feature - Yeah. Sort of similar interpretation,
-
72:43 - 72:44yeah. That's good. I think I'm running a little bit
-
72:44 - 72:49late. Why don't I - why don't we officially close for the day, but you can come up
-
72:49 - 72:50if you more questions and take them offline. Thanks.
- Title:
- Lecture 4 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on Newton's method, exponential families, and generalized linear models and how they relate to machine learning.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:13:07
N. Ueda edited English subtitles for Lecture 4 | Machine Learning (Stanford) | ||
N. Ueda added a translation |