WEBVTT

00:00:11.859 --> 00:00:15.129
this presentation is delivered by the stanford center for professional

00:00:15.129 --> 00:00:22.129
development.

00:00:23.559 --> 00:00:26.100
Okay, so welcome back. And

00:00:26.100 --> 00:00:30.800
what I want to do today is talk about

00:00:30.800 --> 00:00:32.429
Newton's method, an algorithm

00:00:32.429 --> 00:00:34.860
for fitting models like logistic regression,

00:00:34.860 --> 00:00:39.150
and then we'll talk about exponential family distributions and generalized linear

00:00:39.150 --> 00:00:42.810
models. It's a very nice class of ideas that will tie together,

00:00:42.810 --> 00:00:47.180
the logistic regression and the ordinary least squares models that we'll see. So hopefully I'll get to that

00:00:47.180 --> 00:00:50.730
today. So

00:00:50.730 --> 00:00:54.630
throughout the previous lecture and this lecture, we're starting to use increasingly

00:00:54.630 --> 00:00:58.210
large amounts of material on probability.

00:00:58.210 --> 00:01:01.710
So if you'd like to see a refresher on sort of the

00:01:01.710 --> 00:01:02.920
foundations of probability

00:01:02.920 --> 00:01:06.310
- if you're not sure if you quite had your prerequisites for this class

00:01:06.310 --> 00:01:09.130
in terms of a background in probability and statistics,

00:01:09.130 --> 00:01:11.330
then the discussion section

00:01:11.330 --> 00:01:16.130
taught this week by the TA's will go over so they can review a probability.

00:01:16.130 --> 00:01:21.080
At the same discussion sections also for the TA's, we'll also briefly go over

00:01:21.080 --> 00:01:26.180
sort of Matlab and Octave notation, which you need to use for your problem sets. And so if you

00:01:26.180 --> 00:01:27.730
any of you want to see

00:01:27.730 --> 00:01:32.100
a review of the probability and statistics pre-reqs, or if you want to, we will have a short tutorial of

00:01:32.100 --> 00:01:34.189
Matlab and Octave,

00:01:34.189 --> 00:01:41.189
please come to this - the next discussion section. All right. So

00:01:42.090 --> 00:01:45.369
just to recap briefly,

00:01:45.369 --> 00:01:49.070
towards the end of the last lecture I talked about the logistic regression model

00:01:49.070 --> 00:01:50.890
where we had -

00:01:50.890 --> 00:01:57.210
which was an algorithm for classification. We had that P of y given one

00:01:57.210 --> 00:02:01.810
[inaudible] - if an X - if Y equals one, give an X parameterized by theta

00:02:01.810 --> 00:02:03.750
under this model, all right. If

00:02:03.750 --> 00:02:05.280
this was one over one

00:02:05.280 --> 00:02:10.609
plus e to the theta, transpose X.

00:02:10.609 --> 00:02:15.239
And then you can write down the log-likelihood -

00:02:15.239 --> 00:02:22.239
like given the training sets,

00:02:22.599 --> 00:02:29.599


00:02:29.819 --> 00:02:31.329
which was that.

00:02:31.329 --> 00:02:32.289
And

00:02:32.289 --> 00:02:35.860
by taking the derivatives of this, you can derive

00:02:35.860 --> 00:02:38.359
sort of a gradient ascent rule

00:02:38.359 --> 00:02:42.689
for finding the maximum likelihood estimate of the parameter theta for

00:02:42.689 --> 00:02:46.479
this logistic regression model. And so

00:02:46.479 --> 00:02:48.170
last time I wrote down

00:02:48.170 --> 00:02:51.769
the learning rule for batch gradient ascent, but the version of stochastic gradient

00:02:51.769 --> 00:02:55.309
ascent where

00:02:55.309 --> 00:03:02.309
you look at just one training example at a time,

00:03:07.049 --> 00:03:09.370
would be like this, okay.

00:03:09.370 --> 00:03:12.369
So last time I wrote down a batch gradient ascent. 
This is stochastic gradient ascent.

00:03:12.369 --> 00:03:13.500
So

00:03:13.500 --> 00:03:17.969
if you want to fit a logistic regression model, meaning find the

00:03:17.969 --> 00:03:20.810
value of theta that maximizes this log likelihood,

00:03:20.810 --> 00:03:24.650
gradient ascent or stochastic gradient ascent or batch gradient ascent is a perfectly fine

00:03:24.650 --> 00:03:27.159
algorithm to use.

00:03:27.159 --> 00:03:29.150
But what I want to do is talk about

00:03:29.150 --> 00:03:30.430
a different

00:03:30.430 --> 00:03:31.659
algorithm

00:03:31.659 --> 00:03:32.669
for fitting

00:03:32.669 --> 00:03:34.659
models like logistic regression.

00:03:34.659 --> 00:03:39.419
And this would be an algorithm that will, I guess, often run much faster than

00:03:39.419 --> 00:03:41.809
gradient descent.

00:03:41.809 --> 00:03:43.969
Um.

00:03:43.969 --> 00:03:48.799
And this algorithm is called Newton's Method.

00:03:48.799 --> 00:03:54.149
And when we describe Newton's Method - let me ask you - I should

00:03:54.149 --> 00:04:01.149
ask you to consider a different problem first,

00:04:01.239 --> 00:04:03.609
which is -

00:04:03.609 --> 00:04:05.439
let's say

00:04:05.439 --> 00:04:10.650
you have a function F of theta,

00:04:10.650 --> 00:04:13.549
and let's say you want to find the value of theta

00:04:13.549 --> 00:04:17.519
so that

00:04:17.519 --> 00:04:20.559
F of theta

00:04:20.559 --> 00:04:25.789
is equal to zero. Let's start the [inaudible], and then we'll sort of slowly

00:04:25.789 --> 00:04:27.830
change this until it becomes an algorithm for

00:04:27.830 --> 00:04:34.830
fitting maximum likelihood models, like logistic regression. 
So - let's see.

00:04:51.589 --> 00:04:54.509
I guess that works. Okay, so let's say that's my function F.

00:04:54.509 --> 00:04:57.259
This

00:04:57.259 --> 00:05:00.629
is my horizontal axis of theta, plot of F of theta,

00:05:00.629 --> 00:05:03.350
and so they're really trying to find this value for theta, and

00:05:03.350 --> 00:05:05.450
which F of theta is equal to zero. This

00:05:05.450 --> 00:05:08.180
is a horizontal axis.

00:05:08.180 --> 00:05:11.779
So here's the [inaudible]. I'm going to

00:05:11.779 --> 00:05:15.259
initialize

00:05:15.259 --> 00:05:18.259
theta as some value.

00:05:18.259 --> 00:05:21.780
We'll call theta superscript zero.

00:05:21.780 --> 00:05:27.110
And then here's what Newton's Method does. We're going to evaluate the function F at a

00:05:27.110 --> 00:05:30.300
value of theta, and then

00:05:30.300 --> 00:05:34.360
we'll compute the derivative of F, and we'll use the linear approximation to the

00:05:34.360 --> 00:05:38.050
function F of that value of theta. So in particular,

00:05:38.050 --> 00:05:39.179


00:05:39.179 --> 00:05:43.089


00:05:43.089 --> 00:05:45.039


00:05:45.039 --> 00:05:52.039
I'm going to take the tangents to my function -

00:05:52.369 --> 00:05:55.259
hope that makes sense - starting the function [inaudible] work out nicely. I'm going to take the

00:05:55.259 --> 00:05:56.819


00:05:56.819 --> 00:06:00.030
tangent to my function at that point there zero,

00:06:00.030 --> 00:06:04.099
and I'm going to sort of extend this tangent down until it intercepts the horizontal axis.

00:06:04.099 --> 00:06:11.099
I want to see what value this is. And I'm going to call this theta one, okay.

00:06:12.069 --> 00:06:15.509
And then so that's one iteration of Newton's Method. And

00:06:15.509 --> 00:06:18.610
what I'll do then is the same thing with this point. Take the

00:06:18.610 --> 00:06:20.889
tangent

00:06:20.889 --> 00:06:22.469
down here,

00:06:22.469 --> 00:06:25.329
and that's two iterations of the algorithm. And then just

00:06:25.329 --> 00:06:27.960
sort of keep going, that's

00:06:27.960 --> 00:06:31.149
theta three and so on, okay.

00:06:31.149 --> 00:06:32.400
So

00:06:32.400 --> 00:06:37.620
let's just go ahead and write down what this algorithm actually does.

00:06:37.620 --> 00:06:38.090


00:06:38.090 --> 00:06:40.789
To go from theta zero to theta one, let

00:06:45.689 --> 00:06:50.089
length - let me just call that capital delta.

00:06:50.089 --> 00:06:51.799
So capital - so if you

00:06:51.799 --> 00:06:52.870
remember the

00:06:52.870 --> 00:06:55.180
definition of a derivative [inaudible],

00:06:55.180 --> 00:06:58.569
derivative of F evaluated at theta zero.

00:06:58.569 --> 00:07:01.489
In other words, the gradient of this first line,

00:07:01.489 --> 00:07:04.809
by the definition of gradient is going to be equal to this vertical

00:07:04.809 --> 00:07:05.590
length,

00:07:05.590 --> 00:07:07.360
divided by this horizontal length. A

00:07:07.360 --> 00:07:09.819
gradient of this - so the slope of this function

00:07:09.819 --> 00:07:10.920
is defined as

00:07:10.920 --> 00:07:14.299
the ratio between this vertical height

00:07:14.299 --> 00:07:16.490
and this width of triangle.

00:07:16.490 --> 00:07:18.510
So that's just equal to F of theta

00:07:18.510 --> 00:07:22.849
zero,

00:07:22.849 --> 00:07:24.870
divided by delta,

00:07:24.870 --> 00:07:27.710
which implies that

00:07:27.710 --> 00:07:30.159
delta is equal to F of

00:07:30.159 --> 00:07:32.289
theta zero,

00:07:32.289 --> 00:07:34.339
divided by a

00:07:34.339 --> 00:07:37.519
prime of

00:07:37.519 --> 00:07:41.840
theta zero,

00:07:41.840 --> 00:07:43.330
okay.

00:07:43.330 --> 00:07:46.389
And so theta

00:07:46.389 --> 00:07:49.740
one is therefore theta zero minus delta,

00:07:49.740 --> 00:07:51.620
minus capital delta,

00:07:51.620 --> 00:07:56.189
which is therefore just F theta zero over F

00:07:56.189 --> 00:07:58.090
prime of theta

00:07:58.090 --> 00:08:00.199
zero, all

00:08:00.199 --> 00:08:04.740
right. And more generally, one iteration of Newton's Method precedes this, theta T plus

00:08:04.740 --> 00:08:07.629
one equals theta T

00:08:07.629 --> 00:08:10.529
minus

00:08:10.529 --> 00:08:14.620
F of theta T divided by F prime of theta

00:08:14.620 --> 00:08:16.899
T. So that's one iteration

00:08:16.899 --> 00:08:23.899
of Newton's Method.

00:08:23.969 --> 00:08:28.300
Now, this is an algorithm for finding a value of theta for which F of theta equals

00:08:28.300 --> 00:08:30.759
zero. And so we apply the same idea

00:08:30.759 --> 00:08:32.729
to

00:08:32.729 --> 00:08:36.870
maximizing the log likelihood, right. So we have a function L

00:08:36.870 --> 00:08:37.990
of

00:08:37.990 --> 00:08:40.159
theta, and we want to maximize

00:08:40.159 --> 00:08:44.420
this function. Well, how do you maximize the function? You set the derivative to zero. So we want

00:08:44.420 --> 00:08:50.120
theta such that our

00:08:50.120 --> 00:08:53.900
prime of theta is equal to zero, so to maximize this function we want to find the place where

00:08:53.900 --> 00:08:57.270
the derivative of the function is equal to zero,

00:08:57.270 --> 00:08:58.379
and so we just apply

00:08:58.379 --> 00:09:01.380
the same idea.

00:09:01.380 --> 00:09:04.210
So we get

00:09:04.210 --> 00:09:08.510
theta one equals theta T minus L

00:09:08.510 --> 00:09:10.710
prime

00:09:10.710 --> 00:09:17.710
of theta T over L

00:09:18.100 --> 00:09:19.790
double prime of T, L

00:09:19.790 --> 00:09:22.850
double prime of

00:09:22.850 --> 00:09:28.210
theta T, okay. Because to maximize this function, we just let F be equal to L prime. Let F be

00:09:28.210 --> 00:09:30.900
the [inaudible] of L,

00:09:30.900 --> 00:09:35.690
and then we want to find the value of theta for which the derivative of L is

00:09:35.690 --> 00:09:36.300
zero, and

00:09:36.300 --> 00:09:43.300
therefore must be a local optimum.

00:09:48.350 --> 00:09:55.350
Does this make sense? Any questions about this?

00:09:58.900 --> 00:10:02.699
[Inaudible] The answer to that is fairly complicated. There

00:10:02.699 --> 00:10:06.259
are conditions on F that would guarantee that this will work. They are fairly complicated,

00:10:06.259 --> 00:10:10.530
and this is more complex than I want to go into now.

00:10:10.530 --> 00:10:13.350
In practice, this works very well for logistic regression,

00:10:13.350 --> 00:10:18.650
and for sort of generalizing any models I'll talk about later. [Inaudible]

00:10:18.650 --> 00:10:23.310
Yeah, it

00:10:23.310 --> 00:10:27.190
usually doesn't matter. When I implement this, I usually just initialize theta zero to

00:10:27.190 --> 00:10:29.389
zero to

00:10:29.389 --> 00:10:31.570
just initialize the parameters

00:10:31.570 --> 00:10:35.610
to the - back to all zeros, and usually this works fine. It's usually not a huge

00:10:35.610 --> 00:10:42.610
deal how you initialize theta.

00:10:44.110 --> 00:10:45.940
[Inaudible]

00:10:45.940 --> 00:10:51.440
or is it just different conversions? Let me

00:10:51.440 --> 00:10:55.600
say some things about that that'll sort of answer it. All of these

00:10:55.600 --> 00:10:57.250


00:10:57.250 --> 00:10:57.630


00:10:57.630 --> 00:11:01.180
algorithms tend not to - converges problems, and all of these

00:11:01.180 --> 00:11:03.870
algorithms will generally converge,

00:11:03.870 --> 00:11:06.530
unless you choose too large a linear rate for

00:11:06.530 --> 00:11:08.870
gradient ascent or something.

00:11:08.870 --> 00:11:10.620
But the speeds of

00:11:10.620 --> 00:11:14.570
conversions of these algorithms are very different.

00:11:14.570 --> 00:11:18.130
So

00:11:18.130 --> 00:11:22.400
it turns out that Newton's Method is an algorithm that enjoys extremely

00:11:22.400 --> 00:11:24.530
fast conversions.

00:11:24.530 --> 00:11:28.670
The technical term is that it enjoys a property called [inaudible] conversions. Don't

00:11:28.670 --> 00:11:29.950
know [inaudible] what that means,

00:11:29.950 --> 00:11:31.170
but just

00:11:31.170 --> 00:11:33.540
stated informally, it

00:11:33.540 --> 00:11:35.460
means that asymptotically

00:11:35.460 --> 00:11:41.500
every iteration of Newton's Method will double the number of significant digits

00:11:41.500 --> 00:11:43.740
that your solution

00:11:43.740 --> 00:11:44.980
is accurate

00:11:44.980 --> 00:11:46.610
to. Just lots of constant

00:11:46.610 --> 00:11:48.010
factors.

00:11:48.010 --> 00:11:48.809
Suppose that

00:11:48.809 --> 00:11:51.320
on a certain iteration

00:11:51.320 --> 00:11:56.730
your solution is within 0.01 at the optimum, so you have

00:11:56.730 --> 00:11:58.569
0.01 error. Then after one

00:11:58.569 --> 00:12:04.110
iteration, your error will be on the order of 0.001,

00:12:04.110 --> 00:12:11.110
and after another iteration, your error will be on the order of 0.0001.

00:12:13.610 --> 00:12:14.600
So this is called

00:12:14.600 --> 00:12:18.320
quadratic conversions because you essentially get to square the error

00:12:18.320 --> 00:12:21.520
on every iteration of Newton's Method. [Inaudible]

00:12:21.520 --> 00:12:24.520
this is an asymptotic result that holds only when you are pretty

00:12:24.520 --> 00:12:27.120
** close to the optimum anyway, so this is

00:12:27.120 --> 00:12:30.550
the theoretical result that says it's true, but because of constant factors

00:12:30.550 --> 00:12:34.340
and so on, may paint a slightly rosier picture than

00:12:34.340 --> 00:12:35.690
might be accurate.

00:12:35.690 --> 00:12:39.590
But the fact is, when you implement - when I implement Newton's

00:12:39.590 --> 00:12:41.390
Method for logistic regression,

00:12:41.390 --> 00:12:44.830
usually converges like a dozen iterations or so for most reasonable

00:12:44.830 --> 00:12:46.270
size problems of

00:12:46.270 --> 00:12:50.910
tens of hundreds of features. So one

00:12:50.910 --> 00:12:51.550
thing I should

00:12:51.550 --> 00:12:53.170
talk about, which

00:12:53.170 --> 00:12:54.110
is

00:12:54.110 --> 00:12:58.140
what I wrote down over there was actually Newton's Method for the case of

00:12:58.140 --> 00:13:01.410
theta being a single-row number. The

00:13:01.410 --> 00:13:05.820
generalization to Newton's Method for when theta is a vector rather than

00:13:05.820 --> 00:13:07.830
when theta is just a row number

00:13:07.830 --> 00:13:11.040
is the following,

00:13:11.040 --> 00:13:14.580
which is that theta T plus one is theta T

00:13:14.580 --> 00:13:16.870
plus - and then we have the

00:13:16.870 --> 00:13:18.580
second derivative divided by the

00:13:18.580 --> 00:13:22.520
first - the first derivative divided by the second derivative.

00:13:22.520 --> 00:13:25.450


00:13:25.450 --> 00:13:30.090
And the appropriate generalization is this, where

00:13:30.090 --> 00:13:32.949
this is the usual gradient of

00:13:32.949 --> 00:13:37.350
your objective, and

00:13:37.350 --> 00:13:44.350
each [inaudible] is a matrix called a Hessian,

00:13:50.730 --> 00:13:55.460
which is just a matrix of second derivative where HIJ

00:13:55.460 --> 00:14:02.460
equals - okay.

00:14:07.210 --> 00:14:11.000
So just to sort of - the

00:14:11.000 --> 00:14:15.980
first derivative divided by the second derivative, now you have a vector of first derivatives

00:14:15.980 --> 00:14:17.050
times

00:14:17.050 --> 00:14:21.060
sort of the inverse of the matrix of second derivatives. So

00:14:21.060 --> 00:14:22.770
this is sort of just the same thing

00:14:22.770 --> 00:14:28.210
[inaudible] of multiple dimensions.

00:14:28.210 --> 00:14:32.660
So for logistic regression, again, use the -

00:14:32.660 --> 00:14:35.179
for a reasonable number of features

00:14:35.179 --> 00:14:40.069
and training examples - when I run this algorithm, usually you see a conversion

00:14:40.069 --> 00:14:41.150
anywhere from sort of [inaudible]

00:14:41.150 --> 00:14:44.430
to like a dozen or so other [inaudible].

00:14:44.430 --> 00:14:47.610
To compare to gradient ascent, it's [inaudible] to gradient ascent, this

00:14:47.610 --> 00:14:52.860
usually means far fewer iterations to converge.

00:14:52.860 --> 00:14:55.230
Compared to gradient ascent, let's say [inaudible] gradient ascent,

00:14:55.230 --> 00:14:58.600
the disadvantage of Newton's Method is that on every iteration

00:14:58.600 --> 00:15:01.560
you need to invert the Hessian.

00:15:01.560 --> 00:15:02.950
So the Hessian

00:15:02.950 --> 00:15:07.190
will be an N-by-N matrix, or an N plus one by N plus one-dimensional matrix if N

00:15:07.190 --> 00:15:09.010
is the number of features.

00:15:09.010 --> 00:15:12.700
And so if you have a large number of features in your learning problem, if you have

00:15:12.700 --> 00:15:14.540
tens of thousands of features,

00:15:14.540 --> 00:15:18.050
then inverting H could be a slightly computationally expensive

00:15:18.050 --> 00:15:21.910
step. But for smaller, more reasonable numbers of features, this is usually a very [inaudible]. Question? [Inaudible] Let's see. I think

00:15:21.910 --> 00:15:28.910
you're right.

00:15:35.030 --> 00:15:38.650
That should probably be a minus.

00:15:38.650 --> 00:15:40.000
Do you have [inaudible]?

00:15:40.000 --> 00:15:44.810
Yeah, thanks.

00:15:44.810 --> 00:15:48.130
Yeah, X to a minus. Thank you. [Inaudible] problem also.

00:15:48.130 --> 00:15:49.720
I wrote down

00:15:49.720 --> 00:15:51.110
this algorithm

00:15:51.110 --> 00:15:54.760
to find the maximum likely estimate of the parameters for logistic regression.

00:15:54.760 --> 00:15:56.019
I wrote this

00:15:56.019 --> 00:15:58.430
down for maximizing a function.

00:15:58.430 --> 00:16:01.790
So I'll leave you to think about this yourself. If I wanted to use Newton's

00:16:01.790 --> 00:16:04.810
Method to minimize the function, how

00:16:04.810 --> 00:16:06.820
does the algorithm change? All right.

00:16:06.820 --> 00:16:09.470
So I'll leave you to think about that. So in other words,

00:16:09.470 --> 00:16:12.620
it's not the maximizations. How does the algorithm change if you want to use it

00:16:12.620 --> 00:16:19.620
for minimization?

00:16:27.550 --> 00:16:30.030
Actually,

00:16:30.030 --> 00:16:31.820
the answer

00:16:31.820 --> 00:16:32.970
is that

00:16:32.970 --> 00:16:34.730
it doesn't change. I'll

00:16:34.730 --> 00:16:40.410
leave you to work that out yourself why, okay. All right.

00:16:40.410 --> 00:16:47.340
Let's talk about generalized linear models.

00:16:47.340 --> 00:16:50.340
Let me just say, just to give a recap of

00:16:50.340 --> 00:16:52.620
both of the algorithms we've talked about so far.

00:16:52.620 --> 00:16:55.440
We've talked about

00:16:55.440 --> 00:16:59.470
two different algorithms for modeling PFY given X and parameterized by

00:16:59.470 --> 00:17:00.580
theta. And

00:17:00.580 --> 00:17:03.240
one of them - R was

00:17:03.240 --> 00:17:05.910
a real number

00:17:05.910 --> 00:17:10.090
and we are sealing that. And we sort of - the [inaudible] has a Gaussian distribution, then

00:17:10.090 --> 00:17:13.050
we got

00:17:13.050 --> 00:17:16.390
[inaudible] of linear regression.

00:17:16.390 --> 00:17:19.540
In the other case, we saw

00:17:19.540 --> 00:17:23.140
that if - was a classification problem where Y took on a value of either

00:17:23.140 --> 00:17:25.310
zero or one.

00:17:25.310 --> 00:17:27.110


00:17:27.110 --> 00:17:31.430
In that case, well, what's the most natural distribution of zeros and

00:17:31.430 --> 00:17:34.990
ones is the [inaudible]. The [inaudible] distribution models

00:17:34.990 --> 00:17:37.210
random variables with two values,

00:17:37.210 --> 00:17:39.890
and in that case we got

00:17:39.890 --> 00:17:45.820
logistic regression.

00:17:45.820 --> 00:17:48.680
So along the way, some of the questions that came up were - so logistic regression, where

00:17:48.680 --> 00:17:51.279


00:17:51.279 --> 00:17:52.289
on

00:17:52.289 --> 00:17:54.480
earth did I get the [inaudible] function from?

00:17:54.480 --> 00:17:57.450
And then so there are the choices you can use for, sort of,

00:17:57.450 --> 00:17:59.780
just where did this function come from?

00:17:59.780 --> 00:18:02.200
And there are other functions I could've plugged in, but

00:18:02.200 --> 00:18:05.540
the [inaudible] function turns out to be a natural

00:18:05.540 --> 00:18:06.850
default choice

00:18:06.850 --> 00:18:09.240
that lead us to logistic regression.

00:18:09.240 --> 00:18:11.810
And what I want to do now is

00:18:11.810 --> 00:18:14.690
take both of these algorithms and

00:18:14.690 --> 00:18:18.520
show that there are special cases that have [inaudible] the course of algorithms called

00:18:18.520 --> 00:18:19.450
generalized

00:18:19.450 --> 00:18:21.550
linear models,

00:18:21.550 --> 00:18:23.560
and there will be pauses for - it will be as [inaudible] the course

00:18:23.560 --> 00:18:27.110
of algorithms that think that the [inaudible]

00:18:27.110 --> 00:18:30.900
function will fall out very naturally as well.

00:18:30.900 --> 00:18:32.850
So, let's see -

00:18:32.850 --> 00:18:36.820
just looking for a longer piece of chalk. I

00:18:36.820 --> 00:18:40.929
should warn you, the ideas in generalized linear models are somewhat

00:18:40.929 --> 00:18:42.179
complex, so what

00:18:42.179 --> 00:18:46.440
I'm going to do today is try to sort of point you - point out the key ideas and give you a

00:18:46.440 --> 00:18:48.560
gist of the entire story.

00:18:48.560 --> 00:18:52.230
And then some of the details in the map and the derivations I'll leave you to work through

00:18:52.230 --> 00:18:57.850
by yourselves in the intellection [inaudible], which posts

00:18:57.850 --> 00:19:04.850
online.

00:19:05.710 --> 00:19:08.980
So [inaudible] these two distributions, the [inaudible] and

00:19:08.980 --> 00:19:13.350
the Gaussian.

00:19:13.350 --> 00:19:15.140
So suppose we have data

00:19:15.140 --> 00:19:18.550
that is zero-one valued, and we and we want to model it

00:19:18.550 --> 00:19:21.420
with

00:19:21.420 --> 00:19:23.940
[inaudible] variable

00:19:23.940 --> 00:19:25.790
parameterized by

00:19:25.790 --> 00:19:30.960
phi. So the [inaudible] distribution has the probability of Y equals one,

00:19:30.960 --> 00:19:35.049
which just equals the phi, right. So the parameter phi in the

00:19:35.049 --> 00:19:40.030
[inaudible] specifies the probability of Y being one. Now,

00:19:40.030 --> 00:19:42.539
as you vary the parameter theta, you get -

00:19:42.539 --> 00:19:47.419
you sort of get different [inaudible] distributions. As you vary the value of

00:19:47.419 --> 00:19:50.560
theta you get different probability distributions on Y

00:19:50.560 --> 00:19:53.310
that have different probabilities of being equal to one.

00:19:53.310 --> 00:19:58.130
And so I want you to think of this as not one fixed distribution, but as a set where there are a

00:19:58.130 --> 00:20:00.360
cause of distributions

00:20:00.360 --> 00:20:03.740
that you get as you vary theta.

00:20:03.740 --> 00:20:08.870
And in the same way, if you consider Gaussian distribution,

00:20:08.870 --> 00:20:10.880
as you vary [inaudible] you

00:20:10.880 --> 00:20:14.260
would get different

00:20:14.260 --> 00:20:18.320
Gaussian distributions. So think of this again as a cost, or as a set to

00:20:18.320 --> 00:20:19.910
distributions.

00:20:19.910 --> 00:20:26.910
And what I want to do now is show that

00:20:27.680 --> 00:20:31.890
both of these are special cases of the cause of distribution that's called the

00:20:31.890 --> 00:20:34.300
exponential family distribution.

00:20:34.300 --> 00:20:38.160
And in particular, we'll say that the cost of distributions, like

00:20:38.160 --> 00:20:41.330
the [inaudible] distributions that you get as you vary theta,

00:20:41.330 --> 00:20:44.960
we'll say the cost of distributions is in the exponential family

00:20:44.960 --> 00:20:46.429
if it can be written

00:20:46.429 --> 00:20:49.630
in the following form.

00:20:49.630 --> 00:20:51.850
P of Y parameterized by theta is

00:20:51.850 --> 00:20:54.220
equal to B of Y

00:20:54.220 --> 00:21:01.220
[inaudible],

00:21:05.049 --> 00:21:05.860
okay. Let me just

00:21:05.860 --> 00:21:07.470
get some of these

00:21:07.470 --> 00:21:10.690
terms, names, and

00:21:10.690 --> 00:21:13.049
then - let me - I'll

00:21:13.049 --> 00:21:19.360
say a bit more about what this means.

00:21:19.360 --> 00:21:23.780
So [inaudible] is called the natural parameter of the distribution,

00:21:23.780 --> 00:21:26.380
and T

00:21:26.380 --> 00:21:33.380
of Y is called the sufficient statistic.

00:21:34.630 --> 00:21:37.110
Usually,

00:21:37.110 --> 00:21:41.020
for many of the examples we'll see, including the [inaudible]

00:21:41.020 --> 00:21:43.590
and the Gaussian, T

00:21:43.590 --> 00:21:47.810
of Y is just equal to Y. So for most of this lecture you can

00:21:47.810 --> 00:21:51.660
mentally replace T of Y to be equal to Y, although this won't be true for the very

00:21:51.660 --> 00:21:56.070
fine example we do today, but mentally, you think of T of Y as equal to

00:21:56.070 --> 00:21:57.299
Y.

00:21:57.299 --> 00:22:04.299
And so

00:22:08.230 --> 00:22:09.930
for a given choice of

00:22:09.930 --> 00:22:15.890
these functions, A, B and

00:22:15.890 --> 00:22:21.100
T, all right - so we're gonna sort of fix the forms of the functions A, B and T.

00:22:21.100 --> 00:22:21.950
Then

00:22:21.950 --> 00:22:26.550
this formula defines, again, a set of distributions. It defines the cause of distributions

00:22:26.550 --> 00:22:27.990
that

00:22:27.990 --> 00:22:30.980
is now parameterized by

00:22:30.980 --> 00:22:31.600
[inaudible].

00:22:31.600 --> 00:22:35.770
So again, let's write down specific formulas for A, B and T, true specific

00:22:35.770 --> 00:22:38.510
choices of A, B and T. Then

00:22:38.510 --> 00:22:43.480
as I vary [inaudible] I get different distributions.

00:22:43.480 --> 00:22:44.760
And

00:22:44.760 --> 00:22:46.470
I'm going to show that

00:22:46.470 --> 00:22:48.360
the

00:22:48.360 --> 00:22:52.290
[inaudible] - I'm going to show that the [inaudible] and the Gaussians are special

00:22:52.290 --> 00:22:53.720
cases

00:22:53.720 --> 00:22:55.880
of exponential family distributions.

00:22:55.880 --> 00:23:00.610
And by that I mean that I can choose specific functions, A, B and T,

00:23:00.610 --> 00:23:02.870
so that this becomes the formula

00:23:02.870 --> 00:23:06.350
of the distributions of either a [inaudible] or a Gaussian.

00:23:06.350 --> 00:23:09.000
And then again, as I vary [inaudible],

00:23:09.000 --> 00:23:11.590
I'll get [inaudible], distributions with different means,

00:23:11.590 --> 00:23:13.220
or as I vary [inaudible], I'll get

00:23:13.220 --> 00:23:16.790
Gaussian distributions with different means for my

00:23:16.790 --> 00:23:22.920
fixed values of A, B and T.

00:23:22.920 --> 00:23:26.920
And for those of you that know what a sufficient statistic and statistics is,

00:23:26.920 --> 00:23:27.980


00:23:27.980 --> 00:23:30.720
T of Y actually is a sufficient statistic

00:23:30.720 --> 00:23:33.930
in the formal sense of

00:23:33.930 --> 00:23:37.430
sufficient statistic for a probability distribution. They may have seen it in a statistics

00:23:37.430 --> 00:23:38.120
class.

00:23:38.120 --> 00:23:42.240
If you don't know what a sufficient statistic is, don't worry about. We

00:23:42.240 --> 00:23:49.240
sort of don't need that property today. Okay.

00:23:52.440 --> 00:23:55.860
So -

00:23:55.860 --> 00:23:58.470
oh, one last comment.

00:23:58.470 --> 00:24:02.750
Often, T of Y is equal to Y, and in many of these cases, [inaudible] is also

00:24:02.750 --> 00:24:04.290
just a raw number.

00:24:04.290 --> 00:24:08.150
So in many cases, the parameter of this distribution is just a raw number,

00:24:08.150 --> 00:24:10.320
and [inaudible] transposed T of Y

00:24:10.320 --> 00:24:14.340
is just a product of raw numbers. So again, that would be true for our first two examples, but

00:24:14.340 --> 00:24:14.910


00:24:14.910 --> 00:24:20.770
now for the last example I'll do today.

00:24:20.770 --> 00:24:24.900
So now we'll show that the [inaudible] and the Gaussian are examples of exponential family

00:24:24.900 --> 00:24:27.659
distributions. We'll start with the [inaudible].

00:24:27.659 --> 00:24:30.769
So the [inaudible] distribution with [inaudible] - I guess I wrote this down

00:24:30.769 --> 00:24:32.510
already.

00:24:32.510 --> 00:24:36.000
PFY equals one [inaudible] by phi,

00:24:36.000 --> 00:24:36.590
[inaudible] equal to phi.

00:24:36.590 --> 00:24:40.530
So the parameter of phi specifies the probability

00:24:40.530 --> 00:24:42.600
that Y equals one.

00:24:42.600 --> 00:24:45.309
And so my goal now is to choose

00:24:45.309 --> 00:24:48.320
T, A and B, or is to choose A, B and T

00:24:48.320 --> 00:24:53.230
so that my formula for the exponential family becomes identical to my formula

00:24:53.230 --> 00:25:00.230
for the distribution of a [inaudible]. So

00:25:08.390 --> 00:25:15.390
probability of Y parameterized by phi is equal to that, all

00:25:17.060 --> 00:25:19.950
right. And you already saw sort

00:25:19.950 --> 00:25:24.480
of a similar exponential notation where we talked about logistic regression. The probability of

00:25:24.480 --> 00:25:26.630
Y being one is

00:25:26.630 --> 00:25:30.990
phi, the probability of Y being zero is one minus phi, so we can write this compactly as phi to the

00:25:30.990 --> 00:25:37.990
Y times one minus phi to the one minus Y. So I'm gonna take

00:25:40.010 --> 00:25:43.920
the exponent of the log of this, an exponentiation in taking log [inaudible]

00:25:43.920 --> 00:25:45.080
cancel each other

00:25:45.080 --> 00:25:46.060


00:25:46.060 --> 00:25:53.060
out [inaudible].

00:25:53.170 --> 00:25:54.980
And this is equal to E to the

00:25:54.980 --> 00:26:01.980
Y. And so [inaudible]

00:26:28.220 --> 00:26:31.150
is to be T of Y,

00:26:31.150 --> 00:26:33.360
and

00:26:33.360 --> 00:26:40.360
this will be

00:26:42.350 --> 00:26:44.990
minus A of [inaudible].

00:26:44.990 --> 00:26:51.990
And then B of Y is just one, so B of Y doesn't matter. Just

00:26:56.830 --> 00:26:57.950
take a second

00:26:57.950 --> 00:27:00.590
to look through this and make sure it makes

00:27:00.590 --> 00:27:07.590
sense. I'll clean another

00:27:42.400 --> 00:27:47.330
board while you do that. So now let's write down a few more things. Just

00:27:47.330 --> 00:27:51.230
copying from the previous board, we had that [inaudible] zero four equal to log

00:27:51.230 --> 00:27:53.350
phi

00:27:53.350 --> 00:27:56.780
over one minus

00:27:56.780 --> 00:28:00.800
phi. [Inaudible] so if I want to do the [inaudible] take this formula,

00:28:00.800 --> 00:28:03.550
and if you invert it, if you solve for

00:28:03.550 --> 00:28:05.780
phi - excuse me, if you solve for

00:28:05.780 --> 00:28:09.520
theta as a function of phi, which is really [inaudible] is the function of

00:28:09.520 --> 00:28:11.480
phi.

00:28:11.480 --> 00:28:16.910
Just invert this formula. You find that phi is

00:28:16.910 --> 00:28:19.350
one over one plus [inaudible] minus [inaudible].

00:28:19.350 --> 00:28:20.530
And so

00:28:20.530 --> 00:28:23.630
somehow the logistic function magically

00:28:23.630 --> 00:28:28.940
falls out of this. We'll take this even this even further later.

00:28:28.940 --> 00:28:31.370
Again, copying definitions from

00:28:31.370 --> 00:28:35.510
the board on - from the previous board, A of [inaudible]

00:28:35.510 --> 00:28:39.330
I said is minus log of one minus phi. So again, phi and [inaudible] are function of each other, all

00:28:39.330 --> 00:28:42.659
right. So [inaudible] depends on phi, and phi

00:28:42.659 --> 00:28:45.700
depends on [inaudible]. So if I plug in

00:28:45.700 --> 00:28:49.490
this definition for [inaudible]

00:28:49.490 --> 00:28:50.679
into this - excuse

00:28:50.679 --> 00:28:53.510
me, plug in this definition for phi into that,

00:28:53.510 --> 00:28:57.340
I'll find that A of [inaudible] is therefore equal

00:28:57.340 --> 00:29:00.889
to log one plus [inaudible] to [inaudible]. And again, this is just algebra. This is not terribly

00:29:00.889 --> 00:29:01.950
interesting.

00:29:01.950 --> 00:29:07.890
And just to complete - excuse me.

00:29:07.890 --> 00:29:10.920
And just to complete the rest of this, T of

00:29:10.920 --> 00:29:12.860
Y is equal to Y,

00:29:12.860 --> 00:29:15.710
and

00:29:15.710 --> 00:29:18.340
B of Y is equal to one, okay.

00:29:18.340 --> 00:29:20.200
So just to recap what we've done,

00:29:20.200 --> 00:29:23.430
we've come up with a certain choice of

00:29:23.430 --> 00:29:24.720
functions A, T and

00:29:24.720 --> 00:29:26.250
B,

00:29:26.250 --> 00:29:29.610
so then my formula for the exponential family distribution

00:29:29.610 --> 00:29:33.950
now becomes exactly the formula for the distributions, or for the probability

00:29:33.950 --> 00:29:36.660
mass function of the [inaudible] distribution.

00:29:36.660 --> 00:29:40.320
And the natural parameter [inaudible] has a certain relationship of the

00:29:40.320 --> 00:29:45.260
original parameter of the [inaudible].

00:29:45.260 --> 00:29:46.299
Question?

00:29:46.299 --> 00:29:48.580
[Inaudible]

00:29:48.580 --> 00:29:49.600
Let's

00:29:49.600 --> 00:29:54.430
see. [Inaudible]. The second to

00:29:54.430 --> 00:29:58.970
the last one. Oh, this answer is fine. Okay.

00:29:58.970 --> 00:30:00.620
Let's see.

00:30:00.620 --> 00:30:02.620
Yeah, so this is -

00:30:02.620 --> 00:30:07.260
well, if you expand this term out, one minus Y times log Y minus phi,

00:30:07.260 --> 00:30:08.590
and so

00:30:08.590 --> 00:30:09.910
one times log -

00:30:09.910 --> 00:30:12.600
one minus phi becomes this. And the other

00:30:12.600 --> 00:30:16.200
term is minus Y times log Y minus phi.

00:30:16.200 --> 00:30:17.659
And then -

00:30:17.659 --> 00:30:22.529
so the minus of a log is log one over X, or is just log one over

00:30:22.529 --> 00:30:25.380
whatever. So minus Y times log

00:30:25.380 --> 00:30:27.530
one minus phi becomes

00:30:27.530 --> 00:30:34.000
sort of Y times log, one over one minus phi. Does that make sense? Yeah. Yeah,

00:30:34.000 --> 00:30:39.000
cool. Anything else? Yes?

00:30:39.000 --> 00:30:41.210
[Inaudible] is a scaler, isn't it? Up there

00:30:41.210 --> 00:30:43.100
-

00:30:43.100 --> 00:30:46.710
Yes. - it's a [inaudible] transposed, so it can be a vector or

00:30:46.710 --> 00:30:48.690
- Yes, [inaudible]. So

00:30:48.690 --> 00:30:53.179
let's see. In most - in this and the next example, [inaudible] will turn out to

00:30:53.179 --> 00:30:54.799
be a scaler.

00:30:54.799 --> 00:30:56.270
And so -

00:30:56.270 --> 00:30:59.440
well, on this board.

00:30:59.440 --> 00:31:03.570
And so if [inaudible] is a scaler and T of Y is a scaler, then this is just

00:31:03.570 --> 00:31:06.260
a real number times a real number. So this would be like a

00:31:06.260 --> 00:31:08.270
one-dimensional vector

00:31:08.270 --> 00:31:12.340
transposed times a one-dimensional vector. And so this is just real number times real number. Towards the

00:31:12.340 --> 00:31:15.850
end of today's lecture, we'll go with just one example where both of

00:31:15.850 --> 00:31:17.050
these are vectors.

00:31:17.050 --> 00:31:24.050
But for main distributions, these will turn out to be scalers.

00:31:24.350 --> 00:31:25.110


00:31:25.110 --> 00:31:26.730


00:31:26.730 --> 00:31:33.420
[Inaudible] distribution [inaudible]. I

00:31:33.420 --> 00:31:35.120
mean, it doesn't have the

00:31:35.120 --> 00:31:37.630
zero probability

00:31:37.630 --> 00:31:42.279
or [inaudible] zero and one. I

00:31:42.279 --> 00:31:43.279
see. So

00:31:43.279 --> 00:31:46.260
- yeah. Let's - for this,

00:31:46.260 --> 00:31:50.890
let's imagine that we're restricting the domain

00:31:50.890 --> 00:31:51.590


00:31:51.590 --> 00:31:55.140
of the input of the function to be Y equals zero or one.

00:31:55.140 --> 00:31:57.130
So think of that as maybe in

00:31:57.130 --> 00:32:01.620
implicit constraint on it. [Inaudible]. But so this is a

00:32:01.620 --> 00:32:02.540
probability mass function

00:32:02.540 --> 00:32:05.900
for Y equals zero or Y equals one. So

00:32:05.900 --> 00:32:12.900
write down Y equals zero one. Let's think of that as an [inaudible]. So - cool.

00:32:16.190 --> 00:32:19.580
So this

00:32:19.580 --> 00:32:21.479
takes the [inaudible]

00:32:21.479 --> 00:32:24.620
distribution and invites in the form and the exponential family

00:32:24.620 --> 00:32:27.830
distribution. [Inaudible] do that very quickly for the Gaussian. I won't do the algebra for the

00:32:27.830 --> 00:32:29.970
Gaussian.

00:32:29.970 --> 00:32:32.500
I'll basically just write out the answers.

00:32:32.500 --> 00:32:37.190
So

00:32:37.190 --> 00:32:38.250


00:32:38.250 --> 00:32:40.450
with a normal distribution with

00:32:40.450 --> 00:32:42.790
[inaudible] sequence squared,

00:32:42.790 --> 00:32:47.210
and so you remember, was it two lectures ago,

00:32:47.210 --> 00:32:50.910
when we were dividing the maximum likelihood - excuse me, oh, no, just the

00:32:50.910 --> 00:32:52.070
previous lecture

00:32:52.070 --> 00:32:55.750
when we were dividing the maximum likelihood estimate

00:32:55.750 --> 00:32:59.900
for the parameters of ordinary [inaudible] squares. We showed that

00:32:59.900 --> 00:33:04.370
the parameter for [inaudible] squared didn't matter. When we divide

00:33:04.370 --> 00:33:05.980
the [inaudible] model for [inaudible] square [inaudible],

00:33:05.980 --> 00:33:09.200
we said that no matter what [inaudible] square was, we end up with the same

00:33:09.200 --> 00:33:11.520
value of the parameters.

00:33:11.520 --> 00:33:15.559
So for the purposes of just writing lesson, today's lecture, and not taking

00:33:15.559 --> 00:33:16.460
account

00:33:16.460 --> 00:33:17.690
[inaudible] squared,

00:33:17.690 --> 00:33:20.929
I'm just going to set

00:33:20.929 --> 00:33:25.740
[inaudible] squared to be for the one, okay, so as to not worry about it.

00:33:25.740 --> 00:33:28.330
Lecture [inaudible] talks a little bit more about this, but I'm just gonna -

00:33:28.330 --> 00:33:30.490
just to make

00:33:30.490 --> 00:33:33.630
[inaudible] in class a bit easier and simpler today, let's just

00:33:33.630 --> 00:33:36.049
say that [inaudible] square equals one.

00:33:36.049 --> 00:33:38.610
[Inaudible] square is essentially just a scaling factor

00:33:38.610 --> 00:33:43.720
on the variable Y.

00:33:43.720 --> 00:33:45.660
So in that case, the Gaussian

00:33:45.660 --> 00:33:50.840
density is given by this,

00:33:50.840 --> 00:33:52.400
[inaudible]

00:33:52.400 --> 00:33:53.680
squared.

00:33:53.680 --> 00:33:55.020


00:33:55.020 --> 00:33:58.670
And

00:33:58.670 --> 00:34:02.490
- well, by a couple of steps of algebra, which I'm not going to do,

00:34:02.490 --> 00:34:06.840
but is written out in [inaudible] in the lecture now so you can download.

00:34:06.840 --> 00:34:09.869
This is one root two

00:34:09.869 --> 00:34:13.989
pie, E to the minus one-half Y squared

00:34:13.989 --> 00:34:16.899
times E to E.

00:34:16.899 --> 00:34:20.209
New Y minus

00:34:20.209 --> 00:34:22.689
one-half [inaudible] squared, okay.

00:34:22.689 --> 00:34:25.019
So I'm just not doing the algebra.

00:34:25.019 --> 00:34:27.709
And so that's B

00:34:27.709 --> 00:34:32.659
of Y,

00:34:32.659 --> 00:34:37.279
we have [inaudible] that's equal to [inaudible]. P of

00:34:37.279 --> 00:34:42.259
Y equals

00:34:42.259 --> 00:34:43.459
Y,

00:34:43.459 --> 00:34:45.119
and - well, A of [inaudible]

00:34:45.119 --> 00:34:49.859
is equal to

00:34:49.859 --> 00:34:53.989
minus one-half - actually, I think that

00:34:53.989 --> 00:34:59.239
should be plus one-half. Have I got that right?

00:34:59.239 --> 00:35:01.619
Yeah, sorry. Let's

00:35:01.619 --> 00:35:03.189
see - excuse me. Plus sign

00:35:03.189 --> 00:35:06.619
there, okay. If you minus one-half [inaudible] squared, and

00:35:06.619 --> 00:35:11.169
because [inaudible] is equal to [inaudible], this is just minus one-half

00:35:11.169 --> 00:35:13.959
[inaudible] squared, okay.

00:35:13.959 --> 00:35:15.419
And so

00:35:15.419 --> 00:35:19.159
this would be a specific choice again of A, B and T

00:35:19.159 --> 00:35:21.069
that

00:35:21.069 --> 00:35:22.820
expresses the Gaussian density

00:35:22.820 --> 00:35:26.630
in the form of an exponential family distribution.

00:35:26.630 --> 00:35:30.839
And in this case, the relationship between [inaudible] and [inaudible] is that

00:35:30.839 --> 00:35:34.759
[inaudible] is just equal to [inaudible], so the [inaudible] of the Gaussian is just equal to the natural

00:35:34.759 --> 00:35:37.929
parameter of the exponential family distribution. Minus

00:35:37.929 --> 00:35:43.579
half. Oh, this is minus half? [Inaudible] Oh, okay, thanks.

00:35:43.579 --> 00:35:47.069
And so - guessing

00:35:47.069 --> 00:35:51.180
that should be plus then. Is that right? Okay. Oh,

00:35:51.180 --> 00:35:58.180
yes, you're right. Thank you. All right.

00:36:03.189 --> 00:36:05.579
And so [inaudible] result that

00:36:05.579 --> 00:36:09.470
if you've taken a look in undergrad statistics class, turns out that most of the

00:36:09.470 --> 00:36:12.899
"textbook distributions," not all, but most of them,

00:36:12.899 --> 00:36:16.609
can be written in the form of an exponential family distribution.

00:36:16.609 --> 00:36:20.579
So you saw the Gaussian, the normal distribution. It turns out the [inaudible]

00:36:20.579 --> 00:36:25.309
in normal distribution, which is a generalization of Gaussian random variables, so

00:36:25.309 --> 00:36:27.479
it's a high dimension to vectors.

00:36:27.479 --> 00:36:31.999
The [inaudible] normal distribution is also in the exponential family.

00:36:31.999 --> 00:36:36.489
You saw the [inaudible] as an exponential family. It turns out the [inaudible] distribution is too, all

00:36:36.489 --> 00:36:37.889
right. So the [inaudible]

00:36:37.889 --> 00:36:42.139
models outcomes over zero and one. They'll be coin tosses with two outcomes.

00:36:42.139 --> 00:36:46.729
The [inaudible] models outcomes over K possible values. That's also an

00:36:46.729 --> 00:36:49.289
exponential families distribution.

00:36:49.289 --> 00:36:52.670
You may have heard of the Parson distribution. And so the Parson distribution is

00:36:52.670 --> 00:36:54.849
often used for modeling counts. Things like

00:36:54.849 --> 00:36:59.779
the number of radioactive decays in a sample, or the number of

00:36:59.779 --> 00:37:03.710
customers to your website, the numbers of visitors arriving in a store. The

00:37:03.710 --> 00:37:08.059
Parson distribution is also in the exponential family. So are

00:37:08.059 --> 00:37:10.369
the gamma and the exponential

00:37:10.369 --> 00:37:15.359
distributions, if you've heard of them. So the gamma and the exponential distributions are

00:37:15.359 --> 00:37:18.859
distributions of the positive numbers. So they're often used in model

00:37:18.859 --> 00:37:20.299
intervals, like if you're

00:37:20.299 --> 00:37:22.309
standing at the bus stop and you want to

00:37:22.309 --> 00:37:25.619
ask, "When is the next bus likely to arrive? How long do I have to wait for

00:37:25.619 --> 00:37:27.019
my bus to arrive?"

00:37:27.019 --> 00:37:32.229
Often you model that with sort of gamma distribution or

00:37:32.229 --> 00:37:34.190
exponential families, or the exponential

00:37:34.190 --> 00:37:36.599
distribution. Those are also in the exponential family.

00:37:36.599 --> 00:37:40.930
Even more [inaudible] distributions, like the [inaudible] and the [inaudible]

00:37:40.930 --> 00:37:43.890
distributions, these are probably distributions over

00:37:43.890 --> 00:37:49.089
fractions, are already probability distributions over probability distributions.

00:37:49.089 --> 00:37:52.410
And also things like the Wisha distribution, which is the

00:37:52.410 --> 00:37:56.609
distribution over covariance matrices. So all of these, it turns out, can be written in

00:37:56.609 --> 00:38:00.719
the form of exponential family distributions. Well,

00:38:00.719 --> 00:38:03.660
and

00:38:03.660 --> 00:38:08.509
in the problem set where he asks you to take one of these distributions and write

00:38:08.509 --> 00:38:12.739
it in the form of the exponential family distribution, and derive a generalized linear model

00:38:12.739 --> 00:38:14.680
for it, okay.

00:38:14.680 --> 00:38:21.680
Which brings me to the next topic of

00:38:28.789 --> 00:38:32.119
having chosen and exponential family distribution,

00:38:32.119 --> 00:38:37.829
how do you use it to derive a generalized linear model?

00:38:37.829 --> 00:38:41.219
So

00:38:41.219 --> 00:38:45.459
generalized linear models are often abbreviated GLM's.

00:38:45.459 --> 00:38:48.559
And

00:38:48.559 --> 00:38:51.619
I'm going to write down the three assumptions. You can think of them

00:38:51.619 --> 00:38:54.699
as assumptions, or you can think of them as design choices,

00:38:54.699 --> 00:39:00.309
that will then allow me to sort of turn a crank and come up with a generalized linear model.

00:39:00.309 --> 00:39:03.599
So the first one is - I'm going to assume

00:39:03.599 --> 00:39:04.309
that

00:39:04.309 --> 00:39:05.580
given

00:39:05.580 --> 00:39:09.429
my input X and my parameters theta,

00:39:09.429 --> 00:39:12.729
I'm going to assume that the variable Y,

00:39:12.729 --> 00:39:17.009
the output Y, or the response variable Y I'm trying to predict

00:39:17.009 --> 00:39:20.079
is distributed

00:39:20.079 --> 00:39:21.719
exponential family

00:39:21.719 --> 00:39:23.880


00:39:23.880 --> 00:39:27.329
with some natural parameter [inaudible]. And so this means that there

00:39:27.329 --> 00:39:29.300
is some specific choice

00:39:29.300 --> 00:39:32.009
of those functions, A, B and T

00:39:32.009 --> 00:39:33.049
so that

00:39:33.049 --> 00:39:38.049
the conditional distribution of Y given X and parameterized by theta,

00:39:38.049 --> 00:39:39.940
those exponential families

00:39:39.940 --> 00:39:41.729
with parameter

00:39:41.729 --> 00:39:46.390
[inaudible]. Where here, [inaudible] may depend on X in some way.

00:39:46.390 --> 00:39:50.239
So for example, if you're trying to predict - if you want to

00:39:50.239 --> 00:39:50.710
predict

00:39:50.710 --> 00:39:53.090
how many customers have arrived at your website,

00:39:53.090 --> 00:39:54.729
you may choose to

00:39:54.729 --> 00:39:57.190
model the number of people

00:39:57.190 --> 00:40:01.350
- the number of hits on your website by Parson Distribution since Parson

00:40:01.350 --> 00:40:03.869
Distribution is natural for modeling com data.

00:40:03.869 --> 00:40:04.920
And so you may

00:40:04.920 --> 00:40:11.920
choose the exponential family distribution here to be the Parson distribution.

00:40:15.109 --> 00:40:18.049
[Inaudible] that given X, our

00:40:18.049 --> 00:40:20.579
goal is

00:40:20.579 --> 00:40:22.329
to output

00:40:22.329 --> 00:40:25.009
the

00:40:25.009 --> 00:40:28.640
effective value of Y

00:40:28.640 --> 00:40:29.649
given X. So

00:40:29.649 --> 00:40:34.019
given the features in the website examples, I've given a set of features

00:40:34.019 --> 00:40:37.609
about whether there were any proportions, whether there were sales, how many people linked to

00:40:37.609 --> 00:40:39.949
your website, or whatever.

00:40:39.949 --> 00:40:41.489
I'm going to assume that

00:40:41.489 --> 00:40:45.359
our goal in our [inaudible] problem is to estimate the expected number of people that will

00:40:45.359 --> 00:40:49.249
arrive at your website on a given day.

00:40:49.249 --> 00:40:49.999


00:40:49.999 --> 00:40:51.659
So in other words,

00:40:51.659 --> 00:40:54.670
you're saying that I want H of

00:40:54.670 --> 00:40:58.369
X to be equal to - oh,

00:40:58.369 --> 00:40:59.739
excuse me. I actually

00:40:59.739 --> 00:41:05.709
meant to write T of Y here.

00:41:05.709 --> 00:41:10.349
My goal is to get my learning algorithms hypothesis to output the expected value

00:41:10.349 --> 00:41:12.719
of T of Y given X. But

00:41:12.719 --> 00:41:17.170
again, for most of the examples, T of Y is just equal to Y.

00:41:17.170 --> 00:41:21.109
And so for most of the examples, our goal is to get our learning algorithms output, T

00:41:21.109 --> 00:41:28.109
expected value of Y given X because T of Y is usually

00:41:33.239 --> 00:41:33.919
equal

00:41:33.919 --> 00:41:36.229
to Y. Yes? [Inaudible] Yes, same thing, right. T of Y is a sufficient statistic.

00:41:36.229 --> 00:41:39.390
Same T of Y.

00:41:39.390 --> 00:41:43.880
And lastly, this last one I wrote down - these are assumptions. This last

00:41:43.880 --> 00:41:47.680
one you might - maybe wanna think of this as a design choice.

00:41:47.680 --> 00:41:48.509


00:41:48.509 --> 00:41:50.019
Which is [inaudible]

00:41:50.019 --> 00:41:53.119
assume that the distribution of Y given X

00:41:53.119 --> 00:41:55.430
is a distributed exponential family

00:41:55.430 --> 00:41:59.079
with some parameter [inaudible]. So the number of visitors on the website

00:41:59.079 --> 00:42:01.129
on any given day will be Parson

00:42:01.129 --> 00:42:03.439
or some parameter [inaudible].

00:42:03.439 --> 00:42:05.970
And the last decision I need to make is

00:42:05.970 --> 00:42:09.269
was the relationship between my input teachers

00:42:09.269 --> 00:42:11.219
and this parameter

00:42:11.219 --> 00:42:14.629
[inaudible] parameterizing my Parson distribution or whatever.

00:42:14.629 --> 00:42:18.799
And this last step, I'm going to make the

00:42:18.799 --> 00:42:20.929
assumption, or really a design choice,

00:42:20.929 --> 00:42:24.169
that I'm going to assume the relationship between [inaudible]

00:42:24.169 --> 00:42:26.359
and my [inaudible] axis linear,

00:42:26.359 --> 00:42:29.709
and in particular that they're governed by this - that [inaudible] is equal to theta,

00:42:29.709 --> 00:42:32.179
transpose X.

00:42:32.179 --> 00:42:35.759
And the reason I make this design choice is it will allow me to turn the crank of

00:42:35.759 --> 00:42:36.809
the

00:42:36.809 --> 00:42:38.619
generalized linear model of

00:42:38.619 --> 00:42:42.160
machinery and come off with very nice algorithms for fitting

00:42:42.160 --> 00:42:45.409
say Parson Regression models

00:42:45.409 --> 00:42:47.209
or performed regression

00:42:47.209 --> 00:42:50.279
with a gamma distribution outputs or exponential distribution outputs and

00:42:50.279 --> 00:42:53.429
so on.

00:42:53.429 --> 00:42:58.929
So

00:42:58.929 --> 00:43:05.929
let's work through an example.

00:43:13.089 --> 00:43:14.769


00:43:14.769 --> 00:43:17.799
[Inaudible]

00:43:17.799 --> 00:43:22.589
equals theta transpose X works for the case where [inaudible] is a real number.

00:43:22.589 --> 00:43:24.989
For the more general case, you

00:43:24.989 --> 00:43:28.789
would have [inaudible] I equals theta

00:43:28.789 --> 00:43:35.049
I, transpose X if [inaudible]

00:43:35.049 --> 00:43:39.420
is a vector rather than a real number. But again, most of the examples [inaudible]

00:43:39.420 --> 00:43:44.859
will just be a real number.

00:43:44.859 --> 00:43:51.859
All right.

00:43:54.930 --> 00:44:00.389
So let's work through the [inaudible] example. You'll see

00:44:00.389 --> 00:44:07.389
where Y given X parameterized by theta - this is a distributed

00:44:08.839 --> 00:44:12.239
exponential family with natural parameter [inaudible].

00:44:12.239 --> 00:44:13.669
And for

00:44:13.669 --> 00:44:17.299
the [inaudible] distribution, I'm going to choose A, B and T to

00:44:17.299 --> 00:44:19.649
be the specific forms

00:44:19.649 --> 00:44:21.379
that

00:44:21.379 --> 00:44:25.559
cause those exponential families to become the [inaudible] distribution. This is the example we

00:44:25.559 --> 00:44:29.749
worked through just now, the first example we worked through just now.

00:44:29.749 --> 00:44:33.529
So - oh,

00:44:33.529 --> 00:44:35.059
and we also have -

00:44:35.059 --> 00:44:41.069
so for any fixed

00:44:41.069 --> 00:44:44.440
value of X and theta, my hypothesis, my learning

00:44:44.440 --> 00:44:46.339
algorithm

00:44:46.339 --> 00:44:53.170
will

00:44:53.170 --> 00:44:56.799
make a prediction, or will make - will sort of output [inaudible] of

00:44:56.799 --> 00:45:00.859
X,

00:45:00.859 --> 00:45:02.559
which is

00:45:02.559 --> 00:45:06.179
by my, I guess, assumption [inaudible]. Watch our learning

00:45:06.179 --> 00:45:10.309
algorithm to output the expected value of Y given X

00:45:10.309 --> 00:45:13.829


00:45:13.829 --> 00:45:14.950
and parameterized by theta, where Y can take on

00:45:14.950 --> 00:45:17.349
only the value zero and one,

00:45:17.349 --> 00:45:21.639
then the expected value of Y is just equal to the

00:45:21.639 --> 00:45:26.099
probability that Y is equal to one. So

00:45:26.099 --> 00:45:29.669
the expected value of a [inaudible] variable is just equal to the

00:45:29.669 --> 00:45:36.669
probability that it's equal to one.

00:45:36.680 --> 00:45:38.219
And so

00:45:38.219 --> 00:45:41.550
the probability that Y equals one is just equal to phi

00:45:41.550 --> 00:45:43.399
because that's the

00:45:43.399 --> 00:45:44.589
parameter

00:45:44.589 --> 00:45:47.120
of my [inaudible] distribution. Phi

00:45:47.120 --> 00:45:54.120
is, by definition, I guess, is the probability of my [inaudible] distribution [inaudible] value of one.

00:45:57.059 --> 00:45:59.439
Which we worked out previously,

00:45:59.439 --> 00:46:02.509
phi was one over one plus E to the negative [inaudible].

00:46:02.509 --> 00:46:06.059
So we worked this out on our previous board. This is the relationship -

00:46:06.059 --> 00:46:10.039
so when we wrote down the [inaudible] distribution

00:46:10.039 --> 00:46:14.359
in the form of an exponential family, we worked out what the relationship was between phi and

00:46:14.359 --> 00:46:18.459
[inaudible], and it was this. So we worked out the relationship between the

00:46:18.459 --> 00:46:23.689
expected value of Y and [inaudible] was this relationship.

00:46:23.689 --> 00:46:25.719
And lastly, because

00:46:25.719 --> 00:46:28.710
we made the design choice, or the assumption that

00:46:28.710 --> 00:46:31.159
[inaudible] and theta are

00:46:31.159 --> 00:46:32.189
linearly

00:46:32.189 --> 00:46:36.170
related. This is therefore equal to

00:46:36.170 --> 00:46:42.089
one over one plus E to the minus theta, transpose X.

00:46:42.089 --> 00:46:45.200
And so that's

00:46:45.200 --> 00:46:46.170
how I

00:46:46.170 --> 00:46:49.389
come up with the logistic regression algorithm

00:46:49.389 --> 00:46:51.189
when

00:46:51.189 --> 00:46:53.420
you have a variable Y

00:46:53.420 --> 00:46:57.420
- when you have a [inaudible] variable Y, or also response variable

00:46:57.420 --> 00:46:58.069
Y

00:46:58.069 --> 00:47:01.509
that takes on two values, and then you choose to model

00:47:01.509 --> 00:47:07.619
variable [inaudible] distribution. Are you

00:47:07.619 --> 00:47:14.109
sure this does make sense? Raise your hand if this makes sense. Yeah, okay, cool. So I hope

00:47:14.109 --> 00:47:15.560
you get

00:47:15.560 --> 00:47:19.070
the ease of use of this, or sort of the power of this. The only decision I

00:47:19.070 --> 00:47:22.159
made was really, I said Y -

00:47:22.159 --> 00:47:24.759
let's say I have a new machine-learning problem and

00:47:24.759 --> 00:47:28.349
I'm trying to predict the value of a variable Y that happens to take on two

00:47:28.349 --> 00:47:29.619
values.

00:47:29.619 --> 00:47:33.669
Then the only decision I need to make is I chose [inaudible] distribution. I

00:47:33.669 --> 00:47:37.729
say I want to model - I want to assume that given X and theta,

00:47:37.729 --> 00:47:40.999
I'm going to assume Y is distributed

00:47:40.999 --> 00:47:43.160
[inaudible]. That's the only decision I made.

00:47:43.160 --> 00:47:46.890
And then everything else follows automatically having made the

00:47:46.890 --> 00:47:47.890
decision to

00:47:47.890 --> 00:47:50.210
model Y given X and

00:47:50.210 --> 00:47:53.979
parameterized by theta as being [inaudible].

00:47:53.979 --> 00:47:58.440
In the same way you can choose a different distribution, you can choose Y as Parson or Y as

00:47:58.440 --> 00:48:00.189
gamma or Y as whatever,

00:48:00.189 --> 00:48:03.109
and follow a similar process and come up with a different model and

00:48:03.109 --> 00:48:04.410
different learning algorithm.

00:48:04.410 --> 00:48:07.269
Come up with a different generalized linear model for whatever learning

00:48:07.269 --> 00:48:12.289
algorithm you're faced with.

00:48:12.289 --> 00:48:16.339
This tiny little notation, the

00:48:16.339 --> 00:48:20.109
function G that

00:48:20.109 --> 00:48:25.229
relates G of [inaudible] that relates the natural parameter

00:48:25.229 --> 00:48:28.119
to the expected value of Y,

00:48:28.119 --> 00:48:31.679
which in this case, one over one plus [inaudible] minus [inaudible],

00:48:31.679 --> 00:48:38.679
this is called the canonical response function. And G inverse is called the

00:48:41.989 --> 00:48:43.349
canonical

00:48:43.349 --> 00:48:48.650
link function. These aren't

00:48:48.650 --> 00:48:53.199


00:48:53.199 --> 00:48:57.179
a huge deal. I won't use this terminology a lot. I'm just

00:48:57.179 --> 00:48:58.749
mentioning those in case

00:48:58.749 --> 00:49:03.009
you hear about - people talk about generalized linear models, and if they talk about canonical

00:49:03.009 --> 00:49:04.540
response functions or

00:49:04.540 --> 00:49:09.380
canonical link functions, just so you know there's all of this.

00:49:09.380 --> 00:49:11.269
Actually, many techs actually use

00:49:11.269 --> 00:49:15.529
the reverse way. This is G inverse and this is G, but this

00:49:15.529 --> 00:49:18.179
notation turns out to be more consistent with

00:49:18.179 --> 00:49:20.119
other algorithms in machine learning. So

00:49:20.119 --> 00:49:23.150
I'm going to use this notation.

00:49:23.150 --> 00:49:27.119
But I probably won't use the terms canonical response functions and canonical

00:49:27.119 --> 00:49:29.759
link functions in lecture a lot, so just - I don't know.

00:49:29.759 --> 00:49:31.799
I'm not big on

00:49:31.799 --> 00:49:38.799
memorizing lots of names of things. I'm just tossing those out there in case you see it elsewhere. Okay.

00:49:49.839 --> 00:49:51.349
You know what, I think

00:49:51.349 --> 00:49:55.390
in the interest of time, I'm going to skip over the Gaussian example. But again,

00:49:55.390 --> 00:49:57.469
just like I said,

00:49:57.469 --> 00:49:58.920
[inaudible], Y

00:49:58.920 --> 00:50:02.559
is [inaudible], different variation I get of logistic regression. You can do the same

00:50:02.559 --> 00:50:04.579
thing with the Gaussian distribution

00:50:04.579 --> 00:50:08.819
and end up with ordinary [inaudible] squares model.

00:50:08.819 --> 00:50:12.059
The problem with Gaussian is that it's almost so simple that when you see it for

00:50:12.059 --> 00:50:13.959
the first time that it's

00:50:13.959 --> 00:50:17.149
sometimes more confusing than the [inaudible] model because it looks so simple,

00:50:17.149 --> 00:50:20.650
it looks like it has to be more complicated. So let me just skip that

00:50:20.650 --> 00:50:22.719
and leave you to read about

00:50:22.719 --> 00:50:25.369
the Gaussian example in the lecture notes.

00:50:25.369 --> 00:50:27.399
And what I want to do is

00:50:27.399 --> 00:50:29.770
actually go through a more complex example.

00:50:29.770 --> 00:50:33.529
Question? [Inaudible]

00:50:33.529 --> 00:50:35.479
Okay, right. So

00:50:35.479 --> 00:50:39.219
how do choose what theory will be?

00:50:39.219 --> 00:50:42.869
We'll get to that in the end. What you have there is the logistic

00:50:42.869 --> 00:50:45.439
regression model, which is a [inaudible] model

00:50:45.439 --> 00:50:51.309
that assumes the probability of Y given X is given by a certain form.

00:50:51.309 --> 00:50:53.499
And so

00:50:53.499 --> 00:50:58.279
what you do is you can write down the log likelihood of your training set, and

00:50:58.279 --> 00:51:01.859
find the value of theta that maximizes the log likelihood of the parameters.

00:51:01.859 --> 00:51:04.630
Does that make

00:51:04.630 --> 00:51:05.509
sense?

00:51:05.509 --> 00:51:09.159
So I'll say that again towards the end of today's lecture. But

00:51:09.159 --> 00:51:13.699
for logistic regression, the way you choose theta is exactly maximum likelihood,

00:51:13.699 --> 00:51:14.979
as we

00:51:14.979 --> 00:51:18.179
worked out in the previous lecture, using Newton's Method or gradient

00:51:18.179 --> 00:51:19.849
ascent or

00:51:19.849 --> 00:51:21.909
whatever. I'll sort of try to

00:51:21.909 --> 00:51:28.909
do that again for one more example towards the end of today's lecture. So

00:51:29.249 --> 00:51:33.249
what I want to do is actually use the remaining, I don't know,

00:51:33.249 --> 00:51:36.059
19 minutes or so of this class,

00:51:36.059 --> 00:51:38.209
to go through the -

00:51:38.209 --> 00:51:42.309
one of the more - it's probably the most complex example of a

00:51:42.309 --> 00:51:46.089
generalized linear model that I've used. This one I want to go through because it's a little bit

00:51:46.089 --> 00:51:47.659
trickier than

00:51:47.659 --> 00:51:51.159
many of the other textbook examples of

00:51:51.159 --> 00:51:53.509
generalized linear models.

00:51:53.509 --> 00:51:54.580
So

00:51:54.580 --> 00:51:57.959
again, what I'm going to do is

00:51:57.959 --> 00:51:59.960
go through the derivation

00:51:59.960 --> 00:52:03.910
reasonably quickly and give you the gist of it, and if there are steps I skip or details

00:52:03.910 --> 00:52:07.519
omitted, I'll leave you to read about them more carefully

00:52:07.519 --> 00:52:10.049
in the lecture notes.

00:52:10.049 --> 00:52:13.649
And what I want to do is talk about

00:52:13.649 --> 00:52:16.899
[inaudible].

00:52:16.899 --> 00:52:19.660
And

00:52:19.660 --> 00:52:23.190
[inaudible]

00:52:23.190 --> 00:52:25.799
is the distribution over

00:52:25.799 --> 00:52:28.149
K possible outcomes.

00:52:28.149 --> 00:52:29.719


00:52:29.719 --> 00:52:32.939
Imagine you're now in a machine-learning problem where the value of Y that you're trying to

00:52:32.939 --> 00:52:34.640
predict can take on

00:52:34.640 --> 00:52:37.290
K possible outcomes, so rather than

00:52:37.290 --> 00:52:39.589
only two outcomes. So obviously, this example's already -

00:52:39.589 --> 00:52:44.649
if you want to have a learning algorithm, or to magically send emails for you into your

00:52:44.649 --> 00:52:48.130
right email folder, and you may have a dozen email folders you want your algorithm

00:52:48.130 --> 00:52:50.129
to classify emails into.

00:52:50.129 --> 00:52:51.010
Or

00:52:51.010 --> 00:52:54.210
predicting if the patient either has a disease or does not have

00:52:54.210 --> 00:52:57.269
a disease, which would be a [inaudible] classification problem.

00:52:57.269 --> 00:53:01.759
If you think that the patient may have one of K diseases, and

00:53:01.759 --> 00:53:06.039
you want other than have a learning algorithm figure out which one of K diseases your patient has is all. So

00:53:06.039 --> 00:53:06.749
lots

00:53:06.749 --> 00:53:10.399
of multi-cause classification problems where you have more than two causes. You model that

00:53:10.399 --> 00:53:13.989
with [inaudible].

00:53:13.989 --> 00:53:17.169
And eventually -

00:53:17.169 --> 00:53:20.609
so for logistic regression, I had [inaudible] like these where you have a

00:53:20.609 --> 00:53:25.539
training set and you find a decision boundary that separates them.

00:53:25.539 --> 00:53:29.069
[Inaudible], we're going to entertain

00:53:29.069 --> 00:53:32.039
the value of predicting, taking on multiple values, so you now have

00:53:32.039 --> 00:53:33.380
three causes,

00:53:33.380 --> 00:53:35.759
and the learning algorithm

00:53:35.759 --> 00:53:39.489
will learn some way to separate out three causes or more, rather than just two

00:53:39.489 --> 00:53:43.799
causes.

00:53:43.799 --> 00:53:46.940
So let's write [inaudible] in the form of

00:53:46.940 --> 00:53:49.359
an exponential family distribution.

00:53:49.359 --> 00:53:53.219


00:53:53.219 --> 00:53:58.009
So the parameters of a [inaudible] are phi one, phi two

00:53:58.009 --> 00:53:59.919
[inaudible]

00:53:59.919 --> 00:54:04.049
phi K. I'll actually change this in a second -

00:54:04.049 --> 00:54:09.009
where the probability of Y equals I is phi I,

00:54:09.009 --> 00:54:10.160
right, because there are

00:54:10.160 --> 00:54:12.979
K possible outcomes.

00:54:12.979 --> 00:54:17.539
But if I choose this as my parameterization of the [inaudible], then

00:54:17.539 --> 00:54:19.940
my parameter's actually redundant because

00:54:19.940 --> 00:54:23.399
if these are probabilities, then you have to sum up the one.

00:54:23.399 --> 00:54:26.180
And therefore for example, I

00:54:26.180 --> 00:54:29.589
can derive the last parameter, phi K,

00:54:29.589 --> 00:54:32.029
as one minus phi

00:54:32.029 --> 00:54:34.789
one, up

00:54:34.789 --> 00:54:36.390
to phi K minus

00:54:36.390 --> 00:54:40.599
one. So this would be a

00:54:40.599 --> 00:54:44.150
redundant parameterization from

00:54:44.150 --> 00:54:48.789
[inaudible]. The result is over-parameterized. And so for purposes of

00:54:48.789 --> 00:54:49.870
this [inaudible], I'm

00:54:49.870 --> 00:54:53.859
going to treat my parameters of my [inaudible] as phi one,

00:54:53.859 --> 00:54:58.859
phi two, up to phi K minus one.

00:54:58.859 --> 00:55:02.799
And I won't think of phi K as a parameter. I'll just - so my parameters are

00:55:02.799 --> 00:55:03.420
just -

00:55:03.420 --> 00:55:07.869
I just have K minus one parameters, parameterizing my

00:55:07.869 --> 00:55:11.190
[inaudible]. And sometimes I write phi K in my

00:55:11.190 --> 00:55:13.069
derivations as well, and you should think of

00:55:13.069 --> 00:55:17.609
phi K as just a shorthand for this, for one minus the rest of the parameters, okay.

00:55:17.609 --> 00:55:24.609
So

00:55:36.479 --> 00:55:40.799
it turns out the [inaudible] is one of the few examples where T of Y - it's one of the

00:55:40.799 --> 00:55:44.569
examples where T of Y is not equal to Y.

00:55:44.569 --> 00:55:48.999
So in this case, Y is

00:55:48.999 --> 00:55:51.680
on of K possible values.

00:55:51.680 --> 00:55:55.199
And so T of Y would be defined as follows; T

00:55:55.199 --> 00:55:59.239
of one is going to be a vector with a one

00:55:59.239 --> 00:56:02.099
and zeros everywhere else. T

00:56:02.099 --> 00:56:03.640
of two

00:56:03.640 --> 00:56:07.309
is going to be a zero, one, zero

00:56:07.309 --> 00:56:08.689
and so

00:56:08.689 --> 00:56:12.359
on. Except that these are going to be

00:56:12.359 --> 00:56:15.640
K minus one-dimensional vectors. And

00:56:15.640 --> 00:56:17.449
so

00:56:17.449 --> 00:56:20.669
T of K minus one is going to be zero, zero,

00:56:20.669 --> 00:56:23.719
zero, one.

00:56:23.719 --> 00:56:26.829
And

00:56:26.829 --> 00:56:29.069
T of K is going to be the vector of all zeros.

00:56:29.069 --> 00:56:30.439
So this is just

00:56:30.439 --> 00:56:32.419
how I'm choosing to define T

00:56:32.419 --> 00:56:34.559
of Y

00:56:34.559 --> 00:56:39.019
to write down the [inaudible] in the form of an exponential family

00:56:39.019 --> 00:56:40.709
distribution.

00:56:40.709 --> 00:56:44.799
Again, these are K minus one-dimensional vectors.

00:56:44.799 --> 00:56:47.319
So

00:56:47.319 --> 00:56:50.410
this is a good point to introduce one more useful piece of notation,

00:56:50.410 --> 00:56:53.909
which is called indicator function notation.

00:56:53.909 --> 00:56:58.069
So I'm going to write one, and then curly braces.

00:56:58.069 --> 00:57:03.549
And if I write a true statement inside, then the indicator of that statement is going to be

00:57:03.549 --> 00:57:04.949
one. Then I write one,

00:57:04.949 --> 00:57:07.949
and then I write a false statement inside, then

00:57:07.949 --> 00:57:11.849
the value of this indicator function is going to be a

00:57:11.849 --> 00:57:15.299
zero. For example, if I write indicator two

00:57:15.299 --> 00:57:16.989
equals three

00:57:16.989 --> 00:57:20.789
[inaudible] that's false, and so this is equal to zero.

00:57:20.789 --> 00:57:22.489
Whereas indicator [inaudible]

00:57:22.489 --> 00:57:24.819
plus one equals two,

00:57:24.819 --> 00:57:28.449
I wrote down a true statement inside. And so the indicator of the statement was equal to

00:57:28.449 --> 00:57:30.029
one. So the

00:57:30.029 --> 00:57:31.979
indicator function is just a

00:57:31.979 --> 00:57:35.490
very useful notation for indicating sort of truth or falsehood

00:57:35.490 --> 00:57:42.490
of the statement inside. And so - actually, let's do

00:57:45.529 --> 00:57:47.199
this here.

00:57:47.199 --> 00:57:53.939
To combine both of these, right, if I carve out a bit

00:57:53.939 --> 00:57:57.149
of space here -

00:57:57.149 --> 00:58:02.539
so if I use - so TY is a

00:58:02.539 --> 00:58:06.670
vector. Y is one of K values, and so

00:58:06.670 --> 00:58:09.629
TY is one of these K vectors. If

00:58:09.629 --> 00:58:12.969
I use TY as [inaudible] to denote

00:58:12.969 --> 00:58:15.949
the [inaudible] element of the vector

00:58:15.949 --> 00:58:18.429
TY,

00:58:18.429 --> 00:58:21.789
then TY - the [inaudible] element of the vector TY

00:58:21.789 --> 00:58:26.529
is just equal to indicator for

00:58:26.529 --> 00:58:30.459
whether Y is equal to I. Just take a

00:58:30.459 --> 00:58:33.189
- let me clean a couple more boards. Take a look at this for a second

00:58:33.189 --> 00:58:34.989
and make sure you understand why that - make

00:58:34.989 --> 00:58:41.989
sure you understand all that notation and why this is true. All

00:59:09.839 --> 00:59:15.599
right. Actually, raise your hand if this equation makes sense to you.

00:59:15.599 --> 00:59:18.769
Most of you,

00:59:18.769 --> 00:59:23.859
not all, okay. [Inaudible].

00:59:23.859 --> 00:59:26.599
Just as one kind of [inaudible],

00:59:26.599 --> 00:59:29.679
suppose Y is equal to one

00:59:29.679 --> 00:59:33.579
- let's

00:59:33.579 --> 00:59:36.199
say - let me see. Suppose Y is equal to one, right,

00:59:36.199 --> 00:59:39.739
so TY is equal to this vector,

00:59:39.739 --> 00:59:42.389
and therefore the first element of this vector

00:59:42.389 --> 00:59:46.779
will be one, and the rest of the elements will be equal to zero.

00:59:46.779 --> 00:59:50.299
And so - let

00:59:50.299 --> 00:59:54.339
me try that again, I'm sorry. Let's say I want to ask - I want to look at the [inaudible] element of

00:59:54.339 --> 00:59:58.959
the vector TY, and I want to know is this one or zero. All right.

00:59:58.959 --> 01:00:01.249
Well, this will be one.

01:00:01.249 --> 01:00:03.759
The [inaudible] element of the vector TY

01:00:03.759 --> 01:00:05.059
will be equal to one

01:00:05.059 --> 01:00:06.469
if, and only if Y is

01:00:06.469 --> 01:00:08.789
equal to I.

01:00:08.789 --> 01:00:12.339
Because for example, if Y is equal to one, then only the first element of this

01:00:12.339 --> 01:00:13.809
vector will be zero.

01:00:13.809 --> 01:00:17.829
If Y is equal to two, then only the second element of the vector will be zero

01:00:17.829 --> 01:00:19.849
and so on. So the question of

01:00:19.849 --> 01:00:23.899
whether or not - whether the [inaudible] element of this vector, TY, is equal to

01:00:23.899 --> 01:00:25.670
one is

01:00:25.670 --> 01:00:26.699
answered by

01:00:26.699 --> 01:00:28.749
just asking is Y

01:00:28.749 --> 01:00:35.749
equal to I. Okay. If you're

01:00:36.189 --> 01:00:38.980
still not quite sure why that's true, go home and

01:00:38.980 --> 01:00:41.409
think about it a bit more. And I think I -

01:00:41.409 --> 01:00:45.439
and take a look at the lecture notes as

01:00:45.439 --> 01:00:49.779
well, maybe that'll help. At least for now, only just take my word for it. So

01:00:49.779 --> 01:00:54.459
let's go ahead and write out the distribution

01:00:54.459 --> 01:00:58.369
for the [inaudible] in an exponential family form.

01:00:58.369 --> 01:01:05.369
So PFY is equal to phi one.

01:01:06.479 --> 01:01:07.809
Indicator Y equals one

01:01:07.809 --> 01:01:10.539
times phi two. Indicator Y equals

01:01:10.539 --> 01:01:12.909
to

01:01:12.909 --> 01:01:17.129
up to phi K

01:01:17.129 --> 01:01:21.219
times indicator Y equals K. And again, phi K is not a

01:01:21.219 --> 01:01:23.369
parameter of the distribution. Phi K

01:01:23.369 --> 01:01:29.439
is a shorthand for one minus phi one minus phi two minus the rest.

01:01:29.439 --> 01:01:30.689
And so

01:01:30.689 --> 01:01:34.639
using this equation on the left as well, I can also write this as

01:01:34.639 --> 01:01:38.449
phi one times TY one, phi

01:01:38.449 --> 01:01:40.379
two, TY

01:01:40.379 --> 01:01:42.879
two, dot, dot, dot.

01:01:42.879 --> 01:01:44.609
Phi K minus one,

01:01:44.609 --> 01:01:46.019


01:01:46.019 --> 01:01:48.269
TY, K minus one

01:01:48.269 --> 01:01:50.890
times phi K. And then

01:01:50.890 --> 01:01:57.890
one minus [inaudible]. That should

01:02:02.239 --> 01:02:08.239
be K.

01:02:08.239 --> 01:02:10.709
And it turns out -

01:02:10.709 --> 01:02:15.069
it takes some of the steps of algebra that I don't have time to show.

01:02:15.069 --> 01:02:19.269
It turns out, you can simplify this into - well, the

01:02:19.269 --> 01:02:26.269
exponential family form

01:02:32.389 --> 01:02:39.389
where [inaudible] is a vector, this is a

01:02:50.499 --> 01:02:57.499
K minus one-dimensional vector, and - well,

01:03:06.059 --> 01:03:08.469
okay. So deriving this is a few steps of algebra

01:03:08.469 --> 01:03:12.599
that you can work out yourself, but I won't do here.

01:03:12.599 --> 01:03:17.789
And so using my definition for TY, and

01:03:17.789 --> 01:03:18.809
by choosing

01:03:18.809 --> 01:03:20.550
[inaudible] A and B this

01:03:20.550 --> 01:03:24.490
way, I can take my distribution from [inaudible] and write it out in

01:03:24.490 --> 01:03:31.490
a form of an exponential family distribution.

01:03:32.889 --> 01:03:34.949
It turns out also that

01:03:34.949 --> 01:03:36.739
- let's

01:03:36.739 --> 01:03:40.429
see. [Inaudible], right. One of the things we did was we also had

01:03:40.429 --> 01:03:42.199
[inaudible] as a function of phi,

01:03:42.199 --> 01:03:45.199
and then we inverted that to write

01:03:45.199 --> 01:03:47.859
out phi as a function of [inaudible].

01:03:47.859 --> 01:03:50.859
So it turns out you can do that as well.

01:03:50.859 --> 01:03:55.489
So this defines [inaudible] as a function of the [inaudible] distributions parameters phi.

01:03:55.489 --> 01:03:56.249
So

01:03:56.249 --> 01:03:59.799
you can take this relationship between [inaudible] and phi and invert it,

01:03:59.799 --> 01:04:02.630
and write out phi as a function of [inaudible].

01:04:02.630 --> 01:04:06.279
And it turns out, you get that

01:04:06.279 --> 01:04:08.759
phi I is equal to [inaudible] -

01:04:08.759 --> 01:04:15.759
excuse me.

01:04:16.339 --> 01:04:20.959
And you get that phi I is equal to

01:04:20.959 --> 01:04:27.959
[inaudible] I of one plus

01:04:28.479 --> 01:04:35.479
that.

01:04:35.909 --> 01:04:39.170
And the way you do this is you just - this defines

01:04:39.170 --> 01:04:43.349
[inaudible] as a function of the phi, so if you take this and solve for [inaudible], you end up with this. And

01:04:43.349 --> 01:04:47.609
this is - again, there are a couple of steps of algebra that I'm just not showing.

01:04:47.609 --> 01:04:52.059
And then lastly, using our

01:04:52.059 --> 01:04:55.650
assumption that the [inaudible] are a linear function of the [inaudible]

01:04:55.650 --> 01:04:56.789
axis, phi

01:04:56.789 --> 01:05:01.890
I is therefore equal to E to the theta I, transpose X,

01:05:01.890 --> 01:05:06.319
divided by one plus sum over

01:05:06.319 --> 01:05:08.920
J equals one, to K

01:05:08.920 --> 01:05:11.109
minus one,

01:05:11.109 --> 01:05:14.489
E to the

01:05:14.489 --> 01:05:17.659
theta J, transpose

01:05:17.659 --> 01:05:24.069
X. And this is just using the fact

01:05:24.069 --> 01:05:28.549
that [inaudible] I equals theta I, transpose X, which was our earlier

01:05:28.549 --> 01:05:35.549
design choice from generalized linear models. So

01:05:43.659 --> 01:05:49.879
we're just about down.

01:05:49.879 --> 01:05:51.729
So my learning algorithm

01:05:51.729 --> 01:05:56.029
[inaudible]. I'm going to think of it as [inaudible] the

01:05:56.029 --> 01:05:59.150
expected value of TY

01:05:59.150 --> 01:06:02.279
given X and [inaudible] by theta.

01:06:02.279 --> 01:06:04.719
So

01:06:04.719 --> 01:06:08.849
TY was this vector indicator function. So

01:06:08.849 --> 01:06:10.199
T one

01:06:10.199 --> 01:06:13.170
was indicator Y equals one,

01:06:13.170 --> 01:06:16.079
down to indicator Y equals

01:06:16.079 --> 01:06:20.049
K minus one. All

01:06:20.049 --> 01:06:24.349
right. So I want my learning algorithm to output this; the expected value of this vector of

01:06:24.349 --> 01:06:31.349
indicator functions.

01:06:34.130 --> 01:06:39.209
The expected value of indicator Y equals one is just

01:06:39.209 --> 01:06:42.359
the probability that Y equals one,

01:06:42.359 --> 01:06:45.189
which is given by phi one.

01:06:45.189 --> 01:06:48.359
So I have a random variable that's one whenever Y is equal to one and zero

01:06:48.359 --> 01:06:49.859
otherwise,

01:06:49.859 --> 01:06:51.719
so the expected value of that,

01:06:51.719 --> 01:06:54.900
of this indicator Y equals one is just the

01:06:54.900 --> 01:06:59.249
probability that Y equals one, which is given by phi one.

01:06:59.249 --> 01:07:02.239
And therefore,

01:07:02.239 --> 01:07:05.380
by what we were taught earlier,

01:07:05.380 --> 01:07:08.089
this is therefore [inaudible]

01:07:08.089 --> 01:07:15.089
to the theta one, transpose X over - well - okay.

01:07:45.729 --> 01:07:46.849
And so my

01:07:46.849 --> 01:07:51.470
learning algorithm will output the probability that Y equals one, Y equals two, up to Y

01:07:51.470 --> 01:07:54.959
equals K minus one.

01:07:54.959 --> 01:07:58.429
And these probabilities are going to be parameterized by

01:07:58.429 --> 01:08:05.429
these functions like these.

01:08:21.149 --> 01:08:25.339
And so just to give this algorithm a name,

01:08:25.339 --> 01:08:32.339
this algorithm is called softmax regression,

01:08:34.059 --> 01:08:38.599
and is widely thought of as the generalization of

01:08:38.599 --> 01:08:41.900
logistic regression, which is regression of two classes. Is widely thought

01:08:41.900 --> 01:08:44.909
of as a generalization of logistic regression

01:08:44.909 --> 01:08:46.310
to the case of

01:08:46.310 --> 01:08:50.199
K classes rather than two classes.

01:08:50.199 --> 01:08:52.539


01:08:52.539 --> 01:08:56.190
And so just to be very concrete about what you do, right. So you have a machine-learning

01:08:56.190 --> 01:08:59.619
problem, and you want to apply softmax regression to it. So generally,

01:08:59.619 --> 01:09:02.749
work for the entire derivation [inaudible]. I think the

01:09:02.749 --> 01:09:05.719
question you had is about how to fit parameters.

01:09:05.719 --> 01:09:09.649
So let's say you have a machine-learning problem, and

01:09:09.649 --> 01:09:13.209
Y takes on one of K classes.

01:09:13.209 --> 01:09:17.209
What you do is you sit down and say, "Okay, I wanna model Y as being

01:09:17.209 --> 01:09:19.190
[inaudible]

01:09:19.190 --> 01:09:24.190
given any X and then theta." And so you chose [inaudible] as the exponential family. Then you sort

01:09:24.190 --> 01:09:27.349
of turn the crank. And everything else I wrote down follows

01:09:27.349 --> 01:09:29.980
automatically from you have made the choice

01:09:29.980 --> 01:09:34.769
of using [inaudible] distribution as your choice of exponential family.

01:09:34.769 --> 01:09:38.159
And then what you do is you then have this training set, X, I,

01:09:38.159 --> 01:09:39.010
Y,

01:09:39.010 --> 01:09:41.229
I up to X, M,

01:09:41.229 --> 01:09:43.560
Y, M. So

01:09:43.560 --> 01:09:45.429
you're

01:09:45.429 --> 01:09:49.309
doing the training set. We're now [inaudible] the value of Y takes on one

01:09:49.309 --> 01:09:51.479
of K possible values.

01:09:51.479 --> 01:09:55.389
And what you do is you then

01:09:55.389 --> 01:09:59.349
find the parameters of the model by maximum likelihood. So you write down the likelihood

01:09:59.349 --> 01:10:03.039
of the parameters, and you maximize the likelihood.

01:10:03.039 --> 01:10:06.809
So what's the likelihood? Well, the likelihood, as usual, is the

01:10:06.809 --> 01:10:08.929
product of your training set of

01:10:08.929 --> 01:10:11.030
P of YI

01:10:11.030 --> 01:10:14.709
given XI parameterized

01:10:14.709 --> 01:10:17.689
by theta. That's

01:10:17.689 --> 01:10:19.940
the likelihood, same as we had before.

01:10:19.940 --> 01:10:21.349
And that's

01:10:21.349 --> 01:10:23.550


01:10:23.550 --> 01:10:24.640
product of your

01:10:24.640 --> 01:10:26.439
training set of -

01:10:26.439 --> 01:10:29.649
let me write these down now.

01:10:29.649 --> 01:10:32.639
YI equals one

01:10:32.639 --> 01:10:36.049
times phi two of indicator YI

01:10:36.049 --> 01:10:38.300
equals two, dot,

01:10:38.300 --> 01:10:39.489
dot, dot,

01:10:39.489 --> 01:10:42.530
to phi K of indicator YI

01:10:42.530 --> 01:10:45.019
equals

01:10:45.019 --> 01:10:47.269
K.

01:10:47.269 --> 01:10:50.469
Where, for example,

01:10:50.469 --> 01:10:55.340
phi one depends on theta through this formula. It is E to the theta one,

01:10:55.340 --> 01:10:56.650
transpose X

01:10:56.650 --> 01:10:59.289
over

01:10:59.289 --> 01:11:02.469
one

01:11:02.469 --> 01:11:09.420
plus sum over J - well, that formula I had just now.

01:11:09.420 --> 01:11:11.649
And so phi one here is really a

01:11:11.649 --> 01:11:15.559
shorthand for this formula, and similarly for phi two and so on,

01:11:15.559 --> 01:11:20.000
up to phi K, where phi K is one minus all of these things. All right.

01:11:20.000 --> 01:11:21.809
So this is a

01:11:21.809 --> 01:11:22.799


01:11:22.799 --> 01:11:26.209
-this formula looks more complicated than it really is. What you

01:11:26.209 --> 01:11:27.639
really do is you write this down,

01:11:27.639 --> 01:11:31.459
then you take logs, compute a derivative of this formula [inaudible] theta,

01:11:31.459 --> 01:11:34.480
and

01:11:34.480 --> 01:11:36.750
apply say gradient ascent

01:11:36.750 --> 01:11:41.369
to maximize the likelihood. What are the rows of theta? [Inaudible] it's just been a vector,

01:11:41.369 --> 01:11:45.059
right? And now it looks

01:11:45.059 --> 01:11:48.610
like it's two-dimensional. Yeah. In the notation of the [inaudible] I think have theta one

01:11:48.610 --> 01:11:50.459
through

01:11:50.459 --> 01:11:52.599
theta

01:11:52.599 --> 01:11:57.880
K minus one. I've been thinking of each of these as -

01:11:57.880 --> 01:11:59.210
and N

01:11:59.210 --> 01:12:00.910
plus one-dimensional vector.

01:12:00.910 --> 01:12:03.639
If X is N plus one-dimensional,

01:12:03.639 --> 01:12:05.150
then I've been - see, I think if

01:12:05.150 --> 01:12:09.489
you have a set of parameters comprising K minus one vectors,

01:12:09.489 --> 01:12:13.359
and each of these is a - you could group all of these together and make these, but I

01:12:13.359 --> 01:12:15.690
just haven't been doing that. [Inaudible] the derivative

01:12:15.690 --> 01:12:22.690
of K minus one parameter vectors. [Inaudible], what do they correspond to?

01:12:23.340 --> 01:12:25.050
[Inaudible].

01:12:25.050 --> 01:12:29.050
We're sort of out of time. Let me take that offline. It's hard to answer in the

01:12:29.050 --> 01:12:32.559
same way that the logistic regression - what does theta correspond to

01:12:32.559 --> 01:12:37.729
in logistic regression? You can sort of answer that as sort of - Yeah. It's kind of like

01:12:37.729 --> 01:12:42.530
the [inaudible] feature - Yeah. Sort of similar interpretation,

01:12:42.530 --> 01:12:44.060
yeah. That's good. I think I'm running a little bit

01:12:44.060 --> 01:12:48.670
late. Why don't I - why don't we officially close for the day, but you can come up

01:12:48.670 --> 01:12:50.430
if you more questions and take them offline. Thanks.