Lecture 4 | Machine Learning (Stanford)

Edit subtitles

0:12 - 0:15

this presentation is delivered by the stanford center for professional
0:15 - 0:22

development.
0:24 - 0:26

Okay, so welcome back. And
0:26 - 0:31

what I want to do today is talk about
0:31 - 0:32

Newton's method, an algorithm
0:32 - 0:35

for fitting models like logistic regression,
0:35 - 0:39

and then we'll talk about exponential family distributions and generalized linear
0:39 - 0:43

models. It's a very nice class of ideas that will tie together,
0:43 - 0:47

the logistic regression and the ordinary least squares models that we'll see. So hopefully I'll get to that
0:47 - 0:51

today. So
0:51 - 0:55

throughout the previous lecture and this lecture, we're starting to use increasingly
0:55 - 0:58

large amounts of material on probability.
0:58 - 1:02

So if you'd like to see a refresher on sort of the
1:02 - 1:03

foundations of probability
1:03 - 1:06

- if you're not sure if you quite had your prerequisites for this class
1:06 - 1:09

in terms of a background in probability and statistics,
1:09 - 1:11

then the discussion section
1:11 - 1:16

taught this week by the TA's will go over so they can review a probability.
1:16 - 1:21

At the same discussion sections also for the TA's, we'll also briefly go over
1:21 - 1:26

sort of Matlab and Octave notation, which you need to use for your problem sets. And so if you
1:26 - 1:28

any of you want to see
1:28 - 1:32

a review of the probability and statistics pre-reqs, or if you want to, we will have a short tutorial of
1:32 - 1:34

Matlab and Octave,
1:34 - 1:41

please come to this - the next discussion section. All right. So
1:42 - 1:45

just to recap briefly,
1:45 - 1:49

towards the end of the last lecture I talked about the logistic regression model
1:49 - 1:51

where we had -
1:51 - 1:57

which was an algorithm for classification. We had that P of y given one
1:57 - 2:02

[inaudible] - if an X - if Y equals one, give an X parameterized by theta
2:02 - 2:04

under this model, all right. If
2:04 - 2:05

this was one over one
2:05 - 2:11

plus e to the theta, transpose X.
2:11 - 2:15

And then you can write down the log-likelihood -
2:15 - 2:22

like given the training sets,
2:23 - 2:30
2:30 - 2:31

which was that.
2:31 - 2:32

And
2:32 - 2:36

by taking the derivatives of this, you can derive
2:36 - 2:38

sort of a gradient ascent rule
2:38 - 2:43

for finding the maximum likelihood estimate of the parameter theta for
2:43 - 2:46

this logistic regression model. And so
2:46 - 2:48

last time I wrote down
2:48 - 2:52

the learning rule for batch gradient ascent, but the version of stochastic gradient
2:52 - 2:55

ascent where
2:55 - 3:02

you look at just one training example at a time,
3:07 - 3:09

would be like this, okay.
3:09 - 3:12

So last time I wrote down a batch gradient ascent.
This is stochastic gradient ascent.
3:12 - 3:14

So
3:14 - 3:18

if you want to fit a logistic regression model, meaning find the
3:18 - 3:21

value of theta that maximizes this log likelihood,
3:21 - 3:25

gradient ascent or stochastic gradient ascent or batch gradient ascent is a perfectly fine
3:25 - 3:27

algorithm to use.
3:27 - 3:29

But what I want to do is talk about
3:29 - 3:30

a different
3:30 - 3:32

algorithm
3:32 - 3:33

for fitting
3:33 - 3:35

models like logistic regression.
3:35 - 3:39

And this would be an algorithm that will, I guess, often run much faster than
3:39 - 3:42

gradient descent.
3:42 - 3:44

Um.
3:44 - 3:49

And this algorithm is called Newton's Method.
3:49 - 3:54

And when we describe Newton's Method - let me ask you - I should
3:54 - 4:01

ask you to consider a different problem first,
4:01 - 4:04

which is -
4:04 - 4:05

let's say
4:05 - 4:11

you have a function F of theta,
4:11 - 4:14

and let's say you want to find the value of theta
4:14 - 4:18

so that
4:18 - 4:21

F of theta
4:21 - 4:26

is equal to zero. Let's start the [inaudible], and then we'll sort of slowly
4:26 - 4:28

change this until it becomes an algorithm for
4:28 - 4:35

fitting maximum likelihood models, like logistic regression.
So - let's see.
4:52 - 4:55

I guess that works. Okay, so let's say that's my function F.
4:55 - 4:57

This
4:57 - 5:01

is my horizontal axis of theta, plot of F of theta,
5:01 - 5:03

and so they're really trying to find this value for theta, and
5:03 - 5:05

which F of theta is equal to zero. This
5:05 - 5:08

is a horizontal axis.
5:08 - 5:12

So here's the [inaudible]. I'm going to
5:12 - 5:15

initialize
5:15 - 5:18

theta as some value.
5:18 - 5:22

We'll call theta superscript zero.
5:22 - 5:27

And then here's what Newton's Method does. We're going to evaluate the function F at a
5:27 - 5:30

value of theta, and then
5:30 - 5:34

we'll compute the derivative of F, and we'll use the linear approximation to the
5:34 - 5:38

function F of that value of theta. So in particular,
5:38 - 5:39
5:39 - 5:43
5:43 - 5:45
5:45 - 5:52

I'm going to take the tangents to my function -
5:52 - 5:55

hope that makes sense - starting the function [inaudible] work out nicely. I'm going to take the
5:55 - 5:57
5:57 - 6:00

tangent to my function at that point there zero,
6:00 - 6:04

and I'm going to sort of extend this tangent down until it intercepts the horizontal axis.
6:04 - 6:11

I want to see what value this is. And I'm going to call this theta one, okay.
6:12 - 6:16

And then so that's one iteration of Newton's Method. And
6:16 - 6:19

what I'll do then is the same thing with this point. Take the
6:19 - 6:21

tangent
6:21 - 6:22

down here,
6:22 - 6:25

and that's two iterations of the algorithm. And then just
6:25 - 6:28

sort of keep going, that's
6:28 - 6:31

theta three and so on, okay.
6:31 - 6:32

So
6:32 - 6:38

let's just go ahead and write down what this algorithm actually does.
6:38 - 6:38
6:38 - 6:41

To go from theta zero to theta one, let
6:46 - 6:50

length - let me just call that capital delta.
6:50 - 6:52

So capital - so if you
6:52 - 6:53

remember the
6:53 - 6:55

definition of a derivative [inaudible],
6:55 - 6:59

derivative of F evaluated at theta zero.
6:59 - 7:01

In other words, the gradient of this first line,
7:01 - 7:05

by the definition of gradient is going to be equal to this vertical
7:05 - 7:06

length,
7:06 - 7:07

divided by this horizontal length. A
7:07 - 7:10

gradient of this - so the slope of this function
7:10 - 7:11

is defined as
7:11 - 7:14

the ratio between this vertical height
7:14 - 7:16

and this width of triangle.
7:16 - 7:19

So that's just equal to F of theta
7:19 - 7:23

zero,
7:23 - 7:25

divided by delta,
7:25 - 7:28

which implies that
7:28 - 7:30

delta is equal to F of
7:30 - 7:32

theta zero,
7:32 - 7:34

divided by a
7:34 - 7:38

prime of
7:38 - 7:42

theta zero,
7:42 - 7:43

okay.
7:43 - 7:46

And so theta
7:46 - 7:50

one is therefore theta zero minus delta,
7:50 - 7:52

minus capital delta,
7:52 - 7:56

which is therefore just F theta zero over F
7:56 - 7:58

prime of theta
7:58 - 8:00

zero, all
8:00 - 8:05

right. And more generally, one iteration of Newton's Method precedes this, theta T plus
8:05 - 8:08

one equals theta T
8:08 - 8:11

minus
8:11 - 8:15

F of theta T divided by F prime of theta
8:15 - 8:17

T. So that's one iteration
8:17 - 8:24

of Newton's Method.
8:24 - 8:28

Now, this is an algorithm for finding a value of theta for which F of theta equals
8:28 - 8:31

zero. And so we apply the same idea
8:31 - 8:33

to
8:33 - 8:37

maximizing the log likelihood, right. So we have a function L
8:37 - 8:38

of
8:38 - 8:40

theta, and we want to maximize
8:40 - 8:44

this function. Well, how do you maximize the function? You set the derivative to zero. So we want
8:44 - 8:50

theta such that our
8:50 - 8:54

prime of theta is equal to zero, so to maximize this function we want to find the place where
8:54 - 8:57

the derivative of the function is equal to zero,
8:57 - 8:58

and so we just apply
8:58 - 9:01

the same idea.
9:01 - 9:04

So we get
9:04 - 9:09

theta one equals theta T minus L
9:09 - 9:11

prime
9:11 - 9:18

of theta T over L
9:18 - 9:20

double prime of T, L
9:20 - 9:23

double prime of
9:23 - 9:28

theta T, okay. Because to maximize this function, we just let F be equal to L prime. Let F be
9:28 - 9:31

the [inaudible] of L,
9:31 - 9:36

and then we want to find the value of theta for which the derivative of L is
9:36 - 9:36

zero, and
9:36 - 9:43

therefore must be a local optimum.
9:48 - 9:55

Does this make sense? Any questions about this?
9:59 - 10:03

[Inaudible] The answer to that is fairly complicated. There
10:03 - 10:06

are conditions on F that would guarantee that this will work. They are fairly complicated,
10:06 - 10:11

and this is more complex than I want to go into now.
10:11 - 10:13

In practice, this works very well for logistic regression,
10:13 - 10:19

and for sort of generalizing any models I'll talk about later. [Inaudible]
10:19 - 10:23

Yeah, it
10:23 - 10:27

usually doesn't matter. When I implement this, I usually just initialize theta zero to
10:27 - 10:29

zero to
10:29 - 10:32

just initialize the parameters
10:32 - 10:36

to the - back to all zeros, and usually this works fine. It's usually not a huge
10:36 - 10:43

deal how you initialize theta.
10:44 - 10:46

[Inaudible]
10:46 - 10:51

or is it just different conversions? Let me
10:51 - 10:56

say some things about that that'll sort of answer it. All of these
10:56 - 10:57
10:57 - 10:58
10:58 - 11:01

algorithms tend not to - converges problems, and all of these
11:01 - 11:04

algorithms will generally converge,
11:04 - 11:07

unless you choose too large a linear rate for
11:07 - 11:09

gradient ascent or something.
11:09 - 11:11

But the speeds of
11:11 - 11:15

conversions of these algorithms are very different.
11:15 - 11:18

So
11:18 - 11:22

it turns out that Newton's Method is an algorithm that enjoys extremely
11:22 - 11:25

fast conversions.
11:25 - 11:29

The technical term is that it enjoys a property called [inaudible] conversions. Don't
11:29 - 11:30

know [inaudible] what that means,
11:30 - 11:31

but just
11:31 - 11:34

stated informally, it
11:34 - 11:35

means that asymptotically
11:35 - 11:42

every iteration of Newton's Method will double the number of significant digits
11:42 - 11:44

that your solution
11:44 - 11:45

is accurate
11:45 - 11:47

to. Just lots of constant
11:47 - 11:48

factors.
11:48 - 11:49

Suppose that
11:49 - 11:51

on a certain iteration
11:51 - 11:57

your solution is within 0.01 at the optimum, so you have
11:57 - 11:59

0.01 error. Then after one
11:59 - 12:04

iteration, your error will be on the order of 0.001,
12:04 - 12:11

and after another iteration, your error will be on the order of 0.0001.
12:14 - 12:15

So this is called
12:15 - 12:18

quadratic conversions because you essentially get to square the error
12:18 - 12:22

on every iteration of Newton's Method. [Inaudible]
12:22 - 12:25

this is an asymptotic result that holds only when you are pretty
12:25 - 12:27

** close to the optimum anyway, so this is
12:27 - 12:31

the theoretical result that says it's true, but because of constant factors
12:31 - 12:34

and so on, may paint a slightly rosier picture than
12:34 - 12:36

might be accurate.
12:36 - 12:40

But the fact is, when you implement - when I implement Newton's
12:40 - 12:41

Method for logistic regression,
12:41 - 12:45

usually converges like a dozen iterations or so for most reasonable
12:45 - 12:46

size problems of
12:46 - 12:51

tens of hundreds of features. So one
12:51 - 12:52

thing I should
12:52 - 12:53

talk about, which
12:53 - 12:54

is
12:54 - 12:58

what I wrote down over there was actually Newton's Method for the case of
12:58 - 13:01

theta being a single-row number. The
13:01 - 13:06

generalization to Newton's Method for when theta is a vector rather than
13:06 - 13:08

when theta is just a row number
13:08 - 13:11

is the following,
13:11 - 13:15

which is that theta T plus one is theta T
13:15 - 13:17

plus - and then we have the
13:17 - 13:19

second derivative divided by the
13:19 - 13:23

first - the first derivative divided by the second derivative.
13:23 - 13:25
13:25 - 13:30

And the appropriate generalization is this, where
13:30 - 13:33

this is the usual gradient of
13:33 - 13:37

your objective, and
13:37 - 13:44

each [inaudible] is a matrix called a Hessian,
13:51 - 13:55

which is just a matrix of second derivative where HIJ
13:55 - 14:02

equals - okay.
14:07 - 14:11

So just to sort of - the
14:11 - 14:16

first derivative divided by the second derivative, now you have a vector of first derivatives
14:16 - 14:17

times
14:17 - 14:21

sort of the inverse of the matrix of second derivatives. So
14:21 - 14:23

this is sort of just the same thing
14:23 - 14:28

[inaudible] of multiple dimensions.
14:28 - 14:33

So for logistic regression, again, use the -
14:33 - 14:35

for a reasonable number of features
14:35 - 14:40

and training examples - when I run this algorithm, usually you see a conversion
14:40 - 14:41

anywhere from sort of [inaudible]
14:41 - 14:44

to like a dozen or so other [inaudible].
14:44 - 14:48

To compare to gradient ascent, it's [inaudible] to gradient ascent, this
14:48 - 14:53

usually means far fewer iterations to converge.
14:53 - 14:55

Compared to gradient ascent, let's say [inaudible] gradient ascent,
14:55 - 14:59

the disadvantage of Newton's Method is that on every iteration
14:59 - 15:02

you need to invert the Hessian.
15:02 - 15:03

So the Hessian
15:03 - 15:07

will be an N-by-N matrix, or an N plus one by N plus one-dimensional matrix if N
15:07 - 15:09

is the number of features.
15:09 - 15:13

And so if you have a large number of features in your learning problem, if you have
15:13 - 15:15

tens of thousands of features,
15:15 - 15:18

then inverting H could be a slightly computationally expensive
15:18 - 15:22

step. But for smaller, more reasonable numbers of features, this is usually a very [inaudible]. Question? [Inaudible] Let's see. I think
15:22 - 15:29

you're right.
15:35 - 15:39

That should probably be a minus.
15:39 - 15:40

Do you have [inaudible]?
15:40 - 15:45

Yeah, thanks.
15:45 - 15:48

Yeah, X to a minus. Thank you. [Inaudible] problem also.
15:48 - 15:50

I wrote down
15:50 - 15:51

this algorithm
15:51 - 15:55

to find the maximum likely estimate of the parameters for logistic regression.
15:55 - 15:56

I wrote this
15:56 - 15:58

down for maximizing a function.
15:58 - 16:02

So I'll leave you to think about this yourself. If I wanted to use Newton's
16:02 - 16:05

Method to minimize the function, how
16:05 - 16:07

does the algorithm change? All right.
16:07 - 16:09

So I'll leave you to think about that. So in other words,
16:09 - 16:13

it's not the maximizations. How does the algorithm change if you want to use it
16:13 - 16:20

for minimization?
16:28 - 16:30

Actually,
16:30 - 16:32

the answer
16:32 - 16:33

is that
16:33 - 16:35

it doesn't change. I'll
16:35 - 16:40

leave you to work that out yourself why, okay. All right.
16:40 - 16:47

Let's talk about generalized linear models.
16:47 - 16:50

Let me just say, just to give a recap of
16:50 - 16:53

both of the algorithms we've talked about so far.
16:53 - 16:55

We've talked about
16:55 - 16:59

two different algorithms for modeling PFY given X and parameterized by
16:59 - 17:01

theta. And
17:01 - 17:03

one of them - R was
17:03 - 17:06

a real number
17:06 - 17:10

and we are sealing that. And we sort of - the [inaudible] has a Gaussian distribution, then
17:10 - 17:13

we got
17:13 - 17:16

[inaudible] of linear regression.
17:16 - 17:20

In the other case, we saw
17:20 - 17:23

that if - was a classification problem where Y took on a value of either
17:23 - 17:25

zero or one.
17:25 - 17:27
17:27 - 17:31

In that case, well, what's the most natural distribution of zeros and
17:31 - 17:35

ones is the [inaudible]. The [inaudible] distribution models
17:35 - 17:37

random variables with two values,
17:37 - 17:40

and in that case we got
17:40 - 17:46

logistic regression.
17:46 - 17:49

So along the way, some of the questions that came up were - so logistic regression, where
17:49 - 17:51
17:51 - 17:52

on
17:52 - 17:54

earth did I get the [inaudible] function from?
17:54 - 17:57

And then so there are the choices you can use for, sort of,
17:57 - 18:00

just where did this function come from?
18:00 - 18:02

And there are other functions I could've plugged in, but
18:02 - 18:06

the [inaudible] function turns out to be a natural
18:06 - 18:07

default choice
18:07 - 18:09

that lead us to logistic regression.
18:09 - 18:12

And what I want to do now is
18:12 - 18:15

take both of these algorithms and
18:15 - 18:19

show that there are special cases that have [inaudible] the course of algorithms called
18:19 - 18:19

generalized
18:19 - 18:22

linear models,
18:22 - 18:24

and there will be pauses for - it will be as [inaudible] the course
18:24 - 18:27

of algorithms that think that the [inaudible]
18:27 - 18:31

function will fall out very naturally as well.
18:31 - 18:33

So, let's see -
18:33 - 18:37

just looking for a longer piece of chalk. I
18:37 - 18:41

should warn you, the ideas in generalized linear models are somewhat
18:41 - 18:42

complex, so what
18:42 - 18:46

I'm going to do today is try to sort of point you - point out the key ideas and give you a
18:46 - 18:49

gist of the entire story.
18:49 - 18:52

And then some of the details in the map and the derivations I'll leave you to work through
18:52 - 18:58

by yourselves in the intellection [inaudible], which posts
18:58 - 19:05

online.
19:06 - 19:09

So [inaudible] these two distributions, the [inaudible] and
19:09 - 19:13

the Gaussian.
19:13 - 19:15

So suppose we have data
19:15 - 19:19

that is zero-one valued, and we and we want to model it
19:19 - 19:21

with
19:21 - 19:24

[inaudible] variable
19:24 - 19:26

parameterized by
19:26 - 19:31

phi. So the [inaudible] distribution has the probability of Y equals one,
19:31 - 19:35

which just equals the phi, right. So the parameter phi in the
19:35 - 19:40

[inaudible] specifies the probability of Y being one. Now,
19:40 - 19:43

as you vary the parameter theta, you get -
19:43 - 19:47

you sort of get different [inaudible] distributions. As you vary the value of
19:47 - 19:51

theta you get different probability distributions on Y
19:51 - 19:53

that have different probabilities of being equal to one.
19:53 - 19:58

And so I want you to think of this as not one fixed distribution, but as a set where there are a
19:58 - 20:00

cause of distributions
20:00 - 20:04

that you get as you vary theta.
20:04 - 20:09

And in the same way, if you consider Gaussian distribution,
20:09 - 20:11

as you vary [inaudible] you
20:11 - 20:14

would get different
20:14 - 20:18

Gaussian distributions. So think of this again as a cost, or as a set to
20:18 - 20:20

distributions.
20:20 - 20:27

And what I want to do now is show that
20:28 - 20:32

both of these are special cases of the cause of distribution that's called the
20:32 - 20:34

exponential family distribution.
20:34 - 20:38

And in particular, we'll say that the cost of distributions, like
20:38 - 20:41

the [inaudible] distributions that you get as you vary theta,
20:41 - 20:45

we'll say the cost of distributions is in the exponential family
20:45 - 20:46

if it can be written
20:46 - 20:50

in the following form.
20:50 - 20:52

P of Y parameterized by theta is
20:52 - 20:54

equal to B of Y
20:54 - 21:01

[inaudible],
21:05 - 21:06

okay. Let me just
21:06 - 21:07

get some of these
21:07 - 21:11

terms, names, and
21:11 - 21:13

then - let me - I'll
21:13 - 21:19

say a bit more about what this means.
21:19 - 21:24

So [inaudible] is called the natural parameter of the distribution,
21:24 - 21:26

and T
21:26 - 21:33

of Y is called the sufficient statistic.
21:35 - 21:37

Usually,
21:37 - 21:41

for many of the examples we'll see, including the [inaudible]
21:41 - 21:44

and the Gaussian, T
21:44 - 21:48

of Y is just equal to Y. So for most of this lecture you can
21:48 - 21:52

mentally replace T of Y to be equal to Y, although this won't be true for the very
21:52 - 21:56

fine example we do today, but mentally, you think of T of Y as equal to
21:56 - 21:57

Y.
21:57 - 22:04

And so
22:08 - 22:10

for a given choice of
22:10 - 22:16

these functions, A, B and
22:16 - 22:21

T, all right - so we're gonna sort of fix the forms of the functions A, B and T.
22:21 - 22:22

Then
22:22 - 22:27

this formula defines, again, a set of distributions. It defines the cause of distributions
22:27 - 22:28

that
22:28 - 22:31

is now parameterized by
22:31 - 22:32

[inaudible].
22:32 - 22:36

So again, let's write down specific formulas for A, B and T, true specific
22:36 - 22:39

choices of A, B and T. Then
22:39 - 22:43

as I vary [inaudible] I get different distributions.
22:43 - 22:45

And
22:45 - 22:46

I'm going to show that
22:46 - 22:48

the
22:48 - 22:52

[inaudible] - I'm going to show that the [inaudible] and the Gaussians are special
22:52 - 22:54

cases
22:54 - 22:56

of exponential family distributions.
22:56 - 23:01

And by that I mean that I can choose specific functions, A, B and T,
23:01 - 23:03

so that this becomes the formula
23:03 - 23:06

of the distributions of either a [inaudible] or a Gaussian.
23:06 - 23:09

And then again, as I vary [inaudible],
23:09 - 23:12

I'll get [inaudible], distributions with different means,
23:12 - 23:13

or as I vary [inaudible], I'll get
23:13 - 23:17

Gaussian distributions with different means for my
23:17 - 23:23

fixed values of A, B and T.
23:23 - 23:27

And for those of you that know what a sufficient statistic and statistics is,
23:27 - 23:28
23:28 - 23:31

T of Y actually is a sufficient statistic
23:31 - 23:34

in the formal sense of
23:34 - 23:37

sufficient statistic for a probability distribution. They may have seen it in a statistics
23:37 - 23:38

class.
23:38 - 23:42

If you don't know what a sufficient statistic is, don't worry about. We
23:42 - 23:49

sort of don't need that property today. Okay.
23:52 - 23:56

So -
23:56 - 23:58

oh, one last comment.
23:58 - 24:03

Often, T of Y is equal to Y, and in many of these cases, [inaudible] is also
24:03 - 24:04

just a raw number.
24:04 - 24:08

So in many cases, the parameter of this distribution is just a raw number,
24:08 - 24:10

and [inaudible] transposed T of Y
24:10 - 24:14

is just a product of raw numbers. So again, that would be true for our first two examples, but
24:14 - 24:15
24:15 - 24:21

now for the last example I'll do today.
24:21 - 24:25

So now we'll show that the [inaudible] and the Gaussian are examples of exponential family
24:25 - 24:28

distributions. We'll start with the [inaudible].
24:28 - 24:31

So the [inaudible] distribution with [inaudible] - I guess I wrote this down
24:31 - 24:33

already.
24:33 - 24:36

PFY equals one [inaudible] by phi,
24:36 - 24:37

[inaudible] equal to phi.
24:37 - 24:41

So the parameter of phi specifies the probability
24:41 - 24:43

that Y equals one.
24:43 - 24:45

And so my goal now is to choose
24:45 - 24:48

T, A and B, or is to choose A, B and T
24:48 - 24:53

so that my formula for the exponential family becomes identical to my formula
24:53 - 25:00

for the distribution of a [inaudible]. So
25:08 - 25:15

probability of Y parameterized by phi is equal to that, all
25:17 - 25:20

right. And you already saw sort
25:20 - 25:24

of a similar exponential notation where we talked about logistic regression. The probability of
25:24 - 25:27

Y being one is
25:27 - 25:31

phi, the probability of Y being zero is one minus phi, so we can write this compactly as phi to the
25:31 - 25:38

Y times one minus phi to the one minus Y. So I'm gonna take
25:40 - 25:44

the exponent of the log of this, an exponentiation in taking log [inaudible]
25:44 - 25:45

cancel each other
25:45 - 25:46
25:46 - 25:53

out [inaudible].
25:53 - 25:55

And this is equal to E to the
25:55 - 26:02

Y. And so [inaudible]
26:28 - 26:31

is to be T of Y,
26:31 - 26:33

and
26:33 - 26:40

this will be
26:42 - 26:45

minus A of [inaudible].
26:45 - 26:52

And then B of Y is just one, so B of Y doesn't matter. Just
26:57 - 26:58

take a second
26:58 - 27:01

to look through this and make sure it makes
27:01 - 27:08

sense. I'll clean another
27:42 - 27:47

board while you do that. So now let's write down a few more things. Just
27:47 - 27:51

copying from the previous board, we had that [inaudible] zero four equal to log
27:51 - 27:53

phi
27:53 - 27:57

over one minus
27:57 - 28:01

phi. [Inaudible] so if I want to do the [inaudible] take this formula,
28:01 - 28:04

and if you invert it, if you solve for
28:04 - 28:06

phi - excuse me, if you solve for
28:06 - 28:10

theta as a function of phi, which is really [inaudible] is the function of
28:10 - 28:11

phi.
28:11 - 28:17

Just invert this formula. You find that phi is
28:17 - 28:19

one over one plus [inaudible] minus [inaudible].
28:19 - 28:21

And so
28:21 - 28:24

somehow the logistic function magically
28:24 - 28:29

falls out of this. We'll take this even this even further later.
28:29 - 28:31

Again, copying definitions from
28:31 - 28:36

the board on - from the previous board, A of [inaudible]
28:36 - 28:39

I said is minus log of one minus phi. So again, phi and [inaudible] are function of each other, all
28:39 - 28:43

right. So [inaudible] depends on phi, and phi
28:43 - 28:46

depends on [inaudible]. So if I plug in
28:46 - 28:49

this definition for [inaudible]
28:49 - 28:51

into this - excuse
28:51 - 28:54

me, plug in this definition for phi into that,
28:54 - 28:57

I'll find that A of [inaudible] is therefore equal
28:57 - 29:01

to log one plus [inaudible] to [inaudible]. And again, this is just algebra. This is not terribly
29:01 - 29:02

interesting.
29:02 - 29:08

And just to complete - excuse me.
29:08 - 29:11

And just to complete the rest of this, T of
29:11 - 29:13

Y is equal to Y,
29:13 - 29:16

and
29:16 - 29:18

B of Y is equal to one, okay.
29:18 - 29:20

So just to recap what we've done,
29:20 - 29:23

we've come up with a certain choice of
29:23 - 29:25

functions A, T and
29:25 - 29:26

B,
29:26 - 29:30

so then my formula for the exponential family distribution
29:30 - 29:34

now becomes exactly the formula for the distributions, or for the probability
29:34 - 29:37

mass function of the [inaudible] distribution.
29:37 - 29:40

And the natural parameter [inaudible] has a certain relationship of the
29:40 - 29:45

original parameter of the [inaudible].
29:45 - 29:46

Question?
29:46 - 29:49

[Inaudible]
29:49 - 29:50

Let's
29:50 - 29:54

see. [Inaudible]. The second to
29:54 - 29:59

the last one. Oh, this answer is fine. Okay.
29:59 - 30:01

Let's see.
30:01 - 30:03

Yeah, so this is -
30:03 - 30:07

well, if you expand this term out, one minus Y times log Y minus phi,
30:07 - 30:09

and so
30:09 - 30:10

one times log -
30:10 - 30:13

one minus phi becomes this. And the other
30:13 - 30:16

term is minus Y times log Y minus phi.
30:16 - 30:18

And then -
30:18 - 30:23

so the minus of a log is log one over X, or is just log one over
30:23 - 30:25

whatever. So minus Y times log
30:25 - 30:28

one minus phi becomes
30:28 - 30:34

sort of Y times log, one over one minus phi. Does that make sense? Yeah. Yeah,
30:34 - 30:39

cool. Anything else? Yes?
30:39 - 30:41

[Inaudible] is a scaler, isn't it? Up there
30:41 - 30:43

-
30:43 - 30:47

Yes. - it's a [inaudible] transposed, so it can be a vector or
30:47 - 30:49

- Yes, [inaudible]. So
30:49 - 30:53

let's see. In most - in this and the next example, [inaudible] will turn out to
30:53 - 30:55

be a scaler.
30:55 - 30:56

And so -
30:56 - 30:59

well, on this board.
30:59 - 31:04

And so if [inaudible] is a scaler and T of Y is a scaler, then this is just
31:04 - 31:06

a real number times a real number. So this would be like a
31:06 - 31:08

one-dimensional vector
31:08 - 31:12

transposed times a one-dimensional vector. And so this is just real number times real number. Towards the
31:12 - 31:16

end of today's lecture, we'll go with just one example where both of
31:16 - 31:17

these are vectors.
31:17 - 31:24

But for main distributions, these will turn out to be scalers.
31:24 - 31:25
31:25 - 31:27
31:27 - 31:33

[Inaudible] distribution [inaudible]. I
31:33 - 31:35

mean, it doesn't have the
31:35 - 31:38

zero probability
31:38 - 31:42

or [inaudible] zero and one. I
31:42 - 31:43

see. So
31:43 - 31:46

- yeah. Let's - for this,
31:46 - 31:51

let's imagine that we're restricting the domain
31:51 - 31:52
31:52 - 31:55

of the input of the function to be Y equals zero or one.
31:55 - 31:57

So think of that as maybe in
31:57 - 32:02

implicit constraint on it. [Inaudible]. But so this is a
32:02 - 32:03

probability mass function
32:03 - 32:06

for Y equals zero or Y equals one. So
32:06 - 32:13

write down Y equals zero one. Let's think of that as an [inaudible]. So - cool.
32:16 - 32:20

So this
32:20 - 32:21

takes the [inaudible]
32:21 - 32:25

distribution and invites in the form and the exponential family
32:25 - 32:28

distribution. [Inaudible] do that very quickly for the Gaussian. I won't do the algebra for the
32:28 - 32:30

Gaussian.
32:30 - 32:32

I'll basically just write out the answers.
32:32 - 32:37

So
32:37 - 32:38
32:38 - 32:40

with a normal distribution with
32:40 - 32:43

[inaudible] sequence squared,
32:43 - 32:47

and so you remember, was it two lectures ago,
32:47 - 32:51

when we were dividing the maximum likelihood - excuse me, oh, no, just the
32:51 - 32:52

previous lecture
32:52 - 32:56

when we were dividing the maximum likelihood estimate
32:56 - 33:00

for the parameters of ordinary [inaudible] squares. We showed that
33:00 - 33:04

the parameter for [inaudible] squared didn't matter. When we divide
33:04 - 33:06

the [inaudible] model for [inaudible] square [inaudible],
33:06 - 33:09

we said that no matter what [inaudible] square was, we end up with the same
33:09 - 33:12

value of the parameters.
33:12 - 33:16

So for the purposes of just writing lesson, today's lecture, and not taking
33:16 - 33:16

account
33:16 - 33:18

[inaudible] squared,
33:18 - 33:21

I'm just going to set
33:21 - 33:26

[inaudible] squared to be for the one, okay, so as to not worry about it.
33:26 - 33:28

Lecture [inaudible] talks a little bit more about this, but I'm just gonna -
33:28 - 33:30

just to make
33:30 - 33:34

[inaudible] in class a bit easier and simpler today, let's just
33:34 - 33:36

say that [inaudible] square equals one.
33:36 - 33:39

[Inaudible] square is essentially just a scaling factor
33:39 - 33:44

on the variable Y.
33:44 - 33:46

So in that case, the Gaussian
33:46 - 33:51

density is given by this,
33:51 - 33:52

[inaudible]
33:52 - 33:54

squared.
33:54 - 33:55
33:55 - 33:59

And
33:59 - 34:02

- well, by a couple of steps of algebra, which I'm not going to do,
34:02 - 34:07

but is written out in [inaudible] in the lecture now so you can download.
34:07 - 34:10

This is one root two
34:10 - 34:14

pie, E to the minus one-half Y squared
34:14 - 34:17

times E to E.
34:17 - 34:20

New Y minus
34:20 - 34:23

one-half [inaudible] squared, okay.
34:23 - 34:25

So I'm just not doing the algebra.
34:25 - 34:28

And so that's B
34:28 - 34:33

of Y,
34:33 - 34:37

we have [inaudible] that's equal to [inaudible]. P of
34:37 - 34:42

Y equals
34:42 - 34:43

Y,
34:43 - 34:45

and - well, A of [inaudible]
34:45 - 34:50

is equal to
34:50 - 34:54

minus one-half - actually, I think that
34:54 - 34:59

should be plus one-half. Have I got that right?
34:59 - 35:02

Yeah, sorry. Let's
35:02 - 35:03

see - excuse me. Plus sign
35:03 - 35:07

there, okay. If you minus one-half [inaudible] squared, and
35:07 - 35:11

because [inaudible] is equal to [inaudible], this is just minus one-half
35:11 - 35:14

[inaudible] squared, okay.
35:14 - 35:15

And so
35:15 - 35:19

this would be a specific choice again of A, B and T
35:19 - 35:21

that
35:21 - 35:23

expresses the Gaussian density
35:23 - 35:27

in the form of an exponential family distribution.
35:27 - 35:31

And in this case, the relationship between [inaudible] and [inaudible] is that
35:31 - 35:35

[inaudible] is just equal to [inaudible], so the [inaudible] of the Gaussian is just equal to the natural
35:35 - 35:38

parameter of the exponential family distribution. Minus
35:38 - 35:44

half. Oh, this is minus half? [Inaudible] Oh, okay, thanks.
35:44 - 35:47

And so - guessing
35:47 - 35:51

that should be plus then. Is that right? Okay. Oh,
35:51 - 35:58

yes, you're right. Thank you. All right.
36:03 - 36:06

And so [inaudible] result that
36:06 - 36:09

if you've taken a look in undergrad statistics class, turns out that most of the
36:09 - 36:13

"textbook distributions," not all, but most of them,
36:13 - 36:17

can be written in the form of an exponential family distribution.
36:17 - 36:21

So you saw the Gaussian, the normal distribution. It turns out the [inaudible]
36:21 - 36:25

in normal distribution, which is a generalization of Gaussian random variables, so
36:25 - 36:27

it's a high dimension to vectors.
36:27 - 36:32

The [inaudible] normal distribution is also in the exponential family.
36:32 - 36:36

You saw the [inaudible] as an exponential family. It turns out the [inaudible] distribution is too, all
36:36 - 36:38

right. So the [inaudible]
36:38 - 36:42

models outcomes over zero and one. They'll be coin tosses with two outcomes.
36:42 - 36:47

The [inaudible] models outcomes over K possible values. That's also an
36:47 - 36:49

exponential families distribution.
36:49 - 36:53

You may have heard of the Parson distribution. And so the Parson distribution is
36:53 - 36:55

often used for modeling counts. Things like
36:55 - 37:00

the number of radioactive decays in a sample, or the number of
37:00 - 37:04

customers to your website, the numbers of visitors arriving in a store. The
37:04 - 37:08

Parson distribution is also in the exponential family. So are
37:08 - 37:10

the gamma and the exponential
37:10 - 37:15

distributions, if you've heard of them. So the gamma and the exponential distributions are
37:15 - 37:19

distributions of the positive numbers. So they're often used in model
37:19 - 37:20

intervals, like if you're
37:20 - 37:22

standing at the bus stop and you want to
37:22 - 37:26

ask, "When is the next bus likely to arrive? How long do I have to wait for
37:26 - 37:27

my bus to arrive?"
37:27 - 37:32

Often you model that with sort of gamma distribution or
37:32 - 37:34

exponential families, or the exponential
37:34 - 37:37

distribution. Those are also in the exponential family.
37:37 - 37:41

Even more [inaudible] distributions, like the [inaudible] and the [inaudible]
37:41 - 37:44

distributions, these are probably distributions over
37:44 - 37:49

fractions, are already probability distributions over probability distributions.
37:49 - 37:52

And also things like the Wisha distribution, which is the
37:52 - 37:57

distribution over covariance matrices. So all of these, it turns out, can be written in
37:57 - 38:01

the form of exponential family distributions. Well,
38:01 - 38:04

and
38:04 - 38:09

in the problem set where he asks you to take one of these distributions and write
38:09 - 38:13

it in the form of the exponential family distribution, and derive a generalized linear model
38:13 - 38:15

for it, okay.
38:15 - 38:22

Which brings me to the next topic of
38:29 - 38:32

having chosen and exponential family distribution,
38:32 - 38:38

how do you use it to derive a generalized linear model?
38:38 - 38:41

So
38:41 - 38:45

generalized linear models are often abbreviated GLM's.
38:45 - 38:49

And
38:49 - 38:52

I'm going to write down the three assumptions. You can think of them
38:52 - 38:55

as assumptions, or you can think of them as design choices,
38:55 - 39:00

that will then allow me to sort of turn a crank and come up with a generalized linear model.
39:00 - 39:04

So the first one is - I'm going to assume
39:04 - 39:04

that
39:04 - 39:06

given
39:06 - 39:09

my input X and my parameters theta,
39:09 - 39:13

I'm going to assume that the variable Y,
39:13 - 39:17

the output Y, or the response variable Y I'm trying to predict
39:17 - 39:20

is distributed
39:20 - 39:22

exponential family
39:22 - 39:24
39:24 - 39:27

with some natural parameter [inaudible]. And so this means that there
39:27 - 39:29

is some specific choice
39:29 - 39:32

of those functions, A, B and T
39:32 - 39:33

so that
39:33 - 39:38

the conditional distribution of Y given X and parameterized by theta,
39:38 - 39:40

those exponential families
39:40 - 39:42

with parameter
39:42 - 39:46

[inaudible]. Where here, [inaudible] may depend on X in some way.
39:46 - 39:50

So for example, if you're trying to predict - if you want to
39:50 - 39:51

predict
39:51 - 39:53

how many customers have arrived at your website,
39:53 - 39:55

you may choose to
39:55 - 39:57

model the number of people
39:57 - 40:01

- the number of hits on your website by Parson Distribution since Parson
40:01 - 40:04

Distribution is natural for modeling com data.
40:04 - 40:05

And so you may
40:05 - 40:12

choose the exponential family distribution here to be the Parson distribution.
40:15 - 40:18

[Inaudible] that given X, our
40:18 - 40:21

goal is
40:21 - 40:22

to output
40:22 - 40:25

the
40:25 - 40:29

effective value of Y
40:29 - 40:30

given X. So
40:30 - 40:34

given the features in the website examples, I've given a set of features
40:34 - 40:38

about whether there were any proportions, whether there were sales, how many people linked to
40:38 - 40:40

your website, or whatever.
40:40 - 40:41

I'm going to assume that
40:41 - 40:45

our goal in our [inaudible] problem is to estimate the expected number of people that will
40:45 - 40:49

arrive at your website on a given day.
40:49 - 40:50
40:50 - 40:52

So in other words,
40:52 - 40:55

you're saying that I want H of
40:55 - 40:58

X to be equal to - oh,
40:58 - 41:00

excuse me. I actually
41:00 - 41:06

meant to write T of Y here.
41:06 - 41:10

My goal is to get my learning algorithms hypothesis to output the expected value
41:10 - 41:13

of T of Y given X. But
41:13 - 41:17

again, for most of the examples, T of Y is just equal to Y.
41:17 - 41:21

And so for most of the examples, our goal is to get our learning algorithms output, T
41:21 - 41:28

expected value of Y given X because T of Y is usually
41:33 - 41:34

equal
41:34 - 41:36

to Y. Yes? [Inaudible] Yes, same thing, right. T of Y is a sufficient statistic.
41:36 - 41:39

Same T of Y.
41:39 - 41:44

And lastly, this last one I wrote down - these are assumptions. This last
41:44 - 41:48

one you might - maybe wanna think of this as a design choice.
41:48 - 41:49
41:49 - 41:50

Which is [inaudible]
41:50 - 41:53

assume that the distribution of Y given X
41:53 - 41:55

is a distributed exponential family
41:55 - 41:59

with some parameter [inaudible]. So the number of visitors on the website
41:59 - 42:01

on any given day will be Parson
42:01 - 42:03

or some parameter [inaudible].
42:03 - 42:06

And the last decision I need to make is
42:06 - 42:09

was the relationship between my input teachers
42:09 - 42:11

and this parameter
42:11 - 42:15

[inaudible] parameterizing my Parson distribution or whatever.
42:15 - 42:19

And this last step, I'm going to make the
42:19 - 42:21

assumption, or really a design choice,
42:21 - 42:24

that I'm going to assume the relationship between [inaudible]
42:24 - 42:26

and my [inaudible] axis linear,
42:26 - 42:30

and in particular that they're governed by this - that [inaudible] is equal to theta,
42:30 - 42:32

transpose X.
42:32 - 42:36

And the reason I make this design choice is it will allow me to turn the crank of
42:36 - 42:37

the
42:37 - 42:39

generalized linear model of
42:39 - 42:42

machinery and come off with very nice algorithms for fitting
42:42 - 42:45

say Parson Regression models
42:45 - 42:47

or performed regression
42:47 - 42:50

with a gamma distribution outputs or exponential distribution outputs and
42:50 - 42:53

so on.
42:53 - 42:59

So
42:59 - 43:06

let's work through an example.
43:13 - 43:15
43:15 - 43:18

[Inaudible]
43:18 - 43:23

equals theta transpose X works for the case where [inaudible] is a real number.
43:23 - 43:25

For the more general case, you
43:25 - 43:29

would have [inaudible] I equals theta
43:29 - 43:35

I, transpose X if [inaudible]
43:35 - 43:39

is a vector rather than a real number. But again, most of the examples [inaudible]
43:39 - 43:45

will just be a real number.
43:45 - 43:52

All right.
43:55 - 44:00

So let's work through the [inaudible] example. You'll see
44:00 - 44:07

where Y given X parameterized by theta - this is a distributed
44:09 - 44:12

exponential family with natural parameter [inaudible].
44:12 - 44:14

And for
44:14 - 44:17

the [inaudible] distribution, I'm going to choose A, B and T to
44:17 - 44:20

be the specific forms
44:20 - 44:21

that
44:21 - 44:26

cause those exponential families to become the [inaudible] distribution. This is the example we
44:26 - 44:30

worked through just now, the first example we worked through just now.
44:30 - 44:34

So - oh,
44:34 - 44:35

and we also have -
44:35 - 44:41

so for any fixed
44:41 - 44:44

value of X and theta, my hypothesis, my learning
44:44 - 44:46

algorithm
44:46 - 44:53

will
44:53 - 44:57

make a prediction, or will make - will sort of output [inaudible] of
44:57 - 45:01

X,
45:01 - 45:03

which is
45:03 - 45:06

by my, I guess, assumption [inaudible]. Watch our learning
45:06 - 45:10

algorithm to output the expected value of Y given X
45:10 - 45:14
45:14 - 45:15

and parameterized by theta, where Y can take on
45:15 - 45:17

only the value zero and one,
45:17 - 45:22

then the expected value of Y is just equal to the
45:22 - 45:26

probability that Y is equal to one. So
45:26 - 45:30

the expected value of a [inaudible] variable is just equal to the
45:30 - 45:37

probability that it's equal to one.
45:37 - 45:38

And so
45:38 - 45:42

the probability that Y equals one is just equal to phi
45:42 - 45:43

because that's the
45:43 - 45:45

parameter
45:45 - 45:47

of my [inaudible] distribution. Phi
45:47 - 45:54

is, by definition, I guess, is the probability of my [inaudible] distribution [inaudible] value of one.
45:57 - 45:59

Which we worked out previously,
45:59 - 46:03

phi was one over one plus E to the negative [inaudible].
46:03 - 46:06

So we worked this out on our previous board. This is the relationship -
46:06 - 46:10

so when we wrote down the [inaudible] distribution
46:10 - 46:14

in the form of an exponential family, we worked out what the relationship was between phi and
46:14 - 46:18

[inaudible], and it was this. So we worked out the relationship between the
46:18 - 46:24

expected value of Y and [inaudible] was this relationship.
46:24 - 46:26

And lastly, because
46:26 - 46:29

we made the design choice, or the assumption that
46:29 - 46:31

[inaudible] and theta are
46:31 - 46:32

linearly
46:32 - 46:36

related. This is therefore equal to
46:36 - 46:42

one over one plus E to the minus theta, transpose X.
46:42 - 46:45

And so that's
46:45 - 46:46

how I
46:46 - 46:49

come up with the logistic regression algorithm
46:49 - 46:51

when
46:51 - 46:53

you have a variable Y
46:53 - 46:57

- when you have a [inaudible] variable Y, or also response variable
46:57 - 46:58

Y
46:58 - 47:02

that takes on two values, and then you choose to model
47:02 - 47:08

variable [inaudible] distribution. Are you
47:08 - 47:14

sure this does make sense? Raise your hand if this makes sense. Yeah, okay, cool. So I hope
47:14 - 47:16

you get
47:16 - 47:19

the ease of use of this, or sort of the power of this. The only decision I
47:19 - 47:22

made was really, I said Y -
47:22 - 47:25

let's say I have a new machine-learning problem and
47:25 - 47:28

I'm trying to predict the value of a variable Y that happens to take on two
47:28 - 47:30

values.
47:30 - 47:34

Then the only decision I need to make is I chose [inaudible] distribution. I
47:34 - 47:38

say I want to model - I want to assume that given X and theta,
47:38 - 47:41

I'm going to assume Y is distributed
47:41 - 47:43

[inaudible]. That's the only decision I made.
47:43 - 47:47

And then everything else follows automatically having made the
47:47 - 47:48

decision to
47:48 - 47:50

model Y given X and
47:50 - 47:54

parameterized by theta as being [inaudible].
47:54 - 47:58

In the same way you can choose a different distribution, you can choose Y as Parson or Y as
47:58 - 48:00

gamma or Y as whatever,
48:00 - 48:03

and follow a similar process and come up with a different model and
48:03 - 48:04

different learning algorithm.
48:04 - 48:07

Come up with a different generalized linear model for whatever learning
48:07 - 48:12

algorithm you're faced with.
48:12 - 48:16

This tiny little notation, the
48:16 - 48:20

function G that
48:20 - 48:25

relates G of [inaudible] that relates the natural parameter
48:25 - 48:28

to the expected value of Y,
48:28 - 48:32

which in this case, one over one plus [inaudible] minus [inaudible],
48:32 - 48:39

this is called the canonical response function. And G inverse is called the
48:42 - 48:43

canonical
48:43 - 48:49

link function. These aren't
48:49 - 48:53
48:53 - 48:57

a huge deal. I won't use this terminology a lot. I'm just
48:57 - 48:59

mentioning those in case
48:59 - 49:03

you hear about - people talk about generalized linear models, and if they talk about canonical
49:03 - 49:05

response functions or
49:05 - 49:09

canonical link functions, just so you know there's all of this.
49:09 - 49:11

Actually, many techs actually use
49:11 - 49:16

the reverse way. This is G inverse and this is G, but this
49:16 - 49:18

notation turns out to be more consistent with
49:18 - 49:20

other algorithms in machine learning. So
49:20 - 49:23

I'm going to use this notation.
49:23 - 49:27

But I probably won't use the terms canonical response functions and canonical
49:27 - 49:30

link functions in lecture a lot, so just - I don't know.
49:30 - 49:32

I'm not big on
49:32 - 49:39

memorizing lots of names of things. I'm just tossing those out there in case you see it elsewhere. Okay.
49:50 - 49:51

You know what, I think
49:51 - 49:55

in the interest of time, I'm going to skip over the Gaussian example. But again,
49:55 - 49:57

just like I said,
49:57 - 49:59

[inaudible], Y
49:59 - 50:03

is [inaudible], different variation I get of logistic regression. You can do the same
50:03 - 50:05

thing with the Gaussian distribution
50:05 - 50:09

and end up with ordinary [inaudible] squares model.
50:09 - 50:12

The problem with Gaussian is that it's almost so simple that when you see it for
50:12 - 50:14

the first time that it's
50:14 - 50:17

sometimes more confusing than the [inaudible] model because it looks so simple,
50:17 - 50:21

it looks like it has to be more complicated. So let me just skip that
50:21 - 50:23

and leave you to read about
50:23 - 50:25

the Gaussian example in the lecture notes.
50:25 - 50:27

And what I want to do is
50:27 - 50:30

actually go through a more complex example.
50:30 - 50:34

Question? [Inaudible]
50:34 - 50:35

Okay, right. So
50:35 - 50:39

how do choose what theory will be?
50:39 - 50:43

We'll get to that in the end. What you have there is the logistic
50:43 - 50:45

regression model, which is a [inaudible] model
50:45 - 50:51

that assumes the probability of Y given X is given by a certain form.
50:51 - 50:53

And so
50:53 - 50:58

what you do is you can write down the log likelihood of your training set, and
50:58 - 51:02

find the value of theta that maximizes the log likelihood of the parameters.
51:02 - 51:05

Does that make
51:05 - 51:06

sense?
51:06 - 51:09

So I'll say that again towards the end of today's lecture. But
51:09 - 51:14

for logistic regression, the way you choose theta is exactly maximum likelihood,
51:14 - 51:15

as we
51:15 - 51:18

worked out in the previous lecture, using Newton's Method or gradient
51:18 - 51:20

ascent or
51:20 - 51:22

whatever. I'll sort of try to
51:22 - 51:29

do that again for one more example towards the end of today's lecture. So
51:29 - 51:33

what I want to do is actually use the remaining, I don't know,
51:33 - 51:36

19 minutes or so of this class,
51:36 - 51:38

to go through the -
51:38 - 51:42

one of the more - it's probably the most complex example of a
51:42 - 51:46

generalized linear model that I've used. This one I want to go through because it's a little bit
51:46 - 51:48

trickier than
51:48 - 51:51

many of the other textbook examples of
51:51 - 51:54

generalized linear models.
51:54 - 51:55

So
51:55 - 51:58

again, what I'm going to do is
51:58 - 52:00

go through the derivation
52:00 - 52:04

reasonably quickly and give you the gist of it, and if there are steps I skip or details
52:04 - 52:08

omitted, I'll leave you to read about them more carefully
52:08 - 52:10

in the lecture notes.
52:10 - 52:14

And what I want to do is talk about
52:14 - 52:17

[inaudible].
52:17 - 52:20

And
52:20 - 52:23

[inaudible]
52:23 - 52:26

is the distribution over
52:26 - 52:28

K possible outcomes.
52:28 - 52:30
52:30 - 52:33

Imagine you're now in a machine-learning problem where the value of Y that you're trying to
52:33 - 52:35

predict can take on
52:35 - 52:37

K possible outcomes, so rather than
52:37 - 52:40

only two outcomes. So obviously, this example's already -
52:40 - 52:45

if you want to have a learning algorithm, or to magically send emails for you into your
52:45 - 52:48

right email folder, and you may have a dozen email folders you want your algorithm
52:48 - 52:50

to classify emails into.
52:50 - 52:51

Or
52:51 - 52:54

predicting if the patient either has a disease or does not have
52:54 - 52:57

a disease, which would be a [inaudible] classification problem.
52:57 - 53:02

If you think that the patient may have one of K diseases, and
53:02 - 53:06

you want other than have a learning algorithm figure out which one of K diseases your patient has is all. So
53:06 - 53:07

lots
53:07 - 53:10

of multi-cause classification problems where you have more than two causes. You model that
53:10 - 53:14

with [inaudible].
53:14 - 53:17

And eventually -
53:17 - 53:21

so for logistic regression, I had [inaudible] like these where you have a
53:21 - 53:26

training set and you find a decision boundary that separates them.
53:26 - 53:29

[Inaudible], we're going to entertain
53:29 - 53:32

the value of predicting, taking on multiple values, so you now have
53:32 - 53:33

three causes,
53:33 - 53:36

and the learning algorithm
53:36 - 53:39

will learn some way to separate out three causes or more, rather than just two
53:39 - 53:44

causes.
53:44 - 53:47

So let's write [inaudible] in the form of
53:47 - 53:49

an exponential family distribution.
53:49 - 53:53
53:53 - 53:58

So the parameters of a [inaudible] are phi one, phi two
53:58 - 54:00

[inaudible]
54:00 - 54:04

phi K. I'll actually change this in a second -
54:04 - 54:09

where the probability of Y equals I is phi I,
54:09 - 54:10

right, because there are
54:10 - 54:13

K possible outcomes.
54:13 - 54:18

But if I choose this as my parameterization of the [inaudible], then
54:18 - 54:20

my parameter's actually redundant because
54:20 - 54:23

if these are probabilities, then you have to sum up the one.
54:23 - 54:26

And therefore for example, I
54:26 - 54:30

can derive the last parameter, phi K,
54:30 - 54:32

as one minus phi
54:32 - 54:35

one, up
54:35 - 54:36

to phi K minus
54:36 - 54:41

one. So this would be a
54:41 - 54:44

redundant parameterization from
54:44 - 54:49

[inaudible]. The result is over-parameterized. And so for purposes of
54:49 - 54:50

this [inaudible], I'm
54:50 - 54:54

going to treat my parameters of my [inaudible] as phi one,
54:54 - 54:59

phi two, up to phi K minus one.
54:59 - 55:03

And I won't think of phi K as a parameter. I'll just - so my parameters are
55:03 - 55:03

just -
55:03 - 55:08

I just have K minus one parameters, parameterizing my
55:08 - 55:11

[inaudible]. And sometimes I write phi K in my
55:11 - 55:13

derivations as well, and you should think of
55:13 - 55:18

phi K as just a shorthand for this, for one minus the rest of the parameters, okay.
55:18 - 55:25

So
55:36 - 55:41

it turns out the [inaudible] is one of the few examples where T of Y - it's one of the
55:41 - 55:45

examples where T of Y is not equal to Y.
55:45 - 55:49

So in this case, Y is
55:49 - 55:52

on of K possible values.
55:52 - 55:55

And so T of Y would be defined as follows; T
55:55 - 55:59

of one is going to be a vector with a one
55:59 - 56:02

and zeros everywhere else. T
56:02 - 56:04

of two
56:04 - 56:07

is going to be a zero, one, zero
56:07 - 56:09

and so
56:09 - 56:12

on. Except that these are going to be
56:12 - 56:16

K minus one-dimensional vectors. And
56:16 - 56:17

so
56:17 - 56:21

T of K minus one is going to be zero, zero,
56:21 - 56:24

zero, one.
56:24 - 56:27

And
56:27 - 56:29

T of K is going to be the vector of all zeros.
56:29 - 56:30

So this is just
56:30 - 56:32

how I'm choosing to define T
56:32 - 56:35

of Y
56:35 - 56:39

to write down the [inaudible] in the form of an exponential family
56:39 - 56:41

distribution.
56:41 - 56:45

Again, these are K minus one-dimensional vectors.
56:45 - 56:47

So
56:47 - 56:50

this is a good point to introduce one more useful piece of notation,
56:50 - 56:54

which is called indicator function notation.
56:54 - 56:58

So I'm going to write one, and then curly braces.
56:58 - 57:04

And if I write a true statement inside, then the indicator of that statement is going to be
57:04 - 57:05

one. Then I write one,
57:05 - 57:08

and then I write a false statement inside, then
57:08 - 57:12

the value of this indicator function is going to be a
57:12 - 57:15

zero. For example, if I write indicator two
57:15 - 57:17

equals three
57:17 - 57:21

[inaudible] that's false, and so this is equal to zero.
57:21 - 57:22

Whereas indicator [inaudible]
57:22 - 57:25

plus one equals two,
57:25 - 57:28

I wrote down a true statement inside. And so the indicator of the statement was equal to
57:28 - 57:30

one. So the
57:30 - 57:32

indicator function is just a
57:32 - 57:35

very useful notation for indicating sort of truth or falsehood
57:35 - 57:42

of the statement inside. And so - actually, let's do
57:46 - 57:47

this here.
57:47 - 57:54

To combine both of these, right, if I carve out a bit
57:54 - 57:57

of space here -
57:57 - 58:03

so if I use - so TY is a
58:03 - 58:07

vector. Y is one of K values, and so
58:07 - 58:10

TY is one of these K vectors. If
58:10 - 58:13

I use TY as [inaudible] to denote
58:13 - 58:16

the [inaudible] element of the vector
58:16 - 58:18

TY,
58:18 - 58:22

then TY - the [inaudible] element of the vector TY
58:22 - 58:27

is just equal to indicator for
58:27 - 58:30

whether Y is equal to I. Just take a
58:30 - 58:33

- let me clean a couple more boards. Take a look at this for a second
58:33 - 58:35

and make sure you understand why that - make
58:35 - 58:42

sure you understand all that notation and why this is true. All
59:10 - 59:16

right. Actually, raise your hand if this equation makes sense to you.
59:16 - 59:19

Most of you,
59:19 - 59:24

not all, okay. [Inaudible].
59:24 - 59:27

Just as one kind of [inaudible],
59:27 - 59:30

suppose Y is equal to one
59:30 - 59:34

- let's
59:34 - 59:36

say - let me see. Suppose Y is equal to one, right,
59:36 - 59:40

so TY is equal to this vector,
59:40 - 59:42

and therefore the first element of this vector
59:42 - 59:47

will be one, and the rest of the elements will be equal to zero.
59:47 - 59:50

And so - let
59:50 - 59:54

me try that again, I'm sorry. Let's say I want to ask - I want to look at the [inaudible] element of
59:54 - 59:59

the vector TY, and I want to know is this one or zero. All right.
59:59 - 60:01

Well, this will be one.
60:01 - 60:04

The [inaudible] element of the vector TY
60:04 - 60:05

will be equal to one
60:05 - 60:06

if, and only if Y is
60:06 - 60:09

equal to I.
60:09 - 60:12

Because for example, if Y is equal to one, then only the first element of this
60:12 - 60:14

vector will be zero.
60:14 - 60:18

If Y is equal to two, then only the second element of the vector will be zero
60:18 - 60:20

and so on. So the question of
60:20 - 60:24

whether or not - whether the [inaudible] element of this vector, TY, is equal to
60:24 - 60:26

one is
60:26 - 60:27

answered by
60:27 - 60:29

just asking is Y
60:29 - 60:36

equal to I. Okay. If you're
60:36 - 60:39

still not quite sure why that's true, go home and
60:39 - 60:41

think about it a bit more. And I think I -
60:41 - 60:45

and take a look at the lecture notes as
60:45 - 60:50

well, maybe that'll help. At least for now, only just take my word for it. So
60:50 - 60:54

let's go ahead and write out the distribution
60:54 - 60:58

for the [inaudible] in an exponential family form.
60:58 - 61:05

So PFY is equal to phi one.
61:06 - 61:08

Indicator Y equals one
61:08 - 61:11

times phi two. Indicator Y equals
61:11 - 61:13

to
61:13 - 61:17

up to phi K
61:17 - 61:21

times indicator Y equals K. And again, phi K is not a
61:21 - 61:23

parameter of the distribution. Phi K
61:23 - 61:29

is a shorthand for one minus phi one minus phi two minus the rest.
61:29 - 61:31

And so
61:31 - 61:35

using this equation on the left as well, I can also write this as
61:35 - 61:38

phi one times TY one, phi
61:38 - 61:40

two, TY
61:40 - 61:43

two, dot, dot, dot.
61:43 - 61:45

Phi K minus one,
61:45 - 61:46
61:46 - 61:48

TY, K minus one
61:48 - 61:51

times phi K. And then
61:51 - 61:58

one minus [inaudible]. That should
62:02 - 62:08

be K.
62:08 - 62:11

And it turns out -
62:11 - 62:15

it takes some of the steps of algebra that I don't have time to show.
62:15 - 62:19

It turns out, you can simplify this into - well, the
62:19 - 62:26

exponential family form
62:32 - 62:39

where [inaudible] is a vector, this is a
62:50 - 62:57

K minus one-dimensional vector, and - well,
63:06 - 63:08

okay. So deriving this is a few steps of algebra
63:08 - 63:13

that you can work out yourself, but I won't do here.
63:13 - 63:18

And so using my definition for TY, and
63:18 - 63:19

by choosing
63:19 - 63:21

[inaudible] A and B this
63:21 - 63:24

way, I can take my distribution from [inaudible] and write it out in
63:24 - 63:31

a form of an exponential family distribution.
63:33 - 63:35

It turns out also that
63:35 - 63:37

- let's
63:37 - 63:40

see. [Inaudible], right. One of the things we did was we also had
63:40 - 63:42

[inaudible] as a function of phi,
63:42 - 63:45

and then we inverted that to write
63:45 - 63:48

out phi as a function of [inaudible].
63:48 - 63:51

So it turns out you can do that as well.
63:51 - 63:55

So this defines [inaudible] as a function of the [inaudible] distributions parameters phi.
63:55 - 63:56

So
63:56 - 64:00

you can take this relationship between [inaudible] and phi and invert it,
64:00 - 64:03

and write out phi as a function of [inaudible].
64:03 - 64:06

And it turns out, you get that
64:06 - 64:09

phi I is equal to [inaudible] -
64:09 - 64:16

excuse me.
64:16 - 64:21

And you get that phi I is equal to
64:21 - 64:28

[inaudible] I of one plus
64:28 - 64:35

that.
64:36 - 64:39

And the way you do this is you just - this defines
64:39 - 64:43

[inaudible] as a function of the phi, so if you take this and solve for [inaudible], you end up with this. And
64:43 - 64:48

this is - again, there are a couple of steps of algebra that I'm just not showing.
64:48 - 64:52

And then lastly, using our
64:52 - 64:56

assumption that the [inaudible] are a linear function of the [inaudible]
64:56 - 64:57

axis, phi
64:57 - 65:02

I is therefore equal to E to the theta I, transpose X,
65:02 - 65:06

divided by one plus sum over
65:06 - 65:09

J equals one, to K
65:09 - 65:11

minus one,
65:11 - 65:14

E to the
65:14 - 65:18

theta J, transpose
65:18 - 65:24

X. And this is just using the fact
65:24 - 65:29

that [inaudible] I equals theta I, transpose X, which was our earlier
65:29 - 65:36

design choice from generalized linear models. So
65:44 - 65:50

we're just about down.
65:50 - 65:52

So my learning algorithm
65:52 - 65:56

[inaudible]. I'm going to think of it as [inaudible] the
65:56 - 65:59

expected value of TY
65:59 - 66:02

given X and [inaudible] by theta.
66:02 - 66:05

So
66:05 - 66:09

TY was this vector indicator function. So
66:09 - 66:10

T one
66:10 - 66:13

was indicator Y equals one,
66:13 - 66:16

down to indicator Y equals
66:16 - 66:20

K minus one. All
66:20 - 66:24

right. So I want my learning algorithm to output this; the expected value of this vector of
66:24 - 66:31

indicator functions.
66:34 - 66:39

The expected value of indicator Y equals one is just
66:39 - 66:42

the probability that Y equals one,
66:42 - 66:45

which is given by phi one.
66:45 - 66:48

So I have a random variable that's one whenever Y is equal to one and zero
66:48 - 66:50

otherwise,
66:50 - 66:52

so the expected value of that,
66:52 - 66:55

of this indicator Y equals one is just the
66:55 - 66:59

probability that Y equals one, which is given by phi one.
66:59 - 67:02

And therefore,
67:02 - 67:05

by what we were taught earlier,
67:05 - 67:08

this is therefore [inaudible]
67:08 - 67:15

to the theta one, transpose X over - well - okay.
67:46 - 67:47

And so my
67:47 - 67:51

learning algorithm will output the probability that Y equals one, Y equals two, up to Y
67:51 - 67:55

equals K minus one.
67:55 - 67:58

And these probabilities are going to be parameterized by
67:58 - 68:05

these functions like these.
68:21 - 68:25

And so just to give this algorithm a name,
68:25 - 68:32

this algorithm is called softmax regression,
68:34 - 68:39

and is widely thought of as the generalization of
68:39 - 68:42

logistic regression, which is regression of two classes. Is widely thought
68:42 - 68:45

of as a generalization of logistic regression
68:45 - 68:46

to the case of
68:46 - 68:50

K classes rather than two classes.
68:50 - 68:53
68:53 - 68:56

And so just to be very concrete about what you do, right. So you have a machine-learning
68:56 - 69:00

problem, and you want to apply softmax regression to it. So generally,
69:00 - 69:03

work for the entire derivation [inaudible]. I think the
69:03 - 69:06

question you had is about how to fit parameters.
69:06 - 69:10

So let's say you have a machine-learning problem, and
69:10 - 69:13

Y takes on one of K classes.
69:13 - 69:17

What you do is you sit down and say, "Okay, I wanna model Y as being
69:17 - 69:19

[inaudible]
69:19 - 69:24

given any X and then theta." And so you chose [inaudible] as the exponential family. Then you sort
69:24 - 69:27

of turn the crank. And everything else I wrote down follows
69:27 - 69:30

automatically from you have made the choice
69:30 - 69:35

of using [inaudible] distribution as your choice of exponential family.
69:35 - 69:38

And then what you do is you then have this training set, X, I,
69:38 - 69:39

Y,
69:39 - 69:41

I up to X, M,
69:41 - 69:44

Y, M. So
69:44 - 69:45

you're
69:45 - 69:49

doing the training set. We're now [inaudible] the value of Y takes on one
69:49 - 69:51

of K possible values.
69:51 - 69:55

And what you do is you then
69:55 - 69:59

find the parameters of the model by maximum likelihood. So you write down the likelihood
69:59 - 70:03

of the parameters, and you maximize the likelihood.
70:03 - 70:07

So what's the likelihood? Well, the likelihood, as usual, is the
70:07 - 70:09

product of your training set of
70:09 - 70:11

P of YI
70:11 - 70:15

given XI parameterized
70:15 - 70:18

by theta. That's
70:18 - 70:20

the likelihood, same as we had before.
70:20 - 70:21

And that's
70:21 - 70:24
70:24 - 70:25

product of your
70:25 - 70:26

training set of -
70:26 - 70:30

let me write these down now.
70:30 - 70:33

YI equals one
70:33 - 70:36

times phi two of indicator YI
70:36 - 70:38

equals two, dot,
70:38 - 70:39

dot, dot,
70:39 - 70:43

to phi K of indicator YI
70:43 - 70:45

equals
70:45 - 70:47

K.
70:47 - 70:50

Where, for example,
70:50 - 70:55

phi one depends on theta through this formula. It is E to the theta one,
70:55 - 70:57

transpose X
70:57 - 70:59

over
70:59 - 71:02

one
71:02 - 71:09

plus sum over J - well, that formula I had just now.
71:09 - 71:12

And so phi one here is really a
71:12 - 71:16

shorthand for this formula, and similarly for phi two and so on,
71:16 - 71:20

up to phi K, where phi K is one minus all of these things. All right.
71:20 - 71:22

So this is a
71:22 - 71:23
71:23 - 71:26

-this formula looks more complicated than it really is. What you
71:26 - 71:28

really do is you write this down,
71:28 - 71:31

then you take logs, compute a derivative of this formula [inaudible] theta,
71:31 - 71:34

and
71:34 - 71:37

apply say gradient ascent
71:37 - 71:41

to maximize the likelihood. What are the rows of theta? [Inaudible] it's just been a vector,
71:41 - 71:45

right? And now it looks
71:45 - 71:49

like it's two-dimensional. Yeah. In the notation of the [inaudible] I think have theta one
71:49 - 71:50

through
71:50 - 71:53

theta
71:53 - 71:58

K minus one. I've been thinking of each of these as -
71:58 - 71:59

and N
71:59 - 72:01

plus one-dimensional vector.
72:01 - 72:04

If X is N plus one-dimensional,
72:04 - 72:05

then I've been - see, I think if
72:05 - 72:09

you have a set of parameters comprising K minus one vectors,
72:09 - 72:13

and each of these is a - you could group all of these together and make these, but I
72:13 - 72:16

just haven't been doing that. [Inaudible] the derivative
72:16 - 72:23

of K minus one parameter vectors. [Inaudible], what do they correspond to?
72:23 - 72:25

[Inaudible].
72:25 - 72:29

We're sort of out of time. Let me take that offline. It's hard to answer in the
72:29 - 72:33

same way that the logistic regression - what does theta correspond to
72:33 - 72:38

in logistic regression? You can sort of answer that as sort of - Yeah. It's kind of like
72:38 - 72:43

the [inaudible] feature - Yeah. Sort of similar interpretation,
72:43 - 72:44

yeah. That's good. I think I'm running a little bit
72:44 - 72:49

late. Why don't I - why don't we officially close for the day, but you can come up
72:49 - 72:50

if you more questions and take them offline. Thanks.

Title:: Lecture 4 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on Newton's method, exponential families, and generalized linear models and how they relate to machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:13:07

	N. Ueda edited English subtitles for Lecture 4 \| Machine Learning (Stanford)
	N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 4 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)