Lecture 5 | Machine Learning (Stanford)

Edit subtitles

0:10 - 0:12
0:12 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:22

Development.
0:25 - 0:30

So what I want to do today is talk about a different type of learning algorithm, and, in particular,
0:30 - 0:33

start to talk about generative learning algorithms
0:33 - 0:38

and the specific algorithm called Gaussian Discriminant Analysis.
0:38 - 0:43

Take a slight digression, talk about Gaussians, and I'll briefly discuss
0:43 - 0:46

generative versus discriminative learning algorithms,
0:46 - 0:50

and then hopefully wrap up today's lecture with a discussion of Naive Bayes and
0:50 - 0:52

the Laplace Smoothing.
0:52 - 0:55

So just to motivate our
0:55 - 0:59

discussion on generative learning algorithms, right, so by way of contrast,
0:59 - 1:02

the source of classification algorithms we've been talking about
1:02 - 1:09

I think of algorithms that do this. So you're given a training set,
1:09 - 1:12

and
1:12 - 1:15

if you run an algorithm right, we just see progression on those training sets.
1:15 - 1:20

The way I think of logistic regression is that it's trying to find " look at the date and is
1:20 - 1:24

trying to find a straight line to divide the crosses and O's, right? So it's, sort of,
1:24 - 1:27

trying to find a straight line. Let
1:27 - 1:29

me " just make the days a bit noisier.
1:29 - 1:33

Trying to find a straight line
1:33 - 1:34

that separates out
1:34 - 1:39

the positive and the negative classes as well as pass the law, right? And,
1:39 - 1:43

in fact, it shows it on the laptop. Maybe just use the screens or the small
1:43 - 1:46

monitors for this.
1:46 - 1:47

In fact,
1:47 - 1:50

you can see there's the data
1:50 - 1:51

set with
1:51 - 1:53

logistic regression,
1:53 - 1:54

and so
1:54 - 1:58

I've initialized the parameters randomly, and so logistic regression is, kind
1:58 - 1:59

of, the outputting " it's
1:59 - 2:02

the, kind of, hypothesis that
2:02 - 2:05

iteration zero is that straight line shown in the bottom right.
2:05 - 2:09

And so after one iteration and creating descent, the straight line changes a bit.
2:09 - 2:11

After two iterations, three,
2:11 - 2:12

four,
2:12 - 2:15

until logistic regression converges
2:15 - 2:19

and has found the straight line that, more or less, separates the positive and negative class, okay? So
2:19 - 2:20

you can think of this
2:20 - 2:21

as logistic regression,
2:21 - 2:28

sort of, searching for a line that separates the positive and the negative classes.
2:28 - 2:31

What I want to do today is talk about an algorithm that does something slightly
2:31 - 2:33

different,
2:33 - 2:36

and to motivate us, let's use our old example of trying to classifythe
2:36 - 2:41

team malignant cancer and benign cancer, right? So a patient comes in
2:41 - 2:45

and they have a cancer, you want to know if it's a malignant or a harmful cancer,
2:45 - 2:48

or if it's a benign, meaning a harmless cancer.
2:48 - 2:52

So rather than trying to find the straight line to separate the two classes, here's something else
2:52 - 2:54

we could do.
2:54 - 2:56

We can go from our training set
2:56 - 3:00

and look at all the cases of malignant cancers, go through, you know, look for our training set for all the
3:00 - 3:04

positive examples of malignant cancers,
3:04 - 3:09

and we can then build a model for what malignant cancer looks like.
3:09 - 3:13

Then we'll go for our training set again and take out all of the examples of benign cancers,
3:13 - 3:18

and then we'll build a model for what benign cancers look like, okay?
3:18 - 3:19

And then
3:19 - 3:20

when you need to
3:20 - 3:25

classify a new example, when you have a new patient, and you want to decide is this cancer malignant
3:25 - 3:26

or benign,
3:26 - 3:28

you then take your new cancer, and you
3:28 - 3:31

match it to your model of malignant cancers,
3:31 - 3:36

and you match it to your model of benign cancers, and you see which model it matches better, and
3:36 - 3:39

depending on which model it matches better to, you then
3:39 - 3:44

predict whether the new cancer is malignant or benign,
3:44 - 3:46

okay?
3:46 - 3:47

So
3:47 - 3:50

what I just described, just this cross
3:50 - 3:53

of methods where you build a second model for malignant cancers and
3:53 - 3:55

a separate model for benign cancers
3:55 - 3:58

is called a generative learning algorithm,
3:58 - 4:01

and let me just, kind of, formalize this.
4:01 - 4:03
4:03 - 4:05

So in the models that we've
4:05 - 4:08

been talking about previously, those were actually
4:08 - 4:14

all discriminative learning algorithms,
4:14 - 4:19

and studied more formally, a discriminative learning algorithm is one
4:19 - 4:23

that either learns P of Y
4:23 - 4:25

given X directly,
4:25 - 4:27
4:27 - 4:30

or even
4:30 - 4:33

learns a hypothesis
4:33 - 4:40

that
4:40 - 4:43

outputs value 0, 1 directly,
4:43 - 4:47

okay? So logistic regression is an example of
4:47 - 4:49

a discriminative learning algorithm.
4:49 - 4:56

In contrast, a generative learning algorithm of
4:58 - 5:00

models P of X given Y.
5:00 - 5:03

The probability of the features given the class label,
5:03 - 5:04
5:04 - 5:11

and as a technical detail, it also models P of Y, but that's a less important thing, and the
5:12 - 5:15

interpretation of this is that a generative model
5:15 - 5:17

builds a probabilistic model for
5:17 - 5:22

what the features looks like,
5:22 - 5:27

conditioned on the
5:27 - 5:31

class label, okay? In other words, conditioned on whether a cancer is
5:31 - 5:35

malignant or benign, it models probability distribution over what the features
5:35 - 5:37

of the cancer looks like.
5:37 - 5:40

Then having built this model " having built a model for P of X
5:40 - 5:42

given Y and P of Y,
5:42 - 5:46

then by Bayes rule, obviously, you can compute P of Y given 1,
5:46 - 5:47

conditioned on X.
5:47 - 5:49

This is just
5:49 - 5:51

P of X
5:51 - 5:53

given Y = 1
5:53 - 5:56

times P of X
5:56 - 5:59

divided by P of X,
5:59 - 6:01

and, if necessary,
6:01 - 6:06

you can calculate the denominator
6:06 - 6:13

using this, right?
6:19 - 6:20
6:20 - 6:23

And so by modeling P of X given Y
6:23 - 6:26

and modeling P of Y, you can actually use Bayes rule to get back to P of Y given
6:26 - 6:28

X,
6:28 - 6:30

but a generative model -
6:30 - 6:33

generative learning algorithm starts in modeling P of X given Y, rather than P of Y
6:33 - 6:35

given X, okay?
6:35 - 6:38

We'll talk about some of the tradeoffs, and why this may be a
6:38 - 6:42

better or worse idea than a discriminative model a bit later.
6:42 - 6:47

Let's go for a specific example of a generative learning algorithm,
6:47 - 6:48

and for
6:48 - 6:51

this specific motivating example, I'm
6:51 - 6:54

going to assume
6:54 - 6:56

that your input feature is X
6:56 - 6:58

and RN
6:58 - 7:00
7:00 - 7:03

and are
7:03 - 7:06

continuous values, okay?
7:06 - 7:08

And under this assumption, let
7:08 - 7:15

me describe to you a specific algorithm called Gaussian Discriminant Analysis,
7:20 - 7:23
7:23 - 7:26

and the, I
7:26 - 7:30

guess, core assumption is that we're going to assume in the Gaussian discriminant analysis
7:30 - 7:32

model of that P of X given Y
7:32 - 7:39

is Gaussian, okay? So
7:40 - 7:43

actually just raise your hand, how many of you have seen a multivariate Gaussian before -
7:43 - 7:46

not a 1D Gaussian, but the higher range though?
7:46 - 7:51

Okay, cool, like maybe half of you, two-thirds of
7:51 - 7:55

you. So let me just say a few words about Gaussians, and for those of you that have seen
7:55 - 7:57

it before, it'll be a refresher.
7:57 - 8:03

So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the
8:03 - 8:05

script N for normal
8:05 - 8:08

with
8:08 - 8:13

parameters mean U and covariance sigma squared. If
8:13 - 8:17

Z has a density 1
8:17 - 8:20

over 2 Pi, sigma
8:20 - 8:24

2,
8:24 - 8:31

okay?
8:31 - 8:36

That's the formula for the density as a generalization of the one dimension of Gaussians and no
8:36 - 8:38

more the familiar bell-shape curve. It's
8:38 - 8:42

a high dimension vector value random variable
8:42 - 8:43

Z.
8:43 - 8:47

Don't worry too much about this formula for the density. You rarely end up needing to
8:47 - 8:48

use it,
8:48 - 8:52

but the two key quantities are this
8:52 - 8:55

vector mew is the mean of the Gaussian and this
8:55 - 8:57

matrix sigma is
8:57 - 9:00

the covariance matrix -
9:00 - 9:02

covariance,
9:02 - 9:05

and so
9:05 - 9:07

sigma will be equal to,
9:07 - 9:12

right, the definition of covariance of a vector valued random variable
9:12 - 9:14

is X - U, X - V conspose,
9:14 - 9:17

okay?
9:17 - 9:20

And, actually, if this
9:20 - 9:23

doesn't look familiar to you,
9:23 - 9:26

you might re-watch the
9:26 - 9:29

discussion section that the TAs held last Friday
9:29 - 9:34

or the one that they'll be holding later this week on, sort of, a recap of
9:34 - 9:34

probability, okay?
9:34 - 9:38

So
9:38 - 9:41

multi-grade Gaussians is parameterized by a mean and a covariance, and let me just -
9:41 - 9:42

can I
9:42 - 9:46

have the laptop displayed, please?
9:46 - 9:49

I'll just go ahead and actually show you,
9:49 - 9:52

you know, graphically, the effects of
9:52 - 9:54

varying a Gaussian -
9:54 - 9:56

varying the parameters of a Gaussian.
9:56 - 9:59

So what I have up here
9:59 - 10:03

is the density of a zero mean Gaussian with
10:03 - 10:06

covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand
10:06 - 10:08

corner of the slide, and
10:08 - 10:12

there's the familiar bell-shaped curve in two dimensions.
10:12 - 10:17

And so if I shrink the covariance matrix, instead of covariance your identity, if
10:17 - 10:22

I shrink the covariance matrix, then the Gaussian becomes more peaked,
10:22 - 10:25

and if I widen the covariance, so like same = 2, 2,
10:25 - 10:26

then
10:26 - 10:31

the distribution - well, the density becomes more spread out, okay?
10:31 - 10:33

Those vectors stand at normal, identity
10:33 - 10:35

covariance one.
10:35 - 10:37

If I increase
10:37 - 10:39

the diagonals of a covariance matrix, right,
10:39 - 10:44

if I make the variables correlated, and the
10:44 - 10:45

Gaussian becomes flattened out in this X = Y direction, and
10:45 - 10:47

increase it even further,
10:47 - 10:52

then my variables, X and Y, right - excuse me, it goes Z1 and Z2
10:52 - 10:57

are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in
10:57 - 10:59

contours.
10:59 - 11:03

The standard normal of distribution has contours that are - they're actually
11:03 - 11:07

circles. Because of the aspect ratio, these look like ellipses.
11:07 - 11:09

These should actually be circles,
11:09 - 11:13

and if you increase the off diagonals of the Gaussian covariance matrix,
11:13 - 11:15

then it becomes
11:15 - 11:16

ellipses aligned along the, sort of,
11:16 - 11:18

45 degree angle
11:18 - 11:21

in this example.
11:21 - 11:26

This is the same thing. Here's an example of a Gaussian density with negative covariances.
11:26 - 11:29

So now the correlation
11:29 - 11:33

goes the other way, so that even strong [inaudible] of covariance and the same thing in
11:33 - 11:36

contours. This is a Gaussian with negative entries on the diagonals and even
11:36 - 11:41

larger entries on the diagonals, okay?
11:41 - 11:45

And other parameter for the Gaussian is the mean parameters, so if this is - with mew0,
11:45 - 11:48

and as he changed the mean parameter,
11:48 - 11:50

this is mew equals 0.15,
11:50 - 11:56

the location of the Gaussian just moves around, okay?
11:56 - 12:02

All right. So that was a quick primer on what Gaussians look like, and here's
12:02 - 12:02
12:02 - 12:06

as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant
12:06 - 12:09

analysis algorithm, this is what we're going to do. Here's
12:09 - 12:12

the training set,
12:12 - 12:15

and in the Gaussian discriminant analysis algorithm,
12:15 - 12:19

what I'm going to do is I'm going to look at the positive examples, say the crosses,
12:19 - 12:24

and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the
12:24 - 12:25

positive examples, and so
12:25 - 12:28

maybe I end up with a Gaussian distribution like that,
12:28 - 12:31

okay? So there's P of X given Y = 1.
12:31 - 12:34

And then I'll look at the negative examples, the O's in this figure,
12:34 - 12:37

and I'll fit a Gaussian to that, and maybe I get a
12:37 - 12:38

Gaussian
12:38 - 12:41

centered over there. This is the concept of my second Gaussian,
12:41 - 12:42

and together -
12:42 - 12:45

we'll say how later -
12:45 - 12:51

together these two Gaussian densities will define a separator for these two classes, okay?
12:51 - 12:55

And it'll turn out that the separator will turn out to be a little bit
12:55 - 12:56

different
12:56 - 12:57

from what logistic regression
12:57 - 12:59

gives you.
12:59 - 13:00

If you run logistic regression,
13:00 - 13:04

you actually get the division bound to be shown in the green line, whereas Gaussian discriminant
13:04 - 13:07

analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the
13:07 - 13:14
13:29 - 13:34

Gaussian discriminant analysis model, put
13:34 - 13:37

into model P of Y
13:37 - 13:41

as a Bernoulli random variable as usual, but
13:41 - 13:46

as a Bernoulli random variable and parameterized by parameter phi; you've
13:46 - 13:47

seen this before.
13:47 - 13:49
13:49 - 13:56

Model P of X given Y = 0 as a Gaussian -
13:58 - 14:01

oh, you know what? Yeah,
14:01 - 14:05

yes, excuse me. I
14:05 - 14:06

thought this looked strange.
14:06 - 14:12

This
14:12 - 14:13

should be a sigma,
14:13 - 14:16

determined in a sigma to the one-half of the denominator there.
14:16 - 14:18

It's no big deal. It was - yeah,
14:18 - 14:20

well, okay. Right.
14:20 - 14:23

I
14:23 - 14:27

was listing the sigma to the determining the sigma to the one-half on a previous
14:27 - 14:34

board, excuse me. Okay,
14:40 - 14:44

and so I model P of X given Y = 0 as a Gaussian
14:44 - 14:48

with mean mew0 and covariance sigma to the sigma to
14:48 - 14:55

the minus one-half,
14:55 - 15:01
15:01 - 15:02

and
15:02 - 15:09

-
15:11 - 15:13

okay?
15:13 - 15:17

And so the parameters of this model are
15:17 - 15:18

phi,
15:18 - 15:21

mew0,
15:21 - 15:24

mew1, and sigma,
15:24 - 15:25

and so
15:25 - 15:29

I can now write down the likelihood of the parameters
15:29 - 15:30

as - oh, excuse
15:30 - 15:37

me, actually, the log likelihood of the parameters as the log of
15:40 - 15:41

that,
15:41 - 15:45

right?
15:45 - 15:48

So, in other words, if I'm given the training set, then
15:48 - 15:53

they can write down the log likelihood of the parameters as the log of, you know,
15:53 - 15:58

the probative probabilities of P of XI, YI, right?
15:58 - 16:04
16:04 - 16:10
16:10 - 16:12

And
16:12 - 16:15

this is just equal to that where each of these terms, P of XI given YI,
16:15 - 16:17

or P of YI is
16:17 - 16:19

then given
16:19 - 16:20
16:20 - 16:24

by one of these three equations on top, okay?
16:24 - 16:26

And I just want
16:26 - 16:30

to contrast this again with discriminative learning algorithms, right?
16:30 - 16:32

So
16:32 - 16:35

to give this a name, I guess, this sometimes is actually called
16:35 - 16:39

the Joint Data Likelihood - the Joint Likelihood,
16:39 - 16:42

and
16:42 - 16:45

let me just contrast this with what we had previously
16:45 - 16:48

when we're talking about logistic
16:48 - 16:52

regression. Where I said with the log likelihood of the parameter's theater
16:52 - 16:53

was log
16:53 - 16:57

of a product I = 1 to M, P of YI
16:57 - 16:59

given XI
16:59 - 17:02

and parameterized
17:02 - 17:04

by a theater, right?
17:04 - 17:05

So
17:05 - 17:08

back where we're fitting logistic regression models or generalized learning
17:08 - 17:09

models,
17:09 - 17:14

we're always modeling P of YI given XI and parameterized by a theater, and that was the
17:14 - 17:16

conditional
17:16 - 17:17
17:17 - 17:20

likelihood, okay,
17:20 - 17:24

in
17:24 - 17:27

which we're modeling P of YI given XI,
17:27 - 17:28

whereas, now,
17:28 - 17:31

regenerative learning algorithms, we're going to look at the joint likelihood which
17:31 - 17:35

is P of XI, YI, okay?
17:35 - 17:36

So let's
17:36 - 17:43

see.
17:46 - 17:48

So given the training sets
17:48 - 17:51

and using the Gaussian discriminant analysis model
17:51 - 17:56

to fit the parameters of the model, we'll do maximize likelihood estimation as usual,
17:56 - 17:59

and so you maximize your
17:59 - 18:02

L
18:02 - 18:06

with respect to the parameters phi, mew0,
18:06 - 18:08

mew1, sigma,
18:08 - 18:10

and so
18:10 - 18:16

if we find the maximum likelihood estimate of parameters, you find that phi is
18:16 - 18:17

-
18:17 - 18:22

the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for
18:22 - 18:24

indicating notation,
18:24 - 18:27

all right? So the maximum likelihood estimate for phi would be Sum over
18:27 - 18:30

I, YI Ăˇ M,
18:30 - 18:32
18:32 - 18:35

or written alternatively as Sum over -
18:35 - 18:42

all your training examples of indicator YI = 1 Ăˇ M, okay?
18:42 - 18:43
18:43 - 18:46

In other words, maximum likelihood estimate for a
18:46 - 18:47
18:47 - 18:53

newly parameter phi is just the faction of training examples with
18:53 - 18:57

label one, with Y equals 1.
18:57 - 19:03

Maximum likelihood estimate for mew0 is this, okay?
19:03 - 19:10

You
19:22 - 19:24

should stare at this for a second and
19:24 - 19:26

see if it makes sense.
19:26 - 19:33

Actually, I'll just write on the next one for mew1 while you do that.
19:34 - 19:41

Okay?
19:50 - 19:52

So what this is is
19:52 - 19:57

what the denominator is sum of your training sets indicated YI = 0.
19:57 - 20:02

So for every training example for which YI = 0, this will
20:02 - 20:06

increment the count by one, all right? So the
20:06 - 20:08

denominator is just
20:08 - 20:11

the number of examples
20:11 - 20:15
20:15 - 20:15

with
20:15 - 20:21

label zero, all right?
20:21 - 20:23

And then the numerator will be, let's
20:23 - 20:26

see, Sum from I = 1 for M, or every time
20:26 - 20:30

YI is equal to 0, this will be a one, and otherwise, this thing will be zero,
20:30 - 20:31

and so
20:31 - 20:34

this indicator function means that you're including
20:34 - 20:38

only the times for
20:38 - 20:40

which YI is equal to one - only the turns which Y is equal to zero
20:40 - 20:45

because for all the times where YI is equal to one,
20:45 - 20:48

this sum and will be equal to zero,
20:48 - 20:53

and then you multiply that by XI, and so the numerator is
20:53 - 20:56

really the sum
20:56 - 21:01

of XI's corresponding to
21:01 - 21:03
21:03 - 21:09

examples where the class labels were zero, okay?
21:09 - 21:15

Raise your hand if this makes sense. Okay, cool.
21:15 - 21:17

So just to say this fancifully,
21:17 - 21:20

this just means look for your training set,
21:20 - 21:23

find all the examples for which Y = 0,
21:23 - 21:25

and take the average
21:25 - 21:30

of the value of X for all your examples which Y = 0. So take all your negative fitting
21:30 - 21:30

examples
21:30 - 21:33

and average the values for X
21:33 - 21:38

and
21:38 - 21:41

that's mew0, okay? If this
21:41 - 21:45

notation is still a little bit cryptic - if you're still not sure why this
21:45 - 21:48

equation translates into
21:48 - 21:53

what I just said, do go home and stare at it for a while until it just makes sense. This is, sort
21:53 - 21:57

of, no surprise. It just says to estimate the mean for the negative examples,
21:57 - 22:00

take all your negative examples, and average them. So
22:00 - 22:07

no surprise, but this is a useful practice to indicate a notation.
22:07 - 22:08

[Inaudible]
22:08 - 22:11

divide the maximum likelihood estimate for sigma. I won't do that. You can read that in
22:11 - 22:17

the notes yourself.
22:17 - 22:21
22:21 - 22:24

And so having fit
22:24 - 22:29

the parameters find mew0, mew1, and sigma
22:29 - 22:30

to your data,
22:30 - 22:36

well, you now need to make a prediction. You
22:36 - 22:40

know, when you're given a new value of X, when you're given a new cancer, you need to predict whether
22:40 - 22:42

it's malignant or benign.
22:42 - 22:45

Your prediction is then going to be,
22:45 - 22:47

let's say,
22:47 - 22:50

the most likely value of Y given X. I should
22:50 - 22:55

write semicolon the parameters there. I'll just give that
22:55 - 22:57

- which is the [inaudible] of a Y
22:57 - 23:04

by Bayes rule, all right?
23:06 - 23:13

And that is, in turn,
23:15 - 23:15

just that
23:15 - 23:20

because the denominator P of X doesn't depend on Y,
23:20 - 23:22

and
23:22 - 23:24

if P of Y
23:24 - 23:27

is uniform.
23:27 - 23:31

In other words, if each of your constants is equally likely,
23:31 - 23:33

so if P of Y
23:33 - 23:37

takes the same value for all values
23:37 - 23:38

of Y,
23:38 - 23:42

then this is just arc X over Y, P of X
23:42 - 23:47

given Y, okay?
23:47 - 23:51

This happens sometimes, maybe not very often, so usually you end up using this
23:51 - 23:53

formula where you
23:53 - 23:56

compute P of X given Y and P of Y using
23:56 - 24:00

your model, okay? Student:Can
24:00 - 24:01

you give
24:01 - 24:05

us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually
24:05 - 24:09

let me.
24:09 - 24:10

So the min of
24:10 - 24:15

- arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just
24:15 - 24:19

for an example, the min of X - 5
24:19 - 24:23

squared is 0 because by choosing X equals 5, you can get this to be zero,
24:23 - 24:24

and the argument over X
24:24 - 24:31

of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay?
24:31 - 24:38

Cool. Thanks
24:38 - 24:45

for
24:45 - 24:52

asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is
25:01 - 25:04

distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose
25:04 - 25:06

here.
25:06 - 25:07

I meant if
25:07 - 25:11

P of Y = 0 is equal to P of Y = 1, or if Y is the
25:11 - 25:13

uniform distribution over
25:13 - 25:14

the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0
25:14 - 25:15

zero = P of Y given 1. That's all I mean, see? Anything else?
25:15 - 25:17

All
25:17 - 25:20

right. Okay. So
25:20 - 25:22
25:22 - 25:28
25:28 - 25:35

it
25:42 - 25:47

turns out Gaussian discriminant analysis has an interesting relationship
25:47 - 25:49

to logistic
25:49 - 25:52

regression. Let me illustrate that.
25:52 - 25:55
25:55 - 25:59

So let's say you have a training set
25:59 - 26:06

- actually let me just go ahead and draw 1D training set, and that will
26:12 - 26:15

kind of work, yes, okay.
26:15 - 26:18

So let's say we have a training set comprising a few negative and a few positive examples,
26:18 - 26:23

and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two
26:23 - 26:25

densities - a Gaussian density to each of these two - to my positive
26:25 - 26:27

and negative training
26:27 - 26:30

examples,
26:30 - 26:33

and so maybe my
26:33 - 26:38

positive examples, the X's, are fit with a Gaussian like this,
26:38 - 26:40
26:40 - 26:45
26:45 - 26:48

and my negative examples I will fit, and you have a
26:48 - 26:55

Gaussian that looks like that, okay?
26:57 - 27:00

Now, I
27:00 - 27:04

hope this [inaudible]. Now, let's
27:04 - 27:07

vary along the X axis,
27:07 - 27:10

and what I want to do is I'll
27:10 - 27:14

overlay on top of this plot. I'm going to plot
27:14 - 27:21

P of Y = 1 - no, actually,
27:24 - 27:26

given X
27:26 - 27:32

for a variety of values X, okay?
27:32 - 27:33

So I actually realize what I should have done.
27:33 - 27:37

I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just
27:37 - 27:40

makes this part come in better.
27:40 - 27:44

So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal
27:44 - 27:45

axis.
27:45 - 27:49

Then what's the probability of Y being equal to one conditioned on X? Well,
27:49 - 27:53

the way you calculate that is you write P of Y
27:53 - 27:57

= 1 given X, and then you plug in all these formulas as usual, right? It's P of X
27:57 - 27:59

given Y = 1, which is
27:59 - 28:01

your Gaussian density,
28:01 - 28:04

times P of Y = 1, you know, which is
28:04 - 28:08

essentially - this is just going to be equal to phi,
28:08 - 28:10

and then divided by,
28:10 - 28:12

right, P of X, and then this shows you how you can calculate this.
28:12 - 28:14

By using
28:14 - 28:19

these two Gaussians and my phi on P of Y, I actually compute what P of Y
28:19 - 28:21

= 1 given X is, and
28:21 - 28:25

in this case,
28:25 - 28:29

if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to
28:29 - 28:31

a positive class, and so
28:31 - 28:36

it'll be very small; it'll be very close to zero say, okay?
28:36 - 28:41

And then we can increment the value of X a bit, and study a different value of X, and
28:41 - 28:44

plot what is the P of Y given X - P of Y
28:44 - 28:50

= 1 given X, and, again, it'll be pretty small. Let's
28:50 - 28:52

use a point like that, right? At this point,
28:52 - 28:54

the two Gaussian densities
28:54 - 28:56

have equal value,
28:56 - 28:58

and if
28:58 - 28:59

I ask
28:59 - 29:02

if X is this value, right, shown by the arrow,
29:02 - 29:03
29:03 - 29:07

what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if
29:07 - 29:11

you fill
29:11 - 29:14

in a bunch more points, you get a
29:14 - 29:16

curve like that,
29:16 - 29:21

and then you can keep going. Let's say for a point like that, you can ask what's the probability of X
29:21 - 29:22

being one? Well, if
29:22 - 29:25

it's that far out, then clearly, it belongs to this
29:25 - 29:28

rightmost Gaussian, and so
29:28 - 29:32

the probability of Y being a one would be very high; it would be almost one, okay?
29:32 - 29:34

And so you
29:34 - 29:36

can repeat this exercise
29:36 - 29:39

for a bunch of points. All right,
29:39 - 29:42

compute P of Y equals one given X for a bunch of points,
29:42 - 29:46

and if you connect up these points,
29:46 - 29:48

you find that the
29:48 - 29:50

curve you get [Pause] plotted
29:50 - 29:56

takes a form of sigmoid function, okay? So,
29:56 - 30:00

in other words, when you make the assumptions under the Gaussian
30:00 - 30:02
30:02 - 30:06

discriminant analysis model,
30:06 - 30:08

that P of X given Y is Gaussian,
30:08 - 30:13

when you go back and compute what P of Y given X is, you actually get back
30:13 - 30:19

exactly the same sigmoid function
30:19 - 30:23

that we're using which is the progression, okay? But it turns out the key difference is that
30:23 - 30:27

Gaussian discriminant analysis will end up choosing a different
30:27 - 30:29

position
30:29 - 30:32

and a steepness of the sigmoid
30:32 - 30:36

than would logistic regression. Is there a question? Student:I'm just
30:36 - 30:40

wondering,
30:40 - 30:45

the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is
30:45 - 30:48

P of X given Y = 1, and
30:48 - 30:50

this Gaussian is P of X
30:50 - 30:57

given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you
31:02 - 31:04

decide what Y
31:04 - 31:06

given
31:06 - 31:10

P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you
31:10 - 31:12

figured out where
31:12 - 31:14

to draw each dot? Instructor (Andrew Ng):Let's see,
31:14 - 31:16

okay. So the
31:16 - 31:19

computation is as follows, right? The steps
31:19 - 31:22

are I have the training sets, and so given my training set, I'm going to fit
31:22 - 31:25

a Gaussian discriminant analysis model to it,
31:25 - 31:29

and what that means is I'll build a model for P of X given Y = 1. I'll
31:29 - 31:30

build
31:30 - 31:33

a model for P of X given Y = 0,
31:33 - 31:38

and I'll also fit a Bernoulli distribution to
31:38 - 31:39

P of Y, okay?
31:39 - 31:43

So, in other words, given my training set, I'll fit P of X given Y and P of Y
31:43 - 31:47

to my data, and now I've chosen my parameters
31:47 - 31:49

of find mew0,
31:49 - 31:50

mew1,
31:50 - 31:52

and the sigma, okay? Then
31:52 - 31:55

this is the process I went through
31:55 - 31:59

to plot all these dots, right? It's just I pick a point in the X axis,
31:59 - 32:01

and then I compute
32:01 - 32:03

P of Y given X
32:03 - 32:05

for that value of X,
32:05 - 32:09

and P of Y given 1 conditioned on X will be some value between zero and one. It'll
32:09 - 32:12

be some real number, and whatever that real number is, I then plot it on the vertical
32:12 - 32:14

axis,
32:14 - 32:20

okay? And the way I compute P of Y = 1 conditioned on X is
32:20 - 32:21

I would
32:21 - 32:23

use these quantities. I would use
32:23 - 32:25

P of X given Y
32:25 - 32:30

and P of Y, and, sort of, plug them into Bayes rule, and that allows me
32:30 - 32:31

to
32:31 - 32:32

compute P of Y given X
32:32 - 32:34

from these three quantities; does that make
32:34 - 32:35

sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that -
32:35 - 32:39

Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so
32:39 - 32:42

-
32:42 - 32:43

well,
32:43 - 32:45

got this right
32:45 - 32:49

here. So P of X can be written as,
32:49 - 32:50
32:50 - 32:53

right,
32:53 - 32:55

so
32:55 - 32:59

P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y =
32:59 - 33:03

1, right?
33:03 - 33:07

And so each of these terms, P of X given Y
33:07 - 33:11

and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant
33:11 - 33:14

analysis model. Each of these terms is something that
33:14 - 33:16

my model gives me directly,
33:16 - 33:18

so plugged in as the denominator,
33:18 - 33:25

and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool.
33:31 - 33:38
33:53 - 33:58

So let's talk a little bit about the advantages and disadvantages of using a
33:58 - 34:00

generative
34:00 - 34:05

learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we
34:05 - 34:06

assume that
34:06 - 34:08

X conditions on Y
34:08 - 34:14

is Gaussian,
34:14 - 34:17

and the argument I showed on the previous chalkboard, I didn't prove it formally,
34:17 - 34:20

but you can actually go back and prove it yourself
34:20 - 34:24

is that if you assume X given Y is Gaussian,
34:24 - 34:27

then that implies that
34:27 - 34:29

when you plot Y
34:29 - 34:31

given X,
34:31 - 34:38

you find that - well, let me just write logistic posterior, okay?
34:41 - 34:44

And the argument I showed just now, which I didn't prove; you can go home and prove it
34:44 - 34:44

yourself,
34:44 - 34:49

is that if you assume X given Y is Gaussian, then that implies that the posterior
34:49 - 34:54

distribution or the form of
34:54 - 34:57

P of Y = 1 given X
34:57 - 35:01

is going to be a logistic function,
35:01 - 35:02
35:02 - 35:06

and it turns out this
35:06 - 35:08

implication in the opposite direction
35:08 - 35:12

does not hold true,
35:12 - 35:16

okay? In particular, it actually turns out - this is actually, kind of, cool. It
35:16 - 35:19

turns out that if you're
35:19 - 35:23

seeing that X given Y = 1 is
35:23 - 35:24
35:24 - 35:26

Hessian with
35:26 - 35:29

parameter lambda 1,
35:29 - 35:32

and X given Y = 0,
35:32 - 35:37

is Hessian
35:37 - 35:39

with parameter lambda 0.
35:39 - 35:41

It turns out if you assumed this,
35:41 - 35:43

then
35:43 - 35:44

that also
35:44 - 35:47

implies that P of Y
35:47 - 35:51

given X
35:51 - 35:52
35:52 - 35:54

is logistic, okay?
35:54 - 35:57

So there are lots of assumptions on X given Y
35:57 - 36:04

that will lead to P of Y given X being logistic, and,
36:05 - 36:06

therefore,
36:06 - 36:12

this, the assumption that X given Y being Gaussian is the stronger assumption
36:12 - 36:15

than the assumption that Y given X is logistic,
36:15 - 36:17

okay? Because this implies this,
36:17 - 36:23

right? That means that this is a stronger assumption than this because
36:23 - 36:30

this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa.
36:30 - 36:31

And so this leaves some
36:31 - 36:36

of the tradeoffs between Gaussian discriminant analysis and logistic
36:36 - 36:37

regression,
36:37 - 36:40

right? Gaussian discriminant analysis makes a much stronger assumption
36:40 - 36:43

that X given Y is Gaussian,
36:43 - 36:47

and so when this assumption is true, when this assumption approximately holds, if you plot the
36:47 - 36:49

data,
36:49 - 36:52

and if X given Y is, indeed, approximately Gaussian,
36:52 - 36:56

then if you make this assumption, explicit to the algorithm, then the
36:56 - 36:57

algorithm will do better
36:57 - 36:59

because it's as if the
36:59 - 37:03

algorithm is making use of more information about the data. The algorithm knows that
37:03 - 37:04

the data is Gaussian,
37:04 - 37:05

right? And so
37:05 - 37:08

if the Gaussian assumption, you know,
37:08 - 37:10

holds or roughly holds,
37:10 - 37:10

then Gaussian
37:10 - 37:14

discriminant analysis may do better than logistic regression.
37:14 - 37:19

If, conversely, if you're actually not sure what X given Y is, then
37:19 - 37:23

logistic regression, the discriminant algorithm may do better,
37:23 - 37:26

and, in particular, use logistic regression,
37:26 - 37:26

and
37:26 - 37:30

maybe you see [inaudible] before the data was Gaussian, but it turns out the data
37:30 - 37:31

was actually Poisson, right?
37:31 - 37:34

Then logistic regression will still do perfectly fine because if
37:34 - 37:37

the data were actually Poisson,
37:37 - 37:40

then P of Y = 1 given X will be logistic, and it'll do perfectly
37:40 - 37:45

fine, but if you assumed it was Gaussian, then the algorithm may go off and do something
37:45 - 37:52

that's not as good, okay?
37:52 - 37:56

So it turns out that - right.
37:56 - 37:59

So it's slightly different.
37:59 - 38:01

It
38:01 - 38:05

turns out the real advantage of generative learning algorithms is often that it
38:05 - 38:08

requires less data,
38:08 - 38:09

and, in particular,
38:09 - 38:13

data is never really exactly Gaussian, right? Because data is often
38:13 - 38:16

approximately Gaussian; it's never exactly Gaussian.
38:16 - 38:20

And it turns out, generative learning algorithms often do surprisingly well
38:20 - 38:21

even when
38:21 - 38:24

these modeling assumptions are not met, but
38:24 - 38:26

one other tradeoff
38:26 - 38:27

is that
38:27 - 38:30

by making stronger assumptions
38:30 - 38:32

about the data,
38:32 - 38:34

Gaussian discriminant analysis
38:34 - 38:36

often needs less data
38:36 - 38:40

in order to fit, like, an okay model, even if there's less training data. Whereas, in
38:40 - 38:42

contrast,
38:42 - 38:46

logistic regression by making less
38:46 - 38:49

assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're
38:49 - 38:51

making less assumptions,
38:51 - 38:54

but sometimes it takes a slightly larger training set to fit than Gaussian discriminant
38:54 - 38:55

analysis. Question? Student:In order
38:55 - 39:00

to meet any assumption about the number [inaudible], plus
39:00 - 39:01
39:01 - 39:03

here
39:03 - 39:09

we assume that P of Y = 1, equal
39:09 - 39:11

two
39:11 - 39:15

number of.
39:15 - 39:16

[Inaudible]. Is true when
39:16 - 39:18

the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see.
39:18 - 39:22

So there's a question of is this true - what
39:22 - 39:23

was that? Let me translate that
39:23 - 39:25

differently. So
39:25 - 39:29

the marving assumptions are made independently of the size
39:29 - 39:30

of
39:30 - 39:33

your training set, right? So, like, in least/great regression - well,
39:33 - 39:37

in all of these models I'm assuming that these are random variables
39:37 - 39:41

flowing from some distribution, and then, finally, I'm giving a single training set
39:41 - 39:44

and that as for the parameters of the
39:44 - 39:44

distribution, right?
39:44 - 39:51

Student:So
39:52 - 39:54

what's the probability of Y = 1?
39:54 - 39:55

Instructor (Andrew Ng):Probability of Y + 1?
39:55 - 39:56

Student:Yeah, you used the - Instructor (Andrew Ng):Sort
39:56 - 39:58

of, this like
39:58 - 40:00

-
40:00 - 40:03

back to the philosophy of mass molecular estimation,
40:03 - 40:05

right? I'm assuming that
40:05 - 40:08

they're P of Y is equal to phi to the Y,
40:08 - 40:13

Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y
40:13 - 40:14

generating
40:14 - 40:16

all my data,
40:16 - 40:19

and then
40:19 - 40:20

-
40:20 - 40:26

well, when I write this, I guess, maybe what I should write isn't -
40:26 - 40:28

so when I write this, I
40:28 - 40:30

guess there are already two values of phi. One is
40:30 - 40:32

there's a true underlying value of phi
40:32 - 40:35

that guards the use to generate the data,
40:35 - 40:40

and then there's the maximum likelihood estimate of the value of phi, and so when I was writing
40:40 - 40:42

those formulas earlier,
40:42 - 40:45

those formulas are writing for phi, and mew0, and mew1
40:45 - 40:49

were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true
40:49 - 40:56

underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum
40:56 - 40:58

likelihood estimate comes from the data,
40:58 - 41:02

and there's some, sort of, true underlying value of phi that I'm trying to estimate,
41:02 - 41:05

and my maximum likelihood estimate is my attempt to estimate the true value,
41:05 - 41:08

but, you know, by notational and convention
41:08 - 41:12

often are just right as that as well without bothering to distinguish between
41:12 - 41:15

the maximum likelihood value and the true underlying value that I'm assuming is out
41:15 - 41:16

there, and that I'm
41:16 - 41:23

only hoping to estimate.
41:24 - 41:26

Actually, yeah,
41:26 - 41:29

so for the sample of questions like these about maximum likelihood and so on, I hope
41:29 - 41:34

to tease to the Friday discussion section
41:34 - 41:36

as a good time to
41:36 - 41:38

ask questions about, sort of,
41:38 - 41:41

probabilistic definitions like these as well. Are there any
41:41 - 41:48

other questions? No, great. Okay.
41:53 - 42:00

So,
42:00 - 42:01

great. Oh, it
42:01 - 42:08

turns out, just to mention one more thing that's, kind of, cool.
42:08 - 42:09

I said that
42:09 - 42:14

if X given Y is Poisson, and you also go logistic posterior,
42:14 - 42:17

it actually turns out there's a more general version of this. If you assume
42:17 - 42:19

X
42:19 - 42:24

given Y = 1 is exponential family
42:24 - 42:26

with parameter A to 1, and then you assume
42:26 - 42:33

X given Y = 0 is exponential family
42:34 - 42:39

with parameter A to 0, then
42:39 - 42:43

this implies that P of Y = 1 given X is also logistic, okay? And
42:43 - 42:45

that's, kind of, cool.
42:45 - 42:48

It means that Y given X could be - I don't
42:48 - 42:50

know, some strange thing. It could be gamma because
42:50 - 42:53

we've seen Gaussian right
42:53 - 42:54

next to the - I
42:54 - 42:57

don't know, gamma exponential.
42:57 - 42:59

They're actually a beta. I'm
42:59 - 43:02

just rattling off my mental list of exponential family extrusions. It could be any one
43:02 - 43:04

of those things,
43:04 - 43:08

so [inaudible] the same exponential family distribution for the two classes
43:08 - 43:09

with different natural parameters
43:09 - 43:12

than the
43:12 - 43:13

posterior
43:13 - 43:17

P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows
43:17 - 43:19

the robustness of logistic regression
43:19 - 43:22

to the choice of modeling assumptions because it could be that
43:22 - 43:25

the data was actually, you know, gamma distributed,
43:25 - 43:29

and just still turns out to be logistic. So it's the
43:29 - 43:36

robustness of logistic regression to modeling
43:39 - 43:41

assumptions.
43:41 - 43:43

And this is the density. I think,
43:43 - 43:44

early on I promised
43:44 - 43:49

two justifications for where I pulled the logistic function out of the hat, right? So
43:49 - 43:53

one was the exponential family derivation we went through last time, and this is, sort of, the second one.
43:53 - 44:00

That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off
44:05 - 44:10

mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also
44:10 - 44:11

not true, right?
44:11 - 44:12

Yeah, so this exponential
44:12 - 44:14

family distribution
44:14 - 44:18

implies Y = 1 is logistic, but the reverse assumption is also not true.
44:18 - 44:23

There are actually all sorts of really bizarre distributions
44:23 - 44:30

for X that would give rise to logistic function, okay? Okay. So
44:30 - 44:35

let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second
44:35 - 44:38

generative learning algorithm,
44:38 - 44:45

and the motivating example, actually this is called a Naive Bayes algorithm,
44:45 - 44:50

and the motivating example that I'm gonna use will be spam classification. All right. So let's
44:50 - 44:54

say that you want to build a spam classifier to take your incoming stream of email and decide if
44:54 - 44:57

it's spam or
44:57 - 44:57

not.
44:57 - 45:02

So let's
45:02 - 45:05

see. Y will be 0
45:05 - 45:07

or
45:07 - 45:12

1, with 1 being spam email
45:12 - 45:13

and 0 being non-spam, and
45:13 - 45:17

the first decision we need to make is, given a piece of email,
45:17 - 45:21

how do you represent a piece of email using a feature vector X,
45:21 - 45:24

right? So email is just a piece of text, right? Email
45:24 - 45:28

is like a list of words or a list of ASCII characters.
45:28 - 45:31

So I can represent email as a feature of vector X.
45:31 - 45:32

So we'll use a couple of different
45:32 - 45:34

representations,
45:34 - 45:37

but the one I'll use today is
45:37 - 45:38

we will
45:38 - 45:42

construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of
45:42 - 45:44

all the words in my dictionary, okay? So
45:44 - 45:46

the first word is
45:46 - 45:52

RA. The second word in my dictionary is Aardvark, ausworth,
45:52 - 45:56

okay?
45:56 - 45:59

You know, and somewhere along the way you see the word buy in the spam email telling you to buy
45:59 - 46:01

stuff.
46:01 - 46:04

Tell you how you collect your list of words,
46:04 - 46:09

you know, you won't find CS229, right, course number in a dictionary, but
46:09 - 46:10

if you
46:10 - 46:14

collect a list of words via other emails you've gotten, you have this list somewhere
46:14 - 46:18

as well, and then the last word in my dictionary was
46:18 - 46:22

zicmergue, which
46:22 - 46:25

pertains to the technological chemistry that deals with
46:25 - 46:28

the fermentation process in
46:28 - 46:31

brewing.
46:31 - 46:35

So say I get a piece of email, and what I'll do is I'll then
46:35 - 46:38

scan through this list of words, and wherever
46:38 - 46:41

a certain word appears in my email, I'll put a 1 there. So if a particular
46:41 - 46:44

email has the word aid then that's 1.
46:44 - 46:47

You know, my email doesn't have the words ausworth
46:47 - 46:49

or aardvark, so it gets zeros. And again,
46:49 - 46:53

a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay?
46:53 - 46:57

So
46:57 - 46:59

this would be
46:59 - 47:02

one way of creating a feature vector
47:02 - 47:04

to represent a
47:04 - 47:06

piece of email.
47:06 - 47:10

Now, let's throw
47:10 - 47:17

the generative model out for this. Actually,
47:18 - 47:19

let's use
47:19 - 47:21

this.
47:21 - 47:24

In other words, I want to model P of X given Y. The
47:24 - 47:28

given Y = 0 or Y = 1, all right?
47:28 - 47:31

And my feature vectors are going to be 0, 1
47:31 - 47:36

to the N. It's going to be these split vectors, binary value vectors. They're N
47:36 - 47:36

dimensional.
47:36 - 47:38

Where N
47:38 - 47:39

may
47:39 - 47:42

be on the order of, say, 50,000, if you have 50,000
47:42 - 47:44

words in your dictionary,
47:44 - 47:46

which is not atypical. So
47:46 - 47:47

values from - I don't
47:47 - 47:48

know,
47:48 - 47:52

mid-thousands to tens of thousands is very typical
47:52 - 47:54

for problems like
47:54 - 47:57

these. And, therefore,
47:57 - 48:00

there two to the 50,000 possible values for X, right? So two to
48:00 - 48:02

50,000
48:02 - 48:03

possible bit vectors
48:03 - 48:04

of length
48:04 - 48:05
48:05 - 48:07

50,000, and so
48:07 - 48:09

one way to model this is
48:09 - 48:11

the multinomial distribution,
48:11 - 48:14

but because there are two to the 50,000 possible values for X,
48:14 - 48:20

I would need two to the 50,000, but maybe -1 parameters,
48:20 - 48:21

right? Because you have
48:21 - 48:24

this sum to 1, right? So
48:24 - 48:27

-1. And this is clearly way too many parameters
48:27 - 48:28

to model
48:28 - 48:30

using the multinomial distribution
48:30 - 48:34

over all two to 50,000 possibilities.
48:34 - 48:36

So
48:36 - 48:42

in a Naive Bayes algorithm, we're
48:42 - 48:47

going to make a very strong assumption on P of X given Y,
48:47 - 48:49

and, in particular, I'm going to assume - let
48:49 - 48:54

me just say what it's called; then I'll write out what it means. I'm going to assume that the
48:54 - 48:55

XI's
48:55 - 49:02

are conditionally independent
49:06 - 49:08

given Y, okay?
49:08 - 49:11

Let me say what this means.
49:11 - 49:18

So I have that P of X1, X2, up to X50,000,
49:19 - 49:21

right, given the
49:21 - 49:26

Y. By the key rule of probability, this is P of X1 given Y
49:26 - 49:28

times P of X2
49:28 - 49:30

given
49:30 - 49:32
49:32 - 49:34

Y,
49:34 - 49:39

X1
49:39 - 49:43

times PF - I'll just put dot, dot, dot. I'll just write 1, 1 Ă dot, dot, dot up to, you know, well -
49:43 - 49:46

whatever. You get the idea, up to P of X50,000, okay?
49:46 - 49:50

So this is the chain were of probability. This always holds. I've not
49:50 - 49:53

made any assumption yet, and now, we're
49:53 - 49:55

gonna
49:55 - 49:58

meet what's called the Naive Bayes assumption, or this assumption that X
49:58 - 50:02

defies a conditionally independent given Y. Going
50:02 - 50:04

to assume that -
50:04 - 50:07

well, nothing changes for the first term,
50:07 - 50:12

but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3
50:12 - 50:13

given
50:13 - 50:17

the
50:17 - 50:18

Y,
50:18 - 50:20

and so on, up
50:20 - 50:24

to P of X50,000 given Y, okay?
50:24 - 50:29

Or just written more compactly,
50:29 - 50:32
50:32 - 50:36

means assume that P of X1, P of X50,000 given Y is
50:36 - 50:41

the product from I = 1 to 50,000 or P of XI
50:41 - 50:44

given the Y,
50:44 - 50:45

okay?
50:45 - 50:49

And stating informally what this means is that I'm, sort of, assuming that -
50:49 - 50:53

so unless you know the cost label Y, so long as you know whether this is spam or not
50:53 - 50:55

spam,
50:55 - 50:59

then knowing whether the word A appears in email
50:59 - 51:00

does not affect
51:00 - 51:01

the probability
51:01 - 51:03

of whether the word
51:03 - 51:07

Ausworth appears in the email, all right?
51:07 - 51:11

And, in other words, there's assuming - once you know whether an email is spam
51:11 - 51:13

or not spam,
51:13 - 51:16

then knowing whether other words appear in the email won't help
51:16 - 51:19

you predict whether any other word appears in the email, okay?
51:19 - 51:20

And,
51:20 - 51:23

obviously, this assumption is false, right? This
51:23 - 51:26

assumption can't possibly be
51:26 - 51:28

true. I mean, if you see the word
51:28 - 51:31

- I don't know, CS229 in an email, you're much more likely to see my name in the email, or
51:31 - 51:37

the TA's names, or whatever. So this assumption is normally just false
51:37 - 51:37
51:37 - 51:39

under English, right,
51:39 - 51:41

for normal written English,
51:41 - 51:43

but it
51:43 - 51:44

turns out that despite
51:44 - 51:47

this assumption being, sort of,
51:47 - 51:48

false in the literal sense,
51:48 - 51:51

the Naive Bayes algorithm is, sort of,
51:51 - 51:53

an extremely effective
51:53 - 51:58

algorithm for classifying text documents into spam or not spam, for
51:58 - 52:02

classifying your emails into different emails for your automatic view, for
52:02 - 52:04

looking at web pages and classifying
52:04 - 52:06

whether this webpage is trying to sell something or whatever. It
52:06 - 52:08

turns out, this assumption
52:08 - 52:12

works very well for classifying text documents and for other applications too that I'll
52:12 - 52:16

talk a bit about later.
52:16 - 52:21

As a digression that'll make sense only to some of you.
52:21 - 52:23

Let me just say that
52:23 - 52:28

if you're familiar with Bayesian X world, say
52:28 - 52:30
52:30 - 52:34

graphical models, the Bayesian network associated with this model looks like this, and you're assuming
52:34 - 52:35

that
52:35 - 52:37

this is random variable Y
52:37 - 52:39

that then generates X1, X2, through
52:39 - 52:41

X50,000, okay? If you've not seen the
52:41 - 52:43

Bayes Net before, if
52:43 - 52:48

you don't know your graphical model, just ignore this. It's not important to our purposes, but
52:48 - 52:55

if you've seen it before, that's what it will look like. Okay.
52:58 - 53:03

So
53:03 - 53:10
53:12 - 53:16

the parameters of the model
53:16 - 53:18

are as follows
53:18 - 53:22

with phi FI given Y = 1, which
53:22 - 53:23
53:23 - 53:28

is probably FX = 1 or XI = 1
53:28 - 53:29

given Y = 1,
53:29 - 53:36

phi I
53:36 - 53:40

given Y = 0, and phi Y, okay?
53:40 - 53:43

So these are the parameters of the model,
53:43 - 53:45
53:45 - 53:47

and, therefore,
53:47 - 53:50

to fit the parameters of the model, you
53:50 - 53:57

can write down the joint likelihood, right,
54:02 - 54:07

is
54:07 - 54:14

equal to, as usual, okay?
54:19 - 54:21

So given the training sets,
54:21 - 54:28

you can write down the joint
54:29 - 54:33

likelihood of the parameters, and
54:33 - 54:40

then when
54:44 - 54:45

you
54:45 - 54:47

do maximum likelihood estimation,
54:47 - 54:51

you find that the maximum likelihood estimate of the parameters are
54:51 - 54:53

- they're really, pretty much, what you'd expect.
54:53 - 54:56

Maximum likelihood estimate for phi J given Y = 1 is
54:56 - 54:58

sum from I = 1 to
54:58 - 55:02

M,
55:02 - 55:03

indicator
55:03 - 55:07

XIJ =
55:07 - 55:14

1, YI = 1, okay?
55:15 - 55:18
55:18 - 55:19
55:19 - 55:21

And this is just a,
55:21 - 55:23

I guess, stated more simply,
55:23 - 55:25

the numerator just says, Run for
55:25 - 55:28

your entire training set, some [inaudible] examples,
55:28 - 55:31

and count up the number of times you saw word Jay
55:31 - 55:33

in a piece of email
55:33 - 55:35

for which the label Y was equal to 1. So, in other words, look
55:35 - 55:37

through all your spam emails
55:37 - 55:39

and count the number of emails in which the word
55:39 - 55:41

Jay appeared out of
55:41 - 55:42

all your spam emails,
55:42 - 55:45

and the denominator is, you know,
55:45 - 55:46

sum from I = 1 to M,
55:46 - 55:48

the number of spam. The
55:48 - 55:51

denominator is just the number of spam emails you got.
55:51 - 55:54

And so this ratio is
55:54 - 55:58

in all your spam emails in your training set,
55:58 - 56:00

what fraction of these emails
56:00 - 56:03

did the word Jay
56:03 - 56:04

appear in -
56:04 - 56:07

did the, Jay you wrote in your dictionary appear in?
56:07 - 56:10

And that's the maximum likelihood estimate
56:10 - 56:14

for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your
56:14 - 56:17
56:17 - 56:20
56:20 - 56:23

maximum likelihood estimate for phi
56:23 - 56:25

Y
56:25 - 56:29

is pretty much what you'd expect, right?
56:29 - 56:36

Okay?
56:41 - 56:43

And so
56:43 - 56:48

having estimated all these parameters,
56:48 - 56:51

when you're given a new piece of email that you want to classify,
56:51 - 56:53

you can then compute P of Y given X
56:53 - 56:56

using Bayes rule, right?
56:56 - 56:58

Same as before because
56:58 - 57:04

together these parameters gives you a model for P of X given Y and for P of Y,
57:04 - 57:08

and by using Bayes rule, given these two terms, you can compute
57:08 - 57:11

P of X given Y, and
57:11 - 57:16

there's your spam classifier, okay?
57:16 - 57:19

Turns out we need one more elaboration to this idea, but let me check if there are
57:19 - 57:25

questions about this so far.
57:25 - 57:29

Student:So does this model depend
57:29 - 57:30

on
57:30 - 57:34

the number of inputs? Instructor (Andrew Ng):What do
57:34 - 57:35

you
57:35 - 57:38

mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this
57:38 - 57:45

given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make
57:49 - 57:53

sense? Or M is the number of training examples, so when you have M training examples, you plug them
57:53 - 57:54

into this formula,
57:54 - 58:01

and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the
58:01 - 58:04

number of emails? Instructor (Andrew Ng):Yeah, right. So, right.
58:04 - 58:07

So it's, kind of, your training set. I would go through all the email I've gotten
58:07 - 58:09

in the last two months
58:09 - 58:11

and label them as spam or not spam,
58:11 - 58:14

and so you have - I don't
58:14 - 58:17

know, like, a few hundred emails
58:17 - 58:19

labeled as spam or not spam,
58:19 - 58:26

and that will comprise your training sets for X1 and Y1 through XM,
58:28 - 58:29

YM,
58:29 - 58:33

where X is one of those vectors representing which words appeared in the email and Y
58:33 - 58:40

is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number
58:41 - 58:42
58:42 - 58:49

of examples, but the last model doesn't depend on the models, but your phi is the
58:50 - 58:54

same for either one. Instructor (Andrew Ng):They're
58:54 - 58:58

different things, right? There's the model which is
58:58 - 59:01

- the modeling assumptions aren't made very well.
59:01 - 59:04

I'm assuming that - I'm making the Naive Bayes assumption.
59:04 - 59:09

So the probabilistic model is an assumption on the joint distribution
59:09 - 59:10

of X and Y.
59:10 - 59:12

That's what the model is,
59:12 - 59:17

and then I'm given a fixed number of training examples. I'm given M training examples, and
59:17 - 59:20

then it's, like, after I'm given the training sets, I'll then go in to write the maximum
59:20 - 59:22

likelihood estimate of the parameters, right? So that's, sort
59:22 - 59:25

of,
59:25 - 59:32

maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like,
59:38 - 59:41

if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So
59:41 - 59:42

it turns out
59:42 - 59:46

this is, sort of, a very practical question, really. How do I count this list of
59:46 - 59:50

words? One common way to do this is to actually
59:50 - 59:53

find some way to count a list of words, like go through all your emails, go through
59:53 - 59:54

all the -
59:54 - 59:57

in practice, one common way to count a list of words
59:57 - 60:00

is to just take all the words that appear in your training set. That's one fairly common way
60:00 - 60:02

to do it,
60:02 - 60:06

or if that turns out to be too many words, you can take all words that appear
60:06 - 60:07

at least
60:07 - 60:08

three times
60:08 - 60:09

in your training set. So
60:09 - 60:11

words that
60:11 - 60:15

you didn't even see three times in the emails you got in the last
60:15 - 60:17

two months, you discard. So those are - I
60:17 - 60:20

was talking about going through a dictionary, which is a nice way of thinking about it, but in
60:20 - 60:22

practice, you might go through
60:22 - 60:26

your training set and then just take the union of all the words that appear in
60:26 - 60:30

it. In some of the tests I've even, by the way, said select these features, but this is one
60:30 - 60:32

way to think about
60:32 - 60:34

creating your feature vector,
60:34 - 60:41

right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On
60:49 - 60:52

how I came up with the parameters?
60:52 - 60:54

Student:Correct. Instructor (Andrew Ng):Let's see.
60:54 - 60:58

So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right?
60:58 - 60:59

In Naive Bayes,
60:59 - 61:01

I need to build a model
61:01 - 61:04

for P of X given Y and for
61:04 - 61:05

P of Y,
61:05 - 61:10

right? So this is, I mean, in generous of learning algorithms, I need to come up with
61:10 - 61:11

models for these.
61:11 - 61:15

So how'd I model P of Y? Well, I just those to model it using a Bernoulli
61:15 - 61:16

distribution,
61:16 - 61:17

and so
61:17 - 61:20

P of Y will be
61:20 - 61:22

parameterized by that, all right? Student:Okay.
61:22 - 61:27

Instructor (Andrew Ng):And then how'd I model P of X given Y? Well,
61:27 - 61:29

let's keep changing bullets.
61:29 - 61:33

My model for P of X given Y under the Naive Bayes assumption, I assume
61:33 - 61:35

that P of X given Y
61:35 - 61:38

is the product of these probabilities,
61:38 - 61:41

and so I'm going to need parameters to tell me
61:41 - 61:44

what's the probability of each word occurring,
61:44 - 61:46

you know, of each word occurring or not occurring,
61:46 - 61:53

conditions on the email being spam or not spam email, okay? Student:How is that
61:55 - 61:59

Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature
61:59 - 62:01

vectors, XI
62:01 - 62:05

is either one or zero, depending on whether words I appear as in the email,
62:05 - 62:09

right? So by the way I define the
62:09 - 62:11

feature vectors, XI -
62:11 - 62:17

the XI is always zero or one. So that by definition, if XI, you know, is either zero or
62:17 - 62:20

one, then it has to be a Bernoulli distribution, right?
62:20 - 62:22

If XI would continue as then
62:22 - 62:25

you might model this as Gaussian and say you end up
62:25 - 62:27

like we did in Gaussian discriminant analysis. It's
62:27 - 62:31

just that the way I constructed my features for email, XI is always binary
62:31 - 62:32

value, and so you end up
62:32 - 62:33

with a
62:33 - 62:36

Bernoulli here, okay? All right. I
62:36 - 62:40

should move on. So
62:40 - 62:42

it turns out that
62:42 - 62:44

this idea
62:44 - 62:46

almost works.
62:46 - 62:48

Now, here's the problem.
62:48 - 62:50

So let's say you
62:50 - 62:54

complete this class and you start to do, maybe do the class project, and you
62:54 - 62:57

keep working on your class project for a bit, and it
62:57 - 63:01

becomes really good, and you want to submit your class project to a conference, right? So,
63:01 - 63:03

you know, around - I don't know,
63:03 - 63:08

June every year is the conference deadline for the next conference.
63:08 - 63:11

It's just the name of the conference; it's an acronym.
63:11 - 63:12

And so maybe
63:12 - 63:16

you send your project partners or senior friends even, and say, Hey, let's
63:16 - 63:20

work on a project and submit it to the NIPS conference. And so you're getting these emails
63:20 - 63:21

with the word NIPS in them,
63:21 - 63:24

which you've probably never seen before,
63:24 - 63:28

and so a
63:28 - 63:33

piece of email comes from your project partner, and so you
63:33 - 63:36

go, Let's send a paper to the NIPS conference.
63:36 - 63:37

And then your stamp classifier
63:37 - 63:39

will say
63:39 - 63:40

P of X -
63:40 - 63:45

let's say NIPS is the 30,000th word in your dictionary, okay?
63:45 - 63:47

So X30,000 given
63:47 - 63:50

the 1, given
63:50 - 63:52

Y =
63:52 - 63:53

1
63:53 - 63:54

will be equal to 0.
63:54 - 63:57

That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in
63:57 - 64:01

your training set, so maximum likelihood of the parameter is that probably have seen the word
64:01 - 64:02

NIPS is zero,
64:02 - 64:07

and, similarly,
64:07 - 64:08

you
64:08 - 64:12

know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also
64:12 - 64:15

estimated
64:15 - 64:21

as zero.
64:21 - 64:28

So
64:35 - 64:41

when your spam classifier goes to compute P of Y = 1 given X, it will
64:41 - 64:45

compute this
64:45 - 64:47

right here P of Y
64:47 - 64:54

over - well,
64:56 - 65:03

all
65:05 - 65:10

right.
65:10 - 65:11

And so
65:11 - 65:16

you look at that terms, say, this will be product from I =
65:16 - 65:18

1 to 50,000,
65:18 - 65:22

P of XI given Y,
65:22 - 65:26

and one of those probabilities will be equal to
65:26 - 65:28
65:28 - 65:32

zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a
65:32 - 65:37

zero in this product, and so the numerator is zero,
65:37 - 65:40

and in the same way, it turns out the denominator will also be zero, and so you end
65:40 - 65:43

up with -
65:43 - 65:46

actually all of these terms end up being zero. So you end up with P of Y = 1
65:46 - 65:49

given X is 0 over 0 + 0, okay, which is
65:49 - 65:51

undefined. And the
65:51 - 65:54

problem with this is that it's
65:54 - 65:57

just statistically a bad idea
65:57 - 66:00

to say that P of X30,000
66:00 - 66:03

given Y is
66:03 - 66:03

0,
66:03 - 66:06

right? Just because you haven't seen the word NIPS in your last
66:06 - 66:10

two months worth of email, it's also statistically not sound to say that,
66:10 - 66:16

therefore, the chance of ever seeing this word is zero, right?
66:16 - 66:19

And so
66:19 - 66:20
66:20 - 66:23
66:23 - 66:26

is this idea that just because you haven't seen something
66:26 - 66:30

before, that may mean that that event is unlikely, but it doesn't mean that
66:30 - 66:34

it's impossible, and just saying that if you've never seen the word NIPS before,
66:34 - 66:41

then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero.
66:41 - 66:48

So we're gonna fix this,
66:48 - 66:50

and
66:50 - 66:54

to motivate the fix I'll talk about
66:54 - 66:59

- the example we're gonna use is let's say that you've been following the Stanford basketball
66:59 - 67:04

team for all of their away games, and been, sort of, tracking their wins and losses
67:04 - 67:08

to gather statistics, and, maybe - I don't know, form a betting pool about
67:08 - 67:11

whether they're likely to win or lose the next game, okay?
67:11 - 67:13

So
67:13 - 67:15

these are some of the statistics.
67:15 - 67:18
67:18 - 67:19
67:19 - 67:25

So on, I guess, the 8th of February
67:25 - 67:29

last season they played Washington State, and they
67:29 - 67:32

did not win.
67:32 - 67:36

On
67:36 - 67:38

the 11th of February,
67:38 - 67:42

they play Washington, 22nd
67:42 - 67:47

they played USC,
67:47 - 67:50
67:50 - 67:55

played UCLA,
67:55 - 67:58

played USC again,
67:58 - 68:04
68:04 - 68:06

and now you want to estimate
68:06 - 68:10

what's the chance that they'll win or lose against Louisville, right?
68:10 - 68:11

So
68:11 - 68:15

find the four guys last year or five times and they weren't good in their away games, but it
68:15 - 68:17

seems awfully harsh to say that - so it
68:17 - 68:24

seems awfully harsh to say there's zero chance that they'll
68:37 - 68:40

win in the last - in the 5th game. So here's the idea behind Laplace smoothing
68:40 - 68:43

which is
68:43 - 68:45

that we're estimate
68:45 - 68:48

the probably of Y being equal to one, right?
68:48 - 68:53

Normally, the maximum likelihood [inaudible] is the
68:53 - 68:56

number of ones
68:56 - 68:58

divided by
68:58 - 69:01

the number of zeros
69:01 - 69:05

plus the number of ones, okay? I
69:05 - 69:08

hope this informal notation makes sense, right? Knowing
69:08 - 69:11

the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random
69:11 - 69:12

variable
69:12 - 69:15

is
69:15 - 69:17

just the number of ones you saw
69:17 - 69:21

divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in
69:21 - 69:23

the Laplace
69:23 - 69:24

Smoothing
69:24 - 69:26

we're going to
69:26 - 69:30

just take each of these terms, the number of ones and, sort of, add one to that, the number
69:30 - 69:32

of zeros and add one to that, the
69:32 - 69:34

number of ones and add one to that,
69:34 - 69:37

and so in our example,
69:37 - 69:38

instead of estimating
69:38 - 69:43

the probability of winning the next game to be 0 Ăˇ
69:43 - 69:45

5 + 0,
69:45 - 69:50

we'll add one to all of these counts, and so we say that the chance of
69:50 - 69:52

their
69:52 - 69:54

winning the next game is 1/7th,
69:54 - 69:55

okay? Which is
69:55 - 69:59

that having seen them lose, you know, five away games in a row, we aren't terribly -
69:59 - 70:03

we don't think it's terribly likely they'll win the next game, but at
70:03 - 70:06

least we're not saying it's impossible.
70:06 - 70:10

As a historical side note, the Laplace actually came up with the method.
70:10 - 70:14

It's called the Laplace smoothing after him.
70:14 - 70:18
70:18 - 70:22

When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale
70:22 - 70:25

was in a lot of days now, we've seen the sun rise,
70:25 - 70:28

but that doesn't mean we can be absolutely certain the sun will rise tomorrow.
70:28 - 70:31

He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of,
70:31 - 70:36

cool. So,
70:36 - 70:39

and more generally,
70:39 - 70:42
70:42 - 70:44
70:44 - 70:45
70:45 - 70:46
70:46 - 70:49

if Y
70:49 - 70:53

takes on
70:53 - 70:55

K possible of values,
70:55 - 70:59

if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1.
70:59 - 71:02

Let's
71:02 - 71:06

see.
71:06 - 71:11

So the maximum likelihood estimate will be Sum from J = 1 to M,
71:11 - 71:16

indicator YI = J Ăˇ M,
71:16 - 71:19

right?
71:19 - 71:22

That's the maximum likelihood estimate
71:22 - 71:24

of a multinomial probability of Y
71:24 - 71:27

being equal to - oh,
71:27 - 71:29

excuse me, Y = J. All right.
71:29 - 71:33

That's the maximum likelihood estimate for the probability of Y = J,
71:33 - 71:37

and so when you apply Laplace smoothing to that,
71:37 - 71:40

you add one to the numerator, and
71:40 - 71:42

add K to the denominator,
71:42 - 71:49

if Y can take up K possible values, okay?
72:06 - 72:09

So for Naive Bayes,
72:09 - 72:14

what that gives us is -
72:14 - 72:21

shoot.
72:39 - 72:44

Right? So that was the maximum likelihood estimate, and what you
72:44 - 72:48

end up doing is adding one to the numerator and adding two to the denominator,
72:48 - 72:52

and this solves the problem of the zero probabilities, and when your friend sends
72:52 - 72:57

you email about the NIPS conference,
72:57 - 73:01

your spam filter will still be able to
73:01 - 73:08

make a meaningful prediction, all right? Okay.
73:12 - 73:15

Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the
73:15 - 73:16

games on the right, it's liberal assumptions that the probability
73:16 - 73:23

of
73:25 - 73:28

winning is very close to zero, so, I mean, the prediction should
73:28 - 73:30

be equal to PF, 0. Instructor (Andrew Ng):Right.
73:30 - 73:36

I would say that
73:36 - 73:38

in this case the prediction
73:38 - 73:41

is 1/7th, right? We don't have a lot of - if you see somebody lose five games
73:41 - 73:44

in a row, you may not have a lot of faith in them,
73:44 - 73:49

but as an extreme example, suppose you saw them lose one game,
73:49 - 73:52

right? It's just not reasonable to say that the chances of winning the next game
73:52 - 73:53

is zero, but
73:53 - 73:59

that's what maximum likelihood
73:59 - 74:01

estimate
74:01 - 74:05

will say. Student:Yes. Instructor (Andrew Ng):And -
74:05 - 74:08

Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you
74:08 - 74:11

know, given just five training examples, what's a reasonable estimate for the chance of
74:11 - 74:13

winning the next game,
74:13 - 74:14

and
74:14 - 74:18

1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance.
74:18 - 74:21

We're saying the chances of winning the next game is less
74:21 - 74:25

than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of
74:25 - 74:28

Bayesian assumptions about the prior and posterior,
74:28 - 74:31

this Laplace smoothing actually gives the optimal estimate,
74:31 - 74:33

in a certain sense I won't go into
74:33 - 74:37

of what's the chance of winning the next game, and so under a certain assumption
74:37 - 74:38

about the
74:38 - 74:41

Bayesian prior on the parameter.
74:41 - 74:44

So I don't know. It actually seems like a pretty reasonable assumption to me.
74:44 - 74:47

Although, I should say, it actually
74:47 - 74:50

turned out -
74:50 - 74:54

No, I'm just being mean. We actually are a pretty good basketball team, but I chose a
74:54 - 74:55

losing streak
74:55 - 74:59

because it's funnier that way.
74:59 - 75:06

Let's see. Shoot. Does someone want to - are there other questions about
75:07 - 75:09
75:09 - 75:13

this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but
75:13 - 75:15
75:15 - 75:16

we'll do that in the next lecture. So let's wrap it.

Title:: Lecture 5 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:15:31

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 5 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)