0:00:10.040,0:00:12.179


0:00:12.179,0:00:15.449
This presentation is delivered by the Stanford Center for Professional

0:00:15.449,0:00:22.449
Development.

0:00:25.119,0:00:29.819
So what I want to do today is talk about a different type of learning algorithm, and, in particular,

0:00:29.819,0:00:32.790
start to talk about generative learning algorithms

0:00:32.790,0:00:37.840
and the specific algorithm called Gaussian Discriminant Analysis.

0:00:37.840,0:00:43.340
Take a slight digression, talk about Gaussians, and I'll briefly discuss

0:00:43.340,0:00:45.700
generative versus discriminative learning algorithms,

0:00:45.700,0:00:49.780
and then hopefully wrap up today's lecture with a discussion of Naive Bayes and

0:00:49.780,0:00:52.140
the Laplace Smoothing.

0:00:52.140,0:00:55.020
So just to motivate our

0:00:55.020,0:00:58.590
discussion on generative learning algorithms, right, so by way of contrast,

0:00:58.590,0:01:02.100
the source of classification algorithms we've been talking about

0:01:02.100,0:01:08.920
I think of algorithms that do this. So you're given a training set,

0:01:08.920,0:01:11.640
and

0:01:11.640,0:01:15.370
if you run an algorithm right, we just see progression on those training sets.

0:01:15.370,0:01:19.830
The way I think of logistic regression is that it's trying to find " look at the date and is

0:01:19.830,0:01:24.369
trying to find a straight line to divide the crosses and O's, right? So it's, sort of,

0:01:24.369,0:01:27.380
trying to find a straight line. Let

0:01:27.380,0:01:29.400
me " just make the days a bit noisier.

0:01:29.400,0:01:33.060
Trying to find a straight line

0:01:33.060,0:01:34.430
that separates out

0:01:34.430,0:01:38.630
the positive and the negative classes as well as pass the law, right? And,

0:01:38.630,0:01:42.570
in fact, it shows it on the laptop. Maybe just use the screens or the small

0:01:42.570,0:01:46.250
monitors for this.

0:01:46.250,0:01:47.230
In fact,

0:01:47.230,0:01:49.720
you can see there's the data

0:01:49.720,0:01:51.100
set with

0:01:51.100,0:01:52.710
logistic regression,

0:01:52.710,0:01:54.080
and so

0:01:54.080,0:01:57.930
I've initialized the parameters randomly, and so logistic regression is, kind

0:01:57.930,0:01:59.350
of, the outputting " it's

0:01:59.350,0:02:01.530
the, kind of, hypothesis that

0:02:01.530,0:02:04.750
iteration zero is that straight line shown in the bottom right.

0:02:04.750,0:02:08.720
And so after one iteration and creating descent, the straight line changes a bit.

0:02:08.720,0:02:10.999
After two iterations, three,

0:02:10.999,0:02:12.359
four,

0:02:12.359,0:02:15.019
until logistic regression converges

0:02:15.019,0:02:18.969
and has found the straight line that, more or less, separates the positive and negative class, okay? So

0:02:18.969,0:02:20.170
you can think of this

0:02:20.170,0:02:21.490
as logistic regression,

0:02:21.490,0:02:28.010
sort of, searching for a line that separates the positive and the negative classes.

0:02:28.010,0:02:30.980
What I want to do today is talk about an algorithm that does something slightly

0:02:30.980,0:02:32.759
different,

0:02:32.759,0:02:36.309
and to motivate us, let's use our old example of trying to classifythe

0:02:36.309,0:02:40.689
team malignant cancer and benign cancer, right? So a patient comes in

0:02:40.689,0:02:44.759
and they have a cancer, you want to know if it's a malignant or a harmful cancer,

0:02:44.759,0:02:48.489
or if it's a benign, meaning a harmless cancer.

0:02:48.489,0:02:51.579
So rather than trying to find the straight line to separate the two classes, here's something else

0:02:51.579,0:02:53.579
we could do.

0:02:53.579,0:02:55.759
We can go from our training set

0:02:55.759,0:03:00.409
and look at all the cases of malignant cancers, go through, you know, look for our training set for all the

0:03:00.409,0:03:04.129
positive examples of malignant cancers,

0:03:04.129,0:03:08.799
and we can then build a model for what malignant cancer looks like.

0:03:08.799,0:03:13.400
Then we'll go for our training set again and take out all of the examples of benign cancers,

0:03:13.400,0:03:17.859
and then we'll build a model for what benign cancers look like, okay?

0:03:17.859,0:03:19.209
And then

0:03:19.209,0:03:20.270
when you need to

0:03:20.270,0:03:24.620
classify a new example, when you have a new patient, and you want to decide is this cancer malignant

0:03:24.620,0:03:25.650
or benign,

0:03:25.650,0:03:27.920
you then take your new cancer, and you

0:03:27.920,0:03:30.720
match it to your model of malignant cancers,

0:03:30.720,0:03:35.969
and you match it to your model of benign cancers, and you see which model it matches better, and

0:03:35.969,0:03:39.389
depending on which model it matches better to, you then

0:03:39.389,0:03:43.679
predict whether the new cancer is malignant or benign,

0:03:43.679,0:03:45.809
okay?

0:03:45.809,0:03:46.899
So

0:03:46.899,0:03:49.659
what I just described, just this cross

0:03:49.659,0:03:52.929
of methods where you build a second model for malignant cancers and

0:03:52.929,0:03:54.950
a separate model for benign cancers

0:03:54.950,0:03:58.379
is called a generative learning algorithm,

0:03:58.379,0:04:01.230
and let me just, kind of, formalize this.

0:04:01.230,0:04:03.089


0:04:03.089,0:04:05.459
So in the models that we've

0:04:05.459,0:04:07.559
been talking about previously, those were actually

0:04:07.559,0:04:13.899
all discriminative learning algorithms,

0:04:13.899,0:04:18.959
and studied more formally, a discriminative learning algorithm is one

0:04:18.959,0:04:22.729
that either learns P of Y

0:04:22.729,0:04:25.190
given X directly,

0:04:25.190,0:04:26.889


0:04:26.889,0:04:29.619
or even

0:04:29.619,0:04:32.510
learns a hypothesis

0:04:32.510,0:04:39.510
that

0:04:39.699,0:04:43.240
outputs value 0, 1 directly,

0:04:43.240,0:04:46.509
okay? So logistic regression is an example of

0:04:46.509,0:04:49.349
a discriminative learning algorithm.

0:04:49.349,0:04:56.349
In contrast, a generative learning algorithm of

0:04:58.150,0:04:59.940
models P of X given Y.

0:04:59.940,0:05:03.270
The probability of the features given the class label,

0:05:03.270,0:05:04.330


0:05:04.330,0:05:11.330
and as a technical detail, it also models P of Y, but that's a less important thing, and the

0:05:11.939,0:05:15.319
interpretation of this is that a generative model

0:05:15.319,0:05:17.499
builds a probabilistic model for

0:05:17.499,0:05:21.959
what the features looks like,

0:05:21.959,0:05:27.310
conditioned on the

0:05:27.310,0:05:30.729
class label, okay? In other words, conditioned on whether a cancer is

0:05:30.729,0:05:35.060
malignant or benign, it models probability distribution over what the features

0:05:35.060,0:05:37.090
of the cancer looks like.

0:05:37.090,0:05:40.470
Then having built this model " having built a model for P of X

0:05:40.470,0:05:42.129
given Y and P of Y,

0:05:42.129,0:05:45.949
then by Bayes rule, obviously, you can compute P of Y given 1,

0:05:45.949,0:05:47.439
conditioned on X.

0:05:47.439,0:05:49.370
This is just

0:05:49.370,0:05:50.959
P of X

0:05:50.959,0:05:52.669
given Y = 1

0:05:52.669,0:05:56.430
times P of X

0:05:56.430,0:05:58.919
divided by P of X,

0:05:58.919,0:06:01.349
and, if necessary,

0:06:01.349,0:06:05.600
you can calculate the denominator

0:06:05.600,0:06:12.600
using this, right?

0:06:18.969,0:06:19.889


0:06:19.889,0:06:22.919
And so by modeling P of X given Y

0:06:22.919,0:06:26.500
and modeling P of Y, you can actually use Bayes rule to get back to P of Y given

0:06:26.500,0:06:27.569
X,

0:06:27.569,0:06:29.820
but a generative model -

0:06:29.820,0:06:33.120
generative learning algorithm starts in modeling P of X given Y, rather than P of Y

0:06:33.120,0:06:34.979
given X, okay?

0:06:34.979,0:06:37.810
We'll talk about some of the tradeoffs, and why this may be a

0:06:37.810,0:06:41.999
better or worse idea than a discriminative model a bit later.

0:06:41.999,0:06:46.679
Let's go for a specific example of a generative learning algorithm,

0:06:46.679,0:06:47.909
and for

0:06:47.909,0:06:51.329
this specific motivating example, I'm

0:06:51.329,0:06:53.580
going to assume

0:06:53.580,0:06:55.600
that your input feature is X

0:06:55.600,0:06:58.199
and RN

0:06:58.199,0:07:00.489


0:07:00.489,0:07:03.110
and are

0:07:03.110,0:07:05.710
continuous values, okay?

0:07:05.710,0:07:08.190
And under this assumption, let

0:07:08.190,0:07:15.190
me describe to you a specific algorithm called Gaussian Discriminant Analysis,

0:07:20.180,0:07:23.059


0:07:23.059,0:07:25.960
and the, I

0:07:25.960,0:07:30.080
guess, core assumption is that we're going to assume in the Gaussian discriminant analysis

0:07:30.080,0:07:32.279
model of that P of X given Y

0:07:32.279,0:07:39.279
is Gaussian, okay? So

0:07:39.779,0:07:43.449
actually just raise your hand, how many of you have seen a multivariate Gaussian before -

0:07:43.449,0:07:45.639
not a 1D Gaussian, but the higher range though?

0:07:45.639,0:07:51.029
Okay, cool, like maybe half of you, two-thirds of

0:07:51.029,0:07:54.599
you. So let me just say a few words about Gaussians, and for those of you that have seen

0:07:54.599,0:07:57.119
it before, it'll be a refresher.

0:07:57.119,0:08:03.249
So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the

0:08:03.249,0:08:05.309
script N for normal

0:08:05.309,0:08:08.150
with

0:08:08.150,0:08:13.379
parameters mean U and covariance sigma squared. If

0:08:13.379,0:08:16.759
Z has a density 1

0:08:16.759,0:08:19.539
over 2 Pi, sigma

0:08:19.539,0:08:23.939
2,

0:08:23.939,0:08:30.939
okay?

0:08:31.449,0:08:35.949
That's the formula for the density as a generalization of the one dimension of Gaussians and no

0:08:35.949,0:08:38.190
more the familiar bell-shape curve. It's

0:08:38.190,0:08:42.220
a high dimension vector value random variable

0:08:42.220,0:08:42.990
Z.

0:08:42.990,0:08:47.210
Don't worry too much about this formula for the density. You rarely end up needing to

0:08:47.210,0:08:47.960
use it,

0:08:47.960,0:08:51.620
but the two key quantities are this

0:08:51.620,0:08:55.310
vector mew is the mean of the Gaussian and this

0:08:55.310,0:08:56.999
matrix sigma is

0:08:56.999,0:09:00.060
the covariance matrix -

0:09:00.060,0:09:02.470
covariance,

0:09:02.470,0:09:04.670
and so

0:09:04.670,0:09:07.410
sigma will be equal to,

0:09:07.410,0:09:11.810
right, the definition of covariance of a vector valued random variable

0:09:11.810,0:09:13.710
is X - U, X - V conspose,

0:09:13.710,0:09:17.260
okay?

0:09:17.260,0:09:20.320
And, actually, if this

0:09:20.320,0:09:22.930
doesn't look familiar to you,

0:09:22.930,0:09:25.720
you might re-watch the

0:09:25.720,0:09:28.740
discussion section that the TAs held last Friday

0:09:28.740,0:09:33.690
or the one that they'll be holding later this week on, sort of, a recap of

0:09:33.690,0:09:34.290
probability, okay?

0:09:34.290,0:09:37.670
So

0:09:37.670,0:09:41.370
multi-grade Gaussians is parameterized by a mean and a covariance, and let me just -

0:09:41.370,0:09:42.420
can I

0:09:42.420,0:09:45.860
have the laptop displayed, please?

0:09:45.860,0:09:48.570
I'll just go ahead and actually show you,

0:09:48.570,0:09:52.090
you know, graphically, the effects of

0:09:52.090,0:09:53.930
varying a Gaussian -

0:09:53.930,0:09:56.310
varying the parameters of a Gaussian.

0:09:56.310,0:09:59.130
So what I have up here

0:09:59.130,0:10:02.960
is the density of a zero mean Gaussian with

0:10:02.960,0:10:05.880
covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand

0:10:05.880,0:10:08.240
corner of the slide, and

0:10:08.240,0:10:11.600
there's the familiar bell-shaped curve in two dimensions.

0:10:11.600,0:10:17.320
And so if I shrink the covariance matrix, instead of covariance your identity, if

0:10:17.320,0:10:21.600
I shrink the covariance matrix, then the Gaussian becomes more peaked,

0:10:21.600,0:10:25.090
and if I widen the covariance, so like same = 2, 2,

0:10:25.090,0:10:25.749
then

0:10:25.749,0:10:30.810
the distribution - well, the density becomes more spread out, okay?

0:10:30.810,0:10:32.510
Those vectors stand at normal, identity

0:10:32.510,0:10:34.570
covariance one.

0:10:34.570,0:10:36.520
If I increase

0:10:36.520,0:10:39.419
the diagonals of a covariance matrix, right,

0:10:39.419,0:10:43.949
if I make the variables correlated, and the

0:10:43.949,0:10:45.350
Gaussian becomes flattened out in this X = Y direction, and

0:10:45.350,0:10:47.170
increase it even further,

0:10:47.170,0:10:52.460
then my variables, X and Y, right - excuse me, it goes Z1 and Z2

0:10:52.460,0:10:57.020
are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in

0:10:57.020,0:10:58.940
contours.

0:10:58.940,0:11:03.310
The standard normal of distribution has contours that are - they're actually

0:11:03.310,0:11:06.960
circles. Because of the aspect ratio, these look like ellipses.

0:11:06.960,0:11:09.050
These should actually be circles,

0:11:09.050,0:11:12.770
and if you increase the off diagonals of the Gaussian covariance matrix,

0:11:12.770,0:11:14.890
then it becomes

0:11:14.890,0:11:16.400
ellipses aligned along the, sort of,

0:11:16.400,0:11:18.090
45 degree angle

0:11:18.090,0:11:21.120
in this example.

0:11:21.120,0:11:26.430
This is the same thing. Here's an example of a Gaussian density with negative covariances.

0:11:26.430,0:11:28.610
So now the correlation

0:11:28.610,0:11:32.700
goes the other way, so that even strong [inaudible] of covariance and the same thing in

0:11:32.700,0:11:36.400
contours. This is a Gaussian with negative entries on the diagonals and even

0:11:36.400,0:11:41.130
larger entries on the diagonals, okay?

0:11:41.130,0:11:45.480
And other parameter for the Gaussian is the mean parameters, so if this is - with mew0,

0:11:45.480,0:11:47.839
and as he changed the mean parameter,

0:11:47.839,0:11:50.060
this is mew equals 0.15,

0:11:50.060,0:11:56.230
the location of the Gaussian just moves around, okay?

0:11:56.230,0:12:01.500
All right. So that was a quick primer on what Gaussians look like, and here's

0:12:01.500,0:12:02.300


0:12:02.300,0:12:06.230
as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant

0:12:06.230,0:12:09.400
analysis algorithm, this is what we're going to do. Here's

0:12:09.400,0:12:11.640
the training set,

0:12:11.640,0:12:14.790
and in the Gaussian discriminant analysis algorithm,

0:12:14.790,0:12:19.480
what I'm going to do is I'm going to look at the positive examples, say the crosses,

0:12:19.480,0:12:23.520
and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the

0:12:23.520,0:12:25.250
positive examples, and so

0:12:25.250,0:12:27.580
maybe I end up with a Gaussian distribution like that,

0:12:27.580,0:12:30.960
okay? So there's P of X given Y = 1.

0:12:30.960,0:12:34.340
And then I'll look at the negative examples, the O's in this figure,

0:12:34.340,0:12:36.990
and I'll fit a Gaussian to that, and maybe I get a

0:12:36.990,0:12:37.740
Gaussian

0:12:37.740,0:12:40.700
centered over there. This is the concept of my second Gaussian,

0:12:40.700,0:12:41.910
and together -

0:12:41.910,0:12:44.770
we'll say how later -

0:12:44.770,0:12:51.170
together these two Gaussian densities will define a separator for these two classes, okay?

0:12:51.170,0:12:55.300
And it'll turn out that the separator will turn out to be a little bit

0:12:55.300,0:12:55.870
different

0:12:55.870,0:12:57.340
from what logistic regression

0:12:57.340,0:12:58.719
gives you.

0:12:58.719,0:13:00.390
If you run logistic regression,

0:13:00.390,0:13:03.720
you actually get the division bound to be shown in the green line, whereas Gaussian discriminant

0:13:03.720,0:13:07.340
analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the

0:13:07.340,0:13:14.340


0:13:29.460,0:13:34.440
Gaussian discriminant analysis model, put

0:13:34.440,0:13:37.160
into model P of Y

0:13:37.160,0:13:40.940
as a Bernoulli random variable as usual, but

0:13:40.940,0:13:45.540
as a Bernoulli random variable and parameterized by parameter phi; you've

0:13:45.540,0:13:47.300
seen this before.

0:13:47.300,0:13:48.900


0:13:48.900,0:13:55.900
Model P of X given Y = 0 as a Gaussian -

0:13:58.090,0:14:01.060
oh, you know what? Yeah,

0:14:01.060,0:14:04.870
yes, excuse me. I

0:14:04.870,0:14:06.050
thought this looked strange.

0:14:06.050,0:14:11.730
This

0:14:11.730,0:14:13.040
should be a sigma,

0:14:13.040,0:14:16.230
determined in a sigma to the one-half of the denominator there.

0:14:16.230,0:14:18.110
It's no big deal. It was - yeah,

0:14:18.110,0:14:20.350
well, okay. Right.

0:14:20.350,0:14:23.120
I

0:14:23.120,0:14:26.690
was listing the sigma to the determining the sigma to the one-half on a previous

0:14:26.690,0:14:33.690
board, excuse me. Okay,

0:14:40.180,0:14:44.130
and so I model P of X given Y = 0 as a Gaussian

0:14:44.130,0:14:47.700
with mean mew0 and covariance sigma to the sigma to

0:14:47.700,0:14:54.700
the minus one-half,

0:14:54.800,0:15:01.210


0:15:01.210,0:15:02.000
and

0:15:02.000,0:15:09.000
-

0:15:11.200,0:15:12.840
okay?

0:15:12.840,0:15:16.840
And so the parameters of this model are

0:15:16.840,0:15:18.310
phi,

0:15:18.310,0:15:20.640
mew0,

0:15:20.640,0:15:23.900
mew1, and sigma,

0:15:23.900,0:15:25.100
and so

0:15:25.100,0:15:29.390
I can now write down the likelihood of the parameters

0:15:29.390,0:15:30.380
as - oh, excuse

0:15:30.380,0:15:37.380
me, actually, the log likelihood of the parameters as the log of

0:15:40.500,0:15:41.440
that,

0:15:41.440,0:15:45.080
right?

0:15:45.080,0:15:48.260
So, in other words, if I'm given the training set, then

0:15:48.260,0:15:52.690
they can write down the log likelihood of the parameters as the log of, you know,

0:15:52.690,0:15:58.060
the probative probabilities of P of XI, YI, right?

0:15:58.060,0:16:03.830


0:16:03.830,0:16:10.030


0:16:10.030,0:16:11.710
And

0:16:11.710,0:16:14.650
this is just equal to that where each of these terms, P of XI given YI,

0:16:14.650,0:16:16.980
or P of YI is

0:16:16.980,0:16:18.720
then given

0:16:18.720,0:16:19.600


0:16:19.600,0:16:24.250
by one of these three equations on top, okay?

0:16:24.250,0:16:26.420
And I just want

0:16:26.420,0:16:30.000
to contrast this again with discriminative learning algorithms, right?

0:16:30.000,0:16:31.790
So

0:16:31.790,0:16:34.779
to give this a name, I guess, this sometimes is actually called

0:16:34.779,0:16:39.249
the Joint Data Likelihood - the Joint Likelihood,

0:16:39.249,0:16:41.790
and

0:16:41.790,0:16:44.930
let me just contrast this with what we had previously

0:16:44.930,0:16:47.710
when we're talking about logistic

0:16:47.710,0:16:51.860
regression. Where I said with the log likelihood of the parameter's theater

0:16:51.860,0:16:53.089
was log

0:16:53.089,0:16:57.000
of a product I = 1 to M, P of YI

0:16:57.000,0:16:59.030
given XI

0:16:59.030,0:17:01.990
and parameterized

0:17:01.990,0:17:03.960
by a theater, right?

0:17:03.960,0:17:05.150
So

0:17:05.150,0:17:08.040
back where we're fitting logistic regression models or generalized learning

0:17:08.040,0:17:08.670
models,

0:17:08.670,0:17:14.360
we're always modeling P of YI given XI and parameterized by a theater, and that was the

0:17:14.360,0:17:16.420
conditional

0:17:16.420,0:17:17.460


0:17:17.460,0:17:19.579
likelihood, okay,

0:17:19.579,0:17:23.910
in

0:17:23.910,0:17:26.700
which we're modeling P of YI given XI,

0:17:26.700,0:17:27.839
whereas, now,

0:17:27.839,0:17:30.929
regenerative learning algorithms, we're going to look at the joint likelihood which

0:17:30.929,0:17:35.440
is P of XI, YI, okay?

0:17:35.440,0:17:36.270
So let's

0:17:36.270,0:17:43.270
see.

0:17:46.240,0:17:48.049
So given the training sets

0:17:48.049,0:17:51.260
and using the Gaussian discriminant analysis model

0:17:51.260,0:17:55.550
to fit the parameters of the model, we'll do maximize likelihood estimation as usual,

0:17:55.550,0:17:58.780
and so you maximize your

0:17:58.780,0:18:02.269
L

0:18:02.269,0:18:05.910
with respect to the parameters phi, mew0,

0:18:05.910,0:18:07.850
mew1, sigma,

0:18:07.850,0:18:10.150
and so

0:18:10.150,0:18:15.800
if we find the maximum likelihood estimate of parameters, you find that phi is

0:18:15.800,0:18:17.400
-

0:18:17.400,0:18:22.040
the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for

0:18:22.040,0:18:23.750
indicating notation,

0:18:23.750,0:18:27.220
all right? So the maximum likelihood estimate for phi would be Sum over

0:18:27.220,0:18:29.520
I, YI Ăˇ M,

0:18:29.520,0:18:32.040


0:18:32.040,0:18:35.490
or written alternatively as Sum over -

0:18:35.490,0:18:41.620
all your training examples of indicator YI = 1 Ăˇ M, okay?

0:18:41.620,0:18:42.870


0:18:42.870,0:18:45.880
In other words, maximum likelihood estimate for a

0:18:45.880,0:18:47.390


0:18:47.390,0:18:52.600
newly parameter phi is just the faction of training examples with

0:18:52.600,0:18:56.880
label one, with Y equals 1.

0:18:56.880,0:19:02.800
Maximum likelihood estimate for mew0 is this, okay?

0:19:02.800,0:19:09.800
You

0:19:22.010,0:19:24.040
should stare at this for a second and

0:19:24.040,0:19:26.200
see if it makes sense.

0:19:26.200,0:19:33.200
Actually, I'll just write on the next one for mew1 while you do that.

0:19:34.310,0:19:41.310
Okay?

0:19:49.880,0:19:51.900
So what this is is

0:19:51.900,0:19:57.060
what the denominator is sum of your training sets indicated YI = 0.

0:19:57.060,0:20:02.090
So for every training example for which YI = 0, this will

0:20:02.090,0:20:05.670
increment the count by one, all right? So the

0:20:05.670,0:20:08.030
denominator is just

0:20:08.030,0:20:10.880
the number of examples

0:20:10.880,0:20:14.550


0:20:14.550,0:20:15.370
with

0:20:15.370,0:20:20.700
label zero, all right?

0:20:20.700,0:20:22.680
And then the numerator will be, let's

0:20:22.680,0:20:25.660
see, Sum from I = 1 for M, or every time

0:20:25.660,0:20:30.350
YI is equal to 0, this will be a one, and otherwise, this thing will be zero,

0:20:30.350,0:20:31.480
and so

0:20:31.480,0:20:34.330
this indicator function means that you're including

0:20:34.330,0:20:37.630
only the times for

0:20:37.630,0:20:40.270
which YI is equal to one - only the turns which Y is equal to zero

0:20:40.270,0:20:45.020
because for all the times where YI is equal to one,

0:20:45.020,0:20:47.990
this sum and will be equal to zero,

0:20:47.990,0:20:52.890
and then you multiply that by XI, and so the numerator is

0:20:52.890,0:20:55.770
really the sum

0:20:55.770,0:21:00.910
of XI's corresponding to

0:21:00.910,0:21:02.970


0:21:02.970,0:21:08.649
examples where the class labels were zero, okay?

0:21:08.649,0:21:14.530
Raise your hand if this makes sense. Okay, cool.

0:21:14.530,0:21:17.350
So just to say this fancifully,

0:21:17.350,0:21:19.630
this just means look for your training set,

0:21:19.630,0:21:22.900
find all the examples for which Y = 0,

0:21:22.900,0:21:25.160
and take the average

0:21:25.160,0:21:29.570
of the value of X for all your examples which Y = 0. So take all your negative fitting

0:21:29.570,0:21:30.470
examples

0:21:30.470,0:21:32.950
and average the values for X

0:21:32.950,0:21:37.730
and

0:21:37.730,0:21:41.350
that's mew0, okay? If this

0:21:41.350,0:21:45.140
notation is still a little bit cryptic - if you're still not sure why this

0:21:45.140,0:21:48.000
equation translates into

0:21:48.000,0:21:52.940
what I just said, do go home and stare at it for a while until it just makes sense. This is, sort

0:21:52.940,0:21:57.040
of, no surprise. It just says to estimate the mean for the negative examples,

0:21:57.040,0:21:59.860
take all your negative examples, and average them. So

0:21:59.860,0:22:06.830
no surprise, but this is a useful practice to indicate a notation.

0:22:06.830,0:22:07.500
[Inaudible]

0:22:07.500,0:22:11.330
divide the maximum likelihood estimate for sigma. I won't do that. You can read that in

0:22:11.330,0:22:16.630
the notes yourself.

0:22:16.630,0:22:21.390


0:22:21.390,0:22:24.340
And so having fit

0:22:24.340,0:22:28.940
the parameters find mew0, mew1, and sigma

0:22:28.940,0:22:30.470
to your data,

0:22:30.470,0:22:35.590
well, you now need to make a prediction. You

0:22:35.590,0:22:39.710
know, when you're given a new value of X, when you're given a new cancer, you need to predict whether

0:22:39.710,0:22:41.630
it's malignant or benign.

0:22:41.630,0:22:44.610
Your prediction is then going to be,

0:22:44.610,0:22:46.710
let's say,

0:22:46.710,0:22:50.400
the most likely value of Y given X. I should

0:22:50.400,0:22:54.720
write semicolon the parameters there. I'll just give that

0:22:54.720,0:22:57.230
- which is the [inaudible] of a Y

0:22:57.230,0:23:04.230
by Bayes rule, all right?

0:23:05.870,0:23:12.870
And that is, in turn,

0:23:14.540,0:23:15.480
just that

0:23:15.480,0:23:19.790
because the denominator P of X doesn't depend on Y,

0:23:19.790,0:23:21.530
and

0:23:21.530,0:23:23.540
if P of Y

0:23:23.540,0:23:27.450
is uniform.

0:23:27.450,0:23:30.950
In other words, if each of your constants is equally likely,

0:23:30.950,0:23:33.039
so if P of Y

0:23:33.039,0:23:36.840
takes the same value for all values

0:23:36.840,0:23:38.010
of Y,

0:23:38.010,0:23:42.250
then this is just arc X over Y, P of X

0:23:42.250,0:23:47.340
given Y, okay?

0:23:47.340,0:23:50.970
This happens sometimes, maybe not very often, so usually you end up using this

0:23:50.970,0:23:53.049
formula where you

0:23:53.049,0:23:56.470
compute P of X given Y and P of Y using

0:23:56.470,0:23:59.910
your model, okay? Student:Can

0:23:59.910,0:24:00.780
you give

0:24:00.780,0:24:05.480
us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually

0:24:05.480,0:24:08.750
let me.

0:24:08.750,0:24:09.770
So the min of

0:24:09.770,0:24:15.470
- arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just

0:24:15.470,0:24:18.600
for an example, the min of X - 5

0:24:18.600,0:24:22.610
squared is 0 because by choosing X equals 5, you can get this to be zero,

0:24:22.610,0:24:24.470
and the argument over X

0:24:24.470,0:24:30.870
of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay?

0:24:30.870,0:24:37.870
Cool. Thanks

0:24:38.100,0:24:45.100
for

0:24:45.140,0:24:52.140
asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is

0:25:00.910,0:25:04.320
distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose

0:25:04.320,0:25:05.649
here.

0:25:05.649,0:25:07.240
I meant if

0:25:07.240,0:25:10.890
P of Y = 0 is equal to P of Y = 1, or if Y is the

0:25:10.890,0:25:12.780
uniform distribution over

0:25:12.780,0:25:13.860
the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0

0:25:13.860,0:25:15.450
zero = P of Y given 1. That's all I mean, see? Anything else?

0:25:15.450,0:25:17.080
All

0:25:17.080,0:25:20.030
right. Okay. So

0:25:20.030,0:25:22.039


0:25:22.039,0:25:28.040


0:25:28.040,0:25:35.040
it

0:25:42.130,0:25:46.980
turns out Gaussian discriminant analysis has an interesting relationship

0:25:46.980,0:25:48.590
to logistic

0:25:48.590,0:25:51.640
regression. Let me illustrate that.

0:25:51.640,0:25:54.540


0:25:54.540,0:25:59.200
So let's say you have a training set

0:25:59.200,0:26:06.200
- actually let me just go ahead and draw 1D training set, and that will

0:26:11.820,0:26:14.580
kind of work, yes, okay.

0:26:14.580,0:26:18.470
So let's say we have a training set comprising a few negative and a few positive examples,

0:26:18.470,0:26:23.070
and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two

0:26:23.070,0:26:24.679
densities - a Gaussian density to each of these two - to my positive

0:26:24.679,0:26:27.440
and negative training

0:26:27.440,0:26:30.090
examples,

0:26:30.090,0:26:33.470
and so maybe my

0:26:33.470,0:26:37.960
positive examples, the X's, are fit with a Gaussian like this,

0:26:37.960,0:26:39.520


0:26:39.520,0:26:44.560


0:26:44.560,0:26:47.840
and my negative examples I will fit, and you have a

0:26:47.840,0:26:54.840
Gaussian that looks like that, okay?

0:26:57.030,0:27:00.180
Now, I

0:27:00.180,0:27:04.150
hope this [inaudible]. Now, let's

0:27:04.150,0:27:06.789
vary along the X axis,

0:27:06.789,0:27:10.260
and what I want to do is I'll

0:27:10.260,0:27:14.470
overlay on top of this plot. I'm going to plot

0:27:14.470,0:27:21.470
P of Y = 1 - no, actually,

0:27:24.420,0:27:25.820
given X

0:27:25.820,0:27:31.570
for a variety of values X, okay?

0:27:31.570,0:27:32.929
So I actually realize what I should have done.

0:27:32.929,0:27:37.250
I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just

0:27:37.250,0:27:39.720
makes this part come in better.

0:27:39.720,0:27:44.270
So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal

0:27:44.270,0:27:45.200
axis.

0:27:45.200,0:27:48.810
Then what's the probability of Y being equal to one conditioned on X? Well,

0:27:48.810,0:27:52.770
the way you calculate that is you write P of Y

0:27:52.770,0:27:56.899
= 1 given X, and then you plug in all these formulas as usual, right? It's P of X

0:27:56.899,0:27:59.140
given Y = 1, which is

0:27:59.140,0:28:00.929
your Gaussian density,

0:28:00.929,0:28:04.070
times P of Y = 1, you know, which is

0:28:04.070,0:28:08.060
essentially - this is just going to be equal to phi,

0:28:08.060,0:28:09.510
and then divided by,

0:28:09.510,0:28:12.210
right, P of X, and then this shows you how you can calculate this.

0:28:12.210,0:28:13.660
By using

0:28:13.660,0:28:18.730
these two Gaussians and my phi on P of Y, I actually compute what P of Y

0:28:18.730,0:28:21.340
= 1 given X is, and

0:28:21.340,0:28:24.529
in this case,

0:28:24.529,0:28:29.230
if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to

0:28:29.230,0:28:31.330
a positive class, and so

0:28:31.330,0:28:36.370
it'll be very small; it'll be very close to zero say, okay?

0:28:36.370,0:28:40.770
And then we can increment the value of X a bit, and study a different value of X, and

0:28:40.770,0:28:44.230
plot what is the P of Y given X - P of Y

0:28:44.230,0:28:49.660
= 1 given X, and, again, it'll be pretty small. Let's

0:28:49.660,0:28:52.200
use a point like that, right? At this point,

0:28:52.200,0:28:54.480
the two Gaussian densities

0:28:54.480,0:28:56.490
have equal value,

0:28:56.490,0:28:57.539
and if

0:28:57.539,0:28:58.799
I ask

0:28:58.799,0:29:01.779
if X is this value, right, shown by the arrow,

0:29:01.779,0:29:03.080


0:29:03.080,0:29:06.770
what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if

0:29:06.770,0:29:11.270
you fill

0:29:11.270,0:29:13.930
in a bunch more points, you get a

0:29:13.930,0:29:16.470
curve like that,

0:29:16.470,0:29:20.550
and then you can keep going. Let's say for a point like that, you can ask what's the probability of X

0:29:20.550,0:29:21.919
being one? Well, if

0:29:21.919,0:29:25.320
it's that far out, then clearly, it belongs to this

0:29:25.320,0:29:27.540
rightmost Gaussian, and so

0:29:27.540,0:29:31.510
the probability of Y being a one would be very high; it would be almost one, okay?

0:29:31.510,0:29:33.830
And so you

0:29:33.830,0:29:36.429
can repeat this exercise

0:29:36.429,0:29:38.510
for a bunch of points. All right,

0:29:38.510,0:29:41.620
compute P of Y equals one given X for a bunch of points,

0:29:41.620,0:29:46.490
and if you connect up these points,

0:29:46.490,0:29:48.150
you find that the

0:29:48.150,0:29:50.039
curve you get [Pause] plotted

0:29:50.039,0:29:56.410
takes a form of sigmoid function, okay? So,

0:29:56.410,0:30:00.030
in other words, when you make the assumptions under the Gaussian

0:30:00.030,0:30:02.350


0:30:02.350,0:30:05.550
discriminant analysis model,

0:30:05.550,0:30:08.130
that P of X given Y is Gaussian,

0:30:08.130,0:30:13.330
when you go back and compute what P of Y given X is, you actually get back

0:30:13.330,0:30:19.290
exactly the same sigmoid function

0:30:19.290,0:30:22.960
that we're using which is the progression, okay? But it turns out the key difference is that

0:30:22.960,0:30:27.360
Gaussian discriminant analysis will end up choosing a different

0:30:27.360,0:30:28.900
position

0:30:28.900,0:30:31.659
and a steepness of the sigmoid

0:30:31.659,0:30:35.730
than would logistic regression. Is there a question? Student:I'm just

0:30:35.730,0:30:39.570
wondering,

0:30:39.570,0:30:44.600
the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is

0:30:44.600,0:30:47.500
P of X given Y = 1, and

0:30:47.500,0:30:50.070
this Gaussian is P of X

0:30:50.070,0:30:57.070
given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you

0:31:01.920,0:31:04.040
decide what Y

0:31:04.040,0:31:05.760
given

0:31:05.760,0:31:09.769
P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you

0:31:09.769,0:31:12.330
figured out where

0:31:12.330,0:31:13.750
to draw each dot? Instructor (Andrew Ng):Let's see,

0:31:13.750,0:31:15.919
okay. So the

0:31:15.919,0:31:18.600
computation is as follows, right? The steps

0:31:18.600,0:31:22.020
are I have the training sets, and so given my training set, I'm going to fit

0:31:22.020,0:31:25.159
a Gaussian discriminant analysis model to it,

0:31:25.159,0:31:28.940
and what that means is I'll build a model for P of X given Y = 1. I'll

0:31:28.940,0:31:30.159
build

0:31:30.159,0:31:33.059
a model for P of X given Y = 0,

0:31:33.059,0:31:37.950
and I'll also fit a Bernoulli distribution to

0:31:37.950,0:31:39.290
P of Y, okay?

0:31:39.290,0:31:43.270
So, in other words, given my training set, I'll fit P of X given Y and P of Y

0:31:43.270,0:31:46.539
to my data, and now I've chosen my parameters

0:31:46.539,0:31:48.630
of find mew0,

0:31:48.630,0:31:49.600
mew1,

0:31:49.600,0:31:52.310
and the sigma, okay? Then

0:31:52.310,0:31:54.930
this is the process I went through

0:31:54.930,0:31:58.710
to plot all these dots, right? It's just I pick a point in the X axis,

0:31:58.710,0:32:00.950
and then I compute

0:32:00.950,0:32:02.970
P of Y given X

0:32:02.970,0:32:05.200
for that value of X,

0:32:05.200,0:32:09.380
and P of Y given 1 conditioned on X will be some value between zero and one. It'll

0:32:09.380,0:32:12.500
be some real number, and whatever that real number is, I then plot it on the vertical

0:32:12.500,0:32:14.490
axis,

0:32:14.490,0:32:19.929
okay? And the way I compute P of Y = 1 conditioned on X is

0:32:19.929,0:32:21.380
I would

0:32:21.380,0:32:23.470
use these quantities. I would use

0:32:23.470,0:32:25.080
P of X given Y

0:32:25.080,0:32:30.050
and P of Y, and, sort of, plug them into Bayes rule, and that allows me

0:32:30.050,0:32:30.760
to

0:32:30.760,0:32:31.690
compute P of Y given X

0:32:31.690,0:32:34.210
from these three quantities; does that make

0:32:34.210,0:32:35.240
sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that -

0:32:35.240,0:32:39.080
Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so

0:32:39.080,0:32:42.269
-

0:32:42.269,0:32:43.170
well,

0:32:43.170,0:32:45.340
got this right

0:32:45.340,0:32:49.090
here. So P of X can be written as,

0:32:49.090,0:32:50.449


0:32:50.449,0:32:52.810
right,

0:32:52.810,0:32:55.030
so

0:32:55.030,0:32:58.660
P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y =

0:32:58.660,0:33:03.140
1, right?

0:33:03.140,0:33:06.740
And so each of these terms, P of X given Y

0:33:06.740,0:33:11.170
and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant

0:33:11.170,0:33:13.890
analysis model. Each of these terms is something that

0:33:13.890,0:33:16.300
my model gives me directly,

0:33:16.300,0:33:18.350
so plugged in as the denominator,

0:33:18.350,0:33:25.350
and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool.

0:33:30.570,0:33:37.570


0:33:53.170,0:33:57.860
So let's talk a little bit about the advantages and disadvantages of using a

0:33:57.860,0:34:00.360
generative

0:34:00.360,0:34:04.759
learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we

0:34:04.759,0:34:05.960
assume that

0:34:05.960,0:34:08.080
X conditions on Y

0:34:08.080,0:34:13.829
is Gaussian,

0:34:13.829,0:34:17.089
and the argument I showed on the previous chalkboard, I didn't prove it formally,

0:34:17.089,0:34:20.289
but you can actually go back and prove it yourself

0:34:20.289,0:34:24.269
is that if you assume X given Y is Gaussian,

0:34:24.269,0:34:27.419
then that implies that

0:34:27.419,0:34:28.589
when you plot Y

0:34:28.589,0:34:30.639
given X,

0:34:30.639,0:34:37.639
you find that - well, let me just write logistic posterior, okay?

0:34:40.689,0:34:43.519
And the argument I showed just now, which I didn't prove; you can go home and prove it

0:34:43.519,0:34:44.489
yourself,

0:34:44.489,0:34:49.329
is that if you assume X given Y is Gaussian, then that implies that the posterior

0:34:49.329,0:34:53.929
distribution or the form of

0:34:53.929,0:34:57.229
P of Y = 1 given X

0:34:57.229,0:35:00.569
is going to be a logistic function,

0:35:00.569,0:35:02.089


0:35:02.089,0:35:05.809
and it turns out this

0:35:05.809,0:35:08.319
implication in the opposite direction

0:35:08.319,0:35:11.799
does not hold true,

0:35:11.799,0:35:16.349
okay? In particular, it actually turns out - this is actually, kind of, cool. It

0:35:16.349,0:35:19.360
turns out that if you're

0:35:19.360,0:35:22.739
seeing that X given Y = 1 is

0:35:22.739,0:35:24.389


0:35:24.389,0:35:26.169
Hessian with

0:35:26.169,0:35:28.609
parameter lambda 1,

0:35:28.609,0:35:32.039
and X given Y = 0,

0:35:32.039,0:35:36.509
is Hessian

0:35:36.509,0:35:39.199
with parameter lambda 0.

0:35:39.199,0:35:41.420
It turns out if you assumed this,

0:35:41.420,0:35:42.640
then

0:35:42.640,0:35:43.849
that also

0:35:43.849,0:35:46.839
implies that P of Y

0:35:46.839,0:35:51.130
given X

0:35:51.130,0:35:52.439


0:35:52.439,0:35:54.429
is logistic, okay?

0:35:54.429,0:35:57.259
So there are lots of assumptions on X given Y

0:35:57.259,0:36:04.259
that will lead to P of Y given X being logistic, and,

0:36:04.619,0:36:06.059
therefore,

0:36:06.059,0:36:11.609
this, the assumption that X given Y being Gaussian is the stronger assumption

0:36:11.609,0:36:15.019
than the assumption that Y given X is logistic,

0:36:15.019,0:36:17.400
okay? Because this implies this,

0:36:17.400,0:36:22.829
right? That means that this is a stronger assumption than this because

0:36:22.829,0:36:29.829
this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa.

0:36:29.940,0:36:31.099
And so this leaves some

0:36:31.099,0:36:35.749
of the tradeoffs between Gaussian discriminant analysis and logistic

0:36:35.749,0:36:36.559
regression,

0:36:36.559,0:36:40.079
right? Gaussian discriminant analysis makes a much stronger assumption

0:36:40.079,0:36:43.049
that X given Y is Gaussian,

0:36:43.049,0:36:47.099
and so when this assumption is true, when this assumption approximately holds, if you plot the

0:36:47.099,0:36:48.569
data,

0:36:48.569,0:36:52.039
and if X given Y is, indeed, approximately Gaussian,

0:36:52.039,0:36:56.199
then if you make this assumption, explicit to the algorithm, then the

0:36:56.199,0:36:57.499
algorithm will do better

0:36:57.499,0:36:59.239
because it's as if the

0:36:59.239,0:37:02.609
algorithm is making use of more information about the data. The algorithm knows that

0:37:02.609,0:37:04.380
the data is Gaussian,

0:37:04.380,0:37:05.290
right? And so

0:37:05.290,0:37:07.609
if the Gaussian assumption, you know,

0:37:07.609,0:37:09.529
holds or roughly holds,

0:37:09.529,0:37:10.220
then Gaussian

0:37:10.220,0:37:13.569
discriminant analysis may do better than logistic regression.

0:37:13.569,0:37:19.299
If, conversely, if you're actually not sure what X given Y is, then

0:37:19.299,0:37:22.819
logistic regression, the discriminant algorithm may do better,

0:37:22.819,0:37:25.579
and, in particular, use logistic regression,

0:37:25.579,0:37:26.110
and

0:37:26.110,0:37:29.690
maybe you see [inaudible] before the data was Gaussian, but it turns out the data

0:37:29.690,0:37:31.429
was actually Poisson, right?

0:37:31.429,0:37:33.980
Then logistic regression will still do perfectly fine because if

0:37:33.980,0:37:36.869
the data were actually Poisson,

0:37:36.869,0:37:40.210
then P of Y = 1 given X will be logistic, and it'll do perfectly

0:37:40.210,0:37:44.519
fine, but if you assumed it was Gaussian, then the algorithm may go off and do something

0:37:44.519,0:37:51.519
that's not as good, okay?

0:37:52.109,0:37:56.139
So it turns out that - right.

0:37:56.139,0:37:58.799
So it's slightly different.

0:37:58.799,0:38:01.089
It

0:38:01.089,0:38:05.359
turns out the real advantage of generative learning algorithms is often that it

0:38:05.359,0:38:07.839
requires less data,

0:38:07.839,0:38:09.049
and, in particular,

0:38:09.049,0:38:13.469
data is never really exactly Gaussian, right? Because data is often

0:38:13.469,0:38:15.890
approximately Gaussian; it's never exactly Gaussian.

0:38:15.890,0:38:19.989
And it turns out, generative learning algorithms often do surprisingly well

0:38:19.989,0:38:20.750
even when

0:38:20.750,0:38:24.369
these modeling assumptions are not met, but

0:38:24.369,0:38:26.019
one other tradeoff

0:38:26.019,0:38:26.880
is that

0:38:26.880,0:38:29.930
by making stronger assumptions

0:38:29.930,0:38:31.899
about the data,

0:38:31.899,0:38:34.039
Gaussian discriminant analysis

0:38:34.039,0:38:36.229
often needs less data

0:38:36.229,0:38:40.439
in order to fit, like, an okay model, even if there's less training data. Whereas, in

0:38:40.439,0:38:41.870
contrast,

0:38:41.870,0:38:45.959
logistic regression by making less

0:38:45.959,0:38:48.959
assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're

0:38:48.959,0:38:50.729
making less assumptions,

0:38:50.729,0:38:54.150
but sometimes it takes a slightly larger training set to fit than Gaussian discriminant

0:38:54.150,0:38:55.419
analysis. Question? Student:In order

0:38:55.419,0:39:00.019
to meet any assumption about the number [inaudible], plus

0:39:00.019,0:39:01.029


0:39:01.029,0:39:03.139
here

0:39:03.139,0:39:09.089
we assume that P of Y = 1, equal

0:39:09.089,0:39:11.159
two

0:39:11.159,0:39:14.849
number of.

0:39:14.849,0:39:16.319
[Inaudible]. Is true when

0:39:16.319,0:39:18.069
the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see.

0:39:18.069,0:39:21.799
So there's a question of is this true - what

0:39:21.799,0:39:23.379
was that? Let me translate that

0:39:23.379,0:39:25.429
differently. So

0:39:25.429,0:39:28.829
the marving assumptions are made independently of the size

0:39:28.829,0:39:29.880
of

0:39:29.880,0:39:32.520
your training set, right? So, like, in least/great regression - well,

0:39:32.520,0:39:36.649
in all of these models I'm assuming that these are random variables

0:39:36.649,0:39:41.380
flowing from some distribution, and then, finally, I'm giving a single training set

0:39:41.380,0:39:43.640
and that as for the parameters of the

0:39:43.640,0:39:44.450
distribution, right?

0:39:44.450,0:39:51.450
Student:So

0:39:51.889,0:39:54.169
what's the probability of Y = 1?

0:39:54.169,0:39:55.069
Instructor (Andrew Ng):Probability of Y + 1?

0:39:55.069,0:39:56.499
Student:Yeah, you used the - Instructor (Andrew Ng):Sort

0:39:56.499,0:39:57.909
of, this like

0:39:57.909,0:39:59.589
-

0:39:59.589,0:40:02.829
back to the philosophy of mass molecular estimation,

0:40:02.829,0:40:04.700
right? I'm assuming that

0:40:04.700,0:40:08.329
they're P of Y is equal to phi to the Y,

0:40:08.329,0:40:13.390
Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y

0:40:13.390,0:40:14.009
generating

0:40:14.009,0:40:15.879
all my data,

0:40:15.879,0:40:18.849
and then

0:40:18.849,0:40:19.549
-

0:40:19.549,0:40:25.589
well, when I write this, I guess, maybe what I should write isn't -

0:40:25.589,0:40:27.609
so when I write this, I

0:40:27.609,0:40:29.900
guess there are already two values of phi. One is

0:40:29.900,0:40:32.200
there's a true underlying value of phi

0:40:32.200,0:40:35.160
that guards the use to generate the data,

0:40:35.160,0:40:39.630
and then there's the maximum likelihood estimate of the value of phi, and so when I was writing

0:40:39.630,0:40:41.509
those formulas earlier,

0:40:41.509,0:40:45.209
those formulas are writing for phi, and mew0, and mew1

0:40:45.209,0:40:49.359
were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true

0:40:49.359,0:40:55.819
underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum

0:40:55.819,0:40:57.559
likelihood estimate comes from the data,

0:40:57.559,0:41:01.529
and there's some, sort of, true underlying value of phi that I'm trying to estimate,

0:41:01.529,0:41:05.319
and my maximum likelihood estimate is my attempt to estimate the true value,

0:41:05.319,0:41:08.150
but, you know, by notational and convention

0:41:08.150,0:41:11.779
often are just right as that as well without bothering to distinguish between

0:41:11.779,0:41:15.470
the maximum likelihood value and the true underlying value that I'm assuming is out

0:41:15.470,0:41:16.339
there, and that I'm

0:41:16.339,0:41:23.339
only hoping to estimate.

0:41:23.979,0:41:25.579
Actually, yeah,

0:41:25.579,0:41:29.369
so for the sample of questions like these about maximum likelihood and so on, I hope

0:41:29.369,0:41:34.059
to tease to the Friday discussion section

0:41:34.059,0:41:36.279
as a good time to

0:41:36.279,0:41:37.890
ask questions about, sort of,

0:41:37.890,0:41:40.880
probabilistic definitions like these as well. Are there any

0:41:40.880,0:41:47.880
other questions? No, great. Okay.

0:41:52.900,0:41:59.680
So,

0:41:59.680,0:42:01.189
great. Oh, it

0:42:01.189,0:42:07.589
turns out, just to mention one more thing that's, kind of, cool.

0:42:07.589,0:42:09.300
I said that

0:42:09.300,0:42:13.509
if X given Y is Poisson, and you also go logistic posterior,

0:42:13.509,0:42:17.210
it actually turns out there's a more general version of this. If you assume

0:42:17.210,0:42:18.929
X

0:42:18.929,0:42:23.899
given Y = 1 is exponential family

0:42:23.899,0:42:26.279
with parameter A to 1, and then you assume

0:42:26.279,0:42:33.279
X given Y = 0 is exponential family

0:42:33.859,0:42:38.889
with parameter A to 0, then

0:42:38.889,0:42:43.269
this implies that P of Y = 1 given X is also logistic, okay? And

0:42:43.269,0:42:44.679
that's, kind of, cool.

0:42:44.679,0:42:47.590
It means that Y given X could be - I don't

0:42:47.590,0:42:50.180
know, some strange thing. It could be gamma because

0:42:50.180,0:42:52.509
we've seen Gaussian right

0:42:52.509,0:42:53.609
next to the - I

0:42:53.609,0:42:57.029
don't know, gamma exponential.

0:42:57.029,0:42:59.099
They're actually a beta. I'm

0:42:59.099,0:43:02.089
just rattling off my mental list of exponential family extrusions. It could be any one

0:43:02.089,0:43:03.689
of those things,

0:43:03.689,0:43:07.519
so [inaudible] the same exponential family distribution for the two classes

0:43:07.519,0:43:09.439
with different natural parameters

0:43:09.439,0:43:12.239
than the

0:43:12.239,0:43:13.279
posterior

0:43:13.279,0:43:16.849
P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows

0:43:16.849,0:43:19.299
the robustness of logistic regression

0:43:19.299,0:43:22.479
to the choice of modeling assumptions because it could be that

0:43:22.479,0:43:24.989
the data was actually, you know, gamma distributed,

0:43:24.989,0:43:28.739
and just still turns out to be logistic. So it's the

0:43:28.739,0:43:35.739
robustness of logistic regression to modeling

0:43:38.809,0:43:40.960
assumptions.

0:43:40.960,0:43:42.900
And this is the density. I think,

0:43:42.900,0:43:44.399
early on I promised

0:43:44.399,0:43:48.619
two justifications for where I pulled the logistic function out of the hat, right? So

0:43:48.619,0:43:53.299
one was the exponential family derivation we went through last time, and this is, sort of, the second one.

0:43:53.299,0:44:00.299
That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off

0:44:05.109,0:44:09.969
mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also

0:44:09.969,0:44:11.059
not true, right?

0:44:11.059,0:44:12.499
Yeah, so this exponential

0:44:12.499,0:44:14.199
family distribution

0:44:14.199,0:44:18.269
implies Y = 1 is logistic, but the reverse assumption is also not true.

0:44:18.269,0:44:22.599
There are actually all sorts of really bizarre distributions

0:44:22.599,0:44:29.599
for X that would give rise to logistic function, okay? Okay. So

0:44:29.889,0:44:34.719
let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second

0:44:34.719,0:44:37.609
generative learning algorithm,

0:44:37.609,0:44:44.609
and the motivating example, actually this is called a Naive Bayes algorithm,

0:44:44.849,0:44:49.690
and the motivating example that I'm gonna use will be spam classification. All right. So let's

0:44:49.690,0:44:54.109
say that you want to build a spam classifier to take your incoming stream of email and decide if

0:44:54.109,0:44:56.529
it's spam or

0:44:56.529,0:44:57.489
not.

0:44:57.489,0:45:02.299
So let's

0:45:02.299,0:45:04.759
see. Y will be 0

0:45:04.759,0:45:06.849
or

0:45:06.849,0:45:11.679
1, with 1 being spam email

0:45:11.679,0:45:12.589
and 0 being non-spam, and

0:45:12.589,0:45:16.649
the first decision we need to make is, given a piece of email,

0:45:16.649,0:45:20.789
how do you represent a piece of email using a feature vector X,

0:45:20.789,0:45:24.449
right? So email is just a piece of text, right? Email

0:45:24.449,0:45:27.889
is like a list of words or a list of ASCII characters.

0:45:27.889,0:45:30.839
So I can represent email as a feature of vector X.

0:45:30.839,0:45:32.369
So we'll use a couple of different

0:45:32.369,0:45:34.099
representations,

0:45:34.099,0:45:36.579
but the one I'll use today is

0:45:36.579,0:45:38.049
we will

0:45:38.049,0:45:42.179
construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of

0:45:42.179,0:45:44.400
all the words in my dictionary, okay? So

0:45:44.400,0:45:46.469
the first word is

0:45:46.469,0:45:52.289
RA. The second word in my dictionary is Aardvark, ausworth,

0:45:52.289,0:45:55.569
okay?

0:45:55.569,0:45:59.419
You know, and somewhere along the way you see the word buy in the spam email telling you to buy

0:45:59.419,0:46:01.379
stuff.

0:46:01.379,0:46:04.140
Tell you how you collect your list of words,

0:46:04.140,0:46:08.579
you know, you won't find CS229, right, course number in a dictionary, but

0:46:08.579,0:46:09.559
if you

0:46:09.559,0:46:14.249
collect a list of words via other emails you've gotten, you have this list somewhere

0:46:14.249,0:46:17.979
as well, and then the last word in my dictionary was

0:46:17.979,0:46:21.969
zicmergue, which

0:46:21.969,0:46:24.739
pertains to the technological chemistry that deals with

0:46:24.739,0:46:27.899
the fermentation process in

0:46:27.899,0:46:30.829
brewing.

0:46:30.829,0:46:35.180
So say I get a piece of email, and what I'll do is I'll then

0:46:35.180,0:46:37.860
scan through this list of words, and wherever

0:46:37.860,0:46:41.249
a certain word appears in my email, I'll put a 1 there. So if a particular

0:46:41.249,0:46:44.319
email has the word aid then that's 1.

0:46:44.319,0:46:46.560
You know, my email doesn't have the words ausworth

0:46:46.560,0:46:49.209
or aardvark, so it gets zeros. And again,

0:46:49.209,0:46:52.939
a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay?

0:46:52.939,0:46:56.910
So

0:46:56.910,0:46:59.289
this would be

0:46:59.289,0:47:01.939
one way of creating a feature vector

0:47:01.939,0:47:03.619
to represent a

0:47:03.619,0:47:05.739
piece of email.

0:47:05.739,0:47:09.900
Now, let's throw

0:47:09.900,0:47:16.900
the generative model out for this. Actually,

0:47:18.059,0:47:19.289
let's use

0:47:19.289,0:47:20.849
this.

0:47:20.849,0:47:24.400
In other words, I want to model P of X given Y. The

0:47:24.400,0:47:27.959
given Y = 0 or Y = 1, all right?

0:47:27.959,0:47:31.390
And my feature vectors are going to be 0, 1

0:47:31.390,0:47:35.709
to the N. It's going to be these split vectors, binary value vectors. They're N

0:47:35.709,0:47:36.400
dimensional.

0:47:36.400,0:47:38.130
Where N

0:47:38.130,0:47:38.830
may

0:47:38.830,0:47:41.950
be on the order of, say, 50,000, if you have 50,000

0:47:41.950,0:47:44.089
words in your dictionary,

0:47:44.089,0:47:46.110
which is not atypical. So

0:47:46.110,0:47:47.459
values from - I don't

0:47:47.459,0:47:48.449
know,

0:47:48.449,0:47:52.219
mid-thousands to tens of thousands is very typical

0:47:52.219,0:47:53.939
for problems like

0:47:53.939,0:47:56.620
these. And, therefore,

0:47:56.620,0:48:00.340
there two to the 50,000 possible values for X, right? So two to

0:48:00.340,0:48:01.500
50,000

0:48:01.500,0:48:02.729
possible bit vectors

0:48:02.729,0:48:03.819
of length

0:48:03.819,0:48:05.199


0:48:05.199,0:48:07.239
50,000, and so

0:48:07.239,0:48:08.829
one way to model this is

0:48:08.829,0:48:10.749
the multinomial distribution,

0:48:10.749,0:48:13.900
but because there are two to the 50,000 possible values for X,

0:48:13.900,0:48:20.339
I would need two to the 50,000, but maybe -1 parameters,

0:48:20.339,0:48:21.059
right? Because you have

0:48:21.059,0:48:23.699
this sum to 1, right? So

0:48:23.699,0:48:27.089
-1. And this is clearly way too many parameters

0:48:27.089,0:48:28.479
to model

0:48:28.479,0:48:30.269
using the multinomial distribution

0:48:30.269,0:48:34.439
over all two to 50,000 possibilities.

0:48:34.439,0:48:35.839
So

0:48:35.839,0:48:42.289
in a Naive Bayes algorithm, we're

0:48:42.289,0:48:46.579
going to make a very strong assumption on P of X given Y,

0:48:46.579,0:48:49.369
and, in particular, I'm going to assume - let

0:48:49.369,0:48:53.609
me just say what it's called; then I'll write out what it means. I'm going to assume that the

0:48:53.609,0:48:54.630
XI's

0:48:54.630,0:49:01.630
are conditionally independent

0:49:06.489,0:49:07.950
given Y, okay?

0:49:07.950,0:49:11.209
Let me say what this means.

0:49:11.209,0:49:18.209
So I have that P of X1, X2, up to X50,000,

0:49:18.729,0:49:20.559
right, given the

0:49:20.559,0:49:25.910
Y. By the key rule of probability, this is P of X1 given Y

0:49:25.910,0:49:27.759
times P of X2

0:49:27.759,0:49:30.369
given

0:49:30.369,0:49:32.469


0:49:32.469,0:49:33.989
Y,

0:49:33.989,0:49:39.409
X1

0:49:39.409,0:49:42.919
times PF - I'll just put dot, dot, dot. I'll just write 1, 1 Ă dot, dot, dot up to, you know, well -

0:49:42.919,0:49:46.369
whatever. You get the idea, up to P of X50,000, okay?

0:49:46.369,0:49:49.529
So this is the chain were of probability. This always holds. I've not

0:49:49.529,0:49:52.849
made any assumption yet, and now, we're

0:49:52.849,0:49:54.529
gonna

0:49:54.529,0:49:58.089
meet what's called the Naive Bayes assumption, or this assumption that X

0:49:58.089,0:50:01.869
defies a conditionally independent given Y. Going

0:50:01.869,0:50:03.709
to assume that -

0:50:03.709,0:50:06.619
well, nothing changes for the first term,

0:50:06.619,0:50:11.719
but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3

0:50:11.719,0:50:13.359
given

0:50:13.359,0:50:16.979
the

0:50:16.979,0:50:17.760
Y,

0:50:17.760,0:50:20.099
and so on, up

0:50:20.099,0:50:24.489
to P of X50,000 given Y, okay?

0:50:24.489,0:50:28.849
Or just written more compactly,

0:50:28.849,0:50:32.119


0:50:32.119,0:50:35.989
means assume that P of X1, P of X50,000 given Y is

0:50:35.989,0:50:41.150
the product from I = 1 to 50,000 or P of XI

0:50:41.150,0:50:44.170
given the Y,

0:50:44.170,0:50:45.299
okay?

0:50:45.299,0:50:49.129
And stating informally what this means is that I'm, sort of, assuming that -

0:50:49.129,0:50:53.009
so unless you know the cost label Y, so long as you know whether this is spam or not

0:50:53.009,0:50:54.959
spam,

0:50:54.959,0:50:58.579
then knowing whether the word A appears in email

0:50:58.579,0:50:59.809
does not affect

0:50:59.809,0:51:01.359
the probability

0:51:01.359,0:51:02.849
of whether the word

0:51:02.849,0:51:06.829
Ausworth appears in the email, all right?

0:51:06.829,0:51:10.699
And, in other words, there's assuming - once you know whether an email is spam

0:51:10.699,0:51:12.619
or not spam,

0:51:12.619,0:51:15.519
then knowing whether other words appear in the email won't help

0:51:15.519,0:51:19.049
you predict whether any other word appears in the email, okay?

0:51:19.049,0:51:20.289
And,

0:51:20.289,0:51:23.489
obviously, this assumption is false, right? This

0:51:23.489,0:51:25.979
assumption can't possibly be

0:51:25.979,0:51:28.190
true. I mean, if you see the word

0:51:28.190,0:51:31.410
- I don't know, CS229 in an email, you're much more likely to see my name in the email, or

0:51:31.410,0:51:37.050
the TA's names, or whatever. So this assumption is normally just false

0:51:37.050,0:51:37.469


0:51:37.469,0:51:39.349
under English, right,

0:51:39.349,0:51:41.239
for normal written English,

0:51:41.239,0:51:42.519
but it

0:51:42.519,0:51:44.269
turns out that despite

0:51:44.269,0:51:46.599
this assumption being, sort of,

0:51:46.599,0:51:48.180
false in the literal sense,

0:51:48.180,0:51:50.829
the Naive Bayes algorithm is, sort of,

0:51:50.829,0:51:53.469
an extremely effective

0:51:53.469,0:51:57.869
algorithm for classifying text documents into spam or not spam, for

0:51:57.869,0:52:01.519
classifying your emails into different emails for your automatic view, for

0:52:01.519,0:52:03.589
looking at web pages and classifying

0:52:03.589,0:52:06.290
whether this webpage is trying to sell something or whatever. It

0:52:06.290,0:52:07.959
turns out, this assumption

0:52:07.959,0:52:11.819
works very well for classifying text documents and for other applications too that I'll

0:52:11.819,0:52:16.419
talk a bit about later.

0:52:16.419,0:52:20.689
As a digression that'll make sense only to some of you.

0:52:20.689,0:52:23.019
Let me just say that

0:52:23.019,0:52:27.959
if you're familiar with Bayesian X world, say

0:52:27.959,0:52:29.839


0:52:29.839,0:52:33.999
graphical models, the Bayesian network associated with this model looks like this, and you're assuming

0:52:33.999,0:52:34.509
that

0:52:34.509,0:52:36.769
this is random variable Y

0:52:36.769,0:52:38.779
that then generates X1, X2, through

0:52:38.779,0:52:41.089
X50,000, okay? If you've not seen the

0:52:41.089,0:52:43.069
Bayes Net before, if

0:52:43.069,0:52:47.549
you don't know your graphical model, just ignore this. It's not important to our purposes, but

0:52:47.549,0:52:54.549
if you've seen it before, that's what it will look like. Okay.

0:52:58.019,0:53:03.349
So

0:53:03.349,0:53:10.349


0:53:11.660,0:53:16.239
the parameters of the model

0:53:16.239,0:53:18.069
are as follows

0:53:18.069,0:53:21.659
with phi FI given Y = 1, which

0:53:21.659,0:53:23.189


0:53:23.189,0:53:28.019
is probably FX = 1 or XI = 1

0:53:28.019,0:53:28.839
given Y = 1,

0:53:28.839,0:53:35.529
phi I

0:53:35.529,0:53:39.959
given Y = 0, and phi Y, okay?

0:53:39.959,0:53:43.459
So these are the parameters of the model,

0:53:43.459,0:53:45.419


0:53:45.419,0:53:47.249
and, therefore,

0:53:47.249,0:53:50.139
to fit the parameters of the model, you

0:53:50.139,0:53:57.139
can write down the joint likelihood, right,

0:54:01.849,0:54:07.329
is

0:54:07.329,0:54:14.329
equal to, as usual, okay?

0:54:18.670,0:54:20.779
So given the training sets,

0:54:20.779,0:54:27.779
you can write down the joint

0:54:28.979,0:54:32.579
likelihood of the parameters, and

0:54:32.579,0:54:39.579
then when

0:54:43.729,0:54:44.640
you

0:54:44.640,0:54:46.869
do maximum likelihood estimation,

0:54:46.869,0:54:50.860
you find that the maximum likelihood estimate of the parameters are

0:54:50.860,0:54:53.239
- they're really, pretty much, what you'd expect.

0:54:53.239,0:54:55.630
Maximum likelihood estimate for phi J given Y = 1 is

0:54:55.630,0:54:57.880
sum from I = 1 to

0:54:57.880,0:55:01.759
M,

0:55:01.759,0:55:03.459
indicator

0:55:03.459,0:55:07.129
XIJ =

0:55:07.129,0:55:14.129
1, YI = 1, okay?

0:55:14.699,0:55:18.479


0:55:18.479,0:55:19.359


0:55:19.359,0:55:20.719
And this is just a,

0:55:20.719,0:55:22.829
I guess, stated more simply,

0:55:22.829,0:55:24.969
the numerator just says, Run for

0:55:24.969,0:55:27.979
your entire training set, some [inaudible] examples,

0:55:27.979,0:55:30.919
and count up the number of times you saw word Jay

0:55:30.919,0:55:32.549
in a piece of email

0:55:32.549,0:55:34.649
for which the label Y was equal to 1. So, in other words, look

0:55:34.649,0:55:36.800
through all your spam emails

0:55:36.800,0:55:39.320
and count the number of emails in which the word

0:55:39.320,0:55:40.559
Jay appeared out of

0:55:40.559,0:55:42.420
all your spam emails,

0:55:42.420,0:55:44.710
and the denominator is, you know,

0:55:44.710,0:55:45.989
sum from I = 1 to M,

0:55:45.989,0:55:47.949
the number of spam. The

0:55:47.949,0:55:51.439
denominator is just the number of spam emails you got.

0:55:51.439,0:55:54.329
And so this ratio is

0:55:54.329,0:55:57.769
in all your spam emails in your training set,

0:55:57.769,0:56:00.399
what fraction of these emails

0:56:00.399,0:56:03.039
did the word Jay

0:56:03.039,0:56:03.659
appear in -

0:56:03.659,0:56:06.640
did the, Jay you wrote in your dictionary appear in?

0:56:06.640,0:56:10.239
And that's the maximum likelihood estimate

0:56:10.239,0:56:13.819
for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your

0:56:13.819,0:56:17.499


0:56:17.499,0:56:20.499


0:56:20.499,0:56:23.160
maximum likelihood estimate for phi

0:56:23.160,0:56:24.999
Y

0:56:24.999,0:56:28.880
is pretty much what you'd expect, right?

0:56:28.880,0:56:35.880
Okay?

0:56:41.409,0:56:43.389
And so

0:56:43.389,0:56:47.669
having estimated all these parameters,

0:56:47.669,0:56:50.960
when you're given a new piece of email that you want to classify,

0:56:50.960,0:56:53.430
you can then compute P of Y given X

0:56:53.430,0:56:56.079
using Bayes rule, right?

0:56:56.079,0:56:57.580
Same as before because

0:56:57.580,0:57:04.229
together these parameters gives you a model for P of X given Y and for P of Y,

0:57:04.229,0:57:07.849
and by using Bayes rule, given these two terms, you can compute

0:57:07.849,0:57:10.709
P of X given Y, and

0:57:10.709,0:57:15.650
there's your spam classifier, okay?

0:57:15.650,0:57:18.999
Turns out we need one more elaboration to this idea, but let me check if there are

0:57:18.999,0:57:25.109
questions about this so far.

0:57:25.109,0:57:29.269
Student:So does this model depend

0:57:29.269,0:57:29.659
on

0:57:29.659,0:57:34.380
the number of inputs? Instructor (Andrew Ng):What do

0:57:34.380,0:57:35.149
you

0:57:35.149,0:57:38.169
mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this

0:57:38.169,0:57:45.169
given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make

0:57:48.959,0:57:52.629
sense? Or M is the number of training examples, so when you have M training examples, you plug them

0:57:52.629,0:57:53.999
into this formula,

0:57:53.999,0:58:00.999
and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the

0:58:01.289,0:58:04.089
number of emails? Instructor (Andrew Ng):Yeah, right. So, right.

0:58:04.089,0:58:07.130
So it's, kind of, your training set. I would go through all the email I've gotten

0:58:07.130,0:58:08.869
in the last two months

0:58:08.869,0:58:11.129
and label them as spam or not spam,

0:58:11.129,0:58:14.450
and so you have - I don't

0:58:14.450,0:58:16.809
know, like, a few hundred emails

0:58:16.809,0:58:18.630
labeled as spam or not spam,

0:58:18.630,0:58:25.630
and that will comprise your training sets for X1 and Y1 through XM,

0:58:28.049,0:58:28.620
YM,

0:58:28.620,0:58:32.820
where X is one of those vectors representing which words appeared in the email and Y

0:58:32.820,0:58:39.820
is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number

0:58:41.309,0:58:42.019


0:58:42.019,0:58:49.019
of examples, but the last model doesn't depend on the models, but your phi is the

0:58:49.829,0:58:53.630
same for either one. Instructor (Andrew Ng):They're

0:58:53.630,0:58:58.019
different things, right? There's the model which is

0:58:58.019,0:59:00.659
- the modeling assumptions aren't made very well.

0:59:00.659,0:59:04.069
I'm assuming that - I'm making the Naive Bayes assumption.

0:59:04.069,0:59:08.900
So the probabilistic model is an assumption on the joint distribution

0:59:08.900,0:59:10.319
of X and Y.

0:59:10.319,0:59:12.390
That's what the model is,

0:59:12.390,0:59:16.650
and then I'm given a fixed number of training examples. I'm given M training examples, and

0:59:16.650,0:59:20.309
then it's, like, after I'm given the training sets, I'll then go in to write the maximum

0:59:20.309,0:59:22.299
likelihood estimate of the parameters, right? So that's, sort

0:59:22.299,0:59:24.529
of,

0:59:24.529,0:59:31.529
maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like,

0:59:38.389,0:59:41.079
if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So

0:59:41.079,0:59:42.360
it turns out

0:59:42.360,0:59:45.859
this is, sort of, a very practical question, really. How do I count this list of

0:59:45.859,0:59:49.629
words? One common way to do this is to actually

0:59:49.629,0:59:52.899
find some way to count a list of words, like go through all your emails, go through

0:59:52.899,0:59:53.909
all the -

0:59:53.909,0:59:56.930
in practice, one common way to count a list of words

0:59:56.930,1:00:00.339
is to just take all the words that appear in your training set. That's one fairly common way

1:00:00.339,1:00:02.190
to do it,

1:00:02.190,1:00:06.130
or if that turns out to be too many words, you can take all words that appear

1:00:06.130,1:00:07.030
at least

1:00:07.030,1:00:08.080
three times

1:00:08.080,1:00:09.349
in your training set. So

1:00:09.349,1:00:10.659
words that

1:00:10.659,1:00:14.749
you didn't even see three times in the emails you got in the last

1:00:14.749,1:00:17.210
two months, you discard. So those are - I

1:00:17.210,1:00:20.259
was talking about going through a dictionary, which is a nice way of thinking about it, but in

1:00:20.259,1:00:22.049
practice, you might go through

1:00:22.049,1:00:26.409
your training set and then just take the union of all the words that appear in

1:00:26.409,1:00:29.759
it. In some of the tests I've even, by the way, said select these features, but this is one

1:00:29.759,1:00:32.309
way to think about

1:00:32.309,1:00:34.139
creating your feature vector,

1:00:34.139,1:00:41.139
right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On

1:00:49.209,1:00:51.609
how I came up with the parameters?

1:00:51.609,1:00:53.929
Student:Correct. Instructor (Andrew Ng):Let's see.

1:00:53.929,1:00:58.249
So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right?

1:00:58.249,1:00:59.279
In Naive Bayes,

1:00:59.279,1:01:01.499
I need to build a model

1:01:01.499,1:01:03.549
for P of X given Y and for

1:01:03.549,1:01:05.359
P of Y,

1:01:05.359,1:01:09.749
right? So this is, I mean, in generous of learning algorithms, I need to come up with

1:01:09.749,1:01:11.079
models for these.

1:01:11.079,1:01:15.359
So how'd I model P of Y? Well, I just those to model it using a Bernoulli

1:01:15.359,1:01:16.389
distribution,

1:01:16.389,1:01:17.150
and so

1:01:17.150,1:01:19.709
P of Y will be

1:01:19.709,1:01:22.109
parameterized by that, all right? Student:Okay.

1:01:22.109,1:01:26.519
Instructor (Andrew Ng):And then how'd I model P of X given Y? Well,

1:01:26.519,1:01:28.529
let's keep changing bullets.

1:01:28.529,1:01:32.729
My model for P of X given Y under the Naive Bayes assumption, I assume

1:01:32.729,1:01:34.539
that P of X given Y

1:01:34.539,1:01:37.549
is the product of these probabilities,

1:01:37.549,1:01:40.900
and so I'm going to need parameters to tell me

1:01:40.900,1:01:43.509
what's the probability of each word occurring,

1:01:43.509,1:01:46.309
you know, of each word occurring or not occurring,

1:01:46.309,1:01:53.309
conditions on the email being spam or not spam email, okay? Student:How is that

1:01:54.639,1:01:58.829
Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature

1:01:58.829,1:02:00.799
vectors, XI

1:02:00.799,1:02:05.499
is either one or zero, depending on whether words I appear as in the email,

1:02:05.499,1:02:08.859
right? So by the way I define the

1:02:08.859,1:02:11.419
feature vectors, XI -

1:02:11.419,1:02:16.639
the XI is always zero or one. So that by definition, if XI, you know, is either zero or

1:02:16.639,1:02:19.969
one, then it has to be a Bernoulli distribution, right?

1:02:19.969,1:02:22.419
If XI would continue as then

1:02:22.419,1:02:24.819
you might model this as Gaussian and say you end up

1:02:24.819,1:02:27.299
like we did in Gaussian discriminant analysis. It's

1:02:27.299,1:02:30.719
just that the way I constructed my features for email, XI is always binary

1:02:30.719,1:02:32.129
value, and so you end up

1:02:32.129,1:02:32.990
with a

1:02:32.990,1:02:36.059
Bernoulli here, okay? All right. I

1:02:36.059,1:02:40.389
should move on. So

1:02:40.389,1:02:42.389
it turns out that

1:02:42.389,1:02:44.339
this idea

1:02:44.339,1:02:46.369
almost works.

1:02:46.369,1:02:47.719
Now, here's the problem.

1:02:47.719,1:02:49.959
So let's say you

1:02:49.959,1:02:54.439
complete this class and you start to do, maybe do the class project, and you

1:02:54.439,1:02:56.760
keep working on your class project for a bit, and it

1:02:56.760,1:03:01.439
becomes really good, and you want to submit your class project to a conference, right? So,

1:03:01.439,1:03:03.479
you know, around - I don't know,

1:03:03.479,1:03:08.039
June every year is the conference deadline for the next conference.

1:03:08.039,1:03:10.889
It's just the name of the conference; it's an acronym.

1:03:10.889,1:03:12.359
And so maybe

1:03:12.359,1:03:16.080
you send your project partners or senior friends even, and say, Hey, let's

1:03:16.080,1:03:20.069
work on a project and submit it to the NIPS conference. And so you're getting these emails

1:03:20.069,1:03:21.489
with the word NIPS in them,

1:03:21.489,1:03:24.329
which you've probably never seen before,

1:03:24.329,1:03:28.179
and so a

1:03:28.179,1:03:32.519
piece of email comes from your project partner, and so you

1:03:32.519,1:03:35.639
go, Let's send a paper to the NIPS conference.

1:03:35.639,1:03:37.310
And then your stamp classifier

1:03:37.310,1:03:38.759
will say

1:03:38.759,1:03:39.930
P of X -

1:03:39.930,1:03:44.630
let's say NIPS is the 30,000th word in your dictionary, okay?

1:03:44.630,1:03:46.669
So X30,000 given

1:03:46.669,1:03:50.189
the 1, given

1:03:50.189,1:03:51.619
Y =

1:03:51.619,1:03:52.799
1

1:03:52.799,1:03:54.029
will be equal to 0.

1:03:54.029,1:03:57.479
That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in

1:03:57.479,1:04:01.199
your training set, so maximum likelihood of the parameter is that probably have seen the word

1:04:01.199,1:04:02.459
NIPS is zero,

1:04:02.459,1:04:06.789
and, similarly,

1:04:06.789,1:04:08.379
you

1:04:08.379,1:04:12.459
know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also

1:04:12.459,1:04:14.729
estimated

1:04:14.729,1:04:21.209
as zero.

1:04:21.209,1:04:28.209
So

1:04:34.789,1:04:40.709
when your spam classifier goes to compute P of Y = 1 given X, it will

1:04:40.709,1:04:44.529
compute this

1:04:44.529,1:04:47.209
right here P of Y

1:04:47.209,1:04:54.209
over - well,

1:04:56.179,1:05:03.179
all

1:05:05.089,1:05:09.509
right.

1:05:09.509,1:05:11.329
And so

1:05:11.329,1:05:15.759
you look at that terms, say, this will be product from I =

1:05:15.759,1:05:18.379
1 to 50,000,

1:05:18.379,1:05:22.359
P of XI given Y,

1:05:22.359,1:05:26.139
and one of those probabilities will be equal to

1:05:26.139,1:05:28.169


1:05:28.169,1:05:32.339
zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a

1:05:32.339,1:05:36.699
zero in this product, and so the numerator is zero,

1:05:36.699,1:05:40.249
and in the same way, it turns out the denominator will also be zero, and so you end

1:05:40.249,1:05:42.719
up with -

1:05:42.719,1:05:45.839
actually all of these terms end up being zero. So you end up with P of Y = 1

1:05:45.839,1:05:48.779
given X is 0 over 0 + 0, okay, which is

1:05:48.779,1:05:51.410
undefined. And the

1:05:51.410,1:05:53.959
problem with this is that it's

1:05:53.959,1:05:57.240
just statistically a bad idea

1:05:57.240,1:06:00.209
to say that P of X30,000

1:06:00.209,1:06:02.950
given Y is

1:06:02.950,1:06:03.490
0,

1:06:03.490,1:06:06.270
right? Just because you haven't seen the word NIPS in your last

1:06:06.270,1:06:10.420
two months worth of email, it's also statistically not sound to say that,

1:06:10.420,1:06:16.319
therefore, the chance of ever seeing this word is zero, right?

1:06:16.319,1:06:19.170
And so

1:06:19.170,1:06:20.479


1:06:20.479,1:06:22.719


1:06:22.719,1:06:25.819
is this idea that just because you haven't seen something

1:06:25.819,1:06:30.400
before, that may mean that that event is unlikely, but it doesn't mean that

1:06:30.400,1:06:33.920
it's impossible, and just saying that if you've never seen the word NIPS before,

1:06:33.920,1:06:40.759
then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero.

1:06:40.759,1:06:47.759
So we're gonna fix this,

1:06:48.339,1:06:50.079
and

1:06:50.079,1:06:54.119
to motivate the fix I'll talk about

1:06:54.119,1:06:58.939
- the example we're gonna use is let's say that you've been following the Stanford basketball

1:06:58.939,1:07:04.099
team for all of their away games, and been, sort of, tracking their wins and losses

1:07:04.099,1:07:08.169
to gather statistics, and, maybe - I don't know, form a betting pool about

1:07:08.169,1:07:11.059
whether they're likely to win or lose the next game, okay?

1:07:11.059,1:07:12.630
So

1:07:12.630,1:07:15.409
these are some of the statistics.

1:07:15.409,1:07:17.589


1:07:17.589,1:07:19.429


1:07:19.429,1:07:24.819
So on, I guess, the 8th of February

1:07:24.819,1:07:29.299
last season they played Washington State, and they

1:07:29.299,1:07:31.779
did not win.

1:07:31.779,1:07:36.239
On

1:07:36.239,1:07:38.209
the 11th of February,

1:07:38.209,1:07:42.349
they play Washington, 22nd

1:07:42.349,1:07:47.259
they played USC,

1:07:47.259,1:07:49.769


1:07:49.769,1:07:54.929
played UCLA,

1:07:54.929,1:07:57.630
played USC again,

1:07:57.630,1:08:03.739


1:08:03.739,1:08:05.819
and now you want to estimate

1:08:05.819,1:08:09.880
what's the chance that they'll win or lose against Louisville, right?

1:08:09.880,1:08:11.169
So

1:08:11.169,1:08:14.909
find the four guys last year or five times and they weren't good in their away games, but it

1:08:14.909,1:08:17.009
seems awfully harsh to say that - so it

1:08:17.009,1:08:24.009
seems awfully harsh to say there's zero chance that they'll

1:08:37.079,1:08:40.269
win in the last - in the 5th game. So here's the idea behind Laplace smoothing

1:08:40.269,1:08:42.629
which is

1:08:42.629,1:08:44.819
that we're estimate

1:08:44.819,1:08:48.219
the probably of Y being equal to one, right?

1:08:48.219,1:08:53.189
Normally, the maximum likelihood [inaudible] is the

1:08:53.189,1:08:56.059
number of ones

1:08:56.059,1:08:57.659
divided by

1:08:57.659,1:09:00.759
the number of zeros

1:09:00.759,1:09:05.350
plus the number of ones, okay? I

1:09:05.350,1:09:07.869
hope this informal notation makes sense, right? Knowing

1:09:07.869,1:09:11.349
the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random

1:09:11.349,1:09:12.219
variable

1:09:12.219,1:09:14.629
is

1:09:14.629,1:09:16.969
just the number of ones you saw

1:09:16.969,1:09:20.859
divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in

1:09:20.859,1:09:22.859
the Laplace

1:09:22.859,1:09:24.170
Smoothing

1:09:24.170,1:09:25.799
we're going to

1:09:25.799,1:09:29.549
just take each of these terms, the number of ones and, sort of, add one to that, the number

1:09:29.549,1:09:31.830
of zeros and add one to that, the

1:09:31.830,1:09:34.129
number of ones and add one to that,

1:09:34.129,1:09:37.109
and so in our example,

1:09:37.109,1:09:38.299
instead of estimating

1:09:38.299,1:09:42.890
the probability of winning the next game to be 0 Ăˇ

1:09:42.890,1:09:45.210
5 + 0,

1:09:45.210,1:09:50.319
we'll add one to all of these counts, and so we say that the chance of

1:09:50.319,1:09:51.850
their

1:09:51.850,1:09:53.799
winning the next game is 1/7th,

1:09:53.799,1:09:55.179
okay? Which is

1:09:55.179,1:09:59.310
that having seen them lose, you know, five away games in a row, we aren't terribly -

1:09:59.310,1:10:02.510
we don't think it's terribly likely they'll win the next game, but at

1:10:02.510,1:10:05.979
least we're not saying it's impossible.

1:10:05.979,1:10:09.949
As a historical side note, the Laplace actually came up with the method.

1:10:09.949,1:10:14.320
It's called the Laplace smoothing after him.

1:10:14.320,1:10:18.399


1:10:18.399,1:10:21.919
When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale

1:10:21.919,1:10:25.239
was in a lot of days now, we've seen the sun rise,

1:10:25.239,1:10:28.489
but that doesn't mean we can be absolutely certain the sun will rise tomorrow.

1:10:28.489,1:10:31.420
He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of,

1:10:31.420,1:10:36.449
cool. So,

1:10:36.449,1:10:38.549
and more generally,

1:10:38.549,1:10:41.570


1:10:41.570,1:10:43.969


1:10:43.969,1:10:45.429


1:10:45.429,1:10:46.449


1:10:46.449,1:10:49.409
if Y

1:10:49.409,1:10:52.840
takes on

1:10:52.840,1:10:55.150
K possible of values,

1:10:55.150,1:10:59.479
if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1.

1:10:59.479,1:11:01.729
Let's

1:11:01.729,1:11:05.900
see.

1:11:05.900,1:11:11.380
So the maximum likelihood estimate will be Sum from J = 1 to M,

1:11:11.380,1:11:16.499
indicator YI = J Ăˇ M,

1:11:16.499,1:11:19.379
right?

1:11:19.379,1:11:21.629
That's the maximum likelihood estimate

1:11:21.629,1:11:24.369
of a multinomial probability of Y

1:11:24.369,1:11:26.929
being equal to - oh,

1:11:26.929,1:11:29.229
excuse me, Y = J. All right.

1:11:29.229,1:11:33.110
That's the maximum likelihood estimate for the probability of Y = J,

1:11:33.110,1:11:36.890
and so when you apply Laplace smoothing to that,

1:11:36.890,1:11:39.639
you add one to the numerator, and

1:11:39.639,1:11:41.889
add K to the denominator,

1:11:41.889,1:11:48.889
if Y can take up K possible values, okay?

1:12:06.469,1:12:08.579
So for Naive Bayes,

1:12:08.579,1:12:14.349
what that gives us is -

1:12:14.349,1:12:21.349
shoot.

1:12:38.639,1:12:44.320
Right? So that was the maximum likelihood estimate, and what you

1:12:44.320,1:12:47.949
end up doing is adding one to the numerator and adding two to the denominator,

1:12:47.949,1:12:52.030
and this solves the problem of the zero probabilities, and when your friend sends

1:12:52.030,1:12:57.489
you email about the NIPS conference,

1:12:57.489,1:13:00.799
your spam filter will still be able to

1:13:00.799,1:13:07.799
make a meaningful prediction, all right? Okay.

1:13:12.030,1:13:14.829
Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the

1:13:14.829,1:13:16.239
games on the right, it's liberal assumptions that the probability

1:13:16.239,1:13:23.239
of

1:13:24.650,1:13:27.960
winning is very close to zero, so, I mean, the prediction should

1:13:27.960,1:13:29.789
be equal to PF, 0. Instructor (Andrew Ng):Right.

1:13:29.789,1:13:35.959
I would say that

1:13:35.959,1:13:38.139
in this case the prediction

1:13:38.139,1:13:41.170
is 1/7th, right? We don't have a lot of - if you see somebody lose five games

1:13:41.170,1:13:44.289
in a row, you may not have a lot of faith in them,

1:13:44.289,1:13:48.520
but as an extreme example, suppose you saw them lose one game,

1:13:48.520,1:13:51.500
right? It's just not reasonable to say that the chances of winning the next game

1:13:51.500,1:13:52.860
is zero, but

1:13:52.860,1:13:58.900
that's what maximum likelihood

1:13:58.900,1:14:01.150
estimate

1:14:01.150,1:14:05.099
will say. Student:Yes. Instructor (Andrew Ng):And -

1:14:05.099,1:14:08.139
Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you

1:14:08.139,1:14:11.309
know, given just five training examples, what's a reasonable estimate for the chance of

1:14:11.309,1:14:13.170
winning the next game,

1:14:13.170,1:14:14.119
and

1:14:14.119,1:14:18.460
1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance.

1:14:18.460,1:14:21.119
We're saying the chances of winning the next game is less

1:14:21.119,1:14:25.069
than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of

1:14:25.069,1:14:27.859
Bayesian assumptions about the prior and posterior,

1:14:27.859,1:14:31.280
this Laplace smoothing actually gives the optimal estimate,

1:14:31.280,1:14:33.459
in a certain sense I won't go into

1:14:33.459,1:14:37.009
of what's the chance of winning the next game, and so under a certain assumption

1:14:37.009,1:14:38.079
about the

1:14:38.079,1:14:40.679
Bayesian prior on the parameter.

1:14:40.679,1:14:44.260
So I don't know. It actually seems like a pretty reasonable assumption to me.

1:14:44.260,1:14:46.940
Although, I should say, it actually

1:14:46.940,1:14:50.059
turned out -

1:14:50.059,1:14:53.530
No, I'm just being mean. We actually are a pretty good basketball team, but I chose a

1:14:53.530,1:14:54.660
losing streak

1:14:54.660,1:14:58.769
because it's funnier that way.

1:14:58.769,1:15:05.769
Let's see. Shoot. Does someone want to - are there other questions about

1:15:07.489,1:15:09.320


1:15:09.320,1:15:13.380
this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but

1:15:13.380,1:15:15.059


1:15:15.059,1:15:15.860
we'll do that in the next lecture. So let's wrap it.