WEBVTT

00:00:10.040 --> 00:00:12.179


00:00:12.179 --> 00:00:15.449
This presentation is delivered by the Stanford Center for Professional

00:00:15.449 --> 00:00:22.449
Development.

00:00:25.119 --> 00:00:29.819
So what I want to do today is talk about a different type of learning algorithm, and, in particular,

00:00:29.819 --> 00:00:32.790
start to talk about generative learning algorithms

00:00:32.790 --> 00:00:37.840
and the specific algorithm called Gaussian Discriminant Analysis.

00:00:37.840 --> 00:00:43.340
Take a slight digression, talk about Gaussians, and I'll briefly discuss

00:00:43.340 --> 00:00:45.700
generative versus discriminative learning algorithms,

00:00:45.700 --> 00:00:49.780
and then hopefully wrap up today's lecture with a discussion of Naive Bayes and

00:00:49.780 --> 00:00:52.140
the Laplace Smoothing.

00:00:52.140 --> 00:00:55.020
So just to motivate our

00:00:55.020 --> 00:00:58.590
discussion on generative learning algorithms, right, so by way of contrast,

00:00:58.590 --> 00:01:02.100
the source of classification algorithms we've been talking about

00:01:02.100 --> 00:01:08.920
I think of algorithms that do this. So you're given a training set,

00:01:08.920 --> 00:01:11.640
and

00:01:11.640 --> 00:01:15.370
if you run an algorithm right, we just see progression on those training sets.

00:01:15.370 --> 00:01:19.830
The way I think of logistic regression is that it's trying to find " look at the date and is

00:01:19.830 --> 00:01:24.369
trying to find a straight line to divide the crosses and O's, right? So it's, sort of,

00:01:24.369 --> 00:01:27.380
trying to find a straight line. Let

00:01:27.380 --> 00:01:29.400
me " just make the days a bit noisier.

00:01:29.400 --> 00:01:33.060
Trying to find a straight line

00:01:33.060 --> 00:01:34.430
that separates out

00:01:34.430 --> 00:01:38.630
the positive and the negative classes as well as pass the law, right? And,

00:01:38.630 --> 00:01:42.570
in fact, it shows it on the laptop. Maybe just use the screens or the small

00:01:42.570 --> 00:01:46.250
monitors for this.

00:01:46.250 --> 00:01:47.230
In fact,

00:01:47.230 --> 00:01:49.720
you can see there's the data

00:01:49.720 --> 00:01:51.100
set with

00:01:51.100 --> 00:01:52.710
logistic regression,

00:01:52.710 --> 00:01:54.080
and so

00:01:54.080 --> 00:01:57.930
I've initialized the parameters randomly, and so logistic regression is, kind

00:01:57.930 --> 00:01:59.350
of, the outputting " it's

00:01:59.350 --> 00:02:01.530
the, kind of, hypothesis that

00:02:01.530 --> 00:02:04.750
iteration zero is that straight line shown in the bottom right.

00:02:04.750 --> 00:02:08.720
And so after one iteration and creating descent, the straight line changes a bit.

00:02:08.720 --> 00:02:10.999
After two iterations, three,

00:02:10.999 --> 00:02:12.359
four,

00:02:12.359 --> 00:02:15.019
until logistic regression converges

00:02:15.019 --> 00:02:18.969
and has found the straight line that, more or less, separates the positive and negative class, okay? So

00:02:18.969 --> 00:02:20.170
you can think of this

00:02:20.170 --> 00:02:21.490
as logistic regression,

00:02:21.490 --> 00:02:28.010
sort of, searching for a line that separates the positive and the negative classes.

00:02:28.010 --> 00:02:30.980
What I want to do today is talk about an algorithm that does something slightly

00:02:30.980 --> 00:02:32.759
different,

00:02:32.759 --> 00:02:36.309
and to motivate us, let's use our old example of trying to classifythe

00:02:36.309 --> 00:02:40.689
team malignant cancer and benign cancer, right? So a patient comes in

00:02:40.689 --> 00:02:44.759
and they have a cancer, you want to know if it's a malignant or a harmful cancer,

00:02:44.759 --> 00:02:48.489
or if it's a benign, meaning a harmless cancer.

00:02:48.489 --> 00:02:51.579
So rather than trying to find the straight line to separate the two classes, here's something else

00:02:51.579 --> 00:02:53.579
we could do.

00:02:53.579 --> 00:02:55.759
We can go from our training set

00:02:55.759 --> 00:03:00.409
and look at all the cases of malignant cancers, go through, you know, look for our training set for all the

00:03:00.409 --> 00:03:04.129
positive examples of malignant cancers,

00:03:04.129 --> 00:03:08.799
and we can then build a model for what malignant cancer looks like.

00:03:08.799 --> 00:03:13.400
Then we'll go for our training set again and take out all of the examples of benign cancers,

00:03:13.400 --> 00:03:17.859
and then we'll build a model for what benign cancers look like, okay?

00:03:17.859 --> 00:03:19.209
And then

00:03:19.209 --> 00:03:20.270
when you need to

00:03:20.270 --> 00:03:24.620
classify a new example, when you have a new patient, and you want to decide is this cancer malignant

00:03:24.620 --> 00:03:25.650
or benign,

00:03:25.650 --> 00:03:27.920
you then take your new cancer, and you

00:03:27.920 --> 00:03:30.720
match it to your model of malignant cancers,

00:03:30.720 --> 00:03:35.969
and you match it to your model of benign cancers, and you see which model it matches better, and

00:03:35.969 --> 00:03:39.389
depending on which model it matches better to, you then

00:03:39.389 --> 00:03:43.679
predict whether the new cancer is malignant or benign,

00:03:43.679 --> 00:03:45.809
okay?

00:03:45.809 --> 00:03:46.899
So

00:03:46.899 --> 00:03:49.659
what I just described, just this cross

00:03:49.659 --> 00:03:52.929
of methods where you build a second model for malignant cancers and

00:03:52.929 --> 00:03:54.950
a separate model for benign cancers

00:03:54.950 --> 00:03:58.379
is called a generative learning algorithm,

00:03:58.379 --> 00:04:01.230
and let me just, kind of, formalize this.

00:04:01.230 --> 00:04:03.089


00:04:03.089 --> 00:04:05.459
So in the models that we've

00:04:05.459 --> 00:04:07.559
been talking about previously, those were actually

00:04:07.559 --> 00:04:13.899
all discriminative learning algorithms,

00:04:13.899 --> 00:04:18.959
and studied more formally, a discriminative learning algorithm is one

00:04:18.959 --> 00:04:22.729
that either learns P of Y

00:04:22.729 --> 00:04:25.190
given X directly,

00:04:25.190 --> 00:04:26.889


00:04:26.889 --> 00:04:29.619
or even

00:04:29.619 --> 00:04:32.510
learns a hypothesis

00:04:32.510 --> 00:04:39.510
that

00:04:39.699 --> 00:04:43.240
outputs value 0, 1 directly,

00:04:43.240 --> 00:04:46.509
okay? So logistic regression is an example of

00:04:46.509 --> 00:04:49.349
a discriminative learning algorithm.

00:04:49.349 --> 00:04:56.349
In contrast, a generative learning algorithm of

00:04:58.150 --> 00:04:59.940
models P of X given Y.

00:04:59.940 --> 00:05:03.270
The probability of the features given the class label,

00:05:03.270 --> 00:05:04.330


00:05:04.330 --> 00:05:11.330
and as a technical detail, it also models P of Y, but that's a less important thing, and the

00:05:11.939 --> 00:05:15.319
interpretation of this is that a generative model

00:05:15.319 --> 00:05:17.499
builds a probabilistic model for

00:05:17.499 --> 00:05:21.959
what the features looks like,

00:05:21.959 --> 00:05:27.310
conditioned on the

00:05:27.310 --> 00:05:30.729
class label, okay? In other words, conditioned on whether a cancer is

00:05:30.729 --> 00:05:35.060
malignant or benign, it models probability distribution over what the features

00:05:35.060 --> 00:05:37.090
of the cancer looks like.

00:05:37.090 --> 00:05:40.470
Then having built this model " having built a model for P of X

00:05:40.470 --> 00:05:42.129
given Y and P of Y,

00:05:42.129 --> 00:05:45.949
then by Bayes rule, obviously, you can compute P of Y given 1,

00:05:45.949 --> 00:05:47.439
conditioned on X.

00:05:47.439 --> 00:05:49.370
This is just

00:05:49.370 --> 00:05:50.959
P of X

00:05:50.959 --> 00:05:52.669
given Y = 1

00:05:52.669 --> 00:05:56.430
times P of X

00:05:56.430 --> 00:05:58.919
divided by P of X,

00:05:58.919 --> 00:06:01.349
and, if necessary,

00:06:01.349 --> 00:06:05.600
you can calculate the denominator

00:06:05.600 --> 00:06:12.600
using this, right?

00:06:18.969 --> 00:06:19.889


00:06:19.889 --> 00:06:22.919
And so by modeling P of X given Y

00:06:22.919 --> 00:06:26.500
and modeling P of Y, you can actually use Bayes rule to get back to P of Y given

00:06:26.500 --> 00:06:27.569
X,

00:06:27.569 --> 00:06:29.820
but a generative model -

00:06:29.820 --> 00:06:33.120
generative learning algorithm starts in modeling P of X given Y, rather than P of Y

00:06:33.120 --> 00:06:34.979
given X, okay?

00:06:34.979 --> 00:06:37.810
We'll talk about some of the tradeoffs, and why this may be a

00:06:37.810 --> 00:06:41.999
better or worse idea than a discriminative model a bit later.

00:06:41.999 --> 00:06:46.679
Let's go for a specific example of a generative learning algorithm,

00:06:46.679 --> 00:06:47.909
and for

00:06:47.909 --> 00:06:51.329
this specific motivating example, I'm

00:06:51.329 --> 00:06:53.580
going to assume

00:06:53.580 --> 00:06:55.600
that your input feature is X

00:06:55.600 --> 00:06:58.199
and RN

00:06:58.199 --> 00:07:00.489


00:07:00.489 --> 00:07:03.110
and are

00:07:03.110 --> 00:07:05.710
continuous values, okay?

00:07:05.710 --> 00:07:08.190
And under this assumption, let

00:07:08.190 --> 00:07:15.190
me describe to you a specific algorithm called Gaussian Discriminant Analysis,

00:07:20.180 --> 00:07:23.059


00:07:23.059 --> 00:07:25.960
and the, I

00:07:25.960 --> 00:07:30.080
guess, core assumption is that we're going to assume in the Gaussian discriminant analysis

00:07:30.080 --> 00:07:32.279
model of that P of X given Y

00:07:32.279 --> 00:07:39.279
is Gaussian, okay? So

00:07:39.779 --> 00:07:43.449
actually just raise your hand, how many of you have seen a multivariate Gaussian before -

00:07:43.449 --> 00:07:45.639
not a 1D Gaussian, but the higher range though?

00:07:45.639 --> 00:07:51.029
Okay, cool, like maybe half of you, two-thirds of

00:07:51.029 --> 00:07:54.599
you. So let me just say a few words about Gaussians, and for those of you that have seen

00:07:54.599 --> 00:07:57.119
it before, it'll be a refresher.

00:07:57.119 --> 00:08:03.249
So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the

00:08:03.249 --> 00:08:05.309
script N for normal

00:08:05.309 --> 00:08:08.150
with

00:08:08.150 --> 00:08:13.379
parameters mean U and covariance sigma squared. If

00:08:13.379 --> 00:08:16.759
Z has a density 1

00:08:16.759 --> 00:08:19.539
over 2 Pi, sigma

00:08:19.539 --> 00:08:23.939
2,

00:08:23.939 --> 00:08:30.939
okay?

00:08:31.449 --> 00:08:35.949
That's the formula for the density as a generalization of the one dimension of Gaussians and no

00:08:35.949 --> 00:08:38.190
more the familiar bell-shape curve. It's

00:08:38.190 --> 00:08:42.220
a high dimension vector value random variable

00:08:42.220 --> 00:08:42.990
Z.

00:08:42.990 --> 00:08:47.210
Don't worry too much about this formula for the density. You rarely end up needing to

00:08:47.210 --> 00:08:47.960
use it,

00:08:47.960 --> 00:08:51.620
but the two key quantities are this

00:08:51.620 --> 00:08:55.310
vector mew is the mean of the Gaussian and this

00:08:55.310 --> 00:08:56.999
matrix sigma is

00:08:56.999 --> 00:09:00.060
the covariance matrix -

00:09:00.060 --> 00:09:02.470
covariance,

00:09:02.470 --> 00:09:04.670
and so

00:09:04.670 --> 00:09:07.410
sigma will be equal to,

00:09:07.410 --> 00:09:11.810
right, the definition of covariance of a vector valued random variable

00:09:11.810 --> 00:09:13.710
is X - U, X - V conspose,

00:09:13.710 --> 00:09:17.260
okay?

00:09:17.260 --> 00:09:20.320
And, actually, if this

00:09:20.320 --> 00:09:22.930
doesn't look familiar to you,

00:09:22.930 --> 00:09:25.720
you might re-watch the

00:09:25.720 --> 00:09:28.740
discussion section that the TAs held last Friday

00:09:28.740 --> 00:09:33.690
or the one that they'll be holding later this week on, sort of, a recap of

00:09:33.690 --> 00:09:34.290
probability, okay?

00:09:34.290 --> 00:09:37.670
So

00:09:37.670 --> 00:09:41.370
multi-grade Gaussians is parameterized by a mean and a covariance, and let me just -

00:09:41.370 --> 00:09:42.420
can I

00:09:42.420 --> 00:09:45.860
have the laptop displayed, please?

00:09:45.860 --> 00:09:48.570
I'll just go ahead and actually show you,

00:09:48.570 --> 00:09:52.090
you know, graphically, the effects of

00:09:52.090 --> 00:09:53.930
varying a Gaussian -

00:09:53.930 --> 00:09:56.310
varying the parameters of a Gaussian.

00:09:56.310 --> 00:09:59.130
So what I have up here

00:09:59.130 --> 00:10:02.960
is the density of a zero mean Gaussian with

00:10:02.960 --> 00:10:05.880
covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand

00:10:05.880 --> 00:10:08.240
corner of the slide, and

00:10:08.240 --> 00:10:11.600
there's the familiar bell-shaped curve in two dimensions.

00:10:11.600 --> 00:10:17.320
And so if I shrink the covariance matrix, instead of covariance your identity, if

00:10:17.320 --> 00:10:21.600
I shrink the covariance matrix, then the Gaussian becomes more peaked,

00:10:21.600 --> 00:10:25.090
and if I widen the covariance, so like same = 2, 2,

00:10:25.090 --> 00:10:25.749
then

00:10:25.749 --> 00:10:30.810
the distribution - well, the density becomes more spread out, okay?

00:10:30.810 --> 00:10:32.510
Those vectors stand at normal, identity

00:10:32.510 --> 00:10:34.570
covariance one.

00:10:34.570 --> 00:10:36.520
If I increase

00:10:36.520 --> 00:10:39.419
the diagonals of a covariance matrix, right,

00:10:39.419 --> 00:10:43.949
if I make the variables correlated, and the

00:10:43.949 --> 00:10:45.350
Gaussian becomes flattened out in this X = Y direction, and

00:10:45.350 --> 00:10:47.170
increase it even further,

00:10:47.170 --> 00:10:52.460
then my variables, X and Y, right - excuse me, it goes Z1 and Z2

00:10:52.460 --> 00:10:57.020
are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in

00:10:57.020 --> 00:10:58.940
contours.

00:10:58.940 --> 00:11:03.310
The standard normal of distribution has contours that are - they're actually

00:11:03.310 --> 00:11:06.960
circles. Because of the aspect ratio, these look like ellipses.

00:11:06.960 --> 00:11:09.050
These should actually be circles,

00:11:09.050 --> 00:11:12.770
and if you increase the off diagonals of the Gaussian covariance matrix,

00:11:12.770 --> 00:11:14.890
then it becomes

00:11:14.890 --> 00:11:16.400
ellipses aligned along the, sort of,

00:11:16.400 --> 00:11:18.090
45 degree angle

00:11:18.090 --> 00:11:21.120
in this example.

00:11:21.120 --> 00:11:26.430
This is the same thing. Here's an example of a Gaussian density with negative covariances.

00:11:26.430 --> 00:11:28.610
So now the correlation

00:11:28.610 --> 00:11:32.700
goes the other way, so that even strong [inaudible] of covariance and the same thing in

00:11:32.700 --> 00:11:36.400
contours. This is a Gaussian with negative entries on the diagonals and even

00:11:36.400 --> 00:11:41.130
larger entries on the diagonals, okay?

00:11:41.130 --> 00:11:45.480
And other parameter for the Gaussian is the mean parameters, so if this is - with mew0,

00:11:45.480 --> 00:11:47.839
and as he changed the mean parameter,

00:11:47.839 --> 00:11:50.060
this is mew equals 0.15,

00:11:50.060 --> 00:11:56.230
the location of the Gaussian just moves around, okay?

00:11:56.230 --> 00:12:01.500
All right. So that was a quick primer on what Gaussians look like, and here's

00:12:01.500 --> 00:12:02.300


00:12:02.300 --> 00:12:06.230
as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant

00:12:06.230 --> 00:12:09.400
analysis algorithm, this is what we're going to do. Here's

00:12:09.400 --> 00:12:11.640
the training set,

00:12:11.640 --> 00:12:14.790
and in the Gaussian discriminant analysis algorithm,

00:12:14.790 --> 00:12:19.480
what I'm going to do is I'm going to look at the positive examples, say the crosses,

00:12:19.480 --> 00:12:23.520
and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the

00:12:23.520 --> 00:12:25.250
positive examples, and so

00:12:25.250 --> 00:12:27.580
maybe I end up with a Gaussian distribution like that,

00:12:27.580 --> 00:12:30.960
okay? So there's P of X given Y = 1.

00:12:30.960 --> 00:12:34.340
And then I'll look at the negative examples, the O's in this figure,

00:12:34.340 --> 00:12:36.990
and I'll fit a Gaussian to that, and maybe I get a

00:12:36.990 --> 00:12:37.740
Gaussian

00:12:37.740 --> 00:12:40.700
centered over there. This is the concept of my second Gaussian,

00:12:40.700 --> 00:12:41.910
and together -

00:12:41.910 --> 00:12:44.770
we'll say how later -

00:12:44.770 --> 00:12:51.170
together these two Gaussian densities will define a separator for these two classes, okay?

00:12:51.170 --> 00:12:55.300
And it'll turn out that the separator will turn out to be a little bit

00:12:55.300 --> 00:12:55.870
different

00:12:55.870 --> 00:12:57.340
from what logistic regression

00:12:57.340 --> 00:12:58.719
gives you.

00:12:58.719 --> 00:13:00.390
If you run logistic regression,

00:13:00.390 --> 00:13:03.720
you actually get the division bound to be shown in the green line, whereas Gaussian discriminant

00:13:03.720 --> 00:13:07.340
analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the

00:13:07.340 --> 00:13:14.340


00:13:29.460 --> 00:13:34.440
Gaussian discriminant analysis model, put

00:13:34.440 --> 00:13:37.160
into model P of Y

00:13:37.160 --> 00:13:40.940
as a Bernoulli random variable as usual, but

00:13:40.940 --> 00:13:45.540
as a Bernoulli random variable and parameterized by parameter phi; you've

00:13:45.540 --> 00:13:47.300
seen this before.

00:13:47.300 --> 00:13:48.900


00:13:48.900 --> 00:13:55.900
Model P of X given Y = 0 as a Gaussian -

00:13:58.090 --> 00:14:01.060
oh, you know what? Yeah,

00:14:01.060 --> 00:14:04.870
yes, excuse me. I

00:14:04.870 --> 00:14:06.050
thought this looked strange.

00:14:06.050 --> 00:14:11.730
This

00:14:11.730 --> 00:14:13.040
should be a sigma,

00:14:13.040 --> 00:14:16.230
determined in a sigma to the one-half of the denominator there.

00:14:16.230 --> 00:14:18.110
It's no big deal. It was - yeah,

00:14:18.110 --> 00:14:20.350
well, okay. Right.

00:14:20.350 --> 00:14:23.120
I

00:14:23.120 --> 00:14:26.690
was listing the sigma to the determining the sigma to the one-half on a previous

00:14:26.690 --> 00:14:33.690
board, excuse me. Okay,

00:14:40.180 --> 00:14:44.130
and so I model P of X given Y = 0 as a Gaussian

00:14:44.130 --> 00:14:47.700
with mean mew0 and covariance sigma to the sigma to

00:14:47.700 --> 00:14:54.700
the minus one-half,

00:14:54.800 --> 00:15:01.210


00:15:01.210 --> 00:15:02.000
and

00:15:02.000 --> 00:15:09.000
-

00:15:11.200 --> 00:15:12.840
okay?

00:15:12.840 --> 00:15:16.840
And so the parameters of this model are

00:15:16.840 --> 00:15:18.310
phi,

00:15:18.310 --> 00:15:20.640
mew0,

00:15:20.640 --> 00:15:23.900
mew1, and sigma,

00:15:23.900 --> 00:15:25.100
and so

00:15:25.100 --> 00:15:29.390
I can now write down the likelihood of the parameters

00:15:29.390 --> 00:15:30.380
as - oh, excuse

00:15:30.380 --> 00:15:37.380
me, actually, the log likelihood of the parameters as the log of

00:15:40.500 --> 00:15:41.440
that,

00:15:41.440 --> 00:15:45.080
right?

00:15:45.080 --> 00:15:48.260
So, in other words, if I'm given the training set, then

00:15:48.260 --> 00:15:52.690
they can write down the log likelihood of the parameters as the log of, you know,

00:15:52.690 --> 00:15:58.060
the probative probabilities of P of XI, YI, right?

00:15:58.060 --> 00:16:03.830


00:16:03.830 --> 00:16:10.030


00:16:10.030 --> 00:16:11.710
And

00:16:11.710 --> 00:16:14.650
this is just equal to that where each of these terms, P of XI given YI,

00:16:14.650 --> 00:16:16.980
or P of YI is

00:16:16.980 --> 00:16:18.720
then given

00:16:18.720 --> 00:16:19.600


00:16:19.600 --> 00:16:24.250
by one of these three equations on top, okay?

00:16:24.250 --> 00:16:26.420
And I just want

00:16:26.420 --> 00:16:30.000
to contrast this again with discriminative learning algorithms, right?

00:16:30.000 --> 00:16:31.790
So

00:16:31.790 --> 00:16:34.779
to give this a name, I guess, this sometimes is actually called

00:16:34.779 --> 00:16:39.249
the Joint Data Likelihood - the Joint Likelihood,

00:16:39.249 --> 00:16:41.790
and

00:16:41.790 --> 00:16:44.930
let me just contrast this with what we had previously

00:16:44.930 --> 00:16:47.710
when we're talking about logistic

00:16:47.710 --> 00:16:51.860
regression. Where I said with the log likelihood of the parameter's theater

00:16:51.860 --> 00:16:53.089
was log

00:16:53.089 --> 00:16:57.000
of a product I = 1 to M, P of YI

00:16:57.000 --> 00:16:59.030
given XI

00:16:59.030 --> 00:17:01.990
and parameterized

00:17:01.990 --> 00:17:03.960
by a theater, right?

00:17:03.960 --> 00:17:05.150
So

00:17:05.150 --> 00:17:08.040
back where we're fitting logistic regression models or generalized learning

00:17:08.040 --> 00:17:08.670
models,

00:17:08.670 --> 00:17:14.360
we're always modeling P of YI given XI and parameterized by a theater, and that was the

00:17:14.360 --> 00:17:16.420
conditional

00:17:16.420 --> 00:17:17.460


00:17:17.460 --> 00:17:19.579
likelihood, okay,

00:17:19.579 --> 00:17:23.910
in

00:17:23.910 --> 00:17:26.700
which we're modeling P of YI given XI,

00:17:26.700 --> 00:17:27.839
whereas, now,

00:17:27.839 --> 00:17:30.929
regenerative learning algorithms, we're going to look at the joint likelihood which

00:17:30.929 --> 00:17:35.440
is P of XI, YI, okay?

00:17:35.440 --> 00:17:36.270
So let's

00:17:36.270 --> 00:17:43.270
see.

00:17:46.240 --> 00:17:48.049
So given the training sets

00:17:48.049 --> 00:17:51.260
and using the Gaussian discriminant analysis model

00:17:51.260 --> 00:17:55.550
to fit the parameters of the model, we'll do maximize likelihood estimation as usual,

00:17:55.550 --> 00:17:58.780
and so you maximize your

00:17:58.780 --> 00:18:02.269
L

00:18:02.269 --> 00:18:05.910
with respect to the parameters phi, mew0,

00:18:05.910 --> 00:18:07.850
mew1, sigma,

00:18:07.850 --> 00:18:10.150
and so

00:18:10.150 --> 00:18:15.800
if we find the maximum likelihood estimate of parameters, you find that phi is

00:18:15.800 --> 00:18:17.400
-

00:18:17.400 --> 00:18:22.040
the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for

00:18:22.040 --> 00:18:23.750
indicating notation,

00:18:23.750 --> 00:18:27.220
all right? So the maximum likelihood estimate for phi would be Sum over

00:18:27.220 --> 00:18:29.520
I, YI Ăˇ M,

00:18:29.520 --> 00:18:32.040


00:18:32.040 --> 00:18:35.490
or written alternatively as Sum over -

00:18:35.490 --> 00:18:41.620
all your training examples of indicator YI = 1 Ăˇ M, okay?

00:18:41.620 --> 00:18:42.870


00:18:42.870 --> 00:18:45.880
In other words, maximum likelihood estimate for a

00:18:45.880 --> 00:18:47.390


00:18:47.390 --> 00:18:52.600
newly parameter phi is just the faction of training examples with

00:18:52.600 --> 00:18:56.880
label one, with Y equals 1.

00:18:56.880 --> 00:19:02.800
Maximum likelihood estimate for mew0 is this, okay?

00:19:02.800 --> 00:19:09.800
You

00:19:22.010 --> 00:19:24.040
should stare at this for a second and

00:19:24.040 --> 00:19:26.200
see if it makes sense.

00:19:26.200 --> 00:19:33.200
Actually, I'll just write on the next one for mew1 while you do that.

00:19:34.310 --> 00:19:41.310
Okay?

00:19:49.880 --> 00:19:51.900
So what this is is

00:19:51.900 --> 00:19:57.060
what the denominator is sum of your training sets indicated YI = 0.

00:19:57.060 --> 00:20:02.090
So for every training example for which YI = 0, this will

00:20:02.090 --> 00:20:05.670
increment the count by one, all right? So the

00:20:05.670 --> 00:20:08.030
denominator is just

00:20:08.030 --> 00:20:10.880
the number of examples

00:20:10.880 --> 00:20:14.550


00:20:14.550 --> 00:20:15.370
with

00:20:15.370 --> 00:20:20.700
label zero, all right?

00:20:20.700 --> 00:20:22.680
And then the numerator will be, let's

00:20:22.680 --> 00:20:25.660
see, Sum from I = 1 for M, or every time

00:20:25.660 --> 00:20:30.350
YI is equal to 0, this will be a one, and otherwise, this thing will be zero,

00:20:30.350 --> 00:20:31.480
and so

00:20:31.480 --> 00:20:34.330
this indicator function means that you're including

00:20:34.330 --> 00:20:37.630
only the times for

00:20:37.630 --> 00:20:40.270
which YI is equal to one - only the turns which Y is equal to zero

00:20:40.270 --> 00:20:45.020
because for all the times where YI is equal to one,

00:20:45.020 --> 00:20:47.990
this sum and will be equal to zero,

00:20:47.990 --> 00:20:52.890
and then you multiply that by XI, and so the numerator is

00:20:52.890 --> 00:20:55.770
really the sum

00:20:55.770 --> 00:21:00.910
of XI's corresponding to

00:21:00.910 --> 00:21:02.970


00:21:02.970 --> 00:21:08.649
examples where the class labels were zero, okay?

00:21:08.649 --> 00:21:14.530
Raise your hand if this makes sense. Okay, cool.

00:21:14.530 --> 00:21:17.350
So just to say this fancifully,

00:21:17.350 --> 00:21:19.630
this just means look for your training set,

00:21:19.630 --> 00:21:22.900
find all the examples for which Y = 0,

00:21:22.900 --> 00:21:25.160
and take the average

00:21:25.160 --> 00:21:29.570
of the value of X for all your examples which Y = 0. So take all your negative fitting

00:21:29.570 --> 00:21:30.470
examples

00:21:30.470 --> 00:21:32.950
and average the values for X

00:21:32.950 --> 00:21:37.730
and

00:21:37.730 --> 00:21:41.350
that's mew0, okay? If this

00:21:41.350 --> 00:21:45.140
notation is still a little bit cryptic - if you're still not sure why this

00:21:45.140 --> 00:21:48.000
equation translates into

00:21:48.000 --> 00:21:52.940
what I just said, do go home and stare at it for a while until it just makes sense. This is, sort

00:21:52.940 --> 00:21:57.040
of, no surprise. It just says to estimate the mean for the negative examples,

00:21:57.040 --> 00:21:59.860
take all your negative examples, and average them. So

00:21:59.860 --> 00:22:06.830
no surprise, but this is a useful practice to indicate a notation.

00:22:06.830 --> 00:22:07.500
[Inaudible]

00:22:07.500 --> 00:22:11.330
divide the maximum likelihood estimate for sigma. I won't do that. You can read that in

00:22:11.330 --> 00:22:16.630
the notes yourself.

00:22:16.630 --> 00:22:21.390


00:22:21.390 --> 00:22:24.340
And so having fit

00:22:24.340 --> 00:22:28.940
the parameters find mew0, mew1, and sigma

00:22:28.940 --> 00:22:30.470
to your data,

00:22:30.470 --> 00:22:35.590
well, you now need to make a prediction. You

00:22:35.590 --> 00:22:39.710
know, when you're given a new value of X, when you're given a new cancer, you need to predict whether

00:22:39.710 --> 00:22:41.630
it's malignant or benign.

00:22:41.630 --> 00:22:44.610
Your prediction is then going to be,

00:22:44.610 --> 00:22:46.710
let's say,

00:22:46.710 --> 00:22:50.400
the most likely value of Y given X. I should

00:22:50.400 --> 00:22:54.720
write semicolon the parameters there. I'll just give that

00:22:54.720 --> 00:22:57.230
- which is the [inaudible] of a Y

00:22:57.230 --> 00:23:04.230
by Bayes rule, all right?

00:23:05.870 --> 00:23:12.870
And that is, in turn,

00:23:14.540 --> 00:23:15.480
just that

00:23:15.480 --> 00:23:19.790
because the denominator P of X doesn't depend on Y,

00:23:19.790 --> 00:23:21.530
and

00:23:21.530 --> 00:23:23.540
if P of Y

00:23:23.540 --> 00:23:27.450
is uniform.

00:23:27.450 --> 00:23:30.950
In other words, if each of your constants is equally likely,

00:23:30.950 --> 00:23:33.039
so if P of Y

00:23:33.039 --> 00:23:36.840
takes the same value for all values

00:23:36.840 --> 00:23:38.010
of Y,

00:23:38.010 --> 00:23:42.250
then this is just arc X over Y, P of X

00:23:42.250 --> 00:23:47.340
given Y, okay?

00:23:47.340 --> 00:23:50.970
This happens sometimes, maybe not very often, so usually you end up using this

00:23:50.970 --> 00:23:53.049
formula where you

00:23:53.049 --> 00:23:56.470
compute P of X given Y and P of Y using

00:23:56.470 --> 00:23:59.910
your model, okay? Student:Can

00:23:59.910 --> 00:24:00.780
you give

00:24:00.780 --> 00:24:05.480
us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually

00:24:05.480 --> 00:24:08.750
let me.

00:24:08.750 --> 00:24:09.770
So the min of

00:24:09.770 --> 00:24:15.470
- arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just

00:24:15.470 --> 00:24:18.600
for an example, the min of X - 5

00:24:18.600 --> 00:24:22.610
squared is 0 because by choosing X equals 5, you can get this to be zero,

00:24:22.610 --> 00:24:24.470
and the argument over X

00:24:24.470 --> 00:24:30.870
of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay?

00:24:30.870 --> 00:24:37.870
Cool. Thanks

00:24:38.100 --> 00:24:45.100
for

00:24:45.140 --> 00:24:52.140
asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is

00:25:00.910 --> 00:25:04.320
distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose

00:25:04.320 --> 00:25:05.649
here.

00:25:05.649 --> 00:25:07.240
I meant if

00:25:07.240 --> 00:25:10.890
P of Y = 0 is equal to P of Y = 1, or if Y is the

00:25:10.890 --> 00:25:12.780
uniform distribution over

00:25:12.780 --> 00:25:13.860
the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0

00:25:13.860 --> 00:25:15.450
zero = P of Y given 1. That's all I mean, see? Anything else?

00:25:15.450 --> 00:25:17.080
All

00:25:17.080 --> 00:25:20.030
right. Okay. So

00:25:20.030 --> 00:25:22.039


00:25:22.039 --> 00:25:28.040


00:25:28.040 --> 00:25:35.040
it

00:25:42.130 --> 00:25:46.980
turns out Gaussian discriminant analysis has an interesting relationship

00:25:46.980 --> 00:25:48.590
to logistic

00:25:48.590 --> 00:25:51.640
regression. Let me illustrate that.

00:25:51.640 --> 00:25:54.540


00:25:54.540 --> 00:25:59.200
So let's say you have a training set

00:25:59.200 --> 00:26:06.200
- actually let me just go ahead and draw 1D training set, and that will

00:26:11.820 --> 00:26:14.580
kind of work, yes, okay.

00:26:14.580 --> 00:26:18.470
So let's say we have a training set comprising a few negative and a few positive examples,

00:26:18.470 --> 00:26:23.070
and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two

00:26:23.070 --> 00:26:24.679
densities - a Gaussian density to each of these two - to my positive

00:26:24.679 --> 00:26:27.440
and negative training

00:26:27.440 --> 00:26:30.090
examples,

00:26:30.090 --> 00:26:33.470
and so maybe my

00:26:33.470 --> 00:26:37.960
positive examples, the X's, are fit with a Gaussian like this,

00:26:37.960 --> 00:26:39.520


00:26:39.520 --> 00:26:44.560


00:26:44.560 --> 00:26:47.840
and my negative examples I will fit, and you have a

00:26:47.840 --> 00:26:54.840
Gaussian that looks like that, okay?

00:26:57.030 --> 00:27:00.180
Now, I

00:27:00.180 --> 00:27:04.150
hope this [inaudible]. Now, let's

00:27:04.150 --> 00:27:06.789
vary along the X axis,

00:27:06.789 --> 00:27:10.260
and what I want to do is I'll

00:27:10.260 --> 00:27:14.470
overlay on top of this plot. I'm going to plot

00:27:14.470 --> 00:27:21.470
P of Y = 1 - no, actually,

00:27:24.420 --> 00:27:25.820
given X

00:27:25.820 --> 00:27:31.570
for a variety of values X, okay?

00:27:31.570 --> 00:27:32.929
So I actually realize what I should have done.

00:27:32.929 --> 00:27:37.250
I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just

00:27:37.250 --> 00:27:39.720
makes this part come in better.

00:27:39.720 --> 00:27:44.270
So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal

00:27:44.270 --> 00:27:45.200
axis.

00:27:45.200 --> 00:27:48.810
Then what's the probability of Y being equal to one conditioned on X? Well,

00:27:48.810 --> 00:27:52.770
the way you calculate that is you write P of Y

00:27:52.770 --> 00:27:56.899
= 1 given X, and then you plug in all these formulas as usual, right? It's P of X

00:27:56.899 --> 00:27:59.140
given Y = 1, which is

00:27:59.140 --> 00:28:00.929
your Gaussian density,

00:28:00.929 --> 00:28:04.070
times P of Y = 1, you know, which is

00:28:04.070 --> 00:28:08.060
essentially - this is just going to be equal to phi,

00:28:08.060 --> 00:28:09.510
and then divided by,

00:28:09.510 --> 00:28:12.210
right, P of X, and then this shows you how you can calculate this.

00:28:12.210 --> 00:28:13.660
By using

00:28:13.660 --> 00:28:18.730
these two Gaussians and my phi on P of Y, I actually compute what P of Y

00:28:18.730 --> 00:28:21.340
= 1 given X is, and

00:28:21.340 --> 00:28:24.529
in this case,

00:28:24.529 --> 00:28:29.230
if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to

00:28:29.230 --> 00:28:31.330
a positive class, and so

00:28:31.330 --> 00:28:36.370
it'll be very small; it'll be very close to zero say, okay?

00:28:36.370 --> 00:28:40.770
And then we can increment the value of X a bit, and study a different value of X, and

00:28:40.770 --> 00:28:44.230
plot what is the P of Y given X - P of Y

00:28:44.230 --> 00:28:49.660
= 1 given X, and, again, it'll be pretty small. Let's

00:28:49.660 --> 00:28:52.200
use a point like that, right? At this point,

00:28:52.200 --> 00:28:54.480
the two Gaussian densities

00:28:54.480 --> 00:28:56.490
have equal value,

00:28:56.490 --> 00:28:57.539
and if

00:28:57.539 --> 00:28:58.799
I ask

00:28:58.799 --> 00:29:01.779
if X is this value, right, shown by the arrow,

00:29:01.779 --> 00:29:03.080


00:29:03.080 --> 00:29:06.770
what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if

00:29:06.770 --> 00:29:11.270
you fill

00:29:11.270 --> 00:29:13.930
in a bunch more points, you get a

00:29:13.930 --> 00:29:16.470
curve like that,

00:29:16.470 --> 00:29:20.550
and then you can keep going. Let's say for a point like that, you can ask what's the probability of X

00:29:20.550 --> 00:29:21.919
being one? Well, if

00:29:21.919 --> 00:29:25.320
it's that far out, then clearly, it belongs to this

00:29:25.320 --> 00:29:27.540
rightmost Gaussian, and so

00:29:27.540 --> 00:29:31.510
the probability of Y being a one would be very high; it would be almost one, okay?

00:29:31.510 --> 00:29:33.830
And so you

00:29:33.830 --> 00:29:36.429
can repeat this exercise

00:29:36.429 --> 00:29:38.510
for a bunch of points. All right,

00:29:38.510 --> 00:29:41.620
compute P of Y equals one given X for a bunch of points,

00:29:41.620 --> 00:29:46.490
and if you connect up these points,

00:29:46.490 --> 00:29:48.150
you find that the

00:29:48.150 --> 00:29:50.039
curve you get [Pause] plotted

00:29:50.039 --> 00:29:56.410
takes a form of sigmoid function, okay? So,

00:29:56.410 --> 00:30:00.030
in other words, when you make the assumptions under the Gaussian

00:30:00.030 --> 00:30:02.350


00:30:02.350 --> 00:30:05.550
discriminant analysis model,

00:30:05.550 --> 00:30:08.130
that P of X given Y is Gaussian,

00:30:08.130 --> 00:30:13.330
when you go back and compute what P of Y given X is, you actually get back

00:30:13.330 --> 00:30:19.290
exactly the same sigmoid function

00:30:19.290 --> 00:30:22.960
that we're using which is the progression, okay? But it turns out the key difference is that

00:30:22.960 --> 00:30:27.360
Gaussian discriminant analysis will end up choosing a different

00:30:27.360 --> 00:30:28.900
position

00:30:28.900 --> 00:30:31.659
and a steepness of the sigmoid

00:30:31.659 --> 00:30:35.730
than would logistic regression. Is there a question? Student:I'm just

00:30:35.730 --> 00:30:39.570
wondering,

00:30:39.570 --> 00:30:44.600
the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is

00:30:44.600 --> 00:30:47.500
P of X given Y = 1, and

00:30:47.500 --> 00:30:50.070
this Gaussian is P of X

00:30:50.070 --> 00:30:57.070
given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you

00:31:01.920 --> 00:31:04.040
decide what Y

00:31:04.040 --> 00:31:05.760
given

00:31:05.760 --> 00:31:09.769
P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you

00:31:09.769 --> 00:31:12.330
figured out where

00:31:12.330 --> 00:31:13.750
to draw each dot? Instructor (Andrew Ng):Let's see,

00:31:13.750 --> 00:31:15.919
okay. So the

00:31:15.919 --> 00:31:18.600
computation is as follows, right? The steps

00:31:18.600 --> 00:31:22.020
are I have the training sets, and so given my training set, I'm going to fit

00:31:22.020 --> 00:31:25.159
a Gaussian discriminant analysis model to it,

00:31:25.159 --> 00:31:28.940
and what that means is I'll build a model for P of X given Y = 1. I'll

00:31:28.940 --> 00:31:30.159
build

00:31:30.159 --> 00:31:33.059
a model for P of X given Y = 0,

00:31:33.059 --> 00:31:37.950
and I'll also fit a Bernoulli distribution to

00:31:37.950 --> 00:31:39.290
P of Y, okay?

00:31:39.290 --> 00:31:43.270
So, in other words, given my training set, I'll fit P of X given Y and P of Y

00:31:43.270 --> 00:31:46.539
to my data, and now I've chosen my parameters

00:31:46.539 --> 00:31:48.630
of find mew0,

00:31:48.630 --> 00:31:49.600
mew1,

00:31:49.600 --> 00:31:52.310
and the sigma, okay? Then

00:31:52.310 --> 00:31:54.930
this is the process I went through

00:31:54.930 --> 00:31:58.710
to plot all these dots, right? It's just I pick a point in the X axis,

00:31:58.710 --> 00:32:00.950
and then I compute

00:32:00.950 --> 00:32:02.970
P of Y given X

00:32:02.970 --> 00:32:05.200
for that value of X,

00:32:05.200 --> 00:32:09.380
and P of Y given 1 conditioned on X will be some value between zero and one. It'll

00:32:09.380 --> 00:32:12.500
be some real number, and whatever that real number is, I then plot it on the vertical

00:32:12.500 --> 00:32:14.490
axis,

00:32:14.490 --> 00:32:19.929
okay? And the way I compute P of Y = 1 conditioned on X is

00:32:19.929 --> 00:32:21.380
I would

00:32:21.380 --> 00:32:23.470
use these quantities. I would use

00:32:23.470 --> 00:32:25.080
P of X given Y

00:32:25.080 --> 00:32:30.050
and P of Y, and, sort of, plug them into Bayes rule, and that allows me

00:32:30.050 --> 00:32:30.760
to

00:32:30.760 --> 00:32:31.690
compute P of Y given X

00:32:31.690 --> 00:32:34.210
from these three quantities; does that make

00:32:34.210 --> 00:32:35.240
sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that -

00:32:35.240 --> 00:32:39.080
Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so

00:32:39.080 --> 00:32:42.269
-

00:32:42.269 --> 00:32:43.170
well,

00:32:43.170 --> 00:32:45.340
got this right

00:32:45.340 --> 00:32:49.090
here. So P of X can be written as,

00:32:49.090 --> 00:32:50.449


00:32:50.449 --> 00:32:52.810
right,

00:32:52.810 --> 00:32:55.030
so

00:32:55.030 --> 00:32:58.660
P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y =

00:32:58.660 --> 00:33:03.140
1, right?

00:33:03.140 --> 00:33:06.740
And so each of these terms, P of X given Y

00:33:06.740 --> 00:33:11.170
and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant

00:33:11.170 --> 00:33:13.890
analysis model. Each of these terms is something that

00:33:13.890 --> 00:33:16.300
my model gives me directly,

00:33:16.300 --> 00:33:18.350
so plugged in as the denominator,

00:33:18.350 --> 00:33:25.350
and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool.

00:33:30.570 --> 00:33:37.570


00:33:53.170 --> 00:33:57.860
So let's talk a little bit about the advantages and disadvantages of using a

00:33:57.860 --> 00:34:00.360
generative

00:34:00.360 --> 00:34:04.759
learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we

00:34:04.759 --> 00:34:05.960
assume that

00:34:05.960 --> 00:34:08.080
X conditions on Y

00:34:08.080 --> 00:34:13.829
is Gaussian,

00:34:13.829 --> 00:34:17.089
and the argument I showed on the previous chalkboard, I didn't prove it formally,

00:34:17.089 --> 00:34:20.289
but you can actually go back and prove it yourself

00:34:20.289 --> 00:34:24.269
is that if you assume X given Y is Gaussian,

00:34:24.269 --> 00:34:27.419
then that implies that

00:34:27.419 --> 00:34:28.589
when you plot Y

00:34:28.589 --> 00:34:30.639
given X,

00:34:30.639 --> 00:34:37.639
you find that - well, let me just write logistic posterior, okay?

00:34:40.689 --> 00:34:43.519
And the argument I showed just now, which I didn't prove; you can go home and prove it

00:34:43.519 --> 00:34:44.489
yourself,

00:34:44.489 --> 00:34:49.329
is that if you assume X given Y is Gaussian, then that implies that the posterior

00:34:49.329 --> 00:34:53.929
distribution or the form of

00:34:53.929 --> 00:34:57.229
P of Y = 1 given X

00:34:57.229 --> 00:35:00.569
is going to be a logistic function,

00:35:00.569 --> 00:35:02.089


00:35:02.089 --> 00:35:05.809
and it turns out this

00:35:05.809 --> 00:35:08.319
implication in the opposite direction

00:35:08.319 --> 00:35:11.799
does not hold true,

00:35:11.799 --> 00:35:16.349
okay? In particular, it actually turns out - this is actually, kind of, cool. It

00:35:16.349 --> 00:35:19.360
turns out that if you're

00:35:19.360 --> 00:35:22.739
seeing that X given Y = 1 is

00:35:22.739 --> 00:35:24.389


00:35:24.389 --> 00:35:26.169
Hessian with

00:35:26.169 --> 00:35:28.609
parameter lambda 1,

00:35:28.609 --> 00:35:32.039
and X given Y = 0,

00:35:32.039 --> 00:35:36.509
is Hessian

00:35:36.509 --> 00:35:39.199
with parameter lambda 0.

00:35:39.199 --> 00:35:41.420
It turns out if you assumed this,

00:35:41.420 --> 00:35:42.640
then

00:35:42.640 --> 00:35:43.849
that also

00:35:43.849 --> 00:35:46.839
implies that P of Y

00:35:46.839 --> 00:35:51.130
given X

00:35:51.130 --> 00:35:52.439


00:35:52.439 --> 00:35:54.429
is logistic, okay?

00:35:54.429 --> 00:35:57.259
So there are lots of assumptions on X given Y

00:35:57.259 --> 00:36:04.259
that will lead to P of Y given X being logistic, and,

00:36:04.619 --> 00:36:06.059
therefore,

00:36:06.059 --> 00:36:11.609
this, the assumption that X given Y being Gaussian is the stronger assumption

00:36:11.609 --> 00:36:15.019
than the assumption that Y given X is logistic,

00:36:15.019 --> 00:36:17.400
okay? Because this implies this,

00:36:17.400 --> 00:36:22.829
right? That means that this is a stronger assumption than this because

00:36:22.829 --> 00:36:29.829
this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa.

00:36:29.940 --> 00:36:31.099
And so this leaves some

00:36:31.099 --> 00:36:35.749
of the tradeoffs between Gaussian discriminant analysis and logistic

00:36:35.749 --> 00:36:36.559
regression,

00:36:36.559 --> 00:36:40.079
right? Gaussian discriminant analysis makes a much stronger assumption

00:36:40.079 --> 00:36:43.049
that X given Y is Gaussian,

00:36:43.049 --> 00:36:47.099
and so when this assumption is true, when this assumption approximately holds, if you plot the

00:36:47.099 --> 00:36:48.569
data,

00:36:48.569 --> 00:36:52.039
and if X given Y is, indeed, approximately Gaussian,

00:36:52.039 --> 00:36:56.199
then if you make this assumption, explicit to the algorithm, then the

00:36:56.199 --> 00:36:57.499
algorithm will do better

00:36:57.499 --> 00:36:59.239
because it's as if the

00:36:59.239 --> 00:37:02.609
algorithm is making use of more information about the data. The algorithm knows that

00:37:02.609 --> 00:37:04.380
the data is Gaussian,

00:37:04.380 --> 00:37:05.290
right? And so

00:37:05.290 --> 00:37:07.609
if the Gaussian assumption, you know,

00:37:07.609 --> 00:37:09.529
holds or roughly holds,

00:37:09.529 --> 00:37:10.220
then Gaussian

00:37:10.220 --> 00:37:13.569
discriminant analysis may do better than logistic regression.

00:37:13.569 --> 00:37:19.299
If, conversely, if you're actually not sure what X given Y is, then

00:37:19.299 --> 00:37:22.819
logistic regression, the discriminant algorithm may do better,

00:37:22.819 --> 00:37:25.579
and, in particular, use logistic regression,

00:37:25.579 --> 00:37:26.110
and

00:37:26.110 --> 00:37:29.690
maybe you see [inaudible] before the data was Gaussian, but it turns out the data

00:37:29.690 --> 00:37:31.429
was actually Poisson, right?

00:37:31.429 --> 00:37:33.980
Then logistic regression will still do perfectly fine because if

00:37:33.980 --> 00:37:36.869
the data were actually Poisson,

00:37:36.869 --> 00:37:40.210
then P of Y = 1 given X will be logistic, and it'll do perfectly

00:37:40.210 --> 00:37:44.519
fine, but if you assumed it was Gaussian, then the algorithm may go off and do something

00:37:44.519 --> 00:37:51.519
that's not as good, okay?

00:37:52.109 --> 00:37:56.139
So it turns out that - right.

00:37:56.139 --> 00:37:58.799
So it's slightly different.

00:37:58.799 --> 00:38:01.089
It

00:38:01.089 --> 00:38:05.359
turns out the real advantage of generative learning algorithms is often that it

00:38:05.359 --> 00:38:07.839
requires less data,

00:38:07.839 --> 00:38:09.049
and, in particular,

00:38:09.049 --> 00:38:13.469
data is never really exactly Gaussian, right? Because data is often

00:38:13.469 --> 00:38:15.890
approximately Gaussian; it's never exactly Gaussian.

00:38:15.890 --> 00:38:19.989
And it turns out, generative learning algorithms often do surprisingly well

00:38:19.989 --> 00:38:20.750
even when

00:38:20.750 --> 00:38:24.369
these modeling assumptions are not met, but

00:38:24.369 --> 00:38:26.019
one other tradeoff

00:38:26.019 --> 00:38:26.880
is that

00:38:26.880 --> 00:38:29.930
by making stronger assumptions

00:38:29.930 --> 00:38:31.899
about the data,

00:38:31.899 --> 00:38:34.039
Gaussian discriminant analysis

00:38:34.039 --> 00:38:36.229
often needs less data

00:38:36.229 --> 00:38:40.439
in order to fit, like, an okay model, even if there's less training data. Whereas, in

00:38:40.439 --> 00:38:41.870
contrast,

00:38:41.870 --> 00:38:45.959
logistic regression by making less

00:38:45.959 --> 00:38:48.959
assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're

00:38:48.959 --> 00:38:50.729
making less assumptions,

00:38:50.729 --> 00:38:54.150
but sometimes it takes a slightly larger training set to fit than Gaussian discriminant

00:38:54.150 --> 00:38:55.419
analysis. Question? Student:In order

00:38:55.419 --> 00:39:00.019
to meet any assumption about the number [inaudible], plus

00:39:00.019 --> 00:39:01.029


00:39:01.029 --> 00:39:03.139
here

00:39:03.139 --> 00:39:09.089
we assume that P of Y = 1, equal

00:39:09.089 --> 00:39:11.159
two

00:39:11.159 --> 00:39:14.849
number of.

00:39:14.849 --> 00:39:16.319
[Inaudible]. Is true when

00:39:16.319 --> 00:39:18.069
the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see.

00:39:18.069 --> 00:39:21.799
So there's a question of is this true - what

00:39:21.799 --> 00:39:23.379
was that? Let me translate that

00:39:23.379 --> 00:39:25.429
differently. So

00:39:25.429 --> 00:39:28.829
the marving assumptions are made independently of the size

00:39:28.829 --> 00:39:29.880
of

00:39:29.880 --> 00:39:32.520
your training set, right? So, like, in least/great regression - well,

00:39:32.520 --> 00:39:36.649
in all of these models I'm assuming that these are random variables

00:39:36.649 --> 00:39:41.380
flowing from some distribution, and then, finally, I'm giving a single training set

00:39:41.380 --> 00:39:43.640
and that as for the parameters of the

00:39:43.640 --> 00:39:44.450
distribution, right?

00:39:44.450 --> 00:39:51.450
Student:So

00:39:51.889 --> 00:39:54.169
what's the probability of Y = 1?

00:39:54.169 --> 00:39:55.069
Instructor (Andrew Ng):Probability of Y + 1?

00:39:55.069 --> 00:39:56.499
Student:Yeah, you used the - Instructor (Andrew Ng):Sort

00:39:56.499 --> 00:39:57.909
of, this like

00:39:57.909 --> 00:39:59.589
-

00:39:59.589 --> 00:40:02.829
back to the philosophy of mass molecular estimation,

00:40:02.829 --> 00:40:04.700
right? I'm assuming that

00:40:04.700 --> 00:40:08.329
they're P of Y is equal to phi to the Y,

00:40:08.329 --> 00:40:13.390
Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y

00:40:13.390 --> 00:40:14.009
generating

00:40:14.009 --> 00:40:15.879
all my data,

00:40:15.879 --> 00:40:18.849
and then

00:40:18.849 --> 00:40:19.549
-

00:40:19.549 --> 00:40:25.589
well, when I write this, I guess, maybe what I should write isn't -

00:40:25.589 --> 00:40:27.609
so when I write this, I

00:40:27.609 --> 00:40:29.900
guess there are already two values of phi. One is

00:40:29.900 --> 00:40:32.200
there's a true underlying value of phi

00:40:32.200 --> 00:40:35.160
that guards the use to generate the data,

00:40:35.160 --> 00:40:39.630
and then there's the maximum likelihood estimate of the value of phi, and so when I was writing

00:40:39.630 --> 00:40:41.509
those formulas earlier,

00:40:41.509 --> 00:40:45.209
those formulas are writing for phi, and mew0, and mew1

00:40:45.209 --> 00:40:49.359
were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true

00:40:49.359 --> 00:40:55.819
underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum

00:40:55.819 --> 00:40:57.559
likelihood estimate comes from the data,

00:40:57.559 --> 00:41:01.529
and there's some, sort of, true underlying value of phi that I'm trying to estimate,

00:41:01.529 --> 00:41:05.319
and my maximum likelihood estimate is my attempt to estimate the true value,

00:41:05.319 --> 00:41:08.150
but, you know, by notational and convention

00:41:08.150 --> 00:41:11.779
often are just right as that as well without bothering to distinguish between

00:41:11.779 --> 00:41:15.470
the maximum likelihood value and the true underlying value that I'm assuming is out

00:41:15.470 --> 00:41:16.339
there, and that I'm

00:41:16.339 --> 00:41:23.339
only hoping to estimate.

00:41:23.979 --> 00:41:25.579
Actually, yeah,

00:41:25.579 --> 00:41:29.369
so for the sample of questions like these about maximum likelihood and so on, I hope

00:41:29.369 --> 00:41:34.059
to tease to the Friday discussion section

00:41:34.059 --> 00:41:36.279
as a good time to

00:41:36.279 --> 00:41:37.890
ask questions about, sort of,

00:41:37.890 --> 00:41:40.880
probabilistic definitions like these as well. Are there any

00:41:40.880 --> 00:41:47.880
other questions? No, great. Okay.

00:41:52.900 --> 00:41:59.680
So,

00:41:59.680 --> 00:42:01.189
great. Oh, it

00:42:01.189 --> 00:42:07.589
turns out, just to mention one more thing that's, kind of, cool.

00:42:07.589 --> 00:42:09.300
I said that

00:42:09.300 --> 00:42:13.509
if X given Y is Poisson, and you also go logistic posterior,

00:42:13.509 --> 00:42:17.210
it actually turns out there's a more general version of this. If you assume

00:42:17.210 --> 00:42:18.929
X

00:42:18.929 --> 00:42:23.899
given Y = 1 is exponential family

00:42:23.899 --> 00:42:26.279
with parameter A to 1, and then you assume

00:42:26.279 --> 00:42:33.279
X given Y = 0 is exponential family

00:42:33.859 --> 00:42:38.889
with parameter A to 0, then

00:42:38.889 --> 00:42:43.269
this implies that P of Y = 1 given X is also logistic, okay? And

00:42:43.269 --> 00:42:44.679
that's, kind of, cool.

00:42:44.679 --> 00:42:47.590
It means that Y given X could be - I don't

00:42:47.590 --> 00:42:50.180
know, some strange thing. It could be gamma because

00:42:50.180 --> 00:42:52.509
we've seen Gaussian right

00:42:52.509 --> 00:42:53.609
next to the - I

00:42:53.609 --> 00:42:57.029
don't know, gamma exponential.

00:42:57.029 --> 00:42:59.099
They're actually a beta. I'm

00:42:59.099 --> 00:43:02.089
just rattling off my mental list of exponential family extrusions. It could be any one

00:43:02.089 --> 00:43:03.689
of those things,

00:43:03.689 --> 00:43:07.519
so [inaudible] the same exponential family distribution for the two classes

00:43:07.519 --> 00:43:09.439
with different natural parameters

00:43:09.439 --> 00:43:12.239
than the

00:43:12.239 --> 00:43:13.279
posterior

00:43:13.279 --> 00:43:16.849
P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows

00:43:16.849 --> 00:43:19.299
the robustness of logistic regression

00:43:19.299 --> 00:43:22.479
to the choice of modeling assumptions because it could be that

00:43:22.479 --> 00:43:24.989
the data was actually, you know, gamma distributed,

00:43:24.989 --> 00:43:28.739
and just still turns out to be logistic. So it's the

00:43:28.739 --> 00:43:35.739
robustness of logistic regression to modeling

00:43:38.809 --> 00:43:40.960
assumptions.

00:43:40.960 --> 00:43:42.900
And this is the density. I think,

00:43:42.900 --> 00:43:44.399
early on I promised

00:43:44.399 --> 00:43:48.619
two justifications for where I pulled the logistic function out of the hat, right? So

00:43:48.619 --> 00:43:53.299
one was the exponential family derivation we went through last time, and this is, sort of, the second one.

00:43:53.299 --> 00:44:00.299
That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off

00:44:05.109 --> 00:44:09.969
mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also

00:44:09.969 --> 00:44:11.059
not true, right?

00:44:11.059 --> 00:44:12.499
Yeah, so this exponential

00:44:12.499 --> 00:44:14.199
family distribution

00:44:14.199 --> 00:44:18.269
implies Y = 1 is logistic, but the reverse assumption is also not true.

00:44:18.269 --> 00:44:22.599
There are actually all sorts of really bizarre distributions

00:44:22.599 --> 00:44:29.599
for X that would give rise to logistic function, okay? Okay. So

00:44:29.889 --> 00:44:34.719
let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second

00:44:34.719 --> 00:44:37.609
generative learning algorithm,

00:44:37.609 --> 00:44:44.609
and the motivating example, actually this is called a Naive Bayes algorithm,

00:44:44.849 --> 00:44:49.690
and the motivating example that I'm gonna use will be spam classification. All right. So let's

00:44:49.690 --> 00:44:54.109
say that you want to build a spam classifier to take your incoming stream of email and decide if

00:44:54.109 --> 00:44:56.529
it's spam or

00:44:56.529 --> 00:44:57.489
not.

00:44:57.489 --> 00:45:02.299
So let's

00:45:02.299 --> 00:45:04.759
see. Y will be 0

00:45:04.759 --> 00:45:06.849
or

00:45:06.849 --> 00:45:11.679
1, with 1 being spam email

00:45:11.679 --> 00:45:12.589
and 0 being non-spam, and

00:45:12.589 --> 00:45:16.649
the first decision we need to make is, given a piece of email,

00:45:16.649 --> 00:45:20.789
how do you represent a piece of email using a feature vector X,

00:45:20.789 --> 00:45:24.449
right? So email is just a piece of text, right? Email

00:45:24.449 --> 00:45:27.889
is like a list of words or a list of ASCII characters.

00:45:27.889 --> 00:45:30.839
So I can represent email as a feature of vector X.

00:45:30.839 --> 00:45:32.369
So we'll use a couple of different

00:45:32.369 --> 00:45:34.099
representations,

00:45:34.099 --> 00:45:36.579
but the one I'll use today is

00:45:36.579 --> 00:45:38.049
we will

00:45:38.049 --> 00:45:42.179
construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of

00:45:42.179 --> 00:45:44.400
all the words in my dictionary, okay? So

00:45:44.400 --> 00:45:46.469
the first word is

00:45:46.469 --> 00:45:52.289
RA. The second word in my dictionary is Aardvark, ausworth,

00:45:52.289 --> 00:45:55.569
okay?

00:45:55.569 --> 00:45:59.419
You know, and somewhere along the way you see the word buy in the spam email telling you to buy

00:45:59.419 --> 00:46:01.379
stuff.

00:46:01.379 --> 00:46:04.140
Tell you how you collect your list of words,

00:46:04.140 --> 00:46:08.579
you know, you won't find CS229, right, course number in a dictionary, but

00:46:08.579 --> 00:46:09.559
if you

00:46:09.559 --> 00:46:14.249
collect a list of words via other emails you've gotten, you have this list somewhere

00:46:14.249 --> 00:46:17.979
as well, and then the last word in my dictionary was

00:46:17.979 --> 00:46:21.969
zicmergue, which

00:46:21.969 --> 00:46:24.739
pertains to the technological chemistry that deals with

00:46:24.739 --> 00:46:27.899
the fermentation process in

00:46:27.899 --> 00:46:30.829
brewing.

00:46:30.829 --> 00:46:35.180
So say I get a piece of email, and what I'll do is I'll then

00:46:35.180 --> 00:46:37.860
scan through this list of words, and wherever

00:46:37.860 --> 00:46:41.249
a certain word appears in my email, I'll put a 1 there. So if a particular

00:46:41.249 --> 00:46:44.319
email has the word aid then that's 1.

00:46:44.319 --> 00:46:46.560
You know, my email doesn't have the words ausworth

00:46:46.560 --> 00:46:49.209
or aardvark, so it gets zeros. And again,

00:46:49.209 --> 00:46:52.939
a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay?

00:46:52.939 --> 00:46:56.910
So

00:46:56.910 --> 00:46:59.289
this would be

00:46:59.289 --> 00:47:01.939
one way of creating a feature vector

00:47:01.939 --> 00:47:03.619
to represent a

00:47:03.619 --> 00:47:05.739
piece of email.

00:47:05.739 --> 00:47:09.900
Now, let's throw

00:47:09.900 --> 00:47:16.900
the generative model out for this. Actually,

00:47:18.059 --> 00:47:19.289
let's use

00:47:19.289 --> 00:47:20.849
this.

00:47:20.849 --> 00:47:24.400
In other words, I want to model P of X given Y. The

00:47:24.400 --> 00:47:27.959
given Y = 0 or Y = 1, all right?

00:47:27.959 --> 00:47:31.390
And my feature vectors are going to be 0, 1

00:47:31.390 --> 00:47:35.709
to the N. It's going to be these split vectors, binary value vectors. They're N

00:47:35.709 --> 00:47:36.400
dimensional.

00:47:36.400 --> 00:47:38.130
Where N

00:47:38.130 --> 00:47:38.830
may

00:47:38.830 --> 00:47:41.950
be on the order of, say, 50,000, if you have 50,000

00:47:41.950 --> 00:47:44.089
words in your dictionary,

00:47:44.089 --> 00:47:46.110
which is not atypical. So

00:47:46.110 --> 00:47:47.459
values from - I don't

00:47:47.459 --> 00:47:48.449
know,

00:47:48.449 --> 00:47:52.219
mid-thousands to tens of thousands is very typical

00:47:52.219 --> 00:47:53.939
for problems like

00:47:53.939 --> 00:47:56.620
these. And, therefore,

00:47:56.620 --> 00:48:00.340
there two to the 50,000 possible values for X, right? So two to

00:48:00.340 --> 00:48:01.500
50,000

00:48:01.500 --> 00:48:02.729
possible bit vectors

00:48:02.729 --> 00:48:03.819
of length

00:48:03.819 --> 00:48:05.199


00:48:05.199 --> 00:48:07.239
50,000, and so

00:48:07.239 --> 00:48:08.829
one way to model this is

00:48:08.829 --> 00:48:10.749
the multinomial distribution,

00:48:10.749 --> 00:48:13.900
but because there are two to the 50,000 possible values for X,

00:48:13.900 --> 00:48:20.339
I would need two to the 50,000, but maybe -1 parameters,

00:48:20.339 --> 00:48:21.059
right? Because you have

00:48:21.059 --> 00:48:23.699
this sum to 1, right? So

00:48:23.699 --> 00:48:27.089
-1. And this is clearly way too many parameters

00:48:27.089 --> 00:48:28.479
to model

00:48:28.479 --> 00:48:30.269
using the multinomial distribution

00:48:30.269 --> 00:48:34.439
over all two to 50,000 possibilities.

00:48:34.439 --> 00:48:35.839
So

00:48:35.839 --> 00:48:42.289
in a Naive Bayes algorithm, we're

00:48:42.289 --> 00:48:46.579
going to make a very strong assumption on P of X given Y,

00:48:46.579 --> 00:48:49.369
and, in particular, I'm going to assume - let

00:48:49.369 --> 00:48:53.609
me just say what it's called; then I'll write out what it means. I'm going to assume that the

00:48:53.609 --> 00:48:54.630
XI's

00:48:54.630 --> 00:49:01.630
are conditionally independent

00:49:06.489 --> 00:49:07.950
given Y, okay?

00:49:07.950 --> 00:49:11.209
Let me say what this means.

00:49:11.209 --> 00:49:18.209
So I have that P of X1, X2, up to X50,000,

00:49:18.729 --> 00:49:20.559
right, given the

00:49:20.559 --> 00:49:25.910
Y. By the key rule of probability, this is P of X1 given Y

00:49:25.910 --> 00:49:27.759
times P of X2

00:49:27.759 --> 00:49:30.369
given

00:49:30.369 --> 00:49:32.469


00:49:32.469 --> 00:49:33.989
Y,

00:49:33.989 --> 00:49:39.409
X1

00:49:39.409 --> 00:49:42.919
times PF - I'll just put dot, dot, dot. I'll just write 1, 1 Ă dot, dot, dot up to, you know, well -

00:49:42.919 --> 00:49:46.369
whatever. You get the idea, up to P of X50,000, okay?

00:49:46.369 --> 00:49:49.529
So this is the chain were of probability. This always holds. I've not

00:49:49.529 --> 00:49:52.849
made any assumption yet, and now, we're

00:49:52.849 --> 00:49:54.529
gonna

00:49:54.529 --> 00:49:58.089
meet what's called the Naive Bayes assumption, or this assumption that X

00:49:58.089 --> 00:50:01.869
defies a conditionally independent given Y. Going

00:50:01.869 --> 00:50:03.709
to assume that -

00:50:03.709 --> 00:50:06.619
well, nothing changes for the first term,

00:50:06.619 --> 00:50:11.719
but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3

00:50:11.719 --> 00:50:13.359
given

00:50:13.359 --> 00:50:16.979
the

00:50:16.979 --> 00:50:17.760
Y,

00:50:17.760 --> 00:50:20.099
and so on, up

00:50:20.099 --> 00:50:24.489
to P of X50,000 given Y, okay?

00:50:24.489 --> 00:50:28.849
Or just written more compactly,

00:50:28.849 --> 00:50:32.119


00:50:32.119 --> 00:50:35.989
means assume that P of X1, P of X50,000 given Y is

00:50:35.989 --> 00:50:41.150
the product from I = 1 to 50,000 or P of XI

00:50:41.150 --> 00:50:44.170
given the Y,

00:50:44.170 --> 00:50:45.299
okay?

00:50:45.299 --> 00:50:49.129
And stating informally what this means is that I'm, sort of, assuming that -

00:50:49.129 --> 00:50:53.009
so unless you know the cost label Y, so long as you know whether this is spam or not

00:50:53.009 --> 00:50:54.959
spam,

00:50:54.959 --> 00:50:58.579
then knowing whether the word A appears in email

00:50:58.579 --> 00:50:59.809
does not affect

00:50:59.809 --> 00:51:01.359
the probability

00:51:01.359 --> 00:51:02.849
of whether the word

00:51:02.849 --> 00:51:06.829
Ausworth appears in the email, all right?

00:51:06.829 --> 00:51:10.699
And, in other words, there's assuming - once you know whether an email is spam

00:51:10.699 --> 00:51:12.619
or not spam,

00:51:12.619 --> 00:51:15.519
then knowing whether other words appear in the email won't help

00:51:15.519 --> 00:51:19.049
you predict whether any other word appears in the email, okay?

00:51:19.049 --> 00:51:20.289
And,

00:51:20.289 --> 00:51:23.489
obviously, this assumption is false, right? This

00:51:23.489 --> 00:51:25.979
assumption can't possibly be

00:51:25.979 --> 00:51:28.190
true. I mean, if you see the word

00:51:28.190 --> 00:51:31.410
- I don't know, CS229 in an email, you're much more likely to see my name in the email, or

00:51:31.410 --> 00:51:37.050
the TA's names, or whatever. So this assumption is normally just false

00:51:37.050 --> 00:51:37.469


00:51:37.469 --> 00:51:39.349
under English, right,

00:51:39.349 --> 00:51:41.239
for normal written English,

00:51:41.239 --> 00:51:42.519
but it

00:51:42.519 --> 00:51:44.269
turns out that despite

00:51:44.269 --> 00:51:46.599
this assumption being, sort of,

00:51:46.599 --> 00:51:48.180
false in the literal sense,

00:51:48.180 --> 00:51:50.829
the Naive Bayes algorithm is, sort of,

00:51:50.829 --> 00:51:53.469
an extremely effective

00:51:53.469 --> 00:51:57.869
algorithm for classifying text documents into spam or not spam, for

00:51:57.869 --> 00:52:01.519
classifying your emails into different emails for your automatic view, for

00:52:01.519 --> 00:52:03.589
looking at web pages and classifying

00:52:03.589 --> 00:52:06.290
whether this webpage is trying to sell something or whatever. It

00:52:06.290 --> 00:52:07.959
turns out, this assumption

00:52:07.959 --> 00:52:11.819
works very well for classifying text documents and for other applications too that I'll

00:52:11.819 --> 00:52:16.419
talk a bit about later.

00:52:16.419 --> 00:52:20.689
As a digression that'll make sense only to some of you.

00:52:20.689 --> 00:52:23.019
Let me just say that

00:52:23.019 --> 00:52:27.959
if you're familiar with Bayesian X world, say

00:52:27.959 --> 00:52:29.839


00:52:29.839 --> 00:52:33.999
graphical models, the Bayesian network associated with this model looks like this, and you're assuming

00:52:33.999 --> 00:52:34.509
that

00:52:34.509 --> 00:52:36.769
this is random variable Y

00:52:36.769 --> 00:52:38.779
that then generates X1, X2, through

00:52:38.779 --> 00:52:41.089
X50,000, okay? If you've not seen the

00:52:41.089 --> 00:52:43.069
Bayes Net before, if

00:52:43.069 --> 00:52:47.549
you don't know your graphical model, just ignore this. It's not important to our purposes, but

00:52:47.549 --> 00:52:54.549
if you've seen it before, that's what it will look like. Okay.

00:52:58.019 --> 00:53:03.349
So

00:53:03.349 --> 00:53:10.349


00:53:11.660 --> 00:53:16.239
the parameters of the model

00:53:16.239 --> 00:53:18.069
are as follows

00:53:18.069 --> 00:53:21.659
with phi FI given Y = 1, which

00:53:21.659 --> 00:53:23.189


00:53:23.189 --> 00:53:28.019
is probably FX = 1 or XI = 1

00:53:28.019 --> 00:53:28.839
given Y = 1,

00:53:28.839 --> 00:53:35.529
phi I

00:53:35.529 --> 00:53:39.959
given Y = 0, and phi Y, okay?

00:53:39.959 --> 00:53:43.459
So these are the parameters of the model,

00:53:43.459 --> 00:53:45.419


00:53:45.419 --> 00:53:47.249
and, therefore,

00:53:47.249 --> 00:53:50.139
to fit the parameters of the model, you

00:53:50.139 --> 00:53:57.139
can write down the joint likelihood, right,

00:54:01.849 --> 00:54:07.329
is

00:54:07.329 --> 00:54:14.329
equal to, as usual, okay?

00:54:18.670 --> 00:54:20.779
So given the training sets,

00:54:20.779 --> 00:54:27.779
you can write down the joint

00:54:28.979 --> 00:54:32.579
likelihood of the parameters, and

00:54:32.579 --> 00:54:39.579
then when

00:54:43.729 --> 00:54:44.640
you

00:54:44.640 --> 00:54:46.869
do maximum likelihood estimation,

00:54:46.869 --> 00:54:50.860
you find that the maximum likelihood estimate of the parameters are

00:54:50.860 --> 00:54:53.239
- they're really, pretty much, what you'd expect.

00:54:53.239 --> 00:54:55.630
Maximum likelihood estimate for phi J given Y = 1 is

00:54:55.630 --> 00:54:57.880
sum from I = 1 to

00:54:57.880 --> 00:55:01.759
M,

00:55:01.759 --> 00:55:03.459
indicator

00:55:03.459 --> 00:55:07.129
XIJ =

00:55:07.129 --> 00:55:14.129
1, YI = 1, okay?

00:55:14.699 --> 00:55:18.479


00:55:18.479 --> 00:55:19.359


00:55:19.359 --> 00:55:20.719
And this is just a,

00:55:20.719 --> 00:55:22.829
I guess, stated more simply,

00:55:22.829 --> 00:55:24.969
the numerator just says, Run for

00:55:24.969 --> 00:55:27.979
your entire training set, some [inaudible] examples,

00:55:27.979 --> 00:55:30.919
and count up the number of times you saw word Jay

00:55:30.919 --> 00:55:32.549
in a piece of email

00:55:32.549 --> 00:55:34.649
for which the label Y was equal to 1. So, in other words, look

00:55:34.649 --> 00:55:36.800
through all your spam emails

00:55:36.800 --> 00:55:39.320
and count the number of emails in which the word

00:55:39.320 --> 00:55:40.559
Jay appeared out of

00:55:40.559 --> 00:55:42.420
all your spam emails,

00:55:42.420 --> 00:55:44.710
and the denominator is, you know,

00:55:44.710 --> 00:55:45.989
sum from I = 1 to M,

00:55:45.989 --> 00:55:47.949
the number of spam. The

00:55:47.949 --> 00:55:51.439
denominator is just the number of spam emails you got.

00:55:51.439 --> 00:55:54.329
And so this ratio is

00:55:54.329 --> 00:55:57.769
in all your spam emails in your training set,

00:55:57.769 --> 00:56:00.399
what fraction of these emails

00:56:00.399 --> 00:56:03.039
did the word Jay

00:56:03.039 --> 00:56:03.659
appear in -

00:56:03.659 --> 00:56:06.640
did the, Jay you wrote in your dictionary appear in?

00:56:06.640 --> 00:56:10.239
And that's the maximum likelihood estimate

00:56:10.239 --> 00:56:13.819
for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your

00:56:13.819 --> 00:56:17.499


00:56:17.499 --> 00:56:20.499


00:56:20.499 --> 00:56:23.160
maximum likelihood estimate for phi

00:56:23.160 --> 00:56:24.999
Y

00:56:24.999 --> 00:56:28.880
is pretty much what you'd expect, right?

00:56:28.880 --> 00:56:35.880
Okay?

00:56:41.409 --> 00:56:43.389
And so

00:56:43.389 --> 00:56:47.669
having estimated all these parameters,

00:56:47.669 --> 00:56:50.960
when you're given a new piece of email that you want to classify,

00:56:50.960 --> 00:56:53.430
you can then compute P of Y given X

00:56:53.430 --> 00:56:56.079
using Bayes rule, right?

00:56:56.079 --> 00:56:57.580
Same as before because

00:56:57.580 --> 00:57:04.229
together these parameters gives you a model for P of X given Y and for P of Y,

00:57:04.229 --> 00:57:07.849
and by using Bayes rule, given these two terms, you can compute

00:57:07.849 --> 00:57:10.709
P of X given Y, and

00:57:10.709 --> 00:57:15.650
there's your spam classifier, okay?

00:57:15.650 --> 00:57:18.999
Turns out we need one more elaboration to this idea, but let me check if there are

00:57:18.999 --> 00:57:25.109
questions about this so far.

00:57:25.109 --> 00:57:29.269
Student:So does this model depend

00:57:29.269 --> 00:57:29.659
on

00:57:29.659 --> 00:57:34.380
the number of inputs? Instructor (Andrew Ng):What do

00:57:34.380 --> 00:57:35.149
you

00:57:35.149 --> 00:57:38.169
mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this

00:57:38.169 --> 00:57:45.169
given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make

00:57:48.959 --> 00:57:52.629
sense? Or M is the number of training examples, so when you have M training examples, you plug them

00:57:52.629 --> 00:57:53.999
into this formula,

00:57:53.999 --> 00:58:00.999
and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the

00:58:01.289 --> 00:58:04.089
number of emails? Instructor (Andrew Ng):Yeah, right. So, right.

00:58:04.089 --> 00:58:07.130
So it's, kind of, your training set. I would go through all the email I've gotten

00:58:07.130 --> 00:58:08.869
in the last two months

00:58:08.869 --> 00:58:11.129
and label them as spam or not spam,

00:58:11.129 --> 00:58:14.450
and so you have - I don't

00:58:14.450 --> 00:58:16.809
know, like, a few hundred emails

00:58:16.809 --> 00:58:18.630
labeled as spam or not spam,

00:58:18.630 --> 00:58:25.630
and that will comprise your training sets for X1 and Y1 through XM,

00:58:28.049 --> 00:58:28.620
YM,

00:58:28.620 --> 00:58:32.820
where X is one of those vectors representing which words appeared in the email and Y

00:58:32.820 --> 00:58:39.820
is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number

00:58:41.309 --> 00:58:42.019


00:58:42.019 --> 00:58:49.019
of examples, but the last model doesn't depend on the models, but your phi is the

00:58:49.829 --> 00:58:53.630
same for either one. Instructor (Andrew Ng):They're

00:58:53.630 --> 00:58:58.019
different things, right? There's the model which is

00:58:58.019 --> 00:59:00.659
- the modeling assumptions aren't made very well.

00:59:00.659 --> 00:59:04.069
I'm assuming that - I'm making the Naive Bayes assumption.

00:59:04.069 --> 00:59:08.900
So the probabilistic model is an assumption on the joint distribution

00:59:08.900 --> 00:59:10.319
of X and Y.

00:59:10.319 --> 00:59:12.390
That's what the model is,

00:59:12.390 --> 00:59:16.650
and then I'm given a fixed number of training examples. I'm given M training examples, and

00:59:16.650 --> 00:59:20.309
then it's, like, after I'm given the training sets, I'll then go in to write the maximum

00:59:20.309 --> 00:59:22.299
likelihood estimate of the parameters, right? So that's, sort

00:59:22.299 --> 00:59:24.529
of,

00:59:24.529 --> 00:59:31.529
maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like,

00:59:38.389 --> 00:59:41.079
if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So

00:59:41.079 --> 00:59:42.360
it turns out

00:59:42.360 --> 00:59:45.859
this is, sort of, a very practical question, really. How do I count this list of

00:59:45.859 --> 00:59:49.629
words? One common way to do this is to actually

00:59:49.629 --> 00:59:52.899
find some way to count a list of words, like go through all your emails, go through

00:59:52.899 --> 00:59:53.909
all the -

00:59:53.909 --> 00:59:56.930
in practice, one common way to count a list of words

00:59:56.930 --> 01:00:00.339
is to just take all the words that appear in your training set. That's one fairly common way

01:00:00.339 --> 01:00:02.190
to do it,

01:00:02.190 --> 01:00:06.130
or if that turns out to be too many words, you can take all words that appear

01:00:06.130 --> 01:00:07.030
at least

01:00:07.030 --> 01:00:08.080
three times

01:00:08.080 --> 01:00:09.349
in your training set. So

01:00:09.349 --> 01:00:10.659
words that

01:00:10.659 --> 01:00:14.749
you didn't even see three times in the emails you got in the last

01:00:14.749 --> 01:00:17.210
two months, you discard. So those are - I

01:00:17.210 --> 01:00:20.259
was talking about going through a dictionary, which is a nice way of thinking about it, but in

01:00:20.259 --> 01:00:22.049
practice, you might go through

01:00:22.049 --> 01:00:26.409
your training set and then just take the union of all the words that appear in

01:00:26.409 --> 01:00:29.759
it. In some of the tests I've even, by the way, said select these features, but this is one

01:00:29.759 --> 01:00:32.309
way to think about

01:00:32.309 --> 01:00:34.139
creating your feature vector,

01:00:34.139 --> 01:00:41.139
right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On

01:00:49.209 --> 01:00:51.609
how I came up with the parameters?

01:00:51.609 --> 01:00:53.929
Student:Correct. Instructor (Andrew Ng):Let's see.

01:00:53.929 --> 01:00:58.249
So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right?

01:00:58.249 --> 01:00:59.279
In Naive Bayes,

01:00:59.279 --> 01:01:01.499
I need to build a model

01:01:01.499 --> 01:01:03.549
for P of X given Y and for

01:01:03.549 --> 01:01:05.359
P of Y,

01:01:05.359 --> 01:01:09.749
right? So this is, I mean, in generous of learning algorithms, I need to come up with

01:01:09.749 --> 01:01:11.079
models for these.

01:01:11.079 --> 01:01:15.359
So how'd I model P of Y? Well, I just those to model it using a Bernoulli

01:01:15.359 --> 01:01:16.389
distribution,

01:01:16.389 --> 01:01:17.150
and so

01:01:17.150 --> 01:01:19.709
P of Y will be

01:01:19.709 --> 01:01:22.109
parameterized by that, all right? Student:Okay.

01:01:22.109 --> 01:01:26.519
Instructor (Andrew Ng):And then how'd I model P of X given Y? Well,

01:01:26.519 --> 01:01:28.529
let's keep changing bullets.

01:01:28.529 --> 01:01:32.729
My model for P of X given Y under the Naive Bayes assumption, I assume

01:01:32.729 --> 01:01:34.539
that P of X given Y

01:01:34.539 --> 01:01:37.549
is the product of these probabilities,

01:01:37.549 --> 01:01:40.900
and so I'm going to need parameters to tell me

01:01:40.900 --> 01:01:43.509
what's the probability of each word occurring,

01:01:43.509 --> 01:01:46.309
you know, of each word occurring or not occurring,

01:01:46.309 --> 01:01:53.309
conditions on the email being spam or not spam email, okay? Student:How is that

01:01:54.639 --> 01:01:58.829
Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature

01:01:58.829 --> 01:02:00.799
vectors, XI

01:02:00.799 --> 01:02:05.499
is either one or zero, depending on whether words I appear as in the email,

01:02:05.499 --> 01:02:08.859
right? So by the way I define the

01:02:08.859 --> 01:02:11.419
feature vectors, XI -

01:02:11.419 --> 01:02:16.639
the XI is always zero or one. So that by definition, if XI, you know, is either zero or

01:02:16.639 --> 01:02:19.969
one, then it has to be a Bernoulli distribution, right?

01:02:19.969 --> 01:02:22.419
If XI would continue as then

01:02:22.419 --> 01:02:24.819
you might model this as Gaussian and say you end up

01:02:24.819 --> 01:02:27.299
like we did in Gaussian discriminant analysis. It's

01:02:27.299 --> 01:02:30.719
just that the way I constructed my features for email, XI is always binary

01:02:30.719 --> 01:02:32.129
value, and so you end up

01:02:32.129 --> 01:02:32.990
with a

01:02:32.990 --> 01:02:36.059
Bernoulli here, okay? All right. I

01:02:36.059 --> 01:02:40.389
should move on. So

01:02:40.389 --> 01:02:42.389
it turns out that

01:02:42.389 --> 01:02:44.339
this idea

01:02:44.339 --> 01:02:46.369
almost works.

01:02:46.369 --> 01:02:47.719
Now, here's the problem.

01:02:47.719 --> 01:02:49.959
So let's say you

01:02:49.959 --> 01:02:54.439
complete this class and you start to do, maybe do the class project, and you

01:02:54.439 --> 01:02:56.760
keep working on your class project for a bit, and it

01:02:56.760 --> 01:03:01.439
becomes really good, and you want to submit your class project to a conference, right? So,

01:03:01.439 --> 01:03:03.479
you know, around - I don't know,

01:03:03.479 --> 01:03:08.039
June every year is the conference deadline for the next conference.

01:03:08.039 --> 01:03:10.889
It's just the name of the conference; it's an acronym.

01:03:10.889 --> 01:03:12.359
And so maybe

01:03:12.359 --> 01:03:16.080
you send your project partners or senior friends even, and say, Hey, let's

01:03:16.080 --> 01:03:20.069
work on a project and submit it to the NIPS conference. And so you're getting these emails

01:03:20.069 --> 01:03:21.489
with the word NIPS in them,

01:03:21.489 --> 01:03:24.329
which you've probably never seen before,

01:03:24.329 --> 01:03:28.179
and so a

01:03:28.179 --> 01:03:32.519
piece of email comes from your project partner, and so you

01:03:32.519 --> 01:03:35.639
go, Let's send a paper to the NIPS conference.

01:03:35.639 --> 01:03:37.310
And then your stamp classifier

01:03:37.310 --> 01:03:38.759
will say

01:03:38.759 --> 01:03:39.930
P of X -

01:03:39.930 --> 01:03:44.630
let's say NIPS is the 30,000th word in your dictionary, okay?

01:03:44.630 --> 01:03:46.669
So X30,000 given

01:03:46.669 --> 01:03:50.189
the 1, given

01:03:50.189 --> 01:03:51.619
Y =

01:03:51.619 --> 01:03:52.799
1

01:03:52.799 --> 01:03:54.029
will be equal to 0.

01:03:54.029 --> 01:03:57.479
That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in

01:03:57.479 --> 01:04:01.199
your training set, so maximum likelihood of the parameter is that probably have seen the word

01:04:01.199 --> 01:04:02.459
NIPS is zero,

01:04:02.459 --> 01:04:06.789
and, similarly,

01:04:06.789 --> 01:04:08.379
you

01:04:08.379 --> 01:04:12.459
know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also

01:04:12.459 --> 01:04:14.729
estimated

01:04:14.729 --> 01:04:21.209
as zero.

01:04:21.209 --> 01:04:28.209
So

01:04:34.789 --> 01:04:40.709
when your spam classifier goes to compute P of Y = 1 given X, it will

01:04:40.709 --> 01:04:44.529
compute this

01:04:44.529 --> 01:04:47.209
right here P of Y

01:04:47.209 --> 01:04:54.209
over - well,

01:04:56.179 --> 01:05:03.179
all

01:05:05.089 --> 01:05:09.509
right.

01:05:09.509 --> 01:05:11.329
And so

01:05:11.329 --> 01:05:15.759
you look at that terms, say, this will be product from I =

01:05:15.759 --> 01:05:18.379
1 to 50,000,

01:05:18.379 --> 01:05:22.359
P of XI given Y,

01:05:22.359 --> 01:05:26.139
and one of those probabilities will be equal to

01:05:26.139 --> 01:05:28.169


01:05:28.169 --> 01:05:32.339
zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a

01:05:32.339 --> 01:05:36.699
zero in this product, and so the numerator is zero,

01:05:36.699 --> 01:05:40.249
and in the same way, it turns out the denominator will also be zero, and so you end

01:05:40.249 --> 01:05:42.719
up with -

01:05:42.719 --> 01:05:45.839
actually all of these terms end up being zero. So you end up with P of Y = 1

01:05:45.839 --> 01:05:48.779
given X is 0 over 0 + 0, okay, which is

01:05:48.779 --> 01:05:51.410
undefined. And the

01:05:51.410 --> 01:05:53.959
problem with this is that it's

01:05:53.959 --> 01:05:57.240
just statistically a bad idea

01:05:57.240 --> 01:06:00.209
to say that P of X30,000

01:06:00.209 --> 01:06:02.950
given Y is

01:06:02.950 --> 01:06:03.490
0,

01:06:03.490 --> 01:06:06.270
right? Just because you haven't seen the word NIPS in your last

01:06:06.270 --> 01:06:10.420
two months worth of email, it's also statistically not sound to say that,

01:06:10.420 --> 01:06:16.319
therefore, the chance of ever seeing this word is zero, right?

01:06:16.319 --> 01:06:19.170
And so

01:06:19.170 --> 01:06:20.479


01:06:20.479 --> 01:06:22.719


01:06:22.719 --> 01:06:25.819
is this idea that just because you haven't seen something

01:06:25.819 --> 01:06:30.400
before, that may mean that that event is unlikely, but it doesn't mean that

01:06:30.400 --> 01:06:33.920
it's impossible, and just saying that if you've never seen the word NIPS before,

01:06:33.920 --> 01:06:40.759
then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero.

01:06:40.759 --> 01:06:47.759
So we're gonna fix this,

01:06:48.339 --> 01:06:50.079
and

01:06:50.079 --> 01:06:54.119
to motivate the fix I'll talk about

01:06:54.119 --> 01:06:58.939
- the example we're gonna use is let's say that you've been following the Stanford basketball

01:06:58.939 --> 01:07:04.099
team for all of their away games, and been, sort of, tracking their wins and losses

01:07:04.099 --> 01:07:08.169
to gather statistics, and, maybe - I don't know, form a betting pool about

01:07:08.169 --> 01:07:11.059
whether they're likely to win or lose the next game, okay?

01:07:11.059 --> 01:07:12.630
So

01:07:12.630 --> 01:07:15.409
these are some of the statistics.

01:07:15.409 --> 01:07:17.589


01:07:17.589 --> 01:07:19.429


01:07:19.429 --> 01:07:24.819
So on, I guess, the 8th of February

01:07:24.819 --> 01:07:29.299
last season they played Washington State, and they

01:07:29.299 --> 01:07:31.779
did not win.

01:07:31.779 --> 01:07:36.239
On

01:07:36.239 --> 01:07:38.209
the 11th of February,

01:07:38.209 --> 01:07:42.349
they play Washington, 22nd

01:07:42.349 --> 01:07:47.259
they played USC,

01:07:47.259 --> 01:07:49.769


01:07:49.769 --> 01:07:54.929
played UCLA,

01:07:54.929 --> 01:07:57.630
played USC again,

01:07:57.630 --> 01:08:03.739


01:08:03.739 --> 01:08:05.819
and now you want to estimate

01:08:05.819 --> 01:08:09.880
what's the chance that they'll win or lose against Louisville, right?

01:08:09.880 --> 01:08:11.169
So

01:08:11.169 --> 01:08:14.909
find the four guys last year or five times and they weren't good in their away games, but it

01:08:14.909 --> 01:08:17.009
seems awfully harsh to say that - so it

01:08:17.009 --> 01:08:24.009
seems awfully harsh to say there's zero chance that they'll

01:08:37.079 --> 01:08:40.269
win in the last - in the 5th game. So here's the idea behind Laplace smoothing

01:08:40.269 --> 01:08:42.629
which is

01:08:42.629 --> 01:08:44.819
that we're estimate

01:08:44.819 --> 01:08:48.219
the probably of Y being equal to one, right?

01:08:48.219 --> 01:08:53.189
Normally, the maximum likelihood [inaudible] is the

01:08:53.189 --> 01:08:56.059
number of ones

01:08:56.059 --> 01:08:57.659
divided by

01:08:57.659 --> 01:09:00.759
the number of zeros

01:09:00.759 --> 01:09:05.350
plus the number of ones, okay? I

01:09:05.350 --> 01:09:07.869
hope this informal notation makes sense, right? Knowing

01:09:07.869 --> 01:09:11.349
the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random

01:09:11.349 --> 01:09:12.219
variable

01:09:12.219 --> 01:09:14.629
is

01:09:14.629 --> 01:09:16.969
just the number of ones you saw

01:09:16.969 --> 01:09:20.859
divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in

01:09:20.859 --> 01:09:22.859
the Laplace

01:09:22.859 --> 01:09:24.170
Smoothing

01:09:24.170 --> 01:09:25.799
we're going to

01:09:25.799 --> 01:09:29.549
just take each of these terms, the number of ones and, sort of, add one to that, the number

01:09:29.549 --> 01:09:31.830
of zeros and add one to that, the

01:09:31.830 --> 01:09:34.129
number of ones and add one to that,

01:09:34.129 --> 01:09:37.109
and so in our example,

01:09:37.109 --> 01:09:38.299
instead of estimating

01:09:38.299 --> 01:09:42.890
the probability of winning the next game to be 0 Ăˇ

01:09:42.890 --> 01:09:45.210
5 + 0,

01:09:45.210 --> 01:09:50.319
we'll add one to all of these counts, and so we say that the chance of

01:09:50.319 --> 01:09:51.850
their

01:09:51.850 --> 01:09:53.799
winning the next game is 1/7th,

01:09:53.799 --> 01:09:55.179
okay? Which is

01:09:55.179 --> 01:09:59.310
that having seen them lose, you know, five away games in a row, we aren't terribly -

01:09:59.310 --> 01:10:02.510
we don't think it's terribly likely they'll win the next game, but at

01:10:02.510 --> 01:10:05.979
least we're not saying it's impossible.

01:10:05.979 --> 01:10:09.949
As a historical side note, the Laplace actually came up with the method.

01:10:09.949 --> 01:10:14.320
It's called the Laplace smoothing after him.

01:10:14.320 --> 01:10:18.399


01:10:18.399 --> 01:10:21.919
When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale

01:10:21.919 --> 01:10:25.239
was in a lot of days now, we've seen the sun rise,

01:10:25.239 --> 01:10:28.489
but that doesn't mean we can be absolutely certain the sun will rise tomorrow.

01:10:28.489 --> 01:10:31.420
He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of,

01:10:31.420 --> 01:10:36.449
cool. So,

01:10:36.449 --> 01:10:38.549
and more generally,

01:10:38.549 --> 01:10:41.570


01:10:41.570 --> 01:10:43.969


01:10:43.969 --> 01:10:45.429


01:10:45.429 --> 01:10:46.449


01:10:46.449 --> 01:10:49.409
if Y

01:10:49.409 --> 01:10:52.840
takes on

01:10:52.840 --> 01:10:55.150
K possible of values,

01:10:55.150 --> 01:10:59.479
if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1.

01:10:59.479 --> 01:11:01.729
Let's

01:11:01.729 --> 01:11:05.900
see.

01:11:05.900 --> 01:11:11.380
So the maximum likelihood estimate will be Sum from J = 1 to M,

01:11:11.380 --> 01:11:16.499
indicator YI = J Ăˇ M,

01:11:16.499 --> 01:11:19.379
right?

01:11:19.379 --> 01:11:21.629
That's the maximum likelihood estimate

01:11:21.629 --> 01:11:24.369
of a multinomial probability of Y

01:11:24.369 --> 01:11:26.929
being equal to - oh,

01:11:26.929 --> 01:11:29.229
excuse me, Y = J. All right.

01:11:29.229 --> 01:11:33.110
That's the maximum likelihood estimate for the probability of Y = J,

01:11:33.110 --> 01:11:36.890
and so when you apply Laplace smoothing to that,

01:11:36.890 --> 01:11:39.639
you add one to the numerator, and

01:11:39.639 --> 01:11:41.889
add K to the denominator,

01:11:41.889 --> 01:11:48.889
if Y can take up K possible values, okay?

01:12:06.469 --> 01:12:08.579
So for Naive Bayes,

01:12:08.579 --> 01:12:14.349
what that gives us is -

01:12:14.349 --> 01:12:21.349
shoot.

01:12:38.639 --> 01:12:44.320
Right? So that was the maximum likelihood estimate, and what you

01:12:44.320 --> 01:12:47.949
end up doing is adding one to the numerator and adding two to the denominator,

01:12:47.949 --> 01:12:52.030
and this solves the problem of the zero probabilities, and when your friend sends

01:12:52.030 --> 01:12:57.489
you email about the NIPS conference,

01:12:57.489 --> 01:13:00.799
your spam filter will still be able to

01:13:00.799 --> 01:13:07.799
make a meaningful prediction, all right? Okay.

01:13:12.030 --> 01:13:14.829
Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the

01:13:14.829 --> 01:13:16.239
games on the right, it's liberal assumptions that the probability

01:13:16.239 --> 01:13:23.239
of

01:13:24.650 --> 01:13:27.960
winning is very close to zero, so, I mean, the prediction should

01:13:27.960 --> 01:13:29.789
be equal to PF, 0. Instructor (Andrew Ng):Right.

01:13:29.789 --> 01:13:35.959
I would say that

01:13:35.959 --> 01:13:38.139
in this case the prediction

01:13:38.139 --> 01:13:41.170
is 1/7th, right? We don't have a lot of - if you see somebody lose five games

01:13:41.170 --> 01:13:44.289
in a row, you may not have a lot of faith in them,

01:13:44.289 --> 01:13:48.520
but as an extreme example, suppose you saw them lose one game,

01:13:48.520 --> 01:13:51.500
right? It's just not reasonable to say that the chances of winning the next game

01:13:51.500 --> 01:13:52.860
is zero, but

01:13:52.860 --> 01:13:58.900
that's what maximum likelihood

01:13:58.900 --> 01:14:01.150
estimate

01:14:01.150 --> 01:14:05.099
will say. Student:Yes. Instructor (Andrew Ng):And -

01:14:05.099 --> 01:14:08.139
Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you

01:14:08.139 --> 01:14:11.309
know, given just five training examples, what's a reasonable estimate for the chance of

01:14:11.309 --> 01:14:13.170
winning the next game,

01:14:13.170 --> 01:14:14.119
and

01:14:14.119 --> 01:14:18.460
1/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance.

01:14:18.460 --> 01:14:21.119
We're saying the chances of winning the next game is less

01:14:21.119 --> 01:14:25.069
than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of

01:14:25.069 --> 01:14:27.859
Bayesian assumptions about the prior and posterior,

01:14:27.859 --> 01:14:31.280
this Laplace smoothing actually gives the optimal estimate,

01:14:31.280 --> 01:14:33.459
in a certain sense I won't go into

01:14:33.459 --> 01:14:37.009
of what's the chance of winning the next game, and so under a certain assumption

01:14:37.009 --> 01:14:38.079
about the

01:14:38.079 --> 01:14:40.679
Bayesian prior on the parameter.

01:14:40.679 --> 01:14:44.260
So I don't know. It actually seems like a pretty reasonable assumption to me.

01:14:44.260 --> 01:14:46.940
Although, I should say, it actually

01:14:46.940 --> 01:14:50.059
turned out -

01:14:50.059 --> 01:14:53.530
No, I'm just being mean. We actually are a pretty good basketball team, but I chose a

01:14:53.530 --> 01:14:54.660
losing streak

01:14:54.660 --> 01:14:58.769
because it's funnier that way.

01:14:58.769 --> 01:15:05.769
Let's see. Shoot. Does someone want to - are there other questions about

01:15:07.489 --> 01:15:09.320


01:15:09.320 --> 01:15:13.380
this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but

01:15:13.380 --> 01:15:15.059


01:15:15.059 --> 01:15:15.860
we'll do that in the next lecture. So let's wrap it.