Lecture 5 | Machine Learning (Stanford)
-
0:10 - 0:12
-
0:12 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:22Development.
-
0:25 - 0:30So what I want to do today is talk about a different type of learning algorithm, and, in particular,
-
0:30 - 0:33start to talk about generative learning algorithms
-
0:33 - 0:38and the specific algorithm called Gaussian Discriminant Analysis.
-
0:38 - 0:43Take a slight digression, talk about Gaussians, and I'll briefly discuss
-
0:43 - 0:46generative versus discriminative learning algorithms,
-
0:46 - 0:50and then hopefully wrap up today's lecture with a discussion of Naive Bayes and
-
0:50 - 0:52the Laplace Smoothing.
-
0:52 - 0:55So just to motivate our
-
0:55 - 0:59discussion on generative learning algorithms, right, so by way of contrast,
-
0:59 - 1:02the source of classification algorithms we've been talking about
-
1:02 - 1:09I think of algorithms that do this. So you're given a training set,
-
1:09 - 1:12and
-
1:12 - 1:15if you run an algorithm right, we just see progression on those training sets.
-
1:15 - 1:20The way I think of logistic regression is that it's trying to find " look at the date and is
-
1:20 - 1:24trying to find a straight line to divide the crosses and O's, right? So it's, sort of,
-
1:24 - 1:27trying to find a straight line. Let
-
1:27 - 1:29me " just make the days a bit noisier.
-
1:29 - 1:33Trying to find a straight line
-
1:33 - 1:34that separates out
-
1:34 - 1:39the positive and the negative classes as well as pass the law, right? And,
-
1:39 - 1:43in fact, it shows it on the laptop. Maybe just use the screens or the small
-
1:43 - 1:46monitors for this.
-
1:46 - 1:47In fact,
-
1:47 - 1:50you can see there's the data
-
1:50 - 1:51set with
-
1:51 - 1:53logistic regression,
-
1:53 - 1:54and so
-
1:54 - 1:58I've initialized the parameters randomly, and so logistic regression is, kind
-
1:58 - 1:59of, the outputting " it's
-
1:59 - 2:02the, kind of, hypothesis that
-
2:02 - 2:05iteration zero is that straight line shown in the bottom right.
-
2:05 - 2:09And so after one iteration and creating descent, the straight line changes a bit.
-
2:09 - 2:11After two iterations, three,
-
2:11 - 2:12four,
-
2:12 - 2:15until logistic regression converges
-
2:15 - 2:19and has found the straight line that, more or less, separates the positive and negative class, okay? So
-
2:19 - 2:20you can think of this
-
2:20 - 2:21as logistic regression,
-
2:21 - 2:28sort of, searching for a line that separates the positive and the negative classes.
-
2:28 - 2:31What I want to do today is talk about an algorithm that does something slightly
-
2:31 - 2:33different,
-
2:33 - 2:36and to motivate us, let's use our old example of trying to classifythe
-
2:36 - 2:41team malignant cancer and benign cancer, right? So a patient comes in
-
2:41 - 2:45and they have a cancer, you want to know if it's a malignant or a harmful cancer,
-
2:45 - 2:48or if it's a benign, meaning a harmless cancer.
-
2:48 - 2:52So rather than trying to find the straight line to separate the two classes, here's something else
-
2:52 - 2:54we could do.
-
2:54 - 2:56We can go from our training set
-
2:56 - 3:00and look at all the cases of malignant cancers, go through, you know, look for our training set for all the
-
3:00 - 3:04positive examples of malignant cancers,
-
3:04 - 3:09and we can then build a model for what malignant cancer looks like.
-
3:09 - 3:13Then we'll go for our training set again and take out all of the examples of benign cancers,
-
3:13 - 3:18and then we'll build a model for what benign cancers look like, okay?
-
3:18 - 3:19And then
-
3:19 - 3:20when you need to
-
3:20 - 3:25classify a new example, when you have a new patient, and you want to decide is this cancer malignant
-
3:25 - 3:26or benign,
-
3:26 - 3:28you then take your new cancer, and you
-
3:28 - 3:31match it to your model of malignant cancers,
-
3:31 - 3:36and you match it to your model of benign cancers, and you see which model it matches better, and
-
3:36 - 3:39depending on which model it matches better to, you then
-
3:39 - 3:44predict whether the new cancer is malignant or benign,
-
3:44 - 3:46okay?
-
3:46 - 3:47So
-
3:47 - 3:50what I just described, just this cross
-
3:50 - 3:53of methods where you build a second model for malignant cancers and
-
3:53 - 3:55a separate model for benign cancers
-
3:55 - 3:58is called a generative learning algorithm,
-
3:58 - 4:01and let me just, kind of, formalize this.
-
4:01 - 4:03
-
4:03 - 4:05So in the models that we've
-
4:05 - 4:08been talking about previously, those were actually
-
4:08 - 4:14all discriminative learning algorithms,
-
4:14 - 4:19and studied more formally, a discriminative learning algorithm is one
-
4:19 - 4:23that either learns P of Y
-
4:23 - 4:25given X directly,
-
4:25 - 4:27
-
4:27 - 4:30or even
-
4:30 - 4:33learns a hypothesis
-
4:33 - 4:40that
-
4:40 - 4:43outputs value 0, 1 directly,
-
4:43 - 4:47okay? So logistic regression is an example of
-
4:47 - 4:49a discriminative learning algorithm.
-
4:49 - 4:56In contrast, a generative learning algorithm of
-
4:58 - 5:00models P of X given Y.
-
5:00 - 5:03The probability of the features given the class label,
-
5:03 - 5:04
-
5:04 - 5:11and as a technical detail, it also models P of Y, but that's a less important thing, and the
-
5:12 - 5:15interpretation of this is that a generative model
-
5:15 - 5:17builds a probabilistic model for
-
5:17 - 5:22what the features looks like,
-
5:22 - 5:27conditioned on the
-
5:27 - 5:31class label, okay? In other words, conditioned on whether a cancer is
-
5:31 - 5:35malignant or benign, it models probability distribution over what the features
-
5:35 - 5:37of the cancer looks like.
-
5:37 - 5:40Then having built this model " having built a model for P of X
-
5:40 - 5:42given Y and P of Y,
-
5:42 - 5:46then by Bayes rule, obviously, you can compute P of Y given 1,
-
5:46 - 5:47conditioned on X.
-
5:47 - 5:49This is just
-
5:49 - 5:51P of X
-
5:51 - 5:53given Y = 1
-
5:53 - 5:56times P of X
-
5:56 - 5:59divided by P of X,
-
5:59 - 6:01and, if necessary,
-
6:01 - 6:06you can calculate the denominator
-
6:06 - 6:13using this, right?
-
6:19 - 6:20
-
6:20 - 6:23And so by modeling P of X given Y
-
6:23 - 6:26and modeling P of Y, you can actually use Bayes rule to get back to P of Y given
-
6:26 - 6:28X,
-
6:28 - 6:30but a generative model -
-
6:30 - 6:33generative learning algorithm starts in modeling P of X given Y, rather than P of Y
-
6:33 - 6:35given X, okay?
-
6:35 - 6:38We'll talk about some of the tradeoffs, and why this may be a
-
6:38 - 6:42better or worse idea than a discriminative model a bit later.
-
6:42 - 6:47Let's go for a specific example of a generative learning algorithm,
-
6:47 - 6:48and for
-
6:48 - 6:51this specific motivating example, I'm
-
6:51 - 6:54going to assume
-
6:54 - 6:56that your input feature is X
-
6:56 - 6:58and RN
-
6:58 - 7:00
-
7:00 - 7:03and are
-
7:03 - 7:06continuous values, okay?
-
7:06 - 7:08And under this assumption, let
-
7:08 - 7:15me describe to you a specific algorithm called Gaussian Discriminant Analysis,
-
7:20 - 7:23
-
7:23 - 7:26and the, I
-
7:26 - 7:30guess, core assumption is that we're going to assume in the Gaussian discriminant analysis
-
7:30 - 7:32model of that P of X given Y
-
7:32 - 7:39is Gaussian, okay? So
-
7:40 - 7:43actually just raise your hand, how many of you have seen a multivariate Gaussian before -
-
7:43 - 7:46not a 1D Gaussian, but the higher range though?
-
7:46 - 7:51Okay, cool, like maybe half of you, two-thirds of
-
7:51 - 7:55you. So let me just say a few words about Gaussians, and for those of you that have seen
-
7:55 - 7:57it before, it'll be a refresher.
-
7:57 - 8:03So we say that a random variable Z is distributed Gaussian, multivariate Gaussian as - and the
-
8:03 - 8:05script N for normal
-
8:05 - 8:08with
-
8:08 - 8:13parameters mean U and covariance sigma squared. If
-
8:13 - 8:17Z has a density 1
-
8:17 - 8:20over 2 Pi, sigma
-
8:20 - 8:242,
-
8:24 - 8:31okay?
-
8:31 - 8:36That's the formula for the density as a generalization of the one dimension of Gaussians and no
-
8:36 - 8:38more the familiar bell-shape curve. It's
-
8:38 - 8:42a high dimension vector value random variable
-
8:42 - 8:43Z.
-
8:43 - 8:47Don't worry too much about this formula for the density. You rarely end up needing to
-
8:47 - 8:48use it,
-
8:48 - 8:52but the two key quantities are this
-
8:52 - 8:55vector mew is the mean of the Gaussian and this
-
8:55 - 8:57matrix sigma is
-
8:57 - 9:00the covariance matrix -
-
9:00 - 9:02covariance,
-
9:02 - 9:05and so
-
9:05 - 9:07sigma will be equal to,
-
9:07 - 9:12right, the definition of covariance of a vector valued random variable
-
9:12 - 9:14is X - U, X - V conspose,
-
9:14 - 9:17okay?
-
9:17 - 9:20And, actually, if this
-
9:20 - 9:23doesn't look familiar to you,
-
9:23 - 9:26you might re-watch the
-
9:26 - 9:29discussion section that the TAs held last Friday
-
9:29 - 9:34or the one that they'll be holding later this week on, sort of, a recap of
-
9:34 - 9:34probability, okay?
-
9:34 - 9:38So
-
9:38 - 9:41multi-grade Gaussians is parameterized by a mean and a covariance, and let me just -
-
9:41 - 9:42can I
-
9:42 - 9:46have the laptop displayed, please?
-
9:46 - 9:49I'll just go ahead and actually show you,
-
9:49 - 9:52you know, graphically, the effects of
-
9:52 - 9:54varying a Gaussian -
-
9:54 - 9:56varying the parameters of a Gaussian.
-
9:56 - 9:59So what I have up here
-
9:59 - 10:03is the density of a zero mean Gaussian with
-
10:03 - 10:06covariance matrix equals the identity. The covariance matrix is shown in the upper right-hand
-
10:06 - 10:08corner of the slide, and
-
10:08 - 10:12there's the familiar bell-shaped curve in two dimensions.
-
10:12 - 10:17And so if I shrink the covariance matrix, instead of covariance your identity, if
-
10:17 - 10:22I shrink the covariance matrix, then the Gaussian becomes more peaked,
-
10:22 - 10:25and if I widen the covariance, so like same = 2, 2,
-
10:25 - 10:26then
-
10:26 - 10:31the distribution - well, the density becomes more spread out, okay?
-
10:31 - 10:33Those vectors stand at normal, identity
-
10:33 - 10:35covariance one.
-
10:35 - 10:37If I increase
-
10:37 - 10:39the diagonals of a covariance matrix, right,
-
10:39 - 10:44if I make the variables correlated, and the
-
10:44 - 10:45Gaussian becomes flattened out in this X = Y direction, and
-
10:45 - 10:47increase it even further,
-
10:47 - 10:52then my variables, X and Y, right - excuse me, it goes Z1 and Z2
-
10:52 - 10:57are my two variables on a horizontal axis become even more correlated. I'll just show the same thing in
-
10:57 - 10:59contours.
-
10:59 - 11:03The standard normal of distribution has contours that are - they're actually
-
11:03 - 11:07circles. Because of the aspect ratio, these look like ellipses.
-
11:07 - 11:09These should actually be circles,
-
11:09 - 11:13and if you increase the off diagonals of the Gaussian covariance matrix,
-
11:13 - 11:15then it becomes
-
11:15 - 11:16ellipses aligned along the, sort of,
-
11:16 - 11:1845 degree angle
-
11:18 - 11:21in this example.
-
11:21 - 11:26This is the same thing. Here's an example of a Gaussian density with negative covariances.
-
11:26 - 11:29So now the correlation
-
11:29 - 11:33goes the other way, so that even strong [inaudible] of covariance and the same thing in
-
11:33 - 11:36contours. This is a Gaussian with negative entries on the diagonals and even
-
11:36 - 11:41larger entries on the diagonals, okay?
-
11:41 - 11:45And other parameter for the Gaussian is the mean parameters, so if this is - with mew0,
-
11:45 - 11:48and as he changed the mean parameter,
-
11:48 - 11:50this is mew equals 0.15,
-
11:50 - 11:56the location of the Gaussian just moves around, okay?
-
11:56 - 12:02All right. So that was a quick primer on what Gaussians look like, and here's
-
12:02 - 12:02
-
12:02 - 12:06as a roadmap or as a picture to keep in mind, when we described the Gaussian discriminant
-
12:06 - 12:09analysis algorithm, this is what we're going to do. Here's
-
12:09 - 12:12the training set,
-
12:12 - 12:15and in the Gaussian discriminant analysis algorithm,
-
12:15 - 12:19what I'm going to do is I'm going to look at the positive examples, say the crosses,
-
12:19 - 12:24and just looking at only the positive examples, I'm gonna fit a Gaussian distribution to the
-
12:24 - 12:25positive examples, and so
-
12:25 - 12:28maybe I end up with a Gaussian distribution like that,
-
12:28 - 12:31okay? So there's P of X given Y = 1.
-
12:31 - 12:34And then I'll look at the negative examples, the O's in this figure,
-
12:34 - 12:37and I'll fit a Gaussian to that, and maybe I get a
-
12:37 - 12:38Gaussian
-
12:38 - 12:41centered over there. This is the concept of my second Gaussian,
-
12:41 - 12:42and together -
-
12:42 - 12:45we'll say how later -
-
12:45 - 12:51together these two Gaussian densities will define a separator for these two classes, okay?
-
12:51 - 12:55And it'll turn out that the separator will turn out to be a little bit
-
12:55 - 12:56different
-
12:56 - 12:57from what logistic regression
-
12:57 - 12:59gives you.
-
12:59 - 13:00If you run logistic regression,
-
13:00 - 13:04you actually get the division bound to be shown in the green line, whereas Gaussian discriminant
-
13:04 - 13:07analysis gives you the blue line, okay? Switch back to chalkboard, please. All right. Here's the
-
13:07 - 13:14
-
13:29 - 13:34Gaussian discriminant analysis model, put
-
13:34 - 13:37into model P of Y
-
13:37 - 13:41as a Bernoulli random variable as usual, but
-
13:41 - 13:46as a Bernoulli random variable and parameterized by parameter phi; you've
-
13:46 - 13:47seen this before.
-
13:47 - 13:49
-
13:49 - 13:56Model P of X given Y = 0 as a Gaussian -
-
13:58 - 14:01oh, you know what? Yeah,
-
14:01 - 14:05yes, excuse me. I
-
14:05 - 14:06thought this looked strange.
-
14:06 - 14:12This
-
14:12 - 14:13should be a sigma,
-
14:13 - 14:16determined in a sigma to the one-half of the denominator there.
-
14:16 - 14:18It's no big deal. It was - yeah,
-
14:18 - 14:20well, okay. Right.
-
14:20 - 14:23I
-
14:23 - 14:27was listing the sigma to the determining the sigma to the one-half on a previous
-
14:27 - 14:34board, excuse me. Okay,
-
14:40 - 14:44and so I model P of X given Y = 0 as a Gaussian
-
14:44 - 14:48with mean mew0 and covariance sigma to the sigma to
-
14:48 - 14:55the minus one-half,
-
14:55 - 15:01
-
15:01 - 15:02and
-
15:02 - 15:09-
-
15:11 - 15:13okay?
-
15:13 - 15:17And so the parameters of this model are
-
15:17 - 15:18phi,
-
15:18 - 15:21mew0,
-
15:21 - 15:24mew1, and sigma,
-
15:24 - 15:25and so
-
15:25 - 15:29I can now write down the likelihood of the parameters
-
15:29 - 15:30as - oh, excuse
-
15:30 - 15:37me, actually, the log likelihood of the parameters as the log of
-
15:40 - 15:41that,
-
15:41 - 15:45right?
-
15:45 - 15:48So, in other words, if I'm given the training set, then
-
15:48 - 15:53they can write down the log likelihood of the parameters as the log of, you know,
-
15:53 - 15:58the probative probabilities of P of XI, YI, right?
-
15:58 - 16:04
-
16:04 - 16:10
-
16:10 - 16:12And
-
16:12 - 16:15this is just equal to that where each of these terms, P of XI given YI,
-
16:15 - 16:17or P of YI is
-
16:17 - 16:19then given
-
16:19 - 16:20
-
16:20 - 16:24by one of these three equations on top, okay?
-
16:24 - 16:26And I just want
-
16:26 - 16:30to contrast this again with discriminative learning algorithms, right?
-
16:30 - 16:32So
-
16:32 - 16:35to give this a name, I guess, this sometimes is actually called
-
16:35 - 16:39the Joint Data Likelihood - the Joint Likelihood,
-
16:39 - 16:42and
-
16:42 - 16:45let me just contrast this with what we had previously
-
16:45 - 16:48when we're talking about logistic
-
16:48 - 16:52regression. Where I said with the log likelihood of the parameter's theater
-
16:52 - 16:53was log
-
16:53 - 16:57of a product I = 1 to M, P of YI
-
16:57 - 16:59given XI
-
16:59 - 17:02and parameterized
-
17:02 - 17:04by a theater, right?
-
17:04 - 17:05So
-
17:05 - 17:08back where we're fitting logistic regression models or generalized learning
-
17:08 - 17:09models,
-
17:09 - 17:14we're always modeling P of YI given XI and parameterized by a theater, and that was the
-
17:14 - 17:16conditional
-
17:16 - 17:17
-
17:17 - 17:20likelihood, okay,
-
17:20 - 17:24in
-
17:24 - 17:27which we're modeling P of YI given XI,
-
17:27 - 17:28whereas, now,
-
17:28 - 17:31regenerative learning algorithms, we're going to look at the joint likelihood which
-
17:31 - 17:35is P of XI, YI, okay?
-
17:35 - 17:36So let's
-
17:36 - 17:43see.
-
17:46 - 17:48So given the training sets
-
17:48 - 17:51and using the Gaussian discriminant analysis model
-
17:51 - 17:56to fit the parameters of the model, we'll do maximize likelihood estimation as usual,
-
17:56 - 17:59and so you maximize your
-
17:59 - 18:02L
-
18:02 - 18:06with respect to the parameters phi, mew0,
-
18:06 - 18:08mew1, sigma,
-
18:08 - 18:10and so
-
18:10 - 18:16if we find the maximum likelihood estimate of parameters, you find that phi is
-
18:16 - 18:17-
-
18:17 - 18:22the maximum likelihood estimate is actually no surprise, and I'm writing this down mainly as a practice for
-
18:22 - 18:24indicating notation,
-
18:24 - 18:27all right? So the maximum likelihood estimate for phi would be Sum over
-
18:27 - 18:30I, YI á M,
-
18:30 - 18:32
-
18:32 - 18:35or written alternatively as Sum over -
-
18:35 - 18:42all your training examples of indicator YI = 1 á M, okay?
-
18:42 - 18:43
-
18:43 - 18:46In other words, maximum likelihood estimate for a
-
18:46 - 18:47
-
18:47 - 18:53newly parameter phi is just the faction of training examples with
-
18:53 - 18:57label one, with Y equals 1.
-
18:57 - 19:03Maximum likelihood estimate for mew0 is this, okay?
-
19:03 - 19:10You
-
19:22 - 19:24should stare at this for a second and
-
19:24 - 19:26see if it makes sense.
-
19:26 - 19:33Actually, I'll just write on the next one for mew1 while you do that.
-
19:34 - 19:41Okay?
-
19:50 - 19:52So what this is is
-
19:52 - 19:57what the denominator is sum of your training sets indicated YI = 0.
-
19:57 - 20:02So for every training example for which YI = 0, this will
-
20:02 - 20:06increment the count by one, all right? So the
-
20:06 - 20:08denominator is just
-
20:08 - 20:11the number of examples
-
20:11 - 20:15
-
20:15 - 20:15with
-
20:15 - 20:21label zero, all right?
-
20:21 - 20:23And then the numerator will be, let's
-
20:23 - 20:26see, Sum from I = 1 for M, or every time
-
20:26 - 20:30YI is equal to 0, this will be a one, and otherwise, this thing will be zero,
-
20:30 - 20:31and so
-
20:31 - 20:34this indicator function means that you're including
-
20:34 - 20:38only the times for
-
20:38 - 20:40which YI is equal to one - only the turns which Y is equal to zero
-
20:40 - 20:45because for all the times where YI is equal to one,
-
20:45 - 20:48this sum and will be equal to zero,
-
20:48 - 20:53and then you multiply that by XI, and so the numerator is
-
20:53 - 20:56really the sum
-
20:56 - 21:01of XI's corresponding to
-
21:01 - 21:03
-
21:03 - 21:09examples where the class labels were zero, okay?
-
21:09 - 21:15Raise your hand if this makes sense. Okay, cool.
-
21:15 - 21:17So just to say this fancifully,
-
21:17 - 21:20this just means look for your training set,
-
21:20 - 21:23find all the examples for which Y = 0,
-
21:23 - 21:25and take the average
-
21:25 - 21:30of the value of X for all your examples which Y = 0. So take all your negative fitting
-
21:30 - 21:30examples
-
21:30 - 21:33and average the values for X
-
21:33 - 21:38and
-
21:38 - 21:41that's mew0, okay? If this
-
21:41 - 21:45notation is still a little bit cryptic - if you're still not sure why this
-
21:45 - 21:48equation translates into
-
21:48 - 21:53what I just said, do go home and stare at it for a while until it just makes sense. This is, sort
-
21:53 - 21:57of, no surprise. It just says to estimate the mean for the negative examples,
-
21:57 - 22:00take all your negative examples, and average them. So
-
22:00 - 22:07no surprise, but this is a useful practice to indicate a notation.
-
22:07 - 22:08[Inaudible]
-
22:08 - 22:11divide the maximum likelihood estimate for sigma. I won't do that. You can read that in
-
22:11 - 22:17the notes yourself.
-
22:17 - 22:21
-
22:21 - 22:24And so having fit
-
22:24 - 22:29the parameters find mew0, mew1, and sigma
-
22:29 - 22:30to your data,
-
22:30 - 22:36well, you now need to make a prediction. You
-
22:36 - 22:40know, when you're given a new value of X, when you're given a new cancer, you need to predict whether
-
22:40 - 22:42it's malignant or benign.
-
22:42 - 22:45Your prediction is then going to be,
-
22:45 - 22:47let's say,
-
22:47 - 22:50the most likely value of Y given X. I should
-
22:50 - 22:55write semicolon the parameters there. I'll just give that
-
22:55 - 22:57- which is the [inaudible] of a Y
-
22:57 - 23:04by Bayes rule, all right?
-
23:06 - 23:13And that is, in turn,
-
23:15 - 23:15just that
-
23:15 - 23:20because the denominator P of X doesn't depend on Y,
-
23:20 - 23:22and
-
23:22 - 23:24if P of Y
-
23:24 - 23:27is uniform.
-
23:27 - 23:31In other words, if each of your constants is equally likely,
-
23:31 - 23:33so if P of Y
-
23:33 - 23:37takes the same value for all values
-
23:37 - 23:38of Y,
-
23:38 - 23:42then this is just arc X over Y, P of X
-
23:42 - 23:47given Y, okay?
-
23:47 - 23:51This happens sometimes, maybe not very often, so usually you end up using this
-
23:51 - 23:53formula where you
-
23:53 - 23:56compute P of X given Y and P of Y using
-
23:56 - 24:00your model, okay? Student:Can
-
24:00 - 24:01you give
-
24:01 - 24:05us arc x? Instructor (Andrew Ng):Oh, let's see. So if you take - actually
-
24:05 - 24:09let me.
-
24:09 - 24:10So the min of
-
24:10 - 24:15- arcomatics means the value for Y that maximizes this. Student:Oh, okay. Instructor (Andrew Ng):So just
-
24:15 - 24:19for an example, the min of X - 5
-
24:19 - 24:23squared is 0 because by choosing X equals 5, you can get this to be zero,
-
24:23 - 24:24and the argument over X
-
24:24 - 24:31of X - 5 squared is equal to 5 because 5 is the value of X that makes this minimize, okay?
-
24:31 - 24:38Cool. Thanks
-
24:38 - 24:45for
-
24:45 - 24:52asking that. Instructor (Andrew Ng):Okay. Actually any other questions about this? Yeah? Student:Why is
-
25:01 - 25:04distributive removing? Why isn't [inaudible] - Instructor (Andrew Ng):Oh, I see. By uniform I meant - I was being loose
-
25:04 - 25:06here.
-
25:06 - 25:07I meant if
-
25:07 - 25:11P of Y = 0 is equal to P of Y = 1, or if Y is the
-
25:11 - 25:13uniform distribution over
-
25:13 - 25:14the set 0 and 1. Student:Oh. Instructor (Andrew Ng):I just meant - yeah, if P of Y = 0
-
25:14 - 25:15zero = P of Y given 1. That's all I mean, see? Anything else?
-
25:15 - 25:17All
-
25:17 - 25:20right. Okay. So
-
25:20 - 25:22
-
25:22 - 25:28
-
25:28 - 25:35it
-
25:42 - 25:47turns out Gaussian discriminant analysis has an interesting relationship
-
25:47 - 25:49to logistic
-
25:49 - 25:52regression. Let me illustrate that.
-
25:52 - 25:55
-
25:55 - 25:59So let's say you have a training set
-
25:59 - 26:06- actually let me just go ahead and draw 1D training set, and that will
-
26:12 - 26:15kind of work, yes, okay.
-
26:15 - 26:18So let's say we have a training set comprising a few negative and a few positive examples,
-
26:18 - 26:23and let's say I run Gaussian discriminate analysis. So I'll fit Gaussians to each of these two
-
26:23 - 26:25densities - a Gaussian density to each of these two - to my positive
-
26:25 - 26:27and negative training
-
26:27 - 26:30examples,
-
26:30 - 26:33and so maybe my
-
26:33 - 26:38positive examples, the X's, are fit with a Gaussian like this,
-
26:38 - 26:40
-
26:40 - 26:45
-
26:45 - 26:48and my negative examples I will fit, and you have a
-
26:48 - 26:55Gaussian that looks like that, okay?
-
26:57 - 27:00Now, I
-
27:00 - 27:04hope this [inaudible]. Now, let's
-
27:04 - 27:07vary along the X axis,
-
27:07 - 27:10and what I want to do is I'll
-
27:10 - 27:14overlay on top of this plot. I'm going to plot
-
27:14 - 27:21P of Y = 1 - no, actually,
-
27:24 - 27:26given X
-
27:26 - 27:32for a variety of values X, okay?
-
27:32 - 27:33So I actually realize what I should have done.
-
27:33 - 27:37I'm gonna call the X's the negative examples, and I'm gonna call the O's the positive examples. It just
-
27:37 - 27:40makes this part come in better.
-
27:40 - 27:44So let's take a value of X that's fairly small. Let's say X is this value here on a horizontal
-
27:44 - 27:45axis.
-
27:45 - 27:49Then what's the probability of Y being equal to one conditioned on X? Well,
-
27:49 - 27:53the way you calculate that is you write P of Y
-
27:53 - 27:57= 1 given X, and then you plug in all these formulas as usual, right? It's P of X
-
27:57 - 27:59given Y = 1, which is
-
27:59 - 28:01your Gaussian density,
-
28:01 - 28:04times P of Y = 1, you know, which is
-
28:04 - 28:08essentially - this is just going to be equal to phi,
-
28:08 - 28:10and then divided by,
-
28:10 - 28:12right, P of X, and then this shows you how you can calculate this.
-
28:12 - 28:14By using
-
28:14 - 28:19these two Gaussians and my phi on P of Y, I actually compute what P of Y
-
28:19 - 28:21= 1 given X is, and
-
28:21 - 28:25in this case,
-
28:25 - 28:29if X is this small, clearly it belongs to the left Gaussian. It's very unlikely to belong to
-
28:29 - 28:31a positive class, and so
-
28:31 - 28:36it'll be very small; it'll be very close to zero say, okay?
-
28:36 - 28:41And then we can increment the value of X a bit, and study a different value of X, and
-
28:41 - 28:44plot what is the P of Y given X - P of Y
-
28:44 - 28:50= 1 given X, and, again, it'll be pretty small. Let's
-
28:50 - 28:52use a point like that, right? At this point,
-
28:52 - 28:54the two Gaussian densities
-
28:54 - 28:56have equal value,
-
28:56 - 28:58and if
-
28:58 - 28:59I ask
-
28:59 - 29:02if X is this value, right, shown by the arrow,
-
29:02 - 29:03
-
29:03 - 29:07what's the probably of Y being equal to one for that value of X? Well, you really can't tell, so maybe it's about 0.5, okay? And if
-
29:07 - 29:11you fill
-
29:11 - 29:14in a bunch more points, you get a
-
29:14 - 29:16curve like that,
-
29:16 - 29:21and then you can keep going. Let's say for a point like that, you can ask what's the probability of X
-
29:21 - 29:22being one? Well, if
-
29:22 - 29:25it's that far out, then clearly, it belongs to this
-
29:25 - 29:28rightmost Gaussian, and so
-
29:28 - 29:32the probability of Y being a one would be very high; it would be almost one, okay?
-
29:32 - 29:34And so you
-
29:34 - 29:36can repeat this exercise
-
29:36 - 29:39for a bunch of points. All right,
-
29:39 - 29:42compute P of Y equals one given X for a bunch of points,
-
29:42 - 29:46and if you connect up these points,
-
29:46 - 29:48you find that the
-
29:48 - 29:50curve you get [Pause] plotted
-
29:50 - 29:56takes a form of sigmoid function, okay? So,
-
29:56 - 30:00in other words, when you make the assumptions under the Gaussian
-
30:00 - 30:02
-
30:02 - 30:06discriminant analysis model,
-
30:06 - 30:08that P of X given Y is Gaussian,
-
30:08 - 30:13when you go back and compute what P of Y given X is, you actually get back
-
30:13 - 30:19exactly the same sigmoid function
-
30:19 - 30:23that we're using which is the progression, okay? But it turns out the key difference is that
-
30:23 - 30:27Gaussian discriminant analysis will end up choosing a different
-
30:27 - 30:29position
-
30:29 - 30:32and a steepness of the sigmoid
-
30:32 - 30:36than would logistic regression. Is there a question? Student:I'm just
-
30:36 - 30:40wondering,
-
30:40 - 30:45the Gaussian of P of Y [inaudible] you do? Instructor (Andrew Ng):No, let's see. The Gaussian - so this Gaussian is
-
30:45 - 30:48P of X given Y = 1, and
-
30:48 - 30:50this Gaussian is P of X
-
30:50 - 30:57given Y = 0; does that make sense? Anything else? Student:Okay. Instructor (Andrew Ng):Yeah? Student:When you drawing all the dots, how did you
-
31:02 - 31:04decide what Y
-
31:04 - 31:06given
-
31:06 - 31:10P of X was? Instructor (Andrew Ng):What - say that again. Student:I'm sorry. Could you go over how you
-
31:10 - 31:12figured out where
-
31:12 - 31:14to draw each dot? Instructor (Andrew Ng):Let's see,
-
31:14 - 31:16okay. So the
-
31:16 - 31:19computation is as follows, right? The steps
-
31:19 - 31:22are I have the training sets, and so given my training set, I'm going to fit
-
31:22 - 31:25a Gaussian discriminant analysis model to it,
-
31:25 - 31:29and what that means is I'll build a model for P of X given Y = 1. I'll
-
31:29 - 31:30build
-
31:30 - 31:33a model for P of X given Y = 0,
-
31:33 - 31:38and I'll also fit a Bernoulli distribution to
-
31:38 - 31:39P of Y, okay?
-
31:39 - 31:43So, in other words, given my training set, I'll fit P of X given Y and P of Y
-
31:43 - 31:47to my data, and now I've chosen my parameters
-
31:47 - 31:49of find mew0,
-
31:49 - 31:50mew1,
-
31:50 - 31:52and the sigma, okay? Then
-
31:52 - 31:55this is the process I went through
-
31:55 - 31:59to plot all these dots, right? It's just I pick a point in the X axis,
-
31:59 - 32:01and then I compute
-
32:01 - 32:03P of Y given X
-
32:03 - 32:05for that value of X,
-
32:05 - 32:09and P of Y given 1 conditioned on X will be some value between zero and one. It'll
-
32:09 - 32:12be some real number, and whatever that real number is, I then plot it on the vertical
-
32:12 - 32:14axis,
-
32:14 - 32:20okay? And the way I compute P of Y = 1 conditioned on X is
-
32:20 - 32:21I would
-
32:21 - 32:23use these quantities. I would use
-
32:23 - 32:25P of X given Y
-
32:25 - 32:30and P of Y, and, sort of, plug them into Bayes rule, and that allows me
-
32:30 - 32:31to
-
32:31 - 32:32compute P of Y given X
-
32:32 - 32:34from these three quantities; does that make
-
32:34 - 32:35sense? Student:Yeah. Instructor (Andrew Ng):Was there something more that -
-
32:35 - 32:39Student:And how did you model P of X; is that - Instructor (Andrew Ng):Oh, okay. Yeah, so
-
32:39 - 32:42-
-
32:42 - 32:43well,
-
32:43 - 32:45got this right
-
32:45 - 32:49here. So P of X can be written as,
-
32:49 - 32:50
-
32:50 - 32:53right,
-
32:53 - 32:55so
-
32:55 - 32:59P of X given Y = 0 by P of Y = 0 + P of X given Y = 1, P of Y =
-
32:59 - 33:031, right?
-
33:03 - 33:07And so each of these terms, P of X given Y
-
33:07 - 33:11and P of Y, these are terms I can get out of, directly, from my Gaussian discriminant
-
33:11 - 33:14analysis model. Each of these terms is something that
-
33:14 - 33:16my model gives me directly,
-
33:16 - 33:18so plugged in as the denominator,
-
33:18 - 33:25and by doing that, that's how I compute P of Y = 1 given X, make sense? Student:Thank you. Instructor (Andrew Ng):Okay. Cool.
-
33:31 - 33:38
-
33:53 - 33:58So let's talk a little bit about the advantages and disadvantages of using a
-
33:58 - 34:00generative
-
34:00 - 34:05learning algorithm, okay? So in the particular case of Gaussian discriminant analysis, we
-
34:05 - 34:06assume that
-
34:06 - 34:08X conditions on Y
-
34:08 - 34:14is Gaussian,
-
34:14 - 34:17and the argument I showed on the previous chalkboard, I didn't prove it formally,
-
34:17 - 34:20but you can actually go back and prove it yourself
-
34:20 - 34:24is that if you assume X given Y is Gaussian,
-
34:24 - 34:27then that implies that
-
34:27 - 34:29when you plot Y
-
34:29 - 34:31given X,
-
34:31 - 34:38you find that - well, let me just write logistic posterior, okay?
-
34:41 - 34:44And the argument I showed just now, which I didn't prove; you can go home and prove it
-
34:44 - 34:44yourself,
-
34:44 - 34:49is that if you assume X given Y is Gaussian, then that implies that the posterior
-
34:49 - 34:54distribution or the form of
-
34:54 - 34:57P of Y = 1 given X
-
34:57 - 35:01is going to be a logistic function,
-
35:01 - 35:02
-
35:02 - 35:06and it turns out this
-
35:06 - 35:08implication in the opposite direction
-
35:08 - 35:12does not hold true,
-
35:12 - 35:16okay? In particular, it actually turns out - this is actually, kind of, cool. It
-
35:16 - 35:19turns out that if you're
-
35:19 - 35:23seeing that X given Y = 1 is
-
35:23 - 35:24
-
35:24 - 35:26Hessian with
-
35:26 - 35:29parameter lambda 1,
-
35:29 - 35:32and X given Y = 0,
-
35:32 - 35:37is Hessian
-
35:37 - 35:39with parameter lambda 0.
-
35:39 - 35:41It turns out if you assumed this,
-
35:41 - 35:43then
-
35:43 - 35:44that also
-
35:44 - 35:47implies that P of Y
-
35:47 - 35:51given X
-
35:51 - 35:52
-
35:52 - 35:54is logistic, okay?
-
35:54 - 35:57So there are lots of assumptions on X given Y
-
35:57 - 36:04that will lead to P of Y given X being logistic, and,
-
36:05 - 36:06therefore,
-
36:06 - 36:12this, the assumption that X given Y being Gaussian is the stronger assumption
-
36:12 - 36:15than the assumption that Y given X is logistic,
-
36:15 - 36:17okay? Because this implies this,
-
36:17 - 36:23right? That means that this is a stronger assumption than this because
-
36:23 - 36:30this, the logistic posterior holds whenever X given Y is Gaussian but not vice versa.
-
36:30 - 36:31And so this leaves some
-
36:31 - 36:36of the tradeoffs between Gaussian discriminant analysis and logistic
-
36:36 - 36:37regression,
-
36:37 - 36:40right? Gaussian discriminant analysis makes a much stronger assumption
-
36:40 - 36:43that X given Y is Gaussian,
-
36:43 - 36:47and so when this assumption is true, when this assumption approximately holds, if you plot the
-
36:47 - 36:49data,
-
36:49 - 36:52and if X given Y is, indeed, approximately Gaussian,
-
36:52 - 36:56then if you make this assumption, explicit to the algorithm, then the
-
36:56 - 36:57algorithm will do better
-
36:57 - 36:59because it's as if the
-
36:59 - 37:03algorithm is making use of more information about the data. The algorithm knows that
-
37:03 - 37:04the data is Gaussian,
-
37:04 - 37:05right? And so
-
37:05 - 37:08if the Gaussian assumption, you know,
-
37:08 - 37:10holds or roughly holds,
-
37:10 - 37:10then Gaussian
-
37:10 - 37:14discriminant analysis may do better than logistic regression.
-
37:14 - 37:19If, conversely, if you're actually not sure what X given Y is, then
-
37:19 - 37:23logistic regression, the discriminant algorithm may do better,
-
37:23 - 37:26and, in particular, use logistic regression,
-
37:26 - 37:26and
-
37:26 - 37:30maybe you see [inaudible] before the data was Gaussian, but it turns out the data
-
37:30 - 37:31was actually Poisson, right?
-
37:31 - 37:34Then logistic regression will still do perfectly fine because if
-
37:34 - 37:37the data were actually Poisson,
-
37:37 - 37:40then P of Y = 1 given X will be logistic, and it'll do perfectly
-
37:40 - 37:45fine, but if you assumed it was Gaussian, then the algorithm may go off and do something
-
37:45 - 37:52that's not as good, okay?
-
37:52 - 37:56So it turns out that - right.
-
37:56 - 37:59So it's slightly different.
-
37:59 - 38:01It
-
38:01 - 38:05turns out the real advantage of generative learning algorithms is often that it
-
38:05 - 38:08requires less data,
-
38:08 - 38:09and, in particular,
-
38:09 - 38:13data is never really exactly Gaussian, right? Because data is often
-
38:13 - 38:16approximately Gaussian; it's never exactly Gaussian.
-
38:16 - 38:20And it turns out, generative learning algorithms often do surprisingly well
-
38:20 - 38:21even when
-
38:21 - 38:24these modeling assumptions are not met, but
-
38:24 - 38:26one other tradeoff
-
38:26 - 38:27is that
-
38:27 - 38:30by making stronger assumptions
-
38:30 - 38:32about the data,
-
38:32 - 38:34Gaussian discriminant analysis
-
38:34 - 38:36often needs less data
-
38:36 - 38:40in order to fit, like, an okay model, even if there's less training data. Whereas, in
-
38:40 - 38:42contrast,
-
38:42 - 38:46logistic regression by making less
-
38:46 - 38:49assumption is more robust to your modeling assumptions because you're making a weaker assumption; you're
-
38:49 - 38:51making less assumptions,
-
38:51 - 38:54but sometimes it takes a slightly larger training set to fit than Gaussian discriminant
-
38:54 - 38:55analysis. Question? Student:In order
-
38:55 - 39:00to meet any assumption about the number [inaudible], plus
-
39:00 - 39:01
-
39:01 - 39:03here
-
39:03 - 39:09we assume that P of Y = 1, equal
-
39:09 - 39:11two
-
39:11 - 39:15number of.
-
39:15 - 39:16[Inaudible]. Is true when
-
39:16 - 39:18the number of samples is marginal? Instructor (Andrew Ng):Okay. So let's see.
-
39:18 - 39:22So there's a question of is this true - what
-
39:22 - 39:23was that? Let me translate that
-
39:23 - 39:25differently. So
-
39:25 - 39:29the marving assumptions are made independently of the size
-
39:29 - 39:30of
-
39:30 - 39:33your training set, right? So, like, in least/great regression - well,
-
39:33 - 39:37in all of these models I'm assuming that these are random variables
-
39:37 - 39:41flowing from some distribution, and then, finally, I'm giving a single training set
-
39:41 - 39:44and that as for the parameters of the
-
39:44 - 39:44distribution, right?
-
39:44 - 39:51Student:So
-
39:52 - 39:54what's the probability of Y = 1?
-
39:54 - 39:55Instructor (Andrew Ng):Probability of Y + 1?
-
39:55 - 39:56Student:Yeah, you used the - Instructor (Andrew Ng):Sort
-
39:56 - 39:58of, this like
-
39:58 - 40:00-
-
40:00 - 40:03back to the philosophy of mass molecular estimation,
-
40:03 - 40:05right? I'm assuming that
-
40:05 - 40:08they're P of Y is equal to phi to the Y,
-
40:08 - 40:13Y - phi to the Y or Y - Y. So I'm assuming that there's some true value of Y
-
40:13 - 40:14generating
-
40:14 - 40:16all my data,
-
40:16 - 40:19and then
-
40:19 - 40:20-
-
40:20 - 40:26well, when I write this, I guess, maybe what I should write isn't -
-
40:26 - 40:28so when I write this, I
-
40:28 - 40:30guess there are already two values of phi. One is
-
40:30 - 40:32there's a true underlying value of phi
-
40:32 - 40:35that guards the use to generate the data,
-
40:35 - 40:40and then there's the maximum likelihood estimate of the value of phi, and so when I was writing
-
40:40 - 40:42those formulas earlier,
-
40:42 - 40:45those formulas are writing for phi, and mew0, and mew1
-
40:45 - 40:49were really the maximum likelihood estimates for phi, mew0, and mew1, and that's different from the true
-
40:49 - 40:56underlying values of phi, mew0, and mew1, but - Student:[Off mic]. Instructor (Andrew Ng):Yeah, right. So maximum
-
40:56 - 40:58likelihood estimate comes from the data,
-
40:58 - 41:02and there's some, sort of, true underlying value of phi that I'm trying to estimate,
-
41:02 - 41:05and my maximum likelihood estimate is my attempt to estimate the true value,
-
41:05 - 41:08but, you know, by notational and convention
-
41:08 - 41:12often are just right as that as well without bothering to distinguish between
-
41:12 - 41:15the maximum likelihood value and the true underlying value that I'm assuming is out
-
41:15 - 41:16there, and that I'm
-
41:16 - 41:23only hoping to estimate.
-
41:24 - 41:26Actually, yeah,
-
41:26 - 41:29so for the sample of questions like these about maximum likelihood and so on, I hope
-
41:29 - 41:34to tease to the Friday discussion section
-
41:34 - 41:36as a good time to
-
41:36 - 41:38ask questions about, sort of,
-
41:38 - 41:41probabilistic definitions like these as well. Are there any
-
41:41 - 41:48other questions? No, great. Okay.
-
41:53 - 42:00So,
-
42:00 - 42:01great. Oh, it
-
42:01 - 42:08turns out, just to mention one more thing that's, kind of, cool.
-
42:08 - 42:09I said that
-
42:09 - 42:14if X given Y is Poisson, and you also go logistic posterior,
-
42:14 - 42:17it actually turns out there's a more general version of this. If you assume
-
42:17 - 42:19X
-
42:19 - 42:24given Y = 1 is exponential family
-
42:24 - 42:26with parameter A to 1, and then you assume
-
42:26 - 42:33X given Y = 0 is exponential family
-
42:34 - 42:39with parameter A to 0, then
-
42:39 - 42:43this implies that P of Y = 1 given X is also logistic, okay? And
-
42:43 - 42:45that's, kind of, cool.
-
42:45 - 42:48It means that Y given X could be - I don't
-
42:48 - 42:50know, some strange thing. It could be gamma because
-
42:50 - 42:53we've seen Gaussian right
-
42:53 - 42:54next to the - I
-
42:54 - 42:57don't know, gamma exponential.
-
42:57 - 42:59They're actually a beta. I'm
-
42:59 - 43:02just rattling off my mental list of exponential family extrusions. It could be any one
-
43:02 - 43:04of those things,
-
43:04 - 43:08so [inaudible] the same exponential family distribution for the two classes
-
43:08 - 43:09with different natural parameters
-
43:09 - 43:12than the
-
43:12 - 43:13posterior
-
43:13 - 43:17P of Y given 1 given X - P of Y = 1 given X would be logistic, and so this shows
-
43:17 - 43:19the robustness of logistic regression
-
43:19 - 43:22to the choice of modeling assumptions because it could be that
-
43:22 - 43:25the data was actually, you know, gamma distributed,
-
43:25 - 43:29and just still turns out to be logistic. So it's the
-
43:29 - 43:36robustness of logistic regression to modeling
-
43:39 - 43:41assumptions.
-
43:41 - 43:43And this is the density. I think,
-
43:43 - 43:44early on I promised
-
43:44 - 43:49two justifications for where I pulled the logistic function out of the hat, right? So
-
43:49 - 43:53one was the exponential family derivation we went through last time, and this is, sort of, the second one.
-
43:53 - 44:00That all of these modeling assumptions also lead to the logistic function. Yeah? Student:[Off
-
44:05 - 44:10mic]. Instructor (Andrew Ng):Oh, that Y = 1 given as the logistic then this implies that, no. This is also
-
44:10 - 44:11not true, right?
-
44:11 - 44:12Yeah, so this exponential
-
44:12 - 44:14family distribution
-
44:14 - 44:18implies Y = 1 is logistic, but the reverse assumption is also not true.
-
44:18 - 44:23There are actually all sorts of really bizarre distributions
-
44:23 - 44:30for X that would give rise to logistic function, okay? Okay. So
-
44:30 - 44:35let's talk about - those are first generative learning algorithm. Maybe I'll talk about the second
-
44:35 - 44:38generative learning algorithm,
-
44:38 - 44:45and the motivating example, actually this is called a Naive Bayes algorithm,
-
44:45 - 44:50and the motivating example that I'm gonna use will be spam classification. All right. So let's
-
44:50 - 44:54say that you want to build a spam classifier to take your incoming stream of email and decide if
-
44:54 - 44:57it's spam or
-
44:57 - 44:57not.
-
44:57 - 45:02So let's
-
45:02 - 45:05see. Y will be 0
-
45:05 - 45:07or
-
45:07 - 45:121, with 1 being spam email
-
45:12 - 45:13and 0 being non-spam, and
-
45:13 - 45:17the first decision we need to make is, given a piece of email,
-
45:17 - 45:21how do you represent a piece of email using a feature vector X,
-
45:21 - 45:24right? So email is just a piece of text, right? Email
-
45:24 - 45:28is like a list of words or a list of ASCII characters.
-
45:28 - 45:31So I can represent email as a feature of vector X.
-
45:31 - 45:32So we'll use a couple of different
-
45:32 - 45:34representations,
-
45:34 - 45:37but the one I'll use today is
-
45:37 - 45:38we will
-
45:38 - 45:42construct the vector X as follows. I'm gonna go through my dictionary, and, sort of, make a listing of
-
45:42 - 45:44all the words in my dictionary, okay? So
-
45:44 - 45:46the first word is
-
45:46 - 45:52RA. The second word in my dictionary is Aardvark, ausworth,
-
45:52 - 45:56okay?
-
45:56 - 45:59You know, and somewhere along the way you see the word buy in the spam email telling you to buy
-
45:59 - 46:01stuff.
-
46:01 - 46:04Tell you how you collect your list of words,
-
46:04 - 46:09you know, you won't find CS229, right, course number in a dictionary, but
-
46:09 - 46:10if you
-
46:10 - 46:14collect a list of words via other emails you've gotten, you have this list somewhere
-
46:14 - 46:18as well, and then the last word in my dictionary was
-
46:18 - 46:22zicmergue, which
-
46:22 - 46:25pertains to the technological chemistry that deals with
-
46:25 - 46:28the fermentation process in
-
46:28 - 46:31brewing.
-
46:31 - 46:35So say I get a piece of email, and what I'll do is I'll then
-
46:35 - 46:38scan through this list of words, and wherever
-
46:38 - 46:41a certain word appears in my email, I'll put a 1 there. So if a particular
-
46:41 - 46:44email has the word aid then that's 1.
-
46:44 - 46:47You know, my email doesn't have the words ausworth
-
46:47 - 46:49or aardvark, so it gets zeros. And again,
-
46:49 - 46:53a piece of email, they want me to buy something, CS229 doesn't occur, and so on, okay?
-
46:53 - 46:57So
-
46:57 - 46:59this would be
-
46:59 - 47:02one way of creating a feature vector
-
47:02 - 47:04to represent a
-
47:04 - 47:06piece of email.
-
47:06 - 47:10Now, let's throw
-
47:10 - 47:17the generative model out for this. Actually,
-
47:18 - 47:19let's use
-
47:19 - 47:21this.
-
47:21 - 47:24In other words, I want to model P of X given Y. The
-
47:24 - 47:28given Y = 0 or Y = 1, all right?
-
47:28 - 47:31And my feature vectors are going to be 0, 1
-
47:31 - 47:36to the N. It's going to be these split vectors, binary value vectors. They're N
-
47:36 - 47:36dimensional.
-
47:36 - 47:38Where N
-
47:38 - 47:39may
-
47:39 - 47:42be on the order of, say, 50,000, if you have 50,000
-
47:42 - 47:44words in your dictionary,
-
47:44 - 47:46which is not atypical. So
-
47:46 - 47:47values from - I don't
-
47:47 - 47:48know,
-
47:48 - 47:52mid-thousands to tens of thousands is very typical
-
47:52 - 47:54for problems like
-
47:54 - 47:57these. And, therefore,
-
47:57 - 48:00there two to the 50,000 possible values for X, right? So two to
-
48:00 - 48:0250,000
-
48:02 - 48:03possible bit vectors
-
48:03 - 48:04of length
-
48:04 - 48:05
-
48:05 - 48:0750,000, and so
-
48:07 - 48:09one way to model this is
-
48:09 - 48:11the multinomial distribution,
-
48:11 - 48:14but because there are two to the 50,000 possible values for X,
-
48:14 - 48:20I would need two to the 50,000, but maybe -1 parameters,
-
48:20 - 48:21right? Because you have
-
48:21 - 48:24this sum to 1, right? So
-
48:24 - 48:27-1. And this is clearly way too many parameters
-
48:27 - 48:28to model
-
48:28 - 48:30using the multinomial distribution
-
48:30 - 48:34over all two to 50,000 possibilities.
-
48:34 - 48:36So
-
48:36 - 48:42in a Naive Bayes algorithm, we're
-
48:42 - 48:47going to make a very strong assumption on P of X given Y,
-
48:47 - 48:49and, in particular, I'm going to assume - let
-
48:49 - 48:54me just say what it's called; then I'll write out what it means. I'm going to assume that the
-
48:54 - 48:55XI's
-
48:55 - 49:02are conditionally independent
-
49:06 - 49:08given Y, okay?
-
49:08 - 49:11Let me say what this means.
-
49:11 - 49:18So I have that P of X1, X2, up to X50,000,
-
49:19 - 49:21right, given the
-
49:21 - 49:26Y. By the key rule of probability, this is P of X1 given Y
-
49:26 - 49:28times P of X2
-
49:28 - 49:30given
-
49:30 - 49:32
-
49:32 - 49:34Y,
-
49:34 - 49:39X1
-
49:39 - 49:43times PF - I'll just put dot, dot, dot. I'll just write 1, 1 Ă dot, dot, dot up to, you know, well -
-
49:43 - 49:46whatever. You get the idea, up to P of X50,000, okay?
-
49:46 - 49:50So this is the chain were of probability. This always holds. I've not
-
49:50 - 49:53made any assumption yet, and now, we're
-
49:53 - 49:55gonna
-
49:55 - 49:58meet what's called the Naive Bayes assumption, or this assumption that X
-
49:58 - 50:02defies a conditionally independent given Y. Going
-
50:02 - 50:04to assume that -
-
50:04 - 50:07well, nothing changes for the first term,
-
50:07 - 50:12but I'm gonna assume that P of X3 given Y, X1 is equal to P of X2 given the Y. I'm gonna assume that that term's equal to P of X3
-
50:12 - 50:13given
-
50:13 - 50:17the
-
50:17 - 50:18Y,
-
50:18 - 50:20and so on, up
-
50:20 - 50:24to P of X50,000 given Y, okay?
-
50:24 - 50:29Or just written more compactly,
-
50:29 - 50:32
-
50:32 - 50:36means assume that P of X1, P of X50,000 given Y is
-
50:36 - 50:41the product from I = 1 to 50,000 or P of XI
-
50:41 - 50:44given the Y,
-
50:44 - 50:45okay?
-
50:45 - 50:49And stating informally what this means is that I'm, sort of, assuming that -
-
50:49 - 50:53so unless you know the cost label Y, so long as you know whether this is spam or not
-
50:53 - 50:55spam,
-
50:55 - 50:59then knowing whether the word A appears in email
-
50:59 - 51:00does not affect
-
51:00 - 51:01the probability
-
51:01 - 51:03of whether the word
-
51:03 - 51:07Ausworth appears in the email, all right?
-
51:07 - 51:11And, in other words, there's assuming - once you know whether an email is spam
-
51:11 - 51:13or not spam,
-
51:13 - 51:16then knowing whether other words appear in the email won't help
-
51:16 - 51:19you predict whether any other word appears in the email, okay?
-
51:19 - 51:20And,
-
51:20 - 51:23obviously, this assumption is false, right? This
-
51:23 - 51:26assumption can't possibly be
-
51:26 - 51:28true. I mean, if you see the word
-
51:28 - 51:31- I don't know, CS229 in an email, you're much more likely to see my name in the email, or
-
51:31 - 51:37the TA's names, or whatever. So this assumption is normally just false
-
51:37 - 51:37
-
51:37 - 51:39under English, right,
-
51:39 - 51:41for normal written English,
-
51:41 - 51:43but it
-
51:43 - 51:44turns out that despite
-
51:44 - 51:47this assumption being, sort of,
-
51:47 - 51:48false in the literal sense,
-
51:48 - 51:51the Naive Bayes algorithm is, sort of,
-
51:51 - 51:53an extremely effective
-
51:53 - 51:58algorithm for classifying text documents into spam or not spam, for
-
51:58 - 52:02classifying your emails into different emails for your automatic view, for
-
52:02 - 52:04looking at web pages and classifying
-
52:04 - 52:06whether this webpage is trying to sell something or whatever. It
-
52:06 - 52:08turns out, this assumption
-
52:08 - 52:12works very well for classifying text documents and for other applications too that I'll
-
52:12 - 52:16talk a bit about later.
-
52:16 - 52:21As a digression that'll make sense only to some of you.
-
52:21 - 52:23Let me just say that
-
52:23 - 52:28if you're familiar with Bayesian X world, say
-
52:28 - 52:30
-
52:30 - 52:34graphical models, the Bayesian network associated with this model looks like this, and you're assuming
-
52:34 - 52:35that
-
52:35 - 52:37this is random variable Y
-
52:37 - 52:39that then generates X1, X2, through
-
52:39 - 52:41X50,000, okay? If you've not seen the
-
52:41 - 52:43Bayes Net before, if
-
52:43 - 52:48you don't know your graphical model, just ignore this. It's not important to our purposes, but
-
52:48 - 52:55if you've seen it before, that's what it will look like. Okay.
-
52:58 - 53:03So
-
53:03 - 53:10
-
53:12 - 53:16the parameters of the model
-
53:16 - 53:18are as follows
-
53:18 - 53:22with phi FI given Y = 1, which
-
53:22 - 53:23
-
53:23 - 53:28is probably FX = 1 or XI = 1
-
53:28 - 53:29given Y = 1,
-
53:29 - 53:36phi I
-
53:36 - 53:40given Y = 0, and phi Y, okay?
-
53:40 - 53:43So these are the parameters of the model,
-
53:43 - 53:45
-
53:45 - 53:47and, therefore,
-
53:47 - 53:50to fit the parameters of the model, you
-
53:50 - 53:57can write down the joint likelihood, right,
-
54:02 - 54:07is
-
54:07 - 54:14equal to, as usual, okay?
-
54:19 - 54:21So given the training sets,
-
54:21 - 54:28you can write down the joint
-
54:29 - 54:33likelihood of the parameters, and
-
54:33 - 54:40then when
-
54:44 - 54:45you
-
54:45 - 54:47do maximum likelihood estimation,
-
54:47 - 54:51you find that the maximum likelihood estimate of the parameters are
-
54:51 - 54:53- they're really, pretty much, what you'd expect.
-
54:53 - 54:56Maximum likelihood estimate for phi J given Y = 1 is
-
54:56 - 54:58sum from I = 1 to
-
54:58 - 55:02M,
-
55:02 - 55:03indicator
-
55:03 - 55:07XIJ =
-
55:07 - 55:141, YI = 1, okay?
-
55:15 - 55:18
-
55:18 - 55:19
-
55:19 - 55:21And this is just a,
-
55:21 - 55:23I guess, stated more simply,
-
55:23 - 55:25the numerator just says, Run for
-
55:25 - 55:28your entire training set, some [inaudible] examples,
-
55:28 - 55:31and count up the number of times you saw word Jay
-
55:31 - 55:33in a piece of email
-
55:33 - 55:35for which the label Y was equal to 1. So, in other words, look
-
55:35 - 55:37through all your spam emails
-
55:37 - 55:39and count the number of emails in which the word
-
55:39 - 55:41Jay appeared out of
-
55:41 - 55:42all your spam emails,
-
55:42 - 55:45and the denominator is, you know,
-
55:45 - 55:46sum from I = 1 to M,
-
55:46 - 55:48the number of spam. The
-
55:48 - 55:51denominator is just the number of spam emails you got.
-
55:51 - 55:54And so this ratio is
-
55:54 - 55:58in all your spam emails in your training set,
-
55:58 - 56:00what fraction of these emails
-
56:00 - 56:03did the word Jay
-
56:03 - 56:04appear in -
-
56:04 - 56:07did the, Jay you wrote in your dictionary appear in?
-
56:07 - 56:10And that's the maximum likelihood estimate
-
56:10 - 56:14for the probability of seeing the word Jay conditions on the piece of email being spam, okay? And similar to your
-
56:14 - 56:17
-
56:17 - 56:20
-
56:20 - 56:23maximum likelihood estimate for phi
-
56:23 - 56:25Y
-
56:25 - 56:29is pretty much what you'd expect, right?
-
56:29 - 56:36Okay?
-
56:41 - 56:43And so
-
56:43 - 56:48having estimated all these parameters,
-
56:48 - 56:51when you're given a new piece of email that you want to classify,
-
56:51 - 56:53you can then compute P of Y given X
-
56:53 - 56:56using Bayes rule, right?
-
56:56 - 56:58Same as before because
-
56:58 - 57:04together these parameters gives you a model for P of X given Y and for P of Y,
-
57:04 - 57:08and by using Bayes rule, given these two terms, you can compute
-
57:08 - 57:11P of X given Y, and
-
57:11 - 57:16there's your spam classifier, okay?
-
57:16 - 57:19Turns out we need one more elaboration to this idea, but let me check if there are
-
57:19 - 57:25questions about this so far.
-
57:25 - 57:29Student:So does this model depend
-
57:29 - 57:30on
-
57:30 - 57:34the number of inputs? Instructor (Andrew Ng):What do
-
57:34 - 57:35you
-
57:35 - 57:38mean, number of inputs, the number of features? Student:No, number of samples. Instructor (Andrew Ng):Well, N is the number of training examples, so this
-
57:38 - 57:45given M training examples, this is the formula for the maximum likelihood estimate of the parameters, right? So other questions, does it make
-
57:49 - 57:53sense? Or M is the number of training examples, so when you have M training examples, you plug them
-
57:53 - 57:54into this formula,
-
57:54 - 58:01and that's how you compute the maximum likelihood estimates. Student:Is training examples you mean M is the
-
58:01 - 58:04number of emails? Instructor (Andrew Ng):Yeah, right. So, right.
-
58:04 - 58:07So it's, kind of, your training set. I would go through all the email I've gotten
-
58:07 - 58:09in the last two months
-
58:09 - 58:11and label them as spam or not spam,
-
58:11 - 58:14and so you have - I don't
-
58:14 - 58:17know, like, a few hundred emails
-
58:17 - 58:19labeled as spam or not spam,
-
58:19 - 58:26and that will comprise your training sets for X1 and Y1 through XM,
-
58:28 - 58:29YM,
-
58:29 - 58:33where X is one of those vectors representing which words appeared in the email and Y
-
58:33 - 58:40is 0, 1 depending on whether they equal spam or not spam, okay? Student:So you are saying that this model depends on the number
-
58:41 - 58:42
-
58:42 - 58:49of examples, but the last model doesn't depend on the models, but your phi is the
-
58:50 - 58:54same for either one. Instructor (Andrew Ng):They're
-
58:54 - 58:58different things, right? There's the model which is
-
58:58 - 59:01- the modeling assumptions aren't made very well.
-
59:01 - 59:04I'm assuming that - I'm making the Naive Bayes assumption.
-
59:04 - 59:09So the probabilistic model is an assumption on the joint distribution
-
59:09 - 59:10of X and Y.
-
59:10 - 59:12That's what the model is,
-
59:12 - 59:17and then I'm given a fixed number of training examples. I'm given M training examples, and
-
59:17 - 59:20then it's, like, after I'm given the training sets, I'll then go in to write the maximum
-
59:20 - 59:22likelihood estimate of the parameters, right? So that's, sort
-
59:22 - 59:25of,
-
59:25 - 59:32maybe we should take that offline for - yeah, ask a question? Student:Then how would you do this, like,
-
59:38 - 59:41if this [inaudible] didn't work? Instructor (Andrew Ng):Say that again. Student:How would you do it, say, like the 50,000 words - Instructor (Andrew Ng):Oh, okay. How to do this with the 50,000 words, yeah. So
-
59:41 - 59:42it turns out
-
59:42 - 59:46this is, sort of, a very practical question, really. How do I count this list of
-
59:46 - 59:50words? One common way to do this is to actually
-
59:50 - 59:53find some way to count a list of words, like go through all your emails, go through
-
59:53 - 59:54all the -
-
59:54 - 59:57in practice, one common way to count a list of words
-
59:57 - 60:00is to just take all the words that appear in your training set. That's one fairly common way
-
60:00 - 60:02to do it,
-
60:02 - 60:06or if that turns out to be too many words, you can take all words that appear
-
60:06 - 60:07at least
-
60:07 - 60:08three times
-
60:08 - 60:09in your training set. So
-
60:09 - 60:11words that
-
60:11 - 60:15you didn't even see three times in the emails you got in the last
-
60:15 - 60:17two months, you discard. So those are - I
-
60:17 - 60:20was talking about going through a dictionary, which is a nice way of thinking about it, but in
-
60:20 - 60:22practice, you might go through
-
60:22 - 60:26your training set and then just take the union of all the words that appear in
-
60:26 - 60:30it. In some of the tests I've even, by the way, said select these features, but this is one
-
60:30 - 60:32way to think about
-
60:32 - 60:34creating your feature vector,
-
60:34 - 60:41right, as zero and one values, okay? Moving on, yeah. Okay. Ask a question? Student:I'm getting, kind of, confused on how you compute all those parameters. Instructor (Andrew Ng):On
-
60:49 - 60:52how I came up with the parameters?
-
60:52 - 60:54Student:Correct. Instructor (Andrew Ng):Let's see.
-
60:54 - 60:58So in Naive Bayes, what I need to do - the question was how did I come up with the parameters, right?
-
60:58 - 60:59In Naive Bayes,
-
60:59 - 61:01I need to build a model
-
61:01 - 61:04for P of X given Y and for
-
61:04 - 61:05P of Y,
-
61:05 - 61:10right? So this is, I mean, in generous of learning algorithms, I need to come up with
-
61:10 - 61:11models for these.
-
61:11 - 61:15So how'd I model P of Y? Well, I just those to model it using a Bernoulli
-
61:15 - 61:16distribution,
-
61:16 - 61:17and so
-
61:17 - 61:20P of Y will be
-
61:20 - 61:22parameterized by that, all right? Student:Okay.
-
61:22 - 61:27Instructor (Andrew Ng):And then how'd I model P of X given Y? Well,
-
61:27 - 61:29let's keep changing bullets.
-
61:29 - 61:33My model for P of X given Y under the Naive Bayes assumption, I assume
-
61:33 - 61:35that P of X given Y
-
61:35 - 61:38is the product of these probabilities,
-
61:38 - 61:41and so I'm going to need parameters to tell me
-
61:41 - 61:44what's the probability of each word occurring,
-
61:44 - 61:46you know, of each word occurring or not occurring,
-
61:46 - 61:53conditions on the email being spam or not spam email, okay? Student:How is that
-
61:55 - 61:59Bernoulli? Instructor (Andrew Ng):Oh, because X is either zero or one, right? By the way I defined the feature
-
61:59 - 62:01vectors, XI
-
62:01 - 62:05is either one or zero, depending on whether words I appear as in the email,
-
62:05 - 62:09right? So by the way I define the
-
62:09 - 62:11feature vectors, XI -
-
62:11 - 62:17the XI is always zero or one. So that by definition, if XI, you know, is either zero or
-
62:17 - 62:20one, then it has to be a Bernoulli distribution, right?
-
62:20 - 62:22If XI would continue as then
-
62:22 - 62:25you might model this as Gaussian and say you end up
-
62:25 - 62:27like we did in Gaussian discriminant analysis. It's
-
62:27 - 62:31just that the way I constructed my features for email, XI is always binary
-
62:31 - 62:32value, and so you end up
-
62:32 - 62:33with a
-
62:33 - 62:36Bernoulli here, okay? All right. I
-
62:36 - 62:40should move on. So
-
62:40 - 62:42it turns out that
-
62:42 - 62:44this idea
-
62:44 - 62:46almost works.
-
62:46 - 62:48Now, here's the problem.
-
62:48 - 62:50So let's say you
-
62:50 - 62:54complete this class and you start to do, maybe do the class project, and you
-
62:54 - 62:57keep working on your class project for a bit, and it
-
62:57 - 63:01becomes really good, and you want to submit your class project to a conference, right? So,
-
63:01 - 63:03you know, around - I don't know,
-
63:03 - 63:08June every year is the conference deadline for the next conference.
-
63:08 - 63:11It's just the name of the conference; it's an acronym.
-
63:11 - 63:12And so maybe
-
63:12 - 63:16you send your project partners or senior friends even, and say, Hey, let's
-
63:16 - 63:20work on a project and submit it to the NIPS conference. And so you're getting these emails
-
63:20 - 63:21with the word NIPS in them,
-
63:21 - 63:24which you've probably never seen before,
-
63:24 - 63:28and so a
-
63:28 - 63:33piece of email comes from your project partner, and so you
-
63:33 - 63:36go, Let's send a paper to the NIPS conference.
-
63:36 - 63:37And then your stamp classifier
-
63:37 - 63:39will say
-
63:39 - 63:40P of X -
-
63:40 - 63:45let's say NIPS is the 30,000th word in your dictionary, okay?
-
63:45 - 63:47So X30,000 given
-
63:47 - 63:50the 1, given
-
63:50 - 63:52Y =
-
63:52 - 63:531
-
63:53 - 63:54will be equal to 0.
-
63:54 - 63:57That's the maximum likelihood of this, right? Because you've never seen the word NIPS before in
-
63:57 - 64:01your training set, so maximum likelihood of the parameter is that probably have seen the word
-
64:01 - 64:02NIPS is zero,
-
64:02 - 64:07and, similarly,
-
64:07 - 64:08you
-
64:08 - 64:12know, in, I guess, non-spam mail, the chance of seeing the word NIPS is also
-
64:12 - 64:15estimated
-
64:15 - 64:21as zero.
-
64:21 - 64:28So
-
64:35 - 64:41when your spam classifier goes to compute P of Y = 1 given X, it will
-
64:41 - 64:45compute this
-
64:45 - 64:47right here P of Y
-
64:47 - 64:54over - well,
-
64:56 - 65:03all
-
65:05 - 65:10right.
-
65:10 - 65:11And so
-
65:11 - 65:16you look at that terms, say, this will be product from I =
-
65:16 - 65:181 to 50,000,
-
65:18 - 65:22P of XI given Y,
-
65:22 - 65:26and one of those probabilities will be equal to
-
65:26 - 65:28
-
65:28 - 65:32zero because P of X30,000 = 1 given Y = 1 is equal to zero. So you have a
-
65:32 - 65:37zero in this product, and so the numerator is zero,
-
65:37 - 65:40and in the same way, it turns out the denominator will also be zero, and so you end
-
65:40 - 65:43up with -
-
65:43 - 65:46actually all of these terms end up being zero. So you end up with P of Y = 1
-
65:46 - 65:49given X is 0 over 0 + 0, okay, which is
-
65:49 - 65:51undefined. And the
-
65:51 - 65:54problem with this is that it's
-
65:54 - 65:57just statistically a bad idea
-
65:57 - 66:00to say that P of X30,000
-
66:00 - 66:03given Y is
-
66:03 - 66:030,
-
66:03 - 66:06right? Just because you haven't seen the word NIPS in your last
-
66:06 - 66:10two months worth of email, it's also statistically not sound to say that,
-
66:10 - 66:16therefore, the chance of ever seeing this word is zero, right?
-
66:16 - 66:19And so
-
66:19 - 66:20
-
66:20 - 66:23
-
66:23 - 66:26is this idea that just because you haven't seen something
-
66:26 - 66:30before, that may mean that that event is unlikely, but it doesn't mean that
-
66:30 - 66:34it's impossible, and just saying that if you've never seen the word NIPS before,
-
66:34 - 66:41then it is impossible to ever see the word NIPS in future emails; the chance of that is just zero.
-
66:41 - 66:48So we're gonna fix this,
-
66:48 - 66:50and
-
66:50 - 66:54to motivate the fix I'll talk about
-
66:54 - 66:59- the example we're gonna use is let's say that you've been following the Stanford basketball
-
66:59 - 67:04team for all of their away games, and been, sort of, tracking their wins and losses
-
67:04 - 67:08to gather statistics, and, maybe - I don't know, form a betting pool about
-
67:08 - 67:11whether they're likely to win or lose the next game, okay?
-
67:11 - 67:13So
-
67:13 - 67:15these are some of the statistics.
-
67:15 - 67:18
-
67:18 - 67:19
-
67:19 - 67:25So on, I guess, the 8th of February
-
67:25 - 67:29last season they played Washington State, and they
-
67:29 - 67:32did not win.
-
67:32 - 67:36On
-
67:36 - 67:38the 11th of February,
-
67:38 - 67:42they play Washington, 22nd
-
67:42 - 67:47they played USC,
-
67:47 - 67:50
-
67:50 - 67:55played UCLA,
-
67:55 - 67:58played USC again,
-
67:58 - 68:04
-
68:04 - 68:06and now you want to estimate
-
68:06 - 68:10what's the chance that they'll win or lose against Louisville, right?
-
68:10 - 68:11So
-
68:11 - 68:15find the four guys last year or five times and they weren't good in their away games, but it
-
68:15 - 68:17seems awfully harsh to say that - so it
-
68:17 - 68:24seems awfully harsh to say there's zero chance that they'll
-
68:37 - 68:40win in the last - in the 5th game. So here's the idea behind Laplace smoothing
-
68:40 - 68:43which is
-
68:43 - 68:45that we're estimate
-
68:45 - 68:48the probably of Y being equal to one, right?
-
68:48 - 68:53Normally, the maximum likelihood [inaudible] is the
-
68:53 - 68:56number of ones
-
68:56 - 68:58divided by
-
68:58 - 69:01the number of zeros
-
69:01 - 69:05plus the number of ones, okay? I
-
69:05 - 69:08hope this informal notation makes sense, right? Knowing
-
69:08 - 69:11the maximum likelihood estimate for, sort of, a win or loss for Bernoulli random
-
69:11 - 69:12variable
-
69:12 - 69:15is
-
69:15 - 69:17just the number of ones you saw
-
69:17 - 69:21divided by the total number of examples. So it's the number of zeros you saw plus the number of ones you saw. So in
-
69:21 - 69:23the Laplace
-
69:23 - 69:24Smoothing
-
69:24 - 69:26we're going to
-
69:26 - 69:30just take each of these terms, the number of ones and, sort of, add one to that, the number
-
69:30 - 69:32of zeros and add one to that, the
-
69:32 - 69:34number of ones and add one to that,
-
69:34 - 69:37and so in our example,
-
69:37 - 69:38instead of estimating
-
69:38 - 69:43the probability of winning the next game to be 0 á
-
69:43 - 69:455 + 0,
-
69:45 - 69:50we'll add one to all of these counts, and so we say that the chance of
-
69:50 - 69:52their
-
69:52 - 69:54winning the next game is 1/7th,
-
69:54 - 69:55okay? Which is
-
69:55 - 69:59that having seen them lose, you know, five away games in a row, we aren't terribly -
-
69:59 - 70:03we don't think it's terribly likely they'll win the next game, but at
-
70:03 - 70:06least we're not saying it's impossible.
-
70:06 - 70:10As a historical side note, the Laplace actually came up with the method.
-
70:10 - 70:14It's called the Laplace smoothing after him.
-
70:14 - 70:18
-
70:18 - 70:22When he was trying to estimate the probability that the sun will rise tomorrow, and his rationale
-
70:22 - 70:25was in a lot of days now, we've seen the sun rise,
-
70:25 - 70:28but that doesn't mean we can be absolutely certain the sun will rise tomorrow.
-
70:28 - 70:31He was using this to estimate the probability that the sun will rise tomorrow. This is, kind of,
-
70:31 - 70:36cool. So,
-
70:36 - 70:39and more generally,
-
70:39 - 70:42
-
70:42 - 70:44
-
70:44 - 70:45
-
70:45 - 70:46
-
70:46 - 70:49if Y
-
70:49 - 70:53takes on
-
70:53 - 70:55K possible of values,
-
70:55 - 70:59if you're trying to estimate the parameter of the multinomial, then you estimate P of Y = 1.
-
70:59 - 71:02Let's
-
71:02 - 71:06see.
-
71:06 - 71:11So the maximum likelihood estimate will be Sum from J = 1 to M,
-
71:11 - 71:16indicator YI = J á M,
-
71:16 - 71:19right?
-
71:19 - 71:22That's the maximum likelihood estimate
-
71:22 - 71:24of a multinomial probability of Y
-
71:24 - 71:27being equal to - oh,
-
71:27 - 71:29excuse me, Y = J. All right.
-
71:29 - 71:33That's the maximum likelihood estimate for the probability of Y = J,
-
71:33 - 71:37and so when you apply Laplace smoothing to that,
-
71:37 - 71:40you add one to the numerator, and
-
71:40 - 71:42add K to the denominator,
-
71:42 - 71:49if Y can take up K possible values, okay?
-
72:06 - 72:09So for Naive Bayes,
-
72:09 - 72:14what that gives us is -
-
72:14 - 72:21shoot.
-
72:39 - 72:44Right? So that was the maximum likelihood estimate, and what you
-
72:44 - 72:48end up doing is adding one to the numerator and adding two to the denominator,
-
72:48 - 72:52and this solves the problem of the zero probabilities, and when your friend sends
-
72:52 - 72:57you email about the NIPS conference,
-
72:57 - 73:01your spam filter will still be able to
-
73:01 - 73:08make a meaningful prediction, all right? Okay.
-
73:12 - 73:15Shoot. Any questions about this? Yeah? Student:So that's what doesn't makes sense because, for instance, if you take the
-
73:15 - 73:16games on the right, it's liberal assumptions that the probability
-
73:16 - 73:23of
-
73:25 - 73:28winning is very close to zero, so, I mean, the prediction should
-
73:28 - 73:30be equal to PF, 0. Instructor (Andrew Ng):Right.
-
73:30 - 73:36I would say that
-
73:36 - 73:38in this case the prediction
-
73:38 - 73:41is 1/7th, right? We don't have a lot of - if you see somebody lose five games
-
73:41 - 73:44in a row, you may not have a lot of faith in them,
-
73:44 - 73:49but as an extreme example, suppose you saw them lose one game,
-
73:49 - 73:52right? It's just not reasonable to say that the chances of winning the next game
-
73:52 - 73:53is zero, but
-
73:53 - 73:59that's what maximum likelihood
-
73:59 - 74:01estimate
-
74:01 - 74:05will say. Student:Yes. Instructor (Andrew Ng):And -
-
74:05 - 74:08Student:In such a case anywhere the learning algorithm [inaudible] or - Instructor (Andrew Ng):So some questions of, you
-
74:08 - 74:11know, given just five training examples, what's a reasonable estimate for the chance of
-
74:11 - 74:13winning the next game,
-
74:13 - 74:14and
-
74:14 - 74:181/7th is, I think, is actually pretty reasonable. It's less than 1/5th for instance.
-
74:18 - 74:21We're saying the chances of winning the next game is less
-
74:21 - 74:25than 1/5th. It turns out, under a certain set of assumptions I won't go into - under a certain set of
-
74:25 - 74:28Bayesian assumptions about the prior and posterior,
-
74:28 - 74:31this Laplace smoothing actually gives the optimal estimate,
-
74:31 - 74:33in a certain sense I won't go into
-
74:33 - 74:37of what's the chance of winning the next game, and so under a certain assumption
-
74:37 - 74:38about the
-
74:38 - 74:41Bayesian prior on the parameter.
-
74:41 - 74:44So I don't know. It actually seems like a pretty reasonable assumption to me.
-
74:44 - 74:47Although, I should say, it actually
-
74:47 - 74:50turned out -
-
74:50 - 74:54No, I'm just being mean. We actually are a pretty good basketball team, but I chose a
-
74:54 - 74:55losing streak
-
74:55 - 74:59because it's funnier that way.
-
74:59 - 75:06Let's see. Shoot. Does someone want to - are there other questions about
-
75:07 - 75:09
-
75:09 - 75:13this? No, yeah. Okay. So there's more that I want to say about Naive Bayes, but
-
75:13 - 75:15
-
75:15 - 75:16we'll do that in the next lecture. So let's wrap it.
- Title:
- Lecture 5 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:15:31