Lecture 6 | Machine Learning (Stanford)
-
0:00 - 0:12This subtitle is provided from lecsub.jimdo.com under Creative Commons License.
-
0:12 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:22Development.
-
0:24 - 0:25
-
0:25 - 0:27What I want to do today is
-
0:27 - 0:31continue our discussion of Naive Bayes, which is the learning algorithm that
-
0:31 - 0:33I
-
0:33 - 0:36started to discuss in the previous lecture
-
0:36 - 0:40and talk about a couple of different event models in Naive Bayes,
-
0:40 - 0:44and then I'll take a brief digression to talk about neural networks, which is
-
0:44 - 0:47something that I actually won't spend a lot of time on,
-
0:47 - 0:50and then I want to start to talk about support vector machines,
-
0:50 - 0:53and support vector machines is
-
0:53 - 0:56the learning algorithms, the supervised learning algorithm that
-
0:56 - 1:01many people consider the most effective, off-the-shelf
-
1:01 - 1:02supervised learning
-
1:02 - 1:05algorithm. That point of view is debatable, but there are many people that hold that
-
1:05 - 1:06point of view,
-
1:06 - 1:11and we'll start discussing that today, and this will actually take us a few
-
1:11 - 1:14lectures to complete.
-
1:14 - 1:17So let's talk about Naive Bayes. To recap
-
1:17 - 1:21from the previous lecture,
-
1:21 - 1:23I started off
-
1:23 - 1:24describing
-
1:24 - 1:28spam classification as the most [inaudible] example for Naive Bayes
-
1:28 - 1:34in which we would create feature vectors like these, right, that correspond
-
1:34 - 1:38to words in a dictionary.
-
1:38 - 1:40And so, you know,
-
1:40 - 1:43based on what words appear in a piece of email were represented as a
-
1:43 - 1:45feature vector with
-
1:45 - 1:48ones and zeros in the corresponding places,
-
1:48 - 1:50and Naive
-
1:50 - 1:57Bayes was a generative learning algorithm, and by that
-
1:59 - 2:00I mean it's
-
2:00 - 2:06an algorithm in which we model P(X|Y), and for Naive
-
2:06 - 2:08Bayes, specifically, we
-
2:08 - 2:13modeled it as product from I equals one
-
2:13 - 2:14to
-
2:14 - 2:17N, P(Xi|Y),
-
2:17 - 2:19and also we model P(Y),
-
2:19 - 2:23and then we use Bayes Rule, right, to combine these two together,
-
2:23 - 2:25and so our predictions,
-
2:25 - 2:29when you give it a new piece of email you want to tell if it's spam or not spam,
-
2:29 - 2:33you predict arg max over Y, P(Y|X), which
-
2:33 - 2:36by Bayes Rule is arg max over Y,
-
2:36 - 2:39P(X given Y)
-
2:39 - 2:41times P(Y), okay?
-
2:41 - 2:44So this is Naive Bayes,
-
2:44 - 2:46and just to draw attention to two things,
-
2:46 - 2:50one is that in this model, each of our features
-
2:50 - 2:52were zero, one, so
-
2:52 - 2:54indicating whether different words appear,
-
2:54 - 2:58and the length or the feature vector was, sort of,
-
2:58 - 3:00the length N of the feature vector was
-
3:00 - 3:07the number of words in the dictionary.
-
3:08 - 3:15So it might be on this version on the order of 50,000 words, say. What I want
-
3:18 - 3:21to do now is describe two variations on this algorithm.
-
3:21 - 3:24The first one is the simpler one,
-
3:24 - 3:27which it's just a generalization to if
-
3:27 - 3:28xi
-
3:28 - 3:33takes on more values. So, you know, one thing that's commonly done
-
3:33 - 3:35is
-
3:35 - 3:39to apply Naive Bayes to problems where
-
3:39 - 3:43some of these features, xi, takes on K values rather than just two values,
-
3:43 - 3:47and in that case,
-
3:47 - 3:49you actually build,
-
3:49 - 3:52sort of, a very similar model where P(X|Y)
-
3:52 - 3:53is
-
3:53 - 3:57really the same thing, right,
-
3:57 - 4:02where now these are going to be multinomial probabilities
-
4:02 - 4:08rather than Bernoulli's because the XI's can, maybe, take on up to K values.
-
4:08 - 4:12It turns out, the situation where - one situation where this arises very
-
4:12 - 4:17commonly is if you have a feature that's actually continuous valued,
-
4:17 - 4:21and you choose to dispertise it, and you choose to take a continuous value feature
-
4:21 - 4:26and dispertise it into a finite set of K values,
-
4:26 - 4:28and so it's a perfect example
-
4:28 - 4:32if you remember our very first
-
4:32 - 4:35supervised learning problem of predicting
-
4:35 - 4:37the price of
-
4:37 - 4:41houses. If you have the classification problem on these houses, so based on
-
4:41 - 4:44features of a house, and you want to predict whether or not the house will be sold in the next six
-
4:44 - 4:47months, say. That's a classification problem, and
-
4:47 - 4:49once you use Naive Bayes,
-
4:49 - 4:52then given a continuous value feature
-
4:52 - 4:55like the living area,
-
4:55 - 4:59you know, one pretty common thing to do would be take the continuous value living area
-
4:59 - 5:02and just dispertise it
-
5:02 - 5:07into a few
-
5:07 - 5:09
-
5:09 - 5:13- discreet buckets,
-
5:13 - 5:14and so
-
5:14 - 5:17depending on whether the living area of the house is less than 500 square feet
-
5:17 - 5:19or between 1,000 and 1500 square feet,
-
5:19 - 5:22and so on, or whether it's greater than 2,000 square feet,
-
5:22 - 5:26you choose the value of the corresponding feature, XI, to be one,
-
5:26 - 5:29two, three, or four, okay?
-
5:29 - 5:31So that was the first
-
5:31 - 5:36variation or generalization of Naive Bayes I
-
5:36 - 5:43wanted to talk about. I should just check; are there questions about this? Okay.
-
5:50 - 5:56Cool. And so it turns out that in practice, it's fairly common to use about ten buckets
-
5:56 - 5:59to dispertise a continuous value feature. I drew four here only to
-
5:59 - 6:01save on writing.
-
6:01 - 6:03
-
6:03 - 6:03
-
6:03 - 6:07The second and, sort of, final variation that I want to talk about for Naive Bayes
-
6:07 - 6:10is a variation that's specific
-
6:10 - 6:11to
-
6:11 - 6:16classifying text documents, or, more generally, for classifying sequences. So the text
-
6:16 - 6:20document, like a piece of email, you can think of as a sequence of words
-
6:20 - 6:24and you can apply this, sort of, model I'm about to describe to classifying other sequences as
-
6:24 - 6:25well, but let
-
6:25 - 6:27me just focus on text,
-
6:27 - 6:30and here's
-
6:30 - 6:33the idea. So the
-
6:33 - 6:38Naiven a piece of email, we were
-
6:38 - 6:42representing it using this binary vector value representation,
-
6:42 - 6:46and one of the things that this loses, for instance, is the number of times that different words
-
6:46 - 6:47appear,
-
6:47 - 6:49all right? So, for example, if
-
6:49 - 6:53some word appears a lot of times, and you see the word, you know, buy a lot of times.
-
6:53 - 6:58You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of
-
6:58 - 6:59times in the email, it
-
6:59 - 7:03is more likely to be spam than it appears, I guess, only once
-
7:03 - 7:08because even once, I guess, is enough. So let me just
-
7:08 - 7:11try a different, what's called an event model
-
7:11 - 7:15for Naive Bayes that will take into account the number of times a word appears in
-
7:15 - 7:17the email,
-
7:17 - 7:17and to give
-
7:17 - 7:22this previous model a name as well this particular model for
-
7:22 - 7:24text
-
7:24 - 7:26classification
-
7:26 - 7:29
-
7:29 - 7:32
-
7:32 - 7:36is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what
-
7:36 - 7:40the name means. It
-
7:40 - 7:43refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about
-
7:43 - 7:45what the name means.
-
7:45 - 7:46In contrast,
-
7:46 - 7:51what I want to do now is describe a different representation for email in terms of the
-
7:51 - 7:52feature
-
7:52 - 7:59vector, and this is called the Multinomial Event Model, and,
-
8:01 - 8:03again, there is a rationale behind the name, but it's slightly cryptic, so don't worry
-
8:03 - 8:07about why it's called the Multinomial Event Model; it's just called that.
-
8:07 - 8:09And here's what we're gonna do,
-
8:09 - 8:16given a piece of email, I'm going to represent my email as a feature vector,
-
8:16 - 8:19and so my IF training
-
8:19 - 8:23example, XI will be a feature vector,
-
8:23 - 8:29XI sub group one, XI sub group two,
-
8:29 - 8:31XI subscript
-
8:31 - 8:33NI
-
8:33 - 8:35where
-
8:35 - 8:39NI is equal to the
-
8:39 - 8:42number of words
-
8:42 - 8:46in this email, right? So if one of my training examples is an email with
-
8:46 - 8:47300 words in it,
-
8:47 - 8:50then I represent this email via
-
8:50 - 8:52a feature vector
-
8:52 - 8:57with 300 elements,
-
8:57 - 9:00and each
-
9:00 - 9:04of these elements of the feature vector - lets see. Let me just write this as X subscript
-
9:04 - 9:06J.
-
9:06 - 9:13These will be an index into my dictionary, okay?
-
9:13 - 9:16And so if my dictionary has 50,000 words, then
-
9:16 - 9:19each position in my feature vector
-
9:19 - 9:23will be a variable that takes on one of 50,000 possible
-
9:23 - 9:24values
-
9:24 - 9:26
-
9:26 - 9:32corresponding to what word appeared in the J position of my email, okay?
-
9:32 - 9:34So, in other words, I'm gonna take all the words in my email
-
9:34 - 9:36and you have a feature
-
9:36 - 9:39vector that just says which word
-
9:39 - 9:40in my dictionary
-
9:40 - 9:44was each word in the email, okay?
-
9:44 - 9:48So a different definition for NI now, NI now varies and is different for every
-
9:48 - 9:49training example,
-
9:49 - 9:53and this XJ is now indexed into the dictionary.
-
9:53 - 9:55You know, the components of the feature vector
-
9:55 - 9:59are no longer binary random variables; they're these indices in the dictionary that take on a much
-
9:59 - 10:03larger set of values.
-
10:03 - 10:06And so
-
10:06 - 10:09our generative model for this
-
10:09 - 10:10
-
10:10 - 10:17
-
10:23 - 10:24will be that the joint distribution
-
10:24 - 10:28over X and Y will be that, where again N is now the length of the email, all right?
-
10:28 - 10:31So the way to think about this formula is you
-
10:31 - 10:32imagine
-
10:32 - 10:33that there was some
-
10:33 - 10:38probably distribution over emails. There's some random distribution that generates the emails,
-
10:38 - 10:41and that process proceeds as follows:
-
10:41 - 10:45First, Y is chosen, first the class label. Is someone gonna send you spam email or not
-
10:45 - 10:49spam emails is chosen for
-
10:49 - 10:54us. So first Y, the random variable Y, the class label of spam or not spam is generated,
-
10:54 - 10:59and then having decided whether they sent you spam or not spam,
-
10:59 - 11:03someone iterates over all 300 positions of the email,
-
11:03 - 11:07or 300 words that are going to compose them as email,
-
11:07 - 11:10and would generate words from some distribution that
-
11:10 - 11:14depends on whether they chose to send you spam or not spam. So if
-
11:14 - 11:17they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and
-
11:17 - 11:20whatever at discounts, sale, whatever.
-
11:20 - 11:23And if somebody chose to send you not spam, then they'll send you, sort of, the more normal
-
11:23 - 11:28
-
11:28 - 11:31words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the
-
11:31 - 11:35previous event model, and N has a very different definition from the previous event
-
11:35 - 11:38model.
-
11:38 - 11:45And so the parameters of this model are - let's see. Phi
-
11:45 - 11:48subscript K given Y equals one, which
-
11:48 - 11:51is the
-
11:51 - 11:55probability that,
-
11:55 - 11:57you know, conditioned on
-
11:57 - 11:59someone deciding to spend you spam,
-
11:59 - 12:03what's the probability that the next word they choose to email you in
-
12:03 - 12:06the spam email is going to be word K, and
-
12:06 - 12:10similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi
-
12:10 - 12:17Y
-
12:21 - 12:26and just same as before, okay?
-
12:26 - 12:28So these are the parameters of the model,
-
12:28 - 12:29
-
12:29 - 12:31and
-
12:31 - 12:35given a training set, you can work out the maximum likelihood estimates of the
-
12:35 - 12:42
-
12:42 - 12:48parameters. So the maximum likelihood estimate of the parameters will be
-
12:48 - 12:49equal to -
-
12:49 - 12:53and now I'm gonna write one of these, you know, big indicator function things again. It'll be
-
12:53 - 12:58a sum over your training sets
-
12:58 - 13:00indicator
-
13:00 - 13:07whether that was spam times the
-
13:07 - 13:10sum over all the words in that email where N subscript
-
13:10 - 13:15I is the number of words in email I in your training set,
-
13:15 - 13:22times indicator XIJ, SK times that I, okay?
-
13:39 - 13:40So
-
13:40 - 13:45the numerator says sum over all your emails and take into account all the emails that had
-
13:45 - 13:48class label one, take into account only of the emails that were spam because if Y equals zero, then
-
13:48 - 13:52this is zero, and this would go away,
-
13:52 - 13:54and then times sum over
-
13:54 - 13:57all the words in your spam email, and
-
13:57 - 14:01it counts up the number of times you observed the word K in your spam emails.
-
14:01 - 14:05So, in other words, the numerator is look at all the spam emails in your training set
-
14:05 - 14:12and count up the total number of times the word K appeared in this email. The
-
14:14 - 14:17denominator then is sum over I into our training set of
-
14:17 - 14:20whenever one of your examples is spam,
-
14:20 - 14:21you know, sum up the length
-
14:21 - 14:23of that spam email,
-
14:23 - 14:26and so the denominator is the total length
-
14:26 - 14:32of all of the spam emails in your training set.
-
14:32 - 14:37And so the ratio is just out of all your spam emails, what is the fraction of words in
-
14:37 - 14:39all your spam emails that were word K,
-
14:39 - 14:42and that's your estimate for the probability
-
14:42 - 14:44of the next piece of spam mail
-
14:44 - 14:51generating the word K in any given position, okay? At the
-
14:53 - 14:56end of the previous lecture, I talked about LaPlace smoothing,
-
14:56 - 14:59and so when you do that as
-
14:59 - 15:03well, you add one to the numerator and K to the denominator, and this is
-
15:03 - 15:07the LaPlace smoothed estimate
-
15:07 - 15:09of this parameter,
-
15:09 - 15:11okay? And similarly, you do the same thing for -
-
15:11 - 15:14and you can work out the estimates for the
-
15:14 - 15:21other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just
-
15:21 - 15:23wondering
-
15:23 - 15:24
-
15:24 - 15:26what the X of I is, and what the N of -
-
15:26 - 15:28Instructor (Andrew Ng):Right. So
-
15:28 - 15:32in this second event model, the definition for XI and the definition for N
-
15:32 - 15:35are different, right? So here -
-
15:35 - 15:42well, this is for one example XY. So here, N is the number of words
-
15:44 - 15:46in a given email,
-
15:46 - 15:51right? And if it's the I email subscripting then this N subscript I, and so
-
15:51 - 15:56N will be different for different training examples, and here
-
15:56 - 16:01XI will be, you know,
-
16:01 - 16:04these values from 1 to 50,000,
-
16:04 - 16:06and XI is
-
16:06 - 16:07essentially
-
16:07 - 16:11the identity of the Ith
-
16:11 - 16:11word
-
16:11 - 16:15in a given piece of email,
-
16:15 - 16:17okay? So that's why this is
-
16:17 - 16:20grouping, or this is a product over all the different words of your email
-
16:20 - 16:22of their probability
-
16:22 - 16:24the Ith word in your email,
-
16:24 - 16:31conditioned on Y. Yeah?
-
16:32 - 16:36Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right,
-
16:36 - 16:39and I shouldn't have used K here. Let
-
16:39 - 16:44me use a different alphabet and see if that makes sense; does
-
16:44 - 16:46that make sense? Oh, you know what, I'm sorry.
-
16:46 - 16:49You're absolutely right.
-
16:49 - 16:51Thank you. All right. So
-
16:51 - 16:55in LaPlace smoothing, that shouldn't be
-
16:55 - 16:58K. This should be, you know, 50,000,
-
16:58 - 17:02if you have 50,000 words in your dictionary. Yeah, thanks. Great.
-
17:02 - 17:05I stole notation from the previous lecture and
-
17:05 - 17:07didn't translate it properly.
-
17:07 - 17:08So LaPlace smoothing, right,
-
17:08 - 17:15this is the number of possible values that the random variable XI can
-
17:16 - 17:17take on. Cool. Raise your hand if
-
17:17 - 17:22this makes sense? Okay. Some of you, are there
-
17:22 - 17:29more questions about this?
-
17:35 - 17:42
-
17:42 - 17:44Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah,
-
17:44 - 17:45let's see.
-
17:45 - 17:50So LaPlace smoothing is
-
17:50 - 17:51a
-
17:51 - 17:56method to give you, sort of, hopefully, better estimates of their probability
-
17:56 - 17:58distribution over a multinomial,
-
17:58 - 18:00and so
-
18:00 - 18:03was I using X to Y in the previous lecture?
-
18:03 - 18:05So
-
18:05 - 18:08in trying to estimate the probability over a multinomial - I think X
-
18:08 - 18:11and Y are different. I think -
-
18:11 - 18:14was it X or Y? I think it was X, actually. Well - oh,
-
18:14 - 18:17I see, right, right. I think I was using a different definition
-
18:17 - 18:19for the random variable Y because
-
18:19 - 18:24suppose you have a multinomial random variable, X
-
18:24 - 18:28which takes on - let's
-
18:28 - 18:30use a different alphabet. Suppose you have
-
18:30 - 18:35a multinomial random variable X which takes on L different values, then
-
18:35 - 18:37the maximum likelihood estimate
-
18:37 - 18:41for the probability of X,
-
18:41 - 18:42PFX equals K,
-
18:42 - 18:44will be equal to, right,
-
18:44 - 18:47the number of observations.
-
18:47 - 18:52
-
18:52 - 18:55The maximum likelihood estimate for the
-
18:55 - 19:02probability of X being equal to K will be the number of observations of X
-
19:02 - 19:05equals K divided
-
19:05 - 19:11by the total number of observations
-
19:11 - 19:11of X,
-
19:11 - 19:14okay? So that's the maximum likelihood estimate.
-
19:14 - 19:18And to add LaPlace smoothing to this, you, sort of, add one to the numerator,
-
19:18 - 19:20and you add L to the
-
19:20 - 19:21
-
19:21 - 19:23denominator
-
19:23 - 19:26where L was the number of possible values that X can take on.
-
19:26 - 19:29So, in this case,
-
19:29 - 19:32this is a probability that X equals K,
-
19:32 - 19:36and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be
-
19:36 - 19:40something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific
-
19:40 - 19:47definition for a
-
19:48 - 19:49maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples
-
19:49 - 19:52make sense, but I don't know what the, like, general formula for it is.
-
19:52 - 19:54Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so
-
19:54 - 19:56the question is
-
19:56 - 20:02what's the definition for maximum likelihood estimate? So actually
-
20:02 - 20:05in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I
-
20:05 - 20:07was, sort of,
-
20:07 - 20:09throwing out the maximum likelihood estimates on the board
-
20:09 - 20:11without proving them.
-
20:11 - 20:16The way to actually work this out is
-
20:16 - 20:20to actually
-
20:20 - 20:27write down the likelihood.
-
20:29 - 20:33So the way to figure out all of these maximum likelihood estimates is to write
-
20:33 - 20:34down
-
20:34 - 20:35the likelihood of
-
20:35 - 20:41the parameters, phi K given Y being
-
20:41 - 20:44zero, phi
-
20:44 - 20:44Y,
-
20:44 - 20:47right? And so given a training set,
-
20:47 - 20:50the likelihood, I guess, I should be writing log likelihood
-
20:50 - 20:55will be the log of the product of I equals one to
-
20:55 - 20:56N, PFXI, YI,
-
20:56 - 20:58you
-
20:58 - 21:05know, parameterized by these things, okay?
-
21:07 - 21:09Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are
-
21:09 - 21:12parameterized
-
21:12 - 21:14by
-
21:14 - 21:19
-
21:19 - 21:22
-
21:22 - 21:27- well,
-
21:27 - 21:28
-
21:28 - 21:32I'll just drop the parameters to write this
-
21:32 - 21:39
-
21:42 - 21:49more simply - oh, I just put it in - times PFYI, okay?
-
21:49 - 21:51So this is my log likelihood,
-
21:51 - 21:55and so the way you get the maximum likelihood estimate of the parameters
-
21:55 - 21:56is you -
-
21:56 - 22:00so if given a fixed training set, given a set of fixed IYI's,
-
22:00 - 22:02you maximize this in terms of
-
22:02 - 22:06these parameters, and then you get the maximum likelihood estimates that I've been writing
-
22:06 - 22:08out.
-
22:08 - 22:11So in a previous section of today's lecture I wrote out some maximum likelihood estimates
-
22:11 - 22:14for the Gaussian Discriminant Analysis model, and
-
22:14 - 22:16for Naive Bayes,
-
22:16 - 22:18and then this - I didn't prove them,
-
22:18 - 22:22but you get to, sort of, play with that yourself in the homework problem as well
-
22:22 - 22:26and for one of these models, and you'll be able to verify that when
-
22:26 - 22:30you maximize the likelihood and maximize the log likelihood
-
22:30 - 22:34that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to
-
22:34 - 22:35find the way
-
22:35 - 22:42these are derived is by maximizing this, okay? Cool. All right.
-
22:42 - 22:45So
-
22:45 - 22:51that wraps up what I wanted to say about - oh,
-
22:51 - 22:56so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that
-
22:56 - 23:01for text classification, the Naivent
-
23:01 - 23:04model, the last
-
23:04 - 23:07Naivent model,
-
23:07 - 23:10it turns out that almost always does better than
-
23:10 - 23:11the
-
23:11 - 23:15first Naive Bayes model I talked about when you're applying it to the
-
23:15 - 23:19specific case - to the specific of text classification,
-
23:19 - 23:21and
-
23:21 - 23:25one of the reasons is hypothesized for this is that this second model,
-
23:25 - 23:27the multinomial event model,
-
23:27 - 23:30takes into account
-
23:30 - 23:34the number of times a word appears in a document, whereas the former model
-
23:34 - 23:35doesn't.
-
23:35 - 23:37I should say that in truth
-
23:37 - 23:41that actually turns out not to be completely understood why the latter
-
23:41 - 23:44model does better than the former one for text classification, and, sort of,
-
23:44 - 23:46researchers are still
-
23:46 - 23:47debating about it a little bit, but if you
-
23:47 - 23:51ever have a text classification problem, you know, Naive Bayes Classify is probably
-
23:51 - 23:52not,
-
23:52 - 23:54by far, the best learning algorithm out there,
-
23:54 - 23:56but it is
-
23:56 - 23:58relatively straightforward to implement, and it's a very good
-
23:58 - 24:03algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It
-
24:03 - 24:04doesn't
-
24:04 - 24:07actually
-
24:07 - 24:08
-
24:08 - 24:14care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I
-
24:14 - 24:15mean, X variable, if my model like you had exclamation in, does
-
24:15 - 24:17that usually
-
24:17 - 24:18do
-
24:18 - 24:20better if you have enough data?
-
24:20 - 24:23Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words.
-
24:23 - 24:27You can shuffle all the words in the email, and it does exactly the same thing.
-
24:27 - 24:28So
-
24:28 - 24:31in natural language processing, there's actually another name; it's called a Unique Grand
-
24:31 - 24:33Model in natural language processing,
-
24:33 - 24:36and there's some other models like, sort of,
-
24:36 - 24:39say, higher order markup models that take into account some of the ordering
-
24:39 - 24:44of the words. It turns out that for text classification, the models
-
24:44 - 24:45like the
-
24:45 - 24:49bigram models or trigram models,
-
24:49 - 24:53I believe they do only very slightly better, if at all,
-
24:53 - 25:00but that's when you're applying them to text classification, okay? All right.
-
25:04 - 25:07So the next thing I want to talk about is to start again to discussion of
-
25:07 - 25:12non-linear classifiers. So
-
25:12 - 25:19it turns out
-
25:22 - 25:28- well, and so the very first
-
25:28 - 25:31classification algorithm we talked about was logistic regression, which
-
25:31 - 25:33had the forming form for
-
25:33 - 25:36hypothesis,
-
25:36 - 25:38and
-
25:38 - 25:39you can think of this as
-
25:39 - 25:41predicting one when
-
25:41 - 25:44this estimator probability is greater or equal to 0.5 and
-
25:44 - 25:46predicting zero,
-
25:46 - 25:47right,
-
25:47 - 25:50when this is less than 0.5,
-
25:50 - 25:53and given a training set, right?
-
25:53 - 26:00Logistic regression
-
26:03 - 26:06will maybe do grade and descends or something or
-
26:06 - 26:07use Newton's method
-
26:07 - 26:09to find a straight line that
-
26:09 - 26:12reasonably separates the positive and negative classes.
-
26:12 - 26:16But sometimes a data set just can't be separated by a straight line, so is there
-
26:16 - 26:17an algorithm
-
26:17 - 26:18that will let you
-
26:18 - 26:25start to learn these sorts of non-linear division boundaries?
-
26:26 - 26:29And so how do you go about getting a non-linear classifier? And, by the
-
26:29 - 26:34way, one cool result is that remember when I said - when we talked
-
26:34 - 26:38about generative learning algorithms, I said that
-
26:38 - 26:41if you assume Y given X is
-
26:41 - 26:44exponential family,
-
26:44 - 26:45
-
26:45 - 26:47right, with parameter
-
26:47 - 26:51A, and if you build a generative learning algorithm using this, right, plus
-
26:51 - 26:56one, if this is A to
-
26:56 - 27:00one. This is exponential family
-
27:00 - 27:02with natural parameter A to zero,
-
27:02 - 27:06right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true,
-
27:06 - 27:06then
-
27:06 - 27:09you end up with a logistic posterior. It
-
27:09 - 27:11actually turns out that a Naive Bayes model
-
27:11 - 27:13actually falls into this
-
27:13 - 27:17as well. So the Naive Bayes model actually falls into this exponential family as well, and,
-
27:17 - 27:19therefore,
-
27:19 - 27:23under the Naive Bayes model, you're actually
-
27:23 - 27:27using this other linear classifier as well, okay?
-
27:27 - 27:29So the question is
-
27:29 - 27:32how can you start to get non-linear classifiers?
-
27:32 - 27:36And I'm going to talk about one method today
-
27:36 - 27:42which is - and
-
27:42 - 27:45we started to talk about it very briefly which is
-
27:45 - 27:47taking a
-
27:47 - 27:49simpler algorithm like
-
27:49 - 27:51logistic regression
-
27:51 - 27:58and using it to build up to more complex non-linear classifiers, okay? So
-
27:58 - 28:00to motivate this discussion,
-
28:00 - 28:05I'm going to use the little picture - let's see. So suppose you have features X1,
-
28:05 - 28:07X2, and X3,
-
28:07 - 28:10and so by convention, I'm gonna follow
-
28:10 - 28:13our earlier convention that X0 is set to one,
-
28:13 - 28:17and so I'm gonna use a little diagram like this
-
28:17 - 28:18to denote
-
28:18 - 28:20our
-
28:20 - 28:24logistic regression unit, okay? So
-
28:24 - 28:29think of a little picture like that, you know, this little circle as denoting a
-
28:29 - 28:29computation note
-
28:29 - 28:33that takes this input, you know, several features and then it outputs another
-
28:33 - 28:36number that's X subscript theta of X, given by
-
28:36 - 28:38a sigmoid function,
-
28:38 - 28:40and so this little computational unit -
-
28:40 - 28:43well, will have parameters theta.
-
28:43 - 28:48Now, in order to get non-linear division boundaries, all we need to do -
-
28:48 - 28:51well, at least one thing to do is just
-
28:51 - 28:53come up with
-
28:53 - 28:54
-
28:54 - 28:57a way to represent hypotheses
-
28:57 - 29:01that can output non-linear division boundaries, right,
-
29:01 - 29:03and so
-
29:03 - 29:07this is -
-
29:07 - 29:10when you put a bunch of those little pictures
-
29:10 - 29:11that I drew
-
29:11 - 29:13on the previous board,
-
29:13 - 29:14you can then get
-
29:14 - 29:19what's called
-
29:19 - 29:21a Neural Network in which you
-
29:21 - 29:24think of having my features here
-
29:24 - 29:31and then I would feed them to
-
29:31 - 29:35say a few of these little
-
29:35 - 29:39sigmoidal units,
-
29:39 - 29:44and these together will feed into yet another sigmoidal unit, say,
-
29:44 - 29:46which will output
-
29:46 - 29:52my final output H subscript theta of X, okay? And
-
29:52 - 29:54just to give these things names, let me call
-
29:54 - 29:58the values output by these three intermediate sigmoidal units; let me call them A1,
-
29:58 - 30:01A2, A3.
-
30:01 - 30:03And let me just be
-
30:03 - 30:06completely concrete about what this formula represents, right? So
-
30:06 - 30:07
-
30:07 - 30:08each of these
-
30:08 - 30:12units in the middle will have their own associated set of parameters,
-
30:12 - 30:16and so the value A1 will be computed as
-
30:16 - 30:18G
-
30:18 - 30:21of X transpose,
-
30:21 - 30:27and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta
-
30:27 - 30:29two,
-
30:29 - 30:31and A3 will be
-
30:31 - 30:34G of X transpose,
-
30:34 - 30:36theta three,
-
30:36 - 30:43where G is the sigmoid function, all right? So G of Z,
-
30:44 - 30:50and then, finally, our hypothesis will output
-
30:50 - 30:52G of
-
30:52 - 30:54A
-
30:54 - 30:57transpose
-
30:57 - 30:59theta four, right? Where, you
-
30:59 - 31:01know, this A vector
-
31:01 - 31:04is a vector of A1,
-
31:04 - 31:06A2, A3.
-
31:06 - 31:08We can append another one to it at first
-
31:08 - 31:11if you want, okay? Let
-
31:11 - 31:14me
-
31:14 - 31:16just draw up here this - I'm
-
31:16 - 31:18sorry about the cluttered board.
-
31:18 - 31:21And so H subscript theta of X,
-
31:21 - 31:24this is a function
-
31:24 - 31:29of all the parameters theta one
-
31:29 - 31:32through theta four,
-
31:32 - 31:37and so one way to learn parameters for this model
-
31:37 - 31:38is to
-
31:38 - 31:40write down the cost function,
-
31:40 - 31:41say, J of theta
-
31:41 - 31:46equals one-half sum from Y equals one to
-
31:46 - 31:50M, YI minus H subscript theta of
-
31:50 - 31:52XI
-
31:52 - 31:54squared, say. Okay,
-
31:54 - 31:56so that's our familiar
-
31:56 - 31:58quadratic cost function,
-
31:58 - 32:00and so
-
32:00 - 32:04one way to learn the parameters of an algorithm like this is to just use gradient
-
32:04 - 32:05interscent
-
32:05 - 32:09to minimize J of theta as a function of theta, okay? See, in the phi
-
32:09 - 32:11gradient descent
-
32:11 - 32:15to minimize this square area, which
-
32:15 - 32:18stated differently means you use gradient descent to make the predictions of your neural
-
32:18 - 32:20network as close as possible
-
32:20 - 32:27to what you observed as the labels in your training
-
32:31 - 32:35set, okay? So it turns out green descent on this neural network is a specific name, the
-
32:35 - 32:35algorithm
-
32:35 - 32:39that implements grand descent is called back propagation, and so if you ever hear that all that
-
32:39 - 32:41means is - it just means gradient interscent on
-
32:41 - 32:45a cost function like this or a variation of this on the neural network
-
32:45 - 32:48that looks like that,
-
32:48 - 32:50and -
-
32:50 - 32:54well, this algorithm actually has some advantages and disadvantages, but let me
-
32:54 - 32:58actually show you. So, let's see.
-
32:58 - 33:01One of the interesting things about the neural network is that
-
33:01 - 33:05you can look at what these intermediate notes are computing, right? So
-
33:05 - 33:08this neural network has what's called
-
33:08 - 33:12a hidden layer before you then have the output layer,
-
33:12 - 33:14and, more generally, you can actually have inputs feed
-
33:14 - 33:16into these
-
33:16 - 33:20computation units, feed into more layers of computation units, to even more layers, to
-
33:20 - 33:23more layers, and then finally you have an output layer at the end
-
33:23 - 33:27And one cool thing you can do is look at all of these intermediate units, look at
-
33:27 - 33:28these
-
33:28 - 33:31units and what's called a hidden layer of the neural network. Don't
-
33:31 - 33:33worry about why it's called that. Look
-
33:33 - 33:34at computations of the hidden unit
-
33:34 - 33:40and ask what is the hidden unit computing the neural network? So
-
33:40 - 33:43to, maybe, get a better sense of neural networks might be doing, let me show you a
-
33:43 - 33:45video - I'm gonna switch to the laptop - this is made
-
33:45 - 33:46by a
-
33:46 - 33:49friend, Yann LeCun
-
33:49 - 33:52who's currently a professor at New York University.
-
33:52 - 33:56Can I show a video on the laptop?
-
33:56 - 33:59So let me show you a video
-
33:59 - 34:00from Yann
-
34:00 - 34:06LeCun on a neural network that he developed for Hammerton Digit Recognition.
-
34:06 - 34:10There was one other thing he did in this neural network that I'm not gonna talk about called
-
34:10 - 34:12a Convolutional Neural Network
-
34:12 - 34:16that - well,
-
34:16 - 34:19his system is called LeNet,
-
34:19 - 34:23and let's see. Would you put
-
34:23 - 34:23on the
-
34:23 - 34:30laptop display? Hum,
-
34:38 - 34:39actually maybe if -
-
34:39 - 34:42or you can just put on the screen on the side; that would work too
-
34:42 - 34:49if the big screen isn't working.
-
35:12 - 35:14Let's see. I'm
-
35:14 - 35:19just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come
-
35:19 - 35:24on? Well, let me say a few more things about neural network.
-
35:24 - 35:26So it turns out that
-
35:26 - 35:29when you write a quadratic cost function like I
-
35:29 - 35:32wrote down on the chalkboard just now,
-
35:32 - 35:35it turns out that unlike logistic regression,
-
35:35 - 35:39that will almost always respond to non-convex optimization problem,
-
35:39 - 35:43and so whereas for logistic regression if you run gradient descent
-
35:43 - 35:44or Newton's method or whatever,
-
35:44 - 35:46you converse the global optimer.
-
35:46 - 35:50This is not true for neural networks. In general, there are lots of local optimer
-
35:50 - 35:55and, sort of, much harder optimization problem.
-
35:55 - 35:56So
-
35:56 - 36:00neural networks, if you're, sort of, familiar with them, and you're good at making design choices
-
36:00 - 36:02like what learning rate to use, and
-
36:02 - 36:04how many hidden units to use, and so on, you can,
-
36:04 - 36:08sort of, get them to be fairly effective,
-
36:08 - 36:09and
-
36:09 - 36:13there's, sort of, often ongoing debates about, you know, is this learning algorithm
-
36:13 - 36:14better, or is that learning
-
36:14 - 36:18algorithm better? The vast majority of machine learning researchers today seem to perceive
-
36:18 - 36:20support vector machines, which is what I'll talk about later,
-
36:20 - 36:24to be a much more effective off-theshelf learning algorithm than neural networks.
-
36:24 - 36:27This point of view is contested
-
36:27 - 36:32a bit, but so neural networks are not something that I personally use a lot right
-
36:32 - 36:36now because there's a hard optimization problem and you should do so often verge,
-
36:36 - 36:40and it actually, sort of works. It, sort of, works reasonably
-
36:40 - 36:42well. It's just
-
36:42 - 36:44because this is fairly complicated,
-
36:44 - 36:45there's not an algorithm that
-
36:45 - 36:52I use commonly or that my friends use all
-
36:52 - 36:56time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years,
-
36:56 - 36:57
-
36:57 - 37:00you know, the most effective learning algorithm before support vector
-
37:00 - 37:02machines were invented.
-
37:02 - 37:06So here's Yann LeCun's video, and -
-
37:06 - 37:09well, there's actually audio on this too, the soundboard. So I'll just tell you
-
37:09 - 37:11what's happening.
-
37:11 - 37:12
-
37:12 - 37:15What you're seeing is a trained neural network,
-
37:15 - 37:19and this display where my mouse pointer is pointing, right, this big three
-
37:19 - 37:19there
-
37:19 - 37:21is
-
37:21 - 37:25the input to the neural network. So you're showing the neural network this image, and it's
-
37:25 - 37:27trying to recognize what is this.
-
37:27 - 37:31The final answer output by the neural network is this number up here, right
-
37:31 - 37:33below where it says LeNet-5,
-
37:33 - 37:34and the
-
37:34 - 37:38neural network correctly recognizes this image as a three,
-
37:38 - 37:40and if you look to the left of this image,
-
37:40 - 37:42what's interesting about this is
-
37:42 - 37:47the display on the left portion of this is actually showing the
-
37:47 - 37:50intermediate computations of the neural network. In other words, it's showing you
-
37:50 - 37:52what are the hidden layers of the
-
37:52 - 37:54neural network computing.
-
37:54 - 37:58And so, for example, if you look at this one, the third image down from the top,
-
37:58 - 38:02this seems to be computing, you know, certain edges into digits,
-
38:02 - 38:06right? We're just computing digits on the
-
38:06 - 38:07right-hand side of the bottom or something
-
38:07 - 38:10of the input display
-
38:10 - 38:11of the input image, okay?
-
38:11 - 38:15So let me just play this video, and you can see
-
38:15 - 38:17some
-
38:17 - 38:21of the inputs and outputs of the neural network, and
-
38:21 - 38:22those
-
38:22 - 38:26are very different fonts. There's this robustness to
-
38:26 - 38:31noise.
-
38:31 - 38:37All
-
38:37 - 38:38
-
38:38 - 38:42
-
38:42 - 38:43right. Multiple
-
38:43 - 38:49digits,
-
38:49 - 38:51
-
38:51 - 38:58that's, kind of, cool. All
-
38:59 - 39:06right.
-
39:12 - 39:19
-
39:31 - 39:36
-
39:36 - 39:39
-
39:39 - 39:40
-
39:40 - 39:42
-
39:42 - 39:49
-
39:52 - 39:55So,
-
39:55 - 39:58just for fun, let me show you one more video,
-
39:58 - 40:01which was - let's
-
40:01 - 40:06see. This is another video from the various CV's, the machine that changed the world, which
-
40:06 - 40:07was produced by WGBH
-
40:07 - 40:09Television
-
40:09 - 40:11in
-
40:11 - 40:13corporation with British Foreclass Incorporation,
-
40:13 - 40:16and it was aired on PBS a few years ago, I think.
-
40:16 - 40:18I want to show you a
-
40:18 - 40:20video describing the NETtalk
-
40:20 - 40:24Neural Network, which was developed by Terry Sejnowski; he's a
-
40:24 - 40:26researcher.
-
40:26 - 40:30And so NETtalk was actually one of the major milestones in the
-
40:30 - 40:32history of neural network,
-
40:32 - 40:33and this specific application
-
40:33 - 40:36is getting the neural network to read text.
-
40:36 - 40:39So, in other words, can you show a
-
40:39 - 40:42piece of English to a computer
-
40:42 - 40:43and have the computer read,
-
40:43 - 40:46sort of, verbally produce sounds that could respond
-
40:46 - 40:48to the
-
40:48 - 40:51reading of the text.
-
40:51 - 40:55And it turns out that in the history of AI and the history of machine learning,
-
40:55 - 40:57this video
-
40:57 - 41:01created a lot of excitement about neural networks and about machine learning. Part of the
-
41:01 - 41:05reason was that Terry Sejnowski had the foresight to choose
-
41:05 - 41:07to use,
-
41:07 - 41:08in his video,
-
41:08 - 41:13a child-like voice talking about visiting your grandmother's house and so on. You'll
-
41:13 - 41:14see it in a second,
-
41:14 - 41:16and so
-
41:16 - 41:19this really created the perception of - created
-
41:19 - 41:23the impression of the neural network being like a young child learning how to speak,
-
41:23 - 41:25and talking about going to your grandmothers, and so
-
41:25 - 41:27on. So this actually
-
41:27 - 41:29helped generate a lot of excitement
-
41:29 - 41:32within academia and outside academia on neural networks, sort of, early in the history
-
41:32 - 41:36of neural networks. I'm just gonna show you the video.
-
41:36 - 41:38[Begin Video] You're going to hear first
-
41:38 - 41:42what the network sounds like at the very beginning of the training, and it won't
-
41:42 - 41:44sound like words, but it'll sound
-
41:44 - 41:48like attempts that will get better and better with
-
41:48 - 41:54time. [Computer's voice] The network
-
41:54 - 41:58takes the letters, say the phrase, grandmother's house, and
-
41:58 - 42:03makes a random attempt at pronouncing it. [Computer's
-
42:03 - 42:07voice]
-
42:07 - 42:09Grandmother's house. The
-
42:09 - 42:13phonetic difference between the guess and the right pronunciation
-
42:13 - 42:17is sent back through the network. [Computer's
-
42:17 - 42:19voice]
-
42:19 - 42:20Grandmother's house.
-
42:20 - 42:24By adjusting the connection strengths after each attempt,
-
42:24 - 42:27the net slowly improves. And,
-
42:27 - 42:30finally, after letting it train overnight,
-
42:30 - 42:33the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's
-
42:33 - 42:38house.
-
42:38 - 42:41Well, because she gives us candy.
-
42:41 - 42:47Well, and we - NETtalk understands nothing about the language. It is simply associating letters
-
42:47 - 42:50with sounds.
-
42:50 - 42:52[End Video] All right. So
-
42:52 - 42:54at the time this was done, I mean, this is
-
42:54 - 42:57an amazing piece of work. I should say
-
42:57 - 42:59today there are other
-
42:59 - 43:03text to speech systems that work better than what you just saw,
-
43:03 - 43:08and you'll also appreciate getting candy from your grandmother's house is a little bit
-
43:08 - 43:11less impressive than talking about the Dow Jones falling 15 points,
-
43:11 - 43:16and profit taking, whatever. So
-
43:16 - 43:19but I wanted to show that just because that was another cool,
-
43:19 - 43:26major landmark in the history of neural networks. Okay.
-
43:27 - 43:33So let's switch back to the chalkboard,
-
43:33 - 43:36and what I want to do next
-
43:36 - 43:39is
-
43:39 - 43:42tell you about Support Vector Machines,
-
43:42 - 43:49okay? That, sort of, wraps up our discussion on neural
-
44:09 - 44:11networks. So I started off talking about neural networks
-
44:11 - 44:13by motivating it as a way to get
-
44:13 - 44:18us to output non-linear classifiers, right? I don't really approve of it. It turns out
-
44:18 - 44:18that
-
44:18 - 44:22you'd be able to come up with non-linear division boundaries using a neural
-
44:22 - 44:22network
-
44:22 - 44:27like what I drew on the chalkboard earlier.
-
44:27 - 44:30Support Vector Machines will be another learning algorithm that will give us a way
-
44:30 - 44:35to come up with non-linear classifiers. There's a very effective, off-the-shelf
-
44:35 - 44:36learning algorithm,
-
44:36 - 44:39but it turns out that in the discussion I'm gonna - in
-
44:39 - 44:43the progression and development I'm gonna pursue, I'm actually going to start
-
44:43 - 44:48off by describing yet another class of linear classifiers with linear division
-
44:48 - 44:49boundaries,
-
44:49 - 44:54and only be later, sort of, in probably the next lecture or the one after that,
-
44:54 - 44:55that we'll then
-
44:55 - 44:59take the support vector machine idea and, sort of, do some clever things to it
-
44:59 - 45:03to make it work very well to generate non-linear division boundaries as well, okay?
-
45:03 - 45:08But we'll actually start by talking about linear classifiers a little bit more.
-
45:08 - 45:13And to do that, I want to
-
45:13 - 45:15convey two
-
45:15 - 45:17intuitions about classification.
-
45:17 - 45:21One is
-
45:21 - 45:24you think about logistic regression; we have this logistic function that was outputting
-
45:24 - 45:29the probability that Y equals one,
-
45:29 - 45:31and it crosses
-
45:31 - 45:33this line at zero.
-
45:33 - 45:38So when you run logistic regression, I want you to think of it as
-
45:38 - 45:41an algorithm that computes
-
45:41 - 45:43theta transpose X,
-
45:43 - 45:47and then it predicts
-
45:47 - 45:51one, right,
-
45:51 - 45:56if and only if, theta transpose X is greater than zero, right? IFF stands for if
-
45:56 - 46:00and only if. It means the same thing as a double implication,
-
46:00 - 46:05and it predicts zero,
-
46:05 - 46:12if and only if, theta transpose X is less than zero, okay? So if
-
46:12 - 46:15it's the case that
-
46:15 - 46:18theta transpose X is much greater than zero,
-
46:18 - 46:22the double greater than sign means these are much greater than, all right.
-
46:22 - 46:26So if theta transpose X is much greater than zero,
-
46:26 - 46:27then,
-
46:27 - 46:33you know, think of that as a very confident
-
46:33 - 46:39prediction
-
46:39 - 46:41that Y is equal to one,
-
46:41 - 46:43right? If theta transpose X is much greater than zero, then we're
-
46:43 - 46:47gonna predict one then moreover we're very confident it's one, and the picture for
-
46:47 - 46:48that is if theta
-
46:48 - 46:51transpose X is way out here, then
-
46:51 - 46:54we're estimating that the probability of Y being equal to one
-
46:54 - 46:56on the sigmoid function, it will
-
46:56 - 46:59be very close to one.
-
46:59 - 47:03And, in the same way, if theta transpose X
-
47:03 - 47:05is much less than zero,
-
47:05 - 47:12then we're very confident that
-
47:13 - 47:16Y is equal to zero.
-
47:16 - 47:20So
-
47:20 - 47:24wouldn't it be nice - so when we fit logistic regression of some of the
-
47:24 - 47:25classifiers is your training set,
-
47:25 - 47:30then so wouldn't it be nice if,
-
47:30 - 47:30right,
-
47:30 - 47:34for all I
-
47:34 - 47:39such that Y is equal to one.
-
47:39 - 47:40We have
-
47:40 - 47:41theta
-
47:41 - 47:45transpose XI is much greater than zero,
-
47:45 - 47:48and for all I such that Y is equal
-
47:48 - 47:50to
-
47:50 - 47:57zero,
-
47:58 - 48:00we have theta transpose XI is
-
48:00 - 48:01much less than zero,
-
48:01 - 48:06okay? So wouldn't it be nice if this is true? That,
-
48:06 - 48:06
-
48:06 - 48:10essentially, if our training set, we can find parameters theta
-
48:10 - 48:13so that our learning algorithm not only
-
48:13 - 48:17makes correct classifications on all the examples in a training set, but further it's, sort
-
48:17 - 48:20of, is very confident about all of those correct
-
48:20 - 48:22classifications. This is
-
48:22 - 48:27the first intuition that I want you to have, and we'll come back to this first intuition
-
48:27 - 48:29in a second
-
48:29 - 48:33when we talk about functional margins, okay?
-
48:33 - 48:40We'll define this later. The second
-
48:44 - 48:48intuition that I want to
-
48:48 - 48:55convey,
-
48:57 - 49:00and it turns out for the rest of today's lecture I'm going to assume
-
49:00 - 49:04that a training set is linearly separable, okay? So by that I mean for the rest of
-
49:04 - 49:06
-
49:06 - 49:09today's lecture, I'm going to assume that
-
49:09 - 49:11there is indeed a straight line that can separate
-
49:11 - 49:15your training set, and we'll remove this assumption later, but
-
49:15 - 49:17just to develop the algorithm, let's take away the
-
49:17 - 49:19linearly separable training set.
-
49:19 - 49:20And so
-
49:20 - 49:24there's a sense that out of all the straight lines that separate the training
-
49:24 - 49:29set, you know, maybe that straight line isn't such a good one,
-
49:29 - 49:33and that one actually isn't such a great one either, but
-
49:33 - 49:35maybe that line in the
-
49:35 - 49:39middle is a much better linear separator than the others, right?
-
49:39 - 49:41And one reason that
-
49:41 - 49:46when you and I look at it this one seems best
-
49:46 - 49:48is because this line is just further from the data, all right?
-
49:48 - 49:52That is separates the data with a greater distance between your positive and your negative
-
49:52 - 49:56examples and division boundary, okay?
-
49:56 - 49:59And this second intuition, we'll come back to this shortly,
-
49:59 - 50:03about this final line
-
50:03 - 50:06that I drew being, maybe, the best line
-
50:06 - 50:10this notion of distance from the training examples. This is the second intuition I want
-
50:10 - 50:11to convey,
-
50:11 - 50:14and we'll formalize it later when
-
50:14 - 50:18we talk about
-
50:18 - 50:25geometric margins of our classifiers, okay?
-
50:42 - 50:45So in order to describe support vector machine, unfortunately, I'm gonna
-
50:45 - 50:48have to pull a notation change,
-
50:48 - 50:50and, sort
-
50:50 - 50:54of, unfortunately, it, sort of, was impossible to do logistic regression,
-
50:54 - 50:56and support vector machines,
-
50:56 - 51:01and all the other algorithms using one completely consistent notation,
-
51:01 - 51:04and so I'm actually gonna change notations slightly
-
51:04 - 51:08for linear classifiers, and that will actually make it much easier for us -
-
51:08 - 51:10that'll make it much easier later today
-
51:10 - 51:15and in next week's lectures to actually talk about support vector machine. But the
-
51:15 - 51:20notation that I'm gonna use for the rest
-
51:20 - 51:22of today and for most of next week
-
51:22 - 51:26will be that my B equals Y,
-
51:26 - 51:30and instead of be zero, one, they'll be minus one and plus one,
-
51:30 - 51:34
-
51:34 - 51:37and a development of a support vector machine
-
51:37 - 51:38we will have H,
-
51:38 - 51:44have a hypothesis
-
51:44 - 51:49output values to
-
51:49 - 51:53the either plus one or minus one,
-
51:53 - 51:55and so
-
51:55 - 51:58we'll let G of Z be equal to
-
51:58 - 52:00one if Z is
-
52:00 - 52:04greater or equal to zero, and minus one otherwise, right? So just rather than zero and
-
52:04 - 52:08one, we change everything to plus one and minus one.
-
52:08 - 52:10And, finally,
-
52:10 - 52:14whereas previously I wrote G subscript
-
52:14 - 52:17theta of X equals
-
52:17 - 52:18G of theta transpose X
-
52:18 - 52:21and we had the convention
-
52:21 - 52:23that X zero is equal to one,
-
52:23 - 52:26right? And so X is an RN plus
-
52:26 - 52:29one.
-
52:29 - 52:34I'm gonna drop this convention of
-
52:34 - 52:38letting X zero equals a one, and letting X be an RN plus one, and instead I'm
-
52:38 - 52:44going to parameterize my linear classifier as H subscript W, B of X
-
52:44 - 52:46equals G of
-
52:46 - 52:51W transpose X plus B, okay?
-
52:51 - 52:51And so
-
52:51 - 52:53B
-
52:53 - 52:55just now plays the role of
-
52:55 - 52:56theta zero,
-
52:56 - 53:03and W now plays the role of the rest of the parameters, theta one
-
53:04 - 53:07through theta N, okay? So just by separating out
-
53:07 - 53:11the interceptor B rather than lumping it together, it'll make it easier for
-
53:11 - 53:11us
-
53:11 - 53:18to develop support vector machines.
-
53:18 - 53:25So - yes. Student:[Off mic]. Instructor
-
53:28 - 53:32(Andrew Ng):Oh, yes. Right, yes. So W is -
-
53:32 - 53:33right. So W
-
53:33 - 53:35is a vector in RN, and
-
53:35 - 53:37X
-
53:37 - 53:43is now a vector in RN rather than N plus one,
-
53:43 - 53:50and a lowercase b is a real number. Okay.
-
53:56 - 54:03Now, let's formalize the notion of functional margin and germesh margin. Let me make a
-
54:03 - 54:04definition.
-
54:04 - 54:11I'm going to say that the functional margin
-
54:13 - 54:20of the hyper plane
-
54:20 - 54:22WB
-
54:22 - 54:25with respect to a specific training example,
-
54:25 - 54:27XIYI is - WRT
-
54:27 - 54:32stands for with respect to -
-
54:32 - 54:35the function margin of a hyper plane WB with
-
54:35 - 54:38respect to
-
54:38 - 54:43a certain training example, XIYI has been defined as Gamma
-
54:43 - 54:45Hat I equals YI
-
54:45 - 54:47times W transpose XI
-
54:47 - 54:51plus B, okay?
-
54:51 - 54:55And so a set of parameters, W, B defines
-
54:55 - 54:56a
-
54:56 - 55:00classifier - it, sort of, defines a linear separating boundary,
-
55:00 - 55:01and so
-
55:01 - 55:04when I say hyper plane, I just mean
-
55:04 - 55:07the decision boundary that's
-
55:07 - 55:11defined by the parameters W, B.
-
55:11 - 55:13You know what,
-
55:13 - 55:18if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with
-
55:18 - 55:19parameters W, B
-
55:19 - 55:21with respect to a training example
-
55:21 - 55:24is given by this formula, okay?
-
55:24 - 55:26And interpretation of this is that
-
55:26 - 55:28if
-
55:28 - 55:30YI is equal to one,
-
55:30 - 55:33then for each to have a large functional margin, you
-
55:33 - 55:37want W transpose XI plus B to
-
55:37 - 55:38be large,
-
55:38 - 55:39right?
-
55:39 - 55:41And if YI
-
55:41 - 55:44is equal minus one,
-
55:44 - 55:47then in order for the functional margin to be large - we, sort of, want the functional margins
-
55:47 - 55:50to large, but in order for the function margins to be large,
-
55:50 - 55:55if YI is equal to minus one, then the only way for this to be big is if W
-
55:55 - 55:57transpose XI
-
55:57 - 55:58plus B
-
55:58 - 56:04is much less than zero, okay? So this
-
56:04 - 56:07captures the intuition that we had earlier about functional
-
56:07 - 56:09margins - the
-
56:09 - 56:13intuition we had earlier that if YI is equal to one, we want this to be big, and if YI
-
56:13 - 56:15is equal to minus one, we
-
56:15 - 56:16want this to be small,
-
56:16 - 56:17and this, sort of,
-
56:17 - 56:19practice of two cases
-
56:19 - 56:23into one statement that we'd like the functional margin to be large.
-
56:23 - 56:27And notice this is also that
-
56:27 - 56:31so long as YI times W transpose XY
-
56:31 - 56:34plus B, so long as this is greater than zero,
-
56:34 - 56:37that means we
-
56:37 - 56:44classified it correctly, okay?
-
57:14 - 57:18And one more definition, I'm going to say that
-
57:18 - 57:20the functional margin of
-
57:20 - 57:24a hyper plane with respect to an entire training set
-
57:24 - 57:31is
-
57:32 - 57:34going to define gamma hat to
-
57:34 - 57:36be equal to
-
57:36 - 57:36min
-
57:36 - 57:40over all your training examples of gamma hat, I, right?
-
57:40 - 57:41So if you have
-
57:41 - 57:44a training set, if you have just more than one training example,
-
57:44 - 57:47I'm going to define the functional margin
-
57:47 - 57:50with respect to the entire training set as
-
57:50 - 57:52the worst case of all of your
-
57:52 - 57:54functional margins of the entire training set.
-
57:54 - 57:58And so for now we should think of the
-
57:58 - 58:02first function like an intuition of saying that we would like the function margin
-
58:02 - 58:03to be large,
-
58:03 - 58:06and for our purposes, for now,
-
58:06 - 58:07let's just say we would like
-
58:07 - 58:10the worst-case functional margin to be large, okay? And
-
58:10 - 58:16we'll change this a little bit later as well. Now, it turns out that
-
58:16 - 58:20there's one little problem with this intuition that will, sort of, edge
-
58:20 - 58:23us later,
-
58:23 - 58:26which it actually turns out to be very easy to make the functional margin large, all
-
58:26 - 58:29right? So, for example,
-
58:29 - 58:32so as I have a classifiable parameters W and
-
58:32 - 58:37B. If I take W and multiply it by two and take B and multiply it by two,
-
58:37 - 58:42then if you refer to the definition of the functional margin, I guess that was what?
-
58:42 - 58:43Gamma I,
-
58:43 - 58:44gamma hat I
-
58:44 - 58:48equals YI
-
58:48 - 58:51times W times transpose B. If
-
58:51 - 58:53I double
-
58:53 - 58:55W and B,
-
58:55 - 58:55then
-
58:55 - 58:59I can easily double my functional margin. So this goal
-
58:59 - 59:03of making the functional margin large, in and of itself, isn't so
-
59:03 - 59:06useful because it's easy to make the functional margin arbitrarily large
-
59:06 - 59:09just by scaling other parameters. And so
-
59:09 - 59:13maybe one thing we need to do later is add a normalization
-
59:13 - 59:15condition.
-
59:15 - 59:17For example, maybe
-
59:17 - 59:20we want to add a normalization condition that de-norm, the
-
59:20 - 59:22alter-norm of
-
59:22 - 59:23the parameter W
-
59:23 - 59:24is equal to one,
-
59:24 - 59:28and we'll come back to this in a second. All
-
59:28 - 59:34right. And then so - Okay.
-
59:34 - 59:40Now, let's talk about - see how much
-
59:40 - 59:47time we have, 15 minutes. Well, see, I'm trying to
-
59:53 - 60:00decide how much to try to do in the last 15 minutes. Okay. So let's talk
-
60:05 - 60:11about the geometric margin,
-
60:11 - 60:13and so
-
60:13 - 60:17the geometric margin of a training example -
-
60:17 - 60:24[inaudible], right?
-
60:27 - 60:31So the division boundary of my classifier
-
60:31 - 60:33is going to be given by the plane W transpose X
-
60:33 - 60:35plus B is equal
-
60:35 - 60:38to zero, okay? Right, and these are the
-
60:38 - 60:41X1, X2 axis, say,
-
60:41 - 60:43and
-
60:43 - 60:48we're going to draw relatively few training examples here. Let's say I'm drawing
-
60:48 - 60:53deliberately few training examples so that I can add things to this, okay?
-
60:53 - 60:55And so
-
60:55 - 60:57assuming we classified an example correctly, I'm
-
60:57 - 61:01going to define the geometric margin
-
61:01 - 61:06as just a geometric distance between a point between the
-
61:06 - 61:07training example -
-
61:07 - 61:10yeah, between the training example XI,
-
61:10 - 61:12YI
-
61:12 - 61:13and the distance
-
61:13 - 61:15given by this separating
-
61:15 - 61:18line, given by this separating hyper plane, okay?
-
61:18 - 61:20That's what I'm going to define the
-
61:20 - 61:24geometric margin to be.
-
61:24 - 61:26And so I'm gonna do
-
61:26 - 61:30some algebra fairly quickly. In case it doesn't make sense, and read
-
61:30 - 61:34through the lecture notes more carefully for details.
-
61:34 - 61:36Sort of, by standard geometry,
-
61:36 - 61:41the normal, or in other words, the vector that's 90 degrees to the
-
61:41 - 61:42separating hyper plane
-
61:42 - 61:45is going to be given by W divided by
-
61:45 - 61:47the norm of W; that's just how
-
61:47 - 61:50planes and high dimensions work. If this
-
61:50 - 61:53stuff - some of this you have to use, take a look t the lecture notes on the
-
61:53 - 61:56website.
-
61:56 - 61:58And so let's say this distance
-
61:58 - 62:00is
-
62:00 - 62:01
-
62:01 - 62:02gamma I,
-
62:02 - 62:05okay? And so I'm going to use the convention that
-
62:05 - 62:09I'll put a hat on top where I'm referring to functional margins,
-
62:09 - 62:12and no hat on top for geometric margins. So let's say geometric margin,
-
62:12 - 62:14as this example, is
-
62:14 - 62:18gamma I.
-
62:18 - 62:22That means that
-
62:22 - 62:26this point here, right,
-
62:26 - 62:29is going to be
-
62:29 - 62:32XI minus
-
62:32 - 62:36gamma I times W
-
62:36 - 62:39over normal W, okay?
-
62:39 - 62:41Because
-
62:41 - 62:45W over normal W is the unit vector, is the length one vector that
-
62:45 - 62:48is normal to the separating hyper plane,
-
62:48 - 62:52and so when we subtract gamma I times the unit vector
-
62:52 - 62:55from this point, XI, or at this point here is XI.
-
62:55 - 62:57So XI minus,
-
62:57 - 62:59you know, this little vector here
-
62:59 - 63:02is going to be this point that I've drawn as
-
63:02 - 63:03a heavy circle,
-
63:03 - 63:06okay? So this heavy point here is
-
63:06 - 63:07XI minus this vector,
-
63:07 - 63:14and this vector is gamma I time W over norm of W, okay?
-
63:16 - 63:18And so
-
63:18 - 63:21because this heavy point is on the separating hyper plane,
-
63:21 - 63:23right, this point
-
63:23 - 63:25must satisfy
-
63:25 - 63:29W transpose times
-
63:29 - 63:35that point
-
63:35 - 63:36equals zero,
-
63:36 - 63:37right? Because
-
63:37 - 63:38all
-
63:38 - 63:41points X on the separating hyper plane satisfy the equation W transpose X plus B
-
63:41 - 63:43equals zero,
-
63:43 - 63:46and so this point is on the separating hyper plane, therefore,
-
63:46 - 63:49it must satisfy W transpose this point - oh,
-
63:49 - 63:54excuse me. Plus B is equal
-
63:54 - 63:58to zero, okay?
-
63:58 - 64:01Raise your hand if this makes sense so far? Oh,
-
64:01 - 64:04okay. Cool, most of you, but, again,
-
64:04 - 64:07I'm, sort of, being slightly fast in this geometry. So if you're not quite sure
-
64:07 - 64:10why this is a normal vector, or how I subtracted
-
64:10 - 64:17this, or whatever, take a look at the details in the lecture notes.
-
64:21 - 64:24And so
-
64:24 - 64:28what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So
-
64:28 - 64:30this equation I just wrote down,
-
64:30 - 64:32solve this equation for gamma
-
64:32 - 64:34or gamma I,
-
64:34 - 64:41and you find that -
-
64:44 - 64:48you saw that previous equation from gamma I -
-
64:48 - 64:50well, why don't I just do it?
-
64:50 - 64:53You have W transpose XI plus
-
64:53 - 64:55B
-
64:55 - 64:55equals
-
64:55 - 64:57gamma
-
64:57 - 65:01I times W transpose W over norm of W;
-
65:01 - 65:05that's just equal to gamma
-
65:05 - 65:07times the norm of W
-
65:07 - 65:08because
-
65:08 - 65:10W transpose W
-
65:10 - 65:12is the norm
-
65:12 - 65:15of W squared, and, therefore,
-
65:15 - 65:18gamma
-
65:18 - 65:25is just - well,
-
65:27 - 65:34transpose X equals, okay?
-
65:34 - 65:38And, in other words, this little calculation just showed us that if you have a
-
65:38 - 65:41training example
-
65:41 - 65:43XI,
-
65:43 - 65:47then the distance between XI and the separating hyper plane defined by the
-
65:47 - 65:49parameters W and B
-
65:49 - 65:56can be computed by this formula, okay? So
-
66:03 - 66:05the last thing I want to do is actually
-
66:05 - 66:07take into account
-
66:07 - 66:12the sign of the - the correct classification of the training example. So I've
-
66:12 - 66:13been assuming that
-
66:13 - 66:16we've been classifying an example correctly.
-
66:16 - 66:21So, more generally, to find
-
66:21 - 66:27the geometric
-
66:27 - 66:30margin of a training example to be
-
66:30 - 66:33gamma I equals YI
-
66:33 - 66:40times that thing on top, okay?
-
66:45 - 66:47And so
-
66:47 - 66:51this is very similar to the functional margin, except for the normalization by the
-
66:51 - 66:53norm of W,
-
66:53 - 66:58and so as before, you know, this says that so long as -
-
66:58 - 67:01we would like the geometric margin to be large, and all that means is that so
-
67:01 - 67:02long as
-
67:02 - 67:04we're classifying the example correctly,
-
67:04 - 67:08we would ideally hope of the example to be as far as possible from the separating
-
67:08 - 67:11hyper plane, so long as it's on the right side of
-
67:11 - 67:12the separating hyper plane, and that's what YI
-
67:12 - 67:19multiplied into this does.
-
67:23 - 67:28And so a couple of easy facts, one is if the
-
67:28 - 67:32norm of W is equal to one,
-
67:32 - 67:34then
-
67:34 - 67:37the functional margin is equal to the geometric margin, and you see
-
67:37 - 67:39that quite easily,
-
67:39 - 67:41and, more generally,
-
67:41 - 67:43
-
67:43 - 67:48the geometric margin is just equal
-
67:48 - 67:55to the functional margin divided by the norm of W, okay? Let's see, okay.
-
68:12 - 68:19And so one final definition is
-
68:22 - 68:27so far I've defined the geometric margin with respect to a single training example,
-
68:27 - 68:32and so as before, I'll define the geometric margin with respect to an entire training set
-
68:32 - 68:34as gamma equals
-
68:34 - 68:38min over I
-
68:38 - 68:45of gamma I, all right?
-
68:45 - 68:47And so
-
68:47 - 68:54the maximum margin classifier, which is a precursor to the support vector machine,
-
69:00 - 69:05is the learning algorithm that chooses the parameters W and B
-
69:05 - 69:07so as to maximize
-
69:07 - 69:09the geometric margin,
-
69:09 - 69:12and so I just write that down. The
-
69:12 - 69:16maximum margin classified poses the following optimization problem. It says
-
69:16 - 69:20choose gamma, W, and B
-
69:20 - 69:24so as to maximize the geometric margin,
-
69:24 - 69:25subject to that YI times -
-
69:25 - 69:32well,
-
69:33 - 69:38this is just one way to write it,
-
69:38 - 69:41subject to - actually, do I write it like that? Yeah,
-
69:41 - 69:44fine.
-
69:44 - 69:46There are several ways to write this, and one of the things we'll
-
69:46 - 69:49do next time is actually - I'm trying
-
69:49 - 69:53to figure out if I can do this in five minutes. I'm guessing this could
-
69:53 - 69:55be difficult. Well,
-
69:55 - 69:59so this maximizing your classifier is the maximization problem
-
69:59 - 70:01over parameter gamma
-
70:01 - 70:03W
-
70:03 - 70:05and B, and for now, it turns out
-
70:05 - 70:09that the geometric margin doesn't change depending on the norm of W, right? Because
-
70:09 - 70:11in the definition of the geometric margin,
-
70:11 - 70:14notice that we're dividing by the norm of W anyway.
-
70:14 - 70:18So you can actually set the norm of W to be anything you want, and you can multiply
-
70:18 - 70:23W and B by any constant; it doesn't change the geometric margin.
-
70:23 - 70:27This will actually be important, and we'll come back to this later. Notice that you
-
70:27 - 70:30can take the parameters WB,
-
70:30 - 70:34and you can impose any normalization constant to it, or you can change W and B by
-
70:34 - 70:36any scaling factor and
-
70:36 - 70:38replace them by ten W and ten B
-
70:38 - 70:39whatever,
-
70:39 - 70:43and it does not change the geometric margin,
-
70:43 - 70:44okay? And so
-
70:44 - 70:48in this first formulation, I'm just gonna impose a constraint and say that the norm of W was
-
70:48 - 70:49one,
-
70:49 - 70:52and so the function of the geometric margins will be the same,
-
70:52 - 70:53and then we'll say
-
70:53 - 70:55maximize the geometric margins subject to -
-
70:55 - 70:57you maximize gamma
-
70:57 - 71:00subject to that every training example must have geometric margin
-
71:00 - 71:02at least gamma,
-
71:02 - 71:04and this is a geometric margin because
-
71:04 - 71:06when the norm of W is equal to one, then
-
71:06 - 71:08the functional of the geometric margin
-
71:08 - 71:11are identical, okay?
-
71:11 - 71:15So this is the maximum margin classifier, and it turns out that if you do this,
-
71:15 - 71:19it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but
-
71:19 - 71:20it
-
71:20 - 71:23
-
71:23 - 71:24turns out that
-
71:24 - 71:26as we develop this algorithm further, there
-
71:26 - 71:31will be a clever way to allow us to change this algorithm to let it work
-
71:31 - 71:33in infinite dimensional feature spaces
-
71:33 - 71:37and come up with very efficient non-linear classifiers.
-
71:37 - 71:38So
-
71:38 - 71:41there's a ways to go before we turn this into a support vector machine,
-
71:41 - 71:43but this is the first step.
-
71:43 - 71:50So are there questions about this? Yeah.
-
72:04 - 72:08Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a
-
72:08 - 72:10fixed training set, and the
-
72:10 - 72:13scaling of the training set is not something you get to play with, right?
-
72:13 - 72:17So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the
-
72:17 - 72:21Y's. Are there other questions? Okay. So all right.
-
72:21 - 72:26
-
72:26 - 72:30
-
72:30 - 72:35Next week we will take this, and we'll talk about authorization algorithms,
-
72:35 - 72:36and
-
72:36 - 72:40work our way towards turning this into one of the most effective off-theshelf
-
72:40 - 72:42learning algorithms,
-
72:42 - 72:44and just a final reminder again, this next
-
72:44 - 72:48discussion session will be on Matlab and Octaves. So show up for that if you want
-
72:48 - 72:50to see a tutorial.
-
72:50 - 72:52Okay. See you guys in the next class.
- Title:
- Lecture 6 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the applications of naive Bayes, neural networks, and support vector machine.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:13:09