Lecture 6 | Machine Learning (Stanford)

Edit subtitles

0:00 - 0:12

This subtitle is provided from lecsub.jimdo.com under Creative Commons License.
0:12 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:22

Development.
0:24 - 0:25
0:25 - 0:27

What I want to do today is
0:27 - 0:31

continue our discussion of Naive Bayes, which is the learning algorithm that
0:31 - 0:33

I
0:33 - 0:36

started to discuss in the previous lecture
0:36 - 0:40

and talk about a couple of different event models in Naive Bayes,
0:40 - 0:44

and then I'll take a brief digression to talk about neural networks, which is
0:44 - 0:47

something that I actually won't spend a lot of time on,
0:47 - 0:50

and then I want to start to talk about support vector machines,
0:50 - 0:53

and support vector machines is
0:53 - 0:56

the learning algorithms, the supervised learning algorithm that
0:56 - 1:01

many people consider the most effective, off-the-shelf
1:01 - 1:02

supervised learning
1:02 - 1:05

algorithm. That point of view is debatable, but there are many people that hold that
1:05 - 1:06

point of view,
1:06 - 1:11

and we'll start discussing that today, and this will actually take us a few
1:11 - 1:14

lectures to complete.
1:14 - 1:17

So let's talk about Naive Bayes. To recap
1:17 - 1:21

from the previous lecture,
1:21 - 1:23

I started off
1:23 - 1:24

describing
1:24 - 1:28

spam classification as the most [inaudible] example for Naive Bayes
1:28 - 1:34

in which we would create feature vectors like these, right, that correspond
1:34 - 1:38

to words in a dictionary.
1:38 - 1:40

And so, you know,
1:40 - 1:43

based on what words appear in a piece of email were represented as a
1:43 - 1:45

feature vector with
1:45 - 1:48

ones and zeros in the corresponding places,
1:48 - 1:50

and Naive
1:50 - 1:57

Bayes was a generative learning algorithm, and by that
1:59 - 2:00

I mean it's
2:00 - 2:06

an algorithm in which we model P(X|Y), and for Naive
2:06 - 2:08

Bayes, specifically, we
2:08 - 2:13

modeled it as product from I equals one
2:13 - 2:14

to
2:14 - 2:17

N, P(Xi|Y),
2:17 - 2:19

and also we model P(Y),
2:19 - 2:23

and then we use Bayes Rule, right, to combine these two together,
2:23 - 2:25

and so our predictions,
2:25 - 2:29

when you give it a new piece of email you want to tell if it's spam or not spam,
2:29 - 2:33

you predict arg max over Y, P(Y|X), which
2:33 - 2:36

by Bayes Rule is arg max over Y,
2:36 - 2:39

P(X given Y)
2:39 - 2:41

times P(Y), okay?
2:41 - 2:44

So this is Naive Bayes,
2:44 - 2:46

and just to draw attention to two things,
2:46 - 2:50

one is that in this model, each of our features
2:50 - 2:52

were zero, one, so
2:52 - 2:54

indicating whether different words appear,
2:54 - 2:58

and the length or the feature vector was, sort of,
2:58 - 3:00

the length N of the feature vector was
3:00 - 3:07

the number of words in the dictionary.
3:08 - 3:15

So it might be on this version on the order of 50,000 words, say. What I want
3:18 - 3:21

to do now is describe two variations on this algorithm.
3:21 - 3:24

The first one is the simpler one,
3:24 - 3:27

which it's just a generalization to if
3:27 - 3:28

xi
3:28 - 3:33

takes on more values. So, you know, one thing that's commonly done
3:33 - 3:35

is
3:35 - 3:39

to apply Naive Bayes to problems where
3:39 - 3:43

some of these features, xi, takes on K values rather than just two values,
3:43 - 3:47

and in that case,
3:47 - 3:49

you actually build,
3:49 - 3:52

sort of, a very similar model where P(X|Y)
3:52 - 3:53

is
3:53 - 3:57

really the same thing, right,
3:57 - 4:02

where now these are going to be multinomial probabilities
4:02 - 4:08

rather than Bernoulli's because the XI's can, maybe, take on up to K values.
4:08 - 4:12

It turns out, the situation where - one situation where this arises very
4:12 - 4:17

commonly is if you have a feature that's actually continuous valued,
4:17 - 4:21

and you choose to dispertise it, and you choose to take a continuous value feature
4:21 - 4:26

and dispertise it into a finite set of K values,
4:26 - 4:28

and so it's a perfect example
4:28 - 4:32

if you remember our very first
4:32 - 4:35

supervised learning problem of predicting
4:35 - 4:37

the price of
4:37 - 4:41

houses. If you have the classification problem on these houses, so based on
4:41 - 4:44

features of a house, and you want to predict whether or not the house will be sold in the next six
4:44 - 4:47

months, say. That's a classification problem, and
4:47 - 4:49

once you use Naive Bayes,
4:49 - 4:52

then given a continuous value feature
4:52 - 4:55

like the living area,
4:55 - 4:59

you know, one pretty common thing to do would be take the continuous value living area
4:59 - 5:02

and just dispertise it
5:02 - 5:07

into a few
5:07 - 5:09
5:09 - 5:13

- discreet buckets,
5:13 - 5:14

and so
5:14 - 5:17

depending on whether the living area of the house is less than 500 square feet
5:17 - 5:19

or between 1,000 and 1500 square feet,
5:19 - 5:22

and so on, or whether it's greater than 2,000 square feet,
5:22 - 5:26

you choose the value of the corresponding feature, XI, to be one,
5:26 - 5:29

two, three, or four, okay?
5:29 - 5:31

So that was the first
5:31 - 5:36

variation or generalization of Naive Bayes I
5:36 - 5:43

wanted to talk about. I should just check; are there questions about this? Okay.
5:50 - 5:56

Cool. And so it turns out that in practice, it's fairly common to use about ten buckets
5:56 - 5:59

to dispertise a continuous value feature. I drew four here only to
5:59 - 6:01

save on writing.
6:01 - 6:03
6:03 - 6:03
6:03 - 6:07

The second and, sort of, final variation that I want to talk about for Naive Bayes
6:07 - 6:10

is a variation that's specific
6:10 - 6:11

to
6:11 - 6:16

classifying text documents, or, more generally, for classifying sequences. So the text
6:16 - 6:20

document, like a piece of email, you can think of as a sequence of words
6:20 - 6:24

and you can apply this, sort of, model I'm about to describe to classifying other sequences as
6:24 - 6:25

well, but let
6:25 - 6:27

me just focus on text,
6:27 - 6:30

and here's
6:30 - 6:33

the idea. So the
6:33 - 6:38

Naiven a piece of email, we were
6:38 - 6:42

representing it using this binary vector value representation,
6:42 - 6:46

and one of the things that this loses, for instance, is the number of times that different words
6:46 - 6:47

appear,
6:47 - 6:49

all right? So, for example, if
6:49 - 6:53

some word appears a lot of times, and you see the word, you know, buy a lot of times.
6:53 - 6:58

You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of
6:58 - 6:59

times in the email, it
6:59 - 7:03

is more likely to be spam than it appears, I guess, only once
7:03 - 7:08

because even once, I guess, is enough. So let me just
7:08 - 7:11

try a different, what's called an event model
7:11 - 7:15

for Naive Bayes that will take into account the number of times a word appears in
7:15 - 7:17

the email,
7:17 - 7:17

and to give
7:17 - 7:22

this previous model a name as well this particular model for
7:22 - 7:24

text
7:24 - 7:26

classification
7:26 - 7:29
7:29 - 7:32
7:32 - 7:36

is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what
7:36 - 7:40

the name means. It
7:40 - 7:43

refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about
7:43 - 7:45

what the name means.
7:45 - 7:46

In contrast,
7:46 - 7:51

what I want to do now is describe a different representation for email in terms of the
7:51 - 7:52

feature
7:52 - 7:59

vector, and this is called the Multinomial Event Model, and,
8:01 - 8:03

again, there is a rationale behind the name, but it's slightly cryptic, so don't worry
8:03 - 8:07

about why it's called the Multinomial Event Model; it's just called that.
8:07 - 8:09

And here's what we're gonna do,
8:09 - 8:16

given a piece of email, I'm going to represent my email as a feature vector,
8:16 - 8:19

and so my IF training
8:19 - 8:23

example, XI will be a feature vector,
8:23 - 8:29

XI sub group one, XI sub group two,
8:29 - 8:31

XI subscript
8:31 - 8:33

NI
8:33 - 8:35

where
8:35 - 8:39

NI is equal to the
8:39 - 8:42

number of words
8:42 - 8:46

in this email, right? So if one of my training examples is an email with
8:46 - 8:47

300 words in it,
8:47 - 8:50

then I represent this email via
8:50 - 8:52

a feature vector
8:52 - 8:57

with 300 elements,
8:57 - 9:00

and each
9:00 - 9:04

of these elements of the feature vector - lets see. Let me just write this as X subscript
9:04 - 9:06

J.
9:06 - 9:13

These will be an index into my dictionary, okay?
9:13 - 9:16

And so if my dictionary has 50,000 words, then
9:16 - 9:19

each position in my feature vector
9:19 - 9:23

will be a variable that takes on one of 50,000 possible
9:23 - 9:24

values
9:24 - 9:26
9:26 - 9:32

corresponding to what word appeared in the J position of my email, okay?
9:32 - 9:34

So, in other words, I'm gonna take all the words in my email
9:34 - 9:36

and you have a feature
9:36 - 9:39

vector that just says which word
9:39 - 9:40

in my dictionary
9:40 - 9:44

was each word in the email, okay?
9:44 - 9:48

So a different definition for NI now, NI now varies and is different for every
9:48 - 9:49

training example,
9:49 - 9:53

and this XJ is now indexed into the dictionary.
9:53 - 9:55

You know, the components of the feature vector
9:55 - 9:59

are no longer binary random variables; they're these indices in the dictionary that take on a much
9:59 - 10:03

larger set of values.
10:03 - 10:06

And so
10:06 - 10:09

our generative model for this
10:09 - 10:10
10:10 - 10:17
10:23 - 10:24

will be that the joint distribution
10:24 - 10:28

over X and Y will be that, where again N is now the length of the email, all right?
10:28 - 10:31

So the way to think about this formula is you
10:31 - 10:32

imagine
10:32 - 10:33

that there was some
10:33 - 10:38

probably distribution over emails. There's some random distribution that generates the emails,
10:38 - 10:41

and that process proceeds as follows:
10:41 - 10:45

First, Y is chosen, first the class label. Is someone gonna send you spam email or not
10:45 - 10:49

spam emails is chosen for
10:49 - 10:54

us. So first Y, the random variable Y, the class label of spam or not spam is generated,
10:54 - 10:59

and then having decided whether they sent you spam or not spam,
10:59 - 11:03

someone iterates over all 300 positions of the email,
11:03 - 11:07

or 300 words that are going to compose them as email,
11:07 - 11:10

and would generate words from some distribution that
11:10 - 11:14

depends on whether they chose to send you spam or not spam. So if
11:14 - 11:17

they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and
11:17 - 11:20

whatever at discounts, sale, whatever.
11:20 - 11:23

And if somebody chose to send you not spam, then they'll send you, sort of, the more normal
11:23 - 11:28
11:28 - 11:31

words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the
11:31 - 11:35

previous event model, and N has a very different definition from the previous event
11:35 - 11:38

model.
11:38 - 11:45

And so the parameters of this model are - let's see. Phi
11:45 - 11:48

subscript K given Y equals one, which
11:48 - 11:51

is the
11:51 - 11:55

probability that,
11:55 - 11:57

you know, conditioned on
11:57 - 11:59

someone deciding to spend you spam,
11:59 - 12:03

what's the probability that the next word they choose to email you in
12:03 - 12:06

the spam email is going to be word K, and
12:06 - 12:10

similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi
12:10 - 12:17

Y
12:21 - 12:26

and just same as before, okay?
12:26 - 12:28

So these are the parameters of the model,
12:28 - 12:29
12:29 - 12:31

and
12:31 - 12:35

given a training set, you can work out the maximum likelihood estimates of the
12:35 - 12:42
12:42 - 12:48

parameters. So the maximum likelihood estimate of the parameters will be
12:48 - 12:49

equal to -
12:49 - 12:53

and now I'm gonna write one of these, you know, big indicator function things again. It'll be
12:53 - 12:58

a sum over your training sets
12:58 - 13:00

indicator
13:00 - 13:07

whether that was spam times the
13:07 - 13:10

sum over all the words in that email where N subscript
13:10 - 13:15

I is the number of words in email I in your training set,
13:15 - 13:22

times indicator XIJ, SK times that I, okay?
13:39 - 13:40

So
13:40 - 13:45

the numerator says sum over all your emails and take into account all the emails that had
13:45 - 13:48

class label one, take into account only of the emails that were spam because if Y equals zero, then
13:48 - 13:52

this is zero, and this would go away,
13:52 - 13:54

and then times sum over
13:54 - 13:57

all the words in your spam email, and
13:57 - 14:01

it counts up the number of times you observed the word K in your spam emails.
14:01 - 14:05

So, in other words, the numerator is look at all the spam emails in your training set
14:05 - 14:12

and count up the total number of times the word K appeared in this email. The
14:14 - 14:17

denominator then is sum over I into our training set of
14:17 - 14:20

whenever one of your examples is spam,
14:20 - 14:21

you know, sum up the length
14:21 - 14:23

of that spam email,
14:23 - 14:26

and so the denominator is the total length
14:26 - 14:32

of all of the spam emails in your training set.
14:32 - 14:37

And so the ratio is just out of all your spam emails, what is the fraction of words in
14:37 - 14:39

all your spam emails that were word K,
14:39 - 14:42

and that's your estimate for the probability
14:42 - 14:44

of the next piece of spam mail
14:44 - 14:51

generating the word K in any given position, okay? At the
14:53 - 14:56

end of the previous lecture, I talked about LaPlace smoothing,
14:56 - 14:59

and so when you do that as
14:59 - 15:03

well, you add one to the numerator and K to the denominator, and this is
15:03 - 15:07

the LaPlace smoothed estimate
15:07 - 15:09

of this parameter,
15:09 - 15:11

okay? And similarly, you do the same thing for -
15:11 - 15:14

and you can work out the estimates for the
15:14 - 15:21

other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just
15:21 - 15:23

wondering
15:23 - 15:24
15:24 - 15:26

what the X of I is, and what the N of -
15:26 - 15:28

Instructor (Andrew Ng):Right. So
15:28 - 15:32

in this second event model, the definition for XI and the definition for N
15:32 - 15:35

are different, right? So here -
15:35 - 15:42

well, this is for one example XY. So here, N is the number of words
15:44 - 15:46

in a given email,
15:46 - 15:51

right? And if it's the I email subscripting then this N subscript I, and so
15:51 - 15:56

N will be different for different training examples, and here
15:56 - 16:01

XI will be, you know,
16:01 - 16:04

these values from 1 to 50,000,
16:04 - 16:06

and XI is
16:06 - 16:07

essentially
16:07 - 16:11

the identity of the Ith
16:11 - 16:11

word
16:11 - 16:15

in a given piece of email,
16:15 - 16:17

okay? So that's why this is
16:17 - 16:20

grouping, or this is a product over all the different words of your email
16:20 - 16:22

of their probability
16:22 - 16:24

the Ith word in your email,
16:24 - 16:31

conditioned on Y. Yeah?
16:32 - 16:36

Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right,
16:36 - 16:39

and I shouldn't have used K here. Let
16:39 - 16:44

me use a different alphabet and see if that makes sense; does
16:44 - 16:46

that make sense? Oh, you know what, I'm sorry.
16:46 - 16:49

You're absolutely right.
16:49 - 16:51

Thank you. All right. So
16:51 - 16:55

in LaPlace smoothing, that shouldn't be
16:55 - 16:58

K. This should be, you know, 50,000,
16:58 - 17:02

if you have 50,000 words in your dictionary. Yeah, thanks. Great.
17:02 - 17:05

I stole notation from the previous lecture and
17:05 - 17:07

didn't translate it properly.
17:07 - 17:08

So LaPlace smoothing, right,
17:08 - 17:15

this is the number of possible values that the random variable XI can
17:16 - 17:17

take on. Cool. Raise your hand if
17:17 - 17:22

this makes sense? Okay. Some of you, are there
17:22 - 17:29

more questions about this?
17:35 - 17:42
17:42 - 17:44

Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah,
17:44 - 17:45

let's see.
17:45 - 17:50

So LaPlace smoothing is
17:50 - 17:51

a
17:51 - 17:56

method to give you, sort of, hopefully, better estimates of their probability
17:56 - 17:58

distribution over a multinomial,
17:58 - 18:00

and so
18:00 - 18:03

was I using X to Y in the previous lecture?
18:03 - 18:05

So
18:05 - 18:08

in trying to estimate the probability over a multinomial - I think X
18:08 - 18:11

and Y are different. I think -
18:11 - 18:14

was it X or Y? I think it was X, actually. Well - oh,
18:14 - 18:17

I see, right, right. I think I was using a different definition
18:17 - 18:19

for the random variable Y because
18:19 - 18:24

suppose you have a multinomial random variable, X
18:24 - 18:28

which takes on - let's
18:28 - 18:30

use a different alphabet. Suppose you have
18:30 - 18:35

a multinomial random variable X which takes on L different values, then
18:35 - 18:37

the maximum likelihood estimate
18:37 - 18:41

for the probability of X,
18:41 - 18:42

PFX equals K,
18:42 - 18:44

will be equal to, right,
18:44 - 18:47

the number of observations.
18:47 - 18:52
18:52 - 18:55

The maximum likelihood estimate for the
18:55 - 19:02

probability of X being equal to K will be the number of observations of X
19:02 - 19:05

equals K divided
19:05 - 19:11

by the total number of observations
19:11 - 19:11

of X,
19:11 - 19:14

okay? So that's the maximum likelihood estimate.
19:14 - 19:18

And to add LaPlace smoothing to this, you, sort of, add one to the numerator,
19:18 - 19:20

and you add L to the
19:20 - 19:21
19:21 - 19:23

denominator
19:23 - 19:26

where L was the number of possible values that X can take on.
19:26 - 19:29

So, in this case,
19:29 - 19:32

this is a probability that X equals K,
19:32 - 19:36

and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be
19:36 - 19:40

something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific
19:40 - 19:47

definition for a
19:48 - 19:49

maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples
19:49 - 19:52

make sense, but I don't know what the, like, general formula for it is.
19:52 - 19:54

Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so
19:54 - 19:56

the question is
19:56 - 20:02

what's the definition for maximum likelihood estimate? So actually
20:02 - 20:05

in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I
20:05 - 20:07

was, sort of,
20:07 - 20:09

throwing out the maximum likelihood estimates on the board
20:09 - 20:11

without proving them.
20:11 - 20:16

The way to actually work this out is
20:16 - 20:20

to actually
20:20 - 20:27

write down the likelihood.
20:29 - 20:33

So the way to figure out all of these maximum likelihood estimates is to write
20:33 - 20:34

down
20:34 - 20:35

the likelihood of
20:35 - 20:41

the parameters, phi K given Y being
20:41 - 20:44

zero, phi
20:44 - 20:44

Y,
20:44 - 20:47

right? And so given a training set,
20:47 - 20:50

the likelihood, I guess, I should be writing log likelihood
20:50 - 20:55

will be the log of the product of I equals one to
20:55 - 20:56

N, PFXI, YI,
20:56 - 20:58

you
20:58 - 21:05

know, parameterized by these things, okay?
21:07 - 21:09

Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are
21:09 - 21:12

parameterized
21:12 - 21:14

by
21:14 - 21:19
21:19 - 21:22
21:22 - 21:27

- well,
21:27 - 21:28
21:28 - 21:32

I'll just drop the parameters to write this
21:32 - 21:39
21:42 - 21:49

more simply - oh, I just put it in - times PFYI, okay?
21:49 - 21:51

So this is my log likelihood,
21:51 - 21:55

and so the way you get the maximum likelihood estimate of the parameters
21:55 - 21:56

is you -
21:56 - 22:00

so if given a fixed training set, given a set of fixed IYI's,
22:00 - 22:02

you maximize this in terms of
22:02 - 22:06

these parameters, and then you get the maximum likelihood estimates that I've been writing
22:06 - 22:08

out.
22:08 - 22:11

So in a previous section of today's lecture I wrote out some maximum likelihood estimates
22:11 - 22:14

for the Gaussian Discriminant Analysis model, and
22:14 - 22:16

for Naive Bayes,
22:16 - 22:18

and then this - I didn't prove them,
22:18 - 22:22

but you get to, sort of, play with that yourself in the homework problem as well
22:22 - 22:26

and for one of these models, and you'll be able to verify that when
22:26 - 22:30

you maximize the likelihood and maximize the log likelihood
22:30 - 22:34

that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to
22:34 - 22:35

find the way
22:35 - 22:42

these are derived is by maximizing this, okay? Cool. All right.
22:42 - 22:45

So
22:45 - 22:51

that wraps up what I wanted to say about - oh,
22:51 - 22:56

so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that
22:56 - 23:01

for text classification, the Naivent
23:01 - 23:04

model, the last
23:04 - 23:07

Naivent model,
23:07 - 23:10

it turns out that almost always does better than
23:10 - 23:11

the
23:11 - 23:15

first Naive Bayes model I talked about when you're applying it to the
23:15 - 23:19

specific case - to the specific of text classification,
23:19 - 23:21

and
23:21 - 23:25

one of the reasons is hypothesized for this is that this second model,
23:25 - 23:27

the multinomial event model,
23:27 - 23:30

takes into account
23:30 - 23:34

the number of times a word appears in a document, whereas the former model
23:34 - 23:35

doesn't.
23:35 - 23:37

I should say that in truth
23:37 - 23:41

that actually turns out not to be completely understood why the latter
23:41 - 23:44

model does better than the former one for text classification, and, sort of,
23:44 - 23:46

researchers are still
23:46 - 23:47

debating about it a little bit, but if you
23:47 - 23:51

ever have a text classification problem, you know, Naive Bayes Classify is probably
23:51 - 23:52

not,
23:52 - 23:54

by far, the best learning algorithm out there,
23:54 - 23:56

but it is
23:56 - 23:58

relatively straightforward to implement, and it's a very good
23:58 - 24:03

algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It
24:03 - 24:04

doesn't
24:04 - 24:07

actually
24:07 - 24:08
24:08 - 24:14

care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I
24:14 - 24:15

mean, X variable, if my model like you had exclamation in, does
24:15 - 24:17

that usually
24:17 - 24:18

do
24:18 - 24:20

better if you have enough data?
24:20 - 24:23

Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words.
24:23 - 24:27

You can shuffle all the words in the email, and it does exactly the same thing.
24:27 - 24:28

So
24:28 - 24:31

in natural language processing, there's actually another name; it's called a Unique Grand
24:31 - 24:33

Model in natural language processing,
24:33 - 24:36

and there's some other models like, sort of,
24:36 - 24:39

say, higher order markup models that take into account some of the ordering
24:39 - 24:44

of the words. It turns out that for text classification, the models
24:44 - 24:45

like the
24:45 - 24:49

bigram models or trigram models,
24:49 - 24:53

I believe they do only very slightly better, if at all,
24:53 - 25:00

but that's when you're applying them to text classification, okay? All right.
25:04 - 25:07

So the next thing I want to talk about is to start again to discussion of
25:07 - 25:12

non-linear classifiers. So
25:12 - 25:19

it turns out
25:22 - 25:28

- well, and so the very first
25:28 - 25:31

classification algorithm we talked about was logistic regression, which
25:31 - 25:33

had the forming form for
25:33 - 25:36

hypothesis,
25:36 - 25:38

and
25:38 - 25:39

you can think of this as
25:39 - 25:41

predicting one when
25:41 - 25:44

this estimator probability is greater or equal to 0.5 and
25:44 - 25:46

predicting zero,
25:46 - 25:47

right,
25:47 - 25:50

when this is less than 0.5,
25:50 - 25:53

and given a training set, right?
25:53 - 26:00

Logistic regression
26:03 - 26:06

will maybe do grade and descends or something or
26:06 - 26:07

use Newton's method
26:07 - 26:09

to find a straight line that
26:09 - 26:12

reasonably separates the positive and negative classes.
26:12 - 26:16

But sometimes a data set just can't be separated by a straight line, so is there
26:16 - 26:17

an algorithm
26:17 - 26:18

that will let you
26:18 - 26:25

start to learn these sorts of non-linear division boundaries?
26:26 - 26:29

And so how do you go about getting a non-linear classifier? And, by the
26:29 - 26:34

way, one cool result is that remember when I said - when we talked
26:34 - 26:38

about generative learning algorithms, I said that
26:38 - 26:41

if you assume Y given X is
26:41 - 26:44

exponential family,
26:44 - 26:45
26:45 - 26:47

right, with parameter
26:47 - 26:51

A, and if you build a generative learning algorithm using this, right, plus
26:51 - 26:56

one, if this is A to
26:56 - 27:00

one. This is exponential family
27:00 - 27:02

with natural parameter A to zero,
27:02 - 27:06

right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true,
27:06 - 27:06

then
27:06 - 27:09

you end up with a logistic posterior. It
27:09 - 27:11

actually turns out that a Naive Bayes model
27:11 - 27:13

actually falls into this
27:13 - 27:17

as well. So the Naive Bayes model actually falls into this exponential family as well, and,
27:17 - 27:19

therefore,
27:19 - 27:23

under the Naive Bayes model, you're actually
27:23 - 27:27

using this other linear classifier as well, okay?
27:27 - 27:29

So the question is
27:29 - 27:32

how can you start to get non-linear classifiers?
27:32 - 27:36

And I'm going to talk about one method today
27:36 - 27:42

which is - and
27:42 - 27:45

we started to talk about it very briefly which is
27:45 - 27:47

taking a
27:47 - 27:49

simpler algorithm like
27:49 - 27:51

logistic regression
27:51 - 27:58

and using it to build up to more complex non-linear classifiers, okay? So
27:58 - 28:00

to motivate this discussion,
28:00 - 28:05

I'm going to use the little picture - let's see. So suppose you have features X1,
28:05 - 28:07

X2, and X3,
28:07 - 28:10

and so by convention, I'm gonna follow
28:10 - 28:13

our earlier convention that X0 is set to one,
28:13 - 28:17

and so I'm gonna use a little diagram like this
28:17 - 28:18

to denote
28:18 - 28:20

our
28:20 - 28:24

logistic regression unit, okay? So
28:24 - 28:29

think of a little picture like that, you know, this little circle as denoting a
28:29 - 28:29

computation note
28:29 - 28:33

that takes this input, you know, several features and then it outputs another
28:33 - 28:36

number that's X subscript theta of X, given by
28:36 - 28:38

a sigmoid function,
28:38 - 28:40

and so this little computational unit -
28:40 - 28:43

well, will have parameters theta.
28:43 - 28:48

Now, in order to get non-linear division boundaries, all we need to do -
28:48 - 28:51

well, at least one thing to do is just
28:51 - 28:53

come up with
28:53 - 28:54
28:54 - 28:57

a way to represent hypotheses
28:57 - 29:01

that can output non-linear division boundaries, right,
29:01 - 29:03

and so
29:03 - 29:07

this is -
29:07 - 29:10

when you put a bunch of those little pictures
29:10 - 29:11

that I drew
29:11 - 29:13

on the previous board,
29:13 - 29:14

you can then get
29:14 - 29:19

what's called
29:19 - 29:21

a Neural Network in which you
29:21 - 29:24

think of having my features here
29:24 - 29:31

and then I would feed them to
29:31 - 29:35

say a few of these little
29:35 - 29:39

sigmoidal units,
29:39 - 29:44

and these together will feed into yet another sigmoidal unit, say,
29:44 - 29:46

which will output
29:46 - 29:52

my final output H subscript theta of X, okay? And
29:52 - 29:54

just to give these things names, let me call
29:54 - 29:58

the values output by these three intermediate sigmoidal units; let me call them A1,
29:58 - 30:01

A2, A3.
30:01 - 30:03

And let me just be
30:03 - 30:06

completely concrete about what this formula represents, right? So
30:06 - 30:07
30:07 - 30:08

each of these
30:08 - 30:12

units in the middle will have their own associated set of parameters,
30:12 - 30:16

and so the value A1 will be computed as
30:16 - 30:18

G
30:18 - 30:21

of X transpose,
30:21 - 30:27

and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta
30:27 - 30:29

two,
30:29 - 30:31

and A3 will be
30:31 - 30:34

G of X transpose,
30:34 - 30:36

theta three,
30:36 - 30:43

where G is the sigmoid function, all right? So G of Z,
30:44 - 30:50

and then, finally, our hypothesis will output
30:50 - 30:52

G of
30:52 - 30:54

A
30:54 - 30:57

transpose
30:57 - 30:59

theta four, right? Where, you
30:59 - 31:01

know, this A vector
31:01 - 31:04

is a vector of A1,
31:04 - 31:06

A2, A3.
31:06 - 31:08

We can append another one to it at first
31:08 - 31:11

if you want, okay? Let
31:11 - 31:14

me
31:14 - 31:16

just draw up here this - I'm
31:16 - 31:18

sorry about the cluttered board.
31:18 - 31:21

And so H subscript theta of X,
31:21 - 31:24

this is a function
31:24 - 31:29

of all the parameters theta one
31:29 - 31:32

through theta four,
31:32 - 31:37

and so one way to learn parameters for this model
31:37 - 31:38

is to
31:38 - 31:40

write down the cost function,
31:40 - 31:41

say, J of theta
31:41 - 31:46

equals one-half sum from Y equals one to
31:46 - 31:50

M, YI minus H subscript theta of
31:50 - 31:52

XI
31:52 - 31:54

squared, say. Okay,
31:54 - 31:56

so that's our familiar
31:56 - 31:58

quadratic cost function,
31:58 - 32:00

and so
32:00 - 32:04

one way to learn the parameters of an algorithm like this is to just use gradient
32:04 - 32:05

interscent
32:05 - 32:09

to minimize J of theta as a function of theta, okay? See, in the phi
32:09 - 32:11

gradient descent
32:11 - 32:15

to minimize this square area, which
32:15 - 32:18

stated differently means you use gradient descent to make the predictions of your neural
32:18 - 32:20

network as close as possible
32:20 - 32:27

to what you observed as the labels in your training
32:31 - 32:35

set, okay? So it turns out green descent on this neural network is a specific name, the
32:35 - 32:35

algorithm
32:35 - 32:39

that implements grand descent is called back propagation, and so if you ever hear that all that
32:39 - 32:41

means is - it just means gradient interscent on
32:41 - 32:45

a cost function like this or a variation of this on the neural network
32:45 - 32:48

that looks like that,
32:48 - 32:50

and -
32:50 - 32:54

well, this algorithm actually has some advantages and disadvantages, but let me
32:54 - 32:58

actually show you. So, let's see.
32:58 - 33:01

One of the interesting things about the neural network is that
33:01 - 33:05

you can look at what these intermediate notes are computing, right? So
33:05 - 33:08

this neural network has what's called
33:08 - 33:12

a hidden layer before you then have the output layer,
33:12 - 33:14

and, more generally, you can actually have inputs feed
33:14 - 33:16

into these
33:16 - 33:20

computation units, feed into more layers of computation units, to even more layers, to
33:20 - 33:23

more layers, and then finally you have an output layer at the end
33:23 - 33:27

And one cool thing you can do is look at all of these intermediate units, look at
33:27 - 33:28

these
33:28 - 33:31

units and what's called a hidden layer of the neural network. Don't
33:31 - 33:33

worry about why it's called that. Look
33:33 - 33:34

at computations of the hidden unit
33:34 - 33:40

and ask what is the hidden unit computing the neural network? So
33:40 - 33:43

to, maybe, get a better sense of neural networks might be doing, let me show you a
33:43 - 33:45

video - I'm gonna switch to the laptop - this is made
33:45 - 33:46

by a
33:46 - 33:49

friend, Yann LeCun
33:49 - 33:52

who's currently a professor at New York University.
33:52 - 33:56

Can I show a video on the laptop?
33:56 - 33:59

So let me show you a video
33:59 - 34:00

from Yann
34:00 - 34:06

LeCun on a neural network that he developed for Hammerton Digit Recognition.
34:06 - 34:10

There was one other thing he did in this neural network that I'm not gonna talk about called
34:10 - 34:12

a Convolutional Neural Network
34:12 - 34:16

that - well,
34:16 - 34:19

his system is called LeNet,
34:19 - 34:23

and let's see. Would you put
34:23 - 34:23

on the
34:23 - 34:30

laptop display? Hum,
34:38 - 34:39

actually maybe if -
34:39 - 34:42

or you can just put on the screen on the side; that would work too
34:42 - 34:49

if the big screen isn't working.
35:12 - 35:14

Let's see. I'm
35:14 - 35:19

just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come
35:19 - 35:24

on? Well, let me say a few more things about neural network.
35:24 - 35:26

So it turns out that
35:26 - 35:29

when you write a quadratic cost function like I
35:29 - 35:32

wrote down on the chalkboard just now,
35:32 - 35:35

it turns out that unlike logistic regression,
35:35 - 35:39

that will almost always respond to non-convex optimization problem,
35:39 - 35:43

and so whereas for logistic regression if you run gradient descent
35:43 - 35:44

or Newton's method or whatever,
35:44 - 35:46

you converse the global optimer.
35:46 - 35:50

This is not true for neural networks. In general, there are lots of local optimer
35:50 - 35:55

and, sort of, much harder optimization problem.
35:55 - 35:56

So
35:56 - 36:00

neural networks, if you're, sort of, familiar with them, and you're good at making design choices
36:00 - 36:02

like what learning rate to use, and
36:02 - 36:04

how many hidden units to use, and so on, you can,
36:04 - 36:08

sort of, get them to be fairly effective,
36:08 - 36:09

and
36:09 - 36:13

there's, sort of, often ongoing debates about, you know, is this learning algorithm
36:13 - 36:14

better, or is that learning
36:14 - 36:18

algorithm better? The vast majority of machine learning researchers today seem to perceive
36:18 - 36:20

support vector machines, which is what I'll talk about later,
36:20 - 36:24

to be a much more effective off-theshelf learning algorithm than neural networks.
36:24 - 36:27

This point of view is contested
36:27 - 36:32

a bit, but so neural networks are not something that I personally use a lot right
36:32 - 36:36

now because there's a hard optimization problem and you should do so often verge,
36:36 - 36:40

and it actually, sort of works. It, sort of, works reasonably
36:40 - 36:42

well. It's just
36:42 - 36:44

because this is fairly complicated,
36:44 - 36:45

there's not an algorithm that
36:45 - 36:52

I use commonly or that my friends use all
36:52 - 36:56

time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years,
36:56 - 36:57
36:57 - 37:00

you know, the most effective learning algorithm before support vector
37:00 - 37:02

machines were invented.
37:02 - 37:06

So here's Yann LeCun's video, and -
37:06 - 37:09

well, there's actually audio on this too, the soundboard. So I'll just tell you
37:09 - 37:11

what's happening.
37:11 - 37:12
37:12 - 37:15

What you're seeing is a trained neural network,
37:15 - 37:19

and this display where my mouse pointer is pointing, right, this big three
37:19 - 37:19

there
37:19 - 37:21

is
37:21 - 37:25

the input to the neural network. So you're showing the neural network this image, and it's
37:25 - 37:27

trying to recognize what is this.
37:27 - 37:31

The final answer output by the neural network is this number up here, right
37:31 - 37:33

below where it says LeNet-5,
37:33 - 37:34

and the
37:34 - 37:38

neural network correctly recognizes this image as a three,
37:38 - 37:40

and if you look to the left of this image,
37:40 - 37:42

what's interesting about this is
37:42 - 37:47

the display on the left portion of this is actually showing the
37:47 - 37:50

intermediate computations of the neural network. In other words, it's showing you
37:50 - 37:52

what are the hidden layers of the
37:52 - 37:54

neural network computing.
37:54 - 37:58

And so, for example, if you look at this one, the third image down from the top,
37:58 - 38:02

this seems to be computing, you know, certain edges into digits,
38:02 - 38:06

right? We're just computing digits on the
38:06 - 38:07

right-hand side of the bottom or something
38:07 - 38:10

of the input display
38:10 - 38:11

of the input image, okay?
38:11 - 38:15

So let me just play this video, and you can see
38:15 - 38:17

some
38:17 - 38:21

of the inputs and outputs of the neural network, and
38:21 - 38:22

those
38:22 - 38:26

are very different fonts. There's this robustness to
38:26 - 38:31

noise.
38:31 - 38:37

All
38:37 - 38:38
38:38 - 38:42
38:42 - 38:43

right. Multiple
38:43 - 38:49

digits,
38:49 - 38:51
38:51 - 38:58

that's, kind of, cool. All
38:59 - 39:06

right.
39:12 - 39:19
39:31 - 39:36
39:36 - 39:39
39:39 - 39:40
39:40 - 39:42
39:42 - 39:49
39:52 - 39:55

So,
39:55 - 39:58

just for fun, let me show you one more video,
39:58 - 40:01

which was - let's
40:01 - 40:06

see. This is another video from the various CV's, the machine that changed the world, which
40:06 - 40:07

was produced by WGBH
40:07 - 40:09

Television
40:09 - 40:11

in
40:11 - 40:13

corporation with British Foreclass Incorporation,
40:13 - 40:16

and it was aired on PBS a few years ago, I think.
40:16 - 40:18

I want to show you a
40:18 - 40:20

video describing the NETtalk
40:20 - 40:24

Neural Network, which was developed by Terry Sejnowski; he's a
40:24 - 40:26

researcher.
40:26 - 40:30

And so NETtalk was actually one of the major milestones in the
40:30 - 40:32

history of neural network,
40:32 - 40:33

and this specific application
40:33 - 40:36

is getting the neural network to read text.
40:36 - 40:39

So, in other words, can you show a
40:39 - 40:42

piece of English to a computer
40:42 - 40:43

and have the computer read,
40:43 - 40:46

sort of, verbally produce sounds that could respond
40:46 - 40:48

to the
40:48 - 40:51

reading of the text.
40:51 - 40:55

And it turns out that in the history of AI and the history of machine learning,
40:55 - 40:57

this video
40:57 - 41:01

created a lot of excitement about neural networks and about machine learning. Part of the
41:01 - 41:05

reason was that Terry Sejnowski had the foresight to choose
41:05 - 41:07

to use,
41:07 - 41:08

in his video,
41:08 - 41:13

a child-like voice talking about visiting your grandmother's house and so on. You'll
41:13 - 41:14

see it in a second,
41:14 - 41:16

and so
41:16 - 41:19

this really created the perception of - created
41:19 - 41:23

the impression of the neural network being like a young child learning how to speak,
41:23 - 41:25

and talking about going to your grandmothers, and so
41:25 - 41:27

on. So this actually
41:27 - 41:29

helped generate a lot of excitement
41:29 - 41:32

within academia and outside academia on neural networks, sort of, early in the history
41:32 - 41:36

of neural networks. I'm just gonna show you the video.
41:36 - 41:38

[Begin Video] You're going to hear first
41:38 - 41:42

what the network sounds like at the very beginning of the training, and it won't
41:42 - 41:44

sound like words, but it'll sound
41:44 - 41:48

like attempts that will get better and better with
41:48 - 41:54

time. [Computer's voice] The network
41:54 - 41:58

takes the letters, say the phrase, grandmother's house, and
41:58 - 42:03

makes a random attempt at pronouncing it. [Computer's
42:03 - 42:07

voice]
42:07 - 42:09

Grandmother's house. The
42:09 - 42:13

phonetic difference between the guess and the right pronunciation
42:13 - 42:17

is sent back through the network. [Computer's
42:17 - 42:19

voice]
42:19 - 42:20

Grandmother's house.
42:20 - 42:24

By adjusting the connection strengths after each attempt,
42:24 - 42:27

the net slowly improves. And,
42:27 - 42:30

finally, after letting it train overnight,
42:30 - 42:33

the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's
42:33 - 42:38

house.
42:38 - 42:41

Well, because she gives us candy.
42:41 - 42:47

Well, and we - NETtalk understands nothing about the language. It is simply associating letters
42:47 - 42:50

with sounds.
42:50 - 42:52

[End Video] All right. So
42:52 - 42:54

at the time this was done, I mean, this is
42:54 - 42:57

an amazing piece of work. I should say
42:57 - 42:59

today there are other
42:59 - 43:03

text to speech systems that work better than what you just saw,
43:03 - 43:08

and you'll also appreciate getting candy from your grandmother's house is a little bit
43:08 - 43:11

less impressive than talking about the Dow Jones falling 15 points,
43:11 - 43:16

and profit taking, whatever. So
43:16 - 43:19

but I wanted to show that just because that was another cool,
43:19 - 43:26

major landmark in the history of neural networks. Okay.
43:27 - 43:33

So let's switch back to the chalkboard,
43:33 - 43:36

and what I want to do next
43:36 - 43:39

is
43:39 - 43:42

tell you about Support Vector Machines,
43:42 - 43:49

okay? That, sort of, wraps up our discussion on neural
44:09 - 44:11

networks. So I started off talking about neural networks
44:11 - 44:13

by motivating it as a way to get
44:13 - 44:18

us to output non-linear classifiers, right? I don't really approve of it. It turns out
44:18 - 44:18

that
44:18 - 44:22

you'd be able to come up with non-linear division boundaries using a neural
44:22 - 44:22

network
44:22 - 44:27

like what I drew on the chalkboard earlier.
44:27 - 44:30

Support Vector Machines will be another learning algorithm that will give us a way
44:30 - 44:35

to come up with non-linear classifiers. There's a very effective, off-the-shelf
44:35 - 44:36

learning algorithm,
44:36 - 44:39

but it turns out that in the discussion I'm gonna - in
44:39 - 44:43

the progression and development I'm gonna pursue, I'm actually going to start
44:43 - 44:48

off by describing yet another class of linear classifiers with linear division
44:48 - 44:49

boundaries,
44:49 - 44:54

and only be later, sort of, in probably the next lecture or the one after that,
44:54 - 44:55

that we'll then
44:55 - 44:59

take the support vector machine idea and, sort of, do some clever things to it
44:59 - 45:03

to make it work very well to generate non-linear division boundaries as well, okay?
45:03 - 45:08

But we'll actually start by talking about linear classifiers a little bit more.
45:08 - 45:13

And to do that, I want to
45:13 - 45:15

convey two
45:15 - 45:17

intuitions about classification.
45:17 - 45:21

One is
45:21 - 45:24

you think about logistic regression; we have this logistic function that was outputting
45:24 - 45:29

the probability that Y equals one,
45:29 - 45:31

and it crosses
45:31 - 45:33

this line at zero.
45:33 - 45:38

So when you run logistic regression, I want you to think of it as
45:38 - 45:41

an algorithm that computes
45:41 - 45:43

theta transpose X,
45:43 - 45:47

and then it predicts
45:47 - 45:51

one, right,
45:51 - 45:56

if and only if, theta transpose X is greater than zero, right? IFF stands for if
45:56 - 46:00

and only if. It means the same thing as a double implication,
46:00 - 46:05

and it predicts zero,
46:05 - 46:12

if and only if, theta transpose X is less than zero, okay? So if
46:12 - 46:15

it's the case that
46:15 - 46:18

theta transpose X is much greater than zero,
46:18 - 46:22

the double greater than sign means these are much greater than, all right.
46:22 - 46:26

So if theta transpose X is much greater than zero,
46:26 - 46:27

then,
46:27 - 46:33

you know, think of that as a very confident
46:33 - 46:39

prediction
46:39 - 46:41

that Y is equal to one,
46:41 - 46:43

right? If theta transpose X is much greater than zero, then we're
46:43 - 46:47

gonna predict one then moreover we're very confident it's one, and the picture for
46:47 - 46:48

that is if theta
46:48 - 46:51

transpose X is way out here, then
46:51 - 46:54

we're estimating that the probability of Y being equal to one
46:54 - 46:56

on the sigmoid function, it will
46:56 - 46:59

be very close to one.
46:59 - 47:03

And, in the same way, if theta transpose X
47:03 - 47:05

is much less than zero,
47:05 - 47:12

then we're very confident that
47:13 - 47:16

Y is equal to zero.
47:16 - 47:20

So
47:20 - 47:24

wouldn't it be nice - so when we fit logistic regression of some of the
47:24 - 47:25

classifiers is your training set,
47:25 - 47:30

then so wouldn't it be nice if,
47:30 - 47:30

right,
47:30 - 47:34

for all I
47:34 - 47:39

such that Y is equal to one.
47:39 - 47:40

We have
47:40 - 47:41

theta
47:41 - 47:45

transpose XI is much greater than zero,
47:45 - 47:48

and for all I such that Y is equal
47:48 - 47:50

to
47:50 - 47:57

zero,
47:58 - 48:00

we have theta transpose XI is
48:00 - 48:01

much less than zero,
48:01 - 48:06

okay? So wouldn't it be nice if this is true? That,
48:06 - 48:06
48:06 - 48:10

essentially, if our training set, we can find parameters theta
48:10 - 48:13

so that our learning algorithm not only
48:13 - 48:17

makes correct classifications on all the examples in a training set, but further it's, sort
48:17 - 48:20

of, is very confident about all of those correct
48:20 - 48:22

classifications. This is
48:22 - 48:27

the first intuition that I want you to have, and we'll come back to this first intuition
48:27 - 48:29

in a second
48:29 - 48:33

when we talk about functional margins, okay?
48:33 - 48:40

We'll define this later. The second
48:44 - 48:48

intuition that I want to
48:48 - 48:55

convey,
48:57 - 49:00

and it turns out for the rest of today's lecture I'm going to assume
49:00 - 49:04

that a training set is linearly separable, okay? So by that I mean for the rest of
49:04 - 49:06
49:06 - 49:09

today's lecture, I'm going to assume that
49:09 - 49:11

there is indeed a straight line that can separate
49:11 - 49:15

your training set, and we'll remove this assumption later, but
49:15 - 49:17

just to develop the algorithm, let's take away the
49:17 - 49:19

linearly separable training set.
49:19 - 49:20

And so
49:20 - 49:24

there's a sense that out of all the straight lines that separate the training
49:24 - 49:29

set, you know, maybe that straight line isn't such a good one,
49:29 - 49:33

and that one actually isn't such a great one either, but
49:33 - 49:35

maybe that line in the
49:35 - 49:39

middle is a much better linear separator than the others, right?
49:39 - 49:41

And one reason that
49:41 - 49:46

when you and I look at it this one seems best
49:46 - 49:48

is because this line is just further from the data, all right?
49:48 - 49:52

That is separates the data with a greater distance between your positive and your negative
49:52 - 49:56

examples and division boundary, okay?
49:56 - 49:59

And this second intuition, we'll come back to this shortly,
49:59 - 50:03

about this final line
50:03 - 50:06

that I drew being, maybe, the best line
50:06 - 50:10

this notion of distance from the training examples. This is the second intuition I want
50:10 - 50:11

to convey,
50:11 - 50:14

and we'll formalize it later when
50:14 - 50:18

we talk about
50:18 - 50:25

geometric margins of our classifiers, okay?
50:42 - 50:45

So in order to describe support vector machine, unfortunately, I'm gonna
50:45 - 50:48

have to pull a notation change,
50:48 - 50:50

and, sort
50:50 - 50:54

of, unfortunately, it, sort of, was impossible to do logistic regression,
50:54 - 50:56

and support vector machines,
50:56 - 51:01

and all the other algorithms using one completely consistent notation,
51:01 - 51:04

and so I'm actually gonna change notations slightly
51:04 - 51:08

for linear classifiers, and that will actually make it much easier for us -
51:08 - 51:10

that'll make it much easier later today
51:10 - 51:15

and in next week's lectures to actually talk about support vector machine. But the
51:15 - 51:20

notation that I'm gonna use for the rest
51:20 - 51:22

of today and for most of next week
51:22 - 51:26

will be that my B equals Y,
51:26 - 51:30

and instead of be zero, one, they'll be minus one and plus one,
51:30 - 51:34
51:34 - 51:37

and a development of a support vector machine
51:37 - 51:38

we will have H,
51:38 - 51:44

have a hypothesis
51:44 - 51:49

output values to
51:49 - 51:53

the either plus one or minus one,
51:53 - 51:55

and so
51:55 - 51:58

we'll let G of Z be equal to
51:58 - 52:00

one if Z is
52:00 - 52:04

greater or equal to zero, and minus one otherwise, right? So just rather than zero and
52:04 - 52:08

one, we change everything to plus one and minus one.
52:08 - 52:10

And, finally,
52:10 - 52:14

whereas previously I wrote G subscript
52:14 - 52:17

theta of X equals
52:17 - 52:18

G of theta transpose X
52:18 - 52:21

and we had the convention
52:21 - 52:23

that X zero is equal to one,
52:23 - 52:26

right? And so X is an RN plus
52:26 - 52:29

one.
52:29 - 52:34

I'm gonna drop this convention of
52:34 - 52:38

letting X zero equals a one, and letting X be an RN plus one, and instead I'm
52:38 - 52:44

going to parameterize my linear classifier as H subscript W, B of X
52:44 - 52:46

equals G of
52:46 - 52:51

W transpose X plus B, okay?
52:51 - 52:51

And so
52:51 - 52:53

B
52:53 - 52:55

just now plays the role of
52:55 - 52:56

theta zero,
52:56 - 53:03

and W now plays the role of the rest of the parameters, theta one
53:04 - 53:07

through theta N, okay? So just by separating out
53:07 - 53:11

the interceptor B rather than lumping it together, it'll make it easier for
53:11 - 53:11

us
53:11 - 53:18

to develop support vector machines.
53:18 - 53:25

So - yes. Student:[Off mic]. Instructor
53:28 - 53:32

(Andrew Ng):Oh, yes. Right, yes. So W is -
53:32 - 53:33

right. So W
53:33 - 53:35

is a vector in RN, and
53:35 - 53:37

X
53:37 - 53:43

is now a vector in RN rather than N plus one,
53:43 - 53:50

and a lowercase b is a real number. Okay.
53:56 - 54:03

Now, let's formalize the notion of functional margin and germesh margin. Let me make a
54:03 - 54:04

definition.
54:04 - 54:11

I'm going to say that the functional margin
54:13 - 54:20

of the hyper plane
54:20 - 54:22

WB
54:22 - 54:25

with respect to a specific training example,
54:25 - 54:27

XIYI is - WRT
54:27 - 54:32

stands for with respect to -
54:32 - 54:35

the function margin of a hyper plane WB with
54:35 - 54:38

respect to
54:38 - 54:43

a certain training example, XIYI has been defined as Gamma
54:43 - 54:45

Hat I equals YI
54:45 - 54:47

times W transpose XI
54:47 - 54:51

plus B, okay?
54:51 - 54:55

And so a set of parameters, W, B defines
54:55 - 54:56

a
54:56 - 55:00

classifier - it, sort of, defines a linear separating boundary,
55:00 - 55:01

and so
55:01 - 55:04

when I say hyper plane, I just mean
55:04 - 55:07

the decision boundary that's
55:07 - 55:11

defined by the parameters W, B.
55:11 - 55:13

You know what,
55:13 - 55:18

if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with
55:18 - 55:19

parameters W, B
55:19 - 55:21

with respect to a training example
55:21 - 55:24

is given by this formula, okay?
55:24 - 55:26

And interpretation of this is that
55:26 - 55:28

if
55:28 - 55:30

YI is equal to one,
55:30 - 55:33

then for each to have a large functional margin, you
55:33 - 55:37

want W transpose XI plus B to
55:37 - 55:38

be large,
55:38 - 55:39

right?
55:39 - 55:41

And if YI
55:41 - 55:44

is equal minus one,
55:44 - 55:47

then in order for the functional margin to be large - we, sort of, want the functional margins
55:47 - 55:50

to large, but in order for the function margins to be large,
55:50 - 55:55

if YI is equal to minus one, then the only way for this to be big is if W
55:55 - 55:57

transpose XI
55:57 - 55:58

plus B
55:58 - 56:04

is much less than zero, okay? So this
56:04 - 56:07

captures the intuition that we had earlier about functional
56:07 - 56:09

margins - the
56:09 - 56:13

intuition we had earlier that if YI is equal to one, we want this to be big, and if YI
56:13 - 56:15

is equal to minus one, we
56:15 - 56:16

want this to be small,
56:16 - 56:17

and this, sort of,
56:17 - 56:19

practice of two cases
56:19 - 56:23

into one statement that we'd like the functional margin to be large.
56:23 - 56:27

And notice this is also that
56:27 - 56:31

so long as YI times W transpose XY
56:31 - 56:34

plus B, so long as this is greater than zero,
56:34 - 56:37

that means we
56:37 - 56:44

classified it correctly, okay?
57:14 - 57:18

And one more definition, I'm going to say that
57:18 - 57:20

the functional margin of
57:20 - 57:24

a hyper plane with respect to an entire training set
57:24 - 57:31

is
57:32 - 57:34

going to define gamma hat to
57:34 - 57:36

be equal to
57:36 - 57:36

min
57:36 - 57:40

over all your training examples of gamma hat, I, right?
57:40 - 57:41

So if you have
57:41 - 57:44

a training set, if you have just more than one training example,
57:44 - 57:47

I'm going to define the functional margin
57:47 - 57:50

with respect to the entire training set as
57:50 - 57:52

the worst case of all of your
57:52 - 57:54

functional margins of the entire training set.
57:54 - 57:58

And so for now we should think of the
57:58 - 58:02

first function like an intuition of saying that we would like the function margin
58:02 - 58:03

to be large,
58:03 - 58:06

and for our purposes, for now,
58:06 - 58:07

let's just say we would like
58:07 - 58:10

the worst-case functional margin to be large, okay? And
58:10 - 58:16

we'll change this a little bit later as well. Now, it turns out that
58:16 - 58:20

there's one little problem with this intuition that will, sort of, edge
58:20 - 58:23

us later,
58:23 - 58:26

which it actually turns out to be very easy to make the functional margin large, all
58:26 - 58:29

right? So, for example,
58:29 - 58:32

so as I have a classifiable parameters W and
58:32 - 58:37

B. If I take W and multiply it by two and take B and multiply it by two,
58:37 - 58:42

then if you refer to the definition of the functional margin, I guess that was what?
58:42 - 58:43

Gamma I,
58:43 - 58:44

gamma hat I
58:44 - 58:48

equals YI
58:48 - 58:51

times W times transpose B. If
58:51 - 58:53

I double
58:53 - 58:55

W and B,
58:55 - 58:55

then
58:55 - 58:59

I can easily double my functional margin. So this goal
58:59 - 59:03

of making the functional margin large, in and of itself, isn't so
59:03 - 59:06

useful because it's easy to make the functional margin arbitrarily large
59:06 - 59:09

just by scaling other parameters. And so
59:09 - 59:13

maybe one thing we need to do later is add a normalization
59:13 - 59:15

condition.
59:15 - 59:17

For example, maybe
59:17 - 59:20

we want to add a normalization condition that de-norm, the
59:20 - 59:22

alter-norm of
59:22 - 59:23

the parameter W
59:23 - 59:24

is equal to one,
59:24 - 59:28

and we'll come back to this in a second. All
59:28 - 59:34

right. And then so - Okay.
59:34 - 59:40

Now, let's talk about - see how much
59:40 - 59:47

time we have, 15 minutes. Well, see, I'm trying to
59:53 - 60:00

decide how much to try to do in the last 15 minutes. Okay. So let's talk
60:05 - 60:11

about the geometric margin,
60:11 - 60:13

and so
60:13 - 60:17

the geometric margin of a training example -
60:17 - 60:24

[inaudible], right?
60:27 - 60:31

So the division boundary of my classifier
60:31 - 60:33

is going to be given by the plane W transpose X
60:33 - 60:35

plus B is equal
60:35 - 60:38

to zero, okay? Right, and these are the
60:38 - 60:41

X1, X2 axis, say,
60:41 - 60:43

and
60:43 - 60:48

we're going to draw relatively few training examples here. Let's say I'm drawing
60:48 - 60:53

deliberately few training examples so that I can add things to this, okay?
60:53 - 60:55

And so
60:55 - 60:57

assuming we classified an example correctly, I'm
60:57 - 61:01

going to define the geometric margin
61:01 - 61:06

as just a geometric distance between a point between the
61:06 - 61:07

training example -
61:07 - 61:10

yeah, between the training example XI,
61:10 - 61:12

YI
61:12 - 61:13

and the distance
61:13 - 61:15

given by this separating
61:15 - 61:18

line, given by this separating hyper plane, okay?
61:18 - 61:20

That's what I'm going to define the
61:20 - 61:24

geometric margin to be.
61:24 - 61:26

And so I'm gonna do
61:26 - 61:30

some algebra fairly quickly. In case it doesn't make sense, and read
61:30 - 61:34

through the lecture notes more carefully for details.
61:34 - 61:36

Sort of, by standard geometry,
61:36 - 61:41

the normal, or in other words, the vector that's 90 degrees to the
61:41 - 61:42

separating hyper plane
61:42 - 61:45

is going to be given by W divided by
61:45 - 61:47

the norm of W; that's just how
61:47 - 61:50

planes and high dimensions work. If this
61:50 - 61:53

stuff - some of this you have to use, take a look t the lecture notes on the
61:53 - 61:56

website.
61:56 - 61:58

And so let's say this distance
61:58 - 62:00

is
62:00 - 62:01
62:01 - 62:02

gamma I,
62:02 - 62:05

okay? And so I'm going to use the convention that
62:05 - 62:09

I'll put a hat on top where I'm referring to functional margins,
62:09 - 62:12

and no hat on top for geometric margins. So let's say geometric margin,
62:12 - 62:14

as this example, is
62:14 - 62:18

gamma I.
62:18 - 62:22

That means that
62:22 - 62:26

this point here, right,
62:26 - 62:29

is going to be
62:29 - 62:32

XI minus
62:32 - 62:36

gamma I times W
62:36 - 62:39

over normal W, okay?
62:39 - 62:41

Because
62:41 - 62:45

W over normal W is the unit vector, is the length one vector that
62:45 - 62:48

is normal to the separating hyper plane,
62:48 - 62:52

and so when we subtract gamma I times the unit vector
62:52 - 62:55

from this point, XI, or at this point here is XI.
62:55 - 62:57

So XI minus,
62:57 - 62:59

you know, this little vector here
62:59 - 63:02

is going to be this point that I've drawn as
63:02 - 63:03

a heavy circle,
63:03 - 63:06

okay? So this heavy point here is
63:06 - 63:07

XI minus this vector,
63:07 - 63:14

and this vector is gamma I time W over norm of W, okay?
63:16 - 63:18

And so
63:18 - 63:21

because this heavy point is on the separating hyper plane,
63:21 - 63:23

right, this point
63:23 - 63:25

must satisfy
63:25 - 63:29

W transpose times
63:29 - 63:35

that point
63:35 - 63:36

equals zero,
63:36 - 63:37

right? Because
63:37 - 63:38

all
63:38 - 63:41

points X on the separating hyper plane satisfy the equation W transpose X plus B
63:41 - 63:43

equals zero,
63:43 - 63:46

and so this point is on the separating hyper plane, therefore,
63:46 - 63:49

it must satisfy W transpose this point - oh,
63:49 - 63:54

excuse me. Plus B is equal
63:54 - 63:58

to zero, okay?
63:58 - 64:01

Raise your hand if this makes sense so far? Oh,
64:01 - 64:04

okay. Cool, most of you, but, again,
64:04 - 64:07

I'm, sort of, being slightly fast in this geometry. So if you're not quite sure
64:07 - 64:10

why this is a normal vector, or how I subtracted
64:10 - 64:17

this, or whatever, take a look at the details in the lecture notes.
64:21 - 64:24

And so
64:24 - 64:28

what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So
64:28 - 64:30

this equation I just wrote down,
64:30 - 64:32

solve this equation for gamma
64:32 - 64:34

or gamma I,
64:34 - 64:41

and you find that -
64:44 - 64:48

you saw that previous equation from gamma I -
64:48 - 64:50

well, why don't I just do it?
64:50 - 64:53

You have W transpose XI plus
64:53 - 64:55

B
64:55 - 64:55

equals
64:55 - 64:57

gamma
64:57 - 65:01

I times W transpose W over norm of W;
65:01 - 65:05

that's just equal to gamma
65:05 - 65:07

times the norm of W
65:07 - 65:08

because
65:08 - 65:10

W transpose W
65:10 - 65:12

is the norm
65:12 - 65:15

of W squared, and, therefore,
65:15 - 65:18

gamma
65:18 - 65:25

is just - well,
65:27 - 65:34

transpose X equals, okay?
65:34 - 65:38

And, in other words, this little calculation just showed us that if you have a
65:38 - 65:41

training example
65:41 - 65:43

XI,
65:43 - 65:47

then the distance between XI and the separating hyper plane defined by the
65:47 - 65:49

parameters W and B
65:49 - 65:56

can be computed by this formula, okay? So
66:03 - 66:05

the last thing I want to do is actually
66:05 - 66:07

take into account
66:07 - 66:12

the sign of the - the correct classification of the training example. So I've
66:12 - 66:13

been assuming that
66:13 - 66:16

we've been classifying an example correctly.
66:16 - 66:21

So, more generally, to find
66:21 - 66:27

the geometric
66:27 - 66:30

margin of a training example to be
66:30 - 66:33

gamma I equals YI
66:33 - 66:40

times that thing on top, okay?
66:45 - 66:47

And so
66:47 - 66:51

this is very similar to the functional margin, except for the normalization by the
66:51 - 66:53

norm of W,
66:53 - 66:58

and so as before, you know, this says that so long as -
66:58 - 67:01

we would like the geometric margin to be large, and all that means is that so
67:01 - 67:02

long as
67:02 - 67:04

we're classifying the example correctly,
67:04 - 67:08

we would ideally hope of the example to be as far as possible from the separating
67:08 - 67:11

hyper plane, so long as it's on the right side of
67:11 - 67:12

the separating hyper plane, and that's what YI
67:12 - 67:19

multiplied into this does.
67:23 - 67:28

And so a couple of easy facts, one is if the
67:28 - 67:32

norm of W is equal to one,
67:32 - 67:34

then
67:34 - 67:37

the functional margin is equal to the geometric margin, and you see
67:37 - 67:39

that quite easily,
67:39 - 67:41

and, more generally,
67:41 - 67:43
67:43 - 67:48

the geometric margin is just equal
67:48 - 67:55

to the functional margin divided by the norm of W, okay? Let's see, okay.
68:12 - 68:19

And so one final definition is
68:22 - 68:27

so far I've defined the geometric margin with respect to a single training example,
68:27 - 68:32

and so as before, I'll define the geometric margin with respect to an entire training set
68:32 - 68:34

as gamma equals
68:34 - 68:38

min over I
68:38 - 68:45

of gamma I, all right?
68:45 - 68:47

And so
68:47 - 68:54

the maximum margin classifier, which is a precursor to the support vector machine,
69:00 - 69:05

is the learning algorithm that chooses the parameters W and B
69:05 - 69:07

so as to maximize
69:07 - 69:09

the geometric margin,
69:09 - 69:12

and so I just write that down. The
69:12 - 69:16

maximum margin classified poses the following optimization problem. It says
69:16 - 69:20

choose gamma, W, and B
69:20 - 69:24

so as to maximize the geometric margin,
69:24 - 69:25

subject to that YI times -
69:25 - 69:32

well,
69:33 - 69:38

this is just one way to write it,
69:38 - 69:41

subject to - actually, do I write it like that? Yeah,
69:41 - 69:44

fine.
69:44 - 69:46

There are several ways to write this, and one of the things we'll
69:46 - 69:49

do next time is actually - I'm trying
69:49 - 69:53

to figure out if I can do this in five minutes. I'm guessing this could
69:53 - 69:55

be difficult. Well,
69:55 - 69:59

so this maximizing your classifier is the maximization problem
69:59 - 70:01

over parameter gamma
70:01 - 70:03

W
70:03 - 70:05

and B, and for now, it turns out
70:05 - 70:09

that the geometric margin doesn't change depending on the norm of W, right? Because
70:09 - 70:11

in the definition of the geometric margin,
70:11 - 70:14

notice that we're dividing by the norm of W anyway.
70:14 - 70:18

So you can actually set the norm of W to be anything you want, and you can multiply
70:18 - 70:23

W and B by any constant; it doesn't change the geometric margin.
70:23 - 70:27

This will actually be important, and we'll come back to this later. Notice that you
70:27 - 70:30

can take the parameters WB,
70:30 - 70:34

and you can impose any normalization constant to it, or you can change W and B by
70:34 - 70:36

any scaling factor and
70:36 - 70:38

replace them by ten W and ten B
70:38 - 70:39

whatever,
70:39 - 70:43

and it does not change the geometric margin,
70:43 - 70:44

okay? And so
70:44 - 70:48

in this first formulation, I'm just gonna impose a constraint and say that the norm of W was
70:48 - 70:49

one,
70:49 - 70:52

and so the function of the geometric margins will be the same,
70:52 - 70:53

and then we'll say
70:53 - 70:55

maximize the geometric margins subject to -
70:55 - 70:57

you maximize gamma
70:57 - 71:00

subject to that every training example must have geometric margin
71:00 - 71:02

at least gamma,
71:02 - 71:04

and this is a geometric margin because
71:04 - 71:06

when the norm of W is equal to one, then
71:06 - 71:08

the functional of the geometric margin
71:08 - 71:11

are identical, okay?
71:11 - 71:15

So this is the maximum margin classifier, and it turns out that if you do this,
71:15 - 71:19

it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but
71:19 - 71:20

it
71:20 - 71:23
71:23 - 71:24

turns out that
71:24 - 71:26

as we develop this algorithm further, there
71:26 - 71:31

will be a clever way to allow us to change this algorithm to let it work
71:31 - 71:33

in infinite dimensional feature spaces
71:33 - 71:37

and come up with very efficient non-linear classifiers.
71:37 - 71:38

So
71:38 - 71:41

there's a ways to go before we turn this into a support vector machine,
71:41 - 71:43

but this is the first step.
71:43 - 71:50

So are there questions about this? Yeah.
72:04 - 72:08

Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a
72:08 - 72:10

fixed training set, and the
72:10 - 72:13

scaling of the training set is not something you get to play with, right?
72:13 - 72:17

So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the
72:17 - 72:21

Y's. Are there other questions? Okay. So all right.
72:21 - 72:26
72:26 - 72:30
72:30 - 72:35

Next week we will take this, and we'll talk about authorization algorithms,
72:35 - 72:36

and
72:36 - 72:40

work our way towards turning this into one of the most effective off-theshelf
72:40 - 72:42

learning algorithms,
72:42 - 72:44

and just a final reminder again, this next
72:44 - 72:48

discussion session will be on Matlab and Octaves. So show up for that if you want
72:48 - 72:50

to see a tutorial.
72:50 - 72:52

Okay. See you guys in the next class.

Title:: Lecture 6 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the applications of naive Bayes, neural networks, and support vector machine.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:13:09

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 6 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)