0:00:00.290,0:00:12.079
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

0:00:12.079,0:00:15.349
This presentation is delivered by the Stanford Center for Professional

0:00:15.349,0:00:22.349
Development.

0:00:24.169,0:00:25.189


0:00:25.189,0:00:27.299
What I want to do today is

0:00:27.299,0:00:30.829
continue our discussion of Naive Bayes, which is the learning algorithm that

0:00:30.829,0:00:32.600
I

0:00:32.600,0:00:36.110
started to discuss in the previous lecture

0:00:36.110,0:00:40.250
and talk about a couple of different event models in Naive Bayes,

0:00:40.250,0:00:44.450
and then I'll take a brief digression to talk about neural networks, which is

0:00:44.450,0:00:47.000
something that I actually won't spend a lot of time on,

0:00:47.000,0:00:50.350
and then I want to start to talk about support vector machines,

0:00:50.350,0:00:52.680
and support vector machines is

0:00:52.680,0:00:56.450
the learning algorithms, the supervised learning algorithm that

0:00:56.450,0:01:00.910
many people consider the most effective, off-the-shelf

0:01:00.910,0:01:02.250
supervised learning

0:01:02.250,0:01:05.240
algorithm. That point of view is debatable, but there are many people that hold that

0:01:05.240,0:01:06.340
point of view,

0:01:06.340,0:01:10.800
and we'll start discussing that today, and this will actually take us a few

0:01:10.800,0:01:13.830
lectures to complete.

0:01:13.830,0:01:16.940
So let's talk about Naive Bayes. To recap

0:01:16.940,0:01:21.370
from the previous lecture,

0:01:21.370,0:01:23.299
I started off

0:01:23.299,0:01:24.420
describing

0:01:24.420,0:01:28.329
spam classification as the most [inaudible] example for Naive Bayes

0:01:28.329,0:01:33.700
in which we would create feature vectors like these, right, that correspond

0:01:33.700,0:01:37.840
to words in a dictionary.

0:01:37.840,0:01:39.500
And so, you know,

0:01:39.500,0:01:43.100
based on what words appear in a piece of email were represented as a

0:01:43.100,0:01:45.230
feature vector with

0:01:45.230,0:01:48.130
ones and zeros in the corresponding places,

0:01:48.130,0:01:49.950
and Naive

0:01:49.950,0:01:56.950
Bayes was a generative learning algorithm, and by that

0:01:58.799,0:02:00.159
I mean it's

0:02:00.159,0:02:06.140
an algorithm in which we model P(X|Y), and for Naive

0:02:06.140,0:02:08.479
Bayes, specifically, we

0:02:08.479,0:02:12.859
modeled it as product from I equals one

0:02:12.859,0:02:13.789
to

0:02:13.789,0:02:17.109
N, P(Xi|Y),

0:02:17.109,0:02:19.259
and also we model P(Y),

0:02:19.259,0:02:22.749
and then we use Bayes Rule, right, to combine these two together,

0:02:22.749,0:02:25.050
and so our predictions,

0:02:25.050,0:02:28.549
when you give it a new piece of email you want to tell if it's spam or not spam,

0:02:28.549,0:02:33.119
you predict arg max over Y, P(Y|X), which

0:02:33.119,0:02:36.459
by Bayes Rule is arg max over Y,

0:02:36.459,0:02:39.420
P(X given Y)

0:02:39.420,0:02:41.019
times P(Y), okay?

0:02:41.019,0:02:43.559
So this is Naive Bayes,

0:02:43.559,0:02:45.869
and just to draw attention to two things,

0:02:45.869,0:02:49.819
one is that in this model, each of our features

0:02:49.819,0:02:51.549
were zero, one, so

0:02:51.549,0:02:53.919
indicating whether different words appear,

0:02:53.919,0:02:57.589
and the length or the feature vector was, sort of,

0:02:57.589,0:03:00.029
the length N of the feature vector was

0:03:00.029,0:03:07.029
the number of words in the dictionary.

0:03:08.339,0:03:15.339
So it might be on this version on the order of 50,000 words, say. What I want

0:03:18.419,0:03:21.170
to do now is describe two variations on this algorithm.

0:03:21.170,0:03:23.779
The first one is the simpler one,

0:03:23.779,0:03:26.609
which it's just a generalization to if

0:03:26.609,0:03:28.039
xi

0:03:28.039,0:03:33.359
takes on more values. So, you know, one thing that's commonly done

0:03:33.359,0:03:35.169
is

0:03:35.169,0:03:38.839
to apply Naive Bayes to problems where

0:03:38.839,0:03:43.479
some of these features, xi, takes on K values rather than just two values,

0:03:43.479,0:03:46.579
and in that case,

0:03:46.579,0:03:48.679
you actually build,

0:03:48.679,0:03:51.919
sort of, a very similar model where P(X|Y)

0:03:51.919,0:03:53.130
is

0:03:53.130,0:03:57.120
really the same thing, right,

0:03:57.120,0:04:02.099
where now these are going to be multinomial probabilities

0:04:02.099,0:04:07.729
rather than Bernoulli's because the XI's can, maybe, take on up to K values.

0:04:07.729,0:04:11.739
It turns out, the situation where - one situation where this arises very

0:04:11.739,0:04:16.760
commonly is if you have a feature that's actually continuous valued,

0:04:16.760,0:04:20.639
and you choose to dispertise it, and you choose to take a continuous value feature

0:04:20.639,0:04:25.770
and dispertise it into a finite set of K values,

0:04:25.770,0:04:27.710
and so it's a perfect example

0:04:27.710,0:04:32.370
if you remember our very first

0:04:32.370,0:04:35.189
supervised learning problem of predicting

0:04:35.189,0:04:37.089
the price of

0:04:37.089,0:04:41.039
houses. If you have the classification problem on these houses, so based on

0:04:41.039,0:04:44.129
features of a house, and you want to predict whether or not the house will be sold in the next six

0:04:44.129,0:04:46.880
months, say. That's a classification problem, and

0:04:46.880,0:04:49.210
once you use Naive Bayes,

0:04:49.210,0:04:52.229
then given a continuous value feature

0:04:52.229,0:04:54.619
like the living area,

0:04:54.619,0:04:59.259
you know, one pretty common thing to do would be take the continuous value living area

0:04:59.259,0:05:02.300
and just dispertise it

0:05:02.300,0:05:06.669
into a few

0:05:06.669,0:05:08.909


0:05:08.909,0:05:12.769
- discreet buckets,

0:05:12.769,0:05:13.900
and so

0:05:13.900,0:05:16.900
depending on whether the living area of the house is less than 500 square feet

0:05:16.900,0:05:19.479
or between 1,000 and 1500 square feet,

0:05:19.479,0:05:22.279
and so on, or whether it's greater than 2,000 square feet,

0:05:22.279,0:05:26.490
you choose the value of the corresponding feature, XI, to be one,

0:05:26.490,0:05:28.889
two, three, or four, okay?

0:05:28.889,0:05:31.210
So that was the first

0:05:31.210,0:05:36.479
variation or generalization of Naive Bayes I

0:05:36.479,0:05:43.479
wanted to talk about. I should just check; are there questions about this? Okay.

0:05:49.919,0:05:55.740
Cool. And so it turns out that in practice, it's fairly common to use about ten buckets

0:05:55.740,0:05:59.099
to dispertise a continuous value feature. I drew four here only to

0:05:59.099,0:06:00.900
save on writing.

0:06:00.900,0:06:02.789


0:06:02.789,0:06:03.389


0:06:03.389,0:06:06.639
The second and, sort of, final variation that I want to talk about for Naive Bayes

0:06:06.639,0:06:09.839
is a variation that's specific

0:06:09.839,0:06:10.510
to

0:06:10.510,0:06:15.720
classifying text documents, or, more generally, for classifying sequences. So the text

0:06:15.720,0:06:20.219
document, like a piece of email, you can think of as a sequence of words

0:06:20.219,0:06:23.509
and you can apply this, sort of, model I'm about to describe to classifying other sequences as

0:06:23.509,0:06:24.909
well, but let

0:06:24.909,0:06:27.400
me just focus on text,

0:06:27.400,0:06:29.689
and here's

0:06:29.689,0:06:33.419
the idea. So the

0:06:33.419,0:06:38.379
Naiven a piece of email, we were

0:06:38.379,0:06:42.110
representing it using this binary vector value representation,

0:06:42.110,0:06:46.469
and one of the things that this loses, for instance, is the number of times that different words

0:06:46.469,0:06:47.049
appear,

0:06:47.049,0:06:49.429
all right? So, for example, if

0:06:49.429,0:06:53.149
some word appears a lot of times, and you see the word, you know, buy a lot of times.

0:06:53.149,0:06:58.369
You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of

0:06:58.369,0:06:59.490
times in the email, it

0:06:59.490,0:07:03.449
is more likely to be spam than it appears, I guess, only once

0:07:03.449,0:07:08.129
because even once, I guess, is enough. So let me just

0:07:08.129,0:07:11.210
try a different, what's called an event model

0:07:11.210,0:07:14.870
for Naive Bayes that will take into account the number of times a word appears in

0:07:14.870,0:07:16.530
the email,

0:07:16.530,0:07:17.430
and to give

0:07:17.430,0:07:22.300
this previous model a name as well this particular model for

0:07:22.300,0:07:24.059
text

0:07:24.059,0:07:25.990
classification

0:07:25.990,0:07:29.189


0:07:29.189,0:07:32.369


0:07:32.369,0:07:35.639
is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what

0:07:35.639,0:07:39.639
the name means. It

0:07:39.639,0:07:43.139
refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about

0:07:43.139,0:07:45.349
what the name means.

0:07:45.349,0:07:46.469
In contrast,

0:07:46.469,0:07:50.529
what I want to do now is describe a different representation for email in terms of the

0:07:50.529,0:07:51.819
feature

0:07:51.819,0:07:58.819
vector, and this is called the Multinomial Event Model, and,

0:08:00.619,0:08:03.319
again, there is a rationale behind the name, but it's slightly cryptic, so don't worry

0:08:03.319,0:08:07.249
about why it's called the Multinomial Event Model; it's just called that.

0:08:07.249,0:08:09.349
And here's what we're gonna do,

0:08:09.349,0:08:15.559
given a piece of email, I'm going to represent my email as a feature vector,

0:08:15.559,0:08:19.039
and so my IF training

0:08:19.039,0:08:23.340
example, XI will be a feature vector,

0:08:23.340,0:08:29.150
XI sub group one, XI sub group two,

0:08:29.150,0:08:31.279
XI subscript

0:08:31.279,0:08:33.020
NI

0:08:33.020,0:08:35.370
where

0:08:35.370,0:08:38.810
NI is equal to the

0:08:38.810,0:08:42.160
number of words

0:08:42.160,0:08:45.830
in this email, right? So if one of my training examples is an email with

0:08:45.830,0:08:47.340
300 words in it,

0:08:47.340,0:08:49.750
then I represent this email via

0:08:49.750,0:08:52.010
a feature vector

0:08:52.010,0:08:57.320
with 300 elements,

0:08:57.320,0:09:00.360
and each

0:09:00.360,0:09:04.270
of these elements of the feature vector - lets see. Let me just write this as X subscript

0:09:04.270,0:09:05.560
J.

0:09:05.560,0:09:12.560
These will be an index into my dictionary, okay?

0:09:12.820,0:09:16.350
And so if my dictionary has 50,000 words, then

0:09:16.350,0:09:18.690
each position in my feature vector

0:09:18.690,0:09:22.520
will be a variable that takes on one of 50,000 possible

0:09:22.520,0:09:24.440
values

0:09:24.440,0:09:25.640


0:09:25.640,0:09:31.580
corresponding to what word appeared in the J position of my email, okay?

0:09:31.580,0:09:33.940
So, in other words, I'm gonna take all the words in my email

0:09:33.940,0:09:35.680
and you have a feature

0:09:35.680,0:09:38.550
vector that just says which word

0:09:38.550,0:09:39.899
in my dictionary

0:09:39.899,0:09:43.700
was each word in the email, okay?

0:09:43.700,0:09:47.680
So a different definition for NI now, NI now varies and is different for every

0:09:47.680,0:09:49.050
training example,

0:09:49.050,0:09:52.740
and this XJ is now indexed into the dictionary.

0:09:52.740,0:09:54.740
You know, the components of the feature vector

0:09:54.740,0:09:59.450
are no longer binary random variables; they're these indices in the dictionary that take on a much

0:09:59.450,0:10:02.750
larger set of values.

0:10:02.750,0:10:05.660
And so

0:10:05.660,0:10:08.890
our generative model for this

0:10:08.890,0:10:09.980


0:10:09.980,0:10:16.980


0:10:22.550,0:10:24.360
will be that the joint distribution

0:10:24.360,0:10:27.760
over X and Y will be that, where again N is now the length of the email, all right?

0:10:27.760,0:10:30.839
So the way to think about this formula is you

0:10:30.839,0:10:32.060
imagine

0:10:32.060,0:10:33.410
that there was some

0:10:33.410,0:10:38.050
probably distribution over emails. There's some random distribution that generates the emails,

0:10:38.050,0:10:40.900
and that process proceeds as follows:

0:10:40.900,0:10:45.090
First, Y is chosen, first the class label. Is someone gonna send you spam email or not

0:10:45.090,0:10:48.520
spam emails is chosen for

0:10:48.520,0:10:54.240
us. So first Y, the random variable Y, the class label of spam or not spam is generated,

0:10:54.240,0:10:58.630
and then having decided whether they sent you spam or not spam,

0:10:58.630,0:11:02.879
someone iterates over all 300 positions of the email,

0:11:02.879,0:11:07.360
or 300 words that are going to compose them as email,

0:11:07.360,0:11:10.300
and would generate words from some distribution that

0:11:10.300,0:11:14.240
depends on whether they chose to send you spam or not spam. So if

0:11:14.240,0:11:17.459
they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and

0:11:17.459,0:11:19.520
whatever at discounts, sale, whatever.

0:11:19.520,0:11:23.270
And if somebody chose to send you not spam, then they'll send you, sort of, the more normal

0:11:23.270,0:11:27.580


0:11:27.580,0:11:31.150
words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the

0:11:31.150,0:11:34.820
previous event model, and N has a very different definition from the previous event

0:11:34.820,0:11:38.150
model.

0:11:38.150,0:11:44.680
And so the parameters of this model are - let's see. Phi

0:11:44.680,0:11:48.200
subscript K given Y equals one, which

0:11:48.200,0:11:51.370
is the

0:11:51.370,0:11:55.190
probability that,

0:11:55.190,0:11:57.490
you know, conditioned on

0:11:57.490,0:11:59.280
someone deciding to spend you spam,

0:11:59.280,0:12:03.070
what's the probability that the next word they choose to email you in

0:12:03.070,0:12:06.260
the spam email is going to be word K, and

0:12:06.260,0:12:10.350
similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi

0:12:10.350,0:12:17.350
Y

0:12:21.160,0:12:25.590
and just same as before, okay?

0:12:25.590,0:12:28.170
So these are the parameters of the model,

0:12:28.170,0:12:29.350


0:12:29.350,0:12:31.450
and

0:12:31.450,0:12:35.170
given a training set, you can work out the maximum likelihood estimates of the

0:12:35.170,0:12:42.170


0:12:42.230,0:12:47.810
parameters. So the maximum likelihood estimate of the parameters will be

0:12:47.810,0:12:49.100
equal to -

0:12:49.100,0:12:53.440
and now I'm gonna write one of these, you know, big indicator function things again. It'll be

0:12:53.440,0:12:57.750
a sum over your training sets

0:12:57.750,0:13:00.290
indicator

0:13:00.290,0:13:07.290
whether that was spam times the

0:13:07.450,0:13:10.129
sum over all the words in that email where N subscript

0:13:10.129,0:13:15.310
I is the number of words in email I in your training set,

0:13:15.310,0:13:22.310
times indicator XIJ, SK times that I, okay?

0:13:39.390,0:13:40.350
So

0:13:40.350,0:13:45.400
the numerator says sum over all your emails and take into account all the emails that had

0:13:45.400,0:13:48.400
class label one, take into account only of the emails that were spam because if Y equals zero, then

0:13:48.400,0:13:52.210
this is zero, and this would go away,

0:13:52.210,0:13:54.470
and then times sum over

0:13:54.470,0:13:56.579
all the words in your spam email, and

0:13:56.579,0:14:01.040
it counts up the number of times you observed the word K in your spam emails.

0:14:01.040,0:14:04.670
So, in other words, the numerator is look at all the spam emails in your training set

0:14:04.670,0:14:11.670
and count up the total number of times the word K appeared in this email. The

0:14:13.840,0:14:17.160
denominator then is sum over I into our training set of

0:14:17.160,0:14:20.199
whenever one of your examples is spam,

0:14:20.199,0:14:21.430
you know, sum up the length

0:14:21.430,0:14:23.220
of that spam email,

0:14:23.220,0:14:26.160
and so the denominator is the total length

0:14:26.160,0:14:32.370
of all of the spam emails in your training set.

0:14:32.370,0:14:36.820
And so the ratio is just out of all your spam emails, what is the fraction of words in

0:14:36.820,0:14:39.190
all your spam emails that were word K,

0:14:39.190,0:14:41.840
and that's your estimate for the probability

0:14:41.840,0:14:44.240
of the next piece of spam mail

0:14:44.240,0:14:51.240
generating the word K in any given position, okay? At the

0:14:53.100,0:14:56.480
end of the previous lecture, I talked about LaPlace smoothing,

0:14:56.480,0:14:58.980
and so when you do that as

0:14:58.980,0:15:03.310
well, you add one to the numerator and K to the denominator, and this is

0:15:03.310,0:15:06.600
the LaPlace smoothed estimate

0:15:06.600,0:15:08.790
of this parameter,

0:15:08.790,0:15:11.370
okay? And similarly, you do the same thing for -

0:15:11.370,0:15:14.040
and you can work out the estimates for the

0:15:14.040,0:15:20.970
other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just

0:15:20.970,0:15:22.620
wondering

0:15:22.620,0:15:23.600


0:15:23.600,0:15:26.290
what the X of I is, and what the N of -

0:15:26.290,0:15:28.350
Instructor (Andrew Ng):Right. So

0:15:28.350,0:15:32.140
in this second event model, the definition for XI and the definition for N

0:15:32.140,0:15:34.770
are different, right? So here -

0:15:34.770,0:15:41.770
well, this is for one example XY. So here, N is the number of words

0:15:44.300,0:15:46.339
in a given email,

0:15:46.339,0:15:50.730
right? And if it's the I email subscripting then this N subscript I, and so

0:15:50.730,0:15:55.710
N will be different for different training examples, and here

0:15:55.710,0:16:00.910
XI will be, you know,

0:16:00.910,0:16:04.070
these values from 1 to 50,000,

0:16:04.070,0:16:05.580
and XI is

0:16:05.580,0:16:06.940
essentially

0:16:06.940,0:16:10.600
the identity of the Ith

0:16:10.600,0:16:11.340
word

0:16:11.340,0:16:14.830
in a given piece of email,

0:16:14.830,0:16:16.850
okay? So that's why this is

0:16:16.850,0:16:20.440
grouping, or this is a product over all the different words of your email

0:16:20.440,0:16:22.059
of their probability

0:16:22.059,0:16:24.430
the Ith word in your email,

0:16:24.430,0:16:31.430
conditioned on Y. Yeah?

0:16:32.280,0:16:36.480
Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right,

0:16:36.480,0:16:39.260
and I shouldn't have used K here. Let

0:16:39.260,0:16:43.750
me use a different alphabet and see if that makes sense; does

0:16:43.750,0:16:46.490
that make sense? Oh, you know what, I'm sorry.

0:16:46.490,0:16:49.430
You're absolutely right.

0:16:49.430,0:16:50.560
Thank you. All right. So

0:16:50.560,0:16:54.660
in LaPlace smoothing, that shouldn't be

0:16:54.660,0:16:57.940
K. This should be, you know, 50,000,

0:16:57.940,0:17:01.940
if you have 50,000 words in your dictionary. Yeah, thanks. Great.

0:17:01.940,0:17:04.660
I stole notation from the previous lecture and

0:17:04.660,0:17:06.720
didn't translate it properly.

0:17:06.720,0:17:08.400
So LaPlace smoothing, right,

0:17:08.400,0:17:15.400
this is the number of possible values that the random variable XI can

0:17:16.430,0:17:17.320
take on. Cool. Raise your hand if

0:17:17.320,0:17:21.680
this makes sense? Okay. Some of you, are there

0:17:21.680,0:17:28.680
more questions about this?

0:17:34.850,0:17:41.850


0:17:42.010,0:17:43.880
Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah,

0:17:43.880,0:17:45.110
let's see.

0:17:45.110,0:17:49.920
So LaPlace smoothing is

0:17:49.920,0:17:51.110
a

0:17:51.110,0:17:55.570
method to give you, sort of, hopefully, better estimates of their probability

0:17:55.570,0:17:58.150
distribution over a multinomial,

0:17:58.150,0:18:00.430
and so

0:18:00.430,0:18:02.870
was I using X to Y in the previous lecture?

0:18:02.870,0:18:04.600
So

0:18:04.600,0:18:07.809
in trying to estimate the probability over a multinomial - I think X

0:18:07.809,0:18:10.660
and Y are different. I think -

0:18:10.660,0:18:13.690
was it X or Y? I think it was X, actually. Well - oh,

0:18:13.690,0:18:16.800
I see, right, right. I think I was using a different definition

0:18:16.800,0:18:18.960
for the random variable Y because

0:18:18.960,0:18:23.860
suppose you have a multinomial random variable, X

0:18:23.860,0:18:27.890
which takes on - let's

0:18:27.890,0:18:30.190
use a different alphabet. Suppose you have

0:18:30.190,0:18:34.950
a multinomial random variable X which takes on L different values, then

0:18:34.950,0:18:37.440
the maximum likelihood estimate

0:18:37.440,0:18:40.550
for the probability of X,

0:18:40.550,0:18:42.060
PFX equals K,

0:18:42.060,0:18:43.720
will be equal to, right,

0:18:43.720,0:18:46.700
the number of observations.

0:18:46.700,0:18:51.860


0:18:51.860,0:18:55.140
The maximum likelihood estimate for the

0:18:55.140,0:19:02.110
probability of X being equal to K will be the number of observations of X

0:19:02.110,0:19:04.800
equals K divided

0:19:04.800,0:19:10.510
by the total number of observations

0:19:10.510,0:19:11.330
of X,

0:19:11.330,0:19:13.750
okay? So that's the maximum likelihood estimate.

0:19:13.750,0:19:17.710
And to add LaPlace smoothing to this, you, sort of, add one to the numerator,

0:19:17.710,0:19:20.120
and you add L to the

0:19:20.120,0:19:20.910


0:19:20.910,0:19:22.610
denominator

0:19:22.610,0:19:26.200
where L was the number of possible values that X can take on.

0:19:26.200,0:19:28.540
So, in this case,

0:19:28.540,0:19:31.740
this is a probability that X equals K,

0:19:31.740,0:19:36.400
and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be

0:19:36.400,0:19:40.300
something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific

0:19:40.300,0:19:47.300
definition for a

0:19:48.190,0:19:48.659
maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples

0:19:48.659,0:19:51.630
make sense, but I don't know what the, like, general formula for it is.

0:19:51.630,0:19:53.970
Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so

0:19:53.970,0:19:55.610
the question is

0:19:55.610,0:20:01.809
what's the definition for maximum likelihood estimate? So actually

0:20:01.809,0:20:05.460
in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I

0:20:05.460,0:20:06.550
was, sort of,

0:20:06.550,0:20:09.130
throwing out the maximum likelihood estimates on the board

0:20:09.130,0:20:11.049
without proving them.

0:20:11.049,0:20:15.520
The way to actually work this out is

0:20:15.520,0:20:19.900
to actually

0:20:19.900,0:20:26.900
write down the likelihood.

0:20:29.070,0:20:32.850
So the way to figure out all of these maximum likelihood estimates is to write

0:20:32.850,0:20:33.730
down

0:20:33.730,0:20:35.190
the likelihood of

0:20:35.190,0:20:41.170
the parameters, phi K given Y being

0:20:41.170,0:20:43.810
zero, phi

0:20:43.810,0:20:44.450
Y,

0:20:44.450,0:20:47.090
right? And so given a training set,

0:20:47.090,0:20:50.350
the likelihood, I guess, I should be writing log likelihood

0:20:50.350,0:20:54.740
will be the log of the product of I equals one to

0:20:54.740,0:20:55.910
N, PFXI, YI,

0:20:55.910,0:20:57.950
you

0:20:57.950,0:21:04.950
know, parameterized by these things, okay?

0:21:06.940,0:21:09.450
Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are

0:21:09.450,0:21:12.010
parameterized

0:21:12.010,0:21:13.500
by

0:21:13.500,0:21:19.299


0:21:19.299,0:21:21.520


0:21:21.520,0:21:26.930
- well,

0:21:26.930,0:21:28.140


0:21:28.140,0:21:31.690
I'll just drop the parameters to write this

0:21:31.690,0:21:38.690


0:21:42.370,0:21:49.200
more simply - oh, I just put it in - times PFYI, okay?

0:21:49.200,0:21:50.970
So this is my log likelihood,

0:21:50.970,0:21:55.090
and so the way you get the maximum likelihood estimate of the parameters

0:21:55.090,0:21:55.809
is you -

0:21:55.809,0:21:59.750
so if given a fixed training set, given a set of fixed IYI's,

0:21:59.750,0:22:02.180
you maximize this in terms of

0:22:02.180,0:22:05.740
these parameters, and then you get the maximum likelihood estimates that I've been writing

0:22:05.740,0:22:07.530
out.

0:22:07.530,0:22:11.220
So in a previous section of today's lecture I wrote out some maximum likelihood estimates

0:22:11.220,0:22:14.280
for the Gaussian Discriminant Analysis model, and

0:22:14.280,0:22:15.610
for Naive Bayes,

0:22:15.610,0:22:17.750
and then this - I didn't prove them,

0:22:17.750,0:22:22.310
but you get to, sort of, play with that yourself in the homework problem as well

0:22:22.310,0:22:26.420
and for one of these models, and you'll be able to verify that when

0:22:26.420,0:22:29.700
you maximize the likelihood and maximize the log likelihood

0:22:29.700,0:22:33.750
that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to

0:22:33.750,0:22:34.950
find the way

0:22:34.950,0:22:41.630
these are derived is by maximizing this, okay? Cool. All right.

0:22:41.630,0:22:44.730
So

0:22:44.730,0:22:51.299
that wraps up what I wanted to say about - oh,

0:22:51.299,0:22:55.809
so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that

0:22:55.809,0:23:01.480
for text classification, the Naivent

0:23:01.480,0:23:04.080
model, the last

0:23:04.080,0:23:07.300
Naivent model,

0:23:07.300,0:23:10.330
it turns out that almost always does better than

0:23:10.330,0:23:11.370
the

0:23:11.370,0:23:15.360
first Naive Bayes model I talked about when you're applying it to the

0:23:15.360,0:23:18.670
specific case - to the specific of text classification,

0:23:18.670,0:23:21.070
and

0:23:21.070,0:23:25.049
one of the reasons is hypothesized for this is that this second model,

0:23:25.049,0:23:26.880
the multinomial event model,

0:23:26.880,0:23:30.120
takes into account

0:23:30.120,0:23:33.880
the number of times a word appears in a document, whereas the former model

0:23:33.880,0:23:35.170
doesn't.

0:23:35.170,0:23:37.370
I should say that in truth

0:23:37.370,0:23:40.700
that actually turns out not to be completely understood why the latter

0:23:40.700,0:23:43.780
model does better than the former one for text classification, and, sort of,

0:23:43.780,0:23:45.640
researchers are still

0:23:45.640,0:23:46.790
debating about it a little bit, but if you

0:23:46.790,0:23:51.320
ever have a text classification problem, you know, Naive Bayes Classify is probably

0:23:51.320,0:23:51.860
not,

0:23:51.860,0:23:54.150
by far, the best learning algorithm out there,

0:23:54.150,0:23:55.630
but it is

0:23:55.630,0:23:58.460
relatively straightforward to implement, and it's a very good

0:23:58.460,0:24:03.070
algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It

0:24:03.070,0:24:04.490
doesn't

0:24:04.490,0:24:07.020
actually

0:24:07.020,0:24:07.910


0:24:07.910,0:24:13.710
care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I

0:24:13.710,0:24:14.950
mean, X variable, if my model like you had exclamation in, does

0:24:14.950,0:24:17.170
that usually

0:24:17.170,0:24:18.329
do

0:24:18.329,0:24:20.340
better if you have enough data?

0:24:20.340,0:24:23.050
Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words.

0:24:23.050,0:24:26.990
You can shuffle all the words in the email, and it does exactly the same thing.

0:24:26.990,0:24:27.770
So

0:24:27.770,0:24:30.929
in natural language processing, there's actually another name; it's called a Unique Grand

0:24:30.929,0:24:32.850
Model in natural language processing,

0:24:32.850,0:24:35.529
and there's some other models like, sort of,

0:24:35.529,0:24:39.190
say, higher order markup models that take into account some of the ordering

0:24:39.190,0:24:44.080
of the words. It turns out that for text classification, the models

0:24:44.080,0:24:45.260
like the

0:24:45.260,0:24:49.290
bigram models or trigram models,

0:24:49.290,0:24:53.360
I believe they do only very slightly better, if at all,

0:24:53.360,0:25:00.360
but that's when you're applying them to text classification, okay? All right.

0:25:03.550,0:25:07.190
So the next thing I want to talk about is to start again to discussion of

0:25:07.190,0:25:11.560
non-linear classifiers. So

0:25:11.560,0:25:18.560
it turns out

0:25:21.670,0:25:27.590
- well, and so the very first

0:25:27.590,0:25:31.220
classification algorithm we talked about was logistic regression, which

0:25:31.220,0:25:33.419
had the forming form for

0:25:33.419,0:25:36.340
hypothesis,

0:25:36.340,0:25:37.520
and

0:25:37.520,0:25:39.220
you can think of this as

0:25:39.220,0:25:41.190
predicting one when

0:25:41.190,0:25:44.420
this estimator probability is greater or equal to 0.5 and

0:25:44.420,0:25:46.180
predicting zero,

0:25:46.180,0:25:46.839
right,

0:25:46.839,0:25:50.050
when this is less than 0.5,

0:25:50.050,0:25:52.760
and given a training set, right?

0:25:52.760,0:25:59.760
Logistic regression

0:26:02.549,0:26:05.919
will maybe do grade and descends or something or

0:26:05.919,0:26:07.440
use Newton's method

0:26:07.440,0:26:09.370
to find a straight line that

0:26:09.370,0:26:12.130
reasonably separates the positive and negative classes.

0:26:12.130,0:26:15.830
But sometimes a data set just can't be separated by a straight line, so is there

0:26:15.830,0:26:16.950
an algorithm

0:26:16.950,0:26:18.230
that will let you

0:26:18.230,0:26:25.230
start to learn these sorts of non-linear division boundaries?

0:26:25.540,0:26:29.230
And so how do you go about getting a non-linear classifier? And, by the

0:26:29.230,0:26:34.360
way, one cool result is that remember when I said - when we talked

0:26:34.360,0:26:37.790
about generative learning algorithms, I said that

0:26:37.790,0:26:41.040
if you assume Y given X is

0:26:41.040,0:26:43.710
exponential family,

0:26:43.710,0:26:45.330


0:26:45.330,0:26:46.669
right, with parameter

0:26:46.669,0:26:50.730
A, and if you build a generative learning algorithm using this, right, plus

0:26:50.730,0:26:55.850
one, if this is A to

0:26:55.850,0:26:59.710
one. This is exponential family

0:26:59.710,0:27:01.750
with natural parameter A to zero,

0:27:01.750,0:27:05.610
right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true,

0:27:05.610,0:27:06.210
then

0:27:06.210,0:27:08.610
you end up with a logistic posterior. It

0:27:08.610,0:27:11.410
actually turns out that a Naive Bayes model

0:27:11.410,0:27:12.629
actually falls into this

0:27:12.629,0:27:17.440
as well. So the Naive Bayes model actually falls into this exponential family as well, and,

0:27:17.440,0:27:19.370
therefore,

0:27:19.370,0:27:23.000
under the Naive Bayes model, you're actually

0:27:23.000,0:27:27.080
using this other linear classifier as well, okay?

0:27:27.080,0:27:29.020
So the question is

0:27:29.020,0:27:32.480
how can you start to get non-linear classifiers?

0:27:32.480,0:27:35.649
And I'm going to talk about one method today

0:27:35.649,0:27:41.840
which is - and

0:27:41.840,0:27:45.300
we started to talk about it very briefly which is

0:27:45.300,0:27:46.650
taking a

0:27:46.650,0:27:48.990
simpler algorithm like

0:27:48.990,0:27:50.720
logistic regression

0:27:50.720,0:27:57.720
and using it to build up to more complex non-linear classifiers, okay? So

0:27:58.390,0:28:00.440
to motivate this discussion,

0:28:00.440,0:28:05.340
I'm going to use the little picture - let's see. So suppose you have features X1,

0:28:05.340,0:28:07.340
X2, and X3,

0:28:07.340,0:28:10.320
and so by convention, I'm gonna follow

0:28:10.320,0:28:13.460
our earlier convention that X0 is set to one,

0:28:13.460,0:28:17.010
and so I'm gonna use a little diagram like this

0:28:17.010,0:28:18.090
to denote

0:28:18.090,0:28:19.690
our

0:28:19.690,0:28:24.320
logistic regression unit, okay? So

0:28:24.320,0:28:28.609
think of a little picture like that, you know, this little circle as denoting a

0:28:28.609,0:28:29.419
computation note

0:28:29.419,0:28:32.630
that takes this input, you know, several features and then it outputs another

0:28:32.630,0:28:36.090
number that's X subscript theta of X, given by

0:28:36.090,0:28:37.800
a sigmoid function,

0:28:37.800,0:28:39.930
and so this little computational unit -

0:28:39.930,0:28:43.460
well, will have parameters theta.

0:28:43.460,0:28:47.700
Now, in order to get non-linear division boundaries, all we need to do -

0:28:47.700,0:28:50.850
well, at least one thing to do is just

0:28:50.850,0:28:52.780
come up with

0:28:52.780,0:28:54.390


0:28:54.390,0:28:57.020
a way to represent hypotheses

0:28:57.020,0:29:00.640
that can output non-linear division boundaries, right,

0:29:00.640,0:29:02.700
and so

0:29:02.700,0:29:07.490
this is -

0:29:07.490,0:29:09.750
when you put a bunch of those little pictures

0:29:09.750,0:29:11.390
that I drew

0:29:11.390,0:29:12.790
on the previous board,

0:29:12.790,0:29:14.169
you can then get

0:29:14.169,0:29:18.549
what's called

0:29:18.549,0:29:20.780
a Neural Network in which you

0:29:20.780,0:29:23.710
think of having my features here

0:29:23.710,0:29:30.710
and then I would feed them to

0:29:31.019,0:29:34.530
say a few of these little

0:29:34.530,0:29:38.940
sigmoidal units,

0:29:38.940,0:29:43.559
and these together will feed into yet another sigmoidal unit, say,

0:29:43.559,0:29:46.270
which will output

0:29:46.270,0:29:51.510
my final output H subscript theta of X, okay? And

0:29:51.510,0:29:54.500
just to give these things names, let me call

0:29:54.500,0:29:58.380
the values output by these three intermediate sigmoidal units; let me call them A1,

0:29:58.380,0:30:01.360
A2, A3.

0:30:01.360,0:30:02.580
And let me just be

0:30:02.580,0:30:05.679
completely concrete about what this formula represents, right? So

0:30:05.679,0:30:06.570


0:30:06.570,0:30:08.309
each of these

0:30:08.309,0:30:12.080
units in the middle will have their own associated set of parameters,

0:30:12.080,0:30:15.820
and so the value A1 will be computed as

0:30:15.820,0:30:17.940
G

0:30:17.940,0:30:21.440
of X transpose,

0:30:21.440,0:30:27.440
and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta

0:30:27.440,0:30:29.090
two,

0:30:29.090,0:30:30.980
and A3 will be

0:30:30.980,0:30:34.490
G of X transpose,

0:30:34.490,0:30:35.700
theta three,

0:30:35.700,0:30:42.700
where G is the sigmoid function, all right? So G of Z,

0:30:43.640,0:30:49.850
and then, finally, our hypothesis will output

0:30:49.850,0:30:52.360
G of

0:30:52.360,0:30:53.950
A

0:30:53.950,0:30:56.620
transpose

0:30:56.620,0:30:59.200
theta four, right? Where, you

0:30:59.200,0:31:01.320
know, this A vector

0:31:01.320,0:31:04.390
is a vector of A1,

0:31:04.390,0:31:05.780
A2, A3.

0:31:05.780,0:31:08.220
We can append another one to it at first

0:31:08.220,0:31:11.470
if you want, okay? Let

0:31:11.470,0:31:14.270
me

0:31:14.270,0:31:16.210
just draw up here this - I'm

0:31:16.210,0:31:18.470
sorry about the cluttered board.

0:31:18.470,0:31:20.950
And so H subscript theta of X,

0:31:20.950,0:31:23.850
this is a function

0:31:23.850,0:31:29.300
of all the parameters theta one

0:31:29.300,0:31:32.420
through theta four,

0:31:32.420,0:31:36.680
and so one way to learn parameters for this model

0:31:36.680,0:31:37.650
is to

0:31:37.650,0:31:39.910
write down the cost function,

0:31:39.910,0:31:41.180
say, J of theta

0:31:41.180,0:31:46.040
equals one-half sum from Y equals one to

0:31:46.040,0:31:49.730
M, YI minus H subscript theta of

0:31:49.730,0:31:51.770
XI

0:31:51.770,0:31:54.450
squared, say. Okay,

0:31:54.450,0:31:56.070
so that's our familiar

0:31:56.070,0:31:58.280
quadratic cost function,

0:31:58.280,0:31:59.980
and so

0:31:59.980,0:32:03.990
one way to learn the parameters of an algorithm like this is to just use gradient

0:32:03.990,0:32:04.840
interscent

0:32:04.840,0:32:08.809
to minimize J of theta as a function of theta, okay? See, in the phi

0:32:08.809,0:32:10.940
gradient descent

0:32:10.940,0:32:14.610
to minimize this square area, which

0:32:14.610,0:32:18.050
stated differently means you use gradient descent to make the predictions of your neural

0:32:18.050,0:32:20.330
network as close as possible

0:32:20.330,0:32:27.330
to what you observed as the labels in your training

0:32:30.550,0:32:34.520
set, okay? So it turns out green descent on this neural network is a specific name, the

0:32:34.520,0:32:35.059
algorithm

0:32:35.059,0:32:38.789
that implements grand descent is called back propagation, and so if you ever hear that all that

0:32:38.789,0:32:41.240
means is - it just means gradient interscent on

0:32:41.240,0:32:45.170
a cost function like this or a variation of this on the neural network

0:32:45.170,0:32:47.909
that looks like that,

0:32:47.909,0:32:50.220
and -

0:32:50.220,0:32:54.360
well, this algorithm actually has some advantages and disadvantages, but let me

0:32:54.360,0:32:57.830
actually show you. So, let's see.

0:32:57.830,0:33:00.729
One of the interesting things about the neural network is that

0:33:00.729,0:33:05.430
you can look at what these intermediate notes are computing, right? So

0:33:05.430,0:33:08.270
this neural network has what's called

0:33:08.270,0:33:11.610
a hidden layer before you then have the output layer,

0:33:11.610,0:33:14.410
and, more generally, you can actually have inputs feed

0:33:14.410,0:33:16.279
into these

0:33:16.279,0:33:19.690
computation units, feed into more layers of computation units, to even more layers, to

0:33:19.690,0:33:23.370
more layers, and then finally you have an output layer at the end

0:33:23.370,0:33:26.970
And one cool thing you can do is look at all of these intermediate units, look at

0:33:26.970,0:33:28.280
these

0:33:28.280,0:33:30.980
units and what's called a hidden layer of the neural network. Don't

0:33:30.980,0:33:32.580
worry about why it's called that. Look

0:33:32.580,0:33:34.489
at computations of the hidden unit

0:33:34.489,0:33:39.570
and ask what is the hidden unit computing the neural network? So

0:33:39.570,0:33:42.900
to, maybe, get a better sense of neural networks might be doing, let me show you a

0:33:42.900,0:33:44.580
video - I'm gonna switch to the laptop - this is made

0:33:44.580,0:33:46.250
by a

0:33:46.250,0:33:49.360
friend, Yann LeCun

0:33:49.360,0:33:52.420
who's currently a professor at New York University.

0:33:52.420,0:33:56.289
Can I show a video on the laptop?

0:33:56.289,0:33:58.600
So let me show you a video

0:33:58.600,0:34:00.440
from Yann

0:34:00.440,0:34:06.330
LeCun on a neural network that he developed for Hammerton Digit Recognition.

0:34:06.330,0:34:10.110
There was one other thing he did in this neural network that I'm not gonna talk about called

0:34:10.110,0:34:12.389
a Convolutional Neural Network

0:34:12.389,0:34:16.009
that - well,

0:34:16.009,0:34:18.809
his system is called LeNet,

0:34:18.809,0:34:22.569
and let's see. Would you put

0:34:22.569,0:34:23.360
on the

0:34:23.360,0:34:30.360
laptop display? Hum,

0:34:37.779,0:34:38.789
actually maybe if -

0:34:38.789,0:34:42.339
or you can just put on the screen on the side; that would work too

0:34:42.339,0:34:49.339
if the big screen isn't working.

0:35:12.249,0:35:13.729
Let's see. I'm

0:35:13.729,0:35:19.299
just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come

0:35:19.299,0:35:24.209
on? Well, let me say a few more things about neural network.

0:35:24.209,0:35:25.529
So it turns out that

0:35:25.529,0:35:29.039
when you write a quadratic cost function like I

0:35:29.039,0:35:31.689
wrote down on the chalkboard just now,

0:35:31.689,0:35:34.639
it turns out that unlike logistic regression,

0:35:34.639,0:35:39.029
that will almost always respond to non-convex optimization problem,

0:35:39.029,0:35:42.709
and so whereas for logistic regression if you run gradient descent

0:35:42.709,0:35:44.280
or Newton's method or whatever,

0:35:44.280,0:35:46.000
you converse the global optimer.

0:35:46.000,0:35:50.500
This is not true for neural networks. In general, there are lots of local optimer

0:35:50.500,0:35:55.209
and, sort of, much harder optimization problem.

0:35:55.209,0:35:56.159
So

0:35:56.159,0:35:59.699
neural networks, if you're, sort of, familiar with them, and you're good at making design choices

0:35:59.699,0:36:01.599
like what learning rate to use, and

0:36:01.599,0:36:04.319
how many hidden units to use, and so on, you can,

0:36:04.319,0:36:07.700
sort of, get them to be fairly effective,

0:36:07.700,0:36:09.109
and

0:36:09.109,0:36:12.910
there's, sort of, often ongoing debates about, you know, is this learning algorithm

0:36:12.910,0:36:13.819
better, or is that learning

0:36:13.819,0:36:17.869
algorithm better? The vast majority of machine learning researchers today seem to perceive

0:36:17.869,0:36:20.180
support vector machines, which is what I'll talk about later,

0:36:20.180,0:36:24.349
to be a much more effective off-theshelf learning algorithm than neural networks.

0:36:24.349,0:36:26.509
This point of view is contested

0:36:26.509,0:36:32.039
a bit, but so neural networks are not something that I personally use a lot right

0:36:32.039,0:36:35.670
now because there's a hard optimization problem and you should do so often verge,

0:36:35.670,0:36:39.969
and it actually, sort of works. It, sort of, works reasonably

0:36:39.969,0:36:41.739
well. It's just

0:36:41.739,0:36:43.700
because this is fairly complicated,

0:36:43.700,0:36:45.269
there's not an algorithm that

0:36:45.269,0:36:52.009
I use commonly or that my friends use all

0:36:52.009,0:36:56.019
time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years,

0:36:56.019,0:36:56.819


0:36:56.819,0:37:00.400
you know, the most effective learning algorithm before support vector

0:37:00.400,0:37:02.029
machines were invented.

0:37:02.029,0:37:06.349
So here's Yann LeCun's video, and -

0:37:06.349,0:37:09.350
well, there's actually audio on this too, the soundboard. So I'll just tell you

0:37:09.350,0:37:11.209
what's happening.

0:37:11.209,0:37:11.880


0:37:11.880,0:37:14.629
What you're seeing is a trained neural network,

0:37:14.629,0:37:18.549
and this display where my mouse pointer is pointing, right, this big three

0:37:18.549,0:37:19.429
there

0:37:19.429,0:37:21.189
is

0:37:21.189,0:37:24.799
the input to the neural network. So you're showing the neural network this image, and it's

0:37:24.799,0:37:26.859
trying to recognize what is this.

0:37:26.859,0:37:30.769
The final answer output by the neural network is this number up here, right

0:37:30.769,0:37:33.189
below where it says LeNet-5,

0:37:33.189,0:37:34.229
and the

0:37:34.229,0:37:37.759
neural network correctly recognizes this image as a three,

0:37:37.759,0:37:39.890
and if you look to the left of this image,

0:37:39.890,0:37:42.309
what's interesting about this is

0:37:42.309,0:37:46.640
the display on the left portion of this is actually showing the

0:37:46.640,0:37:49.999
intermediate computations of the neural network. In other words, it's showing you

0:37:49.999,0:37:51.900
what are the hidden layers of the

0:37:51.900,0:37:53.829
neural network computing.

0:37:53.829,0:37:57.529
And so, for example, if you look at this one, the third image down from the top,

0:37:57.529,0:38:01.569
this seems to be computing, you know, certain edges into digits,

0:38:01.569,0:38:05.630
right? We're just computing digits on the

0:38:05.630,0:38:07.489
right-hand side of the bottom or something

0:38:07.489,0:38:09.889
of the input display

0:38:09.889,0:38:11.479
of the input image, okay?

0:38:11.479,0:38:15.129
So let me just play this video, and you can see

0:38:15.129,0:38:17.019
some

0:38:17.019,0:38:20.619
of the inputs and outputs of the neural network, and

0:38:20.619,0:38:21.529
those

0:38:21.529,0:38:26.309
are very different fonts. There's this robustness to

0:38:26.309,0:38:30.670
noise.

0:38:30.670,0:38:36.929
All

0:38:36.929,0:38:37.880


0:38:37.880,0:38:41.949


0:38:41.949,0:38:43.000
right. Multiple

0:38:43.000,0:38:49.089
digits,

0:38:49.089,0:38:50.739


0:38:50.739,0:38:57.739
that's, kind of, cool. All

0:38:59.259,0:39:06.259
right.

0:39:12.140,0:39:19.140


0:39:30.779,0:39:35.789


0:39:35.789,0:39:38.689


0:39:38.689,0:39:40.459


0:39:40.459,0:39:41.660


0:39:41.660,0:39:48.660


0:39:52.359,0:39:54.809
So,

0:39:54.809,0:39:58.219
just for fun, let me show you one more video,

0:39:58.219,0:40:00.579
which was - let's

0:40:00.579,0:40:06.089
see. This is another video from the various CV's, the machine that changed the world, which

0:40:06.089,0:40:07.490
was produced by WGBH

0:40:07.490,0:40:09.059
Television

0:40:09.059,0:40:10.800
in

0:40:10.800,0:40:12.599
corporation with British Foreclass Incorporation,

0:40:12.599,0:40:16.159
and it was aired on PBS a few years ago, I think.

0:40:16.159,0:40:17.630
I want to show you a

0:40:17.630,0:40:19.589
video describing the NETtalk

0:40:19.589,0:40:23.629
Neural Network, which was developed by Terry Sejnowski; he's a

0:40:23.629,0:40:25.739
researcher.

0:40:25.739,0:40:29.589
And so NETtalk was actually one of the major milestones in the

0:40:29.589,0:40:31.749
history of neural network,

0:40:31.749,0:40:33.490
and this specific application

0:40:33.490,0:40:36.399
is getting the neural network to read text.

0:40:36.399,0:40:38.779
So, in other words, can you show a

0:40:38.779,0:40:41.659
piece of English to a computer

0:40:41.659,0:40:43.419
and have the computer read,

0:40:43.419,0:40:46.410
sort of, verbally produce sounds that could respond

0:40:46.410,0:40:48.319
to the

0:40:48.319,0:40:50.629
reading of the text.

0:40:50.629,0:40:55.279
And it turns out that in the history of AI and the history of machine learning,

0:40:55.279,0:40:56.509
this video

0:40:56.509,0:41:00.849
created a lot of excitement about neural networks and about machine learning. Part of the

0:41:00.849,0:41:05.429
reason was that Terry Sejnowski had the foresight to choose

0:41:05.429,0:41:06.699
to use,

0:41:06.699,0:41:08.040
in his video,

0:41:08.040,0:41:12.940
a child-like voice talking about visiting your grandmother's house and so on. You'll

0:41:12.940,0:41:14.289
see it in a second,

0:41:14.289,0:41:16.029
and so

0:41:16.029,0:41:19.399
this really created the perception of - created

0:41:19.399,0:41:23.129
the impression of the neural network being like a young child learning how to speak,

0:41:23.129,0:41:25.260
and talking about going to your grandmothers, and so

0:41:25.260,0:41:27.379
on. So this actually

0:41:27.379,0:41:29.289
helped generate a lot of excitement

0:41:29.289,0:41:32.449
within academia and outside academia on neural networks, sort of, early in the history

0:41:32.449,0:41:36.109
of neural networks. I'm just gonna show you the video.

0:41:36.109,0:41:37.910
[Begin Video] You're going to hear first

0:41:37.910,0:41:41.559
what the network sounds like at the very beginning of the training, and it won't

0:41:41.559,0:41:43.589
sound like words, but it'll sound

0:41:43.589,0:41:48.009
like attempts that will get better and better with

0:41:48.009,0:41:54.239
time. [Computer's voice] The network

0:41:54.239,0:41:58.249
takes the letters, say the phrase, grandmother's house, and

0:41:58.249,0:42:02.679
makes a random attempt at pronouncing it. [Computer's

0:42:02.679,0:42:07.219
voice]

0:42:07.219,0:42:09.269
Grandmother's house. The

0:42:09.269,0:42:12.530
phonetic difference between the guess and the right pronunciation

0:42:12.530,0:42:16.539
is sent back through the network. [Computer's

0:42:16.539,0:42:18.989
voice]

0:42:18.989,0:42:20.329
Grandmother's house.

0:42:20.329,0:42:23.920
By adjusting the connection strengths after each attempt,

0:42:23.920,0:42:26.689
the net slowly improves. And,

0:42:26.689,0:42:30.029
finally, after letting it train overnight,

0:42:30.029,0:42:32.700
the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's

0:42:32.700,0:42:37.759
house.

0:42:37.759,0:42:41.149
Well, because she gives us candy.

0:42:41.149,0:42:46.789
Well, and we - NETtalk understands nothing about the language. It is simply associating letters

0:42:46.789,0:42:49.809
with sounds.

0:42:49.809,0:42:52.359
[End Video] All right. So

0:42:52.359,0:42:54.380
at the time this was done, I mean, this is

0:42:54.380,0:42:57.379
an amazing piece of work. I should say

0:42:57.379,0:42:59.140
today there are other

0:42:59.140,0:43:03.389
text to speech systems that work better than what you just saw,

0:43:03.389,0:43:08.359
and you'll also appreciate getting candy from your grandmother's house is a little bit

0:43:08.359,0:43:11.449
less impressive than talking about the Dow Jones falling 15 points,

0:43:11.449,0:43:16.149
and profit taking, whatever. So

0:43:16.149,0:43:18.789
but I wanted to show that just because that was another cool,

0:43:18.789,0:43:25.789
major landmark in the history of neural networks. Okay.

0:43:27.259,0:43:32.949
So let's switch back to the chalkboard,

0:43:32.949,0:43:35.859
and what I want to do next

0:43:35.859,0:43:39.479
is

0:43:39.479,0:43:42.369
tell you about Support Vector Machines,

0:43:42.369,0:43:49.369
okay? That, sort of, wraps up our discussion on neural

0:44:08.839,0:44:10.850
networks. So I started off talking about neural networks

0:44:10.850,0:44:13.269
by motivating it as a way to get

0:44:13.269,0:44:17.920
us to output non-linear classifiers, right? I don't really approve of it. It turns out

0:44:17.920,0:44:18.419
that

0:44:18.419,0:44:21.799
you'd be able to come up with non-linear division boundaries using a neural

0:44:21.799,0:44:22.479
network

0:44:22.479,0:44:27.309
like what I drew on the chalkboard earlier.

0:44:27.309,0:44:30.469
Support Vector Machines will be another learning algorithm that will give us a way

0:44:30.469,0:44:34.529
to come up with non-linear classifiers. There's a very effective, off-the-shelf

0:44:34.529,0:44:36.019
learning algorithm,

0:44:36.019,0:44:39.399
but it turns out that in the discussion I'm gonna - in

0:44:39.399,0:44:43.379
the progression and development I'm gonna pursue, I'm actually going to start

0:44:43.379,0:44:47.799
off by describing yet another class of linear classifiers with linear division

0:44:47.799,0:44:49.269
boundaries,

0:44:49.269,0:44:54.179
and only be later, sort of, in probably the next lecture or the one after that,

0:44:54.179,0:44:55.249
that we'll then

0:44:55.249,0:44:58.699
take the support vector machine idea and, sort of, do some clever things to it

0:44:58.699,0:45:02.959
to make it work very well to generate non-linear division boundaries as well, okay?

0:45:02.959,0:45:07.859
But we'll actually start by talking about linear classifiers a little bit more.

0:45:07.859,0:45:12.619
And to do that, I want to

0:45:12.619,0:45:14.559
convey two

0:45:14.559,0:45:16.689
intuitions about classification.

0:45:16.689,0:45:20.700
One is

0:45:20.700,0:45:24.450
you think about logistic regression; we have this logistic function that was outputting

0:45:24.450,0:45:28.669
the probability that Y equals one,

0:45:28.669,0:45:31.049
and it crosses

0:45:31.049,0:45:32.959
this line at zero.

0:45:32.959,0:45:37.819
So when you run logistic regression, I want you to think of it as

0:45:37.819,0:45:41.179
an algorithm that computes

0:45:41.179,0:45:43.239
theta transpose X,

0:45:43.239,0:45:46.789
and then it predicts

0:45:46.789,0:45:51.170
one, right,

0:45:51.170,0:45:56.359
if and only if, theta transpose X is greater than zero, right? IFF stands for if

0:45:56.359,0:45:59.539
and only if. It means the same thing as a double implication,

0:45:59.539,0:46:04.629
and it predicts zero,

0:46:04.629,0:46:11.629
if and only if, theta transpose X is less than zero, okay? So if

0:46:12.499,0:46:15.019
it's the case that

0:46:15.019,0:46:18.069
theta transpose X is much greater than zero,

0:46:18.069,0:46:22.309
the double greater than sign means these are much greater than, all right.

0:46:22.309,0:46:25.589
So if theta transpose X is much greater than zero,

0:46:25.589,0:46:26.739
then,

0:46:26.739,0:46:33.069
you know, think of that as a very confident

0:46:33.069,0:46:38.779
prediction

0:46:38.779,0:46:40.779
that Y is equal to one,

0:46:40.779,0:46:43.459
right? If theta transpose X is much greater than zero, then we're

0:46:43.459,0:46:47.049
gonna predict one then moreover we're very confident it's one, and the picture for

0:46:47.049,0:46:48.479
that is if theta

0:46:48.479,0:46:51.179
transpose X is way out here, then

0:46:51.179,0:46:54.329
we're estimating that the probability of Y being equal to one

0:46:54.329,0:46:56.089
on the sigmoid function, it will

0:46:56.089,0:46:58.699
be very close to one.

0:46:58.699,0:47:02.779
And, in the same way, if theta transpose X

0:47:02.779,0:47:05.190
is much less than zero,

0:47:05.190,0:47:12.190
then we're very confident that

0:47:12.809,0:47:15.849
Y is equal to zero.

0:47:15.849,0:47:19.599
So

0:47:19.599,0:47:23.859
wouldn't it be nice - so when we fit logistic regression of some of the

0:47:23.859,0:47:25.309
classifiers is your training set,

0:47:25.309,0:47:29.569
then so wouldn't it be nice if,

0:47:29.569,0:47:30.430
right,

0:47:30.430,0:47:34.349
for all I

0:47:34.349,0:47:38.739
such that Y is equal to one.

0:47:38.739,0:47:39.699
We have

0:47:39.699,0:47:41.279
theta

0:47:41.279,0:47:45.009
transpose XI is much greater than zero,

0:47:45.009,0:47:47.929
and for all I such that Y is equal

0:47:47.929,0:47:49.869
to

0:47:49.869,0:47:56.869
zero,

0:47:57.589,0:47:59.910
we have theta transpose XI is

0:47:59.910,0:48:00.970
much less than zero,

0:48:00.970,0:48:05.659
okay? So wouldn't it be nice if this is true? That,

0:48:05.659,0:48:06.380


0:48:06.380,0:48:09.929
essentially, if our training set, we can find parameters theta

0:48:09.929,0:48:12.749
so that our learning algorithm not only

0:48:12.749,0:48:16.839
makes correct classifications on all the examples in a training set, but further it's, sort

0:48:16.839,0:48:19.880
of, is very confident about all of those correct

0:48:19.880,0:48:22.299
classifications. This is

0:48:22.299,0:48:26.859
the first intuition that I want you to have, and we'll come back to this first intuition

0:48:26.859,0:48:28.889
in a second

0:48:28.889,0:48:33.209
when we talk about functional margins, okay?

0:48:33.209,0:48:40.209
We'll define this later. The second

0:48:43.539,0:48:48.119
intuition that I want to

0:48:48.119,0:48:55.119
convey,

0:48:56.769,0:49:00.209
and it turns out for the rest of today's lecture I'm going to assume

0:49:00.209,0:49:04.049
that a training set is linearly separable, okay? So by that I mean for the rest of

0:49:04.049,0:49:05.619


0:49:05.619,0:49:08.690
today's lecture, I'm going to assume that

0:49:08.690,0:49:10.900
there is indeed a straight line that can separate

0:49:10.900,0:49:15.030
your training set, and we'll remove this assumption later, but

0:49:15.030,0:49:17.089
just to develop the algorithm, let's take away the

0:49:17.089,0:49:19.409
linearly separable training set.

0:49:19.409,0:49:20.440
And so

0:49:20.440,0:49:24.400
there's a sense that out of all the straight lines that separate the training

0:49:24.400,0:49:28.569
set, you know, maybe that straight line isn't such a good one,

0:49:28.569,0:49:33.139
and that one actually isn't such a great one either, but

0:49:33.139,0:49:35.019
maybe that line in the

0:49:35.019,0:49:38.869
middle is a much better linear separator than the others, right?

0:49:38.869,0:49:41.179
And one reason that

0:49:41.179,0:49:45.759
when you and I look at it this one seems best

0:49:45.759,0:49:48.329
is because this line is just further from the data, all right?

0:49:48.329,0:49:52.469
That is separates the data with a greater distance between your positive and your negative

0:49:52.469,0:49:55.779
examples and division boundary, okay?

0:49:55.779,0:49:59.390
And this second intuition, we'll come back to this shortly,

0:49:59.390,0:50:02.829
about this final line

0:50:02.829,0:50:06.099
that I drew being, maybe, the best line

0:50:06.099,0:50:09.999
this notion of distance from the training examples. This is the second intuition I want

0:50:09.999,0:50:11.239
to convey,

0:50:11.239,0:50:13.659
and we'll formalize it later when

0:50:13.659,0:50:17.849
we talk about

0:50:17.849,0:50:24.849
geometric margins of our classifiers, okay?

0:50:41.549,0:50:44.889
So in order to describe support vector machine, unfortunately, I'm gonna

0:50:44.889,0:50:48.239
have to pull a notation change,

0:50:48.239,0:50:50.109
and, sort

0:50:50.109,0:50:53.989
of, unfortunately, it, sort of, was impossible to do logistic regression,

0:50:53.989,0:50:55.809
and support vector machines,

0:50:55.809,0:51:01.029
and all the other algorithms using one completely consistent notation,

0:51:01.029,0:51:04.180
and so I'm actually gonna change notations slightly

0:51:04.180,0:51:07.999
for linear classifiers, and that will actually make it much easier for us -

0:51:07.999,0:51:10.019
that'll make it much easier later today

0:51:10.019,0:51:15.149
and in next week's lectures to actually talk about support vector machine. But the

0:51:15.149,0:51:19.549
notation that I'm gonna use for the rest

0:51:19.549,0:51:21.759
of today and for most of next week

0:51:21.759,0:51:25.769
will be that my B equals Y,

0:51:25.769,0:51:30.219
and instead of be zero, one, they'll be minus one and plus one,

0:51:30.219,0:51:33.779


0:51:33.779,0:51:36.759
and a development of a support vector machine

0:51:36.759,0:51:38.469
we will have H,

0:51:38.469,0:51:43.939
have a hypothesis

0:51:43.939,0:51:49.369
output values to

0:51:49.369,0:51:53.179
the either plus one or minus one,

0:51:53.179,0:51:54.619
and so

0:51:54.619,0:51:58.039
we'll let G of Z be equal to

0:51:58.039,0:52:00.349
one if Z is

0:52:00.349,0:52:04.460
greater or equal to zero, and minus one otherwise, right? So just rather than zero and

0:52:04.460,0:52:07.809
one, we change everything to plus one and minus one.

0:52:07.809,0:52:10.149
And, finally,

0:52:10.149,0:52:13.770
whereas previously I wrote G subscript

0:52:13.770,0:52:16.779
theta of X equals

0:52:16.779,0:52:18.429
G of theta transpose X

0:52:18.429,0:52:20.880
and we had the convention

0:52:20.880,0:52:23.409
that X zero is equal to one,

0:52:23.409,0:52:25.789
right? And so X is an RN plus

0:52:25.789,0:52:28.670
one.

0:52:28.670,0:52:34.499
I'm gonna drop this convention of

0:52:34.499,0:52:38.400
letting X zero equals a one, and letting X be an RN plus one, and instead I'm

0:52:38.400,0:52:44.049
going to parameterize my linear classifier as H subscript W, B of X

0:52:44.049,0:52:46.159
equals G of

0:52:46.159,0:52:50.629
W transpose X plus B, okay?

0:52:50.629,0:52:51.349
And so

0:52:51.349,0:52:53.359
B

0:52:53.359,0:52:54.989
just now plays the role of

0:52:54.989,0:52:56.289
theta zero,

0:52:56.289,0:53:03.289
and W now plays the role of the rest of the parameters, theta one

0:53:04.179,0:53:07.059
through theta N, okay? So just by separating out

0:53:07.059,0:53:10.569
the interceptor B rather than lumping it together, it'll make it easier for

0:53:10.569,0:53:11.230
us

0:53:11.230,0:53:17.569
to develop support vector machines.

0:53:17.569,0:53:24.569
So - yes. Student:[Off mic]. Instructor

0:53:27.889,0:53:31.829
(Andrew Ng):Oh, yes. Right, yes. So W is -

0:53:31.829,0:53:33.049
right. So W

0:53:33.049,0:53:35.499
is a vector in RN, and

0:53:35.499,0:53:37.069
X

0:53:37.069,0:53:42.709
is now a vector in RN rather than N plus one,

0:53:42.709,0:53:49.709
and a lowercase b is a real number. Okay.

0:53:56.429,0:54:02.650
Now, let's formalize the notion of functional margin and germesh margin. Let me make a

0:54:02.650,0:54:03.660
definition.

0:54:03.660,0:54:10.660
I'm going to say that the functional margin

0:54:13.160,0:54:19.680
of the hyper plane

0:54:19.680,0:54:21.880
WB

0:54:21.880,0:54:25.229
with respect to a specific training example,

0:54:25.229,0:54:27.429
XIYI is - WRT

0:54:27.429,0:54:32.069
stands for with respect to -

0:54:32.069,0:54:35.489
the function margin of a hyper plane WB with

0:54:35.489,0:54:38.099
respect to

0:54:38.099,0:54:43.099
a certain training example, XIYI has been defined as Gamma

0:54:43.099,0:54:45.319
Hat I equals YI

0:54:45.319,0:54:46.919
times W transpose XI

0:54:46.919,0:54:51.159
plus B, okay?

0:54:51.159,0:54:54.579
And so a set of parameters, W, B defines

0:54:54.579,0:54:55.960
a

0:54:55.960,0:55:00.239
classifier - it, sort of, defines a linear separating boundary,

0:55:00.239,0:55:01.160
and so

0:55:01.160,0:55:03.989
when I say hyper plane, I just mean

0:55:03.989,0:55:07.390
the decision boundary that's

0:55:07.390,0:55:11.169
defined by the parameters W, B.

0:55:11.169,0:55:12.969
You know what,

0:55:12.969,0:55:17.699
if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with

0:55:17.699,0:55:18.869
parameters W, B

0:55:18.869,0:55:21.019
with respect to a training example

0:55:21.019,0:55:23.539
is given by this formula, okay?

0:55:23.539,0:55:25.530
And interpretation of this is that

0:55:25.530,0:55:28.369
if

0:55:28.369,0:55:29.799
YI is equal to one,

0:55:29.799,0:55:33.239
then for each to have a large functional margin, you

0:55:33.239,0:55:36.519
want W transpose XI plus B to

0:55:36.519,0:55:37.849
be large,

0:55:37.849,0:55:39.079
right?

0:55:39.079,0:55:41.389
And if YI

0:55:41.389,0:55:43.709
is equal minus one,

0:55:43.709,0:55:47.479
then in order for the functional margin to be large - we, sort of, want the functional margins

0:55:47.479,0:55:50.229
to large, but in order for the function margins to be large,

0:55:50.229,0:55:55.229
if YI is equal to minus one, then the only way for this to be big is if W

0:55:55.229,0:55:57.029
transpose XI

0:55:57.029,0:55:58.219
plus B

0:55:58.219,0:56:03.849
is much less than zero, okay? So this

0:56:03.849,0:56:06.780
captures the intuition that we had earlier about functional

0:56:06.780,0:56:08.689
margins - the

0:56:08.689,0:56:12.839
intuition we had earlier that if YI is equal to one, we want this to be big, and if YI

0:56:12.839,0:56:14.669
is equal to minus one, we

0:56:14.669,0:56:15.990
want this to be small,

0:56:15.990,0:56:17.009
and this, sort of,

0:56:17.009,0:56:18.699
practice of two cases

0:56:18.699,0:56:22.599
into one statement that we'd like the functional margin to be large.

0:56:22.599,0:56:26.589
And notice this is also that

0:56:26.589,0:56:31.249
so long as YI times W transpose XY

0:56:31.249,0:56:34.049
plus B, so long as this is greater than zero,

0:56:34.049,0:56:36.649
that means we

0:56:36.649,0:56:43.649
classified it correctly, okay?

0:57:14.359,0:57:18.069
And one more definition, I'm going to say that

0:57:18.069,0:57:19.899
the functional margin of

0:57:19.899,0:57:23.679
a hyper plane with respect to an entire training set

0:57:23.679,0:57:30.679
is

0:57:31.579,0:57:33.609
going to define gamma hat to

0:57:33.609,0:57:35.509
be equal to

0:57:35.509,0:57:35.759
min

0:57:35.650,0:57:39.569
over all your training examples of gamma hat, I, right?

0:57:39.569,0:57:40.539
So if you have

0:57:40.539,0:57:44.439
a training set, if you have just more than one training example,

0:57:44.439,0:57:46.949
I'm going to define the functional margin

0:57:46.949,0:57:49.779
with respect to the entire training set as

0:57:49.779,0:57:51.799
the worst case of all of your

0:57:51.799,0:57:54.439
functional margins of the entire training set.

0:57:54.439,0:57:57.849
And so for now we should think of the

0:57:57.849,0:58:02.179
first function like an intuition of saying that we would like the function margin

0:58:02.179,0:58:03.450
to be large,

0:58:03.450,0:58:05.910
and for our purposes, for now,

0:58:05.910,0:58:07.309
let's just say we would like

0:58:07.309,0:58:10.389
the worst-case functional margin to be large, okay? And

0:58:10.389,0:58:16.299
we'll change this a little bit later as well. Now, it turns out that

0:58:16.299,0:58:20.150
there's one little problem with this intuition that will, sort of, edge

0:58:20.150,0:58:22.640
us later,

0:58:22.640,0:58:26.189
which it actually turns out to be very easy to make the functional margin large, all

0:58:26.189,0:58:28.599
right? So, for example,

0:58:28.599,0:58:31.899
so as I have a classifiable parameters W and

0:58:31.899,0:58:37.369
B. If I take W and multiply it by two and take B and multiply it by two,

0:58:37.369,0:58:42.049
then if you refer to the definition of the functional margin, I guess that was what?

0:58:42.049,0:58:42.959
Gamma I,

0:58:42.959,0:58:44.409
gamma hat I

0:58:44.409,0:58:47.929
equals YI

0:58:47.929,0:58:51.059
times W times transpose B. If

0:58:51.059,0:58:52.859
I double

0:58:52.859,0:58:54.609
W and B,

0:58:54.609,0:58:55.010
then

0:58:55.010,0:58:59.419
I can easily double my functional margin. So this goal

0:58:59.419,0:59:02.799
of making the functional margin large, in and of itself, isn't so

0:59:02.799,0:59:05.959
useful because it's easy to make the functional margin arbitrarily large

0:59:05.959,0:59:08.729
just by scaling other parameters. And so

0:59:08.729,0:59:13.259
maybe one thing we need to do later is add a normalization

0:59:13.259,0:59:14.879
condition.

0:59:14.879,0:59:16.849
For example, maybe

0:59:16.849,0:59:20.179
we want to add a normalization condition that de-norm, the

0:59:20.179,0:59:21.789
alter-norm of

0:59:21.789,0:59:23.109
the parameter W

0:59:23.109,0:59:24.009
is equal to one,

0:59:24.009,0:59:28.150
and we'll come back to this in a second. All

0:59:28.150,0:59:34.279
right. And then so - Okay.

0:59:34.279,0:59:39.869
Now, let's talk about - see how much

0:59:39.869,0:59:46.869
time we have, 15 minutes. Well, see, I'm trying to

0:59:52.909,0:59:59.909
decide how much to try to do in the last 15 minutes. Okay. So let's talk

1:00:05.339,1:00:10.890
about the geometric margin,

1:00:10.890,1:00:13.379
and so

1:00:13.379,1:00:16.939
the geometric margin of a training example -

1:00:16.939,1:00:23.939
[inaudible], right?

1:00:27.279,1:00:30.649
So the division boundary of my classifier

1:00:30.649,1:00:33.439
is going to be given by the plane W transpose X

1:00:33.439,1:00:34.889
plus B is equal

1:00:34.889,1:00:37.650
to zero, okay? Right, and these are the

1:00:37.650,1:00:41.389
X1, X2 axis, say,

1:00:41.389,1:00:42.579
and

1:00:42.579,1:00:48.299
we're going to draw relatively few training examples here. Let's say I'm drawing

1:00:48.299,1:00:52.929
deliberately few training examples so that I can add things to this, okay?

1:00:52.929,1:00:54.659
And so

1:00:54.659,1:00:57.199
assuming we classified an example correctly, I'm

1:00:57.199,1:01:01.209
going to define the geometric margin

1:01:01.209,1:01:05.559
as just a geometric distance between a point between the

1:01:05.559,1:01:07.239
training example -

1:01:07.239,1:01:09.649
yeah, between the training example XI,

1:01:09.649,1:01:11.679
YI

1:01:11.679,1:01:13.259
and the distance

1:01:13.259,1:01:15.489
given by this separating

1:01:15.489,1:01:18.249
line, given by this separating hyper plane, okay?

1:01:18.249,1:01:20.329
That's what I'm going to define the

1:01:20.329,1:01:23.809
geometric margin to be.

1:01:23.809,1:01:26.299
And so I'm gonna do

1:01:26.299,1:01:29.919
some algebra fairly quickly. In case it doesn't make sense, and read

1:01:29.919,1:01:33.969
through the lecture notes more carefully for details.

1:01:33.969,1:01:36.479
Sort of, by standard geometry,

1:01:36.479,1:01:40.589
the normal, or in other words, the vector that's 90 degrees to the

1:01:40.589,1:01:41.869
separating hyper plane

1:01:41.869,1:01:45.069
is going to be given by W divided by

1:01:45.069,1:01:47.480
the norm of W; that's just how

1:01:47.480,1:01:49.619
planes and high dimensions work. If this

1:01:49.619,1:01:53.380
stuff - some of this you have to use, take a look t the lecture notes on the

1:01:53.380,1:01:55.589
website.

1:01:55.589,1:01:58.199
And so let's say this distance

1:01:58.199,1:01:59.789
is

1:01:59.789,1:02:01.220


1:02:01.220,1:02:02.079
gamma I,

1:02:02.079,1:02:04.869
okay? And so I'm going to use the convention that

1:02:04.869,1:02:08.549
I'll put a hat on top where I'm referring to functional margins,

1:02:08.549,1:02:11.619
and no hat on top for geometric margins. So let's say geometric margin,

1:02:11.619,1:02:14.120
as this example, is

1:02:14.120,1:02:18.009
gamma I.

1:02:18.009,1:02:21.789
That means that

1:02:21.789,1:02:26.459
this point here, right,

1:02:26.459,1:02:28.899
is going to be

1:02:28.899,1:02:32.149
XI minus

1:02:32.149,1:02:36.119
gamma I times W

1:02:36.119,1:02:39.479
over normal W, okay?

1:02:39.479,1:02:41.279
Because

1:02:41.279,1:02:45.399
W over normal W is the unit vector, is the length one vector that

1:02:45.399,1:02:48.459
is normal to the separating hyper plane,

1:02:48.459,1:02:52.069
and so when we subtract gamma I times the unit vector

1:02:52.069,1:02:55.189
from this point, XI, or at this point here is XI.

1:02:55.189,1:02:57.039
So XI minus,

1:02:57.039,1:02:59.079
you know, this little vector here

1:02:59.079,1:03:01.539
is going to be this point that I've drawn as

1:03:01.539,1:03:03.080
a heavy circle,

1:03:03.080,1:03:05.660
okay? So this heavy point here is

1:03:05.660,1:03:07.429
XI minus this vector,

1:03:07.429,1:03:14.429
and this vector is gamma I time W over norm of W, okay?

1:03:16.049,1:03:17.649
And so

1:03:17.649,1:03:21.160
because this heavy point is on the separating hyper plane,

1:03:21.160,1:03:22.689
right, this point

1:03:22.689,1:03:24.949
must satisfy

1:03:24.949,1:03:28.660
W transpose times

1:03:28.660,1:03:34.659
that point

1:03:34.659,1:03:35.569
equals zero,

1:03:35.569,1:03:37.189
right? Because

1:03:37.189,1:03:38.029
all

1:03:38.029,1:03:41.479
points X on the separating hyper plane satisfy the equation W transpose X plus B

1:03:41.479,1:03:42.939
equals zero,

1:03:42.939,1:03:45.989
and so this point is on the separating hyper plane, therefore,

1:03:45.989,1:03:49.199
it must satisfy W transpose this point - oh,

1:03:49.199,1:03:53.809
excuse me. Plus B is equal

1:03:53.809,1:03:57.579
to zero, okay?

1:03:57.579,1:04:01.289
Raise your hand if this makes sense so far? Oh,

1:04:01.289,1:04:03.669
okay. Cool, most of you, but, again,

1:04:03.669,1:04:07.099
I'm, sort of, being slightly fast in this geometry. So if you're not quite sure

1:04:07.099,1:04:10.089
why this is a normal vector, or how I subtracted

1:04:10.089,1:04:17.089
this, or whatever, take a look at the details in the lecture notes.

1:04:21.299,1:04:24.319
And so

1:04:24.319,1:04:28.459
what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So

1:04:28.459,1:04:30.269
this equation I just wrote down,

1:04:30.269,1:04:32.199
solve this equation for gamma

1:04:32.199,1:04:33.939
or gamma I,

1:04:33.939,1:04:40.939
and you find that -

1:04:44.129,1:04:47.630
you saw that previous equation from gamma I -

1:04:47.630,1:04:49.769
well, why don't I just do it?

1:04:49.769,1:04:52.819
You have W transpose XI plus

1:04:52.819,1:04:54.660
B

1:04:54.660,1:04:55.489
equals

1:04:55.489,1:04:57.049
gamma

1:04:57.049,1:05:01.489
I times W transpose W over norm of W;

1:05:01.489,1:05:05.169
that's just equal to gamma

1:05:05.169,1:05:06.749
times the norm of W

1:05:06.749,1:05:07.659
because

1:05:07.659,1:05:09.529
W transpose W

1:05:09.529,1:05:12.469
is the norm

1:05:12.469,1:05:14.829
of W squared, and, therefore,

1:05:14.829,1:05:17.589
gamma

1:05:17.589,1:05:24.589
is just - well,

1:05:26.579,1:05:33.579
transpose X equals, okay?

1:05:33.999,1:05:38.089
And, in other words, this little calculation just showed us that if you have a

1:05:38.089,1:05:41.299
training example

1:05:41.299,1:05:42.599
XI,

1:05:42.599,1:05:46.749
then the distance between XI and the separating hyper plane defined by the

1:05:46.749,1:05:49.449
parameters W and B

1:05:49.449,1:05:56.449
can be computed by this formula, okay? So

1:06:02.979,1:06:05.459
the last thing I want to do is actually

1:06:05.459,1:06:06.729
take into account

1:06:06.729,1:06:11.929
the sign of the - the correct classification of the training example. So I've

1:06:11.929,1:06:13.130
been assuming that

1:06:13.130,1:06:16.339
we've been classifying an example correctly.

1:06:16.339,1:06:20.999
So, more generally, to find

1:06:20.999,1:06:27.439
the geometric

1:06:27.439,1:06:30.029
margin of a training example to be

1:06:30.029,1:06:32.979
gamma I equals YI

1:06:32.979,1:06:39.979
times that thing on top, okay?

1:06:45.389,1:06:47.279
And so

1:06:47.279,1:06:51.169
this is very similar to the functional margin, except for the normalization by the

1:06:51.169,1:06:52.729
norm of W,

1:06:52.729,1:06:57.759
and so as before, you know, this says that so long as -

1:06:57.759,1:07:00.889
we would like the geometric margin to be large, and all that means is that so

1:07:00.889,1:07:01.849
long as

1:07:01.849,1:07:03.919
we're classifying the example correctly,

1:07:03.919,1:07:08.209
we would ideally hope of the example to be as far as possible from the separating

1:07:08.209,1:07:11.079
hyper plane, so long as it's on the right side of

1:07:11.079,1:07:12.279
the separating hyper plane, and that's what YI

1:07:12.279,1:07:19.279
multiplied into this does.

1:07:22.689,1:07:27.839
And so a couple of easy facts, one is if the

1:07:27.839,1:07:31.769
norm of W is equal to one,

1:07:31.769,1:07:34.289
then

1:07:34.289,1:07:37.439
the functional margin is equal to the geometric margin, and you see

1:07:37.439,1:07:39.369
that quite easily,

1:07:39.369,1:07:41.479
and, more generally,

1:07:41.479,1:07:43.019


1:07:43.019,1:07:47.779
the geometric margin is just equal

1:07:47.779,1:07:54.779
to the functional margin divided by the norm of W, okay? Let's see, okay.

1:08:11.959,1:08:18.959
And so one final definition is

1:08:22.089,1:08:26.769
so far I've defined the geometric margin with respect to a single training example,

1:08:26.769,1:08:32.189
and so as before, I'll define the geometric margin with respect to an entire training set

1:08:32.189,1:08:34.499
as gamma equals

1:08:34.499,1:08:37.599
min over I

1:08:37.599,1:08:44.599
of gamma I, all right?

1:08:45.250,1:08:47.309
And so

1:08:47.309,1:08:54.309
the maximum margin classifier, which is a precursor to the support vector machine,

1:08:59.539,1:09:04.880
is the learning algorithm that chooses the parameters W and B

1:09:04.880,1:09:06.730
so as to maximize

1:09:06.730,1:09:08.900
the geometric margin,

1:09:08.900,1:09:11.919
and so I just write that down. The

1:09:11.919,1:09:16.489
maximum margin classified poses the following optimization problem. It says

1:09:16.489,1:09:19.960
choose gamma, W, and B

1:09:19.960,1:09:23.529
so as to maximize the geometric margin,

1:09:23.529,1:09:25.499
subject to that YI times -

1:09:25.499,1:09:32.499
well,

1:09:33.420,1:09:37.559
this is just one way to write it,

1:09:37.559,1:09:41.089
subject to - actually, do I write it like that? Yeah,

1:09:41.089,1:09:43.819
fine.

1:09:43.819,1:09:46.209
There are several ways to write this, and one of the things we'll

1:09:46.209,1:09:48.949
do next time is actually - I'm trying

1:09:48.949,1:09:52.730
to figure out if I can do this in five minutes. I'm guessing this could

1:09:52.730,1:09:55.420
be difficult. Well,

1:09:55.420,1:09:59.429
so this maximizing your classifier is the maximization problem

1:09:59.429,1:10:00.929
over parameter gamma

1:10:00.929,1:10:02.519
W

1:10:02.519,1:10:04.919
and B, and for now, it turns out

1:10:04.919,1:10:08.869
that the geometric margin doesn't change depending on the norm of W, right? Because

1:10:08.869,1:10:11.479
in the definition of the geometric margin,

1:10:11.479,1:10:14.429
notice that we're dividing by the norm of W anyway.

1:10:14.429,1:10:18.239
So you can actually set the norm of W to be anything you want, and you can multiply

1:10:18.239,1:10:22.530
W and B by any constant; it doesn't change the geometric margin.

1:10:22.530,1:10:26.809
This will actually be important, and we'll come back to this later. Notice that you

1:10:26.809,1:10:29.829
can take the parameters WB,

1:10:29.829,1:10:33.759
and you can impose any normalization constant to it, or you can change W and B by

1:10:33.759,1:10:35.760
any scaling factor and

1:10:35.760,1:10:37.969
replace them by ten W and ten B

1:10:37.969,1:10:38.730
whatever,

1:10:38.730,1:10:42.809
and it does not change the geometric margin,

1:10:42.809,1:10:44.329
okay? And so

1:10:44.329,1:10:47.800
in this first formulation, I'm just gonna impose a constraint and say that the norm of W was

1:10:47.800,1:10:48.989
one,

1:10:48.989,1:10:51.969
and so the function of the geometric margins will be the same,

1:10:51.969,1:10:52.699
and then we'll say

1:10:52.699,1:10:55.119
maximize the geometric margins subject to -

1:10:55.119,1:10:56.800
you maximize gamma

1:10:56.800,1:11:00.489
subject to that every training example must have geometric margin

1:11:00.489,1:11:02.069
at least gamma,

1:11:02.069,1:11:04.219
and this is a geometric margin because

1:11:04.219,1:11:05.780
when the norm of W is equal to one, then

1:11:05.780,1:11:08.159
the functional of the geometric margin

1:11:08.159,1:11:11.409
are identical, okay?

1:11:11.409,1:11:15.130
So this is the maximum margin classifier, and it turns out that if you do this,

1:11:15.130,1:11:19.079
it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but

1:11:19.079,1:11:19.659
it

1:11:19.659,1:11:22.709


1:11:22.709,1:11:24.139
turns out that

1:11:24.139,1:11:26.449
as we develop this algorithm further, there

1:11:26.449,1:11:30.519
will be a clever way to allow us to change this algorithm to let it work

1:11:30.519,1:11:33.150
in infinite dimensional feature spaces

1:11:33.150,1:11:37.010
and come up with very efficient non-linear classifiers.

1:11:37.010,1:11:38.020
So

1:11:38.020,1:11:41.280
there's a ways to go before we turn this into a support vector machine,

1:11:41.280,1:11:43.409
but this is the first step.

1:11:43.409,1:11:50.409
So are there questions about this? Yeah.

1:12:03.610,1:12:08.179
Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a

1:12:08.179,1:12:09.900
fixed training set, and the

1:12:09.900,1:12:12.879
scaling of the training set is not something you get to play with, right?

1:12:12.879,1:12:17.190
So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the

1:12:17.190,1:12:20.869
Y's. Are there other questions? Okay. So all right.

1:12:20.869,1:12:26.150


1:12:26.150,1:12:30.129


1:12:30.129,1:12:34.779
Next week we will take this, and we'll talk about authorization algorithms,

1:12:34.779,1:12:35.839
and

1:12:35.839,1:12:39.949
work our way towards turning this into one of the most effective off-theshelf

1:12:39.949,1:12:41.920
learning algorithms,

1:12:41.920,1:12:44.369
and just a final reminder again, this next

1:12:44.369,1:12:48.170
discussion session will be on Matlab and Octaves. So show up for that if you want

1:12:48.170,1:12:49.679
to see a tutorial.

1:12:49.679,1:12:51.529
Okay. See you guys in the next class.