WEBVTT

00:00:00.290 --> 00:00:12.079
This subtitle is provided from lecsub.jimdo.com under Creative Commons License.

00:00:12.079 --> 00:00:15.349
This presentation is delivered by the Stanford Center for Professional

00:00:15.349 --> 00:00:22.349
Development.

00:00:24.169 --> 00:00:25.189


00:00:25.189 --> 00:00:27.299
What I want to do today is

00:00:27.299 --> 00:00:30.829
continue our discussion of Naive Bayes, which is the learning algorithm that

00:00:30.829 --> 00:00:32.600
I

00:00:32.600 --> 00:00:36.110
started to discuss in the previous lecture

00:00:36.110 --> 00:00:40.250
and talk about a couple of different event models in Naive Bayes,

00:00:40.250 --> 00:00:44.450
and then I'll take a brief digression to talk about neural networks, which is

00:00:44.450 --> 00:00:47.000
something that I actually won't spend a lot of time on,

00:00:47.000 --> 00:00:50.350
and then I want to start to talk about support vector machines,

00:00:50.350 --> 00:00:52.680
and support vector machines is

00:00:52.680 --> 00:00:56.450
the learning algorithms, the supervised learning algorithm that

00:00:56.450 --> 00:01:00.910
many people consider the most effective, off-the-shelf

00:01:00.910 --> 00:01:02.250
supervised learning

00:01:02.250 --> 00:01:05.240
algorithm. That point of view is debatable, but there are many people that hold that

00:01:05.240 --> 00:01:06.340
point of view,

00:01:06.340 --> 00:01:10.800
and we'll start discussing that today, and this will actually take us a few

00:01:10.800 --> 00:01:13.830
lectures to complete.

00:01:13.830 --> 00:01:16.940
So let's talk about Naive Bayes. To recap

00:01:16.940 --> 00:01:21.370
from the previous lecture,

00:01:21.370 --> 00:01:23.299
I started off

00:01:23.299 --> 00:01:24.420
describing

00:01:24.420 --> 00:01:28.329
spam classification as the most [inaudible] example for Naive Bayes

00:01:28.329 --> 00:01:33.700
in which we would create feature vectors like these, right, that correspond

00:01:33.700 --> 00:01:37.840
to words in a dictionary.

00:01:37.840 --> 00:01:39.500
And so, you know,

00:01:39.500 --> 00:01:43.100
based on what words appear in a piece of email were represented as a

00:01:43.100 --> 00:01:45.230
feature vector with

00:01:45.230 --> 00:01:48.130
ones and zeros in the corresponding places,

00:01:48.130 --> 00:01:49.950
and Naive

00:01:49.950 --> 00:01:56.950
Bayes was a generative learning algorithm, and by that

00:01:58.799 --> 00:02:00.159
I mean it's

00:02:00.159 --> 00:02:06.140
an algorithm in which we model P(X|Y), and for Naive

00:02:06.140 --> 00:02:08.479
Bayes, specifically, we

00:02:08.479 --> 00:02:12.859
modeled it as product from I equals one

00:02:12.859 --> 00:02:13.789
to

00:02:13.789 --> 00:02:17.109
N, P(Xi|Y),

00:02:17.109 --> 00:02:19.259
and also we model P(Y),

00:02:19.259 --> 00:02:22.749
and then we use Bayes Rule, right, to combine these two together,

00:02:22.749 --> 00:02:25.050
and so our predictions,

00:02:25.050 --> 00:02:28.549
when you give it a new piece of email you want to tell if it's spam or not spam,

00:02:28.549 --> 00:02:33.119
you predict arg max over Y, P(Y|X), which

00:02:33.119 --> 00:02:36.459
by Bayes Rule is arg max over Y,

00:02:36.459 --> 00:02:39.420
P(X given Y)

00:02:39.420 --> 00:02:41.019
times P(Y), okay?

00:02:41.019 --> 00:02:43.559
So this is Naive Bayes,

00:02:43.559 --> 00:02:45.869
and just to draw attention to two things,

00:02:45.869 --> 00:02:49.819
one is that in this model, each of our features

00:02:49.819 --> 00:02:51.549
were zero, one, so

00:02:51.549 --> 00:02:53.919
indicating whether different words appear,

00:02:53.919 --> 00:02:57.589
and the length or the feature vector was, sort of,

00:02:57.589 --> 00:03:00.029
the length N of the feature vector was

00:03:00.029 --> 00:03:07.029
the number of words in the dictionary.

00:03:08.339 --> 00:03:15.339
So it might be on this version on the order of 50,000 words, say. What I want

00:03:18.419 --> 00:03:21.170
to do now is describe two variations on this algorithm.

00:03:21.170 --> 00:03:23.779
The first one is the simpler one,

00:03:23.779 --> 00:03:26.609
which it's just a generalization to if

00:03:26.609 --> 00:03:28.039
xi

00:03:28.039 --> 00:03:33.359
takes on more values. So, you know, one thing that's commonly done

00:03:33.359 --> 00:03:35.169
is

00:03:35.169 --> 00:03:38.839
to apply Naive Bayes to problems where

00:03:38.839 --> 00:03:43.479
some of these features, xi, takes on K values rather than just two values,

00:03:43.479 --> 00:03:46.579
and in that case,

00:03:46.579 --> 00:03:48.679
you actually build,

00:03:48.679 --> 00:03:51.919
sort of, a very similar model where P(X|Y)

00:03:51.919 --> 00:03:53.130
is

00:03:53.130 --> 00:03:57.120
really the same thing, right,

00:03:57.120 --> 00:04:02.099
where now these are going to be multinomial probabilities

00:04:02.099 --> 00:04:07.729
rather than Bernoulli's because the XI's can, maybe, take on up to K values.

00:04:07.729 --> 00:04:11.739
It turns out, the situation where - one situation where this arises very

00:04:11.739 --> 00:04:16.760
commonly is if you have a feature that's actually continuous valued,

00:04:16.760 --> 00:04:20.639
and you choose to dispertise it, and you choose to take a continuous value feature

00:04:20.639 --> 00:04:25.770
and dispertise it into a finite set of K values,

00:04:25.770 --> 00:04:27.710
and so it's a perfect example

00:04:27.710 --> 00:04:32.370
if you remember our very first

00:04:32.370 --> 00:04:35.189
supervised learning problem of predicting

00:04:35.189 --> 00:04:37.089
the price of

00:04:37.089 --> 00:04:41.039
houses. If you have the classification problem on these houses, so based on

00:04:41.039 --> 00:04:44.129
features of a house, and you want to predict whether or not the house will be sold in the next six

00:04:44.129 --> 00:04:46.880
months, say. That's a classification problem, and

00:04:46.880 --> 00:04:49.210
once you use Naive Bayes,

00:04:49.210 --> 00:04:52.229
then given a continuous value feature

00:04:52.229 --> 00:04:54.619
like the living area,

00:04:54.619 --> 00:04:59.259
you know, one pretty common thing to do would be take the continuous value living area

00:04:59.259 --> 00:05:02.300
and just dispertise it

00:05:02.300 --> 00:05:06.669
into a few

00:05:06.669 --> 00:05:08.909


00:05:08.909 --> 00:05:12.769
- discreet buckets,

00:05:12.769 --> 00:05:13.900
and so

00:05:13.900 --> 00:05:16.900
depending on whether the living area of the house is less than 500 square feet

00:05:16.900 --> 00:05:19.479
or between 1,000 and 1500 square feet,

00:05:19.479 --> 00:05:22.279
and so on, or whether it's greater than 2,000 square feet,

00:05:22.279 --> 00:05:26.490
you choose the value of the corresponding feature, XI, to be one,

00:05:26.490 --> 00:05:28.889
two, three, or four, okay?

00:05:28.889 --> 00:05:31.210
So that was the first

00:05:31.210 --> 00:05:36.479
variation or generalization of Naive Bayes I

00:05:36.479 --> 00:05:43.479
wanted to talk about. I should just check; are there questions about this? Okay.

00:05:49.919 --> 00:05:55.740
Cool. And so it turns out that in practice, it's fairly common to use about ten buckets

00:05:55.740 --> 00:05:59.099
to dispertise a continuous value feature. I drew four here only to

00:05:59.099 --> 00:06:00.900
save on writing.

00:06:00.900 --> 00:06:02.789


00:06:02.789 --> 00:06:03.389


00:06:03.389 --> 00:06:06.639
The second and, sort of, final variation that I want to talk about for Naive Bayes

00:06:06.639 --> 00:06:09.839
is a variation that's specific

00:06:09.839 --> 00:06:10.510
to

00:06:10.510 --> 00:06:15.720
classifying text documents, or, more generally, for classifying sequences. So the text

00:06:15.720 --> 00:06:20.219
document, like a piece of email, you can think of as a sequence of words

00:06:20.219 --> 00:06:23.509
and you can apply this, sort of, model I'm about to describe to classifying other sequences as

00:06:23.509 --> 00:06:24.909
well, but let

00:06:24.909 --> 00:06:27.400
me just focus on text,

00:06:27.400 --> 00:06:29.689
and here's

00:06:29.689 --> 00:06:33.419
the idea. So the

00:06:33.419 --> 00:06:38.379
Naiven a piece of email, we were

00:06:38.379 --> 00:06:42.110
representing it using this binary vector value representation,

00:06:42.110 --> 00:06:46.469
and one of the things that this loses, for instance, is the number of times that different words

00:06:46.469 --> 00:06:47.049
appear,

00:06:47.049 --> 00:06:49.429
all right? So, for example, if

00:06:49.429 --> 00:06:53.149
some word appears a lot of times, and you see the word, you know, buy a lot of times.

00:06:53.149 --> 00:06:58.369
You see the word Viagra; it seems to be a common email example. You see the word Viagra a ton of

00:06:58.369 --> 00:06:59.490
times in the email, it

00:06:59.490 --> 00:07:03.449
is more likely to be spam than it appears, I guess, only once

00:07:03.449 --> 00:07:08.129
because even once, I guess, is enough. So let me just

00:07:08.129 --> 00:07:11.210
try a different, what's called an event model

00:07:11.210 --> 00:07:14.870
for Naive Bayes that will take into account the number of times a word appears in

00:07:14.870 --> 00:07:16.530
the email,

00:07:16.530 --> 00:07:17.430
and to give

00:07:17.430 --> 00:07:22.300
this previous model a name as well this particular model for

00:07:22.300 --> 00:07:24.059
text

00:07:24.059 --> 00:07:25.990
classification

00:07:25.990 --> 00:07:29.189


00:07:29.189 --> 00:07:32.369


00:07:32.369 --> 00:07:35.639
is called the Multivariate Bernoulli Event Model. It's not a great name. Don't worry about what

00:07:35.639 --> 00:07:39.639
the name means. It

00:07:39.639 --> 00:07:43.139
refers to the fact that there are multiple Bernoulli random variables, but it's really - don't worry about

00:07:43.139 --> 00:07:45.349
what the name means.

00:07:45.349 --> 00:07:46.469
In contrast,

00:07:46.469 --> 00:07:50.529
what I want to do now is describe a different representation for email in terms of the

00:07:50.529 --> 00:07:51.819
feature

00:07:51.819 --> 00:07:58.819
vector, and this is called the Multinomial Event Model, and,

00:08:00.619 --> 00:08:03.319
again, there is a rationale behind the name, but it's slightly cryptic, so don't worry

00:08:03.319 --> 00:08:07.249
about why it's called the Multinomial Event Model; it's just called that.

00:08:07.249 --> 00:08:09.349
And here's what we're gonna do,

00:08:09.349 --> 00:08:15.559
given a piece of email, I'm going to represent my email as a feature vector,

00:08:15.559 --> 00:08:19.039
and so my IF training

00:08:19.039 --> 00:08:23.340
example, XI will be a feature vector,

00:08:23.340 --> 00:08:29.150
XI sub group one, XI sub group two,

00:08:29.150 --> 00:08:31.279
XI subscript

00:08:31.279 --> 00:08:33.020
NI

00:08:33.020 --> 00:08:35.370
where

00:08:35.370 --> 00:08:38.810
NI is equal to the

00:08:38.810 --> 00:08:42.160
number of words

00:08:42.160 --> 00:08:45.830
in this email, right? So if one of my training examples is an email with

00:08:45.830 --> 00:08:47.340
300 words in it,

00:08:47.340 --> 00:08:49.750
then I represent this email via

00:08:49.750 --> 00:08:52.010
a feature vector

00:08:52.010 --> 00:08:57.320
with 300 elements,

00:08:57.320 --> 00:09:00.360
and each

00:09:00.360 --> 00:09:04.270
of these elements of the feature vector - lets see. Let me just write this as X subscript

00:09:04.270 --> 00:09:05.560
J.

00:09:05.560 --> 00:09:12.560
These will be an index into my dictionary, okay?

00:09:12.820 --> 00:09:16.350
And so if my dictionary has 50,000 words, then

00:09:16.350 --> 00:09:18.690
each position in my feature vector

00:09:18.690 --> 00:09:22.520
will be a variable that takes on one of 50,000 possible

00:09:22.520 --> 00:09:24.440
values

00:09:24.440 --> 00:09:25.640


00:09:25.640 --> 00:09:31.580
corresponding to what word appeared in the J position of my email, okay?

00:09:31.580 --> 00:09:33.940
So, in other words, I'm gonna take all the words in my email

00:09:33.940 --> 00:09:35.680
and you have a feature

00:09:35.680 --> 00:09:38.550
vector that just says which word

00:09:38.550 --> 00:09:39.899
in my dictionary

00:09:39.899 --> 00:09:43.700
was each word in the email, okay?

00:09:43.700 --> 00:09:47.680
So a different definition for NI now, NI now varies and is different for every

00:09:47.680 --> 00:09:49.050
training example,

00:09:49.050 --> 00:09:52.740
and this XJ is now indexed into the dictionary.

00:09:52.740 --> 00:09:54.740
You know, the components of the feature vector

00:09:54.740 --> 00:09:59.450
are no longer binary random variables; they're these indices in the dictionary that take on a much

00:09:59.450 --> 00:10:02.750
larger set of values.

00:10:02.750 --> 00:10:05.660
And so

00:10:05.660 --> 00:10:08.890
our generative model for this

00:10:08.890 --> 00:10:09.980


00:10:09.980 --> 00:10:16.980


00:10:22.550 --> 00:10:24.360
will be that the joint distribution

00:10:24.360 --> 00:10:27.760
over X and Y will be that, where again N is now the length of the email, all right?

00:10:27.760 --> 00:10:30.839
So the way to think about this formula is you

00:10:30.839 --> 00:10:32.060
imagine

00:10:32.060 --> 00:10:33.410
that there was some

00:10:33.410 --> 00:10:38.050
probably distribution over emails. There's some random distribution that generates the emails,

00:10:38.050 --> 00:10:40.900
and that process proceeds as follows:

00:10:40.900 --> 00:10:45.090
First, Y is chosen, first the class label. Is someone gonna send you spam email or not

00:10:45.090 --> 00:10:48.520
spam emails is chosen for

00:10:48.520 --> 00:10:54.240
us. So first Y, the random variable Y, the class label of spam or not spam is generated,

00:10:54.240 --> 00:10:58.630
and then having decided whether they sent you spam or not spam,

00:10:58.630 --> 00:11:02.879
someone iterates over all 300 positions of the email,

00:11:02.879 --> 00:11:07.360
or 300 words that are going to compose them as email,

00:11:07.360 --> 00:11:10.300
and would generate words from some distribution that

00:11:10.300 --> 00:11:14.240
depends on whether they chose to send you spam or not spam. So if

00:11:14.240 --> 00:11:17.459
they sent you spam, they'll send you words - they'll tend to generate words like, you know, buy, and Viagra, and

00:11:17.459 --> 00:11:19.520
whatever at discounts, sale, whatever.

00:11:19.520 --> 00:11:23.270
And if somebody chose to send you not spam, then they'll send you, sort of, the more normal

00:11:23.270 --> 00:11:27.580


00:11:27.580 --> 00:11:31.150
words you get in an email, okay? So, sort of, just careful, right? XI here has a very different definition from the

00:11:31.150 --> 00:11:34.820
previous event model, and N has a very different definition from the previous event

00:11:34.820 --> 00:11:38.150
model.

00:11:38.150 --> 00:11:44.680
And so the parameters of this model are - let's see. Phi

00:11:44.680 --> 00:11:48.200
subscript K given Y equals one, which

00:11:48.200 --> 00:11:51.370
is the

00:11:51.370 --> 00:11:55.190
probability that,

00:11:55.190 --> 00:11:57.490
you know, conditioned on

00:11:57.490 --> 00:11:59.280
someone deciding to spend you spam,

00:11:59.280 --> 00:12:03.070
what's the probability that the next word they choose to email you in

00:12:03.070 --> 00:12:06.260
the spam email is going to be word K, and

00:12:06.260 --> 00:12:10.350
similarly, you know, sort of, same thing - well, I'll just write it out, I guess - and Phi

00:12:10.350 --> 00:12:17.350
Y

00:12:21.160 --> 00:12:25.590
and just same as before, okay?

00:12:25.590 --> 00:12:28.170
So these are the parameters of the model,

00:12:28.170 --> 00:12:29.350


00:12:29.350 --> 00:12:31.450
and

00:12:31.450 --> 00:12:35.170
given a training set, you can work out the maximum likelihood estimates of the

00:12:35.170 --> 00:12:42.170


00:12:42.230 --> 00:12:47.810
parameters. So the maximum likelihood estimate of the parameters will be

00:12:47.810 --> 00:12:49.100
equal to -

00:12:49.100 --> 00:12:53.440
and now I'm gonna write one of these, you know, big indicator function things again. It'll be

00:12:53.440 --> 00:12:57.750
a sum over your training sets

00:12:57.750 --> 00:13:00.290
indicator

00:13:00.290 --> 00:13:07.290
whether that was spam times the

00:13:07.450 --> 00:13:10.129
sum over all the words in that email where N subscript

00:13:10.129 --> 00:13:15.310
I is the number of words in email I in your training set,

00:13:15.310 --> 00:13:22.310
times indicator XIJ, SK times that I, okay?

00:13:39.390 --> 00:13:40.350
So

00:13:40.350 --> 00:13:45.400
the numerator says sum over all your emails and take into account all the emails that had

00:13:45.400 --> 00:13:48.400
class label one, take into account only of the emails that were spam because if Y equals zero, then

00:13:48.400 --> 00:13:52.210
this is zero, and this would go away,

00:13:52.210 --> 00:13:54.470
and then times sum over

00:13:54.470 --> 00:13:56.579
all the words in your spam email, and

00:13:56.579 --> 00:14:01.040
it counts up the number of times you observed the word K in your spam emails.

00:14:01.040 --> 00:14:04.670
So, in other words, the numerator is look at all the spam emails in your training set

00:14:04.670 --> 00:14:11.670
and count up the total number of times the word K appeared in this email. The

00:14:13.840 --> 00:14:17.160
denominator then is sum over I into our training set of

00:14:17.160 --> 00:14:20.199
whenever one of your examples is spam,

00:14:20.199 --> 00:14:21.430
you know, sum up the length

00:14:21.430 --> 00:14:23.220
of that spam email,

00:14:23.220 --> 00:14:26.160
and so the denominator is the total length

00:14:26.160 --> 00:14:32.370
of all of the spam emails in your training set.

00:14:32.370 --> 00:14:36.820
And so the ratio is just out of all your spam emails, what is the fraction of words in

00:14:36.820 --> 00:14:39.190
all your spam emails that were word K,

00:14:39.190 --> 00:14:41.840
and that's your estimate for the probability

00:14:41.840 --> 00:14:44.240
of the next piece of spam mail

00:14:44.240 --> 00:14:51.240
generating the word K in any given position, okay? At the

00:14:53.100 --> 00:14:56.480
end of the previous lecture, I talked about LaPlace smoothing,

00:14:56.480 --> 00:14:58.980
and so when you do that as

00:14:58.980 --> 00:15:03.310
well, you add one to the numerator and K to the denominator, and this is

00:15:03.310 --> 00:15:06.600
the LaPlace smoothed estimate

00:15:06.600 --> 00:15:08.790
of this parameter,

00:15:08.790 --> 00:15:11.370
okay? And similarly, you do the same thing for -

00:15:11.370 --> 00:15:14.040
and you can work out the estimates for the

00:15:14.040 --> 00:15:20.970
other parameters yourself, okay? So it's very similar. Yeah? Student:I'm sorry. On the right on the top, I was just

00:15:20.970 --> 00:15:22.620
wondering

00:15:22.620 --> 00:15:23.600


00:15:23.600 --> 00:15:26.290
what the X of I is, and what the N of -

00:15:26.290 --> 00:15:28.350
Instructor (Andrew Ng):Right. So

00:15:28.350 --> 00:15:32.140
in this second event model, the definition for XI and the definition for N

00:15:32.140 --> 00:15:34.770
are different, right? So here -

00:15:34.770 --> 00:15:41.770
well, this is for one example XY. So here, N is the number of words

00:15:44.300 --> 00:15:46.339
in a given email,

00:15:46.339 --> 00:15:50.730
right? And if it's the I email subscripting then this N subscript I, and so

00:15:50.730 --> 00:15:55.710
N will be different for different training examples, and here

00:15:55.710 --> 00:16:00.910
XI will be, you know,

00:16:00.910 --> 00:16:04.070
these values from 1 to 50,000,

00:16:04.070 --> 00:16:05.580
and XI is

00:16:05.580 --> 00:16:06.940
essentially

00:16:06.940 --> 00:16:10.600
the identity of the Ith

00:16:10.600 --> 00:16:11.340
word

00:16:11.340 --> 00:16:14.830
in a given piece of email,

00:16:14.830 --> 00:16:16.850
okay? So that's why this is

00:16:16.850 --> 00:16:20.440
grouping, or this is a product over all the different words of your email

00:16:20.440 --> 00:16:22.059
of their probability

00:16:22.059 --> 00:16:24.430
the Ith word in your email,

00:16:24.430 --> 00:16:31.430
conditioned on Y. Yeah?

00:16:32.280 --> 00:16:36.480
Student:[Off mic]. Instructor (Andrew Ng):Oh, no, actually, you know what, I apologize. I just realized that overload the notation, right,

00:16:36.480 --> 00:16:39.260
and I shouldn't have used K here. Let

00:16:39.260 --> 00:16:43.750
me use a different alphabet and see if that makes sense; does

00:16:43.750 --> 00:16:46.490
that make sense? Oh, you know what, I'm sorry.

00:16:46.490 --> 00:16:49.430
You're absolutely right.

00:16:49.430 --> 00:16:50.560
Thank you. All right. So

00:16:50.560 --> 00:16:54.660
in LaPlace smoothing, that shouldn't be

00:16:54.660 --> 00:16:57.940
K. This should be, you know, 50,000,

00:16:57.940 --> 00:17:01.940
if you have 50,000 words in your dictionary. Yeah, thanks. Great.

00:17:01.940 --> 00:17:04.660
I stole notation from the previous lecture and

00:17:04.660 --> 00:17:06.720
didn't translate it properly.

00:17:06.720 --> 00:17:08.400
So LaPlace smoothing, right,

00:17:08.400 --> 00:17:15.400
this is the number of possible values that the random variable XI can

00:17:16.430 --> 00:17:17.320
take on. Cool. Raise your hand if

00:17:17.320 --> 00:17:21.680
this makes sense? Okay. Some of you, are there

00:17:21.680 --> 00:17:28.680
more questions about this?

00:17:34.850 --> 00:17:41.850


00:17:42.010 --> 00:17:43.880
Yeah. Student:On LaPlace smoothing, the denominator and the plus A is the number of values that Y could take? Instructor (Andrew Ng):Yeah,

00:17:43.880 --> 00:17:45.110
let's see.

00:17:45.110 --> 00:17:49.920
So LaPlace smoothing is

00:17:49.920 --> 00:17:51.110
a

00:17:51.110 --> 00:17:55.570
method to give you, sort of, hopefully, better estimates of their probability

00:17:55.570 --> 00:17:58.150
distribution over a multinomial,

00:17:58.150 --> 00:18:00.430
and so

00:18:00.430 --> 00:18:02.870
was I using X to Y in the previous lecture?

00:18:02.870 --> 00:18:04.600
So

00:18:04.600 --> 00:18:07.809
in trying to estimate the probability over a multinomial - I think X

00:18:07.809 --> 00:18:10.660
and Y are different. I think -

00:18:10.660 --> 00:18:13.690
was it X or Y? I think it was X, actually. Well - oh,

00:18:13.690 --> 00:18:16.800
I see, right, right. I think I was using a different definition

00:18:16.800 --> 00:18:18.960
for the random variable Y because

00:18:18.960 --> 00:18:23.860
suppose you have a multinomial random variable, X

00:18:23.860 --> 00:18:27.890
which takes on - let's

00:18:27.890 --> 00:18:30.190
use a different alphabet. Suppose you have

00:18:30.190 --> 00:18:34.950
a multinomial random variable X which takes on L different values, then

00:18:34.950 --> 00:18:37.440
the maximum likelihood estimate

00:18:37.440 --> 00:18:40.550
for the probability of X,

00:18:40.550 --> 00:18:42.060
PFX equals K,

00:18:42.060 --> 00:18:43.720
will be equal to, right,

00:18:43.720 --> 00:18:46.700
the number of observations.

00:18:46.700 --> 00:18:51.860


00:18:51.860 --> 00:18:55.140
The maximum likelihood estimate for the

00:18:55.140 --> 00:19:02.110
probability of X being equal to K will be the number of observations of X

00:19:02.110 --> 00:19:04.800
equals K divided

00:19:04.800 --> 00:19:10.510
by the total number of observations

00:19:10.510 --> 00:19:11.330
of X,

00:19:11.330 --> 00:19:13.750
okay? So that's the maximum likelihood estimate.

00:19:13.750 --> 00:19:17.710
And to add LaPlace smoothing to this, you, sort of, add one to the numerator,

00:19:17.710 --> 00:19:20.120
and you add L to the

00:19:20.120 --> 00:19:20.910


00:19:20.910 --> 00:19:22.610
denominator

00:19:22.610 --> 00:19:26.200
where L was the number of possible values that X can take on.

00:19:26.200 --> 00:19:28.540
So, in this case,

00:19:28.540 --> 00:19:31.740
this is a probability that X equals K,

00:19:31.740 --> 00:19:36.400
and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be

00:19:36.400 --> 00:19:40.300
something else, but that's why I add 50,000 to the denominator. Are there other questions? Yeah. Student:Is there a specific

00:19:40.300 --> 00:19:47.300
definition for a

00:19:48.190 --> 00:19:48.659
maximum likelihood estimation of a parameter? We've talked about it a couple times, and all the examples

00:19:48.659 --> 00:19:51.630
make sense, but I don't know what the, like, general formula for it is.

00:19:51.630 --> 00:19:53.970
Instructor (Andrew Ng):I see. Yeah, right. So the definition of maximum likelihood - so

00:19:53.970 --> 00:19:55.610
the question is

00:19:55.610 --> 00:20:01.809
what's the definition for maximum likelihood estimate? So actually

00:20:01.809 --> 00:20:05.460
in today's lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I

00:20:05.460 --> 00:20:06.550
was, sort of,

00:20:06.550 --> 00:20:09.130
throwing out the maximum likelihood estimates on the board

00:20:09.130 --> 00:20:11.049
without proving them.

00:20:11.049 --> 00:20:15.520
The way to actually work this out is

00:20:15.520 --> 00:20:19.900
to actually

00:20:19.900 --> 00:20:26.900
write down the likelihood.

00:20:29.070 --> 00:20:32.850
So the way to figure out all of these maximum likelihood estimates is to write

00:20:32.850 --> 00:20:33.730
down

00:20:33.730 --> 00:20:35.190
the likelihood of

00:20:35.190 --> 00:20:41.170
the parameters, phi K given Y being

00:20:41.170 --> 00:20:43.810
zero, phi

00:20:43.810 --> 00:20:44.450
Y,

00:20:44.450 --> 00:20:47.090
right? And so given a training set,

00:20:47.090 --> 00:20:50.350
the likelihood, I guess, I should be writing log likelihood

00:20:50.350 --> 00:20:54.740
will be the log of the product of I equals one to

00:20:54.740 --> 00:20:55.910
N, PFXI, YI,

00:20:55.910 --> 00:20:57.950
you

00:20:57.950 --> 00:21:04.950
know, parameterized by these things, okay?

00:21:06.940 --> 00:21:09.450
Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are

00:21:09.450 --> 00:21:12.010
parameterized

00:21:12.010 --> 00:21:13.500
by

00:21:13.500 --> 00:21:19.299


00:21:19.299 --> 00:21:21.520


00:21:21.520 --> 00:21:26.930
- well,

00:21:26.930 --> 00:21:28.140


00:21:28.140 --> 00:21:31.690
I'll just drop the parameters to write this

00:21:31.690 --> 00:21:38.690


00:21:42.370 --> 00:21:49.200
more simply - oh, I just put it in - times PFYI, okay?

00:21:49.200 --> 00:21:50.970
So this is my log likelihood,

00:21:50.970 --> 00:21:55.090
and so the way you get the maximum likelihood estimate of the parameters

00:21:55.090 --> 00:21:55.809
is you -

00:21:55.809 --> 00:21:59.750
so if given a fixed training set, given a set of fixed IYI's,

00:21:59.750 --> 00:22:02.180
you maximize this in terms of

00:22:02.180 --> 00:22:05.740
these parameters, and then you get the maximum likelihood estimates that I've been writing

00:22:05.740 --> 00:22:07.530
out.

00:22:07.530 --> 00:22:11.220
So in a previous section of today's lecture I wrote out some maximum likelihood estimates

00:22:11.220 --> 00:22:14.280
for the Gaussian Discriminant Analysis model, and

00:22:14.280 --> 00:22:15.610
for Naive Bayes,

00:22:15.610 --> 00:22:17.750
and then this - I didn't prove them,

00:22:17.750 --> 00:22:22.310
but you get to, sort of, play with that yourself in the homework problem as well

00:22:22.310 --> 00:22:26.420
and for one of these models, and you'll be able to verify that when

00:22:26.420 --> 00:22:29.700
you maximize the likelihood and maximize the log likelihood

00:22:29.700 --> 00:22:33.750
that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to

00:22:33.750 --> 00:22:34.950
find the way

00:22:34.950 --> 00:22:41.630
these are derived is by maximizing this, okay? Cool. All right.

00:22:41.630 --> 00:22:44.730
So

00:22:44.730 --> 00:22:51.299
that wraps up what I wanted to say about - oh,

00:22:51.299 --> 00:22:55.809
so that, more or less, wraps up what I wanted to say about Naive Bayes, and it turns out that

00:22:55.809 --> 00:23:01.480
for text classification, the Naivent

00:23:01.480 --> 00:23:04.080
model, the last

00:23:04.080 --> 00:23:07.300
Naivent model,

00:23:07.300 --> 00:23:10.330
it turns out that almost always does better than

00:23:10.330 --> 00:23:11.370
the

00:23:11.370 --> 00:23:15.360
first Naive Bayes model I talked about when you're applying it to the

00:23:15.360 --> 00:23:18.670
specific case - to the specific of text classification,

00:23:18.670 --> 00:23:21.070
and

00:23:21.070 --> 00:23:25.049
one of the reasons is hypothesized for this is that this second model,

00:23:25.049 --> 00:23:26.880
the multinomial event model,

00:23:26.880 --> 00:23:30.120
takes into account

00:23:30.120 --> 00:23:33.880
the number of times a word appears in a document, whereas the former model

00:23:33.880 --> 00:23:35.170
doesn't.

00:23:35.170 --> 00:23:37.370
I should say that in truth

00:23:37.370 --> 00:23:40.700
that actually turns out not to be completely understood why the latter

00:23:40.700 --> 00:23:43.780
model does better than the former one for text classification, and, sort of,

00:23:43.780 --> 00:23:45.640
researchers are still

00:23:45.640 --> 00:23:46.790
debating about it a little bit, but if you

00:23:46.790 --> 00:23:51.320
ever have a text classification problem, you know, Naive Bayes Classify is probably

00:23:51.320 --> 00:23:51.860
not,

00:23:51.860 --> 00:23:54.150
by far, the best learning algorithm out there,

00:23:54.150 --> 00:23:55.630
but it is

00:23:55.630 --> 00:23:58.460
relatively straightforward to implement, and it's a very good

00:23:58.460 --> 00:24:03.070
algorithm to try if you have a text classification problem, okay? Still a question? Yeah. Student:So the second model is still positioning a variant, right? It

00:24:03.070 --> 00:24:04.490
doesn't

00:24:04.490 --> 00:24:07.020
actually

00:24:07.020 --> 00:24:07.910


00:24:07.910 --> 00:24:13.710
care where the words are. Instructor (Andrew Ng):Yes, all right. Student:And, I

00:24:13.710 --> 00:24:14.950
mean, X variable, if my model like you had exclamation in, does

00:24:14.950 --> 00:24:17.170
that usually

00:24:17.170 --> 00:24:18.329
do

00:24:18.329 --> 00:24:20.340
better if you have enough data?

00:24:20.340 --> 00:24:23.050
Instructor (Andrew Ng):Yeah, so the question is, sort of, the second model, right? The second model, the multinomial event model actually doesn't care about the ordering of the words.

00:24:23.050 --> 00:24:26.990
You can shuffle all the words in the email, and it does exactly the same thing.

00:24:26.990 --> 00:24:27.770
So

00:24:27.770 --> 00:24:30.929
in natural language processing, there's actually another name; it's called a Unique Grand

00:24:30.929 --> 00:24:32.850
Model in natural language processing,

00:24:32.850 --> 00:24:35.529
and there's some other models like, sort of,

00:24:35.529 --> 00:24:39.190
say, higher order markup models that take into account some of the ordering

00:24:39.190 --> 00:24:44.080
of the words. It turns out that for text classification, the models

00:24:44.080 --> 00:24:45.260
like the

00:24:45.260 --> 00:24:49.290
bigram models or trigram models,

00:24:49.290 --> 00:24:53.360
I believe they do only very slightly better, if at all,

00:24:53.360 --> 00:25:00.360
but that's when you're applying them to text classification, okay? All right.

00:25:03.550 --> 00:25:07.190
So the next thing I want to talk about is to start again to discussion of

00:25:07.190 --> 00:25:11.560
non-linear classifiers. So

00:25:11.560 --> 00:25:18.560
it turns out

00:25:21.670 --> 00:25:27.590
- well, and so the very first

00:25:27.590 --> 00:25:31.220
classification algorithm we talked about was logistic regression, which

00:25:31.220 --> 00:25:33.419
had the forming form for

00:25:33.419 --> 00:25:36.340
hypothesis,

00:25:36.340 --> 00:25:37.520
and

00:25:37.520 --> 00:25:39.220
you can think of this as

00:25:39.220 --> 00:25:41.190
predicting one when

00:25:41.190 --> 00:25:44.420
this estimator probability is greater or equal to 0.5 and

00:25:44.420 --> 00:25:46.180
predicting zero,

00:25:46.180 --> 00:25:46.839
right,

00:25:46.839 --> 00:25:50.050
when this is less than 0.5,

00:25:50.050 --> 00:25:52.760
and given a training set, right?

00:25:52.760 --> 00:25:59.760
Logistic regression

00:26:02.549 --> 00:26:05.919
will maybe do grade and descends or something or

00:26:05.919 --> 00:26:07.440
use Newton's method

00:26:07.440 --> 00:26:09.370
to find a straight line that

00:26:09.370 --> 00:26:12.130
reasonably separates the positive and negative classes.

00:26:12.130 --> 00:26:15.830
But sometimes a data set just can't be separated by a straight line, so is there

00:26:15.830 --> 00:26:16.950
an algorithm

00:26:16.950 --> 00:26:18.230
that will let you

00:26:18.230 --> 00:26:25.230
start to learn these sorts of non-linear division boundaries?

00:26:25.540 --> 00:26:29.230
And so how do you go about getting a non-linear classifier? And, by the

00:26:29.230 --> 00:26:34.360
way, one cool result is that remember when I said - when we talked

00:26:34.360 --> 00:26:37.790
about generative learning algorithms, I said that

00:26:37.790 --> 00:26:41.040
if you assume Y given X is

00:26:41.040 --> 00:26:43.710
exponential family,

00:26:43.710 --> 00:26:45.330


00:26:45.330 --> 00:26:46.669
right, with parameter

00:26:46.669 --> 00:26:50.730
A, and if you build a generative learning algorithm using this, right, plus

00:26:50.730 --> 00:26:55.850
one, if this is A to

00:26:55.850 --> 00:26:59.710
one. This is exponential family

00:26:59.710 --> 00:27:01.750
with natural parameter A to zero,

00:27:01.750 --> 00:27:05.610
right. I think when we talked about Gaussian Discriminant Analysis, I said that if this holds true,

00:27:05.610 --> 00:27:06.210
then

00:27:06.210 --> 00:27:08.610
you end up with a logistic posterior. It

00:27:08.610 --> 00:27:11.410
actually turns out that a Naive Bayes model

00:27:11.410 --> 00:27:12.629
actually falls into this

00:27:12.629 --> 00:27:17.440
as well. So the Naive Bayes model actually falls into this exponential family as well, and,

00:27:17.440 --> 00:27:19.370
therefore,

00:27:19.370 --> 00:27:23.000
under the Naive Bayes model, you're actually

00:27:23.000 --> 00:27:27.080
using this other linear classifier as well, okay?

00:27:27.080 --> 00:27:29.020
So the question is

00:27:29.020 --> 00:27:32.480
how can you start to get non-linear classifiers?

00:27:32.480 --> 00:27:35.649
And I'm going to talk about one method today

00:27:35.649 --> 00:27:41.840
which is - and

00:27:41.840 --> 00:27:45.300
we started to talk about it very briefly which is

00:27:45.300 --> 00:27:46.650
taking a

00:27:46.650 --> 00:27:48.990
simpler algorithm like

00:27:48.990 --> 00:27:50.720
logistic regression

00:27:50.720 --> 00:27:57.720
and using it to build up to more complex non-linear classifiers, okay? So

00:27:58.390 --> 00:28:00.440
to motivate this discussion,

00:28:00.440 --> 00:28:05.340
I'm going to use the little picture - let's see. So suppose you have features X1,

00:28:05.340 --> 00:28:07.340
X2, and X3,

00:28:07.340 --> 00:28:10.320
and so by convention, I'm gonna follow

00:28:10.320 --> 00:28:13.460
our earlier convention that X0 is set to one,

00:28:13.460 --> 00:28:17.010
and so I'm gonna use a little diagram like this

00:28:17.010 --> 00:28:18.090
to denote

00:28:18.090 --> 00:28:19.690
our

00:28:19.690 --> 00:28:24.320
logistic regression unit, okay? So

00:28:24.320 --> 00:28:28.609
think of a little picture like that, you know, this little circle as denoting a

00:28:28.609 --> 00:28:29.419
computation note

00:28:29.419 --> 00:28:32.630
that takes this input, you know, several features and then it outputs another

00:28:32.630 --> 00:28:36.090
number that's X subscript theta of X, given by

00:28:36.090 --> 00:28:37.800
a sigmoid function,

00:28:37.800 --> 00:28:39.930
and so this little computational unit -

00:28:39.930 --> 00:28:43.460
well, will have parameters theta.

00:28:43.460 --> 00:28:47.700
Now, in order to get non-linear division boundaries, all we need to do -

00:28:47.700 --> 00:28:50.850
well, at least one thing to do is just

00:28:50.850 --> 00:28:52.780
come up with

00:28:52.780 --> 00:28:54.390


00:28:54.390 --> 00:28:57.020
a way to represent hypotheses

00:28:57.020 --> 00:29:00.640
that can output non-linear division boundaries, right,

00:29:00.640 --> 00:29:02.700
and so

00:29:02.700 --> 00:29:07.490
this is -

00:29:07.490 --> 00:29:09.750
when you put a bunch of those little pictures

00:29:09.750 --> 00:29:11.390
that I drew

00:29:11.390 --> 00:29:12.790
on the previous board,

00:29:12.790 --> 00:29:14.169
you can then get

00:29:14.169 --> 00:29:18.549
what's called

00:29:18.549 --> 00:29:20.780
a Neural Network in which you

00:29:20.780 --> 00:29:23.710
think of having my features here

00:29:23.710 --> 00:29:30.710
and then I would feed them to

00:29:31.019 --> 00:29:34.530
say a few of these little

00:29:34.530 --> 00:29:38.940
sigmoidal units,

00:29:38.940 --> 00:29:43.559
and these together will feed into yet another sigmoidal unit, say,

00:29:43.559 --> 00:29:46.270
which will output

00:29:46.270 --> 00:29:51.510
my final output H subscript theta of X, okay? And

00:29:51.510 --> 00:29:54.500
just to give these things names, let me call

00:29:54.500 --> 00:29:58.380
the values output by these three intermediate sigmoidal units; let me call them A1,

00:29:58.380 --> 00:30:01.360
A2, A3.

00:30:01.360 --> 00:30:02.580
And let me just be

00:30:02.580 --> 00:30:05.679
completely concrete about what this formula represents, right? So

00:30:05.679 --> 00:30:06.570


00:30:06.570 --> 00:30:08.309
each of these

00:30:08.309 --> 00:30:12.080
units in the middle will have their own associated set of parameters,

00:30:12.080 --> 00:30:15.820
and so the value A1 will be computed as

00:30:15.820 --> 00:30:17.940
G

00:30:17.940 --> 00:30:21.440
of X transpose,

00:30:21.440 --> 00:30:27.440
and then some set of parameters, which I'll write as theta one, and similarly, A2 will be computed as G of X transpose theta

00:30:27.440 --> 00:30:29.090
two,

00:30:29.090 --> 00:30:30.980
and A3 will be

00:30:30.980 --> 00:30:34.490
G of X transpose,

00:30:34.490 --> 00:30:35.700
theta three,

00:30:35.700 --> 00:30:42.700
where G is the sigmoid function, all right? So G of Z,

00:30:43.640 --> 00:30:49.850
and then, finally, our hypothesis will output

00:30:49.850 --> 00:30:52.360
G of

00:30:52.360 --> 00:30:53.950
A

00:30:53.950 --> 00:30:56.620
transpose

00:30:56.620 --> 00:30:59.200
theta four, right? Where, you

00:30:59.200 --> 00:31:01.320
know, this A vector

00:31:01.320 --> 00:31:04.390
is a vector of A1,

00:31:04.390 --> 00:31:05.780
A2, A3.

00:31:05.780 --> 00:31:08.220
We can append another one to it at first

00:31:08.220 --> 00:31:11.470
if you want, okay? Let

00:31:11.470 --> 00:31:14.270
me

00:31:14.270 --> 00:31:16.210
just draw up here this - I'm

00:31:16.210 --> 00:31:18.470
sorry about the cluttered board.

00:31:18.470 --> 00:31:20.950
And so H subscript theta of X,

00:31:20.950 --> 00:31:23.850
this is a function

00:31:23.850 --> 00:31:29.300
of all the parameters theta one

00:31:29.300 --> 00:31:32.420
through theta four,

00:31:32.420 --> 00:31:36.680
and so one way to learn parameters for this model

00:31:36.680 --> 00:31:37.650
is to

00:31:37.650 --> 00:31:39.910
write down the cost function,

00:31:39.910 --> 00:31:41.180
say, J of theta

00:31:41.180 --> 00:31:46.040
equals one-half sum from Y equals one to

00:31:46.040 --> 00:31:49.730
M, YI minus H subscript theta of

00:31:49.730 --> 00:31:51.770
XI

00:31:51.770 --> 00:31:54.450
squared, say. Okay,

00:31:54.450 --> 00:31:56.070
so that's our familiar

00:31:56.070 --> 00:31:58.280
quadratic cost function,

00:31:58.280 --> 00:31:59.980
and so

00:31:59.980 --> 00:32:03.990
one way to learn the parameters of an algorithm like this is to just use gradient

00:32:03.990 --> 00:32:04.840
interscent

00:32:04.840 --> 00:32:08.809
to minimize J of theta as a function of theta, okay? See, in the phi

00:32:08.809 --> 00:32:10.940
gradient descent

00:32:10.940 --> 00:32:14.610
to minimize this square area, which

00:32:14.610 --> 00:32:18.050
stated differently means you use gradient descent to make the predictions of your neural

00:32:18.050 --> 00:32:20.330
network as close as possible

00:32:20.330 --> 00:32:27.330
to what you observed as the labels in your training

00:32:30.550 --> 00:32:34.520
set, okay? So it turns out green descent on this neural network is a specific name, the

00:32:34.520 --> 00:32:35.059
algorithm

00:32:35.059 --> 00:32:38.789
that implements grand descent is called back propagation, and so if you ever hear that all that

00:32:38.789 --> 00:32:41.240
means is - it just means gradient interscent on

00:32:41.240 --> 00:32:45.170
a cost function like this or a variation of this on the neural network

00:32:45.170 --> 00:32:47.909
that looks like that,

00:32:47.909 --> 00:32:50.220
and -

00:32:50.220 --> 00:32:54.360
well, this algorithm actually has some advantages and disadvantages, but let me

00:32:54.360 --> 00:32:57.830
actually show you. So, let's see.

00:32:57.830 --> 00:33:00.729
One of the interesting things about the neural network is that

00:33:00.729 --> 00:33:05.430
you can look at what these intermediate notes are computing, right? So

00:33:05.430 --> 00:33:08.270
this neural network has what's called

00:33:08.270 --> 00:33:11.610
a hidden layer before you then have the output layer,

00:33:11.610 --> 00:33:14.410
and, more generally, you can actually have inputs feed

00:33:14.410 --> 00:33:16.279
into these

00:33:16.279 --> 00:33:19.690
computation units, feed into more layers of computation units, to even more layers, to

00:33:19.690 --> 00:33:23.370
more layers, and then finally you have an output layer at the end

00:33:23.370 --> 00:33:26.970
And one cool thing you can do is look at all of these intermediate units, look at

00:33:26.970 --> 00:33:28.280
these

00:33:28.280 --> 00:33:30.980
units and what's called a hidden layer of the neural network. Don't

00:33:30.980 --> 00:33:32.580
worry about why it's called that. Look

00:33:32.580 --> 00:33:34.489
at computations of the hidden unit

00:33:34.489 --> 00:33:39.570
and ask what is the hidden unit computing the neural network? So

00:33:39.570 --> 00:33:42.900
to, maybe, get a better sense of neural networks might be doing, let me show you a

00:33:42.900 --> 00:33:44.580
video - I'm gonna switch to the laptop - this is made

00:33:44.580 --> 00:33:46.250
by a

00:33:46.250 --> 00:33:49.360
friend, Yann LeCun

00:33:49.360 --> 00:33:52.420
who's currently a professor at New York University.

00:33:52.420 --> 00:33:56.289
Can I show a video on the laptop?

00:33:56.289 --> 00:33:58.600
So let me show you a video

00:33:58.600 --> 00:34:00.440
from Yann

00:34:00.440 --> 00:34:06.330
LeCun on a neural network that he developed for Hammerton Digit Recognition.

00:34:06.330 --> 00:34:10.110
There was one other thing he did in this neural network that I'm not gonna talk about called

00:34:10.110 --> 00:34:12.389
a Convolutional Neural Network

00:34:12.389 --> 00:34:16.009
that - well,

00:34:16.009 --> 00:34:18.809
his system is called LeNet,

00:34:18.809 --> 00:34:22.569
and let's see. Would you put

00:34:22.569 --> 00:34:23.360
on the

00:34:23.360 --> 00:34:30.360
laptop display? Hum,

00:34:37.779 --> 00:34:38.789
actually maybe if -

00:34:38.789 --> 00:34:42.339
or you can just put on the screen on the side; that would work too

00:34:42.339 --> 00:34:49.339
if the big screen isn't working.

00:35:12.249 --> 00:35:13.729
Let's see. I'm

00:35:13.729 --> 00:35:19.299
just trying to think, okay, how do I keep you guys entertained while we're waiting for the video to come

00:35:19.299 --> 00:35:24.209
on? Well, let me say a few more things about neural network.

00:35:24.209 --> 00:35:25.529
So it turns out that

00:35:25.529 --> 00:35:29.039
when you write a quadratic cost function like I

00:35:29.039 --> 00:35:31.689
wrote down on the chalkboard just now,

00:35:31.689 --> 00:35:34.639
it turns out that unlike logistic regression,

00:35:34.639 --> 00:35:39.029
that will almost always respond to non-convex optimization problem,

00:35:39.029 --> 00:35:42.709
and so whereas for logistic regression if you run gradient descent

00:35:42.709 --> 00:35:44.280
or Newton's method or whatever,

00:35:44.280 --> 00:35:46.000
you converse the global optimer.

00:35:46.000 --> 00:35:50.500
This is not true for neural networks. In general, there are lots of local optimer

00:35:50.500 --> 00:35:55.209
and, sort of, much harder optimization problem.

00:35:55.209 --> 00:35:56.159
So

00:35:56.159 --> 00:35:59.699
neural networks, if you're, sort of, familiar with them, and you're good at making design choices

00:35:59.699 --> 00:36:01.599
like what learning rate to use, and

00:36:01.599 --> 00:36:04.319
how many hidden units to use, and so on, you can,

00:36:04.319 --> 00:36:07.700
sort of, get them to be fairly effective,

00:36:07.700 --> 00:36:09.109
and

00:36:09.109 --> 00:36:12.910
there's, sort of, often ongoing debates about, you know, is this learning algorithm

00:36:12.910 --> 00:36:13.819
better, or is that learning

00:36:13.819 --> 00:36:17.869
algorithm better? The vast majority of machine learning researchers today seem to perceive

00:36:17.869 --> 00:36:20.180
support vector machines, which is what I'll talk about later,

00:36:20.180 --> 00:36:24.349
to be a much more effective off-theshelf learning algorithm than neural networks.

00:36:24.349 --> 00:36:26.509
This point of view is contested

00:36:26.509 --> 00:36:32.039
a bit, but so neural networks are not something that I personally use a lot right

00:36:32.039 --> 00:36:35.670
now because there's a hard optimization problem and you should do so often verge,

00:36:35.670 --> 00:36:39.969
and it actually, sort of works. It, sort of, works reasonably

00:36:39.969 --> 00:36:41.739
well. It's just

00:36:41.739 --> 00:36:43.700
because this is fairly complicated,

00:36:43.700 --> 00:36:45.269
there's not an algorithm that

00:36:45.269 --> 00:36:52.009
I use commonly or that my friends use all

00:36:52.009 --> 00:36:56.019
time. Oh, cool. So but let me just go and show you an example of neural network, which was for many years,

00:36:56.019 --> 00:36:56.819


00:36:56.819 --> 00:37:00.400
you know, the most effective learning algorithm before support vector

00:37:00.400 --> 00:37:02.029
machines were invented.

00:37:02.029 --> 00:37:06.349
So here's Yann LeCun's video, and -

00:37:06.349 --> 00:37:09.350
well, there's actually audio on this too, the soundboard. So I'll just tell you

00:37:09.350 --> 00:37:11.209
what's happening.

00:37:11.209 --> 00:37:11.880


00:37:11.880 --> 00:37:14.629
What you're seeing is a trained neural network,

00:37:14.629 --> 00:37:18.549
and this display where my mouse pointer is pointing, right, this big three

00:37:18.549 --> 00:37:19.429
there

00:37:19.429 --> 00:37:21.189
is

00:37:21.189 --> 00:37:24.799
the input to the neural network. So you're showing the neural network this image, and it's

00:37:24.799 --> 00:37:26.859
trying to recognize what is this.

00:37:26.859 --> 00:37:30.769
The final answer output by the neural network is this number up here, right

00:37:30.769 --> 00:37:33.189
below where it says LeNet-5,

00:37:33.189 --> 00:37:34.229
and the

00:37:34.229 --> 00:37:37.759
neural network correctly recognizes this image as a three,

00:37:37.759 --> 00:37:39.890
and if you look to the left of this image,

00:37:39.890 --> 00:37:42.309
what's interesting about this is

00:37:42.309 --> 00:37:46.640
the display on the left portion of this is actually showing the

00:37:46.640 --> 00:37:49.999
intermediate computations of the neural network. In other words, it's showing you

00:37:49.999 --> 00:37:51.900
what are the hidden layers of the

00:37:51.900 --> 00:37:53.829
neural network computing.

00:37:53.829 --> 00:37:57.529
And so, for example, if you look at this one, the third image down from the top,

00:37:57.529 --> 00:38:01.569
this seems to be computing, you know, certain edges into digits,

00:38:01.569 --> 00:38:05.630
right? We're just computing digits on the

00:38:05.630 --> 00:38:07.489
right-hand side of the bottom or something

00:38:07.489 --> 00:38:09.889
of the input display

00:38:09.889 --> 00:38:11.479
of the input image, okay?

00:38:11.479 --> 00:38:15.129
So let me just play this video, and you can see

00:38:15.129 --> 00:38:17.019
some

00:38:17.019 --> 00:38:20.619
of the inputs and outputs of the neural network, and

00:38:20.619 --> 00:38:21.529
those

00:38:21.529 --> 00:38:26.309
are very different fonts. There's this robustness to

00:38:26.309 --> 00:38:30.670
noise.

00:38:30.670 --> 00:38:36.929
All

00:38:36.929 --> 00:38:37.880


00:38:37.880 --> 00:38:41.949


00:38:41.949 --> 00:38:43.000
right. Multiple

00:38:43.000 --> 00:38:49.089
digits,

00:38:49.089 --> 00:38:50.739


00:38:50.739 --> 00:38:57.739
that's, kind of, cool. All

00:38:59.259 --> 00:39:06.259
right.

00:39:12.140 --> 00:39:19.140


00:39:30.779 --> 00:39:35.789


00:39:35.789 --> 00:39:38.689


00:39:38.689 --> 00:39:40.459


00:39:40.459 --> 00:39:41.660


00:39:41.660 --> 00:39:48.660


00:39:52.359 --> 00:39:54.809
So,

00:39:54.809 --> 00:39:58.219
just for fun, let me show you one more video,

00:39:58.219 --> 00:40:00.579
which was - let's

00:40:00.579 --> 00:40:06.089
see. This is another video from the various CV's, the machine that changed the world, which

00:40:06.089 --> 00:40:07.490
was produced by WGBH

00:40:07.490 --> 00:40:09.059
Television

00:40:09.059 --> 00:40:10.800
in

00:40:10.800 --> 00:40:12.599
corporation with British Foreclass Incorporation,

00:40:12.599 --> 00:40:16.159
and it was aired on PBS a few years ago, I think.

00:40:16.159 --> 00:40:17.630
I want to show you a

00:40:17.630 --> 00:40:19.589
video describing the NETtalk

00:40:19.589 --> 00:40:23.629
Neural Network, which was developed by Terry Sejnowski; he's a

00:40:23.629 --> 00:40:25.739
researcher.

00:40:25.739 --> 00:40:29.589
And so NETtalk was actually one of the major milestones in the

00:40:29.589 --> 00:40:31.749
history of neural network,

00:40:31.749 --> 00:40:33.490
and this specific application

00:40:33.490 --> 00:40:36.399
is getting the neural network to read text.

00:40:36.399 --> 00:40:38.779
So, in other words, can you show a

00:40:38.779 --> 00:40:41.659
piece of English to a computer

00:40:41.659 --> 00:40:43.419
and have the computer read,

00:40:43.419 --> 00:40:46.410
sort of, verbally produce sounds that could respond

00:40:46.410 --> 00:40:48.319
to the

00:40:48.319 --> 00:40:50.629
reading of the text.

00:40:50.629 --> 00:40:55.279
And it turns out that in the history of AI and the history of machine learning,

00:40:55.279 --> 00:40:56.509
this video

00:40:56.509 --> 00:41:00.849
created a lot of excitement about neural networks and about machine learning. Part of the

00:41:00.849 --> 00:41:05.429
reason was that Terry Sejnowski had the foresight to choose

00:41:05.429 --> 00:41:06.699
to use,

00:41:06.699 --> 00:41:08.040
in his video,

00:41:08.040 --> 00:41:12.940
a child-like voice talking about visiting your grandmother's house and so on. You'll

00:41:12.940 --> 00:41:14.289
see it in a second,

00:41:14.289 --> 00:41:16.029
and so

00:41:16.029 --> 00:41:19.399
this really created the perception of - created

00:41:19.399 --> 00:41:23.129
the impression of the neural network being like a young child learning how to speak,

00:41:23.129 --> 00:41:25.260
and talking about going to your grandmothers, and so

00:41:25.260 --> 00:41:27.379
on. So this actually

00:41:27.379 --> 00:41:29.289
helped generate a lot of excitement

00:41:29.289 --> 00:41:32.449
within academia and outside academia on neural networks, sort of, early in the history

00:41:32.449 --> 00:41:36.109
of neural networks. I'm just gonna show you the video.

00:41:36.109 --> 00:41:37.910
[Begin Video] You're going to hear first

00:41:37.910 --> 00:41:41.559
what the network sounds like at the very beginning of the training, and it won't

00:41:41.559 --> 00:41:43.589
sound like words, but it'll sound

00:41:43.589 --> 00:41:48.009
like attempts that will get better and better with

00:41:48.009 --> 00:41:54.239
time. [Computer's voice] The network

00:41:54.239 --> 00:41:58.249
takes the letters, say the phrase, grandmother's house, and

00:41:58.249 --> 00:42:02.679
makes a random attempt at pronouncing it. [Computer's

00:42:02.679 --> 00:42:07.219
voice]

00:42:07.219 --> 00:42:09.269
Grandmother's house. The

00:42:09.269 --> 00:42:12.530
phonetic difference between the guess and the right pronunciation

00:42:12.530 --> 00:42:16.539
is sent back through the network. [Computer's

00:42:16.539 --> 00:42:18.989
voice]

00:42:18.989 --> 00:42:20.329
Grandmother's house.

00:42:20.329 --> 00:42:23.920
By adjusting the connection strengths after each attempt,

00:42:23.920 --> 00:42:26.689
the net slowly improves. And,

00:42:26.689 --> 00:42:30.029
finally, after letting it train overnight,

00:42:30.029 --> 00:42:32.700
the next morning it sounds like this: Grandmother's house, I'd like to go to my grandmother's

00:42:32.700 --> 00:42:37.759
house.

00:42:37.759 --> 00:42:41.149
Well, because she gives us candy.

00:42:41.149 --> 00:42:46.789
Well, and we - NETtalk understands nothing about the language. It is simply associating letters

00:42:46.789 --> 00:42:49.809
with sounds.

00:42:49.809 --> 00:42:52.359
[End Video] All right. So

00:42:52.359 --> 00:42:54.380
at the time this was done, I mean, this is

00:42:54.380 --> 00:42:57.379
an amazing piece of work. I should say

00:42:57.379 --> 00:42:59.140
today there are other

00:42:59.140 --> 00:43:03.389
text to speech systems that work better than what you just saw,

00:43:03.389 --> 00:43:08.359
and you'll also appreciate getting candy from your grandmother's house is a little bit

00:43:08.359 --> 00:43:11.449
less impressive than talking about the Dow Jones falling 15 points,

00:43:11.449 --> 00:43:16.149
and profit taking, whatever. So

00:43:16.149 --> 00:43:18.789
but I wanted to show that just because that was another cool,

00:43:18.789 --> 00:43:25.789
major landmark in the history of neural networks. Okay.

00:43:27.259 --> 00:43:32.949
So let's switch back to the chalkboard,

00:43:32.949 --> 00:43:35.859
and what I want to do next

00:43:35.859 --> 00:43:39.479
is

00:43:39.479 --> 00:43:42.369
tell you about Support Vector Machines,

00:43:42.369 --> 00:43:49.369
okay? That, sort of, wraps up our discussion on neural

00:44:08.839 --> 00:44:10.850
networks. So I started off talking about neural networks

00:44:10.850 --> 00:44:13.269
by motivating it as a way to get

00:44:13.269 --> 00:44:17.920
us to output non-linear classifiers, right? I don't really approve of it. It turns out

00:44:17.920 --> 00:44:18.419
that

00:44:18.419 --> 00:44:21.799
you'd be able to come up with non-linear division boundaries using a neural

00:44:21.799 --> 00:44:22.479
network

00:44:22.479 --> 00:44:27.309
like what I drew on the chalkboard earlier.

00:44:27.309 --> 00:44:30.469
Support Vector Machines will be another learning algorithm that will give us a way

00:44:30.469 --> 00:44:34.529
to come up with non-linear classifiers. There's a very effective, off-the-shelf

00:44:34.529 --> 00:44:36.019
learning algorithm,

00:44:36.019 --> 00:44:39.399
but it turns out that in the discussion I'm gonna - in

00:44:39.399 --> 00:44:43.379
the progression and development I'm gonna pursue, I'm actually going to start

00:44:43.379 --> 00:44:47.799
off by describing yet another class of linear classifiers with linear division

00:44:47.799 --> 00:44:49.269
boundaries,

00:44:49.269 --> 00:44:54.179
and only be later, sort of, in probably the next lecture or the one after that,

00:44:54.179 --> 00:44:55.249
that we'll then

00:44:55.249 --> 00:44:58.699
take the support vector machine idea and, sort of, do some clever things to it

00:44:58.699 --> 00:45:02.959
to make it work very well to generate non-linear division boundaries as well, okay?

00:45:02.959 --> 00:45:07.859
But we'll actually start by talking about linear classifiers a little bit more.

00:45:07.859 --> 00:45:12.619
And to do that, I want to

00:45:12.619 --> 00:45:14.559
convey two

00:45:14.559 --> 00:45:16.689
intuitions about classification.

00:45:16.689 --> 00:45:20.700
One is

00:45:20.700 --> 00:45:24.450
you think about logistic regression; we have this logistic function that was outputting

00:45:24.450 --> 00:45:28.669
the probability that Y equals one,

00:45:28.669 --> 00:45:31.049
and it crosses

00:45:31.049 --> 00:45:32.959
this line at zero.

00:45:32.959 --> 00:45:37.819
So when you run logistic regression, I want you to think of it as

00:45:37.819 --> 00:45:41.179
an algorithm that computes

00:45:41.179 --> 00:45:43.239
theta transpose X,

00:45:43.239 --> 00:45:46.789
and then it predicts

00:45:46.789 --> 00:45:51.170
one, right,

00:45:51.170 --> 00:45:56.359
if and only if, theta transpose X is greater than zero, right? IFF stands for if

00:45:56.359 --> 00:45:59.539
and only if. It means the same thing as a double implication,

00:45:59.539 --> 00:46:04.629
and it predicts zero,

00:46:04.629 --> 00:46:11.629
if and only if, theta transpose X is less than zero, okay? So if

00:46:12.499 --> 00:46:15.019
it's the case that

00:46:15.019 --> 00:46:18.069
theta transpose X is much greater than zero,

00:46:18.069 --> 00:46:22.309
the double greater than sign means these are much greater than, all right.

00:46:22.309 --> 00:46:25.589
So if theta transpose X is much greater than zero,

00:46:25.589 --> 00:46:26.739
then,

00:46:26.739 --> 00:46:33.069
you know, think of that as a very confident

00:46:33.069 --> 00:46:38.779
prediction

00:46:38.779 --> 00:46:40.779
that Y is equal to one,

00:46:40.779 --> 00:46:43.459
right? If theta transpose X is much greater than zero, then we're

00:46:43.459 --> 00:46:47.049
gonna predict one then moreover we're very confident it's one, and the picture for

00:46:47.049 --> 00:46:48.479
that is if theta

00:46:48.479 --> 00:46:51.179
transpose X is way out here, then

00:46:51.179 --> 00:46:54.329
we're estimating that the probability of Y being equal to one

00:46:54.329 --> 00:46:56.089
on the sigmoid function, it will

00:46:56.089 --> 00:46:58.699
be very close to one.

00:46:58.699 --> 00:47:02.779
And, in the same way, if theta transpose X

00:47:02.779 --> 00:47:05.190
is much less than zero,

00:47:05.190 --> 00:47:12.190
then we're very confident that

00:47:12.809 --> 00:47:15.849
Y is equal to zero.

00:47:15.849 --> 00:47:19.599
So

00:47:19.599 --> 00:47:23.859
wouldn't it be nice - so when we fit logistic regression of some of the

00:47:23.859 --> 00:47:25.309
classifiers is your training set,

00:47:25.309 --> 00:47:29.569
then so wouldn't it be nice if,

00:47:29.569 --> 00:47:30.430
right,

00:47:30.430 --> 00:47:34.349
for all I

00:47:34.349 --> 00:47:38.739
such that Y is equal to one.

00:47:38.739 --> 00:47:39.699
We have

00:47:39.699 --> 00:47:41.279
theta

00:47:41.279 --> 00:47:45.009
transpose XI is much greater than zero,

00:47:45.009 --> 00:47:47.929
and for all I such that Y is equal

00:47:47.929 --> 00:47:49.869
to

00:47:49.869 --> 00:47:56.869
zero,

00:47:57.589 --> 00:47:59.910
we have theta transpose XI is

00:47:59.910 --> 00:48:00.970
much less than zero,

00:48:00.970 --> 00:48:05.659
okay? So wouldn't it be nice if this is true? That,

00:48:05.659 --> 00:48:06.380


00:48:06.380 --> 00:48:09.929
essentially, if our training set, we can find parameters theta

00:48:09.929 --> 00:48:12.749
so that our learning algorithm not only

00:48:12.749 --> 00:48:16.839
makes correct classifications on all the examples in a training set, but further it's, sort

00:48:16.839 --> 00:48:19.880
of, is very confident about all of those correct

00:48:19.880 --> 00:48:22.299
classifications. This is

00:48:22.299 --> 00:48:26.859
the first intuition that I want you to have, and we'll come back to this first intuition

00:48:26.859 --> 00:48:28.889
in a second

00:48:28.889 --> 00:48:33.209
when we talk about functional margins, okay?

00:48:33.209 --> 00:48:40.209
We'll define this later. The second

00:48:43.539 --> 00:48:48.119
intuition that I want to

00:48:48.119 --> 00:48:55.119
convey,

00:48:56.769 --> 00:49:00.209
and it turns out for the rest of today's lecture I'm going to assume

00:49:00.209 --> 00:49:04.049
that a training set is linearly separable, okay? So by that I mean for the rest of

00:49:04.049 --> 00:49:05.619


00:49:05.619 --> 00:49:08.690
today's lecture, I'm going to assume that

00:49:08.690 --> 00:49:10.900
there is indeed a straight line that can separate

00:49:10.900 --> 00:49:15.030
your training set, and we'll remove this assumption later, but

00:49:15.030 --> 00:49:17.089
just to develop the algorithm, let's take away the

00:49:17.089 --> 00:49:19.409
linearly separable training set.

00:49:19.409 --> 00:49:20.440
And so

00:49:20.440 --> 00:49:24.400
there's a sense that out of all the straight lines that separate the training

00:49:24.400 --> 00:49:28.569
set, you know, maybe that straight line isn't such a good one,

00:49:28.569 --> 00:49:33.139
and that one actually isn't such a great one either, but

00:49:33.139 --> 00:49:35.019
maybe that line in the

00:49:35.019 --> 00:49:38.869
middle is a much better linear separator than the others, right?

00:49:38.869 --> 00:49:41.179
And one reason that

00:49:41.179 --> 00:49:45.759
when you and I look at it this one seems best

00:49:45.759 --> 00:49:48.329
is because this line is just further from the data, all right?

00:49:48.329 --> 00:49:52.469
That is separates the data with a greater distance between your positive and your negative

00:49:52.469 --> 00:49:55.779
examples and division boundary, okay?

00:49:55.779 --> 00:49:59.390
And this second intuition, we'll come back to this shortly,

00:49:59.390 --> 00:50:02.829
about this final line

00:50:02.829 --> 00:50:06.099
that I drew being, maybe, the best line

00:50:06.099 --> 00:50:09.999
this notion of distance from the training examples. This is the second intuition I want

00:50:09.999 --> 00:50:11.239
to convey,

00:50:11.239 --> 00:50:13.659
and we'll formalize it later when

00:50:13.659 --> 00:50:17.849
we talk about

00:50:17.849 --> 00:50:24.849
geometric margins of our classifiers, okay?

00:50:41.549 --> 00:50:44.889
So in order to describe support vector machine, unfortunately, I'm gonna

00:50:44.889 --> 00:50:48.239
have to pull a notation change,

00:50:48.239 --> 00:50:50.109
and, sort

00:50:50.109 --> 00:50:53.989
of, unfortunately, it, sort of, was impossible to do logistic regression,

00:50:53.989 --> 00:50:55.809
and support vector machines,

00:50:55.809 --> 00:51:01.029
and all the other algorithms using one completely consistent notation,

00:51:01.029 --> 00:51:04.180
and so I'm actually gonna change notations slightly

00:51:04.180 --> 00:51:07.999
for linear classifiers, and that will actually make it much easier for us -

00:51:07.999 --> 00:51:10.019
that'll make it much easier later today

00:51:10.019 --> 00:51:15.149
and in next week's lectures to actually talk about support vector machine. But the

00:51:15.149 --> 00:51:19.549
notation that I'm gonna use for the rest

00:51:19.549 --> 00:51:21.759
of today and for most of next week

00:51:21.759 --> 00:51:25.769
will be that my B equals Y,

00:51:25.769 --> 00:51:30.219
and instead of be zero, one, they'll be minus one and plus one,

00:51:30.219 --> 00:51:33.779


00:51:33.779 --> 00:51:36.759
and a development of a support vector machine

00:51:36.759 --> 00:51:38.469
we will have H,

00:51:38.469 --> 00:51:43.939
have a hypothesis

00:51:43.939 --> 00:51:49.369
output values to

00:51:49.369 --> 00:51:53.179
the either plus one or minus one,

00:51:53.179 --> 00:51:54.619
and so

00:51:54.619 --> 00:51:58.039
we'll let G of Z be equal to

00:51:58.039 --> 00:52:00.349
one if Z is

00:52:00.349 --> 00:52:04.460
greater or equal to zero, and minus one otherwise, right? So just rather than zero and

00:52:04.460 --> 00:52:07.809
one, we change everything to plus one and minus one.

00:52:07.809 --> 00:52:10.149
And, finally,

00:52:10.149 --> 00:52:13.770
whereas previously I wrote G subscript

00:52:13.770 --> 00:52:16.779
theta of X equals

00:52:16.779 --> 00:52:18.429
G of theta transpose X

00:52:18.429 --> 00:52:20.880
and we had the convention

00:52:20.880 --> 00:52:23.409
that X zero is equal to one,

00:52:23.409 --> 00:52:25.789
right? And so X is an RN plus

00:52:25.789 --> 00:52:28.670
one.

00:52:28.670 --> 00:52:34.499
I'm gonna drop this convention of

00:52:34.499 --> 00:52:38.400
letting X zero equals a one, and letting X be an RN plus one, and instead I'm

00:52:38.400 --> 00:52:44.049
going to parameterize my linear classifier as H subscript W, B of X

00:52:44.049 --> 00:52:46.159
equals G of

00:52:46.159 --> 00:52:50.629
W transpose X plus B, okay?

00:52:50.629 --> 00:52:51.349
And so

00:52:51.349 --> 00:52:53.359
B

00:52:53.359 --> 00:52:54.989
just now plays the role of

00:52:54.989 --> 00:52:56.289
theta zero,

00:52:56.289 --> 00:53:03.289
and W now plays the role of the rest of the parameters, theta one

00:53:04.179 --> 00:53:07.059
through theta N, okay? So just by separating out

00:53:07.059 --> 00:53:10.569
the interceptor B rather than lumping it together, it'll make it easier for

00:53:10.569 --> 00:53:11.230
us

00:53:11.230 --> 00:53:17.569
to develop support vector machines.

00:53:17.569 --> 00:53:24.569
So - yes. Student:[Off mic]. Instructor

00:53:27.889 --> 00:53:31.829
(Andrew Ng):Oh, yes. Right, yes. So W is -

00:53:31.829 --> 00:53:33.049
right. So W

00:53:33.049 --> 00:53:35.499
is a vector in RN, and

00:53:35.499 --> 00:53:37.069
X

00:53:37.069 --> 00:53:42.709
is now a vector in RN rather than N plus one,

00:53:42.709 --> 00:53:49.709
and a lowercase b is a real number. Okay.

00:53:56.429 --> 00:54:02.650
Now, let's formalize the notion of functional margin and germesh margin. Let me make a

00:54:02.650 --> 00:54:03.660
definition.

00:54:03.660 --> 00:54:10.660
I'm going to say that the functional margin

00:54:13.160 --> 00:54:19.680
of the hyper plane

00:54:19.680 --> 00:54:21.880
WB

00:54:21.880 --> 00:54:25.229
with respect to a specific training example,

00:54:25.229 --> 00:54:27.429
XIYI is - WRT

00:54:27.429 --> 00:54:32.069
stands for with respect to -

00:54:32.069 --> 00:54:35.489
the function margin of a hyper plane WB with

00:54:35.489 --> 00:54:38.099
respect to

00:54:38.099 --> 00:54:43.099
a certain training example, XIYI has been defined as Gamma

00:54:43.099 --> 00:54:45.319
Hat I equals YI

00:54:45.319 --> 00:54:46.919
times W transpose XI

00:54:46.919 --> 00:54:51.159
plus B, okay?

00:54:51.159 --> 00:54:54.579
And so a set of parameters, W, B defines

00:54:54.579 --> 00:54:55.960
a

00:54:55.960 --> 00:55:00.239
classifier - it, sort of, defines a linear separating boundary,

00:55:00.239 --> 00:55:01.160
and so

00:55:01.160 --> 00:55:03.989
when I say hyper plane, I just mean

00:55:03.989 --> 00:55:07.390
the decision boundary that's

00:55:07.390 --> 00:55:11.169
defined by the parameters W, B.

00:55:11.169 --> 00:55:12.969
You know what,

00:55:12.969 --> 00:55:17.699
if you're confused by the hyper plane term, just ignore it. The hyper plane of a classifier with

00:55:17.699 --> 00:55:18.869
parameters W, B

00:55:18.869 --> 00:55:21.019
with respect to a training example

00:55:21.019 --> 00:55:23.539
is given by this formula, okay?

00:55:23.539 --> 00:55:25.530
And interpretation of this is that

00:55:25.530 --> 00:55:28.369
if

00:55:28.369 --> 00:55:29.799
YI is equal to one,

00:55:29.799 --> 00:55:33.239
then for each to have a large functional margin, you

00:55:33.239 --> 00:55:36.519
want W transpose XI plus B to

00:55:36.519 --> 00:55:37.849
be large,

00:55:37.849 --> 00:55:39.079
right?

00:55:39.079 --> 00:55:41.389
And if YI

00:55:41.389 --> 00:55:43.709
is equal minus one,

00:55:43.709 --> 00:55:47.479
then in order for the functional margin to be large - we, sort of, want the functional margins

00:55:47.479 --> 00:55:50.229
to large, but in order for the function margins to be large,

00:55:50.229 --> 00:55:55.229
if YI is equal to minus one, then the only way for this to be big is if W

00:55:55.229 --> 00:55:57.029
transpose XI

00:55:57.029 --> 00:55:58.219
plus B

00:55:58.219 --> 00:56:03.849
is much less than zero, okay? So this

00:56:03.849 --> 00:56:06.780
captures the intuition that we had earlier about functional

00:56:06.780 --> 00:56:08.689
margins - the

00:56:08.689 --> 00:56:12.839
intuition we had earlier that if YI is equal to one, we want this to be big, and if YI

00:56:12.839 --> 00:56:14.669
is equal to minus one, we

00:56:14.669 --> 00:56:15.990
want this to be small,

00:56:15.990 --> 00:56:17.009
and this, sort of,

00:56:17.009 --> 00:56:18.699
practice of two cases

00:56:18.699 --> 00:56:22.599
into one statement that we'd like the functional margin to be large.

00:56:22.599 --> 00:56:26.589
And notice this is also that

00:56:26.589 --> 00:56:31.249
so long as YI times W transpose XY

00:56:31.249 --> 00:56:34.049
plus B, so long as this is greater than zero,

00:56:34.049 --> 00:56:36.649
that means we

00:56:36.649 --> 00:56:43.649
classified it correctly, okay?

00:57:14.359 --> 00:57:18.069
And one more definition, I'm going to say that

00:57:18.069 --> 00:57:19.899
the functional margin of

00:57:19.899 --> 00:57:23.679
a hyper plane with respect to an entire training set

00:57:23.679 --> 00:57:30.679
is

00:57:31.579 --> 00:57:33.609
going to define gamma hat to

00:57:33.609 --> 00:57:35.509
be equal to

00:57:35.509 --> 00:57:35.759
min

00:57:35.650 --> 00:57:39.569
over all your training examples of gamma hat, I, right?

00:57:39.569 --> 00:57:40.539
So if you have

00:57:40.539 --> 00:57:44.439
a training set, if you have just more than one training example,

00:57:44.439 --> 00:57:46.949
I'm going to define the functional margin

00:57:46.949 --> 00:57:49.779
with respect to the entire training set as

00:57:49.779 --> 00:57:51.799
the worst case of all of your

00:57:51.799 --> 00:57:54.439
functional margins of the entire training set.

00:57:54.439 --> 00:57:57.849
And so for now we should think of the

00:57:57.849 --> 00:58:02.179
first function like an intuition of saying that we would like the function margin

00:58:02.179 --> 00:58:03.450
to be large,

00:58:03.450 --> 00:58:05.910
and for our purposes, for now,

00:58:05.910 --> 00:58:07.309
let's just say we would like

00:58:07.309 --> 00:58:10.389
the worst-case functional margin to be large, okay? And

00:58:10.389 --> 00:58:16.299
we'll change this a little bit later as well. Now, it turns out that

00:58:16.299 --> 00:58:20.150
there's one little problem with this intuition that will, sort of, edge

00:58:20.150 --> 00:58:22.640
us later,

00:58:22.640 --> 00:58:26.189
which it actually turns out to be very easy to make the functional margin large, all

00:58:26.189 --> 00:58:28.599
right? So, for example,

00:58:28.599 --> 00:58:31.899
so as I have a classifiable parameters W and

00:58:31.899 --> 00:58:37.369
B. If I take W and multiply it by two and take B and multiply it by two,

00:58:37.369 --> 00:58:42.049
then if you refer to the definition of the functional margin, I guess that was what?

00:58:42.049 --> 00:58:42.959
Gamma I,

00:58:42.959 --> 00:58:44.409
gamma hat I

00:58:44.409 --> 00:58:47.929
equals YI

00:58:47.929 --> 00:58:51.059
times W times transpose B. If

00:58:51.059 --> 00:58:52.859
I double

00:58:52.859 --> 00:58:54.609
W and B,

00:58:54.609 --> 00:58:55.010
then

00:58:55.010 --> 00:58:59.419
I can easily double my functional margin. So this goal

00:58:59.419 --> 00:59:02.799
of making the functional margin large, in and of itself, isn't so

00:59:02.799 --> 00:59:05.959
useful because it's easy to make the functional margin arbitrarily large

00:59:05.959 --> 00:59:08.729
just by scaling other parameters. And so

00:59:08.729 --> 00:59:13.259
maybe one thing we need to do later is add a normalization

00:59:13.259 --> 00:59:14.879
condition.

00:59:14.879 --> 00:59:16.849
For example, maybe

00:59:16.849 --> 00:59:20.179
we want to add a normalization condition that de-norm, the

00:59:20.179 --> 00:59:21.789
alter-norm of

00:59:21.789 --> 00:59:23.109
the parameter W

00:59:23.109 --> 00:59:24.009
is equal to one,

00:59:24.009 --> 00:59:28.150
and we'll come back to this in a second. All

00:59:28.150 --> 00:59:34.279
right. And then so - Okay.

00:59:34.279 --> 00:59:39.869
Now, let's talk about - see how much

00:59:39.869 --> 00:59:46.869
time we have, 15 minutes. Well, see, I'm trying to

00:59:52.909 --> 00:59:59.909
decide how much to try to do in the last 15 minutes. Okay. So let's talk

01:00:05.339 --> 01:00:10.890
about the geometric margin,

01:00:10.890 --> 01:00:13.379
and so

01:00:13.379 --> 01:00:16.939
the geometric margin of a training example -

01:00:16.939 --> 01:00:23.939
[inaudible], right?

01:00:27.279 --> 01:00:30.649
So the division boundary of my classifier

01:00:30.649 --> 01:00:33.439
is going to be given by the plane W transpose X

01:00:33.439 --> 01:00:34.889
plus B is equal

01:00:34.889 --> 01:00:37.650
to zero, okay? Right, and these are the

01:00:37.650 --> 01:00:41.389
X1, X2 axis, say,

01:00:41.389 --> 01:00:42.579
and

01:00:42.579 --> 01:00:48.299
we're going to draw relatively few training examples here. Let's say I'm drawing

01:00:48.299 --> 01:00:52.929
deliberately few training examples so that I can add things to this, okay?

01:00:52.929 --> 01:00:54.659
And so

01:00:54.659 --> 01:00:57.199
assuming we classified an example correctly, I'm

01:00:57.199 --> 01:01:01.209
going to define the geometric margin

01:01:01.209 --> 01:01:05.559
as just a geometric distance between a point between the

01:01:05.559 --> 01:01:07.239
training example -

01:01:07.239 --> 01:01:09.649
yeah, between the training example XI,

01:01:09.649 --> 01:01:11.679
YI

01:01:11.679 --> 01:01:13.259
and the distance

01:01:13.259 --> 01:01:15.489
given by this separating

01:01:15.489 --> 01:01:18.249
line, given by this separating hyper plane, okay?

01:01:18.249 --> 01:01:20.329
That's what I'm going to define the

01:01:20.329 --> 01:01:23.809
geometric margin to be.

01:01:23.809 --> 01:01:26.299
And so I'm gonna do

01:01:26.299 --> 01:01:29.919
some algebra fairly quickly. In case it doesn't make sense, and read

01:01:29.919 --> 01:01:33.969
through the lecture notes more carefully for details.

01:01:33.969 --> 01:01:36.479
Sort of, by standard geometry,

01:01:36.479 --> 01:01:40.589
the normal, or in other words, the vector that's 90 degrees to the

01:01:40.589 --> 01:01:41.869
separating hyper plane

01:01:41.869 --> 01:01:45.069
is going to be given by W divided by

01:01:45.069 --> 01:01:47.480
the norm of W; that's just how

01:01:47.480 --> 01:01:49.619
planes and high dimensions work. If this

01:01:49.619 --> 01:01:53.380
stuff - some of this you have to use, take a look t the lecture notes on the

01:01:53.380 --> 01:01:55.589
website.

01:01:55.589 --> 01:01:58.199
And so let's say this distance

01:01:58.199 --> 01:01:59.789
is

01:01:59.789 --> 01:02:01.220


01:02:01.220 --> 01:02:02.079
gamma I,

01:02:02.079 --> 01:02:04.869
okay? And so I'm going to use the convention that

01:02:04.869 --> 01:02:08.549
I'll put a hat on top where I'm referring to functional margins,

01:02:08.549 --> 01:02:11.619
and no hat on top for geometric margins. So let's say geometric margin,

01:02:11.619 --> 01:02:14.120
as this example, is

01:02:14.120 --> 01:02:18.009
gamma I.

01:02:18.009 --> 01:02:21.789
That means that

01:02:21.789 --> 01:02:26.459
this point here, right,

01:02:26.459 --> 01:02:28.899
is going to be

01:02:28.899 --> 01:02:32.149
XI minus

01:02:32.149 --> 01:02:36.119
gamma I times W

01:02:36.119 --> 01:02:39.479
over normal W, okay?

01:02:39.479 --> 01:02:41.279
Because

01:02:41.279 --> 01:02:45.399
W over normal W is the unit vector, is the length one vector that

01:02:45.399 --> 01:02:48.459
is normal to the separating hyper plane,

01:02:48.459 --> 01:02:52.069
and so when we subtract gamma I times the unit vector

01:02:52.069 --> 01:02:55.189
from this point, XI, or at this point here is XI.

01:02:55.189 --> 01:02:57.039
So XI minus,

01:02:57.039 --> 01:02:59.079
you know, this little vector here

01:02:59.079 --> 01:03:01.539
is going to be this point that I've drawn as

01:03:01.539 --> 01:03:03.080
a heavy circle,

01:03:03.080 --> 01:03:05.660
okay? So this heavy point here is

01:03:05.660 --> 01:03:07.429
XI minus this vector,

01:03:07.429 --> 01:03:14.429
and this vector is gamma I time W over norm of W, okay?

01:03:16.049 --> 01:03:17.649
And so

01:03:17.649 --> 01:03:21.160
because this heavy point is on the separating hyper plane,

01:03:21.160 --> 01:03:22.689
right, this point

01:03:22.689 --> 01:03:24.949
must satisfy

01:03:24.949 --> 01:03:28.660
W transpose times

01:03:28.660 --> 01:03:34.659
that point

01:03:34.659 --> 01:03:35.569
equals zero,

01:03:35.569 --> 01:03:37.189
right? Because

01:03:37.189 --> 01:03:38.029
all

01:03:38.029 --> 01:03:41.479
points X on the separating hyper plane satisfy the equation W transpose X plus B

01:03:41.479 --> 01:03:42.939
equals zero,

01:03:42.939 --> 01:03:45.989
and so this point is on the separating hyper plane, therefore,

01:03:45.989 --> 01:03:49.199
it must satisfy W transpose this point - oh,

01:03:49.199 --> 01:03:53.809
excuse me. Plus B is equal

01:03:53.809 --> 01:03:57.579
to zero, okay?

01:03:57.579 --> 01:04:01.289
Raise your hand if this makes sense so far? Oh,

01:04:01.289 --> 01:04:03.669
okay. Cool, most of you, but, again,

01:04:03.669 --> 01:04:07.099
I'm, sort of, being slightly fast in this geometry. So if you're not quite sure

01:04:07.099 --> 01:04:10.089
why this is a normal vector, or how I subtracted

01:04:10.089 --> 01:04:17.089
this, or whatever, take a look at the details in the lecture notes.

01:04:21.299 --> 01:04:24.319
And so

01:04:24.319 --> 01:04:28.459
what I'm going to do is I'll just take this equation, and I'll solve for gamma, right? So

01:04:28.459 --> 01:04:30.269
this equation I just wrote down,

01:04:30.269 --> 01:04:32.199
solve this equation for gamma

01:04:32.199 --> 01:04:33.939
or gamma I,

01:04:33.939 --> 01:04:40.939
and you find that -

01:04:44.129 --> 01:04:47.630
you saw that previous equation from gamma I -

01:04:47.630 --> 01:04:49.769
well, why don't I just do it?

01:04:49.769 --> 01:04:52.819
You have W transpose XI plus

01:04:52.819 --> 01:04:54.660
B

01:04:54.660 --> 01:04:55.489
equals

01:04:55.489 --> 01:04:57.049
gamma

01:04:57.049 --> 01:05:01.489
I times W transpose W over norm of W;

01:05:01.489 --> 01:05:05.169
that's just equal to gamma

01:05:05.169 --> 01:05:06.749
times the norm of W

01:05:06.749 --> 01:05:07.659
because

01:05:07.659 --> 01:05:09.529
W transpose W

01:05:09.529 --> 01:05:12.469
is the norm

01:05:12.469 --> 01:05:14.829
of W squared, and, therefore,

01:05:14.829 --> 01:05:17.589
gamma

01:05:17.589 --> 01:05:24.589
is just - well,

01:05:26.579 --> 01:05:33.579
transpose X equals, okay?

01:05:33.999 --> 01:05:38.089
And, in other words, this little calculation just showed us that if you have a

01:05:38.089 --> 01:05:41.299
training example

01:05:41.299 --> 01:05:42.599
XI,

01:05:42.599 --> 01:05:46.749
then the distance between XI and the separating hyper plane defined by the

01:05:46.749 --> 01:05:49.449
parameters W and B

01:05:49.449 --> 01:05:56.449
can be computed by this formula, okay? So

01:06:02.979 --> 01:06:05.459
the last thing I want to do is actually

01:06:05.459 --> 01:06:06.729
take into account

01:06:06.729 --> 01:06:11.929
the sign of the - the correct classification of the training example. So I've

01:06:11.929 --> 01:06:13.130
been assuming that

01:06:13.130 --> 01:06:16.339
we've been classifying an example correctly.

01:06:16.339 --> 01:06:20.999
So, more generally, to find

01:06:20.999 --> 01:06:27.439
the geometric

01:06:27.439 --> 01:06:30.029
margin of a training example to be

01:06:30.029 --> 01:06:32.979
gamma I equals YI

01:06:32.979 --> 01:06:39.979
times that thing on top, okay?

01:06:45.389 --> 01:06:47.279
And so

01:06:47.279 --> 01:06:51.169
this is very similar to the functional margin, except for the normalization by the

01:06:51.169 --> 01:06:52.729
norm of W,

01:06:52.729 --> 01:06:57.759
and so as before, you know, this says that so long as -

01:06:57.759 --> 01:07:00.889
we would like the geometric margin to be large, and all that means is that so

01:07:00.889 --> 01:07:01.849
long as

01:07:01.849 --> 01:07:03.919
we're classifying the example correctly,

01:07:03.919 --> 01:07:08.209
we would ideally hope of the example to be as far as possible from the separating

01:07:08.209 --> 01:07:11.079
hyper plane, so long as it's on the right side of

01:07:11.079 --> 01:07:12.279
the separating hyper plane, and that's what YI

01:07:12.279 --> 01:07:19.279
multiplied into this does.

01:07:22.689 --> 01:07:27.839
And so a couple of easy facts, one is if the

01:07:27.839 --> 01:07:31.769
norm of W is equal to one,

01:07:31.769 --> 01:07:34.289
then

01:07:34.289 --> 01:07:37.439
the functional margin is equal to the geometric margin, and you see

01:07:37.439 --> 01:07:39.369
that quite easily,

01:07:39.369 --> 01:07:41.479
and, more generally,

01:07:41.479 --> 01:07:43.019


01:07:43.019 --> 01:07:47.779
the geometric margin is just equal

01:07:47.779 --> 01:07:54.779
to the functional margin divided by the norm of W, okay? Let's see, okay.

01:08:11.959 --> 01:08:18.959
And so one final definition is

01:08:22.089 --> 01:08:26.769
so far I've defined the geometric margin with respect to a single training example,

01:08:26.769 --> 01:08:32.189
and so as before, I'll define the geometric margin with respect to an entire training set

01:08:32.189 --> 01:08:34.499
as gamma equals

01:08:34.499 --> 01:08:37.599
min over I

01:08:37.599 --> 01:08:44.599
of gamma I, all right?

01:08:45.250 --> 01:08:47.309
And so

01:08:47.309 --> 01:08:54.309
the maximum margin classifier, which is a precursor to the support vector machine,

01:08:59.539 --> 01:09:04.880
is the learning algorithm that chooses the parameters W and B

01:09:04.880 --> 01:09:06.730
so as to maximize

01:09:06.730 --> 01:09:08.900
the geometric margin,

01:09:08.900 --> 01:09:11.919
and so I just write that down. The

01:09:11.919 --> 01:09:16.489
maximum margin classified poses the following optimization problem. It says

01:09:16.489 --> 01:09:19.960
choose gamma, W, and B

01:09:19.960 --> 01:09:23.529
so as to maximize the geometric margin,

01:09:23.529 --> 01:09:25.499
subject to that YI times -

01:09:25.499 --> 01:09:32.499
well,

01:09:33.420 --> 01:09:37.559
this is just one way to write it,

01:09:37.559 --> 01:09:41.089
subject to - actually, do I write it like that? Yeah,

01:09:41.089 --> 01:09:43.819
fine.

01:09:43.819 --> 01:09:46.209
There are several ways to write this, and one of the things we'll

01:09:46.209 --> 01:09:48.949
do next time is actually - I'm trying

01:09:48.949 --> 01:09:52.730
to figure out if I can do this in five minutes. I'm guessing this could

01:09:52.730 --> 01:09:55.420
be difficult. Well,

01:09:55.420 --> 01:09:59.429
so this maximizing your classifier is the maximization problem

01:09:59.429 --> 01:10:00.929
over parameter gamma

01:10:00.929 --> 01:10:02.519
W

01:10:02.519 --> 01:10:04.919
and B, and for now, it turns out

01:10:04.919 --> 01:10:08.869
that the geometric margin doesn't change depending on the norm of W, right? Because

01:10:08.869 --> 01:10:11.479
in the definition of the geometric margin,

01:10:11.479 --> 01:10:14.429
notice that we're dividing by the norm of W anyway.

01:10:14.429 --> 01:10:18.239
So you can actually set the norm of W to be anything you want, and you can multiply

01:10:18.239 --> 01:10:22.530
W and B by any constant; it doesn't change the geometric margin.

01:10:22.530 --> 01:10:26.809
This will actually be important, and we'll come back to this later. Notice that you

01:10:26.809 --> 01:10:29.829
can take the parameters WB,

01:10:29.829 --> 01:10:33.759
and you can impose any normalization constant to it, or you can change W and B by

01:10:33.759 --> 01:10:35.760
any scaling factor and

01:10:35.760 --> 01:10:37.969
replace them by ten W and ten B

01:10:37.969 --> 01:10:38.730
whatever,

01:10:38.730 --> 01:10:42.809
and it does not change the geometric margin,

01:10:42.809 --> 01:10:44.329
okay? And so

01:10:44.329 --> 01:10:47.800
in this first formulation, I'm just gonna impose a constraint and say that the norm of W was

01:10:47.800 --> 01:10:48.989
one,

01:10:48.989 --> 01:10:51.969
and so the function of the geometric margins will be the same,

01:10:51.969 --> 01:10:52.699
and then we'll say

01:10:52.699 --> 01:10:55.119
maximize the geometric margins subject to -

01:10:55.119 --> 01:10:56.800
you maximize gamma

01:10:56.800 --> 01:11:00.489
subject to that every training example must have geometric margin

01:11:00.489 --> 01:11:02.069
at least gamma,

01:11:02.069 --> 01:11:04.219
and this is a geometric margin because

01:11:04.219 --> 01:11:05.780
when the norm of W is equal to one, then

01:11:05.780 --> 01:11:08.159
the functional of the geometric margin

01:11:08.159 --> 01:11:11.409
are identical, okay?

01:11:11.409 --> 01:11:15.130
So this is the maximum margin classifier, and it turns out that if you do this,

01:11:15.130 --> 01:11:19.079
it'll run, you know, maybe about as well as a - maybe slight - maybe comparable to logistic regression, but

01:11:19.079 --> 01:11:19.659
it

01:11:19.659 --> 01:11:22.709


01:11:22.709 --> 01:11:24.139
turns out that

01:11:24.139 --> 01:11:26.449
as we develop this algorithm further, there

01:11:26.449 --> 01:11:30.519
will be a clever way to allow us to change this algorithm to let it work

01:11:30.519 --> 01:11:33.150
in infinite dimensional feature spaces

01:11:33.150 --> 01:11:37.010
and come up with very efficient non-linear classifiers.

01:11:37.010 --> 01:11:38.020
So

01:11:38.020 --> 01:11:41.280
there's a ways to go before we turn this into a support vector machine,

01:11:41.280 --> 01:11:43.409
but this is the first step.

01:11:43.409 --> 01:11:50.409
So are there questions about this? Yeah.

01:12:03.610 --> 01:12:08.179
Student:[Off mic]. Instructor (Andrew Ng):For now, let's just say you're given a fixed training set, and you can't - yeah, for now, let's just say you're given a

01:12:08.179 --> 01:12:09.900
fixed training set, and the

01:12:09.900 --> 01:12:12.879
scaling of the training set is not something you get to play with, right?

01:12:12.879 --> 01:12:17.190
So everything I've said is for a fixed training set, so that you can't change the X's, and you can't change the

01:12:17.190 --> 01:12:20.869
Y's. Are there other questions? Okay. So all right.

01:12:20.869 --> 01:12:26.150


01:12:26.150 --> 01:12:30.129


01:12:30.129 --> 01:12:34.779
Next week we will take this, and we'll talk about authorization algorithms,

01:12:34.779 --> 01:12:35.839
and

01:12:35.839 --> 01:12:39.949
work our way towards turning this into one of the most effective off-theshelf

01:12:39.949 --> 01:12:41.920
learning algorithms,

01:12:41.920 --> 01:12:44.369
and just a final reminder again, this next

01:12:44.369 --> 01:12:48.170
discussion session will be on Matlab and Octaves. So show up for that if you want

01:12:48.170 --> 01:12:49.679
to see a tutorial.

01:12:49.679 --> 01:12:51.529
Okay. See you guys in the next class.