Lecture 8 | Machine Learning (Stanford)

Edit subtitles

0:09 - 0:10
0:10 - 0:14

This presentation is delivered by the Stanford Center for Professional
0:14 - 0:21

Development. Welcome back. If you haven't given me the homework yet,
0:27 - 0:28

you can just give it to me at the end of class. That's fine. Let's see. And also just a quick reminder " I've actually seen project proposals start to trickle in already, which is great. As a reminder, project proposals are due this Friday, and if any of you want to meet and chat more about project ideas, I also have office hours immediately after lecture today. Are there any questions about any of that before I get started today? Great. Okay. Welcome back.
0:28 - 0:32

What I want to do today is wrap up our discussion on support vector
0:32 - 0:34

machines
0:34 - 0:37

and in particular we'll also talk about the idea of kernels
0:37 - 0:39

and then talk about
0:39 - 0:42

[inaudible]
0:42 - 0:46

and then I'll talk about
0:46 - 0:50

the SMO algorithm, which is an algorithm for solving the
0:50 - 0:54

optimization problem that we posed last time.
0:54 - 0:56
0:56 - 0:58

To recap, we
0:58 - 1:01

wrote down
1:01 - 1:08

the following context optimization problem.
1:12 - 1:16

All this is assuming that the data is linearly separable, which is an assumption that I'll
1:16 - 1:18

fix later,
1:18 - 1:20
1:20 - 1:23
1:23 - 1:25
1:25 - 1:30

and so with this optimization problem, given a training set,
1:30 - 1:35

this will find
1:35 - 1:37

the optimal margin
1:37 - 1:41

classifier for the data set that maximizes this
1:41 - 1:45

geometric margin from your training
1:45 - 1:47

examples.
1:47 - 1:52

And so in the previous lecture, we also derived the dual of this problem,
1:52 - 1:54

which was to maximize
1:54 - 2:01
2:02 - 2:09
2:18 - 2:22

this. And this is the dual of our
2:22 - 2:24

primal [inaudible] optimization problem.
2:24 - 2:28

Here, I'm using these angle brackets to denote inner product,
2:28 - 2:29
2:29 - 2:31

so
2:31 - 2:33

this is just XI transpose XJ
2:33 - 2:39

for vectors XI
2:39 - 2:41

and XJ. We also worked out the
2:41 - 2:47

ways W would be given by sum over I alpha I
2:47 - 2:51

YI XI. Therefore, when you need to make a prediction of classification
2:51 - 2:52

time, you
2:52 - 2:55

need to compute the
2:55 - 3:01

value of the hypothesis applied to an [inaudible], which is G of W
3:01 - 3:06

transpose X plus B where G is that threshold function that outputs plus one and minus one.
3:06 - 3:09

And so this is G of
3:09 - 3:10

sum over I alpha
3:10 - 3:12
3:12 - 3:19
3:20 - 3:22

I. So that can also be written
3:22 - 3:24

in terms of inner
3:24 - 3:26

products between input
3:26 - 3:31

vectors X. So what I
3:31 - 3:35

want to do is now talk about the idea of kernels, which will make use of this
3:35 - 3:37

property because it turns out
3:37 - 3:42

you can take the only dependers of the algorithm on X
3:42 - 3:44

is through these inner products.
3:44 - 3:47

In fact, you can write the entire algorithm
3:47 - 3:50

without ever explicitly referring to
3:50 - 3:52

an X vector [inaudible]
3:52 - 3:54
3:54 - 3:58

between input feature vectors.
3:58 - 4:01

And the idea of a high kernel is as following "
4:01 - 4:07

let's say that you have an input attribute.
4:07 - 4:11

Let's just say for now it's a real number. Maybe this is the
4:11 - 4:13

living area of a house
4:13 - 4:20

that you're trying to make a prediction on, like whether it will be sold in the next six months.
4:20 - 4:21

Quite often, we'll take this
4:21 - 4:24

feature X and we'll map it
4:24 - 4:29

to a richer set of features. So for example, we will take X and map it
4:29 - 4:36

to these four polynomial features, and let me acutely call this mapping Phi. So we'll
4:37 - 4:39

let Phi of X denote
4:39 - 4:41

the mapping from your original features to
4:41 - 4:45

some higher dimensional set of features.
4:45 - 4:49

So if you do this and you want to use the features Phi of X,
4:49 - 4:50

then
4:50 - 4:54

all you need to do is go back to the learning algorithm
4:54 - 4:57

and everywhere you see
4:57 - 4:59

XI, XJ, we'll replace it
4:59 - 5:00

with
5:00 - 5:05

the inner product between Phi of XI and
5:05 - 5:08

Phi
5:08 - 5:11

of XJ. So this corresponds to running a support vector machine
5:11 - 5:14

with the features given by Phi of X rather than with your original
5:14 - 5:19

onedimensional input feature X.
5:19 - 5:22

And in a scenario that I want to consider,
5:22 - 5:23

sometimes Phi of X
5:23 - 5:29

will be very high dimensional,
5:29 - 5:34

and in fact sometimes Phi of X " so for example, Phi of X may contain
5:34 - 5:37

very high degree polynomial features.
5:37 - 5:42

Sometimes Phi of X will actually even be an infinite dimensional vector of features,
5:42 - 5:45

and the question is if Phi of X is an extremely high dimensional,
5:45 - 5:49

then you can't actually compute to these inner products very
5:49 - 5:51

efficiently, it seems, because computers need to
5:51 - 5:55

represent an extremely high dimensional feature vector and then
5:55 - 5:57

take
5:57 - 6:00

[inaudible] inefficient. It
6:00 - 6:03

turns out that
6:03 - 6:08

in many important special cases, we can write down " let's call the kernel function,
6:08 - 6:10

denoted by K,
6:10 - 6:12

which will be
6:12 - 6:18

this,
6:18 - 6:21

which would be inner product between those feature
6:21 - 6:24

vectors. It turns out there will be important special cases where
6:24 - 6:29

computing Phi of X is computationally very expensive " maybe is impossible.
6:29 - 6:30

There's an infinite dimensional vector,
6:30 - 6:33

and you can't compute infinite dimensional vectors.
6:33 - 6:37

There will be important special cases where Phi of X is very expensive to represent because it
6:37 - 6:39

is so high dimensional,
6:39 - 6:42

but nonetheless, you can actually compute a kernel between
6:42 - 6:46

XI and XJ. You can compute the inner product between these two vectors
6:46 - 6:49

very inexpensively.
6:49 - 6:52

And so the idea of the support vector machine is that
6:52 - 6:57

everywhere in the algorithm that you see these inner products,
6:57 - 6:58

we're going to replace it
6:58 - 7:02

with a kernel function that you can compute efficiently,
7:02 - 7:04

and that lets you work in feature
7:04 - 7:08

spaces Phi of X even if Phi of X
7:08 - 7:12

are very high dimensional. Let
7:12 - 7:16

me now say how that's done. A
7:16 - 7:17
7:17 - 7:21

little bit later today, we'll actually see some concrete examples of Phi of X and
7:21 - 7:26

of kernels. For now, let's just think about constructing kernels explicitly.
7:26 - 7:29

This best illustrates my example.
7:29 - 7:33
7:33 - 7:35

Let's say you have
7:35 - 7:40

two inputs, X and Z. Normally I should write those as XI and XJ, but I'm just going to
7:40 - 7:46

write X and Z to save on writing.
7:46 - 7:50

Let's say my kernel is K of X, Z
7:50 - 7:51

equals
7:51 - 7:55

X transpose Z squared.
7:55 - 8:02

And so this is "
8:09 - 8:13

right? X transpose Z " this thing here is X
8:13 - 8:15

transpose Z and this thing is X transpose Z, so this is X
8:15 - 8:17

transpose Z squared.
8:17 - 8:18
8:18 - 8:25

And that's equal
8:29 - 8:33

to that.
8:33 - 8:36

And so this kernel corresponds
8:36 - 8:38

to the feature mapping where Phi
8:38 - 8:41

of X is equal to " and I'll write
8:41 - 8:48

this down for the case of N equals free, I guess.
9:03 - 9:07

And so with this definition of Phi of X,
9:07 - 9:10

you can verify for yourself that this thing
9:10 - 9:12

becomes the
9:12 - 9:18

inner product between Phi of X
9:18 - 9:22

and Phi of Z, because to get an inner product between two vectors
9:22 - 9:25

is " you can just take a sum of the corresponding elements of the vectors. You
9:25 - 9:27

multiply them. So
9:27 - 9:29

if this is Phi of X,
9:29 - 9:32

then the inner product between Phi of X and Phi of Z will be
9:32 - 9:34

the sum
9:34 - 9:35

over all the
9:35 - 9:40

elements of this vector times the corresponding elements of Phi of Z,
9:40 - 9:45

and what you get is this one.
9:45 - 9:48

And so the cool thing about this is that
9:48 - 9:53

in order to compute Phi of X, you
9:53 - 9:56
9:56 - 10:03

need [inaudible] just to compute
10:03 - 10:05

Phi of X. If N
10:05 - 10:09

is a dimension of X and Z, then
10:09 - 10:14

Phi of X is a vector of all pairs of XI XJ multiplied of
10:14 - 10:15

each other,
10:15 - 10:19

and so the length of Phi of X is N squared. You need order N
10:19 - 10:24

squared time just to compute Phi of X. But
10:24 - 10:31

to compute K
10:37 - 10:39

" to compute the kernel function,
10:39 - 10:41

all you need is order N time,
10:41 - 10:43

because the
10:43 - 10:47

kernel function is defined as X transpose Z squared, so you just take the inner product
10:47 - 10:51

between X and Z, which is order N time and you square that
10:51 - 10:52

and you've computed this
10:52 - 10:54

kernel function,
10:54 - 10:58

and so you just computed the inner product between two vectors where each vector
10:58 - 11:00

has N squared elements, but
11:00 - 11:07

you did it in N square time. Student:For any kernel you find
11:14 - 11:21

for X and Z,
11:21 - 11:24

does Phi exist for X and Z? Instructor (Andrew
11:24 - 11:31

Ng):Let me talk about that later. We'll talk about what is a valid kernel later. Please raise your hand if this makes sense.
11:32 - 11:34
11:34 - 11:35

So
11:35 - 11:39

let me just describe a couple of quick generalizations to
11:39 - 11:42

this.
11:42 - 11:47

One is that if
11:47 - 11:54

you define KXZ to be equal to
11:55 - 12:00

X transpose Z plus C squared, so again, you can compute this kernel
12:00 - 12:03

in order N time,
12:03 - 12:07

then that turns out to correspond to a feature vector
12:07 - 12:08

where I'm
12:08 - 12:11

just going to add a few more elements at the bottom
12:11 - 12:18

where you add root 2.
12:24 - 12:28

Let me read that. That was root 2 CX1 root 2 CX2 root 2 CX3
12:28 - 12:30

and C.
12:30 - 12:31

And so
12:31 - 12:35

this is a way of creating a feature vector with both the monomials,
12:35 - 12:37

meaning the first order terms,
12:37 - 12:41

as well as the quadratic or the inner product
12:41 - 12:43

terms between XI and XJ,
12:43 - 12:45

and the parameter C here
12:45 - 12:49

allows you to control the relative waiting between the monomial
12:49 - 12:51

terms, so the first order terms,
12:51 - 12:54

and the quadratic
12:54 - 12:57

terms. Again, this is still inner
12:57 - 13:01

product between vectors of length and square [inaudible]
13:01 - 13:03

in order N time.
13:03 - 13:05
13:05 - 13:12

More generally, here are some other examples of kernels.
13:14 - 13:18

Actually, a
13:18 - 13:25

generalization of the one I just derived right now would be the following kernel.
13:25 - 13:31

And so this corresponds to using all N plus DQZ
13:31 - 13:34

features
13:34 - 13:39

of all
13:39 - 13:43

monomials. Monomials just mean the products of XI XJ XK. Just
13:43 - 13:47

all the polynomial terms up
13:47 - 13:52

to degree D
13:52 - 13:56

and plus [inaudible] so on the order of
13:56 - 13:59

N plus D to the power of D, so this grows exponentially in D.
13:59 - 14:01

This is a very high dimensional feature vector,
14:01 - 14:05

but again, you can implicitly construct the feature vector and take
14:05 - 14:07

inner products between them.
14:07 - 14:11

It's very computationally efficient, because you just compute the inner product
14:11 - 14:15

between X and Z, add C and you take that real number to the power of D
14:15 - 14:19

and by plugging this in as a kernel, you're implicitly working in an extremely
14:19 - 14:26

high dimensional computing space.
14:26 - 14:28

So what
14:28 - 14:35

I've given is just a few specific examples of how to create kernels. I want
14:36 - 14:40

to go over just a few specific examples of kernels. So let's you ask
14:40 - 14:44

you more generally if you're faced with a new machine-learning problem,
14:44 - 14:46

how do you come up with a kernel? There
14:46 - 14:50

are many ways to think about it, but here's one intuition that's sort of
14:50 - 14:51

useful.
14:51 - 14:52
14:52 - 14:54

So given a set of attributes of X, you're
14:54 - 14:58

going to use a feature vector of Phi of X
14:58 - 15:00

and given a set
15:00 - 15:04

of attributes Z, you're going to use an input feature vector Phi of Z,
15:04 - 15:07

and so the kernel is computing
15:07 - 15:12

the inner product between Phi of X and Phi of Z.
15:12 - 15:13

And so
15:13 - 15:15

one intuition "
15:15 - 15:17

this is a partial intuition. This
15:17 - 15:21

isn't as rigorous intuition that it is used for. It
15:21 - 15:22

is that
15:22 - 15:24

if X and Z are very similar,
15:24 - 15:28

then Phi of X and Phi of Z will be pointing in the same direction,
15:28 - 15:33

and therefore the inner product would be large.
15:33 - 15:34

Whereas in contrast, if
15:34 - 15:37

X and Z are very dissimilar,
15:37 - 15:40

then Phi of X and Phi of Z may be pointing different directions,
15:40 - 15:42

and so the inner product may be small.
15:42 - 15:47

That intuition is not a rigorous one, but it's sort of a useful one to think about.
15:47 - 15:49
15:49 - 15:53

If you're faced with a new learning problem " if I give you some random
15:53 - 15:55

thing to classify
15:55 - 15:56

and you want to decide how to come up with
15:56 - 15:58

a kernel,
15:58 - 15:59

one way is
15:59 - 16:03

to try to come up with the function
16:03 - 16:04

P of XZ that
16:04 - 16:06

is large,
16:06 - 16:11

if you want to learn the algorithm to think of X and Z as similar
16:11 - 16:11

and
16:11 - 16:15

small.
16:15 - 16:18
16:18 - 16:19
16:19 - 16:24

Again, this isn't always true, but this is one of several intuitions.
16:24 - 16:28

So if you're trying to classify some brand new thing " you're trying to
16:28 - 16:32

classify [inaudible] or DNA sequences or something,
16:32 - 16:33
16:33 - 16:37

some strange thing you want to classify, one thing you could do is try to come up with a kernel that's large when you want the algorithm to think
16:37 - 16:39

these are similar things
16:39 - 16:42

or these are dissimilar.
16:42 - 16:44

And so this
16:44 - 16:47

answers the question of let's
16:47 - 16:50

say I have something I want to classify, and let's say I write down
16:50 - 16:53

the function
16:53 - 16:58

that I think is a good measure of how similar or dissimilar X and Z
16:58 - 16:59

are for my specific problem.
16:59 - 17:05

Let's say I write down K of XZ equals E to the
17:05 - 17:08

minus.
17:08 - 17:13

Let's say I write down this function,
17:13 - 17:19

and I think this is a good measure of how similar X and Z are.
17:19 - 17:26

The question, then, is is this really a valid
17:28 - 17:33

kernel? In other words, to understand how we can construct kernels " if I write down the
17:33 - 17:34

function like that,
17:34 - 17:38

the question is does there really exist some Phi
17:38 - 17:40

such that
17:40 - 17:41

KXZ
17:41 - 17:42

is equal to
17:42 - 17:48

the inner product?
17:48 - 17:50

And that's the
17:50 - 17:56

question of is K a valid kernel.
17:56 - 18:00

It turns out that
18:00 - 18:04

there is a result that characterizes necessary and sufficient
18:04 - 18:05

conditions for
18:05 - 18:08

when a function K that you might choose
18:08 - 18:10

is a valid kernel.
18:10 - 18:17

I should go ahead show part of that result now.
18:17 - 18:24

Suppose K is a valid kernel,
18:24 - 18:27

and when I say K is a kernel, what I mean is there does indeed exist some
18:27 - 18:32

function Phi for which this holds true.
18:32 - 18:35

Then
18:35 - 18:39

let any set of points
18:39 - 18:46

X1 up to XM be given. Let
18:48 - 18:53

me define a matrix K.
18:53 - 18:54
18:54 - 18:58

I apologize for overloading notation. K I'm going
18:58 - 19:00

to use to denote both the
19:00 - 19:03

kernel function, which is the function of X and Z
19:03 - 19:07

as well as a matrix. Unfortunately,
19:07 - 19:10

there aren't enough alphabets.
19:10 - 19:13

Well, that's not true. We need to find
19:13 - 19:15

the kernel matrix to
19:15 - 19:20

be an M-by-M matrix such that K subscript IJ is equal to the
19:20 - 19:24

kernel function
19:24 - 19:31

applied to two of my examples.
19:31 - 19:38

Then it
19:39 - 19:45

turns out that for any vector Z that's indimensional, I want
19:45 - 19:50

you to consider Z transpose
19:50 - 19:57
20:02 - 20:04

KZ.
20:04 - 20:06
20:06 - 20:08

By
20:08 - 20:13

definition of matrix multiplication, this is
20:13 - 20:15

that,
20:15 - 20:18

and so
20:18 - 20:23

KIJ is a kernel function between XI and XJ, so that must equal to this. I assume that K is
20:23 - 20:28

a valid kernel function,
20:28 - 20:31

and so there must exist such a
20:31 - 20:33

value
20:33 - 20:37

for Phi.
20:37 - 20:40

This is the inner product between two feature vectors, so let me just make that inner product
20:40 - 20:43

the explicit. I'm going
20:43 - 20:46

to sum over the elements of this vector,
20:46 - 20:50

and I'm going to use Phi XI subscript K just to denote the K
20:50 - 20:57

element of this vector.
20:58 - 21:03

Just
21:03 - 21:10

rearrange sums. You get
21:25 - 21:30

sum over K. This next set may look familiar to some of you,
21:30 - 21:37

which is just "
21:47 - 21:48

right? Therefore,
21:48 - 21:52

this is the sum of squares
21:52 - 21:57

and it must therefore be greater than or equal to zero.
21:57 - 22:00

Do you want to take a minute to look for all the steps and just make
22:00 - 22:07

sure you buy
22:09 - 22:11

them all? Oh, this is
22:11 - 22:15

the inner product between the vector of Phi of XI and Phi
22:15 - 22:17

of XJ,
22:17 - 22:19

so the
22:19 - 22:21

inner product between two vectors
22:21 - 22:28

is the sum over all the elements of the vectors of the corresponding element.
22:28 - 22:32

Student:[Inaudible]. Instructor (Andrew Ng):Oh, yes it is. This is just A
22:32 - 22:37

transpose B equals sum over K, AK BK, so that's
22:37 - 22:39

just this. This is the sum of K of the
22:39 - 22:44

K elements of this vector.
22:44 - 22:46

Take a look at this and make sure
22:46 - 22:50

it makes sense. Questions about
22:50 - 22:57

this? So just to
23:00 - 23:00

summarize,
23:00 - 23:03

what we showed was that
23:03 - 23:05

for any vector Z,
23:05 - 23:07

Z transpose KZ
23:07 - 23:10

is greater than or equal to zero,
23:10 - 23:14

and this is one of the standard definitions of a matrix,
23:14 - 23:18

the matrix K being posisemidefinite
23:18 - 23:24

when a matrix K is posisemidefinite, that is, K is equal to
23:24 - 23:28

zero. Just to summarize, what was shown is that if
23:28 - 23:29

K
23:29 - 23:31

is a
23:31 - 23:34

valid kernel " in other words, if K is a function
23:34 - 23:37

for which there exists
23:37 - 23:38

some Phi such that
23:38 - 23:40

K of XI
23:40 - 23:44

XJ is the inner product between Phi of XI and Phi of XJ.
23:44 - 23:48

So if K is a valid kernel, we showed, then, that the kernel matrix
23:48 - 23:51
23:51 - 23:55

must be posisemidefinite.
23:55 - 24:00
24:00 - 24:02

It turns out that the
24:02 - 24:05

converse [inaudible]
24:05 - 24:09

and so this gives you a test for
24:09 - 24:13

whether a function K is a valid kernel.
24:13 - 24:15

So this is
24:15 - 24:18

a theorem due to Mercer, and so kernels are also sometimes called Mercer
24:18 - 24:25

kernels. It means the same thing. It just means it's a valid kernel. Let
24:26 - 24:31

K of XZ be given.
24:31 - 24:34

Then K
24:34 - 24:37

is a valid
24:37 - 24:43

kernel "
24:43 - 24:45

in other
24:45 - 24:48

words, it's a Mercer kernel, i.e., there exists a Phi
24:48 - 24:50

such that
24:50 - 24:52

KXZ
24:52 - 24:54
24:54 - 24:59
24:59 - 25:01

equals Phi of X
25:01 - 25:05

transpose
25:05 - 25:11

Phi of Z " if and only if for any set of M examples,
25:11 - 25:15

and this really means for any set of M points. It's not necessarily a training set.
25:15 - 25:21

It's just any set of M points you may choose. It
25:21 - 25:28

holds true that the kernel
25:29 - 25:30

matrix,
25:30 - 25:34

capital K that I defined just now,
25:34 - 25:37

is
25:37 - 25:38
25:38 - 25:45

symmetric posisemidefinite.
25:49 - 25:54

And so I proved only one direction of this result. I proved that if it's a
25:54 - 25:55

valid kernel, then K
25:55 - 25:57

is symmetry
25:57 - 26:01

posisemidefinite, but the converse I didn't show. It turns out that this is
26:01 - 26:04

necessary and a sufficient condition. And
26:04 - 26:06

so this gives you a useful test for
26:06 - 26:09

whether any function that you might want to choose
26:09 - 26:16

is a
26:23 - 26:27

kernel. A concrete example of something that's clearly not a
26:27 - 26:28

valid kernel would
26:28 - 26:31

be
26:31 - 26:36

if you find an input X such that K of X, X " and this
26:36 - 26:39

is minus one, for example " then this is an example of something that's clearly
26:39 - 26:42

not a valid kernel,
26:42 - 26:48

because minus one cannot possibly be equal to Phi of X
26:48 - 26:49

transpose Phi of X,
26:49 - 26:53

and so this would be one of many examples of functions that
26:53 - 26:57

will fail to meet the conditions of this theorem,
26:57 - 27:04

because inner products of a vector itself are always greater than zero. So
27:16 - 27:23
27:30 - 27:34

just to tie this back explicitly to an
27:34 - 27:35

SVM,
27:35 - 27:37

let's say to use a support vector machine with a kernel,
27:37 - 27:41

what you do is you choose some function K of XZ,
27:41 - 27:46

and so you can choose " and it turns out that function I wrote down just now " this is,
27:46 - 27:49

indeed, a valid kernel. It
27:49 - 27:51

is called the Galcean kernel
27:51 - 27:52

because of the similarity to
27:52 - 27:54

Galceans.
27:54 - 27:59

So you choose some kernel function like this, or you may choose X
27:59 - 28:02

transpose Z plus
28:02 - 28:03

C
28:03 - 28:06

to the D vector. To apply a support vector machine kernel, you choose one of these
28:06 - 28:08

functions,
28:08 - 28:09

and the choice of this would depend on your problem.
28:09 - 28:14

It depends on what is a good measure of one or two examples
28:14 - 28:17

similar and one or two examples different for your problem.
28:17 - 28:22

Then you go back to our formulation of support vector machine,
28:22 - 28:25

and you have to use the dual formulation,
28:25 - 28:29

and you then replace everywhere you see
28:29 - 28:32

these
28:32 - 28:37

things, you
28:37 - 28:40

replace it with K of XI,
28:40 - 28:42

XJ.
28:42 - 28:45

And you then run exactly the same support vector machine algorithm, only everywhere
28:45 - 28:49

you see these inner products, you replace them with that,
28:49 - 28:52

and what you've just done is you've taken a support vector machine
28:52 - 28:57

and you've taken each of your feature vectors X
28:57 - 29:03

and you've replaced it with implicitly a very high dimensional feature vector.
29:03 - 29:06

It turns out that the Galcean kernel corresponds to a feature vector that's
29:06 - 29:09

infinite dimensional.
29:09 - 29:10
29:10 - 29:14

Nonetheless, you can run a support vector machine in a finite amount of time,
29:14 - 29:15

even though you're working with
29:15 - 29:19

infinite dimensional feature vectors,
29:19 - 29:21

because all you ever need to do is
29:21 - 29:23

compute these things, and you don't ever need
29:23 - 29:26

to represent these infinite dimensional feature vectors
29:26 - 29:28

explicitly.
29:28 - 29:30

Why
29:30 - 29:33

is this a good idea? It turns out "
29:33 - 29:38

I think I started off talking about support vector machines. I started
29:38 - 29:42

saying that we wanted to start to develop non-linear learning algorithms.
29:42 - 29:45

So here's one useful picture to keep in mind, which is that
29:45 - 29:48

let's say your original data "
29:48 - 29:52

I didn't mean to draw that slanted. Let's say you have one-dimensional input
29:52 - 29:57

data. You just have one feature X and R. What a kernel
29:57 - 30:01

does is the following. It takes your original input data and maps it to a
30:01 - 30:05

very high dimensional feature space. In the case of Galcean kernels, an infinite dimensional feature space " for pedagogical reasons, I'll draw
30:05 - 30:09

two
30:09 - 30:13

dimensions here.
30:13 - 30:20

So say [inaudible] very high dimensional feature space where "
30:23 - 30:29

like so. So it takes all your data in R1 and maps it to R infinity,
30:29 - 30:30

and then you run a
30:30 - 30:34

support vector machine in this infinite dimensional space and also exponentially
30:34 - 30:35

high dimensional space,
30:35 - 30:38

and you'll find the optimal margin classifier "
30:38 - 30:40

in other words, the
30:40 - 30:43

classifier that separates your data in this very high dimensional space
30:43 - 30:46

with the largest possible geometric margin.
30:46 - 30:50

In this example that you just drew anyway,
30:50 - 30:55

whereas your data was not linearly separable in your originally one dimensional space,
30:55 - 30:58

when you map it to this much higher dimensional space, it becomes linearly separable, so you
30:58 - 31:00

can use your
31:00 - 31:01

linear classifier
31:01 - 31:05

to [inaudible] which data is not really separable in your original space.
31:05 - 31:07

This is what support vector machines
31:07 - 31:10

output nonlinear decision boundaries
31:10 - 31:15

and in the entire process, all you ever need to do is solve complex optimization
31:15 - 31:19

problems.
31:19 - 31:20
31:20 - 31:25

Questions about any of this? Student:[Inaudible] sigmer? Instructor
31:25 - 31:30

(Andrew Ng):Yeah, so sigmer is " let's
31:30 - 31:32

see. Well, I was going to talk about [inaudible]
31:32 - 31:33

later. One way
31:33 - 31:36

to choose sigmer is
31:36 - 31:39

save aside a small amount of your data and
31:39 - 31:42

try different values of sigmer and train an SVM using,
31:42 - 31:45

say, two thirds of your data.
31:45 - 31:48

Try different values of sigmer, then see what works best on a separate
31:48 - 31:49

hold out
31:49 - 31:53

cross validation set " on a separate set that you're testing. Something about
31:53 - 31:56

learning algorithms we talked about
31:56 - 31:57

"
31:57 - 32:00

locally [inaudible] linear aggressions [inaudible] bandwidth parameter, so there are a
32:00 - 32:03

number of parameters to some of these algorithms that you
32:03 - 32:06

can choose IDs by saving aside some data to test on. I'll
32:06 - 32:10

talk more about model selection [inaudible] explicitly. Are there other questions? Student:So
32:10 - 32:13

how do you know that
32:13 - 32:14
32:14 - 32:17

moving it up to high dimensional space
32:17 - 32:20

is going to give you that kind of separation? Instructor (Andrew
32:20 - 32:21

Ng):Good question.
32:21 - 32:26

Usually, you don't know [inaudible]. Sometimes you can know, but
32:26 - 32:29

in most cases, you won't know [inaudible] actually going to
32:29 - 32:30

linearly
32:30 - 32:34

separable, so the next topic will be [inaudible], which is what
32:34 - 32:41

[inaudible] SVMs that work even though the data is not linearly separable. Student:If you tend
32:41 - 32:42

linearly
32:42 - 32:45

separated by mapping a higher dimension, couldn't you
32:45 - 32:47

also just use [inaudible]
32:47 - 32:49

higher
32:49 - 32:54

dimension? Instructor (Andrew Ng):So very right. This is a question about what to do if you can't
32:54 - 32:55

separate it in higher dimensional
32:55 - 32:59

space. Let me try to address that work with a discussion of [inaudible] soft margin SVMs.
32:59 - 33:06

Okay. Student:What
33:06 - 33:08

if
33:08 - 33:09

you run an SVM algorithm
33:09 - 33:12

that assumes the data are linearly
33:12 - 33:13

separable on data that
33:13 - 33:15

is not actually linearly separable? Instructor
33:15 - 33:19

(Andrew Ng):You guys are really giving me a hard time about whether the
33:19 - 33:20

data's linearly
33:20 - 33:24

separable. It turns out this algorithm won't work if the data is not linearly separable, but
33:24 - 33:29

I'll change that in a second and make it work.
33:29 - 33:31
33:31 - 33:35

If I move on to talk about that,
33:35 - 33:40

let me just say one final word about kernels,
33:40 - 33:46

which is that
33:46 - 33:49
33:49 - 33:53
33:53 - 33:55

I talked about kernels
33:55 - 34:01

in a context of support vector machines,
34:01 - 34:05

and the idea of kernels was what really made support vector machines a very
34:05 - 34:07

powerful learning algorithm,
34:07 - 34:10

and actually towards the end of today's lecture if I have time, I'll actually
34:10 - 34:14

give a couple more [inaudible] examples of how to choose kernels as well.
34:14 - 34:15
34:15 - 34:19

It turns out that the idea of kernels is actually more general than support vector
34:19 - 34:20

machines,
34:20 - 34:21

and in particular,
34:21 - 34:26

we took this SVM algorithm and we derived a dual, and that was what let us write
34:26 - 34:28

the entire algorithm
34:28 - 34:31

in terms of inner products of these.
34:31 - 34:35

It turns out that you can take many of the other algorithms that you've seen
34:35 - 34:37

in this class " in fact, it turns
34:37 - 34:40

out you can take most of the linear algorithms we talked about, such as linear
34:40 - 34:42

regression, logistic regression
34:42 - 34:44

[inaudible]
34:44 - 34:48

and it turns out you can take all of these algorithms and rewrite them entirely
34:48 - 34:51

in terms of these inner products.
34:51 - 34:55

So if you have any algorithm that you can rewrite in terms of inner products, then that
34:55 - 35:01

means you can replace it with K of
35:01 - 35:02

XI XJ
35:02 - 35:06

and that means that you can take any of theses algorithms and
35:06 - 35:11

implicitly map the features vectors of these very high dimensional feature spaces
35:11 - 35:13

and have the algorithm
35:13 - 35:14
35:14 - 35:17

still work. The idea of kernels is perhaps most widely used with support vector
35:17 - 35:20

machines, but it is actually more general than that,
35:20 - 35:23

and you can take many of the other algorithms that you've seen and many of the algorithms that
35:23 - 35:26

we'll see later this quarter as well
35:26 - 35:30

and write them in terms of inner products and thereby kernalize them and apply them to
35:30 - 35:33

infinite dimensional feature spaces.
35:33 - 35:36

You'll actually get to play with many of these ideas more in the next problem
35:36 - 35:41
35:41 - 35:43

set.
35:43 - 35:47

Let's talk about non-linear decision boundaries, and this is
35:47 - 35:50

the idea of
35:50 - 35:55

" it's called the L1 norm soft
35:55 - 35:59

margin SVM. Machine only people sometimes aren't great at coming up with good
35:59 - 36:00

names, but here's
36:00 - 36:02

the idea.
36:02 - 36:09

Let's say I have a data
36:13 - 36:15

set. This is a linearly separable data
36:15 - 36:19

set, but what I do if I have a couple of other examples there that makes the data
36:19 - 36:24

nonlinearly separable,
36:24 - 36:26

and in fact,
36:26 - 36:30

sometimes even if the data is linearly separable, maybe you might not want to.
36:30 - 36:33

So for example, this is a very nice data set. It looks like
36:33 - 36:37

there's a great decision boundary that separates the two [inaudible]. Well,
36:37 - 36:40

what if I had just one outlier
36:40 - 36:42

down here? I could
36:42 - 36:46

still linearly separate this data set
36:46 - 36:48

with something like that,
36:48 - 36:50

but I'm somehow letting one
36:50 - 36:53

slightly suspicious example
36:53 - 36:56

skew my entire decision boundary by a lot,
36:56 - 36:57

and so
36:57 - 37:01

what I'm going to talk about now is the L1 norm soft margin SVM, which is
37:01 - 37:05

a slightly modified formulation of the SVM optimization problem.
37:05 - 37:09

They will let us deal with both of these cases " one where one of the data's
37:09 - 37:11

just not linearly separable
37:11 - 37:14

and two, what if you have some examples that you'd rather
37:14 - 37:17

not get [inaudible] in a training set. Maybe
37:17 - 37:19

with an outlier here, maybe you actually prefer to
37:19 - 37:23

choose that original decision boundary and not try so hard to get
37:23 - 37:27

that training example. Here's the formulation.
37:27 - 37:34

Our
37:44 - 37:45
37:45 - 37:46

SVM
37:46 - 37:51

primal problem was to minimize one-half [inaudible] W
37:51 - 37:52

squared.
37:52 - 37:59
38:03 - 38:05

So this is our original problem, and I'm
38:05 - 38:10

going to modify this by adding the
38:10 - 38:17

following.
38:23 - 38:27

In other words, I'm gonna add these penalty terms, CIs,
38:27 - 38:32

and I'm going to demand that each of my training examples is separated with
38:32 - 38:36

functional margin greater than or equal to one minus CI,
38:36 - 38:40

and you remember
38:40 - 38:45
38:45 - 38:47

if this is
38:47 - 38:50

greater than zero " was it
38:50 - 38:53

two lectures ago that I said that if the
38:53 - 38:55

function margin is greater than zero,
38:55 - 39:01

that implies you classified it correctly.
39:01 - 39:05

If it's less than zero, then you misclassified it.
39:05 - 39:08

By setting some of the CIs to be larger than one, I can
39:08 - 39:11

actually have some examples with
39:11 - 39:13

functional margin negative,
39:13 - 39:16

and therefore I'm allowing my algorithm to misclassify some of the examples of the
39:16 - 39:19

training set.
39:19 - 39:23

However, I'll encourage the algorithm not to do that by adding to the
39:23 - 39:24

optimization objective,
39:24 - 39:30

this sort of penalty term that penalizes setting CIs to be large.
39:30 - 39:34

This is an optimization problem where the parameters are WB
39:34 - 39:37

and all of the CIs
39:37 - 39:43

and this is also a convex optimization problem. It
39:43 - 39:47

turns out that
39:47 - 39:51

similar to how we worked on the dual of the support vector machine,
39:51 - 39:54

we can also work out the dual for this optimization problem. I won't
39:54 - 39:55

actually do it,
39:55 - 39:59

but just to show you the steps, what you do is you construct [inaudible] Alpha R, and I'm going
39:59 - 40:02

to
40:02 - 40:03

use
40:03 - 40:08

Alpha and R to denote the [inaudible] multipliers no corresponding to
40:08 - 40:11

this set of constraints that we had previously
40:11 - 40:15

and this new set of constraints on the CI [inaudible] zero. This gives us a use of
40:15 - 40:16

the
40:16 - 40:23

[inaudible] multipliers. The [inaudible] will be
40:23 - 40:25

optimization objective
40:25 - 40:32

minus sum from
40:36 - 40:39

plus CI
40:39 - 40:44

minus "
40:44 - 40:47

and so
40:47 - 40:51

there's our [inaudible] optimization objective
40:51 - 40:55

minus or plus Alpha times each of these constraints,
40:55 - 40:57
40:57 - 41:04

which are
41:05 - 41:12
41:18 - 41:21

greater or equal to
41:21 - 41:28

zero. I won't redivide the entire dual again, but it's really the same,
41:28 - 41:34

and when you derive the dual of this optimization problem and when you simplify,
41:34 - 41:37

you find that you get the following.
41:37 - 41:41

You have to maximize
41:41 - 41:48

[inaudible], which is actually the same as before.
41:51 - 41:58
42:03 - 42:10
42:20 - 42:23

So it turns out when you derive the dual
42:23 - 42:25

and simply,
42:25 - 42:29

it turns out that the only way the dual changes compared to the previous one
42:29 - 42:32

is that rather than the constraint that the Alpha [inaudible] are greater than or equal to zero,
42:32 - 42:37

we now have a constraint that the Alphas are between zero and C.
42:37 - 42:40

This derivation isn't very hard, and you're encouraged to go home and try to do it yourself.
42:40 - 42:42

It's really essentially the same math,
42:42 - 42:44

and when you simply, it turns out
42:44 - 42:49

you can simply the R of the [inaudible] multiplier away
42:49 - 42:49

and you end up with
42:49 - 42:55

just these constraints of
42:55 - 42:57
42:57 - 43:02

the
43:02 - 43:05

Alphas.
43:05 - 43:06

Just as an aside,
43:06 - 43:09

I won't
43:09 - 43:12

derive these, either. It turns out that "
43:12 - 43:16

remember, I wrote down the [inaudible] conditions in
43:16 - 43:18

the last lecture.
43:18 - 43:22

The necessary conditions for something to be an optimal solution
43:22 - 43:24

to constrain optimization problems.
43:24 - 43:27

So if you used the [inaudible] conditions,
43:27 - 43:30

it turns out you can actually derive conversions conditions, so
43:30 - 43:33

we want to solve this optimization problem.
43:33 - 43:38

When do we know the Alphas have converged to the global optimum?
43:38 - 43:41

It turns out you can use the following.
43:41 - 43:48
43:48 - 43:55
44:05 - 44:12
44:14 - 44:17
44:17 - 44:21

I don't want to say a lot about these. It
44:21 - 44:24

turns out from the [inaudible] conditions you can derive
44:24 - 44:27

these as the conversion conditions for an algorithm that you might choose to
44:27 - 44:28

use
44:28 - 44:34

to try to solve the optimization problem in terms of the
44:34 - 44:40

Alphas. That's the L1 norm soft margin SVM, and this is the
44:40 - 44:44

change the algorithm that lets us handle non-linearly separable data sets as well as
44:44 - 44:47

single outliers that may still be linearly separable
44:47 - 44:54

but you may choose not to separate [inaudible].
44:56 - 44:58

Questions about this?
44:58 - 45:03

Raise
45:03 - 45:10
45:18 - 45:25
45:40 - 45:47
45:52 - 45:57

your hand if this stuff makes sense at all. Great.
45:57 - 46:00

So
46:00 - 46:03

the last thing I want to do is
46:03 - 46:08

talk about an algorithm for actually solving this optimization problem.
46:08 - 46:10

We wrote down
46:10 - 46:14

this dual optimization problem with convergence criteria,
46:14 - 46:19

so let's come up with an efficient algorithm to actually solve this optimization problem.
46:19 - 46:21
46:21 - 46:25

I want to do this partly to give me an excuse to talk about an algorithm
46:25 - 46:29

called coordinate assent, which is useful to
46:29 - 46:34

do. What I actually want to do is tell you about an algorithm called coordinate
46:34 - 46:37

assent, which is a useful algorithm to know about,
46:37 - 46:40

and it'll turn out that it won't apply in the simplest form
46:40 - 46:44

to this problem, but we'll then be able to modify it slightly and then it'll
46:44 - 46:46

give us a very efficient
46:46 - 46:49

algorithm for solving this [inaudible]
46:49 - 46:52

optimization problem. That was the other reason that I had to derive the dual, not only so that we could
46:52 - 46:53

use kernels
46:53 - 46:58

but also so that we can apply an algorithm like the
46:58 - 47:01

SMO
47:01 - 47:05

algorithm. First, let's talk about coordinate assent, which is another
47:05 - 47:12

[inaudible] optimization
47:18 - 47:21

algorithm. To describe coordinate
47:21 - 47:25

assent, I just want you to consider the problem of if we
47:25 - 47:30

want to maximize some function
47:30 - 47:34

W, which is a function of Alpha one through Alpha M with no constraints. So for
47:34 - 47:39
47:39 - 47:40
47:40 - 47:43

now, forget about the constraint that the Alpha [inaudible] must be between zero and C. Forget about the
47:43 - 47:50

constraint that some of YI Alpha I must be equal to zero.
47:50 - 47:53

Then this is the coordinate
47:53 - 47:54

assent
47:54 - 47:56

algorithm. It will
47:56 - 48:01

repeat until convergence
48:01 - 48:04
48:04 - 48:07

and will do for I equals one to M. The [inaudible] of coordinate assent essentially
48:07 - 48:10

holds all the parameters
48:10 - 48:12

except Alpha I fixed
48:12 - 48:16

and then it just maximizes this function with respect to just one of the parameters. Let me
48:16 - 48:18

write that
48:18 - 48:21

as Alpha I gets
48:21 - 48:24

updated as [inaudible]
48:24 - 48:29

over Alpha I hat of
48:29 - 48:30

W
48:30 - 48:32

Alpha one
48:32 - 48:39
48:42 - 48:49

Alpha I minus one. This is really the fancy way of saying hold everything
48:50 - 48:54

except Alpha I
48:54 - 48:58

fixed. Just optimize W by optimization objective with
48:58 - 49:00

respect to only
49:00 - 49:01

Alpha I.
49:01 - 49:02

This is just a fancy
49:02 - 49:04

way of writing
49:04 - 49:09

it. This is coordinate
49:09 - 49:11

assent.
49:11 - 49:14

One picture
49:14 - 49:18

that's kind of useful for coordinate assent is if you
49:18 - 49:25

imagine you're trying to optimize a quadratic function, it
49:25 - 49:27

really looks like
49:27 - 49:31

that. These are the contours of the quadratic function and the minimums
49:31 - 49:32

here.
49:32 - 49:37

This is what
49:37 - 49:40

coordinate assent would look like. These are my [inaudible] call this Alpha two and I'll call this Alpha one.
49:40 - 49:45

My Alpha one Alpha two axis, and so let's say I start down here. Then I'm going
49:45 - 49:49

to begin by minimizing this with respect to Alpha one. I
49:49 - 49:51

go there.
49:51 - 49:55

And then at my new point, I'll minimize with respect to Alpha two, and so I might go to someplace like
49:55 - 49:56

that.
49:56 - 49:58

Then, I'll minimize
49:58 - 50:02

with respect to Alpha one goes back to Alpha two
50:02 - 50:04

and so on.
50:04 - 50:08

You're always taking these axis-aligned steps
50:08 - 50:12

to get to the minimum.
50:12 - 50:16

It turns out that there's a modification to this. There are
50:16 - 50:18

variations of this as well. The way I describe
50:18 - 50:21

the algorithm, we're always doing this
50:21 - 50:24

in alternating order. We always optimize with respect to Alpha
50:24 - 50:29

one then Alpha two, then Alpha one, then Alpha two.
50:29 - 50:32

What I'm about to say applies only in higher dimensions, but it turns out if you have
50:32 - 50:35

a lot of parameters, Alpha one through Alpha M,
50:35 - 50:39

you may not choose to always visit them in a fixed order. You may choose
50:39 - 50:43

which Alphas update next depending on what you
50:43 - 50:46

think will allow you to make the most progress. If you have only two parameters, it makes
50:46 - 50:50

sense to alternate between them.
50:50 - 50:51

If
50:51 - 50:53

you have
50:53 - 50:57

higher dimensional parameters, sometimes you may choose to update them in a
50:57 - 51:01

different order if you think doing so would let you make faster progress towards the
51:01 - 51:02
51:02 - 51:05
51:05 - 51:09

maximum. It turns out that coordinate assent
51:09 - 51:11

compared to some of the algorithms we saw previously " compared to, say, Newton's
51:11 - 51:12
51:12 - 51:16

method, coordinate assent will usually take a lot more steps,
51:16 - 51:21

but the chief advantage of coordinate assent when it works well
51:21 - 51:23

is that sometimes
51:23 - 51:29

the optimization objective W
51:29 - 51:33

sometimes is very inexpensive to optimize W with respect to any one of your
51:33 - 51:34

parameters,
51:34 - 51:38

and so coordinate assent has to take many more iterations than, say, Newton's
51:38 - 51:39

method
51:39 - 51:41

in order to
51:41 - 51:45

converge. It turns out that there are many optimization problems for which it's
51:45 - 51:47

particularly easy
51:47 - 51:48

to fix
51:48 - 51:52

all but one of the parameters and optimize with respect to just that one parameter,
51:52 - 51:56

and if that's true, then the inner group of coordinate assent with optimizing with
51:56 - 51:58

respect to Alpha I can be done very quickly
51:58 - 52:01

and cause [inaudible].
52:01 - 52:02

It turns out that
52:02 - 52:07

this will be true when we modify this algorithm to solve the SVM
52:07 - 52:12

optimization problem.
52:12 - 52:19

Questions about this?
52:22 - 52:29
52:47 - 52:50

Okay. Let's go ahead and apply this
52:50 - 52:54

to our support vector machine dual optimization problem. It
52:54 - 52:57

turns out that
52:57 - 53:02

coordinate assent in its basic form does not work for the following reason. The
53:02 - 53:07

reason is we have constrains on the Alpha Is. Mainly, what we can
53:07 - 53:11

recall from what we worked out previously,
53:11 - 53:12

we have a constraint that the sum of [inaudible] Y
53:12 - 53:16

Alpha I must be equal to zero,
53:16 - 53:20

and so if you fix all the Alphas except for one,
53:20 - 53:24

then you can't change one Alpha without violating the constraint. If
53:24 - 53:25
53:25 - 53:26

I just try to change Alpha one,
53:26 - 53:27
53:27 - 53:31

Alpha one is actually exactly determined as a function of the other
53:31 - 53:33

Alphas because this was sum to zero.
53:33 - 53:36

The
53:36 - 53:39

SMO algorithm, by the way, is due to John Platt, a colleague at
53:39 - 53:43

Microsoft. The SMO
53:43 - 53:46

algorithm, therefore, instead of trying to change one Alpha at a time,
53:46 - 53:49

we will try to change
53:49 - 53:50

two
53:50 - 53:54

Alphas at a time.
53:54 - 53:55

This is
53:55 - 53:58

called the SMO algorithm, in a
53:58 - 54:01

sense the sequential minimal optimization
54:01 - 54:03

and the term minimal refers to the fact that
54:03 - 54:06

we're choosing the smallest number of Alpha Is to change at a time, which
54:06 - 54:11

in this case, we need to change at least two at a time.
54:11 - 54:17

So then go ahead and outline the algorithm.
54:17 - 54:20

We will
54:20 - 54:24

select
54:24 - 54:29

two Alphas to change, some Alpha I and Alpha J [inaudible] " it
54:29 - 54:34

just means a rule of thumb.
54:34 - 54:41

We'll hold all the Alpha Ks fixed
54:41 - 54:45

except
54:45 - 54:50

Alpha I and Alpha J
54:50 - 54:54

and optimize W [inaudible] Alpha
54:54 - 54:57

with respect
54:57 - 54:59

to Alpha I
54:59 - 55:06

and Alpha J subject to all the constraints.
55:14 - 55:18

It turns out the
55:18 - 55:20

key step
55:20 - 55:21

which I'm going to
55:21 - 55:22

work out
55:22 - 55:26

is this one, which
55:26 - 55:28

is how do you optimize W of Alpha
55:28 - 55:32

with respect to the two parameters that you just chose to update
55:32 - 55:35

and subject to the constraints? I'll do
55:35 - 55:37

that in a second.
55:37 - 55:39
55:39 - 55:41

You would keep running this algorithm
55:41 - 55:43

until you have satisfied
55:43 - 55:45

these
55:45 - 55:49

convergence criteria up to Epsilon.
55:49 - 55:54

What I want to do now is describe how to do this [inaudible] "
55:54 - 55:56

how to optimize W of Alpha
55:56 - 55:58

with respect to
55:58 - 56:00

Alpha I and Alpha J,
56:00 - 56:05

and it turns out that it's because you can do this extremely efficiently
56:05 - 56:08

that the SMO algorithm works well. The [inaudible] for the SMO
56:08 - 56:12

algorithm can be done extremely efficiently, so it may take a large number of
56:12 - 56:19

iterations to converge, but each iteration is very cheap. Let's talk about that.
56:33 - 56:40
56:42 - 56:46

So in order to derive that step where we update in respect to Alpha I and
56:46 - 56:47

Alpha J,
56:47 - 56:51

I'm actually going to use Alpha one and Alpha two as my example. I'm gonna
56:51 - 56:52

update
56:52 - 56:57

Alpha one and Alpha two. In general, this could be any Alpha I and Alpha J, but
56:57 - 57:00

just to make my notation on the board easier, I'm going to derive the
57:00 - 57:02

derivation for Alpha one and Alpha two
57:02 - 57:07

and the general [inaudible] completely analogous.
57:07 - 57:09
57:09 - 57:13

On every step of the algorithm with respect to constraint, that
57:13 - 57:17

sum over I Alpha I YI is equal to zero. This is one of the
57:17 - 57:19

constraints we had previously for
57:19 - 57:23

our dual optimization problem.
57:23 - 57:26

This means that Alpha one Y1 plus Alpha
57:26 - 57:28

two Y2 must be equal to this,
57:28 - 57:31

to
57:31 - 57:38
57:41 - 57:43

which I'm going to denote
57:43 - 57:46

by Zeta.
57:46 - 57:48
57:48 - 57:49

So
57:49 - 57:52

we also have the constraint
57:52 - 57:53

that the Alpha Is
57:53 - 57:57

must be between zero and C. We had two constraints on our dual. This was one of the
57:57 - 58:00

constraints. This was the other one.
58:00 - 58:07

In pictures,
58:08 - 58:15

the
58:29 - 58:33

constraint that the Alpha Is is between zero
58:33 - 58:35

and C, that is often called the Bosk constraint,
58:35 - 58:38

and so if I draw Alpha one
58:38 - 58:43

and Alpha two, then
58:43 - 58:50

I
58:50 - 58:53

have a constraint that
58:53 - 58:57

the values of Alpha one and Alpha two must lie within this box that ranges from zero
58:57 - 58:59

to C.
58:59 - 59:02

And so the picture of the algorithm is as follows.
59:02 - 59:04

I have some constraint
59:04 - 59:07

that Alpha one Y1
59:07 - 59:09

plus Alpha
59:09 - 59:10

two Y2
59:10 - 59:14

must equal to Zeta,
59:14 - 59:18
59:18 - 59:21
59:21 - 59:22
59:22 - 59:26

and so this implies that
59:26 - 59:32

Alpha one must be equal to Zeta minus Alpha two Y2
59:32 - 59:37

over Y1,
59:37 - 59:40

and so what I want to do is I want to optimize the objective with respect
59:40 - 59:43
59:43 - 59:45

to this.
59:45 - 59:49

What I can do is plug in my definition for Alpha one as a function of Alpha two
59:49 - 59:51

and so this becomes W of
59:51 - 59:55

Alpha one must be equal to Zeta minus Alpha two Y2
59:55 - 59:57

over
59:57 - 60:02

Y1, Alpha two, Alpha three
60:02 - 60:03

and so on,
60:03 - 60:07

and it turns out that because W is a quadratic function " if you look back to our
60:07 - 60:11

earlier definition of W, you find it's a quadratic function in all the Alphas "
60:11 - 60:13

it turns out that if you look at this expression for W
60:13 - 60:17

and if you view it as just a function of Alpha two,
60:17 - 60:22

you find that this is a one dimensional quadratic function of Alpha two
60:22 - 60:25

if you hold Alpha three, Alpha four and so on fixed,
60:25 - 60:26

and so
60:26 - 60:30

this can be simplified to some expression of the form A Alpha two squared
60:30 - 60:32

plus B Alpha two plus
60:32 - 60:33

C. This
60:33 - 60:38

is a standard quadratic function.
60:38 - 60:40

This is really easy to optimize. We know how to optimize
60:40 - 60:42

"
60:42 - 60:44

when did we learn this? This was high school
60:44 - 60:48

or undergrad or something. You know how to optimize quadratic functions like these.
60:48 - 60:50

You just do that and that gives you the
60:50 - 60:52

optimal value for Alpha two.
60:52 - 60:55

The last step with a
60:55 - 60:59

Bosk constraint like this " just in pictures,
60:59 - 61:02

you know your solution must lie on this line, and so there'll be some
61:02 - 61:05

sort of quadratic function
61:05 - 61:08

over this line, say,
61:08 - 61:10

and so if you minimize the quadratic function,
61:10 - 61:14

maybe you get a value that lies in the box, and if so, then you're done.
61:14 - 61:17

Or if your quadratic function looks like this,
61:17 - 61:21

maybe when you optimize your quadratic function, you may end up with a value outside, so you end up with
61:21 - 61:23

a solution like that. If
61:23 - 61:26

that happens, you clip your solution
61:26 - 61:30

just to map it back inside the box.
61:30 - 61:35

That'll give you the optimal solution of this quadratic optimization problem
61:35 - 61:36

subject to
61:36 - 61:37

your solution
61:37 - 61:40

satisfying this box constraint
61:40 - 61:43

and lying on this straight line " in other words, subject to
61:43 - 61:49

the solution lying on this line segment within the box.
61:49 - 61:53
61:53 - 61:58

Having solved the Alpha two this way, you can clip it if necessary
61:58 - 62:00

to get it back within the box constraint
62:00 - 62:01

and then
62:01 - 62:04

we have Alpha one as a function of Alpha two
62:04 - 62:06

and this allows you to
62:06 - 62:10

optimize W with respect to Alpha one and Alpha two quickly,
62:10 - 62:11

subject to all the constraints,
62:11 - 62:15

and the key step is really just sort of one dequadratic optimization, which
62:15 - 62:16

we do very quickly,
62:16 - 62:23

which is what makes the inner loop of the SMO algorithm very efficient. Student:You mentioned here that
62:25 - 62:32

we can
62:33 - 62:34

change
62:34 - 62:38

whatever, but the SMO
62:38 - 62:43

algorithm, we can
62:43 - 62:45

change two at a
62:45 - 62:46

time,
62:46 - 62:49

so how is that [inaudible] understand that. Instructor (Andrew Ng):Right. Let's say I want to change
62:49 - 62:52

" as I run optimization algorithm, I
62:52 - 62:55

have to respect the constraint that
62:55 - 62:56

sum over
62:56 - 62:58

I Alpha I YI
62:58 - 63:00

must be equal to zero,
63:00 - 63:07

so this is a linear constraint that I didn't have when I was talking about [inaudible] ascent.
63:07 - 63:09

Suppose I tried
63:09 - 63:13

to change just Alpha one.
63:13 - 63:15

Then I know that Alpha one
63:15 - 63:20

must be equal to the sum from I equals two to M
63:20 - 63:22

Alpha I
63:22 - 63:23

YI
63:23 - 63:26

divided by Y1, right,
63:26 - 63:27

and so Alpha
63:27 - 63:30

one can actually be written exactly as a function of Alpha two, Alpha three and so on
63:30 - 63:33

through Alpha M.
63:33 - 63:36

And so if I hold Alpha two, Alpha three, Alpha four
63:36 - 63:40

through Alpha M fixed, then I can't change Alpha one, because Alpha one is the final
63:40 - 63:42
63:42 - 63:46

[inaudible]. Whereas in contrast, if I choose to change Alpha one and Alpha two at the same
63:46 - 63:47

time,
63:47 - 63:51

then I still have a constraint and so I know Alpha one and Alpha two
63:51 - 63:54

must satisfy
63:54 - 63:55

that linear constraint
63:55 - 63:59

but at least this way I can change Alpha one if I also
63:59 - 64:05

change Alpha two accordingly to make sure this satisfies
64:05 - 64:09

the constraint.
64:09 - 64:14

Student:[Inaudible]. Instructor (Andrew Ng):So Zeta was defined [inaudible]. So on
64:14 - 64:16

each iteration, I have some
64:16 - 64:18

setting of the parameters,
64:18 - 64:21

Alpha one, Alpha two, Alpha three and so on,
64:21 - 64:25

and I want to change Alpha one and Alpha two, say.
64:25 - 64:27

So from the previous iteration,
64:27 - 64:28

let's
64:28 - 64:32

say I had not validated the constraint, so that holds true,
64:32 - 64:35

and so I'm just defining Zeta to be equal to this, because
64:35 - 64:38

Alpha one Y1 plus Alpha two Y2 must be equal to sum from
64:38 - 64:42

I equals [inaudible] to M of that,
64:42 - 64:44

and so I'm just defining this
64:44 - 64:51

to be Zeta.
65:00 - 65:01

Student:[Inaudible]. Instructor (Andrew
65:01 - 65:07

Ng):On every iteration, you change maybe a different pair of Alphas to update. The
65:07 - 65:11

way you do this is something I don't want to talk about. I'll say a couple more words about
65:11 - 65:12

that, but
65:12 - 65:14

the basic outline of the algorithm is
65:14 - 65:18

on every iteration of the algorithm, you're going to select some Alpha I and
65:18 - 65:19

Alpha J to update
65:19 - 65:23

like on this board. So that's an Alpha I and an Alpha J to update
65:23 - 65:25

via some [inaudible]
65:25 - 65:28

and then you use the procedure I just described to
65:28 - 65:31

actually update
65:31 - 65:34

Alpha I and Alpha J. What I actually just talked about
65:34 - 65:35

was the procedure
65:35 - 65:40

to optimize W with respect to Alpha I and Alpha J. I didn't actually talk about the
65:40 - 65:47

[inaudible] for choosing Alpha
65:49 - 65:52

I and Alpha J. Student:What is the function MW? Instructor (Andrew Ng):MW is way up
65:52 - 65:54

there. I'll just write it again. W of
65:54 - 66:00

Alpha is that function we had previously. W of Alpha was the sum over I
66:00 - 66:03

" this is about solving the "
66:03 - 66:10

it was that thing.
66:10 - 66:13

All of this is about solving the optimization problem for the SVM, so this
66:13 - 66:19

is the objective function we had, so that's W of Alpha. Student:[Inaudible]? Exchanging one of the Alphas " optimizing
66:19 - 66:26

that one, you
66:26 - 66:27

can
66:27 - 66:29

make the
66:29 - 66:34

other one that you have to change work,
66:34 - 66:36

right? Instructor (Andrew Ng):What
66:36 - 66:40

do
66:40 - 66:44

you mean works? Student:It will get farther from its optimal. Instructor (Andrew Ng):Let me translate it differently.
66:44 - 66:47

What we're trying to do is we're trying to optimize the objective function W
66:47 - 66:48

of Alpha,
66:48 - 66:52

so the metric of progress that we care about is whether W of Alpha is getting
66:52 - 66:55

better on every iteration,
66:55 - 66:59

and so what is true for coordinate assent and for SMO is on every iteration;
66:59 - 67:03

W of Alpha can only increase. It may stay the same
67:03 - 67:05

or it may increase. It can't get worse.
67:05 - 67:09

It's true that eventually, the Alphas will converge at some value. It's true that
67:09 - 67:13

in intervening iterations, one of the Alphas may move further
67:13 - 67:17

away and then closer and further and closer to its final value,
67:17 - 67:20

but what we really care about is that W of Alpha is getting better
67:20 - 67:27

every time, which is true.
67:28 - 67:31

Just a couple more words on
67:31 - 67:34

SMO before I wrap up on this.
67:34 - 67:38

One is that John Platt's original
67:38 - 67:40

algorithm talked about a [inaudible]
67:40 - 67:44

for choosing which values or pairs, Alpha I and Alpha J, to update
67:44 - 67:47

next
67:47 - 67:50

is one of those things that's not conceptually complicated but it's very
67:50 - 67:51

complicated
67:51 - 67:53

to explain in words. I won't talk about that here.
67:53 - 67:56

If you want to learn about it,
67:56 - 67:59

go ahead and look up John Platt's paper on the
67:59 - 68:00
68:00 - 68:06

SMO algorithm. The [inaudible] is pretty easy to read,
68:06 - 68:11

and later on, we'll also posting a handout on the
68:11 - 68:13

course homepage
68:13 - 68:14

with
68:14 - 68:18

some of a simplified version of this [inaudible] that you can use in
68:18 - 68:20

problems.
68:20 - 68:24

You can see some of the process readings in more details.
68:24 - 68:29

One other thing that I didn't talk about was how to update the parameter B. So
68:29 - 68:31

this is solving all your Alphas.
68:31 - 68:34

This is also the Alpha that allows us to get W.
68:34 - 68:38

The other thing I didn't talk about was how to compute the parameter B, and it turns out that again is actually not
68:38 - 68:40

very difficult.
68:40 - 68:43

I'll let you read about that yourself with the
68:43 - 68:45

notes that we'll post
68:45 - 68:52

along with the next
68:52 - 68:56

problems. To wrap up today's lecture, what I want to do is just tell
68:56 - 69:01

you briefly about a couple of examples of applications
69:01 - 69:03
69:03 - 69:04
69:04 - 69:07
69:07 - 69:14

of SVMs.
69:26 - 69:30

Let's consider the problem of Handler's Integer Recognition.
69:30 - 69:32

In Handler's Integer
69:32 - 69:33

Recognition,
69:33 - 69:37

you're given a pixel array with
69:37 - 69:41

a scanned image of,
69:41 - 69:44

say, a zip code somewhere in Britain. This is an array of pixels,
69:44 - 69:48

and some of these pixels will be on
69:48 - 69:49

and other pixels
69:49 - 69:53

will be off. This combination of pixels being on maybe represents the
69:53 - 69:56

character one.
69:56 - 69:58

The question is given
69:58 - 70:00

an input feature vector like this,
70:00 - 70:03

if you
70:03 - 70:07

have, say, ten pixels by ten pixels, then you have
70:07 - 70:14

a hundred dimensional feature vector,
70:14 - 70:17

then [inaudible]. If you have ten pixels by ten pixels, you have
70:17 - 70:18

100
70:18 - 70:22

features, and maybe these are binary features of XB01 or maybe the Xs are
70:22 - 70:24

gray scale values
70:24 - 70:29

corresponding to how dark each of these pixels was.
70:29 - 70:32

[Inaudible].
70:32 - 70:36

Turns out for many years, there was a neuronetwork that was a champion
70:36 - 70:40

algorithm for Handler's Integer Recognition.
70:40 - 70:44

And it turns out that you can apply an SVM with the
70:44 - 70:47

following
70:47 - 70:53

kernel. It turns out either the polynomial kernel or
70:53 - 70:58

the Galcean
70:58 - 71:03
71:03 - 71:05

kernel works fine for this problem,
71:05 - 71:10

and just by writing down this kernel and throwing an SVM at
71:10 - 71:14

it, an SVM gave performance comparable to the very best neuronetworks.
71:14 - 71:15
71:15 - 71:20
71:20 - 71:22

This is surprising because support vector machine
71:22 - 71:26

doesn't take into account any knowledge about the pixels,
71:26 - 71:27

and in particular,
71:27 - 71:30

it doesn't know that this pixel is
71:30 - 71:32

next to that pixel
71:32 - 71:37

because it's just representing the pixel intensity value as a vector.
71:37 - 71:40

And so this means the performance of SVM would be the same even if you were to
71:40 - 71:43

shuffle all the pixels around.
71:43 - 71:46

[Inaudible]
71:46 - 71:52

let's say comparable to the very best neuronetworks, which had been
71:52 - 71:59

under very careful development for many years. I want
71:59 - 72:00

to
72:00 - 72:04

tell you about one other cool example, which is SVMs are
72:04 - 72:09

also used also to classify other fairly esoteric objects. So for
72:09 - 72:09

example,
72:09 - 72:13

let's say you want to classify
72:13 - 72:17

protein sequences into different classes of
72:17 - 72:21

proteins. Every time I do this, I suspect that biologists in the room cringe, so I apologize
72:21 - 72:25

for that. There are
72:25 - 72:29

20 amino acids, and proteins in our bodies are made up by
72:29 - 72:31

sequences of amino acids.
72:31 - 72:35

Even though there are 20 amino acids and 26 alphabets, I'm going to
72:35 - 72:37

denote amino acids by the alphabet
72:37 - 72:41

A through Z with apologizes to the biologists. Here's
72:41 - 72:45

an amino acid sequence
72:45 - 72:52
72:54 - 73:01
73:03 - 73:05
73:05 - 73:12

represented by a series of alphabets.
73:16 - 73:18
73:18 - 73:19

So
73:19 - 73:21

suppose I want to assign
73:21 - 73:25

this protein into a few classes depending on what
73:25 - 73:28

type of protein it is. The
73:28 - 73:30

question is how
73:30 - 73:33

do I construct my feature
73:33 - 73:37

vector? This is challenging for many reasons, one of which is that protein
73:37 - 73:40

sequences can be of different lengths. There are some very long protein
73:40 - 73:42

sequences and some very short ones,
73:42 - 73:45

and so you can't have a feature saying what is
73:45 - 73:49

the amino acid in the 100th position, because maybe there is no 100th
73:49 - 73:53

position in this protein sequence. Some are very long. Some are very short.
73:53 - 73:55

Here's my feature representation,
73:55 - 73:57

which is
73:57 - 74:00

I'm going to write down
74:00 - 74:05

all possible combinations of four alphabets. I'm going to write down AAAA, AAAB, AAAC
74:05 - 74:07

down
74:07 - 74:09

to
74:09 - 74:11

AAAZ and then
74:11 - 74:14

AABA
74:14 - 74:20

and so on. You get the idea. Write down
74:20 - 74:24

all
74:24 - 74:27

possible combinations of
74:27 - 74:28

four alphabets
74:28 - 74:31

and my feature representation will be
74:31 - 74:34

I'm going to scan through this sequence of amino acids
74:34 - 74:39

and count how often each of these subsequences occur. So for example,
74:39 - 74:40

BAJT occurs twice and so I'll put
74:40 - 74:43

a two there, and
74:43 - 74:45

none of these sequences occur, so I'll put
74:45 - 74:46

a zero there. I guess
74:46 - 74:50

I have a one here
74:50 - 74:54

and a zero there. This very long vector
74:54 - 74:57

will be my feature representation for protein.
74:57 - 75:00

This representation applies no matter how long my protein sequence is.
75:00 - 75:03

How
75:03 - 75:05

large is this? Well, it turns out
75:05 - 75:08

this is going to be in R20
75:08 - 75:12

to the four,
75:12 - 75:16

and so you have a 160,000
75:16 - 75:18

dimensional feature vector,
75:18 - 75:21

which is reasonably large, even by modern computer standards.
75:21 - 75:24

Clearly, we don't want to explicitly represent
75:24 - 75:26

these high
75:26 - 75:30

dimensional feature vectors. Imagine you have 1,000 examples and you store this as double
75:30 - 75:34

[inaudible]. Even on modern day computers, this is big.
75:34 - 75:38

It turns out that there's an
75:38 - 75:42

efficient dynamic programming algorithm that can efficiently
75:42 - 75:45

compute inner products between these feature vectors, and so we can apply this
75:45 - 75:48

feature representation, even though it's a ridiculously high
75:48 - 75:50

feature vector
75:50 - 75:52

to classify protein sequences.
75:52 - 75:57

I won't talk about the [inaudible] algorithm. If any of you have seen
75:57 - 76:01

the [inaudible] algorithm for finding subsequences, it's kind of
76:01 - 76:02

reminiscent of that. You
76:02 - 76:05

can look those up if you're interested.
76:05 - 76:08

This is just another example of a cool kernel, and
76:08 - 76:13

more generally, if you're faced with some new machine-learning problem, sometimes you can
76:13 - 76:16

choose a standard kernel like a Galcean kernel,
76:16 - 76:20

and sometimes there are research papers written on how to
76:20 - 76:21

come up with a new kernel
76:21 - 76:28

for a new problem.
76:28 - 76:30

Two last sentences I want
76:30 - 76:34

to say. Where are we now? That wraps up SVMs, which many
76:34 - 76:38

people would consider one of the most effective off the shelf learning algorithms,
76:38 - 76:42

and so as of today, you've actually seen a lot of learning algorithms. I want to
76:42 - 76:46

close this class by saying congrats. You're now well qualified to actually go and apply learning algorithms
76:46 - 76:48

to a lot of problems.
76:48 - 76:53

We're still in week four of the quarter, so there's more to come. In particular, what I
76:53 - 76:57

want to do next is talk about how to really understand the learning algorithms and
76:57 - 76:59

when they work well and when they work poorly
76:59 - 77:02

and to take the tools which you now have and really talk about how you can use
77:02 - 77:04

them really well. We'll start to do that in the next lecture. Thanks.

Title:: Lecture 8 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng continues his lecture about support vector machines, including soft margin optimization and kernels.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:17:19

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 8 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)