Lecture 20 | Machine Learning (Stanford)

Edit subtitles

0:11 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:22

Development.
0:24 - 0:28

So welcome to the last lecture of this course.
0:28 - 0:33

What I want to do today is tell you about one final class of reinforcement
0:33 - 0:35

learning algorithms. I
0:35 - 0:39

just want to say a little bit about POMDPs, the partially observable MDPs, and then
0:39 - 0:44

the main technical topic for today will be policy search algorithms.
0:44 - 0:49

I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,
0:49 - 0:55

and then we'll wrap up the class. So if you recall from the last lecture,
0:55 - 1:00

I actually started to talk about one specific example of a POMDP,
1:00 - 1:07

which was this sort of linear dynamical
1:07 - 1:11

system. This is sort of LQR, linear quadratic revelation problem,
1:11 - 1:24

but I changed it and said what if we only have observations
1:24 - 1:28

YT. And what if we couldn't observe the state of the system directly,
1:28 - 1:37

but had to choose an action only based on some noisy observations that maybe some function of the state?
1:37 - 1:43

So our strategy last time was that we said that in the fully observable case,
1:43 - 1:55

we could choose actions - AT equals two, that matrix LT times ST. So LT was this
1:55 - 1:59

matrix of parameters that [inaudible] describe the dynamic programming
1:59 - 2:02

algorithm for finite horizon MDPs in the LQR problem.
2:02 - 2:11

And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.
2:11 - 2:14

And then I said in the partially observable case,
2:14 - 2:20

we would compute these estimates.
2:20 - 2:27

I wrote them as S of T given T,
2:30 - 2:37

which were our best estimate for what the state is given all the observations. And in particular,
2:37 - 2:41

I'm gonna talk about a Kalman filter
2:41 - 2:49

which we worked out that our posterior distribution of what the state is
2:49 - 2:53

given all the observations up to a certain time
2:53 - 2:59

that was this. So this is from last time. So that
2:59 - 3:03

given the observations Y one through YT, our posterior distribution
3:03 - 3:06

of the current state ST was Gaussian
3:06 - 3:10

would mean ST given T sigma T given T.
3:10 - 3:14

So I said we use a Kalman filter to compute this thing, this ST given T,
3:14 - 3:20

which is going to be our best guess for what the state is currently.
3:20 - 3:31

And then we choose actions using
3:31 - 3:34

our estimate for what the state is, rather than using the true state because we
3:34 - 3:38

don't know the true state anymore in this POMDP.
3:38 - 3:40

So
3:40 - 3:47

it turns out that this specific strategy actually allows you to choose optimal actions,
3:47 - 3:52

allows you to choose actions as well as you possibly can given that this is a
3:52 - 3:55

POMDP, and given there are these noisy observations.
3:55 - 3:59

It turns out that in general finding optimal policies with POMDPs -
3:59 - 4:06

finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.
4:06 - 4:16

Just to be concrete about the formalism of the POMDP - I should just write it here -
4:16 - 4:32

a POMDP formally is a tuple like that
4:32 - 4:43

where the changes are the set Y is the set of possible observations,
4:45 - 4:54

and this O subscript S are the observation distributions.
4:58 - 5:11

And so at each step, we observe
5:18 - 5:20

- at each step in the POMDP, if we're
5:20 - 5:24

in some state ST, we observe some observation YT
5:24 - 5:28

drawn from the observation distribution O subscript ST, that there's an
5:28 - 5:30

index by what the current state is.
5:30 - 5:31

And
5:31 - 5:36

it turns out that computing the optimal policy in a POMDP is an NPhard problem.
5:36 - 5:41

For the specific case of linear dynamical systems with the Kalman filter
5:41 - 5:46

model, we have this strategy of computing the optimal policy assuming full observability
5:46 - 5:49

and then estimating the states from the observations,
5:49 - 5:51

and then plugging the two together.
5:51 - 5:56

That turns out to be optimal essentially for only that special case of a POMDP.
5:56 - 5:58

In the more general case,
5:58 - 6:03

that strategy of designing a controller assuming full observability
6:03 - 6:06

and then just estimating the state and plugging the two together,
6:06 - 6:08

for general POMDPs
6:08 - 6:15

that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.
6:15 - 6:21

Solving these problems in general, NP-hard.
6:21 - 6:22

So what I want to do today
6:22 - 6:26

is actually
6:26 - 6:29

talk about a different class of reinforcement learning algorithms. These are called
6:29 - 6:38

policy search algorithms. In particular, policy search algorithms
6:38 - 6:44

can be applied equally well to MDPs, to fully observed Markov decision processes,
6:44 - 6:46

or to these POMDPs,
6:46 - 6:50

or to these partially observable MPDs.
6:50 - 6:58

What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.
6:58 - 7:02

And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In
7:02 - 7:05

the latter case, when
7:05 - 7:10

you apply a policy search algorithm to a POMDP, it's
7:10 - 7:15

going to be hard to guarantee that you get the globally optimal policy because
7:15 - 7:21

solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be
7:21 - 7:23

I think one of the most
7:23 - 7:31

effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.
7:31 - 7:34

So
7:34 - 7:37

here's what we're going to do.
7:37 - 7:41

In policy search,
7:41 - 7:48

we're going to define of some set
7:48 - 7:55

which I denote capital pi of policies,
7:55 - 8:09

and our strategy is to search for a good policy lower pi into set capital pi.
8:11 - 8:15

Just by analogy, I want to say
8:15 - 8:19

- in the same way, back when we were talking about supervised learning,
8:19 - 8:24

the way we defined the set capital pi of policies in the search for policy in this
8:24 - 8:31

set capital pi is analogous to supervised learning
8:31 - 8:46

where we defined a set script H of hypotheses and search -
8:50 - 8:58

and would search for a good hypothesis in this policy script H.
8:58 - 9:03

Policy search is sometimes also called direct policy search.
9:03 - 9:07

To contrast this with the source of algorithms we've been talking about so
9:07 - 9:11

far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to
9:11 - 9:14

find the optimal value function.
9:14 - 9:19

And then we'd use V star - we'd use the optimal value function to then try to compute or
9:19 - 9:20

try to approximate pi star.
9:20 - 9:24

So all the approaches we talked about previously are
9:24 - 9:30

strategy for finding a good policy. Once we compute the value function, then we go from that to policy.
9:30 - 9:35

In contrast, in policy search algorithms and something that's called direct policy search algorithms,
9:35 - 9:40

the idea is that we're going to quote "directly" try to approximate a good policy
9:40 - 9:48

without going through the intermediate stage of trying to find the value function.
9:48 - 9:50

Let's see.
9:50 - 9:51

And also
9:51 - 9:58

as I develop policy search - just one step that's sometimes slightly confusing.
9:58 - 10:03

Making an analogy to supervised learning again, when we talked about logistic regression,
10:03 - 10:09

I said we have input features X and some labels Y, and I sort of said
10:09 - 10:13

let's approximate Y using the logistic function of the inputs X. And
10:13 - 10:18

at least initially, the logistic function was sort of pulled out of the air.
10:18 - 10:23

In the same way, as I define policy search algorithms, there'll sort of be a step where I say,
10:23 - 10:29

"Well, let's try to compute the actions. Let's try to approximate what a good action is
10:29 - 10:33

using a logistic function of the state." So again, I'll sort of pull a
10:33 - 10:37

function out of the air. I'll say, "Let's just choose a function, and
10:37 - 10:41

that'll be our choice of the policy cost," and I'll say,
10:41 - 10:45

"Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate
10:45 - 10:47

what is a good function -
10:47 - 10:48

excuse me,
10:48 - 10:53

we'll approximate what is a good action using a logistic function of the state."
10:53 - 10:54

So there's that sort of -
10:54 - 10:59

the function of the choice of policy cost that's again a little bit arbitrary,
10:59 - 11:06

but it's arbitrary as it was when we were talking about supervised learning.
11:06 - 11:17

So to develop our first policy search algorithm, I'm actually gonna need the new definition.
11:56 - 12:03

So
12:07 - 12:13

our first policy search algorithm, we'll actually need to work with stochastic policies.
12:13 - 12:18

What I mean by stochastic policy is there's going to be a function that maps from
12:18 - 12:23

the space of states across actions. They're real numbers
12:23 - 12:25

where pi of S comma A will be
12:25 - 12:31

interpreted as the probability of taking this action A in sum state S.
12:31 - 12:46

And so we have to add sum over A - In other words, for every state
12:46 - 12:54

a stochastic policy specifies a probability distribution over the actions.
12:55 - 12:59

So concretely,
12:59 - 13:07

suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.
13:07 - 13:12

What that means is that - in this example let's say I have three actions.
13:12 - 13:16

What that means is that suppose I'm in some state S.
13:16 - 13:25

I would then compute pi of S comma A1, pi of S comma A2,
13:25 - 13:29

pi of S comma A3, if I have a three action MDP.
13:29 - 13:34

These will be three numbers that sum up to one,
13:34 - 13:37

and then my chance of taking action A1 will be
13:37 - 13:41

equal to this. My chance of taking action A2 will be equal to pi of S comma A2.
13:41 - 13:44

My chance of taking action A3 will be equal
13:44 - 13:50

to this number. So that's what it means to execute a stochastic policy.
13:50 - 13:51

So
13:51 - 13:54

as a concrete example, just let me make this
14:03 - 14:07

let me just go ahead and give one specific example of what a stochastic
14:07 - 14:10

policy may look like.
14:10 - 14:14

For this example, I'm gonna use the inverted pendulum
14:14 - 14:20

as my motivating example. It's that problem of balancing a pole. We have an
14:20 - 14:21

inverted
14:21 - 14:26

pendulum that swings freely, and you want to move the cart left and
14:26 - 14:29

right to keep the pole vertical.
14:29 - 14:35

Let's say my actions -
14:35 - 14:44

for today's example, I'm gonna use that angle to denote the angle of the pole
14:44 - 14:50

phi. I have two actions where A1 is to accelerate left
14:50 - 15:03

and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.
15:03 - 15:05

So
15:05 - 15:06

let's see.
15:06 - 15:10

Choose a reward function that penalizes the pole falling over whatever.
15:10 - 15:14

And now let's come up with a stochastic policy for this problem.
15:15 - 15:18

To come up with a class of stochastic policies
15:18 - 15:20
15:20 - 15:24

really means coming up with some class of functions to approximate what action you
15:24 - 15:27

want to take as a function of the state.
15:27 - 15:30

So here's my somewhat arbitrary choice. I'm gonna say that
15:30 - 15:35

the probability of action A1, so pi of S comma A1,
15:35 - 15:40

I'm gonna write as - okay?
15:40 - 15:46

And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that
15:46 - 15:50

my policy is parameterized by a set of parameters theta,
15:50 - 15:53

and for any given set of parameters theta,
15:53 - 15:56

that gives me a stochastic policy.
15:56 - 15:59

And if I'm executing that policy with parameters theta, that means
15:59 - 16:06

that the chance of my choosing to a set of [inaudible] is given by this number.
16:18 - 16:26

Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just
16:26 - 16:29

[inaudible], this means that when I'm in sum state S,
16:29 - 16:31

I'm going to compute this number,
16:31 - 16:34

compute one over one plus E to the minus state of transpose S.
16:34 - 16:40

And then with this probability, I will execute the accelerate right action,
16:40 - 16:46

and with one minus this probability, I'll execute the accelerate left action.
16:46 - 16:51

And again, just to give you a sense of why this might be a reasonable thing to do,
16:52 - 16:59

let's say my state vector is - this
17:02 - 17:05

is [inaudible] state, and I added an extra one as an interceptor, just to give my
17:05 - 17:09

logistic function an extra feature.
17:09 - 17:20

If I choose my parameters and my policy to be say this,
17:20 - 17:23

then that means that at any state,
17:23 - 17:34

the probability of my taking action A1 - the probability of my taking the accelerate
17:34 - 17:41

right action is this one over one plus E to the minus state of transpose S, which taking the inner product
17:41 - 17:48

of theta and S, this just gives you phi,
17:48 - 17:51

equals one over one plus E to the minus phi.
17:51 - 17:55

And so if I choose my parameters theta as follows,
17:55 - 18:01

what that means is that
18:01 - 18:08

just depending on the angle phi of my inverted pendulum,
18:09 - 18:12

the chance of my accelerating to the
18:12 - 18:15

right is just this function of
18:15 - 18:18

the angle of my inverted pendulum. And so this means for example
18:18 - 18:22

that if my inverted pendulum is leaning far over to the right,
18:22 - 18:26

then I'm very likely to accelerate to the right to try to catch it. I
18:26 - 18:30

hope the physics of this inverted pendulum thing make
18:30 - 18:33

sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And
18:33 - 18:37

conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try
18:37 - 18:39

to catch it.
18:39 - 18:44

So this is one example for one specific choice of parameters theta.
18:44 - 18:49

Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if
18:49 - 18:53

the cart is further to the right, you want it to be less likely to accelerate to the
18:53 - 18:56

right, and you can capture that by changing
18:56 - 18:58

one of these coefficients to take into account the
18:58 - 19:00

actual position of the cart. And then depending on
19:00 - 19:03

the velocity of the cart and the angle of velocity,
19:03 - 19:08

you might want to change theta to take into account these other effects as well. Maybe
19:08 - 19:13

if the pole's leaning far to the right, but is actually on its way to swinging back, it's
19:13 - 19:16

specified to the angle of velocity, then you might be
19:16 - 19:19

less worried about having to accelerate hard to the right. And so
19:19 - 19:23

these are the sorts of behavior you can get by varying the parameters theta.
19:24 - 19:29

And so our goal is to tune the parameters theta - our goal in policy search
19:29 - 19:37

is to tune the parameters theta so that when we execute the policy pi subscript theta,
19:37 - 19:40

the pole stays up as long as possible.
19:40 - 19:48

In other words, our goal is to maximize as a function of theta -
19:48 - 20:05

our goal is to maximize the expected value of the payoff
20:05 - 20:10

for when we execute the policy pi theta. We
20:10 - 20:11

want to choose parameters theta
20:11 - 20:21

to maximize that. Are there questions about the problem set up,
20:21 - 20:26

and policy search and policy classes or anything? Yeah. In a case where we have
20:26 - 20:33

more than two actions, would we use a different theta for each of the distributions, or still have
20:33 - 20:36

the same parameters? Oh, yeah.
20:36 - 20:39

Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but
20:39 - 20:44

you have say a fixed number of discrete actions,
20:44 - 20:48

I would sometimes use like a softmax parameterization.
20:48 - 20:54

Similar to softmax regression that we saw earlier in the class,
20:54 - 20:56

you may
20:56 - 21:03

say
21:06 - 21:07

that - [inaudible] out of space. You may have a set of parameters theta 1
21:07 - 21:09

through theta D if
21:09 - 21:25

you have D actions and - pi equals E to the theta I transpose S over -
21:25 - 21:29

so that would be an example of a softmax parameterization for multiple actions.
21:29 - 21:33

It turns out that if you have continuous actions, you can actually make this be a
21:33 - 21:34

density
21:34 - 21:36

over the actions A
21:36 - 21:38

and parameterized by other things as well.
21:38 - 21:43

But the choice of policy class is somewhat up to you, in the same way that
21:44 - 21:48

the choice of whether we chose to use a linear function
21:48 - 21:49

or linear function with
21:49 - 21:56

quadratic features or whatever in supervised learning that was sort of up to us. Anything
22:02 - 22:09

else? Yeah. [Inaudible] stochastic?
22:12 - 22:16

Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition
22:16 - 22:18

probabilities, is it possible to
22:18 - 22:23

use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.
22:23 - 22:24

The answer is yes,
22:24 - 22:26

but
22:26 - 22:29

for the purposes of what I want to show later, that won't be useful.
22:29 - 22:32

But formally, it is possible.
22:32 - 22:35

If you already have a fixed - if you have a
22:35 - 22:41

fixed policy, then you'd be able to do that. Anything else?
22:41 - 22:44

Yeah. No, I guess even a [inaudible] class of policy can do that,
22:44 - 22:50

but for the derivation later, I actually need to keep it separate.
22:50 - 22:55

Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand
22:55 - 23:02

if this makes sense?
23:02 - 23:05

Okay. Thanks. So let's talk about
23:05 - 23:09

an algorithm. What I'm gonna talk about - the first algorithm I'm going to present
23:10 - 23:14

is sometimes called the reinforce algorithm. What I'm going to present it turns out
23:14 - 23:19

isn't exactly the reinforce algorithm as it was originally presented by the
23:19 - 23:31

author Ron Williams, but it sort of captures its essence. Here's the idea.
23:31 - 23:37

In the sequel - in what I'm about to do, I'm going to assume that S0 is
23:37 - 23:40

some fixed initial state.
23:42 - 23:45

Or
23:45 - 23:49

it turns out if S0 is drawn from some fixed initial state
23:49 - 23:52

distribution then everything else [inaudible], but let's just say S0 is some
23:52 - 23:54

fixed initial state.
23:54 - 24:05

So my goal is to maximize this expected sum
24:06 - 24:10

[inaudible]. Given the policy and whatever else, drop that.
24:10 - 24:15

So the random variables in this expectation is a sequence of states and actions:
24:15 - 24:20

S0, A0, S1, A1, and so on, up to ST, AT are the random variables.
24:20 - 24:26

So let me write out this expectation explicitly as a sum
24:26 - 24:33

over all possible state and action sequences of that
24:46 - 24:53

- so that's what an expectation is. It's the probability of the random variables times that. Let
24:53 - 25:00

me just expand out this probability.
25:01 - 25:06

So the probability of seeing this exact sequence of states and actions is
25:06 - 25:11

the probability of the MDP starting in that state.
25:11 - 25:14

If this is a deterministic initial state, then all the probability
25:14 - 25:18

mass would be on one state. Otherwise, there's some distribution over initial states.
25:18 - 25:21

Then times the probability
25:21 - 25:24

that
25:24 - 25:29

you chose action A0 from that state as zero,
25:29 - 25:34

and then times the probability that
25:34 - 25:38

the MDP's transition probabilities happen to transition you to state
25:38 - 25:45

S1 where you chose action A0 to state S0,
25:45 - 25:53

times the probability that you chose that and so on.
25:53 - 26:06

The last term here is that, and then times that.
26:12 - 26:14

So what I did was just
26:14 - 26:19

take this probability of seeing this sequence of states and actions, and then just
26:19 - 26:23

[inaudible] explicitly or expanded explicitly like this.
26:23 - 26:24

It turns out later on
26:24 - 26:30

I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff
26:32 - 26:36

from now. So whenever later in this lecture I write the word payoff, I
26:36 - 26:40

just mean this sum.
26:40 - 26:47

So
26:47 - 26:52

our goal is to maximize the expected payoff, so our goal is to maximize this sum.
26:52 - 26:56

Let me actually just skip ahead. I'm going to write down what the final answer is, and
26:56 - 26:59

then I'll come back and justify the
26:59 - 27:06

algorithm. So here's the algorithm. This is
27:25 - 27:32
28:13 - 28:16

how we're going to
28:16 - 28:19

update the parameters of the algorithm. We're going to
28:19 - 28:23

sample a state action sequence. The way you do this is you just take your
28:23 - 28:25

current stochastic policy,
28:25 - 28:29

and you execute it in the MDP. So just go ahead and start from some initial state, take
28:29 - 28:33

a stochastic action according to your current stochastic policy,
28:33 - 28:35

see where the state transition probably takes you, and so you just
28:35 - 28:39

do that for T times steps, and that's how you sample the state sequence. Then
28:39 - 28:45

you compute the payoff, and then you perform this update.
28:45 - 28:49

So let's go back and figure out what this algorithm is doing.
28:49 - 28:53

Notice that this algorithm performs stochastic updates because on
28:53 - 28:57

every step it updates data according to this thing on the right hand side.
28:57 - 29:00

This thing on the right hand side depends very much on your payoff and on
29:00 - 29:02

the state action sequence you saw. Your
29:02 - 29:05

state action sequence is random,
29:05 - 29:07

so what I want to do is figure out
29:07 - 29:12

- so on every step, I'll sort of take a step that's chosen randomly
29:13 - 29:16

because it depends on this random state action sequence.
29:16 - 29:22

So what I want to do is figure out on average how does it change the parameters theta.
29:22 - 29:23

In particular,
29:23 - 29:44

I want to know what is the expected value of the change to the parameters.
29:58 - 30:06

So I want to know what is the expected value of this change to my parameters theta. Our
30:06 - 30:11

goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.
30:11 - 30:14

So long as
30:14 - 30:18

the updates on expectation are on average taking us uphill
30:18 - 30:20

on the expected payoff,
30:20 - 30:22

then we're happy.
30:22 - 30:25

It turns out that
30:25 - 30:32

this algorithm is a form of stochastic gradient ascent in which -
30:32 - 30:37

remember when I talked about stochastic gradient descent for least squares regression, I
30:37 - 30:43

said that you have some parameters - maybe you're trying to minimize a quadratic function.
30:43 - 30:49

Then you may have parameters that will wander around randomly
30:49 - 30:54

until it gets close to the optimum of the [inaudible] quadratic surface.
30:54 - 30:56

It turns out that
30:56 - 31:00

the reinforce algorithm will be very much like that. It will be a
31:00 - 31:03

stochastic gradient ascent algorithm in which on every step -
31:03 - 31:07

the step we take is a little bit random. It's determined by the random state action sequence,
31:07 - 31:12

but on expectation this turns out to be essentially gradient ascent algorithm.
31:12 - 31:16

And so we'll do something like this. It'll wander around randomly, but on average take you towards the
31:16 - 31:25

optimum. So let me go ahead and prove that now.
31:29 - 31:36

Let's see.
31:38 - 31:46

What I'm going to do is I'm going to derive a gradient ascent update rule
31:46 - 31:50

for maximizing the expected payoff. Then I'll hopefully show that
31:50 - 31:59

by deriving a gradient ascent update rule, I'll end up with this thing on expectation.
31:59 - 32:02

So before I do the derivation, let me just remind you of
32:02 - 32:06

the chain rule - the product rule for differentiation
32:06 - 32:09

in which
32:09 - 32:14

if I have a product of functions,
32:14 - 32:17

then
32:33 - 32:38

the derivative of the product is given by
32:38 - 32:41

taking of the derivatives of these things one at a time. So first I differentiate
32:41 - 32:45

with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,
32:45 - 32:47

leaving the other two fixed.
32:47 - 32:51

Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the
32:51 - 32:54

product rule
32:54 - 32:56

for
32:56 - 33:01

derivatives.
33:01 - 33:05

If you refer back to this equation where
33:05 - 33:07

earlier we wrote out that
33:07 - 33:11

the expected payoff by this equation, this
33:11 - 33:14

sum over all the states of the probability
33:14 - 33:16

times the payoff.
33:16 - 33:20

So what I'm going to do is take the derivative of this expression
33:20 - 33:23

with respect to the parameters theta
33:23 - 33:26

because I want to do gradient ascent on this function. So I'm going to take the
33:26 - 33:31

derivative of this function with respect to theta, and then try to go uphill on this function.
33:31 - 33:36

So using the product rule, when I take the derivative of this function with respect to theta
33:36 - 33:39

what I get is - we'll end up with the sum of terms right there. There are a lot
33:39 - 33:42

of terms here that depend on theta,
33:42 - 33:47

and so what I'll end up with is I'll end up having a sum - having one term
33:47 - 33:51

that corresponds to the derivative of this keeping everything else fixed,
33:51 - 33:54

to have one term from the derivative of this keeping everything else fixed, and I'll
33:54 - 33:55

have one term
33:55 - 33:56

from the derivative of
33:56 - 34:02

that last thing keeping everything else fixed. So just apply the product rule to this. Let's
34:02 - 34:09

write that down. So I have that - the derivative
34:11 - 34:16

with respect to theta of the expected value of the payoff is -
34:16 - 34:22

it turns out I can actually do this entire derivation in exactly four steps,
34:22 - 34:27

but each of the steps requires a huge amount of writing, so
34:27 - 34:38

I'll just start writing and see how that goes, but this is a four step derivation.
34:38 - 34:45

So there's the sum over the state action sequences as we saw before. Close the
35:05 - 35:12

bracket,
36:00 - 36:03

and
36:03 - 36:09

then times the payoff.
36:09 - 36:13

So that huge amount of writing, that was just taking my previous formula and
36:13 - 36:14
36:14 - 36:17

differentiating these terms that depend on theta one at a time. This was the term with
36:17 - 36:18

the derivative
36:18 - 36:19

of the first pi of
36:19 - 36:22

theta S0 A0. So there's the first derivative
36:22 - 36:26

term. There's the second one. Then you have plus dot, dot, dot, like
36:26 - 36:28

in terms of [inaudible].
36:28 - 36:31

That's my last term.
36:31 - 36:38

So that was step one of four.
36:51 - 37:07

And so by algebra - let me just write this down
38:04 - 38:07

and convince us all that it's true.
38:07 - 38:09

This is the second of four steps
38:09 - 38:10

in which it
38:10 - 38:14

just convinced itself that if I expand out - take the sum and multiply it by
38:14 - 38:15

that big product in front,
38:15 - 38:19

then I get back that sum of terms I get. It's essentially -
38:19 - 38:23

for example, when I multiply out, this product on top of this ratio, of this
38:23 - 38:25

first fraction,
38:25 - 38:29

then pi subscript theta S0 A0, that would cancel out this pi subscript
38:29 - 38:31

theta S0 A0
38:31 - 38:35

and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was
38:35 - 38:42

the second. But
38:45 - 38:48

that term on top is just
38:48 - 38:54

what I worked out previously
38:54 - 38:59

- was the joint probability of the state action sequence,
38:59 - 39:32

and now I have that times that times the payoff.
39:32 - 40:07

And so by the definition of expectation, this is just equal to that thing times the payoff.
40:07 - 40:08

So
40:08 - 40:12

this thing inside the expectation, this is exactly
40:12 - 40:19

the step that we were taking in the inner group of our reinforce algorithm,
40:19 - 40:21

roughly the reinforce algorithm. This
40:21 - 40:26

proves that the expected value of our change to theta
40:26 - 40:31

is exactly in the direction of the gradient of our expected payoff. That's how
40:31 - 40:33

I started this whole derivation. I said
40:33 - 40:36

let's look at our expected payoff and take the derivative of that with respect to theta.
40:37 - 40:41

What we've proved is that on expectation,
40:41 - 40:44

the step direction I'll take reinforce is exactly the gradient of
40:44 - 40:46

the thing I'm trying to
40:46 - 40:49

optimize. This shows that this algorithm is
40:49 - 40:54

a stochastic gradient ascent
40:54 - 40:58

algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]
40:58 - 41:01

check if everything makes sense. I'll erase a couple of boards and then check if you have
41:01 - 41:08

questions after that. Questions?
41:43 - 41:50

Could you
41:54 - 41:56

raise your hand if this makes sense?
42:01 - 42:04

Great.
42:04 - 42:07

Some of the comments - we talked about
42:07 - 42:11

those value function approximation approaches where you approximate
42:11 - 42:14

V star, then you go from V star
42:14 - 42:18

to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.
42:18 - 42:22

So let's talk briefly about when either one may be preferable.
42:22 - 42:24

It turns out that
42:24 - 42:27

policy search algorithms are especially effective
42:27 - 42:31

when you can choose a simple policy class pi.
42:32 - 42:36

So the question really is for your problem
42:36 - 42:40

does there exist a simple function like a linear function or a logistic function
42:40 - 42:45

that maps from features of the state to the action that works pretty well.
42:45 - 42:49

So the problem with the inverted pendulum - this is quite likely to be true.
42:49 - 42:52

Going through all the different choices of parameters, you can say
42:52 - 42:55

things like if the pole's leaning towards the right,
42:55 - 42:58

then accelerate towards the right to try to catch it. Thanks to the
42:58 - 43:02

inverted pendulum, this is probably true. For lots of what's called
43:02 - 43:05

low level control tasks, things like driving a car,
43:05 - 43:09

the low level reflexes of do you steer your car left to avoid another car, do
43:09 - 43:13

you steer your car left to follow the car road, flying a helicopter,
43:13 - 43:18

again very short time scale types of decisions - I like to think of these as
43:18 - 43:22

decisions like trained operator for like a trained driver or a trained
43:22 - 43:24

pilot. It would almost
43:24 - 43:27

be a reflex, these sorts of very quick instinctive things where
43:27 - 43:31

you map very directly from the inputs, data, and action. These are
43:31 - 43:32

problems for which
43:32 - 43:36

you can probably choose a reasonable policy class like a logistic function or something,
43:36 - 43:38

and it will often work well.
43:38 - 43:42

In contrast, if you have problems that require
43:42 - 43:46

long multistep reasoning, so things like a game of chess where you have to
43:46 - 43:47

reason carefully about if
43:47 - 43:50

I do this, then they'll do that, then they'll do this, then they'll do that.
43:50 - 43:56

Those I think of as less instinctual, very high level decision making.
43:56 - 44:00

For problems like that,
44:00 - 44:06

I would sometimes use a value function approximation approaches instead. Let
44:06 - 44:09

me say more about this later.
44:09 - 44:14

The last thing I want to do is actually tell you about -
44:14 - 44:22

I guess just as a side comment,
44:22 - 44:25

it turns out also that if you have POMDP, if you have a partially observable
44:25 - 44:28

MDP - I don't want to say too much about this -
44:28 - 44:37

it turns out that if you only have an approximation
44:37 - 44:42

- let's call it S hat of the true
44:42 - 44:47

state, and so this could be S hat equals S of T given T from Kalman filter -
44:57 - 45:13

then you can still use these sorts of policy search algorithms where you can say pi theta of S
45:13 - 45:16

hat comma A - There are various other ways you can use
45:16 - 45:19

policy search algorithms for POMDPs, but this is one of them
45:19 - 45:21

where if you only have estimates of the state, you can then
45:21 - 45:26

choose a policy class that only looks at your estimate of the state to choose the action.
45:26 - 45:31

By using the same way of estimating the states in both training and testing,
45:31 - 45:33

this will usually do some -
45:33 - 45:37

so these sorts of policy search algorithms can be applied
45:37 - 45:47

often reasonably effectively to
45:47 - 45:50

POMDPs as well. There's one more algorithm I wanna talk about, but some final
45:50 - 45:55

words on the reinforce algorithm. It turns out the reinforce algorithm
45:55 - 46:00

often works well but is often extremely slow. So it [inaudible]
46:00 - 46:04

works, but one thing to watch out for is that because you're taking these
46:04 - 46:07

gradient ascent steps that are very noisy, you're sampling a state action sequence, and then
46:07 - 46:08

you're sort of
46:08 - 46:10

taking a gradient ascent step in essentially a sort
46:10 - 46:13

of random direction that only on expectation is
46:13 - 46:14

correct. The
46:14 - 46:18

gradient ascent direction for reinforce can sometimes be a bit noisy,
46:18 - 46:22

and so it's not that uncommon to need like a million iterations of gradient ascent,
46:22 - 46:24

or ten million, or 100 million
46:24 - 46:26

iterations of gradient ascent
46:26 - 46:29

for reinforce [inaudible], so that's just something to watch out for.
46:29 - 46:41

One consequence of that is in the reinforce algorithm - I shouldn't
46:41 - 46:45

really call it reinforce. In what's essentially the reinforce algorithm, there's
46:45 - 46:48

this step where you need to sample a state action sequence.
46:48 - 46:51

So in
46:51 - 46:54

principle you could do this on your own robot. If there were a robot you were trying to
46:54 - 46:55

control, you can actually
46:55 - 46:59

physically initialize in some state, pick an action and so on, and go from there
46:59 - 47:00

to sample a state action sequence.
47:00 - 47:02

But
47:02 - 47:05

if you need to do this ten million times, you probably don't want to [inaudible]
47:05 - 47:07

your robot ten million times.
47:07 - 47:13

I personally have seen many more applications of reinforce in simulation.
47:13 - 47:17

You can easily run ten thousand simulations or ten million simulations of your robot in
47:17 - 47:21

simulation maybe, but you might not want to do that -
47:21 - 47:25

have your robot physically repeat some action ten million times.
47:25 - 47:28

So I personally have seen many more applications of reinforce
47:28 - 47:30

to learn using a simulator
47:30 - 47:35

than to actually do this on a physical device.
47:35 - 47:38

The last thing I wanted to do is tell you about one other algorithm,
47:38 - 47:45

one final policy search algorithm. [Inaudible] the laptop display please. It's
47:52 - 47:56

a policy search algorithm called Pegasus that's
47:56 - 48:00

actually what we use on our autonomous helicopter flight things
48:00 - 48:05

for many years. There are some other things we do now. So
48:05 - 48:08

here's the idea. There's a
48:08 - 48:11

reminder slide on RL formalism. There's nothing here that you don't know, but
48:11 - 48:17

I just want to pictorially describe the RL formalism because I'll use that later.
48:17 - 48:21

I'm gonna draw the reinforcement learning picture as follows. The
48:21 - 48:22

initialized [inaudible]
48:22 - 48:24

system, say a
48:24 - 48:26

helicopter or whatever in sum state S0,
48:26 - 48:28

you choose an action A0,
48:28 - 48:32

and then you'll say helicopter dynamics takes you to some new state S1, you choose
48:32 - 48:33

some other action A1,
48:33 - 48:35

and so on.
48:35 - 48:39

And then you have some reward function, which you reply to the sequence of states you summed out,
48:39 - 48:41

and that's your total payoff. So
48:41 - 48:47

this is just a picture I wanna use to summarize the RL problem.
48:48 - 48:51

Our goal is to maximize the expected payoff,
48:51 - 48:53

which is this, the expected sum of the rewards.
48:53 - 48:57

And our goal is to learn the policy, which I denote by a green box.
48:57 - 49:02

So our policy - and I'll switch back to deterministic policies for now.
49:02 - 49:09

So my deterministic policy will be some function mapping from the states to the actions.
49:09 - 49:15

As a concrete example, you imagine that in the policy search setting,
49:15 - 49:17

you may have a linear class of policies.
49:17 - 49:23

So you may imagine that the action A will be say a linear function of the states,
49:23 - 49:28

and your goal is to learn the parameters of the linear function.
49:28 - 49:33

So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning
49:33 - 49:36

objective. So just [inaudible] imagine that the action A
49:36 - 49:40

is state of transpose S, and you go and policy search this to come up with
49:40 - 49:42

good parameters theta so as
49:42 - 49:49

to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.
49:49 - 49:50

Quite often
49:50 - 49:54

we come up with a model or a simulator for the MDP, and as before
49:54 - 49:57

a model or a simulator is just a box
49:57 - 49:59

that takes this input
49:59 - 50:00

some state ST,
50:00 - 50:03

takes this input some action AT,
50:03 - 50:07

and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.
50:07 - 50:10

This ST plus one will be a random state. It will be drawn from
50:10 - 50:13

the random state transition probabilities of MDP.
50:13 - 50:17

This is important. Very important, ST plus one
50:17 - 50:22

will be a random function ST and AT. In the simulator, this is [inaudible].
50:22 - 50:26

So for example, for autonomous helicopter flight, you [inaudible] build a simulator
50:26 - 50:30

using supervised learning, an algorithm like linear regression
50:30 - 50:32

[inaudible] to linear regression,
50:32 - 50:35

so we can get a nonlinear model of the dynamics of what ST
50:35 - 50:41

plus one is as a random function of ST and AT. Now
50:41 - 50:45

once you have a simulator,
50:45 - 50:48

given any fixed policy you can quite
50:48 - 50:52

straightforwardly evaluate any policy in a simulator.
50:53 - 51:00

Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box
51:00 - 51:03

like that. It works well.
51:03 - 51:08

So if you have any one fixed policy pi,
51:08 - 51:13

you can evaluate the policy pi just using the simulator
51:13 - 51:16

via the picture shown at the bottom of the slide.
51:16 - 51:19

So concretely, you can take your initial state S0,
51:19 - 51:21

feed it into the policy pi,
51:21 - 51:25

your policy pi will output some action A0, you plug it in
51:25 - 51:28

the simulator, the simulator outputs a random state S1, you feed
51:28 - 51:32

S1 into the policy and so on, and you get a sequence of states S0 through ST
51:32 - 51:35

that your helicopter flies through in simulation.
51:35 - 51:38

Then sum up the rewards,
51:38 - 51:43

and this gives you an estimate of the expected payoff of the policy.
51:43 - 51:45

This picture is just a fancy way of saying fly
51:45 - 51:49

your helicopter in simulation and see how well it does, and measure the sum of
51:49 - 51:54

rewards you get on average in the simulator. The
51:54 - 51:57

picture I've drawn here assumes that you run your policy through the
51:57 - 51:59

simulator just once. In general,
51:59 - 52:01

you would run the policy through the simulator
52:01 - 52:05

some number of times and then average to get an average over M
52:05 - 52:11

simulations to get a better estimate of the policy's expected payoff.
52:13 - 52:15

Now that I have a way
52:15 - 52:18

- given any one fixed policy,
52:18 - 52:23

this gives me a way to evaluate the expected payoff of that policy.
52:23 - 52:28

So one reasonably obvious thing you might try to do is then
52:28 - 52:30

just search for a policy,
52:30 - 52:34

in other words search for parameters theta for your policy, that
52:34 - 52:36

gives you high estimated payoff.
52:36 - 52:40

Does that make sense? So my policy has some parameters theta, so
52:40 - 52:41

my policy is
52:41 - 52:46

my actions A are equal to theta transpose S say if there's a linear policy.
52:46 - 52:50

For any fixed value of the parameters theta,
52:50 - 52:51

I can evaluate -
52:51 - 52:54

I can get an estimate for how good the policy is using the simulator.
52:54 - 52:59

One thing I might try to do is search for parameters theta to try to
52:59 - 53:03

maximize my estimated payoff.
53:03 - 53:06

It turns out you can sort of do that,
53:06 - 53:11

but that idea as I've just stated is hard to get to work.
53:11 - 53:12

Here's the
53:12 - 53:16

reason. The simulator allows us to evaluate
53:16 - 53:18

policy, so let's search for policy of high
53:18 - 53:21

value. The difficulty is that the simulator is random,
53:21 - 53:23

and so every time we evaluate a policy,
53:23 - 53:27

we get back a very slightly different answer. So
53:27 - 53:29

in the
53:29 - 53:33

cartoon below, I want you to imagine that the horizontal axis is the space of policies.
53:33 - 53:37

In other words, as I vary the
53:37 - 53:41

parameters in my policy, I get different points on the horizontal axis here. As I
53:41 - 53:44

vary the parameters theta, I get different policies,
53:44 - 53:46

and so I'm moving along the X axis,
53:46 - 53:50

and my total payoff I'm gonna plot on the vertical axis,
53:50 - 53:54

and the red line in this cartoon is the expected payoff of the different
53:54 - 53:56

policies.
53:56 - 54:02

My goal is to find the policy with the highest expected payoff.
54:02 - 54:06

You could search for a policy with high expected payoff, but every time you evaluate a policy
54:06 - 54:08

- say I evaluate some policy,
54:08 - 54:10

then I might get a point that
54:10 - 54:13

just by chance looked a little bit better than it should have. If
54:13 - 54:16

I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third
54:16 - 54:21

policy, fourth,
54:21 - 54:25

sometimes you look here - sometimes I might actually evaluate exactly the same policy
54:25 - 54:28

twice and get back slightly different
54:28 - 54:32

answers just because my simulator is random, so when I apply the same policy
54:32 - 54:39

twice in simulation, I might get back slightly different answers.
54:39 - 54:42

So as I evaluate more and more policies, these are the pictures I get.
54:42 - 54:46

My goal is to try to optimize the red line. I hope you appreciate this is a
54:46 - 54:51

hard problem, especially when all [inaudible] optimization algorithm gets to see are these
54:51 - 54:54

black dots, and they don't have direct access to the red line.
54:54 - 54:58

So when my input space is some fairly high dimensional space, if I have
54:58 - 55:02

ten parameters, the horizontal axis would actually be a 10-D space,
55:02 - 55:07

all I get are these noisy estimates of what the red line is. This is a very hard
55:07 - 55:10

stochastic optimization problem.
55:10 - 55:16

So it turns out there's one way to make this optimization problem much easier. Here's the idea.
55:16 - 55:21

And the method is called Pegasus, which is an acronym for something.
55:21 - 55:23

I'll tell you later.
55:23 - 55:26

So the simulator repeatedly makes calls
55:26 - 55:28

to a random number generator
55:28 - 55:33

to generate random numbers RT, which are used to simulate the stochastic dynamics.
55:33 - 55:37

What I mean by that is that the simulator takes this input of state
55:37 - 55:40

and action, and it outputs the mixed state randomly,
55:40 - 55:44

and if you peer into the simulator, if you open the box of the simulator and ask
55:44 - 55:50

how is my simulator generating these random mixed states ST plus one,
55:50 - 55:54

pretty much the only way to do so - pretty much the only way
55:54 - 55:57

to write a simulator with random outputs is
55:57 - 56:00

we're gonna make calls to a random number generator,
56:00 - 56:03

and get random numbers, these RTs,
56:03 - 56:08

and then you have some function that takes this input S0, A0, and
56:08 - 56:11

the results of your random number generator,
56:11 - 56:14

and it computes some mixed state as a function of the inputs and of the random
56:14 - 56:17

number it got from the random number
56:17 - 56:18

generator. This is
56:18 - 56:22

pretty much the only way anyone implements any random code, any code
56:22 - 56:26

that generates random outputs. You make a call to a random number generator,
56:26 - 56:31

and you compute some function of the random number and of your other
56:31 - 56:35

inputs. The reason that when you evaluate different policies you get
56:35 - 56:36

different answers
56:36 - 56:38

is because
56:38 - 56:41

every time you rerun the simulator, you get a different sequence of random
56:41 - 56:44

numbers from the random number generator,
56:44 - 56:46

and so you get a different answer every time,
56:46 - 56:48

even if you evaluate the same policy twice.
56:48 - 56:51

So here's the idea.
56:51 - 56:56

Let's say we fix in advance the sequence of random numbers,
56:56 - 57:00

choose R1, R2, up to RT minus one.
57:00 - 57:03

Fix the sequence of random numbers in advance,
57:03 - 57:07

and we'll always use the same sequence of random numbers
57:07 - 57:10

to evaluate different policies. Go
57:10 - 57:15

into your code and fix R1, R2, through RT minus one.
57:15 - 57:18

Choose them randomly once and then fix them forever.
57:18 - 57:21

If you always use the same sequence of random numbers, then
57:21 - 57:25

the system is no longer random, and if you evaluate the same policy twice,
57:25 - 57:29

you get back exactly the same answer.
57:29 - 57:33

And so these sequences of random numbers, [inaudible] call them scenarios, and
57:33 - 57:37

Pegasus is an acronym for policy evaluation of gradient and search using scenarios.
57:37 - 57:39

So
57:42 - 57:46

when you do that, this is the picture you get.
57:46 - 57:50

As before, the red line is your expected payoff, and by fixing the random numbers,
57:50 - 57:53

you've defined some estimate of the expected payoff.
57:53 - 57:57

And as you evaluate different policies, they're
57:57 - 58:01

still only approximations to their true expected payoff, but at least now you have
58:01 - 58:01

a
58:01 - 58:03

deterministic function to optimize,
58:03 - 58:07

and you can now apply your favorite algorithms, be it gradient ascent or
58:08 - 58:10

some sort
58:10 - 58:12

of local [inaudible] search
58:12 - 58:14

to try to optimize the black
58:14 - 58:16

curve. This gives you a much easier
58:16 - 58:18

optimization problem, and you
58:18 - 58:23

can optimize this to find hopefully a good policy. So this is
58:23 - 58:27

the Pegasus policy search method.
58:27 - 58:27

So when
58:27 - 58:31

I started to talk about reinforcement learning, I showed
58:31 - 58:35

that video of a helicopter flying upside down. That was actually done using
58:35 - 58:40

exactly method, using exactly this policy search algorithm. This
58:40 - 58:42

seems to scale well
58:42 - 58:47

even to fairly large problems, even to fairly high dimensional state spaces.
58:47 - 58:50

Typically Pegasus policy search algorithms have been using -
58:50 - 58:53

the optimization problem
58:53 - 58:56

is still - is much easier than the stochastic version, but sometimes it's
58:56 - 58:58

not entirely trivial, and so
58:58 - 59:00

you have to apply this sort of method with
59:00 - 59:05

maybe on the order of ten parameters or tens of parameters, so 30, 40
59:05 - 59:06

parameters, but
59:06 - 59:08

not thousands of parameters,
59:08 - 59:13

at least in these sorts of things with them.
59:13 - 59:19

So is that method different than just assuming that
59:19 - 59:27

you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that
59:27 - 59:29

we have a deterministic simulator?
59:29 - 59:31

The answer is no. In the way you do this,
59:31 - 59:35

for the sake of simplicity I talked about one sequence of random numbers.
59:35 - 59:37

What you do is -
59:37 - 59:42

so imagine that the random numbers are simulating different wind gusts against your helicopter.
59:42 - 59:46

So what you want to do isn't really evaluate just against one pattern of wind gusts.
59:46 - 59:48

What you want to do is
59:48 - 59:50

sample some set of
59:50 - 59:54

different patterns of wind gusts, and evaluate against all of them in average.
59:54 - 59:56

So what you do is you actually
59:56 - 59:58

sample say 100 -
59:58 - 60:01

some number I made up like 100 sequences of random numbers,
60:01 - 60:03

and every time you want to evaluate a policy,
60:03 - 60:09

you evaluate it against all 100 sequences of random numbers and then average.
60:09 - 60:13

This is in exactly the same way that on this earlier picture
60:13 - 60:17

you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100
60:17 - 60:18

times in simulation, and then
60:18 - 60:22

average to get a better estimate of the expected reward.
60:22 - 60:23

In the same way,
60:23 - 60:24

you do that here
60:24 - 60:29

but with 100 fixed sequences of random numbers. Does that make sense? Any
60:29 - 60:36

other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?
60:48 - 60:52

Yeah. I guess you're right. So the quality - for a fixed policy,
60:52 - 60:57

the quality of the approximation is equally good for both cases. The
60:57 - 61:01

advantage of fixing the random numbers is that you end up with
61:01 - 61:04

an optimization problem that's much easier.
61:05 - 61:06

I have some search problem,
61:06 - 61:10

and on the horizontal axis there's a space of control policies,
61:10 - 61:13

and my goal is to find a control policy that
61:13 - 61:15

maximizes the payoff.
61:15 - 61:19

The problem with this earlier setting was that
61:19 - 61:22

when I evaluate policies I get these noisy estimates,
61:22 - 61:23

and then it's
61:23 - 61:27

just very hard to optimize the red curve if I have these points that are all over the
61:27 - 61:29

place. And if I evaluate the same policy twice, I don't even get back the same
61:31 - 61:33

answer. By fixing the random numbers,
61:33 - 61:37

the algorithm still doesn't get to see the red curve, but at least it's
61:37 - 61:41

now optimizing a deterministic function. That makes the
61:41 - 61:42

optimization problem much easier. Does
61:42 - 61:49

that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves
61:51 - 61:58

that are easy to optimize. And then you smush them together?
61:58 - 61:59

Let's see.
61:59 - 62:05

I have just one nice black curve that I'm trying to optimize. For each scenario.
62:05 - 62:09

I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.
62:09 - 62:12

So the black curve here is defined by averaging over
62:12 - 62:16

a large set of scenarios. Does that make sense?
62:16 - 62:20

So instead of only one - if the averaging thing doesn't make sense, imagine that there's
62:20 - 62:24

just one sequence of random numbers. That might be easier to think
62:24 - 62:27

about. Fix one sequence of random numbers, and every time I evaluate another policy,
62:27 - 62:31

I evaluate against the same sequence of random numbers, and that gives me
62:31 - 62:36

a nice deterministic function to optimize.
62:36 - 62:39

Any other questions?
62:39 - 62:41

The advantage is really
62:41 - 62:45

that - one way to think about it is when I evaluate the same policy twice,
62:45 - 62:50

at least I get back the same answer. This gives me a deterministic function
62:50 - 62:54

mapping from parameters in my policy
62:54 - 62:56

to my estimate of the expected payoff.
62:58 - 63:13

That's just a function that I can try to optimize using the search algorithm.
63:13 - 63:17

So we use this algorithm for inverted hovering,
63:17 - 63:20

and again policy search algorithms tend to work well when you can find
63:20 - 63:24

a reasonably simple policy mapping from
63:24 - 63:29

the states to the actions. This is sort of especially the low level control tasks,
63:29 - 63:33

which I think of as sort of reflexes almost.
63:33 - 63:34

Completely,
63:34 - 63:37

if you want to solve a problem like Tetris where you might plan
63:37 - 63:40

ahead a few steps about what's a nice configuration of the board, or
63:40 - 63:42

something like a game of chess,
63:42 - 63:47

or problems of long path plannings of go here, then go there, then go there,
63:47 - 63:51

then sometimes you might apply a value function method instead.
63:51 - 63:55

But for tasks like helicopter flight, for
63:55 - 63:59

low level control tasks, for the reflexes of driving or controlling
64:01 - 64:06

various robots, policy search algorithms were easier -
64:06 - 64:11

we can sometimes more easily approximate the policy directly works very well.
64:12 - 64:16

Some [inaudible] the state of RL today. RL algorithms are applied
64:16 - 64:18

to a wide range of problems,
64:18 - 64:22

and the key is really sequential decision making. The place where you
64:22 - 64:26

think about applying reinforcement learning algorithm is when you need to make a decision,
64:26 - 64:28

then another decision, then another decision,
64:28 - 64:32

and some of your actions may have long-term consequences. I think that
64:32 - 64:38

is the heart of RL's sequential decision making, where you make multiple decisions,
64:38 - 64:42

and some of your actions may have long-term consequences. I've shown you
64:42 - 64:44

a bunch of robotics examples. RL
64:44 - 64:46

is also
64:46 - 64:49

applied to thinks like medical decision making, where you may have a patient and you
64:49 - 64:52

want to choose a sequence of treatments, and you do this now for the patient, and the
64:52 - 64:56

patient may be in some other state, and you choose to do that later, and so on.
64:56 - 65:01

It turns out there's a large community of people applying these sorts of tools to queues. So
65:01 - 65:05

imagine you have a bank where you have people lining up, and
65:05 - 65:06

after they go to one cashier,
65:06 - 65:10

some of them have to go to the manager to deal with something else. You have
65:10 - 65:13

a system of multiple people standing in line in multiple queues, and so how do you route
65:13 - 65:18

people optimally to minimize the waiting
65:18 - 65:20

time. And not just people, but
65:20 - 65:24

objects in an assembly line and so on. It turns out there's a surprisingly large community
65:24 - 65:26

working on optimizing queues.
65:26 - 65:30

I mentioned game playing a little bit already. Things
65:30 - 65:33

like financial decision making, if you have a large amount of stock,
65:33 - 65:38

how do you sell off a large amount - how do you time the selling off of your stock
65:38 - 65:41

so as to not affect market prices adversely too much? There
65:41 - 65:45

are many operations research problems, things like factory automation. Can you design
65:45 - 65:46

a factory
65:46 - 65:50

to optimize throughput, or minimize cost, or whatever. These are all
65:50 - 65:54

sorts of problems that people are applying reinforcement learning algorithms to.
65:54 - 65:58

Let me just close with a few robotics examples because they're always fun, and we just
65:58 - 65:59

have these videos.
66:00 - 66:07

This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student
66:07 - 66:14

here. They were working getting a robot dog to climb over difficult rugged
66:14 - 66:17

terrain. Using a reinforcement learning algorithm,
66:17 - 66:20

they applied an approach that's
66:20 - 66:24

similar to a value function approximation approach, not quite but similar.
66:24 - 66:29

They allowed the robot dog to sort of plan ahead multiple steps, and
66:29 - 66:34

carefully choose his footsteps and traverse rugged terrain.
66:34 - 66:43

This is really state of the art in terms of what can you get a robotic dog to do.
66:43 - 66:46

Here's another fun one.
66:46 - 66:48

It turns out that
66:48 - 66:51

wheeled robots are very fuel-efficient. Cars and trucks are the
66:51 - 66:55

most fuel-efficient robots in the world almost.
66:55 - 66:57

Cars and trucks are very fuel-efficient,
66:57 - 67:01

but the bigger robots have the ability to traverse more rugged terrain.
67:01 - 67:05

So this is a robot - this is actually a small scale mockup of a larger
67:05 - 67:07

vehicle built by Lockheed Martin,
67:07 - 67:10

but can you come up with a vehicle that
67:10 - 67:14

has wheels and has the fuel efficiency of wheeled robots, but also
67:14 - 67:19

has legs so it can traverse obstacles.
67:19 - 67:27

Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and
67:27 - 67:31

somewhat complex gaits that would be very hard to design by hand,
67:31 - 67:33

but by choosing a reward function, tell the robot
67:33 - 67:37

this is a plus one reward that's top of the goal,
67:37 - 67:45

and a few other things, it learns these sorts of policies automatically.
67:47 - 67:55

Last couple fun ones - I'll show you a couple last helicopter videos.
67:57 - 68:09

This is the work of again PhD students here, Peter Abiel and Adam Coates
68:09 - 68:33

who have been working on autonomous flight. I'll just let you watch. I'll just
69:49 - 69:54

show you one more. [Inaudible] do this with a real helicopter [inaudible]?
69:54 - 69:55

Not a full-size helicopter.
69:55 - 70:01

Only small radio control helicopters. [Inaudible]. Full-size helicopters
70:01 - 70:06

are the wrong design for this. You shouldn't do this on a full-size helicopter.
70:06 - 70:13

Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.
70:13 - 70:19

Are there any human [inaudible]? Yes,
70:19 - 70:26

there are very good human pilots that can.
70:28 - 70:35

This is just one more
70:44 - 70:46

maneuver. That was kind of fun. So this is the work of
70:46 - 70:56

Peter Abiel and Adam Coates. So that was it. That was all the technical material
70:56 - 71:03

I wanted to present in this class. I guess
71:03 - 71:10

you're all experts on machine learning now. Congratulations.
71:10 - 71:14

And as I hope you've got the sense through this class that this is one of the technologies
71:14 - 71:19

that's really having a huge impact on science in engineering and industry.
71:19 - 71:23

As I said in the first lecture, I think many people use machine
71:23 - 71:30

learning algorithms dozens of times a day without even knowing about it.
71:30 - 71:34

Based on the projects you've done, I hope that
71:34 - 71:38

most of you will be able to imagine yourself
71:38 - 71:43

going out after this class and applying these things to solve a variety of problems.
71:43 - 71:46

Hopefully, some of you will also imagine yourselves
71:46 - 71:49

writing research papers after this class, be it on
71:49 - 71:52

a novel way to do machine learning, or on some way of applying machine
71:52 - 71:55

learning to a problem that you care
71:55 - 71:59

about. In fact, looking at project milestones, I'm actually sure that a large fraction of
71:59 - 72:04

the projects in this class will be publishable, so I think that's great.
72:07 - 72:11

I guess many of you will also go industry, make new products, and make lots
72:11 - 72:13

of money using learning
72:13 - 72:16

algorithms. Remember me
72:16 - 72:20

if that happens. One of the things I'm excited about is through research or
72:20 - 72:23

industry, I'm actually completely sure that the people in this class in the
72:23 - 72:24

next few months
72:24 - 72:27

will apply machine learning algorithms to lots of problems in
72:27 - 72:32

industrial management, and computer science, things like optimizing computer architectures,
72:32 - 72:33

network security,
72:33 - 72:38

robotics, computer vision, to problems in computational biology,
72:39 - 72:42

to problems in aerospace, or understanding natural language,
72:42 - 72:44

and many more things like that.
72:44 - 72:46

So
72:46 - 72:50

right now I have no idea what all of you are going to do with the learning algorithms you learned about,
72:50 - 72:52

but
72:52 - 72:55

every time as I wrap up this class, I always
72:55 - 72:58

feel very excited, and optimistic, and hopeful about the sorts of amazing
72:58 - 73:03

things you'll be able to do.
73:03 - 73:10

One final thing, I'll just give out this handout.
73:31 - 73:35

One final thing is that machine learning has grown out of a larger literature on the AI
73:35 - 73:36

where
73:36 - 73:42

this desire to build systems that exhibit intelligent behavior and machine learning
73:42 - 73:45

is one of the tools of AI, maybe one that's had a
73:45 - 73:47

disproportionately large impact,
73:47 - 73:52

but there are many other ideas in AI that I hope you
73:52 - 73:55

go and continue to learn about. Fortunately, Stanford
73:55 - 73:59

has one of the best and broadest sets of AI classes, and I hope that
73:59 - 74:04

you take advantage of some of these classes, and go and learn more about AI, and more
74:04 - 74:08

about other fields which often apply learning algorithms to problems in vision,
74:08 - 74:10

problems in natural language processing in robotics, and so on.
74:10 - 74:15

So the handout I just gave out has a list of AI related courses. Just
74:15 - 74:18

running down very quickly, I guess, CS221 is an overview that I
74:18 - 74:19

teach. There
74:19 - 74:24

are a lot of robotics classes also: 223A,
74:24 - 74:25

225A, 225B
74:25 - 74:29

- many robotics class. There are so many applications
74:29 - 74:32

of learning algorithms to robotics today. 222
74:32 - 74:36

and 227 are knowledge representation and reasoning classes.
74:36 - 74:40

228 - of all the classes on this list, 228, which Daphne
74:40 - 74:43

Koller teaches, is probably closest in spirit to 229. This is one of
74:43 - 74:45

the classes I highly recommend
74:45 - 74:49

to all of my PhD students as well.
74:49 - 74:53

Many other problems also touch on machine learning. On the next page,
74:54 - 74:58

courses on computer vision, speech recognition, natural language processing,
74:58 - 75:03

various tools that aren't just machine learning, but often involve machine learning in many ways.
75:03 - 75:07

Other aspects of AI, multi-agent systems taught by [inaudible].
75:10 - 75:13

EE364A is convex optimization. It's a class taught by
75:13 - 75:17

Steve Boyd, and convex optimization came up many times in this class. If you
75:17 - 75:21

want to become really good at it, EE364 is a great class. If you're
75:21 - 75:25

interested in project courses, I also teach a project class
75:25 - 75:27

next quarter where we spend the whole quarter working
75:27 - 75:32

on research projects.
75:32 - 75:38

So I hope you go and take some more of those classes.
75:38 - 75:40

In closing,
75:40 - 75:43

let me just say
75:43 - 75:47

this class has been really fun to teach, and it's very satisfying to
75:47 - 75:52

me personally when we set these insanely difficult hallmarks, and
75:52 - 75:55

then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's
75:55 - 75:59

actually very satisfying when I see that.
75:59 - 76:05

Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"
76:05 - 76:05

So
76:05 - 76:09

I hope you take what you've learned, go forth, and
76:09 - 76:12

do amazing things with learning algorithms.
76:12 - 76:16

I know this is a heavy workload class, so thank you all very much for the hard
76:16 - 76:20

work you've put into this class, and the hard work you've put into learning this material,
76:20 - 76:22

and
76:22 - 76:23

thank you very much for having been students in.

Title:: Lecture 20 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses POMDPs, policy search, and Pegasus in the context of reinforcement learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:16:40

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 20 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)