Lecture 20 | Machine Learning (Stanford)
-
0:11 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:22Development.
-
0:24 - 0:28So welcome to the last lecture of this course.
-
0:28 - 0:33What I want to do today is tell you about one final class of reinforcement
-
0:33 - 0:35learning algorithms. I
-
0:35 - 0:39just want to say a little bit about POMDPs, the partially observable MDPs, and then
-
0:39 - 0:44the main technical topic for today will be policy search algorithms.
-
0:44 - 0:49I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,
-
0:49 - 0:55and then we'll wrap up the class. So if you recall from the last lecture,
-
0:55 - 1:00I actually started to talk about one specific example of a POMDP,
-
1:00 - 1:07which was this sort of linear dynamical
-
1:07 - 1:11system. This is sort of LQR, linear quadratic revelation problem,
-
1:11 - 1:24but I changed it and said what if we only have observations
-
1:24 - 1:28YT. And what if we couldn't observe the state of the system directly,
-
1:28 - 1:37but had to choose an action only based on some noisy observations that maybe some function of the state?
-
1:37 - 1:43So our strategy last time was that we said that in the fully observable case,
-
1:43 - 1:55we could choose actions - AT equals two, that matrix LT times ST. So LT was this
-
1:55 - 1:59matrix of parameters that [inaudible] describe the dynamic programming
-
1:59 - 2:02algorithm for finite horizon MDPs in the LQR problem.
-
2:02 - 2:11And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.
-
2:11 - 2:14And then I said in the partially observable case,
-
2:14 - 2:20we would compute these estimates.
-
2:20 - 2:27I wrote them as S of T given T,
-
2:30 - 2:37which were our best estimate for what the state is given all the observations. And in particular,
-
2:37 - 2:41I'm gonna talk about a Kalman filter
-
2:41 - 2:49which we worked out that our posterior distribution of what the state is
-
2:49 - 2:53given all the observations up to a certain time
-
2:53 - 2:59that was this. So this is from last time. So that
-
2:59 - 3:03given the observations Y one through YT, our posterior distribution
-
3:03 - 3:06of the current state ST was Gaussian
-
3:06 - 3:10would mean ST given T sigma T given T.
-
3:10 - 3:14So I said we use a Kalman filter to compute this thing, this ST given T,
-
3:14 - 3:20which is going to be our best guess for what the state is currently.
-
3:20 - 3:31And then we choose actions using
-
3:31 - 3:34our estimate for what the state is, rather than using the true state because we
-
3:34 - 3:38don't know the true state anymore in this POMDP.
-
3:38 - 3:40So
-
3:40 - 3:47it turns out that this specific strategy actually allows you to choose optimal actions,
-
3:47 - 3:52allows you to choose actions as well as you possibly can given that this is a
-
3:52 - 3:55POMDP, and given there are these noisy observations.
-
3:55 - 3:59It turns out that in general finding optimal policies with POMDPs -
-
3:59 - 4:06finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.
-
4:06 - 4:16Just to be concrete about the formalism of the POMDP - I should just write it here -
-
4:16 - 4:32a POMDP formally is a tuple like that
-
4:32 - 4:43where the changes are the set Y is the set of possible observations,
-
4:45 - 4:54and this O subscript S are the observation distributions.
-
4:58 - 5:11And so at each step, we observe
-
5:18 - 5:20- at each step in the POMDP, if we're
-
5:20 - 5:24in some state ST, we observe some observation YT
-
5:24 - 5:28drawn from the observation distribution O subscript ST, that there's an
-
5:28 - 5:30index by what the current state is.
-
5:30 - 5:31And
-
5:31 - 5:36it turns out that computing the optimal policy in a POMDP is an NPhard problem.
-
5:36 - 5:41For the specific case of linear dynamical systems with the Kalman filter
-
5:41 - 5:46model, we have this strategy of computing the optimal policy assuming full observability
-
5:46 - 5:49and then estimating the states from the observations,
-
5:49 - 5:51and then plugging the two together.
-
5:51 - 5:56That turns out to be optimal essentially for only that special case of a POMDP.
-
5:56 - 5:58In the more general case,
-
5:58 - 6:03that strategy of designing a controller assuming full observability
-
6:03 - 6:06and then just estimating the state and plugging the two together,
-
6:06 - 6:08for general POMDPs
-
6:08 - 6:15that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.
-
6:15 - 6:21Solving these problems in general, NP-hard.
-
6:21 - 6:22So what I want to do today
-
6:22 - 6:26is actually
-
6:26 - 6:29talk about a different class of reinforcement learning algorithms. These are called
-
6:29 - 6:38policy search algorithms. In particular, policy search algorithms
-
6:38 - 6:44can be applied equally well to MDPs, to fully observed Markov decision processes,
-
6:44 - 6:46or to these POMDPs,
-
6:46 - 6:50or to these partially observable MPDs.
-
6:50 - 6:58What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.
-
6:58 - 7:02And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In
-
7:02 - 7:05the latter case, when
-
7:05 - 7:10you apply a policy search algorithm to a POMDP, it's
-
7:10 - 7:15going to be hard to guarantee that you get the globally optimal policy because
-
7:15 - 7:21solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be
-
7:21 - 7:23I think one of the most
-
7:23 - 7:31effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.
-
7:31 - 7:34So
-
7:34 - 7:37here's what we're going to do.
-
7:37 - 7:41In policy search,
-
7:41 - 7:48we're going to define of some set
-
7:48 - 7:55which I denote capital pi of policies,
-
7:55 - 8:09and our strategy is to search for a good policy lower pi into set capital pi.
-
8:11 - 8:15Just by analogy, I want to say
-
8:15 - 8:19- in the same way, back when we were talking about supervised learning,
-
8:19 - 8:24the way we defined the set capital pi of policies in the search for policy in this
-
8:24 - 8:31set capital pi is analogous to supervised learning
-
8:31 - 8:46where we defined a set script H of hypotheses and search -
-
8:50 - 8:58and would search for a good hypothesis in this policy script H.
-
8:58 - 9:03Policy search is sometimes also called direct policy search.
-
9:03 - 9:07To contrast this with the source of algorithms we've been talking about so
-
9:07 - 9:11far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to
-
9:11 - 9:14find the optimal value function.
-
9:14 - 9:19And then we'd use V star - we'd use the optimal value function to then try to compute or
-
9:19 - 9:20try to approximate pi star.
-
9:20 - 9:24So all the approaches we talked about previously are
-
9:24 - 9:30strategy for finding a good policy. Once we compute the value function, then we go from that to policy.
-
9:30 - 9:35In contrast, in policy search algorithms and something that's called direct policy search algorithms,
-
9:35 - 9:40the idea is that we're going to quote "directly" try to approximate a good policy
-
9:40 - 9:48without going through the intermediate stage of trying to find the value function.
-
9:48 - 9:50Let's see.
-
9:50 - 9:51And also
-
9:51 - 9:58as I develop policy search - just one step that's sometimes slightly confusing.
-
9:58 - 10:03Making an analogy to supervised learning again, when we talked about logistic regression,
-
10:03 - 10:09I said we have input features X and some labels Y, and I sort of said
-
10:09 - 10:13let's approximate Y using the logistic function of the inputs X. And
-
10:13 - 10:18at least initially, the logistic function was sort of pulled out of the air.
-
10:18 - 10:23In the same way, as I define policy search algorithms, there'll sort of be a step where I say,
-
10:23 - 10:29"Well, let's try to compute the actions. Let's try to approximate what a good action is
-
10:29 - 10:33using a logistic function of the state." So again, I'll sort of pull a
-
10:33 - 10:37function out of the air. I'll say, "Let's just choose a function, and
-
10:37 - 10:41that'll be our choice of the policy cost," and I'll say,
-
10:41 - 10:45"Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate
-
10:45 - 10:47what is a good function -
-
10:47 - 10:48excuse me,
-
10:48 - 10:53we'll approximate what is a good action using a logistic function of the state."
-
10:53 - 10:54So there's that sort of -
-
10:54 - 10:59the function of the choice of policy cost that's again a little bit arbitrary,
-
10:59 - 11:06but it's arbitrary as it was when we were talking about supervised learning.
-
11:06 - 11:17So to develop our first policy search algorithm, I'm actually gonna need the new definition.
-
11:56 - 12:03So
-
12:07 - 12:13our first policy search algorithm, we'll actually need to work with stochastic policies.
-
12:13 - 12:18What I mean by stochastic policy is there's going to be a function that maps from
-
12:18 - 12:23the space of states across actions. They're real numbers
-
12:23 - 12:25where pi of S comma A will be
-
12:25 - 12:31interpreted as the probability of taking this action A in sum state S.
-
12:31 - 12:46And so we have to add sum over A - In other words, for every state
-
12:46 - 12:54a stochastic policy specifies a probability distribution over the actions.
-
12:55 - 12:59So concretely,
-
12:59 - 13:07suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.
-
13:07 - 13:12What that means is that - in this example let's say I have three actions.
-
13:12 - 13:16What that means is that suppose I'm in some state S.
-
13:16 - 13:25I would then compute pi of S comma A1, pi of S comma A2,
-
13:25 - 13:29pi of S comma A3, if I have a three action MDP.
-
13:29 - 13:34These will be three numbers that sum up to one,
-
13:34 - 13:37and then my chance of taking action A1 will be
-
13:37 - 13:41equal to this. My chance of taking action A2 will be equal to pi of S comma A2.
-
13:41 - 13:44My chance of taking action A3 will be equal
-
13:44 - 13:50to this number. So that's what it means to execute a stochastic policy.
-
13:50 - 13:51So
-
13:51 - 13:54as a concrete example, just let me make this
-
14:03 - 14:07let me just go ahead and give one specific example of what a stochastic
-
14:07 - 14:10policy may look like.
-
14:10 - 14:14For this example, I'm gonna use the inverted pendulum
-
14:14 - 14:20as my motivating example. It's that problem of balancing a pole. We have an
-
14:20 - 14:21inverted
-
14:21 - 14:26pendulum that swings freely, and you want to move the cart left and
-
14:26 - 14:29right to keep the pole vertical.
-
14:29 - 14:35Let's say my actions -
-
14:35 - 14:44for today's example, I'm gonna use that angle to denote the angle of the pole
-
14:44 - 14:50phi. I have two actions where A1 is to accelerate left
-
14:50 - 15:03and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.
-
15:03 - 15:05So
-
15:05 - 15:06let's see.
-
15:06 - 15:10Choose a reward function that penalizes the pole falling over whatever.
-
15:10 - 15:14And now let's come up with a stochastic policy for this problem.
-
15:15 - 15:18To come up with a class of stochastic policies
-
15:18 - 15:20
-
15:20 - 15:24really means coming up with some class of functions to approximate what action you
-
15:24 - 15:27want to take as a function of the state.
-
15:27 - 15:30So here's my somewhat arbitrary choice. I'm gonna say that
-
15:30 - 15:35the probability of action A1, so pi of S comma A1,
-
15:35 - 15:40I'm gonna write as - okay?
-
15:40 - 15:46And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that
-
15:46 - 15:50my policy is parameterized by a set of parameters theta,
-
15:50 - 15:53and for any given set of parameters theta,
-
15:53 - 15:56that gives me a stochastic policy.
-
15:56 - 15:59And if I'm executing that policy with parameters theta, that means
-
15:59 - 16:06that the chance of my choosing to a set of [inaudible] is given by this number.
-
16:18 - 16:26Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just
-
16:26 - 16:29[inaudible], this means that when I'm in sum state S,
-
16:29 - 16:31I'm going to compute this number,
-
16:31 - 16:34compute one over one plus E to the minus state of transpose S.
-
16:34 - 16:40And then with this probability, I will execute the accelerate right action,
-
16:40 - 16:46and with one minus this probability, I'll execute the accelerate left action.
-
16:46 - 16:51And again, just to give you a sense of why this might be a reasonable thing to do,
-
16:52 - 16:59let's say my state vector is - this
-
17:02 - 17:05is [inaudible] state, and I added an extra one as an interceptor, just to give my
-
17:05 - 17:09logistic function an extra feature.
-
17:09 - 17:20If I choose my parameters and my policy to be say this,
-
17:20 - 17:23then that means that at any state,
-
17:23 - 17:34the probability of my taking action A1 - the probability of my taking the accelerate
-
17:34 - 17:41right action is this one over one plus E to the minus state of transpose S, which taking the inner product
-
17:41 - 17:48of theta and S, this just gives you phi,
-
17:48 - 17:51equals one over one plus E to the minus phi.
-
17:51 - 17:55And so if I choose my parameters theta as follows,
-
17:55 - 18:01what that means is that
-
18:01 - 18:08just depending on the angle phi of my inverted pendulum,
-
18:09 - 18:12the chance of my accelerating to the
-
18:12 - 18:15right is just this function of
-
18:15 - 18:18the angle of my inverted pendulum. And so this means for example
-
18:18 - 18:22that if my inverted pendulum is leaning far over to the right,
-
18:22 - 18:26then I'm very likely to accelerate to the right to try to catch it. I
-
18:26 - 18:30hope the physics of this inverted pendulum thing make
-
18:30 - 18:33sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And
-
18:33 - 18:37conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try
-
18:37 - 18:39to catch it.
-
18:39 - 18:44So this is one example for one specific choice of parameters theta.
-
18:44 - 18:49Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if
-
18:49 - 18:53the cart is further to the right, you want it to be less likely to accelerate to the
-
18:53 - 18:56right, and you can capture that by changing
-
18:56 - 18:58one of these coefficients to take into account the
-
18:58 - 19:00actual position of the cart. And then depending on
-
19:00 - 19:03the velocity of the cart and the angle of velocity,
-
19:03 - 19:08you might want to change theta to take into account these other effects as well. Maybe
-
19:08 - 19:13if the pole's leaning far to the right, but is actually on its way to swinging back, it's
-
19:13 - 19:16specified to the angle of velocity, then you might be
-
19:16 - 19:19less worried about having to accelerate hard to the right. And so
-
19:19 - 19:23these are the sorts of behavior you can get by varying the parameters theta.
-
19:24 - 19:29And so our goal is to tune the parameters theta - our goal in policy search
-
19:29 - 19:37is to tune the parameters theta so that when we execute the policy pi subscript theta,
-
19:37 - 19:40the pole stays up as long as possible.
-
19:40 - 19:48In other words, our goal is to maximize as a function of theta -
-
19:48 - 20:05our goal is to maximize the expected value of the payoff
-
20:05 - 20:10for when we execute the policy pi theta. We
-
20:10 - 20:11want to choose parameters theta
-
20:11 - 20:21to maximize that. Are there questions about the problem set up,
-
20:21 - 20:26and policy search and policy classes or anything? Yeah. In a case where we have
-
20:26 - 20:33more than two actions, would we use a different theta for each of the distributions, or still have
-
20:33 - 20:36the same parameters? Oh, yeah.
-
20:36 - 20:39Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but
-
20:39 - 20:44you have say a fixed number of discrete actions,
-
20:44 - 20:48I would sometimes use like a softmax parameterization.
-
20:48 - 20:54Similar to softmax regression that we saw earlier in the class,
-
20:54 - 20:56you may
-
20:56 - 21:03say
-
21:06 - 21:07that - [inaudible] out of space. You may have a set of parameters theta 1
-
21:07 - 21:09through theta D if
-
21:09 - 21:25you have D actions and - pi equals E to the theta I transpose S over -
-
21:25 - 21:29so that would be an example of a softmax parameterization for multiple actions.
-
21:29 - 21:33It turns out that if you have continuous actions, you can actually make this be a
-
21:33 - 21:34density
-
21:34 - 21:36over the actions A
-
21:36 - 21:38and parameterized by other things as well.
-
21:38 - 21:43But the choice of policy class is somewhat up to you, in the same way that
-
21:44 - 21:48the choice of whether we chose to use a linear function
-
21:48 - 21:49or linear function with
-
21:49 - 21:56quadratic features or whatever in supervised learning that was sort of up to us. Anything
-
22:02 - 22:09else? Yeah. [Inaudible] stochastic?
-
22:12 - 22:16Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition
-
22:16 - 22:18probabilities, is it possible to
-
22:18 - 22:23use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.
-
22:23 - 22:24The answer is yes,
-
22:24 - 22:26but
-
22:26 - 22:29for the purposes of what I want to show later, that won't be useful.
-
22:29 - 22:32But formally, it is possible.
-
22:32 - 22:35If you already have a fixed - if you have a
-
22:35 - 22:41fixed policy, then you'd be able to do that. Anything else?
-
22:41 - 22:44Yeah. No, I guess even a [inaudible] class of policy can do that,
-
22:44 - 22:50but for the derivation later, I actually need to keep it separate.
-
22:50 - 22:55Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand
-
22:55 - 23:02if this makes sense?
-
23:02 - 23:05Okay. Thanks. So let's talk about
-
23:05 - 23:09an algorithm. What I'm gonna talk about - the first algorithm I'm going to present
-
23:10 - 23:14is sometimes called the reinforce algorithm. What I'm going to present it turns out
-
23:14 - 23:19isn't exactly the reinforce algorithm as it was originally presented by the
-
23:19 - 23:31author Ron Williams, but it sort of captures its essence. Here's the idea.
-
23:31 - 23:37In the sequel - in what I'm about to do, I'm going to assume that S0 is
-
23:37 - 23:40some fixed initial state.
-
23:42 - 23:45Or
-
23:45 - 23:49it turns out if S0 is drawn from some fixed initial state
-
23:49 - 23:52distribution then everything else [inaudible], but let's just say S0 is some
-
23:52 - 23:54fixed initial state.
-
23:54 - 24:05So my goal is to maximize this expected sum
-
24:06 - 24:10[inaudible]. Given the policy and whatever else, drop that.
-
24:10 - 24:15So the random variables in this expectation is a sequence of states and actions:
-
24:15 - 24:20S0, A0, S1, A1, and so on, up to ST, AT are the random variables.
-
24:20 - 24:26So let me write out this expectation explicitly as a sum
-
24:26 - 24:33over all possible state and action sequences of that
-
24:46 - 24:53- so that's what an expectation is. It's the probability of the random variables times that. Let
-
24:53 - 25:00me just expand out this probability.
-
25:01 - 25:06So the probability of seeing this exact sequence of states and actions is
-
25:06 - 25:11the probability of the MDP starting in that state.
-
25:11 - 25:14If this is a deterministic initial state, then all the probability
-
25:14 - 25:18mass would be on one state. Otherwise, there's some distribution over initial states.
-
25:18 - 25:21Then times the probability
-
25:21 - 25:24that
-
25:24 - 25:29you chose action A0 from that state as zero,
-
25:29 - 25:34and then times the probability that
-
25:34 - 25:38the MDP's transition probabilities happen to transition you to state
-
25:38 - 25:45S1 where you chose action A0 to state S0,
-
25:45 - 25:53times the probability that you chose that and so on.
-
25:53 - 26:06The last term here is that, and then times that.
-
26:12 - 26:14So what I did was just
-
26:14 - 26:19take this probability of seeing this sequence of states and actions, and then just
-
26:19 - 26:23[inaudible] explicitly or expanded explicitly like this.
-
26:23 - 26:24It turns out later on
-
26:24 - 26:30I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff
-
26:32 - 26:36from now. So whenever later in this lecture I write the word payoff, I
-
26:36 - 26:40just mean this sum.
-
26:40 - 26:47So
-
26:47 - 26:52our goal is to maximize the expected payoff, so our goal is to maximize this sum.
-
26:52 - 26:56Let me actually just skip ahead. I'm going to write down what the final answer is, and
-
26:56 - 26:59then I'll come back and justify the
-
26:59 - 27:06algorithm. So here's the algorithm. This is
-
27:25 - 27:32
-
28:13 - 28:16how we're going to
-
28:16 - 28:19update the parameters of the algorithm. We're going to
-
28:19 - 28:23sample a state action sequence. The way you do this is you just take your
-
28:23 - 28:25current stochastic policy,
-
28:25 - 28:29and you execute it in the MDP. So just go ahead and start from some initial state, take
-
28:29 - 28:33a stochastic action according to your current stochastic policy,
-
28:33 - 28:35see where the state transition probably takes you, and so you just
-
28:35 - 28:39do that for T times steps, and that's how you sample the state sequence. Then
-
28:39 - 28:45you compute the payoff, and then you perform this update.
-
28:45 - 28:49So let's go back and figure out what this algorithm is doing.
-
28:49 - 28:53Notice that this algorithm performs stochastic updates because on
-
28:53 - 28:57every step it updates data according to this thing on the right hand side.
-
28:57 - 29:00This thing on the right hand side depends very much on your payoff and on
-
29:00 - 29:02the state action sequence you saw. Your
-
29:02 - 29:05state action sequence is random,
-
29:05 - 29:07so what I want to do is figure out
-
29:07 - 29:12- so on every step, I'll sort of take a step that's chosen randomly
-
29:13 - 29:16because it depends on this random state action sequence.
-
29:16 - 29:22So what I want to do is figure out on average how does it change the parameters theta.
-
29:22 - 29:23In particular,
-
29:23 - 29:44I want to know what is the expected value of the change to the parameters.
-
29:58 - 30:06So I want to know what is the expected value of this change to my parameters theta. Our
-
30:06 - 30:11goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.
-
30:11 - 30:14So long as
-
30:14 - 30:18the updates on expectation are on average taking us uphill
-
30:18 - 30:20on the expected payoff,
-
30:20 - 30:22then we're happy.
-
30:22 - 30:25It turns out that
-
30:25 - 30:32this algorithm is a form of stochastic gradient ascent in which -
-
30:32 - 30:37remember when I talked about stochastic gradient descent for least squares regression, I
-
30:37 - 30:43said that you have some parameters - maybe you're trying to minimize a quadratic function.
-
30:43 - 30:49Then you may have parameters that will wander around randomly
-
30:49 - 30:54until it gets close to the optimum of the [inaudible] quadratic surface.
-
30:54 - 30:56It turns out that
-
30:56 - 31:00the reinforce algorithm will be very much like that. It will be a
-
31:00 - 31:03stochastic gradient ascent algorithm in which on every step -
-
31:03 - 31:07the step we take is a little bit random. It's determined by the random state action sequence,
-
31:07 - 31:12but on expectation this turns out to be essentially gradient ascent algorithm.
-
31:12 - 31:16And so we'll do something like this. It'll wander around randomly, but on average take you towards the
-
31:16 - 31:25optimum. So let me go ahead and prove that now.
-
31:29 - 31:36Let's see.
-
31:38 - 31:46What I'm going to do is I'm going to derive a gradient ascent update rule
-
31:46 - 31:50for maximizing the expected payoff. Then I'll hopefully show that
-
31:50 - 31:59by deriving a gradient ascent update rule, I'll end up with this thing on expectation.
-
31:59 - 32:02So before I do the derivation, let me just remind you of
-
32:02 - 32:06the chain rule - the product rule for differentiation
-
32:06 - 32:09in which
-
32:09 - 32:14if I have a product of functions,
-
32:14 - 32:17then
-
32:33 - 32:38the derivative of the product is given by
-
32:38 - 32:41taking of the derivatives of these things one at a time. So first I differentiate
-
32:41 - 32:45with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,
-
32:45 - 32:47leaving the other two fixed.
-
32:47 - 32:51Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the
-
32:51 - 32:54product rule
-
32:54 - 32:56for
-
32:56 - 33:01derivatives.
-
33:01 - 33:05If you refer back to this equation where
-
33:05 - 33:07earlier we wrote out that
-
33:07 - 33:11the expected payoff by this equation, this
-
33:11 - 33:14sum over all the states of the probability
-
33:14 - 33:16times the payoff.
-
33:16 - 33:20So what I'm going to do is take the derivative of this expression
-
33:20 - 33:23with respect to the parameters theta
-
33:23 - 33:26because I want to do gradient ascent on this function. So I'm going to take the
-
33:26 - 33:31derivative of this function with respect to theta, and then try to go uphill on this function.
-
33:31 - 33:36So using the product rule, when I take the derivative of this function with respect to theta
-
33:36 - 33:39what I get is - we'll end up with the sum of terms right there. There are a lot
-
33:39 - 33:42of terms here that depend on theta,
-
33:42 - 33:47and so what I'll end up with is I'll end up having a sum - having one term
-
33:47 - 33:51that corresponds to the derivative of this keeping everything else fixed,
-
33:51 - 33:54to have one term from the derivative of this keeping everything else fixed, and I'll
-
33:54 - 33:55have one term
-
33:55 - 33:56from the derivative of
-
33:56 - 34:02that last thing keeping everything else fixed. So just apply the product rule to this. Let's
-
34:02 - 34:09write that down. So I have that - the derivative
-
34:11 - 34:16with respect to theta of the expected value of the payoff is -
-
34:16 - 34:22it turns out I can actually do this entire derivation in exactly four steps,
-
34:22 - 34:27but each of the steps requires a huge amount of writing, so
-
34:27 - 34:38I'll just start writing and see how that goes, but this is a four step derivation.
-
34:38 - 34:45So there's the sum over the state action sequences as we saw before. Close the
-
35:05 - 35:12bracket,
-
36:00 - 36:03and
-
36:03 - 36:09then times the payoff.
-
36:09 - 36:13So that huge amount of writing, that was just taking my previous formula and
-
36:13 - 36:14
-
36:14 - 36:17differentiating these terms that depend on theta one at a time. This was the term with
-
36:17 - 36:18the derivative
-
36:18 - 36:19of the first pi of
-
36:19 - 36:22theta S0 A0. So there's the first derivative
-
36:22 - 36:26term. There's the second one. Then you have plus dot, dot, dot, like
-
36:26 - 36:28in terms of [inaudible].
-
36:28 - 36:31That's my last term.
-
36:31 - 36:38So that was step one of four.
-
36:51 - 37:07And so by algebra - let me just write this down
-
38:04 - 38:07and convince us all that it's true.
-
38:07 - 38:09This is the second of four steps
-
38:09 - 38:10in which it
-
38:10 - 38:14just convinced itself that if I expand out - take the sum and multiply it by
-
38:14 - 38:15that big product in front,
-
38:15 - 38:19then I get back that sum of terms I get. It's essentially -
-
38:19 - 38:23for example, when I multiply out, this product on top of this ratio, of this
-
38:23 - 38:25first fraction,
-
38:25 - 38:29then pi subscript theta S0 A0, that would cancel out this pi subscript
-
38:29 - 38:31theta S0 A0
-
38:31 - 38:35and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was
-
38:35 - 38:42the second. But
-
38:45 - 38:48that term on top is just
-
38:48 - 38:54what I worked out previously
-
38:54 - 38:59- was the joint probability of the state action sequence,
-
38:59 - 39:32and now I have that times that times the payoff.
-
39:32 - 40:07And so by the definition of expectation, this is just equal to that thing times the payoff.
-
40:07 - 40:08So
-
40:08 - 40:12this thing inside the expectation, this is exactly
-
40:12 - 40:19the step that we were taking in the inner group of our reinforce algorithm,
-
40:19 - 40:21roughly the reinforce algorithm. This
-
40:21 - 40:26proves that the expected value of our change to theta
-
40:26 - 40:31is exactly in the direction of the gradient of our expected payoff. That's how
-
40:31 - 40:33I started this whole derivation. I said
-
40:33 - 40:36let's look at our expected payoff and take the derivative of that with respect to theta.
-
40:37 - 40:41What we've proved is that on expectation,
-
40:41 - 40:44the step direction I'll take reinforce is exactly the gradient of
-
40:44 - 40:46the thing I'm trying to
-
40:46 - 40:49optimize. This shows that this algorithm is
-
40:49 - 40:54a stochastic gradient ascent
-
40:54 - 40:58algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]
-
40:58 - 41:01check if everything makes sense. I'll erase a couple of boards and then check if you have
-
41:01 - 41:08questions after that. Questions?
-
41:43 - 41:50Could you
-
41:54 - 41:56raise your hand if this makes sense?
-
42:01 - 42:04Great.
-
42:04 - 42:07Some of the comments - we talked about
-
42:07 - 42:11those value function approximation approaches where you approximate
-
42:11 - 42:14V star, then you go from V star
-
42:14 - 42:18to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.
-
42:18 - 42:22So let's talk briefly about when either one may be preferable.
-
42:22 - 42:24It turns out that
-
42:24 - 42:27policy search algorithms are especially effective
-
42:27 - 42:31when you can choose a simple policy class pi.
-
42:32 - 42:36So the question really is for your problem
-
42:36 - 42:40does there exist a simple function like a linear function or a logistic function
-
42:40 - 42:45that maps from features of the state to the action that works pretty well.
-
42:45 - 42:49So the problem with the inverted pendulum - this is quite likely to be true.
-
42:49 - 42:52Going through all the different choices of parameters, you can say
-
42:52 - 42:55things like if the pole's leaning towards the right,
-
42:55 - 42:58then accelerate towards the right to try to catch it. Thanks to the
-
42:58 - 43:02inverted pendulum, this is probably true. For lots of what's called
-
43:02 - 43:05low level control tasks, things like driving a car,
-
43:05 - 43:09the low level reflexes of do you steer your car left to avoid another car, do
-
43:09 - 43:13you steer your car left to follow the car road, flying a helicopter,
-
43:13 - 43:18again very short time scale types of decisions - I like to think of these as
-
43:18 - 43:22decisions like trained operator for like a trained driver or a trained
-
43:22 - 43:24pilot. It would almost
-
43:24 - 43:27be a reflex, these sorts of very quick instinctive things where
-
43:27 - 43:31you map very directly from the inputs, data, and action. These are
-
43:31 - 43:32problems for which
-
43:32 - 43:36you can probably choose a reasonable policy class like a logistic function or something,
-
43:36 - 43:38and it will often work well.
-
43:38 - 43:42In contrast, if you have problems that require
-
43:42 - 43:46long multistep reasoning, so things like a game of chess where you have to
-
43:46 - 43:47reason carefully about if
-
43:47 - 43:50I do this, then they'll do that, then they'll do this, then they'll do that.
-
43:50 - 43:56Those I think of as less instinctual, very high level decision making.
-
43:56 - 44:00For problems like that,
-
44:00 - 44:06I would sometimes use a value function approximation approaches instead. Let
-
44:06 - 44:09me say more about this later.
-
44:09 - 44:14The last thing I want to do is actually tell you about -
-
44:14 - 44:22I guess just as a side comment,
-
44:22 - 44:25it turns out also that if you have POMDP, if you have a partially observable
-
44:25 - 44:28MDP - I don't want to say too much about this -
-
44:28 - 44:37it turns out that if you only have an approximation
-
44:37 - 44:42- let's call it S hat of the true
-
44:42 - 44:47state, and so this could be S hat equals S of T given T from Kalman filter -
-
44:57 - 45:13then you can still use these sorts of policy search algorithms where you can say pi theta of S
-
45:13 - 45:16hat comma A - There are various other ways you can use
-
45:16 - 45:19policy search algorithms for POMDPs, but this is one of them
-
45:19 - 45:21where if you only have estimates of the state, you can then
-
45:21 - 45:26choose a policy class that only looks at your estimate of the state to choose the action.
-
45:26 - 45:31By using the same way of estimating the states in both training and testing,
-
45:31 - 45:33this will usually do some -
-
45:33 - 45:37so these sorts of policy search algorithms can be applied
-
45:37 - 45:47often reasonably effectively to
-
45:47 - 45:50POMDPs as well. There's one more algorithm I wanna talk about, but some final
-
45:50 - 45:55words on the reinforce algorithm. It turns out the reinforce algorithm
-
45:55 - 46:00often works well but is often extremely slow. So it [inaudible]
-
46:00 - 46:04works, but one thing to watch out for is that because you're taking these
-
46:04 - 46:07gradient ascent steps that are very noisy, you're sampling a state action sequence, and then
-
46:07 - 46:08you're sort of
-
46:08 - 46:10taking a gradient ascent step in essentially a sort
-
46:10 - 46:13of random direction that only on expectation is
-
46:13 - 46:14correct. The
-
46:14 - 46:18gradient ascent direction for reinforce can sometimes be a bit noisy,
-
46:18 - 46:22and so it's not that uncommon to need like a million iterations of gradient ascent,
-
46:22 - 46:24or ten million, or 100 million
-
46:24 - 46:26iterations of gradient ascent
-
46:26 - 46:29for reinforce [inaudible], so that's just something to watch out for.
-
46:29 - 46:41One consequence of that is in the reinforce algorithm - I shouldn't
-
46:41 - 46:45really call it reinforce. In what's essentially the reinforce algorithm, there's
-
46:45 - 46:48this step where you need to sample a state action sequence.
-
46:48 - 46:51So in
-
46:51 - 46:54principle you could do this on your own robot. If there were a robot you were trying to
-
46:54 - 46:55control, you can actually
-
46:55 - 46:59physically initialize in some state, pick an action and so on, and go from there
-
46:59 - 47:00to sample a state action sequence.
-
47:00 - 47:02But
-
47:02 - 47:05if you need to do this ten million times, you probably don't want to [inaudible]
-
47:05 - 47:07your robot ten million times.
-
47:07 - 47:13I personally have seen many more applications of reinforce in simulation.
-
47:13 - 47:17You can easily run ten thousand simulations or ten million simulations of your robot in
-
47:17 - 47:21simulation maybe, but you might not want to do that -
-
47:21 - 47:25have your robot physically repeat some action ten million times.
-
47:25 - 47:28So I personally have seen many more applications of reinforce
-
47:28 - 47:30to learn using a simulator
-
47:30 - 47:35than to actually do this on a physical device.
-
47:35 - 47:38The last thing I wanted to do is tell you about one other algorithm,
-
47:38 - 47:45one final policy search algorithm. [Inaudible] the laptop display please. It's
-
47:52 - 47:56a policy search algorithm called Pegasus that's
-
47:56 - 48:00actually what we use on our autonomous helicopter flight things
-
48:00 - 48:05for many years. There are some other things we do now. So
-
48:05 - 48:08here's the idea. There's a
-
48:08 - 48:11reminder slide on RL formalism. There's nothing here that you don't know, but
-
48:11 - 48:17I just want to pictorially describe the RL formalism because I'll use that later.
-
48:17 - 48:21I'm gonna draw the reinforcement learning picture as follows. The
-
48:21 - 48:22initialized [inaudible]
-
48:22 - 48:24system, say a
-
48:24 - 48:26helicopter or whatever in sum state S0,
-
48:26 - 48:28you choose an action A0,
-
48:28 - 48:32and then you'll say helicopter dynamics takes you to some new state S1, you choose
-
48:32 - 48:33some other action A1,
-
48:33 - 48:35and so on.
-
48:35 - 48:39And then you have some reward function, which you reply to the sequence of states you summed out,
-
48:39 - 48:41and that's your total payoff. So
-
48:41 - 48:47this is just a picture I wanna use to summarize the RL problem.
-
48:48 - 48:51Our goal is to maximize the expected payoff,
-
48:51 - 48:53which is this, the expected sum of the rewards.
-
48:53 - 48:57And our goal is to learn the policy, which I denote by a green box.
-
48:57 - 49:02So our policy - and I'll switch back to deterministic policies for now.
-
49:02 - 49:09So my deterministic policy will be some function mapping from the states to the actions.
-
49:09 - 49:15As a concrete example, you imagine that in the policy search setting,
-
49:15 - 49:17you may have a linear class of policies.
-
49:17 - 49:23So you may imagine that the action A will be say a linear function of the states,
-
49:23 - 49:28and your goal is to learn the parameters of the linear function.
-
49:28 - 49:33So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning
-
49:33 - 49:36objective. So just [inaudible] imagine that the action A
-
49:36 - 49:40is state of transpose S, and you go and policy search this to come up with
-
49:40 - 49:42good parameters theta so as
-
49:42 - 49:49to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.
-
49:49 - 49:50Quite often
-
49:50 - 49:54we come up with a model or a simulator for the MDP, and as before
-
49:54 - 49:57a model or a simulator is just a box
-
49:57 - 49:59that takes this input
-
49:59 - 50:00some state ST,
-
50:00 - 50:03takes this input some action AT,
-
50:03 - 50:07and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.
-
50:07 - 50:10This ST plus one will be a random state. It will be drawn from
-
50:10 - 50:13the random state transition probabilities of MDP.
-
50:13 - 50:17This is important. Very important, ST plus one
-
50:17 - 50:22will be a random function ST and AT. In the simulator, this is [inaudible].
-
50:22 - 50:26So for example, for autonomous helicopter flight, you [inaudible] build a simulator
-
50:26 - 50:30using supervised learning, an algorithm like linear regression
-
50:30 - 50:32[inaudible] to linear regression,
-
50:32 - 50:35so we can get a nonlinear model of the dynamics of what ST
-
50:35 - 50:41plus one is as a random function of ST and AT. Now
-
50:41 - 50:45once you have a simulator,
-
50:45 - 50:48given any fixed policy you can quite
-
50:48 - 50:52straightforwardly evaluate any policy in a simulator.
-
50:53 - 51:00Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box
-
51:00 - 51:03like that. It works well.
-
51:03 - 51:08So if you have any one fixed policy pi,
-
51:08 - 51:13you can evaluate the policy pi just using the simulator
-
51:13 - 51:16via the picture shown at the bottom of the slide.
-
51:16 - 51:19So concretely, you can take your initial state S0,
-
51:19 - 51:21feed it into the policy pi,
-
51:21 - 51:25your policy pi will output some action A0, you plug it in
-
51:25 - 51:28the simulator, the simulator outputs a random state S1, you feed
-
51:28 - 51:32S1 into the policy and so on, and you get a sequence of states S0 through ST
-
51:32 - 51:35that your helicopter flies through in simulation.
-
51:35 - 51:38Then sum up the rewards,
-
51:38 - 51:43and this gives you an estimate of the expected payoff of the policy.
-
51:43 - 51:45This picture is just a fancy way of saying fly
-
51:45 - 51:49your helicopter in simulation and see how well it does, and measure the sum of
-
51:49 - 51:54rewards you get on average in the simulator. The
-
51:54 - 51:57picture I've drawn here assumes that you run your policy through the
-
51:57 - 51:59simulator just once. In general,
-
51:59 - 52:01you would run the policy through the simulator
-
52:01 - 52:05some number of times and then average to get an average over M
-
52:05 - 52:11simulations to get a better estimate of the policy's expected payoff.
-
52:13 - 52:15Now that I have a way
-
52:15 - 52:18- given any one fixed policy,
-
52:18 - 52:23this gives me a way to evaluate the expected payoff of that policy.
-
52:23 - 52:28So one reasonably obvious thing you might try to do is then
-
52:28 - 52:30just search for a policy,
-
52:30 - 52:34in other words search for parameters theta for your policy, that
-
52:34 - 52:36gives you high estimated payoff.
-
52:36 - 52:40Does that make sense? So my policy has some parameters theta, so
-
52:40 - 52:41my policy is
-
52:41 - 52:46my actions A are equal to theta transpose S say if there's a linear policy.
-
52:46 - 52:50For any fixed value of the parameters theta,
-
52:50 - 52:51I can evaluate -
-
52:51 - 52:54I can get an estimate for how good the policy is using the simulator.
-
52:54 - 52:59One thing I might try to do is search for parameters theta to try to
-
52:59 - 53:03maximize my estimated payoff.
-
53:03 - 53:06It turns out you can sort of do that,
-
53:06 - 53:11but that idea as I've just stated is hard to get to work.
-
53:11 - 53:12Here's the
-
53:12 - 53:16reason. The simulator allows us to evaluate
-
53:16 - 53:18policy, so let's search for policy of high
-
53:18 - 53:21value. The difficulty is that the simulator is random,
-
53:21 - 53:23and so every time we evaluate a policy,
-
53:23 - 53:27we get back a very slightly different answer. So
-
53:27 - 53:29in the
-
53:29 - 53:33cartoon below, I want you to imagine that the horizontal axis is the space of policies.
-
53:33 - 53:37In other words, as I vary the
-
53:37 - 53:41parameters in my policy, I get different points on the horizontal axis here. As I
-
53:41 - 53:44vary the parameters theta, I get different policies,
-
53:44 - 53:46and so I'm moving along the X axis,
-
53:46 - 53:50and my total payoff I'm gonna plot on the vertical axis,
-
53:50 - 53:54and the red line in this cartoon is the expected payoff of the different
-
53:54 - 53:56policies.
-
53:56 - 54:02My goal is to find the policy with the highest expected payoff.
-
54:02 - 54:06You could search for a policy with high expected payoff, but every time you evaluate a policy
-
54:06 - 54:08- say I evaluate some policy,
-
54:08 - 54:10then I might get a point that
-
54:10 - 54:13just by chance looked a little bit better than it should have. If
-
54:13 - 54:16I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third
-
54:16 - 54:21policy, fourth,
-
54:21 - 54:25sometimes you look here - sometimes I might actually evaluate exactly the same policy
-
54:25 - 54:28twice and get back slightly different
-
54:28 - 54:32answers just because my simulator is random, so when I apply the same policy
-
54:32 - 54:39twice in simulation, I might get back slightly different answers.
-
54:39 - 54:42So as I evaluate more and more policies, these are the pictures I get.
-
54:42 - 54:46My goal is to try to optimize the red line. I hope you appreciate this is a
-
54:46 - 54:51hard problem, especially when all [inaudible] optimization algorithm gets to see are these
-
54:51 - 54:54black dots, and they don't have direct access to the red line.
-
54:54 - 54:58So when my input space is some fairly high dimensional space, if I have
-
54:58 - 55:02ten parameters, the horizontal axis would actually be a 10-D space,
-
55:02 - 55:07all I get are these noisy estimates of what the red line is. This is a very hard
-
55:07 - 55:10stochastic optimization problem.
-
55:10 - 55:16So it turns out there's one way to make this optimization problem much easier. Here's the idea.
-
55:16 - 55:21And the method is called Pegasus, which is an acronym for something.
-
55:21 - 55:23I'll tell you later.
-
55:23 - 55:26So the simulator repeatedly makes calls
-
55:26 - 55:28to a random number generator
-
55:28 - 55:33to generate random numbers RT, which are used to simulate the stochastic dynamics.
-
55:33 - 55:37What I mean by that is that the simulator takes this input of state
-
55:37 - 55:40and action, and it outputs the mixed state randomly,
-
55:40 - 55:44and if you peer into the simulator, if you open the box of the simulator and ask
-
55:44 - 55:50how is my simulator generating these random mixed states ST plus one,
-
55:50 - 55:54pretty much the only way to do so - pretty much the only way
-
55:54 - 55:57to write a simulator with random outputs is
-
55:57 - 56:00we're gonna make calls to a random number generator,
-
56:00 - 56:03and get random numbers, these RTs,
-
56:03 - 56:08and then you have some function that takes this input S0, A0, and
-
56:08 - 56:11the results of your random number generator,
-
56:11 - 56:14and it computes some mixed state as a function of the inputs and of the random
-
56:14 - 56:17number it got from the random number
-
56:17 - 56:18generator. This is
-
56:18 - 56:22pretty much the only way anyone implements any random code, any code
-
56:22 - 56:26that generates random outputs. You make a call to a random number generator,
-
56:26 - 56:31and you compute some function of the random number and of your other
-
56:31 - 56:35inputs. The reason that when you evaluate different policies you get
-
56:35 - 56:36different answers
-
56:36 - 56:38is because
-
56:38 - 56:41every time you rerun the simulator, you get a different sequence of random
-
56:41 - 56:44numbers from the random number generator,
-
56:44 - 56:46and so you get a different answer every time,
-
56:46 - 56:48even if you evaluate the same policy twice.
-
56:48 - 56:51So here's the idea.
-
56:51 - 56:56Let's say we fix in advance the sequence of random numbers,
-
56:56 - 57:00choose R1, R2, up to RT minus one.
-
57:00 - 57:03Fix the sequence of random numbers in advance,
-
57:03 - 57:07and we'll always use the same sequence of random numbers
-
57:07 - 57:10to evaluate different policies. Go
-
57:10 - 57:15into your code and fix R1, R2, through RT minus one.
-
57:15 - 57:18Choose them randomly once and then fix them forever.
-
57:18 - 57:21If you always use the same sequence of random numbers, then
-
57:21 - 57:25the system is no longer random, and if you evaluate the same policy twice,
-
57:25 - 57:29you get back exactly the same answer.
-
57:29 - 57:33And so these sequences of random numbers, [inaudible] call them scenarios, and
-
57:33 - 57:37Pegasus is an acronym for policy evaluation of gradient and search using scenarios.
-
57:37 - 57:39So
-
57:42 - 57:46when you do that, this is the picture you get.
-
57:46 - 57:50As before, the red line is your expected payoff, and by fixing the random numbers,
-
57:50 - 57:53you've defined some estimate of the expected payoff.
-
57:53 - 57:57And as you evaluate different policies, they're
-
57:57 - 58:01still only approximations to their true expected payoff, but at least now you have
-
58:01 - 58:01a
-
58:01 - 58:03deterministic function to optimize,
-
58:03 - 58:07and you can now apply your favorite algorithms, be it gradient ascent or
-
58:08 - 58:10some sort
-
58:10 - 58:12of local [inaudible] search
-
58:12 - 58:14to try to optimize the black
-
58:14 - 58:16curve. This gives you a much easier
-
58:16 - 58:18optimization problem, and you
-
58:18 - 58:23can optimize this to find hopefully a good policy. So this is
-
58:23 - 58:27the Pegasus policy search method.
-
58:27 - 58:27So when
-
58:27 - 58:31I started to talk about reinforcement learning, I showed
-
58:31 - 58:35that video of a helicopter flying upside down. That was actually done using
-
58:35 - 58:40exactly method, using exactly this policy search algorithm. This
-
58:40 - 58:42seems to scale well
-
58:42 - 58:47even to fairly large problems, even to fairly high dimensional state spaces.
-
58:47 - 58:50Typically Pegasus policy search algorithms have been using -
-
58:50 - 58:53the optimization problem
-
58:53 - 58:56is still - is much easier than the stochastic version, but sometimes it's
-
58:56 - 58:58not entirely trivial, and so
-
58:58 - 59:00you have to apply this sort of method with
-
59:00 - 59:05maybe on the order of ten parameters or tens of parameters, so 30, 40
-
59:05 - 59:06parameters, but
-
59:06 - 59:08not thousands of parameters,
-
59:08 - 59:13at least in these sorts of things with them.
-
59:13 - 59:19So is that method different than just assuming that
-
59:19 - 59:27you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that
-
59:27 - 59:29we have a deterministic simulator?
-
59:29 - 59:31The answer is no. In the way you do this,
-
59:31 - 59:35for the sake of simplicity I talked about one sequence of random numbers.
-
59:35 - 59:37What you do is -
-
59:37 - 59:42so imagine that the random numbers are simulating different wind gusts against your helicopter.
-
59:42 - 59:46So what you want to do isn't really evaluate just against one pattern of wind gusts.
-
59:46 - 59:48What you want to do is
-
59:48 - 59:50sample some set of
-
59:50 - 59:54different patterns of wind gusts, and evaluate against all of them in average.
-
59:54 - 59:56So what you do is you actually
-
59:56 - 59:58sample say 100 -
-
59:58 - 60:01some number I made up like 100 sequences of random numbers,
-
60:01 - 60:03and every time you want to evaluate a policy,
-
60:03 - 60:09you evaluate it against all 100 sequences of random numbers and then average.
-
60:09 - 60:13This is in exactly the same way that on this earlier picture
-
60:13 - 60:17you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100
-
60:17 - 60:18times in simulation, and then
-
60:18 - 60:22average to get a better estimate of the expected reward.
-
60:22 - 60:23In the same way,
-
60:23 - 60:24you do that here
-
60:24 - 60:29but with 100 fixed sequences of random numbers. Does that make sense? Any
-
60:29 - 60:36other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?
-
60:48 - 60:52Yeah. I guess you're right. So the quality - for a fixed policy,
-
60:52 - 60:57the quality of the approximation is equally good for both cases. The
-
60:57 - 61:01advantage of fixing the random numbers is that you end up with
-
61:01 - 61:04an optimization problem that's much easier.
-
61:05 - 61:06I have some search problem,
-
61:06 - 61:10and on the horizontal axis there's a space of control policies,
-
61:10 - 61:13and my goal is to find a control policy that
-
61:13 - 61:15maximizes the payoff.
-
61:15 - 61:19The problem with this earlier setting was that
-
61:19 - 61:22when I evaluate policies I get these noisy estimates,
-
61:22 - 61:23and then it's
-
61:23 - 61:27just very hard to optimize the red curve if I have these points that are all over the
-
61:27 - 61:29place. And if I evaluate the same policy twice, I don't even get back the same
-
61:31 - 61:33answer. By fixing the random numbers,
-
61:33 - 61:37the algorithm still doesn't get to see the red curve, but at least it's
-
61:37 - 61:41now optimizing a deterministic function. That makes the
-
61:41 - 61:42optimization problem much easier. Does
-
61:42 - 61:49that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves
-
61:51 - 61:58that are easy to optimize. And then you smush them together?
-
61:58 - 61:59Let's see.
-
61:59 - 62:05I have just one nice black curve that I'm trying to optimize. For each scenario.
-
62:05 - 62:09I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.
-
62:09 - 62:12So the black curve here is defined by averaging over
-
62:12 - 62:16a large set of scenarios. Does that make sense?
-
62:16 - 62:20So instead of only one - if the averaging thing doesn't make sense, imagine that there's
-
62:20 - 62:24just one sequence of random numbers. That might be easier to think
-
62:24 - 62:27about. Fix one sequence of random numbers, and every time I evaluate another policy,
-
62:27 - 62:31I evaluate against the same sequence of random numbers, and that gives me
-
62:31 - 62:36a nice deterministic function to optimize.
-
62:36 - 62:39Any other questions?
-
62:39 - 62:41The advantage is really
-
62:41 - 62:45that - one way to think about it is when I evaluate the same policy twice,
-
62:45 - 62:50at least I get back the same answer. This gives me a deterministic function
-
62:50 - 62:54mapping from parameters in my policy
-
62:54 - 62:56to my estimate of the expected payoff.
-
62:58 - 63:13That's just a function that I can try to optimize using the search algorithm.
-
63:13 - 63:17So we use this algorithm for inverted hovering,
-
63:17 - 63:20and again policy search algorithms tend to work well when you can find
-
63:20 - 63:24a reasonably simple policy mapping from
-
63:24 - 63:29the states to the actions. This is sort of especially the low level control tasks,
-
63:29 - 63:33which I think of as sort of reflexes almost.
-
63:33 - 63:34Completely,
-
63:34 - 63:37if you want to solve a problem like Tetris where you might plan
-
63:37 - 63:40ahead a few steps about what's a nice configuration of the board, or
-
63:40 - 63:42something like a game of chess,
-
63:42 - 63:47or problems of long path plannings of go here, then go there, then go there,
-
63:47 - 63:51then sometimes you might apply a value function method instead.
-
63:51 - 63:55But for tasks like helicopter flight, for
-
63:55 - 63:59low level control tasks, for the reflexes of driving or controlling
-
64:01 - 64:06various robots, policy search algorithms were easier -
-
64:06 - 64:11we can sometimes more easily approximate the policy directly works very well.
-
64:12 - 64:16Some [inaudible] the state of RL today. RL algorithms are applied
-
64:16 - 64:18to a wide range of problems,
-
64:18 - 64:22and the key is really sequential decision making. The place where you
-
64:22 - 64:26think about applying reinforcement learning algorithm is when you need to make a decision,
-
64:26 - 64:28then another decision, then another decision,
-
64:28 - 64:32and some of your actions may have long-term consequences. I think that
-
64:32 - 64:38is the heart of RL's sequential decision making, where you make multiple decisions,
-
64:38 - 64:42and some of your actions may have long-term consequences. I've shown you
-
64:42 - 64:44a bunch of robotics examples. RL
-
64:44 - 64:46is also
-
64:46 - 64:49applied to thinks like medical decision making, where you may have a patient and you
-
64:49 - 64:52want to choose a sequence of treatments, and you do this now for the patient, and the
-
64:52 - 64:56patient may be in some other state, and you choose to do that later, and so on.
-
64:56 - 65:01It turns out there's a large community of people applying these sorts of tools to queues. So
-
65:01 - 65:05imagine you have a bank where you have people lining up, and
-
65:05 - 65:06after they go to one cashier,
-
65:06 - 65:10some of them have to go to the manager to deal with something else. You have
-
65:10 - 65:13a system of multiple people standing in line in multiple queues, and so how do you route
-
65:13 - 65:18people optimally to minimize the waiting
-
65:18 - 65:20time. And not just people, but
-
65:20 - 65:24objects in an assembly line and so on. It turns out there's a surprisingly large community
-
65:24 - 65:26working on optimizing queues.
-
65:26 - 65:30I mentioned game playing a little bit already. Things
-
65:30 - 65:33like financial decision making, if you have a large amount of stock,
-
65:33 - 65:38how do you sell off a large amount - how do you time the selling off of your stock
-
65:38 - 65:41so as to not affect market prices adversely too much? There
-
65:41 - 65:45are many operations research problems, things like factory automation. Can you design
-
65:45 - 65:46a factory
-
65:46 - 65:50to optimize throughput, or minimize cost, or whatever. These are all
-
65:50 - 65:54sorts of problems that people are applying reinforcement learning algorithms to.
-
65:54 - 65:58Let me just close with a few robotics examples because they're always fun, and we just
-
65:58 - 65:59have these videos.
-
66:00 - 66:07This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student
-
66:07 - 66:14here. They were working getting a robot dog to climb over difficult rugged
-
66:14 - 66:17terrain. Using a reinforcement learning algorithm,
-
66:17 - 66:20they applied an approach that's
-
66:20 - 66:24similar to a value function approximation approach, not quite but similar.
-
66:24 - 66:29They allowed the robot dog to sort of plan ahead multiple steps, and
-
66:29 - 66:34carefully choose his footsteps and traverse rugged terrain.
-
66:34 - 66:43This is really state of the art in terms of what can you get a robotic dog to do.
-
66:43 - 66:46Here's another fun one.
-
66:46 - 66:48It turns out that
-
66:48 - 66:51wheeled robots are very fuel-efficient. Cars and trucks are the
-
66:51 - 66:55most fuel-efficient robots in the world almost.
-
66:55 - 66:57Cars and trucks are very fuel-efficient,
-
66:57 - 67:01but the bigger robots have the ability to traverse more rugged terrain.
-
67:01 - 67:05So this is a robot - this is actually a small scale mockup of a larger
-
67:05 - 67:07vehicle built by Lockheed Martin,
-
67:07 - 67:10but can you come up with a vehicle that
-
67:10 - 67:14has wheels and has the fuel efficiency of wheeled robots, but also
-
67:14 - 67:19has legs so it can traverse obstacles.
-
67:19 - 67:27Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and
-
67:27 - 67:31somewhat complex gaits that would be very hard to design by hand,
-
67:31 - 67:33but by choosing a reward function, tell the robot
-
67:33 - 67:37this is a plus one reward that's top of the goal,
-
67:37 - 67:45and a few other things, it learns these sorts of policies automatically.
-
67:47 - 67:55Last couple fun ones - I'll show you a couple last helicopter videos.
-
67:57 - 68:09This is the work of again PhD students here, Peter Abiel and Adam Coates
-
68:09 - 68:33who have been working on autonomous flight. I'll just let you watch. I'll just
-
69:49 - 69:54show you one more. [Inaudible] do this with a real helicopter [inaudible]?
-
69:54 - 69:55Not a full-size helicopter.
-
69:55 - 70:01Only small radio control helicopters. [Inaudible]. Full-size helicopters
-
70:01 - 70:06are the wrong design for this. You shouldn't do this on a full-size helicopter.
-
70:06 - 70:13Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.
-
70:13 - 70:19Are there any human [inaudible]? Yes,
-
70:19 - 70:26there are very good human pilots that can.
-
70:28 - 70:35This is just one more
-
70:44 - 70:46maneuver. That was kind of fun. So this is the work of
-
70:46 - 70:56Peter Abiel and Adam Coates. So that was it. That was all the technical material
-
70:56 - 71:03I wanted to present in this class. I guess
-
71:03 - 71:10you're all experts on machine learning now. Congratulations.
-
71:10 - 71:14And as I hope you've got the sense through this class that this is one of the technologies
-
71:14 - 71:19that's really having a huge impact on science in engineering and industry.
-
71:19 - 71:23As I said in the first lecture, I think many people use machine
-
71:23 - 71:30learning algorithms dozens of times a day without even knowing about it.
-
71:30 - 71:34Based on the projects you've done, I hope that
-
71:34 - 71:38most of you will be able to imagine yourself
-
71:38 - 71:43going out after this class and applying these things to solve a variety of problems.
-
71:43 - 71:46Hopefully, some of you will also imagine yourselves
-
71:46 - 71:49writing research papers after this class, be it on
-
71:49 - 71:52a novel way to do machine learning, or on some way of applying machine
-
71:52 - 71:55learning to a problem that you care
-
71:55 - 71:59about. In fact, looking at project milestones, I'm actually sure that a large fraction of
-
71:59 - 72:04the projects in this class will be publishable, so I think that's great.
-
72:07 - 72:11I guess many of you will also go industry, make new products, and make lots
-
72:11 - 72:13of money using learning
-
72:13 - 72:16algorithms. Remember me
-
72:16 - 72:20if that happens. One of the things I'm excited about is through research or
-
72:20 - 72:23industry, I'm actually completely sure that the people in this class in the
-
72:23 - 72:24next few months
-
72:24 - 72:27will apply machine learning algorithms to lots of problems in
-
72:27 - 72:32industrial management, and computer science, things like optimizing computer architectures,
-
72:32 - 72:33network security,
-
72:33 - 72:38robotics, computer vision, to problems in computational biology,
-
72:39 - 72:42to problems in aerospace, or understanding natural language,
-
72:42 - 72:44and many more things like that.
-
72:44 - 72:46So
-
72:46 - 72:50right now I have no idea what all of you are going to do with the learning algorithms you learned about,
-
72:50 - 72:52but
-
72:52 - 72:55every time as I wrap up this class, I always
-
72:55 - 72:58feel very excited, and optimistic, and hopeful about the sorts of amazing
-
72:58 - 73:03things you'll be able to do.
-
73:03 - 73:10One final thing, I'll just give out this handout.
-
73:31 - 73:35One final thing is that machine learning has grown out of a larger literature on the AI
-
73:35 - 73:36where
-
73:36 - 73:42this desire to build systems that exhibit intelligent behavior and machine learning
-
73:42 - 73:45is one of the tools of AI, maybe one that's had a
-
73:45 - 73:47disproportionately large impact,
-
73:47 - 73:52but there are many other ideas in AI that I hope you
-
73:52 - 73:55go and continue to learn about. Fortunately, Stanford
-
73:55 - 73:59has one of the best and broadest sets of AI classes, and I hope that
-
73:59 - 74:04you take advantage of some of these classes, and go and learn more about AI, and more
-
74:04 - 74:08about other fields which often apply learning algorithms to problems in vision,
-
74:08 - 74:10problems in natural language processing in robotics, and so on.
-
74:10 - 74:15So the handout I just gave out has a list of AI related courses. Just
-
74:15 - 74:18running down very quickly, I guess, CS221 is an overview that I
-
74:18 - 74:19teach. There
-
74:19 - 74:24are a lot of robotics classes also: 223A,
-
74:24 - 74:25225A, 225B
-
74:25 - 74:29- many robotics class. There are so many applications
-
74:29 - 74:32of learning algorithms to robotics today. 222
-
74:32 - 74:36and 227 are knowledge representation and reasoning classes.
-
74:36 - 74:40228 - of all the classes on this list, 228, which Daphne
-
74:40 - 74:43Koller teaches, is probably closest in spirit to 229. This is one of
-
74:43 - 74:45the classes I highly recommend
-
74:45 - 74:49to all of my PhD students as well.
-
74:49 - 74:53Many other problems also touch on machine learning. On the next page,
-
74:54 - 74:58courses on computer vision, speech recognition, natural language processing,
-
74:58 - 75:03various tools that aren't just machine learning, but often involve machine learning in many ways.
-
75:03 - 75:07Other aspects of AI, multi-agent systems taught by [inaudible].
-
75:10 - 75:13EE364A is convex optimization. It's a class taught by
-
75:13 - 75:17Steve Boyd, and convex optimization came up many times in this class. If you
-
75:17 - 75:21want to become really good at it, EE364 is a great class. If you're
-
75:21 - 75:25interested in project courses, I also teach a project class
-
75:25 - 75:27next quarter where we spend the whole quarter working
-
75:27 - 75:32on research projects.
-
75:32 - 75:38So I hope you go and take some more of those classes.
-
75:38 - 75:40In closing,
-
75:40 - 75:43let me just say
-
75:43 - 75:47this class has been really fun to teach, and it's very satisfying to
-
75:47 - 75:52me personally when we set these insanely difficult hallmarks, and
-
75:52 - 75:55then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's
-
75:55 - 75:59actually very satisfying when I see that.
-
75:59 - 76:05Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"
-
76:05 - 76:05So
-
76:05 - 76:09I hope you take what you've learned, go forth, and
-
76:09 - 76:12do amazing things with learning algorithms.
-
76:12 - 76:16I know this is a heavy workload class, so thank you all very much for the hard
-
76:16 - 76:20work you've put into this class, and the hard work you've put into learning this material,
-
76:20 - 76:22and
-
76:22 - 76:23thank you very much for having been students in.
- Title:
- Lecture 20 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses POMDPs, policy search, and Pegasus in the context of reinforcement learning.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:16:40