Lecture 17 | Machine Learning (Stanford)
-
0:10 - 0:14This presentation is delivered by the Stanford Center for Professional
-
0:14 - 0:21Development.
-
0:22 - 0:24Okay, good morning. Welcome back.
-
0:24 - 0:27So I hope all of you had a good Thanksgiving break.
-
0:27 - 0:31After the problem sets, I suspect many of us needed one.
-
0:31 - 0:36Just one quick announcement so as I announced by email
-
0:36 - 0:38a few days ago, this afternoon
-
0:38 - 0:38we'll be
-
0:38 - 0:43doing another tape ahead of lecture, so I won't physically be here on
-
0:43 - 0:44Wednesday, and so we'll be taping
-
0:44 - 0:46this Wednesday's lecture ahead of time.
-
0:46 - 0:47If you're free
-
0:47 - 0:53this afternoon, please come to that; it'll be at 3:45 p.m.
-
0:53 - 0:58in the Skilling Auditorium in Skilling 193
-
0:58 - 1:00at 3:45. But of course, you can also just show up
-
1:00 - 1:07in class as usual at the usual time or just watch it online as usual also.
-
1:07 - 1:13Okay, welcome back. What I want to do today is continue our discussion on
-
1:13 - 1:16Reinforcement Learning in MDPs. Quite a
-
1:16 - 1:20long topic for me to go over today, so most of today's lecture will be on
-
1:20 - 1:22continuous state MDPs,
-
1:22 - 1:24and in particular,
-
1:24 - 1:27algorithms for solving continuous state MDPs,
-
1:27 - 1:30so I'll talk just very briefly about discretization.
-
1:30 - 1:34I'll spend a lot of time talking about models, assimilators of MDPs,
-
1:34 - 1:39and then talk about one algorithm called fitted value iteration and
-
1:40 - 1:45two functions which builds on that, and then
-
1:45 - 1:50hopefully, I'll have time to get to a second algorithm called, approximate policy
-
1:50 - 1:51iteration
-
1:51 - 1:53Just to recap,
-
1:53 - 1:55right, in the previous lecture,
-
1:55 - 1:58I defined the Reinforcement Learning problem and
-
1:58 - 2:03I defined MDPs, so let me just recap the notation.
-
2:04 - 2:11I said that an MDP or a Markov Decision Process, was a ? tuple,
-
2:14 - 2:16comprising
-
2:16 - 2:19those things and
-
2:19 - 2:22the running example of those using last time
-
2:22 - 2:25was this one
-
2:25 - 2:27right, adapted from
-
2:27 - 2:30the Russell and Norvig AI textbook.
-
2:30 - 2:32So
-
2:32 - 2:35in this example MDP that I was using,
-
2:35 - 2:39it had 11 states, so that's where S was.
-
2:39 - 2:46The actions were compass directions: north, south, east and west. The
-
2:47 - 2:50state transition probability is to capture
-
2:50 - 2:51chance of your
-
2:51 - 2:52transitioning to every state
-
2:52 - 2:56when you take any action in any other given state and so
-
2:56 - 2:58in our example that captured
-
2:58 - 3:01the stochastic dynamics of our
-
3:01 - 3:04robot wondering around [inaudible], and we said if you take the action north
-
3:04 - 3:06and the south,
-
3:06 - 3:09you have a .8 chance of actually going north and .1 chance of veering
-
3:09 - 3:10off,
-
3:10 - 3:13so that .1 chance of veering off to the right so
-
3:13 - 3:17said model of the robot's noisy
-
3:17 - 3:19dynamic with a [inaudible]
-
3:19 - 3:22and the
-
3:22 - 3:25reward function was that +/-1 at the
-
3:25 - 3:31absorbing states
-
3:31 - 3:33and
-
3:33 - 3:38-0.02
-
3:38 - 3:40elsewhere.
-
3:40 - 3:41This is an example of an MDP, and
-
3:41 - 3:48that's what these five things were. Oh, and I used a discount
-
3:48 - 3:49factor G
-
3:49 - 3:54of usually a number slightly less than one, so that's the 0.99.
-
3:54 - 3:59And so our goal was to find the policy, the
-
3:59 - 4:02control policy and that's at ?,
-
4:02 - 4:04which is a function
-
4:04 - 4:06mapping from the states of the actions
-
4:06 - 4:10that tells us what action to take in every state,
-
4:10 - 4:14and our goal was to find a policy that maximizes the expected value of
-
4:14 - 4:17our total payoff. So we want to find a
-
4:17 - 4:20policy. Well,
-
4:20 - 4:27let's see. We define value functions Vp (s) to be equal to
-
4:33 - 4:36this.
-
4:36 - 4:37We said that
-
4:37 - 4:41the value of a policy ? from State S was given by the expected
-
4:41 - 4:42value
-
4:42 - 4:46of the sum of discounted rewards, conditioned on your executing the policy ?
-
4:46 - 4:49and
-
4:49 - 4:54you're stating off your [inaudible] to say in the State S,
-
4:54 - 5:00and so our strategy for finding the policy was sort of comprised of
-
5:00 - 5:07two steps. So the goal is to find
-
5:12 - 5:15a good policy that maximizes the suspected value of
-
5:15 - 5:17the sum of discounted rewards,
-
5:17 - 5:21and so I said last time that one strategy for finding the [inaudible] of a
-
5:21 - 5:22policy
-
5:22 - 5:26is to first compute the optimal value function which I denoted V*(s) and is defined
-
5:26 - 5:29like that. It's the maximum
-
5:29 - 5:32value that any policy can obtain,
-
5:32 - 5:39and for example,
-
5:45 - 5:48the optimal value function
-
5:48 - 5:55for that MDP looks like this.
-
6:01 - 6:04So in other words, starting from any of these states,
-
6:04 - 6:07what's the expected value of the
-
6:07 - 6:09sum of discounted rewards you get,
-
6:09 - 6:12so this is
-
6:12 - 6:13V*.
-
6:13 - 6:15We also said that
-
6:15 - 6:17once you've found V*,
-
6:17 - 6:20you can compute the optimal policy
-
6:20 - 6:27using this.
-
6:34 - 6:36And
-
6:36 - 6:41so once you've found
-
6:41 - 6:48V and the last piece of this algorithm was Bellman's equations where
-
6:48 - 6:51we know that V*,
-
6:51 - 6:56the optimal sum of discounted rewards you can get for State S, is equal to the
-
6:56 - 6:59immediate reward you get just for starting off in that state
-
6:59 - 7:02+G(for the
-
7:02 - 7:06max over all the actions you could take)(your
-
7:12 - 7:14future
-
7:14 - 7:16sum of discounted
-
7:16 - 7:16rewards)(your
-
7:16 - 7:21future payoff starting from the State S(p) which is where you might transition to
-
7:21 - 7:22after 1(s).
-
7:22 - 7:26And so this gave us a value iteration algorithm,
-
7:26 - 7:30which was essentially
-
7:30 - 7:35V.I. I'm abbreviating value iteration as V.I., so in the value iteration
-
7:36 - 7:43algorithm, in V.I., you just take Bellman's equations and you repeatedly do this.
-
7:50 - 7:54So initialize some guess of the value functions. Initialize a zero as the sum
-
7:54 - 7:55rounding guess and
-
7:55 - 7:59then repeatedly perform this update for all states, and I said last time that if
-
7:59 - 8:01you do this repeatedly,
-
8:01 - 8:06then V(s) will converge to the optimal value function, V(s),
-
8:06 - 8:09you can
-
8:09 - 8:12compute the
-
8:12 - 8:14
-
8:14 - 8:19optimal policy ?*. Just one final thing I want to recap was
-
8:20 - 8:27the policy iteration algorithm
-
8:32 - 8:39in which we repeat the following two steps.
-
8:40 - 8:44So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the
-
8:44 - 8:48value function for that specific policy.
-
8:48 - 8:52So this means for every state, compute the expected sum of discounted rewards
-
8:52 - 8:55for if you execute the policy ?
-
8:55 - 8:58from that state,
-
8:58 - 9:05and then the other step of policy
-
9:28 - 9:30iteration
-
9:30 - 9:32is
-
9:32 - 9:35having found the value function for your policy,
-
9:35 - 9:37you then update the policy
-
9:37 - 9:41pretending that you've already found the optimal value function, V*,
-
9:41 - 9:42and then you
-
9:42 - 9:46repeatedly perform these two steps where you solve for the value function for your current
-
9:46 - 9:47policy
-
9:47 - 9:51and then pretend that that's actually the optimal value function and solve for the
-
9:51 - 9:55policy given the value function, and you repeatedly update
-
9:55 - 9:59the value function or update the policy using that value function.
-
9:59 - 10:00And
-
10:00 - 10:04last time I said that this will also cause
-
10:04 - 10:08the estimated value function V to converge to V*
-
10:08 - 10:13and this will cause p to converge to ?*, the optimal policy.
-
10:13 - 10:14So those are based
-
10:14 - 10:18on our last lecture [inaudible] MDPs and introduced a lot of new
-
10:18 - 10:20notation symbols and just
-
10:20 - 10:22summarize all that again.
-
10:22 - 10:26What I'm about to do now, what I'm about to do for the rest of today's
-
10:26 - 10:28lecture is actually
-
10:28 - 10:32build on these two algorithms so
-
10:32 - 10:36I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how
-
10:36 - 10:42those two algorithms are very
-
10:42 - 10:44different?
-
10:44 - 10:47Instructor (Andrew Ng):I see, right, so
-
10:47 - 10:52yeah, do you see that they're different? Okay,
-
10:52 - 10:54how it's different. Let's see.
-
10:54 - 10:55So
-
10:55 - 11:00well here's one difference. I didn't say this cause no longer use it today. So
-
11:00 - 11:04value iteration and policy iteration are different algorithms.
-
11:04 - 11:06In policy iteration
-
11:06 - 11:08in this
-
11:08 - 11:11step, you're given a fixed policy,
-
11:11 - 11:14and you're going to solve for the value function for that policy
-
11:15 - 11:21and so you're given some fixed policy ?, meaning
-
11:21 - 11:27some function mapping from the state's actions. So give you some policy
-
11:29 - 11:35and
-
11:35 - 11:39whatever. That's just some policy; it's not a great policy.
-
11:39 - 11:41And
-
11:41 - 11:45in that step that I circled, we have to find the ? of S
-
12:02 - 12:05which means that for every state you need to compute
-
12:05 - 12:08your expected sum of discounted rewards or if you execute
-
12:08 - 12:11this specific policy
-
12:11 - 12:16and starting off the MDP in that state S.
-
12:16 - 12:19So I showed this last time. I won't go into details today, so I said last
-
12:19 - 12:20time that
-
12:20 - 12:24you can actually solve for Vp by solving a linear system of
-
12:24 - 12:25equations.
-
12:25 - 12:27There was a form of Bellman's equations
-
12:27 - 12:29for Vp,
-
12:29 - 12:32and it turned out to be, if you write this out, you end up with a
-
12:32 - 12:36linear system of 11 equations of 11 unknowns and so you
-
12:36 - 12:39can actually solve for the value function
-
12:39 - 12:40for a fixed policy
-
12:40 - 12:42by solving like a system of
-
12:42 - 12:46linear equations with 11 variables and 11 constraints,
-
12:46 - 12:49and so
-
12:49 - 12:52that's policy iteration;
-
12:52 - 12:55whereas, in value iteration,
-
12:55 - 12:57going back on board,
-
12:57 - 13:01in value iteration you sort of repeatedly perform this update where you
-
13:01 - 13:04update the value of a state as the [inaudible].
-
13:04 - 13:11So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits
-
13:13 - 13:13so is the assumption
-
13:13 - 13:16that we can never get out of
-
13:16 - 13:18those states? Instructor
-
13:18 - 13:22(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers
-
13:22 - 13:26come up nicely, but I don't wanna spend too much time on them, but yeah, so the
-
13:26 - 13:27assumption is that
-
13:27 - 13:30once you enter the absorbing state, then the world ends or there're no more
-
13:30 - 13:33rewards after that and
-
13:33 - 13:34you can think of
-
13:34 - 13:37another way to think of the absorbing states which is sort of
-
13:37 - 13:38mathematically equivalent.
-
13:38 - 13:41You can think of the absorbing states as
-
13:41 - 13:43transitioning with probability 1
-
13:43 - 13:46to sum 12 state, and then once you're in that
-
13:46 - 13:4912th state, you always
-
13:49 - 13:52remain in that 12th state, and there're no further rewards from there. If you
-
13:52 - 13:54want, you can think of this
-
13:54 - 13:57as actually an MDP with 12 states rather than 11 states,
-
13:57 - 13:59and the 12th state
-
13:59 - 14:05is this zero cost absorbing state that you get stuck in forever. Other
-
14:05 - 14:12questions? Yeah,
-
14:20 - 14:26please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort
-
14:26 - 14:27of
-
14:27 - 14:31tried to give it justification for this last time.
-
14:31 - 14:35I'll
-
14:35 - 14:38say it in
-
14:38 - 14:42one sentence so that's that the expected total payoff I get, I
-
14:42 - 14:47expect to get something from the state as is equal to my immediate reward which is
-
14:47 - 14:50the reward I get for starting a state. Let's see. If I sum
-
14:50 - 14:54the state, I'm gonna get some first reward and then I can transition to
-
14:54 - 14:55some other state, and
-
14:55 - 14:59then from that other state, I'll get some additional rewards from then.
-
14:59 - 15:03So Bellman's equations breaks that sum into two pieces. It says the value of a
-
15:03 - 15:05state is equal to
-
15:05 - 15:07the reward you get right away
-
15:07 - 15:11is really, well.
-
15:11 - 15:17V*(s) is really equal to
-
15:31 - 15:33+G, so this is
-
15:36 - 15:40V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the
-
15:40 - 15:42rewards
-
15:44 - 15:46you get in the future)
-
15:46 - 15:50which it turns out to be equal to that
-
15:50 - 15:52second row. I spent more time
-
15:52 - 15:55justifying this in the previous lecture,
-
15:55 - 16:00although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if
-
16:00 - 16:02you don't remember the justification of that, why don't you just
-
16:02 - 16:03maybe take my word for
-
16:03 - 16:06that this equation holds true
-
16:06 - 16:08since I use it a little bit later as well,
-
16:08 - 16:11and then the lecture notes sort of
-
16:11 - 16:16explain a little further the justification for why this equation might hold
-
16:16 - 16:17true. But for now,
-
16:17 - 16:21yeah, just for now take my word for it that this holds true cause we'll use it a little bit later
-
16:21 - 16:28today as well.
-
16:41 - 16:45[Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent
-
16:45 - 16:47? implicitly,
-
16:48 - 16:53using V(s), would it become equivalent to valuation,
-
16:53 - 16:58and the answer is sort of no.
-
16:58 - 17:02Let's see. It's true that policy iteration and value iteration are
-
17:02 - 17:07closely related algorithms, and there's actually a continuum between them, but
-
17:07 - 17:13yeah, it actually turns out that,
-
17:13 - 17:15oh, no,
-
17:22 - 17:24the
-
17:25 - 17:29algorithms are not equivalent. It's just in policy iteration,
-
17:29 - 17:33there is a step where you're solving for the value function for the policy vehicle is V,
-
17:33 - 17:35solve for Vp. Usually,
-
17:35 - 17:41you can do this, for instance, by solving a linear system of equations. In value
-
17:41 - 17:41iteration,
-
17:41 - 17:44it is a different
-
17:44 - 17:51algorithm, yes. I hope it makes sense that at least cosmetically it's different.
-
17:56 - 17:58[Inaudible]
-
17:58 - 18:02you have [inaudible] representing ? implicitly, then you won't have
-
18:02 - 18:07to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you
-
18:07 - 18:09change a value function,
-
18:09 - 18:10if ? changes as well,
-
18:10 - 18:16then it's sort of hard to solve this.
-
18:16 - 18:19Yeah, so later on we'll actually talk about some examples of when ? is implicitly
-
18:19 - 18:20represented
-
18:20 - 18:24but at
-
18:24 - 18:26least for now it's I think there's " yeah.
-
18:26 - 18:30Maybe there's a way to redefine something, see a mapping onto value iteration but that's not
-
18:30 - 18:33usually done. These are
-
18:33 - 18:37viewed as different algorithms.
-
18:37 - 18:40Okay, cool, so all good questions.
-
18:40 - 18:43Let me move on and talk about how
-
18:43 - 18:50to generalize these ideas to continuous states.
-
18:59 - 19:01Everything we've done so far
-
19:01 - 19:03has been for
-
19:03 - 19:05discrete states or finite-state
-
19:05 - 19:10MDPs. Where, for example, here we had an MDP with a finite set of 11
-
19:10 - 19:11states
-
19:11 - 19:13and so
-
19:13 - 19:17the value function or V(s) or our estimate for the value function,
-
19:17 - 19:17V(s),
-
19:17 - 19:22could then be represented using an array of 11 numbers cause if you have 11 states, the
-
19:22 - 19:24value function
-
19:24 - 19:26needs to assign a real number
-
19:26 - 19:29to each of the 11 states and so to represent V(s)
-
19:29 - 19:32using an array of 11
-
19:32 - 19:33numbers.
-
19:33 - 19:37What I want to do for [inaudible] today is talk about
-
19:37 - 19:38continuous
-
19:38 - 19:42states, so for example,
-
19:42 - 19:45if you want to control
-
19:45 - 19:48any of the number of real [inaudible], so for example, if you want
-
19:48 - 19:51to control a car,
-
19:51 - 19:53
-
19:53 - 19:55a car
-
19:55 - 20:00is positioned given by XYT, as position
-
20:00 - 20:05and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot,
-
20:05 - 20:05so these
-
20:05 - 20:07are so depending on
-
20:07 - 20:09whether you want to model the kinematics and
-
20:09 - 20:12so just position, or whether you want to model the dynamics, meaning the velocity as
-
20:12 - 20:15well.
-
20:15 - 20:16Earlier I showed you
-
20:16 - 20:20video of a helicopter
-
20:20 - 20:23that was flying, using a rain forest we're learning algorithms, so the
-
20:23 - 20:25helicopter which can
-
20:25 - 20:28fly in three-dimensional space rather than just drive on the 2-D plane, the
-
20:28 - 20:33state will be given by XYZ position,
-
20:35 - 20:38FT?, which is
-
20:38 - 20:43?[inaudible].
-
20:43 - 20:45The FT? is
-
20:45 - 20:49sometimes used to note the P[inaudible]
-
20:49 - 20:50of the helicopter,
-
20:50 - 20:52just orientation,
-
20:52 - 20:55and
-
20:59 - 21:03if you want to control a helicopter, you pretty much have to model
-
21:03 - 21:06velocity as well which means both linear velocity as well as angular
-
21:06 - 21:10velocity, and so this would be a 12-dimensional state.
-
21:10 - 21:13If you want an example that
-
21:13 - 21:14is
-
21:14 - 21:17kind of fun but unusual
-
21:17 - 21:20is,
-
21:20 - 21:25and I'm just gonna use this as an example
-
21:25 - 21:28and actually use this little bit example in today's lecture
-
21:28 - 21:31is the inverted pendulum problem which is
-
21:31 - 21:36sort of a long-running classic in reinforcement learning
-
21:36 - 21:39in which imagine that you
-
21:39 - 21:43have a little cart that's on a rail. The rail
-
21:43 - 21:45ends at some point
-
21:45 - 21:48and if you imagine that you have a pole
-
21:48 - 21:52attached to the cart, and this is a free hinge and so
-
21:52 - 21:54the pole here can rotate freely,
-
21:54 - 21:57and your goal is to control the cart and
-
21:57 - 21:59to move it back and forth on this rail
-
21:59 - 22:01so as to keep the pole balanced.
-
22:01 - 22:03Yeah,
-
22:03 - 22:05there's no long pole
-
22:05 - 22:05in
-
22:05 - 22:07this class but you know what I
-
22:07 - 22:08mean, so you
-
22:08 - 22:15can imagine. Oh, is there a long pole here?
-
22:15 - 22:17Back in the corner. Oh, thanks. Cool. So I did not practice this
-
22:17 - 22:20but you can take a long pole and sort of hold it up, balance,
-
22:20 - 22:25so imagine that you can do it better than I can. Imagine these are [inaudible] just moving
-
22:25 - 22:28back and forth to try to keep the pole balanced, so
-
22:28 - 22:31you can actually us the reinforcement learning algorithm to do that.
-
22:31 - 22:34This is actually one of the longstanding classic problems that
-
22:34 - 22:35people [inaudible]
-
22:35 - 22:38implement and play off using reinforcement learning algorithms,
-
22:38 - 22:41and so for this, the
-
22:41 - 22:42states would be X and
-
22:42 - 22:44T, so X
-
22:44 - 22:50would be the position of the cart,
-
22:50 - 22:56and T would be the orientation of the pole
-
22:56 - 23:00and also the linear velocity and the angular velocity of
-
23:00 - 23:01
-
23:01 - 23:08the pole, so I'll
-
23:12 - 23:15actually use this example a
-
23:15 - 23:19couple times. So to read continuous state space,
-
23:19 - 23:23how can you apply an algorithm like value iteration and policy iteration to solve the
-
23:23 - 23:24MDP to control
-
23:24 - 23:28like the car or a helicopter or something like the inverted
-
23:28 - 23:29pendulum?
-
23:29 - 23:32So one thing you can do and
-
23:32 - 23:35this is maybe the most straightforward thing is,
-
23:35 - 23:38if you have say a two-dimensional
-
23:38 - 23:42continuous state space, a S-1 and S-2 are my state variables, and in all the
-
23:42 - 23:44examples there
-
23:44 - 23:48are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here.
-
23:48 - 23:54The most straightforward thing to do would be to take the continuous state space and discretize
-
23:54 - 23:56it
-
23:56 - 24:03into a number of discrete cells.
-
24:07 - 24:12And I use S-bar to denote they're discretized or they're discrete
-
24:12 - 24:13states,
-
24:13 - 24:17and so you can [inaudible] with this continuous state problem with a finite or
-
24:17 - 24:18discrete set of states
-
24:18 - 24:19and then
-
24:19 - 24:23you can use policy iteration or value iteration
-
24:23 - 24:29to solve for
-
24:32 - 24:36V(s)-bar.
-
24:36 - 24:37And
-
24:37 - 24:40if you're robot is then in some state
-
24:40 - 24:42given by that dot,
-
24:42 - 24:46you would then figure out what discretized state it is in. In this case it's in,
-
24:46 - 24:48this discretized dygrid cell
-
24:48 - 24:49that's called
-
24:49 - 24:51S-bar,
-
24:51 - 24:53and then you execute.
-
24:53 - 24:56You choose the policy. You choose the action
-
24:56 - 24:59given by applied to that discrete
-
24:59 - 25:01state, so discretization is maybe the
-
25:01 - 25:07most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes
-
25:07 - 25:10you can sorta make this work but a couple of reasons why
-
25:10 - 25:13this does not work very well.
-
25:13 - 25:15One reason is the following,
-
25:15 - 25:18and
-
25:18 - 25:21for this picture, let's even temporarily put aside reinforcement
-
25:21 - 25:23learning. Let's just
-
25:23 - 25:27think about doing regression for now and so
-
25:27 - 25:30suppose you have some invariable X
-
25:30 - 25:36and suppose I have some data,
-
25:36 - 25:38and I want to fill a function.
-
25:38 - 25:40Y is the function of X,
-
25:40 - 25:41so
-
25:41 - 25:45discretization is saying that I'm going to take my
-
25:45 - 25:48horizontal Xs and chop it up into a number of
-
25:48 - 25:49intervals.
-
25:49 - 25:52Sometimes I call these intervals buckets as well.
-
25:52 - 25:56We chop my horizontal Xs up into a number of buckets
-
25:56 - 25:59and then we're approximate this function
-
25:59 - 26:04using something that's piecewise constant
-
26:04 - 26:05in each of these buckets.
-
26:05 - 26:09And just look at this. This is clearly not a very good representation, right, and when we
-
26:09 - 26:10talk about regression,
-
26:10 - 26:14you just choose some features of X
-
26:14 - 26:18and run linear regression or something. You get a much better fit to the function.
-
26:18 - 26:22And so the sense that discretization just isn't a very good source
-
26:22 - 26:26of piecewise constant functions. This just isn't a very good function
-
26:26 - 26:28for representing many things,
-
26:28 - 26:32and there's also the sense that
-
26:32 - 26:36there's no smoothing or there's no generalization across the different buckets.
-
26:36 - 26:39And in fact, back in regression, I would never have chosen
-
26:39 - 26:43to do regression using this sort of visualization. It's just really doesn't make sense.
-
26:46 - 26:48And so in the same way,
-
26:48 - 26:51instead of
-
26:51 - 26:54X, V(s), instead of X and
-
26:54 - 26:58some hypothesis function of X, if you have the state here and you're trying to approximate the value
-
26:58 - 26:59function, then
-
26:59 - 27:02you can get discretization to work for many problems but maybe this isn't the best
-
27:02 - 27:03representation to represent
-
27:03 - 27:06a value function.
-
27:09 - 27:13The other problem with discretization and maybe the more serious
-
27:13 - 27:17problem is what's
-
27:17 - 27:21often somewhat fancifully called the curse of dimensionality.
-
27:26 - 27:30And just the observation that
-
27:31 - 27:34if the state space is in
-
27:34 - 27:39RN, and if you discretize each variable into K buckets,
-
27:39 - 27:40so
-
27:40 - 27:42if discretize each
-
27:44 - 27:46variable into K
-
27:46 - 27:48discrete values,
-
27:48 - 27:50then you get
-
27:50 - 27:53on the order of K to the power of N
-
27:55 - 27:56discrete states.
-
27:56 - 27:57In other words,
-
27:57 - 28:00the number of discrete states you end up with
-
28:00 - 28:02grows exponentially
-
28:02 - 28:05in the dimension of the problem,
-
28:05 - 28:06and so
-
28:06 - 28:09for a helicopter with 12-dimensional state
-
28:09 - 28:10space, this would be
-
28:10 - 28:13maybe like 100 to the power of 12, just huge, and it's
-
28:13 - 28:13not
-
28:13 - 28:15feasible.
-
28:15 - 28:19And so discretization doesn't scale well at all with two problems in high-dimensional
-
28:19 - 28:24state spaces,
-
28:24 - 28:25and
-
28:25 - 28:29this observation actually applies more generally than to
-
28:29 - 28:33just robotics and continuous state problems.
-
28:33 - 28:38For example, another fairly well-known applications
-
28:38 - 28:40of reinforcement learning
-
28:40 - 28:45has been to factory automations. If you imagine that you have
-
28:45 - 28:4720 machines sitting in the factory
-
28:47 - 28:50and the machines lie in a assembly line and they all
-
28:50 - 28:53do something to a part on the assembly line, then they route the part onto a
-
28:53 - 28:55different machine. You want to
-
28:55 - 28:59use reinforcement learning algorithms, [inaudible] the order in which the
-
28:59 - 29:03different machines operate on your different things that are flowing through your assembly line and maybe
-
29:03 - 29:05different machines can do different things.
-
29:05 - 29:09So if you have N machines and each machine
-
29:09 - 29:11can be in K states,
-
29:11 - 29:14then if you do this sort of discretization, the total number of states would be K
-
29:14 - 29:15to N as well.
-
29:15 - 29:18If you have N machines
-
29:18 - 29:22and if each machine can be in K states, then again, you can get
-
29:22 - 29:25this huge number of states.
-
29:25 - 29:27Other well-known examples would be
-
29:27 - 29:31if you have a board game is another example.
-
29:31 - 29:34You'd want to use reinforcement learning to play chess.
-
29:34 - 29:36Then if you have N
-
29:36 - 29:39pieces on your board game,
-
29:39 - 29:44you have N pieces on the chessboard and if each piece can be in K positions, then this is a
-
29:44 - 29:46game sort of the curse of dimensionality thing where the
-
29:46 - 29:55number of discrete states you end up with goes exponentially with the number of pieces
-
29:55 - 29:57in your board game.
-
29:57 - 30:01So the curse of dimensionality
-
30:01 - 30:06means that discretization scales poorly to high-dimensional state spaces or
-
30:06 - 30:09at least discrete representations
-
30:09 - 30:14scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional
-
30:14 - 30:17problem, discretization will usually work great.
-
30:17 - 30:21If you have a 3-dimensional problem, you can often get discretization to
-
30:21 - 30:21work
-
30:21 - 30:23not too badly without too much trouble.
-
30:23 - 30:28With a 4-dimensional problem, you can still often get to where that they
-
30:28 - 30:30could be challenging
-
30:33 - 30:39and as you go to higher and higher dimensional state spaces,
-
30:39 - 30:43the odds and [inaudible] that you need to figure around to discretization and do things
-
30:43 - 30:46like non-uniform grids, so for example,
-
30:46 - 30:49what I've
-
30:49 - 30:53drawn for you is an example of a non-uniform discretization
-
30:53 - 30:57where I'm discretizing S-2 much more finally than S-1. If I think the value
-
30:57 - 30:58function
-
30:58 - 30:59is much more sensitive
-
30:59 - 31:03to the value of state variable S-2 than to S-1,
-
31:03 - 31:07and so as you get into higher dimensional state spaces, you may need to manually fiddle
-
31:07 - 31:10with choices like these with no uniform discretizations and so on.
-
31:10 - 31:13But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it
-
31:13 - 31:14work fine.
-
31:14 - 31:18With 4-dimensional problems, you can probably get it to work but it'd be just
-
31:18 - 31:19slightly challenging
-
31:19 - 31:20and
-
31:20 - 31:24you can sometimes by fooling around and being clever, you can often push
-
31:24 - 31:25discretization
-
31:25 - 31:29up to let's say about 6-dimensional problems but
-
31:29 - 31:31with some difficulty
-
31:31 - 31:33and problems higher
-
31:33 - 31:37than 6-dimensional would be extremely difficult to solve with discretization.
-
31:37 - 31:40So that's just rough
-
31:40 - 31:44folk wisdom order of managing problems you think about using for discretization.
-
31:51 - 31:53But what I want to
-
31:53 - 31:56spend most of today talking about is
-
31:56 - 32:00[inaudible] methods that often work much better than discretization
-
32:00 - 32:02and which we will
-
32:02 - 32:05approximate V* directly
-
32:05 - 32:09without resulting to these sort of discretizations.
-
32:09 - 32:12Before I jump to the specific representation let me just spend a few minutes
-
32:12 - 32:15talking about the problem setup
-
32:15 - 32:16then.
-
32:16 - 32:19For today's lecture, I'm going to focus on
-
32:19 - 32:25the problem of continuous states
-
32:25 - 32:28and just to keep things sort of very
-
32:28 - 32:29simple in this lecture,
-
32:29 - 32:34I want view of continuous actions, so I'm gonna see
-
32:36 - 32:38discrete actions
-
32:38 - 32:40A. So it turns out actually that
-
32:40 - 32:44is a critical fact also for many problems, it turns out that the state
-
32:44 - 32:46space is
-
32:46 - 32:47much larger than
-
32:47 - 32:51the states of actions. That just seems to have worked out that way for many problems, so for example,
-
32:53 - 32:58for driving a car the state space is 6-dimensional, so if
-
32:58 - 33:00XY T, Xdot, Ydot, Tdot.
-
33:00 - 33:05Whereas, your action has, you still have two actions. You have forward
-
33:05 - 33:08backward motion and steering the car, so you have
-
33:08 - 33:126-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states.
-
33:12 - 33:15The only
-
33:15 - 33:17examples down for a helicopter you've 12-D states in a
-
33:17 - 33:194-dimensional action
-
33:19 - 33:20it turns out, and
-
33:20 - 33:24it's also often much easier to just discretize a continuous
-
33:24 - 33:27actions into a discrete sum of actions.
-
33:27 - 33:28
-
33:28 - 33:32And for the inverted pendulum, you have a 4-D state and a 1-D action.
-
33:32 - 33:33Whether you accelerate
-
33:33 - 33:38your cart to the left or the right is one D action
-
33:38 - 33:42and so for the rest of today, I'm gonna assume a continuous state
-
33:42 - 33:45but I'll assume that maybe you've already discretized your actions,
-
33:45 - 33:48just because in practice it turns out that
-
33:48 - 33:53not for all problems, with many problems large actions
-
33:53 - 33:54is just less of a
-
33:54 - 33:58difficulty than large state spaces.
-
33:58 - 34:00So I'm
-
34:00 - 34:01going to assume
-
34:01 - 34:06that we have
-
34:06 - 34:09a model
-
34:09 - 34:15or simulator of the
-
34:15 - 34:16MDP, and so
-
34:16 - 34:18this is really
-
34:18 - 34:23an assumption on how the state transition probabilities are represented.
-
34:24 - 34:25I'm gonna assume
-
34:25 - 34:28and I'm going to use the terms "model" and "simulator"
-
34:28 - 34:30pretty much synonymously,
-
34:30 - 34:32so
-
34:32 - 34:39specifically, what I'm going to assume is that we have a black box and a
-
34:41 - 34:42piece of code,
-
34:42 - 34:45so that I can input any state, input an action
-
34:45 - 34:46and it will output
-
34:46 - 34:49S
-
34:49 - 34:51prime,
-
34:51 - 34:54sample from the state transition distribution. Says that
-
34:54 - 34:55this is really
-
34:55 - 34:57my assumption
-
34:57 - 35:01on the representation I have for the state transition probabilities,
-
35:01 - 35:05so I'll assume I have a box that read
-
35:05 - 35:07take us in for the stated action
-
35:07 - 35:10and output in mixed state.
-
35:10 - 35:17And so
-
35:17 - 35:20since they're fairly common ways to get models of
-
35:20 - 35:22different MDPs you may work with,
-
35:22 - 35:24one is
-
35:24 - 35:31you might get a model from a physics
-
35:31 - 35:33simulator. So for example,
-
35:33 - 35:37if you're interested in controlling that inverted pendulum,
-
35:37 - 35:39so your action is
-
35:39 - 35:41A which is
-
35:41 - 35:45the magnitude of the force you exert on the cart to
-
35:45 - 35:46
-
35:46 - 35:48left
-
35:48 - 35:49or right, and your
-
35:49 - 35:56state is Xdot, T, Tdot. I'm just gonna write that in that order.
-
35:58 - 36:01And so I'm gonna
-
36:01 - 36:04write down a bunch of equations just for completeness but
-
36:04 - 36:09everything I'm gonna write below here is most of what I wanna write is a bit
-
36:09 - 36:13gratuitous, but so since I'll maybe flip open a textbook on physics, a
-
36:13 - 36:14textbook on mechanics,
-
36:14 - 36:17you can work out the equations of motion of a
-
36:17 - 36:19physical device like this, so you find that
-
36:23 - 36:26Sdot. The dot denotes derivative, so the derivative of the state with respect to time
-
36:26 - 36:28is given by
-
36:29 - 36:32Xdot, ?-L(B)
-
36:32 - 36:34cost B
-
36:34 - 36:36over
-
36:37 - 36:39M
-
36:39 - 36:46Tdot,
-
36:52 - 36:53B.
-
36:53 - 36:58And so on where A is the force is the action that you exert on the cart. L is
-
36:58 - 37:00the length of the pole.
-
37:00 - 37:04M is the total mass of the system and so on. So all these equations
-
37:04 - 37:06are good uses, just writing
-
37:06 - 37:09them down for completeness, but by
-
37:09 - 37:12flipping over, open like a physics textbook, you can work out these equations
-
37:12 - 37:14and notions yourself
-
37:14 - 37:15and this
-
37:15 - 37:17then gives you a model which
-
37:17 - 37:18can say that S-2+1.
-
37:18 - 37:22You're still one time step later
-
37:22 - 37:27will be equal to your previous state
-
37:27 - 37:32plus [inaudible], so in your simulator or in my model
-
37:32 - 37:35what happens to the cart every
-
37:35 - 37:3910th of a second, so ? T
-
37:39 - 37:43would be within one second
-
37:43 - 37:50and then so plus ? T times
-
38:03 - 38:06that. And so that'd be one way to come up with a model of
-
38:06 - 38:08your MDP.
-
38:08 - 38:12And in this specific example, I've actually written down deterministic model because
-
38:13 - 38:15and by deterministic I mean that
-
38:15 - 38:18given an action in a state, the next state
-
38:18 - 38:20is not random,
-
38:20 - 38:23so would be an example of a deterministic model
-
38:23 - 38:26where I can compute the next state exactly
-
38:26 - 38:30as a function of the previous state and the previous action
-
38:30 - 38:33or it's a deterministic model because all the probability
-
38:33 - 38:35mass is on a single state
-
38:35 - 38:37given the previous stated action.
-
38:37 - 38:44You can also make this a stochastic model.
-
38:59 - 39:06A second way that is often used to attain a model is to learn one.
-
39:08 - 39:13And so again, just concretely what you do is you would
-
39:13 - 39:16imagine that you have a physical
-
39:16 - 39:17inverted pendulum system
-
39:17 - 39:21as you physically own an inverted pendulum robot.
-
39:21 - 39:24What you would do is you would then
-
39:24 - 39:27initialize your inverted pendulum robot to some state
-
39:27 - 39:32and then execute some policy, could be some random policy or some policy that you think is pretty
-
39:32 - 39:36good, or you could even try controlling yourself with a joystick or something.
-
39:36 - 39:37But so you set
-
39:37 - 39:41the system off in some state as zero.
-
39:41 - 39:44Then you take some action.
-
39:44 - 39:48Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina
-
39:48 - 39:51control your inverted pendulum or whatever.
-
39:51 - 39:55System would transition to some new state, S-1, and then you
-
39:55 - 39:57take some new action, A-1
-
39:57 - 40:00and so on. Let's say you do
-
40:00 - 40:02this
-
40:02 - 40:05for two time steps
-
40:05 - 40:06and
-
40:06 - 40:09sometimes I call this one trajectory and
-
40:09 - 40:10you repeat this
-
40:10 - 40:13M times, so
-
40:13 - 40:17this is the first trial of the first trajectory,
-
40:17 - 40:24and then you do this again. Initialize it in some
-
40:35 - 40:37and so on.
-
40:37 - 40:40So you do this a bunch of times
-
40:40 - 40:44and then you would run the learning algorithm
-
40:44 - 40:49to
-
40:49 - 40:53estimate ST+1
-
40:53 - 40:57as a function
-
40:57 - 41:02of
-
41:02 - 41:05ST and
-
41:05 - 41:08AT. And for sake of completeness, you should just think of this
-
41:08 - 41:10as inverted pendulum problem,
-
41:10 - 41:11so ST+1 is
-
41:11 - 41:13a 4-dimensional vector. ST, AT
-
41:13 - 41:17will be a 4-dimensional vector and that'll be a real number,
-
41:17 - 41:18and so you might
-
41:18 - 41:21run linear regression 4 times to predict each of these state variables as
-
41:21 - 41:23a function
-
41:23 - 41:25of each of these
-
41:25 - 41:275 real numbers
-
41:27 - 41:30and so on.
-
41:30 - 41:33Just for example, if you
-
41:33 - 41:37say that if you want to estimate your next state ST+1 as a linear
-
41:37 - 41:41function
-
41:41 - 41:46of your previous state in your action and so
-
41:46 - 41:51A here will be a 4 by 4 matrix,
-
41:51 - 41:54and B would be a 4-dimensional vector,
-
41:54 - 42:00then
-
42:00 - 42:07you would choose the values of A and B that minimize this.
-
42:32 - 42:35So if you want your model to be that
-
42:35 - 42:39ST+1 is some linear function of the previous stated action,
-
42:39 - 42:40then you might
-
42:40 - 42:42pose this optimization objective
-
42:42 - 42:45and choose A and B to minimize the sum of squares error
-
42:45 - 42:52in your predictive value for ST+1 as the linear function of ST
-
42:52 - 42:57and AT. I
-
42:57 - 42:58should say that
-
42:58 - 43:03this is one specific example where you're using a linear function of the previous
-
43:03 - 43:06stated action to predict the next state. Of course, you can also use other algorithms
-
43:06 - 43:08like low [inaudible] weight to linear regression
-
43:08 - 43:12or linear regression with nonlinear features or kernel linear
-
43:12 - 43:13regression or whatever
-
43:13 - 43:16to predict the next state as a nonlinear function of the current state as well, so this
-
43:16 - 43:23is just [inaudible] linear problems.
-
43:25 - 43:28And it turns out that low [inaudible] weight to linear
-
43:28 - 43:29regression is
-
43:29 - 43:32for many robots turns out to be an effective method for this learning
-
43:32 - 43:39problem as well.
-
43:47 - 43:52And so having learned to model, having learned the parameters A
-
43:52 - 43:53and B,
-
43:53 - 43:59you then have a model where you say that ST+1 is AST
-
43:59 - 44:05plus BAT, and so that would be an example of a deterministic
-
44:05 - 44:06model
-
44:08 - 44:11or having learned the parameters A and B,
-
44:11 - 44:13you might say that ST+1 is
-
44:13 - 44:15equal to AST +
-
44:15 - 44:17BAT
-
44:24 - 44:26+
-
44:26 - 44:28?T.
-
44:28 - 44:30And so these would be
-
44:30 - 44:33very reasonable ways to come up with either
-
44:33 - 44:37a deterministic or a stochastic model for your
-
44:37 - 44:39inverted pendulum MDP.
-
44:39 - 44:42And so
-
44:42 - 44:44just to summarize,
-
44:44 - 44:48what we have now is a model, meaning a piece of code,
-
44:48 - 44:53where you can input a state and an action
-
44:53 - 44:55and get an ST+1.
-
44:55 - 44:58And so if you have a stochastic model,
-
44:58 - 45:02then to influence this model, you would actually sample ?T from this [inaudible]
-
45:02 - 45:03distribution
-
45:03 - 45:10in order to generate ST+1. So it
-
45:13 - 45:18actually turns out that in a
-
45:18 - 45:20preview, I guess, in the next lecture
-
45:20 - 45:25it actually turns out that in the specific case of linear dynamical systems, in the specific
-
45:25 - 45:29case where the next state is a linear function of the current stated action, it
-
45:29 - 45:32actually turns out that there're very powerful algorithms you can use.
-
45:32 - 45:36So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than
-
45:36 - 45:37right now
-
45:39 - 45:43but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights
-
45:43 - 45:46and linear regressionals and long linear
-
45:46 - 45:48algorithm 'cause many systems aren't really
-
45:48 - 45:54linear. You can build a nonlinear model.
-
45:54 - 45:57So what I wanna do now is talk about
-
45:57 - 45:57given
-
45:57 - 46:01a model, given a simulator for your
-
46:01 - 46:03MDP, how to come up with an algorithm
-
46:03 - 46:07to approximate the alpha value function piece.
-
46:07 - 46:14Before I move on, let me check if there're questions on this. Okay,
-
46:38 - 46:43cool. So here's
-
46:43 - 46:44the idea.
-
46:44 - 46:47Back when we talked about linear regression, we said that
-
46:47 - 46:52given some inputs X in supervised learning, given the input feature is X, we
-
46:52 - 46:55may choose some features of X
-
46:55 - 46:57and then
-
46:57 - 46:59approximate the type of variable
-
46:59 - 47:02as a linear function of various features of X,
-
47:02 - 47:05and so just do exactly the same thing to
-
47:05 - 47:06approximate
-
47:06 - 47:09the optimal value function,
-
47:09 - 47:11and in particular,
-
47:11 - 47:16we'll choose some features
-
47:16 - 47:195-S of a
-
47:19 - 47:21state
-
47:21 - 47:26S. And so you could actually choose
-
47:26 - 47:305-S equals S. That would be one reasonable choice, if you want to approximate the value
-
47:30 - 47:30function
-
47:30 - 47:33as a linear function of the states, but you can
-
47:33 - 47:34also
-
47:34 - 47:37choose other things, so for example,
-
47:37 - 47:41for the inverted pendulum example, you may choose 5-S
-
47:41 - 47:45to be equal to a vector of features that may be [inaudible]1 or you may have
-
47:48 - 47:52Xdot2,
-
47:52 - 47:52Xdot
-
47:52 - 47:55maybe some cross terms,
-
47:55 - 47:57maybe times
-
47:57 - 47:59X, maybe dot2
-
47:59 - 48:04and so on. So
-
48:04 - 48:07you choose some vector or features
-
48:07 - 48:14and then approximate
-
48:17 - 48:22the value function as the value of the state as is equal to data
-
48:22 - 48:24transfers
-
48:24 - 48:28times the features. And I should apologize in advance; I'm overloading notation here.
-
48:28 - 48:30It's unfortunate. I
-
48:30 - 48:34use data both to denote the angle of the cart of the pole inverted
-
48:34 - 48:38pendulum. So this is known as
-
48:38 - 48:39the angle T
-
48:39 - 48:43but also using T to denote the vector of parameters in my [inaudible] algorithm.
-
48:43 - 48:44So sorry
-
48:44 - 48:50about the overloading notation.
-
48:50 - 48:52
-
48:52 - 48:55Just like we did in linear regression, my goal is to come up
-
48:55 - 48:57with
-
48:57 - 49:00a linear combination of the features that gives me a good approximation
-
49:00 - 49:02to the value function
-
49:02 - 49:08and this is completely analogous to when we said that in linear regression
-
49:08 - 49:10our estimate,
-
49:13 - 49:18my response there but Y as a linear function a feature is at the
-
49:18 - 49:23input. That's what we have
-
49:23 - 49:30in
-
49:40 - 49:47linear regression. Let
-
49:47 - 49:51me just write
-
49:51 - 49:56down value iteration again and then I'll written down an approximation to value
-
49:56 - 49:57iteration, so
-
49:57 - 50:01for discrete states,
-
50:01 - 50:04this is the idea behind value iteration and we said that
-
50:04 - 50:09V(s) will be
-
50:09 - 50:12updated
-
50:12 - 50:17as R(s) + G
-
50:17 - 50:20[inaudible]. That was value iteration
-
50:20 - 50:24and in the case of continuous states, this would be the replaced by an [inaudible], an
-
50:24 - 50:31[inaudible] over states rather via sum over states.
-
50:31 - 50:34Let me just write this
-
50:34 - 50:36as
-
50:39 - 50:43R(s) + G([inaudible]) and then that sum over T's prime.
-
50:43 - 50:45That's really an expectation
-
50:45 - 50:47with respect to random state as
-
50:47 - 50:54prime drawn from the state transition probabilities piece SA
-
50:55 - 50:56of V(s)
-
50:56 - 51:00prime. So this is a sum of all states S prime with the
-
51:00 - 51:02probability of going to S prime (value),
-
51:02 - 51:05so that's really an expectation over the random state S prime flowing from PSA of
-
51:05 - 51:09that.
-
51:09 - 51:12And so what I'll do now is
-
51:12 - 51:15write down an algorithm called
-
51:15 - 51:22fitted value iteration
-
51:25 - 51:28that's in
-
51:28 - 51:31approximation to this
-
51:31 - 51:34but specifically for continuous states.
-
51:34 - 51:39I just wrote down the first two steps, and then I'll continue on the next board, so the
-
51:39 - 51:43first step of the algorithm is we'll sample.
-
51:43 - 51:45Choose some set of states
-
51:45 - 51:52at
-
51:53 - 51:55random.
-
51:55 - 52:00So sample S-1, S-2 through S-M randomly so choose a
-
52:00 - 52:02set of states randomly
-
52:02 - 52:08and
-
52:08 - 52:12initialize my parameter vector to be equal to zero. This is
-
52:12 - 52:16analogous to in value iteration where I
-
52:16 - 52:21might initialize the value function to be the function of all zeros. Then here's the
-
52:21 - 52:28end view for the algorithm.
-
52:42 - 52:44Got
-
52:44 - 52:51quite a lot to
-
52:59 - 53:06write
-
55:05 - 55:12actually.
-
55:29 - 55:30Let's
-
55:30 - 55:37see.
-
56:10 - 56:12And so
-
56:12 - 56:18that's the algorithm.
-
56:18 - 56:20Let me just adjust the writing. Give me a second. Give me a minute
-
56:20 - 56:27to finish and then I'll step through this.
-
56:27 - 56:28Actually, if some of my
-
56:28 - 56:35handwriting is eligible, let me know. So
-
56:58 - 57:02let me step through this and so briefly explain the rationale.
-
57:02 - 57:08So the hear of the algorithm is - let's see.
-
57:08 - 57:11In the original value iteration algorithm,
-
57:11 - 57:13we would take the value for each state,
-
57:13 - 57:15V(s)I,
-
57:15 - 57:18and we will overwrite it with this
-
57:18 - 57:22expression here. In the original, this discrete value iteration algorithm
-
57:22 - 57:23was to V(s)I
-
57:23 - 57:24and we will
-
57:24 - 57:26set V(s)I
-
57:26 - 57:29to be equal to that, I think.
-
57:29 - 57:33Now we have in the continuous state case, we have an infinite continuous set of
-
57:33 - 57:34states and so
-
57:34 - 57:38you can't discretely set the value of each of these to that.
-
57:38 - 57:43So what we'll do instead is choose the parameters T
-
57:43 - 57:47so that V(s)I is as close as possible to
-
57:47 - 57:50this thing on the right hand side
-
57:50 - 57:55instead. And this is what YI turns out to be.
-
57:55 - 57:58So completely, what I'm going to do is I'm going to construct
-
57:58 - 58:01estimates of this term,
-
58:01 - 58:04and then I'm going to choose the parameters of my function approximator. I'm
-
58:04 - 58:08gonna choose my parameter as T, so that V(s)I is as close as possible to
-
58:08 - 58:11these. That's what YI is,
-
58:11 - 58:12and specifically,
-
58:12 - 58:15what I'm going to do is I'll choose parameters data
-
58:15 - 58:18to minimize the sum of square differences between T [inaudible] plus
-
58:18 - 58:205SI.
-
58:20 - 58:22This thing here
-
58:22 - 58:24is just V(s)I because I'm approximating V(s)I
-
58:24 - 58:26is a linear function of 5SI
-
58:26 - 58:28and so choose
-
58:28 - 58:32the parameters data to minimize the sum of square differences.
-
58:32 - 58:35So this is last step is basically
-
58:35 - 58:39the approximation version of value
-
58:39 - 58:41iteration. What everything else above
-
58:41 - 58:42was doing was just
-
58:42 - 58:45coming up with an approximation
-
58:45 - 58:46to this term, to
-
58:46 - 58:47this thing here
-
58:47 - 58:51and which I was calling YI.
-
58:51 - 58:54And so confluently, for every state SI we want to
-
58:54 - 58:57estimate what the thing on the right hand side is
-
58:57 - 58:58and but
-
58:58 - 59:02there's an expectation here. There's an expectation over a continuous set of states,
-
59:02 - 59:07may be a very high dimensional state so I can't compute this expectation exactly.
-
59:07 - 59:10What I'll do instead is I'll use my simulator to sample
-
59:10 - 59:14a set of states from this distribution from this P substrip,
-
59:14 - 59:17SIA, from the state transition distribution
-
59:17 - 59:21of where I get to if I take the action A in the state as I,
-
59:21 - 59:24and then I'll average over that sample of states
-
59:24 - 59:27to compute this expectation.
-
59:27 - 59:31And so stepping through the algorithm just says that for
-
59:31 - 59:32each
-
59:32 - 59:36state and for each action, I'm going to sample a set of states. This S prime 1 through S
-
59:36 - 59:41prime K from that state transition distribution, still using the model, and
-
59:41 - 59:44then I'll set Q(a) to be equal to that average
-
59:44 - 59:45and so this is my
-
59:45 - 59:48estimate for R(s)I + G(this
-
59:48 - 59:50expected value
-
59:50 - 59:54for that specific action A).
-
59:54 - 59:57Then I'll take the maximum of actions A
-
59:57 - 60:01and this gives me YI, and so YI is for S for that.
-
60:01 - 60:03And finally, I'll run
-
60:03 - 60:07really run linear regression which is that last of the set [inaudible] to get
-
60:07 - 60:08V(s)I
-
60:08 - 60:11to be close to the YIs.
-
60:11 - 60:16And so this algorithm is called fitted value iteration and it actually often
-
60:16 - 60:19works quite well for continuous,
-
60:19 - 60:23for problems with anywhere from
-
60:23 - 60:266- to 10- to 20-dimensional state spaces
-
60:34 - 60:39if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense?
-
60:39 - 60:41Some of you didn't have your hands up.
-
60:41 - 60:47Are there questions for those,
-
60:48 - 60:51yeah? Is there a recess [inaudible] function in this setup?
-
60:51 - 60:56Oh, yes. An MDP comprises
-
60:56 - 60:57SA,
-
60:57 - 61:00the state transition probabilities G
-
61:00 - 61:01and
-
61:01 - 61:02R
-
61:02 - 61:03and so
-
61:03 - 61:08for continuous state spaces, S would be like R4 for the inverted pendulum
-
61:08 - 61:09or something.
-
61:09 - 61:14Actions with discretized state transitions probabilities with specifying with the model
-
61:14 - 61:15or
-
61:15 - 61:17the simulator,
-
61:17 - 61:20G is just a real number like .99
-
61:20 - 61:24and the real function is usually a function that's given to you. And so
-
61:24 - 61:26the reward function
-
61:26 - 61:31is some function of your 4-dimensional state,
-
61:31 - 61:31and
-
61:31 - 61:33for example, you
-
61:33 - 61:40might choose a reward function to be minus " I don't know.
-
61:40 - 61:47Just for an example of simple reward function, if we want a -1 if the pole has fallen
-
61:48 - 61:52and there it depends on you choose your reward function to be -1 if
-
61:52 - 61:52the
-
61:52 - 61:55inverted pendulum falls over
-
61:55 - 61:59and to find that its angle is greater than 30° or something
-
61:59 - 62:02and zero otherwise.
-
62:02 - 62:04So that would be an example
-
62:04 - 62:07of a reward function that you can choose for the inverted pendulum, but yes, assuming a
-
62:07 - 62:10reward function is given to you
-
62:10 - 62:15so that you can compute R(s)I for any state.
-
62:15 - 62:22Are there other questions?
-
62:40 - 62:41Actually, let
-
62:41 - 62:48me try
-
62:50 - 62:57
-
63:24 - 63:28asking a question, so everything I did here assume that we have a stochastic simulator.
-
63:28 - 63:32So it
-
63:32 - 63:35turns out I can simply this algorithm if I have a deterministic simulator,
-
63:35 - 63:37but
-
63:37 - 63:39deterministic simulator is given a
-
63:39 - 63:42stated action, my next state is always exactly determined.
-
63:42 - 63:46So let me ask you, if I have a deterministic simulator,
-
63:46 - 63:53how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible].
-
63:54 - 63:58
-
63:58 - 63:59Right, so
-
63:59 - 64:01Justin's going right. If I have a deterministic simulator,
-
64:01 - 64:05all my samples from those would be exactly the same, and so if I have a
-
64:05 - 64:06deterministic simulator,
-
64:06 - 64:09I can set K to be equal to 1,
-
64:09 - 64:11so I don't need to draw
-
64:11 - 64:16K different samples. I really only need one sample if I have a
-
64:16 - 64:17deterministic simulator,
-
64:17 - 64:22so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really
-
64:22 - 64:28confused about the,
-
64:28 - 64:29yeah,
-
64:29 - 64:34we sorta turned this [inaudible] into something that looks like linear
-
64:34 - 64:37state
-
64:37 - 64:40regression or some' you know the data transpose times something that
-
64:40 - 64:44we're used to but I guess I'm a
-
64:44 - 64:45little. I don't know really know what question to ask but like when we did this before we
-
64:45 - 64:48had like discrete states and everything. We
-
64:48 - 64:50were determined with
-
64:50 - 64:53finding this optimal policy
-
64:53 - 64:57and I
-
64:57 - 65:02guess it doesn't look like we haven't said the word policy in a while so kinda
-
65:02 - 65:05difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should
-
65:05 - 65:07just say a couple words so
-
65:07 - 65:11let me actually try to get at some of what maybe what you're saying.
-
65:11 - 65:14Our strategy for finding optimal policy
-
65:14 - 65:15has been to
-
65:15 - 65:19find some way to find V
-
65:19 - 65:21and
-
65:21 - 65:24some of approximations
-
65:24 - 65:29of ?*. So far everything I've been doing has been focused on how to find V*. I just
-
65:29 - 65:30want
-
65:33 - 65:36to say one more word. It actually turns out that
-
65:36 - 65:41for linear regression it's often very easy. It's often not terribly difficult to
-
65:41 - 65:43choose some resource of the features. Choosing
-
65:43 - 65:47features for approximating the value function is often somewhat
-
65:47 - 65:48harder,
-
65:48 - 65:51so because the value of a state is
-
65:51 - 65:52how good is
-
65:52 - 65:54starting off in this state.
-
65:54 - 65:56What is my
-
65:56 - 66:00expected sum of discounted rewards? What if I start in a certain state?
-
66:00 - 66:00And so
-
66:00 - 66:04what the feature of the state have to measure is really
-
66:04 - 66:08how good is it to start in a certain state?
-
66:08 - 66:12And so for inverted pendulum you actually have that
-
66:12 - 66:16states where the poles are vertical and when a cart that's centered on your
-
66:16 - 66:18track or something, maybe better and so you can
-
66:18 - 66:21come up with features that measure the orientation of the pole and how close
-
66:21 - 66:26you are to the center of the track and so on and those will be reasonable features to use
-
66:26 - 66:28to approximate V*.
-
66:28 - 66:30Although in general it is true that
-
66:30 - 66:34choosing features, the value function approximation, is often slightly trickier
-
66:34 - 66:40than choosing good features for linear regression. Okay and then Justin's
-
66:40 - 66:41questions of
-
66:41 - 66:43so given
-
66:43 - 66:47V*, how do you go back to actually find a policy?
-
66:47 - 66:50In the discrete case, so we
-
66:50 - 66:54have that ?*(s) is equal to all
-
66:54 - 67:01[inaudible] over A of
-
67:04 - 67:07that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this
-
67:07 - 67:13as an expectation
-
67:13 - 67:16and so then once you find the optimal value
-
67:16 - 67:17function V*,
-
67:17 - 67:19you can then
-
67:19 - 67:20find the optimal policy ?* by computing the
-
67:20 - 67:23[inaudible].
-
67:23 - 67:25So if you're in a continuous state MDP,
-
67:25 - 67:26then
-
67:26 - 67:30you can't actually do this in advance for every single state because there's an
-
67:30 - 67:32infinite number of states
-
67:32 - 67:35and so you can't actually perform this computation in advance to every single
-
67:35 - 67:36state.
-
67:36 - 67:39What you do instead is
-
67:39 - 67:43whenever your robot is in some specific state S is only when your system
-
67:43 - 67:48is in some specific state S like your car is at some position orientation or
-
67:48 - 67:50your inverted pendulum
-
67:50 - 67:53is in some specific position, posed in some
-
67:53 - 67:57specific angle T. It's
-
67:57 - 67:58only when
-
67:58 - 68:03your system, be it a factor or a board game or a robot,
-
68:03 - 68:05is in some specific state S
-
68:05 - 68:09that you would then go ahead and compute this augmax, so
-
68:09 - 68:10it's only when
-
68:10 - 68:16you're in some state S that you then compute this augmax, and then
-
68:16 - 68:18you
-
68:18 - 68:23
-
68:23 - 68:24execute that action A
-
68:24 - 68:25and then
-
68:25 - 68:29as a result of your action, your robot would transition to some new state and then
-
68:29 - 68:34so it'll be given that specific new state that you compute as
-
68:34 - 68:41augmax using that specific state S that
-
68:42 - 68:46you're in. There're a few ways to do it. One way to do this is
-
68:46 - 68:50actually the same as in the inner loop of the fitted value iteration algorithm so
-
68:50 - 68:52because of an expectation of a large number
-
68:52 - 68:55of states, you'd need to sample
-
68:55 - 68:58some set of states from the simulator and then
-
68:58 - 69:03approximate this expectation using an average over your samples, so it's actually as
-
69:03 - 69:06inner loop of the value iteration algorithm.
-
69:06 - 69:09So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample
-
69:10 - 69:13a set of states
-
69:13 - 69:16to approximate those expectations every time you want to take an action in your MDP.
-
69:16 - 69:18Couple
-
69:18 - 69:23of special cases where this can be done, one special case is if you
-
69:23 - 69:30have
-
69:32 - 69:36a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is
-
69:36 - 69:40just some function,
-
69:40 - 69:44could be a linear or a nonlinear function. If it's a deterministic simulator
-
69:44 - 69:48then the next state, ST+1, is just some function of your
-
69:48 - 69:50previous stated action.
-
69:50 - 69:52If that's the case then
-
69:52 - 69:55this expectation, well, then
-
69:55 - 69:59this simplifies to augmax of A of V*
-
69:59 - 70:01of F
-
70:01 - 70:03of
-
70:03 - 70:07I guess S,A
-
70:07 - 70:08because
-
70:08 - 70:11this is really saying
-
70:11 - 70:15S
-
70:15 - 70:18prime=F(s),A. I switched back and forth between notation; I hope that's okay. S
-
70:18 - 70:22to denote the current state, and S prime to deterministic state versus ST and ST+1 through the
-
70:22 - 70:26current state. Both of these are
-
70:26 - 70:28sorta standing notation and don't
-
70:28 - 70:31mind my switching back and forth between them. But if it's a
-
70:31 - 70:35deterministic simulator you can then just compute what the next state S prime would be
-
70:35 - 70:38for each action you might take from the current state,
-
70:38 - 70:42and then take the augmax of actions A,
-
70:42 - 70:47basically choose the action that gets you to the highest value
-
70:47 - 70:54state.
-
71:04 - 71:05And
-
71:05 - 71:09so that's one case where you can compute the augmax and we can compute that
-
71:09 - 71:13expectation without needing to sample an average over some sample.
-
71:13 - 71:17Another very common case actually it turns out is if you have a stochastic
-
71:24 - 71:26simulator, but if your
-
71:26 - 71:30similar happens to take on a very specific form of
-
71:34 - 71:36ST+1=F(s)T,AT+?T
-
71:37 - 71:44where this
-
71:44 - 71:48is galsie noise. The [inaudible] is a very common way to build simulators where you model the next
-
71:48 - 71:50state as a function of the current state and action
-
71:50 - 71:52plus some noise and so
-
71:52 - 71:57once specific example would be that sort of mini dynamical system that
-
71:57 - 72:01we talked about with linear function of the current state and action plus
-
72:01 - 72:03galsie
-
72:03 - 72:08noise. In this case, you can approximate augment over
-
72:08 - 72:15A, well.
-
72:22 - 72:27In that case you take that
-
72:27 - 72:29expectation that
-
72:29 - 72:33you're trying to approximate. The
-
72:33 - 72:37expected value of V of the
-
72:37 - 72:41
-
72:41 - 72:45expected value of
-
72:45 - 72:49S prime, and this is approximation.
-
72:49 - 72:53Expected value of a function is usually not equal to the value of an
-
72:53 - 72:53expectation but
-
72:53 - 72:54it is often
-
72:54 - 72:56a reasonable approximation
-
72:56 - 72:58and so
-
73:02 - 73:05that would be another way to
-
73:05 - 73:06approximate that expectation
-
73:06 - 73:11and so you choose the actions
-
73:11 - 73:14according to
-
73:14 - 73:21watch we do the same formula as I wrote just now.
-
73:22 - 73:24And so this
-
73:24 - 73:25would be a way of
-
73:25 - 73:27approximating
-
73:28 - 73:31this argmax,
-
73:31 - 73:34ignoring the noise in the simulator essentially. And
-
73:34 - 73:38this often works pretty well as well just because many simulators turn
-
73:38 - 73:39out
-
73:39 - 73:43to be the form of some linear or some nonlinear function plus zero mean
-
73:43 - 73:44galsie noise,
-
73:44 - 73:46so and just that ignore the zero mean
-
73:46 - 73:47galsie
-
73:47 - 73:50noise, so that you can compute this quickly.
-
73:50 - 73:54And just to complete about this, what
-
73:54 - 73:56that is, right,
-
73:56 - 73:58that V* F of SA,
-
73:58 - 73:59this
-
73:59 - 74:02you
-
74:04 - 74:09down rate as data transfers Fi of S prime where S prime=F of SA. Great,
-
74:09 - 74:12so this V* you would compute using
-
74:12 - 74:15the parameters data that you just learned using the
-
74:15 - 74:22fitted value iteration
-
74:22 - 74:23algorithm.
-
74:23 - 74:30Questions about this? Student:[Inaudible] case,
-
74:34 - 74:41for real-time application is it possible to use that [inaudible], for example for
-
74:42 - 74:48[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible]
-
74:51 - 74:56expectation. Computers today actually amazingly fast. I'm actually often
-
74:56 - 74:59surprised by how much you can do in real time so
-
74:59 - 75:00the helicopter we actually flying a helicopter
-
75:00 - 75:04using an algorithm different than
-
75:04 - 75:05this? I can't say.
-
75:06 - 75:11But my intuition is that you could actually do this with a
-
75:11 - 75:14helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do
-
75:14 - 75:16this 10 times a second to 50 times a second,
-
75:16 - 75:18and that's actually plenty of time
-
75:18 - 75:19to sample
-
75:19 - 75:241,000 states and compute this
-
75:24 - 75:27expectation. They're real difficult, helicopters because helicopters are mission critical, and you
-
75:27 - 75:30do something it's like fast. You
-
75:30 - 75:32can do serious damage and so
-
75:32 - 75:36maybe not for good reasons. We've actually tended to avoid
-
75:36 - 75:37tossing coins when we're
-
75:37 - 75:40in the air, so the ideal of letting our actions be some
-
75:40 - 75:45up with some random process is slightly scary and just tend not to do that.
-
75:45 - 75:49I should say that's prob'ly not a great reason because you average a
-
75:49 - 75:52large number of things here very well fine but just
-
75:52 - 75:56as a maybe overly conservative design choice, we actually
-
75:56 - 75:57don't, tend not to find
-
75:57 - 75:59anything randomized on
-
75:59 - 76:01which is prob'ly being over conservative.
-
76:02 - 76:05It's the choice we made cause other things are slightly safer.
-
76:05 - 76:08I think you can actually often do this.
-
76:08 - 76:13So long as I see a model can be evaluated fast enough where you can sample
-
76:13 - 76:15100 state transitions or 1,000 state
-
76:15 - 76:18transitions, and then do that at
-
76:18 - 76:2210hz. They haven't said that. This is often attained which is why we often
-
76:22 - 76:29use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So
-
76:32 - 76:36now you know one algorithm [inaudible] reinforcement learning on continuous state
-
76:36 - 76:39spaces. Then we'll pick up with some more ideas on some even
-
76:39 - 76:40more powerful algorithms,
-
76:40 - 76:43the solving MDPs of continuous state spaces.
-
76:43 - 76:44Thanks. Let's close for today.
- Title:
- Lecture 17 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the topic of reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:17:00