Lecture 18 | Machine Learning (Stanford)
-
0:13 - 0:16This presentation is delivered by the Stanford Center for Professional
-
0:16 - 0:23Development.
-
0:26 - 0:28Okay. Welcome back. What I
-
0:28 - 0:31want to do today is talk about
-
0:31 - 0:35one of my favorite algorithms for controlling MDPs that I think is
-
0:36 - 0:39one of the more elegant and efficient and powerful algorithms that
-
0:39 - 0:41I know of.
-
0:41 - 0:43So
-
0:43 - 0:47what I'll do is I'll first start by talking about a couple
-
0:47 - 0:49variations of MDPs
-
0:49 - 0:52that are slightly different from the MDP definition you've seen so far. These
-
0:52 - 0:54are pretty
-
0:54 - 0:59common variations. One is state action rewards,
-
0:59 - 1:04and the other is horizon MDPs. Using this semi-modified definition of an
-
1:04 - 1:07MDP, I'll talk about linear dynamical systems. I'll spend a little bit
-
1:07 - 1:10of time talking about models within dynamical systems, and then talk about LQR, or
-
1:10 - 1:13linear quadratic regulation control,
-
1:13 - 1:19which will lead us to some kind of [inaudible] equation, which is
-
1:19 - 1:23something we will solve in order to do LQR
-
1:23 - 1:24controls.
-
1:24 - 1:26
-
1:26 - 1:27So
-
1:27 - 1:29just
-
1:29 - 1:33to recap, and we've seen this definition many times now.
-
1:33 - 1:37We've been defining an MDP
-
1:37 - 1:42as
-
1:42 - 1:47[inaudible] states actions, states improbabilities, [inaudible] reward function
-
1:47 - 1:50where - gamma's
-
1:50 - 1:53the
-
1:53 - 1:57discount factors, a number between zero and one.
-
1:57 - 2:00And R, the reward function,
-
2:00 - 2:04was the function mapping from the states, the rewards -
-
2:04 - 2:07was the function mapping from the states, the real numbers.
-
2:07 - 2:08So we had
-
2:08 - 2:10value
-
2:10 - 2:14iteration, which would do this.
-
2:22 - 2:26So after a while, the value of the iteration will cause V to convert to V star.
-
2:26 - 2:28Then
-
2:28 - 2:31having found the optimal value function, if you
-
2:31 - 2:33compute the optimal policy
-
2:33 - 2:37by taking essentially [inaudible] of this equation above. Augments of A,
-
2:37 - 2:44of that [inaudible]. So
-
2:46 - 2:48in
-
2:48 - 2:51value iteration,
-
2:51 - 2:55as you iterate of this - you know, perform this update,
-
2:55 - 2:56the function V will
-
2:56 - 3:00[inaudible] convert to V star. So there won't be -
-
3:00 - 3:04so without defining the number of iterations, you get closer and closer to V star.
-
3:04 - 3:08This actually converge exponentially quickly to V
-
3:08 - 3:13star. We will never exactly convert to V star and define the number of iterations.
-
3:15 - 3:20So what I want to do now is describe a couple of common variations of
-
3:20 - 3:20MDPs
-
3:20 - 3:24that we slightly different definitions of. Firs the reward
-
3:24 - 3:28function and then second, we'll do something slightly different
-
3:28 - 3:29from just
-
3:32 - 3:35counting. Then remember in the last lecture, I said that
-
3:35 - 3:37for infinite state of continuously in MDPs,
-
3:37 - 3:41we couldn't apply the most straightforward version of value iteration
-
3:41 - 3:44because if you have a continuous state
-
3:44 - 3:46MDP, we need to use
-
3:46 - 3:48some approximations of the optimal value function.
-
3:48 - 3:52The [inaudible] later in this lecture,
-
3:52 - 3:56I'll talk about a special case of MDPs, where you can actually represent the
-
3:56 - 3:58value function exactly,
-
3:58 - 4:01even if you have an infinite-state space or even if you have a continuous-state
-
4:01 - 4:02space.
-
4:02 - 4:06I'll actually do that, talk about these special constants of infinite-state
-
4:06 - 4:07MDPs,
-
4:07 - 4:11using this new variation of the reward function and the alternative to
-
4:11 - 4:15just counting, so start to make the formulation a little easier.
-
4:15 - 4:17So the first
-
4:17 - 4:21variation I want to talk about is
-
4:21 - 4:25selection
-
4:25 - 4:29rewards. So I'm going to change the definition of the reward function. If this turns out, it
-
4:29 - 4:31won't be
-
4:31 - 4:33a huge deal. In particular, I change reward function
-
4:33 - 4:36to be a function mapping from
-
4:36 - 4:38a state action pair
-
4:38 - 4:41to the real numbers.
-
4:41 - 4:45What I mean by this is just the following.
-
4:45 - 4:47You sell off in
-
4:47 - 4:49some state in zero. You
-
4:49 - 4:53take an action A zero as a result of your state of action choice. You transition
-
4:53 - 4:55to some new state, S1.
-
4:55 - 5:00You take some action, A1. You transition to some new state, S2. You take some
-
5:00 - 5:05action, A2, and so on. So this is a [inaudible] state action sequence that you see.
-
5:05 - 5:08So in an MPP where you have a state action reward,
-
5:08 - 5:13your total payoff is now defined as
-
5:13 - 5:14this,
-
5:14 - 5:19where your reward function is now a function both of the current state and of the
-
5:19 - 5:20action
-
5:20 - 5:22you took in the current state.
-
5:22 - 5:27So this is my
-
5:27 - 5:29total payoff.
-
5:29 - 5:32Then as usual, my goal will be to find a policy -
-
5:32 - 5:35to find the function mapping from the state's actions,
-
5:35 - 5:38so that when I execute that policy,
-
5:38 - 5:40I can maximize the expected value
-
5:40 - 5:42of my total payoff.
-
5:42 - 5:47So this definition, it actually turns out that given an MDP with
-
5:47 - 5:48state action rewards,
-
5:48 - 5:49you can actually -
-
5:49 - 5:53so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP
-
5:53 - 5:57with only rewards that function in the states. That may or may
-
5:57 - 5:58not be [inaudible]. Don't worry if
-
5:58 - 6:01it isn't. But
-
6:01 - 6:05using state action rewards allows you to more directly model
-
6:06 - 6:09problems in which different actions, we have different costs. So
-
6:09 - 6:13a running example is the robot. So [inaudible] a robot,
-
6:13 - 6:17and it's more costly for the robot to move than for it to stay still.
-
6:17 - 6:21If you give an action to stay still, and the action to stay still may
-
6:21 - 6:25have a lower cost because you're not using a battery power [inaudible] recharge it for that
-
6:25 - 6:26action.
-
6:26 - 6:30Another example would be -
-
6:30 - 6:33actually, another navigation example
-
6:33 - 6:37would be if you have an outdoor vehicle. You need to drive
-
6:37 - 6:39over
-
6:39 - 6:42some sort of outdoor terrain, like very rough rocks
-
6:42 - 6:46or driving over grass. It may be costly, more difficult,
-
6:46 - 6:49than driving over, say, a paved road. So you may assign
-
6:49 - 6:51an action that requires
-
6:51 - 6:56driving over grass or driving over rocks to be
-
6:56 - 7:00more costly than driving over paved
-
7:00 - 7:05road.
-
7:05 - 7:08So this really isn't a huge change to the definition of
-
7:08 - 7:15an MDP. I won't really bother to justify this a lot, but [inaudible]
-
7:15 - 7:17equations
-
7:17 - 7:20is generalizing
-
7:20 - 7:24the way that you probably expect it. V star of S is now equal to
-
7:27 - 7:34that.
-
7:37 - 7:41So previously, when the reward function was just a function of the state, S, we could
-
7:41 - 7:44take the max and push it in here.
-
7:44 - 7:45But now that
-
7:46 - 7:50the rewards is a function of the action you're taking as well, the max comes outside. So
-
7:50 - 7:51this says that
-
7:51 - 7:57your expected total payoff, starting from the state, as [inaudible] policy, is equal to
-
7:57 - 8:02first your immediate reward, RFSA, for executing some action, A, in state S. Then
-
8:02 - 8:07plus gamma times your future expected total payoff.
-
8:07 - 8:09So
-
8:09 - 8:11this is your expected total payoff if you
-
8:11 - 8:15take the action, A, from the current state. So
-
8:15 - 8:19while these [inaudible] optimal value functions. So your actually optimal expected total payoff
-
8:19 - 8:22is the max of all actions of this
-
8:22 - 8:26thing on
-
8:26 - 8:30the right. Let's see. Value iteration, which I'm abbreviating VI,
-
8:30 - 8:34is really the same algorithm. B of
-
8:34 - 8:38S is updated as
-
8:38 - 8:45max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations
-
8:46 - 8:50be updating V of S using [inaudible] equations. Again, you get value iteration,
-
8:50 - 8:53exactly the same way.
-
8:53 - 8:56Then finally,
-
8:57 - 9:00having found the optimal value function, V star,
-
9:00 - 9:03using the value iteration algorithm,
-
9:03 - 9:08you can then compute the optimal policy, pi star of S as same as
-
9:08 - 9:12before. The best action to take in the state, S, is the
-
9:12 - 9:19action, A, that maximizes the thing on the righthand
-
9:28 - 9:32side. So having used value iteration
-
9:32 - 9:35to compute the optimal value function, you can then find
-
9:35 - 9:40pi star using that.
-
9:42 - 9:46So this was the easier of the two
-
9:46 - 9:48variations of MDPs [inaudible]
-
9:48 - 9:49so far.
-
9:49 - 9:56Any questions? Actually, are there questions about this?
-
10:01 - 10:03So
-
10:03 - 10:07the other variation, the other alternative definition,
-
10:08 - 10:11will be
-
10:11 - 10:16finite horizon
-
10:16 - 10:19MDPs. So
-
10:19 - 10:22finite horizon MDP
-
10:22 - 10:25comprises of the [inaudible] SA [inaudible]
-
10:25 - 10:28transition, probably
-
10:28 - 10:31with these, and the parameter T
-
10:31 - 10:32and the reward function.
-
10:32 - 10:36Here,
-
10:36 - 10:37T
-
10:37 - 10:41is a parameter called the horizon time.
-
10:41 - 10:43Concretely,
-
10:43 - 10:46what this really means is that we'll
-
10:46 - 10:48be taking actions in the MDP
-
10:48 - 10:55only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out]
-
11:00 - 11:04[Inaudible] zero, take action A0. Get to some other state S1, take
-
11:04 - 11:07action A1 and so on.
-
11:07 - 11:10Eventually, you get to some state, STAT
-
11:10 - 11:12after T times [inaudible]. So
-
11:12 - 11:17my total payoff, now,
-
11:17 - 11:18will be
-
11:18 - 11:19given
-
11:19 - 11:22by this
-
11:22 - 11:26sum from time zero up to time T of my sum over rewards.
-
11:28 - 11:32Okay? My goal, as usual
-
11:32 - 11:34- so this is my total payoff.
-
11:34 - 11:38My goal, as usual, is to maximize the expected value of my total payoff. We want to come up
-
11:38 - 11:40with a policy to maximize the
-
11:40 - 11:44expected value of this total payoff. The key difference is that
-
11:44 - 11:48the world only will exist [inaudible], and after that,
-
11:48 - 11:51there's no more rewards to be corrected.
-
11:51 - 11:55So this turns out to be [inaudible] of a difference because it turns out that the optimal
-
11:55 - 12:00policy
-
12:01 - 12:03may be non-stationary.
-
12:06 - 12:08The
-
12:08 - 12:10term, stationary, means that it doesn't depend
-
12:10 - 12:13on time. Non-stationary
-
12:13 - 12:14means that it may depend on time.
-
12:14 - 12:18So non-stationary
-
12:18 - 12:22roughly means that my optimal action to take will be different for different time
-
12:22 - 12:22steps.
-
12:22 - 12:24That's what nonstationary
-
12:24 - 12:27means. Just as an example of that,
-
12:27 - 12:31imagine that we have some robot. Let's say the
-
12:31 - 12:34robot
-
12:34 - 12:35is here.
-
12:35 - 12:38Let's say that there's
-
12:38 - 12:41a good [inaudible] over there with a plus one reward.
-
12:43 - 12:46Much further away, there's a plus ten reward.
-
12:46 - 12:51So depending on how much time you have left on the clock,
-
12:51 - 12:53it may be better to go after the plus one
-
12:53 - 12:55or the plus ten reward. If it's still early
-
12:55 - 12:59in the game, you still have a lot of time, it may be better to head toward
-
12:59 - 13:03the plus-ten rewards junction and get a much larger
-
13:03 - 13:06reward. If you only have a couple of time sets left, if the clock has nearly reached
-
13:06 - 13:08the time, capital T,
-
13:08 - 13:12then you may not have enough time to get to a plus ten reward. You've be better
-
13:12 - 13:16off heading for the plus one reward that's much more close by.
-
13:16 - 13:20So what this example illustrates is that when you're in that state,
-
13:20 - 13:24the best action to take could be to go left or to go right, depending on what
-
13:24 - 13:26time it is. So just as an example, illustrating
-
13:26 - 13:32how the actually policy can be non-stationary.
-
13:35 - 13:36In fact,
-
13:36 - 13:38since we have non-stationary
-
13:38 - 13:43policies anyway in the sequence, what I'm going to do next, I'm going to
-
13:43 - 13:44allow
-
13:44 - 13:48non-stationary transition probabilities as well. So I'll just write that
-
13:48 - 13:50up there.
-
13:50 - 13:52What I mean is that
-
13:52 - 13:57so far, assuming that the state ST plus one, is joined from the state
-
13:57 - 13:59transition probabilities [inaudible]
-
13:59 - 14:02by the previous states and the previous action.
-
14:02 - 14:06I've been assuming that these state transition probabilities are the same
-
14:06 - 14:10for all times. So I want to say [inaudible] and take some action,
-
14:10 - 14:13the distribution of an innate state doesn't matter.
-
14:13 - 14:16It doesn't depend on time.
-
14:16 - 14:17So
-
14:17 - 14:22I'm going to allow a study more general definition as well, in which we have
-
14:22 - 14:25non-stationary state transition probabilities
-
14:25 - 14:25so that
-
14:25 - 14:28the chance of where you end up [inaudible]
-
14:28 - 14:30may also
-
14:30 - 14:33depend on what time it is.
-
14:33 - 14:37So as examples of this non-stationary state
-
14:37 - 14:39transition probabilities,
-
14:39 - 14:44one example would be if you model flying an aircraft over a long distance.
-
14:44 - 14:49Then as the aircraft flies, you burn fuel and become lighter. So the
-
14:49 - 14:53dynamics of the aircraft actually change over time. The mass of the aircraft can
-
14:53 - 14:55change significantly over time as you burn fuel.
-
14:55 - 14:56So depending on
-
14:56 - 15:01what time it is, your mixed state could actually
-
15:01 - 15:03depend on not only your current state and your action,
-
15:03 - 15:06but also on how much fuel you burn, therefore, what time it is.
-
15:07 - 15:11Other examples, another aerospace one, is
-
15:11 - 15:15if you have the weather forecast for the next 24 hours, say,
-
15:15 - 15:18you know what the winds and precipitation are going to be like over the next 24 hours. Then again,
-
15:18 - 15:22if you fly the aircraft from, say, here to New York,
-
15:22 - 15:25it may cost different amounts to fly
-
15:25 - 15:29different [inaudible] at different times. Maybe flying over
-
15:29 - 15:33the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or
-
15:33 - 15:34if you do it
-
15:34 - 15:38a few hours from now, when the weather may be forecast really bad.
-
15:39 - 15:41For an
-
15:41 - 15:46example you see everyday, same thing for traffic, right?
-
15:46 - 15:48There's at least -
-
15:48 - 15:53depending on where you live, certainly here in California, there are times of day where traffic is
-
15:53 - 15:57really bad in lots of places. So the costs of driving certain roads may vary, depending on
-
15:57 - 15:59what time of day it is.
-
15:59 - 16:04Lots of other examples. Industrial automation, different machines
-
16:04 - 16:07in the factory may be available to different degrees at different times of day. They
-
16:07 - 16:10cost different amounts to hire different workers, depending on whether you pay
-
16:10 - 16:13over time [inaudible] or whatever.
-
16:13 - 16:16So the cost of doing different things in the factory can also
-
16:16 - 16:17be a function of time.
-
16:17 - 16:18
-
16:18 - 16:24The state transition probabilities can also be a function of time.
-
16:26 - 16:28Lastly,
-
16:28 - 16:32while we're doing this as well, to make this fully general, we might as
-
16:32 - 16:38well have non-stationary [inaudible] as well,
-
16:38 - 16:41where you might also index the reward
-
16:41 - 16:43function of these times and prescripts,
-
16:43 - 16:45where the cost of doing different things may depend on
-
16:45 - 16:48the time as well.
-
16:51 - 16:56Actually, there's more examples of non-stationary MDPs, so let's - so now we
-
16:56 - 16:59have a nonstationary policy. Let's talk about an algorithm to actually
-
16:59 - 17:02try to find the optimal policy.
-
17:04 - 17:10So let me
-
17:10 - 17:14define the following.
-
17:14 - 17:20This is now a slightly modified definition of the optimal value function. I'll
-
17:20 - 17:27just
-
17:31 - 17:38write this down, I guess.
-
17:39 - 17:43So I'm going to define the optimal value function, and this going to be
-
17:43 - 17:45indexed by T, with a subscript T. The
-
17:45 - 17:46optimal value of a state
-
17:48 - 17:50for time, T, we're going to define as your
-
17:50 - 17:52optimal
-
17:52 - 17:57sum of rewards for if you start the MDP at that state, S,
-
17:57 - 17:59and if the clock starts off
-
17:59 - 18:02at time,
-
18:02 - 18:06lowercase T. So the optimal value of a state will depend on what time it is and how
-
18:06 - 18:10much time you have lest to run this
-
18:10 - 18:11MDP. Therefore,
-
18:11 - 18:16the sum on the right sums only for time T, time T plus one, time T plus two up
-
18:16 - 18:18to time,
-
18:18 - 18:20capital T.
-
18:23 - 18:27I'll just state in English again, this is your expected optimal
-
18:27 - 18:28total payoff
-
18:28 - 18:31if you start your system in a state, S,
-
18:31 - 18:35and if the clock is already at time,
-
18:35 - 18:37lowercase T.
-
18:38 - 18:43So it turns out then there's a [inaudible], you can value that [inaudible].
-
18:43 - 18:46Let me just write out the value [inaudible] algorithm for this.
-
18:46 - 18:48It turns out
-
18:48 - 18:49you can -
-
18:49 - 18:54well, let me just write this
-
18:54 - 18:57out. I'll write this below.
-
18:57 - 19:00It turns out you can compute the optimal value function for the MDP using the
-
19:00 - 19:03following recursion, which is very similar to
-
19:03 - 19:06what we have for value iteration. We're going
-
19:06 - 19:07to set V
-
19:07 - 19:12of S to be equal to [inaudible] over
-
19:12 - 19:13A,
-
19:13 - 19:20same as before, right? Okay?
-
19:41 - 19:42So
-
19:42 - 19:47if I start the clock at time T and from state S,
-
19:47 - 19:49my expected total payoff is equal to
-
19:49 - 19:52the maximum [inaudible] actions I may take
-
19:52 - 19:54of my immediate reward. Taking that
-
19:54 - 19:56action, A, in that state,
-
19:56 - 20:00S. Them plus my expected future payoff. So if I take action, A,
-
20:00 - 20:02I would transition with [inaudible] P, subscript SA, S prime
-
20:02 - 20:06to some new state, S
-
20:06 - 20:09prime. If I get to the state, S prime,
-
20:09 - 20:12my total expected payoff from the
-
20:12 - 20:14state S prime would be these [inaudible]
-
20:14 - 20:19now subscript T plus one, that's prime. Subscript T plus one
-
20:19 - 20:23reflects that after I've taken one action, my clock will have advanced from
-
20:23 - 20:26time T to time T plus one. So this is now V
-
20:26 - 20:31star subscript T plus one.
-
20:31 - 20:34So this expresses V star of T
-
20:34 - 20:36in terms of V star T plus one.
-
20:36 - 20:40Then lastly, to start off this recursion, you
-
20:40 - 20:42would have V star,
-
20:42 - 20:43capital T
-
20:43 - 20:50is equal to - it's just equal
-
20:51 - 20:53to that.
-
20:53 - 20:57If you're already at time, capital T, then you just get to take one action, and then
-
20:57 - 20:59the clock runs out. So this is
-
20:59 - 21:04V star capital T. Your value of starting in some state, S, with no time -
-
21:04 - 21:09with just one time step, with no time left on the clock.
-
21:09 - 21:13So in the case of finite horizon MDP, this actually gives up a very nice
-
21:14 - 21:16dynamic programming algorithm
-
21:16 - 21:17in which you can
-
21:17 - 21:21start off by computing V star of T.
-
21:21 - 21:24Then you use this backward [inaudible] to compute
-
21:24 - 21:25V star of
-
21:25 - 21:29capital T minus one, capital T minus two and so on. We compute V star
-
21:29 - 21:32of T and T minus one and so on. It recurs backwards
-
21:32 - 21:35onto your computer, V star
-
21:35 - 21:38for all of your time steps. Can you see this
-
21:38 - 21:40board?
-
21:40 - 21:42Cool. Then
-
21:45 - 21:48the final step is
-
21:48 - 21:53- previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible]
-
21:53 - 21:56A of R
-
21:56 - 21:57plus A
-
21:57 - 22:04plus [inaudible] PSA - this
-
22:04 - 22:07is sort of what we had. In
-
22:07 - 22:10the finite horizon case, the
-
22:10 - 22:13ultimate action may depend on what time it is. So the ultimate action to take it,
-
22:13 - 22:17time T, is [inaudible] actions, A.
-
22:17 - 22:19This
-
22:19 - 22:22is basically the
-
22:22 - 22:27augmat of exactly that same thing on the right-hand side as we had
-
22:27 - 22:30in our dynamic programming
-
22:30 - 22:33algorithm. So you do this for every time step, and now you compute it,
-
22:33 - 22:37your optimal policy for different time
-
22:37 - 22:40steps. Again, this is a non-stationary policy
-
22:40 - 22:42because pi star
-
22:42 - 22:47of S my depend on what time it is.
-
22:47 - 22:50So [inaudible] the difference between this
-
22:50 - 22:57and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T.
-
22:57 - 23:01Then using the backwards recursion of that [inaudible] algorithm, you computer V star T
-
23:01 - 23:04minus one. Then V star T minus two
-
23:04 - 23:07and so on down to V star of zero.
-
23:07 - 23:10Then from these, you compute pi star.
-
23:13 - 23:17So one - there's not a huge difference, but one minus
-
23:17 - 23:18difference [inaudible]
-
23:18 - 23:22the infinite horizon discounted case
-
23:22 - 23:26is that by running this recursion once, you now have exactly
-
23:26 - 23:30the right value function. So this just computes the value function, rather
-
23:30 - 23:31than
-
23:31 - 23:38merely converging [inaudible]. This just gives you the right value function with one
-
23:38 - 23:40
-
23:40 - 23:43pass. Cool. Any questions there? Yeah.
-
23:43 - 23:48[Inaudible].
-
23:50 - 23:54This computation's much shorter than valuations. So
-
23:54 - 23:58sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could
-
23:58 - 24:05[inaudible] and then use this case for that case? I see.
-
24:06 - 24:09So for the normal MDP, can you assume capital T and then assume this? So
-
24:09 - 24:13it actually turns out that - that's a
-
24:13 - 24:15great question. Let
-
24:15 - 24:19me just answer this in a hand-wavy way.
-
24:19 - 24:23So it actually turns out for a discounted infinite horizon MDP
-
24:23 - 24:26where some discount factor gamma.
-
24:26 - 24:29So what you can do is you can say,
-
24:29 - 24:31after T times X,
-
24:31 - 24:35gamma to the T would be really small. It would be like [inaudible] something. I
-
24:35 - 24:36don't
-
24:36 - 24:40really care what happens after that many times because the rewards are
-
24:40 - 24:43multiplied by gamma to the T. After that, I don't really care. So you can
-
24:43 - 24:44ask,
-
24:44 - 24:45can I take
-
24:45 - 24:48my infinite horizon discounted MDP
-
24:48 - 24:53and approximate that with a finite horizon MDP where the number of times, steps T,
-
24:53 - 24:55is chosen so that [inaudible] true. So it
-
24:55 - 24:58turns out you can do that.
-
24:58 - 25:02Then you end up with some value for T. You can solve for T so
-
25:02 - 25:04that this holds true.
-
25:04 - 25:10It turns out you can prove that if you took the original value iteration algorithm
-
25:10 - 25:15and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs.
-
25:15 - 25:17If you run that for
-
25:17 - 25:20this same number of time steps,
-
25:20 - 25:24you will end up with an approximation to the value function that is
-
25:24 - 25:27about this close, up to some small constant factors.
-
25:27 - 25:31So to do that, you end up with roughly the same amounts of computation anyway.
-
25:31 - 25:33Then
-
25:33 - 25:36you actually end up with a non-stationary policy, which is more expensive to keep
-
25:37 - 25:41around. You need to keep around the different policy every time step, which is not
-
25:41 - 25:47as nice as if you had the stationary policy, same policy for all times. So
-
25:47 - 25:51there are other reasons, but sometimes you might take an
-
25:51 - 25:55infinite horizon discounted problem and approximate it to a finite horizon problem.
-
25:55 - 25:56But
-
25:56 - 25:59this particular reason is
-
25:59 - 26:01not the one. That makes
-
26:01 - 26:04sense. More
-
26:04 - 26:11questions? Interviewee: [Inaudible]?
-
26:13 - 26:15Is
-
26:15 - 26:17there a gamma in this?
-
26:17 - 26:20So if you want, you can actually
-
26:20 - 26:23change the definition of an MDP
-
26:23 - 26:28and use a finite horizon discounted MDP. If you want, you can do that. You
-
26:28 - 26:29can actually come
-
26:29 - 26:32in and put a gamma there,
-
26:32 - 26:35and use this counting the finite horizon.
-
26:35 - 26:38It turns out that usually, for most
-
26:38 - 26:41problems that people deal with, you either use discounting
-
26:41 - 26:43or you
-
26:43 - 26:47use the finite horizon. It's been less common to do both, but you can certainly do as well.
-
26:47 - 26:52One of the nice things about discounting, it makes such your value function is finite.
-
26:53 - 26:58Algorithmically and mathematically, one of the reasons to use discounting is
-
26:58 - 27:01because you're multiplying your rewards exponentially. It's a geometrically
-
27:01 - 27:03[inaudible] series.
-
27:03 - 27:05It shows that the value function is always finite.
-
27:05 - 27:08This is a really nice mathematical properties when you do discounting.
-
27:08 - 27:11So when you have a finite horizon anyway,
-
27:11 - 27:15then the value function's also guaranteed to be finite. So with that, you
-
27:15 - 27:22don't have to use discounting. But if you want, you can actually discount
-
27:24 - 27:28as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward
-
27:28 - 27:35function to go downward into the to the reward function, since we have nonstationary rewards as well. So
-
27:40 - 27:41that was
-
27:41 - 27:44finite horizon
-
27:44 - 27:45MDPs.
-
27:45 - 27:49What I want to do now is actually use
-
27:49 - 27:53both of these ideas, your state action rewards and your finite horizon MDPs
-
27:53 - 27:58to describe a special case of MDPs that
-
27:58 - 28:00makes very strong assumptions about the problem.
-
28:00 - 28:01
-
28:01 - 28:04But these assumptions are reasonable for many systems. With these
-
28:04 - 28:06assumptions, what we come up with, I
-
28:06 - 28:12think, are very nice and very elegant algorithms for solving even very large
-
28:12 - 28:16MDPs.
-
28:30 - 28:33So let's talk about linear
-
28:33 - 28:36quadratic regulation.
-
28:52 - 28:56We just talked about dynamic programming for finite horizon MDPs, so
-
28:56 - 28:58just remember that algorithm.
-
28:58 - 29:02When I come back to talk about an algorithm for solving LQR problems,
-
29:02 - 29:05I'm actually going to use exactly that dynamic programming algorithm
-
29:05 - 29:10that you just saw for finite horizon MDPs. I'll be using exactly
-
29:10 - 29:14that algorithm again. So just remember that for now.
-
29:14 - 29:16So let's talk about LQR.
-
29:16 - 29:22So I want to take these ideas and apply them to MDPs with continuous
-
29:22 - 29:25state spaces and maybe even continuous action
-
29:25 - 29:26spaces. So
-
29:26 - 29:29to specify and
-
29:29 - 29:34MDPs, I need to give you this fivetuple
-
29:34 - 29:36of state actions, [inaudible] in the reward. I'm going to use
-
29:36 - 29:41the finite horizon, capital T, rather than
-
29:41 - 29:43discounting.
-
29:43 - 29:48So in LQR problems, I'm going to assume the following. I'm
-
29:48 - 29:51going to assume that the
-
29:51 - 29:54[inaudible] space is [inaudible] RN.
-
29:54 - 29:58And I'm going to assume, also, a continuous set of
-
29:58 - 30:04actions lie in RT.
-
30:04 - 30:08To specify the state transition probabilities, PSA,
-
30:08 - 30:12I need to tell you what the distribution of the mixed state is, given the current state and
-
30:12 - 30:14the current action.
-
30:14 - 30:17So we actually saw a little bit of this in the last lecture. I want to assume
-
30:17 - 30:20the next state, ST plus one,
-
30:20 - 30:23is going to be a linear function
-
30:23 - 30:25of the previous
-
30:25 - 30:27state, AST plus BAT
-
30:27 - 30:29plus WT -
-
30:29 - 30:35oh, excuse me. I meant to subscript that.
-
30:55 - 30:59Where WT is Gaussian [inaudible] would mean zero and some covariance
-
30:59 - 31:02given by sigma
-
31:02 - 31:07W. Subscripts at A and B here with subscripts T to show
-
31:07 - 31:09that
-
31:09 - 31:12these matrixes could change over time. So this would be
-
31:12 - 31:15non-stationary dynamics.
-
31:15 - 31:20As a point of notation, unfortunately compiling
-
31:20 - 31:22ideas from multiple literatures,
-
31:22 - 31:26so it's sort of unfortunately that capital A denotes both a set of actions
-
31:26 - 31:28as well as
-
31:28 - 31:30a matrix.
-
31:30 - 31:34When you see A later on, A will usually be used to denote a matrix,
-
31:34 - 31:39rather than a set of actions. So [inaudible] overload notation again, but
-
31:39 - 31:42unfortunately the notational conventions when you have
-
31:42 - 31:47research ideas in multiple research communities, often they share the same symbol.
-
31:48 - 31:50So just to be concrete,
-
31:50 - 31:53AT is a matrix
-
31:53 - 31:56that's N
-
31:56 - 32:01by N. [Inaudible] matrixes that are N by D.
-
32:02 - 32:06Just to be completely clear, right, the matrixes A and B,
-
32:06 - 32:09I'm going to assume, are fixed and known in advance. So I'm going to
-
32:09 - 32:12give you the matrices,
-
32:12 - 32:16A and B, and I'm going to give you sigma W. Your job is to find a good policy
-
32:16 - 32:17for
-
32:17 - 32:19this MDP. So in other words,
-
32:19 - 32:21this is my specification
-
32:21 - 32:25of the state transition probabilities.
-
32:25 - 32:28Looking ahead,
-
32:28 - 32:31we see this later, it turns out this
-
32:31 - 32:38noise term is not very important. So
-
32:40 - 32:41it turns out that
-
32:41 - 32:44the treatment of the noise term is not very important.
-
32:44 - 32:46We'll see this later. We can
-
32:46 - 32:47pretty much
-
32:47 - 32:49ignore the noise
-
32:49 - 32:53term, and we'll still do fine. This is just a warning
-
32:53 - 32:57in the sequel, what I do later, I might be slightly sloppy
-
32:57 - 32:58in my treatment
-
32:58 - 33:03of the noise term. In this very special case, it would be unimportant.
-
33:03 - 33:04The
-
33:04 - 33:06last
-
33:06 - 33:09thing I have to specify is some horizon time,
-
33:09 - 33:13and then I also have some
-
33:13 - 33:14reward
-
33:14 - 33:17function.
-
33:17 - 33:20For LQR control, I'm going to assume that a reward function
-
33:20 - 33:27can be written as
-
33:30 - 33:32this, where
-
33:32 - 33:35UT is a matrix that's N by N. VT is
-
33:35 - 33:36a
-
33:38 - 33:41matrix that's D by D. I'll
-
33:41 - 33:44assume that UT
-
33:44 - 33:47and VT are both positive semi-definite.
-
33:47 - 33:51Are both PSD. So
-
33:51 - 33:57the fact that UT and VT are both positive semi-definite matrices,
-
33:57 - 34:00that implies that
-
34:00 - 34:02ST transpose, UT, ST [inaudible]
-
34:02 - 34:07zero. Similarly, ST transpose
-
34:07 - 34:10are VT, AT, [inaudible] zero.
-
34:10 - 34:17So this implies that
-
34:17 - 34:20your rewards are always negative. This is a
-
34:20 - 34:24somewhat depressing MDP in which there are only costs and no positive rewards, right,
-
34:24 - 34:24because of
-
34:24 - 34:27the minus sign
-
34:27 - 34:29there.
-
34:29 - 34:36So
-
34:43 - 34:47as
-
34:47 - 34:50a complete example for how you might want to apply this,
-
34:50 - 34:54you've seen my helicopter videos, right? So one thing is,
-
34:54 - 34:56for example,
-
34:56 - 34:58suppose you have a helicopter,
-
34:58 - 35:01and you want the state ST to be
-
35:01 - 35:04as close to zero as possible.
-
35:05 - 35:08Then
-
35:08 - 35:10you might choose UT
-
35:10 - 35:13to be equal to the identity matrix.
-
35:13 - 35:17This will make R of STAT be
-
35:17 - 35:20equal to ST transpose
-
35:20 - 35:24ST. But that's just
-
35:27 - 35:31- I'll just
-
35:31 - 35:34write that down. [Inaudible] oh, excuse me - minus.
-
35:34 - 35:38The squared negative of the squared [inaudible] vector. So this would be penalizing the
-
35:38 - 35:40system
-
35:40 - 35:41quadratically
-
35:41 - 35:45for having a state that's half of zero, assuming that zero's the origin state. So if it
-
35:45 - 35:47goes to make a helicopter
-
35:47 - 35:52hover around the state zero, then you might choose this sort of reward function. It
-
35:52 - 35:54turns out
-
35:55 - 35:59it's also very common for action to choose a cost
-
35:59 - 36:03for the action. So suppose I choose VT to be equal to an identity matrix. I get minus
-
36:03 - 36:05AT transpose
-
36:05 - 36:10AT
-
36:10 - 36:12here. Then minus [inaudible] actions as well.
-
36:12 - 36:16Including a quadratic cost for actions, it's also a fairly common
-
36:16 - 36:17thing to do.
-
36:17 - 36:20In practice, this tends to be effective
-
36:20 - 36:23of discouraging your system from
-
36:23 - 36:26jerking the controls around. This discourages making very huge control commands. Having
-
36:26 - 36:28a term like this
-
36:28 - 36:32reward function often makes many systems behave better.
-
36:32 - 36:36Of course, [inaudible] choose different values, we have different values on the diagonal to
-
36:36 - 36:37give different
-
36:37 - 36:40state variables, different weight and
-
36:40 - 36:45so on. So lots of possible choices for U and B. This is one example.
-
36:50 - 36:56So
-
36:56 - 36:59for the next few steps,
-
36:59 - 37:04I'm going to write out things, I'm going to derive things,
-
37:04 - 37:09for the general case of non-stationary dynamics. I'm going - as I write
-
37:09 - 37:12out more math and more equations for LQR,
-
37:12 - 37:14I'm going to try write it out
-
37:14 - 37:19for the fairly general case of time varied dynamics and time varied reward
-
37:19 - 37:22functions. So I'm [inaudible] function.
-
37:22 - 37:26But for purposes of understanding this material,
-
37:26 - 37:29you might want to think of just ignoring many of the subscripts, in terms
-
37:29 - 37:32of T. So
-
37:34 - 37:36for the sake of [inaudible] material,
-
37:36 - 37:40you might want to mentally assume that there are some fixed matrix, A, so
-
37:40 - 37:46that A is equal to A1, A2, equals A3 and so on.
-
37:46 - 37:51Similarly, there's some matrix B.
-
37:51 - 37:52Okay?
-
37:52 - 37:56So write it out for the fully general, non-stationary case, but
-
37:56 - 37:59you might just want to ignore many of the time subscripts and imagine the
-
37:59 - 38:02stationary case for now.
-
38:07 - 38:10Quite a bit later,
-
38:10 - 38:14we're going to talk about an extension of this called differential dynamic programming that will
-
38:14 - 38:17actually use a non-stationary
-
38:17 - 38:21[inaudible] to a very powerful effect for a specific algorithm. But
-
38:21 - 38:26for most of what we're about to do, just pretend that MDP is stationary.
-
38:26 - 38:33Okay.
-
38:49 - 38:51So before I
-
38:51 - 38:55talk about models, let me jus say a couple of words about how you would go about
-
38:55 - 38:57coming up with the linear models. The
-
38:57 - 39:01key assumption in this model is that the dynamics are linear. There's also the assumption the
-
39:01 - 39:03reward function is quadratic, but
-
39:03 - 39:07let's talk about the assumption that the dynamics are linear.
-
39:07 - 39:14ST plus one equals AST plus VAT. Maybe time varying, maybe
-
39:14 - 39:17stationary. I'm just writing stationary for now.
-
39:17 - 39:20So how do you get models like this?
-
39:20 - 39:23We actually saw one example of this already in the previous lecture. If
-
39:23 - 39:27you have an inverted pendulum system, and you want to model the
-
39:27 - 39:28inverted pendulum
-
39:29 - 39:34using a linear model like this, maybe [inaudible]. I'm not going to write that down.
-
39:34 - 39:38One thing you could do is run your inverted pendulum, start it off in some state as
-
39:38 - 39:39zero,
-
39:39 - 39:41take some action, A0, have it
-
39:41 - 39:46get to some state, S1. Take action A1 and so on, get to some state ST.
-
39:46 - 39:48Our index is one
-
39:48 - 39:52to denote that this is my first trial. Then you can repeat this a bunch of times.
-
39:52 - 39:55You can repeat this N times. I'm
-
39:55 - 39:58just executing actions on
-
39:58 - 40:00your physical robot.
-
40:00 - 40:02It could be a robot, it could be a
-
40:02 - 40:06chemical plant. It could be whatever. Trying out different actions in
-
40:06 - 40:10your system and watch
-
40:13 - 40:15what states it gets to. So
-
40:15 - 40:21for the linear model to your data, and choose the parameters A
-
40:21 - 40:26and B,
-
40:26 - 40:28that minimize
-
40:38 - 40:40the quadratic
-
40:40 - 40:41error term.
-
40:47 - 40:52So this says how well does AST plus BAT
-
40:52 - 40:55predict ST plus one. So you minimize the quadratic penalty term. This would be
-
40:55 - 40:58one reasonable way to estimate
-
40:58 - 41:01the parameters of a linear dynamical system for a
-
41:01 - 41:08physical robot or a physical chemical part of whatever they may have. Another
-
41:09 - 41:11way to
-
41:11 - 41:13come up with a linear
-
41:14 - 41:19model
-
41:19 - 41:24consistently, if I want to control, is to take a nonlinear model and to
-
41:24 - 41:30linearize it. Let me show you what I mean by that. So you can linearize a
-
41:30 - 41:36nonlinear model.
-
41:36 - 41:37So
-
41:39 - 41:42let's say you have some nonlinear model
-
41:42 - 41:45that expresses ST plus one as some function of
-
41:45 - 41:46ST
-
41:46 - 41:47and
-
41:47 - 41:50AT.
-
41:50 - 41:54In the example in the previous lecture, I said for the inverted pendulum
-
41:57 - 41:57[inaudible].
-
41:57 - 42:01By referring to the laws of physics. It was actually by downloading off the shelf
-
42:01 - 42:04software for doing physics simulations. So if you haven't
-
42:04 - 42:08seen [inaudible] before, you can go online. You can easily find
-
42:08 - 42:11many open-source packages for simulating the
-
42:11 - 42:14physics of simple devices like these.
-
42:14 - 42:18Download the software, type in the specifications of your robot, and it will
-
42:18 - 42:22simulate the physics that you use. There's lots of open-source software patches like that. You can just
-
42:22 - 42:24download them.
-
42:24 - 42:27But something like that, you can now build a physics simulator
-
42:27 - 42:32that predicts the state as a function of the previous state and the previous action. So
-
42:32 - 42:34you actually come up with some function that
-
42:34 - 42:40says that
-
42:42 - 42:45- the state [inaudible] next time. The [inaudible] vector
-
42:45 - 42:52will be some function of the current state
-
42:53 - 42:55and the current
-
42:55 - 42:58action, where the action in this case is just a real number that says how
-
42:58 - 43:02hard you accelerated to the left or right.
-
43:02 - 43:06Then you can take this nonlinear model. I actually wrote down a sample of a
-
43:06 - 43:08model in the last lecture, but
-
43:08 - 43:11in general, F would be some nonlinear function. [Inaudible] of
-
43:11 - 43:13a linear function.
-
43:13 - 43:17So what I mean by linearize is the following. So here's just a cartoon. I'll
-
43:17 - 43:20write down the math in a second. Let's say the horizontal acces is
-
43:20 - 43:24the input state, ST,
-
43:24 - 43:28and the output state, ST plus one, as I said.
-
43:28 - 43:32Here's
-
43:32 - 43:35the function at F.
-
43:35 - 43:39So the next state, ST plus one, will be some function of the previous state,
-
43:39 - 43:42ST and the action
-
43:42 - 43:46AT. So to linearize this model, what you would do is you would choose a point.
-
43:46 - 43:49We'll call this bar
-
43:49 - 43:54T. Then you would take the derivative of this
-
43:54 - 43:56function. For the
-
43:56 - 44:00[inaudible] straight line to that function.
-
44:00 - 44:04So this allows you to express the next state, ST plus one. You
-
44:04 - 44:07can approximate the next state, ST plus one, as
-
44:07 - 44:10this linear function of the
-
44:10 - 44:14previous state, ST. So
-
44:14 - 44:18to make this cartoon really right, the horizontal access here is really a state
-
44:18 - 44:20action pair.
-
44:20 - 44:25You're linearizing around. So this is just a cartoon. The horizontal
-
44:25 - 44:30access represents the input state and the input action.
-
44:30 - 44:37So
-
44:42 - 44:46just to write this out in
-
44:46 - 44:50math, I'll write out the simple case first and the fully general one
-
44:50 - 44:51in a second. Suppose
-
44:51 - 44:55the horizontal access was only this state. So let's pretend interactions they [inaudible] now.
-
44:55 - 44:56ST plus
-
44:56 - 45:00one is just some function of ST, than that linear function I drew would be
-
45:00 - 45:02ST plus one.
-
45:02 - 45:04We're approximating
-
45:04 - 45:08as F prime evaluated at some point as bar T
-
45:08 - 45:12times ST times S bar T.
-
45:12 - 45:16Plus
-
45:16 - 45:18S bar
-
45:18 - 45:22T. So with this, you'd express ST plus one as a linear
-
45:22 - 45:23function
-
45:23 - 45:29of ST. Just note that S bar T is a constant.
-
45:31 - 45:33It's not
-
45:33 - 45:37a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point
-
45:37 - 45:38S bar T. This is
-
45:38 - 45:42really just the equation of that linear function. So you
-
45:42 - 45:46can then convert this to
-
45:52 - 45:55A and B matrixes. Jumping back one board, I'm going
-
45:55 - 45:57to point out one other thing.
-
45:57 - 45:58Let's
-
45:58 - 46:00say I look at this straight line,
-
46:00 - 46:01and I ask
-
46:01 - 46:03how well does this straight line
-
46:03 - 46:08approximate my function F, my original simulator, my original function F.
-
46:08 - 46:11Then you sort of notice that
-
46:11 - 46:12in this neighborhood,
-
46:12 - 46:15in the neighborhood of S bar, there's a
-
46:15 - 46:17pretty good approximation. It's fairly close.
-
46:17 - 46:19But then as you move further away,
-
46:19 - 46:24moving far off to the left here, it becomes a pretty terrible approximation.
-
46:24 - 46:26So
-
46:26 - 46:29when you linearize
-
46:29 - 46:33a nonlinear model to apply LQR,
-
46:33 - 46:37one of the parameters you have to choose would be the point around which to
-
46:37 - 46:39linearize your nonlinear model.
-
46:39 - 46:44So if you expect your inverted pendulum system
-
46:44 - 46:49to spend most of its time in the vicinity of this state,
-
46:49 - 46:49then it'd
-
46:49 - 46:53be reasonable to linearize around this state
-
46:53 - 46:55because that means that
-
46:55 - 46:57the linear approximation would be a good approximation,
-
46:57 - 47:02usually, for the states that you expect [inaudible] to spend most of this time.
-
47:02 - 47:06If conversely, you expect the system to spend most of its time
-
47:06 - 47:08at states far to the left, then this would be
-
47:08 - 47:10a terrible location to linearize.
-
47:10 - 47:14So one rule of thumb is to choose the position to linearize according to where
-
47:14 - 47:17you expect the system to spend most of its time
-
47:17 - 47:20so that the linear approximation will tend to be
-
47:20 - 47:23an accurate approximation in the
-
47:23 - 47:27vicinity of the states [inaudible].
-
47:27 - 47:29Just to be fair, it is about
-
47:29 - 47:30choosing the point, S bar,
-
47:30 - 47:32A bar,
-
47:32 - 47:34that we'll use
-
47:34 - 47:37to come up with a linear function
-
47:37 - 47:41that we'll pretend it's a good approximation to my original
-
47:41 - 47:47nonlinear function,
-
47:55 - 48:01F. So for an example like the inverted pendulum problem, this problem, if
-
48:01 - 48:04you expect to do pretty well in this problem,
-
48:04 - 48:08then you would expect the state
-
48:08 - 48:13to often be near the zero state.
-
48:13 - 48:18If S equals zero corresponds to X being the center of the railway track that the inverted
-
48:18 - 48:20pendulum lives on. You
-
48:20 - 48:23expect to do fairly well. You expect the pole to
-
48:23 - 48:25mostly be upright
-
48:25 - 48:29[inaudible] upright at zero degrees or 90 degrees, I guess. So you choose
-
48:29 - 48:33whatever state corresponds to having the pole
-
48:33 - 48:34upright. The zero
-
48:34 - 48:37velocity [inaudible], near zero velocity in the middle of the track.
-
48:37 - 48:40So you usually choose that as a state
-
48:40 - 48:43to linearize
-
48:43 - 48:46your inverted pendulum dynamics around. That's
-
48:46 - 48:53a region where you might want your approximation to be good.
-
48:56 - 49:00So I wrote this down. To come back to this formula, I wrote this down for the special
-
49:00 - 49:02case of a one D state variable. If there
-
49:02 - 49:08are no actions. The general formula for the linearization approximation is ST
-
49:09 - 49:12plus one were approximate as F
-
49:12 - 49:15of S bar T.
-
49:15 - 49:17A bar T
-
49:17 - 49:24plus
-
49:50 - 49:54- Okay?
-
49:54 - 49:55Where
-
49:55 - 50:00these upside down triangles are an unusual
-
50:00 - 50:03symbol for taking the
-
50:03 - 50:07derivative of F with respect to [inaudible] vector value, second argument. So
-
50:07 - 50:12by choosing an appropriate state as bar T, A bar T to linearize
-
50:12 - 50:13around,
-
50:13 - 50:14you'd
-
50:14 - 50:18now express ST plus one
-
50:18 - 50:22as a linear function of the current state and
-
50:22 - 50:23the current
-
50:23 - 50:24action, AT. Again,
-
50:24 - 50:29these things, S bar T, is a constant that you choose ahead of time.
-
50:29 - 50:32Same for A bar
-
50:35 - 50:40T. Lastly, having linearized this thing, you can then convert it
-
50:40 - 50:43to matrices
-
50:43 - 50:47like
-
50:47 - 50:49that.
-
50:49 - 50:55So the ST plus one is now a linear function of ST and
-
50:55 - 50:57AT.
-
50:57 - 51:04Questions about this?
-
51:11 - 51:16So just one tiny detail, and it's really not a huge deal, is that this thing
-
51:16 - 51:20below is technically an [inaudible] function.
-
51:20 - 51:23There might actually be an extra constant there, but
-
51:23 - 51:27this is not a difficult generalization of a linear dynamical system definition.
-
51:28 - 51:32One way to deal with that constant is actually to do something like take your
-
51:32 - 51:36definition for this state, let's say XX dot theta theta dot.
-
51:36 - 51:39You can then augment your state vector to
-
51:39 - 51:41have an extra interceptor, one.
-
51:41 - 51:46With the interceptor one and working out the A matrix, you can then
-
51:46 - 51:49take care of the extra constant, C, as well. So you can
-
51:49 - 51:51deal with this thing being -
-
51:51 - 51:55technically it's an affine function because of this extra offset, rather than a linear
-
51:55 - 51:56function.
-
51:56 - 52:02But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't
-
52:02 - 52:03be
-
52:03 - 52:05a huge
-
52:06 - 52:13deal.
-
52:18 - 52:19So to summarize, you see I
-
52:19 - 52:21have this up,
-
52:21 - 52:25you can learn a model, you can take a nonlinear model. Your nonlinear
-
52:25 - 52:29model can be a physics model or a nonlinear model you learned and linearize it.
-
52:30 - 52:32Now I'll post an LQR problem
-
52:32 - 52:35in which we have
-
52:35 - 52:42specification of the MDP in which the states are in RN, the actions are in RD,
-
52:42 - 52:47and the state has zero probabilities given by
-
52:47 - 52:50the [inaudible] linear equation. SD plus one equals
-
52:50 - 52:54ATST plus BTAT. Our rewards are going to be
-
52:54 - 52:57these quadratic functions.
-
52:57 - 53:02So the specification of the MDP means that we know the A matrixes,
-
53:02 - 53:06the B matrixes, the U matrixes
-
53:06 - 53:09and the V matrixes. Our goal is to come up with a policy
-
53:09 - 53:14to maximize our finite
-
53:14 - 53:19horizon sum of rewards.
-
53:20 - 53:23So our goal is to come up with a policy,
-
53:23 - 53:26first, to maximize
-
53:26 - 53:29the expected value of this
-
53:29 - 53:34finite horizon sum of
-
53:39 - 53:46rewards. Okay.
-
53:48 - 53:49So our
-
53:49 - 53:54approach to solving this problem
-
53:54 - 53:57will be exactly that
-
53:57 - 54:00finite horizon dynamic programming algorithm that we worked out a
-
54:00 - 54:02little earlier in this
-
54:02 - 54:04lecture. In particular,
-
54:04 - 54:07my strategy for finding the optimal policy
-
54:07 - 54:09will be to
-
54:09 - 54:12first find V star of
-
54:12 - 54:14T, the capital T,
-
54:14 - 54:19and then I'll apply by a recursion to find V star of T minus one, V star of T minus
-
54:19 - 54:23two and so on.
-
54:23 - 54:28In the dynamic programming algorithm we worked out, V star subscript T of the
-
54:28 - 54:29state ST, this is the maximum
-
54:29 - 54:31[inaudible]
-
54:31 - 54:35actions you might take at that time of R
-
54:35 - 54:39of STAT.
-
54:39 - 54:43Again, just for the sake of understanding this material, you can probably
-
54:43 - 54:47pretend the rewards and the dynamics are actually stationary. I'll write out all
-
54:47 - 54:53these superscripts all the time [inaudible] if you're reading this for the first time.
-
54:53 - 54:57The reward is equal to max of
-
54:57 - 55:04AT of minus -
-
55:12 - 55:16right? I hope this isn't confusing. The superscript Ts denote transposes.
-
55:16 - 55:20The lowercase Ts denote the time index capital T.
-
55:20 - 55:24So that's just a definition of my next quadratic awards. So this
-
55:24 - 55:28is clearly maximized as minus
-
55:29 - 55:31ST transpose
-
55:31 - 55:38UTST
-
55:38 - 55:40because
-
55:40 - 55:44that last term is - this is greater than or equal to zero. That gives
-
55:44 - 55:48me my assumption that VT is [inaudible] semi-definite. So the best action to take in
-
55:48 - 55:49the last time step
-
55:49 - 55:53is just the action
-
55:53 - 55:54zero. So
-
55:54 - 56:01pi star subscript T of ST is equal to the
-
56:02 - 56:04[inaudible] of actions of
-
56:04 - 56:11that same thing. It's just
-
56:11 - 56:12zero. It's by
-
56:12 - 56:16choosing the zero action,
-
56:16 - 56:19AT transpose VTAT becomes zero, and that's
-
56:19 - 56:39how this reward is maximized.
-
56:50 - 56:57Any questions, or is something illegible? Okay.
-
57:00 - 57:04So now let's do the dynamic programming step where
-
57:04 - 57:10my goal is given
-
57:10 - 57:12VT plus one,
-
57:12 - 57:16I want to compute VT. Given V star T plus one, I want to compute
-
57:16 - 57:20V star of T. So this is the dynamic programming step.
-
57:22 - 57:25So the
-
57:25 - 57:29DP steps I wrote down previously was this. So for the finite state case, I
-
57:29 - 57:36wrote down the following.
-
58:32 - 58:36So this is exactly the equation I wrote down
-
58:36 - 58:37previously,
-
58:37 - 58:41and this is what I wrote down for finite states, where you have these discreet state transition
-
58:41 - 58:44probabilities, and we can sum over
-
58:44 - 58:48this discreet set of states. Now we're going to continue as an infinite state
-
58:48 - 58:52again, so this sum over state should actually become an integral. I'm going to
-
58:52 - 58:56actually skip the integral step. We'll just go ahead and write this last term here as an
-
58:56 - 58:57expectation.
-
58:57 - 59:04So this is going to be max over actions AT
-
59:04 - 59:07plus - and then this becomes and expectation
-
59:07 - 59:10over the random mixed state, ST plus one,
-
59:10 - 59:14[inaudible] from state transition probabilities given by P of STAT of V star T
-
59:14 - 59:16plus one, ST
-
59:16 - 59:19plus one. So this
-
59:19 - 59:21is
-
59:21 - 59:24the same equation written down
-
59:24 - 59:26as an expectation.
-
59:27 - 59:32So what I need to do is given a representation of V star T plus one, I
-
59:32 - 59:33need to find V
-
59:33 - 59:36star of T.
-
59:36 - 59:41So it turns out that LQR has the following useful property. It turns out that each of
-
59:41 - 59:43these value functions
-
59:43 - 59:46can be represented as a quadratic function.
-
59:46 - 59:49So concretely,
-
59:49 - 59:53let's suppose that V
-
59:53 - 60:00star T plus one - suppose that this can be expressed as a
-
60:00 - 60:07quadratic
-
60:09 - 60:11function, written like so,
-
60:11 - 60:12where
-
60:12 - 60:15the matrix phi T plus one
-
60:15 - 60:17is an N by N matrix, and
-
60:17 - 60:21psi T plus one
-
60:21 - 60:24is just a real number.
-
60:24 - 60:29So in other words, suppose V star T plus one is just a quadratic function
-
60:30 - 60:34of the state ST
-
60:34 - 60:37plus one.
-
60:37 - 60:40We can then show that
-
60:40 - 60:44when you do one dynamic programming step - when you
-
60:46 - 60:50plug this definition of V star T plus one into your dynamic
-
60:50 - 60:52programming step in the equation I had just now,
-
60:52 - 60:55you can show that you would get
-
60:55 - 60:58that V star T as well, will
-
60:58 - 61:02also be a quadratic function of the
-
61:06 - 61:13same form. [Inaudible] here, right? The sum-appropriate matrix, phi T
-
61:13 - 61:20and sum appropriate real number, psi
-
61:20 - 61:22of
-
61:22 - 61:28T. So what you can do is stall off the recursion with - well, does that make
-
61:28 - 61:32sense?
-
61:32 - 61:35So what you can do is
-
61:35 - 61:37stall off the recursion
-
61:37 - 61:41as follows. So previously, we worked out that V star capital T,
-
61:41 - 61:44we said that this is minus
-
61:44 - 61:48ST
-
61:48 - 61:50transpose UTST. So we
-
61:50 - 61:51have that phi
-
61:51 - 61:53of capital T is
-
61:53 - 61:56equal to minus UT,
-
61:56 - 61:59and psi of capital T is equal to zero.
-
61:59 - 62:01Now V star T of ST
-
62:01 - 62:08is equal to ST transpose phi of T, ST plus psi of T.
-
62:08 - 62:09So
-
62:09 - 62:13you can start out the recursion this way with phi of T equals minus UT and psi of T
-
62:13 - 62:15equals zero.
-
62:15 - 62:17Then work out
-
62:17 - 62:20what the recursion is. I won't
-
62:20 - 62:23
-
62:23 - 62:27actually do the full [inaudible]. This may be algebra, and
-
62:29 - 62:33you've actually done this sort of Gaussian expectation
-
62:33 - 62:40math a lot in your homework by now.
-
62:40 - 62:44So I won't do the full derivation. I'll just outline the
-
62:45 - 62:47one-ish G step. So
-
62:47 - 62:51in dynamic programming step, V star ST is
-
62:51 - 62:53equal to
-
62:53 - 62:56max over actions AT of
-
62:56 - 63:03the median reward.
-
63:04 - 63:06So this was R of
-
63:06 - 63:09SA from my equation in the dynamic programming step.
-
63:09 - 63:12Then plus an expected value
-
63:12 - 63:18over the random mixed state, ST plus one, drawn from the
-
63:18 - 63:22Gaussian distribution would mean
-
63:22 - 63:24ATST plus
-
63:24 - 63:28BTAT and covariant sigma
-
63:28 - 63:32W. So what this is, this is really my
-
63:32 - 63:35specification for P of STAT. This is my
-
63:35 - 63:39state transition distribution in the LQR setting. This is my
-
63:39 - 63:41state transition distribution
-
63:41 - 63:43[inaudible] take action AT
-
63:43 - 63:45in
-
63:45 - 63:46the state
-
63:46 - 63:50ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and
-
63:50 - 63:51covariant
-
63:51 - 63:54sigma W. Then
-
63:54 - 64:01of the this
-
64:11 - 64:14state. This, of course, is just A star
-
64:14 - 64:17T plus one
-
64:17 - 64:24of ST plus one. I hope
-
64:24 - 64:27this makes sense. This is just taking that equation I had previously in the
-
64:27 - 64:30dynamic programming step. So the V star of T, ST
-
64:30 - 64:33equals max over actions
-
64:33 - 64:36of the immediate rewards
-
64:36 - 64:39plus an expected value over the mixed state of
-
64:39 - 64:41V star of the mixed state with
-
64:41 - 64:43the clock advanced by one.
-
64:43 - 64:46So I've just plugged in all the definitions as a reward of the
-
64:46 - 64:48state [inaudible] distribution
-
64:48 - 64:55and of the value function.
-
64:56 - 65:02Actually, could you raise your hand if this makes sense? Cool. So if you write
-
65:02 - 65:03this out
-
65:03 - 65:08and you expand the expectation - I know you've done this many times, so I
-
65:08 - 65:09won't do it -
-
65:09 - 65:11this whole thing on the right-hand side
-
65:11 - 65:15simplifies to a big quadratic function of the action, AT.
-
65:15 - 65:18So this whole
-
65:18 - 65:23thing simplifies to a big
-
65:23 - 65:26quadratic function
-
65:32 - 65:36of the action
-
65:36 - 65:38AT.
-
65:38 - 65:42We want to maximize this with respect to the actions AT. So to
-
65:42 - 65:44maximize a big quadratic function, you
-
65:44 - 65:48just take the derivatives of the functions with respect to the action AT, set the derivative equal
-
65:48 - 65:52to zero, and then you've maximized the right-hand side, with
-
65:52 - 65:56respect to the action,
-
65:56 - 66:00AT. It turns out - I'm just going to write this expression down for completeness. You
-
66:00 - 66:02can derive it yourself at any time. It turns
-
66:02 - 66:05out if you actually maximize that thing on the right- hand side as a
-
66:05 - 66:07function of the actions, AT,
-
66:07 - 66:12you find that [inaudible] AT is going to be
-
66:12 - 66:14that
-
66:14 - 66:21times
-
66:23 - 66:26ST.
-
66:26 - 66:29Don't
-
66:29 - 66:34worry about this expression. You can get it from [inaudible] and derive it yourself.
-
66:34 - 66:38But the key thing to note is that the optimal action, AT, is going to be some big
-
66:38 - 66:39matrix.
-
66:39 - 66:40We're going to call this thing
-
66:40 - 66:43LT
-
66:43 - 66:47times
-
66:49 - 66:51ST. In other words, the optimal action to take in this
-
66:51 - 66:53given state is going to be
-
66:53 - 66:55some linear function
-
66:55 - 66:59of the state, ST. So
-
66:59 - 67:03having done dynamic programming, you remember also when we worked out the
-
67:03 - 67:05dynamic programming algorithm for finite
-
67:05 - 67:06horizon MDPs,
-
67:06 - 67:08we said that
-
67:08 - 67:12the way you compute the optimal policy, pi star of T
-
67:12 - 67:14of ST.
-
67:14 - 67:18This is always the [inaudible] of the same thing. [Inaudible]
-
67:18 - 67:23of actions AT of the same thing.
-
67:23 - 67:25STAT plus
-
67:25 - 67:32your expected value of [inaudible] PSTAT, P-star, T
-
67:36 - 67:39plus one, ST plus one. This thing on the right-hand side is always the same thing as
-
67:39 - 67:42the thing we maximized
-
67:42 - 67:43
-
67:43 - 67:46[inaudible]. So what this means is that when I said this a value of A to the maximize
-
67:46 - 67:50of this. So what this means is that the optimal action to take from the state of ST
-
67:50 - 67:53is actually equal to
-
67:53 - 67:56LT
-
67:56 - 68:02times ST.
-
68:02 - 68:04What
-
68:04 - 68:06was shown is that
-
68:06 - 68:08when you're in some state, ST, the
-
68:08 - 68:14optimal action for that state is going to be some matrix, LT, which can compute,
-
68:14 - 68:14times
-
68:14 - 68:17the state,
-
68:17 - 68:18ST.
-
68:18 - 68:24In other words, the optimal action is actually a linear function of the state.
-
68:24 - 68:27I'm just going to point out, this is not a function of approximation here, right.
-
68:27 - 68:32What we did not do, we did not say,
-
68:32 - 68:35let's find the optimal linear policy. We didn't say, let's look at the optimal
-
68:35 - 68:38policy, and then we'll fit this straight line to the optimal policy. This is not
-
68:38 - 68:42about approximating the optimal policy with a straight line.
-
68:42 - 68:47This derivation is saying that the optimal policy is a straight line. The
-
68:47 - 68:54optimal action is a linear function of the current
-
68:54 - 68:57state.
-
68:57 - 69:01Moreover, when you've worked out this is a value for AT
-
69:01 - 69:02
-
69:02 - 69:05that maximizes this thing on
-
69:05 - 69:10the right-hand side. So you take this and plug it back in to do the dynamic programming recursion.
-
69:10 - 69:17What you find is that -
-
69:20 - 69:24so you take AT and plug it back in to do the maximization. It will
-
69:24 - 69:26actually get
-
69:26 - 69:28you this formula,
-
69:28 - 69:31so V star
-
69:31 - 69:35TST. So you find that
-
69:35 - 69:39it will indeed be a quadratic function like this
-
69:39 - 69:42of the following form where
-
69:42 - 69:46- and I just write out the equations for the sake of
-
69:46 - 69:53completeness. Don't worry too much about their forms. You can
-
69:57 - 70:04derive this yourself.
-
70:48 - 70:52So just to summarize, don't worry too much about the forms of these
-
70:52 - 70:54equations.
-
70:54 - 70:58What we've done is written down the recursion to the expressor
-
70:58 - 71:00phi T and psi T
-
71:00 - 71:02as a function of phi T plus one
-
71:02 - 71:04and psi T plus one.
-
71:04 - 71:05So
-
71:05 - 71:08this allows you to compute the optimal value function
-
71:08 - 71:14for when the clock is at time lowercase T, as a function of the optimal value
-
71:14 - 71:18function for when the clock is at time T plus one.
-
71:20 - 71:25So
-
71:25 - 71:28to summarize,
-
71:28 - 71:33GSELQG here's a finite horizon of
-
71:33 - 71:36- actually, just to give this equation a name as well. This
-
71:36 - 71:40recursion, in terms of the phi Ts, this is called the
-
71:40 - 71:45discrete
-
71:45 - 71:52time Bacardi equation. [Inaudible]
-
71:54 - 71:58recursion that gives you phi
-
71:58 - 72:02T in terms of phi T plus one.
-
72:02 - 72:05So
-
72:05 - 72:11to summarize,
-
72:12 - 72:17our algorithm for finding the exact solution to finite horizon LQR
-
72:17 - 72:23problems is as follows. We initialize phi T
-
72:23 - 72:24to
-
72:24 - 72:27be equal to minus
-
72:30 - 72:33UT
-
72:33 - 72:37and psi T to be equal to zero.
-
72:37 - 72:39Then
-
72:39 - 72:43recursively,
-
72:43 - 72:45calculate
-
72:45 - 72:47phi T
-
72:47 - 72:49and psi
-
72:49 - 72:50T
-
72:50 - 72:54as a function of phi T plus one
-
72:54 - 72:56and psi T plus
-
72:56 - 73:01one with the discrete time -
-
73:02 - 73:06actually, excuse me. So
-
73:06 - 73:09recursively calculate phi T and psi
-
73:09 - 73:11T as a function of phi T plus one and psi T plus one, as I showed,
-
73:11 - 73:15using the discrete time Bacardi equation.
-
73:15 - 73:18So you do this for
-
73:18 - 73:19T equals
-
73:19 - 73:22T minus one, T minus two
-
73:22 - 73:29and so on, down to time zero. Then you
-
73:29 - 73:33compute
-
73:33 - 73:37LT as a function of - actually, is it phi T or phi T
-
73:37 - 73:40plus one? Phi T
-
73:40 - 73:42plus one, I think.
-
73:42 - 73:46As a function of phi T plus one and psi T plus one.
-
73:46 - 73:50This is actually a function of only phi T plus one. You don't really need psi T
-
73:50 - 73:53plus one.
-
73:53 - 74:00Now you have your optimal policy.
-
74:00 - 74:03So having computed the LTs,
-
74:03 - 74:06you now have the
-
74:06 - 74:08optimal action to take in the state
-
74:08 - 74:15ST, just given by
-
74:25 - 74:29this
-
74:29 - 74:35linear equation.
-
74:35 - 74:44How much time do I have left? Okay. Let me just say one last thing about this before I close.
-
74:44 - 74:48Maybe I'll do it next week. I think I'll do it next session
-
74:48 - 74:52instead. So it actually turns out there's one cool property about this that's kind of that is
-
74:52 - 74:55kind of subtle, but you'll find it out in the next lecture.
-
74:55 - 75:02Are there question about this before we close for today, then? So
-
75:05 - 75:07the very cool thing about
-
75:07 - 75:13the solution of discrete time LQR problems - finite horizon LQR
-
75:13 - 75:17problems is that this is a problem in an infinite state, with a continuous state.
-
75:17 - 75:21But nonetheless, under the assumptions we made, you can prove that the
-
75:21 - 75:25value function is a quadratic function of the state.
-
75:25 - 75:29Therefore, just by computing these matrixes phi T and the real numbers
-
75:29 - 75:33psi T, you can actually exactly represent the value function, even for
-
75:33 - 75:37these infinitely large state spaces, even for continuous state spaces.
-
75:37 - 75:43So the computation of these algorithms scales only like the cube,
-
75:43 - 75:46scales only as a polynomial in terms of the number of state variables
-
75:46 - 75:50whereas in [inaudible] dimensionality problems, with [inaudible], we
-
75:50 - 75:53had algorithms of a scale exponentially dimensional problem.
-
75:53 - 75:55Whereas LQR scales only are like the cube
-
75:55 - 76:01of the dimension of the problem. So this easily
-
76:01 - 76:04applies to problems with even very large state spaces. So we actually often apply
-
76:04 - 76:06variations of this algorithm
-
76:06 - 76:09to some subset, to some particular subset for the things we do on our
-
76:09 - 76:13helicopter, which has high dimensional state spaces, with twelve or higher
-
76:13 - 76:14dimensions.
-
76:14 - 76:16This has worked very well for that.
-
76:16 - 76:17So
-
76:17 - 76:21it turns out there are even more things you can do with this, and I'll continue with
-
76:21 - 76:24that in the next lecture. So let's close for today, then.
- Title:
- Lecture 18 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:16:38