WEBVTT 00:00:12.819 --> 00:00:16.099 This presentation is delivered by the Stanford Center for Professional 00:00:16.099 --> 00:00:23.099 Development. 00:00:25.879 --> 00:00:28.419 Okay. Welcome back. What I 00:00:28.419 --> 00:00:31.449 want to do today is talk about 00:00:31.449 --> 00:00:34.570 one of my favorite algorithms for controlling MDPs that I think is 00:00:35.870 --> 00:00:39.320 one of the more elegant and efficient and powerful algorithms that 00:00:39.320 --> 00:00:41.070 I know of. 00:00:41.070 --> 00:00:43.020 So 00:00:43.020 --> 00:00:46.930 what I'll do is I'll first start by talking about a couple 00:00:46.930 --> 00:00:48.640 variations of MDPs 00:00:48.640 --> 00:00:52.320 that are slightly different from the MDP definition you've seen so far. These 00:00:52.320 --> 00:00:54.420 are pretty 00:00:54.420 --> 00:00:58.870 common variations. One is state action rewards, 00:00:58.870 --> 00:01:03.830 and the other is horizon MDPs. Using this semi-modified definition of an 00:01:03.830 --> 00:01:06.540 MDP, I'll talk about linear dynamical systems. I'll spend a little bit 00:01:06.540 --> 00:01:10.410 of time talking about models within dynamical systems, and then talk about LQR, or 00:01:10.410 --> 00:01:13.240 linear quadratic regulation control, 00:01:13.240 --> 00:01:18.600 which will lead us to some kind of [inaudible] equation, which is 00:01:18.600 --> 00:01:23.100 something we will solve in order to do LQR 00:01:23.100 --> 00:01:24.460 controls. 00:01:24.460 --> 00:01:25.660 00:01:25.660 --> 00:01:27.400 So 00:01:27.400 --> 00:01:28.650 just 00:01:28.650 --> 00:01:33.020 to recap, and we've seen this definition many times now. 00:01:33.020 --> 00:01:36.840 We've been defining an MDP 00:01:36.840 --> 00:01:41.960 as 00:01:41.960 --> 00:01:46.969 [inaudible] states actions, states improbabilities, [inaudible] reward function 00:01:46.969 --> 00:01:49.869 where - gamma's 00:01:49.869 --> 00:01:52.980 the 00:01:52.980 --> 00:01:56.680 discount factors, a number between zero and one. 00:01:56.680 --> 00:02:00.380 And R, the reward function, 00:02:00.380 --> 00:02:03.909 was the function mapping from the states, the rewards - 00:02:03.909 --> 00:02:07.000 was the function mapping from the states, the real numbers. 00:02:07.000 --> 00:02:08.220 So we had 00:02:08.220 --> 00:02:10.489 value 00:02:10.489 --> 00:02:13.759 iteration, which would do this. 00:02:21.899 --> 00:02:26.429 So after a while, the value of the iteration will cause V to convert to V star. 00:02:26.429 --> 00:02:27.789 Then 00:02:27.789 --> 00:02:31.209 having found the optimal value function, if you 00:02:31.209 --> 00:02:32.589 compute the optimal policy 00:02:32.589 --> 00:02:36.990 by taking essentially [inaudible] of this equation above. Augments of A, 00:02:36.990 --> 00:02:43.990 of that [inaudible]. So 00:02:46.180 --> 00:02:48.180 in 00:02:48.180 --> 00:02:50.719 value iteration, 00:02:50.719 --> 00:02:54.599 as you iterate of this - you know, perform this update, 00:02:54.599 --> 00:02:56.179 the function V will 00:02:56.179 --> 00:03:00.370 [inaudible] convert to V star. So there won't be - 00:03:00.370 --> 00:03:04.490 so without defining the number of iterations, you get closer and closer to V star. 00:03:04.490 --> 00:03:08.400 This actually converge exponentially quickly to V 00:03:08.400 --> 00:03:12.839 star. We will never exactly convert to V star and define the number of iterations. 00:03:14.799 --> 00:03:19.629 So what I want to do now is describe a couple of common variations of 00:03:19.629 --> 00:03:20.139 MDPs 00:03:20.139 --> 00:03:24.139 that we slightly different definitions of. Firs the reward 00:03:24.139 --> 00:03:27.799 function and then second, we'll do something slightly different 00:03:27.799 --> 00:03:29.389 from just 00:03:31.699 --> 00:03:34.659 counting. Then remember in the last lecture, I said that 00:03:34.659 --> 00:03:37.419 for infinite state of continuously in MDPs, 00:03:37.419 --> 00:03:41.169 we couldn't apply the most straightforward version of value iteration 00:03:41.169 --> 00:03:43.669 because if you have a continuous state 00:03:43.669 --> 00:03:45.709 MDP, we need to use 00:03:45.709 --> 00:03:48.289 some approximations of the optimal value function. 00:03:48.289 --> 00:03:51.629 The [inaudible] later in this lecture, 00:03:51.629 --> 00:03:55.510 I'll talk about a special case of MDPs, where you can actually represent the 00:03:55.510 --> 00:03:57.579 value function exactly, 00:03:57.579 --> 00:04:01.029 even if you have an infinite-state space or even if you have a continuous-state 00:04:01.029 --> 00:04:02.000 space. 00:04:02.000 --> 00:04:06.090 I'll actually do that, talk about these special constants of infinite-state 00:04:06.090 --> 00:04:06.779 MDPs, 00:04:06.779 --> 00:04:11.319 using this new variation of the reward function and the alternative to 00:04:11.319 --> 00:04:15.469 just counting, so start to make the formulation a little easier. 00:04:15.469 --> 00:04:16.609 So the first 00:04:16.609 --> 00:04:20.850 variation I want to talk about is 00:04:20.850 --> 00:04:25.380 selection 00:04:25.380 --> 00:04:28.720 rewards. So I'm going to change the definition of the reward function. If this turns out, it 00:04:28.720 --> 00:04:30.869 won't be 00:04:30.869 --> 00:04:33.409 a huge deal. In particular, I change reward function 00:04:33.409 --> 00:04:36.260 to be a function mapping from 00:04:36.260 --> 00:04:38.449 a state action pair 00:04:38.449 --> 00:04:41.130 to the real numbers. 00:04:41.130 --> 00:04:44.509 What I mean by this is just the following. 00:04:44.509 --> 00:04:46.599 You sell off in 00:04:46.599 --> 00:04:48.849 some state in zero. You 00:04:48.849 --> 00:04:52.830 take an action A zero as a result of your state of action choice. You transition 00:04:52.830 --> 00:04:54.709 to some new state, S1. 00:04:54.709 --> 00:04:59.519 You take some action, A1. You transition to some new state, S2. You take some 00:04:59.519 --> 00:05:05.009 action, A2, and so on. So this is a [inaudible] state action sequence that you see. 00:05:05.009 --> 00:05:08.490 So in an MPP where you have a state action reward, 00:05:08.490 --> 00:05:12.879 your total payoff is now defined as 00:05:12.879 --> 00:05:14.259 this, 00:05:14.259 --> 00:05:19.349 where your reward function is now a function both of the current state and of the 00:05:19.349 --> 00:05:20.020 action 00:05:20.020 --> 00:05:22.220 you took in the current state. 00:05:22.220 --> 00:05:27.039 So this is my 00:05:27.039 --> 00:05:28.539 total payoff. 00:05:29.199 --> 00:05:32.240 Then as usual, my goal will be to find a policy - 00:05:32.240 --> 00:05:35.039 to find the function mapping from the state's actions, 00:05:35.039 --> 00:05:37.699 so that when I execute that policy, 00:05:37.699 --> 00:05:40.320 I can maximize the expected value 00:05:40.320 --> 00:05:42.030 of my total payoff. 00:05:42.030 --> 00:05:46.600 So this definition, it actually turns out that given an MDP with 00:05:46.600 --> 00:05:47.889 state action rewards, 00:05:47.889 --> 00:05:49.339 you can actually - 00:05:49.339 --> 00:05:52.949 so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP 00:05:52.949 --> 00:05:56.639 with only rewards that function in the states. That may or may 00:05:56.639 --> 00:05:58.419 not be [inaudible]. Don't worry if 00:05:58.419 --> 00:06:00.800 it isn't. But 00:06:00.800 --> 00:06:04.729 using state action rewards allows you to more directly model 00:06:05.569 --> 00:06:09.050 problems in which different actions, we have different costs. So 00:06:09.050 --> 00:06:13.090 a running example is the robot. So [inaudible] a robot, 00:06:13.090 --> 00:06:17.080 and it's more costly for the robot to move than for it to stay still. 00:06:17.080 --> 00:06:20.930 If you give an action to stay still, and the action to stay still may 00:06:20.930 --> 00:06:25.349 have a lower cost because you're not using a battery power [inaudible] recharge it for that 00:06:25.349 --> 00:06:26.020 action. 00:06:26.020 --> 00:06:29.820 Another example would be - 00:06:29.820 --> 00:06:33.059 actually, another navigation example 00:06:33.059 --> 00:06:37.429 would be if you have an outdoor vehicle. You need to drive 00:06:37.429 --> 00:06:38.599 over 00:06:38.599 --> 00:06:42.169 some sort of outdoor terrain, like very rough rocks 00:06:42.169 --> 00:06:45.679 or driving over grass. It may be costly, more difficult, 00:06:45.679 --> 00:06:48.960 than driving over, say, a paved road. So you may assign 00:06:48.960 --> 00:06:51.349 an action that requires 00:06:51.349 --> 00:06:56.150 driving over grass or driving over rocks to be 00:06:56.150 --> 00:07:00.289 more costly than driving over paved 00:07:00.289 --> 00:07:04.809 road. 00:07:04.809 --> 00:07:07.830 So this really isn't a huge change to the definition of 00:07:07.830 --> 00:07:14.830 an MDP. I won't really bother to justify this a lot, but [inaudible] 00:07:15.090 --> 00:07:16.779 equations 00:07:16.779 --> 00:07:19.610 is generalizing 00:07:19.610 --> 00:07:24.279 the way that you probably expect it. V star of S is now equal to 00:07:26.650 --> 00:07:33.650 that. 00:07:37.419 --> 00:07:40.649 So previously, when the reward function was just a function of the state, S, we could 00:07:40.649 --> 00:07:43.729 take the max and push it in here. 00:07:43.729 --> 00:07:44.530 But now that 00:07:46.069 --> 00:07:49.779 the rewards is a function of the action you're taking as well, the max comes outside. So 00:07:49.779 --> 00:07:50.869 this says that 00:07:50.869 --> 00:07:57.119 your expected total payoff, starting from the state, as [inaudible] policy, is equal to 00:07:57.119 --> 00:08:01.519 first your immediate reward, RFSA, for executing some action, A, in state S. Then 00:08:02.219 --> 00:08:07.369 plus gamma times your future expected total payoff. 00:08:07.369 --> 00:08:08.839 So 00:08:08.839 --> 00:08:11.330 this is your expected total payoff if you 00:08:11.330 --> 00:08:14.919 take the action, A, from the current state. So 00:08:14.919 --> 00:08:18.879 while these [inaudible] optimal value functions. So your actually optimal expected total payoff 00:08:18.879 --> 00:08:21.539 is the max of all actions of this 00:08:21.539 --> 00:08:26.049 thing on 00:08:26.049 --> 00:08:30.449 the right. Let's see. Value iteration, which I'm abbreviating VI, 00:08:30.449 --> 00:08:33.720 is really the same algorithm. B of 00:08:33.720 --> 00:08:38.060 S is updated as 00:08:38.060 --> 00:08:44.630 max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations 00:08:45.900 --> 00:08:49.880 be updating V of S using [inaudible] equations. Again, you get value iteration, 00:08:49.880 --> 00:08:53.370 exactly the same way. 00:08:53.370 --> 00:08:55.520 Then finally, 00:08:56.950 --> 00:09:00.160 having found the optimal value function, V star, 00:09:00.160 --> 00:09:03.180 using the value iteration algorithm, 00:09:03.180 --> 00:09:07.650 you can then compute the optimal policy, pi star of S as same as 00:09:07.650 --> 00:09:12.250 before. The best action to take in the state, S, is the 00:09:12.250 --> 00:09:19.250 action, A, that maximizes the thing on the righthand 00:09:28.360 --> 00:09:32.480 side. So having used value iteration 00:09:32.480 --> 00:09:35.330 to compute the optimal value function, you can then find 00:09:35.330 --> 00:09:40.210 pi star using that. 00:09:41.890 --> 00:09:45.670 So this was the easier of the two 00:09:45.670 --> 00:09:47.920 variations of MDPs [inaudible] 00:09:47.920 --> 00:09:48.910 so far. 00:09:48.910 --> 00:09:55.910 Any questions? Actually, are there questions about this? 00:10:01.090 --> 00:10:02.810 So 00:10:02.810 --> 00:10:06.660 the other variation, the other alternative definition, 00:10:07.600 --> 00:10:10.520 will be 00:10:10.520 --> 00:10:16.220 finite horizon 00:10:16.220 --> 00:10:19.010 MDPs. So 00:10:19.010 --> 00:10:21.540 finite horizon MDP 00:10:21.540 --> 00:10:24.780 comprises of the [inaudible] SA [inaudible] 00:10:24.780 --> 00:10:28.110 transition, probably 00:10:28.110 --> 00:10:30.650 with these, and the parameter T 00:10:30.650 --> 00:10:32.180 and the reward function. 00:10:32.180 --> 00:10:36.250 Here, 00:10:36.250 --> 00:10:37.360 T 00:10:37.360 --> 00:10:40.520 is a parameter called the horizon time. 00:10:40.520 --> 00:10:43.050 Concretely, 00:10:43.050 --> 00:10:45.920 what this really means is that we'll 00:10:45.920 --> 00:10:48.330 be taking actions in the MDP 00:10:48.330 --> 00:10:55.330 only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out] 00:11:00.430 --> 00:11:04.350 [Inaudible] zero, take action A0. Get to some other state S1, take 00:11:04.350 --> 00:11:06.670 action A1 and so on. 00:11:06.670 --> 00:11:10.100 Eventually, you get to some state, STAT 00:11:10.100 --> 00:11:12.180 after T times [inaudible]. So 00:11:12.180 --> 00:11:16.760 my total payoff, now, 00:11:16.760 --> 00:11:18.180 will be 00:11:18.180 --> 00:11:19.350 given 00:11:19.350 --> 00:11:21.900 by this 00:11:21.900 --> 00:11:26.180 sum from time zero up to time T of my sum over rewards. 00:11:27.740 --> 00:11:31.870 Okay? My goal, as usual 00:11:31.870 --> 00:11:34.030 - so this is my total payoff. 00:11:34.030 --> 00:11:37.580 My goal, as usual, is to maximize the expected value of my total payoff. We want to come up 00:11:37.580 --> 00:11:40.140 with a policy to maximize the 00:11:40.140 --> 00:11:44.010 expected value of this total payoff. The key difference is that 00:11:44.010 --> 00:11:47.890 the world only will exist [inaudible], and after that, 00:11:47.890 --> 00:11:51.380 there's no more rewards to be corrected. 00:11:51.380 --> 00:11:54.820 So this turns out to be [inaudible] of a difference because it turns out that the optimal 00:11:54.820 --> 00:11:59.710 policy 00:12:00.690 --> 00:12:03.200 may be non-stationary. 00:12:06.300 --> 00:12:07.540 The 00:12:07.540 --> 00:12:10.319 term, stationary, means that it doesn't depend 00:12:10.319 --> 00:12:12.649 on time. Non-stationary 00:12:12.649 --> 00:12:14.419 means that it may depend on time. 00:12:14.419 --> 00:12:17.810 So non-stationary 00:12:17.810 --> 00:12:21.820 roughly means that my optimal action to take will be different for different time 00:12:21.820 --> 00:12:22.490 steps. 00:12:22.490 --> 00:12:24.430 That's what nonstationary 00:12:24.430 --> 00:12:26.790 means. Just as an example of that, 00:12:26.790 --> 00:12:31.240 imagine that we have some robot. Let's say the 00:12:31.240 --> 00:12:33.850 robot 00:12:33.850 --> 00:12:35.290 is here. 00:12:35.290 --> 00:12:37.620 Let's say that there's 00:12:37.620 --> 00:12:40.590 a good [inaudible] over there with a plus one reward. 00:12:42.930 --> 00:12:46.360 Much further away, there's a plus ten reward. 00:12:46.360 --> 00:12:50.530 So depending on how much time you have left on the clock, 00:12:50.530 --> 00:12:52.980 it may be better to go after the plus one 00:12:52.980 --> 00:12:55.330 or the plus ten reward. If it's still early 00:12:55.330 --> 00:12:58.840 in the game, you still have a lot of time, it may be better to head toward 00:12:58.840 --> 00:13:03.030 the plus-ten rewards junction and get a much larger 00:13:03.030 --> 00:13:06.040 reward. If you only have a couple of time sets left, if the clock has nearly reached 00:13:06.040 --> 00:13:08.069 the time, capital T, 00:13:08.069 --> 00:13:11.790 then you may not have enough time to get to a plus ten reward. You've be better 00:13:11.790 --> 00:13:16.050 off heading for the plus one reward that's much more close by. 00:13:16.050 --> 00:13:19.660 So what this example illustrates is that when you're in that state, 00:13:19.660 --> 00:13:23.740 the best action to take could be to go left or to go right, depending on what 00:13:23.740 --> 00:13:26.499 time it is. So just as an example, illustrating 00:13:26.499 --> 00:13:31.960 how the actually policy can be non-stationary. 00:13:34.710 --> 00:13:36.410 In fact, 00:13:36.410 --> 00:13:38.120 since we have non-stationary 00:13:38.120 --> 00:13:43.120 policies anyway in the sequence, what I'm going to do next, I'm going to 00:13:43.120 --> 00:13:44.159 allow 00:13:44.159 --> 00:13:48.390 non-stationary transition probabilities as well. So I'll just write that 00:13:48.390 --> 00:13:50.390 up there. 00:13:50.390 --> 00:13:52.170 What I mean is that 00:13:52.170 --> 00:13:56.640 so far, assuming that the state ST plus one, is joined from the state 00:13:56.640 --> 00:13:59.480 transition probabilities [inaudible] 00:13:59.480 --> 00:14:02.110 by the previous states and the previous action. 00:14:02.110 --> 00:14:05.650 I've been assuming that these state transition probabilities are the same 00:14:05.650 --> 00:14:10.170 for all times. So I want to say [inaudible] and take some action, 00:14:10.170 --> 00:14:12.540 the distribution of an innate state doesn't matter. 00:14:12.540 --> 00:14:15.910 It doesn't depend on time. 00:14:15.910 --> 00:14:16.700 So 00:14:16.700 --> 00:14:21.890 I'm going to allow a study more general definition as well, in which we have 00:14:21.890 --> 00:14:24.600 non-stationary state transition probabilities 00:14:24.600 --> 00:14:25.329 so that 00:14:25.329 --> 00:14:27.960 the chance of where you end up [inaudible] 00:14:27.960 --> 00:14:29.590 may also 00:14:29.590 --> 00:14:33.060 depend on what time it is. 00:14:33.060 --> 00:14:37.320 So as examples of this non-stationary state 00:14:37.320 --> 00:14:39.000 transition probabilities, 00:14:39.000 --> 00:14:44.400 one example would be if you model flying an aircraft over a long distance. 00:14:44.400 --> 00:14:48.650 Then as the aircraft flies, you burn fuel and become lighter. So the 00:14:48.650 --> 00:14:52.510 dynamics of the aircraft actually change over time. The mass of the aircraft can 00:14:52.510 --> 00:14:55.160 change significantly over time as you burn fuel. 00:14:55.160 --> 00:14:56.480 So depending on 00:14:56.480 --> 00:15:01.130 what time it is, your mixed state could actually 00:15:01.130 --> 00:15:03.320 depend on not only your current state and your action, 00:15:03.320 --> 00:15:06.240 but also on how much fuel you burn, therefore, what time it is. 00:15:07.250 --> 00:15:10.510 Other examples, another aerospace one, is 00:15:10.510 --> 00:15:14.620 if you have the weather forecast for the next 24 hours, say, 00:15:14.620 --> 00:15:18.360 you know what the winds and precipitation are going to be like over the next 24 hours. Then again, 00:15:18.360 --> 00:15:22.200 if you fly the aircraft from, say, here to New York, 00:15:22.200 --> 00:15:25.019 it may cost different amounts to fly 00:15:25.019 --> 00:15:28.829 different [inaudible] at different times. Maybe flying over 00:15:28.829 --> 00:15:33.439 the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or 00:15:33.439 --> 00:15:34.319 if you do it 00:15:34.319 --> 00:15:37.629 a few hours from now, when the weather may be forecast really bad. 00:15:39.380 --> 00:15:40.970 For an 00:15:40.970 --> 00:15:46.440 example you see everyday, same thing for traffic, right? 00:15:46.440 --> 00:15:48.380 There's at least - 00:15:48.380 --> 00:15:52.510 depending on where you live, certainly here in California, there are times of day where traffic is 00:15:52.510 --> 00:15:57.050 really bad in lots of places. So the costs of driving certain roads may vary, depending on 00:15:57.050 --> 00:15:59.070 what time of day it is. 00:15:59.070 --> 00:16:03.589 Lots of other examples. Industrial automation, different machines 00:16:03.589 --> 00:16:07.160 in the factory may be available to different degrees at different times of day. They 00:16:07.160 --> 00:16:10.149 cost different amounts to hire different workers, depending on whether you pay 00:16:10.149 --> 00:16:12.950 over time [inaudible] or whatever. 00:16:12.950 --> 00:16:15.960 So the cost of doing different things in the factory can also 00:16:15.960 --> 00:16:17.250 be a function of time. 00:16:17.250 --> 00:16:18.429 00:16:18.429 --> 00:16:24.280 The state transition probabilities can also be a function of time. 00:16:25.750 --> 00:16:27.850 Lastly, 00:16:27.850 --> 00:16:31.910 while we're doing this as well, to make this fully general, we might as 00:16:31.910 --> 00:16:37.730 well have non-stationary [inaudible] as well, 00:16:37.730 --> 00:16:40.639 where you might also index the reward 00:16:40.639 --> 00:16:42.679 function of these times and prescripts, 00:16:42.679 --> 00:16:45.090 where the cost of doing different things may depend on 00:16:45.090 --> 00:16:47.830 the time as well. 00:16:50.530 --> 00:16:55.590 Actually, there's more examples of non-stationary MDPs, so let's - so now we 00:16:55.590 --> 00:16:59.249 have a nonstationary policy. Let's talk about an algorithm to actually 00:16:59.249 --> 00:17:02.180 try to find the optimal policy. 00:17:04.110 --> 00:17:10.089 So let me 00:17:10.089 --> 00:17:13.580 define the following. 00:17:13.580 --> 00:17:20.390 This is now a slightly modified definition of the optimal value function. I'll 00:17:20.390 --> 00:17:27.390 just 00:17:30.970 --> 00:17:37.970 write this down, I guess. 00:17:39.080 --> 00:17:42.620 So I'm going to define the optimal value function, and this going to be 00:17:42.620 --> 00:17:44.690 indexed by T, with a subscript T. The 00:17:44.690 --> 00:17:46.480 optimal value of a state 00:17:47.840 --> 00:17:50.360 for time, T, we're going to define as your 00:17:50.360 --> 00:17:52.300 optimal 00:17:52.300 --> 00:17:57.320 sum of rewards for if you start the MDP at that state, S, 00:17:57.320 --> 00:17:59.309 and if the clock starts off 00:17:59.309 --> 00:18:02.200 at time, 00:18:02.200 --> 00:18:06.400 lowercase T. So the optimal value of a state will depend on what time it is and how 00:18:06.400 --> 00:18:09.919 much time you have lest to run this 00:18:09.919 --> 00:18:11.440 MDP. Therefore, 00:18:11.440 --> 00:18:16.230 the sum on the right sums only for time T, time T plus one, time T plus two up 00:18:16.230 --> 00:18:17.840 to time, 00:18:17.840 --> 00:18:19.520 capital T. 00:18:22.570 --> 00:18:26.620 I'll just state in English again, this is your expected optimal 00:18:26.620 --> 00:18:28.059 total payoff 00:18:28.059 --> 00:18:31.240 if you start your system in a state, S, 00:18:31.240 --> 00:18:34.760 and if the clock is already at time, 00:18:34.760 --> 00:18:37.010 lowercase T. 00:18:38.370 --> 00:18:42.620 So it turns out then there's a [inaudible], you can value that [inaudible]. 00:18:42.620 --> 00:18:46.080 Let me just write out the value [inaudible] algorithm for this. 00:18:46.080 --> 00:18:48.059 It turns out 00:18:48.059 --> 00:18:49.270 you can - 00:18:49.270 --> 00:18:54.150 well, let me just write this 00:18:54.150 --> 00:18:56.920 out. I'll write this below. 00:18:56.920 --> 00:19:00.260 It turns out you can compute the optimal value function for the MDP using the 00:19:00.260 --> 00:19:03.020 following recursion, which is very similar to 00:19:03.020 --> 00:19:06.200 what we have for value iteration. We're going 00:19:06.200 --> 00:19:06.879 to set V 00:19:06.879 --> 00:19:12.080 of S to be equal to [inaudible] over 00:19:12.080 --> 00:19:12.889 A, 00:19:12.889 --> 00:19:19.889 same as before, right? Okay? 00:19:41.110 --> 00:19:42.419 So 00:19:42.419 --> 00:19:46.650 if I start the clock at time T and from state S, 00:19:46.650 --> 00:19:49.230 my expected total payoff is equal to 00:19:49.230 --> 00:19:51.590 the maximum [inaudible] actions I may take 00:19:51.590 --> 00:19:53.500 of my immediate reward. Taking that 00:19:53.500 --> 00:19:56.100 action, A, in that state, 00:19:56.100 --> 00:20:00.050 S. Them plus my expected future payoff. So if I take action, A, 00:20:00.050 --> 00:20:01.690 I would transition with [inaudible] P, subscript SA, S prime 00:20:01.690 --> 00:20:06.470 to some new state, S 00:20:06.470 --> 00:20:09.470 prime. If I get to the state, S prime, 00:20:09.470 --> 00:20:11.600 my total expected payoff from the 00:20:11.600 --> 00:20:14.270 state S prime would be these [inaudible] 00:20:14.270 --> 00:20:18.610 now subscript T plus one, that's prime. Subscript T plus one 00:20:18.610 --> 00:20:23.400 reflects that after I've taken one action, my clock will have advanced from 00:20:23.400 --> 00:20:26.400 time T to time T plus one. So this is now V 00:20:26.400 --> 00:20:30.559 star subscript T plus one. 00:20:30.559 --> 00:20:33.909 So this expresses V star of T 00:20:33.909 --> 00:20:36.230 in terms of V star T plus one. 00:20:36.230 --> 00:20:39.740 Then lastly, to start off this recursion, you 00:20:39.740 --> 00:20:41.700 would have V star, 00:20:41.700 --> 00:20:43.270 capital T 00:20:43.270 --> 00:20:50.270 is equal to - it's just equal 00:20:51.300 --> 00:20:53.220 to that. 00:20:53.220 --> 00:20:56.900 If you're already at time, capital T, then you just get to take one action, and then 00:20:56.900 --> 00:20:59.070 the clock runs out. So this is 00:20:59.070 --> 00:21:03.610 V star capital T. Your value of starting in some state, S, with no time - 00:21:03.610 --> 00:21:08.809 with just one time step, with no time left on the clock. 00:21:08.809 --> 00:21:12.880 So in the case of finite horizon MDP, this actually gives up a very nice 00:21:13.830 --> 00:21:15.640 dynamic programming algorithm 00:21:15.640 --> 00:21:16.559 in which you can 00:21:16.559 --> 00:21:20.740 start off by computing V star of T. 00:21:20.740 --> 00:21:24.030 Then you use this backward [inaudible] to compute 00:21:24.030 --> 00:21:25.089 V star of 00:21:25.089 --> 00:21:29.349 capital T minus one, capital T minus two and so on. We compute V star 00:21:29.349 --> 00:21:32.230 of T and T minus one and so on. It recurs backwards 00:21:32.230 --> 00:21:34.670 onto your computer, V star 00:21:34.670 --> 00:21:38.450 for all of your time steps. Can you see this 00:21:38.450 --> 00:21:39.500 board? 00:21:39.500 --> 00:21:42.250 Cool. Then 00:21:44.770 --> 00:21:47.570 the final step is 00:21:47.570 --> 00:21:52.810 - previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible] 00:21:52.810 --> 00:21:55.580 A of R 00:21:55.580 --> 00:21:57.150 plus A 00:21:57.150 --> 00:22:04.150 plus [inaudible] PSA - this 00:22:04.280 --> 00:22:07.230 is sort of what we had. In 00:22:07.230 --> 00:22:09.789 the finite horizon case, the 00:22:09.789 --> 00:22:13.250 ultimate action may depend on what time it is. So the ultimate action to take it, 00:22:13.250 --> 00:22:16.820 time T, is [inaudible] actions, A. 00:22:16.820 --> 00:22:19.249 This 00:22:19.249 --> 00:22:22.350 is basically the 00:22:22.350 --> 00:22:26.840 augmat of exactly that same thing on the right-hand side as we had 00:22:26.840 --> 00:22:29.970 in our dynamic programming 00:22:29.970 --> 00:22:33.060 algorithm. So you do this for every time step, and now you compute it, 00:22:33.060 --> 00:22:36.950 your optimal policy for different time 00:22:36.950 --> 00:22:40.410 steps. Again, this is a non-stationary policy 00:22:40.410 --> 00:22:42.290 because pi star 00:22:42.290 --> 00:22:46.920 of S my depend on what time it is. 00:22:46.920 --> 00:22:50.420 So [inaudible] the difference between this 00:22:50.420 --> 00:22:57.150 and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T. 00:22:57.150 --> 00:23:00.620 Then using the backwards recursion of that [inaudible] algorithm, you computer V star T 00:23:00.620 --> 00:23:04.130 minus one. Then V star T minus two 00:23:04.130 --> 00:23:06.710 and so on down to V star of zero. 00:23:06.710 --> 00:23:09.660 Then from these, you compute pi star. 00:23:13.210 --> 00:23:17.010 So one - there's not a huge difference, but one minus 00:23:17.010 --> 00:23:18.100 difference [inaudible] 00:23:18.100 --> 00:23:21.919 the infinite horizon discounted case 00:23:21.919 --> 00:23:26.350 is that by running this recursion once, you now have exactly 00:23:26.350 --> 00:23:29.550 the right value function. So this just computes the value function, rather 00:23:29.550 --> 00:23:30.560 than 00:23:30.560 --> 00:23:37.560 merely converging [inaudible]. This just gives you the right value function with one 00:23:37.710 --> 00:23:39.710 00:23:39.710 --> 00:23:42.710 pass. Cool. Any questions there? Yeah. 00:23:42.710 --> 00:23:47.600 [Inaudible]. 00:23:50.450 --> 00:23:54.080 This computation's much shorter than valuations. So 00:23:54.080 --> 00:23:58.460 sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could 00:23:58.460 --> 00:24:05.460 [inaudible] and then use this case for that case? I see. 00:24:06.100 --> 00:24:09.430 So for the normal MDP, can you assume capital T and then assume this? So 00:24:09.430 --> 00:24:13.049 it actually turns out that - that's a 00:24:13.049 --> 00:24:14.740 great question. Let 00:24:14.740 --> 00:24:18.640 me just answer this in a hand-wavy way. 00:24:18.640 --> 00:24:23.140 So it actually turns out for a discounted infinite horizon MDP 00:24:23.140 --> 00:24:25.670 where some discount factor gamma. 00:24:25.670 --> 00:24:29.430 So what you can do is you can say, 00:24:29.430 --> 00:24:31.160 after T times X, 00:24:31.160 --> 00:24:34.990 gamma to the T would be really small. It would be like [inaudible] something. I 00:24:34.990 --> 00:24:36.130 don't 00:24:36.130 --> 00:24:39.750 really care what happens after that many times because the rewards are 00:24:39.750 --> 00:24:42.630 multiplied by gamma to the T. After that, I don't really care. So you can 00:24:42.630 --> 00:24:43.919 ask, 00:24:43.919 --> 00:24:45.120 can I take 00:24:45.120 --> 00:24:48.490 my infinite horizon discounted MDP 00:24:48.490 --> 00:24:53.190 and approximate that with a finite horizon MDP where the number of times, steps T, 00:24:53.190 --> 00:24:55.180 is chosen so that [inaudible] true. So it 00:24:55.180 --> 00:24:57.700 turns out you can do that. 00:24:57.700 --> 00:25:01.890 Then you end up with some value for T. You can solve for T so 00:25:01.890 --> 00:25:03.910 that this holds true. 00:25:03.910 --> 00:25:09.500 It turns out you can prove that if you took the original value iteration algorithm 00:25:09.500 --> 00:25:14.780 and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs. 00:25:14.780 --> 00:25:16.830 If you run that for 00:25:16.830 --> 00:25:19.590 this same number of time steps, 00:25:19.590 --> 00:25:23.790 you will end up with an approximation to the value function that is 00:25:23.790 --> 00:25:26.730 about this close, up to some small constant factors. 00:25:26.730 --> 00:25:30.960 So to do that, you end up with roughly the same amounts of computation anyway. 00:25:30.960 --> 00:25:32.550 Then 00:25:32.550 --> 00:25:35.780 you actually end up with a non-stationary policy, which is more expensive to keep 00:25:36.650 --> 00:25:40.720 around. You need to keep around the different policy every time step, which is not 00:25:40.720 --> 00:25:46.880 as nice as if you had the stationary policy, same policy for all times. So 00:25:46.880 --> 00:25:50.600 there are other reasons, but sometimes you might take an 00:25:50.600 --> 00:25:54.560 infinite horizon discounted problem and approximate it to a finite horizon problem. 00:25:54.560 --> 00:25:56.190 But 00:25:56.190 --> 00:25:58.760 this particular reason is 00:25:58.760 --> 00:26:01.270 not the one. That makes 00:26:01.270 --> 00:26:04.210 sense. More 00:26:04.210 --> 00:26:11.210 questions? Interviewee: [Inaudible]? 00:26:12.580 --> 00:26:14.680 Is 00:26:14.680 --> 00:26:16.870 there a gamma in this? 00:26:16.870 --> 00:26:20.159 So if you want, you can actually 00:26:20.159 --> 00:26:22.570 change the definition of an MDP 00:26:22.570 --> 00:26:27.920 and use a finite horizon discounted MDP. If you want, you can do that. You 00:26:27.920 --> 00:26:29.040 can actually come 00:26:29.040 --> 00:26:32.090 in and put a gamma there, 00:26:32.090 --> 00:26:35.140 and use this counting the finite horizon. 00:26:35.140 --> 00:26:37.690 It turns out that usually, for most 00:26:37.690 --> 00:26:41.470 problems that people deal with, you either use discounting 00:26:41.470 --> 00:26:43.220 or you 00:26:43.220 --> 00:26:47.059 use the finite horizon. It's been less common to do both, but you can certainly do as well. 00:26:47.059 --> 00:26:51.990 One of the nice things about discounting, it makes such your value function is finite. 00:26:52.860 --> 00:26:57.549 Algorithmically and mathematically, one of the reasons to use discounting is 00:26:57.549 --> 00:27:01.309 because you're multiplying your rewards exponentially. It's a geometrically 00:27:01.309 --> 00:27:02.640 [inaudible] series. 00:27:02.640 --> 00:27:05.049 It shows that the value function is always finite. 00:27:05.049 --> 00:27:07.780 This is a really nice mathematical properties when you do discounting. 00:27:07.780 --> 00:27:11.360 So when you have a finite horizon anyway, 00:27:11.360 --> 00:27:15.179 then the value function's also guaranteed to be finite. So with that, you 00:27:15.179 --> 00:27:22.179 don't have to use discounting. But if you want, you can actually discount 00:27:24.130 --> 00:27:27.910 as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward 00:27:27.910 --> 00:27:34.910 function to go downward into the to the reward function, since we have nonstationary rewards as well. So 00:27:39.690 --> 00:27:41.049 that was 00:27:41.049 --> 00:27:44.230 finite horizon 00:27:44.230 --> 00:27:45.409 MDPs. 00:27:45.409 --> 00:27:48.960 What I want to do now is actually use 00:27:48.960 --> 00:27:53.030 both of these ideas, your state action rewards and your finite horizon MDPs 00:27:53.030 --> 00:27:57.580 to describe a special case of MDPs that 00:27:57.580 --> 00:28:00.380 makes very strong assumptions about the problem. 00:28:00.380 --> 00:28:00.880 00:28:00.880 --> 00:28:04.390 But these assumptions are reasonable for many systems. With these 00:28:04.390 --> 00:28:06.090 assumptions, what we come up with, I 00:28:06.090 --> 00:28:11.960 think, are very nice and very elegant algorithms for solving even very large 00:28:11.960 --> 00:28:16.240 MDPs. 00:28:30.100 --> 00:28:32.820 So let's talk about linear 00:28:32.820 --> 00:28:35.760 quadratic regulation. 00:28:51.669 --> 00:28:55.860 We just talked about dynamic programming for finite horizon MDPs, so 00:28:55.860 --> 00:28:58.149 just remember that algorithm. 00:28:58.149 --> 00:29:01.850 When I come back to talk about an algorithm for solving LQR problems, 00:29:01.850 --> 00:29:05.320 I'm actually going to use exactly that dynamic programming algorithm 00:29:05.320 --> 00:29:10.320 that you just saw for finite horizon MDPs. I'll be using exactly 00:29:10.320 --> 00:29:13.929 that algorithm again. So just remember that for now. 00:29:13.929 --> 00:29:16.320 So let's talk about LQR. 00:29:16.320 --> 00:29:21.750 So I want to take these ideas and apply them to MDPs with continuous 00:29:21.750 --> 00:29:25.150 state spaces and maybe even continuous action 00:29:25.150 --> 00:29:26.170 spaces. So 00:29:26.170 --> 00:29:28.740 to specify and 00:29:28.740 --> 00:29:33.679 MDPs, I need to give you this fivetuple 00:29:33.679 --> 00:29:36.380 of state actions, [inaudible] in the reward. I'm going to use 00:29:36.380 --> 00:29:40.730 the finite horizon, capital T, rather than 00:29:40.730 --> 00:29:42.570 discounting. 00:29:42.570 --> 00:29:47.809 So in LQR problems, I'm going to assume the following. I'm 00:29:47.809 --> 00:29:51.309 going to assume that the 00:29:51.309 --> 00:29:54.400 [inaudible] space is [inaudible] RN. 00:29:54.400 --> 00:29:58.289 And I'm going to assume, also, a continuous set of 00:29:58.289 --> 00:30:03.830 actions lie in RT. 00:30:03.830 --> 00:30:08.320 To specify the state transition probabilities, PSA, 00:30:08.320 --> 00:30:11.930 I need to tell you what the distribution of the mixed state is, given the current state and 00:30:11.930 --> 00:30:13.510 the current action. 00:30:13.510 --> 00:30:17.450 So we actually saw a little bit of this in the last lecture. I want to assume 00:30:17.450 --> 00:30:19.519 the next state, ST plus one, 00:30:19.519 --> 00:30:23.100 is going to be a linear function 00:30:23.100 --> 00:30:25.010 of the previous 00:30:25.010 --> 00:30:27.200 state, AST plus BAT 00:30:27.200 --> 00:30:29.309 plus WT - 00:30:29.309 --> 00:30:34.650 oh, excuse me. I meant to subscript that. 00:30:54.590 --> 00:30:59.190 Where WT is Gaussian [inaudible] would mean zero and some covariance 00:30:59.190 --> 00:31:02.260 given by sigma 00:31:02.260 --> 00:31:07.370 W. Subscripts at A and B here with subscripts T to show 00:31:07.370 --> 00:31:09.000 that 00:31:09.000 --> 00:31:11.809 these matrixes could change over time. So this would be 00:31:11.809 --> 00:31:14.940 non-stationary dynamics. 00:31:14.940 --> 00:31:20.200 As a point of notation, unfortunately compiling 00:31:20.200 --> 00:31:22.240 ideas from multiple literatures, 00:31:22.240 --> 00:31:26.220 so it's sort of unfortunately that capital A denotes both a set of actions 00:31:26.220 --> 00:31:27.860 as well as 00:31:27.860 --> 00:31:29.609 a matrix. 00:31:29.609 --> 00:31:33.889 When you see A later on, A will usually be used to denote a matrix, 00:31:33.889 --> 00:31:38.870 rather than a set of actions. So [inaudible] overload notation again, but 00:31:38.870 --> 00:31:41.820 unfortunately the notational conventions when you have 00:31:41.820 --> 00:31:46.660 research ideas in multiple research communities, often they share the same symbol. 00:31:47.789 --> 00:31:49.830 So just to be concrete, 00:31:49.830 --> 00:31:53.020 AT is a matrix 00:31:53.020 --> 00:31:55.640 that's N 00:31:55.640 --> 00:32:01.270 by N. [Inaudible] matrixes that are N by D. 00:32:02.130 --> 00:32:05.559 Just to be completely clear, right, the matrixes A and B, 00:32:05.559 --> 00:32:09.230 I'm going to assume, are fixed and known in advance. So I'm going to 00:32:09.230 --> 00:32:11.700 give you the matrices, 00:32:11.700 --> 00:32:15.750 A and B, and I'm going to give you sigma W. Your job is to find a good policy 00:32:15.750 --> 00:32:17.060 for 00:32:17.060 --> 00:32:18.840 this MDP. So in other words, 00:32:18.840 --> 00:32:20.870 this is my specification 00:32:20.870 --> 00:32:25.309 of the state transition probabilities. 00:32:25.309 --> 00:32:28.020 Looking ahead, 00:32:28.020 --> 00:32:31.190 we see this later, it turns out this 00:32:31.190 --> 00:32:38.190 noise term is not very important. So 00:32:39.630 --> 00:32:40.850 it turns out that 00:32:40.850 --> 00:32:44.250 the treatment of the noise term is not very important. 00:32:44.250 --> 00:32:45.830 We'll see this later. We can 00:32:45.830 --> 00:32:47.179 pretty much 00:32:47.179 --> 00:32:49.300 ignore the noise 00:32:49.300 --> 00:32:52.680 term, and we'll still do fine. This is just a warning 00:32:52.680 --> 00:32:56.910 in the sequel, what I do later, I might be slightly sloppy 00:32:56.910 --> 00:32:57.970 in my treatment 00:32:57.970 --> 00:33:02.970 of the noise term. In this very special case, it would be unimportant. 00:33:02.970 --> 00:33:04.170 The 00:33:04.170 --> 00:33:05.549 last 00:33:05.549 --> 00:33:09.150 thing I have to specify is some horizon time, 00:33:09.150 --> 00:33:12.580 and then I also have some 00:33:12.580 --> 00:33:13.700 reward 00:33:13.700 --> 00:33:16.550 function. 00:33:16.550 --> 00:33:20.020 For LQR control, I'm going to assume that a reward function 00:33:20.020 --> 00:33:27.020 can be written as 00:33:30.090 --> 00:33:31.850 this, where 00:33:31.850 --> 00:33:35.260 UT is a matrix that's N by N. VT is 00:33:35.260 --> 00:33:36.439 a 00:33:37.950 --> 00:33:41.020 matrix that's D by D. I'll 00:33:41.020 --> 00:33:44.440 assume that UT 00:33:44.440 --> 00:33:47.040 and VT are both positive semi-definite. 00:33:47.040 --> 00:33:51.289 Are both PSD. So 00:33:51.289 --> 00:33:57.360 the fact that UT and VT are both positive semi-definite matrices, 00:33:57.360 --> 00:33:59.770 that implies that 00:33:59.770 --> 00:34:02.259 ST transpose, UT, ST [inaudible] 00:34:02.259 --> 00:34:07.000 zero. Similarly, ST transpose 00:34:07.000 --> 00:34:09.569 are VT, AT, [inaudible] zero. 00:34:09.569 --> 00:34:16.519 So this implies that 00:34:16.519 --> 00:34:20.190 your rewards are always negative. This is a 00:34:20.190 --> 00:34:23.529 somewhat depressing MDP in which there are only costs and no positive rewards, right, 00:34:23.529 --> 00:34:24.369 because of 00:34:24.369 --> 00:34:27.169 the minus sign 00:34:27.169 --> 00:34:28.569 there. 00:34:28.569 --> 00:34:35.569 So 00:34:43.319 --> 00:34:46.659 as 00:34:46.659 --> 00:34:49.630 a complete example for how you might want to apply this, 00:34:49.630 --> 00:34:53.510 you've seen my helicopter videos, right? So one thing is, 00:34:53.510 --> 00:34:55.729 for example, 00:34:55.729 --> 00:34:57.989 suppose you have a helicopter, 00:34:57.989 --> 00:35:00.890 and you want the state ST to be 00:35:00.890 --> 00:35:04.229 as close to zero as possible. 00:35:05.440 --> 00:35:08.059 Then 00:35:08.059 --> 00:35:10.210 you might choose UT 00:35:10.210 --> 00:35:12.619 to be equal to the identity matrix. 00:35:12.619 --> 00:35:16.709 This will make R of STAT be 00:35:16.709 --> 00:35:20.060 equal to ST transpose 00:35:20.060 --> 00:35:23.819 ST. But that's just 00:35:27.239 --> 00:35:30.719 - I'll just 00:35:30.719 --> 00:35:33.809 write that down. [Inaudible] oh, excuse me - minus. 00:35:33.809 --> 00:35:38.429 The squared negative of the squared [inaudible] vector. So this would be penalizing the 00:35:38.429 --> 00:35:39.680 system 00:35:39.680 --> 00:35:40.769 quadratically 00:35:40.769 --> 00:35:45.219 for having a state that's half of zero, assuming that zero's the origin state. So if it 00:35:45.219 --> 00:35:46.889 goes to make a helicopter 00:35:46.889 --> 00:35:52.039 hover around the state zero, then you might choose this sort of reward function. It 00:35:52.039 --> 00:35:54.439 turns out 00:35:55.279 --> 00:35:58.679 it's also very common for action to choose a cost 00:35:58.679 --> 00:36:03.209 for the action. So suppose I choose VT to be equal to an identity matrix. I get minus 00:36:03.209 --> 00:36:05.359 AT transpose 00:36:05.359 --> 00:36:09.509 AT 00:36:09.509 --> 00:36:12.259 here. Then minus [inaudible] actions as well. 00:36:12.259 --> 00:36:16.079 Including a quadratic cost for actions, it's also a fairly common 00:36:16.079 --> 00:36:17.099 thing to do. 00:36:17.099 --> 00:36:19.560 In practice, this tends to be effective 00:36:19.560 --> 00:36:22.630 of discouraging your system from 00:36:22.630 --> 00:36:26.209 jerking the controls around. This discourages making very huge control commands. Having 00:36:26.209 --> 00:36:27.779 a term like this 00:36:27.779 --> 00:36:31.859 reward function often makes many systems behave better. 00:36:31.859 --> 00:36:35.789 Of course, [inaudible] choose different values, we have different values on the diagonal to 00:36:35.789 --> 00:36:37.069 give different 00:36:37.069 --> 00:36:39.960 state variables, different weight and 00:36:39.960 --> 00:36:44.809 so on. So lots of possible choices for U and B. This is one example. 00:36:49.849 --> 00:36:55.989 So 00:36:55.989 --> 00:36:59.420 for the next few steps, 00:36:59.420 --> 00:37:03.619 I'm going to write out things, I'm going to derive things, 00:37:03.619 --> 00:37:08.509 for the general case of non-stationary dynamics. I'm going - as I write 00:37:08.509 --> 00:37:11.529 out more math and more equations for LQR, 00:37:11.529 --> 00:37:13.809 I'm going to try write it out 00:37:13.809 --> 00:37:18.549 for the fairly general case of time varied dynamics and time varied reward 00:37:18.549 --> 00:37:21.639 functions. So I'm [inaudible] function. 00:37:21.639 --> 00:37:26.309 But for purposes of understanding this material, 00:37:26.309 --> 00:37:29.499 you might want to think of just ignoring many of the subscripts, in terms 00:37:29.499 --> 00:37:31.579 of T. So 00:37:33.849 --> 00:37:36.049 for the sake of [inaudible] material, 00:37:36.049 --> 00:37:40.059 you might want to mentally assume that there are some fixed matrix, A, so 00:37:40.059 --> 00:37:46.079 that A is equal to A1, A2, equals A3 and so on. 00:37:46.079 --> 00:37:50.599 Similarly, there's some matrix B. 00:37:50.599 --> 00:37:51.509 Okay? 00:37:51.509 --> 00:37:55.859 So write it out for the fully general, non-stationary case, but 00:37:55.859 --> 00:37:59.420 you might just want to ignore many of the time subscripts and imagine the 00:37:59.420 --> 00:38:02.099 stationary case for now. 00:38:07.419 --> 00:38:09.939 Quite a bit later, 00:38:09.939 --> 00:38:13.569 we're going to talk about an extension of this called differential dynamic programming that will 00:38:13.569 --> 00:38:16.670 actually use a non-stationary 00:38:16.670 --> 00:38:20.789 [inaudible] to a very powerful effect for a specific algorithm. But 00:38:20.789 --> 00:38:25.579 for most of what we're about to do, just pretend that MDP is stationary. 00:38:25.579 --> 00:38:32.579 Okay. 00:38:49.160 --> 00:38:50.919 So before I 00:38:50.919 --> 00:38:54.759 talk about models, let me jus say a couple of words about how you would go about 00:38:54.759 --> 00:38:56.910 coming up with the linear models. The 00:38:56.910 --> 00:39:00.630 key assumption in this model is that the dynamics are linear. There's also the assumption the 00:39:00.630 --> 00:39:03.429 reward function is quadratic, but 00:39:03.429 --> 00:39:06.849 let's talk about the assumption that the dynamics are linear. 00:39:06.849 --> 00:39:13.789 ST plus one equals AST plus VAT. Maybe time varying, maybe 00:39:13.789 --> 00:39:16.879 stationary. I'm just writing stationary for now. 00:39:16.879 --> 00:39:19.529 So how do you get models like this? 00:39:19.529 --> 00:39:23.169 We actually saw one example of this already in the previous lecture. If 00:39:23.169 --> 00:39:26.729 you have an inverted pendulum system, and you want to model the 00:39:26.729 --> 00:39:28.099 inverted pendulum 00:39:28.900 --> 00:39:34.039 using a linear model like this, maybe [inaudible]. I'm not going to write that down. 00:39:34.039 --> 00:39:37.849 One thing you could do is run your inverted pendulum, start it off in some state as 00:39:37.849 --> 00:39:38.719 zero, 00:39:38.719 --> 00:39:40.799 take some action, A0, have it 00:39:40.799 --> 00:39:45.760 get to some state, S1. Take action A1 and so on, get to some state ST. 00:39:45.760 --> 00:39:48.409 Our index is one 00:39:48.409 --> 00:39:52.139 to denote that this is my first trial. Then you can repeat this a bunch of times. 00:39:52.139 --> 00:39:55.170 You can repeat this N times. I'm 00:39:55.170 --> 00:39:57.670 just executing actions on 00:39:57.670 --> 00:39:59.519 your physical robot. 00:39:59.519 --> 00:40:02.049 It could be a robot, it could be a 00:40:02.049 --> 00:40:05.739 chemical plant. It could be whatever. Trying out different actions in 00:40:05.739 --> 00:40:10.009 your system and watch 00:40:12.770 --> 00:40:14.989 what states it gets to. So 00:40:14.989 --> 00:40:20.619 for the linear model to your data, and choose the parameters A 00:40:20.619 --> 00:40:25.659 and B, 00:40:25.659 --> 00:40:27.939 that minimize 00:40:38.049 --> 00:40:39.709 the quadratic 00:40:39.709 --> 00:40:41.369 error term. 00:40:46.949 --> 00:40:51.640 So this says how well does AST plus BAT 00:40:51.640 --> 00:40:55.359 predict ST plus one. So you minimize the quadratic penalty term. This would be 00:40:55.359 --> 00:40:57.719 one reasonable way to estimate 00:40:57.719 --> 00:41:01.390 the parameters of a linear dynamical system for a 00:41:01.390 --> 00:41:08.390 physical robot or a physical chemical part of whatever they may have. Another 00:41:08.839 --> 00:41:10.569 way to 00:41:10.569 --> 00:41:13.129 come up with a linear 00:41:14.209 --> 00:41:19.199 model 00:41:19.199 --> 00:41:24.180 consistently, if I want to control, is to take a nonlinear model and to 00:41:24.180 --> 00:41:29.920 linearize it. Let me show you what I mean by that. So you can linearize a 00:41:29.920 --> 00:41:35.819 nonlinear model. 00:41:35.819 --> 00:41:37.439 So 00:41:38.570 --> 00:41:41.750 let's say you have some nonlinear model 00:41:41.750 --> 00:41:45.289 that expresses ST plus one as some function of 00:41:45.289 --> 00:41:46.219 ST 00:41:46.219 --> 00:41:47.499 and 00:41:47.499 --> 00:41:49.529 AT. 00:41:49.529 --> 00:41:53.840 In the example in the previous lecture, I said for the inverted pendulum 00:41:56.579 --> 00:41:57.489 [inaudible]. 00:41:57.489 --> 00:42:01.440 By referring to the laws of physics. It was actually by downloading off the shelf 00:42:01.440 --> 00:42:04.380 software for doing physics simulations. So if you haven't 00:42:04.380 --> 00:42:08.139 seen [inaudible] before, you can go online. You can easily find 00:42:08.139 --> 00:42:11.270 many open-source packages for simulating the 00:42:11.270 --> 00:42:14.350 physics of simple devices like these. 00:42:14.350 --> 00:42:18.430 Download the software, type in the specifications of your robot, and it will 00:42:18.430 --> 00:42:21.980 simulate the physics that you use. There's lots of open-source software patches like that. You can just 00:42:21.980 --> 00:42:23.829 download them. 00:42:23.829 --> 00:42:27.060 But something like that, you can now build a physics simulator 00:42:27.060 --> 00:42:31.730 that predicts the state as a function of the previous state and the previous action. So 00:42:31.730 --> 00:42:34.029 you actually come up with some function that 00:42:34.029 --> 00:42:39.550 says that 00:42:41.529 --> 00:42:45.169 - the state [inaudible] next time. The [inaudible] vector 00:42:45.169 --> 00:42:52.169 will be some function of the current state 00:42:53.350 --> 00:42:54.630 and the current 00:42:54.630 --> 00:42:58.019 action, where the action in this case is just a real number that says how 00:42:58.019 --> 00:43:02.129 hard you accelerated to the left or right. 00:43:02.129 --> 00:43:06.359 Then you can take this nonlinear model. I actually wrote down a sample of a 00:43:06.359 --> 00:43:08.119 model in the last lecture, but 00:43:08.119 --> 00:43:11.390 in general, F would be some nonlinear function. [Inaudible] of 00:43:11.390 --> 00:43:12.769 a linear function. 00:43:12.769 --> 00:43:16.919 So what I mean by linearize is the following. So here's just a cartoon. I'll 00:43:16.919 --> 00:43:19.819 write down the math in a second. Let's say the horizontal acces is 00:43:19.819 --> 00:43:24.029 the input state, ST, 00:43:24.029 --> 00:43:27.679 and the output state, ST plus one, as I said. 00:43:27.679 --> 00:43:31.779 Here's 00:43:31.779 --> 00:43:34.709 the function at F. 00:43:34.709 --> 00:43:38.689 So the next state, ST plus one, will be some function of the previous state, 00:43:38.689 --> 00:43:42.029 ST and the action 00:43:42.029 --> 00:43:46.009 AT. So to linearize this model, what you would do is you would choose a point. 00:43:46.009 --> 00:43:49.019 We'll call this bar 00:43:49.019 --> 00:43:54.140 T. Then you would take the derivative of this 00:43:54.140 --> 00:43:56.469 function. For the 00:43:56.469 --> 00:44:00.279 [inaudible] straight line to that function. 00:44:00.279 --> 00:44:04.449 So this allows you to express the next state, ST plus one. You 00:44:04.449 --> 00:44:07.229 can approximate the next state, ST plus one, as 00:44:07.229 --> 00:44:10.089 this linear function of the 00:44:10.089 --> 00:44:13.999 previous state, ST. So 00:44:13.999 --> 00:44:17.809 to make this cartoon really right, the horizontal access here is really a state 00:44:17.809 --> 00:44:20.109 action pair. 00:44:20.109 --> 00:44:24.650 You're linearizing around. So this is just a cartoon. The horizontal 00:44:24.650 --> 00:44:29.699 access represents the input state and the input action. 00:44:29.699 --> 00:44:36.699 So 00:44:42.049 --> 00:44:45.549 just to write this out in 00:44:45.549 --> 00:44:49.529 math, I'll write out the simple case first and the fully general one 00:44:49.529 --> 00:44:51.089 in a second. Suppose 00:44:51.089 --> 00:44:54.709 the horizontal access was only this state. So let's pretend interactions they [inaudible] now. 00:44:54.709 --> 00:44:56.099 ST plus 00:44:56.099 --> 00:45:00.309 one is just some function of ST, than that linear function I drew would be 00:45:00.309 --> 00:45:02.219 ST plus one. 00:45:02.219 --> 00:45:04.450 We're approximating 00:45:04.450 --> 00:45:08.349 as F prime evaluated at some point as bar T 00:45:08.349 --> 00:45:12.189 times ST times S bar T. 00:45:12.189 --> 00:45:15.579 Plus 00:45:15.579 --> 00:45:17.829 S bar 00:45:17.829 --> 00:45:21.879 T. So with this, you'd express ST plus one as a linear 00:45:21.879 --> 00:45:22.719 function 00:45:22.719 --> 00:45:29.359 of ST. Just note that S bar T is a constant. 00:45:31.179 --> 00:45:32.520 It's not 00:45:32.520 --> 00:45:36.520 a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point 00:45:36.520 --> 00:45:38.019 S bar T. This is 00:45:38.019 --> 00:45:41.779 really just the equation of that linear function. So you 00:45:41.779 --> 00:45:46.309 can then convert this to 00:45:51.809 --> 00:45:55.399 A and B matrixes. Jumping back one board, I'm going 00:45:55.399 --> 00:45:57.040 to point out one other thing. 00:45:57.040 --> 00:45:57.829 Let's 00:45:57.829 --> 00:45:59.929 say I look at this straight line, 00:45:59.929 --> 00:46:01.159 and I ask 00:46:01.159 --> 00:46:03.489 how well does this straight line 00:46:03.489 --> 00:46:07.769 approximate my function F, my original simulator, my original function F. 00:46:07.769 --> 00:46:10.630 Then you sort of notice that 00:46:10.630 --> 00:46:12.440 in this neighborhood, 00:46:12.440 --> 00:46:14.830 in the neighborhood of S bar, there's a 00:46:14.830 --> 00:46:17.400 pretty good approximation. It's fairly close. 00:46:17.400 --> 00:46:19.389 But then as you move further away, 00:46:19.389 --> 00:46:24.419 moving far off to the left here, it becomes a pretty terrible approximation. 00:46:24.419 --> 00:46:25.539 So 00:46:25.539 --> 00:46:29.139 when you linearize 00:46:29.139 --> 00:46:33.099 a nonlinear model to apply LQR, 00:46:33.099 --> 00:46:36.789 one of the parameters you have to choose would be the point around which to 00:46:36.789 --> 00:46:38.960 linearize your nonlinear model. 00:46:38.960 --> 00:46:44.169 So if you expect your inverted pendulum system 00:46:44.169 --> 00:46:48.509 to spend most of its time in the vicinity of this state, 00:46:48.509 --> 00:46:49.409 then it'd 00:46:49.409 --> 00:46:53.459 be reasonable to linearize around this state 00:46:53.459 --> 00:46:54.830 because that means that 00:46:54.830 --> 00:46:57.319 the linear approximation would be a good approximation, 00:46:57.319 --> 00:47:02.489 usually, for the states that you expect [inaudible] to spend most of this time. 00:47:02.489 --> 00:47:06.179 If conversely, you expect the system to spend most of its time 00:47:06.179 --> 00:47:07.739 at states far to the left, then this would be 00:47:07.739 --> 00:47:10.039 a terrible location to linearize. 00:47:10.039 --> 00:47:14.269 So one rule of thumb is to choose the position to linearize according to where 00:47:14.269 --> 00:47:17.419 you expect the system to spend most of its time 00:47:17.419 --> 00:47:20.369 so that the linear approximation will tend to be 00:47:20.369 --> 00:47:22.929 an accurate approximation in the 00:47:22.929 --> 00:47:27.269 vicinity of the states [inaudible]. 00:47:27.269 --> 00:47:28.770 Just to be fair, it is about 00:47:28.770 --> 00:47:30.109 choosing the point, S bar, 00:47:30.109 --> 00:47:31.639 A bar, 00:47:31.639 --> 00:47:34.169 that we'll use 00:47:34.169 --> 00:47:36.949 to come up with a linear function 00:47:36.949 --> 00:47:40.599 that we'll pretend it's a good approximation to my original 00:47:40.599 --> 00:47:47.079 nonlinear function, 00:47:55.269 --> 00:48:00.589 F. So for an example like the inverted pendulum problem, this problem, if 00:48:00.589 --> 00:48:03.740 you expect to do pretty well in this problem, 00:48:03.740 --> 00:48:08.089 then you would expect the state 00:48:08.089 --> 00:48:12.680 to often be near the zero state. 00:48:12.680 --> 00:48:17.890 If S equals zero corresponds to X being the center of the railway track that the inverted 00:48:17.890 --> 00:48:20.249 pendulum lives on. You 00:48:20.249 --> 00:48:23.000 expect to do fairly well. You expect the pole to 00:48:23.000 --> 00:48:24.689 mostly be upright 00:48:24.689 --> 00:48:28.559 [inaudible] upright at zero degrees or 90 degrees, I guess. So you choose 00:48:28.559 --> 00:48:32.910 whatever state corresponds to having the pole 00:48:32.910 --> 00:48:33.959 upright. The zero 00:48:33.959 --> 00:48:37.299 velocity [inaudible], near zero velocity in the middle of the track. 00:48:37.299 --> 00:48:40.249 So you usually choose that as a state 00:48:40.249 --> 00:48:42.579 to linearize 00:48:42.579 --> 00:48:45.680 your inverted pendulum dynamics around. That's 00:48:45.680 --> 00:48:52.680 a region where you might want your approximation to be good. 00:48:55.859 --> 00:48:59.759 So I wrote this down. To come back to this formula, I wrote this down for the special 00:48:59.759 --> 00:49:02.229 case of a one D state variable. If there 00:49:02.229 --> 00:49:08.039 are no actions. The general formula for the linearization approximation is ST 00:49:08.949 --> 00:49:11.869 plus one were approximate as F 00:49:11.869 --> 00:49:14.559 of S bar T. 00:49:14.559 --> 00:49:17.119 A bar T 00:49:17.119 --> 00:49:24.119 plus 00:49:50.459 --> 00:49:54.229 - Okay? 00:49:54.229 --> 00:49:55.240 Where 00:49:55.240 --> 00:49:59.669 these upside down triangles are an unusual 00:49:59.669 --> 00:50:02.739 symbol for taking the 00:50:02.739 --> 00:50:07.289 derivative of F with respect to [inaudible] vector value, second argument. So 00:50:07.289 --> 00:50:12.169 by choosing an appropriate state as bar T, A bar T to linearize 00:50:12.169 --> 00:50:12.950 around, 00:50:12.950 --> 00:50:14.009 you'd 00:50:14.009 --> 00:50:17.920 now express ST plus one 00:50:17.920 --> 00:50:22.039 as a linear function of the current state and 00:50:22.039 --> 00:50:23.279 the current 00:50:23.279 --> 00:50:24.279 action, AT. Again, 00:50:24.279 --> 00:50:29.420 these things, S bar T, is a constant that you choose ahead of time. 00:50:29.420 --> 00:50:32.409 Same for A bar 00:50:34.959 --> 00:50:40.109 T. Lastly, having linearized this thing, you can then convert it 00:50:40.109 --> 00:50:43.260 to matrices 00:50:43.260 --> 00:50:46.509 like 00:50:46.509 --> 00:50:48.849 that. 00:50:48.849 --> 00:50:55.460 So the ST plus one is now a linear function of ST and 00:50:55.460 --> 00:50:56.789 AT. 00:50:56.789 --> 00:51:03.789 Questions about this? 00:51:10.759 --> 00:51:16.019 So just one tiny detail, and it's really not a huge deal, is that this thing 00:51:16.019 --> 00:51:20.179 below is technically an [inaudible] function. 00:51:20.179 --> 00:51:23.140 There might actually be an extra constant there, but 00:51:23.140 --> 00:51:27.199 this is not a difficult generalization of a linear dynamical system definition. 00:51:28.130 --> 00:51:32.320 One way to deal with that constant is actually to do something like take your 00:51:32.320 --> 00:51:36.249 definition for this state, let's say XX dot theta theta dot. 00:51:36.249 --> 00:51:38.859 You can then augment your state vector to 00:51:38.859 --> 00:51:40.959 have an extra interceptor, one. 00:51:40.959 --> 00:51:46.069 With the interceptor one and working out the A matrix, you can then 00:51:46.069 --> 00:51:49.170 take care of the extra constant, C, as well. So you can 00:51:49.170 --> 00:51:50.679 deal with this thing being - 00:51:50.679 --> 00:51:54.809 technically it's an affine function because of this extra offset, rather than a linear 00:51:54.809 --> 00:51:55.569 function. 00:51:55.569 --> 00:52:01.809 But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't 00:52:01.809 --> 00:52:02.829 be 00:52:02.829 --> 00:52:04.689 a huge 00:52:06.459 --> 00:52:13.459 deal. 00:52:18.059 --> 00:52:19.449 So to summarize, you see I 00:52:19.449 --> 00:52:20.729 have this up, 00:52:20.729 --> 00:52:24.699 you can learn a model, you can take a nonlinear model. Your nonlinear 00:52:24.699 --> 00:52:28.789 model can be a physics model or a nonlinear model you learned and linearize it. 00:52:29.559 --> 00:52:32.299 Now I'll post an LQR problem 00:52:32.299 --> 00:52:35.339 in which we have 00:52:35.339 --> 00:52:41.689 specification of the MDP in which the states are in RN, the actions are in RD, 00:52:41.689 --> 00:52:46.669 and the state has zero probabilities given by 00:52:46.669 --> 00:52:50.239 the [inaudible] linear equation. SD plus one equals 00:52:50.239 --> 00:52:53.849 ATST plus BTAT. Our rewards are going to be 00:52:53.849 --> 00:52:56.989 these quadratic functions. 00:52:56.989 --> 00:53:01.900 So the specification of the MDP means that we know the A matrixes, 00:53:01.900 --> 00:53:06.089 the B matrixes, the U matrixes 00:53:06.089 --> 00:53:08.930 and the V matrixes. Our goal is to come up with a policy 00:53:08.930 --> 00:53:13.789 to maximize our finite 00:53:13.789 --> 00:53:19.049 horizon sum of rewards. 00:53:20.029 --> 00:53:23.149 So our goal is to come up with a policy, 00:53:23.149 --> 00:53:25.799 first, to maximize 00:53:25.799 --> 00:53:29.059 the expected value of this 00:53:29.059 --> 00:53:34.279 finite horizon sum of 00:53:39.009 --> 00:53:46.009 rewards. Okay. 00:53:48.219 --> 00:53:49.390 So our 00:53:49.390 --> 00:53:53.549 approach to solving this problem 00:53:53.549 --> 00:53:56.899 will be exactly that 00:53:56.899 --> 00:54:00.019 finite horizon dynamic programming algorithm that we worked out a 00:54:00.019 --> 00:54:01.769 little earlier in this 00:54:01.769 --> 00:54:03.579 lecture. In particular, 00:54:03.579 --> 00:54:06.569 my strategy for finding the optimal policy 00:54:06.569 --> 00:54:08.879 will be to 00:54:08.879 --> 00:54:12.359 first find V star of 00:54:12.359 --> 00:54:14.219 T, the capital T, 00:54:14.219 --> 00:54:18.559 and then I'll apply by a recursion to find V star of T minus one, V star of T minus 00:54:18.559 --> 00:54:22.989 two and so on. 00:54:22.989 --> 00:54:27.619 In the dynamic programming algorithm we worked out, V star subscript T of the 00:54:27.619 --> 00:54:29.209 state ST, this is the maximum 00:54:29.209 --> 00:54:31.279 [inaudible] 00:54:31.279 --> 00:54:35.440 actions you might take at that time of R 00:54:35.440 --> 00:54:39.299 of STAT. 00:54:39.299 --> 00:54:42.800 Again, just for the sake of understanding this material, you can probably 00:54:42.800 --> 00:54:47.170 pretend the rewards and the dynamics are actually stationary. I'll write out all 00:54:47.170 --> 00:54:53.249 these superscripts all the time [inaudible] if you're reading this for the first time. 00:54:53.249 --> 00:54:57.340 The reward is equal to max of 00:54:57.340 --> 00:55:04.340 AT of minus - 00:55:11.829 --> 00:55:16.099 right? I hope this isn't confusing. The superscript Ts denote transposes. 00:55:16.099 --> 00:55:19.849 The lowercase Ts denote the time index capital T. 00:55:19.849 --> 00:55:23.769 So that's just a definition of my next quadratic awards. So this 00:55:23.769 --> 00:55:27.859 is clearly maximized as minus 00:55:29.100 --> 00:55:31.269 ST transpose 00:55:31.269 --> 00:55:37.889 UTST 00:55:37.889 --> 00:55:39.949 because 00:55:39.949 --> 00:55:43.909 that last term is - this is greater than or equal to zero. That gives 00:55:43.909 --> 00:55:47.759 me my assumption that VT is [inaudible] semi-definite. So the best action to take in 00:55:47.759 --> 00:55:49.399 the last time step 00:55:49.399 --> 00:55:52.639 is just the action 00:55:52.639 --> 00:55:53.769 zero. So 00:55:53.769 --> 00:56:00.769 pi star subscript T of ST is equal to the 00:56:02.320 --> 00:56:04.369 [inaudible] of actions of 00:56:04.369 --> 00:56:10.969 that same thing. It's just 00:56:10.969 --> 00:56:12.469 zero. It's by 00:56:12.469 --> 00:56:15.609 choosing the zero action, 00:56:15.609 --> 00:56:18.909 AT transpose VTAT becomes zero, and that's 00:56:18.909 --> 00:56:38.829 how this reward is maximized. 00:56:50.259 --> 00:56:57.169 Any questions, or is something illegible? Okay. 00:57:00.390 --> 00:57:04.330 So now let's do the dynamic programming step where 00:57:04.330 --> 00:57:09.679 my goal is given 00:57:09.679 --> 00:57:11.739 VT plus one, 00:57:11.739 --> 00:57:15.799 I want to compute VT. Given V star T plus one, I want to compute 00:57:15.799 --> 00:57:20.410 V star of T. So this is the dynamic programming step. 00:57:21.629 --> 00:57:25.239 So the 00:57:25.239 --> 00:57:29.499 DP steps I wrote down previously was this. So for the finite state case, I 00:57:29.499 --> 00:57:36.499 wrote down the following. 00:58:32.289 --> 00:58:35.769 So this is exactly the equation I wrote down 00:58:35.769 --> 00:58:37.499 previously, 00:58:37.499 --> 00:58:41.489 and this is what I wrote down for finite states, where you have these discreet state transition 00:58:41.489 --> 00:58:43.679 probabilities, and we can sum over 00:58:43.679 --> 00:58:48.039 this discreet set of states. Now we're going to continue as an infinite state 00:58:48.039 --> 00:58:51.850 again, so this sum over state should actually become an integral. I'm going to 00:58:51.850 --> 00:58:55.940 actually skip the integral step. We'll just go ahead and write this last term here as an 00:58:55.940 --> 00:58:56.829 expectation. 00:58:56.829 --> 00:59:03.829 So this is going to be max over actions AT 00:59:04.309 --> 00:59:06.879 plus - and then this becomes and expectation 00:59:06.879 --> 00:59:10.089 over the random mixed state, ST plus one, 00:59:10.089 --> 00:59:13.500 [inaudible] from state transition probabilities given by P of STAT of V star T 00:59:13.500 --> 00:59:16.429 plus one, ST 00:59:16.429 --> 00:59:19.080 plus one. So this 00:59:19.080 --> 00:59:21.309 is 00:59:21.309 --> 00:59:23.719 the same equation written down 00:59:23.719 --> 00:59:25.699 as an expectation. 00:59:26.879 --> 00:59:31.519 So what I need to do is given a representation of V star T plus one, I 00:59:31.519 --> 00:59:32.789 need to find V 00:59:32.789 --> 00:59:36.430 star of T. 00:59:36.430 --> 00:59:40.819 So it turns out that LQR has the following useful property. It turns out that each of 00:59:40.819 --> 00:59:42.839 these value functions 00:59:42.839 --> 00:59:46.129 can be represented as a quadratic function. 00:59:46.129 --> 00:59:48.659 So concretely, 00:59:48.659 --> 00:59:53.259 let's suppose that V 00:59:53.259 --> 01:00:00.259 star T plus one - suppose that this can be expressed as a 01:00:00.429 --> 01:00:07.429 quadratic 01:00:09.369 --> 01:00:10.840 function, written like so, 01:00:10.840 --> 01:00:11.859 where 01:00:11.859 --> 01:00:14.880 the matrix phi T plus one 01:00:14.880 --> 01:00:17.049 is an N by N matrix, and 01:00:17.049 --> 01:00:21.339 psi T plus one 01:00:21.339 --> 01:00:23.599 is just a real number. 01:00:23.599 --> 01:00:29.160 So in other words, suppose V star T plus one is just a quadratic function 01:00:29.969 --> 01:00:33.809 of the state ST 01:00:33.809 --> 01:00:36.679 plus one. 01:00:36.679 --> 01:00:40.170 We can then show that 01:00:40.170 --> 01:00:44.169 when you do one dynamic programming step - when you 01:00:46.089 --> 01:00:49.569 plug this definition of V star T plus one into your dynamic 01:00:49.569 --> 01:00:51.879 programming step in the equation I had just now, 01:00:51.879 --> 01:00:54.559 you can show that you would get 01:00:54.559 --> 01:00:57.899 that V star T as well, will 01:00:57.899 --> 01:01:01.749 also be a quadratic function of the 01:01:06.279 --> 01:01:12.639 same form. [Inaudible] here, right? The sum-appropriate matrix, phi T 01:01:12.639 --> 01:01:19.609 and sum appropriate real number, psi 01:01:19.609 --> 01:01:21.839 of 01:01:21.839 --> 01:01:27.819 T. So what you can do is stall off the recursion with - well, does that make 01:01:27.819 --> 01:01:32.099 sense? 01:01:32.099 --> 01:01:35.009 So what you can do is 01:01:35.009 --> 01:01:36.769 stall off the recursion 01:01:36.769 --> 01:01:41.139 as follows. So previously, we worked out that V star capital T, 01:01:41.139 --> 01:01:44.219 we said that this is minus 01:01:44.219 --> 01:01:47.539 ST 01:01:47.539 --> 01:01:50.009 transpose UTST. So we 01:01:50.009 --> 01:01:50.760 have that phi 01:01:50.760 --> 01:01:53.160 of capital T is 01:01:53.160 --> 01:01:56.329 equal to minus UT, 01:01:56.329 --> 01:01:59.309 and psi of capital T is equal to zero. 01:01:59.309 --> 01:02:00.700 Now V star T of ST 01:02:00.700 --> 01:02:07.700 is equal to ST transpose phi of T, ST plus psi of T. 01:02:08.419 --> 01:02:09.050 So 01:02:09.050 --> 01:02:13.450 you can start out the recursion this way with phi of T equals minus UT and psi of T 01:02:13.450 --> 01:02:14.939 equals zero. 01:02:14.939 --> 01:02:16.849 Then work out 01:02:16.849 --> 01:02:20.219 what the recursion is. I won't 01:02:20.219 --> 01:02:22.819 01:02:22.819 --> 01:02:27.119 actually do the full [inaudible]. This may be algebra, and 01:02:28.809 --> 01:02:32.939 you've actually done this sort of Gaussian expectation 01:02:32.939 --> 01:02:39.559 math a lot in your homework by now. 01:02:39.559 --> 01:02:44.009 So I won't do the full derivation. I'll just outline the 01:02:45.039 --> 01:02:46.869 one-ish G step. So 01:02:46.869 --> 01:02:51.029 in dynamic programming step, V star ST is 01:02:51.029 --> 01:02:53.239 equal to 01:02:53.239 --> 01:02:55.679 max over actions AT of 01:02:55.679 --> 01:03:02.679 the median reward. 01:03:03.619 --> 01:03:05.809 So this was R of 01:03:05.809 --> 01:03:09.479 SA from my equation in the dynamic programming step. 01:03:09.479 --> 01:03:12.099 Then plus an expected value 01:03:12.099 --> 01:03:18.279 over the random mixed state, ST plus one, drawn from the 01:03:18.279 --> 01:03:22.079 Gaussian distribution would mean 01:03:22.079 --> 01:03:23.890 ATST plus 01:03:23.890 --> 01:03:28.169 BTAT and covariant sigma 01:03:28.169 --> 01:03:31.889 W. So what this is, this is really my 01:03:31.889 --> 01:03:35.029 specification for P of STAT. This is my 01:03:35.029 --> 01:03:38.749 state transition distribution in the LQR setting. This is my 01:03:38.749 --> 01:03:40.809 state transition distribution 01:03:40.809 --> 01:03:43.369 [inaudible] take action AT 01:03:43.369 --> 01:03:44.670 in 01:03:44.670 --> 01:03:46.179 the state 01:03:46.179 --> 01:03:50.319 ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and 01:03:50.319 --> 01:03:50.979 covariant 01:03:50.979 --> 01:03:54.430 sigma W. Then 01:03:54.430 --> 01:04:01.430 of the this 01:04:10.669 --> 01:04:13.719 state. This, of course, is just A star 01:04:13.719 --> 01:04:17.109 T plus one 01:04:17.109 --> 01:04:23.549 of ST plus one. I hope 01:04:23.549 --> 01:04:27.459 this makes sense. This is just taking that equation I had previously in the 01:04:27.459 --> 01:04:30.309 dynamic programming step. So the V star of T, ST 01:04:30.309 --> 01:04:33.279 equals max over actions 01:04:33.279 --> 01:04:35.679 of the immediate rewards 01:04:35.679 --> 01:04:38.739 plus an expected value over the mixed state of 01:04:38.739 --> 01:04:40.639 V star of the mixed state with 01:04:40.639 --> 01:04:43.249 the clock advanced by one. 01:04:43.249 --> 01:04:46.179 So I've just plugged in all the definitions as a reward of the 01:04:46.179 --> 01:04:47.889 state [inaudible] distribution 01:04:47.889 --> 01:04:54.889 and of the value function. 01:04:55.909 --> 01:05:01.729 Actually, could you raise your hand if this makes sense? Cool. So if you write 01:05:01.729 --> 01:05:03.429 this out 01:05:03.429 --> 01:05:07.630 and you expand the expectation - I know you've done this many times, so I 01:05:07.630 --> 01:05:08.979 won't do it - 01:05:08.979 --> 01:05:11.470 this whole thing on the right-hand side 01:05:11.470 --> 01:05:15.139 simplifies to a big quadratic function of the action, AT. 01:05:15.139 --> 01:05:17.839 So this whole 01:05:17.839 --> 01:05:22.839 thing simplifies to a big 01:05:22.839 --> 01:05:26.399 quadratic function 01:05:32.449 --> 01:05:35.909 of the action 01:05:35.909 --> 01:05:37.859 AT. 01:05:37.859 --> 01:05:42.489 We want to maximize this with respect to the actions AT. So to 01:05:42.489 --> 01:05:44.339 maximize a big quadratic function, you 01:05:44.339 --> 01:05:48.150 just take the derivatives of the functions with respect to the action AT, set the derivative equal 01:05:48.150 --> 01:05:52.039 to zero, and then you've maximized the right-hand side, with 01:05:52.039 --> 01:05:55.559 respect to the action, 01:05:55.559 --> 01:05:59.709 AT. It turns out - I'm just going to write this expression down for completeness. You 01:05:59.709 --> 01:06:02.209 can derive it yourself at any time. It turns 01:06:02.209 --> 01:06:05.239 out if you actually maximize that thing on the right- hand side as a 01:06:05.239 --> 01:06:07.299 function of the actions, AT, 01:06:07.299 --> 01:06:12.469 you find that [inaudible] AT is going to be 01:06:12.469 --> 01:06:14.429 that 01:06:14.429 --> 01:06:21.429 times 01:06:23.319 --> 01:06:25.930 ST. 01:06:25.930 --> 01:06:29.389 Don't 01:06:29.389 --> 01:06:34.200 worry about this expression. You can get it from [inaudible] and derive it yourself. 01:06:34.200 --> 01:06:37.839 But the key thing to note is that the optimal action, AT, is going to be some big 01:06:37.839 --> 01:06:38.949 matrix. 01:06:38.949 --> 01:06:40.409 We're going to call this thing 01:06:40.409 --> 01:06:43.299 LT 01:06:43.299 --> 01:06:47.259 times 01:06:49.009 --> 01:06:51.459 ST. In other words, the optimal action to take in this 01:06:51.459 --> 01:06:53.329 given state is going to be 01:06:53.329 --> 01:06:55.249 some linear function 01:06:55.249 --> 01:06:59.289 of the state, ST. So 01:06:59.289 --> 01:07:02.819 having done dynamic programming, you remember also when we worked out the 01:07:02.819 --> 01:07:04.979 dynamic programming algorithm for finite 01:07:04.979 --> 01:07:06.279 horizon MDPs, 01:07:06.279 --> 01:07:07.999 we said that 01:07:07.999 --> 01:07:12.109 the way you compute the optimal policy, pi star of T 01:07:12.109 --> 01:07:13.699 of ST. 01:07:13.699 --> 01:07:17.819 This is always the [inaudible] of the same thing. [Inaudible] 01:07:17.819 --> 01:07:22.799 of actions AT of the same thing. 01:07:22.799 --> 01:07:25.139 STAT plus 01:07:25.139 --> 01:07:32.139 your expected value of [inaudible] PSTAT, P-star, T 01:07:35.859 --> 01:07:39.179 plus one, ST plus one. This thing on the right-hand side is always the same thing as 01:07:39.179 --> 01:07:41.839 the thing we maximized 01:07:41.839 --> 01:07:42.969 01:07:42.969 --> 01:07:46.479 [inaudible]. So what this means is that when I said this a value of A to the maximize 01:07:46.479 --> 01:07:50.379 of this. So what this means is that the optimal action to take from the state of ST 01:07:50.379 --> 01:07:53.079 is actually equal to 01:07:53.079 --> 01:07:55.869 LT 01:07:55.869 --> 01:08:01.689 times ST. 01:08:01.689 --> 01:08:03.859 What 01:08:03.859 --> 01:08:05.919 was shown is that 01:08:05.919 --> 01:08:08.129 when you're in some state, ST, the 01:08:08.129 --> 01:08:13.599 optimal action for that state is going to be some matrix, LT, which can compute, 01:08:13.599 --> 01:08:14.229 times 01:08:14.229 --> 01:08:16.859 the state, 01:08:16.859 --> 01:08:18.079 ST. 01:08:18.079 --> 01:08:23.549 In other words, the optimal action is actually a linear function of the state. 01:08:23.549 --> 01:08:26.839 I'm just going to point out, this is not a function of approximation here, right. 01:08:26.839 --> 01:08:31.739 What we did not do, we did not say, 01:08:31.739 --> 01:08:35.389 let's find the optimal linear policy. We didn't say, let's look at the optimal 01:08:35.389 --> 01:08:38.359 policy, and then we'll fit this straight line to the optimal policy. This is not 01:08:38.359 --> 01:08:42.089 about approximating the optimal policy with a straight line. 01:08:42.089 --> 01:08:46.549 This derivation is saying that the optimal policy is a straight line. The 01:08:46.549 --> 01:08:53.549 optimal action is a linear function of the current 01:08:54.239 --> 01:08:57.359 state. 01:08:57.359 --> 01:09:01.319 Moreover, when you've worked out this is a value for AT 01:09:01.319 --> 01:09:02.310 01:09:02.310 --> 01:09:05.370 that maximizes this thing on 01:09:05.370 --> 01:09:10.009 the right-hand side. So you take this and plug it back in to do the dynamic programming recursion. 01:09:10.009 --> 01:09:17.009 What you find is that - 01:09:19.719 --> 01:09:24.189 so you take AT and plug it back in to do the maximization. It will 01:09:24.189 --> 01:09:26.319 actually get 01:09:26.319 --> 01:09:28.279 you this formula, 01:09:28.279 --> 01:09:30.560 so V star 01:09:30.560 --> 01:09:35.049 TST. So you find that 01:09:35.049 --> 01:09:38.770 it will indeed be a quadratic function like this 01:09:38.770 --> 01:09:41.799 of the following form where 01:09:41.799 --> 01:09:45.540 - and I just write out the equations for the sake of 01:09:45.540 --> 01:09:52.540 completeness. Don't worry too much about their forms. You can 01:09:57.369 --> 01:10:03.630 derive this yourself. 01:10:48.130 --> 01:10:52.190 So just to summarize, don't worry too much about the forms of these 01:10:52.190 --> 01:10:54.429 equations. 01:10:54.429 --> 01:10:57.899 What we've done is written down the recursion to the expressor 01:10:57.899 --> 01:10:59.820 phi T and psi T 01:10:59.820 --> 01:11:02.329 as a function of phi T plus one 01:11:02.329 --> 01:11:04.129 and psi T plus one. 01:11:04.129 --> 01:11:05.090 So 01:11:05.090 --> 01:11:08.340 this allows you to compute the optimal value function 01:11:08.340 --> 01:11:13.579 for when the clock is at time lowercase T, as a function of the optimal value 01:11:13.579 --> 01:11:17.699 function for when the clock is at time T plus one. 01:11:20.019 --> 01:11:25.480 So 01:11:25.480 --> 01:11:28.260 to summarize, 01:11:28.260 --> 01:11:32.659 GSELQG here's a finite horizon of 01:11:32.659 --> 01:11:35.920 - actually, just to give this equation a name as well. This 01:11:35.920 --> 01:11:40.370 recursion, in terms of the phi Ts, this is called the 01:11:40.370 --> 01:11:44.909 discrete 01:11:44.909 --> 01:11:51.909 time Bacardi equation. [Inaudible] 01:11:54.099 --> 01:11:57.609 recursion that gives you phi 01:11:57.609 --> 01:12:02.420 T in terms of phi T plus one. 01:12:02.420 --> 01:12:05.130 So 01:12:05.130 --> 01:12:10.969 to summarize, 01:12:12.269 --> 01:12:17.499 our algorithm for finding the exact solution to finite horizon LQR 01:12:17.499 --> 01:12:22.949 problems is as follows. We initialize phi T 01:12:22.949 --> 01:12:24.340 to 01:12:24.340 --> 01:12:26.840 be equal to minus 01:12:30.449 --> 01:12:32.699 UT 01:12:32.699 --> 01:12:36.559 and psi T to be equal to zero. 01:12:36.559 --> 01:12:38.709 Then 01:12:38.709 --> 01:12:42.679 recursively, 01:12:42.679 --> 01:12:45.080 calculate 01:12:45.080 --> 01:12:47.179 phi T 01:12:47.179 --> 01:12:49.260 and psi 01:12:49.260 --> 01:12:50.309 T 01:12:50.309 --> 01:12:53.820 as a function of phi T plus one 01:12:53.820 --> 01:12:56.360 and psi T plus 01:12:56.360 --> 01:13:01.429 one with the discrete time - 01:13:01.769 --> 01:13:06.030 actually, excuse me. So 01:13:06.030 --> 01:13:08.590 recursively calculate phi T and psi 01:13:08.590 --> 01:13:11.349 T as a function of phi T plus one and psi T plus one, as I showed, 01:13:11.349 --> 01:13:14.730 using the discrete time Bacardi equation. 01:13:14.730 --> 01:13:17.940 So you do this for 01:13:17.940 --> 01:13:19.199 T equals 01:13:19.199 --> 01:13:22.429 T minus one, T minus two 01:13:22.429 --> 01:13:28.579 and so on, down to time zero. Then you 01:13:28.579 --> 01:13:32.519 compute 01:13:32.519 --> 01:13:37.429 LT as a function of - actually, is it phi T or phi T 01:13:37.429 --> 01:13:39.599 plus one? Phi T 01:13:39.599 --> 01:13:41.929 plus one, I think. 01:13:41.929 --> 01:13:46.260 As a function of phi T plus one and psi T plus one. 01:13:46.260 --> 01:13:50.199 This is actually a function of only phi T plus one. You don't really need psi T 01:13:50.199 --> 01:13:53.110 plus one. 01:13:53.110 --> 01:13:59.819 Now you have your optimal policy. 01:13:59.819 --> 01:14:03.099 So having computed the LTs, 01:14:03.099 --> 01:14:05.939 you now have the 01:14:05.939 --> 01:14:08.149 optimal action to take in the state 01:14:08.149 --> 01:14:15.149 ST, just given by 01:14:24.530 --> 01:14:29.419 this 01:14:29.419 --> 01:14:34.979 linear equation. 01:14:34.979 --> 01:14:44.499 How much time do I have left? Okay. Let me just say one last thing about this before I close. 01:14:44.479 --> 01:14:48.449 Maybe I'll do it next week. I think I'll do it next session 01:14:48.449 --> 01:14:52.179 instead. So it actually turns out there's one cool property about this that's kind of that is 01:14:52.179 --> 01:14:54.750 kind of subtle, but you'll find it out in the next lecture. 01:14:54.750 --> 01:15:01.750 Are there question about this before we close for today, then? So 01:15:04.829 --> 01:15:07.309 the very cool thing about 01:15:07.309 --> 01:15:12.579 the solution of discrete time LQR problems - finite horizon LQR 01:15:12.579 --> 01:15:17.280 problems is that this is a problem in an infinite state, with a continuous state. 01:15:17.280 --> 01:15:21.059 But nonetheless, under the assumptions we made, you can prove that the 01:15:21.059 --> 01:15:24.749 value function is a quadratic function of the state. 01:15:24.749 --> 01:15:29.089 Therefore, just by computing these matrixes phi T and the real numbers 01:15:29.089 --> 01:15:33.010 psi T, you can actually exactly represent the value function, even for 01:15:33.010 --> 01:15:37.489 these infinitely large state spaces, even for continuous state spaces. 01:15:37.489 --> 01:15:42.599 So the computation of these algorithms scales only like the cube, 01:15:42.599 --> 01:15:46.499 scales only as a polynomial in terms of the number of state variables 01:15:46.499 --> 01:15:49.959 whereas in [inaudible] dimensionality problems, with [inaudible], we 01:15:49.959 --> 01:15:53.099 had algorithms of a scale exponentially dimensional problem. 01:15:53.099 --> 01:15:55.420 Whereas LQR scales only are like the cube 01:15:55.420 --> 01:16:00.699 of the dimension of the problem. So this easily 01:16:00.699 --> 01:16:03.960 applies to problems with even very large state spaces. So we actually often apply 01:16:03.960 --> 01:16:06.049 variations of this algorithm 01:16:06.049 --> 01:16:09.159 to some subset, to some particular subset for the things we do on our 01:16:09.159 --> 01:16:13.010 helicopter, which has high dimensional state spaces, with twelve or higher 01:16:13.010 --> 01:16:13.969 dimensions. 01:16:13.969 --> 01:16:15.989 This has worked very well for that. 01:16:15.989 --> 01:16:17.229 So 01:16:17.229 --> 01:16:21.280 it turns out there are even more things you can do with this, and I'll continue with 01:16:21.280 --> 01:16:23.519 that in the next lecture. So let's close for today, then.