0:00:12.819,0:00:16.099 This presentation is delivered by the Stanford Center for Professional 0:00:16.099,0:00:23.099 Development. 0:00:25.879,0:00:28.419 Okay. Welcome back. What I 0:00:28.419,0:00:31.449 want to do today is talk about 0:00:31.449,0:00:34.570 one of my favorite algorithms for controlling MDPs that I think is 0:00:35.870,0:00:39.320 one of the more elegant and efficient and powerful algorithms that 0:00:39.320,0:00:41.070 I know of. 0:00:41.070,0:00:43.020 So 0:00:43.020,0:00:46.930 what I'll do is I'll first start by talking about a couple 0:00:46.930,0:00:48.640 variations of MDPs 0:00:48.640,0:00:52.320 that are slightly different from the MDP definition you've seen so far. These 0:00:52.320,0:00:54.420 are pretty 0:00:54.420,0:00:58.870 common variations. One is state action rewards, 0:00:58.870,0:01:03.830 and the other is horizon MDPs. Using this semi-modified definition of an 0:01:03.830,0:01:06.540 MDP, I'll talk about linear dynamical systems. I'll spend a little bit 0:01:06.540,0:01:10.410 of time talking about models within dynamical systems, and then talk about LQR, or 0:01:10.410,0:01:13.240 linear quadratic regulation control, 0:01:13.240,0:01:18.600 which will lead us to some kind of [inaudible] equation, which is 0:01:18.600,0:01:23.100 something we will solve in order to do LQR 0:01:23.100,0:01:24.460 controls. 0:01:24.460,0:01:25.660 0:01:25.660,0:01:27.400 So 0:01:27.400,0:01:28.650 just 0:01:28.650,0:01:33.020 to recap, and we've seen this definition many times now. 0:01:33.020,0:01:36.840 We've been defining an MDP 0:01:36.840,0:01:41.960 as 0:01:41.960,0:01:46.969 [inaudible] states actions, states improbabilities, [inaudible] reward function 0:01:46.969,0:01:49.869 where - gamma's 0:01:49.869,0:01:52.980 the 0:01:52.980,0:01:56.680 discount factors, a number between zero and one. 0:01:56.680,0:02:00.380 And R, the reward function, 0:02:00.380,0:02:03.909 was the function mapping from the states, the rewards - 0:02:03.909,0:02:07.000 was the function mapping from the states, the real numbers. 0:02:07.000,0:02:08.220 So we had 0:02:08.220,0:02:10.489 value 0:02:10.489,0:02:13.759 iteration, which would do this. 0:02:21.899,0:02:26.429 So after a while, the value of the iteration will cause V to convert to V star. 0:02:26.429,0:02:27.789 Then 0:02:27.789,0:02:31.209 having found the optimal value function, if you 0:02:31.209,0:02:32.589 compute the optimal policy 0:02:32.589,0:02:36.990 by taking essentially [inaudible] of this equation above. Augments of A, 0:02:36.990,0:02:43.990 of that [inaudible]. So 0:02:46.180,0:02:48.180 in 0:02:48.180,0:02:50.719 value iteration, 0:02:50.719,0:02:54.599 as you iterate of this - you know, perform this update, 0:02:54.599,0:02:56.179 the function V will 0:02:56.179,0:03:00.370 [inaudible] convert to V star. So there won't be - 0:03:00.370,0:03:04.490 so without defining the number of iterations, you get closer and closer to V star. 0:03:04.490,0:03:08.400 This actually converge exponentially quickly to V 0:03:08.400,0:03:12.839 star. We will never exactly convert to V star and define the number of iterations. 0:03:14.799,0:03:19.629 So what I want to do now is describe a couple of common variations of 0:03:19.629,0:03:20.139 MDPs 0:03:20.139,0:03:24.139 that we slightly different definitions of. Firs the reward 0:03:24.139,0:03:27.799 function and then second, we'll do something slightly different 0:03:27.799,0:03:29.389 from just 0:03:31.699,0:03:34.659 counting. Then remember in the last lecture, I said that 0:03:34.659,0:03:37.419 for infinite state of continuously in MDPs, 0:03:37.419,0:03:41.169 we couldn't apply the most straightforward version of value iteration 0:03:41.169,0:03:43.669 because if you have a continuous state 0:03:43.669,0:03:45.709 MDP, we need to use 0:03:45.709,0:03:48.289 some approximations of the optimal value function. 0:03:48.289,0:03:51.629 The [inaudible] later in this lecture, 0:03:51.629,0:03:55.510 I'll talk about a special case of MDPs, where you can actually represent the 0:03:55.510,0:03:57.579 value function exactly, 0:03:57.579,0:04:01.029 even if you have an infinite-state space or even if you have a continuous-state 0:04:01.029,0:04:02.000 space. 0:04:02.000,0:04:06.090 I'll actually do that, talk about these special constants of infinite-state 0:04:06.090,0:04:06.779 MDPs, 0:04:06.779,0:04:11.319 using this new variation of the reward function and the alternative to 0:04:11.319,0:04:15.469 just counting, so start to make the formulation a little easier. 0:04:15.469,0:04:16.609 So the first 0:04:16.609,0:04:20.850 variation I want to talk about is 0:04:20.850,0:04:25.380 selection 0:04:25.380,0:04:28.720 rewards. So I'm going to change the definition of the reward function. If this turns out, it 0:04:28.720,0:04:30.869 won't be 0:04:30.869,0:04:33.409 a huge deal. In particular, I change reward function 0:04:33.409,0:04:36.260 to be a function mapping from 0:04:36.260,0:04:38.449 a state action pair 0:04:38.449,0:04:41.130 to the real numbers. 0:04:41.130,0:04:44.509 What I mean by this is just the following. 0:04:44.509,0:04:46.599 You sell off in 0:04:46.599,0:04:48.849 some state in zero. You 0:04:48.849,0:04:52.830 take an action A zero as a result of your state of action choice. You transition 0:04:52.830,0:04:54.709 to some new state, S1. 0:04:54.709,0:04:59.519 You take some action, A1. You transition to some new state, S2. You take some 0:04:59.519,0:05:05.009 action, A2, and so on. So this is a [inaudible] state action sequence that you see. 0:05:05.009,0:05:08.490 So in an MPP where you have a state action reward, 0:05:08.490,0:05:12.879 your total payoff is now defined as 0:05:12.879,0:05:14.259 this, 0:05:14.259,0:05:19.349 where your reward function is now a function both of the current state and of the 0:05:19.349,0:05:20.020 action 0:05:20.020,0:05:22.220 you took in the current state. 0:05:22.220,0:05:27.039 So this is my 0:05:27.039,0:05:28.539 total payoff. 0:05:29.199,0:05:32.240 Then as usual, my goal will be to find a policy - 0:05:32.240,0:05:35.039 to find the function mapping from the state's actions, 0:05:35.039,0:05:37.699 so that when I execute that policy, 0:05:37.699,0:05:40.320 I can maximize the expected value 0:05:40.320,0:05:42.030 of my total payoff. 0:05:42.030,0:05:46.600 So this definition, it actually turns out that given an MDP with 0:05:46.600,0:05:47.889 state action rewards, 0:05:47.889,0:05:49.339 you can actually - 0:05:49.339,0:05:52.949 so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP 0:05:52.949,0:05:56.639 with only rewards that function in the states. That may or may 0:05:56.639,0:05:58.419 not be [inaudible]. Don't worry if 0:05:58.419,0:06:00.800 it isn't. But 0:06:00.800,0:06:04.729 using state action rewards allows you to more directly model 0:06:05.569,0:06:09.050 problems in which different actions, we have different costs. So 0:06:09.050,0:06:13.090 a running example is the robot. So [inaudible] a robot, 0:06:13.090,0:06:17.080 and it's more costly for the robot to move than for it to stay still. 0:06:17.080,0:06:20.930 If you give an action to stay still, and the action to stay still may 0:06:20.930,0:06:25.349 have a lower cost because you're not using a battery power [inaudible] recharge it for that 0:06:25.349,0:06:26.020 action. 0:06:26.020,0:06:29.820 Another example would be - 0:06:29.820,0:06:33.059 actually, another navigation example 0:06:33.059,0:06:37.429 would be if you have an outdoor vehicle. You need to drive 0:06:37.429,0:06:38.599 over 0:06:38.599,0:06:42.169 some sort of outdoor terrain, like very rough rocks 0:06:42.169,0:06:45.679 or driving over grass. It may be costly, more difficult, 0:06:45.679,0:06:48.960 than driving over, say, a paved road. So you may assign 0:06:48.960,0:06:51.349 an action that requires 0:06:51.349,0:06:56.150 driving over grass or driving over rocks to be 0:06:56.150,0:07:00.289 more costly than driving over paved 0:07:00.289,0:07:04.809 road. 0:07:04.809,0:07:07.830 So this really isn't a huge change to the definition of 0:07:07.830,0:07:14.830 an MDP. I won't really bother to justify this a lot, but [inaudible] 0:07:15.090,0:07:16.779 equations 0:07:16.779,0:07:19.610 is generalizing 0:07:19.610,0:07:24.279 the way that you probably expect it. V star of S is now equal to 0:07:26.650,0:07:33.650 that. 0:07:37.419,0:07:40.649 So previously, when the reward function was just a function of the state, S, we could 0:07:40.649,0:07:43.729 take the max and push it in here. 0:07:43.729,0:07:44.530 But now that 0:07:46.069,0:07:49.779 the rewards is a function of the action you're taking as well, the max comes outside. So 0:07:49.779,0:07:50.869 this says that 0:07:50.869,0:07:57.119 your expected total payoff, starting from the state, as [inaudible] policy, is equal to 0:07:57.119,0:08:01.519 first your immediate reward, RFSA, for executing some action, A, in state S. Then 0:08:02.219,0:08:07.369 plus gamma times your future expected total payoff. 0:08:07.369,0:08:08.839 So 0:08:08.839,0:08:11.330 this is your expected total payoff if you 0:08:11.330,0:08:14.919 take the action, A, from the current state. So 0:08:14.919,0:08:18.879 while these [inaudible] optimal value functions. So your actually optimal expected total payoff 0:08:18.879,0:08:21.539 is the max of all actions of this 0:08:21.539,0:08:26.049 thing on 0:08:26.049,0:08:30.449 the right. Let's see. Value iteration, which I'm abbreviating VI, 0:08:30.449,0:08:33.720 is really the same algorithm. B of 0:08:33.720,0:08:38.060 S is updated as 0:08:38.060,0:08:44.630 max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations 0:08:45.900,0:08:49.880 be updating V of S using [inaudible] equations. Again, you get value iteration, 0:08:49.880,0:08:53.370 exactly the same way. 0:08:53.370,0:08:55.520 Then finally, 0:08:56.950,0:09:00.160 having found the optimal value function, V star, 0:09:00.160,0:09:03.180 using the value iteration algorithm, 0:09:03.180,0:09:07.650 you can then compute the optimal policy, pi star of S as same as 0:09:07.650,0:09:12.250 before. The best action to take in the state, S, is the 0:09:12.250,0:09:19.250 action, A, that maximizes the thing on the righthand 0:09:28.360,0:09:32.480 side. So having used value iteration 0:09:32.480,0:09:35.330 to compute the optimal value function, you can then find 0:09:35.330,0:09:40.210 pi star using that. 0:09:41.890,0:09:45.670 So this was the easier of the two 0:09:45.670,0:09:47.920 variations of MDPs [inaudible] 0:09:47.920,0:09:48.910 so far. 0:09:48.910,0:09:55.910 Any questions? Actually, are there questions about this? 0:10:01.090,0:10:02.810 So 0:10:02.810,0:10:06.660 the other variation, the other alternative definition, 0:10:07.600,0:10:10.520 will be 0:10:10.520,0:10:16.220 finite horizon 0:10:16.220,0:10:19.010 MDPs. So 0:10:19.010,0:10:21.540 finite horizon MDP 0:10:21.540,0:10:24.780 comprises of the [inaudible] SA [inaudible] 0:10:24.780,0:10:28.110 transition, probably 0:10:28.110,0:10:30.650 with these, and the parameter T 0:10:30.650,0:10:32.180 and the reward function. 0:10:32.180,0:10:36.250 Here, 0:10:36.250,0:10:37.360 T 0:10:37.360,0:10:40.520 is a parameter called the horizon time. 0:10:40.520,0:10:43.050 Concretely, 0:10:43.050,0:10:45.920 what this really means is that we'll 0:10:45.920,0:10:48.330 be taking actions in the MDP 0:10:48.330,0:10:55.330 only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out] 0:11:00.430,0:11:04.350 [Inaudible] zero, take action A0. Get to some other state S1, take 0:11:04.350,0:11:06.670 action A1 and so on. 0:11:06.670,0:11:10.100 Eventually, you get to some state, STAT 0:11:10.100,0:11:12.180 after T times [inaudible]. So 0:11:12.180,0:11:16.760 my total payoff, now, 0:11:16.760,0:11:18.180 will be 0:11:18.180,0:11:19.350 given 0:11:19.350,0:11:21.900 by this 0:11:21.900,0:11:26.180 sum from time zero up to time T of my sum over rewards. 0:11:27.740,0:11:31.870 Okay? My goal, as usual 0:11:31.870,0:11:34.030 - so this is my total payoff. 0:11:34.030,0:11:37.580 My goal, as usual, is to maximize the expected value of my total payoff. We want to come up 0:11:37.580,0:11:40.140 with a policy to maximize the 0:11:40.140,0:11:44.010 expected value of this total payoff. The key difference is that 0:11:44.010,0:11:47.890 the world only will exist [inaudible], and after that, 0:11:47.890,0:11:51.380 there's no more rewards to be corrected. 0:11:51.380,0:11:54.820 So this turns out to be [inaudible] of a difference because it turns out that the optimal 0:11:54.820,0:11:59.710 policy 0:12:00.690,0:12:03.200 may be non-stationary. 0:12:06.300,0:12:07.540 The 0:12:07.540,0:12:10.319 term, stationary, means that it doesn't depend 0:12:10.319,0:12:12.649 on time. Non-stationary 0:12:12.649,0:12:14.419 means that it may depend on time. 0:12:14.419,0:12:17.810 So non-stationary 0:12:17.810,0:12:21.820 roughly means that my optimal action to take will be different for different time 0:12:21.820,0:12:22.490 steps. 0:12:22.490,0:12:24.430 That's what nonstationary 0:12:24.430,0:12:26.790 means. Just as an example of that, 0:12:26.790,0:12:31.240 imagine that we have some robot. Let's say the 0:12:31.240,0:12:33.850 robot 0:12:33.850,0:12:35.290 is here. 0:12:35.290,0:12:37.620 Let's say that there's 0:12:37.620,0:12:40.590 a good [inaudible] over there with a plus one reward. 0:12:42.930,0:12:46.360 Much further away, there's a plus ten reward. 0:12:46.360,0:12:50.530 So depending on how much time you have left on the clock, 0:12:50.530,0:12:52.980 it may be better to go after the plus one 0:12:52.980,0:12:55.330 or the plus ten reward. If it's still early 0:12:55.330,0:12:58.840 in the game, you still have a lot of time, it may be better to head toward 0:12:58.840,0:13:03.030 the plus-ten rewards junction and get a much larger 0:13:03.030,0:13:06.040 reward. If you only have a couple of time sets left, if the clock has nearly reached 0:13:06.040,0:13:08.069 the time, capital T, 0:13:08.069,0:13:11.790 then you may not have enough time to get to a plus ten reward. You've be better 0:13:11.790,0:13:16.050 off heading for the plus one reward that's much more close by. 0:13:16.050,0:13:19.660 So what this example illustrates is that when you're in that state, 0:13:19.660,0:13:23.740 the best action to take could be to go left or to go right, depending on what 0:13:23.740,0:13:26.499 time it is. So just as an example, illustrating 0:13:26.499,0:13:31.960 how the actually policy can be non-stationary. 0:13:34.710,0:13:36.410 In fact, 0:13:36.410,0:13:38.120 since we have non-stationary 0:13:38.120,0:13:43.120 policies anyway in the sequence, what I'm going to do next, I'm going to 0:13:43.120,0:13:44.159 allow 0:13:44.159,0:13:48.390 non-stationary transition probabilities as well. So I'll just write that 0:13:48.390,0:13:50.390 up there. 0:13:50.390,0:13:52.170 What I mean is that 0:13:52.170,0:13:56.640 so far, assuming that the state ST plus one, is joined from the state 0:13:56.640,0:13:59.480 transition probabilities [inaudible] 0:13:59.480,0:14:02.110 by the previous states and the previous action. 0:14:02.110,0:14:05.650 I've been assuming that these state transition probabilities are the same 0:14:05.650,0:14:10.170 for all times. So I want to say [inaudible] and take some action, 0:14:10.170,0:14:12.540 the distribution of an innate state doesn't matter. 0:14:12.540,0:14:15.910 It doesn't depend on time. 0:14:15.910,0:14:16.700 So 0:14:16.700,0:14:21.890 I'm going to allow a study more general definition as well, in which we have 0:14:21.890,0:14:24.600 non-stationary state transition probabilities 0:14:24.600,0:14:25.329 so that 0:14:25.329,0:14:27.960 the chance of where you end up [inaudible] 0:14:27.960,0:14:29.590 may also 0:14:29.590,0:14:33.060 depend on what time it is. 0:14:33.060,0:14:37.320 So as examples of this non-stationary state 0:14:37.320,0:14:39.000 transition probabilities, 0:14:39.000,0:14:44.400 one example would be if you model flying an aircraft over a long distance. 0:14:44.400,0:14:48.650 Then as the aircraft flies, you burn fuel and become lighter. So the 0:14:48.650,0:14:52.510 dynamics of the aircraft actually change over time. The mass of the aircraft can 0:14:52.510,0:14:55.160 change significantly over time as you burn fuel. 0:14:55.160,0:14:56.480 So depending on 0:14:56.480,0:15:01.130 what time it is, your mixed state could actually 0:15:01.130,0:15:03.320 depend on not only your current state and your action, 0:15:03.320,0:15:06.240 but also on how much fuel you burn, therefore, what time it is. 0:15:07.250,0:15:10.510 Other examples, another aerospace one, is 0:15:10.510,0:15:14.620 if you have the weather forecast for the next 24 hours, say, 0:15:14.620,0:15:18.360 you know what the winds and precipitation are going to be like over the next 24 hours. Then again, 0:15:18.360,0:15:22.200 if you fly the aircraft from, say, here to New York, 0:15:22.200,0:15:25.019 it may cost different amounts to fly 0:15:25.019,0:15:28.829 different [inaudible] at different times. Maybe flying over 0:15:28.829,0:15:33.439 the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or 0:15:33.439,0:15:34.319 if you do it 0:15:34.319,0:15:37.629 a few hours from now, when the weather may be forecast really bad. 0:15:39.380,0:15:40.970 For an 0:15:40.970,0:15:46.440 example you see everyday, same thing for traffic, right? 0:15:46.440,0:15:48.380 There's at least - 0:15:48.380,0:15:52.510 depending on where you live, certainly here in California, there are times of day where traffic is 0:15:52.510,0:15:57.050 really bad in lots of places. So the costs of driving certain roads may vary, depending on 0:15:57.050,0:15:59.070 what time of day it is. 0:15:59.070,0:16:03.589 Lots of other examples. Industrial automation, different machines 0:16:03.589,0:16:07.160 in the factory may be available to different degrees at different times of day. They 0:16:07.160,0:16:10.149 cost different amounts to hire different workers, depending on whether you pay 0:16:10.149,0:16:12.950 over time [inaudible] or whatever. 0:16:12.950,0:16:15.960 So the cost of doing different things in the factory can also 0:16:15.960,0:16:17.250 be a function of time. 0:16:17.250,0:16:18.429 0:16:18.429,0:16:24.280 The state transition probabilities can also be a function of time. 0:16:25.750,0:16:27.850 Lastly, 0:16:27.850,0:16:31.910 while we're doing this as well, to make this fully general, we might as 0:16:31.910,0:16:37.730 well have non-stationary [inaudible] as well, 0:16:37.730,0:16:40.639 where you might also index the reward 0:16:40.639,0:16:42.679 function of these times and prescripts, 0:16:42.679,0:16:45.090 where the cost of doing different things may depend on 0:16:45.090,0:16:47.830 the time as well. 0:16:50.530,0:16:55.590 Actually, there's more examples of non-stationary MDPs, so let's - so now we 0:16:55.590,0:16:59.249 have a nonstationary policy. Let's talk about an algorithm to actually 0:16:59.249,0:17:02.180 try to find the optimal policy. 0:17:04.110,0:17:10.089 So let me 0:17:10.089,0:17:13.580 define the following. 0:17:13.580,0:17:20.390 This is now a slightly modified definition of the optimal value function. I'll 0:17:20.390,0:17:27.390 just 0:17:30.970,0:17:37.970 write this down, I guess. 0:17:39.080,0:17:42.620 So I'm going to define the optimal value function, and this going to be 0:17:42.620,0:17:44.690 indexed by T, with a subscript T. The 0:17:44.690,0:17:46.480 optimal value of a state 0:17:47.840,0:17:50.360 for time, T, we're going to define as your 0:17:50.360,0:17:52.300 optimal 0:17:52.300,0:17:57.320 sum of rewards for if you start the MDP at that state, S, 0:17:57.320,0:17:59.309 and if the clock starts off 0:17:59.309,0:18:02.200 at time, 0:18:02.200,0:18:06.400 lowercase T. So the optimal value of a state will depend on what time it is and how 0:18:06.400,0:18:09.919 much time you have lest to run this 0:18:09.919,0:18:11.440 MDP. Therefore, 0:18:11.440,0:18:16.230 the sum on the right sums only for time T, time T plus one, time T plus two up 0:18:16.230,0:18:17.840 to time, 0:18:17.840,0:18:19.520 capital T. 0:18:22.570,0:18:26.620 I'll just state in English again, this is your expected optimal 0:18:26.620,0:18:28.059 total payoff 0:18:28.059,0:18:31.240 if you start your system in a state, S, 0:18:31.240,0:18:34.760 and if the clock is already at time, 0:18:34.760,0:18:37.010 lowercase T. 0:18:38.370,0:18:42.620 So it turns out then there's a [inaudible], you can value that [inaudible]. 0:18:42.620,0:18:46.080 Let me just write out the value [inaudible] algorithm for this. 0:18:46.080,0:18:48.059 It turns out 0:18:48.059,0:18:49.270 you can - 0:18:49.270,0:18:54.150 well, let me just write this 0:18:54.150,0:18:56.920 out. I'll write this below. 0:18:56.920,0:19:00.260 It turns out you can compute the optimal value function for the MDP using the 0:19:00.260,0:19:03.020 following recursion, which is very similar to 0:19:03.020,0:19:06.200 what we have for value iteration. We're going 0:19:06.200,0:19:06.879 to set V 0:19:06.879,0:19:12.080 of S to be equal to [inaudible] over 0:19:12.080,0:19:12.889 A, 0:19:12.889,0:19:19.889 same as before, right? Okay? 0:19:41.110,0:19:42.419 So 0:19:42.419,0:19:46.650 if I start the clock at time T and from state S, 0:19:46.650,0:19:49.230 my expected total payoff is equal to 0:19:49.230,0:19:51.590 the maximum [inaudible] actions I may take 0:19:51.590,0:19:53.500 of my immediate reward. Taking that 0:19:53.500,0:19:56.100 action, A, in that state, 0:19:56.100,0:20:00.050 S. Them plus my expected future payoff. So if I take action, A, 0:20:00.050,0:20:01.690 I would transition with [inaudible] P, subscript SA, S prime 0:20:01.690,0:20:06.470 to some new state, S 0:20:06.470,0:20:09.470 prime. If I get to the state, S prime, 0:20:09.470,0:20:11.600 my total expected payoff from the 0:20:11.600,0:20:14.270 state S prime would be these [inaudible] 0:20:14.270,0:20:18.610 now subscript T plus one, that's prime. Subscript T plus one 0:20:18.610,0:20:23.400 reflects that after I've taken one action, my clock will have advanced from 0:20:23.400,0:20:26.400 time T to time T plus one. So this is now V 0:20:26.400,0:20:30.559 star subscript T plus one. 0:20:30.559,0:20:33.909 So this expresses V star of T 0:20:33.909,0:20:36.230 in terms of V star T plus one. 0:20:36.230,0:20:39.740 Then lastly, to start off this recursion, you 0:20:39.740,0:20:41.700 would have V star, 0:20:41.700,0:20:43.270 capital T 0:20:43.270,0:20:50.270 is equal to - it's just equal 0:20:51.300,0:20:53.220 to that. 0:20:53.220,0:20:56.900 If you're already at time, capital T, then you just get to take one action, and then 0:20:56.900,0:20:59.070 the clock runs out. So this is 0:20:59.070,0:21:03.610 V star capital T. Your value of starting in some state, S, with no time - 0:21:03.610,0:21:08.809 with just one time step, with no time left on the clock. 0:21:08.809,0:21:12.880 So in the case of finite horizon MDP, this actually gives up a very nice 0:21:13.830,0:21:15.640 dynamic programming algorithm 0:21:15.640,0:21:16.559 in which you can 0:21:16.559,0:21:20.740 start off by computing V star of T. 0:21:20.740,0:21:24.030 Then you use this backward [inaudible] to compute 0:21:24.030,0:21:25.089 V star of 0:21:25.089,0:21:29.349 capital T minus one, capital T minus two and so on. We compute V star 0:21:29.349,0:21:32.230 of T and T minus one and so on. It recurs backwards 0:21:32.230,0:21:34.670 onto your computer, V star 0:21:34.670,0:21:38.450 for all of your time steps. Can you see this 0:21:38.450,0:21:39.500 board? 0:21:39.500,0:21:42.250 Cool. Then 0:21:44.770,0:21:47.570 the final step is 0:21:47.570,0:21:52.810 - previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible] 0:21:52.810,0:21:55.580 A of R 0:21:55.580,0:21:57.150 plus A 0:21:57.150,0:22:04.150 plus [inaudible] PSA - this 0:22:04.280,0:22:07.230 is sort of what we had. In 0:22:07.230,0:22:09.789 the finite horizon case, the 0:22:09.789,0:22:13.250 ultimate action may depend on what time it is. So the ultimate action to take it, 0:22:13.250,0:22:16.820 time T, is [inaudible] actions, A. 0:22:16.820,0:22:19.249 This 0:22:19.249,0:22:22.350 is basically the 0:22:22.350,0:22:26.840 augmat of exactly that same thing on the right-hand side as we had 0:22:26.840,0:22:29.970 in our dynamic programming 0:22:29.970,0:22:33.060 algorithm. So you do this for every time step, and now you compute it, 0:22:33.060,0:22:36.950 your optimal policy for different time 0:22:36.950,0:22:40.410 steps. Again, this is a non-stationary policy 0:22:40.410,0:22:42.290 because pi star 0:22:42.290,0:22:46.920 of S my depend on what time it is. 0:22:46.920,0:22:50.420 So [inaudible] the difference between this 0:22:50.420,0:22:57.150 and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T. 0:22:57.150,0:23:00.620 Then using the backwards recursion of that [inaudible] algorithm, you computer V star T 0:23:00.620,0:23:04.130 minus one. Then V star T minus two 0:23:04.130,0:23:06.710 and so on down to V star of zero. 0:23:06.710,0:23:09.660 Then from these, you compute pi star. 0:23:13.210,0:23:17.010 So one - there's not a huge difference, but one minus 0:23:17.010,0:23:18.100 difference [inaudible] 0:23:18.100,0:23:21.919 the infinite horizon discounted case 0:23:21.919,0:23:26.350 is that by running this recursion once, you now have exactly 0:23:26.350,0:23:29.550 the right value function. So this just computes the value function, rather 0:23:29.550,0:23:30.560 than 0:23:30.560,0:23:37.560 merely converging [inaudible]. This just gives you the right value function with one 0:23:37.710,0:23:39.710 0:23:39.710,0:23:42.710 pass. Cool. Any questions there? Yeah. 0:23:42.710,0:23:47.600 [Inaudible]. 0:23:50.450,0:23:54.080 This computation's much shorter than valuations. So 0:23:54.080,0:23:58.460 sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could 0:23:58.460,0:24:05.460 [inaudible] and then use this case for that case? I see. 0:24:06.100,0:24:09.430 So for the normal MDP, can you assume capital T and then assume this? So 0:24:09.430,0:24:13.049 it actually turns out that - that's a 0:24:13.049,0:24:14.740 great question. Let 0:24:14.740,0:24:18.640 me just answer this in a hand-wavy way. 0:24:18.640,0:24:23.140 So it actually turns out for a discounted infinite horizon MDP 0:24:23.140,0:24:25.670 where some discount factor gamma. 0:24:25.670,0:24:29.430 So what you can do is you can say, 0:24:29.430,0:24:31.160 after T times X, 0:24:31.160,0:24:34.990 gamma to the T would be really small. It would be like [inaudible] something. I 0:24:34.990,0:24:36.130 don't 0:24:36.130,0:24:39.750 really care what happens after that many times because the rewards are 0:24:39.750,0:24:42.630 multiplied by gamma to the T. After that, I don't really care. So you can 0:24:42.630,0:24:43.919 ask, 0:24:43.919,0:24:45.120 can I take 0:24:45.120,0:24:48.490 my infinite horizon discounted MDP 0:24:48.490,0:24:53.190 and approximate that with a finite horizon MDP where the number of times, steps T, 0:24:53.190,0:24:55.180 is chosen so that [inaudible] true. So it 0:24:55.180,0:24:57.700 turns out you can do that. 0:24:57.700,0:25:01.890 Then you end up with some value for T. You can solve for T so 0:25:01.890,0:25:03.910 that this holds true. 0:25:03.910,0:25:09.500 It turns out you can prove that if you took the original value iteration algorithm 0:25:09.500,0:25:14.780 and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs. 0:25:14.780,0:25:16.830 If you run that for 0:25:16.830,0:25:19.590 this same number of time steps, 0:25:19.590,0:25:23.790 you will end up with an approximation to the value function that is 0:25:23.790,0:25:26.730 about this close, up to some small constant factors. 0:25:26.730,0:25:30.960 So to do that, you end up with roughly the same amounts of computation anyway. 0:25:30.960,0:25:32.550 Then 0:25:32.550,0:25:35.780 you actually end up with a non-stationary policy, which is more expensive to keep 0:25:36.650,0:25:40.720 around. You need to keep around the different policy every time step, which is not 0:25:40.720,0:25:46.880 as nice as if you had the stationary policy, same policy for all times. So 0:25:46.880,0:25:50.600 there are other reasons, but sometimes you might take an 0:25:50.600,0:25:54.560 infinite horizon discounted problem and approximate it to a finite horizon problem. 0:25:54.560,0:25:56.190 But 0:25:56.190,0:25:58.760 this particular reason is 0:25:58.760,0:26:01.270 not the one. That makes 0:26:01.270,0:26:04.210 sense. More 0:26:04.210,0:26:11.210 questions? Interviewee: [Inaudible]? 0:26:12.580,0:26:14.680 Is 0:26:14.680,0:26:16.870 there a gamma in this? 0:26:16.870,0:26:20.159 So if you want, you can actually 0:26:20.159,0:26:22.570 change the definition of an MDP 0:26:22.570,0:26:27.920 and use a finite horizon discounted MDP. If you want, you can do that. You 0:26:27.920,0:26:29.040 can actually come 0:26:29.040,0:26:32.090 in and put a gamma there, 0:26:32.090,0:26:35.140 and use this counting the finite horizon. 0:26:35.140,0:26:37.690 It turns out that usually, for most 0:26:37.690,0:26:41.470 problems that people deal with, you either use discounting 0:26:41.470,0:26:43.220 or you 0:26:43.220,0:26:47.059 use the finite horizon. It's been less common to do both, but you can certainly do as well. 0:26:47.059,0:26:51.990 One of the nice things about discounting, it makes such your value function is finite. 0:26:52.860,0:26:57.549 Algorithmically and mathematically, one of the reasons to use discounting is 0:26:57.549,0:27:01.309 because you're multiplying your rewards exponentially. It's a geometrically 0:27:01.309,0:27:02.640 [inaudible] series. 0:27:02.640,0:27:05.049 It shows that the value function is always finite. 0:27:05.049,0:27:07.780 This is a really nice mathematical properties when you do discounting. 0:27:07.780,0:27:11.360 So when you have a finite horizon anyway, 0:27:11.360,0:27:15.179 then the value function's also guaranteed to be finite. So with that, you 0:27:15.179,0:27:22.179 don't have to use discounting. But if you want, you can actually discount 0:27:24.130,0:27:27.910 as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward 0:27:27.910,0:27:34.910 function to go downward into the to the reward function, since we have nonstationary rewards as well. So 0:27:39.690,0:27:41.049 that was 0:27:41.049,0:27:44.230 finite horizon 0:27:44.230,0:27:45.409 MDPs. 0:27:45.409,0:27:48.960 What I want to do now is actually use 0:27:48.960,0:27:53.030 both of these ideas, your state action rewards and your finite horizon MDPs 0:27:53.030,0:27:57.580 to describe a special case of MDPs that 0:27:57.580,0:28:00.380 makes very strong assumptions about the problem. 0:28:00.380,0:28:00.880 0:28:00.880,0:28:04.390 But these assumptions are reasonable for many systems. With these 0:28:04.390,0:28:06.090 assumptions, what we come up with, I 0:28:06.090,0:28:11.960 think, are very nice and very elegant algorithms for solving even very large 0:28:11.960,0:28:16.240 MDPs. 0:28:30.100,0:28:32.820 So let's talk about linear 0:28:32.820,0:28:35.760 quadratic regulation. 0:28:51.669,0:28:55.860 We just talked about dynamic programming for finite horizon MDPs, so 0:28:55.860,0:28:58.149 just remember that algorithm. 0:28:58.149,0:29:01.850 When I come back to talk about an algorithm for solving LQR problems, 0:29:01.850,0:29:05.320 I'm actually going to use exactly that dynamic programming algorithm 0:29:05.320,0:29:10.320 that you just saw for finite horizon MDPs. I'll be using exactly 0:29:10.320,0:29:13.929 that algorithm again. So just remember that for now. 0:29:13.929,0:29:16.320 So let's talk about LQR. 0:29:16.320,0:29:21.750 So I want to take these ideas and apply them to MDPs with continuous 0:29:21.750,0:29:25.150 state spaces and maybe even continuous action 0:29:25.150,0:29:26.170 spaces. So 0:29:26.170,0:29:28.740 to specify and 0:29:28.740,0:29:33.679 MDPs, I need to give you this fivetuple 0:29:33.679,0:29:36.380 of state actions, [inaudible] in the reward. I'm going to use 0:29:36.380,0:29:40.730 the finite horizon, capital T, rather than 0:29:40.730,0:29:42.570 discounting. 0:29:42.570,0:29:47.809 So in LQR problems, I'm going to assume the following. I'm 0:29:47.809,0:29:51.309 going to assume that the 0:29:51.309,0:29:54.400 [inaudible] space is [inaudible] RN. 0:29:54.400,0:29:58.289 And I'm going to assume, also, a continuous set of 0:29:58.289,0:30:03.830 actions lie in RT. 0:30:03.830,0:30:08.320 To specify the state transition probabilities, PSA, 0:30:08.320,0:30:11.930 I need to tell you what the distribution of the mixed state is, given the current state and 0:30:11.930,0:30:13.510 the current action. 0:30:13.510,0:30:17.450 So we actually saw a little bit of this in the last lecture. I want to assume 0:30:17.450,0:30:19.519 the next state, ST plus one, 0:30:19.519,0:30:23.100 is going to be a linear function 0:30:23.100,0:30:25.010 of the previous 0:30:25.010,0:30:27.200 state, AST plus BAT 0:30:27.200,0:30:29.309 plus WT - 0:30:29.309,0:30:34.650 oh, excuse me. I meant to subscript that. 0:30:54.590,0:30:59.190 Where WT is Gaussian [inaudible] would mean zero and some covariance 0:30:59.190,0:31:02.260 given by sigma 0:31:02.260,0:31:07.370 W. Subscripts at A and B here with subscripts T to show 0:31:07.370,0:31:09.000 that 0:31:09.000,0:31:11.809 these matrixes could change over time. So this would be 0:31:11.809,0:31:14.940 non-stationary dynamics. 0:31:14.940,0:31:20.200 As a point of notation, unfortunately compiling 0:31:20.200,0:31:22.240 ideas from multiple literatures, 0:31:22.240,0:31:26.220 so it's sort of unfortunately that capital A denotes both a set of actions 0:31:26.220,0:31:27.860 as well as 0:31:27.860,0:31:29.609 a matrix. 0:31:29.609,0:31:33.889 When you see A later on, A will usually be used to denote a matrix, 0:31:33.889,0:31:38.870 rather than a set of actions. So [inaudible] overload notation again, but 0:31:38.870,0:31:41.820 unfortunately the notational conventions when you have 0:31:41.820,0:31:46.660 research ideas in multiple research communities, often they share the same symbol. 0:31:47.789,0:31:49.830 So just to be concrete, 0:31:49.830,0:31:53.020 AT is a matrix 0:31:53.020,0:31:55.640 that's N 0:31:55.640,0:32:01.270 by N. [Inaudible] matrixes that are N by D. 0:32:02.130,0:32:05.559 Just to be completely clear, right, the matrixes A and B, 0:32:05.559,0:32:09.230 I'm going to assume, are fixed and known in advance. So I'm going to 0:32:09.230,0:32:11.700 give you the matrices, 0:32:11.700,0:32:15.750 A and B, and I'm going to give you sigma W. Your job is to find a good policy 0:32:15.750,0:32:17.060 for 0:32:17.060,0:32:18.840 this MDP. So in other words, 0:32:18.840,0:32:20.870 this is my specification 0:32:20.870,0:32:25.309 of the state transition probabilities. 0:32:25.309,0:32:28.020 Looking ahead, 0:32:28.020,0:32:31.190 we see this later, it turns out this 0:32:31.190,0:32:38.190 noise term is not very important. So 0:32:39.630,0:32:40.850 it turns out that 0:32:40.850,0:32:44.250 the treatment of the noise term is not very important. 0:32:44.250,0:32:45.830 We'll see this later. We can 0:32:45.830,0:32:47.179 pretty much 0:32:47.179,0:32:49.300 ignore the noise 0:32:49.300,0:32:52.680 term, and we'll still do fine. This is just a warning 0:32:52.680,0:32:56.910 in the sequel, what I do later, I might be slightly sloppy 0:32:56.910,0:32:57.970 in my treatment 0:32:57.970,0:33:02.970 of the noise term. In this very special case, it would be unimportant. 0:33:02.970,0:33:04.170 The 0:33:04.170,0:33:05.549 last 0:33:05.549,0:33:09.150 thing I have to specify is some horizon time, 0:33:09.150,0:33:12.580 and then I also have some 0:33:12.580,0:33:13.700 reward 0:33:13.700,0:33:16.550 function. 0:33:16.550,0:33:20.020 For LQR control, I'm going to assume that a reward function 0:33:20.020,0:33:27.020 can be written as 0:33:30.090,0:33:31.850 this, where 0:33:31.850,0:33:35.260 UT is a matrix that's N by N. VT is 0:33:35.260,0:33:36.439 a 0:33:37.950,0:33:41.020 matrix that's D by D. I'll 0:33:41.020,0:33:44.440 assume that UT 0:33:44.440,0:33:47.040 and VT are both positive semi-definite. 0:33:47.040,0:33:51.289 Are both PSD. So 0:33:51.289,0:33:57.360 the fact that UT and VT are both positive semi-definite matrices, 0:33:57.360,0:33:59.770 that implies that 0:33:59.770,0:34:02.259 ST transpose, UT, ST [inaudible] 0:34:02.259,0:34:07.000 zero. Similarly, ST transpose 0:34:07.000,0:34:09.569 are VT, AT, [inaudible] zero. 0:34:09.569,0:34:16.519 So this implies that 0:34:16.519,0:34:20.190 your rewards are always negative. This is a 0:34:20.190,0:34:23.529 somewhat depressing MDP in which there are only costs and no positive rewards, right, 0:34:23.529,0:34:24.369 because of 0:34:24.369,0:34:27.169 the minus sign 0:34:27.169,0:34:28.569 there. 0:34:28.569,0:34:35.569 So 0:34:43.319,0:34:46.659 as 0:34:46.659,0:34:49.630 a complete example for how you might want to apply this, 0:34:49.630,0:34:53.510 you've seen my helicopter videos, right? So one thing is, 0:34:53.510,0:34:55.729 for example, 0:34:55.729,0:34:57.989 suppose you have a helicopter, 0:34:57.989,0:35:00.890 and you want the state ST to be 0:35:00.890,0:35:04.229 as close to zero as possible. 0:35:05.440,0:35:08.059 Then 0:35:08.059,0:35:10.210 you might choose UT 0:35:10.210,0:35:12.619 to be equal to the identity matrix. 0:35:12.619,0:35:16.709 This will make R of STAT be 0:35:16.709,0:35:20.060 equal to ST transpose 0:35:20.060,0:35:23.819 ST. But that's just 0:35:27.239,0:35:30.719 - I'll just 0:35:30.719,0:35:33.809 write that down. [Inaudible] oh, excuse me - minus. 0:35:33.809,0:35:38.429 The squared negative of the squared [inaudible] vector. So this would be penalizing the 0:35:38.429,0:35:39.680 system 0:35:39.680,0:35:40.769 quadratically 0:35:40.769,0:35:45.219 for having a state that's half of zero, assuming that zero's the origin state. So if it 0:35:45.219,0:35:46.889 goes to make a helicopter 0:35:46.889,0:35:52.039 hover around the state zero, then you might choose this sort of reward function. It 0:35:52.039,0:35:54.439 turns out 0:35:55.279,0:35:58.679 it's also very common for action to choose a cost 0:35:58.679,0:36:03.209 for the action. So suppose I choose VT to be equal to an identity matrix. I get minus 0:36:03.209,0:36:05.359 AT transpose 0:36:05.359,0:36:09.509 AT 0:36:09.509,0:36:12.259 here. Then minus [inaudible] actions as well. 0:36:12.259,0:36:16.079 Including a quadratic cost for actions, it's also a fairly common 0:36:16.079,0:36:17.099 thing to do. 0:36:17.099,0:36:19.560 In practice, this tends to be effective 0:36:19.560,0:36:22.630 of discouraging your system from 0:36:22.630,0:36:26.209 jerking the controls around. This discourages making very huge control commands. Having 0:36:26.209,0:36:27.779 a term like this 0:36:27.779,0:36:31.859 reward function often makes many systems behave better. 0:36:31.859,0:36:35.789 Of course, [inaudible] choose different values, we have different values on the diagonal to 0:36:35.789,0:36:37.069 give different 0:36:37.069,0:36:39.960 state variables, different weight and 0:36:39.960,0:36:44.809 so on. So lots of possible choices for U and B. This is one example. 0:36:49.849,0:36:55.989 So 0:36:55.989,0:36:59.420 for the next few steps, 0:36:59.420,0:37:03.619 I'm going to write out things, I'm going to derive things, 0:37:03.619,0:37:08.509 for the general case of non-stationary dynamics. I'm going - as I write 0:37:08.509,0:37:11.529 out more math and more equations for LQR, 0:37:11.529,0:37:13.809 I'm going to try write it out 0:37:13.809,0:37:18.549 for the fairly general case of time varied dynamics and time varied reward 0:37:18.549,0:37:21.639 functions. So I'm [inaudible] function. 0:37:21.639,0:37:26.309 But for purposes of understanding this material, 0:37:26.309,0:37:29.499 you might want to think of just ignoring many of the subscripts, in terms 0:37:29.499,0:37:31.579 of T. So 0:37:33.849,0:37:36.049 for the sake of [inaudible] material, 0:37:36.049,0:37:40.059 you might want to mentally assume that there are some fixed matrix, A, so 0:37:40.059,0:37:46.079 that A is equal to A1, A2, equals A3 and so on. 0:37:46.079,0:37:50.599 Similarly, there's some matrix B. 0:37:50.599,0:37:51.509 Okay? 0:37:51.509,0:37:55.859 So write it out for the fully general, non-stationary case, but 0:37:55.859,0:37:59.420 you might just want to ignore many of the time subscripts and imagine the 0:37:59.420,0:38:02.099 stationary case for now. 0:38:07.419,0:38:09.939 Quite a bit later, 0:38:09.939,0:38:13.569 we're going to talk about an extension of this called differential dynamic programming that will 0:38:13.569,0:38:16.670 actually use a non-stationary 0:38:16.670,0:38:20.789 [inaudible] to a very powerful effect for a specific algorithm. But 0:38:20.789,0:38:25.579 for most of what we're about to do, just pretend that MDP is stationary. 0:38:25.579,0:38:32.579 Okay. 0:38:49.160,0:38:50.919 So before I 0:38:50.919,0:38:54.759 talk about models, let me jus say a couple of words about how you would go about 0:38:54.759,0:38:56.910 coming up with the linear models. The 0:38:56.910,0:39:00.630 key assumption in this model is that the dynamics are linear. There's also the assumption the 0:39:00.630,0:39:03.429 reward function is quadratic, but 0:39:03.429,0:39:06.849 let's talk about the assumption that the dynamics are linear. 0:39:06.849,0:39:13.789 ST plus one equals AST plus VAT. Maybe time varying, maybe 0:39:13.789,0:39:16.879 stationary. I'm just writing stationary for now. 0:39:16.879,0:39:19.529 So how do you get models like this? 0:39:19.529,0:39:23.169 We actually saw one example of this already in the previous lecture. If 0:39:23.169,0:39:26.729 you have an inverted pendulum system, and you want to model the 0:39:26.729,0:39:28.099 inverted pendulum 0:39:28.900,0:39:34.039 using a linear model like this, maybe [inaudible]. I'm not going to write that down. 0:39:34.039,0:39:37.849 One thing you could do is run your inverted pendulum, start it off in some state as 0:39:37.849,0:39:38.719 zero, 0:39:38.719,0:39:40.799 take some action, A0, have it 0:39:40.799,0:39:45.760 get to some state, S1. Take action A1 and so on, get to some state ST. 0:39:45.760,0:39:48.409 Our index is one 0:39:48.409,0:39:52.139 to denote that this is my first trial. Then you can repeat this a bunch of times. 0:39:52.139,0:39:55.170 You can repeat this N times. I'm 0:39:55.170,0:39:57.670 just executing actions on 0:39:57.670,0:39:59.519 your physical robot. 0:39:59.519,0:40:02.049 It could be a robot, it could be a 0:40:02.049,0:40:05.739 chemical plant. It could be whatever. Trying out different actions in 0:40:05.739,0:40:10.009 your system and watch 0:40:12.770,0:40:14.989 what states it gets to. So 0:40:14.989,0:40:20.619 for the linear model to your data, and choose the parameters A 0:40:20.619,0:40:25.659 and B, 0:40:25.659,0:40:27.939 that minimize 0:40:38.049,0:40:39.709 the quadratic 0:40:39.709,0:40:41.369 error term. 0:40:46.949,0:40:51.640 So this says how well does AST plus BAT 0:40:51.640,0:40:55.359 predict ST plus one. So you minimize the quadratic penalty term. This would be 0:40:55.359,0:40:57.719 one reasonable way to estimate 0:40:57.719,0:41:01.390 the parameters of a linear dynamical system for a 0:41:01.390,0:41:08.390 physical robot or a physical chemical part of whatever they may have. Another 0:41:08.839,0:41:10.569 way to 0:41:10.569,0:41:13.129 come up with a linear 0:41:14.209,0:41:19.199 model 0:41:19.199,0:41:24.180 consistently, if I want to control, is to take a nonlinear model and to 0:41:24.180,0:41:29.920 linearize it. Let me show you what I mean by that. So you can linearize a 0:41:29.920,0:41:35.819 nonlinear model. 0:41:35.819,0:41:37.439 So 0:41:38.570,0:41:41.750 let's say you have some nonlinear model 0:41:41.750,0:41:45.289 that expresses ST plus one as some function of 0:41:45.289,0:41:46.219 ST 0:41:46.219,0:41:47.499 and 0:41:47.499,0:41:49.529 AT. 0:41:49.529,0:41:53.840 In the example in the previous lecture, I said for the inverted pendulum 0:41:56.579,0:41:57.489 [inaudible]. 0:41:57.489,0:42:01.440 By referring to the laws of physics. It was actually by downloading off the shelf 0:42:01.440,0:42:04.380 software for doing physics simulations. So if you haven't 0:42:04.380,0:42:08.139 seen [inaudible] before, you can go online. You can easily find 0:42:08.139,0:42:11.270 many open-source packages for simulating the 0:42:11.270,0:42:14.350 physics of simple devices like these. 0:42:14.350,0:42:18.430 Download the software, type in the specifications of your robot, and it will 0:42:18.430,0:42:21.980 simulate the physics that you use. There's lots of open-source software patches like that. You can just 0:42:21.980,0:42:23.829 download them. 0:42:23.829,0:42:27.060 But something like that, you can now build a physics simulator 0:42:27.060,0:42:31.730 that predicts the state as a function of the previous state and the previous action. So 0:42:31.730,0:42:34.029 you actually come up with some function that 0:42:34.029,0:42:39.550 says that 0:42:41.529,0:42:45.169 - the state [inaudible] next time. The [inaudible] vector 0:42:45.169,0:42:52.169 will be some function of the current state 0:42:53.350,0:42:54.630 and the current 0:42:54.630,0:42:58.019 action, where the action in this case is just a real number that says how 0:42:58.019,0:43:02.129 hard you accelerated to the left or right. 0:43:02.129,0:43:06.359 Then you can take this nonlinear model. I actually wrote down a sample of a 0:43:06.359,0:43:08.119 model in the last lecture, but 0:43:08.119,0:43:11.390 in general, F would be some nonlinear function. [Inaudible] of 0:43:11.390,0:43:12.769 a linear function. 0:43:12.769,0:43:16.919 So what I mean by linearize is the following. So here's just a cartoon. I'll 0:43:16.919,0:43:19.819 write down the math in a second. Let's say the horizontal acces is 0:43:19.819,0:43:24.029 the input state, ST, 0:43:24.029,0:43:27.679 and the output state, ST plus one, as I said. 0:43:27.679,0:43:31.779 Here's 0:43:31.779,0:43:34.709 the function at F. 0:43:34.709,0:43:38.689 So the next state, ST plus one, will be some function of the previous state, 0:43:38.689,0:43:42.029 ST and the action 0:43:42.029,0:43:46.009 AT. So to linearize this model, what you would do is you would choose a point. 0:43:46.009,0:43:49.019 We'll call this bar 0:43:49.019,0:43:54.140 T. Then you would take the derivative of this 0:43:54.140,0:43:56.469 function. For the 0:43:56.469,0:44:00.279 [inaudible] straight line to that function. 0:44:00.279,0:44:04.449 So this allows you to express the next state, ST plus one. You 0:44:04.449,0:44:07.229 can approximate the next state, ST plus one, as 0:44:07.229,0:44:10.089 this linear function of the 0:44:10.089,0:44:13.999 previous state, ST. So 0:44:13.999,0:44:17.809 to make this cartoon really right, the horizontal access here is really a state 0:44:17.809,0:44:20.109 action pair. 0:44:20.109,0:44:24.650 You're linearizing around. So this is just a cartoon. The horizontal 0:44:24.650,0:44:29.699 access represents the input state and the input action. 0:44:29.699,0:44:36.699 So 0:44:42.049,0:44:45.549 just to write this out in 0:44:45.549,0:44:49.529 math, I'll write out the simple case first and the fully general one 0:44:49.529,0:44:51.089 in a second. Suppose 0:44:51.089,0:44:54.709 the horizontal access was only this state. So let's pretend interactions they [inaudible] now. 0:44:54.709,0:44:56.099 ST plus 0:44:56.099,0:45:00.309 one is just some function of ST, than that linear function I drew would be 0:45:00.309,0:45:02.219 ST plus one. 0:45:02.219,0:45:04.450 We're approximating 0:45:04.450,0:45:08.349 as F prime evaluated at some point as bar T 0:45:08.349,0:45:12.189 times ST times S bar T. 0:45:12.189,0:45:15.579 Plus 0:45:15.579,0:45:17.829 S bar 0:45:17.829,0:45:21.879 T. So with this, you'd express ST plus one as a linear 0:45:21.879,0:45:22.719 function 0:45:22.719,0:45:29.359 of ST. Just note that S bar T is a constant. 0:45:31.179,0:45:32.520 It's not 0:45:32.520,0:45:36.520 a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point 0:45:36.520,0:45:38.019 S bar T. This is 0:45:38.019,0:45:41.779 really just the equation of that linear function. So you 0:45:41.779,0:45:46.309 can then convert this to 0:45:51.809,0:45:55.399 A and B matrixes. Jumping back one board, I'm going 0:45:55.399,0:45:57.040 to point out one other thing. 0:45:57.040,0:45:57.829 Let's 0:45:57.829,0:45:59.929 say I look at this straight line, 0:45:59.929,0:46:01.159 and I ask 0:46:01.159,0:46:03.489 how well does this straight line 0:46:03.489,0:46:07.769 approximate my function F, my original simulator, my original function F. 0:46:07.769,0:46:10.630 Then you sort of notice that 0:46:10.630,0:46:12.440 in this neighborhood, 0:46:12.440,0:46:14.830 in the neighborhood of S bar, there's a 0:46:14.830,0:46:17.400 pretty good approximation. It's fairly close. 0:46:17.400,0:46:19.389 But then as you move further away, 0:46:19.389,0:46:24.419 moving far off to the left here, it becomes a pretty terrible approximation. 0:46:24.419,0:46:25.539 So 0:46:25.539,0:46:29.139 when you linearize 0:46:29.139,0:46:33.099 a nonlinear model to apply LQR, 0:46:33.099,0:46:36.789 one of the parameters you have to choose would be the point around which to 0:46:36.789,0:46:38.960 linearize your nonlinear model. 0:46:38.960,0:46:44.169 So if you expect your inverted pendulum system 0:46:44.169,0:46:48.509 to spend most of its time in the vicinity of this state, 0:46:48.509,0:46:49.409 then it'd 0:46:49.409,0:46:53.459 be reasonable to linearize around this state 0:46:53.459,0:46:54.830 because that means that 0:46:54.830,0:46:57.319 the linear approximation would be a good approximation, 0:46:57.319,0:47:02.489 usually, for the states that you expect [inaudible] to spend most of this time. 0:47:02.489,0:47:06.179 If conversely, you expect the system to spend most of its time 0:47:06.179,0:47:07.739 at states far to the left, then this would be 0:47:07.739,0:47:10.039 a terrible location to linearize. 0:47:10.039,0:47:14.269 So one rule of thumb is to choose the position to linearize according to where 0:47:14.269,0:47:17.419 you expect the system to spend most of its time 0:47:17.419,0:47:20.369 so that the linear approximation will tend to be 0:47:20.369,0:47:22.929 an accurate approximation in the 0:47:22.929,0:47:27.269 vicinity of the states [inaudible]. 0:47:27.269,0:47:28.770 Just to be fair, it is about 0:47:28.770,0:47:30.109 choosing the point, S bar, 0:47:30.109,0:47:31.639 A bar, 0:47:31.639,0:47:34.169 that we'll use 0:47:34.169,0:47:36.949 to come up with a linear function 0:47:36.949,0:47:40.599 that we'll pretend it's a good approximation to my original 0:47:40.599,0:47:47.079 nonlinear function, 0:47:55.269,0:48:00.589 F. So for an example like the inverted pendulum problem, this problem, if 0:48:00.589,0:48:03.740 you expect to do pretty well in this problem, 0:48:03.740,0:48:08.089 then you would expect the state 0:48:08.089,0:48:12.680 to often be near the zero state. 0:48:12.680,0:48:17.890 If S equals zero corresponds to X being the center of the railway track that the inverted 0:48:17.890,0:48:20.249 pendulum lives on. You 0:48:20.249,0:48:23.000 expect to do fairly well. You expect the pole to 0:48:23.000,0:48:24.689 mostly be upright 0:48:24.689,0:48:28.559 [inaudible] upright at zero degrees or 90 degrees, I guess. So you choose 0:48:28.559,0:48:32.910 whatever state corresponds to having the pole 0:48:32.910,0:48:33.959 upright. The zero 0:48:33.959,0:48:37.299 velocity [inaudible], near zero velocity in the middle of the track. 0:48:37.299,0:48:40.249 So you usually choose that as a state 0:48:40.249,0:48:42.579 to linearize 0:48:42.579,0:48:45.680 your inverted pendulum dynamics around. That's 0:48:45.680,0:48:52.680 a region where you might want your approximation to be good. 0:48:55.859,0:48:59.759 So I wrote this down. To come back to this formula, I wrote this down for the special 0:48:59.759,0:49:02.229 case of a one D state variable. If there 0:49:02.229,0:49:08.039 are no actions. The general formula for the linearization approximation is ST 0:49:08.949,0:49:11.869 plus one were approximate as F 0:49:11.869,0:49:14.559 of S bar T. 0:49:14.559,0:49:17.119 A bar T 0:49:17.119,0:49:24.119 plus 0:49:50.459,0:49:54.229 - Okay? 0:49:54.229,0:49:55.240 Where 0:49:55.240,0:49:59.669 these upside down triangles are an unusual 0:49:59.669,0:50:02.739 symbol for taking the 0:50:02.739,0:50:07.289 derivative of F with respect to [inaudible] vector value, second argument. So 0:50:07.289,0:50:12.169 by choosing an appropriate state as bar T, A bar T to linearize 0:50:12.169,0:50:12.950 around, 0:50:12.950,0:50:14.009 you'd 0:50:14.009,0:50:17.920 now express ST plus one 0:50:17.920,0:50:22.039 as a linear function of the current state and 0:50:22.039,0:50:23.279 the current 0:50:23.279,0:50:24.279 action, AT. Again, 0:50:24.279,0:50:29.420 these things, S bar T, is a constant that you choose ahead of time. 0:50:29.420,0:50:32.409 Same for A bar 0:50:34.959,0:50:40.109 T. Lastly, having linearized this thing, you can then convert it 0:50:40.109,0:50:43.260 to matrices 0:50:43.260,0:50:46.509 like 0:50:46.509,0:50:48.849 that. 0:50:48.849,0:50:55.460 So the ST plus one is now a linear function of ST and 0:50:55.460,0:50:56.789 AT. 0:50:56.789,0:51:03.789 Questions about this? 0:51:10.759,0:51:16.019 So just one tiny detail, and it's really not a huge deal, is that this thing 0:51:16.019,0:51:20.179 below is technically an [inaudible] function. 0:51:20.179,0:51:23.140 There might actually be an extra constant there, but 0:51:23.140,0:51:27.199 this is not a difficult generalization of a linear dynamical system definition. 0:51:28.130,0:51:32.320 One way to deal with that constant is actually to do something like take your 0:51:32.320,0:51:36.249 definition for this state, let's say XX dot theta theta dot. 0:51:36.249,0:51:38.859 You can then augment your state vector to 0:51:38.859,0:51:40.959 have an extra interceptor, one. 0:51:40.959,0:51:46.069 With the interceptor one and working out the A matrix, you can then 0:51:46.069,0:51:49.170 take care of the extra constant, C, as well. So you can 0:51:49.170,0:51:50.679 deal with this thing being - 0:51:50.679,0:51:54.809 technically it's an affine function because of this extra offset, rather than a linear 0:51:54.809,0:51:55.569 function. 0:51:55.569,0:52:01.809 But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't 0:52:01.809,0:52:02.829 be 0:52:02.829,0:52:04.689 a huge 0:52:06.459,0:52:13.459 deal. 0:52:18.059,0:52:19.449 So to summarize, you see I 0:52:19.449,0:52:20.729 have this up, 0:52:20.729,0:52:24.699 you can learn a model, you can take a nonlinear model. Your nonlinear 0:52:24.699,0:52:28.789 model can be a physics model or a nonlinear model you learned and linearize it. 0:52:29.559,0:52:32.299 Now I'll post an LQR problem 0:52:32.299,0:52:35.339 in which we have 0:52:35.339,0:52:41.689 specification of the MDP in which the states are in RN, the actions are in RD, 0:52:41.689,0:52:46.669 and the state has zero probabilities given by 0:52:46.669,0:52:50.239 the [inaudible] linear equation. SD plus one equals 0:52:50.239,0:52:53.849 ATST plus BTAT. Our rewards are going to be 0:52:53.849,0:52:56.989 these quadratic functions. 0:52:56.989,0:53:01.900 So the specification of the MDP means that we know the A matrixes, 0:53:01.900,0:53:06.089 the B matrixes, the U matrixes 0:53:06.089,0:53:08.930 and the V matrixes. Our goal is to come up with a policy 0:53:08.930,0:53:13.789 to maximize our finite 0:53:13.789,0:53:19.049 horizon sum of rewards. 0:53:20.029,0:53:23.149 So our goal is to come up with a policy, 0:53:23.149,0:53:25.799 first, to maximize 0:53:25.799,0:53:29.059 the expected value of this 0:53:29.059,0:53:34.279 finite horizon sum of 0:53:39.009,0:53:46.009 rewards. Okay. 0:53:48.219,0:53:49.390 So our 0:53:49.390,0:53:53.549 approach to solving this problem 0:53:53.549,0:53:56.899 will be exactly that 0:53:56.899,0:54:00.019 finite horizon dynamic programming algorithm that we worked out a 0:54:00.019,0:54:01.769 little earlier in this 0:54:01.769,0:54:03.579 lecture. In particular, 0:54:03.579,0:54:06.569 my strategy for finding the optimal policy 0:54:06.569,0:54:08.879 will be to 0:54:08.879,0:54:12.359 first find V star of 0:54:12.359,0:54:14.219 T, the capital T, 0:54:14.219,0:54:18.559 and then I'll apply by a recursion to find V star of T minus one, V star of T minus 0:54:18.559,0:54:22.989 two and so on. 0:54:22.989,0:54:27.619 In the dynamic programming algorithm we worked out, V star subscript T of the 0:54:27.619,0:54:29.209 state ST, this is the maximum 0:54:29.209,0:54:31.279 [inaudible] 0:54:31.279,0:54:35.440 actions you might take at that time of R 0:54:35.440,0:54:39.299 of STAT. 0:54:39.299,0:54:42.800 Again, just for the sake of understanding this material, you can probably 0:54:42.800,0:54:47.170 pretend the rewards and the dynamics are actually stationary. I'll write out all 0:54:47.170,0:54:53.249 these superscripts all the time [inaudible] if you're reading this for the first time. 0:54:53.249,0:54:57.340 The reward is equal to max of 0:54:57.340,0:55:04.340 AT of minus - 0:55:11.829,0:55:16.099 right? I hope this isn't confusing. The superscript Ts denote transposes. 0:55:16.099,0:55:19.849 The lowercase Ts denote the time index capital T. 0:55:19.849,0:55:23.769 So that's just a definition of my next quadratic awards. So this 0:55:23.769,0:55:27.859 is clearly maximized as minus 0:55:29.100,0:55:31.269 ST transpose 0:55:31.269,0:55:37.889 UTST 0:55:37.889,0:55:39.949 because 0:55:39.949,0:55:43.909 that last term is - this is greater than or equal to zero. That gives 0:55:43.909,0:55:47.759 me my assumption that VT is [inaudible] semi-definite. So the best action to take in 0:55:47.759,0:55:49.399 the last time step 0:55:49.399,0:55:52.639 is just the action 0:55:52.639,0:55:53.769 zero. So 0:55:53.769,0:56:00.769 pi star subscript T of ST is equal to the 0:56:02.320,0:56:04.369 [inaudible] of actions of 0:56:04.369,0:56:10.969 that same thing. It's just 0:56:10.969,0:56:12.469 zero. It's by 0:56:12.469,0:56:15.609 choosing the zero action, 0:56:15.609,0:56:18.909 AT transpose VTAT becomes zero, and that's 0:56:18.909,0:56:38.829 how this reward is maximized. 0:56:50.259,0:56:57.169 Any questions, or is something illegible? Okay. 0:57:00.390,0:57:04.330 So now let's do the dynamic programming step where 0:57:04.330,0:57:09.679 my goal is given 0:57:09.679,0:57:11.739 VT plus one, 0:57:11.739,0:57:15.799 I want to compute VT. Given V star T plus one, I want to compute 0:57:15.799,0:57:20.410 V star of T. So this is the dynamic programming step. 0:57:21.629,0:57:25.239 So the 0:57:25.239,0:57:29.499 DP steps I wrote down previously was this. So for the finite state case, I 0:57:29.499,0:57:36.499 wrote down the following. 0:58:32.289,0:58:35.769 So this is exactly the equation I wrote down 0:58:35.769,0:58:37.499 previously, 0:58:37.499,0:58:41.489 and this is what I wrote down for finite states, where you have these discreet state transition 0:58:41.489,0:58:43.679 probabilities, and we can sum over 0:58:43.679,0:58:48.039 this discreet set of states. Now we're going to continue as an infinite state 0:58:48.039,0:58:51.850 again, so this sum over state should actually become an integral. I'm going to 0:58:51.850,0:58:55.940 actually skip the integral step. We'll just go ahead and write this last term here as an 0:58:55.940,0:58:56.829 expectation. 0:58:56.829,0:59:03.829 So this is going to be max over actions AT 0:59:04.309,0:59:06.879 plus - and then this becomes and expectation 0:59:06.879,0:59:10.089 over the random mixed state, ST plus one, 0:59:10.089,0:59:13.500 [inaudible] from state transition probabilities given by P of STAT of V star T 0:59:13.500,0:59:16.429 plus one, ST 0:59:16.429,0:59:19.080 plus one. So this 0:59:19.080,0:59:21.309 is 0:59:21.309,0:59:23.719 the same equation written down 0:59:23.719,0:59:25.699 as an expectation. 0:59:26.879,0:59:31.519 So what I need to do is given a representation of V star T plus one, I 0:59:31.519,0:59:32.789 need to find V 0:59:32.789,0:59:36.430 star of T. 0:59:36.430,0:59:40.819 So it turns out that LQR has the following useful property. It turns out that each of 0:59:40.819,0:59:42.839 these value functions 0:59:42.839,0:59:46.129 can be represented as a quadratic function. 0:59:46.129,0:59:48.659 So concretely, 0:59:48.659,0:59:53.259 let's suppose that V 0:59:53.259,1:00:00.259 star T plus one - suppose that this can be expressed as a 1:00:00.429,1:00:07.429 quadratic 1:00:09.369,1:00:10.840 function, written like so, 1:00:10.840,1:00:11.859 where 1:00:11.859,1:00:14.880 the matrix phi T plus one 1:00:14.880,1:00:17.049 is an N by N matrix, and 1:00:17.049,1:00:21.339 psi T plus one 1:00:21.339,1:00:23.599 is just a real number. 1:00:23.599,1:00:29.160 So in other words, suppose V star T plus one is just a quadratic function 1:00:29.969,1:00:33.809 of the state ST 1:00:33.809,1:00:36.679 plus one. 1:00:36.679,1:00:40.170 We can then show that 1:00:40.170,1:00:44.169 when you do one dynamic programming step - when you 1:00:46.089,1:00:49.569 plug this definition of V star T plus one into your dynamic 1:00:49.569,1:00:51.879 programming step in the equation I had just now, 1:00:51.879,1:00:54.559 you can show that you would get 1:00:54.559,1:00:57.899 that V star T as well, will 1:00:57.899,1:01:01.749 also be a quadratic function of the 1:01:06.279,1:01:12.639 same form. [Inaudible] here, right? The sum-appropriate matrix, phi T 1:01:12.639,1:01:19.609 and sum appropriate real number, psi 1:01:19.609,1:01:21.839 of 1:01:21.839,1:01:27.819 T. So what you can do is stall off the recursion with - well, does that make 1:01:27.819,1:01:32.099 sense? 1:01:32.099,1:01:35.009 So what you can do is 1:01:35.009,1:01:36.769 stall off the recursion 1:01:36.769,1:01:41.139 as follows. So previously, we worked out that V star capital T, 1:01:41.139,1:01:44.219 we said that this is minus 1:01:44.219,1:01:47.539 ST 1:01:47.539,1:01:50.009 transpose UTST. So we 1:01:50.009,1:01:50.760 have that phi 1:01:50.760,1:01:53.160 of capital T is 1:01:53.160,1:01:56.329 equal to minus UT, 1:01:56.329,1:01:59.309 and psi of capital T is equal to zero. 1:01:59.309,1:02:00.700 Now V star T of ST 1:02:00.700,1:02:07.700 is equal to ST transpose phi of T, ST plus psi of T. 1:02:08.419,1:02:09.050 So 1:02:09.050,1:02:13.450 you can start out the recursion this way with phi of T equals minus UT and psi of T 1:02:13.450,1:02:14.939 equals zero. 1:02:14.939,1:02:16.849 Then work out 1:02:16.849,1:02:20.219 what the recursion is. I won't 1:02:20.219,1:02:22.819 1:02:22.819,1:02:27.119 actually do the full [inaudible]. This may be algebra, and 1:02:28.809,1:02:32.939 you've actually done this sort of Gaussian expectation 1:02:32.939,1:02:39.559 math a lot in your homework by now. 1:02:39.559,1:02:44.009 So I won't do the full derivation. I'll just outline the 1:02:45.039,1:02:46.869 one-ish G step. So 1:02:46.869,1:02:51.029 in dynamic programming step, V star ST is 1:02:51.029,1:02:53.239 equal to 1:02:53.239,1:02:55.679 max over actions AT of 1:02:55.679,1:03:02.679 the median reward. 1:03:03.619,1:03:05.809 So this was R of 1:03:05.809,1:03:09.479 SA from my equation in the dynamic programming step. 1:03:09.479,1:03:12.099 Then plus an expected value 1:03:12.099,1:03:18.279 over the random mixed state, ST plus one, drawn from the 1:03:18.279,1:03:22.079 Gaussian distribution would mean 1:03:22.079,1:03:23.890 ATST plus 1:03:23.890,1:03:28.169 BTAT and covariant sigma 1:03:28.169,1:03:31.889 W. So what this is, this is really my 1:03:31.889,1:03:35.029 specification for P of STAT. This is my 1:03:35.029,1:03:38.749 state transition distribution in the LQR setting. This is my 1:03:38.749,1:03:40.809 state transition distribution 1:03:40.809,1:03:43.369 [inaudible] take action AT 1:03:43.369,1:03:44.670 in 1:03:44.670,1:03:46.179 the state 1:03:46.179,1:03:50.319 ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and 1:03:50.319,1:03:50.979 covariant 1:03:50.979,1:03:54.430 sigma W. Then 1:03:54.430,1:04:01.430 of the this 1:04:10.669,1:04:13.719 state. This, of course, is just A star 1:04:13.719,1:04:17.109 T plus one 1:04:17.109,1:04:23.549 of ST plus one. I hope 1:04:23.549,1:04:27.459 this makes sense. This is just taking that equation I had previously in the 1:04:27.459,1:04:30.309 dynamic programming step. So the V star of T, ST 1:04:30.309,1:04:33.279 equals max over actions 1:04:33.279,1:04:35.679 of the immediate rewards 1:04:35.679,1:04:38.739 plus an expected value over the mixed state of 1:04:38.739,1:04:40.639 V star of the mixed state with 1:04:40.639,1:04:43.249 the clock advanced by one. 1:04:43.249,1:04:46.179 So I've just plugged in all the definitions as a reward of the 1:04:46.179,1:04:47.889 state [inaudible] distribution 1:04:47.889,1:04:54.889 and of the value function. 1:04:55.909,1:05:01.729 Actually, could you raise your hand if this makes sense? Cool. So if you write 1:05:01.729,1:05:03.429 this out 1:05:03.429,1:05:07.630 and you expand the expectation - I know you've done this many times, so I 1:05:07.630,1:05:08.979 won't do it - 1:05:08.979,1:05:11.470 this whole thing on the right-hand side 1:05:11.470,1:05:15.139 simplifies to a big quadratic function of the action, AT. 1:05:15.139,1:05:17.839 So this whole 1:05:17.839,1:05:22.839 thing simplifies to a big 1:05:22.839,1:05:26.399 quadratic function 1:05:32.449,1:05:35.909 of the action 1:05:35.909,1:05:37.859 AT. 1:05:37.859,1:05:42.489 We want to maximize this with respect to the actions AT. So to 1:05:42.489,1:05:44.339 maximize a big quadratic function, you 1:05:44.339,1:05:48.150 just take the derivatives of the functions with respect to the action AT, set the derivative equal 1:05:48.150,1:05:52.039 to zero, and then you've maximized the right-hand side, with 1:05:52.039,1:05:55.559 respect to the action, 1:05:55.559,1:05:59.709 AT. It turns out - I'm just going to write this expression down for completeness. You 1:05:59.709,1:06:02.209 can derive it yourself at any time. It turns 1:06:02.209,1:06:05.239 out if you actually maximize that thing on the right- hand side as a 1:06:05.239,1:06:07.299 function of the actions, AT, 1:06:07.299,1:06:12.469 you find that [inaudible] AT is going to be 1:06:12.469,1:06:14.429 that 1:06:14.429,1:06:21.429 times 1:06:23.319,1:06:25.930 ST. 1:06:25.930,1:06:29.389 Don't 1:06:29.389,1:06:34.200 worry about this expression. You can get it from [inaudible] and derive it yourself. 1:06:34.200,1:06:37.839 But the key thing to note is that the optimal action, AT, is going to be some big 1:06:37.839,1:06:38.949 matrix. 1:06:38.949,1:06:40.409 We're going to call this thing 1:06:40.409,1:06:43.299 LT 1:06:43.299,1:06:47.259 times 1:06:49.009,1:06:51.459 ST. In other words, the optimal action to take in this 1:06:51.459,1:06:53.329 given state is going to be 1:06:53.329,1:06:55.249 some linear function 1:06:55.249,1:06:59.289 of the state, ST. So 1:06:59.289,1:07:02.819 having done dynamic programming, you remember also when we worked out the 1:07:02.819,1:07:04.979 dynamic programming algorithm for finite 1:07:04.979,1:07:06.279 horizon MDPs, 1:07:06.279,1:07:07.999 we said that 1:07:07.999,1:07:12.109 the way you compute the optimal policy, pi star of T 1:07:12.109,1:07:13.699 of ST. 1:07:13.699,1:07:17.819 This is always the [inaudible] of the same thing. [Inaudible] 1:07:17.819,1:07:22.799 of actions AT of the same thing. 1:07:22.799,1:07:25.139 STAT plus 1:07:25.139,1:07:32.139 your expected value of [inaudible] PSTAT, P-star, T 1:07:35.859,1:07:39.179 plus one, ST plus one. This thing on the right-hand side is always the same thing as 1:07:39.179,1:07:41.839 the thing we maximized 1:07:41.839,1:07:42.969 1:07:42.969,1:07:46.479 [inaudible]. So what this means is that when I said this a value of A to the maximize 1:07:46.479,1:07:50.379 of this. So what this means is that the optimal action to take from the state of ST 1:07:50.379,1:07:53.079 is actually equal to 1:07:53.079,1:07:55.869 LT 1:07:55.869,1:08:01.689 times ST. 1:08:01.689,1:08:03.859 What 1:08:03.859,1:08:05.919 was shown is that 1:08:05.919,1:08:08.129 when you're in some state, ST, the 1:08:08.129,1:08:13.599 optimal action for that state is going to be some matrix, LT, which can compute, 1:08:13.599,1:08:14.229 times 1:08:14.229,1:08:16.859 the state, 1:08:16.859,1:08:18.079 ST. 1:08:18.079,1:08:23.549 In other words, the optimal action is actually a linear function of the state. 1:08:23.549,1:08:26.839 I'm just going to point out, this is not a function of approximation here, right. 1:08:26.839,1:08:31.739 What we did not do, we did not say, 1:08:31.739,1:08:35.389 let's find the optimal linear policy. We didn't say, let's look at the optimal 1:08:35.389,1:08:38.359 policy, and then we'll fit this straight line to the optimal policy. This is not 1:08:38.359,1:08:42.089 about approximating the optimal policy with a straight line. 1:08:42.089,1:08:46.549 This derivation is saying that the optimal policy is a straight line. The 1:08:46.549,1:08:53.549 optimal action is a linear function of the current 1:08:54.239,1:08:57.359 state. 1:08:57.359,1:09:01.319 Moreover, when you've worked out this is a value for AT 1:09:01.319,1:09:02.310 1:09:02.310,1:09:05.370 that maximizes this thing on 1:09:05.370,1:09:10.009 the right-hand side. So you take this and plug it back in to do the dynamic programming recursion. 1:09:10.009,1:09:17.009 What you find is that - 1:09:19.719,1:09:24.189 so you take AT and plug it back in to do the maximization. It will 1:09:24.189,1:09:26.319 actually get 1:09:26.319,1:09:28.279 you this formula, 1:09:28.279,1:09:30.560 so V star 1:09:30.560,1:09:35.049 TST. So you find that 1:09:35.049,1:09:38.770 it will indeed be a quadratic function like this 1:09:38.770,1:09:41.799 of the following form where 1:09:41.799,1:09:45.540 - and I just write out the equations for the sake of 1:09:45.540,1:09:52.540 completeness. Don't worry too much about their forms. You can 1:09:57.369,1:10:03.630 derive this yourself. 1:10:48.130,1:10:52.190 So just to summarize, don't worry too much about the forms of these 1:10:52.190,1:10:54.429 equations. 1:10:54.429,1:10:57.899 What we've done is written down the recursion to the expressor 1:10:57.899,1:10:59.820 phi T and psi T 1:10:59.820,1:11:02.329 as a function of phi T plus one 1:11:02.329,1:11:04.129 and psi T plus one. 1:11:04.129,1:11:05.090 So 1:11:05.090,1:11:08.340 this allows you to compute the optimal value function 1:11:08.340,1:11:13.579 for when the clock is at time lowercase T, as a function of the optimal value 1:11:13.579,1:11:17.699 function for when the clock is at time T plus one. 1:11:20.019,1:11:25.480 So 1:11:25.480,1:11:28.260 to summarize, 1:11:28.260,1:11:32.659 GSELQG here's a finite horizon of 1:11:32.659,1:11:35.920 - actually, just to give this equation a name as well. This 1:11:35.920,1:11:40.370 recursion, in terms of the phi Ts, this is called the 1:11:40.370,1:11:44.909 discrete 1:11:44.909,1:11:51.909 time Bacardi equation. [Inaudible] 1:11:54.099,1:11:57.609 recursion that gives you phi 1:11:57.609,1:12:02.420 T in terms of phi T plus one. 1:12:02.420,1:12:05.130 So 1:12:05.130,1:12:10.969 to summarize, 1:12:12.269,1:12:17.499 our algorithm for finding the exact solution to finite horizon LQR 1:12:17.499,1:12:22.949 problems is as follows. We initialize phi T 1:12:22.949,1:12:24.340 to 1:12:24.340,1:12:26.840 be equal to minus 1:12:30.449,1:12:32.699 UT 1:12:32.699,1:12:36.559 and psi T to be equal to zero. 1:12:36.559,1:12:38.709 Then 1:12:38.709,1:12:42.679 recursively, 1:12:42.679,1:12:45.080 calculate 1:12:45.080,1:12:47.179 phi T 1:12:47.179,1:12:49.260 and psi 1:12:49.260,1:12:50.309 T 1:12:50.309,1:12:53.820 as a function of phi T plus one 1:12:53.820,1:12:56.360 and psi T plus 1:12:56.360,1:13:01.429 one with the discrete time - 1:13:01.769,1:13:06.030 actually, excuse me. So 1:13:06.030,1:13:08.590 recursively calculate phi T and psi 1:13:08.590,1:13:11.349 T as a function of phi T plus one and psi T plus one, as I showed, 1:13:11.349,1:13:14.730 using the discrete time Bacardi equation. 1:13:14.730,1:13:17.940 So you do this for 1:13:17.940,1:13:19.199 T equals 1:13:19.199,1:13:22.429 T minus one, T minus two 1:13:22.429,1:13:28.579 and so on, down to time zero. Then you 1:13:28.579,1:13:32.519 compute 1:13:32.519,1:13:37.429 LT as a function of - actually, is it phi T or phi T 1:13:37.429,1:13:39.599 plus one? Phi T 1:13:39.599,1:13:41.929 plus one, I think. 1:13:41.929,1:13:46.260 As a function of phi T plus one and psi T plus one. 1:13:46.260,1:13:50.199 This is actually a function of only phi T plus one. You don't really need psi T 1:13:50.199,1:13:53.110 plus one. 1:13:53.110,1:13:59.819 Now you have your optimal policy. 1:13:59.819,1:14:03.099 So having computed the LTs, 1:14:03.099,1:14:05.939 you now have the 1:14:05.939,1:14:08.149 optimal action to take in the state 1:14:08.149,1:14:15.149 ST, just given by 1:14:24.530,1:14:29.419 this 1:14:29.419,1:14:34.979 linear equation. 1:14:34.979,1:14:44.499 How much time do I have left? Okay. Let me just say one last thing about this before I close. 1:14:44.479,1:14:48.449 Maybe I'll do it next week. I think I'll do it next session 1:14:48.449,1:14:52.179 instead. So it actually turns out there's one cool property about this that's kind of that is 1:14:52.179,1:14:54.750 kind of subtle, but you'll find it out in the next lecture. 1:14:54.750,1:15:01.750 Are there question about this before we close for today, then? So 1:15:04.829,1:15:07.309 the very cool thing about 1:15:07.309,1:15:12.579 the solution of discrete time LQR problems - finite horizon LQR 1:15:12.579,1:15:17.280 problems is that this is a problem in an infinite state, with a continuous state. 1:15:17.280,1:15:21.059 But nonetheless, under the assumptions we made, you can prove that the 1:15:21.059,1:15:24.749 value function is a quadratic function of the state. 1:15:24.749,1:15:29.089 Therefore, just by computing these matrixes phi T and the real numbers 1:15:29.089,1:15:33.010 psi T, you can actually exactly represent the value function, even for 1:15:33.010,1:15:37.489 these infinitely large state spaces, even for continuous state spaces. 1:15:37.489,1:15:42.599 So the computation of these algorithms scales only like the cube, 1:15:42.599,1:15:46.499 scales only as a polynomial in terms of the number of state variables 1:15:46.499,1:15:49.959 whereas in [inaudible] dimensionality problems, with [inaudible], we 1:15:49.959,1:15:53.099 had algorithms of a scale exponentially dimensional problem. 1:15:53.099,1:15:55.420 Whereas LQR scales only are like the cube 1:15:55.420,1:16:00.699 of the dimension of the problem. So this easily 1:16:00.699,1:16:03.960 applies to problems with even very large state spaces. So we actually often apply 1:16:03.960,1:16:06.049 variations of this algorithm 1:16:06.049,1:16:09.159 to some subset, to some particular subset for the things we do on our 1:16:09.159,1:16:13.010 helicopter, which has high dimensional state spaces, with twelve or higher 1:16:13.010,1:16:13.969 dimensions. 1:16:13.969,1:16:15.989 This has worked very well for that. 1:16:15.989,1:16:17.229 So 1:16:17.229,1:16:21.280 it turns out there are even more things you can do with this, and I'll continue with 1:16:21.280,1:16:23.519 that in the next lecture. So let's close for today, then.