[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:12.82,0:00:16.10,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:16.10,0:00:23.10,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:25.88,0:00:28.42,Default,,0000,0000,0000,,Okay. Welcome back. What I Dialogue: 0,0:00:28.42,0:00:31.45,Default,,0000,0000,0000,,want to do today is talk about Dialogue: 0,0:00:31.45,0:00:34.57,Default,,0000,0000,0000,,one of my favorite algorithms for controlling MDPs that I think is Dialogue: 0,0:00:35.87,0:00:39.32,Default,,0000,0000,0000,,one of the more elegant and efficient and powerful algorithms that Dialogue: 0,0:00:39.32,0:00:41.07,Default,,0000,0000,0000,,I know of. Dialogue: 0,0:00:41.07,0:00:43.02,Default,,0000,0000,0000,,So Dialogue: 0,0:00:43.02,0:00:46.93,Default,,0000,0000,0000,,what I'll do is I'll first start by talking about a couple Dialogue: 0,0:00:46.93,0:00:48.64,Default,,0000,0000,0000,,variations of MDPs Dialogue: 0,0:00:48.64,0:00:52.32,Default,,0000,0000,0000,,that are slightly different from the MDP definition you've seen so far. These Dialogue: 0,0:00:52.32,0:00:54.42,Default,,0000,0000,0000,,are pretty Dialogue: 0,0:00:54.42,0:00:58.87,Default,,0000,0000,0000,,common variations. One is state action rewards, Dialogue: 0,0:00:58.87,0:01:03.83,Default,,0000,0000,0000,,and the other is horizon MDPs. Using this semi-modified definition of an Dialogue: 0,0:01:03.83,0:01:06.54,Default,,0000,0000,0000,,MDP, I'll talk about linear dynamical systems. I'll spend a little bit Dialogue: 0,0:01:06.54,0:01:10.41,Default,,0000,0000,0000,,of time talking about models within dynamical systems, and then talk about LQR, or Dialogue: 0,0:01:10.41,0:01:13.24,Default,,0000,0000,0000,,linear quadratic regulation control, Dialogue: 0,0:01:13.24,0:01:18.60,Default,,0000,0000,0000,,which will lead us to some kind of [inaudible] equation, which is Dialogue: 0,0:01:18.60,0:01:23.10,Default,,0000,0000,0000,,something we will solve in order to do LQR Dialogue: 0,0:01:23.10,0:01:24.46,Default,,0000,0000,0000,,controls. Dialogue: 0,0:01:24.46,0:01:25.66,Default,,0000,0000,0000,, Dialogue: 0,0:01:25.66,0:01:27.40,Default,,0000,0000,0000,,So Dialogue: 0,0:01:27.40,0:01:28.65,Default,,0000,0000,0000,,just Dialogue: 0,0:01:28.65,0:01:33.02,Default,,0000,0000,0000,,to recap, and we've seen this definition many times now. Dialogue: 0,0:01:33.02,0:01:36.84,Default,,0000,0000,0000,,We've been defining an MDP Dialogue: 0,0:01:36.84,0:01:41.96,Default,,0000,0000,0000,,as Dialogue: 0,0:01:41.96,0:01:46.97,Default,,0000,0000,0000,,[inaudible] states actions, states improbabilities, [inaudible] reward function Dialogue: 0,0:01:46.97,0:01:49.87,Default,,0000,0000,0000,,where - gamma's Dialogue: 0,0:01:49.87,0:01:52.98,Default,,0000,0000,0000,,the Dialogue: 0,0:01:52.98,0:01:56.68,Default,,0000,0000,0000,,discount factors, a number between zero and one. Dialogue: 0,0:01:56.68,0:02:00.38,Default,,0000,0000,0000,,And R, the reward function, Dialogue: 0,0:02:00.38,0:02:03.91,Default,,0000,0000,0000,,was the function mapping from the states, the rewards - Dialogue: 0,0:02:03.91,0:02:07.00,Default,,0000,0000,0000,,was the function mapping from the states, the real numbers. Dialogue: 0,0:02:07.00,0:02:08.22,Default,,0000,0000,0000,,So we had Dialogue: 0,0:02:08.22,0:02:10.49,Default,,0000,0000,0000,,value Dialogue: 0,0:02:10.49,0:02:13.76,Default,,0000,0000,0000,,iteration, which would do this. Dialogue: 0,0:02:21.90,0:02:26.43,Default,,0000,0000,0000,,So after a while, the value of the iteration will cause V to convert to V star. Dialogue: 0,0:02:26.43,0:02:27.79,Default,,0000,0000,0000,,Then Dialogue: 0,0:02:27.79,0:02:31.21,Default,,0000,0000,0000,,having found the optimal value function, if you Dialogue: 0,0:02:31.21,0:02:32.59,Default,,0000,0000,0000,,compute the optimal policy Dialogue: 0,0:02:32.59,0:02:36.99,Default,,0000,0000,0000,,by taking essentially [inaudible] of this equation above. Augments of A, Dialogue: 0,0:02:36.99,0:02:43.99,Default,,0000,0000,0000,,of that [inaudible]. So Dialogue: 0,0:02:46.18,0:02:48.18,Default,,0000,0000,0000,,in Dialogue: 0,0:02:48.18,0:02:50.72,Default,,0000,0000,0000,,value iteration, Dialogue: 0,0:02:50.72,0:02:54.60,Default,,0000,0000,0000,,as you iterate of this - you know, perform this update, Dialogue: 0,0:02:54.60,0:02:56.18,Default,,0000,0000,0000,,the function V will Dialogue: 0,0:02:56.18,0:03:00.37,Default,,0000,0000,0000,,[inaudible] convert to V star. So there won't be - Dialogue: 0,0:03:00.37,0:03:04.49,Default,,0000,0000,0000,,so without defining the number of iterations, you get closer and closer to V star. Dialogue: 0,0:03:04.49,0:03:08.40,Default,,0000,0000,0000,,This actually converge exponentially quickly to V Dialogue: 0,0:03:08.40,0:03:12.84,Default,,0000,0000,0000,,star. We will never exactly convert to V star and define the number of iterations. Dialogue: 0,0:03:14.80,0:03:19.63,Default,,0000,0000,0000,,So what I want to do now is describe a couple of common variations of Dialogue: 0,0:03:19.63,0:03:20.14,Default,,0000,0000,0000,,MDPs Dialogue: 0,0:03:20.14,0:03:24.14,Default,,0000,0000,0000,,that we slightly different definitions of. Firs the reward Dialogue: 0,0:03:24.14,0:03:27.80,Default,,0000,0000,0000,,function and then second, we'll do something slightly different Dialogue: 0,0:03:27.80,0:03:29.39,Default,,0000,0000,0000,,from just Dialogue: 0,0:03:31.70,0:03:34.66,Default,,0000,0000,0000,,counting. Then remember in the last lecture, I said that Dialogue: 0,0:03:34.66,0:03:37.42,Default,,0000,0000,0000,,for infinite state of continuously in MDPs, Dialogue: 0,0:03:37.42,0:03:41.17,Default,,0000,0000,0000,,we couldn't apply the most straightforward version of value iteration Dialogue: 0,0:03:41.17,0:03:43.67,Default,,0000,0000,0000,,because if you have a continuous state Dialogue: 0,0:03:43.67,0:03:45.71,Default,,0000,0000,0000,,MDP, we need to use Dialogue: 0,0:03:45.71,0:03:48.29,Default,,0000,0000,0000,,some approximations of the optimal value function. Dialogue: 0,0:03:48.29,0:03:51.63,Default,,0000,0000,0000,,The [inaudible] later in this lecture, Dialogue: 0,0:03:51.63,0:03:55.51,Default,,0000,0000,0000,,I'll talk about a special case of MDPs, where you can actually represent the Dialogue: 0,0:03:55.51,0:03:57.58,Default,,0000,0000,0000,,value function exactly, Dialogue: 0,0:03:57.58,0:04:01.03,Default,,0000,0000,0000,,even if you have an infinite-state space or even if you have a continuous-state Dialogue: 0,0:04:01.03,0:04:02.00,Default,,0000,0000,0000,,space. Dialogue: 0,0:04:02.00,0:04:06.09,Default,,0000,0000,0000,,I'll actually do that, talk about these special constants of infinite-state Dialogue: 0,0:04:06.09,0:04:06.78,Default,,0000,0000,0000,,MDPs, Dialogue: 0,0:04:06.78,0:04:11.32,Default,,0000,0000,0000,,using this new variation of the reward function and the alternative to Dialogue: 0,0:04:11.32,0:04:15.47,Default,,0000,0000,0000,,just counting, so start to make the formulation a little easier. Dialogue: 0,0:04:15.47,0:04:16.61,Default,,0000,0000,0000,,So the first Dialogue: 0,0:04:16.61,0:04:20.85,Default,,0000,0000,0000,,variation I want to talk about is Dialogue: 0,0:04:20.85,0:04:25.38,Default,,0000,0000,0000,,selection Dialogue: 0,0:04:25.38,0:04:28.72,Default,,0000,0000,0000,,rewards. So I'm going to change the definition of the reward function. If this turns out, it Dialogue: 0,0:04:28.72,0:04:30.87,Default,,0000,0000,0000,,won't be Dialogue: 0,0:04:30.87,0:04:33.41,Default,,0000,0000,0000,,a huge deal. In particular, I change reward function Dialogue: 0,0:04:33.41,0:04:36.26,Default,,0000,0000,0000,,to be a function mapping from Dialogue: 0,0:04:36.26,0:04:38.45,Default,,0000,0000,0000,,a state action pair Dialogue: 0,0:04:38.45,0:04:41.13,Default,,0000,0000,0000,,to the real numbers. Dialogue: 0,0:04:41.13,0:04:44.51,Default,,0000,0000,0000,,What I mean by this is just the following. Dialogue: 0,0:04:44.51,0:04:46.60,Default,,0000,0000,0000,,You sell off in Dialogue: 0,0:04:46.60,0:04:48.85,Default,,0000,0000,0000,,some state in zero. You Dialogue: 0,0:04:48.85,0:04:52.83,Default,,0000,0000,0000,,take an action A zero as a result of your state of action choice. You transition Dialogue: 0,0:04:52.83,0:04:54.71,Default,,0000,0000,0000,,to some new state, S1. Dialogue: 0,0:04:54.71,0:04:59.52,Default,,0000,0000,0000,,You take some action, A1. You transition to some new state, S2. You take some Dialogue: 0,0:04:59.52,0:05:05.01,Default,,0000,0000,0000,,action, A2, and so on. So this is a [inaudible] state action sequence that you see. Dialogue: 0,0:05:05.01,0:05:08.49,Default,,0000,0000,0000,,So in an MPP where you have a state action reward, Dialogue: 0,0:05:08.49,0:05:12.88,Default,,0000,0000,0000,,your total payoff is now defined as Dialogue: 0,0:05:12.88,0:05:14.26,Default,,0000,0000,0000,,this, Dialogue: 0,0:05:14.26,0:05:19.35,Default,,0000,0000,0000,,where your reward function is now a function both of the current state and of the Dialogue: 0,0:05:19.35,0:05:20.02,Default,,0000,0000,0000,,action Dialogue: 0,0:05:20.02,0:05:22.22,Default,,0000,0000,0000,,you took in the current state. Dialogue: 0,0:05:22.22,0:05:27.04,Default,,0000,0000,0000,,So this is my Dialogue: 0,0:05:27.04,0:05:28.54,Default,,0000,0000,0000,,total payoff. Dialogue: 0,0:05:29.20,0:05:32.24,Default,,0000,0000,0000,,Then as usual, my goal will be to find a policy - Dialogue: 0,0:05:32.24,0:05:35.04,Default,,0000,0000,0000,,to find the function mapping from the state's actions, Dialogue: 0,0:05:35.04,0:05:37.70,Default,,0000,0000,0000,,so that when I execute that policy, Dialogue: 0,0:05:37.70,0:05:40.32,Default,,0000,0000,0000,,I can maximize the expected value Dialogue: 0,0:05:40.32,0:05:42.03,Default,,0000,0000,0000,,of my total payoff. Dialogue: 0,0:05:42.03,0:05:46.60,Default,,0000,0000,0000,,So this definition, it actually turns out that given an MDP with Dialogue: 0,0:05:46.60,0:05:47.89,Default,,0000,0000,0000,,state action rewards, Dialogue: 0,0:05:47.89,0:05:49.34,Default,,0000,0000,0000,,you can actually - Dialogue: 0,0:05:49.34,0:05:52.95,Default,,0000,0000,0000,,so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP Dialogue: 0,0:05:52.95,0:05:56.64,Default,,0000,0000,0000,,with only rewards that function in the states. That may or may Dialogue: 0,0:05:56.64,0:05:58.42,Default,,0000,0000,0000,,not be [inaudible]. Don't worry if Dialogue: 0,0:05:58.42,0:06:00.80,Default,,0000,0000,0000,,it isn't. But Dialogue: 0,0:06:00.80,0:06:04.73,Default,,0000,0000,0000,,using state action rewards allows you to more directly model Dialogue: 0,0:06:05.57,0:06:09.05,Default,,0000,0000,0000,,problems in which different actions, we have different costs. So Dialogue: 0,0:06:09.05,0:06:13.09,Default,,0000,0000,0000,,a running example is the robot. So [inaudible] a robot, Dialogue: 0,0:06:13.09,0:06:17.08,Default,,0000,0000,0000,,and it's more costly for the robot to move than for it to stay still. Dialogue: 0,0:06:17.08,0:06:20.93,Default,,0000,0000,0000,,If you give an action to stay still, and the action to stay still may Dialogue: 0,0:06:20.93,0:06:25.35,Default,,0000,0000,0000,,have a lower cost because you're not using a battery power [inaudible] recharge it for that Dialogue: 0,0:06:25.35,0:06:26.02,Default,,0000,0000,0000,,action. Dialogue: 0,0:06:26.02,0:06:29.82,Default,,0000,0000,0000,,Another example would be - Dialogue: 0,0:06:29.82,0:06:33.06,Default,,0000,0000,0000,,actually, another navigation example Dialogue: 0,0:06:33.06,0:06:37.43,Default,,0000,0000,0000,,would be if you have an outdoor vehicle. You need to drive Dialogue: 0,0:06:37.43,0:06:38.60,Default,,0000,0000,0000,,over Dialogue: 0,0:06:38.60,0:06:42.17,Default,,0000,0000,0000,,some sort of outdoor terrain, like very rough rocks Dialogue: 0,0:06:42.17,0:06:45.68,Default,,0000,0000,0000,,or driving over grass. It may be costly, more difficult, Dialogue: 0,0:06:45.68,0:06:48.96,Default,,0000,0000,0000,,than driving over, say, a paved road. So you may assign Dialogue: 0,0:06:48.96,0:06:51.35,Default,,0000,0000,0000,,an action that requires Dialogue: 0,0:06:51.35,0:06:56.15,Default,,0000,0000,0000,,driving over grass or driving over rocks to be Dialogue: 0,0:06:56.15,0:07:00.29,Default,,0000,0000,0000,,more costly than driving over paved Dialogue: 0,0:07:00.29,0:07:04.81,Default,,0000,0000,0000,,road. Dialogue: 0,0:07:04.81,0:07:07.83,Default,,0000,0000,0000,,So this really isn't a huge change to the definition of Dialogue: 0,0:07:07.83,0:07:14.83,Default,,0000,0000,0000,,an MDP. I won't really bother to justify this a lot, but [inaudible] Dialogue: 0,0:07:15.09,0:07:16.78,Default,,0000,0000,0000,,equations Dialogue: 0,0:07:16.78,0:07:19.61,Default,,0000,0000,0000,,is generalizing Dialogue: 0,0:07:19.61,0:07:24.28,Default,,0000,0000,0000,,the way that you probably expect it. V star of S is now equal to Dialogue: 0,0:07:26.65,0:07:33.65,Default,,0000,0000,0000,,that. Dialogue: 0,0:07:37.42,0:07:40.65,Default,,0000,0000,0000,,So previously, when the reward function was just a function of the state, S, we could Dialogue: 0,0:07:40.65,0:07:43.73,Default,,0000,0000,0000,,take the max and push it in here. Dialogue: 0,0:07:43.73,0:07:44.53,Default,,0000,0000,0000,,But now that Dialogue: 0,0:07:46.07,0:07:49.78,Default,,0000,0000,0000,,the rewards is a function of the action you're taking as well, the max comes outside. So Dialogue: 0,0:07:49.78,0:07:50.87,Default,,0000,0000,0000,,this says that Dialogue: 0,0:07:50.87,0:07:57.12,Default,,0000,0000,0000,,your expected total payoff, starting from the state, as [inaudible] policy, is equal to Dialogue: 0,0:07:57.12,0:08:01.52,Default,,0000,0000,0000,,first your immediate reward, RFSA, for executing some action, A, in state S. Then Dialogue: 0,0:08:02.22,0:08:07.37,Default,,0000,0000,0000,,plus gamma times your future expected total payoff. Dialogue: 0,0:08:07.37,0:08:08.84,Default,,0000,0000,0000,,So Dialogue: 0,0:08:08.84,0:08:11.33,Default,,0000,0000,0000,,this is your expected total payoff if you Dialogue: 0,0:08:11.33,0:08:14.92,Default,,0000,0000,0000,,take the action, A, from the current state. So Dialogue: 0,0:08:14.92,0:08:18.88,Default,,0000,0000,0000,,while these [inaudible] optimal value functions. So your actually optimal expected total payoff Dialogue: 0,0:08:18.88,0:08:21.54,Default,,0000,0000,0000,,is the max of all actions of this Dialogue: 0,0:08:21.54,0:08:26.05,Default,,0000,0000,0000,,thing on Dialogue: 0,0:08:26.05,0:08:30.45,Default,,0000,0000,0000,,the right. Let's see. Value iteration, which I'm abbreviating VI, Dialogue: 0,0:08:30.45,0:08:33.72,Default,,0000,0000,0000,,is really the same algorithm. B of Dialogue: 0,0:08:33.72,0:08:38.06,Default,,0000,0000,0000,,S is updated as Dialogue: 0,0:08:38.06,0:08:44.63,Default,,0000,0000,0000,,max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations Dialogue: 0,0:08:45.90,0:08:49.88,Default,,0000,0000,0000,,be updating V of S using [inaudible] equations. Again, you get value iteration, Dialogue: 0,0:08:49.88,0:08:53.37,Default,,0000,0000,0000,,exactly the same way. Dialogue: 0,0:08:53.37,0:08:55.52,Default,,0000,0000,0000,,Then finally, Dialogue: 0,0:08:56.95,0:09:00.16,Default,,0000,0000,0000,,having found the optimal value function, V star, Dialogue: 0,0:09:00.16,0:09:03.18,Default,,0000,0000,0000,,using the value iteration algorithm, Dialogue: 0,0:09:03.18,0:09:07.65,Default,,0000,0000,0000,,you can then compute the optimal policy, pi star of S as same as Dialogue: 0,0:09:07.65,0:09:12.25,Default,,0000,0000,0000,,before. The best action to take in the state, S, is the Dialogue: 0,0:09:12.25,0:09:19.25,Default,,0000,0000,0000,,action, A, that maximizes the thing on the righthand Dialogue: 0,0:09:28.36,0:09:32.48,Default,,0000,0000,0000,,side. So having used value iteration Dialogue: 0,0:09:32.48,0:09:35.33,Default,,0000,0000,0000,,to compute the optimal value function, you can then find Dialogue: 0,0:09:35.33,0:09:40.21,Default,,0000,0000,0000,,pi star using that. Dialogue: 0,0:09:41.89,0:09:45.67,Default,,0000,0000,0000,,So this was the easier of the two Dialogue: 0,0:09:45.67,0:09:47.92,Default,,0000,0000,0000,,variations of MDPs [inaudible] Dialogue: 0,0:09:47.92,0:09:48.91,Default,,0000,0000,0000,,so far. Dialogue: 0,0:09:48.91,0:09:55.91,Default,,0000,0000,0000,,Any questions? Actually, are there questions about this? Dialogue: 0,0:10:01.09,0:10:02.81,Default,,0000,0000,0000,,So Dialogue: 0,0:10:02.81,0:10:06.66,Default,,0000,0000,0000,,the other variation, the other alternative definition, Dialogue: 0,0:10:07.60,0:10:10.52,Default,,0000,0000,0000,,will be Dialogue: 0,0:10:10.52,0:10:16.22,Default,,0000,0000,0000,,finite horizon Dialogue: 0,0:10:16.22,0:10:19.01,Default,,0000,0000,0000,,MDPs. So Dialogue: 0,0:10:19.01,0:10:21.54,Default,,0000,0000,0000,,finite horizon MDP Dialogue: 0,0:10:21.54,0:10:24.78,Default,,0000,0000,0000,,comprises of the [inaudible] SA [inaudible] Dialogue: 0,0:10:24.78,0:10:28.11,Default,,0000,0000,0000,,transition, probably Dialogue: 0,0:10:28.11,0:10:30.65,Default,,0000,0000,0000,,with these, and the parameter T Dialogue: 0,0:10:30.65,0:10:32.18,Default,,0000,0000,0000,,and the reward function. Dialogue: 0,0:10:32.18,0:10:36.25,Default,,0000,0000,0000,,Here, Dialogue: 0,0:10:36.25,0:10:37.36,Default,,0000,0000,0000,,T Dialogue: 0,0:10:37.36,0:10:40.52,Default,,0000,0000,0000,,is a parameter called the horizon time. Dialogue: 0,0:10:40.52,0:10:43.05,Default,,0000,0000,0000,,Concretely, Dialogue: 0,0:10:43.05,0:10:45.92,Default,,0000,0000,0000,,what this really means is that we'll Dialogue: 0,0:10:45.92,0:10:48.33,Default,,0000,0000,0000,,be taking actions in the MDP Dialogue: 0,0:10:48.33,0:10:55.33,Default,,0000,0000,0000,,only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out] Dialogue: 0,0:11:00.43,0:11:04.35,Default,,0000,0000,0000,,[Inaudible] zero, take action A0. Get to some other state S1, take Dialogue: 0,0:11:04.35,0:11:06.67,Default,,0000,0000,0000,,action A1 and so on. Dialogue: 0,0:11:06.67,0:11:10.10,Default,,0000,0000,0000,,Eventually, you get to some state, STAT Dialogue: 0,0:11:10.10,0:11:12.18,Default,,0000,0000,0000,,after T times [inaudible]. So Dialogue: 0,0:11:12.18,0:11:16.76,Default,,0000,0000,0000,,my total payoff, now, Dialogue: 0,0:11:16.76,0:11:18.18,Default,,0000,0000,0000,,will be Dialogue: 0,0:11:18.18,0:11:19.35,Default,,0000,0000,0000,,given Dialogue: 0,0:11:19.35,0:11:21.90,Default,,0000,0000,0000,,by this Dialogue: 0,0:11:21.90,0:11:26.18,Default,,0000,0000,0000,,sum from time zero up to time T of my sum over rewards. Dialogue: 0,0:11:27.74,0:11:31.87,Default,,0000,0000,0000,,Okay? My goal, as usual Dialogue: 0,0:11:31.87,0:11:34.03,Default,,0000,0000,0000,,- so this is my total payoff. Dialogue: 0,0:11:34.03,0:11:37.58,Default,,0000,0000,0000,,My goal, as usual, is to maximize the expected value of my total payoff. We want to come up Dialogue: 0,0:11:37.58,0:11:40.14,Default,,0000,0000,0000,,with a policy to maximize the Dialogue: 0,0:11:40.14,0:11:44.01,Default,,0000,0000,0000,,expected value of this total payoff. The key difference is that Dialogue: 0,0:11:44.01,0:11:47.89,Default,,0000,0000,0000,,the world only will exist [inaudible], and after that, Dialogue: 0,0:11:47.89,0:11:51.38,Default,,0000,0000,0000,,there's no more rewards to be corrected. Dialogue: 0,0:11:51.38,0:11:54.82,Default,,0000,0000,0000,,So this turns out to be [inaudible] of a difference because it turns out that the optimal Dialogue: 0,0:11:54.82,0:11:59.71,Default,,0000,0000,0000,,policy Dialogue: 0,0:12:00.69,0:12:03.20,Default,,0000,0000,0000,,may be non-stationary. Dialogue: 0,0:12:06.30,0:12:07.54,Default,,0000,0000,0000,,The Dialogue: 0,0:12:07.54,0:12:10.32,Default,,0000,0000,0000,,term, stationary, means that it doesn't depend Dialogue: 0,0:12:10.32,0:12:12.65,Default,,0000,0000,0000,,on time. Non-stationary Dialogue: 0,0:12:12.65,0:12:14.42,Default,,0000,0000,0000,,means that it may depend on time. Dialogue: 0,0:12:14.42,0:12:17.81,Default,,0000,0000,0000,,So non-stationary Dialogue: 0,0:12:17.81,0:12:21.82,Default,,0000,0000,0000,,roughly means that my optimal action to take will be different for different time Dialogue: 0,0:12:21.82,0:12:22.49,Default,,0000,0000,0000,,steps. Dialogue: 0,0:12:22.49,0:12:24.43,Default,,0000,0000,0000,,That's what nonstationary Dialogue: 0,0:12:24.43,0:12:26.79,Default,,0000,0000,0000,,means. Just as an example of that, Dialogue: 0,0:12:26.79,0:12:31.24,Default,,0000,0000,0000,,imagine that we have some robot. Let's say the Dialogue: 0,0:12:31.24,0:12:33.85,Default,,0000,0000,0000,,robot Dialogue: 0,0:12:33.85,0:12:35.29,Default,,0000,0000,0000,,is here. Dialogue: 0,0:12:35.29,0:12:37.62,Default,,0000,0000,0000,,Let's say that there's Dialogue: 0,0:12:37.62,0:12:40.59,Default,,0000,0000,0000,,a good [inaudible] over there with a plus one reward. Dialogue: 0,0:12:42.93,0:12:46.36,Default,,0000,0000,0000,,Much further away, there's a plus ten reward. Dialogue: 0,0:12:46.36,0:12:50.53,Default,,0000,0000,0000,,So depending on how much time you have left on the clock, Dialogue: 0,0:12:50.53,0:12:52.98,Default,,0000,0000,0000,,it may be better to go after the plus one Dialogue: 0,0:12:52.98,0:12:55.33,Default,,0000,0000,0000,,or the plus ten reward. If it's still early Dialogue: 0,0:12:55.33,0:12:58.84,Default,,0000,0000,0000,,in the game, you still have a lot of time, it may be better to head toward Dialogue: 0,0:12:58.84,0:13:03.03,Default,,0000,0000,0000,,the plus-ten rewards junction and get a much larger Dialogue: 0,0:13:03.03,0:13:06.04,Default,,0000,0000,0000,,reward. If you only have a couple of time sets left, if the clock has nearly reached Dialogue: 0,0:13:06.04,0:13:08.07,Default,,0000,0000,0000,,the time, capital T, Dialogue: 0,0:13:08.07,0:13:11.79,Default,,0000,0000,0000,,then you may not have enough time to get to a plus ten reward. You've be better Dialogue: 0,0:13:11.79,0:13:16.05,Default,,0000,0000,0000,,off heading for the plus one reward that's much more close by. Dialogue: 0,0:13:16.05,0:13:19.66,Default,,0000,0000,0000,,So what this example illustrates is that when you're in that state, Dialogue: 0,0:13:19.66,0:13:23.74,Default,,0000,0000,0000,,the best action to take could be to go left or to go right, depending on what Dialogue: 0,0:13:23.74,0:13:26.50,Default,,0000,0000,0000,,time it is. So just as an example, illustrating Dialogue: 0,0:13:26.50,0:13:31.96,Default,,0000,0000,0000,,how the actually policy can be non-stationary. Dialogue: 0,0:13:34.71,0:13:36.41,Default,,0000,0000,0000,,In fact, Dialogue: 0,0:13:36.41,0:13:38.12,Default,,0000,0000,0000,,since we have non-stationary Dialogue: 0,0:13:38.12,0:13:43.12,Default,,0000,0000,0000,,policies anyway in the sequence, what I'm going to do next, I'm going to Dialogue: 0,0:13:43.12,0:13:44.16,Default,,0000,0000,0000,,allow Dialogue: 0,0:13:44.16,0:13:48.39,Default,,0000,0000,0000,,non-stationary transition probabilities as well. So I'll just write that Dialogue: 0,0:13:48.39,0:13:50.39,Default,,0000,0000,0000,,up there. Dialogue: 0,0:13:50.39,0:13:52.17,Default,,0000,0000,0000,,What I mean is that Dialogue: 0,0:13:52.17,0:13:56.64,Default,,0000,0000,0000,,so far, assuming that the state ST plus one, is joined from the state Dialogue: 0,0:13:56.64,0:13:59.48,Default,,0000,0000,0000,,transition probabilities [inaudible] Dialogue: 0,0:13:59.48,0:14:02.11,Default,,0000,0000,0000,,by the previous states and the previous action. Dialogue: 0,0:14:02.11,0:14:05.65,Default,,0000,0000,0000,,I've been assuming that these state transition probabilities are the same Dialogue: 0,0:14:05.65,0:14:10.17,Default,,0000,0000,0000,,for all times. So I want to say [inaudible] and take some action, Dialogue: 0,0:14:10.17,0:14:12.54,Default,,0000,0000,0000,,the distribution of an innate state doesn't matter. Dialogue: 0,0:14:12.54,0:14:15.91,Default,,0000,0000,0000,,It doesn't depend on time. Dialogue: 0,0:14:15.91,0:14:16.70,Default,,0000,0000,0000,,So Dialogue: 0,0:14:16.70,0:14:21.89,Default,,0000,0000,0000,,I'm going to allow a study more general definition as well, in which we have Dialogue: 0,0:14:21.89,0:14:24.60,Default,,0000,0000,0000,,non-stationary state transition probabilities Dialogue: 0,0:14:24.60,0:14:25.33,Default,,0000,0000,0000,,so that Dialogue: 0,0:14:25.33,0:14:27.96,Default,,0000,0000,0000,,the chance of where you end up [inaudible] Dialogue: 0,0:14:27.96,0:14:29.59,Default,,0000,0000,0000,,may also Dialogue: 0,0:14:29.59,0:14:33.06,Default,,0000,0000,0000,,depend on what time it is. Dialogue: 0,0:14:33.06,0:14:37.32,Default,,0000,0000,0000,,So as examples of this non-stationary state Dialogue: 0,0:14:37.32,0:14:39.00,Default,,0000,0000,0000,,transition probabilities, Dialogue: 0,0:14:39.00,0:14:44.40,Default,,0000,0000,0000,,one example would be if you model flying an aircraft over a long distance. Dialogue: 0,0:14:44.40,0:14:48.65,Default,,0000,0000,0000,,Then as the aircraft flies, you burn fuel and become lighter. So the Dialogue: 0,0:14:48.65,0:14:52.51,Default,,0000,0000,0000,,dynamics of the aircraft actually change over time. The mass of the aircraft can Dialogue: 0,0:14:52.51,0:14:55.16,Default,,0000,0000,0000,,change significantly over time as you burn fuel. Dialogue: 0,0:14:55.16,0:14:56.48,Default,,0000,0000,0000,,So depending on Dialogue: 0,0:14:56.48,0:15:01.13,Default,,0000,0000,0000,,what time it is, your mixed state could actually Dialogue: 0,0:15:01.13,0:15:03.32,Default,,0000,0000,0000,,depend on not only your current state and your action, Dialogue: 0,0:15:03.32,0:15:06.24,Default,,0000,0000,0000,,but also on how much fuel you burn, therefore, what time it is. Dialogue: 0,0:15:07.25,0:15:10.51,Default,,0000,0000,0000,,Other examples, another aerospace one, is Dialogue: 0,0:15:10.51,0:15:14.62,Default,,0000,0000,0000,,if you have the weather forecast for the next 24 hours, say, Dialogue: 0,0:15:14.62,0:15:18.36,Default,,0000,0000,0000,,you know what the winds and precipitation are going to be like over the next 24 hours. Then again, Dialogue: 0,0:15:18.36,0:15:22.20,Default,,0000,0000,0000,,if you fly the aircraft from, say, here to New York, Dialogue: 0,0:15:22.20,0:15:25.02,Default,,0000,0000,0000,,it may cost different amounts to fly Dialogue: 0,0:15:25.02,0:15:28.83,Default,,0000,0000,0000,,different [inaudible] at different times. Maybe flying over Dialogue: 0,0:15:28.83,0:15:33.44,Default,,0000,0000,0000,,the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or Dialogue: 0,0:15:33.44,0:15:34.32,Default,,0000,0000,0000,,if you do it Dialogue: 0,0:15:34.32,0:15:37.63,Default,,0000,0000,0000,,a few hours from now, when the weather may be forecast really bad. Dialogue: 0,0:15:39.38,0:15:40.97,Default,,0000,0000,0000,,For an Dialogue: 0,0:15:40.97,0:15:46.44,Default,,0000,0000,0000,,example you see everyday, same thing for traffic, right? Dialogue: 0,0:15:46.44,0:15:48.38,Default,,0000,0000,0000,,There's at least - Dialogue: 0,0:15:48.38,0:15:52.51,Default,,0000,0000,0000,,depending on where you live, certainly here in California, there are times of day where traffic is Dialogue: 0,0:15:52.51,0:15:57.05,Default,,0000,0000,0000,,really bad in lots of places. So the costs of driving certain roads may vary, depending on Dialogue: 0,0:15:57.05,0:15:59.07,Default,,0000,0000,0000,,what time of day it is. Dialogue: 0,0:15:59.07,0:16:03.59,Default,,0000,0000,0000,,Lots of other examples. Industrial automation, different machines Dialogue: 0,0:16:03.59,0:16:07.16,Default,,0000,0000,0000,,in the factory may be available to different degrees at different times of day. They Dialogue: 0,0:16:07.16,0:16:10.15,Default,,0000,0000,0000,,cost different amounts to hire different workers, depending on whether you pay Dialogue: 0,0:16:10.15,0:16:12.95,Default,,0000,0000,0000,,over time [inaudible] or whatever. Dialogue: 0,0:16:12.95,0:16:15.96,Default,,0000,0000,0000,,So the cost of doing different things in the factory can also Dialogue: 0,0:16:15.96,0:16:17.25,Default,,0000,0000,0000,,be a function of time. Dialogue: 0,0:16:17.25,0:16:18.43,Default,,0000,0000,0000,, Dialogue: 0,0:16:18.43,0:16:24.28,Default,,0000,0000,0000,,The state transition probabilities can also be a function of time. Dialogue: 0,0:16:25.75,0:16:27.85,Default,,0000,0000,0000,,Lastly, Dialogue: 0,0:16:27.85,0:16:31.91,Default,,0000,0000,0000,,while we're doing this as well, to make this fully general, we might as Dialogue: 0,0:16:31.91,0:16:37.73,Default,,0000,0000,0000,,well have non-stationary [inaudible] as well, Dialogue: 0,0:16:37.73,0:16:40.64,Default,,0000,0000,0000,,where you might also index the reward Dialogue: 0,0:16:40.64,0:16:42.68,Default,,0000,0000,0000,,function of these times and prescripts, Dialogue: 0,0:16:42.68,0:16:45.09,Default,,0000,0000,0000,,where the cost of doing different things may depend on Dialogue: 0,0:16:45.09,0:16:47.83,Default,,0000,0000,0000,,the time as well. Dialogue: 0,0:16:50.53,0:16:55.59,Default,,0000,0000,0000,,Actually, there's more examples of non-stationary MDPs, so let's - so now we Dialogue: 0,0:16:55.59,0:16:59.25,Default,,0000,0000,0000,,have a nonstationary policy. Let's talk about an algorithm to actually Dialogue: 0,0:16:59.25,0:17:02.18,Default,,0000,0000,0000,,try to find the optimal policy. Dialogue: 0,0:17:04.11,0:17:10.09,Default,,0000,0000,0000,,So let me Dialogue: 0,0:17:10.09,0:17:13.58,Default,,0000,0000,0000,,define the following. Dialogue: 0,0:17:13.58,0:17:20.39,Default,,0000,0000,0000,,This is now a slightly modified definition of the optimal value function. I'll Dialogue: 0,0:17:20.39,0:17:27.39,Default,,0000,0000,0000,,just Dialogue: 0,0:17:30.97,0:17:37.97,Default,,0000,0000,0000,,write this down, I guess. Dialogue: 0,0:17:39.08,0:17:42.62,Default,,0000,0000,0000,,So I'm going to define the optimal value function, and this going to be Dialogue: 0,0:17:42.62,0:17:44.69,Default,,0000,0000,0000,,indexed by T, with a subscript T. The Dialogue: 0,0:17:44.69,0:17:46.48,Default,,0000,0000,0000,,optimal value of a state Dialogue: 0,0:17:47.84,0:17:50.36,Default,,0000,0000,0000,,for time, T, we're going to define as your Dialogue: 0,0:17:50.36,0:17:52.30,Default,,0000,0000,0000,,optimal Dialogue: 0,0:17:52.30,0:17:57.32,Default,,0000,0000,0000,,sum of rewards for if you start the MDP at that state, S, Dialogue: 0,0:17:57.32,0:17:59.31,Default,,0000,0000,0000,,and if the clock starts off Dialogue: 0,0:17:59.31,0:18:02.20,Default,,0000,0000,0000,,at time, Dialogue: 0,0:18:02.20,0:18:06.40,Default,,0000,0000,0000,,lowercase T. So the optimal value of a state will depend on what time it is and how Dialogue: 0,0:18:06.40,0:18:09.92,Default,,0000,0000,0000,,much time you have lest to run this Dialogue: 0,0:18:09.92,0:18:11.44,Default,,0000,0000,0000,,MDP. Therefore, Dialogue: 0,0:18:11.44,0:18:16.23,Default,,0000,0000,0000,,the sum on the right sums only for time T, time T plus one, time T plus two up Dialogue: 0,0:18:16.23,0:18:17.84,Default,,0000,0000,0000,,to time, Dialogue: 0,0:18:17.84,0:18:19.52,Default,,0000,0000,0000,,capital T. Dialogue: 0,0:18:22.57,0:18:26.62,Default,,0000,0000,0000,,I'll just state in English again, this is your expected optimal Dialogue: 0,0:18:26.62,0:18:28.06,Default,,0000,0000,0000,,total payoff Dialogue: 0,0:18:28.06,0:18:31.24,Default,,0000,0000,0000,,if you start your system in a state, S, Dialogue: 0,0:18:31.24,0:18:34.76,Default,,0000,0000,0000,,and if the clock is already at time, Dialogue: 0,0:18:34.76,0:18:37.01,Default,,0000,0000,0000,,lowercase T. Dialogue: 0,0:18:38.37,0:18:42.62,Default,,0000,0000,0000,,So it turns out then there's a [inaudible], you can value that [inaudible]. Dialogue: 0,0:18:42.62,0:18:46.08,Default,,0000,0000,0000,,Let me just write out the value [inaudible] algorithm for this. Dialogue: 0,0:18:46.08,0:18:48.06,Default,,0000,0000,0000,,It turns out Dialogue: 0,0:18:48.06,0:18:49.27,Default,,0000,0000,0000,,you can - Dialogue: 0,0:18:49.27,0:18:54.15,Default,,0000,0000,0000,,well, let me just write this Dialogue: 0,0:18:54.15,0:18:56.92,Default,,0000,0000,0000,,out. I'll write this below. Dialogue: 0,0:18:56.92,0:19:00.26,Default,,0000,0000,0000,,It turns out you can compute the optimal value function for the MDP using the Dialogue: 0,0:19:00.26,0:19:03.02,Default,,0000,0000,0000,,following recursion, which is very similar to Dialogue: 0,0:19:03.02,0:19:06.20,Default,,0000,0000,0000,,what we have for value iteration. We're going Dialogue: 0,0:19:06.20,0:19:06.88,Default,,0000,0000,0000,,to set V Dialogue: 0,0:19:06.88,0:19:12.08,Default,,0000,0000,0000,,of S to be equal to [inaudible] over Dialogue: 0,0:19:12.08,0:19:12.89,Default,,0000,0000,0000,,A, Dialogue: 0,0:19:12.89,0:19:19.89,Default,,0000,0000,0000,,same as before, right? Okay? Dialogue: 0,0:19:41.11,0:19:42.42,Default,,0000,0000,0000,,So Dialogue: 0,0:19:42.42,0:19:46.65,Default,,0000,0000,0000,,if I start the clock at time T and from state S, Dialogue: 0,0:19:46.65,0:19:49.23,Default,,0000,0000,0000,,my expected total payoff is equal to Dialogue: 0,0:19:49.23,0:19:51.59,Default,,0000,0000,0000,,the maximum [inaudible] actions I may take Dialogue: 0,0:19:51.59,0:19:53.50,Default,,0000,0000,0000,,of my immediate reward. Taking that Dialogue: 0,0:19:53.50,0:19:56.10,Default,,0000,0000,0000,,action, A, in that state, Dialogue: 0,0:19:56.10,0:20:00.05,Default,,0000,0000,0000,,S. Them plus my expected future payoff. So if I take action, A, Dialogue: 0,0:20:00.05,0:20:01.69,Default,,0000,0000,0000,,I would transition with [inaudible] P, subscript SA, S prime Dialogue: 0,0:20:01.69,0:20:06.47,Default,,0000,0000,0000,,to some new state, S Dialogue: 0,0:20:06.47,0:20:09.47,Default,,0000,0000,0000,,prime. If I get to the state, S prime, Dialogue: 0,0:20:09.47,0:20:11.60,Default,,0000,0000,0000,,my total expected payoff from the Dialogue: 0,0:20:11.60,0:20:14.27,Default,,0000,0000,0000,,state S prime would be these [inaudible] Dialogue: 0,0:20:14.27,0:20:18.61,Default,,0000,0000,0000,,now subscript T plus one, that's prime. Subscript T plus one Dialogue: 0,0:20:18.61,0:20:23.40,Default,,0000,0000,0000,,reflects that after I've taken one action, my clock will have advanced from Dialogue: 0,0:20:23.40,0:20:26.40,Default,,0000,0000,0000,,time T to time T plus one. So this is now V Dialogue: 0,0:20:26.40,0:20:30.56,Default,,0000,0000,0000,,star subscript T plus one. Dialogue: 0,0:20:30.56,0:20:33.91,Default,,0000,0000,0000,,So this expresses V star of T Dialogue: 0,0:20:33.91,0:20:36.23,Default,,0000,0000,0000,,in terms of V star T plus one. Dialogue: 0,0:20:36.23,0:20:39.74,Default,,0000,0000,0000,,Then lastly, to start off this recursion, you Dialogue: 0,0:20:39.74,0:20:41.70,Default,,0000,0000,0000,,would have V star, Dialogue: 0,0:20:41.70,0:20:43.27,Default,,0000,0000,0000,,capital T Dialogue: 0,0:20:43.27,0:20:50.27,Default,,0000,0000,0000,,is equal to - it's just equal Dialogue: 0,0:20:51.30,0:20:53.22,Default,,0000,0000,0000,,to that. Dialogue: 0,0:20:53.22,0:20:56.90,Default,,0000,0000,0000,,If you're already at time, capital T, then you just get to take one action, and then Dialogue: 0,0:20:56.90,0:20:59.07,Default,,0000,0000,0000,,the clock runs out. So this is Dialogue: 0,0:20:59.07,0:21:03.61,Default,,0000,0000,0000,,V star capital T. Your value of starting in some state, S, with no time - Dialogue: 0,0:21:03.61,0:21:08.81,Default,,0000,0000,0000,,with just one time step, with no time left on the clock. Dialogue: 0,0:21:08.81,0:21:12.88,Default,,0000,0000,0000,,So in the case of finite horizon MDP, this actually gives up a very nice Dialogue: 0,0:21:13.83,0:21:15.64,Default,,0000,0000,0000,,dynamic programming algorithm Dialogue: 0,0:21:15.64,0:21:16.56,Default,,0000,0000,0000,,in which you can Dialogue: 0,0:21:16.56,0:21:20.74,Default,,0000,0000,0000,,start off by computing V star of T. Dialogue: 0,0:21:20.74,0:21:24.03,Default,,0000,0000,0000,,Then you use this backward [inaudible] to compute Dialogue: 0,0:21:24.03,0:21:25.09,Default,,0000,0000,0000,,V star of Dialogue: 0,0:21:25.09,0:21:29.35,Default,,0000,0000,0000,,capital T minus one, capital T minus two and so on. We compute V star Dialogue: 0,0:21:29.35,0:21:32.23,Default,,0000,0000,0000,,of T and T minus one and so on. It recurs backwards Dialogue: 0,0:21:32.23,0:21:34.67,Default,,0000,0000,0000,,onto your computer, V star Dialogue: 0,0:21:34.67,0:21:38.45,Default,,0000,0000,0000,,for all of your time steps. Can you see this Dialogue: 0,0:21:38.45,0:21:39.50,Default,,0000,0000,0000,,board? Dialogue: 0,0:21:39.50,0:21:42.25,Default,,0000,0000,0000,,Cool. Then Dialogue: 0,0:21:44.77,0:21:47.57,Default,,0000,0000,0000,,the final step is Dialogue: 0,0:21:47.57,0:21:52.81,Default,,0000,0000,0000,,- previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible] Dialogue: 0,0:21:52.81,0:21:55.58,Default,,0000,0000,0000,,A of R Dialogue: 0,0:21:55.58,0:21:57.15,Default,,0000,0000,0000,,plus A Dialogue: 0,0:21:57.15,0:22:04.15,Default,,0000,0000,0000,,plus [inaudible] PSA - this Dialogue: 0,0:22:04.28,0:22:07.23,Default,,0000,0000,0000,,is sort of what we had. In Dialogue: 0,0:22:07.23,0:22:09.79,Default,,0000,0000,0000,,the finite horizon case, the Dialogue: 0,0:22:09.79,0:22:13.25,Default,,0000,0000,0000,,ultimate action may depend on what time it is. So the ultimate action to take it, Dialogue: 0,0:22:13.25,0:22:16.82,Default,,0000,0000,0000,,time T, is [inaudible] actions, A. Dialogue: 0,0:22:16.82,0:22:19.25,Default,,0000,0000,0000,,This Dialogue: 0,0:22:19.25,0:22:22.35,Default,,0000,0000,0000,,is basically the Dialogue: 0,0:22:22.35,0:22:26.84,Default,,0000,0000,0000,,augmat of exactly that same thing on the right-hand side as we had Dialogue: 0,0:22:26.84,0:22:29.97,Default,,0000,0000,0000,,in our dynamic programming Dialogue: 0,0:22:29.97,0:22:33.06,Default,,0000,0000,0000,,algorithm. So you do this for every time step, and now you compute it, Dialogue: 0,0:22:33.06,0:22:36.95,Default,,0000,0000,0000,,your optimal policy for different time Dialogue: 0,0:22:36.95,0:22:40.41,Default,,0000,0000,0000,,steps. Again, this is a non-stationary policy Dialogue: 0,0:22:40.41,0:22:42.29,Default,,0000,0000,0000,,because pi star Dialogue: 0,0:22:42.29,0:22:46.92,Default,,0000,0000,0000,,of S my depend on what time it is. Dialogue: 0,0:22:46.92,0:22:50.42,Default,,0000,0000,0000,,So [inaudible] the difference between this Dialogue: 0,0:22:50.42,0:22:57.15,Default,,0000,0000,0000,,and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T. Dialogue: 0,0:22:57.15,0:23:00.62,Default,,0000,0000,0000,,Then using the backwards recursion of that [inaudible] algorithm, you computer V star T Dialogue: 0,0:23:00.62,0:23:04.13,Default,,0000,0000,0000,,minus one. Then V star T minus two Dialogue: 0,0:23:04.13,0:23:06.71,Default,,0000,0000,0000,,and so on down to V star of zero. Dialogue: 0,0:23:06.71,0:23:09.66,Default,,0000,0000,0000,,Then from these, you compute pi star. Dialogue: 0,0:23:13.21,0:23:17.01,Default,,0000,0000,0000,,So one - there's not a huge difference, but one minus Dialogue: 0,0:23:17.01,0:23:18.10,Default,,0000,0000,0000,,difference [inaudible] Dialogue: 0,0:23:18.10,0:23:21.92,Default,,0000,0000,0000,,the infinite horizon discounted case Dialogue: 0,0:23:21.92,0:23:26.35,Default,,0000,0000,0000,,is that by running this recursion once, you now have exactly Dialogue: 0,0:23:26.35,0:23:29.55,Default,,0000,0000,0000,,the right value function. So this just computes the value function, rather Dialogue: 0,0:23:29.55,0:23:30.56,Default,,0000,0000,0000,,than Dialogue: 0,0:23:30.56,0:23:37.56,Default,,0000,0000,0000,,merely converging [inaudible]. This just gives you the right value function with one Dialogue: 0,0:23:37.71,0:23:39.71,Default,,0000,0000,0000,, Dialogue: 0,0:23:39.71,0:23:42.71,Default,,0000,0000,0000,,pass. Cool. Any questions there? Yeah. Dialogue: 0,0:23:42.71,0:23:47.60,Default,,0000,0000,0000,,[Inaudible]. Dialogue: 0,0:23:50.45,0:23:54.08,Default,,0000,0000,0000,,This computation's much shorter than valuations. So Dialogue: 0,0:23:54.08,0:23:58.46,Default,,0000,0000,0000,,sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could Dialogue: 0,0:23:58.46,0:24:05.46,Default,,0000,0000,0000,,[inaudible] and then use this case for that case? I see. Dialogue: 0,0:24:06.10,0:24:09.43,Default,,0000,0000,0000,,So for the normal MDP, can you assume capital T and then assume this? So Dialogue: 0,0:24:09.43,0:24:13.05,Default,,0000,0000,0000,,it actually turns out that - that's a Dialogue: 0,0:24:13.05,0:24:14.74,Default,,0000,0000,0000,,great question. Let Dialogue: 0,0:24:14.74,0:24:18.64,Default,,0000,0000,0000,,me just answer this in a hand-wavy way. Dialogue: 0,0:24:18.64,0:24:23.14,Default,,0000,0000,0000,,So it actually turns out for a discounted infinite horizon MDP Dialogue: 0,0:24:23.14,0:24:25.67,Default,,0000,0000,0000,,where some discount factor gamma. Dialogue: 0,0:24:25.67,0:24:29.43,Default,,0000,0000,0000,,So what you can do is you can say, Dialogue: 0,0:24:29.43,0:24:31.16,Default,,0000,0000,0000,,after T times X, Dialogue: 0,0:24:31.16,0:24:34.99,Default,,0000,0000,0000,,gamma to the T would be really small. It would be like [inaudible] something. I Dialogue: 0,0:24:34.99,0:24:36.13,Default,,0000,0000,0000,,don't Dialogue: 0,0:24:36.13,0:24:39.75,Default,,0000,0000,0000,,really care what happens after that many times because the rewards are Dialogue: 0,0:24:39.75,0:24:42.63,Default,,0000,0000,0000,,multiplied by gamma to the T. After that, I don't really care. So you can Dialogue: 0,0:24:42.63,0:24:43.92,Default,,0000,0000,0000,,ask, Dialogue: 0,0:24:43.92,0:24:45.12,Default,,0000,0000,0000,,can I take Dialogue: 0,0:24:45.12,0:24:48.49,Default,,0000,0000,0000,,my infinite horizon discounted MDP Dialogue: 0,0:24:48.49,0:24:53.19,Default,,0000,0000,0000,,and approximate that with a finite horizon MDP where the number of times, steps T, Dialogue: 0,0:24:53.19,0:24:55.18,Default,,0000,0000,0000,,is chosen so that [inaudible] true. So it Dialogue: 0,0:24:55.18,0:24:57.70,Default,,0000,0000,0000,,turns out you can do that. Dialogue: 0,0:24:57.70,0:25:01.89,Default,,0000,0000,0000,,Then you end up with some value for T. You can solve for T so Dialogue: 0,0:25:01.89,0:25:03.91,Default,,0000,0000,0000,,that this holds true. Dialogue: 0,0:25:03.91,0:25:09.50,Default,,0000,0000,0000,,It turns out you can prove that if you took the original value iteration algorithm Dialogue: 0,0:25:09.50,0:25:14.78,Default,,0000,0000,0000,,and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs. Dialogue: 0,0:25:14.78,0:25:16.83,Default,,0000,0000,0000,,If you run that for Dialogue: 0,0:25:16.83,0:25:19.59,Default,,0000,0000,0000,,this same number of time steps, Dialogue: 0,0:25:19.59,0:25:23.79,Default,,0000,0000,0000,,you will end up with an approximation to the value function that is Dialogue: 0,0:25:23.79,0:25:26.73,Default,,0000,0000,0000,,about this close, up to some small constant factors. Dialogue: 0,0:25:26.73,0:25:30.96,Default,,0000,0000,0000,,So to do that, you end up with roughly the same amounts of computation anyway. Dialogue: 0,0:25:30.96,0:25:32.55,Default,,0000,0000,0000,,Then Dialogue: 0,0:25:32.55,0:25:35.78,Default,,0000,0000,0000,,you actually end up with a non-stationary policy, which is more expensive to keep Dialogue: 0,0:25:36.65,0:25:40.72,Default,,0000,0000,0000,,around. You need to keep around the different policy every time step, which is not Dialogue: 0,0:25:40.72,0:25:46.88,Default,,0000,0000,0000,,as nice as if you had the stationary policy, same policy for all times. So Dialogue: 0,0:25:46.88,0:25:50.60,Default,,0000,0000,0000,,there are other reasons, but sometimes you might take an Dialogue: 0,0:25:50.60,0:25:54.56,Default,,0000,0000,0000,,infinite horizon discounted problem and approximate it to a finite horizon problem. Dialogue: 0,0:25:54.56,0:25:56.19,Default,,0000,0000,0000,,But Dialogue: 0,0:25:56.19,0:25:58.76,Default,,0000,0000,0000,,this particular reason is Dialogue: 0,0:25:58.76,0:26:01.27,Default,,0000,0000,0000,,not the one. That makes Dialogue: 0,0:26:01.27,0:26:04.21,Default,,0000,0000,0000,,sense. More Dialogue: 0,0:26:04.21,0:26:11.21,Default,,0000,0000,0000,,questions? Interviewee: [Inaudible]? Dialogue: 0,0:26:12.58,0:26:14.68,Default,,0000,0000,0000,,Is Dialogue: 0,0:26:14.68,0:26:16.87,Default,,0000,0000,0000,,there a gamma in this? Dialogue: 0,0:26:16.87,0:26:20.16,Default,,0000,0000,0000,,So if you want, you can actually Dialogue: 0,0:26:20.16,0:26:22.57,Default,,0000,0000,0000,,change the definition of an MDP Dialogue: 0,0:26:22.57,0:26:27.92,Default,,0000,0000,0000,,and use a finite horizon discounted MDP. If you want, you can do that. You Dialogue: 0,0:26:27.92,0:26:29.04,Default,,0000,0000,0000,,can actually come Dialogue: 0,0:26:29.04,0:26:32.09,Default,,0000,0000,0000,,in and put a gamma there, Dialogue: 0,0:26:32.09,0:26:35.14,Default,,0000,0000,0000,,and use this counting the finite horizon. Dialogue: 0,0:26:35.14,0:26:37.69,Default,,0000,0000,0000,,It turns out that usually, for most Dialogue: 0,0:26:37.69,0:26:41.47,Default,,0000,0000,0000,,problems that people deal with, you either use discounting Dialogue: 0,0:26:41.47,0:26:43.22,Default,,0000,0000,0000,,or you Dialogue: 0,0:26:43.22,0:26:47.06,Default,,0000,0000,0000,,use the finite horizon. It's been less common to do both, but you can certainly do as well. Dialogue: 0,0:26:47.06,0:26:51.99,Default,,0000,0000,0000,,One of the nice things about discounting, it makes such your value function is finite. Dialogue: 0,0:26:52.86,0:26:57.55,Default,,0000,0000,0000,,Algorithmically and mathematically, one of the reasons to use discounting is Dialogue: 0,0:26:57.55,0:27:01.31,Default,,0000,0000,0000,,because you're multiplying your rewards exponentially. It's a geometrically Dialogue: 0,0:27:01.31,0:27:02.64,Default,,0000,0000,0000,,[inaudible] series. Dialogue: 0,0:27:02.64,0:27:05.05,Default,,0000,0000,0000,,It shows that the value function is always finite. Dialogue: 0,0:27:05.05,0:27:07.78,Default,,0000,0000,0000,,This is a really nice mathematical properties when you do discounting. Dialogue: 0,0:27:07.78,0:27:11.36,Default,,0000,0000,0000,,So when you have a finite horizon anyway, Dialogue: 0,0:27:11.36,0:27:15.18,Default,,0000,0000,0000,,then the value function's also guaranteed to be finite. So with that, you Dialogue: 0,0:27:15.18,0:27:22.18,Default,,0000,0000,0000,,don't have to use discounting. But if you want, you can actually discount Dialogue: 0,0:27:24.13,0:27:27.91,Default,,0000,0000,0000,,as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward Dialogue: 0,0:27:27.91,0:27:34.91,Default,,0000,0000,0000,,function to go downward into the to the reward function, since we have nonstationary rewards as well. So Dialogue: 0,0:27:39.69,0:27:41.05,Default,,0000,0000,0000,,that was Dialogue: 0,0:27:41.05,0:27:44.23,Default,,0000,0000,0000,,finite horizon Dialogue: 0,0:27:44.23,0:27:45.41,Default,,0000,0000,0000,,MDPs. Dialogue: 0,0:27:45.41,0:27:48.96,Default,,0000,0000,0000,,What I want to do now is actually use Dialogue: 0,0:27:48.96,0:27:53.03,Default,,0000,0000,0000,,both of these ideas, your state action rewards and your finite horizon MDPs Dialogue: 0,0:27:53.03,0:27:57.58,Default,,0000,0000,0000,,to describe a special case of MDPs that Dialogue: 0,0:27:57.58,0:28:00.38,Default,,0000,0000,0000,,makes very strong assumptions about the problem. Dialogue: 0,0:28:00.38,0:28:00.88,Default,,0000,0000,0000,, Dialogue: 0,0:28:00.88,0:28:04.39,Default,,0000,0000,0000,,But these assumptions are reasonable for many systems. With these Dialogue: 0,0:28:04.39,0:28:06.09,Default,,0000,0000,0000,,assumptions, what we come up with, I Dialogue: 0,0:28:06.09,0:28:11.96,Default,,0000,0000,0000,,think, are very nice and very elegant algorithms for solving even very large Dialogue: 0,0:28:11.96,0:28:16.24,Default,,0000,0000,0000,,MDPs. Dialogue: 0,0:28:30.10,0:28:32.82,Default,,0000,0000,0000,,So let's talk about linear Dialogue: 0,0:28:32.82,0:28:35.76,Default,,0000,0000,0000,,quadratic regulation. Dialogue: 0,0:28:51.67,0:28:55.86,Default,,0000,0000,0000,,We just talked about dynamic programming for finite horizon MDPs, so Dialogue: 0,0:28:55.86,0:28:58.15,Default,,0000,0000,0000,,just remember that algorithm. Dialogue: 0,0:28:58.15,0:29:01.85,Default,,0000,0000,0000,,When I come back to talk about an algorithm for solving LQR problems, Dialogue: 0,0:29:01.85,0:29:05.32,Default,,0000,0000,0000,,I'm actually going to use exactly that dynamic programming algorithm Dialogue: 0,0:29:05.32,0:29:10.32,Default,,0000,0000,0000,,that you just saw for finite horizon MDPs. I'll be using exactly Dialogue: 0,0:29:10.32,0:29:13.93,Default,,0000,0000,0000,,that algorithm again. So just remember that for now. Dialogue: 0,0:29:13.93,0:29:16.32,Default,,0000,0000,0000,,So let's talk about LQR. Dialogue: 0,0:29:16.32,0:29:21.75,Default,,0000,0000,0000,,So I want to take these ideas and apply them to MDPs with continuous Dialogue: 0,0:29:21.75,0:29:25.15,Default,,0000,0000,0000,,state spaces and maybe even continuous action Dialogue: 0,0:29:25.15,0:29:26.17,Default,,0000,0000,0000,,spaces. So Dialogue: 0,0:29:26.17,0:29:28.74,Default,,0000,0000,0000,,to specify and Dialogue: 0,0:29:28.74,0:29:33.68,Default,,0000,0000,0000,,MDPs, I need to give you this fivetuple Dialogue: 0,0:29:33.68,0:29:36.38,Default,,0000,0000,0000,,of state actions, [inaudible] in the reward. I'm going to use Dialogue: 0,0:29:36.38,0:29:40.73,Default,,0000,0000,0000,,the finite horizon, capital T, rather than Dialogue: 0,0:29:40.73,0:29:42.57,Default,,0000,0000,0000,,discounting. Dialogue: 0,0:29:42.57,0:29:47.81,Default,,0000,0000,0000,,So in LQR problems, I'm going to assume the following. I'm Dialogue: 0,0:29:47.81,0:29:51.31,Default,,0000,0000,0000,,going to assume that the Dialogue: 0,0:29:51.31,0:29:54.40,Default,,0000,0000,0000,,[inaudible] space is [inaudible] RN. Dialogue: 0,0:29:54.40,0:29:58.29,Default,,0000,0000,0000,,And I'm going to assume, also, a continuous set of Dialogue: 0,0:29:58.29,0:30:03.83,Default,,0000,0000,0000,,actions lie in RT. Dialogue: 0,0:30:03.83,0:30:08.32,Default,,0000,0000,0000,,To specify the state transition probabilities, PSA, Dialogue: 0,0:30:08.32,0:30:11.93,Default,,0000,0000,0000,,I need to tell you what the distribution of the mixed state is, given the current state and Dialogue: 0,0:30:11.93,0:30:13.51,Default,,0000,0000,0000,,the current action. Dialogue: 0,0:30:13.51,0:30:17.45,Default,,0000,0000,0000,,So we actually saw a little bit of this in the last lecture. I want to assume Dialogue: 0,0:30:17.45,0:30:19.52,Default,,0000,0000,0000,,the next state, ST plus one, Dialogue: 0,0:30:19.52,0:30:23.10,Default,,0000,0000,0000,,is going to be a linear function Dialogue: 0,0:30:23.10,0:30:25.01,Default,,0000,0000,0000,,of the previous Dialogue: 0,0:30:25.01,0:30:27.20,Default,,0000,0000,0000,,state, AST plus BAT Dialogue: 0,0:30:27.20,0:30:29.31,Default,,0000,0000,0000,,plus WT - Dialogue: 0,0:30:29.31,0:30:34.65,Default,,0000,0000,0000,,oh, excuse me. I meant to subscript that. Dialogue: 0,0:30:54.59,0:30:59.19,Default,,0000,0000,0000,,Where WT is Gaussian [inaudible] would mean zero and some covariance Dialogue: 0,0:30:59.19,0:31:02.26,Default,,0000,0000,0000,,given by sigma Dialogue: 0,0:31:02.26,0:31:07.37,Default,,0000,0000,0000,,W. Subscripts at A and B here with subscripts T to show Dialogue: 0,0:31:07.37,0:31:09.00,Default,,0000,0000,0000,,that Dialogue: 0,0:31:09.00,0:31:11.81,Default,,0000,0000,0000,,these matrixes could change over time. So this would be Dialogue: 0,0:31:11.81,0:31:14.94,Default,,0000,0000,0000,,non-stationary dynamics. Dialogue: 0,0:31:14.94,0:31:20.20,Default,,0000,0000,0000,,As a point of notation, unfortunately compiling Dialogue: 0,0:31:20.20,0:31:22.24,Default,,0000,0000,0000,,ideas from multiple literatures, Dialogue: 0,0:31:22.24,0:31:26.22,Default,,0000,0000,0000,,so it's sort of unfortunately that capital A denotes both a set of actions Dialogue: 0,0:31:26.22,0:31:27.86,Default,,0000,0000,0000,,as well as Dialogue: 0,0:31:27.86,0:31:29.61,Default,,0000,0000,0000,,a matrix. Dialogue: 0,0:31:29.61,0:31:33.89,Default,,0000,0000,0000,,When you see A later on, A will usually be used to denote a matrix, Dialogue: 0,0:31:33.89,0:31:38.87,Default,,0000,0000,0000,,rather than a set of actions. So [inaudible] overload notation again, but Dialogue: 0,0:31:38.87,0:31:41.82,Default,,0000,0000,0000,,unfortunately the notational conventions when you have Dialogue: 0,0:31:41.82,0:31:46.66,Default,,0000,0000,0000,,research ideas in multiple research communities, often they share the same symbol. Dialogue: 0,0:31:47.79,0:31:49.83,Default,,0000,0000,0000,,So just to be concrete, Dialogue: 0,0:31:49.83,0:31:53.02,Default,,0000,0000,0000,,AT is a matrix Dialogue: 0,0:31:53.02,0:31:55.64,Default,,0000,0000,0000,,that's N Dialogue: 0,0:31:55.64,0:32:01.27,Default,,0000,0000,0000,,by N. [Inaudible] matrixes that are N by D. Dialogue: 0,0:32:02.13,0:32:05.56,Default,,0000,0000,0000,,Just to be completely clear, right, the matrixes A and B, Dialogue: 0,0:32:05.56,0:32:09.23,Default,,0000,0000,0000,,I'm going to assume, are fixed and known in advance. So I'm going to Dialogue: 0,0:32:09.23,0:32:11.70,Default,,0000,0000,0000,,give you the matrices, Dialogue: 0,0:32:11.70,0:32:15.75,Default,,0000,0000,0000,,A and B, and I'm going to give you sigma W. Your job is to find a good policy Dialogue: 0,0:32:15.75,0:32:17.06,Default,,0000,0000,0000,,for Dialogue: 0,0:32:17.06,0:32:18.84,Default,,0000,0000,0000,,this MDP. So in other words, Dialogue: 0,0:32:18.84,0:32:20.87,Default,,0000,0000,0000,,this is my specification Dialogue: 0,0:32:20.87,0:32:25.31,Default,,0000,0000,0000,,of the state transition probabilities. Dialogue: 0,0:32:25.31,0:32:28.02,Default,,0000,0000,0000,,Looking ahead, Dialogue: 0,0:32:28.02,0:32:31.19,Default,,0000,0000,0000,,we see this later, it turns out this Dialogue: 0,0:32:31.19,0:32:38.19,Default,,0000,0000,0000,,noise term is not very important. So Dialogue: 0,0:32:39.63,0:32:40.85,Default,,0000,0000,0000,,it turns out that Dialogue: 0,0:32:40.85,0:32:44.25,Default,,0000,0000,0000,,the treatment of the noise term is not very important. Dialogue: 0,0:32:44.25,0:32:45.83,Default,,0000,0000,0000,,We'll see this later. We can Dialogue: 0,0:32:45.83,0:32:47.18,Default,,0000,0000,0000,,pretty much Dialogue: 0,0:32:47.18,0:32:49.30,Default,,0000,0000,0000,,ignore the noise Dialogue: 0,0:32:49.30,0:32:52.68,Default,,0000,0000,0000,,term, and we'll still do fine. This is just a warning Dialogue: 0,0:32:52.68,0:32:56.91,Default,,0000,0000,0000,,in the sequel, what I do later, I might be slightly sloppy Dialogue: 0,0:32:56.91,0:32:57.97,Default,,0000,0000,0000,,in my treatment Dialogue: 0,0:32:57.97,0:33:02.97,Default,,0000,0000,0000,,of the noise term. In this very special case, it would be unimportant. Dialogue: 0,0:33:02.97,0:33:04.17,Default,,0000,0000,0000,,The Dialogue: 0,0:33:04.17,0:33:05.55,Default,,0000,0000,0000,,last Dialogue: 0,0:33:05.55,0:33:09.15,Default,,0000,0000,0000,,thing I have to specify is some horizon time, Dialogue: 0,0:33:09.15,0:33:12.58,Default,,0000,0000,0000,,and then I also have some Dialogue: 0,0:33:12.58,0:33:13.70,Default,,0000,0000,0000,,reward Dialogue: 0,0:33:13.70,0:33:16.55,Default,,0000,0000,0000,,function. Dialogue: 0,0:33:16.55,0:33:20.02,Default,,0000,0000,0000,,For LQR control, I'm going to assume that a reward function Dialogue: 0,0:33:20.02,0:33:27.02,Default,,0000,0000,0000,,can be written as Dialogue: 0,0:33:30.09,0:33:31.85,Default,,0000,0000,0000,,this, where Dialogue: 0,0:33:31.85,0:33:35.26,Default,,0000,0000,0000,,UT is a matrix that's N by N. VT is Dialogue: 0,0:33:35.26,0:33:36.44,Default,,0000,0000,0000,,a Dialogue: 0,0:33:37.95,0:33:41.02,Default,,0000,0000,0000,,matrix that's D by D. I'll Dialogue: 0,0:33:41.02,0:33:44.44,Default,,0000,0000,0000,,assume that UT Dialogue: 0,0:33:44.44,0:33:47.04,Default,,0000,0000,0000,,and VT are both positive semi-definite. Dialogue: 0,0:33:47.04,0:33:51.29,Default,,0000,0000,0000,,Are both PSD. So Dialogue: 0,0:33:51.29,0:33:57.36,Default,,0000,0000,0000,,the fact that UT and VT are both positive semi-definite matrices, Dialogue: 0,0:33:57.36,0:33:59.77,Default,,0000,0000,0000,,that implies that Dialogue: 0,0:33:59.77,0:34:02.26,Default,,0000,0000,0000,,ST transpose, UT, ST [inaudible] Dialogue: 0,0:34:02.26,0:34:07.00,Default,,0000,0000,0000,,zero. Similarly, ST transpose Dialogue: 0,0:34:07.00,0:34:09.57,Default,,0000,0000,0000,,are VT, AT, [inaudible] zero. Dialogue: 0,0:34:09.57,0:34:16.52,Default,,0000,0000,0000,,So this implies that Dialogue: 0,0:34:16.52,0:34:20.19,Default,,0000,0000,0000,,your rewards are always negative. This is a Dialogue: 0,0:34:20.19,0:34:23.53,Default,,0000,0000,0000,,somewhat depressing MDP in which there are only costs and no positive rewards, right, Dialogue: 0,0:34:23.53,0:34:24.37,Default,,0000,0000,0000,,because of Dialogue: 0,0:34:24.37,0:34:27.17,Default,,0000,0000,0000,,the minus sign Dialogue: 0,0:34:27.17,0:34:28.57,Default,,0000,0000,0000,,there. Dialogue: 0,0:34:28.57,0:34:35.57,Default,,0000,0000,0000,,So Dialogue: 0,0:34:43.32,0:34:46.66,Default,,0000,0000,0000,,as Dialogue: 0,0:34:46.66,0:34:49.63,Default,,0000,0000,0000,,a complete example for how you might want to apply this, Dialogue: 0,0:34:49.63,0:34:53.51,Default,,0000,0000,0000,,you've seen my helicopter videos, right? So one thing is, Dialogue: 0,0:34:53.51,0:34:55.73,Default,,0000,0000,0000,,for example, Dialogue: 0,0:34:55.73,0:34:57.99,Default,,0000,0000,0000,,suppose you have a helicopter, Dialogue: 0,0:34:57.99,0:35:00.89,Default,,0000,0000,0000,,and you want the state ST to be Dialogue: 0,0:35:00.89,0:35:04.23,Default,,0000,0000,0000,,as close to zero as possible. Dialogue: 0,0:35:05.44,0:35:08.06,Default,,0000,0000,0000,,Then Dialogue: 0,0:35:08.06,0:35:10.21,Default,,0000,0000,0000,,you might choose UT Dialogue: 0,0:35:10.21,0:35:12.62,Default,,0000,0000,0000,,to be equal to the identity matrix. Dialogue: 0,0:35:12.62,0:35:16.71,Default,,0000,0000,0000,,This will make R of STAT be Dialogue: 0,0:35:16.71,0:35:20.06,Default,,0000,0000,0000,,equal to ST transpose Dialogue: 0,0:35:20.06,0:35:23.82,Default,,0000,0000,0000,,ST. But that's just Dialogue: 0,0:35:27.24,0:35:30.72,Default,,0000,0000,0000,,- I'll just Dialogue: 0,0:35:30.72,0:35:33.81,Default,,0000,0000,0000,,write that down. [Inaudible] oh, excuse me - minus. Dialogue: 0,0:35:33.81,0:35:38.43,Default,,0000,0000,0000,,The squared negative of the squared [inaudible] vector. So this would be penalizing the Dialogue: 0,0:35:38.43,0:35:39.68,Default,,0000,0000,0000,,system Dialogue: 0,0:35:39.68,0:35:40.77,Default,,0000,0000,0000,,quadratically Dialogue: 0,0:35:40.77,0:35:45.22,Default,,0000,0000,0000,,for having a state that's half of zero, assuming that zero's the origin state. So if it Dialogue: 0,0:35:45.22,0:35:46.89,Default,,0000,0000,0000,,goes to make a helicopter Dialogue: 0,0:35:46.89,0:35:52.04,Default,,0000,0000,0000,,hover around the state zero, then you might choose this sort of reward function. It Dialogue: 0,0:35:52.04,0:35:54.44,Default,,0000,0000,0000,,turns out Dialogue: 0,0:35:55.28,0:35:58.68,Default,,0000,0000,0000,,it's also very common for action to choose a cost Dialogue: 0,0:35:58.68,0:36:03.21,Default,,0000,0000,0000,,for the action. So suppose I choose VT to be equal to an identity matrix. I get minus Dialogue: 0,0:36:03.21,0:36:05.36,Default,,0000,0000,0000,,AT transpose Dialogue: 0,0:36:05.36,0:36:09.51,Default,,0000,0000,0000,,AT Dialogue: 0,0:36:09.51,0:36:12.26,Default,,0000,0000,0000,,here. Then minus [inaudible] actions as well. Dialogue: 0,0:36:12.26,0:36:16.08,Default,,0000,0000,0000,,Including a quadratic cost for actions, it's also a fairly common Dialogue: 0,0:36:16.08,0:36:17.10,Default,,0000,0000,0000,,thing to do. Dialogue: 0,0:36:17.10,0:36:19.56,Default,,0000,0000,0000,,In practice, this tends to be effective Dialogue: 0,0:36:19.56,0:36:22.63,Default,,0000,0000,0000,,of discouraging your system from Dialogue: 0,0:36:22.63,0:36:26.21,Default,,0000,0000,0000,,jerking the controls around. This discourages making very huge control commands. Having Dialogue: 0,0:36:26.21,0:36:27.78,Default,,0000,0000,0000,,a term like this Dialogue: 0,0:36:27.78,0:36:31.86,Default,,0000,0000,0000,,reward function often makes many systems behave better. Dialogue: 0,0:36:31.86,0:36:35.79,Default,,0000,0000,0000,,Of course, [inaudible] choose different values, we have different values on the diagonal to Dialogue: 0,0:36:35.79,0:36:37.07,Default,,0000,0000,0000,,give different Dialogue: 0,0:36:37.07,0:36:39.96,Default,,0000,0000,0000,,state variables, different weight and Dialogue: 0,0:36:39.96,0:36:44.81,Default,,0000,0000,0000,,so on. So lots of possible choices for U and B. This is one example. Dialogue: 0,0:36:49.85,0:36:55.99,Default,,0000,0000,0000,,So Dialogue: 0,0:36:55.99,0:36:59.42,Default,,0000,0000,0000,,for the next few steps, Dialogue: 0,0:36:59.42,0:37:03.62,Default,,0000,0000,0000,,I'm going to write out things, I'm going to derive things, Dialogue: 0,0:37:03.62,0:37:08.51,Default,,0000,0000,0000,,for the general case of non-stationary dynamics. I'm going - as I write Dialogue: 0,0:37:08.51,0:37:11.53,Default,,0000,0000,0000,,out more math and more equations for LQR, Dialogue: 0,0:37:11.53,0:37:13.81,Default,,0000,0000,0000,,I'm going to try write it out Dialogue: 0,0:37:13.81,0:37:18.55,Default,,0000,0000,0000,,for the fairly general case of time varied dynamics and time varied reward Dialogue: 0,0:37:18.55,0:37:21.64,Default,,0000,0000,0000,,functions. So I'm [inaudible] function. Dialogue: 0,0:37:21.64,0:37:26.31,Default,,0000,0000,0000,,But for purposes of understanding this material, Dialogue: 0,0:37:26.31,0:37:29.50,Default,,0000,0000,0000,,you might want to think of just ignoring many of the subscripts, in terms Dialogue: 0,0:37:29.50,0:37:31.58,Default,,0000,0000,0000,,of T. So Dialogue: 0,0:37:33.85,0:37:36.05,Default,,0000,0000,0000,,for the sake of [inaudible] material, Dialogue: 0,0:37:36.05,0:37:40.06,Default,,0000,0000,0000,,you might want to mentally assume that there are some fixed matrix, A, so Dialogue: 0,0:37:40.06,0:37:46.08,Default,,0000,0000,0000,,that A is equal to A1, A2, equals A3 and so on. Dialogue: 0,0:37:46.08,0:37:50.60,Default,,0000,0000,0000,,Similarly, there's some matrix B. Dialogue: 0,0:37:50.60,0:37:51.51,Default,,0000,0000,0000,,Okay? Dialogue: 0,0:37:51.51,0:37:55.86,Default,,0000,0000,0000,,So write it out for the fully general, non-stationary case, but Dialogue: 0,0:37:55.86,0:37:59.42,Default,,0000,0000,0000,,you might just want to ignore many of the time subscripts and imagine the Dialogue: 0,0:37:59.42,0:38:02.10,Default,,0000,0000,0000,,stationary case for now. Dialogue: 0,0:38:07.42,0:38:09.94,Default,,0000,0000,0000,,Quite a bit later, Dialogue: 0,0:38:09.94,0:38:13.57,Default,,0000,0000,0000,,we're going to talk about an extension of this called differential dynamic programming that will Dialogue: 0,0:38:13.57,0:38:16.67,Default,,0000,0000,0000,,actually use a non-stationary Dialogue: 0,0:38:16.67,0:38:20.79,Default,,0000,0000,0000,,[inaudible] to a very powerful effect for a specific algorithm. But Dialogue: 0,0:38:20.79,0:38:25.58,Default,,0000,0000,0000,,for most of what we're about to do, just pretend that MDP is stationary. Dialogue: 0,0:38:25.58,0:38:32.58,Default,,0000,0000,0000,,Okay. Dialogue: 0,0:38:49.16,0:38:50.92,Default,,0000,0000,0000,,So before I Dialogue: 0,0:38:50.92,0:38:54.76,Default,,0000,0000,0000,,talk about models, let me jus say a couple of words about how you would go about Dialogue: 0,0:38:54.76,0:38:56.91,Default,,0000,0000,0000,,coming up with the linear models. The Dialogue: 0,0:38:56.91,0:39:00.63,Default,,0000,0000,0000,,key assumption in this model is that the dynamics are linear. There's also the assumption the Dialogue: 0,0:39:00.63,0:39:03.43,Default,,0000,0000,0000,,reward function is quadratic, but Dialogue: 0,0:39:03.43,0:39:06.85,Default,,0000,0000,0000,,let's talk about the assumption that the dynamics are linear. Dialogue: 0,0:39:06.85,0:39:13.79,Default,,0000,0000,0000,,ST plus one equals AST plus VAT. Maybe time varying, maybe Dialogue: 0,0:39:13.79,0:39:16.88,Default,,0000,0000,0000,,stationary. I'm just writing stationary for now. Dialogue: 0,0:39:16.88,0:39:19.53,Default,,0000,0000,0000,,So how do you get models like this? Dialogue: 0,0:39:19.53,0:39:23.17,Default,,0000,0000,0000,,We actually saw one example of this already in the previous lecture. If Dialogue: 0,0:39:23.17,0:39:26.73,Default,,0000,0000,0000,,you have an inverted pendulum system, and you want to model the Dialogue: 0,0:39:26.73,0:39:28.10,Default,,0000,0000,0000,,inverted pendulum Dialogue: 0,0:39:28.90,0:39:34.04,Default,,0000,0000,0000,,using a linear model like this, maybe [inaudible]. I'm not going to write that down. Dialogue: 0,0:39:34.04,0:39:37.85,Default,,0000,0000,0000,,One thing you could do is run your inverted pendulum, start it off in some state as Dialogue: 0,0:39:37.85,0:39:38.72,Default,,0000,0000,0000,,zero, Dialogue: 0,0:39:38.72,0:39:40.80,Default,,0000,0000,0000,,take some action, A0, have it Dialogue: 0,0:39:40.80,0:39:45.76,Default,,0000,0000,0000,,get to some state, S1. Take action A1 and so on, get to some state ST. Dialogue: 0,0:39:45.76,0:39:48.41,Default,,0000,0000,0000,,Our index is one Dialogue: 0,0:39:48.41,0:39:52.14,Default,,0000,0000,0000,,to denote that this is my first trial. Then you can repeat this a bunch of times. Dialogue: 0,0:39:52.14,0:39:55.17,Default,,0000,0000,0000,,You can repeat this N times. I'm Dialogue: 0,0:39:55.17,0:39:57.67,Default,,0000,0000,0000,,just executing actions on Dialogue: 0,0:39:57.67,0:39:59.52,Default,,0000,0000,0000,,your physical robot. Dialogue: 0,0:39:59.52,0:40:02.05,Default,,0000,0000,0000,,It could be a robot, it could be a Dialogue: 0,0:40:02.05,0:40:05.74,Default,,0000,0000,0000,,chemical plant. It could be whatever. Trying out different actions in Dialogue: 0,0:40:05.74,0:40:10.01,Default,,0000,0000,0000,,your system and watch Dialogue: 0,0:40:12.77,0:40:14.99,Default,,0000,0000,0000,,what states it gets to. So Dialogue: 0,0:40:14.99,0:40:20.62,Default,,0000,0000,0000,,for the linear model to your data, and choose the parameters A Dialogue: 0,0:40:20.62,0:40:25.66,Default,,0000,0000,0000,,and B, Dialogue: 0,0:40:25.66,0:40:27.94,Default,,0000,0000,0000,,that minimize Dialogue: 0,0:40:38.05,0:40:39.71,Default,,0000,0000,0000,,the quadratic Dialogue: 0,0:40:39.71,0:40:41.37,Default,,0000,0000,0000,,error term. Dialogue: 0,0:40:46.95,0:40:51.64,Default,,0000,0000,0000,,So this says how well does AST plus BAT Dialogue: 0,0:40:51.64,0:40:55.36,Default,,0000,0000,0000,,predict ST plus one. So you minimize the quadratic penalty term. This would be Dialogue: 0,0:40:55.36,0:40:57.72,Default,,0000,0000,0000,,one reasonable way to estimate Dialogue: 0,0:40:57.72,0:41:01.39,Default,,0000,0000,0000,,the parameters of a linear dynamical system for a Dialogue: 0,0:41:01.39,0:41:08.39,Default,,0000,0000,0000,,physical robot or a physical chemical part of whatever they may have. Another Dialogue: 0,0:41:08.84,0:41:10.57,Default,,0000,0000,0000,,way to Dialogue: 0,0:41:10.57,0:41:13.13,Default,,0000,0000,0000,,come up with a linear Dialogue: 0,0:41:14.21,0:41:19.20,Default,,0000,0000,0000,,model Dialogue: 0,0:41:19.20,0:41:24.18,Default,,0000,0000,0000,,consistently, if I want to control, is to take a nonlinear model and to Dialogue: 0,0:41:24.18,0:41:29.92,Default,,0000,0000,0000,,linearize it. Let me show you what I mean by that. So you can linearize a Dialogue: 0,0:41:29.92,0:41:35.82,Default,,0000,0000,0000,,nonlinear model. Dialogue: 0,0:41:35.82,0:41:37.44,Default,,0000,0000,0000,,So Dialogue: 0,0:41:38.57,0:41:41.75,Default,,0000,0000,0000,,let's say you have some nonlinear model Dialogue: 0,0:41:41.75,0:41:45.29,Default,,0000,0000,0000,,that expresses ST plus one as some function of Dialogue: 0,0:41:45.29,0:41:46.22,Default,,0000,0000,0000,,ST Dialogue: 0,0:41:46.22,0:41:47.50,Default,,0000,0000,0000,,and Dialogue: 0,0:41:47.50,0:41:49.53,Default,,0000,0000,0000,,AT. Dialogue: 0,0:41:49.53,0:41:53.84,Default,,0000,0000,0000,,In the example in the previous lecture, I said for the inverted pendulum Dialogue: 0,0:41:56.58,0:41:57.49,Default,,0000,0000,0000,,[inaudible]. Dialogue: 0,0:41:57.49,0:42:01.44,Default,,0000,0000,0000,,By referring to the laws of physics. It was actually by downloading off the shelf Dialogue: 0,0:42:01.44,0:42:04.38,Default,,0000,0000,0000,,software for doing physics simulations. So if you haven't Dialogue: 0,0:42:04.38,0:42:08.14,Default,,0000,0000,0000,,seen [inaudible] before, you can go online. You can easily find Dialogue: 0,0:42:08.14,0:42:11.27,Default,,0000,0000,0000,,many open-source packages for simulating the Dialogue: 0,0:42:11.27,0:42:14.35,Default,,0000,0000,0000,,physics of simple devices like these. Dialogue: 0,0:42:14.35,0:42:18.43,Default,,0000,0000,0000,,Download the software, type in the specifications of your robot, and it will Dialogue: 0,0:42:18.43,0:42:21.98,Default,,0000,0000,0000,,simulate the physics that you use. There's lots of open-source software patches like that. You can just Dialogue: 0,0:42:21.98,0:42:23.83,Default,,0000,0000,0000,,download them. Dialogue: 0,0:42:23.83,0:42:27.06,Default,,0000,0000,0000,,But something like that, you can now build a physics simulator Dialogue: 0,0:42:27.06,0:42:31.73,Default,,0000,0000,0000,,that predicts the state as a function of the previous state and the previous action. So Dialogue: 0,0:42:31.73,0:42:34.03,Default,,0000,0000,0000,,you actually come up with some function that Dialogue: 0,0:42:34.03,0:42:39.55,Default,,0000,0000,0000,,says that Dialogue: 0,0:42:41.53,0:42:45.17,Default,,0000,0000,0000,,- the state [inaudible] next time. The [inaudible] vector Dialogue: 0,0:42:45.17,0:42:52.17,Default,,0000,0000,0000,,will be some function of the current state Dialogue: 0,0:42:53.35,0:42:54.63,Default,,0000,0000,0000,,and the current Dialogue: 0,0:42:54.63,0:42:58.02,Default,,0000,0000,0000,,action, where the action in this case is just a real number that says how Dialogue: 0,0:42:58.02,0:43:02.13,Default,,0000,0000,0000,,hard you accelerated to the left or right. Dialogue: 0,0:43:02.13,0:43:06.36,Default,,0000,0000,0000,,Then you can take this nonlinear model. I actually wrote down a sample of a Dialogue: 0,0:43:06.36,0:43:08.12,Default,,0000,0000,0000,,model in the last lecture, but Dialogue: 0,0:43:08.12,0:43:11.39,Default,,0000,0000,0000,,in general, F would be some nonlinear function. [Inaudible] of Dialogue: 0,0:43:11.39,0:43:12.77,Default,,0000,0000,0000,,a linear function. Dialogue: 0,0:43:12.77,0:43:16.92,Default,,0000,0000,0000,,So what I mean by linearize is the following. So here's just a cartoon. I'll Dialogue: 0,0:43:16.92,0:43:19.82,Default,,0000,0000,0000,,write down the math in a second. Let's say the horizontal acces is Dialogue: 0,0:43:19.82,0:43:24.03,Default,,0000,0000,0000,,the input state, ST, Dialogue: 0,0:43:24.03,0:43:27.68,Default,,0000,0000,0000,,and the output state, ST plus one, as I said. Dialogue: 0,0:43:27.68,0:43:31.78,Default,,0000,0000,0000,,Here's Dialogue: 0,0:43:31.78,0:43:34.71,Default,,0000,0000,0000,,the function at F. Dialogue: 0,0:43:34.71,0:43:38.69,Default,,0000,0000,0000,,So the next state, ST plus one, will be some function of the previous state, Dialogue: 0,0:43:38.69,0:43:42.03,Default,,0000,0000,0000,,ST and the action Dialogue: 0,0:43:42.03,0:43:46.01,Default,,0000,0000,0000,,AT. So to linearize this model, what you would do is you would choose a point. Dialogue: 0,0:43:46.01,0:43:49.02,Default,,0000,0000,0000,,We'll call this bar Dialogue: 0,0:43:49.02,0:43:54.14,Default,,0000,0000,0000,,T. Then you would take the derivative of this Dialogue: 0,0:43:54.14,0:43:56.47,Default,,0000,0000,0000,,function. For the Dialogue: 0,0:43:56.47,0:44:00.28,Default,,0000,0000,0000,,[inaudible] straight line to that function. Dialogue: 0,0:44:00.28,0:44:04.45,Default,,0000,0000,0000,,So this allows you to express the next state, ST plus one. You Dialogue: 0,0:44:04.45,0:44:07.23,Default,,0000,0000,0000,,can approximate the next state, ST plus one, as Dialogue: 0,0:44:07.23,0:44:10.09,Default,,0000,0000,0000,,this linear function of the Dialogue: 0,0:44:10.09,0:44:13.100,Default,,0000,0000,0000,,previous state, ST. So Dialogue: 0,0:44:13.100,0:44:17.81,Default,,0000,0000,0000,,to make this cartoon really right, the horizontal access here is really a state Dialogue: 0,0:44:17.81,0:44:20.11,Default,,0000,0000,0000,,action pair. Dialogue: 0,0:44:20.11,0:44:24.65,Default,,0000,0000,0000,,You're linearizing around. So this is just a cartoon. The horizontal Dialogue: 0,0:44:24.65,0:44:29.70,Default,,0000,0000,0000,,access represents the input state and the input action. Dialogue: 0,0:44:29.70,0:44:36.70,Default,,0000,0000,0000,,So Dialogue: 0,0:44:42.05,0:44:45.55,Default,,0000,0000,0000,,just to write this out in Dialogue: 0,0:44:45.55,0:44:49.53,Default,,0000,0000,0000,,math, I'll write out the simple case first and the fully general one Dialogue: 0,0:44:49.53,0:44:51.09,Default,,0000,0000,0000,,in a second. Suppose Dialogue: 0,0:44:51.09,0:44:54.71,Default,,0000,0000,0000,,the horizontal access was only this state. So let's pretend interactions they [inaudible] now. Dialogue: 0,0:44:54.71,0:44:56.10,Default,,0000,0000,0000,,ST plus Dialogue: 0,0:44:56.10,0:45:00.31,Default,,0000,0000,0000,,one is just some function of ST, than that linear function I drew would be Dialogue: 0,0:45:00.31,0:45:02.22,Default,,0000,0000,0000,,ST plus one. Dialogue: 0,0:45:02.22,0:45:04.45,Default,,0000,0000,0000,,We're approximating Dialogue: 0,0:45:04.45,0:45:08.35,Default,,0000,0000,0000,,as F prime evaluated at some point as bar T Dialogue: 0,0:45:08.35,0:45:12.19,Default,,0000,0000,0000,,times ST times S bar T. Dialogue: 0,0:45:12.19,0:45:15.58,Default,,0000,0000,0000,,Plus Dialogue: 0,0:45:15.58,0:45:17.83,Default,,0000,0000,0000,,S bar Dialogue: 0,0:45:17.83,0:45:21.88,Default,,0000,0000,0000,,T. So with this, you'd express ST plus one as a linear Dialogue: 0,0:45:21.88,0:45:22.72,Default,,0000,0000,0000,,function Dialogue: 0,0:45:22.72,0:45:29.36,Default,,0000,0000,0000,,of ST. Just note that S bar T is a constant. Dialogue: 0,0:45:31.18,0:45:32.52,Default,,0000,0000,0000,,It's not Dialogue: 0,0:45:32.52,0:45:36.52,Default,,0000,0000,0000,,a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point Dialogue: 0,0:45:36.52,0:45:38.02,Default,,0000,0000,0000,,S bar T. This is Dialogue: 0,0:45:38.02,0:45:41.78,Default,,0000,0000,0000,,really just the equation of that linear function. So you Dialogue: 0,0:45:41.78,0:45:46.31,Default,,0000,0000,0000,,can then convert this to Dialogue: 0,0:45:51.81,0:45:55.40,Default,,0000,0000,0000,,A and B matrixes. Jumping back one board, I'm going Dialogue: 0,0:45:55.40,0:45:57.04,Default,,0000,0000,0000,,to point out one other thing. Dialogue: 0,0:45:57.04,0:45:57.83,Default,,0000,0000,0000,,Let's Dialogue: 0,0:45:57.83,0:45:59.93,Default,,0000,0000,0000,,say I look at this straight line, Dialogue: 0,0:45:59.93,0:46:01.16,Default,,0000,0000,0000,,and I ask Dialogue: 0,0:46:01.16,0:46:03.49,Default,,0000,0000,0000,,how well does this straight line Dialogue: 0,0:46:03.49,0:46:07.77,Default,,0000,0000,0000,,approximate my function F, my original simulator, my original function F. Dialogue: 0,0:46:07.77,0:46:10.63,Default,,0000,0000,0000,,Then you sort of notice that Dialogue: 0,0:46:10.63,0:46:12.44,Default,,0000,0000,0000,,in this neighborhood, Dialogue: 0,0:46:12.44,0:46:14.83,Default,,0000,0000,0000,,in the neighborhood of S bar, there's a Dialogue: 0,0:46:14.83,0:46:17.40,Default,,0000,0000,0000,,pretty good approximation. It's fairly close. Dialogue: 0,0:46:17.40,0:46:19.39,Default,,0000,0000,0000,,But then as you move further away, Dialogue: 0,0:46:19.39,0:46:24.42,Default,,0000,0000,0000,,moving far off to the left here, it becomes a pretty terrible approximation. Dialogue: 0,0:46:24.42,0:46:25.54,Default,,0000,0000,0000,,So Dialogue: 0,0:46:25.54,0:46:29.14,Default,,0000,0000,0000,,when you linearize Dialogue: 0,0:46:29.14,0:46:33.10,Default,,0000,0000,0000,,a nonlinear model to apply LQR, Dialogue: 0,0:46:33.10,0:46:36.79,Default,,0000,0000,0000,,one of the parameters you have to choose would be the point around which to Dialogue: 0,0:46:36.79,0:46:38.96,Default,,0000,0000,0000,,linearize your nonlinear model. Dialogue: 0,0:46:38.96,0:46:44.17,Default,,0000,0000,0000,,So if you expect your inverted pendulum system Dialogue: 0,0:46:44.17,0:46:48.51,Default,,0000,0000,0000,,to spend most of its time in the vicinity of this state, Dialogue: 0,0:46:48.51,0:46:49.41,Default,,0000,0000,0000,,then it'd Dialogue: 0,0:46:49.41,0:46:53.46,Default,,0000,0000,0000,,be reasonable to linearize around this state Dialogue: 0,0:46:53.46,0:46:54.83,Default,,0000,0000,0000,,because that means that Dialogue: 0,0:46:54.83,0:46:57.32,Default,,0000,0000,0000,,the linear approximation would be a good approximation, Dialogue: 0,0:46:57.32,0:47:02.49,Default,,0000,0000,0000,,usually, for the states that you expect [inaudible] to spend most of this time. Dialogue: 0,0:47:02.49,0:47:06.18,Default,,0000,0000,0000,,If conversely, you expect the system to spend most of its time Dialogue: 0,0:47:06.18,0:47:07.74,Default,,0000,0000,0000,,at states far to the left, then this would be Dialogue: 0,0:47:07.74,0:47:10.04,Default,,0000,0000,0000,,a terrible location to linearize. Dialogue: 0,0:47:10.04,0:47:14.27,Default,,0000,0000,0000,,So one rule of thumb is to choose the position to linearize according to where Dialogue: 0,0:47:14.27,0:47:17.42,Default,,0000,0000,0000,,you expect the system to spend most of its time Dialogue: 0,0:47:17.42,0:47:20.37,Default,,0000,0000,0000,,so that the linear approximation will tend to be Dialogue: 0,0:47:20.37,0:47:22.93,Default,,0000,0000,0000,,an accurate approximation in the Dialogue: 0,0:47:22.93,0:47:27.27,Default,,0000,0000,0000,,vicinity of the states [inaudible]. Dialogue: 0,0:47:27.27,0:47:28.77,Default,,0000,0000,0000,,Just to be fair, it is about Dialogue: 0,0:47:28.77,0:47:30.11,Default,,0000,0000,0000,,choosing the point, S bar, Dialogue: 0,0:47:30.11,0:47:31.64,Default,,0000,0000,0000,,A bar, Dialogue: 0,0:47:31.64,0:47:34.17,Default,,0000,0000,0000,,that we'll use Dialogue: 0,0:47:34.17,0:47:36.95,Default,,0000,0000,0000,,to come up with a linear function Dialogue: 0,0:47:36.95,0:47:40.60,Default,,0000,0000,0000,,that we'll pretend it's a good approximation to my original Dialogue: 0,0:47:40.60,0:47:47.08,Default,,0000,0000,0000,,nonlinear function, Dialogue: 0,0:47:55.27,0:48:00.59,Default,,0000,0000,0000,,F. So for an example like the inverted pendulum problem, this problem, if Dialogue: 0,0:48:00.59,0:48:03.74,Default,,0000,0000,0000,,you expect to do pretty well in this problem, Dialogue: 0,0:48:03.74,0:48:08.09,Default,,0000,0000,0000,,then you would expect the state Dialogue: 0,0:48:08.09,0:48:12.68,Default,,0000,0000,0000,,to often be near the zero state. Dialogue: 0,0:48:12.68,0:48:17.89,Default,,0000,0000,0000,,If S equals zero corresponds to X being the center of the railway track that the inverted Dialogue: 0,0:48:17.89,0:48:20.25,Default,,0000,0000,0000,,pendulum lives on. You Dialogue: 0,0:48:20.25,0:48:23.00,Default,,0000,0000,0000,,expect to do fairly well. You expect the pole to Dialogue: 0,0:48:23.00,0:48:24.69,Default,,0000,0000,0000,,mostly be upright Dialogue: 0,0:48:24.69,0:48:28.56,Default,,0000,0000,0000,,[inaudible] upright at zero degrees or 90 degrees, I guess. So you choose Dialogue: 0,0:48:28.56,0:48:32.91,Default,,0000,0000,0000,,whatever state corresponds to having the pole Dialogue: 0,0:48:32.91,0:48:33.96,Default,,0000,0000,0000,,upright. The zero Dialogue: 0,0:48:33.96,0:48:37.30,Default,,0000,0000,0000,,velocity [inaudible], near zero velocity in the middle of the track. Dialogue: 0,0:48:37.30,0:48:40.25,Default,,0000,0000,0000,,So you usually choose that as a state Dialogue: 0,0:48:40.25,0:48:42.58,Default,,0000,0000,0000,,to linearize Dialogue: 0,0:48:42.58,0:48:45.68,Default,,0000,0000,0000,,your inverted pendulum dynamics around. That's Dialogue: 0,0:48:45.68,0:48:52.68,Default,,0000,0000,0000,,a region where you might want your approximation to be good. Dialogue: 0,0:48:55.86,0:48:59.76,Default,,0000,0000,0000,,So I wrote this down. To come back to this formula, I wrote this down for the special Dialogue: 0,0:48:59.76,0:49:02.23,Default,,0000,0000,0000,,case of a one D state variable. If there Dialogue: 0,0:49:02.23,0:49:08.04,Default,,0000,0000,0000,,are no actions. The general formula for the linearization approximation is ST Dialogue: 0,0:49:08.95,0:49:11.87,Default,,0000,0000,0000,,plus one were approximate as F Dialogue: 0,0:49:11.87,0:49:14.56,Default,,0000,0000,0000,,of S bar T. Dialogue: 0,0:49:14.56,0:49:17.12,Default,,0000,0000,0000,,A bar T Dialogue: 0,0:49:17.12,0:49:24.12,Default,,0000,0000,0000,,plus Dialogue: 0,0:49:50.46,0:49:54.23,Default,,0000,0000,0000,,- Okay? Dialogue: 0,0:49:54.23,0:49:55.24,Default,,0000,0000,0000,,Where Dialogue: 0,0:49:55.24,0:49:59.67,Default,,0000,0000,0000,,these upside down triangles are an unusual Dialogue: 0,0:49:59.67,0:50:02.74,Default,,0000,0000,0000,,symbol for taking the Dialogue: 0,0:50:02.74,0:50:07.29,Default,,0000,0000,0000,,derivative of F with respect to [inaudible] vector value, second argument. So Dialogue: 0,0:50:07.29,0:50:12.17,Default,,0000,0000,0000,,by choosing an appropriate state as bar T, A bar T to linearize Dialogue: 0,0:50:12.17,0:50:12.95,Default,,0000,0000,0000,,around, Dialogue: 0,0:50:12.95,0:50:14.01,Default,,0000,0000,0000,,you'd Dialogue: 0,0:50:14.01,0:50:17.92,Default,,0000,0000,0000,,now express ST plus one Dialogue: 0,0:50:17.92,0:50:22.04,Default,,0000,0000,0000,,as a linear function of the current state and Dialogue: 0,0:50:22.04,0:50:23.28,Default,,0000,0000,0000,,the current Dialogue: 0,0:50:23.28,0:50:24.28,Default,,0000,0000,0000,,action, AT. Again, Dialogue: 0,0:50:24.28,0:50:29.42,Default,,0000,0000,0000,,these things, S bar T, is a constant that you choose ahead of time. Dialogue: 0,0:50:29.42,0:50:32.41,Default,,0000,0000,0000,,Same for A bar Dialogue: 0,0:50:34.96,0:50:40.11,Default,,0000,0000,0000,,T. Lastly, having linearized this thing, you can then convert it Dialogue: 0,0:50:40.11,0:50:43.26,Default,,0000,0000,0000,,to matrices Dialogue: 0,0:50:43.26,0:50:46.51,Default,,0000,0000,0000,,like Dialogue: 0,0:50:46.51,0:50:48.85,Default,,0000,0000,0000,,that. Dialogue: 0,0:50:48.85,0:50:55.46,Default,,0000,0000,0000,,So the ST plus one is now a linear function of ST and Dialogue: 0,0:50:55.46,0:50:56.79,Default,,0000,0000,0000,,AT. Dialogue: 0,0:50:56.79,0:51:03.79,Default,,0000,0000,0000,,Questions about this? Dialogue: 0,0:51:10.76,0:51:16.02,Default,,0000,0000,0000,,So just one tiny detail, and it's really not a huge deal, is that this thing Dialogue: 0,0:51:16.02,0:51:20.18,Default,,0000,0000,0000,,below is technically an [inaudible] function. Dialogue: 0,0:51:20.18,0:51:23.14,Default,,0000,0000,0000,,There might actually be an extra constant there, but Dialogue: 0,0:51:23.14,0:51:27.20,Default,,0000,0000,0000,,this is not a difficult generalization of a linear dynamical system definition. Dialogue: 0,0:51:28.13,0:51:32.32,Default,,0000,0000,0000,,One way to deal with that constant is actually to do something like take your Dialogue: 0,0:51:32.32,0:51:36.25,Default,,0000,0000,0000,,definition for this state, let's say XX dot theta theta dot. Dialogue: 0,0:51:36.25,0:51:38.86,Default,,0000,0000,0000,,You can then augment your state vector to Dialogue: 0,0:51:38.86,0:51:40.96,Default,,0000,0000,0000,,have an extra interceptor, one. Dialogue: 0,0:51:40.96,0:51:46.07,Default,,0000,0000,0000,,With the interceptor one and working out the A matrix, you can then Dialogue: 0,0:51:46.07,0:51:49.17,Default,,0000,0000,0000,,take care of the extra constant, C, as well. So you can Dialogue: 0,0:51:49.17,0:51:50.68,Default,,0000,0000,0000,,deal with this thing being - Dialogue: 0,0:51:50.68,0:51:54.81,Default,,0000,0000,0000,,technically it's an affine function because of this extra offset, rather than a linear Dialogue: 0,0:51:54.81,0:51:55.57,Default,,0000,0000,0000,,function. Dialogue: 0,0:51:55.57,0:52:01.81,Default,,0000,0000,0000,,But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't Dialogue: 0,0:52:01.81,0:52:02.83,Default,,0000,0000,0000,,be Dialogue: 0,0:52:02.83,0:52:04.69,Default,,0000,0000,0000,,a huge Dialogue: 0,0:52:06.46,0:52:13.46,Default,,0000,0000,0000,,deal. Dialogue: 0,0:52:18.06,0:52:19.45,Default,,0000,0000,0000,,So to summarize, you see I Dialogue: 0,0:52:19.45,0:52:20.73,Default,,0000,0000,0000,,have this up, Dialogue: 0,0:52:20.73,0:52:24.70,Default,,0000,0000,0000,,you can learn a model, you can take a nonlinear model. Your nonlinear Dialogue: 0,0:52:24.70,0:52:28.79,Default,,0000,0000,0000,,model can be a physics model or a nonlinear model you learned and linearize it. Dialogue: 0,0:52:29.56,0:52:32.30,Default,,0000,0000,0000,,Now I'll post an LQR problem Dialogue: 0,0:52:32.30,0:52:35.34,Default,,0000,0000,0000,,in which we have Dialogue: 0,0:52:35.34,0:52:41.69,Default,,0000,0000,0000,,specification of the MDP in which the states are in RN, the actions are in RD, Dialogue: 0,0:52:41.69,0:52:46.67,Default,,0000,0000,0000,,and the state has zero probabilities given by Dialogue: 0,0:52:46.67,0:52:50.24,Default,,0000,0000,0000,,the [inaudible] linear equation. SD plus one equals Dialogue: 0,0:52:50.24,0:52:53.85,Default,,0000,0000,0000,,ATST plus BTAT. Our rewards are going to be Dialogue: 0,0:52:53.85,0:52:56.99,Default,,0000,0000,0000,,these quadratic functions. Dialogue: 0,0:52:56.99,0:53:01.90,Default,,0000,0000,0000,,So the specification of the MDP means that we know the A matrixes, Dialogue: 0,0:53:01.90,0:53:06.09,Default,,0000,0000,0000,,the B matrixes, the U matrixes Dialogue: 0,0:53:06.09,0:53:08.93,Default,,0000,0000,0000,,and the V matrixes. Our goal is to come up with a policy Dialogue: 0,0:53:08.93,0:53:13.79,Default,,0000,0000,0000,,to maximize our finite Dialogue: 0,0:53:13.79,0:53:19.05,Default,,0000,0000,0000,,horizon sum of rewards. Dialogue: 0,0:53:20.03,0:53:23.15,Default,,0000,0000,0000,,So our goal is to come up with a policy, Dialogue: 0,0:53:23.15,0:53:25.80,Default,,0000,0000,0000,,first, to maximize Dialogue: 0,0:53:25.80,0:53:29.06,Default,,0000,0000,0000,,the expected value of this Dialogue: 0,0:53:29.06,0:53:34.28,Default,,0000,0000,0000,,finite horizon sum of Dialogue: 0,0:53:39.01,0:53:46.01,Default,,0000,0000,0000,,rewards. Okay. Dialogue: 0,0:53:48.22,0:53:49.39,Default,,0000,0000,0000,,So our Dialogue: 0,0:53:49.39,0:53:53.55,Default,,0000,0000,0000,,approach to solving this problem Dialogue: 0,0:53:53.55,0:53:56.90,Default,,0000,0000,0000,,will be exactly that Dialogue: 0,0:53:56.90,0:54:00.02,Default,,0000,0000,0000,,finite horizon dynamic programming algorithm that we worked out a Dialogue: 0,0:54:00.02,0:54:01.77,Default,,0000,0000,0000,,little earlier in this Dialogue: 0,0:54:01.77,0:54:03.58,Default,,0000,0000,0000,,lecture. In particular, Dialogue: 0,0:54:03.58,0:54:06.57,Default,,0000,0000,0000,,my strategy for finding the optimal policy Dialogue: 0,0:54:06.57,0:54:08.88,Default,,0000,0000,0000,,will be to Dialogue: 0,0:54:08.88,0:54:12.36,Default,,0000,0000,0000,,first find V star of Dialogue: 0,0:54:12.36,0:54:14.22,Default,,0000,0000,0000,,T, the capital T, Dialogue: 0,0:54:14.22,0:54:18.56,Default,,0000,0000,0000,,and then I'll apply by a recursion to find V star of T minus one, V star of T minus Dialogue: 0,0:54:18.56,0:54:22.99,Default,,0000,0000,0000,,two and so on. Dialogue: 0,0:54:22.99,0:54:27.62,Default,,0000,0000,0000,,In the dynamic programming algorithm we worked out, V star subscript T of the Dialogue: 0,0:54:27.62,0:54:29.21,Default,,0000,0000,0000,,state ST, this is the maximum Dialogue: 0,0:54:29.21,0:54:31.28,Default,,0000,0000,0000,,[inaudible] Dialogue: 0,0:54:31.28,0:54:35.44,Default,,0000,0000,0000,,actions you might take at that time of R Dialogue: 0,0:54:35.44,0:54:39.30,Default,,0000,0000,0000,,of STAT. Dialogue: 0,0:54:39.30,0:54:42.80,Default,,0000,0000,0000,,Again, just for the sake of understanding this material, you can probably Dialogue: 0,0:54:42.80,0:54:47.17,Default,,0000,0000,0000,,pretend the rewards and the dynamics are actually stationary. I'll write out all Dialogue: 0,0:54:47.17,0:54:53.25,Default,,0000,0000,0000,,these superscripts all the time [inaudible] if you're reading this for the first time. Dialogue: 0,0:54:53.25,0:54:57.34,Default,,0000,0000,0000,,The reward is equal to max of Dialogue: 0,0:54:57.34,0:55:04.34,Default,,0000,0000,0000,,AT of minus - Dialogue: 0,0:55:11.83,0:55:16.10,Default,,0000,0000,0000,,right? I hope this isn't confusing. The superscript Ts denote transposes. Dialogue: 0,0:55:16.10,0:55:19.85,Default,,0000,0000,0000,,The lowercase Ts denote the time index capital T. Dialogue: 0,0:55:19.85,0:55:23.77,Default,,0000,0000,0000,,So that's just a definition of my next quadratic awards. So this Dialogue: 0,0:55:23.77,0:55:27.86,Default,,0000,0000,0000,,is clearly maximized as minus Dialogue: 0,0:55:29.10,0:55:31.27,Default,,0000,0000,0000,,ST transpose Dialogue: 0,0:55:31.27,0:55:37.89,Default,,0000,0000,0000,,UTST Dialogue: 0,0:55:37.89,0:55:39.95,Default,,0000,0000,0000,,because Dialogue: 0,0:55:39.95,0:55:43.91,Default,,0000,0000,0000,,that last term is - this is greater than or equal to zero. That gives Dialogue: 0,0:55:43.91,0:55:47.76,Default,,0000,0000,0000,,me my assumption that VT is [inaudible] semi-definite. So the best action to take in Dialogue: 0,0:55:47.76,0:55:49.40,Default,,0000,0000,0000,,the last time step Dialogue: 0,0:55:49.40,0:55:52.64,Default,,0000,0000,0000,,is just the action Dialogue: 0,0:55:52.64,0:55:53.77,Default,,0000,0000,0000,,zero. So Dialogue: 0,0:55:53.77,0:56:00.77,Default,,0000,0000,0000,,pi star subscript T of ST is equal to the Dialogue: 0,0:56:02.32,0:56:04.37,Default,,0000,0000,0000,,[inaudible] of actions of Dialogue: 0,0:56:04.37,0:56:10.97,Default,,0000,0000,0000,,that same thing. It's just Dialogue: 0,0:56:10.97,0:56:12.47,Default,,0000,0000,0000,,zero. It's by Dialogue: 0,0:56:12.47,0:56:15.61,Default,,0000,0000,0000,,choosing the zero action, Dialogue: 0,0:56:15.61,0:56:18.91,Default,,0000,0000,0000,,AT transpose VTAT becomes zero, and that's Dialogue: 0,0:56:18.91,0:56:38.83,Default,,0000,0000,0000,,how this reward is maximized. Dialogue: 0,0:56:50.26,0:56:57.17,Default,,0000,0000,0000,,Any questions, or is something illegible? Okay. Dialogue: 0,0:57:00.39,0:57:04.33,Default,,0000,0000,0000,,So now let's do the dynamic programming step where Dialogue: 0,0:57:04.33,0:57:09.68,Default,,0000,0000,0000,,my goal is given Dialogue: 0,0:57:09.68,0:57:11.74,Default,,0000,0000,0000,,VT plus one, Dialogue: 0,0:57:11.74,0:57:15.80,Default,,0000,0000,0000,,I want to compute VT. Given V star T plus one, I want to compute Dialogue: 0,0:57:15.80,0:57:20.41,Default,,0000,0000,0000,,V star of T. So this is the dynamic programming step. Dialogue: 0,0:57:21.63,0:57:25.24,Default,,0000,0000,0000,,So the Dialogue: 0,0:57:25.24,0:57:29.50,Default,,0000,0000,0000,,DP steps I wrote down previously was this. So for the finite state case, I Dialogue: 0,0:57:29.50,0:57:36.50,Default,,0000,0000,0000,,wrote down the following. Dialogue: 0,0:58:32.29,0:58:35.77,Default,,0000,0000,0000,,So this is exactly the equation I wrote down Dialogue: 0,0:58:35.77,0:58:37.50,Default,,0000,0000,0000,,previously, Dialogue: 0,0:58:37.50,0:58:41.49,Default,,0000,0000,0000,,and this is what I wrote down for finite states, where you have these discreet state transition Dialogue: 0,0:58:41.49,0:58:43.68,Default,,0000,0000,0000,,probabilities, and we can sum over Dialogue: 0,0:58:43.68,0:58:48.04,Default,,0000,0000,0000,,this discreet set of states. Now we're going to continue as an infinite state Dialogue: 0,0:58:48.04,0:58:51.85,Default,,0000,0000,0000,,again, so this sum over state should actually become an integral. I'm going to Dialogue: 0,0:58:51.85,0:58:55.94,Default,,0000,0000,0000,,actually skip the integral step. We'll just go ahead and write this last term here as an Dialogue: 0,0:58:55.94,0:58:56.83,Default,,0000,0000,0000,,expectation. Dialogue: 0,0:58:56.83,0:59:03.83,Default,,0000,0000,0000,,So this is going to be max over actions AT Dialogue: 0,0:59:04.31,0:59:06.88,Default,,0000,0000,0000,,plus - and then this becomes and expectation Dialogue: 0,0:59:06.88,0:59:10.09,Default,,0000,0000,0000,,over the random mixed state, ST plus one, Dialogue: 0,0:59:10.09,0:59:13.50,Default,,0000,0000,0000,,[inaudible] from state transition probabilities given by P of STAT of V star T Dialogue: 0,0:59:13.50,0:59:16.43,Default,,0000,0000,0000,,plus one, ST Dialogue: 0,0:59:16.43,0:59:19.08,Default,,0000,0000,0000,,plus one. So this Dialogue: 0,0:59:19.08,0:59:21.31,Default,,0000,0000,0000,,is Dialogue: 0,0:59:21.31,0:59:23.72,Default,,0000,0000,0000,,the same equation written down Dialogue: 0,0:59:23.72,0:59:25.70,Default,,0000,0000,0000,,as an expectation. Dialogue: 0,0:59:26.88,0:59:31.52,Default,,0000,0000,0000,,So what I need to do is given a representation of V star T plus one, I Dialogue: 0,0:59:31.52,0:59:32.79,Default,,0000,0000,0000,,need to find V Dialogue: 0,0:59:32.79,0:59:36.43,Default,,0000,0000,0000,,star of T. Dialogue: 0,0:59:36.43,0:59:40.82,Default,,0000,0000,0000,,So it turns out that LQR has the following useful property. It turns out that each of Dialogue: 0,0:59:40.82,0:59:42.84,Default,,0000,0000,0000,,these value functions Dialogue: 0,0:59:42.84,0:59:46.13,Default,,0000,0000,0000,,can be represented as a quadratic function. Dialogue: 0,0:59:46.13,0:59:48.66,Default,,0000,0000,0000,,So concretely, Dialogue: 0,0:59:48.66,0:59:53.26,Default,,0000,0000,0000,,let's suppose that V Dialogue: 0,0:59:53.26,1:00:00.26,Default,,0000,0000,0000,,star T plus one - suppose that this can be expressed as a Dialogue: 0,1:00:00.43,1:00:07.43,Default,,0000,0000,0000,,quadratic Dialogue: 0,1:00:09.37,1:00:10.84,Default,,0000,0000,0000,,function, written like so, Dialogue: 0,1:00:10.84,1:00:11.86,Default,,0000,0000,0000,,where Dialogue: 0,1:00:11.86,1:00:14.88,Default,,0000,0000,0000,,the matrix phi T plus one Dialogue: 0,1:00:14.88,1:00:17.05,Default,,0000,0000,0000,,is an N by N matrix, and Dialogue: 0,1:00:17.05,1:00:21.34,Default,,0000,0000,0000,,psi T plus one Dialogue: 0,1:00:21.34,1:00:23.60,Default,,0000,0000,0000,,is just a real number. Dialogue: 0,1:00:23.60,1:00:29.16,Default,,0000,0000,0000,,So in other words, suppose V star T plus one is just a quadratic function Dialogue: 0,1:00:29.97,1:00:33.81,Default,,0000,0000,0000,,of the state ST Dialogue: 0,1:00:33.81,1:00:36.68,Default,,0000,0000,0000,,plus one. Dialogue: 0,1:00:36.68,1:00:40.17,Default,,0000,0000,0000,,We can then show that Dialogue: 0,1:00:40.17,1:00:44.17,Default,,0000,0000,0000,,when you do one dynamic programming step - when you Dialogue: 0,1:00:46.09,1:00:49.57,Default,,0000,0000,0000,,plug this definition of V star T plus one into your dynamic Dialogue: 0,1:00:49.57,1:00:51.88,Default,,0000,0000,0000,,programming step in the equation I had just now, Dialogue: 0,1:00:51.88,1:00:54.56,Default,,0000,0000,0000,,you can show that you would get Dialogue: 0,1:00:54.56,1:00:57.90,Default,,0000,0000,0000,,that V star T as well, will Dialogue: 0,1:00:57.90,1:01:01.75,Default,,0000,0000,0000,,also be a quadratic function of the Dialogue: 0,1:01:06.28,1:01:12.64,Default,,0000,0000,0000,,same form. [Inaudible] here, right? The sum-appropriate matrix, phi T Dialogue: 0,1:01:12.64,1:01:19.61,Default,,0000,0000,0000,,and sum appropriate real number, psi Dialogue: 0,1:01:19.61,1:01:21.84,Default,,0000,0000,0000,,of Dialogue: 0,1:01:21.84,1:01:27.82,Default,,0000,0000,0000,,T. So what you can do is stall off the recursion with - well, does that make Dialogue: 0,1:01:27.82,1:01:32.10,Default,,0000,0000,0000,,sense? Dialogue: 0,1:01:32.10,1:01:35.01,Default,,0000,0000,0000,,So what you can do is Dialogue: 0,1:01:35.01,1:01:36.77,Default,,0000,0000,0000,,stall off the recursion Dialogue: 0,1:01:36.77,1:01:41.14,Default,,0000,0000,0000,,as follows. So previously, we worked out that V star capital T, Dialogue: 0,1:01:41.14,1:01:44.22,Default,,0000,0000,0000,,we said that this is minus Dialogue: 0,1:01:44.22,1:01:47.54,Default,,0000,0000,0000,,ST Dialogue: 0,1:01:47.54,1:01:50.01,Default,,0000,0000,0000,,transpose UTST. So we Dialogue: 0,1:01:50.01,1:01:50.76,Default,,0000,0000,0000,,have that phi Dialogue: 0,1:01:50.76,1:01:53.16,Default,,0000,0000,0000,,of capital T is Dialogue: 0,1:01:53.16,1:01:56.33,Default,,0000,0000,0000,,equal to minus UT, Dialogue: 0,1:01:56.33,1:01:59.31,Default,,0000,0000,0000,,and psi of capital T is equal to zero. Dialogue: 0,1:01:59.31,1:02:00.70,Default,,0000,0000,0000,,Now V star T of ST Dialogue: 0,1:02:00.70,1:02:07.70,Default,,0000,0000,0000,,is equal to ST transpose phi of T, ST plus psi of T. Dialogue: 0,1:02:08.42,1:02:09.05,Default,,0000,0000,0000,,So Dialogue: 0,1:02:09.05,1:02:13.45,Default,,0000,0000,0000,,you can start out the recursion this way with phi of T equals minus UT and psi of T Dialogue: 0,1:02:13.45,1:02:14.94,Default,,0000,0000,0000,,equals zero. Dialogue: 0,1:02:14.94,1:02:16.85,Default,,0000,0000,0000,,Then work out Dialogue: 0,1:02:16.85,1:02:20.22,Default,,0000,0000,0000,,what the recursion is. I won't Dialogue: 0,1:02:20.22,1:02:22.82,Default,,0000,0000,0000,, Dialogue: 0,1:02:22.82,1:02:27.12,Default,,0000,0000,0000,,actually do the full [inaudible]. This may be algebra, and Dialogue: 0,1:02:28.81,1:02:32.94,Default,,0000,0000,0000,,you've actually done this sort of Gaussian expectation Dialogue: 0,1:02:32.94,1:02:39.56,Default,,0000,0000,0000,,math a lot in your homework by now. Dialogue: 0,1:02:39.56,1:02:44.01,Default,,0000,0000,0000,,So I won't do the full derivation. I'll just outline the Dialogue: 0,1:02:45.04,1:02:46.87,Default,,0000,0000,0000,,one-ish G step. So Dialogue: 0,1:02:46.87,1:02:51.03,Default,,0000,0000,0000,,in dynamic programming step, V star ST is Dialogue: 0,1:02:51.03,1:02:53.24,Default,,0000,0000,0000,,equal to Dialogue: 0,1:02:53.24,1:02:55.68,Default,,0000,0000,0000,,max over actions AT of Dialogue: 0,1:02:55.68,1:03:02.68,Default,,0000,0000,0000,,the median reward. Dialogue: 0,1:03:03.62,1:03:05.81,Default,,0000,0000,0000,,So this was R of Dialogue: 0,1:03:05.81,1:03:09.48,Default,,0000,0000,0000,,SA from my equation in the dynamic programming step. Dialogue: 0,1:03:09.48,1:03:12.10,Default,,0000,0000,0000,,Then plus an expected value Dialogue: 0,1:03:12.10,1:03:18.28,Default,,0000,0000,0000,,over the random mixed state, ST plus one, drawn from the Dialogue: 0,1:03:18.28,1:03:22.08,Default,,0000,0000,0000,,Gaussian distribution would mean Dialogue: 0,1:03:22.08,1:03:23.89,Default,,0000,0000,0000,,ATST plus Dialogue: 0,1:03:23.89,1:03:28.17,Default,,0000,0000,0000,,BTAT and covariant sigma Dialogue: 0,1:03:28.17,1:03:31.89,Default,,0000,0000,0000,,W. So what this is, this is really my Dialogue: 0,1:03:31.89,1:03:35.03,Default,,0000,0000,0000,,specification for P of STAT. This is my Dialogue: 0,1:03:35.03,1:03:38.75,Default,,0000,0000,0000,,state transition distribution in the LQR setting. This is my Dialogue: 0,1:03:38.75,1:03:40.81,Default,,0000,0000,0000,,state transition distribution Dialogue: 0,1:03:40.81,1:03:43.37,Default,,0000,0000,0000,,[inaudible] take action AT Dialogue: 0,1:03:43.37,1:03:44.67,Default,,0000,0000,0000,,in Dialogue: 0,1:03:44.67,1:03:46.18,Default,,0000,0000,0000,,the state Dialogue: 0,1:03:46.18,1:03:50.32,Default,,0000,0000,0000,,ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and Dialogue: 0,1:03:50.32,1:03:50.98,Default,,0000,0000,0000,,covariant Dialogue: 0,1:03:50.98,1:03:54.43,Default,,0000,0000,0000,,sigma W. Then Dialogue: 0,1:03:54.43,1:04:01.43,Default,,0000,0000,0000,,of the this Dialogue: 0,1:04:10.67,1:04:13.72,Default,,0000,0000,0000,,state. This, of course, is just A star Dialogue: 0,1:04:13.72,1:04:17.11,Default,,0000,0000,0000,,T plus one Dialogue: 0,1:04:17.11,1:04:23.55,Default,,0000,0000,0000,,of ST plus one. I hope Dialogue: 0,1:04:23.55,1:04:27.46,Default,,0000,0000,0000,,this makes sense. This is just taking that equation I had previously in the Dialogue: 0,1:04:27.46,1:04:30.31,Default,,0000,0000,0000,,dynamic programming step. So the V star of T, ST Dialogue: 0,1:04:30.31,1:04:33.28,Default,,0000,0000,0000,,equals max over actions Dialogue: 0,1:04:33.28,1:04:35.68,Default,,0000,0000,0000,,of the immediate rewards Dialogue: 0,1:04:35.68,1:04:38.74,Default,,0000,0000,0000,,plus an expected value over the mixed state of Dialogue: 0,1:04:38.74,1:04:40.64,Default,,0000,0000,0000,,V star of the mixed state with Dialogue: 0,1:04:40.64,1:04:43.25,Default,,0000,0000,0000,,the clock advanced by one. Dialogue: 0,1:04:43.25,1:04:46.18,Default,,0000,0000,0000,,So I've just plugged in all the definitions as a reward of the Dialogue: 0,1:04:46.18,1:04:47.89,Default,,0000,0000,0000,,state [inaudible] distribution Dialogue: 0,1:04:47.89,1:04:54.89,Default,,0000,0000,0000,,and of the value function. Dialogue: 0,1:04:55.91,1:05:01.73,Default,,0000,0000,0000,,Actually, could you raise your hand if this makes sense? Cool. So if you write Dialogue: 0,1:05:01.73,1:05:03.43,Default,,0000,0000,0000,,this out Dialogue: 0,1:05:03.43,1:05:07.63,Default,,0000,0000,0000,,and you expand the expectation - I know you've done this many times, so I Dialogue: 0,1:05:07.63,1:05:08.98,Default,,0000,0000,0000,,won't do it - Dialogue: 0,1:05:08.98,1:05:11.47,Default,,0000,0000,0000,,this whole thing on the right-hand side Dialogue: 0,1:05:11.47,1:05:15.14,Default,,0000,0000,0000,,simplifies to a big quadratic function of the action, AT. Dialogue: 0,1:05:15.14,1:05:17.84,Default,,0000,0000,0000,,So this whole Dialogue: 0,1:05:17.84,1:05:22.84,Default,,0000,0000,0000,,thing simplifies to a big Dialogue: 0,1:05:22.84,1:05:26.40,Default,,0000,0000,0000,,quadratic function Dialogue: 0,1:05:32.45,1:05:35.91,Default,,0000,0000,0000,,of the action Dialogue: 0,1:05:35.91,1:05:37.86,Default,,0000,0000,0000,,AT. Dialogue: 0,1:05:37.86,1:05:42.49,Default,,0000,0000,0000,,We want to maximize this with respect to the actions AT. So to Dialogue: 0,1:05:42.49,1:05:44.34,Default,,0000,0000,0000,,maximize a big quadratic function, you Dialogue: 0,1:05:44.34,1:05:48.15,Default,,0000,0000,0000,,just take the derivatives of the functions with respect to the action AT, set the derivative equal Dialogue: 0,1:05:48.15,1:05:52.04,Default,,0000,0000,0000,,to zero, and then you've maximized the right-hand side, with Dialogue: 0,1:05:52.04,1:05:55.56,Default,,0000,0000,0000,,respect to the action, Dialogue: 0,1:05:55.56,1:05:59.71,Default,,0000,0000,0000,,AT. It turns out - I'm just going to write this expression down for completeness. You Dialogue: 0,1:05:59.71,1:06:02.21,Default,,0000,0000,0000,,can derive it yourself at any time. It turns Dialogue: 0,1:06:02.21,1:06:05.24,Default,,0000,0000,0000,,out if you actually maximize that thing on the right- hand side as a Dialogue: 0,1:06:05.24,1:06:07.30,Default,,0000,0000,0000,,function of the actions, AT, Dialogue: 0,1:06:07.30,1:06:12.47,Default,,0000,0000,0000,,you find that [inaudible] AT is going to be Dialogue: 0,1:06:12.47,1:06:14.43,Default,,0000,0000,0000,,that Dialogue: 0,1:06:14.43,1:06:21.43,Default,,0000,0000,0000,,times Dialogue: 0,1:06:23.32,1:06:25.93,Default,,0000,0000,0000,,ST. Dialogue: 0,1:06:25.93,1:06:29.39,Default,,0000,0000,0000,,Don't Dialogue: 0,1:06:29.39,1:06:34.20,Default,,0000,0000,0000,,worry about this expression. You can get it from [inaudible] and derive it yourself. Dialogue: 0,1:06:34.20,1:06:37.84,Default,,0000,0000,0000,,But the key thing to note is that the optimal action, AT, is going to be some big Dialogue: 0,1:06:37.84,1:06:38.95,Default,,0000,0000,0000,,matrix. Dialogue: 0,1:06:38.95,1:06:40.41,Default,,0000,0000,0000,,We're going to call this thing Dialogue: 0,1:06:40.41,1:06:43.30,Default,,0000,0000,0000,,LT Dialogue: 0,1:06:43.30,1:06:47.26,Default,,0000,0000,0000,,times Dialogue: 0,1:06:49.01,1:06:51.46,Default,,0000,0000,0000,,ST. In other words, the optimal action to take in this Dialogue: 0,1:06:51.46,1:06:53.33,Default,,0000,0000,0000,,given state is going to be Dialogue: 0,1:06:53.33,1:06:55.25,Default,,0000,0000,0000,,some linear function Dialogue: 0,1:06:55.25,1:06:59.29,Default,,0000,0000,0000,,of the state, ST. So Dialogue: 0,1:06:59.29,1:07:02.82,Default,,0000,0000,0000,,having done dynamic programming, you remember also when we worked out the Dialogue: 0,1:07:02.82,1:07:04.98,Default,,0000,0000,0000,,dynamic programming algorithm for finite Dialogue: 0,1:07:04.98,1:07:06.28,Default,,0000,0000,0000,,horizon MDPs, Dialogue: 0,1:07:06.28,1:07:07.100,Default,,0000,0000,0000,,we said that Dialogue: 0,1:07:07.100,1:07:12.11,Default,,0000,0000,0000,,the way you compute the optimal policy, pi star of T Dialogue: 0,1:07:12.11,1:07:13.70,Default,,0000,0000,0000,,of ST. Dialogue: 0,1:07:13.70,1:07:17.82,Default,,0000,0000,0000,,This is always the [inaudible] of the same thing. [Inaudible] Dialogue: 0,1:07:17.82,1:07:22.80,Default,,0000,0000,0000,,of actions AT of the same thing. Dialogue: 0,1:07:22.80,1:07:25.14,Default,,0000,0000,0000,,STAT plus Dialogue: 0,1:07:25.14,1:07:32.14,Default,,0000,0000,0000,,your expected value of [inaudible] PSTAT, P-star, T Dialogue: 0,1:07:35.86,1:07:39.18,Default,,0000,0000,0000,,plus one, ST plus one. This thing on the right-hand side is always the same thing as Dialogue: 0,1:07:39.18,1:07:41.84,Default,,0000,0000,0000,,the thing we maximized Dialogue: 0,1:07:41.84,1:07:42.97,Default,,0000,0000,0000,, Dialogue: 0,1:07:42.97,1:07:46.48,Default,,0000,0000,0000,,[inaudible]. So what this means is that when I said this a value of A to the maximize Dialogue: 0,1:07:46.48,1:07:50.38,Default,,0000,0000,0000,,of this. So what this means is that the optimal action to take from the state of ST Dialogue: 0,1:07:50.38,1:07:53.08,Default,,0000,0000,0000,,is actually equal to Dialogue: 0,1:07:53.08,1:07:55.87,Default,,0000,0000,0000,,LT Dialogue: 0,1:07:55.87,1:08:01.69,Default,,0000,0000,0000,,times ST. Dialogue: 0,1:08:01.69,1:08:03.86,Default,,0000,0000,0000,,What Dialogue: 0,1:08:03.86,1:08:05.92,Default,,0000,0000,0000,,was shown is that Dialogue: 0,1:08:05.92,1:08:08.13,Default,,0000,0000,0000,,when you're in some state, ST, the Dialogue: 0,1:08:08.13,1:08:13.60,Default,,0000,0000,0000,,optimal action for that state is going to be some matrix, LT, which can compute, Dialogue: 0,1:08:13.60,1:08:14.23,Default,,0000,0000,0000,,times Dialogue: 0,1:08:14.23,1:08:16.86,Default,,0000,0000,0000,,the state, Dialogue: 0,1:08:16.86,1:08:18.08,Default,,0000,0000,0000,,ST. Dialogue: 0,1:08:18.08,1:08:23.55,Default,,0000,0000,0000,,In other words, the optimal action is actually a linear function of the state. Dialogue: 0,1:08:23.55,1:08:26.84,Default,,0000,0000,0000,,I'm just going to point out, this is not a function of approximation here, right. Dialogue: 0,1:08:26.84,1:08:31.74,Default,,0000,0000,0000,,What we did not do, we did not say, Dialogue: 0,1:08:31.74,1:08:35.39,Default,,0000,0000,0000,,let's find the optimal linear policy. We didn't say, let's look at the optimal Dialogue: 0,1:08:35.39,1:08:38.36,Default,,0000,0000,0000,,policy, and then we'll fit this straight line to the optimal policy. This is not Dialogue: 0,1:08:38.36,1:08:42.09,Default,,0000,0000,0000,,about approximating the optimal policy with a straight line. Dialogue: 0,1:08:42.09,1:08:46.55,Default,,0000,0000,0000,,This derivation is saying that the optimal policy is a straight line. The Dialogue: 0,1:08:46.55,1:08:53.55,Default,,0000,0000,0000,,optimal action is a linear function of the current Dialogue: 0,1:08:54.24,1:08:57.36,Default,,0000,0000,0000,,state. Dialogue: 0,1:08:57.36,1:09:01.32,Default,,0000,0000,0000,,Moreover, when you've worked out this is a value for AT Dialogue: 0,1:09:01.32,1:09:02.31,Default,,0000,0000,0000,, Dialogue: 0,1:09:02.31,1:09:05.37,Default,,0000,0000,0000,,that maximizes this thing on Dialogue: 0,1:09:05.37,1:09:10.01,Default,,0000,0000,0000,,the right-hand side. So you take this and plug it back in to do the dynamic programming recursion. Dialogue: 0,1:09:10.01,1:09:17.01,Default,,0000,0000,0000,,What you find is that - Dialogue: 0,1:09:19.72,1:09:24.19,Default,,0000,0000,0000,,so you take AT and plug it back in to do the maximization. It will Dialogue: 0,1:09:24.19,1:09:26.32,Default,,0000,0000,0000,,actually get Dialogue: 0,1:09:26.32,1:09:28.28,Default,,0000,0000,0000,,you this formula, Dialogue: 0,1:09:28.28,1:09:30.56,Default,,0000,0000,0000,,so V star Dialogue: 0,1:09:30.56,1:09:35.05,Default,,0000,0000,0000,,TST. So you find that Dialogue: 0,1:09:35.05,1:09:38.77,Default,,0000,0000,0000,,it will indeed be a quadratic function like this Dialogue: 0,1:09:38.77,1:09:41.80,Default,,0000,0000,0000,,of the following form where Dialogue: 0,1:09:41.80,1:09:45.54,Default,,0000,0000,0000,,- and I just write out the equations for the sake of Dialogue: 0,1:09:45.54,1:09:52.54,Default,,0000,0000,0000,,completeness. Don't worry too much about their forms. You can Dialogue: 0,1:09:57.37,1:10:03.63,Default,,0000,0000,0000,,derive this yourself. Dialogue: 0,1:10:48.13,1:10:52.19,Default,,0000,0000,0000,,So just to summarize, don't worry too much about the forms of these Dialogue: 0,1:10:52.19,1:10:54.43,Default,,0000,0000,0000,,equations. Dialogue: 0,1:10:54.43,1:10:57.90,Default,,0000,0000,0000,,What we've done is written down the recursion to the expressor Dialogue: 0,1:10:57.90,1:10:59.82,Default,,0000,0000,0000,,phi T and psi T Dialogue: 0,1:10:59.82,1:11:02.33,Default,,0000,0000,0000,,as a function of phi T plus one Dialogue: 0,1:11:02.33,1:11:04.13,Default,,0000,0000,0000,,and psi T plus one. Dialogue: 0,1:11:04.13,1:11:05.09,Default,,0000,0000,0000,,So Dialogue: 0,1:11:05.09,1:11:08.34,Default,,0000,0000,0000,,this allows you to compute the optimal value function Dialogue: 0,1:11:08.34,1:11:13.58,Default,,0000,0000,0000,,for when the clock is at time lowercase T, as a function of the optimal value Dialogue: 0,1:11:13.58,1:11:17.70,Default,,0000,0000,0000,,function for when the clock is at time T plus one. Dialogue: 0,1:11:20.02,1:11:25.48,Default,,0000,0000,0000,,So Dialogue: 0,1:11:25.48,1:11:28.26,Default,,0000,0000,0000,,to summarize, Dialogue: 0,1:11:28.26,1:11:32.66,Default,,0000,0000,0000,,GSELQG here's a finite horizon of Dialogue: 0,1:11:32.66,1:11:35.92,Default,,0000,0000,0000,,- actually, just to give this equation a name as well. This Dialogue: 0,1:11:35.92,1:11:40.37,Default,,0000,0000,0000,,recursion, in terms of the phi Ts, this is called the Dialogue: 0,1:11:40.37,1:11:44.91,Default,,0000,0000,0000,,discrete Dialogue: 0,1:11:44.91,1:11:51.91,Default,,0000,0000,0000,,time Bacardi equation. [Inaudible] Dialogue: 0,1:11:54.10,1:11:57.61,Default,,0000,0000,0000,,recursion that gives you phi Dialogue: 0,1:11:57.61,1:12:02.42,Default,,0000,0000,0000,,T in terms of phi T plus one. Dialogue: 0,1:12:02.42,1:12:05.13,Default,,0000,0000,0000,,So Dialogue: 0,1:12:05.13,1:12:10.97,Default,,0000,0000,0000,,to summarize, Dialogue: 0,1:12:12.27,1:12:17.50,Default,,0000,0000,0000,,our algorithm for finding the exact solution to finite horizon LQR Dialogue: 0,1:12:17.50,1:12:22.95,Default,,0000,0000,0000,,problems is as follows. We initialize phi T Dialogue: 0,1:12:22.95,1:12:24.34,Default,,0000,0000,0000,,to Dialogue: 0,1:12:24.34,1:12:26.84,Default,,0000,0000,0000,,be equal to minus Dialogue: 0,1:12:30.45,1:12:32.70,Default,,0000,0000,0000,,UT Dialogue: 0,1:12:32.70,1:12:36.56,Default,,0000,0000,0000,,and psi T to be equal to zero. Dialogue: 0,1:12:36.56,1:12:38.71,Default,,0000,0000,0000,,Then Dialogue: 0,1:12:38.71,1:12:42.68,Default,,0000,0000,0000,,recursively, Dialogue: 0,1:12:42.68,1:12:45.08,Default,,0000,0000,0000,,calculate Dialogue: 0,1:12:45.08,1:12:47.18,Default,,0000,0000,0000,,phi T Dialogue: 0,1:12:47.18,1:12:49.26,Default,,0000,0000,0000,,and psi Dialogue: 0,1:12:49.26,1:12:50.31,Default,,0000,0000,0000,,T Dialogue: 0,1:12:50.31,1:12:53.82,Default,,0000,0000,0000,,as a function of phi T plus one Dialogue: 0,1:12:53.82,1:12:56.36,Default,,0000,0000,0000,,and psi T plus Dialogue: 0,1:12:56.36,1:13:01.43,Default,,0000,0000,0000,,one with the discrete time - Dialogue: 0,1:13:01.77,1:13:06.03,Default,,0000,0000,0000,,actually, excuse me. So Dialogue: 0,1:13:06.03,1:13:08.59,Default,,0000,0000,0000,,recursively calculate phi T and psi Dialogue: 0,1:13:08.59,1:13:11.35,Default,,0000,0000,0000,,T as a function of phi T plus one and psi T plus one, as I showed, Dialogue: 0,1:13:11.35,1:13:14.73,Default,,0000,0000,0000,,using the discrete time Bacardi equation. Dialogue: 0,1:13:14.73,1:13:17.94,Default,,0000,0000,0000,,So you do this for Dialogue: 0,1:13:17.94,1:13:19.20,Default,,0000,0000,0000,,T equals Dialogue: 0,1:13:19.20,1:13:22.43,Default,,0000,0000,0000,,T minus one, T minus two Dialogue: 0,1:13:22.43,1:13:28.58,Default,,0000,0000,0000,,and so on, down to time zero. Then you Dialogue: 0,1:13:28.58,1:13:32.52,Default,,0000,0000,0000,,compute Dialogue: 0,1:13:32.52,1:13:37.43,Default,,0000,0000,0000,,LT as a function of - actually, is it phi T or phi T Dialogue: 0,1:13:37.43,1:13:39.60,Default,,0000,0000,0000,,plus one? Phi T Dialogue: 0,1:13:39.60,1:13:41.93,Default,,0000,0000,0000,,plus one, I think. Dialogue: 0,1:13:41.93,1:13:46.26,Default,,0000,0000,0000,,As a function of phi T plus one and psi T plus one. Dialogue: 0,1:13:46.26,1:13:50.20,Default,,0000,0000,0000,,This is actually a function of only phi T plus one. You don't really need psi T Dialogue: 0,1:13:50.20,1:13:53.11,Default,,0000,0000,0000,,plus one. Dialogue: 0,1:13:53.11,1:13:59.82,Default,,0000,0000,0000,,Now you have your optimal policy. Dialogue: 0,1:13:59.82,1:14:03.10,Default,,0000,0000,0000,,So having computed the LTs, Dialogue: 0,1:14:03.10,1:14:05.94,Default,,0000,0000,0000,,you now have the Dialogue: 0,1:14:05.94,1:14:08.15,Default,,0000,0000,0000,,optimal action to take in the state Dialogue: 0,1:14:08.15,1:14:15.15,Default,,0000,0000,0000,,ST, just given by Dialogue: 0,1:14:24.53,1:14:29.42,Default,,0000,0000,0000,,this Dialogue: 0,1:14:29.42,1:14:34.98,Default,,0000,0000,0000,,linear equation. Dialogue: 0,1:14:34.98,1:14:44.50,Default,,0000,0000,0000,,How much time do I have left? Okay. Let me just say one last thing about this before I close. Dialogue: 0,1:14:44.48,1:14:48.45,Default,,0000,0000,0000,,Maybe I'll do it next week. I think I'll do it next session Dialogue: 0,1:14:48.45,1:14:52.18,Default,,0000,0000,0000,,instead. So it actually turns out there's one cool property about this that's kind of that is Dialogue: 0,1:14:52.18,1:14:54.75,Default,,0000,0000,0000,,kind of subtle, but you'll find it out in the next lecture. Dialogue: 0,1:14:54.75,1:15:01.75,Default,,0000,0000,0000,,Are there question about this before we close for today, then? So Dialogue: 0,1:15:04.83,1:15:07.31,Default,,0000,0000,0000,,the very cool thing about Dialogue: 0,1:15:07.31,1:15:12.58,Default,,0000,0000,0000,,the solution of discrete time LQR problems - finite horizon LQR Dialogue: 0,1:15:12.58,1:15:17.28,Default,,0000,0000,0000,,problems is that this is a problem in an infinite state, with a continuous state. Dialogue: 0,1:15:17.28,1:15:21.06,Default,,0000,0000,0000,,But nonetheless, under the assumptions we made, you can prove that the Dialogue: 0,1:15:21.06,1:15:24.75,Default,,0000,0000,0000,,value function is a quadratic function of the state. Dialogue: 0,1:15:24.75,1:15:29.09,Default,,0000,0000,0000,,Therefore, just by computing these matrixes phi T and the real numbers Dialogue: 0,1:15:29.09,1:15:33.01,Default,,0000,0000,0000,,psi T, you can actually exactly represent the value function, even for Dialogue: 0,1:15:33.01,1:15:37.49,Default,,0000,0000,0000,,these infinitely large state spaces, even for continuous state spaces. Dialogue: 0,1:15:37.49,1:15:42.60,Default,,0000,0000,0000,,So the computation of these algorithms scales only like the cube, Dialogue: 0,1:15:42.60,1:15:46.50,Default,,0000,0000,0000,,scales only as a polynomial in terms of the number of state variables Dialogue: 0,1:15:46.50,1:15:49.96,Default,,0000,0000,0000,,whereas in [inaudible] dimensionality problems, with [inaudible], we Dialogue: 0,1:15:49.96,1:15:53.10,Default,,0000,0000,0000,,had algorithms of a scale exponentially dimensional problem. Dialogue: 0,1:15:53.10,1:15:55.42,Default,,0000,0000,0000,,Whereas LQR scales only are like the cube Dialogue: 0,1:15:55.42,1:16:00.70,Default,,0000,0000,0000,,of the dimension of the problem. So this easily Dialogue: 0,1:16:00.70,1:16:03.96,Default,,0000,0000,0000,,applies to problems with even very large state spaces. So we actually often apply Dialogue: 0,1:16:03.96,1:16:06.05,Default,,0000,0000,0000,,variations of this algorithm Dialogue: 0,1:16:06.05,1:16:09.16,Default,,0000,0000,0000,,to some subset, to some particular subset for the things we do on our Dialogue: 0,1:16:09.16,1:16:13.01,Default,,0000,0000,0000,,helicopter, which has high dimensional state spaces, with twelve or higher Dialogue: 0,1:16:13.01,1:16:13.97,Default,,0000,0000,0000,,dimensions. Dialogue: 0,1:16:13.97,1:16:15.99,Default,,0000,0000,0000,,This has worked very well for that. Dialogue: 0,1:16:15.99,1:16:17.23,Default,,0000,0000,0000,,So Dialogue: 0,1:16:17.23,1:16:21.28,Default,,0000,0000,0000,,it turns out there are even more things you can do with this, and I'll continue with Dialogue: 0,1:16:21.28,1:16:23.52,Default,,0000,0000,0000,,that in the next lecture. So let's close for today, then.