[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:10.30,0:00:13.57,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional Dialogue: 0,0:00:13.57,0:00:20.57,Default,,0000,0000,0000,,Development. Dialogue: 0,0:00:21.52,0:00:24.10,Default,,0000,0000,0000,,Okay, good morning. Welcome back. Dialogue: 0,0:00:24.10,0:00:27.02,Default,,0000,0000,0000,,So I hope all of you had a good Thanksgiving break. Dialogue: 0,0:00:27.02,0:00:31.31,Default,,0000,0000,0000,,After the problem sets, I suspect many of us needed one. Dialogue: 0,0:00:31.31,0:00:35.64,Default,,0000,0000,0000,,Just one quick announcement so as I announced by email Dialogue: 0,0:00:35.64,0:00:37.87,Default,,0000,0000,0000,,a few days ago, this afternoon Dialogue: 0,0:00:37.87,0:00:38.49,Default,,0000,0000,0000,,we'll be Dialogue: 0,0:00:38.49,0:00:42.59,Default,,0000,0000,0000,,doing another tape ahead of lecture, so I won't physically be here on Dialogue: 0,0:00:42.59,0:00:43.84,Default,,0000,0000,0000,,Wednesday, and so we'll be taping Dialogue: 0,0:00:43.84,0:00:45.99,Default,,0000,0000,0000,,this Wednesday's lecture ahead of time. Dialogue: 0,0:00:45.99,0:00:47.21,Default,,0000,0000,0000,,If you're free Dialogue: 0,0:00:47.21,0:00:52.77,Default,,0000,0000,0000,,this afternoon, please come to that; it'll be at 3:45 p.m. Dialogue: 0,0:00:52.77,0:00:57.61,Default,,0000,0000,0000,,in the Skilling Auditorium in Skilling 193 Dialogue: 0,0:00:57.61,0:00:59.83,Default,,0000,0000,0000,,at 3:45. But of course, you can also just show up Dialogue: 0,0:00:59.83,0:01:06.83,Default,,0000,0000,0000,,in class as usual at the usual time or just watch it online as usual also. Dialogue: 0,0:01:07.11,0:01:12.92,Default,,0000,0000,0000,,Okay, welcome back. What I want to do today is continue our discussion on Dialogue: 0,0:01:12.92,0:01:15.69,Default,,0000,0000,0000,,Reinforcement Learning in MDPs. Quite a Dialogue: 0,0:01:15.69,0:01:19.79,Default,,0000,0000,0000,,long topic for me to go over today, so most of today's lecture will be on Dialogue: 0,0:01:19.79,0:01:21.72,Default,,0000,0000,0000,,continuous state MDPs, Dialogue: 0,0:01:21.72,0:01:24.09,Default,,0000,0000,0000,,and in particular, Dialogue: 0,0:01:24.09,0:01:26.75,Default,,0000,0000,0000,,algorithms for solving continuous state MDPs, Dialogue: 0,0:01:26.75,0:01:29.89,Default,,0000,0000,0000,,so I'll talk just very briefly about discretization. Dialogue: 0,0:01:29.89,0:01:34.08,Default,,0000,0000,0000,,I'll spend a lot of time talking about models, assimilators of MDPs, Dialogue: 0,0:01:34.08,0:01:39.25,Default,,0000,0000,0000,,and then talk about one algorithm called fitted value iteration and Dialogue: 0,0:01:40.42,0:01:45.12,Default,,0000,0000,0000,,two functions which builds on that, and then Dialogue: 0,0:01:45.12,0:01:49.74,Default,,0000,0000,0000,,hopefully, I'll have time to get to a second algorithm called, approximate policy Dialogue: 0,0:01:49.74,0:01:50.96,Default,,0000,0000,0000,,iteration Dialogue: 0,0:01:50.96,0:01:53.19,Default,,0000,0000,0000,,Just to recap, Dialogue: 0,0:01:53.19,0:01:55.12,Default,,0000,0000,0000,,right, in the previous lecture, Dialogue: 0,0:01:55.12,0:01:58.27,Default,,0000,0000,0000,,I defined the Reinforcement Learning problem and Dialogue: 0,0:01:58.27,0:02:02.95,Default,,0000,0000,0000,,I defined MDPs, so let me just recap the notation. Dialogue: 0,0:02:04.40,0:02:11.40,Default,,0000,0000,0000,,I said that an MDP or a Markov Decision Process, was a ? tuple, Dialogue: 0,0:02:14.21,0:02:15.96,Default,,0000,0000,0000,,comprising Dialogue: 0,0:02:15.96,0:02:19.14,Default,,0000,0000,0000,,those things and Dialogue: 0,0:02:19.14,0:02:22.22,Default,,0000,0000,0000,,the running example of those using last time Dialogue: 0,0:02:22.22,0:02:25.19,Default,,0000,0000,0000,,was this one Dialogue: 0,0:02:25.19,0:02:26.71,Default,,0000,0000,0000,,right, adapted from Dialogue: 0,0:02:26.71,0:02:30.21,Default,,0000,0000,0000,,the Russell and Norvig AI textbook. Dialogue: 0,0:02:30.21,0:02:31.62,Default,,0000,0000,0000,,So Dialogue: 0,0:02:31.62,0:02:35.50,Default,,0000,0000,0000,,in this example MDP that I was using, Dialogue: 0,0:02:35.50,0:02:39.27,Default,,0000,0000,0000,,it had 11 states, so that's where S was. Dialogue: 0,0:02:39.27,0:02:46.27,Default,,0000,0000,0000,,The actions were compass directions: north, south, east and west. The Dialogue: 0,0:02:46.65,0:02:49.69,Default,,0000,0000,0000,,state transition probability is to capture Dialogue: 0,0:02:49.69,0:02:51.11,Default,,0000,0000,0000,,chance of your Dialogue: 0,0:02:51.11,0:02:52.41,Default,,0000,0000,0000,,transitioning to every state Dialogue: 0,0:02:52.41,0:02:55.90,Default,,0000,0000,0000,,when you take any action in any other given state and so Dialogue: 0,0:02:55.90,0:02:58.06,Default,,0000,0000,0000,,in our example that captured Dialogue: 0,0:02:58.06,0:03:01.12,Default,,0000,0000,0000,,the stochastic dynamics of our Dialogue: 0,0:03:01.12,0:03:04.50,Default,,0000,0000,0000,,robot wondering around [inaudible], and we said if you take the action north Dialogue: 0,0:03:04.50,0:03:05.64,Default,,0000,0000,0000,,and the south, Dialogue: 0,0:03:05.64,0:03:08.88,Default,,0000,0000,0000,,you have a .8 chance of actually going north and .1 chance of veering Dialogue: 0,0:03:08.88,0:03:09.88,Default,,0000,0000,0000,,off, Dialogue: 0,0:03:09.88,0:03:12.69,Default,,0000,0000,0000,,so that .1 chance of veering off to the right so Dialogue: 0,0:03:12.69,0:03:17.07,Default,,0000,0000,0000,,said model of the robot's noisy Dialogue: 0,0:03:17.07,0:03:19.20,Default,,0000,0000,0000,,dynamic with a [inaudible] Dialogue: 0,0:03:19.20,0:03:21.71,Default,,0000,0000,0000,,and the Dialogue: 0,0:03:21.71,0:03:24.79,Default,,0000,0000,0000,,reward function was that +/-1 at the Dialogue: 0,0:03:24.79,0:03:31.06,Default,,0000,0000,0000,,absorbing states Dialogue: 0,0:03:31.06,0:03:32.75,Default,,0000,0000,0000,,and Dialogue: 0,0:03:32.75,0:03:37.66,Default,,0000,0000,0000,,-0.02 Dialogue: 0,0:03:37.66,0:03:39.59,Default,,0000,0000,0000,,elsewhere. Dialogue: 0,0:03:39.59,0:03:41.50,Default,,0000,0000,0000,,This is an example of an MDP, and Dialogue: 0,0:03:41.50,0:03:47.90,Default,,0000,0000,0000,,that's what these five things were. Oh, and I used a discount Dialogue: 0,0:03:47.90,0:03:48.87,Default,,0000,0000,0000,,factor G Dialogue: 0,0:03:48.87,0:03:54.23,Default,,0000,0000,0000,,of usually a number slightly less than one, so that's the 0.99. Dialogue: 0,0:03:54.23,0:03:58.94,Default,,0000,0000,0000,,And so our goal was to find the policy, the Dialogue: 0,0:03:58.94,0:04:02.09,Default,,0000,0000,0000,,control policy and that's at ?, Dialogue: 0,0:04:02.09,0:04:03.73,Default,,0000,0000,0000,,which is a function Dialogue: 0,0:04:03.73,0:04:06.40,Default,,0000,0000,0000,,mapping from the states of the actions Dialogue: 0,0:04:06.40,0:04:09.61,Default,,0000,0000,0000,,that tells us what action to take in every state, Dialogue: 0,0:04:09.61,0:04:14.16,Default,,0000,0000,0000,,and our goal was to find a policy that maximizes the expected value of Dialogue: 0,0:04:14.16,0:04:17.32,Default,,0000,0000,0000,,our total payoff. So we want to find a Dialogue: 0,0:04:17.32,0:04:19.60,Default,,0000,0000,0000,,policy. Well, Dialogue: 0,0:04:19.60,0:04:26.60,Default,,0000,0000,0000,,let's see. We define value functions Vp (s) to be equal to Dialogue: 0,0:04:33.25,0:04:35.74,Default,,0000,0000,0000,,this. Dialogue: 0,0:04:35.74,0:04:36.98,Default,,0000,0000,0000,,We said that Dialogue: 0,0:04:36.98,0:04:41.34,Default,,0000,0000,0000,,the value of a policy ? from State S was given by the expected Dialogue: 0,0:04:41.34,0:04:42.03,Default,,0000,0000,0000,,value Dialogue: 0,0:04:42.03,0:04:46.20,Default,,0000,0000,0000,,of the sum of discounted rewards, conditioned on your executing the policy ? Dialogue: 0,0:04:46.20,0:04:49.33,Default,,0000,0000,0000,,and Dialogue: 0,0:04:49.33,0:04:54.47,Default,,0000,0000,0000,,you're stating off your [inaudible] to say in the State S, Dialogue: 0,0:04:54.47,0:04:59.96,Default,,0000,0000,0000,,and so our strategy for finding the policy was sort of comprised of Dialogue: 0,0:04:59.96,0:05:06.96,Default,,0000,0000,0000,,two steps. So the goal is to find Dialogue: 0,0:05:11.57,0:05:15.19,Default,,0000,0000,0000,,a good policy that maximizes the suspected value of Dialogue: 0,0:05:15.19,0:05:16.76,Default,,0000,0000,0000,,the sum of discounted rewards, Dialogue: 0,0:05:16.76,0:05:21.02,Default,,0000,0000,0000,,and so I said last time that one strategy for finding the [inaudible] of a Dialogue: 0,0:05:21.02,0:05:21.71,Default,,0000,0000,0000,,policy Dialogue: 0,0:05:21.71,0:05:26.08,Default,,0000,0000,0000,,is to first compute the optimal value function which I denoted V*(s) and is defined Dialogue: 0,0:05:26.08,0:05:28.84,Default,,0000,0000,0000,,like that. It's the maximum Dialogue: 0,0:05:28.84,0:05:31.97,Default,,0000,0000,0000,,value that any policy can obtain, Dialogue: 0,0:05:31.97,0:05:38.97,Default,,0000,0000,0000,,and for example, Dialogue: 0,0:05:44.94,0:05:48.45,Default,,0000,0000,0000,,the optimal value function Dialogue: 0,0:05:48.45,0:05:55.45,Default,,0000,0000,0000,,for that MDP looks like this. Dialogue: 0,0:06:01.44,0:06:04.41,Default,,0000,0000,0000,,So in other words, starting from any of these states, Dialogue: 0,0:06:04.41,0:06:07.05,Default,,0000,0000,0000,,what's the expected value of the Dialogue: 0,0:06:07.05,0:06:09.21,Default,,0000,0000,0000,,sum of discounted rewards you get, Dialogue: 0,0:06:09.21,0:06:12.01,Default,,0000,0000,0000,,so this is Dialogue: 0,0:06:12.01,0:06:13.21,Default,,0000,0000,0000,,V*. Dialogue: 0,0:06:13.21,0:06:15.29,Default,,0000,0000,0000,,We also said that Dialogue: 0,0:06:15.29,0:06:17.10,Default,,0000,0000,0000,,once you've found V*, Dialogue: 0,0:06:17.10,0:06:20.40,Default,,0000,0000,0000,,you can compute the optimal policy Dialogue: 0,0:06:20.40,0:06:27.40,Default,,0000,0000,0000,,using this. Dialogue: 0,0:06:34.30,0:06:36.28,Default,,0000,0000,0000,,And Dialogue: 0,0:06:36.28,0:06:41.05,Default,,0000,0000,0000,,so once you've found Dialogue: 0,0:06:41.05,0:06:48.05,Default,,0000,0000,0000,,V and the last piece of this algorithm was Bellman's equations where Dialogue: 0,0:06:48.46,0:06:51.24,Default,,0000,0000,0000,,we know that V*, Dialogue: 0,0:06:51.24,0:06:55.58,Default,,0000,0000,0000,,the optimal sum of discounted rewards you can get for State S, is equal to the Dialogue: 0,0:06:55.58,0:06:59.34,Default,,0000,0000,0000,,immediate reward you get just for starting off in that state Dialogue: 0,0:06:59.34,0:07:02.36,Default,,0000,0000,0000,,+G(for the Dialogue: 0,0:07:02.36,0:07:05.98,Default,,0000,0000,0000,,max over all the actions you could take)(your Dialogue: 0,0:07:11.64,0:07:13.66,Default,,0000,0000,0000,,future Dialogue: 0,0:07:13.66,0:07:15.62,Default,,0000,0000,0000,,sum of discounted Dialogue: 0,0:07:15.62,0:07:16.48,Default,,0000,0000,0000,,rewards)(your Dialogue: 0,0:07:16.48,0:07:20.65,Default,,0000,0000,0000,,future payoff starting from the State S(p) which is where you might transition to Dialogue: 0,0:07:20.65,0:07:22.43,Default,,0000,0000,0000,,after 1(s). Dialogue: 0,0:07:22.43,0:07:26.28,Default,,0000,0000,0000,,And so this gave us a value iteration algorithm, Dialogue: 0,0:07:26.28,0:07:29.91,Default,,0000,0000,0000,,which was essentially Dialogue: 0,0:07:29.91,0:07:35.29,Default,,0000,0000,0000,,V.I. I'm abbreviating value iteration as V.I., so in the value iteration Dialogue: 0,0:07:36.16,0:07:43.16,Default,,0000,0000,0000,,algorithm, in V.I., you just take Bellman's equations and you repeatedly do this. Dialogue: 0,0:07:49.96,0:07:54.10,Default,,0000,0000,0000,,So initialize some guess of the value functions. Initialize a zero as the sum Dialogue: 0,0:07:54.10,0:07:55.13,Default,,0000,0000,0000,,rounding guess and Dialogue: 0,0:07:55.13,0:07:58.96,Default,,0000,0000,0000,,then repeatedly perform this update for all states, and I said last time that if Dialogue: 0,0:07:58.96,0:08:00.74,Default,,0000,0000,0000,,you do this repeatedly, Dialogue: 0,0:08:00.74,0:08:06.34,Default,,0000,0000,0000,,then V(s) will converge to the optimal value function, V(s), Dialogue: 0,0:08:06.34,0:08:09.34,Default,,0000,0000,0000,,you can Dialogue: 0,0:08:09.34,0:08:12.22,Default,,0000,0000,0000,,compute the Dialogue: 0,0:08:12.22,0:08:13.83,Default,,0000,0000,0000,, Dialogue: 0,0:08:13.83,0:08:19.15,Default,,0000,0000,0000,,optimal policy ?*. Just one final thing I want to recap was Dialogue: 0,0:08:20.15,0:08:27.15,Default,,0000,0000,0000,,the policy iteration algorithm Dialogue: 0,0:08:32.37,0:08:39.37,Default,,0000,0000,0000,,in which we repeat the following two steps. Dialogue: 0,0:08:39.64,0:08:43.79,Default,,0000,0000,0000,,So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the Dialogue: 0,0:08:43.79,0:08:48.19,Default,,0000,0000,0000,,value function for that specific policy. Dialogue: 0,0:08:48.19,0:08:52.20,Default,,0000,0000,0000,,So this means for every state, compute the expected sum of discounted rewards Dialogue: 0,0:08:52.20,0:08:55.46,Default,,0000,0000,0000,,for if you execute the policy ? Dialogue: 0,0:08:55.46,0:08:58.00,Default,,0000,0000,0000,,from that state, Dialogue: 0,0:08:58.00,0:09:05.00,Default,,0000,0000,0000,,and then the other step of policy Dialogue: 0,0:09:28.15,0:09:30.38,Default,,0000,0000,0000,,iteration Dialogue: 0,0:09:30.38,0:09:31.73,Default,,0000,0000,0000,,is Dialogue: 0,0:09:31.73,0:09:34.96,Default,,0000,0000,0000,,having found the value function for your policy, Dialogue: 0,0:09:34.96,0:09:37.44,Default,,0000,0000,0000,,you then update the policy Dialogue: 0,0:09:37.44,0:09:41.12,Default,,0000,0000,0000,,pretending that you've already found the optimal value function, V*, Dialogue: 0,0:09:41.12,0:09:42.39,Default,,0000,0000,0000,,and then you Dialogue: 0,0:09:42.39,0:09:45.79,Default,,0000,0000,0000,,repeatedly perform these two steps where you solve for the value function for your current Dialogue: 0,0:09:45.79,0:09:46.82,Default,,0000,0000,0000,,policy Dialogue: 0,0:09:46.82,0:09:50.76,Default,,0000,0000,0000,,and then pretend that that's actually the optimal value function and solve for the Dialogue: 0,0:09:50.76,0:09:54.79,Default,,0000,0000,0000,,policy given the value function, and you repeatedly update Dialogue: 0,0:09:54.79,0:09:59.06,Default,,0000,0000,0000,,the value function or update the policy using that value function. Dialogue: 0,0:09:59.06,0:10:00.35,Default,,0000,0000,0000,,And Dialogue: 0,0:10:00.35,0:10:03.54,Default,,0000,0000,0000,,last time I said that this will also cause Dialogue: 0,0:10:03.54,0:10:07.50,Default,,0000,0000,0000,,the estimated value function V to converge to V* Dialogue: 0,0:10:07.50,0:10:12.64,Default,,0000,0000,0000,,and this will cause p to converge to ?*, the optimal policy. Dialogue: 0,0:10:12.64,0:10:14.01,Default,,0000,0000,0000,,So those are based Dialogue: 0,0:10:14.01,0:10:17.92,Default,,0000,0000,0000,,on our last lecture [inaudible] MDPs and introduced a lot of new Dialogue: 0,0:10:17.92,0:10:20.18,Default,,0000,0000,0000,,notation symbols and just Dialogue: 0,0:10:20.18,0:10:21.88,Default,,0000,0000,0000,,summarize all that again. Dialogue: 0,0:10:21.88,0:10:26.16,Default,,0000,0000,0000,,What I'm about to do now, what I'm about to do for the rest of today's Dialogue: 0,0:10:26.16,0:10:27.68,Default,,0000,0000,0000,,lecture is actually Dialogue: 0,0:10:27.68,0:10:32.42,Default,,0000,0000,0000,,build on these two algorithms so Dialogue: 0,0:10:32.42,0:10:36.04,Default,,0000,0000,0000,,I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how Dialogue: 0,0:10:36.04,0:10:42.08,Default,,0000,0000,0000,,those two algorithms are very Dialogue: 0,0:10:42.08,0:10:43.71,Default,,0000,0000,0000,,different? Dialogue: 0,0:10:43.71,0:10:46.81,Default,,0000,0000,0000,,Instructor (Andrew Ng):I see, right, so Dialogue: 0,0:10:46.81,0:10:51.81,Default,,0000,0000,0000,,yeah, do you see that they're different? Okay, Dialogue: 0,0:10:51.81,0:10:53.84,Default,,0000,0000,0000,,how it's different. Let's see. Dialogue: 0,0:10:53.84,0:10:54.97,Default,,0000,0000,0000,,So Dialogue: 0,0:10:54.97,0:11:00.05,Default,,0000,0000,0000,,well here's one difference. I didn't say this cause no longer use it today. So Dialogue: 0,0:11:00.05,0:11:03.53,Default,,0000,0000,0000,,value iteration and policy iteration are different algorithms. Dialogue: 0,0:11:03.53,0:11:06.26,Default,,0000,0000,0000,,In policy iteration Dialogue: 0,0:11:06.26,0:11:07.57,Default,,0000,0000,0000,,in this Dialogue: 0,0:11:07.57,0:11:10.63,Default,,0000,0000,0000,,step, you're given a fixed policy, Dialogue: 0,0:11:10.63,0:11:14.17,Default,,0000,0000,0000,,and you're going to solve for the value function for that policy Dialogue: 0,0:11:15.32,0:11:20.52,Default,,0000,0000,0000,,and so you're given some fixed policy ?, meaning Dialogue: 0,0:11:20.52,0:11:26.93,Default,,0000,0000,0000,,some function mapping from the state's actions. So give you some policy Dialogue: 0,0:11:28.84,0:11:34.63,Default,,0000,0000,0000,,and Dialogue: 0,0:11:34.63,0:11:39.35,Default,,0000,0000,0000,,whatever. That's just some policy; it's not a great policy. Dialogue: 0,0:11:39.35,0:11:40.89,Default,,0000,0000,0000,,And Dialogue: 0,0:11:40.89,0:11:45.35,Default,,0000,0000,0000,,in that step that I circled, we have to find the ? of S Dialogue: 0,0:12:01.64,0:12:04.74,Default,,0000,0000,0000,,which means that for every state you need to compute Dialogue: 0,0:12:04.74,0:12:08.36,Default,,0000,0000,0000,,your expected sum of discounted rewards or if you execute Dialogue: 0,0:12:08.36,0:12:10.81,Default,,0000,0000,0000,,this specific policy Dialogue: 0,0:12:10.81,0:12:15.50,Default,,0000,0000,0000,,and starting off the MDP in that state S. Dialogue: 0,0:12:15.50,0:12:19.00,Default,,0000,0000,0000,,So I showed this last time. I won't go into details today, so I said last Dialogue: 0,0:12:19.00,0:12:20.33,Default,,0000,0000,0000,,time that Dialogue: 0,0:12:20.33,0:12:23.96,Default,,0000,0000,0000,,you can actually solve for Vp by solving a linear system of Dialogue: 0,0:12:23.96,0:12:25.36,Default,,0000,0000,0000,,equations. Dialogue: 0,0:12:25.36,0:12:27.26,Default,,0000,0000,0000,,There was a form of Bellman's equations Dialogue: 0,0:12:27.26,0:12:28.76,Default,,0000,0000,0000,,for Vp, Dialogue: 0,0:12:28.76,0:12:32.03,Default,,0000,0000,0000,,and it turned out to be, if you write this out, you end up with a Dialogue: 0,0:12:32.03,0:12:35.92,Default,,0000,0000,0000,,linear system of 11 equations of 11 unknowns and so you Dialogue: 0,0:12:35.92,0:12:38.83,Default,,0000,0000,0000,,can actually solve for the value function Dialogue: 0,0:12:38.83,0:12:39.97,Default,,0000,0000,0000,,for a fixed policy Dialogue: 0,0:12:39.97,0:12:42.18,Default,,0000,0000,0000,,by solving like a system of Dialogue: 0,0:12:42.18,0:12:46.18,Default,,0000,0000,0000,,linear equations with 11 variables and 11 constraints, Dialogue: 0,0:12:46.18,0:12:48.83,Default,,0000,0000,0000,,and so Dialogue: 0,0:12:48.83,0:12:51.79,Default,,0000,0000,0000,,that's policy iteration; Dialogue: 0,0:12:51.79,0:12:55.29,Default,,0000,0000,0000,,whereas, in value iteration, Dialogue: 0,0:12:55.29,0:12:57.43,Default,,0000,0000,0000,,going back on board, Dialogue: 0,0:12:57.43,0:13:00.59,Default,,0000,0000,0000,,in value iteration you sort of repeatedly perform this update where you Dialogue: 0,0:13:00.59,0:13:04.02,Default,,0000,0000,0000,,update the value of a state as the [inaudible]. Dialogue: 0,0:13:04.02,0:13:11.02,Default,,0000,0000,0000,,So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits Dialogue: 0,0:13:12.67,0:13:13.46,Default,,0000,0000,0000,,so is the assumption Dialogue: 0,0:13:13.46,0:13:16.13,Default,,0000,0000,0000,,that we can never get out of Dialogue: 0,0:13:16.13,0:13:17.96,Default,,0000,0000,0000,,those states? Instructor Dialogue: 0,0:13:17.96,0:13:21.73,Default,,0000,0000,0000,,(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers Dialogue: 0,0:13:21.73,0:13:25.95,Default,,0000,0000,0000,,come up nicely, but I don't wanna spend too much time on them, but yeah, so the Dialogue: 0,0:13:25.95,0:13:26.82,Default,,0000,0000,0000,,assumption is that Dialogue: 0,0:13:26.82,0:13:30.33,Default,,0000,0000,0000,,once you enter the absorbing state, then the world ends or there're no more Dialogue: 0,0:13:30.33,0:13:32.55,Default,,0000,0000,0000,,rewards after that and Dialogue: 0,0:13:32.55,0:13:33.74,Default,,0000,0000,0000,,you can think of Dialogue: 0,0:13:33.74,0:13:36.97,Default,,0000,0000,0000,,another way to think of the absorbing states which is sort of Dialogue: 0,0:13:36.97,0:13:38.00,Default,,0000,0000,0000,,mathematically equivalent. Dialogue: 0,0:13:38.00,0:13:41.09,Default,,0000,0000,0000,,You can think of the absorbing states as Dialogue: 0,0:13:41.09,0:13:43.44,Default,,0000,0000,0000,,transitioning with probability 1 Dialogue: 0,0:13:43.44,0:13:46.03,Default,,0000,0000,0000,,to sum 12 state, and then once you're in that Dialogue: 0,0:13:46.03,0:13:48.87,Default,,0000,0000,0000,,12th state, you always Dialogue: 0,0:13:48.87,0:13:51.66,Default,,0000,0000,0000,,remain in that 12th state, and there're no further rewards from there. If you Dialogue: 0,0:13:51.66,0:13:53.91,Default,,0000,0000,0000,,want, you can think of this Dialogue: 0,0:13:53.91,0:13:57.25,Default,,0000,0000,0000,,as actually an MDP with 12 states rather than 11 states, Dialogue: 0,0:13:57.25,0:13:59.27,Default,,0000,0000,0000,,and the 12th state Dialogue: 0,0:13:59.27,0:14:05.32,Default,,0000,0000,0000,,is this zero cost absorbing state that you get stuck in forever. Other Dialogue: 0,0:14:05.32,0:14:12.32,Default,,0000,0000,0000,,questions? Yeah, Dialogue: 0,0:14:19.64,0:14:26.46,Default,,0000,0000,0000,,please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort Dialogue: 0,0:14:26.46,0:14:27.19,Default,,0000,0000,0000,,of Dialogue: 0,0:14:27.19,0:14:31.12,Default,,0000,0000,0000,,tried to give it justification for this last time. Dialogue: 0,0:14:31.12,0:14:34.85,Default,,0000,0000,0000,,I'll Dialogue: 0,0:14:34.85,0:14:37.50,Default,,0000,0000,0000,,say it in Dialogue: 0,0:14:37.50,0:14:41.99,Default,,0000,0000,0000,,one sentence so that's that the expected total payoff I get, I Dialogue: 0,0:14:41.99,0:14:46.82,Default,,0000,0000,0000,,expect to get something from the state as is equal to my immediate reward which is Dialogue: 0,0:14:46.82,0:14:49.65,Default,,0000,0000,0000,,the reward I get for starting a state. Let's see. If I sum Dialogue: 0,0:14:49.65,0:14:54.22,Default,,0000,0000,0000,,the state, I'm gonna get some first reward and then I can transition to Dialogue: 0,0:14:54.22,0:14:55.44,Default,,0000,0000,0000,,some other state, and Dialogue: 0,0:14:55.44,0:14:59.02,Default,,0000,0000,0000,,then from that other state, I'll get some additional rewards from then. Dialogue: 0,0:14:59.02,0:15:03.39,Default,,0000,0000,0000,,So Bellman's equations breaks that sum into two pieces. It says the value of a Dialogue: 0,0:15:03.39,0:15:04.86,Default,,0000,0000,0000,,state is equal to Dialogue: 0,0:15:04.86,0:15:06.65,Default,,0000,0000,0000,,the reward you get right away Dialogue: 0,0:15:06.65,0:15:10.53,Default,,0000,0000,0000,,is really, well. Dialogue: 0,0:15:10.53,0:15:17.12,Default,,0000,0000,0000,,V*(s) is really equal to Dialogue: 0,0:15:31.13,0:15:33.39,Default,,0000,0000,0000,,+G, so this is Dialogue: 0,0:15:35.94,0:15:40.06,Default,,0000,0000,0000,,V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the Dialogue: 0,0:15:40.06,0:15:42.26,Default,,0000,0000,0000,,rewards Dialogue: 0,0:15:44.48,0:15:46.44,Default,,0000,0000,0000,,you get in the future) Dialogue: 0,0:15:46.44,0:15:50.08,Default,,0000,0000,0000,,which it turns out to be equal to that Dialogue: 0,0:15:50.08,0:15:52.34,Default,,0000,0000,0000,,second row. I spent more time Dialogue: 0,0:15:52.34,0:15:55.36,Default,,0000,0000,0000,,justifying this in the previous lecture, Dialogue: 0,0:15:55.36,0:15:59.93,Default,,0000,0000,0000,,although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if Dialogue: 0,0:15:59.93,0:16:02.10,Default,,0000,0000,0000,,you don't remember the justification of that, why don't you just Dialogue: 0,0:16:02.10,0:16:03.43,Default,,0000,0000,0000,,maybe take my word for Dialogue: 0,0:16:03.43,0:16:05.99,Default,,0000,0000,0000,,that this equation holds true Dialogue: 0,0:16:05.99,0:16:07.94,Default,,0000,0000,0000,,since I use it a little bit later as well, Dialogue: 0,0:16:07.94,0:16:11.09,Default,,0000,0000,0000,,and then the lecture notes sort of Dialogue: 0,0:16:11.09,0:16:15.83,Default,,0000,0000,0000,,explain a little further the justification for why this equation might hold Dialogue: 0,0:16:15.83,0:16:17.43,Default,,0000,0000,0000,,true. But for now, Dialogue: 0,0:16:17.43,0:16:21.20,Default,,0000,0000,0000,,yeah, just for now take my word for it that this holds true cause we'll use it a little bit later Dialogue: 0,0:16:21.20,0:16:28.20,Default,,0000,0000,0000,,today as well. Dialogue: 0,0:16:40.96,0:16:44.71,Default,,0000,0000,0000,,[Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent Dialogue: 0,0:16:44.71,0:16:47.01,Default,,0000,0000,0000,,? implicitly, Dialogue: 0,0:16:47.60,0:16:53.06,Default,,0000,0000,0000,,using V(s), would it become equivalent to valuation, Dialogue: 0,0:16:53.06,0:16:58.24,Default,,0000,0000,0000,,and the answer is sort of no. Dialogue: 0,0:16:58.24,0:17:01.71,Default,,0000,0000,0000,,Let's see. It's true that policy iteration and value iteration are Dialogue: 0,0:17:01.71,0:17:07.21,Default,,0000,0000,0000,,closely related algorithms, and there's actually a continuum between them, but Dialogue: 0,0:17:07.21,0:17:12.85,Default,,0000,0000,0000,,yeah, it actually turns out that, Dialogue: 0,0:17:12.85,0:17:15.40,Default,,0000,0000,0000,,oh, no, Dialogue: 0,0:17:22.37,0:17:23.91,Default,,0000,0000,0000,,the Dialogue: 0,0:17:24.76,0:17:28.91,Default,,0000,0000,0000,,algorithms are not equivalent. It's just in policy iteration, Dialogue: 0,0:17:28.91,0:17:33.29,Default,,0000,0000,0000,,there is a step where you're solving for the value function for the policy vehicle is V, Dialogue: 0,0:17:33.29,0:17:35.47,Default,,0000,0000,0000,,solve for Vp. Usually, Dialogue: 0,0:17:35.47,0:17:40.59,Default,,0000,0000,0000,,you can do this, for instance, by solving a linear system of equations. In value Dialogue: 0,0:17:40.59,0:17:41.26,Default,,0000,0000,0000,,iteration, Dialogue: 0,0:17:41.26,0:17:43.85,Default,,0000,0000,0000,,it is a different Dialogue: 0,0:17:43.85,0:17:50.85,Default,,0000,0000,0000,,algorithm, yes. I hope it makes sense that at least cosmetically it's different. Dialogue: 0,0:17:56.27,0:17:58.14,Default,,0000,0000,0000,,[Inaudible] Dialogue: 0,0:17:58.14,0:18:02.25,Default,,0000,0000,0000,,you have [inaudible] representing ? implicitly, then you won't have Dialogue: 0,0:18:02.25,0:18:06.62,Default,,0000,0000,0000,,to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you Dialogue: 0,0:18:06.62,0:18:08.56,Default,,0000,0000,0000,,change a value function, Dialogue: 0,0:18:08.56,0:18:10.36,Default,,0000,0000,0000,,if ? changes as well, Dialogue: 0,0:18:10.36,0:18:15.75,Default,,0000,0000,0000,,then it's sort of hard to solve this. Dialogue: 0,0:18:15.75,0:18:19.34,Default,,0000,0000,0000,,Yeah, so later on we'll actually talk about some examples of when ? is implicitly Dialogue: 0,0:18:19.34,0:18:20.44,Default,,0000,0000,0000,,represented Dialogue: 0,0:18:20.44,0:18:24.06,Default,,0000,0000,0000,,but at Dialogue: 0,0:18:24.06,0:18:26.21,Default,,0000,0000,0000,,least for now it's I think there's " yeah. Dialogue: 0,0:18:26.21,0:18:30.35,Default,,0000,0000,0000,,Maybe there's a way to redefine something, see a mapping onto value iteration but that's not Dialogue: 0,0:18:30.35,0:18:32.53,Default,,0000,0000,0000,,usually done. These are Dialogue: 0,0:18:32.53,0:18:36.67,Default,,0000,0000,0000,,viewed as different algorithms. Dialogue: 0,0:18:36.67,0:18:39.64,Default,,0000,0000,0000,,Okay, cool, so all good questions. Dialogue: 0,0:18:39.64,0:18:43.41,Default,,0000,0000,0000,,Let me move on and talk about how Dialogue: 0,0:18:43.41,0:18:50.41,Default,,0000,0000,0000,,to generalize these ideas to continuous states. Dialogue: 0,0:18:58.88,0:19:01.26,Default,,0000,0000,0000,,Everything we've done so far Dialogue: 0,0:19:01.26,0:19:02.95,Default,,0000,0000,0000,,has been for Dialogue: 0,0:19:02.95,0:19:05.00,Default,,0000,0000,0000,,discrete states or finite-state Dialogue: 0,0:19:05.00,0:19:09.89,Default,,0000,0000,0000,,MDPs. Where, for example, here we had an MDP with a finite set of 11 Dialogue: 0,0:19:09.89,0:19:10.78,Default,,0000,0000,0000,,states Dialogue: 0,0:19:10.78,0:19:12.98,Default,,0000,0000,0000,,and so Dialogue: 0,0:19:12.98,0:19:16.72,Default,,0000,0000,0000,,the value function or V(s) or our estimate for the value function, Dialogue: 0,0:19:16.72,0:19:17.36,Default,,0000,0000,0000,,V(s), Dialogue: 0,0:19:17.36,0:19:22.29,Default,,0000,0000,0000,,could then be represented using an array of 11 numbers cause if you have 11 states, the Dialogue: 0,0:19:22.29,0:19:23.58,Default,,0000,0000,0000,,value function Dialogue: 0,0:19:23.58,0:19:26.12,Default,,0000,0000,0000,,needs to assign a real number Dialogue: 0,0:19:26.12,0:19:29.33,Default,,0000,0000,0000,,to each of the 11 states and so to represent V(s) Dialogue: 0,0:19:29.33,0:19:31.99,Default,,0000,0000,0000,,using an array of 11 Dialogue: 0,0:19:31.99,0:19:33.08,Default,,0000,0000,0000,,numbers. Dialogue: 0,0:19:33.08,0:19:37.08,Default,,0000,0000,0000,,What I want to do for [inaudible] today is talk about Dialogue: 0,0:19:37.08,0:19:38.45,Default,,0000,0000,0000,,continuous Dialogue: 0,0:19:38.45,0:19:41.83,Default,,0000,0000,0000,,states, so for example, Dialogue: 0,0:19:41.83,0:19:44.98,Default,,0000,0000,0000,,if you want to control Dialogue: 0,0:19:44.98,0:19:48.24,Default,,0000,0000,0000,,any of the number of real [inaudible], so for example, if you want Dialogue: 0,0:19:48.24,0:19:51.41,Default,,0000,0000,0000,,to control a car, Dialogue: 0,0:19:51.41,0:19:52.61,Default,,0000,0000,0000,, Dialogue: 0,0:19:52.61,0:19:55.09,Default,,0000,0000,0000,,a car Dialogue: 0,0:19:55.09,0:19:59.84,Default,,0000,0000,0000,,is positioned given by XYT, as position Dialogue: 0,0:19:59.84,0:20:04.66,Default,,0000,0000,0000,,and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot, Dialogue: 0,0:20:04.66,0:20:05.16,Default,,0000,0000,0000,,so these Dialogue: 0,0:20:05.16,0:20:06.64,Default,,0000,0000,0000,,are so depending on Dialogue: 0,0:20:06.64,0:20:08.93,Default,,0000,0000,0000,,whether you want to model the kinematics and Dialogue: 0,0:20:08.93,0:20:12.14,Default,,0000,0000,0000,,so just position, or whether you want to model the dynamics, meaning the velocity as Dialogue: 0,0:20:12.14,0:20:14.63,Default,,0000,0000,0000,,well. Dialogue: 0,0:20:14.63,0:20:16.18,Default,,0000,0000,0000,,Earlier I showed you Dialogue: 0,0:20:16.18,0:20:19.74,Default,,0000,0000,0000,,video of a helicopter Dialogue: 0,0:20:19.74,0:20:23.48,Default,,0000,0000,0000,,that was flying, using a rain forest we're learning algorithms, so the Dialogue: 0,0:20:23.48,0:20:24.73,Default,,0000,0000,0000,,helicopter which can Dialogue: 0,0:20:24.73,0:20:28.39,Default,,0000,0000,0000,,fly in three-dimensional space rather than just drive on the 2-D plane, the Dialogue: 0,0:20:28.39,0:20:32.78,Default,,0000,0000,0000,,state will be given by XYZ position, Dialogue: 0,0:20:35.10,0:20:37.61,Default,,0000,0000,0000,,FT?, which is Dialogue: 0,0:20:37.61,0:20:42.74,Default,,0000,0000,0000,,?[inaudible]. Dialogue: 0,0:20:42.74,0:20:45.25,Default,,0000,0000,0000,,The FT? is Dialogue: 0,0:20:45.25,0:20:48.62,Default,,0000,0000,0000,,sometimes used to note the P[inaudible] Dialogue: 0,0:20:48.62,0:20:49.84,Default,,0000,0000,0000,,of the helicopter, Dialogue: 0,0:20:49.84,0:20:51.71,Default,,0000,0000,0000,,just orientation, Dialogue: 0,0:20:51.71,0:20:55.34,Default,,0000,0000,0000,,and Dialogue: 0,0:20:59.05,0:21:02.54,Default,,0000,0000,0000,,if you want to control a helicopter, you pretty much have to model Dialogue: 0,0:21:02.54,0:21:06.49,Default,,0000,0000,0000,,velocity as well which means both linear velocity as well as angular Dialogue: 0,0:21:06.49,0:21:09.52,Default,,0000,0000,0000,,velocity, and so this would be a 12-dimensional state. Dialogue: 0,0:21:10.40,0:21:12.61,Default,,0000,0000,0000,,If you want an example that Dialogue: 0,0:21:12.61,0:21:14.15,Default,,0000,0000,0000,,is Dialogue: 0,0:21:14.15,0:21:16.77,Default,,0000,0000,0000,,kind of fun but unusual Dialogue: 0,0:21:16.77,0:21:19.96,Default,,0000,0000,0000,,is, Dialogue: 0,0:21:19.96,0:21:24.55,Default,,0000,0000,0000,,and I'm just gonna use this as an example Dialogue: 0,0:21:24.55,0:21:28.24,Default,,0000,0000,0000,,and actually use this little bit example in today's lecture Dialogue: 0,0:21:28.24,0:21:31.41,Default,,0000,0000,0000,,is the inverted pendulum problem which is Dialogue: 0,0:21:31.41,0:21:35.50,Default,,0000,0000,0000,,sort of a long-running classic in reinforcement learning Dialogue: 0,0:21:35.50,0:21:39.04,Default,,0000,0000,0000,,in which imagine that you Dialogue: 0,0:21:39.04,0:21:42.88,Default,,0000,0000,0000,,have a little cart that's on a rail. The rail Dialogue: 0,0:21:42.88,0:21:45.37,Default,,0000,0000,0000,,ends at some point Dialogue: 0,0:21:45.37,0:21:48.40,Default,,0000,0000,0000,,and if you imagine that you have a pole Dialogue: 0,0:21:48.40,0:21:51.65,Default,,0000,0000,0000,,attached to the cart, and this is a free hinge and so Dialogue: 0,0:21:51.65,0:21:54.11,Default,,0000,0000,0000,,the pole here can rotate freely, Dialogue: 0,0:21:54.11,0:21:56.95,Default,,0000,0000,0000,,and your goal is to control the cart and Dialogue: 0,0:21:56.95,0:21:59.03,Default,,0000,0000,0000,,to move it back and forth on this rail Dialogue: 0,0:21:59.03,0:22:01.13,Default,,0000,0000,0000,,so as to keep the pole balanced. Dialogue: 0,0:22:01.13,0:22:02.56,Default,,0000,0000,0000,,Yeah, Dialogue: 0,0:22:02.56,0:22:04.62,Default,,0000,0000,0000,,there's no long pole Dialogue: 0,0:22:04.62,0:22:05.40,Default,,0000,0000,0000,,in Dialogue: 0,0:22:05.40,0:22:07.48,Default,,0000,0000,0000,,this class but you know what I Dialogue: 0,0:22:07.48,0:22:08.41,Default,,0000,0000,0000,,mean, so you Dialogue: 0,0:22:08.41,0:22:15.21,Default,,0000,0000,0000,,can imagine. Oh, is there a long pole here? Dialogue: 0,0:22:15.21,0:22:16.59,Default,,0000,0000,0000,,Back in the corner. Oh, thanks. Cool. So I did not practice this Dialogue: 0,0:22:16.59,0:22:19.92,Default,,0000,0000,0000,,but you can take a long pole and sort of hold it up, balance, Dialogue: 0,0:22:19.92,0:22:24.67,Default,,0000,0000,0000,,so imagine that you can do it better than I can. Imagine these are [inaudible] just moving Dialogue: 0,0:22:24.67,0:22:27.59,Default,,0000,0000,0000,,back and forth to try to keep the pole balanced, so Dialogue: 0,0:22:27.59,0:22:30.58,Default,,0000,0000,0000,,you can actually us the reinforcement learning algorithm to do that. Dialogue: 0,0:22:30.58,0:22:33.93,Default,,0000,0000,0000,,This is actually one of the longstanding classic problems that Dialogue: 0,0:22:33.93,0:22:35.28,Default,,0000,0000,0000,,people [inaudible] Dialogue: 0,0:22:35.28,0:22:38.03,Default,,0000,0000,0000,,implement and play off using reinforcement learning algorithms, Dialogue: 0,0:22:38.03,0:22:40.80,Default,,0000,0000,0000,,and so for this, the Dialogue: 0,0:22:40.80,0:22:41.94,Default,,0000,0000,0000,,states would be X and Dialogue: 0,0:22:41.94,0:22:44.00,Default,,0000,0000,0000,,T, so X Dialogue: 0,0:22:44.00,0:22:49.66,Default,,0000,0000,0000,,would be the position of the cart, Dialogue: 0,0:22:49.66,0:22:56.07,Default,,0000,0000,0000,,and T would be the orientation of the pole Dialogue: 0,0:22:56.07,0:22:59.75,Default,,0000,0000,0000,,and also the linear velocity and the angular velocity of Dialogue: 0,0:22:59.75,0:23:00.65,Default,,0000,0000,0000,, Dialogue: 0,0:23:00.65,0:23:07.65,Default,,0000,0000,0000,,the pole, so I'll Dialogue: 0,0:23:12.34,0:23:15.39,Default,,0000,0000,0000,,actually use this example a Dialogue: 0,0:23:15.39,0:23:18.71,Default,,0000,0000,0000,,couple times. So to read continuous state space, Dialogue: 0,0:23:18.71,0:23:22.76,Default,,0000,0000,0000,,how can you apply an algorithm like value iteration and policy iteration to solve the Dialogue: 0,0:23:22.76,0:23:24.17,Default,,0000,0000,0000,,MDP to control Dialogue: 0,0:23:24.17,0:23:27.80,Default,,0000,0000,0000,,like the car or a helicopter or something like the inverted Dialogue: 0,0:23:27.80,0:23:29.03,Default,,0000,0000,0000,,pendulum? Dialogue: 0,0:23:29.03,0:23:31.72,Default,,0000,0000,0000,,So one thing you can do and Dialogue: 0,0:23:31.72,0:23:35.06,Default,,0000,0000,0000,,this is maybe the most straightforward thing is, Dialogue: 0,0:23:35.06,0:23:37.67,Default,,0000,0000,0000,,if you have say a two-dimensional Dialogue: 0,0:23:37.67,0:23:42.27,Default,,0000,0000,0000,,continuous state space, a S-1 and S-2 are my state variables, and in all the Dialogue: 0,0:23:42.27,0:23:43.99,Default,,0000,0000,0000,,examples there Dialogue: 0,0:23:43.99,0:23:48.15,Default,,0000,0000,0000,,are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here. Dialogue: 0,0:23:48.15,0:23:53.67,Default,,0000,0000,0000,,The most straightforward thing to do would be to take the continuous state space and discretize Dialogue: 0,0:23:53.67,0:23:55.78,Default,,0000,0000,0000,,it Dialogue: 0,0:23:55.78,0:24:02.78,Default,,0000,0000,0000,,into a number of discrete cells. Dialogue: 0,0:24:07.44,0:24:11.89,Default,,0000,0000,0000,,And I use S-bar to denote they're discretized or they're discrete Dialogue: 0,0:24:11.89,0:24:13.08,Default,,0000,0000,0000,,states, Dialogue: 0,0:24:13.08,0:24:16.92,Default,,0000,0000,0000,,and so you can [inaudible] with this continuous state problem with a finite or Dialogue: 0,0:24:16.92,0:24:18.41,Default,,0000,0000,0000,,discrete set of states Dialogue: 0,0:24:18.41,0:24:19.49,Default,,0000,0000,0000,,and then Dialogue: 0,0:24:19.49,0:24:22.88,Default,,0000,0000,0000,,you can use policy iteration or value iteration Dialogue: 0,0:24:22.88,0:24:28.90,Default,,0000,0000,0000,,to solve for Dialogue: 0,0:24:31.53,0:24:35.95,Default,,0000,0000,0000,,V(s)-bar. Dialogue: 0,0:24:35.95,0:24:37.39,Default,,0000,0000,0000,,And Dialogue: 0,0:24:37.39,0:24:40.35,Default,,0000,0000,0000,,if you're robot is then in some state Dialogue: 0,0:24:40.35,0:24:41.86,Default,,0000,0000,0000,,given by that dot, Dialogue: 0,0:24:41.86,0:24:45.91,Default,,0000,0000,0000,,you would then figure out what discretized state it is in. In this case it's in, Dialogue: 0,0:24:45.91,0:24:47.95,Default,,0000,0000,0000,,this discretized dygrid cell Dialogue: 0,0:24:47.95,0:24:49.18,Default,,0000,0000,0000,,that's called Dialogue: 0,0:24:49.18,0:24:51.40,Default,,0000,0000,0000,,S-bar, Dialogue: 0,0:24:51.40,0:24:52.84,Default,,0000,0000,0000,,and then you execute. Dialogue: 0,0:24:52.84,0:24:56.32,Default,,0000,0000,0000,,You choose the policy. You choose the action Dialogue: 0,0:24:56.32,0:24:58.68,Default,,0000,0000,0000,,given by applied to that discrete Dialogue: 0,0:24:58.68,0:25:01.19,Default,,0000,0000,0000,,state, so discretization is maybe the Dialogue: 0,0:25:01.19,0:25:06.69,Default,,0000,0000,0000,,most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes Dialogue: 0,0:25:06.69,0:25:10.20,Default,,0000,0000,0000,,you can sorta make this work but a couple of reasons why Dialogue: 0,0:25:10.20,0:25:13.07,Default,,0000,0000,0000,,this does not work very well. Dialogue: 0,0:25:13.07,0:25:15.13,Default,,0000,0000,0000,,One reason is the following, Dialogue: 0,0:25:15.13,0:25:17.61,Default,,0000,0000,0000,,and Dialogue: 0,0:25:17.61,0:25:21.21,Default,,0000,0000,0000,,for this picture, let's even temporarily put aside reinforcement Dialogue: 0,0:25:21.21,0:25:22.84,Default,,0000,0000,0000,,learning. Let's just Dialogue: 0,0:25:22.84,0:25:26.61,Default,,0000,0000,0000,,think about doing regression for now and so Dialogue: 0,0:25:26.61,0:25:29.65,Default,,0000,0000,0000,,suppose you have some invariable X Dialogue: 0,0:25:29.65,0:25:36.21,Default,,0000,0000,0000,,and suppose I have some data, Dialogue: 0,0:25:36.21,0:25:37.91,Default,,0000,0000,0000,,and I want to fill a function. Dialogue: 0,0:25:37.91,0:25:39.72,Default,,0000,0000,0000,,Y is the function of X, Dialogue: 0,0:25:39.72,0:25:40.86,Default,,0000,0000,0000,,so Dialogue: 0,0:25:40.86,0:25:44.55,Default,,0000,0000,0000,,discretization is saying that I'm going to take my Dialogue: 0,0:25:44.55,0:25:48.01,Default,,0000,0000,0000,,horizontal Xs and chop it up into a number of Dialogue: 0,0:25:48.01,0:25:48.67,Default,,0000,0000,0000,,intervals. Dialogue: 0,0:25:48.67,0:25:52.06,Default,,0000,0000,0000,,Sometimes I call these intervals buckets as well. Dialogue: 0,0:25:52.06,0:25:56.31,Default,,0000,0000,0000,,We chop my horizontal Xs up into a number of buckets Dialogue: 0,0:25:56.31,0:25:59.48,Default,,0000,0000,0000,,and then we're approximate this function Dialogue: 0,0:25:59.48,0:26:03.77,Default,,0000,0000,0000,,using something that's piecewise constant Dialogue: 0,0:26:03.77,0:26:05.44,Default,,0000,0000,0000,,in each of these buckets. Dialogue: 0,0:26:05.44,0:26:08.80,Default,,0000,0000,0000,,And just look at this. This is clearly not a very good representation, right, and when we Dialogue: 0,0:26:08.80,0:26:10.39,Default,,0000,0000,0000,,talk about regression, Dialogue: 0,0:26:10.39,0:26:14.34,Default,,0000,0000,0000,,you just choose some features of X Dialogue: 0,0:26:14.34,0:26:17.99,Default,,0000,0000,0000,,and run linear regression or something. You get a much better fit to the function. Dialogue: 0,0:26:17.99,0:26:22.17,Default,,0000,0000,0000,,And so the sense that discretization just isn't a very good source Dialogue: 0,0:26:22.17,0:26:25.77,Default,,0000,0000,0000,,of piecewise constant functions. This just isn't a very good function Dialogue: 0,0:26:25.77,0:26:28.21,Default,,0000,0000,0000,,for representing many things, Dialogue: 0,0:26:28.21,0:26:32.09,Default,,0000,0000,0000,,and there's also the sense that Dialogue: 0,0:26:32.09,0:26:35.58,Default,,0000,0000,0000,,there's no smoothing or there's no generalization across the different buckets. Dialogue: 0,0:26:35.58,0:26:39.29,Default,,0000,0000,0000,,And in fact, back in regression, I would never have chosen Dialogue: 0,0:26:39.29,0:26:42.87,Default,,0000,0000,0000,,to do regression using this sort of visualization. It's just really doesn't make sense. Dialogue: 0,0:26:45.63,0:26:47.99,Default,,0000,0000,0000,,And so in the same way, Dialogue: 0,0:26:47.99,0:26:50.95,Default,,0000,0000,0000,,instead of Dialogue: 0,0:26:50.95,0:26:53.93,Default,,0000,0000,0000,,X, V(s), instead of X and Dialogue: 0,0:26:53.93,0:26:57.73,Default,,0000,0000,0000,,some hypothesis function of X, if you have the state here and you're trying to approximate the value Dialogue: 0,0:26:57.73,0:26:59.11,Default,,0000,0000,0000,,function, then Dialogue: 0,0:26:59.11,0:27:02.47,Default,,0000,0000,0000,,you can get discretization to work for many problems but maybe this isn't the best Dialogue: 0,0:27:02.47,0:27:03.43,Default,,0000,0000,0000,,representation to represent Dialogue: 0,0:27:03.43,0:27:06.42,Default,,0000,0000,0000,,a value function. Dialogue: 0,0:27:08.96,0:27:12.58,Default,,0000,0000,0000,,The other problem with discretization and maybe the more serious Dialogue: 0,0:27:12.58,0:27:17.35,Default,,0000,0000,0000,,problem is what's Dialogue: 0,0:27:17.35,0:27:21.45,Default,,0000,0000,0000,,often somewhat fancifully called the curse of dimensionality. Dialogue: 0,0:27:26.48,0:27:30.23,Default,,0000,0000,0000,,And just the observation that Dialogue: 0,0:27:30.99,0:27:33.80,Default,,0000,0000,0000,,if the state space is in Dialogue: 0,0:27:33.80,0:27:39.35,Default,,0000,0000,0000,,RN, and if you discretize each variable into K buckets, Dialogue: 0,0:27:39.35,0:27:40.45,Default,,0000,0000,0000,,so Dialogue: 0,0:27:40.45,0:27:41.91,Default,,0000,0000,0000,,if discretize each Dialogue: 0,0:27:44.16,0:27:45.69,Default,,0000,0000,0000,,variable into K Dialogue: 0,0:27:45.69,0:27:47.75,Default,,0000,0000,0000,,discrete values, Dialogue: 0,0:27:47.75,0:27:50.15,Default,,0000,0000,0000,,then you get Dialogue: 0,0:27:50.15,0:27:53.03,Default,,0000,0000,0000,,on the order of K to the power of N Dialogue: 0,0:27:54.76,0:27:55.74,Default,,0000,0000,0000,,discrete states. Dialogue: 0,0:27:55.74,0:27:56.61,Default,,0000,0000,0000,,In other words, Dialogue: 0,0:27:56.61,0:28:00.17,Default,,0000,0000,0000,,the number of discrete states you end up with Dialogue: 0,0:28:00.17,0:28:02.10,Default,,0000,0000,0000,,grows exponentially Dialogue: 0,0:28:02.10,0:28:04.79,Default,,0000,0000,0000,,in the dimension of the problem, Dialogue: 0,0:28:04.79,0:28:05.56,Default,,0000,0000,0000,,and so Dialogue: 0,0:28:05.56,0:28:09.35,Default,,0000,0000,0000,,for a helicopter with 12-dimensional state Dialogue: 0,0:28:09.35,0:28:10.34,Default,,0000,0000,0000,,space, this would be Dialogue: 0,0:28:10.34,0:28:12.77,Default,,0000,0000,0000,,maybe like 100 to the power of 12, just huge, and it's Dialogue: 0,0:28:12.77,0:28:13.46,Default,,0000,0000,0000,,not Dialogue: 0,0:28:13.46,0:28:14.65,Default,,0000,0000,0000,,feasible. Dialogue: 0,0:28:14.65,0:28:19.13,Default,,0000,0000,0000,,And so discretization doesn't scale well at all with two problems in high-dimensional Dialogue: 0,0:28:19.13,0:28:23.60,Default,,0000,0000,0000,,state spaces, Dialogue: 0,0:28:23.60,0:28:24.79,Default,,0000,0000,0000,,and Dialogue: 0,0:28:24.79,0:28:28.52,Default,,0000,0000,0000,,this observation actually applies more generally than to Dialogue: 0,0:28:28.52,0:28:32.51,Default,,0000,0000,0000,,just robotics and continuous state problems. Dialogue: 0,0:28:32.51,0:28:37.67,Default,,0000,0000,0000,,For example, another fairly well-known applications Dialogue: 0,0:28:37.67,0:28:40.17,Default,,0000,0000,0000,,of reinforcement learning Dialogue: 0,0:28:40.17,0:28:44.85,Default,,0000,0000,0000,,has been to factory automations. If you imagine that you have Dialogue: 0,0:28:44.85,0:28:47.04,Default,,0000,0000,0000,,20 machines sitting in the factory Dialogue: 0,0:28:47.04,0:28:50.03,Default,,0000,0000,0000,,and the machines lie in a assembly line and they all Dialogue: 0,0:28:50.03,0:28:53.20,Default,,0000,0000,0000,,do something to a part on the assembly line, then they route the part onto a Dialogue: 0,0:28:53.20,0:28:55.39,Default,,0000,0000,0000,,different machine. You want to Dialogue: 0,0:28:55.39,0:28:58.78,Default,,0000,0000,0000,,use reinforcement learning algorithms, [inaudible] the order in which the Dialogue: 0,0:28:58.78,0:29:02.52,Default,,0000,0000,0000,,different machines operate on your different things that are flowing through your assembly line and maybe Dialogue: 0,0:29:02.52,0:29:05.31,Default,,0000,0000,0000,,different machines can do different things. Dialogue: 0,0:29:05.31,0:29:08.97,Default,,0000,0000,0000,,So if you have N machines and each machine Dialogue: 0,0:29:08.97,0:29:10.54,Default,,0000,0000,0000,,can be in K states, Dialogue: 0,0:29:10.54,0:29:13.73,Default,,0000,0000,0000,,then if you do this sort of discretization, the total number of states would be K Dialogue: 0,0:29:13.73,0:29:15.42,Default,,0000,0000,0000,,to N as well. Dialogue: 0,0:29:15.42,0:29:18.29,Default,,0000,0000,0000,,If you have N machines Dialogue: 0,0:29:18.29,0:29:21.86,Default,,0000,0000,0000,,and if each machine can be in K states, then again, you can get Dialogue: 0,0:29:21.86,0:29:24.55,Default,,0000,0000,0000,,this huge number of states. Dialogue: 0,0:29:24.55,0:29:26.80,Default,,0000,0000,0000,,Other well-known examples would be Dialogue: 0,0:29:26.80,0:29:30.62,Default,,0000,0000,0000,,if you have a board game is another example. Dialogue: 0,0:29:30.62,0:29:33.76,Default,,0000,0000,0000,,You'd want to use reinforcement learning to play chess. Dialogue: 0,0:29:33.76,0:29:36.31,Default,,0000,0000,0000,,Then if you have N Dialogue: 0,0:29:36.31,0:29:38.66,Default,,0000,0000,0000,,pieces on your board game, Dialogue: 0,0:29:38.66,0:29:43.54,Default,,0000,0000,0000,,you have N pieces on the chessboard and if each piece can be in K positions, then this is a Dialogue: 0,0:29:43.54,0:29:46.47,Default,,0000,0000,0000,,game sort of the curse of dimensionality thing where the Dialogue: 0,0:29:46.47,0:29:55.06,Default,,0000,0000,0000,,number of discrete states you end up with goes exponentially with the number of pieces Dialogue: 0,0:29:55.06,0:29:57.12,Default,,0000,0000,0000,,in your board game. Dialogue: 0,0:29:57.12,0:30:01.17,Default,,0000,0000,0000,,So the curse of dimensionality Dialogue: 0,0:30:01.17,0:30:05.83,Default,,0000,0000,0000,,means that discretization scales poorly to high-dimensional state spaces or Dialogue: 0,0:30:05.83,0:30:09.37,Default,,0000,0000,0000,,at least discrete representations Dialogue: 0,0:30:09.37,0:30:14.17,Default,,0000,0000,0000,,scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional Dialogue: 0,0:30:14.17,0:30:16.72,Default,,0000,0000,0000,,problem, discretization will usually work great. Dialogue: 0,0:30:16.72,0:30:20.73,Default,,0000,0000,0000,,If you have a 3-dimensional problem, you can often get discretization to Dialogue: 0,0:30:20.73,0:30:21.29,Default,,0000,0000,0000,,work Dialogue: 0,0:30:21.29,0:30:23.27,Default,,0000,0000,0000,,not too badly without too much trouble. Dialogue: 0,0:30:23.27,0:30:28.37,Default,,0000,0000,0000,,With a 4-dimensional problem, you can still often get to where that they Dialogue: 0,0:30:28.37,0:30:30.19,Default,,0000,0000,0000,,could be challenging Dialogue: 0,0:30:32.99,0:30:38.77,Default,,0000,0000,0000,,and as you go to higher and higher dimensional state spaces, Dialogue: 0,0:30:38.77,0:30:42.52,Default,,0000,0000,0000,,the odds and [inaudible] that you need to figure around to discretization and do things Dialogue: 0,0:30:42.52,0:30:45.90,Default,,0000,0000,0000,,like non-uniform grids, so for example, Dialogue: 0,0:30:45.90,0:30:49.47,Default,,0000,0000,0000,,what I've Dialogue: 0,0:30:49.47,0:30:52.71,Default,,0000,0000,0000,,drawn for you is an example of a non-uniform discretization Dialogue: 0,0:30:52.71,0:30:57.14,Default,,0000,0000,0000,,where I'm discretizing S-2 much more finally than S-1. If I think the value Dialogue: 0,0:30:57.14,0:30:57.75,Default,,0000,0000,0000,,function Dialogue: 0,0:30:57.75,0:30:59.25,Default,,0000,0000,0000,,is much more sensitive Dialogue: 0,0:30:59.25,0:31:02.62,Default,,0000,0000,0000,,to the value of state variable S-2 than to S-1, Dialogue: 0,0:31:02.62,0:31:06.61,Default,,0000,0000,0000,,and so as you get into higher dimensional state spaces, you may need to manually fiddle Dialogue: 0,0:31:06.61,0:31:10.00,Default,,0000,0000,0000,,with choices like these with no uniform discretizations and so on. Dialogue: 0,0:31:10.00,0:31:13.27,Default,,0000,0000,0000,,But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it Dialogue: 0,0:31:13.27,0:31:14.35,Default,,0000,0000,0000,,work fine. Dialogue: 0,0:31:14.35,0:31:17.63,Default,,0000,0000,0000,,With 4-dimensional problems, you can probably get it to work but it'd be just Dialogue: 0,0:31:17.63,0:31:19.03,Default,,0000,0000,0000,,slightly challenging Dialogue: 0,0:31:19.03,0:31:20.30,Default,,0000,0000,0000,,and Dialogue: 0,0:31:20.30,0:31:24.42,Default,,0000,0000,0000,,you can sometimes by fooling around and being clever, you can often push Dialogue: 0,0:31:24.42,0:31:25.11,Default,,0000,0000,0000,,discretization Dialogue: 0,0:31:25.11,0:31:29.12,Default,,0000,0000,0000,,up to let's say about 6-dimensional problems but Dialogue: 0,0:31:29.12,0:31:30.87,Default,,0000,0000,0000,,with some difficulty Dialogue: 0,0:31:30.87,0:31:32.98,Default,,0000,0000,0000,,and problems higher Dialogue: 0,0:31:32.98,0:31:37.22,Default,,0000,0000,0000,,than 6-dimensional would be extremely difficult to solve with discretization. Dialogue: 0,0:31:37.22,0:31:39.70,Default,,0000,0000,0000,,So that's just rough Dialogue: 0,0:31:39.70,0:31:44.36,Default,,0000,0000,0000,,folk wisdom order of managing problems you think about using for discretization. Dialogue: 0,0:31:51.37,0:31:52.79,Default,,0000,0000,0000,,But what I want to Dialogue: 0,0:31:52.79,0:31:56.03,Default,,0000,0000,0000,,spend most of today talking about is Dialogue: 0,0:31:56.03,0:31:59.91,Default,,0000,0000,0000,,[inaudible] methods that often work much better than discretization Dialogue: 0,0:31:59.91,0:32:02.03,Default,,0000,0000,0000,,and which we will Dialogue: 0,0:32:02.03,0:32:04.64,Default,,0000,0000,0000,,approximate V* directly Dialogue: 0,0:32:04.64,0:32:08.85,Default,,0000,0000,0000,,without resulting to these sort of discretizations. Dialogue: 0,0:32:08.85,0:32:12.30,Default,,0000,0000,0000,,Before I jump to the specific representation let me just spend a few minutes Dialogue: 0,0:32:12.30,0:32:14.78,Default,,0000,0000,0000,,talking about the problem setup Dialogue: 0,0:32:14.78,0:32:15.61,Default,,0000,0000,0000,,then. Dialogue: 0,0:32:15.61,0:32:18.79,Default,,0000,0000,0000,,For today's lecture, I'm going to focus on Dialogue: 0,0:32:18.79,0:32:25.36,Default,,0000,0000,0000,,the problem of continuous states Dialogue: 0,0:32:25.36,0:32:28.06,Default,,0000,0000,0000,,and just to keep things sort of very Dialogue: 0,0:32:28.06,0:32:29.48,Default,,0000,0000,0000,,simple in this lecture, Dialogue: 0,0:32:29.48,0:32:33.62,Default,,0000,0000,0000,,I want view of continuous actions, so I'm gonna see Dialogue: 0,0:32:36.00,0:32:37.65,Default,,0000,0000,0000,,discrete actions Dialogue: 0,0:32:37.65,0:32:40.28,Default,,0000,0000,0000,,A. So it turns out actually that Dialogue: 0,0:32:40.28,0:32:44.23,Default,,0000,0000,0000,,is a critical fact also for many problems, it turns out that the state Dialogue: 0,0:32:44.23,0:32:45.63,Default,,0000,0000,0000,,space is Dialogue: 0,0:32:45.63,0:32:47.30,Default,,0000,0000,0000,,much larger than Dialogue: 0,0:32:47.30,0:32:51.18,Default,,0000,0000,0000,,the states of actions. That just seems to have worked out that way for many problems, so for example, Dialogue: 0,0:32:52.63,0:32:58.21,Default,,0000,0000,0000,,for driving a car the state space is 6-dimensional, so if Dialogue: 0,0:32:58.21,0:33:00.01,Default,,0000,0000,0000,,XY T, Xdot, Ydot, Tdot. Dialogue: 0,0:33:00.01,0:33:04.84,Default,,0000,0000,0000,,Whereas, your action has, you still have two actions. You have forward Dialogue: 0,0:33:04.84,0:33:07.60,Default,,0000,0000,0000,,backward motion and steering the car, so you have Dialogue: 0,0:33:07.60,0:33:11.89,Default,,0000,0000,0000,,6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states. Dialogue: 0,0:33:11.89,0:33:14.74,Default,,0000,0000,0000,,The only Dialogue: 0,0:33:14.74,0:33:17.19,Default,,0000,0000,0000,,examples down for a helicopter you've 12-D states in a Dialogue: 0,0:33:17.19,0:33:18.62,Default,,0000,0000,0000,,4-dimensional action Dialogue: 0,0:33:18.62,0:33:20.29,Default,,0000,0000,0000,,it turns out, and Dialogue: 0,0:33:20.29,0:33:24.20,Default,,0000,0000,0000,,it's also often much easier to just discretize a continuous Dialogue: 0,0:33:24.20,0:33:27.30,Default,,0000,0000,0000,,actions into a discrete sum of actions. Dialogue: 0,0:33:27.30,0:33:27.84,Default,,0000,0000,0000,, Dialogue: 0,0:33:27.84,0:33:31.67,Default,,0000,0000,0000,,And for the inverted pendulum, you have a 4-D state and a 1-D action. Dialogue: 0,0:33:31.67,0:33:33.47,Default,,0000,0000,0000,,Whether you accelerate Dialogue: 0,0:33:33.47,0:33:38.14,Default,,0000,0000,0000,,your cart to the left or the right is one D action Dialogue: 0,0:33:38.14,0:33:41.88,Default,,0000,0000,0000,,and so for the rest of today, I'm gonna assume a continuous state Dialogue: 0,0:33:41.88,0:33:45.48,Default,,0000,0000,0000,,but I'll assume that maybe you've already discretized your actions, Dialogue: 0,0:33:45.48,0:33:47.70,Default,,0000,0000,0000,,just because in practice it turns out that Dialogue: 0,0:33:47.70,0:33:52.78,Default,,0000,0000,0000,,not for all problems, with many problems large actions Dialogue: 0,0:33:52.78,0:33:54.28,Default,,0000,0000,0000,,is just less of a Dialogue: 0,0:33:54.28,0:33:58.47,Default,,0000,0000,0000,,difficulty than large state spaces. Dialogue: 0,0:33:58.47,0:33:59.73,Default,,0000,0000,0000,,So I'm Dialogue: 0,0:33:59.73,0:34:01.35,Default,,0000,0000,0000,,going to assume Dialogue: 0,0:34:01.35,0:34:06.36,Default,,0000,0000,0000,,that we have Dialogue: 0,0:34:06.36,0:34:08.70,Default,,0000,0000,0000,,a model Dialogue: 0,0:34:08.70,0:34:15.09,Default,,0000,0000,0000,,or simulator of the Dialogue: 0,0:34:15.09,0:34:16.45,Default,,0000,0000,0000,,MDP, and so Dialogue: 0,0:34:16.45,0:34:18.18,Default,,0000,0000,0000,,this is really Dialogue: 0,0:34:18.18,0:34:22.58,Default,,0000,0000,0000,,an assumption on how the state transition probabilities are represented. Dialogue: 0,0:34:23.59,0:34:24.69,Default,,0000,0000,0000,,I'm gonna assume Dialogue: 0,0:34:24.69,0:34:28.08,Default,,0000,0000,0000,,and I'm going to use the terms "model" and "simulator" Dialogue: 0,0:34:28.08,0:34:29.80,Default,,0000,0000,0000,,pretty much synonymously, Dialogue: 0,0:34:29.80,0:34:31.75,Default,,0000,0000,0000,,so Dialogue: 0,0:34:31.75,0:34:38.75,Default,,0000,0000,0000,,specifically, what I'm going to assume is that we have a black box and a Dialogue: 0,0:34:40.65,0:34:41.83,Default,,0000,0000,0000,,piece of code, Dialogue: 0,0:34:41.83,0:34:45.20,Default,,0000,0000,0000,,so that I can input any state, input an action Dialogue: 0,0:34:45.20,0:34:46.40,Default,,0000,0000,0000,,and it will output Dialogue: 0,0:34:46.40,0:34:48.56,Default,,0000,0000,0000,,S Dialogue: 0,0:34:48.56,0:34:51.26,Default,,0000,0000,0000,,prime, Dialogue: 0,0:34:51.26,0:34:54.22,Default,,0000,0000,0000,,sample from the state transition distribution. Says that Dialogue: 0,0:34:54.22,0:34:55.07,Default,,0000,0000,0000,,this is really Dialogue: 0,0:34:55.07,0:34:57.08,Default,,0000,0000,0000,,my assumption Dialogue: 0,0:34:57.08,0:35:01.21,Default,,0000,0000,0000,,on the representation I have for the state transition probabilities, Dialogue: 0,0:35:01.21,0:35:05.19,Default,,0000,0000,0000,,so I'll assume I have a box that read Dialogue: 0,0:35:05.19,0:35:06.100,Default,,0000,0000,0000,,take us in for the stated action Dialogue: 0,0:35:06.100,0:35:09.85,Default,,0000,0000,0000,,and output in mixed state. Dialogue: 0,0:35:09.85,0:35:16.85,Default,,0000,0000,0000,,And so Dialogue: 0,0:35:17.41,0:35:20.47,Default,,0000,0000,0000,,since they're fairly common ways to get models of Dialogue: 0,0:35:20.47,0:35:22.36,Default,,0000,0000,0000,,different MDPs you may work with, Dialogue: 0,0:35:22.36,0:35:24.48,Default,,0000,0000,0000,,one is Dialogue: 0,0:35:24.48,0:35:30.60,Default,,0000,0000,0000,,you might get a model from a physics Dialogue: 0,0:35:30.60,0:35:32.96,Default,,0000,0000,0000,,simulator. So for example, Dialogue: 0,0:35:32.96,0:35:36.76,Default,,0000,0000,0000,,if you're interested in controlling that inverted pendulum, Dialogue: 0,0:35:36.76,0:35:39.45,Default,,0000,0000,0000,,so your action is Dialogue: 0,0:35:39.45,0:35:41.23,Default,,0000,0000,0000,,A which is Dialogue: 0,0:35:41.23,0:35:45.13,Default,,0000,0000,0000,,the magnitude of the force you exert on the cart to Dialogue: 0,0:35:45.13,0:35:46.14,Default,,0000,0000,0000,, Dialogue: 0,0:35:46.14,0:35:47.71,Default,,0000,0000,0000,,left Dialogue: 0,0:35:47.71,0:35:49.38,Default,,0000,0000,0000,,or right, and your Dialogue: 0,0:35:49.38,0:35:56.38,Default,,0000,0000,0000,,state is Xdot, T, Tdot. I'm just gonna write that in that order. Dialogue: 0,0:35:58.11,0:36:01.47,Default,,0000,0000,0000,,And so I'm gonna Dialogue: 0,0:36:01.47,0:36:04.38,Default,,0000,0000,0000,,write down a bunch of equations just for completeness but Dialogue: 0,0:36:04.38,0:36:08.78,Default,,0000,0000,0000,,everything I'm gonna write below here is most of what I wanna write is a bit Dialogue: 0,0:36:08.78,0:36:12.60,Default,,0000,0000,0000,,gratuitous, but so since I'll maybe flip open a textbook on physics, a Dialogue: 0,0:36:12.60,0:36:13.82,Default,,0000,0000,0000,,textbook on mechanics, Dialogue: 0,0:36:13.82,0:36:17.07,Default,,0000,0000,0000,,you can work out the equations of motion of a Dialogue: 0,0:36:17.07,0:36:19.50,Default,,0000,0000,0000,,physical device like this, so you find that Dialogue: 0,0:36:22.75,0:36:26.40,Default,,0000,0000,0000,,Sdot. The dot denotes derivative, so the derivative of the state with respect to time Dialogue: 0,0:36:26.40,0:36:28.08,Default,,0000,0000,0000,,is given by Dialogue: 0,0:36:28.74,0:36:31.71,Default,,0000,0000,0000,,Xdot, ?-L(B) Dialogue: 0,0:36:31.71,0:36:34.10,Default,,0000,0000,0000,,cost B Dialogue: 0,0:36:34.10,0:36:35.55,Default,,0000,0000,0000,,over Dialogue: 0,0:36:36.92,0:36:39.07,Default,,0000,0000,0000,,M Dialogue: 0,0:36:39.07,0:36:46.07,Default,,0000,0000,0000,,Tdot, Dialogue: 0,0:36:52.02,0:36:52.86,Default,,0000,0000,0000,,B. Dialogue: 0,0:36:52.86,0:36:58.20,Default,,0000,0000,0000,,And so on where A is the force is the action that you exert on the cart. L is Dialogue: 0,0:36:58.20,0:36:59.87,Default,,0000,0000,0000,,the length of the pole. Dialogue: 0,0:36:59.87,0:37:04.10,Default,,0000,0000,0000,,M is the total mass of the system and so on. So all these equations Dialogue: 0,0:37:04.10,0:37:06.21,Default,,0000,0000,0000,,are good uses, just writing Dialogue: 0,0:37:06.21,0:37:08.55,Default,,0000,0000,0000,,them down for completeness, but by Dialogue: 0,0:37:08.55,0:37:12.18,Default,,0000,0000,0000,,flipping over, open like a physics textbook, you can work out these equations Dialogue: 0,0:37:12.18,0:37:13.84,Default,,0000,0000,0000,,and notions yourself Dialogue: 0,0:37:13.84,0:37:15.17,Default,,0000,0000,0000,,and this Dialogue: 0,0:37:15.17,0:37:17.41,Default,,0000,0000,0000,,then gives you a model which Dialogue: 0,0:37:17.41,0:37:18.26,Default,,0000,0000,0000,,can say that S-2+1. Dialogue: 0,0:37:18.26,0:37:21.55,Default,,0000,0000,0000,,You're still one time step later Dialogue: 0,0:37:21.55,0:37:27.49,Default,,0000,0000,0000,,will be equal to your previous state Dialogue: 0,0:37:27.49,0:37:32.22,Default,,0000,0000,0000,,plus [inaudible], so in your simulator or in my model Dialogue: 0,0:37:32.22,0:37:35.20,Default,,0000,0000,0000,,what happens to the cart every Dialogue: 0,0:37:35.20,0:37:39.23,Default,,0000,0000,0000,,10th of a second, so ? T Dialogue: 0,0:37:39.23,0:37:43.13,Default,,0000,0000,0000,,would be within one second Dialogue: 0,0:37:43.13,0:37:50.13,Default,,0000,0000,0000,,and then so plus ? T times Dialogue: 0,0:38:03.47,0:38:06.13,Default,,0000,0000,0000,,that. And so that'd be one way to come up with a model of Dialogue: 0,0:38:06.13,0:38:08.17,Default,,0000,0000,0000,,your MDP. Dialogue: 0,0:38:08.17,0:38:12.39,Default,,0000,0000,0000,,And in this specific example, I've actually written down deterministic model because Dialogue: 0,0:38:12.83,0:38:14.83,Default,,0000,0000,0000,,and by deterministic I mean that Dialogue: 0,0:38:14.83,0:38:18.32,Default,,0000,0000,0000,,given an action in a state, the next state Dialogue: 0,0:38:18.32,0:38:19.84,Default,,0000,0000,0000,,is not random, Dialogue: 0,0:38:19.84,0:38:23.18,Default,,0000,0000,0000,,so would be an example of a deterministic model Dialogue: 0,0:38:23.18,0:38:26.18,Default,,0000,0000,0000,,where I can compute the next state exactly Dialogue: 0,0:38:26.18,0:38:29.70,Default,,0000,0000,0000,,as a function of the previous state and the previous action Dialogue: 0,0:38:29.70,0:38:33.48,Default,,0000,0000,0000,,or it's a deterministic model because all the probability Dialogue: 0,0:38:33.48,0:38:34.100,Default,,0000,0000,0000,,mass is on a single state Dialogue: 0,0:38:34.100,0:38:36.94,Default,,0000,0000,0000,,given the previous stated action. Dialogue: 0,0:38:36.94,0:38:43.94,Default,,0000,0000,0000,,You can also make this a stochastic model. Dialogue: 0,0:38:58.59,0:39:05.59,Default,,0000,0000,0000,,A second way that is often used to attain a model is to learn one. Dialogue: 0,0:39:08.36,0:39:13.08,Default,,0000,0000,0000,,And so again, just concretely what you do is you would Dialogue: 0,0:39:13.08,0:39:15.60,Default,,0000,0000,0000,,imagine that you have a physical Dialogue: 0,0:39:15.60,0:39:17.11,Default,,0000,0000,0000,,inverted pendulum system Dialogue: 0,0:39:17.11,0:39:20.63,Default,,0000,0000,0000,,as you physically own an inverted pendulum robot. Dialogue: 0,0:39:20.63,0:39:24.11,Default,,0000,0000,0000,,What you would do is you would then Dialogue: 0,0:39:24.11,0:39:27.11,Default,,0000,0000,0000,,initialize your inverted pendulum robot to some state Dialogue: 0,0:39:27.11,0:39:31.91,Default,,0000,0000,0000,,and then execute some policy, could be some random policy or some policy that you think is pretty Dialogue: 0,0:39:31.91,0:39:36.22,Default,,0000,0000,0000,,good, or you could even try controlling yourself with a joystick or something. Dialogue: 0,0:39:36.22,0:39:37.40,Default,,0000,0000,0000,,But so you set Dialogue: 0,0:39:37.40,0:39:40.89,Default,,0000,0000,0000,,the system off in some state as zero. Dialogue: 0,0:39:40.89,0:39:43.62,Default,,0000,0000,0000,,Then you take some action. Dialogue: 0,0:39:43.62,0:39:48.04,Default,,0000,0000,0000,,Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina Dialogue: 0,0:39:48.04,0:39:51.18,Default,,0000,0000,0000,,control your inverted pendulum or whatever. Dialogue: 0,0:39:51.18,0:39:55.03,Default,,0000,0000,0000,,System would transition to some new state, S-1, and then you Dialogue: 0,0:39:55.03,0:39:57.27,Default,,0000,0000,0000,,take some new action, A-1 Dialogue: 0,0:39:57.27,0:39:59.80,Default,,0000,0000,0000,,and so on. Let's say you do Dialogue: 0,0:39:59.80,0:40:01.58,Default,,0000,0000,0000,,this Dialogue: 0,0:40:01.58,0:40:04.78,Default,,0000,0000,0000,,for two time steps Dialogue: 0,0:40:04.78,0:40:06.48,Default,,0000,0000,0000,,and Dialogue: 0,0:40:06.48,0:40:08.86,Default,,0000,0000,0000,,sometimes I call this one trajectory and Dialogue: 0,0:40:08.86,0:40:10.01,Default,,0000,0000,0000,,you repeat this Dialogue: 0,0:40:10.01,0:40:12.99,Default,,0000,0000,0000,,M times, so Dialogue: 0,0:40:12.99,0:40:16.99,Default,,0000,0000,0000,,this is the first trial of the first trajectory, Dialogue: 0,0:40:16.99,0:40:23.99,Default,,0000,0000,0000,,and then you do this again. Initialize it in some Dialogue: 0,0:40:35.42,0:40:36.88,Default,,0000,0000,0000,,and so on. Dialogue: 0,0:40:36.88,0:40:39.98,Default,,0000,0000,0000,,So you do this a bunch of times Dialogue: 0,0:40:39.98,0:40:43.64,Default,,0000,0000,0000,,and then you would run the learning algorithm Dialogue: 0,0:40:43.64,0:40:49.19,Default,,0000,0000,0000,,to Dialogue: 0,0:40:49.19,0:40:53.20,Default,,0000,0000,0000,,estimate ST+1 Dialogue: 0,0:40:53.20,0:40:56.65,Default,,0000,0000,0000,,as a function Dialogue: 0,0:40:56.65,0:41:01.87,Default,,0000,0000,0000,,of Dialogue: 0,0:41:01.87,0:41:04.62,Default,,0000,0000,0000,,ST and Dialogue: 0,0:41:04.62,0:41:08.01,Default,,0000,0000,0000,,AT. And for sake of completeness, you should just think of this Dialogue: 0,0:41:08.01,0:41:10.13,Default,,0000,0000,0000,,as inverted pendulum problem, Dialogue: 0,0:41:10.13,0:41:11.19,Default,,0000,0000,0000,,so ST+1 is Dialogue: 0,0:41:11.19,0:41:13.29,Default,,0000,0000,0000,,a 4-dimensional vector. ST, AT Dialogue: 0,0:41:13.29,0:41:16.58,Default,,0000,0000,0000,,will be a 4-dimensional vector and that'll be a real number, Dialogue: 0,0:41:16.58,0:41:17.68,Default,,0000,0000,0000,,and so you might Dialogue: 0,0:41:17.68,0:41:21.19,Default,,0000,0000,0000,,run linear regression 4 times to predict each of these state variables as Dialogue: 0,0:41:21.19,0:41:22.75,Default,,0000,0000,0000,,a function Dialogue: 0,0:41:22.75,0:41:25.25,Default,,0000,0000,0000,,of each of these Dialogue: 0,0:41:25.25,0:41:26.98,Default,,0000,0000,0000,,5 real numbers Dialogue: 0,0:41:26.98,0:41:29.52,Default,,0000,0000,0000,,and so on. Dialogue: 0,0:41:29.52,0:41:32.58,Default,,0000,0000,0000,,Just for example, if you Dialogue: 0,0:41:32.58,0:41:36.90,Default,,0000,0000,0000,,say that if you want to estimate your next state ST+1 as a linear Dialogue: 0,0:41:36.90,0:41:40.64,Default,,0000,0000,0000,,function Dialogue: 0,0:41:40.64,0:41:45.89,Default,,0000,0000,0000,,of your previous state in your action and so Dialogue: 0,0:41:45.89,0:41:51.47,Default,,0000,0000,0000,,A here will be a 4 by 4 matrix, Dialogue: 0,0:41:51.47,0:41:54.45,Default,,0000,0000,0000,,and B would be a 4-dimensional vector, Dialogue: 0,0:41:54.45,0:42:00.20,Default,,0000,0000,0000,,then Dialogue: 0,0:42:00.20,0:42:07.20,Default,,0000,0000,0000,,you would choose the values of A and B that minimize this. Dialogue: 0,0:42:32.27,0:42:35.22,Default,,0000,0000,0000,,So if you want your model to be that Dialogue: 0,0:42:35.22,0:42:38.85,Default,,0000,0000,0000,,ST+1 is some linear function of the previous stated action, Dialogue: 0,0:42:38.85,0:42:39.96,Default,,0000,0000,0000,,then you might Dialogue: 0,0:42:39.96,0:42:41.95,Default,,0000,0000,0000,,pose this optimization objective Dialogue: 0,0:42:41.95,0:42:45.13,Default,,0000,0000,0000,,and choose A and B to minimize the sum of squares error Dialogue: 0,0:42:45.13,0:42:52.13,Default,,0000,0000,0000,,in your predictive value for ST+1 as the linear function of ST Dialogue: 0,0:42:52.49,0:42:56.89,Default,,0000,0000,0000,,and AT. I Dialogue: 0,0:42:56.89,0:42:58.12,Default,,0000,0000,0000,,should say that Dialogue: 0,0:42:58.12,0:43:02.84,Default,,0000,0000,0000,,this is one specific example where you're using a linear function of the previous Dialogue: 0,0:43:02.84,0:43:06.40,Default,,0000,0000,0000,,stated action to predict the next state. Of course, you can also use other algorithms Dialogue: 0,0:43:06.40,0:43:08.47,Default,,0000,0000,0000,,like low [inaudible] weight to linear regression Dialogue: 0,0:43:08.47,0:43:12.08,Default,,0000,0000,0000,,or linear regression with nonlinear features or kernel linear Dialogue: 0,0:43:12.08,0:43:13.12,Default,,0000,0000,0000,,regression or whatever Dialogue: 0,0:43:13.12,0:43:16.38,Default,,0000,0000,0000,,to predict the next state as a nonlinear function of the current state as well, so this Dialogue: 0,0:43:16.38,0:43:23.38,Default,,0000,0000,0000,,is just [inaudible] linear problems. Dialogue: 0,0:43:25.47,0:43:27.82,Default,,0000,0000,0000,,And it turns out that low [inaudible] weight to linear Dialogue: 0,0:43:27.82,0:43:28.77,Default,,0000,0000,0000,,regression is Dialogue: 0,0:43:28.77,0:43:31.99,Default,,0000,0000,0000,,for many robots turns out to be an effective method for this learning Dialogue: 0,0:43:31.99,0:43:38.99,Default,,0000,0000,0000,,problem as well. Dialogue: 0,0:43:47.28,0:43:51.92,Default,,0000,0000,0000,,And so having learned to model, having learned the parameters A Dialogue: 0,0:43:51.92,0:43:52.68,Default,,0000,0000,0000,,and B, Dialogue: 0,0:43:52.68,0:43:58.98,Default,,0000,0000,0000,,you then have a model where you say that ST+1 is AST Dialogue: 0,0:43:58.98,0:44:05.36,Default,,0000,0000,0000,,plus BAT, and so that would be an example of a deterministic Dialogue: 0,0:44:05.36,0:44:06.22,Default,,0000,0000,0000,,model Dialogue: 0,0:44:07.95,0:44:10.75,Default,,0000,0000,0000,,or having learned the parameters A and B, Dialogue: 0,0:44:10.75,0:44:13.26,Default,,0000,0000,0000,,you might say that ST+1 is Dialogue: 0,0:44:13.26,0:44:15.23,Default,,0000,0000,0000,,equal to AST + Dialogue: 0,0:44:15.23,0:44:17.44,Default,,0000,0000,0000,,BAT Dialogue: 0,0:44:24.04,0:44:25.98,Default,,0000,0000,0000,,+ Dialogue: 0,0:44:25.98,0:44:28.25,Default,,0000,0000,0000,,?T. Dialogue: 0,0:44:28.25,0:44:29.92,Default,,0000,0000,0000,,And so these would be Dialogue: 0,0:44:29.92,0:44:32.70,Default,,0000,0000,0000,,very reasonable ways to come up with either Dialogue: 0,0:44:32.70,0:44:37.07,Default,,0000,0000,0000,,a deterministic or a stochastic model for your Dialogue: 0,0:44:37.07,0:44:39.33,Default,,0000,0000,0000,,inverted pendulum MDP. Dialogue: 0,0:44:39.33,0:44:41.52,Default,,0000,0000,0000,,And so Dialogue: 0,0:44:41.52,0:44:43.55,Default,,0000,0000,0000,,just to summarize, Dialogue: 0,0:44:43.55,0:44:48.24,Default,,0000,0000,0000,,what we have now is a model, meaning a piece of code, Dialogue: 0,0:44:48.24,0:44:53.46,Default,,0000,0000,0000,,where you can input a state and an action Dialogue: 0,0:44:53.46,0:44:55.08,Default,,0000,0000,0000,,and get an ST+1. Dialogue: 0,0:44:55.08,0:44:57.97,Default,,0000,0000,0000,,And so if you have a stochastic model, Dialogue: 0,0:44:57.97,0:45:02.42,Default,,0000,0000,0000,,then to influence this model, you would actually sample ?T from this [inaudible] Dialogue: 0,0:45:02.42,0:45:03.04,Default,,0000,0000,0000,,distribution Dialogue: 0,0:45:03.04,0:45:10.04,Default,,0000,0000,0000,,in order to generate ST+1. So it Dialogue: 0,0:45:13.18,0:45:18.27,Default,,0000,0000,0000,,actually turns out that in a Dialogue: 0,0:45:18.27,0:45:20.17,Default,,0000,0000,0000,,preview, I guess, in the next lecture Dialogue: 0,0:45:20.17,0:45:24.68,Default,,0000,0000,0000,,it actually turns out that in the specific case of linear dynamical systems, in the specific Dialogue: 0,0:45:24.68,0:45:28.56,Default,,0000,0000,0000,,case where the next state is a linear function of the current stated action, it Dialogue: 0,0:45:28.56,0:45:31.58,Default,,0000,0000,0000,,actually turns out that there're very powerful algorithms you can use. Dialogue: 0,0:45:31.58,0:45:35.64,Default,,0000,0000,0000,,So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than Dialogue: 0,0:45:35.64,0:45:36.63,Default,,0000,0000,0000,,right now Dialogue: 0,0:45:38.56,0:45:42.76,Default,,0000,0000,0000,,but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights Dialogue: 0,0:45:42.76,0:45:45.60,Default,,0000,0000,0000,,and linear regressionals and long linear Dialogue: 0,0:45:45.60,0:45:48.34,Default,,0000,0000,0000,,algorithm 'cause many systems aren't really Dialogue: 0,0:45:48.34,0:45:54.31,Default,,0000,0000,0000,,linear. You can build a nonlinear model. Dialogue: 0,0:45:54.31,0:45:56.95,Default,,0000,0000,0000,,So what I wanna do now is talk about Dialogue: 0,0:45:56.95,0:45:57.33,Default,,0000,0000,0000,,given Dialogue: 0,0:45:57.33,0:46:00.51,Default,,0000,0000,0000,,a model, given a simulator for your Dialogue: 0,0:46:00.51,0:46:03.35,Default,,0000,0000,0000,,MDP, how to come up with an algorithm Dialogue: 0,0:46:03.35,0:46:07.32,Default,,0000,0000,0000,,to approximate the alpha value function piece. Dialogue: 0,0:46:07.32,0:46:14.32,Default,,0000,0000,0000,,Before I move on, let me check if there're questions on this. Okay, Dialogue: 0,0:46:38.34,0:46:42.58,Default,,0000,0000,0000,,cool. So here's Dialogue: 0,0:46:42.58,0:46:43.50,Default,,0000,0000,0000,,the idea. Dialogue: 0,0:46:43.50,0:46:46.89,Default,,0000,0000,0000,,Back when we talked about linear regression, we said that Dialogue: 0,0:46:46.89,0:46:52.12,Default,,0000,0000,0000,,given some inputs X in supervised learning, given the input feature is X, we Dialogue: 0,0:46:52.12,0:46:54.59,Default,,0000,0000,0000,,may choose some features of X Dialogue: 0,0:46:54.59,0:46:56.53,Default,,0000,0000,0000,,and then Dialogue: 0,0:46:57.23,0:46:58.82,Default,,0000,0000,0000,,approximate the type of variable Dialogue: 0,0:46:58.82,0:47:02.23,Default,,0000,0000,0000,,as a linear function of various features of X, Dialogue: 0,0:47:02.23,0:47:04.93,Default,,0000,0000,0000,,and so just do exactly the same thing to Dialogue: 0,0:47:04.93,0:47:06.33,Default,,0000,0000,0000,,approximate Dialogue: 0,0:47:06.33,0:47:08.54,Default,,0000,0000,0000,,the optimal value function, Dialogue: 0,0:47:08.54,0:47:10.87,Default,,0000,0000,0000,,and in particular, Dialogue: 0,0:47:10.87,0:47:15.82,Default,,0000,0000,0000,,we'll choose some features Dialogue: 0,0:47:15.82,0:47:19.25,Default,,0000,0000,0000,,5-S of a Dialogue: 0,0:47:19.25,0:47:20.98,Default,,0000,0000,0000,,state Dialogue: 0,0:47:20.98,0:47:25.55,Default,,0000,0000,0000,,S. And so you could actually choose Dialogue: 0,0:47:25.55,0:47:29.53,Default,,0000,0000,0000,,5-S equals S. That would be one reasonable choice, if you want to approximate the value Dialogue: 0,0:47:29.53,0:47:30.16,Default,,0000,0000,0000,,function Dialogue: 0,0:47:30.16,0:47:32.76,Default,,0000,0000,0000,,as a linear function of the states, but you can Dialogue: 0,0:47:32.76,0:47:33.66,Default,,0000,0000,0000,,also Dialogue: 0,0:47:33.66,0:47:37.07,Default,,0000,0000,0000,,choose other things, so for example, Dialogue: 0,0:47:37.07,0:47:41.03,Default,,0000,0000,0000,,for the inverted pendulum example, you may choose 5-S Dialogue: 0,0:47:41.03,0:47:45.05,Default,,0000,0000,0000,,to be equal to a vector of features that may be [inaudible]1 or you may have Dialogue: 0,0:47:47.87,0:47:51.79,Default,,0000,0000,0000,,Xdot2, Dialogue: 0,0:47:51.79,0:47:52.19,Default,,0000,0000,0000,,Xdot Dialogue: 0,0:47:52.19,0:47:54.65,Default,,0000,0000,0000,,maybe some cross terms, Dialogue: 0,0:47:54.65,0:47:56.80,Default,,0000,0000,0000,,maybe times Dialogue: 0,0:47:56.80,0:47:59.37,Default,,0000,0000,0000,,X, maybe dot2 Dialogue: 0,0:47:59.37,0:48:03.66,Default,,0000,0000,0000,,and so on. So Dialogue: 0,0:48:03.66,0:48:07.21,Default,,0000,0000,0000,,you choose some vector or features Dialogue: 0,0:48:07.21,0:48:14.21,Default,,0000,0000,0000,,and then approximate Dialogue: 0,0:48:16.82,0:48:21.76,Default,,0000,0000,0000,,the value function as the value of the state as is equal to data Dialogue: 0,0:48:21.76,0:48:24.29,Default,,0000,0000,0000,,transfers Dialogue: 0,0:48:24.29,0:48:28.42,Default,,0000,0000,0000,,times the features. And I should apologize in advance; I'm overloading notation here. Dialogue: 0,0:48:28.42,0:48:29.84,Default,,0000,0000,0000,,It's unfortunate. I Dialogue: 0,0:48:29.84,0:48:34.35,Default,,0000,0000,0000,,use data both to denote the angle of the cart of the pole inverted Dialogue: 0,0:48:34.35,0:48:37.55,Default,,0000,0000,0000,,pendulum. So this is known as Dialogue: 0,0:48:37.55,0:48:39.14,Default,,0000,0000,0000,,the angle T Dialogue: 0,0:48:39.14,0:48:43.28,Default,,0000,0000,0000,,but also using T to denote the vector of parameters in my [inaudible] algorithm. Dialogue: 0,0:48:43.28,0:48:43.91,Default,,0000,0000,0000,,So sorry Dialogue: 0,0:48:43.91,0:48:49.94,Default,,0000,0000,0000,,about the overloading notation. Dialogue: 0,0:48:49.94,0:48:51.60,Default,,0000,0000,0000,, Dialogue: 0,0:48:51.60,0:48:55.44,Default,,0000,0000,0000,,Just like we did in linear regression, my goal is to come up Dialogue: 0,0:48:55.44,0:48:56.60,Default,,0000,0000,0000,,with Dialogue: 0,0:48:56.60,0:49:00.34,Default,,0000,0000,0000,,a linear combination of the features that gives me a good approximation Dialogue: 0,0:49:00.34,0:49:01.93,Default,,0000,0000,0000,,to the value function Dialogue: 0,0:49:01.93,0:49:07.66,Default,,0000,0000,0000,,and this is completely analogous to when we said that in linear regression Dialogue: 0,0:49:07.66,0:49:09.50,Default,,0000,0000,0000,,our estimate, Dialogue: 0,0:49:13.37,0:49:18.05,Default,,0000,0000,0000,,my response there but Y as a linear function a feature is at the Dialogue: 0,0:49:18.05,0:49:23.46,Default,,0000,0000,0000,,input. That's what we have Dialogue: 0,0:49:23.46,0:49:30.46,Default,,0000,0000,0000,,in Dialogue: 0,0:49:40.11,0:49:47.08,Default,,0000,0000,0000,,linear regression. Let Dialogue: 0,0:49:47.08,0:49:50.78,Default,,0000,0000,0000,,me just write Dialogue: 0,0:49:50.78,0:49:55.86,Default,,0000,0000,0000,,down value iteration again and then I'll written down an approximation to value Dialogue: 0,0:49:55.86,0:49:57.20,Default,,0000,0000,0000,,iteration, so Dialogue: 0,0:49:57.20,0:50:00.76,Default,,0000,0000,0000,,for discrete states, Dialogue: 0,0:50:00.76,0:50:03.93,Default,,0000,0000,0000,,this is the idea behind value iteration and we said that Dialogue: 0,0:50:03.93,0:50:08.85,Default,,0000,0000,0000,,V(s) will be Dialogue: 0,0:50:08.85,0:50:11.61,Default,,0000,0000,0000,,updated Dialogue: 0,0:50:11.61,0:50:17.40,Default,,0000,0000,0000,,as R(s) + G Dialogue: 0,0:50:17.40,0:50:19.53,Default,,0000,0000,0000,,[inaudible]. That was value iteration Dialogue: 0,0:50:19.53,0:50:24.13,Default,,0000,0000,0000,,and in the case of continuous states, this would be the replaced by an [inaudible], an Dialogue: 0,0:50:24.13,0:50:31.13,Default,,0000,0000,0000,,[inaudible] over states rather via sum over states. Dialogue: 0,0:50:31.39,0:50:34.34,Default,,0000,0000,0000,,Let me just write this Dialogue: 0,0:50:34.34,0:50:35.67,Default,,0000,0000,0000,,as Dialogue: 0,0:50:38.81,0:50:42.59,Default,,0000,0000,0000,,R(s) + G([inaudible]) and then that sum over T's prime. Dialogue: 0,0:50:42.59,0:50:45.15,Default,,0000,0000,0000,,That's really an expectation Dialogue: 0,0:50:45.15,0:50:47.40,Default,,0000,0000,0000,,with respect to random state as Dialogue: 0,0:50:47.40,0:50:54.40,Default,,0000,0000,0000,,prime drawn from the state transition probabilities piece SA Dialogue: 0,0:50:55.05,0:50:56.36,Default,,0000,0000,0000,,of V(s) Dialogue: 0,0:50:56.36,0:51:00.34,Default,,0000,0000,0000,,prime. So this is a sum of all states S prime with the Dialogue: 0,0:51:00.34,0:51:01.60,Default,,0000,0000,0000,,probability of going to S prime (value), Dialogue: 0,0:51:01.60,0:51:05.25,Default,,0000,0000,0000,,so that's really an expectation over the random state S prime flowing from PSA of Dialogue: 0,0:51:05.25,0:51:09.37,Default,,0000,0000,0000,,that. Dialogue: 0,0:51:09.37,0:51:11.99,Default,,0000,0000,0000,,And so what I'll do now is Dialogue: 0,0:51:11.99,0:51:15.20,Default,,0000,0000,0000,,write down an algorithm called Dialogue: 0,0:51:15.20,0:51:22.20,Default,,0000,0000,0000,,fitted value iteration Dialogue: 0,0:51:25.29,0:51:27.95,Default,,0000,0000,0000,,that's in Dialogue: 0,0:51:27.95,0:51:30.60,Default,,0000,0000,0000,,approximation to this Dialogue: 0,0:51:30.60,0:51:33.81,Default,,0000,0000,0000,,but specifically for continuous states. Dialogue: 0,0:51:33.81,0:51:38.52,Default,,0000,0000,0000,,I just wrote down the first two steps, and then I'll continue on the next board, so the Dialogue: 0,0:51:38.52,0:51:43.50,Default,,0000,0000,0000,,first step of the algorithm is we'll sample. Dialogue: 0,0:51:43.50,0:51:45.27,Default,,0000,0000,0000,,Choose some set of states Dialogue: 0,0:51:45.27,0:51:52.27,Default,,0000,0000,0000,,at Dialogue: 0,0:51:53.25,0:51:55.12,Default,,0000,0000,0000,,random. Dialogue: 0,0:51:55.12,0:51:59.87,Default,,0000,0000,0000,,So sample S-1, S-2 through S-M randomly so choose a Dialogue: 0,0:51:59.87,0:52:02.47,Default,,0000,0000,0000,,set of states randomly Dialogue: 0,0:52:02.47,0:52:08.21,Default,,0000,0000,0000,,and Dialogue: 0,0:52:08.21,0:52:12.02,Default,,0000,0000,0000,,initialize my parameter vector to be equal to zero. This is Dialogue: 0,0:52:12.02,0:52:16.19,Default,,0000,0000,0000,,analogous to in value iteration where I Dialogue: 0,0:52:16.19,0:52:21.14,Default,,0000,0000,0000,,might initialize the value function to be the function of all zeros. Then here's the Dialogue: 0,0:52:21.14,0:52:28.14,Default,,0000,0000,0000,,end view for the algorithm. Dialogue: 0,0:52:42.36,0:52:44.12,Default,,0000,0000,0000,,Got Dialogue: 0,0:52:44.12,0:52:51.12,Default,,0000,0000,0000,,quite a lot to Dialogue: 0,0:52:59.19,0:53:06.19,Default,,0000,0000,0000,,write Dialogue: 0,0:55:05.41,0:55:12.41,Default,,0000,0000,0000,,actually. Dialogue: 0,0:55:28.60,0:55:29.64,Default,,0000,0000,0000,,Let's Dialogue: 0,0:55:29.64,0:55:36.64,Default,,0000,0000,0000,,see. Dialogue: 0,0:56:10.42,0:56:12.01,Default,,0000,0000,0000,,And so Dialogue: 0,0:56:12.01,0:56:17.84,Default,,0000,0000,0000,,that's the algorithm. Dialogue: 0,0:56:17.84,0:56:19.88,Default,,0000,0000,0000,,Let me just adjust the writing. Give me a second. Give me a minute Dialogue: 0,0:56:19.88,0:56:26.88,Default,,0000,0000,0000,,to finish and then I'll step through this. Dialogue: 0,0:56:26.99,0:56:28.49,Default,,0000,0000,0000,,Actually, if some of my Dialogue: 0,0:56:28.49,0:56:35.49,Default,,0000,0000,0000,,handwriting is eligible, let me know. So Dialogue: 0,0:56:57.56,0:57:02.08,Default,,0000,0000,0000,,let me step through this and so briefly explain the rationale. Dialogue: 0,0:57:02.08,0:57:08.14,Default,,0000,0000,0000,,So the hear of the algorithm is - let's see. Dialogue: 0,0:57:08.14,0:57:10.87,Default,,0000,0000,0000,,In the original value iteration algorithm, Dialogue: 0,0:57:10.87,0:57:13.24,Default,,0000,0000,0000,,we would take the value for each state, Dialogue: 0,0:57:13.24,0:57:14.80,Default,,0000,0000,0000,,V(s)I, Dialogue: 0,0:57:14.80,0:57:17.60,Default,,0000,0000,0000,,and we will overwrite it with this Dialogue: 0,0:57:17.60,0:57:21.66,Default,,0000,0000,0000,,expression here. In the original, this discrete value iteration algorithm Dialogue: 0,0:57:21.66,0:57:23.20,Default,,0000,0000,0000,,was to V(s)I Dialogue: 0,0:57:23.20,0:57:23.94,Default,,0000,0000,0000,,and we will Dialogue: 0,0:57:23.94,0:57:25.83,Default,,0000,0000,0000,,set V(s)I Dialogue: 0,0:57:25.83,0:57:28.59,Default,,0000,0000,0000,,to be equal to that, I think. Dialogue: 0,0:57:28.59,0:57:33.13,Default,,0000,0000,0000,,Now we have in the continuous state case, we have an infinite continuous set of Dialogue: 0,0:57:33.13,0:57:34.20,Default,,0000,0000,0000,,states and so Dialogue: 0,0:57:34.20,0:57:37.86,Default,,0000,0000,0000,,you can't discretely set the value of each of these to that. Dialogue: 0,0:57:37.86,0:57:42.51,Default,,0000,0000,0000,,So what we'll do instead is choose the parameters T Dialogue: 0,0:57:42.51,0:57:46.85,Default,,0000,0000,0000,,so that V(s)I is as close as possible to Dialogue: 0,0:57:46.85,0:57:49.93,Default,,0000,0000,0000,,this thing on the right hand side Dialogue: 0,0:57:49.93,0:57:54.51,Default,,0000,0000,0000,,instead. And this is what YI turns out to be. Dialogue: 0,0:57:54.51,0:57:58.48,Default,,0000,0000,0000,,So completely, what I'm going to do is I'm going to construct Dialogue: 0,0:57:58.48,0:58:00.58,Default,,0000,0000,0000,,estimates of this term, Dialogue: 0,0:58:00.58,0:58:04.28,Default,,0000,0000,0000,,and then I'm going to choose the parameters of my function approximator. I'm Dialogue: 0,0:58:04.28,0:58:07.73,Default,,0000,0000,0000,,gonna choose my parameter as T, so that V(s)I is as close as possible to Dialogue: 0,0:58:07.73,0:58:10.62,Default,,0000,0000,0000,,these. That's what YI is, Dialogue: 0,0:58:10.62,0:58:11.98,Default,,0000,0000,0000,,and specifically, Dialogue: 0,0:58:11.98,0:58:15.14,Default,,0000,0000,0000,,what I'm going to do is I'll choose parameters data Dialogue: 0,0:58:15.14,0:58:18.49,Default,,0000,0000,0000,,to minimize the sum of square differences between T [inaudible] plus Dialogue: 0,0:58:18.49,0:58:20.41,Default,,0000,0000,0000,,5SI. Dialogue: 0,0:58:20.41,0:58:21.55,Default,,0000,0000,0000,,This thing here Dialogue: 0,0:58:21.55,0:58:24.49,Default,,0000,0000,0000,,is just V(s)I because I'm approximating V(s)I Dialogue: 0,0:58:24.49,0:58:26.22,Default,,0000,0000,0000,,is a linear function of 5SI Dialogue: 0,0:58:26.22,0:58:27.88,Default,,0000,0000,0000,,and so choose Dialogue: 0,0:58:27.88,0:58:32.02,Default,,0000,0000,0000,,the parameters data to minimize the sum of square differences. Dialogue: 0,0:58:32.02,0:58:34.83,Default,,0000,0000,0000,,So this is last step is basically Dialogue: 0,0:58:34.83,0:58:38.76,Default,,0000,0000,0000,,the approximation version of value Dialogue: 0,0:58:38.76,0:58:40.61,Default,,0000,0000,0000,,iteration. What everything else above Dialogue: 0,0:58:40.61,0:58:42.42,Default,,0000,0000,0000,,was doing was just Dialogue: 0,0:58:42.42,0:58:44.75,Default,,0000,0000,0000,,coming up with an approximation Dialogue: 0,0:58:44.75,0:58:46.19,Default,,0000,0000,0000,,to this term, to Dialogue: 0,0:58:46.19,0:58:47.50,Default,,0000,0000,0000,,this thing here Dialogue: 0,0:58:47.50,0:58:50.98,Default,,0000,0000,0000,,and which I was calling YI. Dialogue: 0,0:58:50.98,0:58:54.20,Default,,0000,0000,0000,,And so confluently, for every state SI we want to Dialogue: 0,0:58:54.20,0:58:57.16,Default,,0000,0000,0000,,estimate what the thing on the right hand side is Dialogue: 0,0:58:57.16,0:58:57.98,Default,,0000,0000,0000,,and but Dialogue: 0,0:58:57.98,0:59:01.98,Default,,0000,0000,0000,,there's an expectation here. There's an expectation over a continuous set of states, Dialogue: 0,0:59:01.98,0:59:06.61,Default,,0000,0000,0000,,may be a very high dimensional state so I can't compute this expectation exactly. Dialogue: 0,0:59:06.61,0:59:09.81,Default,,0000,0000,0000,,What I'll do instead is I'll use my simulator to sample Dialogue: 0,0:59:09.81,0:59:13.97,Default,,0000,0000,0000,,a set of states from this distribution from this P substrip, Dialogue: 0,0:59:13.97,0:59:16.96,Default,,0000,0000,0000,,SIA, from the state transition distribution Dialogue: 0,0:59:16.96,0:59:20.67,Default,,0000,0000,0000,,of where I get to if I take the action A in the state as I, Dialogue: 0,0:59:20.67,0:59:23.92,Default,,0000,0000,0000,,and then I'll average over that sample of states Dialogue: 0,0:59:23.92,0:59:26.82,Default,,0000,0000,0000,,to compute this expectation. Dialogue: 0,0:59:26.82,0:59:30.61,Default,,0000,0000,0000,,And so stepping through the algorithm just says that for Dialogue: 0,0:59:30.61,0:59:31.50,Default,,0000,0000,0000,,each Dialogue: 0,0:59:31.50,0:59:36.33,Default,,0000,0000,0000,,state and for each action, I'm going to sample a set of states. This S prime 1 through S Dialogue: 0,0:59:36.33,0:59:40.66,Default,,0000,0000,0000,,prime K from that state transition distribution, still using the model, and Dialogue: 0,0:59:40.66,0:59:43.91,Default,,0000,0000,0000,,then I'll set Q(a) to be equal to that average Dialogue: 0,0:59:43.91,0:59:45.49,Default,,0000,0000,0000,,and so this is my Dialogue: 0,0:59:45.49,0:59:48.43,Default,,0000,0000,0000,,estimate for R(s)I + G(this Dialogue: 0,0:59:48.43,0:59:50.40,Default,,0000,0000,0000,,expected value Dialogue: 0,0:59:50.40,0:59:53.63,Default,,0000,0000,0000,,for that specific action A). Dialogue: 0,0:59:53.63,0:59:56.76,Default,,0000,0000,0000,,Then I'll take the maximum of actions A Dialogue: 0,0:59:56.76,1:00:00.89,Default,,0000,0000,0000,,and this gives me YI, and so YI is for S for that. Dialogue: 0,1:00:00.89,1:00:02.92,Default,,0000,0000,0000,,And finally, I'll run Dialogue: 0,1:00:02.92,1:00:06.78,Default,,0000,0000,0000,,really run linear regression which is that last of the set [inaudible] to get Dialogue: 0,1:00:06.78,1:00:07.79,Default,,0000,0000,0000,,V(s)I Dialogue: 0,1:00:07.79,1:00:11.48,Default,,0000,0000,0000,,to be close to the YIs. Dialogue: 0,1:00:11.48,1:00:16.18,Default,,0000,0000,0000,,And so this algorithm is called fitted value iteration and it actually often Dialogue: 0,1:00:16.18,1:00:19.39,Default,,0000,0000,0000,,works quite well for continuous, Dialogue: 0,1:00:19.39,1:00:22.66,Default,,0000,0000,0000,,for problems with anywhere from Dialogue: 0,1:00:22.66,1:00:26.01,Default,,0000,0000,0000,,6- to 10- to 20-dimensional state spaces Dialogue: 0,1:00:33.95,1:00:39.50,Default,,0000,0000,0000,,if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense? Dialogue: 0,1:00:39.50,1:00:40.79,Default,,0000,0000,0000,,Some of you didn't have your hands up. Dialogue: 0,1:00:40.79,1:00:47.28,Default,,0000,0000,0000,,Are there questions for those, Dialogue: 0,1:00:48.41,1:00:51.34,Default,,0000,0000,0000,,yeah? Is there a recess [inaudible] function in this setup? Dialogue: 0,1:00:51.34,1:00:55.67,Default,,0000,0000,0000,,Oh, yes. An MDP comprises Dialogue: 0,1:00:55.67,1:00:57.19,Default,,0000,0000,0000,,SA, Dialogue: 0,1:00:57.19,1:00:59.87,Default,,0000,0000,0000,,the state transition probabilities G Dialogue: 0,1:00:59.87,1:01:01.23,Default,,0000,0000,0000,,and Dialogue: 0,1:01:01.23,1:01:02.38,Default,,0000,0000,0000,,R Dialogue: 0,1:01:02.38,1:01:03.36,Default,,0000,0000,0000,,and so Dialogue: 0,1:01:03.36,1:01:08.36,Default,,0000,0000,0000,,for continuous state spaces, S would be like R4 for the inverted pendulum Dialogue: 0,1:01:08.36,1:01:09.41,Default,,0000,0000,0000,,or something. Dialogue: 0,1:01:09.41,1:01:13.53,Default,,0000,0000,0000,,Actions with discretized state transitions probabilities with specifying with the model Dialogue: 0,1:01:13.53,1:01:14.78,Default,,0000,0000,0000,,or Dialogue: 0,1:01:14.78,1:01:17.47,Default,,0000,0000,0000,,the simulator, Dialogue: 0,1:01:17.47,1:01:19.55,Default,,0000,0000,0000,,G is just a real number like .99 Dialogue: 0,1:01:19.55,1:01:23.60,Default,,0000,0000,0000,,and the real function is usually a function that's given to you. And so Dialogue: 0,1:01:23.60,1:01:26.10,Default,,0000,0000,0000,,the reward function Dialogue: 0,1:01:26.10,1:01:30.69,Default,,0000,0000,0000,,is some function of your 4-dimensional state, Dialogue: 0,1:01:30.69,1:01:31.49,Default,,0000,0000,0000,,and Dialogue: 0,1:01:31.49,1:01:32.99,Default,,0000,0000,0000,,for example, you Dialogue: 0,1:01:32.99,1:01:39.99,Default,,0000,0000,0000,,might choose a reward function to be minus " I don't know. Dialogue: 0,1:01:40.03,1:01:47.03,Default,,0000,0000,0000,,Just for an example of simple reward function, if we want a -1 if the pole has fallen Dialogue: 0,1:01:47.82,1:01:51.64,Default,,0000,0000,0000,,and there it depends on you choose your reward function to be -1 if Dialogue: 0,1:01:51.64,1:01:52.29,Default,,0000,0000,0000,,the Dialogue: 0,1:01:52.29,1:01:54.62,Default,,0000,0000,0000,,inverted pendulum falls over Dialogue: 0,1:01:54.62,1:01:58.70,Default,,0000,0000,0000,,and to find that its angle is greater than 30° or something Dialogue: 0,1:01:58.70,1:02:02.18,Default,,0000,0000,0000,,and zero otherwise. Dialogue: 0,1:02:02.18,1:02:03.94,Default,,0000,0000,0000,,So that would be an example Dialogue: 0,1:02:03.94,1:02:06.64,Default,,0000,0000,0000,,of a reward function that you can choose for the inverted pendulum, but yes, assuming a Dialogue: 0,1:02:06.64,1:02:09.62,Default,,0000,0000,0000,,reward function is given to you Dialogue: 0,1:02:09.62,1:02:14.59,Default,,0000,0000,0000,,so that you can compute R(s)I for any state. Dialogue: 0,1:02:14.59,1:02:21.59,Default,,0000,0000,0000,,Are there other questions? Dialogue: 0,1:02:39.69,1:02:40.62,Default,,0000,0000,0000,,Actually, let Dialogue: 0,1:02:40.62,1:02:47.62,Default,,0000,0000,0000,,me try Dialogue: 0,1:02:50.46,1:02:57.46,Default,,0000,0000,0000,, Dialogue: 0,1:03:23.98,1:03:28.48,Default,,0000,0000,0000,,asking a question, so everything I did here assume that we have a stochastic simulator. Dialogue: 0,1:03:28.48,1:03:31.88,Default,,0000,0000,0000,,So it Dialogue: 0,1:03:31.88,1:03:35.39,Default,,0000,0000,0000,,turns out I can simply this algorithm if I have a deterministic simulator, Dialogue: 0,1:03:35.39,1:03:36.51,Default,,0000,0000,0000,,but Dialogue: 0,1:03:36.51,1:03:38.77,Default,,0000,0000,0000,,deterministic simulator is given a Dialogue: 0,1:03:38.77,1:03:42.08,Default,,0000,0000,0000,,stated action, my next state is always exactly determined. Dialogue: 0,1:03:42.08,1:03:45.57,Default,,0000,0000,0000,,So let me ask you, if I have a deterministic simulator, Dialogue: 0,1:03:45.57,1:03:52.57,Default,,0000,0000,0000,,how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible]. Dialogue: 0,1:03:53.93,1:03:57.71,Default,,0000,0000,0000,, Dialogue: 0,1:03:57.71,1:03:58.72,Default,,0000,0000,0000,,Right, so Dialogue: 0,1:03:58.72,1:04:01.25,Default,,0000,0000,0000,,Justin's going right. If I have a deterministic simulator, Dialogue: 0,1:04:01.25,1:04:04.87,Default,,0000,0000,0000,,all my samples from those would be exactly the same, and so if I have a Dialogue: 0,1:04:04.87,1:04:06.15,Default,,0000,0000,0000,,deterministic simulator, Dialogue: 0,1:04:06.15,1:04:09.11,Default,,0000,0000,0000,,I can set K to be equal to 1, Dialogue: 0,1:04:09.11,1:04:10.93,Default,,0000,0000,0000,,so I don't need to draw Dialogue: 0,1:04:10.93,1:04:15.87,Default,,0000,0000,0000,,K different samples. I really only need one sample if I have a Dialogue: 0,1:04:15.87,1:04:17.30,Default,,0000,0000,0000,,deterministic simulator, Dialogue: 0,1:04:17.30,1:04:21.67,Default,,0000,0000,0000,,so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really Dialogue: 0,1:04:21.67,1:04:28.16,Default,,0000,0000,0000,,confused about the, Dialogue: 0,1:04:28.16,1:04:29.03,Default,,0000,0000,0000,,yeah, Dialogue: 0,1:04:29.03,1:04:34.43,Default,,0000,0000,0000,,we sorta turned this [inaudible] into something that looks like linear Dialogue: 0,1:04:34.43,1:04:37.25,Default,,0000,0000,0000,,state Dialogue: 0,1:04:37.25,1:04:40.09,Default,,0000,0000,0000,,regression or some' you know the data transpose times something that Dialogue: 0,1:04:40.09,1:04:43.89,Default,,0000,0000,0000,,we're used to but I guess I'm a Dialogue: 0,1:04:43.89,1:04:45.37,Default,,0000,0000,0000,,little. I don't know really know what question to ask but like when we did this before we Dialogue: 0,1:04:45.37,1:04:48.33,Default,,0000,0000,0000,,had like discrete states and everything. We Dialogue: 0,1:04:48.33,1:04:50.02,Default,,0000,0000,0000,,were determined with Dialogue: 0,1:04:50.02,1:04:53.19,Default,,0000,0000,0000,,finding this optimal policy Dialogue: 0,1:04:53.19,1:04:56.71,Default,,0000,0000,0000,,and I Dialogue: 0,1:04:56.71,1:05:01.87,Default,,0000,0000,0000,,guess it doesn't look like we haven't said the word policy in a while so kinda Dialogue: 0,1:05:01.87,1:05:05.01,Default,,0000,0000,0000,,difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should Dialogue: 0,1:05:05.01,1:05:06.82,Default,,0000,0000,0000,,just say a couple words so Dialogue: 0,1:05:06.82,1:05:10.66,Default,,0000,0000,0000,,let me actually try to get at some of what maybe what you're saying. Dialogue: 0,1:05:10.66,1:05:14.25,Default,,0000,0000,0000,,Our strategy for finding optimal policy Dialogue: 0,1:05:14.25,1:05:15.08,Default,,0000,0000,0000,,has been to Dialogue: 0,1:05:15.08,1:05:19.27,Default,,0000,0000,0000,,find some way to find V Dialogue: 0,1:05:19.27,1:05:21.17,Default,,0000,0000,0000,,and Dialogue: 0,1:05:21.17,1:05:24.09,Default,,0000,0000,0000,,some of approximations Dialogue: 0,1:05:24.09,1:05:28.55,Default,,0000,0000,0000,,of ?*. So far everything I've been doing has been focused on how to find V*. I just Dialogue: 0,1:05:28.55,1:05:30.48,Default,,0000,0000,0000,,want Dialogue: 0,1:05:33.12,1:05:36.36,Default,,0000,0000,0000,,to say one more word. It actually turns out that Dialogue: 0,1:05:36.36,1:05:40.73,Default,,0000,0000,0000,,for linear regression it's often very easy. It's often not terribly difficult to Dialogue: 0,1:05:40.73,1:05:43.02,Default,,0000,0000,0000,,choose some resource of the features. Choosing Dialogue: 0,1:05:43.02,1:05:46.98,Default,,0000,0000,0000,,features for approximating the value function is often somewhat Dialogue: 0,1:05:46.98,1:05:48.40,Default,,0000,0000,0000,,harder, Dialogue: 0,1:05:48.40,1:05:50.97,Default,,0000,0000,0000,,so because the value of a state is Dialogue: 0,1:05:50.97,1:05:52.02,Default,,0000,0000,0000,,how good is Dialogue: 0,1:05:52.02,1:05:54.38,Default,,0000,0000,0000,,starting off in this state. Dialogue: 0,1:05:54.38,1:05:56.04,Default,,0000,0000,0000,,What is my Dialogue: 0,1:05:56.04,1:05:59.61,Default,,0000,0000,0000,,expected sum of discounted rewards? What if I start in a certain state? Dialogue: 0,1:05:59.61,1:06:00.41,Default,,0000,0000,0000,,And so Dialogue: 0,1:06:00.41,1:06:03.56,Default,,0000,0000,0000,,what the feature of the state have to measure is really Dialogue: 0,1:06:03.56,1:06:07.89,Default,,0000,0000,0000,,how good is it to start in a certain state? Dialogue: 0,1:06:07.89,1:06:12.21,Default,,0000,0000,0000,,And so for inverted pendulum you actually have that Dialogue: 0,1:06:12.21,1:06:15.90,Default,,0000,0000,0000,,states where the poles are vertical and when a cart that's centered on your Dialogue: 0,1:06:15.90,1:06:18.08,Default,,0000,0000,0000,,track or something, maybe better and so you can Dialogue: 0,1:06:18.08,1:06:21.30,Default,,0000,0000,0000,,come up with features that measure the orientation of the pole and how close Dialogue: 0,1:06:21.30,1:06:25.69,Default,,0000,0000,0000,,you are to the center of the track and so on and those will be reasonable features to use Dialogue: 0,1:06:25.69,1:06:28.33,Default,,0000,0000,0000,,to approximate V*. Dialogue: 0,1:06:28.33,1:06:30.25,Default,,0000,0000,0000,,Although in general it is true that Dialogue: 0,1:06:30.25,1:06:33.86,Default,,0000,0000,0000,,choosing features, the value function approximation, is often slightly trickier Dialogue: 0,1:06:33.86,1:06:39.72,Default,,0000,0000,0000,,than choosing good features for linear regression. Okay and then Justin's Dialogue: 0,1:06:39.72,1:06:40.71,Default,,0000,0000,0000,,questions of Dialogue: 0,1:06:40.71,1:06:42.53,Default,,0000,0000,0000,,so given Dialogue: 0,1:06:42.53,1:06:47.11,Default,,0000,0000,0000,,V*, how do you go back to actually find a policy? Dialogue: 0,1:06:47.11,1:06:49.80,Default,,0000,0000,0000,,In the discrete case, so we Dialogue: 0,1:06:49.80,1:06:54.11,Default,,0000,0000,0000,,have that ?*(s) is equal to all Dialogue: 0,1:06:54.11,1:07:01.11,Default,,0000,0000,0000,,[inaudible] over A of Dialogue: 0,1:07:04.32,1:07:07.47,Default,,0000,0000,0000,,that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this Dialogue: 0,1:07:07.47,1:07:12.87,Default,,0000,0000,0000,,as an expectation Dialogue: 0,1:07:12.87,1:07:15.61,Default,,0000,0000,0000,,and so then once you find the optimal value Dialogue: 0,1:07:15.61,1:07:16.74,Default,,0000,0000,0000,,function V*, Dialogue: 0,1:07:16.74,1:07:18.58,Default,,0000,0000,0000,,you can then Dialogue: 0,1:07:18.58,1:07:20.35,Default,,0000,0000,0000,,find the optimal policy ?* by computing the Dialogue: 0,1:07:20.35,1:07:22.58,Default,,0000,0000,0000,,[inaudible]. Dialogue: 0,1:07:22.58,1:07:25.22,Default,,0000,0000,0000,,So if you're in a continuous state MDP, Dialogue: 0,1:07:25.22,1:07:26.43,Default,,0000,0000,0000,,then Dialogue: 0,1:07:26.43,1:07:30.41,Default,,0000,0000,0000,,you can't actually do this in advance for every single state because there's an Dialogue: 0,1:07:30.41,1:07:31.90,Default,,0000,0000,0000,,infinite number of states Dialogue: 0,1:07:31.90,1:07:34.81,Default,,0000,0000,0000,,and so you can't actually perform this computation in advance to every single Dialogue: 0,1:07:34.81,1:07:36.43,Default,,0000,0000,0000,,state. Dialogue: 0,1:07:36.43,1:07:38.95,Default,,0000,0000,0000,,What you do instead is Dialogue: 0,1:07:38.95,1:07:43.21,Default,,0000,0000,0000,,whenever your robot is in some specific state S is only when your system Dialogue: 0,1:07:43.21,1:07:48.07,Default,,0000,0000,0000,,is in some specific state S like your car is at some position orientation or Dialogue: 0,1:07:48.07,1:07:49.83,Default,,0000,0000,0000,,your inverted pendulum Dialogue: 0,1:07:49.83,1:07:52.95,Default,,0000,0000,0000,,is in some specific position, posed in some Dialogue: 0,1:07:52.95,1:07:56.53,Default,,0000,0000,0000,,specific angle T. It's Dialogue: 0,1:07:56.53,1:07:58.49,Default,,0000,0000,0000,,only when Dialogue: 0,1:07:58.49,1:08:03.07,Default,,0000,0000,0000,,your system, be it a factor or a board game or a robot, Dialogue: 0,1:08:03.07,1:08:05.30,Default,,0000,0000,0000,,is in some specific state S Dialogue: 0,1:08:05.30,1:08:08.81,Default,,0000,0000,0000,,that you would then go ahead and compute this augmax, so Dialogue: 0,1:08:08.81,1:08:10.23,Default,,0000,0000,0000,,it's only when Dialogue: 0,1:08:10.23,1:08:15.78,Default,,0000,0000,0000,,you're in some state S that you then compute this augmax, and then Dialogue: 0,1:08:15.78,1:08:18.10,Default,,0000,0000,0000,,you Dialogue: 0,1:08:18.10,1:08:22.59,Default,,0000,0000,0000,, Dialogue: 0,1:08:22.59,1:08:24.44,Default,,0000,0000,0000,,execute that action A Dialogue: 0,1:08:24.44,1:08:25.16,Default,,0000,0000,0000,,and then Dialogue: 0,1:08:25.16,1:08:29.49,Default,,0000,0000,0000,,as a result of your action, your robot would transition to some new state and then Dialogue: 0,1:08:29.49,1:08:33.76,Default,,0000,0000,0000,,so it'll be given that specific new state that you compute as Dialogue: 0,1:08:33.76,1:08:40.76,Default,,0000,0000,0000,,augmax using that specific state S that Dialogue: 0,1:08:41.80,1:08:45.74,Default,,0000,0000,0000,,you're in. There're a few ways to do it. One way to do this is Dialogue: 0,1:08:45.74,1:08:50.10,Default,,0000,0000,0000,,actually the same as in the inner loop of the fitted value iteration algorithm so Dialogue: 0,1:08:50.10,1:08:52.15,Default,,0000,0000,0000,,because of an expectation of a large number Dialogue: 0,1:08:52.15,1:08:55.22,Default,,0000,0000,0000,,of states, you'd need to sample Dialogue: 0,1:08:55.22,1:08:57.100,Default,,0000,0000,0000,,some set of states from the simulator and then Dialogue: 0,1:08:57.100,1:09:03.12,Default,,0000,0000,0000,,approximate this expectation using an average over your samples, so it's actually as Dialogue: 0,1:09:03.12,1:09:05.82,Default,,0000,0000,0000,,inner loop of the value iteration algorithm. Dialogue: 0,1:09:05.82,1:09:09.48,Default,,0000,0000,0000,,So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample Dialogue: 0,1:09:09.77,1:09:13.03,Default,,0000,0000,0000,,a set of states Dialogue: 0,1:09:13.03,1:09:16.23,Default,,0000,0000,0000,,to approximate those expectations every time you want to take an action in your MDP. Dialogue: 0,1:09:16.23,1:09:18.36,Default,,0000,0000,0000,,Couple Dialogue: 0,1:09:18.36,1:09:22.71,Default,,0000,0000,0000,,of special cases where this can be done, one special case is if you Dialogue: 0,1:09:22.71,1:09:29.71,Default,,0000,0000,0000,,have Dialogue: 0,1:09:31.71,1:09:35.56,Default,,0000,0000,0000,,a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is Dialogue: 0,1:09:35.56,1:09:39.77,Default,,0000,0000,0000,,just some function, Dialogue: 0,1:09:39.77,1:09:43.69,Default,,0000,0000,0000,,could be a linear or a nonlinear function. If it's a deterministic simulator Dialogue: 0,1:09:43.69,1:09:47.57,Default,,0000,0000,0000,,then the next state, ST+1, is just some function of your Dialogue: 0,1:09:47.57,1:09:50.19,Default,,0000,0000,0000,,previous stated action. Dialogue: 0,1:09:50.19,1:09:52.06,Default,,0000,0000,0000,,If that's the case then Dialogue: 0,1:09:52.06,1:09:54.85,Default,,0000,0000,0000,,this expectation, well, then Dialogue: 0,1:09:54.85,1:09:59.26,Default,,0000,0000,0000,,this simplifies to augmax of A of V* Dialogue: 0,1:09:59.26,1:10:00.71,Default,,0000,0000,0000,,of F Dialogue: 0,1:10:00.71,1:10:03.41,Default,,0000,0000,0000,,of Dialogue: 0,1:10:03.41,1:10:07.28,Default,,0000,0000,0000,,I guess S,A Dialogue: 0,1:10:07.28,1:10:08.25,Default,,0000,0000,0000,,because Dialogue: 0,1:10:08.25,1:10:11.07,Default,,0000,0000,0000,,this is really saying Dialogue: 0,1:10:11.07,1:10:14.57,Default,,0000,0000,0000,,S Dialogue: 0,1:10:14.57,1:10:18.41,Default,,0000,0000,0000,,prime=F(s),A. I switched back and forth between notation; I hope that's okay. S Dialogue: 0,1:10:18.41,1:10:21.80,Default,,0000,0000,0000,,to denote the current state, and S prime to deterministic state versus ST and ST+1 through the Dialogue: 0,1:10:21.80,1:10:25.53,Default,,0000,0000,0000,,current state. Both of these are Dialogue: 0,1:10:25.53,1:10:27.70,Default,,0000,0000,0000,,sorta standing notation and don't Dialogue: 0,1:10:27.70,1:10:31.19,Default,,0000,0000,0000,,mind my switching back and forth between them. But if it's a Dialogue: 0,1:10:31.19,1:10:35.31,Default,,0000,0000,0000,,deterministic simulator you can then just compute what the next state S prime would be Dialogue: 0,1:10:35.31,1:10:37.73,Default,,0000,0000,0000,,for each action you might take from the current state, Dialogue: 0,1:10:37.73,1:10:41.69,Default,,0000,0000,0000,,and then take the augmax of actions A, Dialogue: 0,1:10:41.69,1:10:47.23,Default,,0000,0000,0000,,basically choose the action that gets you to the highest value Dialogue: 0,1:10:47.23,1:10:54.23,Default,,0000,0000,0000,,state. Dialogue: 0,1:11:04.30,1:11:04.66,Default,,0000,0000,0000,,And Dialogue: 0,1:11:04.66,1:11:08.92,Default,,0000,0000,0000,,so that's one case where you can compute the augmax and we can compute that Dialogue: 0,1:11:08.92,1:11:13.47,Default,,0000,0000,0000,,expectation without needing to sample an average over some sample. Dialogue: 0,1:11:13.47,1:11:16.67,Default,,0000,0000,0000,,Another very common case actually it turns out is if you have a stochastic Dialogue: 0,1:11:24.01,1:11:25.97,Default,,0000,0000,0000,,simulator, but if your Dialogue: 0,1:11:25.97,1:11:30.46,Default,,0000,0000,0000,,similar happens to take on a very specific form of Dialogue: 0,1:11:33.82,1:11:36.12,Default,,0000,0000,0000,,ST+1=F(s)T,AT+?T Dialogue: 0,1:11:37.12,1:11:43.69,Default,,0000,0000,0000,,where this Dialogue: 0,1:11:43.69,1:11:47.90,Default,,0000,0000,0000,,is galsie noise. The [inaudible] is a very common way to build simulators where you model the next Dialogue: 0,1:11:47.90,1:11:50.23,Default,,0000,0000,0000,,state as a function of the current state and action Dialogue: 0,1:11:50.23,1:11:52.26,Default,,0000,0000,0000,,plus some noise and so Dialogue: 0,1:11:52.26,1:11:57.34,Default,,0000,0000,0000,,once specific example would be that sort of mini dynamical system that Dialogue: 0,1:11:57.34,1:12:01.40,Default,,0000,0000,0000,,we talked about with linear function of the current state and action plus Dialogue: 0,1:12:01.40,1:12:03.02,Default,,0000,0000,0000,,galsie Dialogue: 0,1:12:03.02,1:12:07.96,Default,,0000,0000,0000,,noise. In this case, you can approximate augment over Dialogue: 0,1:12:07.96,1:12:14.96,Default,,0000,0000,0000,,A, well. Dialogue: 0,1:12:21.81,1:12:27.16,Default,,0000,0000,0000,,In that case you take that Dialogue: 0,1:12:27.16,1:12:29.09,Default,,0000,0000,0000,,expectation that Dialogue: 0,1:12:29.09,1:12:33.18,Default,,0000,0000,0000,,you're trying to approximate. The Dialogue: 0,1:12:33.18,1:12:37.31,Default,,0000,0000,0000,,expected value of V of the Dialogue: 0,1:12:37.31,1:12:41.48,Default,,0000,0000,0000,, Dialogue: 0,1:12:41.48,1:12:45.26,Default,,0000,0000,0000,,expected value of Dialogue: 0,1:12:45.26,1:12:49.18,Default,,0000,0000,0000,,S prime, and this is approximation. Dialogue: 0,1:12:49.18,1:12:52.51,Default,,0000,0000,0000,,Expected value of a function is usually not equal to the value of an Dialogue: 0,1:12:52.51,1:12:53.06,Default,,0000,0000,0000,,expectation but Dialogue: 0,1:12:53.06,1:12:54.16,Default,,0000,0000,0000,,it is often Dialogue: 0,1:12:54.16,1:12:56.12,Default,,0000,0000,0000,,a reasonable approximation Dialogue: 0,1:12:56.12,1:12:58.34,Default,,0000,0000,0000,,and so Dialogue: 0,1:13:01.92,1:13:04.67,Default,,0000,0000,0000,,that would be another way to Dialogue: 0,1:13:04.67,1:13:06.44,Default,,0000,0000,0000,,approximate that expectation Dialogue: 0,1:13:06.44,1:13:11.27,Default,,0000,0000,0000,,and so you choose the actions Dialogue: 0,1:13:11.27,1:13:14.34,Default,,0000,0000,0000,,according to Dialogue: 0,1:13:14.34,1:13:21.34,Default,,0000,0000,0000,,watch we do the same formula as I wrote just now. Dialogue: 0,1:13:22.01,1:13:23.62,Default,,0000,0000,0000,,And so this Dialogue: 0,1:13:23.62,1:13:24.92,Default,,0000,0000,0000,,would be a way of Dialogue: 0,1:13:24.92,1:13:27.11,Default,,0000,0000,0000,,approximating Dialogue: 0,1:13:28.16,1:13:30.53,Default,,0000,0000,0000,,this argmax, Dialogue: 0,1:13:30.53,1:13:34.49,Default,,0000,0000,0000,,ignoring the noise in the simulator essentially. And Dialogue: 0,1:13:34.49,1:13:37.61,Default,,0000,0000,0000,,this often works pretty well as well just because many simulators turn Dialogue: 0,1:13:37.61,1:13:38.54,Default,,0000,0000,0000,,out Dialogue: 0,1:13:38.54,1:13:42.70,Default,,0000,0000,0000,,to be the form of some linear or some nonlinear function plus zero mean Dialogue: 0,1:13:42.70,1:13:43.75,Default,,0000,0000,0000,,galsie noise, Dialogue: 0,1:13:43.75,1:13:46.13,Default,,0000,0000,0000,,so and just that ignore the zero mean Dialogue: 0,1:13:46.13,1:13:47.37,Default,,0000,0000,0000,,galsie Dialogue: 0,1:13:47.37,1:13:50.07,Default,,0000,0000,0000,,noise, so that you can compute this quickly. Dialogue: 0,1:13:50.07,1:13:54.23,Default,,0000,0000,0000,,And just to complete about this, what Dialogue: 0,1:13:54.23,1:13:55.90,Default,,0000,0000,0000,,that is, right, Dialogue: 0,1:13:55.90,1:13:58.46,Default,,0000,0000,0000,,that V* F of SA, Dialogue: 0,1:13:58.46,1:13:58.97,Default,,0000,0000,0000,,this Dialogue: 0,1:13:58.97,1:14:01.67,Default,,0000,0000,0000,,you Dialogue: 0,1:14:04.39,1:14:08.88,Default,,0000,0000,0000,,down rate as data transfers Fi of S prime where S prime=F of SA. Great, Dialogue: 0,1:14:08.88,1:14:12.22,Default,,0000,0000,0000,,so this V* you would compute using Dialogue: 0,1:14:12.22,1:14:15.05,Default,,0000,0000,0000,,the parameters data that you just learned using the Dialogue: 0,1:14:15.05,1:14:21.72,Default,,0000,0000,0000,,fitted value iteration Dialogue: 0,1:14:21.72,1:14:23.17,Default,,0000,0000,0000,,algorithm. Dialogue: 0,1:14:23.17,1:14:30.17,Default,,0000,0000,0000,,Questions about this? Student:[Inaudible] case, Dialogue: 0,1:14:33.85,1:14:40.85,Default,,0000,0000,0000,,for real-time application is it possible to use that [inaudible], for example for Dialogue: 0,1:14:41.81,1:14:47.74,Default,,0000,0000,0000,,[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible] Dialogue: 0,1:14:51.16,1:14:55.51,Default,,0000,0000,0000,,expectation. Computers today actually amazingly fast. I'm actually often Dialogue: 0,1:14:55.51,1:14:58.66,Default,,0000,0000,0000,,surprised by how much you can do in real time so Dialogue: 0,1:14:58.66,1:15:00.27,Default,,0000,0000,0000,,the helicopter we actually flying a helicopter Dialogue: 0,1:15:00.27,1:15:03.89,Default,,0000,0000,0000,,using an algorithm different than Dialogue: 0,1:15:03.89,1:15:04.74,Default,,0000,0000,0000,,this? I can't say. Dialogue: 0,1:15:06.28,1:15:10.54,Default,,0000,0000,0000,,But my intuition is that you could actually do this with a Dialogue: 0,1:15:10.54,1:15:14.01,Default,,0000,0000,0000,,helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do Dialogue: 0,1:15:14.01,1:15:16.39,Default,,0000,0000,0000,,this 10 times a second to 50 times a second, Dialogue: 0,1:15:16.39,1:15:18.10,Default,,0000,0000,0000,,and that's actually plenty of time Dialogue: 0,1:15:18.10,1:15:19.40,Default,,0000,0000,0000,,to sample Dialogue: 0,1:15:19.40,1:15:23.50,Default,,0000,0000,0000,,1,000 states and compute this Dialogue: 0,1:15:23.50,1:15:26.91,Default,,0000,0000,0000,,expectation. They're real difficult, helicopters because helicopters are mission critical, and you Dialogue: 0,1:15:26.91,1:15:29.50,Default,,0000,0000,0000,,do something it's like fast. You Dialogue: 0,1:15:29.50,1:15:31.85,Default,,0000,0000,0000,,can do serious damage and so Dialogue: 0,1:15:31.85,1:15:35.75,Default,,0000,0000,0000,,maybe not for good reasons. We've actually tended to avoid Dialogue: 0,1:15:35.75,1:15:37.04,Default,,0000,0000,0000,,tossing coins when we're Dialogue: 0,1:15:37.04,1:15:40.18,Default,,0000,0000,0000,,in the air, so the ideal of letting our actions be some Dialogue: 0,1:15:40.18,1:15:45.10,Default,,0000,0000,0000,,up with some random process is slightly scary and just tend not to do that. Dialogue: 0,1:15:45.10,1:15:48.99,Default,,0000,0000,0000,,I should say that's prob'ly not a great reason because you average a Dialogue: 0,1:15:48.99,1:15:51.55,Default,,0000,0000,0000,,large number of things here very well fine but just Dialogue: 0,1:15:51.55,1:15:55.83,Default,,0000,0000,0000,,as a maybe overly conservative design choice, we actually Dialogue: 0,1:15:55.83,1:15:57.06,Default,,0000,0000,0000,,don't, tend not to find Dialogue: 0,1:15:57.06,1:15:59.50,Default,,0000,0000,0000,,anything randomized on Dialogue: 0,1:15:59.50,1:16:01.40,Default,,0000,0000,0000,,which is prob'ly being over conservative. Dialogue: 0,1:16:02.34,1:16:04.97,Default,,0000,0000,0000,,It's the choice we made cause other things are slightly safer. Dialogue: 0,1:16:04.97,1:16:08.15,Default,,0000,0000,0000,,I think you can actually often do this. Dialogue: 0,1:16:08.15,1:16:13.19,Default,,0000,0000,0000,,So long as I see a model can be evaluated fast enough where you can sample Dialogue: 0,1:16:13.19,1:16:15.47,Default,,0000,0000,0000,,100 state transitions or 1,000 state Dialogue: 0,1:16:15.47,1:16:18.12,Default,,0000,0000,0000,,transitions, and then do that at Dialogue: 0,1:16:18.12,1:16:21.55,Default,,0000,0000,0000,,10hz. They haven't said that. This is often attained which is why we often Dialogue: 0,1:16:21.55,1:16:28.55,Default,,0000,0000,0000,,use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So Dialogue: 0,1:16:31.80,1:16:35.80,Default,,0000,0000,0000,,now you know one algorithm [inaudible] reinforcement learning on continuous state Dialogue: 0,1:16:35.80,1:16:38.71,Default,,0000,0000,0000,,spaces. Then we'll pick up with some more ideas on some even Dialogue: 0,1:16:38.71,1:16:40.45,Default,,0000,0000,0000,,more powerful algorithms, Dialogue: 0,1:16:40.45,1:16:43.37,Default,,0000,0000,0000,,the solving MDPs of continuous state spaces. Dialogue: 0,1:16:43.37,1:16:44.17,Default,,0000,0000,0000,,Thanks. Let's close for today.