0:00:10.299,0:00:13.569 This presentation is delivered by the Stanford Center for Professional 0:00:13.569,0:00:20.569 Development. 0:00:21.519,0:00:24.099 Okay, good morning. Welcome back. 0:00:24.099,0:00:27.019 So I hope all of you had a good Thanksgiving break. 0:00:27.019,0:00:31.309 After the problem sets, I suspect many of us needed one. 0:00:31.309,0:00:35.640 Just one quick announcement so as I announced by email 0:00:35.640,0:00:37.870 a few days ago, this afternoon 0:00:37.870,0:00:38.489 we'll be 0:00:38.489,0:00:42.590 doing another tape ahead of lecture, so I won't physically be here on 0:00:42.590,0:00:43.840 Wednesday, and so we'll be taping 0:00:43.840,0:00:45.990 this Wednesday's lecture ahead of time. 0:00:45.990,0:00:47.210 If you're free 0:00:47.210,0:00:52.770 this afternoon, please come to that; it'll be at 3:45 p.m. 0:00:52.770,0:00:57.610 in the Skilling Auditorium in Skilling 193 0:00:57.610,0:00:59.830 at 3:45. But of course, you can also just show up 0:00:59.830,0:01:06.830 in class as usual at the usual time or just watch it online as usual also. 0:01:07.110,0:01:12.920 Okay, welcome back. What I want to do today is continue our discussion on 0:01:12.920,0:01:15.690 Reinforcement Learning in MDPs. Quite a 0:01:15.690,0:01:19.790 long topic for me to go over today, so most of today's lecture will be on 0:01:19.790,0:01:21.720 continuous state MDPs, 0:01:21.720,0:01:24.090 and in particular, 0:01:24.090,0:01:26.750 algorithms for solving continuous state MDPs, 0:01:26.750,0:01:29.890 so I'll talk just very briefly about discretization. 0:01:29.890,0:01:34.080 I'll spend a lot of time talking about models, assimilators of MDPs, 0:01:34.080,0:01:39.250 and then talk about one algorithm called fitted value iteration and 0:01:40.420,0:01:45.119 two functions which builds on that, and then 0:01:45.119,0:01:49.740 hopefully, I'll have time to get to a second algorithm called, approximate policy 0:01:49.740,0:01:50.960 iteration 0:01:50.960,0:01:53.189 Just to recap, 0:01:53.189,0:01:55.119 right, in the previous lecture, 0:01:55.119,0:01:58.270 I defined the Reinforcement Learning problem and 0:01:58.270,0:02:02.950 I defined MDPs, so let me just recap the notation. 0:02:04.400,0:02:11.400 I said that an MDP or a Markov Decision Process, was a ? tuple, 0:02:14.209,0:02:15.959 comprising 0:02:15.959,0:02:19.139 those things and 0:02:19.139,0:02:22.219 the running example of those using last time 0:02:22.219,0:02:25.189 was this one 0:02:25.189,0:02:26.709 right, adapted from 0:02:26.709,0:02:30.209 the Russell and Norvig AI textbook. 0:02:30.209,0:02:31.619 So 0:02:31.619,0:02:35.499 in this example MDP that I was using, 0:02:35.499,0:02:39.269 it had 11 states, so that's where S was. 0:02:39.269,0:02:46.269 The actions were compass directions: north, south, east and west. The 0:02:46.650,0:02:49.689 state transition probability is to capture 0:02:49.689,0:02:51.109 chance of your 0:02:51.109,0:02:52.409 transitioning to every state 0:02:52.409,0:02:55.899 when you take any action in any other given state and so 0:02:55.899,0:02:58.059 in our example that captured 0:02:58.059,0:03:01.119 the stochastic dynamics of our 0:03:01.119,0:03:04.499 robot wondering around [inaudible], and we said if you take the action north 0:03:04.499,0:03:05.639 and the south, 0:03:05.639,0:03:08.879 you have a .8 chance of actually going north and .1 chance of veering 0:03:08.879,0:03:09.879 off, 0:03:09.879,0:03:12.689 so that .1 chance of veering off to the right so 0:03:12.689,0:03:17.069 said model of the robot's noisy 0:03:17.069,0:03:19.200 dynamic with a [inaudible] 0:03:19.200,0:03:21.709 and the 0:03:21.709,0:03:24.789 reward function was that +/-1 at the 0:03:24.789,0:03:31.059 absorbing states 0:03:31.059,0:03:32.749 and 0:03:32.749,0:03:37.659 -0.02 0:03:37.659,0:03:39.589 elsewhere. 0:03:39.589,0:03:41.499 This is an example of an MDP, and 0:03:41.499,0:03:47.899 that's what these five things were. Oh, and I used a discount 0:03:47.899,0:03:48.870 factor G 0:03:48.870,0:03:54.229 of usually a number slightly less than one, so that's the 0.99. 0:03:54.229,0:03:58.939 And so our goal was to find the policy, the 0:03:58.939,0:04:02.089 control policy and that's at ?, 0:04:02.089,0:04:03.729 which is a function 0:04:03.729,0:04:06.400 mapping from the states of the actions 0:04:06.400,0:04:09.609 that tells us what action to take in every state, 0:04:09.609,0:04:14.159 and our goal was to find a policy that maximizes the expected value of 0:04:14.159,0:04:17.319 our total payoff. So we want to find a 0:04:17.319,0:04:19.599 policy. Well, 0:04:19.599,0:04:26.599 let's see. We define value functions Vp (s) to be equal to 0:04:33.250,0:04:35.740 this. 0:04:35.740,0:04:36.979 We said that 0:04:36.979,0:04:41.340 the value of a policy ? from State S was given by the expected 0:04:41.340,0:04:42.029 value 0:04:42.029,0:04:46.199 of the sum of discounted rewards, conditioned on your executing the policy ? 0:04:46.199,0:04:49.329 and 0:04:49.329,0:04:54.469 you're stating off your [inaudible] to say in the State S, 0:04:54.469,0:04:59.959 and so our strategy for finding the policy was sort of comprised of 0:04:59.959,0:05:06.959 two steps. So the goal is to find 0:05:11.569,0:05:15.190 a good policy that maximizes the suspected value of 0:05:15.190,0:05:16.759 the sum of discounted rewards, 0:05:16.759,0:05:21.020 and so I said last time that one strategy for finding the [inaudible] of a 0:05:21.020,0:05:21.709 policy 0:05:21.709,0:05:26.080 is to first compute the optimal value function which I denoted V*(s) and is defined 0:05:26.080,0:05:28.839 like that. It's the maximum 0:05:28.839,0:05:31.969 value that any policy can obtain, 0:05:31.969,0:05:38.969 and for example, 0:05:44.939,0:05:48.449 the optimal value function 0:05:48.449,0:05:55.449 for that MDP looks like this. 0:06:01.439,0:06:04.409 So in other words, starting from any of these states, 0:06:04.409,0:06:07.050 what's the expected value of the 0:06:07.050,0:06:09.210 sum of discounted rewards you get, 0:06:09.210,0:06:12.009 so this is 0:06:12.009,0:06:13.210 V*. 0:06:13.210,0:06:15.289 We also said that 0:06:15.289,0:06:17.099 once you've found V*, 0:06:17.099,0:06:20.400 you can compute the optimal policy 0:06:20.400,0:06:27.400 using this. 0:06:34.299,0:06:36.279 And 0:06:36.279,0:06:41.049 so once you've found 0:06:41.049,0:06:48.049 V and the last piece of this algorithm was Bellman's equations where 0:06:48.460,0:06:51.240 we know that V*, 0:06:51.240,0:06:55.579 the optimal sum of discounted rewards you can get for State S, is equal to the 0:06:55.579,0:06:59.339 immediate reward you get just for starting off in that state 0:06:59.339,0:07:02.360 +G(for the 0:07:02.360,0:07:05.979 max over all the actions you could take)(your 0:07:11.639,0:07:13.659 future 0:07:13.659,0:07:15.619 sum of discounted 0:07:15.619,0:07:16.479 rewards)(your 0:07:16.479,0:07:20.650 future payoff starting from the State S(p) which is where you might transition to 0:07:20.650,0:07:22.430 after 1(s). 0:07:22.430,0:07:26.279 And so this gave us a value iteration algorithm, 0:07:26.279,0:07:29.909 which was essentially 0:07:29.909,0:07:35.289 V.I. I'm abbreviating value iteration as V.I., so in the value iteration 0:07:36.159,0:07:43.159 algorithm, in V.I., you just take Bellman's equations and you repeatedly do this. 0:07:49.960,0:07:54.099 So initialize some guess of the value functions. Initialize a zero as the sum 0:07:54.099,0:07:55.129 rounding guess and 0:07:55.129,0:07:58.959 then repeatedly perform this update for all states, and I said last time that if 0:07:58.959,0:08:00.740 you do this repeatedly, 0:08:00.740,0:08:06.339 then V(s) will converge to the optimal value function, V(s), 0:08:06.339,0:08:09.340 you can 0:08:09.340,0:08:12.219 compute the 0:08:12.219,0:08:13.830 0:08:13.830,0:08:19.150 optimal policy ?*. Just one final thing I want to recap was 0:08:20.150,0:08:27.150 the policy iteration algorithm 0:08:32.370,0:08:39.370 in which we repeat the following two steps. 0:08:39.640,0:08:43.790 So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the 0:08:43.790,0:08:48.190 value function for that specific policy. 0:08:48.190,0:08:52.200 So this means for every state, compute the expected sum of discounted rewards 0:08:52.200,0:08:55.460 for if you execute the policy ? 0:08:55.460,0:08:58.000 from that state, 0:08:58.000,0:09:05.000 and then the other step of policy 0:09:28.150,0:09:30.380 iteration 0:09:30.380,0:09:31.730 is 0:09:31.730,0:09:34.960 having found the value function for your policy, 0:09:34.960,0:09:37.440 you then update the policy 0:09:37.440,0:09:41.120 pretending that you've already found the optimal value function, V*, 0:09:41.120,0:09:42.390 and then you 0:09:42.390,0:09:45.790 repeatedly perform these two steps where you solve for the value function for your current 0:09:45.790,0:09:46.820 policy 0:09:46.820,0:09:50.760 and then pretend that that's actually the optimal value function and solve for the 0:09:50.760,0:09:54.790 policy given the value function, and you repeatedly update 0:09:54.790,0:09:59.060 the value function or update the policy using that value function. 0:09:59.060,0:10:00.350 And 0:10:00.350,0:10:03.540 last time I said that this will also cause 0:10:03.540,0:10:07.500 the estimated value function V to converge to V* 0:10:07.500,0:10:12.640 and this will cause p to converge to ?*, the optimal policy. 0:10:12.640,0:10:14.010 So those are based 0:10:14.010,0:10:17.920 on our last lecture [inaudible] MDPs and introduced a lot of new 0:10:17.920,0:10:20.180 notation symbols and just 0:10:20.180,0:10:21.880 summarize all that again. 0:10:21.880,0:10:26.160 What I'm about to do now, what I'm about to do for the rest of today's 0:10:26.160,0:10:27.680 lecture is actually 0:10:27.680,0:10:32.420 build on these two algorithms so 0:10:32.420,0:10:36.040 I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how 0:10:36.040,0:10:42.080 those two algorithms are very 0:10:42.080,0:10:43.710 different? 0:10:43.710,0:10:46.810 Instructor (Andrew Ng):I see, right, so 0:10:46.810,0:10:51.810 yeah, do you see that they're different? Okay, 0:10:51.810,0:10:53.840 how it's different. Let's see. 0:10:53.840,0:10:54.970 So 0:10:54.970,0:11:00.050 well here's one difference. I didn't say this cause no longer use it today. So 0:11:00.050,0:11:03.530 value iteration and policy iteration are different algorithms. 0:11:03.530,0:11:06.260 In policy iteration 0:11:06.260,0:11:07.570 in this 0:11:07.570,0:11:10.630 step, you're given a fixed policy, 0:11:10.630,0:11:14.170 and you're going to solve for the value function for that policy 0:11:15.320,0:11:20.520 and so you're given some fixed policy ?, meaning 0:11:20.520,0:11:26.930 some function mapping from the state's actions. So give you some policy 0:11:28.840,0:11:34.630 and 0:11:34.630,0:11:39.350 whatever. That's just some policy; it's not a great policy. 0:11:39.350,0:11:40.890 And 0:11:40.890,0:11:45.350 in that step that I circled, we have to find the ? of S 0:12:01.640,0:12:04.740 which means that for every state you need to compute 0:12:04.740,0:12:08.360 your expected sum of discounted rewards or if you execute 0:12:08.360,0:12:10.810 this specific policy 0:12:10.810,0:12:15.500 and starting off the MDP in that state S. 0:12:15.500,0:12:19.000 So I showed this last time. I won't go into details today, so I said last 0:12:19.000,0:12:20.330 time that 0:12:20.330,0:12:23.960 you can actually solve for Vp by solving a linear system of 0:12:23.960,0:12:25.360 equations. 0:12:25.360,0:12:27.260 There was a form of Bellman's equations 0:12:27.260,0:12:28.760 for Vp, 0:12:28.760,0:12:32.029 and it turned out to be, if you write this out, you end up with a 0:12:32.029,0:12:35.920 linear system of 11 equations of 11 unknowns and so you 0:12:35.920,0:12:38.830 can actually solve for the value function 0:12:38.830,0:12:39.970 for a fixed policy 0:12:39.970,0:12:42.180 by solving like a system of 0:12:42.180,0:12:46.180 linear equations with 11 variables and 11 constraints, 0:12:46.180,0:12:48.830 and so 0:12:48.830,0:12:51.790 that's policy iteration; 0:12:51.790,0:12:55.290 whereas, in value iteration, 0:12:55.290,0:12:57.430 going back on board, 0:12:57.430,0:13:00.590 in value iteration you sort of repeatedly perform this update where you 0:13:00.590,0:13:04.020 update the value of a state as the [inaudible]. 0:13:04.020,0:13:11.020 So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits 0:13:12.670,0:13:13.459 so is the assumption 0:13:13.459,0:13:16.130 that we can never get out of 0:13:16.130,0:13:17.960 those states? Instructor 0:13:17.960,0:13:21.730 (Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers 0:13:21.730,0:13:25.950 come up nicely, but I don't wanna spend too much time on them, but yeah, so the 0:13:25.950,0:13:26.820 assumption is that 0:13:26.820,0:13:30.329 once you enter the absorbing state, then the world ends or there're no more 0:13:30.329,0:13:32.549 rewards after that and 0:13:32.549,0:13:33.740 you can think of 0:13:33.740,0:13:36.970 another way to think of the absorbing states which is sort of 0:13:36.970,0:13:38.000 mathematically equivalent. 0:13:38.000,0:13:41.090 You can think of the absorbing states as 0:13:41.090,0:13:43.440 transitioning with probability 1 0:13:43.440,0:13:46.030 to sum 12 state, and then once you're in that 0:13:46.030,0:13:48.870 12th state, you always 0:13:48.870,0:13:51.660 remain in that 12th state, and there're no further rewards from there. If you 0:13:51.660,0:13:53.910 want, you can think of this 0:13:53.910,0:13:57.250 as actually an MDP with 12 states rather than 11 states, 0:13:57.250,0:13:59.270 and the 12th state 0:13:59.270,0:14:05.320 is this zero cost absorbing state that you get stuck in forever. Other 0:14:05.320,0:14:12.320 questions? Yeah, 0:14:19.640,0:14:26.460 please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort 0:14:26.460,0:14:27.190 of 0:14:27.190,0:14:31.120 tried to give it justification for this last time. 0:14:31.120,0:14:34.850 I'll 0:14:34.850,0:14:37.500 say it in 0:14:37.500,0:14:41.990 one sentence so that's that the expected total payoff I get, I 0:14:41.990,0:14:46.820 expect to get something from the state as is equal to my immediate reward which is 0:14:46.820,0:14:49.650 the reward I get for starting a state. Let's see. If I sum 0:14:49.650,0:14:54.220 the state, I'm gonna get some first reward and then I can transition to 0:14:54.220,0:14:55.440 some other state, and 0:14:55.440,0:14:59.020 then from that other state, I'll get some additional rewards from then. 0:14:59.020,0:15:03.390 So Bellman's equations breaks that sum into two pieces. It says the value of a 0:15:03.390,0:15:04.860 state is equal to 0:15:04.860,0:15:06.650 the reward you get right away 0:15:06.650,0:15:10.530 is really, well. 0:15:10.530,0:15:17.120 V*(s) is really equal to 0:15:31.130,0:15:33.390 +G, so this is 0:15:35.940,0:15:40.060 V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the 0:15:40.060,0:15:42.260 rewards 0:15:44.480,0:15:46.440 you get in the future) 0:15:46.440,0:15:50.080 which it turns out to be equal to that 0:15:50.080,0:15:52.340 second row. I spent more time 0:15:52.340,0:15:55.360 justifying this in the previous lecture, 0:15:55.360,0:15:59.930 although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if 0:15:59.930,0:16:02.100 you don't remember the justification of that, why don't you just 0:16:02.100,0:16:03.430 maybe take my word for 0:16:03.430,0:16:05.990 that this equation holds true 0:16:05.990,0:16:07.940 since I use it a little bit later as well, 0:16:07.940,0:16:11.090 and then the lecture notes sort of 0:16:11.090,0:16:15.830 explain a little further the justification for why this equation might hold 0:16:15.830,0:16:17.430 true. But for now, 0:16:17.430,0:16:21.200 yeah, just for now take my word for it that this holds true cause we'll use it a little bit later 0:16:21.200,0:16:28.200 today as well. 0:16:40.960,0:16:44.710 [Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent 0:16:44.710,0:16:47.010 ? implicitly, 0:16:47.600,0:16:53.060 using V(s), would it become equivalent to valuation, 0:16:53.060,0:16:58.240 and the answer is sort of no. 0:16:58.240,0:17:01.710 Let's see. It's true that policy iteration and value iteration are 0:17:01.710,0:17:07.210 closely related algorithms, and there's actually a continuum between them, but 0:17:07.210,0:17:12.850 yeah, it actually turns out that, 0:17:12.850,0:17:15.400 oh, no, 0:17:22.370,0:17:23.909 the 0:17:24.759,0:17:28.910 algorithms are not equivalent. It's just in policy iteration, 0:17:28.910,0:17:33.289 there is a step where you're solving for the value function for the policy vehicle is V, 0:17:33.289,0:17:35.470 solve for Vp. Usually, 0:17:35.470,0:17:40.590 you can do this, for instance, by solving a linear system of equations. In value 0:17:40.590,0:17:41.260 iteration, 0:17:41.260,0:17:43.850 it is a different 0:17:43.850,0:17:50.850 algorithm, yes. I hope it makes sense that at least cosmetically it's different. 0:17:56.270,0:17:58.140 [Inaudible] 0:17:58.140,0:18:02.250 you have [inaudible] representing ? implicitly, then you won't have 0:18:02.250,0:18:06.620 to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you 0:18:06.620,0:18:08.559 change a value function, 0:18:08.559,0:18:10.360 if ? changes as well, 0:18:10.360,0:18:15.750 then it's sort of hard to solve this. 0:18:15.750,0:18:19.340 Yeah, so later on we'll actually talk about some examples of when ? is implicitly 0:18:19.340,0:18:20.440 represented 0:18:20.440,0:18:24.059 but at 0:18:24.059,0:18:26.210 least for now it's I think there's " yeah. 0:18:26.210,0:18:30.350 Maybe there's a way to redefine something, see a mapping onto value iteration but that's not 0:18:30.350,0:18:32.530 usually done. These are 0:18:32.530,0:18:36.670 viewed as different algorithms. 0:18:36.670,0:18:39.640 Okay, cool, so all good questions. 0:18:39.640,0:18:43.410 Let me move on and talk about how 0:18:43.410,0:18:50.410 to generalize these ideas to continuous states. 0:18:58.880,0:19:01.260 Everything we've done so far 0:19:01.260,0:19:02.950 has been for 0:19:02.950,0:19:05.000 discrete states or finite-state 0:19:05.000,0:19:09.890 MDPs. Where, for example, here we had an MDP with a finite set of 11 0:19:09.890,0:19:10.780 states 0:19:10.780,0:19:12.980 and so 0:19:12.980,0:19:16.720 the value function or V(s) or our estimate for the value function, 0:19:16.720,0:19:17.360 V(s), 0:19:17.360,0:19:22.290 could then be represented using an array of 11 numbers cause if you have 11 states, the 0:19:22.290,0:19:23.580 value function 0:19:23.580,0:19:26.120 needs to assign a real number 0:19:26.120,0:19:29.330 to each of the 11 states and so to represent V(s) 0:19:29.330,0:19:31.990 using an array of 11 0:19:31.990,0:19:33.080 numbers. 0:19:33.080,0:19:37.080 What I want to do for [inaudible] today is talk about 0:19:37.080,0:19:38.450 continuous 0:19:38.450,0:19:41.830 states, so for example, 0:19:41.830,0:19:44.980 if you want to control 0:19:44.980,0:19:48.240 any of the number of real [inaudible], so for example, if you want 0:19:48.240,0:19:51.409 to control a car, 0:19:51.409,0:19:52.610 0:19:52.610,0:19:55.090 a car 0:19:55.090,0:19:59.840 is positioned given by XYT, as position 0:19:59.840,0:20:04.660 and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot, 0:20:04.660,0:20:05.159 so these 0:20:05.159,0:20:06.640 are so depending on 0:20:06.640,0:20:08.929 whether you want to model the kinematics and 0:20:08.929,0:20:12.139 so just position, or whether you want to model the dynamics, meaning the velocity as 0:20:12.139,0:20:14.630 well. 0:20:14.630,0:20:16.179 Earlier I showed you 0:20:16.179,0:20:19.740 video of a helicopter 0:20:19.740,0:20:23.480 that was flying, using a rain forest we're learning algorithms, so the 0:20:23.480,0:20:24.730 helicopter which can 0:20:24.730,0:20:28.390 fly in three-dimensional space rather than just drive on the 2-D plane, the 0:20:28.390,0:20:32.780 state will be given by XYZ position, 0:20:35.100,0:20:37.610 FT?, which is 0:20:37.610,0:20:42.740 ?[inaudible]. 0:20:42.740,0:20:45.250 The FT? is 0:20:45.250,0:20:48.620 sometimes used to note the P[inaudible] 0:20:48.620,0:20:49.840 of the helicopter, 0:20:49.840,0:20:51.710 just orientation, 0:20:51.710,0:20:55.340 and 0:20:59.050,0:21:02.540 if you want to control a helicopter, you pretty much have to model 0:21:02.540,0:21:06.490 velocity as well which means both linear velocity as well as angular 0:21:06.490,0:21:09.520 velocity, and so this would be a 12-dimensional state. 0:21:10.400,0:21:12.610 If you want an example that 0:21:12.610,0:21:14.150 is 0:21:14.150,0:21:16.770 kind of fun but unusual 0:21:16.770,0:21:19.960 is, 0:21:19.960,0:21:24.550 and I'm just gonna use this as an example 0:21:24.550,0:21:28.240 and actually use this little bit example in today's lecture 0:21:28.240,0:21:31.410 is the inverted pendulum problem which is 0:21:31.410,0:21:35.500 sort of a long-running classic in reinforcement learning 0:21:35.500,0:21:39.040 in which imagine that you 0:21:39.040,0:21:42.880 have a little cart that's on a rail. The rail 0:21:42.880,0:21:45.370 ends at some point 0:21:45.370,0:21:48.400 and if you imagine that you have a pole 0:21:48.400,0:21:51.650 attached to the cart, and this is a free hinge and so 0:21:51.650,0:21:54.110 the pole here can rotate freely, 0:21:54.110,0:21:56.950 and your goal is to control the cart and 0:21:56.950,0:21:59.030 to move it back and forth on this rail 0:21:59.030,0:22:01.130 so as to keep the pole balanced. 0:22:01.130,0:22:02.560 Yeah, 0:22:02.560,0:22:04.620 there's no long pole 0:22:04.620,0:22:05.400 in 0:22:05.400,0:22:07.480 this class but you know what I 0:22:07.480,0:22:08.409 mean, so you 0:22:08.409,0:22:15.210 can imagine. Oh, is there a long pole here? 0:22:15.210,0:22:16.590 Back in the corner. Oh, thanks. Cool. So I did not practice this 0:22:16.590,0:22:19.920 but you can take a long pole and sort of hold it up, balance, 0:22:19.920,0:22:24.670 so imagine that you can do it better than I can. Imagine these are [inaudible] just moving 0:22:24.670,0:22:27.590 back and forth to try to keep the pole balanced, so 0:22:27.590,0:22:30.580 you can actually us the reinforcement learning algorithm to do that. 0:22:30.580,0:22:33.929 This is actually one of the longstanding classic problems that 0:22:33.929,0:22:35.280 people [inaudible] 0:22:35.280,0:22:38.030 implement and play off using reinforcement learning algorithms, 0:22:38.030,0:22:40.800 and so for this, the 0:22:40.800,0:22:41.940 states would be X and 0:22:41.940,0:22:44.000 T, so X 0:22:44.000,0:22:49.660 would be the position of the cart, 0:22:49.660,0:22:56.070 and T would be the orientation of the pole 0:22:56.070,0:22:59.750 and also the linear velocity and the angular velocity of 0:22:59.750,0:23:00.650 0:23:00.650,0:23:07.650 the pole, so I'll 0:23:12.340,0:23:15.390 actually use this example a 0:23:15.390,0:23:18.710 couple times. So to read continuous state space, 0:23:18.710,0:23:22.760 how can you apply an algorithm like value iteration and policy iteration to solve the 0:23:22.760,0:23:24.170 MDP to control 0:23:24.170,0:23:27.800 like the car or a helicopter or something like the inverted 0:23:27.800,0:23:29.030 pendulum? 0:23:29.030,0:23:31.720 So one thing you can do and 0:23:31.720,0:23:35.060 this is maybe the most straightforward thing is, 0:23:35.060,0:23:37.670 if you have say a two-dimensional 0:23:37.670,0:23:42.270 continuous state space, a S-1 and S-2 are my state variables, and in all the 0:23:42.270,0:23:43.990 examples there 0:23:43.990,0:23:48.150 are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here. 0:23:48.150,0:23:53.670 The most straightforward thing to do would be to take the continuous state space and discretize 0:23:53.670,0:23:55.780 it 0:23:55.780,0:24:02.780 into a number of discrete cells. 0:24:07.440,0:24:11.890 And I use S-bar to denote they're discretized or they're discrete 0:24:11.890,0:24:13.080 states, 0:24:13.080,0:24:16.919 and so you can [inaudible] with this continuous state problem with a finite or 0:24:16.919,0:24:18.409 discrete set of states 0:24:18.409,0:24:19.490 and then 0:24:19.490,0:24:22.880 you can use policy iteration or value iteration 0:24:22.880,0:24:28.900 to solve for 0:24:31.529,0:24:35.950 V(s)-bar. 0:24:35.950,0:24:37.390 And 0:24:37.390,0:24:40.350 if you're robot is then in some state 0:24:40.350,0:24:41.860 given by that dot, 0:24:41.860,0:24:45.910 you would then figure out what discretized state it is in. In this case it's in, 0:24:45.910,0:24:47.950 this discretized dygrid cell 0:24:47.950,0:24:49.180 that's called 0:24:49.180,0:24:51.400 S-bar, 0:24:51.400,0:24:52.840 and then you execute. 0:24:52.840,0:24:56.320 You choose the policy. You choose the action 0:24:56.320,0:24:58.679 given by applied to that discrete 0:24:58.679,0:25:01.190 state, so discretization is maybe the 0:25:01.190,0:25:06.690 most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes 0:25:06.690,0:25:10.200 you can sorta make this work but a couple of reasons why 0:25:10.200,0:25:13.070 this does not work very well. 0:25:13.070,0:25:15.130 One reason is the following, 0:25:15.130,0:25:17.610 and 0:25:17.610,0:25:21.210 for this picture, let's even temporarily put aside reinforcement 0:25:21.210,0:25:22.840 learning. Let's just 0:25:22.840,0:25:26.610 think about doing regression for now and so 0:25:26.610,0:25:29.650 suppose you have some invariable X 0:25:29.650,0:25:36.210 and suppose I have some data, 0:25:36.210,0:25:37.910 and I want to fill a function. 0:25:37.910,0:25:39.720 Y is the function of X, 0:25:39.720,0:25:40.860 so 0:25:40.860,0:25:44.550 discretization is saying that I'm going to take my 0:25:44.550,0:25:48.010 horizontal Xs and chop it up into a number of 0:25:48.010,0:25:48.669 intervals. 0:25:48.669,0:25:52.060 Sometimes I call these intervals buckets as well. 0:25:52.060,0:25:56.310 We chop my horizontal Xs up into a number of buckets 0:25:56.310,0:25:59.480 and then we're approximate this function 0:25:59.480,0:26:03.770 using something that's piecewise constant 0:26:03.770,0:26:05.440 in each of these buckets. 0:26:05.440,0:26:08.800 And just look at this. This is clearly not a very good representation, right, and when we 0:26:08.800,0:26:10.390 talk about regression, 0:26:10.390,0:26:14.340 you just choose some features of X 0:26:14.340,0:26:17.990 and run linear regression or something. You get a much better fit to the function. 0:26:17.990,0:26:22.170 And so the sense that discretization just isn't a very good source 0:26:22.170,0:26:25.770 of piecewise constant functions. This just isn't a very good function 0:26:25.770,0:26:28.210 for representing many things, 0:26:28.210,0:26:32.090 and there's also the sense that 0:26:32.090,0:26:35.580 there's no smoothing or there's no generalization across the different buckets. 0:26:35.580,0:26:39.290 And in fact, back in regression, I would never have chosen 0:26:39.290,0:26:42.870 to do regression using this sort of visualization. It's just really doesn't make sense. 0:26:45.630,0:26:47.990 And so in the same way, 0:26:47.990,0:26:50.950 instead of 0:26:50.950,0:26:53.929 X, V(s), instead of X and 0:26:53.929,0:26:57.730 some hypothesis function of X, if you have the state here and you're trying to approximate the value 0:26:57.730,0:26:59.110 function, then 0:26:59.110,0:27:02.470 you can get discretization to work for many problems but maybe this isn't the best 0:27:02.470,0:27:03.429 representation to represent 0:27:03.429,0:27:06.419 a value function. 0:27:08.960,0:27:12.580 The other problem with discretization and maybe the more serious 0:27:12.580,0:27:17.350 problem is what's 0:27:17.350,0:27:21.450 often somewhat fancifully called the curse of dimensionality. 0:27:26.480,0:27:30.230 And just the observation that 0:27:30.990,0:27:33.800 if the state space is in 0:27:33.800,0:27:39.350 RN, and if you discretize each variable into K buckets, 0:27:39.350,0:27:40.450 so 0:27:40.450,0:27:41.909 if discretize each 0:27:44.159,0:27:45.690 variable into K 0:27:45.690,0:27:47.750 discrete values, 0:27:47.750,0:27:50.150 then you get 0:27:50.150,0:27:53.030 on the order of K to the power of N 0:27:54.760,0:27:55.740 discrete states. 0:27:55.740,0:27:56.609 In other words, 0:27:56.609,0:28:00.170 the number of discrete states you end up with 0:28:00.170,0:28:02.100 grows exponentially 0:28:02.100,0:28:04.789 in the dimension of the problem, 0:28:04.789,0:28:05.560 and so 0:28:05.560,0:28:09.350 for a helicopter with 12-dimensional state 0:28:09.350,0:28:10.340 space, this would be 0:28:10.340,0:28:12.769 maybe like 100 to the power of 12, just huge, and it's 0:28:12.769,0:28:13.460 not 0:28:13.460,0:28:14.650 feasible. 0:28:14.650,0:28:19.130 And so discretization doesn't scale well at all with two problems in high-dimensional 0:28:19.130,0:28:23.600 state spaces, 0:28:23.600,0:28:24.790 and 0:28:24.790,0:28:28.520 this observation actually applies more generally than to 0:28:28.520,0:28:32.510 just robotics and continuous state problems. 0:28:32.510,0:28:37.669 For example, another fairly well-known applications 0:28:37.669,0:28:40.170 of reinforcement learning 0:28:40.170,0:28:44.850 has been to factory automations. If you imagine that you have 0:28:44.850,0:28:47.040 20 machines sitting in the factory 0:28:47.040,0:28:50.029 and the machines lie in a assembly line and they all 0:28:50.029,0:28:53.200 do something to a part on the assembly line, then they route the part onto a 0:28:53.200,0:28:55.389 different machine. You want to 0:28:55.389,0:28:58.779 use reinforcement learning algorithms, [inaudible] the order in which the 0:28:58.779,0:29:02.519 different machines operate on your different things that are flowing through your assembly line and maybe 0:29:02.519,0:29:05.309 different machines can do different things. 0:29:05.309,0:29:08.970 So if you have N machines and each machine 0:29:08.970,0:29:10.540 can be in K states, 0:29:10.540,0:29:13.730 then if you do this sort of discretization, the total number of states would be K 0:29:13.730,0:29:15.420 to N as well. 0:29:15.420,0:29:18.289 If you have N machines 0:29:18.289,0:29:21.860 and if each machine can be in K states, then again, you can get 0:29:21.860,0:29:24.550 this huge number of states. 0:29:24.550,0:29:26.799 Other well-known examples would be 0:29:26.799,0:29:30.620 if you have a board game is another example. 0:29:30.620,0:29:33.760 You'd want to use reinforcement learning to play chess. 0:29:33.760,0:29:36.310 Then if you have N 0:29:36.310,0:29:38.660 pieces on your board game, 0:29:38.660,0:29:43.540 you have N pieces on the chessboard and if each piece can be in K positions, then this is a 0:29:43.540,0:29:46.470 game sort of the curse of dimensionality thing where the 0:29:46.470,0:29:55.059 number of discrete states you end up with goes exponentially with the number of pieces 0:29:55.059,0:29:57.120 in your board game. 0:29:57.120,0:30:01.169 So the curse of dimensionality 0:30:01.169,0:30:05.830 means that discretization scales poorly to high-dimensional state spaces or 0:30:05.830,0:30:09.370 at least discrete representations 0:30:09.370,0:30:14.169 scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional 0:30:14.169,0:30:16.720 problem, discretization will usually work great. 0:30:16.720,0:30:20.730 If you have a 3-dimensional problem, you can often get discretization to 0:30:20.730,0:30:21.290 work 0:30:21.290,0:30:23.270 not too badly without too much trouble. 0:30:23.270,0:30:28.370 With a 4-dimensional problem, you can still often get to where that they 0:30:28.370,0:30:30.190 could be challenging 0:30:32.990,0:30:38.770 and as you go to higher and higher dimensional state spaces, 0:30:38.770,0:30:42.520 the odds and [inaudible] that you need to figure around to discretization and do things 0:30:42.520,0:30:45.900 like non-uniform grids, so for example, 0:30:45.900,0:30:49.470 what I've 0:30:49.470,0:30:52.710 drawn for you is an example of a non-uniform discretization 0:30:52.710,0:30:57.139 where I'm discretizing S-2 much more finally than S-1. If I think the value 0:30:57.139,0:30:57.750 function 0:30:57.750,0:30:59.250 is much more sensitive 0:30:59.250,0:31:02.620 to the value of state variable S-2 than to S-1, 0:31:02.620,0:31:06.610 and so as you get into higher dimensional state spaces, you may need to manually fiddle 0:31:06.610,0:31:10.000 with choices like these with no uniform discretizations and so on. 0:31:10.000,0:31:13.269 But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it 0:31:13.269,0:31:14.350 work fine. 0:31:14.350,0:31:17.630 With 4-dimensional problems, you can probably get it to work but it'd be just 0:31:17.630,0:31:19.030 slightly challenging 0:31:19.030,0:31:20.300 and 0:31:20.300,0:31:24.420 you can sometimes by fooling around and being clever, you can often push 0:31:24.420,0:31:25.110 discretization 0:31:25.110,0:31:29.120 up to let's say about 6-dimensional problems but 0:31:29.120,0:31:30.870 with some difficulty 0:31:30.870,0:31:32.980 and problems higher 0:31:32.980,0:31:37.220 than 6-dimensional would be extremely difficult to solve with discretization. 0:31:37.220,0:31:39.700 So that's just rough 0:31:39.700,0:31:44.360 folk wisdom order of managing problems you think about using for discretization. 0:31:51.370,0:31:52.790 But what I want to 0:31:52.790,0:31:56.030 spend most of today talking about is 0:31:56.030,0:31:59.909 [inaudible] methods that often work much better than discretization 0:31:59.909,0:32:02.029 and which we will 0:32:02.029,0:32:04.640 approximate V* directly 0:32:04.640,0:32:08.850 without resulting to these sort of discretizations. 0:32:08.850,0:32:12.300 Before I jump to the specific representation let me just spend a few minutes 0:32:12.300,0:32:14.780 talking about the problem setup 0:32:14.780,0:32:15.610 then. 0:32:15.610,0:32:18.790 For today's lecture, I'm going to focus on 0:32:18.790,0:32:25.360 the problem of continuous states 0:32:25.360,0:32:28.060 and just to keep things sort of very 0:32:28.060,0:32:29.480 simple in this lecture, 0:32:29.480,0:32:33.619 I want view of continuous actions, so I'm gonna see 0:32:36.000,0:32:37.650 discrete actions 0:32:37.650,0:32:40.280 A. So it turns out actually that 0:32:40.280,0:32:44.230 is a critical fact also for many problems, it turns out that the state 0:32:44.230,0:32:45.630 space is 0:32:45.630,0:32:47.299 much larger than 0:32:47.299,0:32:51.179 the states of actions. That just seems to have worked out that way for many problems, so for example, 0:32:52.630,0:32:58.210 for driving a car the state space is 6-dimensional, so if 0:32:58.210,0:33:00.010 XY T, Xdot, Ydot, Tdot. 0:33:00.010,0:33:04.840 Whereas, your action has, you still have two actions. You have forward 0:33:04.840,0:33:07.600 backward motion and steering the car, so you have 0:33:07.600,0:33:11.889 6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states. 0:33:11.889,0:33:14.740 The only 0:33:14.740,0:33:17.190 examples down for a helicopter you've 12-D states in a 0:33:17.190,0:33:18.620 4-dimensional action 0:33:18.620,0:33:20.290 it turns out, and 0:33:20.290,0:33:24.200 it's also often much easier to just discretize a continuous 0:33:24.200,0:33:27.300 actions into a discrete sum of actions. 0:33:27.300,0:33:27.840 0:33:27.840,0:33:31.670 And for the inverted pendulum, you have a 4-D state and a 1-D action. 0:33:31.670,0:33:33.470 Whether you accelerate 0:33:33.470,0:33:38.140 your cart to the left or the right is one D action 0:33:38.140,0:33:41.879 and so for the rest of today, I'm gonna assume a continuous state 0:33:41.879,0:33:45.480 but I'll assume that maybe you've already discretized your actions, 0:33:45.480,0:33:47.700 just because in practice it turns out that 0:33:47.700,0:33:52.780 not for all problems, with many problems large actions 0:33:52.780,0:33:54.280 is just less of a 0:33:54.280,0:33:58.470 difficulty than large state spaces. 0:33:58.470,0:33:59.730 So I'm 0:33:59.730,0:34:01.350 going to assume 0:34:01.350,0:34:06.360 that we have 0:34:06.360,0:34:08.699 a model 0:34:08.699,0:34:15.089 or simulator of the 0:34:15.089,0:34:16.450 MDP, and so 0:34:16.450,0:34:18.179 this is really 0:34:18.179,0:34:22.579 an assumption on how the state transition probabilities are represented. 0:34:23.589,0:34:24.690 I'm gonna assume 0:34:24.690,0:34:28.079 and I'm going to use the terms "model" and "simulator" 0:34:28.079,0:34:29.800 pretty much synonymously, 0:34:29.800,0:34:31.749 so 0:34:31.749,0:34:38.749 specifically, what I'm going to assume is that we have a black box and a 0:34:40.649,0:34:41.829 piece of code, 0:34:41.829,0:34:45.200 so that I can input any state, input an action 0:34:45.200,0:34:46.399 and it will output 0:34:46.399,0:34:48.559 S 0:34:48.559,0:34:51.259 prime, 0:34:51.259,0:34:54.219 sample from the state transition distribution. Says that 0:34:54.219,0:34:55.069 this is really 0:34:55.069,0:34:57.080 my assumption 0:34:57.080,0:35:01.209 on the representation I have for the state transition probabilities, 0:35:01.209,0:35:05.189 so I'll assume I have a box that read 0:35:05.189,0:35:06.999 take us in for the stated action 0:35:06.999,0:35:09.849 and output in mixed state. 0:35:09.849,0:35:16.849 And so 0:35:17.410,0:35:20.469 since they're fairly common ways to get models of 0:35:20.469,0:35:22.359 different MDPs you may work with, 0:35:22.359,0:35:24.479 one is 0:35:24.479,0:35:30.599 you might get a model from a physics 0:35:30.599,0:35:32.960 simulator. So for example, 0:35:32.960,0:35:36.759 if you're interested in controlling that inverted pendulum, 0:35:36.759,0:35:39.449 so your action is 0:35:39.449,0:35:41.230 A which is 0:35:41.230,0:35:45.130 the magnitude of the force you exert on the cart to 0:35:45.130,0:35:46.140 0:35:46.140,0:35:47.709 left 0:35:47.709,0:35:49.380 or right, and your 0:35:49.380,0:35:56.380 state is Xdot, T, Tdot. I'm just gonna write that in that order. 0:35:58.109,0:36:01.469 And so I'm gonna 0:36:01.469,0:36:04.379 write down a bunch of equations just for completeness but 0:36:04.379,0:36:08.779 everything I'm gonna write below here is most of what I wanna write is a bit 0:36:08.779,0:36:12.599 gratuitous, but so since I'll maybe flip open a textbook on physics, a 0:36:12.599,0:36:13.820 textbook on mechanics, 0:36:13.820,0:36:17.069 you can work out the equations of motion of a 0:36:17.069,0:36:19.499 physical device like this, so you find that 0:36:22.749,0:36:26.400 Sdot. The dot denotes derivative, so the derivative of the state with respect to time 0:36:26.400,0:36:28.080 is given by 0:36:28.739,0:36:31.710 Xdot, ?-L(B) 0:36:31.710,0:36:34.099 cost B 0:36:34.099,0:36:35.549 over 0:36:36.919,0:36:39.069 M 0:36:39.069,0:36:46.069 Tdot, 0:36:52.019,0:36:52.859 B. 0:36:52.859,0:36:58.199 And so on where A is the force is the action that you exert on the cart. L is 0:36:58.199,0:36:59.869 the length of the pole. 0:36:59.869,0:37:04.099 M is the total mass of the system and so on. So all these equations 0:37:04.099,0:37:06.209 are good uses, just writing 0:37:06.209,0:37:08.549 them down for completeness, but by 0:37:08.549,0:37:12.179 flipping over, open like a physics textbook, you can work out these equations 0:37:12.179,0:37:13.839 and notions yourself 0:37:13.839,0:37:15.169 and this 0:37:15.169,0:37:17.410 then gives you a model which 0:37:17.410,0:37:18.259 can say that S-2+1. 0:37:18.259,0:37:21.549 You're still one time step later 0:37:21.549,0:37:27.489 will be equal to your previous state 0:37:27.489,0:37:32.219 plus [inaudible], so in your simulator or in my model 0:37:32.219,0:37:35.200 what happens to the cart every 0:37:35.200,0:37:39.229 10th of a second, so ? T 0:37:39.229,0:37:43.130 would be within one second 0:37:43.130,0:37:50.130 and then so plus ? T times 0:38:03.469,0:38:06.129 that. And so that'd be one way to come up with a model of 0:38:06.129,0:38:08.169 your MDP. 0:38:08.169,0:38:12.389 And in this specific example, I've actually written down deterministic model because 0:38:12.830,0:38:14.829 and by deterministic I mean that 0:38:14.829,0:38:18.320 given an action in a state, the next state 0:38:18.320,0:38:19.839 is not random, 0:38:19.839,0:38:23.179 so would be an example of a deterministic model 0:38:23.179,0:38:26.180 where I can compute the next state exactly 0:38:26.180,0:38:29.699 as a function of the previous state and the previous action 0:38:29.699,0:38:33.479 or it's a deterministic model because all the probability 0:38:33.479,0:38:34.999 mass is on a single state 0:38:34.999,0:38:36.939 given the previous stated action. 0:38:36.939,0:38:43.939 You can also make this a stochastic model. 0:38:58.589,0:39:05.589 A second way that is often used to attain a model is to learn one. 0:39:08.359,0:39:13.079 And so again, just concretely what you do is you would 0:39:13.079,0:39:15.599 imagine that you have a physical 0:39:15.599,0:39:17.109 inverted pendulum system 0:39:17.109,0:39:20.630 as you physically own an inverted pendulum robot. 0:39:20.630,0:39:24.109 What you would do is you would then 0:39:24.109,0:39:27.109 initialize your inverted pendulum robot to some state 0:39:27.109,0:39:31.909 and then execute some policy, could be some random policy or some policy that you think is pretty 0:39:31.909,0:39:36.219 good, or you could even try controlling yourself with a joystick or something. 0:39:36.219,0:39:37.399 But so you set 0:39:37.399,0:39:40.889 the system off in some state as zero. 0:39:40.889,0:39:43.619 Then you take some action. 0:39:43.619,0:39:48.039 Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina 0:39:48.039,0:39:51.179 control your inverted pendulum or whatever. 0:39:51.179,0:39:55.029 System would transition to some new state, S-1, and then you 0:39:55.029,0:39:57.269 take some new action, A-1 0:39:57.269,0:39:59.799 and so on. Let's say you do 0:39:59.799,0:40:01.579 this 0:40:01.579,0:40:04.779 for two time steps 0:40:04.779,0:40:06.480 and 0:40:06.480,0:40:08.859 sometimes I call this one trajectory and 0:40:08.859,0:40:10.009 you repeat this 0:40:10.009,0:40:12.989 M times, so 0:40:12.989,0:40:16.989 this is the first trial of the first trajectory, 0:40:16.989,0:40:23.989 and then you do this again. Initialize it in some 0:40:35.419,0:40:36.880 and so on. 0:40:36.880,0:40:39.979 So you do this a bunch of times 0:40:39.979,0:40:43.640 and then you would run the learning algorithm 0:40:43.640,0:40:49.190 to 0:40:49.190,0:40:53.199 estimate ST+1 0:40:53.199,0:40:56.650 as a function 0:40:56.650,0:41:01.869 of 0:41:01.869,0:41:04.619 ST and 0:41:04.619,0:41:08.009 AT. And for sake of completeness, you should just think of this 0:41:08.009,0:41:10.130 as inverted pendulum problem, 0:41:10.130,0:41:11.190 so ST+1 is 0:41:11.190,0:41:13.290 a 4-dimensional vector. ST, AT 0:41:13.290,0:41:16.579 will be a 4-dimensional vector and that'll be a real number, 0:41:16.579,0:41:17.680 and so you might 0:41:17.680,0:41:21.190 run linear regression 4 times to predict each of these state variables as 0:41:21.190,0:41:22.749 a function 0:41:22.749,0:41:25.250 of each of these 0:41:25.250,0:41:26.979 5 real numbers 0:41:26.979,0:41:29.519 and so on. 0:41:29.519,0:41:32.579 Just for example, if you 0:41:32.579,0:41:36.899 say that if you want to estimate your next state ST+1 as a linear 0:41:36.899,0:41:40.639 function 0:41:40.639,0:41:45.889 of your previous state in your action and so 0:41:45.889,0:41:51.470 A here will be a 4 by 4 matrix, 0:41:51.470,0:41:54.449 and B would be a 4-dimensional vector, 0:41:54.449,0:42:00.199 then 0:42:00.199,0:42:07.199 you would choose the values of A and B that minimize this. 0:42:32.269,0:42:35.220 So if you want your model to be that 0:42:35.220,0:42:38.849 ST+1 is some linear function of the previous stated action, 0:42:38.849,0:42:39.960 then you might 0:42:39.960,0:42:41.950 pose this optimization objective 0:42:41.950,0:42:45.130 and choose A and B to minimize the sum of squares error 0:42:45.130,0:42:52.130 in your predictive value for ST+1 as the linear function of ST 0:42:52.489,0:42:56.889 and AT. I 0:42:56.889,0:42:58.119 should say that 0:42:58.119,0:43:02.839 this is one specific example where you're using a linear function of the previous 0:43:02.839,0:43:06.399 stated action to predict the next state. Of course, you can also use other algorithms 0:43:06.399,0:43:08.469 like low [inaudible] weight to linear regression 0:43:08.469,0:43:12.079 or linear regression with nonlinear features or kernel linear 0:43:12.079,0:43:13.119 regression or whatever 0:43:13.119,0:43:16.380 to predict the next state as a nonlinear function of the current state as well, so this 0:43:16.380,0:43:23.380 is just [inaudible] linear problems. 0:43:25.469,0:43:27.820 And it turns out that low [inaudible] weight to linear 0:43:27.820,0:43:28.770 regression is 0:43:28.770,0:43:31.990 for many robots turns out to be an effective method for this learning 0:43:31.990,0:43:38.990 problem as well. 0:43:47.279,0:43:51.919 And so having learned to model, having learned the parameters A 0:43:51.919,0:43:52.680 and B, 0:43:52.680,0:43:58.979 you then have a model where you say that ST+1 is AST 0:43:58.979,0:44:05.359 plus BAT, and so that would be an example of a deterministic 0:44:05.359,0:44:06.219 model 0:44:07.949,0:44:10.749 or having learned the parameters A and B, 0:44:10.749,0:44:13.259 you might say that ST+1 is 0:44:13.259,0:44:15.229 equal to AST + 0:44:15.229,0:44:17.440 BAT 0:44:24.039,0:44:25.979 + 0:44:25.979,0:44:28.249 ?T. 0:44:28.249,0:44:29.919 And so these would be 0:44:29.919,0:44:32.699 very reasonable ways to come up with either 0:44:32.699,0:44:37.069 a deterministic or a stochastic model for your 0:44:37.069,0:44:39.329 inverted pendulum MDP. 0:44:39.329,0:44:41.519 And so 0:44:41.519,0:44:43.549 just to summarize, 0:44:43.549,0:44:48.239 what we have now is a model, meaning a piece of code, 0:44:48.239,0:44:53.459 where you can input a state and an action 0:44:53.459,0:44:55.079 and get an ST+1. 0:44:55.079,0:44:57.969 And so if you have a stochastic model, 0:44:57.969,0:45:02.419 then to influence this model, you would actually sample ?T from this [inaudible] 0:45:02.419,0:45:03.039 distribution 0:45:03.039,0:45:10.039 in order to generate ST+1. So it 0:45:13.179,0:45:18.269 actually turns out that in a 0:45:18.269,0:45:20.169 preview, I guess, in the next lecture 0:45:20.169,0:45:24.679 it actually turns out that in the specific case of linear dynamical systems, in the specific 0:45:24.679,0:45:28.559 case where the next state is a linear function of the current stated action, it 0:45:28.559,0:45:31.579 actually turns out that there're very powerful algorithms you can use. 0:45:31.579,0:45:35.640 So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than 0:45:35.640,0:45:36.630 right now 0:45:38.559,0:45:42.759 but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights 0:45:42.759,0:45:45.599 and linear regressionals and long linear 0:45:45.599,0:45:48.339 algorithm 'cause many systems aren't really 0:45:48.339,0:45:54.309 linear. You can build a nonlinear model. 0:45:54.309,0:45:56.949 So what I wanna do now is talk about 0:45:56.949,0:45:57.330 given 0:45:57.330,0:46:00.510 a model, given a simulator for your 0:46:00.510,0:46:03.349 MDP, how to come up with an algorithm 0:46:03.349,0:46:07.319 to approximate the alpha value function piece. 0:46:07.319,0:46:14.319 Before I move on, let me check if there're questions on this. Okay, 0:46:38.339,0:46:42.579 cool. So here's 0:46:42.579,0:46:43.500 the idea. 0:46:43.500,0:46:46.890 Back when we talked about linear regression, we said that 0:46:46.890,0:46:52.119 given some inputs X in supervised learning, given the input feature is X, we 0:46:52.119,0:46:54.589 may choose some features of X 0:46:54.589,0:46:56.529 and then 0:46:57.229,0:46:58.819 approximate the type of variable 0:46:58.819,0:47:02.229 as a linear function of various features of X, 0:47:02.229,0:47:04.929 and so just do exactly the same thing to 0:47:04.929,0:47:06.329 approximate 0:47:06.329,0:47:08.539 the optimal value function, 0:47:08.539,0:47:10.869 and in particular, 0:47:10.869,0:47:15.819 we'll choose some features 0:47:15.819,0:47:19.249 5-S of a 0:47:19.249,0:47:20.979 state 0:47:20.979,0:47:25.549 S. And so you could actually choose 0:47:25.549,0:47:29.529 5-S equals S. That would be one reasonable choice, if you want to approximate the value 0:47:29.529,0:47:30.160 function 0:47:30.160,0:47:32.759 as a linear function of the states, but you can 0:47:32.759,0:47:33.659 also 0:47:33.659,0:47:37.069 choose other things, so for example, 0:47:37.069,0:47:41.029 for the inverted pendulum example, you may choose 5-S 0:47:41.029,0:47:45.049 to be equal to a vector of features that may be [inaudible]1 or you may have 0:47:47.870,0:47:51.789 Xdot2, 0:47:51.789,0:47:52.190 Xdot 0:47:52.190,0:47:54.649 maybe some cross terms, 0:47:54.649,0:47:56.799 maybe times 0:47:56.799,0:47:59.369 X, maybe dot2 0:47:59.369,0:48:03.659 and so on. So 0:48:03.659,0:48:07.209 you choose some vector or features 0:48:07.209,0:48:14.209 and then approximate 0:48:16.819,0:48:21.759 the value function as the value of the state as is equal to data 0:48:21.759,0:48:24.289 transfers 0:48:24.289,0:48:28.420 times the features. And I should apologize in advance; I'm overloading notation here. 0:48:28.420,0:48:29.839 It's unfortunate. I 0:48:29.839,0:48:34.349 use data both to denote the angle of the cart of the pole inverted 0:48:34.349,0:48:37.549 pendulum. So this is known as 0:48:37.549,0:48:39.140 the angle T 0:48:39.140,0:48:43.280 but also using T to denote the vector of parameters in my [inaudible] algorithm. 0:48:43.280,0:48:43.910 So sorry 0:48:43.910,0:48:49.939 about the overloading notation. 0:48:49.939,0:48:51.599 0:48:51.599,0:48:55.439 Just like we did in linear regression, my goal is to come up 0:48:55.439,0:48:56.599 with 0:48:56.599,0:49:00.339 a linear combination of the features that gives me a good approximation 0:49:00.339,0:49:01.929 to the value function 0:49:01.929,0:49:07.660 and this is completely analogous to when we said that in linear regression 0:49:07.660,0:49:09.500 our estimate, 0:49:13.369,0:49:18.049 my response there but Y as a linear function a feature is at the 0:49:18.049,0:49:23.459 input. That's what we have 0:49:23.459,0:49:30.459 in 0:49:40.109,0:49:47.079 linear regression. Let 0:49:47.079,0:49:50.779 me just write 0:49:50.779,0:49:55.859 down value iteration again and then I'll written down an approximation to value 0:49:55.859,0:49:57.199 iteration, so 0:49:57.199,0:50:00.759 for discrete states, 0:50:00.759,0:50:03.929 this is the idea behind value iteration and we said that 0:50:03.929,0:50:08.849 V(s) will be 0:50:08.849,0:50:11.609 updated 0:50:11.609,0:50:17.400 as R(s) + G 0:50:17.400,0:50:19.530 [inaudible]. That was value iteration 0:50:19.530,0:50:24.129 and in the case of continuous states, this would be the replaced by an [inaudible], an 0:50:24.129,0:50:31.129 [inaudible] over states rather via sum over states. 0:50:31.389,0:50:34.340 Let me just write this 0:50:34.340,0:50:35.669 as 0:50:38.809,0:50:42.589 R(s) + G([inaudible]) and then that sum over T's prime. 0:50:42.589,0:50:45.150 That's really an expectation 0:50:45.150,0:50:47.400 with respect to random state as 0:50:47.400,0:50:54.400 prime drawn from the state transition probabilities piece SA 0:50:55.049,0:50:56.359 of V(s) 0:50:56.359,0:51:00.339 prime. So this is a sum of all states S prime with the 0:51:00.339,0:51:01.599 probability of going to S prime (value), 0:51:01.599,0:51:05.249 so that's really an expectation over the random state S prime flowing from PSA of 0:51:05.249,0:51:09.369 that. 0:51:09.369,0:51:11.989 And so what I'll do now is 0:51:11.989,0:51:15.199 write down an algorithm called 0:51:15.199,0:51:22.199 fitted value iteration 0:51:25.289,0:51:27.949 that's in 0:51:27.949,0:51:30.599 approximation to this 0:51:30.599,0:51:33.809 but specifically for continuous states. 0:51:33.809,0:51:38.519 I just wrote down the first two steps, and then I'll continue on the next board, so the 0:51:38.519,0:51:43.499 first step of the algorithm is we'll sample. 0:51:43.499,0:51:45.269 Choose some set of states 0:51:45.269,0:51:52.269 at 0:51:53.250,0:51:55.119 random. 0:51:55.119,0:51:59.869 So sample S-1, S-2 through S-M randomly so choose a 0:51:59.869,0:52:02.469 set of states randomly 0:52:02.469,0:52:08.209 and 0:52:08.209,0:52:12.020 initialize my parameter vector to be equal to zero. This is 0:52:12.020,0:52:16.190 analogous to in value iteration where I 0:52:16.190,0:52:21.140 might initialize the value function to be the function of all zeros. Then here's the 0:52:21.140,0:52:28.140 end view for the algorithm. 0:52:42.359,0:52:44.119 Got 0:52:44.119,0:52:51.119 quite a lot to 0:52:59.189,0:53:06.189 write 0:55:05.409,0:55:12.409 actually. 0:55:28.599,0:55:29.639 Let's 0:55:29.639,0:55:36.639 see. 0:56:10.420,0:56:12.009 And so 0:56:12.009,0:56:17.839 that's the algorithm. 0:56:17.839,0:56:19.880 Let me just adjust the writing. Give me a second. Give me a minute 0:56:19.880,0:56:26.880 to finish and then I'll step through this. 0:56:26.989,0:56:28.489 Actually, if some of my 0:56:28.489,0:56:35.489 handwriting is eligible, let me know. So 0:56:57.559,0:57:02.079 let me step through this and so briefly explain the rationale. 0:57:02.079,0:57:08.140 So the hear of the algorithm is - let's see. 0:57:08.140,0:57:10.869 In the original value iteration algorithm, 0:57:10.869,0:57:13.240 we would take the value for each state, 0:57:13.240,0:57:14.799 V(s)I, 0:57:14.799,0:57:17.599 and we will overwrite it with this 0:57:17.599,0:57:21.659 expression here. In the original, this discrete value iteration algorithm 0:57:21.659,0:57:23.199 was to V(s)I 0:57:23.199,0:57:23.940 and we will 0:57:23.940,0:57:25.829 set V(s)I 0:57:25.829,0:57:28.589 to be equal to that, I think. 0:57:28.589,0:57:33.130 Now we have in the continuous state case, we have an infinite continuous set of 0:57:33.130,0:57:34.199 states and so 0:57:34.199,0:57:37.859 you can't discretely set the value of each of these to that. 0:57:37.859,0:57:42.509 So what we'll do instead is choose the parameters T 0:57:42.509,0:57:46.849 so that V(s)I is as close as possible to 0:57:46.849,0:57:49.930 this thing on the right hand side 0:57:49.930,0:57:54.509 instead. And this is what YI turns out to be. 0:57:54.509,0:57:58.479 So completely, what I'm going to do is I'm going to construct 0:57:58.479,0:58:00.579 estimates of this term, 0:58:00.579,0:58:04.279 and then I'm going to choose the parameters of my function approximator. I'm 0:58:04.279,0:58:07.729 gonna choose my parameter as T, so that V(s)I is as close as possible to 0:58:07.729,0:58:10.619 these. That's what YI is, 0:58:10.619,0:58:11.979 and specifically, 0:58:11.979,0:58:15.139 what I'm going to do is I'll choose parameters data 0:58:15.139,0:58:18.489 to minimize the sum of square differences between T [inaudible] plus 0:58:18.489,0:58:20.409 5SI. 0:58:20.409,0:58:21.550 This thing here 0:58:21.550,0:58:24.489 is just V(s)I because I'm approximating V(s)I 0:58:24.489,0:58:26.219 is a linear function of 5SI 0:58:26.219,0:58:27.880 and so choose 0:58:27.880,0:58:32.019 the parameters data to minimize the sum of square differences. 0:58:32.019,0:58:34.829 So this is last step is basically 0:58:34.829,0:58:38.759 the approximation version of value 0:58:38.759,0:58:40.610 iteration. What everything else above 0:58:40.610,0:58:42.419 was doing was just 0:58:42.419,0:58:44.749 coming up with an approximation 0:58:44.749,0:58:46.189 to this term, to 0:58:46.189,0:58:47.499 this thing here 0:58:47.499,0:58:50.979 and which I was calling YI. 0:58:50.979,0:58:54.199 And so confluently, for every state SI we want to 0:58:54.199,0:58:57.159 estimate what the thing on the right hand side is 0:58:57.159,0:58:57.979 and but 0:58:57.979,0:59:01.980 there's an expectation here. There's an expectation over a continuous set of states, 0:59:01.980,0:59:06.609 may be a very high dimensional state so I can't compute this expectation exactly. 0:59:06.609,0:59:09.809 What I'll do instead is I'll use my simulator to sample 0:59:09.809,0:59:13.969 a set of states from this distribution from this P substrip, 0:59:13.969,0:59:16.959 SIA, from the state transition distribution 0:59:16.959,0:59:20.669 of where I get to if I take the action A in the state as I, 0:59:20.669,0:59:23.919 and then I'll average over that sample of states 0:59:23.919,0:59:26.819 to compute this expectation. 0:59:26.819,0:59:30.609 And so stepping through the algorithm just says that for 0:59:30.609,0:59:31.500 each 0:59:31.500,0:59:36.329 state and for each action, I'm going to sample a set of states. This S prime 1 through S 0:59:36.329,0:59:40.659 prime K from that state transition distribution, still using the model, and 0:59:40.659,0:59:43.909 then I'll set Q(a) to be equal to that average 0:59:43.909,0:59:45.489 and so this is my 0:59:45.489,0:59:48.429 estimate for R(s)I + G(this 0:59:48.429,0:59:50.399 expected value 0:59:50.399,0:59:53.629 for that specific action A). 0:59:53.629,0:59:56.759 Then I'll take the maximum of actions A 0:59:56.759,1:00:00.890 and this gives me YI, and so YI is for S for that. 1:00:00.890,1:00:02.919 And finally, I'll run 1:00:02.919,1:00:06.779 really run linear regression which is that last of the set [inaudible] to get 1:00:06.779,1:00:07.789 V(s)I 1:00:07.789,1:00:11.479 to be close to the YIs. 1:00:11.479,1:00:16.179 And so this algorithm is called fitted value iteration and it actually often 1:00:16.179,1:00:19.389 works quite well for continuous, 1:00:19.389,1:00:22.659 for problems with anywhere from 1:00:22.659,1:00:26.010 6- to 10- to 20-dimensional state spaces 1:00:33.949,1:00:39.499 if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense? 1:00:39.499,1:00:40.790 Some of you didn't have your hands up. 1:00:40.790,1:00:47.279 Are there questions for those, 1:00:48.410,1:00:51.339 yeah? Is there a recess [inaudible] function in this setup? 1:00:51.339,1:00:55.669 Oh, yes. An MDP comprises 1:00:55.669,1:00:57.189 SA, 1:00:57.189,1:00:59.869 the state transition probabilities G 1:00:59.869,1:01:01.229 and 1:01:01.229,1:01:02.380 R 1:01:02.380,1:01:03.359 and so 1:01:03.359,1:01:08.359 for continuous state spaces, S would be like R4 for the inverted pendulum 1:01:08.359,1:01:09.409 or something. 1:01:09.409,1:01:13.529 Actions with discretized state transitions probabilities with specifying with the model 1:01:13.529,1:01:14.779 or 1:01:14.779,1:01:17.470 the simulator, 1:01:17.470,1:01:19.550 G is just a real number like .99 1:01:19.550,1:01:23.599 and the real function is usually a function that's given to you. And so 1:01:23.599,1:01:26.099 the reward function 1:01:26.099,1:01:30.689 is some function of your 4-dimensional state, 1:01:30.689,1:01:31.489 and 1:01:31.489,1:01:32.989 for example, you 1:01:32.989,1:01:39.989 might choose a reward function to be minus " I don't know. 1:01:40.029,1:01:47.029 Just for an example of simple reward function, if we want a -1 if the pole has fallen 1:01:47.819,1:01:51.640 and there it depends on you choose your reward function to be -1 if 1:01:51.640,1:01:52.290 the 1:01:52.290,1:01:54.619 inverted pendulum falls over 1:01:54.619,1:01:58.700 and to find that its angle is greater than 30° or something 1:01:58.700,1:02:02.179 and zero otherwise. 1:02:02.179,1:02:03.939 So that would be an example 1:02:03.939,1:02:06.640 of a reward function that you can choose for the inverted pendulum, but yes, assuming a 1:02:06.640,1:02:09.619 reward function is given to you 1:02:09.619,1:02:14.589 so that you can compute R(s)I for any state. 1:02:14.589,1:02:21.589 Are there other questions? 1:02:39.689,1:02:40.619 Actually, let 1:02:40.619,1:02:47.619 me try 1:02:50.459,1:02:57.459 1:03:23.979,1:03:28.479 asking a question, so everything I did here assume that we have a stochastic simulator. 1:03:28.479,1:03:31.879 So it 1:03:31.879,1:03:35.389 turns out I can simply this algorithm if I have a deterministic simulator, 1:03:35.389,1:03:36.510 but 1:03:36.510,1:03:38.769 deterministic simulator is given a 1:03:38.769,1:03:42.079 stated action, my next state is always exactly determined. 1:03:42.079,1:03:45.569 So let me ask you, if I have a deterministic simulator, 1:03:45.569,1:03:52.569 how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible]. 1:03:53.929,1:03:57.709 1:03:57.709,1:03:58.720 Right, so 1:03:58.720,1:04:01.249 Justin's going right. If I have a deterministic simulator, 1:04:01.249,1:04:04.869 all my samples from those would be exactly the same, and so if I have a 1:04:04.869,1:04:06.149 deterministic simulator, 1:04:06.149,1:04:09.109 I can set K to be equal to 1, 1:04:09.109,1:04:10.930 so I don't need to draw 1:04:10.930,1:04:15.869 K different samples. I really only need one sample if I have a 1:04:15.869,1:04:17.299 deterministic simulator, 1:04:17.299,1:04:21.669 so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really 1:04:21.669,1:04:28.159 confused about the, 1:04:28.159,1:04:29.029 yeah, 1:04:29.029,1:04:34.429 we sorta turned this [inaudible] into something that looks like linear 1:04:34.429,1:04:37.249 state 1:04:37.249,1:04:40.089 regression or some' you know the data transpose times something that 1:04:40.089,1:04:43.890 we're used to but I guess I'm a 1:04:43.890,1:04:45.369 little. I don't know really know what question to ask but like when we did this before we 1:04:45.369,1:04:48.330 had like discrete states and everything. We 1:04:48.330,1:04:50.019 were determined with 1:04:50.019,1:04:53.189 finding this optimal policy 1:04:53.189,1:04:56.709 and I 1:04:56.709,1:05:01.869 guess it doesn't look like we haven't said the word policy in a while so kinda 1:05:01.869,1:05:05.009 difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should 1:05:05.009,1:05:06.819 just say a couple words so 1:05:06.819,1:05:10.660 let me actually try to get at some of what maybe what you're saying. 1:05:10.660,1:05:14.249 Our strategy for finding optimal policy 1:05:14.249,1:05:15.080 has been to 1:05:15.080,1:05:19.269 find some way to find V 1:05:19.269,1:05:21.169 and 1:05:21.169,1:05:24.089 some of approximations 1:05:24.089,1:05:28.549 of ?*. So far everything I've been doing has been focused on how to find V*. I just 1:05:28.549,1:05:30.479 want 1:05:33.119,1:05:36.359 to say one more word. It actually turns out that 1:05:36.359,1:05:40.729 for linear regression it's often very easy. It's often not terribly difficult to 1:05:40.729,1:05:43.019 choose some resource of the features. Choosing 1:05:43.019,1:05:46.979 features for approximating the value function is often somewhat 1:05:46.979,1:05:48.400 harder, 1:05:48.400,1:05:50.969 so because the value of a state is 1:05:50.969,1:05:52.020 how good is 1:05:52.020,1:05:54.380 starting off in this state. 1:05:54.380,1:05:56.039 What is my 1:05:56.039,1:05:59.609 expected sum of discounted rewards? What if I start in a certain state? 1:05:59.609,1:06:00.409 And so 1:06:00.409,1:06:03.559 what the feature of the state have to measure is really 1:06:03.559,1:06:07.890 how good is it to start in a certain state? 1:06:07.890,1:06:12.209 And so for inverted pendulum you actually have that 1:06:12.209,1:06:15.899 states where the poles are vertical and when a cart that's centered on your 1:06:15.899,1:06:18.079 track or something, maybe better and so you can 1:06:18.079,1:06:21.299 come up with features that measure the orientation of the pole and how close 1:06:21.299,1:06:25.689 you are to the center of the track and so on and those will be reasonable features to use 1:06:25.689,1:06:28.329 to approximate V*. 1:06:28.329,1:06:30.249 Although in general it is true that 1:06:30.249,1:06:33.859 choosing features, the value function approximation, is often slightly trickier 1:06:33.859,1:06:39.719 than choosing good features for linear regression. Okay and then Justin's 1:06:39.719,1:06:40.709 questions of 1:06:40.709,1:06:42.529 so given 1:06:42.529,1:06:47.109 V*, how do you go back to actually find a policy? 1:06:47.109,1:06:49.799 In the discrete case, so we 1:06:49.799,1:06:54.109 have that ?*(s) is equal to all 1:06:54.109,1:07:01.109 [inaudible] over A of 1:07:04.319,1:07:07.469 that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this 1:07:07.469,1:07:12.869 as an expectation 1:07:12.869,1:07:15.609 and so then once you find the optimal value 1:07:15.609,1:07:16.739 function V*, 1:07:16.739,1:07:18.579 you can then 1:07:18.579,1:07:20.349 find the optimal policy ?* by computing the 1:07:20.349,1:07:22.579 [inaudible]. 1:07:22.579,1:07:25.219 So if you're in a continuous state MDP, 1:07:25.219,1:07:26.429 then 1:07:26.429,1:07:30.409 you can't actually do this in advance for every single state because there's an 1:07:30.409,1:07:31.899 infinite number of states 1:07:31.899,1:07:34.809 and so you can't actually perform this computation in advance to every single 1:07:34.809,1:07:36.430 state. 1:07:36.430,1:07:38.949 What you do instead is 1:07:38.949,1:07:43.209 whenever your robot is in some specific state S is only when your system 1:07:43.209,1:07:48.069 is in some specific state S like your car is at some position orientation or 1:07:48.069,1:07:49.829 your inverted pendulum 1:07:49.829,1:07:52.949 is in some specific position, posed in some 1:07:52.949,1:07:56.529 specific angle T. It's 1:07:56.529,1:07:58.489 only when 1:07:58.489,1:08:03.069 your system, be it a factor or a board game or a robot, 1:08:03.069,1:08:05.299 is in some specific state S 1:08:05.299,1:08:08.809 that you would then go ahead and compute this augmax, so 1:08:08.809,1:08:10.229 it's only when 1:08:10.229,1:08:15.779 you're in some state S that you then compute this augmax, and then 1:08:15.779,1:08:18.099 you 1:08:18.099,1:08:22.589 1:08:22.589,1:08:24.440 execute that action A 1:08:24.440,1:08:25.159 and then 1:08:25.159,1:08:29.489 as a result of your action, your robot would transition to some new state and then 1:08:29.489,1:08:33.759 so it'll be given that specific new state that you compute as 1:08:33.759,1:08:40.759 augmax using that specific state S that 1:08:41.799,1:08:45.739 you're in. There're a few ways to do it. One way to do this is 1:08:45.739,1:08:50.099 actually the same as in the inner loop of the fitted value iteration algorithm so 1:08:50.099,1:08:52.150 because of an expectation of a large number 1:08:52.150,1:08:55.219 of states, you'd need to sample 1:08:55.219,1:08:57.999 some set of states from the simulator and then 1:08:57.999,1:09:03.120 approximate this expectation using an average over your samples, so it's actually as 1:09:03.120,1:09:05.819 inner loop of the value iteration algorithm. 1:09:05.819,1:09:09.479 So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample 1:09:09.770,1:09:13.029 a set of states 1:09:13.029,1:09:16.230 to approximate those expectations every time you want to take an action in your MDP. 1:09:16.230,1:09:18.359 Couple 1:09:18.359,1:09:22.710 of special cases where this can be done, one special case is if you 1:09:22.710,1:09:29.710 have 1:09:31.710,1:09:35.560 a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is 1:09:35.560,1:09:39.769 just some function, 1:09:39.769,1:09:43.690 could be a linear or a nonlinear function. If it's a deterministic simulator 1:09:43.690,1:09:47.569 then the next state, ST+1, is just some function of your 1:09:47.569,1:09:50.189 previous stated action. 1:09:50.189,1:09:52.060 If that's the case then 1:09:52.060,1:09:54.849 this expectation, well, then 1:09:54.849,1:09:59.259 this simplifies to augmax of A of V* 1:09:59.259,1:10:00.709 of F 1:10:00.709,1:10:03.409 of 1:10:03.409,1:10:07.280 I guess S,A 1:10:07.280,1:10:08.249 because 1:10:08.249,1:10:11.070 this is really saying 1:10:11.070,1:10:14.570 S 1:10:14.570,1:10:18.409 prime=F(s),A. I switched back and forth between notation; I hope that's okay. S 1:10:18.409,1:10:21.800 to denote the current state, and S prime to deterministic state versus ST and ST+1 through the 1:10:21.800,1:10:25.530 current state. Both of these are 1:10:25.530,1:10:27.699 sorta standing notation and don't 1:10:27.699,1:10:31.189 mind my switching back and forth between them. But if it's a 1:10:31.189,1:10:35.310 deterministic simulator you can then just compute what the next state S prime would be 1:10:35.310,1:10:37.729 for each action you might take from the current state, 1:10:37.729,1:10:41.689 and then take the augmax of actions A, 1:10:41.689,1:10:47.229 basically choose the action that gets you to the highest value 1:10:47.229,1:10:54.229 state. 1:11:04.300,1:11:04.659 And 1:11:04.659,1:11:08.920 so that's one case where you can compute the augmax and we can compute that 1:11:08.920,1:11:13.469 expectation without needing to sample an average over some sample. 1:11:13.469,1:11:16.670 Another very common case actually it turns out is if you have a stochastic 1:11:24.010,1:11:25.969 simulator, but if your 1:11:25.969,1:11:30.459 similar happens to take on a very specific form of 1:11:33.820,1:11:36.119 ST+1=F(s)T,AT+?T 1:11:37.120,1:11:43.689 where this 1:11:43.689,1:11:47.899 is galsie noise. The [inaudible] is a very common way to build simulators where you model the next 1:11:47.899,1:11:50.229 state as a function of the current state and action 1:11:50.229,1:11:52.260 plus some noise and so 1:11:52.260,1:11:57.340 once specific example would be that sort of mini dynamical system that 1:11:57.340,1:12:01.400 we talked about with linear function of the current state and action plus 1:12:01.400,1:12:03.019 galsie 1:12:03.019,1:12:07.959 noise. In this case, you can approximate augment over 1:12:07.959,1:12:14.959 A, well. 1:12:21.809,1:12:27.159 In that case you take that 1:12:27.159,1:12:29.089 expectation that 1:12:29.089,1:12:33.179 you're trying to approximate. The 1:12:33.179,1:12:37.309 expected value of V of the 1:12:37.309,1:12:41.479 1:12:41.479,1:12:45.260 expected value of 1:12:45.260,1:12:49.179 S prime, and this is approximation. 1:12:49.179,1:12:52.510 Expected value of a function is usually not equal to the value of an 1:12:52.510,1:12:53.059 expectation but 1:12:53.059,1:12:54.159 it is often 1:12:54.159,1:12:56.119 a reasonable approximation 1:12:56.119,1:12:58.340 and so 1:13:01.919,1:13:04.669 that would be another way to 1:13:04.669,1:13:06.439 approximate that expectation 1:13:06.439,1:13:11.270 and so you choose the actions 1:13:11.270,1:13:14.340 according to 1:13:14.340,1:13:21.340 watch we do the same formula as I wrote just now. 1:13:22.010,1:13:23.619 And so this 1:13:23.619,1:13:24.920 would be a way of 1:13:24.920,1:13:27.109 approximating 1:13:28.160,1:13:30.530 this argmax, 1:13:30.530,1:13:34.489 ignoring the noise in the simulator essentially. And 1:13:34.489,1:13:37.610 this often works pretty well as well just because many simulators turn 1:13:37.610,1:13:38.539 out 1:13:38.539,1:13:42.699 to be the form of some linear or some nonlinear function plus zero mean 1:13:42.699,1:13:43.749 galsie noise, 1:13:43.749,1:13:46.130 so and just that ignore the zero mean 1:13:46.130,1:13:47.369 galsie 1:13:47.369,1:13:50.070 noise, so that you can compute this quickly. 1:13:50.070,1:13:54.230 And just to complete about this, what 1:13:54.230,1:13:55.899 that is, right, 1:13:55.899,1:13:58.460 that V* F of SA, 1:13:58.460,1:13:58.969 this 1:13:58.969,1:14:01.670 you 1:14:04.389,1:14:08.880 down rate as data transfers Fi of S prime where S prime=F of SA. Great, 1:14:08.880,1:14:12.219 so this V* you would compute using 1:14:12.219,1:14:15.050 the parameters data that you just learned using the 1:14:15.050,1:14:21.719 fitted value iteration 1:14:21.719,1:14:23.170 algorithm. 1:14:23.170,1:14:30.170 Questions about this? Student:[Inaudible] case, 1:14:33.849,1:14:40.849 for real-time application is it possible to use that [inaudible], for example for 1:14:41.809,1:14:47.739 [inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible] 1:14:51.159,1:14:55.510 expectation. Computers today actually amazingly fast. I'm actually often 1:14:55.510,1:14:58.659 surprised by how much you can do in real time so 1:14:58.659,1:15:00.269 the helicopter we actually flying a helicopter 1:15:00.269,1:15:03.889 using an algorithm different than 1:15:03.889,1:15:04.739 this? I can't say. 1:15:06.280,1:15:10.539 But my intuition is that you could actually do this with a 1:15:10.539,1:15:14.010 helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do 1:15:14.010,1:15:16.389 this 10 times a second to 50 times a second, 1:15:16.389,1:15:18.099 and that's actually plenty of time 1:15:18.099,1:15:19.400 to sample 1:15:19.400,1:15:23.500 1,000 states and compute this 1:15:23.500,1:15:26.909 expectation. They're real difficult, helicopters because helicopters are mission critical, and you 1:15:26.909,1:15:29.500 do something it's like fast. You 1:15:29.500,1:15:31.849 can do serious damage and so 1:15:31.849,1:15:35.750 maybe not for good reasons. We've actually tended to avoid 1:15:35.750,1:15:37.039 tossing coins when we're 1:15:37.039,1:15:40.179 in the air, so the ideal of letting our actions be some 1:15:40.179,1:15:45.099 up with some random process is slightly scary and just tend not to do that. 1:15:45.099,1:15:48.989 I should say that's prob'ly not a great reason because you average a 1:15:48.989,1:15:51.550 large number of things here very well fine but just 1:15:51.550,1:15:55.829 as a maybe overly conservative design choice, we actually 1:15:55.829,1:15:57.060 don't, tend not to find 1:15:57.060,1:15:59.499 anything randomized on 1:15:59.499,1:16:01.400 which is prob'ly being over conservative. 1:16:02.340,1:16:04.969 It's the choice we made cause other things are slightly safer. 1:16:04.969,1:16:08.149 I think you can actually often do this. 1:16:08.149,1:16:13.189 So long as I see a model can be evaluated fast enough where you can sample 1:16:13.189,1:16:15.469 100 state transitions or 1,000 state 1:16:15.469,1:16:18.119 transitions, and then do that at 1:16:18.119,1:16:21.550 10hz. They haven't said that. This is often attained which is why we often 1:16:21.550,1:16:28.550 use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So 1:16:31.800,1:16:35.800 now you know one algorithm [inaudible] reinforcement learning on continuous state 1:16:35.800,1:16:38.710 spaces. Then we'll pick up with some more ideas on some even 1:16:38.710,1:16:40.449 more powerful algorithms, 1:16:40.449,1:16:43.370 the solving MDPs of continuous state spaces. 1:16:43.370,1:16:44.170 Thanks. Let's close for today.