WEBVTT 00:00:10.299 --> 00:00:13.569 This presentation is delivered by the Stanford Center for Professional 00:00:13.569 --> 00:00:20.569 Development. 00:00:21.519 --> 00:00:24.099 Okay, good morning. Welcome back. 00:00:24.099 --> 00:00:27.019 So I hope all of you had a good Thanksgiving break. 00:00:27.019 --> 00:00:31.309 After the problem sets, I suspect many of us needed one. 00:00:31.309 --> 00:00:35.640 Just one quick announcement so as I announced by email 00:00:35.640 --> 00:00:37.870 a few days ago, this afternoon 00:00:37.870 --> 00:00:38.489 we'll be 00:00:38.489 --> 00:00:42.590 doing another tape ahead of lecture, so I won't physically be here on 00:00:42.590 --> 00:00:43.840 Wednesday, and so we'll be taping 00:00:43.840 --> 00:00:45.990 this Wednesday's lecture ahead of time. 00:00:45.990 --> 00:00:47.210 If you're free 00:00:47.210 --> 00:00:52.770 this afternoon, please come to that; it'll be at 3:45 p.m. 00:00:52.770 --> 00:00:57.610 in the Skilling Auditorium in Skilling 193 00:00:57.610 --> 00:00:59.830 at 3:45. But of course, you can also just show up 00:00:59.830 --> 00:01:06.830 in class as usual at the usual time or just watch it online as usual also. 00:01:07.110 --> 00:01:12.920 Okay, welcome back. What I want to do today is continue our discussion on 00:01:12.920 --> 00:01:15.690 Reinforcement Learning in MDPs. Quite a 00:01:15.690 --> 00:01:19.790 long topic for me to go over today, so most of today's lecture will be on 00:01:19.790 --> 00:01:21.720 continuous state MDPs, 00:01:21.720 --> 00:01:24.090 and in particular, 00:01:24.090 --> 00:01:26.750 algorithms for solving continuous state MDPs, 00:01:26.750 --> 00:01:29.890 so I'll talk just very briefly about discretization. 00:01:29.890 --> 00:01:34.080 I'll spend a lot of time talking about models, assimilators of MDPs, 00:01:34.080 --> 00:01:39.250 and then talk about one algorithm called fitted value iteration and 00:01:40.420 --> 00:01:45.119 two functions which builds on that, and then 00:01:45.119 --> 00:01:49.740 hopefully, I'll have time to get to a second algorithm called, approximate policy 00:01:49.740 --> 00:01:50.960 iteration 00:01:50.960 --> 00:01:53.189 Just to recap, 00:01:53.189 --> 00:01:55.119 right, in the previous lecture, 00:01:55.119 --> 00:01:58.270 I defined the Reinforcement Learning problem and 00:01:58.270 --> 00:02:02.950 I defined MDPs, so let me just recap the notation. 00:02:04.400 --> 00:02:11.400 I said that an MDP or a Markov Decision Process, was a ? tuple, 00:02:14.209 --> 00:02:15.959 comprising 00:02:15.959 --> 00:02:19.139 those things and 00:02:19.139 --> 00:02:22.219 the running example of those using last time 00:02:22.219 --> 00:02:25.189 was this one 00:02:25.189 --> 00:02:26.709 right, adapted from 00:02:26.709 --> 00:02:30.209 the Russell and Norvig AI textbook. 00:02:30.209 --> 00:02:31.619 So 00:02:31.619 --> 00:02:35.499 in this example MDP that I was using, 00:02:35.499 --> 00:02:39.269 it had 11 states, so that's where S was. 00:02:39.269 --> 00:02:46.269 The actions were compass directions: north, south, east and west. The 00:02:46.650 --> 00:02:49.689 state transition probability is to capture 00:02:49.689 --> 00:02:51.109 chance of your 00:02:51.109 --> 00:02:52.409 transitioning to every state 00:02:52.409 --> 00:02:55.899 when you take any action in any other given state and so 00:02:55.899 --> 00:02:58.059 in our example that captured 00:02:58.059 --> 00:03:01.119 the stochastic dynamics of our 00:03:01.119 --> 00:03:04.499 robot wondering around [inaudible], and we said if you take the action north 00:03:04.499 --> 00:03:05.639 and the south, 00:03:05.639 --> 00:03:08.879 you have a .8 chance of actually going north and .1 chance of veering 00:03:08.879 --> 00:03:09.879 off, 00:03:09.879 --> 00:03:12.689 so that .1 chance of veering off to the right so 00:03:12.689 --> 00:03:17.069 said model of the robot's noisy 00:03:17.069 --> 00:03:19.200 dynamic with a [inaudible] 00:03:19.200 --> 00:03:21.709 and the 00:03:21.709 --> 00:03:24.789 reward function was that +/-1 at the 00:03:24.789 --> 00:03:31.059 absorbing states 00:03:31.059 --> 00:03:32.749 and 00:03:32.749 --> 00:03:37.659 -0.02 00:03:37.659 --> 00:03:39.589 elsewhere. 00:03:39.589 --> 00:03:41.499 This is an example of an MDP, and 00:03:41.499 --> 00:03:47.899 that's what these five things were. Oh, and I used a discount 00:03:47.899 --> 00:03:48.870 factor G 00:03:48.870 --> 00:03:54.229 of usually a number slightly less than one, so that's the 0.99. 00:03:54.229 --> 00:03:58.939 And so our goal was to find the policy, the 00:03:58.939 --> 00:04:02.089 control policy and that's at ?, 00:04:02.089 --> 00:04:03.729 which is a function 00:04:03.729 --> 00:04:06.400 mapping from the states of the actions 00:04:06.400 --> 00:04:09.609 that tells us what action to take in every state, 00:04:09.609 --> 00:04:14.159 and our goal was to find a policy that maximizes the expected value of 00:04:14.159 --> 00:04:17.319 our total payoff. So we want to find a 00:04:17.319 --> 00:04:19.599 policy. Well, 00:04:19.599 --> 00:04:26.599 let's see. We define value functions Vp (s) to be equal to 00:04:33.250 --> 00:04:35.740 this. 00:04:35.740 --> 00:04:36.979 We said that 00:04:36.979 --> 00:04:41.340 the value of a policy ? from State S was given by the expected 00:04:41.340 --> 00:04:42.029 value 00:04:42.029 --> 00:04:46.199 of the sum of discounted rewards, conditioned on your executing the policy ? 00:04:46.199 --> 00:04:49.329 and 00:04:49.329 --> 00:04:54.469 you're stating off your [inaudible] to say in the State S, 00:04:54.469 --> 00:04:59.959 and so our strategy for finding the policy was sort of comprised of 00:04:59.959 --> 00:05:06.959 two steps. So the goal is to find 00:05:11.569 --> 00:05:15.190 a good policy that maximizes the suspected value of 00:05:15.190 --> 00:05:16.759 the sum of discounted rewards, 00:05:16.759 --> 00:05:21.020 and so I said last time that one strategy for finding the [inaudible] of a 00:05:21.020 --> 00:05:21.709 policy 00:05:21.709 --> 00:05:26.080 is to first compute the optimal value function which I denoted V*(s) and is defined 00:05:26.080 --> 00:05:28.839 like that. It's the maximum 00:05:28.839 --> 00:05:31.969 value that any policy can obtain, 00:05:31.969 --> 00:05:38.969 and for example, 00:05:44.939 --> 00:05:48.449 the optimal value function 00:05:48.449 --> 00:05:55.449 for that MDP looks like this. 00:06:01.439 --> 00:06:04.409 So in other words, starting from any of these states, 00:06:04.409 --> 00:06:07.050 what's the expected value of the 00:06:07.050 --> 00:06:09.210 sum of discounted rewards you get, 00:06:09.210 --> 00:06:12.009 so this is 00:06:12.009 --> 00:06:13.210 V*. 00:06:13.210 --> 00:06:15.289 We also said that 00:06:15.289 --> 00:06:17.099 once you've found V*, 00:06:17.099 --> 00:06:20.400 you can compute the optimal policy 00:06:20.400 --> 00:06:27.400 using this. 00:06:34.299 --> 00:06:36.279 And 00:06:36.279 --> 00:06:41.049 so once you've found 00:06:41.049 --> 00:06:48.049 V and the last piece of this algorithm was Bellman's equations where 00:06:48.460 --> 00:06:51.240 we know that V*, 00:06:51.240 --> 00:06:55.579 the optimal sum of discounted rewards you can get for State S, is equal to the 00:06:55.579 --> 00:06:59.339 immediate reward you get just for starting off in that state 00:06:59.339 --> 00:07:02.360 +G(for the 00:07:02.360 --> 00:07:05.979 max over all the actions you could take)(your 00:07:11.639 --> 00:07:13.659 future 00:07:13.659 --> 00:07:15.619 sum of discounted 00:07:15.619 --> 00:07:16.479 rewards)(your 00:07:16.479 --> 00:07:20.650 future payoff starting from the State S(p) which is where you might transition to 00:07:20.650 --> 00:07:22.430 after 1(s). 00:07:22.430 --> 00:07:26.279 And so this gave us a value iteration algorithm, 00:07:26.279 --> 00:07:29.909 which was essentially 00:07:29.909 --> 00:07:35.289 V.I. I'm abbreviating value iteration as V.I., so in the value iteration 00:07:36.159 --> 00:07:43.159 algorithm, in V.I., you just take Bellman's equations and you repeatedly do this. 00:07:49.960 --> 00:07:54.099 So initialize some guess of the value functions. Initialize a zero as the sum 00:07:54.099 --> 00:07:55.129 rounding guess and 00:07:55.129 --> 00:07:58.959 then repeatedly perform this update for all states, and I said last time that if 00:07:58.959 --> 00:08:00.740 you do this repeatedly, 00:08:00.740 --> 00:08:06.339 then V(s) will converge to the optimal value function, V(s), 00:08:06.339 --> 00:08:09.340 you can 00:08:09.340 --> 00:08:12.219 compute the 00:08:12.219 --> 00:08:13.830 00:08:13.830 --> 00:08:19.150 optimal policy ?*. Just one final thing I want to recap was 00:08:20.150 --> 00:08:27.150 the policy iteration algorithm 00:08:32.370 --> 00:08:39.370 in which we repeat the following two steps. 00:08:39.640 --> 00:08:43.790 So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the 00:08:43.790 --> 00:08:48.190 value function for that specific policy. 00:08:48.190 --> 00:08:52.200 So this means for every state, compute the expected sum of discounted rewards 00:08:52.200 --> 00:08:55.460 for if you execute the policy ? 00:08:55.460 --> 00:08:58.000 from that state, 00:08:58.000 --> 00:09:05.000 and then the other step of policy 00:09:28.150 --> 00:09:30.380 iteration 00:09:30.380 --> 00:09:31.730 is 00:09:31.730 --> 00:09:34.960 having found the value function for your policy, 00:09:34.960 --> 00:09:37.440 you then update the policy 00:09:37.440 --> 00:09:41.120 pretending that you've already found the optimal value function, V*, 00:09:41.120 --> 00:09:42.390 and then you 00:09:42.390 --> 00:09:45.790 repeatedly perform these two steps where you solve for the value function for your current 00:09:45.790 --> 00:09:46.820 policy 00:09:46.820 --> 00:09:50.760 and then pretend that that's actually the optimal value function and solve for the 00:09:50.760 --> 00:09:54.790 policy given the value function, and you repeatedly update 00:09:54.790 --> 00:09:59.060 the value function or update the policy using that value function. 00:09:59.060 --> 00:10:00.350 And 00:10:00.350 --> 00:10:03.540 last time I said that this will also cause 00:10:03.540 --> 00:10:07.500 the estimated value function V to converge to V* 00:10:07.500 --> 00:10:12.640 and this will cause p to converge to ?*, the optimal policy. 00:10:12.640 --> 00:10:14.010 So those are based 00:10:14.010 --> 00:10:17.920 on our last lecture [inaudible] MDPs and introduced a lot of new 00:10:17.920 --> 00:10:20.180 notation symbols and just 00:10:20.180 --> 00:10:21.880 summarize all that again. 00:10:21.880 --> 00:10:26.160 What I'm about to do now, what I'm about to do for the rest of today's 00:10:26.160 --> 00:10:27.680 lecture is actually 00:10:27.680 --> 00:10:32.420 build on these two algorithms so 00:10:32.420 --> 00:10:36.040 I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how 00:10:36.040 --> 00:10:42.080 those two algorithms are very 00:10:42.080 --> 00:10:43.710 different? 00:10:43.710 --> 00:10:46.810 Instructor (Andrew Ng):I see, right, so 00:10:46.810 --> 00:10:51.810 yeah, do you see that they're different? Okay, 00:10:51.810 --> 00:10:53.840 how it's different. Let's see. 00:10:53.840 --> 00:10:54.970 So 00:10:54.970 --> 00:11:00.050 well here's one difference. I didn't say this cause no longer use it today. So 00:11:00.050 --> 00:11:03.530 value iteration and policy iteration are different algorithms. 00:11:03.530 --> 00:11:06.260 In policy iteration 00:11:06.260 --> 00:11:07.570 in this 00:11:07.570 --> 00:11:10.630 step, you're given a fixed policy, 00:11:10.630 --> 00:11:14.170 and you're going to solve for the value function for that policy 00:11:15.320 --> 00:11:20.520 and so you're given some fixed policy ?, meaning 00:11:20.520 --> 00:11:26.930 some function mapping from the state's actions. So give you some policy 00:11:28.840 --> 00:11:34.630 and 00:11:34.630 --> 00:11:39.350 whatever. That's just some policy; it's not a great policy. 00:11:39.350 --> 00:11:40.890 And 00:11:40.890 --> 00:11:45.350 in that step that I circled, we have to find the ? of S 00:12:01.640 --> 00:12:04.740 which means that for every state you need to compute 00:12:04.740 --> 00:12:08.360 your expected sum of discounted rewards or if you execute 00:12:08.360 --> 00:12:10.810 this specific policy 00:12:10.810 --> 00:12:15.500 and starting off the MDP in that state S. 00:12:15.500 --> 00:12:19.000 So I showed this last time. I won't go into details today, so I said last 00:12:19.000 --> 00:12:20.330 time that 00:12:20.330 --> 00:12:23.960 you can actually solve for Vp by solving a linear system of 00:12:23.960 --> 00:12:25.360 equations. 00:12:25.360 --> 00:12:27.260 There was a form of Bellman's equations 00:12:27.260 --> 00:12:28.760 for Vp, 00:12:28.760 --> 00:12:32.029 and it turned out to be, if you write this out, you end up with a 00:12:32.029 --> 00:12:35.920 linear system of 11 equations of 11 unknowns and so you 00:12:35.920 --> 00:12:38.830 can actually solve for the value function 00:12:38.830 --> 00:12:39.970 for a fixed policy 00:12:39.970 --> 00:12:42.180 by solving like a system of 00:12:42.180 --> 00:12:46.180 linear equations with 11 variables and 11 constraints, 00:12:46.180 --> 00:12:48.830 and so 00:12:48.830 --> 00:12:51.790 that's policy iteration; 00:12:51.790 --> 00:12:55.290 whereas, in value iteration, 00:12:55.290 --> 00:12:57.430 going back on board, 00:12:57.430 --> 00:13:00.590 in value iteration you sort of repeatedly perform this update where you 00:13:00.590 --> 00:13:04.020 update the value of a state as the [inaudible]. 00:13:04.020 --> 00:13:11.020 So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits 00:13:12.670 --> 00:13:13.459 so is the assumption 00:13:13.459 --> 00:13:16.130 that we can never get out of 00:13:16.130 --> 00:13:17.960 those states? Instructor 00:13:17.960 --> 00:13:21.730 (Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers 00:13:21.730 --> 00:13:25.950 come up nicely, but I don't wanna spend too much time on them, but yeah, so the 00:13:25.950 --> 00:13:26.820 assumption is that 00:13:26.820 --> 00:13:30.329 once you enter the absorbing state, then the world ends or there're no more 00:13:30.329 --> 00:13:32.549 rewards after that and 00:13:32.549 --> 00:13:33.740 you can think of 00:13:33.740 --> 00:13:36.970 another way to think of the absorbing states which is sort of 00:13:36.970 --> 00:13:38.000 mathematically equivalent. 00:13:38.000 --> 00:13:41.090 You can think of the absorbing states as 00:13:41.090 --> 00:13:43.440 transitioning with probability 1 00:13:43.440 --> 00:13:46.030 to sum 12 state, and then once you're in that 00:13:46.030 --> 00:13:48.870 12th state, you always 00:13:48.870 --> 00:13:51.660 remain in that 12th state, and there're no further rewards from there. If you 00:13:51.660 --> 00:13:53.910 want, you can think of this 00:13:53.910 --> 00:13:57.250 as actually an MDP with 12 states rather than 11 states, 00:13:57.250 --> 00:13:59.270 and the 12th state 00:13:59.270 --> 00:14:05.320 is this zero cost absorbing state that you get stuck in forever. Other 00:14:05.320 --> 00:14:12.320 questions? Yeah, 00:14:19.640 --> 00:14:26.460 please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort 00:14:26.460 --> 00:14:27.190 of 00:14:27.190 --> 00:14:31.120 tried to give it justification for this last time. 00:14:31.120 --> 00:14:34.850 I'll 00:14:34.850 --> 00:14:37.500 say it in 00:14:37.500 --> 00:14:41.990 one sentence so that's that the expected total payoff I get, I 00:14:41.990 --> 00:14:46.820 expect to get something from the state as is equal to my immediate reward which is 00:14:46.820 --> 00:14:49.650 the reward I get for starting a state. Let's see. If I sum 00:14:49.650 --> 00:14:54.220 the state, I'm gonna get some first reward and then I can transition to 00:14:54.220 --> 00:14:55.440 some other state, and 00:14:55.440 --> 00:14:59.020 then from that other state, I'll get some additional rewards from then. 00:14:59.020 --> 00:15:03.390 So Bellman's equations breaks that sum into two pieces. It says the value of a 00:15:03.390 --> 00:15:04.860 state is equal to 00:15:04.860 --> 00:15:06.650 the reward you get right away 00:15:06.650 --> 00:15:10.530 is really, well. 00:15:10.530 --> 00:15:17.120 V*(s) is really equal to 00:15:31.130 --> 00:15:33.390 +G, so this is 00:15:35.940 --> 00:15:40.060 V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the 00:15:40.060 --> 00:15:42.260 rewards 00:15:44.480 --> 00:15:46.440 you get in the future) 00:15:46.440 --> 00:15:50.080 which it turns out to be equal to that 00:15:50.080 --> 00:15:52.340 second row. I spent more time 00:15:52.340 --> 00:15:55.360 justifying this in the previous lecture, 00:15:55.360 --> 00:15:59.930 although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if 00:15:59.930 --> 00:16:02.100 you don't remember the justification of that, why don't you just 00:16:02.100 --> 00:16:03.430 maybe take my word for 00:16:03.430 --> 00:16:05.990 that this equation holds true 00:16:05.990 --> 00:16:07.940 since I use it a little bit later as well, 00:16:07.940 --> 00:16:11.090 and then the lecture notes sort of 00:16:11.090 --> 00:16:15.830 explain a little further the justification for why this equation might hold 00:16:15.830 --> 00:16:17.430 true. But for now, 00:16:17.430 --> 00:16:21.200 yeah, just for now take my word for it that this holds true cause we'll use it a little bit later 00:16:21.200 --> 00:16:28.200 today as well. 00:16:40.960 --> 00:16:44.710 [Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent 00:16:44.710 --> 00:16:47.010 ? implicitly, 00:16:47.600 --> 00:16:53.060 using V(s), would it become equivalent to valuation, 00:16:53.060 --> 00:16:58.240 and the answer is sort of no. 00:16:58.240 --> 00:17:01.710 Let's see. It's true that policy iteration and value iteration are 00:17:01.710 --> 00:17:07.210 closely related algorithms, and there's actually a continuum between them, but 00:17:07.210 --> 00:17:12.850 yeah, it actually turns out that, 00:17:12.850 --> 00:17:15.400 oh, no, 00:17:22.370 --> 00:17:23.909 the 00:17:24.759 --> 00:17:28.910 algorithms are not equivalent. It's just in policy iteration, 00:17:28.910 --> 00:17:33.289 there is a step where you're solving for the value function for the policy vehicle is V, 00:17:33.289 --> 00:17:35.470 solve for Vp. Usually, 00:17:35.470 --> 00:17:40.590 you can do this, for instance, by solving a linear system of equations. In value 00:17:40.590 --> 00:17:41.260 iteration, 00:17:41.260 --> 00:17:43.850 it is a different 00:17:43.850 --> 00:17:50.850 algorithm, yes. I hope it makes sense that at least cosmetically it's different. 00:17:56.270 --> 00:17:58.140 [Inaudible] 00:17:58.140 --> 00:18:02.250 you have [inaudible] representing ? implicitly, then you won't have 00:18:02.250 --> 00:18:06.620 to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you 00:18:06.620 --> 00:18:08.559 change a value function, 00:18:08.559 --> 00:18:10.360 if ? changes as well, 00:18:10.360 --> 00:18:15.750 then it's sort of hard to solve this. 00:18:15.750 --> 00:18:19.340 Yeah, so later on we'll actually talk about some examples of when ? is implicitly 00:18:19.340 --> 00:18:20.440 represented 00:18:20.440 --> 00:18:24.059 but at 00:18:24.059 --> 00:18:26.210 least for now it's I think there's " yeah. 00:18:26.210 --> 00:18:30.350 Maybe there's a way to redefine something, see a mapping onto value iteration but that's not 00:18:30.350 --> 00:18:32.530 usually done. These are 00:18:32.530 --> 00:18:36.670 viewed as different algorithms. 00:18:36.670 --> 00:18:39.640 Okay, cool, so all good questions. 00:18:39.640 --> 00:18:43.410 Let me move on and talk about how 00:18:43.410 --> 00:18:50.410 to generalize these ideas to continuous states. 00:18:58.880 --> 00:19:01.260 Everything we've done so far 00:19:01.260 --> 00:19:02.950 has been for 00:19:02.950 --> 00:19:05.000 discrete states or finite-state 00:19:05.000 --> 00:19:09.890 MDPs. Where, for example, here we had an MDP with a finite set of 11 00:19:09.890 --> 00:19:10.780 states 00:19:10.780 --> 00:19:12.980 and so 00:19:12.980 --> 00:19:16.720 the value function or V(s) or our estimate for the value function, 00:19:16.720 --> 00:19:17.360 V(s), 00:19:17.360 --> 00:19:22.290 could then be represented using an array of 11 numbers cause if you have 11 states, the 00:19:22.290 --> 00:19:23.580 value function 00:19:23.580 --> 00:19:26.120 needs to assign a real number 00:19:26.120 --> 00:19:29.330 to each of the 11 states and so to represent V(s) 00:19:29.330 --> 00:19:31.990 using an array of 11 00:19:31.990 --> 00:19:33.080 numbers. 00:19:33.080 --> 00:19:37.080 What I want to do for [inaudible] today is talk about 00:19:37.080 --> 00:19:38.450 continuous 00:19:38.450 --> 00:19:41.830 states, so for example, 00:19:41.830 --> 00:19:44.980 if you want to control 00:19:44.980 --> 00:19:48.240 any of the number of real [inaudible], so for example, if you want 00:19:48.240 --> 00:19:51.409 to control a car, 00:19:51.409 --> 00:19:52.610 00:19:52.610 --> 00:19:55.090 a car 00:19:55.090 --> 00:19:59.840 is positioned given by XYT, as position 00:19:59.840 --> 00:20:04.660 and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot, 00:20:04.660 --> 00:20:05.159 so these 00:20:05.159 --> 00:20:06.640 are so depending on 00:20:06.640 --> 00:20:08.929 whether you want to model the kinematics and 00:20:08.929 --> 00:20:12.139 so just position, or whether you want to model the dynamics, meaning the velocity as 00:20:12.139 --> 00:20:14.630 well. 00:20:14.630 --> 00:20:16.179 Earlier I showed you 00:20:16.179 --> 00:20:19.740 video of a helicopter 00:20:19.740 --> 00:20:23.480 that was flying, using a rain forest we're learning algorithms, so the 00:20:23.480 --> 00:20:24.730 helicopter which can 00:20:24.730 --> 00:20:28.390 fly in three-dimensional space rather than just drive on the 2-D plane, the 00:20:28.390 --> 00:20:32.780 state will be given by XYZ position, 00:20:35.100 --> 00:20:37.610 FT?, which is 00:20:37.610 --> 00:20:42.740 ?[inaudible]. 00:20:42.740 --> 00:20:45.250 The FT? is 00:20:45.250 --> 00:20:48.620 sometimes used to note the P[inaudible] 00:20:48.620 --> 00:20:49.840 of the helicopter, 00:20:49.840 --> 00:20:51.710 just orientation, 00:20:51.710 --> 00:20:55.340 and 00:20:59.050 --> 00:21:02.540 if you want to control a helicopter, you pretty much have to model 00:21:02.540 --> 00:21:06.490 velocity as well which means both linear velocity as well as angular 00:21:06.490 --> 00:21:09.520 velocity, and so this would be a 12-dimensional state. 00:21:10.400 --> 00:21:12.610 If you want an example that 00:21:12.610 --> 00:21:14.150 is 00:21:14.150 --> 00:21:16.770 kind of fun but unusual 00:21:16.770 --> 00:21:19.960 is, 00:21:19.960 --> 00:21:24.550 and I'm just gonna use this as an example 00:21:24.550 --> 00:21:28.240 and actually use this little bit example in today's lecture 00:21:28.240 --> 00:21:31.410 is the inverted pendulum problem which is 00:21:31.410 --> 00:21:35.500 sort of a long-running classic in reinforcement learning 00:21:35.500 --> 00:21:39.040 in which imagine that you 00:21:39.040 --> 00:21:42.880 have a little cart that's on a rail. The rail 00:21:42.880 --> 00:21:45.370 ends at some point 00:21:45.370 --> 00:21:48.400 and if you imagine that you have a pole 00:21:48.400 --> 00:21:51.650 attached to the cart, and this is a free hinge and so 00:21:51.650 --> 00:21:54.110 the pole here can rotate freely, 00:21:54.110 --> 00:21:56.950 and your goal is to control the cart and 00:21:56.950 --> 00:21:59.030 to move it back and forth on this rail 00:21:59.030 --> 00:22:01.130 so as to keep the pole balanced. 00:22:01.130 --> 00:22:02.560 Yeah, 00:22:02.560 --> 00:22:04.620 there's no long pole 00:22:04.620 --> 00:22:05.400 in 00:22:05.400 --> 00:22:07.480 this class but you know what I 00:22:07.480 --> 00:22:08.409 mean, so you 00:22:08.409 --> 00:22:15.210 can imagine. Oh, is there a long pole here? 00:22:15.210 --> 00:22:16.590 Back in the corner. Oh, thanks. Cool. So I did not practice this 00:22:16.590 --> 00:22:19.920 but you can take a long pole and sort of hold it up, balance, 00:22:19.920 --> 00:22:24.670 so imagine that you can do it better than I can. Imagine these are [inaudible] just moving 00:22:24.670 --> 00:22:27.590 back and forth to try to keep the pole balanced, so 00:22:27.590 --> 00:22:30.580 you can actually us the reinforcement learning algorithm to do that. 00:22:30.580 --> 00:22:33.929 This is actually one of the longstanding classic problems that 00:22:33.929 --> 00:22:35.280 people [inaudible] 00:22:35.280 --> 00:22:38.030 implement and play off using reinforcement learning algorithms, 00:22:38.030 --> 00:22:40.800 and so for this, the 00:22:40.800 --> 00:22:41.940 states would be X and 00:22:41.940 --> 00:22:44.000 T, so X 00:22:44.000 --> 00:22:49.660 would be the position of the cart, 00:22:49.660 --> 00:22:56.070 and T would be the orientation of the pole 00:22:56.070 --> 00:22:59.750 and also the linear velocity and the angular velocity of 00:22:59.750 --> 00:23:00.650 00:23:00.650 --> 00:23:07.650 the pole, so I'll 00:23:12.340 --> 00:23:15.390 actually use this example a 00:23:15.390 --> 00:23:18.710 couple times. So to read continuous state space, 00:23:18.710 --> 00:23:22.760 how can you apply an algorithm like value iteration and policy iteration to solve the 00:23:22.760 --> 00:23:24.170 MDP to control 00:23:24.170 --> 00:23:27.800 like the car or a helicopter or something like the inverted 00:23:27.800 --> 00:23:29.030 pendulum? 00:23:29.030 --> 00:23:31.720 So one thing you can do and 00:23:31.720 --> 00:23:35.060 this is maybe the most straightforward thing is, 00:23:35.060 --> 00:23:37.670 if you have say a two-dimensional 00:23:37.670 --> 00:23:42.270 continuous state space, a S-1 and S-2 are my state variables, and in all the 00:23:42.270 --> 00:23:43.990 examples there 00:23:43.990 --> 00:23:48.150 are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here. 00:23:48.150 --> 00:23:53.670 The most straightforward thing to do would be to take the continuous state space and discretize 00:23:53.670 --> 00:23:55.780 it 00:23:55.780 --> 00:24:02.780 into a number of discrete cells. 00:24:07.440 --> 00:24:11.890 And I use S-bar to denote they're discretized or they're discrete 00:24:11.890 --> 00:24:13.080 states, 00:24:13.080 --> 00:24:16.919 and so you can [inaudible] with this continuous state problem with a finite or 00:24:16.919 --> 00:24:18.409 discrete set of states 00:24:18.409 --> 00:24:19.490 and then 00:24:19.490 --> 00:24:22.880 you can use policy iteration or value iteration 00:24:22.880 --> 00:24:28.900 to solve for 00:24:31.529 --> 00:24:35.950 V(s)-bar. 00:24:35.950 --> 00:24:37.390 And 00:24:37.390 --> 00:24:40.350 if you're robot is then in some state 00:24:40.350 --> 00:24:41.860 given by that dot, 00:24:41.860 --> 00:24:45.910 you would then figure out what discretized state it is in. In this case it's in, 00:24:45.910 --> 00:24:47.950 this discretized dygrid cell 00:24:47.950 --> 00:24:49.180 that's called 00:24:49.180 --> 00:24:51.400 S-bar, 00:24:51.400 --> 00:24:52.840 and then you execute. 00:24:52.840 --> 00:24:56.320 You choose the policy. You choose the action 00:24:56.320 --> 00:24:58.679 given by applied to that discrete 00:24:58.679 --> 00:25:01.190 state, so discretization is maybe the 00:25:01.190 --> 00:25:06.690 most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes 00:25:06.690 --> 00:25:10.200 you can sorta make this work but a couple of reasons why 00:25:10.200 --> 00:25:13.070 this does not work very well. 00:25:13.070 --> 00:25:15.130 One reason is the following, 00:25:15.130 --> 00:25:17.610 and 00:25:17.610 --> 00:25:21.210 for this picture, let's even temporarily put aside reinforcement 00:25:21.210 --> 00:25:22.840 learning. Let's just 00:25:22.840 --> 00:25:26.610 think about doing regression for now and so 00:25:26.610 --> 00:25:29.650 suppose you have some invariable X 00:25:29.650 --> 00:25:36.210 and suppose I have some data, 00:25:36.210 --> 00:25:37.910 and I want to fill a function. 00:25:37.910 --> 00:25:39.720 Y is the function of X, 00:25:39.720 --> 00:25:40.860 so 00:25:40.860 --> 00:25:44.550 discretization is saying that I'm going to take my 00:25:44.550 --> 00:25:48.010 horizontal Xs and chop it up into a number of 00:25:48.010 --> 00:25:48.669 intervals. 00:25:48.669 --> 00:25:52.060 Sometimes I call these intervals buckets as well. 00:25:52.060 --> 00:25:56.310 We chop my horizontal Xs up into a number of buckets 00:25:56.310 --> 00:25:59.480 and then we're approximate this function 00:25:59.480 --> 00:26:03.770 using something that's piecewise constant 00:26:03.770 --> 00:26:05.440 in each of these buckets. 00:26:05.440 --> 00:26:08.800 And just look at this. This is clearly not a very good representation, right, and when we 00:26:08.800 --> 00:26:10.390 talk about regression, 00:26:10.390 --> 00:26:14.340 you just choose some features of X 00:26:14.340 --> 00:26:17.990 and run linear regression or something. You get a much better fit to the function. 00:26:17.990 --> 00:26:22.170 And so the sense that discretization just isn't a very good source 00:26:22.170 --> 00:26:25.770 of piecewise constant functions. This just isn't a very good function 00:26:25.770 --> 00:26:28.210 for representing many things, 00:26:28.210 --> 00:26:32.090 and there's also the sense that 00:26:32.090 --> 00:26:35.580 there's no smoothing or there's no generalization across the different buckets. 00:26:35.580 --> 00:26:39.290 And in fact, back in regression, I would never have chosen 00:26:39.290 --> 00:26:42.870 to do regression using this sort of visualization. It's just really doesn't make sense. 00:26:45.630 --> 00:26:47.990 And so in the same way, 00:26:47.990 --> 00:26:50.950 instead of 00:26:50.950 --> 00:26:53.929 X, V(s), instead of X and 00:26:53.929 --> 00:26:57.730 some hypothesis function of X, if you have the state here and you're trying to approximate the value 00:26:57.730 --> 00:26:59.110 function, then 00:26:59.110 --> 00:27:02.470 you can get discretization to work for many problems but maybe this isn't the best 00:27:02.470 --> 00:27:03.429 representation to represent 00:27:03.429 --> 00:27:06.419 a value function. 00:27:08.960 --> 00:27:12.580 The other problem with discretization and maybe the more serious 00:27:12.580 --> 00:27:17.350 problem is what's 00:27:17.350 --> 00:27:21.450 often somewhat fancifully called the curse of dimensionality. 00:27:26.480 --> 00:27:30.230 And just the observation that 00:27:30.990 --> 00:27:33.800 if the state space is in 00:27:33.800 --> 00:27:39.350 RN, and if you discretize each variable into K buckets, 00:27:39.350 --> 00:27:40.450 so 00:27:40.450 --> 00:27:41.909 if discretize each 00:27:44.159 --> 00:27:45.690 variable into K 00:27:45.690 --> 00:27:47.750 discrete values, 00:27:47.750 --> 00:27:50.150 then you get 00:27:50.150 --> 00:27:53.030 on the order of K to the power of N 00:27:54.760 --> 00:27:55.740 discrete states. 00:27:55.740 --> 00:27:56.609 In other words, 00:27:56.609 --> 00:28:00.170 the number of discrete states you end up with 00:28:00.170 --> 00:28:02.100 grows exponentially 00:28:02.100 --> 00:28:04.789 in the dimension of the problem, 00:28:04.789 --> 00:28:05.560 and so 00:28:05.560 --> 00:28:09.350 for a helicopter with 12-dimensional state 00:28:09.350 --> 00:28:10.340 space, this would be 00:28:10.340 --> 00:28:12.769 maybe like 100 to the power of 12, just huge, and it's 00:28:12.769 --> 00:28:13.460 not 00:28:13.460 --> 00:28:14.650 feasible. 00:28:14.650 --> 00:28:19.130 And so discretization doesn't scale well at all with two problems in high-dimensional 00:28:19.130 --> 00:28:23.600 state spaces, 00:28:23.600 --> 00:28:24.790 and 00:28:24.790 --> 00:28:28.520 this observation actually applies more generally than to 00:28:28.520 --> 00:28:32.510 just robotics and continuous state problems. 00:28:32.510 --> 00:28:37.669 For example, another fairly well-known applications 00:28:37.669 --> 00:28:40.170 of reinforcement learning 00:28:40.170 --> 00:28:44.850 has been to factory automations. If you imagine that you have 00:28:44.850 --> 00:28:47.040 20 machines sitting in the factory 00:28:47.040 --> 00:28:50.029 and the machines lie in a assembly line and they all 00:28:50.029 --> 00:28:53.200 do something to a part on the assembly line, then they route the part onto a 00:28:53.200 --> 00:28:55.389 different machine. You want to 00:28:55.389 --> 00:28:58.779 use reinforcement learning algorithms, [inaudible] the order in which the 00:28:58.779 --> 00:29:02.519 different machines operate on your different things that are flowing through your assembly line and maybe 00:29:02.519 --> 00:29:05.309 different machines can do different things. 00:29:05.309 --> 00:29:08.970 So if you have N machines and each machine 00:29:08.970 --> 00:29:10.540 can be in K states, 00:29:10.540 --> 00:29:13.730 then if you do this sort of discretization, the total number of states would be K 00:29:13.730 --> 00:29:15.420 to N as well. 00:29:15.420 --> 00:29:18.289 If you have N machines 00:29:18.289 --> 00:29:21.860 and if each machine can be in K states, then again, you can get 00:29:21.860 --> 00:29:24.550 this huge number of states. 00:29:24.550 --> 00:29:26.799 Other well-known examples would be 00:29:26.799 --> 00:29:30.620 if you have a board game is another example. 00:29:30.620 --> 00:29:33.760 You'd want to use reinforcement learning to play chess. 00:29:33.760 --> 00:29:36.310 Then if you have N 00:29:36.310 --> 00:29:38.660 pieces on your board game, 00:29:38.660 --> 00:29:43.540 you have N pieces on the chessboard and if each piece can be in K positions, then this is a 00:29:43.540 --> 00:29:46.470 game sort of the curse of dimensionality thing where the 00:29:46.470 --> 00:29:55.059 number of discrete states you end up with goes exponentially with the number of pieces 00:29:55.059 --> 00:29:57.120 in your board game. 00:29:57.120 --> 00:30:01.169 So the curse of dimensionality 00:30:01.169 --> 00:30:05.830 means that discretization scales poorly to high-dimensional state spaces or 00:30:05.830 --> 00:30:09.370 at least discrete representations 00:30:09.370 --> 00:30:14.169 scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional 00:30:14.169 --> 00:30:16.720 problem, discretization will usually work great. 00:30:16.720 --> 00:30:20.730 If you have a 3-dimensional problem, you can often get discretization to 00:30:20.730 --> 00:30:21.290 work 00:30:21.290 --> 00:30:23.270 not too badly without too much trouble. 00:30:23.270 --> 00:30:28.370 With a 4-dimensional problem, you can still often get to where that they 00:30:28.370 --> 00:30:30.190 could be challenging 00:30:32.990 --> 00:30:38.770 and as you go to higher and higher dimensional state spaces, 00:30:38.770 --> 00:30:42.520 the odds and [inaudible] that you need to figure around to discretization and do things 00:30:42.520 --> 00:30:45.900 like non-uniform grids, so for example, 00:30:45.900 --> 00:30:49.470 what I've 00:30:49.470 --> 00:30:52.710 drawn for you is an example of a non-uniform discretization 00:30:52.710 --> 00:30:57.139 where I'm discretizing S-2 much more finally than S-1. If I think the value 00:30:57.139 --> 00:30:57.750 function 00:30:57.750 --> 00:30:59.250 is much more sensitive 00:30:59.250 --> 00:31:02.620 to the value of state variable S-2 than to S-1, 00:31:02.620 --> 00:31:06.610 and so as you get into higher dimensional state spaces, you may need to manually fiddle 00:31:06.610 --> 00:31:10.000 with choices like these with no uniform discretizations and so on. 00:31:10.000 --> 00:31:13.269 But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it 00:31:13.269 --> 00:31:14.350 work fine. 00:31:14.350 --> 00:31:17.630 With 4-dimensional problems, you can probably get it to work but it'd be just 00:31:17.630 --> 00:31:19.030 slightly challenging 00:31:19.030 --> 00:31:20.300 and 00:31:20.300 --> 00:31:24.420 you can sometimes by fooling around and being clever, you can often push 00:31:24.420 --> 00:31:25.110 discretization 00:31:25.110 --> 00:31:29.120 up to let's say about 6-dimensional problems but 00:31:29.120 --> 00:31:30.870 with some difficulty 00:31:30.870 --> 00:31:32.980 and problems higher 00:31:32.980 --> 00:31:37.220 than 6-dimensional would be extremely difficult to solve with discretization. 00:31:37.220 --> 00:31:39.700 So that's just rough 00:31:39.700 --> 00:31:44.360 folk wisdom order of managing problems you think about using for discretization. 00:31:51.370 --> 00:31:52.790 But what I want to 00:31:52.790 --> 00:31:56.030 spend most of today talking about is 00:31:56.030 --> 00:31:59.909 [inaudible] methods that often work much better than discretization 00:31:59.909 --> 00:32:02.029 and which we will 00:32:02.029 --> 00:32:04.640 approximate V* directly 00:32:04.640 --> 00:32:08.850 without resulting to these sort of discretizations. 00:32:08.850 --> 00:32:12.300 Before I jump to the specific representation let me just spend a few minutes 00:32:12.300 --> 00:32:14.780 talking about the problem setup 00:32:14.780 --> 00:32:15.610 then. 00:32:15.610 --> 00:32:18.790 For today's lecture, I'm going to focus on 00:32:18.790 --> 00:32:25.360 the problem of continuous states 00:32:25.360 --> 00:32:28.060 and just to keep things sort of very 00:32:28.060 --> 00:32:29.480 simple in this lecture, 00:32:29.480 --> 00:32:33.619 I want view of continuous actions, so I'm gonna see 00:32:36.000 --> 00:32:37.650 discrete actions 00:32:37.650 --> 00:32:40.280 A. So it turns out actually that 00:32:40.280 --> 00:32:44.230 is a critical fact also for many problems, it turns out that the state 00:32:44.230 --> 00:32:45.630 space is 00:32:45.630 --> 00:32:47.299 much larger than 00:32:47.299 --> 00:32:51.179 the states of actions. That just seems to have worked out that way for many problems, so for example, 00:32:52.630 --> 00:32:58.210 for driving a car the state space is 6-dimensional, so if 00:32:58.210 --> 00:33:00.010 XY T, Xdot, Ydot, Tdot. 00:33:00.010 --> 00:33:04.840 Whereas, your action has, you still have two actions. You have forward 00:33:04.840 --> 00:33:07.600 backward motion and steering the car, so you have 00:33:07.600 --> 00:33:11.889 6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states. 00:33:11.889 --> 00:33:14.740 The only 00:33:14.740 --> 00:33:17.190 examples down for a helicopter you've 12-D states in a 00:33:17.190 --> 00:33:18.620 4-dimensional action 00:33:18.620 --> 00:33:20.290 it turns out, and 00:33:20.290 --> 00:33:24.200 it's also often much easier to just discretize a continuous 00:33:24.200 --> 00:33:27.300 actions into a discrete sum of actions. 00:33:27.300 --> 00:33:27.840 00:33:27.840 --> 00:33:31.670 And for the inverted pendulum, you have a 4-D state and a 1-D action. 00:33:31.670 --> 00:33:33.470 Whether you accelerate 00:33:33.470 --> 00:33:38.140 your cart to the left or the right is one D action 00:33:38.140 --> 00:33:41.879 and so for the rest of today, I'm gonna assume a continuous state 00:33:41.879 --> 00:33:45.480 but I'll assume that maybe you've already discretized your actions, 00:33:45.480 --> 00:33:47.700 just because in practice it turns out that 00:33:47.700 --> 00:33:52.780 not for all problems, with many problems large actions 00:33:52.780 --> 00:33:54.280 is just less of a 00:33:54.280 --> 00:33:58.470 difficulty than large state spaces. 00:33:58.470 --> 00:33:59.730 So I'm 00:33:59.730 --> 00:34:01.350 going to assume 00:34:01.350 --> 00:34:06.360 that we have 00:34:06.360 --> 00:34:08.699 a model 00:34:08.699 --> 00:34:15.089 or simulator of the 00:34:15.089 --> 00:34:16.450 MDP, and so 00:34:16.450 --> 00:34:18.179 this is really 00:34:18.179 --> 00:34:22.579 an assumption on how the state transition probabilities are represented. 00:34:23.589 --> 00:34:24.690 I'm gonna assume 00:34:24.690 --> 00:34:28.079 and I'm going to use the terms "model" and "simulator" 00:34:28.079 --> 00:34:29.800 pretty much synonymously, 00:34:29.800 --> 00:34:31.749 so 00:34:31.749 --> 00:34:38.749 specifically, what I'm going to assume is that we have a black box and a 00:34:40.649 --> 00:34:41.829 piece of code, 00:34:41.829 --> 00:34:45.200 so that I can input any state, input an action 00:34:45.200 --> 00:34:46.399 and it will output 00:34:46.399 --> 00:34:48.559 S 00:34:48.559 --> 00:34:51.259 prime, 00:34:51.259 --> 00:34:54.219 sample from the state transition distribution. Says that 00:34:54.219 --> 00:34:55.069 this is really 00:34:55.069 --> 00:34:57.080 my assumption 00:34:57.080 --> 00:35:01.209 on the representation I have for the state transition probabilities, 00:35:01.209 --> 00:35:05.189 so I'll assume I have a box that read 00:35:05.189 --> 00:35:06.999 take us in for the stated action 00:35:06.999 --> 00:35:09.849 and output in mixed state. 00:35:09.849 --> 00:35:16.849 And so 00:35:17.410 --> 00:35:20.469 since they're fairly common ways to get models of 00:35:20.469 --> 00:35:22.359 different MDPs you may work with, 00:35:22.359 --> 00:35:24.479 one is 00:35:24.479 --> 00:35:30.599 you might get a model from a physics 00:35:30.599 --> 00:35:32.960 simulator. So for example, 00:35:32.960 --> 00:35:36.759 if you're interested in controlling that inverted pendulum, 00:35:36.759 --> 00:35:39.449 so your action is 00:35:39.449 --> 00:35:41.230 A which is 00:35:41.230 --> 00:35:45.130 the magnitude of the force you exert on the cart to 00:35:45.130 --> 00:35:46.140 00:35:46.140 --> 00:35:47.709 left 00:35:47.709 --> 00:35:49.380 or right, and your 00:35:49.380 --> 00:35:56.380 state is Xdot, T, Tdot. I'm just gonna write that in that order. 00:35:58.109 --> 00:36:01.469 And so I'm gonna 00:36:01.469 --> 00:36:04.379 write down a bunch of equations just for completeness but 00:36:04.379 --> 00:36:08.779 everything I'm gonna write below here is most of what I wanna write is a bit 00:36:08.779 --> 00:36:12.599 gratuitous, but so since I'll maybe flip open a textbook on physics, a 00:36:12.599 --> 00:36:13.820 textbook on mechanics, 00:36:13.820 --> 00:36:17.069 you can work out the equations of motion of a 00:36:17.069 --> 00:36:19.499 physical device like this, so you find that 00:36:22.749 --> 00:36:26.400 Sdot. The dot denotes derivative, so the derivative of the state with respect to time 00:36:26.400 --> 00:36:28.080 is given by 00:36:28.739 --> 00:36:31.710 Xdot, ?-L(B) 00:36:31.710 --> 00:36:34.099 cost B 00:36:34.099 --> 00:36:35.549 over 00:36:36.919 --> 00:36:39.069 M 00:36:39.069 --> 00:36:46.069 Tdot, 00:36:52.019 --> 00:36:52.859 B. 00:36:52.859 --> 00:36:58.199 And so on where A is the force is the action that you exert on the cart. L is 00:36:58.199 --> 00:36:59.869 the length of the pole. 00:36:59.869 --> 00:37:04.099 M is the total mass of the system and so on. So all these equations 00:37:04.099 --> 00:37:06.209 are good uses, just writing 00:37:06.209 --> 00:37:08.549 them down for completeness, but by 00:37:08.549 --> 00:37:12.179 flipping over, open like a physics textbook, you can work out these equations 00:37:12.179 --> 00:37:13.839 and notions yourself 00:37:13.839 --> 00:37:15.169 and this 00:37:15.169 --> 00:37:17.410 then gives you a model which 00:37:17.410 --> 00:37:18.259 can say that S-2+1. 00:37:18.259 --> 00:37:21.549 You're still one time step later 00:37:21.549 --> 00:37:27.489 will be equal to your previous state 00:37:27.489 --> 00:37:32.219 plus [inaudible], so in your simulator or in my model 00:37:32.219 --> 00:37:35.200 what happens to the cart every 00:37:35.200 --> 00:37:39.229 10th of a second, so ? T 00:37:39.229 --> 00:37:43.130 would be within one second 00:37:43.130 --> 00:37:50.130 and then so plus ? T times 00:38:03.469 --> 00:38:06.129 that. And so that'd be one way to come up with a model of 00:38:06.129 --> 00:38:08.169 your MDP. 00:38:08.169 --> 00:38:12.389 And in this specific example, I've actually written down deterministic model because 00:38:12.830 --> 00:38:14.829 and by deterministic I mean that 00:38:14.829 --> 00:38:18.320 given an action in a state, the next state 00:38:18.320 --> 00:38:19.839 is not random, 00:38:19.839 --> 00:38:23.179 so would be an example of a deterministic model 00:38:23.179 --> 00:38:26.180 where I can compute the next state exactly 00:38:26.180 --> 00:38:29.699 as a function of the previous state and the previous action 00:38:29.699 --> 00:38:33.479 or it's a deterministic model because all the probability 00:38:33.479 --> 00:38:34.999 mass is on a single state 00:38:34.999 --> 00:38:36.939 given the previous stated action. 00:38:36.939 --> 00:38:43.939 You can also make this a stochastic model. 00:38:58.589 --> 00:39:05.589 A second way that is often used to attain a model is to learn one. 00:39:08.359 --> 00:39:13.079 And so again, just concretely what you do is you would 00:39:13.079 --> 00:39:15.599 imagine that you have a physical 00:39:15.599 --> 00:39:17.109 inverted pendulum system 00:39:17.109 --> 00:39:20.630 as you physically own an inverted pendulum robot. 00:39:20.630 --> 00:39:24.109 What you would do is you would then 00:39:24.109 --> 00:39:27.109 initialize your inverted pendulum robot to some state 00:39:27.109 --> 00:39:31.909 and then execute some policy, could be some random policy or some policy that you think is pretty 00:39:31.909 --> 00:39:36.219 good, or you could even try controlling yourself with a joystick or something. 00:39:36.219 --> 00:39:37.399 But so you set 00:39:37.399 --> 00:39:40.889 the system off in some state as zero. 00:39:40.889 --> 00:39:43.619 Then you take some action. 00:39:43.619 --> 00:39:48.039 Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina 00:39:48.039 --> 00:39:51.179 control your inverted pendulum or whatever. 00:39:51.179 --> 00:39:55.029 System would transition to some new state, S-1, and then you 00:39:55.029 --> 00:39:57.269 take some new action, A-1 00:39:57.269 --> 00:39:59.799 and so on. Let's say you do 00:39:59.799 --> 00:40:01.579 this 00:40:01.579 --> 00:40:04.779 for two time steps 00:40:04.779 --> 00:40:06.480 and 00:40:06.480 --> 00:40:08.859 sometimes I call this one trajectory and 00:40:08.859 --> 00:40:10.009 you repeat this 00:40:10.009 --> 00:40:12.989 M times, so 00:40:12.989 --> 00:40:16.989 this is the first trial of the first trajectory, 00:40:16.989 --> 00:40:23.989 and then you do this again. Initialize it in some 00:40:35.419 --> 00:40:36.880 and so on. 00:40:36.880 --> 00:40:39.979 So you do this a bunch of times 00:40:39.979 --> 00:40:43.640 and then you would run the learning algorithm 00:40:43.640 --> 00:40:49.190 to 00:40:49.190 --> 00:40:53.199 estimate ST+1 00:40:53.199 --> 00:40:56.650 as a function 00:40:56.650 --> 00:41:01.869 of 00:41:01.869 --> 00:41:04.619 ST and 00:41:04.619 --> 00:41:08.009 AT. And for sake of completeness, you should just think of this 00:41:08.009 --> 00:41:10.130 as inverted pendulum problem, 00:41:10.130 --> 00:41:11.190 so ST+1 is 00:41:11.190 --> 00:41:13.290 a 4-dimensional vector. ST, AT 00:41:13.290 --> 00:41:16.579 will be a 4-dimensional vector and that'll be a real number, 00:41:16.579 --> 00:41:17.680 and so you might 00:41:17.680 --> 00:41:21.190 run linear regression 4 times to predict each of these state variables as 00:41:21.190 --> 00:41:22.749 a function 00:41:22.749 --> 00:41:25.250 of each of these 00:41:25.250 --> 00:41:26.979 5 real numbers 00:41:26.979 --> 00:41:29.519 and so on. 00:41:29.519 --> 00:41:32.579 Just for example, if you 00:41:32.579 --> 00:41:36.899 say that if you want to estimate your next state ST+1 as a linear 00:41:36.899 --> 00:41:40.639 function 00:41:40.639 --> 00:41:45.889 of your previous state in your action and so 00:41:45.889 --> 00:41:51.470 A here will be a 4 by 4 matrix, 00:41:51.470 --> 00:41:54.449 and B would be a 4-dimensional vector, 00:41:54.449 --> 00:42:00.199 then 00:42:00.199 --> 00:42:07.199 you would choose the values of A and B that minimize this. 00:42:32.269 --> 00:42:35.220 So if you want your model to be that 00:42:35.220 --> 00:42:38.849 ST+1 is some linear function of the previous stated action, 00:42:38.849 --> 00:42:39.960 then you might 00:42:39.960 --> 00:42:41.950 pose this optimization objective 00:42:41.950 --> 00:42:45.130 and choose A and B to minimize the sum of squares error 00:42:45.130 --> 00:42:52.130 in your predictive value for ST+1 as the linear function of ST 00:42:52.489 --> 00:42:56.889 and AT. I 00:42:56.889 --> 00:42:58.119 should say that 00:42:58.119 --> 00:43:02.839 this is one specific example where you're using a linear function of the previous 00:43:02.839 --> 00:43:06.399 stated action to predict the next state. Of course, you can also use other algorithms 00:43:06.399 --> 00:43:08.469 like low [inaudible] weight to linear regression 00:43:08.469 --> 00:43:12.079 or linear regression with nonlinear features or kernel linear 00:43:12.079 --> 00:43:13.119 regression or whatever 00:43:13.119 --> 00:43:16.380 to predict the next state as a nonlinear function of the current state as well, so this 00:43:16.380 --> 00:43:23.380 is just [inaudible] linear problems. 00:43:25.469 --> 00:43:27.820 And it turns out that low [inaudible] weight to linear 00:43:27.820 --> 00:43:28.770 regression is 00:43:28.770 --> 00:43:31.990 for many robots turns out to be an effective method for this learning 00:43:31.990 --> 00:43:38.990 problem as well. 00:43:47.279 --> 00:43:51.919 And so having learned to model, having learned the parameters A 00:43:51.919 --> 00:43:52.680 and B, 00:43:52.680 --> 00:43:58.979 you then have a model where you say that ST+1 is AST 00:43:58.979 --> 00:44:05.359 plus BAT, and so that would be an example of a deterministic 00:44:05.359 --> 00:44:06.219 model 00:44:07.949 --> 00:44:10.749 or having learned the parameters A and B, 00:44:10.749 --> 00:44:13.259 you might say that ST+1 is 00:44:13.259 --> 00:44:15.229 equal to AST + 00:44:15.229 --> 00:44:17.440 BAT 00:44:24.039 --> 00:44:25.979 + 00:44:25.979 --> 00:44:28.249 ?T. 00:44:28.249 --> 00:44:29.919 And so these would be 00:44:29.919 --> 00:44:32.699 very reasonable ways to come up with either 00:44:32.699 --> 00:44:37.069 a deterministic or a stochastic model for your 00:44:37.069 --> 00:44:39.329 inverted pendulum MDP. 00:44:39.329 --> 00:44:41.519 And so 00:44:41.519 --> 00:44:43.549 just to summarize, 00:44:43.549 --> 00:44:48.239 what we have now is a model, meaning a piece of code, 00:44:48.239 --> 00:44:53.459 where you can input a state and an action 00:44:53.459 --> 00:44:55.079 and get an ST+1. 00:44:55.079 --> 00:44:57.969 And so if you have a stochastic model, 00:44:57.969 --> 00:45:02.419 then to influence this model, you would actually sample ?T from this [inaudible] 00:45:02.419 --> 00:45:03.039 distribution 00:45:03.039 --> 00:45:10.039 in order to generate ST+1. So it 00:45:13.179 --> 00:45:18.269 actually turns out that in a 00:45:18.269 --> 00:45:20.169 preview, I guess, in the next lecture 00:45:20.169 --> 00:45:24.679 it actually turns out that in the specific case of linear dynamical systems, in the specific 00:45:24.679 --> 00:45:28.559 case where the next state is a linear function of the current stated action, it 00:45:28.559 --> 00:45:31.579 actually turns out that there're very powerful algorithms you can use. 00:45:31.579 --> 00:45:35.640 So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than 00:45:35.640 --> 00:45:36.630 right now 00:45:38.559 --> 00:45:42.759 but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights 00:45:42.759 --> 00:45:45.599 and linear regressionals and long linear 00:45:45.599 --> 00:45:48.339 algorithm 'cause many systems aren't really 00:45:48.339 --> 00:45:54.309 linear. You can build a nonlinear model. 00:45:54.309 --> 00:45:56.949 So what I wanna do now is talk about 00:45:56.949 --> 00:45:57.330 given 00:45:57.330 --> 00:46:00.510 a model, given a simulator for your 00:46:00.510 --> 00:46:03.349 MDP, how to come up with an algorithm 00:46:03.349 --> 00:46:07.319 to approximate the alpha value function piece. 00:46:07.319 --> 00:46:14.319 Before I move on, let me check if there're questions on this. Okay, 00:46:38.339 --> 00:46:42.579 cool. So here's 00:46:42.579 --> 00:46:43.500 the idea. 00:46:43.500 --> 00:46:46.890 Back when we talked about linear regression, we said that 00:46:46.890 --> 00:46:52.119 given some inputs X in supervised learning, given the input feature is X, we 00:46:52.119 --> 00:46:54.589 may choose some features of X 00:46:54.589 --> 00:46:56.529 and then 00:46:57.229 --> 00:46:58.819 approximate the type of variable 00:46:58.819 --> 00:47:02.229 as a linear function of various features of X, 00:47:02.229 --> 00:47:04.929 and so just do exactly the same thing to 00:47:04.929 --> 00:47:06.329 approximate 00:47:06.329 --> 00:47:08.539 the optimal value function, 00:47:08.539 --> 00:47:10.869 and in particular, 00:47:10.869 --> 00:47:15.819 we'll choose some features 00:47:15.819 --> 00:47:19.249 5-S of a 00:47:19.249 --> 00:47:20.979 state 00:47:20.979 --> 00:47:25.549 S. And so you could actually choose 00:47:25.549 --> 00:47:29.529 5-S equals S. That would be one reasonable choice, if you want to approximate the value 00:47:29.529 --> 00:47:30.160 function 00:47:30.160 --> 00:47:32.759 as a linear function of the states, but you can 00:47:32.759 --> 00:47:33.659 also 00:47:33.659 --> 00:47:37.069 choose other things, so for example, 00:47:37.069 --> 00:47:41.029 for the inverted pendulum example, you may choose 5-S 00:47:41.029 --> 00:47:45.049 to be equal to a vector of features that may be [inaudible]1 or you may have 00:47:47.870 --> 00:47:51.789 Xdot2, 00:47:51.789 --> 00:47:52.190 Xdot 00:47:52.190 --> 00:47:54.649 maybe some cross terms, 00:47:54.649 --> 00:47:56.799 maybe times 00:47:56.799 --> 00:47:59.369 X, maybe dot2 00:47:59.369 --> 00:48:03.659 and so on. So 00:48:03.659 --> 00:48:07.209 you choose some vector or features 00:48:07.209 --> 00:48:14.209 and then approximate 00:48:16.819 --> 00:48:21.759 the value function as the value of the state as is equal to data 00:48:21.759 --> 00:48:24.289 transfers 00:48:24.289 --> 00:48:28.420 times the features. And I should apologize in advance; I'm overloading notation here. 00:48:28.420 --> 00:48:29.839 It's unfortunate. I 00:48:29.839 --> 00:48:34.349 use data both to denote the angle of the cart of the pole inverted 00:48:34.349 --> 00:48:37.549 pendulum. So this is known as 00:48:37.549 --> 00:48:39.140 the angle T 00:48:39.140 --> 00:48:43.280 but also using T to denote the vector of parameters in my [inaudible] algorithm. 00:48:43.280 --> 00:48:43.910 So sorry 00:48:43.910 --> 00:48:49.939 about the overloading notation. 00:48:49.939 --> 00:48:51.599 00:48:51.599 --> 00:48:55.439 Just like we did in linear regression, my goal is to come up 00:48:55.439 --> 00:48:56.599 with 00:48:56.599 --> 00:49:00.339 a linear combination of the features that gives me a good approximation 00:49:00.339 --> 00:49:01.929 to the value function 00:49:01.929 --> 00:49:07.660 and this is completely analogous to when we said that in linear regression 00:49:07.660 --> 00:49:09.500 our estimate, 00:49:13.369 --> 00:49:18.049 my response there but Y as a linear function a feature is at the 00:49:18.049 --> 00:49:23.459 input. That's what we have 00:49:23.459 --> 00:49:30.459 in 00:49:40.109 --> 00:49:47.079 linear regression. Let 00:49:47.079 --> 00:49:50.779 me just write 00:49:50.779 --> 00:49:55.859 down value iteration again and then I'll written down an approximation to value 00:49:55.859 --> 00:49:57.199 iteration, so 00:49:57.199 --> 00:50:00.759 for discrete states, 00:50:00.759 --> 00:50:03.929 this is the idea behind value iteration and we said that 00:50:03.929 --> 00:50:08.849 V(s) will be 00:50:08.849 --> 00:50:11.609 updated 00:50:11.609 --> 00:50:17.400 as R(s) + G 00:50:17.400 --> 00:50:19.530 [inaudible]. That was value iteration 00:50:19.530 --> 00:50:24.129 and in the case of continuous states, this would be the replaced by an [inaudible], an 00:50:24.129 --> 00:50:31.129 [inaudible] over states rather via sum over states. 00:50:31.389 --> 00:50:34.340 Let me just write this 00:50:34.340 --> 00:50:35.669 as 00:50:38.809 --> 00:50:42.589 R(s) + G([inaudible]) and then that sum over T's prime. 00:50:42.589 --> 00:50:45.150 That's really an expectation 00:50:45.150 --> 00:50:47.400 with respect to random state as 00:50:47.400 --> 00:50:54.400 prime drawn from the state transition probabilities piece SA 00:50:55.049 --> 00:50:56.359 of V(s) 00:50:56.359 --> 00:51:00.339 prime. So this is a sum of all states S prime with the 00:51:00.339 --> 00:51:01.599 probability of going to S prime (value), 00:51:01.599 --> 00:51:05.249 so that's really an expectation over the random state S prime flowing from PSA of 00:51:05.249 --> 00:51:09.369 that. 00:51:09.369 --> 00:51:11.989 And so what I'll do now is 00:51:11.989 --> 00:51:15.199 write down an algorithm called 00:51:15.199 --> 00:51:22.199 fitted value iteration 00:51:25.289 --> 00:51:27.949 that's in 00:51:27.949 --> 00:51:30.599 approximation to this 00:51:30.599 --> 00:51:33.809 but specifically for continuous states. 00:51:33.809 --> 00:51:38.519 I just wrote down the first two steps, and then I'll continue on the next board, so the 00:51:38.519 --> 00:51:43.499 first step of the algorithm is we'll sample. 00:51:43.499 --> 00:51:45.269 Choose some set of states 00:51:45.269 --> 00:51:52.269 at 00:51:53.250 --> 00:51:55.119 random. 00:51:55.119 --> 00:51:59.869 So sample S-1, S-2 through S-M randomly so choose a 00:51:59.869 --> 00:52:02.469 set of states randomly 00:52:02.469 --> 00:52:08.209 and 00:52:08.209 --> 00:52:12.020 initialize my parameter vector to be equal to zero. This is 00:52:12.020 --> 00:52:16.190 analogous to in value iteration where I 00:52:16.190 --> 00:52:21.140 might initialize the value function to be the function of all zeros. Then here's the 00:52:21.140 --> 00:52:28.140 end view for the algorithm. 00:52:42.359 --> 00:52:44.119 Got 00:52:44.119 --> 00:52:51.119 quite a lot to 00:52:59.189 --> 00:53:06.189 write 00:55:05.409 --> 00:55:12.409 actually. 00:55:28.599 --> 00:55:29.639 Let's 00:55:29.639 --> 00:55:36.639 see. 00:56:10.420 --> 00:56:12.009 And so 00:56:12.009 --> 00:56:17.839 that's the algorithm. 00:56:17.839 --> 00:56:19.880 Let me just adjust the writing. Give me a second. Give me a minute 00:56:19.880 --> 00:56:26.880 to finish and then I'll step through this. 00:56:26.989 --> 00:56:28.489 Actually, if some of my 00:56:28.489 --> 00:56:35.489 handwriting is eligible, let me know. So 00:56:57.559 --> 00:57:02.079 let me step through this and so briefly explain the rationale. 00:57:02.079 --> 00:57:08.140 So the hear of the algorithm is - let's see. 00:57:08.140 --> 00:57:10.869 In the original value iteration algorithm, 00:57:10.869 --> 00:57:13.240 we would take the value for each state, 00:57:13.240 --> 00:57:14.799 V(s)I, 00:57:14.799 --> 00:57:17.599 and we will overwrite it with this 00:57:17.599 --> 00:57:21.659 expression here. In the original, this discrete value iteration algorithm 00:57:21.659 --> 00:57:23.199 was to V(s)I 00:57:23.199 --> 00:57:23.940 and we will 00:57:23.940 --> 00:57:25.829 set V(s)I 00:57:25.829 --> 00:57:28.589 to be equal to that, I think. 00:57:28.589 --> 00:57:33.130 Now we have in the continuous state case, we have an infinite continuous set of 00:57:33.130 --> 00:57:34.199 states and so 00:57:34.199 --> 00:57:37.859 you can't discretely set the value of each of these to that. 00:57:37.859 --> 00:57:42.509 So what we'll do instead is choose the parameters T 00:57:42.509 --> 00:57:46.849 so that V(s)I is as close as possible to 00:57:46.849 --> 00:57:49.930 this thing on the right hand side 00:57:49.930 --> 00:57:54.509 instead. And this is what YI turns out to be. 00:57:54.509 --> 00:57:58.479 So completely, what I'm going to do is I'm going to construct 00:57:58.479 --> 00:58:00.579 estimates of this term, 00:58:00.579 --> 00:58:04.279 and then I'm going to choose the parameters of my function approximator. I'm 00:58:04.279 --> 00:58:07.729 gonna choose my parameter as T, so that V(s)I is as close as possible to 00:58:07.729 --> 00:58:10.619 these. That's what YI is, 00:58:10.619 --> 00:58:11.979 and specifically, 00:58:11.979 --> 00:58:15.139 what I'm going to do is I'll choose parameters data 00:58:15.139 --> 00:58:18.489 to minimize the sum of square differences between T [inaudible] plus 00:58:18.489 --> 00:58:20.409 5SI. 00:58:20.409 --> 00:58:21.550 This thing here 00:58:21.550 --> 00:58:24.489 is just V(s)I because I'm approximating V(s)I 00:58:24.489 --> 00:58:26.219 is a linear function of 5SI 00:58:26.219 --> 00:58:27.880 and so choose 00:58:27.880 --> 00:58:32.019 the parameters data to minimize the sum of square differences. 00:58:32.019 --> 00:58:34.829 So this is last step is basically 00:58:34.829 --> 00:58:38.759 the approximation version of value 00:58:38.759 --> 00:58:40.610 iteration. What everything else above 00:58:40.610 --> 00:58:42.419 was doing was just 00:58:42.419 --> 00:58:44.749 coming up with an approximation 00:58:44.749 --> 00:58:46.189 to this term, to 00:58:46.189 --> 00:58:47.499 this thing here 00:58:47.499 --> 00:58:50.979 and which I was calling YI. 00:58:50.979 --> 00:58:54.199 And so confluently, for every state SI we want to 00:58:54.199 --> 00:58:57.159 estimate what the thing on the right hand side is 00:58:57.159 --> 00:58:57.979 and but 00:58:57.979 --> 00:59:01.980 there's an expectation here. There's an expectation over a continuous set of states, 00:59:01.980 --> 00:59:06.609 may be a very high dimensional state so I can't compute this expectation exactly. 00:59:06.609 --> 00:59:09.809 What I'll do instead is I'll use my simulator to sample 00:59:09.809 --> 00:59:13.969 a set of states from this distribution from this P substrip, 00:59:13.969 --> 00:59:16.959 SIA, from the state transition distribution 00:59:16.959 --> 00:59:20.669 of where I get to if I take the action A in the state as I, 00:59:20.669 --> 00:59:23.919 and then I'll average over that sample of states 00:59:23.919 --> 00:59:26.819 to compute this expectation. 00:59:26.819 --> 00:59:30.609 And so stepping through the algorithm just says that for 00:59:30.609 --> 00:59:31.500 each 00:59:31.500 --> 00:59:36.329 state and for each action, I'm going to sample a set of states. This S prime 1 through S 00:59:36.329 --> 00:59:40.659 prime K from that state transition distribution, still using the model, and 00:59:40.659 --> 00:59:43.909 then I'll set Q(a) to be equal to that average 00:59:43.909 --> 00:59:45.489 and so this is my 00:59:45.489 --> 00:59:48.429 estimate for R(s)I + G(this 00:59:48.429 --> 00:59:50.399 expected value 00:59:50.399 --> 00:59:53.629 for that specific action A). 00:59:53.629 --> 00:59:56.759 Then I'll take the maximum of actions A 00:59:56.759 --> 01:00:00.890 and this gives me YI, and so YI is for S for that. 01:00:00.890 --> 01:00:02.919 And finally, I'll run 01:00:02.919 --> 01:00:06.779 really run linear regression which is that last of the set [inaudible] to get 01:00:06.779 --> 01:00:07.789 V(s)I 01:00:07.789 --> 01:00:11.479 to be close to the YIs. 01:00:11.479 --> 01:00:16.179 And so this algorithm is called fitted value iteration and it actually often 01:00:16.179 --> 01:00:19.389 works quite well for continuous, 01:00:19.389 --> 01:00:22.659 for problems with anywhere from 01:00:22.659 --> 01:00:26.010 6- to 10- to 20-dimensional state spaces 01:00:33.949 --> 01:00:39.499 if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense? 01:00:39.499 --> 01:00:40.790 Some of you didn't have your hands up. 01:00:40.790 --> 01:00:47.279 Are there questions for those, 01:00:48.410 --> 01:00:51.339 yeah? Is there a recess [inaudible] function in this setup? 01:00:51.339 --> 01:00:55.669 Oh, yes. An MDP comprises 01:00:55.669 --> 01:00:57.189 SA, 01:00:57.189 --> 01:00:59.869 the state transition probabilities G 01:00:59.869 --> 01:01:01.229 and 01:01:01.229 --> 01:01:02.380 R 01:01:02.380 --> 01:01:03.359 and so 01:01:03.359 --> 01:01:08.359 for continuous state spaces, S would be like R4 for the inverted pendulum 01:01:08.359 --> 01:01:09.409 or something. 01:01:09.409 --> 01:01:13.529 Actions with discretized state transitions probabilities with specifying with the model 01:01:13.529 --> 01:01:14.779 or 01:01:14.779 --> 01:01:17.470 the simulator, 01:01:17.470 --> 01:01:19.550 G is just a real number like .99 01:01:19.550 --> 01:01:23.599 and the real function is usually a function that's given to you. And so 01:01:23.599 --> 01:01:26.099 the reward function 01:01:26.099 --> 01:01:30.689 is some function of your 4-dimensional state, 01:01:30.689 --> 01:01:31.489 and 01:01:31.489 --> 01:01:32.989 for example, you 01:01:32.989 --> 01:01:39.989 might choose a reward function to be minus " I don't know. 01:01:40.029 --> 01:01:47.029 Just for an example of simple reward function, if we want a -1 if the pole has fallen 01:01:47.819 --> 01:01:51.640 and there it depends on you choose your reward function to be -1 if 01:01:51.640 --> 01:01:52.290 the 01:01:52.290 --> 01:01:54.619 inverted pendulum falls over 01:01:54.619 --> 01:01:58.700 and to find that its angle is greater than 30° or something 01:01:58.700 --> 01:02:02.179 and zero otherwise. 01:02:02.179 --> 01:02:03.939 So that would be an example 01:02:03.939 --> 01:02:06.640 of a reward function that you can choose for the inverted pendulum, but yes, assuming a 01:02:06.640 --> 01:02:09.619 reward function is given to you 01:02:09.619 --> 01:02:14.589 so that you can compute R(s)I for any state. 01:02:14.589 --> 01:02:21.589 Are there other questions? 01:02:39.689 --> 01:02:40.619 Actually, let 01:02:40.619 --> 01:02:47.619 me try 01:02:50.459 --> 01:02:57.459 01:03:23.979 --> 01:03:28.479 asking a question, so everything I did here assume that we have a stochastic simulator. 01:03:28.479 --> 01:03:31.879 So it 01:03:31.879 --> 01:03:35.389 turns out I can simply this algorithm if I have a deterministic simulator, 01:03:35.389 --> 01:03:36.510 but 01:03:36.510 --> 01:03:38.769 deterministic simulator is given a 01:03:38.769 --> 01:03:42.079 stated action, my next state is always exactly determined. 01:03:42.079 --> 01:03:45.569 So let me ask you, if I have a deterministic simulator, 01:03:45.569 --> 01:03:52.569 how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible]. 01:03:53.929 --> 01:03:57.709 01:03:57.709 --> 01:03:58.720 Right, so 01:03:58.720 --> 01:04:01.249 Justin's going right. If I have a deterministic simulator, 01:04:01.249 --> 01:04:04.869 all my samples from those would be exactly the same, and so if I have a 01:04:04.869 --> 01:04:06.149 deterministic simulator, 01:04:06.149 --> 01:04:09.109 I can set K to be equal to 1, 01:04:09.109 --> 01:04:10.930 so I don't need to draw 01:04:10.930 --> 01:04:15.869 K different samples. I really only need one sample if I have a 01:04:15.869 --> 01:04:17.299 deterministic simulator, 01:04:17.299 --> 01:04:21.669 so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really 01:04:21.669 --> 01:04:28.159 confused about the, 01:04:28.159 --> 01:04:29.029 yeah, 01:04:29.029 --> 01:04:34.429 we sorta turned this [inaudible] into something that looks like linear 01:04:34.429 --> 01:04:37.249 state 01:04:37.249 --> 01:04:40.089 regression or some' you know the data transpose times something that 01:04:40.089 --> 01:04:43.890 we're used to but I guess I'm a 01:04:43.890 --> 01:04:45.369 little. I don't know really know what question to ask but like when we did this before we 01:04:45.369 --> 01:04:48.330 had like discrete states and everything. We 01:04:48.330 --> 01:04:50.019 were determined with 01:04:50.019 --> 01:04:53.189 finding this optimal policy 01:04:53.189 --> 01:04:56.709 and I 01:04:56.709 --> 01:05:01.869 guess it doesn't look like we haven't said the word policy in a while so kinda 01:05:01.869 --> 01:05:05.009 difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should 01:05:05.009 --> 01:05:06.819 just say a couple words so 01:05:06.819 --> 01:05:10.660 let me actually try to get at some of what maybe what you're saying. 01:05:10.660 --> 01:05:14.249 Our strategy for finding optimal policy 01:05:14.249 --> 01:05:15.080 has been to 01:05:15.080 --> 01:05:19.269 find some way to find V 01:05:19.269 --> 01:05:21.169 and 01:05:21.169 --> 01:05:24.089 some of approximations 01:05:24.089 --> 01:05:28.549 of ?*. So far everything I've been doing has been focused on how to find V*. I just 01:05:28.549 --> 01:05:30.479 want 01:05:33.119 --> 01:05:36.359 to say one more word. It actually turns out that 01:05:36.359 --> 01:05:40.729 for linear regression it's often very easy. It's often not terribly difficult to 01:05:40.729 --> 01:05:43.019 choose some resource of the features. Choosing 01:05:43.019 --> 01:05:46.979 features for approximating the value function is often somewhat 01:05:46.979 --> 01:05:48.400 harder, 01:05:48.400 --> 01:05:50.969 so because the value of a state is 01:05:50.969 --> 01:05:52.020 how good is 01:05:52.020 --> 01:05:54.380 starting off in this state. 01:05:54.380 --> 01:05:56.039 What is my 01:05:56.039 --> 01:05:59.609 expected sum of discounted rewards? What if I start in a certain state? 01:05:59.609 --> 01:06:00.409 And so 01:06:00.409 --> 01:06:03.559 what the feature of the state have to measure is really 01:06:03.559 --> 01:06:07.890 how good is it to start in a certain state? 01:06:07.890 --> 01:06:12.209 And so for inverted pendulum you actually have that 01:06:12.209 --> 01:06:15.899 states where the poles are vertical and when a cart that's centered on your 01:06:15.899 --> 01:06:18.079 track or something, maybe better and so you can 01:06:18.079 --> 01:06:21.299 come up with features that measure the orientation of the pole and how close 01:06:21.299 --> 01:06:25.689 you are to the center of the track and so on and those will be reasonable features to use 01:06:25.689 --> 01:06:28.329 to approximate V*. 01:06:28.329 --> 01:06:30.249 Although in general it is true that 01:06:30.249 --> 01:06:33.859 choosing features, the value function approximation, is often slightly trickier 01:06:33.859 --> 01:06:39.719 than choosing good features for linear regression. Okay and then Justin's 01:06:39.719 --> 01:06:40.709 questions of 01:06:40.709 --> 01:06:42.529 so given 01:06:42.529 --> 01:06:47.109 V*, how do you go back to actually find a policy? 01:06:47.109 --> 01:06:49.799 In the discrete case, so we 01:06:49.799 --> 01:06:54.109 have that ?*(s) is equal to all 01:06:54.109 --> 01:07:01.109 [inaudible] over A of 01:07:04.319 --> 01:07:07.469 that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this 01:07:07.469 --> 01:07:12.869 as an expectation 01:07:12.869 --> 01:07:15.609 and so then once you find the optimal value 01:07:15.609 --> 01:07:16.739 function V*, 01:07:16.739 --> 01:07:18.579 you can then 01:07:18.579 --> 01:07:20.349 find the optimal policy ?* by computing the 01:07:20.349 --> 01:07:22.579 [inaudible]. 01:07:22.579 --> 01:07:25.219 So if you're in a continuous state MDP, 01:07:25.219 --> 01:07:26.429 then 01:07:26.429 --> 01:07:30.409 you can't actually do this in advance for every single state because there's an 01:07:30.409 --> 01:07:31.899 infinite number of states 01:07:31.899 --> 01:07:34.809 and so you can't actually perform this computation in advance to every single 01:07:34.809 --> 01:07:36.430 state. 01:07:36.430 --> 01:07:38.949 What you do instead is 01:07:38.949 --> 01:07:43.209 whenever your robot is in some specific state S is only when your system 01:07:43.209 --> 01:07:48.069 is in some specific state S like your car is at some position orientation or 01:07:48.069 --> 01:07:49.829 your inverted pendulum 01:07:49.829 --> 01:07:52.949 is in some specific position, posed in some 01:07:52.949 --> 01:07:56.529 specific angle T. It's 01:07:56.529 --> 01:07:58.489 only when 01:07:58.489 --> 01:08:03.069 your system, be it a factor or a board game or a robot, 01:08:03.069 --> 01:08:05.299 is in some specific state S 01:08:05.299 --> 01:08:08.809 that you would then go ahead and compute this augmax, so 01:08:08.809 --> 01:08:10.229 it's only when 01:08:10.229 --> 01:08:15.779 you're in some state S that you then compute this augmax, and then 01:08:15.779 --> 01:08:18.099 you 01:08:18.099 --> 01:08:22.589 01:08:22.589 --> 01:08:24.440 execute that action A 01:08:24.440 --> 01:08:25.159 and then 01:08:25.159 --> 01:08:29.489 as a result of your action, your robot would transition to some new state and then 01:08:29.489 --> 01:08:33.759 so it'll be given that specific new state that you compute as 01:08:33.759 --> 01:08:40.759 augmax using that specific state S that 01:08:41.799 --> 01:08:45.739 you're in. There're a few ways to do it. One way to do this is 01:08:45.739 --> 01:08:50.099 actually the same as in the inner loop of the fitted value iteration algorithm so 01:08:50.099 --> 01:08:52.150 because of an expectation of a large number 01:08:52.150 --> 01:08:55.219 of states, you'd need to sample 01:08:55.219 --> 01:08:57.999 some set of states from the simulator and then 01:08:57.999 --> 01:09:03.120 approximate this expectation using an average over your samples, so it's actually as 01:09:03.120 --> 01:09:05.819 inner loop of the value iteration algorithm. 01:09:05.819 --> 01:09:09.479 So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample 01:09:09.770 --> 01:09:13.029 a set of states 01:09:13.029 --> 01:09:16.230 to approximate those expectations every time you want to take an action in your MDP. 01:09:16.230 --> 01:09:18.359 Couple 01:09:18.359 --> 01:09:22.710 of special cases where this can be done, one special case is if you 01:09:22.710 --> 01:09:29.710 have 01:09:31.710 --> 01:09:35.560 a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is 01:09:35.560 --> 01:09:39.769 just some function, 01:09:39.769 --> 01:09:43.690 could be a linear or a nonlinear function. If it's a deterministic simulator 01:09:43.690 --> 01:09:47.569 then the next state, ST+1, is just some function of your 01:09:47.569 --> 01:09:50.189 previous stated action. 01:09:50.189 --> 01:09:52.060 If that's the case then 01:09:52.060 --> 01:09:54.849 this expectation, well, then 01:09:54.849 --> 01:09:59.259 this simplifies to augmax of A of V* 01:09:59.259 --> 01:10:00.709 of F 01:10:00.709 --> 01:10:03.409 of 01:10:03.409 --> 01:10:07.280 I guess S,A 01:10:07.280 --> 01:10:08.249 because 01:10:08.249 --> 01:10:11.070 this is really saying 01:10:11.070 --> 01:10:14.570 S 01:10:14.570 --> 01:10:18.409 prime=F(s),A. I switched back and forth between notation; I hope that's okay. S 01:10:18.409 --> 01:10:21.800 to denote the current state, and S prime to deterministic state versus ST and ST+1 through the 01:10:21.800 --> 01:10:25.530 current state. Both of these are 01:10:25.530 --> 01:10:27.699 sorta standing notation and don't 01:10:27.699 --> 01:10:31.189 mind my switching back and forth between them. But if it's a 01:10:31.189 --> 01:10:35.310 deterministic simulator you can then just compute what the next state S prime would be 01:10:35.310 --> 01:10:37.729 for each action you might take from the current state, 01:10:37.729 --> 01:10:41.689 and then take the augmax of actions A, 01:10:41.689 --> 01:10:47.229 basically choose the action that gets you to the highest value 01:10:47.229 --> 01:10:54.229 state. 01:11:04.300 --> 01:11:04.659 And 01:11:04.659 --> 01:11:08.920 so that's one case where you can compute the augmax and we can compute that 01:11:08.920 --> 01:11:13.469 expectation without needing to sample an average over some sample. 01:11:13.469 --> 01:11:16.670 Another very common case actually it turns out is if you have a stochastic 01:11:24.010 --> 01:11:25.969 simulator, but if your 01:11:25.969 --> 01:11:30.459 similar happens to take on a very specific form of 01:11:33.820 --> 01:11:36.119 ST+1=F(s)T,AT+?T 01:11:37.120 --> 01:11:43.689 where this 01:11:43.689 --> 01:11:47.899 is galsie noise. The [inaudible] is a very common way to build simulators where you model the next 01:11:47.899 --> 01:11:50.229 state as a function of the current state and action 01:11:50.229 --> 01:11:52.260 plus some noise and so 01:11:52.260 --> 01:11:57.340 once specific example would be that sort of mini dynamical system that 01:11:57.340 --> 01:12:01.400 we talked about with linear function of the current state and action plus 01:12:01.400 --> 01:12:03.019 galsie 01:12:03.019 --> 01:12:07.959 noise. In this case, you can approximate augment over 01:12:07.959 --> 01:12:14.959 A, well. 01:12:21.809 --> 01:12:27.159 In that case you take that 01:12:27.159 --> 01:12:29.089 expectation that 01:12:29.089 --> 01:12:33.179 you're trying to approximate. The 01:12:33.179 --> 01:12:37.309 expected value of V of the 01:12:37.309 --> 01:12:41.479 01:12:41.479 --> 01:12:45.260 expected value of 01:12:45.260 --> 01:12:49.179 S prime, and this is approximation. 01:12:49.179 --> 01:12:52.510 Expected value of a function is usually not equal to the value of an 01:12:52.510 --> 01:12:53.059 expectation but 01:12:53.059 --> 01:12:54.159 it is often 01:12:54.159 --> 01:12:56.119 a reasonable approximation 01:12:56.119 --> 01:12:58.340 and so 01:13:01.919 --> 01:13:04.669 that would be another way to 01:13:04.669 --> 01:13:06.439 approximate that expectation 01:13:06.439 --> 01:13:11.270 and so you choose the actions 01:13:11.270 --> 01:13:14.340 according to 01:13:14.340 --> 01:13:21.340 watch we do the same formula as I wrote just now. 01:13:22.010 --> 01:13:23.619 And so this 01:13:23.619 --> 01:13:24.920 would be a way of 01:13:24.920 --> 01:13:27.109 approximating 01:13:28.160 --> 01:13:30.530 this argmax, 01:13:30.530 --> 01:13:34.489 ignoring the noise in the simulator essentially. And 01:13:34.489 --> 01:13:37.610 this often works pretty well as well just because many simulators turn 01:13:37.610 --> 01:13:38.539 out 01:13:38.539 --> 01:13:42.699 to be the form of some linear or some nonlinear function plus zero mean 01:13:42.699 --> 01:13:43.749 galsie noise, 01:13:43.749 --> 01:13:46.130 so and just that ignore the zero mean 01:13:46.130 --> 01:13:47.369 galsie 01:13:47.369 --> 01:13:50.070 noise, so that you can compute this quickly. 01:13:50.070 --> 01:13:54.230 And just to complete about this, what 01:13:54.230 --> 01:13:55.899 that is, right, 01:13:55.899 --> 01:13:58.460 that V* F of SA, 01:13:58.460 --> 01:13:58.969 this 01:13:58.969 --> 01:14:01.670 you 01:14:04.389 --> 01:14:08.880 down rate as data transfers Fi of S prime where S prime=F of SA. Great, 01:14:08.880 --> 01:14:12.219 so this V* you would compute using 01:14:12.219 --> 01:14:15.050 the parameters data that you just learned using the 01:14:15.050 --> 01:14:21.719 fitted value iteration 01:14:21.719 --> 01:14:23.170 algorithm. 01:14:23.170 --> 01:14:30.170 Questions about this? Student:[Inaudible] case, 01:14:33.849 --> 01:14:40.849 for real-time application is it possible to use that [inaudible], for example for 01:14:41.809 --> 01:14:47.739 [inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible] 01:14:51.159 --> 01:14:55.510 expectation. Computers today actually amazingly fast. I'm actually often 01:14:55.510 --> 01:14:58.659 surprised by how much you can do in real time so 01:14:58.659 --> 01:15:00.269 the helicopter we actually flying a helicopter 01:15:00.269 --> 01:15:03.889 using an algorithm different than 01:15:03.889 --> 01:15:04.739 this? I can't say. 01:15:06.280 --> 01:15:10.539 But my intuition is that you could actually do this with a 01:15:10.539 --> 01:15:14.010 helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do 01:15:14.010 --> 01:15:16.389 this 10 times a second to 50 times a second, 01:15:16.389 --> 01:15:18.099 and that's actually plenty of time 01:15:18.099 --> 01:15:19.400 to sample 01:15:19.400 --> 01:15:23.500 1,000 states and compute this 01:15:23.500 --> 01:15:26.909 expectation. They're real difficult, helicopters because helicopters are mission critical, and you 01:15:26.909 --> 01:15:29.500 do something it's like fast. You 01:15:29.500 --> 01:15:31.849 can do serious damage and so 01:15:31.849 --> 01:15:35.750 maybe not for good reasons. We've actually tended to avoid 01:15:35.750 --> 01:15:37.039 tossing coins when we're 01:15:37.039 --> 01:15:40.179 in the air, so the ideal of letting our actions be some 01:15:40.179 --> 01:15:45.099 up with some random process is slightly scary and just tend not to do that. 01:15:45.099 --> 01:15:48.989 I should say that's prob'ly not a great reason because you average a 01:15:48.989 --> 01:15:51.550 large number of things here very well fine but just 01:15:51.550 --> 01:15:55.829 as a maybe overly conservative design choice, we actually 01:15:55.829 --> 01:15:57.060 don't, tend not to find 01:15:57.060 --> 01:15:59.499 anything randomized on 01:15:59.499 --> 01:16:01.400 which is prob'ly being over conservative. 01:16:02.340 --> 01:16:04.969 It's the choice we made cause other things are slightly safer. 01:16:04.969 --> 01:16:08.149 I think you can actually often do this. 01:16:08.149 --> 01:16:13.189 So long as I see a model can be evaluated fast enough where you can sample 01:16:13.189 --> 01:16:15.469 100 state transitions or 1,000 state 01:16:15.469 --> 01:16:18.119 transitions, and then do that at 01:16:18.119 --> 01:16:21.550 10hz. They haven't said that. This is often attained which is why we often 01:16:21.550 --> 01:16:28.550 use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So 01:16:31.800 --> 01:16:35.800 now you know one algorithm [inaudible] reinforcement learning on continuous state 01:16:35.800 --> 01:16:38.710 spaces. Then we'll pick up with some more ideas on some even 01:16:38.710 --> 01:16:40.449 more powerful algorithms, 01:16:40.449 --> 01:16:43.370 the solving MDPs of continuous state spaces. 01:16:43.370 --> 01:16:44.170 Thanks. Let's close for today.