1 00:00:11,269 --> 00:00:14,539 This presentation is delivered by the Stanford Center for Professional 2 00:00:14,539 --> 00:00:21,539 Development. 3 00:00:23,929 --> 00:00:28,039 So welcome to the last lecture of this course. 4 00:00:28,039 --> 00:00:32,969 What I want to do today is tell you about one final class of reinforcement 5 00:00:32,969 --> 00:00:34,820 learning algorithms. I 6 00:00:34,820 --> 00:00:39,350 just want to say a little bit about POMDPs, the partially observable MDPs, and then 7 00:00:39,350 --> 00:00:43,590 the main technical topic for today will be policy search algorithms. 8 00:00:43,590 --> 00:00:48,670 I'll talk about two specific algorithms, essentially called reinforced and called Pegasus, 9 00:00:48,670 --> 00:00:54,940 and then we'll wrap up the class. So if you recall from the last lecture, 10 00:00:54,940 --> 00:00:59,870 I actually started to talk about one specific example of a POMDP, 11 00:00:59,870 --> 00:01:06,870 which was this sort of linear dynamical 12 00:01:07,170 --> 00:01:10,790 system. This is sort of LQR, linear quadratic revelation problem, 13 00:01:10,790 --> 00:01:24,190 but I changed it and said what if we only have observations 14 00:01:24,190 --> 00:01:28,270 YT. And what if we couldn't observe the state of the system directly, 15 00:01:28,270 --> 00:01:37,180 but had to choose an action only based on some noisy observations that maybe some function of the state? 16 00:01:37,180 --> 00:01:42,660 So our strategy last time was that we said that in the fully observable case, 17 00:01:42,660 --> 00:01:54,870 we could choose actions - AT equals two, that matrix LT times ST. So LT was this 18 00:01:54,870 --> 00:01:58,730 matrix of parameters that [inaudible] describe the dynamic programming 19 00:01:58,730 --> 00:02:02,200 algorithm for finite horizon MDPs in the LQR problem. 20 00:02:02,200 --> 00:02:10,529 And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state. 21 00:02:10,529 --> 00:02:13,929 And then I said in the partially observable case, 22 00:02:14,469 --> 00:02:19,979 we would compute these estimates. 23 00:02:19,979 --> 00:02:26,979 I wrote them as S of T given T, 24 00:02:30,379 --> 00:02:37,230 which were our best estimate for what the state is given all the observations. And in particular, 25 00:02:37,230 --> 00:02:41,459 I'm gonna talk about a Kalman filter 26 00:02:41,459 --> 00:02:48,959 which we worked out that our posterior distribution of what the state is 27 00:02:48,959 --> 00:02:52,519 given all the observations up to a certain time 28 00:02:52,519 --> 00:02:59,139 that was this. So this is from last time. So that 29 00:02:59,139 --> 00:03:03,159 given the observations Y one through YT, our posterior distribution 30 00:03:03,159 --> 00:03:05,579 of the current state ST was Gaussian 31 00:03:05,579 --> 00:03:10,169 would mean ST given T sigma T given T. 32 00:03:10,169 --> 00:03:14,090 So I said we use a Kalman filter to compute this thing, this ST given T, 33 00:03:14,090 --> 00:03:20,499 which is going to be our best guess for what the state is currently. 34 00:03:20,499 --> 00:03:30,609 And then we choose actions using 35 00:03:30,609 --> 00:03:34,059 our estimate for what the state is, rather than using the true state because we 36 00:03:34,059 --> 00:03:37,519 don't know the true state anymore in this POMDP. 37 00:03:37,519 --> 00:03:39,619 So 38 00:03:39,619 --> 00:03:47,219 it turns out that this specific strategy actually allows you to choose optimal actions, 39 00:03:47,219 --> 00:03:51,749 allows you to choose actions as well as you possibly can given that this is a 40 00:03:51,749 --> 00:03:54,799 POMDP, and given there are these noisy observations. 41 00:03:54,799 --> 00:03:59,469 It turns out that in general finding optimal policies with POMDPs - 42 00:03:59,469 --> 00:04:06,039 finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem. 43 00:04:06,039 --> 00:04:16,259 Just to be concrete about the formalism of the POMDP - I should just write it here - 44 00:04:16,259 --> 00:04:32,000 a POMDP formally is a tuple like that 45 00:04:32,000 --> 00:04:42,550 where the changes are the set Y is the set of possible observations, 46 00:04:44,759 --> 00:04:53,519 and this O subscript S are the observation distributions. 47 00:04:58,039 --> 00:05:11,220 And so at each step, we observe 48 00:05:17,629 --> 00:05:20,179 - at each step in the POMDP, if we're 49 00:05:20,179 --> 00:05:23,849 in some state ST, we observe some observation YT 50 00:05:23,849 --> 00:05:27,610 drawn from the observation distribution O subscript ST, that there's an 51 00:05:27,610 --> 00:05:30,369 index by what the current state is. 52 00:05:30,369 --> 00:05:31,060 And 53 00:05:31,060 --> 00:05:35,599 it turns out that computing the optimal policy in a POMDP is an NPhard problem. 54 00:05:36,309 --> 00:05:40,539 For the specific case of linear dynamical systems with the Kalman filter 55 00:05:40,539 --> 00:05:45,849 model, we have this strategy of computing the optimal policy assuming full observability 56 00:05:45,849 --> 00:05:49,080 and then estimating the states from the observations, 57 00:05:49,080 --> 00:05:50,889 and then plugging the two together. 58 00:05:50,889 --> 00:05:56,319 That turns out to be optimal essentially for only that special case of a POMDP. 59 00:05:56,319 --> 00:05:58,479 In the more general case, 60 00:05:58,479 --> 00:06:02,959 that strategy of designing a controller assuming full observability 61 00:06:02,959 --> 00:06:06,009 and then just estimating the state and plugging the two together, 62 00:06:06,009 --> 00:06:07,950 for general POMDPs 63 00:06:07,950 --> 00:06:14,759 that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal. 64 00:06:14,759 --> 00:06:20,869 Solving these problems in general, NP-hard. 65 00:06:20,869 --> 00:06:22,500 So what I want to do today 66 00:06:22,500 --> 00:06:25,569 is actually 67 00:06:25,569 --> 00:06:29,139 talk about a different class of reinforcement learning algorithms. These are called 68 00:06:29,139 --> 00:06:38,259 policy search algorithms. In particular, policy search algorithms 69 00:06:38,259 --> 00:06:43,630 can be applied equally well to MDPs, to fully observed Markov decision processes, 70 00:06:43,630 --> 00:06:46,419 or to these POMDPs, 71 00:06:46,419 --> 00:06:49,789 or to these partially observable MPDs. 72 00:06:49,789 --> 00:06:58,269 What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case. 73 00:06:58,269 --> 00:07:01,919 And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In 74 00:07:01,919 --> 00:07:05,460 the latter case, when 75 00:07:05,460 --> 00:07:09,830 you apply a policy search algorithm to a POMDP, it's 76 00:07:09,830 --> 00:07:15,499 going to be hard to guarantee that you get the globally optimal policy because 77 00:07:15,499 --> 00:07:20,809 solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be 78 00:07:20,809 --> 00:07:23,009 I think one of the most 79 00:07:23,009 --> 00:07:31,029 effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs. 80 00:07:31,029 --> 00:07:34,110 So 81 00:07:34,110 --> 00:07:36,849 here's what we're going to do. 82 00:07:36,849 --> 00:07:40,789 In policy search, 83 00:07:40,789 --> 00:07:47,789 we're going to define of some set 84 00:07:48,309 --> 00:07:54,869 which I denote capital pi of policies, 85 00:07:54,869 --> 00:08:09,240 and our strategy is to search for a good policy lower pi into set capital pi. 86 00:08:10,909 --> 00:08:14,930 Just by analogy, I want to say 87 00:08:14,930 --> 00:08:19,379 - in the same way, back when we were talking about supervised learning, 88 00:08:19,379 --> 00:08:23,769 the way we defined the set capital pi of policies in the search for policy in this 89 00:08:23,769 --> 00:08:31,269 set capital pi is analogous to supervised learning 90 00:08:31,269 --> 00:08:46,300 where we defined a set script H of hypotheses and search - 91 00:08:49,520 --> 00:08:58,120 and would search for a good hypothesis in this policy script H. 92 00:08:58,120 --> 00:09:03,040 Policy search is sometimes also called direct policy search. 93 00:09:03,040 --> 00:09:06,740 To contrast this with the source of algorithms we've been talking about so 94 00:09:06,740 --> 00:09:11,360 far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to 95 00:09:11,360 --> 00:09:13,600 find the optimal value function. 96 00:09:13,600 --> 00:09:19,179 And then we'd use V star - we'd use the optimal value function to then try to compute or 97 00:09:19,179 --> 00:09:20,260 try to approximate pi star. 98 00:09:20,260 --> 00:09:23,980 So all the approaches we talked about previously are 99 00:09:23,980 --> 00:09:29,500 strategy for finding a good policy. Once we compute the value function, then we go from that to policy. 100 00:09:29,500 --> 00:09:34,700 In contrast, in policy search algorithms and something that's called direct policy search algorithms, 101 00:09:34,700 --> 00:09:40,160 the idea is that we're going to quote "directly" try to approximate a good policy 102 00:09:40,160 --> 00:09:48,280 without going through the intermediate stage of trying to find the value function. 103 00:09:48,280 --> 00:09:49,530 Let's see. 104 00:09:49,530 --> 00:09:50,880 And also 105 00:09:50,880 --> 00:09:57,610 as I develop policy search - just one step that's sometimes slightly confusing. 106 00:09:57,610 --> 00:10:02,710 Making an analogy to supervised learning again, when we talked about logistic regression, 107 00:10:02,710 --> 00:10:08,700 I said we have input features X and some labels Y, and I sort of said 108 00:10:08,700 --> 00:10:13,320 let's approximate Y using the logistic function of the inputs X. And 109 00:10:13,320 --> 00:10:18,320 at least initially, the logistic function was sort of pulled out of the air. 110 00:10:18,320 --> 00:10:23,400 In the same way, as I define policy search algorithms, there'll sort of be a step where I say, 111 00:10:23,400 --> 00:10:28,560 "Well, let's try to compute the actions. Let's try to approximate what a good action is 112 00:10:28,560 --> 00:10:33,020 using a logistic function of the state." So again, I'll sort of pull a 113 00:10:33,020 --> 00:10:36,770 function out of the air. I'll say, "Let's just choose a function, and 114 00:10:36,770 --> 00:10:40,650 that'll be our choice of the policy cost," and I'll say, 115 00:10:40,650 --> 00:10:45,309 "Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate 116 00:10:45,309 --> 00:10:47,300 what is a good function - 117 00:10:47,300 --> 00:10:48,350 excuse me, 118 00:10:48,350 --> 00:10:53,269 we'll approximate what is a good action using a logistic function of the state." 119 00:10:53,269 --> 00:10:54,420 So there's that sort of - 120 00:10:54,420 --> 00:10:59,460 the function of the choice of policy cost that's again a little bit arbitrary, 121 00:10:59,460 --> 00:11:06,080 but it's arbitrary as it was when we were talking about supervised learning. 122 00:11:06,080 --> 00:11:17,350 So to develop our first policy search algorithm, I'm actually gonna need the new definition. 123 00:11:56,100 --> 00:12:03,100 So 124 00:12:06,650 --> 00:12:12,710 our first policy search algorithm, we'll actually need to work with stochastic policies. 125 00:12:12,710 --> 00:12:17,679 What I mean by stochastic policy is there's going to be a function that maps from 126 00:12:17,679 --> 00:12:22,510 the space of states across actions. They're real numbers 127 00:12:22,510 --> 00:12:25,039 where pi of S comma A will be 128 00:12:25,039 --> 00:12:30,580 interpreted as the probability of taking this action A in sum state S. 129 00:12:30,580 --> 00:12:46,340 And so we have to add sum over A - In other words, for every state 130 00:12:46,340 --> 00:12:53,830 a stochastic policy specifies a probability distribution over the actions. 131 00:12:54,550 --> 00:12:59,250 So concretely, 132 00:12:59,250 --> 00:13:07,270 suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi. 133 00:13:07,270 --> 00:13:11,690 What that means is that - in this example let's say I have three actions. 134 00:13:11,690 --> 00:13:16,060 What that means is that suppose I'm in some state S. 135 00:13:16,060 --> 00:13:25,290 I would then compute pi of S comma A1, pi of S comma A2, 136 00:13:25,290 --> 00:13:29,370 pi of S comma A3, if I have a three action MDP. 137 00:13:29,370 --> 00:13:33,630 These will be three numbers that sum up to one, 138 00:13:33,630 --> 00:13:36,730 and then my chance of taking action A1 will be 139 00:13:36,730 --> 00:13:41,350 equal to this. My chance of taking action A2 will be equal to pi of S comma A2. 140 00:13:41,350 --> 00:13:44,129 My chance of taking action A3 will be equal 141 00:13:44,129 --> 00:13:49,670 to this number. So that's what it means to execute a stochastic policy. 142 00:13:49,670 --> 00:13:50,880 So 143 00:13:50,880 --> 00:13:53,730 as a concrete example, just let me make this 144 00:14:03,380 --> 00:14:07,470 let me just go ahead and give one specific example of what a stochastic 145 00:14:07,470 --> 00:14:09,550 policy may look like. 146 00:14:09,550 --> 00:14:14,140 For this example, I'm gonna use the inverted pendulum 147 00:14:14,140 --> 00:14:20,370 as my motivating example. It's that problem of balancing a pole. We have an 148 00:14:20,370 --> 00:14:21,270 inverted 149 00:14:21,270 --> 00:14:26,230 pendulum that swings freely, and you want to move the cart left and 150 00:14:26,230 --> 00:14:29,090 right to keep the pole vertical. 151 00:14:29,090 --> 00:14:34,950 Let's say my actions - 152 00:14:34,950 --> 00:14:43,780 for today's example, I'm gonna use that angle to denote the angle of the pole 153 00:14:43,780 --> 00:14:49,740 phi. I have two actions where A1 is to accelerate left 154 00:14:49,740 --> 00:15:02,560 and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left. 155 00:15:02,560 --> 00:15:04,720 So 156 00:15:04,720 --> 00:15:06,050 let's see. 157 00:15:06,050 --> 00:15:10,480 Choose a reward function that penalizes the pole falling over whatever. 158 00:15:10,480 --> 00:15:14,170 And now let's come up with a stochastic policy for this problem. 159 00:15:14,910 --> 00:15:17,860 To come up with a class of stochastic policies 160 00:15:17,860 --> 00:15:20,080 161 00:15:20,080 --> 00:15:23,990 really means coming up with some class of functions to approximate what action you 162 00:15:23,990 --> 00:15:26,570 want to take as a function of the state. 163 00:15:26,570 --> 00:15:30,000 So here's my somewhat arbitrary choice. I'm gonna say that 164 00:15:30,000 --> 00:15:35,210 the probability of action A1, so pi of S comma A1, 165 00:15:35,210 --> 00:15:39,880 I'm gonna write as - okay? 166 00:15:39,880 --> 00:15:45,630 And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that 167 00:15:45,630 --> 00:15:49,740 my policy is parameterized by a set of parameters theta, 168 00:15:49,740 --> 00:15:53,310 and for any given set of parameters theta, 169 00:15:53,310 --> 00:15:56,270 that gives me a stochastic policy. 170 00:15:56,270 --> 00:15:59,380 And if I'm executing that policy with parameters theta, that means 171 00:15:59,380 --> 00:16:06,020 that the chance of my choosing to a set of [inaudible] is given by this number. 172 00:16:18,170 --> 00:16:25,520 Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just 173 00:16:25,520 --> 00:16:28,850 [inaudible], this means that when I'm in sum state S, 174 00:16:28,850 --> 00:16:30,960 I'm going to compute this number, 175 00:16:30,960 --> 00:16:34,260 compute one over one plus E to the minus state of transpose S. 176 00:16:34,260 --> 00:16:40,060 And then with this probability, I will execute the accelerate right action, 177 00:16:40,060 --> 00:16:45,780 and with one minus this probability, I'll execute the accelerate left action. 178 00:16:45,780 --> 00:16:50,820 And again, just to give you a sense of why this might be a reasonable thing to do, 179 00:16:52,200 --> 00:16:59,200 let's say my state vector is - this 180 00:17:01,530 --> 00:17:05,440 is [inaudible] state, and I added an extra one as an interceptor, just to give my 181 00:17:05,440 --> 00:17:08,949 logistic function an extra feature. 182 00:17:08,949 --> 00:17:19,860 If I choose my parameters and my policy to be say this, 183 00:17:19,860 --> 00:17:23,079 then that means that at any state, 184 00:17:23,079 --> 00:17:33,870 the probability of my taking action A1 - the probability of my taking the accelerate 185 00:17:33,870 --> 00:17:40,870 right action is this one over one plus E to the minus state of transpose S, which taking the inner product 186 00:17:41,360 --> 00:17:48,180 of theta and S, this just gives you phi, 187 00:17:48,180 --> 00:17:51,250 equals one over one plus E to the minus phi. 188 00:17:51,250 --> 00:17:54,900 And so if I choose my parameters theta as follows, 189 00:17:54,900 --> 00:18:01,150 what that means is that 190 00:18:01,150 --> 00:18:08,150 just depending on the angle phi of my inverted pendulum, 191 00:18:09,340 --> 00:18:12,200 the chance of my accelerating to the 192 00:18:12,200 --> 00:18:15,190 right is just this function of 193 00:18:15,190 --> 00:18:18,320 the angle of my inverted pendulum. And so this means for example 194 00:18:18,320 --> 00:18:22,090 that if my inverted pendulum is leaning far over to the right, 195 00:18:22,090 --> 00:18:26,370 then I'm very likely to accelerate to the right to try to catch it. I 196 00:18:26,370 --> 00:18:30,000 hope the physics of this inverted pendulum thing make 197 00:18:30,000 --> 00:18:33,440 sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And 198 00:18:33,440 --> 00:18:36,740 conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try 199 00:18:36,740 --> 00:18:38,520 to catch it. 200 00:18:38,520 --> 00:18:44,030 So this is one example for one specific choice of parameters theta. 201 00:18:44,030 --> 00:18:49,420 Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if 202 00:18:49,420 --> 00:18:52,860 the cart is further to the right, you want it to be less likely to accelerate to the 203 00:18:52,860 --> 00:18:55,610 right, and you can capture that by changing 204 00:18:55,610 --> 00:18:57,770 one of these coefficients to take into account the 205 00:18:57,770 --> 00:19:00,250 actual position of the cart. And then depending on 206 00:19:00,250 --> 00:19:03,379 the velocity of the cart and the angle of velocity, 207 00:19:03,379 --> 00:19:08,040 you might want to change theta to take into account these other effects as well. Maybe 208 00:19:08,040 --> 00:19:13,060 if the pole's leaning far to the right, but is actually on its way to swinging back, it's 209 00:19:13,060 --> 00:19:15,620 specified to the angle of velocity, then you might be 210 00:19:15,620 --> 00:19:19,300 less worried about having to accelerate hard to the right. And so 211 00:19:19,300 --> 00:19:23,170 these are the sorts of behavior you can get by varying the parameters theta. 212 00:19:24,160 --> 00:19:28,640 And so our goal is to tune the parameters theta - our goal in policy search 213 00:19:28,640 --> 00:19:37,290 is to tune the parameters theta so that when we execute the policy pi subscript theta, 214 00:19:37,290 --> 00:19:39,730 the pole stays up as long as possible. 215 00:19:39,730 --> 00:19:48,430 In other words, our goal is to maximize as a function of theta - 216 00:19:48,430 --> 00:20:05,440 our goal is to maximize the expected value of the payoff 217 00:20:05,440 --> 00:20:09,620 for when we execute the policy pi theta. We 218 00:20:09,620 --> 00:20:11,249 want to choose parameters theta 219 00:20:11,249 --> 00:20:21,120 to maximize that. Are there questions about the problem set up, 220 00:20:21,120 --> 00:20:26,330 and policy search and policy classes or anything? Yeah. In a case where we have 221 00:20:26,330 --> 00:20:32,880 more than two actions, would we use a different theta for each of the distributions, or still have 222 00:20:32,880 --> 00:20:36,310 the same parameters? Oh, yeah. 223 00:20:36,310 --> 00:20:39,430 Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but 224 00:20:39,430 --> 00:20:43,710 you have say a fixed number of discrete actions, 225 00:20:43,710 --> 00:20:48,210 I would sometimes use like a softmax parameterization. 226 00:20:48,210 --> 00:20:53,550 Similar to softmax regression that we saw earlier in the class, 227 00:20:53,550 --> 00:20:55,910 you may 228 00:20:55,910 --> 00:21:02,910 say 229 00:21:05,540 --> 00:21:06,900 that - [inaudible] out of space. You may have a set of parameters theta 1 230 00:21:06,900 --> 00:21:09,350 through theta D if 231 00:21:09,350 --> 00:21:24,910 you have D actions and - pi equals E to the theta I transpose S over - 232 00:21:24,910 --> 00:21:29,160 so that would be an example of a softmax parameterization for multiple actions. 233 00:21:29,160 --> 00:21:33,390 It turns out that if you have continuous actions, you can actually make this be a 234 00:21:33,390 --> 00:21:34,090 density 235 00:21:34,090 --> 00:21:35,950 over the actions A 236 00:21:35,950 --> 00:21:38,500 and parameterized by other things as well. 237 00:21:38,500 --> 00:21:42,590 But the choice of policy class is somewhat up to you, in the same way that 238 00:21:44,150 --> 00:21:47,670 the choice of whether we chose to use a linear function 239 00:21:47,670 --> 00:21:49,430 or linear function with 240 00:21:49,430 --> 00:21:56,430 quadratic features or whatever in supervised learning that was sort of up to us. Anything 241 00:22:01,790 --> 00:22:08,790 else? Yeah. [Inaudible] stochastic? 242 00:22:12,400 --> 00:22:16,350 Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition 243 00:22:16,350 --> 00:22:18,200 probabilities, is it possible to 244 00:22:18,200 --> 00:22:22,740 use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities. 245 00:22:22,740 --> 00:22:24,460 The answer is yes, 246 00:22:24,460 --> 00:22:25,740 but 247 00:22:25,740 --> 00:22:29,400 for the purposes of what I want to show later, that won't be useful. 248 00:22:29,400 --> 00:22:31,870 But formally, it is possible. 249 00:22:31,870 --> 00:22:34,810 If you already have a fixed - if you have a 250 00:22:34,810 --> 00:22:40,760 fixed policy, then you'd be able to do that. Anything else? 251 00:22:40,760 --> 00:22:44,190 Yeah. No, I guess even a [inaudible] class of policy can do that, 252 00:22:44,190 --> 00:22:50,290 but for the derivation later, I actually need to keep it separate. 253 00:22:50,290 --> 00:22:55,059 Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand 254 00:22:55,059 --> 00:23:01,730 if this makes sense? 255 00:23:01,730 --> 00:23:05,430 Okay. Thanks. So let's talk about 256 00:23:05,430 --> 00:23:08,650 an algorithm. What I'm gonna talk about - the first algorithm I'm going to present 257 00:23:09,750 --> 00:23:13,820 is sometimes called the reinforce algorithm. What I'm going to present it turns out 258 00:23:13,820 --> 00:23:18,550 isn't exactly the reinforce algorithm as it was originally presented by the 259 00:23:18,550 --> 00:23:30,570 author Ron Williams, but it sort of captures its essence. Here's the idea. 260 00:23:30,570 --> 00:23:37,030 In the sequel - in what I'm about to do, I'm going to assume that S0 is 261 00:23:37,030 --> 00:23:40,330 some fixed initial state. 262 00:23:41,950 --> 00:23:44,720 Or 263 00:23:44,720 --> 00:23:48,740 it turns out if S0 is drawn from some fixed initial state 264 00:23:48,740 --> 00:23:52,440 distribution then everything else [inaudible], but let's just say S0 is some 265 00:23:52,440 --> 00:23:54,040 fixed initial state. 266 00:23:54,040 --> 00:24:04,600 So my goal is to maximize this expected sum 267 00:24:05,820 --> 00:24:10,340 [inaudible]. Given the policy and whatever else, drop that. 268 00:24:10,340 --> 00:24:14,700 So the random variables in this expectation is a sequence of states and actions: 269 00:24:14,700 --> 00:24:20,200 S0, A0, S1, A1, and so on, up to ST, AT are the random variables. 270 00:24:20,200 --> 00:24:25,919 So let me write out this expectation explicitly as a sum 271 00:24:25,919 --> 00:24:33,430 over all possible state and action sequences of that 272 00:24:46,050 --> 00:24:52,920 - so that's what an expectation is. It's the probability of the random variables times that. Let 273 00:24:52,920 --> 00:24:59,920 me just expand out this probability. 274 00:25:00,510 --> 00:25:06,110 So the probability of seeing this exact sequence of states and actions is 275 00:25:06,110 --> 00:25:10,630 the probability of the MDP starting in that state. 276 00:25:10,630 --> 00:25:14,070 If this is a deterministic initial state, then all the probability 277 00:25:14,070 --> 00:25:18,180 mass would be on one state. Otherwise, there's some distribution over initial states. 278 00:25:18,180 --> 00:25:21,040 Then times the probability 279 00:25:21,040 --> 00:25:23,529 that 280 00:25:23,529 --> 00:25:29,350 you chose action A0 from that state as zero, 281 00:25:29,350 --> 00:25:34,390 and then times the probability that 282 00:25:34,390 --> 00:25:38,130 the MDP's transition probabilities happen to transition you to state 283 00:25:38,130 --> 00:25:44,830 S1 where you chose action A0 to state S0, 284 00:25:44,830 --> 00:25:53,020 times the probability that you chose that and so on. 285 00:25:53,020 --> 00:26:06,279 The last term here is that, and then times that. 286 00:26:12,370 --> 00:26:13,930 So what I did was just 287 00:26:13,930 --> 00:26:18,620 take this probability of seeing this sequence of states and actions, and then just 288 00:26:18,620 --> 00:26:22,610 [inaudible] explicitly or expanded explicitly like this. 289 00:26:22,610 --> 00:26:24,480 It turns out later on 290 00:26:24,480 --> 00:26:30,270 I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff 291 00:26:31,590 --> 00:26:36,039 from now. So whenever later in this lecture I write the word payoff, I 292 00:26:36,039 --> 00:26:40,230 just mean this sum. 293 00:26:40,230 --> 00:26:47,020 So 294 00:26:47,020 --> 00:26:52,330 our goal is to maximize the expected payoff, so our goal is to maximize this sum. 295 00:26:52,330 --> 00:26:55,590 Let me actually just skip ahead. I'm going to write down what the final answer is, and 296 00:26:55,590 --> 00:26:59,370 then I'll come back and justify the 297 00:26:59,370 --> 00:27:06,370 algorithm. So here's the algorithm. This is 298 00:27:25,190 --> 00:27:32,190 299 00:28:12,770 --> 00:28:16,020 how we're going to 300 00:28:16,020 --> 00:28:19,280 update the parameters of the algorithm. We're going to 301 00:28:19,280 --> 00:28:22,880 sample a state action sequence. The way you do this is you just take your 302 00:28:22,880 --> 00:28:24,700 current stochastic policy, 303 00:28:24,700 --> 00:28:29,039 and you execute it in the MDP. So just go ahead and start from some initial state, take 304 00:28:29,039 --> 00:28:32,600 a stochastic action according to your current stochastic policy, 305 00:28:32,600 --> 00:28:35,000 see where the state transition probably takes you, and so you just 306 00:28:35,000 --> 00:28:38,860 do that for T times steps, and that's how you sample the state sequence. Then 307 00:28:38,860 --> 00:28:45,020 you compute the payoff, and then you perform this update. 308 00:28:45,020 --> 00:28:49,210 So let's go back and figure out what this algorithm is doing. 309 00:28:49,210 --> 00:28:52,730 Notice that this algorithm performs stochastic updates because on 310 00:28:52,730 --> 00:28:56,750 every step it updates data according to this thing on the right hand side. 311 00:28:56,750 --> 00:29:00,150 This thing on the right hand side depends very much on your payoff and on 312 00:29:00,150 --> 00:29:02,320 the state action sequence you saw. Your 313 00:29:02,320 --> 00:29:04,650 state action sequence is random, 314 00:29:04,650 --> 00:29:07,210 so what I want to do is figure out 315 00:29:07,210 --> 00:29:12,230 - so on every step, I'll sort of take a step that's chosen randomly 316 00:29:13,070 --> 00:29:16,380 because it depends on this random state action sequence. 317 00:29:16,380 --> 00:29:21,530 So what I want to do is figure out on average how does it change the parameters theta. 318 00:29:21,530 --> 00:29:23,360 In particular, 319 00:29:23,360 --> 00:29:44,320 I want to know what is the expected value of the change to the parameters. 320 00:29:58,100 --> 00:30:06,280 So I want to know what is the expected value of this change to my parameters theta. Our 321 00:30:06,280 --> 00:30:10,930 goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff. 322 00:30:10,930 --> 00:30:13,700 So long as 323 00:30:13,700 --> 00:30:18,310 the updates on expectation are on average taking us uphill 324 00:30:18,310 --> 00:30:20,050 on the expected payoff, 325 00:30:20,050 --> 00:30:21,970 then we're happy. 326 00:30:21,970 --> 00:30:24,970 It turns out that 327 00:30:24,970 --> 00:30:31,720 this algorithm is a form of stochastic gradient ascent in which - 328 00:30:31,720 --> 00:30:36,779 remember when I talked about stochastic gradient descent for least squares regression, I 329 00:30:36,779 --> 00:30:42,820 said that you have some parameters - maybe you're trying to minimize a quadratic function. 330 00:30:42,820 --> 00:30:49,149 Then you may have parameters that will wander around randomly 331 00:30:49,149 --> 00:30:54,290 until it gets close to the optimum of the [inaudible] quadratic surface. 332 00:30:54,290 --> 00:30:55,810 It turns out that 333 00:30:55,810 --> 00:30:59,650 the reinforce algorithm will be very much like that. It will be a 334 00:30:59,650 --> 00:31:03,220 stochastic gradient ascent algorithm in which on every step - 335 00:31:03,220 --> 00:31:07,300 the step we take is a little bit random. It's determined by the random state action sequence, 336 00:31:07,300 --> 00:31:11,940 but on expectation this turns out to be essentially gradient ascent algorithm. 337 00:31:11,940 --> 00:31:16,100 And so we'll do something like this. It'll wander around randomly, but on average take you towards the 338 00:31:16,100 --> 00:31:24,700 optimum. So let me go ahead and prove that now. 339 00:31:28,780 --> 00:31:35,780 Let's see. 340 00:31:37,540 --> 00:31:46,020 What I'm going to do is I'm going to derive a gradient ascent update rule 341 00:31:46,020 --> 00:31:50,429 for maximizing the expected payoff. Then I'll hopefully show that 342 00:31:50,429 --> 00:31:58,570 by deriving a gradient ascent update rule, I'll end up with this thing on expectation. 343 00:31:58,570 --> 00:32:02,110 So before I do the derivation, let me just remind you of 344 00:32:02,110 --> 00:32:06,330 the chain rule - the product rule for differentiation 345 00:32:06,330 --> 00:32:08,510 in which 346 00:32:08,510 --> 00:32:13,890 if I have a product of functions, 347 00:32:13,890 --> 00:32:17,230 then 348 00:32:33,440 --> 00:32:37,779 the derivative of the product is given by 349 00:32:37,779 --> 00:32:41,029 taking of the derivatives of these things one at a time. So first I differentiate 350 00:32:41,029 --> 00:32:45,340 with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G, 351 00:32:45,340 --> 00:32:46,540 leaving the other two fixed. 352 00:32:46,540 --> 00:32:50,640 Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the 353 00:32:50,640 --> 00:32:53,530 product rule 354 00:32:53,530 --> 00:32:56,380 for 355 00:32:56,380 --> 00:33:00,980 derivatives. 356 00:33:00,980 --> 00:33:05,020 If you refer back to this equation where 357 00:33:05,020 --> 00:33:06,870 earlier we wrote out that 358 00:33:06,870 --> 00:33:11,140 the expected payoff by this equation, this 359 00:33:11,140 --> 00:33:14,110 sum over all the states of the probability 360 00:33:14,110 --> 00:33:15,720 times the payoff. 361 00:33:15,720 --> 00:33:20,410 So what I'm going to do is take the derivative of this expression 362 00:33:20,410 --> 00:33:22,730 with respect to the parameters theta 363 00:33:22,730 --> 00:33:26,320 because I want to do gradient ascent on this function. So I'm going to take the 364 00:33:26,320 --> 00:33:30,880 derivative of this function with respect to theta, and then try to go uphill on this function. 365 00:33:30,880 --> 00:33:35,690 So using the product rule, when I take the derivative of this function with respect to theta 366 00:33:35,690 --> 00:33:39,169 what I get is - we'll end up with the sum of terms right there. There are a lot 367 00:33:39,169 --> 00:33:42,300 of terms here that depend on theta, 368 00:33:42,300 --> 00:33:47,160 and so what I'll end up with is I'll end up having a sum - having one term 369 00:33:47,160 --> 00:33:50,700 that corresponds to the derivative of this keeping everything else fixed, 370 00:33:50,700 --> 00:33:53,960 to have one term from the derivative of this keeping everything else fixed, and I'll 371 00:33:53,960 --> 00:33:55,140 have one term 372 00:33:55,140 --> 00:33:56,430 from the derivative of 373 00:33:56,430 --> 00:34:02,100 that last thing keeping everything else fixed. So just apply the product rule to this. Let's 374 00:34:02,100 --> 00:34:09,100 write that down. So I have that - the derivative 375 00:34:10,599 --> 00:34:15,789 with respect to theta of the expected value of the payoff is - 376 00:34:15,789 --> 00:34:22,129 it turns out I can actually do this entire derivation in exactly four steps, 377 00:34:22,129 --> 00:34:26,819 but each of the steps requires a huge amount of writing, so 378 00:34:26,819 --> 00:34:37,569 I'll just start writing and see how that goes, but this is a four step derivation. 379 00:34:37,919 --> 00:34:44,919 So there's the sum over the state action sequences as we saw before. Close the 380 00:35:04,869 --> 00:35:11,869 bracket, 381 00:36:00,229 --> 00:36:03,169 and 382 00:36:03,169 --> 00:36:08,670 then times the payoff. 383 00:36:08,670 --> 00:36:12,599 So that huge amount of writing, that was just taking my previous formula and 384 00:36:12,599 --> 00:36:13,529 385 00:36:13,529 --> 00:36:16,589 differentiating these terms that depend on theta one at a time. This was the term with 386 00:36:16,589 --> 00:36:17,879 the derivative 387 00:36:17,879 --> 00:36:19,489 of the first pi of 388 00:36:19,489 --> 00:36:22,410 theta S0 A0. So there's the first derivative 389 00:36:22,410 --> 00:36:26,329 term. There's the second one. Then you have plus dot, dot, dot, like 390 00:36:26,329 --> 00:36:28,369 in terms of [inaudible]. 391 00:36:28,369 --> 00:36:31,279 That's my last term. 392 00:36:31,279 --> 00:36:38,279 So that was step one of four. 393 00:36:51,299 --> 00:37:06,849 And so by algebra - let me just write this down 394 00:38:04,009 --> 00:38:06,509 and convince us all that it's true. 395 00:38:06,509 --> 00:38:08,549 This is the second of four steps 396 00:38:08,549 --> 00:38:09,649 in which it 397 00:38:09,649 --> 00:38:13,529 just convinced itself that if I expand out - take the sum and multiply it by 398 00:38:13,529 --> 00:38:15,239 that big product in front, 399 00:38:15,239 --> 00:38:18,929 then I get back that sum of terms I get. It's essentially - 400 00:38:18,929 --> 00:38:22,530 for example, when I multiply out, this product on top of this ratio, of this 401 00:38:22,530 --> 00:38:24,789 first fraction, 402 00:38:24,789 --> 00:38:29,289 then pi subscript theta S0 A0, that would cancel out this pi subscript 403 00:38:29,289 --> 00:38:31,219 theta S0 A0 404 00:38:31,219 --> 00:38:35,130 and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was 405 00:38:35,130 --> 00:38:42,130 the second. But 406 00:38:44,960 --> 00:38:47,640 that term on top is just 407 00:38:47,640 --> 00:38:53,999 what I worked out previously 408 00:38:53,999 --> 00:38:58,569 - was the joint probability of the state action sequence, 409 00:38:58,569 --> 00:39:32,170 and now I have that times that times the payoff. 410 00:39:32,170 --> 00:40:07,129 And so by the definition of expectation, this is just equal to that thing times the payoff. 411 00:40:07,129 --> 00:40:08,440 So 412 00:40:08,440 --> 00:40:12,329 this thing inside the expectation, this is exactly 413 00:40:12,329 --> 00:40:18,839 the step that we were taking in the inner group of our reinforce algorithm, 414 00:40:18,839 --> 00:40:20,880 roughly the reinforce algorithm. This 415 00:40:20,880 --> 00:40:25,640 proves that the expected value of our change to theta 416 00:40:25,640 --> 00:40:30,509 is exactly in the direction of the gradient of our expected payoff. That's how 417 00:40:30,509 --> 00:40:32,639 I started this whole derivation. I said 418 00:40:32,639 --> 00:40:36,019 let's look at our expected payoff and take the derivative of that with respect to theta. 419 00:40:37,109 --> 00:40:40,529 What we've proved is that on expectation, 420 00:40:40,529 --> 00:40:44,229 the step direction I'll take reinforce is exactly the gradient of 421 00:40:44,229 --> 00:40:46,220 the thing I'm trying to 422 00:40:46,220 --> 00:40:48,679 optimize. This shows that this algorithm is 423 00:40:48,679 --> 00:40:54,029 a stochastic gradient ascent 424 00:40:54,029 --> 00:40:58,059 algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible] 425 00:40:58,059 --> 00:41:00,910 check if everything makes sense. I'll erase a couple of boards and then check if you have 426 00:41:00,910 --> 00:41:07,910 questions after that. Questions? 427 00:41:42,919 --> 00:41:49,919 Could you 428 00:41:53,749 --> 00:41:56,279 raise your hand if this makes sense? 429 00:42:01,239 --> 00:42:03,999 Great. 430 00:42:03,999 --> 00:42:06,789 Some of the comments - we talked about 431 00:42:06,789 --> 00:42:10,629 those value function approximation approaches where you approximate 432 00:42:10,629 --> 00:42:13,579 V star, then you go from V star 433 00:42:13,579 --> 00:42:17,839 to pi star. Then here was also policy search approaches, where you try to approximate the policy directly. 434 00:42:17,839 --> 00:42:22,329 So let's talk briefly about when either one may be preferable. 435 00:42:22,329 --> 00:42:24,049 It turns out that 436 00:42:24,049 --> 00:42:26,899 policy search algorithms are especially effective 437 00:42:26,899 --> 00:42:30,699 when you can choose a simple policy class pi. 438 00:42:32,029 --> 00:42:35,949 So the question really is for your problem 439 00:42:35,949 --> 00:42:40,079 does there exist a simple function like a linear function or a logistic function 440 00:42:40,079 --> 00:42:45,289 that maps from features of the state to the action that works pretty well. 441 00:42:45,289 --> 00:42:49,029 So the problem with the inverted pendulum - this is quite likely to be true. 442 00:42:49,029 --> 00:42:52,450 Going through all the different choices of parameters, you can say 443 00:42:52,450 --> 00:42:54,930 things like if the pole's leaning towards the right, 444 00:42:54,930 --> 00:42:57,709 then accelerate towards the right to try to catch it. Thanks to the 445 00:42:57,709 --> 00:43:01,579 inverted pendulum, this is probably true. For lots of what's called 446 00:43:01,579 --> 00:43:04,650 low level control tasks, things like driving a car, 447 00:43:04,650 --> 00:43:09,069 the low level reflexes of do you steer your car left to avoid another car, do 448 00:43:09,069 --> 00:43:13,389 you steer your car left to follow the car road, flying a helicopter, 449 00:43:13,389 --> 00:43:18,359 again very short time scale types of decisions - I like to think of these as 450 00:43:18,359 --> 00:43:21,960 decisions like trained operator for like a trained driver or a trained 451 00:43:21,960 --> 00:43:23,819 pilot. It would almost 452 00:43:23,819 --> 00:43:27,360 be a reflex, these sorts of very quick instinctive things where 453 00:43:27,360 --> 00:43:30,790 you map very directly from the inputs, data, and action. These are 454 00:43:30,790 --> 00:43:32,159 problems for which 455 00:43:32,159 --> 00:43:35,879 you can probably choose a reasonable policy class like a logistic function or something, 456 00:43:35,879 --> 00:43:38,489 and it will often work well. 457 00:43:38,489 --> 00:43:41,829 In contrast, if you have problems that require 458 00:43:41,829 --> 00:43:46,019 long multistep reasoning, so things like a game of chess where you have to 459 00:43:46,019 --> 00:43:47,230 reason carefully about if 460 00:43:47,230 --> 00:43:50,130 I do this, then they'll do that, then they'll do this, then they'll do that. 461 00:43:50,130 --> 00:43:56,049 Those I think of as less instinctual, very high level decision making. 462 00:43:56,049 --> 00:43:59,519 For problems like that, 463 00:43:59,519 --> 00:44:06,289 I would sometimes use a value function approximation approaches instead. Let 464 00:44:06,289 --> 00:44:09,089 me say more about this later. 465 00:44:09,089 --> 00:44:14,139 The last thing I want to do is actually tell you about - 466 00:44:14,139 --> 00:44:21,999 I guess just as a side comment, 467 00:44:21,999 --> 00:44:25,499 it turns out also that if you have POMDP, if you have a partially observable 468 00:44:25,499 --> 00:44:28,459 MDP - I don't want to say too much about this - 469 00:44:28,459 --> 00:44:36,799 it turns out that if you only have an approximation 470 00:44:36,799 --> 00:44:42,119 - let's call it S hat of the true 471 00:44:42,119 --> 00:44:47,389 state, and so this could be S hat equals S of T given T from Kalman filter - 472 00:44:57,359 --> 00:45:12,949 then you can still use these sorts of policy search algorithms where you can say pi theta of S 473 00:45:12,949 --> 00:45:15,819 hat comma A - There are various other ways you can use 474 00:45:15,819 --> 00:45:18,669 policy search algorithms for POMDPs, but this is one of them 475 00:45:18,669 --> 00:45:21,440 where if you only have estimates of the state, you can then 476 00:45:21,440 --> 00:45:25,599 choose a policy class that only looks at your estimate of the state to choose the action. 477 00:45:25,599 --> 00:45:30,629 By using the same way of estimating the states in both training and testing, 478 00:45:30,629 --> 00:45:33,089 this will usually do some - 479 00:45:33,089 --> 00:45:37,069 so these sorts of policy search algorithms can be applied 480 00:45:37,069 --> 00:45:47,440 often reasonably effectively to 481 00:45:47,440 --> 00:45:50,349 POMDPs as well. There's one more algorithm I wanna talk about, but some final 482 00:45:50,349 --> 00:45:54,519 words on the reinforce algorithm. It turns out the reinforce algorithm 483 00:45:54,519 --> 00:46:00,089 often works well but is often extremely slow. So it [inaudible] 484 00:46:00,089 --> 00:46:03,680 works, but one thing to watch out for is that because you're taking these 485 00:46:03,680 --> 00:46:07,470 gradient ascent steps that are very noisy, you're sampling a state action sequence, and then 486 00:46:07,470 --> 00:46:07,909 you're sort of 487 00:46:07,909 --> 00:46:09,579 taking a gradient ascent step in essentially a sort 488 00:46:09,579 --> 00:46:12,960 of random direction that only on expectation is 489 00:46:12,960 --> 00:46:14,239 correct. The 490 00:46:14,239 --> 00:46:17,999 gradient ascent direction for reinforce can sometimes be a bit noisy, 491 00:46:17,999 --> 00:46:21,599 and so it's not that uncommon to need like a million iterations of gradient ascent, 492 00:46:21,599 --> 00:46:24,420 or ten million, or 100 million 493 00:46:24,420 --> 00:46:25,579 iterations of gradient ascent 494 00:46:25,579 --> 00:46:29,450 for reinforce [inaudible], so that's just something to watch out for. 495 00:46:29,450 --> 00:46:40,619 One consequence of that is in the reinforce algorithm - I shouldn't 496 00:46:40,619 --> 00:46:44,969 really call it reinforce. In what's essentially the reinforce algorithm, there's 497 00:46:44,969 --> 00:46:48,449 this step where you need to sample a state action sequence. 498 00:46:48,449 --> 00:46:50,759 So in 499 00:46:50,759 --> 00:46:54,160 principle you could do this on your own robot. If there were a robot you were trying to 500 00:46:54,160 --> 00:46:55,280 control, you can actually 501 00:46:55,280 --> 00:46:58,709 physically initialize in some state, pick an action and so on, and go from there 502 00:46:58,709 --> 00:47:00,450 to sample a state action sequence. 503 00:47:00,450 --> 00:47:01,529 But 504 00:47:01,529 --> 00:47:05,269 if you need to do this ten million times, you probably don't want to [inaudible] 505 00:47:05,269 --> 00:47:07,459 your robot ten million times. 506 00:47:07,459 --> 00:47:13,160 I personally have seen many more applications of reinforce in simulation. 507 00:47:13,160 --> 00:47:16,930 You can easily run ten thousand simulations or ten million simulations of your robot in 508 00:47:16,930 --> 00:47:20,890 simulation maybe, but you might not want to do that - 509 00:47:20,890 --> 00:47:24,509 have your robot physically repeat some action ten million times. 510 00:47:24,509 --> 00:47:27,529 So I personally have seen many more applications of reinforce 511 00:47:27,529 --> 00:47:30,319 to learn using a simulator 512 00:47:30,319 --> 00:47:34,799 than to actually do this on a physical device. 513 00:47:34,799 --> 00:47:38,230 The last thing I wanted to do is tell you about one other algorithm, 514 00:47:38,230 --> 00:47:45,230 one final policy search algorithm. [Inaudible] the laptop display please. It's 515 00:47:52,249 --> 00:47:56,119 a policy search algorithm called Pegasus that's 516 00:47:56,119 --> 00:48:00,039 actually what we use on our autonomous helicopter flight things 517 00:48:00,039 --> 00:48:05,459 for many years. There are some other things we do now. So 518 00:48:05,459 --> 00:48:07,579 here's the idea. There's a 519 00:48:07,579 --> 00:48:11,059 reminder slide on RL formalism. There's nothing here that you don't know, but 520 00:48:11,059 --> 00:48:17,119 I just want to pictorially describe the RL formalism because I'll use that later. 521 00:48:17,119 --> 00:48:20,859 I'm gonna draw the reinforcement learning picture as follows. The 522 00:48:20,859 --> 00:48:21,689 initialized [inaudible] 523 00:48:21,689 --> 00:48:23,509 system, say a 524 00:48:23,509 --> 00:48:25,559 helicopter or whatever in sum state S0, 525 00:48:25,559 --> 00:48:27,509 you choose an action A0, 526 00:48:27,509 --> 00:48:31,779 and then you'll say helicopter dynamics takes you to some new state S1, you choose 527 00:48:31,779 --> 00:48:33,259 some other action A1, 528 00:48:33,259 --> 00:48:34,619 and so on. 529 00:48:34,619 --> 00:48:38,789 And then you have some reward function, which you reply to the sequence of states you summed out, 530 00:48:38,789 --> 00:48:41,249 and that's your total payoff. So 531 00:48:41,249 --> 00:48:46,890 this is just a picture I wanna use to summarize the RL problem. 532 00:48:47,949 --> 00:48:50,880 Our goal is to maximize the expected payoff, 533 00:48:50,880 --> 00:48:53,000 which is this, the expected sum of the rewards. 534 00:48:53,000 --> 00:48:56,909 And our goal is to learn the policy, which I denote by a green box. 535 00:48:56,909 --> 00:49:01,989 So our policy - and I'll switch back to deterministic policies for now. 536 00:49:01,989 --> 00:49:09,129 So my deterministic policy will be some function mapping from the states to the actions. 537 00:49:09,129 --> 00:49:14,599 As a concrete example, you imagine that in the policy search setting, 538 00:49:14,599 --> 00:49:17,339 you may have a linear class of policies. 539 00:49:17,339 --> 00:49:22,660 So you may imagine that the action A will be say a linear function of the states, 540 00:49:22,660 --> 00:49:27,599 and your goal is to learn the parameters of the linear function. 541 00:49:27,599 --> 00:49:33,349 So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning 542 00:49:33,349 --> 00:49:36,209 objective. So just [inaudible] imagine that the action A 543 00:49:36,209 --> 00:49:40,309 is state of transpose S, and you go and policy search this to come up with 544 00:49:40,309 --> 00:49:42,109 good parameters theta so as 545 00:49:42,109 --> 00:49:48,979 to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea. 546 00:49:48,979 --> 00:49:49,809 Quite often 547 00:49:49,809 --> 00:49:54,459 we come up with a model or a simulator for the MDP, and as before 548 00:49:54,459 --> 00:49:57,499 a model or a simulator is just a box 549 00:49:57,499 --> 00:49:58,830 that takes this input 550 00:49:58,830 --> 00:50:00,419 some state ST, 551 00:50:00,419 --> 00:50:02,690 takes this input some action AT, 552 00:50:02,690 --> 00:50:06,570 and then outputs some [inaudible] state ST plus one that you might want to take in the MDP. 553 00:50:06,570 --> 00:50:10,130 This ST plus one will be a random state. It will be drawn from 554 00:50:10,130 --> 00:50:13,030 the random state transition probabilities of MDP. 555 00:50:13,030 --> 00:50:16,979 This is important. Very important, ST plus one 556 00:50:16,979 --> 00:50:21,809 will be a random function ST and AT. In the simulator, this is [inaudible]. 557 00:50:21,809 --> 00:50:25,709 So for example, for autonomous helicopter flight, you [inaudible] build a simulator 558 00:50:25,709 --> 00:50:29,819 using supervised learning, an algorithm like linear regression 559 00:50:29,819 --> 00:50:31,529 [inaudible] to linear regression, 560 00:50:31,529 --> 00:50:35,099 so we can get a nonlinear model of the dynamics of what ST 561 00:50:35,099 --> 00:50:41,399 plus one is as a random function of ST and AT. Now 562 00:50:41,399 --> 00:50:44,999 once you have a simulator, 563 00:50:44,999 --> 00:50:48,100 given any fixed policy you can quite 564 00:50:48,100 --> 00:50:52,309 straightforwardly evaluate any policy in a simulator. 565 00:50:53,359 --> 00:51:00,209 Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box 566 00:51:00,209 --> 00:51:02,879 like that. It works well. 567 00:51:02,879 --> 00:51:07,569 So if you have any one fixed policy pi, 568 00:51:07,569 --> 00:51:13,109 you can evaluate the policy pi just using the simulator 569 00:51:13,109 --> 00:51:15,969 via the picture shown at the bottom of the slide. 570 00:51:15,969 --> 00:51:18,880 So concretely, you can take your initial state S0, 571 00:51:18,880 --> 00:51:20,979 feed it into the policy pi, 572 00:51:20,979 --> 00:51:25,449 your policy pi will output some action A0, you plug it in 573 00:51:25,449 --> 00:51:28,069 the simulator, the simulator outputs a random state S1, you feed 574 00:51:28,069 --> 00:51:31,630 S1 into the policy and so on, and you get a sequence of states S0 through ST 575 00:51:31,630 --> 00:51:35,429 that your helicopter flies through in simulation. 576 00:51:35,429 --> 00:51:37,809 Then sum up the rewards, 577 00:51:37,809 --> 00:51:42,609 and this gives you an estimate of the expected payoff of the policy. 578 00:51:42,609 --> 00:51:44,929 This picture is just a fancy way of saying fly 579 00:51:44,929 --> 00:51:48,809 your helicopter in simulation and see how well it does, and measure the sum of 580 00:51:48,809 --> 00:51:53,969 rewards you get on average in the simulator. The 581 00:51:53,969 --> 00:51:57,219 picture I've drawn here assumes that you run your policy through the 582 00:51:57,219 --> 00:51:59,159 simulator just once. In general, 583 00:51:59,159 --> 00:52:01,299 you would run the policy through the simulator 584 00:52:01,299 --> 00:52:05,099 some number of times and then average to get an average over M 585 00:52:05,099 --> 00:52:10,869 simulations to get a better estimate of the policy's expected payoff. 586 00:52:12,799 --> 00:52:14,619 Now that I have a way 587 00:52:14,619 --> 00:52:18,069 - given any one fixed policy, 588 00:52:18,069 --> 00:52:23,429 this gives me a way to evaluate the expected payoff of that policy. 589 00:52:23,429 --> 00:52:28,179 So one reasonably obvious thing you might try to do is then 590 00:52:28,179 --> 00:52:30,069 just search for a policy, 591 00:52:30,069 --> 00:52:33,609 in other words search for parameters theta for your policy, that 592 00:52:33,609 --> 00:52:35,929 gives you high estimated payoff. 593 00:52:35,929 --> 00:52:39,509 Does that make sense? So my policy has some parameters theta, so 594 00:52:39,509 --> 00:52:40,979 my policy is 595 00:52:40,979 --> 00:52:46,309 my actions A are equal to theta transpose S say if there's a linear policy. 596 00:52:46,309 --> 00:52:49,559 For any fixed value of the parameters theta, 597 00:52:49,559 --> 00:52:50,989 I can evaluate - 598 00:52:50,989 --> 00:52:54,440 I can get an estimate for how good the policy is using the simulator. 599 00:52:54,440 --> 00:52:58,599 One thing I might try to do is search for parameters theta to try to 600 00:52:58,599 --> 00:53:03,109 maximize my estimated payoff. 601 00:53:03,109 --> 00:53:05,929 It turns out you can sort of do that, 602 00:53:05,929 --> 00:53:10,979 but that idea as I've just stated is hard to get to work. 603 00:53:10,979 --> 00:53:12,180 Here's the 604 00:53:12,180 --> 00:53:15,720 reason. The simulator allows us to evaluate 605 00:53:15,720 --> 00:53:17,529 policy, so let's search for policy of high 606 00:53:17,529 --> 00:53:20,619 value. The difficulty is that the simulator is random, 607 00:53:20,619 --> 00:53:22,999 and so every time we evaluate a policy, 608 00:53:22,999 --> 00:53:26,819 we get back a very slightly different answer. So 609 00:53:26,819 --> 00:53:28,739 in the 610 00:53:28,739 --> 00:53:33,219 cartoon below, I want you to imagine that the horizontal axis is the space of policies. 611 00:53:33,219 --> 00:53:36,779 In other words, as I vary the 612 00:53:36,779 --> 00:53:40,839 parameters in my policy, I get different points on the horizontal axis here. As I 613 00:53:40,839 --> 00:53:43,789 vary the parameters theta, I get different policies, 614 00:53:43,789 --> 00:53:46,479 and so I'm moving along the X axis, 615 00:53:46,479 --> 00:53:50,079 and my total payoff I'm gonna plot on the vertical axis, 616 00:53:50,079 --> 00:53:54,459 and the red line in this cartoon is the expected payoff of the different 617 00:53:54,459 --> 00:53:56,130 policies. 618 00:53:56,130 --> 00:54:01,929 My goal is to find the policy with the highest expected payoff. 619 00:54:01,929 --> 00:54:06,019 You could search for a policy with high expected payoff, but every time you evaluate a policy 620 00:54:06,019 --> 00:54:08,150 - say I evaluate some policy, 621 00:54:08,150 --> 00:54:09,819 then I might get a point that 622 00:54:09,819 --> 00:54:12,879 just by chance looked a little bit better than it should have. If 623 00:54:12,879 --> 00:54:16,489 I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third 624 00:54:16,489 --> 00:54:21,419 policy, fourth, 625 00:54:21,419 --> 00:54:25,439 sometimes you look here - sometimes I might actually evaluate exactly the same policy 626 00:54:25,439 --> 00:54:27,829 twice and get back slightly different 627 00:54:27,829 --> 00:54:32,380 answers just because my simulator is random, so when I apply the same policy 628 00:54:32,380 --> 00:54:38,669 twice in simulation, I might get back slightly different answers. 629 00:54:38,669 --> 00:54:41,839 So as I evaluate more and more policies, these are the pictures I get. 630 00:54:41,839 --> 00:54:45,799 My goal is to try to optimize the red line. I hope you appreciate this is a 631 00:54:45,799 --> 00:54:50,569 hard problem, especially when all [inaudible] optimization algorithm gets to see are these 632 00:54:50,569 --> 00:54:53,949 black dots, and they don't have direct access to the red line. 633 00:54:53,949 --> 00:54:58,269 So when my input space is some fairly high dimensional space, if I have 634 00:54:58,269 --> 00:55:02,319 ten parameters, the horizontal axis would actually be a 10-D space, 635 00:55:02,319 --> 00:55:07,049 all I get are these noisy estimates of what the red line is. This is a very hard 636 00:55:07,049 --> 00:55:10,209 stochastic optimization problem. 637 00:55:10,209 --> 00:55:15,969 So it turns out there's one way to make this optimization problem much easier. Here's the idea. 638 00:55:15,969 --> 00:55:20,949 And the method is called Pegasus, which is an acronym for something. 639 00:55:20,949 --> 00:55:22,890 I'll tell you later. 640 00:55:22,890 --> 00:55:26,140 So the simulator repeatedly makes calls 641 00:55:26,140 --> 00:55:27,959 to a random number generator 642 00:55:27,959 --> 00:55:33,079 to generate random numbers RT, which are used to simulate the stochastic dynamics. 643 00:55:33,079 --> 00:55:36,709 What I mean by that is that the simulator takes this input of state 644 00:55:36,709 --> 00:55:39,749 and action, and it outputs the mixed state randomly, 645 00:55:39,749 --> 00:55:43,999 and if you peer into the simulator, if you open the box of the simulator and ask 646 00:55:43,999 --> 00:55:49,529 how is my simulator generating these random mixed states ST plus one, 647 00:55:49,529 --> 00:55:53,809 pretty much the only way to do so - pretty much the only way 648 00:55:53,809 --> 00:55:56,909 to write a simulator with random outputs is 649 00:55:56,909 --> 00:56:00,209 we're gonna make calls to a random number generator, 650 00:56:00,209 --> 00:56:03,440 and get random numbers, these RTs, 651 00:56:03,440 --> 00:56:08,299 and then you have some function that takes this input S0, A0, and 652 00:56:08,299 --> 00:56:10,979 the results of your random number generator, 653 00:56:10,979 --> 00:56:14,430 and it computes some mixed state as a function of the inputs and of the random 654 00:56:14,430 --> 00:56:16,839 number it got from the random number 655 00:56:16,839 --> 00:56:17,699 generator. This is 656 00:56:17,699 --> 00:56:22,359 pretty much the only way anyone implements any random code, any code 657 00:56:22,359 --> 00:56:26,179 that generates random outputs. You make a call to a random number generator, 658 00:56:26,179 --> 00:56:30,869 and you compute some function of the random number and of your other 659 00:56:30,869 --> 00:56:35,299 inputs. The reason that when you evaluate different policies you get 660 00:56:35,299 --> 00:56:36,439 different answers 661 00:56:36,439 --> 00:56:37,749 is because 662 00:56:37,749 --> 00:56:41,459 every time you rerun the simulator, you get a different sequence of random 663 00:56:41,459 --> 00:56:43,569 numbers from the random number generator, 664 00:56:43,569 --> 00:56:45,569 and so you get a different answer every time, 665 00:56:45,569 --> 00:56:48,139 even if you evaluate the same policy twice. 666 00:56:48,139 --> 00:56:50,910 So here's the idea. 667 00:56:50,910 --> 00:56:56,150 Let's say we fix in advance the sequence of random numbers, 668 00:56:56,150 --> 00:57:00,229 choose R1, R2, up to RT minus one. 669 00:57:00,229 --> 00:57:03,359 Fix the sequence of random numbers in advance, 670 00:57:03,359 --> 00:57:07,280 and we'll always use the same sequence of random numbers 671 00:57:07,280 --> 00:57:10,189 to evaluate different policies. Go 672 00:57:10,189 --> 00:57:14,549 into your code and fix R1, R2, through RT minus one. 673 00:57:14,549 --> 00:57:17,640 Choose them randomly once and then fix them forever. 674 00:57:17,640 --> 00:57:20,979 If you always use the same sequence of random numbers, then 675 00:57:20,979 --> 00:57:25,210 the system is no longer random, and if you evaluate the same policy twice, 676 00:57:25,210 --> 00:57:28,569 you get back exactly the same answer. 677 00:57:28,569 --> 00:57:33,330 And so these sequences of random numbers, [inaudible] call them scenarios, and 678 00:57:33,330 --> 00:57:37,410 Pegasus is an acronym for policy evaluation of gradient and search using scenarios. 679 00:57:37,410 --> 00:57:38,859 So 680 00:57:42,369 --> 00:57:45,949 when you do that, this is the picture you get. 681 00:57:45,949 --> 00:57:50,190 As before, the red line is your expected payoff, and by fixing the random numbers, 682 00:57:50,190 --> 00:57:53,469 you've defined some estimate of the expected payoff. 683 00:57:53,469 --> 00:57:56,739 And as you evaluate different policies, they're 684 00:57:56,739 --> 00:58:00,609 still only approximations to their true expected payoff, but at least now you have 685 00:58:00,609 --> 00:58:01,310 a 686 00:58:01,310 --> 00:58:03,359 deterministic function to optimize, 687 00:58:03,359 --> 00:58:07,109 and you can now apply your favorite algorithms, be it gradient ascent or 688 00:58:08,089 --> 00:58:09,809 some sort 689 00:58:09,809 --> 00:58:12,259 of local [inaudible] search 690 00:58:12,259 --> 00:58:13,609 to try to optimize the black 691 00:58:13,609 --> 00:58:16,329 curve. This gives you a much easier 692 00:58:16,329 --> 00:58:18,390 optimization problem, and you 693 00:58:18,390 --> 00:58:22,789 can optimize this to find hopefully a good policy. So this is 694 00:58:22,789 --> 00:58:26,650 the Pegasus policy search method. 695 00:58:26,650 --> 00:58:27,450 So when 696 00:58:27,450 --> 00:58:31,390 I started to talk about reinforcement learning, I showed 697 00:58:31,390 --> 00:58:34,980 that video of a helicopter flying upside down. That was actually done using 698 00:58:34,980 --> 00:58:40,029 exactly method, using exactly this policy search algorithm. This 699 00:58:40,029 --> 00:58:41,849 seems to scale well 700 00:58:41,849 --> 00:58:47,099 even to fairly large problems, even to fairly high dimensional state spaces. 701 00:58:47,099 --> 00:58:50,439 Typically Pegasus policy search algorithms have been using - 702 00:58:50,439 --> 00:58:52,719 the optimization problem 703 00:58:52,719 --> 00:58:56,440 is still - is much easier than the stochastic version, but sometimes it's 704 00:58:56,440 --> 00:58:58,279 not entirely trivial, and so 705 00:58:58,279 --> 00:59:00,299 you have to apply this sort of method with 706 00:59:00,299 --> 00:59:04,549 maybe on the order of ten parameters or tens of parameters, so 30, 40 707 00:59:04,549 --> 00:59:05,750 parameters, but 708 00:59:05,750 --> 00:59:08,169 not thousands of parameters, 709 00:59:08,169 --> 00:59:13,049 at least in these sorts of things with them. 710 00:59:13,049 --> 00:59:19,469 So is that method different than just assuming that 711 00:59:19,469 --> 00:59:26,949 you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that 712 00:59:26,949 --> 00:59:28,809 we have a deterministic simulator? 713 00:59:28,809 --> 00:59:31,249 The answer is no. In the way you do this, 714 00:59:31,249 --> 00:59:34,949 for the sake of simplicity I talked about one sequence of random numbers. 715 00:59:34,949 --> 00:59:36,890 What you do is - 716 00:59:36,890 --> 00:59:41,969 so imagine that the random numbers are simulating different wind gusts against your helicopter. 717 00:59:41,969 --> 00:59:46,429 So what you want to do isn't really evaluate just against one pattern of wind gusts. 718 00:59:46,429 --> 00:59:48,000 What you want to do is 719 00:59:48,000 --> 00:59:49,920 sample some set of 720 00:59:49,920 --> 00:59:53,650 different patterns of wind gusts, and evaluate against all of them in average. 721 00:59:53,650 --> 00:59:55,680 So what you do is you actually 722 00:59:55,680 --> 00:59:58,039 sample say 100 - 723 00:59:58,039 --> 01:00:01,359 some number I made up like 100 sequences of random numbers, 724 01:00:01,359 --> 01:00:03,430 and every time you want to evaluate a policy, 725 01:00:03,430 --> 01:00:08,679 you evaluate it against all 100 sequences of random numbers and then average. 726 01:00:08,679 --> 01:00:13,249 This is in exactly the same way that on this earlier picture 727 01:00:13,249 --> 01:00:16,879 you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100 728 01:00:16,879 --> 01:00:18,450 times in simulation, and then 729 01:00:18,450 --> 01:00:21,709 average to get a better estimate of the expected reward. 730 01:00:21,709 --> 01:00:23,009 In the same way, 731 01:00:23,009 --> 01:00:24,079 you do that here 732 01:00:24,079 --> 01:00:29,079 but with 100 fixed sequences of random numbers. Does that make sense? Any 733 01:00:29,079 --> 01:00:36,079 other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]? 734 01:00:47,719 --> 01:00:51,829 Yeah. I guess you're right. So the quality - for a fixed policy, 735 01:00:51,829 --> 01:00:57,099 the quality of the approximation is equally good for both cases. The 736 01:00:57,099 --> 01:01:00,989 advantage of fixing the random numbers is that you end up with 737 01:01:00,989 --> 01:01:03,690 an optimization problem that's much easier. 738 01:01:04,609 --> 01:01:06,100 I have some search problem, 739 01:01:06,100 --> 01:01:10,049 and on the horizontal axis there's a space of control policies, 740 01:01:10,049 --> 01:01:12,559 and my goal is to find a control policy that 741 01:01:12,559 --> 01:01:15,319 maximizes the payoff. 742 01:01:15,319 --> 01:01:18,660 The problem with this earlier setting was that 743 01:01:18,660 --> 01:01:21,949 when I evaluate policies I get these noisy estimates, 744 01:01:21,949 --> 01:01:22,950 and then it's 745 01:01:22,950 --> 01:01:26,690 just very hard to optimize the red curve if I have these points that are all over the 746 01:01:26,690 --> 01:01:29,420 place. And if I evaluate the same policy twice, I don't even get back the same 747 01:01:30,719 --> 01:01:32,959 answer. By fixing the random numbers, 748 01:01:32,959 --> 01:01:36,669 the algorithm still doesn't get to see the red curve, but at least it's 749 01:01:36,669 --> 01:01:40,609 now optimizing a deterministic function. That makes the 750 01:01:40,609 --> 01:01:42,349 optimization problem much easier. Does 751 01:01:42,349 --> 01:01:49,349 that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves 752 01:01:50,539 --> 01:01:58,009 that are easy to optimize. And then you smush them together? 753 01:01:58,009 --> 01:01:59,309 Let's see. 754 01:01:59,309 --> 01:02:04,559 I have just one nice black curve that I'm trying to optimize. For each scenario. 755 01:02:04,559 --> 01:02:08,890 I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios. 756 01:02:08,890 --> 01:02:12,339 So the black curve here is defined by averaging over 757 01:02:12,339 --> 01:02:16,249 a large set of scenarios. Does that make sense? 758 01:02:16,249 --> 01:02:20,099 So instead of only one - if the averaging thing doesn't make sense, imagine that there's 759 01:02:20,099 --> 01:02:24,049 just one sequence of random numbers. That might be easier to think 760 01:02:24,049 --> 01:02:27,419 about. Fix one sequence of random numbers, and every time I evaluate another policy, 761 01:02:27,419 --> 01:02:31,039 I evaluate against the same sequence of random numbers, and that gives me 762 01:02:31,039 --> 01:02:36,489 a nice deterministic function to optimize. 763 01:02:36,489 --> 01:02:39,449 Any other questions? 764 01:02:39,449 --> 01:02:41,269 The advantage is really 765 01:02:41,269 --> 01:02:45,459 that - one way to think about it is when I evaluate the same policy twice, 766 01:02:45,459 --> 01:02:50,449 at least I get back the same answer. This gives me a deterministic function 767 01:02:50,449 --> 01:02:53,890 mapping from parameters in my policy 768 01:02:53,890 --> 01:02:56,239 to my estimate of the expected payoff. 769 01:02:57,789 --> 01:03:13,229 That's just a function that I can try to optimize using the search algorithm. 770 01:03:13,229 --> 01:03:16,799 So we use this algorithm for inverted hovering, 771 01:03:16,799 --> 01:03:20,329 and again policy search algorithms tend to work well when you can find 772 01:03:20,329 --> 01:03:23,739 a reasonably simple policy mapping from 773 01:03:23,739 --> 01:03:28,640 the states to the actions. This is sort of especially the low level control tasks, 774 01:03:28,640 --> 01:03:32,699 which I think of as sort of reflexes almost. 775 01:03:32,699 --> 01:03:33,799 Completely, 776 01:03:33,799 --> 01:03:37,269 if you want to solve a problem like Tetris where you might plan 777 01:03:37,269 --> 01:03:40,499 ahead a few steps about what's a nice configuration of the board, or 778 01:03:40,499 --> 01:03:42,289 something like a game of chess, 779 01:03:42,289 --> 01:03:46,719 or problems of long path plannings of go here, then go there, then go there, 780 01:03:46,719 --> 01:03:51,069 then sometimes you might apply a value function method instead. 781 01:03:51,069 --> 01:03:55,059 But for tasks like helicopter flight, for 782 01:03:55,059 --> 01:03:58,569 low level control tasks, for the reflexes of driving or controlling 783 01:04:00,749 --> 01:04:05,699 various robots, policy search algorithms were easier - 784 01:04:05,699 --> 01:04:10,629 we can sometimes more easily approximate the policy directly works very well. 785 01:04:11,839 --> 01:04:15,599 Some [inaudible] the state of RL today. RL algorithms are applied 786 01:04:15,599 --> 01:04:18,499 to a wide range of problems, 787 01:04:18,499 --> 01:04:22,239 and the key is really sequential decision making. The place where you 788 01:04:22,239 --> 01:04:25,569 think about applying reinforcement learning algorithm is when you need to make a decision, 789 01:04:25,569 --> 01:04:28,209 then another decision, then another decision, 790 01:04:28,209 --> 01:04:31,889 and some of your actions may have long-term consequences. I think that 791 01:04:31,889 --> 01:04:37,910 is the heart of RL's sequential decision making, where you make multiple decisions, 792 01:04:37,910 --> 01:04:41,599 and some of your actions may have long-term consequences. I've shown you 793 01:04:41,599 --> 01:04:44,170 a bunch of robotics examples. RL 794 01:04:44,170 --> 01:04:45,569 is also 795 01:04:45,569 --> 01:04:48,759 applied to thinks like medical decision making, where you may have a patient and you 796 01:04:48,759 --> 01:04:51,879 want to choose a sequence of treatments, and you do this now for the patient, and the 797 01:04:51,879 --> 01:04:56,449 patient may be in some other state, and you choose to do that later, and so on. 798 01:04:56,449 --> 01:05:00,839 It turns out there's a large community of people applying these sorts of tools to queues. So 799 01:05:00,839 --> 01:05:04,539 imagine you have a bank where you have people lining up, and 800 01:05:04,539 --> 01:05:06,009 after they go to one cashier, 801 01:05:06,009 --> 01:05:09,740 some of them have to go to the manager to deal with something else. You have 802 01:05:09,740 --> 01:05:13,410 a system of multiple people standing in line in multiple queues, and so how do you route 803 01:05:13,410 --> 01:05:17,839 people optimally to minimize the waiting 804 01:05:17,839 --> 01:05:20,239 time. And not just people, but 805 01:05:20,239 --> 01:05:23,969 objects in an assembly line and so on. It turns out there's a surprisingly large community 806 01:05:23,969 --> 01:05:26,349 working on optimizing queues. 807 01:05:26,349 --> 01:05:29,719 I mentioned game playing a little bit already. Things 808 01:05:29,719 --> 01:05:33,289 like financial decision making, if you have a large amount of stock, 809 01:05:33,289 --> 01:05:37,569 how do you sell off a large amount - how do you time the selling off of your stock 810 01:05:37,569 --> 01:05:41,329 so as to not affect market prices adversely too much? There 811 01:05:41,329 --> 01:05:44,949 are many operations research problems, things like factory automation. Can you design 812 01:05:44,949 --> 01:05:46,369 a factory 813 01:05:46,369 --> 01:05:50,140 to optimize throughput, or minimize cost, or whatever. These are all 814 01:05:50,140 --> 01:05:54,049 sorts of problems that people are applying reinforcement learning algorithms to. 815 01:05:54,049 --> 01:05:57,529 Let me just close with a few robotics examples because they're always fun, and we just 816 01:05:57,529 --> 01:05:59,190 have these videos. 817 01:06:00,189 --> 01:06:07,259 This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student 818 01:06:07,259 --> 01:06:13,760 here. They were working getting a robot dog to climb over difficult rugged 819 01:06:13,760 --> 01:06:16,959 terrain. Using a reinforcement learning algorithm, 820 01:06:16,959 --> 01:06:19,979 they applied an approach that's 821 01:06:19,979 --> 01:06:24,149 similar to a value function approximation approach, not quite but similar. 822 01:06:24,149 --> 01:06:28,849 They allowed the robot dog to sort of plan ahead multiple steps, and 823 01:06:28,849 --> 01:06:33,999 carefully choose his footsteps and traverse rugged terrain. 824 01:06:33,999 --> 01:06:43,339 This is really state of the art in terms of what can you get a robotic dog to do. 825 01:06:43,339 --> 01:06:45,569 Here's another fun one. 826 01:06:45,569 --> 01:06:48,009 It turns out that 827 01:06:48,009 --> 01:06:51,259 wheeled robots are very fuel-efficient. Cars and trucks are the 828 01:06:51,259 --> 01:06:55,250 most fuel-efficient robots in the world almost. 829 01:06:55,250 --> 01:06:57,260 Cars and trucks are very fuel-efficient, 830 01:06:57,260 --> 01:07:01,309 but the bigger robots have the ability to traverse more rugged terrain. 831 01:07:01,309 --> 01:07:04,599 So this is a robot - this is actually a small scale mockup of a larger 832 01:07:04,599 --> 01:07:06,519 vehicle built by Lockheed Martin, 833 01:07:06,519 --> 01:07:09,699 but can you come up with a vehicle that 834 01:07:09,699 --> 01:07:13,689 has wheels and has the fuel efficiency of wheeled robots, but also 835 01:07:13,689 --> 01:07:19,099 has legs so it can traverse obstacles. 836 01:07:19,099 --> 01:07:27,349 Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and 837 01:07:27,349 --> 01:07:31,319 somewhat complex gaits that would be very hard to design by hand, 838 01:07:31,319 --> 01:07:33,279 but by choosing a reward function, tell the robot 839 01:07:33,279 --> 01:07:36,789 this is a plus one reward that's top of the goal, 840 01:07:36,789 --> 01:07:45,469 and a few other things, it learns these sorts of policies automatically. 841 01:07:46,519 --> 01:07:55,229 Last couple fun ones - I'll show you a couple last helicopter videos. 842 01:07:57,029 --> 01:08:08,700 This is the work of again PhD students here, Peter Abiel and Adam Coates 843 01:08:08,769 --> 01:08:33,319 who have been working on autonomous flight. I'll just let you watch. I'll just 844 01:09:49,339 --> 01:09:53,900 show you one more. [Inaudible] do this with a real helicopter [inaudible]? 845 01:09:53,900 --> 01:09:55,199 Not a full-size helicopter. 846 01:09:55,199 --> 01:10:00,860 Only small radio control helicopters. [Inaudible]. Full-size helicopters 847 01:10:00,860 --> 01:10:05,780 are the wrong design for this. You shouldn't do this on a full-size helicopter. 848 01:10:05,780 --> 01:10:12,619 Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more. 849 01:10:12,619 --> 01:10:18,919 Are there any human [inaudible]? Yes, 850 01:10:18,919 --> 01:10:25,919 there are very good human pilots that can. 851 01:10:28,420 --> 01:10:35,420 This is just one more 852 01:10:43,739 --> 01:10:45,510 maneuver. That was kind of fun. So this is the work of 853 01:10:45,510 --> 01:10:55,530 Peter Abiel and Adam Coates. So that was it. That was all the technical material 854 01:10:55,530 --> 01:11:02,530 I wanted to present in this class. I guess 855 01:11:03,109 --> 01:11:10,199 you're all experts on machine learning now. Congratulations. 856 01:11:10,199 --> 01:11:14,209 And as I hope you've got the sense through this class that this is one of the technologies 857 01:11:14,209 --> 01:11:18,969 that's really having a huge impact on science in engineering and industry. 858 01:11:18,969 --> 01:11:23,130 As I said in the first lecture, I think many people use machine 859 01:11:23,130 --> 01:11:30,130 learning algorithms dozens of times a day without even knowing about it. 860 01:11:30,389 --> 01:11:34,110 Based on the projects you've done, I hope that 861 01:11:34,110 --> 01:11:37,899 most of you will be able to imagine yourself 862 01:11:37,899 --> 01:11:43,280 going out after this class and applying these things to solve a variety of problems. 863 01:11:43,280 --> 01:11:45,980 Hopefully, some of you will also imagine yourselves 864 01:11:45,980 --> 01:11:48,989 writing research papers after this class, be it on 865 01:11:48,989 --> 01:11:52,440 a novel way to do machine learning, or on some way of applying machine 866 01:11:52,440 --> 01:11:55,289 learning to a problem that you care 867 01:11:55,289 --> 01:11:58,749 about. In fact, looking at project milestones, I'm actually sure that a large fraction of 868 01:11:58,749 --> 01:12:04,300 the projects in this class will be publishable, so I think that's great. 869 01:12:06,679 --> 01:12:10,750 I guess many of you will also go industry, make new products, and make lots 870 01:12:10,750 --> 01:12:12,659 of money using learning 871 01:12:12,659 --> 01:12:15,580 algorithms. Remember me 872 01:12:15,580 --> 01:12:19,530 if that happens. One of the things I'm excited about is through research or 873 01:12:19,530 --> 01:12:22,570 industry, I'm actually completely sure that the people in this class in the 874 01:12:22,570 --> 01:12:23,849 next few months 875 01:12:23,849 --> 01:12:27,320 will apply machine learning algorithms to lots of problems in 876 01:12:27,320 --> 01:12:31,770 industrial management, and computer science, things like optimizing computer architectures, 877 01:12:31,770 --> 01:12:33,340 network security, 878 01:12:33,340 --> 01:12:38,480 robotics, computer vision, to problems in computational biology, 879 01:12:39,140 --> 01:12:42,429 to problems in aerospace, or understanding natural language, 880 01:12:42,429 --> 01:12:44,309 and many more things like that. 881 01:12:44,309 --> 01:12:46,030 So 882 01:12:46,030 --> 01:12:49,920 right now I have no idea what all of you are going to do with the learning algorithms you learned about, 883 01:12:49,920 --> 01:12:51,540 but 884 01:12:51,540 --> 01:12:54,809 every time as I wrap up this class, I always 885 01:12:54,809 --> 01:12:58,209 feel very excited, and optimistic, and hopeful about the sorts of amazing 886 01:12:58,209 --> 01:13:03,129 things you'll be able to do. 887 01:13:03,129 --> 01:13:10,489 One final thing, I'll just give out this handout. 888 01:13:30,749 --> 01:13:35,170 One final thing is that machine learning has grown out of a larger literature on the AI 889 01:13:35,170 --> 01:13:36,419 where 890 01:13:36,419 --> 01:13:41,829 this desire to build systems that exhibit intelligent behavior and machine learning 891 01:13:41,829 --> 01:13:44,889 is one of the tools of AI, maybe one that's had a 892 01:13:44,889 --> 01:13:46,510 disproportionately large impact, 893 01:13:46,510 --> 01:13:51,519 but there are many other ideas in AI that I hope you 894 01:13:51,519 --> 01:13:55,289 go and continue to learn about. Fortunately, Stanford 895 01:13:55,289 --> 01:13:59,279 has one of the best and broadest sets of AI classes, and I hope that 896 01:13:59,279 --> 01:14:03,690 you take advantage of some of these classes, and go and learn more about AI, and more 897 01:14:03,690 --> 01:14:07,690 about other fields which often apply learning algorithms to problems in vision, 898 01:14:07,690 --> 01:14:10,480 problems in natural language processing in robotics, and so on. 899 01:14:10,480 --> 01:14:14,960 So the handout I just gave out has a list of AI related courses. Just 900 01:14:14,960 --> 01:14:18,489 running down very quickly, I guess, CS221 is an overview that I 901 01:14:18,489 --> 01:14:19,340 teach. There 902 01:14:19,340 --> 01:14:23,699 are a lot of robotics classes also: 223A, 903 01:14:23,699 --> 01:14:25,139 225A, 225B 904 01:14:25,139 --> 01:14:28,760 - many robotics class. There are so many applications 905 01:14:28,760 --> 01:14:31,999 of learning algorithms to robotics today. 222 906 01:14:31,999 --> 01:14:36,369 and 227 are knowledge representation and reasoning classes. 907 01:14:36,369 --> 01:14:39,919 228 - of all the classes on this list, 228, which Daphne 908 01:14:39,919 --> 01:14:43,479 Koller teaches, is probably closest in spirit to 229. This is one of 909 01:14:43,479 --> 01:14:45,300 the classes I highly recommend 910 01:14:45,300 --> 01:14:48,600 to all of my PhD students as well. 911 01:14:48,600 --> 01:14:53,300 Many other problems also touch on machine learning. On the next page, 912 01:14:54,040 --> 01:14:57,760 courses on computer vision, speech recognition, natural language processing, 913 01:14:57,760 --> 01:15:03,199 various tools that aren't just machine learning, but often involve machine learning in many ways. 914 01:15:03,199 --> 01:15:07,339 Other aspects of AI, multi-agent systems taught by [inaudible]. 915 01:15:09,630 --> 01:15:13,139 EE364A is convex optimization. It's a class taught by 916 01:15:13,139 --> 01:15:16,630 Steve Boyd, and convex optimization came up many times in this class. If you 917 01:15:16,630 --> 01:15:20,909 want to become really good at it, EE364 is a great class. If you're 918 01:15:20,909 --> 01:15:24,579 interested in project courses, I also teach a project class 919 01:15:24,579 --> 01:15:27,179 next quarter where we spend the whole quarter working 920 01:15:27,179 --> 01:15:31,510 on research projects. 921 01:15:31,510 --> 01:15:37,589 So I hope you go and take some more of those classes. 922 01:15:37,589 --> 01:15:40,379 In closing, 923 01:15:40,379 --> 01:15:42,849 let me just say 924 01:15:42,849 --> 01:15:46,880 this class has been really fun to teach, and it's very satisfying to 925 01:15:46,880 --> 01:15:52,010 me personally when we set these insanely difficult hallmarks, and 926 01:15:52,010 --> 01:15:54,649 then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's 927 01:15:54,649 --> 01:15:58,570 actually very satisfying when I see that. 928 01:15:58,570 --> 01:16:04,739 Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable," 929 01:16:04,739 --> 01:16:05,469 So 930 01:16:05,469 --> 01:16:09,039 I hope you take what you've learned, go forth, and 931 01:16:09,039 --> 01:16:11,570 do amazing things with learning algorithms. 932 01:16:11,570 --> 01:16:16,220 I know this is a heavy workload class, so thank you all very much for the hard 933 01:16:16,220 --> 01:16:20,280 work you've put into this class, and the hard work you've put into learning this material, 934 01:16:20,280 --> 01:16:21,699 and 935 01:16:21,699 --> 01:16:23,389 thank you very much for having been students in.