WEBVTT 00:00:11.269 --> 00:00:14.539 This presentation is delivered by the Stanford Center for Professional 00:00:14.539 --> 00:00:21.539 Development. 00:00:23.929 --> 00:00:28.039 So welcome to the last lecture of this course. 00:00:28.039 --> 00:00:32.969 What I want to do today is tell you about one final class of reinforcement 00:00:32.969 --> 00:00:34.820 learning algorithms. I 00:00:34.820 --> 00:00:39.350 just want to say a little bit about POMDPs, the partially observable MDPs, and then 00:00:39.350 --> 00:00:43.590 the main technical topic for today will be policy search algorithms. 00:00:43.590 --> 00:00:48.670 I'll talk about two specific algorithms, essentially called reinforced and called Pegasus, 00:00:48.670 --> 00:00:54.940 and then we'll wrap up the class. So if you recall from the last lecture, 00:00:54.940 --> 00:00:59.870 I actually started to talk about one specific example of a POMDP, 00:00:59.870 --> 00:01:06.870 which was this sort of linear dynamical 00:01:07.170 --> 00:01:10.790 system. This is sort of LQR, linear quadratic revelation problem, 00:01:10.790 --> 00:01:24.190 but I changed it and said what if we only have observations 00:01:24.190 --> 00:01:28.270 YT. And what if we couldn't observe the state of the system directly, 00:01:28.270 --> 00:01:37.180 but had to choose an action only based on some noisy observations that maybe some function of the state? 00:01:37.180 --> 00:01:42.660 So our strategy last time was that we said that in the fully observable case, 00:01:42.660 --> 00:01:54.870 we could choose actions - AT equals two, that matrix LT times ST. So LT was this 00:01:54.870 --> 00:01:58.730 matrix of parameters that [inaudible] describe the dynamic programming 00:01:58.730 --> 00:02:02.200 algorithm for finite horizon MDPs in the LQR problem. 00:02:02.200 --> 00:02:10.529 And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state. 00:02:10.529 --> 00:02:13.929 And then I said in the partially observable case, 00:02:14.469 --> 00:02:19.979 we would compute these estimates. 00:02:19.979 --> 00:02:26.979 I wrote them as S of T given T, 00:02:30.379 --> 00:02:37.230 which were our best estimate for what the state is given all the observations. And in particular, 00:02:37.230 --> 00:02:41.459 I'm gonna talk about a Kalman filter 00:02:41.459 --> 00:02:48.959 which we worked out that our posterior distribution of what the state is 00:02:48.959 --> 00:02:52.519 given all the observations up to a certain time 00:02:52.519 --> 00:02:59.139 that was this. So this is from last time. So that 00:02:59.139 --> 00:03:03.159 given the observations Y one through YT, our posterior distribution 00:03:03.159 --> 00:03:05.579 of the current state ST was Gaussian 00:03:05.579 --> 00:03:10.169 would mean ST given T sigma T given T. 00:03:10.169 --> 00:03:14.090 So I said we use a Kalman filter to compute this thing, this ST given T, 00:03:14.090 --> 00:03:20.499 which is going to be our best guess for what the state is currently. 00:03:20.499 --> 00:03:30.609 And then we choose actions using 00:03:30.609 --> 00:03:34.059 our estimate for what the state is, rather than using the true state because we 00:03:34.059 --> 00:03:37.519 don't know the true state anymore in this POMDP. 00:03:37.519 --> 00:03:39.619 So 00:03:39.619 --> 00:03:47.219 it turns out that this specific strategy actually allows you to choose optimal actions, 00:03:47.219 --> 00:03:51.749 allows you to choose actions as well as you possibly can given that this is a 00:03:51.749 --> 00:03:54.799 POMDP, and given there are these noisy observations. 00:03:54.799 --> 00:03:59.469 It turns out that in general finding optimal policies with POMDPs - 00:03:59.469 --> 00:04:06.039 finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem. 00:04:06.039 --> 00:04:16.259 Just to be concrete about the formalism of the POMDP - I should just write it here - 00:04:16.259 --> 00:04:32.000 a POMDP formally is a tuple like that 00:04:32.000 --> 00:04:42.550 where the changes are the set Y is the set of possible observations, 00:04:44.759 --> 00:04:53.519 and this O subscript S are the observation distributions. 00:04:58.039 --> 00:05:11.220 And so at each step, we observe 00:05:17.629 --> 00:05:20.179 - at each step in the POMDP, if we're 00:05:20.179 --> 00:05:23.849 in some state ST, we observe some observation YT 00:05:23.849 --> 00:05:27.610 drawn from the observation distribution O subscript ST, that there's an 00:05:27.610 --> 00:05:30.369 index by what the current state is. 00:05:30.369 --> 00:05:31.060 And 00:05:31.060 --> 00:05:35.599 it turns out that computing the optimal policy in a POMDP is an NPhard problem. 00:05:36.309 --> 00:05:40.539 For the specific case of linear dynamical systems with the Kalman filter 00:05:40.539 --> 00:05:45.849 model, we have this strategy of computing the optimal policy assuming full observability 00:05:45.849 --> 00:05:49.080 and then estimating the states from the observations, 00:05:49.080 --> 00:05:50.889 and then plugging the two together. 00:05:50.889 --> 00:05:56.319 That turns out to be optimal essentially for only that special case of a POMDP. 00:05:56.319 --> 00:05:58.479 In the more general case, 00:05:58.479 --> 00:06:02.959 that strategy of designing a controller assuming full observability 00:06:02.959 --> 00:06:06.009 and then just estimating the state and plugging the two together, 00:06:06.009 --> 00:06:07.950 for general POMDPs 00:06:07.950 --> 00:06:14.759 that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal. 00:06:14.759 --> 00:06:20.869 Solving these problems in general, NP-hard. 00:06:20.869 --> 00:06:22.500 So what I want to do today 00:06:22.500 --> 00:06:25.569 is actually 00:06:25.569 --> 00:06:29.139 talk about a different class of reinforcement learning algorithms. These are called 00:06:29.139 --> 00:06:38.259 policy search algorithms. In particular, policy search algorithms 00:06:38.259 --> 00:06:43.630 can be applied equally well to MDPs, to fully observed Markov decision processes, 00:06:43.630 --> 00:06:46.419 or to these POMDPs, 00:06:46.419 --> 00:06:49.789 or to these partially observable MPDs. 00:06:49.789 --> 00:06:58.269 What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case. 00:06:58.269 --> 00:07:01.919 And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In 00:07:01.919 --> 00:07:05.460 the latter case, when 00:07:05.460 --> 00:07:09.830 you apply a policy search algorithm to a POMDP, it's 00:07:09.830 --> 00:07:15.499 going to be hard to guarantee that you get the globally optimal policy because 00:07:15.499 --> 00:07:20.809 solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be 00:07:20.809 --> 00:07:23.009 I think one of the most 00:07:23.009 --> 00:07:31.029 effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs. 00:07:31.029 --> 00:07:34.110 So 00:07:34.110 --> 00:07:36.849 here's what we're going to do. 00:07:36.849 --> 00:07:40.789 In policy search, 00:07:40.789 --> 00:07:47.789 we're going to define of some set 00:07:48.309 --> 00:07:54.869 which I denote capital pi of policies, 00:07:54.869 --> 00:08:09.240 and our strategy is to search for a good policy lower pi into set capital pi. 00:08:10.909 --> 00:08:14.930 Just by analogy, I want to say 00:08:14.930 --> 00:08:19.379 - in the same way, back when we were talking about supervised learning, 00:08:19.379 --> 00:08:23.769 the way we defined the set capital pi of policies in the search for policy in this 00:08:23.769 --> 00:08:31.269 set capital pi is analogous to supervised learning 00:08:31.269 --> 00:08:46.300 where we defined a set script H of hypotheses and search - 00:08:49.520 --> 00:08:58.120 and would search for a good hypothesis in this policy script H. 00:08:58.120 --> 00:09:03.040 Policy search is sometimes also called direct policy search. 00:09:03.040 --> 00:09:06.740 To contrast this with the source of algorithms we've been talking about so 00:09:06.740 --> 00:09:11.360 far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to 00:09:11.360 --> 00:09:13.600 find the optimal value function. 00:09:13.600 --> 00:09:19.179 And then we'd use V star - we'd use the optimal value function to then try to compute or 00:09:19.179 --> 00:09:20.260 try to approximate pi star. 00:09:20.260 --> 00:09:23.980 So all the approaches we talked about previously are 00:09:23.980 --> 00:09:29.500 strategy for finding a good policy. Once we compute the value function, then we go from that to policy. 00:09:29.500 --> 00:09:34.700 In contrast, in policy search algorithms and something that's called direct policy search algorithms, 00:09:34.700 --> 00:09:40.160 the idea is that we're going to quote "directly" try to approximate a good policy 00:09:40.160 --> 00:09:48.280 without going through the intermediate stage of trying to find the value function. 00:09:48.280 --> 00:09:49.530 Let's see. 00:09:49.530 --> 00:09:50.880 And also 00:09:50.880 --> 00:09:57.610 as I develop policy search - just one step that's sometimes slightly confusing. 00:09:57.610 --> 00:10:02.710 Making an analogy to supervised learning again, when we talked about logistic regression, 00:10:02.710 --> 00:10:08.700 I said we have input features X and some labels Y, and I sort of said 00:10:08.700 --> 00:10:13.320 let's approximate Y using the logistic function of the inputs X. And 00:10:13.320 --> 00:10:18.320 at least initially, the logistic function was sort of pulled out of the air. 00:10:18.320 --> 00:10:23.400 In the same way, as I define policy search algorithms, there'll sort of be a step where I say, 00:10:23.400 --> 00:10:28.560 "Well, let's try to compute the actions. Let's try to approximate what a good action is 00:10:28.560 --> 00:10:33.020 using a logistic function of the state." So again, I'll sort of pull a 00:10:33.020 --> 00:10:36.770 function out of the air. I'll say, "Let's just choose a function, and 00:10:36.770 --> 00:10:40.650 that'll be our choice of the policy cost," and I'll say, 00:10:40.650 --> 00:10:45.309 "Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate 00:10:45.309 --> 00:10:47.300 what is a good function - 00:10:47.300 --> 00:10:48.350 excuse me, 00:10:48.350 --> 00:10:53.269 we'll approximate what is a good action using a logistic function of the state." 00:10:53.269 --> 00:10:54.420 So there's that sort of - 00:10:54.420 --> 00:10:59.460 the function of the choice of policy cost that's again a little bit arbitrary, 00:10:59.460 --> 00:11:06.080 but it's arbitrary as it was when we were talking about supervised learning. 00:11:06.080 --> 00:11:17.350 So to develop our first policy search algorithm, I'm actually gonna need the new definition. 00:11:56.100 --> 00:12:03.100 So 00:12:06.650 --> 00:12:12.710 our first policy search algorithm, we'll actually need to work with stochastic policies. 00:12:12.710 --> 00:12:17.679 What I mean by stochastic policy is there's going to be a function that maps from 00:12:17.679 --> 00:12:22.510 the space of states across actions. They're real numbers 00:12:22.510 --> 00:12:25.039 where pi of S comma A will be 00:12:25.039 --> 00:12:30.580 interpreted as the probability of taking this action A in sum state S. 00:12:30.580 --> 00:12:46.340 And so we have to add sum over A - In other words, for every state 00:12:46.340 --> 00:12:53.830 a stochastic policy specifies a probability distribution over the actions. 00:12:54.550 --> 00:12:59.250 So concretely, 00:12:59.250 --> 00:13:07.270 suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi. 00:13:07.270 --> 00:13:11.690 What that means is that - in this example let's say I have three actions. 00:13:11.690 --> 00:13:16.060 What that means is that suppose I'm in some state S. 00:13:16.060 --> 00:13:25.290 I would then compute pi of S comma A1, pi of S comma A2, 00:13:25.290 --> 00:13:29.370 pi of S comma A3, if I have a three action MDP. 00:13:29.370 --> 00:13:33.630 These will be three numbers that sum up to one, 00:13:33.630 --> 00:13:36.730 and then my chance of taking action A1 will be 00:13:36.730 --> 00:13:41.350 equal to this. My chance of taking action A2 will be equal to pi of S comma A2. 00:13:41.350 --> 00:13:44.129 My chance of taking action A3 will be equal 00:13:44.129 --> 00:13:49.670 to this number. So that's what it means to execute a stochastic policy. 00:13:49.670 --> 00:13:50.880 So 00:13:50.880 --> 00:13:53.730 as a concrete example, just let me make this 00:14:03.380 --> 00:14:07.470 let me just go ahead and give one specific example of what a stochastic 00:14:07.470 --> 00:14:09.550 policy may look like. 00:14:09.550 --> 00:14:14.140 For this example, I'm gonna use the inverted pendulum 00:14:14.140 --> 00:14:20.370 as my motivating example. It's that problem of balancing a pole. We have an 00:14:20.370 --> 00:14:21.270 inverted 00:14:21.270 --> 00:14:26.230 pendulum that swings freely, and you want to move the cart left and 00:14:26.230 --> 00:14:29.090 right to keep the pole vertical. 00:14:29.090 --> 00:14:34.950 Let's say my actions - 00:14:34.950 --> 00:14:43.780 for today's example, I'm gonna use that angle to denote the angle of the pole 00:14:43.780 --> 00:14:49.740 phi. I have two actions where A1 is to accelerate left 00:14:49.740 --> 00:15:02.560 and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left. 00:15:02.560 --> 00:15:04.720 So 00:15:04.720 --> 00:15:06.050 let's see. 00:15:06.050 --> 00:15:10.480 Choose a reward function that penalizes the pole falling over whatever. 00:15:10.480 --> 00:15:14.170 And now let's come up with a stochastic policy for this problem. 00:15:14.910 --> 00:15:17.860 To come up with a class of stochastic policies 00:15:17.860 --> 00:15:20.080 00:15:20.080 --> 00:15:23.990 really means coming up with some class of functions to approximate what action you 00:15:23.990 --> 00:15:26.570 want to take as a function of the state. 00:15:26.570 --> 00:15:30.000 So here's my somewhat arbitrary choice. I'm gonna say that 00:15:30.000 --> 00:15:35.210 the probability of action A1, so pi of S comma A1, 00:15:35.210 --> 00:15:39.880 I'm gonna write as - okay? 00:15:39.880 --> 00:15:45.630 And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that 00:15:45.630 --> 00:15:49.740 my policy is parameterized by a set of parameters theta, 00:15:49.740 --> 00:15:53.310 and for any given set of parameters theta, 00:15:53.310 --> 00:15:56.270 that gives me a stochastic policy. 00:15:56.270 --> 00:15:59.380 And if I'm executing that policy with parameters theta, that means 00:15:59.380 --> 00:16:06.020 that the chance of my choosing to a set of [inaudible] is given by this number. 00:16:18.170 --> 00:16:25.520 Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just 00:16:25.520 --> 00:16:28.850 [inaudible], this means that when I'm in sum state S, 00:16:28.850 --> 00:16:30.960 I'm going to compute this number, 00:16:30.960 --> 00:16:34.260 compute one over one plus E to the minus state of transpose S. 00:16:34.260 --> 00:16:40.060 And then with this probability, I will execute the accelerate right action, 00:16:40.060 --> 00:16:45.780 and with one minus this probability, I'll execute the accelerate left action. 00:16:45.780 --> 00:16:50.820 And again, just to give you a sense of why this might be a reasonable thing to do, 00:16:52.200 --> 00:16:59.200 let's say my state vector is - this 00:17:01.530 --> 00:17:05.440 is [inaudible] state, and I added an extra one as an interceptor, just to give my 00:17:05.440 --> 00:17:08.949 logistic function an extra feature. 00:17:08.949 --> 00:17:19.860 If I choose my parameters and my policy to be say this, 00:17:19.860 --> 00:17:23.079 then that means that at any state, 00:17:23.079 --> 00:17:33.870 the probability of my taking action A1 - the probability of my taking the accelerate 00:17:33.870 --> 00:17:40.870 right action is this one over one plus E to the minus state of transpose S, which taking the inner product 00:17:41.360 --> 00:17:48.180 of theta and S, this just gives you phi, 00:17:48.180 --> 00:17:51.250 equals one over one plus E to the minus phi. 00:17:51.250 --> 00:17:54.900 And so if I choose my parameters theta as follows, 00:17:54.900 --> 00:18:01.150 what that means is that 00:18:01.150 --> 00:18:08.150 just depending on the angle phi of my inverted pendulum, 00:18:09.340 --> 00:18:12.200 the chance of my accelerating to the 00:18:12.200 --> 00:18:15.190 right is just this function of 00:18:15.190 --> 00:18:18.320 the angle of my inverted pendulum. And so this means for example 00:18:18.320 --> 00:18:22.090 that if my inverted pendulum is leaning far over to the right, 00:18:22.090 --> 00:18:26.370 then I'm very likely to accelerate to the right to try to catch it. I 00:18:26.370 --> 00:18:30.000 hope the physics of this inverted pendulum thing make 00:18:30.000 --> 00:18:33.440 sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And 00:18:33.440 --> 00:18:36.740 conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try 00:18:36.740 --> 00:18:38.520 to catch it. 00:18:38.520 --> 00:18:44.030 So this is one example for one specific choice of parameters theta. 00:18:44.030 --> 00:18:49.420 Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if 00:18:49.420 --> 00:18:52.860 the cart is further to the right, you want it to be less likely to accelerate to the 00:18:52.860 --> 00:18:55.610 right, and you can capture that by changing 00:18:55.610 --> 00:18:57.770 one of these coefficients to take into account the 00:18:57.770 --> 00:19:00.250 actual position of the cart. And then depending on 00:19:00.250 --> 00:19:03.379 the velocity of the cart and the angle of velocity, 00:19:03.379 --> 00:19:08.040 you might want to change theta to take into account these other effects as well. Maybe 00:19:08.040 --> 00:19:13.060 if the pole's leaning far to the right, but is actually on its way to swinging back, it's 00:19:13.060 --> 00:19:15.620 specified to the angle of velocity, then you might be 00:19:15.620 --> 00:19:19.300 less worried about having to accelerate hard to the right. And so 00:19:19.300 --> 00:19:23.170 these are the sorts of behavior you can get by varying the parameters theta. 00:19:24.160 --> 00:19:28.640 And so our goal is to tune the parameters theta - our goal in policy search 00:19:28.640 --> 00:19:37.290 is to tune the parameters theta so that when we execute the policy pi subscript theta, 00:19:37.290 --> 00:19:39.730 the pole stays up as long as possible. 00:19:39.730 --> 00:19:48.430 In other words, our goal is to maximize as a function of theta - 00:19:48.430 --> 00:20:05.440 our goal is to maximize the expected value of the payoff 00:20:05.440 --> 00:20:09.620 for when we execute the policy pi theta. We 00:20:09.620 --> 00:20:11.249 want to choose parameters theta 00:20:11.249 --> 00:20:21.120 to maximize that. Are there questions about the problem set up, 00:20:21.120 --> 00:20:26.330 and policy search and policy classes or anything? Yeah. In a case where we have 00:20:26.330 --> 00:20:32.880 more than two actions, would we use a different theta for each of the distributions, or still have 00:20:32.880 --> 00:20:36.310 the same parameters? Oh, yeah. 00:20:36.310 --> 00:20:39.430 Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but 00:20:39.430 --> 00:20:43.710 you have say a fixed number of discrete actions, 00:20:43.710 --> 00:20:48.210 I would sometimes use like a softmax parameterization. 00:20:48.210 --> 00:20:53.550 Similar to softmax regression that we saw earlier in the class, 00:20:53.550 --> 00:20:55.910 you may 00:20:55.910 --> 00:21:02.910 say 00:21:05.540 --> 00:21:06.900 that - [inaudible] out of space. You may have a set of parameters theta 1 00:21:06.900 --> 00:21:09.350 through theta D if 00:21:09.350 --> 00:21:24.910 you have D actions and - pi equals E to the theta I transpose S over - 00:21:24.910 --> 00:21:29.160 so that would be an example of a softmax parameterization for multiple actions. 00:21:29.160 --> 00:21:33.390 It turns out that if you have continuous actions, you can actually make this be a 00:21:33.390 --> 00:21:34.090 density 00:21:34.090 --> 00:21:35.950 over the actions A 00:21:35.950 --> 00:21:38.500 and parameterized by other things as well. 00:21:38.500 --> 00:21:42.590 But the choice of policy class is somewhat up to you, in the same way that 00:21:44.150 --> 00:21:47.670 the choice of whether we chose to use a linear function 00:21:47.670 --> 00:21:49.430 or linear function with 00:21:49.430 --> 00:21:56.430 quadratic features or whatever in supervised learning that was sort of up to us. Anything 00:22:01.790 --> 00:22:08.790 else? Yeah. [Inaudible] stochastic? 00:22:12.400 --> 00:22:16.350 Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition 00:22:16.350 --> 00:22:18.200 probabilities, is it possible to 00:22:18.200 --> 00:22:22.740 use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities. 00:22:22.740 --> 00:22:24.460 The answer is yes, 00:22:24.460 --> 00:22:25.740 but 00:22:25.740 --> 00:22:29.400 for the purposes of what I want to show later, that won't be useful. 00:22:29.400 --> 00:22:31.870 But formally, it is possible. 00:22:31.870 --> 00:22:34.810 If you already have a fixed - if you have a 00:22:34.810 --> 00:22:40.760 fixed policy, then you'd be able to do that. Anything else? 00:22:40.760 --> 00:22:44.190 Yeah. No, I guess even a [inaudible] class of policy can do that, 00:22:44.190 --> 00:22:50.290 but for the derivation later, I actually need to keep it separate. 00:22:50.290 --> 00:22:55.059 Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand 00:22:55.059 --> 00:23:01.730 if this makes sense? 00:23:01.730 --> 00:23:05.430 Okay. Thanks. So let's talk about 00:23:05.430 --> 00:23:08.650 an algorithm. What I'm gonna talk about - the first algorithm I'm going to present 00:23:09.750 --> 00:23:13.820 is sometimes called the reinforce algorithm. What I'm going to present it turns out 00:23:13.820 --> 00:23:18.550 isn't exactly the reinforce algorithm as it was originally presented by the 00:23:18.550 --> 00:23:30.570 author Ron Williams, but it sort of captures its essence. Here's the idea. 00:23:30.570 --> 00:23:37.030 In the sequel - in what I'm about to do, I'm going to assume that S0 is 00:23:37.030 --> 00:23:40.330 some fixed initial state. 00:23:41.950 --> 00:23:44.720 Or 00:23:44.720 --> 00:23:48.740 it turns out if S0 is drawn from some fixed initial state 00:23:48.740 --> 00:23:52.440 distribution then everything else [inaudible], but let's just say S0 is some 00:23:52.440 --> 00:23:54.040 fixed initial state. 00:23:54.040 --> 00:24:04.600 So my goal is to maximize this expected sum 00:24:05.820 --> 00:24:10.340 [inaudible]. Given the policy and whatever else, drop that. 00:24:10.340 --> 00:24:14.700 So the random variables in this expectation is a sequence of states and actions: 00:24:14.700 --> 00:24:20.200 S0, A0, S1, A1, and so on, up to ST, AT are the random variables. 00:24:20.200 --> 00:24:25.919 So let me write out this expectation explicitly as a sum 00:24:25.919 --> 00:24:33.430 over all possible state and action sequences of that 00:24:46.050 --> 00:24:52.920 - so that's what an expectation is. It's the probability of the random variables times that. Let 00:24:52.920 --> 00:24:59.920 me just expand out this probability. 00:25:00.510 --> 00:25:06.110 So the probability of seeing this exact sequence of states and actions is 00:25:06.110 --> 00:25:10.630 the probability of the MDP starting in that state. 00:25:10.630 --> 00:25:14.070 If this is a deterministic initial state, then all the probability 00:25:14.070 --> 00:25:18.180 mass would be on one state. Otherwise, there's some distribution over initial states. 00:25:18.180 --> 00:25:21.040 Then times the probability 00:25:21.040 --> 00:25:23.529 that 00:25:23.529 --> 00:25:29.350 you chose action A0 from that state as zero, 00:25:29.350 --> 00:25:34.390 and then times the probability that 00:25:34.390 --> 00:25:38.130 the MDP's transition probabilities happen to transition you to state 00:25:38.130 --> 00:25:44.830 S1 where you chose action A0 to state S0, 00:25:44.830 --> 00:25:53.020 times the probability that you chose that and so on. 00:25:53.020 --> 00:26:06.279 The last term here is that, and then times that. 00:26:12.370 --> 00:26:13.930 So what I did was just 00:26:13.930 --> 00:26:18.620 take this probability of seeing this sequence of states and actions, and then just 00:26:18.620 --> 00:26:22.610 [inaudible] explicitly or expanded explicitly like this. 00:26:22.610 --> 00:26:24.480 It turns out later on 00:26:24.480 --> 00:26:30.270 I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff 00:26:31.590 --> 00:26:36.039 from now. So whenever later in this lecture I write the word payoff, I 00:26:36.039 --> 00:26:40.230 just mean this sum. 00:26:40.230 --> 00:26:47.020 So 00:26:47.020 --> 00:26:52.330 our goal is to maximize the expected payoff, so our goal is to maximize this sum. 00:26:52.330 --> 00:26:55.590 Let me actually just skip ahead. I'm going to write down what the final answer is, and 00:26:55.590 --> 00:26:59.370 then I'll come back and justify the 00:26:59.370 --> 00:27:06.370 algorithm. So here's the algorithm. This is 00:27:25.190 --> 00:27:32.190 00:28:12.770 --> 00:28:16.020 how we're going to 00:28:16.020 --> 00:28:19.280 update the parameters of the algorithm. We're going to 00:28:19.280 --> 00:28:22.880 sample a state action sequence. The way you do this is you just take your 00:28:22.880 --> 00:28:24.700 current stochastic policy, 00:28:24.700 --> 00:28:29.039 and you execute it in the MDP. So just go ahead and start from some initial state, take 00:28:29.039 --> 00:28:32.600 a stochastic action according to your current stochastic policy, 00:28:32.600 --> 00:28:35.000 see where the state transition probably takes you, and so you just 00:28:35.000 --> 00:28:38.860 do that for T times steps, and that's how you sample the state sequence. Then 00:28:38.860 --> 00:28:45.020 you compute the payoff, and then you perform this update. 00:28:45.020 --> 00:28:49.210 So let's go back and figure out what this algorithm is doing. 00:28:49.210 --> 00:28:52.730 Notice that this algorithm performs stochastic updates because on 00:28:52.730 --> 00:28:56.750 every step it updates data according to this thing on the right hand side. 00:28:56.750 --> 00:29:00.150 This thing on the right hand side depends very much on your payoff and on 00:29:00.150 --> 00:29:02.320 the state action sequence you saw. Your 00:29:02.320 --> 00:29:04.650 state action sequence is random, 00:29:04.650 --> 00:29:07.210 so what I want to do is figure out 00:29:07.210 --> 00:29:12.230 - so on every step, I'll sort of take a step that's chosen randomly 00:29:13.070 --> 00:29:16.380 because it depends on this random state action sequence. 00:29:16.380 --> 00:29:21.530 So what I want to do is figure out on average how does it change the parameters theta. 00:29:21.530 --> 00:29:23.360 In particular, 00:29:23.360 --> 00:29:44.320 I want to know what is the expected value of the change to the parameters. 00:29:58.100 --> 00:30:06.280 So I want to know what is the expected value of this change to my parameters theta. Our 00:30:06.280 --> 00:30:10.930 goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff. 00:30:10.930 --> 00:30:13.700 So long as 00:30:13.700 --> 00:30:18.310 the updates on expectation are on average taking us uphill 00:30:18.310 --> 00:30:20.050 on the expected payoff, 00:30:20.050 --> 00:30:21.970 then we're happy. 00:30:21.970 --> 00:30:24.970 It turns out that 00:30:24.970 --> 00:30:31.720 this algorithm is a form of stochastic gradient ascent in which - 00:30:31.720 --> 00:30:36.779 remember when I talked about stochastic gradient descent for least squares regression, I 00:30:36.779 --> 00:30:42.820 said that you have some parameters - maybe you're trying to minimize a quadratic function. 00:30:42.820 --> 00:30:49.149 Then you may have parameters that will wander around randomly 00:30:49.149 --> 00:30:54.290 until it gets close to the optimum of the [inaudible] quadratic surface. 00:30:54.290 --> 00:30:55.810 It turns out that 00:30:55.810 --> 00:30:59.650 the reinforce algorithm will be very much like that. It will be a 00:30:59.650 --> 00:31:03.220 stochastic gradient ascent algorithm in which on every step - 00:31:03.220 --> 00:31:07.300 the step we take is a little bit random. It's determined by the random state action sequence, 00:31:07.300 --> 00:31:11.940 but on expectation this turns out to be essentially gradient ascent algorithm. 00:31:11.940 --> 00:31:16.100 And so we'll do something like this. It'll wander around randomly, but on average take you towards the 00:31:16.100 --> 00:31:24.700 optimum. So let me go ahead and prove that now. 00:31:28.780 --> 00:31:35.780 Let's see. 00:31:37.540 --> 00:31:46.020 What I'm going to do is I'm going to derive a gradient ascent update rule 00:31:46.020 --> 00:31:50.429 for maximizing the expected payoff. Then I'll hopefully show that 00:31:50.429 --> 00:31:58.570 by deriving a gradient ascent update rule, I'll end up with this thing on expectation. 00:31:58.570 --> 00:32:02.110 So before I do the derivation, let me just remind you of 00:32:02.110 --> 00:32:06.330 the chain rule - the product rule for differentiation 00:32:06.330 --> 00:32:08.510 in which 00:32:08.510 --> 00:32:13.890 if I have a product of functions, 00:32:13.890 --> 00:32:17.230 then 00:32:33.440 --> 00:32:37.779 the derivative of the product is given by 00:32:37.779 --> 00:32:41.029 taking of the derivatives of these things one at a time. So first I differentiate 00:32:41.029 --> 00:32:45.340 with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G, 00:32:45.340 --> 00:32:46.540 leaving the other two fixed. 00:32:46.540 --> 00:32:50.640 Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the 00:32:50.640 --> 00:32:53.530 product rule 00:32:53.530 --> 00:32:56.380 for 00:32:56.380 --> 00:33:00.980 derivatives. 00:33:00.980 --> 00:33:05.020 If you refer back to this equation where 00:33:05.020 --> 00:33:06.870 earlier we wrote out that 00:33:06.870 --> 00:33:11.140 the expected payoff by this equation, this 00:33:11.140 --> 00:33:14.110 sum over all the states of the probability 00:33:14.110 --> 00:33:15.720 times the payoff. 00:33:15.720 --> 00:33:20.410 So what I'm going to do is take the derivative of this expression 00:33:20.410 --> 00:33:22.730 with respect to the parameters theta 00:33:22.730 --> 00:33:26.320 because I want to do gradient ascent on this function. So I'm going to take the 00:33:26.320 --> 00:33:30.880 derivative of this function with respect to theta, and then try to go uphill on this function. 00:33:30.880 --> 00:33:35.690 So using the product rule, when I take the derivative of this function with respect to theta 00:33:35.690 --> 00:33:39.169 what I get is - we'll end up with the sum of terms right there. There are a lot 00:33:39.169 --> 00:33:42.300 of terms here that depend on theta, 00:33:42.300 --> 00:33:47.160 and so what I'll end up with is I'll end up having a sum - having one term 00:33:47.160 --> 00:33:50.700 that corresponds to the derivative of this keeping everything else fixed, 00:33:50.700 --> 00:33:53.960 to have one term from the derivative of this keeping everything else fixed, and I'll 00:33:53.960 --> 00:33:55.140 have one term 00:33:55.140 --> 00:33:56.430 from the derivative of 00:33:56.430 --> 00:34:02.100 that last thing keeping everything else fixed. So just apply the product rule to this. Let's 00:34:02.100 --> 00:34:09.100 write that down. So I have that - the derivative 00:34:10.599 --> 00:34:15.789 with respect to theta of the expected value of the payoff is - 00:34:15.789 --> 00:34:22.129 it turns out I can actually do this entire derivation in exactly four steps, 00:34:22.129 --> 00:34:26.819 but each of the steps requires a huge amount of writing, so 00:34:26.819 --> 00:34:37.569 I'll just start writing and see how that goes, but this is a four step derivation. 00:34:37.919 --> 00:34:44.919 So there's the sum over the state action sequences as we saw before. Close the 00:35:04.869 --> 00:35:11.869 bracket, 00:36:00.229 --> 00:36:03.169 and 00:36:03.169 --> 00:36:08.670 then times the payoff. 00:36:08.670 --> 00:36:12.599 So that huge amount of writing, that was just taking my previous formula and 00:36:12.599 --> 00:36:13.529 00:36:13.529 --> 00:36:16.589 differentiating these terms that depend on theta one at a time. This was the term with 00:36:16.589 --> 00:36:17.879 the derivative 00:36:17.879 --> 00:36:19.489 of the first pi of 00:36:19.489 --> 00:36:22.410 theta S0 A0. So there's the first derivative 00:36:22.410 --> 00:36:26.329 term. There's the second one. Then you have plus dot, dot, dot, like 00:36:26.329 --> 00:36:28.369 in terms of [inaudible]. 00:36:28.369 --> 00:36:31.279 That's my last term. 00:36:31.279 --> 00:36:38.279 So that was step one of four. 00:36:51.299 --> 00:37:06.849 And so by algebra - let me just write this down 00:38:04.009 --> 00:38:06.509 and convince us all that it's true. 00:38:06.509 --> 00:38:08.549 This is the second of four steps 00:38:08.549 --> 00:38:09.649 in which it 00:38:09.649 --> 00:38:13.529 just convinced itself that if I expand out - take the sum and multiply it by 00:38:13.529 --> 00:38:15.239 that big product in front, 00:38:15.239 --> 00:38:18.929 then I get back that sum of terms I get. It's essentially - 00:38:18.929 --> 00:38:22.530 for example, when I multiply out, this product on top of this ratio, of this 00:38:22.530 --> 00:38:24.789 first fraction, 00:38:24.789 --> 00:38:29.289 then pi subscript theta S0 A0, that would cancel out this pi subscript 00:38:29.289 --> 00:38:31.219 theta S0 A0 00:38:31.219 --> 00:38:35.130 and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was 00:38:35.130 --> 00:38:42.130 the second. But 00:38:44.960 --> 00:38:47.640 that term on top is just 00:38:47.640 --> 00:38:53.999 what I worked out previously 00:38:53.999 --> 00:38:58.569 - was the joint probability of the state action sequence, 00:38:58.569 --> 00:39:32.170 and now I have that times that times the payoff. 00:39:32.170 --> 00:40:07.129 And so by the definition of expectation, this is just equal to that thing times the payoff. 00:40:07.129 --> 00:40:08.440 So 00:40:08.440 --> 00:40:12.329 this thing inside the expectation, this is exactly 00:40:12.329 --> 00:40:18.839 the step that we were taking in the inner group of our reinforce algorithm, 00:40:18.839 --> 00:40:20.880 roughly the reinforce algorithm. This 00:40:20.880 --> 00:40:25.640 proves that the expected value of our change to theta 00:40:25.640 --> 00:40:30.509 is exactly in the direction of the gradient of our expected payoff. That's how 00:40:30.509 --> 00:40:32.639 I started this whole derivation. I said 00:40:32.639 --> 00:40:36.019 let's look at our expected payoff and take the derivative of that with respect to theta. 00:40:37.109 --> 00:40:40.529 What we've proved is that on expectation, 00:40:40.529 --> 00:40:44.229 the step direction I'll take reinforce is exactly the gradient of 00:40:44.229 --> 00:40:46.220 the thing I'm trying to 00:40:46.220 --> 00:40:48.679 optimize. This shows that this algorithm is 00:40:48.679 --> 00:40:54.029 a stochastic gradient ascent 00:40:54.029 --> 00:40:58.059 algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible] 00:40:58.059 --> 00:41:00.910 check if everything makes sense. I'll erase a couple of boards and then check if you have 00:41:00.910 --> 00:41:07.910 questions after that. Questions? 00:41:42.919 --> 00:41:49.919 Could you 00:41:53.749 --> 00:41:56.279 raise your hand if this makes sense? 00:42:01.239 --> 00:42:03.999 Great. 00:42:03.999 --> 00:42:06.789 Some of the comments - we talked about 00:42:06.789 --> 00:42:10.629 those value function approximation approaches where you approximate 00:42:10.629 --> 00:42:13.579 V star, then you go from V star 00:42:13.579 --> 00:42:17.839 to pi star. Then here was also policy search approaches, where you try to approximate the policy directly. 00:42:17.839 --> 00:42:22.329 So let's talk briefly about when either one may be preferable. 00:42:22.329 --> 00:42:24.049 It turns out that 00:42:24.049 --> 00:42:26.899 policy search algorithms are especially effective 00:42:26.899 --> 00:42:30.699 when you can choose a simple policy class pi. 00:42:32.029 --> 00:42:35.949 So the question really is for your problem 00:42:35.949 --> 00:42:40.079 does there exist a simple function like a linear function or a logistic function 00:42:40.079 --> 00:42:45.289 that maps from features of the state to the action that works pretty well. 00:42:45.289 --> 00:42:49.029 So the problem with the inverted pendulum - this is quite likely to be true. 00:42:49.029 --> 00:42:52.450 Going through all the different choices of parameters, you can say 00:42:52.450 --> 00:42:54.930 things like if the pole's leaning towards the right, 00:42:54.930 --> 00:42:57.709 then accelerate towards the right to try to catch it. Thanks to the 00:42:57.709 --> 00:43:01.579 inverted pendulum, this is probably true. For lots of what's called 00:43:01.579 --> 00:43:04.650 low level control tasks, things like driving a car, 00:43:04.650 --> 00:43:09.069 the low level reflexes of do you steer your car left to avoid another car, do 00:43:09.069 --> 00:43:13.389 you steer your car left to follow the car road, flying a helicopter, 00:43:13.389 --> 00:43:18.359 again very short time scale types of decisions - I like to think of these as 00:43:18.359 --> 00:43:21.960 decisions like trained operator for like a trained driver or a trained 00:43:21.960 --> 00:43:23.819 pilot. It would almost 00:43:23.819 --> 00:43:27.360 be a reflex, these sorts of very quick instinctive things where 00:43:27.360 --> 00:43:30.790 you map very directly from the inputs, data, and action. These are 00:43:30.790 --> 00:43:32.159 problems for which 00:43:32.159 --> 00:43:35.879 you can probably choose a reasonable policy class like a logistic function or something, 00:43:35.879 --> 00:43:38.489 and it will often work well. 00:43:38.489 --> 00:43:41.829 In contrast, if you have problems that require 00:43:41.829 --> 00:43:46.019 long multistep reasoning, so things like a game of chess where you have to 00:43:46.019 --> 00:43:47.230 reason carefully about if 00:43:47.230 --> 00:43:50.130 I do this, then they'll do that, then they'll do this, then they'll do that. 00:43:50.130 --> 00:43:56.049 Those I think of as less instinctual, very high level decision making. 00:43:56.049 --> 00:43:59.519 For problems like that, 00:43:59.519 --> 00:44:06.289 I would sometimes use a value function approximation approaches instead. Let 00:44:06.289 --> 00:44:09.089 me say more about this later. 00:44:09.089 --> 00:44:14.139 The last thing I want to do is actually tell you about - 00:44:14.139 --> 00:44:21.999 I guess just as a side comment, 00:44:21.999 --> 00:44:25.499 it turns out also that if you have POMDP, if you have a partially observable 00:44:25.499 --> 00:44:28.459 MDP - I don't want to say too much about this - 00:44:28.459 --> 00:44:36.799 it turns out that if you only have an approximation 00:44:36.799 --> 00:44:42.119 - let's call it S hat of the true 00:44:42.119 --> 00:44:47.389 state, and so this could be S hat equals S of T given T from Kalman filter - 00:44:57.359 --> 00:45:12.949 then you can still use these sorts of policy search algorithms where you can say pi theta of S 00:45:12.949 --> 00:45:15.819 hat comma A - There are various other ways you can use 00:45:15.819 --> 00:45:18.669 policy search algorithms for POMDPs, but this is one of them 00:45:18.669 --> 00:45:21.440 where if you only have estimates of the state, you can then 00:45:21.440 --> 00:45:25.599 choose a policy class that only looks at your estimate of the state to choose the action. 00:45:25.599 --> 00:45:30.629 By using the same way of estimating the states in both training and testing, 00:45:30.629 --> 00:45:33.089 this will usually do some - 00:45:33.089 --> 00:45:37.069 so these sorts of policy search algorithms can be applied 00:45:37.069 --> 00:45:47.440 often reasonably effectively to 00:45:47.440 --> 00:45:50.349 POMDPs as well. There's one more algorithm I wanna talk about, but some final 00:45:50.349 --> 00:45:54.519 words on the reinforce algorithm. It turns out the reinforce algorithm 00:45:54.519 --> 00:46:00.089 often works well but is often extremely slow. So it [inaudible] 00:46:00.089 --> 00:46:03.680 works, but one thing to watch out for is that because you're taking these 00:46:03.680 --> 00:46:07.470 gradient ascent steps that are very noisy, you're sampling a state action sequence, and then 00:46:07.470 --> 00:46:07.909 you're sort of 00:46:07.909 --> 00:46:09.579 taking a gradient ascent step in essentially a sort 00:46:09.579 --> 00:46:12.960 of random direction that only on expectation is 00:46:12.960 --> 00:46:14.239 correct. The 00:46:14.239 --> 00:46:17.999 gradient ascent direction for reinforce can sometimes be a bit noisy, 00:46:17.999 --> 00:46:21.599 and so it's not that uncommon to need like a million iterations of gradient ascent, 00:46:21.599 --> 00:46:24.420 or ten million, or 100 million 00:46:24.420 --> 00:46:25.579 iterations of gradient ascent 00:46:25.579 --> 00:46:29.450 for reinforce [inaudible], so that's just something to watch out for. 00:46:29.450 --> 00:46:40.619 One consequence of that is in the reinforce algorithm - I shouldn't 00:46:40.619 --> 00:46:44.969 really call it reinforce. In what's essentially the reinforce algorithm, there's 00:46:44.969 --> 00:46:48.449 this step where you need to sample a state action sequence. 00:46:48.449 --> 00:46:50.759 So in 00:46:50.759 --> 00:46:54.160 principle you could do this on your own robot. If there were a robot you were trying to 00:46:54.160 --> 00:46:55.280 control, you can actually 00:46:55.280 --> 00:46:58.709 physically initialize in some state, pick an action and so on, and go from there 00:46:58.709 --> 00:47:00.450 to sample a state action sequence. 00:47:00.450 --> 00:47:01.529 But 00:47:01.529 --> 00:47:05.269 if you need to do this ten million times, you probably don't want to [inaudible] 00:47:05.269 --> 00:47:07.459 your robot ten million times. 00:47:07.459 --> 00:47:13.160 I personally have seen many more applications of reinforce in simulation. 00:47:13.160 --> 00:47:16.930 You can easily run ten thousand simulations or ten million simulations of your robot in 00:47:16.930 --> 00:47:20.890 simulation maybe, but you might not want to do that - 00:47:20.890 --> 00:47:24.509 have your robot physically repeat some action ten million times. 00:47:24.509 --> 00:47:27.529 So I personally have seen many more applications of reinforce 00:47:27.529 --> 00:47:30.319 to learn using a simulator 00:47:30.319 --> 00:47:34.799 than to actually do this on a physical device. 00:47:34.799 --> 00:47:38.230 The last thing I wanted to do is tell you about one other algorithm, 00:47:38.230 --> 00:47:45.230 one final policy search algorithm. [Inaudible] the laptop display please. It's 00:47:52.249 --> 00:47:56.119 a policy search algorithm called Pegasus that's 00:47:56.119 --> 00:48:00.039 actually what we use on our autonomous helicopter flight things 00:48:00.039 --> 00:48:05.459 for many years. There are some other things we do now. So 00:48:05.459 --> 00:48:07.579 here's the idea. There's a 00:48:07.579 --> 00:48:11.059 reminder slide on RL formalism. There's nothing here that you don't know, but 00:48:11.059 --> 00:48:17.119 I just want to pictorially describe the RL formalism because I'll use that later. 00:48:17.119 --> 00:48:20.859 I'm gonna draw the reinforcement learning picture as follows. The 00:48:20.859 --> 00:48:21.689 initialized [inaudible] 00:48:21.689 --> 00:48:23.509 system, say a 00:48:23.509 --> 00:48:25.559 helicopter or whatever in sum state S0, 00:48:25.559 --> 00:48:27.509 you choose an action A0, 00:48:27.509 --> 00:48:31.779 and then you'll say helicopter dynamics takes you to some new state S1, you choose 00:48:31.779 --> 00:48:33.259 some other action A1, 00:48:33.259 --> 00:48:34.619 and so on. 00:48:34.619 --> 00:48:38.789 And then you have some reward function, which you reply to the sequence of states you summed out, 00:48:38.789 --> 00:48:41.249 and that's your total payoff. So 00:48:41.249 --> 00:48:46.890 this is just a picture I wanna use to summarize the RL problem. 00:48:47.949 --> 00:48:50.880 Our goal is to maximize the expected payoff, 00:48:50.880 --> 00:48:53.000 which is this, the expected sum of the rewards. 00:48:53.000 --> 00:48:56.909 And our goal is to learn the policy, which I denote by a green box. 00:48:56.909 --> 00:49:01.989 So our policy - and I'll switch back to deterministic policies for now. 00:49:01.989 --> 00:49:09.129 So my deterministic policy will be some function mapping from the states to the actions. 00:49:09.129 --> 00:49:14.599 As a concrete example, you imagine that in the policy search setting, 00:49:14.599 --> 00:49:17.339 you may have a linear class of policies. 00:49:17.339 --> 00:49:22.660 So you may imagine that the action A will be say a linear function of the states, 00:49:22.660 --> 00:49:27.599 and your goal is to learn the parameters of the linear function. 00:49:27.599 --> 00:49:33.349 So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning 00:49:33.349 --> 00:49:36.209 objective. So just [inaudible] imagine that the action A 00:49:36.209 --> 00:49:40.309 is state of transpose S, and you go and policy search this to come up with 00:49:40.309 --> 00:49:42.109 good parameters theta so as 00:49:42.109 --> 00:49:48.979 to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea. 00:49:48.979 --> 00:49:49.809 Quite often 00:49:49.809 --> 00:49:54.459 we come up with a model or a simulator for the MDP, and as before 00:49:54.459 --> 00:49:57.499 a model or a simulator is just a box 00:49:57.499 --> 00:49:58.830 that takes this input 00:49:58.830 --> 00:50:00.419 some state ST, 00:50:00.419 --> 00:50:02.690 takes this input some action AT, 00:50:02.690 --> 00:50:06.570 and then outputs some [inaudible] state ST plus one that you might want to take in the MDP. 00:50:06.570 --> 00:50:10.130 This ST plus one will be a random state. It will be drawn from 00:50:10.130 --> 00:50:13.030 the random state transition probabilities of MDP. 00:50:13.030 --> 00:50:16.979 This is important. Very important, ST plus one 00:50:16.979 --> 00:50:21.809 will be a random function ST and AT. In the simulator, this is [inaudible]. 00:50:21.809 --> 00:50:25.709 So for example, for autonomous helicopter flight, you [inaudible] build a simulator 00:50:25.709 --> 00:50:29.819 using supervised learning, an algorithm like linear regression 00:50:29.819 --> 00:50:31.529 [inaudible] to linear regression, 00:50:31.529 --> 00:50:35.099 so we can get a nonlinear model of the dynamics of what ST 00:50:35.099 --> 00:50:41.399 plus one is as a random function of ST and AT. Now 00:50:41.399 --> 00:50:44.999 once you have a simulator, 00:50:44.999 --> 00:50:48.100 given any fixed policy you can quite 00:50:48.100 --> 00:50:52.309 straightforwardly evaluate any policy in a simulator. 00:50:53.359 --> 00:51:00.209 Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box 00:51:00.209 --> 00:51:02.879 like that. It works well. 00:51:02.879 --> 00:51:07.569 So if you have any one fixed policy pi, 00:51:07.569 --> 00:51:13.109 you can evaluate the policy pi just using the simulator 00:51:13.109 --> 00:51:15.969 via the picture shown at the bottom of the slide. 00:51:15.969 --> 00:51:18.880 So concretely, you can take your initial state S0, 00:51:18.880 --> 00:51:20.979 feed it into the policy pi, 00:51:20.979 --> 00:51:25.449 your policy pi will output some action A0, you plug it in 00:51:25.449 --> 00:51:28.069 the simulator, the simulator outputs a random state S1, you feed 00:51:28.069 --> 00:51:31.630 S1 into the policy and so on, and you get a sequence of states S0 through ST 00:51:31.630 --> 00:51:35.429 that your helicopter flies through in simulation. 00:51:35.429 --> 00:51:37.809 Then sum up the rewards, 00:51:37.809 --> 00:51:42.609 and this gives you an estimate of the expected payoff of the policy. 00:51:42.609 --> 00:51:44.929 This picture is just a fancy way of saying fly 00:51:44.929 --> 00:51:48.809 your helicopter in simulation and see how well it does, and measure the sum of 00:51:48.809 --> 00:51:53.969 rewards you get on average in the simulator. The 00:51:53.969 --> 00:51:57.219 picture I've drawn here assumes that you run your policy through the 00:51:57.219 --> 00:51:59.159 simulator just once. In general, 00:51:59.159 --> 00:52:01.299 you would run the policy through the simulator 00:52:01.299 --> 00:52:05.099 some number of times and then average to get an average over M 00:52:05.099 --> 00:52:10.869 simulations to get a better estimate of the policy's expected payoff. 00:52:12.799 --> 00:52:14.619 Now that I have a way 00:52:14.619 --> 00:52:18.069 - given any one fixed policy, 00:52:18.069 --> 00:52:23.429 this gives me a way to evaluate the expected payoff of that policy. 00:52:23.429 --> 00:52:28.179 So one reasonably obvious thing you might try to do is then 00:52:28.179 --> 00:52:30.069 just search for a policy, 00:52:30.069 --> 00:52:33.609 in other words search for parameters theta for your policy, that 00:52:33.609 --> 00:52:35.929 gives you high estimated payoff. 00:52:35.929 --> 00:52:39.509 Does that make sense? So my policy has some parameters theta, so 00:52:39.509 --> 00:52:40.979 my policy is 00:52:40.979 --> 00:52:46.309 my actions A are equal to theta transpose S say if there's a linear policy. 00:52:46.309 --> 00:52:49.559 For any fixed value of the parameters theta, 00:52:49.559 --> 00:52:50.989 I can evaluate - 00:52:50.989 --> 00:52:54.440 I can get an estimate for how good the policy is using the simulator. 00:52:54.440 --> 00:52:58.599 One thing I might try to do is search for parameters theta to try to 00:52:58.599 --> 00:53:03.109 maximize my estimated payoff. 00:53:03.109 --> 00:53:05.929 It turns out you can sort of do that, 00:53:05.929 --> 00:53:10.979 but that idea as I've just stated is hard to get to work. 00:53:10.979 --> 00:53:12.180 Here's the 00:53:12.180 --> 00:53:15.720 reason. The simulator allows us to evaluate 00:53:15.720 --> 00:53:17.529 policy, so let's search for policy of high 00:53:17.529 --> 00:53:20.619 value. The difficulty is that the simulator is random, 00:53:20.619 --> 00:53:22.999 and so every time we evaluate a policy, 00:53:22.999 --> 00:53:26.819 we get back a very slightly different answer. So 00:53:26.819 --> 00:53:28.739 in the 00:53:28.739 --> 00:53:33.219 cartoon below, I want you to imagine that the horizontal axis is the space of policies. 00:53:33.219 --> 00:53:36.779 In other words, as I vary the 00:53:36.779 --> 00:53:40.839 parameters in my policy, I get different points on the horizontal axis here. As I 00:53:40.839 --> 00:53:43.789 vary the parameters theta, I get different policies, 00:53:43.789 --> 00:53:46.479 and so I'm moving along the X axis, 00:53:46.479 --> 00:53:50.079 and my total payoff I'm gonna plot on the vertical axis, 00:53:50.079 --> 00:53:54.459 and the red line in this cartoon is the expected payoff of the different 00:53:54.459 --> 00:53:56.130 policies. 00:53:56.130 --> 00:54:01.929 My goal is to find the policy with the highest expected payoff. 00:54:01.929 --> 00:54:06.019 You could search for a policy with high expected payoff, but every time you evaluate a policy 00:54:06.019 --> 00:54:08.150 - say I evaluate some policy, 00:54:08.150 --> 00:54:09.819 then I might get a point that 00:54:09.819 --> 00:54:12.879 just by chance looked a little bit better than it should have. If 00:54:12.879 --> 00:54:16.489 I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third 00:54:16.489 --> 00:54:21.419 policy, fourth, 00:54:21.419 --> 00:54:25.439 sometimes you look here - sometimes I might actually evaluate exactly the same policy 00:54:25.439 --> 00:54:27.829 twice and get back slightly different 00:54:27.829 --> 00:54:32.380 answers just because my simulator is random, so when I apply the same policy 00:54:32.380 --> 00:54:38.669 twice in simulation, I might get back slightly different answers. 00:54:38.669 --> 00:54:41.839 So as I evaluate more and more policies, these are the pictures I get. 00:54:41.839 --> 00:54:45.799 My goal is to try to optimize the red line. I hope you appreciate this is a 00:54:45.799 --> 00:54:50.569 hard problem, especially when all [inaudible] optimization algorithm gets to see are these 00:54:50.569 --> 00:54:53.949 black dots, and they don't have direct access to the red line. 00:54:53.949 --> 00:54:58.269 So when my input space is some fairly high dimensional space, if I have 00:54:58.269 --> 00:55:02.319 ten parameters, the horizontal axis would actually be a 10-D space, 00:55:02.319 --> 00:55:07.049 all I get are these noisy estimates of what the red line is. This is a very hard 00:55:07.049 --> 00:55:10.209 stochastic optimization problem. 00:55:10.209 --> 00:55:15.969 So it turns out there's one way to make this optimization problem much easier. Here's the idea. 00:55:15.969 --> 00:55:20.949 And the method is called Pegasus, which is an acronym for something. 00:55:20.949 --> 00:55:22.890 I'll tell you later. 00:55:22.890 --> 00:55:26.140 So the simulator repeatedly makes calls 00:55:26.140 --> 00:55:27.959 to a random number generator 00:55:27.959 --> 00:55:33.079 to generate random numbers RT, which are used to simulate the stochastic dynamics. 00:55:33.079 --> 00:55:36.709 What I mean by that is that the simulator takes this input of state 00:55:36.709 --> 00:55:39.749 and action, and it outputs the mixed state randomly, 00:55:39.749 --> 00:55:43.999 and if you peer into the simulator, if you open the box of the simulator and ask 00:55:43.999 --> 00:55:49.529 how is my simulator generating these random mixed states ST plus one, 00:55:49.529 --> 00:55:53.809 pretty much the only way to do so - pretty much the only way 00:55:53.809 --> 00:55:56.909 to write a simulator with random outputs is 00:55:56.909 --> 00:56:00.209 we're gonna make calls to a random number generator, 00:56:00.209 --> 00:56:03.440 and get random numbers, these RTs, 00:56:03.440 --> 00:56:08.299 and then you have some function that takes this input S0, A0, and 00:56:08.299 --> 00:56:10.979 the results of your random number generator, 00:56:10.979 --> 00:56:14.430 and it computes some mixed state as a function of the inputs and of the random 00:56:14.430 --> 00:56:16.839 number it got from the random number 00:56:16.839 --> 00:56:17.699 generator. This is 00:56:17.699 --> 00:56:22.359 pretty much the only way anyone implements any random code, any code 00:56:22.359 --> 00:56:26.179 that generates random outputs. You make a call to a random number generator, 00:56:26.179 --> 00:56:30.869 and you compute some function of the random number and of your other 00:56:30.869 --> 00:56:35.299 inputs. The reason that when you evaluate different policies you get 00:56:35.299 --> 00:56:36.439 different answers 00:56:36.439 --> 00:56:37.749 is because 00:56:37.749 --> 00:56:41.459 every time you rerun the simulator, you get a different sequence of random 00:56:41.459 --> 00:56:43.569 numbers from the random number generator, 00:56:43.569 --> 00:56:45.569 and so you get a different answer every time, 00:56:45.569 --> 00:56:48.139 even if you evaluate the same policy twice. 00:56:48.139 --> 00:56:50.910 So here's the idea. 00:56:50.910 --> 00:56:56.150 Let's say we fix in advance the sequence of random numbers, 00:56:56.150 --> 00:57:00.229 choose R1, R2, up to RT minus one. 00:57:00.229 --> 00:57:03.359 Fix the sequence of random numbers in advance, 00:57:03.359 --> 00:57:07.280 and we'll always use the same sequence of random numbers 00:57:07.280 --> 00:57:10.189 to evaluate different policies. Go 00:57:10.189 --> 00:57:14.549 into your code and fix R1, R2, through RT minus one. 00:57:14.549 --> 00:57:17.640 Choose them randomly once and then fix them forever. 00:57:17.640 --> 00:57:20.979 If you always use the same sequence of random numbers, then 00:57:20.979 --> 00:57:25.210 the system is no longer random, and if you evaluate the same policy twice, 00:57:25.210 --> 00:57:28.569 you get back exactly the same answer. 00:57:28.569 --> 00:57:33.330 And so these sequences of random numbers, [inaudible] call them scenarios, and 00:57:33.330 --> 00:57:37.410 Pegasus is an acronym for policy evaluation of gradient and search using scenarios. 00:57:37.410 --> 00:57:38.859 So 00:57:42.369 --> 00:57:45.949 when you do that, this is the picture you get. 00:57:45.949 --> 00:57:50.190 As before, the red line is your expected payoff, and by fixing the random numbers, 00:57:50.190 --> 00:57:53.469 you've defined some estimate of the expected payoff. 00:57:53.469 --> 00:57:56.739 And as you evaluate different policies, they're 00:57:56.739 --> 00:58:00.609 still only approximations to their true expected payoff, but at least now you have 00:58:00.609 --> 00:58:01.310 a 00:58:01.310 --> 00:58:03.359 deterministic function to optimize, 00:58:03.359 --> 00:58:07.109 and you can now apply your favorite algorithms, be it gradient ascent or 00:58:08.089 --> 00:58:09.809 some sort 00:58:09.809 --> 00:58:12.259 of local [inaudible] search 00:58:12.259 --> 00:58:13.609 to try to optimize the black 00:58:13.609 --> 00:58:16.329 curve. This gives you a much easier 00:58:16.329 --> 00:58:18.390 optimization problem, and you 00:58:18.390 --> 00:58:22.789 can optimize this to find hopefully a good policy. So this is 00:58:22.789 --> 00:58:26.650 the Pegasus policy search method. 00:58:26.650 --> 00:58:27.450 So when 00:58:27.450 --> 00:58:31.390 I started to talk about reinforcement learning, I showed 00:58:31.390 --> 00:58:34.980 that video of a helicopter flying upside down. That was actually done using 00:58:34.980 --> 00:58:40.029 exactly method, using exactly this policy search algorithm. This 00:58:40.029 --> 00:58:41.849 seems to scale well 00:58:41.849 --> 00:58:47.099 even to fairly large problems, even to fairly high dimensional state spaces. 00:58:47.099 --> 00:58:50.439 Typically Pegasus policy search algorithms have been using - 00:58:50.439 --> 00:58:52.719 the optimization problem 00:58:52.719 --> 00:58:56.440 is still - is much easier than the stochastic version, but sometimes it's 00:58:56.440 --> 00:58:58.279 not entirely trivial, and so 00:58:58.279 --> 00:59:00.299 you have to apply this sort of method with 00:59:00.299 --> 00:59:04.549 maybe on the order of ten parameters or tens of parameters, so 30, 40 00:59:04.549 --> 00:59:05.750 parameters, but 00:59:05.750 --> 00:59:08.169 not thousands of parameters, 00:59:08.169 --> 00:59:13.049 at least in these sorts of things with them. 00:59:13.049 --> 00:59:19.469 So is that method different than just assuming that 00:59:19.469 --> 00:59:26.949 you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that 00:59:26.949 --> 00:59:28.809 we have a deterministic simulator? 00:59:28.809 --> 00:59:31.249 The answer is no. In the way you do this, 00:59:31.249 --> 00:59:34.949 for the sake of simplicity I talked about one sequence of random numbers. 00:59:34.949 --> 00:59:36.890 What you do is - 00:59:36.890 --> 00:59:41.969 so imagine that the random numbers are simulating different wind gusts against your helicopter. 00:59:41.969 --> 00:59:46.429 So what you want to do isn't really evaluate just against one pattern of wind gusts. 00:59:46.429 --> 00:59:48.000 What you want to do is 00:59:48.000 --> 00:59:49.920 sample some set of 00:59:49.920 --> 00:59:53.650 different patterns of wind gusts, and evaluate against all of them in average. 00:59:53.650 --> 00:59:55.680 So what you do is you actually 00:59:55.680 --> 00:59:58.039 sample say 100 - 00:59:58.039 --> 01:00:01.359 some number I made up like 100 sequences of random numbers, 01:00:01.359 --> 01:00:03.430 and every time you want to evaluate a policy, 01:00:03.430 --> 01:00:08.679 you evaluate it against all 100 sequences of random numbers and then average. 01:00:08.679 --> 01:00:13.249 This is in exactly the same way that on this earlier picture 01:00:13.249 --> 01:00:16.879 you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100 01:00:16.879 --> 01:00:18.450 times in simulation, and then 01:00:18.450 --> 01:00:21.709 average to get a better estimate of the expected reward. 01:00:21.709 --> 01:00:23.009 In the same way, 01:00:23.009 --> 01:00:24.079 you do that here 01:00:24.079 --> 01:00:29.079 but with 100 fixed sequences of random numbers. Does that make sense? Any 01:00:29.079 --> 01:00:36.079 other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]? 01:00:47.719 --> 01:00:51.829 Yeah. I guess you're right. So the quality - for a fixed policy, 01:00:51.829 --> 01:00:57.099 the quality of the approximation is equally good for both cases. The 01:00:57.099 --> 01:01:00.989 advantage of fixing the random numbers is that you end up with 01:01:00.989 --> 01:01:03.690 an optimization problem that's much easier. 01:01:04.609 --> 01:01:06.100 I have some search problem, 01:01:06.100 --> 01:01:10.049 and on the horizontal axis there's a space of control policies, 01:01:10.049 --> 01:01:12.559 and my goal is to find a control policy that 01:01:12.559 --> 01:01:15.319 maximizes the payoff. 01:01:15.319 --> 01:01:18.660 The problem with this earlier setting was that 01:01:18.660 --> 01:01:21.949 when I evaluate policies I get these noisy estimates, 01:01:21.949 --> 01:01:22.950 and then it's 01:01:22.950 --> 01:01:26.690 just very hard to optimize the red curve if I have these points that are all over the 01:01:26.690 --> 01:01:29.420 place. And if I evaluate the same policy twice, I don't even get back the same 01:01:30.719 --> 01:01:32.959 answer. By fixing the random numbers, 01:01:32.959 --> 01:01:36.669 the algorithm still doesn't get to see the red curve, but at least it's 01:01:36.669 --> 01:01:40.609 now optimizing a deterministic function. That makes the 01:01:40.609 --> 01:01:42.349 optimization problem much easier. Does 01:01:42.349 --> 01:01:49.349 that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves 01:01:50.539 --> 01:01:58.009 that are easy to optimize. And then you smush them together? 01:01:58.009 --> 01:01:59.309 Let's see. 01:01:59.309 --> 01:02:04.559 I have just one nice black curve that I'm trying to optimize. For each scenario. 01:02:04.559 --> 01:02:08.890 I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios. 01:02:08.890 --> 01:02:12.339 So the black curve here is defined by averaging over 01:02:12.339 --> 01:02:16.249 a large set of scenarios. Does that make sense? 01:02:16.249 --> 01:02:20.099 So instead of only one - if the averaging thing doesn't make sense, imagine that there's 01:02:20.099 --> 01:02:24.049 just one sequence of random numbers. That might be easier to think 01:02:24.049 --> 01:02:27.419 about. Fix one sequence of random numbers, and every time I evaluate another policy, 01:02:27.419 --> 01:02:31.039 I evaluate against the same sequence of random numbers, and that gives me 01:02:31.039 --> 01:02:36.489 a nice deterministic function to optimize. 01:02:36.489 --> 01:02:39.449 Any other questions? 01:02:39.449 --> 01:02:41.269 The advantage is really 01:02:41.269 --> 01:02:45.459 that - one way to think about it is when I evaluate the same policy twice, 01:02:45.459 --> 01:02:50.449 at least I get back the same answer. This gives me a deterministic function 01:02:50.449 --> 01:02:53.890 mapping from parameters in my policy 01:02:53.890 --> 01:02:56.239 to my estimate of the expected payoff. 01:02:57.789 --> 01:03:13.229 That's just a function that I can try to optimize using the search algorithm. 01:03:13.229 --> 01:03:16.799 So we use this algorithm for inverted hovering, 01:03:16.799 --> 01:03:20.329 and again policy search algorithms tend to work well when you can find 01:03:20.329 --> 01:03:23.739 a reasonably simple policy mapping from 01:03:23.739 --> 01:03:28.640 the states to the actions. This is sort of especially the low level control tasks, 01:03:28.640 --> 01:03:32.699 which I think of as sort of reflexes almost. 01:03:32.699 --> 01:03:33.799 Completely, 01:03:33.799 --> 01:03:37.269 if you want to solve a problem like Tetris where you might plan 01:03:37.269 --> 01:03:40.499 ahead a few steps about what's a nice configuration of the board, or 01:03:40.499 --> 01:03:42.289 something like a game of chess, 01:03:42.289 --> 01:03:46.719 or problems of long path plannings of go here, then go there, then go there, 01:03:46.719 --> 01:03:51.069 then sometimes you might apply a value function method instead. 01:03:51.069 --> 01:03:55.059 But for tasks like helicopter flight, for 01:03:55.059 --> 01:03:58.569 low level control tasks, for the reflexes of driving or controlling 01:04:00.749 --> 01:04:05.699 various robots, policy search algorithms were easier - 01:04:05.699 --> 01:04:10.629 we can sometimes more easily approximate the policy directly works very well. 01:04:11.839 --> 01:04:15.599 Some [inaudible] the state of RL today. RL algorithms are applied 01:04:15.599 --> 01:04:18.499 to a wide range of problems, 01:04:18.499 --> 01:04:22.239 and the key is really sequential decision making. The place where you 01:04:22.239 --> 01:04:25.569 think about applying reinforcement learning algorithm is when you need to make a decision, 01:04:25.569 --> 01:04:28.209 then another decision, then another decision, 01:04:28.209 --> 01:04:31.889 and some of your actions may have long-term consequences. I think that 01:04:31.889 --> 01:04:37.910 is the heart of RL's sequential decision making, where you make multiple decisions, 01:04:37.910 --> 01:04:41.599 and some of your actions may have long-term consequences. I've shown you 01:04:41.599 --> 01:04:44.170 a bunch of robotics examples. RL 01:04:44.170 --> 01:04:45.569 is also 01:04:45.569 --> 01:04:48.759 applied to thinks like medical decision making, where you may have a patient and you 01:04:48.759 --> 01:04:51.879 want to choose a sequence of treatments, and you do this now for the patient, and the 01:04:51.879 --> 01:04:56.449 patient may be in some other state, and you choose to do that later, and so on. 01:04:56.449 --> 01:05:00.839 It turns out there's a large community of people applying these sorts of tools to queues. So 01:05:00.839 --> 01:05:04.539 imagine you have a bank where you have people lining up, and 01:05:04.539 --> 01:05:06.009 after they go to one cashier, 01:05:06.009 --> 01:05:09.740 some of them have to go to the manager to deal with something else. You have 01:05:09.740 --> 01:05:13.410 a system of multiple people standing in line in multiple queues, and so how do you route 01:05:13.410 --> 01:05:17.839 people optimally to minimize the waiting 01:05:17.839 --> 01:05:20.239 time. And not just people, but 01:05:20.239 --> 01:05:23.969 objects in an assembly line and so on. It turns out there's a surprisingly large community 01:05:23.969 --> 01:05:26.349 working on optimizing queues. 01:05:26.349 --> 01:05:29.719 I mentioned game playing a little bit already. Things 01:05:29.719 --> 01:05:33.289 like financial decision making, if you have a large amount of stock, 01:05:33.289 --> 01:05:37.569 how do you sell off a large amount - how do you time the selling off of your stock 01:05:37.569 --> 01:05:41.329 so as to not affect market prices adversely too much? There 01:05:41.329 --> 01:05:44.949 are many operations research problems, things like factory automation. Can you design 01:05:44.949 --> 01:05:46.369 a factory 01:05:46.369 --> 01:05:50.140 to optimize throughput, or minimize cost, or whatever. These are all 01:05:50.140 --> 01:05:54.049 sorts of problems that people are applying reinforcement learning algorithms to. 01:05:54.049 --> 01:05:57.529 Let me just close with a few robotics examples because they're always fun, and we just 01:05:57.529 --> 01:05:59.190 have these videos. 01:06:00.189 --> 01:06:07.259 This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student 01:06:07.259 --> 01:06:13.760 here. They were working getting a robot dog to climb over difficult rugged 01:06:13.760 --> 01:06:16.959 terrain. Using a reinforcement learning algorithm, 01:06:16.959 --> 01:06:19.979 they applied an approach that's 01:06:19.979 --> 01:06:24.149 similar to a value function approximation approach, not quite but similar. 01:06:24.149 --> 01:06:28.849 They allowed the robot dog to sort of plan ahead multiple steps, and 01:06:28.849 --> 01:06:33.999 carefully choose his footsteps and traverse rugged terrain. 01:06:33.999 --> 01:06:43.339 This is really state of the art in terms of what can you get a robotic dog to do. 01:06:43.339 --> 01:06:45.569 Here's another fun one. 01:06:45.569 --> 01:06:48.009 It turns out that 01:06:48.009 --> 01:06:51.259 wheeled robots are very fuel-efficient. Cars and trucks are the 01:06:51.259 --> 01:06:55.250 most fuel-efficient robots in the world almost. 01:06:55.250 --> 01:06:57.260 Cars and trucks are very fuel-efficient, 01:06:57.260 --> 01:07:01.309 but the bigger robots have the ability to traverse more rugged terrain. 01:07:01.309 --> 01:07:04.599 So this is a robot - this is actually a small scale mockup of a larger 01:07:04.599 --> 01:07:06.519 vehicle built by Lockheed Martin, 01:07:06.519 --> 01:07:09.699 but can you come up with a vehicle that 01:07:09.699 --> 01:07:13.689 has wheels and has the fuel efficiency of wheeled robots, but also 01:07:13.689 --> 01:07:19.099 has legs so it can traverse obstacles. 01:07:19.099 --> 01:07:27.349 Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and 01:07:27.349 --> 01:07:31.319 somewhat complex gaits that would be very hard to design by hand, 01:07:31.319 --> 01:07:33.279 but by choosing a reward function, tell the robot 01:07:33.279 --> 01:07:36.789 this is a plus one reward that's top of the goal, 01:07:36.789 --> 01:07:45.469 and a few other things, it learns these sorts of policies automatically. 01:07:46.519 --> 01:07:55.229 Last couple fun ones - I'll show you a couple last helicopter videos. 01:07:57.029 --> 01:08:08.700 This is the work of again PhD students here, Peter Abiel and Adam Coates 01:08:08.769 --> 01:08:33.319 who have been working on autonomous flight. I'll just let you watch. I'll just 01:09:49.339 --> 01:09:53.900 show you one more. [Inaudible] do this with a real helicopter [inaudible]? 01:09:53.900 --> 01:09:55.199 Not a full-size helicopter. 01:09:55.199 --> 01:10:00.860 Only small radio control helicopters. [Inaudible]. Full-size helicopters 01:10:00.860 --> 01:10:05.780 are the wrong design for this. You shouldn't do this on a full-size helicopter. 01:10:05.780 --> 01:10:12.619 Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more. 01:10:12.619 --> 01:10:18.919 Are there any human [inaudible]? Yes, 01:10:18.919 --> 01:10:25.919 there are very good human pilots that can. 01:10:28.420 --> 01:10:35.420 This is just one more 01:10:43.739 --> 01:10:45.510 maneuver. That was kind of fun. So this is the work of 01:10:45.510 --> 01:10:55.530 Peter Abiel and Adam Coates. So that was it. That was all the technical material 01:10:55.530 --> 01:11:02.530 I wanted to present in this class. I guess 01:11:03.109 --> 01:11:10.199 you're all experts on machine learning now. Congratulations. 01:11:10.199 --> 01:11:14.209 And as I hope you've got the sense through this class that this is one of the technologies 01:11:14.209 --> 01:11:18.969 that's really having a huge impact on science in engineering and industry. 01:11:18.969 --> 01:11:23.130 As I said in the first lecture, I think many people use machine 01:11:23.130 --> 01:11:30.130 learning algorithms dozens of times a day without even knowing about it. 01:11:30.389 --> 01:11:34.110 Based on the projects you've done, I hope that 01:11:34.110 --> 01:11:37.899 most of you will be able to imagine yourself 01:11:37.899 --> 01:11:43.280 going out after this class and applying these things to solve a variety of problems. 01:11:43.280 --> 01:11:45.980 Hopefully, some of you will also imagine yourselves 01:11:45.980 --> 01:11:48.989 writing research papers after this class, be it on 01:11:48.989 --> 01:11:52.440 a novel way to do machine learning, or on some way of applying machine 01:11:52.440 --> 01:11:55.289 learning to a problem that you care 01:11:55.289 --> 01:11:58.749 about. In fact, looking at project milestones, I'm actually sure that a large fraction of 01:11:58.749 --> 01:12:04.300 the projects in this class will be publishable, so I think that's great. 01:12:06.679 --> 01:12:10.750 I guess many of you will also go industry, make new products, and make lots 01:12:10.750 --> 01:12:12.659 of money using learning 01:12:12.659 --> 01:12:15.580 algorithms. Remember me 01:12:15.580 --> 01:12:19.530 if that happens. One of the things I'm excited about is through research or 01:12:19.530 --> 01:12:22.570 industry, I'm actually completely sure that the people in this class in the 01:12:22.570 --> 01:12:23.849 next few months 01:12:23.849 --> 01:12:27.320 will apply machine learning algorithms to lots of problems in 01:12:27.320 --> 01:12:31.770 industrial management, and computer science, things like optimizing computer architectures, 01:12:31.770 --> 01:12:33.340 network security, 01:12:33.340 --> 01:12:38.480 robotics, computer vision, to problems in computational biology, 01:12:39.140 --> 01:12:42.429 to problems in aerospace, or understanding natural language, 01:12:42.429 --> 01:12:44.309 and many more things like that. 01:12:44.309 --> 01:12:46.030 So 01:12:46.030 --> 01:12:49.920 right now I have no idea what all of you are going to do with the learning algorithms you learned about, 01:12:49.920 --> 01:12:51.540 but 01:12:51.540 --> 01:12:54.809 every time as I wrap up this class, I always 01:12:54.809 --> 01:12:58.209 feel very excited, and optimistic, and hopeful about the sorts of amazing 01:12:58.209 --> 01:13:03.129 things you'll be able to do. 01:13:03.129 --> 01:13:10.489 One final thing, I'll just give out this handout. 01:13:30.749 --> 01:13:35.170 One final thing is that machine learning has grown out of a larger literature on the AI 01:13:35.170 --> 01:13:36.419 where 01:13:36.419 --> 01:13:41.829 this desire to build systems that exhibit intelligent behavior and machine learning 01:13:41.829 --> 01:13:44.889 is one of the tools of AI, maybe one that's had a 01:13:44.889 --> 01:13:46.510 disproportionately large impact, 01:13:46.510 --> 01:13:51.519 but there are many other ideas in AI that I hope you 01:13:51.519 --> 01:13:55.289 go and continue to learn about. Fortunately, Stanford 01:13:55.289 --> 01:13:59.279 has one of the best and broadest sets of AI classes, and I hope that 01:13:59.279 --> 01:14:03.690 you take advantage of some of these classes, and go and learn more about AI, and more 01:14:03.690 --> 01:14:07.690 about other fields which often apply learning algorithms to problems in vision, 01:14:07.690 --> 01:14:10.480 problems in natural language processing in robotics, and so on. 01:14:10.480 --> 01:14:14.960 So the handout I just gave out has a list of AI related courses. Just 01:14:14.960 --> 01:14:18.489 running down very quickly, I guess, CS221 is an overview that I 01:14:18.489 --> 01:14:19.340 teach. There 01:14:19.340 --> 01:14:23.699 are a lot of robotics classes also: 223A, 01:14:23.699 --> 01:14:25.139 225A, 225B 01:14:25.139 --> 01:14:28.760 - many robotics class. There are so many applications 01:14:28.760 --> 01:14:31.999 of learning algorithms to robotics today. 222 01:14:31.999 --> 01:14:36.369 and 227 are knowledge representation and reasoning classes. 01:14:36.369 --> 01:14:39.919 228 - of all the classes on this list, 228, which Daphne 01:14:39.919 --> 01:14:43.479 Koller teaches, is probably closest in spirit to 229. This is one of 01:14:43.479 --> 01:14:45.300 the classes I highly recommend 01:14:45.300 --> 01:14:48.600 to all of my PhD students as well. 01:14:48.600 --> 01:14:53.300 Many other problems also touch on machine learning. On the next page, 01:14:54.040 --> 01:14:57.760 courses on computer vision, speech recognition, natural language processing, 01:14:57.760 --> 01:15:03.199 various tools that aren't just machine learning, but often involve machine learning in many ways. 01:15:03.199 --> 01:15:07.339 Other aspects of AI, multi-agent systems taught by [inaudible]. 01:15:09.630 --> 01:15:13.139 EE364A is convex optimization. It's a class taught by 01:15:13.139 --> 01:15:16.630 Steve Boyd, and convex optimization came up many times in this class. If you 01:15:16.630 --> 01:15:20.909 want to become really good at it, EE364 is a great class. If you're 01:15:20.909 --> 01:15:24.579 interested in project courses, I also teach a project class 01:15:24.579 --> 01:15:27.179 next quarter where we spend the whole quarter working 01:15:27.179 --> 01:15:31.510 on research projects. 01:15:31.510 --> 01:15:37.589 So I hope you go and take some more of those classes. 01:15:37.589 --> 01:15:40.379 In closing, 01:15:40.379 --> 01:15:42.849 let me just say 01:15:42.849 --> 01:15:46.880 this class has been really fun to teach, and it's very satisfying to 01:15:46.880 --> 01:15:52.010 me personally when we set these insanely difficult hallmarks, and 01:15:52.010 --> 01:15:54.649 then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's 01:15:54.649 --> 01:15:58.570 actually very satisfying when I see that. 01:15:58.570 --> 01:16:04.739 Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable," 01:16:04.739 --> 01:16:05.469 So 01:16:05.469 --> 01:16:09.039 I hope you take what you've learned, go forth, and 01:16:09.039 --> 01:16:11.570 do amazing things with learning algorithms. 01:16:11.570 --> 01:16:16.220 I know this is a heavy workload class, so thank you all very much for the hard 01:16:16.220 --> 01:16:20.280 work you've put into this class, and the hard work you've put into learning this material, 01:16:20.280 --> 01:16:21.699 and 01:16:21.699 --> 01:16:23.389 thank you very much for having been students in.