0:00:11.269,0:00:14.539 This presentation is delivered by the Stanford Center for Professional 0:00:14.539,0:00:21.539 Development. 0:00:23.929,0:00:28.039 So welcome to the last lecture of this course. 0:00:28.039,0:00:32.969 What I want to do today is tell you about one final class of reinforcement 0:00:32.969,0:00:34.820 learning algorithms. I 0:00:34.820,0:00:39.350 just want to say a little bit about POMDPs, the partially observable MDPs, and then 0:00:39.350,0:00:43.590 the main technical topic for today will be policy search algorithms. 0:00:43.590,0:00:48.670 I'll talk about two specific algorithms, essentially called reinforced and called Pegasus, 0:00:48.670,0:00:54.940 and then we'll wrap up the class. So if you recall from the last lecture, 0:00:54.940,0:00:59.870 I actually started to talk about one specific example of a POMDP, 0:00:59.870,0:01:06.870 which was this sort of linear dynamical 0:01:07.170,0:01:10.790 system. This is sort of LQR, linear quadratic revelation problem, 0:01:10.790,0:01:24.190 but I changed it and said what if we only have observations 0:01:24.190,0:01:28.270 YT. And what if we couldn't observe the state of the system directly, 0:01:28.270,0:01:37.180 but had to choose an action only based on some noisy observations that maybe some function of the state? 0:01:37.180,0:01:42.660 So our strategy last time was that we said that in the fully observable case, 0:01:42.660,0:01:54.870 we could choose actions - AT equals two, that matrix LT times ST. So LT was this 0:01:54.870,0:01:58.730 matrix of parameters that [inaudible] describe the dynamic programming 0:01:58.730,0:02:02.200 algorithm for finite horizon MDPs in the LQR problem. 0:02:02.200,0:02:10.529 And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state. 0:02:10.529,0:02:13.929 And then I said in the partially observable case, 0:02:14.469,0:02:19.979 we would compute these estimates. 0:02:19.979,0:02:26.979 I wrote them as S of T given T, 0:02:30.379,0:02:37.230 which were our best estimate for what the state is given all the observations. And in particular, 0:02:37.230,0:02:41.459 I'm gonna talk about a Kalman filter 0:02:41.459,0:02:48.959 which we worked out that our posterior distribution of what the state is 0:02:48.959,0:02:52.519 given all the observations up to a certain time 0:02:52.519,0:02:59.139 that was this. So this is from last time. So that 0:02:59.139,0:03:03.159 given the observations Y one through YT, our posterior distribution 0:03:03.159,0:03:05.579 of the current state ST was Gaussian 0:03:05.579,0:03:10.169 would mean ST given T sigma T given T. 0:03:10.169,0:03:14.090 So I said we use a Kalman filter to compute this thing, this ST given T, 0:03:14.090,0:03:20.499 which is going to be our best guess for what the state is currently. 0:03:20.499,0:03:30.609 And then we choose actions using 0:03:30.609,0:03:34.059 our estimate for what the state is, rather than using the true state because we 0:03:34.059,0:03:37.519 don't know the true state anymore in this POMDP. 0:03:37.519,0:03:39.619 So 0:03:39.619,0:03:47.219 it turns out that this specific strategy actually allows you to choose optimal actions, 0:03:47.219,0:03:51.749 allows you to choose actions as well as you possibly can given that this is a 0:03:51.749,0:03:54.799 POMDP, and given there are these noisy observations. 0:03:54.799,0:03:59.469 It turns out that in general finding optimal policies with POMDPs - 0:03:59.469,0:04:06.039 finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem. 0:04:06.039,0:04:16.259 Just to be concrete about the formalism of the POMDP - I should just write it here - 0:04:16.259,0:04:32.000 a POMDP formally is a tuple like that 0:04:32.000,0:04:42.550 where the changes are the set Y is the set of possible observations, 0:04:44.759,0:04:53.519 and this O subscript S are the observation distributions. 0:04:58.039,0:05:11.220 And so at each step, we observe 0:05:17.629,0:05:20.179 - at each step in the POMDP, if we're 0:05:20.179,0:05:23.849 in some state ST, we observe some observation YT 0:05:23.849,0:05:27.610 drawn from the observation distribution O subscript ST, that there's an 0:05:27.610,0:05:30.369 index by what the current state is. 0:05:30.369,0:05:31.060 And 0:05:31.060,0:05:35.599 it turns out that computing the optimal policy in a POMDP is an NPhard problem. 0:05:36.309,0:05:40.539 For the specific case of linear dynamical systems with the Kalman filter 0:05:40.539,0:05:45.849 model, we have this strategy of computing the optimal policy assuming full observability 0:05:45.849,0:05:49.080 and then estimating the states from the observations, 0:05:49.080,0:05:50.889 and then plugging the two together. 0:05:50.889,0:05:56.319 That turns out to be optimal essentially for only that special case of a POMDP. 0:05:56.319,0:05:58.479 In the more general case, 0:05:58.479,0:06:02.959 that strategy of designing a controller assuming full observability 0:06:02.959,0:06:06.009 and then just estimating the state and plugging the two together, 0:06:06.009,0:06:07.950 for general POMDPs 0:06:07.950,0:06:14.759 that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal. 0:06:14.759,0:06:20.869 Solving these problems in general, NP-hard. 0:06:20.869,0:06:22.500 So what I want to do today 0:06:22.500,0:06:25.569 is actually 0:06:25.569,0:06:29.139 talk about a different class of reinforcement learning algorithms. These are called 0:06:29.139,0:06:38.259 policy search algorithms. In particular, policy search algorithms 0:06:38.259,0:06:43.630 can be applied equally well to MDPs, to fully observed Markov decision processes, 0:06:43.630,0:06:46.419 or to these POMDPs, 0:06:46.419,0:06:49.789 or to these partially observable MPDs. 0:06:49.789,0:06:58.269 What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case. 0:06:58.269,0:07:01.919 And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In 0:07:01.919,0:07:05.460 the latter case, when 0:07:05.460,0:07:09.830 you apply a policy search algorithm to a POMDP, it's 0:07:09.830,0:07:15.499 going to be hard to guarantee that you get the globally optimal policy because 0:07:15.499,0:07:20.809 solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be 0:07:20.809,0:07:23.009 I think one of the most 0:07:23.009,0:07:31.029 effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs. 0:07:31.029,0:07:34.110 So 0:07:34.110,0:07:36.849 here's what we're going to do. 0:07:36.849,0:07:40.789 In policy search, 0:07:40.789,0:07:47.789 we're going to define of some set 0:07:48.309,0:07:54.869 which I denote capital pi of policies, 0:07:54.869,0:08:09.240 and our strategy is to search for a good policy lower pi into set capital pi. 0:08:10.909,0:08:14.930 Just by analogy, I want to say 0:08:14.930,0:08:19.379 - in the same way, back when we were talking about supervised learning, 0:08:19.379,0:08:23.769 the way we defined the set capital pi of policies in the search for policy in this 0:08:23.769,0:08:31.269 set capital pi is analogous to supervised learning 0:08:31.269,0:08:46.300 where we defined a set script H of hypotheses and search - 0:08:49.520,0:08:58.120 and would search for a good hypothesis in this policy script H. 0:08:58.120,0:09:03.040 Policy search is sometimes also called direct policy search. 0:09:03.040,0:09:06.740 To contrast this with the source of algorithms we've been talking about so 0:09:06.740,0:09:11.360 far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to 0:09:11.360,0:09:13.600 find the optimal value function. 0:09:13.600,0:09:19.179 And then we'd use V star - we'd use the optimal value function to then try to compute or 0:09:19.179,0:09:20.260 try to approximate pi star. 0:09:20.260,0:09:23.980 So all the approaches we talked about previously are 0:09:23.980,0:09:29.500 strategy for finding a good policy. Once we compute the value function, then we go from that to policy. 0:09:29.500,0:09:34.700 In contrast, in policy search algorithms and something that's called direct policy search algorithms, 0:09:34.700,0:09:40.160 the idea is that we're going to quote "directly" try to approximate a good policy 0:09:40.160,0:09:48.280 without going through the intermediate stage of trying to find the value function. 0:09:48.280,0:09:49.530 Let's see. 0:09:49.530,0:09:50.880 And also 0:09:50.880,0:09:57.610 as I develop policy search - just one step that's sometimes slightly confusing. 0:09:57.610,0:10:02.710 Making an analogy to supervised learning again, when we talked about logistic regression, 0:10:02.710,0:10:08.700 I said we have input features X and some labels Y, and I sort of said 0:10:08.700,0:10:13.320 let's approximate Y using the logistic function of the inputs X. And 0:10:13.320,0:10:18.320 at least initially, the logistic function was sort of pulled out of the air. 0:10:18.320,0:10:23.400 In the same way, as I define policy search algorithms, there'll sort of be a step where I say, 0:10:23.400,0:10:28.560 "Well, let's try to compute the actions. Let's try to approximate what a good action is 0:10:28.560,0:10:33.020 using a logistic function of the state." So again, I'll sort of pull a 0:10:33.020,0:10:36.770 function out of the air. I'll say, "Let's just choose a function, and 0:10:36.770,0:10:40.650 that'll be our choice of the policy cost," and I'll say, 0:10:40.650,0:10:45.309 "Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate 0:10:45.309,0:10:47.300 what is a good function - 0:10:47.300,0:10:48.350 excuse me, 0:10:48.350,0:10:53.269 we'll approximate what is a good action using a logistic function of the state." 0:10:53.269,0:10:54.420 So there's that sort of - 0:10:54.420,0:10:59.460 the function of the choice of policy cost that's again a little bit arbitrary, 0:10:59.460,0:11:06.080 but it's arbitrary as it was when we were talking about supervised learning. 0:11:06.080,0:11:17.350 So to develop our first policy search algorithm, I'm actually gonna need the new definition. 0:11:56.100,0:12:03.100 So 0:12:06.650,0:12:12.710 our first policy search algorithm, we'll actually need to work with stochastic policies. 0:12:12.710,0:12:17.679 What I mean by stochastic policy is there's going to be a function that maps from 0:12:17.679,0:12:22.510 the space of states across actions. They're real numbers 0:12:22.510,0:12:25.039 where pi of S comma A will be 0:12:25.039,0:12:30.580 interpreted as the probability of taking this action A in sum state S. 0:12:30.580,0:12:46.340 And so we have to add sum over A - In other words, for every state 0:12:46.340,0:12:53.830 a stochastic policy specifies a probability distribution over the actions. 0:12:54.550,0:12:59.250 So concretely, 0:12:59.250,0:13:07.270 suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi. 0:13:07.270,0:13:11.690 What that means is that - in this example let's say I have three actions. 0:13:11.690,0:13:16.060 What that means is that suppose I'm in some state S. 0:13:16.060,0:13:25.290 I would then compute pi of S comma A1, pi of S comma A2, 0:13:25.290,0:13:29.370 pi of S comma A3, if I have a three action MDP. 0:13:29.370,0:13:33.630 These will be three numbers that sum up to one, 0:13:33.630,0:13:36.730 and then my chance of taking action A1 will be 0:13:36.730,0:13:41.350 equal to this. My chance of taking action A2 will be equal to pi of S comma A2. 0:13:41.350,0:13:44.129 My chance of taking action A3 will be equal 0:13:44.129,0:13:49.670 to this number. So that's what it means to execute a stochastic policy. 0:13:49.670,0:13:50.880 So 0:13:50.880,0:13:53.730 as a concrete example, just let me make this 0:14:03.380,0:14:07.470 let me just go ahead and give one specific example of what a stochastic 0:14:07.470,0:14:09.550 policy may look like. 0:14:09.550,0:14:14.140 For this example, I'm gonna use the inverted pendulum 0:14:14.140,0:14:20.370 as my motivating example. It's that problem of balancing a pole. We have an 0:14:20.370,0:14:21.270 inverted 0:14:21.270,0:14:26.230 pendulum that swings freely, and you want to move the cart left and 0:14:26.230,0:14:29.090 right to keep the pole vertical. 0:14:29.090,0:14:34.950 Let's say my actions - 0:14:34.950,0:14:43.780 for today's example, I'm gonna use that angle to denote the angle of the pole 0:14:43.780,0:14:49.740 phi. I have two actions where A1 is to accelerate left 0:14:49.740,0:15:02.560 and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left. 0:15:02.560,0:15:04.720 So 0:15:04.720,0:15:06.050 let's see. 0:15:06.050,0:15:10.480 Choose a reward function that penalizes the pole falling over whatever. 0:15:10.480,0:15:14.170 And now let's come up with a stochastic policy for this problem. 0:15:14.910,0:15:17.860 To come up with a class of stochastic policies 0:15:17.860,0:15:20.080 0:15:20.080,0:15:23.990 really means coming up with some class of functions to approximate what action you 0:15:23.990,0:15:26.570 want to take as a function of the state. 0:15:26.570,0:15:30.000 So here's my somewhat arbitrary choice. I'm gonna say that 0:15:30.000,0:15:35.210 the probability of action A1, so pi of S comma A1, 0:15:35.210,0:15:39.880 I'm gonna write as - okay? 0:15:39.880,0:15:45.630 And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that 0:15:45.630,0:15:49.740 my policy is parameterized by a set of parameters theta, 0:15:49.740,0:15:53.310 and for any given set of parameters theta, 0:15:53.310,0:15:56.270 that gives me a stochastic policy. 0:15:56.270,0:15:59.380 And if I'm executing that policy with parameters theta, that means 0:15:59.380,0:16:06.020 that the chance of my choosing to a set of [inaudible] is given by this number. 0:16:18.170,0:16:25.520 Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just 0:16:25.520,0:16:28.850 [inaudible], this means that when I'm in sum state S, 0:16:28.850,0:16:30.960 I'm going to compute this number, 0:16:30.960,0:16:34.260 compute one over one plus E to the minus state of transpose S. 0:16:34.260,0:16:40.060 And then with this probability, I will execute the accelerate right action, 0:16:40.060,0:16:45.780 and with one minus this probability, I'll execute the accelerate left action. 0:16:45.780,0:16:50.820 And again, just to give you a sense of why this might be a reasonable thing to do, 0:16:52.200,0:16:59.200 let's say my state vector is - this 0:17:01.530,0:17:05.440 is [inaudible] state, and I added an extra one as an interceptor, just to give my 0:17:05.440,0:17:08.949 logistic function an extra feature. 0:17:08.949,0:17:19.860 If I choose my parameters and my policy to be say this, 0:17:19.860,0:17:23.079 then that means that at any state, 0:17:23.079,0:17:33.870 the probability of my taking action A1 - the probability of my taking the accelerate 0:17:33.870,0:17:40.870 right action is this one over one plus E to the minus state of transpose S, which taking the inner product 0:17:41.360,0:17:48.180 of theta and S, this just gives you phi, 0:17:48.180,0:17:51.250 equals one over one plus E to the minus phi. 0:17:51.250,0:17:54.900 And so if I choose my parameters theta as follows, 0:17:54.900,0:18:01.150 what that means is that 0:18:01.150,0:18:08.150 just depending on the angle phi of my inverted pendulum, 0:18:09.340,0:18:12.200 the chance of my accelerating to the 0:18:12.200,0:18:15.190 right is just this function of 0:18:15.190,0:18:18.320 the angle of my inverted pendulum. And so this means for example 0:18:18.320,0:18:22.090 that if my inverted pendulum is leaning far over to the right, 0:18:22.090,0:18:26.370 then I'm very likely to accelerate to the right to try to catch it. I 0:18:26.370,0:18:30.000 hope the physics of this inverted pendulum thing make 0:18:30.000,0:18:33.440 sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And 0:18:33.440,0:18:36.740 conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try 0:18:36.740,0:18:38.520 to catch it. 0:18:38.520,0:18:44.030 So this is one example for one specific choice of parameters theta. 0:18:44.030,0:18:49.420 Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if 0:18:49.420,0:18:52.860 the cart is further to the right, you want it to be less likely to accelerate to the 0:18:52.860,0:18:55.610 right, and you can capture that by changing 0:18:55.610,0:18:57.770 one of these coefficients to take into account the 0:18:57.770,0:19:00.250 actual position of the cart. And then depending on 0:19:00.250,0:19:03.379 the velocity of the cart and the angle of velocity, 0:19:03.379,0:19:08.040 you might want to change theta to take into account these other effects as well. Maybe 0:19:08.040,0:19:13.060 if the pole's leaning far to the right, but is actually on its way to swinging back, it's 0:19:13.060,0:19:15.620 specified to the angle of velocity, then you might be 0:19:15.620,0:19:19.300 less worried about having to accelerate hard to the right. And so 0:19:19.300,0:19:23.170 these are the sorts of behavior you can get by varying the parameters theta. 0:19:24.160,0:19:28.640 And so our goal is to tune the parameters theta - our goal in policy search 0:19:28.640,0:19:37.290 is to tune the parameters theta so that when we execute the policy pi subscript theta, 0:19:37.290,0:19:39.730 the pole stays up as long as possible. 0:19:39.730,0:19:48.430 In other words, our goal is to maximize as a function of theta - 0:19:48.430,0:20:05.440 our goal is to maximize the expected value of the payoff 0:20:05.440,0:20:09.620 for when we execute the policy pi theta. We 0:20:09.620,0:20:11.249 want to choose parameters theta 0:20:11.249,0:20:21.120 to maximize that. Are there questions about the problem set up, 0:20:21.120,0:20:26.330 and policy search and policy classes or anything? Yeah. In a case where we have 0:20:26.330,0:20:32.880 more than two actions, would we use a different theta for each of the distributions, or still have 0:20:32.880,0:20:36.310 the same parameters? Oh, yeah. 0:20:36.310,0:20:39.430 Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but 0:20:39.430,0:20:43.710 you have say a fixed number of discrete actions, 0:20:43.710,0:20:48.210 I would sometimes use like a softmax parameterization. 0:20:48.210,0:20:53.550 Similar to softmax regression that we saw earlier in the class, 0:20:53.550,0:20:55.910 you may 0:20:55.910,0:21:02.910 say 0:21:05.540,0:21:06.900 that - [inaudible] out of space. You may have a set of parameters theta 1 0:21:06.900,0:21:09.350 through theta D if 0:21:09.350,0:21:24.910 you have D actions and - pi equals E to the theta I transpose S over - 0:21:24.910,0:21:29.160 so that would be an example of a softmax parameterization for multiple actions. 0:21:29.160,0:21:33.390 It turns out that if you have continuous actions, you can actually make this be a 0:21:33.390,0:21:34.090 density 0:21:34.090,0:21:35.950 over the actions A 0:21:35.950,0:21:38.500 and parameterized by other things as well. 0:21:38.500,0:21:42.590 But the choice of policy class is somewhat up to you, in the same way that 0:21:44.150,0:21:47.670 the choice of whether we chose to use a linear function 0:21:47.670,0:21:49.430 or linear function with 0:21:49.430,0:21:56.430 quadratic features or whatever in supervised learning that was sort of up to us. Anything 0:22:01.790,0:22:08.790 else? Yeah. [Inaudible] stochastic? 0:22:12.400,0:22:16.350 Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition 0:22:16.350,0:22:18.200 probabilities, is it possible to 0:22:18.200,0:22:22.740 use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities. 0:22:22.740,0:22:24.460 The answer is yes, 0:22:24.460,0:22:25.740 but 0:22:25.740,0:22:29.400 for the purposes of what I want to show later, that won't be useful. 0:22:29.400,0:22:31.870 But formally, it is possible. 0:22:31.870,0:22:34.810 If you already have a fixed - if you have a 0:22:34.810,0:22:40.760 fixed policy, then you'd be able to do that. Anything else? 0:22:40.760,0:22:44.190 Yeah. No, I guess even a [inaudible] class of policy can do that, 0:22:44.190,0:22:50.290 but for the derivation later, I actually need to keep it separate. 0:22:50.290,0:22:55.059 Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand 0:22:55.059,0:23:01.730 if this makes sense? 0:23:01.730,0:23:05.430 Okay. Thanks. So let's talk about 0:23:05.430,0:23:08.650 an algorithm. What I'm gonna talk about - the first algorithm I'm going to present 0:23:09.750,0:23:13.820 is sometimes called the reinforce algorithm. What I'm going to present it turns out 0:23:13.820,0:23:18.550 isn't exactly the reinforce algorithm as it was originally presented by the 0:23:18.550,0:23:30.570 author Ron Williams, but it sort of captures its essence. Here's the idea. 0:23:30.570,0:23:37.030 In the sequel - in what I'm about to do, I'm going to assume that S0 is 0:23:37.030,0:23:40.330 some fixed initial state. 0:23:41.950,0:23:44.720 Or 0:23:44.720,0:23:48.740 it turns out if S0 is drawn from some fixed initial state 0:23:48.740,0:23:52.440 distribution then everything else [inaudible], but let's just say S0 is some 0:23:52.440,0:23:54.040 fixed initial state. 0:23:54.040,0:24:04.600 So my goal is to maximize this expected sum 0:24:05.820,0:24:10.340 [inaudible]. Given the policy and whatever else, drop that. 0:24:10.340,0:24:14.700 So the random variables in this expectation is a sequence of states and actions: 0:24:14.700,0:24:20.200 S0, A0, S1, A1, and so on, up to ST, AT are the random variables. 0:24:20.200,0:24:25.919 So let me write out this expectation explicitly as a sum 0:24:25.919,0:24:33.430 over all possible state and action sequences of that 0:24:46.050,0:24:52.920 - so that's what an expectation is. It's the probability of the random variables times that. Let 0:24:52.920,0:24:59.920 me just expand out this probability. 0:25:00.510,0:25:06.110 So the probability of seeing this exact sequence of states and actions is 0:25:06.110,0:25:10.630 the probability of the MDP starting in that state. 0:25:10.630,0:25:14.070 If this is a deterministic initial state, then all the probability 0:25:14.070,0:25:18.180 mass would be on one state. Otherwise, there's some distribution over initial states. 0:25:18.180,0:25:21.040 Then times the probability 0:25:21.040,0:25:23.529 that 0:25:23.529,0:25:29.350 you chose action A0 from that state as zero, 0:25:29.350,0:25:34.390 and then times the probability that 0:25:34.390,0:25:38.130 the MDP's transition probabilities happen to transition you to state 0:25:38.130,0:25:44.830 S1 where you chose action A0 to state S0, 0:25:44.830,0:25:53.020 times the probability that you chose that and so on. 0:25:53.020,0:26:06.279 The last term here is that, and then times that. 0:26:12.370,0:26:13.930 So what I did was just 0:26:13.930,0:26:18.620 take this probability of seeing this sequence of states and actions, and then just 0:26:18.620,0:26:22.610 [inaudible] explicitly or expanded explicitly like this. 0:26:22.610,0:26:24.480 It turns out later on 0:26:24.480,0:26:30.270 I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff 0:26:31.590,0:26:36.039 from now. So whenever later in this lecture I write the word payoff, I 0:26:36.039,0:26:40.230 just mean this sum. 0:26:40.230,0:26:47.020 So 0:26:47.020,0:26:52.330 our goal is to maximize the expected payoff, so our goal is to maximize this sum. 0:26:52.330,0:26:55.590 Let me actually just skip ahead. I'm going to write down what the final answer is, and 0:26:55.590,0:26:59.370 then I'll come back and justify the 0:26:59.370,0:27:06.370 algorithm. So here's the algorithm. This is 0:27:25.190,0:27:32.190 0:28:12.770,0:28:16.020 how we're going to 0:28:16.020,0:28:19.280 update the parameters of the algorithm. We're going to 0:28:19.280,0:28:22.880 sample a state action sequence. The way you do this is you just take your 0:28:22.880,0:28:24.700 current stochastic policy, 0:28:24.700,0:28:29.039 and you execute it in the MDP. So just go ahead and start from some initial state, take 0:28:29.039,0:28:32.600 a stochastic action according to your current stochastic policy, 0:28:32.600,0:28:35.000 see where the state transition probably takes you, and so you just 0:28:35.000,0:28:38.860 do that for T times steps, and that's how you sample the state sequence. Then 0:28:38.860,0:28:45.020 you compute the payoff, and then you perform this update. 0:28:45.020,0:28:49.210 So let's go back and figure out what this algorithm is doing. 0:28:49.210,0:28:52.730 Notice that this algorithm performs stochastic updates because on 0:28:52.730,0:28:56.750 every step it updates data according to this thing on the right hand side. 0:28:56.750,0:29:00.150 This thing on the right hand side depends very much on your payoff and on 0:29:00.150,0:29:02.320 the state action sequence you saw. Your 0:29:02.320,0:29:04.650 state action sequence is random, 0:29:04.650,0:29:07.210 so what I want to do is figure out 0:29:07.210,0:29:12.230 - so on every step, I'll sort of take a step that's chosen randomly 0:29:13.070,0:29:16.380 because it depends on this random state action sequence. 0:29:16.380,0:29:21.530 So what I want to do is figure out on average how does it change the parameters theta. 0:29:21.530,0:29:23.360 In particular, 0:29:23.360,0:29:44.320 I want to know what is the expected value of the change to the parameters. 0:29:58.100,0:30:06.280 So I want to know what is the expected value of this change to my parameters theta. Our 0:30:06.280,0:30:10.930 goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff. 0:30:10.930,0:30:13.700 So long as 0:30:13.700,0:30:18.310 the updates on expectation are on average taking us uphill 0:30:18.310,0:30:20.050 on the expected payoff, 0:30:20.050,0:30:21.970 then we're happy. 0:30:21.970,0:30:24.970 It turns out that 0:30:24.970,0:30:31.720 this algorithm is a form of stochastic gradient ascent in which - 0:30:31.720,0:30:36.779 remember when I talked about stochastic gradient descent for least squares regression, I 0:30:36.779,0:30:42.820 said that you have some parameters - maybe you're trying to minimize a quadratic function. 0:30:42.820,0:30:49.149 Then you may have parameters that will wander around randomly 0:30:49.149,0:30:54.290 until it gets close to the optimum of the [inaudible] quadratic surface. 0:30:54.290,0:30:55.810 It turns out that 0:30:55.810,0:30:59.650 the reinforce algorithm will be very much like that. It will be a 0:30:59.650,0:31:03.220 stochastic gradient ascent algorithm in which on every step - 0:31:03.220,0:31:07.300 the step we take is a little bit random. It's determined by the random state action sequence, 0:31:07.300,0:31:11.940 but on expectation this turns out to be essentially gradient ascent algorithm. 0:31:11.940,0:31:16.100 And so we'll do something like this. It'll wander around randomly, but on average take you towards the 0:31:16.100,0:31:24.700 optimum. So let me go ahead and prove that now. 0:31:28.780,0:31:35.780 Let's see. 0:31:37.540,0:31:46.020 What I'm going to do is I'm going to derive a gradient ascent update rule 0:31:46.020,0:31:50.429 for maximizing the expected payoff. Then I'll hopefully show that 0:31:50.429,0:31:58.570 by deriving a gradient ascent update rule, I'll end up with this thing on expectation. 0:31:58.570,0:32:02.110 So before I do the derivation, let me just remind you of 0:32:02.110,0:32:06.330 the chain rule - the product rule for differentiation 0:32:06.330,0:32:08.510 in which 0:32:08.510,0:32:13.890 if I have a product of functions, 0:32:13.890,0:32:17.230 then 0:32:33.440,0:32:37.779 the derivative of the product is given by 0:32:37.779,0:32:41.029 taking of the derivatives of these things one at a time. So first I differentiate 0:32:41.029,0:32:45.340 with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G, 0:32:45.340,0:32:46.540 leaving the other two fixed. 0:32:46.540,0:32:50.640 Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the 0:32:50.640,0:32:53.530 product rule 0:32:53.530,0:32:56.380 for 0:32:56.380,0:33:00.980 derivatives. 0:33:00.980,0:33:05.020 If you refer back to this equation where 0:33:05.020,0:33:06.870 earlier we wrote out that 0:33:06.870,0:33:11.140 the expected payoff by this equation, this 0:33:11.140,0:33:14.110 sum over all the states of the probability 0:33:14.110,0:33:15.720 times the payoff. 0:33:15.720,0:33:20.410 So what I'm going to do is take the derivative of this expression 0:33:20.410,0:33:22.730 with respect to the parameters theta 0:33:22.730,0:33:26.320 because I want to do gradient ascent on this function. So I'm going to take the 0:33:26.320,0:33:30.880 derivative of this function with respect to theta, and then try to go uphill on this function. 0:33:30.880,0:33:35.690 So using the product rule, when I take the derivative of this function with respect to theta 0:33:35.690,0:33:39.169 what I get is - we'll end up with the sum of terms right there. There are a lot 0:33:39.169,0:33:42.300 of terms here that depend on theta, 0:33:42.300,0:33:47.160 and so what I'll end up with is I'll end up having a sum - having one term 0:33:47.160,0:33:50.700 that corresponds to the derivative of this keeping everything else fixed, 0:33:50.700,0:33:53.960 to have one term from the derivative of this keeping everything else fixed, and I'll 0:33:53.960,0:33:55.140 have one term 0:33:55.140,0:33:56.430 from the derivative of 0:33:56.430,0:34:02.100 that last thing keeping everything else fixed. So just apply the product rule to this. Let's 0:34:02.100,0:34:09.100 write that down. So I have that - the derivative 0:34:10.599,0:34:15.789 with respect to theta of the expected value of the payoff is - 0:34:15.789,0:34:22.129 it turns out I can actually do this entire derivation in exactly four steps, 0:34:22.129,0:34:26.819 but each of the steps requires a huge amount of writing, so 0:34:26.819,0:34:37.569 I'll just start writing and see how that goes, but this is a four step derivation. 0:34:37.919,0:34:44.919 So there's the sum over the state action sequences as we saw before. Close the 0:35:04.869,0:35:11.869 bracket, 0:36:00.229,0:36:03.169 and 0:36:03.169,0:36:08.670 then times the payoff. 0:36:08.670,0:36:12.599 So that huge amount of writing, that was just taking my previous formula and 0:36:12.599,0:36:13.529 0:36:13.529,0:36:16.589 differentiating these terms that depend on theta one at a time. This was the term with 0:36:16.589,0:36:17.879 the derivative 0:36:17.879,0:36:19.489 of the first pi of 0:36:19.489,0:36:22.410 theta S0 A0. So there's the first derivative 0:36:22.410,0:36:26.329 term. There's the second one. Then you have plus dot, dot, dot, like 0:36:26.329,0:36:28.369 in terms of [inaudible]. 0:36:28.369,0:36:31.279 That's my last term. 0:36:31.279,0:36:38.279 So that was step one of four. 0:36:51.299,0:37:06.849 And so by algebra - let me just write this down 0:38:04.009,0:38:06.509 and convince us all that it's true. 0:38:06.509,0:38:08.549 This is the second of four steps 0:38:08.549,0:38:09.649 in which it 0:38:09.649,0:38:13.529 just convinced itself that if I expand out - take the sum and multiply it by 0:38:13.529,0:38:15.239 that big product in front, 0:38:15.239,0:38:18.929 then I get back that sum of terms I get. It's essentially - 0:38:18.929,0:38:22.530 for example, when I multiply out, this product on top of this ratio, of this 0:38:22.530,0:38:24.789 first fraction, 0:38:24.789,0:38:29.289 then pi subscript theta S0 A0, that would cancel out this pi subscript 0:38:29.289,0:38:31.219 theta S0 A0 0:38:31.219,0:38:35.130 and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was 0:38:35.130,0:38:42.130 the second. But 0:38:44.960,0:38:47.640 that term on top is just 0:38:47.640,0:38:53.999 what I worked out previously 0:38:53.999,0:38:58.569 - was the joint probability of the state action sequence, 0:38:58.569,0:39:32.170 and now I have that times that times the payoff. 0:39:32.170,0:40:07.129 And so by the definition of expectation, this is just equal to that thing times the payoff. 0:40:07.129,0:40:08.440 So 0:40:08.440,0:40:12.329 this thing inside the expectation, this is exactly 0:40:12.329,0:40:18.839 the step that we were taking in the inner group of our reinforce algorithm, 0:40:18.839,0:40:20.880 roughly the reinforce algorithm. This 0:40:20.880,0:40:25.640 proves that the expected value of our change to theta 0:40:25.640,0:40:30.509 is exactly in the direction of the gradient of our expected payoff. That's how 0:40:30.509,0:40:32.639 I started this whole derivation. I said 0:40:32.639,0:40:36.019 let's look at our expected payoff and take the derivative of that with respect to theta. 0:40:37.109,0:40:40.529 What we've proved is that on expectation, 0:40:40.529,0:40:44.229 the step direction I'll take reinforce is exactly the gradient of 0:40:44.229,0:40:46.220 the thing I'm trying to 0:40:46.220,0:40:48.679 optimize. This shows that this algorithm is 0:40:48.679,0:40:54.029 a stochastic gradient ascent 0:40:54.029,0:40:58.059 algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible] 0:40:58.059,0:41:00.910 check if everything makes sense. I'll erase a couple of boards and then check if you have 0:41:00.910,0:41:07.910 questions after that. Questions? 0:41:42.919,0:41:49.919 Could you 0:41:53.749,0:41:56.279 raise your hand if this makes sense? 0:42:01.239,0:42:03.999 Great. 0:42:03.999,0:42:06.789 Some of the comments - we talked about 0:42:06.789,0:42:10.629 those value function approximation approaches where you approximate 0:42:10.629,0:42:13.579 V star, then you go from V star 0:42:13.579,0:42:17.839 to pi star. Then here was also policy search approaches, where you try to approximate the policy directly. 0:42:17.839,0:42:22.329 So let's talk briefly about when either one may be preferable. 0:42:22.329,0:42:24.049 It turns out that 0:42:24.049,0:42:26.899 policy search algorithms are especially effective 0:42:26.899,0:42:30.699 when you can choose a simple policy class pi. 0:42:32.029,0:42:35.949 So the question really is for your problem 0:42:35.949,0:42:40.079 does there exist a simple function like a linear function or a logistic function 0:42:40.079,0:42:45.289 that maps from features of the state to the action that works pretty well. 0:42:45.289,0:42:49.029 So the problem with the inverted pendulum - this is quite likely to be true. 0:42:49.029,0:42:52.450 Going through all the different choices of parameters, you can say 0:42:52.450,0:42:54.930 things like if the pole's leaning towards the right, 0:42:54.930,0:42:57.709 then accelerate towards the right to try to catch it. Thanks to the 0:42:57.709,0:43:01.579 inverted pendulum, this is probably true. For lots of what's called 0:43:01.579,0:43:04.650 low level control tasks, things like driving a car, 0:43:04.650,0:43:09.069 the low level reflexes of do you steer your car left to avoid another car, do 0:43:09.069,0:43:13.389 you steer your car left to follow the car road, flying a helicopter, 0:43:13.389,0:43:18.359 again very short time scale types of decisions - I like to think of these as 0:43:18.359,0:43:21.960 decisions like trained operator for like a trained driver or a trained 0:43:21.960,0:43:23.819 pilot. It would almost 0:43:23.819,0:43:27.360 be a reflex, these sorts of very quick instinctive things where 0:43:27.360,0:43:30.790 you map very directly from the inputs, data, and action. These are 0:43:30.790,0:43:32.159 problems for which 0:43:32.159,0:43:35.879 you can probably choose a reasonable policy class like a logistic function or something, 0:43:35.879,0:43:38.489 and it will often work well. 0:43:38.489,0:43:41.829 In contrast, if you have problems that require 0:43:41.829,0:43:46.019 long multistep reasoning, so things like a game of chess where you have to 0:43:46.019,0:43:47.230 reason carefully about if 0:43:47.230,0:43:50.130 I do this, then they'll do that, then they'll do this, then they'll do that. 0:43:50.130,0:43:56.049 Those I think of as less instinctual, very high level decision making. 0:43:56.049,0:43:59.519 For problems like that, 0:43:59.519,0:44:06.289 I would sometimes use a value function approximation approaches instead. Let 0:44:06.289,0:44:09.089 me say more about this later. 0:44:09.089,0:44:14.139 The last thing I want to do is actually tell you about - 0:44:14.139,0:44:21.999 I guess just as a side comment, 0:44:21.999,0:44:25.499 it turns out also that if you have POMDP, if you have a partially observable 0:44:25.499,0:44:28.459 MDP - I don't want to say too much about this - 0:44:28.459,0:44:36.799 it turns out that if you only have an approximation 0:44:36.799,0:44:42.119 - let's call it S hat of the true 0:44:42.119,0:44:47.389 state, and so this could be S hat equals S of T given T from Kalman filter - 0:44:57.359,0:45:12.949 then you can still use these sorts of policy search algorithms where you can say pi theta of S 0:45:12.949,0:45:15.819 hat comma A - There are various other ways you can use 0:45:15.819,0:45:18.669 policy search algorithms for POMDPs, but this is one of them 0:45:18.669,0:45:21.440 where if you only have estimates of the state, you can then 0:45:21.440,0:45:25.599 choose a policy class that only looks at your estimate of the state to choose the action. 0:45:25.599,0:45:30.629 By using the same way of estimating the states in both training and testing, 0:45:30.629,0:45:33.089 this will usually do some - 0:45:33.089,0:45:37.069 so these sorts of policy search algorithms can be applied 0:45:37.069,0:45:47.440 often reasonably effectively to 0:45:47.440,0:45:50.349 POMDPs as well. There's one more algorithm I wanna talk about, but some final 0:45:50.349,0:45:54.519 words on the reinforce algorithm. It turns out the reinforce algorithm 0:45:54.519,0:46:00.089 often works well but is often extremely slow. So it [inaudible] 0:46:00.089,0:46:03.680 works, but one thing to watch out for is that because you're taking these 0:46:03.680,0:46:07.470 gradient ascent steps that are very noisy, you're sampling a state action sequence, and then 0:46:07.470,0:46:07.909 you're sort of 0:46:07.909,0:46:09.579 taking a gradient ascent step in essentially a sort 0:46:09.579,0:46:12.960 of random direction that only on expectation is 0:46:12.960,0:46:14.239 correct. The 0:46:14.239,0:46:17.999 gradient ascent direction for reinforce can sometimes be a bit noisy, 0:46:17.999,0:46:21.599 and so it's not that uncommon to need like a million iterations of gradient ascent, 0:46:21.599,0:46:24.420 or ten million, or 100 million 0:46:24.420,0:46:25.579 iterations of gradient ascent 0:46:25.579,0:46:29.450 for reinforce [inaudible], so that's just something to watch out for. 0:46:29.450,0:46:40.619 One consequence of that is in the reinforce algorithm - I shouldn't 0:46:40.619,0:46:44.969 really call it reinforce. In what's essentially the reinforce algorithm, there's 0:46:44.969,0:46:48.449 this step where you need to sample a state action sequence. 0:46:48.449,0:46:50.759 So in 0:46:50.759,0:46:54.160 principle you could do this on your own robot. If there were a robot you were trying to 0:46:54.160,0:46:55.280 control, you can actually 0:46:55.280,0:46:58.709 physically initialize in some state, pick an action and so on, and go from there 0:46:58.709,0:47:00.450 to sample a state action sequence. 0:47:00.450,0:47:01.529 But 0:47:01.529,0:47:05.269 if you need to do this ten million times, you probably don't want to [inaudible] 0:47:05.269,0:47:07.459 your robot ten million times. 0:47:07.459,0:47:13.160 I personally have seen many more applications of reinforce in simulation. 0:47:13.160,0:47:16.930 You can easily run ten thousand simulations or ten million simulations of your robot in 0:47:16.930,0:47:20.890 simulation maybe, but you might not want to do that - 0:47:20.890,0:47:24.509 have your robot physically repeat some action ten million times. 0:47:24.509,0:47:27.529 So I personally have seen many more applications of reinforce 0:47:27.529,0:47:30.319 to learn using a simulator 0:47:30.319,0:47:34.799 than to actually do this on a physical device. 0:47:34.799,0:47:38.230 The last thing I wanted to do is tell you about one other algorithm, 0:47:38.230,0:47:45.230 one final policy search algorithm. [Inaudible] the laptop display please. It's 0:47:52.249,0:47:56.119 a policy search algorithm called Pegasus that's 0:47:56.119,0:48:00.039 actually what we use on our autonomous helicopter flight things 0:48:00.039,0:48:05.459 for many years. There are some other things we do now. So 0:48:05.459,0:48:07.579 here's the idea. There's a 0:48:07.579,0:48:11.059 reminder slide on RL formalism. There's nothing here that you don't know, but 0:48:11.059,0:48:17.119 I just want to pictorially describe the RL formalism because I'll use that later. 0:48:17.119,0:48:20.859 I'm gonna draw the reinforcement learning picture as follows. The 0:48:20.859,0:48:21.689 initialized [inaudible] 0:48:21.689,0:48:23.509 system, say a 0:48:23.509,0:48:25.559 helicopter or whatever in sum state S0, 0:48:25.559,0:48:27.509 you choose an action A0, 0:48:27.509,0:48:31.779 and then you'll say helicopter dynamics takes you to some new state S1, you choose 0:48:31.779,0:48:33.259 some other action A1, 0:48:33.259,0:48:34.619 and so on. 0:48:34.619,0:48:38.789 And then you have some reward function, which you reply to the sequence of states you summed out, 0:48:38.789,0:48:41.249 and that's your total payoff. So 0:48:41.249,0:48:46.890 this is just a picture I wanna use to summarize the RL problem. 0:48:47.949,0:48:50.880 Our goal is to maximize the expected payoff, 0:48:50.880,0:48:53.000 which is this, the expected sum of the rewards. 0:48:53.000,0:48:56.909 And our goal is to learn the policy, which I denote by a green box. 0:48:56.909,0:49:01.989 So our policy - and I'll switch back to deterministic policies for now. 0:49:01.989,0:49:09.129 So my deterministic policy will be some function mapping from the states to the actions. 0:49:09.129,0:49:14.599 As a concrete example, you imagine that in the policy search setting, 0:49:14.599,0:49:17.339 you may have a linear class of policies. 0:49:17.339,0:49:22.660 So you may imagine that the action A will be say a linear function of the states, 0:49:22.660,0:49:27.599 and your goal is to learn the parameters of the linear function. 0:49:27.599,0:49:33.349 So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning 0:49:33.349,0:49:36.209 objective. So just [inaudible] imagine that the action A 0:49:36.209,0:49:40.309 is state of transpose S, and you go and policy search this to come up with 0:49:40.309,0:49:42.109 good parameters theta so as 0:49:42.109,0:49:48.979 to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea. 0:49:48.979,0:49:49.809 Quite often 0:49:49.809,0:49:54.459 we come up with a model or a simulator for the MDP, and as before 0:49:54.459,0:49:57.499 a model or a simulator is just a box 0:49:57.499,0:49:58.830 that takes this input 0:49:58.830,0:50:00.419 some state ST, 0:50:00.419,0:50:02.690 takes this input some action AT, 0:50:02.690,0:50:06.570 and then outputs some [inaudible] state ST plus one that you might want to take in the MDP. 0:50:06.570,0:50:10.130 This ST plus one will be a random state. It will be drawn from 0:50:10.130,0:50:13.030 the random state transition probabilities of MDP. 0:50:13.030,0:50:16.979 This is important. Very important, ST plus one 0:50:16.979,0:50:21.809 will be a random function ST and AT. In the simulator, this is [inaudible]. 0:50:21.809,0:50:25.709 So for example, for autonomous helicopter flight, you [inaudible] build a simulator 0:50:25.709,0:50:29.819 using supervised learning, an algorithm like linear regression 0:50:29.819,0:50:31.529 [inaudible] to linear regression, 0:50:31.529,0:50:35.099 so we can get a nonlinear model of the dynamics of what ST 0:50:35.099,0:50:41.399 plus one is as a random function of ST and AT. Now 0:50:41.399,0:50:44.999 once you have a simulator, 0:50:44.999,0:50:48.100 given any fixed policy you can quite 0:50:48.100,0:50:52.309 straightforwardly evaluate any policy in a simulator. 0:50:53.359,0:51:00.209 Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box 0:51:00.209,0:51:02.879 like that. It works well. 0:51:02.879,0:51:07.569 So if you have any one fixed policy pi, 0:51:07.569,0:51:13.109 you can evaluate the policy pi just using the simulator 0:51:13.109,0:51:15.969 via the picture shown at the bottom of the slide. 0:51:15.969,0:51:18.880 So concretely, you can take your initial state S0, 0:51:18.880,0:51:20.979 feed it into the policy pi, 0:51:20.979,0:51:25.449 your policy pi will output some action A0, you plug it in 0:51:25.449,0:51:28.069 the simulator, the simulator outputs a random state S1, you feed 0:51:28.069,0:51:31.630 S1 into the policy and so on, and you get a sequence of states S0 through ST 0:51:31.630,0:51:35.429 that your helicopter flies through in simulation. 0:51:35.429,0:51:37.809 Then sum up the rewards, 0:51:37.809,0:51:42.609 and this gives you an estimate of the expected payoff of the policy. 0:51:42.609,0:51:44.929 This picture is just a fancy way of saying fly 0:51:44.929,0:51:48.809 your helicopter in simulation and see how well it does, and measure the sum of 0:51:48.809,0:51:53.969 rewards you get on average in the simulator. The 0:51:53.969,0:51:57.219 picture I've drawn here assumes that you run your policy through the 0:51:57.219,0:51:59.159 simulator just once. In general, 0:51:59.159,0:52:01.299 you would run the policy through the simulator 0:52:01.299,0:52:05.099 some number of times and then average to get an average over M 0:52:05.099,0:52:10.869 simulations to get a better estimate of the policy's expected payoff. 0:52:12.799,0:52:14.619 Now that I have a way 0:52:14.619,0:52:18.069 - given any one fixed policy, 0:52:18.069,0:52:23.429 this gives me a way to evaluate the expected payoff of that policy. 0:52:23.429,0:52:28.179 So one reasonably obvious thing you might try to do is then 0:52:28.179,0:52:30.069 just search for a policy, 0:52:30.069,0:52:33.609 in other words search for parameters theta for your policy, that 0:52:33.609,0:52:35.929 gives you high estimated payoff. 0:52:35.929,0:52:39.509 Does that make sense? So my policy has some parameters theta, so 0:52:39.509,0:52:40.979 my policy is 0:52:40.979,0:52:46.309 my actions A are equal to theta transpose S say if there's a linear policy. 0:52:46.309,0:52:49.559 For any fixed value of the parameters theta, 0:52:49.559,0:52:50.989 I can evaluate - 0:52:50.989,0:52:54.440 I can get an estimate for how good the policy is using the simulator. 0:52:54.440,0:52:58.599 One thing I might try to do is search for parameters theta to try to 0:52:58.599,0:53:03.109 maximize my estimated payoff. 0:53:03.109,0:53:05.929 It turns out you can sort of do that, 0:53:05.929,0:53:10.979 but that idea as I've just stated is hard to get to work. 0:53:10.979,0:53:12.180 Here's the 0:53:12.180,0:53:15.720 reason. The simulator allows us to evaluate 0:53:15.720,0:53:17.529 policy, so let's search for policy of high 0:53:17.529,0:53:20.619 value. The difficulty is that the simulator is random, 0:53:20.619,0:53:22.999 and so every time we evaluate a policy, 0:53:22.999,0:53:26.819 we get back a very slightly different answer. So 0:53:26.819,0:53:28.739 in the 0:53:28.739,0:53:33.219 cartoon below, I want you to imagine that the horizontal axis is the space of policies. 0:53:33.219,0:53:36.779 In other words, as I vary the 0:53:36.779,0:53:40.839 parameters in my policy, I get different points on the horizontal axis here. As I 0:53:40.839,0:53:43.789 vary the parameters theta, I get different policies, 0:53:43.789,0:53:46.479 and so I'm moving along the X axis, 0:53:46.479,0:53:50.079 and my total payoff I'm gonna plot on the vertical axis, 0:53:50.079,0:53:54.459 and the red line in this cartoon is the expected payoff of the different 0:53:54.459,0:53:56.130 policies. 0:53:56.130,0:54:01.929 My goal is to find the policy with the highest expected payoff. 0:54:01.929,0:54:06.019 You could search for a policy with high expected payoff, but every time you evaluate a policy 0:54:06.019,0:54:08.150 - say I evaluate some policy, 0:54:08.150,0:54:09.819 then I might get a point that 0:54:09.819,0:54:12.879 just by chance looked a little bit better than it should have. If 0:54:12.879,0:54:16.489 I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third 0:54:16.489,0:54:21.419 policy, fourth, 0:54:21.419,0:54:25.439 sometimes you look here - sometimes I might actually evaluate exactly the same policy 0:54:25.439,0:54:27.829 twice and get back slightly different 0:54:27.829,0:54:32.380 answers just because my simulator is random, so when I apply the same policy 0:54:32.380,0:54:38.669 twice in simulation, I might get back slightly different answers. 0:54:38.669,0:54:41.839 So as I evaluate more and more policies, these are the pictures I get. 0:54:41.839,0:54:45.799 My goal is to try to optimize the red line. I hope you appreciate this is a 0:54:45.799,0:54:50.569 hard problem, especially when all [inaudible] optimization algorithm gets to see are these 0:54:50.569,0:54:53.949 black dots, and they don't have direct access to the red line. 0:54:53.949,0:54:58.269 So when my input space is some fairly high dimensional space, if I have 0:54:58.269,0:55:02.319 ten parameters, the horizontal axis would actually be a 10-D space, 0:55:02.319,0:55:07.049 all I get are these noisy estimates of what the red line is. This is a very hard 0:55:07.049,0:55:10.209 stochastic optimization problem. 0:55:10.209,0:55:15.969 So it turns out there's one way to make this optimization problem much easier. Here's the idea. 0:55:15.969,0:55:20.949 And the method is called Pegasus, which is an acronym for something. 0:55:20.949,0:55:22.890 I'll tell you later. 0:55:22.890,0:55:26.140 So the simulator repeatedly makes calls 0:55:26.140,0:55:27.959 to a random number generator 0:55:27.959,0:55:33.079 to generate random numbers RT, which are used to simulate the stochastic dynamics. 0:55:33.079,0:55:36.709 What I mean by that is that the simulator takes this input of state 0:55:36.709,0:55:39.749 and action, and it outputs the mixed state randomly, 0:55:39.749,0:55:43.999 and if you peer into the simulator, if you open the box of the simulator and ask 0:55:43.999,0:55:49.529 how is my simulator generating these random mixed states ST plus one, 0:55:49.529,0:55:53.809 pretty much the only way to do so - pretty much the only way 0:55:53.809,0:55:56.909 to write a simulator with random outputs is 0:55:56.909,0:56:00.209 we're gonna make calls to a random number generator, 0:56:00.209,0:56:03.440 and get random numbers, these RTs, 0:56:03.440,0:56:08.299 and then you have some function that takes this input S0, A0, and 0:56:08.299,0:56:10.979 the results of your random number generator, 0:56:10.979,0:56:14.430 and it computes some mixed state as a function of the inputs and of the random 0:56:14.430,0:56:16.839 number it got from the random number 0:56:16.839,0:56:17.699 generator. This is 0:56:17.699,0:56:22.359 pretty much the only way anyone implements any random code, any code 0:56:22.359,0:56:26.179 that generates random outputs. You make a call to a random number generator, 0:56:26.179,0:56:30.869 and you compute some function of the random number and of your other 0:56:30.869,0:56:35.299 inputs. The reason that when you evaluate different policies you get 0:56:35.299,0:56:36.439 different answers 0:56:36.439,0:56:37.749 is because 0:56:37.749,0:56:41.459 every time you rerun the simulator, you get a different sequence of random 0:56:41.459,0:56:43.569 numbers from the random number generator, 0:56:43.569,0:56:45.569 and so you get a different answer every time, 0:56:45.569,0:56:48.139 even if you evaluate the same policy twice. 0:56:48.139,0:56:50.910 So here's the idea. 0:56:50.910,0:56:56.150 Let's say we fix in advance the sequence of random numbers, 0:56:56.150,0:57:00.229 choose R1, R2, up to RT minus one. 0:57:00.229,0:57:03.359 Fix the sequence of random numbers in advance, 0:57:03.359,0:57:07.280 and we'll always use the same sequence of random numbers 0:57:07.280,0:57:10.189 to evaluate different policies. Go 0:57:10.189,0:57:14.549 into your code and fix R1, R2, through RT minus one. 0:57:14.549,0:57:17.640 Choose them randomly once and then fix them forever. 0:57:17.640,0:57:20.979 If you always use the same sequence of random numbers, then 0:57:20.979,0:57:25.210 the system is no longer random, and if you evaluate the same policy twice, 0:57:25.210,0:57:28.569 you get back exactly the same answer. 0:57:28.569,0:57:33.330 And so these sequences of random numbers, [inaudible] call them scenarios, and 0:57:33.330,0:57:37.410 Pegasus is an acronym for policy evaluation of gradient and search using scenarios. 0:57:37.410,0:57:38.859 So 0:57:42.369,0:57:45.949 when you do that, this is the picture you get. 0:57:45.949,0:57:50.190 As before, the red line is your expected payoff, and by fixing the random numbers, 0:57:50.190,0:57:53.469 you've defined some estimate of the expected payoff. 0:57:53.469,0:57:56.739 And as you evaluate different policies, they're 0:57:56.739,0:58:00.609 still only approximations to their true expected payoff, but at least now you have 0:58:00.609,0:58:01.310 a 0:58:01.310,0:58:03.359 deterministic function to optimize, 0:58:03.359,0:58:07.109 and you can now apply your favorite algorithms, be it gradient ascent or 0:58:08.089,0:58:09.809 some sort 0:58:09.809,0:58:12.259 of local [inaudible] search 0:58:12.259,0:58:13.609 to try to optimize the black 0:58:13.609,0:58:16.329 curve. This gives you a much easier 0:58:16.329,0:58:18.390 optimization problem, and you 0:58:18.390,0:58:22.789 can optimize this to find hopefully a good policy. So this is 0:58:22.789,0:58:26.650 the Pegasus policy search method. 0:58:26.650,0:58:27.450 So when 0:58:27.450,0:58:31.390 I started to talk about reinforcement learning, I showed 0:58:31.390,0:58:34.980 that video of a helicopter flying upside down. That was actually done using 0:58:34.980,0:58:40.029 exactly method, using exactly this policy search algorithm. This 0:58:40.029,0:58:41.849 seems to scale well 0:58:41.849,0:58:47.099 even to fairly large problems, even to fairly high dimensional state spaces. 0:58:47.099,0:58:50.439 Typically Pegasus policy search algorithms have been using - 0:58:50.439,0:58:52.719 the optimization problem 0:58:52.719,0:58:56.440 is still - is much easier than the stochastic version, but sometimes it's 0:58:56.440,0:58:58.279 not entirely trivial, and so 0:58:58.279,0:59:00.299 you have to apply this sort of method with 0:59:00.299,0:59:04.549 maybe on the order of ten parameters or tens of parameters, so 30, 40 0:59:04.549,0:59:05.750 parameters, but 0:59:05.750,0:59:08.169 not thousands of parameters, 0:59:08.169,0:59:13.049 at least in these sorts of things with them. 0:59:13.049,0:59:19.469 So is that method different than just assuming that 0:59:19.469,0:59:26.949 you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that 0:59:26.949,0:59:28.809 we have a deterministic simulator? 0:59:28.809,0:59:31.249 The answer is no. In the way you do this, 0:59:31.249,0:59:34.949 for the sake of simplicity I talked about one sequence of random numbers. 0:59:34.949,0:59:36.890 What you do is - 0:59:36.890,0:59:41.969 so imagine that the random numbers are simulating different wind gusts against your helicopter. 0:59:41.969,0:59:46.429 So what you want to do isn't really evaluate just against one pattern of wind gusts. 0:59:46.429,0:59:48.000 What you want to do is 0:59:48.000,0:59:49.920 sample some set of 0:59:49.920,0:59:53.650 different patterns of wind gusts, and evaluate against all of them in average. 0:59:53.650,0:59:55.680 So what you do is you actually 0:59:55.680,0:59:58.039 sample say 100 - 0:59:58.039,1:00:01.359 some number I made up like 100 sequences of random numbers, 1:00:01.359,1:00:03.430 and every time you want to evaluate a policy, 1:00:03.430,1:00:08.679 you evaluate it against all 100 sequences of random numbers and then average. 1:00:08.679,1:00:13.249 This is in exactly the same way that on this earlier picture 1:00:13.249,1:00:16.879 you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100 1:00:16.879,1:00:18.450 times in simulation, and then 1:00:18.450,1:00:21.709 average to get a better estimate of the expected reward. 1:00:21.709,1:00:23.009 In the same way, 1:00:23.009,1:00:24.079 you do that here 1:00:24.079,1:00:29.079 but with 100 fixed sequences of random numbers. Does that make sense? Any 1:00:29.079,1:00:36.079 other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]? 1:00:47.719,1:00:51.829 Yeah. I guess you're right. So the quality - for a fixed policy, 1:00:51.829,1:00:57.099 the quality of the approximation is equally good for both cases. The 1:00:57.099,1:01:00.989 advantage of fixing the random numbers is that you end up with 1:01:00.989,1:01:03.690 an optimization problem that's much easier. 1:01:04.609,1:01:06.100 I have some search problem, 1:01:06.100,1:01:10.049 and on the horizontal axis there's a space of control policies, 1:01:10.049,1:01:12.559 and my goal is to find a control policy that 1:01:12.559,1:01:15.319 maximizes the payoff. 1:01:15.319,1:01:18.660 The problem with this earlier setting was that 1:01:18.660,1:01:21.949 when I evaluate policies I get these noisy estimates, 1:01:21.949,1:01:22.950 and then it's 1:01:22.950,1:01:26.690 just very hard to optimize the red curve if I have these points that are all over the 1:01:26.690,1:01:29.420 place. And if I evaluate the same policy twice, I don't even get back the same 1:01:30.719,1:01:32.959 answer. By fixing the random numbers, 1:01:32.959,1:01:36.669 the algorithm still doesn't get to see the red curve, but at least it's 1:01:36.669,1:01:40.609 now optimizing a deterministic function. That makes the 1:01:40.609,1:01:42.349 optimization problem much easier. Does 1:01:42.349,1:01:49.349 that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves 1:01:50.539,1:01:58.009 that are easy to optimize. And then you smush them together? 1:01:58.009,1:01:59.309 Let's see. 1:01:59.309,1:02:04.559 I have just one nice black curve that I'm trying to optimize. For each scenario. 1:02:04.559,1:02:08.890 I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios. 1:02:08.890,1:02:12.339 So the black curve here is defined by averaging over 1:02:12.339,1:02:16.249 a large set of scenarios. Does that make sense? 1:02:16.249,1:02:20.099 So instead of only one - if the averaging thing doesn't make sense, imagine that there's 1:02:20.099,1:02:24.049 just one sequence of random numbers. That might be easier to think 1:02:24.049,1:02:27.419 about. Fix one sequence of random numbers, and every time I evaluate another policy, 1:02:27.419,1:02:31.039 I evaluate against the same sequence of random numbers, and that gives me 1:02:31.039,1:02:36.489 a nice deterministic function to optimize. 1:02:36.489,1:02:39.449 Any other questions? 1:02:39.449,1:02:41.269 The advantage is really 1:02:41.269,1:02:45.459 that - one way to think about it is when I evaluate the same policy twice, 1:02:45.459,1:02:50.449 at least I get back the same answer. This gives me a deterministic function 1:02:50.449,1:02:53.890 mapping from parameters in my policy 1:02:53.890,1:02:56.239 to my estimate of the expected payoff. 1:02:57.789,1:03:13.229 That's just a function that I can try to optimize using the search algorithm. 1:03:13.229,1:03:16.799 So we use this algorithm for inverted hovering, 1:03:16.799,1:03:20.329 and again policy search algorithms tend to work well when you can find 1:03:20.329,1:03:23.739 a reasonably simple policy mapping from 1:03:23.739,1:03:28.640 the states to the actions. This is sort of especially the low level control tasks, 1:03:28.640,1:03:32.699 which I think of as sort of reflexes almost. 1:03:32.699,1:03:33.799 Completely, 1:03:33.799,1:03:37.269 if you want to solve a problem like Tetris where you might plan 1:03:37.269,1:03:40.499 ahead a few steps about what's a nice configuration of the board, or 1:03:40.499,1:03:42.289 something like a game of chess, 1:03:42.289,1:03:46.719 or problems of long path plannings of go here, then go there, then go there, 1:03:46.719,1:03:51.069 then sometimes you might apply a value function method instead. 1:03:51.069,1:03:55.059 But for tasks like helicopter flight, for 1:03:55.059,1:03:58.569 low level control tasks, for the reflexes of driving or controlling 1:04:00.749,1:04:05.699 various robots, policy search algorithms were easier - 1:04:05.699,1:04:10.629 we can sometimes more easily approximate the policy directly works very well. 1:04:11.839,1:04:15.599 Some [inaudible] the state of RL today. RL algorithms are applied 1:04:15.599,1:04:18.499 to a wide range of problems, 1:04:18.499,1:04:22.239 and the key is really sequential decision making. The place where you 1:04:22.239,1:04:25.569 think about applying reinforcement learning algorithm is when you need to make a decision, 1:04:25.569,1:04:28.209 then another decision, then another decision, 1:04:28.209,1:04:31.889 and some of your actions may have long-term consequences. I think that 1:04:31.889,1:04:37.910 is the heart of RL's sequential decision making, where you make multiple decisions, 1:04:37.910,1:04:41.599 and some of your actions may have long-term consequences. I've shown you 1:04:41.599,1:04:44.170 a bunch of robotics examples. RL 1:04:44.170,1:04:45.569 is also 1:04:45.569,1:04:48.759 applied to thinks like medical decision making, where you may have a patient and you 1:04:48.759,1:04:51.879 want to choose a sequence of treatments, and you do this now for the patient, and the 1:04:51.879,1:04:56.449 patient may be in some other state, and you choose to do that later, and so on. 1:04:56.449,1:05:00.839 It turns out there's a large community of people applying these sorts of tools to queues. So 1:05:00.839,1:05:04.539 imagine you have a bank where you have people lining up, and 1:05:04.539,1:05:06.009 after they go to one cashier, 1:05:06.009,1:05:09.740 some of them have to go to the manager to deal with something else. You have 1:05:09.740,1:05:13.410 a system of multiple people standing in line in multiple queues, and so how do you route 1:05:13.410,1:05:17.839 people optimally to minimize the waiting 1:05:17.839,1:05:20.239 time. And not just people, but 1:05:20.239,1:05:23.969 objects in an assembly line and so on. It turns out there's a surprisingly large community 1:05:23.969,1:05:26.349 working on optimizing queues. 1:05:26.349,1:05:29.719 I mentioned game playing a little bit already. Things 1:05:29.719,1:05:33.289 like financial decision making, if you have a large amount of stock, 1:05:33.289,1:05:37.569 how do you sell off a large amount - how do you time the selling off of your stock 1:05:37.569,1:05:41.329 so as to not affect market prices adversely too much? There 1:05:41.329,1:05:44.949 are many operations research problems, things like factory automation. Can you design 1:05:44.949,1:05:46.369 a factory 1:05:46.369,1:05:50.140 to optimize throughput, or minimize cost, or whatever. These are all 1:05:50.140,1:05:54.049 sorts of problems that people are applying reinforcement learning algorithms to. 1:05:54.049,1:05:57.529 Let me just close with a few robotics examples because they're always fun, and we just 1:05:57.529,1:05:59.190 have these videos. 1:06:00.189,1:06:07.259 This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student 1:06:07.259,1:06:13.760 here. They were working getting a robot dog to climb over difficult rugged 1:06:13.760,1:06:16.959 terrain. Using a reinforcement learning algorithm, 1:06:16.959,1:06:19.979 they applied an approach that's 1:06:19.979,1:06:24.149 similar to a value function approximation approach, not quite but similar. 1:06:24.149,1:06:28.849 They allowed the robot dog to sort of plan ahead multiple steps, and 1:06:28.849,1:06:33.999 carefully choose his footsteps and traverse rugged terrain. 1:06:33.999,1:06:43.339 This is really state of the art in terms of what can you get a robotic dog to do. 1:06:43.339,1:06:45.569 Here's another fun one. 1:06:45.569,1:06:48.009 It turns out that 1:06:48.009,1:06:51.259 wheeled robots are very fuel-efficient. Cars and trucks are the 1:06:51.259,1:06:55.250 most fuel-efficient robots in the world almost. 1:06:55.250,1:06:57.260 Cars and trucks are very fuel-efficient, 1:06:57.260,1:07:01.309 but the bigger robots have the ability to traverse more rugged terrain. 1:07:01.309,1:07:04.599 So this is a robot - this is actually a small scale mockup of a larger 1:07:04.599,1:07:06.519 vehicle built by Lockheed Martin, 1:07:06.519,1:07:09.699 but can you come up with a vehicle that 1:07:09.699,1:07:13.689 has wheels and has the fuel efficiency of wheeled robots, but also 1:07:13.689,1:07:19.099 has legs so it can traverse obstacles. 1:07:19.099,1:07:27.349 Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and 1:07:27.349,1:07:31.319 somewhat complex gaits that would be very hard to design by hand, 1:07:31.319,1:07:33.279 but by choosing a reward function, tell the robot 1:07:33.279,1:07:36.789 this is a plus one reward that's top of the goal, 1:07:36.789,1:07:45.469 and a few other things, it learns these sorts of policies automatically. 1:07:46.519,1:07:55.229 Last couple fun ones - I'll show you a couple last helicopter videos. 1:07:57.029,1:08:08.700 This is the work of again PhD students here, Peter Abiel and Adam Coates 1:08:08.769,1:08:33.319 who have been working on autonomous flight. I'll just let you watch. I'll just 1:09:49.339,1:09:53.900 show you one more. [Inaudible] do this with a real helicopter [inaudible]? 1:09:53.900,1:09:55.199 Not a full-size helicopter. 1:09:55.199,1:10:00.860 Only small radio control helicopters. [Inaudible]. Full-size helicopters 1:10:00.860,1:10:05.780 are the wrong design for this. You shouldn't do this on a full-size helicopter. 1:10:05.780,1:10:12.619 Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more. 1:10:12.619,1:10:18.919 Are there any human [inaudible]? Yes, 1:10:18.919,1:10:25.919 there are very good human pilots that can. 1:10:28.420,1:10:35.420 This is just one more 1:10:43.739,1:10:45.510 maneuver. That was kind of fun. So this is the work of 1:10:45.510,1:10:55.530 Peter Abiel and Adam Coates. So that was it. That was all the technical material 1:10:55.530,1:11:02.530 I wanted to present in this class. I guess 1:11:03.109,1:11:10.199 you're all experts on machine learning now. Congratulations. 1:11:10.199,1:11:14.209 And as I hope you've got the sense through this class that this is one of the technologies 1:11:14.209,1:11:18.969 that's really having a huge impact on science in engineering and industry. 1:11:18.969,1:11:23.130 As I said in the first lecture, I think many people use machine 1:11:23.130,1:11:30.130 learning algorithms dozens of times a day without even knowing about it. 1:11:30.389,1:11:34.110 Based on the projects you've done, I hope that 1:11:34.110,1:11:37.899 most of you will be able to imagine yourself 1:11:37.899,1:11:43.280 going out after this class and applying these things to solve a variety of problems. 1:11:43.280,1:11:45.980 Hopefully, some of you will also imagine yourselves 1:11:45.980,1:11:48.989 writing research papers after this class, be it on 1:11:48.989,1:11:52.440 a novel way to do machine learning, or on some way of applying machine 1:11:52.440,1:11:55.289 learning to a problem that you care 1:11:55.289,1:11:58.749 about. In fact, looking at project milestones, I'm actually sure that a large fraction of 1:11:58.749,1:12:04.300 the projects in this class will be publishable, so I think that's great. 1:12:06.679,1:12:10.750 I guess many of you will also go industry, make new products, and make lots 1:12:10.750,1:12:12.659 of money using learning 1:12:12.659,1:12:15.580 algorithms. Remember me 1:12:15.580,1:12:19.530 if that happens. One of the things I'm excited about is through research or 1:12:19.530,1:12:22.570 industry, I'm actually completely sure that the people in this class in the 1:12:22.570,1:12:23.849 next few months 1:12:23.849,1:12:27.320 will apply machine learning algorithms to lots of problems in 1:12:27.320,1:12:31.770 industrial management, and computer science, things like optimizing computer architectures, 1:12:31.770,1:12:33.340 network security, 1:12:33.340,1:12:38.480 robotics, computer vision, to problems in computational biology, 1:12:39.140,1:12:42.429 to problems in aerospace, or understanding natural language, 1:12:42.429,1:12:44.309 and many more things like that. 1:12:44.309,1:12:46.030 So 1:12:46.030,1:12:49.920 right now I have no idea what all of you are going to do with the learning algorithms you learned about, 1:12:49.920,1:12:51.540 but 1:12:51.540,1:12:54.809 every time as I wrap up this class, I always 1:12:54.809,1:12:58.209 feel very excited, and optimistic, and hopeful about the sorts of amazing 1:12:58.209,1:13:03.129 things you'll be able to do. 1:13:03.129,1:13:10.489 One final thing, I'll just give out this handout. 1:13:30.749,1:13:35.170 One final thing is that machine learning has grown out of a larger literature on the AI 1:13:35.170,1:13:36.419 where 1:13:36.419,1:13:41.829 this desire to build systems that exhibit intelligent behavior and machine learning 1:13:41.829,1:13:44.889 is one of the tools of AI, maybe one that's had a 1:13:44.889,1:13:46.510 disproportionately large impact, 1:13:46.510,1:13:51.519 but there are many other ideas in AI that I hope you 1:13:51.519,1:13:55.289 go and continue to learn about. Fortunately, Stanford 1:13:55.289,1:13:59.279 has one of the best and broadest sets of AI classes, and I hope that 1:13:59.279,1:14:03.690 you take advantage of some of these classes, and go and learn more about AI, and more 1:14:03.690,1:14:07.690 about other fields which often apply learning algorithms to problems in vision, 1:14:07.690,1:14:10.480 problems in natural language processing in robotics, and so on. 1:14:10.480,1:14:14.960 So the handout I just gave out has a list of AI related courses. Just 1:14:14.960,1:14:18.489 running down very quickly, I guess, CS221 is an overview that I 1:14:18.489,1:14:19.340 teach. There 1:14:19.340,1:14:23.699 are a lot of robotics classes also: 223A, 1:14:23.699,1:14:25.139 225A, 225B 1:14:25.139,1:14:28.760 - many robotics class. There are so many applications 1:14:28.760,1:14:31.999 of learning algorithms to robotics today. 222 1:14:31.999,1:14:36.369 and 227 are knowledge representation and reasoning classes. 1:14:36.369,1:14:39.919 228 - of all the classes on this list, 228, which Daphne 1:14:39.919,1:14:43.479 Koller teaches, is probably closest in spirit to 229. This is one of 1:14:43.479,1:14:45.300 the classes I highly recommend 1:14:45.300,1:14:48.600 to all of my PhD students as well. 1:14:48.600,1:14:53.300 Many other problems also touch on machine learning. On the next page, 1:14:54.040,1:14:57.760 courses on computer vision, speech recognition, natural language processing, 1:14:57.760,1:15:03.199 various tools that aren't just machine learning, but often involve machine learning in many ways. 1:15:03.199,1:15:07.339 Other aspects of AI, multi-agent systems taught by [inaudible]. 1:15:09.630,1:15:13.139 EE364A is convex optimization. It's a class taught by 1:15:13.139,1:15:16.630 Steve Boyd, and convex optimization came up many times in this class. If you 1:15:16.630,1:15:20.909 want to become really good at it, EE364 is a great class. If you're 1:15:20.909,1:15:24.579 interested in project courses, I also teach a project class 1:15:24.579,1:15:27.179 next quarter where we spend the whole quarter working 1:15:27.179,1:15:31.510 on research projects. 1:15:31.510,1:15:37.589 So I hope you go and take some more of those classes. 1:15:37.589,1:15:40.379 In closing, 1:15:40.379,1:15:42.849 let me just say 1:15:42.849,1:15:46.880 this class has been really fun to teach, and it's very satisfying to 1:15:46.880,1:15:52.010 me personally when we set these insanely difficult hallmarks, and 1:15:52.010,1:15:54.649 then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's 1:15:54.649,1:15:58.570 actually very satisfying when I see that. 1:15:58.570,1:16:04.739 Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable," 1:16:04.739,1:16:05.469 So 1:16:05.469,1:16:09.039 I hope you take what you've learned, go forth, and 1:16:09.039,1:16:11.570 do amazing things with learning algorithms. 1:16:11.570,1:16:16.220 I know this is a heavy workload class, so thank you all very much for the hard 1:16:16.220,1:16:20.280 work you've put into this class, and the hard work you've put into learning this material, 1:16:20.280,1:16:21.699 and 1:16:21.699,1:16:23.389 thank you very much for having been students in.