0:00:11.269,0:00:14.539
This presentation is delivered by the Stanford Center for Professional

0:00:14.539,0:00:21.539
Development.

0:00:23.929,0:00:28.039
So welcome to the last lecture of this course.

0:00:28.039,0:00:32.969
What I want to do today is tell you about one final class of reinforcement

0:00:32.969,0:00:34.820
learning algorithms. I

0:00:34.820,0:00:39.350
just want to say a little bit about POMDPs, the partially observable MDPs, and then

0:00:39.350,0:00:43.590
the main technical topic for today will be policy search algorithms.

0:00:43.590,0:00:48.670
I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,

0:00:48.670,0:00:54.940
and then we'll wrap up the class. So if you recall from the last lecture,

0:00:54.940,0:00:59.870
I actually started to talk about one specific example of a POMDP,

0:00:59.870,0:01:06.870
which was this sort of linear dynamical

0:01:07.170,0:01:10.790
system. This is sort of LQR, linear quadratic revelation problem,

0:01:10.790,0:01:24.190
but I changed it and said what if we only have observations

0:01:24.190,0:01:28.270
YT. And what if we couldn't observe the state of the system directly,

0:01:28.270,0:01:37.180
but had to choose an action only based on some noisy observations that maybe some function of the state?

0:01:37.180,0:01:42.660
So our strategy last time was that we said that in the fully observable case,

0:01:42.660,0:01:54.870
we could choose actions - AT equals two, that matrix LT times ST. So LT was this

0:01:54.870,0:01:58.730
matrix of parameters that [inaudible] describe the dynamic programming

0:01:58.730,0:02:02.200
algorithm for finite horizon MDPs in the LQR problem.

0:02:02.200,0:02:10.529
And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.

0:02:10.529,0:02:13.929
And then I said in the partially observable case,

0:02:14.469,0:02:19.979
we would compute these estimates.

0:02:19.979,0:02:26.979
I wrote them as S of T given T,

0:02:30.379,0:02:37.230
which were our best estimate for what the state is given all the observations. And in particular,

0:02:37.230,0:02:41.459
I'm gonna talk about a Kalman filter

0:02:41.459,0:02:48.959
which we worked out that our posterior distribution of what the state is

0:02:48.959,0:02:52.519
given all the observations up to a certain time

0:02:52.519,0:02:59.139
that was this. So this is from last time. So that

0:02:59.139,0:03:03.159
given the observations Y one through YT, our posterior distribution

0:03:03.159,0:03:05.579
of the current state ST was Gaussian

0:03:05.579,0:03:10.169
would mean ST given T sigma T given T.

0:03:10.169,0:03:14.090
So I said we use a Kalman filter to compute this thing, this ST given T,

0:03:14.090,0:03:20.499
which is going to be our best guess for what the state is currently.

0:03:20.499,0:03:30.609
And then we choose actions using

0:03:30.609,0:03:34.059
our estimate for what the state is, rather than using the true state because we

0:03:34.059,0:03:37.519
don't know the true state anymore in this POMDP.

0:03:37.519,0:03:39.619
So

0:03:39.619,0:03:47.219
it turns out that this specific strategy actually allows you to choose optimal actions,

0:03:47.219,0:03:51.749
allows you to choose actions as well as you possibly can given that this is a

0:03:51.749,0:03:54.799
POMDP, and given there are these noisy observations.

0:03:54.799,0:03:59.469
It turns out that in general finding optimal policies with POMDPs -

0:03:59.469,0:04:06.039
finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.

0:04:06.039,0:04:16.259
Just to be concrete about the formalism of the POMDP - I should just write it here -

0:04:16.259,0:04:32.000
a POMDP formally is a tuple like that

0:04:32.000,0:04:42.550
where the changes are the set Y is the set of possible observations,

0:04:44.759,0:04:53.519
and this O subscript S are the observation distributions.

0:04:58.039,0:05:11.220
And so at each step, we observe

0:05:17.629,0:05:20.179
- at each step in the POMDP, if we're

0:05:20.179,0:05:23.849
in some state ST, we observe some observation YT

0:05:23.849,0:05:27.610
drawn from the observation distribution O subscript ST, that there's an

0:05:27.610,0:05:30.369
index by what the current state is.

0:05:30.369,0:05:31.060
And

0:05:31.060,0:05:35.599
it turns out that computing the optimal policy in a POMDP is an NPhard problem.

0:05:36.309,0:05:40.539
For the specific case of linear dynamical systems with the Kalman filter

0:05:40.539,0:05:45.849
model, we have this strategy of computing the optimal policy assuming full observability

0:05:45.849,0:05:49.080
and then estimating the states from the observations,

0:05:49.080,0:05:50.889
and then plugging the two together.

0:05:50.889,0:05:56.319
That turns out to be optimal essentially for only that special case of a POMDP.

0:05:56.319,0:05:58.479
In the more general case,

0:05:58.479,0:06:02.959
that strategy of designing a controller assuming full observability

0:06:02.959,0:06:06.009
and then just estimating the state and plugging the two together,

0:06:06.009,0:06:07.950
for general POMDPs

0:06:07.950,0:06:14.759
that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.

0:06:14.759,0:06:20.869
Solving these problems in general, NP-hard.

0:06:20.869,0:06:22.500
So what I want to do today

0:06:22.500,0:06:25.569
is actually

0:06:25.569,0:06:29.139
talk about a different class of reinforcement learning algorithms. These are called

0:06:29.139,0:06:38.259
policy search algorithms. In particular, policy search algorithms

0:06:38.259,0:06:43.630
can be applied equally well to MDPs, to fully observed Markov decision processes,

0:06:43.630,0:06:46.419
or to these POMDPs,

0:06:46.419,0:06:49.789
or to these partially observable MPDs.

0:06:49.789,0:06:58.269
What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.

0:06:58.269,0:07:01.919
And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In

0:07:01.919,0:07:05.460
the latter case, when

0:07:05.460,0:07:09.830
you apply a policy search algorithm to a POMDP, it's

0:07:09.830,0:07:15.499
going to be hard to guarantee that you get the globally optimal policy because

0:07:15.499,0:07:20.809
solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be

0:07:20.809,0:07:23.009
I think one of the most

0:07:23.009,0:07:31.029
effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.

0:07:31.029,0:07:34.110
So

0:07:34.110,0:07:36.849
here's what we're going to do.

0:07:36.849,0:07:40.789
In policy search,

0:07:40.789,0:07:47.789
we're going to define of some set

0:07:48.309,0:07:54.869
which I denote capital pi of policies,

0:07:54.869,0:08:09.240
and our strategy is to search for a good policy lower pi into set capital pi.

0:08:10.909,0:08:14.930
Just by analogy, I want to say

0:08:14.930,0:08:19.379
- in the same way, back when we were talking about supervised learning,

0:08:19.379,0:08:23.769
the way we defined the set capital pi of policies in the search for policy in this

0:08:23.769,0:08:31.269
set capital pi is analogous to supervised learning

0:08:31.269,0:08:46.300
where we defined a set script H of hypotheses and search -

0:08:49.520,0:08:58.120
and would search for a good hypothesis in this policy script H.

0:08:58.120,0:09:03.040
Policy search is sometimes also called direct policy search.

0:09:03.040,0:09:06.740
To contrast this with the source of algorithms we've been talking about so

0:09:06.740,0:09:11.360
far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to

0:09:11.360,0:09:13.600
find the optimal value function.

0:09:13.600,0:09:19.179
And then we'd use V star - we'd use the optimal value function to then try to compute or

0:09:19.179,0:09:20.260
try to approximate pi star.

0:09:20.260,0:09:23.980
So all the approaches we talked about previously are

0:09:23.980,0:09:29.500
strategy for finding a good policy. Once we compute the value function, then we go from that to policy.

0:09:29.500,0:09:34.700
In contrast, in policy search algorithms and something that's called direct policy search algorithms,

0:09:34.700,0:09:40.160
the idea is that we're going to quote "directly" try to approximate a good policy

0:09:40.160,0:09:48.280
without going through the intermediate stage of trying to find the value function.

0:09:48.280,0:09:49.530
Let's see.

0:09:49.530,0:09:50.880
And also

0:09:50.880,0:09:57.610
as I develop policy search - just one step that's sometimes slightly confusing.

0:09:57.610,0:10:02.710
Making an analogy to supervised learning again, when we talked about logistic regression,

0:10:02.710,0:10:08.700
I said we have input features X and some labels Y, and I sort of said

0:10:08.700,0:10:13.320
let's approximate Y using the logistic function of the inputs X. And

0:10:13.320,0:10:18.320
at least initially, the logistic function was sort of pulled out of the air.

0:10:18.320,0:10:23.400
In the same way, as I define policy search algorithms, there'll sort of be a step where I say,

0:10:23.400,0:10:28.560
"Well, let's try to compute the actions. Let's try to approximate what a good action is

0:10:28.560,0:10:33.020
using a logistic function of the state." So again, I'll sort of pull a

0:10:33.020,0:10:36.770
function out of the air. I'll say, "Let's just choose a function, and

0:10:36.770,0:10:40.650
that'll be our choice of the policy cost," and I'll say,

0:10:40.650,0:10:45.309
"Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate

0:10:45.309,0:10:47.300
what is a good function -

0:10:47.300,0:10:48.350
excuse me,

0:10:48.350,0:10:53.269
we'll approximate what is a good action using a logistic function of the state."

0:10:53.269,0:10:54.420
So there's that sort of -

0:10:54.420,0:10:59.460
the function of the choice of policy cost that's again a little bit arbitrary,

0:10:59.460,0:11:06.080
but it's arbitrary as it was when we were talking about supervised learning.

0:11:06.080,0:11:17.350
So to develop our first policy search algorithm, I'm actually gonna need the new definition.

0:11:56.100,0:12:03.100
So

0:12:06.650,0:12:12.710
our first policy search algorithm, we'll actually need to work with stochastic policies.

0:12:12.710,0:12:17.679
What I mean by stochastic policy is there's going to be a function that maps from

0:12:17.679,0:12:22.510
the space of states across actions. They're real numbers

0:12:22.510,0:12:25.039
where pi of S comma A will be

0:12:25.039,0:12:30.580
interpreted as the probability of taking this action A in sum state S.

0:12:30.580,0:12:46.340
And so we have to add sum over A - In other words, for every state

0:12:46.340,0:12:53.830
a stochastic policy specifies a probability distribution over the actions.

0:12:54.550,0:12:59.250
So concretely,

0:12:59.250,0:13:07.270
suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.

0:13:07.270,0:13:11.690
What that means is that - in this example let's say I have three actions.

0:13:11.690,0:13:16.060
What that means is that suppose I'm in some state S.

0:13:16.060,0:13:25.290
I would then compute pi of S comma A1, pi of S comma A2,

0:13:25.290,0:13:29.370
pi of S comma A3, if I have a three action MDP.

0:13:29.370,0:13:33.630
These will be three numbers that sum up to one,

0:13:33.630,0:13:36.730
and then my chance of taking action A1 will be

0:13:36.730,0:13:41.350
equal to this. My chance of taking action A2 will be equal to pi of S comma A2.

0:13:41.350,0:13:44.129
My chance of taking action A3 will be equal

0:13:44.129,0:13:49.670
to this number. So that's what it means to execute a stochastic policy.

0:13:49.670,0:13:50.880
So

0:13:50.880,0:13:53.730
as a concrete example, just let me make this

0:14:03.380,0:14:07.470
let me just go ahead and give one specific example of what a stochastic

0:14:07.470,0:14:09.550
policy may look like.

0:14:09.550,0:14:14.140
For this example, I'm gonna use the inverted pendulum

0:14:14.140,0:14:20.370
as my motivating example. It's that problem of balancing a pole. We have an

0:14:20.370,0:14:21.270
inverted

0:14:21.270,0:14:26.230
pendulum that swings freely, and you want to move the cart left and

0:14:26.230,0:14:29.090
right to keep the pole vertical.

0:14:29.090,0:14:34.950
Let's say my actions -

0:14:34.950,0:14:43.780
for today's example, I'm gonna use that angle to denote the angle of the pole

0:14:43.780,0:14:49.740
phi. I have two actions where A1 is to accelerate left

0:14:49.740,0:15:02.560
and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.

0:15:02.560,0:15:04.720
So

0:15:04.720,0:15:06.050
let's see.

0:15:06.050,0:15:10.480
Choose a reward function that penalizes the pole falling over whatever.

0:15:10.480,0:15:14.170
And now let's come up with a stochastic policy for this problem.

0:15:14.910,0:15:17.860
To come up with a class of stochastic policies

0:15:17.860,0:15:20.080


0:15:20.080,0:15:23.990
really means coming up with some class of functions to approximate what action you

0:15:23.990,0:15:26.570
want to take as a function of the state.

0:15:26.570,0:15:30.000
So here's my somewhat arbitrary choice. I'm gonna say that

0:15:30.000,0:15:35.210
the probability of action A1, so pi of S comma A1,

0:15:35.210,0:15:39.880
I'm gonna write as - okay?

0:15:39.880,0:15:45.630
And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that

0:15:45.630,0:15:49.740
my policy is parameterized by a set of parameters theta,

0:15:49.740,0:15:53.310
and for any given set of parameters theta,

0:15:53.310,0:15:56.270
that gives me a stochastic policy.

0:15:56.270,0:15:59.380
And if I'm executing that policy with parameters theta, that means

0:15:59.380,0:16:06.020
that the chance of my choosing to a set of [inaudible] is given by this number.

0:16:18.170,0:16:25.520
Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just

0:16:25.520,0:16:28.850
[inaudible], this means that when I'm in sum state S,

0:16:28.850,0:16:30.960
I'm going to compute this number,

0:16:30.960,0:16:34.260
compute one over one plus E to the minus state of transpose S.

0:16:34.260,0:16:40.060
And then with this probability, I will execute the accelerate right action,

0:16:40.060,0:16:45.780
and with one minus this probability, I'll execute the accelerate left action.

0:16:45.780,0:16:50.820
And again, just to give you a sense of why this might be a reasonable thing to do,

0:16:52.200,0:16:59.200
let's say my state vector is - this

0:17:01.530,0:17:05.440
is [inaudible] state, and I added an extra one as an interceptor, just to give my

0:17:05.440,0:17:08.949
logistic function an extra feature.

0:17:08.949,0:17:19.860
If I choose my parameters and my policy to be say this,

0:17:19.860,0:17:23.079
then that means that at any state,

0:17:23.079,0:17:33.870
the probability of my taking action A1 - the probability of my taking the accelerate

0:17:33.870,0:17:40.870
right action is this one over one plus E to the minus state of transpose S, which taking the inner product

0:17:41.360,0:17:48.180
of theta and S, this just gives you phi,

0:17:48.180,0:17:51.250
equals one over one plus E to the minus phi.

0:17:51.250,0:17:54.900
And so if I choose my parameters theta as follows,

0:17:54.900,0:18:01.150
what that means is that

0:18:01.150,0:18:08.150
just depending on the angle phi of my inverted pendulum,

0:18:09.340,0:18:12.200
the chance of my accelerating to the

0:18:12.200,0:18:15.190
right is just this function of

0:18:15.190,0:18:18.320
the angle of my inverted pendulum. And so this means for example

0:18:18.320,0:18:22.090
that if my inverted pendulum is leaning far over to the right,

0:18:22.090,0:18:26.370
then I'm very likely to accelerate to the right to try to catch it. I

0:18:26.370,0:18:30.000
hope the physics of this inverted pendulum thing make

0:18:30.000,0:18:33.440
sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And

0:18:33.440,0:18:36.740
conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try

0:18:36.740,0:18:38.520
to catch it.

0:18:38.520,0:18:44.030
So this is one example for one specific choice of parameters theta.

0:18:44.030,0:18:49.420
Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if

0:18:49.420,0:18:52.860
the cart is further to the right, you want it to be less likely to accelerate to the

0:18:52.860,0:18:55.610
right, and you can capture that by changing

0:18:55.610,0:18:57.770
one of these coefficients to take into account the

0:18:57.770,0:19:00.250
actual position of the cart. And then depending on

0:19:00.250,0:19:03.379
the velocity of the cart and the angle of velocity,

0:19:03.379,0:19:08.040
you might want to change theta to take into account these other effects as well. Maybe

0:19:08.040,0:19:13.060
if the pole's leaning far to the right, but is actually on its way to swinging back, it's

0:19:13.060,0:19:15.620
specified to the angle of velocity, then you might be

0:19:15.620,0:19:19.300
less worried about having to accelerate hard to the right. And so

0:19:19.300,0:19:23.170
these are the sorts of behavior you can get by varying the parameters theta.

0:19:24.160,0:19:28.640
And so our goal is to tune the parameters theta - our goal in policy search

0:19:28.640,0:19:37.290
is to tune the parameters theta so that when we execute the policy pi subscript theta,

0:19:37.290,0:19:39.730
the pole stays up as long as possible.

0:19:39.730,0:19:48.430
In other words, our goal is to maximize as a function of theta -

0:19:48.430,0:20:05.440
our goal is to maximize the expected value of the payoff

0:20:05.440,0:20:09.620
for when we execute the policy pi theta. We

0:20:09.620,0:20:11.249
want to choose parameters theta

0:20:11.249,0:20:21.120
to maximize that. Are there questions about the problem set up,

0:20:21.120,0:20:26.330
and policy search and policy classes or anything? Yeah. In a case where we have

0:20:26.330,0:20:32.880
more than two actions, would we use a different theta for each of the distributions, or still have

0:20:32.880,0:20:36.310
the same parameters? Oh, yeah.

0:20:36.310,0:20:39.430
Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but

0:20:39.430,0:20:43.710
you have say a fixed number of discrete actions,

0:20:43.710,0:20:48.210
I would sometimes use like a softmax parameterization.

0:20:48.210,0:20:53.550
Similar to softmax regression that we saw earlier in the class,

0:20:53.550,0:20:55.910
you may

0:20:55.910,0:21:02.910
say

0:21:05.540,0:21:06.900
that - [inaudible] out of space. You may have a set of parameters theta 1

0:21:06.900,0:21:09.350
through theta D if

0:21:09.350,0:21:24.910
you have D actions and - pi equals E to the theta I transpose S over -

0:21:24.910,0:21:29.160
so that would be an example of a softmax parameterization for multiple actions.

0:21:29.160,0:21:33.390
It turns out that if you have continuous actions, you can actually make this be a

0:21:33.390,0:21:34.090
density

0:21:34.090,0:21:35.950
over the actions A

0:21:35.950,0:21:38.500
and parameterized by other things as well.

0:21:38.500,0:21:42.590
But the choice of policy class is somewhat up to you, in the same way that

0:21:44.150,0:21:47.670
the choice of whether we chose to use a linear function

0:21:47.670,0:21:49.430
or linear function with

0:21:49.430,0:21:56.430
quadratic features or whatever in supervised learning that was sort of up to us. Anything

0:22:01.790,0:22:08.790
else? Yeah. [Inaudible] stochastic?

0:22:12.400,0:22:16.350
Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition

0:22:16.350,0:22:18.200
probabilities, is it possible to

0:22:18.200,0:22:22.740
use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.

0:22:22.740,0:22:24.460
The answer is yes,

0:22:24.460,0:22:25.740
but

0:22:25.740,0:22:29.400
for the purposes of what I want to show later, that won't be useful.

0:22:29.400,0:22:31.870
But formally, it is possible.

0:22:31.870,0:22:34.810
If you already have a fixed - if you have a

0:22:34.810,0:22:40.760
fixed policy, then you'd be able to do that. Anything else?

0:22:40.760,0:22:44.190
Yeah. No, I guess even a [inaudible] class of policy can do that,

0:22:44.190,0:22:50.290
but for the derivation later, I actually need to keep it separate.

0:22:50.290,0:22:55.059
Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand

0:22:55.059,0:23:01.730
if this makes sense?

0:23:01.730,0:23:05.430
Okay. Thanks. So let's talk about

0:23:05.430,0:23:08.650
an algorithm. What I'm gonna talk about - the first algorithm I'm going to present

0:23:09.750,0:23:13.820
is sometimes called the reinforce algorithm. What I'm going to present it turns out

0:23:13.820,0:23:18.550
isn't exactly the reinforce algorithm as it was originally presented by the

0:23:18.550,0:23:30.570
author Ron Williams, but it sort of captures its essence. Here's the idea.

0:23:30.570,0:23:37.030
In the sequel - in what I'm about to do, I'm going to assume that S0 is

0:23:37.030,0:23:40.330
some fixed initial state.

0:23:41.950,0:23:44.720
Or

0:23:44.720,0:23:48.740
it turns out if S0 is drawn from some fixed initial state

0:23:48.740,0:23:52.440
distribution then everything else [inaudible], but let's just say S0 is some

0:23:52.440,0:23:54.040
fixed initial state.

0:23:54.040,0:24:04.600
So my goal is to maximize this expected sum

0:24:05.820,0:24:10.340
[inaudible]. Given the policy and whatever else, drop that.

0:24:10.340,0:24:14.700
So the random variables in this expectation is a sequence of states and actions:

0:24:14.700,0:24:20.200
S0, A0, S1, A1, and so on, up to ST, AT are the random variables.

0:24:20.200,0:24:25.919
So let me write out this expectation explicitly as a sum

0:24:25.919,0:24:33.430
over all possible state and action sequences of that

0:24:46.050,0:24:52.920
- so that's what an expectation is. It's the probability of the random variables times that. Let

0:24:52.920,0:24:59.920
me just expand out this probability.

0:25:00.510,0:25:06.110
So the probability of seeing this exact sequence of states and actions is

0:25:06.110,0:25:10.630
the probability of the MDP starting in that state.

0:25:10.630,0:25:14.070
If this is a deterministic initial state, then all the probability

0:25:14.070,0:25:18.180
mass would be on one state. Otherwise, there's some distribution over initial states.

0:25:18.180,0:25:21.040
Then times the probability

0:25:21.040,0:25:23.529
that

0:25:23.529,0:25:29.350
you chose action A0 from that state as zero,

0:25:29.350,0:25:34.390
and then times the probability that

0:25:34.390,0:25:38.130
the MDP's transition probabilities happen to transition you to state

0:25:38.130,0:25:44.830
S1 where you chose action A0 to state S0,

0:25:44.830,0:25:53.020
times the probability that you chose that and so on.

0:25:53.020,0:26:06.279
The last term here is that, and then times that.

0:26:12.370,0:26:13.930
So what I did was just

0:26:13.930,0:26:18.620
take this probability of seeing this sequence of states and actions, and then just

0:26:18.620,0:26:22.610
[inaudible] explicitly or expanded explicitly like this.

0:26:22.610,0:26:24.480
It turns out later on

0:26:24.480,0:26:30.270
I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff

0:26:31.590,0:26:36.039
from now. So whenever later in this lecture I write the word payoff, I

0:26:36.039,0:26:40.230
just mean this sum.

0:26:40.230,0:26:47.020
So

0:26:47.020,0:26:52.330
our goal is to maximize the expected payoff, so our goal is to maximize this sum.

0:26:52.330,0:26:55.590
Let me actually just skip ahead. I'm going to write down what the final answer is, and

0:26:55.590,0:26:59.370
then I'll come back and justify the

0:26:59.370,0:27:06.370
algorithm. So here's the algorithm. This is

0:27:25.190,0:27:32.190


0:28:12.770,0:28:16.020
how we're going to

0:28:16.020,0:28:19.280
update the parameters of the algorithm. We're going to

0:28:19.280,0:28:22.880
sample a state action sequence. The way you do this is you just take your

0:28:22.880,0:28:24.700
current stochastic policy,

0:28:24.700,0:28:29.039
and you execute it in the MDP. So just go ahead and start from some initial state, take

0:28:29.039,0:28:32.600
a stochastic action according to your current stochastic policy,

0:28:32.600,0:28:35.000
see where the state transition probably takes you, and so you just

0:28:35.000,0:28:38.860
do that for T times steps, and that's how you sample the state sequence. Then

0:28:38.860,0:28:45.020
you compute the payoff, and then you perform this update.

0:28:45.020,0:28:49.210
So let's go back and figure out what this algorithm is doing.

0:28:49.210,0:28:52.730
Notice that this algorithm performs stochastic updates because on

0:28:52.730,0:28:56.750
every step it updates data according to this thing on the right hand side.

0:28:56.750,0:29:00.150
This thing on the right hand side depends very much on your payoff and on

0:29:00.150,0:29:02.320
the state action sequence you saw. Your

0:29:02.320,0:29:04.650
state action sequence is random,

0:29:04.650,0:29:07.210
so what I want to do is figure out

0:29:07.210,0:29:12.230
- so on every step, I'll sort of take a step that's chosen randomly

0:29:13.070,0:29:16.380
because it depends on this random state action sequence.

0:29:16.380,0:29:21.530
So what I want to do is figure out on average how does it change the parameters theta.

0:29:21.530,0:29:23.360
In particular,

0:29:23.360,0:29:44.320
I want to know what is the expected value of the change to the parameters.

0:29:58.100,0:30:06.280
So I want to know what is the expected value of this change to my parameters theta. Our

0:30:06.280,0:30:10.930
goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.

0:30:10.930,0:30:13.700
So long as

0:30:13.700,0:30:18.310
the updates on expectation are on average taking us uphill

0:30:18.310,0:30:20.050
on the expected payoff,

0:30:20.050,0:30:21.970
then we're happy.

0:30:21.970,0:30:24.970
It turns out that

0:30:24.970,0:30:31.720
this algorithm is a form of stochastic gradient ascent in which -

0:30:31.720,0:30:36.779
remember when I talked about stochastic gradient descent for least squares regression, I

0:30:36.779,0:30:42.820
said that you have some parameters - maybe you're trying to minimize a quadratic function.

0:30:42.820,0:30:49.149
Then you may have parameters that will wander around randomly

0:30:49.149,0:30:54.290
until it gets close to the optimum of the [inaudible] quadratic surface.

0:30:54.290,0:30:55.810
It turns out that

0:30:55.810,0:30:59.650
the reinforce algorithm will be very much like that. It will be a

0:30:59.650,0:31:03.220
stochastic gradient ascent algorithm in which on every step -

0:31:03.220,0:31:07.300
the step we take is a little bit random. It's determined by the random state action sequence,

0:31:07.300,0:31:11.940
but on expectation this turns out to be essentially gradient ascent algorithm.

0:31:11.940,0:31:16.100
And so we'll do something like this. It'll wander around randomly, but on average take you towards the

0:31:16.100,0:31:24.700
optimum. So let me go ahead and prove that now.

0:31:28.780,0:31:35.780
Let's see.

0:31:37.540,0:31:46.020
What I'm going to do is I'm going to derive a gradient ascent update rule

0:31:46.020,0:31:50.429
for maximizing the expected payoff. Then I'll hopefully show that

0:31:50.429,0:31:58.570
by deriving a gradient ascent update rule, I'll end up with this thing on expectation.

0:31:58.570,0:32:02.110
So before I do the derivation, let me just remind you of

0:32:02.110,0:32:06.330
the chain rule - the product rule for differentiation

0:32:06.330,0:32:08.510
in which

0:32:08.510,0:32:13.890
if I have a product of functions,

0:32:13.890,0:32:17.230
then

0:32:33.440,0:32:37.779
the derivative of the product is given by

0:32:37.779,0:32:41.029
taking of the derivatives of these things one at a time. So first I differentiate

0:32:41.029,0:32:45.340
with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,

0:32:45.340,0:32:46.540
leaving the other two fixed.

0:32:46.540,0:32:50.640
Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the

0:32:50.640,0:32:53.530
product rule

0:32:53.530,0:32:56.380
for

0:32:56.380,0:33:00.980
derivatives.

0:33:00.980,0:33:05.020
If you refer back to this equation where

0:33:05.020,0:33:06.870
earlier we wrote out that

0:33:06.870,0:33:11.140
the expected payoff by this equation, this

0:33:11.140,0:33:14.110
sum over all the states of the probability

0:33:14.110,0:33:15.720
times the payoff.

0:33:15.720,0:33:20.410
So what I'm going to do is take the derivative of this expression

0:33:20.410,0:33:22.730
with respect to the parameters theta

0:33:22.730,0:33:26.320
because I want to do gradient ascent on this function. So I'm going to take the

0:33:26.320,0:33:30.880
derivative of this function with respect to theta, and then try to go uphill on this function.

0:33:30.880,0:33:35.690
So using the product rule, when I take the derivative of this function with respect to theta

0:33:35.690,0:33:39.169
what I get is - we'll end up with the sum of terms right there. There are a lot

0:33:39.169,0:33:42.300
of terms here that depend on theta,

0:33:42.300,0:33:47.160
and so what I'll end up with is I'll end up having a sum - having one term

0:33:47.160,0:33:50.700
that corresponds to the derivative of this keeping everything else fixed,

0:33:50.700,0:33:53.960
to have one term from the derivative of this keeping everything else fixed, and I'll

0:33:53.960,0:33:55.140
have one term

0:33:55.140,0:33:56.430
from the derivative of

0:33:56.430,0:34:02.100
that last thing keeping everything else fixed. So just apply the product rule to this. Let's

0:34:02.100,0:34:09.100
write that down. So I have that - the derivative

0:34:10.599,0:34:15.789
with respect to theta of the expected value of the payoff is -

0:34:15.789,0:34:22.129
it turns out I can actually do this entire derivation in exactly four steps,

0:34:22.129,0:34:26.819
but each of the steps requires a huge amount of writing, so

0:34:26.819,0:34:37.569
I'll just start writing and see how that goes, but this is a four step derivation.

0:34:37.919,0:34:44.919
So there's the sum over the state action sequences as we saw before. Close the

0:35:04.869,0:35:11.869
bracket,

0:36:00.229,0:36:03.169
and

0:36:03.169,0:36:08.670
then times the payoff.

0:36:08.670,0:36:12.599
So that huge amount of writing, that was just taking my previous formula and

0:36:12.599,0:36:13.529


0:36:13.529,0:36:16.589
differentiating these terms that depend on theta one at a time. This was the term with

0:36:16.589,0:36:17.879
the derivative

0:36:17.879,0:36:19.489
of the first pi of

0:36:19.489,0:36:22.410
theta S0 A0. So there's the first derivative

0:36:22.410,0:36:26.329
term. There's the second one. Then you have plus dot, dot, dot, like

0:36:26.329,0:36:28.369
in terms of [inaudible].

0:36:28.369,0:36:31.279
That's my last term.

0:36:31.279,0:36:38.279
So that was step one of four.

0:36:51.299,0:37:06.849
And so by algebra - let me just write this down

0:38:04.009,0:38:06.509
and convince us all that it's true.

0:38:06.509,0:38:08.549
This is the second of four steps

0:38:08.549,0:38:09.649
in which it

0:38:09.649,0:38:13.529
just convinced itself that if I expand out - take the sum and multiply it by

0:38:13.529,0:38:15.239
that big product in front,

0:38:15.239,0:38:18.929
then I get back that sum of terms I get. It's essentially -

0:38:18.929,0:38:22.530
for example, when I multiply out, this product on top of this ratio, of this

0:38:22.530,0:38:24.789
first fraction,

0:38:24.789,0:38:29.289
then pi subscript theta S0 A0, that would cancel out this pi subscript

0:38:29.289,0:38:31.219
theta S0 A0

0:38:31.219,0:38:35.130
and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was

0:38:35.130,0:38:42.130
the second. But

0:38:44.960,0:38:47.640
that term on top is just

0:38:47.640,0:38:53.999
what I worked out previously

0:38:53.999,0:38:58.569
- was the joint probability of the state action sequence,

0:38:58.569,0:39:32.170
and now I have that times that times the payoff.

0:39:32.170,0:40:07.129
And so by the definition of expectation, this is just equal to that thing times the payoff.

0:40:07.129,0:40:08.440
So

0:40:08.440,0:40:12.329
this thing inside the expectation, this is exactly

0:40:12.329,0:40:18.839
the step that we were taking in the inner group of our reinforce algorithm,

0:40:18.839,0:40:20.880
roughly the reinforce algorithm. This

0:40:20.880,0:40:25.640
proves that the expected value of our change to theta

0:40:25.640,0:40:30.509
is exactly in the direction of the gradient of our expected payoff. That's how

0:40:30.509,0:40:32.639
I started this whole derivation. I said

0:40:32.639,0:40:36.019
let's look at our expected payoff and take the derivative of that with respect to theta.

0:40:37.109,0:40:40.529
What we've proved is that on expectation,

0:40:40.529,0:40:44.229
the step direction I'll take reinforce is exactly the gradient of

0:40:44.229,0:40:46.220
the thing I'm trying to

0:40:46.220,0:40:48.679
optimize. This shows that this algorithm is

0:40:48.679,0:40:54.029
a stochastic gradient ascent

0:40:54.029,0:40:58.059
algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]

0:40:58.059,0:41:00.910
check if everything makes sense. I'll erase a couple of boards and then check if you have

0:41:00.910,0:41:07.910
questions after that. Questions?

0:41:42.919,0:41:49.919
Could you

0:41:53.749,0:41:56.279
raise your hand if this makes sense?

0:42:01.239,0:42:03.999
Great.

0:42:03.999,0:42:06.789
Some of the comments - we talked about

0:42:06.789,0:42:10.629
those value function approximation approaches where you approximate

0:42:10.629,0:42:13.579
V star, then you go from V star

0:42:13.579,0:42:17.839
to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.

0:42:17.839,0:42:22.329
So let's talk briefly about when either one may be preferable.

0:42:22.329,0:42:24.049
It turns out that

0:42:24.049,0:42:26.899
policy search algorithms are especially effective

0:42:26.899,0:42:30.699
when you can choose a simple policy class pi.

0:42:32.029,0:42:35.949
So the question really is for your problem

0:42:35.949,0:42:40.079
does there exist a simple function like a linear function or a logistic function

0:42:40.079,0:42:45.289
that maps from features of the state to the action that works pretty well.

0:42:45.289,0:42:49.029
So the problem with the inverted pendulum - this is quite likely to be true.

0:42:49.029,0:42:52.450
Going through all the different choices of parameters, you can say

0:42:52.450,0:42:54.930
things like if the pole's leaning towards the right,

0:42:54.930,0:42:57.709
then accelerate towards the right to try to catch it. Thanks to the

0:42:57.709,0:43:01.579
inverted pendulum, this is probably true. For lots of what's called

0:43:01.579,0:43:04.650
low level control tasks, things like driving a car,

0:43:04.650,0:43:09.069
the low level reflexes of do you steer your car left to avoid another car, do

0:43:09.069,0:43:13.389
you steer your car left to follow the car road, flying a helicopter,

0:43:13.389,0:43:18.359
again very short time scale types of decisions - I like to think of these as

0:43:18.359,0:43:21.960
decisions like trained operator for like a trained driver or a trained

0:43:21.960,0:43:23.819
pilot. It would almost

0:43:23.819,0:43:27.360
be a reflex, these sorts of very quick instinctive things where

0:43:27.360,0:43:30.790
you map very directly from the inputs, data, and action. These are

0:43:30.790,0:43:32.159
problems for which

0:43:32.159,0:43:35.879
you can probably choose a reasonable policy class like a logistic function or something,

0:43:35.879,0:43:38.489
and it will often work well.

0:43:38.489,0:43:41.829
In contrast, if you have problems that require

0:43:41.829,0:43:46.019
long multistep reasoning, so things like a game of chess where you have to

0:43:46.019,0:43:47.230
reason carefully about if

0:43:47.230,0:43:50.130
I do this, then they'll do that, then they'll do this, then they'll do that.

0:43:50.130,0:43:56.049
Those I think of as less instinctual, very high level decision making.

0:43:56.049,0:43:59.519
For problems like that,

0:43:59.519,0:44:06.289
I would sometimes use a value function approximation approaches instead. Let

0:44:06.289,0:44:09.089
me say more about this later.

0:44:09.089,0:44:14.139
The last thing I want to do is actually tell you about -

0:44:14.139,0:44:21.999
I guess just as a side comment,

0:44:21.999,0:44:25.499
it turns out also that if you have POMDP, if you have a partially observable

0:44:25.499,0:44:28.459
MDP - I don't want to say too much about this -

0:44:28.459,0:44:36.799
it turns out that if you only have an approximation

0:44:36.799,0:44:42.119
- let's call it S hat of the true

0:44:42.119,0:44:47.389
state, and so this could be S hat equals S of T given T from Kalman filter -

0:44:57.359,0:45:12.949
then you can still use these sorts of policy search algorithms where you can say pi theta of S

0:45:12.949,0:45:15.819
hat comma A - There are various other ways you can use

0:45:15.819,0:45:18.669
policy search algorithms for POMDPs, but this is one of them

0:45:18.669,0:45:21.440
where if you only have estimates of the state, you can then

0:45:21.440,0:45:25.599
choose a policy class that only looks at your estimate of the state to choose the action.

0:45:25.599,0:45:30.629
By using the same way of estimating the states in both training and testing,

0:45:30.629,0:45:33.089
this will usually do some -

0:45:33.089,0:45:37.069
so these sorts of policy search algorithms can be applied

0:45:37.069,0:45:47.440
often reasonably effectively to

0:45:47.440,0:45:50.349
POMDPs as well. There's one more algorithm I wanna talk about, but some final

0:45:50.349,0:45:54.519
words on the reinforce algorithm. It turns out the reinforce algorithm

0:45:54.519,0:46:00.089
often works well but is often extremely slow. So it [inaudible]

0:46:00.089,0:46:03.680
works, but one thing to watch out for is that because you're taking these

0:46:03.680,0:46:07.470
gradient ascent steps that are very noisy, you're sampling a state action sequence, and then

0:46:07.470,0:46:07.909
you're sort of

0:46:07.909,0:46:09.579
taking a gradient ascent step in essentially a sort

0:46:09.579,0:46:12.960
of random direction that only on expectation is

0:46:12.960,0:46:14.239
correct. The

0:46:14.239,0:46:17.999
gradient ascent direction for reinforce can sometimes be a bit noisy,

0:46:17.999,0:46:21.599
and so it's not that uncommon to need like a million iterations of gradient ascent,

0:46:21.599,0:46:24.420
or ten million, or 100 million

0:46:24.420,0:46:25.579
iterations of gradient ascent

0:46:25.579,0:46:29.450
for reinforce [inaudible], so that's just something to watch out for.

0:46:29.450,0:46:40.619
One consequence of that is in the reinforce algorithm - I shouldn't

0:46:40.619,0:46:44.969
really call it reinforce. In what's essentially the reinforce algorithm, there's

0:46:44.969,0:46:48.449
this step where you need to sample a state action sequence.

0:46:48.449,0:46:50.759
So in

0:46:50.759,0:46:54.160
principle you could do this on your own robot. If there were a robot you were trying to

0:46:54.160,0:46:55.280
control, you can actually

0:46:55.280,0:46:58.709
physically initialize in some state, pick an action and so on, and go from there

0:46:58.709,0:47:00.450
to sample a state action sequence.

0:47:00.450,0:47:01.529
But

0:47:01.529,0:47:05.269
if you need to do this ten million times, you probably don't want to [inaudible]

0:47:05.269,0:47:07.459
your robot ten million times.

0:47:07.459,0:47:13.160
I personally have seen many more applications of reinforce in simulation.

0:47:13.160,0:47:16.930
You can easily run ten thousand simulations or ten million simulations of your robot in

0:47:16.930,0:47:20.890
simulation maybe, but you might not want to do that -

0:47:20.890,0:47:24.509
have your robot physically repeat some action ten million times.

0:47:24.509,0:47:27.529
So I personally have seen many more applications of reinforce

0:47:27.529,0:47:30.319
to learn using a simulator

0:47:30.319,0:47:34.799
than to actually do this on a physical device.

0:47:34.799,0:47:38.230
The last thing I wanted to do is tell you about one other algorithm,

0:47:38.230,0:47:45.230
one final policy search algorithm. [Inaudible] the laptop display please. It's

0:47:52.249,0:47:56.119
a policy search algorithm called Pegasus that's

0:47:56.119,0:48:00.039
actually what we use on our autonomous helicopter flight things

0:48:00.039,0:48:05.459
for many years. There are some other things we do now. So

0:48:05.459,0:48:07.579
here's the idea. There's a

0:48:07.579,0:48:11.059
reminder slide on RL formalism. There's nothing here that you don't know, but

0:48:11.059,0:48:17.119
I just want to pictorially describe the RL formalism because I'll use that later.

0:48:17.119,0:48:20.859
I'm gonna draw the reinforcement learning picture as follows. The

0:48:20.859,0:48:21.689
initialized [inaudible]

0:48:21.689,0:48:23.509
system, say a

0:48:23.509,0:48:25.559
helicopter or whatever in sum state S0,

0:48:25.559,0:48:27.509
you choose an action A0,

0:48:27.509,0:48:31.779
and then you'll say helicopter dynamics takes you to some new state S1, you choose

0:48:31.779,0:48:33.259
some other action A1,

0:48:33.259,0:48:34.619
and so on.

0:48:34.619,0:48:38.789
And then you have some reward function, which you reply to the sequence of states you summed out,

0:48:38.789,0:48:41.249
and that's your total payoff. So

0:48:41.249,0:48:46.890
this is just a picture I wanna use to summarize the RL problem.

0:48:47.949,0:48:50.880
Our goal is to maximize the expected payoff,

0:48:50.880,0:48:53.000
which is this, the expected sum of the rewards.

0:48:53.000,0:48:56.909
And our goal is to learn the policy, which I denote by a green box.

0:48:56.909,0:49:01.989
So our policy - and I'll switch back to deterministic policies for now.

0:49:01.989,0:49:09.129
So my deterministic policy will be some function mapping from the states to the actions.

0:49:09.129,0:49:14.599
As a concrete example, you imagine that in the policy search setting,

0:49:14.599,0:49:17.339
you may have a linear class of policies.

0:49:17.339,0:49:22.660
So you may imagine that the action A will be say a linear function of the states,

0:49:22.660,0:49:27.599
and your goal is to learn the parameters of the linear function.

0:49:27.599,0:49:33.349
So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning

0:49:33.349,0:49:36.209
objective. So just [inaudible] imagine that the action A

0:49:36.209,0:49:40.309
is state of transpose S, and you go and policy search this to come up with

0:49:40.309,0:49:42.109
good parameters theta so as

0:49:42.109,0:49:48.979
to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.

0:49:48.979,0:49:49.809
Quite often

0:49:49.809,0:49:54.459
we come up with a model or a simulator for the MDP, and as before

0:49:54.459,0:49:57.499
a model or a simulator is just a box

0:49:57.499,0:49:58.830
that takes this input

0:49:58.830,0:50:00.419
some state ST,

0:50:00.419,0:50:02.690
takes this input some action AT,

0:50:02.690,0:50:06.570
and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.

0:50:06.570,0:50:10.130
This ST plus one will be a random state. It will be drawn from

0:50:10.130,0:50:13.030
the random state transition probabilities of MDP.

0:50:13.030,0:50:16.979
This is important. Very important, ST plus one

0:50:16.979,0:50:21.809
will be a random function ST and AT. In the simulator, this is [inaudible].

0:50:21.809,0:50:25.709
So for example, for autonomous helicopter flight, you [inaudible] build a simulator

0:50:25.709,0:50:29.819
using supervised learning, an algorithm like linear regression

0:50:29.819,0:50:31.529
[inaudible] to linear regression,

0:50:31.529,0:50:35.099
so we can get a nonlinear model of the dynamics of what ST

0:50:35.099,0:50:41.399
plus one is as a random function of ST and AT. Now

0:50:41.399,0:50:44.999
once you have a simulator,

0:50:44.999,0:50:48.100
given any fixed policy you can quite

0:50:48.100,0:50:52.309
straightforwardly evaluate any policy in a simulator.

0:50:53.359,0:51:00.209
Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box

0:51:00.209,0:51:02.879
like that. It works well.

0:51:02.879,0:51:07.569
So if you have any one fixed policy pi,

0:51:07.569,0:51:13.109
you can evaluate the policy pi just using the simulator

0:51:13.109,0:51:15.969
via the picture shown at the bottom of the slide.

0:51:15.969,0:51:18.880
So concretely, you can take your initial state S0,

0:51:18.880,0:51:20.979
feed it into the policy pi,

0:51:20.979,0:51:25.449
your policy pi will output some action A0, you plug it in

0:51:25.449,0:51:28.069
the simulator, the simulator outputs a random state S1, you feed

0:51:28.069,0:51:31.630
S1 into the policy and so on, and you get a sequence of states S0 through ST

0:51:31.630,0:51:35.429
that your helicopter flies through in simulation.

0:51:35.429,0:51:37.809
Then sum up the rewards,

0:51:37.809,0:51:42.609
and this gives you an estimate of the expected payoff of the policy.

0:51:42.609,0:51:44.929
This picture is just a fancy way of saying fly

0:51:44.929,0:51:48.809
your helicopter in simulation and see how well it does, and measure the sum of

0:51:48.809,0:51:53.969
rewards you get on average in the simulator. The

0:51:53.969,0:51:57.219
picture I've drawn here assumes that you run your policy through the

0:51:57.219,0:51:59.159
simulator just once. In general,

0:51:59.159,0:52:01.299
you would run the policy through the simulator

0:52:01.299,0:52:05.099
some number of times and then average to get an average over M

0:52:05.099,0:52:10.869
simulations to get a better estimate of the policy's expected payoff.

0:52:12.799,0:52:14.619
Now that I have a way

0:52:14.619,0:52:18.069
- given any one fixed policy,

0:52:18.069,0:52:23.429
this gives me a way to evaluate the expected payoff of that policy.

0:52:23.429,0:52:28.179
So one reasonably obvious thing you might try to do is then

0:52:28.179,0:52:30.069
just search for a policy,

0:52:30.069,0:52:33.609
in other words search for parameters theta for your policy, that

0:52:33.609,0:52:35.929
gives you high estimated payoff.

0:52:35.929,0:52:39.509
Does that make sense? So my policy has some parameters theta, so

0:52:39.509,0:52:40.979
my policy is

0:52:40.979,0:52:46.309
my actions A are equal to theta transpose S say if there's a linear policy.

0:52:46.309,0:52:49.559
For any fixed value of the parameters theta,

0:52:49.559,0:52:50.989
I can evaluate -

0:52:50.989,0:52:54.440
I can get an estimate for how good the policy is using the simulator.

0:52:54.440,0:52:58.599
One thing I might try to do is search for parameters theta to try to

0:52:58.599,0:53:03.109
maximize my estimated payoff.

0:53:03.109,0:53:05.929
It turns out you can sort of do that,

0:53:05.929,0:53:10.979
but that idea as I've just stated is hard to get to work.

0:53:10.979,0:53:12.180
Here's the

0:53:12.180,0:53:15.720
reason. The simulator allows us to evaluate

0:53:15.720,0:53:17.529
policy, so let's search for policy of high

0:53:17.529,0:53:20.619
value. The difficulty is that the simulator is random,

0:53:20.619,0:53:22.999
and so every time we evaluate a policy,

0:53:22.999,0:53:26.819
we get back a very slightly different answer. So

0:53:26.819,0:53:28.739
in the

0:53:28.739,0:53:33.219
cartoon below, I want you to imagine that the horizontal axis is the space of policies.

0:53:33.219,0:53:36.779
In other words, as I vary the

0:53:36.779,0:53:40.839
parameters in my policy, I get different points on the horizontal axis here. As I

0:53:40.839,0:53:43.789
vary the parameters theta, I get different policies,

0:53:43.789,0:53:46.479
and so I'm moving along the X axis,

0:53:46.479,0:53:50.079
and my total payoff I'm gonna plot on the vertical axis,

0:53:50.079,0:53:54.459
and the red line in this cartoon is the expected payoff of the different

0:53:54.459,0:53:56.130
policies.

0:53:56.130,0:54:01.929
My goal is to find the policy with the highest expected payoff.

0:54:01.929,0:54:06.019
You could search for a policy with high expected payoff, but every time you evaluate a policy

0:54:06.019,0:54:08.150
- say I evaluate some policy,

0:54:08.150,0:54:09.819
then I might get a point that

0:54:09.819,0:54:12.879
just by chance looked a little bit better than it should have. If

0:54:12.879,0:54:16.489
I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third

0:54:16.489,0:54:21.419
policy, fourth,

0:54:21.419,0:54:25.439
sometimes you look here - sometimes I might actually evaluate exactly the same policy

0:54:25.439,0:54:27.829
twice and get back slightly different

0:54:27.829,0:54:32.380
answers just because my simulator is random, so when I apply the same policy

0:54:32.380,0:54:38.669
twice in simulation, I might get back slightly different answers.

0:54:38.669,0:54:41.839
So as I evaluate more and more policies, these are the pictures I get.

0:54:41.839,0:54:45.799
My goal is to try to optimize the red line. I hope you appreciate this is a

0:54:45.799,0:54:50.569
hard problem, especially when all [inaudible] optimization algorithm gets to see are these

0:54:50.569,0:54:53.949
black dots, and they don't have direct access to the red line.

0:54:53.949,0:54:58.269
So when my input space is some fairly high dimensional space, if I have

0:54:58.269,0:55:02.319
ten parameters, the horizontal axis would actually be a 10-D space,

0:55:02.319,0:55:07.049
all I get are these noisy estimates of what the red line is. This is a very hard

0:55:07.049,0:55:10.209
stochastic optimization problem.

0:55:10.209,0:55:15.969
So it turns out there's one way to make this optimization problem much easier. Here's the idea.

0:55:15.969,0:55:20.949
And the method is called Pegasus, which is an acronym for something.

0:55:20.949,0:55:22.890
I'll tell you later.

0:55:22.890,0:55:26.140
So the simulator repeatedly makes calls

0:55:26.140,0:55:27.959
to a random number generator

0:55:27.959,0:55:33.079
to generate random numbers RT, which are used to simulate the stochastic dynamics.

0:55:33.079,0:55:36.709
What I mean by that is that the simulator takes this input of state

0:55:36.709,0:55:39.749
and action, and it outputs the mixed state randomly,

0:55:39.749,0:55:43.999
and if you peer into the simulator, if you open the box of the simulator and ask

0:55:43.999,0:55:49.529
how is my simulator generating these random mixed states ST plus one,

0:55:49.529,0:55:53.809
pretty much the only way to do so - pretty much the only way

0:55:53.809,0:55:56.909
to write a simulator with random outputs is

0:55:56.909,0:56:00.209
we're gonna make calls to a random number generator,

0:56:00.209,0:56:03.440
and get random numbers, these RTs,

0:56:03.440,0:56:08.299
and then you have some function that takes this input S0, A0, and

0:56:08.299,0:56:10.979
the results of your random number generator,

0:56:10.979,0:56:14.430
and it computes some mixed state as a function of the inputs and of the random

0:56:14.430,0:56:16.839
number it got from the random number

0:56:16.839,0:56:17.699
generator. This is

0:56:17.699,0:56:22.359
pretty much the only way anyone implements any random code, any code

0:56:22.359,0:56:26.179
that generates random outputs. You make a call to a random number generator,

0:56:26.179,0:56:30.869
and you compute some function of the random number and of your other

0:56:30.869,0:56:35.299
inputs. The reason that when you evaluate different policies you get

0:56:35.299,0:56:36.439
different answers

0:56:36.439,0:56:37.749
is because

0:56:37.749,0:56:41.459
every time you rerun the simulator, you get a different sequence of random

0:56:41.459,0:56:43.569
numbers from the random number generator,

0:56:43.569,0:56:45.569
and so you get a different answer every time,

0:56:45.569,0:56:48.139
even if you evaluate the same policy twice.

0:56:48.139,0:56:50.910
So here's the idea.

0:56:50.910,0:56:56.150
Let's say we fix in advance the sequence of random numbers,

0:56:56.150,0:57:00.229
choose R1, R2, up to RT minus one.

0:57:00.229,0:57:03.359
Fix the sequence of random numbers in advance,

0:57:03.359,0:57:07.280
and we'll always use the same sequence of random numbers

0:57:07.280,0:57:10.189
to evaluate different policies. Go

0:57:10.189,0:57:14.549
into your code and fix R1, R2, through RT minus one.

0:57:14.549,0:57:17.640
Choose them randomly once and then fix them forever.

0:57:17.640,0:57:20.979
If you always use the same sequence of random numbers, then

0:57:20.979,0:57:25.210
the system is no longer random, and if you evaluate the same policy twice,

0:57:25.210,0:57:28.569
you get back exactly the same answer.

0:57:28.569,0:57:33.330
And so these sequences of random numbers, [inaudible] call them scenarios, and

0:57:33.330,0:57:37.410
Pegasus is an acronym for policy evaluation of gradient and search using scenarios.

0:57:37.410,0:57:38.859
So

0:57:42.369,0:57:45.949
when you do that, this is the picture you get.

0:57:45.949,0:57:50.190
As before, the red line is your expected payoff, and by fixing the random numbers,

0:57:50.190,0:57:53.469
you've defined some estimate of the expected payoff.

0:57:53.469,0:57:56.739
And as you evaluate different policies, they're

0:57:56.739,0:58:00.609
still only approximations to their true expected payoff, but at least now you have

0:58:00.609,0:58:01.310
a

0:58:01.310,0:58:03.359
deterministic function to optimize,

0:58:03.359,0:58:07.109
and you can now apply your favorite algorithms, be it gradient ascent or

0:58:08.089,0:58:09.809
some sort

0:58:09.809,0:58:12.259
of local [inaudible] search

0:58:12.259,0:58:13.609
to try to optimize the black

0:58:13.609,0:58:16.329
curve. This gives you a much easier

0:58:16.329,0:58:18.390
optimization problem, and you

0:58:18.390,0:58:22.789
can optimize this to find hopefully a good policy. So this is

0:58:22.789,0:58:26.650
the Pegasus policy search method.

0:58:26.650,0:58:27.450
So when

0:58:27.450,0:58:31.390
I started to talk about reinforcement learning, I showed

0:58:31.390,0:58:34.980
that video of a helicopter flying upside down. That was actually done using

0:58:34.980,0:58:40.029
exactly method, using exactly this policy search algorithm. This

0:58:40.029,0:58:41.849
seems to scale well

0:58:41.849,0:58:47.099
even to fairly large problems, even to fairly high dimensional state spaces.

0:58:47.099,0:58:50.439
Typically Pegasus policy search algorithms have been using -

0:58:50.439,0:58:52.719
the optimization problem

0:58:52.719,0:58:56.440
is still - is much easier than the stochastic version, but sometimes it's

0:58:56.440,0:58:58.279
not entirely trivial, and so

0:58:58.279,0:59:00.299
you have to apply this sort of method with

0:59:00.299,0:59:04.549
maybe on the order of ten parameters or tens of parameters, so 30, 40

0:59:04.549,0:59:05.750
parameters, but

0:59:05.750,0:59:08.169
not thousands of parameters,

0:59:08.169,0:59:13.049
at least in these sorts of things with them.

0:59:13.049,0:59:19.469
So is that method different than just assuming that

0:59:19.469,0:59:26.949
you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that

0:59:26.949,0:59:28.809
we have a deterministic simulator?

0:59:28.809,0:59:31.249
The answer is no. In the way you do this,

0:59:31.249,0:59:34.949
for the sake of simplicity I talked about one sequence of random numbers.

0:59:34.949,0:59:36.890
What you do is -

0:59:36.890,0:59:41.969
so imagine that the random numbers are simulating different wind gusts against your helicopter.

0:59:41.969,0:59:46.429
So what you want to do isn't really evaluate just against one pattern of wind gusts.

0:59:46.429,0:59:48.000
What you want to do is

0:59:48.000,0:59:49.920
sample some set of

0:59:49.920,0:59:53.650
different patterns of wind gusts, and evaluate against all of them in average.

0:59:53.650,0:59:55.680
So what you do is you actually

0:59:55.680,0:59:58.039
sample say 100 -

0:59:58.039,1:00:01.359
some number I made up like 100 sequences of random numbers,

1:00:01.359,1:00:03.430
and every time you want to evaluate a policy,

1:00:03.430,1:00:08.679
you evaluate it against all 100 sequences of random numbers and then average.

1:00:08.679,1:00:13.249
This is in exactly the same way that on this earlier picture

1:00:13.249,1:00:16.879
you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100

1:00:16.879,1:00:18.450
times in simulation, and then

1:00:18.450,1:00:21.709
average to get a better estimate of the expected reward.

1:00:21.709,1:00:23.009
In the same way,

1:00:23.009,1:00:24.079
you do that here

1:00:24.079,1:00:29.079
but with 100 fixed sequences of random numbers. Does that make sense? Any

1:00:29.079,1:00:36.079
other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?

1:00:47.719,1:00:51.829
Yeah. I guess you're right. So the quality - for a fixed policy,

1:00:51.829,1:00:57.099
the quality of the approximation is equally good for both cases. The

1:00:57.099,1:01:00.989
advantage of fixing the random numbers is that you end up with

1:01:00.989,1:01:03.690
an optimization problem that's much easier.

1:01:04.609,1:01:06.100
I have some search problem,

1:01:06.100,1:01:10.049
and on the horizontal axis there's a space of control policies,

1:01:10.049,1:01:12.559
and my goal is to find a control policy that

1:01:12.559,1:01:15.319
maximizes the payoff.

1:01:15.319,1:01:18.660
The problem with this earlier setting was that

1:01:18.660,1:01:21.949
when I evaluate policies I get these noisy estimates,

1:01:21.949,1:01:22.950
and then it's

1:01:22.950,1:01:26.690
just very hard to optimize the red curve if I have these points that are all over the

1:01:26.690,1:01:29.420
place. And if I evaluate the same policy twice, I don't even get back the same

1:01:30.719,1:01:32.959
answer. By fixing the random numbers,

1:01:32.959,1:01:36.669
the algorithm still doesn't get to see the red curve, but at least it's

1:01:36.669,1:01:40.609
now optimizing a deterministic function. That makes the

1:01:40.609,1:01:42.349
optimization problem much easier. Does

1:01:42.349,1:01:49.349
that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves

1:01:50.539,1:01:58.009
that are easy to optimize. And then you smush them together?

1:01:58.009,1:01:59.309
Let's see.

1:01:59.309,1:02:04.559
I have just one nice black curve that I'm trying to optimize. For each scenario.

1:02:04.559,1:02:08.890
I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.

1:02:08.890,1:02:12.339
So the black curve here is defined by averaging over

1:02:12.339,1:02:16.249
a large set of scenarios. Does that make sense?

1:02:16.249,1:02:20.099
So instead of only one - if the averaging thing doesn't make sense, imagine that there's

1:02:20.099,1:02:24.049
just one sequence of random numbers. That might be easier to think

1:02:24.049,1:02:27.419
about. Fix one sequence of random numbers, and every time I evaluate another policy,

1:02:27.419,1:02:31.039
I evaluate against the same sequence of random numbers, and that gives me

1:02:31.039,1:02:36.489
a nice deterministic function to optimize.

1:02:36.489,1:02:39.449
Any other questions?

1:02:39.449,1:02:41.269
The advantage is really

1:02:41.269,1:02:45.459
that - one way to think about it is when I evaluate the same policy twice,

1:02:45.459,1:02:50.449
at least I get back the same answer. This gives me a deterministic function

1:02:50.449,1:02:53.890
mapping from parameters in my policy

1:02:53.890,1:02:56.239
to my estimate of the expected payoff.

1:02:57.789,1:03:13.229
That's just a function that I can try to optimize using the search algorithm.

1:03:13.229,1:03:16.799
So we use this algorithm for inverted hovering,

1:03:16.799,1:03:20.329
and again policy search algorithms tend to work well when you can find

1:03:20.329,1:03:23.739
a reasonably simple policy mapping from

1:03:23.739,1:03:28.640
the states to the actions. This is sort of especially the low level control tasks,

1:03:28.640,1:03:32.699
which I think of as sort of reflexes almost.

1:03:32.699,1:03:33.799
Completely,

1:03:33.799,1:03:37.269
if you want to solve a problem like Tetris where you might plan

1:03:37.269,1:03:40.499
ahead a few steps about what's a nice configuration of the board, or

1:03:40.499,1:03:42.289
something like a game of chess,

1:03:42.289,1:03:46.719
or problems of long path plannings of go here, then go there, then go there,

1:03:46.719,1:03:51.069
then sometimes you might apply a value function method instead.

1:03:51.069,1:03:55.059
But for tasks like helicopter flight, for

1:03:55.059,1:03:58.569
low level control tasks, for the reflexes of driving or controlling

1:04:00.749,1:04:05.699
various robots, policy search algorithms were easier -

1:04:05.699,1:04:10.629
we can sometimes more easily approximate the policy directly works very well.

1:04:11.839,1:04:15.599
Some [inaudible] the state of RL today. RL algorithms are applied

1:04:15.599,1:04:18.499
to a wide range of problems,

1:04:18.499,1:04:22.239
and the key is really sequential decision making. The place where you

1:04:22.239,1:04:25.569
think about applying reinforcement learning algorithm is when you need to make a decision,

1:04:25.569,1:04:28.209
then another decision, then another decision,

1:04:28.209,1:04:31.889
and some of your actions may have long-term consequences. I think that

1:04:31.889,1:04:37.910
is the heart of RL's sequential decision making, where you make multiple decisions,

1:04:37.910,1:04:41.599
and some of your actions may have long-term consequences. I've shown you

1:04:41.599,1:04:44.170
a bunch of robotics examples. RL

1:04:44.170,1:04:45.569
is also

1:04:45.569,1:04:48.759
applied to thinks like medical decision making, where you may have a patient and you

1:04:48.759,1:04:51.879
want to choose a sequence of treatments, and you do this now for the patient, and the

1:04:51.879,1:04:56.449
patient may be in some other state, and you choose to do that later, and so on.

1:04:56.449,1:05:00.839
It turns out there's a large community of people applying these sorts of tools to queues. So

1:05:00.839,1:05:04.539
imagine you have a bank where you have people lining up, and

1:05:04.539,1:05:06.009
after they go to one cashier,

1:05:06.009,1:05:09.740
some of them have to go to the manager to deal with something else. You have

1:05:09.740,1:05:13.410
a system of multiple people standing in line in multiple queues, and so how do you route

1:05:13.410,1:05:17.839
people optimally to minimize the waiting

1:05:17.839,1:05:20.239
time. And not just people, but

1:05:20.239,1:05:23.969
objects in an assembly line and so on. It turns out there's a surprisingly large community

1:05:23.969,1:05:26.349
working on optimizing queues.

1:05:26.349,1:05:29.719
I mentioned game playing a little bit already. Things

1:05:29.719,1:05:33.289
like financial decision making, if you have a large amount of stock,

1:05:33.289,1:05:37.569
how do you sell off a large amount - how do you time the selling off of your stock

1:05:37.569,1:05:41.329
so as to not affect market prices adversely too much? There

1:05:41.329,1:05:44.949
are many operations research problems, things like factory automation. Can you design

1:05:44.949,1:05:46.369
a factory

1:05:46.369,1:05:50.140
to optimize throughput, or minimize cost, or whatever. These are all

1:05:50.140,1:05:54.049
sorts of problems that people are applying reinforcement learning algorithms to.

1:05:54.049,1:05:57.529
Let me just close with a few robotics examples because they're always fun, and we just

1:05:57.529,1:05:59.190
have these videos.

1:06:00.189,1:06:07.259
This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student

1:06:07.259,1:06:13.760
here. They were working getting a robot dog to climb over difficult rugged

1:06:13.760,1:06:16.959
terrain. Using a reinforcement learning algorithm,

1:06:16.959,1:06:19.979
they applied an approach that's

1:06:19.979,1:06:24.149
similar to a value function approximation approach, not quite but similar.

1:06:24.149,1:06:28.849
They allowed the robot dog to sort of plan ahead multiple steps, and

1:06:28.849,1:06:33.999
carefully choose his footsteps and traverse rugged terrain.

1:06:33.999,1:06:43.339
This is really state of the art in terms of what can you get a robotic dog to do.

1:06:43.339,1:06:45.569
Here's another fun one.

1:06:45.569,1:06:48.009
It turns out that

1:06:48.009,1:06:51.259
wheeled robots are very fuel-efficient. Cars and trucks are the

1:06:51.259,1:06:55.250
most fuel-efficient robots in the world almost.

1:06:55.250,1:06:57.260
Cars and trucks are very fuel-efficient,

1:06:57.260,1:07:01.309
but the bigger robots have the ability to traverse more rugged terrain.

1:07:01.309,1:07:04.599
So this is a robot - this is actually a small scale mockup of a larger

1:07:04.599,1:07:06.519
vehicle built by Lockheed Martin,

1:07:06.519,1:07:09.699
but can you come up with a vehicle that

1:07:09.699,1:07:13.689
has wheels and has the fuel efficiency of wheeled robots, but also

1:07:13.689,1:07:19.099
has legs so it can traverse obstacles.

1:07:19.099,1:07:27.349
Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and

1:07:27.349,1:07:31.319
somewhat complex gaits that would be very hard to design by hand,

1:07:31.319,1:07:33.279
but by choosing a reward function, tell the robot

1:07:33.279,1:07:36.789
this is a plus one reward that's top of the goal,

1:07:36.789,1:07:45.469
and a few other things, it learns these sorts of policies automatically.

1:07:46.519,1:07:55.229
Last couple fun ones - I'll show you a couple last helicopter videos.

1:07:57.029,1:08:08.700
This is the work of again PhD students here, Peter Abiel and Adam Coates

1:08:08.769,1:08:33.319
who have been working on autonomous flight. I'll just let you watch. I'll just

1:09:49.339,1:09:53.900
show you one more. [Inaudible] do this with a real helicopter [inaudible]?

1:09:53.900,1:09:55.199
Not a full-size helicopter.

1:09:55.199,1:10:00.860
Only small radio control helicopters. [Inaudible]. Full-size helicopters

1:10:00.860,1:10:05.780
are the wrong design for this. You shouldn't do this on a full-size helicopter.

1:10:05.780,1:10:12.619
Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.

1:10:12.619,1:10:18.919
Are there any human [inaudible]? Yes,

1:10:18.919,1:10:25.919
there are very good human pilots that can.

1:10:28.420,1:10:35.420
This is just one more

1:10:43.739,1:10:45.510
maneuver. That was kind of fun. So this is the work of

1:10:45.510,1:10:55.530
Peter Abiel and Adam Coates. So that was it. That was all the technical material

1:10:55.530,1:11:02.530
I wanted to present in this class. I guess

1:11:03.109,1:11:10.199
you're all experts on machine learning now. Congratulations.

1:11:10.199,1:11:14.209
And as I hope you've got the sense through this class that this is one of the technologies

1:11:14.209,1:11:18.969
that's really having a huge impact on science in engineering and industry.

1:11:18.969,1:11:23.130
As I said in the first lecture, I think many people use machine

1:11:23.130,1:11:30.130
learning algorithms dozens of times a day without even knowing about it.

1:11:30.389,1:11:34.110
Based on the projects you've done, I hope that

1:11:34.110,1:11:37.899
most of you will be able to imagine yourself

1:11:37.899,1:11:43.280
going out after this class and applying these things to solve a variety of problems.

1:11:43.280,1:11:45.980
Hopefully, some of you will also imagine yourselves

1:11:45.980,1:11:48.989
writing research papers after this class, be it on

1:11:48.989,1:11:52.440
a novel way to do machine learning, or on some way of applying machine

1:11:52.440,1:11:55.289
learning to a problem that you care

1:11:55.289,1:11:58.749
about. In fact, looking at project milestones, I'm actually sure that a large fraction of

1:11:58.749,1:12:04.300
the projects in this class will be publishable, so I think that's great.

1:12:06.679,1:12:10.750
I guess many of you will also go industry, make new products, and make lots

1:12:10.750,1:12:12.659
of money using learning

1:12:12.659,1:12:15.580
algorithms. Remember me

1:12:15.580,1:12:19.530
if that happens. One of the things I'm excited about is through research or

1:12:19.530,1:12:22.570
industry, I'm actually completely sure that the people in this class in the

1:12:22.570,1:12:23.849
next few months

1:12:23.849,1:12:27.320
will apply machine learning algorithms to lots of problems in

1:12:27.320,1:12:31.770
industrial management, and computer science, things like optimizing computer architectures,

1:12:31.770,1:12:33.340
network security,

1:12:33.340,1:12:38.480
robotics, computer vision, to problems in computational biology,

1:12:39.140,1:12:42.429
to problems in aerospace, or understanding natural language,

1:12:42.429,1:12:44.309
and many more things like that.

1:12:44.309,1:12:46.030
So

1:12:46.030,1:12:49.920
right now I have no idea what all of you are going to do with the learning algorithms you learned about,

1:12:49.920,1:12:51.540
but

1:12:51.540,1:12:54.809
every time as I wrap up this class, I always

1:12:54.809,1:12:58.209
feel very excited, and optimistic, and hopeful about the sorts of amazing

1:12:58.209,1:13:03.129
things you'll be able to do.

1:13:03.129,1:13:10.489
One final thing, I'll just give out this handout.

1:13:30.749,1:13:35.170
One final thing is that machine learning has grown out of a larger literature on the AI

1:13:35.170,1:13:36.419
where

1:13:36.419,1:13:41.829
this desire to build systems that exhibit intelligent behavior and machine learning

1:13:41.829,1:13:44.889
is one of the tools of AI, maybe one that's had a

1:13:44.889,1:13:46.510
disproportionately large impact,

1:13:46.510,1:13:51.519
but there are many other ideas in AI that I hope you

1:13:51.519,1:13:55.289
go and continue to learn about. Fortunately, Stanford

1:13:55.289,1:13:59.279
has one of the best and broadest sets of AI classes, and I hope that

1:13:59.279,1:14:03.690
you take advantage of some of these classes, and go and learn more about AI, and more

1:14:03.690,1:14:07.690
about other fields which often apply learning algorithms to problems in vision,

1:14:07.690,1:14:10.480
problems in natural language processing in robotics, and so on.

1:14:10.480,1:14:14.960
So the handout I just gave out has a list of AI related courses. Just

1:14:14.960,1:14:18.489
running down very quickly, I guess, CS221 is an overview that I

1:14:18.489,1:14:19.340
teach. There

1:14:19.340,1:14:23.699
are a lot of robotics classes also: 223A,

1:14:23.699,1:14:25.139
225A, 225B

1:14:25.139,1:14:28.760
- many robotics class. There are so many applications

1:14:28.760,1:14:31.999
of learning algorithms to robotics today. 222

1:14:31.999,1:14:36.369
and 227 are knowledge representation and reasoning classes.

1:14:36.369,1:14:39.919
228 - of all the classes on this list, 228, which Daphne

1:14:39.919,1:14:43.479
Koller teaches, is probably closest in spirit to 229. This is one of

1:14:43.479,1:14:45.300
the classes I highly recommend

1:14:45.300,1:14:48.600
to all of my PhD students as well.

1:14:48.600,1:14:53.300
Many other problems also touch on machine learning. On the next page,

1:14:54.040,1:14:57.760
courses on computer vision, speech recognition, natural language processing,

1:14:57.760,1:15:03.199
various tools that aren't just machine learning, but often involve machine learning in many ways.

1:15:03.199,1:15:07.339
Other aspects of AI, multi-agent systems taught by [inaudible].

1:15:09.630,1:15:13.139
EE364A is convex optimization. It's a class taught by

1:15:13.139,1:15:16.630
Steve Boyd, and convex optimization came up many times in this class. If you

1:15:16.630,1:15:20.909
want to become really good at it, EE364 is a great class. If you're

1:15:20.909,1:15:24.579
interested in project courses, I also teach a project class

1:15:24.579,1:15:27.179
next quarter where we spend the whole quarter working

1:15:27.179,1:15:31.510
on research projects.

1:15:31.510,1:15:37.589
So I hope you go and take some more of those classes.

1:15:37.589,1:15:40.379
In closing,

1:15:40.379,1:15:42.849
let me just say

1:15:42.849,1:15:46.880
this class has been really fun to teach, and it's very satisfying to

1:15:46.880,1:15:52.010
me personally when we set these insanely difficult hallmarks, and

1:15:52.010,1:15:54.649
then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's

1:15:54.649,1:15:58.570
actually very satisfying when I see that.

1:15:58.570,1:16:04.739
Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"

1:16:04.739,1:16:05.469
So

1:16:05.469,1:16:09.039
I hope you take what you've learned, go forth, and

1:16:09.039,1:16:11.570
do amazing things with learning algorithms.

1:16:11.570,1:16:16.220
I know this is a heavy workload class, so thank you all very much for the hard

1:16:16.220,1:16:20.280
work you've put into this class, and the hard work you've put into learning this material,

1:16:20.280,1:16:21.699
and

1:16:21.699,1:16:23.389
thank you very much for having been students in.