WEBVTT

00:00:11.269 --> 00:00:14.539
This presentation is delivered by the Stanford Center for Professional

00:00:14.539 --> 00:00:21.539
Development.

00:00:23.929 --> 00:00:28.039
So welcome to the last lecture of this course.

00:00:28.039 --> 00:00:32.969
What I want to do today is tell you about one final class of reinforcement

00:00:32.969 --> 00:00:34.820
learning algorithms. I

00:00:34.820 --> 00:00:39.350
just want to say a little bit about POMDPs, the partially observable MDPs, and then

00:00:39.350 --> 00:00:43.590
the main technical topic for today will be policy search algorithms.

00:00:43.590 --> 00:00:48.670
I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,

00:00:48.670 --> 00:00:54.940
and then we'll wrap up the class. So if you recall from the last lecture,

00:00:54.940 --> 00:00:59.870
I actually started to talk about one specific example of a POMDP,

00:00:59.870 --> 00:01:06.870
which was this sort of linear dynamical

00:01:07.170 --> 00:01:10.790
system. This is sort of LQR, linear quadratic revelation problem,

00:01:10.790 --> 00:01:24.190
but I changed it and said what if we only have observations

00:01:24.190 --> 00:01:28.270
YT. And what if we couldn't observe the state of the system directly,

00:01:28.270 --> 00:01:37.180
but had to choose an action only based on some noisy observations that maybe some function of the state?

00:01:37.180 --> 00:01:42.660
So our strategy last time was that we said that in the fully observable case,

00:01:42.660 --> 00:01:54.870
we could choose actions - AT equals two, that matrix LT times ST. So LT was this

00:01:54.870 --> 00:01:58.730
matrix of parameters that [inaudible] describe the dynamic programming

00:01:58.730 --> 00:02:02.200
algorithm for finite horizon MDPs in the LQR problem.

00:02:02.200 --> 00:02:10.529
And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.

00:02:10.529 --> 00:02:13.929
And then I said in the partially observable case,

00:02:14.469 --> 00:02:19.979
we would compute these estimates.

00:02:19.979 --> 00:02:26.979
I wrote them as S of T given T,

00:02:30.379 --> 00:02:37.230
which were our best estimate for what the state is given all the observations. And in particular,

00:02:37.230 --> 00:02:41.459
I'm gonna talk about a Kalman filter

00:02:41.459 --> 00:02:48.959
which we worked out that our posterior distribution of what the state is

00:02:48.959 --> 00:02:52.519
given all the observations up to a certain time

00:02:52.519 --> 00:02:59.139
that was this. So this is from last time. So that

00:02:59.139 --> 00:03:03.159
given the observations Y one through YT, our posterior distribution

00:03:03.159 --> 00:03:05.579
of the current state ST was Gaussian

00:03:05.579 --> 00:03:10.169
would mean ST given T sigma T given T.

00:03:10.169 --> 00:03:14.090
So I said we use a Kalman filter to compute this thing, this ST given T,

00:03:14.090 --> 00:03:20.499
which is going to be our best guess for what the state is currently.

00:03:20.499 --> 00:03:30.609
And then we choose actions using

00:03:30.609 --> 00:03:34.059
our estimate for what the state is, rather than using the true state because we

00:03:34.059 --> 00:03:37.519
don't know the true state anymore in this POMDP.

00:03:37.519 --> 00:03:39.619
So

00:03:39.619 --> 00:03:47.219
it turns out that this specific strategy actually allows you to choose optimal actions,

00:03:47.219 --> 00:03:51.749
allows you to choose actions as well as you possibly can given that this is a

00:03:51.749 --> 00:03:54.799
POMDP, and given there are these noisy observations.

00:03:54.799 --> 00:03:59.469
It turns out that in general finding optimal policies with POMDPs -

00:03:59.469 --> 00:04:06.039
finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.

00:04:06.039 --> 00:04:16.259
Just to be concrete about the formalism of the POMDP - I should just write it here -

00:04:16.259 --> 00:04:32.000
a POMDP formally is a tuple like that

00:04:32.000 --> 00:04:42.550
where the changes are the set Y is the set of possible observations,

00:04:44.759 --> 00:04:53.519
and this O subscript S are the observation distributions.

00:04:58.039 --> 00:05:11.220
And so at each step, we observe

00:05:17.629 --> 00:05:20.179
- at each step in the POMDP, if we're

00:05:20.179 --> 00:05:23.849
in some state ST, we observe some observation YT

00:05:23.849 --> 00:05:27.610
drawn from the observation distribution O subscript ST, that there's an

00:05:27.610 --> 00:05:30.369
index by what the current state is.

00:05:30.369 --> 00:05:31.060
And

00:05:31.060 --> 00:05:35.599
it turns out that computing the optimal policy in a POMDP is an NPhard problem.

00:05:36.309 --> 00:05:40.539
For the specific case of linear dynamical systems with the Kalman filter

00:05:40.539 --> 00:05:45.849
model, we have this strategy of computing the optimal policy assuming full observability

00:05:45.849 --> 00:05:49.080
and then estimating the states from the observations,

00:05:49.080 --> 00:05:50.889
and then plugging the two together.

00:05:50.889 --> 00:05:56.319
That turns out to be optimal essentially for only that special case of a POMDP.

00:05:56.319 --> 00:05:58.479
In the more general case,

00:05:58.479 --> 00:06:02.959
that strategy of designing a controller assuming full observability

00:06:02.959 --> 00:06:06.009
and then just estimating the state and plugging the two together,

00:06:06.009 --> 00:06:07.950
for general POMDPs

00:06:07.950 --> 00:06:14.759
that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.

00:06:14.759 --> 00:06:20.869
Solving these problems in general, NP-hard.

00:06:20.869 --> 00:06:22.500
So what I want to do today

00:06:22.500 --> 00:06:25.569
is actually

00:06:25.569 --> 00:06:29.139
talk about a different class of reinforcement learning algorithms. These are called

00:06:29.139 --> 00:06:38.259
policy search algorithms. In particular, policy search algorithms

00:06:38.259 --> 00:06:43.630
can be applied equally well to MDPs, to fully observed Markov decision processes,

00:06:43.630 --> 00:06:46.419
or to these POMDPs,

00:06:46.419 --> 00:06:49.789
or to these partially observable MPDs.

00:06:49.789 --> 00:06:58.269
What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.

00:06:58.269 --> 00:07:01.919
And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In

00:07:01.919 --> 00:07:05.460
the latter case, when

00:07:05.460 --> 00:07:09.830
you apply a policy search algorithm to a POMDP, it's

00:07:09.830 --> 00:07:15.499
going to be hard to guarantee that you get the globally optimal policy because

00:07:15.499 --> 00:07:20.809
solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be

00:07:20.809 --> 00:07:23.009
I think one of the most

00:07:23.009 --> 00:07:31.029
effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.

00:07:31.029 --> 00:07:34.110
So

00:07:34.110 --> 00:07:36.849
here's what we're going to do.

00:07:36.849 --> 00:07:40.789
In policy search,

00:07:40.789 --> 00:07:47.789
we're going to define of some set

00:07:48.309 --> 00:07:54.869
which I denote capital pi of policies,

00:07:54.869 --> 00:08:09.240
and our strategy is to search for a good policy lower pi into set capital pi.

00:08:10.909 --> 00:08:14.930
Just by analogy, I want to say

00:08:14.930 --> 00:08:19.379
- in the same way, back when we were talking about supervised learning,

00:08:19.379 --> 00:08:23.769
the way we defined the set capital pi of policies in the search for policy in this

00:08:23.769 --> 00:08:31.269
set capital pi is analogous to supervised learning

00:08:31.269 --> 00:08:46.300
where we defined a set script H of hypotheses and search -

00:08:49.520 --> 00:08:58.120
and would search for a good hypothesis in this policy script H.

00:08:58.120 --> 00:09:03.040
Policy search is sometimes also called direct policy search.

00:09:03.040 --> 00:09:06.740
To contrast this with the source of algorithms we've been talking about so

00:09:06.740 --> 00:09:11.360
far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to

00:09:11.360 --> 00:09:13.600
find the optimal value function.

00:09:13.600 --> 00:09:19.179
And then we'd use V star - we'd use the optimal value function to then try to compute or

00:09:19.179 --> 00:09:20.260
try to approximate pi star.

00:09:20.260 --> 00:09:23.980
So all the approaches we talked about previously are

00:09:23.980 --> 00:09:29.500
strategy for finding a good policy. Once we compute the value function, then we go from that to policy.

00:09:29.500 --> 00:09:34.700
In contrast, in policy search algorithms and something that's called direct policy search algorithms,

00:09:34.700 --> 00:09:40.160
the idea is that we're going to quote "directly" try to approximate a good policy

00:09:40.160 --> 00:09:48.280
without going through the intermediate stage of trying to find the value function.

00:09:48.280 --> 00:09:49.530
Let's see.

00:09:49.530 --> 00:09:50.880
And also

00:09:50.880 --> 00:09:57.610
as I develop policy search - just one step that's sometimes slightly confusing.

00:09:57.610 --> 00:10:02.710
Making an analogy to supervised learning again, when we talked about logistic regression,

00:10:02.710 --> 00:10:08.700
I said we have input features X and some labels Y, and I sort of said

00:10:08.700 --> 00:10:13.320
let's approximate Y using the logistic function of the inputs X. And

00:10:13.320 --> 00:10:18.320
at least initially, the logistic function was sort of pulled out of the air.

00:10:18.320 --> 00:10:23.400
In the same way, as I define policy search algorithms, there'll sort of be a step where I say,

00:10:23.400 --> 00:10:28.560
"Well, let's try to compute the actions. Let's try to approximate what a good action is

00:10:28.560 --> 00:10:33.020
using a logistic function of the state." So again, I'll sort of pull a

00:10:33.020 --> 00:10:36.770
function out of the air. I'll say, "Let's just choose a function, and

00:10:36.770 --> 00:10:40.650
that'll be our choice of the policy cost," and I'll say,

00:10:40.650 --> 00:10:45.309
"Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate

00:10:45.309 --> 00:10:47.300
what is a good function -

00:10:47.300 --> 00:10:48.350
excuse me,

00:10:48.350 --> 00:10:53.269
we'll approximate what is a good action using a logistic function of the state."

00:10:53.269 --> 00:10:54.420
So there's that sort of -

00:10:54.420 --> 00:10:59.460
the function of the choice of policy cost that's again a little bit arbitrary,

00:10:59.460 --> 00:11:06.080
but it's arbitrary as it was when we were talking about supervised learning.

00:11:06.080 --> 00:11:17.350
So to develop our first policy search algorithm, I'm actually gonna need the new definition.

00:11:56.100 --> 00:12:03.100
So

00:12:06.650 --> 00:12:12.710
our first policy search algorithm, we'll actually need to work with stochastic policies.

00:12:12.710 --> 00:12:17.679
What I mean by stochastic policy is there's going to be a function that maps from

00:12:17.679 --> 00:12:22.510
the space of states across actions. They're real numbers

00:12:22.510 --> 00:12:25.039
where pi of S comma A will be

00:12:25.039 --> 00:12:30.580
interpreted as the probability of taking this action A in sum state S.

00:12:30.580 --> 00:12:46.340
And so we have to add sum over A - In other words, for every state

00:12:46.340 --> 00:12:53.830
a stochastic policy specifies a probability distribution over the actions.

00:12:54.550 --> 00:12:59.250
So concretely,

00:12:59.250 --> 00:13:07.270
suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.

00:13:07.270 --> 00:13:11.690
What that means is that - in this example let's say I have three actions.

00:13:11.690 --> 00:13:16.060
What that means is that suppose I'm in some state S.

00:13:16.060 --> 00:13:25.290
I would then compute pi of S comma A1, pi of S comma A2,

00:13:25.290 --> 00:13:29.370
pi of S comma A3, if I have a three action MDP.

00:13:29.370 --> 00:13:33.630
These will be three numbers that sum up to one,

00:13:33.630 --> 00:13:36.730
and then my chance of taking action A1 will be

00:13:36.730 --> 00:13:41.350
equal to this. My chance of taking action A2 will be equal to pi of S comma A2.

00:13:41.350 --> 00:13:44.129
My chance of taking action A3 will be equal

00:13:44.129 --> 00:13:49.670
to this number. So that's what it means to execute a stochastic policy.

00:13:49.670 --> 00:13:50.880
So

00:13:50.880 --> 00:13:53.730
as a concrete example, just let me make this

00:14:03.380 --> 00:14:07.470
let me just go ahead and give one specific example of what a stochastic

00:14:07.470 --> 00:14:09.550
policy may look like.

00:14:09.550 --> 00:14:14.140
For this example, I'm gonna use the inverted pendulum

00:14:14.140 --> 00:14:20.370
as my motivating example. It's that problem of balancing a pole. We have an

00:14:20.370 --> 00:14:21.270
inverted

00:14:21.270 --> 00:14:26.230
pendulum that swings freely, and you want to move the cart left and

00:14:26.230 --> 00:14:29.090
right to keep the pole vertical.

00:14:29.090 --> 00:14:34.950
Let's say my actions -

00:14:34.950 --> 00:14:43.780
for today's example, I'm gonna use that angle to denote the angle of the pole

00:14:43.780 --> 00:14:49.740
phi. I have two actions where A1 is to accelerate left

00:14:49.740 --> 00:15:02.560
and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.

00:15:02.560 --> 00:15:04.720
So

00:15:04.720 --> 00:15:06.050
let's see.

00:15:06.050 --> 00:15:10.480
Choose a reward function that penalizes the pole falling over whatever.

00:15:10.480 --> 00:15:14.170
And now let's come up with a stochastic policy for this problem.

00:15:14.910 --> 00:15:17.860
To come up with a class of stochastic policies

00:15:17.860 --> 00:15:20.080


00:15:20.080 --> 00:15:23.990
really means coming up with some class of functions to approximate what action you

00:15:23.990 --> 00:15:26.570
want to take as a function of the state.

00:15:26.570 --> 00:15:30.000
So here's my somewhat arbitrary choice. I'm gonna say that

00:15:30.000 --> 00:15:35.210
the probability of action A1, so pi of S comma A1,

00:15:35.210 --> 00:15:39.880
I'm gonna write as - okay?

00:15:39.880 --> 00:15:45.630
And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that

00:15:45.630 --> 00:15:49.740
my policy is parameterized by a set of parameters theta,

00:15:49.740 --> 00:15:53.310
and for any given set of parameters theta,

00:15:53.310 --> 00:15:56.270
that gives me a stochastic policy.

00:15:56.270 --> 00:15:59.380
And if I'm executing that policy with parameters theta, that means

00:15:59.380 --> 00:16:06.020
that the chance of my choosing to a set of [inaudible] is given by this number.

00:16:18.170 --> 00:16:25.520
Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just

00:16:25.520 --> 00:16:28.850
[inaudible], this means that when I'm in sum state S,

00:16:28.850 --> 00:16:30.960
I'm going to compute this number,

00:16:30.960 --> 00:16:34.260
compute one over one plus E to the minus state of transpose S.

00:16:34.260 --> 00:16:40.060
And then with this probability, I will execute the accelerate right action,

00:16:40.060 --> 00:16:45.780
and with one minus this probability, I'll execute the accelerate left action.

00:16:45.780 --> 00:16:50.820
And again, just to give you a sense of why this might be a reasonable thing to do,

00:16:52.200 --> 00:16:59.200
let's say my state vector is - this

00:17:01.530 --> 00:17:05.440
is [inaudible] state, and I added an extra one as an interceptor, just to give my

00:17:05.440 --> 00:17:08.949
logistic function an extra feature.

00:17:08.949 --> 00:17:19.860
If I choose my parameters and my policy to be say this,

00:17:19.860 --> 00:17:23.079
then that means that at any state,

00:17:23.079 --> 00:17:33.870
the probability of my taking action A1 - the probability of my taking the accelerate

00:17:33.870 --> 00:17:40.870
right action is this one over one plus E to the minus state of transpose S, which taking the inner product

00:17:41.360 --> 00:17:48.180
of theta and S, this just gives you phi,

00:17:48.180 --> 00:17:51.250
equals one over one plus E to the minus phi.

00:17:51.250 --> 00:17:54.900
And so if I choose my parameters theta as follows,

00:17:54.900 --> 00:18:01.150
what that means is that

00:18:01.150 --> 00:18:08.150
just depending on the angle phi of my inverted pendulum,

00:18:09.340 --> 00:18:12.200
the chance of my accelerating to the

00:18:12.200 --> 00:18:15.190
right is just this function of

00:18:15.190 --> 00:18:18.320
the angle of my inverted pendulum. And so this means for example

00:18:18.320 --> 00:18:22.090
that if my inverted pendulum is leaning far over to the right,

00:18:22.090 --> 00:18:26.370
then I'm very likely to accelerate to the right to try to catch it. I

00:18:26.370 --> 00:18:30.000
hope the physics of this inverted pendulum thing make

00:18:30.000 --> 00:18:33.440
sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And

00:18:33.440 --> 00:18:36.740
conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try

00:18:36.740 --> 00:18:38.520
to catch it.

00:18:38.520 --> 00:18:44.030
So this is one example for one specific choice of parameters theta.

00:18:44.030 --> 00:18:49.420
Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if

00:18:49.420 --> 00:18:52.860
the cart is further to the right, you want it to be less likely to accelerate to the

00:18:52.860 --> 00:18:55.610
right, and you can capture that by changing

00:18:55.610 --> 00:18:57.770
one of these coefficients to take into account the

00:18:57.770 --> 00:19:00.250
actual position of the cart. And then depending on

00:19:00.250 --> 00:19:03.379
the velocity of the cart and the angle of velocity,

00:19:03.379 --> 00:19:08.040
you might want to change theta to take into account these other effects as well. Maybe

00:19:08.040 --> 00:19:13.060
if the pole's leaning far to the right, but is actually on its way to swinging back, it's

00:19:13.060 --> 00:19:15.620
specified to the angle of velocity, then you might be

00:19:15.620 --> 00:19:19.300
less worried about having to accelerate hard to the right. And so

00:19:19.300 --> 00:19:23.170
these are the sorts of behavior you can get by varying the parameters theta.

00:19:24.160 --> 00:19:28.640
And so our goal is to tune the parameters theta - our goal in policy search

00:19:28.640 --> 00:19:37.290
is to tune the parameters theta so that when we execute the policy pi subscript theta,

00:19:37.290 --> 00:19:39.730
the pole stays up as long as possible.

00:19:39.730 --> 00:19:48.430
In other words, our goal is to maximize as a function of theta -

00:19:48.430 --> 00:20:05.440
our goal is to maximize the expected value of the payoff

00:20:05.440 --> 00:20:09.620
for when we execute the policy pi theta. We

00:20:09.620 --> 00:20:11.249
want to choose parameters theta

00:20:11.249 --> 00:20:21.120
to maximize that. Are there questions about the problem set up,

00:20:21.120 --> 00:20:26.330
and policy search and policy classes or anything? Yeah. In a case where we have

00:20:26.330 --> 00:20:32.880
more than two actions, would we use a different theta for each of the distributions, or still have

00:20:32.880 --> 00:20:36.310
the same parameters? Oh, yeah.

00:20:36.310 --> 00:20:39.430
Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but

00:20:39.430 --> 00:20:43.710
you have say a fixed number of discrete actions,

00:20:43.710 --> 00:20:48.210
I would sometimes use like a softmax parameterization.

00:20:48.210 --> 00:20:53.550
Similar to softmax regression that we saw earlier in the class,

00:20:53.550 --> 00:20:55.910
you may

00:20:55.910 --> 00:21:02.910
say

00:21:05.540 --> 00:21:06.900
that - [inaudible] out of space. You may have a set of parameters theta 1

00:21:06.900 --> 00:21:09.350
through theta D if

00:21:09.350 --> 00:21:24.910
you have D actions and - pi equals E to the theta I transpose S over -

00:21:24.910 --> 00:21:29.160
so that would be an example of a softmax parameterization for multiple actions.

00:21:29.160 --> 00:21:33.390
It turns out that if you have continuous actions, you can actually make this be a

00:21:33.390 --> 00:21:34.090
density

00:21:34.090 --> 00:21:35.950
over the actions A

00:21:35.950 --> 00:21:38.500
and parameterized by other things as well.

00:21:38.500 --> 00:21:42.590
But the choice of policy class is somewhat up to you, in the same way that

00:21:44.150 --> 00:21:47.670
the choice of whether we chose to use a linear function

00:21:47.670 --> 00:21:49.430
or linear function with

00:21:49.430 --> 00:21:56.430
quadratic features or whatever in supervised learning that was sort of up to us. Anything

00:22:01.790 --> 00:22:08.790
else? Yeah. [Inaudible] stochastic?

00:22:12.400 --> 00:22:16.350
Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition

00:22:16.350 --> 00:22:18.200
probabilities, is it possible to

00:22:18.200 --> 00:22:22.740
use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.

00:22:22.740 --> 00:22:24.460
The answer is yes,

00:22:24.460 --> 00:22:25.740
but

00:22:25.740 --> 00:22:29.400
for the purposes of what I want to show later, that won't be useful.

00:22:29.400 --> 00:22:31.870
But formally, it is possible.

00:22:31.870 --> 00:22:34.810
If you already have a fixed - if you have a

00:22:34.810 --> 00:22:40.760
fixed policy, then you'd be able to do that. Anything else?

00:22:40.760 --> 00:22:44.190
Yeah. No, I guess even a [inaudible] class of policy can do that,

00:22:44.190 --> 00:22:50.290
but for the derivation later, I actually need to keep it separate.

00:22:50.290 --> 00:22:55.059
Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand

00:22:55.059 --> 00:23:01.730
if this makes sense?

00:23:01.730 --> 00:23:05.430
Okay. Thanks. So let's talk about

00:23:05.430 --> 00:23:08.650
an algorithm. What I'm gonna talk about - the first algorithm I'm going to present

00:23:09.750 --> 00:23:13.820
is sometimes called the reinforce algorithm. What I'm going to present it turns out

00:23:13.820 --> 00:23:18.550
isn't exactly the reinforce algorithm as it was originally presented by the

00:23:18.550 --> 00:23:30.570
author Ron Williams, but it sort of captures its essence. Here's the idea.

00:23:30.570 --> 00:23:37.030
In the sequel - in what I'm about to do, I'm going to assume that S0 is

00:23:37.030 --> 00:23:40.330
some fixed initial state.

00:23:41.950 --> 00:23:44.720
Or

00:23:44.720 --> 00:23:48.740
it turns out if S0 is drawn from some fixed initial state

00:23:48.740 --> 00:23:52.440
distribution then everything else [inaudible], but let's just say S0 is some

00:23:52.440 --> 00:23:54.040
fixed initial state.

00:23:54.040 --> 00:24:04.600
So my goal is to maximize this expected sum

00:24:05.820 --> 00:24:10.340
[inaudible]. Given the policy and whatever else, drop that.

00:24:10.340 --> 00:24:14.700
So the random variables in this expectation is a sequence of states and actions:

00:24:14.700 --> 00:24:20.200
S0, A0, S1, A1, and so on, up to ST, AT are the random variables.

00:24:20.200 --> 00:24:25.919
So let me write out this expectation explicitly as a sum

00:24:25.919 --> 00:24:33.430
over all possible state and action sequences of that

00:24:46.050 --> 00:24:52.920
- so that's what an expectation is. It's the probability of the random variables times that. Let

00:24:52.920 --> 00:24:59.920
me just expand out this probability.

00:25:00.510 --> 00:25:06.110
So the probability of seeing this exact sequence of states and actions is

00:25:06.110 --> 00:25:10.630
the probability of the MDP starting in that state.

00:25:10.630 --> 00:25:14.070
If this is a deterministic initial state, then all the probability

00:25:14.070 --> 00:25:18.180
mass would be on one state. Otherwise, there's some distribution over initial states.

00:25:18.180 --> 00:25:21.040
Then times the probability

00:25:21.040 --> 00:25:23.529
that

00:25:23.529 --> 00:25:29.350
you chose action A0 from that state as zero,

00:25:29.350 --> 00:25:34.390
and then times the probability that

00:25:34.390 --> 00:25:38.130
the MDP's transition probabilities happen to transition you to state

00:25:38.130 --> 00:25:44.830
S1 where you chose action A0 to state S0,

00:25:44.830 --> 00:25:53.020
times the probability that you chose that and so on.

00:25:53.020 --> 00:26:06.279
The last term here is that, and then times that.

00:26:12.370 --> 00:26:13.930
So what I did was just

00:26:13.930 --> 00:26:18.620
take this probability of seeing this sequence of states and actions, and then just

00:26:18.620 --> 00:26:22.610
[inaudible] explicitly or expanded explicitly like this.

00:26:22.610 --> 00:26:24.480
It turns out later on

00:26:24.480 --> 00:26:30.270
I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff

00:26:31.590 --> 00:26:36.039
from now. So whenever later in this lecture I write the word payoff, I

00:26:36.039 --> 00:26:40.230
just mean this sum.

00:26:40.230 --> 00:26:47.020
So

00:26:47.020 --> 00:26:52.330
our goal is to maximize the expected payoff, so our goal is to maximize this sum.

00:26:52.330 --> 00:26:55.590
Let me actually just skip ahead. I'm going to write down what the final answer is, and

00:26:55.590 --> 00:26:59.370
then I'll come back and justify the

00:26:59.370 --> 00:27:06.370
algorithm. So here's the algorithm. This is

00:27:25.190 --> 00:27:32.190


00:28:12.770 --> 00:28:16.020
how we're going to

00:28:16.020 --> 00:28:19.280
update the parameters of the algorithm. We're going to

00:28:19.280 --> 00:28:22.880
sample a state action sequence. The way you do this is you just take your

00:28:22.880 --> 00:28:24.700
current stochastic policy,

00:28:24.700 --> 00:28:29.039
and you execute it in the MDP. So just go ahead and start from some initial state, take

00:28:29.039 --> 00:28:32.600
a stochastic action according to your current stochastic policy,

00:28:32.600 --> 00:28:35.000
see where the state transition probably takes you, and so you just

00:28:35.000 --> 00:28:38.860
do that for T times steps, and that's how you sample the state sequence. Then

00:28:38.860 --> 00:28:45.020
you compute the payoff, and then you perform this update.

00:28:45.020 --> 00:28:49.210
So let's go back and figure out what this algorithm is doing.

00:28:49.210 --> 00:28:52.730
Notice that this algorithm performs stochastic updates because on

00:28:52.730 --> 00:28:56.750
every step it updates data according to this thing on the right hand side.

00:28:56.750 --> 00:29:00.150
This thing on the right hand side depends very much on your payoff and on

00:29:00.150 --> 00:29:02.320
the state action sequence you saw. Your

00:29:02.320 --> 00:29:04.650
state action sequence is random,

00:29:04.650 --> 00:29:07.210
so what I want to do is figure out

00:29:07.210 --> 00:29:12.230
- so on every step, I'll sort of take a step that's chosen randomly

00:29:13.070 --> 00:29:16.380
because it depends on this random state action sequence.

00:29:16.380 --> 00:29:21.530
So what I want to do is figure out on average how does it change the parameters theta.

00:29:21.530 --> 00:29:23.360
In particular,

00:29:23.360 --> 00:29:44.320
I want to know what is the expected value of the change to the parameters.

00:29:58.100 --> 00:30:06.280
So I want to know what is the expected value of this change to my parameters theta. Our

00:30:06.280 --> 00:30:10.930
goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.

00:30:10.930 --> 00:30:13.700
So long as

00:30:13.700 --> 00:30:18.310
the updates on expectation are on average taking us uphill

00:30:18.310 --> 00:30:20.050
on the expected payoff,

00:30:20.050 --> 00:30:21.970
then we're happy.

00:30:21.970 --> 00:30:24.970
It turns out that

00:30:24.970 --> 00:30:31.720
this algorithm is a form of stochastic gradient ascent in which -

00:30:31.720 --> 00:30:36.779
remember when I talked about stochastic gradient descent for least squares regression, I

00:30:36.779 --> 00:30:42.820
said that you have some parameters - maybe you're trying to minimize a quadratic function.

00:30:42.820 --> 00:30:49.149
Then you may have parameters that will wander around randomly

00:30:49.149 --> 00:30:54.290
until it gets close to the optimum of the [inaudible] quadratic surface.

00:30:54.290 --> 00:30:55.810
It turns out that

00:30:55.810 --> 00:30:59.650
the reinforce algorithm will be very much like that. It will be a

00:30:59.650 --> 00:31:03.220
stochastic gradient ascent algorithm in which on every step -

00:31:03.220 --> 00:31:07.300
the step we take is a little bit random. It's determined by the random state action sequence,

00:31:07.300 --> 00:31:11.940
but on expectation this turns out to be essentially gradient ascent algorithm.

00:31:11.940 --> 00:31:16.100
And so we'll do something like this. It'll wander around randomly, but on average take you towards the

00:31:16.100 --> 00:31:24.700
optimum. So let me go ahead and prove that now.

00:31:28.780 --> 00:31:35.780
Let's see.

00:31:37.540 --> 00:31:46.020
What I'm going to do is I'm going to derive a gradient ascent update rule

00:31:46.020 --> 00:31:50.429
for maximizing the expected payoff. Then I'll hopefully show that

00:31:50.429 --> 00:31:58.570
by deriving a gradient ascent update rule, I'll end up with this thing on expectation.

00:31:58.570 --> 00:32:02.110
So before I do the derivation, let me just remind you of

00:32:02.110 --> 00:32:06.330
the chain rule - the product rule for differentiation

00:32:06.330 --> 00:32:08.510
in which

00:32:08.510 --> 00:32:13.890
if I have a product of functions,

00:32:13.890 --> 00:32:17.230
then

00:32:33.440 --> 00:32:37.779
the derivative of the product is given by

00:32:37.779 --> 00:32:41.029
taking of the derivatives of these things one at a time. So first I differentiate

00:32:41.029 --> 00:32:45.340
with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,

00:32:45.340 --> 00:32:46.540
leaving the other two fixed.

00:32:46.540 --> 00:32:50.640
Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the

00:32:50.640 --> 00:32:53.530
product rule

00:32:53.530 --> 00:32:56.380
for

00:32:56.380 --> 00:33:00.980
derivatives.

00:33:00.980 --> 00:33:05.020
If you refer back to this equation where

00:33:05.020 --> 00:33:06.870
earlier we wrote out that

00:33:06.870 --> 00:33:11.140
the expected payoff by this equation, this

00:33:11.140 --> 00:33:14.110
sum over all the states of the probability

00:33:14.110 --> 00:33:15.720
times the payoff.

00:33:15.720 --> 00:33:20.410
So what I'm going to do is take the derivative of this expression

00:33:20.410 --> 00:33:22.730
with respect to the parameters theta

00:33:22.730 --> 00:33:26.320
because I want to do gradient ascent on this function. So I'm going to take the

00:33:26.320 --> 00:33:30.880
derivative of this function with respect to theta, and then try to go uphill on this function.

00:33:30.880 --> 00:33:35.690
So using the product rule, when I take the derivative of this function with respect to theta

00:33:35.690 --> 00:33:39.169
what I get is - we'll end up with the sum of terms right there. There are a lot

00:33:39.169 --> 00:33:42.300
of terms here that depend on theta,

00:33:42.300 --> 00:33:47.160
and so what I'll end up with is I'll end up having a sum - having one term

00:33:47.160 --> 00:33:50.700
that corresponds to the derivative of this keeping everything else fixed,

00:33:50.700 --> 00:33:53.960
to have one term from the derivative of this keeping everything else fixed, and I'll

00:33:53.960 --> 00:33:55.140
have one term

00:33:55.140 --> 00:33:56.430
from the derivative of

00:33:56.430 --> 00:34:02.100
that last thing keeping everything else fixed. So just apply the product rule to this. Let's

00:34:02.100 --> 00:34:09.100
write that down. So I have that - the derivative

00:34:10.599 --> 00:34:15.789
with respect to theta of the expected value of the payoff is -

00:34:15.789 --> 00:34:22.129
it turns out I can actually do this entire derivation in exactly four steps,

00:34:22.129 --> 00:34:26.819
but each of the steps requires a huge amount of writing, so

00:34:26.819 --> 00:34:37.569
I'll just start writing and see how that goes, but this is a four step derivation.

00:34:37.919 --> 00:34:44.919
So there's the sum over the state action sequences as we saw before. Close the

00:35:04.869 --> 00:35:11.869
bracket,

00:36:00.229 --> 00:36:03.169
and

00:36:03.169 --> 00:36:08.670
then times the payoff.

00:36:08.670 --> 00:36:12.599
So that huge amount of writing, that was just taking my previous formula and

00:36:12.599 --> 00:36:13.529


00:36:13.529 --> 00:36:16.589
differentiating these terms that depend on theta one at a time. This was the term with

00:36:16.589 --> 00:36:17.879
the derivative

00:36:17.879 --> 00:36:19.489
of the first pi of

00:36:19.489 --> 00:36:22.410
theta S0 A0. So there's the first derivative

00:36:22.410 --> 00:36:26.329
term. There's the second one. Then you have plus dot, dot, dot, like

00:36:26.329 --> 00:36:28.369
in terms of [inaudible].

00:36:28.369 --> 00:36:31.279
That's my last term.

00:36:31.279 --> 00:36:38.279
So that was step one of four.

00:36:51.299 --> 00:37:06.849
And so by algebra - let me just write this down

00:38:04.009 --> 00:38:06.509
and convince us all that it's true.

00:38:06.509 --> 00:38:08.549
This is the second of four steps

00:38:08.549 --> 00:38:09.649
in which it

00:38:09.649 --> 00:38:13.529
just convinced itself that if I expand out - take the sum and multiply it by

00:38:13.529 --> 00:38:15.239
that big product in front,

00:38:15.239 --> 00:38:18.929
then I get back that sum of terms I get. It's essentially -

00:38:18.929 --> 00:38:22.530
for example, when I multiply out, this product on top of this ratio, of this

00:38:22.530 --> 00:38:24.789
first fraction,

00:38:24.789 --> 00:38:29.289
then pi subscript theta S0 A0, that would cancel out this pi subscript

00:38:29.289 --> 00:38:31.219
theta S0 A0

00:38:31.219 --> 00:38:35.130
and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was

00:38:35.130 --> 00:38:42.130
the second. But

00:38:44.960 --> 00:38:47.640
that term on top is just

00:38:47.640 --> 00:38:53.999
what I worked out previously

00:38:53.999 --> 00:38:58.569
- was the joint probability of the state action sequence,

00:38:58.569 --> 00:39:32.170
and now I have that times that times the payoff.

00:39:32.170 --> 00:40:07.129
And so by the definition of expectation, this is just equal to that thing times the payoff.

00:40:07.129 --> 00:40:08.440
So

00:40:08.440 --> 00:40:12.329
this thing inside the expectation, this is exactly

00:40:12.329 --> 00:40:18.839
the step that we were taking in the inner group of our reinforce algorithm,

00:40:18.839 --> 00:40:20.880
roughly the reinforce algorithm. This

00:40:20.880 --> 00:40:25.640
proves that the expected value of our change to theta

00:40:25.640 --> 00:40:30.509
is exactly in the direction of the gradient of our expected payoff. That's how

00:40:30.509 --> 00:40:32.639
I started this whole derivation. I said

00:40:32.639 --> 00:40:36.019
let's look at our expected payoff and take the derivative of that with respect to theta.

00:40:37.109 --> 00:40:40.529
What we've proved is that on expectation,

00:40:40.529 --> 00:40:44.229
the step direction I'll take reinforce is exactly the gradient of

00:40:44.229 --> 00:40:46.220
the thing I'm trying to

00:40:46.220 --> 00:40:48.679
optimize. This shows that this algorithm is

00:40:48.679 --> 00:40:54.029
a stochastic gradient ascent

00:40:54.029 --> 00:40:58.059
algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]

00:40:58.059 --> 00:41:00.910
check if everything makes sense. I'll erase a couple of boards and then check if you have

00:41:00.910 --> 00:41:07.910
questions after that. Questions?

00:41:42.919 --> 00:41:49.919
Could you

00:41:53.749 --> 00:41:56.279
raise your hand if this makes sense?

00:42:01.239 --> 00:42:03.999
Great.

00:42:03.999 --> 00:42:06.789
Some of the comments - we talked about

00:42:06.789 --> 00:42:10.629
those value function approximation approaches where you approximate

00:42:10.629 --> 00:42:13.579
V star, then you go from V star

00:42:13.579 --> 00:42:17.839
to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.

00:42:17.839 --> 00:42:22.329
So let's talk briefly about when either one may be preferable.

00:42:22.329 --> 00:42:24.049
It turns out that

00:42:24.049 --> 00:42:26.899
policy search algorithms are especially effective

00:42:26.899 --> 00:42:30.699
when you can choose a simple policy class pi.

00:42:32.029 --> 00:42:35.949
So the question really is for your problem

00:42:35.949 --> 00:42:40.079
does there exist a simple function like a linear function or a logistic function

00:42:40.079 --> 00:42:45.289
that maps from features of the state to the action that works pretty well.

00:42:45.289 --> 00:42:49.029
So the problem with the inverted pendulum - this is quite likely to be true.

00:42:49.029 --> 00:42:52.450
Going through all the different choices of parameters, you can say

00:42:52.450 --> 00:42:54.930
things like if the pole's leaning towards the right,

00:42:54.930 --> 00:42:57.709
then accelerate towards the right to try to catch it. Thanks to the

00:42:57.709 --> 00:43:01.579
inverted pendulum, this is probably true. For lots of what's called

00:43:01.579 --> 00:43:04.650
low level control tasks, things like driving a car,

00:43:04.650 --> 00:43:09.069
the low level reflexes of do you steer your car left to avoid another car, do

00:43:09.069 --> 00:43:13.389
you steer your car left to follow the car road, flying a helicopter,

00:43:13.389 --> 00:43:18.359
again very short time scale types of decisions - I like to think of these as

00:43:18.359 --> 00:43:21.960
decisions like trained operator for like a trained driver or a trained

00:43:21.960 --> 00:43:23.819
pilot. It would almost

00:43:23.819 --> 00:43:27.360
be a reflex, these sorts of very quick instinctive things where

00:43:27.360 --> 00:43:30.790
you map very directly from the inputs, data, and action. These are

00:43:30.790 --> 00:43:32.159
problems for which

00:43:32.159 --> 00:43:35.879
you can probably choose a reasonable policy class like a logistic function or something,

00:43:35.879 --> 00:43:38.489
and it will often work well.

00:43:38.489 --> 00:43:41.829
In contrast, if you have problems that require

00:43:41.829 --> 00:43:46.019
long multistep reasoning, so things like a game of chess where you have to

00:43:46.019 --> 00:43:47.230
reason carefully about if

00:43:47.230 --> 00:43:50.130
I do this, then they'll do that, then they'll do this, then they'll do that.

00:43:50.130 --> 00:43:56.049
Those I think of as less instinctual, very high level decision making.

00:43:56.049 --> 00:43:59.519
For problems like that,

00:43:59.519 --> 00:44:06.289
I would sometimes use a value function approximation approaches instead. Let

00:44:06.289 --> 00:44:09.089
me say more about this later.

00:44:09.089 --> 00:44:14.139
The last thing I want to do is actually tell you about -

00:44:14.139 --> 00:44:21.999
I guess just as a side comment,

00:44:21.999 --> 00:44:25.499
it turns out also that if you have POMDP, if you have a partially observable

00:44:25.499 --> 00:44:28.459
MDP - I don't want to say too much about this -

00:44:28.459 --> 00:44:36.799
it turns out that if you only have an approximation

00:44:36.799 --> 00:44:42.119
- let's call it S hat of the true

00:44:42.119 --> 00:44:47.389
state, and so this could be S hat equals S of T given T from Kalman filter -

00:44:57.359 --> 00:45:12.949
then you can still use these sorts of policy search algorithms where you can say pi theta of S

00:45:12.949 --> 00:45:15.819
hat comma A - There are various other ways you can use

00:45:15.819 --> 00:45:18.669
policy search algorithms for POMDPs, but this is one of them

00:45:18.669 --> 00:45:21.440
where if you only have estimates of the state, you can then

00:45:21.440 --> 00:45:25.599
choose a policy class that only looks at your estimate of the state to choose the action.

00:45:25.599 --> 00:45:30.629
By using the same way of estimating the states in both training and testing,

00:45:30.629 --> 00:45:33.089
this will usually do some -

00:45:33.089 --> 00:45:37.069
so these sorts of policy search algorithms can be applied

00:45:37.069 --> 00:45:47.440
often reasonably effectively to

00:45:47.440 --> 00:45:50.349
POMDPs as well. There's one more algorithm I wanna talk about, but some final

00:45:50.349 --> 00:45:54.519
words on the reinforce algorithm. It turns out the reinforce algorithm

00:45:54.519 --> 00:46:00.089
often works well but is often extremely slow. So it [inaudible]

00:46:00.089 --> 00:46:03.680
works, but one thing to watch out for is that because you're taking these

00:46:03.680 --> 00:46:07.470
gradient ascent steps that are very noisy, you're sampling a state action sequence, and then

00:46:07.470 --> 00:46:07.909
you're sort of

00:46:07.909 --> 00:46:09.579
taking a gradient ascent step in essentially a sort

00:46:09.579 --> 00:46:12.960
of random direction that only on expectation is

00:46:12.960 --> 00:46:14.239
correct. The

00:46:14.239 --> 00:46:17.999
gradient ascent direction for reinforce can sometimes be a bit noisy,

00:46:17.999 --> 00:46:21.599
and so it's not that uncommon to need like a million iterations of gradient ascent,

00:46:21.599 --> 00:46:24.420
or ten million, or 100 million

00:46:24.420 --> 00:46:25.579
iterations of gradient ascent

00:46:25.579 --> 00:46:29.450
for reinforce [inaudible], so that's just something to watch out for.

00:46:29.450 --> 00:46:40.619
One consequence of that is in the reinforce algorithm - I shouldn't

00:46:40.619 --> 00:46:44.969
really call it reinforce. In what's essentially the reinforce algorithm, there's

00:46:44.969 --> 00:46:48.449
this step where you need to sample a state action sequence.

00:46:48.449 --> 00:46:50.759
So in

00:46:50.759 --> 00:46:54.160
principle you could do this on your own robot. If there were a robot you were trying to

00:46:54.160 --> 00:46:55.280
control, you can actually

00:46:55.280 --> 00:46:58.709
physically initialize in some state, pick an action and so on, and go from there

00:46:58.709 --> 00:47:00.450
to sample a state action sequence.

00:47:00.450 --> 00:47:01.529
But

00:47:01.529 --> 00:47:05.269
if you need to do this ten million times, you probably don't want to [inaudible]

00:47:05.269 --> 00:47:07.459
your robot ten million times.

00:47:07.459 --> 00:47:13.160
I personally have seen many more applications of reinforce in simulation.

00:47:13.160 --> 00:47:16.930
You can easily run ten thousand simulations or ten million simulations of your robot in

00:47:16.930 --> 00:47:20.890
simulation maybe, but you might not want to do that -

00:47:20.890 --> 00:47:24.509
have your robot physically repeat some action ten million times.

00:47:24.509 --> 00:47:27.529
So I personally have seen many more applications of reinforce

00:47:27.529 --> 00:47:30.319
to learn using a simulator

00:47:30.319 --> 00:47:34.799
than to actually do this on a physical device.

00:47:34.799 --> 00:47:38.230
The last thing I wanted to do is tell you about one other algorithm,

00:47:38.230 --> 00:47:45.230
one final policy search algorithm. [Inaudible] the laptop display please. It's

00:47:52.249 --> 00:47:56.119
a policy search algorithm called Pegasus that's

00:47:56.119 --> 00:48:00.039
actually what we use on our autonomous helicopter flight things

00:48:00.039 --> 00:48:05.459
for many years. There are some other things we do now. So

00:48:05.459 --> 00:48:07.579
here's the idea. There's a

00:48:07.579 --> 00:48:11.059
reminder slide on RL formalism. There's nothing here that you don't know, but

00:48:11.059 --> 00:48:17.119
I just want to pictorially describe the RL formalism because I'll use that later.

00:48:17.119 --> 00:48:20.859
I'm gonna draw the reinforcement learning picture as follows. The

00:48:20.859 --> 00:48:21.689
initialized [inaudible]

00:48:21.689 --> 00:48:23.509
system, say a

00:48:23.509 --> 00:48:25.559
helicopter or whatever in sum state S0,

00:48:25.559 --> 00:48:27.509
you choose an action A0,

00:48:27.509 --> 00:48:31.779
and then you'll say helicopter dynamics takes you to some new state S1, you choose

00:48:31.779 --> 00:48:33.259
some other action A1,

00:48:33.259 --> 00:48:34.619
and so on.

00:48:34.619 --> 00:48:38.789
And then you have some reward function, which you reply to the sequence of states you summed out,

00:48:38.789 --> 00:48:41.249
and that's your total payoff. So

00:48:41.249 --> 00:48:46.890
this is just a picture I wanna use to summarize the RL problem.

00:48:47.949 --> 00:48:50.880
Our goal is to maximize the expected payoff,

00:48:50.880 --> 00:48:53.000
which is this, the expected sum of the rewards.

00:48:53.000 --> 00:48:56.909
And our goal is to learn the policy, which I denote by a green box.

00:48:56.909 --> 00:49:01.989
So our policy - and I'll switch back to deterministic policies for now.

00:49:01.989 --> 00:49:09.129
So my deterministic policy will be some function mapping from the states to the actions.

00:49:09.129 --> 00:49:14.599
As a concrete example, you imagine that in the policy search setting,

00:49:14.599 --> 00:49:17.339
you may have a linear class of policies.

00:49:17.339 --> 00:49:22.660
So you may imagine that the action A will be say a linear function of the states,

00:49:22.660 --> 00:49:27.599
and your goal is to learn the parameters of the linear function.

00:49:27.599 --> 00:49:33.349
So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning

00:49:33.349 --> 00:49:36.209
objective. So just [inaudible] imagine that the action A

00:49:36.209 --> 00:49:40.309
is state of transpose S, and you go and policy search this to come up with

00:49:40.309 --> 00:49:42.109
good parameters theta so as

00:49:42.109 --> 00:49:48.979
to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.

00:49:48.979 --> 00:49:49.809
Quite often

00:49:49.809 --> 00:49:54.459
we come up with a model or a simulator for the MDP, and as before

00:49:54.459 --> 00:49:57.499
a model or a simulator is just a box

00:49:57.499 --> 00:49:58.830
that takes this input

00:49:58.830 --> 00:50:00.419
some state ST,

00:50:00.419 --> 00:50:02.690
takes this input some action AT,

00:50:02.690 --> 00:50:06.570
and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.

00:50:06.570 --> 00:50:10.130
This ST plus one will be a random state. It will be drawn from

00:50:10.130 --> 00:50:13.030
the random state transition probabilities of MDP.

00:50:13.030 --> 00:50:16.979
This is important. Very important, ST plus one

00:50:16.979 --> 00:50:21.809
will be a random function ST and AT. In the simulator, this is [inaudible].

00:50:21.809 --> 00:50:25.709
So for example, for autonomous helicopter flight, you [inaudible] build a simulator

00:50:25.709 --> 00:50:29.819
using supervised learning, an algorithm like linear regression

00:50:29.819 --> 00:50:31.529
[inaudible] to linear regression,

00:50:31.529 --> 00:50:35.099
so we can get a nonlinear model of the dynamics of what ST

00:50:35.099 --> 00:50:41.399
plus one is as a random function of ST and AT. Now

00:50:41.399 --> 00:50:44.999
once you have a simulator,

00:50:44.999 --> 00:50:48.100
given any fixed policy you can quite

00:50:48.100 --> 00:50:52.309
straightforwardly evaluate any policy in a simulator.

00:50:53.359 --> 00:51:00.209
Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box

00:51:00.209 --> 00:51:02.879
like that. It works well.

00:51:02.879 --> 00:51:07.569
So if you have any one fixed policy pi,

00:51:07.569 --> 00:51:13.109
you can evaluate the policy pi just using the simulator

00:51:13.109 --> 00:51:15.969
via the picture shown at the bottom of the slide.

00:51:15.969 --> 00:51:18.880
So concretely, you can take your initial state S0,

00:51:18.880 --> 00:51:20.979
feed it into the policy pi,

00:51:20.979 --> 00:51:25.449
your policy pi will output some action A0, you plug it in

00:51:25.449 --> 00:51:28.069
the simulator, the simulator outputs a random state S1, you feed

00:51:28.069 --> 00:51:31.630
S1 into the policy and so on, and you get a sequence of states S0 through ST

00:51:31.630 --> 00:51:35.429
that your helicopter flies through in simulation.

00:51:35.429 --> 00:51:37.809
Then sum up the rewards,

00:51:37.809 --> 00:51:42.609
and this gives you an estimate of the expected payoff of the policy.

00:51:42.609 --> 00:51:44.929
This picture is just a fancy way of saying fly

00:51:44.929 --> 00:51:48.809
your helicopter in simulation and see how well it does, and measure the sum of

00:51:48.809 --> 00:51:53.969
rewards you get on average in the simulator. The

00:51:53.969 --> 00:51:57.219
picture I've drawn here assumes that you run your policy through the

00:51:57.219 --> 00:51:59.159
simulator just once. In general,

00:51:59.159 --> 00:52:01.299
you would run the policy through the simulator

00:52:01.299 --> 00:52:05.099
some number of times and then average to get an average over M

00:52:05.099 --> 00:52:10.869
simulations to get a better estimate of the policy's expected payoff.

00:52:12.799 --> 00:52:14.619
Now that I have a way

00:52:14.619 --> 00:52:18.069
- given any one fixed policy,

00:52:18.069 --> 00:52:23.429
this gives me a way to evaluate the expected payoff of that policy.

00:52:23.429 --> 00:52:28.179
So one reasonably obvious thing you might try to do is then

00:52:28.179 --> 00:52:30.069
just search for a policy,

00:52:30.069 --> 00:52:33.609
in other words search for parameters theta for your policy, that

00:52:33.609 --> 00:52:35.929
gives you high estimated payoff.

00:52:35.929 --> 00:52:39.509
Does that make sense? So my policy has some parameters theta, so

00:52:39.509 --> 00:52:40.979
my policy is

00:52:40.979 --> 00:52:46.309
my actions A are equal to theta transpose S say if there's a linear policy.

00:52:46.309 --> 00:52:49.559
For any fixed value of the parameters theta,

00:52:49.559 --> 00:52:50.989
I can evaluate -

00:52:50.989 --> 00:52:54.440
I can get an estimate for how good the policy is using the simulator.

00:52:54.440 --> 00:52:58.599
One thing I might try to do is search for parameters theta to try to

00:52:58.599 --> 00:53:03.109
maximize my estimated payoff.

00:53:03.109 --> 00:53:05.929
It turns out you can sort of do that,

00:53:05.929 --> 00:53:10.979
but that idea as I've just stated is hard to get to work.

00:53:10.979 --> 00:53:12.180
Here's the

00:53:12.180 --> 00:53:15.720
reason. The simulator allows us to evaluate

00:53:15.720 --> 00:53:17.529
policy, so let's search for policy of high

00:53:17.529 --> 00:53:20.619
value. The difficulty is that the simulator is random,

00:53:20.619 --> 00:53:22.999
and so every time we evaluate a policy,

00:53:22.999 --> 00:53:26.819
we get back a very slightly different answer. So

00:53:26.819 --> 00:53:28.739
in the

00:53:28.739 --> 00:53:33.219
cartoon below, I want you to imagine that the horizontal axis is the space of policies.

00:53:33.219 --> 00:53:36.779
In other words, as I vary the

00:53:36.779 --> 00:53:40.839
parameters in my policy, I get different points on the horizontal axis here. As I

00:53:40.839 --> 00:53:43.789
vary the parameters theta, I get different policies,

00:53:43.789 --> 00:53:46.479
and so I'm moving along the X axis,

00:53:46.479 --> 00:53:50.079
and my total payoff I'm gonna plot on the vertical axis,

00:53:50.079 --> 00:53:54.459
and the red line in this cartoon is the expected payoff of the different

00:53:54.459 --> 00:53:56.130
policies.

00:53:56.130 --> 00:54:01.929
My goal is to find the policy with the highest expected payoff.

00:54:01.929 --> 00:54:06.019
You could search for a policy with high expected payoff, but every time you evaluate a policy

00:54:06.019 --> 00:54:08.150
- say I evaluate some policy,

00:54:08.150 --> 00:54:09.819
then I might get a point that

00:54:09.819 --> 00:54:12.879
just by chance looked a little bit better than it should have. If

00:54:12.879 --> 00:54:16.489
I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third

00:54:16.489 --> 00:54:21.419
policy, fourth,

00:54:21.419 --> 00:54:25.439
sometimes you look here - sometimes I might actually evaluate exactly the same policy

00:54:25.439 --> 00:54:27.829
twice and get back slightly different

00:54:27.829 --> 00:54:32.380
answers just because my simulator is random, so when I apply the same policy

00:54:32.380 --> 00:54:38.669
twice in simulation, I might get back slightly different answers.

00:54:38.669 --> 00:54:41.839
So as I evaluate more and more policies, these are the pictures I get.

00:54:41.839 --> 00:54:45.799
My goal is to try to optimize the red line. I hope you appreciate this is a

00:54:45.799 --> 00:54:50.569
hard problem, especially when all [inaudible] optimization algorithm gets to see are these

00:54:50.569 --> 00:54:53.949
black dots, and they don't have direct access to the red line.

00:54:53.949 --> 00:54:58.269
So when my input space is some fairly high dimensional space, if I have

00:54:58.269 --> 00:55:02.319
ten parameters, the horizontal axis would actually be a 10-D space,

00:55:02.319 --> 00:55:07.049
all I get are these noisy estimates of what the red line is. This is a very hard

00:55:07.049 --> 00:55:10.209
stochastic optimization problem.

00:55:10.209 --> 00:55:15.969
So it turns out there's one way to make this optimization problem much easier. Here's the idea.

00:55:15.969 --> 00:55:20.949
And the method is called Pegasus, which is an acronym for something.

00:55:20.949 --> 00:55:22.890
I'll tell you later.

00:55:22.890 --> 00:55:26.140
So the simulator repeatedly makes calls

00:55:26.140 --> 00:55:27.959
to a random number generator

00:55:27.959 --> 00:55:33.079
to generate random numbers RT, which are used to simulate the stochastic dynamics.

00:55:33.079 --> 00:55:36.709
What I mean by that is that the simulator takes this input of state

00:55:36.709 --> 00:55:39.749
and action, and it outputs the mixed state randomly,

00:55:39.749 --> 00:55:43.999
and if you peer into the simulator, if you open the box of the simulator and ask

00:55:43.999 --> 00:55:49.529
how is my simulator generating these random mixed states ST plus one,

00:55:49.529 --> 00:55:53.809
pretty much the only way to do so - pretty much the only way

00:55:53.809 --> 00:55:56.909
to write a simulator with random outputs is

00:55:56.909 --> 00:56:00.209
we're gonna make calls to a random number generator,

00:56:00.209 --> 00:56:03.440
and get random numbers, these RTs,

00:56:03.440 --> 00:56:08.299
and then you have some function that takes this input S0, A0, and

00:56:08.299 --> 00:56:10.979
the results of your random number generator,

00:56:10.979 --> 00:56:14.430
and it computes some mixed state as a function of the inputs and of the random

00:56:14.430 --> 00:56:16.839
number it got from the random number

00:56:16.839 --> 00:56:17.699
generator. This is

00:56:17.699 --> 00:56:22.359
pretty much the only way anyone implements any random code, any code

00:56:22.359 --> 00:56:26.179
that generates random outputs. You make a call to a random number generator,

00:56:26.179 --> 00:56:30.869
and you compute some function of the random number and of your other

00:56:30.869 --> 00:56:35.299
inputs. The reason that when you evaluate different policies you get

00:56:35.299 --> 00:56:36.439
different answers

00:56:36.439 --> 00:56:37.749
is because

00:56:37.749 --> 00:56:41.459
every time you rerun the simulator, you get a different sequence of random

00:56:41.459 --> 00:56:43.569
numbers from the random number generator,

00:56:43.569 --> 00:56:45.569
and so you get a different answer every time,

00:56:45.569 --> 00:56:48.139
even if you evaluate the same policy twice.

00:56:48.139 --> 00:56:50.910
So here's the idea.

00:56:50.910 --> 00:56:56.150
Let's say we fix in advance the sequence of random numbers,

00:56:56.150 --> 00:57:00.229
choose R1, R2, up to RT minus one.

00:57:00.229 --> 00:57:03.359
Fix the sequence of random numbers in advance,

00:57:03.359 --> 00:57:07.280
and we'll always use the same sequence of random numbers

00:57:07.280 --> 00:57:10.189
to evaluate different policies. Go

00:57:10.189 --> 00:57:14.549
into your code and fix R1, R2, through RT minus one.

00:57:14.549 --> 00:57:17.640
Choose them randomly once and then fix them forever.

00:57:17.640 --> 00:57:20.979
If you always use the same sequence of random numbers, then

00:57:20.979 --> 00:57:25.210
the system is no longer random, and if you evaluate the same policy twice,

00:57:25.210 --> 00:57:28.569
you get back exactly the same answer.

00:57:28.569 --> 00:57:33.330
And so these sequences of random numbers, [inaudible] call them scenarios, and

00:57:33.330 --> 00:57:37.410
Pegasus is an acronym for policy evaluation of gradient and search using scenarios.

00:57:37.410 --> 00:57:38.859
So

00:57:42.369 --> 00:57:45.949
when you do that, this is the picture you get.

00:57:45.949 --> 00:57:50.190
As before, the red line is your expected payoff, and by fixing the random numbers,

00:57:50.190 --> 00:57:53.469
you've defined some estimate of the expected payoff.

00:57:53.469 --> 00:57:56.739
And as you evaluate different policies, they're

00:57:56.739 --> 00:58:00.609
still only approximations to their true expected payoff, but at least now you have

00:58:00.609 --> 00:58:01.310
a

00:58:01.310 --> 00:58:03.359
deterministic function to optimize,

00:58:03.359 --> 00:58:07.109
and you can now apply your favorite algorithms, be it gradient ascent or

00:58:08.089 --> 00:58:09.809
some sort

00:58:09.809 --> 00:58:12.259
of local [inaudible] search

00:58:12.259 --> 00:58:13.609
to try to optimize the black

00:58:13.609 --> 00:58:16.329
curve. This gives you a much easier

00:58:16.329 --> 00:58:18.390
optimization problem, and you

00:58:18.390 --> 00:58:22.789
can optimize this to find hopefully a good policy. So this is

00:58:22.789 --> 00:58:26.650
the Pegasus policy search method.

00:58:26.650 --> 00:58:27.450
So when

00:58:27.450 --> 00:58:31.390
I started to talk about reinforcement learning, I showed

00:58:31.390 --> 00:58:34.980
that video of a helicopter flying upside down. That was actually done using

00:58:34.980 --> 00:58:40.029
exactly method, using exactly this policy search algorithm. This

00:58:40.029 --> 00:58:41.849
seems to scale well

00:58:41.849 --> 00:58:47.099
even to fairly large problems, even to fairly high dimensional state spaces.

00:58:47.099 --> 00:58:50.439
Typically Pegasus policy search algorithms have been using -

00:58:50.439 --> 00:58:52.719
the optimization problem

00:58:52.719 --> 00:58:56.440
is still - is much easier than the stochastic version, but sometimes it's

00:58:56.440 --> 00:58:58.279
not entirely trivial, and so

00:58:58.279 --> 00:59:00.299
you have to apply this sort of method with

00:59:00.299 --> 00:59:04.549
maybe on the order of ten parameters or tens of parameters, so 30, 40

00:59:04.549 --> 00:59:05.750
parameters, but

00:59:05.750 --> 00:59:08.169
not thousands of parameters,

00:59:08.169 --> 00:59:13.049
at least in these sorts of things with them.

00:59:13.049 --> 00:59:19.469
So is that method different than just assuming that

00:59:19.469 --> 00:59:26.949
you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that

00:59:26.949 --> 00:59:28.809
we have a deterministic simulator?

00:59:28.809 --> 00:59:31.249
The answer is no. In the way you do this,

00:59:31.249 --> 00:59:34.949
for the sake of simplicity I talked about one sequence of random numbers.

00:59:34.949 --> 00:59:36.890
What you do is -

00:59:36.890 --> 00:59:41.969
so imagine that the random numbers are simulating different wind gusts against your helicopter.

00:59:41.969 --> 00:59:46.429
So what you want to do isn't really evaluate just against one pattern of wind gusts.

00:59:46.429 --> 00:59:48.000
What you want to do is

00:59:48.000 --> 00:59:49.920
sample some set of

00:59:49.920 --> 00:59:53.650
different patterns of wind gusts, and evaluate against all of them in average.

00:59:53.650 --> 00:59:55.680
So what you do is you actually

00:59:55.680 --> 00:59:58.039
sample say 100 -

00:59:58.039 --> 01:00:01.359
some number I made up like 100 sequences of random numbers,

01:00:01.359 --> 01:00:03.430
and every time you want to evaluate a policy,

01:00:03.430 --> 01:00:08.679
you evaluate it against all 100 sequences of random numbers and then average.

01:00:08.679 --> 01:00:13.249
This is in exactly the same way that on this earlier picture

01:00:13.249 --> 01:00:16.879
you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100

01:00:16.879 --> 01:00:18.450
times in simulation, and then

01:00:18.450 --> 01:00:21.709
average to get a better estimate of the expected reward.

01:00:21.709 --> 01:00:23.009
In the same way,

01:00:23.009 --> 01:00:24.079
you do that here

01:00:24.079 --> 01:00:29.079
but with 100 fixed sequences of random numbers. Does that make sense? Any

01:00:29.079 --> 01:00:36.079
other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?

01:00:47.719 --> 01:00:51.829
Yeah. I guess you're right. So the quality - for a fixed policy,

01:00:51.829 --> 01:00:57.099
the quality of the approximation is equally good for both cases. The

01:00:57.099 --> 01:01:00.989
advantage of fixing the random numbers is that you end up with

01:01:00.989 --> 01:01:03.690
an optimization problem that's much easier.

01:01:04.609 --> 01:01:06.100
I have some search problem,

01:01:06.100 --> 01:01:10.049
and on the horizontal axis there's a space of control policies,

01:01:10.049 --> 01:01:12.559
and my goal is to find a control policy that

01:01:12.559 --> 01:01:15.319
maximizes the payoff.

01:01:15.319 --> 01:01:18.660
The problem with this earlier setting was that

01:01:18.660 --> 01:01:21.949
when I evaluate policies I get these noisy estimates,

01:01:21.949 --> 01:01:22.950
and then it's

01:01:22.950 --> 01:01:26.690
just very hard to optimize the red curve if I have these points that are all over the

01:01:26.690 --> 01:01:29.420
place. And if I evaluate the same policy twice, I don't even get back the same

01:01:30.719 --> 01:01:32.959
answer. By fixing the random numbers,

01:01:32.959 --> 01:01:36.669
the algorithm still doesn't get to see the red curve, but at least it's

01:01:36.669 --> 01:01:40.609
now optimizing a deterministic function. That makes the

01:01:40.609 --> 01:01:42.349
optimization problem much easier. Does

01:01:42.349 --> 01:01:49.349
that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves

01:01:50.539 --> 01:01:58.009
that are easy to optimize. And then you smush them together?

01:01:58.009 --> 01:01:59.309
Let's see.

01:01:59.309 --> 01:02:04.559
I have just one nice black curve that I'm trying to optimize. For each scenario.

01:02:04.559 --> 01:02:08.890
I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.

01:02:08.890 --> 01:02:12.339
So the black curve here is defined by averaging over

01:02:12.339 --> 01:02:16.249
a large set of scenarios. Does that make sense?

01:02:16.249 --> 01:02:20.099
So instead of only one - if the averaging thing doesn't make sense, imagine that there's

01:02:20.099 --> 01:02:24.049
just one sequence of random numbers. That might be easier to think

01:02:24.049 --> 01:02:27.419
about. Fix one sequence of random numbers, and every time I evaluate another policy,

01:02:27.419 --> 01:02:31.039
I evaluate against the same sequence of random numbers, and that gives me

01:02:31.039 --> 01:02:36.489
a nice deterministic function to optimize.

01:02:36.489 --> 01:02:39.449
Any other questions?

01:02:39.449 --> 01:02:41.269
The advantage is really

01:02:41.269 --> 01:02:45.459
that - one way to think about it is when I evaluate the same policy twice,

01:02:45.459 --> 01:02:50.449
at least I get back the same answer. This gives me a deterministic function

01:02:50.449 --> 01:02:53.890
mapping from parameters in my policy

01:02:53.890 --> 01:02:56.239
to my estimate of the expected payoff.

01:02:57.789 --> 01:03:13.229
That's just a function that I can try to optimize using the search algorithm.

01:03:13.229 --> 01:03:16.799
So we use this algorithm for inverted hovering,

01:03:16.799 --> 01:03:20.329
and again policy search algorithms tend to work well when you can find

01:03:20.329 --> 01:03:23.739
a reasonably simple policy mapping from

01:03:23.739 --> 01:03:28.640
the states to the actions. This is sort of especially the low level control tasks,

01:03:28.640 --> 01:03:32.699
which I think of as sort of reflexes almost.

01:03:32.699 --> 01:03:33.799
Completely,

01:03:33.799 --> 01:03:37.269
if you want to solve a problem like Tetris where you might plan

01:03:37.269 --> 01:03:40.499
ahead a few steps about what's a nice configuration of the board, or

01:03:40.499 --> 01:03:42.289
something like a game of chess,

01:03:42.289 --> 01:03:46.719
or problems of long path plannings of go here, then go there, then go there,

01:03:46.719 --> 01:03:51.069
then sometimes you might apply a value function method instead.

01:03:51.069 --> 01:03:55.059
But for tasks like helicopter flight, for

01:03:55.059 --> 01:03:58.569
low level control tasks, for the reflexes of driving or controlling

01:04:00.749 --> 01:04:05.699
various robots, policy search algorithms were easier -

01:04:05.699 --> 01:04:10.629
we can sometimes more easily approximate the policy directly works very well.

01:04:11.839 --> 01:04:15.599
Some [inaudible] the state of RL today. RL algorithms are applied

01:04:15.599 --> 01:04:18.499
to a wide range of problems,

01:04:18.499 --> 01:04:22.239
and the key is really sequential decision making. The place where you

01:04:22.239 --> 01:04:25.569
think about applying reinforcement learning algorithm is when you need to make a decision,

01:04:25.569 --> 01:04:28.209
then another decision, then another decision,

01:04:28.209 --> 01:04:31.889
and some of your actions may have long-term consequences. I think that

01:04:31.889 --> 01:04:37.910
is the heart of RL's sequential decision making, where you make multiple decisions,

01:04:37.910 --> 01:04:41.599
and some of your actions may have long-term consequences. I've shown you

01:04:41.599 --> 01:04:44.170
a bunch of robotics examples. RL

01:04:44.170 --> 01:04:45.569
is also

01:04:45.569 --> 01:04:48.759
applied to thinks like medical decision making, where you may have a patient and you

01:04:48.759 --> 01:04:51.879
want to choose a sequence of treatments, and you do this now for the patient, and the

01:04:51.879 --> 01:04:56.449
patient may be in some other state, and you choose to do that later, and so on.

01:04:56.449 --> 01:05:00.839
It turns out there's a large community of people applying these sorts of tools to queues. So

01:05:00.839 --> 01:05:04.539
imagine you have a bank where you have people lining up, and

01:05:04.539 --> 01:05:06.009
after they go to one cashier,

01:05:06.009 --> 01:05:09.740
some of them have to go to the manager to deal with something else. You have

01:05:09.740 --> 01:05:13.410
a system of multiple people standing in line in multiple queues, and so how do you route

01:05:13.410 --> 01:05:17.839
people optimally to minimize the waiting

01:05:17.839 --> 01:05:20.239
time. And not just people, but

01:05:20.239 --> 01:05:23.969
objects in an assembly line and so on. It turns out there's a surprisingly large community

01:05:23.969 --> 01:05:26.349
working on optimizing queues.

01:05:26.349 --> 01:05:29.719
I mentioned game playing a little bit already. Things

01:05:29.719 --> 01:05:33.289
like financial decision making, if you have a large amount of stock,

01:05:33.289 --> 01:05:37.569
how do you sell off a large amount - how do you time the selling off of your stock

01:05:37.569 --> 01:05:41.329
so as to not affect market prices adversely too much? There

01:05:41.329 --> 01:05:44.949
are many operations research problems, things like factory automation. Can you design

01:05:44.949 --> 01:05:46.369
a factory

01:05:46.369 --> 01:05:50.140
to optimize throughput, or minimize cost, or whatever. These are all

01:05:50.140 --> 01:05:54.049
sorts of problems that people are applying reinforcement learning algorithms to.

01:05:54.049 --> 01:05:57.529
Let me just close with a few robotics examples because they're always fun, and we just

01:05:57.529 --> 01:05:59.190
have these videos.

01:06:00.189 --> 01:06:07.259
This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student

01:06:07.259 --> 01:06:13.760
here. They were working getting a robot dog to climb over difficult rugged

01:06:13.760 --> 01:06:16.959
terrain. Using a reinforcement learning algorithm,

01:06:16.959 --> 01:06:19.979
they applied an approach that's

01:06:19.979 --> 01:06:24.149
similar to a value function approximation approach, not quite but similar.

01:06:24.149 --> 01:06:28.849
They allowed the robot dog to sort of plan ahead multiple steps, and

01:06:28.849 --> 01:06:33.999
carefully choose his footsteps and traverse rugged terrain.

01:06:33.999 --> 01:06:43.339
This is really state of the art in terms of what can you get a robotic dog to do.

01:06:43.339 --> 01:06:45.569
Here's another fun one.

01:06:45.569 --> 01:06:48.009
It turns out that

01:06:48.009 --> 01:06:51.259
wheeled robots are very fuel-efficient. Cars and trucks are the

01:06:51.259 --> 01:06:55.250
most fuel-efficient robots in the world almost.

01:06:55.250 --> 01:06:57.260
Cars and trucks are very fuel-efficient,

01:06:57.260 --> 01:07:01.309
but the bigger robots have the ability to traverse more rugged terrain.

01:07:01.309 --> 01:07:04.599
So this is a robot - this is actually a small scale mockup of a larger

01:07:04.599 --> 01:07:06.519
vehicle built by Lockheed Martin,

01:07:06.519 --> 01:07:09.699
but can you come up with a vehicle that

01:07:09.699 --> 01:07:13.689
has wheels and has the fuel efficiency of wheeled robots, but also

01:07:13.689 --> 01:07:19.099
has legs so it can traverse obstacles.

01:07:19.099 --> 01:07:27.349
Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and

01:07:27.349 --> 01:07:31.319
somewhat complex gaits that would be very hard to design by hand,

01:07:31.319 --> 01:07:33.279
but by choosing a reward function, tell the robot

01:07:33.279 --> 01:07:36.789
this is a plus one reward that's top of the goal,

01:07:36.789 --> 01:07:45.469
and a few other things, it learns these sorts of policies automatically.

01:07:46.519 --> 01:07:55.229
Last couple fun ones - I'll show you a couple last helicopter videos.

01:07:57.029 --> 01:08:08.700
This is the work of again PhD students here, Peter Abiel and Adam Coates

01:08:08.769 --> 01:08:33.319
who have been working on autonomous flight. I'll just let you watch. I'll just

01:09:49.339 --> 01:09:53.900
show you one more. [Inaudible] do this with a real helicopter [inaudible]?

01:09:53.900 --> 01:09:55.199
Not a full-size helicopter.

01:09:55.199 --> 01:10:00.860
Only small radio control helicopters. [Inaudible]. Full-size helicopters

01:10:00.860 --> 01:10:05.780
are the wrong design for this. You shouldn't do this on a full-size helicopter.

01:10:05.780 --> 01:10:12.619
Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.

01:10:12.619 --> 01:10:18.919
Are there any human [inaudible]? Yes,

01:10:18.919 --> 01:10:25.919
there are very good human pilots that can.

01:10:28.420 --> 01:10:35.420
This is just one more

01:10:43.739 --> 01:10:45.510
maneuver. That was kind of fun. So this is the work of

01:10:45.510 --> 01:10:55.530
Peter Abiel and Adam Coates. So that was it. That was all the technical material

01:10:55.530 --> 01:11:02.530
I wanted to present in this class. I guess

01:11:03.109 --> 01:11:10.199
you're all experts on machine learning now. Congratulations.

01:11:10.199 --> 01:11:14.209
And as I hope you've got the sense through this class that this is one of the technologies

01:11:14.209 --> 01:11:18.969
that's really having a huge impact on science in engineering and industry.

01:11:18.969 --> 01:11:23.130
As I said in the first lecture, I think many people use machine

01:11:23.130 --> 01:11:30.130
learning algorithms dozens of times a day without even knowing about it.

01:11:30.389 --> 01:11:34.110
Based on the projects you've done, I hope that

01:11:34.110 --> 01:11:37.899
most of you will be able to imagine yourself

01:11:37.899 --> 01:11:43.280
going out after this class and applying these things to solve a variety of problems.

01:11:43.280 --> 01:11:45.980
Hopefully, some of you will also imagine yourselves

01:11:45.980 --> 01:11:48.989
writing research papers after this class, be it on

01:11:48.989 --> 01:11:52.440
a novel way to do machine learning, or on some way of applying machine

01:11:52.440 --> 01:11:55.289
learning to a problem that you care

01:11:55.289 --> 01:11:58.749
about. In fact, looking at project milestones, I'm actually sure that a large fraction of

01:11:58.749 --> 01:12:04.300
the projects in this class will be publishable, so I think that's great.

01:12:06.679 --> 01:12:10.750
I guess many of you will also go industry, make new products, and make lots

01:12:10.750 --> 01:12:12.659
of money using learning

01:12:12.659 --> 01:12:15.580
algorithms. Remember me

01:12:15.580 --> 01:12:19.530
if that happens. One of the things I'm excited about is through research or

01:12:19.530 --> 01:12:22.570
industry, I'm actually completely sure that the people in this class in the

01:12:22.570 --> 01:12:23.849
next few months

01:12:23.849 --> 01:12:27.320
will apply machine learning algorithms to lots of problems in

01:12:27.320 --> 01:12:31.770
industrial management, and computer science, things like optimizing computer architectures,

01:12:31.770 --> 01:12:33.340
network security,

01:12:33.340 --> 01:12:38.480
robotics, computer vision, to problems in computational biology,

01:12:39.140 --> 01:12:42.429
to problems in aerospace, or understanding natural language,

01:12:42.429 --> 01:12:44.309
and many more things like that.

01:12:44.309 --> 01:12:46.030
So

01:12:46.030 --> 01:12:49.920
right now I have no idea what all of you are going to do with the learning algorithms you learned about,

01:12:49.920 --> 01:12:51.540
but

01:12:51.540 --> 01:12:54.809
every time as I wrap up this class, I always

01:12:54.809 --> 01:12:58.209
feel very excited, and optimistic, and hopeful about the sorts of amazing

01:12:58.209 --> 01:13:03.129
things you'll be able to do.

01:13:03.129 --> 01:13:10.489
One final thing, I'll just give out this handout.

01:13:30.749 --> 01:13:35.170
One final thing is that machine learning has grown out of a larger literature on the AI

01:13:35.170 --> 01:13:36.419
where

01:13:36.419 --> 01:13:41.829
this desire to build systems that exhibit intelligent behavior and machine learning

01:13:41.829 --> 01:13:44.889
is one of the tools of AI, maybe one that's had a

01:13:44.889 --> 01:13:46.510
disproportionately large impact,

01:13:46.510 --> 01:13:51.519
but there are many other ideas in AI that I hope you

01:13:51.519 --> 01:13:55.289
go and continue to learn about. Fortunately, Stanford

01:13:55.289 --> 01:13:59.279
has one of the best and broadest sets of AI classes, and I hope that

01:13:59.279 --> 01:14:03.690
you take advantage of some of these classes, and go and learn more about AI, and more

01:14:03.690 --> 01:14:07.690
about other fields which often apply learning algorithms to problems in vision,

01:14:07.690 --> 01:14:10.480
problems in natural language processing in robotics, and so on.

01:14:10.480 --> 01:14:14.960
So the handout I just gave out has a list of AI related courses. Just

01:14:14.960 --> 01:14:18.489
running down very quickly, I guess, CS221 is an overview that I

01:14:18.489 --> 01:14:19.340
teach. There

01:14:19.340 --> 01:14:23.699
are a lot of robotics classes also: 223A,

01:14:23.699 --> 01:14:25.139
225A, 225B

01:14:25.139 --> 01:14:28.760
- many robotics class. There are so many applications

01:14:28.760 --> 01:14:31.999
of learning algorithms to robotics today. 222

01:14:31.999 --> 01:14:36.369
and 227 are knowledge representation and reasoning classes.

01:14:36.369 --> 01:14:39.919
228 - of all the classes on this list, 228, which Daphne

01:14:39.919 --> 01:14:43.479
Koller teaches, is probably closest in spirit to 229. This is one of

01:14:43.479 --> 01:14:45.300
the classes I highly recommend

01:14:45.300 --> 01:14:48.600
to all of my PhD students as well.

01:14:48.600 --> 01:14:53.300
Many other problems also touch on machine learning. On the next page,

01:14:54.040 --> 01:14:57.760
courses on computer vision, speech recognition, natural language processing,

01:14:57.760 --> 01:15:03.199
various tools that aren't just machine learning, but often involve machine learning in many ways.

01:15:03.199 --> 01:15:07.339
Other aspects of AI, multi-agent systems taught by [inaudible].

01:15:09.630 --> 01:15:13.139
EE364A is convex optimization. It's a class taught by

01:15:13.139 --> 01:15:16.630
Steve Boyd, and convex optimization came up many times in this class. If you

01:15:16.630 --> 01:15:20.909
want to become really good at it, EE364 is a great class. If you're

01:15:20.909 --> 01:15:24.579
interested in project courses, I also teach a project class

01:15:24.579 --> 01:15:27.179
next quarter where we spend the whole quarter working

01:15:27.179 --> 01:15:31.510
on research projects.

01:15:31.510 --> 01:15:37.589
So I hope you go and take some more of those classes.

01:15:37.589 --> 01:15:40.379
In closing,

01:15:40.379 --> 01:15:42.849
let me just say

01:15:42.849 --> 01:15:46.880
this class has been really fun to teach, and it's very satisfying to

01:15:46.880 --> 01:15:52.010
me personally when we set these insanely difficult hallmarks, and

01:15:52.010 --> 01:15:54.649
then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's

01:15:54.649 --> 01:15:58.570
actually very satisfying when I see that.

01:15:58.570 --> 01:16:04.739
Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"

01:16:04.739 --> 01:16:05.469
So

01:16:05.469 --> 01:16:09.039
I hope you take what you've learned, go forth, and

01:16:09.039 --> 01:16:11.570
do amazing things with learning algorithms.

01:16:11.570 --> 01:16:16.220
I know this is a heavy workload class, so thank you all very much for the hard

01:16:16.220 --> 01:16:20.280
work you've put into this class, and the hard work you've put into learning this material,

01:16:20.280 --> 01:16:21.699
and

01:16:21.699 --> 01:16:23.389
thank you very much for having been students in.