0:00:12.819,0:00:16.099
This presentation is delivered by the Stanford Center for Professional

0:00:16.099,0:00:23.099
Development.

0:00:25.879,0:00:28.419
Okay. Welcome back. What I

0:00:28.419,0:00:31.449
want to do today is talk about

0:00:31.449,0:00:34.570
one of my favorite algorithms for controlling MDPs that I think is

0:00:35.870,0:00:39.320
one of the more elegant and efficient and powerful algorithms that

0:00:39.320,0:00:41.070
I know of.

0:00:41.070,0:00:43.020
So

0:00:43.020,0:00:46.930
what I'll do is I'll first start by talking about a couple

0:00:46.930,0:00:48.640
variations of MDPs

0:00:48.640,0:00:52.320
that are slightly different from the MDP definition you've seen so far. These

0:00:52.320,0:00:54.420
are pretty

0:00:54.420,0:00:58.870
common variations. One is state action rewards,

0:00:58.870,0:01:03.830
and the other is horizon MDPs. Using this semi-modified definition of an

0:01:03.830,0:01:06.540
MDP, I'll talk about linear dynamical systems. I'll spend a little bit

0:01:06.540,0:01:10.410
of time talking about models within dynamical systems, and then talk about LQR, or

0:01:10.410,0:01:13.240
linear quadratic regulation control,

0:01:13.240,0:01:18.600
which will lead us to some kind of [inaudible] equation, which is

0:01:18.600,0:01:23.100
something we will solve in order to do LQR

0:01:23.100,0:01:24.460
controls.

0:01:24.460,0:01:25.660


0:01:25.660,0:01:27.400
So

0:01:27.400,0:01:28.650
just

0:01:28.650,0:01:33.020
to recap, and we've seen this definition many times now.

0:01:33.020,0:01:36.840
We've been defining an MDP

0:01:36.840,0:01:41.960
as

0:01:41.960,0:01:46.969
[inaudible] states actions, states improbabilities, [inaudible] reward function

0:01:46.969,0:01:49.869
where - gamma's

0:01:49.869,0:01:52.980
the

0:01:52.980,0:01:56.680
discount factors, a number between zero and one.

0:01:56.680,0:02:00.380
And R, the reward function,

0:02:00.380,0:02:03.909
was the function mapping from the states, the rewards -

0:02:03.909,0:02:07.000
was the function mapping from the states, the real numbers.

0:02:07.000,0:02:08.220
So we had

0:02:08.220,0:02:10.489
value

0:02:10.489,0:02:13.759
iteration, which would do this.

0:02:21.899,0:02:26.429
So after a while, the value of the iteration will cause V to convert to V star.

0:02:26.429,0:02:27.789
Then

0:02:27.789,0:02:31.209
having found the optimal value function, if you

0:02:31.209,0:02:32.589
compute the optimal policy

0:02:32.589,0:02:36.990
by taking essentially [inaudible] of this equation above. Augments of A,

0:02:36.990,0:02:43.990
of that [inaudible]. So

0:02:46.180,0:02:48.180
in

0:02:48.180,0:02:50.719
value iteration,

0:02:50.719,0:02:54.599
as you iterate of this - you know, perform this update,

0:02:54.599,0:02:56.179
the function V will

0:02:56.179,0:03:00.370
[inaudible] convert to V star. So there won't be -

0:03:00.370,0:03:04.490
so without defining the number of iterations, you get closer and closer to V star.

0:03:04.490,0:03:08.400
This actually converge exponentially quickly to V

0:03:08.400,0:03:12.839
star. We will never exactly convert to V star and define the number of iterations.

0:03:14.799,0:03:19.629
So what I want to do now is describe a couple of common variations of

0:03:19.629,0:03:20.139
MDPs

0:03:20.139,0:03:24.139
that we slightly different definitions of. Firs the reward

0:03:24.139,0:03:27.799
function and then second, we'll do something slightly different

0:03:27.799,0:03:29.389
from just

0:03:31.699,0:03:34.659
counting. Then remember in the last lecture, I said that

0:03:34.659,0:03:37.419
for infinite state of continuously in MDPs,

0:03:37.419,0:03:41.169
we couldn't apply the most straightforward version of value iteration

0:03:41.169,0:03:43.669
because if you have a continuous state

0:03:43.669,0:03:45.709
MDP, we need to use

0:03:45.709,0:03:48.289
some approximations of the optimal value function.

0:03:48.289,0:03:51.629
The [inaudible] later in this lecture,

0:03:51.629,0:03:55.510
I'll talk about a special case of MDPs, where you can actually represent the

0:03:55.510,0:03:57.579
value function exactly,

0:03:57.579,0:04:01.029
even if you have an infinite-state space or even if you have a continuous-state

0:04:01.029,0:04:02.000
space.

0:04:02.000,0:04:06.090
I'll actually do that, talk about these special constants of infinite-state

0:04:06.090,0:04:06.779
MDPs,

0:04:06.779,0:04:11.319
using this new variation of the reward function and the alternative to

0:04:11.319,0:04:15.469
just counting, so start to make the formulation a little easier.

0:04:15.469,0:04:16.609
So the first

0:04:16.609,0:04:20.850
variation I want to talk about is

0:04:20.850,0:04:25.380
selection

0:04:25.380,0:04:28.720
rewards. So I'm going to change the definition of the reward function. If this turns out, it

0:04:28.720,0:04:30.869
won't be

0:04:30.869,0:04:33.409
a huge deal. In particular, I change reward function

0:04:33.409,0:04:36.260
to be a function mapping from

0:04:36.260,0:04:38.449
a state action pair

0:04:38.449,0:04:41.130
to the real numbers.

0:04:41.130,0:04:44.509
What I mean by this is just the following.

0:04:44.509,0:04:46.599
You sell off in

0:04:46.599,0:04:48.849
some state in zero. You

0:04:48.849,0:04:52.830
take an action A zero as a result of your state of action choice. You transition

0:04:52.830,0:04:54.709
to some new state, S1.

0:04:54.709,0:04:59.519
You take some action, A1. You transition to some new state, S2. You take some

0:04:59.519,0:05:05.009
action, A2, and so on. So this is a [inaudible] state action sequence that you see.

0:05:05.009,0:05:08.490
So in an MPP where you have a state action reward,

0:05:08.490,0:05:12.879
your total payoff is now defined as

0:05:12.879,0:05:14.259
this,

0:05:14.259,0:05:19.349
where your reward function is now a function both of the current state and of the

0:05:19.349,0:05:20.020
action

0:05:20.020,0:05:22.220
you took in the current state.

0:05:22.220,0:05:27.039
So this is my

0:05:27.039,0:05:28.539
total payoff.

0:05:29.199,0:05:32.240
Then as usual, my goal will be to find a policy -

0:05:32.240,0:05:35.039
to find the function mapping from the state's actions,

0:05:35.039,0:05:37.699
so that when I execute that policy,

0:05:37.699,0:05:40.320
I can maximize the expected value

0:05:40.320,0:05:42.030
of my total payoff.

0:05:42.030,0:05:46.600
So this definition, it actually turns out that given an MDP with

0:05:46.600,0:05:47.889
state action rewards,

0:05:47.889,0:05:49.339
you can actually -

0:05:49.339,0:05:52.949
so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP

0:05:52.949,0:05:56.639
with only rewards that function in the states. That may or may

0:05:56.639,0:05:58.419
not be [inaudible]. Don't worry if

0:05:58.419,0:06:00.800
it isn't. But

0:06:00.800,0:06:04.729
using state action rewards allows you to more directly model

0:06:05.569,0:06:09.050
problems in which different actions, we have different costs. So

0:06:09.050,0:06:13.090
a running example is the robot. So [inaudible] a robot,

0:06:13.090,0:06:17.080
and it's more costly for the robot to move than for it to stay still.

0:06:17.080,0:06:20.930
If you give an action to stay still, and the action to stay still may

0:06:20.930,0:06:25.349
have a lower cost because you're not using a battery power [inaudible] recharge it for that

0:06:25.349,0:06:26.020
action.

0:06:26.020,0:06:29.820
Another example would be -

0:06:29.820,0:06:33.059
actually, another navigation example

0:06:33.059,0:06:37.429
would be if you have an outdoor vehicle. You need to drive

0:06:37.429,0:06:38.599
over

0:06:38.599,0:06:42.169
some sort of outdoor terrain, like very rough rocks

0:06:42.169,0:06:45.679
or driving over grass. It may be costly, more difficult,

0:06:45.679,0:06:48.960
than driving over, say, a paved road. So you may assign

0:06:48.960,0:06:51.349
an action that requires

0:06:51.349,0:06:56.150
driving over grass or driving over rocks to be

0:06:56.150,0:07:00.289
more costly than driving over paved

0:07:00.289,0:07:04.809
road.

0:07:04.809,0:07:07.830
So this really isn't a huge change to the definition of

0:07:07.830,0:07:14.830
an MDP. I won't really bother to justify this a lot, but [inaudible]

0:07:15.090,0:07:16.779
equations

0:07:16.779,0:07:19.610
is generalizing

0:07:19.610,0:07:24.279
the way that you probably expect it. V star of S is now equal to

0:07:26.650,0:07:33.650
that.

0:07:37.419,0:07:40.649
So previously, when the reward function was just a function of the state, S, we could

0:07:40.649,0:07:43.729
take the max and push it in here.

0:07:43.729,0:07:44.530
But now that

0:07:46.069,0:07:49.779
the rewards is a function of the action you're taking as well, the max comes outside. So

0:07:49.779,0:07:50.869
this says that

0:07:50.869,0:07:57.119
your expected total payoff, starting from the state, as [inaudible] policy, is equal to

0:07:57.119,0:08:01.519
first your immediate reward, RFSA, for executing some action, A, in state S. Then

0:08:02.219,0:08:07.369
plus gamma times your future expected total payoff.

0:08:07.369,0:08:08.839
So

0:08:08.839,0:08:11.330
this is your expected total payoff if you

0:08:11.330,0:08:14.919
take the action, A, from the current state. So

0:08:14.919,0:08:18.879
while these [inaudible] optimal value functions. So your actually optimal expected total payoff

0:08:18.879,0:08:21.539
is the max of all actions of this

0:08:21.539,0:08:26.049
thing on

0:08:26.049,0:08:30.449
the right. Let's see. Value iteration, which I'm abbreviating VI,

0:08:30.449,0:08:33.720
is really the same algorithm. B of

0:08:33.720,0:08:38.060
S is updated as

0:08:38.060,0:08:44.630
max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations

0:08:45.900,0:08:49.880
be updating V of S using [inaudible] equations. Again, you get value iteration,

0:08:49.880,0:08:53.370
exactly the same way.

0:08:53.370,0:08:55.520
Then finally,

0:08:56.950,0:09:00.160
having found the optimal value function, V star,

0:09:00.160,0:09:03.180
using the value iteration algorithm,

0:09:03.180,0:09:07.650
you can then compute the optimal policy, pi star of S as same as

0:09:07.650,0:09:12.250
before. The best action to take in the state, S, is the

0:09:12.250,0:09:19.250
action, A, that maximizes the thing on the righthand

0:09:28.360,0:09:32.480
side. So having used value iteration

0:09:32.480,0:09:35.330
to compute the optimal value function, you can then find

0:09:35.330,0:09:40.210
pi star using that.

0:09:41.890,0:09:45.670
So this was the easier of the two

0:09:45.670,0:09:47.920
variations of MDPs [inaudible]

0:09:47.920,0:09:48.910
so far.

0:09:48.910,0:09:55.910
Any questions? Actually, are there questions about this?

0:10:01.090,0:10:02.810
So

0:10:02.810,0:10:06.660
the other variation, the other alternative definition,

0:10:07.600,0:10:10.520
will be

0:10:10.520,0:10:16.220
finite horizon

0:10:16.220,0:10:19.010
MDPs. So

0:10:19.010,0:10:21.540
finite horizon MDP

0:10:21.540,0:10:24.780
comprises of the [inaudible] SA [inaudible]

0:10:24.780,0:10:28.110
transition, probably

0:10:28.110,0:10:30.650
with these, and the parameter T

0:10:30.650,0:10:32.180
and the reward function.

0:10:32.180,0:10:36.250
Here,

0:10:36.250,0:10:37.360
T

0:10:37.360,0:10:40.520
is a parameter called the horizon time.

0:10:40.520,0:10:43.050
Concretely,

0:10:43.050,0:10:45.920
what this really means is that we'll

0:10:45.920,0:10:48.330
be taking actions in the MDP

0:10:48.330,0:10:55.330
only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out]

0:11:00.430,0:11:04.350
[Inaudible] zero, take action A0. Get to some other state S1, take

0:11:04.350,0:11:06.670
action A1 and so on.

0:11:06.670,0:11:10.100
Eventually, you get to some state, STAT

0:11:10.100,0:11:12.180
after T times [inaudible]. So

0:11:12.180,0:11:16.760
my total payoff, now,

0:11:16.760,0:11:18.180
will be

0:11:18.180,0:11:19.350
given

0:11:19.350,0:11:21.900
by this

0:11:21.900,0:11:26.180
sum from time zero up to time T of my sum over rewards.

0:11:27.740,0:11:31.870
Okay? My goal, as usual

0:11:31.870,0:11:34.030
- so this is my total payoff.

0:11:34.030,0:11:37.580
My goal, as usual, is to maximize the expected value of my total payoff. We want to come up

0:11:37.580,0:11:40.140
with a policy to maximize the

0:11:40.140,0:11:44.010
expected value of this total payoff. The key difference is that

0:11:44.010,0:11:47.890
the world only will exist [inaudible], and after that,

0:11:47.890,0:11:51.380
there's no more rewards to be corrected.

0:11:51.380,0:11:54.820
So this turns out to be [inaudible] of a difference because it turns out that the optimal

0:11:54.820,0:11:59.710
policy

0:12:00.690,0:12:03.200
may be non-stationary.

0:12:06.300,0:12:07.540
The

0:12:07.540,0:12:10.319
term, stationary, means that it doesn't depend

0:12:10.319,0:12:12.649
on time. Non-stationary

0:12:12.649,0:12:14.419
means that it may depend on time.

0:12:14.419,0:12:17.810
So non-stationary

0:12:17.810,0:12:21.820
roughly means that my optimal action to take will be different for different time

0:12:21.820,0:12:22.490
steps.

0:12:22.490,0:12:24.430
That's what nonstationary

0:12:24.430,0:12:26.790
means. Just as an example of that,

0:12:26.790,0:12:31.240
imagine that we have some robot. Let's say the

0:12:31.240,0:12:33.850
robot

0:12:33.850,0:12:35.290
is here.

0:12:35.290,0:12:37.620
Let's say that there's

0:12:37.620,0:12:40.590
a good [inaudible] over there with a plus one reward.

0:12:42.930,0:12:46.360
Much further away, there's a plus ten reward.

0:12:46.360,0:12:50.530
So depending on how much time you have left on the clock,

0:12:50.530,0:12:52.980
it may be better to go after the plus one

0:12:52.980,0:12:55.330
or the plus ten reward. If it's still early

0:12:55.330,0:12:58.840
in the game, you still have a lot of time, it may be better to head toward

0:12:58.840,0:13:03.030
the plus-ten rewards junction and get a much larger

0:13:03.030,0:13:06.040
reward. If you only have a couple of time sets left, if the clock has nearly reached

0:13:06.040,0:13:08.069
the time, capital T,

0:13:08.069,0:13:11.790
then you may not have enough time to get to a plus ten reward. You've be better

0:13:11.790,0:13:16.050
off heading for the plus one reward that's much more close by.

0:13:16.050,0:13:19.660
So what this example illustrates is that when you're in that state,

0:13:19.660,0:13:23.740
the best action to take could be to go left or to go right, depending on what

0:13:23.740,0:13:26.499
time it is. So just as an example, illustrating

0:13:26.499,0:13:31.960
how the actually policy can be non-stationary.

0:13:34.710,0:13:36.410
In fact,

0:13:36.410,0:13:38.120
since we have non-stationary

0:13:38.120,0:13:43.120
policies anyway in the sequence, what I'm going to do next, I'm going to

0:13:43.120,0:13:44.159
allow

0:13:44.159,0:13:48.390
non-stationary transition probabilities as well. So I'll just write that

0:13:48.390,0:13:50.390
up there.

0:13:50.390,0:13:52.170
What I mean is that

0:13:52.170,0:13:56.640
so far, assuming that the state ST plus one, is joined from the state

0:13:56.640,0:13:59.480
transition probabilities [inaudible]

0:13:59.480,0:14:02.110
by the previous states and the previous action.

0:14:02.110,0:14:05.650
I've been assuming that these state transition probabilities are the same

0:14:05.650,0:14:10.170
for all times. So I want to say [inaudible] and take some action,

0:14:10.170,0:14:12.540
the distribution of an innate state doesn't matter.

0:14:12.540,0:14:15.910
It doesn't depend on time.

0:14:15.910,0:14:16.700
So

0:14:16.700,0:14:21.890
I'm going to allow a study more general definition as well, in which we have

0:14:21.890,0:14:24.600
non-stationary state transition probabilities

0:14:24.600,0:14:25.329
so that

0:14:25.329,0:14:27.960
the chance of where you end up [inaudible]

0:14:27.960,0:14:29.590
may also

0:14:29.590,0:14:33.060
depend on what time it is.

0:14:33.060,0:14:37.320
So as examples of this non-stationary state

0:14:37.320,0:14:39.000
transition probabilities,

0:14:39.000,0:14:44.400
one example would be if you model flying an aircraft over a long distance.

0:14:44.400,0:14:48.650
Then as the aircraft flies, you burn fuel and become lighter. So the

0:14:48.650,0:14:52.510
dynamics of the aircraft actually change over time. The mass of the aircraft can

0:14:52.510,0:14:55.160
change significantly over time as you burn fuel.

0:14:55.160,0:14:56.480
So depending on

0:14:56.480,0:15:01.130
what time it is, your mixed state could actually

0:15:01.130,0:15:03.320
depend on not only your current state and your action,

0:15:03.320,0:15:06.240
but also on how much fuel you burn, therefore, what time it is.

0:15:07.250,0:15:10.510
Other examples, another aerospace one, is

0:15:10.510,0:15:14.620
if you have the weather forecast for the next 24 hours, say,

0:15:14.620,0:15:18.360
you know what the winds and precipitation are going to be like over the next 24 hours. Then again,

0:15:18.360,0:15:22.200
if you fly the aircraft from, say, here to New York,

0:15:22.200,0:15:25.019
it may cost different amounts to fly

0:15:25.019,0:15:28.829
different [inaudible] at different times. Maybe flying over

0:15:28.829,0:15:33.439
the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or

0:15:33.439,0:15:34.319
if you do it

0:15:34.319,0:15:37.629
a few hours from now, when the weather may be forecast really bad.

0:15:39.380,0:15:40.970
For an

0:15:40.970,0:15:46.440
example you see everyday, same thing for traffic, right?

0:15:46.440,0:15:48.380
There's at least -

0:15:48.380,0:15:52.510
depending on where you live, certainly here in California, there are times of day where traffic is

0:15:52.510,0:15:57.050
really bad in lots of places. So the costs of driving certain roads may vary, depending on

0:15:57.050,0:15:59.070
what time of day it is.

0:15:59.070,0:16:03.589
Lots of other examples. Industrial automation, different machines

0:16:03.589,0:16:07.160
in the factory may be available to different degrees at different times of day. They

0:16:07.160,0:16:10.149
cost different amounts to hire different workers, depending on whether you pay

0:16:10.149,0:16:12.950
over time [inaudible] or whatever.

0:16:12.950,0:16:15.960
So the cost of doing different things in the factory can also

0:16:15.960,0:16:17.250
be a function of time.

0:16:17.250,0:16:18.429


0:16:18.429,0:16:24.280
The state transition probabilities can also be a function of time.

0:16:25.750,0:16:27.850
Lastly,

0:16:27.850,0:16:31.910
while we're doing this as well, to make this fully general, we might as

0:16:31.910,0:16:37.730
well have non-stationary [inaudible] as well,

0:16:37.730,0:16:40.639
where you might also index the reward

0:16:40.639,0:16:42.679
function of these times and prescripts,

0:16:42.679,0:16:45.090
where the cost of doing different things may depend on

0:16:45.090,0:16:47.830
the time as well.

0:16:50.530,0:16:55.590
Actually, there's more examples of non-stationary MDPs, so let's - so now we

0:16:55.590,0:16:59.249
have a nonstationary policy. Let's talk about an algorithm to actually

0:16:59.249,0:17:02.180
try to find the optimal policy.

0:17:04.110,0:17:10.089
So let me

0:17:10.089,0:17:13.580
define the following.

0:17:13.580,0:17:20.390
This is now a slightly modified definition of the optimal value function. I'll

0:17:20.390,0:17:27.390
just

0:17:30.970,0:17:37.970
write this down, I guess.

0:17:39.080,0:17:42.620
So I'm going to define the optimal value function, and this going to be

0:17:42.620,0:17:44.690
indexed by T, with a subscript T. The

0:17:44.690,0:17:46.480
optimal value of a state

0:17:47.840,0:17:50.360
for time, T, we're going to define as your

0:17:50.360,0:17:52.300
optimal

0:17:52.300,0:17:57.320
sum of rewards for if you start the MDP at that state, S,

0:17:57.320,0:17:59.309
and if the clock starts off

0:17:59.309,0:18:02.200
at time,

0:18:02.200,0:18:06.400
lowercase T. So the optimal value of a state will depend on what time it is and how

0:18:06.400,0:18:09.919
much time you have lest to run this

0:18:09.919,0:18:11.440
MDP. Therefore,

0:18:11.440,0:18:16.230
the sum on the right sums only for time T, time T plus one, time T plus two up

0:18:16.230,0:18:17.840
to time,

0:18:17.840,0:18:19.520
capital T.

0:18:22.570,0:18:26.620
I'll just state in English again, this is your expected optimal

0:18:26.620,0:18:28.059
total payoff

0:18:28.059,0:18:31.240
if you start your system in a state, S,

0:18:31.240,0:18:34.760
and if the clock is already at time,

0:18:34.760,0:18:37.010
lowercase T.

0:18:38.370,0:18:42.620
So it turns out then there's a [inaudible], you can value that [inaudible].

0:18:42.620,0:18:46.080
Let me just write out the value [inaudible] algorithm for this.

0:18:46.080,0:18:48.059
It turns out

0:18:48.059,0:18:49.270
you can -

0:18:49.270,0:18:54.150
well, let me just write this

0:18:54.150,0:18:56.920
out. I'll write this below.

0:18:56.920,0:19:00.260
It turns out you can compute the optimal value function for the MDP using the

0:19:00.260,0:19:03.020
following recursion, which is very similar to

0:19:03.020,0:19:06.200
what we have for value iteration. We're going

0:19:06.200,0:19:06.879
to set V

0:19:06.879,0:19:12.080
of S to be equal to [inaudible] over

0:19:12.080,0:19:12.889
A,

0:19:12.889,0:19:19.889
same as before, right? Okay?

0:19:41.110,0:19:42.419
So

0:19:42.419,0:19:46.650
if I start the clock at time T and from state S,

0:19:46.650,0:19:49.230
my expected total payoff is equal to

0:19:49.230,0:19:51.590
the maximum [inaudible] actions I may take

0:19:51.590,0:19:53.500
of my immediate reward. Taking that

0:19:53.500,0:19:56.100
action, A, in that state,

0:19:56.100,0:20:00.050
S. Them plus my expected future payoff. So if I take action, A,

0:20:00.050,0:20:01.690
I would transition with [inaudible] P, subscript SA, S prime

0:20:01.690,0:20:06.470
to some new state, S

0:20:06.470,0:20:09.470
prime. If I get to the state, S prime,

0:20:09.470,0:20:11.600
my total expected payoff from the

0:20:11.600,0:20:14.270
state S prime would be these [inaudible]

0:20:14.270,0:20:18.610
now subscript T plus one, that's prime. Subscript T plus one

0:20:18.610,0:20:23.400
reflects that after I've taken one action, my clock will have advanced from

0:20:23.400,0:20:26.400
time T to time T plus one. So this is now V

0:20:26.400,0:20:30.559
star subscript T plus one.

0:20:30.559,0:20:33.909
So this expresses V star of T

0:20:33.909,0:20:36.230
in terms of V star T plus one.

0:20:36.230,0:20:39.740
Then lastly, to start off this recursion, you

0:20:39.740,0:20:41.700
would have V star,

0:20:41.700,0:20:43.270
capital T

0:20:43.270,0:20:50.270
is equal to - it's just equal

0:20:51.300,0:20:53.220
to that.

0:20:53.220,0:20:56.900
If you're already at time, capital T, then you just get to take one action, and then

0:20:56.900,0:20:59.070
the clock runs out. So this is

0:20:59.070,0:21:03.610
V star capital T. Your value of starting in some state, S, with no time -

0:21:03.610,0:21:08.809
with just one time step, with no time left on the clock.

0:21:08.809,0:21:12.880
So in the case of finite horizon MDP, this actually gives up a very nice

0:21:13.830,0:21:15.640
dynamic programming algorithm

0:21:15.640,0:21:16.559
in which you can

0:21:16.559,0:21:20.740
start off by computing V star of T.

0:21:20.740,0:21:24.030
Then you use this backward [inaudible] to compute

0:21:24.030,0:21:25.089
V star of

0:21:25.089,0:21:29.349
capital T minus one, capital T minus two and so on. We compute V star

0:21:29.349,0:21:32.230
of T and T minus one and so on. It recurs backwards

0:21:32.230,0:21:34.670
onto your computer, V star

0:21:34.670,0:21:38.450
for all of your time steps. Can you see this

0:21:38.450,0:21:39.500
board?

0:21:39.500,0:21:42.250
Cool. Then

0:21:44.770,0:21:47.570
the final step is

0:21:47.570,0:21:52.810
- previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible]

0:21:52.810,0:21:55.580
A of R

0:21:55.580,0:21:57.150
plus A

0:21:57.150,0:22:04.150
plus [inaudible] PSA - this

0:22:04.280,0:22:07.230
is sort of what we had. In

0:22:07.230,0:22:09.789
the finite horizon case, the

0:22:09.789,0:22:13.250
ultimate action may depend on what time it is. So the ultimate action to take it,

0:22:13.250,0:22:16.820
time T, is [inaudible] actions, A.

0:22:16.820,0:22:19.249
This

0:22:19.249,0:22:22.350
is basically the

0:22:22.350,0:22:26.840
augmat of exactly that same thing on the right-hand side as we had

0:22:26.840,0:22:29.970
in our dynamic programming

0:22:29.970,0:22:33.060
algorithm. So you do this for every time step, and now you compute it,

0:22:33.060,0:22:36.950
your optimal policy for different time

0:22:36.950,0:22:40.410
steps. Again, this is a non-stationary policy

0:22:40.410,0:22:42.290
because pi star

0:22:42.290,0:22:46.920
of S my depend on what time it is.

0:22:46.920,0:22:50.420
So [inaudible] the difference between this

0:22:50.420,0:22:57.150
and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T.

0:22:57.150,0:23:00.620
Then using the backwards recursion of that [inaudible] algorithm, you computer V star T

0:23:00.620,0:23:04.130
minus one. Then V star T minus two

0:23:04.130,0:23:06.710
and so on down to V star of zero.

0:23:06.710,0:23:09.660
Then from these, you compute pi star.

0:23:13.210,0:23:17.010
So one - there's not a huge difference, but one minus

0:23:17.010,0:23:18.100
difference [inaudible]

0:23:18.100,0:23:21.919
the infinite horizon discounted case

0:23:21.919,0:23:26.350
is that by running this recursion once, you now have exactly

0:23:26.350,0:23:29.550
the right value function. So this just computes the value function, rather

0:23:29.550,0:23:30.560
than

0:23:30.560,0:23:37.560
merely converging [inaudible]. This just gives you the right value function with one

0:23:37.710,0:23:39.710


0:23:39.710,0:23:42.710
pass. Cool. Any questions there? Yeah.

0:23:42.710,0:23:47.600
[Inaudible].

0:23:50.450,0:23:54.080
This computation's much shorter than valuations. So

0:23:54.080,0:23:58.460
sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could

0:23:58.460,0:24:05.460
[inaudible] and then use this case for that case? I see.

0:24:06.100,0:24:09.430
So for the normal MDP, can you assume capital T and then assume this? So

0:24:09.430,0:24:13.049
it actually turns out that - that's a

0:24:13.049,0:24:14.740
great question. Let

0:24:14.740,0:24:18.640
me just answer this in a hand-wavy way.

0:24:18.640,0:24:23.140
So it actually turns out for a discounted infinite horizon MDP

0:24:23.140,0:24:25.670
where some discount factor gamma.

0:24:25.670,0:24:29.430
So what you can do is you can say,

0:24:29.430,0:24:31.160
after T times X,

0:24:31.160,0:24:34.990
gamma to the T would be really small. It would be like [inaudible] something. I

0:24:34.990,0:24:36.130
don't

0:24:36.130,0:24:39.750
really care what happens after that many times because the rewards are

0:24:39.750,0:24:42.630
multiplied by gamma to the T. After that, I don't really care. So you can

0:24:42.630,0:24:43.919
ask,

0:24:43.919,0:24:45.120
can I take

0:24:45.120,0:24:48.490
my infinite horizon discounted MDP

0:24:48.490,0:24:53.190
and approximate that with a finite horizon MDP where the number of times, steps T,

0:24:53.190,0:24:55.180
is chosen so that [inaudible] true. So it

0:24:55.180,0:24:57.700
turns out you can do that.

0:24:57.700,0:25:01.890
Then you end up with some value for T. You can solve for T so

0:25:01.890,0:25:03.910
that this holds true.

0:25:03.910,0:25:09.500
It turns out you can prove that if you took the original value iteration algorithm

0:25:09.500,0:25:14.780
and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs.

0:25:14.780,0:25:16.830
If you run that for

0:25:16.830,0:25:19.590
this same number of time steps,

0:25:19.590,0:25:23.790
you will end up with an approximation to the value function that is

0:25:23.790,0:25:26.730
about this close, up to some small constant factors.

0:25:26.730,0:25:30.960
So to do that, you end up with roughly the same amounts of computation anyway.

0:25:30.960,0:25:32.550
Then

0:25:32.550,0:25:35.780
you actually end up with a non-stationary policy, which is more expensive to keep

0:25:36.650,0:25:40.720
around. You need to keep around the different policy every time step, which is not

0:25:40.720,0:25:46.880
as nice as if you had the stationary policy, same policy for all times. So

0:25:46.880,0:25:50.600
there are other reasons, but sometimes you might take an

0:25:50.600,0:25:54.560
infinite horizon discounted problem and approximate it to a finite horizon problem.

0:25:54.560,0:25:56.190
But

0:25:56.190,0:25:58.760
this particular reason is

0:25:58.760,0:26:01.270
not the one. That makes

0:26:01.270,0:26:04.210
sense. More

0:26:04.210,0:26:11.210
questions? Interviewee: [Inaudible]?

0:26:12.580,0:26:14.680
Is

0:26:14.680,0:26:16.870
there a gamma in this?

0:26:16.870,0:26:20.159
So if you want, you can actually

0:26:20.159,0:26:22.570
change the definition of an MDP

0:26:22.570,0:26:27.920
and use a finite horizon discounted MDP. If you want, you can do that. You

0:26:27.920,0:26:29.040
can actually come

0:26:29.040,0:26:32.090
in and put a gamma there,

0:26:32.090,0:26:35.140
and use this counting the finite horizon.

0:26:35.140,0:26:37.690
It turns out that usually, for most

0:26:37.690,0:26:41.470
problems that people deal with, you either use discounting

0:26:41.470,0:26:43.220
or you

0:26:43.220,0:26:47.059
use the finite horizon. It's been less common to do both, but you can certainly do as well.

0:26:47.059,0:26:51.990
One of the nice things about discounting, it makes such your value function is finite.

0:26:52.860,0:26:57.549
Algorithmically and mathematically, one of the reasons to use discounting is

0:26:57.549,0:27:01.309
because you're multiplying your rewards exponentially. It's a geometrically

0:27:01.309,0:27:02.640
[inaudible] series.

0:27:02.640,0:27:05.049
It shows that the value function is always finite.

0:27:05.049,0:27:07.780
This is a really nice mathematical properties when you do discounting.

0:27:07.780,0:27:11.360
So when you have a finite horizon anyway,

0:27:11.360,0:27:15.179
then the value function's also guaranteed to be finite. So with that, you

0:27:15.179,0:27:22.179
don't have to use discounting. But if you want, you can actually discount

0:27:24.130,0:27:27.910
as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward

0:27:27.910,0:27:34.910
function to go downward into the to the reward function, since we have nonstationary rewards as well. So

0:27:39.690,0:27:41.049
that was

0:27:41.049,0:27:44.230
finite horizon

0:27:44.230,0:27:45.409
MDPs.

0:27:45.409,0:27:48.960
What I want to do now is actually use

0:27:48.960,0:27:53.030
both of these ideas, your state action rewards and your finite horizon MDPs

0:27:53.030,0:27:57.580
to describe a special case of MDPs that

0:27:57.580,0:28:00.380
makes very strong assumptions about the problem.

0:28:00.380,0:28:00.880


0:28:00.880,0:28:04.390
But these assumptions are reasonable for many systems. With these

0:28:04.390,0:28:06.090
assumptions, what we come up with, I

0:28:06.090,0:28:11.960
think, are very nice and very elegant algorithms for solving even very large

0:28:11.960,0:28:16.240
MDPs.

0:28:30.100,0:28:32.820
So let's talk about linear

0:28:32.820,0:28:35.760
quadratic regulation.

0:28:51.669,0:28:55.860
We just talked about dynamic programming for finite horizon MDPs, so

0:28:55.860,0:28:58.149
just remember that algorithm.

0:28:58.149,0:29:01.850
When I come back to talk about an algorithm for solving LQR problems,

0:29:01.850,0:29:05.320
I'm actually going to use exactly that dynamic programming algorithm

0:29:05.320,0:29:10.320
that you just saw for finite horizon MDPs. I'll be using exactly

0:29:10.320,0:29:13.929
that algorithm again. So just remember that for now.

0:29:13.929,0:29:16.320
So let's talk about LQR.

0:29:16.320,0:29:21.750
So I want to take these ideas and apply them to MDPs with continuous

0:29:21.750,0:29:25.150
state spaces and maybe even continuous action

0:29:25.150,0:29:26.170
spaces. So

0:29:26.170,0:29:28.740
to specify and

0:29:28.740,0:29:33.679
MDPs, I need to give you this fivetuple

0:29:33.679,0:29:36.380
of state actions, [inaudible] in the reward. I'm going to use

0:29:36.380,0:29:40.730
the finite horizon, capital T, rather than

0:29:40.730,0:29:42.570
discounting.

0:29:42.570,0:29:47.809
So in LQR problems, I'm going to assume the following. I'm

0:29:47.809,0:29:51.309
going to assume that the

0:29:51.309,0:29:54.400
[inaudible] space is [inaudible] RN.

0:29:54.400,0:29:58.289
And I'm going to assume, also, a continuous set of

0:29:58.289,0:30:03.830
actions lie in RT.

0:30:03.830,0:30:08.320
To specify the state transition probabilities, PSA,

0:30:08.320,0:30:11.930
I need to tell you what the distribution of the mixed state is, given the current state and

0:30:11.930,0:30:13.510
the current action.

0:30:13.510,0:30:17.450
So we actually saw a little bit of this in the last lecture. I want to assume

0:30:17.450,0:30:19.519
the next state, ST plus one,

0:30:19.519,0:30:23.100
is going to be a linear function

0:30:23.100,0:30:25.010
of the previous

0:30:25.010,0:30:27.200
state, AST plus BAT

0:30:27.200,0:30:29.309
plus WT -

0:30:29.309,0:30:34.650
oh, excuse me. I meant to subscript that.

0:30:54.590,0:30:59.190
Where WT is Gaussian [inaudible] would mean zero and some covariance

0:30:59.190,0:31:02.260
given by sigma

0:31:02.260,0:31:07.370
W. Subscripts at A and B here with subscripts T to show

0:31:07.370,0:31:09.000
that

0:31:09.000,0:31:11.809
these matrixes could change over time. So this would be

0:31:11.809,0:31:14.940
non-stationary dynamics.

0:31:14.940,0:31:20.200
As a point of notation, unfortunately compiling

0:31:20.200,0:31:22.240
ideas from multiple literatures,

0:31:22.240,0:31:26.220
so it's sort of unfortunately that capital A denotes both a set of actions

0:31:26.220,0:31:27.860
as well as

0:31:27.860,0:31:29.609
a matrix.

0:31:29.609,0:31:33.889
When you see A later on, A will usually be used to denote a matrix,

0:31:33.889,0:31:38.870
rather than a set of actions. So [inaudible] overload notation again, but

0:31:38.870,0:31:41.820
unfortunately the notational conventions when you have

0:31:41.820,0:31:46.660
research ideas in multiple research communities, often they share the same symbol.

0:31:47.789,0:31:49.830
So just to be concrete,

0:31:49.830,0:31:53.020
AT is a matrix

0:31:53.020,0:31:55.640
that's N

0:31:55.640,0:32:01.270
by N. [Inaudible] matrixes that are N by D.

0:32:02.130,0:32:05.559
Just to be completely clear, right, the matrixes A and B,

0:32:05.559,0:32:09.230
I'm going to assume, are fixed and known in advance. So I'm going to

0:32:09.230,0:32:11.700
give you the matrices,

0:32:11.700,0:32:15.750
A and B, and I'm going to give you sigma W. Your job is to find a good policy

0:32:15.750,0:32:17.060
for

0:32:17.060,0:32:18.840
this MDP. So in other words,

0:32:18.840,0:32:20.870
this is my specification

0:32:20.870,0:32:25.309
of the state transition probabilities.

0:32:25.309,0:32:28.020
Looking ahead,

0:32:28.020,0:32:31.190
we see this later, it turns out this

0:32:31.190,0:32:38.190
noise term is not very important. So

0:32:39.630,0:32:40.850
it turns out that

0:32:40.850,0:32:44.250
the treatment of the noise term is not very important.

0:32:44.250,0:32:45.830
We'll see this later. We can

0:32:45.830,0:32:47.179
pretty much

0:32:47.179,0:32:49.300
ignore the noise

0:32:49.300,0:32:52.680
term, and we'll still do fine. This is just a warning

0:32:52.680,0:32:56.910
in the sequel, what I do later, I might be slightly sloppy

0:32:56.910,0:32:57.970
in my treatment

0:32:57.970,0:33:02.970
of the noise term. In this very special case, it would be unimportant.

0:33:02.970,0:33:04.170
The

0:33:04.170,0:33:05.549
last

0:33:05.549,0:33:09.150
thing I have to specify is some horizon time,

0:33:09.150,0:33:12.580
and then I also have some

0:33:12.580,0:33:13.700
reward

0:33:13.700,0:33:16.550
function.

0:33:16.550,0:33:20.020
For LQR control, I'm going to assume that a reward function

0:33:20.020,0:33:27.020
can be written as

0:33:30.090,0:33:31.850
this, where

0:33:31.850,0:33:35.260
UT is a matrix that's N by N. VT is

0:33:35.260,0:33:36.439
a

0:33:37.950,0:33:41.020
matrix that's D by D. I'll

0:33:41.020,0:33:44.440
assume that UT

0:33:44.440,0:33:47.040
and VT are both positive semi-definite.

0:33:47.040,0:33:51.289
Are both PSD. So

0:33:51.289,0:33:57.360
the fact that UT and VT are both positive semi-definite matrices,

0:33:57.360,0:33:59.770
that implies that

0:33:59.770,0:34:02.259
ST transpose, UT, ST [inaudible]

0:34:02.259,0:34:07.000
zero. Similarly, ST transpose

0:34:07.000,0:34:09.569
are VT, AT, [inaudible] zero.

0:34:09.569,0:34:16.519
So this implies that

0:34:16.519,0:34:20.190
your rewards are always negative. This is a

0:34:20.190,0:34:23.529
somewhat depressing MDP in which there are only costs and no positive rewards, right,

0:34:23.529,0:34:24.369
because of

0:34:24.369,0:34:27.169
the minus sign

0:34:27.169,0:34:28.569
there.

0:34:28.569,0:34:35.569
So

0:34:43.319,0:34:46.659
as

0:34:46.659,0:34:49.630
a complete example for how you might want to apply this,

0:34:49.630,0:34:53.510
you've seen my helicopter videos, right? So one thing is,

0:34:53.510,0:34:55.729
for example,

0:34:55.729,0:34:57.989
suppose you have a helicopter,

0:34:57.989,0:35:00.890
and you want the state ST to be

0:35:00.890,0:35:04.229
as close to zero as possible.

0:35:05.440,0:35:08.059
Then

0:35:08.059,0:35:10.210
you might choose UT

0:35:10.210,0:35:12.619
to be equal to the identity matrix.

0:35:12.619,0:35:16.709
This will make R of STAT be

0:35:16.709,0:35:20.060
equal to ST transpose

0:35:20.060,0:35:23.819
ST. But that's just

0:35:27.239,0:35:30.719
- I'll just

0:35:30.719,0:35:33.809
write that down. [Inaudible] oh, excuse me - minus.

0:35:33.809,0:35:38.429
The squared negative of the squared [inaudible] vector. So this would be penalizing the

0:35:38.429,0:35:39.680
system

0:35:39.680,0:35:40.769
quadratically

0:35:40.769,0:35:45.219
for having a state that's half of zero, assuming that zero's the origin state. So if it

0:35:45.219,0:35:46.889
goes to make a helicopter

0:35:46.889,0:35:52.039
hover around the state zero, then you might choose this sort of reward function. It

0:35:52.039,0:35:54.439
turns out

0:35:55.279,0:35:58.679
it's also very common for action to choose a cost

0:35:58.679,0:36:03.209
for the action. So suppose I choose VT to be equal to an identity matrix. I get minus

0:36:03.209,0:36:05.359
AT transpose

0:36:05.359,0:36:09.509
AT

0:36:09.509,0:36:12.259
here. Then minus [inaudible] actions as well.

0:36:12.259,0:36:16.079
Including a quadratic cost for actions, it's also a fairly common

0:36:16.079,0:36:17.099
thing to do.

0:36:17.099,0:36:19.560
In practice, this tends to be effective

0:36:19.560,0:36:22.630
of discouraging your system from

0:36:22.630,0:36:26.209
jerking the controls around. This discourages making very huge control commands. Having

0:36:26.209,0:36:27.779
a term like this

0:36:27.779,0:36:31.859
reward function often makes many systems behave better.

0:36:31.859,0:36:35.789
Of course, [inaudible] choose different values, we have different values on the diagonal to

0:36:35.789,0:36:37.069
give different

0:36:37.069,0:36:39.960
state variables, different weight and

0:36:39.960,0:36:44.809
so on. So lots of possible choices for U and B. This is one example.

0:36:49.849,0:36:55.989
So

0:36:55.989,0:36:59.420
for the next few steps,

0:36:59.420,0:37:03.619
I'm going to write out things, I'm going to derive things,

0:37:03.619,0:37:08.509
for the general case of non-stationary dynamics. I'm going - as I write

0:37:08.509,0:37:11.529
out more math and more equations for LQR,

0:37:11.529,0:37:13.809
I'm going to try write it out

0:37:13.809,0:37:18.549
for the fairly general case of time varied dynamics and time varied reward

0:37:18.549,0:37:21.639
functions. So I'm [inaudible] function.

0:37:21.639,0:37:26.309
But for purposes of understanding this material,

0:37:26.309,0:37:29.499
you might want to think of just ignoring many of the subscripts, in terms

0:37:29.499,0:37:31.579
of T. So

0:37:33.849,0:37:36.049
for the sake of [inaudible] material,

0:37:36.049,0:37:40.059
you might want to mentally assume that there are some fixed matrix, A, so

0:37:40.059,0:37:46.079
that A is equal to A1, A2, equals A3 and so on.

0:37:46.079,0:37:50.599
Similarly, there's some matrix B.

0:37:50.599,0:37:51.509
Okay?

0:37:51.509,0:37:55.859
So write it out for the fully general, non-stationary case, but

0:37:55.859,0:37:59.420
you might just want to ignore many of the time subscripts and imagine the

0:37:59.420,0:38:02.099
stationary case for now.

0:38:07.419,0:38:09.939
Quite a bit later,

0:38:09.939,0:38:13.569
we're going to talk about an extension of this called differential dynamic programming that will

0:38:13.569,0:38:16.670
actually use a non-stationary

0:38:16.670,0:38:20.789
[inaudible] to a very powerful effect for a specific algorithm. But

0:38:20.789,0:38:25.579
for most of what we're about to do, just pretend that MDP is stationary.

0:38:25.579,0:38:32.579
Okay.

0:38:49.160,0:38:50.919
So before I

0:38:50.919,0:38:54.759
talk about models, let me jus say a couple of words about how you would go about

0:38:54.759,0:38:56.910
coming up with the linear models. The

0:38:56.910,0:39:00.630
key assumption in this model is that the dynamics are linear. There's also the assumption the

0:39:00.630,0:39:03.429
reward function is quadratic, but

0:39:03.429,0:39:06.849
let's talk about the assumption that the dynamics are linear.

0:39:06.849,0:39:13.789
ST plus one equals AST plus VAT. Maybe time varying, maybe

0:39:13.789,0:39:16.879
stationary. I'm just writing stationary for now.

0:39:16.879,0:39:19.529
So how do you get models like this?

0:39:19.529,0:39:23.169
We actually saw one example of this already in the previous lecture. If

0:39:23.169,0:39:26.729
you have an inverted pendulum system, and you want to model the

0:39:26.729,0:39:28.099
inverted pendulum

0:39:28.900,0:39:34.039
using a linear model like this, maybe [inaudible]. I'm not going to write that down.

0:39:34.039,0:39:37.849
One thing you could do is run your inverted pendulum, start it off in some state as

0:39:37.849,0:39:38.719
zero,

0:39:38.719,0:39:40.799
take some action, A0, have it

0:39:40.799,0:39:45.760
get to some state, S1. Take action A1 and so on, get to some state ST.

0:39:45.760,0:39:48.409
Our index is one

0:39:48.409,0:39:52.139
to denote that this is my first trial. Then you can repeat this a bunch of times.

0:39:52.139,0:39:55.170
You can repeat this N times. I'm

0:39:55.170,0:39:57.670
just executing actions on

0:39:57.670,0:39:59.519
your physical robot.

0:39:59.519,0:40:02.049
It could be a robot, it could be a

0:40:02.049,0:40:05.739
chemical plant. It could be whatever. Trying out different actions in

0:40:05.739,0:40:10.009
your system and watch

0:40:12.770,0:40:14.989
what states it gets to. So

0:40:14.989,0:40:20.619
for the linear model to your data, and choose the parameters A

0:40:20.619,0:40:25.659
and B,

0:40:25.659,0:40:27.939
that minimize

0:40:38.049,0:40:39.709
the quadratic

0:40:39.709,0:40:41.369
error term.

0:40:46.949,0:40:51.640
So this says how well does AST plus BAT

0:40:51.640,0:40:55.359
predict ST plus one. So you minimize the quadratic penalty term. This would be

0:40:55.359,0:40:57.719
one reasonable way to estimate

0:40:57.719,0:41:01.390
the parameters of a linear dynamical system for a

0:41:01.390,0:41:08.390
physical robot or a physical chemical part of whatever they may have. Another

0:41:08.839,0:41:10.569
way to

0:41:10.569,0:41:13.129
come up with a linear

0:41:14.209,0:41:19.199
model

0:41:19.199,0:41:24.180
consistently, if I want to control, is to take a nonlinear model and to

0:41:24.180,0:41:29.920
linearize it. Let me show you what I mean by that. So you can linearize a

0:41:29.920,0:41:35.819
nonlinear model.

0:41:35.819,0:41:37.439
So

0:41:38.570,0:41:41.750
let's say you have some nonlinear model

0:41:41.750,0:41:45.289
that expresses ST plus one as some function of

0:41:45.289,0:41:46.219
ST

0:41:46.219,0:41:47.499
and

0:41:47.499,0:41:49.529
AT.

0:41:49.529,0:41:53.840
In the example in the previous lecture, I said for the inverted pendulum

0:41:56.579,0:41:57.489
[inaudible].

0:41:57.489,0:42:01.440
By referring to the laws of physics. It was actually by downloading off the shelf

0:42:01.440,0:42:04.380
software for doing physics simulations. So if you haven't

0:42:04.380,0:42:08.139
seen [inaudible] before, you can go online. You can easily find

0:42:08.139,0:42:11.270
many open-source packages for simulating the

0:42:11.270,0:42:14.350
physics of simple devices like these.

0:42:14.350,0:42:18.430
Download the software, type in the specifications of your robot, and it will

0:42:18.430,0:42:21.980
simulate the physics that you use. There's lots of open-source software patches like that. You can just

0:42:21.980,0:42:23.829
download them.

0:42:23.829,0:42:27.060
But something like that, you can now build a physics simulator

0:42:27.060,0:42:31.730
that predicts the state as a function of the previous state and the previous action. So

0:42:31.730,0:42:34.029
you actually come up with some function that

0:42:34.029,0:42:39.550
says that

0:42:41.529,0:42:45.169
- the state [inaudible] next time. The [inaudible] vector

0:42:45.169,0:42:52.169
will be some function of the current state

0:42:53.350,0:42:54.630
and the current

0:42:54.630,0:42:58.019
action, where the action in this case is just a real number that says how

0:42:58.019,0:43:02.129
hard you accelerated to the left or right.

0:43:02.129,0:43:06.359
Then you can take this nonlinear model. I actually wrote down a sample of a

0:43:06.359,0:43:08.119
model in the last lecture, but

0:43:08.119,0:43:11.390
in general, F would be some nonlinear function. [Inaudible] of

0:43:11.390,0:43:12.769
a linear function.

0:43:12.769,0:43:16.919
So what I mean by linearize is the following. So here's just a cartoon. I'll

0:43:16.919,0:43:19.819
write down the math in a second. Let's say the horizontal acces is

0:43:19.819,0:43:24.029
the input state, ST,

0:43:24.029,0:43:27.679
and the output state, ST plus one, as I said.

0:43:27.679,0:43:31.779
Here's

0:43:31.779,0:43:34.709
the function at F.

0:43:34.709,0:43:38.689
So the next state, ST plus one, will be some function of the previous state,

0:43:38.689,0:43:42.029
ST and the action

0:43:42.029,0:43:46.009
AT. So to linearize this model, what you would do is you would choose a point.

0:43:46.009,0:43:49.019
We'll call this bar

0:43:49.019,0:43:54.140
T. Then you would take the derivative of this

0:43:54.140,0:43:56.469
function. For the

0:43:56.469,0:44:00.279
[inaudible] straight line to that function.

0:44:00.279,0:44:04.449
So this allows you to express the next state, ST plus one. You

0:44:04.449,0:44:07.229
can approximate the next state, ST plus one, as

0:44:07.229,0:44:10.089
this linear function of the

0:44:10.089,0:44:13.999
previous state, ST. So

0:44:13.999,0:44:17.809
to make this cartoon really right, the horizontal access here is really a state

0:44:17.809,0:44:20.109
action pair.

0:44:20.109,0:44:24.650
You're linearizing around. So this is just a cartoon. The horizontal

0:44:24.650,0:44:29.699
access represents the input state and the input action.

0:44:29.699,0:44:36.699
So

0:44:42.049,0:44:45.549
just to write this out in

0:44:45.549,0:44:49.529
math, I'll write out the simple case first and the fully general one

0:44:49.529,0:44:51.089
in a second. Suppose

0:44:51.089,0:44:54.709
the horizontal access was only this state. So let's pretend interactions they [inaudible] now.

0:44:54.709,0:44:56.099
ST plus

0:44:56.099,0:45:00.309
one is just some function of ST, than that linear function I drew would be

0:45:00.309,0:45:02.219
ST plus one.

0:45:02.219,0:45:04.450
We're approximating

0:45:04.450,0:45:08.349
as F prime evaluated at some point as bar T

0:45:08.349,0:45:12.189
times ST times S bar T.

0:45:12.189,0:45:15.579
Plus

0:45:15.579,0:45:17.829
S bar

0:45:17.829,0:45:21.879
T. So with this, you'd express ST plus one as a linear

0:45:21.879,0:45:22.719
function

0:45:22.719,0:45:29.359
of ST. Just note that S bar T is a constant.

0:45:31.179,0:45:32.520
It's not

0:45:32.520,0:45:36.520
a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point

0:45:36.520,0:45:38.019
S bar T. This is

0:45:38.019,0:45:41.779
really just the equation of that linear function. So you

0:45:41.779,0:45:46.309
can then convert this to

0:45:51.809,0:45:55.399
A and B matrixes. Jumping back one board, I'm going

0:45:55.399,0:45:57.040
to point out one other thing.

0:45:57.040,0:45:57.829
Let's

0:45:57.829,0:45:59.929
say I look at this straight line,

0:45:59.929,0:46:01.159
and I ask

0:46:01.159,0:46:03.489
how well does this straight line

0:46:03.489,0:46:07.769
approximate my function F, my original simulator, my original function F.

0:46:07.769,0:46:10.630
Then you sort of notice that

0:46:10.630,0:46:12.440
in this neighborhood,

0:46:12.440,0:46:14.830
in the neighborhood of S bar, there's a

0:46:14.830,0:46:17.400
pretty good approximation. It's fairly close.

0:46:17.400,0:46:19.389
But then as you move further away,

0:46:19.389,0:46:24.419
moving far off to the left here, it becomes a pretty terrible approximation.

0:46:24.419,0:46:25.539
So

0:46:25.539,0:46:29.139
when you linearize

0:46:29.139,0:46:33.099
a nonlinear model to apply LQR,

0:46:33.099,0:46:36.789
one of the parameters you have to choose would be the point around which to

0:46:36.789,0:46:38.960
linearize your nonlinear model.

0:46:38.960,0:46:44.169
So if you expect your inverted pendulum system

0:46:44.169,0:46:48.509
to spend most of its time in the vicinity of this state,

0:46:48.509,0:46:49.409
then it'd

0:46:49.409,0:46:53.459
be reasonable to linearize around this state

0:46:53.459,0:46:54.830
because that means that

0:46:54.830,0:46:57.319
the linear approximation would be a good approximation,

0:46:57.319,0:47:02.489
usually, for the states that you expect [inaudible] to spend most of this time.

0:47:02.489,0:47:06.179
If conversely, you expect the system to spend most of its time

0:47:06.179,0:47:07.739
at states far to the left, then this would be

0:47:07.739,0:47:10.039
a terrible location to linearize.

0:47:10.039,0:47:14.269
So one rule of thumb is to choose the position to linearize according to where

0:47:14.269,0:47:17.419
you expect the system to spend most of its time

0:47:17.419,0:47:20.369
so that the linear approximation will tend to be

0:47:20.369,0:47:22.929
an accurate approximation in the

0:47:22.929,0:47:27.269
vicinity of the states [inaudible].

0:47:27.269,0:47:28.770
Just to be fair, it is about

0:47:28.770,0:47:30.109
choosing the point, S bar,

0:47:30.109,0:47:31.639
A bar,

0:47:31.639,0:47:34.169
that we'll use

0:47:34.169,0:47:36.949
to come up with a linear function

0:47:36.949,0:47:40.599
that we'll pretend it's a good approximation to my original

0:47:40.599,0:47:47.079
nonlinear function,

0:47:55.269,0:48:00.589
F. So for an example like the inverted pendulum problem, this problem, if

0:48:00.589,0:48:03.740
you expect to do pretty well in this problem,

0:48:03.740,0:48:08.089
then you would expect the state

0:48:08.089,0:48:12.680
to often be near the zero state.

0:48:12.680,0:48:17.890
If S equals zero corresponds to X being the center of the railway track that the inverted

0:48:17.890,0:48:20.249
pendulum lives on. You

0:48:20.249,0:48:23.000
expect to do fairly well. You expect the pole to

0:48:23.000,0:48:24.689
mostly be upright

0:48:24.689,0:48:28.559
[inaudible] upright at zero degrees or 90 degrees, I guess. So you choose

0:48:28.559,0:48:32.910
whatever state corresponds to having the pole

0:48:32.910,0:48:33.959
upright. The zero

0:48:33.959,0:48:37.299
velocity [inaudible], near zero velocity in the middle of the track.

0:48:37.299,0:48:40.249
So you usually choose that as a state

0:48:40.249,0:48:42.579
to linearize

0:48:42.579,0:48:45.680
your inverted pendulum dynamics around. That's

0:48:45.680,0:48:52.680
a region where you might want your approximation to be good.

0:48:55.859,0:48:59.759
So I wrote this down. To come back to this formula, I wrote this down for the special

0:48:59.759,0:49:02.229
case of a one D state variable. If there

0:49:02.229,0:49:08.039
are no actions. The general formula for the linearization approximation is ST

0:49:08.949,0:49:11.869
plus one were approximate as F

0:49:11.869,0:49:14.559
of S bar T.

0:49:14.559,0:49:17.119
A bar T

0:49:17.119,0:49:24.119
plus

0:49:50.459,0:49:54.229
- Okay?

0:49:54.229,0:49:55.240
Where

0:49:55.240,0:49:59.669
these upside down triangles are an unusual

0:49:59.669,0:50:02.739
symbol for taking the

0:50:02.739,0:50:07.289
derivative of F with respect to [inaudible] vector value, second argument. So

0:50:07.289,0:50:12.169
by choosing an appropriate state as bar T, A bar T to linearize

0:50:12.169,0:50:12.950
around,

0:50:12.950,0:50:14.009
you'd

0:50:14.009,0:50:17.920
now express ST plus one

0:50:17.920,0:50:22.039
as a linear function of the current state and

0:50:22.039,0:50:23.279
the current

0:50:23.279,0:50:24.279
action, AT. Again,

0:50:24.279,0:50:29.420
these things, S bar T, is a constant that you choose ahead of time.

0:50:29.420,0:50:32.409
Same for A bar

0:50:34.959,0:50:40.109
T. Lastly, having linearized this thing, you can then convert it

0:50:40.109,0:50:43.260
to matrices

0:50:43.260,0:50:46.509
like

0:50:46.509,0:50:48.849
that.

0:50:48.849,0:50:55.460
So the ST plus one is now a linear function of ST and

0:50:55.460,0:50:56.789
AT.

0:50:56.789,0:51:03.789
Questions about this?

0:51:10.759,0:51:16.019
So just one tiny detail, and it's really not a huge deal, is that this thing

0:51:16.019,0:51:20.179
below is technically an [inaudible] function.

0:51:20.179,0:51:23.140
There might actually be an extra constant there, but

0:51:23.140,0:51:27.199
this is not a difficult generalization of a linear dynamical system definition.

0:51:28.130,0:51:32.320
One way to deal with that constant is actually to do something like take your

0:51:32.320,0:51:36.249
definition for this state, let's say XX dot theta theta dot.

0:51:36.249,0:51:38.859
You can then augment your state vector to

0:51:38.859,0:51:40.959
have an extra interceptor, one.

0:51:40.959,0:51:46.069
With the interceptor one and working out the A matrix, you can then

0:51:46.069,0:51:49.170
take care of the extra constant, C, as well. So you can

0:51:49.170,0:51:50.679
deal with this thing being -

0:51:50.679,0:51:54.809
technically it's an affine function because of this extra offset, rather than a linear

0:51:54.809,0:51:55.569
function.

0:51:55.569,0:52:01.809
But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't

0:52:01.809,0:52:02.829
be

0:52:02.829,0:52:04.689
a huge

0:52:06.459,0:52:13.459
deal.

0:52:18.059,0:52:19.449
So to summarize, you see I

0:52:19.449,0:52:20.729
have this up,

0:52:20.729,0:52:24.699
you can learn a model, you can take a nonlinear model. Your nonlinear

0:52:24.699,0:52:28.789
model can be a physics model or a nonlinear model you learned and linearize it.

0:52:29.559,0:52:32.299
Now I'll post an LQR problem

0:52:32.299,0:52:35.339
in which we have

0:52:35.339,0:52:41.689
specification of the MDP in which the states are in RN, the actions are in RD,

0:52:41.689,0:52:46.669
and the state has zero probabilities given by

0:52:46.669,0:52:50.239
the [inaudible] linear equation. SD plus one equals

0:52:50.239,0:52:53.849
ATST plus BTAT. Our rewards are going to be

0:52:53.849,0:52:56.989
these quadratic functions.

0:52:56.989,0:53:01.900
So the specification of the MDP means that we know the A matrixes,

0:53:01.900,0:53:06.089
the B matrixes, the U matrixes

0:53:06.089,0:53:08.930
and the V matrixes. Our goal is to come up with a policy

0:53:08.930,0:53:13.789
to maximize our finite

0:53:13.789,0:53:19.049
horizon sum of rewards.

0:53:20.029,0:53:23.149
So our goal is to come up with a policy,

0:53:23.149,0:53:25.799
first, to maximize

0:53:25.799,0:53:29.059
the expected value of this

0:53:29.059,0:53:34.279
finite horizon sum of

0:53:39.009,0:53:46.009
rewards. Okay.

0:53:48.219,0:53:49.390
So our

0:53:49.390,0:53:53.549
approach to solving this problem

0:53:53.549,0:53:56.899
will be exactly that

0:53:56.899,0:54:00.019
finite horizon dynamic programming algorithm that we worked out a

0:54:00.019,0:54:01.769
little earlier in this

0:54:01.769,0:54:03.579
lecture. In particular,

0:54:03.579,0:54:06.569
my strategy for finding the optimal policy

0:54:06.569,0:54:08.879
will be to

0:54:08.879,0:54:12.359
first find V star of

0:54:12.359,0:54:14.219
T, the capital T,

0:54:14.219,0:54:18.559
and then I'll apply by a recursion to find V star of T minus one, V star of T minus

0:54:18.559,0:54:22.989
two and so on.

0:54:22.989,0:54:27.619
In the dynamic programming algorithm we worked out, V star subscript T of the

0:54:27.619,0:54:29.209
state ST, this is the maximum

0:54:29.209,0:54:31.279
[inaudible]

0:54:31.279,0:54:35.440
actions you might take at that time of R

0:54:35.440,0:54:39.299
of STAT.

0:54:39.299,0:54:42.800
Again, just for the sake of understanding this material, you can probably

0:54:42.800,0:54:47.170
pretend the rewards and the dynamics are actually stationary. I'll write out all

0:54:47.170,0:54:53.249
these superscripts all the time [inaudible] if you're reading this for the first time.

0:54:53.249,0:54:57.340
The reward is equal to max of

0:54:57.340,0:55:04.340
AT of minus -

0:55:11.829,0:55:16.099
right? I hope this isn't confusing. The superscript Ts denote transposes.

0:55:16.099,0:55:19.849
The lowercase Ts denote the time index capital T.

0:55:19.849,0:55:23.769
So that's just a definition of my next quadratic awards. So this

0:55:23.769,0:55:27.859
is clearly maximized as minus

0:55:29.100,0:55:31.269
ST transpose

0:55:31.269,0:55:37.889
UTST

0:55:37.889,0:55:39.949
because

0:55:39.949,0:55:43.909
that last term is - this is greater than or equal to zero. That gives

0:55:43.909,0:55:47.759
me my assumption that VT is [inaudible] semi-definite. So the best action to take in

0:55:47.759,0:55:49.399
the last time step

0:55:49.399,0:55:52.639
is just the action

0:55:52.639,0:55:53.769
zero. So

0:55:53.769,0:56:00.769
pi star subscript T of ST is equal to the

0:56:02.320,0:56:04.369
[inaudible] of actions of

0:56:04.369,0:56:10.969
that same thing. It's just

0:56:10.969,0:56:12.469
zero. It's by

0:56:12.469,0:56:15.609
choosing the zero action,

0:56:15.609,0:56:18.909
AT transpose VTAT becomes zero, and that's

0:56:18.909,0:56:38.829
how this reward is maximized.

0:56:50.259,0:56:57.169
Any questions, or is something illegible? Okay.

0:57:00.390,0:57:04.330
So now let's do the dynamic programming step where

0:57:04.330,0:57:09.679
my goal is given

0:57:09.679,0:57:11.739
VT plus one,

0:57:11.739,0:57:15.799
I want to compute VT. Given V star T plus one, I want to compute

0:57:15.799,0:57:20.410
V star of T. So this is the dynamic programming step.

0:57:21.629,0:57:25.239
So the

0:57:25.239,0:57:29.499
DP steps I wrote down previously was this. So for the finite state case, I

0:57:29.499,0:57:36.499
wrote down the following.

0:58:32.289,0:58:35.769
So this is exactly the equation I wrote down

0:58:35.769,0:58:37.499
previously,

0:58:37.499,0:58:41.489
and this is what I wrote down for finite states, where you have these discreet state transition

0:58:41.489,0:58:43.679
probabilities, and we can sum over

0:58:43.679,0:58:48.039
this discreet set of states. Now we're going to continue as an infinite state

0:58:48.039,0:58:51.850
again, so this sum over state should actually become an integral. I'm going to

0:58:51.850,0:58:55.940
actually skip the integral step. We'll just go ahead and write this last term here as an

0:58:55.940,0:58:56.829
expectation.

0:58:56.829,0:59:03.829
So this is going to be max over actions AT

0:59:04.309,0:59:06.879
plus - and then this becomes and expectation

0:59:06.879,0:59:10.089
over the random mixed state, ST plus one,

0:59:10.089,0:59:13.500
[inaudible] from state transition probabilities given by P of STAT of V star T

0:59:13.500,0:59:16.429
plus one, ST

0:59:16.429,0:59:19.080
plus one. So this

0:59:19.080,0:59:21.309
is

0:59:21.309,0:59:23.719
the same equation written down

0:59:23.719,0:59:25.699
as an expectation.

0:59:26.879,0:59:31.519
So what I need to do is given a representation of V star T plus one, I

0:59:31.519,0:59:32.789
need to find V

0:59:32.789,0:59:36.430
star of T.

0:59:36.430,0:59:40.819
So it turns out that LQR has the following useful property. It turns out that each of

0:59:40.819,0:59:42.839
these value functions

0:59:42.839,0:59:46.129
can be represented as a quadratic function.

0:59:46.129,0:59:48.659
So concretely,

0:59:48.659,0:59:53.259
let's suppose that V

0:59:53.259,1:00:00.259
star T plus one - suppose that this can be expressed as a

1:00:00.429,1:00:07.429
quadratic

1:00:09.369,1:00:10.840
function, written like so,

1:00:10.840,1:00:11.859
where

1:00:11.859,1:00:14.880
the matrix phi T plus one

1:00:14.880,1:00:17.049
is an N by N matrix, and

1:00:17.049,1:00:21.339
psi T plus one

1:00:21.339,1:00:23.599
is just a real number.

1:00:23.599,1:00:29.160
So in other words, suppose V star T plus one is just a quadratic function

1:00:29.969,1:00:33.809
of the state ST

1:00:33.809,1:00:36.679
plus one.

1:00:36.679,1:00:40.170
We can then show that

1:00:40.170,1:00:44.169
when you do one dynamic programming step - when you

1:00:46.089,1:00:49.569
plug this definition of V star T plus one into your dynamic

1:00:49.569,1:00:51.879
programming step in the equation I had just now,

1:00:51.879,1:00:54.559
you can show that you would get

1:00:54.559,1:00:57.899
that V star T as well, will

1:00:57.899,1:01:01.749
also be a quadratic function of the

1:01:06.279,1:01:12.639
same form. [Inaudible] here, right? The sum-appropriate matrix, phi T

1:01:12.639,1:01:19.609
and sum appropriate real number, psi

1:01:19.609,1:01:21.839
of

1:01:21.839,1:01:27.819
T. So what you can do is stall off the recursion with - well, does that make

1:01:27.819,1:01:32.099
sense?

1:01:32.099,1:01:35.009
So what you can do is

1:01:35.009,1:01:36.769
stall off the recursion

1:01:36.769,1:01:41.139
as follows. So previously, we worked out that V star capital T,

1:01:41.139,1:01:44.219
we said that this is minus

1:01:44.219,1:01:47.539
ST

1:01:47.539,1:01:50.009
transpose UTST. So we

1:01:50.009,1:01:50.760
have that phi

1:01:50.760,1:01:53.160
of capital T is

1:01:53.160,1:01:56.329
equal to minus UT,

1:01:56.329,1:01:59.309
and psi of capital T is equal to zero.

1:01:59.309,1:02:00.700
Now V star T of ST

1:02:00.700,1:02:07.700
is equal to ST transpose phi of T, ST plus psi of T.

1:02:08.419,1:02:09.050
So

1:02:09.050,1:02:13.450
you can start out the recursion this way with phi of T equals minus UT and psi of T

1:02:13.450,1:02:14.939
equals zero.

1:02:14.939,1:02:16.849
Then work out

1:02:16.849,1:02:20.219
what the recursion is. I won't

1:02:20.219,1:02:22.819


1:02:22.819,1:02:27.119
actually do the full [inaudible]. This may be algebra, and

1:02:28.809,1:02:32.939
you've actually done this sort of Gaussian expectation

1:02:32.939,1:02:39.559
math a lot in your homework by now.

1:02:39.559,1:02:44.009
So I won't do the full derivation. I'll just outline the

1:02:45.039,1:02:46.869
one-ish G step. So

1:02:46.869,1:02:51.029
in dynamic programming step, V star ST is

1:02:51.029,1:02:53.239
equal to

1:02:53.239,1:02:55.679
max over actions AT of

1:02:55.679,1:03:02.679
the median reward.

1:03:03.619,1:03:05.809
So this was R of

1:03:05.809,1:03:09.479
SA from my equation in the dynamic programming step.

1:03:09.479,1:03:12.099
Then plus an expected value

1:03:12.099,1:03:18.279
over the random mixed state, ST plus one, drawn from the

1:03:18.279,1:03:22.079
Gaussian distribution would mean

1:03:22.079,1:03:23.890
ATST plus

1:03:23.890,1:03:28.169
BTAT and covariant sigma

1:03:28.169,1:03:31.889
W. So what this is, this is really my

1:03:31.889,1:03:35.029
specification for P of STAT. This is my

1:03:35.029,1:03:38.749
state transition distribution in the LQR setting. This is my

1:03:38.749,1:03:40.809
state transition distribution

1:03:40.809,1:03:43.369
[inaudible] take action AT

1:03:43.369,1:03:44.670
in

1:03:44.670,1:03:46.179
the state

1:03:46.179,1:03:50.319
ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and

1:03:50.319,1:03:50.979
covariant

1:03:50.979,1:03:54.430
sigma W. Then

1:03:54.430,1:04:01.430
of the this

1:04:10.669,1:04:13.719
state. This, of course, is just A star

1:04:13.719,1:04:17.109
T plus one

1:04:17.109,1:04:23.549
of ST plus one. I hope

1:04:23.549,1:04:27.459
this makes sense. This is just taking that equation I had previously in the

1:04:27.459,1:04:30.309
dynamic programming step. So the V star of T, ST

1:04:30.309,1:04:33.279
equals max over actions

1:04:33.279,1:04:35.679
of the immediate rewards

1:04:35.679,1:04:38.739
plus an expected value over the mixed state of

1:04:38.739,1:04:40.639
V star of the mixed state with

1:04:40.639,1:04:43.249
the clock advanced by one.

1:04:43.249,1:04:46.179
So I've just plugged in all the definitions as a reward of the

1:04:46.179,1:04:47.889
state [inaudible] distribution

1:04:47.889,1:04:54.889
and of the value function.

1:04:55.909,1:05:01.729
Actually, could you raise your hand if this makes sense? Cool. So if you write

1:05:01.729,1:05:03.429
this out

1:05:03.429,1:05:07.630
and you expand the expectation - I know you've done this many times, so I

1:05:07.630,1:05:08.979
won't do it -

1:05:08.979,1:05:11.470
this whole thing on the right-hand side

1:05:11.470,1:05:15.139
simplifies to a big quadratic function of the action, AT.

1:05:15.139,1:05:17.839
So this whole

1:05:17.839,1:05:22.839
thing simplifies to a big

1:05:22.839,1:05:26.399
quadratic function

1:05:32.449,1:05:35.909
of the action

1:05:35.909,1:05:37.859
AT.

1:05:37.859,1:05:42.489
We want to maximize this with respect to the actions AT. So to

1:05:42.489,1:05:44.339
maximize a big quadratic function, you

1:05:44.339,1:05:48.150
just take the derivatives of the functions with respect to the action AT, set the derivative equal

1:05:48.150,1:05:52.039
to zero, and then you've maximized the right-hand side, with

1:05:52.039,1:05:55.559
respect to the action,

1:05:55.559,1:05:59.709
AT. It turns out - I'm just going to write this expression down for completeness. You

1:05:59.709,1:06:02.209
can derive it yourself at any time. It turns

1:06:02.209,1:06:05.239
out if you actually maximize that thing on the right- hand side as a

1:06:05.239,1:06:07.299
function of the actions, AT,

1:06:07.299,1:06:12.469
you find that [inaudible] AT is going to be

1:06:12.469,1:06:14.429
that

1:06:14.429,1:06:21.429
times

1:06:23.319,1:06:25.930
ST.

1:06:25.930,1:06:29.389
Don't

1:06:29.389,1:06:34.200
worry about this expression. You can get it from [inaudible] and derive it yourself.

1:06:34.200,1:06:37.839
But the key thing to note is that the optimal action, AT, is going to be some big

1:06:37.839,1:06:38.949
matrix.

1:06:38.949,1:06:40.409
We're going to call this thing

1:06:40.409,1:06:43.299
LT

1:06:43.299,1:06:47.259
times

1:06:49.009,1:06:51.459
ST. In other words, the optimal action to take in this

1:06:51.459,1:06:53.329
given state is going to be

1:06:53.329,1:06:55.249
some linear function

1:06:55.249,1:06:59.289
of the state, ST. So

1:06:59.289,1:07:02.819
having done dynamic programming, you remember also when we worked out the

1:07:02.819,1:07:04.979
dynamic programming algorithm for finite

1:07:04.979,1:07:06.279
horizon MDPs,

1:07:06.279,1:07:07.999
we said that

1:07:07.999,1:07:12.109
the way you compute the optimal policy, pi star of T

1:07:12.109,1:07:13.699
of ST.

1:07:13.699,1:07:17.819
This is always the [inaudible] of the same thing. [Inaudible]

1:07:17.819,1:07:22.799
of actions AT of the same thing.

1:07:22.799,1:07:25.139
STAT plus

1:07:25.139,1:07:32.139
your expected value of [inaudible] PSTAT, P-star, T

1:07:35.859,1:07:39.179
plus one, ST plus one. This thing on the right-hand side is always the same thing as

1:07:39.179,1:07:41.839
the thing we maximized

1:07:41.839,1:07:42.969


1:07:42.969,1:07:46.479
[inaudible]. So what this means is that when I said this a value of A to the maximize

1:07:46.479,1:07:50.379
of this. So what this means is that the optimal action to take from the state of ST

1:07:50.379,1:07:53.079
is actually equal to

1:07:53.079,1:07:55.869
LT

1:07:55.869,1:08:01.689
times ST.

1:08:01.689,1:08:03.859
What

1:08:03.859,1:08:05.919
was shown is that

1:08:05.919,1:08:08.129
when you're in some state, ST, the

1:08:08.129,1:08:13.599
optimal action for that state is going to be some matrix, LT, which can compute,

1:08:13.599,1:08:14.229
times

1:08:14.229,1:08:16.859
the state,

1:08:16.859,1:08:18.079
ST.

1:08:18.079,1:08:23.549
In other words, the optimal action is actually a linear function of the state.

1:08:23.549,1:08:26.839
I'm just going to point out, this is not a function of approximation here, right.

1:08:26.839,1:08:31.739
What we did not do, we did not say,

1:08:31.739,1:08:35.389
let's find the optimal linear policy. We didn't say, let's look at the optimal

1:08:35.389,1:08:38.359
policy, and then we'll fit this straight line to the optimal policy. This is not

1:08:38.359,1:08:42.089
about approximating the optimal policy with a straight line.

1:08:42.089,1:08:46.549
This derivation is saying that the optimal policy is a straight line. The

1:08:46.549,1:08:53.549
optimal action is a linear function of the current

1:08:54.239,1:08:57.359
state.

1:08:57.359,1:09:01.319
Moreover, when you've worked out this is a value for AT

1:09:01.319,1:09:02.310


1:09:02.310,1:09:05.370
that maximizes this thing on

1:09:05.370,1:09:10.009
the right-hand side. So you take this and plug it back in to do the dynamic programming recursion.

1:09:10.009,1:09:17.009
What you find is that -

1:09:19.719,1:09:24.189
so you take AT and plug it back in to do the maximization. It will

1:09:24.189,1:09:26.319
actually get

1:09:26.319,1:09:28.279
you this formula,

1:09:28.279,1:09:30.560
so V star

1:09:30.560,1:09:35.049
TST. So you find that

1:09:35.049,1:09:38.770
it will indeed be a quadratic function like this

1:09:38.770,1:09:41.799
of the following form where

1:09:41.799,1:09:45.540
- and I just write out the equations for the sake of

1:09:45.540,1:09:52.540
completeness. Don't worry too much about their forms. You can

1:09:57.369,1:10:03.630
derive this yourself.

1:10:48.130,1:10:52.190
So just to summarize, don't worry too much about the forms of these

1:10:52.190,1:10:54.429
equations.

1:10:54.429,1:10:57.899
What we've done is written down the recursion to the expressor

1:10:57.899,1:10:59.820
phi T and psi T

1:10:59.820,1:11:02.329
as a function of phi T plus one

1:11:02.329,1:11:04.129
and psi T plus one.

1:11:04.129,1:11:05.090
So

1:11:05.090,1:11:08.340
this allows you to compute the optimal value function

1:11:08.340,1:11:13.579
for when the clock is at time lowercase T, as a function of the optimal value

1:11:13.579,1:11:17.699
function for when the clock is at time T plus one.

1:11:20.019,1:11:25.480
So

1:11:25.480,1:11:28.260
to summarize,

1:11:28.260,1:11:32.659
GSELQG here's a finite horizon of

1:11:32.659,1:11:35.920
- actually, just to give this equation a name as well. This

1:11:35.920,1:11:40.370
recursion, in terms of the phi Ts, this is called the

1:11:40.370,1:11:44.909
discrete

1:11:44.909,1:11:51.909
time Bacardi equation. [Inaudible]

1:11:54.099,1:11:57.609
recursion that gives you phi

1:11:57.609,1:12:02.420
T in terms of phi T plus one.

1:12:02.420,1:12:05.130
So

1:12:05.130,1:12:10.969
to summarize,

1:12:12.269,1:12:17.499
our algorithm for finding the exact solution to finite horizon LQR

1:12:17.499,1:12:22.949
problems is as follows. We initialize phi T

1:12:22.949,1:12:24.340
to

1:12:24.340,1:12:26.840
be equal to minus

1:12:30.449,1:12:32.699
UT

1:12:32.699,1:12:36.559
and psi T to be equal to zero.

1:12:36.559,1:12:38.709
Then

1:12:38.709,1:12:42.679
recursively,

1:12:42.679,1:12:45.080
calculate

1:12:45.080,1:12:47.179
phi T

1:12:47.179,1:12:49.260
and psi

1:12:49.260,1:12:50.309
T

1:12:50.309,1:12:53.820
as a function of phi T plus one

1:12:53.820,1:12:56.360
and psi T plus

1:12:56.360,1:13:01.429
one with the discrete time -

1:13:01.769,1:13:06.030
actually, excuse me. So

1:13:06.030,1:13:08.590
recursively calculate phi T and psi

1:13:08.590,1:13:11.349
T as a function of phi T plus one and psi T plus one, as I showed,

1:13:11.349,1:13:14.730
using the discrete time Bacardi equation.

1:13:14.730,1:13:17.940
So you do this for

1:13:17.940,1:13:19.199
T equals

1:13:19.199,1:13:22.429
T minus one, T minus two

1:13:22.429,1:13:28.579
and so on, down to time zero. Then you

1:13:28.579,1:13:32.519
compute

1:13:32.519,1:13:37.429
LT as a function of - actually, is it phi T or phi T

1:13:37.429,1:13:39.599
plus one? Phi T

1:13:39.599,1:13:41.929
plus one, I think.

1:13:41.929,1:13:46.260
As a function of phi T plus one and psi T plus one.

1:13:46.260,1:13:50.199
This is actually a function of only phi T plus one. You don't really need psi T

1:13:50.199,1:13:53.110
plus one.

1:13:53.110,1:13:59.819
Now you have your optimal policy.

1:13:59.819,1:14:03.099
So having computed the LTs,

1:14:03.099,1:14:05.939
you now have the

1:14:05.939,1:14:08.149
optimal action to take in the state

1:14:08.149,1:14:15.149
ST, just given by

1:14:24.530,1:14:29.419
this

1:14:29.419,1:14:34.979
linear equation.

1:14:34.979,1:14:44.499
How much time do I have left? Okay. Let me just say one last thing about this before I close.

1:14:44.479,1:14:48.449
Maybe I'll do it next week. I think I'll do it next session

1:14:48.449,1:14:52.179
instead. So it actually turns out there's one cool property about this that's kind of that is

1:14:52.179,1:14:54.750
kind of subtle, but you'll find it out in the next lecture.

1:14:54.750,1:15:01.750
Are there question about this before we close for today, then? So

1:15:04.829,1:15:07.309
the very cool thing about

1:15:07.309,1:15:12.579
the solution of discrete time LQR problems - finite horizon LQR

1:15:12.579,1:15:17.280
problems is that this is a problem in an infinite state, with a continuous state.

1:15:17.280,1:15:21.059
But nonetheless, under the assumptions we made, you can prove that the

1:15:21.059,1:15:24.749
value function is a quadratic function of the state.

1:15:24.749,1:15:29.089
Therefore, just by computing these matrixes phi T and the real numbers

1:15:29.089,1:15:33.010
psi T, you can actually exactly represent the value function, even for

1:15:33.010,1:15:37.489
these infinitely large state spaces, even for continuous state spaces.

1:15:37.489,1:15:42.599
So the computation of these algorithms scales only like the cube,

1:15:42.599,1:15:46.499
scales only as a polynomial in terms of the number of state variables

1:15:46.499,1:15:49.959
whereas in [inaudible] dimensionality problems, with [inaudible], we

1:15:49.959,1:15:53.099
had algorithms of a scale exponentially dimensional problem.

1:15:53.099,1:15:55.420
Whereas LQR scales only are like the cube

1:15:55.420,1:16:00.699
of the dimension of the problem. So this easily

1:16:00.699,1:16:03.960
applies to problems with even very large state spaces. So we actually often apply

1:16:03.960,1:16:06.049
variations of this algorithm

1:16:06.049,1:16:09.159
to some subset, to some particular subset for the things we do on our

1:16:09.159,1:16:13.010
helicopter, which has high dimensional state spaces, with twelve or higher

1:16:13.010,1:16:13.969
dimensions.

1:16:13.969,1:16:15.989
This has worked very well for that.

1:16:15.989,1:16:17.229
So

1:16:17.229,1:16:21.280
it turns out there are even more things you can do with this, and I'll continue with

1:16:21.280,1:16:23.519
that in the next lecture. So let's close for today, then.