WEBVTT

00:00:12.819 --> 00:00:16.099
This presentation is delivered by the Stanford Center for Professional

00:00:16.099 --> 00:00:23.099
Development.

00:00:25.879 --> 00:00:28.419
Okay. Welcome back. What I

00:00:28.419 --> 00:00:31.449
want to do today is talk about

00:00:31.449 --> 00:00:34.570
one of my favorite algorithms for controlling MDPs that I think is

00:00:35.870 --> 00:00:39.320
one of the more elegant and efficient and powerful algorithms that

00:00:39.320 --> 00:00:41.070
I know of.

00:00:41.070 --> 00:00:43.020
So

00:00:43.020 --> 00:00:46.930
what I'll do is I'll first start by talking about a couple

00:00:46.930 --> 00:00:48.640
variations of MDPs

00:00:48.640 --> 00:00:52.320
that are slightly different from the MDP definition you've seen so far. These

00:00:52.320 --> 00:00:54.420
are pretty

00:00:54.420 --> 00:00:58.870
common variations. One is state action rewards,

00:00:58.870 --> 00:01:03.830
and the other is horizon MDPs. Using this semi-modified definition of an

00:01:03.830 --> 00:01:06.540
MDP, I'll talk about linear dynamical systems. I'll spend a little bit

00:01:06.540 --> 00:01:10.410
of time talking about models within dynamical systems, and then talk about LQR, or

00:01:10.410 --> 00:01:13.240
linear quadratic regulation control,

00:01:13.240 --> 00:01:18.600
which will lead us to some kind of [inaudible] equation, which is

00:01:18.600 --> 00:01:23.100
something we will solve in order to do LQR

00:01:23.100 --> 00:01:24.460
controls.

00:01:24.460 --> 00:01:25.660


00:01:25.660 --> 00:01:27.400
So

00:01:27.400 --> 00:01:28.650
just

00:01:28.650 --> 00:01:33.020
to recap, and we've seen this definition many times now.

00:01:33.020 --> 00:01:36.840
We've been defining an MDP

00:01:36.840 --> 00:01:41.960
as

00:01:41.960 --> 00:01:46.969
[inaudible] states actions, states improbabilities, [inaudible] reward function

00:01:46.969 --> 00:01:49.869
where - gamma's

00:01:49.869 --> 00:01:52.980
the

00:01:52.980 --> 00:01:56.680
discount factors, a number between zero and one.

00:01:56.680 --> 00:02:00.380
And R, the reward function,

00:02:00.380 --> 00:02:03.909
was the function mapping from the states, the rewards -

00:02:03.909 --> 00:02:07.000
was the function mapping from the states, the real numbers.

00:02:07.000 --> 00:02:08.220
So we had

00:02:08.220 --> 00:02:10.489
value

00:02:10.489 --> 00:02:13.759
iteration, which would do this.

00:02:21.899 --> 00:02:26.429
So after a while, the value of the iteration will cause V to convert to V star.

00:02:26.429 --> 00:02:27.789
Then

00:02:27.789 --> 00:02:31.209
having found the optimal value function, if you

00:02:31.209 --> 00:02:32.589
compute the optimal policy

00:02:32.589 --> 00:02:36.990
by taking essentially [inaudible] of this equation above. Augments of A,

00:02:36.990 --> 00:02:43.990
of that [inaudible]. So

00:02:46.180 --> 00:02:48.180
in

00:02:48.180 --> 00:02:50.719
value iteration,

00:02:50.719 --> 00:02:54.599
as you iterate of this - you know, perform this update,

00:02:54.599 --> 00:02:56.179
the function V will

00:02:56.179 --> 00:03:00.370
[inaudible] convert to V star. So there won't be -

00:03:00.370 --> 00:03:04.490
so without defining the number of iterations, you get closer and closer to V star.

00:03:04.490 --> 00:03:08.400
This actually converge exponentially quickly to V

00:03:08.400 --> 00:03:12.839
star. We will never exactly convert to V star and define the number of iterations.

00:03:14.799 --> 00:03:19.629
So what I want to do now is describe a couple of common variations of

00:03:19.629 --> 00:03:20.139
MDPs

00:03:20.139 --> 00:03:24.139
that we slightly different definitions of. Firs the reward

00:03:24.139 --> 00:03:27.799
function and then second, we'll do something slightly different

00:03:27.799 --> 00:03:29.389
from just

00:03:31.699 --> 00:03:34.659
counting. Then remember in the last lecture, I said that

00:03:34.659 --> 00:03:37.419
for infinite state of continuously in MDPs,

00:03:37.419 --> 00:03:41.169
we couldn't apply the most straightforward version of value iteration

00:03:41.169 --> 00:03:43.669
because if you have a continuous state

00:03:43.669 --> 00:03:45.709
MDP, we need to use

00:03:45.709 --> 00:03:48.289
some approximations of the optimal value function.

00:03:48.289 --> 00:03:51.629
The [inaudible] later in this lecture,

00:03:51.629 --> 00:03:55.510
I'll talk about a special case of MDPs, where you can actually represent the

00:03:55.510 --> 00:03:57.579
value function exactly,

00:03:57.579 --> 00:04:01.029
even if you have an infinite-state space or even if you have a continuous-state

00:04:01.029 --> 00:04:02.000
space.

00:04:02.000 --> 00:04:06.090
I'll actually do that, talk about these special constants of infinite-state

00:04:06.090 --> 00:04:06.779
MDPs,

00:04:06.779 --> 00:04:11.319
using this new variation of the reward function and the alternative to

00:04:11.319 --> 00:04:15.469
just counting, so start to make the formulation a little easier.

00:04:15.469 --> 00:04:16.609
So the first

00:04:16.609 --> 00:04:20.850
variation I want to talk about is

00:04:20.850 --> 00:04:25.380
selection

00:04:25.380 --> 00:04:28.720
rewards. So I'm going to change the definition of the reward function. If this turns out, it

00:04:28.720 --> 00:04:30.869
won't be

00:04:30.869 --> 00:04:33.409
a huge deal. In particular, I change reward function

00:04:33.409 --> 00:04:36.260
to be a function mapping from

00:04:36.260 --> 00:04:38.449
a state action pair

00:04:38.449 --> 00:04:41.130
to the real numbers.

00:04:41.130 --> 00:04:44.509
What I mean by this is just the following.

00:04:44.509 --> 00:04:46.599
You sell off in

00:04:46.599 --> 00:04:48.849
some state in zero. You

00:04:48.849 --> 00:04:52.830
take an action A zero as a result of your state of action choice. You transition

00:04:52.830 --> 00:04:54.709
to some new state, S1.

00:04:54.709 --> 00:04:59.519
You take some action, A1. You transition to some new state, S2. You take some

00:04:59.519 --> 00:05:05.009
action, A2, and so on. So this is a [inaudible] state action sequence that you see.

00:05:05.009 --> 00:05:08.490
So in an MPP where you have a state action reward,

00:05:08.490 --> 00:05:12.879
your total payoff is now defined as

00:05:12.879 --> 00:05:14.259
this,

00:05:14.259 --> 00:05:19.349
where your reward function is now a function both of the current state and of the

00:05:19.349 --> 00:05:20.020
action

00:05:20.020 --> 00:05:22.220
you took in the current state.

00:05:22.220 --> 00:05:27.039
So this is my

00:05:27.039 --> 00:05:28.539
total payoff.

00:05:29.199 --> 00:05:32.240
Then as usual, my goal will be to find a policy -

00:05:32.240 --> 00:05:35.039
to find the function mapping from the state's actions,

00:05:35.039 --> 00:05:37.699
so that when I execute that policy,

00:05:37.699 --> 00:05:40.320
I can maximize the expected value

00:05:40.320 --> 00:05:42.030
of my total payoff.

00:05:42.030 --> 00:05:46.600
So this definition, it actually turns out that given an MDP with

00:05:46.600 --> 00:05:47.889
state action rewards,

00:05:47.889 --> 00:05:49.339
you can actually -

00:05:49.339 --> 00:05:52.949
so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP

00:05:52.949 --> 00:05:56.639
with only rewards that function in the states. That may or may

00:05:56.639 --> 00:05:58.419
not be [inaudible]. Don't worry if

00:05:58.419 --> 00:06:00.800
it isn't. But

00:06:00.800 --> 00:06:04.729
using state action rewards allows you to more directly model

00:06:05.569 --> 00:06:09.050
problems in which different actions, we have different costs. So

00:06:09.050 --> 00:06:13.090
a running example is the robot. So [inaudible] a robot,

00:06:13.090 --> 00:06:17.080
and it's more costly for the robot to move than for it to stay still.

00:06:17.080 --> 00:06:20.930
If you give an action to stay still, and the action to stay still may

00:06:20.930 --> 00:06:25.349
have a lower cost because you're not using a battery power [inaudible] recharge it for that

00:06:25.349 --> 00:06:26.020
action.

00:06:26.020 --> 00:06:29.820
Another example would be -

00:06:29.820 --> 00:06:33.059
actually, another navigation example

00:06:33.059 --> 00:06:37.429
would be if you have an outdoor vehicle. You need to drive

00:06:37.429 --> 00:06:38.599
over

00:06:38.599 --> 00:06:42.169
some sort of outdoor terrain, like very rough rocks

00:06:42.169 --> 00:06:45.679
or driving over grass. It may be costly, more difficult,

00:06:45.679 --> 00:06:48.960
than driving over, say, a paved road. So you may assign

00:06:48.960 --> 00:06:51.349
an action that requires

00:06:51.349 --> 00:06:56.150
driving over grass or driving over rocks to be

00:06:56.150 --> 00:07:00.289
more costly than driving over paved

00:07:00.289 --> 00:07:04.809
road.

00:07:04.809 --> 00:07:07.830
So this really isn't a huge change to the definition of

00:07:07.830 --> 00:07:14.830
an MDP. I won't really bother to justify this a lot, but [inaudible]

00:07:15.090 --> 00:07:16.779
equations

00:07:16.779 --> 00:07:19.610
is generalizing

00:07:19.610 --> 00:07:24.279
the way that you probably expect it. V star of S is now equal to

00:07:26.650 --> 00:07:33.650
that.

00:07:37.419 --> 00:07:40.649
So previously, when the reward function was just a function of the state, S, we could

00:07:40.649 --> 00:07:43.729
take the max and push it in here.

00:07:43.729 --> 00:07:44.530
But now that

00:07:46.069 --> 00:07:49.779
the rewards is a function of the action you're taking as well, the max comes outside. So

00:07:49.779 --> 00:07:50.869
this says that

00:07:50.869 --> 00:07:57.119
your expected total payoff, starting from the state, as [inaudible] policy, is equal to

00:07:57.119 --> 00:08:01.519
first your immediate reward, RFSA, for executing some action, A, in state S. Then

00:08:02.219 --> 00:08:07.369
plus gamma times your future expected total payoff.

00:08:07.369 --> 00:08:08.839
So

00:08:08.839 --> 00:08:11.330
this is your expected total payoff if you

00:08:11.330 --> 00:08:14.919
take the action, A, from the current state. So

00:08:14.919 --> 00:08:18.879
while these [inaudible] optimal value functions. So your actually optimal expected total payoff

00:08:18.879 --> 00:08:21.539
is the max of all actions of this

00:08:21.539 --> 00:08:26.049
thing on

00:08:26.049 --> 00:08:30.449
the right. Let's see. Value iteration, which I'm abbreviating VI,

00:08:30.449 --> 00:08:33.720
is really the same algorithm. B of

00:08:33.720 --> 00:08:38.060
S is updated as

00:08:38.060 --> 00:08:44.630
max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations

00:08:45.900 --> 00:08:49.880
be updating V of S using [inaudible] equations. Again, you get value iteration,

00:08:49.880 --> 00:08:53.370
exactly the same way.

00:08:53.370 --> 00:08:55.520
Then finally,

00:08:56.950 --> 00:09:00.160
having found the optimal value function, V star,

00:09:00.160 --> 00:09:03.180
using the value iteration algorithm,

00:09:03.180 --> 00:09:07.650
you can then compute the optimal policy, pi star of S as same as

00:09:07.650 --> 00:09:12.250
before. The best action to take in the state, S, is the

00:09:12.250 --> 00:09:19.250
action, A, that maximizes the thing on the righthand

00:09:28.360 --> 00:09:32.480
side. So having used value iteration

00:09:32.480 --> 00:09:35.330
to compute the optimal value function, you can then find

00:09:35.330 --> 00:09:40.210
pi star using that.

00:09:41.890 --> 00:09:45.670
So this was the easier of the two

00:09:45.670 --> 00:09:47.920
variations of MDPs [inaudible]

00:09:47.920 --> 00:09:48.910
so far.

00:09:48.910 --> 00:09:55.910
Any questions? Actually, are there questions about this?

00:10:01.090 --> 00:10:02.810
So

00:10:02.810 --> 00:10:06.660
the other variation, the other alternative definition,

00:10:07.600 --> 00:10:10.520
will be

00:10:10.520 --> 00:10:16.220
finite horizon

00:10:16.220 --> 00:10:19.010
MDPs. So

00:10:19.010 --> 00:10:21.540
finite horizon MDP

00:10:21.540 --> 00:10:24.780
comprises of the [inaudible] SA [inaudible]

00:10:24.780 --> 00:10:28.110
transition, probably

00:10:28.110 --> 00:10:30.650
with these, and the parameter T

00:10:30.650 --> 00:10:32.180
and the reward function.

00:10:32.180 --> 00:10:36.250
Here,

00:10:36.250 --> 00:10:37.360
T

00:10:37.360 --> 00:10:40.520
is a parameter called the horizon time.

00:10:40.520 --> 00:10:43.050
Concretely,

00:10:43.050 --> 00:10:45.920
what this really means is that we'll

00:10:45.920 --> 00:10:48.330
be taking actions in the MDP

00:10:48.330 --> 00:10:55.330
only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out]

00:11:00.430 --> 00:11:04.350
[Inaudible] zero, take action A0. Get to some other state S1, take

00:11:04.350 --> 00:11:06.670
action A1 and so on.

00:11:06.670 --> 00:11:10.100
Eventually, you get to some state, STAT

00:11:10.100 --> 00:11:12.180
after T times [inaudible]. So

00:11:12.180 --> 00:11:16.760
my total payoff, now,

00:11:16.760 --> 00:11:18.180
will be

00:11:18.180 --> 00:11:19.350
given

00:11:19.350 --> 00:11:21.900
by this

00:11:21.900 --> 00:11:26.180
sum from time zero up to time T of my sum over rewards.

00:11:27.740 --> 00:11:31.870
Okay? My goal, as usual

00:11:31.870 --> 00:11:34.030
- so this is my total payoff.

00:11:34.030 --> 00:11:37.580
My goal, as usual, is to maximize the expected value of my total payoff. We want to come up

00:11:37.580 --> 00:11:40.140
with a policy to maximize the

00:11:40.140 --> 00:11:44.010
expected value of this total payoff. The key difference is that

00:11:44.010 --> 00:11:47.890
the world only will exist [inaudible], and after that,

00:11:47.890 --> 00:11:51.380
there's no more rewards to be corrected.

00:11:51.380 --> 00:11:54.820
So this turns out to be [inaudible] of a difference because it turns out that the optimal

00:11:54.820 --> 00:11:59.710
policy

00:12:00.690 --> 00:12:03.200
may be non-stationary.

00:12:06.300 --> 00:12:07.540
The

00:12:07.540 --> 00:12:10.319
term, stationary, means that it doesn't depend

00:12:10.319 --> 00:12:12.649
on time. Non-stationary

00:12:12.649 --> 00:12:14.419
means that it may depend on time.

00:12:14.419 --> 00:12:17.810
So non-stationary

00:12:17.810 --> 00:12:21.820
roughly means that my optimal action to take will be different for different time

00:12:21.820 --> 00:12:22.490
steps.

00:12:22.490 --> 00:12:24.430
That's what nonstationary

00:12:24.430 --> 00:12:26.790
means. Just as an example of that,

00:12:26.790 --> 00:12:31.240
imagine that we have some robot. Let's say the

00:12:31.240 --> 00:12:33.850
robot

00:12:33.850 --> 00:12:35.290
is here.

00:12:35.290 --> 00:12:37.620
Let's say that there's

00:12:37.620 --> 00:12:40.590
a good [inaudible] over there with a plus one reward.

00:12:42.930 --> 00:12:46.360
Much further away, there's a plus ten reward.

00:12:46.360 --> 00:12:50.530
So depending on how much time you have left on the clock,

00:12:50.530 --> 00:12:52.980
it may be better to go after the plus one

00:12:52.980 --> 00:12:55.330
or the plus ten reward. If it's still early

00:12:55.330 --> 00:12:58.840
in the game, you still have a lot of time, it may be better to head toward

00:12:58.840 --> 00:13:03.030
the plus-ten rewards junction and get a much larger

00:13:03.030 --> 00:13:06.040
reward. If you only have a couple of time sets left, if the clock has nearly reached

00:13:06.040 --> 00:13:08.069
the time, capital T,

00:13:08.069 --> 00:13:11.790
then you may not have enough time to get to a plus ten reward. You've be better

00:13:11.790 --> 00:13:16.050
off heading for the plus one reward that's much more close by.

00:13:16.050 --> 00:13:19.660
So what this example illustrates is that when you're in that state,

00:13:19.660 --> 00:13:23.740
the best action to take could be to go left or to go right, depending on what

00:13:23.740 --> 00:13:26.499
time it is. So just as an example, illustrating

00:13:26.499 --> 00:13:31.960
how the actually policy can be non-stationary.

00:13:34.710 --> 00:13:36.410
In fact,

00:13:36.410 --> 00:13:38.120
since we have non-stationary

00:13:38.120 --> 00:13:43.120
policies anyway in the sequence, what I'm going to do next, I'm going to

00:13:43.120 --> 00:13:44.159
allow

00:13:44.159 --> 00:13:48.390
non-stationary transition probabilities as well. So I'll just write that

00:13:48.390 --> 00:13:50.390
up there.

00:13:50.390 --> 00:13:52.170
What I mean is that

00:13:52.170 --> 00:13:56.640
so far, assuming that the state ST plus one, is joined from the state

00:13:56.640 --> 00:13:59.480
transition probabilities [inaudible]

00:13:59.480 --> 00:14:02.110
by the previous states and the previous action.

00:14:02.110 --> 00:14:05.650
I've been assuming that these state transition probabilities are the same

00:14:05.650 --> 00:14:10.170
for all times. So I want to say [inaudible] and take some action,

00:14:10.170 --> 00:14:12.540
the distribution of an innate state doesn't matter.

00:14:12.540 --> 00:14:15.910
It doesn't depend on time.

00:14:15.910 --> 00:14:16.700
So

00:14:16.700 --> 00:14:21.890
I'm going to allow a study more general definition as well, in which we have

00:14:21.890 --> 00:14:24.600
non-stationary state transition probabilities

00:14:24.600 --> 00:14:25.329
so that

00:14:25.329 --> 00:14:27.960
the chance of where you end up [inaudible]

00:14:27.960 --> 00:14:29.590
may also

00:14:29.590 --> 00:14:33.060
depend on what time it is.

00:14:33.060 --> 00:14:37.320
So as examples of this non-stationary state

00:14:37.320 --> 00:14:39.000
transition probabilities,

00:14:39.000 --> 00:14:44.400
one example would be if you model flying an aircraft over a long distance.

00:14:44.400 --> 00:14:48.650
Then as the aircraft flies, you burn fuel and become lighter. So the

00:14:48.650 --> 00:14:52.510
dynamics of the aircraft actually change over time. The mass of the aircraft can

00:14:52.510 --> 00:14:55.160
change significantly over time as you burn fuel.

00:14:55.160 --> 00:14:56.480
So depending on

00:14:56.480 --> 00:15:01.130
what time it is, your mixed state could actually

00:15:01.130 --> 00:15:03.320
depend on not only your current state and your action,

00:15:03.320 --> 00:15:06.240
but also on how much fuel you burn, therefore, what time it is.

00:15:07.250 --> 00:15:10.510
Other examples, another aerospace one, is

00:15:10.510 --> 00:15:14.620
if you have the weather forecast for the next 24 hours, say,

00:15:14.620 --> 00:15:18.360
you know what the winds and precipitation are going to be like over the next 24 hours. Then again,

00:15:18.360 --> 00:15:22.200
if you fly the aircraft from, say, here to New York,

00:15:22.200 --> 00:15:25.019
it may cost different amounts to fly

00:15:25.019 --> 00:15:28.829
different [inaudible] at different times. Maybe flying over

00:15:28.829 --> 00:15:33.439
the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or

00:15:33.439 --> 00:15:34.319
if you do it

00:15:34.319 --> 00:15:37.629
a few hours from now, when the weather may be forecast really bad.

00:15:39.380 --> 00:15:40.970
For an

00:15:40.970 --> 00:15:46.440
example you see everyday, same thing for traffic, right?

00:15:46.440 --> 00:15:48.380
There's at least -

00:15:48.380 --> 00:15:52.510
depending on where you live, certainly here in California, there are times of day where traffic is

00:15:52.510 --> 00:15:57.050
really bad in lots of places. So the costs of driving certain roads may vary, depending on

00:15:57.050 --> 00:15:59.070
what time of day it is.

00:15:59.070 --> 00:16:03.589
Lots of other examples. Industrial automation, different machines

00:16:03.589 --> 00:16:07.160
in the factory may be available to different degrees at different times of day. They

00:16:07.160 --> 00:16:10.149
cost different amounts to hire different workers, depending on whether you pay

00:16:10.149 --> 00:16:12.950
over time [inaudible] or whatever.

00:16:12.950 --> 00:16:15.960
So the cost of doing different things in the factory can also

00:16:15.960 --> 00:16:17.250
be a function of time.

00:16:17.250 --> 00:16:18.429


00:16:18.429 --> 00:16:24.280
The state transition probabilities can also be a function of time.

00:16:25.750 --> 00:16:27.850
Lastly,

00:16:27.850 --> 00:16:31.910
while we're doing this as well, to make this fully general, we might as

00:16:31.910 --> 00:16:37.730
well have non-stationary [inaudible] as well,

00:16:37.730 --> 00:16:40.639
where you might also index the reward

00:16:40.639 --> 00:16:42.679
function of these times and prescripts,

00:16:42.679 --> 00:16:45.090
where the cost of doing different things may depend on

00:16:45.090 --> 00:16:47.830
the time as well.

00:16:50.530 --> 00:16:55.590
Actually, there's more examples of non-stationary MDPs, so let's - so now we

00:16:55.590 --> 00:16:59.249
have a nonstationary policy. Let's talk about an algorithm to actually

00:16:59.249 --> 00:17:02.180
try to find the optimal policy.

00:17:04.110 --> 00:17:10.089
So let me

00:17:10.089 --> 00:17:13.580
define the following.

00:17:13.580 --> 00:17:20.390
This is now a slightly modified definition of the optimal value function. I'll

00:17:20.390 --> 00:17:27.390
just

00:17:30.970 --> 00:17:37.970
write this down, I guess.

00:17:39.080 --> 00:17:42.620
So I'm going to define the optimal value function, and this going to be

00:17:42.620 --> 00:17:44.690
indexed by T, with a subscript T. The

00:17:44.690 --> 00:17:46.480
optimal value of a state

00:17:47.840 --> 00:17:50.360
for time, T, we're going to define as your

00:17:50.360 --> 00:17:52.300
optimal

00:17:52.300 --> 00:17:57.320
sum of rewards for if you start the MDP at that state, S,

00:17:57.320 --> 00:17:59.309
and if the clock starts off

00:17:59.309 --> 00:18:02.200
at time,

00:18:02.200 --> 00:18:06.400
lowercase T. So the optimal value of a state will depend on what time it is and how

00:18:06.400 --> 00:18:09.919
much time you have lest to run this

00:18:09.919 --> 00:18:11.440
MDP. Therefore,

00:18:11.440 --> 00:18:16.230
the sum on the right sums only for time T, time T plus one, time T plus two up

00:18:16.230 --> 00:18:17.840
to time,

00:18:17.840 --> 00:18:19.520
capital T.

00:18:22.570 --> 00:18:26.620
I'll just state in English again, this is your expected optimal

00:18:26.620 --> 00:18:28.059
total payoff

00:18:28.059 --> 00:18:31.240
if you start your system in a state, S,

00:18:31.240 --> 00:18:34.760
and if the clock is already at time,

00:18:34.760 --> 00:18:37.010
lowercase T.

00:18:38.370 --> 00:18:42.620
So it turns out then there's a [inaudible], you can value that [inaudible].

00:18:42.620 --> 00:18:46.080
Let me just write out the value [inaudible] algorithm for this.

00:18:46.080 --> 00:18:48.059
It turns out

00:18:48.059 --> 00:18:49.270
you can -

00:18:49.270 --> 00:18:54.150
well, let me just write this

00:18:54.150 --> 00:18:56.920
out. I'll write this below.

00:18:56.920 --> 00:19:00.260
It turns out you can compute the optimal value function for the MDP using the

00:19:00.260 --> 00:19:03.020
following recursion, which is very similar to

00:19:03.020 --> 00:19:06.200
what we have for value iteration. We're going

00:19:06.200 --> 00:19:06.879
to set V

00:19:06.879 --> 00:19:12.080
of S to be equal to [inaudible] over

00:19:12.080 --> 00:19:12.889
A,

00:19:12.889 --> 00:19:19.889
same as before, right? Okay?

00:19:41.110 --> 00:19:42.419
So

00:19:42.419 --> 00:19:46.650
if I start the clock at time T and from state S,

00:19:46.650 --> 00:19:49.230
my expected total payoff is equal to

00:19:49.230 --> 00:19:51.590
the maximum [inaudible] actions I may take

00:19:51.590 --> 00:19:53.500
of my immediate reward. Taking that

00:19:53.500 --> 00:19:56.100
action, A, in that state,

00:19:56.100 --> 00:20:00.050
S. Them plus my expected future payoff. So if I take action, A,

00:20:00.050 --> 00:20:01.690
I would transition with [inaudible] P, subscript SA, S prime

00:20:01.690 --> 00:20:06.470
to some new state, S

00:20:06.470 --> 00:20:09.470
prime. If I get to the state, S prime,

00:20:09.470 --> 00:20:11.600
my total expected payoff from the

00:20:11.600 --> 00:20:14.270
state S prime would be these [inaudible]

00:20:14.270 --> 00:20:18.610
now subscript T plus one, that's prime. Subscript T plus one

00:20:18.610 --> 00:20:23.400
reflects that after I've taken one action, my clock will have advanced from

00:20:23.400 --> 00:20:26.400
time T to time T plus one. So this is now V

00:20:26.400 --> 00:20:30.559
star subscript T plus one.

00:20:30.559 --> 00:20:33.909
So this expresses V star of T

00:20:33.909 --> 00:20:36.230
in terms of V star T plus one.

00:20:36.230 --> 00:20:39.740
Then lastly, to start off this recursion, you

00:20:39.740 --> 00:20:41.700
would have V star,

00:20:41.700 --> 00:20:43.270
capital T

00:20:43.270 --> 00:20:50.270
is equal to - it's just equal

00:20:51.300 --> 00:20:53.220
to that.

00:20:53.220 --> 00:20:56.900
If you're already at time, capital T, then you just get to take one action, and then

00:20:56.900 --> 00:20:59.070
the clock runs out. So this is

00:20:59.070 --> 00:21:03.610
V star capital T. Your value of starting in some state, S, with no time -

00:21:03.610 --> 00:21:08.809
with just one time step, with no time left on the clock.

00:21:08.809 --> 00:21:12.880
So in the case of finite horizon MDP, this actually gives up a very nice

00:21:13.830 --> 00:21:15.640
dynamic programming algorithm

00:21:15.640 --> 00:21:16.559
in which you can

00:21:16.559 --> 00:21:20.740
start off by computing V star of T.

00:21:20.740 --> 00:21:24.030
Then you use this backward [inaudible] to compute

00:21:24.030 --> 00:21:25.089
V star of

00:21:25.089 --> 00:21:29.349
capital T minus one, capital T minus two and so on. We compute V star

00:21:29.349 --> 00:21:32.230
of T and T minus one and so on. It recurs backwards

00:21:32.230 --> 00:21:34.670
onto your computer, V star

00:21:34.670 --> 00:21:38.450
for all of your time steps. Can you see this

00:21:38.450 --> 00:21:39.500
board?

00:21:39.500 --> 00:21:42.250
Cool. Then

00:21:44.770 --> 00:21:47.570
the final step is

00:21:47.570 --> 00:21:52.810
- previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible]

00:21:52.810 --> 00:21:55.580
A of R

00:21:55.580 --> 00:21:57.150
plus A

00:21:57.150 --> 00:22:04.150
plus [inaudible] PSA - this

00:22:04.280 --> 00:22:07.230
is sort of what we had. In

00:22:07.230 --> 00:22:09.789
the finite horizon case, the

00:22:09.789 --> 00:22:13.250
ultimate action may depend on what time it is. So the ultimate action to take it,

00:22:13.250 --> 00:22:16.820
time T, is [inaudible] actions, A.

00:22:16.820 --> 00:22:19.249
This

00:22:19.249 --> 00:22:22.350
is basically the

00:22:22.350 --> 00:22:26.840
augmat of exactly that same thing on the right-hand side as we had

00:22:26.840 --> 00:22:29.970
in our dynamic programming

00:22:29.970 --> 00:22:33.060
algorithm. So you do this for every time step, and now you compute it,

00:22:33.060 --> 00:22:36.950
your optimal policy for different time

00:22:36.950 --> 00:22:40.410
steps. Again, this is a non-stationary policy

00:22:40.410 --> 00:22:42.290
because pi star

00:22:42.290 --> 00:22:46.920
of S my depend on what time it is.

00:22:46.920 --> 00:22:50.420
So [inaudible] the difference between this

00:22:50.420 --> 00:22:57.150
and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T.

00:22:57.150 --> 00:23:00.620
Then using the backwards recursion of that [inaudible] algorithm, you computer V star T

00:23:00.620 --> 00:23:04.130
minus one. Then V star T minus two

00:23:04.130 --> 00:23:06.710
and so on down to V star of zero.

00:23:06.710 --> 00:23:09.660
Then from these, you compute pi star.

00:23:13.210 --> 00:23:17.010
So one - there's not a huge difference, but one minus

00:23:17.010 --> 00:23:18.100
difference [inaudible]

00:23:18.100 --> 00:23:21.919
the infinite horizon discounted case

00:23:21.919 --> 00:23:26.350
is that by running this recursion once, you now have exactly

00:23:26.350 --> 00:23:29.550
the right value function. So this just computes the value function, rather

00:23:29.550 --> 00:23:30.560
than

00:23:30.560 --> 00:23:37.560
merely converging [inaudible]. This just gives you the right value function with one

00:23:37.710 --> 00:23:39.710


00:23:39.710 --> 00:23:42.710
pass. Cool. Any questions there? Yeah.

00:23:42.710 --> 00:23:47.600
[Inaudible].

00:23:50.450 --> 00:23:54.080
This computation's much shorter than valuations. So

00:23:54.080 --> 00:23:58.460
sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could

00:23:58.460 --> 00:24:05.460
[inaudible] and then use this case for that case? I see.

00:24:06.100 --> 00:24:09.430
So for the normal MDP, can you assume capital T and then assume this? So

00:24:09.430 --> 00:24:13.049
it actually turns out that - that's a

00:24:13.049 --> 00:24:14.740
great question. Let

00:24:14.740 --> 00:24:18.640
me just answer this in a hand-wavy way.

00:24:18.640 --> 00:24:23.140
So it actually turns out for a discounted infinite horizon MDP

00:24:23.140 --> 00:24:25.670
where some discount factor gamma.

00:24:25.670 --> 00:24:29.430
So what you can do is you can say,

00:24:29.430 --> 00:24:31.160
after T times X,

00:24:31.160 --> 00:24:34.990
gamma to the T would be really small. It would be like [inaudible] something. I

00:24:34.990 --> 00:24:36.130
don't

00:24:36.130 --> 00:24:39.750
really care what happens after that many times because the rewards are

00:24:39.750 --> 00:24:42.630
multiplied by gamma to the T. After that, I don't really care. So you can

00:24:42.630 --> 00:24:43.919
ask,

00:24:43.919 --> 00:24:45.120
can I take

00:24:45.120 --> 00:24:48.490
my infinite horizon discounted MDP

00:24:48.490 --> 00:24:53.190
and approximate that with a finite horizon MDP where the number of times, steps T,

00:24:53.190 --> 00:24:55.180
is chosen so that [inaudible] true. So it

00:24:55.180 --> 00:24:57.700
turns out you can do that.

00:24:57.700 --> 00:25:01.890
Then you end up with some value for T. You can solve for T so

00:25:01.890 --> 00:25:03.910
that this holds true.

00:25:03.910 --> 00:25:09.500
It turns out you can prove that if you took the original value iteration algorithm

00:25:09.500 --> 00:25:14.780
and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs.

00:25:14.780 --> 00:25:16.830
If you run that for

00:25:16.830 --> 00:25:19.590
this same number of time steps,

00:25:19.590 --> 00:25:23.790
you will end up with an approximation to the value function that is

00:25:23.790 --> 00:25:26.730
about this close, up to some small constant factors.

00:25:26.730 --> 00:25:30.960
So to do that, you end up with roughly the same amounts of computation anyway.

00:25:30.960 --> 00:25:32.550
Then

00:25:32.550 --> 00:25:35.780
you actually end up with a non-stationary policy, which is more expensive to keep

00:25:36.650 --> 00:25:40.720
around. You need to keep around the different policy every time step, which is not

00:25:40.720 --> 00:25:46.880
as nice as if you had the stationary policy, same policy for all times. So

00:25:46.880 --> 00:25:50.600
there are other reasons, but sometimes you might take an

00:25:50.600 --> 00:25:54.560
infinite horizon discounted problem and approximate it to a finite horizon problem.

00:25:54.560 --> 00:25:56.190
But

00:25:56.190 --> 00:25:58.760
this particular reason is

00:25:58.760 --> 00:26:01.270
not the one. That makes

00:26:01.270 --> 00:26:04.210
sense. More

00:26:04.210 --> 00:26:11.210
questions? Interviewee: [Inaudible]?

00:26:12.580 --> 00:26:14.680
Is

00:26:14.680 --> 00:26:16.870
there a gamma in this?

00:26:16.870 --> 00:26:20.159
So if you want, you can actually

00:26:20.159 --> 00:26:22.570
change the definition of an MDP

00:26:22.570 --> 00:26:27.920
and use a finite horizon discounted MDP. If you want, you can do that. You

00:26:27.920 --> 00:26:29.040
can actually come

00:26:29.040 --> 00:26:32.090
in and put a gamma there,

00:26:32.090 --> 00:26:35.140
and use this counting the finite horizon.

00:26:35.140 --> 00:26:37.690
It turns out that usually, for most

00:26:37.690 --> 00:26:41.470
problems that people deal with, you either use discounting

00:26:41.470 --> 00:26:43.220
or you

00:26:43.220 --> 00:26:47.059
use the finite horizon. It's been less common to do both, but you can certainly do as well.

00:26:47.059 --> 00:26:51.990
One of the nice things about discounting, it makes such your value function is finite.

00:26:52.860 --> 00:26:57.549
Algorithmically and mathematically, one of the reasons to use discounting is

00:26:57.549 --> 00:27:01.309
because you're multiplying your rewards exponentially. It's a geometrically

00:27:01.309 --> 00:27:02.640
[inaudible] series.

00:27:02.640 --> 00:27:05.049
It shows that the value function is always finite.

00:27:05.049 --> 00:27:07.780
This is a really nice mathematical properties when you do discounting.

00:27:07.780 --> 00:27:11.360
So when you have a finite horizon anyway,

00:27:11.360 --> 00:27:15.179
then the value function's also guaranteed to be finite. So with that, you

00:27:15.179 --> 00:27:22.179
don't have to use discounting. But if you want, you can actually discount

00:27:24.130 --> 00:27:27.910
as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward

00:27:27.910 --> 00:27:34.910
function to go downward into the to the reward function, since we have nonstationary rewards as well. So

00:27:39.690 --> 00:27:41.049
that was

00:27:41.049 --> 00:27:44.230
finite horizon

00:27:44.230 --> 00:27:45.409
MDPs.

00:27:45.409 --> 00:27:48.960
What I want to do now is actually use

00:27:48.960 --> 00:27:53.030
both of these ideas, your state action rewards and your finite horizon MDPs

00:27:53.030 --> 00:27:57.580
to describe a special case of MDPs that

00:27:57.580 --> 00:28:00.380
makes very strong assumptions about the problem.

00:28:00.380 --> 00:28:00.880


00:28:00.880 --> 00:28:04.390
But these assumptions are reasonable for many systems. With these

00:28:04.390 --> 00:28:06.090
assumptions, what we come up with, I

00:28:06.090 --> 00:28:11.960
think, are very nice and very elegant algorithms for solving even very large

00:28:11.960 --> 00:28:16.240
MDPs.

00:28:30.100 --> 00:28:32.820
So let's talk about linear

00:28:32.820 --> 00:28:35.760
quadratic regulation.

00:28:51.669 --> 00:28:55.860
We just talked about dynamic programming for finite horizon MDPs, so

00:28:55.860 --> 00:28:58.149
just remember that algorithm.

00:28:58.149 --> 00:29:01.850
When I come back to talk about an algorithm for solving LQR problems,

00:29:01.850 --> 00:29:05.320
I'm actually going to use exactly that dynamic programming algorithm

00:29:05.320 --> 00:29:10.320
that you just saw for finite horizon MDPs. I'll be using exactly

00:29:10.320 --> 00:29:13.929
that algorithm again. So just remember that for now.

00:29:13.929 --> 00:29:16.320
So let's talk about LQR.

00:29:16.320 --> 00:29:21.750
So I want to take these ideas and apply them to MDPs with continuous

00:29:21.750 --> 00:29:25.150
state spaces and maybe even continuous action

00:29:25.150 --> 00:29:26.170
spaces. So

00:29:26.170 --> 00:29:28.740
to specify and

00:29:28.740 --> 00:29:33.679
MDPs, I need to give you this fivetuple

00:29:33.679 --> 00:29:36.380
of state actions, [inaudible] in the reward. I'm going to use

00:29:36.380 --> 00:29:40.730
the finite horizon, capital T, rather than

00:29:40.730 --> 00:29:42.570
discounting.

00:29:42.570 --> 00:29:47.809
So in LQR problems, I'm going to assume the following. I'm

00:29:47.809 --> 00:29:51.309
going to assume that the

00:29:51.309 --> 00:29:54.400
[inaudible] space is [inaudible] RN.

00:29:54.400 --> 00:29:58.289
And I'm going to assume, also, a continuous set of

00:29:58.289 --> 00:30:03.830
actions lie in RT.

00:30:03.830 --> 00:30:08.320
To specify the state transition probabilities, PSA,

00:30:08.320 --> 00:30:11.930
I need to tell you what the distribution of the mixed state is, given the current state and

00:30:11.930 --> 00:30:13.510
the current action.

00:30:13.510 --> 00:30:17.450
So we actually saw a little bit of this in the last lecture. I want to assume

00:30:17.450 --> 00:30:19.519
the next state, ST plus one,

00:30:19.519 --> 00:30:23.100
is going to be a linear function

00:30:23.100 --> 00:30:25.010
of the previous

00:30:25.010 --> 00:30:27.200
state, AST plus BAT

00:30:27.200 --> 00:30:29.309
plus WT -

00:30:29.309 --> 00:30:34.650
oh, excuse me. I meant to subscript that.

00:30:54.590 --> 00:30:59.190
Where WT is Gaussian [inaudible] would mean zero and some covariance

00:30:59.190 --> 00:31:02.260
given by sigma

00:31:02.260 --> 00:31:07.370
W. Subscripts at A and B here with subscripts T to show

00:31:07.370 --> 00:31:09.000
that

00:31:09.000 --> 00:31:11.809
these matrixes could change over time. So this would be

00:31:11.809 --> 00:31:14.940
non-stationary dynamics.

00:31:14.940 --> 00:31:20.200
As a point of notation, unfortunately compiling

00:31:20.200 --> 00:31:22.240
ideas from multiple literatures,

00:31:22.240 --> 00:31:26.220
so it's sort of unfortunately that capital A denotes both a set of actions

00:31:26.220 --> 00:31:27.860
as well as

00:31:27.860 --> 00:31:29.609
a matrix.

00:31:29.609 --> 00:31:33.889
When you see A later on, A will usually be used to denote a matrix,

00:31:33.889 --> 00:31:38.870
rather than a set of actions. So [inaudible] overload notation again, but

00:31:38.870 --> 00:31:41.820
unfortunately the notational conventions when you have

00:31:41.820 --> 00:31:46.660
research ideas in multiple research communities, often they share the same symbol.

00:31:47.789 --> 00:31:49.830
So just to be concrete,

00:31:49.830 --> 00:31:53.020
AT is a matrix

00:31:53.020 --> 00:31:55.640
that's N

00:31:55.640 --> 00:32:01.270
by N. [Inaudible] matrixes that are N by D.

00:32:02.130 --> 00:32:05.559
Just to be completely clear, right, the matrixes A and B,

00:32:05.559 --> 00:32:09.230
I'm going to assume, are fixed and known in advance. So I'm going to

00:32:09.230 --> 00:32:11.700
give you the matrices,

00:32:11.700 --> 00:32:15.750
A and B, and I'm going to give you sigma W. Your job is to find a good policy

00:32:15.750 --> 00:32:17.060
for

00:32:17.060 --> 00:32:18.840
this MDP. So in other words,

00:32:18.840 --> 00:32:20.870
this is my specification

00:32:20.870 --> 00:32:25.309
of the state transition probabilities.

00:32:25.309 --> 00:32:28.020
Looking ahead,

00:32:28.020 --> 00:32:31.190
we see this later, it turns out this

00:32:31.190 --> 00:32:38.190
noise term is not very important. So

00:32:39.630 --> 00:32:40.850
it turns out that

00:32:40.850 --> 00:32:44.250
the treatment of the noise term is not very important.

00:32:44.250 --> 00:32:45.830
We'll see this later. We can

00:32:45.830 --> 00:32:47.179
pretty much

00:32:47.179 --> 00:32:49.300
ignore the noise

00:32:49.300 --> 00:32:52.680
term, and we'll still do fine. This is just a warning

00:32:52.680 --> 00:32:56.910
in the sequel, what I do later, I might be slightly sloppy

00:32:56.910 --> 00:32:57.970
in my treatment

00:32:57.970 --> 00:33:02.970
of the noise term. In this very special case, it would be unimportant.

00:33:02.970 --> 00:33:04.170
The

00:33:04.170 --> 00:33:05.549
last

00:33:05.549 --> 00:33:09.150
thing I have to specify is some horizon time,

00:33:09.150 --> 00:33:12.580
and then I also have some

00:33:12.580 --> 00:33:13.700
reward

00:33:13.700 --> 00:33:16.550
function.

00:33:16.550 --> 00:33:20.020
For LQR control, I'm going to assume that a reward function

00:33:20.020 --> 00:33:27.020
can be written as

00:33:30.090 --> 00:33:31.850
this, where

00:33:31.850 --> 00:33:35.260
UT is a matrix that's N by N. VT is

00:33:35.260 --> 00:33:36.439
a

00:33:37.950 --> 00:33:41.020
matrix that's D by D. I'll

00:33:41.020 --> 00:33:44.440
assume that UT

00:33:44.440 --> 00:33:47.040
and VT are both positive semi-definite.

00:33:47.040 --> 00:33:51.289
Are both PSD. So

00:33:51.289 --> 00:33:57.360
the fact that UT and VT are both positive semi-definite matrices,

00:33:57.360 --> 00:33:59.770
that implies that

00:33:59.770 --> 00:34:02.259
ST transpose, UT, ST [inaudible]

00:34:02.259 --> 00:34:07.000
zero. Similarly, ST transpose

00:34:07.000 --> 00:34:09.569
are VT, AT, [inaudible] zero.

00:34:09.569 --> 00:34:16.519
So this implies that

00:34:16.519 --> 00:34:20.190
your rewards are always negative. This is a

00:34:20.190 --> 00:34:23.529
somewhat depressing MDP in which there are only costs and no positive rewards, right,

00:34:23.529 --> 00:34:24.369
because of

00:34:24.369 --> 00:34:27.169
the minus sign

00:34:27.169 --> 00:34:28.569
there.

00:34:28.569 --> 00:34:35.569
So

00:34:43.319 --> 00:34:46.659
as

00:34:46.659 --> 00:34:49.630
a complete example for how you might want to apply this,

00:34:49.630 --> 00:34:53.510
you've seen my helicopter videos, right? So one thing is,

00:34:53.510 --> 00:34:55.729
for example,

00:34:55.729 --> 00:34:57.989
suppose you have a helicopter,

00:34:57.989 --> 00:35:00.890
and you want the state ST to be

00:35:00.890 --> 00:35:04.229
as close to zero as possible.

00:35:05.440 --> 00:35:08.059
Then

00:35:08.059 --> 00:35:10.210
you might choose UT

00:35:10.210 --> 00:35:12.619
to be equal to the identity matrix.

00:35:12.619 --> 00:35:16.709
This will make R of STAT be

00:35:16.709 --> 00:35:20.060
equal to ST transpose

00:35:20.060 --> 00:35:23.819
ST. But that's just

00:35:27.239 --> 00:35:30.719
- I'll just

00:35:30.719 --> 00:35:33.809
write that down. [Inaudible] oh, excuse me - minus.

00:35:33.809 --> 00:35:38.429
The squared negative of the squared [inaudible] vector. So this would be penalizing the

00:35:38.429 --> 00:35:39.680
system

00:35:39.680 --> 00:35:40.769
quadratically

00:35:40.769 --> 00:35:45.219
for having a state that's half of zero, assuming that zero's the origin state. So if it

00:35:45.219 --> 00:35:46.889
goes to make a helicopter

00:35:46.889 --> 00:35:52.039
hover around the state zero, then you might choose this sort of reward function. It

00:35:52.039 --> 00:35:54.439
turns out

00:35:55.279 --> 00:35:58.679
it's also very common for action to choose a cost

00:35:58.679 --> 00:36:03.209
for the action. So suppose I choose VT to be equal to an identity matrix. I get minus

00:36:03.209 --> 00:36:05.359
AT transpose

00:36:05.359 --> 00:36:09.509
AT

00:36:09.509 --> 00:36:12.259
here. Then minus [inaudible] actions as well.

00:36:12.259 --> 00:36:16.079
Including a quadratic cost for actions, it's also a fairly common

00:36:16.079 --> 00:36:17.099
thing to do.

00:36:17.099 --> 00:36:19.560
In practice, this tends to be effective

00:36:19.560 --> 00:36:22.630
of discouraging your system from

00:36:22.630 --> 00:36:26.209
jerking the controls around. This discourages making very huge control commands. Having

00:36:26.209 --> 00:36:27.779
a term like this

00:36:27.779 --> 00:36:31.859
reward function often makes many systems behave better.

00:36:31.859 --> 00:36:35.789
Of course, [inaudible] choose different values, we have different values on the diagonal to

00:36:35.789 --> 00:36:37.069
give different

00:36:37.069 --> 00:36:39.960
state variables, different weight and

00:36:39.960 --> 00:36:44.809
so on. So lots of possible choices for U and B. This is one example.

00:36:49.849 --> 00:36:55.989
So

00:36:55.989 --> 00:36:59.420
for the next few steps,

00:36:59.420 --> 00:37:03.619
I'm going to write out things, I'm going to derive things,

00:37:03.619 --> 00:37:08.509
for the general case of non-stationary dynamics. I'm going - as I write

00:37:08.509 --> 00:37:11.529
out more math and more equations for LQR,

00:37:11.529 --> 00:37:13.809
I'm going to try write it out

00:37:13.809 --> 00:37:18.549
for the fairly general case of time varied dynamics and time varied reward

00:37:18.549 --> 00:37:21.639
functions. So I'm [inaudible] function.

00:37:21.639 --> 00:37:26.309
But for purposes of understanding this material,

00:37:26.309 --> 00:37:29.499
you might want to think of just ignoring many of the subscripts, in terms

00:37:29.499 --> 00:37:31.579
of T. So

00:37:33.849 --> 00:37:36.049
for the sake of [inaudible] material,

00:37:36.049 --> 00:37:40.059
you might want to mentally assume that there are some fixed matrix, A, so

00:37:40.059 --> 00:37:46.079
that A is equal to A1, A2, equals A3 and so on.

00:37:46.079 --> 00:37:50.599
Similarly, there's some matrix B.

00:37:50.599 --> 00:37:51.509
Okay?

00:37:51.509 --> 00:37:55.859
So write it out for the fully general, non-stationary case, but

00:37:55.859 --> 00:37:59.420
you might just want to ignore many of the time subscripts and imagine the

00:37:59.420 --> 00:38:02.099
stationary case for now.

00:38:07.419 --> 00:38:09.939
Quite a bit later,

00:38:09.939 --> 00:38:13.569
we're going to talk about an extension of this called differential dynamic programming that will

00:38:13.569 --> 00:38:16.670
actually use a non-stationary

00:38:16.670 --> 00:38:20.789
[inaudible] to a very powerful effect for a specific algorithm. But

00:38:20.789 --> 00:38:25.579
for most of what we're about to do, just pretend that MDP is stationary.

00:38:25.579 --> 00:38:32.579
Okay.

00:38:49.160 --> 00:38:50.919
So before I

00:38:50.919 --> 00:38:54.759
talk about models, let me jus say a couple of words about how you would go about

00:38:54.759 --> 00:38:56.910
coming up with the linear models. The

00:38:56.910 --> 00:39:00.630
key assumption in this model is that the dynamics are linear. There's also the assumption the

00:39:00.630 --> 00:39:03.429
reward function is quadratic, but

00:39:03.429 --> 00:39:06.849
let's talk about the assumption that the dynamics are linear.

00:39:06.849 --> 00:39:13.789
ST plus one equals AST plus VAT. Maybe time varying, maybe

00:39:13.789 --> 00:39:16.879
stationary. I'm just writing stationary for now.

00:39:16.879 --> 00:39:19.529
So how do you get models like this?

00:39:19.529 --> 00:39:23.169
We actually saw one example of this already in the previous lecture. If

00:39:23.169 --> 00:39:26.729
you have an inverted pendulum system, and you want to model the

00:39:26.729 --> 00:39:28.099
inverted pendulum

00:39:28.900 --> 00:39:34.039
using a linear model like this, maybe [inaudible]. I'm not going to write that down.

00:39:34.039 --> 00:39:37.849
One thing you could do is run your inverted pendulum, start it off in some state as

00:39:37.849 --> 00:39:38.719
zero,

00:39:38.719 --> 00:39:40.799
take some action, A0, have it

00:39:40.799 --> 00:39:45.760
get to some state, S1. Take action A1 and so on, get to some state ST.

00:39:45.760 --> 00:39:48.409
Our index is one

00:39:48.409 --> 00:39:52.139
to denote that this is my first trial. Then you can repeat this a bunch of times.

00:39:52.139 --> 00:39:55.170
You can repeat this N times. I'm

00:39:55.170 --> 00:39:57.670
just executing actions on

00:39:57.670 --> 00:39:59.519
your physical robot.

00:39:59.519 --> 00:40:02.049
It could be a robot, it could be a

00:40:02.049 --> 00:40:05.739
chemical plant. It could be whatever. Trying out different actions in

00:40:05.739 --> 00:40:10.009
your system and watch

00:40:12.770 --> 00:40:14.989
what states it gets to. So

00:40:14.989 --> 00:40:20.619
for the linear model to your data, and choose the parameters A

00:40:20.619 --> 00:40:25.659
and B,

00:40:25.659 --> 00:40:27.939
that minimize

00:40:38.049 --> 00:40:39.709
the quadratic

00:40:39.709 --> 00:40:41.369
error term.

00:40:46.949 --> 00:40:51.640
So this says how well does AST plus BAT

00:40:51.640 --> 00:40:55.359
predict ST plus one. So you minimize the quadratic penalty term. This would be

00:40:55.359 --> 00:40:57.719
one reasonable way to estimate

00:40:57.719 --> 00:41:01.390
the parameters of a linear dynamical system for a

00:41:01.390 --> 00:41:08.390
physical robot or a physical chemical part of whatever they may have. Another

00:41:08.839 --> 00:41:10.569
way to

00:41:10.569 --> 00:41:13.129
come up with a linear

00:41:14.209 --> 00:41:19.199
model

00:41:19.199 --> 00:41:24.180
consistently, if I want to control, is to take a nonlinear model and to

00:41:24.180 --> 00:41:29.920
linearize it. Let me show you what I mean by that. So you can linearize a

00:41:29.920 --> 00:41:35.819
nonlinear model.

00:41:35.819 --> 00:41:37.439
So

00:41:38.570 --> 00:41:41.750
let's say you have some nonlinear model

00:41:41.750 --> 00:41:45.289
that expresses ST plus one as some function of

00:41:45.289 --> 00:41:46.219
ST

00:41:46.219 --> 00:41:47.499
and

00:41:47.499 --> 00:41:49.529
AT.

00:41:49.529 --> 00:41:53.840
In the example in the previous lecture, I said for the inverted pendulum

00:41:56.579 --> 00:41:57.489
[inaudible].

00:41:57.489 --> 00:42:01.440
By referring to the laws of physics. It was actually by downloading off the shelf

00:42:01.440 --> 00:42:04.380
software for doing physics simulations. So if you haven't

00:42:04.380 --> 00:42:08.139
seen [inaudible] before, you can go online. You can easily find

00:42:08.139 --> 00:42:11.270
many open-source packages for simulating the

00:42:11.270 --> 00:42:14.350
physics of simple devices like these.

00:42:14.350 --> 00:42:18.430
Download the software, type in the specifications of your robot, and it will

00:42:18.430 --> 00:42:21.980
simulate the physics that you use. There's lots of open-source software patches like that. You can just

00:42:21.980 --> 00:42:23.829
download them.

00:42:23.829 --> 00:42:27.060
But something like that, you can now build a physics simulator

00:42:27.060 --> 00:42:31.730
that predicts the state as a function of the previous state and the previous action. So

00:42:31.730 --> 00:42:34.029
you actually come up with some function that

00:42:34.029 --> 00:42:39.550
says that

00:42:41.529 --> 00:42:45.169
- the state [inaudible] next time. The [inaudible] vector

00:42:45.169 --> 00:42:52.169
will be some function of the current state

00:42:53.350 --> 00:42:54.630
and the current

00:42:54.630 --> 00:42:58.019
action, where the action in this case is just a real number that says how

00:42:58.019 --> 00:43:02.129
hard you accelerated to the left or right.

00:43:02.129 --> 00:43:06.359
Then you can take this nonlinear model. I actually wrote down a sample of a

00:43:06.359 --> 00:43:08.119
model in the last lecture, but

00:43:08.119 --> 00:43:11.390
in general, F would be some nonlinear function. [Inaudible] of

00:43:11.390 --> 00:43:12.769
a linear function.

00:43:12.769 --> 00:43:16.919
So what I mean by linearize is the following. So here's just a cartoon. I'll

00:43:16.919 --> 00:43:19.819
write down the math in a second. Let's say the horizontal acces is

00:43:19.819 --> 00:43:24.029
the input state, ST,

00:43:24.029 --> 00:43:27.679
and the output state, ST plus one, as I said.

00:43:27.679 --> 00:43:31.779
Here's

00:43:31.779 --> 00:43:34.709
the function at F.

00:43:34.709 --> 00:43:38.689
So the next state, ST plus one, will be some function of the previous state,

00:43:38.689 --> 00:43:42.029
ST and the action

00:43:42.029 --> 00:43:46.009
AT. So to linearize this model, what you would do is you would choose a point.

00:43:46.009 --> 00:43:49.019
We'll call this bar

00:43:49.019 --> 00:43:54.140
T. Then you would take the derivative of this

00:43:54.140 --> 00:43:56.469
function. For the

00:43:56.469 --> 00:44:00.279
[inaudible] straight line to that function.

00:44:00.279 --> 00:44:04.449
So this allows you to express the next state, ST plus one. You

00:44:04.449 --> 00:44:07.229
can approximate the next state, ST plus one, as

00:44:07.229 --> 00:44:10.089
this linear function of the

00:44:10.089 --> 00:44:13.999
previous state, ST. So

00:44:13.999 --> 00:44:17.809
to make this cartoon really right, the horizontal access here is really a state

00:44:17.809 --> 00:44:20.109
action pair.

00:44:20.109 --> 00:44:24.650
You're linearizing around. So this is just a cartoon. The horizontal

00:44:24.650 --> 00:44:29.699
access represents the input state and the input action.

00:44:29.699 --> 00:44:36.699
So

00:44:42.049 --> 00:44:45.549
just to write this out in

00:44:45.549 --> 00:44:49.529
math, I'll write out the simple case first and the fully general one

00:44:49.529 --> 00:44:51.089
in a second. Suppose

00:44:51.089 --> 00:44:54.709
the horizontal access was only this state. So let's pretend interactions they [inaudible] now.

00:44:54.709 --> 00:44:56.099
ST plus

00:44:56.099 --> 00:45:00.309
one is just some function of ST, than that linear function I drew would be

00:45:00.309 --> 00:45:02.219
ST plus one.

00:45:02.219 --> 00:45:04.450
We're approximating

00:45:04.450 --> 00:45:08.349
as F prime evaluated at some point as bar T

00:45:08.349 --> 00:45:12.189
times ST times S bar T.

00:45:12.189 --> 00:45:15.579
Plus

00:45:15.579 --> 00:45:17.829
S bar

00:45:17.829 --> 00:45:21.879
T. So with this, you'd express ST plus one as a linear

00:45:21.879 --> 00:45:22.719
function

00:45:22.719 --> 00:45:29.359
of ST. Just note that S bar T is a constant.

00:45:31.179 --> 00:45:32.520
It's not

00:45:32.520 --> 00:45:36.520
a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point

00:45:36.520 --> 00:45:38.019
S bar T. This is

00:45:38.019 --> 00:45:41.779
really just the equation of that linear function. So you

00:45:41.779 --> 00:45:46.309
can then convert this to

00:45:51.809 --> 00:45:55.399
A and B matrixes. Jumping back one board, I'm going

00:45:55.399 --> 00:45:57.040
to point out one other thing.

00:45:57.040 --> 00:45:57.829
Let's

00:45:57.829 --> 00:45:59.929
say I look at this straight line,

00:45:59.929 --> 00:46:01.159
and I ask

00:46:01.159 --> 00:46:03.489
how well does this straight line

00:46:03.489 --> 00:46:07.769
approximate my function F, my original simulator, my original function F.

00:46:07.769 --> 00:46:10.630
Then you sort of notice that

00:46:10.630 --> 00:46:12.440
in this neighborhood,

00:46:12.440 --> 00:46:14.830
in the neighborhood of S bar, there's a

00:46:14.830 --> 00:46:17.400
pretty good approximation. It's fairly close.

00:46:17.400 --> 00:46:19.389
But then as you move further away,

00:46:19.389 --> 00:46:24.419
moving far off to the left here, it becomes a pretty terrible approximation.

00:46:24.419 --> 00:46:25.539
So

00:46:25.539 --> 00:46:29.139
when you linearize

00:46:29.139 --> 00:46:33.099
a nonlinear model to apply LQR,

00:46:33.099 --> 00:46:36.789
one of the parameters you have to choose would be the point around which to

00:46:36.789 --> 00:46:38.960
linearize your nonlinear model.

00:46:38.960 --> 00:46:44.169
So if you expect your inverted pendulum system

00:46:44.169 --> 00:46:48.509
to spend most of its time in the vicinity of this state,

00:46:48.509 --> 00:46:49.409
then it'd

00:46:49.409 --> 00:46:53.459
be reasonable to linearize around this state

00:46:53.459 --> 00:46:54.830
because that means that

00:46:54.830 --> 00:46:57.319
the linear approximation would be a good approximation,

00:46:57.319 --> 00:47:02.489
usually, for the states that you expect [inaudible] to spend most of this time.

00:47:02.489 --> 00:47:06.179
If conversely, you expect the system to spend most of its time

00:47:06.179 --> 00:47:07.739
at states far to the left, then this would be

00:47:07.739 --> 00:47:10.039
a terrible location to linearize.

00:47:10.039 --> 00:47:14.269
So one rule of thumb is to choose the position to linearize according to where

00:47:14.269 --> 00:47:17.419
you expect the system to spend most of its time

00:47:17.419 --> 00:47:20.369
so that the linear approximation will tend to be

00:47:20.369 --> 00:47:22.929
an accurate approximation in the

00:47:22.929 --> 00:47:27.269
vicinity of the states [inaudible].

00:47:27.269 --> 00:47:28.770
Just to be fair, it is about

00:47:28.770 --> 00:47:30.109
choosing the point, S bar,

00:47:30.109 --> 00:47:31.639
A bar,

00:47:31.639 --> 00:47:34.169
that we'll use

00:47:34.169 --> 00:47:36.949
to come up with a linear function

00:47:36.949 --> 00:47:40.599
that we'll pretend it's a good approximation to my original

00:47:40.599 --> 00:47:47.079
nonlinear function,

00:47:55.269 --> 00:48:00.589
F. So for an example like the inverted pendulum problem, this problem, if

00:48:00.589 --> 00:48:03.740
you expect to do pretty well in this problem,

00:48:03.740 --> 00:48:08.089
then you would expect the state

00:48:08.089 --> 00:48:12.680
to often be near the zero state.

00:48:12.680 --> 00:48:17.890
If S equals zero corresponds to X being the center of the railway track that the inverted

00:48:17.890 --> 00:48:20.249
pendulum lives on. You

00:48:20.249 --> 00:48:23.000
expect to do fairly well. You expect the pole to

00:48:23.000 --> 00:48:24.689
mostly be upright

00:48:24.689 --> 00:48:28.559
[inaudible] upright at zero degrees or 90 degrees, I guess. So you choose

00:48:28.559 --> 00:48:32.910
whatever state corresponds to having the pole

00:48:32.910 --> 00:48:33.959
upright. The zero

00:48:33.959 --> 00:48:37.299
velocity [inaudible], near zero velocity in the middle of the track.

00:48:37.299 --> 00:48:40.249
So you usually choose that as a state

00:48:40.249 --> 00:48:42.579
to linearize

00:48:42.579 --> 00:48:45.680
your inverted pendulum dynamics around. That's

00:48:45.680 --> 00:48:52.680
a region where you might want your approximation to be good.

00:48:55.859 --> 00:48:59.759
So I wrote this down. To come back to this formula, I wrote this down for the special

00:48:59.759 --> 00:49:02.229
case of a one D state variable. If there

00:49:02.229 --> 00:49:08.039
are no actions. The general formula for the linearization approximation is ST

00:49:08.949 --> 00:49:11.869
plus one were approximate as F

00:49:11.869 --> 00:49:14.559
of S bar T.

00:49:14.559 --> 00:49:17.119
A bar T

00:49:17.119 --> 00:49:24.119
plus

00:49:50.459 --> 00:49:54.229
- Okay?

00:49:54.229 --> 00:49:55.240
Where

00:49:55.240 --> 00:49:59.669
these upside down triangles are an unusual

00:49:59.669 --> 00:50:02.739
symbol for taking the

00:50:02.739 --> 00:50:07.289
derivative of F with respect to [inaudible] vector value, second argument. So

00:50:07.289 --> 00:50:12.169
by choosing an appropriate state as bar T, A bar T to linearize

00:50:12.169 --> 00:50:12.950
around,

00:50:12.950 --> 00:50:14.009
you'd

00:50:14.009 --> 00:50:17.920
now express ST plus one

00:50:17.920 --> 00:50:22.039
as a linear function of the current state and

00:50:22.039 --> 00:50:23.279
the current

00:50:23.279 --> 00:50:24.279
action, AT. Again,

00:50:24.279 --> 00:50:29.420
these things, S bar T, is a constant that you choose ahead of time.

00:50:29.420 --> 00:50:32.409
Same for A bar

00:50:34.959 --> 00:50:40.109
T. Lastly, having linearized this thing, you can then convert it

00:50:40.109 --> 00:50:43.260
to matrices

00:50:43.260 --> 00:50:46.509
like

00:50:46.509 --> 00:50:48.849
that.

00:50:48.849 --> 00:50:55.460
So the ST plus one is now a linear function of ST and

00:50:55.460 --> 00:50:56.789
AT.

00:50:56.789 --> 00:51:03.789
Questions about this?

00:51:10.759 --> 00:51:16.019
So just one tiny detail, and it's really not a huge deal, is that this thing

00:51:16.019 --> 00:51:20.179
below is technically an [inaudible] function.

00:51:20.179 --> 00:51:23.140
There might actually be an extra constant there, but

00:51:23.140 --> 00:51:27.199
this is not a difficult generalization of a linear dynamical system definition.

00:51:28.130 --> 00:51:32.320
One way to deal with that constant is actually to do something like take your

00:51:32.320 --> 00:51:36.249
definition for this state, let's say XX dot theta theta dot.

00:51:36.249 --> 00:51:38.859
You can then augment your state vector to

00:51:38.859 --> 00:51:40.959
have an extra interceptor, one.

00:51:40.959 --> 00:51:46.069
With the interceptor one and working out the A matrix, you can then

00:51:46.069 --> 00:51:49.170
take care of the extra constant, C, as well. So you can

00:51:49.170 --> 00:51:50.679
deal with this thing being -

00:51:50.679 --> 00:51:54.809
technically it's an affine function because of this extra offset, rather than a linear

00:51:54.809 --> 00:51:55.569
function.

00:51:55.569 --> 00:52:01.809
But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't

00:52:01.809 --> 00:52:02.829
be

00:52:02.829 --> 00:52:04.689
a huge

00:52:06.459 --> 00:52:13.459
deal.

00:52:18.059 --> 00:52:19.449
So to summarize, you see I

00:52:19.449 --> 00:52:20.729
have this up,

00:52:20.729 --> 00:52:24.699
you can learn a model, you can take a nonlinear model. Your nonlinear

00:52:24.699 --> 00:52:28.789
model can be a physics model or a nonlinear model you learned and linearize it.

00:52:29.559 --> 00:52:32.299
Now I'll post an LQR problem

00:52:32.299 --> 00:52:35.339
in which we have

00:52:35.339 --> 00:52:41.689
specification of the MDP in which the states are in RN, the actions are in RD,

00:52:41.689 --> 00:52:46.669
and the state has zero probabilities given by

00:52:46.669 --> 00:52:50.239
the [inaudible] linear equation. SD plus one equals

00:52:50.239 --> 00:52:53.849
ATST plus BTAT. Our rewards are going to be

00:52:53.849 --> 00:52:56.989
these quadratic functions.

00:52:56.989 --> 00:53:01.900
So the specification of the MDP means that we know the A matrixes,

00:53:01.900 --> 00:53:06.089
the B matrixes, the U matrixes

00:53:06.089 --> 00:53:08.930
and the V matrixes. Our goal is to come up with a policy

00:53:08.930 --> 00:53:13.789
to maximize our finite

00:53:13.789 --> 00:53:19.049
horizon sum of rewards.

00:53:20.029 --> 00:53:23.149
So our goal is to come up with a policy,

00:53:23.149 --> 00:53:25.799
first, to maximize

00:53:25.799 --> 00:53:29.059
the expected value of this

00:53:29.059 --> 00:53:34.279
finite horizon sum of

00:53:39.009 --> 00:53:46.009
rewards. Okay.

00:53:48.219 --> 00:53:49.390
So our

00:53:49.390 --> 00:53:53.549
approach to solving this problem

00:53:53.549 --> 00:53:56.899
will be exactly that

00:53:56.899 --> 00:54:00.019
finite horizon dynamic programming algorithm that we worked out a

00:54:00.019 --> 00:54:01.769
little earlier in this

00:54:01.769 --> 00:54:03.579
lecture. In particular,

00:54:03.579 --> 00:54:06.569
my strategy for finding the optimal policy

00:54:06.569 --> 00:54:08.879
will be to

00:54:08.879 --> 00:54:12.359
first find V star of

00:54:12.359 --> 00:54:14.219
T, the capital T,

00:54:14.219 --> 00:54:18.559
and then I'll apply by a recursion to find V star of T minus one, V star of T minus

00:54:18.559 --> 00:54:22.989
two and so on.

00:54:22.989 --> 00:54:27.619
In the dynamic programming algorithm we worked out, V star subscript T of the

00:54:27.619 --> 00:54:29.209
state ST, this is the maximum

00:54:29.209 --> 00:54:31.279
[inaudible]

00:54:31.279 --> 00:54:35.440
actions you might take at that time of R

00:54:35.440 --> 00:54:39.299
of STAT.

00:54:39.299 --> 00:54:42.800
Again, just for the sake of understanding this material, you can probably

00:54:42.800 --> 00:54:47.170
pretend the rewards and the dynamics are actually stationary. I'll write out all

00:54:47.170 --> 00:54:53.249
these superscripts all the time [inaudible] if you're reading this for the first time.

00:54:53.249 --> 00:54:57.340
The reward is equal to max of

00:54:57.340 --> 00:55:04.340
AT of minus -

00:55:11.829 --> 00:55:16.099
right? I hope this isn't confusing. The superscript Ts denote transposes.

00:55:16.099 --> 00:55:19.849
The lowercase Ts denote the time index capital T.

00:55:19.849 --> 00:55:23.769
So that's just a definition of my next quadratic awards. So this

00:55:23.769 --> 00:55:27.859
is clearly maximized as minus

00:55:29.100 --> 00:55:31.269
ST transpose

00:55:31.269 --> 00:55:37.889
UTST

00:55:37.889 --> 00:55:39.949
because

00:55:39.949 --> 00:55:43.909
that last term is - this is greater than or equal to zero. That gives

00:55:43.909 --> 00:55:47.759
me my assumption that VT is [inaudible] semi-definite. So the best action to take in

00:55:47.759 --> 00:55:49.399
the last time step

00:55:49.399 --> 00:55:52.639
is just the action

00:55:52.639 --> 00:55:53.769
zero. So

00:55:53.769 --> 00:56:00.769
pi star subscript T of ST is equal to the

00:56:02.320 --> 00:56:04.369
[inaudible] of actions of

00:56:04.369 --> 00:56:10.969
that same thing. It's just

00:56:10.969 --> 00:56:12.469
zero. It's by

00:56:12.469 --> 00:56:15.609
choosing the zero action,

00:56:15.609 --> 00:56:18.909
AT transpose VTAT becomes zero, and that's

00:56:18.909 --> 00:56:38.829
how this reward is maximized.

00:56:50.259 --> 00:56:57.169
Any questions, or is something illegible? Okay.

00:57:00.390 --> 00:57:04.330
So now let's do the dynamic programming step where

00:57:04.330 --> 00:57:09.679
my goal is given

00:57:09.679 --> 00:57:11.739
VT plus one,

00:57:11.739 --> 00:57:15.799
I want to compute VT. Given V star T plus one, I want to compute

00:57:15.799 --> 00:57:20.410
V star of T. So this is the dynamic programming step.

00:57:21.629 --> 00:57:25.239
So the

00:57:25.239 --> 00:57:29.499
DP steps I wrote down previously was this. So for the finite state case, I

00:57:29.499 --> 00:57:36.499
wrote down the following.

00:58:32.289 --> 00:58:35.769
So this is exactly the equation I wrote down

00:58:35.769 --> 00:58:37.499
previously,

00:58:37.499 --> 00:58:41.489
and this is what I wrote down for finite states, where you have these discreet state transition

00:58:41.489 --> 00:58:43.679
probabilities, and we can sum over

00:58:43.679 --> 00:58:48.039
this discreet set of states. Now we're going to continue as an infinite state

00:58:48.039 --> 00:58:51.850
again, so this sum over state should actually become an integral. I'm going to

00:58:51.850 --> 00:58:55.940
actually skip the integral step. We'll just go ahead and write this last term here as an

00:58:55.940 --> 00:58:56.829
expectation.

00:58:56.829 --> 00:59:03.829
So this is going to be max over actions AT

00:59:04.309 --> 00:59:06.879
plus - and then this becomes and expectation

00:59:06.879 --> 00:59:10.089
over the random mixed state, ST plus one,

00:59:10.089 --> 00:59:13.500
[inaudible] from state transition probabilities given by P of STAT of V star T

00:59:13.500 --> 00:59:16.429
plus one, ST

00:59:16.429 --> 00:59:19.080
plus one. So this

00:59:19.080 --> 00:59:21.309
is

00:59:21.309 --> 00:59:23.719
the same equation written down

00:59:23.719 --> 00:59:25.699
as an expectation.

00:59:26.879 --> 00:59:31.519
So what I need to do is given a representation of V star T plus one, I

00:59:31.519 --> 00:59:32.789
need to find V

00:59:32.789 --> 00:59:36.430
star of T.

00:59:36.430 --> 00:59:40.819
So it turns out that LQR has the following useful property. It turns out that each of

00:59:40.819 --> 00:59:42.839
these value functions

00:59:42.839 --> 00:59:46.129
can be represented as a quadratic function.

00:59:46.129 --> 00:59:48.659
So concretely,

00:59:48.659 --> 00:59:53.259
let's suppose that V

00:59:53.259 --> 01:00:00.259
star T plus one - suppose that this can be expressed as a

01:00:00.429 --> 01:00:07.429
quadratic

01:00:09.369 --> 01:00:10.840
function, written like so,

01:00:10.840 --> 01:00:11.859
where

01:00:11.859 --> 01:00:14.880
the matrix phi T plus one

01:00:14.880 --> 01:00:17.049
is an N by N matrix, and

01:00:17.049 --> 01:00:21.339
psi T plus one

01:00:21.339 --> 01:00:23.599
is just a real number.

01:00:23.599 --> 01:00:29.160
So in other words, suppose V star T plus one is just a quadratic function

01:00:29.969 --> 01:00:33.809
of the state ST

01:00:33.809 --> 01:00:36.679
plus one.

01:00:36.679 --> 01:00:40.170
We can then show that

01:00:40.170 --> 01:00:44.169
when you do one dynamic programming step - when you

01:00:46.089 --> 01:00:49.569
plug this definition of V star T plus one into your dynamic

01:00:49.569 --> 01:00:51.879
programming step in the equation I had just now,

01:00:51.879 --> 01:00:54.559
you can show that you would get

01:00:54.559 --> 01:00:57.899
that V star T as well, will

01:00:57.899 --> 01:01:01.749
also be a quadratic function of the

01:01:06.279 --> 01:01:12.639
same form. [Inaudible] here, right? The sum-appropriate matrix, phi T

01:01:12.639 --> 01:01:19.609
and sum appropriate real number, psi

01:01:19.609 --> 01:01:21.839
of

01:01:21.839 --> 01:01:27.819
T. So what you can do is stall off the recursion with - well, does that make

01:01:27.819 --> 01:01:32.099
sense?

01:01:32.099 --> 01:01:35.009
So what you can do is

01:01:35.009 --> 01:01:36.769
stall off the recursion

01:01:36.769 --> 01:01:41.139
as follows. So previously, we worked out that V star capital T,

01:01:41.139 --> 01:01:44.219
we said that this is minus

01:01:44.219 --> 01:01:47.539
ST

01:01:47.539 --> 01:01:50.009
transpose UTST. So we

01:01:50.009 --> 01:01:50.760
have that phi

01:01:50.760 --> 01:01:53.160
of capital T is

01:01:53.160 --> 01:01:56.329
equal to minus UT,

01:01:56.329 --> 01:01:59.309
and psi of capital T is equal to zero.

01:01:59.309 --> 01:02:00.700
Now V star T of ST

01:02:00.700 --> 01:02:07.700
is equal to ST transpose phi of T, ST plus psi of T.

01:02:08.419 --> 01:02:09.050
So

01:02:09.050 --> 01:02:13.450
you can start out the recursion this way with phi of T equals minus UT and psi of T

01:02:13.450 --> 01:02:14.939
equals zero.

01:02:14.939 --> 01:02:16.849
Then work out

01:02:16.849 --> 01:02:20.219
what the recursion is. I won't

01:02:20.219 --> 01:02:22.819


01:02:22.819 --> 01:02:27.119
actually do the full [inaudible]. This may be algebra, and

01:02:28.809 --> 01:02:32.939
you've actually done this sort of Gaussian expectation

01:02:32.939 --> 01:02:39.559
math a lot in your homework by now.

01:02:39.559 --> 01:02:44.009
So I won't do the full derivation. I'll just outline the

01:02:45.039 --> 01:02:46.869
one-ish G step. So

01:02:46.869 --> 01:02:51.029
in dynamic programming step, V star ST is

01:02:51.029 --> 01:02:53.239
equal to

01:02:53.239 --> 01:02:55.679
max over actions AT of

01:02:55.679 --> 01:03:02.679
the median reward.

01:03:03.619 --> 01:03:05.809
So this was R of

01:03:05.809 --> 01:03:09.479
SA from my equation in the dynamic programming step.

01:03:09.479 --> 01:03:12.099
Then plus an expected value

01:03:12.099 --> 01:03:18.279
over the random mixed state, ST plus one, drawn from the

01:03:18.279 --> 01:03:22.079
Gaussian distribution would mean

01:03:22.079 --> 01:03:23.890
ATST plus

01:03:23.890 --> 01:03:28.169
BTAT and covariant sigma

01:03:28.169 --> 01:03:31.889
W. So what this is, this is really my

01:03:31.889 --> 01:03:35.029
specification for P of STAT. This is my

01:03:35.029 --> 01:03:38.749
state transition distribution in the LQR setting. This is my

01:03:38.749 --> 01:03:40.809
state transition distribution

01:03:40.809 --> 01:03:43.369
[inaudible] take action AT

01:03:43.369 --> 01:03:44.670
in

01:03:44.670 --> 01:03:46.179
the state

01:03:46.179 --> 01:03:50.319
ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and

01:03:50.319 --> 01:03:50.979
covariant

01:03:50.979 --> 01:03:54.430
sigma W. Then

01:03:54.430 --> 01:04:01.430
of the this

01:04:10.669 --> 01:04:13.719
state. This, of course, is just A star

01:04:13.719 --> 01:04:17.109
T plus one

01:04:17.109 --> 01:04:23.549
of ST plus one. I hope

01:04:23.549 --> 01:04:27.459
this makes sense. This is just taking that equation I had previously in the

01:04:27.459 --> 01:04:30.309
dynamic programming step. So the V star of T, ST

01:04:30.309 --> 01:04:33.279
equals max over actions

01:04:33.279 --> 01:04:35.679
of the immediate rewards

01:04:35.679 --> 01:04:38.739
plus an expected value over the mixed state of

01:04:38.739 --> 01:04:40.639
V star of the mixed state with

01:04:40.639 --> 01:04:43.249
the clock advanced by one.

01:04:43.249 --> 01:04:46.179
So I've just plugged in all the definitions as a reward of the

01:04:46.179 --> 01:04:47.889
state [inaudible] distribution

01:04:47.889 --> 01:04:54.889
and of the value function.

01:04:55.909 --> 01:05:01.729
Actually, could you raise your hand if this makes sense? Cool. So if you write

01:05:01.729 --> 01:05:03.429
this out

01:05:03.429 --> 01:05:07.630
and you expand the expectation - I know you've done this many times, so I

01:05:07.630 --> 01:05:08.979
won't do it -

01:05:08.979 --> 01:05:11.470
this whole thing on the right-hand side

01:05:11.470 --> 01:05:15.139
simplifies to a big quadratic function of the action, AT.

01:05:15.139 --> 01:05:17.839
So this whole

01:05:17.839 --> 01:05:22.839
thing simplifies to a big

01:05:22.839 --> 01:05:26.399
quadratic function

01:05:32.449 --> 01:05:35.909
of the action

01:05:35.909 --> 01:05:37.859
AT.

01:05:37.859 --> 01:05:42.489
We want to maximize this with respect to the actions AT. So to

01:05:42.489 --> 01:05:44.339
maximize a big quadratic function, you

01:05:44.339 --> 01:05:48.150
just take the derivatives of the functions with respect to the action AT, set the derivative equal

01:05:48.150 --> 01:05:52.039
to zero, and then you've maximized the right-hand side, with

01:05:52.039 --> 01:05:55.559
respect to the action,

01:05:55.559 --> 01:05:59.709
AT. It turns out - I'm just going to write this expression down for completeness. You

01:05:59.709 --> 01:06:02.209
can derive it yourself at any time. It turns

01:06:02.209 --> 01:06:05.239
out if you actually maximize that thing on the right- hand side as a

01:06:05.239 --> 01:06:07.299
function of the actions, AT,

01:06:07.299 --> 01:06:12.469
you find that [inaudible] AT is going to be

01:06:12.469 --> 01:06:14.429
that

01:06:14.429 --> 01:06:21.429
times

01:06:23.319 --> 01:06:25.930
ST.

01:06:25.930 --> 01:06:29.389
Don't

01:06:29.389 --> 01:06:34.200
worry about this expression. You can get it from [inaudible] and derive it yourself.

01:06:34.200 --> 01:06:37.839
But the key thing to note is that the optimal action, AT, is going to be some big

01:06:37.839 --> 01:06:38.949
matrix.

01:06:38.949 --> 01:06:40.409
We're going to call this thing

01:06:40.409 --> 01:06:43.299
LT

01:06:43.299 --> 01:06:47.259
times

01:06:49.009 --> 01:06:51.459
ST. In other words, the optimal action to take in this

01:06:51.459 --> 01:06:53.329
given state is going to be

01:06:53.329 --> 01:06:55.249
some linear function

01:06:55.249 --> 01:06:59.289
of the state, ST. So

01:06:59.289 --> 01:07:02.819
having done dynamic programming, you remember also when we worked out the

01:07:02.819 --> 01:07:04.979
dynamic programming algorithm for finite

01:07:04.979 --> 01:07:06.279
horizon MDPs,

01:07:06.279 --> 01:07:07.999
we said that

01:07:07.999 --> 01:07:12.109
the way you compute the optimal policy, pi star of T

01:07:12.109 --> 01:07:13.699
of ST.

01:07:13.699 --> 01:07:17.819
This is always the [inaudible] of the same thing. [Inaudible]

01:07:17.819 --> 01:07:22.799
of actions AT of the same thing.

01:07:22.799 --> 01:07:25.139
STAT plus

01:07:25.139 --> 01:07:32.139
your expected value of [inaudible] PSTAT, P-star, T

01:07:35.859 --> 01:07:39.179
plus one, ST plus one. This thing on the right-hand side is always the same thing as

01:07:39.179 --> 01:07:41.839
the thing we maximized

01:07:41.839 --> 01:07:42.969


01:07:42.969 --> 01:07:46.479
[inaudible]. So what this means is that when I said this a value of A to the maximize

01:07:46.479 --> 01:07:50.379
of this. So what this means is that the optimal action to take from the state of ST

01:07:50.379 --> 01:07:53.079
is actually equal to

01:07:53.079 --> 01:07:55.869
LT

01:07:55.869 --> 01:08:01.689
times ST.

01:08:01.689 --> 01:08:03.859
What

01:08:03.859 --> 01:08:05.919
was shown is that

01:08:05.919 --> 01:08:08.129
when you're in some state, ST, the

01:08:08.129 --> 01:08:13.599
optimal action for that state is going to be some matrix, LT, which can compute,

01:08:13.599 --> 01:08:14.229
times

01:08:14.229 --> 01:08:16.859
the state,

01:08:16.859 --> 01:08:18.079
ST.

01:08:18.079 --> 01:08:23.549
In other words, the optimal action is actually a linear function of the state.

01:08:23.549 --> 01:08:26.839
I'm just going to point out, this is not a function of approximation here, right.

01:08:26.839 --> 01:08:31.739
What we did not do, we did not say,

01:08:31.739 --> 01:08:35.389
let's find the optimal linear policy. We didn't say, let's look at the optimal

01:08:35.389 --> 01:08:38.359
policy, and then we'll fit this straight line to the optimal policy. This is not

01:08:38.359 --> 01:08:42.089
about approximating the optimal policy with a straight line.

01:08:42.089 --> 01:08:46.549
This derivation is saying that the optimal policy is a straight line. The

01:08:46.549 --> 01:08:53.549
optimal action is a linear function of the current

01:08:54.239 --> 01:08:57.359
state.

01:08:57.359 --> 01:09:01.319
Moreover, when you've worked out this is a value for AT

01:09:01.319 --> 01:09:02.310


01:09:02.310 --> 01:09:05.370
that maximizes this thing on

01:09:05.370 --> 01:09:10.009
the right-hand side. So you take this and plug it back in to do the dynamic programming recursion.

01:09:10.009 --> 01:09:17.009
What you find is that -

01:09:19.719 --> 01:09:24.189
so you take AT and plug it back in to do the maximization. It will

01:09:24.189 --> 01:09:26.319
actually get

01:09:26.319 --> 01:09:28.279
you this formula,

01:09:28.279 --> 01:09:30.560
so V star

01:09:30.560 --> 01:09:35.049
TST. So you find that

01:09:35.049 --> 01:09:38.770
it will indeed be a quadratic function like this

01:09:38.770 --> 01:09:41.799
of the following form where

01:09:41.799 --> 01:09:45.540
- and I just write out the equations for the sake of

01:09:45.540 --> 01:09:52.540
completeness. Don't worry too much about their forms. You can

01:09:57.369 --> 01:10:03.630
derive this yourself.

01:10:48.130 --> 01:10:52.190
So just to summarize, don't worry too much about the forms of these

01:10:52.190 --> 01:10:54.429
equations.

01:10:54.429 --> 01:10:57.899
What we've done is written down the recursion to the expressor

01:10:57.899 --> 01:10:59.820
phi T and psi T

01:10:59.820 --> 01:11:02.329
as a function of phi T plus one

01:11:02.329 --> 01:11:04.129
and psi T plus one.

01:11:04.129 --> 01:11:05.090
So

01:11:05.090 --> 01:11:08.340
this allows you to compute the optimal value function

01:11:08.340 --> 01:11:13.579
for when the clock is at time lowercase T, as a function of the optimal value

01:11:13.579 --> 01:11:17.699
function for when the clock is at time T plus one.

01:11:20.019 --> 01:11:25.480
So

01:11:25.480 --> 01:11:28.260
to summarize,

01:11:28.260 --> 01:11:32.659
GSELQG here's a finite horizon of

01:11:32.659 --> 01:11:35.920
- actually, just to give this equation a name as well. This

01:11:35.920 --> 01:11:40.370
recursion, in terms of the phi Ts, this is called the

01:11:40.370 --> 01:11:44.909
discrete

01:11:44.909 --> 01:11:51.909
time Bacardi equation. [Inaudible]

01:11:54.099 --> 01:11:57.609
recursion that gives you phi

01:11:57.609 --> 01:12:02.420
T in terms of phi T plus one.

01:12:02.420 --> 01:12:05.130
So

01:12:05.130 --> 01:12:10.969
to summarize,

01:12:12.269 --> 01:12:17.499
our algorithm for finding the exact solution to finite horizon LQR

01:12:17.499 --> 01:12:22.949
problems is as follows. We initialize phi T

01:12:22.949 --> 01:12:24.340
to

01:12:24.340 --> 01:12:26.840
be equal to minus

01:12:30.449 --> 01:12:32.699
UT

01:12:32.699 --> 01:12:36.559
and psi T to be equal to zero.

01:12:36.559 --> 01:12:38.709
Then

01:12:38.709 --> 01:12:42.679
recursively,

01:12:42.679 --> 01:12:45.080
calculate

01:12:45.080 --> 01:12:47.179
phi T

01:12:47.179 --> 01:12:49.260
and psi

01:12:49.260 --> 01:12:50.309
T

01:12:50.309 --> 01:12:53.820
as a function of phi T plus one

01:12:53.820 --> 01:12:56.360
and psi T plus

01:12:56.360 --> 01:13:01.429
one with the discrete time -

01:13:01.769 --> 01:13:06.030
actually, excuse me. So

01:13:06.030 --> 01:13:08.590
recursively calculate phi T and psi

01:13:08.590 --> 01:13:11.349
T as a function of phi T plus one and psi T plus one, as I showed,

01:13:11.349 --> 01:13:14.730
using the discrete time Bacardi equation.

01:13:14.730 --> 01:13:17.940
So you do this for

01:13:17.940 --> 01:13:19.199
T equals

01:13:19.199 --> 01:13:22.429
T minus one, T minus two

01:13:22.429 --> 01:13:28.579
and so on, down to time zero. Then you

01:13:28.579 --> 01:13:32.519
compute

01:13:32.519 --> 01:13:37.429
LT as a function of - actually, is it phi T or phi T

01:13:37.429 --> 01:13:39.599
plus one? Phi T

01:13:39.599 --> 01:13:41.929
plus one, I think.

01:13:41.929 --> 01:13:46.260
As a function of phi T plus one and psi T plus one.

01:13:46.260 --> 01:13:50.199
This is actually a function of only phi T plus one. You don't really need psi T

01:13:50.199 --> 01:13:53.110
plus one.

01:13:53.110 --> 01:13:59.819
Now you have your optimal policy.

01:13:59.819 --> 01:14:03.099
So having computed the LTs,

01:14:03.099 --> 01:14:05.939
you now have the

01:14:05.939 --> 01:14:08.149
optimal action to take in the state

01:14:08.149 --> 01:14:15.149
ST, just given by

01:14:24.530 --> 01:14:29.419
this

01:14:29.419 --> 01:14:34.979
linear equation.

01:14:34.979 --> 01:14:44.499
How much time do I have left? Okay. Let me just say one last thing about this before I close.

01:14:44.479 --> 01:14:48.449
Maybe I'll do it next week. I think I'll do it next session

01:14:48.449 --> 01:14:52.179
instead. So it actually turns out there's one cool property about this that's kind of that is

01:14:52.179 --> 01:14:54.750
kind of subtle, but you'll find it out in the next lecture.

01:14:54.750 --> 01:15:01.750
Are there question about this before we close for today, then? So

01:15:04.829 --> 01:15:07.309
the very cool thing about

01:15:07.309 --> 01:15:12.579
the solution of discrete time LQR problems - finite horizon LQR

01:15:12.579 --> 01:15:17.280
problems is that this is a problem in an infinite state, with a continuous state.

01:15:17.280 --> 01:15:21.059
But nonetheless, under the assumptions we made, you can prove that the

01:15:21.059 --> 01:15:24.749
value function is a quadratic function of the state.

01:15:24.749 --> 01:15:29.089
Therefore, just by computing these matrixes phi T and the real numbers

01:15:29.089 --> 01:15:33.010
psi T, you can actually exactly represent the value function, even for

01:15:33.010 --> 01:15:37.489
these infinitely large state spaces, even for continuous state spaces.

01:15:37.489 --> 01:15:42.599
So the computation of these algorithms scales only like the cube,

01:15:42.599 --> 01:15:46.499
scales only as a polynomial in terms of the number of state variables

01:15:46.499 --> 01:15:49.959
whereas in [inaudible] dimensionality problems, with [inaudible], we

01:15:49.959 --> 01:15:53.099
had algorithms of a scale exponentially dimensional problem.

01:15:53.099 --> 01:15:55.420
Whereas LQR scales only are like the cube

01:15:55.420 --> 01:16:00.699
of the dimension of the problem. So this easily

01:16:00.699 --> 01:16:03.960
applies to problems with even very large state spaces. So we actually often apply

01:16:03.960 --> 01:16:06.049
variations of this algorithm

01:16:06.049 --> 01:16:09.159
to some subset, to some particular subset for the things we do on our

01:16:09.159 --> 01:16:13.010
helicopter, which has high dimensional state spaces, with twelve or higher

01:16:13.010 --> 01:16:13.969
dimensions.

01:16:13.969 --> 01:16:15.989
This has worked very well for that.

01:16:15.989 --> 01:16:17.229
So

01:16:17.229 --> 01:16:21.280
it turns out there are even more things you can do with this, and I'll continue with

01:16:21.280 --> 01:16:23.519
that in the next lecture. So let's close for today, then.