Lecture 18 | Machine Learning (Stanford)

Edit subtitles

0:13 - 0:16

This presentation is delivered by the Stanford Center for Professional
0:16 - 0:23

Development.
0:26 - 0:28

Okay. Welcome back. What I
0:28 - 0:31

want to do today is talk about
0:31 - 0:35

one of my favorite algorithms for controlling MDPs that I think is
0:36 - 0:39

one of the more elegant and efficient and powerful algorithms that
0:39 - 0:41

I know of.
0:41 - 0:43

So
0:43 - 0:47

what I'll do is I'll first start by talking about a couple
0:47 - 0:49

variations of MDPs
0:49 - 0:52

that are slightly different from the MDP definition you've seen so far. These
0:52 - 0:54

are pretty
0:54 - 0:59

common variations. One is state action rewards,
0:59 - 1:04

and the other is horizon MDPs. Using this semi-modified definition of an
1:04 - 1:07

MDP, I'll talk about linear dynamical systems. I'll spend a little bit
1:07 - 1:10

of time talking about models within dynamical systems, and then talk about LQR, or
1:10 - 1:13

linear quadratic regulation control,
1:13 - 1:19

which will lead us to some kind of [inaudible] equation, which is
1:19 - 1:23

something we will solve in order to do LQR
1:23 - 1:24

controls.
1:24 - 1:26
1:26 - 1:27

So
1:27 - 1:29

just
1:29 - 1:33

to recap, and we've seen this definition many times now.
1:33 - 1:37

We've been defining an MDP
1:37 - 1:42

as
1:42 - 1:47

[inaudible] states actions, states improbabilities, [inaudible] reward function
1:47 - 1:50

where - gamma's
1:50 - 1:53

the
1:53 - 1:57

discount factors, a number between zero and one.
1:57 - 2:00

And R, the reward function,
2:00 - 2:04

was the function mapping from the states, the rewards -
2:04 - 2:07

was the function mapping from the states, the real numbers.
2:07 - 2:08

So we had
2:08 - 2:10

value
2:10 - 2:14

iteration, which would do this.
2:22 - 2:26

So after a while, the value of the iteration will cause V to convert to V star.
2:26 - 2:28

Then
2:28 - 2:31

having found the optimal value function, if you
2:31 - 2:33

compute the optimal policy
2:33 - 2:37

by taking essentially [inaudible] of this equation above. Augments of A,
2:37 - 2:44

of that [inaudible]. So
2:46 - 2:48

in
2:48 - 2:51

value iteration,
2:51 - 2:55

as you iterate of this - you know, perform this update,
2:55 - 2:56

the function V will
2:56 - 3:00

[inaudible] convert to V star. So there won't be -
3:00 - 3:04

so without defining the number of iterations, you get closer and closer to V star.
3:04 - 3:08

This actually converge exponentially quickly to V
3:08 - 3:13

star. We will never exactly convert to V star and define the number of iterations.
3:15 - 3:20

So what I want to do now is describe a couple of common variations of
3:20 - 3:20

MDPs
3:20 - 3:24

that we slightly different definitions of. Firs the reward
3:24 - 3:28

function and then second, we'll do something slightly different
3:28 - 3:29

from just
3:32 - 3:35

counting. Then remember in the last lecture, I said that
3:35 - 3:37

for infinite state of continuously in MDPs,
3:37 - 3:41

we couldn't apply the most straightforward version of value iteration
3:41 - 3:44

because if you have a continuous state
3:44 - 3:46

MDP, we need to use
3:46 - 3:48

some approximations of the optimal value function.
3:48 - 3:52

The [inaudible] later in this lecture,
3:52 - 3:56

I'll talk about a special case of MDPs, where you can actually represent the
3:56 - 3:58

value function exactly,
3:58 - 4:01

even if you have an infinite-state space or even if you have a continuous-state
4:01 - 4:02

space.
4:02 - 4:06

I'll actually do that, talk about these special constants of infinite-state
4:06 - 4:07

MDPs,
4:07 - 4:11

using this new variation of the reward function and the alternative to
4:11 - 4:15

just counting, so start to make the formulation a little easier.
4:15 - 4:17

So the first
4:17 - 4:21

variation I want to talk about is
4:21 - 4:25

selection
4:25 - 4:29

rewards. So I'm going to change the definition of the reward function. If this turns out, it
4:29 - 4:31

won't be
4:31 - 4:33

a huge deal. In particular, I change reward function
4:33 - 4:36

to be a function mapping from
4:36 - 4:38

a state action pair
4:38 - 4:41

to the real numbers.
4:41 - 4:45

What I mean by this is just the following.
4:45 - 4:47

You sell off in
4:47 - 4:49

some state in zero. You
4:49 - 4:53

take an action A zero as a result of your state of action choice. You transition
4:53 - 4:55

to some new state, S1.
4:55 - 5:00

You take some action, A1. You transition to some new state, S2. You take some
5:00 - 5:05

action, A2, and so on. So this is a [inaudible] state action sequence that you see.
5:05 - 5:08

So in an MPP where you have a state action reward,
5:08 - 5:13

your total payoff is now defined as
5:13 - 5:14

this,
5:14 - 5:19

where your reward function is now a function both of the current state and of the
5:19 - 5:20

action
5:20 - 5:22

you took in the current state.
5:22 - 5:27

So this is my
5:27 - 5:29

total payoff.
5:29 - 5:32

Then as usual, my goal will be to find a policy -
5:32 - 5:35

to find the function mapping from the state's actions,
5:35 - 5:38

so that when I execute that policy,
5:38 - 5:40

I can maximize the expected value
5:40 - 5:42

of my total payoff.
5:42 - 5:47

So this definition, it actually turns out that given an MDP with
5:47 - 5:48

state action rewards,
5:48 - 5:49

you can actually -
5:49 - 5:53

so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP
5:53 - 5:57

with only rewards that function in the states. That may or may
5:57 - 5:58

not be [inaudible]. Don't worry if
5:58 - 6:01

it isn't. But
6:01 - 6:05

using state action rewards allows you to more directly model
6:06 - 6:09

problems in which different actions, we have different costs. So
6:09 - 6:13

a running example is the robot. So [inaudible] a robot,
6:13 - 6:17

and it's more costly for the robot to move than for it to stay still.
6:17 - 6:21

If you give an action to stay still, and the action to stay still may
6:21 - 6:25

have a lower cost because you're not using a battery power [inaudible] recharge it for that
6:25 - 6:26

action.
6:26 - 6:30

Another example would be -
6:30 - 6:33

actually, another navigation example
6:33 - 6:37

would be if you have an outdoor vehicle. You need to drive
6:37 - 6:39

over
6:39 - 6:42

some sort of outdoor terrain, like very rough rocks
6:42 - 6:46

or driving over grass. It may be costly, more difficult,
6:46 - 6:49

than driving over, say, a paved road. So you may assign
6:49 - 6:51

an action that requires
6:51 - 6:56

driving over grass or driving over rocks to be
6:56 - 7:00

more costly than driving over paved
7:00 - 7:05

road.
7:05 - 7:08

So this really isn't a huge change to the definition of
7:08 - 7:15

an MDP. I won't really bother to justify this a lot, but [inaudible]
7:15 - 7:17

equations
7:17 - 7:20

is generalizing
7:20 - 7:24

the way that you probably expect it. V star of S is now equal to
7:27 - 7:34

that.
7:37 - 7:41

So previously, when the reward function was just a function of the state, S, we could
7:41 - 7:44

take the max and push it in here.
7:44 - 7:45

But now that
7:46 - 7:50

the rewards is a function of the action you're taking as well, the max comes outside. So
7:50 - 7:51

this says that
7:51 - 7:57

your expected total payoff, starting from the state, as [inaudible] policy, is equal to
7:57 - 8:02

first your immediate reward, RFSA, for executing some action, A, in state S. Then
8:02 - 8:07

plus gamma times your future expected total payoff.
8:07 - 8:09

So
8:09 - 8:11

this is your expected total payoff if you
8:11 - 8:15

take the action, A, from the current state. So
8:15 - 8:19

while these [inaudible] optimal value functions. So your actually optimal expected total payoff
8:19 - 8:22

is the max of all actions of this
8:22 - 8:26

thing on
8:26 - 8:30

the right. Let's see. Value iteration, which I'm abbreviating VI,
8:30 - 8:34

is really the same algorithm. B of
8:34 - 8:38

S is updated as
8:38 - 8:45

max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations
8:46 - 8:50

be updating V of S using [inaudible] equations. Again, you get value iteration,
8:50 - 8:53

exactly the same way.
8:53 - 8:56

Then finally,
8:57 - 9:00

having found the optimal value function, V star,
9:00 - 9:03

using the value iteration algorithm,
9:03 - 9:08

you can then compute the optimal policy, pi star of S as same as
9:08 - 9:12

before. The best action to take in the state, S, is the
9:12 - 9:19

action, A, that maximizes the thing on the righthand
9:28 - 9:32

side. So having used value iteration
9:32 - 9:35

to compute the optimal value function, you can then find
9:35 - 9:40

pi star using that.
9:42 - 9:46

So this was the easier of the two
9:46 - 9:48

variations of MDPs [inaudible]
9:48 - 9:49

so far.
9:49 - 9:56

Any questions? Actually, are there questions about this?
10:01 - 10:03

So
10:03 - 10:07

the other variation, the other alternative definition,
10:08 - 10:11

will be
10:11 - 10:16

finite horizon
10:16 - 10:19

MDPs. So
10:19 - 10:22

finite horizon MDP
10:22 - 10:25

comprises of the [inaudible] SA [inaudible]
10:25 - 10:28

transition, probably
10:28 - 10:31

with these, and the parameter T
10:31 - 10:32

and the reward function.
10:32 - 10:36

Here,
10:36 - 10:37

T
10:37 - 10:41

is a parameter called the horizon time.
10:41 - 10:43

Concretely,
10:43 - 10:46

what this really means is that we'll
10:46 - 10:48

be taking actions in the MDP
10:48 - 10:55

only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out]
11:00 - 11:04

[Inaudible] zero, take action A0. Get to some other state S1, take
11:04 - 11:07

action A1 and so on.
11:07 - 11:10

Eventually, you get to some state, STAT
11:10 - 11:12

after T times [inaudible]. So
11:12 - 11:17

my total payoff, now,
11:17 - 11:18

will be
11:18 - 11:19

given
11:19 - 11:22

by this
11:22 - 11:26

sum from time zero up to time T of my sum over rewards.
11:28 - 11:32

Okay? My goal, as usual
11:32 - 11:34

- so this is my total payoff.
11:34 - 11:38

My goal, as usual, is to maximize the expected value of my total payoff. We want to come up
11:38 - 11:40

with a policy to maximize the
11:40 - 11:44

expected value of this total payoff. The key difference is that
11:44 - 11:48

the world only will exist [inaudible], and after that,
11:48 - 11:51

there's no more rewards to be corrected.
11:51 - 11:55

So this turns out to be [inaudible] of a difference because it turns out that the optimal
11:55 - 12:00

policy
12:01 - 12:03

may be non-stationary.
12:06 - 12:08

The
12:08 - 12:10

term, stationary, means that it doesn't depend
12:10 - 12:13

on time. Non-stationary
12:13 - 12:14

means that it may depend on time.
12:14 - 12:18

So non-stationary
12:18 - 12:22

roughly means that my optimal action to take will be different for different time
12:22 - 12:22

steps.
12:22 - 12:24

That's what nonstationary
12:24 - 12:27

means. Just as an example of that,
12:27 - 12:31

imagine that we have some robot. Let's say the
12:31 - 12:34

robot
12:34 - 12:35

is here.
12:35 - 12:38

Let's say that there's
12:38 - 12:41

a good [inaudible] over there with a plus one reward.
12:43 - 12:46

Much further away, there's a plus ten reward.
12:46 - 12:51

So depending on how much time you have left on the clock,
12:51 - 12:53

it may be better to go after the plus one
12:53 - 12:55

or the plus ten reward. If it's still early
12:55 - 12:59

in the game, you still have a lot of time, it may be better to head toward
12:59 - 13:03

the plus-ten rewards junction and get a much larger
13:03 - 13:06

reward. If you only have a couple of time sets left, if the clock has nearly reached
13:06 - 13:08

the time, capital T,
13:08 - 13:12

then you may not have enough time to get to a plus ten reward. You've be better
13:12 - 13:16

off heading for the plus one reward that's much more close by.
13:16 - 13:20

So what this example illustrates is that when you're in that state,
13:20 - 13:24

the best action to take could be to go left or to go right, depending on what
13:24 - 13:26

time it is. So just as an example, illustrating
13:26 - 13:32

how the actually policy can be non-stationary.
13:35 - 13:36

In fact,
13:36 - 13:38

since we have non-stationary
13:38 - 13:43

policies anyway in the sequence, what I'm going to do next, I'm going to
13:43 - 13:44

allow
13:44 - 13:48

non-stationary transition probabilities as well. So I'll just write that
13:48 - 13:50

up there.
13:50 - 13:52

What I mean is that
13:52 - 13:57

so far, assuming that the state ST plus one, is joined from the state
13:57 - 13:59

transition probabilities [inaudible]
13:59 - 14:02

by the previous states and the previous action.
14:02 - 14:06

I've been assuming that these state transition probabilities are the same
14:06 - 14:10

for all times. So I want to say [inaudible] and take some action,
14:10 - 14:13

the distribution of an innate state doesn't matter.
14:13 - 14:16

It doesn't depend on time.
14:16 - 14:17

So
14:17 - 14:22

I'm going to allow a study more general definition as well, in which we have
14:22 - 14:25

non-stationary state transition probabilities
14:25 - 14:25

so that
14:25 - 14:28

the chance of where you end up [inaudible]
14:28 - 14:30

may also
14:30 - 14:33

depend on what time it is.
14:33 - 14:37

So as examples of this non-stationary state
14:37 - 14:39

transition probabilities,
14:39 - 14:44

one example would be if you model flying an aircraft over a long distance.
14:44 - 14:49

Then as the aircraft flies, you burn fuel and become lighter. So the
14:49 - 14:53

dynamics of the aircraft actually change over time. The mass of the aircraft can
14:53 - 14:55

change significantly over time as you burn fuel.
14:55 - 14:56

So depending on
14:56 - 15:01

what time it is, your mixed state could actually
15:01 - 15:03

depend on not only your current state and your action,
15:03 - 15:06

but also on how much fuel you burn, therefore, what time it is.
15:07 - 15:11

Other examples, another aerospace one, is
15:11 - 15:15

if you have the weather forecast for the next 24 hours, say,
15:15 - 15:18

you know what the winds and precipitation are going to be like over the next 24 hours. Then again,
15:18 - 15:22

if you fly the aircraft from, say, here to New York,
15:22 - 15:25

it may cost different amounts to fly
15:25 - 15:29

different [inaudible] at different times. Maybe flying over
15:29 - 15:33

the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or
15:33 - 15:34

if you do it
15:34 - 15:38

a few hours from now, when the weather may be forecast really bad.
15:39 - 15:41

For an
15:41 - 15:46

example you see everyday, same thing for traffic, right?
15:46 - 15:48

There's at least -
15:48 - 15:53

depending on where you live, certainly here in California, there are times of day where traffic is
15:53 - 15:57

really bad in lots of places. So the costs of driving certain roads may vary, depending on
15:57 - 15:59

what time of day it is.
15:59 - 16:04

Lots of other examples. Industrial automation, different machines
16:04 - 16:07

in the factory may be available to different degrees at different times of day. They
16:07 - 16:10

cost different amounts to hire different workers, depending on whether you pay
16:10 - 16:13

over time [inaudible] or whatever.
16:13 - 16:16

So the cost of doing different things in the factory can also
16:16 - 16:17

be a function of time.
16:17 - 16:18
16:18 - 16:24

The state transition probabilities can also be a function of time.
16:26 - 16:28

Lastly,
16:28 - 16:32

while we're doing this as well, to make this fully general, we might as
16:32 - 16:38

well have non-stationary [inaudible] as well,
16:38 - 16:41

where you might also index the reward
16:41 - 16:43

function of these times and prescripts,
16:43 - 16:45

where the cost of doing different things may depend on
16:45 - 16:48

the time as well.
16:51 - 16:56

Actually, there's more examples of non-stationary MDPs, so let's - so now we
16:56 - 16:59

have a nonstationary policy. Let's talk about an algorithm to actually
16:59 - 17:02

try to find the optimal policy.
17:04 - 17:10

So let me
17:10 - 17:14

define the following.
17:14 - 17:20

This is now a slightly modified definition of the optimal value function. I'll
17:20 - 17:27

just
17:31 - 17:38

write this down, I guess.
17:39 - 17:43

So I'm going to define the optimal value function, and this going to be
17:43 - 17:45

indexed by T, with a subscript T. The
17:45 - 17:46

optimal value of a state
17:48 - 17:50

for time, T, we're going to define as your
17:50 - 17:52

optimal
17:52 - 17:57

sum of rewards for if you start the MDP at that state, S,
17:57 - 17:59

and if the clock starts off
17:59 - 18:02

at time,
18:02 - 18:06

lowercase T. So the optimal value of a state will depend on what time it is and how
18:06 - 18:10

much time you have lest to run this
18:10 - 18:11

MDP. Therefore,
18:11 - 18:16

the sum on the right sums only for time T, time T plus one, time T plus two up
18:16 - 18:18

to time,
18:18 - 18:20

capital T.
18:23 - 18:27

I'll just state in English again, this is your expected optimal
18:27 - 18:28

total payoff
18:28 - 18:31

if you start your system in a state, S,
18:31 - 18:35

and if the clock is already at time,
18:35 - 18:37

lowercase T.
18:38 - 18:43

So it turns out then there's a [inaudible], you can value that [inaudible].
18:43 - 18:46

Let me just write out the value [inaudible] algorithm for this.
18:46 - 18:48

It turns out
18:48 - 18:49

you can -
18:49 - 18:54

well, let me just write this
18:54 - 18:57

out. I'll write this below.
18:57 - 19:00

It turns out you can compute the optimal value function for the MDP using the
19:00 - 19:03

following recursion, which is very similar to
19:03 - 19:06

what we have for value iteration. We're going
19:06 - 19:07

to set V
19:07 - 19:12

of S to be equal to [inaudible] over
19:12 - 19:13

A,
19:13 - 19:20

same as before, right? Okay?
19:41 - 19:42

So
19:42 - 19:47

if I start the clock at time T and from state S,
19:47 - 19:49

my expected total payoff is equal to
19:49 - 19:52

the maximum [inaudible] actions I may take
19:52 - 19:54

of my immediate reward. Taking that
19:54 - 19:56

action, A, in that state,
19:56 - 20:00

S. Them plus my expected future payoff. So if I take action, A,
20:00 - 20:02

I would transition with [inaudible] P, subscript SA, S prime
20:02 - 20:06

to some new state, S
20:06 - 20:09

prime. If I get to the state, S prime,
20:09 - 20:12

my total expected payoff from the
20:12 - 20:14

state S prime would be these [inaudible]
20:14 - 20:19

now subscript T plus one, that's prime. Subscript T plus one
20:19 - 20:23

reflects that after I've taken one action, my clock will have advanced from
20:23 - 20:26

time T to time T plus one. So this is now V
20:26 - 20:31

star subscript T plus one.
20:31 - 20:34

So this expresses V star of T
20:34 - 20:36

in terms of V star T plus one.
20:36 - 20:40

Then lastly, to start off this recursion, you
20:40 - 20:42

would have V star,
20:42 - 20:43

capital T
20:43 - 20:50

is equal to - it's just equal
20:51 - 20:53

to that.
20:53 - 20:57

If you're already at time, capital T, then you just get to take one action, and then
20:57 - 20:59

the clock runs out. So this is
20:59 - 21:04

V star capital T. Your value of starting in some state, S, with no time -
21:04 - 21:09

with just one time step, with no time left on the clock.
21:09 - 21:13

So in the case of finite horizon MDP, this actually gives up a very nice
21:14 - 21:16

dynamic programming algorithm
21:16 - 21:17

in which you can
21:17 - 21:21

start off by computing V star of T.
21:21 - 21:24

Then you use this backward [inaudible] to compute
21:24 - 21:25

V star of
21:25 - 21:29

capital T minus one, capital T minus two and so on. We compute V star
21:29 - 21:32

of T and T minus one and so on. It recurs backwards
21:32 - 21:35

onto your computer, V star
21:35 - 21:38

for all of your time steps. Can you see this
21:38 - 21:40

board?
21:40 - 21:42

Cool. Then
21:45 - 21:48

the final step is
21:48 - 21:53

- previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible]
21:53 - 21:56

A of R
21:56 - 21:57

plus A
21:57 - 22:04

plus [inaudible] PSA - this
22:04 - 22:07

is sort of what we had. In
22:07 - 22:10

the finite horizon case, the
22:10 - 22:13

ultimate action may depend on what time it is. So the ultimate action to take it,
22:13 - 22:17

time T, is [inaudible] actions, A.
22:17 - 22:19

This
22:19 - 22:22

is basically the
22:22 - 22:27

augmat of exactly that same thing on the right-hand side as we had
22:27 - 22:30

in our dynamic programming
22:30 - 22:33

algorithm. So you do this for every time step, and now you compute it,
22:33 - 22:37

your optimal policy for different time
22:37 - 22:40

steps. Again, this is a non-stationary policy
22:40 - 22:42

because pi star
22:42 - 22:47

of S my depend on what time it is.
22:47 - 22:50

So [inaudible] the difference between this
22:50 - 22:57

and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T.
22:57 - 23:01

Then using the backwards recursion of that [inaudible] algorithm, you computer V star T
23:01 - 23:04

minus one. Then V star T minus two
23:04 - 23:07

and so on down to V star of zero.
23:07 - 23:10

Then from these, you compute pi star.
23:13 - 23:17

So one - there's not a huge difference, but one minus
23:17 - 23:18

difference [inaudible]
23:18 - 23:22

the infinite horizon discounted case
23:22 - 23:26

is that by running this recursion once, you now have exactly
23:26 - 23:30

the right value function. So this just computes the value function, rather
23:30 - 23:31

than
23:31 - 23:38

merely converging [inaudible]. This just gives you the right value function with one
23:38 - 23:40
23:40 - 23:43

pass. Cool. Any questions there? Yeah.
23:43 - 23:48

[Inaudible].
23:50 - 23:54

This computation's much shorter than valuations. So
23:54 - 23:58

sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could
23:58 - 24:05

[inaudible] and then use this case for that case? I see.
24:06 - 24:09

So for the normal MDP, can you assume capital T and then assume this? So
24:09 - 24:13

it actually turns out that - that's a
24:13 - 24:15

great question. Let
24:15 - 24:19

me just answer this in a hand-wavy way.
24:19 - 24:23

So it actually turns out for a discounted infinite horizon MDP
24:23 - 24:26

where some discount factor gamma.
24:26 - 24:29

So what you can do is you can say,
24:29 - 24:31

after T times X,
24:31 - 24:35

gamma to the T would be really small. It would be like [inaudible] something. I
24:35 - 24:36

don't
24:36 - 24:40

really care what happens after that many times because the rewards are
24:40 - 24:43

multiplied by gamma to the T. After that, I don't really care. So you can
24:43 - 24:44

ask,
24:44 - 24:45

can I take
24:45 - 24:48

my infinite horizon discounted MDP
24:48 - 24:53

and approximate that with a finite horizon MDP where the number of times, steps T,
24:53 - 24:55

is chosen so that [inaudible] true. So it
24:55 - 24:58

turns out you can do that.
24:58 - 25:02

Then you end up with some value for T. You can solve for T so
25:02 - 25:04

that this holds true.
25:04 - 25:10

It turns out you can prove that if you took the original value iteration algorithm
25:10 - 25:15

and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs.
25:15 - 25:17

If you run that for
25:17 - 25:20

this same number of time steps,
25:20 - 25:24

you will end up with an approximation to the value function that is
25:24 - 25:27

about this close, up to some small constant factors.
25:27 - 25:31

So to do that, you end up with roughly the same amounts of computation anyway.
25:31 - 25:33

Then
25:33 - 25:36

you actually end up with a non-stationary policy, which is more expensive to keep
25:37 - 25:41

around. You need to keep around the different policy every time step, which is not
25:41 - 25:47

as nice as if you had the stationary policy, same policy for all times. So
25:47 - 25:51

there are other reasons, but sometimes you might take an
25:51 - 25:55

infinite horizon discounted problem and approximate it to a finite horizon problem.
25:55 - 25:56

But
25:56 - 25:59

this particular reason is
25:59 - 26:01

not the one. That makes
26:01 - 26:04

sense. More
26:04 - 26:11

questions? Interviewee: [Inaudible]?
26:13 - 26:15

Is
26:15 - 26:17

there a gamma in this?
26:17 - 26:20

So if you want, you can actually
26:20 - 26:23

change the definition of an MDP
26:23 - 26:28

and use a finite horizon discounted MDP. If you want, you can do that. You
26:28 - 26:29

can actually come
26:29 - 26:32

in and put a gamma there,
26:32 - 26:35

and use this counting the finite horizon.
26:35 - 26:38

It turns out that usually, for most
26:38 - 26:41

problems that people deal with, you either use discounting
26:41 - 26:43

or you
26:43 - 26:47

use the finite horizon. It's been less common to do both, but you can certainly do as well.
26:47 - 26:52

One of the nice things about discounting, it makes such your value function is finite.
26:53 - 26:58

Algorithmically and mathematically, one of the reasons to use discounting is
26:58 - 27:01

because you're multiplying your rewards exponentially. It's a geometrically
27:01 - 27:03

[inaudible] series.
27:03 - 27:05

It shows that the value function is always finite.
27:05 - 27:08

This is a really nice mathematical properties when you do discounting.
27:08 - 27:11

So when you have a finite horizon anyway,
27:11 - 27:15

then the value function's also guaranteed to be finite. So with that, you
27:15 - 27:22

don't have to use discounting. But if you want, you can actually discount
27:24 - 27:28

as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward
27:28 - 27:35

function to go downward into the to the reward function, since we have nonstationary rewards as well. So
27:40 - 27:41

that was
27:41 - 27:44

finite horizon
27:44 - 27:45

MDPs.
27:45 - 27:49

What I want to do now is actually use
27:49 - 27:53

both of these ideas, your state action rewards and your finite horizon MDPs
27:53 - 27:58

to describe a special case of MDPs that
27:58 - 28:00

makes very strong assumptions about the problem.
28:00 - 28:01
28:01 - 28:04

But these assumptions are reasonable for many systems. With these
28:04 - 28:06

assumptions, what we come up with, I
28:06 - 28:12

think, are very nice and very elegant algorithms for solving even very large
28:12 - 28:16

MDPs.
28:30 - 28:33

So let's talk about linear
28:33 - 28:36

quadratic regulation.
28:52 - 28:56

We just talked about dynamic programming for finite horizon MDPs, so
28:56 - 28:58

just remember that algorithm.
28:58 - 29:02

When I come back to talk about an algorithm for solving LQR problems,
29:02 - 29:05

I'm actually going to use exactly that dynamic programming algorithm
29:05 - 29:10

that you just saw for finite horizon MDPs. I'll be using exactly
29:10 - 29:14

that algorithm again. So just remember that for now.
29:14 - 29:16

So let's talk about LQR.
29:16 - 29:22

So I want to take these ideas and apply them to MDPs with continuous
29:22 - 29:25

state spaces and maybe even continuous action
29:25 - 29:26

spaces. So
29:26 - 29:29

to specify and
29:29 - 29:34

MDPs, I need to give you this fivetuple
29:34 - 29:36

of state actions, [inaudible] in the reward. I'm going to use
29:36 - 29:41

the finite horizon, capital T, rather than
29:41 - 29:43

discounting.
29:43 - 29:48

So in LQR problems, I'm going to assume the following. I'm
29:48 - 29:51

going to assume that the
29:51 - 29:54

[inaudible] space is [inaudible] RN.
29:54 - 29:58

And I'm going to assume, also, a continuous set of
29:58 - 30:04

actions lie in RT.
30:04 - 30:08

To specify the state transition probabilities, PSA,
30:08 - 30:12

I need to tell you what the distribution of the mixed state is, given the current state and
30:12 - 30:14

the current action.
30:14 - 30:17

So we actually saw a little bit of this in the last lecture. I want to assume
30:17 - 30:20

the next state, ST plus one,
30:20 - 30:23

is going to be a linear function
30:23 - 30:25

of the previous
30:25 - 30:27

state, AST plus BAT
30:27 - 30:29

plus WT -
30:29 - 30:35

oh, excuse me. I meant to subscript that.
30:55 - 30:59

Where WT is Gaussian [inaudible] would mean zero and some covariance
30:59 - 31:02

given by sigma
31:02 - 31:07

W. Subscripts at A and B here with subscripts T to show
31:07 - 31:09

that
31:09 - 31:12

these matrixes could change over time. So this would be
31:12 - 31:15

non-stationary dynamics.
31:15 - 31:20

As a point of notation, unfortunately compiling
31:20 - 31:22

ideas from multiple literatures,
31:22 - 31:26

so it's sort of unfortunately that capital A denotes both a set of actions
31:26 - 31:28

as well as
31:28 - 31:30

a matrix.
31:30 - 31:34

When you see A later on, A will usually be used to denote a matrix,
31:34 - 31:39

rather than a set of actions. So [inaudible] overload notation again, but
31:39 - 31:42

unfortunately the notational conventions when you have
31:42 - 31:47

research ideas in multiple research communities, often they share the same symbol.
31:48 - 31:50

So just to be concrete,
31:50 - 31:53

AT is a matrix
31:53 - 31:56

that's N
31:56 - 32:01

by N. [Inaudible] matrixes that are N by D.
32:02 - 32:06

Just to be completely clear, right, the matrixes A and B,
32:06 - 32:09

I'm going to assume, are fixed and known in advance. So I'm going to
32:09 - 32:12

give you the matrices,
32:12 - 32:16

A and B, and I'm going to give you sigma W. Your job is to find a good policy
32:16 - 32:17

for
32:17 - 32:19

this MDP. So in other words,
32:19 - 32:21

this is my specification
32:21 - 32:25

of the state transition probabilities.
32:25 - 32:28

Looking ahead,
32:28 - 32:31

we see this later, it turns out this
32:31 - 32:38

noise term is not very important. So
32:40 - 32:41

it turns out that
32:41 - 32:44

the treatment of the noise term is not very important.
32:44 - 32:46

We'll see this later. We can
32:46 - 32:47

pretty much
32:47 - 32:49

ignore the noise
32:49 - 32:53

term, and we'll still do fine. This is just a warning
32:53 - 32:57

in the sequel, what I do later, I might be slightly sloppy
32:57 - 32:58

in my treatment
32:58 - 33:03

of the noise term. In this very special case, it would be unimportant.
33:03 - 33:04

The
33:04 - 33:06

last
33:06 - 33:09

thing I have to specify is some horizon time,
33:09 - 33:13

and then I also have some
33:13 - 33:14

reward
33:14 - 33:17

function.
33:17 - 33:20

For LQR control, I'm going to assume that a reward function
33:20 - 33:27

can be written as
33:30 - 33:32

this, where
33:32 - 33:35

UT is a matrix that's N by N. VT is
33:35 - 33:36

a
33:38 - 33:41

matrix that's D by D. I'll
33:41 - 33:44

assume that UT
33:44 - 33:47

and VT are both positive semi-definite.
33:47 - 33:51

Are both PSD. So
33:51 - 33:57

the fact that UT and VT are both positive semi-definite matrices,
33:57 - 34:00

that implies that
34:00 - 34:02

ST transpose, UT, ST [inaudible]
34:02 - 34:07

zero. Similarly, ST transpose
34:07 - 34:10

are VT, AT, [inaudible] zero.
34:10 - 34:17

So this implies that
34:17 - 34:20

your rewards are always negative. This is a
34:20 - 34:24

somewhat depressing MDP in which there are only costs and no positive rewards, right,
34:24 - 34:24

because of
34:24 - 34:27

the minus sign
34:27 - 34:29

there.
34:29 - 34:36

So
34:43 - 34:47

as
34:47 - 34:50

a complete example for how you might want to apply this,
34:50 - 34:54

you've seen my helicopter videos, right? So one thing is,
34:54 - 34:56

for example,
34:56 - 34:58

suppose you have a helicopter,
34:58 - 35:01

and you want the state ST to be
35:01 - 35:04

as close to zero as possible.
35:05 - 35:08

Then
35:08 - 35:10

you might choose UT
35:10 - 35:13

to be equal to the identity matrix.
35:13 - 35:17

This will make R of STAT be
35:17 - 35:20

equal to ST transpose
35:20 - 35:24

ST. But that's just
35:27 - 35:31

- I'll just
35:31 - 35:34

write that down. [Inaudible] oh, excuse me - minus.
35:34 - 35:38

The squared negative of the squared [inaudible] vector. So this would be penalizing the
35:38 - 35:40

system
35:40 - 35:41

quadratically
35:41 - 35:45

for having a state that's half of zero, assuming that zero's the origin state. So if it
35:45 - 35:47

goes to make a helicopter
35:47 - 35:52

hover around the state zero, then you might choose this sort of reward function. It
35:52 - 35:54

turns out
35:55 - 35:59

it's also very common for action to choose a cost
35:59 - 36:03

for the action. So suppose I choose VT to be equal to an identity matrix. I get minus
36:03 - 36:05

AT transpose
36:05 - 36:10

AT
36:10 - 36:12

here. Then minus [inaudible] actions as well.
36:12 - 36:16

Including a quadratic cost for actions, it's also a fairly common
36:16 - 36:17

thing to do.
36:17 - 36:20

In practice, this tends to be effective
36:20 - 36:23

of discouraging your system from
36:23 - 36:26

jerking the controls around. This discourages making very huge control commands. Having
36:26 - 36:28

a term like this
36:28 - 36:32

reward function often makes many systems behave better.
36:32 - 36:36

Of course, [inaudible] choose different values, we have different values on the diagonal to
36:36 - 36:37

give different
36:37 - 36:40

state variables, different weight and
36:40 - 36:45

so on. So lots of possible choices for U and B. This is one example.
36:50 - 36:56

So
36:56 - 36:59

for the next few steps,
36:59 - 37:04

I'm going to write out things, I'm going to derive things,
37:04 - 37:09

for the general case of non-stationary dynamics. I'm going - as I write
37:09 - 37:12

out more math and more equations for LQR,
37:12 - 37:14

I'm going to try write it out
37:14 - 37:19

for the fairly general case of time varied dynamics and time varied reward
37:19 - 37:22

functions. So I'm [inaudible] function.
37:22 - 37:26

But for purposes of understanding this material,
37:26 - 37:29

you might want to think of just ignoring many of the subscripts, in terms
37:29 - 37:32

of T. So
37:34 - 37:36

for the sake of [inaudible] material,
37:36 - 37:40

you might want to mentally assume that there are some fixed matrix, A, so
37:40 - 37:46

that A is equal to A1, A2, equals A3 and so on.
37:46 - 37:51

Similarly, there's some matrix B.
37:51 - 37:52

Okay?
37:52 - 37:56

So write it out for the fully general, non-stationary case, but
37:56 - 37:59

you might just want to ignore many of the time subscripts and imagine the
37:59 - 38:02

stationary case for now.
38:07 - 38:10

Quite a bit later,
38:10 - 38:14

we're going to talk about an extension of this called differential dynamic programming that will
38:14 - 38:17

actually use a non-stationary
38:17 - 38:21

[inaudible] to a very powerful effect for a specific algorithm. But
38:21 - 38:26

for most of what we're about to do, just pretend that MDP is stationary.
38:26 - 38:33

Okay.
38:49 - 38:51

So before I
38:51 - 38:55

talk about models, let me jus say a couple of words about how you would go about
38:55 - 38:57

coming up with the linear models. The
38:57 - 39:01

key assumption in this model is that the dynamics are linear. There's also the assumption the
39:01 - 39:03

reward function is quadratic, but
39:03 - 39:07

let's talk about the assumption that the dynamics are linear.
39:07 - 39:14

ST plus one equals AST plus VAT. Maybe time varying, maybe
39:14 - 39:17

stationary. I'm just writing stationary for now.
39:17 - 39:20

So how do you get models like this?
39:20 - 39:23

We actually saw one example of this already in the previous lecture. If
39:23 - 39:27

you have an inverted pendulum system, and you want to model the
39:27 - 39:28

inverted pendulum
39:29 - 39:34

using a linear model like this, maybe [inaudible]. I'm not going to write that down.
39:34 - 39:38

One thing you could do is run your inverted pendulum, start it off in some state as
39:38 - 39:39

zero,
39:39 - 39:41

take some action, A0, have it
39:41 - 39:46

get to some state, S1. Take action A1 and so on, get to some state ST.
39:46 - 39:48

Our index is one
39:48 - 39:52

to denote that this is my first trial. Then you can repeat this a bunch of times.
39:52 - 39:55

You can repeat this N times. I'm
39:55 - 39:58

just executing actions on
39:58 - 40:00

your physical robot.
40:00 - 40:02

It could be a robot, it could be a
40:02 - 40:06

chemical plant. It could be whatever. Trying out different actions in
40:06 - 40:10

your system and watch
40:13 - 40:15

what states it gets to. So
40:15 - 40:21

for the linear model to your data, and choose the parameters A
40:21 - 40:26

and B,
40:26 - 40:28

that minimize
40:38 - 40:40

the quadratic
40:40 - 40:41

error term.
40:47 - 40:52

So this says how well does AST plus BAT
40:52 - 40:55

predict ST plus one. So you minimize the quadratic penalty term. This would be
40:55 - 40:58

one reasonable way to estimate
40:58 - 41:01

the parameters of a linear dynamical system for a
41:01 - 41:08

physical robot or a physical chemical part of whatever they may have. Another
41:09 - 41:11

way to
41:11 - 41:13

come up with a linear
41:14 - 41:19

model
41:19 - 41:24

consistently, if I want to control, is to take a nonlinear model and to
41:24 - 41:30

linearize it. Let me show you what I mean by that. So you can linearize a
41:30 - 41:36

nonlinear model.
41:36 - 41:37

So
41:39 - 41:42

let's say you have some nonlinear model
41:42 - 41:45

that expresses ST plus one as some function of
41:45 - 41:46

ST
41:46 - 41:47

and
41:47 - 41:50

AT.
41:50 - 41:54

In the example in the previous lecture, I said for the inverted pendulum
41:57 - 41:57

[inaudible].
41:57 - 42:01

By referring to the laws of physics. It was actually by downloading off the shelf
42:01 - 42:04

software for doing physics simulations. So if you haven't
42:04 - 42:08

seen [inaudible] before, you can go online. You can easily find
42:08 - 42:11

many open-source packages for simulating the
42:11 - 42:14

physics of simple devices like these.
42:14 - 42:18

Download the software, type in the specifications of your robot, and it will
42:18 - 42:22

simulate the physics that you use. There's lots of open-source software patches like that. You can just
42:22 - 42:24

download them.
42:24 - 42:27

But something like that, you can now build a physics simulator
42:27 - 42:32

that predicts the state as a function of the previous state and the previous action. So
42:32 - 42:34

you actually come up with some function that
42:34 - 42:40

says that
42:42 - 42:45

- the state [inaudible] next time. The [inaudible] vector
42:45 - 42:52

will be some function of the current state
42:53 - 42:55

and the current
42:55 - 42:58

action, where the action in this case is just a real number that says how
42:58 - 43:02

hard you accelerated to the left or right.
43:02 - 43:06

Then you can take this nonlinear model. I actually wrote down a sample of a
43:06 - 43:08

model in the last lecture, but
43:08 - 43:11

in general, F would be some nonlinear function. [Inaudible] of
43:11 - 43:13

a linear function.
43:13 - 43:17

So what I mean by linearize is the following. So here's just a cartoon. I'll
43:17 - 43:20

write down the math in a second. Let's say the horizontal acces is
43:20 - 43:24

the input state, ST,
43:24 - 43:28

and the output state, ST plus one, as I said.
43:28 - 43:32

Here's
43:32 - 43:35

the function at F.
43:35 - 43:39

So the next state, ST plus one, will be some function of the previous state,
43:39 - 43:42

ST and the action
43:42 - 43:46

AT. So to linearize this model, what you would do is you would choose a point.
43:46 - 43:49

We'll call this bar
43:49 - 43:54

T. Then you would take the derivative of this
43:54 - 43:56

function. For the
43:56 - 44:00

[inaudible] straight line to that function.
44:00 - 44:04

So this allows you to express the next state, ST plus one. You
44:04 - 44:07

can approximate the next state, ST plus one, as
44:07 - 44:10

this linear function of the
44:10 - 44:14

previous state, ST. So
44:14 - 44:18

to make this cartoon really right, the horizontal access here is really a state
44:18 - 44:20

action pair.
44:20 - 44:25

You're linearizing around. So this is just a cartoon. The horizontal
44:25 - 44:30

access represents the input state and the input action.
44:30 - 44:37

So
44:42 - 44:46

just to write this out in
44:46 - 44:50

math, I'll write out the simple case first and the fully general one
44:50 - 44:51

in a second. Suppose
44:51 - 44:55

the horizontal access was only this state. So let's pretend interactions they [inaudible] now.
44:55 - 44:56

ST plus
44:56 - 45:00

one is just some function of ST, than that linear function I drew would be
45:00 - 45:02

ST plus one.
45:02 - 45:04

We're approximating
45:04 - 45:08

as F prime evaluated at some point as bar T
45:08 - 45:12

times ST times S bar T.
45:12 - 45:16

Plus
45:16 - 45:18

S bar
45:18 - 45:22

T. So with this, you'd express ST plus one as a linear
45:22 - 45:23

function
45:23 - 45:29

of ST. Just note that S bar T is a constant.
45:31 - 45:33

It's not
45:33 - 45:37

a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point
45:37 - 45:38

S bar T. This is
45:38 - 45:42

really just the equation of that linear function. So you
45:42 - 45:46

can then convert this to
45:52 - 45:55

A and B matrixes. Jumping back one board, I'm going
45:55 - 45:57

to point out one other thing.
45:57 - 45:58

Let's
45:58 - 46:00

say I look at this straight line,
46:00 - 46:01

and I ask
46:01 - 46:03

how well does this straight line
46:03 - 46:08

approximate my function F, my original simulator, my original function F.
46:08 - 46:11

Then you sort of notice that
46:11 - 46:12

in this neighborhood,
46:12 - 46:15

in the neighborhood of S bar, there's a
46:15 - 46:17

pretty good approximation. It's fairly close.
46:17 - 46:19

But then as you move further away,
46:19 - 46:24

moving far off to the left here, it becomes a pretty terrible approximation.
46:24 - 46:26

So
46:26 - 46:29

when you linearize
46:29 - 46:33

a nonlinear model to apply LQR,
46:33 - 46:37

one of the parameters you have to choose would be the point around which to
46:37 - 46:39

linearize your nonlinear model.
46:39 - 46:44

So if you expect your inverted pendulum system
46:44 - 46:49

to spend most of its time in the vicinity of this state,
46:49 - 46:49

then it'd
46:49 - 46:53

be reasonable to linearize around this state
46:53 - 46:55

because that means that
46:55 - 46:57

the linear approximation would be a good approximation,
46:57 - 47:02

usually, for the states that you expect [inaudible] to spend most of this time.
47:02 - 47:06

If conversely, you expect the system to spend most of its time
47:06 - 47:08

at states far to the left, then this would be
47:08 - 47:10

a terrible location to linearize.
47:10 - 47:14

So one rule of thumb is to choose the position to linearize according to where
47:14 - 47:17

you expect the system to spend most of its time
47:17 - 47:20

so that the linear approximation will tend to be
47:20 - 47:23

an accurate approximation in the
47:23 - 47:27

vicinity of the states [inaudible].
47:27 - 47:29

Just to be fair, it is about
47:29 - 47:30

choosing the point, S bar,
47:30 - 47:32

A bar,
47:32 - 47:34

that we'll use
47:34 - 47:37

to come up with a linear function
47:37 - 47:41

that we'll pretend it's a good approximation to my original
47:41 - 47:47

nonlinear function,
47:55 - 48:01

F. So for an example like the inverted pendulum problem, this problem, if
48:01 - 48:04

you expect to do pretty well in this problem,
48:04 - 48:08

then you would expect the state
48:08 - 48:13

to often be near the zero state.
48:13 - 48:18

If S equals zero corresponds to X being the center of the railway track that the inverted
48:18 - 48:20

pendulum lives on. You
48:20 - 48:23

expect to do fairly well. You expect the pole to
48:23 - 48:25

mostly be upright
48:25 - 48:29

[inaudible] upright at zero degrees or 90 degrees, I guess. So you choose
48:29 - 48:33

whatever state corresponds to having the pole
48:33 - 48:34

upright. The zero
48:34 - 48:37

velocity [inaudible], near zero velocity in the middle of the track.
48:37 - 48:40

So you usually choose that as a state
48:40 - 48:43

to linearize
48:43 - 48:46

your inverted pendulum dynamics around. That's
48:46 - 48:53

a region where you might want your approximation to be good.
48:56 - 49:00

So I wrote this down. To come back to this formula, I wrote this down for the special
49:00 - 49:02

case of a one D state variable. If there
49:02 - 49:08

are no actions. The general formula for the linearization approximation is ST
49:09 - 49:12

plus one were approximate as F
49:12 - 49:15

of S bar T.
49:15 - 49:17

A bar T
49:17 - 49:24

plus
49:50 - 49:54

- Okay?
49:54 - 49:55

Where
49:55 - 50:00

these upside down triangles are an unusual
50:00 - 50:03

symbol for taking the
50:03 - 50:07

derivative of F with respect to [inaudible] vector value, second argument. So
50:07 - 50:12

by choosing an appropriate state as bar T, A bar T to linearize
50:12 - 50:13

around,
50:13 - 50:14

you'd
50:14 - 50:18

now express ST plus one
50:18 - 50:22

as a linear function of the current state and
50:22 - 50:23

the current
50:23 - 50:24

action, AT. Again,
50:24 - 50:29

these things, S bar T, is a constant that you choose ahead of time.
50:29 - 50:32

Same for A bar
50:35 - 50:40

T. Lastly, having linearized this thing, you can then convert it
50:40 - 50:43

to matrices
50:43 - 50:47

like
50:47 - 50:49

that.
50:49 - 50:55

So the ST plus one is now a linear function of ST and
50:55 - 50:57

AT.
50:57 - 51:04

Questions about this?
51:11 - 51:16

So just one tiny detail, and it's really not a huge deal, is that this thing
51:16 - 51:20

below is technically an [inaudible] function.
51:20 - 51:23

There might actually be an extra constant there, but
51:23 - 51:27

this is not a difficult generalization of a linear dynamical system definition.
51:28 - 51:32

One way to deal with that constant is actually to do something like take your
51:32 - 51:36

definition for this state, let's say XX dot theta theta dot.
51:36 - 51:39

You can then augment your state vector to
51:39 - 51:41

have an extra interceptor, one.
51:41 - 51:46

With the interceptor one and working out the A matrix, you can then
51:46 - 51:49

take care of the extra constant, C, as well. So you can
51:49 - 51:51

deal with this thing being -
51:51 - 51:55

technically it's an affine function because of this extra offset, rather than a linear
51:55 - 51:56

function.
51:56 - 52:02

But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't
52:02 - 52:03

be
52:03 - 52:05

a huge
52:06 - 52:13

deal.
52:18 - 52:19

So to summarize, you see I
52:19 - 52:21

have this up,
52:21 - 52:25

you can learn a model, you can take a nonlinear model. Your nonlinear
52:25 - 52:29

model can be a physics model or a nonlinear model you learned and linearize it.
52:30 - 52:32

Now I'll post an LQR problem
52:32 - 52:35

in which we have
52:35 - 52:42

specification of the MDP in which the states are in RN, the actions are in RD,
52:42 - 52:47

and the state has zero probabilities given by
52:47 - 52:50

the [inaudible] linear equation. SD plus one equals
52:50 - 52:54

ATST plus BTAT. Our rewards are going to be
52:54 - 52:57

these quadratic functions.
52:57 - 53:02

So the specification of the MDP means that we know the A matrixes,
53:02 - 53:06

the B matrixes, the U matrixes
53:06 - 53:09

and the V matrixes. Our goal is to come up with a policy
53:09 - 53:14

to maximize our finite
53:14 - 53:19

horizon sum of rewards.
53:20 - 53:23

So our goal is to come up with a policy,
53:23 - 53:26

first, to maximize
53:26 - 53:29

the expected value of this
53:29 - 53:34

finite horizon sum of
53:39 - 53:46

rewards. Okay.
53:48 - 53:49

So our
53:49 - 53:54

approach to solving this problem
53:54 - 53:57

will be exactly that
53:57 - 54:00

finite horizon dynamic programming algorithm that we worked out a
54:00 - 54:02

little earlier in this
54:02 - 54:04

lecture. In particular,
54:04 - 54:07

my strategy for finding the optimal policy
54:07 - 54:09

will be to
54:09 - 54:12

first find V star of
54:12 - 54:14

T, the capital T,
54:14 - 54:19

and then I'll apply by a recursion to find V star of T minus one, V star of T minus
54:19 - 54:23

two and so on.
54:23 - 54:28

In the dynamic programming algorithm we worked out, V star subscript T of the
54:28 - 54:29

state ST, this is the maximum
54:29 - 54:31

[inaudible]
54:31 - 54:35

actions you might take at that time of R
54:35 - 54:39

of STAT.
54:39 - 54:43

Again, just for the sake of understanding this material, you can probably
54:43 - 54:47

pretend the rewards and the dynamics are actually stationary. I'll write out all
54:47 - 54:53

these superscripts all the time [inaudible] if you're reading this for the first time.
54:53 - 54:57

The reward is equal to max of
54:57 - 55:04

AT of minus -
55:12 - 55:16

right? I hope this isn't confusing. The superscript Ts denote transposes.
55:16 - 55:20

The lowercase Ts denote the time index capital T.
55:20 - 55:24

So that's just a definition of my next quadratic awards. So this
55:24 - 55:28

is clearly maximized as minus
55:29 - 55:31

ST transpose
55:31 - 55:38

UTST
55:38 - 55:40

because
55:40 - 55:44

that last term is - this is greater than or equal to zero. That gives
55:44 - 55:48

me my assumption that VT is [inaudible] semi-definite. So the best action to take in
55:48 - 55:49

the last time step
55:49 - 55:53

is just the action
55:53 - 55:54

zero. So
55:54 - 56:01

pi star subscript T of ST is equal to the
56:02 - 56:04

[inaudible] of actions of
56:04 - 56:11

that same thing. It's just
56:11 - 56:12

zero. It's by
56:12 - 56:16

choosing the zero action,
56:16 - 56:19

AT transpose VTAT becomes zero, and that's
56:19 - 56:39

how this reward is maximized.
56:50 - 56:57

Any questions, or is something illegible? Okay.
57:00 - 57:04

So now let's do the dynamic programming step where
57:04 - 57:10

my goal is given
57:10 - 57:12

VT plus one,
57:12 - 57:16

I want to compute VT. Given V star T plus one, I want to compute
57:16 - 57:20

V star of T. So this is the dynamic programming step.
57:22 - 57:25

So the
57:25 - 57:29

DP steps I wrote down previously was this. So for the finite state case, I
57:29 - 57:36

wrote down the following.
58:32 - 58:36

So this is exactly the equation I wrote down
58:36 - 58:37

previously,
58:37 - 58:41

and this is what I wrote down for finite states, where you have these discreet state transition
58:41 - 58:44

probabilities, and we can sum over
58:44 - 58:48

this discreet set of states. Now we're going to continue as an infinite state
58:48 - 58:52

again, so this sum over state should actually become an integral. I'm going to
58:52 - 58:56

actually skip the integral step. We'll just go ahead and write this last term here as an
58:56 - 58:57

expectation.
58:57 - 59:04

So this is going to be max over actions AT
59:04 - 59:07

plus - and then this becomes and expectation
59:07 - 59:10

over the random mixed state, ST plus one,
59:10 - 59:14

[inaudible] from state transition probabilities given by P of STAT of V star T
59:14 - 59:16

plus one, ST
59:16 - 59:19

plus one. So this
59:19 - 59:21

is
59:21 - 59:24

the same equation written down
59:24 - 59:26

as an expectation.
59:27 - 59:32

So what I need to do is given a representation of V star T plus one, I
59:32 - 59:33

need to find V
59:33 - 59:36

star of T.
59:36 - 59:41

So it turns out that LQR has the following useful property. It turns out that each of
59:41 - 59:43

these value functions
59:43 - 59:46

can be represented as a quadratic function.
59:46 - 59:49

So concretely,
59:49 - 59:53

let's suppose that V
59:53 - 60:00

star T plus one - suppose that this can be expressed as a
60:00 - 60:07

quadratic
60:09 - 60:11

function, written like so,
60:11 - 60:12

where
60:12 - 60:15

the matrix phi T plus one
60:15 - 60:17

is an N by N matrix, and
60:17 - 60:21

psi T plus one
60:21 - 60:24

is just a real number.
60:24 - 60:29

So in other words, suppose V star T plus one is just a quadratic function
60:30 - 60:34

of the state ST
60:34 - 60:37

plus one.
60:37 - 60:40

We can then show that
60:40 - 60:44

when you do one dynamic programming step - when you
60:46 - 60:50

plug this definition of V star T plus one into your dynamic
60:50 - 60:52

programming step in the equation I had just now,
60:52 - 60:55

you can show that you would get
60:55 - 60:58

that V star T as well, will
60:58 - 61:02

also be a quadratic function of the
61:06 - 61:13

same form. [Inaudible] here, right? The sum-appropriate matrix, phi T
61:13 - 61:20

and sum appropriate real number, psi
61:20 - 61:22

of
61:22 - 61:28

T. So what you can do is stall off the recursion with - well, does that make
61:28 - 61:32

sense?
61:32 - 61:35

So what you can do is
61:35 - 61:37

stall off the recursion
61:37 - 61:41

as follows. So previously, we worked out that V star capital T,
61:41 - 61:44

we said that this is minus
61:44 - 61:48

ST
61:48 - 61:50

transpose UTST. So we
61:50 - 61:51

have that phi
61:51 - 61:53

of capital T is
61:53 - 61:56

equal to minus UT,
61:56 - 61:59

and psi of capital T is equal to zero.
61:59 - 62:01

Now V star T of ST
62:01 - 62:08

is equal to ST transpose phi of T, ST plus psi of T.
62:08 - 62:09

So
62:09 - 62:13

you can start out the recursion this way with phi of T equals minus UT and psi of T
62:13 - 62:15

equals zero.
62:15 - 62:17

Then work out
62:17 - 62:20

what the recursion is. I won't
62:20 - 62:23
62:23 - 62:27

actually do the full [inaudible]. This may be algebra, and
62:29 - 62:33

you've actually done this sort of Gaussian expectation
62:33 - 62:40

math a lot in your homework by now.
62:40 - 62:44

So I won't do the full derivation. I'll just outline the
62:45 - 62:47

one-ish G step. So
62:47 - 62:51

in dynamic programming step, V star ST is
62:51 - 62:53

equal to
62:53 - 62:56

max over actions AT of
62:56 - 63:03

the median reward.
63:04 - 63:06

So this was R of
63:06 - 63:09

SA from my equation in the dynamic programming step.
63:09 - 63:12

Then plus an expected value
63:12 - 63:18

over the random mixed state, ST plus one, drawn from the
63:18 - 63:22

Gaussian distribution would mean
63:22 - 63:24

ATST plus
63:24 - 63:28

BTAT and covariant sigma
63:28 - 63:32

W. So what this is, this is really my
63:32 - 63:35

specification for P of STAT. This is my
63:35 - 63:39

state transition distribution in the LQR setting. This is my
63:39 - 63:41

state transition distribution
63:41 - 63:43

[inaudible] take action AT
63:43 - 63:45

in
63:45 - 63:46

the state
63:46 - 63:50

ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and
63:50 - 63:51

covariant
63:51 - 63:54

sigma W. Then
63:54 - 64:01

of the this
64:11 - 64:14

state. This, of course, is just A star
64:14 - 64:17

T plus one
64:17 - 64:24

of ST plus one. I hope
64:24 - 64:27

this makes sense. This is just taking that equation I had previously in the
64:27 - 64:30

dynamic programming step. So the V star of T, ST
64:30 - 64:33

equals max over actions
64:33 - 64:36

of the immediate rewards
64:36 - 64:39

plus an expected value over the mixed state of
64:39 - 64:41

V star of the mixed state with
64:41 - 64:43

the clock advanced by one.
64:43 - 64:46

So I've just plugged in all the definitions as a reward of the
64:46 - 64:48

state [inaudible] distribution
64:48 - 64:55

and of the value function.
64:56 - 65:02

Actually, could you raise your hand if this makes sense? Cool. So if you write
65:02 - 65:03

this out
65:03 - 65:08

and you expand the expectation - I know you've done this many times, so I
65:08 - 65:09

won't do it -
65:09 - 65:11

this whole thing on the right-hand side
65:11 - 65:15

simplifies to a big quadratic function of the action, AT.
65:15 - 65:18

So this whole
65:18 - 65:23

thing simplifies to a big
65:23 - 65:26

quadratic function
65:32 - 65:36

of the action
65:36 - 65:38

AT.
65:38 - 65:42

We want to maximize this with respect to the actions AT. So to
65:42 - 65:44

maximize a big quadratic function, you
65:44 - 65:48

just take the derivatives of the functions with respect to the action AT, set the derivative equal
65:48 - 65:52

to zero, and then you've maximized the right-hand side, with
65:52 - 65:56

respect to the action,
65:56 - 66:00

AT. It turns out - I'm just going to write this expression down for completeness. You
66:00 - 66:02

can derive it yourself at any time. It turns
66:02 - 66:05

out if you actually maximize that thing on the right- hand side as a
66:05 - 66:07

function of the actions, AT,
66:07 - 66:12

you find that [inaudible] AT is going to be
66:12 - 66:14

that
66:14 - 66:21

times
66:23 - 66:26

ST.
66:26 - 66:29

Don't
66:29 - 66:34

worry about this expression. You can get it from [inaudible] and derive it yourself.
66:34 - 66:38

But the key thing to note is that the optimal action, AT, is going to be some big
66:38 - 66:39

matrix.
66:39 - 66:40

We're going to call this thing
66:40 - 66:43

LT
66:43 - 66:47

times
66:49 - 66:51

ST. In other words, the optimal action to take in this
66:51 - 66:53

given state is going to be
66:53 - 66:55

some linear function
66:55 - 66:59

of the state, ST. So
66:59 - 67:03

having done dynamic programming, you remember also when we worked out the
67:03 - 67:05

dynamic programming algorithm for finite
67:05 - 67:06

horizon MDPs,
67:06 - 67:08

we said that
67:08 - 67:12

the way you compute the optimal policy, pi star of T
67:12 - 67:14

of ST.
67:14 - 67:18

This is always the [inaudible] of the same thing. [Inaudible]
67:18 - 67:23

of actions AT of the same thing.
67:23 - 67:25

STAT plus
67:25 - 67:32

your expected value of [inaudible] PSTAT, P-star, T
67:36 - 67:39

plus one, ST plus one. This thing on the right-hand side is always the same thing as
67:39 - 67:42

the thing we maximized
67:42 - 67:43
67:43 - 67:46

[inaudible]. So what this means is that when I said this a value of A to the maximize
67:46 - 67:50

of this. So what this means is that the optimal action to take from the state of ST
67:50 - 67:53

is actually equal to
67:53 - 67:56

LT
67:56 - 68:02

times ST.
68:02 - 68:04

What
68:04 - 68:06

was shown is that
68:06 - 68:08

when you're in some state, ST, the
68:08 - 68:14

optimal action for that state is going to be some matrix, LT, which can compute,
68:14 - 68:14

times
68:14 - 68:17

the state,
68:17 - 68:18

ST.
68:18 - 68:24

In other words, the optimal action is actually a linear function of the state.
68:24 - 68:27

I'm just going to point out, this is not a function of approximation here, right.
68:27 - 68:32

What we did not do, we did not say,
68:32 - 68:35

let's find the optimal linear policy. We didn't say, let's look at the optimal
68:35 - 68:38

policy, and then we'll fit this straight line to the optimal policy. This is not
68:38 - 68:42

about approximating the optimal policy with a straight line.
68:42 - 68:47

This derivation is saying that the optimal policy is a straight line. The
68:47 - 68:54

optimal action is a linear function of the current
68:54 - 68:57

state.
68:57 - 69:01

Moreover, when you've worked out this is a value for AT
69:01 - 69:02
69:02 - 69:05

that maximizes this thing on
69:05 - 69:10

the right-hand side. So you take this and plug it back in to do the dynamic programming recursion.
69:10 - 69:17

What you find is that -
69:20 - 69:24

so you take AT and plug it back in to do the maximization. It will
69:24 - 69:26

actually get
69:26 - 69:28

you this formula,
69:28 - 69:31

so V star
69:31 - 69:35

TST. So you find that
69:35 - 69:39

it will indeed be a quadratic function like this
69:39 - 69:42

of the following form where
69:42 - 69:46

- and I just write out the equations for the sake of
69:46 - 69:53

completeness. Don't worry too much about their forms. You can
69:57 - 70:04

derive this yourself.
70:48 - 70:52

So just to summarize, don't worry too much about the forms of these
70:52 - 70:54

equations.
70:54 - 70:58

What we've done is written down the recursion to the expressor
70:58 - 71:00

phi T and psi T
71:00 - 71:02

as a function of phi T plus one
71:02 - 71:04

and psi T plus one.
71:04 - 71:05

So
71:05 - 71:08

this allows you to compute the optimal value function
71:08 - 71:14

for when the clock is at time lowercase T, as a function of the optimal value
71:14 - 71:18

function for when the clock is at time T plus one.
71:20 - 71:25

So
71:25 - 71:28

to summarize,
71:28 - 71:33

GSELQG here's a finite horizon of
71:33 - 71:36

- actually, just to give this equation a name as well. This
71:36 - 71:40

recursion, in terms of the phi Ts, this is called the
71:40 - 71:45

discrete
71:45 - 71:52

time Bacardi equation. [Inaudible]
71:54 - 71:58

recursion that gives you phi
71:58 - 72:02

T in terms of phi T plus one.
72:02 - 72:05

So
72:05 - 72:11

to summarize,
72:12 - 72:17

our algorithm for finding the exact solution to finite horizon LQR
72:17 - 72:23

problems is as follows. We initialize phi T
72:23 - 72:24

to
72:24 - 72:27

be equal to minus
72:30 - 72:33

UT
72:33 - 72:37

and psi T to be equal to zero.
72:37 - 72:39

Then
72:39 - 72:43

recursively,
72:43 - 72:45

calculate
72:45 - 72:47

phi T
72:47 - 72:49

and psi
72:49 - 72:50

T
72:50 - 72:54

as a function of phi T plus one
72:54 - 72:56

and psi T plus
72:56 - 73:01

one with the discrete time -
73:02 - 73:06

actually, excuse me. So
73:06 - 73:09

recursively calculate phi T and psi
73:09 - 73:11

T as a function of phi T plus one and psi T plus one, as I showed,
73:11 - 73:15

using the discrete time Bacardi equation.
73:15 - 73:18

So you do this for
73:18 - 73:19

T equals
73:19 - 73:22

T minus one, T minus two
73:22 - 73:29

and so on, down to time zero. Then you
73:29 - 73:33

compute
73:33 - 73:37

LT as a function of - actually, is it phi T or phi T
73:37 - 73:40

plus one? Phi T
73:40 - 73:42

plus one, I think.
73:42 - 73:46

As a function of phi T plus one and psi T plus one.
73:46 - 73:50

This is actually a function of only phi T plus one. You don't really need psi T
73:50 - 73:53

plus one.
73:53 - 74:00

Now you have your optimal policy.
74:00 - 74:03

So having computed the LTs,
74:03 - 74:06

you now have the
74:06 - 74:08

optimal action to take in the state
74:08 - 74:15

ST, just given by
74:25 - 74:29

this
74:29 - 74:35

linear equation.
74:35 - 74:44

How much time do I have left? Okay. Let me just say one last thing about this before I close.
74:44 - 74:48

Maybe I'll do it next week. I think I'll do it next session
74:48 - 74:52

instead. So it actually turns out there's one cool property about this that's kind of that is
74:52 - 74:55

kind of subtle, but you'll find it out in the next lecture.
74:55 - 75:02

Are there question about this before we close for today, then? So
75:05 - 75:07

the very cool thing about
75:07 - 75:13

the solution of discrete time LQR problems - finite horizon LQR
75:13 - 75:17

problems is that this is a problem in an infinite state, with a continuous state.
75:17 - 75:21

But nonetheless, under the assumptions we made, you can prove that the
75:21 - 75:25

value function is a quadratic function of the state.
75:25 - 75:29

Therefore, just by computing these matrixes phi T and the real numbers
75:29 - 75:33

psi T, you can actually exactly represent the value function, even for
75:33 - 75:37

these infinitely large state spaces, even for continuous state spaces.
75:37 - 75:43

So the computation of these algorithms scales only like the cube,
75:43 - 75:46

scales only as a polynomial in terms of the number of state variables
75:46 - 75:50

whereas in [inaudible] dimensionality problems, with [inaudible], we
75:50 - 75:53

had algorithms of a scale exponentially dimensional problem.
75:53 - 75:55

Whereas LQR scales only are like the cube
75:55 - 76:01

of the dimension of the problem. So this easily
76:01 - 76:04

applies to problems with even very large state spaces. So we actually often apply
76:04 - 76:06

variations of this algorithm
76:06 - 76:09

to some subset, to some particular subset for the things we do on our
76:09 - 76:13

helicopter, which has high dimensional state spaces, with twelve or higher
76:13 - 76:14

dimensions.
76:14 - 76:16

This has worked very well for that.
76:16 - 76:17

So
76:17 - 76:21

it turns out there are even more things you can do with this, and I'll continue with
76:21 - 76:24

that in the next lecture. So let's close for today, then.

Title:: Lecture 18 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:16:38

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 18 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)