Lecture 17 | Machine Learning (Stanford)

Edit subtitles

0:10 - 0:14

This presentation is delivered by the Stanford Center for Professional
0:14 - 0:21

Development.
0:22 - 0:24

Okay, good morning. Welcome back.
0:24 - 0:27

So I hope all of you had a good Thanksgiving break.
0:27 - 0:31

After the problem sets, I suspect many of us needed one.
0:31 - 0:36

Just one quick announcement so as I announced by email
0:36 - 0:38

a few days ago, this afternoon
0:38 - 0:38

we'll be
0:38 - 0:43

doing another tape ahead of lecture, so I won't physically be here on
0:43 - 0:44

Wednesday, and so we'll be taping
0:44 - 0:46

this Wednesday's lecture ahead of time.
0:46 - 0:47

If you're free
0:47 - 0:53

this afternoon, please come to that; it'll be at 3:45 p.m.
0:53 - 0:58

in the Skilling Auditorium in Skilling 193
0:58 - 1:00

at 3:45. But of course, you can also just show up
1:00 - 1:07

in class as usual at the usual time or just watch it online as usual also.
1:07 - 1:13

Okay, welcome back. What I want to do today is continue our discussion on
1:13 - 1:16

Reinforcement Learning in MDPs. Quite a
1:16 - 1:20

long topic for me to go over today, so most of today's lecture will be on
1:20 - 1:22

continuous state MDPs,
1:22 - 1:24

and in particular,
1:24 - 1:27

algorithms for solving continuous state MDPs,
1:27 - 1:30

so I'll talk just very briefly about discretization.
1:30 - 1:34

I'll spend a lot of time talking about models, assimilators of MDPs,
1:34 - 1:39

and then talk about one algorithm called fitted value iteration and
1:40 - 1:45

two functions which builds on that, and then
1:45 - 1:50

hopefully, I'll have time to get to a second algorithm called, approximate policy
1:50 - 1:51

iteration
1:51 - 1:53

Just to recap,
1:53 - 1:55

right, in the previous lecture,
1:55 - 1:58

I defined the Reinforcement Learning problem and
1:58 - 2:03

I defined MDPs, so let me just recap the notation.
2:04 - 2:11

I said that an MDP or a Markov Decision Process, was a ? tuple,
2:14 - 2:16

comprising
2:16 - 2:19

those things and
2:19 - 2:22

the running example of those using last time
2:22 - 2:25

was this one
2:25 - 2:27

right, adapted from
2:27 - 2:30

the Russell and Norvig AI textbook.
2:30 - 2:32

So
2:32 - 2:35

in this example MDP that I was using,
2:35 - 2:39

it had 11 states, so that's where S was.
2:39 - 2:46

The actions were compass directions: north, south, east and west. The
2:47 - 2:50

state transition probability is to capture
2:50 - 2:51

chance of your
2:51 - 2:52

transitioning to every state
2:52 - 2:56

when you take any action in any other given state and so
2:56 - 2:58

in our example that captured
2:58 - 3:01

the stochastic dynamics of our
3:01 - 3:04

robot wondering around [inaudible], and we said if you take the action north
3:04 - 3:06

and the south,
3:06 - 3:09

you have a .8 chance of actually going north and .1 chance of veering
3:09 - 3:10

off,
3:10 - 3:13

so that .1 chance of veering off to the right so
3:13 - 3:17

said model of the robot's noisy
3:17 - 3:19

dynamic with a [inaudible]
3:19 - 3:22

and the
3:22 - 3:25

reward function was that +/-1 at the
3:25 - 3:31

absorbing states
3:31 - 3:33

and
3:33 - 3:38

-0.02
3:38 - 3:40

elsewhere.
3:40 - 3:41

This is an example of an MDP, and
3:41 - 3:48

that's what these five things were. Oh, and I used a discount
3:48 - 3:49

factor G
3:49 - 3:54

of usually a number slightly less than one, so that's the 0.99.
3:54 - 3:59

And so our goal was to find the policy, the
3:59 - 4:02

control policy and that's at ?,
4:02 - 4:04

which is a function
4:04 - 4:06

mapping from the states of the actions
4:06 - 4:10

that tells us what action to take in every state,
4:10 - 4:14

and our goal was to find a policy that maximizes the expected value of
4:14 - 4:17

our total payoff. So we want to find a
4:17 - 4:20

policy. Well,
4:20 - 4:27

let's see. We define value functions Vp (s) to be equal to
4:33 - 4:36

this.
4:36 - 4:37

We said that
4:37 - 4:41

the value of a policy ? from State S was given by the expected
4:41 - 4:42

value
4:42 - 4:46

of the sum of discounted rewards, conditioned on your executing the policy ?
4:46 - 4:49

and
4:49 - 4:54

you're stating off your [inaudible] to say in the State S,
4:54 - 5:00

and so our strategy for finding the policy was sort of comprised of
5:00 - 5:07

two steps. So the goal is to find
5:12 - 5:15

a good policy that maximizes the suspected value of
5:15 - 5:17

the sum of discounted rewards,
5:17 - 5:21

and so I said last time that one strategy for finding the [inaudible] of a
5:21 - 5:22

policy
5:22 - 5:26

is to first compute the optimal value function which I denoted V*(s) and is defined
5:26 - 5:29

like that. It's the maximum
5:29 - 5:32

value that any policy can obtain,
5:32 - 5:39

and for example,
5:45 - 5:48

the optimal value function
5:48 - 5:55

for that MDP looks like this.
6:01 - 6:04

So in other words, starting from any of these states,
6:04 - 6:07

what's the expected value of the
6:07 - 6:09

sum of discounted rewards you get,
6:09 - 6:12

so this is
6:12 - 6:13

V*.
6:13 - 6:15

We also said that
6:15 - 6:17

once you've found V*,
6:17 - 6:20

you can compute the optimal policy
6:20 - 6:27

using this.
6:34 - 6:36

And
6:36 - 6:41

so once you've found
6:41 - 6:48

V and the last piece of this algorithm was Bellman's equations where
6:48 - 6:51

we know that V*,
6:51 - 6:56

the optimal sum of discounted rewards you can get for State S, is equal to the
6:56 - 6:59

immediate reward you get just for starting off in that state
6:59 - 7:02

+G(for the
7:02 - 7:06

max over all the actions you could take)(your
7:12 - 7:14

future
7:14 - 7:16

sum of discounted
7:16 - 7:16

rewards)(your
7:16 - 7:21

future payoff starting from the State S(p) which is where you might transition to
7:21 - 7:22

after 1(s).
7:22 - 7:26

And so this gave us a value iteration algorithm,
7:26 - 7:30

which was essentially
7:30 - 7:35

V.I. I'm abbreviating value iteration as V.I., so in the value iteration
7:36 - 7:43

algorithm, in V.I., you just take Bellman's equations and you repeatedly do this.
7:50 - 7:54

So initialize some guess of the value functions. Initialize a zero as the sum
7:54 - 7:55

rounding guess and
7:55 - 7:59

then repeatedly perform this update for all states, and I said last time that if
7:59 - 8:01

you do this repeatedly,
8:01 - 8:06

then V(s) will converge to the optimal value function, V(s),
8:06 - 8:09

you can
8:09 - 8:12

compute the
8:12 - 8:14
8:14 - 8:19

optimal policy ?*. Just one final thing I want to recap was
8:20 - 8:27

the policy iteration algorithm
8:32 - 8:39

in which we repeat the following two steps.
8:40 - 8:44

So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the
8:44 - 8:48

value function for that specific policy.
8:48 - 8:52

So this means for every state, compute the expected sum of discounted rewards
8:52 - 8:55

for if you execute the policy ?
8:55 - 8:58

from that state,
8:58 - 9:05

and then the other step of policy
9:28 - 9:30

iteration
9:30 - 9:32

is
9:32 - 9:35

having found the value function for your policy,
9:35 - 9:37

you then update the policy
9:37 - 9:41

pretending that you've already found the optimal value function, V*,
9:41 - 9:42

and then you
9:42 - 9:46

repeatedly perform these two steps where you solve for the value function for your current
9:46 - 9:47

policy
9:47 - 9:51

and then pretend that that's actually the optimal value function and solve for the
9:51 - 9:55

policy given the value function, and you repeatedly update
9:55 - 9:59

the value function or update the policy using that value function.
9:59 - 10:00

And
10:00 - 10:04

last time I said that this will also cause
10:04 - 10:08

the estimated value function V to converge to V*
10:08 - 10:13

and this will cause p to converge to ?*, the optimal policy.
10:13 - 10:14

So those are based
10:14 - 10:18

on our last lecture [inaudible] MDPs and introduced a lot of new
10:18 - 10:20

notation symbols and just
10:20 - 10:22

summarize all that again.
10:22 - 10:26

What I'm about to do now, what I'm about to do for the rest of today's
10:26 - 10:28

lecture is actually
10:28 - 10:32

build on these two algorithms so
10:32 - 10:36

I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how
10:36 - 10:42

those two algorithms are very
10:42 - 10:44

different?
10:44 - 10:47

Instructor (Andrew Ng):I see, right, so
10:47 - 10:52

yeah, do you see that they're different? Okay,
10:52 - 10:54

how it's different. Let's see.
10:54 - 10:55

So
10:55 - 11:00

well here's one difference. I didn't say this cause no longer use it today. So
11:00 - 11:04

value iteration and policy iteration are different algorithms.
11:04 - 11:06

In policy iteration
11:06 - 11:08

in this
11:08 - 11:11

step, you're given a fixed policy,
11:11 - 11:14

and you're going to solve for the value function for that policy
11:15 - 11:21

and so you're given some fixed policy ?, meaning
11:21 - 11:27

some function mapping from the state's actions. So give you some policy
11:29 - 11:35

and
11:35 - 11:39

whatever. That's just some policy; it's not a great policy.
11:39 - 11:41

And
11:41 - 11:45

in that step that I circled, we have to find the ? of S
12:02 - 12:05

which means that for every state you need to compute
12:05 - 12:08

your expected sum of discounted rewards or if you execute
12:08 - 12:11

this specific policy
12:11 - 12:16

and starting off the MDP in that state S.
12:16 - 12:19

So I showed this last time. I won't go into details today, so I said last
12:19 - 12:20

time that
12:20 - 12:24

you can actually solve for Vp by solving a linear system of
12:24 - 12:25

equations.
12:25 - 12:27

There was a form of Bellman's equations
12:27 - 12:29

for Vp,
12:29 - 12:32

and it turned out to be, if you write this out, you end up with a
12:32 - 12:36

linear system of 11 equations of 11 unknowns and so you
12:36 - 12:39

can actually solve for the value function
12:39 - 12:40

for a fixed policy
12:40 - 12:42

by solving like a system of
12:42 - 12:46

linear equations with 11 variables and 11 constraints,
12:46 - 12:49

and so
12:49 - 12:52

that's policy iteration;
12:52 - 12:55

whereas, in value iteration,
12:55 - 12:57

going back on board,
12:57 - 13:01

in value iteration you sort of repeatedly perform this update where you
13:01 - 13:04

update the value of a state as the [inaudible].
13:04 - 13:11

So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits
13:13 - 13:13

so is the assumption
13:13 - 13:16

that we can never get out of
13:16 - 13:18

those states? Instructor
13:18 - 13:22

(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers
13:22 - 13:26

come up nicely, but I don't wanna spend too much time on them, but yeah, so the
13:26 - 13:27

assumption is that
13:27 - 13:30

once you enter the absorbing state, then the world ends or there're no more
13:30 - 13:33

rewards after that and
13:33 - 13:34

you can think of
13:34 - 13:37

another way to think of the absorbing states which is sort of
13:37 - 13:38

mathematically equivalent.
13:38 - 13:41

You can think of the absorbing states as
13:41 - 13:43

transitioning with probability 1
13:43 - 13:46

to sum 12 state, and then once you're in that
13:46 - 13:49

12th state, you always
13:49 - 13:52

remain in that 12th state, and there're no further rewards from there. If you
13:52 - 13:54

want, you can think of this
13:54 - 13:57

as actually an MDP with 12 states rather than 11 states,
13:57 - 13:59

and the 12th state
13:59 - 14:05

is this zero cost absorbing state that you get stuck in forever. Other
14:05 - 14:12

questions? Yeah,
14:20 - 14:26

please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort
14:26 - 14:27

of
14:27 - 14:31

tried to give it justification for this last time.
14:31 - 14:35

I'll
14:35 - 14:38

say it in
14:38 - 14:42

one sentence so that's that the expected total payoff I get, I
14:42 - 14:47

expect to get something from the state as is equal to my immediate reward which is
14:47 - 14:50

the reward I get for starting a state. Let's see. If I sum
14:50 - 14:54

the state, I'm gonna get some first reward and then I can transition to
14:54 - 14:55

some other state, and
14:55 - 14:59

then from that other state, I'll get some additional rewards from then.
14:59 - 15:03

So Bellman's equations breaks that sum into two pieces. It says the value of a
15:03 - 15:05

state is equal to
15:05 - 15:07

the reward you get right away
15:07 - 15:11

is really, well.
15:11 - 15:17

V*(s) is really equal to
15:31 - 15:33

+G, so this is
15:36 - 15:40

V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the
15:40 - 15:42

rewards
15:44 - 15:46

you get in the future)
15:46 - 15:50

which it turns out to be equal to that
15:50 - 15:52

second row. I spent more time
15:52 - 15:55

justifying this in the previous lecture,
15:55 - 16:00

although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if
16:00 - 16:02

you don't remember the justification of that, why don't you just
16:02 - 16:03

maybe take my word for
16:03 - 16:06

that this equation holds true
16:06 - 16:08

since I use it a little bit later as well,
16:08 - 16:11

and then the lecture notes sort of
16:11 - 16:16

explain a little further the justification for why this equation might hold
16:16 - 16:17

true. But for now,
16:17 - 16:21

yeah, just for now take my word for it that this holds true cause we'll use it a little bit later
16:21 - 16:28

today as well.
16:41 - 16:45

[Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent
16:45 - 16:47

? implicitly,
16:48 - 16:53

using V(s), would it become equivalent to valuation,
16:53 - 16:58

and the answer is sort of no.
16:58 - 17:02

Let's see. It's true that policy iteration and value iteration are
17:02 - 17:07

closely related algorithms, and there's actually a continuum between them, but
17:07 - 17:13

yeah, it actually turns out that,
17:13 - 17:15

oh, no,
17:22 - 17:24

the
17:25 - 17:29

algorithms are not equivalent. It's just in policy iteration,
17:29 - 17:33

there is a step where you're solving for the value function for the policy vehicle is V,
17:33 - 17:35

solve for Vp. Usually,
17:35 - 17:41

you can do this, for instance, by solving a linear system of equations. In value
17:41 - 17:41

iteration,
17:41 - 17:44

it is a different
17:44 - 17:51

algorithm, yes. I hope it makes sense that at least cosmetically it's different.
17:56 - 17:58

[Inaudible]
17:58 - 18:02

you have [inaudible] representing ? implicitly, then you won't have
18:02 - 18:07

to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you
18:07 - 18:09

change a value function,
18:09 - 18:10

if ? changes as well,
18:10 - 18:16

then it's sort of hard to solve this.
18:16 - 18:19

Yeah, so later on we'll actually talk about some examples of when ? is implicitly
18:19 - 18:20

represented
18:20 - 18:24

but at
18:24 - 18:26

least for now it's I think there's " yeah.
18:26 - 18:30

Maybe there's a way to redefine something, see a mapping onto value iteration but that's not
18:30 - 18:33

usually done. These are
18:33 - 18:37

viewed as different algorithms.
18:37 - 18:40

Okay, cool, so all good questions.
18:40 - 18:43

Let me move on and talk about how
18:43 - 18:50

to generalize these ideas to continuous states.
18:59 - 19:01

Everything we've done so far
19:01 - 19:03

has been for
19:03 - 19:05

discrete states or finite-state
19:05 - 19:10

MDPs. Where, for example, here we had an MDP with a finite set of 11
19:10 - 19:11

states
19:11 - 19:13

and so
19:13 - 19:17

the value function or V(s) or our estimate for the value function,
19:17 - 19:17

V(s),
19:17 - 19:22

could then be represented using an array of 11 numbers cause if you have 11 states, the
19:22 - 19:24

value function
19:24 - 19:26

needs to assign a real number
19:26 - 19:29

to each of the 11 states and so to represent V(s)
19:29 - 19:32

using an array of 11
19:32 - 19:33

numbers.
19:33 - 19:37

What I want to do for [inaudible] today is talk about
19:37 - 19:38

continuous
19:38 - 19:42

states, so for example,
19:42 - 19:45

if you want to control
19:45 - 19:48

any of the number of real [inaudible], so for example, if you want
19:48 - 19:51

to control a car,
19:51 - 19:53
19:53 - 19:55

a car
19:55 - 20:00

is positioned given by XYT, as position
20:00 - 20:05

and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot,
20:05 - 20:05

so these
20:05 - 20:07

are so depending on
20:07 - 20:09

whether you want to model the kinematics and
20:09 - 20:12

so just position, or whether you want to model the dynamics, meaning the velocity as
20:12 - 20:15

well.
20:15 - 20:16

Earlier I showed you
20:16 - 20:20

video of a helicopter
20:20 - 20:23

that was flying, using a rain forest we're learning algorithms, so the
20:23 - 20:25

helicopter which can
20:25 - 20:28

fly in three-dimensional space rather than just drive on the 2-D plane, the
20:28 - 20:33

state will be given by XYZ position,
20:35 - 20:38

FT?, which is
20:38 - 20:43

?[inaudible].
20:43 - 20:45

The FT? is
20:45 - 20:49

sometimes used to note the P[inaudible]
20:49 - 20:50

of the helicopter,
20:50 - 20:52

just orientation,
20:52 - 20:55

and
20:59 - 21:03

if you want to control a helicopter, you pretty much have to model
21:03 - 21:06

velocity as well which means both linear velocity as well as angular
21:06 - 21:10

velocity, and so this would be a 12-dimensional state.
21:10 - 21:13

If you want an example that
21:13 - 21:14

is
21:14 - 21:17

kind of fun but unusual
21:17 - 21:20

is,
21:20 - 21:25

and I'm just gonna use this as an example
21:25 - 21:28

and actually use this little bit example in today's lecture
21:28 - 21:31

is the inverted pendulum problem which is
21:31 - 21:36

sort of a long-running classic in reinforcement learning
21:36 - 21:39

in which imagine that you
21:39 - 21:43

have a little cart that's on a rail. The rail
21:43 - 21:45

ends at some point
21:45 - 21:48

and if you imagine that you have a pole
21:48 - 21:52

attached to the cart, and this is a free hinge and so
21:52 - 21:54

the pole here can rotate freely,
21:54 - 21:57

and your goal is to control the cart and
21:57 - 21:59

to move it back and forth on this rail
21:59 - 22:01

so as to keep the pole balanced.
22:01 - 22:03

Yeah,
22:03 - 22:05

there's no long pole
22:05 - 22:05

in
22:05 - 22:07

this class but you know what I
22:07 - 22:08

mean, so you
22:08 - 22:15

can imagine. Oh, is there a long pole here?
22:15 - 22:17

Back in the corner. Oh, thanks. Cool. So I did not practice this
22:17 - 22:20

but you can take a long pole and sort of hold it up, balance,
22:20 - 22:25

so imagine that you can do it better than I can. Imagine these are [inaudible] just moving
22:25 - 22:28

back and forth to try to keep the pole balanced, so
22:28 - 22:31

you can actually us the reinforcement learning algorithm to do that.
22:31 - 22:34

This is actually one of the longstanding classic problems that
22:34 - 22:35

people [inaudible]
22:35 - 22:38

implement and play off using reinforcement learning algorithms,
22:38 - 22:41

and so for this, the
22:41 - 22:42

states would be X and
22:42 - 22:44

T, so X
22:44 - 22:50

would be the position of the cart,
22:50 - 22:56

and T would be the orientation of the pole
22:56 - 23:00

and also the linear velocity and the angular velocity of
23:00 - 23:01
23:01 - 23:08

the pole, so I'll
23:12 - 23:15

actually use this example a
23:15 - 23:19

couple times. So to read continuous state space,
23:19 - 23:23

how can you apply an algorithm like value iteration and policy iteration to solve the
23:23 - 23:24

MDP to control
23:24 - 23:28

like the car or a helicopter or something like the inverted
23:28 - 23:29

pendulum?
23:29 - 23:32

So one thing you can do and
23:32 - 23:35

this is maybe the most straightforward thing is,
23:35 - 23:38

if you have say a two-dimensional
23:38 - 23:42

continuous state space, a S-1 and S-2 are my state variables, and in all the
23:42 - 23:44

examples there
23:44 - 23:48

are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here.
23:48 - 23:54

The most straightforward thing to do would be to take the continuous state space and discretize
23:54 - 23:56

it
23:56 - 24:03

into a number of discrete cells.
24:07 - 24:12

And I use S-bar to denote they're discretized or they're discrete
24:12 - 24:13

states,
24:13 - 24:17

and so you can [inaudible] with this continuous state problem with a finite or
24:17 - 24:18

discrete set of states
24:18 - 24:19

and then
24:19 - 24:23

you can use policy iteration or value iteration
24:23 - 24:29

to solve for
24:32 - 24:36

V(s)-bar.
24:36 - 24:37

And
24:37 - 24:40

if you're robot is then in some state
24:40 - 24:42

given by that dot,
24:42 - 24:46

you would then figure out what discretized state it is in. In this case it's in,
24:46 - 24:48

this discretized dygrid cell
24:48 - 24:49

that's called
24:49 - 24:51

S-bar,
24:51 - 24:53

and then you execute.
24:53 - 24:56

You choose the policy. You choose the action
24:56 - 24:59

given by applied to that discrete
24:59 - 25:01

state, so discretization is maybe the
25:01 - 25:07

most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes
25:07 - 25:10

you can sorta make this work but a couple of reasons why
25:10 - 25:13

this does not work very well.
25:13 - 25:15

One reason is the following,
25:15 - 25:18

and
25:18 - 25:21

for this picture, let's even temporarily put aside reinforcement
25:21 - 25:23

learning. Let's just
25:23 - 25:27

think about doing regression for now and so
25:27 - 25:30

suppose you have some invariable X
25:30 - 25:36

and suppose I have some data,
25:36 - 25:38

and I want to fill a function.
25:38 - 25:40

Y is the function of X,
25:40 - 25:41

so
25:41 - 25:45

discretization is saying that I'm going to take my
25:45 - 25:48

horizontal Xs and chop it up into a number of
25:48 - 25:49

intervals.
25:49 - 25:52

Sometimes I call these intervals buckets as well.
25:52 - 25:56

We chop my horizontal Xs up into a number of buckets
25:56 - 25:59

and then we're approximate this function
25:59 - 26:04

using something that's piecewise constant
26:04 - 26:05

in each of these buckets.
26:05 - 26:09

And just look at this. This is clearly not a very good representation, right, and when we
26:09 - 26:10

talk about regression,
26:10 - 26:14

you just choose some features of X
26:14 - 26:18

and run linear regression or something. You get a much better fit to the function.
26:18 - 26:22

And so the sense that discretization just isn't a very good source
26:22 - 26:26

of piecewise constant functions. This just isn't a very good function
26:26 - 26:28

for representing many things,
26:28 - 26:32

and there's also the sense that
26:32 - 26:36

there's no smoothing or there's no generalization across the different buckets.
26:36 - 26:39

And in fact, back in regression, I would never have chosen
26:39 - 26:43

to do regression using this sort of visualization. It's just really doesn't make sense.
26:46 - 26:48

And so in the same way,
26:48 - 26:51

instead of
26:51 - 26:54

X, V(s), instead of X and
26:54 - 26:58

some hypothesis function of X, if you have the state here and you're trying to approximate the value
26:58 - 26:59

function, then
26:59 - 27:02

you can get discretization to work for many problems but maybe this isn't the best
27:02 - 27:03

representation to represent
27:03 - 27:06

a value function.
27:09 - 27:13

The other problem with discretization and maybe the more serious
27:13 - 27:17

problem is what's
27:17 - 27:21

often somewhat fancifully called the curse of dimensionality.
27:26 - 27:30

And just the observation that
27:31 - 27:34

if the state space is in
27:34 - 27:39

RN, and if you discretize each variable into K buckets,
27:39 - 27:40

so
27:40 - 27:42

if discretize each
27:44 - 27:46

variable into K
27:46 - 27:48

discrete values,
27:48 - 27:50

then you get
27:50 - 27:53

on the order of K to the power of N
27:55 - 27:56

discrete states.
27:56 - 27:57

In other words,
27:57 - 28:00

the number of discrete states you end up with
28:00 - 28:02

grows exponentially
28:02 - 28:05

in the dimension of the problem,
28:05 - 28:06

and so
28:06 - 28:09

for a helicopter with 12-dimensional state
28:09 - 28:10

space, this would be
28:10 - 28:13

maybe like 100 to the power of 12, just huge, and it's
28:13 - 28:13

not
28:13 - 28:15

feasible.
28:15 - 28:19

And so discretization doesn't scale well at all with two problems in high-dimensional
28:19 - 28:24

state spaces,
28:24 - 28:25

and
28:25 - 28:29

this observation actually applies more generally than to
28:29 - 28:33

just robotics and continuous state problems.
28:33 - 28:38

For example, another fairly well-known applications
28:38 - 28:40

of reinforcement learning
28:40 - 28:45

has been to factory automations. If you imagine that you have
28:45 - 28:47

20 machines sitting in the factory
28:47 - 28:50

and the machines lie in a assembly line and they all
28:50 - 28:53

do something to a part on the assembly line, then they route the part onto a
28:53 - 28:55

different machine. You want to
28:55 - 28:59

use reinforcement learning algorithms, [inaudible] the order in which the
28:59 - 29:03

different machines operate on your different things that are flowing through your assembly line and maybe
29:03 - 29:05

different machines can do different things.
29:05 - 29:09

So if you have N machines and each machine
29:09 - 29:11

can be in K states,
29:11 - 29:14

then if you do this sort of discretization, the total number of states would be K
29:14 - 29:15

to N as well.
29:15 - 29:18

If you have N machines
29:18 - 29:22

and if each machine can be in K states, then again, you can get
29:22 - 29:25

this huge number of states.
29:25 - 29:27

Other well-known examples would be
29:27 - 29:31

if you have a board game is another example.
29:31 - 29:34

You'd want to use reinforcement learning to play chess.
29:34 - 29:36

Then if you have N
29:36 - 29:39

pieces on your board game,
29:39 - 29:44

you have N pieces on the chessboard and if each piece can be in K positions, then this is a
29:44 - 29:46

game sort of the curse of dimensionality thing where the
29:46 - 29:55

number of discrete states you end up with goes exponentially with the number of pieces
29:55 - 29:57

in your board game.
29:57 - 30:01

So the curse of dimensionality
30:01 - 30:06

means that discretization scales poorly to high-dimensional state spaces or
30:06 - 30:09

at least discrete representations
30:09 - 30:14

scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional
30:14 - 30:17

problem, discretization will usually work great.
30:17 - 30:21

If you have a 3-dimensional problem, you can often get discretization to
30:21 - 30:21

work
30:21 - 30:23

not too badly without too much trouble.
30:23 - 30:28

With a 4-dimensional problem, you can still often get to where that they
30:28 - 30:30

could be challenging
30:33 - 30:39

and as you go to higher and higher dimensional state spaces,
30:39 - 30:43

the odds and [inaudible] that you need to figure around to discretization and do things
30:43 - 30:46

like non-uniform grids, so for example,
30:46 - 30:49

what I've
30:49 - 30:53

drawn for you is an example of a non-uniform discretization
30:53 - 30:57

where I'm discretizing S-2 much more finally than S-1. If I think the value
30:57 - 30:58

function
30:58 - 30:59

is much more sensitive
30:59 - 31:03

to the value of state variable S-2 than to S-1,
31:03 - 31:07

and so as you get into higher dimensional state spaces, you may need to manually fiddle
31:07 - 31:10

with choices like these with no uniform discretizations and so on.
31:10 - 31:13

But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it
31:13 - 31:14

work fine.
31:14 - 31:18

With 4-dimensional problems, you can probably get it to work but it'd be just
31:18 - 31:19

slightly challenging
31:19 - 31:20

and
31:20 - 31:24

you can sometimes by fooling around and being clever, you can often push
31:24 - 31:25

discretization
31:25 - 31:29

up to let's say about 6-dimensional problems but
31:29 - 31:31

with some difficulty
31:31 - 31:33

and problems higher
31:33 - 31:37

than 6-dimensional would be extremely difficult to solve with discretization.
31:37 - 31:40

So that's just rough
31:40 - 31:44

folk wisdom order of managing problems you think about using for discretization.
31:51 - 31:53

But what I want to
31:53 - 31:56

spend most of today talking about is
31:56 - 32:00

[inaudible] methods that often work much better than discretization
32:00 - 32:02

and which we will
32:02 - 32:05

approximate V* directly
32:05 - 32:09

without resulting to these sort of discretizations.
32:09 - 32:12

Before I jump to the specific representation let me just spend a few minutes
32:12 - 32:15

talking about the problem setup
32:15 - 32:16

then.
32:16 - 32:19

For today's lecture, I'm going to focus on
32:19 - 32:25

the problem of continuous states
32:25 - 32:28

and just to keep things sort of very
32:28 - 32:29

simple in this lecture,
32:29 - 32:34

I want view of continuous actions, so I'm gonna see
32:36 - 32:38

discrete actions
32:38 - 32:40

A. So it turns out actually that
32:40 - 32:44

is a critical fact also for many problems, it turns out that the state
32:44 - 32:46

space is
32:46 - 32:47

much larger than
32:47 - 32:51

the states of actions. That just seems to have worked out that way for many problems, so for example,
32:53 - 32:58

for driving a car the state space is 6-dimensional, so if
32:58 - 33:00

XY T, Xdot, Ydot, Tdot.
33:00 - 33:05

Whereas, your action has, you still have two actions. You have forward
33:05 - 33:08

backward motion and steering the car, so you have
33:08 - 33:12

6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states.
33:12 - 33:15

The only
33:15 - 33:17

examples down for a helicopter you've 12-D states in a
33:17 - 33:19

4-dimensional action
33:19 - 33:20

it turns out, and
33:20 - 33:24

it's also often much easier to just discretize a continuous
33:24 - 33:27

actions into a discrete sum of actions.
33:27 - 33:28
33:28 - 33:32

And for the inverted pendulum, you have a 4-D state and a 1-D action.
33:32 - 33:33

Whether you accelerate
33:33 - 33:38

your cart to the left or the right is one D action
33:38 - 33:42

and so for the rest of today, I'm gonna assume a continuous state
33:42 - 33:45

but I'll assume that maybe you've already discretized your actions,
33:45 - 33:48

just because in practice it turns out that
33:48 - 33:53

not for all problems, with many problems large actions
33:53 - 33:54

is just less of a
33:54 - 33:58

difficulty than large state spaces.
33:58 - 34:00

So I'm
34:00 - 34:01

going to assume
34:01 - 34:06

that we have
34:06 - 34:09

a model
34:09 - 34:15

or simulator of the
34:15 - 34:16

MDP, and so
34:16 - 34:18

this is really
34:18 - 34:23

an assumption on how the state transition probabilities are represented.
34:24 - 34:25

I'm gonna assume
34:25 - 34:28

and I'm going to use the terms "model" and "simulator"
34:28 - 34:30

pretty much synonymously,
34:30 - 34:32

so
34:32 - 34:39

specifically, what I'm going to assume is that we have a black box and a
34:41 - 34:42

piece of code,
34:42 - 34:45

so that I can input any state, input an action
34:45 - 34:46

and it will output
34:46 - 34:49

S
34:49 - 34:51

prime,
34:51 - 34:54

sample from the state transition distribution. Says that
34:54 - 34:55

this is really
34:55 - 34:57

my assumption
34:57 - 35:01

on the representation I have for the state transition probabilities,
35:01 - 35:05

so I'll assume I have a box that read
35:05 - 35:07

take us in for the stated action
35:07 - 35:10

and output in mixed state.
35:10 - 35:17

And so
35:17 - 35:20

since they're fairly common ways to get models of
35:20 - 35:22

different MDPs you may work with,
35:22 - 35:24

one is
35:24 - 35:31

you might get a model from a physics
35:31 - 35:33

simulator. So for example,
35:33 - 35:37

if you're interested in controlling that inverted pendulum,
35:37 - 35:39

so your action is
35:39 - 35:41

A which is
35:41 - 35:45

the magnitude of the force you exert on the cart to
35:45 - 35:46
35:46 - 35:48

left
35:48 - 35:49

or right, and your
35:49 - 35:56

state is Xdot, T, Tdot. I'm just gonna write that in that order.
35:58 - 36:01

And so I'm gonna
36:01 - 36:04

write down a bunch of equations just for completeness but
36:04 - 36:09

everything I'm gonna write below here is most of what I wanna write is a bit
36:09 - 36:13

gratuitous, but so since I'll maybe flip open a textbook on physics, a
36:13 - 36:14

textbook on mechanics,
36:14 - 36:17

you can work out the equations of motion of a
36:17 - 36:19

physical device like this, so you find that
36:23 - 36:26

Sdot. The dot denotes derivative, so the derivative of the state with respect to time
36:26 - 36:28

is given by
36:29 - 36:32

Xdot, ?-L(B)
36:32 - 36:34

cost B
36:34 - 36:36

over
36:37 - 36:39

M
36:39 - 36:46

Tdot,
36:52 - 36:53

B.
36:53 - 36:58

And so on where A is the force is the action that you exert on the cart. L is
36:58 - 37:00

the length of the pole.
37:00 - 37:04

M is the total mass of the system and so on. So all these equations
37:04 - 37:06

are good uses, just writing
37:06 - 37:09

them down for completeness, but by
37:09 - 37:12

flipping over, open like a physics textbook, you can work out these equations
37:12 - 37:14

and notions yourself
37:14 - 37:15

and this
37:15 - 37:17

then gives you a model which
37:17 - 37:18

can say that S-2+1.
37:18 - 37:22

You're still one time step later
37:22 - 37:27

will be equal to your previous state
37:27 - 37:32

plus [inaudible], so in your simulator or in my model
37:32 - 37:35

what happens to the cart every
37:35 - 37:39

10th of a second, so ? T
37:39 - 37:43

would be within one second
37:43 - 37:50

and then so plus ? T times
38:03 - 38:06

that. And so that'd be one way to come up with a model of
38:06 - 38:08

your MDP.
38:08 - 38:12

And in this specific example, I've actually written down deterministic model because
38:13 - 38:15

and by deterministic I mean that
38:15 - 38:18

given an action in a state, the next state
38:18 - 38:20

is not random,
38:20 - 38:23

so would be an example of a deterministic model
38:23 - 38:26

where I can compute the next state exactly
38:26 - 38:30

as a function of the previous state and the previous action
38:30 - 38:33

or it's a deterministic model because all the probability
38:33 - 38:35

mass is on a single state
38:35 - 38:37

given the previous stated action.
38:37 - 38:44

You can also make this a stochastic model.
38:59 - 39:06

A second way that is often used to attain a model is to learn one.
39:08 - 39:13

And so again, just concretely what you do is you would
39:13 - 39:16

imagine that you have a physical
39:16 - 39:17

inverted pendulum system
39:17 - 39:21

as you physically own an inverted pendulum robot.
39:21 - 39:24

What you would do is you would then
39:24 - 39:27

initialize your inverted pendulum robot to some state
39:27 - 39:32

and then execute some policy, could be some random policy or some policy that you think is pretty
39:32 - 39:36

good, or you could even try controlling yourself with a joystick or something.
39:36 - 39:37

But so you set
39:37 - 39:41

the system off in some state as zero.
39:41 - 39:44

Then you take some action.
39:44 - 39:48

Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina
39:48 - 39:51

control your inverted pendulum or whatever.
39:51 - 39:55

System would transition to some new state, S-1, and then you
39:55 - 39:57

take some new action, A-1
39:57 - 40:00

and so on. Let's say you do
40:00 - 40:02

this
40:02 - 40:05

for two time steps
40:05 - 40:06

and
40:06 - 40:09

sometimes I call this one trajectory and
40:09 - 40:10

you repeat this
40:10 - 40:13

M times, so
40:13 - 40:17

this is the first trial of the first trajectory,
40:17 - 40:24

and then you do this again. Initialize it in some
40:35 - 40:37

and so on.
40:37 - 40:40

So you do this a bunch of times
40:40 - 40:44

and then you would run the learning algorithm
40:44 - 40:49

to
40:49 - 40:53

estimate ST+1
40:53 - 40:57

as a function
40:57 - 41:02

of
41:02 - 41:05

ST and
41:05 - 41:08

AT. And for sake of completeness, you should just think of this
41:08 - 41:10

as inverted pendulum problem,
41:10 - 41:11

so ST+1 is
41:11 - 41:13

a 4-dimensional vector. ST, AT
41:13 - 41:17

will be a 4-dimensional vector and that'll be a real number,
41:17 - 41:18

and so you might
41:18 - 41:21

run linear regression 4 times to predict each of these state variables as
41:21 - 41:23

a function
41:23 - 41:25

of each of these
41:25 - 41:27

5 real numbers
41:27 - 41:30

and so on.
41:30 - 41:33

Just for example, if you
41:33 - 41:37

say that if you want to estimate your next state ST+1 as a linear
41:37 - 41:41

function
41:41 - 41:46

of your previous state in your action and so
41:46 - 41:51

A here will be a 4 by 4 matrix,
41:51 - 41:54

and B would be a 4-dimensional vector,
41:54 - 42:00

then
42:00 - 42:07

you would choose the values of A and B that minimize this.
42:32 - 42:35

So if you want your model to be that
42:35 - 42:39

ST+1 is some linear function of the previous stated action,
42:39 - 42:40

then you might
42:40 - 42:42

pose this optimization objective
42:42 - 42:45

and choose A and B to minimize the sum of squares error
42:45 - 42:52

in your predictive value for ST+1 as the linear function of ST
42:52 - 42:57

and AT. I
42:57 - 42:58

should say that
42:58 - 43:03

this is one specific example where you're using a linear function of the previous
43:03 - 43:06

stated action to predict the next state. Of course, you can also use other algorithms
43:06 - 43:08

like low [inaudible] weight to linear regression
43:08 - 43:12

or linear regression with nonlinear features or kernel linear
43:12 - 43:13

regression or whatever
43:13 - 43:16

to predict the next state as a nonlinear function of the current state as well, so this
43:16 - 43:23

is just [inaudible] linear problems.
43:25 - 43:28

And it turns out that low [inaudible] weight to linear
43:28 - 43:29

regression is
43:29 - 43:32

for many robots turns out to be an effective method for this learning
43:32 - 43:39

problem as well.
43:47 - 43:52

And so having learned to model, having learned the parameters A
43:52 - 43:53

and B,
43:53 - 43:59

you then have a model where you say that ST+1 is AST
43:59 - 44:05

plus BAT, and so that would be an example of a deterministic
44:05 - 44:06

model
44:08 - 44:11

or having learned the parameters A and B,
44:11 - 44:13

you might say that ST+1 is
44:13 - 44:15

equal to AST +
44:15 - 44:17

BAT
44:24 - 44:26

+
44:26 - 44:28

?T.
44:28 - 44:30

And so these would be
44:30 - 44:33

very reasonable ways to come up with either
44:33 - 44:37

a deterministic or a stochastic model for your
44:37 - 44:39

inverted pendulum MDP.
44:39 - 44:42

And so
44:42 - 44:44

just to summarize,
44:44 - 44:48

what we have now is a model, meaning a piece of code,
44:48 - 44:53

where you can input a state and an action
44:53 - 44:55

and get an ST+1.
44:55 - 44:58

And so if you have a stochastic model,
44:58 - 45:02

then to influence this model, you would actually sample ?T from this [inaudible]
45:02 - 45:03

distribution
45:03 - 45:10

in order to generate ST+1. So it
45:13 - 45:18

actually turns out that in a
45:18 - 45:20

preview, I guess, in the next lecture
45:20 - 45:25

it actually turns out that in the specific case of linear dynamical systems, in the specific
45:25 - 45:29

case where the next state is a linear function of the current stated action, it
45:29 - 45:32

actually turns out that there're very powerful algorithms you can use.
45:32 - 45:36

So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than
45:36 - 45:37

right now
45:39 - 45:43

but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights
45:43 - 45:46

and linear regressionals and long linear
45:46 - 45:48

algorithm 'cause many systems aren't really
45:48 - 45:54

linear. You can build a nonlinear model.
45:54 - 45:57

So what I wanna do now is talk about
45:57 - 45:57

given
45:57 - 46:01

a model, given a simulator for your
46:01 - 46:03

MDP, how to come up with an algorithm
46:03 - 46:07

to approximate the alpha value function piece.
46:07 - 46:14

Before I move on, let me check if there're questions on this. Okay,
46:38 - 46:43

cool. So here's
46:43 - 46:44

the idea.
46:44 - 46:47

Back when we talked about linear regression, we said that
46:47 - 46:52

given some inputs X in supervised learning, given the input feature is X, we
46:52 - 46:55

may choose some features of X
46:55 - 46:57

and then
46:57 - 46:59

approximate the type of variable
46:59 - 47:02

as a linear function of various features of X,
47:02 - 47:05

and so just do exactly the same thing to
47:05 - 47:06

approximate
47:06 - 47:09

the optimal value function,
47:09 - 47:11

and in particular,
47:11 - 47:16

we'll choose some features
47:16 - 47:19

5-S of a
47:19 - 47:21

state
47:21 - 47:26

S. And so you could actually choose
47:26 - 47:30

5-S equals S. That would be one reasonable choice, if you want to approximate the value
47:30 - 47:30

function
47:30 - 47:33

as a linear function of the states, but you can
47:33 - 47:34

also
47:34 - 47:37

choose other things, so for example,
47:37 - 47:41

for the inverted pendulum example, you may choose 5-S
47:41 - 47:45

to be equal to a vector of features that may be [inaudible]1 or you may have
47:48 - 47:52

Xdot2,
47:52 - 47:52

Xdot
47:52 - 47:55

maybe some cross terms,
47:55 - 47:57

maybe times
47:57 - 47:59

X, maybe dot2
47:59 - 48:04

and so on. So
48:04 - 48:07

you choose some vector or features
48:07 - 48:14

and then approximate
48:17 - 48:22

the value function as the value of the state as is equal to data
48:22 - 48:24

transfers
48:24 - 48:28

times the features. And I should apologize in advance; I'm overloading notation here.
48:28 - 48:30

It's unfortunate. I
48:30 - 48:34

use data both to denote the angle of the cart of the pole inverted
48:34 - 48:38

pendulum. So this is known as
48:38 - 48:39

the angle T
48:39 - 48:43

but also using T to denote the vector of parameters in my [inaudible] algorithm.
48:43 - 48:44

So sorry
48:44 - 48:50

about the overloading notation.
48:50 - 48:52
48:52 - 48:55

Just like we did in linear regression, my goal is to come up
48:55 - 48:57

with
48:57 - 49:00

a linear combination of the features that gives me a good approximation
49:00 - 49:02

to the value function
49:02 - 49:08

and this is completely analogous to when we said that in linear regression
49:08 - 49:10

our estimate,
49:13 - 49:18

my response there but Y as a linear function a feature is at the
49:18 - 49:23

input. That's what we have
49:23 - 49:30

in
49:40 - 49:47

linear regression. Let
49:47 - 49:51

me just write
49:51 - 49:56

down value iteration again and then I'll written down an approximation to value
49:56 - 49:57

iteration, so
49:57 - 50:01

for discrete states,
50:01 - 50:04

this is the idea behind value iteration and we said that
50:04 - 50:09

V(s) will be
50:09 - 50:12

updated
50:12 - 50:17

as R(s) + G
50:17 - 50:20

[inaudible]. That was value iteration
50:20 - 50:24

and in the case of continuous states, this would be the replaced by an [inaudible], an
50:24 - 50:31

[inaudible] over states rather via sum over states.
50:31 - 50:34

Let me just write this
50:34 - 50:36

as
50:39 - 50:43

R(s) + G([inaudible]) and then that sum over T's prime.
50:43 - 50:45

That's really an expectation
50:45 - 50:47

with respect to random state as
50:47 - 50:54

prime drawn from the state transition probabilities piece SA
50:55 - 50:56

of V(s)
50:56 - 51:00

prime. So this is a sum of all states S prime with the
51:00 - 51:02

probability of going to S prime (value),
51:02 - 51:05

so that's really an expectation over the random state S prime flowing from PSA of
51:05 - 51:09

that.
51:09 - 51:12

And so what I'll do now is
51:12 - 51:15

write down an algorithm called
51:15 - 51:22

fitted value iteration
51:25 - 51:28

that's in
51:28 - 51:31

approximation to this
51:31 - 51:34

but specifically for continuous states.
51:34 - 51:39

I just wrote down the first two steps, and then I'll continue on the next board, so the
51:39 - 51:43

first step of the algorithm is we'll sample.
51:43 - 51:45

Choose some set of states
51:45 - 51:52

at
51:53 - 51:55

random.
51:55 - 52:00

So sample S-1, S-2 through S-M randomly so choose a
52:00 - 52:02

set of states randomly
52:02 - 52:08

and
52:08 - 52:12

initialize my parameter vector to be equal to zero. This is
52:12 - 52:16

analogous to in value iteration where I
52:16 - 52:21

might initialize the value function to be the function of all zeros. Then here's the
52:21 - 52:28

end view for the algorithm.
52:42 - 52:44

Got
52:44 - 52:51

quite a lot to
52:59 - 53:06

write
55:05 - 55:12

actually.
55:29 - 55:30

Let's
55:30 - 55:37

see.
56:10 - 56:12

And so
56:12 - 56:18

that's the algorithm.
56:18 - 56:20

Let me just adjust the writing. Give me a second. Give me a minute
56:20 - 56:27

to finish and then I'll step through this.
56:27 - 56:28

Actually, if some of my
56:28 - 56:35

handwriting is eligible, let me know. So
56:58 - 57:02

let me step through this and so briefly explain the rationale.
57:02 - 57:08

So the hear of the algorithm is - let's see.
57:08 - 57:11

In the original value iteration algorithm,
57:11 - 57:13

we would take the value for each state,
57:13 - 57:15

V(s)I,
57:15 - 57:18

and we will overwrite it with this
57:18 - 57:22

expression here. In the original, this discrete value iteration algorithm
57:22 - 57:23

was to V(s)I
57:23 - 57:24

and we will
57:24 - 57:26

set V(s)I
57:26 - 57:29

to be equal to that, I think.
57:29 - 57:33

Now we have in the continuous state case, we have an infinite continuous set of
57:33 - 57:34

states and so
57:34 - 57:38

you can't discretely set the value of each of these to that.
57:38 - 57:43

So what we'll do instead is choose the parameters T
57:43 - 57:47

so that V(s)I is as close as possible to
57:47 - 57:50

this thing on the right hand side
57:50 - 57:55

instead. And this is what YI turns out to be.
57:55 - 57:58

So completely, what I'm going to do is I'm going to construct
57:58 - 58:01

estimates of this term,
58:01 - 58:04

and then I'm going to choose the parameters of my function approximator. I'm
58:04 - 58:08

gonna choose my parameter as T, so that V(s)I is as close as possible to
58:08 - 58:11

these. That's what YI is,
58:11 - 58:12

and specifically,
58:12 - 58:15

what I'm going to do is I'll choose parameters data
58:15 - 58:18

to minimize the sum of square differences between T [inaudible] plus
58:18 - 58:20

5SI.
58:20 - 58:22

This thing here
58:22 - 58:24

is just V(s)I because I'm approximating V(s)I
58:24 - 58:26

is a linear function of 5SI
58:26 - 58:28

and so choose
58:28 - 58:32

the parameters data to minimize the sum of square differences.
58:32 - 58:35

So this is last step is basically
58:35 - 58:39

the approximation version of value
58:39 - 58:41

iteration. What everything else above
58:41 - 58:42

was doing was just
58:42 - 58:45

coming up with an approximation
58:45 - 58:46

to this term, to
58:46 - 58:47

this thing here
58:47 - 58:51

and which I was calling YI.
58:51 - 58:54

And so confluently, for every state SI we want to
58:54 - 58:57

estimate what the thing on the right hand side is
58:57 - 58:58

and but
58:58 - 59:02

there's an expectation here. There's an expectation over a continuous set of states,
59:02 - 59:07

may be a very high dimensional state so I can't compute this expectation exactly.
59:07 - 59:10

What I'll do instead is I'll use my simulator to sample
59:10 - 59:14

a set of states from this distribution from this P substrip,
59:14 - 59:17

SIA, from the state transition distribution
59:17 - 59:21

of where I get to if I take the action A in the state as I,
59:21 - 59:24

and then I'll average over that sample of states
59:24 - 59:27

to compute this expectation.
59:27 - 59:31

And so stepping through the algorithm just says that for
59:31 - 59:32

each
59:32 - 59:36

state and for each action, I'm going to sample a set of states. This S prime 1 through S
59:36 - 59:41

prime K from that state transition distribution, still using the model, and
59:41 - 59:44

then I'll set Q(a) to be equal to that average
59:44 - 59:45

and so this is my
59:45 - 59:48

estimate for R(s)I + G(this
59:48 - 59:50

expected value
59:50 - 59:54

for that specific action A).
59:54 - 59:57

Then I'll take the maximum of actions A
59:57 - 60:01

and this gives me YI, and so YI is for S for that.
60:01 - 60:03

And finally, I'll run
60:03 - 60:07

really run linear regression which is that last of the set [inaudible] to get
60:07 - 60:08

V(s)I
60:08 - 60:11

to be close to the YIs.
60:11 - 60:16

And so this algorithm is called fitted value iteration and it actually often
60:16 - 60:19

works quite well for continuous,
60:19 - 60:23

for problems with anywhere from
60:23 - 60:26

6- to 10- to 20-dimensional state spaces
60:34 - 60:39

if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense?
60:39 - 60:41

Some of you didn't have your hands up.
60:41 - 60:47

Are there questions for those,
60:48 - 60:51

yeah? Is there a recess [inaudible] function in this setup?
60:51 - 60:56

Oh, yes. An MDP comprises
60:56 - 60:57

SA,
60:57 - 61:00

the state transition probabilities G
61:00 - 61:01

and
61:01 - 61:02

R
61:02 - 61:03

and so
61:03 - 61:08

for continuous state spaces, S would be like R4 for the inverted pendulum
61:08 - 61:09

or something.
61:09 - 61:14

Actions with discretized state transitions probabilities with specifying with the model
61:14 - 61:15

or
61:15 - 61:17

the simulator,
61:17 - 61:20

G is just a real number like .99
61:20 - 61:24

and the real function is usually a function that's given to you. And so
61:24 - 61:26

the reward function
61:26 - 61:31

is some function of your 4-dimensional state,
61:31 - 61:31

and
61:31 - 61:33

for example, you
61:33 - 61:40

might choose a reward function to be minus " I don't know.
61:40 - 61:47

Just for an example of simple reward function, if we want a -1 if the pole has fallen
61:48 - 61:52

and there it depends on you choose your reward function to be -1 if
61:52 - 61:52

the
61:52 - 61:55

inverted pendulum falls over
61:55 - 61:59

and to find that its angle is greater than 30Â° or something
61:59 - 62:02

and zero otherwise.
62:02 - 62:04

So that would be an example
62:04 - 62:07

of a reward function that you can choose for the inverted pendulum, but yes, assuming a
62:07 - 62:10

reward function is given to you
62:10 - 62:15

so that you can compute R(s)I for any state.
62:15 - 62:22

Are there other questions?
62:40 - 62:41

Actually, let
62:41 - 62:48

me try
62:50 - 62:57
63:24 - 63:28

asking a question, so everything I did here assume that we have a stochastic simulator.
63:28 - 63:32

So it
63:32 - 63:35

turns out I can simply this algorithm if I have a deterministic simulator,
63:35 - 63:37

but
63:37 - 63:39

deterministic simulator is given a
63:39 - 63:42

stated action, my next state is always exactly determined.
63:42 - 63:46

So let me ask you, if I have a deterministic simulator,
63:46 - 63:53

how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible].
63:54 - 63:58
63:58 - 63:59

Right, so
63:59 - 64:01

Justin's going right. If I have a deterministic simulator,
64:01 - 64:05

all my samples from those would be exactly the same, and so if I have a
64:05 - 64:06

deterministic simulator,
64:06 - 64:09

I can set K to be equal to 1,
64:09 - 64:11

so I don't need to draw
64:11 - 64:16

K different samples. I really only need one sample if I have a
64:16 - 64:17

deterministic simulator,
64:17 - 64:22

so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really
64:22 - 64:28

confused about the,
64:28 - 64:29

yeah,
64:29 - 64:34

we sorta turned this [inaudible] into something that looks like linear
64:34 - 64:37

state
64:37 - 64:40

regression or some' you know the data transpose times something that
64:40 - 64:44

we're used to but I guess I'm a
64:44 - 64:45

little. I don't know really know what question to ask but like when we did this before we
64:45 - 64:48

had like discrete states and everything. We
64:48 - 64:50

were determined with
64:50 - 64:53

finding this optimal policy
64:53 - 64:57

and I
64:57 - 65:02

guess it doesn't look like we haven't said the word policy in a while so kinda
65:02 - 65:05

difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should
65:05 - 65:07

just say a couple words so
65:07 - 65:11

let me actually try to get at some of what maybe what you're saying.
65:11 - 65:14

Our strategy for finding optimal policy
65:14 - 65:15

has been to
65:15 - 65:19

find some way to find V
65:19 - 65:21

and
65:21 - 65:24

some of approximations
65:24 - 65:29

of ?*. So far everything I've been doing has been focused on how to find V*. I just
65:29 - 65:30

want
65:33 - 65:36

to say one more word. It actually turns out that
65:36 - 65:41

for linear regression it's often very easy. It's often not terribly difficult to
65:41 - 65:43

choose some resource of the features. Choosing
65:43 - 65:47

features for approximating the value function is often somewhat
65:47 - 65:48

harder,
65:48 - 65:51

so because the value of a state is
65:51 - 65:52

how good is
65:52 - 65:54

starting off in this state.
65:54 - 65:56

What is my
65:56 - 66:00

expected sum of discounted rewards? What if I start in a certain state?
66:00 - 66:00

And so
66:00 - 66:04

what the feature of the state have to measure is really
66:04 - 66:08

how good is it to start in a certain state?
66:08 - 66:12

And so for inverted pendulum you actually have that
66:12 - 66:16

states where the poles are vertical and when a cart that's centered on your
66:16 - 66:18

track or something, maybe better and so you can
66:18 - 66:21

come up with features that measure the orientation of the pole and how close
66:21 - 66:26

you are to the center of the track and so on and those will be reasonable features to use
66:26 - 66:28

to approximate V*.
66:28 - 66:30

Although in general it is true that
66:30 - 66:34

choosing features, the value function approximation, is often slightly trickier
66:34 - 66:40

than choosing good features for linear regression. Okay and then Justin's
66:40 - 66:41

questions of
66:41 - 66:43

so given
66:43 - 66:47

V*, how do you go back to actually find a policy?
66:47 - 66:50

In the discrete case, so we
66:50 - 66:54

have that ?*(s) is equal to all
66:54 - 67:01

[inaudible] over A of
67:04 - 67:07

that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this
67:07 - 67:13

as an expectation
67:13 - 67:16

and so then once you find the optimal value
67:16 - 67:17

function V*,
67:17 - 67:19

you can then
67:19 - 67:20

find the optimal policy ?* by computing the
67:20 - 67:23

[inaudible].
67:23 - 67:25

So if you're in a continuous state MDP,
67:25 - 67:26

then
67:26 - 67:30

you can't actually do this in advance for every single state because there's an
67:30 - 67:32

infinite number of states
67:32 - 67:35

and so you can't actually perform this computation in advance to every single
67:35 - 67:36

state.
67:36 - 67:39

What you do instead is
67:39 - 67:43

whenever your robot is in some specific state S is only when your system
67:43 - 67:48

is in some specific state S like your car is at some position orientation or
67:48 - 67:50

your inverted pendulum
67:50 - 67:53

is in some specific position, posed in some
67:53 - 67:57

specific angle T. It's
67:57 - 67:58

only when
67:58 - 68:03

your system, be it a factor or a board game or a robot,
68:03 - 68:05

is in some specific state S
68:05 - 68:09

that you would then go ahead and compute this augmax, so
68:09 - 68:10

it's only when
68:10 - 68:16

you're in some state S that you then compute this augmax, and then
68:16 - 68:18

you
68:18 - 68:23
68:23 - 68:24

execute that action A
68:24 - 68:25

and then
68:25 - 68:29

as a result of your action, your robot would transition to some new state and then
68:29 - 68:34

so it'll be given that specific new state that you compute as
68:34 - 68:41

augmax using that specific state S that
68:42 - 68:46

you're in. There're a few ways to do it. One way to do this is
68:46 - 68:50

actually the same as in the inner loop of the fitted value iteration algorithm so
68:50 - 68:52

because of an expectation of a large number
68:52 - 68:55

of states, you'd need to sample
68:55 - 68:58

some set of states from the simulator and then
68:58 - 69:03

approximate this expectation using an average over your samples, so it's actually as
69:03 - 69:06

inner loop of the value iteration algorithm.
69:06 - 69:09

So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample
69:10 - 69:13

a set of states
69:13 - 69:16

to approximate those expectations every time you want to take an action in your MDP.
69:16 - 69:18

Couple
69:18 - 69:23

of special cases where this can be done, one special case is if you
69:23 - 69:30

have
69:32 - 69:36

a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is
69:36 - 69:40

just some function,
69:40 - 69:44

could be a linear or a nonlinear function. If it's a deterministic simulator
69:44 - 69:48

then the next state, ST+1, is just some function of your
69:48 - 69:50

previous stated action.
69:50 - 69:52

If that's the case then
69:52 - 69:55

this expectation, well, then
69:55 - 69:59

this simplifies to augmax of A of V*
69:59 - 70:01

of F
70:01 - 70:03

of
70:03 - 70:07

I guess S,A
70:07 - 70:08

because
70:08 - 70:11

this is really saying
70:11 - 70:15

S
70:15 - 70:18

prime=F(s),A. I switched back and forth between notation; I hope that's okay. S
70:18 - 70:22

to denote the current state, and S prime to deterministic state versus ST and ST+1 through the
70:22 - 70:26

current state. Both of these are
70:26 - 70:28

sorta standing notation and don't
70:28 - 70:31

mind my switching back and forth between them. But if it's a
70:31 - 70:35

deterministic simulator you can then just compute what the next state S prime would be
70:35 - 70:38

for each action you might take from the current state,
70:38 - 70:42

and then take the augmax of actions A,
70:42 - 70:47

basically choose the action that gets you to the highest value
70:47 - 70:54

state.
71:04 - 71:05

And
71:05 - 71:09

so that's one case where you can compute the augmax and we can compute that
71:09 - 71:13

expectation without needing to sample an average over some sample.
71:13 - 71:17

Another very common case actually it turns out is if you have a stochastic
71:24 - 71:26

simulator, but if your
71:26 - 71:30

similar happens to take on a very specific form of
71:34 - 71:36

ST+1=F(s)T,AT+?T
71:37 - 71:44

where this
71:44 - 71:48

is galsie noise. The [inaudible] is a very common way to build simulators where you model the next
71:48 - 71:50

state as a function of the current state and action
71:50 - 71:52

plus some noise and so
71:52 - 71:57

once specific example would be that sort of mini dynamical system that
71:57 - 72:01

we talked about with linear function of the current state and action plus
72:01 - 72:03

galsie
72:03 - 72:08

noise. In this case, you can approximate augment over
72:08 - 72:15

A, well.
72:22 - 72:27

In that case you take that
72:27 - 72:29

expectation that
72:29 - 72:33

you're trying to approximate. The
72:33 - 72:37

expected value of V of the
72:37 - 72:41
72:41 - 72:45

expected value of
72:45 - 72:49

S prime, and this is approximation.
72:49 - 72:53

Expected value of a function is usually not equal to the value of an
72:53 - 72:53

expectation but
72:53 - 72:54

it is often
72:54 - 72:56

a reasonable approximation
72:56 - 72:58

and so
73:02 - 73:05

that would be another way to
73:05 - 73:06

approximate that expectation
73:06 - 73:11

and so you choose the actions
73:11 - 73:14

according to
73:14 - 73:21

watch we do the same formula as I wrote just now.
73:22 - 73:24

And so this
73:24 - 73:25

would be a way of
73:25 - 73:27

approximating
73:28 - 73:31

this argmax,
73:31 - 73:34

ignoring the noise in the simulator essentially. And
73:34 - 73:38

this often works pretty well as well just because many simulators turn
73:38 - 73:39

out
73:39 - 73:43

to be the form of some linear or some nonlinear function plus zero mean
73:43 - 73:44

galsie noise,
73:44 - 73:46

so and just that ignore the zero mean
73:46 - 73:47

galsie
73:47 - 73:50

noise, so that you can compute this quickly.
73:50 - 73:54

And just to complete about this, what
73:54 - 73:56

that is, right,
73:56 - 73:58

that V* F of SA,
73:58 - 73:59

this
73:59 - 74:02

you
74:04 - 74:09

down rate as data transfers Fi of S prime where S prime=F of SA. Great,
74:09 - 74:12

so this V* you would compute using
74:12 - 74:15

the parameters data that you just learned using the
74:15 - 74:22

fitted value iteration
74:22 - 74:23

algorithm.
74:23 - 74:30

Questions about this? Student:[Inaudible] case,
74:34 - 74:41

for real-time application is it possible to use that [inaudible], for example for
74:42 - 74:48

[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible]
74:51 - 74:56

expectation. Computers today actually amazingly fast. I'm actually often
74:56 - 74:59

surprised by how much you can do in real time so
74:59 - 75:00

the helicopter we actually flying a helicopter
75:00 - 75:04

using an algorithm different than
75:04 - 75:05

this? I can't say.
75:06 - 75:11

But my intuition is that you could actually do this with a
75:11 - 75:14

helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do
75:14 - 75:16

this 10 times a second to 50 times a second,
75:16 - 75:18

and that's actually plenty of time
75:18 - 75:19

to sample
75:19 - 75:24

1,000 states and compute this
75:24 - 75:27

expectation. They're real difficult, helicopters because helicopters are mission critical, and you
75:27 - 75:30

do something it's like fast. You
75:30 - 75:32

can do serious damage and so
75:32 - 75:36

maybe not for good reasons. We've actually tended to avoid
75:36 - 75:37

tossing coins when we're
75:37 - 75:40

in the air, so the ideal of letting our actions be some
75:40 - 75:45

up with some random process is slightly scary and just tend not to do that.
75:45 - 75:49

I should say that's prob'ly not a great reason because you average a
75:49 - 75:52

large number of things here very well fine but just
75:52 - 75:56

as a maybe overly conservative design choice, we actually
75:56 - 75:57

don't, tend not to find
75:57 - 75:59

anything randomized on
75:59 - 76:01

which is prob'ly being over conservative.
76:02 - 76:05

It's the choice we made cause other things are slightly safer.
76:05 - 76:08

I think you can actually often do this.
76:08 - 76:13

So long as I see a model can be evaluated fast enough where you can sample
76:13 - 76:15

100 state transitions or 1,000 state
76:15 - 76:18

transitions, and then do that at
76:18 - 76:22

10hz. They haven't said that. This is often attained which is why we often
76:22 - 76:29

use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So
76:32 - 76:36

now you know one algorithm [inaudible] reinforcement learning on continuous state
76:36 - 76:39

spaces. Then we'll pick up with some more ideas on some even
76:39 - 76:40

more powerful algorithms,
76:40 - 76:43

the solving MDPs of continuous state spaces.
76:43 - 76:44

Thanks. Let's close for today.

Title:: Lecture 17 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the topic of reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:17:00

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 17 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)