0:00:10.299,0:00:13.569
This presentation is delivered by the Stanford Center for Professional

0:00:13.569,0:00:20.569
Development.

0:00:21.519,0:00:24.099
Okay, good morning. Welcome back.

0:00:24.099,0:00:27.019
So I hope all of you had a good Thanksgiving break.

0:00:27.019,0:00:31.309
After the problem sets, I suspect many of us needed one.

0:00:31.309,0:00:35.640
Just one quick announcement so as I announced by email

0:00:35.640,0:00:37.870
a few days ago, this afternoon

0:00:37.870,0:00:38.489
we'll be

0:00:38.489,0:00:42.590
doing another tape ahead of lecture, so I won't physically be here on

0:00:42.590,0:00:43.840
Wednesday, and so we'll be taping

0:00:43.840,0:00:45.990
this Wednesday's lecture ahead of time.

0:00:45.990,0:00:47.210
If you're free

0:00:47.210,0:00:52.770
this afternoon, please come to that; it'll be at 3:45 p.m.

0:00:52.770,0:00:57.610
in the Skilling Auditorium in Skilling 193

0:00:57.610,0:00:59.830
at 3:45. But of course, you can also just show up

0:00:59.830,0:01:06.830
in class as usual at the usual time or just watch it online as usual also.

0:01:07.110,0:01:12.920
Okay, welcome back. What I want to do today is continue our discussion on

0:01:12.920,0:01:15.690
Reinforcement Learning in MDPs. Quite a

0:01:15.690,0:01:19.790
long topic for me to go over today, so most of today's lecture will be on

0:01:19.790,0:01:21.720
continuous state MDPs,

0:01:21.720,0:01:24.090
and in particular,

0:01:24.090,0:01:26.750
algorithms for solving continuous state MDPs,

0:01:26.750,0:01:29.890
so I'll talk just very briefly about discretization.

0:01:29.890,0:01:34.080
I'll spend a lot of time talking about models, assimilators of MDPs,

0:01:34.080,0:01:39.250
and then talk about one algorithm called fitted value iteration and

0:01:40.420,0:01:45.119
two functions which builds on that, and then

0:01:45.119,0:01:49.740
hopefully, I'll have time to get to a second algorithm called, approximate policy

0:01:49.740,0:01:50.960
iteration

0:01:50.960,0:01:53.189
Just to recap,

0:01:53.189,0:01:55.119
right, in the previous lecture,

0:01:55.119,0:01:58.270
I defined the Reinforcement Learning problem and

0:01:58.270,0:02:02.950
I defined MDPs, so let me just recap the notation.

0:02:04.400,0:02:11.400
I said that an MDP or a Markov Decision Process, was a ? tuple,

0:02:14.209,0:02:15.959
comprising

0:02:15.959,0:02:19.139
those things and

0:02:19.139,0:02:22.219
the running example of those using last time

0:02:22.219,0:02:25.189
was this one

0:02:25.189,0:02:26.709
right, adapted from

0:02:26.709,0:02:30.209
the Russell and Norvig AI textbook.

0:02:30.209,0:02:31.619
So

0:02:31.619,0:02:35.499
in this example MDP that I was using,

0:02:35.499,0:02:39.269
it had 11 states, so that's where S was.

0:02:39.269,0:02:46.269
The actions were compass directions: north, south, east and west. The

0:02:46.650,0:02:49.689
state transition probability is to capture

0:02:49.689,0:02:51.109
chance of your

0:02:51.109,0:02:52.409
transitioning to every state

0:02:52.409,0:02:55.899
when you take any action in any other given state and so

0:02:55.899,0:02:58.059
in our example that captured

0:02:58.059,0:03:01.119
the stochastic dynamics of our

0:03:01.119,0:03:04.499
robot wondering around [inaudible], and we said if you take the action north

0:03:04.499,0:03:05.639
and the south,

0:03:05.639,0:03:08.879
you have a .8 chance of actually going north and .1 chance of veering

0:03:08.879,0:03:09.879
off,

0:03:09.879,0:03:12.689
so that .1 chance of veering off to the right so

0:03:12.689,0:03:17.069
said model of the robot's noisy

0:03:17.069,0:03:19.200
dynamic with a [inaudible]

0:03:19.200,0:03:21.709
and the

0:03:21.709,0:03:24.789
reward function was that +/-1 at the

0:03:24.789,0:03:31.059
absorbing states

0:03:31.059,0:03:32.749
and

0:03:32.749,0:03:37.659
-0.02

0:03:37.659,0:03:39.589
elsewhere.

0:03:39.589,0:03:41.499
This is an example of an MDP, and

0:03:41.499,0:03:47.899
that's what these five things were. Oh, and I used a discount

0:03:47.899,0:03:48.870
factor G

0:03:48.870,0:03:54.229
of usually a number slightly less than one, so that's the 0.99.

0:03:54.229,0:03:58.939
And so our goal was to find the policy, the

0:03:58.939,0:04:02.089
control policy and that's at ?,

0:04:02.089,0:04:03.729
which is a function

0:04:03.729,0:04:06.400
mapping from the states of the actions

0:04:06.400,0:04:09.609
that tells us what action to take in every state,

0:04:09.609,0:04:14.159
and our goal was to find a policy that maximizes the expected value of

0:04:14.159,0:04:17.319
our total payoff. So we want to find a

0:04:17.319,0:04:19.599
policy. Well,

0:04:19.599,0:04:26.599
let's see. We define value functions Vp (s) to be equal to

0:04:33.250,0:04:35.740
this.

0:04:35.740,0:04:36.979
We said that

0:04:36.979,0:04:41.340
the value of a policy ? from State S was given by the expected

0:04:41.340,0:04:42.029
value

0:04:42.029,0:04:46.199
of the sum of discounted rewards, conditioned on your executing the policy ?

0:04:46.199,0:04:49.329
and

0:04:49.329,0:04:54.469
you're stating off your [inaudible] to say in the State S,

0:04:54.469,0:04:59.959
and so our strategy for finding the policy was sort of comprised of

0:04:59.959,0:05:06.959
two steps. So the goal is to find

0:05:11.569,0:05:15.190
a good policy that maximizes the suspected value of

0:05:15.190,0:05:16.759
the sum of discounted rewards,

0:05:16.759,0:05:21.020
and so I said last time that one strategy for finding the [inaudible] of a

0:05:21.020,0:05:21.709
policy

0:05:21.709,0:05:26.080
is to first compute the optimal value function which I denoted V*(s) and is defined

0:05:26.080,0:05:28.839
like that. It's the maximum

0:05:28.839,0:05:31.969
value that any policy can obtain,

0:05:31.969,0:05:38.969
and for example,

0:05:44.939,0:05:48.449
the optimal value function

0:05:48.449,0:05:55.449
for that MDP looks like this.

0:06:01.439,0:06:04.409
So in other words, starting from any of these states,

0:06:04.409,0:06:07.050
what's the expected value of the

0:06:07.050,0:06:09.210
sum of discounted rewards you get,

0:06:09.210,0:06:12.009
so this is

0:06:12.009,0:06:13.210
V*.

0:06:13.210,0:06:15.289
We also said that

0:06:15.289,0:06:17.099
once you've found V*,

0:06:17.099,0:06:20.400
you can compute the optimal policy

0:06:20.400,0:06:27.400
using this.

0:06:34.299,0:06:36.279
And

0:06:36.279,0:06:41.049
so once you've found

0:06:41.049,0:06:48.049
V and the last piece of this algorithm was Bellman's equations where

0:06:48.460,0:06:51.240
we know that V*,

0:06:51.240,0:06:55.579
the optimal sum of discounted rewards you can get for State S, is equal to the

0:06:55.579,0:06:59.339
immediate reward you get just for starting off in that state

0:06:59.339,0:07:02.360
+G(for the

0:07:02.360,0:07:05.979
max over all the actions you could take)(your

0:07:11.639,0:07:13.659
future

0:07:13.659,0:07:15.619
sum of discounted

0:07:15.619,0:07:16.479
rewards)(your

0:07:16.479,0:07:20.650
future payoff starting from the State S(p) which is where you might transition to

0:07:20.650,0:07:22.430
after 1(s).

0:07:22.430,0:07:26.279
And so this gave us a value iteration algorithm,

0:07:26.279,0:07:29.909
which was essentially

0:07:29.909,0:07:35.289
V.I. I'm abbreviating value iteration as V.I., so in the value iteration

0:07:36.159,0:07:43.159
algorithm, in V.I., you just take Bellman's equations and you repeatedly do this.

0:07:49.960,0:07:54.099
So initialize some guess of the value functions. Initialize a zero as the sum

0:07:54.099,0:07:55.129
rounding guess and

0:07:55.129,0:07:58.959
then repeatedly perform this update for all states, and I said last time that if

0:07:58.959,0:08:00.740
you do this repeatedly,

0:08:00.740,0:08:06.339
then V(s) will converge to the optimal value function, V(s),

0:08:06.339,0:08:09.340
you can

0:08:09.340,0:08:12.219
compute the

0:08:12.219,0:08:13.830


0:08:13.830,0:08:19.150
optimal policy ?*. Just one final thing I want to recap was

0:08:20.150,0:08:27.150
the policy iteration algorithm

0:08:32.370,0:08:39.370
in which we repeat the following two steps.

0:08:39.640,0:08:43.790
So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the

0:08:43.790,0:08:48.190
value function for that specific policy.

0:08:48.190,0:08:52.200
So this means for every state, compute the expected sum of discounted rewards

0:08:52.200,0:08:55.460
for if you execute the policy ?

0:08:55.460,0:08:58.000
from that state,

0:08:58.000,0:09:05.000
and then the other step of policy

0:09:28.150,0:09:30.380
iteration

0:09:30.380,0:09:31.730
is

0:09:31.730,0:09:34.960
having found the value function for your policy,

0:09:34.960,0:09:37.440
you then update the policy

0:09:37.440,0:09:41.120
pretending that you've already found the optimal value function, V*,

0:09:41.120,0:09:42.390
and then you

0:09:42.390,0:09:45.790
repeatedly perform these two steps where you solve for the value function for your current

0:09:45.790,0:09:46.820
policy

0:09:46.820,0:09:50.760
and then pretend that that's actually the optimal value function and solve for the

0:09:50.760,0:09:54.790
policy given the value function, and you repeatedly update

0:09:54.790,0:09:59.060
the value function or update the policy using that value function.

0:09:59.060,0:10:00.350
And

0:10:00.350,0:10:03.540
last time I said that this will also cause

0:10:03.540,0:10:07.500
the estimated value function V to converge to V*

0:10:07.500,0:10:12.640
and this will cause p to converge to ?*, the optimal policy.

0:10:12.640,0:10:14.010
So those are based

0:10:14.010,0:10:17.920
on our last lecture [inaudible] MDPs and introduced a lot of new

0:10:17.920,0:10:20.180
notation symbols and just

0:10:20.180,0:10:21.880
summarize all that again.

0:10:21.880,0:10:26.160
What I'm about to do now, what I'm about to do for the rest of today's

0:10:26.160,0:10:27.680
lecture is actually

0:10:27.680,0:10:32.420
build on these two algorithms so

0:10:32.420,0:10:36.040
I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how

0:10:36.040,0:10:42.080
those two algorithms are very

0:10:42.080,0:10:43.710
different?

0:10:43.710,0:10:46.810
Instructor (Andrew Ng):I see, right, so

0:10:46.810,0:10:51.810
yeah, do you see that they're different? Okay,

0:10:51.810,0:10:53.840
how it's different. Let's see.

0:10:53.840,0:10:54.970
So

0:10:54.970,0:11:00.050
well here's one difference. I didn't say this cause no longer use it today. So

0:11:00.050,0:11:03.530
value iteration and policy iteration are different algorithms.

0:11:03.530,0:11:06.260
In policy iteration

0:11:06.260,0:11:07.570
in this

0:11:07.570,0:11:10.630
step, you're given a fixed policy,

0:11:10.630,0:11:14.170
and you're going to solve for the value function for that policy

0:11:15.320,0:11:20.520
and so you're given some fixed policy ?, meaning

0:11:20.520,0:11:26.930
some function mapping from the state's actions. So give you some policy

0:11:28.840,0:11:34.630
and

0:11:34.630,0:11:39.350
whatever. That's just some policy; it's not a great policy.

0:11:39.350,0:11:40.890
And

0:11:40.890,0:11:45.350
in that step that I circled, we have to find the ? of S

0:12:01.640,0:12:04.740
which means that for every state you need to compute

0:12:04.740,0:12:08.360
your expected sum of discounted rewards or if you execute

0:12:08.360,0:12:10.810
this specific policy

0:12:10.810,0:12:15.500
and starting off the MDP in that state S.

0:12:15.500,0:12:19.000
So I showed this last time. I won't go into details today, so I said last

0:12:19.000,0:12:20.330
time that

0:12:20.330,0:12:23.960
you can actually solve for Vp by solving a linear system of

0:12:23.960,0:12:25.360
equations.

0:12:25.360,0:12:27.260
There was a form of Bellman's equations

0:12:27.260,0:12:28.760
for Vp,

0:12:28.760,0:12:32.029
and it turned out to be, if you write this out, you end up with a

0:12:32.029,0:12:35.920
linear system of 11 equations of 11 unknowns and so you

0:12:35.920,0:12:38.830
can actually solve for the value function

0:12:38.830,0:12:39.970
for a fixed policy

0:12:39.970,0:12:42.180
by solving like a system of

0:12:42.180,0:12:46.180
linear equations with 11 variables and 11 constraints,

0:12:46.180,0:12:48.830
and so

0:12:48.830,0:12:51.790
that's policy iteration;

0:12:51.790,0:12:55.290
whereas, in value iteration,

0:12:55.290,0:12:57.430
going back on board,

0:12:57.430,0:13:00.590
in value iteration you sort of repeatedly perform this update where you

0:13:00.590,0:13:04.020
update the value of a state as the [inaudible].

0:13:04.020,0:13:11.020
So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits

0:13:12.670,0:13:13.459
so is the assumption

0:13:13.459,0:13:16.130
that we can never get out of

0:13:16.130,0:13:17.960
those states? Instructor

0:13:17.960,0:13:21.730
(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers

0:13:21.730,0:13:25.950
come up nicely, but I don't wanna spend too much time on them, but yeah, so the

0:13:25.950,0:13:26.820
assumption is that

0:13:26.820,0:13:30.329
once you enter the absorbing state, then the world ends or there're no more

0:13:30.329,0:13:32.549
rewards after that and

0:13:32.549,0:13:33.740
you can think of

0:13:33.740,0:13:36.970
another way to think of the absorbing states which is sort of

0:13:36.970,0:13:38.000
mathematically equivalent.

0:13:38.000,0:13:41.090
You can think of the absorbing states as

0:13:41.090,0:13:43.440
transitioning with probability 1

0:13:43.440,0:13:46.030
to sum 12 state, and then once you're in that

0:13:46.030,0:13:48.870
12th state, you always

0:13:48.870,0:13:51.660
remain in that 12th state, and there're no further rewards from there. If you

0:13:51.660,0:13:53.910
want, you can think of this

0:13:53.910,0:13:57.250
as actually an MDP with 12 states rather than 11 states,

0:13:57.250,0:13:59.270
and the 12th state

0:13:59.270,0:14:05.320
is this zero cost absorbing state that you get stuck in forever. Other

0:14:05.320,0:14:12.320
questions? Yeah,

0:14:19.640,0:14:26.460
please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort

0:14:26.460,0:14:27.190
of

0:14:27.190,0:14:31.120
tried to give it justification for this last time.

0:14:31.120,0:14:34.850
I'll

0:14:34.850,0:14:37.500
say it in

0:14:37.500,0:14:41.990
one sentence so that's that the expected total payoff I get, I

0:14:41.990,0:14:46.820
expect to get something from the state as is equal to my immediate reward which is

0:14:46.820,0:14:49.650
the reward I get for starting a state. Let's see. If I sum

0:14:49.650,0:14:54.220
the state, I'm gonna get some first reward and then I can transition to

0:14:54.220,0:14:55.440
some other state, and

0:14:55.440,0:14:59.020
then from that other state, I'll get some additional rewards from then.

0:14:59.020,0:15:03.390
So Bellman's equations breaks that sum into two pieces. It says the value of a

0:15:03.390,0:15:04.860
state is equal to

0:15:04.860,0:15:06.650
the reward you get right away

0:15:06.650,0:15:10.530
is really, well.

0:15:10.530,0:15:17.120
V*(s) is really equal to

0:15:31.130,0:15:33.390
+G, so this is

0:15:35.940,0:15:40.060
V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the

0:15:40.060,0:15:42.260
rewards

0:15:44.480,0:15:46.440
you get in the future)

0:15:46.440,0:15:50.080
which it turns out to be equal to that

0:15:50.080,0:15:52.340
second row. I spent more time

0:15:52.340,0:15:55.360
justifying this in the previous lecture,

0:15:55.360,0:15:59.930
although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if

0:15:59.930,0:16:02.100
you don't remember the justification of that, why don't you just

0:16:02.100,0:16:03.430
maybe take my word for

0:16:03.430,0:16:05.990
that this equation holds true

0:16:05.990,0:16:07.940
since I use it a little bit later as well,

0:16:07.940,0:16:11.090
and then the lecture notes sort of

0:16:11.090,0:16:15.830
explain a little further the justification for why this equation might hold

0:16:15.830,0:16:17.430
true. But for now,

0:16:17.430,0:16:21.200
yeah, just for now take my word for it that this holds true cause we'll use it a little bit later

0:16:21.200,0:16:28.200
today as well.

0:16:40.960,0:16:44.710
[Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent

0:16:44.710,0:16:47.010
? implicitly,

0:16:47.600,0:16:53.060
using V(s), would it become equivalent to valuation,

0:16:53.060,0:16:58.240
and the answer is sort of no.

0:16:58.240,0:17:01.710
Let's see. It's true that policy iteration and value iteration are

0:17:01.710,0:17:07.210
closely related algorithms, and there's actually a continuum between them, but

0:17:07.210,0:17:12.850
yeah, it actually turns out that,

0:17:12.850,0:17:15.400
oh, no,

0:17:22.370,0:17:23.909
the

0:17:24.759,0:17:28.910
algorithms are not equivalent. It's just in policy iteration,

0:17:28.910,0:17:33.289
there is a step where you're solving for the value function for the policy vehicle is V,

0:17:33.289,0:17:35.470
solve for Vp. Usually,

0:17:35.470,0:17:40.590
you can do this, for instance, by solving a linear system of equations. In value

0:17:40.590,0:17:41.260
iteration,

0:17:41.260,0:17:43.850
it is a different

0:17:43.850,0:17:50.850
algorithm, yes. I hope it makes sense that at least cosmetically it's different.

0:17:56.270,0:17:58.140
[Inaudible]

0:17:58.140,0:18:02.250
you have [inaudible] representing ? implicitly, then you won't have

0:18:02.250,0:18:06.620
to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you

0:18:06.620,0:18:08.559
change a value function,

0:18:08.559,0:18:10.360
if ? changes as well,

0:18:10.360,0:18:15.750
then it's sort of hard to solve this.

0:18:15.750,0:18:19.340
Yeah, so later on we'll actually talk about some examples of when ? is implicitly

0:18:19.340,0:18:20.440
represented

0:18:20.440,0:18:24.059
but at

0:18:24.059,0:18:26.210
least for now it's I think there's " yeah.

0:18:26.210,0:18:30.350
Maybe there's a way to redefine something, see a mapping onto value iteration but that's not

0:18:30.350,0:18:32.530
usually done. These are

0:18:32.530,0:18:36.670
viewed as different algorithms.

0:18:36.670,0:18:39.640
Okay, cool, so all good questions.

0:18:39.640,0:18:43.410
Let me move on and talk about how

0:18:43.410,0:18:50.410
to generalize these ideas to continuous states.

0:18:58.880,0:19:01.260
Everything we've done so far

0:19:01.260,0:19:02.950
has been for

0:19:02.950,0:19:05.000
discrete states or finite-state

0:19:05.000,0:19:09.890
MDPs. Where, for example, here we had an MDP with a finite set of 11

0:19:09.890,0:19:10.780
states

0:19:10.780,0:19:12.980
and so

0:19:12.980,0:19:16.720
the value function or V(s) or our estimate for the value function,

0:19:16.720,0:19:17.360
V(s),

0:19:17.360,0:19:22.290
could then be represented using an array of 11 numbers cause if you have 11 states, the

0:19:22.290,0:19:23.580
value function

0:19:23.580,0:19:26.120
needs to assign a real number

0:19:26.120,0:19:29.330
to each of the 11 states and so to represent V(s)

0:19:29.330,0:19:31.990
using an array of 11

0:19:31.990,0:19:33.080
numbers.

0:19:33.080,0:19:37.080
What I want to do for [inaudible] today is talk about

0:19:37.080,0:19:38.450
continuous

0:19:38.450,0:19:41.830
states, so for example,

0:19:41.830,0:19:44.980
if you want to control

0:19:44.980,0:19:48.240
any of the number of real [inaudible], so for example, if you want

0:19:48.240,0:19:51.409
to control a car,

0:19:51.409,0:19:52.610


0:19:52.610,0:19:55.090
a car

0:19:55.090,0:19:59.840
is positioned given by XYT, as position

0:19:59.840,0:20:04.660
and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot,

0:20:04.660,0:20:05.159
so these

0:20:05.159,0:20:06.640
are so depending on

0:20:06.640,0:20:08.929
whether you want to model the kinematics and

0:20:08.929,0:20:12.139
so just position, or whether you want to model the dynamics, meaning the velocity as

0:20:12.139,0:20:14.630
well.

0:20:14.630,0:20:16.179
Earlier I showed you

0:20:16.179,0:20:19.740
video of a helicopter

0:20:19.740,0:20:23.480
that was flying, using a rain forest we're learning algorithms, so the

0:20:23.480,0:20:24.730
helicopter which can

0:20:24.730,0:20:28.390
fly in three-dimensional space rather than just drive on the 2-D plane, the

0:20:28.390,0:20:32.780
state will be given by XYZ position,

0:20:35.100,0:20:37.610
FT?, which is

0:20:37.610,0:20:42.740
?[inaudible].

0:20:42.740,0:20:45.250
The FT? is

0:20:45.250,0:20:48.620
sometimes used to note the P[inaudible]

0:20:48.620,0:20:49.840
of the helicopter,

0:20:49.840,0:20:51.710
just orientation,

0:20:51.710,0:20:55.340
and

0:20:59.050,0:21:02.540
if you want to control a helicopter, you pretty much have to model

0:21:02.540,0:21:06.490
velocity as well which means both linear velocity as well as angular

0:21:06.490,0:21:09.520
velocity, and so this would be a 12-dimensional state.

0:21:10.400,0:21:12.610
If you want an example that

0:21:12.610,0:21:14.150
is

0:21:14.150,0:21:16.770
kind of fun but unusual

0:21:16.770,0:21:19.960
is,

0:21:19.960,0:21:24.550
and I'm just gonna use this as an example

0:21:24.550,0:21:28.240
and actually use this little bit example in today's lecture

0:21:28.240,0:21:31.410
is the inverted pendulum problem which is

0:21:31.410,0:21:35.500
sort of a long-running classic in reinforcement learning

0:21:35.500,0:21:39.040
in which imagine that you

0:21:39.040,0:21:42.880
have a little cart that's on a rail. The rail

0:21:42.880,0:21:45.370
ends at some point

0:21:45.370,0:21:48.400
and if you imagine that you have a pole

0:21:48.400,0:21:51.650
attached to the cart, and this is a free hinge and so

0:21:51.650,0:21:54.110
the pole here can rotate freely,

0:21:54.110,0:21:56.950
and your goal is to control the cart and

0:21:56.950,0:21:59.030
to move it back and forth on this rail

0:21:59.030,0:22:01.130
so as to keep the pole balanced.

0:22:01.130,0:22:02.560
Yeah,

0:22:02.560,0:22:04.620
there's no long pole

0:22:04.620,0:22:05.400
in

0:22:05.400,0:22:07.480
this class but you know what I

0:22:07.480,0:22:08.409
mean, so you

0:22:08.409,0:22:15.210
can imagine. Oh, is there a long pole here?

0:22:15.210,0:22:16.590
Back in the corner. Oh, thanks. Cool. So I did not practice this

0:22:16.590,0:22:19.920
but you can take a long pole and sort of hold it up, balance,

0:22:19.920,0:22:24.670
so imagine that you can do it better than I can. Imagine these are [inaudible] just moving

0:22:24.670,0:22:27.590
back and forth to try to keep the pole balanced, so

0:22:27.590,0:22:30.580
you can actually us the reinforcement learning algorithm to do that.

0:22:30.580,0:22:33.929
This is actually one of the longstanding classic problems that

0:22:33.929,0:22:35.280
people [inaudible]

0:22:35.280,0:22:38.030
implement and play off using reinforcement learning algorithms,

0:22:38.030,0:22:40.800
and so for this, the

0:22:40.800,0:22:41.940
states would be X and

0:22:41.940,0:22:44.000
T, so X

0:22:44.000,0:22:49.660
would be the position of the cart,

0:22:49.660,0:22:56.070
and T would be the orientation of the pole

0:22:56.070,0:22:59.750
and also the linear velocity and the angular velocity of

0:22:59.750,0:23:00.650


0:23:00.650,0:23:07.650
the pole, so I'll

0:23:12.340,0:23:15.390
actually use this example a

0:23:15.390,0:23:18.710
couple times. So to read continuous state space,

0:23:18.710,0:23:22.760
how can you apply an algorithm like value iteration and policy iteration to solve the

0:23:22.760,0:23:24.170
MDP to control

0:23:24.170,0:23:27.800
like the car or a helicopter or something like the inverted

0:23:27.800,0:23:29.030
pendulum?

0:23:29.030,0:23:31.720
So one thing you can do and

0:23:31.720,0:23:35.060
this is maybe the most straightforward thing is,

0:23:35.060,0:23:37.670
if you have say a two-dimensional

0:23:37.670,0:23:42.270
continuous state space, a S-1 and S-2 are my state variables, and in all the

0:23:42.270,0:23:43.990
examples there

0:23:43.990,0:23:48.150
are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here.

0:23:48.150,0:23:53.670
The most straightforward thing to do would be to take the continuous state space and discretize

0:23:53.670,0:23:55.780
it

0:23:55.780,0:24:02.780
into a number of discrete cells.

0:24:07.440,0:24:11.890
And I use S-bar to denote they're discretized or they're discrete

0:24:11.890,0:24:13.080
states,

0:24:13.080,0:24:16.919
and so you can [inaudible] with this continuous state problem with a finite or

0:24:16.919,0:24:18.409
discrete set of states

0:24:18.409,0:24:19.490
and then

0:24:19.490,0:24:22.880
you can use policy iteration or value iteration

0:24:22.880,0:24:28.900
to solve for

0:24:31.529,0:24:35.950
V(s)-bar.

0:24:35.950,0:24:37.390
And

0:24:37.390,0:24:40.350
if you're robot is then in some state

0:24:40.350,0:24:41.860
given by that dot,

0:24:41.860,0:24:45.910
you would then figure out what discretized state it is in. In this case it's in,

0:24:45.910,0:24:47.950
this discretized dygrid cell

0:24:47.950,0:24:49.180
that's called

0:24:49.180,0:24:51.400
S-bar,

0:24:51.400,0:24:52.840
and then you execute.

0:24:52.840,0:24:56.320
You choose the policy. You choose the action

0:24:56.320,0:24:58.679
given by applied to that discrete

0:24:58.679,0:25:01.190
state, so discretization is maybe the

0:25:01.190,0:25:06.690
most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes

0:25:06.690,0:25:10.200
you can sorta make this work but a couple of reasons why

0:25:10.200,0:25:13.070
this does not work very well.

0:25:13.070,0:25:15.130
One reason is the following,

0:25:15.130,0:25:17.610
and

0:25:17.610,0:25:21.210
for this picture, let's even temporarily put aside reinforcement

0:25:21.210,0:25:22.840
learning. Let's just

0:25:22.840,0:25:26.610
think about doing regression for now and so

0:25:26.610,0:25:29.650
suppose you have some invariable X

0:25:29.650,0:25:36.210
and suppose I have some data,

0:25:36.210,0:25:37.910
and I want to fill a function.

0:25:37.910,0:25:39.720
Y is the function of X,

0:25:39.720,0:25:40.860
so

0:25:40.860,0:25:44.550
discretization is saying that I'm going to take my

0:25:44.550,0:25:48.010
horizontal Xs and chop it up into a number of

0:25:48.010,0:25:48.669
intervals.

0:25:48.669,0:25:52.060
Sometimes I call these intervals buckets as well.

0:25:52.060,0:25:56.310
We chop my horizontal Xs up into a number of buckets

0:25:56.310,0:25:59.480
and then we're approximate this function

0:25:59.480,0:26:03.770
using something that's piecewise constant

0:26:03.770,0:26:05.440
in each of these buckets.

0:26:05.440,0:26:08.800
And just look at this. This is clearly not a very good representation, right, and when we

0:26:08.800,0:26:10.390
talk about regression,

0:26:10.390,0:26:14.340
you just choose some features of X

0:26:14.340,0:26:17.990
and run linear regression or something. You get a much better fit to the function.

0:26:17.990,0:26:22.170
And so the sense that discretization just isn't a very good source

0:26:22.170,0:26:25.770
of piecewise constant functions. This just isn't a very good function

0:26:25.770,0:26:28.210
for representing many things,

0:26:28.210,0:26:32.090
and there's also the sense that

0:26:32.090,0:26:35.580
there's no smoothing or there's no generalization across the different buckets.

0:26:35.580,0:26:39.290
And in fact, back in regression, I would never have chosen

0:26:39.290,0:26:42.870
to do regression using this sort of visualization. It's just really doesn't make sense.

0:26:45.630,0:26:47.990
And so in the same way,

0:26:47.990,0:26:50.950
instead of

0:26:50.950,0:26:53.929
X, V(s), instead of X and

0:26:53.929,0:26:57.730
some hypothesis function of X, if you have the state here and you're trying to approximate the value

0:26:57.730,0:26:59.110
function, then

0:26:59.110,0:27:02.470
you can get discretization to work for many problems but maybe this isn't the best

0:27:02.470,0:27:03.429
representation to represent

0:27:03.429,0:27:06.419
a value function.

0:27:08.960,0:27:12.580
The other problem with discretization and maybe the more serious

0:27:12.580,0:27:17.350
problem is what's

0:27:17.350,0:27:21.450
often somewhat fancifully called the curse of dimensionality.

0:27:26.480,0:27:30.230
And just the observation that

0:27:30.990,0:27:33.800
if the state space is in

0:27:33.800,0:27:39.350
RN, and if you discretize each variable into K buckets,

0:27:39.350,0:27:40.450
so

0:27:40.450,0:27:41.909
if discretize each

0:27:44.159,0:27:45.690
variable into K

0:27:45.690,0:27:47.750
discrete values,

0:27:47.750,0:27:50.150
then you get

0:27:50.150,0:27:53.030
on the order of K to the power of N

0:27:54.760,0:27:55.740
discrete states.

0:27:55.740,0:27:56.609
In other words,

0:27:56.609,0:28:00.170
the number of discrete states you end up with

0:28:00.170,0:28:02.100
grows exponentially

0:28:02.100,0:28:04.789
in the dimension of the problem,

0:28:04.789,0:28:05.560
and so

0:28:05.560,0:28:09.350
for a helicopter with 12-dimensional state

0:28:09.350,0:28:10.340
space, this would be

0:28:10.340,0:28:12.769
maybe like 100 to the power of 12, just huge, and it's

0:28:12.769,0:28:13.460
not

0:28:13.460,0:28:14.650
feasible.

0:28:14.650,0:28:19.130
And so discretization doesn't scale well at all with two problems in high-dimensional

0:28:19.130,0:28:23.600
state spaces,

0:28:23.600,0:28:24.790
and

0:28:24.790,0:28:28.520
this observation actually applies more generally than to

0:28:28.520,0:28:32.510
just robotics and continuous state problems.

0:28:32.510,0:28:37.669
For example, another fairly well-known applications

0:28:37.669,0:28:40.170
of reinforcement learning

0:28:40.170,0:28:44.850
has been to factory automations. If you imagine that you have

0:28:44.850,0:28:47.040
20 machines sitting in the factory

0:28:47.040,0:28:50.029
and the machines lie in a assembly line and they all

0:28:50.029,0:28:53.200
do something to a part on the assembly line, then they route the part onto a

0:28:53.200,0:28:55.389
different machine. You want to

0:28:55.389,0:28:58.779
use reinforcement learning algorithms, [inaudible] the order in which the

0:28:58.779,0:29:02.519
different machines operate on your different things that are flowing through your assembly line and maybe

0:29:02.519,0:29:05.309
different machines can do different things.

0:29:05.309,0:29:08.970
So if you have N machines and each machine

0:29:08.970,0:29:10.540
can be in K states,

0:29:10.540,0:29:13.730
then if you do this sort of discretization, the total number of states would be K

0:29:13.730,0:29:15.420
to N as well.

0:29:15.420,0:29:18.289
If you have N machines

0:29:18.289,0:29:21.860
and if each machine can be in K states, then again, you can get

0:29:21.860,0:29:24.550
this huge number of states.

0:29:24.550,0:29:26.799
Other well-known examples would be

0:29:26.799,0:29:30.620
if you have a board game is another example.

0:29:30.620,0:29:33.760
You'd want to use reinforcement learning to play chess.

0:29:33.760,0:29:36.310
Then if you have N

0:29:36.310,0:29:38.660
pieces on your board game,

0:29:38.660,0:29:43.540
you have N pieces on the chessboard and if each piece can be in K positions, then this is a

0:29:43.540,0:29:46.470
game sort of the curse of dimensionality thing where the

0:29:46.470,0:29:55.059
number of discrete states you end up with goes exponentially with the number of pieces

0:29:55.059,0:29:57.120
in your board game.

0:29:57.120,0:30:01.169
So the curse of dimensionality

0:30:01.169,0:30:05.830
means that discretization scales poorly to high-dimensional state spaces or

0:30:05.830,0:30:09.370
at least discrete representations

0:30:09.370,0:30:14.169
scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional

0:30:14.169,0:30:16.720
problem, discretization will usually work great.

0:30:16.720,0:30:20.730
If you have a 3-dimensional problem, you can often get discretization to

0:30:20.730,0:30:21.290
work

0:30:21.290,0:30:23.270
not too badly without too much trouble.

0:30:23.270,0:30:28.370
With a 4-dimensional problem, you can still often get to where that they

0:30:28.370,0:30:30.190
could be challenging

0:30:32.990,0:30:38.770
and as you go to higher and higher dimensional state spaces,

0:30:38.770,0:30:42.520
the odds and [inaudible] that you need to figure around to discretization and do things

0:30:42.520,0:30:45.900
like non-uniform grids, so for example,

0:30:45.900,0:30:49.470
what I've

0:30:49.470,0:30:52.710
drawn for you is an example of a non-uniform discretization

0:30:52.710,0:30:57.139
where I'm discretizing S-2 much more finally than S-1. If I think the value

0:30:57.139,0:30:57.750
function

0:30:57.750,0:30:59.250
is much more sensitive

0:30:59.250,0:31:02.620
to the value of state variable S-2 than to S-1,

0:31:02.620,0:31:06.610
and so as you get into higher dimensional state spaces, you may need to manually fiddle

0:31:06.610,0:31:10.000
with choices like these with no uniform discretizations and so on.

0:31:10.000,0:31:13.269
But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it

0:31:13.269,0:31:14.350
work fine.

0:31:14.350,0:31:17.630
With 4-dimensional problems, you can probably get it to work but it'd be just

0:31:17.630,0:31:19.030
slightly challenging

0:31:19.030,0:31:20.300
and

0:31:20.300,0:31:24.420
you can sometimes by fooling around and being clever, you can often push

0:31:24.420,0:31:25.110
discretization

0:31:25.110,0:31:29.120
up to let's say about 6-dimensional problems but

0:31:29.120,0:31:30.870
with some difficulty

0:31:30.870,0:31:32.980
and problems higher

0:31:32.980,0:31:37.220
than 6-dimensional would be extremely difficult to solve with discretization.

0:31:37.220,0:31:39.700
So that's just rough

0:31:39.700,0:31:44.360
folk wisdom order of managing problems you think about using for discretization.

0:31:51.370,0:31:52.790
But what I want to

0:31:52.790,0:31:56.030
spend most of today talking about is

0:31:56.030,0:31:59.909
[inaudible] methods that often work much better than discretization

0:31:59.909,0:32:02.029
and which we will

0:32:02.029,0:32:04.640
approximate V* directly

0:32:04.640,0:32:08.850
without resulting to these sort of discretizations.

0:32:08.850,0:32:12.300
Before I jump to the specific representation let me just spend a few minutes

0:32:12.300,0:32:14.780
talking about the problem setup

0:32:14.780,0:32:15.610
then.

0:32:15.610,0:32:18.790
For today's lecture, I'm going to focus on

0:32:18.790,0:32:25.360
the problem of continuous states

0:32:25.360,0:32:28.060
and just to keep things sort of very

0:32:28.060,0:32:29.480
simple in this lecture,

0:32:29.480,0:32:33.619
I want view of continuous actions, so I'm gonna see

0:32:36.000,0:32:37.650
discrete actions

0:32:37.650,0:32:40.280
A. So it turns out actually that

0:32:40.280,0:32:44.230
is a critical fact also for many problems, it turns out that the state

0:32:44.230,0:32:45.630
space is

0:32:45.630,0:32:47.299
much larger than

0:32:47.299,0:32:51.179
the states of actions. That just seems to have worked out that way for many problems, so for example,

0:32:52.630,0:32:58.210
for driving a car the state space is 6-dimensional, so if

0:32:58.210,0:33:00.010
XY T, Xdot, Ydot, Tdot.

0:33:00.010,0:33:04.840
Whereas, your action has, you still have two actions. You have forward

0:33:04.840,0:33:07.600
backward motion and steering the car, so you have

0:33:07.600,0:33:11.889
6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states.

0:33:11.889,0:33:14.740
The only

0:33:14.740,0:33:17.190
examples down for a helicopter you've 12-D states in a

0:33:17.190,0:33:18.620
4-dimensional action

0:33:18.620,0:33:20.290
it turns out, and

0:33:20.290,0:33:24.200
it's also often much easier to just discretize a continuous

0:33:24.200,0:33:27.300
actions into a discrete sum of actions.

0:33:27.300,0:33:27.840


0:33:27.840,0:33:31.670
And for the inverted pendulum, you have a 4-D state and a 1-D action.

0:33:31.670,0:33:33.470
Whether you accelerate

0:33:33.470,0:33:38.140
your cart to the left or the right is one D action

0:33:38.140,0:33:41.879
and so for the rest of today, I'm gonna assume a continuous state

0:33:41.879,0:33:45.480
but I'll assume that maybe you've already discretized your actions,

0:33:45.480,0:33:47.700
just because in practice it turns out that

0:33:47.700,0:33:52.780
not for all problems, with many problems large actions

0:33:52.780,0:33:54.280
is just less of a

0:33:54.280,0:33:58.470
difficulty than large state spaces.

0:33:58.470,0:33:59.730
So I'm

0:33:59.730,0:34:01.350
going to assume

0:34:01.350,0:34:06.360
that we have

0:34:06.360,0:34:08.699
a model

0:34:08.699,0:34:15.089
or simulator of the

0:34:15.089,0:34:16.450
MDP, and so

0:34:16.450,0:34:18.179
this is really

0:34:18.179,0:34:22.579
an assumption on how the state transition probabilities are represented.

0:34:23.589,0:34:24.690
I'm gonna assume

0:34:24.690,0:34:28.079
and I'm going to use the terms "model" and "simulator"

0:34:28.079,0:34:29.800
pretty much synonymously,

0:34:29.800,0:34:31.749
so

0:34:31.749,0:34:38.749
specifically, what I'm going to assume is that we have a black box and a

0:34:40.649,0:34:41.829
piece of code,

0:34:41.829,0:34:45.200
so that I can input any state, input an action

0:34:45.200,0:34:46.399
and it will output

0:34:46.399,0:34:48.559
S

0:34:48.559,0:34:51.259
prime,

0:34:51.259,0:34:54.219
sample from the state transition distribution. Says that

0:34:54.219,0:34:55.069
this is really

0:34:55.069,0:34:57.080
my assumption

0:34:57.080,0:35:01.209
on the representation I have for the state transition probabilities,

0:35:01.209,0:35:05.189
so I'll assume I have a box that read

0:35:05.189,0:35:06.999
take us in for the stated action

0:35:06.999,0:35:09.849
and output in mixed state.

0:35:09.849,0:35:16.849
And so

0:35:17.410,0:35:20.469
since they're fairly common ways to get models of

0:35:20.469,0:35:22.359
different MDPs you may work with,

0:35:22.359,0:35:24.479
one is

0:35:24.479,0:35:30.599
you might get a model from a physics

0:35:30.599,0:35:32.960
simulator. So for example,

0:35:32.960,0:35:36.759
if you're interested in controlling that inverted pendulum,

0:35:36.759,0:35:39.449
so your action is

0:35:39.449,0:35:41.230
A which is

0:35:41.230,0:35:45.130
the magnitude of the force you exert on the cart to

0:35:45.130,0:35:46.140


0:35:46.140,0:35:47.709
left

0:35:47.709,0:35:49.380
or right, and your

0:35:49.380,0:35:56.380
state is Xdot, T, Tdot. I'm just gonna write that in that order.

0:35:58.109,0:36:01.469
And so I'm gonna

0:36:01.469,0:36:04.379
write down a bunch of equations just for completeness but

0:36:04.379,0:36:08.779
everything I'm gonna write below here is most of what I wanna write is a bit

0:36:08.779,0:36:12.599
gratuitous, but so since I'll maybe flip open a textbook on physics, a

0:36:12.599,0:36:13.820
textbook on mechanics,

0:36:13.820,0:36:17.069
you can work out the equations of motion of a

0:36:17.069,0:36:19.499
physical device like this, so you find that

0:36:22.749,0:36:26.400
Sdot. The dot denotes derivative, so the derivative of the state with respect to time

0:36:26.400,0:36:28.080
is given by

0:36:28.739,0:36:31.710
Xdot, ?-L(B)

0:36:31.710,0:36:34.099
cost B

0:36:34.099,0:36:35.549
over

0:36:36.919,0:36:39.069
M

0:36:39.069,0:36:46.069
Tdot,

0:36:52.019,0:36:52.859
B.

0:36:52.859,0:36:58.199
And so on where A is the force is the action that you exert on the cart. L is

0:36:58.199,0:36:59.869
the length of the pole.

0:36:59.869,0:37:04.099
M is the total mass of the system and so on. So all these equations

0:37:04.099,0:37:06.209
are good uses, just writing

0:37:06.209,0:37:08.549
them down for completeness, but by

0:37:08.549,0:37:12.179
flipping over, open like a physics textbook, you can work out these equations

0:37:12.179,0:37:13.839
and notions yourself

0:37:13.839,0:37:15.169
and this

0:37:15.169,0:37:17.410
then gives you a model which

0:37:17.410,0:37:18.259
can say that S-2+1.

0:37:18.259,0:37:21.549
You're still one time step later

0:37:21.549,0:37:27.489
will be equal to your previous state

0:37:27.489,0:37:32.219
plus [inaudible], so in your simulator or in my model

0:37:32.219,0:37:35.200
what happens to the cart every

0:37:35.200,0:37:39.229
10th of a second, so ? T

0:37:39.229,0:37:43.130
would be within one second

0:37:43.130,0:37:50.130
and then so plus ? T times

0:38:03.469,0:38:06.129
that. And so that'd be one way to come up with a model of

0:38:06.129,0:38:08.169
your MDP.

0:38:08.169,0:38:12.389
And in this specific example, I've actually written down deterministic model because

0:38:12.830,0:38:14.829
and by deterministic I mean that

0:38:14.829,0:38:18.320
given an action in a state, the next state

0:38:18.320,0:38:19.839
is not random,

0:38:19.839,0:38:23.179
so would be an example of a deterministic model

0:38:23.179,0:38:26.180
where I can compute the next state exactly

0:38:26.180,0:38:29.699
as a function of the previous state and the previous action

0:38:29.699,0:38:33.479
or it's a deterministic model because all the probability

0:38:33.479,0:38:34.999
mass is on a single state

0:38:34.999,0:38:36.939
given the previous stated action.

0:38:36.939,0:38:43.939
You can also make this a stochastic model.

0:38:58.589,0:39:05.589
A second way that is often used to attain a model is to learn one.

0:39:08.359,0:39:13.079
And so again, just concretely what you do is you would

0:39:13.079,0:39:15.599
imagine that you have a physical

0:39:15.599,0:39:17.109
inverted pendulum system

0:39:17.109,0:39:20.630
as you physically own an inverted pendulum robot.

0:39:20.630,0:39:24.109
What you would do is you would then

0:39:24.109,0:39:27.109
initialize your inverted pendulum robot to some state

0:39:27.109,0:39:31.909
and then execute some policy, could be some random policy or some policy that you think is pretty

0:39:31.909,0:39:36.219
good, or you could even try controlling yourself with a joystick or something.

0:39:36.219,0:39:37.399
But so you set

0:39:37.399,0:39:40.889
the system off in some state as zero.

0:39:40.889,0:39:43.619
Then you take some action.

0:39:43.619,0:39:48.039
Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina

0:39:48.039,0:39:51.179
control your inverted pendulum or whatever.

0:39:51.179,0:39:55.029
System would transition to some new state, S-1, and then you

0:39:55.029,0:39:57.269
take some new action, A-1

0:39:57.269,0:39:59.799
and so on. Let's say you do

0:39:59.799,0:40:01.579
this

0:40:01.579,0:40:04.779
for two time steps

0:40:04.779,0:40:06.480
and

0:40:06.480,0:40:08.859
sometimes I call this one trajectory and

0:40:08.859,0:40:10.009
you repeat this

0:40:10.009,0:40:12.989
M times, so

0:40:12.989,0:40:16.989
this is the first trial of the first trajectory,

0:40:16.989,0:40:23.989
and then you do this again. Initialize it in some

0:40:35.419,0:40:36.880
and so on.

0:40:36.880,0:40:39.979
So you do this a bunch of times

0:40:39.979,0:40:43.640
and then you would run the learning algorithm

0:40:43.640,0:40:49.190
to

0:40:49.190,0:40:53.199
estimate ST+1

0:40:53.199,0:40:56.650
as a function

0:40:56.650,0:41:01.869
of

0:41:01.869,0:41:04.619
ST and

0:41:04.619,0:41:08.009
AT. And for sake of completeness, you should just think of this

0:41:08.009,0:41:10.130
as inverted pendulum problem,

0:41:10.130,0:41:11.190
so ST+1 is

0:41:11.190,0:41:13.290
a 4-dimensional vector. ST, AT

0:41:13.290,0:41:16.579
will be a 4-dimensional vector and that'll be a real number,

0:41:16.579,0:41:17.680
and so you might

0:41:17.680,0:41:21.190
run linear regression 4 times to predict each of these state variables as

0:41:21.190,0:41:22.749
a function

0:41:22.749,0:41:25.250
of each of these

0:41:25.250,0:41:26.979
5 real numbers

0:41:26.979,0:41:29.519
and so on.

0:41:29.519,0:41:32.579
Just for example, if you

0:41:32.579,0:41:36.899
say that if you want to estimate your next state ST+1 as a linear

0:41:36.899,0:41:40.639
function

0:41:40.639,0:41:45.889
of your previous state in your action and so

0:41:45.889,0:41:51.470
A here will be a 4 by 4 matrix,

0:41:51.470,0:41:54.449
and B would be a 4-dimensional vector,

0:41:54.449,0:42:00.199
then

0:42:00.199,0:42:07.199
you would choose the values of A and B that minimize this.

0:42:32.269,0:42:35.220
So if you want your model to be that

0:42:35.220,0:42:38.849
ST+1 is some linear function of the previous stated action,

0:42:38.849,0:42:39.960
then you might

0:42:39.960,0:42:41.950
pose this optimization objective

0:42:41.950,0:42:45.130
and choose A and B to minimize the sum of squares error

0:42:45.130,0:42:52.130
in your predictive value for ST+1 as the linear function of ST

0:42:52.489,0:42:56.889
and AT. I

0:42:56.889,0:42:58.119
should say that

0:42:58.119,0:43:02.839
this is one specific example where you're using a linear function of the previous

0:43:02.839,0:43:06.399
stated action to predict the next state. Of course, you can also use other algorithms

0:43:06.399,0:43:08.469
like low [inaudible] weight to linear regression

0:43:08.469,0:43:12.079
or linear regression with nonlinear features or kernel linear

0:43:12.079,0:43:13.119
regression or whatever

0:43:13.119,0:43:16.380
to predict the next state as a nonlinear function of the current state as well, so this

0:43:16.380,0:43:23.380
is just [inaudible] linear problems.

0:43:25.469,0:43:27.820
And it turns out that low [inaudible] weight to linear

0:43:27.820,0:43:28.770
regression is

0:43:28.770,0:43:31.990
for many robots turns out to be an effective method for this learning

0:43:31.990,0:43:38.990
problem as well.

0:43:47.279,0:43:51.919
And so having learned to model, having learned the parameters A

0:43:51.919,0:43:52.680
and B,

0:43:52.680,0:43:58.979
you then have a model where you say that ST+1 is AST

0:43:58.979,0:44:05.359
plus BAT, and so that would be an example of a deterministic

0:44:05.359,0:44:06.219
model

0:44:07.949,0:44:10.749
or having learned the parameters A and B,

0:44:10.749,0:44:13.259
you might say that ST+1 is

0:44:13.259,0:44:15.229
equal to AST +

0:44:15.229,0:44:17.440
BAT

0:44:24.039,0:44:25.979
+

0:44:25.979,0:44:28.249
?T.

0:44:28.249,0:44:29.919
And so these would be

0:44:29.919,0:44:32.699
very reasonable ways to come up with either

0:44:32.699,0:44:37.069
a deterministic or a stochastic model for your

0:44:37.069,0:44:39.329
inverted pendulum MDP.

0:44:39.329,0:44:41.519
And so

0:44:41.519,0:44:43.549
just to summarize,

0:44:43.549,0:44:48.239
what we have now is a model, meaning a piece of code,

0:44:48.239,0:44:53.459
where you can input a state and an action

0:44:53.459,0:44:55.079
and get an ST+1.

0:44:55.079,0:44:57.969
And so if you have a stochastic model,

0:44:57.969,0:45:02.419
then to influence this model, you would actually sample ?T from this [inaudible]

0:45:02.419,0:45:03.039
distribution

0:45:03.039,0:45:10.039
in order to generate ST+1. So it

0:45:13.179,0:45:18.269
actually turns out that in a

0:45:18.269,0:45:20.169
preview, I guess, in the next lecture

0:45:20.169,0:45:24.679
it actually turns out that in the specific case of linear dynamical systems, in the specific

0:45:24.679,0:45:28.559
case where the next state is a linear function of the current stated action, it

0:45:28.559,0:45:31.579
actually turns out that there're very powerful algorithms you can use.

0:45:31.579,0:45:35.640
So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than

0:45:35.640,0:45:36.630
right now

0:45:38.559,0:45:42.759
but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights

0:45:42.759,0:45:45.599
and linear regressionals and long linear

0:45:45.599,0:45:48.339
algorithm 'cause many systems aren't really

0:45:48.339,0:45:54.309
linear. You can build a nonlinear model.

0:45:54.309,0:45:56.949
So what I wanna do now is talk about

0:45:56.949,0:45:57.330
given

0:45:57.330,0:46:00.510
a model, given a simulator for your

0:46:00.510,0:46:03.349
MDP, how to come up with an algorithm

0:46:03.349,0:46:07.319
to approximate the alpha value function piece.

0:46:07.319,0:46:14.319
Before I move on, let me check if there're questions on this. Okay,

0:46:38.339,0:46:42.579
cool. So here's

0:46:42.579,0:46:43.500
the idea.

0:46:43.500,0:46:46.890
Back when we talked about linear regression, we said that

0:46:46.890,0:46:52.119
given some inputs X in supervised learning, given the input feature is X, we

0:46:52.119,0:46:54.589
may choose some features of X

0:46:54.589,0:46:56.529
and then

0:46:57.229,0:46:58.819
approximate the type of variable

0:46:58.819,0:47:02.229
as a linear function of various features of X,

0:47:02.229,0:47:04.929
and so just do exactly the same thing to

0:47:04.929,0:47:06.329
approximate

0:47:06.329,0:47:08.539
the optimal value function,

0:47:08.539,0:47:10.869
and in particular,

0:47:10.869,0:47:15.819
we'll choose some features

0:47:15.819,0:47:19.249
5-S of a

0:47:19.249,0:47:20.979
state

0:47:20.979,0:47:25.549
S. And so you could actually choose

0:47:25.549,0:47:29.529
5-S equals S. That would be one reasonable choice, if you want to approximate the value

0:47:29.529,0:47:30.160
function

0:47:30.160,0:47:32.759
as a linear function of the states, but you can

0:47:32.759,0:47:33.659
also

0:47:33.659,0:47:37.069
choose other things, so for example,

0:47:37.069,0:47:41.029
for the inverted pendulum example, you may choose 5-S

0:47:41.029,0:47:45.049
to be equal to a vector of features that may be [inaudible]1 or you may have

0:47:47.870,0:47:51.789
Xdot2,

0:47:51.789,0:47:52.190
Xdot

0:47:52.190,0:47:54.649
maybe some cross terms,

0:47:54.649,0:47:56.799
maybe times

0:47:56.799,0:47:59.369
X, maybe dot2

0:47:59.369,0:48:03.659
and so on. So

0:48:03.659,0:48:07.209
you choose some vector or features

0:48:07.209,0:48:14.209
and then approximate

0:48:16.819,0:48:21.759
the value function as the value of the state as is equal to data

0:48:21.759,0:48:24.289
transfers

0:48:24.289,0:48:28.420
times the features. And I should apologize in advance; I'm overloading notation here.

0:48:28.420,0:48:29.839
It's unfortunate. I

0:48:29.839,0:48:34.349
use data both to denote the angle of the cart of the pole inverted

0:48:34.349,0:48:37.549
pendulum. So this is known as

0:48:37.549,0:48:39.140
the angle T

0:48:39.140,0:48:43.280
but also using T to denote the vector of parameters in my [inaudible] algorithm.

0:48:43.280,0:48:43.910
So sorry

0:48:43.910,0:48:49.939
about the overloading notation.

0:48:49.939,0:48:51.599


0:48:51.599,0:48:55.439
Just like we did in linear regression, my goal is to come up

0:48:55.439,0:48:56.599
with

0:48:56.599,0:49:00.339
a linear combination of the features that gives me a good approximation

0:49:00.339,0:49:01.929
to the value function

0:49:01.929,0:49:07.660
and this is completely analogous to when we said that in linear regression

0:49:07.660,0:49:09.500
our estimate,

0:49:13.369,0:49:18.049
my response there but Y as a linear function a feature is at the

0:49:18.049,0:49:23.459
input. That's what we have

0:49:23.459,0:49:30.459
in

0:49:40.109,0:49:47.079
linear regression. Let

0:49:47.079,0:49:50.779
me just write

0:49:50.779,0:49:55.859
down value iteration again and then I'll written down an approximation to value

0:49:55.859,0:49:57.199
iteration, so

0:49:57.199,0:50:00.759
for discrete states,

0:50:00.759,0:50:03.929
this is the idea behind value iteration and we said that

0:50:03.929,0:50:08.849
V(s) will be

0:50:08.849,0:50:11.609
updated

0:50:11.609,0:50:17.400
as R(s) + G

0:50:17.400,0:50:19.530
[inaudible]. That was value iteration

0:50:19.530,0:50:24.129
and in the case of continuous states, this would be the replaced by an [inaudible], an

0:50:24.129,0:50:31.129
[inaudible] over states rather via sum over states.

0:50:31.389,0:50:34.340
Let me just write this

0:50:34.340,0:50:35.669
as

0:50:38.809,0:50:42.589
R(s) + G([inaudible]) and then that sum over T's prime.

0:50:42.589,0:50:45.150
That's really an expectation

0:50:45.150,0:50:47.400
with respect to random state as

0:50:47.400,0:50:54.400
prime drawn from the state transition probabilities piece SA

0:50:55.049,0:50:56.359
of V(s)

0:50:56.359,0:51:00.339
prime. So this is a sum of all states S prime with the

0:51:00.339,0:51:01.599
probability of going to S prime (value),

0:51:01.599,0:51:05.249
so that's really an expectation over the random state S prime flowing from PSA of

0:51:05.249,0:51:09.369
that.

0:51:09.369,0:51:11.989
And so what I'll do now is

0:51:11.989,0:51:15.199
write down an algorithm called

0:51:15.199,0:51:22.199
fitted value iteration

0:51:25.289,0:51:27.949
that's in

0:51:27.949,0:51:30.599
approximation to this

0:51:30.599,0:51:33.809
but specifically for continuous states.

0:51:33.809,0:51:38.519
I just wrote down the first two steps, and then I'll continue on the next board, so the

0:51:38.519,0:51:43.499
first step of the algorithm is we'll sample.

0:51:43.499,0:51:45.269
Choose some set of states

0:51:45.269,0:51:52.269
at

0:51:53.250,0:51:55.119
random.

0:51:55.119,0:51:59.869
So sample S-1, S-2 through S-M randomly so choose a

0:51:59.869,0:52:02.469
set of states randomly

0:52:02.469,0:52:08.209
and

0:52:08.209,0:52:12.020
initialize my parameter vector to be equal to zero. This is

0:52:12.020,0:52:16.190
analogous to in value iteration where I

0:52:16.190,0:52:21.140
might initialize the value function to be the function of all zeros. Then here's the

0:52:21.140,0:52:28.140
end view for the algorithm.

0:52:42.359,0:52:44.119
Got

0:52:44.119,0:52:51.119
quite a lot to

0:52:59.189,0:53:06.189
write

0:55:05.409,0:55:12.409
actually.

0:55:28.599,0:55:29.639
Let's

0:55:29.639,0:55:36.639
see.

0:56:10.420,0:56:12.009
And so

0:56:12.009,0:56:17.839
that's the algorithm.

0:56:17.839,0:56:19.880
Let me just adjust the writing. Give me a second. Give me a minute

0:56:19.880,0:56:26.880
to finish and then I'll step through this.

0:56:26.989,0:56:28.489
Actually, if some of my

0:56:28.489,0:56:35.489
handwriting is eligible, let me know. So

0:56:57.559,0:57:02.079
let me step through this and so briefly explain the rationale.

0:57:02.079,0:57:08.140
So the hear of the algorithm is - let's see.

0:57:08.140,0:57:10.869
In the original value iteration algorithm,

0:57:10.869,0:57:13.240
we would take the value for each state,

0:57:13.240,0:57:14.799
V(s)I,

0:57:14.799,0:57:17.599
and we will overwrite it with this

0:57:17.599,0:57:21.659
expression here. In the original, this discrete value iteration algorithm

0:57:21.659,0:57:23.199
was to V(s)I

0:57:23.199,0:57:23.940
and we will

0:57:23.940,0:57:25.829
set V(s)I

0:57:25.829,0:57:28.589
to be equal to that, I think.

0:57:28.589,0:57:33.130
Now we have in the continuous state case, we have an infinite continuous set of

0:57:33.130,0:57:34.199
states and so

0:57:34.199,0:57:37.859
you can't discretely set the value of each of these to that.

0:57:37.859,0:57:42.509
So what we'll do instead is choose the parameters T

0:57:42.509,0:57:46.849
so that V(s)I is as close as possible to

0:57:46.849,0:57:49.930
this thing on the right hand side

0:57:49.930,0:57:54.509
instead. And this is what YI turns out to be.

0:57:54.509,0:57:58.479
So completely, what I'm going to do is I'm going to construct

0:57:58.479,0:58:00.579
estimates of this term,

0:58:00.579,0:58:04.279
and then I'm going to choose the parameters of my function approximator. I'm

0:58:04.279,0:58:07.729
gonna choose my parameter as T, so that V(s)I is as close as possible to

0:58:07.729,0:58:10.619
these. That's what YI is,

0:58:10.619,0:58:11.979
and specifically,

0:58:11.979,0:58:15.139
what I'm going to do is I'll choose parameters data

0:58:15.139,0:58:18.489
to minimize the sum of square differences between T [inaudible] plus

0:58:18.489,0:58:20.409
5SI.

0:58:20.409,0:58:21.550
This thing here

0:58:21.550,0:58:24.489
is just V(s)I because I'm approximating V(s)I

0:58:24.489,0:58:26.219
is a linear function of 5SI

0:58:26.219,0:58:27.880
and so choose

0:58:27.880,0:58:32.019
the parameters data to minimize the sum of square differences.

0:58:32.019,0:58:34.829
So this is last step is basically

0:58:34.829,0:58:38.759
the approximation version of value

0:58:38.759,0:58:40.610
iteration. What everything else above

0:58:40.610,0:58:42.419
was doing was just

0:58:42.419,0:58:44.749
coming up with an approximation

0:58:44.749,0:58:46.189
to this term, to

0:58:46.189,0:58:47.499
this thing here

0:58:47.499,0:58:50.979
and which I was calling YI.

0:58:50.979,0:58:54.199
And so confluently, for every state SI we want to

0:58:54.199,0:58:57.159
estimate what the thing on the right hand side is

0:58:57.159,0:58:57.979
and but

0:58:57.979,0:59:01.980
there's an expectation here. There's an expectation over a continuous set of states,

0:59:01.980,0:59:06.609
may be a very high dimensional state so I can't compute this expectation exactly.

0:59:06.609,0:59:09.809
What I'll do instead is I'll use my simulator to sample

0:59:09.809,0:59:13.969
a set of states from this distribution from this P substrip,

0:59:13.969,0:59:16.959
SIA, from the state transition distribution

0:59:16.959,0:59:20.669
of where I get to if I take the action A in the state as I,

0:59:20.669,0:59:23.919
and then I'll average over that sample of states

0:59:23.919,0:59:26.819
to compute this expectation.

0:59:26.819,0:59:30.609
And so stepping through the algorithm just says that for

0:59:30.609,0:59:31.500
each

0:59:31.500,0:59:36.329
state and for each action, I'm going to sample a set of states. This S prime 1 through S

0:59:36.329,0:59:40.659
prime K from that state transition distribution, still using the model, and

0:59:40.659,0:59:43.909
then I'll set Q(a) to be equal to that average

0:59:43.909,0:59:45.489
and so this is my

0:59:45.489,0:59:48.429
estimate for R(s)I + G(this

0:59:48.429,0:59:50.399
expected value

0:59:50.399,0:59:53.629
for that specific action A).

0:59:53.629,0:59:56.759
Then I'll take the maximum of actions A

0:59:56.759,1:00:00.890
and this gives me YI, and so YI is for S for that.

1:00:00.890,1:00:02.919
And finally, I'll run

1:00:02.919,1:00:06.779
really run linear regression which is that last of the set [inaudible] to get

1:00:06.779,1:00:07.789
V(s)I

1:00:07.789,1:00:11.479
to be close to the YIs.

1:00:11.479,1:00:16.179
And so this algorithm is called fitted value iteration and it actually often

1:00:16.179,1:00:19.389
works quite well for continuous,

1:00:19.389,1:00:22.659
for problems with anywhere from

1:00:22.659,1:00:26.010
6- to 10- to 20-dimensional state spaces

1:00:33.949,1:00:39.499
if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense?

1:00:39.499,1:00:40.790
Some of you didn't have your hands up.

1:00:40.790,1:00:47.279
Are there questions for those,

1:00:48.410,1:00:51.339
yeah? Is there a recess [inaudible] function in this setup?

1:00:51.339,1:00:55.669
Oh, yes. An MDP comprises

1:00:55.669,1:00:57.189
SA,

1:00:57.189,1:00:59.869
the state transition probabilities G

1:00:59.869,1:01:01.229
and

1:01:01.229,1:01:02.380
R

1:01:02.380,1:01:03.359
and so

1:01:03.359,1:01:08.359
for continuous state spaces, S would be like R4 for the inverted pendulum

1:01:08.359,1:01:09.409
or something.

1:01:09.409,1:01:13.529
Actions with discretized state transitions probabilities with specifying with the model

1:01:13.529,1:01:14.779
or

1:01:14.779,1:01:17.470
the simulator,

1:01:17.470,1:01:19.550
G is just a real number like .99

1:01:19.550,1:01:23.599
and the real function is usually a function that's given to you. And so

1:01:23.599,1:01:26.099
the reward function

1:01:26.099,1:01:30.689
is some function of your 4-dimensional state,

1:01:30.689,1:01:31.489
and

1:01:31.489,1:01:32.989
for example, you

1:01:32.989,1:01:39.989
might choose a reward function to be minus " I don't know.

1:01:40.029,1:01:47.029
Just for an example of simple reward function, if we want a -1 if the pole has fallen

1:01:47.819,1:01:51.640
and there it depends on you choose your reward function to be -1 if

1:01:51.640,1:01:52.290
the

1:01:52.290,1:01:54.619
inverted pendulum falls over

1:01:54.619,1:01:58.700
and to find that its angle is greater than 30Â° or something

1:01:58.700,1:02:02.179
and zero otherwise.

1:02:02.179,1:02:03.939
So that would be an example

1:02:03.939,1:02:06.640
of a reward function that you can choose for the inverted pendulum, but yes, assuming a

1:02:06.640,1:02:09.619
reward function is given to you

1:02:09.619,1:02:14.589
so that you can compute R(s)I for any state.

1:02:14.589,1:02:21.589
Are there other questions?

1:02:39.689,1:02:40.619
Actually, let

1:02:40.619,1:02:47.619
me try

1:02:50.459,1:02:57.459


1:03:23.979,1:03:28.479
asking a question, so everything I did here assume that we have a stochastic simulator.

1:03:28.479,1:03:31.879
So it

1:03:31.879,1:03:35.389
turns out I can simply this algorithm if I have a deterministic simulator,

1:03:35.389,1:03:36.510
but

1:03:36.510,1:03:38.769
deterministic simulator is given a

1:03:38.769,1:03:42.079
stated action, my next state is always exactly determined.

1:03:42.079,1:03:45.569
So let me ask you, if I have a deterministic simulator,

1:03:45.569,1:03:52.569
how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible].

1:03:53.929,1:03:57.709


1:03:57.709,1:03:58.720
Right, so

1:03:58.720,1:04:01.249
Justin's going right. If I have a deterministic simulator,

1:04:01.249,1:04:04.869
all my samples from those would be exactly the same, and so if I have a

1:04:04.869,1:04:06.149
deterministic simulator,

1:04:06.149,1:04:09.109
I can set K to be equal to 1,

1:04:09.109,1:04:10.930
so I don't need to draw

1:04:10.930,1:04:15.869
K different samples. I really only need one sample if I have a

1:04:15.869,1:04:17.299
deterministic simulator,

1:04:17.299,1:04:21.669
so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really

1:04:21.669,1:04:28.159
confused about the,

1:04:28.159,1:04:29.029
yeah,

1:04:29.029,1:04:34.429
we sorta turned this [inaudible] into something that looks like linear

1:04:34.429,1:04:37.249
state

1:04:37.249,1:04:40.089
regression or some' you know the data transpose times something that

1:04:40.089,1:04:43.890
we're used to but I guess I'm a

1:04:43.890,1:04:45.369
little. I don't know really know what question to ask but like when we did this before we

1:04:45.369,1:04:48.330
had like discrete states and everything. We

1:04:48.330,1:04:50.019
were determined with

1:04:50.019,1:04:53.189
finding this optimal policy

1:04:53.189,1:04:56.709
and I

1:04:56.709,1:05:01.869
guess it doesn't look like we haven't said the word policy in a while so kinda

1:05:01.869,1:05:05.009
difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should

1:05:05.009,1:05:06.819
just say a couple words so

1:05:06.819,1:05:10.660
let me actually try to get at some of what maybe what you're saying.

1:05:10.660,1:05:14.249
Our strategy for finding optimal policy

1:05:14.249,1:05:15.080
has been to

1:05:15.080,1:05:19.269
find some way to find V

1:05:19.269,1:05:21.169
and

1:05:21.169,1:05:24.089
some of approximations

1:05:24.089,1:05:28.549
of ?*. So far everything I've been doing has been focused on how to find V*. I just

1:05:28.549,1:05:30.479
want

1:05:33.119,1:05:36.359
to say one more word. It actually turns out that

1:05:36.359,1:05:40.729
for linear regression it's often very easy. It's often not terribly difficult to

1:05:40.729,1:05:43.019
choose some resource of the features. Choosing

1:05:43.019,1:05:46.979
features for approximating the value function is often somewhat

1:05:46.979,1:05:48.400
harder,

1:05:48.400,1:05:50.969
so because the value of a state is

1:05:50.969,1:05:52.020
how good is

1:05:52.020,1:05:54.380
starting off in this state.

1:05:54.380,1:05:56.039
What is my

1:05:56.039,1:05:59.609
expected sum of discounted rewards? What if I start in a certain state?

1:05:59.609,1:06:00.409
And so

1:06:00.409,1:06:03.559
what the feature of the state have to measure is really

1:06:03.559,1:06:07.890
how good is it to start in a certain state?

1:06:07.890,1:06:12.209
And so for inverted pendulum you actually have that

1:06:12.209,1:06:15.899
states where the poles are vertical and when a cart that's centered on your

1:06:15.899,1:06:18.079
track or something, maybe better and so you can

1:06:18.079,1:06:21.299
come up with features that measure the orientation of the pole and how close

1:06:21.299,1:06:25.689
you are to the center of the track and so on and those will be reasonable features to use

1:06:25.689,1:06:28.329
to approximate V*.

1:06:28.329,1:06:30.249
Although in general it is true that

1:06:30.249,1:06:33.859
choosing features, the value function approximation, is often slightly trickier

1:06:33.859,1:06:39.719
than choosing good features for linear regression. Okay and then Justin's

1:06:39.719,1:06:40.709
questions of

1:06:40.709,1:06:42.529
so given

1:06:42.529,1:06:47.109
V*, how do you go back to actually find a policy?

1:06:47.109,1:06:49.799
In the discrete case, so we

1:06:49.799,1:06:54.109
have that ?*(s) is equal to all

1:06:54.109,1:07:01.109
[inaudible] over A of

1:07:04.319,1:07:07.469
that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this

1:07:07.469,1:07:12.869
as an expectation

1:07:12.869,1:07:15.609
and so then once you find the optimal value

1:07:15.609,1:07:16.739
function V*,

1:07:16.739,1:07:18.579
you can then

1:07:18.579,1:07:20.349
find the optimal policy ?* by computing the

1:07:20.349,1:07:22.579
[inaudible].

1:07:22.579,1:07:25.219
So if you're in a continuous state MDP,

1:07:25.219,1:07:26.429
then

1:07:26.429,1:07:30.409
you can't actually do this in advance for every single state because there's an

1:07:30.409,1:07:31.899
infinite number of states

1:07:31.899,1:07:34.809
and so you can't actually perform this computation in advance to every single

1:07:34.809,1:07:36.430
state.

1:07:36.430,1:07:38.949
What you do instead is

1:07:38.949,1:07:43.209
whenever your robot is in some specific state S is only when your system

1:07:43.209,1:07:48.069
is in some specific state S like your car is at some position orientation or

1:07:48.069,1:07:49.829
your inverted pendulum

1:07:49.829,1:07:52.949
is in some specific position, posed in some

1:07:52.949,1:07:56.529
specific angle T. It's

1:07:56.529,1:07:58.489
only when

1:07:58.489,1:08:03.069
your system, be it a factor or a board game or a robot,

1:08:03.069,1:08:05.299
is in some specific state S

1:08:05.299,1:08:08.809
that you would then go ahead and compute this augmax, so

1:08:08.809,1:08:10.229
it's only when

1:08:10.229,1:08:15.779
you're in some state S that you then compute this augmax, and then

1:08:15.779,1:08:18.099
you

1:08:18.099,1:08:22.589


1:08:22.589,1:08:24.440
execute that action A

1:08:24.440,1:08:25.159
and then

1:08:25.159,1:08:29.489
as a result of your action, your robot would transition to some new state and then

1:08:29.489,1:08:33.759
so it'll be given that specific new state that you compute as

1:08:33.759,1:08:40.759
augmax using that specific state S that

1:08:41.799,1:08:45.739
you're in. There're a few ways to do it. One way to do this is

1:08:45.739,1:08:50.099
actually the same as in the inner loop of the fitted value iteration algorithm so

1:08:50.099,1:08:52.150
because of an expectation of a large number

1:08:52.150,1:08:55.219
of states, you'd need to sample

1:08:55.219,1:08:57.999
some set of states from the simulator and then

1:08:57.999,1:09:03.120
approximate this expectation using an average over your samples, so it's actually as

1:09:03.120,1:09:05.819
inner loop of the value iteration algorithm.

1:09:05.819,1:09:09.479
So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample

1:09:09.770,1:09:13.029
a set of states

1:09:13.029,1:09:16.230
to approximate those expectations every time you want to take an action in your MDP.

1:09:16.230,1:09:18.359
Couple

1:09:18.359,1:09:22.710
of special cases where this can be done, one special case is if you

1:09:22.710,1:09:29.710
have

1:09:31.710,1:09:35.560
a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is

1:09:35.560,1:09:39.769
just some function,

1:09:39.769,1:09:43.690
could be a linear or a nonlinear function. If it's a deterministic simulator

1:09:43.690,1:09:47.569
then the next state, ST+1, is just some function of your

1:09:47.569,1:09:50.189
previous stated action.

1:09:50.189,1:09:52.060
If that's the case then

1:09:52.060,1:09:54.849
this expectation, well, then

1:09:54.849,1:09:59.259
this simplifies to augmax of A of V*

1:09:59.259,1:10:00.709
of F

1:10:00.709,1:10:03.409
of

1:10:03.409,1:10:07.280
I guess S,A

1:10:07.280,1:10:08.249
because

1:10:08.249,1:10:11.070
this is really saying

1:10:11.070,1:10:14.570
S

1:10:14.570,1:10:18.409
prime=F(s),A. I switched back and forth between notation; I hope that's okay. S

1:10:18.409,1:10:21.800
to denote the current state, and S prime to deterministic state versus ST and ST+1 through the

1:10:21.800,1:10:25.530
current state. Both of these are

1:10:25.530,1:10:27.699
sorta standing notation and don't

1:10:27.699,1:10:31.189
mind my switching back and forth between them. But if it's a

1:10:31.189,1:10:35.310
deterministic simulator you can then just compute what the next state S prime would be

1:10:35.310,1:10:37.729
for each action you might take from the current state,

1:10:37.729,1:10:41.689
and then take the augmax of actions A,

1:10:41.689,1:10:47.229
basically choose the action that gets you to the highest value

1:10:47.229,1:10:54.229
state.

1:11:04.300,1:11:04.659
And

1:11:04.659,1:11:08.920
so that's one case where you can compute the augmax and we can compute that

1:11:08.920,1:11:13.469
expectation without needing to sample an average over some sample.

1:11:13.469,1:11:16.670
Another very common case actually it turns out is if you have a stochastic

1:11:24.010,1:11:25.969
simulator, but if your

1:11:25.969,1:11:30.459
similar happens to take on a very specific form of

1:11:33.820,1:11:36.119
ST+1=F(s)T,AT+?T

1:11:37.120,1:11:43.689
where this

1:11:43.689,1:11:47.899
is galsie noise. The [inaudible] is a very common way to build simulators where you model the next

1:11:47.899,1:11:50.229
state as a function of the current state and action

1:11:50.229,1:11:52.260
plus some noise and so

1:11:52.260,1:11:57.340
once specific example would be that sort of mini dynamical system that

1:11:57.340,1:12:01.400
we talked about with linear function of the current state and action plus

1:12:01.400,1:12:03.019
galsie

1:12:03.019,1:12:07.959
noise. In this case, you can approximate augment over

1:12:07.959,1:12:14.959
A, well.

1:12:21.809,1:12:27.159
In that case you take that

1:12:27.159,1:12:29.089
expectation that

1:12:29.089,1:12:33.179
you're trying to approximate. The

1:12:33.179,1:12:37.309
expected value of V of the

1:12:37.309,1:12:41.479


1:12:41.479,1:12:45.260
expected value of

1:12:45.260,1:12:49.179
S prime, and this is approximation.

1:12:49.179,1:12:52.510
Expected value of a function is usually not equal to the value of an

1:12:52.510,1:12:53.059
expectation but

1:12:53.059,1:12:54.159
it is often

1:12:54.159,1:12:56.119
a reasonable approximation

1:12:56.119,1:12:58.340
and so

1:13:01.919,1:13:04.669
that would be another way to

1:13:04.669,1:13:06.439
approximate that expectation

1:13:06.439,1:13:11.270
and so you choose the actions

1:13:11.270,1:13:14.340
according to

1:13:14.340,1:13:21.340
watch we do the same formula as I wrote just now.

1:13:22.010,1:13:23.619
And so this

1:13:23.619,1:13:24.920
would be a way of

1:13:24.920,1:13:27.109
approximating

1:13:28.160,1:13:30.530
this argmax,

1:13:30.530,1:13:34.489
ignoring the noise in the simulator essentially. And

1:13:34.489,1:13:37.610
this often works pretty well as well just because many simulators turn

1:13:37.610,1:13:38.539
out

1:13:38.539,1:13:42.699
to be the form of some linear or some nonlinear function plus zero mean

1:13:42.699,1:13:43.749
galsie noise,

1:13:43.749,1:13:46.130
so and just that ignore the zero mean

1:13:46.130,1:13:47.369
galsie

1:13:47.369,1:13:50.070
noise, so that you can compute this quickly.

1:13:50.070,1:13:54.230
And just to complete about this, what

1:13:54.230,1:13:55.899
that is, right,

1:13:55.899,1:13:58.460
that V* F of SA,

1:13:58.460,1:13:58.969
this

1:13:58.969,1:14:01.670
you

1:14:04.389,1:14:08.880
down rate as data transfers Fi of S prime where S prime=F of SA. Great,

1:14:08.880,1:14:12.219
so this V* you would compute using

1:14:12.219,1:14:15.050
the parameters data that you just learned using the

1:14:15.050,1:14:21.719
fitted value iteration

1:14:21.719,1:14:23.170
algorithm.

1:14:23.170,1:14:30.170
Questions about this? Student:[Inaudible] case,

1:14:33.849,1:14:40.849
for real-time application is it possible to use that [inaudible], for example for

1:14:41.809,1:14:47.739
[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible]

1:14:51.159,1:14:55.510
expectation. Computers today actually amazingly fast. I'm actually often

1:14:55.510,1:14:58.659
surprised by how much you can do in real time so

1:14:58.659,1:15:00.269
the helicopter we actually flying a helicopter

1:15:00.269,1:15:03.889
using an algorithm different than

1:15:03.889,1:15:04.739
this? I can't say.

1:15:06.280,1:15:10.539
But my intuition is that you could actually do this with a

1:15:10.539,1:15:14.010
helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do

1:15:14.010,1:15:16.389
this 10 times a second to 50 times a second,

1:15:16.389,1:15:18.099
and that's actually plenty of time

1:15:18.099,1:15:19.400
to sample

1:15:19.400,1:15:23.500
1,000 states and compute this

1:15:23.500,1:15:26.909
expectation. They're real difficult, helicopters because helicopters are mission critical, and you

1:15:26.909,1:15:29.500
do something it's like fast. You

1:15:29.500,1:15:31.849
can do serious damage and so

1:15:31.849,1:15:35.750
maybe not for good reasons. We've actually tended to avoid

1:15:35.750,1:15:37.039
tossing coins when we're

1:15:37.039,1:15:40.179
in the air, so the ideal of letting our actions be some

1:15:40.179,1:15:45.099
up with some random process is slightly scary and just tend not to do that.

1:15:45.099,1:15:48.989
I should say that's prob'ly not a great reason because you average a

1:15:48.989,1:15:51.550
large number of things here very well fine but just

1:15:51.550,1:15:55.829
as a maybe overly conservative design choice, we actually

1:15:55.829,1:15:57.060
don't, tend not to find

1:15:57.060,1:15:59.499
anything randomized on

1:15:59.499,1:16:01.400
which is prob'ly being over conservative.

1:16:02.340,1:16:04.969
It's the choice we made cause other things are slightly safer.

1:16:04.969,1:16:08.149
I think you can actually often do this.

1:16:08.149,1:16:13.189
So long as I see a model can be evaluated fast enough where you can sample

1:16:13.189,1:16:15.469
100 state transitions or 1,000 state

1:16:15.469,1:16:18.119
transitions, and then do that at

1:16:18.119,1:16:21.550
10hz. They haven't said that. This is often attained which is why we often

1:16:21.550,1:16:28.550
use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So

1:16:31.800,1:16:35.800
now you know one algorithm [inaudible] reinforcement learning on continuous state

1:16:35.800,1:16:38.710
spaces. Then we'll pick up with some more ideas on some even

1:16:38.710,1:16:40.449
more powerful algorithms,

1:16:40.449,1:16:43.370
the solving MDPs of continuous state spaces.

1:16:43.370,1:16:44.170
Thanks. Let's close for today.