WEBVTT

00:00:10.299 --> 00:00:13.569
This presentation is delivered by the Stanford Center for Professional

00:00:13.569 --> 00:00:20.569
Development.

00:00:21.519 --> 00:00:24.099
Okay, good morning. Welcome back.

00:00:24.099 --> 00:00:27.019
So I hope all of you had a good Thanksgiving break.

00:00:27.019 --> 00:00:31.309
After the problem sets, I suspect many of us needed one.

00:00:31.309 --> 00:00:35.640
Just one quick announcement so as I announced by email

00:00:35.640 --> 00:00:37.870
a few days ago, this afternoon

00:00:37.870 --> 00:00:38.489
we'll be

00:00:38.489 --> 00:00:42.590
doing another tape ahead of lecture, so I won't physically be here on

00:00:42.590 --> 00:00:43.840
Wednesday, and so we'll be taping

00:00:43.840 --> 00:00:45.990
this Wednesday's lecture ahead of time.

00:00:45.990 --> 00:00:47.210
If you're free

00:00:47.210 --> 00:00:52.770
this afternoon, please come to that; it'll be at 3:45 p.m.

00:00:52.770 --> 00:00:57.610
in the Skilling Auditorium in Skilling 193

00:00:57.610 --> 00:00:59.830
at 3:45. But of course, you can also just show up

00:00:59.830 --> 00:01:06.830
in class as usual at the usual time or just watch it online as usual also.

00:01:07.110 --> 00:01:12.920
Okay, welcome back. What I want to do today is continue our discussion on

00:01:12.920 --> 00:01:15.690
Reinforcement Learning in MDPs. Quite a

00:01:15.690 --> 00:01:19.790
long topic for me to go over today, so most of today's lecture will be on

00:01:19.790 --> 00:01:21.720
continuous state MDPs,

00:01:21.720 --> 00:01:24.090
and in particular,

00:01:24.090 --> 00:01:26.750
algorithms for solving continuous state MDPs,

00:01:26.750 --> 00:01:29.890
so I'll talk just very briefly about discretization.

00:01:29.890 --> 00:01:34.080
I'll spend a lot of time talking about models, assimilators of MDPs,

00:01:34.080 --> 00:01:39.250
and then talk about one algorithm called fitted value iteration and

00:01:40.420 --> 00:01:45.119
two functions which builds on that, and then

00:01:45.119 --> 00:01:49.740
hopefully, I'll have time to get to a second algorithm called, approximate policy

00:01:49.740 --> 00:01:50.960
iteration

00:01:50.960 --> 00:01:53.189
Just to recap,

00:01:53.189 --> 00:01:55.119
right, in the previous lecture,

00:01:55.119 --> 00:01:58.270
I defined the Reinforcement Learning problem and

00:01:58.270 --> 00:02:02.950
I defined MDPs, so let me just recap the notation.

00:02:04.400 --> 00:02:11.400
I said that an MDP or a Markov Decision Process, was a ? tuple,

00:02:14.209 --> 00:02:15.959
comprising

00:02:15.959 --> 00:02:19.139
those things and

00:02:19.139 --> 00:02:22.219
the running example of those using last time

00:02:22.219 --> 00:02:25.189
was this one

00:02:25.189 --> 00:02:26.709
right, adapted from

00:02:26.709 --> 00:02:30.209
the Russell and Norvig AI textbook.

00:02:30.209 --> 00:02:31.619
So

00:02:31.619 --> 00:02:35.499
in this example MDP that I was using,

00:02:35.499 --> 00:02:39.269
it had 11 states, so that's where S was.

00:02:39.269 --> 00:02:46.269
The actions were compass directions: north, south, east and west. The

00:02:46.650 --> 00:02:49.689
state transition probability is to capture

00:02:49.689 --> 00:02:51.109
chance of your

00:02:51.109 --> 00:02:52.409
transitioning to every state

00:02:52.409 --> 00:02:55.899
when you take any action in any other given state and so

00:02:55.899 --> 00:02:58.059
in our example that captured

00:02:58.059 --> 00:03:01.119
the stochastic dynamics of our

00:03:01.119 --> 00:03:04.499
robot wondering around [inaudible], and we said if you take the action north

00:03:04.499 --> 00:03:05.639
and the south,

00:03:05.639 --> 00:03:08.879
you have a .8 chance of actually going north and .1 chance of veering

00:03:08.879 --> 00:03:09.879
off,

00:03:09.879 --> 00:03:12.689
so that .1 chance of veering off to the right so

00:03:12.689 --> 00:03:17.069
said model of the robot's noisy

00:03:17.069 --> 00:03:19.200
dynamic with a [inaudible]

00:03:19.200 --> 00:03:21.709
and the

00:03:21.709 --> 00:03:24.789
reward function was that +/-1 at the

00:03:24.789 --> 00:03:31.059
absorbing states

00:03:31.059 --> 00:03:32.749
and

00:03:32.749 --> 00:03:37.659
-0.02

00:03:37.659 --> 00:03:39.589
elsewhere.

00:03:39.589 --> 00:03:41.499
This is an example of an MDP, and

00:03:41.499 --> 00:03:47.899
that's what these five things were. Oh, and I used a discount

00:03:47.899 --> 00:03:48.870
factor G

00:03:48.870 --> 00:03:54.229
of usually a number slightly less than one, so that's the 0.99.

00:03:54.229 --> 00:03:58.939
And so our goal was to find the policy, the

00:03:58.939 --> 00:04:02.089
control policy and that's at ?,

00:04:02.089 --> 00:04:03.729
which is a function

00:04:03.729 --> 00:04:06.400
mapping from the states of the actions

00:04:06.400 --> 00:04:09.609
that tells us what action to take in every state,

00:04:09.609 --> 00:04:14.159
and our goal was to find a policy that maximizes the expected value of

00:04:14.159 --> 00:04:17.319
our total payoff. So we want to find a

00:04:17.319 --> 00:04:19.599
policy. Well,

00:04:19.599 --> 00:04:26.599
let's see. We define value functions Vp (s) to be equal to

00:04:33.250 --> 00:04:35.740
this.

00:04:35.740 --> 00:04:36.979
We said that

00:04:36.979 --> 00:04:41.340
the value of a policy ? from State S was given by the expected

00:04:41.340 --> 00:04:42.029
value

00:04:42.029 --> 00:04:46.199
of the sum of discounted rewards, conditioned on your executing the policy ?

00:04:46.199 --> 00:04:49.329
and

00:04:49.329 --> 00:04:54.469
you're stating off your [inaudible] to say in the State S,

00:04:54.469 --> 00:04:59.959
and so our strategy for finding the policy was sort of comprised of

00:04:59.959 --> 00:05:06.959
two steps. So the goal is to find

00:05:11.569 --> 00:05:15.190
a good policy that maximizes the suspected value of

00:05:15.190 --> 00:05:16.759
the sum of discounted rewards,

00:05:16.759 --> 00:05:21.020
and so I said last time that one strategy for finding the [inaudible] of a

00:05:21.020 --> 00:05:21.709
policy

00:05:21.709 --> 00:05:26.080
is to first compute the optimal value function which I denoted V*(s) and is defined

00:05:26.080 --> 00:05:28.839
like that. It's the maximum

00:05:28.839 --> 00:05:31.969
value that any policy can obtain,

00:05:31.969 --> 00:05:38.969
and for example,

00:05:44.939 --> 00:05:48.449
the optimal value function

00:05:48.449 --> 00:05:55.449
for that MDP looks like this.

00:06:01.439 --> 00:06:04.409
So in other words, starting from any of these states,

00:06:04.409 --> 00:06:07.050
what's the expected value of the

00:06:07.050 --> 00:06:09.210
sum of discounted rewards you get,

00:06:09.210 --> 00:06:12.009
so this is

00:06:12.009 --> 00:06:13.210
V*.

00:06:13.210 --> 00:06:15.289
We also said that

00:06:15.289 --> 00:06:17.099
once you've found V*,

00:06:17.099 --> 00:06:20.400
you can compute the optimal policy

00:06:20.400 --> 00:06:27.400
using this.

00:06:34.299 --> 00:06:36.279
And

00:06:36.279 --> 00:06:41.049
so once you've found

00:06:41.049 --> 00:06:48.049
V and the last piece of this algorithm was Bellman's equations where

00:06:48.460 --> 00:06:51.240
we know that V*,

00:06:51.240 --> 00:06:55.579
the optimal sum of discounted rewards you can get for State S, is equal to the

00:06:55.579 --> 00:06:59.339
immediate reward you get just for starting off in that state

00:06:59.339 --> 00:07:02.360
+G(for the

00:07:02.360 --> 00:07:05.979
max over all the actions you could take)(your

00:07:11.639 --> 00:07:13.659
future

00:07:13.659 --> 00:07:15.619
sum of discounted

00:07:15.619 --> 00:07:16.479
rewards)(your

00:07:16.479 --> 00:07:20.650
future payoff starting from the State S(p) which is where you might transition to

00:07:20.650 --> 00:07:22.430
after 1(s).

00:07:22.430 --> 00:07:26.279
And so this gave us a value iteration algorithm,

00:07:26.279 --> 00:07:29.909
which was essentially

00:07:29.909 --> 00:07:35.289
V.I. I'm abbreviating value iteration as V.I., so in the value iteration

00:07:36.159 --> 00:07:43.159
algorithm, in V.I., you just take Bellman's equations and you repeatedly do this.

00:07:49.960 --> 00:07:54.099
So initialize some guess of the value functions. Initialize a zero as the sum

00:07:54.099 --> 00:07:55.129
rounding guess and

00:07:55.129 --> 00:07:58.959
then repeatedly perform this update for all states, and I said last time that if

00:07:58.959 --> 00:08:00.740
you do this repeatedly,

00:08:00.740 --> 00:08:06.339
then V(s) will converge to the optimal value function, V(s),

00:08:06.339 --> 00:08:09.340
you can

00:08:09.340 --> 00:08:12.219
compute the

00:08:12.219 --> 00:08:13.830


00:08:13.830 --> 00:08:19.150
optimal policy ?*. Just one final thing I want to recap was

00:08:20.150 --> 00:08:27.150
the policy iteration algorithm

00:08:32.370 --> 00:08:39.370
in which we repeat the following two steps.

00:08:39.640 --> 00:08:43.790
So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the

00:08:43.790 --> 00:08:48.190
value function for that specific policy.

00:08:48.190 --> 00:08:52.200
So this means for every state, compute the expected sum of discounted rewards

00:08:52.200 --> 00:08:55.460
for if you execute the policy ?

00:08:55.460 --> 00:08:58.000
from that state,

00:08:58.000 --> 00:09:05.000
and then the other step of policy

00:09:28.150 --> 00:09:30.380
iteration

00:09:30.380 --> 00:09:31.730
is

00:09:31.730 --> 00:09:34.960
having found the value function for your policy,

00:09:34.960 --> 00:09:37.440
you then update the policy

00:09:37.440 --> 00:09:41.120
pretending that you've already found the optimal value function, V*,

00:09:41.120 --> 00:09:42.390
and then you

00:09:42.390 --> 00:09:45.790
repeatedly perform these two steps where you solve for the value function for your current

00:09:45.790 --> 00:09:46.820
policy

00:09:46.820 --> 00:09:50.760
and then pretend that that's actually the optimal value function and solve for the

00:09:50.760 --> 00:09:54.790
policy given the value function, and you repeatedly update

00:09:54.790 --> 00:09:59.060
the value function or update the policy using that value function.

00:09:59.060 --> 00:10:00.350
And

00:10:00.350 --> 00:10:03.540
last time I said that this will also cause

00:10:03.540 --> 00:10:07.500
the estimated value function V to converge to V*

00:10:07.500 --> 00:10:12.640
and this will cause p to converge to ?*, the optimal policy.

00:10:12.640 --> 00:10:14.010
So those are based

00:10:14.010 --> 00:10:17.920
on our last lecture [inaudible] MDPs and introduced a lot of new

00:10:17.920 --> 00:10:20.180
notation symbols and just

00:10:20.180 --> 00:10:21.880
summarize all that again.

00:10:21.880 --> 00:10:26.160
What I'm about to do now, what I'm about to do for the rest of today's

00:10:26.160 --> 00:10:27.680
lecture is actually

00:10:27.680 --> 00:10:32.420
build on these two algorithms so

00:10:32.420 --> 00:10:36.040
I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how

00:10:36.040 --> 00:10:42.080
those two algorithms are very

00:10:42.080 --> 00:10:43.710
different?

00:10:43.710 --> 00:10:46.810
Instructor (Andrew Ng):I see, right, so

00:10:46.810 --> 00:10:51.810
yeah, do you see that they're different? Okay,

00:10:51.810 --> 00:10:53.840
how it's different. Let's see.

00:10:53.840 --> 00:10:54.970
So

00:10:54.970 --> 00:11:00.050
well here's one difference. I didn't say this cause no longer use it today. So

00:11:00.050 --> 00:11:03.530
value iteration and policy iteration are different algorithms.

00:11:03.530 --> 00:11:06.260
In policy iteration

00:11:06.260 --> 00:11:07.570
in this

00:11:07.570 --> 00:11:10.630
step, you're given a fixed policy,

00:11:10.630 --> 00:11:14.170
and you're going to solve for the value function for that policy

00:11:15.320 --> 00:11:20.520
and so you're given some fixed policy ?, meaning

00:11:20.520 --> 00:11:26.930
some function mapping from the state's actions. So give you some policy

00:11:28.840 --> 00:11:34.630
and

00:11:34.630 --> 00:11:39.350
whatever. That's just some policy; it's not a great policy.

00:11:39.350 --> 00:11:40.890
And

00:11:40.890 --> 00:11:45.350
in that step that I circled, we have to find the ? of S

00:12:01.640 --> 00:12:04.740
which means that for every state you need to compute

00:12:04.740 --> 00:12:08.360
your expected sum of discounted rewards or if you execute

00:12:08.360 --> 00:12:10.810
this specific policy

00:12:10.810 --> 00:12:15.500
and starting off the MDP in that state S.

00:12:15.500 --> 00:12:19.000
So I showed this last time. I won't go into details today, so I said last

00:12:19.000 --> 00:12:20.330
time that

00:12:20.330 --> 00:12:23.960
you can actually solve for Vp by solving a linear system of

00:12:23.960 --> 00:12:25.360
equations.

00:12:25.360 --> 00:12:27.260
There was a form of Bellman's equations

00:12:27.260 --> 00:12:28.760
for Vp,

00:12:28.760 --> 00:12:32.029
and it turned out to be, if you write this out, you end up with a

00:12:32.029 --> 00:12:35.920
linear system of 11 equations of 11 unknowns and so you

00:12:35.920 --> 00:12:38.830
can actually solve for the value function

00:12:38.830 --> 00:12:39.970
for a fixed policy

00:12:39.970 --> 00:12:42.180
by solving like a system of

00:12:42.180 --> 00:12:46.180
linear equations with 11 variables and 11 constraints,

00:12:46.180 --> 00:12:48.830
and so

00:12:48.830 --> 00:12:51.790
that's policy iteration;

00:12:51.790 --> 00:12:55.290
whereas, in value iteration,

00:12:55.290 --> 00:12:57.430
going back on board,

00:12:57.430 --> 00:13:00.590
in value iteration you sort of repeatedly perform this update where you

00:13:00.590 --> 00:13:04.020
update the value of a state as the [inaudible].

00:13:04.020 --> 00:13:11.020
So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits

00:13:12.670 --> 00:13:13.459
so is the assumption

00:13:13.459 --> 00:13:16.130
that we can never get out of

00:13:16.130 --> 00:13:17.960
those states? Instructor

00:13:17.960 --> 00:13:21.730
(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers

00:13:21.730 --> 00:13:25.950
come up nicely, but I don't wanna spend too much time on them, but yeah, so the

00:13:25.950 --> 00:13:26.820
assumption is that

00:13:26.820 --> 00:13:30.329
once you enter the absorbing state, then the world ends or there're no more

00:13:30.329 --> 00:13:32.549
rewards after that and

00:13:32.549 --> 00:13:33.740
you can think of

00:13:33.740 --> 00:13:36.970
another way to think of the absorbing states which is sort of

00:13:36.970 --> 00:13:38.000
mathematically equivalent.

00:13:38.000 --> 00:13:41.090
You can think of the absorbing states as

00:13:41.090 --> 00:13:43.440
transitioning with probability 1

00:13:43.440 --> 00:13:46.030
to sum 12 state, and then once you're in that

00:13:46.030 --> 00:13:48.870
12th state, you always

00:13:48.870 --> 00:13:51.660
remain in that 12th state, and there're no further rewards from there. If you

00:13:51.660 --> 00:13:53.910
want, you can think of this

00:13:53.910 --> 00:13:57.250
as actually an MDP with 12 states rather than 11 states,

00:13:57.250 --> 00:13:59.270
and the 12th state

00:13:59.270 --> 00:14:05.320
is this zero cost absorbing state that you get stuck in forever. Other

00:14:05.320 --> 00:14:12.320
questions? Yeah,

00:14:19.640 --> 00:14:26.460
please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort

00:14:26.460 --> 00:14:27.190
of

00:14:27.190 --> 00:14:31.120
tried to give it justification for this last time.

00:14:31.120 --> 00:14:34.850
I'll

00:14:34.850 --> 00:14:37.500
say it in

00:14:37.500 --> 00:14:41.990
one sentence so that's that the expected total payoff I get, I

00:14:41.990 --> 00:14:46.820
expect to get something from the state as is equal to my immediate reward which is

00:14:46.820 --> 00:14:49.650
the reward I get for starting a state. Let's see. If I sum

00:14:49.650 --> 00:14:54.220
the state, I'm gonna get some first reward and then I can transition to

00:14:54.220 --> 00:14:55.440
some other state, and

00:14:55.440 --> 00:14:59.020
then from that other state, I'll get some additional rewards from then.

00:14:59.020 --> 00:15:03.390
So Bellman's equations breaks that sum into two pieces. It says the value of a

00:15:03.390 --> 00:15:04.860
state is equal to

00:15:04.860 --> 00:15:06.650
the reward you get right away

00:15:06.650 --> 00:15:10.530
is really, well.

00:15:10.530 --> 00:15:17.120
V*(s) is really equal to

00:15:31.130 --> 00:15:33.390
+G, so this is

00:15:35.940 --> 00:15:40.060
V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the

00:15:40.060 --> 00:15:42.260
rewards

00:15:44.480 --> 00:15:46.440
you get in the future)

00:15:46.440 --> 00:15:50.080
which it turns out to be equal to that

00:15:50.080 --> 00:15:52.340
second row. I spent more time

00:15:52.340 --> 00:15:55.360
justifying this in the previous lecture,

00:15:55.360 --> 00:15:59.930
although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if

00:15:59.930 --> 00:16:02.100
you don't remember the justification of that, why don't you just

00:16:02.100 --> 00:16:03.430
maybe take my word for

00:16:03.430 --> 00:16:05.990
that this equation holds true

00:16:05.990 --> 00:16:07.940
since I use it a little bit later as well,

00:16:07.940 --> 00:16:11.090
and then the lecture notes sort of

00:16:11.090 --> 00:16:15.830
explain a little further the justification for why this equation might hold

00:16:15.830 --> 00:16:17.430
true. But for now,

00:16:17.430 --> 00:16:21.200
yeah, just for now take my word for it that this holds true cause we'll use it a little bit later

00:16:21.200 --> 00:16:28.200
today as well.

00:16:40.960 --> 00:16:44.710
[Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent

00:16:44.710 --> 00:16:47.010
? implicitly,

00:16:47.600 --> 00:16:53.060
using V(s), would it become equivalent to valuation,

00:16:53.060 --> 00:16:58.240
and the answer is sort of no.

00:16:58.240 --> 00:17:01.710
Let's see. It's true that policy iteration and value iteration are

00:17:01.710 --> 00:17:07.210
closely related algorithms, and there's actually a continuum between them, but

00:17:07.210 --> 00:17:12.850
yeah, it actually turns out that,

00:17:12.850 --> 00:17:15.400
oh, no,

00:17:22.370 --> 00:17:23.909
the

00:17:24.759 --> 00:17:28.910
algorithms are not equivalent. It's just in policy iteration,

00:17:28.910 --> 00:17:33.289
there is a step where you're solving for the value function for the policy vehicle is V,

00:17:33.289 --> 00:17:35.470
solve for Vp. Usually,

00:17:35.470 --> 00:17:40.590
you can do this, for instance, by solving a linear system of equations. In value

00:17:40.590 --> 00:17:41.260
iteration,

00:17:41.260 --> 00:17:43.850
it is a different

00:17:43.850 --> 00:17:50.850
algorithm, yes. I hope it makes sense that at least cosmetically it's different.

00:17:56.270 --> 00:17:58.140
[Inaudible]

00:17:58.140 --> 00:18:02.250
you have [inaudible] representing ? implicitly, then you won't have

00:18:02.250 --> 00:18:06.620
to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you

00:18:06.620 --> 00:18:08.559
change a value function,

00:18:08.559 --> 00:18:10.360
if ? changes as well,

00:18:10.360 --> 00:18:15.750
then it's sort of hard to solve this.

00:18:15.750 --> 00:18:19.340
Yeah, so later on we'll actually talk about some examples of when ? is implicitly

00:18:19.340 --> 00:18:20.440
represented

00:18:20.440 --> 00:18:24.059
but at

00:18:24.059 --> 00:18:26.210
least for now it's I think there's " yeah.

00:18:26.210 --> 00:18:30.350
Maybe there's a way to redefine something, see a mapping onto value iteration but that's not

00:18:30.350 --> 00:18:32.530
usually done. These are

00:18:32.530 --> 00:18:36.670
viewed as different algorithms.

00:18:36.670 --> 00:18:39.640
Okay, cool, so all good questions.

00:18:39.640 --> 00:18:43.410
Let me move on and talk about how

00:18:43.410 --> 00:18:50.410
to generalize these ideas to continuous states.

00:18:58.880 --> 00:19:01.260
Everything we've done so far

00:19:01.260 --> 00:19:02.950
has been for

00:19:02.950 --> 00:19:05.000
discrete states or finite-state

00:19:05.000 --> 00:19:09.890
MDPs. Where, for example, here we had an MDP with a finite set of 11

00:19:09.890 --> 00:19:10.780
states

00:19:10.780 --> 00:19:12.980
and so

00:19:12.980 --> 00:19:16.720
the value function or V(s) or our estimate for the value function,

00:19:16.720 --> 00:19:17.360
V(s),

00:19:17.360 --> 00:19:22.290
could then be represented using an array of 11 numbers cause if you have 11 states, the

00:19:22.290 --> 00:19:23.580
value function

00:19:23.580 --> 00:19:26.120
needs to assign a real number

00:19:26.120 --> 00:19:29.330
to each of the 11 states and so to represent V(s)

00:19:29.330 --> 00:19:31.990
using an array of 11

00:19:31.990 --> 00:19:33.080
numbers.

00:19:33.080 --> 00:19:37.080
What I want to do for [inaudible] today is talk about

00:19:37.080 --> 00:19:38.450
continuous

00:19:38.450 --> 00:19:41.830
states, so for example,

00:19:41.830 --> 00:19:44.980
if you want to control

00:19:44.980 --> 00:19:48.240
any of the number of real [inaudible], so for example, if you want

00:19:48.240 --> 00:19:51.409
to control a car,

00:19:51.409 --> 00:19:52.610


00:19:52.610 --> 00:19:55.090
a car

00:19:55.090 --> 00:19:59.840
is positioned given by XYT, as position

00:19:59.840 --> 00:20:04.660
and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot,

00:20:04.660 --> 00:20:05.159
so these

00:20:05.159 --> 00:20:06.640
are so depending on

00:20:06.640 --> 00:20:08.929
whether you want to model the kinematics and

00:20:08.929 --> 00:20:12.139
so just position, or whether you want to model the dynamics, meaning the velocity as

00:20:12.139 --> 00:20:14.630
well.

00:20:14.630 --> 00:20:16.179
Earlier I showed you

00:20:16.179 --> 00:20:19.740
video of a helicopter

00:20:19.740 --> 00:20:23.480
that was flying, using a rain forest we're learning algorithms, so the

00:20:23.480 --> 00:20:24.730
helicopter which can

00:20:24.730 --> 00:20:28.390
fly in three-dimensional space rather than just drive on the 2-D plane, the

00:20:28.390 --> 00:20:32.780
state will be given by XYZ position,

00:20:35.100 --> 00:20:37.610
FT?, which is

00:20:37.610 --> 00:20:42.740
?[inaudible].

00:20:42.740 --> 00:20:45.250
The FT? is

00:20:45.250 --> 00:20:48.620
sometimes used to note the P[inaudible]

00:20:48.620 --> 00:20:49.840
of the helicopter,

00:20:49.840 --> 00:20:51.710
just orientation,

00:20:51.710 --> 00:20:55.340
and

00:20:59.050 --> 00:21:02.540
if you want to control a helicopter, you pretty much have to model

00:21:02.540 --> 00:21:06.490
velocity as well which means both linear velocity as well as angular

00:21:06.490 --> 00:21:09.520
velocity, and so this would be a 12-dimensional state.

00:21:10.400 --> 00:21:12.610
If you want an example that

00:21:12.610 --> 00:21:14.150
is

00:21:14.150 --> 00:21:16.770
kind of fun but unusual

00:21:16.770 --> 00:21:19.960
is,

00:21:19.960 --> 00:21:24.550
and I'm just gonna use this as an example

00:21:24.550 --> 00:21:28.240
and actually use this little bit example in today's lecture

00:21:28.240 --> 00:21:31.410
is the inverted pendulum problem which is

00:21:31.410 --> 00:21:35.500
sort of a long-running classic in reinforcement learning

00:21:35.500 --> 00:21:39.040
in which imagine that you

00:21:39.040 --> 00:21:42.880
have a little cart that's on a rail. The rail

00:21:42.880 --> 00:21:45.370
ends at some point

00:21:45.370 --> 00:21:48.400
and if you imagine that you have a pole

00:21:48.400 --> 00:21:51.650
attached to the cart, and this is a free hinge and so

00:21:51.650 --> 00:21:54.110
the pole here can rotate freely,

00:21:54.110 --> 00:21:56.950
and your goal is to control the cart and

00:21:56.950 --> 00:21:59.030
to move it back and forth on this rail

00:21:59.030 --> 00:22:01.130
so as to keep the pole balanced.

00:22:01.130 --> 00:22:02.560
Yeah,

00:22:02.560 --> 00:22:04.620
there's no long pole

00:22:04.620 --> 00:22:05.400
in

00:22:05.400 --> 00:22:07.480
this class but you know what I

00:22:07.480 --> 00:22:08.409
mean, so you

00:22:08.409 --> 00:22:15.210
can imagine. Oh, is there a long pole here?

00:22:15.210 --> 00:22:16.590
Back in the corner. Oh, thanks. Cool. So I did not practice this

00:22:16.590 --> 00:22:19.920
but you can take a long pole and sort of hold it up, balance,

00:22:19.920 --> 00:22:24.670
so imagine that you can do it better than I can. Imagine these are [inaudible] just moving

00:22:24.670 --> 00:22:27.590
back and forth to try to keep the pole balanced, so

00:22:27.590 --> 00:22:30.580
you can actually us the reinforcement learning algorithm to do that.

00:22:30.580 --> 00:22:33.929
This is actually one of the longstanding classic problems that

00:22:33.929 --> 00:22:35.280
people [inaudible]

00:22:35.280 --> 00:22:38.030
implement and play off using reinforcement learning algorithms,

00:22:38.030 --> 00:22:40.800
and so for this, the

00:22:40.800 --> 00:22:41.940
states would be X and

00:22:41.940 --> 00:22:44.000
T, so X

00:22:44.000 --> 00:22:49.660
would be the position of the cart,

00:22:49.660 --> 00:22:56.070
and T would be the orientation of the pole

00:22:56.070 --> 00:22:59.750
and also the linear velocity and the angular velocity of

00:22:59.750 --> 00:23:00.650


00:23:00.650 --> 00:23:07.650
the pole, so I'll

00:23:12.340 --> 00:23:15.390
actually use this example a

00:23:15.390 --> 00:23:18.710
couple times. So to read continuous state space,

00:23:18.710 --> 00:23:22.760
how can you apply an algorithm like value iteration and policy iteration to solve the

00:23:22.760 --> 00:23:24.170
MDP to control

00:23:24.170 --> 00:23:27.800
like the car or a helicopter or something like the inverted

00:23:27.800 --> 00:23:29.030
pendulum?

00:23:29.030 --> 00:23:31.720
So one thing you can do and

00:23:31.720 --> 00:23:35.060
this is maybe the most straightforward thing is,

00:23:35.060 --> 00:23:37.670
if you have say a two-dimensional

00:23:37.670 --> 00:23:42.270
continuous state space, a S-1 and S-2 are my state variables, and in all the

00:23:42.270 --> 00:23:43.990
examples there

00:23:43.990 --> 00:23:48.150
are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here.

00:23:48.150 --> 00:23:53.670
The most straightforward thing to do would be to take the continuous state space and discretize

00:23:53.670 --> 00:23:55.780
it

00:23:55.780 --> 00:24:02.780
into a number of discrete cells.

00:24:07.440 --> 00:24:11.890
And I use S-bar to denote they're discretized or they're discrete

00:24:11.890 --> 00:24:13.080
states,

00:24:13.080 --> 00:24:16.919
and so you can [inaudible] with this continuous state problem with a finite or

00:24:16.919 --> 00:24:18.409
discrete set of states

00:24:18.409 --> 00:24:19.490
and then

00:24:19.490 --> 00:24:22.880
you can use policy iteration or value iteration

00:24:22.880 --> 00:24:28.900
to solve for

00:24:31.529 --> 00:24:35.950
V(s)-bar.

00:24:35.950 --> 00:24:37.390
And

00:24:37.390 --> 00:24:40.350
if you're robot is then in some state

00:24:40.350 --> 00:24:41.860
given by that dot,

00:24:41.860 --> 00:24:45.910
you would then figure out what discretized state it is in. In this case it's in,

00:24:45.910 --> 00:24:47.950
this discretized dygrid cell

00:24:47.950 --> 00:24:49.180
that's called

00:24:49.180 --> 00:24:51.400
S-bar,

00:24:51.400 --> 00:24:52.840
and then you execute.

00:24:52.840 --> 00:24:56.320
You choose the policy. You choose the action

00:24:56.320 --> 00:24:58.679
given by applied to that discrete

00:24:58.679 --> 00:25:01.190
state, so discretization is maybe the

00:25:01.190 --> 00:25:06.690
most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes

00:25:06.690 --> 00:25:10.200
you can sorta make this work but a couple of reasons why

00:25:10.200 --> 00:25:13.070
this does not work very well.

00:25:13.070 --> 00:25:15.130
One reason is the following,

00:25:15.130 --> 00:25:17.610
and

00:25:17.610 --> 00:25:21.210
for this picture, let's even temporarily put aside reinforcement

00:25:21.210 --> 00:25:22.840
learning. Let's just

00:25:22.840 --> 00:25:26.610
think about doing regression for now and so

00:25:26.610 --> 00:25:29.650
suppose you have some invariable X

00:25:29.650 --> 00:25:36.210
and suppose I have some data,

00:25:36.210 --> 00:25:37.910
and I want to fill a function.

00:25:37.910 --> 00:25:39.720
Y is the function of X,

00:25:39.720 --> 00:25:40.860
so

00:25:40.860 --> 00:25:44.550
discretization is saying that I'm going to take my

00:25:44.550 --> 00:25:48.010
horizontal Xs and chop it up into a number of

00:25:48.010 --> 00:25:48.669
intervals.

00:25:48.669 --> 00:25:52.060
Sometimes I call these intervals buckets as well.

00:25:52.060 --> 00:25:56.310
We chop my horizontal Xs up into a number of buckets

00:25:56.310 --> 00:25:59.480
and then we're approximate this function

00:25:59.480 --> 00:26:03.770
using something that's piecewise constant

00:26:03.770 --> 00:26:05.440
in each of these buckets.

00:26:05.440 --> 00:26:08.800
And just look at this. This is clearly not a very good representation, right, and when we

00:26:08.800 --> 00:26:10.390
talk about regression,

00:26:10.390 --> 00:26:14.340
you just choose some features of X

00:26:14.340 --> 00:26:17.990
and run linear regression or something. You get a much better fit to the function.

00:26:17.990 --> 00:26:22.170
And so the sense that discretization just isn't a very good source

00:26:22.170 --> 00:26:25.770
of piecewise constant functions. This just isn't a very good function

00:26:25.770 --> 00:26:28.210
for representing many things,

00:26:28.210 --> 00:26:32.090
and there's also the sense that

00:26:32.090 --> 00:26:35.580
there's no smoothing or there's no generalization across the different buckets.

00:26:35.580 --> 00:26:39.290
And in fact, back in regression, I would never have chosen

00:26:39.290 --> 00:26:42.870
to do regression using this sort of visualization. It's just really doesn't make sense.

00:26:45.630 --> 00:26:47.990
And so in the same way,

00:26:47.990 --> 00:26:50.950
instead of

00:26:50.950 --> 00:26:53.929
X, V(s), instead of X and

00:26:53.929 --> 00:26:57.730
some hypothesis function of X, if you have the state here and you're trying to approximate the value

00:26:57.730 --> 00:26:59.110
function, then

00:26:59.110 --> 00:27:02.470
you can get discretization to work for many problems but maybe this isn't the best

00:27:02.470 --> 00:27:03.429
representation to represent

00:27:03.429 --> 00:27:06.419
a value function.

00:27:08.960 --> 00:27:12.580
The other problem with discretization and maybe the more serious

00:27:12.580 --> 00:27:17.350
problem is what's

00:27:17.350 --> 00:27:21.450
often somewhat fancifully called the curse of dimensionality.

00:27:26.480 --> 00:27:30.230
And just the observation that

00:27:30.990 --> 00:27:33.800
if the state space is in

00:27:33.800 --> 00:27:39.350
RN, and if you discretize each variable into K buckets,

00:27:39.350 --> 00:27:40.450
so

00:27:40.450 --> 00:27:41.909
if discretize each

00:27:44.159 --> 00:27:45.690
variable into K

00:27:45.690 --> 00:27:47.750
discrete values,

00:27:47.750 --> 00:27:50.150
then you get

00:27:50.150 --> 00:27:53.030
on the order of K to the power of N

00:27:54.760 --> 00:27:55.740
discrete states.

00:27:55.740 --> 00:27:56.609
In other words,

00:27:56.609 --> 00:28:00.170
the number of discrete states you end up with

00:28:00.170 --> 00:28:02.100
grows exponentially

00:28:02.100 --> 00:28:04.789
in the dimension of the problem,

00:28:04.789 --> 00:28:05.560
and so

00:28:05.560 --> 00:28:09.350
for a helicopter with 12-dimensional state

00:28:09.350 --> 00:28:10.340
space, this would be

00:28:10.340 --> 00:28:12.769
maybe like 100 to the power of 12, just huge, and it's

00:28:12.769 --> 00:28:13.460
not

00:28:13.460 --> 00:28:14.650
feasible.

00:28:14.650 --> 00:28:19.130
And so discretization doesn't scale well at all with two problems in high-dimensional

00:28:19.130 --> 00:28:23.600
state spaces,

00:28:23.600 --> 00:28:24.790
and

00:28:24.790 --> 00:28:28.520
this observation actually applies more generally than to

00:28:28.520 --> 00:28:32.510
just robotics and continuous state problems.

00:28:32.510 --> 00:28:37.669
For example, another fairly well-known applications

00:28:37.669 --> 00:28:40.170
of reinforcement learning

00:28:40.170 --> 00:28:44.850
has been to factory automations. If you imagine that you have

00:28:44.850 --> 00:28:47.040
20 machines sitting in the factory

00:28:47.040 --> 00:28:50.029
and the machines lie in a assembly line and they all

00:28:50.029 --> 00:28:53.200
do something to a part on the assembly line, then they route the part onto a

00:28:53.200 --> 00:28:55.389
different machine. You want to

00:28:55.389 --> 00:28:58.779
use reinforcement learning algorithms, [inaudible] the order in which the

00:28:58.779 --> 00:29:02.519
different machines operate on your different things that are flowing through your assembly line and maybe

00:29:02.519 --> 00:29:05.309
different machines can do different things.

00:29:05.309 --> 00:29:08.970
So if you have N machines and each machine

00:29:08.970 --> 00:29:10.540
can be in K states,

00:29:10.540 --> 00:29:13.730
then if you do this sort of discretization, the total number of states would be K

00:29:13.730 --> 00:29:15.420
to N as well.

00:29:15.420 --> 00:29:18.289
If you have N machines

00:29:18.289 --> 00:29:21.860
and if each machine can be in K states, then again, you can get

00:29:21.860 --> 00:29:24.550
this huge number of states.

00:29:24.550 --> 00:29:26.799
Other well-known examples would be

00:29:26.799 --> 00:29:30.620
if you have a board game is another example.

00:29:30.620 --> 00:29:33.760
You'd want to use reinforcement learning to play chess.

00:29:33.760 --> 00:29:36.310
Then if you have N

00:29:36.310 --> 00:29:38.660
pieces on your board game,

00:29:38.660 --> 00:29:43.540
you have N pieces on the chessboard and if each piece can be in K positions, then this is a

00:29:43.540 --> 00:29:46.470
game sort of the curse of dimensionality thing where the

00:29:46.470 --> 00:29:55.059
number of discrete states you end up with goes exponentially with the number of pieces

00:29:55.059 --> 00:29:57.120
in your board game.

00:29:57.120 --> 00:30:01.169
So the curse of dimensionality

00:30:01.169 --> 00:30:05.830
means that discretization scales poorly to high-dimensional state spaces or

00:30:05.830 --> 00:30:09.370
at least discrete representations

00:30:09.370 --> 00:30:14.169
scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional

00:30:14.169 --> 00:30:16.720
problem, discretization will usually work great.

00:30:16.720 --> 00:30:20.730
If you have a 3-dimensional problem, you can often get discretization to

00:30:20.730 --> 00:30:21.290
work

00:30:21.290 --> 00:30:23.270
not too badly without too much trouble.

00:30:23.270 --> 00:30:28.370
With a 4-dimensional problem, you can still often get to where that they

00:30:28.370 --> 00:30:30.190
could be challenging

00:30:32.990 --> 00:30:38.770
and as you go to higher and higher dimensional state spaces,

00:30:38.770 --> 00:30:42.520
the odds and [inaudible] that you need to figure around to discretization and do things

00:30:42.520 --> 00:30:45.900
like non-uniform grids, so for example,

00:30:45.900 --> 00:30:49.470
what I've

00:30:49.470 --> 00:30:52.710
drawn for you is an example of a non-uniform discretization

00:30:52.710 --> 00:30:57.139
where I'm discretizing S-2 much more finally than S-1. If I think the value

00:30:57.139 --> 00:30:57.750
function

00:30:57.750 --> 00:30:59.250
is much more sensitive

00:30:59.250 --> 00:31:02.620
to the value of state variable S-2 than to S-1,

00:31:02.620 --> 00:31:06.610
and so as you get into higher dimensional state spaces, you may need to manually fiddle

00:31:06.610 --> 00:31:10.000
with choices like these with no uniform discretizations and so on.

00:31:10.000 --> 00:31:13.269
But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it

00:31:13.269 --> 00:31:14.350
work fine.

00:31:14.350 --> 00:31:17.630
With 4-dimensional problems, you can probably get it to work but it'd be just

00:31:17.630 --> 00:31:19.030
slightly challenging

00:31:19.030 --> 00:31:20.300
and

00:31:20.300 --> 00:31:24.420
you can sometimes by fooling around and being clever, you can often push

00:31:24.420 --> 00:31:25.110
discretization

00:31:25.110 --> 00:31:29.120
up to let's say about 6-dimensional problems but

00:31:29.120 --> 00:31:30.870
with some difficulty

00:31:30.870 --> 00:31:32.980
and problems higher

00:31:32.980 --> 00:31:37.220
than 6-dimensional would be extremely difficult to solve with discretization.

00:31:37.220 --> 00:31:39.700
So that's just rough

00:31:39.700 --> 00:31:44.360
folk wisdom order of managing problems you think about using for discretization.

00:31:51.370 --> 00:31:52.790
But what I want to

00:31:52.790 --> 00:31:56.030
spend most of today talking about is

00:31:56.030 --> 00:31:59.909
[inaudible] methods that often work much better than discretization

00:31:59.909 --> 00:32:02.029
and which we will

00:32:02.029 --> 00:32:04.640
approximate V* directly

00:32:04.640 --> 00:32:08.850
without resulting to these sort of discretizations.

00:32:08.850 --> 00:32:12.300
Before I jump to the specific representation let me just spend a few minutes

00:32:12.300 --> 00:32:14.780
talking about the problem setup

00:32:14.780 --> 00:32:15.610
then.

00:32:15.610 --> 00:32:18.790
For today's lecture, I'm going to focus on

00:32:18.790 --> 00:32:25.360
the problem of continuous states

00:32:25.360 --> 00:32:28.060
and just to keep things sort of very

00:32:28.060 --> 00:32:29.480
simple in this lecture,

00:32:29.480 --> 00:32:33.619
I want view of continuous actions, so I'm gonna see

00:32:36.000 --> 00:32:37.650
discrete actions

00:32:37.650 --> 00:32:40.280
A. So it turns out actually that

00:32:40.280 --> 00:32:44.230
is a critical fact also for many problems, it turns out that the state

00:32:44.230 --> 00:32:45.630
space is

00:32:45.630 --> 00:32:47.299
much larger than

00:32:47.299 --> 00:32:51.179
the states of actions. That just seems to have worked out that way for many problems, so for example,

00:32:52.630 --> 00:32:58.210
for driving a car the state space is 6-dimensional, so if

00:32:58.210 --> 00:33:00.010
XY T, Xdot, Ydot, Tdot.

00:33:00.010 --> 00:33:04.840
Whereas, your action has, you still have two actions. You have forward

00:33:04.840 --> 00:33:07.600
backward motion and steering the car, so you have

00:33:07.600 --> 00:33:11.889
6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states.

00:33:11.889 --> 00:33:14.740
The only

00:33:14.740 --> 00:33:17.190
examples down for a helicopter you've 12-D states in a

00:33:17.190 --> 00:33:18.620
4-dimensional action

00:33:18.620 --> 00:33:20.290
it turns out, and

00:33:20.290 --> 00:33:24.200
it's also often much easier to just discretize a continuous

00:33:24.200 --> 00:33:27.300
actions into a discrete sum of actions.

00:33:27.300 --> 00:33:27.840


00:33:27.840 --> 00:33:31.670
And for the inverted pendulum, you have a 4-D state and a 1-D action.

00:33:31.670 --> 00:33:33.470
Whether you accelerate

00:33:33.470 --> 00:33:38.140
your cart to the left or the right is one D action

00:33:38.140 --> 00:33:41.879
and so for the rest of today, I'm gonna assume a continuous state

00:33:41.879 --> 00:33:45.480
but I'll assume that maybe you've already discretized your actions,

00:33:45.480 --> 00:33:47.700
just because in practice it turns out that

00:33:47.700 --> 00:33:52.780
not for all problems, with many problems large actions

00:33:52.780 --> 00:33:54.280
is just less of a

00:33:54.280 --> 00:33:58.470
difficulty than large state spaces.

00:33:58.470 --> 00:33:59.730
So I'm

00:33:59.730 --> 00:34:01.350
going to assume

00:34:01.350 --> 00:34:06.360
that we have

00:34:06.360 --> 00:34:08.699
a model

00:34:08.699 --> 00:34:15.089
or simulator of the

00:34:15.089 --> 00:34:16.450
MDP, and so

00:34:16.450 --> 00:34:18.179
this is really

00:34:18.179 --> 00:34:22.579
an assumption on how the state transition probabilities are represented.

00:34:23.589 --> 00:34:24.690
I'm gonna assume

00:34:24.690 --> 00:34:28.079
and I'm going to use the terms "model" and "simulator"

00:34:28.079 --> 00:34:29.800
pretty much synonymously,

00:34:29.800 --> 00:34:31.749
so

00:34:31.749 --> 00:34:38.749
specifically, what I'm going to assume is that we have a black box and a

00:34:40.649 --> 00:34:41.829
piece of code,

00:34:41.829 --> 00:34:45.200
so that I can input any state, input an action

00:34:45.200 --> 00:34:46.399
and it will output

00:34:46.399 --> 00:34:48.559
S

00:34:48.559 --> 00:34:51.259
prime,

00:34:51.259 --> 00:34:54.219
sample from the state transition distribution. Says that

00:34:54.219 --> 00:34:55.069
this is really

00:34:55.069 --> 00:34:57.080
my assumption

00:34:57.080 --> 00:35:01.209
on the representation I have for the state transition probabilities,

00:35:01.209 --> 00:35:05.189
so I'll assume I have a box that read

00:35:05.189 --> 00:35:06.999
take us in for the stated action

00:35:06.999 --> 00:35:09.849
and output in mixed state.

00:35:09.849 --> 00:35:16.849
And so

00:35:17.410 --> 00:35:20.469
since they're fairly common ways to get models of

00:35:20.469 --> 00:35:22.359
different MDPs you may work with,

00:35:22.359 --> 00:35:24.479
one is

00:35:24.479 --> 00:35:30.599
you might get a model from a physics

00:35:30.599 --> 00:35:32.960
simulator. So for example,

00:35:32.960 --> 00:35:36.759
if you're interested in controlling that inverted pendulum,

00:35:36.759 --> 00:35:39.449
so your action is

00:35:39.449 --> 00:35:41.230
A which is

00:35:41.230 --> 00:35:45.130
the magnitude of the force you exert on the cart to

00:35:45.130 --> 00:35:46.140


00:35:46.140 --> 00:35:47.709
left

00:35:47.709 --> 00:35:49.380
or right, and your

00:35:49.380 --> 00:35:56.380
state is Xdot, T, Tdot. I'm just gonna write that in that order.

00:35:58.109 --> 00:36:01.469
And so I'm gonna

00:36:01.469 --> 00:36:04.379
write down a bunch of equations just for completeness but

00:36:04.379 --> 00:36:08.779
everything I'm gonna write below here is most of what I wanna write is a bit

00:36:08.779 --> 00:36:12.599
gratuitous, but so since I'll maybe flip open a textbook on physics, a

00:36:12.599 --> 00:36:13.820
textbook on mechanics,

00:36:13.820 --> 00:36:17.069
you can work out the equations of motion of a

00:36:17.069 --> 00:36:19.499
physical device like this, so you find that

00:36:22.749 --> 00:36:26.400
Sdot. The dot denotes derivative, so the derivative of the state with respect to time

00:36:26.400 --> 00:36:28.080
is given by

00:36:28.739 --> 00:36:31.710
Xdot, ?-L(B)

00:36:31.710 --> 00:36:34.099
cost B

00:36:34.099 --> 00:36:35.549
over

00:36:36.919 --> 00:36:39.069
M

00:36:39.069 --> 00:36:46.069
Tdot,

00:36:52.019 --> 00:36:52.859
B.

00:36:52.859 --> 00:36:58.199
And so on where A is the force is the action that you exert on the cart. L is

00:36:58.199 --> 00:36:59.869
the length of the pole.

00:36:59.869 --> 00:37:04.099
M is the total mass of the system and so on. So all these equations

00:37:04.099 --> 00:37:06.209
are good uses, just writing

00:37:06.209 --> 00:37:08.549
them down for completeness, but by

00:37:08.549 --> 00:37:12.179
flipping over, open like a physics textbook, you can work out these equations

00:37:12.179 --> 00:37:13.839
and notions yourself

00:37:13.839 --> 00:37:15.169
and this

00:37:15.169 --> 00:37:17.410
then gives you a model which

00:37:17.410 --> 00:37:18.259
can say that S-2+1.

00:37:18.259 --> 00:37:21.549
You're still one time step later

00:37:21.549 --> 00:37:27.489
will be equal to your previous state

00:37:27.489 --> 00:37:32.219
plus [inaudible], so in your simulator or in my model

00:37:32.219 --> 00:37:35.200
what happens to the cart every

00:37:35.200 --> 00:37:39.229
10th of a second, so ? T

00:37:39.229 --> 00:37:43.130
would be within one second

00:37:43.130 --> 00:37:50.130
and then so plus ? T times

00:38:03.469 --> 00:38:06.129
that. And so that'd be one way to come up with a model of

00:38:06.129 --> 00:38:08.169
your MDP.

00:38:08.169 --> 00:38:12.389
And in this specific example, I've actually written down deterministic model because

00:38:12.830 --> 00:38:14.829
and by deterministic I mean that

00:38:14.829 --> 00:38:18.320
given an action in a state, the next state

00:38:18.320 --> 00:38:19.839
is not random,

00:38:19.839 --> 00:38:23.179
so would be an example of a deterministic model

00:38:23.179 --> 00:38:26.180
where I can compute the next state exactly

00:38:26.180 --> 00:38:29.699
as a function of the previous state and the previous action

00:38:29.699 --> 00:38:33.479
or it's a deterministic model because all the probability

00:38:33.479 --> 00:38:34.999
mass is on a single state

00:38:34.999 --> 00:38:36.939
given the previous stated action.

00:38:36.939 --> 00:38:43.939
You can also make this a stochastic model.

00:38:58.589 --> 00:39:05.589
A second way that is often used to attain a model is to learn one.

00:39:08.359 --> 00:39:13.079
And so again, just concretely what you do is you would

00:39:13.079 --> 00:39:15.599
imagine that you have a physical

00:39:15.599 --> 00:39:17.109
inverted pendulum system

00:39:17.109 --> 00:39:20.630
as you physically own an inverted pendulum robot.

00:39:20.630 --> 00:39:24.109
What you would do is you would then

00:39:24.109 --> 00:39:27.109
initialize your inverted pendulum robot to some state

00:39:27.109 --> 00:39:31.909
and then execute some policy, could be some random policy or some policy that you think is pretty

00:39:31.909 --> 00:39:36.219
good, or you could even try controlling yourself with a joystick or something.

00:39:36.219 --> 00:39:37.399
But so you set

00:39:37.399 --> 00:39:40.889
the system off in some state as zero.

00:39:40.889 --> 00:39:43.619
Then you take some action.

00:39:43.619 --> 00:39:48.039
Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina

00:39:48.039 --> 00:39:51.179
control your inverted pendulum or whatever.

00:39:51.179 --> 00:39:55.029
System would transition to some new state, S-1, and then you

00:39:55.029 --> 00:39:57.269
take some new action, A-1

00:39:57.269 --> 00:39:59.799
and so on. Let's say you do

00:39:59.799 --> 00:40:01.579
this

00:40:01.579 --> 00:40:04.779
for two time steps

00:40:04.779 --> 00:40:06.480
and

00:40:06.480 --> 00:40:08.859
sometimes I call this one trajectory and

00:40:08.859 --> 00:40:10.009
you repeat this

00:40:10.009 --> 00:40:12.989
M times, so

00:40:12.989 --> 00:40:16.989
this is the first trial of the first trajectory,

00:40:16.989 --> 00:40:23.989
and then you do this again. Initialize it in some

00:40:35.419 --> 00:40:36.880
and so on.

00:40:36.880 --> 00:40:39.979
So you do this a bunch of times

00:40:39.979 --> 00:40:43.640
and then you would run the learning algorithm

00:40:43.640 --> 00:40:49.190
to

00:40:49.190 --> 00:40:53.199
estimate ST+1

00:40:53.199 --> 00:40:56.650
as a function

00:40:56.650 --> 00:41:01.869
of

00:41:01.869 --> 00:41:04.619
ST and

00:41:04.619 --> 00:41:08.009
AT. And for sake of completeness, you should just think of this

00:41:08.009 --> 00:41:10.130
as inverted pendulum problem,

00:41:10.130 --> 00:41:11.190
so ST+1 is

00:41:11.190 --> 00:41:13.290
a 4-dimensional vector. ST, AT

00:41:13.290 --> 00:41:16.579
will be a 4-dimensional vector and that'll be a real number,

00:41:16.579 --> 00:41:17.680
and so you might

00:41:17.680 --> 00:41:21.190
run linear regression 4 times to predict each of these state variables as

00:41:21.190 --> 00:41:22.749
a function

00:41:22.749 --> 00:41:25.250
of each of these

00:41:25.250 --> 00:41:26.979
5 real numbers

00:41:26.979 --> 00:41:29.519
and so on.

00:41:29.519 --> 00:41:32.579
Just for example, if you

00:41:32.579 --> 00:41:36.899
say that if you want to estimate your next state ST+1 as a linear

00:41:36.899 --> 00:41:40.639
function

00:41:40.639 --> 00:41:45.889
of your previous state in your action and so

00:41:45.889 --> 00:41:51.470
A here will be a 4 by 4 matrix,

00:41:51.470 --> 00:41:54.449
and B would be a 4-dimensional vector,

00:41:54.449 --> 00:42:00.199
then

00:42:00.199 --> 00:42:07.199
you would choose the values of A and B that minimize this.

00:42:32.269 --> 00:42:35.220
So if you want your model to be that

00:42:35.220 --> 00:42:38.849
ST+1 is some linear function of the previous stated action,

00:42:38.849 --> 00:42:39.960
then you might

00:42:39.960 --> 00:42:41.950
pose this optimization objective

00:42:41.950 --> 00:42:45.130
and choose A and B to minimize the sum of squares error

00:42:45.130 --> 00:42:52.130
in your predictive value for ST+1 as the linear function of ST

00:42:52.489 --> 00:42:56.889
and AT. I

00:42:56.889 --> 00:42:58.119
should say that

00:42:58.119 --> 00:43:02.839
this is one specific example where you're using a linear function of the previous

00:43:02.839 --> 00:43:06.399
stated action to predict the next state. Of course, you can also use other algorithms

00:43:06.399 --> 00:43:08.469
like low [inaudible] weight to linear regression

00:43:08.469 --> 00:43:12.079
or linear regression with nonlinear features or kernel linear

00:43:12.079 --> 00:43:13.119
regression or whatever

00:43:13.119 --> 00:43:16.380
to predict the next state as a nonlinear function of the current state as well, so this

00:43:16.380 --> 00:43:23.380
is just [inaudible] linear problems.

00:43:25.469 --> 00:43:27.820
And it turns out that low [inaudible] weight to linear

00:43:27.820 --> 00:43:28.770
regression is

00:43:28.770 --> 00:43:31.990
for many robots turns out to be an effective method for this learning

00:43:31.990 --> 00:43:38.990
problem as well.

00:43:47.279 --> 00:43:51.919
And so having learned to model, having learned the parameters A

00:43:51.919 --> 00:43:52.680
and B,

00:43:52.680 --> 00:43:58.979
you then have a model where you say that ST+1 is AST

00:43:58.979 --> 00:44:05.359
plus BAT, and so that would be an example of a deterministic

00:44:05.359 --> 00:44:06.219
model

00:44:07.949 --> 00:44:10.749
or having learned the parameters A and B,

00:44:10.749 --> 00:44:13.259
you might say that ST+1 is

00:44:13.259 --> 00:44:15.229
equal to AST +

00:44:15.229 --> 00:44:17.440
BAT

00:44:24.039 --> 00:44:25.979
+

00:44:25.979 --> 00:44:28.249
?T.

00:44:28.249 --> 00:44:29.919
And so these would be

00:44:29.919 --> 00:44:32.699
very reasonable ways to come up with either

00:44:32.699 --> 00:44:37.069
a deterministic or a stochastic model for your

00:44:37.069 --> 00:44:39.329
inverted pendulum MDP.

00:44:39.329 --> 00:44:41.519
And so

00:44:41.519 --> 00:44:43.549
just to summarize,

00:44:43.549 --> 00:44:48.239
what we have now is a model, meaning a piece of code,

00:44:48.239 --> 00:44:53.459
where you can input a state and an action

00:44:53.459 --> 00:44:55.079
and get an ST+1.

00:44:55.079 --> 00:44:57.969
And so if you have a stochastic model,

00:44:57.969 --> 00:45:02.419
then to influence this model, you would actually sample ?T from this [inaudible]

00:45:02.419 --> 00:45:03.039
distribution

00:45:03.039 --> 00:45:10.039
in order to generate ST+1. So it

00:45:13.179 --> 00:45:18.269
actually turns out that in a

00:45:18.269 --> 00:45:20.169
preview, I guess, in the next lecture

00:45:20.169 --> 00:45:24.679
it actually turns out that in the specific case of linear dynamical systems, in the specific

00:45:24.679 --> 00:45:28.559
case where the next state is a linear function of the current stated action, it

00:45:28.559 --> 00:45:31.579
actually turns out that there're very powerful algorithms you can use.

00:45:31.579 --> 00:45:35.640
So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than

00:45:35.640 --> 00:45:36.630
right now

00:45:38.559 --> 00:45:42.759
but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights

00:45:42.759 --> 00:45:45.599
and linear regressionals and long linear

00:45:45.599 --> 00:45:48.339
algorithm 'cause many systems aren't really

00:45:48.339 --> 00:45:54.309
linear. You can build a nonlinear model.

00:45:54.309 --> 00:45:56.949
So what I wanna do now is talk about

00:45:56.949 --> 00:45:57.330
given

00:45:57.330 --> 00:46:00.510
a model, given a simulator for your

00:46:00.510 --> 00:46:03.349
MDP, how to come up with an algorithm

00:46:03.349 --> 00:46:07.319
to approximate the alpha value function piece.

00:46:07.319 --> 00:46:14.319
Before I move on, let me check if there're questions on this. Okay,

00:46:38.339 --> 00:46:42.579
cool. So here's

00:46:42.579 --> 00:46:43.500
the idea.

00:46:43.500 --> 00:46:46.890
Back when we talked about linear regression, we said that

00:46:46.890 --> 00:46:52.119
given some inputs X in supervised learning, given the input feature is X, we

00:46:52.119 --> 00:46:54.589
may choose some features of X

00:46:54.589 --> 00:46:56.529
and then

00:46:57.229 --> 00:46:58.819
approximate the type of variable

00:46:58.819 --> 00:47:02.229
as a linear function of various features of X,

00:47:02.229 --> 00:47:04.929
and so just do exactly the same thing to

00:47:04.929 --> 00:47:06.329
approximate

00:47:06.329 --> 00:47:08.539
the optimal value function,

00:47:08.539 --> 00:47:10.869
and in particular,

00:47:10.869 --> 00:47:15.819
we'll choose some features

00:47:15.819 --> 00:47:19.249
5-S of a

00:47:19.249 --> 00:47:20.979
state

00:47:20.979 --> 00:47:25.549
S. And so you could actually choose

00:47:25.549 --> 00:47:29.529
5-S equals S. That would be one reasonable choice, if you want to approximate the value

00:47:29.529 --> 00:47:30.160
function

00:47:30.160 --> 00:47:32.759
as a linear function of the states, but you can

00:47:32.759 --> 00:47:33.659
also

00:47:33.659 --> 00:47:37.069
choose other things, so for example,

00:47:37.069 --> 00:47:41.029
for the inverted pendulum example, you may choose 5-S

00:47:41.029 --> 00:47:45.049
to be equal to a vector of features that may be [inaudible]1 or you may have

00:47:47.870 --> 00:47:51.789
Xdot2,

00:47:51.789 --> 00:47:52.190
Xdot

00:47:52.190 --> 00:47:54.649
maybe some cross terms,

00:47:54.649 --> 00:47:56.799
maybe times

00:47:56.799 --> 00:47:59.369
X, maybe dot2

00:47:59.369 --> 00:48:03.659
and so on. So

00:48:03.659 --> 00:48:07.209
you choose some vector or features

00:48:07.209 --> 00:48:14.209
and then approximate

00:48:16.819 --> 00:48:21.759
the value function as the value of the state as is equal to data

00:48:21.759 --> 00:48:24.289
transfers

00:48:24.289 --> 00:48:28.420
times the features. And I should apologize in advance; I'm overloading notation here.

00:48:28.420 --> 00:48:29.839
It's unfortunate. I

00:48:29.839 --> 00:48:34.349
use data both to denote the angle of the cart of the pole inverted

00:48:34.349 --> 00:48:37.549
pendulum. So this is known as

00:48:37.549 --> 00:48:39.140
the angle T

00:48:39.140 --> 00:48:43.280
but also using T to denote the vector of parameters in my [inaudible] algorithm.

00:48:43.280 --> 00:48:43.910
So sorry

00:48:43.910 --> 00:48:49.939
about the overloading notation.

00:48:49.939 --> 00:48:51.599


00:48:51.599 --> 00:48:55.439
Just like we did in linear regression, my goal is to come up

00:48:55.439 --> 00:48:56.599
with

00:48:56.599 --> 00:49:00.339
a linear combination of the features that gives me a good approximation

00:49:00.339 --> 00:49:01.929
to the value function

00:49:01.929 --> 00:49:07.660
and this is completely analogous to when we said that in linear regression

00:49:07.660 --> 00:49:09.500
our estimate,

00:49:13.369 --> 00:49:18.049
my response there but Y as a linear function a feature is at the

00:49:18.049 --> 00:49:23.459
input. That's what we have

00:49:23.459 --> 00:49:30.459
in

00:49:40.109 --> 00:49:47.079
linear regression. Let

00:49:47.079 --> 00:49:50.779
me just write

00:49:50.779 --> 00:49:55.859
down value iteration again and then I'll written down an approximation to value

00:49:55.859 --> 00:49:57.199
iteration, so

00:49:57.199 --> 00:50:00.759
for discrete states,

00:50:00.759 --> 00:50:03.929
this is the idea behind value iteration and we said that

00:50:03.929 --> 00:50:08.849
V(s) will be

00:50:08.849 --> 00:50:11.609
updated

00:50:11.609 --> 00:50:17.400
as R(s) + G

00:50:17.400 --> 00:50:19.530
[inaudible]. That was value iteration

00:50:19.530 --> 00:50:24.129
and in the case of continuous states, this would be the replaced by an [inaudible], an

00:50:24.129 --> 00:50:31.129
[inaudible] over states rather via sum over states.

00:50:31.389 --> 00:50:34.340
Let me just write this

00:50:34.340 --> 00:50:35.669
as

00:50:38.809 --> 00:50:42.589
R(s) + G([inaudible]) and then that sum over T's prime.

00:50:42.589 --> 00:50:45.150
That's really an expectation

00:50:45.150 --> 00:50:47.400
with respect to random state as

00:50:47.400 --> 00:50:54.400
prime drawn from the state transition probabilities piece SA

00:50:55.049 --> 00:50:56.359
of V(s)

00:50:56.359 --> 00:51:00.339
prime. So this is a sum of all states S prime with the

00:51:00.339 --> 00:51:01.599
probability of going to S prime (value),

00:51:01.599 --> 00:51:05.249
so that's really an expectation over the random state S prime flowing from PSA of

00:51:05.249 --> 00:51:09.369
that.

00:51:09.369 --> 00:51:11.989
And so what I'll do now is

00:51:11.989 --> 00:51:15.199
write down an algorithm called

00:51:15.199 --> 00:51:22.199
fitted value iteration

00:51:25.289 --> 00:51:27.949
that's in

00:51:27.949 --> 00:51:30.599
approximation to this

00:51:30.599 --> 00:51:33.809
but specifically for continuous states.

00:51:33.809 --> 00:51:38.519
I just wrote down the first two steps, and then I'll continue on the next board, so the

00:51:38.519 --> 00:51:43.499
first step of the algorithm is we'll sample.

00:51:43.499 --> 00:51:45.269
Choose some set of states

00:51:45.269 --> 00:51:52.269
at

00:51:53.250 --> 00:51:55.119
random.

00:51:55.119 --> 00:51:59.869
So sample S-1, S-2 through S-M randomly so choose a

00:51:59.869 --> 00:52:02.469
set of states randomly

00:52:02.469 --> 00:52:08.209
and

00:52:08.209 --> 00:52:12.020
initialize my parameter vector to be equal to zero. This is

00:52:12.020 --> 00:52:16.190
analogous to in value iteration where I

00:52:16.190 --> 00:52:21.140
might initialize the value function to be the function of all zeros. Then here's the

00:52:21.140 --> 00:52:28.140
end view for the algorithm.

00:52:42.359 --> 00:52:44.119
Got

00:52:44.119 --> 00:52:51.119
quite a lot to

00:52:59.189 --> 00:53:06.189
write

00:55:05.409 --> 00:55:12.409
actually.

00:55:28.599 --> 00:55:29.639
Let's

00:55:29.639 --> 00:55:36.639
see.

00:56:10.420 --> 00:56:12.009
And so

00:56:12.009 --> 00:56:17.839
that's the algorithm.

00:56:17.839 --> 00:56:19.880
Let me just adjust the writing. Give me a second. Give me a minute

00:56:19.880 --> 00:56:26.880
to finish and then I'll step through this.

00:56:26.989 --> 00:56:28.489
Actually, if some of my

00:56:28.489 --> 00:56:35.489
handwriting is eligible, let me know. So

00:56:57.559 --> 00:57:02.079
let me step through this and so briefly explain the rationale.

00:57:02.079 --> 00:57:08.140
So the hear of the algorithm is - let's see.

00:57:08.140 --> 00:57:10.869
In the original value iteration algorithm,

00:57:10.869 --> 00:57:13.240
we would take the value for each state,

00:57:13.240 --> 00:57:14.799
V(s)I,

00:57:14.799 --> 00:57:17.599
and we will overwrite it with this

00:57:17.599 --> 00:57:21.659
expression here. In the original, this discrete value iteration algorithm

00:57:21.659 --> 00:57:23.199
was to V(s)I

00:57:23.199 --> 00:57:23.940
and we will

00:57:23.940 --> 00:57:25.829
set V(s)I

00:57:25.829 --> 00:57:28.589
to be equal to that, I think.

00:57:28.589 --> 00:57:33.130
Now we have in the continuous state case, we have an infinite continuous set of

00:57:33.130 --> 00:57:34.199
states and so

00:57:34.199 --> 00:57:37.859
you can't discretely set the value of each of these to that.

00:57:37.859 --> 00:57:42.509
So what we'll do instead is choose the parameters T

00:57:42.509 --> 00:57:46.849
so that V(s)I is as close as possible to

00:57:46.849 --> 00:57:49.930
this thing on the right hand side

00:57:49.930 --> 00:57:54.509
instead. And this is what YI turns out to be.

00:57:54.509 --> 00:57:58.479
So completely, what I'm going to do is I'm going to construct

00:57:58.479 --> 00:58:00.579
estimates of this term,

00:58:00.579 --> 00:58:04.279
and then I'm going to choose the parameters of my function approximator. I'm

00:58:04.279 --> 00:58:07.729
gonna choose my parameter as T, so that V(s)I is as close as possible to

00:58:07.729 --> 00:58:10.619
these. That's what YI is,

00:58:10.619 --> 00:58:11.979
and specifically,

00:58:11.979 --> 00:58:15.139
what I'm going to do is I'll choose parameters data

00:58:15.139 --> 00:58:18.489
to minimize the sum of square differences between T [inaudible] plus

00:58:18.489 --> 00:58:20.409
5SI.

00:58:20.409 --> 00:58:21.550
This thing here

00:58:21.550 --> 00:58:24.489
is just V(s)I because I'm approximating V(s)I

00:58:24.489 --> 00:58:26.219
is a linear function of 5SI

00:58:26.219 --> 00:58:27.880
and so choose

00:58:27.880 --> 00:58:32.019
the parameters data to minimize the sum of square differences.

00:58:32.019 --> 00:58:34.829
So this is last step is basically

00:58:34.829 --> 00:58:38.759
the approximation version of value

00:58:38.759 --> 00:58:40.610
iteration. What everything else above

00:58:40.610 --> 00:58:42.419
was doing was just

00:58:42.419 --> 00:58:44.749
coming up with an approximation

00:58:44.749 --> 00:58:46.189
to this term, to

00:58:46.189 --> 00:58:47.499
this thing here

00:58:47.499 --> 00:58:50.979
and which I was calling YI.

00:58:50.979 --> 00:58:54.199
And so confluently, for every state SI we want to

00:58:54.199 --> 00:58:57.159
estimate what the thing on the right hand side is

00:58:57.159 --> 00:58:57.979
and but

00:58:57.979 --> 00:59:01.980
there's an expectation here. There's an expectation over a continuous set of states,

00:59:01.980 --> 00:59:06.609
may be a very high dimensional state so I can't compute this expectation exactly.

00:59:06.609 --> 00:59:09.809
What I'll do instead is I'll use my simulator to sample

00:59:09.809 --> 00:59:13.969
a set of states from this distribution from this P substrip,

00:59:13.969 --> 00:59:16.959
SIA, from the state transition distribution

00:59:16.959 --> 00:59:20.669
of where I get to if I take the action A in the state as I,

00:59:20.669 --> 00:59:23.919
and then I'll average over that sample of states

00:59:23.919 --> 00:59:26.819
to compute this expectation.

00:59:26.819 --> 00:59:30.609
And so stepping through the algorithm just says that for

00:59:30.609 --> 00:59:31.500
each

00:59:31.500 --> 00:59:36.329
state and for each action, I'm going to sample a set of states. This S prime 1 through S

00:59:36.329 --> 00:59:40.659
prime K from that state transition distribution, still using the model, and

00:59:40.659 --> 00:59:43.909
then I'll set Q(a) to be equal to that average

00:59:43.909 --> 00:59:45.489
and so this is my

00:59:45.489 --> 00:59:48.429
estimate for R(s)I + G(this

00:59:48.429 --> 00:59:50.399
expected value

00:59:50.399 --> 00:59:53.629
for that specific action A).

00:59:53.629 --> 00:59:56.759
Then I'll take the maximum of actions A

00:59:56.759 --> 01:00:00.890
and this gives me YI, and so YI is for S for that.

01:00:00.890 --> 01:00:02.919
And finally, I'll run

01:00:02.919 --> 01:00:06.779
really run linear regression which is that last of the set [inaudible] to get

01:00:06.779 --> 01:00:07.789
V(s)I

01:00:07.789 --> 01:00:11.479
to be close to the YIs.

01:00:11.479 --> 01:00:16.179
And so this algorithm is called fitted value iteration and it actually often

01:00:16.179 --> 01:00:19.389
works quite well for continuous,

01:00:19.389 --> 01:00:22.659
for problems with anywhere from

01:00:22.659 --> 01:00:26.010
6- to 10- to 20-dimensional state spaces

01:00:33.949 --> 01:00:39.499
if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense?

01:00:39.499 --> 01:00:40.790
Some of you didn't have your hands up.

01:00:40.790 --> 01:00:47.279
Are there questions for those,

01:00:48.410 --> 01:00:51.339
yeah? Is there a recess [inaudible] function in this setup?

01:00:51.339 --> 01:00:55.669
Oh, yes. An MDP comprises

01:00:55.669 --> 01:00:57.189
SA,

01:00:57.189 --> 01:00:59.869
the state transition probabilities G

01:00:59.869 --> 01:01:01.229
and

01:01:01.229 --> 01:01:02.380
R

01:01:02.380 --> 01:01:03.359
and so

01:01:03.359 --> 01:01:08.359
for continuous state spaces, S would be like R4 for the inverted pendulum

01:01:08.359 --> 01:01:09.409
or something.

01:01:09.409 --> 01:01:13.529
Actions with discretized state transitions probabilities with specifying with the model

01:01:13.529 --> 01:01:14.779
or

01:01:14.779 --> 01:01:17.470
the simulator,

01:01:17.470 --> 01:01:19.550
G is just a real number like .99

01:01:19.550 --> 01:01:23.599
and the real function is usually a function that's given to you. And so

01:01:23.599 --> 01:01:26.099
the reward function

01:01:26.099 --> 01:01:30.689
is some function of your 4-dimensional state,

01:01:30.689 --> 01:01:31.489
and

01:01:31.489 --> 01:01:32.989
for example, you

01:01:32.989 --> 01:01:39.989
might choose a reward function to be minus " I don't know.

01:01:40.029 --> 01:01:47.029
Just for an example of simple reward function, if we want a -1 if the pole has fallen

01:01:47.819 --> 01:01:51.640
and there it depends on you choose your reward function to be -1 if

01:01:51.640 --> 01:01:52.290
the

01:01:52.290 --> 01:01:54.619
inverted pendulum falls over

01:01:54.619 --> 01:01:58.700
and to find that its angle is greater than 30Â° or something

01:01:58.700 --> 01:02:02.179
and zero otherwise.

01:02:02.179 --> 01:02:03.939
So that would be an example

01:02:03.939 --> 01:02:06.640
of a reward function that you can choose for the inverted pendulum, but yes, assuming a

01:02:06.640 --> 01:02:09.619
reward function is given to you

01:02:09.619 --> 01:02:14.589
so that you can compute R(s)I for any state.

01:02:14.589 --> 01:02:21.589
Are there other questions?

01:02:39.689 --> 01:02:40.619
Actually, let

01:02:40.619 --> 01:02:47.619
me try

01:02:50.459 --> 01:02:57.459


01:03:23.979 --> 01:03:28.479
asking a question, so everything I did here assume that we have a stochastic simulator.

01:03:28.479 --> 01:03:31.879
So it

01:03:31.879 --> 01:03:35.389
turns out I can simply this algorithm if I have a deterministic simulator,

01:03:35.389 --> 01:03:36.510
but

01:03:36.510 --> 01:03:38.769
deterministic simulator is given a

01:03:38.769 --> 01:03:42.079
stated action, my next state is always exactly determined.

01:03:42.079 --> 01:03:45.569
So let me ask you, if I have a deterministic simulator,

01:03:45.569 --> 01:03:52.569
how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible].

01:03:53.929 --> 01:03:57.709


01:03:57.709 --> 01:03:58.720
Right, so

01:03:58.720 --> 01:04:01.249
Justin's going right. If I have a deterministic simulator,

01:04:01.249 --> 01:04:04.869
all my samples from those would be exactly the same, and so if I have a

01:04:04.869 --> 01:04:06.149
deterministic simulator,

01:04:06.149 --> 01:04:09.109
I can set K to be equal to 1,

01:04:09.109 --> 01:04:10.930
so I don't need to draw

01:04:10.930 --> 01:04:15.869
K different samples. I really only need one sample if I have a

01:04:15.869 --> 01:04:17.299
deterministic simulator,

01:04:17.299 --> 01:04:21.669
so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really

01:04:21.669 --> 01:04:28.159
confused about the,

01:04:28.159 --> 01:04:29.029
yeah,

01:04:29.029 --> 01:04:34.429
we sorta turned this [inaudible] into something that looks like linear

01:04:34.429 --> 01:04:37.249
state

01:04:37.249 --> 01:04:40.089
regression or some' you know the data transpose times something that

01:04:40.089 --> 01:04:43.890
we're used to but I guess I'm a

01:04:43.890 --> 01:04:45.369
little. I don't know really know what question to ask but like when we did this before we

01:04:45.369 --> 01:04:48.330
had like discrete states and everything. We

01:04:48.330 --> 01:04:50.019
were determined with

01:04:50.019 --> 01:04:53.189
finding this optimal policy

01:04:53.189 --> 01:04:56.709
and I

01:04:56.709 --> 01:05:01.869
guess it doesn't look like we haven't said the word policy in a while so kinda

01:05:01.869 --> 01:05:05.009
difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should

01:05:05.009 --> 01:05:06.819
just say a couple words so

01:05:06.819 --> 01:05:10.660
let me actually try to get at some of what maybe what you're saying.

01:05:10.660 --> 01:05:14.249
Our strategy for finding optimal policy

01:05:14.249 --> 01:05:15.080
has been to

01:05:15.080 --> 01:05:19.269
find some way to find V

01:05:19.269 --> 01:05:21.169
and

01:05:21.169 --> 01:05:24.089
some of approximations

01:05:24.089 --> 01:05:28.549
of ?*. So far everything I've been doing has been focused on how to find V*. I just

01:05:28.549 --> 01:05:30.479
want

01:05:33.119 --> 01:05:36.359
to say one more word. It actually turns out that

01:05:36.359 --> 01:05:40.729
for linear regression it's often very easy. It's often not terribly difficult to

01:05:40.729 --> 01:05:43.019
choose some resource of the features. Choosing

01:05:43.019 --> 01:05:46.979
features for approximating the value function is often somewhat

01:05:46.979 --> 01:05:48.400
harder,

01:05:48.400 --> 01:05:50.969
so because the value of a state is

01:05:50.969 --> 01:05:52.020
how good is

01:05:52.020 --> 01:05:54.380
starting off in this state.

01:05:54.380 --> 01:05:56.039
What is my

01:05:56.039 --> 01:05:59.609
expected sum of discounted rewards? What if I start in a certain state?

01:05:59.609 --> 01:06:00.409
And so

01:06:00.409 --> 01:06:03.559
what the feature of the state have to measure is really

01:06:03.559 --> 01:06:07.890
how good is it to start in a certain state?

01:06:07.890 --> 01:06:12.209
And so for inverted pendulum you actually have that

01:06:12.209 --> 01:06:15.899
states where the poles are vertical and when a cart that's centered on your

01:06:15.899 --> 01:06:18.079
track or something, maybe better and so you can

01:06:18.079 --> 01:06:21.299
come up with features that measure the orientation of the pole and how close

01:06:21.299 --> 01:06:25.689
you are to the center of the track and so on and those will be reasonable features to use

01:06:25.689 --> 01:06:28.329
to approximate V*.

01:06:28.329 --> 01:06:30.249
Although in general it is true that

01:06:30.249 --> 01:06:33.859
choosing features, the value function approximation, is often slightly trickier

01:06:33.859 --> 01:06:39.719
than choosing good features for linear regression. Okay and then Justin's

01:06:39.719 --> 01:06:40.709
questions of

01:06:40.709 --> 01:06:42.529
so given

01:06:42.529 --> 01:06:47.109
V*, how do you go back to actually find a policy?

01:06:47.109 --> 01:06:49.799
In the discrete case, so we

01:06:49.799 --> 01:06:54.109
have that ?*(s) is equal to all

01:06:54.109 --> 01:07:01.109
[inaudible] over A of

01:07:04.319 --> 01:07:07.469
that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this

01:07:07.469 --> 01:07:12.869
as an expectation

01:07:12.869 --> 01:07:15.609
and so then once you find the optimal value

01:07:15.609 --> 01:07:16.739
function V*,

01:07:16.739 --> 01:07:18.579
you can then

01:07:18.579 --> 01:07:20.349
find the optimal policy ?* by computing the

01:07:20.349 --> 01:07:22.579
[inaudible].

01:07:22.579 --> 01:07:25.219
So if you're in a continuous state MDP,

01:07:25.219 --> 01:07:26.429
then

01:07:26.429 --> 01:07:30.409
you can't actually do this in advance for every single state because there's an

01:07:30.409 --> 01:07:31.899
infinite number of states

01:07:31.899 --> 01:07:34.809
and so you can't actually perform this computation in advance to every single

01:07:34.809 --> 01:07:36.430
state.

01:07:36.430 --> 01:07:38.949
What you do instead is

01:07:38.949 --> 01:07:43.209
whenever your robot is in some specific state S is only when your system

01:07:43.209 --> 01:07:48.069
is in some specific state S like your car is at some position orientation or

01:07:48.069 --> 01:07:49.829
your inverted pendulum

01:07:49.829 --> 01:07:52.949
is in some specific position, posed in some

01:07:52.949 --> 01:07:56.529
specific angle T. It's

01:07:56.529 --> 01:07:58.489
only when

01:07:58.489 --> 01:08:03.069
your system, be it a factor or a board game or a robot,

01:08:03.069 --> 01:08:05.299
is in some specific state S

01:08:05.299 --> 01:08:08.809
that you would then go ahead and compute this augmax, so

01:08:08.809 --> 01:08:10.229
it's only when

01:08:10.229 --> 01:08:15.779
you're in some state S that you then compute this augmax, and then

01:08:15.779 --> 01:08:18.099
you

01:08:18.099 --> 01:08:22.589


01:08:22.589 --> 01:08:24.440
execute that action A

01:08:24.440 --> 01:08:25.159
and then

01:08:25.159 --> 01:08:29.489
as a result of your action, your robot would transition to some new state and then

01:08:29.489 --> 01:08:33.759
so it'll be given that specific new state that you compute as

01:08:33.759 --> 01:08:40.759
augmax using that specific state S that

01:08:41.799 --> 01:08:45.739
you're in. There're a few ways to do it. One way to do this is

01:08:45.739 --> 01:08:50.099
actually the same as in the inner loop of the fitted value iteration algorithm so

01:08:50.099 --> 01:08:52.150
because of an expectation of a large number

01:08:52.150 --> 01:08:55.219
of states, you'd need to sample

01:08:55.219 --> 01:08:57.999
some set of states from the simulator and then

01:08:57.999 --> 01:09:03.120
approximate this expectation using an average over your samples, so it's actually as

01:09:03.120 --> 01:09:05.819
inner loop of the value iteration algorithm.

01:09:05.819 --> 01:09:09.479
So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample

01:09:09.770 --> 01:09:13.029
a set of states

01:09:13.029 --> 01:09:16.230
to approximate those expectations every time you want to take an action in your MDP.

01:09:16.230 --> 01:09:18.359
Couple

01:09:18.359 --> 01:09:22.710
of special cases where this can be done, one special case is if you

01:09:22.710 --> 01:09:29.710
have

01:09:31.710 --> 01:09:35.560
a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is

01:09:35.560 --> 01:09:39.769
just some function,

01:09:39.769 --> 01:09:43.690
could be a linear or a nonlinear function. If it's a deterministic simulator

01:09:43.690 --> 01:09:47.569
then the next state, ST+1, is just some function of your

01:09:47.569 --> 01:09:50.189
previous stated action.

01:09:50.189 --> 01:09:52.060
If that's the case then

01:09:52.060 --> 01:09:54.849
this expectation, well, then

01:09:54.849 --> 01:09:59.259
this simplifies to augmax of A of V*

01:09:59.259 --> 01:10:00.709
of F

01:10:00.709 --> 01:10:03.409
of

01:10:03.409 --> 01:10:07.280
I guess S,A

01:10:07.280 --> 01:10:08.249
because

01:10:08.249 --> 01:10:11.070
this is really saying

01:10:11.070 --> 01:10:14.570
S

01:10:14.570 --> 01:10:18.409
prime=F(s),A. I switched back and forth between notation; I hope that's okay. S

01:10:18.409 --> 01:10:21.800
to denote the current state, and S prime to deterministic state versus ST and ST+1 through the

01:10:21.800 --> 01:10:25.530
current state. Both of these are

01:10:25.530 --> 01:10:27.699
sorta standing notation and don't

01:10:27.699 --> 01:10:31.189
mind my switching back and forth between them. But if it's a

01:10:31.189 --> 01:10:35.310
deterministic simulator you can then just compute what the next state S prime would be

01:10:35.310 --> 01:10:37.729
for each action you might take from the current state,

01:10:37.729 --> 01:10:41.689
and then take the augmax of actions A,

01:10:41.689 --> 01:10:47.229
basically choose the action that gets you to the highest value

01:10:47.229 --> 01:10:54.229
state.

01:11:04.300 --> 01:11:04.659
And

01:11:04.659 --> 01:11:08.920
so that's one case where you can compute the augmax and we can compute that

01:11:08.920 --> 01:11:13.469
expectation without needing to sample an average over some sample.

01:11:13.469 --> 01:11:16.670
Another very common case actually it turns out is if you have a stochastic

01:11:24.010 --> 01:11:25.969
simulator, but if your

01:11:25.969 --> 01:11:30.459
similar happens to take on a very specific form of

01:11:33.820 --> 01:11:36.119
ST+1=F(s)T,AT+?T

01:11:37.120 --> 01:11:43.689
where this

01:11:43.689 --> 01:11:47.899
is galsie noise. The [inaudible] is a very common way to build simulators where you model the next

01:11:47.899 --> 01:11:50.229
state as a function of the current state and action

01:11:50.229 --> 01:11:52.260
plus some noise and so

01:11:52.260 --> 01:11:57.340
once specific example would be that sort of mini dynamical system that

01:11:57.340 --> 01:12:01.400
we talked about with linear function of the current state and action plus

01:12:01.400 --> 01:12:03.019
galsie

01:12:03.019 --> 01:12:07.959
noise. In this case, you can approximate augment over

01:12:07.959 --> 01:12:14.959
A, well.

01:12:21.809 --> 01:12:27.159
In that case you take that

01:12:27.159 --> 01:12:29.089
expectation that

01:12:29.089 --> 01:12:33.179
you're trying to approximate. The

01:12:33.179 --> 01:12:37.309
expected value of V of the

01:12:37.309 --> 01:12:41.479


01:12:41.479 --> 01:12:45.260
expected value of

01:12:45.260 --> 01:12:49.179
S prime, and this is approximation.

01:12:49.179 --> 01:12:52.510
Expected value of a function is usually not equal to the value of an

01:12:52.510 --> 01:12:53.059
expectation but

01:12:53.059 --> 01:12:54.159
it is often

01:12:54.159 --> 01:12:56.119
a reasonable approximation

01:12:56.119 --> 01:12:58.340
and so

01:13:01.919 --> 01:13:04.669
that would be another way to

01:13:04.669 --> 01:13:06.439
approximate that expectation

01:13:06.439 --> 01:13:11.270
and so you choose the actions

01:13:11.270 --> 01:13:14.340
according to

01:13:14.340 --> 01:13:21.340
watch we do the same formula as I wrote just now.

01:13:22.010 --> 01:13:23.619
And so this

01:13:23.619 --> 01:13:24.920
would be a way of

01:13:24.920 --> 01:13:27.109
approximating

01:13:28.160 --> 01:13:30.530
this argmax,

01:13:30.530 --> 01:13:34.489
ignoring the noise in the simulator essentially. And

01:13:34.489 --> 01:13:37.610
this often works pretty well as well just because many simulators turn

01:13:37.610 --> 01:13:38.539
out

01:13:38.539 --> 01:13:42.699
to be the form of some linear or some nonlinear function plus zero mean

01:13:42.699 --> 01:13:43.749
galsie noise,

01:13:43.749 --> 01:13:46.130
so and just that ignore the zero mean

01:13:46.130 --> 01:13:47.369
galsie

01:13:47.369 --> 01:13:50.070
noise, so that you can compute this quickly.

01:13:50.070 --> 01:13:54.230
And just to complete about this, what

01:13:54.230 --> 01:13:55.899
that is, right,

01:13:55.899 --> 01:13:58.460
that V* F of SA,

01:13:58.460 --> 01:13:58.969
this

01:13:58.969 --> 01:14:01.670
you

01:14:04.389 --> 01:14:08.880
down rate as data transfers Fi of S prime where S prime=F of SA. Great,

01:14:08.880 --> 01:14:12.219
so this V* you would compute using

01:14:12.219 --> 01:14:15.050
the parameters data that you just learned using the

01:14:15.050 --> 01:14:21.719
fitted value iteration

01:14:21.719 --> 01:14:23.170
algorithm.

01:14:23.170 --> 01:14:30.170
Questions about this? Student:[Inaudible] case,

01:14:33.849 --> 01:14:40.849
for real-time application is it possible to use that [inaudible], for example for

01:14:41.809 --> 01:14:47.739
[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible]

01:14:51.159 --> 01:14:55.510
expectation. Computers today actually amazingly fast. I'm actually often

01:14:55.510 --> 01:14:58.659
surprised by how much you can do in real time so

01:14:58.659 --> 01:15:00.269
the helicopter we actually flying a helicopter

01:15:00.269 --> 01:15:03.889
using an algorithm different than

01:15:03.889 --> 01:15:04.739
this? I can't say.

01:15:06.280 --> 01:15:10.539
But my intuition is that you could actually do this with a

01:15:10.539 --> 01:15:14.010
helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do

01:15:14.010 --> 01:15:16.389
this 10 times a second to 50 times a second,

01:15:16.389 --> 01:15:18.099
and that's actually plenty of time

01:15:18.099 --> 01:15:19.400
to sample

01:15:19.400 --> 01:15:23.500
1,000 states and compute this

01:15:23.500 --> 01:15:26.909
expectation. They're real difficult, helicopters because helicopters are mission critical, and you

01:15:26.909 --> 01:15:29.500
do something it's like fast. You

01:15:29.500 --> 01:15:31.849
can do serious damage and so

01:15:31.849 --> 01:15:35.750
maybe not for good reasons. We've actually tended to avoid

01:15:35.750 --> 01:15:37.039
tossing coins when we're

01:15:37.039 --> 01:15:40.179
in the air, so the ideal of letting our actions be some

01:15:40.179 --> 01:15:45.099
up with some random process is slightly scary and just tend not to do that.

01:15:45.099 --> 01:15:48.989
I should say that's prob'ly not a great reason because you average a

01:15:48.989 --> 01:15:51.550
large number of things here very well fine but just

01:15:51.550 --> 01:15:55.829
as a maybe overly conservative design choice, we actually

01:15:55.829 --> 01:15:57.060
don't, tend not to find

01:15:57.060 --> 01:15:59.499
anything randomized on

01:15:59.499 --> 01:16:01.400
which is prob'ly being over conservative.

01:16:02.340 --> 01:16:04.969
It's the choice we made cause other things are slightly safer.

01:16:04.969 --> 01:16:08.149
I think you can actually often do this.

01:16:08.149 --> 01:16:13.189
So long as I see a model can be evaluated fast enough where you can sample

01:16:13.189 --> 01:16:15.469
100 state transitions or 1,000 state

01:16:15.469 --> 01:16:18.119
transitions, and then do that at

01:16:18.119 --> 01:16:21.550
10hz. They haven't said that. This is often attained which is why we often

01:16:21.550 --> 01:16:28.550
use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So

01:16:31.800 --> 01:16:35.800
now you know one algorithm [inaudible] reinforcement learning on continuous state

01:16:35.800 --> 01:16:38.710
spaces. Then we'll pick up with some more ideas on some even

01:16:38.710 --> 01:16:40.449
more powerful algorithms,

01:16:40.449 --> 01:16:43.370
the solving MDPs of continuous state spaces.

01:16:43.370 --> 01:16:44.170
Thanks. Let's close for today.