
Title:
1010 Passive Temporal Difference Learning

Description:
Unit 10 10 Passive Temporal Difference Learning.mp4

Let's start by looking at passive reinforcement learning.

I'm going to describe an algorithm called

Temporal Difference Learningor TD.

And what that meanssounds like a fancy name,

but all it really means is we're going to move

from one state to the next;

and we're going to look at the difference between the 2 states,

and learn thatand then kind of back up

the values, from one state to the next.

So if we're going to follow a fixed policy, pi,

and let's say our policy tells us to go this way, and then go this way.

We'll eventually learn that we get a plus 1 reward there

and we'll start feeding back that plus 1, saying:

if it was good to get a plus 1 here,

it must be somewhat good to be in this state,

somewhat good to be in this stateand so on, back to the start state.

So, in order to run this algorithm,

we're going to try to build up a table of utilities for each state

and along the way, we're going to keep track of

the number of times that we visited each state.

Now, the table of utilities, we're going to start blank

we're not going to start them at zero or anything else

where they're just going to be undefined.

And the table of numbers, we're going to start at zero,

saying we visited each state a total of zero times.

What we're going to do is run the policy,

have a trial that goes through the state;

when it gets to a terminal state,

we start it over again at the start and run it again;

and we keep track of how many times we visited each state,

we update the utilities, and we get a better

and better estimate for the utility.

And this is what the inner loop of the algorithm looks like

and let's see if we can trace it out.

So we'll start at a start state,

we'll apply the policyand let's say the policy tells us to move in this direction.

Then we get a reward here,

which is zero;

and then we look at it with the algorithm,

and the algorithm tells us if the state

is newyes, it is; we've never been there before

then set the utility of that state to the new reward, which is zero.

Okayso now we have a zero here;

and then let's say, the next step, we move up here.

So, again, we have a zero;

and let's say our policy looks like a good one,

so we get: here, we have a zero.

We get: here, we have a zero.

We get: herenow, this state,

we get a reward of 1, so that state gets a utility of 1.

And all along the way, we have to think about

how we're backing up these values, as well.

So when we get here, we have to look at this formula to say:

How are we going to update the utility of the prior state?

And the difference between this state and this state is zero.

so this difference, here, is going to be zero

the reward is zero, and so there's going to be no update to this state.

But now, finallyfor the first timewe're going to have an actual update.

So we're going to update this state to be plus 1,

and now we're going to think about changing this state.

And what was its old utility?well, it was zero.

And then there's a factor called Alpha,

which is the learning rate

that tells us how much we want to move this utility

towards something that's maybe a better estimate.

And the learning rate should be such that,

if we are brand new,

we want to move a big step;

and if we've seen this state a lot of times,

we're pretty confident of our number

and we want to make a small step.

So let's say that the Alpha function is 1 over N plus 1.

Well, we'd better not make it 1 over N plus 1, when N is zero.

So 1 over N plus 1 would be ½;

and then the reward in this state was zero;

plus, we had a Gamma

and let's just say that Gamma is 1,

so there's no discounting; and then

we look at the difference between the utility

of the resulting statewhich is 1

minus the utility of this state, which was zero.

So we get ½, 1 minus zerowhich is ½.

So we update this;

and we change this zero to ½.

Now let's say we start all over again

and let's say our policy is right on track;

and nothing unusual, stochastically, has happened.

So we follow the same path,

we don't updatebecause they're all zeros all along this path.

We go here, here, here;

and now it's time for an update.

So now, we've transitioned from a zero to ½

so how are we going to update this state?

Well, the old state was zero

and now we have a 1 over N plus 1

so let's say 1/3.

So we're getting a little bit more confidentbecause we've been there

twice, rather than just once.

The reward in this state was zero,

and then we have to look at the difference between these 2 states.

That's where we get the name, Temporal Difference;

and so, we have ½ minus zero

and so that's 1/3 times ½

so that's 1/6.

Now we update this state.

It was zero; now it becomes 1/6.

And you can see how the results

of the positive 1 starts to propagate

backwardsbut it propagates slowly.

We have to have 1 trial at a time

to get that to propagate backwards.

Now, how about the update from this state to this state?

Now, we were ½ hereso our old utility was ½;

plus Alphathe learning rateis 1/3.

The reward in the old state was zero;

plus the difference between these two,

which is 1 minus ½.

So that's ½ plus 1/6 is 2/3.

And now the second time through,

we've updated the utility of this state from 1/2 to 2/3.

And we keep on goingand you can see the results of the positive, propagating backwards.

And if we did more examples through here,

you would see the results of the negative propagating backwards.

And eventually, it converges to the correct utilities for this policy.