
Title:
0916 Mdp And Costs

Description:

[Narrator] So even for the simple grid world,

the optimal control policy assuming stochastic actions

and no costs of moving, except for the final absorbing costs,

is somewhat nontrivial.

Take a second to look at this.

Along here it seems pretty obvious, but

for the state over here, B3, and for the state over here, C4,

we choose an action that just avoids falling into the minus 100,

which is more important than trying to make progress towards the plus 100.

Now obviously this is not the general case of an MDP,

and it's somewhat frustrating they'd be willing to run through the wall,

just so as to avoid falling into the minus 100,

and the reason why this seems unintuitive is

because we're really forgetting the issue of costs.

In normal life, there is a cost associated with moving.

MDPs are gentle enough to have a cost factor,

and the way we're going to denote costs

is by defining our award function over any possible state.

We are reaching the state A4,

gives us plus 100, minus 100 for B4,

and perhaps minus 3 for every other state,

which reflects the fact that if you take a step somewhere

that we will pay minus 3.

So this gives an incentive to shorten the final action sequence.

So we're now ready to state the actual objective

of an MDP which is to minimize not

just the momentary costs, but the sum

of all future rewards,

but you're going to write RT to denote the fact that

this reward has received time T, and because

our reward itself is stochastic,

we have to complete the expectation over those,

and that we seek to maximize.

So we seek to find the policy that maximizes the expression over here.

Now another interesting caveat is a sentence people put

a so called discount factor into this equation

with an exponent of T, where a discount factor was going to be 0.9,

and what this does is it decays future reward

relative to more immediate rewards, and it's

kind of an alternative way to specify costs.

So we can make this explicit by a negative reward per state

or we can bring in a discount factor

that discounts the plus 100 by the

number of steps that it went by before it reached the plus 100.

This also gives an incentive to get to the goal as fast as possible.

The nice mathematical thing about discount factor is

it keeps this expectation bounded.

It easy to show that this expression over here

will always be smaller or equal to 1 over 1 minus gamma times the

absolute reward maximizing value and

which in this case would be plus 100.