
I'd like to show you some value function after convergence

and the corresponding policies.

If we assume gamma = 1 and our cost for the nonabsorbing state

equals 3, as before, we get the following approximate value function

after convergence, and the corresponding policy looks as follows.

Up here we go right until we hit the absorbing state.

Over here we prefer to go north.

Here we go left, and here we go north again.

I left the policy open for the absorbing states

because there's no action to be chosen here.

This is a situation where

the risk of falling into the 100 is balanced by

the time spent going around.

We have an action over here in this visible state here

that risks the 10% chance of falling into the 100.

But that's preferable under the cost model of 3

to the action of going south.

Now, this all changes if we assume a cost of 0

for all the states over here, in which case,

the value function after convergence looks interesting.

And with some thought, you realize it's exactly the right one.

Each value is exactly 100,

and the reason is with a cost of 0,

it doesn't matter how long we move around.

Eventually we can guarantee in this case we reach the 100,

therefore each value after backups will become 100.

The corresponding policy is the one we discussed before.

And the crucial thing here is that for this state,

we go south, if you're willing to wait the time.

For this state over here, we go west,

willing to wait the time so as to avoid

falling into the 100.

And all the other states resolve

exactly as you would expect them to resolve

as shown over here.

If we set the costs to 200,

so each step itself is even more expensive

then falling into this ditch over here,

we get a value function that's strongly negative everywhere

with this being the most negative state.

But more interesting is the policy.

This is a situation where our agent tries to end the game

as fast as possible so as not to endure the penalty of 200.

And even over here where it jumps itself into the 100's

it's still better than going north and taking the excess of 200 as a penalty

and then leave the +100.

Similarly, over here we go straight north,

and over here we go as fast as possible

to the state over here.

Now, this is an extreme case.

I don't know why it would make sense to set a penalty for life

that is so negative that even negative death is worse than living,

but certainly that's the result of running value iteration in this extreme case.