I'd like to show you some value function after convergence
and the corresponding policies.
If we assume gamma = 1 and our cost for the non-absorbing state
equals -3, as before, we get the following approximate value function
after convergence, and the corresponding policy looks as follows.
Up here we go right until we hit the absorbing state.
Over here we prefer to go north.
Here we go left, and here we go north again.
I left the policy open for the absorbing states
because there's no action to be chosen here.
This is a situation where
the risk of falling into the -100 is balanced by
the time spent going around.
We have an action over here in this visible state here
that risks the 10% chance of falling into the -100.
But that's preferable under the cost model of -3
to the action of going south.
Now, this all changes if we assume a cost of 0
for all the states over here, in which case,
the value function after convergence looks interesting.
And with some thought, you realize it's exactly the right one.
Each value is exactly 100,
and the reason is with a cost of 0,
it doesn't matter how long we move around.
Eventually we can guarantee in this case we reach the 100,
therefore each value after backups will become 100.
The corresponding policy is the one we discussed before.
And the crucial thing here is that for this state,
we go south, if you're willing to wait the time.
For this state over here, we go west,
willing to wait the time so as to avoid
falling into the -100.
And all the other states resolve
exactly as you would expect them to resolve
as shown over here.
If we set the costs to -200,
so each step itself is even more expensive
then falling into this ditch over here,
we get a value function that's strongly negative everywhere
with this being the most negative state.
But more interesting is the policy.
This is a situation where our agent tries to end the game
as fast as possible so as not to endure the penalty of -200.
And even over here where it jumps itself into the -100's
it's still better than going north and taking the excess of 200 as a penalty
and then leave the +100.
Similarly, over here we go straight north,
and over here we go as fast as possible
to the state over here.
Now, this is an extreme case.
I don't know why it would make sense to set a penalty for life
that is so negative that even negative death is worse than living,
but certainly that's the result of running value iteration in this extreme case.