## 09-31 Value Iterations And Policy 2

• 0:00 - 0:04
I'd like to show you some value function after convergence
• 0:04 - 0:07
and the corresponding policies.
• 0:07 - 0:11
If we assume gamma = 1 and our cost for the non-absorbing state
• 0:11 - 0:15
equals -3, as before, we get the following approximate value function
• 0:15 - 0:21
after convergence, and the corresponding policy looks as follows.
• 0:21 - 0:25
Up here we go right until we hit the absorbing state.
• 0:25 - 0:27
Over here we prefer to go north.
• 0:27 - 0:31
Here we go left, and here we go north again.
• 0:31 - 0:33
I left the policy open for the absorbing states
• 0:33 - 0:36
because there's no action to be chosen here.
• 0:36 - 0:39
This is a situation where
• 0:39 - 0:42
the risk of falling into the -100 is balanced by
• 0:42 - 0:45
the time spent going around.
• 0:45 - 0:48
We have an action over here in this visible state here
• 0:48 - 0:52
that risks the 10% chance of falling into the -100.
• 0:52 - 0:55
But that's preferable under the cost model of -3
• 0:55 - 0:58
to the action of going south.
• 0:58 - 1:02
Now, this all changes if we assume a cost of 0
• 1:02 - 1:05
for all the states over here, in which case,
• 1:05 - 1:09
the value function after convergence looks interesting.
• 1:09 - 1:13
And with some thought, you realize it's exactly the right one.
• 1:13 - 1:16
Each value is exactly 100,
• 1:16 - 1:18
and the reason is with a cost of 0,
• 1:18 - 1:21
it doesn't matter how long we move around.
• 1:21 - 1:24
Eventually we can guarantee in this case we reach the 100,
• 1:24 - 1:28
therefore each value after backups will become 100.
• 1:28 - 1:32
The corresponding policy is the one we discussed before.
• 1:32 - 1:35
And the crucial thing here is that for this state,
• 1:35 - 1:38
we go south, if you're willing to wait the time.
• 1:38 - 1:40
For this state over here, we go west,
• 1:40 - 1:42
willing to wait the time so as to avoid
• 1:42 - 1:44
falling into the -100.
• 1:44 - 1:46
And all the other states resolve
• 1:46 - 1:49
exactly as you would expect them to resolve
• 1:49 - 1:52
as shown over here.
• 1:52 - 1:55
If we set the costs to -200,
• 1:55 - 1:58
so each step itself is even more expensive
• 1:58 - 2:02
then falling into this ditch over here,
• 2:02 - 2:05
we get a value function that's strongly negative everywhere
• 2:05 - 2:08
with this being the most negative state.
• 2:08 - 2:11
But more interesting is the policy.
• 2:11 - 2:14
This is a situation where our agent tries to end the game
• 2:14 - 2:18
as fast as possible so as not to endure the penalty of -200.
• 2:18 - 2:21
And even over here where it jumps itself into the -100's
• 2:21 - 2:25
it's still better than going north and taking the excess of 200 as a penalty
• 2:25 - 2:27
and then leave the +100.
• 2:27 - 2:30
Similarly, over here we go straight north,
• 2:30 - 2:32
and over here we go as fast as possible
• 2:32 - 2:35
to the state over here.
• 2:35 - 2:37
Now, this is an extreme case.
• 2:37 - 2:39
I don't know why it would make sense to set a penalty for life
• 2:39 - 2:45
that is so negative that even negative death is worse than living,
• 2:45 -
but certainly that's the result of running value iteration in this extreme case.
Title:
09-31 Value Iterations And Policy 2
Description:

Unit 9 31 Value Iterations and Policy 2

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
02:49