Return to Video

09-31 Value Iterations And Policy 2

  • 0:00 - 0:04
    I'd like to show you some value function after convergence
  • 0:04 - 0:07
    and the corresponding policies.
  • 0:07 - 0:11
    If we assume gamma = 1 and our cost for the non-absorbing state
  • 0:11 - 0:15
    equals -3, as before, we get the following approximate value function
  • 0:15 - 0:21
    after convergence, and the corresponding policy looks as follows.
  • 0:21 - 0:25
    Up here we go right until we hit the absorbing state.
  • 0:25 - 0:27
    Over here we prefer to go north.
  • 0:27 - 0:31
    Here we go left, and here we go north again.
  • 0:31 - 0:33
    I left the policy open for the absorbing states
  • 0:33 - 0:36
    because there's no action to be chosen here.
  • 0:36 - 0:39
    This is a situation where
  • 0:39 - 0:42
    the risk of falling into the -100 is balanced by
  • 0:42 - 0:45
    the time spent going around.
  • 0:45 - 0:48
    We have an action over here in this visible state here
  • 0:48 - 0:52
    that risks the 10% chance of falling into the -100.
  • 0:52 - 0:55
    But that's preferable under the cost model of -3
  • 0:55 - 0:58
    to the action of going south.
  • 0:58 - 1:02
    Now, this all changes if we assume a cost of 0
  • 1:02 - 1:05
    for all the states over here, in which case,
  • 1:05 - 1:09
    the value function after convergence looks interesting.
  • 1:09 - 1:13
    And with some thought, you realize it's exactly the right one.
  • 1:13 - 1:16
    Each value is exactly 100,
  • 1:16 - 1:18
    and the reason is with a cost of 0,
  • 1:18 - 1:21
    it doesn't matter how long we move around.
  • 1:21 - 1:24
    Eventually we can guarantee in this case we reach the 100,
  • 1:24 - 1:28
    therefore each value after backups will become 100.
  • 1:28 - 1:32
    The corresponding policy is the one we discussed before.
  • 1:32 - 1:35
    And the crucial thing here is that for this state,
  • 1:35 - 1:38
    we go south, if you're willing to wait the time.
  • 1:38 - 1:40
    For this state over here, we go west,
  • 1:40 - 1:42
    willing to wait the time so as to avoid
  • 1:42 - 1:44
    falling into the -100.
  • 1:44 - 1:46
    And all the other states resolve
  • 1:46 - 1:49
    exactly as you would expect them to resolve
  • 1:49 - 1:52
    as shown over here.
  • 1:52 - 1:55
    If we set the costs to -200,
  • 1:55 - 1:58
    so each step itself is even more expensive
  • 1:58 - 2:02
    then falling into this ditch over here,
  • 2:02 - 2:05
    we get a value function that's strongly negative everywhere
  • 2:05 - 2:08
    with this being the most negative state.
  • 2:08 - 2:11
    But more interesting is the policy.
  • 2:11 - 2:14
    This is a situation where our agent tries to end the game
  • 2:14 - 2:18
    as fast as possible so as not to endure the penalty of -200.
  • 2:18 - 2:21
    And even over here where it jumps itself into the -100's
  • 2:21 - 2:25
    it's still better than going north and taking the excess of 200 as a penalty
  • 2:25 - 2:27
    and then leave the +100.
  • 2:27 - 2:30
    Similarly, over here we go straight north,
  • 2:30 - 2:32
    and over here we go as fast as possible
  • 2:32 - 2:35
    to the state over here.
  • 2:35 - 2:37
    Now, this is an extreme case.
  • 2:37 - 2:39
    I don't know why it would make sense to set a penalty for life
  • 2:39 - 2:45
    that is so negative that even negative death is worse than living,
  • 2:45 -
    but certainly that's the result of running value iteration in this extreme case.
Title:
09-31 Value Iterations And Policy 2
Description:

Unit 9 31 Value Iterations and Policy 2

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
02:49
Amara Bot added a translation

English subtitles

Revisions