Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← 09-16 Mdp And Costs

Unit 9 16 MDP and Costs

Get Embed Code
2 Languages

Showing Revision 1 created 11/28/2012 by Amara Bot.

  1. [Narrator] So even for the simple grid world,
  2. the optimal control policy assuming stochastic actions
  3. and no costs of moving, except for the final absorbing costs,
  4. is somewhat nontrivial.
  5. Take a second to look at this.
  6. Along here it seems pretty obvious, but
  7. for the state over here, B3, and for the state over here, C4,
  8. we choose an action that just avoids falling into the minus 100,
  9. which is more important than trying to make progress towards the plus 100.
  10. Now obviously this is not the general case of an MDP,
  11. and it's somewhat frustrating they'd be willing to run through the wall,
  12. just so as to avoid falling into the minus 100,
  13. and the reason why this seems unintuitive is
  14. because we're really forgetting the issue of costs.
  15. In normal life, there is a cost associated with moving.
  16. MDPs are gentle enough to have a cost factor,
  17. and the way we're going to denote costs
  18. is by defining our award function over any possible state.
  19. We are reaching the state A4,
  20. gives us plus 100, minus 100 for B4,
  21. and perhaps minus 3 for every other state,
  22. which reflects the fact that if you take a step somewhere
  23. that we will pay minus 3.
  24. So this gives an incentive to shorten the final action sequence.
  25. So we're now ready to state the actual objective
  26. of an MDP which is to minimize not
  27. just the momentary costs, but the sum
  28. of all future rewards,
  29. but you're going to write RT to denote the fact that
  30. this reward has received time T, and because
  31. our reward itself is stochastic,
  32. we have to complete the expectation over those,
  33. and that we seek to maximize.
  34. So we seek to find the policy that maximizes the expression over here.
  35. Now another interesting caveat is a sentence people put
  36. a so called discount factor into this equation
  37. with an exponent of T, where a discount factor was going to be 0.9,
  38. and what this does is it decays future reward
  39. relative to more immediate rewards, and it's
  40. kind of an alternative way to specify costs.
  41. So we can make this explicit by a negative reward per state
  42. or we can bring in a discount factor
  43. that discounts the plus 100 by the
  44. number of steps that it went by before it reached the plus 100.
  45. This also gives an incentive to get to the goal as fast as possible.
  46. The nice mathematical thing about discount factor is
  47. it keeps this expectation bounded.
  48. It easy to show that this expression over here
  49. will always be smaller or equal to 1 over 1 minus gamma times the
  50. absolute reward maximizing value and
  51. which in this case would be plus 100.