 ## ← 09-16 Mdp And Costs

• 1 Follower
• 51 Lines

Unit 9 16 MDP and Costs

### Get Embed Code x Embed video Use the following code to embed this video. See our usage guide for more details on embedding. Paste this in your document somewhere (closest to the closing body tag is preferable): ```<script type="text/javascript" src='https://amara.org/embedder-iframe'></script> ``` Paste this inside your HTML body, where you want to include the widget: ```<div class="amara-embed" data-url="http://www.youtube.com/watch?v=WyLEhX3oUZU" data-team="udacity"></div> ``` 2 Languages

• English [en]
• Japanese [ja]

Showing Revision 1 created 11/28/2012 by Amara Bot.

1. [Narrator] So even for the simple grid world,
2. the optimal control policy assuming stochastic actions
3. and no costs of moving, except for the final absorbing costs,
4. is somewhat nontrivial.
5. Take a second to look at this.
6. Along here it seems pretty obvious, but
7. for the state over here, B3, and for the state over here, C4,
8. we choose an action that just avoids falling into the minus 100,
9. which is more important than trying to make progress towards the plus 100.
10. Now obviously this is not the general case of an MDP,
11. and it's somewhat frustrating they'd be willing to run through the wall,
12. just so as to avoid falling into the minus 100,
13. and the reason why this seems unintuitive is
14. because we're really forgetting the issue of costs.
15. In normal life, there is a cost associated with moving.
16. MDPs are gentle enough to have a cost factor,
17. and the way we're going to denote costs
18. is by defining our award function over any possible state.
19. We are reaching the state A4,
20. gives us plus 100, minus 100 for B4,
21. and perhaps minus 3 for every other state,
22. which reflects the fact that if you take a step somewhere
23. that we will pay minus 3.
24. So this gives an incentive to shorten the final action sequence.
25. So we're now ready to state the actual objective
26. of an MDP which is to minimize not
27. just the momentary costs, but the sum
28. of all future rewards,
29. but you're going to write RT to denote the fact that
30. this reward has received time T, and because
31. our reward itself is stochastic,
32. we have to complete the expectation over those,
33. and that we seek to maximize.
34. So we seek to find the policy that maximizes the expression over here.
35. Now another interesting caveat is a sentence people put
36. a so called discount factor into this equation
37. with an exponent of T, where a discount factor was going to be 0.9,
38. and what this does is it decays future reward
39. relative to more immediate rewards, and it's
40. kind of an alternative way to specify costs.
41. So we can make this explicit by a negative reward per state
42. or we can bring in a discount factor
43. that discounts the plus 100 by the
44. number of steps that it went by before it reached the plus 100.
45. This also gives an incentive to get to the goal as fast as possible.
46. The nice mathematical thing about discount factor is
47. it keeps this expectation bounded.
48. It easy to show that this expression over here
49. will always be smaller or equal to 1 over 1 minus gamma times the
50. absolute reward maximizing value and
51. which in this case would be plus 100.