English subtitles

← 10-07 Solving A Mdp

Unit 10 07 Solving a MDP.mp4

Get Embed Code
2 Languages

Showing Revision 1 created 11/28/2012 by Amara Bot.

  1. Now to solve an MDP,
  2. we're trying to find a policy--pi of S--
  3. that's going to be our answer.
  4. The pi that we want--the optimal policy--
  5. is the one that's going to maximize
  6. the discounted, total Reward.
  7. So what we mean is:
  8. we want to take the sum over all Times
  9. into the future of the Reward
  10. that you get from starting out
  11. in the state that you're in, in time T--
  12. and then applying the policy to that state,
  13. and arriving at a new state, at time T plus 1.
  14. And so we want to maximize that sum--
  15. but the sum might be infinite
  16. and so, what we do is
  17. we take this value, Gamma,
  18. and raise it to the T power, saying
  19. we're going to count future Rewards less than
  20. current Rewards--and that way,
  21. we'll make sure that the sum total is bounded.
  22. So we want the policy that maximizes that result.
  23. If we figure out the utility of the state
  24. by solving the Markov Decision Process,
  25. then we have: the utility of any state, S,
  26. is equal to the maximum over all
  27. possible actions that we could take in S
  28. of the expected value of taking that action.
  29. And what's the expected value?
  30. Well, it's just the sum over all resulting states
  31. of the transition model--
  32. the probability that we get to that state,
  33. given from the start state, we take an action
  34. specified by the optimal policy
  35. times the utility of that resulting state.
  36. So--look at all possible actions;
  37. choose the best one--
  38. according to the expected, in terms of probability utility.