Return to Video

09-32 Mdp Conclusion

  • 0:00 - 0:02
    So, we've learned quite a bit so far.
  • 0:02 - 0:06
    We've learned about Markov Decision Processes.
  • 0:06 - 0:10
    We have fully observable with a set of states
  • 0:10 - 0:14
    and corresponding actions where they have stochastic action effects
  • 0:14 - 0:19
    characterized by a conditional probability entity of P of S prime
  • 0:19 - 0:22
    given that we apply action A in state S.
  • 0:22 - 0:25
    We seek to maximize a reward function
  • 0:25 - 0:27
    that we define over states.
  • 0:27 - 0:30
    You can equally define over states in action pairs.
  • 0:30 - 0:33
    The objective was to maximize the expected
  • 0:33 - 0:36
    future accumulative and discounted rewards,
  • 0:36 - 0:38
    as shown by this formula over here.
  • 0:38 - 0:42
    The key to solving them was called value iteration
  • 0:42 - 0:45
    where we assigned a value to each state.
  • 0:45 - 0:47
    There's alternative techniques that have assigned values
  • 0:47 - 0:50
    to state action pairs, often called Q(s, a),
  • 0:50 - 0:53
    but we didn't really consider this so far.
  • 0:53 - 0:55
    We defined a recursive update rule
  • 0:55 - 0:58
    to update V(s) that was very logical
  • 0:58 - 1:00
    after we understood that we have an action choice,
  • 1:00 - 1:03
    but nature chooses for us the outcome of the action
  • 1:03 - 1:07
    in a stochastic transition probability over here.
  • 1:07 - 1:10
    And then we observe the value iteration converged
  • 1:10 - 1:12
    and we're able to define a policy if we're assuming
  • 1:12 - 1:16
    the argmax under the value iteration expression,
  • 1:16 - 1:18
    which I don't spell out over here.
  • 1:18 - 1:20
    This is a beautiful framework.
  • 1:20 - 1:22
    It's really different from planning than before
  • 1:22 - 1:26
    because of the stochasticity of the action effects.
  • 1:26 - 1:29
    Rather than making a single sequence of states and actions,
  • 1:29 - 1:31
    as would be the case in deterministic planning,
  • 1:31 - 1:35
    now we make an entire field a so-called policy
  • 1:35 - 1:39
    that assigns an action to every possible state.
  • 1:39 - 1:42
    And we compute this using a technique called value iteration
  • 1:42 -
    that spreads value in reverse order through the field of states.
Title:
09-32 Mdp Conclusion
Description:

Unit 9 32 MDP Conclusion

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
01:47
Amara Bot added a translation

English subtitles

Revisions