09-32 Mdp Conclusion

• 0:00 - 0:02
So, we've learned quite a bit so far.
• 0:02 - 0:06
We've learned about Markov Decision Processes.
• 0:06 - 0:10
We have fully observable with a set of states
• 0:10 - 0:14
and corresponding actions where they have stochastic action effects
• 0:14 - 0:19
characterized by a conditional probability entity of P of S prime
• 0:19 - 0:22
given that we apply action A in state S.
• 0:22 - 0:25
We seek to maximize a reward function
• 0:25 - 0:27
that we define over states.
• 0:27 - 0:30
You can equally define over states in action pairs.
• 0:30 - 0:33
The objective was to maximize the expected
• 0:33 - 0:36
future accumulative and discounted rewards,
• 0:36 - 0:38
as shown by this formula over here.
• 0:38 - 0:42
The key to solving them was called value iteration
• 0:42 - 0:45
where we assigned a value to each state.
• 0:45 - 0:47
There's alternative techniques that have assigned values
• 0:47 - 0:50
to state action pairs, often called Q(s, a),
• 0:50 - 0:53
but we didn't really consider this so far.
• 0:53 - 0:55
We defined a recursive update rule
• 0:55 - 0:58
to update V(s) that was very logical
• 0:58 - 1:00
after we understood that we have an action choice,
• 1:00 - 1:03
but nature chooses for us the outcome of the action
• 1:03 - 1:07
in a stochastic transition probability over here.
• 1:07 - 1:10
And then we observe the value iteration converged
• 1:10 - 1:12
and we're able to define a policy if we're assuming
• 1:12 - 1:16
the argmax under the value iteration expression,
• 1:16 - 1:18
which I don't spell out over here.
• 1:18 - 1:20
This is a beautiful framework.
• 1:20 - 1:22
It's really different from planning than before
• 1:22 - 1:26
because of the stochasticity of the action effects.
• 1:26 - 1:29
Rather than making a single sequence of states and actions,
• 1:29 - 1:31
as would be the case in deterministic planning,
• 1:31 - 1:35
now we make an entire field a so-called policy
• 1:35 - 1:39
that assigns an action to every possible state.
• 1:39 - 1:42
And we compute this using a technique called value iteration
• 1:42 -
that spreads value in reverse order through the field of states.
Title:
09-32 Mdp Conclusion
Description:

Unit 9 32 MDP Conclusion

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
01:47