-
Title:
10-07 Solving A Mdp
-
Description:
Unit 10 07 Solving a MDP.mp4
-
Now to solve an MDP,
-
we're trying to find a policy--pi of S--
-
that's going to be our answer.
-
The pi that we want--the optimal policy--
-
is the one that's going to maximize
-
the discounted, total Reward.
-
So what we mean is:
-
we want to take the sum over all Times
-
into the future of the Reward
-
that you get from starting out
-
in the state that you're in, in time T--
-
and then applying the policy to that state,
-
and arriving at a new state, at time T plus 1.
-
And so we want to maximize that sum--
-
but the sum might be infinite
-
and so, what we do is
-
we take this value, Gamma,
-
and raise it to the T power, saying
-
we're going to count future Rewards less than
-
current Rewards--and that way,
-
we'll make sure that the sum total is bounded.
-
So we want the policy that maximizes that result.
-
If we figure out the utility of the state
-
by solving the Markov Decision Process,
-
then we have: the utility of any state, S,
-
is equal to the maximum over all
-
possible actions that we could take in S
-
of the expected value of taking that action.
-
And what's the expected value?
-
Well, it's just the sum over all resulting states
-
of the transition model--
-
the probability that we get to that state,
-
given from the start state, we take an action
-
specified by the optimal policy
-
times the utility of that resulting state.
-
So--look at all possible actions;
-
choose the best one--
-
according to the expected, in terms of probability utility.