
Title:
1007 Solving A Mdp

Description:
Unit 10 07 Solving a MDP.mp4

Now to solve an MDP,

we're trying to find a policypi of S

that's going to be our answer.

The pi that we wantthe optimal policy

is the one that's going to maximize

the discounted, total Reward.

So what we mean is:

we want to take the sum over all Times

into the future of the Reward

that you get from starting out

in the state that you're in, in time T

and then applying the policy to that state,

and arriving at a new state, at time T plus 1.

And so we want to maximize that sum

but the sum might be infinite

and so, what we do is

we take this value, Gamma,

and raise it to the T power, saying

we're going to count future Rewards less than

current Rewardsand that way,

we'll make sure that the sum total is bounded.

So we want the policy that maximizes that result.

If we figure out the utility of the state

by solving the Markov Decision Process,

then we have: the utility of any state, S,

is equal to the maximum over all

possible actions that we could take in S

of the expected value of taking that action.

And what's the expected value?

Well, it's just the sum over all resulting states

of the transition model

the probability that we get to that state,

given from the start state, we take an action

specified by the optimal policy

times the utility of that resulting state.

Solook at all possible actions;

choose the best one

according to the expected, in terms of probability utility.