## 09-35 Pomdp

• 0:00 - 0:05
I'd like to illustrate the problem, using a very simple environment
• 0:05 - 0:07
that looks, as follows:
• 0:07 - 0:09
Suppose you live in world like this;
• 0:09 - 0:11
and your agent starts over here,
• 0:11 - 0:13
and there are 2 possible outcomes.
• 0:13 - 0:16
You can exit the maze over here--
• 0:16 - 0:18
where you get a plus 100--
• 0:18 - 0:20
or you can exit the maze over here,
• 0:20 - 0:22
where you receive a minus 100.
• 0:22 - 0:25
Now, in a fully observable case,
• 0:25 - 0:28
and in a deterministic case,
• 0:28 - 0:32
the optimal plan might look something like this;
• 0:32 - 0:35
and whether or not is goes straight over here or not, depends on the details.
• 0:35 - 0:38
For example, whether the agent has momentum or not.
• 0:38 - 0:44
But you'll find a single sequence of actions and states that might cut the corners,
• 0:44 - 0:47
as close as possible, to reach the plus 100 as fast as possible.
• 0:47 - 0:50
That's conventional planning.
• 0:50 - 0:53
Let's contrast this with the case we just learned about,
• 0:53 - 0:57
which is the fully observable case or the stochastic case.
• 0:57 - 1:01
We just learned that the best thing to compute is a policy
• 1:01 - 1:04
that assigns to every possible state, an optimal action;
• 1:04 - 1:07
and simplified speaking, this might look as follows:
• 1:07 - 1:09
Where each of these arrows corresponds
• 1:09 - 1:12
to a sample control policy.
• 1:12 - 1:16
And those are defined in part of the state space that are even far away.
• 1:16 - 1:18
So this wouuld be an example of a control policy
• 1:18 - 1:22
where all the arrows gradually point you over here.
• 1:22 - 1:25
We just learned about this, using MDPs and value iteration.
• 1:25 - 1:29
The case I really want to get at is the case of partial observability--
• 1:29 - 1:32
which we will eventually solve, using a technique called POMDP.
• 1:32 - 1:37
And in this case, I'm going to keep the location of the agent in the maze observable.
• 1:37 - 1:43
The part I'm going to make unobservable is where, exactly, I receive plus 100
• 1:43 - 1:45
and where I receive minus 100.
• 1:45 - 1:48
Instead, I'm going to put a sign over here
• 1:48 - 1:51
that tells the agent where to expect plus 100,
• 1:51 - 1:53
and where to expect minus 100.
• 1:53 - 1:57
So the optimum policy would be to first move to the sign,
• 1:57 - 1:59
• 1:59 - 2:03
and then return and go to the corresponding exit,
• 2:03 - 2:07
for which the agent now knows where to receive plus 100.
• 2:07 - 2:10
So, for example, if this exit over here gives us plus 100,
• 2:10 - 2:12
the sign will say Left.
• 2:12 - 2:15
If this exit over here gives us plus 100, the sign will say Right.
• 2:15 - 2:17
What makes this environment interesting is
• 2:17 - 2:21
that if the agent knew which exit would have plus 100,
• 2:21 - 2:23
it will go north, from a starting position.
• 2:23 - 2:26
It goes south exclusively to gather information.
• 2:26 - 2:30
So the question becomes: Can we devise a method for planning
• 2:30 - 2:36
that understands that, even though we'd wish to receive the plus 100 as the best exit,
• 2:36 - 2:40
there's a detour necessary to gather information.
• 2:40 - 2:42
So here's a solution that doesn't work:
• 2:42 - 2:46
Obviously, the agent might be in 2 different worlds--and it doesn't know.
• 2:46 - 2:49
It might be in the world where there's plus 100 on the Left side
• 2:49 - 2:51
or it might be in the world with plus 100 on the Right side,
• 2:51 - 2:53
with minus 100 in the corresponding other exit.
• 2:53 - 2:59
What doesn't work is you can't solve the problem for both of these cases
• 2:59 - 3:00
and then put these solutions together--
• 3:00 - 3:02
for example, by averaging.
• 3:02 - 3:04
The reason why this doesn't work is
• 3:04 - 3:08
this agent, after averaging, would go north.
• 3:08 - 3:11
It would never have the idea that it is worthwhile to go south,
• 3:11 - 3:15
read the sign, and then return to the optimal exit.
• 3:15 - 3:18
When it arrives, finally, at the intersection over here,
• 3:18 - 3:20
it doesn't really know what to do.
• 3:20 - 3:22
So here is the situation that does work--
• 3:22 - 3:25
and it's related to information space or belief space.
• 3:25 - 3:29
In the information space or belief space representation you do planning,
• 3:29 - 3:31
not in the set of physical world states,
• 3:31 - 3:34
but in what you might know about those states.
• 3:34 - 3:39
And if you're really honest, you find out that there's a multitude of belief states.
• 3:39 - 3:44
Here's the initial one, where you just don't know where to receive 100.
• 3:44 - 3:48
Now, if you move around and either reach one of these exits or the sign,
• 3:48 - 3:51
you will suddenly know where to receive 100.
• 3:51 - 3:55
And that makes your belief state change--
• 3:55 - 3:58
and that makes your belief state change.
• 3:58 - 4:01
So, for example, if you find out that 100 is Left,
• 4:01 - 4:03
then your belief state will look like this--
• 4:03 - 4:05
where the ambiguity is now resolved.
• 4:05 - 4:09
Now, how would you jump from this state space or this state space?
• 4:09 - 4:12
The answer is: when you read the sign,
• 4:12 - 4:16
there's a 50 percent chance that the location over here
• 4:16 - 4:19
will result in a transition to the location over here--
• 4:19 - 4:23
50 percent because there's a 50 percent chance that the plus 100 is on the Left.
• 4:23 - 4:28
There's also a 50 percent chance that the plus 100 is on the Right,
• 4:28 - 4:31
so the transition over here is stochastic;
• 4:31 - 4:35
and with 50 percent chance, it will result in a transition over here.
• 4:35 - 4:39
If we now do the MDP trick in this new belief space,
• 4:39 - 4:44
and you pour water in here, it kind of flows through here
• 4:44 - 4:48
and creates all these gradients--as we had before.
• 4:48 - 4:51
We do the same over here and all these gradients are being created
• 4:51 - 4:53
point to this exit on the Left side.
• 4:53 - 4:58
Then, eventually, this water will flow through here and create gradients like this;
• 4:58 - 5:02
and then flow back through here, where it creates gradients like this.
• 5:02 - 5:06
So the value function is plus 100 over here, plus 100 over here
• 5:06 - 5:08
that gradually decrease down here, down here;
• 5:08 - 5:11
and then gradually further decrease over here--
• 5:11 - 5:15
and even further decrease over there, so we've got arrows like these.
• 5:15 - 5:20
And that shows you that in this new belief space, you can find a solution.
• 5:20 - 5:24
In fact, you can use value iteration--MDP's value iteration--
• 5:24 - 5:28
in this new space to find a solution to this really complicated
• 5:28 - 5:30
partially observable planning process.
• 5:30 - 5:33
And the solution--just to reiterate--
• 5:33 - 5:35
we'll suggest: Go south first,
• 5:35 - 5:37
• 5:37 - 5:41
expose yourself to the random position to the Left or Right world
• 5:41 -
in which you are now able to reach the plus 100 with absolute confidence.
Title:
09-35 Pomdp
Description:

Unit 9 35 POMDP

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
05:46
 Amara Bot added a translation

• Amara Bot