Return to Video

09-35 Pomdp

  • 0:00 - 0:05
    I'd like to illustrate the problem, using a very simple environment
  • 0:05 - 0:07
    that looks, as follows:
  • 0:07 - 0:09
    Suppose you live in world like this;
  • 0:09 - 0:11
    and your agent starts over here,
  • 0:11 - 0:13
    and there are 2 possible outcomes.
  • 0:13 - 0:16
    You can exit the maze over here--
  • 0:16 - 0:18
    where you get a plus 100--
  • 0:18 - 0:20
    or you can exit the maze over here,
  • 0:20 - 0:22
    where you receive a minus 100.
  • 0:22 - 0:25
    Now, in a fully observable case,
  • 0:25 - 0:28
    and in a deterministic case,
  • 0:28 - 0:32
    the optimal plan might look something like this;
  • 0:32 - 0:35
    and whether or not is goes straight over here or not, depends on the details.
  • 0:35 - 0:38
    For example, whether the agent has momentum or not.
  • 0:38 - 0:44
    But you'll find a single sequence of actions and states that might cut the corners,
  • 0:44 - 0:47
    as close as possible, to reach the plus 100 as fast as possible.
  • 0:47 - 0:50
    That's conventional planning.
  • 0:50 - 0:53
    Let's contrast this with the case we just learned about,
  • 0:53 - 0:57
    which is the fully observable case or the stochastic case.
  • 0:57 - 1:01
    We just learned that the best thing to compute is a policy
  • 1:01 - 1:04
    that assigns to every possible state, an optimal action;
  • 1:04 - 1:07
    and simplified speaking, this might look as follows:
  • 1:07 - 1:09
    Where each of these arrows corresponds
  • 1:09 - 1:12
    to a sample control policy.
  • 1:12 - 1:16
    And those are defined in part of the state space that are even far away.
  • 1:16 - 1:18
    So this wouuld be an example of a control policy
  • 1:18 - 1:22
    where all the arrows gradually point you over here.
  • 1:22 - 1:25
    We just learned about this, using MDPs and value iteration.
  • 1:25 - 1:29
    The case I really want to get at is the case of partial observability--
  • 1:29 - 1:32
    which we will eventually solve, using a technique called POMDP.
  • 1:32 - 1:37
    And in this case, I'm going to keep the location of the agent in the maze observable.
  • 1:37 - 1:43
    The part I'm going to make unobservable is where, exactly, I receive plus 100
  • 1:43 - 1:45
    and where I receive minus 100.
  • 1:45 - 1:48
    Instead, I'm going to put a sign over here
  • 1:48 - 1:51
    that tells the agent where to expect plus 100,
  • 1:51 - 1:53
    and where to expect minus 100.
  • 1:53 - 1:57
    So the optimum policy would be to first move to the sign,
  • 1:57 - 1:59
    read the sign;
  • 1:59 - 2:03
    and then return and go to the corresponding exit,
  • 2:03 - 2:07
    for which the agent now knows where to receive plus 100.
  • 2:07 - 2:10
    So, for example, if this exit over here gives us plus 100,
  • 2:10 - 2:12
    the sign will say Left.
  • 2:12 - 2:15
    If this exit over here gives us plus 100, the sign will say Right.
  • 2:15 - 2:17
    What makes this environment interesting is
  • 2:17 - 2:21
    that if the agent knew which exit would have plus 100,
  • 2:21 - 2:23
    it will go north, from a starting position.
  • 2:23 - 2:26
    It goes south exclusively to gather information.
  • 2:26 - 2:30
    So the question becomes: Can we devise a method for planning
  • 2:30 - 2:36
    that understands that, even though we'd wish to receive the plus 100 as the best exit,
  • 2:36 - 2:40
    there's a detour necessary to gather information.
  • 2:40 - 2:42
    So here's a solution that doesn't work:
  • 2:42 - 2:46
    Obviously, the agent might be in 2 different worlds--and it doesn't know.
  • 2:46 - 2:49
    It might be in the world where there's plus 100 on the Left side
  • 2:49 - 2:51
    or it might be in the world with plus 100 on the Right side,
  • 2:51 - 2:53
    with minus 100 in the corresponding other exit.
  • 2:53 - 2:59
    What doesn't work is you can't solve the problem for both of these cases
  • 2:59 - 3:00
    and then put these solutions together--
  • 3:00 - 3:02
    for example, by averaging.
  • 3:02 - 3:04
    The reason why this doesn't work is
  • 3:04 - 3:08
    this agent, after averaging, would go north.
  • 3:08 - 3:11
    It would never have the idea that it is worthwhile to go south,
  • 3:11 - 3:15
    read the sign, and then return to the optimal exit.
  • 3:15 - 3:18
    When it arrives, finally, at the intersection over here,
  • 3:18 - 3:20
    it doesn't really know what to do.
  • 3:20 - 3:22
    So here is the situation that does work--
  • 3:22 - 3:25
    and it's related to information space or belief space.
  • 3:25 - 3:29
    In the information space or belief space representation you do planning,
  • 3:29 - 3:31
    not in the set of physical world states,
  • 3:31 - 3:34
    but in what you might know about those states.
  • 3:34 - 3:39
    And if you're really honest, you find out that there's a multitude of belief states.
  • 3:39 - 3:44
    Here's the initial one, where you just don't know where to receive 100.
  • 3:44 - 3:48
    Now, if you move around and either reach one of these exits or the sign,
  • 3:48 - 3:51
    you will suddenly know where to receive 100.
  • 3:51 - 3:55
    And that makes your belief state change--
  • 3:55 - 3:58
    and that makes your belief state change.
  • 3:58 - 4:01
    So, for example, if you find out that 100 is Left,
  • 4:01 - 4:03
    then your belief state will look like this--
  • 4:03 - 4:05
    where the ambiguity is now resolved.
  • 4:05 - 4:09
    Now, how would you jump from this state space or this state space?
  • 4:09 - 4:12
    The answer is: when you read the sign,
  • 4:12 - 4:16
    there's a 50 percent chance that the location over here
  • 4:16 - 4:19
    will result in a transition to the location over here--
  • 4:19 - 4:23
    50 percent because there's a 50 percent chance that the plus 100 is on the Left.
  • 4:23 - 4:28
    There's also a 50 percent chance that the plus 100 is on the Right,
  • 4:28 - 4:31
    so the transition over here is stochastic;
  • 4:31 - 4:35
    and with 50 percent chance, it will result in a transition over here.
  • 4:35 - 4:39
    If we now do the MDP trick in this new belief space,
  • 4:39 - 4:44
    and you pour water in here, it kind of flows through here
  • 4:44 - 4:48
    and creates all these gradients--as we had before.
  • 4:48 - 4:51
    We do the same over here and all these gradients are being created
  • 4:51 - 4:53
    point to this exit on the Left side.
  • 4:53 - 4:58
    Then, eventually, this water will flow through here and create gradients like this;
  • 4:58 - 5:02
    and then flow back through here, where it creates gradients like this.
  • 5:02 - 5:06
    So the value function is plus 100 over here, plus 100 over here
  • 5:06 - 5:08
    that gradually decrease down here, down here;
  • 5:08 - 5:11
    and then gradually further decrease over here--
  • 5:11 - 5:15
    and even further decrease over there, so we've got arrows like these.
  • 5:15 - 5:20
    And that shows you that in this new belief space, you can find a solution.
  • 5:20 - 5:24
    In fact, you can use value iteration--MDP's value iteration--
  • 5:24 - 5:28
    in this new space to find a solution to this really complicated
  • 5:28 - 5:30
    partially observable planning process.
  • 5:30 - 5:33
    And the solution--just to reiterate--
  • 5:33 - 5:35
    we'll suggest: Go south first,
  • 5:35 - 5:37
    read the sign,
  • 5:37 - 5:41
    expose yourself to the random position to the Left or Right world
  • 5:41 -
    in which you are now able to reach the plus 100 with absolute confidence.
Title:
09-35 Pomdp
Description:

Unit 9 35 POMDP

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
05:46
Amara Bot added a translation

English subtitles

Revisions