Return to Video

10-02 Successes

  • 0:00 - 0:03
    For example, in the 4 by 3 GridWorld,
  • 0:03 - 0:08
    what if we don't know where the plus 1 and minus 1 rewards are when we start out?
  • 0:08 - 0:13
    A reinforcement learning agent can learn to explore the territory,
  • 0:13 - 0:15
    find where the rewards are,
  • 0:15 - 0:17
    and then learn an optimal policy.
  • 0:17 - 0:19
    Whereas, an MDP solver can only do that
  • 0:19 - 0:22
    once it knows exactly where the rewards are.
  • 0:22 - 0:27
    Now, this idea of wandering around and then finding a plus 1 or a minus 1
  • 0:27 - 0:32
    is analogous to many forms of games, such as backgammon--
  • 0:32 - 0:35
    and here's an example: backgammon is a stochastic game;
  • 0:35 - 0:38
    and at the end, you either win or lose.
  • 0:38 - 0:40
    And in the 1990s, Gary Tesauro at IBM
  • 0:40 - 0:43
    wrote a program to play backgammon.
  • 0:43 - 0:49
    His first attempt tried to learn the utility of a Game state, U of S,
  • 0:49 - 0:53
    using examples that were labelled by human expert backgammon players.
  • 0:53 - 0:55
    But this was tedious work for the experts,
  • 0:55 - 0:58
    so only a small number of states were labelled.
  • 0:58 - 1:00
    The program tried to generalize from that,
  • 1:00 - 1:02
    using supervised learning,
  • 1:02 - 1:04
    and was not able to perform very well.
  • 1:04 - 1:11
    So Tesauro's second attempt used no human expertise and no supervision.
  • 1:11 - 1:14
    Instead, he had 1 copy of his program play against another;
  • 1:14 - 1:18
    and at the end of the game, the winner got a positive reward,
  • 1:18 - 1:20
    and the loser, a negative.
  • 1:20 - 1:22
    So he used reinforcement learning;
  • 1:22 - 1:25
    he backed up that knowledge throughout the Game states,
  • 1:25 - 1:27
    and he was able to arrive at a function
  • 1:27 - 1:30
    that had no input from human expert players,
  • 1:30 - 1:32
    but, still, was able to perform
  • 1:32 - 1:35
    at the level of the very best players in the world.
  • 1:35 - 1:41
    He was able to do this, after learning from examples of about 200,000 games.
  • 1:41 - 1:43
    Now, that may seem like a lot--
  • 1:43 - 1:46
    but it really only covers about 1 trillionth
  • 1:46 - 1:49
    of the total state space of backgammon.
  • 1:49 - 1:51
    Now, here's another example:
  • 1:51 - 1:54
    This is a remote controlled helicopter
  • 1:54 - 1:56
    that Professor Andrew Ng at Stanford trained,
  • 1:56 - 1:58
    using reinforcement learning;
  • 1:58 - 2:00
    and the helicopter--oh--oh, sorry--
  • 2:00 - 2:04
    I made a mistake--I put this picture upside down
  • 2:04 - 2:08
    because--really, Ng trained the helicopter
  • 2:08 - 2:11
    to be able to fly fancy maneuvers--like flying upside down.
  • 2:11 - 2:15
    And he did that by looking at only a few hours
  • 2:15 - 2:18
    of training data from expert helicopter pilots
  • 2:18 - 2:20
    who would take over the remote controls,
  • 2:20 - 2:23
    pilot the helicopter--and those would all be recorded--
  • 2:23 - 2:27
    and then, you would get rewards from when it did something good,
  • 2:27 - 2:29
    or when it did something bad;
  • 2:29 - 2:32
    and Ng was able to use reinforcement learrning
  • 2:32 - 2:34
    to build an automated helicopter pilot,
  • 2:34 - 2:36
    just from those training examples.
  • 2:36 - 2:39
    And that automated pilot, too, can perform tricks
  • 2:39 - 2:43
    that only a handful of humans are capable of performing.
  • 2:43 - 2:49
    But enough of this still picture--let's watch a video of Ng's helicopters in action.
  • 2:49 - 2:52
    [Stanford University Autonomous Helicopter]
  • 2:52 - 3:05
    [sound of helicopter flying] [Chaos]
  • 3:05 -
    [Stanford University Autonomous Helicopter]
Title:
10-02 Successes
Description:

Unit 10 2 Successes.mp4

more » « less
Team:
Udacity
Project:
CS271 - Intro to Artificial Intelligence
Duration:
03:11
Amara Bot added a translation

English subtitles

Revisions