English subtitles

← 10-02 Successes

Unit 10 2 Successes.mp4

Get Embed Code
3 Languages

Showing Revision 1 created 11/28/2012 by Amara Bot.

  1. For example, in the 4 by 3 GridWorld,
  2. what if we don't know where the plus 1 and minus 1 rewards are when we start out?
  3. A reinforcement learning agent can learn to explore the territory,
  4. find where the rewards are,
  5. and then learn an optimal policy.
  6. Whereas, an MDP solver can only do that
  7. once it knows exactly where the rewards are.
  8. Now, this idea of wandering around and then finding a plus 1 or a minus 1
  9. is analogous to many forms of games, such as backgammon--
  10. and here's an example: backgammon is a stochastic game;
  11. and at the end, you either win or lose.
  12. And in the 1990s, Gary Tesauro at IBM
  13. wrote a program to play backgammon.
  14. His first attempt tried to learn the utility of a Game state, U of S,
  15. using examples that were labelled by human expert backgammon players.
  16. But this was tedious work for the experts,
  17. so only a small number of states were labelled.
  18. The program tried to generalize from that,
  19. using supervised learning,
  20. and was not able to perform very well.
  21. So Tesauro's second attempt used no human expertise and no supervision.
  22. Instead, he had 1 copy of his program play against another;
  23. and at the end of the game, the winner got a positive reward,
  24. and the loser, a negative.
  25. So he used reinforcement learning;
  26. he backed up that knowledge throughout the Game states,
  27. and he was able to arrive at a function
  28. that had no input from human expert players,
  29. but, still, was able to perform
  30. at the level of the very best players in the world.
  31. He was able to do this, after learning from examples of about 200,000 games.
  32. Now, that may seem like a lot--
  33. but it really only covers about 1 trillionth
  34. of the total state space of backgammon.
  35. Now, here's another example:
  36. This is a remote controlled helicopter
  37. that Professor Andrew Ng at Stanford trained,
  38. using reinforcement learning;
  39. and the helicopter--oh--oh, sorry--
  40. I made a mistake--I put this picture upside down
  41. because--really, Ng trained the helicopter
  42. to be able to fly fancy maneuvers--like flying upside down.
  43. And he did that by looking at only a few hours
  44. of training data from expert helicopter pilots
  45. who would take over the remote controls,
  46. pilot the helicopter--and those would all be recorded--
  47. and then, you would get rewards from when it did something good,
  48. or when it did something bad;
  49. and Ng was able to use reinforcement learrning
  50. to build an automated helicopter pilot,
  51. just from those training examples.
  52. And that automated pilot, too, can perform tricks
  53. that only a handful of humans are capable of performing.
  54. But enough of this still picture--let's watch a video of Ng's helicopters in action.
  55. [Stanford University Autonomous Helicopter]
  56. [sound of helicopter flying] [Chaos]
  57. [Stanford University Autonomous Helicopter]