English subtitles

← 10-10 Passive Temporal Difference Learning

Unit 10 10 Passive Temporal Difference Learning.mp4

Get Embed Code
2 Languages

Showing Revision 1 created 11/28/2012 by Amara Bot.

  1. Let's start by looking at passive reinforcement learning.
  2. I'm going to describe an algorithm called
  3. Temporal Difference Learning--or TD.
  4. And what that means--sounds like a fancy name,
  5. but all it really means is we're going to move
  6. from one state to the next;
  7. and we're going to look at the difference between the 2 states,
  8. and learn that--and then kind of back up
  9. the values, from one state to the next.
  10. So if we're going to follow a fixed policy, pi,
  11. and let's say our policy tells us to go this way, and then go this way.
  12. We'll eventually learn that we get a plus 1 reward there
  13. and we'll start feeding back that plus 1, saying:
  14. if it was good to get a plus 1 here,
  15. it must be somewhat good to be in this state,
  16. somewhat good to be in this state--and so on, back to the start state.
  17. So, in order to run this algorithm,
  18. we're going to try to build up a table of utilities for each state
  19. and along the way, we're going to keep track of
  20. the number of times that we visited each state.
  21. Now, the table of utilities, we're going to start blank--
  22. we're not going to start them at zero or anything else
  23. where they're just going to be undefined.
  24. And the table of numbers, we're going to start at zero,
  25. saying we visited each state a total of zero times.
  26. What we're going to do is run the policy,
  27. have a trial that goes through the state;
  28. when it gets to a terminal state,
  29. we start it over again at the start and run it again;
  30. and we keep track of how many times we visited each state,
  31. we update the utilities, and we get a better
  32. and better estimate for the utility.
  33. And this is what the inner loop of the algorithm looks like--
  34. and let's see if we can trace it out.
  35. So we'll start at a start state,
  36. we'll apply the policy--and let's say the policy tells us to move in this direction.
  37. Then we get a reward here,
  38. which is zero;
  39. and then we look at it with the algorithm,
  40. and the algorithm tells us if the state
  41. is new--yes, it is; we've never been there before--
  42. then set the utility of that state to the new reward, which is zero.
  43. Okay--so now we have a zero here;
  44. and then let's say, the next step, we move up here.
  45. So, again, we have a zero;
  46. and let's say our policy looks like a good one,
  47. so we get: here, we have a zero.
  48. We get: here, we have a zero.
  49. We get: here--now, this state,
  50. we get a reward of 1, so that state gets a utility of 1.
  51. And all along the way, we have to think about
  52. how we're backing up these values, as well.
  53. So when we get here, we have to look at this formula to say:
  54. How are we going to update the utility of the prior state?
  55. And the difference between this state and this state is zero.
  56. so this difference, here, is going to be zero--
  57. the reward is zero, and so there's going to be no update to this state.
  58. But now, finally--for the first time--we're going to have an actual update.
  59. So we're going to update this state to be plus 1,
  60. and now we're going to think about changing this state.
  61. And what was its old utility?--well, it was zero.
  62. And then there's a factor called Alpha,
  63. which is the learning rate
  64. that tells us how much we want to move this utility
  65. towards something that's maybe a better estimate.
  66. And the learning rate should be such that,
  67. if we are brand new,
  68. we want to move a big step;
  69. and if we've seen this state a lot of times,
  70. we're pretty confident of our number
  71. and we want to make a small step.
  72. So let's say that the Alpha function is 1 over N plus 1.
  73. Well, we'd better not make it 1 over N plus 1, when N is zero.
  74. So 1 over N plus 1 would be ½;
  75. and then the reward in this state was zero;
  76. plus, we had a Gamma--
  77. and let's just say that Gamma is 1,
  78. so there's no discounting; and then
  79. we look at the difference between the utility
  80. of the resulting state--which is 1--
  81. minus the utility of this state, which was zero.
  82. So we get ½, 1 minus zero--which is ½.
  83. So we update this;
  84. and we change this zero to ½.
  85. Now let's say we start all over again
  86. and let's say our policy is right on track;
  87. and nothing unusual, stochastically, has happened.
  88. So we follow the same path,
  89. we don't update--because they're all zeros all along this path.
  90. We go here, here, here;
  91. and now it's time for an update.
  92. So now, we've transitioned from a zero to ½--
  93. so how are we going to update this state?
  94. Well, the old state was zero
  95. and now we have a 1 over N plus 1--
  96. so let's say 1/3.
  97. So we're getting a little bit more confident--because we've been there
  98. twice, rather than just once.
  99. The reward in this state was zero,
  100. and then we have to look at the difference between these 2 states.
  101. That's where we get the name, Temporal Difference;
  102. and so, we have ½ minus zero--
  103. and so that's 1/3 times ½--
  104. so that's 1/6.
  105. Now we update this state.
  106. It was zero; now it becomes 1/6.
  107. And you can see how the results
  108. of the positive 1 starts to propagate
  109. backwards--but it propagates slowly.
  110. We have to have 1 trial at a time
  111. to get that to propagate backwards.
  112. Now, how about the update from this state to this state?
  113. Now, we were ½ here--so our old utility was ½;
  114. plus Alpha--the learning rate--is 1/3.
  115. The reward in the old state was zero;
  116. plus the difference between these two,
  117. which is 1 minus ½.
  118. So that's ½ plus 1/6 is 2/3.
  119. And now the second time through,
  120. we've updated the utility of this state from 1/2 to 2/3.
  121. And we keep on going--and you can see the results of the positive, propagating backwards.
  122. And if we did more examples through here,
  123. you would see the results of the negative propagating backwards.
  124. And eventually, it converges to the correct utilities for this policy.