1
00:00:11,269 --> 00:00:14,539
This presentation is delivered by the Stanford Center for Professional

2
00:00:14,539 --> 00:00:21,539
Development.

3
00:00:23,929 --> 00:00:28,039
So welcome to the last lecture of this course.

4
00:00:28,039 --> 00:00:32,969
What I want to do today is tell you about one final class of reinforcement

5
00:00:32,969 --> 00:00:34,820
learning algorithms. I

6
00:00:34,820 --> 00:00:39,350
just want to say a little bit about POMDPs, the partially observable MDPs, and then

7
00:00:39,350 --> 00:00:43,590
the main technical topic for today will be policy search algorithms.

8
00:00:43,590 --> 00:00:48,670
I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,

9
00:00:48,670 --> 00:00:54,940
and then we'll wrap up the class. So if you recall from the last lecture,

10
00:00:54,940 --> 00:00:59,870
I actually started to talk about one specific example of a POMDP,

11
00:00:59,870 --> 00:01:06,870
which was this sort of linear dynamical

12
00:01:07,170 --> 00:01:10,790
system. This is sort of LQR, linear quadratic revelation problem,

13
00:01:10,790 --> 00:01:24,190
but I changed it and said what if we only have observations

14
00:01:24,190 --> 00:01:28,270
YT. And what if we couldn't observe the state of the system directly,

15
00:01:28,270 --> 00:01:37,180
but had to choose an action only based on some noisy observations that maybe some function of the state?

16
00:01:37,180 --> 00:01:42,660
So our strategy last time was that we said that in the fully observable case,

17
00:01:42,660 --> 00:01:54,870
we could choose actions - AT equals two, that matrix LT times ST. So LT was this

18
00:01:54,870 --> 00:01:58,730
matrix of parameters that [inaudible] describe the dynamic programming

19
00:01:58,730 --> 00:02:02,200
algorithm for finite horizon MDPs in the LQR problem.

20
00:02:02,200 --> 00:02:10,529
And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.

21
00:02:10,529 --> 00:02:13,929
And then I said in the partially observable case,

22
00:02:14,469 --> 00:02:19,979
we would compute these estimates.

23
00:02:19,979 --> 00:02:26,979
I wrote them as S of T given T,

24
00:02:30,379 --> 00:02:37,230
which were our best estimate for what the state is given all the observations. And in particular,

25
00:02:37,230 --> 00:02:41,459
I'm gonna talk about a Kalman filter

26
00:02:41,459 --> 00:02:48,959
which we worked out that our posterior distribution of what the state is

27
00:02:48,959 --> 00:02:52,519
given all the observations up to a certain time

28
00:02:52,519 --> 00:02:59,139
that was this. So this is from last time. So that

29
00:02:59,139 --> 00:03:03,159
given the observations Y one through YT, our posterior distribution

30
00:03:03,159 --> 00:03:05,579
of the current state ST was Gaussian

31
00:03:05,579 --> 00:03:10,169
would mean ST given T sigma T given T.

32
00:03:10,169 --> 00:03:14,090
So I said we use a Kalman filter to compute this thing, this ST given T,

33
00:03:14,090 --> 00:03:20,499
which is going to be our best guess for what the state is currently.

34
00:03:20,499 --> 00:03:30,609
And then we choose actions using

35
00:03:30,609 --> 00:03:34,059
our estimate for what the state is, rather than using the true state because we

36
00:03:34,059 --> 00:03:37,519
don't know the true state anymore in this POMDP.

37
00:03:37,519 --> 00:03:39,619
So

38
00:03:39,619 --> 00:03:47,219
it turns out that this specific strategy actually allows you to choose optimal actions,

39
00:03:47,219 --> 00:03:51,749
allows you to choose actions as well as you possibly can given that this is a

40
00:03:51,749 --> 00:03:54,799
POMDP, and given there are these noisy observations.

41
00:03:54,799 --> 00:03:59,469
It turns out that in general finding optimal policies with POMDPs -

42
00:03:59,469 --> 00:04:06,039
finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.

43
00:04:06,039 --> 00:04:16,259
Just to be concrete about the formalism of the POMDP - I should just write it here -

44
00:04:16,259 --> 00:04:32,000
a POMDP formally is a tuple like that

45
00:04:32,000 --> 00:04:42,550
where the changes are the set Y is the set of possible observations,

46
00:04:44,759 --> 00:04:53,519
and this O subscript S are the observation distributions.

47
00:04:58,039 --> 00:05:11,220
And so at each step, we observe

48
00:05:17,629 --> 00:05:20,179
- at each step in the POMDP, if we're

49
00:05:20,179 --> 00:05:23,849
in some state ST, we observe some observation YT

50
00:05:23,849 --> 00:05:27,610
drawn from the observation distribution O subscript ST, that there's an

51
00:05:27,610 --> 00:05:30,369
index by what the current state is.

52
00:05:30,369 --> 00:05:31,060
And

53
00:05:31,060 --> 00:05:35,599
it turns out that computing the optimal policy in a POMDP is an NPhard problem.

54
00:05:36,309 --> 00:05:40,539
For the specific case of linear dynamical systems with the Kalman filter

55
00:05:40,539 --> 00:05:45,849
model, we have this strategy of computing the optimal policy assuming full observability

56
00:05:45,849 --> 00:05:49,080
and then estimating the states from the observations,

57
00:05:49,080 --> 00:05:50,889
and then plugging the two together.

58
00:05:50,889 --> 00:05:56,319
That turns out to be optimal essentially for only that special case of a POMDP.

59
00:05:56,319 --> 00:05:58,479
In the more general case,

60
00:05:58,479 --> 00:06:02,959
that strategy of designing a controller assuming full observability

61
00:06:02,959 --> 00:06:06,009
and then just estimating the state and plugging the two together,

62
00:06:06,009 --> 00:06:07,950
for general POMDPs

63
00:06:07,950 --> 00:06:14,759
that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.

64
00:06:14,759 --> 00:06:20,869
Solving these problems in general, NP-hard.

65
00:06:20,869 --> 00:06:22,500
So what I want to do today

66
00:06:22,500 --> 00:06:25,569
is actually

67
00:06:25,569 --> 00:06:29,139
talk about a different class of reinforcement learning algorithms. These are called

68
00:06:29,139 --> 00:06:38,259
policy search algorithms. In particular, policy search algorithms

69
00:06:38,259 --> 00:06:43,630
can be applied equally well to MDPs, to fully observed Markov decision processes,

70
00:06:43,630 --> 00:06:46,419
or to these POMDPs,

71
00:06:46,419 --> 00:06:49,789
or to these partially observable MPDs.

72
00:06:49,789 --> 00:06:58,269
What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.

73
00:06:58,269 --> 00:07:01,919
And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In

74
00:07:01,919 --> 00:07:05,460
the latter case, when

75
00:07:05,460 --> 00:07:09,830
you apply a policy search algorithm to a POMDP, it's

76
00:07:09,830 --> 00:07:15,499
going to be hard to guarantee that you get the globally optimal policy because

77
00:07:15,499 --> 00:07:20,809
solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be

78
00:07:20,809 --> 00:07:23,009
I think one of the most

79
00:07:23,009 --> 00:07:31,029
effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.

80
00:07:31,029 --> 00:07:34,110
So

81
00:07:34,110 --> 00:07:36,849
here's what we're going to do.

82
00:07:36,849 --> 00:07:40,789
In policy search,

83
00:07:40,789 --> 00:07:47,789
we're going to define of some set

84
00:07:48,309 --> 00:07:54,869
which I denote capital pi of policies,

85
00:07:54,869 --> 00:08:09,240
and our strategy is to search for a good policy lower pi into set capital pi.

86
00:08:10,909 --> 00:08:14,930
Just by analogy, I want to say

87
00:08:14,930 --> 00:08:19,379
- in the same way, back when we were talking about supervised learning,

88
00:08:19,379 --> 00:08:23,769
the way we defined the set capital pi of policies in the search for policy in this

89
00:08:23,769 --> 00:08:31,269
set capital pi is analogous to supervised learning

90
00:08:31,269 --> 00:08:46,300
where we defined a set script H of hypotheses and search -

91
00:08:49,520 --> 00:08:58,120
and would search for a good hypothesis in this policy script H.

92
00:08:58,120 --> 00:09:03,040
Policy search is sometimes also called direct policy search.

93
00:09:03,040 --> 00:09:06,740
To contrast this with the source of algorithms we've been talking about so

94
00:09:06,740 --> 00:09:11,360
far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to

95
00:09:11,360 --> 00:09:13,600
find the optimal value function.

96
00:09:13,600 --> 00:09:19,179
And then we'd use V star - we'd use the optimal value function to then try to compute or

97
00:09:19,179 --> 00:09:20,260
try to approximate pi star.

98
00:09:20,260 --> 00:09:23,980
So all the approaches we talked about previously are

99
00:09:23,980 --> 00:09:29,500
strategy for finding a good policy. Once we compute the value function, then we go from that to policy.

100
00:09:29,500 --> 00:09:34,700
In contrast, in policy search algorithms and something that's called direct policy search algorithms,

101
00:09:34,700 --> 00:09:40,160
the idea is that we're going to quote "directly" try to approximate a good policy

102
00:09:40,160 --> 00:09:48,280
without going through the intermediate stage of trying to find the value function.

103
00:09:48,280 --> 00:09:49,530
Let's see.

104
00:09:49,530 --> 00:09:50,880
And also

105
00:09:50,880 --> 00:09:57,610
as I develop policy search - just one step that's sometimes slightly confusing.

106
00:09:57,610 --> 00:10:02,710
Making an analogy to supervised learning again, when we talked about logistic regression,

107
00:10:02,710 --> 00:10:08,700
I said we have input features X and some labels Y, and I sort of said

108
00:10:08,700 --> 00:10:13,320
let's approximate Y using the logistic function of the inputs X. And

109
00:10:13,320 --> 00:10:18,320
at least initially, the logistic function was sort of pulled out of the air.

110
00:10:18,320 --> 00:10:23,400
In the same way, as I define policy search algorithms, there'll sort of be a step where I say,

111
00:10:23,400 --> 00:10:28,560
"Well, let's try to compute the actions. Let's try to approximate what a good action is

112
00:10:28,560 --> 00:10:33,020
using a logistic function of the state." So again, I'll sort of pull a

113
00:10:33,020 --> 00:10:36,770
function out of the air. I'll say, "Let's just choose a function, and

114
00:10:36,770 --> 00:10:40,650
that'll be our choice of the policy cost," and I'll say,

115
00:10:40,650 --> 00:10:45,309
"Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate

116
00:10:45,309 --> 00:10:47,300
what is a good function -

117
00:10:47,300 --> 00:10:48,350
excuse me,

118
00:10:48,350 --> 00:10:53,269
we'll approximate what is a good action using a logistic function of the state."

119
00:10:53,269 --> 00:10:54,420
So there's that sort of -

120
00:10:54,420 --> 00:10:59,460
the function of the choice of policy cost that's again a little bit arbitrary,

121
00:10:59,460 --> 00:11:06,080
but it's arbitrary as it was when we were talking about supervised learning.

122
00:11:06,080 --> 00:11:17,350
So to develop our first policy search algorithm, I'm actually gonna need the new definition.

123
00:11:56,100 --> 00:12:03,100
So

124
00:12:06,650 --> 00:12:12,710
our first policy search algorithm, we'll actually need to work with stochastic policies.

125
00:12:12,710 --> 00:12:17,679
What I mean by stochastic policy is there's going to be a function that maps from

126
00:12:17,679 --> 00:12:22,510
the space of states across actions. They're real numbers

127
00:12:22,510 --> 00:12:25,039
where pi of S comma A will be

128
00:12:25,039 --> 00:12:30,580
interpreted as the probability of taking this action A in sum state S.

129
00:12:30,580 --> 00:12:46,340
And so we have to add sum over A - In other words, for every state

130
00:12:46,340 --> 00:12:53,830
a stochastic policy specifies a probability distribution over the actions.

131
00:12:54,550 --> 00:12:59,250
So concretely,

132
00:12:59,250 --> 00:13:07,270
suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.

133
00:13:07,270 --> 00:13:11,690
What that means is that - in this example let's say I have three actions.

134
00:13:11,690 --> 00:13:16,060
What that means is that suppose I'm in some state S.

135
00:13:16,060 --> 00:13:25,290
I would then compute pi of S comma A1, pi of S comma A2,

136
00:13:25,290 --> 00:13:29,370
pi of S comma A3, if I have a three action MDP.

137
00:13:29,370 --> 00:13:33,630
These will be three numbers that sum up to one,

138
00:13:33,630 --> 00:13:36,730
and then my chance of taking action A1 will be

139
00:13:36,730 --> 00:13:41,350
equal to this. My chance of taking action A2 will be equal to pi of S comma A2.

140
00:13:41,350 --> 00:13:44,129
My chance of taking action A3 will be equal

141
00:13:44,129 --> 00:13:49,670
to this number. So that's what it means to execute a stochastic policy.

142
00:13:49,670 --> 00:13:50,880
So

143
00:13:50,880 --> 00:13:53,730
as a concrete example, just let me make this

144
00:14:03,380 --> 00:14:07,470
let me just go ahead and give one specific example of what a stochastic

145
00:14:07,470 --> 00:14:09,550
policy may look like.

146
00:14:09,550 --> 00:14:14,140
For this example, I'm gonna use the inverted pendulum

147
00:14:14,140 --> 00:14:20,370
as my motivating example. It's that problem of balancing a pole. We have an

148
00:14:20,370 --> 00:14:21,270
inverted

149
00:14:21,270 --> 00:14:26,230
pendulum that swings freely, and you want to move the cart left and

150
00:14:26,230 --> 00:14:29,090
right to keep the pole vertical.

151
00:14:29,090 --> 00:14:34,950
Let's say my actions -

152
00:14:34,950 --> 00:14:43,780
for today's example, I'm gonna use that angle to denote the angle of the pole

153
00:14:43,780 --> 00:14:49,740
phi. I have two actions where A1 is to accelerate left

154
00:14:49,740 --> 00:15:02,560
and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.

155
00:15:02,560 --> 00:15:04,720
So

156
00:15:04,720 --> 00:15:06,050
let's see.

157
00:15:06,050 --> 00:15:10,480
Choose a reward function that penalizes the pole falling over whatever.

158
00:15:10,480 --> 00:15:14,170
And now let's come up with a stochastic policy for this problem.

159
00:15:14,910 --> 00:15:17,860
To come up with a class of stochastic policies

160
00:15:17,860 --> 00:15:20,080


161
00:15:20,080 --> 00:15:23,990
really means coming up with some class of functions to approximate what action you

162
00:15:23,990 --> 00:15:26,570
want to take as a function of the state.

163
00:15:26,570 --> 00:15:30,000
So here's my somewhat arbitrary choice. I'm gonna say that

164
00:15:30,000 --> 00:15:35,210
the probability of action A1, so pi of S comma A1,

165
00:15:35,210 --> 00:15:39,880
I'm gonna write as - okay?

166
00:15:39,880 --> 00:15:45,630
And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that

167
00:15:45,630 --> 00:15:49,740
my policy is parameterized by a set of parameters theta,

168
00:15:49,740 --> 00:15:53,310
and for any given set of parameters theta,

169
00:15:53,310 --> 00:15:56,270
that gives me a stochastic policy.

170
00:15:56,270 --> 00:15:59,380
And if I'm executing that policy with parameters theta, that means

171
00:15:59,380 --> 00:16:06,020
that the chance of my choosing to a set of [inaudible] is given by this number.

172
00:16:18,170 --> 00:16:25,520
Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just

173
00:16:25,520 --> 00:16:28,850
[inaudible], this means that when I'm in sum state S,

174
00:16:28,850 --> 00:16:30,960
I'm going to compute this number,

175
00:16:30,960 --> 00:16:34,260
compute one over one plus E to the minus state of transpose S.

176
00:16:34,260 --> 00:16:40,060
And then with this probability, I will execute the accelerate right action,

177
00:16:40,060 --> 00:16:45,780
and with one minus this probability, I'll execute the accelerate left action.

178
00:16:45,780 --> 00:16:50,820
And again, just to give you a sense of why this might be a reasonable thing to do,

179
00:16:52,200 --> 00:16:59,200
let's say my state vector is - this

180
00:17:01,530 --> 00:17:05,440
is [inaudible] state, and I added an extra one as an interceptor, just to give my

181
00:17:05,440 --> 00:17:08,949
logistic function an extra feature.

182
00:17:08,949 --> 00:17:19,860
If I choose my parameters and my policy to be say this,

183
00:17:19,860 --> 00:17:23,079
then that means that at any state,

184
00:17:23,079 --> 00:17:33,870
the probability of my taking action A1 - the probability of my taking the accelerate

185
00:17:33,870 --> 00:17:40,870
right action is this one over one plus E to the minus state of transpose S, which taking the inner product

186
00:17:41,360 --> 00:17:48,180
of theta and S, this just gives you phi,

187
00:17:48,180 --> 00:17:51,250
equals one over one plus E to the minus phi.

188
00:17:51,250 --> 00:17:54,900
And so if I choose my parameters theta as follows,

189
00:17:54,900 --> 00:18:01,150
what that means is that

190
00:18:01,150 --> 00:18:08,150
just depending on the angle phi of my inverted pendulum,

191
00:18:09,340 --> 00:18:12,200
the chance of my accelerating to the

192
00:18:12,200 --> 00:18:15,190
right is just this function of

193
00:18:15,190 --> 00:18:18,320
the angle of my inverted pendulum. And so this means for example

194
00:18:18,320 --> 00:18:22,090
that if my inverted pendulum is leaning far over to the right,

195
00:18:22,090 --> 00:18:26,370
then I'm very likely to accelerate to the right to try to catch it. I

196
00:18:26,370 --> 00:18:30,000
hope the physics of this inverted pendulum thing make

197
00:18:30,000 --> 00:18:33,440
sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And

198
00:18:33,440 --> 00:18:36,740
conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try

199
00:18:36,740 --> 00:18:38,520
to catch it.

200
00:18:38,520 --> 00:18:44,030
So this is one example for one specific choice of parameters theta.

201
00:18:44,030 --> 00:18:49,420
Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if

202
00:18:49,420 --> 00:18:52,860
the cart is further to the right, you want it to be less likely to accelerate to the

203
00:18:52,860 --> 00:18:55,610
right, and you can capture that by changing

204
00:18:55,610 --> 00:18:57,770
one of these coefficients to take into account the

205
00:18:57,770 --> 00:19:00,250
actual position of the cart. And then depending on

206
00:19:00,250 --> 00:19:03,379
the velocity of the cart and the angle of velocity,

207
00:19:03,379 --> 00:19:08,040
you might want to change theta to take into account these other effects as well. Maybe

208
00:19:08,040 --> 00:19:13,060
if the pole's leaning far to the right, but is actually on its way to swinging back, it's

209
00:19:13,060 --> 00:19:15,620
specified to the angle of velocity, then you might be

210
00:19:15,620 --> 00:19:19,300
less worried about having to accelerate hard to the right. And so

211
00:19:19,300 --> 00:19:23,170
these are the sorts of behavior you can get by varying the parameters theta.

212
00:19:24,160 --> 00:19:28,640
And so our goal is to tune the parameters theta - our goal in policy search

213
00:19:28,640 --> 00:19:37,290
is to tune the parameters theta so that when we execute the policy pi subscript theta,

214
00:19:37,290 --> 00:19:39,730
the pole stays up as long as possible.

215
00:19:39,730 --> 00:19:48,430
In other words, our goal is to maximize as a function of theta -

216
00:19:48,430 --> 00:20:05,440
our goal is to maximize the expected value of the payoff

217
00:20:05,440 --> 00:20:09,620
for when we execute the policy pi theta. We

218
00:20:09,620 --> 00:20:11,249
want to choose parameters theta

219
00:20:11,249 --> 00:20:21,120
to maximize that. Are there questions about the problem set up,

220
00:20:21,120 --> 00:20:26,330
and policy search and policy classes or anything? Yeah. In a case where we have

221
00:20:26,330 --> 00:20:32,880
more than two actions, would we use a different theta for each of the distributions, or still have

222
00:20:32,880 --> 00:20:36,310
the same parameters? Oh, yeah.

223
00:20:36,310 --> 00:20:39,430
Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but

224
00:20:39,430 --> 00:20:43,710
you have say a fixed number of discrete actions,

225
00:20:43,710 --> 00:20:48,210
I would sometimes use like a softmax parameterization.

226
00:20:48,210 --> 00:20:53,550
Similar to softmax regression that we saw earlier in the class,

227
00:20:53,550 --> 00:20:55,910
you may

228
00:20:55,910 --> 00:21:02,910
say

229
00:21:05,540 --> 00:21:06,900
that - [inaudible] out of space. You may have a set of parameters theta 1

230
00:21:06,900 --> 00:21:09,350
through theta D if

231
00:21:09,350 --> 00:21:24,910
you have D actions and - pi equals E to the theta I transpose S over -

232
00:21:24,910 --> 00:21:29,160
so that would be an example of a softmax parameterization for multiple actions.

233
00:21:29,160 --> 00:21:33,390
It turns out that if you have continuous actions, you can actually make this be a

234
00:21:33,390 --> 00:21:34,090
density

235
00:21:34,090 --> 00:21:35,950
over the actions A

236
00:21:35,950 --> 00:21:38,500
and parameterized by other things as well.

237
00:21:38,500 --> 00:21:42,590
But the choice of policy class is somewhat up to you, in the same way that

238
00:21:44,150 --> 00:21:47,670
the choice of whether we chose to use a linear function

239
00:21:47,670 --> 00:21:49,430
or linear function with

240
00:21:49,430 --> 00:21:56,430
quadratic features or whatever in supervised learning that was sort of up to us. Anything

241
00:22:01,790 --> 00:22:08,790
else? Yeah. [Inaudible] stochastic?

242
00:22:12,400 --> 00:22:16,350
Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition

243
00:22:16,350 --> 00:22:18,200
probabilities, is it possible to

244
00:22:18,200 --> 00:22:22,740
use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.

245
00:22:22,740 --> 00:22:24,460
The answer is yes,

246
00:22:24,460 --> 00:22:25,740
but

247
00:22:25,740 --> 00:22:29,400
for the purposes of what I want to show later, that won't be useful.

248
00:22:29,400 --> 00:22:31,870
But formally, it is possible.

249
00:22:31,870 --> 00:22:34,810
If you already have a fixed - if you have a

250
00:22:34,810 --> 00:22:40,760
fixed policy, then you'd be able to do that. Anything else?

251
00:22:40,760 --> 00:22:44,190
Yeah. No, I guess even a [inaudible] class of policy can do that,

252
00:22:44,190 --> 00:22:50,290
but for the derivation later, I actually need to keep it separate.

253
00:22:50,290 --> 00:22:55,059
Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand

254
00:22:55,059 --> 00:23:01,730
if this makes sense?

255
00:23:01,730 --> 00:23:05,430
Okay. Thanks. So let's talk about

256
00:23:05,430 --> 00:23:08,650
an algorithm. What I'm gonna talk about - the first algorithm I'm going to present

257
00:23:09,750 --> 00:23:13,820
is sometimes called the reinforce algorithm. What I'm going to present it turns out

258
00:23:13,820 --> 00:23:18,550
isn't exactly the reinforce algorithm as it was originally presented by the

259
00:23:18,550 --> 00:23:30,570
author Ron Williams, but it sort of captures its essence. Here's the idea.

260
00:23:30,570 --> 00:23:37,030
In the sequel - in what I'm about to do, I'm going to assume that S0 is

261
00:23:37,030 --> 00:23:40,330
some fixed initial state.

262
00:23:41,950 --> 00:23:44,720
Or

263
00:23:44,720 --> 00:23:48,740
it turns out if S0 is drawn from some fixed initial state

264
00:23:48,740 --> 00:23:52,440
distribution then everything else [inaudible], but let's just say S0 is some

265
00:23:52,440 --> 00:23:54,040
fixed initial state.

266
00:23:54,040 --> 00:24:04,600
So my goal is to maximize this expected sum

267
00:24:05,820 --> 00:24:10,340
[inaudible]. Given the policy and whatever else, drop that.

268
00:24:10,340 --> 00:24:14,700
So the random variables in this expectation is a sequence of states and actions:

269
00:24:14,700 --> 00:24:20,200
S0, A0, S1, A1, and so on, up to ST, AT are the random variables.

270
00:24:20,200 --> 00:24:25,919
So let me write out this expectation explicitly as a sum

271
00:24:25,919 --> 00:24:33,430
over all possible state and action sequences of that

272
00:24:46,050 --> 00:24:52,920
- so that's what an expectation is. It's the probability of the random variables times that. Let

273
00:24:52,920 --> 00:24:59,920
me just expand out this probability.

274
00:25:00,510 --> 00:25:06,110
So the probability of seeing this exact sequence of states and actions is

275
00:25:06,110 --> 00:25:10,630
the probability of the MDP starting in that state.

276
00:25:10,630 --> 00:25:14,070
If this is a deterministic initial state, then all the probability

277
00:25:14,070 --> 00:25:18,180
mass would be on one state. Otherwise, there's some distribution over initial states.

278
00:25:18,180 --> 00:25:21,040
Then times the probability

279
00:25:21,040 --> 00:25:23,529
that

280
00:25:23,529 --> 00:25:29,350
you chose action A0 from that state as zero,

281
00:25:29,350 --> 00:25:34,390
and then times the probability that

282
00:25:34,390 --> 00:25:38,130
the MDP's transition probabilities happen to transition you to state

283
00:25:38,130 --> 00:25:44,830
S1 where you chose action A0 to state S0,

284
00:25:44,830 --> 00:25:53,020
times the probability that you chose that and so on.

285
00:25:53,020 --> 00:26:06,279
The last term here is that, and then times that.

286
00:26:12,370 --> 00:26:13,930
So what I did was just

287
00:26:13,930 --> 00:26:18,620
take this probability of seeing this sequence of states and actions, and then just

288
00:26:18,620 --> 00:26:22,610
[inaudible] explicitly or expanded explicitly like this.

289
00:26:22,610 --> 00:26:24,480
It turns out later on

290
00:26:24,480 --> 00:26:30,270
I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff

291
00:26:31,590 --> 00:26:36,039
from now. So whenever later in this lecture I write the word payoff, I

292
00:26:36,039 --> 00:26:40,230
just mean this sum.

293
00:26:40,230 --> 00:26:47,020
So

294
00:26:47,020 --> 00:26:52,330
our goal is to maximize the expected payoff, so our goal is to maximize this sum.

295
00:26:52,330 --> 00:26:55,590
Let me actually just skip ahead. I'm going to write down what the final answer is, and

296
00:26:55,590 --> 00:26:59,370
then I'll come back and justify the

297
00:26:59,370 --> 00:27:06,370
algorithm. So here's the algorithm. This is

298
00:27:25,190 --> 00:27:32,190


299
00:28:12,770 --> 00:28:16,020
how we're going to

300
00:28:16,020 --> 00:28:19,280
update the parameters of the algorithm. We're going to

301
00:28:19,280 --> 00:28:22,880
sample a state action sequence. The way you do this is you just take your

302
00:28:22,880 --> 00:28:24,700
current stochastic policy,

303
00:28:24,700 --> 00:28:29,039
and you execute it in the MDP. So just go ahead and start from some initial state, take

304
00:28:29,039 --> 00:28:32,600
a stochastic action according to your current stochastic policy,

305
00:28:32,600 --> 00:28:35,000
see where the state transition probably takes you, and so you just

306
00:28:35,000 --> 00:28:38,860
do that for T times steps, and that's how you sample the state sequence. Then

307
00:28:38,860 --> 00:28:45,020
you compute the payoff, and then you perform this update.

308
00:28:45,020 --> 00:28:49,210
So let's go back and figure out what this algorithm is doing.

309
00:28:49,210 --> 00:28:52,730
Notice that this algorithm performs stochastic updates because on

310
00:28:52,730 --> 00:28:56,750
every step it updates data according to this thing on the right hand side.

311
00:28:56,750 --> 00:29:00,150
This thing on the right hand side depends very much on your payoff and on

312
00:29:00,150 --> 00:29:02,320
the state action sequence you saw. Your

313
00:29:02,320 --> 00:29:04,650
state action sequence is random,

314
00:29:04,650 --> 00:29:07,210
so what I want to do is figure out

315
00:29:07,210 --> 00:29:12,230
- so on every step, I'll sort of take a step that's chosen randomly

316
00:29:13,070 --> 00:29:16,380
because it depends on this random state action sequence.

317
00:29:16,380 --> 00:29:21,530
So what I want to do is figure out on average how does it change the parameters theta.

318
00:29:21,530 --> 00:29:23,360
In particular,

319
00:29:23,360 --> 00:29:44,320
I want to know what is the expected value of the change to the parameters.

320
00:29:58,100 --> 00:30:06,280
So I want to know what is the expected value of this change to my parameters theta. Our

321
00:30:06,280 --> 00:30:10,930
goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.

322
00:30:10,930 --> 00:30:13,700
So long as

323
00:30:13,700 --> 00:30:18,310
the updates on expectation are on average taking us uphill

324
00:30:18,310 --> 00:30:20,050
on the expected payoff,

325
00:30:20,050 --> 00:30:21,970
then we're happy.

326
00:30:21,970 --> 00:30:24,970
It turns out that

327
00:30:24,970 --> 00:30:31,720
this algorithm is a form of stochastic gradient ascent in which -

328
00:30:31,720 --> 00:30:36,779
remember when I talked about stochastic gradient descent for least squares regression, I

329
00:30:36,779 --> 00:30:42,820
said that you have some parameters - maybe you're trying to minimize a quadratic function.

330
00:30:42,820 --> 00:30:49,149
Then you may have parameters that will wander around randomly

331
00:30:49,149 --> 00:30:54,290
until it gets close to the optimum of the [inaudible] quadratic surface.

332
00:30:54,290 --> 00:30:55,810
It turns out that

333
00:30:55,810 --> 00:30:59,650
the reinforce algorithm will be very much like that. It will be a

334
00:30:59,650 --> 00:31:03,220
stochastic gradient ascent algorithm in which on every step -

335
00:31:03,220 --> 00:31:07,300
the step we take is a little bit random. It's determined by the random state action sequence,

336
00:31:07,300 --> 00:31:11,940
but on expectation this turns out to be essentially gradient ascent algorithm.

337
00:31:11,940 --> 00:31:16,100
And so we'll do something like this. It'll wander around randomly, but on average take you towards the

338
00:31:16,100 --> 00:31:24,700
optimum. So let me go ahead and prove that now.

339
00:31:28,780 --> 00:31:35,780
Let's see.

340
00:31:37,540 --> 00:31:46,020
What I'm going to do is I'm going to derive a gradient ascent update rule

341
00:31:46,020 --> 00:31:50,429
for maximizing the expected payoff. Then I'll hopefully show that

342
00:31:50,429 --> 00:31:58,570
by deriving a gradient ascent update rule, I'll end up with this thing on expectation.

343
00:31:58,570 --> 00:32:02,110
So before I do the derivation, let me just remind you of

344
00:32:02,110 --> 00:32:06,330
the chain rule - the product rule for differentiation

345
00:32:06,330 --> 00:32:08,510
in which

346
00:32:08,510 --> 00:32:13,890
if I have a product of functions,

347
00:32:13,890 --> 00:32:17,230
then

348
00:32:33,440 --> 00:32:37,779
the derivative of the product is given by

349
00:32:37,779 --> 00:32:41,029
taking of the derivatives of these things one at a time. So first I differentiate

350
00:32:41,029 --> 00:32:45,340
with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,

351
00:32:45,340 --> 00:32:46,540
leaving the other two fixed.

352
00:32:46,540 --> 00:32:50,640
Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the

353
00:32:50,640 --> 00:32:53,530
product rule

354
00:32:53,530 --> 00:32:56,380
for

355
00:32:56,380 --> 00:33:00,980
derivatives.

356
00:33:00,980 --> 00:33:05,020
If you refer back to this equation where

357
00:33:05,020 --> 00:33:06,870
earlier we wrote out that

358
00:33:06,870 --> 00:33:11,140
the expected payoff by this equation, this

359
00:33:11,140 --> 00:33:14,110
sum over all the states of the probability

360
00:33:14,110 --> 00:33:15,720
times the payoff.

361
00:33:15,720 --> 00:33:20,410
So what I'm going to do is take the derivative of this expression

362
00:33:20,410 --> 00:33:22,730
with respect to the parameters theta

363
00:33:22,730 --> 00:33:26,320
because I want to do gradient ascent on this function. So I'm going to take the

364
00:33:26,320 --> 00:33:30,880
derivative of this function with respect to theta, and then try to go uphill on this function.

365
00:33:30,880 --> 00:33:35,690
So using the product rule, when I take the derivative of this function with respect to theta

366
00:33:35,690 --> 00:33:39,169
what I get is - we'll end up with the sum of terms right there. There are a lot

367
00:33:39,169 --> 00:33:42,300
of terms here that depend on theta,

368
00:33:42,300 --> 00:33:47,160
and so what I'll end up with is I'll end up having a sum - having one term

369
00:33:47,160 --> 00:33:50,700
that corresponds to the derivative of this keeping everything else fixed,

370
00:33:50,700 --> 00:33:53,960
to have one term from the derivative of this keeping everything else fixed, and I'll

371
00:33:53,960 --> 00:33:55,140
have one term

372
00:33:55,140 --> 00:33:56,430
from the derivative of

373
00:33:56,430 --> 00:34:02,100
that last thing keeping everything else fixed. So just apply the product rule to this. Let's

374
00:34:02,100 --> 00:34:09,100
write that down. So I have that - the derivative

375
00:34:10,599 --> 00:34:15,789
with respect to theta of the expected value of the payoff is -

376
00:34:15,789 --> 00:34:22,129
it turns out I can actually do this entire derivation in exactly four steps,

377
00:34:22,129 --> 00:34:26,819
but each of the steps requires a huge amount of writing, so

378
00:34:26,819 --> 00:34:37,569
I'll just start writing and see how that goes, but this is a four step derivation.

379
00:34:37,919 --> 00:34:44,919
So there's the sum over the state action sequences as we saw before. Close the

380
00:35:04,869 --> 00:35:11,869
bracket,

381
00:36:00,229 --> 00:36:03,169
and

382
00:36:03,169 --> 00:36:08,670
then times the payoff.

383
00:36:08,670 --> 00:36:12,599
So that huge amount of writing, that was just taking my previous formula and

384
00:36:12,599 --> 00:36:13,529


385
00:36:13,529 --> 00:36:16,589
differentiating these terms that depend on theta one at a time. This was the term with

386
00:36:16,589 --> 00:36:17,879
the derivative

387
00:36:17,879 --> 00:36:19,489
of the first pi of

388
00:36:19,489 --> 00:36:22,410
theta S0 A0. So there's the first derivative

389
00:36:22,410 --> 00:36:26,329
term. There's the second one. Then you have plus dot, dot, dot, like

390
00:36:26,329 --> 00:36:28,369
in terms of [inaudible].

391
00:36:28,369 --> 00:36:31,279
That's my last term.

392
00:36:31,279 --> 00:36:38,279
So that was step one of four.

393
00:36:51,299 --> 00:37:06,849
And so by algebra - let me just write this down

394
00:38:04,009 --> 00:38:06,509
and convince us all that it's true.

395
00:38:06,509 --> 00:38:08,549
This is the second of four steps

396
00:38:08,549 --> 00:38:09,649
in which it

397
00:38:09,649 --> 00:38:13,529
just convinced itself that if I expand out - take the sum and multiply it by

398
00:38:13,529 --> 00:38:15,239
that big product in front,

399
00:38:15,239 --> 00:38:18,929
then I get back that sum of terms I get. It's essentially -

400
00:38:18,929 --> 00:38:22,530
for example, when I multiply out, this product on top of this ratio, of this

401
00:38:22,530 --> 00:38:24,789
first fraction,

402
00:38:24,789 --> 00:38:29,289
then pi subscript theta S0 A0, that would cancel out this pi subscript

403
00:38:29,289 --> 00:38:31,219
theta S0 A0

404
00:38:31,219 --> 00:38:35,130
and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was

405
00:38:35,130 --> 00:38:42,130
the second. But

406
00:38:44,960 --> 00:38:47,640
that term on top is just

407
00:38:47,640 --> 00:38:53,999
what I worked out previously

408
00:38:53,999 --> 00:38:58,569
- was the joint probability of the state action sequence,

409
00:38:58,569 --> 00:39:32,170
and now I have that times that times the payoff.

410
00:39:32,170 --> 00:40:07,129
And so by the definition of expectation, this is just equal to that thing times the payoff.

411
00:40:07,129 --> 00:40:08,440
So

412
00:40:08,440 --> 00:40:12,329
this thing inside the expectation, this is exactly

413
00:40:12,329 --> 00:40:18,839
the step that we were taking in the inner group of our reinforce algorithm,

414
00:40:18,839 --> 00:40:20,880
roughly the reinforce algorithm. This

415
00:40:20,880 --> 00:40:25,640
proves that the expected value of our change to theta

416
00:40:25,640 --> 00:40:30,509
is exactly in the direction of the gradient of our expected payoff. That's how

417
00:40:30,509 --> 00:40:32,639
I started this whole derivation. I said

418
00:40:32,639 --> 00:40:36,019
let's look at our expected payoff and take the derivative of that with respect to theta.

419
00:40:37,109 --> 00:40:40,529
What we've proved is that on expectation,

420
00:40:40,529 --> 00:40:44,229
the step direction I'll take reinforce is exactly the gradient of

421
00:40:44,229 --> 00:40:46,220
the thing I'm trying to

422
00:40:46,220 --> 00:40:48,679
optimize. This shows that this algorithm is

423
00:40:48,679 --> 00:40:54,029
a stochastic gradient ascent

424
00:40:54,029 --> 00:40:58,059
algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]

425
00:40:58,059 --> 00:41:00,910
check if everything makes sense. I'll erase a couple of boards and then check if you have

426
00:41:00,910 --> 00:41:07,910
questions after that. Questions?

427
00:41:42,919 --> 00:41:49,919
Could you

428
00:41:53,749 --> 00:41:56,279
raise your hand if this makes sense?

429
00:42:01,239 --> 00:42:03,999
Great.

430
00:42:03,999 --> 00:42:06,789
Some of the comments - we talked about

431
00:42:06,789 --> 00:42:10,629
those value function approximation approaches where you approximate

432
00:42:10,629 --> 00:42:13,579
V star, then you go from V star

433
00:42:13,579 --> 00:42:17,839
to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.

434
00:42:17,839 --> 00:42:22,329
So let's talk briefly about when either one may be preferable.

435
00:42:22,329 --> 00:42:24,049
It turns out that

436
00:42:24,049 --> 00:42:26,899
policy search algorithms are especially effective

437
00:42:26,899 --> 00:42:30,699
when you can choose a simple policy class pi.

438
00:42:32,029 --> 00:42:35,949
So the question really is for your problem

439
00:42:35,949 --> 00:42:40,079
does there exist a simple function like a linear function or a logistic function

440
00:42:40,079 --> 00:42:45,289
that maps from features of the state to the action that works pretty well.

441
00:42:45,289 --> 00:42:49,029
So the problem with the inverted pendulum - this is quite likely to be true.

442
00:42:49,029 --> 00:42:52,450
Going through all the different choices of parameters, you can say

443
00:42:52,450 --> 00:42:54,930
things like if the pole's leaning towards the right,

444
00:42:54,930 --> 00:42:57,709
then accelerate towards the right to try to catch it. Thanks to the

445
00:42:57,709 --> 00:43:01,579
inverted pendulum, this is probably true. For lots of what's called

446
00:43:01,579 --> 00:43:04,650
low level control tasks, things like driving a car,

447
00:43:04,650 --> 00:43:09,069
the low level reflexes of do you steer your car left to avoid another car, do

448
00:43:09,069 --> 00:43:13,389
you steer your car left to follow the car road, flying a helicopter,

449
00:43:13,389 --> 00:43:18,359
again very short time scale types of decisions - I like to think of these as

450
00:43:18,359 --> 00:43:21,960
decisions like trained operator for like a trained driver or a trained

451
00:43:21,960 --> 00:43:23,819
pilot. It would almost

452
00:43:23,819 --> 00:43:27,360
be a reflex, these sorts of very quick instinctive things where

453
00:43:27,360 --> 00:43:30,790
you map very directly from the inputs, data, and action. These are

454
00:43:30,790 --> 00:43:32,159
problems for which

455
00:43:32,159 --> 00:43:35,879
you can probably choose a reasonable policy class like a logistic function or something,

456
00:43:35,879 --> 00:43:38,489
and it will often work well.

457
00:43:38,489 --> 00:43:41,829
In contrast, if you have problems that require

458
00:43:41,829 --> 00:43:46,019
long multistep reasoning, so things like a game of chess where you have to

459
00:43:46,019 --> 00:43:47,230
reason carefully about if

460
00:43:47,230 --> 00:43:50,130
I do this, then they'll do that, then they'll do this, then they'll do that.

461
00:43:50,130 --> 00:43:56,049
Those I think of as less instinctual, very high level decision making.

462
00:43:56,049 --> 00:43:59,519
For problems like that,

463
00:43:59,519 --> 00:44:06,289
I would sometimes use a value function approximation approaches instead. Let

464
00:44:06,289 --> 00:44:09,089
me say more about this later.

465
00:44:09,089 --> 00:44:14,139
The last thing I want to do is actually tell you about -

466
00:44:14,139 --> 00:44:21,999
I guess just as a side comment,

467
00:44:21,999 --> 00:44:25,499
it turns out also that if you have POMDP, if you have a partially observable

468
00:44:25,499 --> 00:44:28,459
MDP - I don't want to say too much about this -

469
00:44:28,459 --> 00:44:36,799
it turns out that if you only have an approximation

470
00:44:36,799 --> 00:44:42,119
- let's call it S hat of the true

471
00:44:42,119 --> 00:44:47,389
state, and so this could be S hat equals S of T given T from Kalman filter -

472
00:44:57,359 --> 00:45:12,949
then you can still use these sorts of policy search algorithms where you can say pi theta of S

473
00:45:12,949 --> 00:45:15,819
hat comma A - There are various other ways you can use

474
00:45:15,819 --> 00:45:18,669
policy search algorithms for POMDPs, but this is one of them

475
00:45:18,669 --> 00:45:21,440
where if you only have estimates of the state, you can then

476
00:45:21,440 --> 00:45:25,599
choose a policy class that only looks at your estimate of the state to choose the action.

477
00:45:25,599 --> 00:45:30,629
By using the same way of estimating the states in both training and testing,

478
00:45:30,629 --> 00:45:33,089
this will usually do some -

479
00:45:33,089 --> 00:45:37,069
so these sorts of policy search algorithms can be applied

480
00:45:37,069 --> 00:45:47,440
often reasonably effectively to

481
00:45:47,440 --> 00:45:50,349
POMDPs as well. There's one more algorithm I wanna talk about, but some final

482
00:45:50,349 --> 00:45:54,519
words on the reinforce algorithm. It turns out the reinforce algorithm

483
00:45:54,519 --> 00:46:00,089
often works well but is often extremely slow. So it [inaudible]

484
00:46:00,089 --> 00:46:03,680
works, but one thing to watch out for is that because you're taking these

485
00:46:03,680 --> 00:46:07,470
gradient ascent steps that are very noisy, you're sampling a state action sequence, and then

486
00:46:07,470 --> 00:46:07,909
you're sort of

487
00:46:07,909 --> 00:46:09,579
taking a gradient ascent step in essentially a sort

488
00:46:09,579 --> 00:46:12,960
of random direction that only on expectation is

489
00:46:12,960 --> 00:46:14,239
correct. The

490
00:46:14,239 --> 00:46:17,999
gradient ascent direction for reinforce can sometimes be a bit noisy,

491
00:46:17,999 --> 00:46:21,599
and so it's not that uncommon to need like a million iterations of gradient ascent,

492
00:46:21,599 --> 00:46:24,420
or ten million, or 100 million

493
00:46:24,420 --> 00:46:25,579
iterations of gradient ascent

494
00:46:25,579 --> 00:46:29,450
for reinforce [inaudible], so that's just something to watch out for.

495
00:46:29,450 --> 00:46:40,619
One consequence of that is in the reinforce algorithm - I shouldn't

496
00:46:40,619 --> 00:46:44,969
really call it reinforce. In what's essentially the reinforce algorithm, there's

497
00:46:44,969 --> 00:46:48,449
this step where you need to sample a state action sequence.

498
00:46:48,449 --> 00:46:50,759
So in

499
00:46:50,759 --> 00:46:54,160
principle you could do this on your own robot. If there were a robot you were trying to

500
00:46:54,160 --> 00:46:55,280
control, you can actually

501
00:46:55,280 --> 00:46:58,709
physically initialize in some state, pick an action and so on, and go from there

502
00:46:58,709 --> 00:47:00,450
to sample a state action sequence.

503
00:47:00,450 --> 00:47:01,529
But

504
00:47:01,529 --> 00:47:05,269
if you need to do this ten million times, you probably don't want to [inaudible]

505
00:47:05,269 --> 00:47:07,459
your robot ten million times.

506
00:47:07,459 --> 00:47:13,160
I personally have seen many more applications of reinforce in simulation.

507
00:47:13,160 --> 00:47:16,930
You can easily run ten thousand simulations or ten million simulations of your robot in

508
00:47:16,930 --> 00:47:20,890
simulation maybe, but you might not want to do that -

509
00:47:20,890 --> 00:47:24,509
have your robot physically repeat some action ten million times.

510
00:47:24,509 --> 00:47:27,529
So I personally have seen many more applications of reinforce

511
00:47:27,529 --> 00:47:30,319
to learn using a simulator

512
00:47:30,319 --> 00:47:34,799
than to actually do this on a physical device.

513
00:47:34,799 --> 00:47:38,230
The last thing I wanted to do is tell you about one other algorithm,

514
00:47:38,230 --> 00:47:45,230
one final policy search algorithm. [Inaudible] the laptop display please. It's

515
00:47:52,249 --> 00:47:56,119
a policy search algorithm called Pegasus that's

516
00:47:56,119 --> 00:48:00,039
actually what we use on our autonomous helicopter flight things

517
00:48:00,039 --> 00:48:05,459
for many years. There are some other things we do now. So

518
00:48:05,459 --> 00:48:07,579
here's the idea. There's a

519
00:48:07,579 --> 00:48:11,059
reminder slide on RL formalism. There's nothing here that you don't know, but

520
00:48:11,059 --> 00:48:17,119
I just want to pictorially describe the RL formalism because I'll use that later.

521
00:48:17,119 --> 00:48:20,859
I'm gonna draw the reinforcement learning picture as follows. The

522
00:48:20,859 --> 00:48:21,689
initialized [inaudible]

523
00:48:21,689 --> 00:48:23,509
system, say a

524
00:48:23,509 --> 00:48:25,559
helicopter or whatever in sum state S0,

525
00:48:25,559 --> 00:48:27,509
you choose an action A0,

526
00:48:27,509 --> 00:48:31,779
and then you'll say helicopter dynamics takes you to some new state S1, you choose

527
00:48:31,779 --> 00:48:33,259
some other action A1,

528
00:48:33,259 --> 00:48:34,619
and so on.

529
00:48:34,619 --> 00:48:38,789
And then you have some reward function, which you reply to the sequence of states you summed out,

530
00:48:38,789 --> 00:48:41,249
and that's your total payoff. So

531
00:48:41,249 --> 00:48:46,890
this is just a picture I wanna use to summarize the RL problem.

532
00:48:47,949 --> 00:48:50,880
Our goal is to maximize the expected payoff,

533
00:48:50,880 --> 00:48:53,000
which is this, the expected sum of the rewards.

534
00:48:53,000 --> 00:48:56,909
And our goal is to learn the policy, which I denote by a green box.

535
00:48:56,909 --> 00:49:01,989
So our policy - and I'll switch back to deterministic policies for now.

536
00:49:01,989 --> 00:49:09,129
So my deterministic policy will be some function mapping from the states to the actions.

537
00:49:09,129 --> 00:49:14,599
As a concrete example, you imagine that in the policy search setting,

538
00:49:14,599 --> 00:49:17,339
you may have a linear class of policies.

539
00:49:17,339 --> 00:49:22,660
So you may imagine that the action A will be say a linear function of the states,

540
00:49:22,660 --> 00:49:27,599
and your goal is to learn the parameters of the linear function.

541
00:49:27,599 --> 00:49:33,349
So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning

542
00:49:33,349 --> 00:49:36,209
objective. So just [inaudible] imagine that the action A

543
00:49:36,209 --> 00:49:40,309
is state of transpose S, and you go and policy search this to come up with

544
00:49:40,309 --> 00:49:42,109
good parameters theta so as

545
00:49:42,109 --> 00:49:48,979
to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.

546
00:49:48,979 --> 00:49:49,809
Quite often

547
00:49:49,809 --> 00:49:54,459
we come up with a model or a simulator for the MDP, and as before

548
00:49:54,459 --> 00:49:57,499
a model or a simulator is just a box

549
00:49:57,499 --> 00:49:58,830
that takes this input

550
00:49:58,830 --> 00:50:00,419
some state ST,

551
00:50:00,419 --> 00:50:02,690
takes this input some action AT,

552
00:50:02,690 --> 00:50:06,570
and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.

553
00:50:06,570 --> 00:50:10,130
This ST plus one will be a random state. It will be drawn from

554
00:50:10,130 --> 00:50:13,030
the random state transition probabilities of MDP.

555
00:50:13,030 --> 00:50:16,979
This is important. Very important, ST plus one

556
00:50:16,979 --> 00:50:21,809
will be a random function ST and AT. In the simulator, this is [inaudible].

557
00:50:21,809 --> 00:50:25,709
So for example, for autonomous helicopter flight, you [inaudible] build a simulator

558
00:50:25,709 --> 00:50:29,819
using supervised learning, an algorithm like linear regression

559
00:50:29,819 --> 00:50:31,529
[inaudible] to linear regression,

560
00:50:31,529 --> 00:50:35,099
so we can get a nonlinear model of the dynamics of what ST

561
00:50:35,099 --> 00:50:41,399
plus one is as a random function of ST and AT. Now

562
00:50:41,399 --> 00:50:44,999
once you have a simulator,

563
00:50:44,999 --> 00:50:48,100
given any fixed policy you can quite

564
00:50:48,100 --> 00:50:52,309
straightforwardly evaluate any policy in a simulator.

565
00:50:53,359 --> 00:51:00,209
Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box

566
00:51:00,209 --> 00:51:02,879
like that. It works well.

567
00:51:02,879 --> 00:51:07,569
So if you have any one fixed policy pi,

568
00:51:07,569 --> 00:51:13,109
you can evaluate the policy pi just using the simulator

569
00:51:13,109 --> 00:51:15,969
via the picture shown at the bottom of the slide.

570
00:51:15,969 --> 00:51:18,880
So concretely, you can take your initial state S0,

571
00:51:18,880 --> 00:51:20,979
feed it into the policy pi,

572
00:51:20,979 --> 00:51:25,449
your policy pi will output some action A0, you plug it in

573
00:51:25,449 --> 00:51:28,069
the simulator, the simulator outputs a random state S1, you feed

574
00:51:28,069 --> 00:51:31,630
S1 into the policy and so on, and you get a sequence of states S0 through ST

575
00:51:31,630 --> 00:51:35,429
that your helicopter flies through in simulation.

576
00:51:35,429 --> 00:51:37,809
Then sum up the rewards,

577
00:51:37,809 --> 00:51:42,609
and this gives you an estimate of the expected payoff of the policy.

578
00:51:42,609 --> 00:51:44,929
This picture is just a fancy way of saying fly

579
00:51:44,929 --> 00:51:48,809
your helicopter in simulation and see how well it does, and measure the sum of

580
00:51:48,809 --> 00:51:53,969
rewards you get on average in the simulator. The

581
00:51:53,969 --> 00:51:57,219
picture I've drawn here assumes that you run your policy through the

582
00:51:57,219 --> 00:51:59,159
simulator just once. In general,

583
00:51:59,159 --> 00:52:01,299
you would run the policy through the simulator

584
00:52:01,299 --> 00:52:05,099
some number of times and then average to get an average over M

585
00:52:05,099 --> 00:52:10,869
simulations to get a better estimate of the policy's expected payoff.

586
00:52:12,799 --> 00:52:14,619
Now that I have a way

587
00:52:14,619 --> 00:52:18,069
- given any one fixed policy,

588
00:52:18,069 --> 00:52:23,429
this gives me a way to evaluate the expected payoff of that policy.

589
00:52:23,429 --> 00:52:28,179
So one reasonably obvious thing you might try to do is then

590
00:52:28,179 --> 00:52:30,069
just search for a policy,

591
00:52:30,069 --> 00:52:33,609
in other words search for parameters theta for your policy, that

592
00:52:33,609 --> 00:52:35,929
gives you high estimated payoff.

593
00:52:35,929 --> 00:52:39,509
Does that make sense? So my policy has some parameters theta, so

594
00:52:39,509 --> 00:52:40,979
my policy is

595
00:52:40,979 --> 00:52:46,309
my actions A are equal to theta transpose S say if there's a linear policy.

596
00:52:46,309 --> 00:52:49,559
For any fixed value of the parameters theta,

597
00:52:49,559 --> 00:52:50,989
I can evaluate -

598
00:52:50,989 --> 00:52:54,440
I can get an estimate for how good the policy is using the simulator.

599
00:52:54,440 --> 00:52:58,599
One thing I might try to do is search for parameters theta to try to

600
00:52:58,599 --> 00:53:03,109
maximize my estimated payoff.

601
00:53:03,109 --> 00:53:05,929
It turns out you can sort of do that,

602
00:53:05,929 --> 00:53:10,979
but that idea as I've just stated is hard to get to work.

603
00:53:10,979 --> 00:53:12,180
Here's the

604
00:53:12,180 --> 00:53:15,720
reason. The simulator allows us to evaluate

605
00:53:15,720 --> 00:53:17,529
policy, so let's search for policy of high

606
00:53:17,529 --> 00:53:20,619
value. The difficulty is that the simulator is random,

607
00:53:20,619 --> 00:53:22,999
and so every time we evaluate a policy,

608
00:53:22,999 --> 00:53:26,819
we get back a very slightly different answer. So

609
00:53:26,819 --> 00:53:28,739
in the

610
00:53:28,739 --> 00:53:33,219
cartoon below, I want you to imagine that the horizontal axis is the space of policies.

611
00:53:33,219 --> 00:53:36,779
In other words, as I vary the

612
00:53:36,779 --> 00:53:40,839
parameters in my policy, I get different points on the horizontal axis here. As I

613
00:53:40,839 --> 00:53:43,789
vary the parameters theta, I get different policies,

614
00:53:43,789 --> 00:53:46,479
and so I'm moving along the X axis,

615
00:53:46,479 --> 00:53:50,079
and my total payoff I'm gonna plot on the vertical axis,

616
00:53:50,079 --> 00:53:54,459
and the red line in this cartoon is the expected payoff of the different

617
00:53:54,459 --> 00:53:56,130
policies.

618
00:53:56,130 --> 00:54:01,929
My goal is to find the policy with the highest expected payoff.

619
00:54:01,929 --> 00:54:06,019
You could search for a policy with high expected payoff, but every time you evaluate a policy

620
00:54:06,019 --> 00:54:08,150
- say I evaluate some policy,

621
00:54:08,150 --> 00:54:09,819
then I might get a point that

622
00:54:09,819 --> 00:54:12,879
just by chance looked a little bit better than it should have. If

623
00:54:12,879 --> 00:54:16,489
I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third

624
00:54:16,489 --> 00:54:21,419
policy, fourth,

625
00:54:21,419 --> 00:54:25,439
sometimes you look here - sometimes I might actually evaluate exactly the same policy

626
00:54:25,439 --> 00:54:27,829
twice and get back slightly different

627
00:54:27,829 --> 00:54:32,380
answers just because my simulator is random, so when I apply the same policy

628
00:54:32,380 --> 00:54:38,669
twice in simulation, I might get back slightly different answers.

629
00:54:38,669 --> 00:54:41,839
So as I evaluate more and more policies, these are the pictures I get.

630
00:54:41,839 --> 00:54:45,799
My goal is to try to optimize the red line. I hope you appreciate this is a

631
00:54:45,799 --> 00:54:50,569
hard problem, especially when all [inaudible] optimization algorithm gets to see are these

632
00:54:50,569 --> 00:54:53,949
black dots, and they don't have direct access to the red line.

633
00:54:53,949 --> 00:54:58,269
So when my input space is some fairly high dimensional space, if I have

634
00:54:58,269 --> 00:55:02,319
ten parameters, the horizontal axis would actually be a 10-D space,

635
00:55:02,319 --> 00:55:07,049
all I get are these noisy estimates of what the red line is. This is a very hard

636
00:55:07,049 --> 00:55:10,209
stochastic optimization problem.

637
00:55:10,209 --> 00:55:15,969
So it turns out there's one way to make this optimization problem much easier. Here's the idea.

638
00:55:15,969 --> 00:55:20,949
And the method is called Pegasus, which is an acronym for something.

639
00:55:20,949 --> 00:55:22,890
I'll tell you later.

640
00:55:22,890 --> 00:55:26,140
So the simulator repeatedly makes calls

641
00:55:26,140 --> 00:55:27,959
to a random number generator

642
00:55:27,959 --> 00:55:33,079
to generate random numbers RT, which are used to simulate the stochastic dynamics.

643
00:55:33,079 --> 00:55:36,709
What I mean by that is that the simulator takes this input of state

644
00:55:36,709 --> 00:55:39,749
and action, and it outputs the mixed state randomly,

645
00:55:39,749 --> 00:55:43,999
and if you peer into the simulator, if you open the box of the simulator and ask

646
00:55:43,999 --> 00:55:49,529
how is my simulator generating these random mixed states ST plus one,

647
00:55:49,529 --> 00:55:53,809
pretty much the only way to do so - pretty much the only way

648
00:55:53,809 --> 00:55:56,909
to write a simulator with random outputs is

649
00:55:56,909 --> 00:56:00,209
we're gonna make calls to a random number generator,

650
00:56:00,209 --> 00:56:03,440
and get random numbers, these RTs,

651
00:56:03,440 --> 00:56:08,299
and then you have some function that takes this input S0, A0, and

652
00:56:08,299 --> 00:56:10,979
the results of your random number generator,

653
00:56:10,979 --> 00:56:14,430
and it computes some mixed state as a function of the inputs and of the random

654
00:56:14,430 --> 00:56:16,839
number it got from the random number

655
00:56:16,839 --> 00:56:17,699
generator. This is

656
00:56:17,699 --> 00:56:22,359
pretty much the only way anyone implements any random code, any code

657
00:56:22,359 --> 00:56:26,179
that generates random outputs. You make a call to a random number generator,

658
00:56:26,179 --> 00:56:30,869
and you compute some function of the random number and of your other

659
00:56:30,869 --> 00:56:35,299
inputs. The reason that when you evaluate different policies you get

660
00:56:35,299 --> 00:56:36,439
different answers

661
00:56:36,439 --> 00:56:37,749
is because

662
00:56:37,749 --> 00:56:41,459
every time you rerun the simulator, you get a different sequence of random

663
00:56:41,459 --> 00:56:43,569
numbers from the random number generator,

664
00:56:43,569 --> 00:56:45,569
and so you get a different answer every time,

665
00:56:45,569 --> 00:56:48,139
even if you evaluate the same policy twice.

666
00:56:48,139 --> 00:56:50,910
So here's the idea.

667
00:56:50,910 --> 00:56:56,150
Let's say we fix in advance the sequence of random numbers,

668
00:56:56,150 --> 00:57:00,229
choose R1, R2, up to RT minus one.

669
00:57:00,229 --> 00:57:03,359
Fix the sequence of random numbers in advance,

670
00:57:03,359 --> 00:57:07,280
and we'll always use the same sequence of random numbers

671
00:57:07,280 --> 00:57:10,189
to evaluate different policies. Go

672
00:57:10,189 --> 00:57:14,549
into your code and fix R1, R2, through RT minus one.

673
00:57:14,549 --> 00:57:17,640
Choose them randomly once and then fix them forever.

674
00:57:17,640 --> 00:57:20,979
If you always use the same sequence of random numbers, then

675
00:57:20,979 --> 00:57:25,210
the system is no longer random, and if you evaluate the same policy twice,

676
00:57:25,210 --> 00:57:28,569
you get back exactly the same answer.

677
00:57:28,569 --> 00:57:33,330
And so these sequences of random numbers, [inaudible] call them scenarios, and

678
00:57:33,330 --> 00:57:37,410
Pegasus is an acronym for policy evaluation of gradient and search using scenarios.

679
00:57:37,410 --> 00:57:38,859
So

680
00:57:42,369 --> 00:57:45,949
when you do that, this is the picture you get.

681
00:57:45,949 --> 00:57:50,190
As before, the red line is your expected payoff, and by fixing the random numbers,

682
00:57:50,190 --> 00:57:53,469
you've defined some estimate of the expected payoff.

683
00:57:53,469 --> 00:57:56,739
And as you evaluate different policies, they're

684
00:57:56,739 --> 00:58:00,609
still only approximations to their true expected payoff, but at least now you have

685
00:58:00,609 --> 00:58:01,310
a

686
00:58:01,310 --> 00:58:03,359
deterministic function to optimize,

687
00:58:03,359 --> 00:58:07,109
and you can now apply your favorite algorithms, be it gradient ascent or

688
00:58:08,089 --> 00:58:09,809
some sort

689
00:58:09,809 --> 00:58:12,259
of local [inaudible] search

690
00:58:12,259 --> 00:58:13,609
to try to optimize the black

691
00:58:13,609 --> 00:58:16,329
curve. This gives you a much easier

692
00:58:16,329 --> 00:58:18,390
optimization problem, and you

693
00:58:18,390 --> 00:58:22,789
can optimize this to find hopefully a good policy. So this is

694
00:58:22,789 --> 00:58:26,650
the Pegasus policy search method.

695
00:58:26,650 --> 00:58:27,450
So when

696
00:58:27,450 --> 00:58:31,390
I started to talk about reinforcement learning, I showed

697
00:58:31,390 --> 00:58:34,980
that video of a helicopter flying upside down. That was actually done using

698
00:58:34,980 --> 00:58:40,029
exactly method, using exactly this policy search algorithm. This

699
00:58:40,029 --> 00:58:41,849
seems to scale well

700
00:58:41,849 --> 00:58:47,099
even to fairly large problems, even to fairly high dimensional state spaces.

701
00:58:47,099 --> 00:58:50,439
Typically Pegasus policy search algorithms have been using -

702
00:58:50,439 --> 00:58:52,719
the optimization problem

703
00:58:52,719 --> 00:58:56,440
is still - is much easier than the stochastic version, but sometimes it's

704
00:58:56,440 --> 00:58:58,279
not entirely trivial, and so

705
00:58:58,279 --> 00:59:00,299
you have to apply this sort of method with

706
00:59:00,299 --> 00:59:04,549
maybe on the order of ten parameters or tens of parameters, so 30, 40

707
00:59:04,549 --> 00:59:05,750
parameters, but

708
00:59:05,750 --> 00:59:08,169
not thousands of parameters,

709
00:59:08,169 --> 00:59:13,049
at least in these sorts of things with them.

710
00:59:13,049 --> 00:59:19,469
So is that method different than just assuming that

711
00:59:19,469 --> 00:59:26,949
you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that

712
00:59:26,949 --> 00:59:28,809
we have a deterministic simulator?

713
00:59:28,809 --> 00:59:31,249
The answer is no. In the way you do this,

714
00:59:31,249 --> 00:59:34,949
for the sake of simplicity I talked about one sequence of random numbers.

715
00:59:34,949 --> 00:59:36,890
What you do is -

716
00:59:36,890 --> 00:59:41,969
so imagine that the random numbers are simulating different wind gusts against your helicopter.

717
00:59:41,969 --> 00:59:46,429
So what you want to do isn't really evaluate just against one pattern of wind gusts.

718
00:59:46,429 --> 00:59:48,000
What you want to do is

719
00:59:48,000 --> 00:59:49,920
sample some set of

720
00:59:49,920 --> 00:59:53,650
different patterns of wind gusts, and evaluate against all of them in average.

721
00:59:53,650 --> 00:59:55,680
So what you do is you actually

722
00:59:55,680 --> 00:59:58,039
sample say 100 -

723
00:59:58,039 --> 01:00:01,359
some number I made up like 100 sequences of random numbers,

724
01:00:01,359 --> 01:00:03,430
and every time you want to evaluate a policy,

725
01:00:03,430 --> 01:00:08,679
you evaluate it against all 100 sequences of random numbers and then average.

726
01:00:08,679 --> 01:00:13,249
This is in exactly the same way that on this earlier picture

727
01:00:13,249 --> 01:00:16,879
you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100

728
01:00:16,879 --> 01:00:18,450
times in simulation, and then

729
01:00:18,450 --> 01:00:21,709
average to get a better estimate of the expected reward.

730
01:00:21,709 --> 01:00:23,009
In the same way,

731
01:00:23,009 --> 01:00:24,079
you do that here

732
01:00:24,079 --> 01:00:29,079
but with 100 fixed sequences of random numbers. Does that make sense? Any

733
01:00:29,079 --> 01:00:36,079
other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?

734
01:00:47,719 --> 01:00:51,829
Yeah. I guess you're right. So the quality - for a fixed policy,

735
01:00:51,829 --> 01:00:57,099
the quality of the approximation is equally good for both cases. The

736
01:00:57,099 --> 01:01:00,989
advantage of fixing the random numbers is that you end up with

737
01:01:00,989 --> 01:01:03,690
an optimization problem that's much easier.

738
01:01:04,609 --> 01:01:06,100
I have some search problem,

739
01:01:06,100 --> 01:01:10,049
and on the horizontal axis there's a space of control policies,

740
01:01:10,049 --> 01:01:12,559
and my goal is to find a control policy that

741
01:01:12,559 --> 01:01:15,319
maximizes the payoff.

742
01:01:15,319 --> 01:01:18,660
The problem with this earlier setting was that

743
01:01:18,660 --> 01:01:21,949
when I evaluate policies I get these noisy estimates,

744
01:01:21,949 --> 01:01:22,950
and then it's

745
01:01:22,950 --> 01:01:26,690
just very hard to optimize the red curve if I have these points that are all over the

746
01:01:26,690 --> 01:01:29,420
place. And if I evaluate the same policy twice, I don't even get back the same

747
01:01:30,719 --> 01:01:32,959
answer. By fixing the random numbers,

748
01:01:32,959 --> 01:01:36,669
the algorithm still doesn't get to see the red curve, but at least it's

749
01:01:36,669 --> 01:01:40,609
now optimizing a deterministic function. That makes the

750
01:01:40,609 --> 01:01:42,349
optimization problem much easier. Does

751
01:01:42,349 --> 01:01:49,349
that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves

752
01:01:50,539 --> 01:01:58,009
that are easy to optimize. And then you smush them together?

753
01:01:58,009 --> 01:01:59,309
Let's see.

754
01:01:59,309 --> 01:02:04,559
I have just one nice black curve that I'm trying to optimize. For each scenario.

755
01:02:04,559 --> 01:02:08,890
I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.

756
01:02:08,890 --> 01:02:12,339
So the black curve here is defined by averaging over

757
01:02:12,339 --> 01:02:16,249
a large set of scenarios. Does that make sense?

758
01:02:16,249 --> 01:02:20,099
So instead of only one - if the averaging thing doesn't make sense, imagine that there's

759
01:02:20,099 --> 01:02:24,049
just one sequence of random numbers. That might be easier to think

760
01:02:24,049 --> 01:02:27,419
about. Fix one sequence of random numbers, and every time I evaluate another policy,

761
01:02:27,419 --> 01:02:31,039
I evaluate against the same sequence of random numbers, and that gives me

762
01:02:31,039 --> 01:02:36,489
a nice deterministic function to optimize.

763
01:02:36,489 --> 01:02:39,449
Any other questions?

764
01:02:39,449 --> 01:02:41,269
The advantage is really

765
01:02:41,269 --> 01:02:45,459
that - one way to think about it is when I evaluate the same policy twice,

766
01:02:45,459 --> 01:02:50,449
at least I get back the same answer. This gives me a deterministic function

767
01:02:50,449 --> 01:02:53,890
mapping from parameters in my policy

768
01:02:53,890 --> 01:02:56,239
to my estimate of the expected payoff.

769
01:02:57,789 --> 01:03:13,229
That's just a function that I can try to optimize using the search algorithm.

770
01:03:13,229 --> 01:03:16,799
So we use this algorithm for inverted hovering,

771
01:03:16,799 --> 01:03:20,329
and again policy search algorithms tend to work well when you can find

772
01:03:20,329 --> 01:03:23,739
a reasonably simple policy mapping from

773
01:03:23,739 --> 01:03:28,640
the states to the actions. This is sort of especially the low level control tasks,

774
01:03:28,640 --> 01:03:32,699
which I think of as sort of reflexes almost.

775
01:03:32,699 --> 01:03:33,799
Completely,

776
01:03:33,799 --> 01:03:37,269
if you want to solve a problem like Tetris where you might plan

777
01:03:37,269 --> 01:03:40,499
ahead a few steps about what's a nice configuration of the board, or

778
01:03:40,499 --> 01:03:42,289
something like a game of chess,

779
01:03:42,289 --> 01:03:46,719
or problems of long path plannings of go here, then go there, then go there,

780
01:03:46,719 --> 01:03:51,069
then sometimes you might apply a value function method instead.

781
01:03:51,069 --> 01:03:55,059
But for tasks like helicopter flight, for

782
01:03:55,059 --> 01:03:58,569
low level control tasks, for the reflexes of driving or controlling

783
01:04:00,749 --> 01:04:05,699
various robots, policy search algorithms were easier -

784
01:04:05,699 --> 01:04:10,629
we can sometimes more easily approximate the policy directly works very well.

785
01:04:11,839 --> 01:04:15,599
Some [inaudible] the state of RL today. RL algorithms are applied

786
01:04:15,599 --> 01:04:18,499
to a wide range of problems,

787
01:04:18,499 --> 01:04:22,239
and the key is really sequential decision making. The place where you

788
01:04:22,239 --> 01:04:25,569
think about applying reinforcement learning algorithm is when you need to make a decision,

789
01:04:25,569 --> 01:04:28,209
then another decision, then another decision,

790
01:04:28,209 --> 01:04:31,889
and some of your actions may have long-term consequences. I think that

791
01:04:31,889 --> 01:04:37,910
is the heart of RL's sequential decision making, where you make multiple decisions,

792
01:04:37,910 --> 01:04:41,599
and some of your actions may have long-term consequences. I've shown you

793
01:04:41,599 --> 01:04:44,170
a bunch of robotics examples. RL

794
01:04:44,170 --> 01:04:45,569
is also

795
01:04:45,569 --> 01:04:48,759
applied to thinks like medical decision making, where you may have a patient and you

796
01:04:48,759 --> 01:04:51,879
want to choose a sequence of treatments, and you do this now for the patient, and the

797
01:04:51,879 --> 01:04:56,449
patient may be in some other state, and you choose to do that later, and so on.

798
01:04:56,449 --> 01:05:00,839
It turns out there's a large community of people applying these sorts of tools to queues. So

799
01:05:00,839 --> 01:05:04,539
imagine you have a bank where you have people lining up, and

800
01:05:04,539 --> 01:05:06,009
after they go to one cashier,

801
01:05:06,009 --> 01:05:09,740
some of them have to go to the manager to deal with something else. You have

802
01:05:09,740 --> 01:05:13,410
a system of multiple people standing in line in multiple queues, and so how do you route

803
01:05:13,410 --> 01:05:17,839
people optimally to minimize the waiting

804
01:05:17,839 --> 01:05:20,239
time. And not just people, but

805
01:05:20,239 --> 01:05:23,969
objects in an assembly line and so on. It turns out there's a surprisingly large community

806
01:05:23,969 --> 01:05:26,349
working on optimizing queues.

807
01:05:26,349 --> 01:05:29,719
I mentioned game playing a little bit already. Things

808
01:05:29,719 --> 01:05:33,289
like financial decision making, if you have a large amount of stock,

809
01:05:33,289 --> 01:05:37,569
how do you sell off a large amount - how do you time the selling off of your stock

810
01:05:37,569 --> 01:05:41,329
so as to not affect market prices adversely too much? There

811
01:05:41,329 --> 01:05:44,949
are many operations research problems, things like factory automation. Can you design

812
01:05:44,949 --> 01:05:46,369
a factory

813
01:05:46,369 --> 01:05:50,140
to optimize throughput, or minimize cost, or whatever. These are all

814
01:05:50,140 --> 01:05:54,049
sorts of problems that people are applying reinforcement learning algorithms to.

815
01:05:54,049 --> 01:05:57,529
Let me just close with a few robotics examples because they're always fun, and we just

816
01:05:57,529 --> 01:05:59,190
have these videos.

817
01:06:00,189 --> 01:06:07,259
This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student

818
01:06:07,259 --> 01:06:13,760
here. They were working getting a robot dog to climb over difficult rugged

819
01:06:13,760 --> 01:06:16,959
terrain. Using a reinforcement learning algorithm,

820
01:06:16,959 --> 01:06:19,979
they applied an approach that's

821
01:06:19,979 --> 01:06:24,149
similar to a value function approximation approach, not quite but similar.

822
01:06:24,149 --> 01:06:28,849
They allowed the robot dog to sort of plan ahead multiple steps, and

823
01:06:28,849 --> 01:06:33,999
carefully choose his footsteps and traverse rugged terrain.

824
01:06:33,999 --> 01:06:43,339
This is really state of the art in terms of what can you get a robotic dog to do.

825
01:06:43,339 --> 01:06:45,569
Here's another fun one.

826
01:06:45,569 --> 01:06:48,009
It turns out that

827
01:06:48,009 --> 01:06:51,259
wheeled robots are very fuel-efficient. Cars and trucks are the

828
01:06:51,259 --> 01:06:55,250
most fuel-efficient robots in the world almost.

829
01:06:55,250 --> 01:06:57,260
Cars and trucks are very fuel-efficient,

830
01:06:57,260 --> 01:07:01,309
but the bigger robots have the ability to traverse more rugged terrain.

831
01:07:01,309 --> 01:07:04,599
So this is a robot - this is actually a small scale mockup of a larger

832
01:07:04,599 --> 01:07:06,519
vehicle built by Lockheed Martin,

833
01:07:06,519 --> 01:07:09,699
but can you come up with a vehicle that

834
01:07:09,699 --> 01:07:13,689
has wheels and has the fuel efficiency of wheeled robots, but also

835
01:07:13,689 --> 01:07:19,099
has legs so it can traverse obstacles.

836
01:07:19,099 --> 01:07:27,349
Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and

837
01:07:27,349 --> 01:07:31,319
somewhat complex gaits that would be very hard to design by hand,

838
01:07:31,319 --> 01:07:33,279
but by choosing a reward function, tell the robot

839
01:07:33,279 --> 01:07:36,789
this is a plus one reward that's top of the goal,

840
01:07:36,789 --> 01:07:45,469
and a few other things, it learns these sorts of policies automatically.

841
01:07:46,519 --> 01:07:55,229
Last couple fun ones - I'll show you a couple last helicopter videos.

842
01:07:57,029 --> 01:08:08,700
This is the work of again PhD students here, Peter Abiel and Adam Coates

843
01:08:08,769 --> 01:08:33,319
who have been working on autonomous flight. I'll just let you watch. I'll just

844
01:09:49,339 --> 01:09:53,900
show you one more. [Inaudible] do this with a real helicopter [inaudible]?

845
01:09:53,900 --> 01:09:55,199
Not a full-size helicopter.

846
01:09:55,199 --> 01:10:00,860
Only small radio control helicopters. [Inaudible]. Full-size helicopters

847
01:10:00,860 --> 01:10:05,780
are the wrong design for this. You shouldn't do this on a full-size helicopter.

848
01:10:05,780 --> 01:10:12,619
Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.

849
01:10:12,619 --> 01:10:18,919
Are there any human [inaudible]? Yes,

850
01:10:18,919 --> 01:10:25,919
there are very good human pilots that can.

851
01:10:28,420 --> 01:10:35,420
This is just one more

852
01:10:43,739 --> 01:10:45,510
maneuver. That was kind of fun. So this is the work of

853
01:10:45,510 --> 01:10:55,530
Peter Abiel and Adam Coates. So that was it. That was all the technical material

854
01:10:55,530 --> 01:11:02,530
I wanted to present in this class. I guess

855
01:11:03,109 --> 01:11:10,199
you're all experts on machine learning now. Congratulations.

856
01:11:10,199 --> 01:11:14,209
And as I hope you've got the sense through this class that this is one of the technologies

857
01:11:14,209 --> 01:11:18,969
that's really having a huge impact on science in engineering and industry.

858
01:11:18,969 --> 01:11:23,130
As I said in the first lecture, I think many people use machine

859
01:11:23,130 --> 01:11:30,130
learning algorithms dozens of times a day without even knowing about it.

860
01:11:30,389 --> 01:11:34,110
Based on the projects you've done, I hope that

861
01:11:34,110 --> 01:11:37,899
most of you will be able to imagine yourself

862
01:11:37,899 --> 01:11:43,280
going out after this class and applying these things to solve a variety of problems.

863
01:11:43,280 --> 01:11:45,980
Hopefully, some of you will also imagine yourselves

864
01:11:45,980 --> 01:11:48,989
writing research papers after this class, be it on

865
01:11:48,989 --> 01:11:52,440
a novel way to do machine learning, or on some way of applying machine

866
01:11:52,440 --> 01:11:55,289
learning to a problem that you care

867
01:11:55,289 --> 01:11:58,749
about. In fact, looking at project milestones, I'm actually sure that a large fraction of

868
01:11:58,749 --> 01:12:04,300
the projects in this class will be publishable, so I think that's great.

869
01:12:06,679 --> 01:12:10,750
I guess many of you will also go industry, make new products, and make lots

870
01:12:10,750 --> 01:12:12,659
of money using learning

871
01:12:12,659 --> 01:12:15,580
algorithms. Remember me

872
01:12:15,580 --> 01:12:19,530
if that happens. One of the things I'm excited about is through research or

873
01:12:19,530 --> 01:12:22,570
industry, I'm actually completely sure that the people in this class in the

874
01:12:22,570 --> 01:12:23,849
next few months

875
01:12:23,849 --> 01:12:27,320
will apply machine learning algorithms to lots of problems in

876
01:12:27,320 --> 01:12:31,770
industrial management, and computer science, things like optimizing computer architectures,

877
01:12:31,770 --> 01:12:33,340
network security,

878
01:12:33,340 --> 01:12:38,480
robotics, computer vision, to problems in computational biology,

879
01:12:39,140 --> 01:12:42,429
to problems in aerospace, or understanding natural language,

880
01:12:42,429 --> 01:12:44,309
and many more things like that.

881
01:12:44,309 --> 01:12:46,030
So

882
01:12:46,030 --> 01:12:49,920
right now I have no idea what all of you are going to do with the learning algorithms you learned about,

883
01:12:49,920 --> 01:12:51,540
but

884
01:12:51,540 --> 01:12:54,809
every time as I wrap up this class, I always

885
01:12:54,809 --> 01:12:58,209
feel very excited, and optimistic, and hopeful about the sorts of amazing

886
01:12:58,209 --> 01:13:03,129
things you'll be able to do.

887
01:13:03,129 --> 01:13:10,489
One final thing, I'll just give out this handout.

888
01:13:30,749 --> 01:13:35,170
One final thing is that machine learning has grown out of a larger literature on the AI

889
01:13:35,170 --> 01:13:36,419
where

890
01:13:36,419 --> 01:13:41,829
this desire to build systems that exhibit intelligent behavior and machine learning

891
01:13:41,829 --> 01:13:44,889
is one of the tools of AI, maybe one that's had a

892
01:13:44,889 --> 01:13:46,510
disproportionately large impact,

893
01:13:46,510 --> 01:13:51,519
but there are many other ideas in AI that I hope you

894
01:13:51,519 --> 01:13:55,289
go and continue to learn about. Fortunately, Stanford

895
01:13:55,289 --> 01:13:59,279
has one of the best and broadest sets of AI classes, and I hope that

896
01:13:59,279 --> 01:14:03,690
you take advantage of some of these classes, and go and learn more about AI, and more

897
01:14:03,690 --> 01:14:07,690
about other fields which often apply learning algorithms to problems in vision,

898
01:14:07,690 --> 01:14:10,480
problems in natural language processing in robotics, and so on.

899
01:14:10,480 --> 01:14:14,960
So the handout I just gave out has a list of AI related courses. Just

900
01:14:14,960 --> 01:14:18,489
running down very quickly, I guess, CS221 is an overview that I

901
01:14:18,489 --> 01:14:19,340
teach. There

902
01:14:19,340 --> 01:14:23,699
are a lot of robotics classes also: 223A,

903
01:14:23,699 --> 01:14:25,139
225A, 225B

904
01:14:25,139 --> 01:14:28,760
- many robotics class. There are so many applications

905
01:14:28,760 --> 01:14:31,999
of learning algorithms to robotics today. 222

906
01:14:31,999 --> 01:14:36,369
and 227 are knowledge representation and reasoning classes.

907
01:14:36,369 --> 01:14:39,919
228 - of all the classes on this list, 228, which Daphne

908
01:14:39,919 --> 01:14:43,479
Koller teaches, is probably closest in spirit to 229. This is one of

909
01:14:43,479 --> 01:14:45,300
the classes I highly recommend

910
01:14:45,300 --> 01:14:48,600
to all of my PhD students as well.

911
01:14:48,600 --> 01:14:53,300
Many other problems also touch on machine learning. On the next page,

912
01:14:54,040 --> 01:14:57,760
courses on computer vision, speech recognition, natural language processing,

913
01:14:57,760 --> 01:15:03,199
various tools that aren't just machine learning, but often involve machine learning in many ways.

914
01:15:03,199 --> 01:15:07,339
Other aspects of AI, multi-agent systems taught by [inaudible].

915
01:15:09,630 --> 01:15:13,139
EE364A is convex optimization. It's a class taught by

916
01:15:13,139 --> 01:15:16,630
Steve Boyd, and convex optimization came up many times in this class. If you

917
01:15:16,630 --> 01:15:20,909
want to become really good at it, EE364 is a great class. If you're

918
01:15:20,909 --> 01:15:24,579
interested in project courses, I also teach a project class

919
01:15:24,579 --> 01:15:27,179
next quarter where we spend the whole quarter working

920
01:15:27,179 --> 01:15:31,510
on research projects.

921
01:15:31,510 --> 01:15:37,589
So I hope you go and take some more of those classes.

922
01:15:37,589 --> 01:15:40,379
In closing,

923
01:15:40,379 --> 01:15:42,849
let me just say

924
01:15:42,849 --> 01:15:46,880
this class has been really fun to teach, and it's very satisfying to

925
01:15:46,880 --> 01:15:52,010
me personally when we set these insanely difficult hallmarks, and

926
01:15:52,010 --> 01:15:54,649
then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's

927
01:15:54,649 --> 01:15:58,570
actually very satisfying when I see that.

928
01:15:58,570 --> 01:16:04,739
Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"

929
01:16:04,739 --> 01:16:05,469
So

930
01:16:05,469 --> 01:16:09,039
I hope you take what you've learned, go forth, and

931
01:16:09,039 --> 01:16:11,570
do amazing things with learning algorithms.

932
01:16:11,570 --> 01:16:16,220
I know this is a heavy workload class, so thank you all very much for the hard

933
01:16:16,220 --> 01:16:20,280
work you've put into this class, and the hard work you've put into learning this material,

934
01:16:20,280 --> 01:16:21,699
and

935
01:16:21,699 --> 01:16:23,389
thank you very much for having been students in.