[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:11.27,0:00:14.54,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional
Dialogue: 0,0:00:14.54,0:00:21.54,Default,,0000,0000,0000,,Development.
Dialogue: 0,0:00:23.93,0:00:28.04,Default,,0000,0000,0000,,So welcome to the last lecture of this course.
Dialogue: 0,0:00:28.04,0:00:32.97,Default,,0000,0000,0000,,What I want to do today is tell you about one final class of reinforcement
Dialogue: 0,0:00:32.97,0:00:34.82,Default,,0000,0000,0000,,learning algorithms. I
Dialogue: 0,0:00:34.82,0:00:39.35,Default,,0000,0000,0000,,just want to say a little bit about POMDPs, the partially observable MDPs, and then
Dialogue: 0,0:00:39.35,0:00:43.59,Default,,0000,0000,0000,,the main technical topic for today will be policy search algorithms.
Dialogue: 0,0:00:43.59,0:00:48.67,Default,,0000,0000,0000,,I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,
Dialogue: 0,0:00:48.67,0:00:54.94,Default,,0000,0000,0000,,and then we'll wrap up the class. So if you recall from the last lecture,
Dialogue: 0,0:00:54.94,0:00:59.87,Default,,0000,0000,0000,,I actually started to talk about one specific example of a POMDP,
Dialogue: 0,0:00:59.87,0:01:06.87,Default,,0000,0000,0000,,which was this sort of linear dynamical
Dialogue: 0,0:01:07.17,0:01:10.79,Default,,0000,0000,0000,,system. This is sort of LQR, linear quadratic revelation problem,
Dialogue: 0,0:01:10.79,0:01:24.19,Default,,0000,0000,0000,,but I changed it and said what if we only have observations
Dialogue: 0,0:01:24.19,0:01:28.27,Default,,0000,0000,0000,,YT. And what if we couldn't observe the state of the system directly,
Dialogue: 0,0:01:28.27,0:01:37.18,Default,,0000,0000,0000,,but had to choose an action only based on some noisy observations that maybe some function of the state?
Dialogue: 0,0:01:37.18,0:01:42.66,Default,,0000,0000,0000,,So our strategy last time was that we said that in the fully observable case,
Dialogue: 0,0:01:42.66,0:01:54.87,Default,,0000,0000,0000,,we could choose actions - AT equals two, that matrix LT times ST. So LT was this
Dialogue: 0,0:01:54.87,0:01:58.73,Default,,0000,0000,0000,,matrix of parameters that [inaudible] describe the dynamic programming
Dialogue: 0,0:01:58.73,0:02:02.20,Default,,0000,0000,0000,,algorithm for finite horizon MDPs in the LQR problem.
Dialogue: 0,0:02:02.20,0:02:10.53,Default,,0000,0000,0000,,And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.
Dialogue: 0,0:02:10.53,0:02:13.93,Default,,0000,0000,0000,,And then I said in the partially observable case,
Dialogue: 0,0:02:14.47,0:02:19.98,Default,,0000,0000,0000,,we would compute these estimates.
Dialogue: 0,0:02:19.98,0:02:26.98,Default,,0000,0000,0000,,I wrote them as S of T given T,
Dialogue: 0,0:02:30.38,0:02:37.23,Default,,0000,0000,0000,,which were our best estimate for what the state is given all the observations. And in particular,
Dialogue: 0,0:02:37.23,0:02:41.46,Default,,0000,0000,0000,,I'm gonna talk about a Kalman filter
Dialogue: 0,0:02:41.46,0:02:48.96,Default,,0000,0000,0000,,which we worked out that our posterior distribution of what the state is
Dialogue: 0,0:02:48.96,0:02:52.52,Default,,0000,0000,0000,,given all the observations up to a certain time
Dialogue: 0,0:02:52.52,0:02:59.14,Default,,0000,0000,0000,,that was this. So this is from last time. So that
Dialogue: 0,0:02:59.14,0:03:03.16,Default,,0000,0000,0000,,given the observations Y one through YT, our posterior distribution
Dialogue: 0,0:03:03.16,0:03:05.58,Default,,0000,0000,0000,,of the current state ST was Gaussian
Dialogue: 0,0:03:05.58,0:03:10.17,Default,,0000,0000,0000,,would mean ST given T sigma T given T.
Dialogue: 0,0:03:10.17,0:03:14.09,Default,,0000,0000,0000,,So I said we use a Kalman filter to compute this thing, this ST given T,
Dialogue: 0,0:03:14.09,0:03:20.50,Default,,0000,0000,0000,,which is going to be our best guess for what the state is currently.
Dialogue: 0,0:03:20.50,0:03:30.61,Default,,0000,0000,0000,,And then we choose actions using
Dialogue: 0,0:03:30.61,0:03:34.06,Default,,0000,0000,0000,,our estimate for what the state is, rather than using the true state because we
Dialogue: 0,0:03:34.06,0:03:37.52,Default,,0000,0000,0000,,don't know the true state anymore in this POMDP.
Dialogue: 0,0:03:37.52,0:03:39.62,Default,,0000,0000,0000,,So
Dialogue: 0,0:03:39.62,0:03:47.22,Default,,0000,0000,0000,,it turns out that this specific strategy actually allows you to choose optimal actions,
Dialogue: 0,0:03:47.22,0:03:51.75,Default,,0000,0000,0000,,allows you to choose actions as well as you possibly can given that this is a
Dialogue: 0,0:03:51.75,0:03:54.80,Default,,0000,0000,0000,,POMDP, and given there are these noisy observations.
Dialogue: 0,0:03:54.80,0:03:59.47,Default,,0000,0000,0000,,It turns out that in general finding optimal policies with POMDPs -
Dialogue: 0,0:03:59.47,0:04:06.04,Default,,0000,0000,0000,,finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.
Dialogue: 0,0:04:06.04,0:04:16.26,Default,,0000,0000,0000,,Just to be concrete about the formalism of the POMDP - I should just write it here -
Dialogue: 0,0:04:16.26,0:04:32.00,Default,,0000,0000,0000,,a POMDP formally is a tuple like that
Dialogue: 0,0:04:32.00,0:04:42.55,Default,,0000,0000,0000,,where the changes are the set Y is the set of possible observations,
Dialogue: 0,0:04:44.76,0:04:53.52,Default,,0000,0000,0000,,and this O subscript S are the observation distributions.
Dialogue: 0,0:04:58.04,0:05:11.22,Default,,0000,0000,0000,,And so at each step, we observe
Dialogue: 0,0:05:17.63,0:05:20.18,Default,,0000,0000,0000,,- at each step in the POMDP, if we're
Dialogue: 0,0:05:20.18,0:05:23.85,Default,,0000,0000,0000,,in some state ST, we observe some observation YT
Dialogue: 0,0:05:23.85,0:05:27.61,Default,,0000,0000,0000,,drawn from the observation distribution O subscript ST, that there's an
Dialogue: 0,0:05:27.61,0:05:30.37,Default,,0000,0000,0000,,index by what the current state is.
Dialogue: 0,0:05:30.37,0:05:31.06,Default,,0000,0000,0000,,And
Dialogue: 0,0:05:31.06,0:05:35.60,Default,,0000,0000,0000,,it turns out that computing the optimal policy in a POMDP is an NPhard problem.
Dialogue: 0,0:05:36.31,0:05:40.54,Default,,0000,0000,0000,,For the specific case of linear dynamical systems with the Kalman filter
Dialogue: 0,0:05:40.54,0:05:45.85,Default,,0000,0000,0000,,model, we have this strategy of computing the optimal policy assuming full observability
Dialogue: 0,0:05:45.85,0:05:49.08,Default,,0000,0000,0000,,and then estimating the states from the observations,
Dialogue: 0,0:05:49.08,0:05:50.89,Default,,0000,0000,0000,,and then plugging the two together.
Dialogue: 0,0:05:50.89,0:05:56.32,Default,,0000,0000,0000,,That turns out to be optimal essentially for only that special case of a POMDP.
Dialogue: 0,0:05:56.32,0:05:58.48,Default,,0000,0000,0000,,In the more general case,
Dialogue: 0,0:05:58.48,0:06:02.96,Default,,0000,0000,0000,,that strategy of designing a controller assuming full observability
Dialogue: 0,0:06:02.96,0:06:06.01,Default,,0000,0000,0000,,and then just estimating the state and plugging the two together,
Dialogue: 0,0:06:06.01,0:06:07.95,Default,,0000,0000,0000,,for general POMDPs
Dialogue: 0,0:06:07.95,0:06:14.76,Default,,0000,0000,0000,,that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.
Dialogue: 0,0:06:14.76,0:06:20.87,Default,,0000,0000,0000,,Solving these problems in general, NP-hard.
Dialogue: 0,0:06:20.87,0:06:22.50,Default,,0000,0000,0000,,So what I want to do today
Dialogue: 0,0:06:22.50,0:06:25.57,Default,,0000,0000,0000,,is actually
Dialogue: 0,0:06:25.57,0:06:29.14,Default,,0000,0000,0000,,talk about a different class of reinforcement learning algorithms. These are called
Dialogue: 0,0:06:29.14,0:06:38.26,Default,,0000,0000,0000,,policy search algorithms. In particular, policy search algorithms
Dialogue: 0,0:06:38.26,0:06:43.63,Default,,0000,0000,0000,,can be applied equally well to MDPs, to fully observed Markov decision processes,
Dialogue: 0,0:06:43.63,0:06:46.42,Default,,0000,0000,0000,,or to these POMDPs,
Dialogue: 0,0:06:46.42,0:06:49.79,Default,,0000,0000,0000,,or to these partially observable MPDs.
Dialogue: 0,0:06:49.79,0:06:58.27,Default,,0000,0000,0000,,What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.
Dialogue: 0,0:06:58.27,0:07:01.92,Default,,0000,0000,0000,,And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In
Dialogue: 0,0:07:01.92,0:07:05.46,Default,,0000,0000,0000,,the latter case, when
Dialogue: 0,0:07:05.46,0:07:09.83,Default,,0000,0000,0000,,you apply a policy search algorithm to a POMDP, it's
Dialogue: 0,0:07:09.83,0:07:15.50,Default,,0000,0000,0000,,going to be hard to guarantee that you get the globally optimal policy because
Dialogue: 0,0:07:15.50,0:07:20.81,Default,,0000,0000,0000,,solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be
Dialogue: 0,0:07:20.81,0:07:23.01,Default,,0000,0000,0000,,I think one of the most
Dialogue: 0,0:07:23.01,0:07:31.03,Default,,0000,0000,0000,,effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.
Dialogue: 0,0:07:31.03,0:07:34.11,Default,,0000,0000,0000,,So
Dialogue: 0,0:07:34.11,0:07:36.85,Default,,0000,0000,0000,,here's what we're going to do.
Dialogue: 0,0:07:36.85,0:07:40.79,Default,,0000,0000,0000,,In policy search,
Dialogue: 0,0:07:40.79,0:07:47.79,Default,,0000,0000,0000,,we're going to define of some set
Dialogue: 0,0:07:48.31,0:07:54.87,Default,,0000,0000,0000,,which I denote capital pi of policies,
Dialogue: 0,0:07:54.87,0:08:09.24,Default,,0000,0000,0000,,and our strategy is to search for a good policy lower pi into set capital pi.
Dialogue: 0,0:08:10.91,0:08:14.93,Default,,0000,0000,0000,,Just by analogy, I want to say
Dialogue: 0,0:08:14.93,0:08:19.38,Default,,0000,0000,0000,,- in the same way, back when we were talking about supervised learning,
Dialogue: 0,0:08:19.38,0:08:23.77,Default,,0000,0000,0000,,the way we defined the set capital pi of policies in the search for policy in this
Dialogue: 0,0:08:23.77,0:08:31.27,Default,,0000,0000,0000,,set capital pi is analogous to supervised learning
Dialogue: 0,0:08:31.27,0:08:46.30,Default,,0000,0000,0000,,where we defined a set script H of hypotheses and search -
Dialogue: 0,0:08:49.52,0:08:58.12,Default,,0000,0000,0000,,and would search for a good hypothesis in this policy script H.
Dialogue: 0,0:08:58.12,0:09:03.04,Default,,0000,0000,0000,,Policy search is sometimes also called direct policy search.
Dialogue: 0,0:09:03.04,0:09:06.74,Default,,0000,0000,0000,,To contrast this with the source of algorithms we've been talking about so
Dialogue: 0,0:09:06.74,0:09:11.36,Default,,0000,0000,0000,,far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to
Dialogue: 0,0:09:11.36,0:09:13.60,Default,,0000,0000,0000,,find the optimal value function.
Dialogue: 0,0:09:13.60,0:09:19.18,Default,,0000,0000,0000,,And then we'd use V star - we'd use the optimal value function to then try to compute or
Dialogue: 0,0:09:19.18,0:09:20.26,Default,,0000,0000,0000,,try to approximate pi star.
Dialogue: 0,0:09:20.26,0:09:23.98,Default,,0000,0000,0000,,So all the approaches we talked about previously are
Dialogue: 0,0:09:23.98,0:09:29.50,Default,,0000,0000,0000,,strategy for finding a good policy. Once we compute the value function, then we go from that to policy.
Dialogue: 0,0:09:29.50,0:09:34.70,Default,,0000,0000,0000,,In contrast, in policy search algorithms and something that's called direct policy search algorithms,
Dialogue: 0,0:09:34.70,0:09:40.16,Default,,0000,0000,0000,,the idea is that we're going to quote "directly" try to approximate a good policy
Dialogue: 0,0:09:40.16,0:09:48.28,Default,,0000,0000,0000,,without going through the intermediate stage of trying to find the value function.
Dialogue: 0,0:09:48.28,0:09:49.53,Default,,0000,0000,0000,,Let's see.
Dialogue: 0,0:09:49.53,0:09:50.88,Default,,0000,0000,0000,,And also
Dialogue: 0,0:09:50.88,0:09:57.61,Default,,0000,0000,0000,,as I develop policy search - just one step that's sometimes slightly confusing.
Dialogue: 0,0:09:57.61,0:10:02.71,Default,,0000,0000,0000,,Making an analogy to supervised learning again, when we talked about logistic regression,
Dialogue: 0,0:10:02.71,0:10:08.70,Default,,0000,0000,0000,,I said we have input features X and some labels Y, and I sort of said
Dialogue: 0,0:10:08.70,0:10:13.32,Default,,0000,0000,0000,,let's approximate Y using the logistic function of the inputs X. And
Dialogue: 0,0:10:13.32,0:10:18.32,Default,,0000,0000,0000,,at least initially, the logistic function was sort of pulled out of the air.
Dialogue: 0,0:10:18.32,0:10:23.40,Default,,0000,0000,0000,,In the same way, as I define policy search algorithms, there'll sort of be a step where I say,
Dialogue: 0,0:10:23.40,0:10:28.56,Default,,0000,0000,0000,,"Well, let's try to compute the actions. Let's try to approximate what a good action is
Dialogue: 0,0:10:28.56,0:10:33.02,Default,,0000,0000,0000,,using a logistic function of the state." So again, I'll sort of pull a
Dialogue: 0,0:10:33.02,0:10:36.77,Default,,0000,0000,0000,,function out of the air. I'll say, "Let's just choose a function, and
Dialogue: 0,0:10:36.77,0:10:40.65,Default,,0000,0000,0000,,that'll be our choice of the policy cost," and I'll say,
Dialogue: 0,0:10:40.65,0:10:45.31,Default,,0000,0000,0000,,"Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate
Dialogue: 0,0:10:45.31,0:10:47.30,Default,,0000,0000,0000,,what is a good function -
Dialogue: 0,0:10:47.30,0:10:48.35,Default,,0000,0000,0000,,excuse me,
Dialogue: 0,0:10:48.35,0:10:53.27,Default,,0000,0000,0000,,we'll approximate what is a good action using a logistic function of the state."
Dialogue: 0,0:10:53.27,0:10:54.42,Default,,0000,0000,0000,,So there's that sort of -
Dialogue: 0,0:10:54.42,0:10:59.46,Default,,0000,0000,0000,,the function of the choice of policy cost that's again a little bit arbitrary,
Dialogue: 0,0:10:59.46,0:11:06.08,Default,,0000,0000,0000,,but it's arbitrary as it was when we were talking about supervised learning.
Dialogue: 0,0:11:06.08,0:11:17.35,Default,,0000,0000,0000,,So to develop our first policy search algorithm, I'm actually gonna need the new definition.
Dialogue: 0,0:11:56.10,0:12:03.10,Default,,0000,0000,0000,,So
Dialogue: 0,0:12:06.65,0:12:12.71,Default,,0000,0000,0000,,our first policy search algorithm, we'll actually need to work with stochastic policies.
Dialogue: 0,0:12:12.71,0:12:17.68,Default,,0000,0000,0000,,What I mean by stochastic policy is there's going to be a function that maps from
Dialogue: 0,0:12:17.68,0:12:22.51,Default,,0000,0000,0000,,the space of states across actions. They're real numbers
Dialogue: 0,0:12:22.51,0:12:25.04,Default,,0000,0000,0000,,where pi of S comma A will be
Dialogue: 0,0:12:25.04,0:12:30.58,Default,,0000,0000,0000,,interpreted as the probability of taking this action A in sum state S.
Dialogue: 0,0:12:30.58,0:12:46.34,Default,,0000,0000,0000,,And so we have to add sum over A - In other words, for every state
Dialogue: 0,0:12:46.34,0:12:53.83,Default,,0000,0000,0000,,a stochastic policy specifies a probability distribution over the actions.
Dialogue: 0,0:12:54.55,0:12:59.25,Default,,0000,0000,0000,,So concretely,
Dialogue: 0,0:12:59.25,0:13:07.27,Default,,0000,0000,0000,,suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.
Dialogue: 0,0:13:07.27,0:13:11.69,Default,,0000,0000,0000,,What that means is that - in this example let's say I have three actions.
Dialogue: 0,0:13:11.69,0:13:16.06,Default,,0000,0000,0000,,What that means is that suppose I'm in some state S.
Dialogue: 0,0:13:16.06,0:13:25.29,Default,,0000,0000,0000,,I would then compute pi of S comma A1, pi of S comma A2,
Dialogue: 0,0:13:25.29,0:13:29.37,Default,,0000,0000,0000,,pi of S comma A3, if I have a three action MDP.
Dialogue: 0,0:13:29.37,0:13:33.63,Default,,0000,0000,0000,,These will be three numbers that sum up to one,
Dialogue: 0,0:13:33.63,0:13:36.73,Default,,0000,0000,0000,,and then my chance of taking action A1 will be
Dialogue: 0,0:13:36.73,0:13:41.35,Default,,0000,0000,0000,,equal to this. My chance of taking action A2 will be equal to pi of S comma A2.
Dialogue: 0,0:13:41.35,0:13:44.13,Default,,0000,0000,0000,,My chance of taking action A3 will be equal
Dialogue: 0,0:13:44.13,0:13:49.67,Default,,0000,0000,0000,,to this number. So that's what it means to execute a stochastic policy.
Dialogue: 0,0:13:49.67,0:13:50.88,Default,,0000,0000,0000,,So
Dialogue: 0,0:13:50.88,0:13:53.73,Default,,0000,0000,0000,,as a concrete example, just let me make this
Dialogue: 0,0:14:03.38,0:14:07.47,Default,,0000,0000,0000,,let me just go ahead and give one specific example of what a stochastic
Dialogue: 0,0:14:07.47,0:14:09.55,Default,,0000,0000,0000,,policy may look like.
Dialogue: 0,0:14:09.55,0:14:14.14,Default,,0000,0000,0000,,For this example, I'm gonna use the inverted pendulum
Dialogue: 0,0:14:14.14,0:14:20.37,Default,,0000,0000,0000,,as my motivating example. It's that problem of balancing a pole. We have an
Dialogue: 0,0:14:20.37,0:14:21.27,Default,,0000,0000,0000,,inverted
Dialogue: 0,0:14:21.27,0:14:26.23,Default,,0000,0000,0000,,pendulum that swings freely, and you want to move the cart left and
Dialogue: 0,0:14:26.23,0:14:29.09,Default,,0000,0000,0000,,right to keep the pole vertical.
Dialogue: 0,0:14:29.09,0:14:34.95,Default,,0000,0000,0000,,Let's say my actions -
Dialogue: 0,0:14:34.95,0:14:43.78,Default,,0000,0000,0000,,for today's example, I'm gonna use that angle to denote the angle of the pole
Dialogue: 0,0:14:43.78,0:14:49.74,Default,,0000,0000,0000,,phi. I have two actions where A1 is to accelerate left
Dialogue: 0,0:14:49.74,0:15:02.56,Default,,0000,0000,0000,,and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.
Dialogue: 0,0:15:02.56,0:15:04.72,Default,,0000,0000,0000,,So
Dialogue: 0,0:15:04.72,0:15:06.05,Default,,0000,0000,0000,,let's see.
Dialogue: 0,0:15:06.05,0:15:10.48,Default,,0000,0000,0000,,Choose a reward function that penalizes the pole falling over whatever.
Dialogue: 0,0:15:10.48,0:15:14.17,Default,,0000,0000,0000,,And now let's come up with a stochastic policy for this problem.
Dialogue: 0,0:15:14.91,0:15:17.86,Default,,0000,0000,0000,,To come up with a class of stochastic policies
Dialogue: 0,0:15:17.86,0:15:20.08,Default,,0000,0000,0000,,
Dialogue: 0,0:15:20.08,0:15:23.99,Default,,0000,0000,0000,,really means coming up with some class of functions to approximate what action you
Dialogue: 0,0:15:23.99,0:15:26.57,Default,,0000,0000,0000,,want to take as a function of the state.
Dialogue: 0,0:15:26.57,0:15:30.00,Default,,0000,0000,0000,,So here's my somewhat arbitrary choice. I'm gonna say that
Dialogue: 0,0:15:30.00,0:15:35.21,Default,,0000,0000,0000,,the probability of action A1, so pi of S comma A1,
Dialogue: 0,0:15:35.21,0:15:39.88,Default,,0000,0000,0000,,I'm gonna write as - okay?
Dialogue: 0,0:15:39.88,0:15:45.63,Default,,0000,0000,0000,,And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that
Dialogue: 0,0:15:45.63,0:15:49.74,Default,,0000,0000,0000,,my policy is parameterized by a set of parameters theta,
Dialogue: 0,0:15:49.74,0:15:53.31,Default,,0000,0000,0000,,and for any given set of parameters theta,
Dialogue: 0,0:15:53.31,0:15:56.27,Default,,0000,0000,0000,,that gives me a stochastic policy.
Dialogue: 0,0:15:56.27,0:15:59.38,Default,,0000,0000,0000,,And if I'm executing that policy with parameters theta, that means
Dialogue: 0,0:15:59.38,0:16:06.02,Default,,0000,0000,0000,,that the chance of my choosing to a set of [inaudible] is given by this number.
Dialogue: 0,0:16:18.17,0:16:25.52,Default,,0000,0000,0000,,Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just
Dialogue: 0,0:16:25.52,0:16:28.85,Default,,0000,0000,0000,,[inaudible], this means that when I'm in sum state S,
Dialogue: 0,0:16:28.85,0:16:30.96,Default,,0000,0000,0000,,I'm going to compute this number,
Dialogue: 0,0:16:30.96,0:16:34.26,Default,,0000,0000,0000,,compute one over one plus E to the minus state of transpose S.
Dialogue: 0,0:16:34.26,0:16:40.06,Default,,0000,0000,0000,,And then with this probability, I will execute the accelerate right action,
Dialogue: 0,0:16:40.06,0:16:45.78,Default,,0000,0000,0000,,and with one minus this probability, I'll execute the accelerate left action.
Dialogue: 0,0:16:45.78,0:16:50.82,Default,,0000,0000,0000,,And again, just to give you a sense of why this might be a reasonable thing to do,
Dialogue: 0,0:16:52.20,0:16:59.20,Default,,0000,0000,0000,,let's say my state vector is - this
Dialogue: 0,0:17:01.53,0:17:05.44,Default,,0000,0000,0000,,is [inaudible] state, and I added an extra one as an interceptor, just to give my
Dialogue: 0,0:17:05.44,0:17:08.95,Default,,0000,0000,0000,,logistic function an extra feature.
Dialogue: 0,0:17:08.95,0:17:19.86,Default,,0000,0000,0000,,If I choose my parameters and my policy to be say this,
Dialogue: 0,0:17:19.86,0:17:23.08,Default,,0000,0000,0000,,then that means that at any state,
Dialogue: 0,0:17:23.08,0:17:33.87,Default,,0000,0000,0000,,the probability of my taking action A1 - the probability of my taking the accelerate
Dialogue: 0,0:17:33.87,0:17:40.87,Default,,0000,0000,0000,,right action is this one over one plus E to the minus state of transpose S, which taking the inner product
Dialogue: 0,0:17:41.36,0:17:48.18,Default,,0000,0000,0000,,of theta and S, this just gives you phi,
Dialogue: 0,0:17:48.18,0:17:51.25,Default,,0000,0000,0000,,equals one over one plus E to the minus phi.
Dialogue: 0,0:17:51.25,0:17:54.90,Default,,0000,0000,0000,,And so if I choose my parameters theta as follows,
Dialogue: 0,0:17:54.90,0:18:01.15,Default,,0000,0000,0000,,what that means is that
Dialogue: 0,0:18:01.15,0:18:08.15,Default,,0000,0000,0000,,just depending on the angle phi of my inverted pendulum,
Dialogue: 0,0:18:09.34,0:18:12.20,Default,,0000,0000,0000,,the chance of my accelerating to the
Dialogue: 0,0:18:12.20,0:18:15.19,Default,,0000,0000,0000,,right is just this function of
Dialogue: 0,0:18:15.19,0:18:18.32,Default,,0000,0000,0000,,the angle of my inverted pendulum. And so this means for example
Dialogue: 0,0:18:18.32,0:18:22.09,Default,,0000,0000,0000,,that if my inverted pendulum is leaning far over to the right,
Dialogue: 0,0:18:22.09,0:18:26.37,Default,,0000,0000,0000,,then I'm very likely to accelerate to the right to try to catch it. I
Dialogue: 0,0:18:26.37,0:18:30.00,Default,,0000,0000,0000,,hope the physics of this inverted pendulum thing make
Dialogue: 0,0:18:30.00,0:18:33.44,Default,,0000,0000,0000,,sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And
Dialogue: 0,0:18:33.44,0:18:36.74,Default,,0000,0000,0000,,conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try
Dialogue: 0,0:18:36.74,0:18:38.52,Default,,0000,0000,0000,,to catch it.
Dialogue: 0,0:18:38.52,0:18:44.03,Default,,0000,0000,0000,,So this is one example for one specific choice of parameters theta.
Dialogue: 0,0:18:44.03,0:18:49.42,Default,,0000,0000,0000,,Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if
Dialogue: 0,0:18:49.42,0:18:52.86,Default,,0000,0000,0000,,the cart is further to the right, you want it to be less likely to accelerate to the
Dialogue: 0,0:18:52.86,0:18:55.61,Default,,0000,0000,0000,,right, and you can capture that by changing
Dialogue: 0,0:18:55.61,0:18:57.77,Default,,0000,0000,0000,,one of these coefficients to take into account the
Dialogue: 0,0:18:57.77,0:19:00.25,Default,,0000,0000,0000,,actual position of the cart. And then depending on
Dialogue: 0,0:19:00.25,0:19:03.38,Default,,0000,0000,0000,,the velocity of the cart and the angle of velocity,
Dialogue: 0,0:19:03.38,0:19:08.04,Default,,0000,0000,0000,,you might want to change theta to take into account these other effects as well. Maybe
Dialogue: 0,0:19:08.04,0:19:13.06,Default,,0000,0000,0000,,if the pole's leaning far to the right, but is actually on its way to swinging back, it's
Dialogue: 0,0:19:13.06,0:19:15.62,Default,,0000,0000,0000,,specified to the angle of velocity, then you might be
Dialogue: 0,0:19:15.62,0:19:19.30,Default,,0000,0000,0000,,less worried about having to accelerate hard to the right. And so
Dialogue: 0,0:19:19.30,0:19:23.17,Default,,0000,0000,0000,,these are the sorts of behavior you can get by varying the parameters theta.
Dialogue: 0,0:19:24.16,0:19:28.64,Default,,0000,0000,0000,,And so our goal is to tune the parameters theta - our goal in policy search
Dialogue: 0,0:19:28.64,0:19:37.29,Default,,0000,0000,0000,,is to tune the parameters theta so that when we execute the policy pi subscript theta,
Dialogue: 0,0:19:37.29,0:19:39.73,Default,,0000,0000,0000,,the pole stays up as long as possible.
Dialogue: 0,0:19:39.73,0:19:48.43,Default,,0000,0000,0000,,In other words, our goal is to maximize as a function of theta -
Dialogue: 0,0:19:48.43,0:20:05.44,Default,,0000,0000,0000,,our goal is to maximize the expected value of the payoff
Dialogue: 0,0:20:05.44,0:20:09.62,Default,,0000,0000,0000,,for when we execute the policy pi theta. We
Dialogue: 0,0:20:09.62,0:20:11.25,Default,,0000,0000,0000,,want to choose parameters theta
Dialogue: 0,0:20:11.25,0:20:21.12,Default,,0000,0000,0000,,to maximize that. Are there questions about the problem set up,
Dialogue: 0,0:20:21.12,0:20:26.33,Default,,0000,0000,0000,,and policy search and policy classes or anything? Yeah. In a case where we have
Dialogue: 0,0:20:26.33,0:20:32.88,Default,,0000,0000,0000,,more than two actions, would we use a different theta for each of the distributions, or still have
Dialogue: 0,0:20:32.88,0:20:36.31,Default,,0000,0000,0000,,the same parameters? Oh, yeah.
Dialogue: 0,0:20:36.31,0:20:39.43,Default,,0000,0000,0000,,Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but
Dialogue: 0,0:20:39.43,0:20:43.71,Default,,0000,0000,0000,,you have say a fixed number of discrete actions,
Dialogue: 0,0:20:43.71,0:20:48.21,Default,,0000,0000,0000,,I would sometimes use like a softmax parameterization.
Dialogue: 0,0:20:48.21,0:20:53.55,Default,,0000,0000,0000,,Similar to softmax regression that we saw earlier in the class,
Dialogue: 0,0:20:53.55,0:20:55.91,Default,,0000,0000,0000,,you may
Dialogue: 0,0:20:55.91,0:21:02.91,Default,,0000,0000,0000,,say
Dialogue: 0,0:21:05.54,0:21:06.90,Default,,0000,0000,0000,,that - [inaudible] out of space. You may have a set of parameters theta 1
Dialogue: 0,0:21:06.90,0:21:09.35,Default,,0000,0000,0000,,through theta D if
Dialogue: 0,0:21:09.35,0:21:24.91,Default,,0000,0000,0000,,you have D actions and - pi equals E to the theta I transpose S over -
Dialogue: 0,0:21:24.91,0:21:29.16,Default,,0000,0000,0000,,so that would be an example of a softmax parameterization for multiple actions.
Dialogue: 0,0:21:29.16,0:21:33.39,Default,,0000,0000,0000,,It turns out that if you have continuous actions, you can actually make this be a
Dialogue: 0,0:21:33.39,0:21:34.09,Default,,0000,0000,0000,,density
Dialogue: 0,0:21:34.09,0:21:35.95,Default,,0000,0000,0000,,over the actions A
Dialogue: 0,0:21:35.95,0:21:38.50,Default,,0000,0000,0000,,and parameterized by other things as well.
Dialogue: 0,0:21:38.50,0:21:42.59,Default,,0000,0000,0000,,But the choice of policy class is somewhat up to you, in the same way that
Dialogue: 0,0:21:44.15,0:21:47.67,Default,,0000,0000,0000,,the choice of whether we chose to use a linear function
Dialogue: 0,0:21:47.67,0:21:49.43,Default,,0000,0000,0000,,or linear function with
Dialogue: 0,0:21:49.43,0:21:56.43,Default,,0000,0000,0000,,quadratic features or whatever in supervised learning that was sort of up to us. Anything
Dialogue: 0,0:22:01.79,0:22:08.79,Default,,0000,0000,0000,,else? Yeah. [Inaudible] stochastic?
Dialogue: 0,0:22:12.40,0:22:16.35,Default,,0000,0000,0000,,Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition
Dialogue: 0,0:22:16.35,0:22:18.20,Default,,0000,0000,0000,,probabilities, is it possible to
Dialogue: 0,0:22:18.20,0:22:22.74,Default,,0000,0000,0000,,use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.
Dialogue: 0,0:22:22.74,0:22:24.46,Default,,0000,0000,0000,,The answer is yes,
Dialogue: 0,0:22:24.46,0:22:25.74,Default,,0000,0000,0000,,but
Dialogue: 0,0:22:25.74,0:22:29.40,Default,,0000,0000,0000,,for the purposes of what I want to show later, that won't be useful.
Dialogue: 0,0:22:29.40,0:22:31.87,Default,,0000,0000,0000,,But formally, it is possible.
Dialogue: 0,0:22:31.87,0:22:34.81,Default,,0000,0000,0000,,If you already have a fixed - if you have a
Dialogue: 0,0:22:34.81,0:22:40.76,Default,,0000,0000,0000,,fixed policy, then you'd be able to do that. Anything else?
Dialogue: 0,0:22:40.76,0:22:44.19,Default,,0000,0000,0000,,Yeah. No, I guess even a [inaudible] class of policy can do that,
Dialogue: 0,0:22:44.19,0:22:50.29,Default,,0000,0000,0000,,but for the derivation later, I actually need to keep it separate.
Dialogue: 0,0:22:50.29,0:22:55.06,Default,,0000,0000,0000,,Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand
Dialogue: 0,0:22:55.06,0:23:01.73,Default,,0000,0000,0000,,if this makes sense?
Dialogue: 0,0:23:01.73,0:23:05.43,Default,,0000,0000,0000,,Okay. Thanks. So let's talk about
Dialogue: 0,0:23:05.43,0:23:08.65,Default,,0000,0000,0000,,an algorithm. What I'm gonna talk about - the first algorithm I'm going to present
Dialogue: 0,0:23:09.75,0:23:13.82,Default,,0000,0000,0000,,is sometimes called the reinforce algorithm. What I'm going to present it turns out
Dialogue: 0,0:23:13.82,0:23:18.55,Default,,0000,0000,0000,,isn't exactly the reinforce algorithm as it was originally presented by the
Dialogue: 0,0:23:18.55,0:23:30.57,Default,,0000,0000,0000,,author Ron Williams, but it sort of captures its essence. Here's the idea.
Dialogue: 0,0:23:30.57,0:23:37.03,Default,,0000,0000,0000,,In the sequel - in what I'm about to do, I'm going to assume that S0 is
Dialogue: 0,0:23:37.03,0:23:40.33,Default,,0000,0000,0000,,some fixed initial state.
Dialogue: 0,0:23:41.95,0:23:44.72,Default,,0000,0000,0000,,Or
Dialogue: 0,0:23:44.72,0:23:48.74,Default,,0000,0000,0000,,it turns out if S0 is drawn from some fixed initial state
Dialogue: 0,0:23:48.74,0:23:52.44,Default,,0000,0000,0000,,distribution then everything else [inaudible], but let's just say S0 is some
Dialogue: 0,0:23:52.44,0:23:54.04,Default,,0000,0000,0000,,fixed initial state.
Dialogue: 0,0:23:54.04,0:24:04.60,Default,,0000,0000,0000,,So my goal is to maximize this expected sum
Dialogue: 0,0:24:05.82,0:24:10.34,Default,,0000,0000,0000,,[inaudible]. Given the policy and whatever else, drop that.
Dialogue: 0,0:24:10.34,0:24:14.70,Default,,0000,0000,0000,,So the random variables in this expectation is a sequence of states and actions:
Dialogue: 0,0:24:14.70,0:24:20.20,Default,,0000,0000,0000,,S0, A0, S1, A1, and so on, up to ST, AT are the random variables.
Dialogue: 0,0:24:20.20,0:24:25.92,Default,,0000,0000,0000,,So let me write out this expectation explicitly as a sum
Dialogue: 0,0:24:25.92,0:24:33.43,Default,,0000,0000,0000,,over all possible state and action sequences of that
Dialogue: 0,0:24:46.05,0:24:52.92,Default,,0000,0000,0000,,- so that's what an expectation is. It's the probability of the random variables times that. Let
Dialogue: 0,0:24:52.92,0:24:59.92,Default,,0000,0000,0000,,me just expand out this probability.
Dialogue: 0,0:25:00.51,0:25:06.11,Default,,0000,0000,0000,,So the probability of seeing this exact sequence of states and actions is
Dialogue: 0,0:25:06.11,0:25:10.63,Default,,0000,0000,0000,,the probability of the MDP starting in that state.
Dialogue: 0,0:25:10.63,0:25:14.07,Default,,0000,0000,0000,,If this is a deterministic initial state, then all the probability
Dialogue: 0,0:25:14.07,0:25:18.18,Default,,0000,0000,0000,,mass would be on one state. Otherwise, there's some distribution over initial states.
Dialogue: 0,0:25:18.18,0:25:21.04,Default,,0000,0000,0000,,Then times the probability
Dialogue: 0,0:25:21.04,0:25:23.53,Default,,0000,0000,0000,,that
Dialogue: 0,0:25:23.53,0:25:29.35,Default,,0000,0000,0000,,you chose action A0 from that state as zero,
Dialogue: 0,0:25:29.35,0:25:34.39,Default,,0000,0000,0000,,and then times the probability that
Dialogue: 0,0:25:34.39,0:25:38.13,Default,,0000,0000,0000,,the MDP's transition probabilities happen to transition you to state
Dialogue: 0,0:25:38.13,0:25:44.83,Default,,0000,0000,0000,,S1 where you chose action A0 to state S0,
Dialogue: 0,0:25:44.83,0:25:53.02,Default,,0000,0000,0000,,times the probability that you chose that and so on.
Dialogue: 0,0:25:53.02,0:26:06.28,Default,,0000,0000,0000,,The last term here is that, and then times that.
Dialogue: 0,0:26:12.37,0:26:13.93,Default,,0000,0000,0000,,So what I did was just
Dialogue: 0,0:26:13.93,0:26:18.62,Default,,0000,0000,0000,,take this probability of seeing this sequence of states and actions, and then just
Dialogue: 0,0:26:18.62,0:26:22.61,Default,,0000,0000,0000,,[inaudible] explicitly or expanded explicitly like this.
Dialogue: 0,0:26:22.61,0:26:24.48,Default,,0000,0000,0000,,It turns out later on
Dialogue: 0,0:26:24.48,0:26:30.27,Default,,0000,0000,0000,,I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff
Dialogue: 0,0:26:31.59,0:26:36.04,Default,,0000,0000,0000,,from now. So whenever later in this lecture I write the word payoff, I
Dialogue: 0,0:26:36.04,0:26:40.23,Default,,0000,0000,0000,,just mean this sum.
Dialogue: 0,0:26:40.23,0:26:47.02,Default,,0000,0000,0000,,So
Dialogue: 0,0:26:47.02,0:26:52.33,Default,,0000,0000,0000,,our goal is to maximize the expected payoff, so our goal is to maximize this sum.
Dialogue: 0,0:26:52.33,0:26:55.59,Default,,0000,0000,0000,,Let me actually just skip ahead. I'm going to write down what the final answer is, and
Dialogue: 0,0:26:55.59,0:26:59.37,Default,,0000,0000,0000,,then I'll come back and justify the
Dialogue: 0,0:26:59.37,0:27:06.37,Default,,0000,0000,0000,,algorithm. So here's the algorithm. This is
Dialogue: 0,0:27:25.19,0:27:32.19,Default,,0000,0000,0000,,
Dialogue: 0,0:28:12.77,0:28:16.02,Default,,0000,0000,0000,,how we're going to
Dialogue: 0,0:28:16.02,0:28:19.28,Default,,0000,0000,0000,,update the parameters of the algorithm. We're going to
Dialogue: 0,0:28:19.28,0:28:22.88,Default,,0000,0000,0000,,sample a state action sequence. The way you do this is you just take your
Dialogue: 0,0:28:22.88,0:28:24.70,Default,,0000,0000,0000,,current stochastic policy,
Dialogue: 0,0:28:24.70,0:28:29.04,Default,,0000,0000,0000,,and you execute it in the MDP. So just go ahead and start from some initial state, take
Dialogue: 0,0:28:29.04,0:28:32.60,Default,,0000,0000,0000,,a stochastic action according to your current stochastic policy,
Dialogue: 0,0:28:32.60,0:28:35.00,Default,,0000,0000,0000,,see where the state transition probably takes you, and so you just
Dialogue: 0,0:28:35.00,0:28:38.86,Default,,0000,0000,0000,,do that for T times steps, and that's how you sample the state sequence. Then
Dialogue: 0,0:28:38.86,0:28:45.02,Default,,0000,0000,0000,,you compute the payoff, and then you perform this update.
Dialogue: 0,0:28:45.02,0:28:49.21,Default,,0000,0000,0000,,So let's go back and figure out what this algorithm is doing.
Dialogue: 0,0:28:49.21,0:28:52.73,Default,,0000,0000,0000,,Notice that this algorithm performs stochastic updates because on
Dialogue: 0,0:28:52.73,0:28:56.75,Default,,0000,0000,0000,,every step it updates data according to this thing on the right hand side.
Dialogue: 0,0:28:56.75,0:29:00.15,Default,,0000,0000,0000,,This thing on the right hand side depends very much on your payoff and on
Dialogue: 0,0:29:00.15,0:29:02.32,Default,,0000,0000,0000,,the state action sequence you saw. Your
Dialogue: 0,0:29:02.32,0:29:04.65,Default,,0000,0000,0000,,state action sequence is random,
Dialogue: 0,0:29:04.65,0:29:07.21,Default,,0000,0000,0000,,so what I want to do is figure out
Dialogue: 0,0:29:07.21,0:29:12.23,Default,,0000,0000,0000,,- so on every step, I'll sort of take a step that's chosen randomly
Dialogue: 0,0:29:13.07,0:29:16.38,Default,,0000,0000,0000,,because it depends on this random state action sequence.
Dialogue: 0,0:29:16.38,0:29:21.53,Default,,0000,0000,0000,,So what I want to do is figure out on average how does it change the parameters theta.
Dialogue: 0,0:29:21.53,0:29:23.36,Default,,0000,0000,0000,,In particular,
Dialogue: 0,0:29:23.36,0:29:44.32,Default,,0000,0000,0000,,I want to know what is the expected value of the change to the parameters.
Dialogue: 0,0:29:58.10,0:30:06.28,Default,,0000,0000,0000,,So I want to know what is the expected value of this change to my parameters theta. Our
Dialogue: 0,0:30:06.28,0:30:10.93,Default,,0000,0000,0000,,goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.
Dialogue: 0,0:30:10.93,0:30:13.70,Default,,0000,0000,0000,,So long as
Dialogue: 0,0:30:13.70,0:30:18.31,Default,,0000,0000,0000,,the updates on expectation are on average taking us uphill
Dialogue: 0,0:30:18.31,0:30:20.05,Default,,0000,0000,0000,,on the expected payoff,
Dialogue: 0,0:30:20.05,0:30:21.97,Default,,0000,0000,0000,,then we're happy.
Dialogue: 0,0:30:21.97,0:30:24.97,Default,,0000,0000,0000,,It turns out that
Dialogue: 0,0:30:24.97,0:30:31.72,Default,,0000,0000,0000,,this algorithm is a form of stochastic gradient ascent in which -
Dialogue: 0,0:30:31.72,0:30:36.78,Default,,0000,0000,0000,,remember when I talked about stochastic gradient descent for least squares regression, I
Dialogue: 0,0:30:36.78,0:30:42.82,Default,,0000,0000,0000,,said that you have some parameters - maybe you're trying to minimize a quadratic function.
Dialogue: 0,0:30:42.82,0:30:49.15,Default,,0000,0000,0000,,Then you may have parameters that will wander around randomly
Dialogue: 0,0:30:49.15,0:30:54.29,Default,,0000,0000,0000,,until it gets close to the optimum of the [inaudible] quadratic surface.
Dialogue: 0,0:30:54.29,0:30:55.81,Default,,0000,0000,0000,,It turns out that
Dialogue: 0,0:30:55.81,0:30:59.65,Default,,0000,0000,0000,,the reinforce algorithm will be very much like that. It will be a
Dialogue: 0,0:30:59.65,0:31:03.22,Default,,0000,0000,0000,,stochastic gradient ascent algorithm in which on every step -
Dialogue: 0,0:31:03.22,0:31:07.30,Default,,0000,0000,0000,,the step we take is a little bit random. It's determined by the random state action sequence,
Dialogue: 0,0:31:07.30,0:31:11.94,Default,,0000,0000,0000,,but on expectation this turns out to be essentially gradient ascent algorithm.
Dialogue: 0,0:31:11.94,0:31:16.10,Default,,0000,0000,0000,,And so we'll do something like this. It'll wander around randomly, but on average take you towards the
Dialogue: 0,0:31:16.10,0:31:24.70,Default,,0000,0000,0000,,optimum. So let me go ahead and prove that now.
Dialogue: 0,0:31:28.78,0:31:35.78,Default,,0000,0000,0000,,Let's see.
Dialogue: 0,0:31:37.54,0:31:46.02,Default,,0000,0000,0000,,What I'm going to do is I'm going to derive a gradient ascent update rule
Dialogue: 0,0:31:46.02,0:31:50.43,Default,,0000,0000,0000,,for maximizing the expected payoff. Then I'll hopefully show that
Dialogue: 0,0:31:50.43,0:31:58.57,Default,,0000,0000,0000,,by deriving a gradient ascent update rule, I'll end up with this thing on expectation.
Dialogue: 0,0:31:58.57,0:32:02.11,Default,,0000,0000,0000,,So before I do the derivation, let me just remind you of
Dialogue: 0,0:32:02.11,0:32:06.33,Default,,0000,0000,0000,,the chain rule - the product rule for differentiation
Dialogue: 0,0:32:06.33,0:32:08.51,Default,,0000,0000,0000,,in which
Dialogue: 0,0:32:08.51,0:32:13.89,Default,,0000,0000,0000,,if I have a product of functions,
Dialogue: 0,0:32:13.89,0:32:17.23,Default,,0000,0000,0000,,then
Dialogue: 0,0:32:33.44,0:32:37.78,Default,,0000,0000,0000,,the derivative of the product is given by
Dialogue: 0,0:32:37.78,0:32:41.03,Default,,0000,0000,0000,,taking of the derivatives of these things one at a time. So first I differentiate
Dialogue: 0,0:32:41.03,0:32:45.34,Default,,0000,0000,0000,,with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,
Dialogue: 0,0:32:45.34,0:32:46.54,Default,,0000,0000,0000,,leaving the other two fixed.
Dialogue: 0,0:32:46.54,0:32:50.64,Default,,0000,0000,0000,,Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the
Dialogue: 0,0:32:50.64,0:32:53.53,Default,,0000,0000,0000,,product rule
Dialogue: 0,0:32:53.53,0:32:56.38,Default,,0000,0000,0000,,for
Dialogue: 0,0:32:56.38,0:33:00.98,Default,,0000,0000,0000,,derivatives.
Dialogue: 0,0:33:00.98,0:33:05.02,Default,,0000,0000,0000,,If you refer back to this equation where
Dialogue: 0,0:33:05.02,0:33:06.87,Default,,0000,0000,0000,,earlier we wrote out that
Dialogue: 0,0:33:06.87,0:33:11.14,Default,,0000,0000,0000,,the expected payoff by this equation, this
Dialogue: 0,0:33:11.14,0:33:14.11,Default,,0000,0000,0000,,sum over all the states of the probability
Dialogue: 0,0:33:14.11,0:33:15.72,Default,,0000,0000,0000,,times the payoff.
Dialogue: 0,0:33:15.72,0:33:20.41,Default,,0000,0000,0000,,So what I'm going to do is take the derivative of this expression
Dialogue: 0,0:33:20.41,0:33:22.73,Default,,0000,0000,0000,,with respect to the parameters theta
Dialogue: 0,0:33:22.73,0:33:26.32,Default,,0000,0000,0000,,because I want to do gradient ascent on this function. So I'm going to take the
Dialogue: 0,0:33:26.32,0:33:30.88,Default,,0000,0000,0000,,derivative of this function with respect to theta, and then try to go uphill on this function.
Dialogue: 0,0:33:30.88,0:33:35.69,Default,,0000,0000,0000,,So using the product rule, when I take the derivative of this function with respect to theta
Dialogue: 0,0:33:35.69,0:33:39.17,Default,,0000,0000,0000,,what I get is - we'll end up with the sum of terms right there. There are a lot
Dialogue: 0,0:33:39.17,0:33:42.30,Default,,0000,0000,0000,,of terms here that depend on theta,
Dialogue: 0,0:33:42.30,0:33:47.16,Default,,0000,0000,0000,,and so what I'll end up with is I'll end up having a sum - having one term
Dialogue: 0,0:33:47.16,0:33:50.70,Default,,0000,0000,0000,,that corresponds to the derivative of this keeping everything else fixed,
Dialogue: 0,0:33:50.70,0:33:53.96,Default,,0000,0000,0000,,to have one term from the derivative of this keeping everything else fixed, and I'll
Dialogue: 0,0:33:53.96,0:33:55.14,Default,,0000,0000,0000,,have one term
Dialogue: 0,0:33:55.14,0:33:56.43,Default,,0000,0000,0000,,from the derivative of
Dialogue: 0,0:33:56.43,0:34:02.10,Default,,0000,0000,0000,,that last thing keeping everything else fixed. So just apply the product rule to this. Let's
Dialogue: 0,0:34:02.10,0:34:09.10,Default,,0000,0000,0000,,write that down. So I have that - the derivative
Dialogue: 0,0:34:10.60,0:34:15.79,Default,,0000,0000,0000,,with respect to theta of the expected value of the payoff is -
Dialogue: 0,0:34:15.79,0:34:22.13,Default,,0000,0000,0000,,it turns out I can actually do this entire derivation in exactly four steps,
Dialogue: 0,0:34:22.13,0:34:26.82,Default,,0000,0000,0000,,but each of the steps requires a huge amount of writing, so
Dialogue: 0,0:34:26.82,0:34:37.57,Default,,0000,0000,0000,,I'll just start writing and see how that goes, but this is a four step derivation.
Dialogue: 0,0:34:37.92,0:34:44.92,Default,,0000,0000,0000,,So there's the sum over the state action sequences as we saw before. Close the
Dialogue: 0,0:35:04.87,0:35:11.87,Default,,0000,0000,0000,,bracket,
Dialogue: 0,0:36:00.23,0:36:03.17,Default,,0000,0000,0000,,and
Dialogue: 0,0:36:03.17,0:36:08.67,Default,,0000,0000,0000,,then times the payoff.
Dialogue: 0,0:36:08.67,0:36:12.60,Default,,0000,0000,0000,,So that huge amount of writing, that was just taking my previous formula and
Dialogue: 0,0:36:12.60,0:36:13.53,Default,,0000,0000,0000,,
Dialogue: 0,0:36:13.53,0:36:16.59,Default,,0000,0000,0000,,differentiating these terms that depend on theta one at a time. This was the term with
Dialogue: 0,0:36:16.59,0:36:17.88,Default,,0000,0000,0000,,the derivative
Dialogue: 0,0:36:17.88,0:36:19.49,Default,,0000,0000,0000,,of the first pi of
Dialogue: 0,0:36:19.49,0:36:22.41,Default,,0000,0000,0000,,theta S0 A0. So there's the first derivative
Dialogue: 0,0:36:22.41,0:36:26.33,Default,,0000,0000,0000,,term. There's the second one. Then you have plus dot, dot, dot, like
Dialogue: 0,0:36:26.33,0:36:28.37,Default,,0000,0000,0000,,in terms of [inaudible].
Dialogue: 0,0:36:28.37,0:36:31.28,Default,,0000,0000,0000,,That's my last term.
Dialogue: 0,0:36:31.28,0:36:38.28,Default,,0000,0000,0000,,So that was step one of four.
Dialogue: 0,0:36:51.30,0:37:06.85,Default,,0000,0000,0000,,And so by algebra - let me just write this down
Dialogue: 0,0:38:04.01,0:38:06.51,Default,,0000,0000,0000,,and convince us all that it's true.
Dialogue: 0,0:38:06.51,0:38:08.55,Default,,0000,0000,0000,,This is the second of four steps
Dialogue: 0,0:38:08.55,0:38:09.65,Default,,0000,0000,0000,,in which it
Dialogue: 0,0:38:09.65,0:38:13.53,Default,,0000,0000,0000,,just convinced itself that if I expand out - take the sum and multiply it by
Dialogue: 0,0:38:13.53,0:38:15.24,Default,,0000,0000,0000,,that big product in front,
Dialogue: 0,0:38:15.24,0:38:18.93,Default,,0000,0000,0000,,then I get back that sum of terms I get. It's essentially -
Dialogue: 0,0:38:18.93,0:38:22.53,Default,,0000,0000,0000,,for example, when I multiply out, this product on top of this ratio, of this
Dialogue: 0,0:38:22.53,0:38:24.79,Default,,0000,0000,0000,,first fraction,
Dialogue: 0,0:38:24.79,0:38:29.29,Default,,0000,0000,0000,,then pi subscript theta S0 A0, that would cancel out this pi subscript
Dialogue: 0,0:38:29.29,0:38:31.22,Default,,0000,0000,0000,,theta S0 A0
Dialogue: 0,0:38:31.22,0:38:35.13,Default,,0000,0000,0000,,and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was
Dialogue: 0,0:38:35.13,0:38:42.13,Default,,0000,0000,0000,,the second. But
Dialogue: 0,0:38:44.96,0:38:47.64,Default,,0000,0000,0000,,that term on top is just
Dialogue: 0,0:38:47.64,0:38:53.100,Default,,0000,0000,0000,,what I worked out previously
Dialogue: 0,0:38:53.100,0:38:58.57,Default,,0000,0000,0000,,- was the joint probability of the state action sequence,
Dialogue: 0,0:38:58.57,0:39:32.17,Default,,0000,0000,0000,,and now I have that times that times the payoff.
Dialogue: 0,0:39:32.17,0:40:07.13,Default,,0000,0000,0000,,And so by the definition of expectation, this is just equal to that thing times the payoff.
Dialogue: 0,0:40:07.13,0:40:08.44,Default,,0000,0000,0000,,So
Dialogue: 0,0:40:08.44,0:40:12.33,Default,,0000,0000,0000,,this thing inside the expectation, this is exactly
Dialogue: 0,0:40:12.33,0:40:18.84,Default,,0000,0000,0000,,the step that we were taking in the inner group of our reinforce algorithm,
Dialogue: 0,0:40:18.84,0:40:20.88,Default,,0000,0000,0000,,roughly the reinforce algorithm. This
Dialogue: 0,0:40:20.88,0:40:25.64,Default,,0000,0000,0000,,proves that the expected value of our change to theta
Dialogue: 0,0:40:25.64,0:40:30.51,Default,,0000,0000,0000,,is exactly in the direction of the gradient of our expected payoff. That's how
Dialogue: 0,0:40:30.51,0:40:32.64,Default,,0000,0000,0000,,I started this whole derivation. I said
Dialogue: 0,0:40:32.64,0:40:36.02,Default,,0000,0000,0000,,let's look at our expected payoff and take the derivative of that with respect to theta.
Dialogue: 0,0:40:37.11,0:40:40.53,Default,,0000,0000,0000,,What we've proved is that on expectation,
Dialogue: 0,0:40:40.53,0:40:44.23,Default,,0000,0000,0000,,the step direction I'll take reinforce is exactly the gradient of
Dialogue: 0,0:40:44.23,0:40:46.22,Default,,0000,0000,0000,,the thing I'm trying to
Dialogue: 0,0:40:46.22,0:40:48.68,Default,,0000,0000,0000,,optimize. This shows that this algorithm is
Dialogue: 0,0:40:48.68,0:40:54.03,Default,,0000,0000,0000,,a stochastic gradient ascent
Dialogue: 0,0:40:54.03,0:40:58.06,Default,,0000,0000,0000,,algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]
Dialogue: 0,0:40:58.06,0:41:00.91,Default,,0000,0000,0000,,check if everything makes sense. I'll erase a couple of boards and then check if you have
Dialogue: 0,0:41:00.91,0:41:07.91,Default,,0000,0000,0000,,questions after that. Questions?
Dialogue: 0,0:41:42.92,0:41:49.92,Default,,0000,0000,0000,,Could you
Dialogue: 0,0:41:53.75,0:41:56.28,Default,,0000,0000,0000,,raise your hand if this makes sense?
Dialogue: 0,0:42:01.24,0:42:03.100,Default,,0000,0000,0000,,Great.
Dialogue: 0,0:42:03.100,0:42:06.79,Default,,0000,0000,0000,,Some of the comments - we talked about
Dialogue: 0,0:42:06.79,0:42:10.63,Default,,0000,0000,0000,,those value function approximation approaches where you approximate
Dialogue: 0,0:42:10.63,0:42:13.58,Default,,0000,0000,0000,,V star, then you go from V star
Dialogue: 0,0:42:13.58,0:42:17.84,Default,,0000,0000,0000,,to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.
Dialogue: 0,0:42:17.84,0:42:22.33,Default,,0000,0000,0000,,So let's talk briefly about when either one may be preferable.
Dialogue: 0,0:42:22.33,0:42:24.05,Default,,0000,0000,0000,,It turns out that
Dialogue: 0,0:42:24.05,0:42:26.90,Default,,0000,0000,0000,,policy search algorithms are especially effective
Dialogue: 0,0:42:26.90,0:42:30.70,Default,,0000,0000,0000,,when you can choose a simple policy class pi.
Dialogue: 0,0:42:32.03,0:42:35.95,Default,,0000,0000,0000,,So the question really is for your problem
Dialogue: 0,0:42:35.95,0:42:40.08,Default,,0000,0000,0000,,does there exist a simple function like a linear function or a logistic function
Dialogue: 0,0:42:40.08,0:42:45.29,Default,,0000,0000,0000,,that maps from features of the state to the action that works pretty well.
Dialogue: 0,0:42:45.29,0:42:49.03,Default,,0000,0000,0000,,So the problem with the inverted pendulum - this is quite likely to be true.
Dialogue: 0,0:42:49.03,0:42:52.45,Default,,0000,0000,0000,,Going through all the different choices of parameters, you can say
Dialogue: 0,0:42:52.45,0:42:54.93,Default,,0000,0000,0000,,things like if the pole's leaning towards the right,
Dialogue: 0,0:42:54.93,0:42:57.71,Default,,0000,0000,0000,,then accelerate towards the right to try to catch it. Thanks to the
Dialogue: 0,0:42:57.71,0:43:01.58,Default,,0000,0000,0000,,inverted pendulum, this is probably true. For lots of what's called
Dialogue: 0,0:43:01.58,0:43:04.65,Default,,0000,0000,0000,,low level control tasks, things like driving a car,
Dialogue: 0,0:43:04.65,0:43:09.07,Default,,0000,0000,0000,,the low level reflexes of do you steer your car left to avoid another car, do
Dialogue: 0,0:43:09.07,0:43:13.39,Default,,0000,0000,0000,,you steer your car left to follow the car road, flying a helicopter,
Dialogue: 0,0:43:13.39,0:43:18.36,Default,,0000,0000,0000,,again very short time scale types of decisions - I like to think of these as
Dialogue: 0,0:43:18.36,0:43:21.96,Default,,0000,0000,0000,,decisions like trained operator for like a trained driver or a trained
Dialogue: 0,0:43:21.96,0:43:23.82,Default,,0000,0000,0000,,pilot. It would almost
Dialogue: 0,0:43:23.82,0:43:27.36,Default,,0000,0000,0000,,be a reflex, these sorts of very quick instinctive things where
Dialogue: 0,0:43:27.36,0:43:30.79,Default,,0000,0000,0000,,you map very directly from the inputs, data, and action. These are
Dialogue: 0,0:43:30.79,0:43:32.16,Default,,0000,0000,0000,,problems for which
Dialogue: 0,0:43:32.16,0:43:35.88,Default,,0000,0000,0000,,you can probably choose a reasonable policy class like a logistic function or something,
Dialogue: 0,0:43:35.88,0:43:38.49,Default,,0000,0000,0000,,and it will often work well.
Dialogue: 0,0:43:38.49,0:43:41.83,Default,,0000,0000,0000,,In contrast, if you have problems that require
Dialogue: 0,0:43:41.83,0:43:46.02,Default,,0000,0000,0000,,long multistep reasoning, so things like a game of chess where you have to
Dialogue: 0,0:43:46.02,0:43:47.23,Default,,0000,0000,0000,,reason carefully about if
Dialogue: 0,0:43:47.23,0:43:50.13,Default,,0000,0000,0000,,I do this, then they'll do that, then they'll do this, then they'll do that.
Dialogue: 0,0:43:50.13,0:43:56.05,Default,,0000,0000,0000,,Those I think of as less instinctual, very high level decision making.
Dialogue: 0,0:43:56.05,0:43:59.52,Default,,0000,0000,0000,,For problems like that,
Dialogue: 0,0:43:59.52,0:44:06.29,Default,,0000,0000,0000,,I would sometimes use a value function approximation approaches instead. Let
Dialogue: 0,0:44:06.29,0:44:09.09,Default,,0000,0000,0000,,me say more about this later.
Dialogue: 0,0:44:09.09,0:44:14.14,Default,,0000,0000,0000,,The last thing I want to do is actually tell you about -
Dialogue: 0,0:44:14.14,0:44:21.100,Default,,0000,0000,0000,,I guess just as a side comment,
Dialogue: 0,0:44:21.100,0:44:25.50,Default,,0000,0000,0000,,it turns out also that if you have POMDP, if you have a partially observable
Dialogue: 0,0:44:25.50,0:44:28.46,Default,,0000,0000,0000,,MDP - I don't want to say too much about this -
Dialogue: 0,0:44:28.46,0:44:36.80,Default,,0000,0000,0000,,it turns out that if you only have an approximation
Dialogue: 0,0:44:36.80,0:44:42.12,Default,,0000,0000,0000,,- let's call it S hat of the true
Dialogue: 0,0:44:42.12,0:44:47.39,Default,,0000,0000,0000,,state, and so this could be S hat equals S of T given T from Kalman filter -
Dialogue: 0,0:44:57.36,0:45:12.95,Default,,0000,0000,0000,,then you can still use these sorts of policy search algorithms where you can say pi theta of S
Dialogue: 0,0:45:12.95,0:45:15.82,Default,,0000,0000,0000,,hat comma A - There are various other ways you can use
Dialogue: 0,0:45:15.82,0:45:18.67,Default,,0000,0000,0000,,policy search algorithms for POMDPs, but this is one of them
Dialogue: 0,0:45:18.67,0:45:21.44,Default,,0000,0000,0000,,where if you only have estimates of the state, you can then
Dialogue: 0,0:45:21.44,0:45:25.60,Default,,0000,0000,0000,,choose a policy class that only looks at your estimate of the state to choose the action.
Dialogue: 0,0:45:25.60,0:45:30.63,Default,,0000,0000,0000,,By using the same way of estimating the states in both training and testing,
Dialogue: 0,0:45:30.63,0:45:33.09,Default,,0000,0000,0000,,this will usually do some -
Dialogue: 0,0:45:33.09,0:45:37.07,Default,,0000,0000,0000,,so these sorts of policy search algorithms can be applied
Dialogue: 0,0:45:37.07,0:45:47.44,Default,,0000,0000,0000,,often reasonably effectively to
Dialogue: 0,0:45:47.44,0:45:50.35,Default,,0000,0000,0000,,POMDPs as well. There's one more algorithm I wanna talk about, but some final
Dialogue: 0,0:45:50.35,0:45:54.52,Default,,0000,0000,0000,,words on the reinforce algorithm. It turns out the reinforce algorithm
Dialogue: 0,0:45:54.52,0:46:00.09,Default,,0000,0000,0000,,often works well but is often extremely slow. So it [inaudible]
Dialogue: 0,0:46:00.09,0:46:03.68,Default,,0000,0000,0000,,works, but one thing to watch out for is that because you're taking these
Dialogue: 0,0:46:03.68,0:46:07.47,Default,,0000,0000,0000,,gradient ascent steps that are very noisy, you're sampling a state action sequence, and then
Dialogue: 0,0:46:07.47,0:46:07.91,Default,,0000,0000,0000,,you're sort of
Dialogue: 0,0:46:07.91,0:46:09.58,Default,,0000,0000,0000,,taking a gradient ascent step in essentially a sort
Dialogue: 0,0:46:09.58,0:46:12.96,Default,,0000,0000,0000,,of random direction that only on expectation is
Dialogue: 0,0:46:12.96,0:46:14.24,Default,,0000,0000,0000,,correct. The
Dialogue: 0,0:46:14.24,0:46:17.100,Default,,0000,0000,0000,,gradient ascent direction for reinforce can sometimes be a bit noisy,
Dialogue: 0,0:46:17.100,0:46:21.60,Default,,0000,0000,0000,,and so it's not that uncommon to need like a million iterations of gradient ascent,
Dialogue: 0,0:46:21.60,0:46:24.42,Default,,0000,0000,0000,,or ten million, or 100 million
Dialogue: 0,0:46:24.42,0:46:25.58,Default,,0000,0000,0000,,iterations of gradient ascent
Dialogue: 0,0:46:25.58,0:46:29.45,Default,,0000,0000,0000,,for reinforce [inaudible], so that's just something to watch out for.
Dialogue: 0,0:46:29.45,0:46:40.62,Default,,0000,0000,0000,,One consequence of that is in the reinforce algorithm - I shouldn't
Dialogue: 0,0:46:40.62,0:46:44.97,Default,,0000,0000,0000,,really call it reinforce. In what's essentially the reinforce algorithm, there's
Dialogue: 0,0:46:44.97,0:46:48.45,Default,,0000,0000,0000,,this step where you need to sample a state action sequence.
Dialogue: 0,0:46:48.45,0:46:50.76,Default,,0000,0000,0000,,So in
Dialogue: 0,0:46:50.76,0:46:54.16,Default,,0000,0000,0000,,principle you could do this on your own robot. If there were a robot you were trying to
Dialogue: 0,0:46:54.16,0:46:55.28,Default,,0000,0000,0000,,control, you can actually
Dialogue: 0,0:46:55.28,0:46:58.71,Default,,0000,0000,0000,,physically initialize in some state, pick an action and so on, and go from there
Dialogue: 0,0:46:58.71,0:47:00.45,Default,,0000,0000,0000,,to sample a state action sequence.
Dialogue: 0,0:47:00.45,0:47:01.53,Default,,0000,0000,0000,,But
Dialogue: 0,0:47:01.53,0:47:05.27,Default,,0000,0000,0000,,if you need to do this ten million times, you probably don't want to [inaudible]
Dialogue: 0,0:47:05.27,0:47:07.46,Default,,0000,0000,0000,,your robot ten million times.
Dialogue: 0,0:47:07.46,0:47:13.16,Default,,0000,0000,0000,,I personally have seen many more applications of reinforce in simulation.
Dialogue: 0,0:47:13.16,0:47:16.93,Default,,0000,0000,0000,,You can easily run ten thousand simulations or ten million simulations of your robot in
Dialogue: 0,0:47:16.93,0:47:20.89,Default,,0000,0000,0000,,simulation maybe, but you might not want to do that -
Dialogue: 0,0:47:20.89,0:47:24.51,Default,,0000,0000,0000,,have your robot physically repeat some action ten million times.
Dialogue: 0,0:47:24.51,0:47:27.53,Default,,0000,0000,0000,,So I personally have seen many more applications of reinforce
Dialogue: 0,0:47:27.53,0:47:30.32,Default,,0000,0000,0000,,to learn using a simulator
Dialogue: 0,0:47:30.32,0:47:34.80,Default,,0000,0000,0000,,than to actually do this on a physical device.
Dialogue: 0,0:47:34.80,0:47:38.23,Default,,0000,0000,0000,,The last thing I wanted to do is tell you about one other algorithm,
Dialogue: 0,0:47:38.23,0:47:45.23,Default,,0000,0000,0000,,one final policy search algorithm. [Inaudible] the laptop display please. It's
Dialogue: 0,0:47:52.25,0:47:56.12,Default,,0000,0000,0000,,a policy search algorithm called Pegasus that's
Dialogue: 0,0:47:56.12,0:48:00.04,Default,,0000,0000,0000,,actually what we use on our autonomous helicopter flight things
Dialogue: 0,0:48:00.04,0:48:05.46,Default,,0000,0000,0000,,for many years. There are some other things we do now. So
Dialogue: 0,0:48:05.46,0:48:07.58,Default,,0000,0000,0000,,here's the idea. There's a
Dialogue: 0,0:48:07.58,0:48:11.06,Default,,0000,0000,0000,,reminder slide on RL formalism. There's nothing here that you don't know, but
Dialogue: 0,0:48:11.06,0:48:17.12,Default,,0000,0000,0000,,I just want to pictorially describe the RL formalism because I'll use that later.
Dialogue: 0,0:48:17.12,0:48:20.86,Default,,0000,0000,0000,,I'm gonna draw the reinforcement learning picture as follows. The
Dialogue: 0,0:48:20.86,0:48:21.69,Default,,0000,0000,0000,,initialized [inaudible]
Dialogue: 0,0:48:21.69,0:48:23.51,Default,,0000,0000,0000,,system, say a
Dialogue: 0,0:48:23.51,0:48:25.56,Default,,0000,0000,0000,,helicopter or whatever in sum state S0,
Dialogue: 0,0:48:25.56,0:48:27.51,Default,,0000,0000,0000,,you choose an action A0,
Dialogue: 0,0:48:27.51,0:48:31.78,Default,,0000,0000,0000,,and then you'll say helicopter dynamics takes you to some new state S1, you choose
Dialogue: 0,0:48:31.78,0:48:33.26,Default,,0000,0000,0000,,some other action A1,
Dialogue: 0,0:48:33.26,0:48:34.62,Default,,0000,0000,0000,,and so on.
Dialogue: 0,0:48:34.62,0:48:38.79,Default,,0000,0000,0000,,And then you have some reward function, which you reply to the sequence of states you summed out,
Dialogue: 0,0:48:38.79,0:48:41.25,Default,,0000,0000,0000,,and that's your total payoff. So
Dialogue: 0,0:48:41.25,0:48:46.89,Default,,0000,0000,0000,,this is just a picture I wanna use to summarize the RL problem.
Dialogue: 0,0:48:47.95,0:48:50.88,Default,,0000,0000,0000,,Our goal is to maximize the expected payoff,
Dialogue: 0,0:48:50.88,0:48:53.00,Default,,0000,0000,0000,,which is this, the expected sum of the rewards.
Dialogue: 0,0:48:53.00,0:48:56.91,Default,,0000,0000,0000,,And our goal is to learn the policy, which I denote by a green box.
Dialogue: 0,0:48:56.91,0:49:01.99,Default,,0000,0000,0000,,So our policy - and I'll switch back to deterministic policies for now.
Dialogue: 0,0:49:01.99,0:49:09.13,Default,,0000,0000,0000,,So my deterministic policy will be some function mapping from the states to the actions.
Dialogue: 0,0:49:09.13,0:49:14.60,Default,,0000,0000,0000,,As a concrete example, you imagine that in the policy search setting,
Dialogue: 0,0:49:14.60,0:49:17.34,Default,,0000,0000,0000,,you may have a linear class of policies.
Dialogue: 0,0:49:17.34,0:49:22.66,Default,,0000,0000,0000,,So you may imagine that the action A will be say a linear function of the states,
Dialogue: 0,0:49:22.66,0:49:27.60,Default,,0000,0000,0000,,and your goal is to learn the parameters of the linear function.
Dialogue: 0,0:49:27.60,0:49:33.35,Default,,0000,0000,0000,,So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning
Dialogue: 0,0:49:33.35,0:49:36.21,Default,,0000,0000,0000,,objective. So just [inaudible] imagine that the action A
Dialogue: 0,0:49:36.21,0:49:40.31,Default,,0000,0000,0000,,is state of transpose S, and you go and policy search this to come up with
Dialogue: 0,0:49:40.31,0:49:42.11,Default,,0000,0000,0000,,good parameters theta so as
Dialogue: 0,0:49:42.11,0:49:48.98,Default,,0000,0000,0000,,to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.
Dialogue: 0,0:49:48.98,0:49:49.81,Default,,0000,0000,0000,,Quite often
Dialogue: 0,0:49:49.81,0:49:54.46,Default,,0000,0000,0000,,we come up with a model or a simulator for the MDP, and as before
Dialogue: 0,0:49:54.46,0:49:57.50,Default,,0000,0000,0000,,a model or a simulator is just a box
Dialogue: 0,0:49:57.50,0:49:58.83,Default,,0000,0000,0000,,that takes this input
Dialogue: 0,0:49:58.83,0:50:00.42,Default,,0000,0000,0000,,some state ST,
Dialogue: 0,0:50:00.42,0:50:02.69,Default,,0000,0000,0000,,takes this input some action AT,
Dialogue: 0,0:50:02.69,0:50:06.57,Default,,0000,0000,0000,,and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.
Dialogue: 0,0:50:06.57,0:50:10.13,Default,,0000,0000,0000,,This ST plus one will be a random state. It will be drawn from
Dialogue: 0,0:50:10.13,0:50:13.03,Default,,0000,0000,0000,,the random state transition probabilities of MDP.
Dialogue: 0,0:50:13.03,0:50:16.98,Default,,0000,0000,0000,,This is important. Very important, ST plus one
Dialogue: 0,0:50:16.98,0:50:21.81,Default,,0000,0000,0000,,will be a random function ST and AT. In the simulator, this is [inaudible].
Dialogue: 0,0:50:21.81,0:50:25.71,Default,,0000,0000,0000,,So for example, for autonomous helicopter flight, you [inaudible] build a simulator
Dialogue: 0,0:50:25.71,0:50:29.82,Default,,0000,0000,0000,,using supervised learning, an algorithm like linear regression
Dialogue: 0,0:50:29.82,0:50:31.53,Default,,0000,0000,0000,,[inaudible] to linear regression,
Dialogue: 0,0:50:31.53,0:50:35.10,Default,,0000,0000,0000,,so we can get a nonlinear model of the dynamics of what ST
Dialogue: 0,0:50:35.10,0:50:41.40,Default,,0000,0000,0000,,plus one is as a random function of ST and AT. Now
Dialogue: 0,0:50:41.40,0:50:44.100,Default,,0000,0000,0000,,once you have a simulator,
Dialogue: 0,0:50:44.100,0:50:48.10,Default,,0000,0000,0000,,given any fixed policy you can quite
Dialogue: 0,0:50:48.10,0:50:52.31,Default,,0000,0000,0000,,straightforwardly evaluate any policy in a simulator.
Dialogue: 0,0:50:53.36,0:51:00.21,Default,,0000,0000,0000,,Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box
Dialogue: 0,0:51:00.21,0:51:02.88,Default,,0000,0000,0000,,like that. It works well.
Dialogue: 0,0:51:02.88,0:51:07.57,Default,,0000,0000,0000,,So if you have any one fixed policy pi,
Dialogue: 0,0:51:07.57,0:51:13.11,Default,,0000,0000,0000,,you can evaluate the policy pi just using the simulator
Dialogue: 0,0:51:13.11,0:51:15.97,Default,,0000,0000,0000,,via the picture shown at the bottom of the slide.
Dialogue: 0,0:51:15.97,0:51:18.88,Default,,0000,0000,0000,,So concretely, you can take your initial state S0,
Dialogue: 0,0:51:18.88,0:51:20.98,Default,,0000,0000,0000,,feed it into the policy pi,
Dialogue: 0,0:51:20.98,0:51:25.45,Default,,0000,0000,0000,,your policy pi will output some action A0, you plug it in
Dialogue: 0,0:51:25.45,0:51:28.07,Default,,0000,0000,0000,,the simulator, the simulator outputs a random state S1, you feed
Dialogue: 0,0:51:28.07,0:51:31.63,Default,,0000,0000,0000,,S1 into the policy and so on, and you get a sequence of states S0 through ST
Dialogue: 0,0:51:31.63,0:51:35.43,Default,,0000,0000,0000,,that your helicopter flies through in simulation.
Dialogue: 0,0:51:35.43,0:51:37.81,Default,,0000,0000,0000,,Then sum up the rewards,
Dialogue: 0,0:51:37.81,0:51:42.61,Default,,0000,0000,0000,,and this gives you an estimate of the expected payoff of the policy.
Dialogue: 0,0:51:42.61,0:51:44.93,Default,,0000,0000,0000,,This picture is just a fancy way of saying fly
Dialogue: 0,0:51:44.93,0:51:48.81,Default,,0000,0000,0000,,your helicopter in simulation and see how well it does, and measure the sum of
Dialogue: 0,0:51:48.81,0:51:53.97,Default,,0000,0000,0000,,rewards you get on average in the simulator. The
Dialogue: 0,0:51:53.97,0:51:57.22,Default,,0000,0000,0000,,picture I've drawn here assumes that you run your policy through the
Dialogue: 0,0:51:57.22,0:51:59.16,Default,,0000,0000,0000,,simulator just once. In general,
Dialogue: 0,0:51:59.16,0:52:01.30,Default,,0000,0000,0000,,you would run the policy through the simulator
Dialogue: 0,0:52:01.30,0:52:05.10,Default,,0000,0000,0000,,some number of times and then average to get an average over M
Dialogue: 0,0:52:05.10,0:52:10.87,Default,,0000,0000,0000,,simulations to get a better estimate of the policy's expected payoff.
Dialogue: 0,0:52:12.80,0:52:14.62,Default,,0000,0000,0000,,Now that I have a way
Dialogue: 0,0:52:14.62,0:52:18.07,Default,,0000,0000,0000,,- given any one fixed policy,
Dialogue: 0,0:52:18.07,0:52:23.43,Default,,0000,0000,0000,,this gives me a way to evaluate the expected payoff of that policy.
Dialogue: 0,0:52:23.43,0:52:28.18,Default,,0000,0000,0000,,So one reasonably obvious thing you might try to do is then
Dialogue: 0,0:52:28.18,0:52:30.07,Default,,0000,0000,0000,,just search for a policy,
Dialogue: 0,0:52:30.07,0:52:33.61,Default,,0000,0000,0000,,in other words search for parameters theta for your policy, that
Dialogue: 0,0:52:33.61,0:52:35.93,Default,,0000,0000,0000,,gives you high estimated payoff.
Dialogue: 0,0:52:35.93,0:52:39.51,Default,,0000,0000,0000,,Does that make sense? So my policy has some parameters theta, so
Dialogue: 0,0:52:39.51,0:52:40.98,Default,,0000,0000,0000,,my policy is
Dialogue: 0,0:52:40.98,0:52:46.31,Default,,0000,0000,0000,,my actions A are equal to theta transpose S say if there's a linear policy.
Dialogue: 0,0:52:46.31,0:52:49.56,Default,,0000,0000,0000,,For any fixed value of the parameters theta,
Dialogue: 0,0:52:49.56,0:52:50.99,Default,,0000,0000,0000,,I can evaluate -
Dialogue: 0,0:52:50.99,0:52:54.44,Default,,0000,0000,0000,,I can get an estimate for how good the policy is using the simulator.
Dialogue: 0,0:52:54.44,0:52:58.60,Default,,0000,0000,0000,,One thing I might try to do is search for parameters theta to try to
Dialogue: 0,0:52:58.60,0:53:03.11,Default,,0000,0000,0000,,maximize my estimated payoff.
Dialogue: 0,0:53:03.11,0:53:05.93,Default,,0000,0000,0000,,It turns out you can sort of do that,
Dialogue: 0,0:53:05.93,0:53:10.98,Default,,0000,0000,0000,,but that idea as I've just stated is hard to get to work.
Dialogue: 0,0:53:10.98,0:53:12.18,Default,,0000,0000,0000,,Here's the
Dialogue: 0,0:53:12.18,0:53:15.72,Default,,0000,0000,0000,,reason. The simulator allows us to evaluate
Dialogue: 0,0:53:15.72,0:53:17.53,Default,,0000,0000,0000,,policy, so let's search for policy of high
Dialogue: 0,0:53:17.53,0:53:20.62,Default,,0000,0000,0000,,value. The difficulty is that the simulator is random,
Dialogue: 0,0:53:20.62,0:53:22.100,Default,,0000,0000,0000,,and so every time we evaluate a policy,
Dialogue: 0,0:53:22.100,0:53:26.82,Default,,0000,0000,0000,,we get back a very slightly different answer. So
Dialogue: 0,0:53:26.82,0:53:28.74,Default,,0000,0000,0000,,in the
Dialogue: 0,0:53:28.74,0:53:33.22,Default,,0000,0000,0000,,cartoon below, I want you to imagine that the horizontal axis is the space of policies.
Dialogue: 0,0:53:33.22,0:53:36.78,Default,,0000,0000,0000,,In other words, as I vary the
Dialogue: 0,0:53:36.78,0:53:40.84,Default,,0000,0000,0000,,parameters in my policy, I get different points on the horizontal axis here. As I
Dialogue: 0,0:53:40.84,0:53:43.79,Default,,0000,0000,0000,,vary the parameters theta, I get different policies,
Dialogue: 0,0:53:43.79,0:53:46.48,Default,,0000,0000,0000,,and so I'm moving along the X axis,
Dialogue: 0,0:53:46.48,0:53:50.08,Default,,0000,0000,0000,,and my total payoff I'm gonna plot on the vertical axis,
Dialogue: 0,0:53:50.08,0:53:54.46,Default,,0000,0000,0000,,and the red line in this cartoon is the expected payoff of the different
Dialogue: 0,0:53:54.46,0:53:56.13,Default,,0000,0000,0000,,policies.
Dialogue: 0,0:53:56.13,0:54:01.93,Default,,0000,0000,0000,,My goal is to find the policy with the highest expected payoff.
Dialogue: 0,0:54:01.93,0:54:06.02,Default,,0000,0000,0000,,You could search for a policy with high expected payoff, but every time you evaluate a policy
Dialogue: 0,0:54:06.02,0:54:08.15,Default,,0000,0000,0000,,- say I evaluate some policy,
Dialogue: 0,0:54:08.15,0:54:09.82,Default,,0000,0000,0000,,then I might get a point that
Dialogue: 0,0:54:09.82,0:54:12.88,Default,,0000,0000,0000,,just by chance looked a little bit better than it should have. If
Dialogue: 0,0:54:12.88,0:54:16.49,Default,,0000,0000,0000,,I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third
Dialogue: 0,0:54:16.49,0:54:21.42,Default,,0000,0000,0000,,policy, fourth,
Dialogue: 0,0:54:21.42,0:54:25.44,Default,,0000,0000,0000,,sometimes you look here - sometimes I might actually evaluate exactly the same policy
Dialogue: 0,0:54:25.44,0:54:27.83,Default,,0000,0000,0000,,twice and get back slightly different
Dialogue: 0,0:54:27.83,0:54:32.38,Default,,0000,0000,0000,,answers just because my simulator is random, so when I apply the same policy
Dialogue: 0,0:54:32.38,0:54:38.67,Default,,0000,0000,0000,,twice in simulation, I might get back slightly different answers.
Dialogue: 0,0:54:38.67,0:54:41.84,Default,,0000,0000,0000,,So as I evaluate more and more policies, these are the pictures I get.
Dialogue: 0,0:54:41.84,0:54:45.80,Default,,0000,0000,0000,,My goal is to try to optimize the red line. I hope you appreciate this is a
Dialogue: 0,0:54:45.80,0:54:50.57,Default,,0000,0000,0000,,hard problem, especially when all [inaudible] optimization algorithm gets to see are these
Dialogue: 0,0:54:50.57,0:54:53.95,Default,,0000,0000,0000,,black dots, and they don't have direct access to the red line.
Dialogue: 0,0:54:53.95,0:54:58.27,Default,,0000,0000,0000,,So when my input space is some fairly high dimensional space, if I have
Dialogue: 0,0:54:58.27,0:55:02.32,Default,,0000,0000,0000,,ten parameters, the horizontal axis would actually be a 10-D space,
Dialogue: 0,0:55:02.32,0:55:07.05,Default,,0000,0000,0000,,all I get are these noisy estimates of what the red line is. This is a very hard
Dialogue: 0,0:55:07.05,0:55:10.21,Default,,0000,0000,0000,,stochastic optimization problem.
Dialogue: 0,0:55:10.21,0:55:15.97,Default,,0000,0000,0000,,So it turns out there's one way to make this optimization problem much easier. Here's the idea.
Dialogue: 0,0:55:15.97,0:55:20.95,Default,,0000,0000,0000,,And the method is called Pegasus, which is an acronym for something.
Dialogue: 0,0:55:20.95,0:55:22.89,Default,,0000,0000,0000,,I'll tell you later.
Dialogue: 0,0:55:22.89,0:55:26.14,Default,,0000,0000,0000,,So the simulator repeatedly makes calls
Dialogue: 0,0:55:26.14,0:55:27.96,Default,,0000,0000,0000,,to a random number generator
Dialogue: 0,0:55:27.96,0:55:33.08,Default,,0000,0000,0000,,to generate random numbers RT, which are used to simulate the stochastic dynamics.
Dialogue: 0,0:55:33.08,0:55:36.71,Default,,0000,0000,0000,,What I mean by that is that the simulator takes this input of state
Dialogue: 0,0:55:36.71,0:55:39.75,Default,,0000,0000,0000,,and action, and it outputs the mixed state randomly,
Dialogue: 0,0:55:39.75,0:55:43.100,Default,,0000,0000,0000,,and if you peer into the simulator, if you open the box of the simulator and ask
Dialogue: 0,0:55:43.100,0:55:49.53,Default,,0000,0000,0000,,how is my simulator generating these random mixed states ST plus one,
Dialogue: 0,0:55:49.53,0:55:53.81,Default,,0000,0000,0000,,pretty much the only way to do so - pretty much the only way
Dialogue: 0,0:55:53.81,0:55:56.91,Default,,0000,0000,0000,,to write a simulator with random outputs is
Dialogue: 0,0:55:56.91,0:56:00.21,Default,,0000,0000,0000,,we're gonna make calls to a random number generator,
Dialogue: 0,0:56:00.21,0:56:03.44,Default,,0000,0000,0000,,and get random numbers, these RTs,
Dialogue: 0,0:56:03.44,0:56:08.30,Default,,0000,0000,0000,,and then you have some function that takes this input S0, A0, and
Dialogue: 0,0:56:08.30,0:56:10.98,Default,,0000,0000,0000,,the results of your random number generator,
Dialogue: 0,0:56:10.98,0:56:14.43,Default,,0000,0000,0000,,and it computes some mixed state as a function of the inputs and of the random
Dialogue: 0,0:56:14.43,0:56:16.84,Default,,0000,0000,0000,,number it got from the random number
Dialogue: 0,0:56:16.84,0:56:17.70,Default,,0000,0000,0000,,generator. This is
Dialogue: 0,0:56:17.70,0:56:22.36,Default,,0000,0000,0000,,pretty much the only way anyone implements any random code, any code
Dialogue: 0,0:56:22.36,0:56:26.18,Default,,0000,0000,0000,,that generates random outputs. You make a call to a random number generator,
Dialogue: 0,0:56:26.18,0:56:30.87,Default,,0000,0000,0000,,and you compute some function of the random number and of your other
Dialogue: 0,0:56:30.87,0:56:35.30,Default,,0000,0000,0000,,inputs. The reason that when you evaluate different policies you get
Dialogue: 0,0:56:35.30,0:56:36.44,Default,,0000,0000,0000,,different answers
Dialogue: 0,0:56:36.44,0:56:37.75,Default,,0000,0000,0000,,is because
Dialogue: 0,0:56:37.75,0:56:41.46,Default,,0000,0000,0000,,every time you rerun the simulator, you get a different sequence of random
Dialogue: 0,0:56:41.46,0:56:43.57,Default,,0000,0000,0000,,numbers from the random number generator,
Dialogue: 0,0:56:43.57,0:56:45.57,Default,,0000,0000,0000,,and so you get a different answer every time,
Dialogue: 0,0:56:45.57,0:56:48.14,Default,,0000,0000,0000,,even if you evaluate the same policy twice.
Dialogue: 0,0:56:48.14,0:56:50.91,Default,,0000,0000,0000,,So here's the idea.
Dialogue: 0,0:56:50.91,0:56:56.15,Default,,0000,0000,0000,,Let's say we fix in advance the sequence of random numbers,
Dialogue: 0,0:56:56.15,0:57:00.23,Default,,0000,0000,0000,,choose R1, R2, up to RT minus one.
Dialogue: 0,0:57:00.23,0:57:03.36,Default,,0000,0000,0000,,Fix the sequence of random numbers in advance,
Dialogue: 0,0:57:03.36,0:57:07.28,Default,,0000,0000,0000,,and we'll always use the same sequence of random numbers
Dialogue: 0,0:57:07.28,0:57:10.19,Default,,0000,0000,0000,,to evaluate different policies. Go
Dialogue: 0,0:57:10.19,0:57:14.55,Default,,0000,0000,0000,,into your code and fix R1, R2, through RT minus one.
Dialogue: 0,0:57:14.55,0:57:17.64,Default,,0000,0000,0000,,Choose them randomly once and then fix them forever.
Dialogue: 0,0:57:17.64,0:57:20.98,Default,,0000,0000,0000,,If you always use the same sequence of random numbers, then
Dialogue: 0,0:57:20.98,0:57:25.21,Default,,0000,0000,0000,,the system is no longer random, and if you evaluate the same policy twice,
Dialogue: 0,0:57:25.21,0:57:28.57,Default,,0000,0000,0000,,you get back exactly the same answer.
Dialogue: 0,0:57:28.57,0:57:33.33,Default,,0000,0000,0000,,And so these sequences of random numbers, [inaudible] call them scenarios, and
Dialogue: 0,0:57:33.33,0:57:37.41,Default,,0000,0000,0000,,Pegasus is an acronym for policy evaluation of gradient and search using scenarios.
Dialogue: 0,0:57:37.41,0:57:38.86,Default,,0000,0000,0000,,So
Dialogue: 0,0:57:42.37,0:57:45.95,Default,,0000,0000,0000,,when you do that, this is the picture you get.
Dialogue: 0,0:57:45.95,0:57:50.19,Default,,0000,0000,0000,,As before, the red line is your expected payoff, and by fixing the random numbers,
Dialogue: 0,0:57:50.19,0:57:53.47,Default,,0000,0000,0000,,you've defined some estimate of the expected payoff.
Dialogue: 0,0:57:53.47,0:57:56.74,Default,,0000,0000,0000,,And as you evaluate different policies, they're
Dialogue: 0,0:57:56.74,0:58:00.61,Default,,0000,0000,0000,,still only approximations to their true expected payoff, but at least now you have
Dialogue: 0,0:58:00.61,0:58:01.31,Default,,0000,0000,0000,,a
Dialogue: 0,0:58:01.31,0:58:03.36,Default,,0000,0000,0000,,deterministic function to optimize,
Dialogue: 0,0:58:03.36,0:58:07.11,Default,,0000,0000,0000,,and you can now apply your favorite algorithms, be it gradient ascent or
Dialogue: 0,0:58:08.09,0:58:09.81,Default,,0000,0000,0000,,some sort
Dialogue: 0,0:58:09.81,0:58:12.26,Default,,0000,0000,0000,,of local [inaudible] search
Dialogue: 0,0:58:12.26,0:58:13.61,Default,,0000,0000,0000,,to try to optimize the black
Dialogue: 0,0:58:13.61,0:58:16.33,Default,,0000,0000,0000,,curve. This gives you a much easier
Dialogue: 0,0:58:16.33,0:58:18.39,Default,,0000,0000,0000,,optimization problem, and you
Dialogue: 0,0:58:18.39,0:58:22.79,Default,,0000,0000,0000,,can optimize this to find hopefully a good policy. So this is
Dialogue: 0,0:58:22.79,0:58:26.65,Default,,0000,0000,0000,,the Pegasus policy search method.
Dialogue: 0,0:58:26.65,0:58:27.45,Default,,0000,0000,0000,,So when
Dialogue: 0,0:58:27.45,0:58:31.39,Default,,0000,0000,0000,,I started to talk about reinforcement learning, I showed
Dialogue: 0,0:58:31.39,0:58:34.98,Default,,0000,0000,0000,,that video of a helicopter flying upside down. That was actually done using
Dialogue: 0,0:58:34.98,0:58:40.03,Default,,0000,0000,0000,,exactly method, using exactly this policy search algorithm. This
Dialogue: 0,0:58:40.03,0:58:41.85,Default,,0000,0000,0000,,seems to scale well
Dialogue: 0,0:58:41.85,0:58:47.10,Default,,0000,0000,0000,,even to fairly large problems, even to fairly high dimensional state spaces.
Dialogue: 0,0:58:47.10,0:58:50.44,Default,,0000,0000,0000,,Typically Pegasus policy search algorithms have been using -
Dialogue: 0,0:58:50.44,0:58:52.72,Default,,0000,0000,0000,,the optimization problem
Dialogue: 0,0:58:52.72,0:58:56.44,Default,,0000,0000,0000,,is still - is much easier than the stochastic version, but sometimes it's
Dialogue: 0,0:58:56.44,0:58:58.28,Default,,0000,0000,0000,,not entirely trivial, and so
Dialogue: 0,0:58:58.28,0:59:00.30,Default,,0000,0000,0000,,you have to apply this sort of method with
Dialogue: 0,0:59:00.30,0:59:04.55,Default,,0000,0000,0000,,maybe on the order of ten parameters or tens of parameters, so 30, 40
Dialogue: 0,0:59:04.55,0:59:05.75,Default,,0000,0000,0000,,parameters, but
Dialogue: 0,0:59:05.75,0:59:08.17,Default,,0000,0000,0000,,not thousands of parameters,
Dialogue: 0,0:59:08.17,0:59:13.05,Default,,0000,0000,0000,,at least in these sorts of things with them.
Dialogue: 0,0:59:13.05,0:59:19.47,Default,,0000,0000,0000,,So is that method different than just assuming that
Dialogue: 0,0:59:19.47,0:59:26.95,Default,,0000,0000,0000,,you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that
Dialogue: 0,0:59:26.95,0:59:28.81,Default,,0000,0000,0000,,we have a deterministic simulator?
Dialogue: 0,0:59:28.81,0:59:31.25,Default,,0000,0000,0000,,The answer is no. In the way you do this,
Dialogue: 0,0:59:31.25,0:59:34.95,Default,,0000,0000,0000,,for the sake of simplicity I talked about one sequence of random numbers.
Dialogue: 0,0:59:34.95,0:59:36.89,Default,,0000,0000,0000,,What you do is -
Dialogue: 0,0:59:36.89,0:59:41.97,Default,,0000,0000,0000,,so imagine that the random numbers are simulating different wind gusts against your helicopter.
Dialogue: 0,0:59:41.97,0:59:46.43,Default,,0000,0000,0000,,So what you want to do isn't really evaluate just against one pattern of wind gusts.
Dialogue: 0,0:59:46.43,0:59:48.00,Default,,0000,0000,0000,,What you want to do is
Dialogue: 0,0:59:48.00,0:59:49.92,Default,,0000,0000,0000,,sample some set of
Dialogue: 0,0:59:49.92,0:59:53.65,Default,,0000,0000,0000,,different patterns of wind gusts, and evaluate against all of them in average.
Dialogue: 0,0:59:53.65,0:59:55.68,Default,,0000,0000,0000,,So what you do is you actually
Dialogue: 0,0:59:55.68,0:59:58.04,Default,,0000,0000,0000,,sample say 100 -
Dialogue: 0,0:59:58.04,1:00:01.36,Default,,0000,0000,0000,,some number I made up like 100 sequences of random numbers,
Dialogue: 0,1:00:01.36,1:00:03.43,Default,,0000,0000,0000,,and every time you want to evaluate a policy,
Dialogue: 0,1:00:03.43,1:00:08.68,Default,,0000,0000,0000,,you evaluate it against all 100 sequences of random numbers and then average.
Dialogue: 0,1:00:08.68,1:00:13.25,Default,,0000,0000,0000,,This is in exactly the same way that on this earlier picture
Dialogue: 0,1:00:13.25,1:00:16.88,Default,,0000,0000,0000,,you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100
Dialogue: 0,1:00:16.88,1:00:18.45,Default,,0000,0000,0000,,times in simulation, and then
Dialogue: 0,1:00:18.45,1:00:21.71,Default,,0000,0000,0000,,average to get a better estimate of the expected reward.
Dialogue: 0,1:00:21.71,1:00:23.01,Default,,0000,0000,0000,,In the same way,
Dialogue: 0,1:00:23.01,1:00:24.08,Default,,0000,0000,0000,,you do that here
Dialogue: 0,1:00:24.08,1:00:29.08,Default,,0000,0000,0000,,but with 100 fixed sequences of random numbers. Does that make sense? Any
Dialogue: 0,1:00:29.08,1:00:36.08,Default,,0000,0000,0000,,other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?
Dialogue: 0,1:00:47.72,1:00:51.83,Default,,0000,0000,0000,,Yeah. I guess you're right. So the quality - for a fixed policy,
Dialogue: 0,1:00:51.83,1:00:57.10,Default,,0000,0000,0000,,the quality of the approximation is equally good for both cases. The
Dialogue: 0,1:00:57.10,1:01:00.99,Default,,0000,0000,0000,,advantage of fixing the random numbers is that you end up with
Dialogue: 0,1:01:00.99,1:01:03.69,Default,,0000,0000,0000,,an optimization problem that's much easier.
Dialogue: 0,1:01:04.61,1:01:06.10,Default,,0000,0000,0000,,I have some search problem,
Dialogue: 0,1:01:06.10,1:01:10.05,Default,,0000,0000,0000,,and on the horizontal axis there's a space of control policies,
Dialogue: 0,1:01:10.05,1:01:12.56,Default,,0000,0000,0000,,and my goal is to find a control policy that
Dialogue: 0,1:01:12.56,1:01:15.32,Default,,0000,0000,0000,,maximizes the payoff.
Dialogue: 0,1:01:15.32,1:01:18.66,Default,,0000,0000,0000,,The problem with this earlier setting was that
Dialogue: 0,1:01:18.66,1:01:21.95,Default,,0000,0000,0000,,when I evaluate policies I get these noisy estimates,
Dialogue: 0,1:01:21.95,1:01:22.95,Default,,0000,0000,0000,,and then it's
Dialogue: 0,1:01:22.95,1:01:26.69,Default,,0000,0000,0000,,just very hard to optimize the red curve if I have these points that are all over the
Dialogue: 0,1:01:26.69,1:01:29.42,Default,,0000,0000,0000,,place. And if I evaluate the same policy twice, I don't even get back the same
Dialogue: 0,1:01:30.72,1:01:32.96,Default,,0000,0000,0000,,answer. By fixing the random numbers,
Dialogue: 0,1:01:32.96,1:01:36.67,Default,,0000,0000,0000,,the algorithm still doesn't get to see the red curve, but at least it's
Dialogue: 0,1:01:36.67,1:01:40.61,Default,,0000,0000,0000,,now optimizing a deterministic function. That makes the
Dialogue: 0,1:01:40.61,1:01:42.35,Default,,0000,0000,0000,,optimization problem much easier. Does
Dialogue: 0,1:01:42.35,1:01:49.35,Default,,0000,0000,0000,,that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves
Dialogue: 0,1:01:50.54,1:01:58.01,Default,,0000,0000,0000,,that are easy to optimize. And then you smush them together?
Dialogue: 0,1:01:58.01,1:01:59.31,Default,,0000,0000,0000,,Let's see.
Dialogue: 0,1:01:59.31,1:02:04.56,Default,,0000,0000,0000,,I have just one nice black curve that I'm trying to optimize. For each scenario.
Dialogue: 0,1:02:04.56,1:02:08.89,Default,,0000,0000,0000,,I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.
Dialogue: 0,1:02:08.89,1:02:12.34,Default,,0000,0000,0000,,So the black curve here is defined by averaging over
Dialogue: 0,1:02:12.34,1:02:16.25,Default,,0000,0000,0000,,a large set of scenarios. Does that make sense?
Dialogue: 0,1:02:16.25,1:02:20.10,Default,,0000,0000,0000,,So instead of only one - if the averaging thing doesn't make sense, imagine that there's
Dialogue: 0,1:02:20.10,1:02:24.05,Default,,0000,0000,0000,,just one sequence of random numbers. That might be easier to think
Dialogue: 0,1:02:24.05,1:02:27.42,Default,,0000,0000,0000,,about. Fix one sequence of random numbers, and every time I evaluate another policy,
Dialogue: 0,1:02:27.42,1:02:31.04,Default,,0000,0000,0000,,I evaluate against the same sequence of random numbers, and that gives me
Dialogue: 0,1:02:31.04,1:02:36.49,Default,,0000,0000,0000,,a nice deterministic function to optimize.
Dialogue: 0,1:02:36.49,1:02:39.45,Default,,0000,0000,0000,,Any other questions?
Dialogue: 0,1:02:39.45,1:02:41.27,Default,,0000,0000,0000,,The advantage is really
Dialogue: 0,1:02:41.27,1:02:45.46,Default,,0000,0000,0000,,that - one way to think about it is when I evaluate the same policy twice,
Dialogue: 0,1:02:45.46,1:02:50.45,Default,,0000,0000,0000,,at least I get back the same answer. This gives me a deterministic function
Dialogue: 0,1:02:50.45,1:02:53.89,Default,,0000,0000,0000,,mapping from parameters in my policy
Dialogue: 0,1:02:53.89,1:02:56.24,Default,,0000,0000,0000,,to my estimate of the expected payoff.
Dialogue: 0,1:02:57.79,1:03:13.23,Default,,0000,0000,0000,,That's just a function that I can try to optimize using the search algorithm.
Dialogue: 0,1:03:13.23,1:03:16.80,Default,,0000,0000,0000,,So we use this algorithm for inverted hovering,
Dialogue: 0,1:03:16.80,1:03:20.33,Default,,0000,0000,0000,,and again policy search algorithms tend to work well when you can find
Dialogue: 0,1:03:20.33,1:03:23.74,Default,,0000,0000,0000,,a reasonably simple policy mapping from
Dialogue: 0,1:03:23.74,1:03:28.64,Default,,0000,0000,0000,,the states to the actions. This is sort of especially the low level control tasks,
Dialogue: 0,1:03:28.64,1:03:32.70,Default,,0000,0000,0000,,which I think of as sort of reflexes almost.
Dialogue: 0,1:03:32.70,1:03:33.80,Default,,0000,0000,0000,,Completely,
Dialogue: 0,1:03:33.80,1:03:37.27,Default,,0000,0000,0000,,if you want to solve a problem like Tetris where you might plan
Dialogue: 0,1:03:37.27,1:03:40.50,Default,,0000,0000,0000,,ahead a few steps about what's a nice configuration of the board, or
Dialogue: 0,1:03:40.50,1:03:42.29,Default,,0000,0000,0000,,something like a game of chess,
Dialogue: 0,1:03:42.29,1:03:46.72,Default,,0000,0000,0000,,or problems of long path plannings of go here, then go there, then go there,
Dialogue: 0,1:03:46.72,1:03:51.07,Default,,0000,0000,0000,,then sometimes you might apply a value function method instead.
Dialogue: 0,1:03:51.07,1:03:55.06,Default,,0000,0000,0000,,But for tasks like helicopter flight, for
Dialogue: 0,1:03:55.06,1:03:58.57,Default,,0000,0000,0000,,low level control tasks, for the reflexes of driving or controlling
Dialogue: 0,1:04:00.75,1:04:05.70,Default,,0000,0000,0000,,various robots, policy search algorithms were easier -
Dialogue: 0,1:04:05.70,1:04:10.63,Default,,0000,0000,0000,,we can sometimes more easily approximate the policy directly works very well.
Dialogue: 0,1:04:11.84,1:04:15.60,Default,,0000,0000,0000,,Some [inaudible] the state of RL today. RL algorithms are applied
Dialogue: 0,1:04:15.60,1:04:18.50,Default,,0000,0000,0000,,to a wide range of problems,
Dialogue: 0,1:04:18.50,1:04:22.24,Default,,0000,0000,0000,,and the key is really sequential decision making. The place where you
Dialogue: 0,1:04:22.24,1:04:25.57,Default,,0000,0000,0000,,think about applying reinforcement learning algorithm is when you need to make a decision,
Dialogue: 0,1:04:25.57,1:04:28.21,Default,,0000,0000,0000,,then another decision, then another decision,
Dialogue: 0,1:04:28.21,1:04:31.89,Default,,0000,0000,0000,,and some of your actions may have long-term consequences. I think that
Dialogue: 0,1:04:31.89,1:04:37.91,Default,,0000,0000,0000,,is the heart of RL's sequential decision making, where you make multiple decisions,
Dialogue: 0,1:04:37.91,1:04:41.60,Default,,0000,0000,0000,,and some of your actions may have long-term consequences. I've shown you
Dialogue: 0,1:04:41.60,1:04:44.17,Default,,0000,0000,0000,,a bunch of robotics examples. RL
Dialogue: 0,1:04:44.17,1:04:45.57,Default,,0000,0000,0000,,is also
Dialogue: 0,1:04:45.57,1:04:48.76,Default,,0000,0000,0000,,applied to thinks like medical decision making, where you may have a patient and you
Dialogue: 0,1:04:48.76,1:04:51.88,Default,,0000,0000,0000,,want to choose a sequence of treatments, and you do this now for the patient, and the
Dialogue: 0,1:04:51.88,1:04:56.45,Default,,0000,0000,0000,,patient may be in some other state, and you choose to do that later, and so on.
Dialogue: 0,1:04:56.45,1:05:00.84,Default,,0000,0000,0000,,It turns out there's a large community of people applying these sorts of tools to queues. So
Dialogue: 0,1:05:00.84,1:05:04.54,Default,,0000,0000,0000,,imagine you have a bank where you have people lining up, and
Dialogue: 0,1:05:04.54,1:05:06.01,Default,,0000,0000,0000,,after they go to one cashier,
Dialogue: 0,1:05:06.01,1:05:09.74,Default,,0000,0000,0000,,some of them have to go to the manager to deal with something else. You have
Dialogue: 0,1:05:09.74,1:05:13.41,Default,,0000,0000,0000,,a system of multiple people standing in line in multiple queues, and so how do you route
Dialogue: 0,1:05:13.41,1:05:17.84,Default,,0000,0000,0000,,people optimally to minimize the waiting
Dialogue: 0,1:05:17.84,1:05:20.24,Default,,0000,0000,0000,,time. And not just people, but
Dialogue: 0,1:05:20.24,1:05:23.97,Default,,0000,0000,0000,,objects in an assembly line and so on. It turns out there's a surprisingly large community
Dialogue: 0,1:05:23.97,1:05:26.35,Default,,0000,0000,0000,,working on optimizing queues.
Dialogue: 0,1:05:26.35,1:05:29.72,Default,,0000,0000,0000,,I mentioned game playing a little bit already. Things
Dialogue: 0,1:05:29.72,1:05:33.29,Default,,0000,0000,0000,,like financial decision making, if you have a large amount of stock,
Dialogue: 0,1:05:33.29,1:05:37.57,Default,,0000,0000,0000,,how do you sell off a large amount - how do you time the selling off of your stock
Dialogue: 0,1:05:37.57,1:05:41.33,Default,,0000,0000,0000,,so as to not affect market prices adversely too much? There
Dialogue: 0,1:05:41.33,1:05:44.95,Default,,0000,0000,0000,,are many operations research problems, things like factory automation. Can you design
Dialogue: 0,1:05:44.95,1:05:46.37,Default,,0000,0000,0000,,a factory
Dialogue: 0,1:05:46.37,1:05:50.14,Default,,0000,0000,0000,,to optimize throughput, or minimize cost, or whatever. These are all
Dialogue: 0,1:05:50.14,1:05:54.05,Default,,0000,0000,0000,,sorts of problems that people are applying reinforcement learning algorithms to.
Dialogue: 0,1:05:54.05,1:05:57.53,Default,,0000,0000,0000,,Let me just close with a few robotics examples because they're always fun, and we just
Dialogue: 0,1:05:57.53,1:05:59.19,Default,,0000,0000,0000,,have these videos.
Dialogue: 0,1:06:00.19,1:06:07.26,Default,,0000,0000,0000,,This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student
Dialogue: 0,1:06:07.26,1:06:13.76,Default,,0000,0000,0000,,here. They were working getting a robot dog to climb over difficult rugged
Dialogue: 0,1:06:13.76,1:06:16.96,Default,,0000,0000,0000,,terrain. Using a reinforcement learning algorithm,
Dialogue: 0,1:06:16.96,1:06:19.98,Default,,0000,0000,0000,,they applied an approach that's
Dialogue: 0,1:06:19.98,1:06:24.15,Default,,0000,0000,0000,,similar to a value function approximation approach, not quite but similar.
Dialogue: 0,1:06:24.15,1:06:28.85,Default,,0000,0000,0000,,They allowed the robot dog to sort of plan ahead multiple steps, and
Dialogue: 0,1:06:28.85,1:06:33.100,Default,,0000,0000,0000,,carefully choose his footsteps and traverse rugged terrain.
Dialogue: 0,1:06:33.100,1:06:43.34,Default,,0000,0000,0000,,This is really state of the art in terms of what can you get a robotic dog to do.
Dialogue: 0,1:06:43.34,1:06:45.57,Default,,0000,0000,0000,,Here's another fun one.
Dialogue: 0,1:06:45.57,1:06:48.01,Default,,0000,0000,0000,,It turns out that
Dialogue: 0,1:06:48.01,1:06:51.26,Default,,0000,0000,0000,,wheeled robots are very fuel-efficient. Cars and trucks are the
Dialogue: 0,1:06:51.26,1:06:55.25,Default,,0000,0000,0000,,most fuel-efficient robots in the world almost.
Dialogue: 0,1:06:55.25,1:06:57.26,Default,,0000,0000,0000,,Cars and trucks are very fuel-efficient,
Dialogue: 0,1:06:57.26,1:07:01.31,Default,,0000,0000,0000,,but the bigger robots have the ability to traverse more rugged terrain.
Dialogue: 0,1:07:01.31,1:07:04.60,Default,,0000,0000,0000,,So this is a robot - this is actually a small scale mockup of a larger
Dialogue: 0,1:07:04.60,1:07:06.52,Default,,0000,0000,0000,,vehicle built by Lockheed Martin,
Dialogue: 0,1:07:06.52,1:07:09.70,Default,,0000,0000,0000,,but can you come up with a vehicle that
Dialogue: 0,1:07:09.70,1:07:13.69,Default,,0000,0000,0000,,has wheels and has the fuel efficiency of wheeled robots, but also
Dialogue: 0,1:07:13.69,1:07:19.10,Default,,0000,0000,0000,,has legs so it can traverse obstacles.
Dialogue: 0,1:07:19.10,1:07:27.35,Default,,0000,0000,0000,,Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and
Dialogue: 0,1:07:27.35,1:07:31.32,Default,,0000,0000,0000,,somewhat complex gaits that would be very hard to design by hand,
Dialogue: 0,1:07:31.32,1:07:33.28,Default,,0000,0000,0000,,but by choosing a reward function, tell the robot
Dialogue: 0,1:07:33.28,1:07:36.79,Default,,0000,0000,0000,,this is a plus one reward that's top of the goal,
Dialogue: 0,1:07:36.79,1:07:45.47,Default,,0000,0000,0000,,and a few other things, it learns these sorts of policies automatically.
Dialogue: 0,1:07:46.52,1:07:55.23,Default,,0000,0000,0000,,Last couple fun ones - I'll show you a couple last helicopter videos.
Dialogue: 0,1:07:57.03,1:08:08.70,Default,,0000,0000,0000,,This is the work of again PhD students here, Peter Abiel and Adam Coates
Dialogue: 0,1:08:08.77,1:08:33.32,Default,,0000,0000,0000,,who have been working on autonomous flight. I'll just let you watch. I'll just
Dialogue: 0,1:09:49.34,1:09:53.90,Default,,0000,0000,0000,,show you one more. [Inaudible] do this with a real helicopter [inaudible]?
Dialogue: 0,1:09:53.90,1:09:55.20,Default,,0000,0000,0000,,Not a full-size helicopter.
Dialogue: 0,1:09:55.20,1:10:00.86,Default,,0000,0000,0000,,Only small radio control helicopters. [Inaudible]. Full-size helicopters
Dialogue: 0,1:10:00.86,1:10:05.78,Default,,0000,0000,0000,,are the wrong design for this. You shouldn't do this on a full-size helicopter.
Dialogue: 0,1:10:05.78,1:10:12.62,Default,,0000,0000,0000,,Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.
Dialogue: 0,1:10:12.62,1:10:18.92,Default,,0000,0000,0000,,Are there any human [inaudible]? Yes,
Dialogue: 0,1:10:18.92,1:10:25.92,Default,,0000,0000,0000,,there are very good human pilots that can.
Dialogue: 0,1:10:28.42,1:10:35.42,Default,,0000,0000,0000,,This is just one more
Dialogue: 0,1:10:43.74,1:10:45.51,Default,,0000,0000,0000,,maneuver. That was kind of fun. So this is the work of
Dialogue: 0,1:10:45.51,1:10:55.53,Default,,0000,0000,0000,,Peter Abiel and Adam Coates. So that was it. That was all the technical material
Dialogue: 0,1:10:55.53,1:11:02.53,Default,,0000,0000,0000,,I wanted to present in this class. I guess
Dialogue: 0,1:11:03.11,1:11:10.20,Default,,0000,0000,0000,,you're all experts on machine learning now. Congratulations.
Dialogue: 0,1:11:10.20,1:11:14.21,Default,,0000,0000,0000,,And as I hope you've got the sense through this class that this is one of the technologies
Dialogue: 0,1:11:14.21,1:11:18.97,Default,,0000,0000,0000,,that's really having a huge impact on science in engineering and industry.
Dialogue: 0,1:11:18.97,1:11:23.13,Default,,0000,0000,0000,,As I said in the first lecture, I think many people use machine
Dialogue: 0,1:11:23.13,1:11:30.13,Default,,0000,0000,0000,,learning algorithms dozens of times a day without even knowing about it.
Dialogue: 0,1:11:30.39,1:11:34.11,Default,,0000,0000,0000,,Based on the projects you've done, I hope that
Dialogue: 0,1:11:34.11,1:11:37.90,Default,,0000,0000,0000,,most of you will be able to imagine yourself
Dialogue: 0,1:11:37.90,1:11:43.28,Default,,0000,0000,0000,,going out after this class and applying these things to solve a variety of problems.
Dialogue: 0,1:11:43.28,1:11:45.98,Default,,0000,0000,0000,,Hopefully, some of you will also imagine yourselves
Dialogue: 0,1:11:45.98,1:11:48.99,Default,,0000,0000,0000,,writing research papers after this class, be it on
Dialogue: 0,1:11:48.99,1:11:52.44,Default,,0000,0000,0000,,a novel way to do machine learning, or on some way of applying machine
Dialogue: 0,1:11:52.44,1:11:55.29,Default,,0000,0000,0000,,learning to a problem that you care
Dialogue: 0,1:11:55.29,1:11:58.75,Default,,0000,0000,0000,,about. In fact, looking at project milestones, I'm actually sure that a large fraction of
Dialogue: 0,1:11:58.75,1:12:04.30,Default,,0000,0000,0000,,the projects in this class will be publishable, so I think that's great.
Dialogue: 0,1:12:06.68,1:12:10.75,Default,,0000,0000,0000,,I guess many of you will also go industry, make new products, and make lots
Dialogue: 0,1:12:10.75,1:12:12.66,Default,,0000,0000,0000,,of money using learning
Dialogue: 0,1:12:12.66,1:12:15.58,Default,,0000,0000,0000,,algorithms. Remember me
Dialogue: 0,1:12:15.58,1:12:19.53,Default,,0000,0000,0000,,if that happens. One of the things I'm excited about is through research or
Dialogue: 0,1:12:19.53,1:12:22.57,Default,,0000,0000,0000,,industry, I'm actually completely sure that the people in this class in the
Dialogue: 0,1:12:22.57,1:12:23.85,Default,,0000,0000,0000,,next few months
Dialogue: 0,1:12:23.85,1:12:27.32,Default,,0000,0000,0000,,will apply machine learning algorithms to lots of problems in
Dialogue: 0,1:12:27.32,1:12:31.77,Default,,0000,0000,0000,,industrial management, and computer science, things like optimizing computer architectures,
Dialogue: 0,1:12:31.77,1:12:33.34,Default,,0000,0000,0000,,network security,
Dialogue: 0,1:12:33.34,1:12:38.48,Default,,0000,0000,0000,,robotics, computer vision, to problems in computational biology,
Dialogue: 0,1:12:39.14,1:12:42.43,Default,,0000,0000,0000,,to problems in aerospace, or understanding natural language,
Dialogue: 0,1:12:42.43,1:12:44.31,Default,,0000,0000,0000,,and many more things like that.
Dialogue: 0,1:12:44.31,1:12:46.03,Default,,0000,0000,0000,,So
Dialogue: 0,1:12:46.03,1:12:49.92,Default,,0000,0000,0000,,right now I have no idea what all of you are going to do with the learning algorithms you learned about,
Dialogue: 0,1:12:49.92,1:12:51.54,Default,,0000,0000,0000,,but
Dialogue: 0,1:12:51.54,1:12:54.81,Default,,0000,0000,0000,,every time as I wrap up this class, I always
Dialogue: 0,1:12:54.81,1:12:58.21,Default,,0000,0000,0000,,feel very excited, and optimistic, and hopeful about the sorts of amazing
Dialogue: 0,1:12:58.21,1:13:03.13,Default,,0000,0000,0000,,things you'll be able to do.
Dialogue: 0,1:13:03.13,1:13:10.49,Default,,0000,0000,0000,,One final thing, I'll just give out this handout.
Dialogue: 0,1:13:30.75,1:13:35.17,Default,,0000,0000,0000,,One final thing is that machine learning has grown out of a larger literature on the AI
Dialogue: 0,1:13:35.17,1:13:36.42,Default,,0000,0000,0000,,where
Dialogue: 0,1:13:36.42,1:13:41.83,Default,,0000,0000,0000,,this desire to build systems that exhibit intelligent behavior and machine learning
Dialogue: 0,1:13:41.83,1:13:44.89,Default,,0000,0000,0000,,is one of the tools of AI, maybe one that's had a
Dialogue: 0,1:13:44.89,1:13:46.51,Default,,0000,0000,0000,,disproportionately large impact,
Dialogue: 0,1:13:46.51,1:13:51.52,Default,,0000,0000,0000,,but there are many other ideas in AI that I hope you
Dialogue: 0,1:13:51.52,1:13:55.29,Default,,0000,0000,0000,,go and continue to learn about. Fortunately, Stanford
Dialogue: 0,1:13:55.29,1:13:59.28,Default,,0000,0000,0000,,has one of the best and broadest sets of AI classes, and I hope that
Dialogue: 0,1:13:59.28,1:14:03.69,Default,,0000,0000,0000,,you take advantage of some of these classes, and go and learn more about AI, and more
Dialogue: 0,1:14:03.69,1:14:07.69,Default,,0000,0000,0000,,about other fields which often apply learning algorithms to problems in vision,
Dialogue: 0,1:14:07.69,1:14:10.48,Default,,0000,0000,0000,,problems in natural language processing in robotics, and so on.
Dialogue: 0,1:14:10.48,1:14:14.96,Default,,0000,0000,0000,,So the handout I just gave out has a list of AI related courses. Just
Dialogue: 0,1:14:14.96,1:14:18.49,Default,,0000,0000,0000,,running down very quickly, I guess, CS221 is an overview that I
Dialogue: 0,1:14:18.49,1:14:19.34,Default,,0000,0000,0000,,teach. There
Dialogue: 0,1:14:19.34,1:14:23.70,Default,,0000,0000,0000,,are a lot of robotics classes also: 223A,
Dialogue: 0,1:14:23.70,1:14:25.14,Default,,0000,0000,0000,,225A, 225B
Dialogue: 0,1:14:25.14,1:14:28.76,Default,,0000,0000,0000,,- many robotics class. There are so many applications
Dialogue: 0,1:14:28.76,1:14:31.100,Default,,0000,0000,0000,,of learning algorithms to robotics today. 222
Dialogue: 0,1:14:31.100,1:14:36.37,Default,,0000,0000,0000,,and 227 are knowledge representation and reasoning classes.
Dialogue: 0,1:14:36.37,1:14:39.92,Default,,0000,0000,0000,,228 - of all the classes on this list, 228, which Daphne
Dialogue: 0,1:14:39.92,1:14:43.48,Default,,0000,0000,0000,,Koller teaches, is probably closest in spirit to 229. This is one of
Dialogue: 0,1:14:43.48,1:14:45.30,Default,,0000,0000,0000,,the classes I highly recommend
Dialogue: 0,1:14:45.30,1:14:48.60,Default,,0000,0000,0000,,to all of my PhD students as well.
Dialogue: 0,1:14:48.60,1:14:53.30,Default,,0000,0000,0000,,Many other problems also touch on machine learning. On the next page,
Dialogue: 0,1:14:54.04,1:14:57.76,Default,,0000,0000,0000,,courses on computer vision, speech recognition, natural language processing,
Dialogue: 0,1:14:57.76,1:15:03.20,Default,,0000,0000,0000,,various tools that aren't just machine learning, but often involve machine learning in many ways.
Dialogue: 0,1:15:03.20,1:15:07.34,Default,,0000,0000,0000,,Other aspects of AI, multi-agent systems taught by [inaudible].
Dialogue: 0,1:15:09.63,1:15:13.14,Default,,0000,0000,0000,,EE364A is convex optimization. It's a class taught by
Dialogue: 0,1:15:13.14,1:15:16.63,Default,,0000,0000,0000,,Steve Boyd, and convex optimization came up many times in this class. If you
Dialogue: 0,1:15:16.63,1:15:20.91,Default,,0000,0000,0000,,want to become really good at it, EE364 is a great class. If you're
Dialogue: 0,1:15:20.91,1:15:24.58,Default,,0000,0000,0000,,interested in project courses, I also teach a project class
Dialogue: 0,1:15:24.58,1:15:27.18,Default,,0000,0000,0000,,next quarter where we spend the whole quarter working
Dialogue: 0,1:15:27.18,1:15:31.51,Default,,0000,0000,0000,,on research projects.
Dialogue: 0,1:15:31.51,1:15:37.59,Default,,0000,0000,0000,,So I hope you go and take some more of those classes.
Dialogue: 0,1:15:37.59,1:15:40.38,Default,,0000,0000,0000,,In closing,
Dialogue: 0,1:15:40.38,1:15:42.85,Default,,0000,0000,0000,,let me just say
Dialogue: 0,1:15:42.85,1:15:46.88,Default,,0000,0000,0000,,this class has been really fun to teach, and it's very satisfying to
Dialogue: 0,1:15:46.88,1:15:52.01,Default,,0000,0000,0000,,me personally when we set these insanely difficult hallmarks, and
Dialogue: 0,1:15:52.01,1:15:54.65,Default,,0000,0000,0000,,then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's
Dialogue: 0,1:15:54.65,1:15:58.57,Default,,0000,0000,0000,,actually very satisfying when I see that.
Dialogue: 0,1:15:58.57,1:16:04.74,Default,,0000,0000,0000,,Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"
Dialogue: 0,1:16:04.74,1:16:05.47,Default,,0000,0000,0000,,So
Dialogue: 0,1:16:05.47,1:16:09.04,Default,,0000,0000,0000,,I hope you take what you've learned, go forth, and
Dialogue: 0,1:16:09.04,1:16:11.57,Default,,0000,0000,0000,,do amazing things with learning algorithms.
Dialogue: 0,1:16:11.57,1:16:16.22,Default,,0000,0000,0000,,I know this is a heavy workload class, so thank you all very much for the hard
Dialogue: 0,1:16:16.22,1:16:20.28,Default,,0000,0000,0000,,work you've put into this class, and the hard work you've put into learning this material,
Dialogue: 0,1:16:20.28,1:16:21.70,Default,,0000,0000,0000,,and
Dialogue: 0,1:16:21.70,1:16:23.39,Default,,0000,0000,0000,,thank you very much for having been students in.