< Return to Video

Lecture 20 | Machine Learning (Stanford)

  • 0:11 - 0:15
    This presentation is delivered by the Stanford Center for Professional
  • 0:15 - 0:22
    Development.
  • 0:24 - 0:28
    So welcome to the last lecture of this course.
  • 0:28 - 0:33
    What I want to do today is tell you about one final class of reinforcement
  • 0:33 - 0:35
    learning algorithms. I
  • 0:35 - 0:39
    just want to say a little bit about POMDPs, the partially observable MDPs, and then
  • 0:39 - 0:44
    the main technical topic for today will be policy search algorithms.
  • 0:44 - 0:49
    I'll talk about two specific algorithms, essentially called reinforced and called Pegasus,
  • 0:49 - 0:55
    and then we'll wrap up the class. So if you recall from the last lecture,
  • 0:55 - 1:00
    I actually started to talk about one specific example of a POMDP,
  • 1:00 - 1:07
    which was this sort of linear dynamical
  • 1:07 - 1:11
    system. This is sort of LQR, linear quadratic revelation problem,
  • 1:11 - 1:24
    but I changed it and said what if we only have observations
  • 1:24 - 1:28
    YT. And what if we couldn't observe the state of the system directly,
  • 1:28 - 1:37
    but had to choose an action only based on some noisy observations that maybe some function of the state?
  • 1:37 - 1:43
    So our strategy last time was that we said that in the fully observable case,
  • 1:43 - 1:55
    we could choose actions - AT equals two, that matrix LT times ST. So LT was this
  • 1:55 - 1:59
    matrix of parameters that [inaudible] describe the dynamic programming
  • 1:59 - 2:02
    algorithm for finite horizon MDPs in the LQR problem.
  • 2:02 - 2:11
    And so we said if only we knew what the state was, we choose actions according to some matrix LT times the state.
  • 2:11 - 2:14
    And then I said in the partially observable case,
  • 2:14 - 2:20
    we would compute these estimates.
  • 2:20 - 2:27
    I wrote them as S of T given T,
  • 2:30 - 2:37
    which were our best estimate for what the state is given all the observations. And in particular,
  • 2:37 - 2:41
    I'm gonna talk about a Kalman filter
  • 2:41 - 2:49
    which we worked out that our posterior distribution of what the state is
  • 2:49 - 2:53
    given all the observations up to a certain time
  • 2:53 - 2:59
    that was this. So this is from last time. So that
  • 2:59 - 3:03
    given the observations Y one through YT, our posterior distribution
  • 3:03 - 3:06
    of the current state ST was Gaussian
  • 3:06 - 3:10
    would mean ST given T sigma T given T.
  • 3:10 - 3:14
    So I said we use a Kalman filter to compute this thing, this ST given T,
  • 3:14 - 3:20
    which is going to be our best guess for what the state is currently.
  • 3:20 - 3:31
    And then we choose actions using
  • 3:31 - 3:34
    our estimate for what the state is, rather than using the true state because we
  • 3:34 - 3:38
    don't know the true state anymore in this POMDP.
  • 3:38 - 3:40
    So
  • 3:40 - 3:47
    it turns out that this specific strategy actually allows you to choose optimal actions,
  • 3:47 - 3:52
    allows you to choose actions as well as you possibly can given that this is a
  • 3:52 - 3:55
    POMDP, and given there are these noisy observations.
  • 3:55 - 3:59
    It turns out that in general finding optimal policies with POMDPs -
  • 3:59 - 4:06
    finding optimal policies for these sorts of partially observable MDPs is an NP-hard problem.
  • 4:06 - 4:16
    Just to be concrete about the formalism of the POMDP - I should just write it here -
  • 4:16 - 4:32
    a POMDP formally is a tuple like that
  • 4:32 - 4:43
    where the changes are the set Y is the set of possible observations,
  • 4:45 - 4:54
    and this O subscript S are the observation distributions.
  • 4:58 - 5:11
    And so at each step, we observe
  • 5:18 - 5:20
    - at each step in the POMDP, if we're
  • 5:20 - 5:24
    in some state ST, we observe some observation YT
  • 5:24 - 5:28
    drawn from the observation distribution O subscript ST, that there's an
  • 5:28 - 5:30
    index by what the current state is.
  • 5:30 - 5:31
    And
  • 5:31 - 5:36
    it turns out that computing the optimal policy in a POMDP is an NPhard problem.
  • 5:36 - 5:41
    For the specific case of linear dynamical systems with the Kalman filter
  • 5:41 - 5:46
    model, we have this strategy of computing the optimal policy assuming full observability
  • 5:46 - 5:49
    and then estimating the states from the observations,
  • 5:49 - 5:51
    and then plugging the two together.
  • 5:51 - 5:56
    That turns out to be optimal essentially for only that special case of a POMDP.
  • 5:56 - 5:58
    In the more general case,
  • 5:58 - 6:03
    that strategy of designing a controller assuming full observability
  • 6:03 - 6:06
    and then just estimating the state and plugging the two together,
  • 6:06 - 6:08
    for general POMDPs
  • 6:08 - 6:15
    that same strategy is often a very reasonable strategy but is not always guaranteed to be optimal.
  • 6:15 - 6:21
    Solving these problems in general, NP-hard.
  • 6:21 - 6:22
    So what I want to do today
  • 6:22 - 6:26
    is actually
  • 6:26 - 6:29
    talk about a different class of reinforcement learning algorithms. These are called
  • 6:29 - 6:38
    policy search algorithms. In particular, policy search algorithms
  • 6:38 - 6:44
    can be applied equally well to MDPs, to fully observed Markov decision processes,
  • 6:44 - 6:46
    or to these POMDPs,
  • 6:46 - 6:50
    or to these partially observable MPDs.
  • 6:50 - 6:58
    What I want to do now, I'll actually just describe policy search algorithms applied to MDPs, applied to the fully observable case.
  • 6:58 - 7:02
    And in the end, I just briefly describe how you can take policy search algorithms and apply them to POMDPs. In
  • 7:02 - 7:05
    the latter case, when
  • 7:05 - 7:10
    you apply a policy search algorithm to a POMDP, it's
  • 7:10 - 7:15
    going to be hard to guarantee that you get the globally optimal policy because
  • 7:15 - 7:21
    solving POMDPs in general is NP-hard, but nonetheless policy search algorithms - it turns out to be
  • 7:21 - 7:23
    I think one of the most
  • 7:23 - 7:31
    effective classes of reinforcement learning algorithms, as well both for MDPs and for POMDPs.
  • 7:31 - 7:34
    So
  • 7:34 - 7:37
    here's what we're going to do.
  • 7:37 - 7:41
    In policy search,
  • 7:41 - 7:48
    we're going to define of some set
  • 7:48 - 7:55
    which I denote capital pi of policies,
  • 7:55 - 8:09
    and our strategy is to search for a good policy lower pi into set capital pi.
  • 8:11 - 8:15
    Just by analogy, I want to say
  • 8:15 - 8:19
    - in the same way, back when we were talking about supervised learning,
  • 8:19 - 8:24
    the way we defined the set capital pi of policies in the search for policy in this
  • 8:24 - 8:31
    set capital pi is analogous to supervised learning
  • 8:31 - 8:46
    where we defined a set script H of hypotheses and search -
  • 8:50 - 8:58
    and would search for a good hypothesis in this policy script H.
  • 8:58 - 9:03
    Policy search is sometimes also called direct policy search.
  • 9:03 - 9:07
    To contrast this with the source of algorithms we've been talking about so
  • 9:07 - 9:11
    far, in all the algorithms we've been talking about so far, we would try to find V star. We would try to
  • 9:11 - 9:14
    find the optimal value function.
  • 9:14 - 9:19
    And then we'd use V star - we'd use the optimal value function to then try to compute or
  • 9:19 - 9:20
    try to approximate pi star.
  • 9:20 - 9:24
    So all the approaches we talked about previously are
  • 9:24 - 9:30
    strategy for finding a good policy. Once we compute the value function, then we go from that to policy.
  • 9:30 - 9:35
    In contrast, in policy search algorithms and something that's called direct policy search algorithms,
  • 9:35 - 9:40
    the idea is that we're going to quote "directly" try to approximate a good policy
  • 9:40 - 9:48
    without going through the intermediate stage of trying to find the value function.
  • 9:48 - 9:50
    Let's see.
  • 9:50 - 9:51
    And also
  • 9:51 - 9:58
    as I develop policy search - just one step that's sometimes slightly confusing.
  • 9:58 - 10:03
    Making an analogy to supervised learning again, when we talked about logistic regression,
  • 10:03 - 10:09
    I said we have input features X and some labels Y, and I sort of said
  • 10:09 - 10:13
    let's approximate Y using the logistic function of the inputs X. And
  • 10:13 - 10:18
    at least initially, the logistic function was sort of pulled out of the air.
  • 10:18 - 10:23
    In the same way, as I define policy search algorithms, there'll sort of be a step where I say,
  • 10:23 - 10:29
    "Well, let's try to compute the actions. Let's try to approximate what a good action is
  • 10:29 - 10:33
    using a logistic function of the state." So again, I'll sort of pull a
  • 10:33 - 10:37
    function out of the air. I'll say, "Let's just choose a function, and
  • 10:37 - 10:41
    that'll be our choice of the policy cost," and I'll say,
  • 10:41 - 10:45
    "Let's take this input the state, and then we'll map it through logistic function, and then hopefully, we'll approximate
  • 10:45 - 10:47
    what is a good function -
  • 10:47 - 10:48
    excuse me,
  • 10:48 - 10:53
    we'll approximate what is a good action using a logistic function of the state."
  • 10:53 - 10:54
    So there's that sort of -
  • 10:54 - 10:59
    the function of the choice of policy cost that's again a little bit arbitrary,
  • 10:59 - 11:06
    but it's arbitrary as it was when we were talking about supervised learning.
  • 11:06 - 11:17
    So to develop our first policy search algorithm, I'm actually gonna need the new definition.
  • 11:56 - 12:03
    So
  • 12:07 - 12:13
    our first policy search algorithm, we'll actually need to work with stochastic policies.
  • 12:13 - 12:18
    What I mean by stochastic policy is there's going to be a function that maps from
  • 12:18 - 12:23
    the space of states across actions. They're real numbers
  • 12:23 - 12:25
    where pi of S comma A will be
  • 12:25 - 12:31
    interpreted as the probability of taking this action A in sum state S.
  • 12:31 - 12:46
    And so we have to add sum over A - In other words, for every state
  • 12:46 - 12:54
    a stochastic policy specifies a probability distribution over the actions.
  • 12:55 - 12:59
    So concretely,
  • 12:59 - 13:07
    suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi.
  • 13:07 - 13:12
    What that means is that - in this example let's say I have three actions.
  • 13:12 - 13:16
    What that means is that suppose I'm in some state S.
  • 13:16 - 13:25
    I would then compute pi of S comma A1, pi of S comma A2,
  • 13:25 - 13:29
    pi of S comma A3, if I have a three action MDP.
  • 13:29 - 13:34
    These will be three numbers that sum up to one,
  • 13:34 - 13:37
    and then my chance of taking action A1 will be
  • 13:37 - 13:41
    equal to this. My chance of taking action A2 will be equal to pi of S comma A2.
  • 13:41 - 13:44
    My chance of taking action A3 will be equal
  • 13:44 - 13:50
    to this number. So that's what it means to execute a stochastic policy.
  • 13:50 - 13:51
    So
  • 13:51 - 13:54
    as a concrete example, just let me make this
  • 14:03 - 14:07
    let me just go ahead and give one specific example of what a stochastic
  • 14:07 - 14:10
    policy may look like.
  • 14:10 - 14:14
    For this example, I'm gonna use the inverted pendulum
  • 14:14 - 14:20
    as my motivating example. It's that problem of balancing a pole. We have an
  • 14:20 - 14:21
    inverted
  • 14:21 - 14:26
    pendulum that swings freely, and you want to move the cart left and
  • 14:26 - 14:29
    right to keep the pole vertical.
  • 14:29 - 14:35
    Let's say my actions -
  • 14:35 - 14:44
    for today's example, I'm gonna use that angle to denote the angle of the pole
  • 14:44 - 14:50
    phi. I have two actions where A1 is to accelerate left
  • 14:50 - 15:03
    and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left.
  • 15:03 - 15:05
    So
  • 15:05 - 15:06
    let's see.
  • 15:06 - 15:10
    Choose a reward function that penalizes the pole falling over whatever.
  • 15:10 - 15:14
    And now let's come up with a stochastic policy for this problem.
  • 15:15 - 15:18
    To come up with a class of stochastic policies
  • 15:18 - 15:20
  • 15:20 - 15:24
    really means coming up with some class of functions to approximate what action you
  • 15:24 - 15:27
    want to take as a function of the state.
  • 15:27 - 15:30
    So here's my somewhat arbitrary choice. I'm gonna say that
  • 15:30 - 15:35
    the probability of action A1, so pi of S comma A1,
  • 15:35 - 15:40
    I'm gonna write as - okay?
  • 15:40 - 15:46
    And I just chose the logistic function because it's a convenient function we've used a lot. So I'm gonna say that
  • 15:46 - 15:50
    my policy is parameterized by a set of parameters theta,
  • 15:50 - 15:53
    and for any given set of parameters theta,
  • 15:53 - 15:56
    that gives me a stochastic policy.
  • 15:56 - 15:59
    And if I'm executing that policy with parameters theta, that means
  • 15:59 - 16:06
    that the chance of my choosing to a set of [inaudible] is given by this number.
  • 16:18 - 16:26
    Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just
  • 16:26 - 16:29
    [inaudible], this means that when I'm in sum state S,
  • 16:29 - 16:31
    I'm going to compute this number,
  • 16:31 - 16:34
    compute one over one plus E to the minus state of transpose S.
  • 16:34 - 16:40
    And then with this probability, I will execute the accelerate right action,
  • 16:40 - 16:46
    and with one minus this probability, I'll execute the accelerate left action.
  • 16:46 - 16:51
    And again, just to give you a sense of why this might be a reasonable thing to do,
  • 16:52 - 16:59
    let's say my state vector is - this
  • 17:02 - 17:05
    is [inaudible] state, and I added an extra one as an interceptor, just to give my
  • 17:05 - 17:09
    logistic function an extra feature.
  • 17:09 - 17:20
    If I choose my parameters and my policy to be say this,
  • 17:20 - 17:23
    then that means that at any state,
  • 17:23 - 17:34
    the probability of my taking action A1 - the probability of my taking the accelerate
  • 17:34 - 17:41
    right action is this one over one plus E to the minus state of transpose S, which taking the inner product
  • 17:41 - 17:48
    of theta and S, this just gives you phi,
  • 17:48 - 17:51
    equals one over one plus E to the minus phi.
  • 17:51 - 17:55
    And so if I choose my parameters theta as follows,
  • 17:55 - 18:01
    what that means is that
  • 18:01 - 18:08
    just depending on the angle phi of my inverted pendulum,
  • 18:09 - 18:12
    the chance of my accelerating to the
  • 18:12 - 18:15
    right is just this function of
  • 18:15 - 18:18
    the angle of my inverted pendulum. And so this means for example
  • 18:18 - 18:22
    that if my inverted pendulum is leaning far over to the right,
  • 18:22 - 18:26
    then I'm very likely to accelerate to the right to try to catch it. I
  • 18:26 - 18:30
    hope the physics of this inverted pendulum thing make
  • 18:30 - 18:33
    sense. If my pole's leaning over to the right, then I wanna accelerate to the right to catch it. And
  • 18:33 - 18:37
    conversely if phi is negative, it's leaning over to the left, and I'll accelerate to the left to try
  • 18:37 - 18:39
    to catch it.
  • 18:39 - 18:44
    So this is one example for one specific choice of parameters theta.
  • 18:44 - 18:49
    Obviously, this isn't a great policy because it ignores the rest of the features. Maybe if
  • 18:49 - 18:53
    the cart is further to the right, you want it to be less likely to accelerate to the
  • 18:53 - 18:56
    right, and you can capture that by changing
  • 18:56 - 18:58
    one of these coefficients to take into account the
  • 18:58 - 19:00
    actual position of the cart. And then depending on
  • 19:00 - 19:03
    the velocity of the cart and the angle of velocity,
  • 19:03 - 19:08
    you might want to change theta to take into account these other effects as well. Maybe
  • 19:08 - 19:13
    if the pole's leaning far to the right, but is actually on its way to swinging back, it's
  • 19:13 - 19:16
    specified to the angle of velocity, then you might be
  • 19:16 - 19:19
    less worried about having to accelerate hard to the right. And so
  • 19:19 - 19:23
    these are the sorts of behavior you can get by varying the parameters theta.
  • 19:24 - 19:29
    And so our goal is to tune the parameters theta - our goal in policy search
  • 19:29 - 19:37
    is to tune the parameters theta so that when we execute the policy pi subscript theta,
  • 19:37 - 19:40
    the pole stays up as long as possible.
  • 19:40 - 19:48
    In other words, our goal is to maximize as a function of theta -
  • 19:48 - 20:05
    our goal is to maximize the expected value of the payoff
  • 20:05 - 20:10
    for when we execute the policy pi theta. We
  • 20:10 - 20:11
    want to choose parameters theta
  • 20:11 - 20:21
    to maximize that. Are there questions about the problem set up,
  • 20:21 - 20:26
    and policy search and policy classes or anything? Yeah. In a case where we have
  • 20:26 - 20:33
    more than two actions, would we use a different theta for each of the distributions, or still have
  • 20:33 - 20:36
    the same parameters? Oh, yeah.
  • 20:36 - 20:39
    Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but
  • 20:39 - 20:44
    you have say a fixed number of discrete actions,
  • 20:44 - 20:48
    I would sometimes use like a softmax parameterization.
  • 20:48 - 20:54
    Similar to softmax regression that we saw earlier in the class,
  • 20:54 - 20:56
    you may
  • 20:56 - 21:03
    say
  • 21:06 - 21:07
    that - [inaudible] out of space. You may have a set of parameters theta 1
  • 21:07 - 21:09
    through theta D if
  • 21:09 - 21:25
    you have D actions and - pi equals E to the theta I transpose S over -
  • 21:25 - 21:29
    so that would be an example of a softmax parameterization for multiple actions.
  • 21:29 - 21:33
    It turns out that if you have continuous actions, you can actually make this be a
  • 21:33 - 21:34
    density
  • 21:34 - 21:36
    over the actions A
  • 21:36 - 21:38
    and parameterized by other things as well.
  • 21:38 - 21:43
    But the choice of policy class is somewhat up to you, in the same way that
  • 21:44 - 21:48
    the choice of whether we chose to use a linear function
  • 21:48 - 21:49
    or linear function with
  • 21:49 - 21:56
    quadratic features or whatever in supervised learning that was sort of up to us. Anything
  • 22:02 - 22:09
    else? Yeah. [Inaudible] stochastic?
  • 22:12 - 22:16
    Yes. So is it possible to [inaudible] a stochastic policy using numbers [inaudible]? I see. Given that MDP has stochastic transition
  • 22:16 - 22:18
    probabilities, is it possible to
  • 22:18 - 22:23
    use [inaudible] policies and [inaudible] the stochasticity of the state transition probabilities.
  • 22:23 - 22:24
    The answer is yes,
  • 22:24 - 22:26
    but
  • 22:26 - 22:29
    for the purposes of what I want to show later, that won't be useful.
  • 22:29 - 22:32
    But formally, it is possible.
  • 22:32 - 22:35
    If you already have a fixed - if you have a
  • 22:35 - 22:41
    fixed policy, then you'd be able to do that. Anything else?
  • 22:41 - 22:44
    Yeah. No, I guess even a [inaudible] class of policy can do that,
  • 22:44 - 22:50
    but for the derivation later, I actually need to keep it separate.
  • 22:50 - 22:55
    Actually, could you just - I know the concept of policy search is sometimes a little confusing. Could you just raise your hand
  • 22:55 - 23:02
    if this makes sense?
  • 23:02 - 23:05
    Okay. Thanks. So let's talk about
  • 23:05 - 23:09
    an algorithm. What I'm gonna talk about - the first algorithm I'm going to present
  • 23:10 - 23:14
    is sometimes called the reinforce algorithm. What I'm going to present it turns out
  • 23:14 - 23:19
    isn't exactly the reinforce algorithm as it was originally presented by the
  • 23:19 - 23:31
    author Ron Williams, but it sort of captures its essence. Here's the idea.
  • 23:31 - 23:37
    In the sequel - in what I'm about to do, I'm going to assume that S0 is
  • 23:37 - 23:40
    some fixed initial state.
  • 23:42 - 23:45
    Or
  • 23:45 - 23:49
    it turns out if S0 is drawn from some fixed initial state
  • 23:49 - 23:52
    distribution then everything else [inaudible], but let's just say S0 is some
  • 23:52 - 23:54
    fixed initial state.
  • 23:54 - 24:05
    So my goal is to maximize this expected sum
  • 24:06 - 24:10
    [inaudible]. Given the policy and whatever else, drop that.
  • 24:10 - 24:15
    So the random variables in this expectation is a sequence of states and actions:
  • 24:15 - 24:20
    S0, A0, S1, A1, and so on, up to ST, AT are the random variables.
  • 24:20 - 24:26
    So let me write out this expectation explicitly as a sum
  • 24:26 - 24:33
    over all possible state and action sequences of that
  • 24:46 - 24:53
    - so that's what an expectation is. It's the probability of the random variables times that. Let
  • 24:53 - 25:00
    me just expand out this probability.
  • 25:01 - 25:06
    So the probability of seeing this exact sequence of states and actions is
  • 25:06 - 25:11
    the probability of the MDP starting in that state.
  • 25:11 - 25:14
    If this is a deterministic initial state, then all the probability
  • 25:14 - 25:18
    mass would be on one state. Otherwise, there's some distribution over initial states.
  • 25:18 - 25:21
    Then times the probability
  • 25:21 - 25:24
    that
  • 25:24 - 25:29
    you chose action A0 from that state as zero,
  • 25:29 - 25:34
    and then times the probability that
  • 25:34 - 25:38
    the MDP's transition probabilities happen to transition you to state
  • 25:38 - 25:45
    S1 where you chose action A0 to state S0,
  • 25:45 - 25:53
    times the probability that you chose that and so on.
  • 25:53 - 26:06
    The last term here is that, and then times that.
  • 26:12 - 26:14
    So what I did was just
  • 26:14 - 26:19
    take this probability of seeing this sequence of states and actions, and then just
  • 26:19 - 26:23
    [inaudible] explicitly or expanded explicitly like this.
  • 26:23 - 26:24
    It turns out later on
  • 26:24 - 26:30
    I'm going to need to write this sum of rewards a lot, so I'm just gonna call this the payoff
  • 26:32 - 26:36
    from now. So whenever later in this lecture I write the word payoff, I
  • 26:36 - 26:40
    just mean this sum.
  • 26:40 - 26:47
    So
  • 26:47 - 26:52
    our goal is to maximize the expected payoff, so our goal is to maximize this sum.
  • 26:52 - 26:56
    Let me actually just skip ahead. I'm going to write down what the final answer is, and
  • 26:56 - 26:59
    then I'll come back and justify the
  • 26:59 - 27:06
    algorithm. So here's the algorithm. This is
  • 27:25 - 27:32
  • 28:13 - 28:16
    how we're going to
  • 28:16 - 28:19
    update the parameters of the algorithm. We're going to
  • 28:19 - 28:23
    sample a state action sequence. The way you do this is you just take your
  • 28:23 - 28:25
    current stochastic policy,
  • 28:25 - 28:29
    and you execute it in the MDP. So just go ahead and start from some initial state, take
  • 28:29 - 28:33
    a stochastic action according to your current stochastic policy,
  • 28:33 - 28:35
    see where the state transition probably takes you, and so you just
  • 28:35 - 28:39
    do that for T times steps, and that's how you sample the state sequence. Then
  • 28:39 - 28:45
    you compute the payoff, and then you perform this update.
  • 28:45 - 28:49
    So let's go back and figure out what this algorithm is doing.
  • 28:49 - 28:53
    Notice that this algorithm performs stochastic updates because on
  • 28:53 - 28:57
    every step it updates data according to this thing on the right hand side.
  • 28:57 - 29:00
    This thing on the right hand side depends very much on your payoff and on
  • 29:00 - 29:02
    the state action sequence you saw. Your
  • 29:02 - 29:05
    state action sequence is random,
  • 29:05 - 29:07
    so what I want to do is figure out
  • 29:07 - 29:12
    - so on every step, I'll sort of take a step that's chosen randomly
  • 29:13 - 29:16
    because it depends on this random state action sequence.
  • 29:16 - 29:22
    So what I want to do is figure out on average how does it change the parameters theta.
  • 29:22 - 29:23
    In particular,
  • 29:23 - 29:44
    I want to know what is the expected value of the change to the parameters.
  • 29:58 - 30:06
    So I want to know what is the expected value of this change to my parameters theta. Our
  • 30:06 - 30:11
    goal is to maximize the sum [inaudible] - our goal is to maximize the value of the payoff.
  • 30:11 - 30:14
    So long as
  • 30:14 - 30:18
    the updates on expectation are on average taking us uphill
  • 30:18 - 30:20
    on the expected payoff,
  • 30:20 - 30:22
    then we're happy.
  • 30:22 - 30:25
    It turns out that
  • 30:25 - 30:32
    this algorithm is a form of stochastic gradient ascent in which -
  • 30:32 - 30:37
    remember when I talked about stochastic gradient descent for least squares regression, I
  • 30:37 - 30:43
    said that you have some parameters - maybe you're trying to minimize a quadratic function.
  • 30:43 - 30:49
    Then you may have parameters that will wander around randomly
  • 30:49 - 30:54
    until it gets close to the optimum of the [inaudible] quadratic surface.
  • 30:54 - 30:56
    It turns out that
  • 30:56 - 31:00
    the reinforce algorithm will be very much like that. It will be a
  • 31:00 - 31:03
    stochastic gradient ascent algorithm in which on every step -
  • 31:03 - 31:07
    the step we take is a little bit random. It's determined by the random state action sequence,
  • 31:07 - 31:12
    but on expectation this turns out to be essentially gradient ascent algorithm.
  • 31:12 - 31:16
    And so we'll do something like this. It'll wander around randomly, but on average take you towards the
  • 31:16 - 31:25
    optimum. So let me go ahead and prove that now.
  • 31:29 - 31:36
    Let's see.
  • 31:38 - 31:46
    What I'm going to do is I'm going to derive a gradient ascent update rule
  • 31:46 - 31:50
    for maximizing the expected payoff. Then I'll hopefully show that
  • 31:50 - 31:59
    by deriving a gradient ascent update rule, I'll end up with this thing on expectation.
  • 31:59 - 32:02
    So before I do the derivation, let me just remind you of
  • 32:02 - 32:06
    the chain rule - the product rule for differentiation
  • 32:06 - 32:09
    in which
  • 32:09 - 32:14
    if I have a product of functions,
  • 32:14 - 32:17
    then
  • 32:33 - 32:38
    the derivative of the product is given by
  • 32:38 - 32:41
    taking of the derivatives of these things one at a time. So first I differentiate
  • 32:41 - 32:45
    with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G,
  • 32:45 - 32:47
    leaving the other two fixed.
  • 32:47 - 32:51
    Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that's the
  • 32:51 - 32:54
    product rule
  • 32:54 - 32:56
    for
  • 32:56 - 33:01
    derivatives.
  • 33:01 - 33:05
    If you refer back to this equation where
  • 33:05 - 33:07
    earlier we wrote out that
  • 33:07 - 33:11
    the expected payoff by this equation, this
  • 33:11 - 33:14
    sum over all the states of the probability
  • 33:14 - 33:16
    times the payoff.
  • 33:16 - 33:20
    So what I'm going to do is take the derivative of this expression
  • 33:20 - 33:23
    with respect to the parameters theta
  • 33:23 - 33:26
    because I want to do gradient ascent on this function. So I'm going to take the
  • 33:26 - 33:31
    derivative of this function with respect to theta, and then try to go uphill on this function.
  • 33:31 - 33:36
    So using the product rule, when I take the derivative of this function with respect to theta
  • 33:36 - 33:39
    what I get is - we'll end up with the sum of terms right there. There are a lot
  • 33:39 - 33:42
    of terms here that depend on theta,
  • 33:42 - 33:47
    and so what I'll end up with is I'll end up having a sum - having one term
  • 33:47 - 33:51
    that corresponds to the derivative of this keeping everything else fixed,
  • 33:51 - 33:54
    to have one term from the derivative of this keeping everything else fixed, and I'll
  • 33:54 - 33:55
    have one term
  • 33:55 - 33:56
    from the derivative of
  • 33:56 - 34:02
    that last thing keeping everything else fixed. So just apply the product rule to this. Let's
  • 34:02 - 34:09
    write that down. So I have that - the derivative
  • 34:11 - 34:16
    with respect to theta of the expected value of the payoff is -
  • 34:16 - 34:22
    it turns out I can actually do this entire derivation in exactly four steps,
  • 34:22 - 34:27
    but each of the steps requires a huge amount of writing, so
  • 34:27 - 34:38
    I'll just start writing and see how that goes, but this is a four step derivation.
  • 34:38 - 34:45
    So there's the sum over the state action sequences as we saw before. Close the
  • 35:05 - 35:12
    bracket,
  • 36:00 - 36:03
    and
  • 36:03 - 36:09
    then times the payoff.
  • 36:09 - 36:13
    So that huge amount of writing, that was just taking my previous formula and
  • 36:13 - 36:14
  • 36:14 - 36:17
    differentiating these terms that depend on theta one at a time. This was the term with
  • 36:17 - 36:18
    the derivative
  • 36:18 - 36:19
    of the first pi of
  • 36:19 - 36:22
    theta S0 A0. So there's the first derivative
  • 36:22 - 36:26
    term. There's the second one. Then you have plus dot, dot, dot, like
  • 36:26 - 36:28
    in terms of [inaudible].
  • 36:28 - 36:31
    That's my last term.
  • 36:31 - 36:38
    So that was step one of four.
  • 36:51 - 37:07
    And so by algebra - let me just write this down
  • 38:04 - 38:07
    and convince us all that it's true.
  • 38:07 - 38:09
    This is the second of four steps
  • 38:09 - 38:10
    in which it
  • 38:10 - 38:14
    just convinced itself that if I expand out - take the sum and multiply it by
  • 38:14 - 38:15
    that big product in front,
  • 38:15 - 38:19
    then I get back that sum of terms I get. It's essentially -
  • 38:19 - 38:23
    for example, when I multiply out, this product on top of this ratio, of this
  • 38:23 - 38:25
    first fraction,
  • 38:25 - 38:29
    then pi subscript theta S0 A0, that would cancel out this pi subscript
  • 38:29 - 38:31
    theta S0 A0
  • 38:31 - 38:35
    and replace it with the derivative with respect to theta of pi theta S0 A0. So [inaudible] algebra was
  • 38:35 - 38:42
    the second. But
  • 38:45 - 38:48
    that term on top is just
  • 38:48 - 38:54
    what I worked out previously
  • 38:54 - 38:59
    - was the joint probability of the state action sequence,
  • 38:59 - 39:32
    and now I have that times that times the payoff.
  • 39:32 - 40:07
    And so by the definition of expectation, this is just equal to that thing times the payoff.
  • 40:07 - 40:08
    So
  • 40:08 - 40:12
    this thing inside the expectation, this is exactly
  • 40:12 - 40:19
    the step that we were taking in the inner group of our reinforce algorithm,
  • 40:19 - 40:21
    roughly the reinforce algorithm. This
  • 40:21 - 40:26
    proves that the expected value of our change to theta
  • 40:26 - 40:31
    is exactly in the direction of the gradient of our expected payoff. That's how
  • 40:31 - 40:33
    I started this whole derivation. I said
  • 40:33 - 40:36
    let's look at our expected payoff and take the derivative of that with respect to theta.
  • 40:37 - 40:41
    What we've proved is that on expectation,
  • 40:41 - 40:44
    the step direction I'll take reinforce is exactly the gradient of
  • 40:44 - 40:46
    the thing I'm trying to
  • 40:46 - 40:49
    optimize. This shows that this algorithm is
  • 40:49 - 40:54
    a stochastic gradient ascent
  • 40:54 - 40:58
    algorithm. I wrote a lot. Why don't you take a minute to look at the equations and [inaudible]
  • 40:58 - 41:01
    check if everything makes sense. I'll erase a couple of boards and then check if you have
  • 41:01 - 41:08
    questions after that. Questions?
  • 41:43 - 41:50
    Could you
  • 41:54 - 41:56
    raise your hand if this makes sense?
  • 42:01 - 42:04
    Great.
  • 42:04 - 42:07
    Some of the comments - we talked about
  • 42:07 - 42:11
    those value function approximation approaches where you approximate
  • 42:11 - 42:14
    V star, then you go from V star
  • 42:14 - 42:18
    to pi star. Then here was also policy search approaches, where you try to approximate the policy directly.
  • 42:18 - 42:22
    So let's talk briefly about when either one may be preferable.
  • 42:22 - 42:24
    It turns out that
  • 42:24 - 42:27
    policy search algorithms are especially effective
  • 42:27 - 42:31
    when you can choose a simple policy class pi.
  • 42:32 - 42:36
    So the question really is for your problem
  • 42:36 - 42:40
    does there exist a simple function like a linear function or a logistic function
  • 42:40 - 42:45
    that maps from features of the state to the action that works pretty well.
  • 42:45 - 42:49
    So the problem with the inverted pendulum - this is quite likely to be true.
  • 42:49 - 42:52
    Going through all the different choices of parameters, you can say
  • 42:52 - 42:55
    things like if the pole's leaning towards the right,
  • 42:55 - 42:58
    then accelerate towards the right to try to catch it. Thanks to the
  • 42:58 - 43:02
    inverted pendulum, this is probably true. For lots of what's called
  • 43:02 - 43:05
    low level control tasks, things like driving a car,
  • 43:05 - 43:09
    the low level reflexes of do you steer your car left to avoid another car, do
  • 43:09 - 43:13
    you steer your car left to follow the car road, flying a helicopter,
  • 43:13 - 43:18
    again very short time scale types of decisions - I like to think of these as
  • 43:18 - 43:22
    decisions like trained operator for like a trained driver or a trained
  • 43:22 - 43:24
    pilot. It would almost
  • 43:24 - 43:27
    be a reflex, these sorts of very quick instinctive things where
  • 43:27 - 43:31
    you map very directly from the inputs, data, and action. These are
  • 43:31 - 43:32
    problems for which
  • 43:32 - 43:36
    you can probably choose a reasonable policy class like a logistic function or something,
  • 43:36 - 43:38
    and it will often work well.
  • 43:38 - 43:42
    In contrast, if you have problems that require
  • 43:42 - 43:46
    long multistep reasoning, so things like a game of chess where you have to
  • 43:46 - 43:47
    reason carefully about if
  • 43:47 - 43:50
    I do this, then they'll do that, then they'll do this, then they'll do that.
  • 43:50 - 43:56
    Those I think of as less instinctual, very high level decision making.
  • 43:56 - 44:00
    For problems like that,
  • 44:00 - 44:06
    I would sometimes use a value function approximation approaches instead. Let
  • 44:06 - 44:09
    me say more about this later.
  • 44:09 - 44:14
    The last thing I want to do is actually tell you about -
  • 44:14 - 44:22
    I guess just as a side comment,
  • 44:22 - 44:25
    it turns out also that if you have POMDP, if you have a partially observable
  • 44:25 - 44:28
    MDP - I don't want to say too much about this -
  • 44:28 - 44:37
    it turns out that if you only have an approximation
  • 44:37 - 44:42
    - let's call it S hat of the true
  • 44:42 - 44:47
    state, and so this could be S hat equals S of T given T from Kalman filter -
  • 44:57 - 45:13
    then you can still use these sorts of policy search algorithms where you can say pi theta of S
  • 45:13 - 45:16
    hat comma A - There are various other ways you can use
  • 45:16 - 45:19
    policy search algorithms for POMDPs, but this is one of them
  • 45:19 - 45:21
    where if you only have estimates of the state, you can then
  • 45:21 - 45:26
    choose a policy class that only looks at your estimate of the state to choose the action.
  • 45:26 - 45:31
    By using the same way of estimating the states in both training and testing,
  • 45:31 - 45:33
    this will usually do some -
  • 45:33 - 45:37
    so these sorts of policy search algorithms can be applied
  • 45:37 - 45:47
    often reasonably effectively to
  • 45:47 - 45:50
    POMDPs as well. There's one more algorithm I wanna talk about, but some final
  • 45:50 - 45:55
    words on the reinforce algorithm. It turns out the reinforce algorithm
  • 45:55 - 46:00
    often works well but is often extremely slow. So it [inaudible]
  • 46:00 - 46:04
    works, but one thing to watch out for is that because you're taking these
  • 46:04 - 46:07
    gradient ascent steps that are very noisy, you're sampling a state action sequence, and then
  • 46:07 - 46:08
    you're sort of
  • 46:08 - 46:10
    taking a gradient ascent step in essentially a sort
  • 46:10 - 46:13
    of random direction that only on expectation is
  • 46:13 - 46:14
    correct. The
  • 46:14 - 46:18
    gradient ascent direction for reinforce can sometimes be a bit noisy,
  • 46:18 - 46:22
    and so it's not that uncommon to need like a million iterations of gradient ascent,
  • 46:22 - 46:24
    or ten million, or 100 million
  • 46:24 - 46:26
    iterations of gradient ascent
  • 46:26 - 46:29
    for reinforce [inaudible], so that's just something to watch out for.
  • 46:29 - 46:41
    One consequence of that is in the reinforce algorithm - I shouldn't
  • 46:41 - 46:45
    really call it reinforce. In what's essentially the reinforce algorithm, there's
  • 46:45 - 46:48
    this step where you need to sample a state action sequence.
  • 46:48 - 46:51
    So in
  • 46:51 - 46:54
    principle you could do this on your own robot. If there were a robot you were trying to
  • 46:54 - 46:55
    control, you can actually
  • 46:55 - 46:59
    physically initialize in some state, pick an action and so on, and go from there
  • 46:59 - 47:00
    to sample a state action sequence.
  • 47:00 - 47:02
    But
  • 47:02 - 47:05
    if you need to do this ten million times, you probably don't want to [inaudible]
  • 47:05 - 47:07
    your robot ten million times.
  • 47:07 - 47:13
    I personally have seen many more applications of reinforce in simulation.
  • 47:13 - 47:17
    You can easily run ten thousand simulations or ten million simulations of your robot in
  • 47:17 - 47:21
    simulation maybe, but you might not want to do that -
  • 47:21 - 47:25
    have your robot physically repeat some action ten million times.
  • 47:25 - 47:28
    So I personally have seen many more applications of reinforce
  • 47:28 - 47:30
    to learn using a simulator
  • 47:30 - 47:35
    than to actually do this on a physical device.
  • 47:35 - 47:38
    The last thing I wanted to do is tell you about one other algorithm,
  • 47:38 - 47:45
    one final policy search algorithm. [Inaudible] the laptop display please. It's
  • 47:52 - 47:56
    a policy search algorithm called Pegasus that's
  • 47:56 - 48:00
    actually what we use on our autonomous helicopter flight things
  • 48:00 - 48:05
    for many years. There are some other things we do now. So
  • 48:05 - 48:08
    here's the idea. There's a
  • 48:08 - 48:11
    reminder slide on RL formalism. There's nothing here that you don't know, but
  • 48:11 - 48:17
    I just want to pictorially describe the RL formalism because I'll use that later.
  • 48:17 - 48:21
    I'm gonna draw the reinforcement learning picture as follows. The
  • 48:21 - 48:22
    initialized [inaudible]
  • 48:22 - 48:24
    system, say a
  • 48:24 - 48:26
    helicopter or whatever in sum state S0,
  • 48:26 - 48:28
    you choose an action A0,
  • 48:28 - 48:32
    and then you'll say helicopter dynamics takes you to some new state S1, you choose
  • 48:32 - 48:33
    some other action A1,
  • 48:33 - 48:35
    and so on.
  • 48:35 - 48:39
    And then you have some reward function, which you reply to the sequence of states you summed out,
  • 48:39 - 48:41
    and that's your total payoff. So
  • 48:41 - 48:47
    this is just a picture I wanna use to summarize the RL problem.
  • 48:48 - 48:51
    Our goal is to maximize the expected payoff,
  • 48:51 - 48:53
    which is this, the expected sum of the rewards.
  • 48:53 - 48:57
    And our goal is to learn the policy, which I denote by a green box.
  • 48:57 - 49:02
    So our policy - and I'll switch back to deterministic policies for now.
  • 49:02 - 49:09
    So my deterministic policy will be some function mapping from the states to the actions.
  • 49:09 - 49:15
    As a concrete example, you imagine that in the policy search setting,
  • 49:15 - 49:17
    you may have a linear class of policies.
  • 49:17 - 49:23
    So you may imagine that the action A will be say a linear function of the states,
  • 49:23 - 49:28
    and your goal is to learn the parameters of the linear function.
  • 49:28 - 49:33
    So imagine trying to do linear progression on policies, except you're trying to optimize the reinforcement learning
  • 49:33 - 49:36
    objective. So just [inaudible] imagine that the action A
  • 49:36 - 49:40
    is state of transpose S, and you go and policy search this to come up with
  • 49:40 - 49:42
    good parameters theta so as
  • 49:42 - 49:49
    to maximize the expected payoff. That would be one setting in which this picture applies. There's the idea.
  • 49:49 - 49:50
    Quite often
  • 49:50 - 49:54
    we come up with a model or a simulator for the MDP, and as before
  • 49:54 - 49:57
    a model or a simulator is just a box
  • 49:57 - 49:59
    that takes this input
  • 49:59 - 50:00
    some state ST,
  • 50:00 - 50:03
    takes this input some action AT,
  • 50:03 - 50:07
    and then outputs some [inaudible] state ST plus one that you might want to take in the MDP.
  • 50:07 - 50:10
    This ST plus one will be a random state. It will be drawn from
  • 50:10 - 50:13
    the random state transition probabilities of MDP.
  • 50:13 - 50:17
    This is important. Very important, ST plus one
  • 50:17 - 50:22
    will be a random function ST and AT. In the simulator, this is [inaudible].
  • 50:22 - 50:26
    So for example, for autonomous helicopter flight, you [inaudible] build a simulator
  • 50:26 - 50:30
    using supervised learning, an algorithm like linear regression
  • 50:30 - 50:32
    [inaudible] to linear regression,
  • 50:32 - 50:35
    so we can get a nonlinear model of the dynamics of what ST
  • 50:35 - 50:41
    plus one is as a random function of ST and AT. Now
  • 50:41 - 50:45
    once you have a simulator,
  • 50:45 - 50:48
    given any fixed policy you can quite
  • 50:48 - 50:52
    straightforwardly evaluate any policy in a simulator.
  • 50:53 - 51:00
    Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box
  • 51:00 - 51:03
    like that. It works well.
  • 51:03 - 51:08
    So if you have any one fixed policy pi,
  • 51:08 - 51:13
    you can evaluate the policy pi just using the simulator
  • 51:13 - 51:16
    via the picture shown at the bottom of the slide.
  • 51:16 - 51:19
    So concretely, you can take your initial state S0,
  • 51:19 - 51:21
    feed it into the policy pi,
  • 51:21 - 51:25
    your policy pi will output some action A0, you plug it in
  • 51:25 - 51:28
    the simulator, the simulator outputs a random state S1, you feed
  • 51:28 - 51:32
    S1 into the policy and so on, and you get a sequence of states S0 through ST
  • 51:32 - 51:35
    that your helicopter flies through in simulation.
  • 51:35 - 51:38
    Then sum up the rewards,
  • 51:38 - 51:43
    and this gives you an estimate of the expected payoff of the policy.
  • 51:43 - 51:45
    This picture is just a fancy way of saying fly
  • 51:45 - 51:49
    your helicopter in simulation and see how well it does, and measure the sum of
  • 51:49 - 51:54
    rewards you get on average in the simulator. The
  • 51:54 - 51:57
    picture I've drawn here assumes that you run your policy through the
  • 51:57 - 51:59
    simulator just once. In general,
  • 51:59 - 52:01
    you would run the policy through the simulator
  • 52:01 - 52:05
    some number of times and then average to get an average over M
  • 52:05 - 52:11
    simulations to get a better estimate of the policy's expected payoff.
  • 52:13 - 52:15
    Now that I have a way
  • 52:15 - 52:18
    - given any one fixed policy,
  • 52:18 - 52:23
    this gives me a way to evaluate the expected payoff of that policy.
  • 52:23 - 52:28
    So one reasonably obvious thing you might try to do is then
  • 52:28 - 52:30
    just search for a policy,
  • 52:30 - 52:34
    in other words search for parameters theta for your policy, that
  • 52:34 - 52:36
    gives you high estimated payoff.
  • 52:36 - 52:40
    Does that make sense? So my policy has some parameters theta, so
  • 52:40 - 52:41
    my policy is
  • 52:41 - 52:46
    my actions A are equal to theta transpose S say if there's a linear policy.
  • 52:46 - 52:50
    For any fixed value of the parameters theta,
  • 52:50 - 52:51
    I can evaluate -
  • 52:51 - 52:54
    I can get an estimate for how good the policy is using the simulator.
  • 52:54 - 52:59
    One thing I might try to do is search for parameters theta to try to
  • 52:59 - 53:03
    maximize my estimated payoff.
  • 53:03 - 53:06
    It turns out you can sort of do that,
  • 53:06 - 53:11
    but that idea as I've just stated is hard to get to work.
  • 53:11 - 53:12
    Here's the
  • 53:12 - 53:16
    reason. The simulator allows us to evaluate
  • 53:16 - 53:18
    policy, so let's search for policy of high
  • 53:18 - 53:21
    value. The difficulty is that the simulator is random,
  • 53:21 - 53:23
    and so every time we evaluate a policy,
  • 53:23 - 53:27
    we get back a very slightly different answer. So
  • 53:27 - 53:29
    in the
  • 53:29 - 53:33
    cartoon below, I want you to imagine that the horizontal axis is the space of policies.
  • 53:33 - 53:37
    In other words, as I vary the
  • 53:37 - 53:41
    parameters in my policy, I get different points on the horizontal axis here. As I
  • 53:41 - 53:44
    vary the parameters theta, I get different policies,
  • 53:44 - 53:46
    and so I'm moving along the X axis,
  • 53:46 - 53:50
    and my total payoff I'm gonna plot on the vertical axis,
  • 53:50 - 53:54
    and the red line in this cartoon is the expected payoff of the different
  • 53:54 - 53:56
    policies.
  • 53:56 - 54:02
    My goal is to find the policy with the highest expected payoff.
  • 54:02 - 54:06
    You could search for a policy with high expected payoff, but every time you evaluate a policy
  • 54:06 - 54:08
    - say I evaluate some policy,
  • 54:08 - 54:10
    then I might get a point that
  • 54:10 - 54:13
    just by chance looked a little bit better than it should have. If
  • 54:13 - 54:16
    I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third
  • 54:16 - 54:21
    policy, fourth,
  • 54:21 - 54:25
    sometimes you look here - sometimes I might actually evaluate exactly the same policy
  • 54:25 - 54:28
    twice and get back slightly different
  • 54:28 - 54:32
    answers just because my simulator is random, so when I apply the same policy
  • 54:32 - 54:39
    twice in simulation, I might get back slightly different answers.
  • 54:39 - 54:42
    So as I evaluate more and more policies, these are the pictures I get.
  • 54:42 - 54:46
    My goal is to try to optimize the red line. I hope you appreciate this is a
  • 54:46 - 54:51
    hard problem, especially when all [inaudible] optimization algorithm gets to see are these
  • 54:51 - 54:54
    black dots, and they don't have direct access to the red line.
  • 54:54 - 54:58
    So when my input space is some fairly high dimensional space, if I have
  • 54:58 - 55:02
    ten parameters, the horizontal axis would actually be a 10-D space,
  • 55:02 - 55:07
    all I get are these noisy estimates of what the red line is. This is a very hard
  • 55:07 - 55:10
    stochastic optimization problem.
  • 55:10 - 55:16
    So it turns out there's one way to make this optimization problem much easier. Here's the idea.
  • 55:16 - 55:21
    And the method is called Pegasus, which is an acronym for something.
  • 55:21 - 55:23
    I'll tell you later.
  • 55:23 - 55:26
    So the simulator repeatedly makes calls
  • 55:26 - 55:28
    to a random number generator
  • 55:28 - 55:33
    to generate random numbers RT, which are used to simulate the stochastic dynamics.
  • 55:33 - 55:37
    What I mean by that is that the simulator takes this input of state
  • 55:37 - 55:40
    and action, and it outputs the mixed state randomly,
  • 55:40 - 55:44
    and if you peer into the simulator, if you open the box of the simulator and ask
  • 55:44 - 55:50
    how is my simulator generating these random mixed states ST plus one,
  • 55:50 - 55:54
    pretty much the only way to do so - pretty much the only way
  • 55:54 - 55:57
    to write a simulator with random outputs is
  • 55:57 - 56:00
    we're gonna make calls to a random number generator,
  • 56:00 - 56:03
    and get random numbers, these RTs,
  • 56:03 - 56:08
    and then you have some function that takes this input S0, A0, and
  • 56:08 - 56:11
    the results of your random number generator,
  • 56:11 - 56:14
    and it computes some mixed state as a function of the inputs and of the random
  • 56:14 - 56:17
    number it got from the random number
  • 56:17 - 56:18
    generator. This is
  • 56:18 - 56:22
    pretty much the only way anyone implements any random code, any code
  • 56:22 - 56:26
    that generates random outputs. You make a call to a random number generator,
  • 56:26 - 56:31
    and you compute some function of the random number and of your other
  • 56:31 - 56:35
    inputs. The reason that when you evaluate different policies you get
  • 56:35 - 56:36
    different answers
  • 56:36 - 56:38
    is because
  • 56:38 - 56:41
    every time you rerun the simulator, you get a different sequence of random
  • 56:41 - 56:44
    numbers from the random number generator,
  • 56:44 - 56:46
    and so you get a different answer every time,
  • 56:46 - 56:48
    even if you evaluate the same policy twice.
  • 56:48 - 56:51
    So here's the idea.
  • 56:51 - 56:56
    Let's say we fix in advance the sequence of random numbers,
  • 56:56 - 57:00
    choose R1, R2, up to RT minus one.
  • 57:00 - 57:03
    Fix the sequence of random numbers in advance,
  • 57:03 - 57:07
    and we'll always use the same sequence of random numbers
  • 57:07 - 57:10
    to evaluate different policies. Go
  • 57:10 - 57:15
    into your code and fix R1, R2, through RT minus one.
  • 57:15 - 57:18
    Choose them randomly once and then fix them forever.
  • 57:18 - 57:21
    If you always use the same sequence of random numbers, then
  • 57:21 - 57:25
    the system is no longer random, and if you evaluate the same policy twice,
  • 57:25 - 57:29
    you get back exactly the same answer.
  • 57:29 - 57:33
    And so these sequences of random numbers, [inaudible] call them scenarios, and
  • 57:33 - 57:37
    Pegasus is an acronym for policy evaluation of gradient and search using scenarios.
  • 57:37 - 57:39
    So
  • 57:42 - 57:46
    when you do that, this is the picture you get.
  • 57:46 - 57:50
    As before, the red line is your expected payoff, and by fixing the random numbers,
  • 57:50 - 57:53
    you've defined some estimate of the expected payoff.
  • 57:53 - 57:57
    And as you evaluate different policies, they're
  • 57:57 - 58:01
    still only approximations to their true expected payoff, but at least now you have
  • 58:01 - 58:01
    a
  • 58:01 - 58:03
    deterministic function to optimize,
  • 58:03 - 58:07
    and you can now apply your favorite algorithms, be it gradient ascent or
  • 58:08 - 58:10
    some sort
  • 58:10 - 58:12
    of local [inaudible] search
  • 58:12 - 58:14
    to try to optimize the black
  • 58:14 - 58:16
    curve. This gives you a much easier
  • 58:16 - 58:18
    optimization problem, and you
  • 58:18 - 58:23
    can optimize this to find hopefully a good policy. So this is
  • 58:23 - 58:27
    the Pegasus policy search method.
  • 58:27 - 58:27
    So when
  • 58:27 - 58:31
    I started to talk about reinforcement learning, I showed
  • 58:31 - 58:35
    that video of a helicopter flying upside down. That was actually done using
  • 58:35 - 58:40
    exactly method, using exactly this policy search algorithm. This
  • 58:40 - 58:42
    seems to scale well
  • 58:42 - 58:47
    even to fairly large problems, even to fairly high dimensional state spaces.
  • 58:47 - 58:50
    Typically Pegasus policy search algorithms have been using -
  • 58:50 - 58:53
    the optimization problem
  • 58:53 - 58:56
    is still - is much easier than the stochastic version, but sometimes it's
  • 58:56 - 58:58
    not entirely trivial, and so
  • 58:58 - 59:00
    you have to apply this sort of method with
  • 59:00 - 59:05
    maybe on the order of ten parameters or tens of parameters, so 30, 40
  • 59:05 - 59:06
    parameters, but
  • 59:06 - 59:08
    not thousands of parameters,
  • 59:08 - 59:13
    at least in these sorts of things with them.
  • 59:13 - 59:19
    So is that method different than just assuming that
  • 59:19 - 59:27
    you know your simulator exactly, just throwing away all the random numbers entirely? So is this different from assuming that
  • 59:27 - 59:29
    we have a deterministic simulator?
  • 59:29 - 59:31
    The answer is no. In the way you do this,
  • 59:31 - 59:35
    for the sake of simplicity I talked about one sequence of random numbers.
  • 59:35 - 59:37
    What you do is -
  • 59:37 - 59:42
    so imagine that the random numbers are simulating different wind gusts against your helicopter.
  • 59:42 - 59:46
    So what you want to do isn't really evaluate just against one pattern of wind gusts.
  • 59:46 - 59:48
    What you want to do is
  • 59:48 - 59:50
    sample some set of
  • 59:50 - 59:54
    different patterns of wind gusts, and evaluate against all of them in average.
  • 59:54 - 59:56
    So what you do is you actually
  • 59:56 - 59:58
    sample say 100 -
  • 59:58 - 60:01
    some number I made up like 100 sequences of random numbers,
  • 60:01 - 60:03
    and every time you want to evaluate a policy,
  • 60:03 - 60:09
    you evaluate it against all 100 sequences of random numbers and then average.
  • 60:09 - 60:13
    This is in exactly the same way that on this earlier picture
  • 60:13 - 60:17
    you wouldn't necessarily evaluate the policy just once. You evaluate it maybe 100
  • 60:17 - 60:18
    times in simulation, and then
  • 60:18 - 60:22
    average to get a better estimate of the expected reward.
  • 60:22 - 60:23
    In the same way,
  • 60:23 - 60:24
    you do that here
  • 60:24 - 60:29
    but with 100 fixed sequences of random numbers. Does that make sense? Any
  • 60:29 - 60:36
    other questions? If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible] random numbers [inaudible] won't you get similar ideas [inaudible]?
  • 60:48 - 60:52
    Yeah. I guess you're right. So the quality - for a fixed policy,
  • 60:52 - 60:57
    the quality of the approximation is equally good for both cases. The
  • 60:57 - 61:01
    advantage of fixing the random numbers is that you end up with
  • 61:01 - 61:04
    an optimization problem that's much easier.
  • 61:05 - 61:06
    I have some search problem,
  • 61:06 - 61:10
    and on the horizontal axis there's a space of control policies,
  • 61:10 - 61:13
    and my goal is to find a control policy that
  • 61:13 - 61:15
    maximizes the payoff.
  • 61:15 - 61:19
    The problem with this earlier setting was that
  • 61:19 - 61:22
    when I evaluate policies I get these noisy estimates,
  • 61:22 - 61:23
    and then it's
  • 61:23 - 61:27
    just very hard to optimize the red curve if I have these points that are all over the
  • 61:27 - 61:29
    place. And if I evaluate the same policy twice, I don't even get back the same
  • 61:31 - 61:33
    answer. By fixing the random numbers,
  • 61:33 - 61:37
    the algorithm still doesn't get to see the red curve, but at least it's
  • 61:37 - 61:41
    now optimizing a deterministic function. That makes the
  • 61:41 - 61:42
    optimization problem much easier. Does
  • 61:42 - 61:49
    that make sense? Student:So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves
  • 61:51 - 61:58
    that are easy to optimize. And then you smush them together?
  • 61:58 - 61:59
    Let's see.
  • 61:59 - 62:05
    I have just one nice black curve that I'm trying to optimize. For each scenario.
  • 62:05 - 62:09
    I see. So I'm gonna average over M scenarios, so I'm gonna average over 100 scenarios.
  • 62:09 - 62:12
    So the black curve here is defined by averaging over
  • 62:12 - 62:16
    a large set of scenarios. Does that make sense?
  • 62:16 - 62:20
    So instead of only one - if the averaging thing doesn't make sense, imagine that there's
  • 62:20 - 62:24
    just one sequence of random numbers. That might be easier to think
  • 62:24 - 62:27
    about. Fix one sequence of random numbers, and every time I evaluate another policy,
  • 62:27 - 62:31
    I evaluate against the same sequence of random numbers, and that gives me
  • 62:31 - 62:36
    a nice deterministic function to optimize.
  • 62:36 - 62:39
    Any other questions?
  • 62:39 - 62:41
    The advantage is really
  • 62:41 - 62:45
    that - one way to think about it is when I evaluate the same policy twice,
  • 62:45 - 62:50
    at least I get back the same answer. This gives me a deterministic function
  • 62:50 - 62:54
    mapping from parameters in my policy
  • 62:54 - 62:56
    to my estimate of the expected payoff.
  • 62:58 - 63:13
    That's just a function that I can try to optimize using the search algorithm.
  • 63:13 - 63:17
    So we use this algorithm for inverted hovering,
  • 63:17 - 63:20
    and again policy search algorithms tend to work well when you can find
  • 63:20 - 63:24
    a reasonably simple policy mapping from
  • 63:24 - 63:29
    the states to the actions. This is sort of especially the low level control tasks,
  • 63:29 - 63:33
    which I think of as sort of reflexes almost.
  • 63:33 - 63:34
    Completely,
  • 63:34 - 63:37
    if you want to solve a problem like Tetris where you might plan
  • 63:37 - 63:40
    ahead a few steps about what's a nice configuration of the board, or
  • 63:40 - 63:42
    something like a game of chess,
  • 63:42 - 63:47
    or problems of long path plannings of go here, then go there, then go there,
  • 63:47 - 63:51
    then sometimes you might apply a value function method instead.
  • 63:51 - 63:55
    But for tasks like helicopter flight, for
  • 63:55 - 63:59
    low level control tasks, for the reflexes of driving or controlling
  • 64:01 - 64:06
    various robots, policy search algorithms were easier -
  • 64:06 - 64:11
    we can sometimes more easily approximate the policy directly works very well.
  • 64:12 - 64:16
    Some [inaudible] the state of RL today. RL algorithms are applied
  • 64:16 - 64:18
    to a wide range of problems,
  • 64:18 - 64:22
    and the key is really sequential decision making. The place where you
  • 64:22 - 64:26
    think about applying reinforcement learning algorithm is when you need to make a decision,
  • 64:26 - 64:28
    then another decision, then another decision,
  • 64:28 - 64:32
    and some of your actions may have long-term consequences. I think that
  • 64:32 - 64:38
    is the heart of RL's sequential decision making, where you make multiple decisions,
  • 64:38 - 64:42
    and some of your actions may have long-term consequences. I've shown you
  • 64:42 - 64:44
    a bunch of robotics examples. RL
  • 64:44 - 64:46
    is also
  • 64:46 - 64:49
    applied to thinks like medical decision making, where you may have a patient and you
  • 64:49 - 64:52
    want to choose a sequence of treatments, and you do this now for the patient, and the
  • 64:52 - 64:56
    patient may be in some other state, and you choose to do that later, and so on.
  • 64:56 - 65:01
    It turns out there's a large community of people applying these sorts of tools to queues. So
  • 65:01 - 65:05
    imagine you have a bank where you have people lining up, and
  • 65:05 - 65:06
    after they go to one cashier,
  • 65:06 - 65:10
    some of them have to go to the manager to deal with something else. You have
  • 65:10 - 65:13
    a system of multiple people standing in line in multiple queues, and so how do you route
  • 65:13 - 65:18
    people optimally to minimize the waiting
  • 65:18 - 65:20
    time. And not just people, but
  • 65:20 - 65:24
    objects in an assembly line and so on. It turns out there's a surprisingly large community
  • 65:24 - 65:26
    working on optimizing queues.
  • 65:26 - 65:30
    I mentioned game playing a little bit already. Things
  • 65:30 - 65:33
    like financial decision making, if you have a large amount of stock,
  • 65:33 - 65:38
    how do you sell off a large amount - how do you time the selling off of your stock
  • 65:38 - 65:41
    so as to not affect market prices adversely too much? There
  • 65:41 - 65:45
    are many operations research problems, things like factory automation. Can you design
  • 65:45 - 65:46
    a factory
  • 65:46 - 65:50
    to optimize throughput, or minimize cost, or whatever. These are all
  • 65:50 - 65:54
    sorts of problems that people are applying reinforcement learning algorithms to.
  • 65:54 - 65:58
    Let me just close with a few robotics examples because they're always fun, and we just
  • 65:58 - 65:59
    have these videos.
  • 66:00 - 66:07
    This video was a work of Ziko Coulter and Peter Abiel, which is a PhD student
  • 66:07 - 66:14
    here. They were working getting a robot dog to climb over difficult rugged
  • 66:14 - 66:17
    terrain. Using a reinforcement learning algorithm,
  • 66:17 - 66:20
    they applied an approach that's
  • 66:20 - 66:24
    similar to a value function approximation approach, not quite but similar.
  • 66:24 - 66:29
    They allowed the robot dog to sort of plan ahead multiple steps, and
  • 66:29 - 66:34
    carefully choose his footsteps and traverse rugged terrain.
  • 66:34 - 66:43
    This is really state of the art in terms of what can you get a robotic dog to do.
  • 66:43 - 66:46
    Here's another fun one.
  • 66:46 - 66:48
    It turns out that
  • 66:48 - 66:51
    wheeled robots are very fuel-efficient. Cars and trucks are the
  • 66:51 - 66:55
    most fuel-efficient robots in the world almost.
  • 66:55 - 66:57
    Cars and trucks are very fuel-efficient,
  • 66:57 - 67:01
    but the bigger robots have the ability to traverse more rugged terrain.
  • 67:01 - 67:05
    So this is a robot - this is actually a small scale mockup of a larger
  • 67:05 - 67:07
    vehicle built by Lockheed Martin,
  • 67:07 - 67:10
    but can you come up with a vehicle that
  • 67:10 - 67:14
    has wheels and has the fuel efficiency of wheeled robots, but also
  • 67:14 - 67:19
    has legs so it can traverse obstacles.
  • 67:19 - 67:27
    Again, using a reinforcement algorithm to design a controller for this robot to make it traverse obstacles, and
  • 67:27 - 67:31
    somewhat complex gaits that would be very hard to design by hand,
  • 67:31 - 67:33
    but by choosing a reward function, tell the robot
  • 67:33 - 67:37
    this is a plus one reward that's top of the goal,
  • 67:37 - 67:45
    and a few other things, it learns these sorts of policies automatically.
  • 67:47 - 67:55
    Last couple fun ones - I'll show you a couple last helicopter videos.
  • 67:57 - 68:09
    This is the work of again PhD students here, Peter Abiel and Adam Coates
  • 68:09 - 68:33
    who have been working on autonomous flight. I'll just let you watch. I'll just
  • 69:49 - 69:54
    show you one more. [Inaudible] do this with a real helicopter [inaudible]?
  • 69:54 - 69:55
    Not a full-size helicopter.
  • 69:55 - 70:01
    Only small radio control helicopters. [Inaudible]. Full-size helicopters
  • 70:01 - 70:06
    are the wrong design for this. You shouldn't do this on a full-size helicopter.
  • 70:06 - 70:13
    Many full-size helicopters would fall apart if you tried to do this. Let's see. There's one more.
  • 70:13 - 70:19
    Are there any human [inaudible]? Yes,
  • 70:19 - 70:26
    there are very good human pilots that can.
  • 70:28 - 70:35
    This is just one more
  • 70:44 - 70:46
    maneuver. That was kind of fun. So this is the work of
  • 70:46 - 70:56
    Peter Abiel and Adam Coates. So that was it. That was all the technical material
  • 70:56 - 71:03
    I wanted to present in this class. I guess
  • 71:03 - 71:10
    you're all experts on machine learning now. Congratulations.
  • 71:10 - 71:14
    And as I hope you've got the sense through this class that this is one of the technologies
  • 71:14 - 71:19
    that's really having a huge impact on science in engineering and industry.
  • 71:19 - 71:23
    As I said in the first lecture, I think many people use machine
  • 71:23 - 71:30
    learning algorithms dozens of times a day without even knowing about it.
  • 71:30 - 71:34
    Based on the projects you've done, I hope that
  • 71:34 - 71:38
    most of you will be able to imagine yourself
  • 71:38 - 71:43
    going out after this class and applying these things to solve a variety of problems.
  • 71:43 - 71:46
    Hopefully, some of you will also imagine yourselves
  • 71:46 - 71:49
    writing research papers after this class, be it on
  • 71:49 - 71:52
    a novel way to do machine learning, or on some way of applying machine
  • 71:52 - 71:55
    learning to a problem that you care
  • 71:55 - 71:59
    about. In fact, looking at project milestones, I'm actually sure that a large fraction of
  • 71:59 - 72:04
    the projects in this class will be publishable, so I think that's great.
  • 72:07 - 72:11
    I guess many of you will also go industry, make new products, and make lots
  • 72:11 - 72:13
    of money using learning
  • 72:13 - 72:16
    algorithms. Remember me
  • 72:16 - 72:20
    if that happens. One of the things I'm excited about is through research or
  • 72:20 - 72:23
    industry, I'm actually completely sure that the people in this class in the
  • 72:23 - 72:24
    next few months
  • 72:24 - 72:27
    will apply machine learning algorithms to lots of problems in
  • 72:27 - 72:32
    industrial management, and computer science, things like optimizing computer architectures,
  • 72:32 - 72:33
    network security,
  • 72:33 - 72:38
    robotics, computer vision, to problems in computational biology,
  • 72:39 - 72:42
    to problems in aerospace, or understanding natural language,
  • 72:42 - 72:44
    and many more things like that.
  • 72:44 - 72:46
    So
  • 72:46 - 72:50
    right now I have no idea what all of you are going to do with the learning algorithms you learned about,
  • 72:50 - 72:52
    but
  • 72:52 - 72:55
    every time as I wrap up this class, I always
  • 72:55 - 72:58
    feel very excited, and optimistic, and hopeful about the sorts of amazing
  • 72:58 - 73:03
    things you'll be able to do.
  • 73:03 - 73:10
    One final thing, I'll just give out this handout.
  • 73:31 - 73:35
    One final thing is that machine learning has grown out of a larger literature on the AI
  • 73:35 - 73:36
    where
  • 73:36 - 73:42
    this desire to build systems that exhibit intelligent behavior and machine learning
  • 73:42 - 73:45
    is one of the tools of AI, maybe one that's had a
  • 73:45 - 73:47
    disproportionately large impact,
  • 73:47 - 73:52
    but there are many other ideas in AI that I hope you
  • 73:52 - 73:55
    go and continue to learn about. Fortunately, Stanford
  • 73:55 - 73:59
    has one of the best and broadest sets of AI classes, and I hope that
  • 73:59 - 74:04
    you take advantage of some of these classes, and go and learn more about AI, and more
  • 74:04 - 74:08
    about other fields which often apply learning algorithms to problems in vision,
  • 74:08 - 74:10
    problems in natural language processing in robotics, and so on.
  • 74:10 - 74:15
    So the handout I just gave out has a list of AI related courses. Just
  • 74:15 - 74:18
    running down very quickly, I guess, CS221 is an overview that I
  • 74:18 - 74:19
    teach. There
  • 74:19 - 74:24
    are a lot of robotics classes also: 223A,
  • 74:24 - 74:25
    225A, 225B
  • 74:25 - 74:29
    - many robotics class. There are so many applications
  • 74:29 - 74:32
    of learning algorithms to robotics today. 222
  • 74:32 - 74:36
    and 227 are knowledge representation and reasoning classes.
  • 74:36 - 74:40
    228 - of all the classes on this list, 228, which Daphne
  • 74:40 - 74:43
    Koller teaches, is probably closest in spirit to 229. This is one of
  • 74:43 - 74:45
    the classes I highly recommend
  • 74:45 - 74:49
    to all of my PhD students as well.
  • 74:49 - 74:53
    Many other problems also touch on machine learning. On the next page,
  • 74:54 - 74:58
    courses on computer vision, speech recognition, natural language processing,
  • 74:58 - 75:03
    various tools that aren't just machine learning, but often involve machine learning in many ways.
  • 75:03 - 75:07
    Other aspects of AI, multi-agent systems taught by [inaudible].
  • 75:10 - 75:13
    EE364A is convex optimization. It's a class taught by
  • 75:13 - 75:17
    Steve Boyd, and convex optimization came up many times in this class. If you
  • 75:17 - 75:21
    want to become really good at it, EE364 is a great class. If you're
  • 75:21 - 75:25
    interested in project courses, I also teach a project class
  • 75:25 - 75:27
    next quarter where we spend the whole quarter working
  • 75:27 - 75:32
    on research projects.
  • 75:32 - 75:38
    So I hope you go and take some more of those classes.
  • 75:38 - 75:40
    In closing,
  • 75:40 - 75:43
    let me just say
  • 75:43 - 75:47
    this class has been really fun to teach, and it's very satisfying to
  • 75:47 - 75:52
    me personally when we set these insanely difficult hallmarks, and
  • 75:52 - 75:55
    then we'd see a solution, and I'd be like, "Oh my god. They actually figured that one out." It's
  • 75:55 - 75:59
    actually very satisfying when I see that.
  • 75:59 - 76:05
    Or looking at the milestones, I often go, "Wow, that's really cool. I bet this would be publishable,"
  • 76:05 - 76:05
    So
  • 76:05 - 76:09
    I hope you take what you've learned, go forth, and
  • 76:09 - 76:12
    do amazing things with learning algorithms.
  • 76:12 - 76:16
    I know this is a heavy workload class, so thank you all very much for the hard
  • 76:16 - 76:20
    work you've put into this class, and the hard work you've put into learning this material,
  • 76:20 - 76:22
    and
  • 76:22 - 76:23
    thank you very much for having been students in.
Title:
Lecture 20 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses POMDPs, policy search, and Pegasus in the context of reinforcement learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:16:40
N. Ueda added a translation

English subtitles

Revisions