< Return to Video

Lecture 18 | Machine Learning (Stanford)

  • 0:13 - 0:16
    This presentation is delivered by the Stanford Center for Professional
  • 0:16 - 0:23
    Development.
  • 0:26 - 0:28
    Okay. Welcome back. What I
  • 0:28 - 0:31
    want to do today is talk about
  • 0:31 - 0:35
    one of my favorite algorithms for controlling MDPs that I think is
  • 0:36 - 0:39
    one of the more elegant and efficient and powerful algorithms that
  • 0:39 - 0:41
    I know of.
  • 0:41 - 0:43
    So
  • 0:43 - 0:47
    what I'll do is I'll first start by talking about a couple
  • 0:47 - 0:49
    variations of MDPs
  • 0:49 - 0:52
    that are slightly different from the MDP definition you've seen so far. These
  • 0:52 - 0:54
    are pretty
  • 0:54 - 0:59
    common variations. One is state action rewards,
  • 0:59 - 1:04
    and the other is horizon MDPs. Using this semi-modified definition of an
  • 1:04 - 1:07
    MDP, I'll talk about linear dynamical systems. I'll spend a little bit
  • 1:07 - 1:10
    of time talking about models within dynamical systems, and then talk about LQR, or
  • 1:10 - 1:13
    linear quadratic regulation control,
  • 1:13 - 1:19
    which will lead us to some kind of [inaudible] equation, which is
  • 1:19 - 1:23
    something we will solve in order to do LQR
  • 1:23 - 1:24
    controls.
  • 1:24 - 1:26
  • 1:26 - 1:27
    So
  • 1:27 - 1:29
    just
  • 1:29 - 1:33
    to recap, and we've seen this definition many times now.
  • 1:33 - 1:37
    We've been defining an MDP
  • 1:37 - 1:42
    as
  • 1:42 - 1:47
    [inaudible] states actions, states improbabilities, [inaudible] reward function
  • 1:47 - 1:50
    where - gamma's
  • 1:50 - 1:53
    the
  • 1:53 - 1:57
    discount factors, a number between zero and one.
  • 1:57 - 2:00
    And R, the reward function,
  • 2:00 - 2:04
    was the function mapping from the states, the rewards -
  • 2:04 - 2:07
    was the function mapping from the states, the real numbers.
  • 2:07 - 2:08
    So we had
  • 2:08 - 2:10
    value
  • 2:10 - 2:14
    iteration, which would do this.
  • 2:22 - 2:26
    So after a while, the value of the iteration will cause V to convert to V star.
  • 2:26 - 2:28
    Then
  • 2:28 - 2:31
    having found the optimal value function, if you
  • 2:31 - 2:33
    compute the optimal policy
  • 2:33 - 2:37
    by taking essentially [inaudible] of this equation above. Augments of A,
  • 2:37 - 2:44
    of that [inaudible]. So
  • 2:46 - 2:48
    in
  • 2:48 - 2:51
    value iteration,
  • 2:51 - 2:55
    as you iterate of this - you know, perform this update,
  • 2:55 - 2:56
    the function V will
  • 2:56 - 3:00
    [inaudible] convert to V star. So there won't be -
  • 3:00 - 3:04
    so without defining the number of iterations, you get closer and closer to V star.
  • 3:04 - 3:08
    This actually converge exponentially quickly to V
  • 3:08 - 3:13
    star. We will never exactly convert to V star and define the number of iterations.
  • 3:15 - 3:20
    So what I want to do now is describe a couple of common variations of
  • 3:20 - 3:20
    MDPs
  • 3:20 - 3:24
    that we slightly different definitions of. Firs the reward
  • 3:24 - 3:28
    function and then second, we'll do something slightly different
  • 3:28 - 3:29
    from just
  • 3:32 - 3:35
    counting. Then remember in the last lecture, I said that
  • 3:35 - 3:37
    for infinite state of continuously in MDPs,
  • 3:37 - 3:41
    we couldn't apply the most straightforward version of value iteration
  • 3:41 - 3:44
    because if you have a continuous state
  • 3:44 - 3:46
    MDP, we need to use
  • 3:46 - 3:48
    some approximations of the optimal value function.
  • 3:48 - 3:52
    The [inaudible] later in this lecture,
  • 3:52 - 3:56
    I'll talk about a special case of MDPs, where you can actually represent the
  • 3:56 - 3:58
    value function exactly,
  • 3:58 - 4:01
    even if you have an infinite-state space or even if you have a continuous-state
  • 4:01 - 4:02
    space.
  • 4:02 - 4:06
    I'll actually do that, talk about these special constants of infinite-state
  • 4:06 - 4:07
    MDPs,
  • 4:07 - 4:11
    using this new variation of the reward function and the alternative to
  • 4:11 - 4:15
    just counting, so start to make the formulation a little easier.
  • 4:15 - 4:17
    So the first
  • 4:17 - 4:21
    variation I want to talk about is
  • 4:21 - 4:25
    selection
  • 4:25 - 4:29
    rewards. So I'm going to change the definition of the reward function. If this turns out, it
  • 4:29 - 4:31
    won't be
  • 4:31 - 4:33
    a huge deal. In particular, I change reward function
  • 4:33 - 4:36
    to be a function mapping from
  • 4:36 - 4:38
    a state action pair
  • 4:38 - 4:41
    to the real numbers.
  • 4:41 - 4:45
    What I mean by this is just the following.
  • 4:45 - 4:47
    You sell off in
  • 4:47 - 4:49
    some state in zero. You
  • 4:49 - 4:53
    take an action A zero as a result of your state of action choice. You transition
  • 4:53 - 4:55
    to some new state, S1.
  • 4:55 - 5:00
    You take some action, A1. You transition to some new state, S2. You take some
  • 5:00 - 5:05
    action, A2, and so on. So this is a [inaudible] state action sequence that you see.
  • 5:05 - 5:08
    So in an MPP where you have a state action reward,
  • 5:08 - 5:13
    your total payoff is now defined as
  • 5:13 - 5:14
    this,
  • 5:14 - 5:19
    where your reward function is now a function both of the current state and of the
  • 5:19 - 5:20
    action
  • 5:20 - 5:22
    you took in the current state.
  • 5:22 - 5:27
    So this is my
  • 5:27 - 5:29
    total payoff.
  • 5:29 - 5:32
    Then as usual, my goal will be to find a policy -
  • 5:32 - 5:35
    to find the function mapping from the state's actions,
  • 5:35 - 5:38
    so that when I execute that policy,
  • 5:38 - 5:40
    I can maximize the expected value
  • 5:40 - 5:42
    of my total payoff.
  • 5:42 - 5:47
    So this definition, it actually turns out that given an MDP with
  • 5:47 - 5:48
    state action rewards,
  • 5:48 - 5:49
    you can actually -
  • 5:49 - 5:53
    so by [inaudible] with the definitions of the states, you can actually reduce this back to an MDP
  • 5:53 - 5:57
    with only rewards that function in the states. That may or may
  • 5:57 - 5:58
    not be [inaudible]. Don't worry if
  • 5:58 - 6:01
    it isn't. But
  • 6:01 - 6:05
    using state action rewards allows you to more directly model
  • 6:06 - 6:09
    problems in which different actions, we have different costs. So
  • 6:09 - 6:13
    a running example is the robot. So [inaudible] a robot,
  • 6:13 - 6:17
    and it's more costly for the robot to move than for it to stay still.
  • 6:17 - 6:21
    If you give an action to stay still, and the action to stay still may
  • 6:21 - 6:25
    have a lower cost because you're not using a battery power [inaudible] recharge it for that
  • 6:25 - 6:26
    action.
  • 6:26 - 6:30
    Another example would be -
  • 6:30 - 6:33
    actually, another navigation example
  • 6:33 - 6:37
    would be if you have an outdoor vehicle. You need to drive
  • 6:37 - 6:39
    over
  • 6:39 - 6:42
    some sort of outdoor terrain, like very rough rocks
  • 6:42 - 6:46
    or driving over grass. It may be costly, more difficult,
  • 6:46 - 6:49
    than driving over, say, a paved road. So you may assign
  • 6:49 - 6:51
    an action that requires
  • 6:51 - 6:56
    driving over grass or driving over rocks to be
  • 6:56 - 7:00
    more costly than driving over paved
  • 7:00 - 7:05
    road.
  • 7:05 - 7:08
    So this really isn't a huge change to the definition of
  • 7:08 - 7:15
    an MDP. I won't really bother to justify this a lot, but [inaudible]
  • 7:15 - 7:17
    equations
  • 7:17 - 7:20
    is generalizing
  • 7:20 - 7:24
    the way that you probably expect it. V star of S is now equal to
  • 7:27 - 7:34
    that.
  • 7:37 - 7:41
    So previously, when the reward function was just a function of the state, S, we could
  • 7:41 - 7:44
    take the max and push it in here.
  • 7:44 - 7:45
    But now that
  • 7:46 - 7:50
    the rewards is a function of the action you're taking as well, the max comes outside. So
  • 7:50 - 7:51
    this says that
  • 7:51 - 7:57
    your expected total payoff, starting from the state, as [inaudible] policy, is equal to
  • 7:57 - 8:02
    first your immediate reward, RFSA, for executing some action, A, in state S. Then
  • 8:02 - 8:07
    plus gamma times your future expected total payoff.
  • 8:07 - 8:09
    So
  • 8:09 - 8:11
    this is your expected total payoff if you
  • 8:11 - 8:15
    take the action, A, from the current state. So
  • 8:15 - 8:19
    while these [inaudible] optimal value functions. So your actually optimal expected total payoff
  • 8:19 - 8:22
    is the max of all actions of this
  • 8:22 - 8:26
    thing on
  • 8:26 - 8:30
    the right. Let's see. Value iteration, which I'm abbreviating VI,
  • 8:30 - 8:34
    is really the same algorithm. B of
  • 8:34 - 8:38
    S is updated as
  • 8:38 - 8:45
    max over A, RFSA, same thing. Just [inaudible] on the right-hand side of those equations
  • 8:46 - 8:50
    be updating V of S using [inaudible] equations. Again, you get value iteration,
  • 8:50 - 8:53
    exactly the same way.
  • 8:53 - 8:56
    Then finally,
  • 8:57 - 9:00
    having found the optimal value function, V star,
  • 9:00 - 9:03
    using the value iteration algorithm,
  • 9:03 - 9:08
    you can then compute the optimal policy, pi star of S as same as
  • 9:08 - 9:12
    before. The best action to take in the state, S, is the
  • 9:12 - 9:19
    action, A, that maximizes the thing on the righthand
  • 9:28 - 9:32
    side. So having used value iteration
  • 9:32 - 9:35
    to compute the optimal value function, you can then find
  • 9:35 - 9:40
    pi star using that.
  • 9:42 - 9:46
    So this was the easier of the two
  • 9:46 - 9:48
    variations of MDPs [inaudible]
  • 9:48 - 9:49
    so far.
  • 9:49 - 9:56
    Any questions? Actually, are there questions about this?
  • 10:01 - 10:03
    So
  • 10:03 - 10:07
    the other variation, the other alternative definition,
  • 10:08 - 10:11
    will be
  • 10:11 - 10:16
    finite horizon
  • 10:16 - 10:19
    MDPs. So
  • 10:19 - 10:22
    finite horizon MDP
  • 10:22 - 10:25
    comprises of the [inaudible] SA [inaudible]
  • 10:25 - 10:28
    transition, probably
  • 10:28 - 10:31
    with these, and the parameter T
  • 10:31 - 10:32
    and the reward function.
  • 10:32 - 10:36
    Here,
  • 10:36 - 10:37
    T
  • 10:37 - 10:41
    is a parameter called the horizon time.
  • 10:41 - 10:43
    Concretely,
  • 10:43 - 10:46
    what this really means is that we'll
  • 10:46 - 10:48
    be taking actions in the MDP
  • 10:48 - 10:55
    only for a total of capital T times this. So we won't use this counting anymore. [Audio cuts out]
  • 11:00 - 11:04
    [Inaudible] zero, take action A0. Get to some other state S1, take
  • 11:04 - 11:07
    action A1 and so on.
  • 11:07 - 11:10
    Eventually, you get to some state, STAT
  • 11:10 - 11:12
    after T times [inaudible]. So
  • 11:12 - 11:17
    my total payoff, now,
  • 11:17 - 11:18
    will be
  • 11:18 - 11:19
    given
  • 11:19 - 11:22
    by this
  • 11:22 - 11:26
    sum from time zero up to time T of my sum over rewards.
  • 11:28 - 11:32
    Okay? My goal, as usual
  • 11:32 - 11:34
    - so this is my total payoff.
  • 11:34 - 11:38
    My goal, as usual, is to maximize the expected value of my total payoff. We want to come up
  • 11:38 - 11:40
    with a policy to maximize the
  • 11:40 - 11:44
    expected value of this total payoff. The key difference is that
  • 11:44 - 11:48
    the world only will exist [inaudible], and after that,
  • 11:48 - 11:51
    there's no more rewards to be corrected.
  • 11:51 - 11:55
    So this turns out to be [inaudible] of a difference because it turns out that the optimal
  • 11:55 - 12:00
    policy
  • 12:01 - 12:03
    may be non-stationary.
  • 12:06 - 12:08
    The
  • 12:08 - 12:10
    term, stationary, means that it doesn't depend
  • 12:10 - 12:13
    on time. Non-stationary
  • 12:13 - 12:14
    means that it may depend on time.
  • 12:14 - 12:18
    So non-stationary
  • 12:18 - 12:22
    roughly means that my optimal action to take will be different for different time
  • 12:22 - 12:22
    steps.
  • 12:22 - 12:24
    That's what nonstationary
  • 12:24 - 12:27
    means. Just as an example of that,
  • 12:27 - 12:31
    imagine that we have some robot. Let's say the
  • 12:31 - 12:34
    robot
  • 12:34 - 12:35
    is here.
  • 12:35 - 12:38
    Let's say that there's
  • 12:38 - 12:41
    a good [inaudible] over there with a plus one reward.
  • 12:43 - 12:46
    Much further away, there's a plus ten reward.
  • 12:46 - 12:51
    So depending on how much time you have left on the clock,
  • 12:51 - 12:53
    it may be better to go after the plus one
  • 12:53 - 12:55
    or the plus ten reward. If it's still early
  • 12:55 - 12:59
    in the game, you still have a lot of time, it may be better to head toward
  • 12:59 - 13:03
    the plus-ten rewards junction and get a much larger
  • 13:03 - 13:06
    reward. If you only have a couple of time sets left, if the clock has nearly reached
  • 13:06 - 13:08
    the time, capital T,
  • 13:08 - 13:12
    then you may not have enough time to get to a plus ten reward. You've be better
  • 13:12 - 13:16
    off heading for the plus one reward that's much more close by.
  • 13:16 - 13:20
    So what this example illustrates is that when you're in that state,
  • 13:20 - 13:24
    the best action to take could be to go left or to go right, depending on what
  • 13:24 - 13:26
    time it is. So just as an example, illustrating
  • 13:26 - 13:32
    how the actually policy can be non-stationary.
  • 13:35 - 13:36
    In fact,
  • 13:36 - 13:38
    since we have non-stationary
  • 13:38 - 13:43
    policies anyway in the sequence, what I'm going to do next, I'm going to
  • 13:43 - 13:44
    allow
  • 13:44 - 13:48
    non-stationary transition probabilities as well. So I'll just write that
  • 13:48 - 13:50
    up there.
  • 13:50 - 13:52
    What I mean is that
  • 13:52 - 13:57
    so far, assuming that the state ST plus one, is joined from the state
  • 13:57 - 13:59
    transition probabilities [inaudible]
  • 13:59 - 14:02
    by the previous states and the previous action.
  • 14:02 - 14:06
    I've been assuming that these state transition probabilities are the same
  • 14:06 - 14:10
    for all times. So I want to say [inaudible] and take some action,
  • 14:10 - 14:13
    the distribution of an innate state doesn't matter.
  • 14:13 - 14:16
    It doesn't depend on time.
  • 14:16 - 14:17
    So
  • 14:17 - 14:22
    I'm going to allow a study more general definition as well, in which we have
  • 14:22 - 14:25
    non-stationary state transition probabilities
  • 14:25 - 14:25
    so that
  • 14:25 - 14:28
    the chance of where you end up [inaudible]
  • 14:28 - 14:30
    may also
  • 14:30 - 14:33
    depend on what time it is.
  • 14:33 - 14:37
    So as examples of this non-stationary state
  • 14:37 - 14:39
    transition probabilities,
  • 14:39 - 14:44
    one example would be if you model flying an aircraft over a long distance.
  • 14:44 - 14:49
    Then as the aircraft flies, you burn fuel and become lighter. So the
  • 14:49 - 14:53
    dynamics of the aircraft actually change over time. The mass of the aircraft can
  • 14:53 - 14:55
    change significantly over time as you burn fuel.
  • 14:55 - 14:56
    So depending on
  • 14:56 - 15:01
    what time it is, your mixed state could actually
  • 15:01 - 15:03
    depend on not only your current state and your action,
  • 15:03 - 15:06
    but also on how much fuel you burn, therefore, what time it is.
  • 15:07 - 15:11
    Other examples, another aerospace one, is
  • 15:11 - 15:15
    if you have the weather forecast for the next 24 hours, say,
  • 15:15 - 15:18
    you know what the winds and precipitation are going to be like over the next 24 hours. Then again,
  • 15:18 - 15:22
    if you fly the aircraft from, say, here to New York,
  • 15:22 - 15:25
    it may cost different amounts to fly
  • 15:25 - 15:29
    different [inaudible] at different times. Maybe flying over
  • 15:29 - 15:33
    the Rockies may cost different amounts, depending on whether you do it now, when there's really great weather, or
  • 15:33 - 15:34
    if you do it
  • 15:34 - 15:38
    a few hours from now, when the weather may be forecast really bad.
  • 15:39 - 15:41
    For an
  • 15:41 - 15:46
    example you see everyday, same thing for traffic, right?
  • 15:46 - 15:48
    There's at least -
  • 15:48 - 15:53
    depending on where you live, certainly here in California, there are times of day where traffic is
  • 15:53 - 15:57
    really bad in lots of places. So the costs of driving certain roads may vary, depending on
  • 15:57 - 15:59
    what time of day it is.
  • 15:59 - 16:04
    Lots of other examples. Industrial automation, different machines
  • 16:04 - 16:07
    in the factory may be available to different degrees at different times of day. They
  • 16:07 - 16:10
    cost different amounts to hire different workers, depending on whether you pay
  • 16:10 - 16:13
    over time [inaudible] or whatever.
  • 16:13 - 16:16
    So the cost of doing different things in the factory can also
  • 16:16 - 16:17
    be a function of time.
  • 16:17 - 16:18
  • 16:18 - 16:24
    The state transition probabilities can also be a function of time.
  • 16:26 - 16:28
    Lastly,
  • 16:28 - 16:32
    while we're doing this as well, to make this fully general, we might as
  • 16:32 - 16:38
    well have non-stationary [inaudible] as well,
  • 16:38 - 16:41
    where you might also index the reward
  • 16:41 - 16:43
    function of these times and prescripts,
  • 16:43 - 16:45
    where the cost of doing different things may depend on
  • 16:45 - 16:48
    the time as well.
  • 16:51 - 16:56
    Actually, there's more examples of non-stationary MDPs, so let's - so now we
  • 16:56 - 16:59
    have a nonstationary policy. Let's talk about an algorithm to actually
  • 16:59 - 17:02
    try to find the optimal policy.
  • 17:04 - 17:10
    So let me
  • 17:10 - 17:14
    define the following.
  • 17:14 - 17:20
    This is now a slightly modified definition of the optimal value function. I'll
  • 17:20 - 17:27
    just
  • 17:31 - 17:38
    write this down, I guess.
  • 17:39 - 17:43
    So I'm going to define the optimal value function, and this going to be
  • 17:43 - 17:45
    indexed by T, with a subscript T. The
  • 17:45 - 17:46
    optimal value of a state
  • 17:48 - 17:50
    for time, T, we're going to define as your
  • 17:50 - 17:52
    optimal
  • 17:52 - 17:57
    sum of rewards for if you start the MDP at that state, S,
  • 17:57 - 17:59
    and if the clock starts off
  • 17:59 - 18:02
    at time,
  • 18:02 - 18:06
    lowercase T. So the optimal value of a state will depend on what time it is and how
  • 18:06 - 18:10
    much time you have lest to run this
  • 18:10 - 18:11
    MDP. Therefore,
  • 18:11 - 18:16
    the sum on the right sums only for time T, time T plus one, time T plus two up
  • 18:16 - 18:18
    to time,
  • 18:18 - 18:20
    capital T.
  • 18:23 - 18:27
    I'll just state in English again, this is your expected optimal
  • 18:27 - 18:28
    total payoff
  • 18:28 - 18:31
    if you start your system in a state, S,
  • 18:31 - 18:35
    and if the clock is already at time,
  • 18:35 - 18:37
    lowercase T.
  • 18:38 - 18:43
    So it turns out then there's a [inaudible], you can value that [inaudible].
  • 18:43 - 18:46
    Let me just write out the value [inaudible] algorithm for this.
  • 18:46 - 18:48
    It turns out
  • 18:48 - 18:49
    you can -
  • 18:49 - 18:54
    well, let me just write this
  • 18:54 - 18:57
    out. I'll write this below.
  • 18:57 - 19:00
    It turns out you can compute the optimal value function for the MDP using the
  • 19:00 - 19:03
    following recursion, which is very similar to
  • 19:03 - 19:06
    what we have for value iteration. We're going
  • 19:06 - 19:07
    to set V
  • 19:07 - 19:12
    of S to be equal to [inaudible] over
  • 19:12 - 19:13
    A,
  • 19:13 - 19:20
    same as before, right? Okay?
  • 19:41 - 19:42
    So
  • 19:42 - 19:47
    if I start the clock at time T and from state S,
  • 19:47 - 19:49
    my expected total payoff is equal to
  • 19:49 - 19:52
    the maximum [inaudible] actions I may take
  • 19:52 - 19:54
    of my immediate reward. Taking that
  • 19:54 - 19:56
    action, A, in that state,
  • 19:56 - 20:00
    S. Them plus my expected future payoff. So if I take action, A,
  • 20:00 - 20:02
    I would transition with [inaudible] P, subscript SA, S prime
  • 20:02 - 20:06
    to some new state, S
  • 20:06 - 20:09
    prime. If I get to the state, S prime,
  • 20:09 - 20:12
    my total expected payoff from the
  • 20:12 - 20:14
    state S prime would be these [inaudible]
  • 20:14 - 20:19
    now subscript T plus one, that's prime. Subscript T plus one
  • 20:19 - 20:23
    reflects that after I've taken one action, my clock will have advanced from
  • 20:23 - 20:26
    time T to time T plus one. So this is now V
  • 20:26 - 20:31
    star subscript T plus one.
  • 20:31 - 20:34
    So this expresses V star of T
  • 20:34 - 20:36
    in terms of V star T plus one.
  • 20:36 - 20:40
    Then lastly, to start off this recursion, you
  • 20:40 - 20:42
    would have V star,
  • 20:42 - 20:43
    capital T
  • 20:43 - 20:50
    is equal to - it's just equal
  • 20:51 - 20:53
    to that.
  • 20:53 - 20:57
    If you're already at time, capital T, then you just get to take one action, and then
  • 20:57 - 20:59
    the clock runs out. So this is
  • 20:59 - 21:04
    V star capital T. Your value of starting in some state, S, with no time -
  • 21:04 - 21:09
    with just one time step, with no time left on the clock.
  • 21:09 - 21:13
    So in the case of finite horizon MDP, this actually gives up a very nice
  • 21:14 - 21:16
    dynamic programming algorithm
  • 21:16 - 21:17
    in which you can
  • 21:17 - 21:21
    start off by computing V star of T.
  • 21:21 - 21:24
    Then you use this backward [inaudible] to compute
  • 21:24 - 21:25
    V star of
  • 21:25 - 21:29
    capital T minus one, capital T minus two and so on. We compute V star
  • 21:29 - 21:32
    of T and T minus one and so on. It recurs backwards
  • 21:32 - 21:35
    onto your computer, V star
  • 21:35 - 21:38
    for all of your time steps. Can you see this
  • 21:38 - 21:40
    board?
  • 21:40 - 21:42
    Cool. Then
  • 21:45 - 21:48
    the final step is
  • 21:48 - 21:53
    - previously, we said that pi star of S - I'm going to come back and change this a bit - was the [inaudible]
  • 21:53 - 21:56
    A of R
  • 21:56 - 21:57
    plus A
  • 21:57 - 22:04
    plus [inaudible] PSA - this
  • 22:04 - 22:07
    is sort of what we had. In
  • 22:07 - 22:10
    the finite horizon case, the
  • 22:10 - 22:13
    ultimate action may depend on what time it is. So the ultimate action to take it,
  • 22:13 - 22:17
    time T, is [inaudible] actions, A.
  • 22:17 - 22:19
    This
  • 22:19 - 22:22
    is basically the
  • 22:22 - 22:27
    augmat of exactly that same thing on the right-hand side as we had
  • 22:27 - 22:30
    in our dynamic programming
  • 22:30 - 22:33
    algorithm. So you do this for every time step, and now you compute it,
  • 22:33 - 22:37
    your optimal policy for different time
  • 22:37 - 22:40
    steps. Again, this is a non-stationary policy
  • 22:40 - 22:42
    because pi star
  • 22:42 - 22:47
    of S my depend on what time it is.
  • 22:47 - 22:50
    So [inaudible] the difference between this
  • 22:50 - 22:57
    and the early version of the earlier version of value iteration is that - so what you do is you complete V star of T.
  • 22:57 - 23:01
    Then using the backwards recursion of that [inaudible] algorithm, you computer V star T
  • 23:01 - 23:04
    minus one. Then V star T minus two
  • 23:04 - 23:07
    and so on down to V star of zero.
  • 23:07 - 23:10
    Then from these, you compute pi star.
  • 23:13 - 23:17
    So one - there's not a huge difference, but one minus
  • 23:17 - 23:18
    difference [inaudible]
  • 23:18 - 23:22
    the infinite horizon discounted case
  • 23:22 - 23:26
    is that by running this recursion once, you now have exactly
  • 23:26 - 23:30
    the right value function. So this just computes the value function, rather
  • 23:30 - 23:31
    than
  • 23:31 - 23:38
    merely converging [inaudible]. This just gives you the right value function with one
  • 23:38 - 23:40
  • 23:40 - 23:43
    pass. Cool. Any questions there? Yeah.
  • 23:43 - 23:48
    [Inaudible].
  • 23:50 - 23:54
    This computation's much shorter than valuations. So
  • 23:54 - 23:58
    sort of yes and no. It depends on how large capital T is. [Inaudible] the normal MDP, could
  • 23:58 - 24:05
    [inaudible] and then use this case for that case? I see.
  • 24:06 - 24:09
    So for the normal MDP, can you assume capital T and then assume this? So
  • 24:09 - 24:13
    it actually turns out that - that's a
  • 24:13 - 24:15
    great question. Let
  • 24:15 - 24:19
    me just answer this in a hand-wavy way.
  • 24:19 - 24:23
    So it actually turns out for a discounted infinite horizon MDP
  • 24:23 - 24:26
    where some discount factor gamma.
  • 24:26 - 24:29
    So what you can do is you can say,
  • 24:29 - 24:31
    after T times X,
  • 24:31 - 24:35
    gamma to the T would be really small. It would be like [inaudible] something. I
  • 24:35 - 24:36
    don't
  • 24:36 - 24:40
    really care what happens after that many times because the rewards are
  • 24:40 - 24:43
    multiplied by gamma to the T. After that, I don't really care. So you can
  • 24:43 - 24:44
    ask,
  • 24:44 - 24:45
    can I take
  • 24:45 - 24:48
    my infinite horizon discounted MDP
  • 24:48 - 24:53
    and approximate that with a finite horizon MDP where the number of times, steps T,
  • 24:53 - 24:55
    is chosen so that [inaudible] true. So it
  • 24:55 - 24:58
    turns out you can do that.
  • 24:58 - 25:02
    Then you end up with some value for T. You can solve for T so
  • 25:02 - 25:04
    that this holds true.
  • 25:04 - 25:10
    It turns out you can prove that if you took the original value iteration algorithm
  • 25:10 - 25:15
    and if you run the original value of the [inaudible] algorithm - the version for discounted MDPs.
  • 25:15 - 25:17
    If you run that for
  • 25:17 - 25:20
    this same number of time steps,
  • 25:20 - 25:24
    you will end up with an approximation to the value function that is
  • 25:24 - 25:27
    about this close, up to some small constant factors.
  • 25:27 - 25:31
    So to do that, you end up with roughly the same amounts of computation anyway.
  • 25:31 - 25:33
    Then
  • 25:33 - 25:36
    you actually end up with a non-stationary policy, which is more expensive to keep
  • 25:37 - 25:41
    around. You need to keep around the different policy every time step, which is not
  • 25:41 - 25:47
    as nice as if you had the stationary policy, same policy for all times. So
  • 25:47 - 25:51
    there are other reasons, but sometimes you might take an
  • 25:51 - 25:55
    infinite horizon discounted problem and approximate it to a finite horizon problem.
  • 25:55 - 25:56
    But
  • 25:56 - 25:59
    this particular reason is
  • 25:59 - 26:01
    not the one. That makes
  • 26:01 - 26:04
    sense. More
  • 26:04 - 26:11
    questions? Interviewee: [Inaudible]?
  • 26:13 - 26:15
    Is
  • 26:15 - 26:17
    there a gamma in this?
  • 26:17 - 26:20
    So if you want, you can actually
  • 26:20 - 26:23
    change the definition of an MDP
  • 26:23 - 26:28
    and use a finite horizon discounted MDP. If you want, you can do that. You
  • 26:28 - 26:29
    can actually come
  • 26:29 - 26:32
    in and put a gamma there,
  • 26:32 - 26:35
    and use this counting the finite horizon.
  • 26:35 - 26:38
    It turns out that usually, for most
  • 26:38 - 26:41
    problems that people deal with, you either use discounting
  • 26:41 - 26:43
    or you
  • 26:43 - 26:47
    use the finite horizon. It's been less common to do both, but you can certainly do as well.
  • 26:47 - 26:52
    One of the nice things about discounting, it makes such your value function is finite.
  • 26:53 - 26:58
    Algorithmically and mathematically, one of the reasons to use discounting is
  • 26:58 - 27:01
    because you're multiplying your rewards exponentially. It's a geometrically
  • 27:01 - 27:03
    [inaudible] series.
  • 27:03 - 27:05
    It shows that the value function is always finite.
  • 27:05 - 27:08
    This is a really nice mathematical properties when you do discounting.
  • 27:08 - 27:11
    So when you have a finite horizon anyway,
  • 27:11 - 27:15
    then the value function's also guaranteed to be finite. So with that, you
  • 27:15 - 27:22
    don't have to use discounting. But if you want, you can actually discount
  • 27:24 - 27:28
    as well. Interviewee: [Inaudible]. Instructor (Andrew Ng):Yeah, yes, you're right. If you want, you can redefine the reward
  • 27:28 - 27:35
    function to go downward into the to the reward function, since we have nonstationary rewards as well. So
  • 27:40 - 27:41
    that was
  • 27:41 - 27:44
    finite horizon
  • 27:44 - 27:45
    MDPs.
  • 27:45 - 27:49
    What I want to do now is actually use
  • 27:49 - 27:53
    both of these ideas, your state action rewards and your finite horizon MDPs
  • 27:53 - 27:58
    to describe a special case of MDPs that
  • 27:58 - 28:00
    makes very strong assumptions about the problem.
  • 28:00 - 28:01
  • 28:01 - 28:04
    But these assumptions are reasonable for many systems. With these
  • 28:04 - 28:06
    assumptions, what we come up with, I
  • 28:06 - 28:12
    think, are very nice and very elegant algorithms for solving even very large
  • 28:12 - 28:16
    MDPs.
  • 28:30 - 28:33
    So let's talk about linear
  • 28:33 - 28:36
    quadratic regulation.
  • 28:52 - 28:56
    We just talked about dynamic programming for finite horizon MDPs, so
  • 28:56 - 28:58
    just remember that algorithm.
  • 28:58 - 29:02
    When I come back to talk about an algorithm for solving LQR problems,
  • 29:02 - 29:05
    I'm actually going to use exactly that dynamic programming algorithm
  • 29:05 - 29:10
    that you just saw for finite horizon MDPs. I'll be using exactly
  • 29:10 - 29:14
    that algorithm again. So just remember that for now.
  • 29:14 - 29:16
    So let's talk about LQR.
  • 29:16 - 29:22
    So I want to take these ideas and apply them to MDPs with continuous
  • 29:22 - 29:25
    state spaces and maybe even continuous action
  • 29:25 - 29:26
    spaces. So
  • 29:26 - 29:29
    to specify and
  • 29:29 - 29:34
    MDPs, I need to give you this fivetuple
  • 29:34 - 29:36
    of state actions, [inaudible] in the reward. I'm going to use
  • 29:36 - 29:41
    the finite horizon, capital T, rather than
  • 29:41 - 29:43
    discounting.
  • 29:43 - 29:48
    So in LQR problems, I'm going to assume the following. I'm
  • 29:48 - 29:51
    going to assume that the
  • 29:51 - 29:54
    [inaudible] space is [inaudible] RN.
  • 29:54 - 29:58
    And I'm going to assume, also, a continuous set of
  • 29:58 - 30:04
    actions lie in RT.
  • 30:04 - 30:08
    To specify the state transition probabilities, PSA,
  • 30:08 - 30:12
    I need to tell you what the distribution of the mixed state is, given the current state and
  • 30:12 - 30:14
    the current action.
  • 30:14 - 30:17
    So we actually saw a little bit of this in the last lecture. I want to assume
  • 30:17 - 30:20
    the next state, ST plus one,
  • 30:20 - 30:23
    is going to be a linear function
  • 30:23 - 30:25
    of the previous
  • 30:25 - 30:27
    state, AST plus BAT
  • 30:27 - 30:29
    plus WT -
  • 30:29 - 30:35
    oh, excuse me. I meant to subscript that.
  • 30:55 - 30:59
    Where WT is Gaussian [inaudible] would mean zero and some covariance
  • 30:59 - 31:02
    given by sigma
  • 31:02 - 31:07
    W. Subscripts at A and B here with subscripts T to show
  • 31:07 - 31:09
    that
  • 31:09 - 31:12
    these matrixes could change over time. So this would be
  • 31:12 - 31:15
    non-stationary dynamics.
  • 31:15 - 31:20
    As a point of notation, unfortunately compiling
  • 31:20 - 31:22
    ideas from multiple literatures,
  • 31:22 - 31:26
    so it's sort of unfortunately that capital A denotes both a set of actions
  • 31:26 - 31:28
    as well as
  • 31:28 - 31:30
    a matrix.
  • 31:30 - 31:34
    When you see A later on, A will usually be used to denote a matrix,
  • 31:34 - 31:39
    rather than a set of actions. So [inaudible] overload notation again, but
  • 31:39 - 31:42
    unfortunately the notational conventions when you have
  • 31:42 - 31:47
    research ideas in multiple research communities, often they share the same symbol.
  • 31:48 - 31:50
    So just to be concrete,
  • 31:50 - 31:53
    AT is a matrix
  • 31:53 - 31:56
    that's N
  • 31:56 - 32:01
    by N. [Inaudible] matrixes that are N by D.
  • 32:02 - 32:06
    Just to be completely clear, right, the matrixes A and B,
  • 32:06 - 32:09
    I'm going to assume, are fixed and known in advance. So I'm going to
  • 32:09 - 32:12
    give you the matrices,
  • 32:12 - 32:16
    A and B, and I'm going to give you sigma W. Your job is to find a good policy
  • 32:16 - 32:17
    for
  • 32:17 - 32:19
    this MDP. So in other words,
  • 32:19 - 32:21
    this is my specification
  • 32:21 - 32:25
    of the state transition probabilities.
  • 32:25 - 32:28
    Looking ahead,
  • 32:28 - 32:31
    we see this later, it turns out this
  • 32:31 - 32:38
    noise term is not very important. So
  • 32:40 - 32:41
    it turns out that
  • 32:41 - 32:44
    the treatment of the noise term is not very important.
  • 32:44 - 32:46
    We'll see this later. We can
  • 32:46 - 32:47
    pretty much
  • 32:47 - 32:49
    ignore the noise
  • 32:49 - 32:53
    term, and we'll still do fine. This is just a warning
  • 32:53 - 32:57
    in the sequel, what I do later, I might be slightly sloppy
  • 32:57 - 32:58
    in my treatment
  • 32:58 - 33:03
    of the noise term. In this very special case, it would be unimportant.
  • 33:03 - 33:04
    The
  • 33:04 - 33:06
    last
  • 33:06 - 33:09
    thing I have to specify is some horizon time,
  • 33:09 - 33:13
    and then I also have some
  • 33:13 - 33:14
    reward
  • 33:14 - 33:17
    function.
  • 33:17 - 33:20
    For LQR control, I'm going to assume that a reward function
  • 33:20 - 33:27
    can be written as
  • 33:30 - 33:32
    this, where
  • 33:32 - 33:35
    UT is a matrix that's N by N. VT is
  • 33:35 - 33:36
    a
  • 33:38 - 33:41
    matrix that's D by D. I'll
  • 33:41 - 33:44
    assume that UT
  • 33:44 - 33:47
    and VT are both positive semi-definite.
  • 33:47 - 33:51
    Are both PSD. So
  • 33:51 - 33:57
    the fact that UT and VT are both positive semi-definite matrices,
  • 33:57 - 34:00
    that implies that
  • 34:00 - 34:02
    ST transpose, UT, ST [inaudible]
  • 34:02 - 34:07
    zero. Similarly, ST transpose
  • 34:07 - 34:10
    are VT, AT, [inaudible] zero.
  • 34:10 - 34:17
    So this implies that
  • 34:17 - 34:20
    your rewards are always negative. This is a
  • 34:20 - 34:24
    somewhat depressing MDP in which there are only costs and no positive rewards, right,
  • 34:24 - 34:24
    because of
  • 34:24 - 34:27
    the minus sign
  • 34:27 - 34:29
    there.
  • 34:29 - 34:36
    So
  • 34:43 - 34:47
    as
  • 34:47 - 34:50
    a complete example for how you might want to apply this,
  • 34:50 - 34:54
    you've seen my helicopter videos, right? So one thing is,
  • 34:54 - 34:56
    for example,
  • 34:56 - 34:58
    suppose you have a helicopter,
  • 34:58 - 35:01
    and you want the state ST to be
  • 35:01 - 35:04
    as close to zero as possible.
  • 35:05 - 35:08
    Then
  • 35:08 - 35:10
    you might choose UT
  • 35:10 - 35:13
    to be equal to the identity matrix.
  • 35:13 - 35:17
    This will make R of STAT be
  • 35:17 - 35:20
    equal to ST transpose
  • 35:20 - 35:24
    ST. But that's just
  • 35:27 - 35:31
    - I'll just
  • 35:31 - 35:34
    write that down. [Inaudible] oh, excuse me - minus.
  • 35:34 - 35:38
    The squared negative of the squared [inaudible] vector. So this would be penalizing the
  • 35:38 - 35:40
    system
  • 35:40 - 35:41
    quadratically
  • 35:41 - 35:45
    for having a state that's half of zero, assuming that zero's the origin state. So if it
  • 35:45 - 35:47
    goes to make a helicopter
  • 35:47 - 35:52
    hover around the state zero, then you might choose this sort of reward function. It
  • 35:52 - 35:54
    turns out
  • 35:55 - 35:59
    it's also very common for action to choose a cost
  • 35:59 - 36:03
    for the action. So suppose I choose VT to be equal to an identity matrix. I get minus
  • 36:03 - 36:05
    AT transpose
  • 36:05 - 36:10
    AT
  • 36:10 - 36:12
    here. Then minus [inaudible] actions as well.
  • 36:12 - 36:16
    Including a quadratic cost for actions, it's also a fairly common
  • 36:16 - 36:17
    thing to do.
  • 36:17 - 36:20
    In practice, this tends to be effective
  • 36:20 - 36:23
    of discouraging your system from
  • 36:23 - 36:26
    jerking the controls around. This discourages making very huge control commands. Having
  • 36:26 - 36:28
    a term like this
  • 36:28 - 36:32
    reward function often makes many systems behave better.
  • 36:32 - 36:36
    Of course, [inaudible] choose different values, we have different values on the diagonal to
  • 36:36 - 36:37
    give different
  • 36:37 - 36:40
    state variables, different weight and
  • 36:40 - 36:45
    so on. So lots of possible choices for U and B. This is one example.
  • 36:50 - 36:56
    So
  • 36:56 - 36:59
    for the next few steps,
  • 36:59 - 37:04
    I'm going to write out things, I'm going to derive things,
  • 37:04 - 37:09
    for the general case of non-stationary dynamics. I'm going - as I write
  • 37:09 - 37:12
    out more math and more equations for LQR,
  • 37:12 - 37:14
    I'm going to try write it out
  • 37:14 - 37:19
    for the fairly general case of time varied dynamics and time varied reward
  • 37:19 - 37:22
    functions. So I'm [inaudible] function.
  • 37:22 - 37:26
    But for purposes of understanding this material,
  • 37:26 - 37:29
    you might want to think of just ignoring many of the subscripts, in terms
  • 37:29 - 37:32
    of T. So
  • 37:34 - 37:36
    for the sake of [inaudible] material,
  • 37:36 - 37:40
    you might want to mentally assume that there are some fixed matrix, A, so
  • 37:40 - 37:46
    that A is equal to A1, A2, equals A3 and so on.
  • 37:46 - 37:51
    Similarly, there's some matrix B.
  • 37:51 - 37:52
    Okay?
  • 37:52 - 37:56
    So write it out for the fully general, non-stationary case, but
  • 37:56 - 37:59
    you might just want to ignore many of the time subscripts and imagine the
  • 37:59 - 38:02
    stationary case for now.
  • 38:07 - 38:10
    Quite a bit later,
  • 38:10 - 38:14
    we're going to talk about an extension of this called differential dynamic programming that will
  • 38:14 - 38:17
    actually use a non-stationary
  • 38:17 - 38:21
    [inaudible] to a very powerful effect for a specific algorithm. But
  • 38:21 - 38:26
    for most of what we're about to do, just pretend that MDP is stationary.
  • 38:26 - 38:33
    Okay.
  • 38:49 - 38:51
    So before I
  • 38:51 - 38:55
    talk about models, let me jus say a couple of words about how you would go about
  • 38:55 - 38:57
    coming up with the linear models. The
  • 38:57 - 39:01
    key assumption in this model is that the dynamics are linear. There's also the assumption the
  • 39:01 - 39:03
    reward function is quadratic, but
  • 39:03 - 39:07
    let's talk about the assumption that the dynamics are linear.
  • 39:07 - 39:14
    ST plus one equals AST plus VAT. Maybe time varying, maybe
  • 39:14 - 39:17
    stationary. I'm just writing stationary for now.
  • 39:17 - 39:20
    So how do you get models like this?
  • 39:20 - 39:23
    We actually saw one example of this already in the previous lecture. If
  • 39:23 - 39:27
    you have an inverted pendulum system, and you want to model the
  • 39:27 - 39:28
    inverted pendulum
  • 39:29 - 39:34
    using a linear model like this, maybe [inaudible]. I'm not going to write that down.
  • 39:34 - 39:38
    One thing you could do is run your inverted pendulum, start it off in some state as
  • 39:38 - 39:39
    zero,
  • 39:39 - 39:41
    take some action, A0, have it
  • 39:41 - 39:46
    get to some state, S1. Take action A1 and so on, get to some state ST.
  • 39:46 - 39:48
    Our index is one
  • 39:48 - 39:52
    to denote that this is my first trial. Then you can repeat this a bunch of times.
  • 39:52 - 39:55
    You can repeat this N times. I'm
  • 39:55 - 39:58
    just executing actions on
  • 39:58 - 40:00
    your physical robot.
  • 40:00 - 40:02
    It could be a robot, it could be a
  • 40:02 - 40:06
    chemical plant. It could be whatever. Trying out different actions in
  • 40:06 - 40:10
    your system and watch
  • 40:13 - 40:15
    what states it gets to. So
  • 40:15 - 40:21
    for the linear model to your data, and choose the parameters A
  • 40:21 - 40:26
    and B,
  • 40:26 - 40:28
    that minimize
  • 40:38 - 40:40
    the quadratic
  • 40:40 - 40:41
    error term.
  • 40:47 - 40:52
    So this says how well does AST plus BAT
  • 40:52 - 40:55
    predict ST plus one. So you minimize the quadratic penalty term. This would be
  • 40:55 - 40:58
    one reasonable way to estimate
  • 40:58 - 41:01
    the parameters of a linear dynamical system for a
  • 41:01 - 41:08
    physical robot or a physical chemical part of whatever they may have. Another
  • 41:09 - 41:11
    way to
  • 41:11 - 41:13
    come up with a linear
  • 41:14 - 41:19
    model
  • 41:19 - 41:24
    consistently, if I want to control, is to take a nonlinear model and to
  • 41:24 - 41:30
    linearize it. Let me show you what I mean by that. So you can linearize a
  • 41:30 - 41:36
    nonlinear model.
  • 41:36 - 41:37
    So
  • 41:39 - 41:42
    let's say you have some nonlinear model
  • 41:42 - 41:45
    that expresses ST plus one as some function of
  • 41:45 - 41:46
    ST
  • 41:46 - 41:47
    and
  • 41:47 - 41:50
    AT.
  • 41:50 - 41:54
    In the example in the previous lecture, I said for the inverted pendulum
  • 41:57 - 41:57
    [inaudible].
  • 41:57 - 42:01
    By referring to the laws of physics. It was actually by downloading off the shelf
  • 42:01 - 42:04
    software for doing physics simulations. So if you haven't
  • 42:04 - 42:08
    seen [inaudible] before, you can go online. You can easily find
  • 42:08 - 42:11
    many open-source packages for simulating the
  • 42:11 - 42:14
    physics of simple devices like these.
  • 42:14 - 42:18
    Download the software, type in the specifications of your robot, and it will
  • 42:18 - 42:22
    simulate the physics that you use. There's lots of open-source software patches like that. You can just
  • 42:22 - 42:24
    download them.
  • 42:24 - 42:27
    But something like that, you can now build a physics simulator
  • 42:27 - 42:32
    that predicts the state as a function of the previous state and the previous action. So
  • 42:32 - 42:34
    you actually come up with some function that
  • 42:34 - 42:40
    says that
  • 42:42 - 42:45
    - the state [inaudible] next time. The [inaudible] vector
  • 42:45 - 42:52
    will be some function of the current state
  • 42:53 - 42:55
    and the current
  • 42:55 - 42:58
    action, where the action in this case is just a real number that says how
  • 42:58 - 43:02
    hard you accelerated to the left or right.
  • 43:02 - 43:06
    Then you can take this nonlinear model. I actually wrote down a sample of a
  • 43:06 - 43:08
    model in the last lecture, but
  • 43:08 - 43:11
    in general, F would be some nonlinear function. [Inaudible] of
  • 43:11 - 43:13
    a linear function.
  • 43:13 - 43:17
    So what I mean by linearize is the following. So here's just a cartoon. I'll
  • 43:17 - 43:20
    write down the math in a second. Let's say the horizontal acces is
  • 43:20 - 43:24
    the input state, ST,
  • 43:24 - 43:28
    and the output state, ST plus one, as I said.
  • 43:28 - 43:32
    Here's
  • 43:32 - 43:35
    the function at F.
  • 43:35 - 43:39
    So the next state, ST plus one, will be some function of the previous state,
  • 43:39 - 43:42
    ST and the action
  • 43:42 - 43:46
    AT. So to linearize this model, what you would do is you would choose a point.
  • 43:46 - 43:49
    We'll call this bar
  • 43:49 - 43:54
    T. Then you would take the derivative of this
  • 43:54 - 43:56
    function. For the
  • 43:56 - 44:00
    [inaudible] straight line to that function.
  • 44:00 - 44:04
    So this allows you to express the next state, ST plus one. You
  • 44:04 - 44:07
    can approximate the next state, ST plus one, as
  • 44:07 - 44:10
    this linear function of the
  • 44:10 - 44:14
    previous state, ST. So
  • 44:14 - 44:18
    to make this cartoon really right, the horizontal access here is really a state
  • 44:18 - 44:20
    action pair.
  • 44:20 - 44:25
    You're linearizing around. So this is just a cartoon. The horizontal
  • 44:25 - 44:30
    access represents the input state and the input action.
  • 44:30 - 44:37
    So
  • 44:42 - 44:46
    just to write this out in
  • 44:46 - 44:50
    math, I'll write out the simple case first and the fully general one
  • 44:50 - 44:51
    in a second. Suppose
  • 44:51 - 44:55
    the horizontal access was only this state. So let's pretend interactions they [inaudible] now.
  • 44:55 - 44:56
    ST plus
  • 44:56 - 45:00
    one is just some function of ST, than that linear function I drew would be
  • 45:00 - 45:02
    ST plus one.
  • 45:02 - 45:04
    We're approximating
  • 45:04 - 45:08
    as F prime evaluated at some point as bar T
  • 45:08 - 45:12
    times ST times S bar T.
  • 45:12 - 45:16
    Plus
  • 45:16 - 45:18
    S bar
  • 45:18 - 45:22
    T. So with this, you'd express ST plus one as a linear
  • 45:22 - 45:23
    function
  • 45:23 - 45:29
    of ST. Just note that S bar T is a constant.
  • 45:31 - 45:33
    It's not
  • 45:33 - 45:37
    a variable. Does that make sense? S bar T is a constant. F prime of S bar T is gradient of the function F at the point
  • 45:37 - 45:38
    S bar T. This is
  • 45:38 - 45:42
    really just the equation of that linear function. So you
  • 45:42 - 45:46
    can then convert this to
  • 45:52 - 45:55
    A and B matrixes. Jumping back one board, I'm going
  • 45:55 - 45:57
    to point out one other thing.
  • 45:57 - 45:58
    Let's
  • 45:58 - 46:00
    say I look at this straight line,
  • 46:00 - 46:01
    and I ask
  • 46:01 - 46:03
    how well does this straight line
  • 46:03 - 46:08
    approximate my function F, my original simulator, my original function F.
  • 46:08 - 46:11
    Then you sort of notice that
  • 46:11 - 46:12
    in this neighborhood,
  • 46:12 - 46:15
    in the neighborhood of S bar, there's a
  • 46:15 - 46:17
    pretty good approximation. It's fairly close.
  • 46:17 - 46:19
    But then as you move further away,
  • 46:19 - 46:24
    moving far off to the left here, it becomes a pretty terrible approximation.
  • 46:24 - 46:26
    So
  • 46:26 - 46:29
    when you linearize
  • 46:29 - 46:33
    a nonlinear model to apply LQR,
  • 46:33 - 46:37
    one of the parameters you have to choose would be the point around which to
  • 46:37 - 46:39
    linearize your nonlinear model.
  • 46:39 - 46:44
    So if you expect your inverted pendulum system
  • 46:44 - 46:49
    to spend most of its time in the vicinity of this state,
  • 46:49 - 46:49
    then it'd
  • 46:49 - 46:53
    be reasonable to linearize around this state
  • 46:53 - 46:55
    because that means that
  • 46:55 - 46:57
    the linear approximation would be a good approximation,
  • 46:57 - 47:02
    usually, for the states that you expect [inaudible] to spend most of this time.
  • 47:02 - 47:06
    If conversely, you expect the system to spend most of its time
  • 47:06 - 47:08
    at states far to the left, then this would be
  • 47:08 - 47:10
    a terrible location to linearize.
  • 47:10 - 47:14
    So one rule of thumb is to choose the position to linearize according to where
  • 47:14 - 47:17
    you expect the system to spend most of its time
  • 47:17 - 47:20
    so that the linear approximation will tend to be
  • 47:20 - 47:23
    an accurate approximation in the
  • 47:23 - 47:27
    vicinity of the states [inaudible].
  • 47:27 - 47:29
    Just to be fair, it is about
  • 47:29 - 47:30
    choosing the point, S bar,
  • 47:30 - 47:32
    A bar,
  • 47:32 - 47:34
    that we'll use
  • 47:34 - 47:37
    to come up with a linear function
  • 47:37 - 47:41
    that we'll pretend it's a good approximation to my original
  • 47:41 - 47:47
    nonlinear function,
  • 47:55 - 48:01
    F. So for an example like the inverted pendulum problem, this problem, if
  • 48:01 - 48:04
    you expect to do pretty well in this problem,
  • 48:04 - 48:08
    then you would expect the state
  • 48:08 - 48:13
    to often be near the zero state.
  • 48:13 - 48:18
    If S equals zero corresponds to X being the center of the railway track that the inverted
  • 48:18 - 48:20
    pendulum lives on. You
  • 48:20 - 48:23
    expect to do fairly well. You expect the pole to
  • 48:23 - 48:25
    mostly be upright
  • 48:25 - 48:29
    [inaudible] upright at zero degrees or 90 degrees, I guess. So you choose
  • 48:29 - 48:33
    whatever state corresponds to having the pole
  • 48:33 - 48:34
    upright. The zero
  • 48:34 - 48:37
    velocity [inaudible], near zero velocity in the middle of the track.
  • 48:37 - 48:40
    So you usually choose that as a state
  • 48:40 - 48:43
    to linearize
  • 48:43 - 48:46
    your inverted pendulum dynamics around. That's
  • 48:46 - 48:53
    a region where you might want your approximation to be good.
  • 48:56 - 49:00
    So I wrote this down. To come back to this formula, I wrote this down for the special
  • 49:00 - 49:02
    case of a one D state variable. If there
  • 49:02 - 49:08
    are no actions. The general formula for the linearization approximation is ST
  • 49:09 - 49:12
    plus one were approximate as F
  • 49:12 - 49:15
    of S bar T.
  • 49:15 - 49:17
    A bar T
  • 49:17 - 49:24
    plus
  • 49:50 - 49:54
    - Okay?
  • 49:54 - 49:55
    Where
  • 49:55 - 50:00
    these upside down triangles are an unusual
  • 50:00 - 50:03
    symbol for taking the
  • 50:03 - 50:07
    derivative of F with respect to [inaudible] vector value, second argument. So
  • 50:07 - 50:12
    by choosing an appropriate state as bar T, A bar T to linearize
  • 50:12 - 50:13
    around,
  • 50:13 - 50:14
    you'd
  • 50:14 - 50:18
    now express ST plus one
  • 50:18 - 50:22
    as a linear function of the current state and
  • 50:22 - 50:23
    the current
  • 50:23 - 50:24
    action, AT. Again,
  • 50:24 - 50:29
    these things, S bar T, is a constant that you choose ahead of time.
  • 50:29 - 50:32
    Same for A bar
  • 50:35 - 50:40
    T. Lastly, having linearized this thing, you can then convert it
  • 50:40 - 50:43
    to matrices
  • 50:43 - 50:47
    like
  • 50:47 - 50:49
    that.
  • 50:49 - 50:55
    So the ST plus one is now a linear function of ST and
  • 50:55 - 50:57
    AT.
  • 50:57 - 51:04
    Questions about this?
  • 51:11 - 51:16
    So just one tiny detail, and it's really not a huge deal, is that this thing
  • 51:16 - 51:20
    below is technically an [inaudible] function.
  • 51:20 - 51:23
    There might actually be an extra constant there, but
  • 51:23 - 51:27
    this is not a difficult generalization of a linear dynamical system definition.
  • 51:28 - 51:32
    One way to deal with that constant is actually to do something like take your
  • 51:32 - 51:36
    definition for this state, let's say XX dot theta theta dot.
  • 51:36 - 51:39
    You can then augment your state vector to
  • 51:39 - 51:41
    have an extra interceptor, one.
  • 51:41 - 51:46
    With the interceptor one and working out the A matrix, you can then
  • 51:46 - 51:49
    take care of the extra constant, C, as well. So you can
  • 51:49 - 51:51
    deal with this thing being -
  • 51:51 - 51:55
    technically it's an affine function because of this extra offset, rather than a linear
  • 51:55 - 51:56
    function.
  • 51:56 - 52:02
    But this is just a little bit of bookkeeping [inaudible] for yourself and shouldn't
  • 52:02 - 52:03
    be
  • 52:03 - 52:05
    a huge
  • 52:06 - 52:13
    deal.
  • 52:18 - 52:19
    So to summarize, you see I
  • 52:19 - 52:21
    have this up,
  • 52:21 - 52:25
    you can learn a model, you can take a nonlinear model. Your nonlinear
  • 52:25 - 52:29
    model can be a physics model or a nonlinear model you learned and linearize it.
  • 52:30 - 52:32
    Now I'll post an LQR problem
  • 52:32 - 52:35
    in which we have
  • 52:35 - 52:42
    specification of the MDP in which the states are in RN, the actions are in RD,
  • 52:42 - 52:47
    and the state has zero probabilities given by
  • 52:47 - 52:50
    the [inaudible] linear equation. SD plus one equals
  • 52:50 - 52:54
    ATST plus BTAT. Our rewards are going to be
  • 52:54 - 52:57
    these quadratic functions.
  • 52:57 - 53:02
    So the specification of the MDP means that we know the A matrixes,
  • 53:02 - 53:06
    the B matrixes, the U matrixes
  • 53:06 - 53:09
    and the V matrixes. Our goal is to come up with a policy
  • 53:09 - 53:14
    to maximize our finite
  • 53:14 - 53:19
    horizon sum of rewards.
  • 53:20 - 53:23
    So our goal is to come up with a policy,
  • 53:23 - 53:26
    first, to maximize
  • 53:26 - 53:29
    the expected value of this
  • 53:29 - 53:34
    finite horizon sum of
  • 53:39 - 53:46
    rewards. Okay.
  • 53:48 - 53:49
    So our
  • 53:49 - 53:54
    approach to solving this problem
  • 53:54 - 53:57
    will be exactly that
  • 53:57 - 54:00
    finite horizon dynamic programming algorithm that we worked out a
  • 54:00 - 54:02
    little earlier in this
  • 54:02 - 54:04
    lecture. In particular,
  • 54:04 - 54:07
    my strategy for finding the optimal policy
  • 54:07 - 54:09
    will be to
  • 54:09 - 54:12
    first find V star of
  • 54:12 - 54:14
    T, the capital T,
  • 54:14 - 54:19
    and then I'll apply by a recursion to find V star of T minus one, V star of T minus
  • 54:19 - 54:23
    two and so on.
  • 54:23 - 54:28
    In the dynamic programming algorithm we worked out, V star subscript T of the
  • 54:28 - 54:29
    state ST, this is the maximum
  • 54:29 - 54:31
    [inaudible]
  • 54:31 - 54:35
    actions you might take at that time of R
  • 54:35 - 54:39
    of STAT.
  • 54:39 - 54:43
    Again, just for the sake of understanding this material, you can probably
  • 54:43 - 54:47
    pretend the rewards and the dynamics are actually stationary. I'll write out all
  • 54:47 - 54:53
    these superscripts all the time [inaudible] if you're reading this for the first time.
  • 54:53 - 54:57
    The reward is equal to max of
  • 54:57 - 55:04
    AT of minus -
  • 55:12 - 55:16
    right? I hope this isn't confusing. The superscript Ts denote transposes.
  • 55:16 - 55:20
    The lowercase Ts denote the time index capital T.
  • 55:20 - 55:24
    So that's just a definition of my next quadratic awards. So this
  • 55:24 - 55:28
    is clearly maximized as minus
  • 55:29 - 55:31
    ST transpose
  • 55:31 - 55:38
    UTST
  • 55:38 - 55:40
    because
  • 55:40 - 55:44
    that last term is - this is greater than or equal to zero. That gives
  • 55:44 - 55:48
    me my assumption that VT is [inaudible] semi-definite. So the best action to take in
  • 55:48 - 55:49
    the last time step
  • 55:49 - 55:53
    is just the action
  • 55:53 - 55:54
    zero. So
  • 55:54 - 56:01
    pi star subscript T of ST is equal to the
  • 56:02 - 56:04
    [inaudible] of actions of
  • 56:04 - 56:11
    that same thing. It's just
  • 56:11 - 56:12
    zero. It's by
  • 56:12 - 56:16
    choosing the zero action,
  • 56:16 - 56:19
    AT transpose VTAT becomes zero, and that's
  • 56:19 - 56:39
    how this reward is maximized.
  • 56:50 - 56:57
    Any questions, or is something illegible? Okay.
  • 57:00 - 57:04
    So now let's do the dynamic programming step where
  • 57:04 - 57:10
    my goal is given
  • 57:10 - 57:12
    VT plus one,
  • 57:12 - 57:16
    I want to compute VT. Given V star T plus one, I want to compute
  • 57:16 - 57:20
    V star of T. So this is the dynamic programming step.
  • 57:22 - 57:25
    So the
  • 57:25 - 57:29
    DP steps I wrote down previously was this. So for the finite state case, I
  • 57:29 - 57:36
    wrote down the following.
  • 58:32 - 58:36
    So this is exactly the equation I wrote down
  • 58:36 - 58:37
    previously,
  • 58:37 - 58:41
    and this is what I wrote down for finite states, where you have these discreet state transition
  • 58:41 - 58:44
    probabilities, and we can sum over
  • 58:44 - 58:48
    this discreet set of states. Now we're going to continue as an infinite state
  • 58:48 - 58:52
    again, so this sum over state should actually become an integral. I'm going to
  • 58:52 - 58:56
    actually skip the integral step. We'll just go ahead and write this last term here as an
  • 58:56 - 58:57
    expectation.
  • 58:57 - 59:04
    So this is going to be max over actions AT
  • 59:04 - 59:07
    plus - and then this becomes and expectation
  • 59:07 - 59:10
    over the random mixed state, ST plus one,
  • 59:10 - 59:14
    [inaudible] from state transition probabilities given by P of STAT of V star T
  • 59:14 - 59:16
    plus one, ST
  • 59:16 - 59:19
    plus one. So this
  • 59:19 - 59:21
    is
  • 59:21 - 59:24
    the same equation written down
  • 59:24 - 59:26
    as an expectation.
  • 59:27 - 59:32
    So what I need to do is given a representation of V star T plus one, I
  • 59:32 - 59:33
    need to find V
  • 59:33 - 59:36
    star of T.
  • 59:36 - 59:41
    So it turns out that LQR has the following useful property. It turns out that each of
  • 59:41 - 59:43
    these value functions
  • 59:43 - 59:46
    can be represented as a quadratic function.
  • 59:46 - 59:49
    So concretely,
  • 59:49 - 59:53
    let's suppose that V
  • 59:53 - 60:00
    star T plus one - suppose that this can be expressed as a
  • 60:00 - 60:07
    quadratic
  • 60:09 - 60:11
    function, written like so,
  • 60:11 - 60:12
    where
  • 60:12 - 60:15
    the matrix phi T plus one
  • 60:15 - 60:17
    is an N by N matrix, and
  • 60:17 - 60:21
    psi T plus one
  • 60:21 - 60:24
    is just a real number.
  • 60:24 - 60:29
    So in other words, suppose V star T plus one is just a quadratic function
  • 60:30 - 60:34
    of the state ST
  • 60:34 - 60:37
    plus one.
  • 60:37 - 60:40
    We can then show that
  • 60:40 - 60:44
    when you do one dynamic programming step - when you
  • 60:46 - 60:50
    plug this definition of V star T plus one into your dynamic
  • 60:50 - 60:52
    programming step in the equation I had just now,
  • 60:52 - 60:55
    you can show that you would get
  • 60:55 - 60:58
    that V star T as well, will
  • 60:58 - 61:02
    also be a quadratic function of the
  • 61:06 - 61:13
    same form. [Inaudible] here, right? The sum-appropriate matrix, phi T
  • 61:13 - 61:20
    and sum appropriate real number, psi
  • 61:20 - 61:22
    of
  • 61:22 - 61:28
    T. So what you can do is stall off the recursion with - well, does that make
  • 61:28 - 61:32
    sense?
  • 61:32 - 61:35
    So what you can do is
  • 61:35 - 61:37
    stall off the recursion
  • 61:37 - 61:41
    as follows. So previously, we worked out that V star capital T,
  • 61:41 - 61:44
    we said that this is minus
  • 61:44 - 61:48
    ST
  • 61:48 - 61:50
    transpose UTST. So we
  • 61:50 - 61:51
    have that phi
  • 61:51 - 61:53
    of capital T is
  • 61:53 - 61:56
    equal to minus UT,
  • 61:56 - 61:59
    and psi of capital T is equal to zero.
  • 61:59 - 62:01
    Now V star T of ST
  • 62:01 - 62:08
    is equal to ST transpose phi of T, ST plus psi of T.
  • 62:08 - 62:09
    So
  • 62:09 - 62:13
    you can start out the recursion this way with phi of T equals minus UT and psi of T
  • 62:13 - 62:15
    equals zero.
  • 62:15 - 62:17
    Then work out
  • 62:17 - 62:20
    what the recursion is. I won't
  • 62:20 - 62:23
  • 62:23 - 62:27
    actually do the full [inaudible]. This may be algebra, and
  • 62:29 - 62:33
    you've actually done this sort of Gaussian expectation
  • 62:33 - 62:40
    math a lot in your homework by now.
  • 62:40 - 62:44
    So I won't do the full derivation. I'll just outline the
  • 62:45 - 62:47
    one-ish G step. So
  • 62:47 - 62:51
    in dynamic programming step, V star ST is
  • 62:51 - 62:53
    equal to
  • 62:53 - 62:56
    max over actions AT of
  • 62:56 - 63:03
    the median reward.
  • 63:04 - 63:06
    So this was R of
  • 63:06 - 63:09
    SA from my equation in the dynamic programming step.
  • 63:09 - 63:12
    Then plus an expected value
  • 63:12 - 63:18
    over the random mixed state, ST plus one, drawn from the
  • 63:18 - 63:22
    Gaussian distribution would mean
  • 63:22 - 63:24
    ATST plus
  • 63:24 - 63:28
    BTAT and covariant sigma
  • 63:28 - 63:32
    W. So what this is, this is really my
  • 63:32 - 63:35
    specification for P of STAT. This is my
  • 63:35 - 63:39
    state transition distribution in the LQR setting. This is my
  • 63:39 - 63:41
    state transition distribution
  • 63:41 - 63:43
    [inaudible] take action AT
  • 63:43 - 63:45
    in
  • 63:45 - 63:46
    the state
  • 63:46 - 63:50
    ST. Then my next state is - distributed Gaussian would mean ATST plus BTAT and
  • 63:50 - 63:51
    covariant
  • 63:51 - 63:54
    sigma W. Then
  • 63:54 - 64:01
    of the this
  • 64:11 - 64:14
    state. This, of course, is just A star
  • 64:14 - 64:17
    T plus one
  • 64:17 - 64:24
    of ST plus one. I hope
  • 64:24 - 64:27
    this makes sense. This is just taking that equation I had previously in the
  • 64:27 - 64:30
    dynamic programming step. So the V star of T, ST
  • 64:30 - 64:33
    equals max over actions
  • 64:33 - 64:36
    of the immediate rewards
  • 64:36 - 64:39
    plus an expected value over the mixed state of
  • 64:39 - 64:41
    V star of the mixed state with
  • 64:41 - 64:43
    the clock advanced by one.
  • 64:43 - 64:46
    So I've just plugged in all the definitions as a reward of the
  • 64:46 - 64:48
    state [inaudible] distribution
  • 64:48 - 64:55
    and of the value function.
  • 64:56 - 65:02
    Actually, could you raise your hand if this makes sense? Cool. So if you write
  • 65:02 - 65:03
    this out
  • 65:03 - 65:08
    and you expand the expectation - I know you've done this many times, so I
  • 65:08 - 65:09
    won't do it -
  • 65:09 - 65:11
    this whole thing on the right-hand side
  • 65:11 - 65:15
    simplifies to a big quadratic function of the action, AT.
  • 65:15 - 65:18
    So this whole
  • 65:18 - 65:23
    thing simplifies to a big
  • 65:23 - 65:26
    quadratic function
  • 65:32 - 65:36
    of the action
  • 65:36 - 65:38
    AT.
  • 65:38 - 65:42
    We want to maximize this with respect to the actions AT. So to
  • 65:42 - 65:44
    maximize a big quadratic function, you
  • 65:44 - 65:48
    just take the derivatives of the functions with respect to the action AT, set the derivative equal
  • 65:48 - 65:52
    to zero, and then you've maximized the right-hand side, with
  • 65:52 - 65:56
    respect to the action,
  • 65:56 - 66:00
    AT. It turns out - I'm just going to write this expression down for completeness. You
  • 66:00 - 66:02
    can derive it yourself at any time. It turns
  • 66:02 - 66:05
    out if you actually maximize that thing on the right- hand side as a
  • 66:05 - 66:07
    function of the actions, AT,
  • 66:07 - 66:12
    you find that [inaudible] AT is going to be
  • 66:12 - 66:14
    that
  • 66:14 - 66:21
    times
  • 66:23 - 66:26
    ST.
  • 66:26 - 66:29
    Don't
  • 66:29 - 66:34
    worry about this expression. You can get it from [inaudible] and derive it yourself.
  • 66:34 - 66:38
    But the key thing to note is that the optimal action, AT, is going to be some big
  • 66:38 - 66:39
    matrix.
  • 66:39 - 66:40
    We're going to call this thing
  • 66:40 - 66:43
    LT
  • 66:43 - 66:47
    times
  • 66:49 - 66:51
    ST. In other words, the optimal action to take in this
  • 66:51 - 66:53
    given state is going to be
  • 66:53 - 66:55
    some linear function
  • 66:55 - 66:59
    of the state, ST. So
  • 66:59 - 67:03
    having done dynamic programming, you remember also when we worked out the
  • 67:03 - 67:05
    dynamic programming algorithm for finite
  • 67:05 - 67:06
    horizon MDPs,
  • 67:06 - 67:08
    we said that
  • 67:08 - 67:12
    the way you compute the optimal policy, pi star of T
  • 67:12 - 67:14
    of ST.
  • 67:14 - 67:18
    This is always the [inaudible] of the same thing. [Inaudible]
  • 67:18 - 67:23
    of actions AT of the same thing.
  • 67:23 - 67:25
    STAT plus
  • 67:25 - 67:32
    your expected value of [inaudible] PSTAT, P-star, T
  • 67:36 - 67:39
    plus one, ST plus one. This thing on the right-hand side is always the same thing as
  • 67:39 - 67:42
    the thing we maximized
  • 67:42 - 67:43
  • 67:43 - 67:46
    [inaudible]. So what this means is that when I said this a value of A to the maximize
  • 67:46 - 67:50
    of this. So what this means is that the optimal action to take from the state of ST
  • 67:50 - 67:53
    is actually equal to
  • 67:53 - 67:56
    LT
  • 67:56 - 68:02
    times ST.
  • 68:02 - 68:04
    What
  • 68:04 - 68:06
    was shown is that
  • 68:06 - 68:08
    when you're in some state, ST, the
  • 68:08 - 68:14
    optimal action for that state is going to be some matrix, LT, which can compute,
  • 68:14 - 68:14
    times
  • 68:14 - 68:17
    the state,
  • 68:17 - 68:18
    ST.
  • 68:18 - 68:24
    In other words, the optimal action is actually a linear function of the state.
  • 68:24 - 68:27
    I'm just going to point out, this is not a function of approximation here, right.
  • 68:27 - 68:32
    What we did not do, we did not say,
  • 68:32 - 68:35
    let's find the optimal linear policy. We didn't say, let's look at the optimal
  • 68:35 - 68:38
    policy, and then we'll fit this straight line to the optimal policy. This is not
  • 68:38 - 68:42
    about approximating the optimal policy with a straight line.
  • 68:42 - 68:47
    This derivation is saying that the optimal policy is a straight line. The
  • 68:47 - 68:54
    optimal action is a linear function of the current
  • 68:54 - 68:57
    state.
  • 68:57 - 69:01
    Moreover, when you've worked out this is a value for AT
  • 69:01 - 69:02
  • 69:02 - 69:05
    that maximizes this thing on
  • 69:05 - 69:10
    the right-hand side. So you take this and plug it back in to do the dynamic programming recursion.
  • 69:10 - 69:17
    What you find is that -
  • 69:20 - 69:24
    so you take AT and plug it back in to do the maximization. It will
  • 69:24 - 69:26
    actually get
  • 69:26 - 69:28
    you this formula,
  • 69:28 - 69:31
    so V star
  • 69:31 - 69:35
    TST. So you find that
  • 69:35 - 69:39
    it will indeed be a quadratic function like this
  • 69:39 - 69:42
    of the following form where
  • 69:42 - 69:46
    - and I just write out the equations for the sake of
  • 69:46 - 69:53
    completeness. Don't worry too much about their forms. You can
  • 69:57 - 70:04
    derive this yourself.
  • 70:48 - 70:52
    So just to summarize, don't worry too much about the forms of these
  • 70:52 - 70:54
    equations.
  • 70:54 - 70:58
    What we've done is written down the recursion to the expressor
  • 70:58 - 71:00
    phi T and psi T
  • 71:00 - 71:02
    as a function of phi T plus one
  • 71:02 - 71:04
    and psi T plus one.
  • 71:04 - 71:05
    So
  • 71:05 - 71:08
    this allows you to compute the optimal value function
  • 71:08 - 71:14
    for when the clock is at time lowercase T, as a function of the optimal value
  • 71:14 - 71:18
    function for when the clock is at time T plus one.
  • 71:20 - 71:25
    So
  • 71:25 - 71:28
    to summarize,
  • 71:28 - 71:33
    GSELQG here's a finite horizon of
  • 71:33 - 71:36
    - actually, just to give this equation a name as well. This
  • 71:36 - 71:40
    recursion, in terms of the phi Ts, this is called the
  • 71:40 - 71:45
    discrete
  • 71:45 - 71:52
    time Bacardi equation. [Inaudible]
  • 71:54 - 71:58
    recursion that gives you phi
  • 71:58 - 72:02
    T in terms of phi T plus one.
  • 72:02 - 72:05
    So
  • 72:05 - 72:11
    to summarize,
  • 72:12 - 72:17
    our algorithm for finding the exact solution to finite horizon LQR
  • 72:17 - 72:23
    problems is as follows. We initialize phi T
  • 72:23 - 72:24
    to
  • 72:24 - 72:27
    be equal to minus
  • 72:30 - 72:33
    UT
  • 72:33 - 72:37
    and psi T to be equal to zero.
  • 72:37 - 72:39
    Then
  • 72:39 - 72:43
    recursively,
  • 72:43 - 72:45
    calculate
  • 72:45 - 72:47
    phi T
  • 72:47 - 72:49
    and psi
  • 72:49 - 72:50
    T
  • 72:50 - 72:54
    as a function of phi T plus one
  • 72:54 - 72:56
    and psi T plus
  • 72:56 - 73:01
    one with the discrete time -
  • 73:02 - 73:06
    actually, excuse me. So
  • 73:06 - 73:09
    recursively calculate phi T and psi
  • 73:09 - 73:11
    T as a function of phi T plus one and psi T plus one, as I showed,
  • 73:11 - 73:15
    using the discrete time Bacardi equation.
  • 73:15 - 73:18
    So you do this for
  • 73:18 - 73:19
    T equals
  • 73:19 - 73:22
    T minus one, T minus two
  • 73:22 - 73:29
    and so on, down to time zero. Then you
  • 73:29 - 73:33
    compute
  • 73:33 - 73:37
    LT as a function of - actually, is it phi T or phi T
  • 73:37 - 73:40
    plus one? Phi T
  • 73:40 - 73:42
    plus one, I think.
  • 73:42 - 73:46
    As a function of phi T plus one and psi T plus one.
  • 73:46 - 73:50
    This is actually a function of only phi T plus one. You don't really need psi T
  • 73:50 - 73:53
    plus one.
  • 73:53 - 74:00
    Now you have your optimal policy.
  • 74:00 - 74:03
    So having computed the LTs,
  • 74:03 - 74:06
    you now have the
  • 74:06 - 74:08
    optimal action to take in the state
  • 74:08 - 74:15
    ST, just given by
  • 74:25 - 74:29
    this
  • 74:29 - 74:35
    linear equation.
  • 74:35 - 74:44
    How much time do I have left? Okay. Let me just say one last thing about this before I close.
  • 74:44 - 74:48
    Maybe I'll do it next week. I think I'll do it next session
  • 74:48 - 74:52
    instead. So it actually turns out there's one cool property about this that's kind of that is
  • 74:52 - 74:55
    kind of subtle, but you'll find it out in the next lecture.
  • 74:55 - 75:02
    Are there question about this before we close for today, then? So
  • 75:05 - 75:07
    the very cool thing about
  • 75:07 - 75:13
    the solution of discrete time LQR problems - finite horizon LQR
  • 75:13 - 75:17
    problems is that this is a problem in an infinite state, with a continuous state.
  • 75:17 - 75:21
    But nonetheless, under the assumptions we made, you can prove that the
  • 75:21 - 75:25
    value function is a quadratic function of the state.
  • 75:25 - 75:29
    Therefore, just by computing these matrixes phi T and the real numbers
  • 75:29 - 75:33
    psi T, you can actually exactly represent the value function, even for
  • 75:33 - 75:37
    these infinitely large state spaces, even for continuous state spaces.
  • 75:37 - 75:43
    So the computation of these algorithms scales only like the cube,
  • 75:43 - 75:46
    scales only as a polynomial in terms of the number of state variables
  • 75:46 - 75:50
    whereas in [inaudible] dimensionality problems, with [inaudible], we
  • 75:50 - 75:53
    had algorithms of a scale exponentially dimensional problem.
  • 75:53 - 75:55
    Whereas LQR scales only are like the cube
  • 75:55 - 76:01
    of the dimension of the problem. So this easily
  • 76:01 - 76:04
    applies to problems with even very large state spaces. So we actually often apply
  • 76:04 - 76:06
    variations of this algorithm
  • 76:06 - 76:09
    to some subset, to some particular subset for the things we do on our
  • 76:09 - 76:13
    helicopter, which has high dimensional state spaces, with twelve or higher
  • 76:13 - 76:14
    dimensions.
  • 76:14 - 76:16
    This has worked very well for that.
  • 76:16 - 76:17
    So
  • 76:17 - 76:21
    it turns out there are even more things you can do with this, and I'll continue with
  • 76:21 - 76:24
    that in the next lecture. So let's close for today, then.
Title:
Lecture 18 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:16:38
N. Ueda added a translation

English subtitles

Revisions