[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:10.30,0:00:13.57,Default,,0000,0000,0000,,This presentation is delivered by the Stanford Center for Professional
Dialogue: 0,0:00:13.57,0:00:20.57,Default,,0000,0000,0000,,Development.
Dialogue: 0,0:00:21.52,0:00:24.10,Default,,0000,0000,0000,,Okay, good morning. Welcome back.
Dialogue: 0,0:00:24.10,0:00:27.02,Default,,0000,0000,0000,,So I hope all of you had a good Thanksgiving break.
Dialogue: 0,0:00:27.02,0:00:31.31,Default,,0000,0000,0000,,After the problem sets, I suspect many of us needed one.
Dialogue: 0,0:00:31.31,0:00:35.64,Default,,0000,0000,0000,,Just one quick announcement so as I announced by email
Dialogue: 0,0:00:35.64,0:00:37.87,Default,,0000,0000,0000,,a few days ago, this afternoon
Dialogue: 0,0:00:37.87,0:00:38.49,Default,,0000,0000,0000,,we'll be
Dialogue: 0,0:00:38.49,0:00:42.59,Default,,0000,0000,0000,,doing another tape ahead of lecture, so I won't physically be here on
Dialogue: 0,0:00:42.59,0:00:43.84,Default,,0000,0000,0000,,Wednesday, and so we'll be taping
Dialogue: 0,0:00:43.84,0:00:45.99,Default,,0000,0000,0000,,this Wednesday's lecture ahead of time.
Dialogue: 0,0:00:45.99,0:00:47.21,Default,,0000,0000,0000,,If you're free
Dialogue: 0,0:00:47.21,0:00:52.77,Default,,0000,0000,0000,,this afternoon, please come to that; it'll be at 3:45 p.m.
Dialogue: 0,0:00:52.77,0:00:57.61,Default,,0000,0000,0000,,in the Skilling Auditorium in Skilling 193
Dialogue: 0,0:00:57.61,0:00:59.83,Default,,0000,0000,0000,,at 3:45. But of course, you can also just show up
Dialogue: 0,0:00:59.83,0:01:06.83,Default,,0000,0000,0000,,in class as usual at the usual time or just watch it online as usual also.
Dialogue: 0,0:01:07.11,0:01:12.92,Default,,0000,0000,0000,,Okay, welcome back. What I want to do today is continue our discussion on
Dialogue: 0,0:01:12.92,0:01:15.69,Default,,0000,0000,0000,,Reinforcement Learning in MDPs. Quite a
Dialogue: 0,0:01:15.69,0:01:19.79,Default,,0000,0000,0000,,long topic for me to go over today, so most of today's lecture will be on
Dialogue: 0,0:01:19.79,0:01:21.72,Default,,0000,0000,0000,,continuous state MDPs,
Dialogue: 0,0:01:21.72,0:01:24.09,Default,,0000,0000,0000,,and in particular,
Dialogue: 0,0:01:24.09,0:01:26.75,Default,,0000,0000,0000,,algorithms for solving continuous state MDPs,
Dialogue: 0,0:01:26.75,0:01:29.89,Default,,0000,0000,0000,,so I'll talk just very briefly about discretization.
Dialogue: 0,0:01:29.89,0:01:34.08,Default,,0000,0000,0000,,I'll spend a lot of time talking about models, assimilators of MDPs,
Dialogue: 0,0:01:34.08,0:01:39.25,Default,,0000,0000,0000,,and then talk about one algorithm called fitted value iteration and
Dialogue: 0,0:01:40.42,0:01:45.12,Default,,0000,0000,0000,,two functions which builds on that, and then
Dialogue: 0,0:01:45.12,0:01:49.74,Default,,0000,0000,0000,,hopefully, I'll have time to get to a second algorithm called, approximate policy
Dialogue: 0,0:01:49.74,0:01:50.96,Default,,0000,0000,0000,,iteration
Dialogue: 0,0:01:50.96,0:01:53.19,Default,,0000,0000,0000,,Just to recap,
Dialogue: 0,0:01:53.19,0:01:55.12,Default,,0000,0000,0000,,right, in the previous lecture,
Dialogue: 0,0:01:55.12,0:01:58.27,Default,,0000,0000,0000,,I defined the Reinforcement Learning problem and
Dialogue: 0,0:01:58.27,0:02:02.95,Default,,0000,0000,0000,,I defined MDPs, so let me just recap the notation.
Dialogue: 0,0:02:04.40,0:02:11.40,Default,,0000,0000,0000,,I said that an MDP or a Markov Decision Process, was a ? tuple,
Dialogue: 0,0:02:14.21,0:02:15.96,Default,,0000,0000,0000,,comprising
Dialogue: 0,0:02:15.96,0:02:19.14,Default,,0000,0000,0000,,those things and
Dialogue: 0,0:02:19.14,0:02:22.22,Default,,0000,0000,0000,,the running example of those using last time
Dialogue: 0,0:02:22.22,0:02:25.19,Default,,0000,0000,0000,,was this one
Dialogue: 0,0:02:25.19,0:02:26.71,Default,,0000,0000,0000,,right, adapted from
Dialogue: 0,0:02:26.71,0:02:30.21,Default,,0000,0000,0000,,the Russell and Norvig AI textbook.
Dialogue: 0,0:02:30.21,0:02:31.62,Default,,0000,0000,0000,,So
Dialogue: 0,0:02:31.62,0:02:35.50,Default,,0000,0000,0000,,in this example MDP that I was using,
Dialogue: 0,0:02:35.50,0:02:39.27,Default,,0000,0000,0000,,it had 11 states, so that's where S was.
Dialogue: 0,0:02:39.27,0:02:46.27,Default,,0000,0000,0000,,The actions were compass directions: north, south, east and west. The
Dialogue: 0,0:02:46.65,0:02:49.69,Default,,0000,0000,0000,,state transition probability is to capture
Dialogue: 0,0:02:49.69,0:02:51.11,Default,,0000,0000,0000,,chance of your
Dialogue: 0,0:02:51.11,0:02:52.41,Default,,0000,0000,0000,,transitioning to every state
Dialogue: 0,0:02:52.41,0:02:55.90,Default,,0000,0000,0000,,when you take any action in any other given state and so
Dialogue: 0,0:02:55.90,0:02:58.06,Default,,0000,0000,0000,,in our example that captured
Dialogue: 0,0:02:58.06,0:03:01.12,Default,,0000,0000,0000,,the stochastic dynamics of our
Dialogue: 0,0:03:01.12,0:03:04.50,Default,,0000,0000,0000,,robot wondering around [inaudible], and we said if you take the action north
Dialogue: 0,0:03:04.50,0:03:05.64,Default,,0000,0000,0000,,and the south,
Dialogue: 0,0:03:05.64,0:03:08.88,Default,,0000,0000,0000,,you have a .8 chance of actually going north and .1 chance of veering
Dialogue: 0,0:03:08.88,0:03:09.88,Default,,0000,0000,0000,,off,
Dialogue: 0,0:03:09.88,0:03:12.69,Default,,0000,0000,0000,,so that .1 chance of veering off to the right so
Dialogue: 0,0:03:12.69,0:03:17.07,Default,,0000,0000,0000,,said model of the robot's noisy
Dialogue: 0,0:03:17.07,0:03:19.20,Default,,0000,0000,0000,,dynamic with a [inaudible]
Dialogue: 0,0:03:19.20,0:03:21.71,Default,,0000,0000,0000,,and the
Dialogue: 0,0:03:21.71,0:03:24.79,Default,,0000,0000,0000,,reward function was that +/-1 at the
Dialogue: 0,0:03:24.79,0:03:31.06,Default,,0000,0000,0000,,absorbing states
Dialogue: 0,0:03:31.06,0:03:32.75,Default,,0000,0000,0000,,and
Dialogue: 0,0:03:32.75,0:03:37.66,Default,,0000,0000,0000,,-0.02
Dialogue: 0,0:03:37.66,0:03:39.59,Default,,0000,0000,0000,,elsewhere.
Dialogue: 0,0:03:39.59,0:03:41.50,Default,,0000,0000,0000,,This is an example of an MDP, and
Dialogue: 0,0:03:41.50,0:03:47.90,Default,,0000,0000,0000,,that's what these five things were. Oh, and I used a discount
Dialogue: 0,0:03:47.90,0:03:48.87,Default,,0000,0000,0000,,factor G
Dialogue: 0,0:03:48.87,0:03:54.23,Default,,0000,0000,0000,,of usually a number slightly less than one, so that's the 0.99.
Dialogue: 0,0:03:54.23,0:03:58.94,Default,,0000,0000,0000,,And so our goal was to find the policy, the
Dialogue: 0,0:03:58.94,0:04:02.09,Default,,0000,0000,0000,,control policy and that's at ?,
Dialogue: 0,0:04:02.09,0:04:03.73,Default,,0000,0000,0000,,which is a function
Dialogue: 0,0:04:03.73,0:04:06.40,Default,,0000,0000,0000,,mapping from the states of the actions
Dialogue: 0,0:04:06.40,0:04:09.61,Default,,0000,0000,0000,,that tells us what action to take in every state,
Dialogue: 0,0:04:09.61,0:04:14.16,Default,,0000,0000,0000,,and our goal was to find a policy that maximizes the expected value of
Dialogue: 0,0:04:14.16,0:04:17.32,Default,,0000,0000,0000,,our total payoff. So we want to find a
Dialogue: 0,0:04:17.32,0:04:19.60,Default,,0000,0000,0000,,policy. Well,
Dialogue: 0,0:04:19.60,0:04:26.60,Default,,0000,0000,0000,,let's see. We define value functions Vp (s) to be equal to
Dialogue: 0,0:04:33.25,0:04:35.74,Default,,0000,0000,0000,,this.
Dialogue: 0,0:04:35.74,0:04:36.98,Default,,0000,0000,0000,,We said that
Dialogue: 0,0:04:36.98,0:04:41.34,Default,,0000,0000,0000,,the value of a policy ? from State S was given by the expected
Dialogue: 0,0:04:41.34,0:04:42.03,Default,,0000,0000,0000,,value
Dialogue: 0,0:04:42.03,0:04:46.20,Default,,0000,0000,0000,,of the sum of discounted rewards, conditioned on your executing the policy ?
Dialogue: 0,0:04:46.20,0:04:49.33,Default,,0000,0000,0000,,and
Dialogue: 0,0:04:49.33,0:04:54.47,Default,,0000,0000,0000,,you're stating off your [inaudible] to say in the State S,
Dialogue: 0,0:04:54.47,0:04:59.96,Default,,0000,0000,0000,,and so our strategy for finding the policy was sort of comprised of
Dialogue: 0,0:04:59.96,0:05:06.96,Default,,0000,0000,0000,,two steps. So the goal is to find
Dialogue: 0,0:05:11.57,0:05:15.19,Default,,0000,0000,0000,,a good policy that maximizes the suspected value of
Dialogue: 0,0:05:15.19,0:05:16.76,Default,,0000,0000,0000,,the sum of discounted rewards,
Dialogue: 0,0:05:16.76,0:05:21.02,Default,,0000,0000,0000,,and so I said last time that one strategy for finding the [inaudible] of a
Dialogue: 0,0:05:21.02,0:05:21.71,Default,,0000,0000,0000,,policy
Dialogue: 0,0:05:21.71,0:05:26.08,Default,,0000,0000,0000,,is to first compute the optimal value function which I denoted V*(s) and is defined
Dialogue: 0,0:05:26.08,0:05:28.84,Default,,0000,0000,0000,,like that. It's the maximum
Dialogue: 0,0:05:28.84,0:05:31.97,Default,,0000,0000,0000,,value that any policy can obtain,
Dialogue: 0,0:05:31.97,0:05:38.97,Default,,0000,0000,0000,,and for example,
Dialogue: 0,0:05:44.94,0:05:48.45,Default,,0000,0000,0000,,the optimal value function
Dialogue: 0,0:05:48.45,0:05:55.45,Default,,0000,0000,0000,,for that MDP looks like this.
Dialogue: 0,0:06:01.44,0:06:04.41,Default,,0000,0000,0000,,So in other words, starting from any of these states,
Dialogue: 0,0:06:04.41,0:06:07.05,Default,,0000,0000,0000,,what's the expected value of the
Dialogue: 0,0:06:07.05,0:06:09.21,Default,,0000,0000,0000,,sum of discounted rewards you get,
Dialogue: 0,0:06:09.21,0:06:12.01,Default,,0000,0000,0000,,so this is
Dialogue: 0,0:06:12.01,0:06:13.21,Default,,0000,0000,0000,,V*.
Dialogue: 0,0:06:13.21,0:06:15.29,Default,,0000,0000,0000,,We also said that
Dialogue: 0,0:06:15.29,0:06:17.10,Default,,0000,0000,0000,,once you've found V*,
Dialogue: 0,0:06:17.10,0:06:20.40,Default,,0000,0000,0000,,you can compute the optimal policy
Dialogue: 0,0:06:20.40,0:06:27.40,Default,,0000,0000,0000,,using this.
Dialogue: 0,0:06:34.30,0:06:36.28,Default,,0000,0000,0000,,And
Dialogue: 0,0:06:36.28,0:06:41.05,Default,,0000,0000,0000,,so once you've found
Dialogue: 0,0:06:41.05,0:06:48.05,Default,,0000,0000,0000,,V and the last piece of this algorithm was Bellman's equations where
Dialogue: 0,0:06:48.46,0:06:51.24,Default,,0000,0000,0000,,we know that V*,
Dialogue: 0,0:06:51.24,0:06:55.58,Default,,0000,0000,0000,,the optimal sum of discounted rewards you can get for State S, is equal to the
Dialogue: 0,0:06:55.58,0:06:59.34,Default,,0000,0000,0000,,immediate reward you get just for starting off in that state
Dialogue: 0,0:06:59.34,0:07:02.36,Default,,0000,0000,0000,,+G(for the
Dialogue: 0,0:07:02.36,0:07:05.98,Default,,0000,0000,0000,,max over all the actions you could take)(your
Dialogue: 0,0:07:11.64,0:07:13.66,Default,,0000,0000,0000,,future
Dialogue: 0,0:07:13.66,0:07:15.62,Default,,0000,0000,0000,,sum of discounted
Dialogue: 0,0:07:15.62,0:07:16.48,Default,,0000,0000,0000,,rewards)(your
Dialogue: 0,0:07:16.48,0:07:20.65,Default,,0000,0000,0000,,future payoff starting from the State S(p) which is where you might transition to
Dialogue: 0,0:07:20.65,0:07:22.43,Default,,0000,0000,0000,,after 1(s).
Dialogue: 0,0:07:22.43,0:07:26.28,Default,,0000,0000,0000,,And so this gave us a value iteration algorithm,
Dialogue: 0,0:07:26.28,0:07:29.91,Default,,0000,0000,0000,,which was essentially
Dialogue: 0,0:07:29.91,0:07:35.29,Default,,0000,0000,0000,,V.I. I'm abbreviating value iteration as V.I., so in the value iteration
Dialogue: 0,0:07:36.16,0:07:43.16,Default,,0000,0000,0000,,algorithm, in V.I., you just take Bellman's equations and you repeatedly do this.
Dialogue: 0,0:07:49.96,0:07:54.10,Default,,0000,0000,0000,,So initialize some guess of the value functions. Initialize a zero as the sum
Dialogue: 0,0:07:54.10,0:07:55.13,Default,,0000,0000,0000,,rounding guess and
Dialogue: 0,0:07:55.13,0:07:58.96,Default,,0000,0000,0000,,then repeatedly perform this update for all states, and I said last time that if
Dialogue: 0,0:07:58.96,0:08:00.74,Default,,0000,0000,0000,,you do this repeatedly,
Dialogue: 0,0:08:00.74,0:08:06.34,Default,,0000,0000,0000,,then V(s) will converge to the optimal value function, V(s),
Dialogue: 0,0:08:06.34,0:08:09.34,Default,,0000,0000,0000,,you can
Dialogue: 0,0:08:09.34,0:08:12.22,Default,,0000,0000,0000,,compute the
Dialogue: 0,0:08:12.22,0:08:13.83,Default,,0000,0000,0000,,
Dialogue: 0,0:08:13.83,0:08:19.15,Default,,0000,0000,0000,,optimal policy ?*. Just one final thing I want to recap was
Dialogue: 0,0:08:20.15,0:08:27.15,Default,,0000,0000,0000,,the policy iteration algorithm
Dialogue: 0,0:08:32.37,0:08:39.37,Default,,0000,0000,0000,,in which we repeat the following two steps.
Dialogue: 0,0:08:39.64,0:08:43.79,Default,,0000,0000,0000,,So let's see, given a random initial policy, we'll solve for Vp. We'll solve for the
Dialogue: 0,0:08:43.79,0:08:48.19,Default,,0000,0000,0000,,value function for that specific policy.
Dialogue: 0,0:08:48.19,0:08:52.20,Default,,0000,0000,0000,,So this means for every state, compute the expected sum of discounted rewards
Dialogue: 0,0:08:52.20,0:08:55.46,Default,,0000,0000,0000,,for if you execute the policy ?
Dialogue: 0,0:08:55.46,0:08:58.00,Default,,0000,0000,0000,,from that state,
Dialogue: 0,0:08:58.00,0:09:05.00,Default,,0000,0000,0000,,and then the other step of policy
Dialogue: 0,0:09:28.15,0:09:30.38,Default,,0000,0000,0000,,iteration
Dialogue: 0,0:09:30.38,0:09:31.73,Default,,0000,0000,0000,,is
Dialogue: 0,0:09:31.73,0:09:34.96,Default,,0000,0000,0000,,having found the value function for your policy,
Dialogue: 0,0:09:34.96,0:09:37.44,Default,,0000,0000,0000,,you then update the policy
Dialogue: 0,0:09:37.44,0:09:41.12,Default,,0000,0000,0000,,pretending that you've already found the optimal value function, V*,
Dialogue: 0,0:09:41.12,0:09:42.39,Default,,0000,0000,0000,,and then you
Dialogue: 0,0:09:42.39,0:09:45.79,Default,,0000,0000,0000,,repeatedly perform these two steps where you solve for the value function for your current
Dialogue: 0,0:09:45.79,0:09:46.82,Default,,0000,0000,0000,,policy
Dialogue: 0,0:09:46.82,0:09:50.76,Default,,0000,0000,0000,,and then pretend that that's actually the optimal value function and solve for the
Dialogue: 0,0:09:50.76,0:09:54.79,Default,,0000,0000,0000,,policy given the value function, and you repeatedly update
Dialogue: 0,0:09:54.79,0:09:59.06,Default,,0000,0000,0000,,the value function or update the policy using that value function.
Dialogue: 0,0:09:59.06,0:10:00.35,Default,,0000,0000,0000,,And
Dialogue: 0,0:10:00.35,0:10:03.54,Default,,0000,0000,0000,,last time I said that this will also cause
Dialogue: 0,0:10:03.54,0:10:07.50,Default,,0000,0000,0000,,the estimated value function V to converge to V*
Dialogue: 0,0:10:07.50,0:10:12.64,Default,,0000,0000,0000,,and this will cause p to converge to ?*, the optimal policy.
Dialogue: 0,0:10:12.64,0:10:14.01,Default,,0000,0000,0000,,So those are based
Dialogue: 0,0:10:14.01,0:10:17.92,Default,,0000,0000,0000,,on our last lecture [inaudible] MDPs and introduced a lot of new
Dialogue: 0,0:10:17.92,0:10:20.18,Default,,0000,0000,0000,,notation symbols and just
Dialogue: 0,0:10:20.18,0:10:21.88,Default,,0000,0000,0000,,summarize all that again.
Dialogue: 0,0:10:21.88,0:10:26.16,Default,,0000,0000,0000,,What I'm about to do now, what I'm about to do for the rest of today's
Dialogue: 0,0:10:26.16,0:10:27.68,Default,,0000,0000,0000,,lecture is actually
Dialogue: 0,0:10:27.68,0:10:32.42,Default,,0000,0000,0000,,build on these two algorithms so
Dialogue: 0,0:10:32.42,0:10:36.04,Default,,0000,0000,0000,,I guess if you have any questions about this piece, ask now since I've got to go on please. Yeah. Student:[Inaudible] how
Dialogue: 0,0:10:36.04,0:10:42.08,Default,,0000,0000,0000,,those two algorithms are very
Dialogue: 0,0:10:42.08,0:10:43.71,Default,,0000,0000,0000,,different?
Dialogue: 0,0:10:43.71,0:10:46.81,Default,,0000,0000,0000,,Instructor (Andrew Ng):I see, right, so
Dialogue: 0,0:10:46.81,0:10:51.81,Default,,0000,0000,0000,,yeah, do you see that they're different? Okay,
Dialogue: 0,0:10:51.81,0:10:53.84,Default,,0000,0000,0000,,how it's different. Let's see.
Dialogue: 0,0:10:53.84,0:10:54.97,Default,,0000,0000,0000,,So
Dialogue: 0,0:10:54.97,0:11:00.05,Default,,0000,0000,0000,,well here's one difference. I didn't say this cause no longer use it today. So
Dialogue: 0,0:11:00.05,0:11:03.53,Default,,0000,0000,0000,,value iteration and policy iteration are different algorithms.
Dialogue: 0,0:11:03.53,0:11:06.26,Default,,0000,0000,0000,,In policy iteration
Dialogue: 0,0:11:06.26,0:11:07.57,Default,,0000,0000,0000,,in this
Dialogue: 0,0:11:07.57,0:11:10.63,Default,,0000,0000,0000,,step, you're given a fixed policy,
Dialogue: 0,0:11:10.63,0:11:14.17,Default,,0000,0000,0000,,and you're going to solve for the value function for that policy
Dialogue: 0,0:11:15.32,0:11:20.52,Default,,0000,0000,0000,,and so you're given some fixed policy ?, meaning
Dialogue: 0,0:11:20.52,0:11:26.93,Default,,0000,0000,0000,,some function mapping from the state's actions. So give you some policy
Dialogue: 0,0:11:28.84,0:11:34.63,Default,,0000,0000,0000,,and
Dialogue: 0,0:11:34.63,0:11:39.35,Default,,0000,0000,0000,,whatever. That's just some policy; it's not a great policy.
Dialogue: 0,0:11:39.35,0:11:40.89,Default,,0000,0000,0000,,And
Dialogue: 0,0:11:40.89,0:11:45.35,Default,,0000,0000,0000,,in that step that I circled, we have to find the ? of S
Dialogue: 0,0:12:01.64,0:12:04.74,Default,,0000,0000,0000,,which means that for every state you need to compute
Dialogue: 0,0:12:04.74,0:12:08.36,Default,,0000,0000,0000,,your expected sum of discounted rewards or if you execute
Dialogue: 0,0:12:08.36,0:12:10.81,Default,,0000,0000,0000,,this specific policy
Dialogue: 0,0:12:10.81,0:12:15.50,Default,,0000,0000,0000,,and starting off the MDP in that state S.
Dialogue: 0,0:12:15.50,0:12:19.00,Default,,0000,0000,0000,,So I showed this last time. I won't go into details today, so I said last
Dialogue: 0,0:12:19.00,0:12:20.33,Default,,0000,0000,0000,,time that
Dialogue: 0,0:12:20.33,0:12:23.96,Default,,0000,0000,0000,,you can actually solve for Vp by solving a linear system of
Dialogue: 0,0:12:23.96,0:12:25.36,Default,,0000,0000,0000,,equations.
Dialogue: 0,0:12:25.36,0:12:27.26,Default,,0000,0000,0000,,There was a form of Bellman's equations
Dialogue: 0,0:12:27.26,0:12:28.76,Default,,0000,0000,0000,,for Vp,
Dialogue: 0,0:12:28.76,0:12:32.03,Default,,0000,0000,0000,,and it turned out to be, if you write this out, you end up with a
Dialogue: 0,0:12:32.03,0:12:35.92,Default,,0000,0000,0000,,linear system of 11 equations of 11 unknowns and so you
Dialogue: 0,0:12:35.92,0:12:38.83,Default,,0000,0000,0000,,can actually solve for the value function
Dialogue: 0,0:12:38.83,0:12:39.97,Default,,0000,0000,0000,,for a fixed policy
Dialogue: 0,0:12:39.97,0:12:42.18,Default,,0000,0000,0000,,by solving like a system of
Dialogue: 0,0:12:42.18,0:12:46.18,Default,,0000,0000,0000,,linear equations with 11 variables and 11 constraints,
Dialogue: 0,0:12:46.18,0:12:48.83,Default,,0000,0000,0000,,and so
Dialogue: 0,0:12:48.83,0:12:51.79,Default,,0000,0000,0000,,that's policy iteration;
Dialogue: 0,0:12:51.79,0:12:55.29,Default,,0000,0000,0000,,whereas, in value iteration,
Dialogue: 0,0:12:55.29,0:12:57.43,Default,,0000,0000,0000,,going back on board,
Dialogue: 0,0:12:57.43,0:13:00.59,Default,,0000,0000,0000,,in value iteration you sort of repeatedly perform this update where you
Dialogue: 0,0:13:00.59,0:13:04.02,Default,,0000,0000,0000,,update the value of a state as the [inaudible].
Dialogue: 0,0:13:04.02,0:13:11.02,Default,,0000,0000,0000,,So I hope that makes sense that the algorithm of these is different. Student:[Inaudible] on the atomic kits
Dialogue: 0,0:13:12.67,0:13:13.46,Default,,0000,0000,0000,,so is the assumption
Dialogue: 0,0:13:13.46,0:13:16.13,Default,,0000,0000,0000,,that we can never get out of
Dialogue: 0,0:13:16.13,0:13:17.96,Default,,0000,0000,0000,,those states? Instructor
Dialogue: 0,0:13:17.96,0:13:21.73,Default,,0000,0000,0000,,(Andrew Ng):Yes. There's always things that you where you solve for this [inaudible], for example, and make the numbers
Dialogue: 0,0:13:21.73,0:13:25.95,Default,,0000,0000,0000,,come up nicely, but I don't wanna spend too much time on them, but yeah, so the
Dialogue: 0,0:13:25.95,0:13:26.82,Default,,0000,0000,0000,,assumption is that
Dialogue: 0,0:13:26.82,0:13:30.33,Default,,0000,0000,0000,,once you enter the absorbing state, then the world ends or there're no more
Dialogue: 0,0:13:30.33,0:13:32.55,Default,,0000,0000,0000,,rewards after that and
Dialogue: 0,0:13:32.55,0:13:33.74,Default,,0000,0000,0000,,you can think of
Dialogue: 0,0:13:33.74,0:13:36.97,Default,,0000,0000,0000,,another way to think of the absorbing states which is sort of
Dialogue: 0,0:13:36.97,0:13:38.00,Default,,0000,0000,0000,,mathematically equivalent.
Dialogue: 0,0:13:38.00,0:13:41.09,Default,,0000,0000,0000,,You can think of the absorbing states as
Dialogue: 0,0:13:41.09,0:13:43.44,Default,,0000,0000,0000,,transitioning with probability 1
Dialogue: 0,0:13:43.44,0:13:46.03,Default,,0000,0000,0000,,to sum 12 state, and then once you're in that
Dialogue: 0,0:13:46.03,0:13:48.87,Default,,0000,0000,0000,,12th state, you always
Dialogue: 0,0:13:48.87,0:13:51.66,Default,,0000,0000,0000,,remain in that 12th state, and there're no further rewards from there. If you
Dialogue: 0,0:13:51.66,0:13:53.91,Default,,0000,0000,0000,,want, you can think of this
Dialogue: 0,0:13:53.91,0:13:57.25,Default,,0000,0000,0000,,as actually an MDP with 12 states rather than 11 states,
Dialogue: 0,0:13:57.25,0:13:59.27,Default,,0000,0000,0000,,and the 12th state
Dialogue: 0,0:13:59.27,0:14:05.32,Default,,0000,0000,0000,,is this zero cost absorbing state that you get stuck in forever. Other
Dialogue: 0,0:14:05.32,0:14:12.32,Default,,0000,0000,0000,,questions? Yeah,
Dialogue: 0,0:14:19.64,0:14:26.46,Default,,0000,0000,0000,,please go. Student:Where did the Bellman's equations [inaudible] to optimal value [inaudible]? Instructor (Andrew Ng):Boy, yeah. Okay, this Bellman's equations, this equation that I'm pointing to, I sort
Dialogue: 0,0:14:26.46,0:14:27.19,Default,,0000,0000,0000,,of
Dialogue: 0,0:14:27.19,0:14:31.12,Default,,0000,0000,0000,,tried to give it justification for this last time.
Dialogue: 0,0:14:31.12,0:14:34.85,Default,,0000,0000,0000,,I'll
Dialogue: 0,0:14:34.85,0:14:37.50,Default,,0000,0000,0000,,say it in
Dialogue: 0,0:14:37.50,0:14:41.99,Default,,0000,0000,0000,,one sentence so that's that the expected total payoff I get, I
Dialogue: 0,0:14:41.99,0:14:46.82,Default,,0000,0000,0000,,expect to get something from the state as is equal to my immediate reward which is
Dialogue: 0,0:14:46.82,0:14:49.65,Default,,0000,0000,0000,,the reward I get for starting a state. Let's see. If I sum
Dialogue: 0,0:14:49.65,0:14:54.22,Default,,0000,0000,0000,,the state, I'm gonna get some first reward and then I can transition to
Dialogue: 0,0:14:54.22,0:14:55.44,Default,,0000,0000,0000,,some other state, and
Dialogue: 0,0:14:55.44,0:14:59.02,Default,,0000,0000,0000,,then from that other state, I'll get some additional rewards from then.
Dialogue: 0,0:14:59.02,0:15:03.39,Default,,0000,0000,0000,,So Bellman's equations breaks that sum into two pieces. It says the value of a
Dialogue: 0,0:15:03.39,0:15:04.86,Default,,0000,0000,0000,,state is equal to
Dialogue: 0,0:15:04.86,0:15:06.65,Default,,0000,0000,0000,,the reward you get right away
Dialogue: 0,0:15:06.65,0:15:10.53,Default,,0000,0000,0000,,is really, well.
Dialogue: 0,0:15:10.53,0:15:17.12,Default,,0000,0000,0000,,V*(s) is really equal to
Dialogue: 0,0:15:31.13,0:15:33.39,Default,,0000,0000,0000,,+G, so this is
Dialogue: 0,0:15:35.94,0:15:40.06,Default,,0000,0000,0000,,V into two terms and says that there's this first term which is the immediate reward, that, and then +G(the
Dialogue: 0,0:15:40.06,0:15:42.26,Default,,0000,0000,0000,,rewards
Dialogue: 0,0:15:44.48,0:15:46.44,Default,,0000,0000,0000,,you get in the future)
Dialogue: 0,0:15:46.44,0:15:50.08,Default,,0000,0000,0000,,which it turns out to be equal to that
Dialogue: 0,0:15:50.08,0:15:52.34,Default,,0000,0000,0000,,second row. I spent more time
Dialogue: 0,0:15:52.34,0:15:55.36,Default,,0000,0000,0000,,justifying this in the previous lecture,
Dialogue: 0,0:15:55.36,0:15:59.93,Default,,0000,0000,0000,,although yeah, hopefully, for the purposes of this lecture, if you're not sure where this is came, if
Dialogue: 0,0:15:59.93,0:16:02.10,Default,,0000,0000,0000,,you don't remember the justification of that, why don't you just
Dialogue: 0,0:16:02.10,0:16:03.43,Default,,0000,0000,0000,,maybe take my word for
Dialogue: 0,0:16:03.43,0:16:05.99,Default,,0000,0000,0000,,that this equation holds true
Dialogue: 0,0:16:05.99,0:16:07.94,Default,,0000,0000,0000,,since I use it a little bit later as well,
Dialogue: 0,0:16:07.94,0:16:11.09,Default,,0000,0000,0000,,and then the lecture notes sort of
Dialogue: 0,0:16:11.09,0:16:15.83,Default,,0000,0000,0000,,explain a little further the justification for why this equation might hold
Dialogue: 0,0:16:15.83,0:16:17.43,Default,,0000,0000,0000,,true. But for now,
Dialogue: 0,0:16:17.43,0:16:21.20,Default,,0000,0000,0000,,yeah, just for now take my word for it that this holds true cause we'll use it a little bit later
Dialogue: 0,0:16:21.20,0:16:28.20,Default,,0000,0000,0000,,today as well.
Dialogue: 0,0:16:40.96,0:16:44.71,Default,,0000,0000,0000,,[Inaudible] and would it be in sort of turn back into [inaudible]. Actually, [inaudible] right question is if in policy iteration if we represent
Dialogue: 0,0:16:44.71,0:16:47.01,Default,,0000,0000,0000,,? implicitly,
Dialogue: 0,0:16:47.60,0:16:53.06,Default,,0000,0000,0000,,using V(s), would it become equivalent to valuation,
Dialogue: 0,0:16:53.06,0:16:58.24,Default,,0000,0000,0000,,and the answer is sort of no.
Dialogue: 0,0:16:58.24,0:17:01.71,Default,,0000,0000,0000,,Let's see. It's true that policy iteration and value iteration are
Dialogue: 0,0:17:01.71,0:17:07.21,Default,,0000,0000,0000,,closely related algorithms, and there's actually a continuum between them, but
Dialogue: 0,0:17:07.21,0:17:12.85,Default,,0000,0000,0000,,yeah, it actually turns out that,
Dialogue: 0,0:17:12.85,0:17:15.40,Default,,0000,0000,0000,,oh, no,
Dialogue: 0,0:17:22.37,0:17:23.91,Default,,0000,0000,0000,,the
Dialogue: 0,0:17:24.76,0:17:28.91,Default,,0000,0000,0000,,algorithms are not equivalent. It's just in policy iteration,
Dialogue: 0,0:17:28.91,0:17:33.29,Default,,0000,0000,0000,,there is a step where you're solving for the value function for the policy vehicle is V,
Dialogue: 0,0:17:33.29,0:17:35.47,Default,,0000,0000,0000,,solve for Vp. Usually,
Dialogue: 0,0:17:35.47,0:17:40.59,Default,,0000,0000,0000,,you can do this, for instance, by solving a linear system of equations. In value
Dialogue: 0,0:17:40.59,0:17:41.26,Default,,0000,0000,0000,,iteration,
Dialogue: 0,0:17:41.26,0:17:43.85,Default,,0000,0000,0000,,it is a different
Dialogue: 0,0:17:43.85,0:17:50.85,Default,,0000,0000,0000,,algorithm, yes. I hope it makes sense that at least cosmetically it's different.
Dialogue: 0,0:17:56.27,0:17:58.14,Default,,0000,0000,0000,,[Inaudible]
Dialogue: 0,0:17:58.14,0:18:02.25,Default,,0000,0000,0000,,you have [inaudible] representing ? implicitly, then you won't have
Dialogue: 0,0:18:02.25,0:18:06.62,Default,,0000,0000,0000,,to solve that to [inaudible] equations. Yeah, the problem is - let's see. To solve for Vp, this works only if you have a fixed policy, so once you
Dialogue: 0,0:18:06.62,0:18:08.56,Default,,0000,0000,0000,,change a value function,
Dialogue: 0,0:18:08.56,0:18:10.36,Default,,0000,0000,0000,,if ? changes as well,
Dialogue: 0,0:18:10.36,0:18:15.75,Default,,0000,0000,0000,,then it's sort of hard to solve this.
Dialogue: 0,0:18:15.75,0:18:19.34,Default,,0000,0000,0000,,Yeah, so later on we'll actually talk about some examples of when ? is implicitly
Dialogue: 0,0:18:19.34,0:18:20.44,Default,,0000,0000,0000,,represented
Dialogue: 0,0:18:20.44,0:18:24.06,Default,,0000,0000,0000,,but at
Dialogue: 0,0:18:24.06,0:18:26.21,Default,,0000,0000,0000,,least for now it's I think there's " yeah.
Dialogue: 0,0:18:26.21,0:18:30.35,Default,,0000,0000,0000,,Maybe there's a way to redefine something, see a mapping onto value iteration but that's not
Dialogue: 0,0:18:30.35,0:18:32.53,Default,,0000,0000,0000,,usually done. These are
Dialogue: 0,0:18:32.53,0:18:36.67,Default,,0000,0000,0000,,viewed as different algorithms.
Dialogue: 0,0:18:36.67,0:18:39.64,Default,,0000,0000,0000,,Okay, cool, so all good questions.
Dialogue: 0,0:18:39.64,0:18:43.41,Default,,0000,0000,0000,,Let me move on and talk about how
Dialogue: 0,0:18:43.41,0:18:50.41,Default,,0000,0000,0000,,to generalize these ideas to continuous states.
Dialogue: 0,0:18:58.88,0:19:01.26,Default,,0000,0000,0000,,Everything we've done so far
Dialogue: 0,0:19:01.26,0:19:02.95,Default,,0000,0000,0000,,has been for
Dialogue: 0,0:19:02.95,0:19:05.00,Default,,0000,0000,0000,,discrete states or finite-state
Dialogue: 0,0:19:05.00,0:19:09.89,Default,,0000,0000,0000,,MDPs. Where, for example, here we had an MDP with a finite set of 11
Dialogue: 0,0:19:09.89,0:19:10.78,Default,,0000,0000,0000,,states
Dialogue: 0,0:19:10.78,0:19:12.98,Default,,0000,0000,0000,,and so
Dialogue: 0,0:19:12.98,0:19:16.72,Default,,0000,0000,0000,,the value function or V(s) or our estimate for the value function,
Dialogue: 0,0:19:16.72,0:19:17.36,Default,,0000,0000,0000,,V(s),
Dialogue: 0,0:19:17.36,0:19:22.29,Default,,0000,0000,0000,,could then be represented using an array of 11 numbers cause if you have 11 states, the
Dialogue: 0,0:19:22.29,0:19:23.58,Default,,0000,0000,0000,,value function
Dialogue: 0,0:19:23.58,0:19:26.12,Default,,0000,0000,0000,,needs to assign a real number
Dialogue: 0,0:19:26.12,0:19:29.33,Default,,0000,0000,0000,,to each of the 11 states and so to represent V(s)
Dialogue: 0,0:19:29.33,0:19:31.99,Default,,0000,0000,0000,,using an array of 11
Dialogue: 0,0:19:31.99,0:19:33.08,Default,,0000,0000,0000,,numbers.
Dialogue: 0,0:19:33.08,0:19:37.08,Default,,0000,0000,0000,,What I want to do for [inaudible] today is talk about
Dialogue: 0,0:19:37.08,0:19:38.45,Default,,0000,0000,0000,,continuous
Dialogue: 0,0:19:38.45,0:19:41.83,Default,,0000,0000,0000,,states, so for example,
Dialogue: 0,0:19:41.83,0:19:44.98,Default,,0000,0000,0000,,if you want to control
Dialogue: 0,0:19:44.98,0:19:48.24,Default,,0000,0000,0000,,any of the number of real [inaudible], so for example, if you want
Dialogue: 0,0:19:48.24,0:19:51.41,Default,,0000,0000,0000,,to control a car,
Dialogue: 0,0:19:51.41,0:19:52.61,Default,,0000,0000,0000,,
Dialogue: 0,0:19:52.61,0:19:55.09,Default,,0000,0000,0000,,a car
Dialogue: 0,0:19:55.09,0:19:59.84,Default,,0000,0000,0000,,is positioned given by XYT, as position
Dialogue: 0,0:19:59.84,0:20:04.66,Default,,0000,0000,0000,,and orientation and if you want to Markov the velocity as well, then Xdot, Ydot, Tdot,
Dialogue: 0,0:20:04.66,0:20:05.16,Default,,0000,0000,0000,,so these
Dialogue: 0,0:20:05.16,0:20:06.64,Default,,0000,0000,0000,,are so depending on
Dialogue: 0,0:20:06.64,0:20:08.93,Default,,0000,0000,0000,,whether you want to model the kinematics and
Dialogue: 0,0:20:08.93,0:20:12.14,Default,,0000,0000,0000,,so just position, or whether you want to model the dynamics, meaning the velocity as
Dialogue: 0,0:20:12.14,0:20:14.63,Default,,0000,0000,0000,,well.
Dialogue: 0,0:20:14.63,0:20:16.18,Default,,0000,0000,0000,,Earlier I showed you
Dialogue: 0,0:20:16.18,0:20:19.74,Default,,0000,0000,0000,,video of a helicopter
Dialogue: 0,0:20:19.74,0:20:23.48,Default,,0000,0000,0000,,that was flying, using a rain forest we're learning algorithms, so the
Dialogue: 0,0:20:23.48,0:20:24.73,Default,,0000,0000,0000,,helicopter which can
Dialogue: 0,0:20:24.73,0:20:28.39,Default,,0000,0000,0000,,fly in three-dimensional space rather than just drive on the 2-D plane, the
Dialogue: 0,0:20:28.39,0:20:32.78,Default,,0000,0000,0000,,state will be given by XYZ position,
Dialogue: 0,0:20:35.10,0:20:37.61,Default,,0000,0000,0000,,FT?, which is
Dialogue: 0,0:20:37.61,0:20:42.74,Default,,0000,0000,0000,,?[inaudible].
Dialogue: 0,0:20:42.74,0:20:45.25,Default,,0000,0000,0000,,The FT? is
Dialogue: 0,0:20:45.25,0:20:48.62,Default,,0000,0000,0000,,sometimes used to note the P[inaudible]
Dialogue: 0,0:20:48.62,0:20:49.84,Default,,0000,0000,0000,,of the helicopter,
Dialogue: 0,0:20:49.84,0:20:51.71,Default,,0000,0000,0000,,just orientation,
Dialogue: 0,0:20:51.71,0:20:55.34,Default,,0000,0000,0000,,and
Dialogue: 0,0:20:59.05,0:21:02.54,Default,,0000,0000,0000,,if you want to control a helicopter, you pretty much have to model
Dialogue: 0,0:21:02.54,0:21:06.49,Default,,0000,0000,0000,,velocity as well which means both linear velocity as well as angular
Dialogue: 0,0:21:06.49,0:21:09.52,Default,,0000,0000,0000,,velocity, and so this would be a 12-dimensional state.
Dialogue: 0,0:21:10.40,0:21:12.61,Default,,0000,0000,0000,,If you want an example that
Dialogue: 0,0:21:12.61,0:21:14.15,Default,,0000,0000,0000,,is
Dialogue: 0,0:21:14.15,0:21:16.77,Default,,0000,0000,0000,,kind of fun but unusual
Dialogue: 0,0:21:16.77,0:21:19.96,Default,,0000,0000,0000,,is,
Dialogue: 0,0:21:19.96,0:21:24.55,Default,,0000,0000,0000,,and I'm just gonna use this as an example
Dialogue: 0,0:21:24.55,0:21:28.24,Default,,0000,0000,0000,,and actually use this little bit example in today's lecture
Dialogue: 0,0:21:28.24,0:21:31.41,Default,,0000,0000,0000,,is the inverted pendulum problem which is
Dialogue: 0,0:21:31.41,0:21:35.50,Default,,0000,0000,0000,,sort of a long-running classic in reinforcement learning
Dialogue: 0,0:21:35.50,0:21:39.04,Default,,0000,0000,0000,,in which imagine that you
Dialogue: 0,0:21:39.04,0:21:42.88,Default,,0000,0000,0000,,have a little cart that's on a rail. The rail
Dialogue: 0,0:21:42.88,0:21:45.37,Default,,0000,0000,0000,,ends at some point
Dialogue: 0,0:21:45.37,0:21:48.40,Default,,0000,0000,0000,,and if you imagine that you have a pole
Dialogue: 0,0:21:48.40,0:21:51.65,Default,,0000,0000,0000,,attached to the cart, and this is a free hinge and so
Dialogue: 0,0:21:51.65,0:21:54.11,Default,,0000,0000,0000,,the pole here can rotate freely,
Dialogue: 0,0:21:54.11,0:21:56.95,Default,,0000,0000,0000,,and your goal is to control the cart and
Dialogue: 0,0:21:56.95,0:21:59.03,Default,,0000,0000,0000,,to move it back and forth on this rail
Dialogue: 0,0:21:59.03,0:22:01.13,Default,,0000,0000,0000,,so as to keep the pole balanced.
Dialogue: 0,0:22:01.13,0:22:02.56,Default,,0000,0000,0000,,Yeah,
Dialogue: 0,0:22:02.56,0:22:04.62,Default,,0000,0000,0000,,there's no long pole
Dialogue: 0,0:22:04.62,0:22:05.40,Default,,0000,0000,0000,,in
Dialogue: 0,0:22:05.40,0:22:07.48,Default,,0000,0000,0000,,this class but you know what I
Dialogue: 0,0:22:07.48,0:22:08.41,Default,,0000,0000,0000,,mean, so you
Dialogue: 0,0:22:08.41,0:22:15.21,Default,,0000,0000,0000,,can imagine. Oh, is there a long pole here?
Dialogue: 0,0:22:15.21,0:22:16.59,Default,,0000,0000,0000,,Back in the corner. Oh, thanks. Cool. So I did not practice this
Dialogue: 0,0:22:16.59,0:22:19.92,Default,,0000,0000,0000,,but you can take a long pole and sort of hold it up, balance,
Dialogue: 0,0:22:19.92,0:22:24.67,Default,,0000,0000,0000,,so imagine that you can do it better than I can. Imagine these are [inaudible] just moving
Dialogue: 0,0:22:24.67,0:22:27.59,Default,,0000,0000,0000,,back and forth to try to keep the pole balanced, so
Dialogue: 0,0:22:27.59,0:22:30.58,Default,,0000,0000,0000,,you can actually us the reinforcement learning algorithm to do that.
Dialogue: 0,0:22:30.58,0:22:33.93,Default,,0000,0000,0000,,This is actually one of the longstanding classic problems that
Dialogue: 0,0:22:33.93,0:22:35.28,Default,,0000,0000,0000,,people [inaudible]
Dialogue: 0,0:22:35.28,0:22:38.03,Default,,0000,0000,0000,,implement and play off using reinforcement learning algorithms,
Dialogue: 0,0:22:38.03,0:22:40.80,Default,,0000,0000,0000,,and so for this, the
Dialogue: 0,0:22:40.80,0:22:41.94,Default,,0000,0000,0000,,states would be X and
Dialogue: 0,0:22:41.94,0:22:44.00,Default,,0000,0000,0000,,T, so X
Dialogue: 0,0:22:44.00,0:22:49.66,Default,,0000,0000,0000,,would be the position of the cart,
Dialogue: 0,0:22:49.66,0:22:56.07,Default,,0000,0000,0000,,and T would be the orientation of the pole
Dialogue: 0,0:22:56.07,0:22:59.75,Default,,0000,0000,0000,,and also the linear velocity and the angular velocity of
Dialogue: 0,0:22:59.75,0:23:00.65,Default,,0000,0000,0000,,
Dialogue: 0,0:23:00.65,0:23:07.65,Default,,0000,0000,0000,,the pole, so I'll
Dialogue: 0,0:23:12.34,0:23:15.39,Default,,0000,0000,0000,,actually use this example a
Dialogue: 0,0:23:15.39,0:23:18.71,Default,,0000,0000,0000,,couple times. So to read continuous state space,
Dialogue: 0,0:23:18.71,0:23:22.76,Default,,0000,0000,0000,,how can you apply an algorithm like value iteration and policy iteration to solve the
Dialogue: 0,0:23:22.76,0:23:24.17,Default,,0000,0000,0000,,MDP to control
Dialogue: 0,0:23:24.17,0:23:27.80,Default,,0000,0000,0000,,like the car or a helicopter or something like the inverted
Dialogue: 0,0:23:27.80,0:23:29.03,Default,,0000,0000,0000,,pendulum?
Dialogue: 0,0:23:29.03,0:23:31.72,Default,,0000,0000,0000,,So one thing you can do and
Dialogue: 0,0:23:31.72,0:23:35.06,Default,,0000,0000,0000,,this is maybe the most straightforward thing is,
Dialogue: 0,0:23:35.06,0:23:37.67,Default,,0000,0000,0000,,if you have say a two-dimensional
Dialogue: 0,0:23:37.67,0:23:42.27,Default,,0000,0000,0000,,continuous state space, a S-1 and S-2 are my state variables, and in all the
Dialogue: 0,0:23:42.27,0:23:43.99,Default,,0000,0000,0000,,examples there
Dialogue: 0,0:23:43.99,0:23:48.15,Default,,0000,0000,0000,,are I guess between 4dimensional to 12-dimensional. I'll just draw 2-D here.
Dialogue: 0,0:23:48.15,0:23:53.67,Default,,0000,0000,0000,,The most straightforward thing to do would be to take the continuous state space and discretize
Dialogue: 0,0:23:53.67,0:23:55.78,Default,,0000,0000,0000,,it
Dialogue: 0,0:23:55.78,0:24:02.78,Default,,0000,0000,0000,,into a number of discrete cells.
Dialogue: 0,0:24:07.44,0:24:11.89,Default,,0000,0000,0000,,And I use S-bar to denote they're discretized or they're discrete
Dialogue: 0,0:24:11.89,0:24:13.08,Default,,0000,0000,0000,,states,
Dialogue: 0,0:24:13.08,0:24:16.92,Default,,0000,0000,0000,,and so you can [inaudible] with this continuous state problem with a finite or
Dialogue: 0,0:24:16.92,0:24:18.41,Default,,0000,0000,0000,,discrete set of states
Dialogue: 0,0:24:18.41,0:24:19.49,Default,,0000,0000,0000,,and then
Dialogue: 0,0:24:19.49,0:24:22.88,Default,,0000,0000,0000,,you can use policy iteration or value iteration
Dialogue: 0,0:24:22.88,0:24:28.90,Default,,0000,0000,0000,,to solve for
Dialogue: 0,0:24:31.53,0:24:35.95,Default,,0000,0000,0000,,V(s)-bar.
Dialogue: 0,0:24:35.95,0:24:37.39,Default,,0000,0000,0000,,And
Dialogue: 0,0:24:37.39,0:24:40.35,Default,,0000,0000,0000,,if you're robot is then in some state
Dialogue: 0,0:24:40.35,0:24:41.86,Default,,0000,0000,0000,,given by that dot,
Dialogue: 0,0:24:41.86,0:24:45.91,Default,,0000,0000,0000,,you would then figure out what discretized state it is in. In this case it's in,
Dialogue: 0,0:24:45.91,0:24:47.95,Default,,0000,0000,0000,,this discretized dygrid cell
Dialogue: 0,0:24:47.95,0:24:49.18,Default,,0000,0000,0000,,that's called
Dialogue: 0,0:24:49.18,0:24:51.40,Default,,0000,0000,0000,,S-bar,
Dialogue: 0,0:24:51.40,0:24:52.84,Default,,0000,0000,0000,,and then you execute.
Dialogue: 0,0:24:52.84,0:24:56.32,Default,,0000,0000,0000,,You choose the policy. You choose the action
Dialogue: 0,0:24:56.32,0:24:58.68,Default,,0000,0000,0000,,given by applied to that discrete
Dialogue: 0,0:24:58.68,0:25:01.19,Default,,0000,0000,0000,,state, so discretization is maybe the
Dialogue: 0,0:25:01.19,0:25:06.69,Default,,0000,0000,0000,,most straightforward way to turn a continuous state problem into a discrete state problem. Sometimes
Dialogue: 0,0:25:06.69,0:25:10.20,Default,,0000,0000,0000,,you can sorta make this work but a couple of reasons why
Dialogue: 0,0:25:10.20,0:25:13.07,Default,,0000,0000,0000,,this does not work very well.
Dialogue: 0,0:25:13.07,0:25:15.13,Default,,0000,0000,0000,,One reason is the following,
Dialogue: 0,0:25:15.13,0:25:17.61,Default,,0000,0000,0000,,and
Dialogue: 0,0:25:17.61,0:25:21.21,Default,,0000,0000,0000,,for this picture, let's even temporarily put aside reinforcement
Dialogue: 0,0:25:21.21,0:25:22.84,Default,,0000,0000,0000,,learning. Let's just
Dialogue: 0,0:25:22.84,0:25:26.61,Default,,0000,0000,0000,,think about doing regression for now and so
Dialogue: 0,0:25:26.61,0:25:29.65,Default,,0000,0000,0000,,suppose you have some invariable X
Dialogue: 0,0:25:29.65,0:25:36.21,Default,,0000,0000,0000,,and suppose I have some data,
Dialogue: 0,0:25:36.21,0:25:37.91,Default,,0000,0000,0000,,and I want to fill a function.
Dialogue: 0,0:25:37.91,0:25:39.72,Default,,0000,0000,0000,,Y is the function of X,
Dialogue: 0,0:25:39.72,0:25:40.86,Default,,0000,0000,0000,,so
Dialogue: 0,0:25:40.86,0:25:44.55,Default,,0000,0000,0000,,discretization is saying that I'm going to take my
Dialogue: 0,0:25:44.55,0:25:48.01,Default,,0000,0000,0000,,horizontal Xs and chop it up into a number of
Dialogue: 0,0:25:48.01,0:25:48.67,Default,,0000,0000,0000,,intervals.
Dialogue: 0,0:25:48.67,0:25:52.06,Default,,0000,0000,0000,,Sometimes I call these intervals buckets as well.
Dialogue: 0,0:25:52.06,0:25:56.31,Default,,0000,0000,0000,,We chop my horizontal Xs up into a number of buckets
Dialogue: 0,0:25:56.31,0:25:59.48,Default,,0000,0000,0000,,and then we're approximate this function
Dialogue: 0,0:25:59.48,0:26:03.77,Default,,0000,0000,0000,,using something that's piecewise constant
Dialogue: 0,0:26:03.77,0:26:05.44,Default,,0000,0000,0000,,in each of these buckets.
Dialogue: 0,0:26:05.44,0:26:08.80,Default,,0000,0000,0000,,And just look at this. This is clearly not a very good representation, right, and when we
Dialogue: 0,0:26:08.80,0:26:10.39,Default,,0000,0000,0000,,talk about regression,
Dialogue: 0,0:26:10.39,0:26:14.34,Default,,0000,0000,0000,,you just choose some features of X
Dialogue: 0,0:26:14.34,0:26:17.99,Default,,0000,0000,0000,,and run linear regression or something. You get a much better fit to the function.
Dialogue: 0,0:26:17.99,0:26:22.17,Default,,0000,0000,0000,,And so the sense that discretization just isn't a very good source
Dialogue: 0,0:26:22.17,0:26:25.77,Default,,0000,0000,0000,,of piecewise constant functions. This just isn't a very good function
Dialogue: 0,0:26:25.77,0:26:28.21,Default,,0000,0000,0000,,for representing many things,
Dialogue: 0,0:26:28.21,0:26:32.09,Default,,0000,0000,0000,,and there's also the sense that
Dialogue: 0,0:26:32.09,0:26:35.58,Default,,0000,0000,0000,,there's no smoothing or there's no generalization across the different buckets.
Dialogue: 0,0:26:35.58,0:26:39.29,Default,,0000,0000,0000,,And in fact, back in regression, I would never have chosen
Dialogue: 0,0:26:39.29,0:26:42.87,Default,,0000,0000,0000,,to do regression using this sort of visualization. It's just really doesn't make sense.
Dialogue: 0,0:26:45.63,0:26:47.99,Default,,0000,0000,0000,,And so in the same way,
Dialogue: 0,0:26:47.99,0:26:50.95,Default,,0000,0000,0000,,instead of
Dialogue: 0,0:26:50.95,0:26:53.93,Default,,0000,0000,0000,,X, V(s), instead of X and
Dialogue: 0,0:26:53.93,0:26:57.73,Default,,0000,0000,0000,,some hypothesis function of X, if you have the state here and you're trying to approximate the value
Dialogue: 0,0:26:57.73,0:26:59.11,Default,,0000,0000,0000,,function, then
Dialogue: 0,0:26:59.11,0:27:02.47,Default,,0000,0000,0000,,you can get discretization to work for many problems but maybe this isn't the best
Dialogue: 0,0:27:02.47,0:27:03.43,Default,,0000,0000,0000,,representation to represent
Dialogue: 0,0:27:03.43,0:27:06.42,Default,,0000,0000,0000,,a value function.
Dialogue: 0,0:27:08.96,0:27:12.58,Default,,0000,0000,0000,,The other problem with discretization and maybe the more serious
Dialogue: 0,0:27:12.58,0:27:17.35,Default,,0000,0000,0000,,problem is what's
Dialogue: 0,0:27:17.35,0:27:21.45,Default,,0000,0000,0000,,often somewhat fancifully called the curse of dimensionality.
Dialogue: 0,0:27:26.48,0:27:30.23,Default,,0000,0000,0000,,And just the observation that
Dialogue: 0,0:27:30.99,0:27:33.80,Default,,0000,0000,0000,,if the state space is in
Dialogue: 0,0:27:33.80,0:27:39.35,Default,,0000,0000,0000,,RN, and if you discretize each variable into K buckets,
Dialogue: 0,0:27:39.35,0:27:40.45,Default,,0000,0000,0000,,so
Dialogue: 0,0:27:40.45,0:27:41.91,Default,,0000,0000,0000,,if discretize each
Dialogue: 0,0:27:44.16,0:27:45.69,Default,,0000,0000,0000,,variable into K
Dialogue: 0,0:27:45.69,0:27:47.75,Default,,0000,0000,0000,,discrete values,
Dialogue: 0,0:27:47.75,0:27:50.15,Default,,0000,0000,0000,,then you get
Dialogue: 0,0:27:50.15,0:27:53.03,Default,,0000,0000,0000,,on the order of K to the power of N
Dialogue: 0,0:27:54.76,0:27:55.74,Default,,0000,0000,0000,,discrete states.
Dialogue: 0,0:27:55.74,0:27:56.61,Default,,0000,0000,0000,,In other words,
Dialogue: 0,0:27:56.61,0:28:00.17,Default,,0000,0000,0000,,the number of discrete states you end up with
Dialogue: 0,0:28:00.17,0:28:02.10,Default,,0000,0000,0000,,grows exponentially
Dialogue: 0,0:28:02.10,0:28:04.79,Default,,0000,0000,0000,,in the dimension of the problem,
Dialogue: 0,0:28:04.79,0:28:05.56,Default,,0000,0000,0000,,and so
Dialogue: 0,0:28:05.56,0:28:09.35,Default,,0000,0000,0000,,for a helicopter with 12-dimensional state
Dialogue: 0,0:28:09.35,0:28:10.34,Default,,0000,0000,0000,,space, this would be
Dialogue: 0,0:28:10.34,0:28:12.77,Default,,0000,0000,0000,,maybe like 100 to the power of 12, just huge, and it's
Dialogue: 0,0:28:12.77,0:28:13.46,Default,,0000,0000,0000,,not
Dialogue: 0,0:28:13.46,0:28:14.65,Default,,0000,0000,0000,,feasible.
Dialogue: 0,0:28:14.65,0:28:19.13,Default,,0000,0000,0000,,And so discretization doesn't scale well at all with two problems in high-dimensional
Dialogue: 0,0:28:19.13,0:28:23.60,Default,,0000,0000,0000,,state spaces,
Dialogue: 0,0:28:23.60,0:28:24.79,Default,,0000,0000,0000,,and
Dialogue: 0,0:28:24.79,0:28:28.52,Default,,0000,0000,0000,,this observation actually applies more generally than to
Dialogue: 0,0:28:28.52,0:28:32.51,Default,,0000,0000,0000,,just robotics and continuous state problems.
Dialogue: 0,0:28:32.51,0:28:37.67,Default,,0000,0000,0000,,For example, another fairly well-known applications
Dialogue: 0,0:28:37.67,0:28:40.17,Default,,0000,0000,0000,,of reinforcement learning
Dialogue: 0,0:28:40.17,0:28:44.85,Default,,0000,0000,0000,,has been to factory automations. If you imagine that you have
Dialogue: 0,0:28:44.85,0:28:47.04,Default,,0000,0000,0000,,20 machines sitting in the factory
Dialogue: 0,0:28:47.04,0:28:50.03,Default,,0000,0000,0000,,and the machines lie in a assembly line and they all
Dialogue: 0,0:28:50.03,0:28:53.20,Default,,0000,0000,0000,,do something to a part on the assembly line, then they route the part onto a
Dialogue: 0,0:28:53.20,0:28:55.39,Default,,0000,0000,0000,,different machine. You want to
Dialogue: 0,0:28:55.39,0:28:58.78,Default,,0000,0000,0000,,use reinforcement learning algorithms, [inaudible] the order in which the
Dialogue: 0,0:28:58.78,0:29:02.52,Default,,0000,0000,0000,,different machines operate on your different things that are flowing through your assembly line and maybe
Dialogue: 0,0:29:02.52,0:29:05.31,Default,,0000,0000,0000,,different machines can do different things.
Dialogue: 0,0:29:05.31,0:29:08.97,Default,,0000,0000,0000,,So if you have N machines and each machine
Dialogue: 0,0:29:08.97,0:29:10.54,Default,,0000,0000,0000,,can be in K states,
Dialogue: 0,0:29:10.54,0:29:13.73,Default,,0000,0000,0000,,then if you do this sort of discretization, the total number of states would be K
Dialogue: 0,0:29:13.73,0:29:15.42,Default,,0000,0000,0000,,to N as well.
Dialogue: 0,0:29:15.42,0:29:18.29,Default,,0000,0000,0000,,If you have N machines
Dialogue: 0,0:29:18.29,0:29:21.86,Default,,0000,0000,0000,,and if each machine can be in K states, then again, you can get
Dialogue: 0,0:29:21.86,0:29:24.55,Default,,0000,0000,0000,,this huge number of states.
Dialogue: 0,0:29:24.55,0:29:26.80,Default,,0000,0000,0000,,Other well-known examples would be
Dialogue: 0,0:29:26.80,0:29:30.62,Default,,0000,0000,0000,,if you have a board game is another example.
Dialogue: 0,0:29:30.62,0:29:33.76,Default,,0000,0000,0000,,You'd want to use reinforcement learning to play chess.
Dialogue: 0,0:29:33.76,0:29:36.31,Default,,0000,0000,0000,,Then if you have N
Dialogue: 0,0:29:36.31,0:29:38.66,Default,,0000,0000,0000,,pieces on your board game,
Dialogue: 0,0:29:38.66,0:29:43.54,Default,,0000,0000,0000,,you have N pieces on the chessboard and if each piece can be in K positions, then this is a
Dialogue: 0,0:29:43.54,0:29:46.47,Default,,0000,0000,0000,,game sort of the curse of dimensionality thing where the
Dialogue: 0,0:29:46.47,0:29:55.06,Default,,0000,0000,0000,,number of discrete states you end up with goes exponentially with the number of pieces
Dialogue: 0,0:29:55.06,0:29:57.12,Default,,0000,0000,0000,,in your board game.
Dialogue: 0,0:29:57.12,0:30:01.17,Default,,0000,0000,0000,,So the curse of dimensionality
Dialogue: 0,0:30:01.17,0:30:05.83,Default,,0000,0000,0000,,means that discretization scales poorly to high-dimensional state spaces or
Dialogue: 0,0:30:05.83,0:30:09.37,Default,,0000,0000,0000,,at least discrete representations
Dialogue: 0,0:30:09.37,0:30:14.17,Default,,0000,0000,0000,,scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional
Dialogue: 0,0:30:14.17,0:30:16.72,Default,,0000,0000,0000,,problem, discretization will usually work great.
Dialogue: 0,0:30:16.72,0:30:20.73,Default,,0000,0000,0000,,If you have a 3-dimensional problem, you can often get discretization to
Dialogue: 0,0:30:20.73,0:30:21.29,Default,,0000,0000,0000,,work
Dialogue: 0,0:30:21.29,0:30:23.27,Default,,0000,0000,0000,,not too badly without too much trouble.
Dialogue: 0,0:30:23.27,0:30:28.37,Default,,0000,0000,0000,,With a 4-dimensional problem, you can still often get to where that they
Dialogue: 0,0:30:28.37,0:30:30.19,Default,,0000,0000,0000,,could be challenging
Dialogue: 0,0:30:32.99,0:30:38.77,Default,,0000,0000,0000,,and as you go to higher and higher dimensional state spaces,
Dialogue: 0,0:30:38.77,0:30:42.52,Default,,0000,0000,0000,,the odds and [inaudible] that you need to figure around to discretization and do things
Dialogue: 0,0:30:42.52,0:30:45.90,Default,,0000,0000,0000,,like non-uniform grids, so for example,
Dialogue: 0,0:30:45.90,0:30:49.47,Default,,0000,0000,0000,,what I've
Dialogue: 0,0:30:49.47,0:30:52.71,Default,,0000,0000,0000,,drawn for you is an example of a non-uniform discretization
Dialogue: 0,0:30:52.71,0:30:57.14,Default,,0000,0000,0000,,where I'm discretizing S-2 much more finally than S-1. If I think the value
Dialogue: 0,0:30:57.14,0:30:57.75,Default,,0000,0000,0000,,function
Dialogue: 0,0:30:57.75,0:30:59.25,Default,,0000,0000,0000,,is much more sensitive
Dialogue: 0,0:30:59.25,0:31:02.62,Default,,0000,0000,0000,,to the value of state variable S-2 than to S-1,
Dialogue: 0,0:31:02.62,0:31:06.61,Default,,0000,0000,0000,,and so as you get into higher dimensional state spaces, you may need to manually fiddle
Dialogue: 0,0:31:06.61,0:31:10.00,Default,,0000,0000,0000,,with choices like these with no uniform discretizations and so on.
Dialogue: 0,0:31:10.00,0:31:13.27,Default,,0000,0000,0000,,But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it
Dialogue: 0,0:31:13.27,0:31:14.35,Default,,0000,0000,0000,,work fine.
Dialogue: 0,0:31:14.35,0:31:17.63,Default,,0000,0000,0000,,With 4-dimensional problems, you can probably get it to work but it'd be just
Dialogue: 0,0:31:17.63,0:31:19.03,Default,,0000,0000,0000,,slightly challenging
Dialogue: 0,0:31:19.03,0:31:20.30,Default,,0000,0000,0000,,and
Dialogue: 0,0:31:20.30,0:31:24.42,Default,,0000,0000,0000,,you can sometimes by fooling around and being clever, you can often push
Dialogue: 0,0:31:24.42,0:31:25.11,Default,,0000,0000,0000,,discretization
Dialogue: 0,0:31:25.11,0:31:29.12,Default,,0000,0000,0000,,up to let's say about 6-dimensional problems but
Dialogue: 0,0:31:29.12,0:31:30.87,Default,,0000,0000,0000,,with some difficulty
Dialogue: 0,0:31:30.87,0:31:32.98,Default,,0000,0000,0000,,and problems higher
Dialogue: 0,0:31:32.98,0:31:37.22,Default,,0000,0000,0000,,than 6-dimensional would be extremely difficult to solve with discretization.
Dialogue: 0,0:31:37.22,0:31:39.70,Default,,0000,0000,0000,,So that's just rough
Dialogue: 0,0:31:39.70,0:31:44.36,Default,,0000,0000,0000,,folk wisdom order of managing problems you think about using for discretization.
Dialogue: 0,0:31:51.37,0:31:52.79,Default,,0000,0000,0000,,But what I want to
Dialogue: 0,0:31:52.79,0:31:56.03,Default,,0000,0000,0000,,spend most of today talking about is
Dialogue: 0,0:31:56.03,0:31:59.91,Default,,0000,0000,0000,,[inaudible] methods that often work much better than discretization
Dialogue: 0,0:31:59.91,0:32:02.03,Default,,0000,0000,0000,,and which we will
Dialogue: 0,0:32:02.03,0:32:04.64,Default,,0000,0000,0000,,approximate V* directly
Dialogue: 0,0:32:04.64,0:32:08.85,Default,,0000,0000,0000,,without resulting to these sort of discretizations.
Dialogue: 0,0:32:08.85,0:32:12.30,Default,,0000,0000,0000,,Before I jump to the specific representation let me just spend a few minutes
Dialogue: 0,0:32:12.30,0:32:14.78,Default,,0000,0000,0000,,talking about the problem setup
Dialogue: 0,0:32:14.78,0:32:15.61,Default,,0000,0000,0000,,then.
Dialogue: 0,0:32:15.61,0:32:18.79,Default,,0000,0000,0000,,For today's lecture, I'm going to focus on
Dialogue: 0,0:32:18.79,0:32:25.36,Default,,0000,0000,0000,,the problem of continuous states
Dialogue: 0,0:32:25.36,0:32:28.06,Default,,0000,0000,0000,,and just to keep things sort of very
Dialogue: 0,0:32:28.06,0:32:29.48,Default,,0000,0000,0000,,simple in this lecture,
Dialogue: 0,0:32:29.48,0:32:33.62,Default,,0000,0000,0000,,I want view of continuous actions, so I'm gonna see
Dialogue: 0,0:32:36.00,0:32:37.65,Default,,0000,0000,0000,,discrete actions
Dialogue: 0,0:32:37.65,0:32:40.28,Default,,0000,0000,0000,,A. So it turns out actually that
Dialogue: 0,0:32:40.28,0:32:44.23,Default,,0000,0000,0000,,is a critical fact also for many problems, it turns out that the state
Dialogue: 0,0:32:44.23,0:32:45.63,Default,,0000,0000,0000,,space is
Dialogue: 0,0:32:45.63,0:32:47.30,Default,,0000,0000,0000,,much larger than
Dialogue: 0,0:32:47.30,0:32:51.18,Default,,0000,0000,0000,,the states of actions. That just seems to have worked out that way for many problems, so for example,
Dialogue: 0,0:32:52.63,0:32:58.21,Default,,0000,0000,0000,,for driving a car the state space is 6-dimensional, so if
Dialogue: 0,0:32:58.21,0:33:00.01,Default,,0000,0000,0000,,XY T, Xdot, Ydot, Tdot.
Dialogue: 0,0:33:00.01,0:33:04.84,Default,,0000,0000,0000,,Whereas, your action has, you still have two actions. You have forward
Dialogue: 0,0:33:04.84,0:33:07.60,Default,,0000,0000,0000,,backward motion and steering the car, so you have
Dialogue: 0,0:33:07.60,0:33:11.89,Default,,0000,0000,0000,,6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states.
Dialogue: 0,0:33:11.89,0:33:14.74,Default,,0000,0000,0000,,The only
Dialogue: 0,0:33:14.74,0:33:17.19,Default,,0000,0000,0000,,examples down for a helicopter you've 12-D states in a
Dialogue: 0,0:33:17.19,0:33:18.62,Default,,0000,0000,0000,,4-dimensional action
Dialogue: 0,0:33:18.62,0:33:20.29,Default,,0000,0000,0000,,it turns out, and
Dialogue: 0,0:33:20.29,0:33:24.20,Default,,0000,0000,0000,,it's also often much easier to just discretize a continuous
Dialogue: 0,0:33:24.20,0:33:27.30,Default,,0000,0000,0000,,actions into a discrete sum of actions.
Dialogue: 0,0:33:27.30,0:33:27.84,Default,,0000,0000,0000,,
Dialogue: 0,0:33:27.84,0:33:31.67,Default,,0000,0000,0000,,And for the inverted pendulum, you have a 4-D state and a 1-D action.
Dialogue: 0,0:33:31.67,0:33:33.47,Default,,0000,0000,0000,,Whether you accelerate
Dialogue: 0,0:33:33.47,0:33:38.14,Default,,0000,0000,0000,,your cart to the left or the right is one D action
Dialogue: 0,0:33:38.14,0:33:41.88,Default,,0000,0000,0000,,and so for the rest of today, I'm gonna assume a continuous state
Dialogue: 0,0:33:41.88,0:33:45.48,Default,,0000,0000,0000,,but I'll assume that maybe you've already discretized your actions,
Dialogue: 0,0:33:45.48,0:33:47.70,Default,,0000,0000,0000,,just because in practice it turns out that
Dialogue: 0,0:33:47.70,0:33:52.78,Default,,0000,0000,0000,,not for all problems, with many problems large actions
Dialogue: 0,0:33:52.78,0:33:54.28,Default,,0000,0000,0000,,is just less of a
Dialogue: 0,0:33:54.28,0:33:58.47,Default,,0000,0000,0000,,difficulty than large state spaces.
Dialogue: 0,0:33:58.47,0:33:59.73,Default,,0000,0000,0000,,So I'm
Dialogue: 0,0:33:59.73,0:34:01.35,Default,,0000,0000,0000,,going to assume
Dialogue: 0,0:34:01.35,0:34:06.36,Default,,0000,0000,0000,,that we have
Dialogue: 0,0:34:06.36,0:34:08.70,Default,,0000,0000,0000,,a model
Dialogue: 0,0:34:08.70,0:34:15.09,Default,,0000,0000,0000,,or simulator of the
Dialogue: 0,0:34:15.09,0:34:16.45,Default,,0000,0000,0000,,MDP, and so
Dialogue: 0,0:34:16.45,0:34:18.18,Default,,0000,0000,0000,,this is really
Dialogue: 0,0:34:18.18,0:34:22.58,Default,,0000,0000,0000,,an assumption on how the state transition probabilities are represented.
Dialogue: 0,0:34:23.59,0:34:24.69,Default,,0000,0000,0000,,I'm gonna assume
Dialogue: 0,0:34:24.69,0:34:28.08,Default,,0000,0000,0000,,and I'm going to use the terms "model" and "simulator"
Dialogue: 0,0:34:28.08,0:34:29.80,Default,,0000,0000,0000,,pretty much synonymously,
Dialogue: 0,0:34:29.80,0:34:31.75,Default,,0000,0000,0000,,so
Dialogue: 0,0:34:31.75,0:34:38.75,Default,,0000,0000,0000,,specifically, what I'm going to assume is that we have a black box and a
Dialogue: 0,0:34:40.65,0:34:41.83,Default,,0000,0000,0000,,piece of code,
Dialogue: 0,0:34:41.83,0:34:45.20,Default,,0000,0000,0000,,so that I can input any state, input an action
Dialogue: 0,0:34:45.20,0:34:46.40,Default,,0000,0000,0000,,and it will output
Dialogue: 0,0:34:46.40,0:34:48.56,Default,,0000,0000,0000,,S
Dialogue: 0,0:34:48.56,0:34:51.26,Default,,0000,0000,0000,,prime,
Dialogue: 0,0:34:51.26,0:34:54.22,Default,,0000,0000,0000,,sample from the state transition distribution. Says that
Dialogue: 0,0:34:54.22,0:34:55.07,Default,,0000,0000,0000,,this is really
Dialogue: 0,0:34:55.07,0:34:57.08,Default,,0000,0000,0000,,my assumption
Dialogue: 0,0:34:57.08,0:35:01.21,Default,,0000,0000,0000,,on the representation I have for the state transition probabilities,
Dialogue: 0,0:35:01.21,0:35:05.19,Default,,0000,0000,0000,,so I'll assume I have a box that read
Dialogue: 0,0:35:05.19,0:35:06.100,Default,,0000,0000,0000,,take us in for the stated action
Dialogue: 0,0:35:06.100,0:35:09.85,Default,,0000,0000,0000,,and output in mixed state.
Dialogue: 0,0:35:09.85,0:35:16.85,Default,,0000,0000,0000,,And so
Dialogue: 0,0:35:17.41,0:35:20.47,Default,,0000,0000,0000,,since they're fairly common ways to get models of
Dialogue: 0,0:35:20.47,0:35:22.36,Default,,0000,0000,0000,,different MDPs you may work with,
Dialogue: 0,0:35:22.36,0:35:24.48,Default,,0000,0000,0000,,one is
Dialogue: 0,0:35:24.48,0:35:30.60,Default,,0000,0000,0000,,you might get a model from a physics
Dialogue: 0,0:35:30.60,0:35:32.96,Default,,0000,0000,0000,,simulator. So for example,
Dialogue: 0,0:35:32.96,0:35:36.76,Default,,0000,0000,0000,,if you're interested in controlling that inverted pendulum,
Dialogue: 0,0:35:36.76,0:35:39.45,Default,,0000,0000,0000,,so your action is
Dialogue: 0,0:35:39.45,0:35:41.23,Default,,0000,0000,0000,,A which is
Dialogue: 0,0:35:41.23,0:35:45.13,Default,,0000,0000,0000,,the magnitude of the force you exert on the cart to
Dialogue: 0,0:35:45.13,0:35:46.14,Default,,0000,0000,0000,,
Dialogue: 0,0:35:46.14,0:35:47.71,Default,,0000,0000,0000,,left
Dialogue: 0,0:35:47.71,0:35:49.38,Default,,0000,0000,0000,,or right, and your
Dialogue: 0,0:35:49.38,0:35:56.38,Default,,0000,0000,0000,,state is Xdot, T, Tdot. I'm just gonna write that in that order.
Dialogue: 0,0:35:58.11,0:36:01.47,Default,,0000,0000,0000,,And so I'm gonna
Dialogue: 0,0:36:01.47,0:36:04.38,Default,,0000,0000,0000,,write down a bunch of equations just for completeness but
Dialogue: 0,0:36:04.38,0:36:08.78,Default,,0000,0000,0000,,everything I'm gonna write below here is most of what I wanna write is a bit
Dialogue: 0,0:36:08.78,0:36:12.60,Default,,0000,0000,0000,,gratuitous, but so since I'll maybe flip open a textbook on physics, a
Dialogue: 0,0:36:12.60,0:36:13.82,Default,,0000,0000,0000,,textbook on mechanics,
Dialogue: 0,0:36:13.82,0:36:17.07,Default,,0000,0000,0000,,you can work out the equations of motion of a
Dialogue: 0,0:36:17.07,0:36:19.50,Default,,0000,0000,0000,,physical device like this, so you find that
Dialogue: 0,0:36:22.75,0:36:26.40,Default,,0000,0000,0000,,Sdot. The dot denotes derivative, so the derivative of the state with respect to time
Dialogue: 0,0:36:26.40,0:36:28.08,Default,,0000,0000,0000,,is given by
Dialogue: 0,0:36:28.74,0:36:31.71,Default,,0000,0000,0000,,Xdot, ?-L(B)
Dialogue: 0,0:36:31.71,0:36:34.10,Default,,0000,0000,0000,,cost B
Dialogue: 0,0:36:34.10,0:36:35.55,Default,,0000,0000,0000,,over
Dialogue: 0,0:36:36.92,0:36:39.07,Default,,0000,0000,0000,,M
Dialogue: 0,0:36:39.07,0:36:46.07,Default,,0000,0000,0000,,Tdot,
Dialogue: 0,0:36:52.02,0:36:52.86,Default,,0000,0000,0000,,B.
Dialogue: 0,0:36:52.86,0:36:58.20,Default,,0000,0000,0000,,And so on where A is the force is the action that you exert on the cart. L is
Dialogue: 0,0:36:58.20,0:36:59.87,Default,,0000,0000,0000,,the length of the pole.
Dialogue: 0,0:36:59.87,0:37:04.10,Default,,0000,0000,0000,,M is the total mass of the system and so on. So all these equations
Dialogue: 0,0:37:04.10,0:37:06.21,Default,,0000,0000,0000,,are good uses, just writing
Dialogue: 0,0:37:06.21,0:37:08.55,Default,,0000,0000,0000,,them down for completeness, but by
Dialogue: 0,0:37:08.55,0:37:12.18,Default,,0000,0000,0000,,flipping over, open like a physics textbook, you can work out these equations
Dialogue: 0,0:37:12.18,0:37:13.84,Default,,0000,0000,0000,,and notions yourself
Dialogue: 0,0:37:13.84,0:37:15.17,Default,,0000,0000,0000,,and this
Dialogue: 0,0:37:15.17,0:37:17.41,Default,,0000,0000,0000,,then gives you a model which
Dialogue: 0,0:37:17.41,0:37:18.26,Default,,0000,0000,0000,,can say that S-2+1.
Dialogue: 0,0:37:18.26,0:37:21.55,Default,,0000,0000,0000,,You're still one time step later
Dialogue: 0,0:37:21.55,0:37:27.49,Default,,0000,0000,0000,,will be equal to your previous state
Dialogue: 0,0:37:27.49,0:37:32.22,Default,,0000,0000,0000,,plus [inaudible], so in your simulator or in my model
Dialogue: 0,0:37:32.22,0:37:35.20,Default,,0000,0000,0000,,what happens to the cart every
Dialogue: 0,0:37:35.20,0:37:39.23,Default,,0000,0000,0000,,10th of a second, so ? T
Dialogue: 0,0:37:39.23,0:37:43.13,Default,,0000,0000,0000,,would be within one second
Dialogue: 0,0:37:43.13,0:37:50.13,Default,,0000,0000,0000,,and then so plus ? T times
Dialogue: 0,0:38:03.47,0:38:06.13,Default,,0000,0000,0000,,that. And so that'd be one way to come up with a model of
Dialogue: 0,0:38:06.13,0:38:08.17,Default,,0000,0000,0000,,your MDP.
Dialogue: 0,0:38:08.17,0:38:12.39,Default,,0000,0000,0000,,And in this specific example, I've actually written down deterministic model because
Dialogue: 0,0:38:12.83,0:38:14.83,Default,,0000,0000,0000,,and by deterministic I mean that
Dialogue: 0,0:38:14.83,0:38:18.32,Default,,0000,0000,0000,,given an action in a state, the next state
Dialogue: 0,0:38:18.32,0:38:19.84,Default,,0000,0000,0000,,is not random,
Dialogue: 0,0:38:19.84,0:38:23.18,Default,,0000,0000,0000,,so would be an example of a deterministic model
Dialogue: 0,0:38:23.18,0:38:26.18,Default,,0000,0000,0000,,where I can compute the next state exactly
Dialogue: 0,0:38:26.18,0:38:29.70,Default,,0000,0000,0000,,as a function of the previous state and the previous action
Dialogue: 0,0:38:29.70,0:38:33.48,Default,,0000,0000,0000,,or it's a deterministic model because all the probability
Dialogue: 0,0:38:33.48,0:38:34.100,Default,,0000,0000,0000,,mass is on a single state
Dialogue: 0,0:38:34.100,0:38:36.94,Default,,0000,0000,0000,,given the previous stated action.
Dialogue: 0,0:38:36.94,0:38:43.94,Default,,0000,0000,0000,,You can also make this a stochastic model.
Dialogue: 0,0:38:58.59,0:39:05.59,Default,,0000,0000,0000,,A second way that is often used to attain a model is to learn one.
Dialogue: 0,0:39:08.36,0:39:13.08,Default,,0000,0000,0000,,And so again, just concretely what you do is you would
Dialogue: 0,0:39:13.08,0:39:15.60,Default,,0000,0000,0000,,imagine that you have a physical
Dialogue: 0,0:39:15.60,0:39:17.11,Default,,0000,0000,0000,,inverted pendulum system
Dialogue: 0,0:39:17.11,0:39:20.63,Default,,0000,0000,0000,,as you physically own an inverted pendulum robot.
Dialogue: 0,0:39:20.63,0:39:24.11,Default,,0000,0000,0000,,What you would do is you would then
Dialogue: 0,0:39:24.11,0:39:27.11,Default,,0000,0000,0000,,initialize your inverted pendulum robot to some state
Dialogue: 0,0:39:27.11,0:39:31.91,Default,,0000,0000,0000,,and then execute some policy, could be some random policy or some policy that you think is pretty
Dialogue: 0,0:39:31.91,0:39:36.22,Default,,0000,0000,0000,,good, or you could even try controlling yourself with a joystick or something.
Dialogue: 0,0:39:36.22,0:39:37.40,Default,,0000,0000,0000,,But so you set
Dialogue: 0,0:39:37.40,0:39:40.89,Default,,0000,0000,0000,,the system off in some state as zero.
Dialogue: 0,0:39:40.89,0:39:43.62,Default,,0000,0000,0000,,Then you take some action.
Dialogue: 0,0:39:43.62,0:39:48.04,Default,,0000,0000,0000,,Here's zero and the game could be chosen by some policy or chosen by you using a joystick tryina
Dialogue: 0,0:39:48.04,0:39:51.18,Default,,0000,0000,0000,,control your inverted pendulum or whatever.
Dialogue: 0,0:39:51.18,0:39:55.03,Default,,0000,0000,0000,,System would transition to some new state, S-1, and then you
Dialogue: 0,0:39:55.03,0:39:57.27,Default,,0000,0000,0000,,take some new action, A-1
Dialogue: 0,0:39:57.27,0:39:59.80,Default,,0000,0000,0000,,and so on. Let's say you do
Dialogue: 0,0:39:59.80,0:40:01.58,Default,,0000,0000,0000,,this
Dialogue: 0,0:40:01.58,0:40:04.78,Default,,0000,0000,0000,,for two time steps
Dialogue: 0,0:40:04.78,0:40:06.48,Default,,0000,0000,0000,,and
Dialogue: 0,0:40:06.48,0:40:08.86,Default,,0000,0000,0000,,sometimes I call this one trajectory and
Dialogue: 0,0:40:08.86,0:40:10.01,Default,,0000,0000,0000,,you repeat this
Dialogue: 0,0:40:10.01,0:40:12.99,Default,,0000,0000,0000,,M times, so
Dialogue: 0,0:40:12.99,0:40:16.99,Default,,0000,0000,0000,,this is the first trial of the first trajectory,
Dialogue: 0,0:40:16.99,0:40:23.99,Default,,0000,0000,0000,,and then you do this again. Initialize it in some
Dialogue: 0,0:40:35.42,0:40:36.88,Default,,0000,0000,0000,,and so on.
Dialogue: 0,0:40:36.88,0:40:39.98,Default,,0000,0000,0000,,So you do this a bunch of times
Dialogue: 0,0:40:39.98,0:40:43.64,Default,,0000,0000,0000,,and then you would run the learning algorithm
Dialogue: 0,0:40:43.64,0:40:49.19,Default,,0000,0000,0000,,to
Dialogue: 0,0:40:49.19,0:40:53.20,Default,,0000,0000,0000,,estimate ST+1
Dialogue: 0,0:40:53.20,0:40:56.65,Default,,0000,0000,0000,,as a function
Dialogue: 0,0:40:56.65,0:41:01.87,Default,,0000,0000,0000,,of
Dialogue: 0,0:41:01.87,0:41:04.62,Default,,0000,0000,0000,,ST and
Dialogue: 0,0:41:04.62,0:41:08.01,Default,,0000,0000,0000,,AT. And for sake of completeness, you should just think of this
Dialogue: 0,0:41:08.01,0:41:10.13,Default,,0000,0000,0000,,as inverted pendulum problem,
Dialogue: 0,0:41:10.13,0:41:11.19,Default,,0000,0000,0000,,so ST+1 is
Dialogue: 0,0:41:11.19,0:41:13.29,Default,,0000,0000,0000,,a 4-dimensional vector. ST, AT
Dialogue: 0,0:41:13.29,0:41:16.58,Default,,0000,0000,0000,,will be a 4-dimensional vector and that'll be a real number,
Dialogue: 0,0:41:16.58,0:41:17.68,Default,,0000,0000,0000,,and so you might
Dialogue: 0,0:41:17.68,0:41:21.19,Default,,0000,0000,0000,,run linear regression 4 times to predict each of these state variables as
Dialogue: 0,0:41:21.19,0:41:22.75,Default,,0000,0000,0000,,a function
Dialogue: 0,0:41:22.75,0:41:25.25,Default,,0000,0000,0000,,of each of these
Dialogue: 0,0:41:25.25,0:41:26.98,Default,,0000,0000,0000,,5 real numbers
Dialogue: 0,0:41:26.98,0:41:29.52,Default,,0000,0000,0000,,and so on.
Dialogue: 0,0:41:29.52,0:41:32.58,Default,,0000,0000,0000,,Just for example, if you
Dialogue: 0,0:41:32.58,0:41:36.90,Default,,0000,0000,0000,,say that if you want to estimate your next state ST+1 as a linear
Dialogue: 0,0:41:36.90,0:41:40.64,Default,,0000,0000,0000,,function
Dialogue: 0,0:41:40.64,0:41:45.89,Default,,0000,0000,0000,,of your previous state in your action and so
Dialogue: 0,0:41:45.89,0:41:51.47,Default,,0000,0000,0000,,A here will be a 4 by 4 matrix,
Dialogue: 0,0:41:51.47,0:41:54.45,Default,,0000,0000,0000,,and B would be a 4-dimensional vector,
Dialogue: 0,0:41:54.45,0:42:00.20,Default,,0000,0000,0000,,then
Dialogue: 0,0:42:00.20,0:42:07.20,Default,,0000,0000,0000,,you would choose the values of A and B that minimize this.
Dialogue: 0,0:42:32.27,0:42:35.22,Default,,0000,0000,0000,,So if you want your model to be that
Dialogue: 0,0:42:35.22,0:42:38.85,Default,,0000,0000,0000,,ST+1 is some linear function of the previous stated action,
Dialogue: 0,0:42:38.85,0:42:39.96,Default,,0000,0000,0000,,then you might
Dialogue: 0,0:42:39.96,0:42:41.95,Default,,0000,0000,0000,,pose this optimization objective
Dialogue: 0,0:42:41.95,0:42:45.13,Default,,0000,0000,0000,,and choose A and B to minimize the sum of squares error
Dialogue: 0,0:42:45.13,0:42:52.13,Default,,0000,0000,0000,,in your predictive value for ST+1 as the linear function of ST
Dialogue: 0,0:42:52.49,0:42:56.89,Default,,0000,0000,0000,,and AT. I
Dialogue: 0,0:42:56.89,0:42:58.12,Default,,0000,0000,0000,,should say that
Dialogue: 0,0:42:58.12,0:43:02.84,Default,,0000,0000,0000,,this is one specific example where you're using a linear function of the previous
Dialogue: 0,0:43:02.84,0:43:06.40,Default,,0000,0000,0000,,stated action to predict the next state. Of course, you can also use other algorithms
Dialogue: 0,0:43:06.40,0:43:08.47,Default,,0000,0000,0000,,like low [inaudible] weight to linear regression
Dialogue: 0,0:43:08.47,0:43:12.08,Default,,0000,0000,0000,,or linear regression with nonlinear features or kernel linear
Dialogue: 0,0:43:12.08,0:43:13.12,Default,,0000,0000,0000,,regression or whatever
Dialogue: 0,0:43:13.12,0:43:16.38,Default,,0000,0000,0000,,to predict the next state as a nonlinear function of the current state as well, so this
Dialogue: 0,0:43:16.38,0:43:23.38,Default,,0000,0000,0000,,is just [inaudible] linear problems.
Dialogue: 0,0:43:25.47,0:43:27.82,Default,,0000,0000,0000,,And it turns out that low [inaudible] weight to linear
Dialogue: 0,0:43:27.82,0:43:28.77,Default,,0000,0000,0000,,regression is
Dialogue: 0,0:43:28.77,0:43:31.99,Default,,0000,0000,0000,,for many robots turns out to be an effective method for this learning
Dialogue: 0,0:43:31.99,0:43:38.99,Default,,0000,0000,0000,,problem as well.
Dialogue: 0,0:43:47.28,0:43:51.92,Default,,0000,0000,0000,,And so having learned to model, having learned the parameters A
Dialogue: 0,0:43:51.92,0:43:52.68,Default,,0000,0000,0000,,and B,
Dialogue: 0,0:43:52.68,0:43:58.98,Default,,0000,0000,0000,,you then have a model where you say that ST+1 is AST
Dialogue: 0,0:43:58.98,0:44:05.36,Default,,0000,0000,0000,,plus BAT, and so that would be an example of a deterministic
Dialogue: 0,0:44:05.36,0:44:06.22,Default,,0000,0000,0000,,model
Dialogue: 0,0:44:07.95,0:44:10.75,Default,,0000,0000,0000,,or having learned the parameters A and B,
Dialogue: 0,0:44:10.75,0:44:13.26,Default,,0000,0000,0000,,you might say that ST+1 is
Dialogue: 0,0:44:13.26,0:44:15.23,Default,,0000,0000,0000,,equal to AST +
Dialogue: 0,0:44:15.23,0:44:17.44,Default,,0000,0000,0000,,BAT
Dialogue: 0,0:44:24.04,0:44:25.98,Default,,0000,0000,0000,,+
Dialogue: 0,0:44:25.98,0:44:28.25,Default,,0000,0000,0000,,?T.
Dialogue: 0,0:44:28.25,0:44:29.92,Default,,0000,0000,0000,,And so these would be
Dialogue: 0,0:44:29.92,0:44:32.70,Default,,0000,0000,0000,,very reasonable ways to come up with either
Dialogue: 0,0:44:32.70,0:44:37.07,Default,,0000,0000,0000,,a deterministic or a stochastic model for your
Dialogue: 0,0:44:37.07,0:44:39.33,Default,,0000,0000,0000,,inverted pendulum MDP.
Dialogue: 0,0:44:39.33,0:44:41.52,Default,,0000,0000,0000,,And so
Dialogue: 0,0:44:41.52,0:44:43.55,Default,,0000,0000,0000,,just to summarize,
Dialogue: 0,0:44:43.55,0:44:48.24,Default,,0000,0000,0000,,what we have now is a model, meaning a piece of code,
Dialogue: 0,0:44:48.24,0:44:53.46,Default,,0000,0000,0000,,where you can input a state and an action
Dialogue: 0,0:44:53.46,0:44:55.08,Default,,0000,0000,0000,,and get an ST+1.
Dialogue: 0,0:44:55.08,0:44:57.97,Default,,0000,0000,0000,,And so if you have a stochastic model,
Dialogue: 0,0:44:57.97,0:45:02.42,Default,,0000,0000,0000,,then to influence this model, you would actually sample ?T from this [inaudible]
Dialogue: 0,0:45:02.42,0:45:03.04,Default,,0000,0000,0000,,distribution
Dialogue: 0,0:45:03.04,0:45:10.04,Default,,0000,0000,0000,,in order to generate ST+1. So it
Dialogue: 0,0:45:13.18,0:45:18.27,Default,,0000,0000,0000,,actually turns out that in a
Dialogue: 0,0:45:18.27,0:45:20.17,Default,,0000,0000,0000,,preview, I guess, in the next lecture
Dialogue: 0,0:45:20.17,0:45:24.68,Default,,0000,0000,0000,,it actually turns out that in the specific case of linear dynamical systems, in the specific
Dialogue: 0,0:45:24.68,0:45:28.56,Default,,0000,0000,0000,,case where the next state is a linear function of the current stated action, it
Dialogue: 0,0:45:28.56,0:45:31.58,Default,,0000,0000,0000,,actually turns out that there're very powerful algorithms you can use.
Dialogue: 0,0:45:31.58,0:45:35.64,Default,,0000,0000,0000,,So I'm actually not gonna talk about that today. I'll talk about that in the next lecture rather than
Dialogue: 0,0:45:35.64,0:45:36.63,Default,,0000,0000,0000,,right now
Dialogue: 0,0:45:38.56,0:45:42.76,Default,,0000,0000,0000,,but turns out that for many problems of inverted pendulum go if you use low [inaudible] weights
Dialogue: 0,0:45:42.76,0:45:45.60,Default,,0000,0000,0000,,and linear regressionals and long linear
Dialogue: 0,0:45:45.60,0:45:48.34,Default,,0000,0000,0000,,algorithm 'cause many systems aren't really
Dialogue: 0,0:45:48.34,0:45:54.31,Default,,0000,0000,0000,,linear. You can build a nonlinear model.
Dialogue: 0,0:45:54.31,0:45:56.95,Default,,0000,0000,0000,,So what I wanna do now is talk about
Dialogue: 0,0:45:56.95,0:45:57.33,Default,,0000,0000,0000,,given
Dialogue: 0,0:45:57.33,0:46:00.51,Default,,0000,0000,0000,,a model, given a simulator for your
Dialogue: 0,0:46:00.51,0:46:03.35,Default,,0000,0000,0000,,MDP, how to come up with an algorithm
Dialogue: 0,0:46:03.35,0:46:07.32,Default,,0000,0000,0000,,to approximate the alpha value function piece.
Dialogue: 0,0:46:07.32,0:46:14.32,Default,,0000,0000,0000,,Before I move on, let me check if there're questions on this. Okay,
Dialogue: 0,0:46:38.34,0:46:42.58,Default,,0000,0000,0000,,cool. So here's
Dialogue: 0,0:46:42.58,0:46:43.50,Default,,0000,0000,0000,,the idea.
Dialogue: 0,0:46:43.50,0:46:46.89,Default,,0000,0000,0000,,Back when we talked about linear regression, we said that
Dialogue: 0,0:46:46.89,0:46:52.12,Default,,0000,0000,0000,,given some inputs X in supervised learning, given the input feature is X, we
Dialogue: 0,0:46:52.12,0:46:54.59,Default,,0000,0000,0000,,may choose some features of X
Dialogue: 0,0:46:54.59,0:46:56.53,Default,,0000,0000,0000,,and then
Dialogue: 0,0:46:57.23,0:46:58.82,Default,,0000,0000,0000,,approximate the type of variable
Dialogue: 0,0:46:58.82,0:47:02.23,Default,,0000,0000,0000,,as a linear function of various features of X,
Dialogue: 0,0:47:02.23,0:47:04.93,Default,,0000,0000,0000,,and so just do exactly the same thing to
Dialogue: 0,0:47:04.93,0:47:06.33,Default,,0000,0000,0000,,approximate
Dialogue: 0,0:47:06.33,0:47:08.54,Default,,0000,0000,0000,,the optimal value function,
Dialogue: 0,0:47:08.54,0:47:10.87,Default,,0000,0000,0000,,and in particular,
Dialogue: 0,0:47:10.87,0:47:15.82,Default,,0000,0000,0000,,we'll choose some features
Dialogue: 0,0:47:15.82,0:47:19.25,Default,,0000,0000,0000,,5-S of a
Dialogue: 0,0:47:19.25,0:47:20.98,Default,,0000,0000,0000,,state
Dialogue: 0,0:47:20.98,0:47:25.55,Default,,0000,0000,0000,,S. And so you could actually choose
Dialogue: 0,0:47:25.55,0:47:29.53,Default,,0000,0000,0000,,5-S equals S. That would be one reasonable choice, if you want to approximate the value
Dialogue: 0,0:47:29.53,0:47:30.16,Default,,0000,0000,0000,,function
Dialogue: 0,0:47:30.16,0:47:32.76,Default,,0000,0000,0000,,as a linear function of the states, but you can
Dialogue: 0,0:47:32.76,0:47:33.66,Default,,0000,0000,0000,,also
Dialogue: 0,0:47:33.66,0:47:37.07,Default,,0000,0000,0000,,choose other things, so for example,
Dialogue: 0,0:47:37.07,0:47:41.03,Default,,0000,0000,0000,,for the inverted pendulum example, you may choose 5-S
Dialogue: 0,0:47:41.03,0:47:45.05,Default,,0000,0000,0000,,to be equal to a vector of features that may be [inaudible]1 or you may have
Dialogue: 0,0:47:47.87,0:47:51.79,Default,,0000,0000,0000,,Xdot2,
Dialogue: 0,0:47:51.79,0:47:52.19,Default,,0000,0000,0000,,Xdot
Dialogue: 0,0:47:52.19,0:47:54.65,Default,,0000,0000,0000,,maybe some cross terms,
Dialogue: 0,0:47:54.65,0:47:56.80,Default,,0000,0000,0000,,maybe times
Dialogue: 0,0:47:56.80,0:47:59.37,Default,,0000,0000,0000,,X, maybe dot2
Dialogue: 0,0:47:59.37,0:48:03.66,Default,,0000,0000,0000,,and so on. So
Dialogue: 0,0:48:03.66,0:48:07.21,Default,,0000,0000,0000,,you choose some vector or features
Dialogue: 0,0:48:07.21,0:48:14.21,Default,,0000,0000,0000,,and then approximate
Dialogue: 0,0:48:16.82,0:48:21.76,Default,,0000,0000,0000,,the value function as the value of the state as is equal to data
Dialogue: 0,0:48:21.76,0:48:24.29,Default,,0000,0000,0000,,transfers
Dialogue: 0,0:48:24.29,0:48:28.42,Default,,0000,0000,0000,,times the features. And I should apologize in advance; I'm overloading notation here.
Dialogue: 0,0:48:28.42,0:48:29.84,Default,,0000,0000,0000,,It's unfortunate. I
Dialogue: 0,0:48:29.84,0:48:34.35,Default,,0000,0000,0000,,use data both to denote the angle of the cart of the pole inverted
Dialogue: 0,0:48:34.35,0:48:37.55,Default,,0000,0000,0000,,pendulum. So this is known as
Dialogue: 0,0:48:37.55,0:48:39.14,Default,,0000,0000,0000,,the angle T
Dialogue: 0,0:48:39.14,0:48:43.28,Default,,0000,0000,0000,,but also using T to denote the vector of parameters in my [inaudible] algorithm.
Dialogue: 0,0:48:43.28,0:48:43.91,Default,,0000,0000,0000,,So sorry
Dialogue: 0,0:48:43.91,0:48:49.94,Default,,0000,0000,0000,,about the overloading notation.
Dialogue: 0,0:48:49.94,0:48:51.60,Default,,0000,0000,0000,,
Dialogue: 0,0:48:51.60,0:48:55.44,Default,,0000,0000,0000,,Just like we did in linear regression, my goal is to come up
Dialogue: 0,0:48:55.44,0:48:56.60,Default,,0000,0000,0000,,with
Dialogue: 0,0:48:56.60,0:49:00.34,Default,,0000,0000,0000,,a linear combination of the features that gives me a good approximation
Dialogue: 0,0:49:00.34,0:49:01.93,Default,,0000,0000,0000,,to the value function
Dialogue: 0,0:49:01.93,0:49:07.66,Default,,0000,0000,0000,,and this is completely analogous to when we said that in linear regression
Dialogue: 0,0:49:07.66,0:49:09.50,Default,,0000,0000,0000,,our estimate,
Dialogue: 0,0:49:13.37,0:49:18.05,Default,,0000,0000,0000,,my response there but Y as a linear function a feature is at the
Dialogue: 0,0:49:18.05,0:49:23.46,Default,,0000,0000,0000,,input. That's what we have
Dialogue: 0,0:49:23.46,0:49:30.46,Default,,0000,0000,0000,,in
Dialogue: 0,0:49:40.11,0:49:47.08,Default,,0000,0000,0000,,linear regression. Let
Dialogue: 0,0:49:47.08,0:49:50.78,Default,,0000,0000,0000,,me just write
Dialogue: 0,0:49:50.78,0:49:55.86,Default,,0000,0000,0000,,down value iteration again and then I'll written down an approximation to value
Dialogue: 0,0:49:55.86,0:49:57.20,Default,,0000,0000,0000,,iteration, so
Dialogue: 0,0:49:57.20,0:50:00.76,Default,,0000,0000,0000,,for discrete states,
Dialogue: 0,0:50:00.76,0:50:03.93,Default,,0000,0000,0000,,this is the idea behind value iteration and we said that
Dialogue: 0,0:50:03.93,0:50:08.85,Default,,0000,0000,0000,,V(s) will be
Dialogue: 0,0:50:08.85,0:50:11.61,Default,,0000,0000,0000,,updated
Dialogue: 0,0:50:11.61,0:50:17.40,Default,,0000,0000,0000,,as R(s) + G
Dialogue: 0,0:50:17.40,0:50:19.53,Default,,0000,0000,0000,,[inaudible]. That was value iteration
Dialogue: 0,0:50:19.53,0:50:24.13,Default,,0000,0000,0000,,and in the case of continuous states, this would be the replaced by an [inaudible], an
Dialogue: 0,0:50:24.13,0:50:31.13,Default,,0000,0000,0000,,[inaudible] over states rather via sum over states.
Dialogue: 0,0:50:31.39,0:50:34.34,Default,,0000,0000,0000,,Let me just write this
Dialogue: 0,0:50:34.34,0:50:35.67,Default,,0000,0000,0000,,as
Dialogue: 0,0:50:38.81,0:50:42.59,Default,,0000,0000,0000,,R(s) + G([inaudible]) and then that sum over T's prime.
Dialogue: 0,0:50:42.59,0:50:45.15,Default,,0000,0000,0000,,That's really an expectation
Dialogue: 0,0:50:45.15,0:50:47.40,Default,,0000,0000,0000,,with respect to random state as
Dialogue: 0,0:50:47.40,0:50:54.40,Default,,0000,0000,0000,,prime drawn from the state transition probabilities piece SA
Dialogue: 0,0:50:55.05,0:50:56.36,Default,,0000,0000,0000,,of V(s)
Dialogue: 0,0:50:56.36,0:51:00.34,Default,,0000,0000,0000,,prime. So this is a sum of all states S prime with the
Dialogue: 0,0:51:00.34,0:51:01.60,Default,,0000,0000,0000,,probability of going to S prime (value),
Dialogue: 0,0:51:01.60,0:51:05.25,Default,,0000,0000,0000,,so that's really an expectation over the random state S prime flowing from PSA of
Dialogue: 0,0:51:05.25,0:51:09.37,Default,,0000,0000,0000,,that.
Dialogue: 0,0:51:09.37,0:51:11.99,Default,,0000,0000,0000,,And so what I'll do now is
Dialogue: 0,0:51:11.99,0:51:15.20,Default,,0000,0000,0000,,write down an algorithm called
Dialogue: 0,0:51:15.20,0:51:22.20,Default,,0000,0000,0000,,fitted value iteration
Dialogue: 0,0:51:25.29,0:51:27.95,Default,,0000,0000,0000,,that's in
Dialogue: 0,0:51:27.95,0:51:30.60,Default,,0000,0000,0000,,approximation to this
Dialogue: 0,0:51:30.60,0:51:33.81,Default,,0000,0000,0000,,but specifically for continuous states.
Dialogue: 0,0:51:33.81,0:51:38.52,Default,,0000,0000,0000,,I just wrote down the first two steps, and then I'll continue on the next board, so the
Dialogue: 0,0:51:38.52,0:51:43.50,Default,,0000,0000,0000,,first step of the algorithm is we'll sample.
Dialogue: 0,0:51:43.50,0:51:45.27,Default,,0000,0000,0000,,Choose some set of states
Dialogue: 0,0:51:45.27,0:51:52.27,Default,,0000,0000,0000,,at
Dialogue: 0,0:51:53.25,0:51:55.12,Default,,0000,0000,0000,,random.
Dialogue: 0,0:51:55.12,0:51:59.87,Default,,0000,0000,0000,,So sample S-1, S-2 through S-M randomly so choose a
Dialogue: 0,0:51:59.87,0:52:02.47,Default,,0000,0000,0000,,set of states randomly
Dialogue: 0,0:52:02.47,0:52:08.21,Default,,0000,0000,0000,,and
Dialogue: 0,0:52:08.21,0:52:12.02,Default,,0000,0000,0000,,initialize my parameter vector to be equal to zero. This is
Dialogue: 0,0:52:12.02,0:52:16.19,Default,,0000,0000,0000,,analogous to in value iteration where I
Dialogue: 0,0:52:16.19,0:52:21.14,Default,,0000,0000,0000,,might initialize the value function to be the function of all zeros. Then here's the
Dialogue: 0,0:52:21.14,0:52:28.14,Default,,0000,0000,0000,,end view for the algorithm.
Dialogue: 0,0:52:42.36,0:52:44.12,Default,,0000,0000,0000,,Got
Dialogue: 0,0:52:44.12,0:52:51.12,Default,,0000,0000,0000,,quite a lot to
Dialogue: 0,0:52:59.19,0:53:06.19,Default,,0000,0000,0000,,write
Dialogue: 0,0:55:05.41,0:55:12.41,Default,,0000,0000,0000,,actually.
Dialogue: 0,0:55:28.60,0:55:29.64,Default,,0000,0000,0000,,Let's
Dialogue: 0,0:55:29.64,0:55:36.64,Default,,0000,0000,0000,,see.
Dialogue: 0,0:56:10.42,0:56:12.01,Default,,0000,0000,0000,,And so
Dialogue: 0,0:56:12.01,0:56:17.84,Default,,0000,0000,0000,,that's the algorithm.
Dialogue: 0,0:56:17.84,0:56:19.88,Default,,0000,0000,0000,,Let me just adjust the writing. Give me a second. Give me a minute
Dialogue: 0,0:56:19.88,0:56:26.88,Default,,0000,0000,0000,,to finish and then I'll step through this.
Dialogue: 0,0:56:26.99,0:56:28.49,Default,,0000,0000,0000,,Actually, if some of my
Dialogue: 0,0:56:28.49,0:56:35.49,Default,,0000,0000,0000,,handwriting is eligible, let me know. So
Dialogue: 0,0:56:57.56,0:57:02.08,Default,,0000,0000,0000,,let me step through this and so briefly explain the rationale.
Dialogue: 0,0:57:02.08,0:57:08.14,Default,,0000,0000,0000,,So the hear of the algorithm is - let's see.
Dialogue: 0,0:57:08.14,0:57:10.87,Default,,0000,0000,0000,,In the original value iteration algorithm,
Dialogue: 0,0:57:10.87,0:57:13.24,Default,,0000,0000,0000,,we would take the value for each state,
Dialogue: 0,0:57:13.24,0:57:14.80,Default,,0000,0000,0000,,V(s)I,
Dialogue: 0,0:57:14.80,0:57:17.60,Default,,0000,0000,0000,,and we will overwrite it with this
Dialogue: 0,0:57:17.60,0:57:21.66,Default,,0000,0000,0000,,expression here. In the original, this discrete value iteration algorithm
Dialogue: 0,0:57:21.66,0:57:23.20,Default,,0000,0000,0000,,was to V(s)I
Dialogue: 0,0:57:23.20,0:57:23.94,Default,,0000,0000,0000,,and we will
Dialogue: 0,0:57:23.94,0:57:25.83,Default,,0000,0000,0000,,set V(s)I
Dialogue: 0,0:57:25.83,0:57:28.59,Default,,0000,0000,0000,,to be equal to that, I think.
Dialogue: 0,0:57:28.59,0:57:33.13,Default,,0000,0000,0000,,Now we have in the continuous state case, we have an infinite continuous set of
Dialogue: 0,0:57:33.13,0:57:34.20,Default,,0000,0000,0000,,states and so
Dialogue: 0,0:57:34.20,0:57:37.86,Default,,0000,0000,0000,,you can't discretely set the value of each of these to that.
Dialogue: 0,0:57:37.86,0:57:42.51,Default,,0000,0000,0000,,So what we'll do instead is choose the parameters T
Dialogue: 0,0:57:42.51,0:57:46.85,Default,,0000,0000,0000,,so that V(s)I is as close as possible to
Dialogue: 0,0:57:46.85,0:57:49.93,Default,,0000,0000,0000,,this thing on the right hand side
Dialogue: 0,0:57:49.93,0:57:54.51,Default,,0000,0000,0000,,instead. And this is what YI turns out to be.
Dialogue: 0,0:57:54.51,0:57:58.48,Default,,0000,0000,0000,,So completely, what I'm going to do is I'm going to construct
Dialogue: 0,0:57:58.48,0:58:00.58,Default,,0000,0000,0000,,estimates of this term,
Dialogue: 0,0:58:00.58,0:58:04.28,Default,,0000,0000,0000,,and then I'm going to choose the parameters of my function approximator. I'm
Dialogue: 0,0:58:04.28,0:58:07.73,Default,,0000,0000,0000,,gonna choose my parameter as T, so that V(s)I is as close as possible to
Dialogue: 0,0:58:07.73,0:58:10.62,Default,,0000,0000,0000,,these. That's what YI is,
Dialogue: 0,0:58:10.62,0:58:11.98,Default,,0000,0000,0000,,and specifically,
Dialogue: 0,0:58:11.98,0:58:15.14,Default,,0000,0000,0000,,what I'm going to do is I'll choose parameters data
Dialogue: 0,0:58:15.14,0:58:18.49,Default,,0000,0000,0000,,to minimize the sum of square differences between T [inaudible] plus
Dialogue: 0,0:58:18.49,0:58:20.41,Default,,0000,0000,0000,,5SI.
Dialogue: 0,0:58:20.41,0:58:21.55,Default,,0000,0000,0000,,This thing here
Dialogue: 0,0:58:21.55,0:58:24.49,Default,,0000,0000,0000,,is just V(s)I because I'm approximating V(s)I
Dialogue: 0,0:58:24.49,0:58:26.22,Default,,0000,0000,0000,,is a linear function of 5SI
Dialogue: 0,0:58:26.22,0:58:27.88,Default,,0000,0000,0000,,and so choose
Dialogue: 0,0:58:27.88,0:58:32.02,Default,,0000,0000,0000,,the parameters data to minimize the sum of square differences.
Dialogue: 0,0:58:32.02,0:58:34.83,Default,,0000,0000,0000,,So this is last step is basically
Dialogue: 0,0:58:34.83,0:58:38.76,Default,,0000,0000,0000,,the approximation version of value
Dialogue: 0,0:58:38.76,0:58:40.61,Default,,0000,0000,0000,,iteration. What everything else above
Dialogue: 0,0:58:40.61,0:58:42.42,Default,,0000,0000,0000,,was doing was just
Dialogue: 0,0:58:42.42,0:58:44.75,Default,,0000,0000,0000,,coming up with an approximation
Dialogue: 0,0:58:44.75,0:58:46.19,Default,,0000,0000,0000,,to this term, to
Dialogue: 0,0:58:46.19,0:58:47.50,Default,,0000,0000,0000,,this thing here
Dialogue: 0,0:58:47.50,0:58:50.98,Default,,0000,0000,0000,,and which I was calling YI.
Dialogue: 0,0:58:50.98,0:58:54.20,Default,,0000,0000,0000,,And so confluently, for every state SI we want to
Dialogue: 0,0:58:54.20,0:58:57.16,Default,,0000,0000,0000,,estimate what the thing on the right hand side is
Dialogue: 0,0:58:57.16,0:58:57.98,Default,,0000,0000,0000,,and but
Dialogue: 0,0:58:57.98,0:59:01.98,Default,,0000,0000,0000,,there's an expectation here. There's an expectation over a continuous set of states,
Dialogue: 0,0:59:01.98,0:59:06.61,Default,,0000,0000,0000,,may be a very high dimensional state so I can't compute this expectation exactly.
Dialogue: 0,0:59:06.61,0:59:09.81,Default,,0000,0000,0000,,What I'll do instead is I'll use my simulator to sample
Dialogue: 0,0:59:09.81,0:59:13.97,Default,,0000,0000,0000,,a set of states from this distribution from this P substrip,
Dialogue: 0,0:59:13.97,0:59:16.96,Default,,0000,0000,0000,,SIA, from the state transition distribution
Dialogue: 0,0:59:16.96,0:59:20.67,Default,,0000,0000,0000,,of where I get to if I take the action A in the state as I,
Dialogue: 0,0:59:20.67,0:59:23.92,Default,,0000,0000,0000,,and then I'll average over that sample of states
Dialogue: 0,0:59:23.92,0:59:26.82,Default,,0000,0000,0000,,to compute this expectation.
Dialogue: 0,0:59:26.82,0:59:30.61,Default,,0000,0000,0000,,And so stepping through the algorithm just says that for
Dialogue: 0,0:59:30.61,0:59:31.50,Default,,0000,0000,0000,,each
Dialogue: 0,0:59:31.50,0:59:36.33,Default,,0000,0000,0000,,state and for each action, I'm going to sample a set of states. This S prime 1 through S
Dialogue: 0,0:59:36.33,0:59:40.66,Default,,0000,0000,0000,,prime K from that state transition distribution, still using the model, and
Dialogue: 0,0:59:40.66,0:59:43.91,Default,,0000,0000,0000,,then I'll set Q(a) to be equal to that average
Dialogue: 0,0:59:43.91,0:59:45.49,Default,,0000,0000,0000,,and so this is my
Dialogue: 0,0:59:45.49,0:59:48.43,Default,,0000,0000,0000,,estimate for R(s)I + G(this
Dialogue: 0,0:59:48.43,0:59:50.40,Default,,0000,0000,0000,,expected value
Dialogue: 0,0:59:50.40,0:59:53.63,Default,,0000,0000,0000,,for that specific action A).
Dialogue: 0,0:59:53.63,0:59:56.76,Default,,0000,0000,0000,,Then I'll take the maximum of actions A
Dialogue: 0,0:59:56.76,1:00:00.89,Default,,0000,0000,0000,,and this gives me YI, and so YI is for S for that.
Dialogue: 0,1:00:00.89,1:00:02.92,Default,,0000,0000,0000,,And finally, I'll run
Dialogue: 0,1:00:02.92,1:00:06.78,Default,,0000,0000,0000,,really run linear regression which is that last of the set [inaudible] to get
Dialogue: 0,1:00:06.78,1:00:07.79,Default,,0000,0000,0000,,V(s)I
Dialogue: 0,1:00:07.79,1:00:11.48,Default,,0000,0000,0000,,to be close to the YIs.
Dialogue: 0,1:00:11.48,1:00:16.18,Default,,0000,0000,0000,,And so this algorithm is called fitted value iteration and it actually often
Dialogue: 0,1:00:16.18,1:00:19.39,Default,,0000,0000,0000,,works quite well for continuous,
Dialogue: 0,1:00:19.39,1:00:22.66,Default,,0000,0000,0000,,for problems with anywhere from
Dialogue: 0,1:00:22.66,1:00:26.01,Default,,0000,0000,0000,,6- to 10- to 20-dimensional state spaces
Dialogue: 0,1:00:33.95,1:00:39.50,Default,,0000,0000,0000,,if you can choose appropriate features. Can you raise a hand please if this algorithm makes sense?
Dialogue: 0,1:00:39.50,1:00:40.79,Default,,0000,0000,0000,,Some of you didn't have your hands up.
Dialogue: 0,1:00:40.79,1:00:47.28,Default,,0000,0000,0000,,Are there questions for those,
Dialogue: 0,1:00:48.41,1:00:51.34,Default,,0000,0000,0000,,yeah? Is there a recess [inaudible] function in this setup?
Dialogue: 0,1:00:51.34,1:00:55.67,Default,,0000,0000,0000,,Oh, yes. An MDP comprises
Dialogue: 0,1:00:55.67,1:00:57.19,Default,,0000,0000,0000,,SA,
Dialogue: 0,1:00:57.19,1:00:59.87,Default,,0000,0000,0000,,the state transition probabilities G
Dialogue: 0,1:00:59.87,1:01:01.23,Default,,0000,0000,0000,,and
Dialogue: 0,1:01:01.23,1:01:02.38,Default,,0000,0000,0000,,R
Dialogue: 0,1:01:02.38,1:01:03.36,Default,,0000,0000,0000,,and so
Dialogue: 0,1:01:03.36,1:01:08.36,Default,,0000,0000,0000,,for continuous state spaces, S would be like R4 for the inverted pendulum
Dialogue: 0,1:01:08.36,1:01:09.41,Default,,0000,0000,0000,,or something.
Dialogue: 0,1:01:09.41,1:01:13.53,Default,,0000,0000,0000,,Actions with discretized state transitions probabilities with specifying with the model
Dialogue: 0,1:01:13.53,1:01:14.78,Default,,0000,0000,0000,,or
Dialogue: 0,1:01:14.78,1:01:17.47,Default,,0000,0000,0000,,the simulator,
Dialogue: 0,1:01:17.47,1:01:19.55,Default,,0000,0000,0000,,G is just a real number like .99
Dialogue: 0,1:01:19.55,1:01:23.60,Default,,0000,0000,0000,,and the real function is usually a function that's given to you. And so
Dialogue: 0,1:01:23.60,1:01:26.10,Default,,0000,0000,0000,,the reward function
Dialogue: 0,1:01:26.10,1:01:30.69,Default,,0000,0000,0000,,is some function of your 4-dimensional state,
Dialogue: 0,1:01:30.69,1:01:31.49,Default,,0000,0000,0000,,and
Dialogue: 0,1:01:31.49,1:01:32.99,Default,,0000,0000,0000,,for example, you
Dialogue: 0,1:01:32.99,1:01:39.99,Default,,0000,0000,0000,,might choose a reward function to be minus " I don't know.
Dialogue: 0,1:01:40.03,1:01:47.03,Default,,0000,0000,0000,,Just for an example of simple reward function, if we want a -1 if the pole has fallen
Dialogue: 0,1:01:47.82,1:01:51.64,Default,,0000,0000,0000,,and there it depends on you choose your reward function to be -1 if
Dialogue: 0,1:01:51.64,1:01:52.29,Default,,0000,0000,0000,,the
Dialogue: 0,1:01:52.29,1:01:54.62,Default,,0000,0000,0000,,inverted pendulum falls over
Dialogue: 0,1:01:54.62,1:01:58.70,Default,,0000,0000,0000,,and to find that its angle is greater than 30Â° or something
Dialogue: 0,1:01:58.70,1:02:02.18,Default,,0000,0000,0000,,and zero otherwise.
Dialogue: 0,1:02:02.18,1:02:03.94,Default,,0000,0000,0000,,So that would be an example
Dialogue: 0,1:02:03.94,1:02:06.64,Default,,0000,0000,0000,,of a reward function that you can choose for the inverted pendulum, but yes, assuming a
Dialogue: 0,1:02:06.64,1:02:09.62,Default,,0000,0000,0000,,reward function is given to you
Dialogue: 0,1:02:09.62,1:02:14.59,Default,,0000,0000,0000,,so that you can compute R(s)I for any state.
Dialogue: 0,1:02:14.59,1:02:21.59,Default,,0000,0000,0000,,Are there other questions?
Dialogue: 0,1:02:39.69,1:02:40.62,Default,,0000,0000,0000,,Actually, let
Dialogue: 0,1:02:40.62,1:02:47.62,Default,,0000,0000,0000,,me try
Dialogue: 0,1:02:50.46,1:02:57.46,Default,,0000,0000,0000,,
Dialogue: 0,1:03:23.98,1:03:28.48,Default,,0000,0000,0000,,asking a question, so everything I did here assume that we have a stochastic simulator.
Dialogue: 0,1:03:28.48,1:03:31.88,Default,,0000,0000,0000,,So it
Dialogue: 0,1:03:31.88,1:03:35.39,Default,,0000,0000,0000,,turns out I can simply this algorithm if I have a deterministic simulator,
Dialogue: 0,1:03:35.39,1:03:36.51,Default,,0000,0000,0000,,but
Dialogue: 0,1:03:36.51,1:03:38.77,Default,,0000,0000,0000,,deterministic simulator is given a
Dialogue: 0,1:03:38.77,1:03:42.08,Default,,0000,0000,0000,,stated action, my next state is always exactly determined.
Dialogue: 0,1:03:42.08,1:03:45.57,Default,,0000,0000,0000,,So let me ask you, if I have a deterministic simulator,
Dialogue: 0,1:03:45.57,1:03:52.57,Default,,0000,0000,0000,,how would I change this algorithm? How would I simplify this algorithm? Lower your samples that you're drawing [inaudible].
Dialogue: 0,1:03:53.93,1:03:57.71,Default,,0000,0000,0000,,
Dialogue: 0,1:03:57.71,1:03:58.72,Default,,0000,0000,0000,,Right, so
Dialogue: 0,1:03:58.72,1:04:01.25,Default,,0000,0000,0000,,Justin's going right. If I have a deterministic simulator,
Dialogue: 0,1:04:01.25,1:04:04.87,Default,,0000,0000,0000,,all my samples from those would be exactly the same, and so if I have a
Dialogue: 0,1:04:04.87,1:04:06.15,Default,,0000,0000,0000,,deterministic simulator,
Dialogue: 0,1:04:06.15,1:04:09.11,Default,,0000,0000,0000,,I can set K to be equal to 1,
Dialogue: 0,1:04:09.11,1:04:10.93,Default,,0000,0000,0000,,so I don't need to draw
Dialogue: 0,1:04:10.93,1:04:15.87,Default,,0000,0000,0000,,K different samples. I really only need one sample if I have a
Dialogue: 0,1:04:15.87,1:04:17.30,Default,,0000,0000,0000,,deterministic simulator,
Dialogue: 0,1:04:17.30,1:04:21.67,Default,,0000,0000,0000,,so you can simplify this by setting K=1 if you have a deterministic simulator. Yeah? I guess I'm really
Dialogue: 0,1:04:21.67,1:04:28.16,Default,,0000,0000,0000,,confused about the,
Dialogue: 0,1:04:28.16,1:04:29.03,Default,,0000,0000,0000,,yeah,
Dialogue: 0,1:04:29.03,1:04:34.43,Default,,0000,0000,0000,,we sorta turned this [inaudible] into something that looks like linear
Dialogue: 0,1:04:34.43,1:04:37.25,Default,,0000,0000,0000,,state
Dialogue: 0,1:04:37.25,1:04:40.09,Default,,0000,0000,0000,,regression or some' you know the data transpose times something that
Dialogue: 0,1:04:40.09,1:04:43.89,Default,,0000,0000,0000,,we're used to but I guess I'm a
Dialogue: 0,1:04:43.89,1:04:45.37,Default,,0000,0000,0000,,little. I don't know really know what question to ask but like when we did this before we
Dialogue: 0,1:04:45.37,1:04:48.33,Default,,0000,0000,0000,,had like discrete states and everything. We
Dialogue: 0,1:04:48.33,1:04:50.02,Default,,0000,0000,0000,,were determined with
Dialogue: 0,1:04:50.02,1:04:53.19,Default,,0000,0000,0000,,finding this optimal policy
Dialogue: 0,1:04:53.19,1:04:56.71,Default,,0000,0000,0000,,and I
Dialogue: 0,1:04:56.71,1:05:01.87,Default,,0000,0000,0000,,guess it doesn't look like we haven't said the word policy in a while so kinda
Dialogue: 0,1:05:01.87,1:05:05.01,Default,,0000,0000,0000,,difficult. Okay, yeah, so [inaudible] matters back to policy but maybe I should
Dialogue: 0,1:05:05.01,1:05:06.82,Default,,0000,0000,0000,,just say a couple words so
Dialogue: 0,1:05:06.82,1:05:10.66,Default,,0000,0000,0000,,let me actually try to get at some of what maybe what you're saying.
Dialogue: 0,1:05:10.66,1:05:14.25,Default,,0000,0000,0000,,Our strategy for finding optimal policy
Dialogue: 0,1:05:14.25,1:05:15.08,Default,,0000,0000,0000,,has been to
Dialogue: 0,1:05:15.08,1:05:19.27,Default,,0000,0000,0000,,find some way to find V
Dialogue: 0,1:05:19.27,1:05:21.17,Default,,0000,0000,0000,,and
Dialogue: 0,1:05:21.17,1:05:24.09,Default,,0000,0000,0000,,some of approximations
Dialogue: 0,1:05:24.09,1:05:28.55,Default,,0000,0000,0000,,of ?*. So far everything I've been doing has been focused on how to find V*. I just
Dialogue: 0,1:05:28.55,1:05:30.48,Default,,0000,0000,0000,,want
Dialogue: 0,1:05:33.12,1:05:36.36,Default,,0000,0000,0000,,to say one more word. It actually turns out that
Dialogue: 0,1:05:36.36,1:05:40.73,Default,,0000,0000,0000,,for linear regression it's often very easy. It's often not terribly difficult to
Dialogue: 0,1:05:40.73,1:05:43.02,Default,,0000,0000,0000,,choose some resource of the features. Choosing
Dialogue: 0,1:05:43.02,1:05:46.98,Default,,0000,0000,0000,,features for approximating the value function is often somewhat
Dialogue: 0,1:05:46.98,1:05:48.40,Default,,0000,0000,0000,,harder,
Dialogue: 0,1:05:48.40,1:05:50.97,Default,,0000,0000,0000,,so because the value of a state is
Dialogue: 0,1:05:50.97,1:05:52.02,Default,,0000,0000,0000,,how good is
Dialogue: 0,1:05:52.02,1:05:54.38,Default,,0000,0000,0000,,starting off in this state.
Dialogue: 0,1:05:54.38,1:05:56.04,Default,,0000,0000,0000,,What is my
Dialogue: 0,1:05:56.04,1:05:59.61,Default,,0000,0000,0000,,expected sum of discounted rewards? What if I start in a certain state?
Dialogue: 0,1:05:59.61,1:06:00.41,Default,,0000,0000,0000,,And so
Dialogue: 0,1:06:00.41,1:06:03.56,Default,,0000,0000,0000,,what the feature of the state have to measure is really
Dialogue: 0,1:06:03.56,1:06:07.89,Default,,0000,0000,0000,,how good is it to start in a certain state?
Dialogue: 0,1:06:07.89,1:06:12.21,Default,,0000,0000,0000,,And so for inverted pendulum you actually have that
Dialogue: 0,1:06:12.21,1:06:15.90,Default,,0000,0000,0000,,states where the poles are vertical and when a cart that's centered on your
Dialogue: 0,1:06:15.90,1:06:18.08,Default,,0000,0000,0000,,track or something, maybe better and so you can
Dialogue: 0,1:06:18.08,1:06:21.30,Default,,0000,0000,0000,,come up with features that measure the orientation of the pole and how close
Dialogue: 0,1:06:21.30,1:06:25.69,Default,,0000,0000,0000,,you are to the center of the track and so on and those will be reasonable features to use
Dialogue: 0,1:06:25.69,1:06:28.33,Default,,0000,0000,0000,,to approximate V*.
Dialogue: 0,1:06:28.33,1:06:30.25,Default,,0000,0000,0000,,Although in general it is true that
Dialogue: 0,1:06:30.25,1:06:33.86,Default,,0000,0000,0000,,choosing features, the value function approximation, is often slightly trickier
Dialogue: 0,1:06:33.86,1:06:39.72,Default,,0000,0000,0000,,than choosing good features for linear regression. Okay and then Justin's
Dialogue: 0,1:06:39.72,1:06:40.71,Default,,0000,0000,0000,,questions of
Dialogue: 0,1:06:40.71,1:06:42.53,Default,,0000,0000,0000,,so given
Dialogue: 0,1:06:42.53,1:06:47.11,Default,,0000,0000,0000,,V*, how do you go back to actually find a policy?
Dialogue: 0,1:06:47.11,1:06:49.80,Default,,0000,0000,0000,,In the discrete case, so we
Dialogue: 0,1:06:49.80,1:06:54.11,Default,,0000,0000,0000,,have that ?*(s) is equal to all
Dialogue: 0,1:06:54.11,1:07:01.11,Default,,0000,0000,0000,,[inaudible] over A of
Dialogue: 0,1:07:04.32,1:07:07.47,Default,,0000,0000,0000,,that. So that's again, I used to write this as a sum over states [inaudible]. I'll just write this
Dialogue: 0,1:07:07.47,1:07:12.87,Default,,0000,0000,0000,,as an expectation
Dialogue: 0,1:07:12.87,1:07:15.61,Default,,0000,0000,0000,,and so then once you find the optimal value
Dialogue: 0,1:07:15.61,1:07:16.74,Default,,0000,0000,0000,,function V*,
Dialogue: 0,1:07:16.74,1:07:18.58,Default,,0000,0000,0000,,you can then
Dialogue: 0,1:07:18.58,1:07:20.35,Default,,0000,0000,0000,,find the optimal policy ?* by computing the
Dialogue: 0,1:07:20.35,1:07:22.58,Default,,0000,0000,0000,,[inaudible].
Dialogue: 0,1:07:22.58,1:07:25.22,Default,,0000,0000,0000,,So if you're in a continuous state MDP,
Dialogue: 0,1:07:25.22,1:07:26.43,Default,,0000,0000,0000,,then
Dialogue: 0,1:07:26.43,1:07:30.41,Default,,0000,0000,0000,,you can't actually do this in advance for every single state because there's an
Dialogue: 0,1:07:30.41,1:07:31.90,Default,,0000,0000,0000,,infinite number of states
Dialogue: 0,1:07:31.90,1:07:34.81,Default,,0000,0000,0000,,and so you can't actually perform this computation in advance to every single
Dialogue: 0,1:07:34.81,1:07:36.43,Default,,0000,0000,0000,,state.
Dialogue: 0,1:07:36.43,1:07:38.95,Default,,0000,0000,0000,,What you do instead is
Dialogue: 0,1:07:38.95,1:07:43.21,Default,,0000,0000,0000,,whenever your robot is in some specific state S is only when your system
Dialogue: 0,1:07:43.21,1:07:48.07,Default,,0000,0000,0000,,is in some specific state S like your car is at some position orientation or
Dialogue: 0,1:07:48.07,1:07:49.83,Default,,0000,0000,0000,,your inverted pendulum
Dialogue: 0,1:07:49.83,1:07:52.95,Default,,0000,0000,0000,,is in some specific position, posed in some
Dialogue: 0,1:07:52.95,1:07:56.53,Default,,0000,0000,0000,,specific angle T. It's
Dialogue: 0,1:07:56.53,1:07:58.49,Default,,0000,0000,0000,,only when
Dialogue: 0,1:07:58.49,1:08:03.07,Default,,0000,0000,0000,,your system, be it a factor or a board game or a robot,
Dialogue: 0,1:08:03.07,1:08:05.30,Default,,0000,0000,0000,,is in some specific state S
Dialogue: 0,1:08:05.30,1:08:08.81,Default,,0000,0000,0000,,that you would then go ahead and compute this augmax, so
Dialogue: 0,1:08:08.81,1:08:10.23,Default,,0000,0000,0000,,it's only when
Dialogue: 0,1:08:10.23,1:08:15.78,Default,,0000,0000,0000,,you're in some state S that you then compute this augmax, and then
Dialogue: 0,1:08:15.78,1:08:18.10,Default,,0000,0000,0000,,you
Dialogue: 0,1:08:18.10,1:08:22.59,Default,,0000,0000,0000,,
Dialogue: 0,1:08:22.59,1:08:24.44,Default,,0000,0000,0000,,execute that action A
Dialogue: 0,1:08:24.44,1:08:25.16,Default,,0000,0000,0000,,and then
Dialogue: 0,1:08:25.16,1:08:29.49,Default,,0000,0000,0000,,as a result of your action, your robot would transition to some new state and then
Dialogue: 0,1:08:29.49,1:08:33.76,Default,,0000,0000,0000,,so it'll be given that specific new state that you compute as
Dialogue: 0,1:08:33.76,1:08:40.76,Default,,0000,0000,0000,,augmax using that specific state S that
Dialogue: 0,1:08:41.80,1:08:45.74,Default,,0000,0000,0000,,you're in. There're a few ways to do it. One way to do this is
Dialogue: 0,1:08:45.74,1:08:50.10,Default,,0000,0000,0000,,actually the same as in the inner loop of the fitted value iteration algorithm so
Dialogue: 0,1:08:50.10,1:08:52.15,Default,,0000,0000,0000,,because of an expectation of a large number
Dialogue: 0,1:08:52.15,1:08:55.22,Default,,0000,0000,0000,,of states, you'd need to sample
Dialogue: 0,1:08:55.22,1:08:57.100,Default,,0000,0000,0000,,some set of states from the simulator and then
Dialogue: 0,1:08:57.100,1:09:03.12,Default,,0000,0000,0000,,approximate this expectation using an average over your samples, so it's actually as
Dialogue: 0,1:09:03.12,1:09:05.82,Default,,0000,0000,0000,,inner loop of the value iteration algorithm.
Dialogue: 0,1:09:05.82,1:09:09.48,Default,,0000,0000,0000,,So you could do that. That's sometimes done. Sometimes it can also be a pain to have to sample
Dialogue: 0,1:09:09.77,1:09:13.03,Default,,0000,0000,0000,,a set of states
Dialogue: 0,1:09:13.03,1:09:16.23,Default,,0000,0000,0000,,to approximate those expectations every time you want to take an action in your MDP.
Dialogue: 0,1:09:16.23,1:09:18.36,Default,,0000,0000,0000,,Couple
Dialogue: 0,1:09:18.36,1:09:22.71,Default,,0000,0000,0000,,of special cases where this can be done, one special case is if you
Dialogue: 0,1:09:22.71,1:09:29.71,Default,,0000,0000,0000,,have
Dialogue: 0,1:09:31.71,1:09:35.56,Default,,0000,0000,0000,,a deterministic simulator. If it's a deterministic simulator, so in other words, if your similar is
Dialogue: 0,1:09:35.56,1:09:39.77,Default,,0000,0000,0000,,just some function,
Dialogue: 0,1:09:39.77,1:09:43.69,Default,,0000,0000,0000,,could be a linear or a nonlinear function. If it's a deterministic simulator
Dialogue: 0,1:09:43.69,1:09:47.57,Default,,0000,0000,0000,,then the next state, ST+1, is just some function of your
Dialogue: 0,1:09:47.57,1:09:50.19,Default,,0000,0000,0000,,previous stated action.
Dialogue: 0,1:09:50.19,1:09:52.06,Default,,0000,0000,0000,,If that's the case then
Dialogue: 0,1:09:52.06,1:09:54.85,Default,,0000,0000,0000,,this expectation, well, then
Dialogue: 0,1:09:54.85,1:09:59.26,Default,,0000,0000,0000,,this simplifies to augmax of A of V*
Dialogue: 0,1:09:59.26,1:10:00.71,Default,,0000,0000,0000,,of F
Dialogue: 0,1:10:00.71,1:10:03.41,Default,,0000,0000,0000,,of
Dialogue: 0,1:10:03.41,1:10:07.28,Default,,0000,0000,0000,,I guess S,A
Dialogue: 0,1:10:07.28,1:10:08.25,Default,,0000,0000,0000,,because
Dialogue: 0,1:10:08.25,1:10:11.07,Default,,0000,0000,0000,,this is really saying
Dialogue: 0,1:10:11.07,1:10:14.57,Default,,0000,0000,0000,,S
Dialogue: 0,1:10:14.57,1:10:18.41,Default,,0000,0000,0000,,prime=F(s),A. I switched back and forth between notation; I hope that's okay. S
Dialogue: 0,1:10:18.41,1:10:21.80,Default,,0000,0000,0000,,to denote the current state, and S prime to deterministic state versus ST and ST+1 through the
Dialogue: 0,1:10:21.80,1:10:25.53,Default,,0000,0000,0000,,current state. Both of these are
Dialogue: 0,1:10:25.53,1:10:27.70,Default,,0000,0000,0000,,sorta standing notation and don't
Dialogue: 0,1:10:27.70,1:10:31.19,Default,,0000,0000,0000,,mind my switching back and forth between them. But if it's a
Dialogue: 0,1:10:31.19,1:10:35.31,Default,,0000,0000,0000,,deterministic simulator you can then just compute what the next state S prime would be
Dialogue: 0,1:10:35.31,1:10:37.73,Default,,0000,0000,0000,,for each action you might take from the current state,
Dialogue: 0,1:10:37.73,1:10:41.69,Default,,0000,0000,0000,,and then take the augmax of actions A,
Dialogue: 0,1:10:41.69,1:10:47.23,Default,,0000,0000,0000,,basically choose the action that gets you to the highest value
Dialogue: 0,1:10:47.23,1:10:54.23,Default,,0000,0000,0000,,state.
Dialogue: 0,1:11:04.30,1:11:04.66,Default,,0000,0000,0000,,And
Dialogue: 0,1:11:04.66,1:11:08.92,Default,,0000,0000,0000,,so that's one case where you can compute the augmax and we can compute that
Dialogue: 0,1:11:08.92,1:11:13.47,Default,,0000,0000,0000,,expectation without needing to sample an average over some sample.
Dialogue: 0,1:11:13.47,1:11:16.67,Default,,0000,0000,0000,,Another very common case actually it turns out is if you have a stochastic
Dialogue: 0,1:11:24.01,1:11:25.97,Default,,0000,0000,0000,,simulator, but if your
Dialogue: 0,1:11:25.97,1:11:30.46,Default,,0000,0000,0000,,similar happens to take on a very specific form of
Dialogue: 0,1:11:33.82,1:11:36.12,Default,,0000,0000,0000,,ST+1=F(s)T,AT+?T
Dialogue: 0,1:11:37.12,1:11:43.69,Default,,0000,0000,0000,,where this
Dialogue: 0,1:11:43.69,1:11:47.90,Default,,0000,0000,0000,,is galsie noise. The [inaudible] is a very common way to build simulators where you model the next
Dialogue: 0,1:11:47.90,1:11:50.23,Default,,0000,0000,0000,,state as a function of the current state and action
Dialogue: 0,1:11:50.23,1:11:52.26,Default,,0000,0000,0000,,plus some noise and so
Dialogue: 0,1:11:52.26,1:11:57.34,Default,,0000,0000,0000,,once specific example would be that sort of mini dynamical system that
Dialogue: 0,1:11:57.34,1:12:01.40,Default,,0000,0000,0000,,we talked about with linear function of the current state and action plus
Dialogue: 0,1:12:01.40,1:12:03.02,Default,,0000,0000,0000,,galsie
Dialogue: 0,1:12:03.02,1:12:07.96,Default,,0000,0000,0000,,noise. In this case, you can approximate augment over
Dialogue: 0,1:12:07.96,1:12:14.96,Default,,0000,0000,0000,,A, well.
Dialogue: 0,1:12:21.81,1:12:27.16,Default,,0000,0000,0000,,In that case you take that
Dialogue: 0,1:12:27.16,1:12:29.09,Default,,0000,0000,0000,,expectation that
Dialogue: 0,1:12:29.09,1:12:33.18,Default,,0000,0000,0000,,you're trying to approximate. The
Dialogue: 0,1:12:33.18,1:12:37.31,Default,,0000,0000,0000,,expected value of V of the
Dialogue: 0,1:12:37.31,1:12:41.48,Default,,0000,0000,0000,,
Dialogue: 0,1:12:41.48,1:12:45.26,Default,,0000,0000,0000,,expected value of
Dialogue: 0,1:12:45.26,1:12:49.18,Default,,0000,0000,0000,,S prime, and this is approximation.
Dialogue: 0,1:12:49.18,1:12:52.51,Default,,0000,0000,0000,,Expected value of a function is usually not equal to the value of an
Dialogue: 0,1:12:52.51,1:12:53.06,Default,,0000,0000,0000,,expectation but
Dialogue: 0,1:12:53.06,1:12:54.16,Default,,0000,0000,0000,,it is often
Dialogue: 0,1:12:54.16,1:12:56.12,Default,,0000,0000,0000,,a reasonable approximation
Dialogue: 0,1:12:56.12,1:12:58.34,Default,,0000,0000,0000,,and so
Dialogue: 0,1:13:01.92,1:13:04.67,Default,,0000,0000,0000,,that would be another way to
Dialogue: 0,1:13:04.67,1:13:06.44,Default,,0000,0000,0000,,approximate that expectation
Dialogue: 0,1:13:06.44,1:13:11.27,Default,,0000,0000,0000,,and so you choose the actions
Dialogue: 0,1:13:11.27,1:13:14.34,Default,,0000,0000,0000,,according to
Dialogue: 0,1:13:14.34,1:13:21.34,Default,,0000,0000,0000,,watch we do the same formula as I wrote just now.
Dialogue: 0,1:13:22.01,1:13:23.62,Default,,0000,0000,0000,,And so this
Dialogue: 0,1:13:23.62,1:13:24.92,Default,,0000,0000,0000,,would be a way of
Dialogue: 0,1:13:24.92,1:13:27.11,Default,,0000,0000,0000,,approximating
Dialogue: 0,1:13:28.16,1:13:30.53,Default,,0000,0000,0000,,this argmax,
Dialogue: 0,1:13:30.53,1:13:34.49,Default,,0000,0000,0000,,ignoring the noise in the simulator essentially. And
Dialogue: 0,1:13:34.49,1:13:37.61,Default,,0000,0000,0000,,this often works pretty well as well just because many simulators turn
Dialogue: 0,1:13:37.61,1:13:38.54,Default,,0000,0000,0000,,out
Dialogue: 0,1:13:38.54,1:13:42.70,Default,,0000,0000,0000,,to be the form of some linear or some nonlinear function plus zero mean
Dialogue: 0,1:13:42.70,1:13:43.75,Default,,0000,0000,0000,,galsie noise,
Dialogue: 0,1:13:43.75,1:13:46.13,Default,,0000,0000,0000,,so and just that ignore the zero mean
Dialogue: 0,1:13:46.13,1:13:47.37,Default,,0000,0000,0000,,galsie
Dialogue: 0,1:13:47.37,1:13:50.07,Default,,0000,0000,0000,,noise, so that you can compute this quickly.
Dialogue: 0,1:13:50.07,1:13:54.23,Default,,0000,0000,0000,,And just to complete about this, what
Dialogue: 0,1:13:54.23,1:13:55.90,Default,,0000,0000,0000,,that is, right,
Dialogue: 0,1:13:55.90,1:13:58.46,Default,,0000,0000,0000,,that V* F of SA,
Dialogue: 0,1:13:58.46,1:13:58.97,Default,,0000,0000,0000,,this
Dialogue: 0,1:13:58.97,1:14:01.67,Default,,0000,0000,0000,,you
Dialogue: 0,1:14:04.39,1:14:08.88,Default,,0000,0000,0000,,down rate as data transfers Fi of S prime where S prime=F of SA. Great,
Dialogue: 0,1:14:08.88,1:14:12.22,Default,,0000,0000,0000,,so this V* you would compute using
Dialogue: 0,1:14:12.22,1:14:15.05,Default,,0000,0000,0000,,the parameters data that you just learned using the
Dialogue: 0,1:14:15.05,1:14:21.72,Default,,0000,0000,0000,,fitted value iteration
Dialogue: 0,1:14:21.72,1:14:23.17,Default,,0000,0000,0000,,algorithm.
Dialogue: 0,1:14:23.17,1:14:30.17,Default,,0000,0000,0000,,Questions about this? Student:[Inaudible] case,
Dialogue: 0,1:14:33.85,1:14:40.85,Default,,0000,0000,0000,,for real-time application is it possible to use that [inaudible], for example for
Dialogue: 0,1:14:41.81,1:14:47.74,Default,,0000,0000,0000,,[inaudible]. Instructor (Andrew Ng):Yes, in real-time applications is it possible to sample case phases use [inaudible]
Dialogue: 0,1:14:51.16,1:14:55.51,Default,,0000,0000,0000,,expectation. Computers today actually amazingly fast. I'm actually often
Dialogue: 0,1:14:55.51,1:14:58.66,Default,,0000,0000,0000,,surprised by how much you can do in real time so
Dialogue: 0,1:14:58.66,1:15:00.27,Default,,0000,0000,0000,,the helicopter we actually flying a helicopter
Dialogue: 0,1:15:00.27,1:15:03.89,Default,,0000,0000,0000,,using an algorithm different than
Dialogue: 0,1:15:03.89,1:15:04.74,Default,,0000,0000,0000,,this? I can't say.
Dialogue: 0,1:15:06.28,1:15:10.54,Default,,0000,0000,0000,,But my intuition is that you could actually do this with a
Dialogue: 0,1:15:10.54,1:15:14.01,Default,,0000,0000,0000,,helicopter. A helicopter would control at somewhere between 10hz and 50hz. You need to do
Dialogue: 0,1:15:14.01,1:15:16.39,Default,,0000,0000,0000,,this 10 times a second to 50 times a second,
Dialogue: 0,1:15:16.39,1:15:18.10,Default,,0000,0000,0000,,and that's actually plenty of time
Dialogue: 0,1:15:18.10,1:15:19.40,Default,,0000,0000,0000,,to sample
Dialogue: 0,1:15:19.40,1:15:23.50,Default,,0000,0000,0000,,1,000 states and compute this
Dialogue: 0,1:15:23.50,1:15:26.91,Default,,0000,0000,0000,,expectation. They're real difficult, helicopters because helicopters are mission critical, and you
Dialogue: 0,1:15:26.91,1:15:29.50,Default,,0000,0000,0000,,do something it's like fast. You
Dialogue: 0,1:15:29.50,1:15:31.85,Default,,0000,0000,0000,,can do serious damage and so
Dialogue: 0,1:15:31.85,1:15:35.75,Default,,0000,0000,0000,,maybe not for good reasons. We've actually tended to avoid
Dialogue: 0,1:15:35.75,1:15:37.04,Default,,0000,0000,0000,,tossing coins when we're
Dialogue: 0,1:15:37.04,1:15:40.18,Default,,0000,0000,0000,,in the air, so the ideal of letting our actions be some
Dialogue: 0,1:15:40.18,1:15:45.10,Default,,0000,0000,0000,,up with some random process is slightly scary and just tend not to do that.
Dialogue: 0,1:15:45.10,1:15:48.99,Default,,0000,0000,0000,,I should say that's prob'ly not a great reason because you average a
Dialogue: 0,1:15:48.99,1:15:51.55,Default,,0000,0000,0000,,large number of things here very well fine but just
Dialogue: 0,1:15:51.55,1:15:55.83,Default,,0000,0000,0000,,as a maybe overly conservative design choice, we actually
Dialogue: 0,1:15:55.83,1:15:57.06,Default,,0000,0000,0000,,don't, tend not to find
Dialogue: 0,1:15:57.06,1:15:59.50,Default,,0000,0000,0000,,anything randomized on
Dialogue: 0,1:15:59.50,1:16:01.40,Default,,0000,0000,0000,,which is prob'ly being over conservative.
Dialogue: 0,1:16:02.34,1:16:04.97,Default,,0000,0000,0000,,It's the choice we made cause other things are slightly safer.
Dialogue: 0,1:16:04.97,1:16:08.15,Default,,0000,0000,0000,,I think you can actually often do this.
Dialogue: 0,1:16:08.15,1:16:13.19,Default,,0000,0000,0000,,So long as I see a model can be evaluated fast enough where you can sample
Dialogue: 0,1:16:13.19,1:16:15.47,Default,,0000,0000,0000,,100 state transitions or 1,000 state
Dialogue: 0,1:16:15.47,1:16:18.12,Default,,0000,0000,0000,,transitions, and then do that at
Dialogue: 0,1:16:18.12,1:16:21.55,Default,,0000,0000,0000,,10hz. They haven't said that. This is often attained which is why we often
Dialogue: 0,1:16:21.55,1:16:28.55,Default,,0000,0000,0000,,use the other approximations that don't require your drawing a large sample. Anything else? No, okay, cool. So
Dialogue: 0,1:16:31.80,1:16:35.80,Default,,0000,0000,0000,,now you know one algorithm [inaudible] reinforcement learning on continuous state
Dialogue: 0,1:16:35.80,1:16:38.71,Default,,0000,0000,0000,,spaces. Then we'll pick up with some more ideas on some even
Dialogue: 0,1:16:38.71,1:16:40.45,Default,,0000,0000,0000,,more powerful algorithms,
Dialogue: 0,1:16:40.45,1:16:43.37,Default,,0000,0000,0000,,the solving MDPs of continuous state spaces.
Dialogue: 0,1:16:43.37,1:16:44.17,Default,,0000,0000,0000,,Thanks. Let's close for today.