[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:00.53,0:00:04.48,Default,,0000,0000,0000,,♪ (Music) ♪
Dialogue: 0,0:00:09.47,0:00:14.05,Default,,0000,0000,0000,,Welcome back. This is our third lecture\Nin the series, talking about
Dialogue: 0,0:00:14.05,0:00:15.40,Default,,0000,0000,0000,,mirror descent.
Dialogue: 0,0:00:17.17,0:00:22.34,Default,,0000,0000,0000,,And um, we're going to get down to\Nthe main definition of how it is that
Dialogue: 0,0:00:22.34,0:00:27.78,Default,,0000,0000,0000,,we change the proximal term \Nin uh, from the subgradient method.
Dialogue: 0,0:00:27.78,0:00:35.28,Default,,0000,0000,0000,,So again, recall that the subgradient\Nmethod tells me that xt plus one is
Dialogue: 0,0:00:35.28,0:00:41.90,Default,,0000,0000,0000,,equal to xt minus eta gt, where gt is\Nan element of the subdifferential
Dialogue: 0,0:00:41.90,0:00:44.56,Default,,0000,0000,0000,,of f at xt.
Dialogue: 0,0:00:44.56,0:00:48.10,Default,,0000,0000,0000,,Again, I've said this in each\Nof these lectures, but I'm leaving
Dialogue: 0,0:00:48.10,0:00:50.75,Default,,0000,0000,0000,,out the dependence on the \Non the,
Dialogue: 0,0:00:50.75,0:00:54.89,Default,,0000,0000,0000,,one the - set constraint,
Dialogue: 0,0:00:54.89,0:01:00.92,Default,,0000,0000,0000,,but with the set constraint,\Nthis would take the form
Dialogue: 0,0:01:00.92,0:01:02.64,Default,,0000,0000,0000,,of an additional projection.
Dialogue: 0,0:01:04.92,0:01:12.07,Default,,0000,0000,0000,,So xt plus one is equal to the projection\Nonto x, the projection here means
Dialogue: 0,0:01:12.07,0:01:17.00,Default,,0000,0000,0000,,the Euclidean projection, of xt minus\Neta gt.
Dialogue: 0,0:01:17.00,0:01:21.40,Default,,0000,0000,0000,,But because it doesn't change very \Nmuch, for now,
Dialogue: 0,0:01:21.40,0:01:24.00,Default,,0000,0000,0000,,I've just been sparing myself the\Nextra writing,
Dialogue: 0,0:01:24.00,0:01:27.68,Default,,0000,0000,0000,,but we should really have this in mind.
Dialogue: 0,0:01:27.68,0:01:33.55,Default,,0000,0000,0000,,And, uh, we've seen repeatedly, this\Nhas been the theme of the first two
Dialogue: 0,0:01:33.55,0:01:37.63,Default,,0000,0000,0000,,lectures on mirror descent, that\Nxt plus one is also
Dialogue: 0,0:01:37.63,0:01:45.69,Default,,0000,0000,0000,,equivalently, just the argmin\Nof eta, gt, transpose x, plus,
Dialogue: 0,0:01:45.69,0:01:48.35,Default,,0000,0000,0000,,this is the linear - you can think of this\Nas the linear approximation
Dialogue: 0,0:01:48.35,0:01:53.80,Default,,0000,0000,0000,,to the function, plus something that\Nkeeps us from going too far, and this
Dialogue: 0,0:01:53.80,0:01:57.67,Default,,0000,0000,0000,,is typically the Euclidean norm, or \Nspecifically, when we use the
Dialogue: 0,0:01:57.67,0:02:00.80,Default,,0000,0000,0000,,Euclidean norm, we get exactly the\Nsubgradient method.
Dialogue: 0,0:02:02.37,0:02:06.30,Default,,0000,0000,0000,,Set constraint here, this is an\Nargmin over all x, but if I have a
Dialogue: 0,0:02:06.30,0:02:09.45,Default,,0000,0000,0000,,set constraint, it just becomes - \Nthe only thing that changes is
Dialogue: 0,0:02:09.45,0:02:13.28,Default,,0000,0000,0000,,this is an argmin over x in my\Nset constraint.
Dialogue: 0,0:02:13.28,0:02:15.89,Default,,0000,0000,0000,,So again, not much has changed.
Dialogue: 0,0:02:15.89,0:02:19.68,Default,,0000,0000,0000,,So, we're going to change this.
Dialogue: 0,0:02:23.96,0:02:26.01,Default,,0000,0000,0000,,And by changing it, we get\Nmirror descent, and we'll see
Dialogue: 0,0:02:26.01,0:02:29.04,Default,,0000,0000,0000,,exactly when we can do this to\Nour advantage.
Dialogue: 0,0:02:30.32,0:02:33.85,Default,,0000,0000,0000,,In order to do this, and to\Nexplain what will replace it,
Dialogue: 0,0:02:33.85,0:02:37.46,Default,,0000,0000,0000,,we need to define this concept \Nof Bregman divergence.
Dialogue: 0,0:02:40.24,0:02:42.34,Default,,0000,0000,0000,,For a convex function phi,
Dialogue: 0,0:02:45.88,0:02:49.38,Default,,0000,0000,0000,,I can define its Bregman\Ndivergence as follows.
Dialogue: 0,0:02:49.38,0:02:55.26,Default,,0000,0000,0000,,So the divergence with respect\Nto phi, between two points x and y,
Dialogue: 0,0:02:55.26,0:03:04.17,Default,,0000,0000,0000,,is defined as phi of x minus \Nphi of y, minus the gradient of phi
Dialogue: 0,0:03:04.17,0:03:09.68,Default,,0000,0000,0000,,of y transpose times x minus y.
Dialogue: 0,0:03:09.68,0:03:11.12,Default,,0000,0000,0000,,So what does this mean?
Dialogue: 0,0:03:11.12,0:03:13.41,Default,,0000,0000,0000,,It actually has a very simple \Ninterpretation.
Dialogue: 0,0:03:13.41,0:03:15.50,Default,,0000,0000,0000,,And let me group these terms\Nas follows.
Dialogue: 0,0:03:15.50,0:03:19.01,Default,,0000,0000,0000,,Let me group the second and third\Nterm, and then I'll have to distribute
Dialogue: 0,0:03:19.01,0:03:22.32,Default,,0000,0000,0000,,that minus sign, in order to make \Nthis plus.
Dialogue: 0,0:03:22.32,0:03:28.30,Default,,0000,0000,0000,,So now, a simple picture tells\Nus exactly what this is.
Dialogue: 0,0:03:28.30,0:03:34.45,Default,,0000,0000,0000,,If phi is a convex function, and\NI've got my point y, and I've
Dialogue: 0,0:03:34.45,0:03:38.73,Default,,0000,0000,0000,,got my point x, then what is\Nthe first term?
Dialogue: 0,0:03:38.73,0:03:43.02,Default,,0000,0000,0000,,The first term is just phi of\Nx, it's sitting up there.
Dialogue: 0,0:03:43.02,0:03:44.57,Default,,0000,0000,0000,,What's the second term?
Dialogue: 0,0:03:44.57,0:03:49.47,Default,,0000,0000,0000,,We can interpret the second\Nterm as the linear approximation
Dialogue: 0,0:03:49.47,0:03:52.22,Default,,0000,0000,0000,,to phi at y,
Dialogue: 0,0:03:52.22,0:03:57.11,Default,,0000,0000,0000,,evaluated at the point x.
Dialogue: 0,0:03:58.53,0:04:01.60,Default,,0000,0000,0000,,So this is the second term.
Dialogue: 0,0:04:01.60,0:04:08.85,Default,,0000,0000,0000,,This is phi of y plus the gradient\Nof phi of y transpose x minus y.
Dialogue: 0,0:04:08.85,0:04:17.13,Default,,0000,0000,0000,,And therefore, the Bregman\Ndivergence is exactly this term.
Dialogue: 0,0:04:17.13,0:04:19.86,Default,,0000,0000,0000,,Making things a little messy \Nhere, so let's see.
Dialogue: 0,0:04:22.16,0:04:23.77,Default,,0000,0000,0000,,One more try.
Dialogue: 0,0:04:24.96,0:04:28.07,Default,,0000,0000,0000,,This is exactly the Bregman\Ndivergence.
Dialogue: 0,0:04:29.08,0:04:33.70,Default,,0000,0000,0000,,The more curvature something has, \Nthe more Bregman divergence it has.
Dialogue: 0,0:04:33.70,0:04:35.64,Default,,0000,0000,0000,,Let's do a very simple calculation.
Dialogue: 0,0:04:37.09,0:04:43.60,Default,,0000,0000,0000,,In the simple case where phi of x
Dialogue: 0,0:04:43.60,0:04:49.44,Default,,0000,0000,0000,,is just equal to one half times\Nthe Euclidean distance.
Dialogue: 0,0:04:49.44,0:04:52.88,Default,,0000,0000,0000,,So let's see what the Bregman\Ndivergence is in this case.
Dialogue: 0,0:04:52.88,0:04:57.69,Default,,0000,0000,0000,,Well, plugging in the definition, \NI see that the Bregman divergence
Dialogue: 0,0:04:57.69,0:05:01.68,Default,,0000,0000,0000,,between x and y is equal to \Nphi of x,
Dialogue: 0,0:05:01.68,0:05:05.99,Default,,0000,0000,0000,,that's one half Euclidean norm \Nsquared, of x,
Dialogue: 0,0:05:05.99,0:05:10.81,Default,,0000,0000,0000,,Minus phi of y, that's one half\NEuclidean norm squared of
Dialogue: 0,0:05:10.81,0:05:15.25,Default,,0000,0000,0000,,y, Plus, gradient of phi of y.
Dialogue: 0,0:05:15.25,0:05:18.23,Default,,0000,0000,0000,,Well, what's the gradient \Nof one half times the Euclidean norm?
Dialogue: 0,0:05:18.23,0:05:20.71,Default,,0000,0000,0000,,It's just y itself.
Dialogue: 0,0:05:20.71,0:05:22.56,Default,,0000,0000,0000,,Y transpose.
Dialogue: 0,0:05:22.56,0:05:27.44,Default,,0000,0000,0000,,Times x minus y, and grouping\Nall these terms,
Dialogue: 0,0:05:27.44,0:05:32.61,Default,,0000,0000,0000,,rewriting, I get one half, factor\Nthat out, Euclidean norm of x
Dialogue: 0,0:05:32.61,0:05:36.79,Default,,0000,0000,0000,,squared, plus...
Dialogue: 0,0:05:40.42,0:05:47.32,Default,,0000,0000,0000,,minus twice y transpose x.
Dialogue: 0,0:05:47.32,0:05:54.02,Default,,0000,0000,0000,,Then I've got minus y squared,
Dialogue: 0,0:05:54.02,0:05:58.82,Default,,0000,0000,0000,,plus twice y squared.
Dialogue: 0,0:05:58.82,0:06:04.36,Default,,0000,0000,0000,,That simplifies to just plus,\Nand you can see that this is
Dialogue: 0,0:06:04.36,0:06:05.87,Default,,0000,0000,0000,,exactly the quadratic.
Dialogue: 0,0:06:05.87,0:06:10.59,Default,,0000,0000,0000,,This is equal to one half x minus\Ny squared.
Dialogue: 0,0:06:10.59,0:06:16.09,Default,,0000,0000,0000,,So, if we take phi to be the function\Nthat's one half Euclidean norm squared,
Dialogue: 0,0:06:16.09,0:06:19.33,Default,,0000,0000,0000,,then the Bregman divergence is
Dialogue: 0,0:06:19.33,0:06:25.16,Default,,0000,0000,0000,,just going to be the difference in \NEuclidean - it's the Euclidean norm of
Dialogue: 0,0:06:25.16,0:06:26.20,Default,,0000,0000,0000,,the difference.
Dialogue: 0,0:06:26.20,0:06:27.56,Default,,0000,0000,0000,,So we get something familiar.
Dialogue: 0,0:06:27.56,0:06:32.55,Default,,0000,0000,0000,,But, we're going to see several examples\Num, where we're going to take phi
Dialogue: 0,0:06:32.55,0:06:38.22,Default,,0000,0000,0000,,to be something different and therefore we\Nget a very different Bregman divergence.
Dialogue: 0,0:06:39.94,0:06:44.74,Default,,0000,0000,0000,,Okay, and we're going to use this directly\Nin our definition of mirror descent.
Dialogue: 0,0:06:47.75,0:06:55.94,Default,,0000,0000,0000,,Mirror descent now becomes xt plus one,\Nis going to be equal to the argmin,
Dialogue: 0,0:06:58.52,0:07:06.99,Default,,0000,0000,0000,,possibly in our set constraint,\Neta gt transpose x,
Dialogue: 0,0:07:06.100,0:07:09.76,Default,,0000,0000,0000,,plus, previously we had
Dialogue: 0,0:07:09.76,0:07:15.66,Default,,0000,0000,0000,,one half times the Euclidean norm,\Nbut now we have one half times the
Dialogue: 0,0:07:15.66,0:07:21.46,Default,,0000,0000,0000,,Bregman divergence between x and xt.
Dialogue: 0,0:07:23.18,0:07:32.96,Default,,0000,0000,0000,,So, if phi is equal to just one half\NEuclidean norm squared, then
Dialogue: 0,0:07:32.96,0:07:40.04,Default,,0000,0000,0000,,we recover our usual subgradient\Nmethod.
Dialogue: 0,0:07:44.41,0:07:47.75,Default,,0000,0000,0000,,But, if we have something else,
Dialogue: 0,0:07:47.75,0:07:54.70,Default,,0000,0000,0000,,if phi of x is equal to say,
Dialogue: 0,0:07:54.70,0:08:02.57,Default,,0000,0000,0000,,one half times x transpose, times\Na different matrix, times x,
Dialogue: 0,0:08:02.57,0:08:09.71,Default,,0000,0000,0000,,where q is not equal to the identity,\Nthen, we get a different algorithm.
Dialogue: 0,0:08:13.28,0:08:19.29,Default,,0000,0000,0000,,And, this was exactly the idea that\Nstarted us off in mirror descent.
Dialogue: 0,0:08:19.29,0:08:23.98,Default,,0000,0000,0000,,So this was the idea that we had\Nin lecture one on mirror descent.
Dialogue: 0,0:08:23.98,0:08:30.93,Default,,0000,0000,0000,,So see the first lecture in the mirror\Ndescent series.
Dialogue: 0,0:08:33.68,0:08:36.65,Default,,0000,0000,0000,,But we're going to choose phi \Nto be even different than these.
Dialogue: 0,0:08:36.65,0:08:40.46,Default,,0000,0000,0000,,So we'll see that we can take phi\Nto be quite different, and we would
Dialogue: 0,0:08:40.46,0:08:43.64,Default,,0000,0000,0000,,therefore really change the\Ngeometry of the space.
Dialogue: 0,0:08:44.59,0:08:48.27,Default,,0000,0000,0000,,So, this is our first look\Nat mirror descent,
Dialogue: 0,0:08:48.27,0:08:51.29,Default,,0000,0000,0000,,we're going to be trying to understand\Nits properties through examples,
Dialogue: 0,0:08:51.29,0:08:54.98,Default,,0000,0000,0000,,and then also, in the following \Nlecture, actually deriving
Dialogue: 0,0:08:54.98,0:08:58.94,Default,,0000,0000,0000,,rate of convergence and proof\Nfor rate of convergence.
Dialogue: 0,0:08:58.94,0:09:02.09,Default,,0000,0000,0000,,But in the meantime, let's go \Nback to Bregman divergence, and get
Dialogue: 0,0:09:02.09,0:09:06.24,Default,,0000,0000,0000,,used to it a little bit, and see what\Nsome of its properties are.
Dialogue: 0,0:09:06.24,0:09:10.54,Default,,0000,0000,0000,,So, in part, I'm going to derive this\Nbecause we'll need it.
Dialogue: 0,0:09:10.54,0:09:13.92,Default,,0000,0000,0000,,In part, just to get us used to thinking\Nabout Bregman divergence.
Dialogue: 0,0:09:13.92,0:09:17.27,Default,,0000,0000,0000,,It is a little bit familiar, because\Nit seems like -
Dialogue: 0,0:09:17.27,0:09:21.10,Default,,0000,0000,0000,,because we saw that if we choose\Nphi to be half the Euclidean norm
Dialogue: 0,0:09:21.10,0:09:24.17,Default,,0000,0000,0000,,squared, we get back something\Nvery familiar, but if we plug in
Dialogue: 0,0:09:24.17,0:09:26.11,Default,,0000,0000,0000,,other things, it will be a little bit\Ndifferent.
Dialogue: 0,0:09:26.11,0:09:28.86,Default,,0000,0000,0000,,So, just want to get used to\Nmanipulating it, and seeing that
Dialogue: 0,0:09:28.86,0:09:32.50,Default,,0000,0000,0000,,it's really not that difficult.
Dialogue: 0,0:09:33.28,0:09:38.89,Default,,0000,0000,0000,,Um, okay, so let's, in part for exercise, \Nagain, in part because we'll need it,
Dialogue: 0,0:09:38.89,0:09:47.36,Default,,0000,0000,0000,,let's show that um, for any convex\Nfunction phi, we have this property
Dialogue: 0,0:09:47.36,0:09:53.07,Default,,0000,0000,0000,,that the gradient of phi, evaluated\Nat x, minus the gradient of phi,
Dialogue: 0,0:09:53.07,0:10:00.68,Default,,0000,0000,0000,,evaluated at y, transpose, times x minus\Nz, is in fact equal to the Bregman
Dialogue: 0,0:10:00.68,0:10:05.34,Default,,0000,0000,0000,,divergence between x and y, \Nplus the Bregman divergence between
Dialogue: 0,0:10:05.34,0:10:12.21,Default,,0000,0000,0000,,z and x, minus the Bregman\Ndivergence between z and y.
Dialogue: 0,0:10:12.21,0:10:14.82,Default,,0000,0000,0000,,And this is just a matter of \Nplugging in these definitions and
Dialogue: 0,0:10:14.82,0:10:18.22,Default,,0000,0000,0000,,evaluating, so let's just do exactly\Nthat.
Dialogue: 0,0:10:18.22,0:10:27.35,Default,,0000,0000,0000,,So from the definition, the \Nright hand side is equal to the first
Dialogue: 0,0:10:27.35,0:10:35.16,Default,,0000,0000,0000,,term, is equal to phi of x,\Nminus phi of y,
Dialogue: 0,0:10:35.16,0:10:39.00,Default,,0000,0000,0000,,plus gradient phi of y,
Dialogue: 0,0:10:39.00,0:10:47.94,Default,,0000,0000,0000,,times x minus y. The second term\Nis equal to phi of z, minus phi
Dialogue: 0,0:10:47.94,0:10:55.00,Default,,0000,0000,0000,,of x, plus the gradient of phi\Nof x, times z minus x.
Dialogue: 0,0:10:55.97,0:11:02.32,Default,,0000,0000,0000,,And then the third term, which\Nwe subtract, is equal to phi of z,
Dialogue: 0,0:11:02.32,0:11:05.64,Default,,0000,0000,0000,,plus, I'm distributing the\Nminus sign here,
Dialogue: 0,0:11:05.64,0:11:14.06,Default,,0000,0000,0000,,So it's plus phi of y, plus the\Ngradient of phi of y, times
Dialogue: 0,0:11:14.06,0:11:15.69,Default,,0000,0000,0000,,z minus y.
Dialogue: 0,0:11:15.69,0:11:17.22,Default,,0000,0000,0000,,Okay, so what do we see?
Dialogue: 0,0:11:17.22,0:11:20.14,Default,,0000,0000,0000,,Well, first of all, there's -
Dialogue: 0,0:11:20.14,0:11:23.82,Default,,0000,0000,0000,,I see a positive phi of x\Nhere, and a negative phi of x,
Dialogue: 0,0:11:23.82,0:11:28.47,Default,,0000,0000,0000,,I see a positive phi of z \Nand a negative phi of z.
Dialogue: 0,0:11:28.47,0:11:31.54,Default,,0000,0000,0000,,I see a negative phi of y\Nand another positive phi of y,
Dialogue: 0,0:11:31.54,0:11:34.05,Default,,0000,0000,0000,,so all of these cancel, \Nand now I - oops.
Dialogue: 0,0:11:34.05,0:11:38.07,Default,,0000,0000,0000,,And now I just have to collect\Nall of these terms here.
Dialogue: 0,0:11:38.07,0:11:40.78,Default,,0000,0000,0000,,And this is just a matter of\Na little bit of accounting.
Dialogue: 0,0:11:40.78,0:11:44.16,Default,,0000,0000,0000,,So this is equal to
Dialogue: 0,0:11:46.12,0:11:51.52,Default,,0000,0000,0000,,the gradient of phi of x\Ntimes x minus z
Dialogue: 0,0:11:51.52,0:11:57.68,Default,,0000,0000,0000,,minus the gradient of phi of y\Ntimes x minus y
Dialogue: 0,0:11:57.68,0:12:02.39,Default,,0000,0000,0000,,plus the gradient of phi of y\Ntimes z minus y,
Dialogue: 0,0:12:02.39,0:12:11.01,Default,,0000,0000,0000,,and this is equal to the gradient\Nof phi of x times x minus z
Dialogue: 0,0:12:11.01,0:12:17.25,Default,,0000,0000,0000,,minus the gradient of phi of y\Ntimes x minus z.
Dialogue: 0,0:12:18.77,0:12:21.84,Default,,0000,0000,0000,,Now I can group these together\Nand I get exactly what I wanted.
Dialogue: 0,0:12:21.84,0:12:29.25,Default,,0000,0000,0000,,This is the gradient of phi of x\Nminus the gradient of phi of y
Dialogue: 0,0:12:29.25,0:12:32.89,Default,,0000,0000,0000,,times x minus z.
Dialogue: 0,0:12:32.89,0:12:35.86,Default,,0000,0000,0000,,We're going to use this property,\Nbut again, the point is for us
Dialogue: 0,0:12:35.86,0:12:39.23,Default,,0000,0000,0000,,not to worry about just\Nplugging in the definitions.
Dialogue: 0,0:12:39.23,0:12:41.97,Default,,0000,0000,0000,,These are relatively simple things.
Dialogue: 0,0:12:43.78,0:12:48.63,Default,,0000,0000,0000,,We're going to need something else\Nin our development of mirror descent
Dialogue: 0,0:12:48.63,0:12:49.96,Default,,0000,0000,0000,,and when we analyze it.
Dialogue: 0,0:12:49.96,0:12:52.70,Default,,0000,0000,0000,,And this is the concept of a dual norm.
Dialogue: 0,0:12:53.41,0:12:55.45,Default,,0000,0000,0000,,So where have we been\Nusing dual norms
Dialogue: 0,0:12:55.45,0:12:57.50,Default,,0000,0000,0000,,without even really knowing it?
Dialogue: 0,0:12:58.64,0:13:03.49,Default,,0000,0000,0000,,We've been using dual norms because\Neverything we've done so far
Dialogue: 0,0:13:03.49,0:13:09.07,Default,,0000,0000,0000,,with the exception of a few things\Nlike prox, we used L1 norm, and so on,
Dialogue: 0,0:13:09.07,0:13:14.58,Default,,0000,0000,0000,,we've been using the Euclidean norm.
Dialogue: 0,0:13:14.58,0:13:19.33,Default,,0000,0000,0000,,So let's - let me give the definition\Nhere of a dual norm.
Dialogue: 0,0:13:23.30,0:13:27.12,Default,,0000,0000,0000,,So now I'm going to use this\Ndouble bar just as a generic norm.
Dialogue: 0,0:13:27.12,0:13:30.64,Default,,0000,0000,0000,,So don't look at it and think\Nit's the two norm.
Dialogue: 0,0:13:30.64,0:13:32.65,Default,,0000,0000,0000,,This is just some norm.
Dialogue: 0,0:13:32.65,0:13:35.60,Default,,0000,0000,0000,,So for this indicating any norm,
Dialogue: 0,0:13:38.43,0:13:40.23,Default,,0000,0000,0000,,its dual
Dialogue: 0,0:13:42.06,0:13:46.29,Default,,0000,0000,0000,,is defined as follows.
Dialogue: 0,0:13:48.37,0:13:50.93,Default,,0000,0000,0000,,I'm going to use this lower star notation.
Dialogue: 0,0:13:50.93,0:13:55.08,Default,,0000,0000,0000,,So lower star is very frequently\Nused to denote the dual.
Dialogue: 0,0:13:55.08,0:13:59.09,Default,,0000,0000,0000,,But in some other contexts\Nit has a specific definition.
Dialogue: 0,0:13:59.09,0:14:03.82,Default,,0000,0000,0000,,So for example with matrices,\Nsometimes it denotes the trace norm
Dialogue: 0,0:14:03.82,0:14:05.40,Default,,0000,0000,0000,,or the nuclear norm.
Dialogue: 0,0:14:05.40,0:14:09.50,Default,,0000,0000,0000,,But in this case I'll try to be crystal\Nclear if we do need to talk about
Dialogue: 0,0:14:09.50,0:14:11.60,Default,,0000,0000,0000,,another norm that's\Noften denoted like this.
Dialogue: 0,0:14:11.60,0:14:19.61,Default,,0000,0000,0000,,For us, this star is always going to mean\Nthe dual norm unless I specify otherwise.
Dialogue: 0,0:14:19.61,0:14:28.42,Default,,0000,0000,0000,,So this is defined as the\Nmaximum of u transpose v
Dialogue: 0,0:14:28.42,0:14:34.57,Default,,0000,0000,0000,,subject to v less than one\Nwhere this is the original norm,
Dialogue: 0,0:14:34.57,0:14:36.79,Default,,0000,0000,0000,,whatever that might have been.
Dialogue: 0,0:14:36.79,0:14:38.59,Default,,0000,0000,0000,,Okay, so this is just the definition.
Dialogue: 0,0:14:38.59,0:14:41.47,Default,,0000,0000,0000,,So let's do some examples.
Dialogue: 0,0:14:48.11,0:14:52.85,Default,,0000,0000,0000,,So let's take - let's take this
Dialogue: 0,0:14:52.85,0:14:55.42,Default,,0000,0000,0000,,equal to the Euclidean norm.
Dialogue: 0,0:14:55.42,0:14:59.06,Default,,0000,0000,0000,,Might as well warm up with\Nsomething familiar.
Dialogue: 0,0:14:59.06,0:15:02.92,Default,,0000,0000,0000,,So what is the dual\Nof the Euclidean norm,
Dialogue: 0,0:15:02.92,0:15:08.03,Default,,0000,0000,0000,,u - I'll put a star, I'll write it\Nthis way so we remember.
Dialogue: 0,0:15:08.03,0:15:14.40,Default,,0000,0000,0000,,This is defined as the\Nmax of u transpose v
Dialogue: 0,0:15:14.40,0:15:18.76,Default,,0000,0000,0000,,subject to the Euclidean\Nnorm of v is less than one.
Dialogue: 0,0:15:18.76,0:15:20.81,Default,,0000,0000,0000,,We can just solve this geometrically.
Dialogue: 0,0:15:20.81,0:15:26.81,Default,,0000,0000,0000,,It's saying your constrained\Nset is just a sphere.
Dialogue: 0,0:15:26.81,0:15:29.41,Default,,0000,0000,0000,,So it's saying find the\Npoint on the sphere
Dialogue: 0,0:15:29.41,0:15:33.35,Default,,0000,0000,0000,,that has the largest inner product\Nwith a particular vector.
Dialogue: 0,0:15:34.82,0:15:36.97,Default,,0000,0000,0000,,What is that going to be?
Dialogue: 0,0:15:36.97,0:15:39.57,Default,,0000,0000,0000,,Well, it's - you can see\Nfrom this picture.
Dialogue: 0,0:15:39.57,0:15:42.63,Default,,0000,0000,0000,,You're not going to pick something\Nthat is perpendicular, certainly.
Dialogue: 0,0:15:42.63,0:15:45.67,Default,,0000,0000,0000,,If you picked something perpendicular,\Nyou'd get zero in the max.
Dialogue: 0,0:15:45.67,0:15:48.18,Default,,0000,0000,0000,,You're not going to pick something\Nin the opposite direction,
Dialogue: 0,0:15:48.18,0:15:49.37,Default,,0000,0000,0000,,you'd get negative.
Dialogue: 0,0:15:49.37,0:15:54.42,Default,,0000,0000,0000,,No, you're going to align exactly\Nwith it and go out unit norm.
Dialogue: 0,0:15:54.42,0:15:58.46,Default,,0000,0000,0000,,So in other words,\Nthe optimal solution here
Dialogue: 0,0:15:58.46,0:16:00.78,Default,,0000,0000,0000,,is the maximizer,
Dialogue: 0,0:16:02.88,0:16:09.49,Default,,0000,0000,0000,,is just equal to u divided by its norm,
Dialogue: 0,0:16:09.49,0:16:11.30,Default,,0000,0000,0000,,and the max value
Dialogue: 0,0:16:16.07,0:16:20.80,Default,,0000,0000,0000,,is therefore going to be\Nu transpose times this v
Dialogue: 0,0:16:22.52,0:16:23.91,Default,,0000,0000,0000,,times v star.
Dialogue: 0,0:16:23.91,0:16:28.64,Default,,0000,0000,0000,,This is equal to just u transpose u\Ndivided by the norm of u,
Dialogue: 0,0:16:28.64,0:16:30.70,Default,,0000,0000,0000,,which is the norm of u.
Dialogue: 0,0:16:32.66,0:16:36.33,Default,,0000,0000,0000,,In other words,\Nthe two norm is self dual.
Dialogue: 0,0:16:36.33,0:16:40.55,Default,,0000,0000,0000,,So the Euclidean norm - what we just\Nshowed is is that the Euclidean norm
Dialogue: 0,0:16:43.26,0:16:46.31,Default,,0000,0000,0000,,is its own dual.
Dialogue: 0,0:16:46.31,0:16:48.11,Default,,0000,0000,0000,,It's called self dual.
Dialogue: 0,0:16:50.00,0:16:52.70,Default,,0000,0000,0000,,So this is the way that we've been\Nactually using the dual norm
Dialogue: 0,0:16:52.70,0:16:54.35,Default,,0000,0000,0000,,without even really knowing it,
Dialogue: 0,0:16:54.35,0:16:57.41,Default,,0000,0000,0000,,because we've just been\Nworking in the two norm.
Dialogue: 0,0:16:57.41,0:16:59.48,Default,,0000,0000,0000,,So let's take another example.
Dialogue: 0,0:17:01.31,0:17:03.98,Default,,0000,0000,0000,,So now let's take - that's example one.
Dialogue: 0,0:17:03.98,0:17:09.37,Default,,0000,0000,0000,,Let's take norm equal to the one norm.
Dialogue: 0,0:17:10.49,0:17:13.06,Default,,0000,0000,0000,,Okay, and let's see what its dual is.
Dialogue: 0,0:17:13.99,0:17:21.39,Default,,0000,0000,0000,,So we've got, what is the dual norm\Nof u in the one norm?
Dialogue: 0,0:17:21.39,0:17:27.08,Default,,0000,0000,0000,,This is again, just copying the definition,\Nthe max of u transpose v...
Dialogue: 0,0:17:28.34,0:17:30.59,Default,,0000,0000,0000,,u transpose v,
Dialogue: 0,0:17:30.59,0:17:37.67,Default,,0000,0000,0000,,subject to v in the one norm\Nbeing less than one.
Dialogue: 0,0:17:37.67,0:17:41.29,Default,,0000,0000,0000,,So let's think about what\Nthis is going to give us.
Dialogue: 0,0:17:41.29,0:17:44.86,Default,,0000,0000,0000,,So what does v - what does\Nthis one norm tell us?
Dialogue: 0,0:17:44.86,0:17:48.95,Default,,0000,0000,0000,,So this constraint is telling us that
Dialogue: 0,0:17:48.95,0:17:52.85,Default,,0000,0000,0000,,we have to be inside this diamond,
Dialogue: 0,0:17:52.85,0:17:58.38,Default,,0000,0000,0000,,or algebraically it's telling us that the\Nabsolute values have to sum to one.
Dialogue: 0,0:17:58.38,0:18:01.30,Default,,0000,0000,0000,,You can think about this as\Na resource allocation problem.
Dialogue: 0,0:18:01.30,0:18:04.88,Default,,0000,0000,0000,,You have to allocate your resources,\Nbut you only have 100%
Dialogue: 0,0:18:04.88,0:18:07.95,Default,,0000,0000,0000,,and you have to allocate the pieces\Nso that they add up to one.
Dialogue: 0,0:18:07.95,0:18:09.44,Default,,0000,0000,0000,,Where are you going to put them?
Dialogue: 0,0:18:09.44,0:18:13.10,Default,,0000,0000,0000,,You're going to put them\Nwherever the values are the biggest.
Dialogue: 0,0:18:13.10,0:18:15.62,Default,,0000,0000,0000,,The slight difference to this\Nis that you can take values
Dialogue: 0,0:18:15.62,0:18:18.68,Default,,0000,0000,0000,,that are positive and negative,\Nbut then so can v.
Dialogue: 0,0:18:18.68,0:18:22.32,Default,,0000,0000,0000,,So what you're going to do is\Nyou're going to find the biggest
Dialogue: 0,0:18:22.32,0:18:26.10,Default,,0000,0000,0000,,in absolute value element of u,
Dialogue: 0,0:18:26.10,0:18:27.82,Default,,0000,0000,0000,,whether it's positive or negative,
Dialogue: 0,0:18:27.82,0:18:30.69,Default,,0000,0000,0000,,and you're going to keep all of v,\Nall of your energy,
Dialogue: 0,0:18:30.69,0:18:32.90,Default,,0000,0000,0000,,you're going to put onto that coordinate.
Dialogue: 0,0:18:32.90,0:18:37.08,Default,,0000,0000,0000,,If the biggest element of u is negative,\Nyou're also going to make v negative.
Dialogue: 0,0:18:37.08,0:18:41.25,Default,,0000,0000,0000,,If the biggest element of u is positive,\Nyou're also going to make v positive.
Dialogue: 0,0:18:41.25,0:18:48.55,Default,,0000,0000,0000,,So in other words, v star in this case,\Nour optimizing v,
Dialogue: 0,0:18:50.09,0:18:51.47,Default,,0000,0000,0000,,is going to be
Dialogue: 0,0:18:53.64,0:18:55.17,Default,,0000,0000,0000,,is going to look like
Dialogue: 0,0:18:59.40,0:19:02.98,Default,,0000,0000,0000,,the vector that's all zero except\Nwith a single one.
Dialogue: 0,0:19:04.37,0:19:07.99,Default,,0000,0000,0000,,And it's going to have a single one\Nin the ith position,
Dialogue: 0,0:19:07.99,0:19:14.70,Default,,0000,0000,0000,,and its sign is going to be\Nthe sign of ui.
Dialogue: 0,0:19:14.70,0:19:16.56,Default,,0000,0000,0000,,And which i is it going to be?
Dialogue: 0,0:19:16.56,0:19:21.44,Default,,0000,0000,0000,,It's going to be the i where the value\Nof the absolute value of ui is biggest.
Dialogue: 0,0:19:21.44,0:19:29.40,Default,,0000,0000,0000,,So i is the argmax of the\Nabsolute value of ui.
Dialogue: 0,0:19:29.40,0:19:31.41,Default,,0000,0000,0000,,So in other words, what do we get?
Dialogue: 0,0:19:31.41,0:19:33.86,Default,,0000,0000,0000,,The dual norm is the infinity norm.
Dialogue: 0,0:19:35.33,0:19:42.90,Default,,0000,0000,0000,,So the dual norm of u is just\Nequal to the infinity norm.
Dialogue: 0,0:19:42.90,0:19:45.82,Default,,0000,0000,0000,,And you can also see\Nthat the infinity norm,
Dialogue: 0,0:19:45.82,0:19:48.10,Default,,0000,0000,0000,,its dual is going to be\Nthe one norm.
Dialogue: 0,0:19:48.10,0:19:52.07,Default,,0000,0000,0000,,So there are many examples\Nlike this that we can work out.
Dialogue: 0,0:19:53.38,0:19:56.25,Default,,0000,0000,0000,,Let's do one more example.
Dialogue: 0,0:19:56.25,0:19:57.40,Default,,0000,0000,0000,,So, so far
Dialogue: 0,0:19:59.61,0:20:06.78,Default,,0000,0000,0000,,we've seen that the dual\Nof the Euclidean norm
Dialogue: 0,0:20:06.78,0:20:10.82,Default,,0000,0000,0000,,is again the Euclidean norm.
Dialogue: 0,0:20:10.82,0:20:17.82,Default,,0000,0000,0000,,The dual of the one norm\Nis the infinity norm.
Dialogue: 0,0:20:17.82,0:20:27.50,Default,,0000,0000,0000,,I mentioned that the dual of\Nthe infinity norm is the one norm.
Dialogue: 0,0:20:28.65,0:20:30.48,Default,,0000,0000,0000,,So let's try that.
Dialogue: 0,0:20:33.46,0:20:38.100,Default,,0000,0000,0000,,So the infinity norm,\Nlet's look at the dual of that.
Dialogue: 0,0:20:38.100,0:20:44.68,Default,,0000,0000,0000,,By definition this is\Nthe max of u transpose v
Dialogue: 0,0:20:44.68,0:20:50.72,Default,,0000,0000,0000,,subject to v in the infinity norm\Nbeing less than one.
Dialogue: 0,0:20:50.72,0:20:54.84,Default,,0000,0000,0000,,So the infinity norm being less than one\Nmeans that every element of v
Dialogue: 0,0:20:54.84,0:20:57.11,Default,,0000,0000,0000,,can be between minus one and one.
Dialogue: 0,0:20:57.11,0:20:58.24,Default,,0000,0000,0000,,And there's no penalty.
Dialogue: 0,0:20:58.24,0:21:03.45,Default,,0000,0000,0000,,In other words, if v1 is very big,\Nit doesn't mean that v2 has to be small.
Dialogue: 0,0:21:03.45,0:21:04.95,Default,,0000,0000,0000,,So what would you do?
Dialogue: 0,0:21:04.95,0:21:06.50,Default,,0000,0000,0000,,Again you can see this is quite simple.
Dialogue: 0,0:21:06.50,0:21:11.53,Default,,0000,0000,0000,,You would look at ui and say ui,\Nis that positive or negative?
Dialogue: 0,0:21:11.53,0:21:17.89,Default,,0000,0000,0000,,I'm going to let vi be the same\Nsign and be as big as I'm allowed,
Dialogue: 0,0:21:17.89,0:21:19.90,Default,,0000,0000,0000,,which is in this case one.
Dialogue: 0,0:21:19.90,0:21:26.81,Default,,0000,0000,0000,,So the optimizing v is just\Ngoing to be the sign of u,
Dialogue: 0,0:21:26.81,0:21:27.78,Default,,0000,0000,0000,,element-wise.
Dialogue: 0,0:21:27.78,0:21:30.73,Default,,0000,0000,0000,,In other words,
Dialogue: 0,0:21:30.73,0:21:36.64,Default,,0000,0000,0000,,the ith element of v star is just\Ngoing to be the sign of ui.
Dialogue: 0,0:21:36.64,0:21:39.52,Default,,0000,0000,0000,,And so then what is the sum of -
Dialogue: 0,0:21:39.52,0:21:44.58,Default,,0000,0000,0000,,so u transpose v is equal\Nto the sum of ui vi.
Dialogue: 0,0:21:44.58,0:21:48.72,Default,,0000,0000,0000,,And plugging in this v, this is\Nthe same as the sum of ui
Dialogue: 0,0:21:48.72,0:21:51.72,Default,,0000,0000,0000,,times the sign of ui,
Dialogue: 0,0:21:51.72,0:21:58.08,Default,,0000,0000,0000,,which is exactly equal to the sum\Nof the absolute value of ui,
Dialogue: 0,0:21:58.08,0:22:01.14,Default,,0000,0000,0000,,which is in fact the L1 norm.
Dialogue: 0,0:22:04.24,0:22:08.21,Default,,0000,0000,0000,,So now we've seen\Nall three of these examples.
Dialogue: 0,0:22:08.21,0:22:15.75,Default,,0000,0000,0000,,And we're going to use a very\Nimportant property of dual norms,
Dialogue: 0,0:22:15.75,0:22:17.23,Default,,0000,0000,0000,,which I'm going to write down now,
Dialogue: 0,0:22:17.23,0:22:20.62,Default,,0000,0000,0000,,and we're going to see it again\Nin the next lecture.
Dialogue: 0,0:22:25.90,0:22:27.85,Default,,0000,0000,0000,,So an important inequality
Dialogue: 0,0:22:29.79,0:22:31.14,Default,,0000,0000,0000,,says the following.
Dialogue: 0,0:22:33.86,0:22:38.88,Default,,0000,0000,0000,,Remember that we've needed to bound,\Nseveral times, the dot product
Dialogue: 0,0:22:38.88,0:22:42.05,Default,,0000,0000,0000,,between two vectors u and v.
Dialogue: 0,0:22:42.05,0:22:44.01,Default,,0000,0000,0000,,And we've done this in several ways.
Dialogue: 0,0:22:44.01,0:22:49.29,Default,,0000,0000,0000,,So we can bound this\Nin terms of the two norm.
Dialogue: 0,0:22:49.29,0:22:55.93,Default,,0000,0000,0000,,So this is upper bounded by the two\Nnorm of u times the two norm of v.
Dialogue: 0,0:22:55.93,0:22:59.32,Default,,0000,0000,0000,,In fact we know that these are equal\Nonly when u and v are parallel
Dialogue: 0,0:22:59.32,0:23:01.87,Default,,0000,0000,0000,,and pointing in the same direction.
Dialogue: 0,0:23:02.87,0:23:07.48,Default,,0000,0000,0000,,So this is the Cauchy-Schwarz -\Nfamous Cauchy-Schwarz inequality.
Dialogue: 0,0:23:12.39,0:23:16.32,Default,,0000,0000,0000,,In fact, it's also true that u transpose v
Dialogue: 0,0:23:16.32,0:23:24.32,Default,,0000,0000,0000,,is less than or equal to u times v star,
Dialogue: 0,0:23:24.32,0:23:27.29,Default,,0000,0000,0000,,where u and v are any norm and its dual.
Dialogue: 0,0:23:27.29,0:23:30.37,Default,,0000,0000,0000,,And this is sometimes called\NHölder's inequality.
Dialogue: 0,0:23:34.64,0:23:38.50,Default,,0000,0000,0000,,And so far we've used - if you go\Nback to the proofs of gradient descent,
Dialogue: 0,0:23:38.50,0:23:41.66,Default,,0000,0000,0000,,subgradient descent, and so on,\Nyou'll see this kind of inequality,
Dialogue: 0,0:23:41.66,0:23:44.55,Default,,0000,0000,0000,,the first one, in many places.
Dialogue: 0,0:23:51.08,0:23:58.100,Default,,0000,0000,0000,,So one of the properties that we've\Nused, another way to bound uv
Dialogue: 0,0:23:58.100,0:24:02.88,Default,,0000,0000,0000,,is as follows.
Dialogue: 0,0:24:02.88,0:24:09.54,Default,,0000,0000,0000,,Also, u transpose v\Nis less than or equal to
Dialogue: 0,0:24:09.54,0:24:14.68,Default,,0000,0000,0000,,One half times u squared - this is\Nthe Euclidean norm squared -
Dialogue: 0,0:24:14.68,0:24:17.80,Default,,0000,0000,0000,,plus one half v squared.
Dialogue: 0,0:24:17.80,0:24:20.17,Default,,0000,0000,0000,,This just comes from\Ncompleting the square.
Dialogue: 0,0:24:25.29,0:24:28.17,Default,,0000,0000,0000,,Again, we've used this\Nover and over again.
Dialogue: 0,0:24:28.17,0:24:30.65,Default,,0000,0000,0000,,But using the dual norm,
Dialogue: 0,0:24:30.65,0:24:36.01,Default,,0000,0000,0000,,we can refine this in a particular way.
Dialogue: 0,0:24:36.01,0:24:41.24,Default,,0000,0000,0000,,So we will use the following upper bound:
Dialogue: 0,0:24:41.24,0:24:50.32,Default,,0000,0000,0000,,u transpose v is less than or equal to\None over two alpha
Dialogue: 0,0:24:50.32,0:24:52.54,Default,,0000,0000,0000,,times the norm of u squared -
Dialogue: 0,0:24:52.54,0:24:56.78,Default,,0000,0000,0000,,note there's no two subscript,\Nthis is not the Euclidean norm -
Dialogue: 0,0:24:56.78,0:25:05.44,Default,,0000,0000,0000,,plus alpha over two,\Ntimes the dual norm of v, squared.
Dialogue: 0,0:25:05.44,0:25:07.05,Default,,0000,0000,0000,,And this is very similar.
Dialogue: 0,0:25:07.05,0:25:08.69,Default,,0000,0000,0000,,If we take alpha equals to one,
Dialogue: 0,0:25:08.69,0:25:11.25,Default,,0000,0000,0000,,we recover what we get\Nfrom completing the square.
Dialogue: 0,0:25:12.77,0:25:14.89,Default,,0000,0000,0000,,The reason that this\Nis so important for us
Dialogue: 0,0:25:14.89,0:25:18.62,Default,,0000,0000,0000,,is that these dot products\Nare foundational for us.
Dialogue: 0,0:25:18.62,0:25:21.17,Default,,0000,0000,0000,,This is the definition of convexity.
Dialogue: 0,0:25:21.17,0:25:23.34,Default,,0000,0000,0000,,We use these to make our\Nlinear approximations.
Dialogue: 0,0:25:23.34,0:25:24.89,Default,,0000,0000,0000,,This is how we define convex functions,
Dialogue: 0,0:25:24.89,0:25:28.15,Default,,0000,0000,0000,,this is how we define\Nour linear approximations,
Dialogue: 0,0:25:28.15,0:25:30.20,Default,,0000,0000,0000,,and this is how we define\Nall our algorithms.
Dialogue: 0,0:25:30.20,0:25:33.64,Default,,0000,0000,0000,,So we're going to see next time\Nhow we use this inequality,
Dialogue: 0,0:25:33.64,0:25:35.19,Default,,0000,0000,0000,,how we use Bregman divergence,
Dialogue: 0,0:25:35.19,0:25:40.58,Default,,0000,0000,0000,,and how we use the definition of\Nthe Bregman divergence
Dialogue: 0,0:25:40.58,0:25:41.94,Default,,0000,0000,0000,,as a proximal term
Dialogue: 0,0:25:41.94,0:25:47.04,Default,,0000,0000,0000,,in order to show better\Nconvergence rates for mirror prox
Dialogue: 0,0:25:47.04,0:25:50.14,Default,,0000,0000,0000,,in certain settings where -\NI'm sorry, for mirror descent
Dialogue: 0,0:25:50.14,0:25:54.39,Default,,0000,0000,0000,,in certain settings where we don't\Nwant to use Euclidean geometry.
Dialogue: 0,0:25:54.39,0:25:56.17,Default,,0000,0000,0000,,We'll pick this up next time.