[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.53,0:00:04.48,Default,,0000,0000,0000,,♪ (Music) ♪ Dialogue: 0,0:00:09.47,0:00:14.05,Default,,0000,0000,0000,,Welcome back. This is our third lecture\Nin the series, talking about Dialogue: 0,0:00:14.05,0:00:15.40,Default,,0000,0000,0000,,mirror descent. Dialogue: 0,0:00:17.17,0:00:22.34,Default,,0000,0000,0000,,And um, we're going to get down to\Nthe main definition of how it is that Dialogue: 0,0:00:22.34,0:00:27.78,Default,,0000,0000,0000,,we change the proximal term \Nin uh, from the subgradient method. Dialogue: 0,0:00:27.78,0:00:35.28,Default,,0000,0000,0000,,So again, recall that the subgradient\Nmethod tells me that xt plus one is Dialogue: 0,0:00:35.28,0:00:41.90,Default,,0000,0000,0000,,equal to xt minus eta gt, where gt is\Nan element of the subdifferential Dialogue: 0,0:00:41.90,0:00:44.56,Default,,0000,0000,0000,,of f at xt. Dialogue: 0,0:00:44.56,0:00:48.10,Default,,0000,0000,0000,,Again, I've said this in each\Nof these lectures, but I'm leaving Dialogue: 0,0:00:48.10,0:00:50.75,Default,,0000,0000,0000,,out the dependence on the \Non the, Dialogue: 0,0:00:50.75,0:00:54.89,Default,,0000,0000,0000,,one the - set constraint, Dialogue: 0,0:00:54.89,0:01:00.92,Default,,0000,0000,0000,,but with the set constraint,\Nthis would take the form Dialogue: 0,0:01:00.92,0:01:02.64,Default,,0000,0000,0000,,of an additional projection. Dialogue: 0,0:01:04.92,0:01:12.07,Default,,0000,0000,0000,,So xt plus one is equal to the projection\Nonto x, the projection here means Dialogue: 0,0:01:12.07,0:01:17.00,Default,,0000,0000,0000,,the Euclidean projection, of xt minus\Neta gt. Dialogue: 0,0:01:17.00,0:01:21.40,Default,,0000,0000,0000,,But because it doesn't change very \Nmuch, for now, Dialogue: 0,0:01:21.40,0:01:24.00,Default,,0000,0000,0000,,I've just been sparing myself the\Nextra writing, Dialogue: 0,0:01:24.00,0:01:27.68,Default,,0000,0000,0000,,but we should really have this in mind. Dialogue: 0,0:01:27.68,0:01:33.55,Default,,0000,0000,0000,,And, uh, we've seen repeatedly, this\Nhas been the theme of the first two Dialogue: 0,0:01:33.55,0:01:37.63,Default,,0000,0000,0000,,lectures on mirror descent, that\Nxt plus one is also Dialogue: 0,0:01:37.63,0:01:45.69,Default,,0000,0000,0000,,equivalently, just the argmin\Nof eta, gt, transpose x, plus, Dialogue: 0,0:01:45.69,0:01:48.35,Default,,0000,0000,0000,,this is the linear - you can think of this\Nas the linear approximation Dialogue: 0,0:01:48.35,0:01:53.80,Default,,0000,0000,0000,,to the function, plus something that\Nkeeps us from going too far, and this Dialogue: 0,0:01:53.80,0:01:57.67,Default,,0000,0000,0000,,is typically the Euclidean norm, or \Nspecifically, when we use the Dialogue: 0,0:01:57.67,0:02:00.80,Default,,0000,0000,0000,,Euclidean norm, we get exactly the\Nsubgradient method. Dialogue: 0,0:02:02.37,0:02:06.30,Default,,0000,0000,0000,,Set constraint here, this is an\Nargmin over all x, but if I have a Dialogue: 0,0:02:06.30,0:02:09.45,Default,,0000,0000,0000,,set constraint, it just becomes - \Nthe only thing that changes is Dialogue: 0,0:02:09.45,0:02:13.28,Default,,0000,0000,0000,,this is an argmin over x in my\Nset constraint. Dialogue: 0,0:02:13.28,0:02:15.89,Default,,0000,0000,0000,,So again, not much has changed. Dialogue: 0,0:02:15.89,0:02:19.68,Default,,0000,0000,0000,,So, we're going to change this. Dialogue: 0,0:02:23.96,0:02:26.01,Default,,0000,0000,0000,,And by changing it, we get\Nmirror descent, and we'll see Dialogue: 0,0:02:26.01,0:02:29.04,Default,,0000,0000,0000,,exactly when we can do this to\Nour advantage. Dialogue: 0,0:02:30.32,0:02:33.85,Default,,0000,0000,0000,,In order to do this, and to\Nexplain what will replace it, Dialogue: 0,0:02:33.85,0:02:37.46,Default,,0000,0000,0000,,we need to define this concept \Nof Bregman divergence. Dialogue: 0,0:02:40.24,0:02:42.34,Default,,0000,0000,0000,,For a convex function phi, Dialogue: 0,0:02:45.88,0:02:49.38,Default,,0000,0000,0000,,I can define its Bregman\Ndivergence as follows. Dialogue: 0,0:02:49.38,0:02:55.26,Default,,0000,0000,0000,,So the divergence with respect\Nto phi, between two points x and y, Dialogue: 0,0:02:55.26,0:03:04.17,Default,,0000,0000,0000,,is defined as phi of x minus \Nphi of y, minus the gradient of phi Dialogue: 0,0:03:04.17,0:03:09.68,Default,,0000,0000,0000,,of y transpose times x minus y. Dialogue: 0,0:03:09.68,0:03:11.12,Default,,0000,0000,0000,,So what does this mean? Dialogue: 0,0:03:11.12,0:03:13.41,Default,,0000,0000,0000,,It actually has a very simple \Ninterpretation. Dialogue: 0,0:03:13.41,0:03:15.50,Default,,0000,0000,0000,,And let me group these terms\Nas follows. Dialogue: 0,0:03:15.50,0:03:19.01,Default,,0000,0000,0000,,Let me group the second and third\Nterm, and then I'll have to distribute Dialogue: 0,0:03:19.01,0:03:22.32,Default,,0000,0000,0000,,that minus sign, in order to make \Nthis plus. Dialogue: 0,0:03:22.32,0:03:28.30,Default,,0000,0000,0000,,So now, a simple picture tells\Nus exactly what this is. Dialogue: 0,0:03:28.30,0:03:34.45,Default,,0000,0000,0000,,If phi is a convex function, and\NI've got my point y, and I've Dialogue: 0,0:03:34.45,0:03:38.73,Default,,0000,0000,0000,,got my point x, then what is\Nthe first term? Dialogue: 0,0:03:38.73,0:03:43.02,Default,,0000,0000,0000,,The first term is just phi of\Nx, it's sitting up there. Dialogue: 0,0:03:43.02,0:03:44.57,Default,,0000,0000,0000,,What's the second term? Dialogue: 0,0:03:44.57,0:03:49.47,Default,,0000,0000,0000,,We can interpret the second\Nterm as the linear approximation Dialogue: 0,0:03:49.47,0:03:52.22,Default,,0000,0000,0000,,to phi at y, Dialogue: 0,0:03:52.22,0:03:57.11,Default,,0000,0000,0000,,evaluated at the point x. Dialogue: 0,0:03:58.53,0:04:01.60,Default,,0000,0000,0000,,So this is the second term. Dialogue: 0,0:04:01.60,0:04:08.85,Default,,0000,0000,0000,,This is phi of y plus the gradient\Nof phi of y transpose x minus y. Dialogue: 0,0:04:08.85,0:04:17.13,Default,,0000,0000,0000,,And therefore, the Bregman\Ndivergence is exactly this term. Dialogue: 0,0:04:17.13,0:04:19.86,Default,,0000,0000,0000,,Making things a little messy \Nhere, so let's see. Dialogue: 0,0:04:22.16,0:04:23.77,Default,,0000,0000,0000,,One more try. Dialogue: 0,0:04:24.96,0:04:28.07,Default,,0000,0000,0000,,This is exactly the Bregman\Ndivergence. Dialogue: 0,0:04:29.08,0:04:33.70,Default,,0000,0000,0000,,The more curvature something has, \Nthe more Bregman divergence it has. Dialogue: 0,0:04:33.70,0:04:35.64,Default,,0000,0000,0000,,Let's do a very simple calculation. Dialogue: 0,0:04:37.09,0:04:43.60,Default,,0000,0000,0000,,In the simple case where phi of x Dialogue: 0,0:04:43.60,0:04:49.44,Default,,0000,0000,0000,,is just equal to one half times\Nthe Euclidean distance. Dialogue: 0,0:04:49.44,0:04:52.88,Default,,0000,0000,0000,,So let's see what the Bregman\Ndivergence is in this case. Dialogue: 0,0:04:52.88,0:04:57.69,Default,,0000,0000,0000,,Well, plugging in the definition, \NI see that the Bregman divergence Dialogue: 0,0:04:57.69,0:05:01.68,Default,,0000,0000,0000,,between x and y is equal to \Nphi of x, Dialogue: 0,0:05:01.68,0:05:05.99,Default,,0000,0000,0000,,that's one half Euclidean norm \Nsquared, of x, Dialogue: 0,0:05:05.99,0:05:10.81,Default,,0000,0000,0000,,Minus phi of y, that's one half\NEuclidean norm squared of Dialogue: 0,0:05:10.81,0:05:15.25,Default,,0000,0000,0000,,y, Plus, gradient of phi of y. Dialogue: 0,0:05:15.25,0:05:18.23,Default,,0000,0000,0000,,Well, what's the gradient \Nof one half times the Euclidean norm? Dialogue: 0,0:05:18.23,0:05:20.71,Default,,0000,0000,0000,,It's just y itself. Dialogue: 0,0:05:20.71,0:05:22.56,Default,,0000,0000,0000,,Y transpose. Dialogue: 0,0:05:22.56,0:05:27.44,Default,,0000,0000,0000,,Times x minus y, and grouping\Nall these terms, Dialogue: 0,0:05:27.44,0:05:32.61,Default,,0000,0000,0000,,rewriting, I get one half, factor\Nthat out, Euclidean norm of x Dialogue: 0,0:05:32.61,0:05:36.79,Default,,0000,0000,0000,,squared, plus... Dialogue: 0,0:05:40.42,0:05:47.32,Default,,0000,0000,0000,,minus twice y transpose x. Dialogue: 0,0:05:47.32,0:05:54.02,Default,,0000,0000,0000,,Then I've got minus y squared, Dialogue: 0,0:05:54.02,0:05:58.82,Default,,0000,0000,0000,,plus twice y squared. Dialogue: 0,0:05:58.82,0:06:04.36,Default,,0000,0000,0000,,That simplifies to just plus,\Nand you can see that this is Dialogue: 0,0:06:04.36,0:06:05.87,Default,,0000,0000,0000,,exactly the quadratic. Dialogue: 0,0:06:05.87,0:06:10.59,Default,,0000,0000,0000,,This is equal to one half x minus\Ny squared. Dialogue: 0,0:06:10.59,0:06:16.09,Default,,0000,0000,0000,,So, if we take phi to be the function\Nthat's one half Euclidean norm squared, Dialogue: 0,0:06:16.09,0:06:19.33,Default,,0000,0000,0000,,then the Bregman divergence is Dialogue: 0,0:06:19.33,0:06:25.16,Default,,0000,0000,0000,,just going to be the difference in \NEuclidean - it's the Euclidean norm of Dialogue: 0,0:06:25.16,0:06:26.20,Default,,0000,0000,0000,,the difference. Dialogue: 0,0:06:26.20,0:06:27.56,Default,,0000,0000,0000,,So we get something familiar. Dialogue: 0,0:06:27.56,0:06:32.55,Default,,0000,0000,0000,,But, we're going to see several examples\Num, where we're going to take phi Dialogue: 0,0:06:32.55,0:06:38.22,Default,,0000,0000,0000,,to be something different and therefore we\Nget a very different Bregman divergence. Dialogue: 0,0:06:39.94,0:06:44.74,Default,,0000,0000,0000,,Okay, and we're going to use this directly\Nin our definition of mirror descent. Dialogue: 0,0:06:47.75,0:06:55.94,Default,,0000,0000,0000,,Mirror descent now becomes xt plus one,\Nis going to be equal to the argmin, Dialogue: 0,0:06:58.52,0:07:06.99,Default,,0000,0000,0000,,possibly in our set constraint,\Neta gt transpose x, Dialogue: 0,0:07:06.100,0:07:09.76,Default,,0000,0000,0000,,plus, previously we had Dialogue: 0,0:07:09.76,0:07:15.66,Default,,0000,0000,0000,,one half times the Euclidean norm,\Nbut now we have one half times the Dialogue: 0,0:07:15.66,0:07:21.46,Default,,0000,0000,0000,,Bregman divergence between x and xt. Dialogue: 0,0:07:23.18,0:07:32.96,Default,,0000,0000,0000,,So, if phi is equal to just one half\NEuclidean norm squared, then Dialogue: 0,0:07:32.96,0:07:40.04,Default,,0000,0000,0000,,we recover our usual subgradient\Nmethod. Dialogue: 0,0:07:44.41,0:07:47.75,Default,,0000,0000,0000,,But, if we have something else, Dialogue: 0,0:07:47.75,0:07:54.70,Default,,0000,0000,0000,,if phi of x is equal to say, Dialogue: 0,0:07:54.70,0:08:02.57,Default,,0000,0000,0000,,one half times x transpose, times\Na different matrix, times x, Dialogue: 0,0:08:02.57,0:08:09.71,Default,,0000,0000,0000,,where q is not equal to the identity,\Nthen, we get a different algorithm. Dialogue: 0,0:08:13.28,0:08:19.29,Default,,0000,0000,0000,,And, this was exactly the idea that\Nstarted us off in mirror descent. Dialogue: 0,0:08:19.29,0:08:23.98,Default,,0000,0000,0000,,So this was the idea that we had\Nin lecture one on mirror descent. Dialogue: 0,0:08:23.98,0:08:30.93,Default,,0000,0000,0000,,So see the first lecture in the mirror\Ndescent series. Dialogue: 0,0:08:33.68,0:08:36.65,Default,,0000,0000,0000,,But we're going to choose phi \Nto be even different than these. Dialogue: 0,0:08:36.65,0:08:40.46,Default,,0000,0000,0000,,So we'll see that we can take phi\Nto be quite different, and we would Dialogue: 0,0:08:40.46,0:08:43.64,Default,,0000,0000,0000,,therefore really change the\Ngeometry of the space. Dialogue: 0,0:08:44.59,0:08:48.27,Default,,0000,0000,0000,,So, this is our first look\Nat mirror descent, Dialogue: 0,0:08:48.27,0:08:51.29,Default,,0000,0000,0000,,we're going to be trying to understand\Nits properties through examples, Dialogue: 0,0:08:51.29,0:08:54.98,Default,,0000,0000,0000,,and then also, in the following \Nlecture, actually deriving Dialogue: 0,0:08:54.98,0:08:58.94,Default,,0000,0000,0000,,rate of convergence and proof\Nfor rate of convergence. Dialogue: 0,0:08:58.94,0:09:02.09,Default,,0000,0000,0000,,But in the meantime, let's go \Nback to Bregman divergence, and get Dialogue: 0,0:09:02.09,0:09:06.24,Default,,0000,0000,0000,,used to it a little bit, and see what\Nsome of its properties are. Dialogue: 0,0:09:06.24,0:09:10.54,Default,,0000,0000,0000,,So, in part, I'm going to derive this\Nbecause we'll need it. Dialogue: 0,0:09:10.54,0:09:13.92,Default,,0000,0000,0000,,In part, just to get us used to thinking\Nabout Bregman divergence. Dialogue: 0,0:09:13.92,0:09:17.27,Default,,0000,0000,0000,,It is a little bit familiar, because\Nit seems like - Dialogue: 0,0:09:17.27,0:09:21.10,Default,,0000,0000,0000,,because we saw that if we choose\Nphi to be half the Euclidean norm Dialogue: 0,0:09:21.10,0:09:24.17,Default,,0000,0000,0000,,squared, we get back something\Nvery familiar, but if we plug in Dialogue: 0,0:09:24.17,0:09:26.11,Default,,0000,0000,0000,,other things, it will be a little bit\Ndifferent. Dialogue: 0,0:09:26.11,0:09:28.86,Default,,0000,0000,0000,,So, just want to get used to\Nmanipulating it, and seeing that Dialogue: 0,0:09:28.86,0:09:32.50,Default,,0000,0000,0000,,it's really not that difficult. Dialogue: 0,0:09:33.28,0:09:38.89,Default,,0000,0000,0000,,Um, okay, so let's, in part for exercise, \Nagain, in part because we'll need it, Dialogue: 0,0:09:38.89,0:09:47.36,Default,,0000,0000,0000,,let's show that um, for any convex\Nfunction phi, we have this property Dialogue: 0,0:09:47.36,0:09:53.07,Default,,0000,0000,0000,,that the gradient of phi, evaluated\Nat x, minus the gradient of phi, Dialogue: 0,0:09:53.07,0:10:00.68,Default,,0000,0000,0000,,evaluated at y, transpose, times x minus\Nz, is in fact equal to the Bregman Dialogue: 0,0:10:00.68,0:10:05.34,Default,,0000,0000,0000,,divergence between x and y, \Nplus the Bregman divergence between Dialogue: 0,0:10:05.34,0:10:12.21,Default,,0000,0000,0000,,z and x, minus the Bregman\Ndivergence between z and y. Dialogue: 0,0:10:12.21,0:10:14.82,Default,,0000,0000,0000,,And this is just a matter of \Nplugging in these definitions and Dialogue: 0,0:10:14.82,0:10:18.22,Default,,0000,0000,0000,,evaluating, so let's just do exactly\Nthat. Dialogue: 0,0:10:18.22,0:10:27.35,Default,,0000,0000,0000,,So from the definition, the \Nright hand side is equal to the first Dialogue: 0,0:10:27.35,0:10:35.16,Default,,0000,0000,0000,,term, is equal to phi of x,\Nminus phi of y, Dialogue: 0,0:10:35.16,0:10:39.00,Default,,0000,0000,0000,,plus gradient phi of y, Dialogue: 0,0:10:39.00,0:10:47.94,Default,,0000,0000,0000,,times x minus y. The second term\Nis equal to phi of z, minus phi Dialogue: 0,0:10:47.94,0:10:55.00,Default,,0000,0000,0000,,of x, plus the gradient of phi\Nof x, times z minus x. Dialogue: 0,0:10:55.97,0:11:02.32,Default,,0000,0000,0000,,And then the third term, which\Nwe subtract, is equal to phi of z, Dialogue: 0,0:11:02.32,0:11:05.64,Default,,0000,0000,0000,,plus, I'm distributing the\Nminus sign here, Dialogue: 0,0:11:05.64,0:11:14.06,Default,,0000,0000,0000,,So it's plus phi of y, plus the\Ngradient of phi of y, times Dialogue: 0,0:11:14.06,0:11:15.69,Default,,0000,0000,0000,,z minus y. Dialogue: 0,0:11:15.69,0:11:17.22,Default,,0000,0000,0000,,Okay, so what do we see? Dialogue: 0,0:11:17.22,0:11:20.14,Default,,0000,0000,0000,,Well, first of all, there's - Dialogue: 0,0:11:20.14,0:11:23.82,Default,,0000,0000,0000,,I see a positive phi of x\Nhere, and a negative phi of x, Dialogue: 0,0:11:23.82,0:11:28.47,Default,,0000,0000,0000,,I see a positive phi of z \Nand a negative phi of z. Dialogue: 0,0:11:28.47,0:11:31.54,Default,,0000,0000,0000,,I see a negative phi of y\Nand another positive phi of y, Dialogue: 0,0:11:31.54,0:11:34.05,Default,,0000,0000,0000,,so all of these cancel, \Nand now I - oops. Dialogue: 0,0:11:34.05,0:11:38.07,Default,,0000,0000,0000,,And now I just have to collect\Nall of these terms here. Dialogue: 0,0:11:38.07,0:11:40.78,Default,,0000,0000,0000,,And this is just a matter of\Na little bit of accounting. Dialogue: 0,0:11:40.78,0:11:44.16,Default,,0000,0000,0000,,So this is equal to Dialogue: 0,0:11:46.12,0:11:51.52,Default,,0000,0000,0000,,the gradient of phi of x\Ntimes x minus z Dialogue: 0,0:11:51.52,0:11:57.68,Default,,0000,0000,0000,,minus the gradient of phi of y\Ntimes x minus y Dialogue: 0,0:11:57.68,0:12:02.39,Default,,0000,0000,0000,,plus the gradient of phi of y\Ntimes z minus y, Dialogue: 0,0:12:02.39,0:12:11.01,Default,,0000,0000,0000,,and this is equal to the gradient\Nof phi of x times x minus z Dialogue: 0,0:12:11.01,0:12:17.25,Default,,0000,0000,0000,,minus the gradient of phi of y\Ntimes x minus z. Dialogue: 0,0:12:18.77,0:12:21.84,Default,,0000,0000,0000,,Now I can group these together\Nand I get exactly what I wanted. Dialogue: 0,0:12:21.84,0:12:29.25,Default,,0000,0000,0000,,This is the gradient of phi of x\Nminus the gradient of phi of y Dialogue: 0,0:12:29.25,0:12:32.89,Default,,0000,0000,0000,,times x minus z. Dialogue: 0,0:12:32.89,0:12:35.86,Default,,0000,0000,0000,,We're going to use this property,\Nbut again, the point is for us Dialogue: 0,0:12:35.86,0:12:39.23,Default,,0000,0000,0000,,not to worry about just\Nplugging in the definitions. Dialogue: 0,0:12:39.23,0:12:41.97,Default,,0000,0000,0000,,These are relatively simple things. Dialogue: 0,0:12:43.78,0:12:48.63,Default,,0000,0000,0000,,We're going to need something else\Nin our development of mirror descent Dialogue: 0,0:12:48.63,0:12:49.96,Default,,0000,0000,0000,,and when we analyze it. Dialogue: 0,0:12:49.96,0:12:52.70,Default,,0000,0000,0000,,And this is the concept of a dual norm. Dialogue: 0,0:12:53.41,0:12:55.45,Default,,0000,0000,0000,,So where have we been\Nusing dual norms Dialogue: 0,0:12:55.45,0:12:57.50,Default,,0000,0000,0000,,without even really knowing it? Dialogue: 0,0:12:58.64,0:13:03.49,Default,,0000,0000,0000,,We've been using dual norms because\Neverything we've done so far Dialogue: 0,0:13:03.49,0:13:09.07,Default,,0000,0000,0000,,with the exception of a few things\Nlike prox, we used L1 norm, and so on, Dialogue: 0,0:13:09.07,0:13:14.58,Default,,0000,0000,0000,,we've been using the Euclidean norm. Dialogue: 0,0:13:14.58,0:13:19.33,Default,,0000,0000,0000,,So let's - let me give the definition\Nhere of a dual norm. Dialogue: 0,0:13:23.30,0:13:27.12,Default,,0000,0000,0000,,So now I'm going to use this\Ndouble bar just as a generic norm. Dialogue: 0,0:13:27.12,0:13:30.64,Default,,0000,0000,0000,,So don't look at it and think\Nit's the two norm. Dialogue: 0,0:13:30.64,0:13:32.65,Default,,0000,0000,0000,,This is just some norm. Dialogue: 0,0:13:32.65,0:13:35.60,Default,,0000,0000,0000,,So for this indicating any norm, Dialogue: 0,0:13:38.43,0:13:40.23,Default,,0000,0000,0000,,its dual Dialogue: 0,0:13:42.06,0:13:46.29,Default,,0000,0000,0000,,is defined as follows. Dialogue: 0,0:13:48.37,0:13:50.93,Default,,0000,0000,0000,,I'm going to use this lower star notation. Dialogue: 0,0:13:50.93,0:13:55.08,Default,,0000,0000,0000,,So lower star is very frequently\Nused to denote the dual. Dialogue: 0,0:13:55.08,0:13:59.09,Default,,0000,0000,0000,,But in some other contexts\Nit has a specific definition. Dialogue: 0,0:13:59.09,0:14:03.82,Default,,0000,0000,0000,,So for example with matrices,\Nsometimes it denotes the trace norm Dialogue: 0,0:14:03.82,0:14:05.40,Default,,0000,0000,0000,,or the nuclear norm. Dialogue: 0,0:14:05.40,0:14:09.50,Default,,0000,0000,0000,,But in this case I'll try to be crystal\Nclear if we do need to talk about Dialogue: 0,0:14:09.50,0:14:11.60,Default,,0000,0000,0000,,another norm that's\Noften denoted like this. Dialogue: 0,0:14:11.60,0:14:19.61,Default,,0000,0000,0000,,For us, this star is always going to mean\Nthe dual norm unless I specify otherwise. Dialogue: 0,0:14:19.61,0:14:28.42,Default,,0000,0000,0000,,So this is defined as the\Nmaximum of u transpose v Dialogue: 0,0:14:28.42,0:14:34.57,Default,,0000,0000,0000,,subject to v less than one\Nwhere this is the original norm, Dialogue: 0,0:14:34.57,0:14:36.79,Default,,0000,0000,0000,,whatever that might have been. Dialogue: 0,0:14:36.79,0:14:38.59,Default,,0000,0000,0000,,Okay, so this is just the definition. Dialogue: 0,0:14:38.59,0:14:41.47,Default,,0000,0000,0000,,So let's do some examples. Dialogue: 0,0:14:48.11,0:14:52.85,Default,,0000,0000,0000,,So let's take - let's take this Dialogue: 0,0:14:52.85,0:14:55.42,Default,,0000,0000,0000,,equal to the Euclidean norm. Dialogue: 0,0:14:55.42,0:14:59.06,Default,,0000,0000,0000,,Might as well warm up with\Nsomething familiar. Dialogue: 0,0:14:59.06,0:15:02.92,Default,,0000,0000,0000,,So what is the dual\Nof the Euclidean norm, Dialogue: 0,0:15:02.92,0:15:08.03,Default,,0000,0000,0000,,u - I'll put a star, I'll write it\Nthis way so we remember. Dialogue: 0,0:15:08.03,0:15:14.40,Default,,0000,0000,0000,,This is defined as the\Nmax of u transpose v Dialogue: 0,0:15:14.40,0:15:18.76,Default,,0000,0000,0000,,subject to the Euclidean\Nnorm of v is less than one. Dialogue: 0,0:15:18.76,0:15:20.81,Default,,0000,0000,0000,,We can just solve this geometrically. Dialogue: 0,0:15:20.81,0:15:26.81,Default,,0000,0000,0000,,It's saying your constrained\Nset is just a sphere. Dialogue: 0,0:15:26.81,0:15:29.41,Default,,0000,0000,0000,,So it's saying find the\Npoint on the sphere Dialogue: 0,0:15:29.41,0:15:33.35,Default,,0000,0000,0000,,that has the largest inner product\Nwith a particular vector. Dialogue: 0,0:15:34.82,0:15:36.97,Default,,0000,0000,0000,,What is that going to be? Dialogue: 0,0:15:36.97,0:15:39.57,Default,,0000,0000,0000,,Well, it's - you can see\Nfrom this picture. Dialogue: 0,0:15:39.57,0:15:42.63,Default,,0000,0000,0000,,You're not going to pick something\Nthat is perpendicular, certainly. Dialogue: 0,0:15:42.63,0:15:45.67,Default,,0000,0000,0000,,If you picked something perpendicular,\Nyou'd get zero in the max. Dialogue: 0,0:15:45.67,0:15:48.18,Default,,0000,0000,0000,,You're not going to pick something\Nin the opposite direction, Dialogue: 0,0:15:48.18,0:15:49.37,Default,,0000,0000,0000,,you'd get negative. Dialogue: 0,0:15:49.37,0:15:54.42,Default,,0000,0000,0000,,No, you're going to align exactly\Nwith it and go out unit norm. Dialogue: 0,0:15:54.42,0:15:58.46,Default,,0000,0000,0000,,So in other words,\Nthe optimal solution here Dialogue: 0,0:15:58.46,0:16:00.78,Default,,0000,0000,0000,,is the maximizer, Dialogue: 0,0:16:02.88,0:16:09.49,Default,,0000,0000,0000,,is just equal to u divided by its norm, Dialogue: 0,0:16:09.49,0:16:11.30,Default,,0000,0000,0000,,and the max value Dialogue: 0,0:16:16.07,0:16:20.80,Default,,0000,0000,0000,,is therefore going to be\Nu transpose times this v Dialogue: 0,0:16:22.52,0:16:23.91,Default,,0000,0000,0000,,times v star. Dialogue: 0,0:16:23.91,0:16:28.64,Default,,0000,0000,0000,,This is equal to just u transpose u\Ndivided by the norm of u, Dialogue: 0,0:16:28.64,0:16:30.70,Default,,0000,0000,0000,,which is the norm of u. Dialogue: 0,0:16:32.66,0:16:36.33,Default,,0000,0000,0000,,In other words,\Nthe two norm is self dual. Dialogue: 0,0:16:36.33,0:16:40.55,Default,,0000,0000,0000,,So the Euclidean norm - what we just\Nshowed is is that the Euclidean norm Dialogue: 0,0:16:43.26,0:16:46.31,Default,,0000,0000,0000,,is its own dual. Dialogue: 0,0:16:46.31,0:16:48.11,Default,,0000,0000,0000,,It's called self dual. Dialogue: 0,0:16:50.00,0:16:52.70,Default,,0000,0000,0000,,So this is the way that we've been\Nactually using the dual norm Dialogue: 0,0:16:52.70,0:16:54.35,Default,,0000,0000,0000,,without even really knowing it, Dialogue: 0,0:16:54.35,0:16:57.41,Default,,0000,0000,0000,,because we've just been\Nworking in the two norm. Dialogue: 0,0:16:57.41,0:16:59.48,Default,,0000,0000,0000,,So let's take another example. Dialogue: 0,0:17:01.31,0:17:03.98,Default,,0000,0000,0000,,So now let's take - that's example one. Dialogue: 0,0:17:03.98,0:17:09.37,Default,,0000,0000,0000,,Let's take norm equal to the one norm. Dialogue: 0,0:17:10.49,0:17:13.06,Default,,0000,0000,0000,,Okay, and let's see what its dual is. Dialogue: 0,0:17:13.99,0:17:21.39,Default,,0000,0000,0000,,So we've got, what is the dual norm\Nof u in the one norm? Dialogue: 0,0:17:21.39,0:17:27.08,Default,,0000,0000,0000,,This is again, just copying the definition,\Nthe max of u transpose v... Dialogue: 0,0:17:28.34,0:17:30.59,Default,,0000,0000,0000,,u transpose v, Dialogue: 0,0:17:30.59,0:17:37.67,Default,,0000,0000,0000,,subject to v in the one norm\Nbeing less than one. Dialogue: 0,0:17:37.67,0:17:41.29,Default,,0000,0000,0000,,So let's think about what\Nthis is going to give us. Dialogue: 0,0:17:41.29,0:17:44.86,Default,,0000,0000,0000,,So what does v - what does\Nthis one norm tell us? Dialogue: 0,0:17:44.86,0:17:48.95,Default,,0000,0000,0000,,So this constraint is telling us that Dialogue: 0,0:17:48.95,0:17:52.85,Default,,0000,0000,0000,,we have to be inside this diamond, Dialogue: 0,0:17:52.85,0:17:58.38,Default,,0000,0000,0000,,or algebraically it's telling us that the\Nabsolute values have to sum to one. Dialogue: 0,0:17:58.38,0:18:01.30,Default,,0000,0000,0000,,You can think about this as\Na resource allocation problem. Dialogue: 0,0:18:01.30,0:18:04.88,Default,,0000,0000,0000,,You have to allocate your resources,\Nbut you only have 100% Dialogue: 0,0:18:04.88,0:18:07.95,Default,,0000,0000,0000,,and you have to allocate the pieces\Nso that they add up to one. Dialogue: 0,0:18:07.95,0:18:09.44,Default,,0000,0000,0000,,Where are you going to put them? Dialogue: 0,0:18:09.44,0:18:13.10,Default,,0000,0000,0000,,You're going to put them\Nwherever the values are the biggest. Dialogue: 0,0:18:13.10,0:18:15.62,Default,,0000,0000,0000,,The slight difference to this\Nis that you can take values Dialogue: 0,0:18:15.62,0:18:18.68,Default,,0000,0000,0000,,that are positive and negative,\Nbut then so can v. Dialogue: 0,0:18:18.68,0:18:22.32,Default,,0000,0000,0000,,So what you're going to do is\Nyou're going to find the biggest Dialogue: 0,0:18:22.32,0:18:26.10,Default,,0000,0000,0000,,in absolute value element of u, Dialogue: 0,0:18:26.10,0:18:27.82,Default,,0000,0000,0000,,whether it's positive or negative, Dialogue: 0,0:18:27.82,0:18:30.69,Default,,0000,0000,0000,,and you're going to keep all of v,\Nall of your energy, Dialogue: 0,0:18:30.69,0:18:32.90,Default,,0000,0000,0000,,you're going to put onto that coordinate. Dialogue: 0,0:18:32.90,0:18:37.08,Default,,0000,0000,0000,,If the biggest element of u is negative,\Nyou're also going to make v negative. Dialogue: 0,0:18:37.08,0:18:41.25,Default,,0000,0000,0000,,If the biggest element of u is positive,\Nyou're also going to make v positive. Dialogue: 0,0:18:41.25,0:18:48.55,Default,,0000,0000,0000,,So in other words, v star in this case,\Nour optimizing v, Dialogue: 0,0:18:50.09,0:18:51.47,Default,,0000,0000,0000,,is going to be Dialogue: 0,0:18:53.64,0:18:55.17,Default,,0000,0000,0000,,is going to look like Dialogue: 0,0:18:59.40,0:19:02.98,Default,,0000,0000,0000,,the vector that's all zero except\Nwith a single one. Dialogue: 0,0:19:04.37,0:19:07.99,Default,,0000,0000,0000,,And it's going to have a single one\Nin the ith position, Dialogue: 0,0:19:07.99,0:19:14.70,Default,,0000,0000,0000,,and its sign is going to be\Nthe sign of ui. Dialogue: 0,0:19:14.70,0:19:16.56,Default,,0000,0000,0000,,And which i is it going to be? Dialogue: 0,0:19:16.56,0:19:21.44,Default,,0000,0000,0000,,It's going to be the i where the value\Nof the absolute value of ui is biggest. Dialogue: 0,0:19:21.44,0:19:29.40,Default,,0000,0000,0000,,So i is the argmax of the\Nabsolute value of ui. Dialogue: 0,0:19:29.40,0:19:31.41,Default,,0000,0000,0000,,So in other words, what do we get? Dialogue: 0,0:19:31.41,0:19:33.86,Default,,0000,0000,0000,,The dual norm is the infinity norm. Dialogue: 0,0:19:35.33,0:19:42.90,Default,,0000,0000,0000,,So the dual norm of u is just\Nequal to the infinity norm. Dialogue: 0,0:19:42.90,0:19:45.82,Default,,0000,0000,0000,,And you can also see\Nthat the infinity norm, Dialogue: 0,0:19:45.82,0:19:48.10,Default,,0000,0000,0000,,its dual is going to be\Nthe one norm. Dialogue: 0,0:19:48.10,0:19:52.07,Default,,0000,0000,0000,,So there are many examples\Nlike this that we can work out. Dialogue: 0,0:19:53.38,0:19:56.25,Default,,0000,0000,0000,,Let's do one more example. Dialogue: 0,0:19:56.25,0:19:57.40,Default,,0000,0000,0000,,So, so far Dialogue: 0,0:19:59.61,0:20:06.78,Default,,0000,0000,0000,,we've seen that the dual\Nof the Euclidean norm Dialogue: 0,0:20:06.78,0:20:10.82,Default,,0000,0000,0000,,is again the Euclidean norm. Dialogue: 0,0:20:10.82,0:20:17.82,Default,,0000,0000,0000,,The dual of the one norm\Nis the infinity norm. Dialogue: 0,0:20:17.82,0:20:27.50,Default,,0000,0000,0000,,I mentioned that the dual of\Nthe infinity norm is the one norm. Dialogue: 0,0:20:28.65,0:20:30.48,Default,,0000,0000,0000,,So let's try that. Dialogue: 0,0:20:33.46,0:20:38.100,Default,,0000,0000,0000,,So the infinity norm,\Nlet's look at the dual of that. Dialogue: 0,0:20:38.100,0:20:44.68,Default,,0000,0000,0000,,By definition this is\Nthe max of u transpose v Dialogue: 0,0:20:44.68,0:20:50.72,Default,,0000,0000,0000,,subject to v in the infinity norm\Nbeing less than one. Dialogue: 0,0:20:50.72,0:20:54.84,Default,,0000,0000,0000,,So the infinity norm being less than one\Nmeans that every element of v Dialogue: 0,0:20:54.84,0:20:57.11,Default,,0000,0000,0000,,can be between minus one and one. Dialogue: 0,0:20:57.11,0:20:58.24,Default,,0000,0000,0000,,And there's no penalty. Dialogue: 0,0:20:58.24,0:21:03.45,Default,,0000,0000,0000,,In other words, if v1 is very big,\Nit doesn't mean that v2 has to be small. Dialogue: 0,0:21:03.45,0:21:04.95,Default,,0000,0000,0000,,So what would you do? Dialogue: 0,0:21:04.95,0:21:06.50,Default,,0000,0000,0000,,Again you can see this is quite simple. Dialogue: 0,0:21:06.50,0:21:11.53,Default,,0000,0000,0000,,You would look at ui and say ui,\Nis that positive or negative? Dialogue: 0,0:21:11.53,0:21:17.89,Default,,0000,0000,0000,,I'm going to let vi be the same\Nsign and be as big as I'm allowed, Dialogue: 0,0:21:17.89,0:21:19.90,Default,,0000,0000,0000,,which is in this case one. Dialogue: 0,0:21:19.90,0:21:26.81,Default,,0000,0000,0000,,So the optimizing v is just\Ngoing to be the sign of u, Dialogue: 0,0:21:26.81,0:21:27.78,Default,,0000,0000,0000,,element-wise. Dialogue: 0,0:21:27.78,0:21:30.73,Default,,0000,0000,0000,,In other words, Dialogue: 0,0:21:30.73,0:21:36.64,Default,,0000,0000,0000,,the ith element of v star is just\Ngoing to be the sign of ui. Dialogue: 0,0:21:36.64,0:21:39.52,Default,,0000,0000,0000,,And so then what is the sum of - Dialogue: 0,0:21:39.52,0:21:44.58,Default,,0000,0000,0000,,so u transpose v is equal\Nto the sum of ui vi. Dialogue: 0,0:21:44.58,0:21:48.72,Default,,0000,0000,0000,,And plugging in this v, this is\Nthe same as the sum of ui Dialogue: 0,0:21:48.72,0:21:51.72,Default,,0000,0000,0000,,times the sign of ui, Dialogue: 0,0:21:51.72,0:21:58.08,Default,,0000,0000,0000,,which is exactly equal to the sum\Nof the absolute value of ui, Dialogue: 0,0:21:58.08,0:22:01.14,Default,,0000,0000,0000,,which is in fact the L1 norm. Dialogue: 0,0:22:04.24,0:22:08.21,Default,,0000,0000,0000,,So now we've seen\Nall three of these examples. Dialogue: 0,0:22:08.21,0:22:15.75,Default,,0000,0000,0000,,And we're going to use a very\Nimportant property of dual norms, Dialogue: 0,0:22:15.75,0:22:17.23,Default,,0000,0000,0000,,which I'm going to write down now, Dialogue: 0,0:22:17.23,0:22:20.62,Default,,0000,0000,0000,,and we're going to see it again\Nin the next lecture. Dialogue: 0,0:22:25.90,0:22:27.85,Default,,0000,0000,0000,,So an important inequality Dialogue: 0,0:22:29.79,0:22:31.14,Default,,0000,0000,0000,,says the following. Dialogue: 0,0:22:33.86,0:22:38.88,Default,,0000,0000,0000,,Remember that we've needed to bound,\Nseveral times, the dot product Dialogue: 0,0:22:38.88,0:22:42.05,Default,,0000,0000,0000,,between two vectors u and v. Dialogue: 0,0:22:42.05,0:22:44.01,Default,,0000,0000,0000,,And we've done this in several ways. Dialogue: 0,0:22:44.01,0:22:49.29,Default,,0000,0000,0000,,So we can bound this\Nin terms of the two norm. Dialogue: 0,0:22:49.29,0:22:55.93,Default,,0000,0000,0000,,So this is upper bounded by the two\Nnorm of u times the two norm of v. Dialogue: 0,0:22:55.93,0:22:59.32,Default,,0000,0000,0000,,In fact we know that these are equal\Nonly when u and v are parallel Dialogue: 0,0:22:59.32,0:23:01.87,Default,,0000,0000,0000,,and pointing in the same direction. Dialogue: 0,0:23:02.87,0:23:07.48,Default,,0000,0000,0000,,So this is the Cauchy-Schwarz -\Nfamous Cauchy-Schwarz inequality. Dialogue: 0,0:23:12.39,0:23:16.32,Default,,0000,0000,0000,,In fact, it's also true that u transpose v Dialogue: 0,0:23:16.32,0:23:24.32,Default,,0000,0000,0000,,is less than or equal to u times v star, Dialogue: 0,0:23:24.32,0:23:27.29,Default,,0000,0000,0000,,where u and v are any norm and its dual. Dialogue: 0,0:23:27.29,0:23:30.37,Default,,0000,0000,0000,,And this is sometimes called\NHölder's inequality. Dialogue: 0,0:23:34.64,0:23:38.50,Default,,0000,0000,0000,,And so far we've used - if you go\Nback to the proofs of gradient descent, Dialogue: 0,0:23:38.50,0:23:41.66,Default,,0000,0000,0000,,subgradient descent, and so on,\Nyou'll see this kind of inequality, Dialogue: 0,0:23:41.66,0:23:44.55,Default,,0000,0000,0000,,the first one, in many places. Dialogue: 0,0:23:51.08,0:23:58.100,Default,,0000,0000,0000,,So one of the properties that we've\Nused, another way to bound uv Dialogue: 0,0:23:58.100,0:24:02.88,Default,,0000,0000,0000,,is as follows. Dialogue: 0,0:24:02.88,0:24:09.54,Default,,0000,0000,0000,,Also, u transpose v\Nis less than or equal to Dialogue: 0,0:24:09.54,0:24:14.68,Default,,0000,0000,0000,,One half times u squared - this is\Nthe Euclidean norm squared - Dialogue: 0,0:24:14.68,0:24:17.80,Default,,0000,0000,0000,,plus one half v squared. Dialogue: 0,0:24:17.80,0:24:20.17,Default,,0000,0000,0000,,This just comes from\Ncompleting the square. Dialogue: 0,0:24:25.29,0:24:28.17,Default,,0000,0000,0000,,Again, we've used this\Nover and over again. Dialogue: 0,0:24:28.17,0:24:30.65,Default,,0000,0000,0000,,But using the dual norm, Dialogue: 0,0:24:30.65,0:24:36.01,Default,,0000,0000,0000,,we can refine this in a particular way. Dialogue: 0,0:24:36.01,0:24:41.24,Default,,0000,0000,0000,,So we will use the following upper bound: Dialogue: 0,0:24:41.24,0:24:50.32,Default,,0000,0000,0000,,u transpose v is less than or equal to\None over two alpha Dialogue: 0,0:24:50.32,0:24:52.54,Default,,0000,0000,0000,,times the norm of u squared - Dialogue: 0,0:24:52.54,0:24:56.78,Default,,0000,0000,0000,,note there's no two subscript,\Nthis is not the Euclidean norm - Dialogue: 0,0:24:56.78,0:25:05.44,Default,,0000,0000,0000,,plus alpha over two,\Ntimes the dual norm of v, squared. Dialogue: 0,0:25:05.44,0:25:07.05,Default,,0000,0000,0000,,And this is very similar. Dialogue: 0,0:25:07.05,0:25:08.69,Default,,0000,0000,0000,,If we take alpha equals to one, Dialogue: 0,0:25:08.69,0:25:11.25,Default,,0000,0000,0000,,we recover what we get\Nfrom completing the square. Dialogue: 0,0:25:12.77,0:25:14.89,Default,,0000,0000,0000,,The reason that this\Nis so important for us Dialogue: 0,0:25:14.89,0:25:18.62,Default,,0000,0000,0000,,is that these dot products\Nare foundational for us. Dialogue: 0,0:25:18.62,0:25:21.17,Default,,0000,0000,0000,,This is the definition of convexity. Dialogue: 0,0:25:21.17,0:25:23.34,Default,,0000,0000,0000,,We use these to make our\Nlinear approximations. Dialogue: 0,0:25:23.34,0:25:24.89,Default,,0000,0000,0000,,This is how we define convex functions, Dialogue: 0,0:25:24.89,0:25:28.15,Default,,0000,0000,0000,,this is how we define\Nour linear approximations, Dialogue: 0,0:25:28.15,0:25:30.20,Default,,0000,0000,0000,,and this is how we define\Nall our algorithms. Dialogue: 0,0:25:30.20,0:25:33.64,Default,,0000,0000,0000,,So we're going to see next time\Nhow we use this inequality, Dialogue: 0,0:25:33.64,0:25:35.19,Default,,0000,0000,0000,,how we use Bregman divergence, Dialogue: 0,0:25:35.19,0:25:40.58,Default,,0000,0000,0000,,and how we use the definition of\Nthe Bregman divergence Dialogue: 0,0:25:40.58,0:25:41.94,Default,,0000,0000,0000,,as a proximal term Dialogue: 0,0:25:41.94,0:25:47.04,Default,,0000,0000,0000,,in order to show better\Nconvergence rates for mirror prox Dialogue: 0,0:25:47.04,0:25:50.14,Default,,0000,0000,0000,,in certain settings where -\NI'm sorry, for mirror descent Dialogue: 0,0:25:50.14,0:25:54.39,Default,,0000,0000,0000,,in certain settings where we don't\Nwant to use Euclidean geometry. Dialogue: 0,0:25:54.39,0:25:56.17,Default,,0000,0000,0000,,We'll pick this up next time.