https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4

0:01 - 0:04

♪ (Music) ♪
0:09 - 0:14

Welcome back. This is our third lecture
in the series, talking about
0:14 - 0:15

mirror descent.
0:17 - 0:22

And um, we're going to get down to
the main definition of how it is that
0:22 - 0:28

we change the proximal term
in uh, from the subgradient method.
0:28 - 0:35

So again, recall that the subgradient
method tells me that xt plus one is
0:35 - 0:42

equal to xt minus eta gt, where gt is
an element of the subdifferential
0:42 - 0:45

of f at xt.
0:45 - 0:48

Again, I've said this in each
of these lectures, but I'm leaving
0:48 - 0:51

out the dependence on the
on the,
0:51 - 0:55

one the - set constraint,
0:55 - 1:01

but with the set constraint,
this would take the form
1:01 - 1:03

of an additional projection.
1:05 - 1:12

So xt plus one is equal to the projection
onto x, the projection here means
1:12 - 1:17

the Euclidean projection, of xt minus
eta gt.
1:17 - 1:21

But because it doesn't change very
much, for now,
1:21 - 1:24

I've just been sparing myself the
extra writing,
1:24 - 1:28

but we should really have this in mind.
1:28 - 1:34

And, uh, we've seen repeatedly, this
has been the theme of the first two
1:34 - 1:38

lectures on mirror descent, that
xt plus one is also
1:38 - 1:46

equivalently, just the argmin
of eta, gt, transpose x, plus,
1:46 - 1:48

this is the linear - you can think of this
as the linear approximation
1:48 - 1:54

to the function, plus something that
keeps us from going too far, and this
1:54 - 1:58

is typically the Euclidean norm, or
specifically, when we use the
1:58 - 2:01

Euclidean norm, we get exactly the
subgradient method.
2:02 - 2:06

Set constraint here, this is an
argmin over all x, but if I have a
2:06 - 2:09

set constraint, it just becomes -
the only thing that changes is
2:09 - 2:13

this is an argmin over x in my
set constraint.
2:13 - 2:16

So again, not much has changed.
2:16 - 2:20

So, we're going to change this.
2:24 - 2:26

And by changing it, we get
mirror descent, and we'll see
2:26 - 2:29

exactly when we can do this to
our advantage.
2:30 - 2:34

In order to do this, and to
explain what will replace it,
2:34 - 2:37

we need to define this concept
of Bregman divergence.
2:40 - 2:42

For a convex function phi,
2:46 - 2:49

I can define its Bregman
divergence as follows.
2:49 - 2:55

So the divergence with respect
to phi, between two points x and y,
2:55 - 3:04

is defined as phi of x minus
phi of y, minus the gradient of phi
3:04 - 3:10

of y transpose times x minus y.
3:10 - 3:11

So what does this mean?
3:11 - 3:13

It actually has a very simple
interpretation.
3:13 - 3:15

And let me group these terms
as follows.
3:15 - 3:19

Let me group the second and third
term, and then I'll have to distribute
3:19 - 3:22

that minus sign, in order to make
this plus.
3:22 - 3:28

So now, a simple picture tells
us exactly what this is.
3:28 - 3:34

If phi is a convex function, and
I've got my point y, and I've
3:34 - 3:39

got my point x, then what is
the first term?
3:39 - 3:43

The first term is just phi of
x, it's sitting up there.
3:43 - 3:45

What's the second term?
3:45 - 3:49

We can interpret the second
term as the linear approximation
3:49 - 3:52

to phi at y,
3:52 - 3:57

evaluated at the point x.
3:59 - 4:02

So this is the second term.
4:02 - 4:09

This is phi of y plus the gradient
of phi of y transpose x minus y.
4:09 - 4:17

And therefore, the Bregman
divergence is exactly this term.
4:17 - 4:20

Making things a little messy
here, so let's see.
4:22 - 4:24

One more try.
4:25 - 4:28

This is exactly the Bregman
divergence.
4:29 - 4:34

The more curvature something has,
the more Bregman divergence it has.
4:34 - 4:36

Let's do a very simple calculation.
4:37 - 4:44

In the simple case where phi of x
4:44 - 4:49

is just equal to one half times
the Euclidean distance.
4:49 - 4:53

So let's see what the Bregman
divergence is in this case.
4:53 - 4:58

Well, plugging in the definition,
I see that the Bregman divergence
4:58 - 5:02

between x and y is equal to
phi of x,
5:02 - 5:06

that's one half Euclidean norm
squared, of x,
5:06 - 5:11

Minus phi of y, that's one half
Euclidean norm squared of
5:11 - 5:15

y, Plus, gradient of phi of y.
5:15 - 5:18

Well, what's the gradient
of one half times the Euclidean norm?
5:18 - 5:21

It's just y itself.
5:21 - 5:23

Y transpose.
5:23 - 5:27

Times x minus y, and grouping
all these terms,
5:27 - 5:33

rewriting, I get one half, factor
that out, Euclidean norm of x
5:33 - 5:37

squared, plus...
5:40 - 5:47

minus twice y transpose x.
5:47 - 5:54

Then I've got minus y squared,
5:54 - 5:59

plus twice y squared.
5:59 - 6:04

That simplifies to just plus,
and you can see that this is
6:04 - 6:06

exactly the quadratic.
6:06 - 6:11

This is equal to one half x minus
y squared.
6:11 - 6:16

So, if we take phi to be the function
that's one half Euclidean norm squared,
6:16 - 6:19

then the Bregman divergence is
6:19 - 6:25

just going to be the difference in
Euclidean - it's the Euclidean norm of
6:25 - 6:26

the difference.
6:26 - 6:28

So we get something familiar.
6:28 - 6:33

But, we're going to see several examples
um, where we're going to take phi
6:33 - 6:38

to be something different and therefore we
get a very different Bregman divergence.
6:40 - 6:45

Okay, and we're going to use this directly
in our definition of mirror descent.
6:48 - 6:56

Mirror descent now becomes xt plus one,
is going to be equal to the argmin,
6:59 - 7:07

possibly in our set constraint,
eta gt transpose x,
7:07 - 7:10

plus, previously we had
7:10 - 7:16

one half times the Euclidean norm,
but now we have one half times the
7:16 - 7:21

Bregman divergence between x and xt.
7:23 - 7:33

So, if phi is equal to just one half
Euclidean norm squared, then
7:33 - 7:40

we recover our usual subgradient
method.
7:44 - 7:48

But, if we have something else,
7:48 - 7:55

if phi of x is equal to say,
7:55 - 8:03

one half times x transpose, times
a different matrix, times x,
8:03 - 8:10

where q is not equal to the identity,
then, we get a different algorithm.
8:13 - 8:19

And, this was exactly the idea that
started us off in mirror descent.
8:19 - 8:24

So this was the idea that we had
in lecture one on mirror descent.
8:24 - 8:31

So see the first lecture in the mirror
descent series.
8:34 - 8:37

But we're going to choose phi
to be even different than these.
8:37 - 8:40

So we'll see that we can take phi
to be quite different, and we would
8:40 - 8:44

therefore really change the
geometry of the space.
8:45 - 8:48

So, this is our first look
at mirror descent,
8:48 - 8:51

we're going to be trying to understand
its properties through examples,
8:51 - 8:55

and then also, in the following
lecture, actually deriving
8:55 - 8:59

rate of convergence and proof
for rate of convergence.
8:59 - 9:02

But in the meantime, let's go
back to Bregman divergence, and get
9:02 - 9:06

used to it a little bit, and see what
some of its properties are.
9:06 - 9:11

So, in part, I'm going to derive this
because we'll need it.
9:11 - 9:14

In part, just to get us used to thinking
about Bregman divergence.
9:14 - 9:17

It is a little bit familiar, because
it seems like -
9:17 - 9:21

because we saw that if we choose
phi to be half the Euclidean norm
9:21 - 9:24

squared, we get back something
very familiar, but if we plug in
9:24 - 9:26

other things, it will be a little bit
different.
9:26 - 9:29

So, just want to get used to
manipulating it, and seeing that
9:29 - 9:33

it's really not that difficult.
9:33 - 9:39

Um, okay, so let's, in part for exercise,
again, in part because we'll need it,
9:39 - 9:47

let's show that um, for any convex
function phi, we have this property
9:47 - 9:53

that the gradient of phi, evaluated
at x, minus the gradient of phi,
9:53 - 10:01

evaluated at y, transpose, times x minus
z, is in fact equal to the Bregman
10:01 - 10:05

divergence between x and y,
plus the Bregman divergence between
10:05 - 10:12

z and x, minus the Bregman
divergence between z and y.
10:12 - 10:15

And this is just a matter of
plugging in these definitions and
10:15 - 10:18

evaluating, so let's just do exactly
that.
10:18 - 10:27

So from the definition, the
right hand side is equal to the first
10:27 - 10:35

term, is equal to phi of x,
minus phi of y,
10:35 - 10:39

plus gradient phi of y,
10:39 - 10:48

times x minus y. The second term
is equal to phi of z, minus phi
10:48 - 10:55

of x, plus the gradient of phi
of x, times z minus x.
10:56 - 11:02

And then the third term, which
we subtract, is equal to phi of z,
11:02 - 11:06

plus, I'm distributing the
minus sign here,
11:06 - 11:14

So it's plus phi of y, plus the
gradient of phi of y, times
11:14 - 11:16

z minus y.
11:16 - 11:17

Okay, so what do we see?
11:17 - 11:20

Well, first of all, there's -
11:20 - 11:24

I see a positive phi of x
here, and a negative phi of x,
11:24 - 11:28

I see a positive phi of z
and a negative phi of z.
11:28 - 11:32

I see a negative phi of y
and another positive phi of y,
11:32 - 11:34

so all of these cancel,
and now I - oops.
11:34 - 11:38

And now I just have to collect
all of these terms here.
11:38 - 11:41

And this is just a matter of
a little bit of accounting.
11:41 - 11:44

So this is equal to
11:46 - 11:52

the gradient of phi of x
times x minus z
11:52 - 11:58

minus the gradient of phi of y
times x minus y
11:58 - 12:02

plus the gradient of phi of y
times z minus y,
12:02 - 12:11

and this is equal to the gradient
of phi of x times x minus z
12:11 - 12:17

minus the gradient of phi of y
times x minus z.
12:19 - 12:22

Now I can group these together
and I get exactly what I wanted.
12:22 - 12:29

This is the gradient of phi of x
minus the gradient of phi of y
12:29 - 12:33

times x minus z.
12:33 - 12:36

We're going to use this property,
but again, the point is for us
12:36 - 12:39

not to worry about just
plugging in the definitions.
12:39 - 12:42

These are relatively simple things.
12:44 - 12:49

We're going to need something else
in our development of mirror descent
12:49 - 12:50

and when we analyze it.
12:50 - 12:53

And this is the concept of a dual norm.
12:53 - 12:55

So where have we been
using dual norms
12:55 - 12:57

without even really knowing it?
12:59 - 13:03

We've been using dual norms because
everything we've done so far
13:03 - 13:09

with the exception of a few things
like prox, we used L1 norm, and so on,
13:09 - 13:15

we've been using the Euclidean norm.
13:15 - 13:19

So let's - let me give the definition
here of a dual norm.
13:23 - 13:27

So now I'm going to use this
double bar just as a generic norm.
13:27 - 13:31

So don't look at it and think
it's the two norm.
13:31 - 13:33

This is just some norm.
13:33 - 13:36

So for this indicating any norm,
13:38 - 13:40

its dual
13:42 - 13:46

is defined as follows.
13:48 - 13:51

I'm going to use this lower star notation.
13:51 - 13:55

So lower star is very frequently
used to denote the dual.
13:55 - 13:59

But in some other contexts
it has a specific definition.
13:59 - 14:04

So for example with matrices,
sometimes it denotes the trace norm
14:04 - 14:05

or the nuclear norm.
14:05 - 14:09

But in this case I'll try to be crystal
clear if we do need to talk about
14:09 - 14:12

another norm that's
often denoted like this.
14:12 - 14:20

For us, this star is always going to mean
the dual norm unless I specify otherwise.
14:20 - 14:28

So this is defined as the
maximum of u transpose v
14:28 - 14:35

subject to v less than one
where this is the original norm,
14:35 - 14:37

whatever that might have been.
14:37 - 14:39

Okay, so this is just the definition.
14:39 - 14:41

So let's do some examples.
14:48 - 14:53

So let's take - let's take this
14:53 - 14:55

equal to the Euclidean norm.
14:55 - 14:59

Might as well warm up with
something familiar.
14:59 - 15:03

So what is the dual
of the Euclidean norm,
15:03 - 15:08

u - I'll put a star, I'll write it
this way so we remember.
15:08 - 15:14

This is defined as the
max of u transpose v
15:14 - 15:19

subject to the Euclidean
norm of v is less than one.
15:19 - 15:21

We can just solve this geometrically.
15:21 - 15:27

It's saying your constrained
set is just a sphere.
15:27 - 15:29

So it's saying find the
point on the sphere
15:29 - 15:33

that has the largest inner product
with a particular vector.
15:35 - 15:37

What is that going to be?
15:37 - 15:40

Well, it's - you can see
from this picture.
15:40 - 15:43

You're not going to pick something
that is perpendicular, certainly.
15:43 - 15:46

If you picked something perpendicular,
you'd get zero in the max.
15:46 - 15:48

You're not going to pick something
in the opposite direction,
15:48 - 15:49

you'd get negative.
15:49 - 15:54

No, you're going to align exactly
with it and go out unit norm.
15:54 - 15:58

So in other words,
the optimal solution here
15:58 - 16:01

is the maximizer,
16:03 - 16:09

is just equal to u divided by its norm,
16:09 - 16:11

and the max value
16:16 - 16:21

is therefore going to be
u transpose times this v
16:23 - 16:24

times v star.
16:24 - 16:29

This is equal to just u transpose u
divided by the norm of u,
16:29 - 16:31

which is the norm of u.
16:33 - 16:36

In other words,
the two norm is self dual.
16:36 - 16:41

So the Euclidean norm - what we just
showed is is that the Euclidean norm
16:43 - 16:46

is its own dual.
16:46 - 16:48

It's called self dual.
16:50 - 16:53

So this is the way that we've been
actually using the dual norm
16:53 - 16:54

without even really knowing it,
16:54 - 16:57

because we've just been
working in the two norm.
16:57 - 16:59

So let's take another example.
17:01 - 17:04

So now let's take - that's example one.
17:04 - 17:09

Let's take norm equal to the one norm.
17:10 - 17:13

Okay, and let's see what its dual is.
17:14 - 17:21

So we've got, what is the dual norm
of u in the one norm?
17:21 - 17:27

This is again, just copying the definition,
the max of u transpose v...
17:28 - 17:31

u transpose v,
17:31 - 17:38

subject to v in the one norm
being less than one.
17:38 - 17:41

So let's think about what
this is going to give us.
17:41 - 17:45

So what does v - what does
this one norm tell us?
17:45 - 17:49

So this constraint is telling us that
17:49 - 17:53

we have to be inside this diamond,
17:53 - 17:58

or algebraically it's telling us that the
absolute values have to sum to one.
17:58 - 18:01

You can think about this as
a resource allocation problem.
18:01 - 18:05

You have to allocate your resources,
but you only have 100%
18:05 - 18:08

and you have to allocate the pieces
so that they add up to one.
18:08 - 18:09

Where are you going to put them?
18:09 - 18:13

You're going to put them
wherever the values are the biggest.
18:13 - 18:16

The slight difference to this
is that you can take values
18:16 - 18:19

that are positive and negative,
but then so can v.
18:19 - 18:22

So what you're going to do is
you're going to find the biggest
18:22 - 18:26

in absolute value element of u,
18:26 - 18:28

whether it's positive or negative,
18:28 - 18:31

and you're going to keep all of v,
all of your energy,
18:31 - 18:33

you're going to put onto that coordinate.
18:33 - 18:37

If the biggest element of u is negative,
you're also going to make v negative.
18:37 - 18:41

If the biggest element of u is positive,
you're also going to make v positive.
18:41 - 18:49

So in other words, v star in this case,
our optimizing v,
18:50 - 18:51

is going to be
18:54 - 18:55

is going to look like
18:59 - 19:03

the vector that's all zero except
with a single one.
19:04 - 19:08

And it's going to have a single one
in the ith position,
19:08 - 19:15

and its sign is going to be
the sign of ui.
19:15 - 19:17

And which i is it going to be?
19:17 - 19:21

It's going to be the i where the value
of the absolute value of ui is biggest.
19:21 - 19:29

So i is the argmax of the
absolute value of ui.
19:29 - 19:31

So in other words, what do we get?
19:31 - 19:34

The dual norm is the infinity norm.
19:35 - 19:43

So the dual norm of u is just
equal to the infinity norm.
19:43 - 19:46

And you can also see
that the infinity norm,
19:46 - 19:48

its dual is going to be
the one norm.
19:48 - 19:52

So there are many examples
like this that we can work out.
19:53 - 19:56

Let's do one more example.
19:56 - 19:57

So, so far
20:00 - 20:07

we've seen that the dual
of the Euclidean norm
20:07 - 20:11

is again the Euclidean norm.
20:11 - 20:18

The dual of the one norm
is the infinity norm.
20:18 - 20:27

I mentioned that the dual of
the infinity norm is the one norm.
20:29 - 20:30

So let's try that.
20:33 - 20:39

So the infinity norm,
let's look at the dual of that.
20:39 - 20:45

By definition this is
the max of u transpose v
20:45 - 20:51

subject to v in the infinity norm
being less than one.
20:51 - 20:55

So the infinity norm being less than one
means that every element of v
20:55 - 20:57

can be between minus one and one.
20:57 - 20:58

And there's no penalty.
20:58 - 21:03

In other words, if v1 is very big,
it doesn't mean that v2 has to be small.
21:03 - 21:05

So what would you do?
21:05 - 21:06

Again you can see this is quite simple.
21:06 - 21:12

You would look at ui and say ui,
is that positive or negative?
21:12 - 21:18

I'm going to let vi be the same
sign and be as big as I'm allowed,
21:18 - 21:20

which is in this case one.
21:20 - 21:27

So the optimizing v is just
going to be the sign of u,
21:27 - 21:28

element-wise.
21:28 - 21:31

In other words,
21:31 - 21:37

the ith element of v star is just
going to be the sign of ui.
21:37 - 21:40

And so then what is the sum of -
21:40 - 21:45

so u transpose v is equal
to the sum of ui vi.
21:45 - 21:49

And plugging in this v, this is
the same as the sum of ui
21:49 - 21:52

times the sign of ui,
21:52 - 21:58

which is exactly equal to the sum
of the absolute value of ui,
21:58 - 22:01

which is in fact the L1 norm.
22:04 - 22:08

So now we've seen
all three of these examples.
22:08 - 22:16

And we're going to use a very
important property of dual norms,
22:16 - 22:17

which I'm going to write down now,
22:17 - 22:21

and we're going to see it again
in the next lecture.
22:26 - 22:28

So an important inequality
22:30 - 22:31

says the following.
22:34 - 22:39

Remember that we've needed to bound,
several times, the dot product
22:39 - 22:42

between two vectors u and v.
22:42 - 22:44

And we've done this in several ways.
22:44 - 22:49

So we can bound this
in terms of the two norm.
22:49 - 22:56

So this is upper bounded by the two
norm of u times the two norm of v.
22:56 - 22:59

In fact we know that these are equal
only when u and v are parallel
22:59 - 23:02

and pointing in the same direction.
23:03 - 23:07

So this is the Cauchy-Schwarz -
famous Cauchy-Schwarz inequality.
23:12 - 23:16

In fact, it's also true that u transpose v
23:16 - 23:24

is less than or equal to u times v star,
23:24 - 23:27

where u and v are any norm and its dual.
23:27 - 23:30

And this is sometimes called
Hölder's inequality.
23:35 - 23:39

And so far we've used - if you go
back to the proofs of gradient descent,
23:39 - 23:42

subgradient descent, and so on,
you'll see this kind of inequality,
23:42 - 23:45

the first one, in many places.
23:51 - 23:59

So one of the properties that we've
used, another way to bound uv
23:59 - 24:03

is as follows.
24:03 - 24:10

Also, u transpose v
is less than or equal to
24:10 - 24:15

One half times u squared - this is
the Euclidean norm squared -
24:15 - 24:18

plus one half v squared.
24:18 - 24:20

This just comes from
completing the square.
24:25 - 24:28

Again, we've used this
over and over again.
24:28 - 24:31

But using the dual norm,
24:31 - 24:36

we can refine this in a particular way.
24:36 - 24:41

So we will use the following upper bound:
24:41 - 24:50

u transpose v is less than or equal to
one over two alpha
24:50 - 24:53

times the norm of u squared -
24:53 - 24:57

note there's no two subscript,
this is not the Euclidean norm -
24:57 - 25:05

plus alpha over two,
times the dual norm of v, squared.
25:05 - 25:07

And this is very similar.
25:07 - 25:09

If we take alpha equals to one,
25:09 - 25:11

we recover what we get
from completing the square.
25:13 - 25:15

The reason that this
is so important for us
25:15 - 25:19

is that these dot products
are foundational for us.
25:19 - 25:21

This is the definition of convexity.
25:21 - 25:23

We use these to make our
linear approximations.
25:23 - 25:25

This is how we define convex functions,
25:25 - 25:28

this is how we define
our linear approximations,
25:28 - 25:30

and this is how we define
all our algorithms.
25:30 - 25:34

So we're going to see next time
how we use this inequality,
25:34 - 25:35

how we use Bregman divergence,
25:35 - 25:41

and how we use the definition of
the Bregman divergence
25:41 - 25:42

as a proximal term
25:42 - 25:47

in order to show better
convergence rates for mirror prox
25:47 - 25:50

in certain settings where -
I'm sorry, for mirror descent
25:50 - 25:54

in certain settings where we don't
want to use Euclidean geometry.
25:54 - 25:56

We'll pick this up next time.

Title:: https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
Video Language:: English
Duration:: 26:11

	Chase Rigby edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
	Wyatt284 edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
	Wyatt284 edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
	Wyatt284 edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
	Wyatt284 edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
	Wyatt284 edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
	Shane Mays edited English subtitles for https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4

English subtitles

Revisions

Revision 7 Edited

Chase Rigby

https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4

Revisions

Our website uses cookies

Operating cookies (Required)