-
♪ (Music) ♪
-
Welcome back. This is our third lecture
in the series, talking about
-
mirror descent.
-
And um, we're going to get down to
the main definition of how it is that
-
we change the proximal term
in uh, from the subgradient method.
-
So again, recall that the subgradient
method tells me that xt plus one is
-
equal to xt minus eta gt, where gt is
an element of the subdifferential
-
of f at xt.
-
Again, I've said this in each
of these lectures, but I'm leaving
-
out the dependence on the
on the,
-
one the - set constraint,
-
but with the set constraint,
this would take the form
-
of an additional projection.
-
So xt plus one is equal to the projection
onto x, the projection here means
-
the Euclidean projection, of xt minus
eta gt.
-
But because it doesn't change very
much, for now,
-
I've just been sparing myself the
extra writing,
-
but we should really have this in mind.
-
And, uh, we've seen repeatedly, this
has been the theme of the first two
-
lectures on mirror descent, that
xt plus one is also
-
equivalently, just the argmin
of eta, gt, transpose x, plus,
-
this is the linear - you can think of this
as the linear approximation
-
to the function, plus something that
keeps us from going too far, and this
-
is typically the Euclidean norm, or
specifically, when we use the
-
Euclidean norm, we get exactly the
subgradient method.
-
Set constraint here, this is an
argmin over all x, but if I have a
-
set constraint, it just becomes -
the only thing that changes is
-
this is an argmin over x in my
set constraint.
-
So again, not much has changed.
-
So, we're going to change this.
-
And by changing it, we get
mirror descent, and we'll see
-
exactly when we can do this to
our advantage.
-
In order to do this, and to
explain what will replace it,
-
we need to define this concept
of Bregman divergence.
-
For a convex function phi,
-
I can define its Bregman
divergence as follows.
-
So the divergence with respect
to phi, between two points x and y,
-
is defined as phi of x minus
phi of y, minus the gradient of phi
-
of y transpose times x minus y.
-
So what does this mean?
-
It actually has a very simple
interpretation.
-
And let me group these terms
as follows.
-
Let me group the second and third
term, and then I'll have to distribute
-
that minus sign, in order to make
this plus.
-
So now, a simple picture tells
us exactly what this is.
-
If phi is a convex function, and
I've got my point y, and I've
-
got my point x, then what is
the first term?
-
The first term is just phi of
x, it's sitting up there.
-
What's the second term?
-
We can interpret the second
term as the linear approximation
-
to phi at y,
-
evaluated at the point x.
-
So this is the second term.
-
This is phi of y plus the gradient
of phi of y transpose x minus y.
-
And therefore, the Bregman
divergence is exactly this term.
-
Making things a little messy
here, so let's see.
-
One more try.
-
This is exactly the Bregman
divergence.
-
The more curvature something has,
the more Bregman divergence it has.
-
Let's do a very simple calculation.
-
In the simple case where phi of x
-
is just equal to one half times
the Euclidean distance.
-
So let's see what the Bregman
divergence is in this case.
-
Well, plugging in the definition,
I see that the Bregman divergence
-
between x and y is equal to
phi of x,
-
that's one half Euclidean norm
squared, of x,
-
Minus phi of y, that's one half
Euclidean norm squared of
-
y, Plus, gradient of phi of y.
-
Well, what's the gradient
of one half times the Euclidean norm?
-
It's just y itself.
-
Y transpose.
-
Times x minus y, and grouping
all these terms,
-
rewriting, I get one half, factor
that out, Euclidean norm of x
-
squared, plus...
-
minus twice y transpose x.
-
Then I've got minus y squared,
-
plus twice y squared.
-
That simplifies to just plus,
and you can see that this is
-
exactly the quadratic.
-
This is equal to one half x minus
y squared.
-
So, if we take phi to be the function
that's one half Euclidean norm squared,
-
then the Bregman divergence is
-
just going to be the difference in
Euclidean - it's the Euclidean norm of
-
the difference.
-
So we get something familiar.
-
But, we're going to see several examples
um, where we're going to take phi
-
to be something different and therefore we
get a very different Bregman divergence.
-
Okay, and we're going to use this directly
in our definition of mirror descent.
-
Mirror descent now becomes xt plus one,
is going to be equal to the argmin,
-
possibly in our set constraint,
eta gt transpose x,
-
plus, previously we had
-
one half times the Euclidean norm,
but now we have one half times the
-
Bregman divergence between x and xt.
-
So, if phi is equal to just one half
Euclidean norm squared, then
-
we recover our usual subgradient
method.
-
But, if we have something else,
-
if phi of x is equal to say,
-
one half times x transpose, times
a different matrix, times x,
-
where q is not equal to the identity,
then, we get a different algorithm.
-
And, this was exactly the idea that
started us off in mirror descent.
-
So this was the idea that we had
in lecture one on mirror descent.
-
So see the first lecture in the mirror
descent series.
-
But we're going to choose phi
to be even different than these.
-
So we'll see that we can take phi
to be quite different, and we would
-
therefore really change the
geometry of the space.
-
So, this is our first look
at mirror descent,
-
we're going to be trying to understand
its properties through examples,
-
and then also, in the following
lecture, actually deriving
-
rate of convergence and proof
for rate of convergence.
-
But in the meantime, let's go
back to Bregman divergence, and get
-
used to it a little bit, and see what
some of its properties are.
-
So, in part, I'm going to derive this
because we'll need it.
-
In part, just to get us used to thinking
about Bregman divergence.
-
It is a little bit familiar, because
it seems like -
-
because we saw that if we choose
phi to be half the Euclidean norm
-
squared, we get back something
very familiar, but if we plug in
-
other things, it will be a little bit
different.
-
So, just want to get used to
manipulating it, and seeing that
-
it's really not that difficult.
-
Um, okay, so let's, in part for exercise,
again, in part because we'll need it,
-
let's show that um, for any convex
function phi, we have this property
-
that the gradient of phi, evaluated
at x, minus the gradient of phi,
-
evaluated at y, transpose, times x minus
z, is in fact equal to the Bregman
-
divergence between x and y,
plus the Bregman divergence between
-
z and x, minus the Bregman
divergence between z and y.
-
And this is just a matter of
plugging in these definitions and
-
evaluating, so let's just do exactly
that.
-
So from the definition, the
right hand side is equal to the first
-
term, is equal to phi of x,
minus phi of y,
-
plus gradient phi of y,
-
times x minus y. The second term
is equal to phi of z, minus phi
-
of x, plus the gradient of phi
of x, times z minus x.
-
And then the third term, which
we subtract, is equal to phi of z,
-
plus, I'm distributing the
minus sign here,
-
So it's plus phi of y, plus the
gradient of phi of y, times
-
z minus y.
-
Okay, so what do we see?
-
Well, first of all, there's -
-
I see a positive phi of x
here, and a negative phi of x,
-
I see a positive phi of z
and a negative phi of z.
-
I see a negative phi of y
and another positive phi of y,
-
so all of these cancel,
and now I - oops.
-
And now I just have to collect
all of these terms here.
-
And this is just a matter of
a little bit of accounting.
-
So this is equal to
-
the gradient of phi of x
times x minus z
-
minus the gradient of phi of y
times x minus y
-
plus the gradient of phi of y
times z minus y,
-
and this is equal to the gradient
of phi of x times x minus z
-
minus the gradient of phi of y
times x minus z.
-
Now I can group these together
and I get exactly what I wanted.
-
This is the gradient of phi of x
minus the gradient of phi of y
-
times x minus z.
-
We're going to use this property,
but again, the point is for us
-
not to worry about just
plugging in the definitions.
-
These are relatively simple things.
-
We're going to need something else
in our development of mirror descent
-
and when we analyze it.
-
And this is the concept of a dual norm.
-
So where have we been
using dual norms
-
without even really knowing it?
-
We've been using dual norms because
everything we've done so far
-
with the exception of a few things
like prox, we used L1 norm, and so on,
-
we've been using the Euclidean norm.
-
So let's - let me give the definition
here of a dual norm.
-
So now I'm going to use this
double bar just as a generic norm.
-
So don't look at it and think
it's the two norm.
-
This is just some norm.
-
So for this indicating any norm,
-
its dual
-
is defined as follows.
-
I'm going to use this lower star notation.
-
So lower star is very frequently
used to denote the dual.
-
But in some other contexts
it has a specific definition.
-
So for example with matrices,
sometimes it denotes the trace norm
-
or the nuclear norm.
-
But in this case I'll try to be crystal
clear if we do need to talk about
-
another norm that's
often denoted like this.
-
For us, this star is always going to mean
the dual norm unless I specify otherwise.
-
So this is defined as the
maximum of u transpose v
-
subject to v less than one
where this is the original norm,
-
whatever that might have been.
-
Okay, so this is just the definition.
-
So let's do some examples.
-
So let's take - let's take this
-
equal to the Euclidean norm.
-
Might as well warm up with
something familiar.
-
So what is the dual
of the Euclidean norm,
-
u - I'll put a star, I'll write it
this way so we remember.
-
This is defined as the
max of u transpose v
-
subject to the Euclidean
norm of v is less than one.
-
We can just solve this geometrically.
-
It's saying your constrained
set is just a sphere.
-
So it's saying find the
point on the sphere
-
that has the largest inner product
with a particular vector.
-
What is that going to be?
-
Well, it's - you can see
from this picture.
-
You're not going to pick something
that is perpendicular, certainly.
-
If you picked something perpendicular,
you'd get zero in the max.
-
You're not going to pick something
in the opposite direction,
-
you'd get negative.
-
No, you're going to align exactly
with it and go out unit norm.
-
So in other words,
the optimal solution here
-
is the maximizer,
-
is just equal to u divided by its norm,
-
and the max value
-
is therefore going to be
u transpose times this v
-
times v star.
-
This is equal to just u transpose u
divided by the norm of u,
-
which is the norm of u.
-
In other words,
the two norm is self dual.
-
So the Euclidean norm - what we just
showed is is that the Euclidean norm
-
is its own dual.
-
It's called self dual.
-
So this is the way that we've been
actually using the dual norm
-
without even really knowing it,
-
because we've just been
working in the two norm.
-
So let's take another example.
-
So now let's take - that's example one.
-
Let's take norm equal to the one norm.
-
Okay, and let's see what its dual is.
-
So we've got, what is the dual norm
of u in the one norm?
-
This is again, just copying the definition,
the max of u transpose v...
-
u transpose v,
-
subject to v in the one norm
being less than one.
-
So let's think about what
this is going to give us.
-
So what does v - what does
this one norm tell us?
-
So this constraint is telling us that
-
we have to be inside this diamond,
-
or algebraically it's telling us that the
absolute values have to sum to one.
-
You can think about this as
a resource allocation problem.
-
You have to allocate your resources,
but you only have 100%
-
and you have to allocate the pieces
so that they add up to one.
-
Where are you going to put them?
-
You're going to put them
wherever the values are the biggest.
-
The slight difference to this
is that you can take values
-
that are positive and negative,
but then so can v.
-
So what you're going to do is
you're going to find the biggest
-
in absolute value element of u,
-
whether it's positive or negative,
-
and you're going to keep all of v,
all of your energy,
-
you're going to put onto that coordinate.
-
If the biggest element of u is negative,
you're also going to make v negative.
-
If the biggest element of u is positive,
you're also going to make v positive.
-
So in other words, v star in this case,
our optimizing v,
-
is going to be
-
is going to look like
-
the vector that's all zero except
with a single one.
-
And it's going to have a single one
in the ith position,
-
and its sign is going to be
the sign of ui.
-
And which i is it going to be?
-
It's going to be the i where the value
of the absolute value of ui is biggest.
-
So i is the argmax of the
absolute value of ui.
-
So in other words, what do we get?
-
The dual norm is the infinity norm.
-
So the dual norm of u is just
equal to the infinity norm.
-
And you can also see
that the infinity norm,
-
its dual is going to be
the one norm.
-
So there are many examples
like this that we can work out.
-
Let's do one more example.
-
So, so far
-
we've seen that the dual
of the Euclidean norm
-
is again the Euclidean norm.
-
The dual of the one norm
is the infinity norm.
-
I mentioned that the dual of
the infinity norm is the one norm.
-
So let's try that.
-
So the infinity norm,
let's look at the dual of that.
-
By definition this is
the max of u transpose v
-
subject to v in the infinity norm
being less than one.
-
So the infinity norm being less than one
means that every element of v
-
can be between minus one and one.
-
And there's no penalty.
-
In other words, if v1 is very big,
it doesn't mean that v2 has to be small.
-
So what would you do?
-
Again you can see this is quite simple.
-
You would look at ui and say ui,
is that positive or negative?
-
I'm going to let vi be the same
sign and be as big as I'm allowed,
-
which is in this case one.
-
So the optimizing v is just
going to be the sign of u,
-
element-wise.
-
In other words,
-
the ith element of v star is just
going to be the sign of ui.
-
And so then what is the sum of -
-
so u transpose v is equal
to the sum of ui vi.
-
And plugging in this v, this is
the same as the sum of ui
-
times the sign of ui,
-
which is exactly equal to the sum
of the absolute value of ui,
-
which is in fact the L1 norm.
-
So now we've seen
all three of these examples.
-
And we're going to use a very
important property of dual norms,
-
which I'm going to write down now,
-
and we're going to see it again
in the next lecture.
-
So an important inequality
-
says the following.
-
Remember that we've needed to bound,
several times, the dot product
-
between two vectors u and v.
-
And we've done this in several ways.
-
So we can bound this
in terms of the two norm.
-
So this is upper bounded by the two
norm of u times the two norm of v.
-
In fact we know that these are equal
only when u and v are parallel
-
and pointing in the same direction.
-
So this is the Cauchy-Schwarz -
famous Cauchy-Schwarz inequality.
-
In fact, it's also true that u transpose v
-
is less than or equal to u times v star,
-
where u and v are any norm and its dual.
-
And this is sometimes called
Hölder's inequality.
-
And so far we've used - if you go
back to the proofs of gradient descent,
-
subgradient descent, and so on,
you'll see this kind of inequality,
-
the first one, in many places.
-
So one of the properties that we've
used, another way to bound uv
-
is as follows.
-
Also, u transpose v
is less than or equal to
-
One half times u squared - this is
the Euclidean norm squared -
-
plus one half v squared.
-
This just comes from
completing the square.
-
Again, we've used this
over and over again.
-
But using the dual norm,
-
we can refine this in a particular way.
-
So we will use the following upper bound:
-
u transpose v is less than or equal to
one over two alpha
-
times the norm of u squared -
-
note there's no two subscript,
this is not the Euclidean norm -
-
plus alpha over two,
times the dual norm of v, squared.
-
And this is very similar.
-
If we take alpha equals to one,
-
we recover what we get
from completing the square.
-
The reason that this
is so important for us
-
is that these dot products
are foundational for us.
-
This is the definition of convexity.
-
We use these to make our
linear approximations.
-
This is how we define convex functions,
-
this is how we define
our linear approximations,
-
and this is how we define
all our algorithms.
-
So we're going to see next time
how we use this inequality,
-
how we use Bregman divergence,
-
and how we use the definition of
the Bregman divergence
-
as a proximal term
-
in order to show better
convergence rates for mirror prox
-
in certain settings where -
I'm sorry, for mirror descent
-
in certain settings where we don't
want to use Euclidean geometry.
-
We'll pick this up next time.