Return to Video

https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4

  • 0:01 - 0:04
    ♪ (Music) ♪
  • 0:09 - 0:14
    Welcome back. This is our third lecture
    in the series, talking about
  • 0:14 - 0:15
    mirror descent.
  • 0:17 - 0:22
    And um, we're going to get down to
    the main definition of how it is that
  • 0:22 - 0:28
    we change the proximal term
    in uh, from the subgradient method.
  • 0:28 - 0:35
    So again, recall that the subgradient
    method tells me that xt plus one is
  • 0:35 - 0:42
    equal to xt minus eta gt, where gt is
    an element of the subdifferential
  • 0:42 - 0:45
    of f at xt.
  • 0:45 - 0:48
    Again, I've said this in each
    of these lectures, but I'm leaving
  • 0:48 - 0:51
    out the dependence on the
    on the,
  • 0:51 - 0:55
    one the - set constraint,
  • 0:55 - 1:01
    but with the set constraint,
    this would take the form
  • 1:01 - 1:03
    of an additional projection.
  • 1:05 - 1:12
    So xt plus one is equal to the projection
    onto x, the projection here means
  • 1:12 - 1:17
    the Euclidean projection, of xt minus
    eta gt.
  • 1:17 - 1:21
    But because it doesn't change very
    much, for now,
  • 1:21 - 1:24
    I've just been sparing myself the
    extra writing,
  • 1:24 - 1:28
    but we should really have this in mind.
  • 1:28 - 1:34
    And, uh, we've seen repeatedly, this
    has been the theme of the first two
  • 1:34 - 1:38
    lectures on mirror descent, that
    xt plus one is also
  • 1:38 - 1:46
    equivalently, just the argmin
    of eta, gt, transpose x, plus,
  • 1:46 - 1:48
    this is the linear - you can think of this
    as the linear approximation
  • 1:48 - 1:54
    to the function, plus something that
    keeps us from going too far, and this
  • 1:54 - 1:58
    is typically the Euclidean norm, or
    specifically, when we use the
  • 1:58 - 2:01
    Euclidean norm, we get exactly the
    subgradient method.
  • 2:02 - 2:06
    Set constraint here, this is an
    argmin over all x, but if I have a
  • 2:06 - 2:09
    set constraint, it just becomes -
    the only thing that changes is
  • 2:09 - 2:13
    this is an argmin over x in my
    set constraint.
  • 2:13 - 2:16
    So again, not much has changed.
  • 2:16 - 2:20
    So, we're going to change this.
  • 2:24 - 2:26
    And by changing it, we get
    mirror descent, and we'll see
  • 2:26 - 2:29
    exactly when we can do this to
    our advantage.
  • 2:30 - 2:34
    In order to do this, and to
    explain what will replace it,
  • 2:34 - 2:37
    we need to define this concept
    of Bregman divergence.
  • 2:40 - 2:42
    For a convex function phi,
  • 2:46 - 2:49
    I can define its Bregman
    divergence as follows.
  • 2:49 - 2:55
    So the divergence with respect
    to phi, between two points x and y,
  • 2:55 - 3:04
    is defined as phi of x minus
    phi of y, minus the gradient of phi
  • 3:04 - 3:10
    of y transpose times x minus y.
  • 3:10 - 3:11
    So what does this mean?
  • 3:11 - 3:13
    It actually has a very simple
    interpretation.
  • 3:13 - 3:15
    And let me group these terms
    as follows.
  • 3:15 - 3:19
    Let me group the second and third
    term, and then I'll have to distribute
  • 3:19 - 3:22
    that minus sign, in order to make
    this plus.
  • 3:22 - 3:28
    So now, a simple picture tells
    us exactly what this is.
  • 3:28 - 3:34
    If phi is a convex function, and
    I've got my point y, and I've
  • 3:34 - 3:39
    got my point x, then what is
    the first term?
  • 3:39 - 3:43
    The first term is just phi of
    x, it's sitting up there.
  • 3:43 - 3:45
    What's the second term?
  • 3:45 - 3:49
    We can interpret the second
    term as the linear approximation
  • 3:49 - 3:52
    to phi at y,
  • 3:52 - 3:57
    evaluated at the point x.
  • 3:59 - 4:02
    So this is the second term.
  • 4:02 - 4:09
    This is phi of y plus the gradient
    of phi of y transpose x minus y.
  • 4:09 - 4:17
    And therefore, the Bregman
    divergence is exactly this term.
  • 4:17 - 4:20
    Making things a little messy
    here, so let's see.
  • 4:22 - 4:24
    One more try.
  • 4:25 - 4:28
    This is exactly the Bregman
    divergence.
  • 4:29 - 4:34
    The more curvature something has,
    the more Bregman divergence it has.
  • 4:34 - 4:36
    Let's do a very simple calculation.
  • 4:37 - 4:44
    In the simple case where phi of x
  • 4:44 - 4:49
    is just equal to one half times
    the Euclidean distance.
  • 4:49 - 4:53
    So let's see what the Bregman
    divergence is in this case.
  • 4:53 - 4:58
    Well, plugging in the definition,
    I see that the Bregman divergence
  • 4:58 - 5:02
    between x and y is equal to
    phi of x,
  • 5:02 - 5:06
    that's one half Euclidean norm
    squared, of x,
  • 5:06 - 5:11
    Minus phi of y, that's one half
    Euclidean norm squared of
  • 5:11 - 5:15
    y, Plus, gradient of phi of y.
  • 5:15 - 5:18
    Well, what's the gradient
    of one half times the Euclidean norm?
  • 5:18 - 5:21
    It's just y itself.
  • 5:21 - 5:23
    Y transpose.
  • 5:23 - 5:27
    Times x minus y, and grouping
    all these terms,
  • 5:27 - 5:33
    rewriting, I get one half, factor
    that out, Euclidean norm of x
  • 5:33 - 5:37
    squared, plus...
  • 5:40 - 5:47
    minus twice y transpose x.
  • 5:47 - 5:54
    Then I've got minus y squared,
  • 5:54 - 5:59
    plus twice y squared.
  • 5:59 - 6:04
    That simplifies to just plus,
    and you can see that this is
  • 6:04 - 6:06
    exactly the quadratic.
  • 6:06 - 6:11
    This is equal to one half x minus
    y squared.
  • 6:11 - 6:16
    So, if we take phi to be the function
    that's one half Euclidean norm squared,
  • 6:16 - 6:19
    then the Bregman divergence is
  • 6:19 - 6:25
    just going to be the difference in
    Euclidean - it's the Euclidean norm of
  • 6:25 - 6:26
    the difference.
  • 6:26 - 6:28
    So we get something familiar.
  • 6:28 - 6:33
    But, we're going to see several examples
    um, where we're going to take phi
  • 6:33 - 6:38
    to be something different and therefore we
    get a very different Bregman divergence.
  • 6:40 - 6:45
    Okay, and we're going to use this directly
    in our definition of mirror descent.
  • 6:48 - 6:56
    Mirror descent now becomes xt plus one,
    is going to be equal to the argmin,
  • 6:59 - 7:07
    possibly in our set constraint,
    eta gt transpose x,
  • 7:07 - 7:10
    plus, previously we had
  • 7:10 - 7:16
    one half times the Euclidean norm,
    but now we have one half times the
  • 7:16 - 7:21
    Bregman divergence between x and xt.
  • 7:23 - 7:33
    So, if phi is equal to just one half
    Euclidean norm squared, then
  • 7:33 - 7:40
    we recover our usual subgradient
    method.
  • 7:44 - 7:48
    But, if we have something else,
  • 7:48 - 7:55
    if phi of x is equal to say,
  • 7:55 - 8:03
    one half times x transpose, times
    a different matrix, times x,
  • 8:03 - 8:10
    where q is not equal to the identity,
    then, we get a different algorithm.
  • 8:13 - 8:19
    And, this was exactly the idea that
    started us off in mirror descent.
  • 8:19 - 8:24
    So this was the idea that we had
    in lecture one on mirror descent.
  • 8:24 - 8:31
    So see the first lecture in the mirror
    descent series.
  • 8:34 - 8:37
    But we're going to choose phi
    to be even different than these.
  • 8:37 - 8:40
    So we'll see that we can take phi
    to be quite different, and we would
  • 8:40 - 8:44
    therefore really change the
    geometry of the space.
  • 8:45 - 8:48
    So, this is our first look
    at mirror descent,
  • 8:48 - 8:51
    we're going to be trying to understand
    its properties through examples,
  • 8:51 - 8:55
    and then also, in the following
    lecture, actually deriving
  • 8:55 - 8:59
    rate of convergence and proof
    for rate of convergence.
  • 8:59 - 9:02
    But in the meantime, let's go
    back to Bregman divergence, and get
  • 9:02 - 9:06
    used to it a little bit, and see what
    some of its properties are.
  • 9:06 - 9:11
    So, in part, I'm going to derive this
    because we'll need it.
  • 9:11 - 9:14
    In part, just to get us used to thinking
    about Bregman divergence.
  • 9:14 - 9:17
    It is a little bit familiar, because
    it seems like -
  • 9:17 - 9:21
    because we saw that if we choose
    phi to be half the Euclidean norm
  • 9:21 - 9:24
    squared, we get back something
    very familiar, but if we plug in
  • 9:24 - 9:26
    other things, it will be a little bit
    different.
  • 9:26 - 9:29
    So, just want to get used to
    manipulating it, and seeing that
  • 9:29 - 9:33
    it's really not that difficult.
  • 9:33 - 9:39
    Um, okay, so let's, in part for exercise,
    again, in part because we'll need it,
  • 9:39 - 9:47
    let's show that um, for any convex
    function phi, we have this property
  • 9:47 - 9:53
    that the gradient of phi, evaluated
    at x, minus the gradient of phi,
  • 9:53 - 10:01
    evaluated at y, transpose, times x minus
    z, is in fact equal to the Bregman
  • 10:01 - 10:05
    divergence between x and y,
    plus the Bregman divergence between
  • 10:05 - 10:12
    z and x, minus the Bregman
    divergence between z and y.
  • 10:12 - 10:15
    And this is just a matter of
    plugging in these definitions and
  • 10:15 - 10:18
    evaluating, so let's just do exactly
    that.
  • 10:18 - 10:27
    So from the definition, the
    right hand side is equal to the first
  • 10:27 - 10:35
    term, is equal to phi of x,
    minus phi of y,
  • 10:35 - 10:39
    plus gradient phi of y,
  • 10:39 - 10:48
    times x minus y. The second term
    is equal to phi of z, minus phi
  • 10:48 - 10:55
    of x, plus the gradient of phi
    of x, times z minus x.
  • 10:56 - 11:02
    And then the third term, which
    we subtract, is equal to phi of z,
  • 11:02 - 11:06
    plus, I'm distributing the
    minus sign here,
  • 11:06 - 11:14
    So it's plus phi of y, plus the
    gradient of phi of y, times
  • 11:14 - 11:16
    z minus y.
  • 11:16 - 11:17
    Okay, so what do we see?
  • 11:17 - 11:20
    Well, first of all, there's -
  • 11:20 - 11:24
    I see a positive phi of x
    here, and a negative phi of x,
  • 11:24 - 11:28
    I see a positive phi of z
    and a negative phi of z.
  • 11:28 - 11:32
    I see a negative phi of y
    and another positive phi of y,
  • 11:32 - 11:34
    so all of these cancel,
    and now I - oops.
  • 11:34 - 11:38
    And now I just have to collect
    all of these terms here.
  • 11:38 - 11:41
    And this is just a matter of
    a little bit of accounting.
  • 11:41 - 11:44
    So this is equal to
  • 11:46 - 11:52
    the gradient of phi of x
    times x minus z
  • 11:52 - 11:58
    minus the gradient of phi of y
    times x minus y
  • 11:58 - 12:02
    plus the gradient of phi of y
    times z minus y,
  • 12:02 - 12:11
    and this is equal to the gradient
    of phi of x times x minus z
  • 12:11 - 12:17
    minus the gradient of phi of y
    times x minus z.
  • 12:19 - 12:22
    Now I can group these together
    and I get exactly what I wanted.
  • 12:22 - 12:29
    This is the gradient of phi of x
    minus the gradient of phi of y
  • 12:29 - 12:33
    times x minus z.
  • 12:33 - 12:36
    We're going to use this property,
    but again, the point is for us
  • 12:36 - 12:39
    not to worry about just
    plugging in the definitions.
  • 12:39 - 12:42
    These are relatively simple things.
  • 12:44 - 12:49
    We're going to need something else
    in our development of mirror descent
  • 12:49 - 12:50
    and when we analyze it.
  • 12:50 - 12:53
    And this is the concept of a dual norm.
  • 12:53 - 12:55
    So where have we been
    using dual norms
  • 12:55 - 12:57
    without even really knowing it?
  • 12:59 - 13:03
    We've been using dual norms because
    everything we've done so far
  • 13:03 - 13:09
    with the exception of a few things
    like prox, we used L1 norm, and so on,
  • 13:09 - 13:15
    we've been using the Euclidean norm.
  • 13:15 - 13:19
    So let's - let me give the definition
    here of a dual norm.
  • 13:23 - 13:27
    So now I'm going to use this
    double bar just as a generic norm.
  • 13:27 - 13:31
    So don't look at it and think
    it's the two norm.
  • 13:31 - 13:33
    This is just some norm.
  • 13:33 - 13:36
    So for this indicating any norm,
  • 13:38 - 13:40
    its dual
  • 13:42 - 13:46
    is defined as follows.
  • 13:48 - 13:51
    I'm going to use this lower star notation.
  • 13:51 - 13:55
    So lower star is very frequently
    used to denote the dual.
  • 13:55 - 13:59
    But in some other contexts
    it has a specific definition.
  • 13:59 - 14:04
    So for example with matrices,
    sometimes it denotes the trace norm
  • 14:04 - 14:05
    or the nuclear norm.
  • 14:05 - 14:09
    But in this case I'll try to be crystal
    clear if we do need to talk about
  • 14:09 - 14:12
    another norm that's
    often denoted like this.
  • 14:12 - 14:20
    For us, this star is always going to mean
    the dual norm unless I specify otherwise.
  • 14:20 - 14:28
    So this is defined as the
    maximum of u transpose v
  • 14:28 - 14:35
    subject to v less than one
    where this is the original norm,
  • 14:35 - 14:37
    whatever that might have been.
  • 14:37 - 14:39
    Okay, so this is just the definition.
  • 14:39 - 14:41
    So let's do some examples.
  • 14:48 - 14:53
    So let's take - let's take this
  • 14:53 - 14:55
    equal to the Euclidean norm.
  • 14:55 - 14:59
    Might as well warm up with
    something familiar.
  • 14:59 - 15:03
    So what is the dual
    of the Euclidean norm,
  • 15:03 - 15:08
    u - I'll put a star, I'll write it
    this way so we remember.
  • 15:08 - 15:14
    This is defined as the
    max of u transpose v
  • 15:14 - 15:19
    subject to the Euclidean
    norm of v is less than one.
  • 15:19 - 15:21
    We can just solve this geometrically.
  • 15:21 - 15:27
    It's saying your constrained
    set is just a sphere.
  • 15:27 - 15:29
    So it's saying find the
    point on the sphere
  • 15:29 - 15:33
    that has the largest inner product
    with a particular vector.
  • 15:35 - 15:37
    What is that going to be?
  • 15:37 - 15:40
    Well, it's - you can see
    from this picture.
  • 15:40 - 15:43
    You're not going to pick something
    that is perpendicular, certainly.
  • 15:43 - 15:46
    If you picked something perpendicular,
    you'd get zero in the max.
  • 15:46 - 15:48
    You're not going to pick something
    in the opposite direction,
  • 15:48 - 15:49
    you'd get negative.
  • 15:49 - 15:54
    No, you're going to align exactly
    with it and go out unit norm.
  • 15:54 - 15:58
    So in other words,
    the optimal solution here
  • 15:58 - 16:01
    is the maximizer,
  • 16:03 - 16:09
    is just equal to u divided by its norm,
  • 16:09 - 16:11
    and the max value
  • 16:16 - 16:21
    is therefore going to be
    u transpose times this v
  • 16:23 - 16:24
    times v star.
  • 16:24 - 16:29
    This is equal to just u transpose u
    divided by the norm of u,
  • 16:29 - 16:31
    which is the norm of u.
  • 16:33 - 16:36
    In other words,
    the two norm is self dual.
  • 16:36 - 16:41
    So the Euclidean norm - what we just
    showed is is that the Euclidean norm
  • 16:43 - 16:46
    is its own dual.
  • 16:46 - 16:48
    It's called self dual.
  • 16:50 - 16:53
    So this is the way that we've been
    actually using the dual norm
  • 16:53 - 16:54
    without even really knowing it,
  • 16:54 - 16:57
    because we've just been
    working in the two norm.
  • 16:57 - 16:59
    So let's take another example.
  • 17:01 - 17:04
    So now let's take - that's example one.
  • 17:04 - 17:09
    Let's take norm equal to the one norm.
  • 17:10 - 17:13
    Okay, and let's see what its dual is.
  • 17:14 - 17:21
    So we've got, what is the dual norm
    of u in the one norm?
  • 17:21 - 17:27
    This is again, just copying the definition,
    the max of u transpose v...
  • 17:28 - 17:31
    u transpose v,
  • 17:31 - 17:38
    subject to v in the one norm
    being less than one.
  • 17:38 - 17:41
    So let's think about what
    this is going to give us.
  • 17:41 - 17:45
    So what does v - what does
    this one norm tell us?
  • 17:45 - 17:49
    So this constraint is telling us that
  • 17:49 - 17:53
    we have to be inside this diamond,
  • 17:53 - 17:58
    or algebraically it's telling us that the
    absolute values have to sum to one.
  • 17:58 - 18:01
    You can think about this as
    a resource allocation problem.
  • 18:01 - 18:05
    You have to allocate your resources,
    but you only have 100%
  • 18:05 - 18:08
    and you have to allocate the pieces
    so that they add up to one.
  • 18:08 - 18:09
    Where are you going to put them?
  • 18:09 - 18:13
    You're going to put them
    wherever the values are the biggest.
  • 18:13 - 18:16
    The slight difference to this
    is that you can take values
  • 18:16 - 18:19
    that are positive and negative,
    but then so can v.
  • 18:19 - 18:22
    So what you're going to do is
    you're going to find the biggest
  • 18:22 - 18:26
    in absolute value element of u,
  • 18:26 - 18:28
    whether it's positive or negative,
  • 18:28 - 18:31
    and you're going to keep all of v,
    all of your energy,
  • 18:31 - 18:33
    you're going to put onto that coordinate.
  • 18:33 - 18:37
    If the biggest element of u is negative,
    you're also going to make v negative.
  • 18:37 - 18:41
    If the biggest element of u is positive,
    you're also going to make v positive.
  • 18:41 - 18:49
    So in other words, v star in this case,
    our optimizing v,
  • 18:50 - 18:51
    is going to be
  • 18:54 - 18:55
    is going to look like
  • 18:59 - 19:03
    the vector that's all zero except
    with a single one.
  • 19:04 - 19:08
    And it's going to have a single one
    in the ith position,
  • 19:08 - 19:15
    and its sign is going to be
    the sign of ui.
  • 19:15 - 19:17
    And which i is it going to be?
  • 19:17 - 19:21
    It's going to be the i where the value
    of the absolute value of ui is biggest.
  • 19:21 - 19:29
    So i is the argmax of the
    absolute value of ui.
  • 19:29 - 19:31
    So in other words, what do we get?
  • 19:31 - 19:34
    The dual norm is the infinity norm.
  • 19:35 - 19:43
    So the dual norm of u is just
    equal to the infinity norm.
  • 19:43 - 19:46
    And you can also see
    that the infinity norm,
  • 19:46 - 19:48
    its dual is going to be
    the one norm.
  • 19:48 - 19:52
    So there are many examples
    like this that we can work out.
  • 19:53 - 19:56
    Let's do one more example.
  • 19:56 - 19:57
    So, so far
  • 20:00 - 20:07
    we've seen that the dual
    of the Euclidean norm
  • 20:07 - 20:11
    is again the Euclidean norm.
  • 20:11 - 20:18
    The dual of the one norm
    is the infinity norm.
  • 20:18 - 20:27
    I mentioned that the dual of
    the infinity norm is the one norm.
  • 20:29 - 20:30
    So let's try that.
  • 20:33 - 20:39
    So the infinity norm,
    let's look at the dual of that.
  • 20:39 - 20:45
    By definition this is
    the max of u transpose v
  • 20:45 - 20:51
    subject to v in the infinity norm
    being less than one.
  • 20:51 - 20:55
    So the infinity norm being less than one
    means that every element of v
  • 20:55 - 20:57
    can be between minus one and one.
  • 20:57 - 20:58
    And there's no penalty.
  • 20:58 - 21:03
    In other words, if v1 is very big,
    it doesn't mean that v2 has to be small.
  • 21:03 - 21:05
    So what would you do?
  • 21:05 - 21:06
    Again you can see this is quite simple.
  • 21:06 - 21:12
    You would look at ui and say ui,
    is that positive or negative?
  • 21:12 - 21:18
    I'm going to let vi be the same
    sign and be as big as I'm allowed,
  • 21:18 - 21:20
    which is in this case one.
  • 21:20 - 21:27
    So the optimizing v is just
    going to be the sign of u,
  • 21:27 - 21:28
    element-wise.
  • 21:28 - 21:31
    In other words,
  • 21:31 - 21:37
    the ith element of v star is just
    going to be the sign of ui.
  • 21:37 - 21:40
    And so then what is the sum of -
  • 21:40 - 21:45
    so u transpose v is equal
    to the sum of ui vi.
  • 21:45 - 21:49
    And plugging in this v, this is
    the same as the sum of ui
  • 21:49 - 21:52
    times the sign of ui,
  • 21:52 - 21:58
    which is exactly equal to the sum
    of the absolute value of ui,
  • 21:58 - 22:01
    which is in fact the L1 norm.
  • 22:04 - 22:08
    So now we've seen
    all three of these examples.
  • 22:08 - 22:16
    And we're going to use a very
    important property of dual norms,
  • 22:16 - 22:17
    which I'm going to write down now,
  • 22:17 - 22:21
    and we're going to see it again
    in the next lecture.
  • 22:26 - 22:28
    So an important inequality
  • 22:30 - 22:31
    says the following.
  • 22:34 - 22:39
    Remember that we've needed to bound,
    several times, the dot product
  • 22:39 - 22:42
    between two vectors u and v.
  • 22:42 - 22:44
    And we've done this in several ways.
  • 22:44 - 22:49
    So we can bound this
    in terms of the two norm.
  • 22:49 - 22:56
    So this is upper bounded by the two
    norm of u times the two norm of v.
  • 22:56 - 22:59
    In fact we know that these are equal
    only when u and v are parallel
  • 22:59 - 23:02
    and pointing in the same direction.
  • 23:03 - 23:07
    So this is the Cauchy-Schwarz -
    famous Cauchy-Schwarz inequality.
  • 23:12 - 23:16
    In fact, it's also true that u transpose v
  • 23:16 - 23:24
    is less than or equal to u times v star,
  • 23:24 - 23:27
    where u and v are any norm and its dual.
  • 23:27 - 23:30
    And this is sometimes called
    Hölder's inequality.
  • 23:35 - 23:39
    And so far we've used - if you go
    back to the proofs of gradient descent,
  • 23:39 - 23:42
    subgradient descent, and so on,
    you'll see this kind of inequality,
  • 23:42 - 23:45
    the first one, in many places.
  • 23:51 - 23:59
    So one of the properties that we've
    used, another way to bound uv
  • 23:59 - 24:03
    is as follows.
  • 24:03 - 24:10
    Also, u transpose v
    is less than or equal to
  • 24:10 - 24:15
    One half times u squared - this is
    the Euclidean norm squared -
  • 24:15 - 24:18
    plus one half v squared.
  • 24:18 - 24:20
    This just comes from
    completing the square.
  • 24:25 - 24:28
    Again, we've used this
    over and over again.
  • 24:28 - 24:31
    But using the dual norm,
  • 24:31 - 24:36
    we can refine this in a particular way.
  • 24:36 - 24:41
    So we will use the following upper bound:
  • 24:41 - 24:50
    u transpose v is less than or equal to
    one over two alpha
  • 24:50 - 24:53
    times the norm of u squared -
  • 24:53 - 24:57
    note there's no two subscript,
    this is not the Euclidean norm -
  • 24:57 - 25:05
    plus alpha over two,
    times the dual norm of v, squared.
  • 25:05 - 25:07
    And this is very similar.
  • 25:07 - 25:09
    If we take alpha equals to one,
  • 25:09 - 25:11
    we recover what we get
    from completing the square.
  • 25:13 - 25:15
    The reason that this
    is so important for us
  • 25:15 - 25:19
    is that these dot products
    are foundational for us.
  • 25:19 - 25:21
    This is the definition of convexity.
  • 25:21 - 25:23
    We use these to make our
    linear approximations.
  • 25:23 - 25:25
    This is how we define convex functions,
  • 25:25 - 25:28
    this is how we define
    our linear approximations,
  • 25:28 - 25:30
    and this is how we define
    all our algorithms.
  • 25:30 - 25:34
    So we're going to see next time
    how we use this inequality,
  • 25:34 - 25:35
    how we use Bregman divergence,
  • 25:35 - 25:41
    and how we use the definition of
    the Bregman divergence
  • 25:41 - 25:42
    as a proximal term
  • 25:42 - 25:47
    in order to show better
    convergence rates for mirror prox
  • 25:47 - 25:50
    in certain settings where -
    I'm sorry, for mirror descent
  • 25:50 - 25:54
    in certain settings where we don't
    want to use Euclidean geometry.
  • 25:54 - 25:56
    We'll pick this up next time.
Title:
https:/.../2019-05-13_opv1_caramanis_5.7_mirror_descent3.mp4
Video Language:
English
Duration:
26:11

English subtitles

Revisions