Return to Video

Rmsprop: Divide the gradient by a running average of its recent magnitude

  • 0:00 - 0:05
    In this video, I'm first going to
    introduce a method called rprop, that is
  • 0:05 - 0:10
    used for full batch learning.
    It's like Robbie Jacobs method, but not
  • 0:10 - 0:13
    quite the same.
    I'm then going to show how to extend RPROP
  • 0:13 - 0:19
    so that it works for mini-batches. This
    gives you the advantages of rprop and it
  • 0:19 - 0:25
    also gives you the advantage of mini-batch
    learning, which is essential for large,
  • 0:25 - 0:29
    redundant data sets.
    The method that we end up with called RMS
  • 0:29 - 0:34
    Pro is currently my favorite method as a
    sort of basic method for learning the
  • 0:34 - 0:39
    weights in a large neural network with a
    large redundant data set.
  • 0:39 - 0:45
    I'm now going to describe rprop which is
    an interesting way of trying to deal with
  • 0:45 - 0:49
    the fact that gradients vary widely in
    their magnitudes.
  • 0:51 - 0:54
    Some gradients can be tiny and others can
    be huge.
  • 0:54 - 0:58
    And that makes it hard to choose a single
    global learning rate.
  • 0:59 - 1:04
    If we're doing full batch learning, we can
    cope with this big variations in
  • 1:04 - 1:07
    gradients, by just using the sign of the
    gradient.
  • 1:08 - 1:12
    That makes all of the weight updates be
    the same size.
  • 1:13 - 1:19
    For issues like escaping from plateaus
    with very small gradients this is a great
  • 1:19 - 1:23
    technique cause even with tiny gradients
    we'll take quite big steps.
  • 1:23 - 1:28
    We couldn't achieve that by just turning
    up the learning rate because then the
  • 1:28 - 1:33
    steps we took for weights that had big
    gradients would be much to big.
  • 1:33 - 1:38
    Rprop combines the idea of just using the
    sign of the gradient with the idea of
  • 1:38 - 1:42
    making the step size.
    Depend on which weight it is.
  • 1:42 - 1:47
    So to decide how much to change your
    weight, you don't look at the magnitude of
  • 1:47 - 1:50
    the gradient, you just look at the sign of
    the gradient.
  • 1:50 - 1:55
    But, you do look at the step size you
    decided around for that weight.
  • 1:55 - 2:00
    And, that step size adopts over time,
    again without looking at the magnitude of
  • 2:00 - 2:05
    the gradient.
    So we increase the step size for a weight
  • 2:05 - 2:08
    multiplicatively.
    For example by factor 1.2.
  • 2:08 - 2:11
    If the signs of the last two gradients
    agree.
  • 2:11 - 2:16
    This is like in Robbie Jacobs' adapted
    weights methods except that we did, gonna
  • 2:16 - 2:21
    do a multiplicative increase here.
    If the signs of the last two gradients
  • 2:21 - 2:26
    disagree, we decrease the step size
    multiplicatively, and in this case, we'll
  • 2:26 - 2:31
    make that more powerful than the increase,
    so that we can die down faster than we
  • 2:31 - 2:34
    grow.
    We need to limit the step sizes.
  • 2:34 - 2:38
    Mike Shuster's advice was to limit them
    between 50 and a millionth.
  • 2:38 - 2:42
    I think it depends a lot on what problem
    you're dealing with.
  • 2:42 - 2:47
    If for example you have a problem with
    some tiny inputs, you might need very big
  • 2:47 - 2:51
    weights on those inputs for them to have
    an effect.
  • 2:51 - 2:56
    I suspect that if you're not dealing with
    that kind of problem, having an upper
  • 2:56 - 3:00
    limit on the weight changes that's much
    less than 50 would be a good idea.
  • 3:00 - 3:04
    So one question is, why doesn't rprop work
    with mini-batches.
  • 3:04 - 3:07
    People have tried it, and find it hard to
    get it to work.
  • 3:07 - 3:11
    You can get it to work with very big
    mini-batches, where you use much more
  • 3:11 - 3:16
    conservative changes to the step sizes.
    But it's difficult.
  • 3:16 - 3:21
    So the reason it doesn't work is it
    violates the central idea behind
  • 3:21 - 3:26
    stochastic gradient descent,
    Which is, that when we have a small
  • 3:26 - 3:32
    loaning rate, the gradient gets
    effectively average over successive mini
  • 3:32 - 3:37
    batches.
    So consider a weight that gets a gradient
  • 3:37 - 3:44
    of +.01 on nine mini batches, and then a
    gradient of -.09 on the tenth mini batch.
  • 3:44 - 3:49
    What we'd like is those gradients will
    roughly average out so the weight will
  • 3:49 - 3:52
    stay where it is.
    Rprop won't give us that.
  • 3:52 - 3:56
    Rprop would increment the weight nine
    times by whatever its current step size
  • 3:56 - 4:01
    is, and decrement it only once.
    And that would make the weight get much
  • 4:01 - 4:04
    bigger.
    We're assuming here that the step sizes
  • 4:04 - 4:09
    adapt much slower than the time scale of
    these mini batches.
  • 4:09 - 4:15
    So the question is, can we combine the
    robustness that you get from rprop by just
  • 4:15 - 4:20
    using the sign of the gradient.
    The efficiency that you get from many
  • 4:20 - 4:23
    batches.
    And this averaging of gradients over
  • 4:23 - 4:29
    mini-batches is what allows mini-batches
    to combine gradients in the right way.
  • 4:29 - 4:32
    That leads to a method which I'm calling
    Rmsprop.
  • 4:32 - 4:38
    And you can consider to be a mini-batch
    version of rprop. rprop is equivalent to
  • 4:38 - 4:42
    using the gradient,
    But also dividing by the magnitude of the
  • 4:42 - 4:45
    gradient.
    And the reason it has problems with
  • 4:45 - 4:51
    mini-batches is that we divide the
    gradient by a different magnitude for each
  • 4:51 - 4:55
    mini batch.
    So the idea is that we're going to force
  • 4:55 - 5:01
    the number we divide by to be pretty much
    the same for nearby mini-batches. We do
  • 5:01 - 5:06
    that by keeping a moving average of the
    squared gradient for each weight.
  • 5:06 - 5:11
    So mean square WT means this moving
    average for weight W at time T,
  • 5:11 - 5:14
    Where time is an indicator of weight
    updates.
  • 5:14 - 5:21
    Time increments by one each time we update
    the weights The numbers I put in of 0.9
  • 5:21 - 5:26
    and 0.1 for computing moving average are
    just examples, but their reasonably
  • 5:26 - 5:31
    sensible examples.
    So the mean square is the previous mean
  • 5:31 - 5:37
    square times 0.9,
    Plus the value of the squared gradient for
  • 5:37 - 5:40
    that weight at time t,
    Times 0.1.
  • 5:40 - 5:45
    We then take that mean square.
    We take its square root,
  • 5:45 - 5:52
    Which is why it has the name RMS.
    And then we divide the gradient by that
  • 5:52 - 5:57
    RMS, and make an update proportional to
    that.
  • 5:58 - 6:02
    That makes the learning work much better.
    Notice that we're not adapting the
  • 6:02 - 6:05
    learning rate separately for each
    connection here.
  • 6:05 - 6:09
    This is a simpler method where we simply,
    for each connection, keep a running
  • 6:09 - 6:13
    average of the route mean square gradient
    and divide by that.
  • 6:13 - 6:18
    There's many further developments one
    could make for rmsprop. You could combine
  • 6:18 - 6:22
    the standard moment.
    My experiment so far suggests that doesn't
  • 6:22 - 6:26
    help as much as momentum normally does,
    And that needs more investigation.
  • 6:26 - 6:32
    You could combine our rmsprop with
    Nesterov momentum where you first make the
  • 6:32 - 6:37
    jump and then make a correction.
    And Ilya Sutskever has tried that recently
  • 6:37 - 6:41
    and got good results.
    He's discovered that it works best if the
  • 6:41 - 6:46
    rms of the recent gradients is used to
    divide the correction term we make rather
  • 6:46 - 6:51
    than the large jump you make in the
    direction of the accumulated corrections.
  • 6:51 - 6:56
    Obviously you could combine rmsprop with
    adaptive learning rates on each connection
  • 6:56 - 7:01
    which would make it much more like rprop.
    That just needs a lot more investigation.
  • 7:01 - 7:04
    I just don't know at present how helpful
    that will be.
  • 7:04 - 7:08
    And then there is a bunch of other methods
    related to rmsprop that have a lot in
  • 7:08 - 7:12
    common with it.
    Yann LeCun's group has an interesting
  • 7:12 - 7:16
    paper called No More Pesky Learning Rates
    that came out this year.
  • 7:16 - 7:22
    And some of the terms in that looked like
    rmsprop, but it has many other terms.
  • 7:22 - 7:27
    I suspect, at present, that most of the
    advantage that comes from this complicated
  • 7:27 - 7:33
    method recommended by Yann LeCun's group
    comes from the fact that it's similar to
  • 7:33 - 7:36
    rmsprop.
    But I don't really know that.
  • 7:36 - 7:41
    So, a summary of the learning methods for
    neural networks, goes like this.
  • 7:41 - 7:46
    If you've got a small data set, say 10,000
    cases or less,
  • 7:46 - 7:52
    Or a big data set without much redundancy,
    you should consider using a full batch
  • 7:52 - 7:56
    method.
    This full batch methods adapted from the
  • 7:56 - 8:00
    optimization literature like non-linear
    conjugate gradient or lbfgs, or
  • 8:00 - 8:04
    LevenbergMarkhart,Marquardt.
    And one advantage of using those methods
  • 8:04 - 8:09
    is they typically come with a package.
    And when you report the results in your
  • 8:09 - 8:14
    paper you just have to say, I used this
    package and here's what it did.
  • 8:14 - 8:18
    You don't have to justify all sorts of
    little decisions.
  • 8:18 - 8:23
    Alternatively you could use the adaptive
    learning rates I described in another
  • 8:23 - 8:28
    video or rprop, which are both essentially
    full batch methods but they are methods
  • 8:28 - 8:33
    that were developed for neural networks.
    If you have a big redundant data set it's
  • 8:33 - 8:37
    essential to use mini batches.
    It's a huge waste not to do that.
  • 8:37 - 8:41
    The first thing to try is just standard
    gradient descent with momentum.
  • 8:41 - 8:46
    You're going to have to choose a global
    learning rate, and you might want to write
  • 8:46 - 8:50
    a little loop to adapt that global
    learning rate based on whether the
  • 8:50 - 8:54
    gradient has changed side.
    But to begin with, don't go for anything
  • 8:54 - 8:58
    as fancy as adapting individual learning
    rates for individual weights.
  • 8:58 - 9:03
    The next thing to try is RMS prop.
    That's very simple to implement if you do
  • 9:03 - 9:08
    it without momentum, and in my experiment
    so far, that seems to work as well as
  • 9:08 - 9:10
    gradient descent with momentum, would be
    better.
  • 9:12 - 9:17
    You can also consider all sorts of ways of
    improving rmsprop by adding momentum or
  • 9:17 - 9:22
    adaptive step sizes for each weight, but
    that's still basically uncharted
  • 9:22 - 9:26
    territory.
    Finally, you could find out whatever Yann
  • 9:26 - 9:30
    Lecun's latest receipt is and try that.
    He's probably the person who's tried the
  • 9:30 - 9:35
    most different ways of getting stochastic
    gradient descent to work well, and so it's
  • 9:35 - 9:41
    worth keeping up with whatever he's doing.
    One question you might ask is why is there
  • 9:41 - 9:45
    no simple recipe.
    We have been messing around with neural
  • 9:45 - 9:49
    nets, including deep neural nets, for more
    than 25 years now, and you would think
  • 9:49 - 9:53
    that we would come up with an agreed way
    of doing the learning.
  • 9:53 - 9:57
    There's really two reasons I think why
    there isn't a simple recipe.
  • 9:58 - 10:02
    First, neural nets differ a lot.
    Very deep networks, especially ones that
  • 10:02 - 10:07
    have narrow bottlenecks in them, which
    I'll come to in later lectures, are very
  • 10:07 - 10:12
    hard things to optimize and they need
    methods that can be very sensitive to very
  • 10:12 - 10:15
    small gradients.
    Recurring nets are another special case,
  • 10:15 - 10:20
    they're typically very hard to optimize,
    if you want them to notice things that
  • 10:20 - 10:24
    happened a long time in the past and
    change the weights based on these things
  • 10:24 - 10:29
    that happened a long time ago.
    Then there's wide shallow networks, which
  • 10:29 - 10:33
    are quite different in flavor and are used
    a lot in practice.
  • 10:33 - 10:38
    They often can be optimized with methods
    that are not very accurate.
  • 10:38 - 10:42
    Because we stop the optimization early
    before it starts overfitting.
  • 10:42 - 10:47
    So for these different kinds of networks,
    there's very different methods that are
  • 10:47 - 10:52
    probably appropriate.
    The other consideration is that tasks
  • 10:52 - 10:56
    differ a lot.
    Some tasks require very accurate weights.
  • 10:57 - 11:00
    Some tasks don't require weights to be
    very accurate at all.
  • 11:01 - 11:09
    Also there's some tasks that have weird
    properties, like if your inputs are words
  • 11:09 - 11:15
    rare words may only occur on one case in a
    hundred thousand.
  • 11:15 - 11:20
    That's a very, very different flavor from
    what happens if your inputs are pixels.
  • 11:20 - 11:25
    So to summarize we really don't have nice
    clear cut advice for how to train a neural
  • 11:25 - 11:28
    net.
    We have a bunch of rules of sum, it's not
  • 11:28 - 11:34
    entirely satisfactory, but just think how
    much better in your all natural work once
  • 11:34 - 11:37
    we've got this sorted out, and they
    already work pretty well.
Title:
Rmsprop: Divide the gradient by a running average of its recent magnitude
Video Language:
English

English subtitles

Revisions