Rmsprop: Divide the gradient by a running average of its recent magnitude

0:00 - 0:05

In this video, I'm first going to
introduce a method called rprop, that is
0:05 - 0:10

used for full batch learning.
It's like Robbie Jacobs method, but not
0:10 - 0:13

quite the same.
I'm then going to show how to extend RPROP
0:13 - 0:19

so that it works for mini-batches. This
gives you the advantages of rprop and it
0:19 - 0:25

also gives you the advantage of mini-batch
learning, which is essential for large,
0:25 - 0:29

redundant data sets.
The method that we end up with called RMS
0:29 - 0:34

Pro is currently my favorite method as a
sort of basic method for learning the
0:34 - 0:39

weights in a large neural network with a
large redundant data set.
0:39 - 0:45

I'm now going to describe rprop which is
an interesting way of trying to deal with
0:45 - 0:49

the fact that gradients vary widely in
their magnitudes.
0:51 - 0:54

Some gradients can be tiny and others can
be huge.
0:54 - 0:58

And that makes it hard to choose a single
global learning rate.
0:59 - 1:04

If we're doing full batch learning, we can
cope with this big variations in
1:04 - 1:07

gradients, by just using the sign of the
gradient.
1:08 - 1:12

That makes all of the weight updates be
the same size.
1:13 - 1:19

For issues like escaping from plateaus
with very small gradients this is a great
1:19 - 1:23

technique cause even with tiny gradients
we'll take quite big steps.
1:23 - 1:28

We couldn't achieve that by just turning
up the learning rate because then the
1:28 - 1:33

steps we took for weights that had big
gradients would be much to big.
1:33 - 1:38

Rprop combines the idea of just using the
sign of the gradient with the idea of
1:38 - 1:42

making the step size.
Depend on which weight it is.
1:42 - 1:47

So to decide how much to change your
weight, you don't look at the magnitude of
1:47 - 1:50

the gradient, you just look at the sign of
the gradient.
1:50 - 1:55

But, you do look at the step size you
decided around for that weight.
1:55 - 2:00

And, that step size adopts over time,
again without looking at the magnitude of
2:00 - 2:05

the gradient.
So we increase the step size for a weight
2:05 - 2:08

multiplicatively.
For example by factor 1.2.
2:08 - 2:11

If the signs of the last two gradients
agree.
2:11 - 2:16

This is like in Robbie Jacobs' adapted
weights methods except that we did, gonna
2:16 - 2:21

do a multiplicative increase here.
If the signs of the last two gradients
2:21 - 2:26

disagree, we decrease the step size
multiplicatively, and in this case, we'll
2:26 - 2:31

make that more powerful than the increase,
so that we can die down faster than we
2:31 - 2:34

grow.
We need to limit the step sizes.
2:34 - 2:38

Mike Shuster's advice was to limit them
between 50 and a millionth.
2:38 - 2:42

I think it depends a lot on what problem
you're dealing with.
2:42 - 2:47

If for example you have a problem with
some tiny inputs, you might need very big
2:47 - 2:51

weights on those inputs for them to have
an effect.
2:51 - 2:56

I suspect that if you're not dealing with
that kind of problem, having an upper
2:56 - 3:00

limit on the weight changes that's much
less than 50 would be a good idea.
3:00 - 3:04

So one question is, why doesn't rprop work
with mini-batches.
3:04 - 3:07

People have tried it, and find it hard to
get it to work.
3:07 - 3:11

You can get it to work with very big
mini-batches, where you use much more
3:11 - 3:16

conservative changes to the step sizes.
But it's difficult.
3:16 - 3:21

So the reason it doesn't work is it
violates the central idea behind
3:21 - 3:26

stochastic gradient descent,
Which is, that when we have a small
3:26 - 3:32

loaning rate, the gradient gets
effectively average over successive mini
3:32 - 3:37

batches.
So consider a weight that gets a gradient
3:37 - 3:44

of +.01 on nine mini batches, and then a
gradient of -.09 on the tenth mini batch.
3:44 - 3:49

What we'd like is those gradients will
roughly average out so the weight will
3:49 - 3:52

stay where it is.
Rprop won't give us that.
3:52 - 3:56

Rprop would increment the weight nine
times by whatever its current step size
3:56 - 4:01

is, and decrement it only once.
And that would make the weight get much
4:01 - 4:04

bigger.
We're assuming here that the step sizes
4:04 - 4:09

adapt much slower than the time scale of
these mini batches.
4:09 - 4:15

So the question is, can we combine the
robustness that you get from rprop by just
4:15 - 4:20

using the sign of the gradient.
The efficiency that you get from many
4:20 - 4:23

batches.
And this averaging of gradients over
4:23 - 4:29

mini-batches is what allows mini-batches
to combine gradients in the right way.
4:29 - 4:32

That leads to a method which I'm calling
Rmsprop.
4:32 - 4:38

And you can consider to be a mini-batch
version of rprop. rprop is equivalent to
4:38 - 4:42

using the gradient,
But also dividing by the magnitude of the
4:42 - 4:45

gradient.
And the reason it has problems with
4:45 - 4:51

mini-batches is that we divide the
gradient by a different magnitude for each
4:51 - 4:55

mini batch.
So the idea is that we're going to force
4:55 - 5:01

the number we divide by to be pretty much
the same for nearby mini-batches. We do
5:01 - 5:06

that by keeping a moving average of the
squared gradient for each weight.
5:06 - 5:11

So mean square WT means this moving
average for weight W at time T,
5:11 - 5:14

Where time is an indicator of weight
updates.
5:14 - 5:21

Time increments by one each time we update
the weights The numbers I put in of 0.9
5:21 - 5:26

and 0.1 for computing moving average are
just examples, but their reasonably
5:26 - 5:31

sensible examples.
So the mean square is the previous mean
5:31 - 5:37

square times 0.9,
Plus the value of the squared gradient for
5:37 - 5:40

that weight at time t,
Times 0.1.
5:40 - 5:45

We then take that mean square.
We take its square root,
5:45 - 5:52

Which is why it has the name RMS.
And then we divide the gradient by that
5:52 - 5:57

RMS, and make an update proportional to
that.
5:58 - 6:02

That makes the learning work much better.
Notice that we're not adapting the
6:02 - 6:05

learning rate separately for each
connection here.
6:05 - 6:09

This is a simpler method where we simply,
for each connection, keep a running
6:09 - 6:13

average of the route mean square gradient
and divide by that.
6:13 - 6:18

There's many further developments one
could make for rmsprop. You could combine
6:18 - 6:22

the standard moment.
My experiment so far suggests that doesn't
6:22 - 6:26

help as much as momentum normally does,
And that needs more investigation.
6:26 - 6:32

You could combine our rmsprop with
Nesterov momentum where you first make the
6:32 - 6:37

jump and then make a correction.
And Ilya Sutskever has tried that recently
6:37 - 6:41

and got good results.
He's discovered that it works best if the
6:41 - 6:46

rms of the recent gradients is used to
divide the correction term we make rather
6:46 - 6:51

than the large jump you make in the
direction of the accumulated corrections.
6:51 - 6:56

Obviously you could combine rmsprop with
adaptive learning rates on each connection
6:56 - 7:01

which would make it much more like rprop.
That just needs a lot more investigation.
7:01 - 7:04

I just don't know at present how helpful
that will be.
7:04 - 7:08

And then there is a bunch of other methods
related to rmsprop that have a lot in
7:08 - 7:12

common with it.
Yann LeCun's group has an interesting
7:12 - 7:16

paper called No More Pesky Learning Rates
that came out this year.
7:16 - 7:22

And some of the terms in that looked like
rmsprop, but it has many other terms.
7:22 - 7:27

I suspect, at present, that most of the
advantage that comes from this complicated
7:27 - 7:33

method recommended by Yann LeCun's group
comes from the fact that it's similar to
7:33 - 7:36

rmsprop.
But I don't really know that.
7:36 - 7:41

So, a summary of the learning methods for
neural networks, goes like this.
7:41 - 7:46

If you've got a small data set, say 10,000
cases or less,
7:46 - 7:52

Or a big data set without much redundancy,
you should consider using a full batch
7:52 - 7:56

method.
This full batch methods adapted from the
7:56 - 8:00

optimization literature like non-linear
conjugate gradient or lbfgs, or
8:00 - 8:04

LevenbergMarkhart,Marquardt.
And one advantage of using those methods
8:04 - 8:09

is they typically come with a package.
And when you report the results in your
8:09 - 8:14

paper you just have to say, I used this
package and here's what it did.
8:14 - 8:18

You don't have to justify all sorts of
little decisions.
8:18 - 8:23

Alternatively you could use the adaptive
learning rates I described in another
8:23 - 8:28

video or rprop, which are both essentially
full batch methods but they are methods
8:28 - 8:33

that were developed for neural networks.
If you have a big redundant data set it's
8:33 - 8:37

essential to use mini batches.
It's a huge waste not to do that.
8:37 - 8:41

The first thing to try is just standard
gradient descent with momentum.
8:41 - 8:46

You're going to have to choose a global
learning rate, and you might want to write
8:46 - 8:50

a little loop to adapt that global
learning rate based on whether the
8:50 - 8:54

gradient has changed side.
But to begin with, don't go for anything
8:54 - 8:58

as fancy as adapting individual learning
rates for individual weights.
8:58 - 9:03

The next thing to try is RMS prop.
That's very simple to implement if you do
9:03 - 9:08

it without momentum, and in my experiment
so far, that seems to work as well as
9:08 - 9:10

gradient descent with momentum, would be
better.
9:12 - 9:17

You can also consider all sorts of ways of
improving rmsprop by adding momentum or
9:17 - 9:22

adaptive step sizes for each weight, but
that's still basically uncharted
9:22 - 9:26

territory.
Finally, you could find out whatever Yann
9:26 - 9:30

Lecun's latest receipt is and try that.
He's probably the person who's tried the
9:30 - 9:35

most different ways of getting stochastic
gradient descent to work well, and so it's
9:35 - 9:41

worth keeping up with whatever he's doing.
One question you might ask is why is there
9:41 - 9:45

no simple recipe.
We have been messing around with neural
9:45 - 9:49

nets, including deep neural nets, for more
than 25 years now, and you would think
9:49 - 9:53

that we would come up with an agreed way
of doing the learning.
9:53 - 9:57

There's really two reasons I think why
there isn't a simple recipe.
9:58 - 10:02

First, neural nets differ a lot.
Very deep networks, especially ones that
10:02 - 10:07

have narrow bottlenecks in them, which
I'll come to in later lectures, are very
10:07 - 10:12

hard things to optimize and they need
methods that can be very sensitive to very
10:12 - 10:15

small gradients.
Recurring nets are another special case,
10:15 - 10:20

they're typically very hard to optimize,
if you want them to notice things that
10:20 - 10:24

happened a long time in the past and
change the weights based on these things
10:24 - 10:29

that happened a long time ago.
Then there's wide shallow networks, which
10:29 - 10:33

are quite different in flavor and are used
a lot in practice.
10:33 - 10:38

They often can be optimized with methods
that are not very accurate.
10:38 - 10:42

Because we stop the optimization early
before it starts overfitting.
10:42 - 10:47

So for these different kinds of networks,
there's very different methods that are
10:47 - 10:52

probably appropriate.
The other consideration is that tasks
10:52 - 10:56

differ a lot.
Some tasks require very accurate weights.
10:57 - 11:00

Some tasks don't require weights to be
very accurate at all.
11:01 - 11:09

Also there's some tasks that have weird
properties, like if your inputs are words
11:09 - 11:15

rare words may only occur on one case in a
hundred thousand.
11:15 - 11:20

That's a very, very different flavor from
what happens if your inputs are pixels.
11:20 - 11:25

So to summarize we really don't have nice
clear cut advice for how to train a neural
11:25 - 11:28

net.
We have a bunch of rules of sum, it's not
11:28 - 11:34

entirely satisfactory, but just think how
much better in your all natural work once
11:34 - 11:37

we've got this sorted out, and they
already work pretty well.

Title:: Rmsprop: Divide the gradient by a running average of its recent magnitude
Video Language:: English

	stanford-bot edited English subtitles for Rmsprop: Divide the gradient by a running average of its recent magnitude
	stanford-bot edited English subtitles for Rmsprop: Divide the gradient by a running average of its recent magnitude
	stanford-bot added a translation
	stanford-bot added a translation

English subtitles

Revisions

Revision 2

stanford-bot

Rmsprop: Divide the gradient by a running average of its recent magnitude

Revisions

Our website uses cookies

Operating cookies (Required)