In this video, I'm first going to
introduce a method called rprop, that is
used for full batch learning.
It's like Robbie Jacobs method, but not
quite the same.
I'm then going to show how to extend RPROP
so that it works for mini-batches. This
gives you the advantages of rprop and it
also gives you the advantage of mini-batch
learning, which is essential for large,
redundant data sets.
The method that we end up with called RMS
Pro is currently my favorite method as a
sort of basic method for learning the
weights in a large neural network with a
large redundant data set.
I'm now going to describe rprop which is
an interesting way of trying to deal with
the fact that gradients vary widely in
their magnitudes.
Some gradients can be tiny and others can
be huge.
And that makes it hard to choose a single
global learning rate.
If we're doing full batch learning, we can
cope with this big variations in
gradients, by just using the sign of the
gradient.
That makes all of the weight updates be
the same size.
For issues like escaping from plateaus
with very small gradients this is a great
technique cause even with tiny gradients
we'll take quite big steps.
We couldn't achieve that by just turning
up the learning rate because then the
steps we took for weights that had big
gradients would be much to big.
Rprop combines the idea of just using the
sign of the gradient with the idea of
making the step size.
Depend on which weight it is.
So to decide how much to change your
weight, you don't look at the magnitude of
the gradient, you just look at the sign of
the gradient.
But, you do look at the step size you
decided around for that weight.
And, that step size adopts over time,
again without looking at the magnitude of
the gradient.
So we increase the step size for a weight
multiplicatively.
For example by factor 1.2.
If the signs of the last two gradients
agree.
This is like in Robbie Jacobs' adapted
weights methods except that we did, gonna
do a multiplicative increase here.
If the signs of the last two gradients
disagree, we decrease the step size
multiplicatively, and in this case, we'll
make that more powerful than the increase,
so that we can die down faster than we
grow.
We need to limit the step sizes.
Mike Shuster's advice was to limit them
between 50 and a millionth.
I think it depends a lot on what problem
you're dealing with.
If for example you have a problem with
some tiny inputs, you might need very big
weights on those inputs for them to have
an effect.
I suspect that if you're not dealing with
that kind of problem, having an upper
limit on the weight changes that's much
less than 50 would be a good idea.
So one question is, why doesn't rprop work
with mini-batches.
People have tried it, and find it hard to
get it to work.
You can get it to work with very big
mini-batches, where you use much more
conservative changes to the step sizes.
But it's difficult.
So the reason it doesn't work is it
violates the central idea behind
stochastic gradient descent,
Which is, that when we have a small
loaning rate, the gradient gets
effectively average over successive mini
batches.
So consider a weight that gets a gradient
of +.01 on nine mini batches, and then a
gradient of -.09 on the tenth mini batch.
What we'd like is those gradients will
roughly average out so the weight will
stay where it is.
Rprop won't give us that.
Rprop would increment the weight nine
times by whatever its current step size
is, and decrement it only once.
And that would make the weight get much
bigger.
We're assuming here that the step sizes
adapt much slower than the time scale of
these mini batches.
So the question is, can we combine the
robustness that you get from rprop by just
using the sign of the gradient.
The efficiency that you get from many
batches.
And this averaging of gradients over
mini-batches is what allows mini-batches
to combine gradients in the right way.
That leads to a method which I'm calling
Rmsprop.
And you can consider to be a mini-batch
version of rprop. rprop is equivalent to
using the gradient,
But also dividing by the magnitude of the
gradient.
And the reason it has problems with
mini-batches is that we divide the
gradient by a different magnitude for each
mini batch.
So the idea is that we're going to force
the number we divide by to be pretty much
the same for nearby mini-batches. We do
that by keeping a moving average of the
squared gradient for each weight.
So mean square WT means this moving
average for weight W at time T,
Where time is an indicator of weight
updates.
Time increments by one each time we update
the weights The numbers I put in of 0.9
and 0.1 for computing moving average are
just examples, but their reasonably
sensible examples.
So the mean square is the previous mean
square times 0.9,
Plus the value of the squared gradient for
that weight at time t,
Times 0.1.
We then take that mean square.
We take its square root,
Which is why it has the name RMS.
And then we divide the gradient by that
RMS, and make an update proportional to
that.
That makes the learning work much better.
Notice that we're not adapting the
learning rate separately for each
connection here.
This is a simpler method where we simply,
for each connection, keep a running
average of the route mean square gradient
and divide by that.
There's many further developments one
could make for rmsprop. You could combine
the standard moment.
My experiment so far suggests that doesn't
help as much as momentum normally does,
And that needs more investigation.
You could combine our rmsprop with
Nesterov momentum where you first make the
jump and then make a correction.
And Ilya Sutskever has tried that recently
and got good results.
He's discovered that it works best if the
rms of the recent gradients is used to
divide the correction term we make rather
than the large jump you make in the
direction of the accumulated corrections.
Obviously you could combine rmsprop with
adaptive learning rates on each connection
which would make it much more like rprop.
That just needs a lot more investigation.
I just don't know at present how helpful
that will be.
And then there is a bunch of other methods
related to rmsprop that have a lot in
common with it.
Yann LeCun's group has an interesting
paper called No More Pesky Learning Rates
that came out this year.
And some of the terms in that looked like
rmsprop, but it has many other terms.
I suspect, at present, that most of the
advantage that comes from this complicated
method recommended by Yann LeCun's group
comes from the fact that it's similar to
rmsprop.
But I don't really know that.
So, a summary of the learning methods for
neural networks, goes like this.
If you've got a small data set, say 10,000
cases or less,
Or a big data set without much redundancy,
you should consider using a full batch
method.
This full batch methods adapted from the
optimization literature like non-linear
conjugate gradient or lbfgs, or
LevenbergMarkhart,Marquardt.
And one advantage of using those methods
is they typically come with a package.
And when you report the results in your
paper you just have to say, I used this
package and here's what it did.
You don't have to justify all sorts of
little decisions.
Alternatively you could use the adaptive
learning rates I described in another
video or rprop, which are both essentially
full batch methods but they are methods
that were developed for neural networks.
If you have a big redundant data set it's
essential to use mini batches.
It's a huge waste not to do that.
The first thing to try is just standard
gradient descent with momentum.
You're going to have to choose a global
learning rate, and you might want to write
a little loop to adapt that global
learning rate based on whether the
gradient has changed side.
But to begin with, don't go for anything
as fancy as adapting individual learning
rates for individual weights.
The next thing to try is RMS prop.
That's very simple to implement if you do
it without momentum, and in my experiment
so far, that seems to work as well as
gradient descent with momentum, would be
better.
You can also consider all sorts of ways of
improving rmsprop by adding momentum or
adaptive step sizes for each weight, but
that's still basically uncharted
territory.
Finally, you could find out whatever Yann
Lecun's latest receipt is and try that.
He's probably the person who's tried the
most different ways of getting stochastic
gradient descent to work well, and so it's
worth keeping up with whatever he's doing.
One question you might ask is why is there
no simple recipe.
We have been messing around with neural
nets, including deep neural nets, for more
than 25 years now, and you would think
that we would come up with an agreed way
of doing the learning.
There's really two reasons I think why
there isn't a simple recipe.
First, neural nets differ a lot.
Very deep networks, especially ones that
have narrow bottlenecks in them, which
I'll come to in later lectures, are very
hard things to optimize and they need
methods that can be very sensitive to very
small gradients.
Recurring nets are another special case,
they're typically very hard to optimize,
if you want them to notice things that
happened a long time in the past and
change the weights based on these things
that happened a long time ago.
Then there's wide shallow networks, which
are quite different in flavor and are used
a lot in practice.
They often can be optimized with methods
that are not very accurate.
Because we stop the optimization early
before it starts overfitting.
So for these different kinds of networks,
there's very different methods that are
probably appropriate.
The other consideration is that tasks
differ a lot.
Some tasks require very accurate weights.
Some tasks don't require weights to be
very accurate at all.
Also there's some tasks that have weird
properties, like if your inputs are words
rare words may only occur on one case in a
hundred thousand.
That's a very, very different flavor from
what happens if your inputs are pixels.
So to summarize we really don't have nice
clear cut advice for how to train a neural
net.
We have a bunch of rules of sum, it's not
entirely satisfactory, but just think how
much better in your all natural work once
we've got this sorted out, and they
already work pretty well.