-
In this video, I'm first going to
introduce a method called rprop, that is
-
used for full batch learning.
It's like Robbie Jacobs method, but not
-
quite the same.
I'm then going to show how to extend RPROP
-
so that it works for mini-batches. This
gives you the advantages of rprop and it
-
also gives you the advantage of mini-batch
learning, which is essential for large,
-
redundant data sets.
The method that we end up with called RMS
-
Pro is currently my favorite method as a
sort of basic method for learning the
-
weights in a large neural network with a
large redundant data set.
-
I'm now going to describe rprop which is
an interesting way of trying to deal with
-
the fact that gradients vary widely in
their magnitudes.
-
Some gradients can be tiny and others can
be huge.
-
And that makes it hard to choose a single
global learning rate.
-
If we're doing full batch learning, we can
cope with this big variations in
-
gradients, by just using the sign of the
gradient.
-
That makes all of the weight updates be
the same size.
-
For issues like escaping from plateaus
with very small gradients this is a great
-
technique cause even with tiny gradients
we'll take quite big steps.
-
We couldn't achieve that by just turning
up the learning rate because then the
-
steps we took for weights that had big
gradients would be much to big.
-
Rprop combines the idea of just using the
sign of the gradient with the idea of
-
making the step size.
Depend on which weight it is.
-
So to decide how much to change your
weight, you don't look at the magnitude of
-
the gradient, you just look at the sign of
the gradient.
-
But, you do look at the step size you
decided around for that weight.
-
And, that step size adopts over time,
again without looking at the magnitude of
-
the gradient.
So we increase the step size for a weight
-
multiplicatively.
For example by factor 1.2.
-
If the signs of the last two gradients
agree.
-
This is like in Robbie Jacobs' adapted
weights methods except that we did, gonna
-
do a multiplicative increase here.
If the signs of the last two gradients
-
disagree, we decrease the step size
multiplicatively, and in this case, we'll
-
make that more powerful than the increase,
so that we can die down faster than we
-
grow.
We need to limit the step sizes.
-
Mike Shuster's advice was to limit them
between 50 and a millionth.
-
I think it depends a lot on what problem
you're dealing with.
-
If for example you have a problem with
some tiny inputs, you might need very big
-
weights on those inputs for them to have
an effect.
-
I suspect that if you're not dealing with
that kind of problem, having an upper
-
limit on the weight changes that's much
less than 50 would be a good idea.
-
So one question is, why doesn't rprop work
with mini-batches.
-
People have tried it, and find it hard to
get it to work.
-
You can get it to work with very big
mini-batches, where you use much more
-
conservative changes to the step sizes.
But it's difficult.
-
So the reason it doesn't work is it
violates the central idea behind
-
stochastic gradient descent,
Which is, that when we have a small
-
loaning rate, the gradient gets
effectively average over successive mini
-
batches.
So consider a weight that gets a gradient
-
of +.01 on nine mini batches, and then a
gradient of -.09 on the tenth mini batch.
-
What we'd like is those gradients will
roughly average out so the weight will
-
stay where it is.
Rprop won't give us that.
-
Rprop would increment the weight nine
times by whatever its current step size
-
is, and decrement it only once.
And that would make the weight get much
-
bigger.
We're assuming here that the step sizes
-
adapt much slower than the time scale of
these mini batches.
-
So the question is, can we combine the
robustness that you get from rprop by just
-
using the sign of the gradient.
The efficiency that you get from many
-
batches.
And this averaging of gradients over
-
mini-batches is what allows mini-batches
to combine gradients in the right way.
-
That leads to a method which I'm calling
Rmsprop.
-
And you can consider to be a mini-batch
version of rprop. rprop is equivalent to
-
using the gradient,
But also dividing by the magnitude of the
-
gradient.
And the reason it has problems with
-
mini-batches is that we divide the
gradient by a different magnitude for each
-
mini batch.
So the idea is that we're going to force
-
the number we divide by to be pretty much
the same for nearby mini-batches. We do
-
that by keeping a moving average of the
squared gradient for each weight.
-
So mean square WT means this moving
average for weight W at time T,
-
Where time is an indicator of weight
updates.
-
Time increments by one each time we update
the weights The numbers I put in of 0.9
-
and 0.1 for computing moving average are
just examples, but their reasonably
-
sensible examples.
So the mean square is the previous mean
-
square times 0.9,
Plus the value of the squared gradient for
-
that weight at time t,
Times 0.1.
-
We then take that mean square.
We take its square root,
-
Which is why it has the name RMS.
And then we divide the gradient by that
-
RMS, and make an update proportional to
that.
-
That makes the learning work much better.
Notice that we're not adapting the
-
learning rate separately for each
connection here.
-
This is a simpler method where we simply,
for each connection, keep a running
-
average of the route mean square gradient
and divide by that.
-
There's many further developments one
could make for rmsprop. You could combine
-
the standard moment.
My experiment so far suggests that doesn't
-
help as much as momentum normally does,
And that needs more investigation.
-
You could combine our rmsprop with
Nesterov momentum where you first make the
-
jump and then make a correction.
And Ilya Sutskever has tried that recently
-
and got good results.
He's discovered that it works best if the
-
rms of the recent gradients is used to
divide the correction term we make rather
-
than the large jump you make in the
direction of the accumulated corrections.
-
Obviously you could combine rmsprop with
adaptive learning rates on each connection
-
which would make it much more like rprop.
That just needs a lot more investigation.
-
I just don't know at present how helpful
that will be.
-
And then there is a bunch of other methods
related to rmsprop that have a lot in
-
common with it.
Yann LeCun's group has an interesting
-
paper called No More Pesky Learning Rates
that came out this year.
-
And some of the terms in that looked like
rmsprop, but it has many other terms.
-
I suspect, at present, that most of the
advantage that comes from this complicated
-
method recommended by Yann LeCun's group
comes from the fact that it's similar to
-
rmsprop.
But I don't really know that.
-
So, a summary of the learning methods for
neural networks, goes like this.
-
If you've got a small data set, say 10,000
cases or less,
-
Or a big data set without much redundancy,
you should consider using a full batch
-
method.
This full batch methods adapted from the
-
optimization literature like non-linear
conjugate gradient or lbfgs, or
-
LevenbergMarkhart,Marquardt.
And one advantage of using those methods
-
is they typically come with a package.
And when you report the results in your
-
paper you just have to say, I used this
package and here's what it did.
-
You don't have to justify all sorts of
little decisions.
-
Alternatively you could use the adaptive
learning rates I described in another
-
video or rprop, which are both essentially
full batch methods but they are methods
-
that were developed for neural networks.
If you have a big redundant data set it's
-
essential to use mini batches.
It's a huge waste not to do that.
-
The first thing to try is just standard
gradient descent with momentum.
-
You're going to have to choose a global
learning rate, and you might want to write
-
a little loop to adapt that global
learning rate based on whether the
-
gradient has changed side.
But to begin with, don't go for anything
-
as fancy as adapting individual learning
rates for individual weights.
-
The next thing to try is RMS prop.
That's very simple to implement if you do
-
it without momentum, and in my experiment
so far, that seems to work as well as
-
gradient descent with momentum, would be
better.
-
You can also consider all sorts of ways of
improving rmsprop by adding momentum or
-
adaptive step sizes for each weight, but
that's still basically uncharted
-
territory.
Finally, you could find out whatever Yann
-
Lecun's latest receipt is and try that.
He's probably the person who's tried the
-
most different ways of getting stochastic
gradient descent to work well, and so it's
-
worth keeping up with whatever he's doing.
One question you might ask is why is there
-
no simple recipe.
We have been messing around with neural
-
nets, including deep neural nets, for more
than 25 years now, and you would think
-
that we would come up with an agreed way
of doing the learning.
-
There's really two reasons I think why
there isn't a simple recipe.
-
First, neural nets differ a lot.
Very deep networks, especially ones that
-
have narrow bottlenecks in them, which
I'll come to in later lectures, are very
-
hard things to optimize and they need
methods that can be very sensitive to very
-
small gradients.
Recurring nets are another special case,
-
they're typically very hard to optimize,
if you want them to notice things that
-
happened a long time in the past and
change the weights based on these things
-
that happened a long time ago.
Then there's wide shallow networks, which
-
are quite different in flavor and are used
a lot in practice.
-
They often can be optimized with methods
that are not very accurate.
-
Because we stop the optimization early
before it starts overfitting.
-
So for these different kinds of networks,
there's very different methods that are
-
probably appropriate.
The other consideration is that tasks
-
differ a lot.
Some tasks require very accurate weights.
-
Some tasks don't require weights to be
very accurate at all.
-
Also there's some tasks that have weird
properties, like if your inputs are words
-
rare words may only occur on one case in a
hundred thousand.
-
That's a very, very different flavor from
what happens if your inputs are pixels.
-
So to summarize we really don't have nice
clear cut advice for how to train a neural
-
net.
We have a bunch of rules of sum, it's not
-
entirely satisfactory, but just think how
much better in your all natural work once
-
we've got this sorted out, and they
already work pretty well.