In this video, I'm first going to
introduce a method called rprop, that is

used for full batch learning.
It's like Robbie Jacobs method, but not

quite the same.
I'm then going to show how to extend RPROP

so that it works for mini-batches. This
gives you the advantages of rprop and it

also gives you the advantage of mini-batch
learning, which is essential for large,

redundant data sets.
The method that we end up with called RMS

Pro is currently my favorite method as a
sort of basic method for learning the

weights in a large neural network with a
large redundant data set.

I'm now going to describe rprop which is
an interesting way of trying to deal with

the fact that gradients vary widely in
their magnitudes.

Some gradients can be tiny and others can
be huge.

And that makes it hard to choose a single
global learning rate.

If we're doing full batch learning, we can
cope with this big variations in

gradients, by just using the sign of the
gradient.

That makes all of the weight updates be
the same size.

For issues like escaping from plateaus
with very small gradients this is a great

technique cause even with tiny gradients
we'll take quite big steps.

We couldn't achieve that by just turning
up the learning rate because then the

steps we took for weights that had big
gradients would be much to big.

Rprop combines the idea of just using the
sign of the gradient with the idea of

making the step size.
Depend on which weight it is.

So to decide how much to change your
weight, you don't look at the magnitude of

the gradient, you just look at the sign of
the gradient.

But, you do look at the step size you
decided around for that weight.

And, that step size adopts over time,
again without looking at the magnitude of

the gradient.
So we increase the step size for a weight

multiplicatively.
For example by factor 1.2.

If the signs of the last two gradients
agree.

This is like in Robbie Jacobs' adapted
weights methods except that we did, gonna

do a multiplicative increase here.
If the signs of the last two gradients

disagree, we decrease the step size
multiplicatively, and in this case, we'll

make that more powerful than the increase,
so that we can die down faster than we

grow.
We need to limit the step sizes.

Mike Shuster's advice was to limit them
between 50 and a millionth.

I think it depends a lot on what problem
you're dealing with.

If for example you have a problem with
some tiny inputs, you might need very big

weights on those inputs for them to have
an effect.

I suspect that if you're not dealing with
that kind of problem, having an upper

limit on the weight changes that's much
less than 50 would be a good idea.

So one question is, why doesn't rprop work
with mini-batches.

People have tried it, and find it hard to
get it to work.

You can get it to work with very big
mini-batches, where you use much more

conservative changes to the step sizes.
But it's difficult.

So the reason it doesn't work is it
violates the central idea behind

stochastic gradient descent,
Which is, that when we have a small

loaning rate, the gradient gets
effectively average over successive mini

batches.
So consider a weight that gets a gradient

of +.01 on nine mini batches, and then a
gradient of -.09 on the tenth mini batch.

What we'd like is those gradients will
roughly average out so the weight will

stay where it is.
Rprop won't give us that.

Rprop would increment the weight nine
times by whatever its current step size

is, and decrement it only once.
And that would make the weight get much

bigger.
We're assuming here that the step sizes

adapt much slower than the time scale of
these mini batches.

So the question is, can we combine the
robustness that you get from rprop by just

using the sign of the gradient.
The efficiency that you get from many

batches.
And this averaging of gradients over

mini-batches is what allows mini-batches
to combine gradients in the right way.

That leads to a method which I'm calling
Rmsprop.

And you can consider to be a mini-batch
version of rprop. rprop is equivalent to

using the gradient,
But also dividing by the magnitude of the

gradient.
And the reason it has problems with

mini-batches is that we divide the
gradient by a different magnitude for each

mini batch.
So the idea is that we're going to force

the number we divide by to be pretty much
the same for nearby mini-batches. We do

that by keeping a moving average of the
squared gradient for each weight.

So mean square WT means this moving
average for weight W at time T,

Where time is an indicator of weight
updates.

Time increments by one each time we update
the weights The numbers I put in of 0.9

and 0.1 for computing moving average are
just examples, but their reasonably

sensible examples.
So the mean square is the previous mean

square times 0.9,
Plus the value of the squared gradient for

that weight at time t,
Times 0.1.

We then take that mean square.
We take its square root,

Which is why it has the name RMS.
And then we divide the gradient by that

RMS, and make an update proportional to
that.

That makes the learning work much better.
Notice that we're not adapting the

learning rate separately for each
connection here.

This is a simpler method where we simply,
for each connection, keep a running

average of the route mean square gradient
and divide by that.

There's many further developments one
could make for rmsprop. You could combine

the standard moment.
My experiment so far suggests that doesn't

help as much as momentum normally does,
And that needs more investigation.

You could combine our rmsprop with
Nesterov momentum where you first make the

jump and then make a correction.
And Ilya Sutskever has tried that recently

and got good results.
He's discovered that it works best if the

rms of the recent gradients is used to
divide the correction term we make rather

than the large jump you make in the
direction of the accumulated corrections.

Obviously you could combine rmsprop with
adaptive learning rates on each connection

which would make it much more like rprop.
That just needs a lot more investigation.

I just don't know at present how helpful
that will be.

And then there is a bunch of other methods
related to rmsprop that have a lot in

common with it.
Yann LeCun's group has an interesting

paper called No More Pesky Learning Rates
that came out this year.

And some of the terms in that looked like
rmsprop, but it has many other terms.

I suspect, at present, that most of the
advantage that comes from this complicated

method recommended by Yann LeCun's group
comes from the fact that it's similar to

rmsprop.
But I don't really know that.

So, a summary of the learning methods for
neural networks, goes like this.

If you've got a small data set, say 10,000
cases or less,

Or a big data set without much redundancy,
you should consider using a full batch

method.
This full batch methods adapted from the

optimization literature like non-linear
conjugate gradient or lbfgs, or

LevenbergMarkhart,Marquardt.
And one advantage of using those methods

is they typically come with a package.
And when you report the results in your

paper you just have to say, I used this
package and here's what it did.

You don't have to justify all sorts of
little decisions.

Alternatively you could use the adaptive
learning rates I described in another

video or rprop, which are both essentially
full batch methods but they are methods

that were developed for neural networks.
If you have a big redundant data set it's

essential to use mini batches.
It's a huge waste not to do that.

The first thing to try is just standard
gradient descent with momentum.

You're going to have to choose a global
learning rate, and you might want to write

a little loop to adapt that global
learning rate based on whether the

gradient has changed side.
But to begin with, don't go for anything

as fancy as adapting individual learning
rates for individual weights.

The next thing to try is RMS prop.
That's very simple to implement if you do

it without momentum, and in my experiment
so far, that seems to work as well as

gradient descent with momentum, would be
better.

You can also consider all sorts of ways of
improving rmsprop by adding momentum or

adaptive step sizes for each weight, but
that's still basically uncharted

territory.
Finally, you could find out whatever Yann

Lecun's latest receipt is and try that.
He's probably the person who's tried the

most different ways of getting stochastic
gradient descent to work well, and so it's

worth keeping up with whatever he's doing.
One question you might ask is why is there

no simple recipe.
We have been messing around with neural

nets, including deep neural nets, for more
than 25 years now, and you would think

that we would come up with an agreed way
of doing the learning.

There's really two reasons I think why
there isn't a simple recipe.

First, neural nets differ a lot.
Very deep networks, especially ones that

have narrow bottlenecks in them, which
I'll come to in later lectures, are very

hard things to optimize and they need
methods that can be very sensitive to very

small gradients.
Recurring nets are another special case,

they're typically very hard to optimize,
if you want them to notice things that

happened a long time in the past and
change the weights based on these things

that happened a long time ago.
Then there's wide shallow networks, which

are quite different in flavor and are used
a lot in practice.

They often can be optimized with methods
that are not very accurate.

Because we stop the optimization early
before it starts overfitting.

So for these different kinds of networks,
there's very different methods that are

probably appropriate.
The other consideration is that tasks

differ a lot.
Some tasks require very accurate weights.

Some tasks don't require weights to be
very accurate at all.

Also there's some tasks that have weird
properties, like if your inputs are words

rare words may only occur on one case in a
hundred thousand.

That's a very, very different flavor from
what happens if your inputs are pixels.

So to summarize we really don't have nice
clear cut advice for how to train a neural

net.
We have a bunch of rules of sum, it's not

entirely satisfactory, but just think how
much better in your all natural work once

we've got this sorted out, and they
already work pretty well.