0:00:00.000,0:00:04.805
In this video, I'm first going to[br]introduce a method called rprop, that is

0:00:04.805,0:00:09.679
used for full batch learning.[br]It's like Robbie Jacobs method, but not

0:00:09.679,0:00:13.454
quite the same.[br]I'm then going to show how to extend RPROP

0:00:13.454,0:00:18.946
so that it works for mini-batches. This[br]gives you the advantages of rprop and it

0:00:18.946,0:00:24.507
also gives you the advantage of mini-batch[br]learning, which is essential for large,

0:00:24.507,0:00:29.159
redundant data sets.[br]The method that we end up with called RMS

0:00:29.159,0:00:34.313
Pro is currently my favorite method as a[br]sort of basic method for learning the

0:00:34.313,0:00:38.880
weights in a large neural network with a[br]large redundant data set.

0:00:39.360,0:00:44.865
I'm now going to describe rprop which is[br]an interesting way of trying to deal with

0:00:44.865,0:00:48.580
the fact that gradients vary widely in[br]their magnitudes.

0:00:50.740,0:00:53.916
Some gradients can be tiny and others can[br]be huge.

0:00:53.916,0:00:57.920
And that makes it hard to choose a single[br]global learning rate.

0:00:58.520,0:01:03.552
If we're doing full batch learning, we can[br]cope with this big variations in

0:01:03.552,0:01:06.840
gradients, by just using the sign of the[br]gradient.

0:01:07.660,0:01:11.580
That makes all of the weight updates be[br]the same size.

0:01:13.280,0:01:18.597
For issues like escaping from plateaus[br]with very small gradients this is a great

0:01:18.597,0:01:23.061
technique cause even with tiny gradients[br]we'll take quite big steps.

0:01:23.061,0:01:28.248
We couldn't achieve that by just turning[br]up the learning rate because then the

0:01:28.248,0:01:32.843
steps we took for weights that had big[br]gradients would be much to big.

0:01:32.843,0:01:38.161
Rprop combines the idea of just using the[br]sign of the gradient with the idea of

0:01:38.161,0:01:41.764
making the step size.[br]Depend on which weight it is.

0:01:41.764,0:01:46.896
So to decide how much to change your[br]weight, you don't look at the magnitude of

0:01:46.896,0:01:50.469
the gradient, you just look at the sign of[br]the gradient.

0:01:50.469,0:01:54.887
But, you do look at the step size you[br]decided around for that weight.

0:01:54.887,0:01:59.955
And, that step size adopts over time,[br]again without looking at the magnitude of

0:01:59.955,0:02:04.626
the gradient.[br]So we increase the step size for a weight

0:02:04.626,0:02:07.562
multiplicatively.[br]For example by factor 1.2.

0:02:07.562,0:02:10.633
If the signs of the last two gradients[br]agree.

0:02:10.633,0:02:16.299
This is like in Robbie Jacobs' adapted[br]weights methods except that we did, gonna

0:02:16.299,0:02:21.376
do a multiplicative increase here.[br]If the signs of the last two gradients

0:02:21.376,0:02:26.139
disagree, we decrease the step size[br]multiplicatively, and in this case, we'll

0:02:26.139,0:02:31.282
make that more powerful than the increase,[br]so that we can die down faster than we

0:02:31.282,0:02:33.693
grow.[br]We need to limit the step sizes.

0:02:33.693,0:02:38.075
Mike Shuster's advice was to limit them[br]between 50 and a millionth.

0:02:38.271,0:02:42.261
I think it depends a lot on what problem[br]you're dealing with.

0:02:42.261,0:02:47.494
If for example you have a problem with[br]some tiny inputs, you might need very big

0:02:47.494,0:02:50.830
weights on those inputs for them to have[br]an effect.

0:02:50.830,0:02:55.534
I suspect that if you're not dealing with[br]that kind of problem, having an upper

0:02:55.534,0:02:59.941
limit on the weight changes that's much[br]less than 50 would be a good idea.

0:02:59.941,0:03:03.515
So one question is, why doesn't rprop work[br]with mini-batches.

0:03:03.515,0:03:06.850
People have tried it, and find it hard to[br]get it to work.

0:03:06.850,0:03:11.257
You can get it to work with very big[br]mini-batches, where you use much more

0:03:11.257,0:03:16.045
conservative changes to the step sizes.[br]But it's difficult.

0:03:16.045,0:03:21.440
So the reason it doesn't work is it[br]violates the central idea behind

0:03:21.440,0:03:26.365
stochastic gradient descent,[br]Which is, that when we have a small

0:03:26.365,0:03:31.994
loaning rate, the gradient gets[br]effectively average over successive mini

0:03:31.994,0:03:37.495
batches.[br]So consider a weight that gets a gradient

0:03:37.495,0:03:44.270
of +.01 on nine mini batches, and then a[br]gradient of -.09 on the tenth mini batch.

0:03:44.270,0:03:49.006
What we'd like is those gradients will[br]roughly average out so the weight will

0:03:49.006,0:03:51.617
stay where it is.[br]Rprop won't give us that.

0:03:51.617,0:03:56.414
Rprop would increment the weight nine[br]times by whatever its current step size

0:03:56.414,0:04:00.664
is, and decrement it only once.[br]And that would make the weight get much

0:04:00.664,0:04:04.141
bigger.[br]We're assuming here that the step sizes

0:04:04.141,0:04:09.070
adapt much slower than the time scale of[br]these mini batches.

0:04:09.070,0:04:15.297
So the question is, can we combine the[br]robustness that you get from rprop by just

0:04:15.297,0:04:20.170
using the sign of the gradient.[br]The efficiency that you get from many

0:04:20.170,0:04:23.334
batches.[br]And this averaging of gradients over

0:04:23.334,0:04:28.894
mini-batches is what allows mini-batches[br]to combine gradients in the right way.

0:04:28.894,0:04:32.343
That leads to a method which I'm calling[br]Rmsprop.

0:04:32.343,0:04:37.974
And you can consider to be a mini-batch[br]version of rprop. rprop is equivalent to

0:04:37.974,0:04:42.268
using the gradient,[br]But also dividing by the magnitude of the

0:04:42.268,0:04:45.435
gradient.[br]And the reason it has problems with

0:04:45.435,0:04:50.925
mini-batches is that we divide the[br]gradient by a different magnitude for each

0:04:50.925,0:04:54.522
mini batch.[br]So the idea is that we're going to force

0:04:54.522,0:05:00.539
the number we divide by to be pretty much[br]the same for nearby mini-batches. We do

0:05:00.539,0:05:05.961
that by keeping a moving average of the[br]squared gradient for each weight.

0:05:05.961,0:05:11.012
So mean square WT means this moving[br]average for weight W at time T,

0:05:11.012,0:05:14.354
Where time is an indicator of weight[br]updates.

0:05:14.354,0:05:20.891
Time increments by one each time we update[br]the weights The numbers I put in of 0.9

0:05:20.891,0:05:25.915
and 0.1 for computing moving average are[br]just examples, but their reasonably

0:05:25.915,0:05:31.361
sensible examples.[br]So the mean square is the previous mean

0:05:31.361,0:05:36.920
square times 0.9,[br]Plus the value of the squared gradient for

0:05:36.920,0:05:40.041
that weight at time t,[br]Times 0.1.

0:05:40.041,0:05:45.308
We then take that mean square.[br]We take its square root,

0:05:45.308,0:05:52.428
Which is why it has the name RMS.[br]And then we divide the gradient by that

0:05:52.428,0:05:56.720
RMS, and make an update proportional to[br]that.

0:05:57.720,0:06:02.129
That makes the learning work much better.[br]Notice that we're not adapting the

0:06:02.129,0:06:05.030
learning rate separately for each[br]connection here.

0:06:05.030,0:06:09.440
This is a simpler method where we simply,[br]for each connection, keep a running

0:06:09.440,0:06:12.980
average of the route mean square gradient[br]and divide by that.

0:06:12.980,0:06:17.797
There's many further developments one[br]could make for rmsprop. You could combine

0:06:17.797,0:06:21.591
the standard moment.[br]My experiment so far suggests that doesn't

0:06:21.591,0:06:26.175
help as much as momentum normally does,[br]And that needs more investigation.

0:06:26.175,0:06:31.835
You could combine our rmsprop with[br]Nesterov momentum where you first make the

0:06:31.835,0:06:36.560
jump and then make a correction.[br]And Ilya Sutskever has tried that recently

0:06:36.560,0:06:40.645
and got good results.[br]He's discovered that it works best if the

0:06:40.645,0:06:45.816
rms of the recent gradients is used to[br]divide the correction term we make rather

0:06:45.816,0:06:50.732
than the large jump you make in the[br]direction of the accumulated corrections.

0:06:50.732,0:06:56.158
Obviously you could combine rmsprop with[br]adaptive learning rates on each connection

0:06:56.158,0:07:00.968
which would make it much more like rprop.[br]That just needs a lot more investigation.

0:07:00.968,0:07:03.771
I just don't know at present how helpful[br]that will be.

0:07:03.771,0:07:08.131
And then there is a bunch of other methods[br]related to rmsprop that have a lot in

0:07:08.131,0:07:11.508
common with it.[br]Yann LeCun's group has an interesting

0:07:11.508,0:07:16.142
paper called No More Pesky Learning Rates[br]that came out this year.

0:07:16.142,0:07:21.689
And some of the terms in that looked like[br]rmsprop, but it has many other terms.

0:07:21.689,0:07:27.377
I suspect, at present, that most of the[br]advantage that comes from this complicated

0:07:27.377,0:07:33.064
method recommended by Yann LeCun's group[br]comes from the fact that it's similar to

0:07:33.064,0:07:35.943
rmsprop.[br]But I don't really know that.

0:07:35.943,0:07:41.210
So, a summary of the learning methods for[br]neural networks, goes like this.

0:07:41.210,0:07:46.350
If you've got a small data set, say 10,000[br]cases or less,

0:07:46.350,0:07:52.200
Or a big data set without much redundancy,[br]you should consider using a full batch

0:07:52.200,0:07:55.776
method.[br]This full batch methods adapted from the

0:07:55.776,0:08:00.484
optimization literature like non-linear[br]conjugate gradient or lbfgs, or

0:08:00.484,0:08:03.963
LevenbergMarkhart,Marquardt.[br]And one advantage of using those methods

0:08:03.963,0:08:09.284
is they typically come with a package.[br]And when you report the results in your

0:08:09.284,0:08:14.059
paper you just have to say, I used this[br]package and here's what it did.

0:08:14.059,0:08:17.880
You don't have to justify all sorts of[br]little decisions.

0:08:18.160,0:08:23.025
Alternatively you could use the adaptive[br]learning rates I described in another

0:08:23.025,0:08:28.136
video or rprop, which are both essentially[br]full batch methods but they are methods

0:08:28.136,0:08:33.186
that were developed for neural networks.[br]If you have a big redundant data set it's

0:08:33.186,0:08:37.066
essential to use mini batches.[br]It's a huge waste not to do that.

0:08:37.066,0:08:41.070
The first thing to try is just standard[br]gradient descent with momentum.

0:08:41.070,0:08:45.860
You're going to have to choose a global[br]learning rate, and you might want to write

0:08:45.860,0:08:50.105
a little loop to adapt that global[br]learning rate based on whether the

0:08:50.105,0:08:53.864
gradient has changed side.[br]But to begin with, don't go for anything

0:08:53.864,0:08:58.109
as fancy as adapting individual learning[br]rates for individual weights.

0:08:58.109,0:09:02.900
The next thing to try is RMS prop.[br]That's very simple to implement if you do

0:09:02.900,0:09:07.629
it without momentum, and in my experiment[br]so far, that seems to work as well as

0:09:07.629,0:09:10.480
gradient descent with momentum, would be[br]better.

0:09:11.880,0:09:17.344
You can also consider all sorts of ways of[br]improving rmsprop by adding momentum or

0:09:17.344,0:09:21.929
adaptive step sizes for each weight, but[br]that's still basically uncharted

0:09:21.929,0:09:25.572
territory.[br]Finally, you could find out whatever Yann

0:09:25.572,0:09:30.094
Lecun's latest receipt is and try that.[br]He's probably the person who's tried the

0:09:30.094,0:09:35.370
most different ways of getting stochastic[br]gradient descent to work well, and so it's

0:09:35.370,0:09:40.709
worth keeping up with whatever he's doing.[br]One question you might ask is why is there

0:09:40.709,0:09:44.629
no simple recipe.[br]We have been messing around with neural

0:09:44.629,0:09:49.154
nets, including deep neural nets, for more[br]than 25 years now, and you would think

0:09:49.154,0:09:53.060
that we would come up with an agreed way[br]of doing the learning.

0:09:53.440,0:09:57.340
There's really two reasons I think why[br]there isn't a simple recipe.

0:09:58.000,0:10:02.196
First, neural nets differ a lot.[br]Very deep networks, especially ones that

0:10:02.196,0:10:06.807
have narrow bottlenecks in them, which[br]I'll come to in later lectures, are very

0:10:06.807,0:10:11.595
hard things to optimize and they need[br]methods that can be very sensitive to very

0:10:11.595,0:10:14.906
small gradients.[br]Recurring nets are another special case,

0:10:14.906,0:10:19.575
they're typically very hard to optimize,[br]if you want them to notice things that

0:10:19.575,0:10:24.186
happened a long time in the past and[br]change the weights based on these things

0:10:24.186,0:10:29.213
that happened a long time ago.[br]Then there's wide shallow networks, which

0:10:29.213,0:10:33.222
are quite different in flavor and are used[br]a lot in practice.

0:10:33.222,0:10:37.690
They often can be optimized with methods[br]that are not very accurate.

0:10:37.690,0:10:42.158
Because we stop the optimization early[br]before it starts overfitting.

0:10:42.158,0:10:47.480
So for these different kinds of networks,[br]there's very different methods that are

0:10:47.480,0:10:52.249
probably appropriate.[br]The other consideration is that tasks

0:10:52.249,0:10:56.440
differ a lot.[br]Some tasks require very accurate weights.

0:10:56.700,0:11:00.440
Some tasks don't require weights to be[br]very accurate at all.

0:11:01.100,0:11:08.786
Also there's some tasks that have weird[br]properties, like if your inputs are words

0:11:08.786,0:11:14.700
rare words may only occur on one case in a[br]hundred thousand.

0:11:14.980,0:11:20.174
That's a very, very different flavor from[br]what happens if your inputs are pixels.

0:11:20.174,0:11:25.499
So to summarize we really don't have nice[br]clear cut advice for how to train a neural

0:11:25.499,0:11:28.356
net.[br]We have a bunch of rules of sum, it's not

0:11:28.356,0:11:33.745
entirely satisfactory, but just think how[br]much better in your all natural work once

0:11:33.745,0:11:36.992
we've got this sorted out, and they[br]already work pretty well.