0:00:00.000,0:00:04.805 In this video, I'm first going to[br]introduce a method called rprop, that is 0:00:04.805,0:00:09.679 used for full batch learning.[br]It's like Robbie Jacobs method, but not 0:00:09.679,0:00:13.454 quite the same.[br]I'm then going to show how to extend RPROP 0:00:13.454,0:00:18.946 so that it works for mini-batches. This[br]gives you the advantages of rprop and it 0:00:18.946,0:00:24.507 also gives you the advantage of mini-batch[br]learning, which is essential for large, 0:00:24.507,0:00:29.159 redundant data sets.[br]The method that we end up with called RMS 0:00:29.159,0:00:34.313 Pro is currently my favorite method as a[br]sort of basic method for learning the 0:00:34.313,0:00:38.880 weights in a large neural network with a[br]large redundant data set. 0:00:39.360,0:00:44.865 I'm now going to describe rprop which is[br]an interesting way of trying to deal with 0:00:44.865,0:00:48.580 the fact that gradients vary widely in[br]their magnitudes. 0:00:50.740,0:00:53.916 Some gradients can be tiny and others can[br]be huge. 0:00:53.916,0:00:57.920 And that makes it hard to choose a single[br]global learning rate. 0:00:58.520,0:01:03.552 If we're doing full batch learning, we can[br]cope with this big variations in 0:01:03.552,0:01:06.840 gradients, by just using the sign of the[br]gradient. 0:01:07.660,0:01:11.580 That makes all of the weight updates be[br]the same size. 0:01:13.280,0:01:18.597 For issues like escaping from plateaus[br]with very small gradients this is a great 0:01:18.597,0:01:23.061 technique cause even with tiny gradients[br]we'll take quite big steps. 0:01:23.061,0:01:28.248 We couldn't achieve that by just turning[br]up the learning rate because then the 0:01:28.248,0:01:32.843 steps we took for weights that had big[br]gradients would be much to big. 0:01:32.843,0:01:38.161 Rprop combines the idea of just using the[br]sign of the gradient with the idea of 0:01:38.161,0:01:41.764 making the step size.[br]Depend on which weight it is. 0:01:41.764,0:01:46.896 So to decide how much to change your[br]weight, you don't look at the magnitude of 0:01:46.896,0:01:50.469 the gradient, you just look at the sign of[br]the gradient. 0:01:50.469,0:01:54.887 But, you do look at the step size you[br]decided around for that weight. 0:01:54.887,0:01:59.955 And, that step size adopts over time,[br]again without looking at the magnitude of 0:01:59.955,0:02:04.626 the gradient.[br]So we increase the step size for a weight 0:02:04.626,0:02:07.562 multiplicatively.[br]For example by factor 1.2. 0:02:07.562,0:02:10.633 If the signs of the last two gradients[br]agree. 0:02:10.633,0:02:16.299 This is like in Robbie Jacobs' adapted[br]weights methods except that we did, gonna 0:02:16.299,0:02:21.376 do a multiplicative increase here.[br]If the signs of the last two gradients 0:02:21.376,0:02:26.139 disagree, we decrease the step size[br]multiplicatively, and in this case, we'll 0:02:26.139,0:02:31.282 make that more powerful than the increase,[br]so that we can die down faster than we 0:02:31.282,0:02:33.693 grow.[br]We need to limit the step sizes. 0:02:33.693,0:02:38.075 Mike Shuster's advice was to limit them[br]between 50 and a millionth. 0:02:38.271,0:02:42.261 I think it depends a lot on what problem[br]you're dealing with. 0:02:42.261,0:02:47.494 If for example you have a problem with[br]some tiny inputs, you might need very big 0:02:47.494,0:02:50.830 weights on those inputs for them to have[br]an effect. 0:02:50.830,0:02:55.534 I suspect that if you're not dealing with[br]that kind of problem, having an upper 0:02:55.534,0:02:59.941 limit on the weight changes that's much[br]less than 50 would be a good idea. 0:02:59.941,0:03:03.515 So one question is, why doesn't rprop work[br]with mini-batches. 0:03:03.515,0:03:06.850 People have tried it, and find it hard to[br]get it to work. 0:03:06.850,0:03:11.257 You can get it to work with very big[br]mini-batches, where you use much more 0:03:11.257,0:03:16.045 conservative changes to the step sizes.[br]But it's difficult. 0:03:16.045,0:03:21.440 So the reason it doesn't work is it[br]violates the central idea behind 0:03:21.440,0:03:26.365 stochastic gradient descent,[br]Which is, that when we have a small 0:03:26.365,0:03:31.994 loaning rate, the gradient gets[br]effectively average over successive mini 0:03:31.994,0:03:37.495 batches.[br]So consider a weight that gets a gradient 0:03:37.495,0:03:44.270 of +.01 on nine mini batches, and then a[br]gradient of -.09 on the tenth mini batch. 0:03:44.270,0:03:49.006 What we'd like is those gradients will[br]roughly average out so the weight will 0:03:49.006,0:03:51.617 stay where it is.[br]Rprop won't give us that. 0:03:51.617,0:03:56.414 Rprop would increment the weight nine[br]times by whatever its current step size 0:03:56.414,0:04:00.664 is, and decrement it only once.[br]And that would make the weight get much 0:04:00.664,0:04:04.141 bigger.[br]We're assuming here that the step sizes 0:04:04.141,0:04:09.070 adapt much slower than the time scale of[br]these mini batches. 0:04:09.070,0:04:15.297 So the question is, can we combine the[br]robustness that you get from rprop by just 0:04:15.297,0:04:20.170 using the sign of the gradient.[br]The efficiency that you get from many 0:04:20.170,0:04:23.334 batches.[br]And this averaging of gradients over 0:04:23.334,0:04:28.894 mini-batches is what allows mini-batches[br]to combine gradients in the right way. 0:04:28.894,0:04:32.343 That leads to a method which I'm calling[br]Rmsprop. 0:04:32.343,0:04:37.974 And you can consider to be a mini-batch[br]version of rprop. rprop is equivalent to 0:04:37.974,0:04:42.268 using the gradient,[br]But also dividing by the magnitude of the 0:04:42.268,0:04:45.435 gradient.[br]And the reason it has problems with 0:04:45.435,0:04:50.925 mini-batches is that we divide the[br]gradient by a different magnitude for each 0:04:50.925,0:04:54.522 mini batch.[br]So the idea is that we're going to force 0:04:54.522,0:05:00.539 the number we divide by to be pretty much[br]the same for nearby mini-batches. We do 0:05:00.539,0:05:05.961 that by keeping a moving average of the[br]squared gradient for each weight. 0:05:05.961,0:05:11.012 So mean square WT means this moving[br]average for weight W at time T, 0:05:11.012,0:05:14.354 Where time is an indicator of weight[br]updates. 0:05:14.354,0:05:20.891 Time increments by one each time we update[br]the weights The numbers I put in of 0.9 0:05:20.891,0:05:25.915 and 0.1 for computing moving average are[br]just examples, but their reasonably 0:05:25.915,0:05:31.361 sensible examples.[br]So the mean square is the previous mean 0:05:31.361,0:05:36.920 square times 0.9,[br]Plus the value of the squared gradient for 0:05:36.920,0:05:40.041 that weight at time t,[br]Times 0.1. 0:05:40.041,0:05:45.308 We then take that mean square.[br]We take its square root, 0:05:45.308,0:05:52.428 Which is why it has the name RMS.[br]And then we divide the gradient by that 0:05:52.428,0:05:56.720 RMS, and make an update proportional to[br]that. 0:05:57.720,0:06:02.129 That makes the learning work much better.[br]Notice that we're not adapting the 0:06:02.129,0:06:05.030 learning rate separately for each[br]connection here. 0:06:05.030,0:06:09.440 This is a simpler method where we simply,[br]for each connection, keep a running 0:06:09.440,0:06:12.980 average of the route mean square gradient[br]and divide by that. 0:06:12.980,0:06:17.797 There's many further developments one[br]could make for rmsprop. You could combine 0:06:17.797,0:06:21.591 the standard moment.[br]My experiment so far suggests that doesn't 0:06:21.591,0:06:26.175 help as much as momentum normally does,[br]And that needs more investigation. 0:06:26.175,0:06:31.835 You could combine our rmsprop with[br]Nesterov momentum where you first make the 0:06:31.835,0:06:36.560 jump and then make a correction.[br]And Ilya Sutskever has tried that recently 0:06:36.560,0:06:40.645 and got good results.[br]He's discovered that it works best if the 0:06:40.645,0:06:45.816 rms of the recent gradients is used to[br]divide the correction term we make rather 0:06:45.816,0:06:50.732 than the large jump you make in the[br]direction of the accumulated corrections. 0:06:50.732,0:06:56.158 Obviously you could combine rmsprop with[br]adaptive learning rates on each connection 0:06:56.158,0:07:00.968 which would make it much more like rprop.[br]That just needs a lot more investigation. 0:07:00.968,0:07:03.771 I just don't know at present how helpful[br]that will be. 0:07:03.771,0:07:08.131 And then there is a bunch of other methods[br]related to rmsprop that have a lot in 0:07:08.131,0:07:11.508 common with it.[br]Yann LeCun's group has an interesting 0:07:11.508,0:07:16.142 paper called No More Pesky Learning Rates[br]that came out this year. 0:07:16.142,0:07:21.689 And some of the terms in that looked like[br]rmsprop, but it has many other terms. 0:07:21.689,0:07:27.377 I suspect, at present, that most of the[br]advantage that comes from this complicated 0:07:27.377,0:07:33.064 method recommended by Yann LeCun's group[br]comes from the fact that it's similar to 0:07:33.064,0:07:35.943 rmsprop.[br]But I don't really know that. 0:07:35.943,0:07:41.210 So, a summary of the learning methods for[br]neural networks, goes like this. 0:07:41.210,0:07:46.350 If you've got a small data set, say 10,000[br]cases or less, 0:07:46.350,0:07:52.200 Or a big data set without much redundancy,[br]you should consider using a full batch 0:07:52.200,0:07:55.776 method.[br]This full batch methods adapted from the 0:07:55.776,0:08:00.484 optimization literature like non-linear[br]conjugate gradient or lbfgs, or 0:08:00.484,0:08:03.963 LevenbergMarkhart,Marquardt.[br]And one advantage of using those methods 0:08:03.963,0:08:09.284 is they typically come with a package.[br]And when you report the results in your 0:08:09.284,0:08:14.059 paper you just have to say, I used this[br]package and here's what it did. 0:08:14.059,0:08:17.880 You don't have to justify all sorts of[br]little decisions. 0:08:18.160,0:08:23.025 Alternatively you could use the adaptive[br]learning rates I described in another 0:08:23.025,0:08:28.136 video or rprop, which are both essentially[br]full batch methods but they are methods 0:08:28.136,0:08:33.186 that were developed for neural networks.[br]If you have a big redundant data set it's 0:08:33.186,0:08:37.066 essential to use mini batches.[br]It's a huge waste not to do that. 0:08:37.066,0:08:41.070 The first thing to try is just standard[br]gradient descent with momentum. 0:08:41.070,0:08:45.860 You're going to have to choose a global[br]learning rate, and you might want to write 0:08:45.860,0:08:50.105 a little loop to adapt that global[br]learning rate based on whether the 0:08:50.105,0:08:53.864 gradient has changed side.[br]But to begin with, don't go for anything 0:08:53.864,0:08:58.109 as fancy as adapting individual learning[br]rates for individual weights. 0:08:58.109,0:09:02.900 The next thing to try is RMS prop.[br]That's very simple to implement if you do 0:09:02.900,0:09:07.629 it without momentum, and in my experiment[br]so far, that seems to work as well as 0:09:07.629,0:09:10.480 gradient descent with momentum, would be[br]better. 0:09:11.880,0:09:17.344 You can also consider all sorts of ways of[br]improving rmsprop by adding momentum or 0:09:17.344,0:09:21.929 adaptive step sizes for each weight, but[br]that's still basically uncharted 0:09:21.929,0:09:25.572 territory.[br]Finally, you could find out whatever Yann 0:09:25.572,0:09:30.094 Lecun's latest receipt is and try that.[br]He's probably the person who's tried the 0:09:30.094,0:09:35.370 most different ways of getting stochastic[br]gradient descent to work well, and so it's 0:09:35.370,0:09:40.709 worth keeping up with whatever he's doing.[br]One question you might ask is why is there 0:09:40.709,0:09:44.629 no simple recipe.[br]We have been messing around with neural 0:09:44.629,0:09:49.154 nets, including deep neural nets, for more[br]than 25 years now, and you would think 0:09:49.154,0:09:53.060 that we would come up with an agreed way[br]of doing the learning. 0:09:53.440,0:09:57.340 There's really two reasons I think why[br]there isn't a simple recipe. 0:09:58.000,0:10:02.196 First, neural nets differ a lot.[br]Very deep networks, especially ones that 0:10:02.196,0:10:06.807 have narrow bottlenecks in them, which[br]I'll come to in later lectures, are very 0:10:06.807,0:10:11.595 hard things to optimize and they need[br]methods that can be very sensitive to very 0:10:11.595,0:10:14.906 small gradients.[br]Recurring nets are another special case, 0:10:14.906,0:10:19.575 they're typically very hard to optimize,[br]if you want them to notice things that 0:10:19.575,0:10:24.186 happened a long time in the past and[br]change the weights based on these things 0:10:24.186,0:10:29.213 that happened a long time ago.[br]Then there's wide shallow networks, which 0:10:29.213,0:10:33.222 are quite different in flavor and are used[br]a lot in practice. 0:10:33.222,0:10:37.690 They often can be optimized with methods[br]that are not very accurate. 0:10:37.690,0:10:42.158 Because we stop the optimization early[br]before it starts overfitting. 0:10:42.158,0:10:47.480 So for these different kinds of networks,[br]there's very different methods that are 0:10:47.480,0:10:52.249 probably appropriate.[br]The other consideration is that tasks 0:10:52.249,0:10:56.440 differ a lot.[br]Some tasks require very accurate weights. 0:10:56.700,0:11:00.440 Some tasks don't require weights to be[br]very accurate at all. 0:11:01.100,0:11:08.786 Also there's some tasks that have weird[br]properties, like if your inputs are words 0:11:08.786,0:11:14.700 rare words may only occur on one case in a[br]hundred thousand. 0:11:14.980,0:11:20.174 That's a very, very different flavor from[br]what happens if your inputs are pixels. 0:11:20.174,0:11:25.499 So to summarize we really don't have nice[br]clear cut advice for how to train a neural 0:11:25.499,0:11:28.356 net.[br]We have a bunch of rules of sum, it's not 0:11:28.356,0:11:33.745 entirely satisfactory, but just think how[br]much better in your all natural work once 0:11:33.745,0:11:36.992 we've got this sorted out, and they[br]already work pretty well.