In this video, I'm first going to introduce a method called rprop, that is used for full batch learning. It's like Robbie Jacobs method, but not quite the same. I'm then going to show how to extend RPROP so that it works for mini-batches. This gives you the advantages of rprop and it also gives you the advantage of mini-batch learning, which is essential for large, redundant data sets. The method that we end up with called RMS Pro is currently my favorite method as a sort of basic method for learning the weights in a large neural network with a large redundant data set. I'm now going to describe rprop which is an interesting way of trying to deal with the fact that gradients vary widely in their magnitudes. Some gradients can be tiny and others can be huge. And that makes it hard to choose a single global learning rate. If we're doing full batch learning, we can cope with this big variations in gradients, by just using the sign of the gradient. That makes all of the weight updates be the same size. For issues like escaping from plateaus with very small gradients this is a great technique cause even with tiny gradients we'll take quite big steps. We couldn't achieve that by just turning up the learning rate because then the steps we took for weights that had big gradients would be much to big. Rprop combines the idea of just using the sign of the gradient with the idea of making the step size. Depend on which weight it is. So to decide how much to change your weight, you don't look at the magnitude of the gradient, you just look at the sign of the gradient. But, you do look at the step size you decided around for that weight. And, that step size adopts over time, again without looking at the magnitude of the gradient. So we increase the step size for a weight multiplicatively. For example by factor 1.2. If the signs of the last two gradients agree. This is like in Robbie Jacobs' adapted weights methods except that we did, gonna do a multiplicative increase here. If the signs of the last two gradients disagree, we decrease the step size multiplicatively, and in this case, we'll make that more powerful than the increase, so that we can die down faster than we grow. We need to limit the step sizes. Mike Shuster's advice was to limit them between 50 and a millionth. I think it depends a lot on what problem you're dealing with. If for example you have a problem with some tiny inputs, you might need very big weights on those inputs for them to have an effect. I suspect that if you're not dealing with that kind of problem, having an upper limit on the weight changes that's much less than 50 would be a good idea. So one question is, why doesn't rprop work with mini-batches. People have tried it, and find it hard to get it to work. You can get it to work with very big mini-batches, where you use much more conservative changes to the step sizes. But it's difficult. So the reason it doesn't work is it violates the central idea behind stochastic gradient descent, Which is, that when we have a small loaning rate, the gradient gets effectively average over successive mini batches. So consider a weight that gets a gradient of +.01 on nine mini batches, and then a gradient of -.09 on the tenth mini batch. What we'd like is those gradients will roughly average out so the weight will stay where it is. Rprop won't give us that. Rprop would increment the weight nine times by whatever its current step size is, and decrement it only once. And that would make the weight get much bigger. We're assuming here that the step sizes adapt much slower than the time scale of these mini batches. So the question is, can we combine the robustness that you get from rprop by just using the sign of the gradient. The efficiency that you get from many batches. And this averaging of gradients over mini-batches is what allows mini-batches to combine gradients in the right way. That leads to a method which I'm calling Rmsprop. And you can consider to be a mini-batch version of rprop. rprop is equivalent to using the gradient, But also dividing by the magnitude of the gradient. And the reason it has problems with mini-batches is that we divide the gradient by a different magnitude for each mini batch. So the idea is that we're going to force the number we divide by to be pretty much the same for nearby mini-batches. We do that by keeping a moving average of the squared gradient for each weight. So mean square WT means this moving average for weight W at time T, Where time is an indicator of weight updates. Time increments by one each time we update the weights The numbers I put in of 0.9 and 0.1 for computing moving average are just examples, but their reasonably sensible examples. So the mean square is the previous mean square times 0.9, Plus the value of the squared gradient for that weight at time t, Times 0.1. We then take that mean square. We take its square root, Which is why it has the name RMS. And then we divide the gradient by that RMS, and make an update proportional to that. That makes the learning work much better. Notice that we're not adapting the learning rate separately for each connection here. This is a simpler method where we simply, for each connection, keep a running average of the route mean square gradient and divide by that. There's many further developments one could make for rmsprop. You could combine the standard moment. My experiment so far suggests that doesn't help as much as momentum normally does, And that needs more investigation. You could combine our rmsprop with Nesterov momentum where you first make the jump and then make a correction. And Ilya Sutskever has tried that recently and got good results. He's discovered that it works best if the rms of the recent gradients is used to divide the correction term we make rather than the large jump you make in the direction of the accumulated corrections. Obviously you could combine rmsprop with adaptive learning rates on each connection which would make it much more like rprop. That just needs a lot more investigation. I just don't know at present how helpful that will be. And then there is a bunch of other methods related to rmsprop that have a lot in common with it. Yann LeCun's group has an interesting paper called No More Pesky Learning Rates that came out this year. And some of the terms in that looked like rmsprop, but it has many other terms. I suspect, at present, that most of the advantage that comes from this complicated method recommended by Yann LeCun's group comes from the fact that it's similar to rmsprop. But I don't really know that. So, a summary of the learning methods for neural networks, goes like this. If you've got a small data set, say 10,000 cases or less, Or a big data set without much redundancy, you should consider using a full batch method. This full batch methods adapted from the optimization literature like non-linear conjugate gradient or lbfgs, or LevenbergMarkhart,Marquardt. And one advantage of using those methods is they typically come with a package. And when you report the results in your paper you just have to say, I used this package and here's what it did. You don't have to justify all sorts of little decisions. Alternatively you could use the adaptive learning rates I described in another video or rprop, which are both essentially full batch methods but they are methods that were developed for neural networks. If you have a big redundant data set it's essential to use mini batches. It's a huge waste not to do that. The first thing to try is just standard gradient descent with momentum. You're going to have to choose a global learning rate, and you might want to write a little loop to adapt that global learning rate based on whether the gradient has changed side. But to begin with, don't go for anything as fancy as adapting individual learning rates for individual weights. The next thing to try is RMS prop. That's very simple to implement if you do it without momentum, and in my experiment so far, that seems to work as well as gradient descent with momentum, would be better. You can also consider all sorts of ways of improving rmsprop by adding momentum or adaptive step sizes for each weight, but that's still basically uncharted territory. Finally, you could find out whatever Yann Lecun's latest receipt is and try that. He's probably the person who's tried the most different ways of getting stochastic gradient descent to work well, and so it's worth keeping up with whatever he's doing. One question you might ask is why is there no simple recipe. We have been messing around with neural nets, including deep neural nets, for more than 25 years now, and you would think that we would come up with an agreed way of doing the learning. There's really two reasons I think why there isn't a simple recipe. First, neural nets differ a lot. Very deep networks, especially ones that have narrow bottlenecks in them, which I'll come to in later lectures, are very hard things to optimize and they need methods that can be very sensitive to very small gradients. Recurring nets are another special case, they're typically very hard to optimize, if you want them to notice things that happened a long time in the past and change the weights based on these things that happened a long time ago. Then there's wide shallow networks, which are quite different in flavor and are used a lot in practice. They often can be optimized with methods that are not very accurate. Because we stop the optimization early before it starts overfitting. So for these different kinds of networks, there's very different methods that are probably appropriate. The other consideration is that tasks differ a lot. Some tasks require very accurate weights. Some tasks don't require weights to be very accurate at all. Also there's some tasks that have weird properties, like if your inputs are words rare words may only occur on one case in a hundred thousand. That's a very, very different flavor from what happens if your inputs are pixels. So to summarize we really don't have nice clear cut advice for how to train a neural net. We have a bunch of rules of sum, it's not entirely satisfactory, but just think how much better in your all natural work once we've got this sorted out, and they already work pretty well.