[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.00,0:00:04.80,Default,,0000,0000,0000,,In this video, I'm first going to\Nintroduce a method called rprop, that is Dialogue: 0,0:00:04.80,0:00:09.68,Default,,0000,0000,0000,,used for full batch learning.\NIt's like Robbie Jacobs method, but not Dialogue: 0,0:00:09.68,0:00:13.45,Default,,0000,0000,0000,,quite the same.\NI'm then going to show how to extend RPROP Dialogue: 0,0:00:13.45,0:00:18.95,Default,,0000,0000,0000,,so that it works for mini-batches. This\Ngives you the advantages of rprop and it Dialogue: 0,0:00:18.95,0:00:24.51,Default,,0000,0000,0000,,also gives you the advantage of mini-batch\Nlearning, which is essential for large, Dialogue: 0,0:00:24.51,0:00:29.16,Default,,0000,0000,0000,,redundant data sets.\NThe method that we end up with called RMS Dialogue: 0,0:00:29.16,0:00:34.31,Default,,0000,0000,0000,,Pro is currently my favorite method as a\Nsort of basic method for learning the Dialogue: 0,0:00:34.31,0:00:38.88,Default,,0000,0000,0000,,weights in a large neural network with a\Nlarge redundant data set. Dialogue: 0,0:00:39.36,0:00:44.86,Default,,0000,0000,0000,,I'm now going to describe rprop which is\Nan interesting way of trying to deal with Dialogue: 0,0:00:44.86,0:00:48.58,Default,,0000,0000,0000,,the fact that gradients vary widely in\Ntheir magnitudes. Dialogue: 0,0:00:50.74,0:00:53.92,Default,,0000,0000,0000,,Some gradients can be tiny and others can\Nbe huge. Dialogue: 0,0:00:53.92,0:00:57.92,Default,,0000,0000,0000,,And that makes it hard to choose a single\Nglobal learning rate. Dialogue: 0,0:00:58.52,0:01:03.55,Default,,0000,0000,0000,,If we're doing full batch learning, we can\Ncope with this big variations in Dialogue: 0,0:01:03.55,0:01:06.84,Default,,0000,0000,0000,,gradients, by just using the sign of the\Ngradient. Dialogue: 0,0:01:07.66,0:01:11.58,Default,,0000,0000,0000,,That makes all of the weight updates be\Nthe same size. Dialogue: 0,0:01:13.28,0:01:18.60,Default,,0000,0000,0000,,For issues like escaping from plateaus\Nwith very small gradients this is a great Dialogue: 0,0:01:18.60,0:01:23.06,Default,,0000,0000,0000,,technique cause even with tiny gradients\Nwe'll take quite big steps. Dialogue: 0,0:01:23.06,0:01:28.25,Default,,0000,0000,0000,,We couldn't achieve that by just turning\Nup the learning rate because then the Dialogue: 0,0:01:28.25,0:01:32.84,Default,,0000,0000,0000,,steps we took for weights that had big\Ngradients would be much to big. Dialogue: 0,0:01:32.84,0:01:38.16,Default,,0000,0000,0000,,Rprop combines the idea of just using the\Nsign of the gradient with the idea of Dialogue: 0,0:01:38.16,0:01:41.76,Default,,0000,0000,0000,,making the step size.\NDepend on which weight it is. Dialogue: 0,0:01:41.76,0:01:46.90,Default,,0000,0000,0000,,So to decide how much to change your\Nweight, you don't look at the magnitude of Dialogue: 0,0:01:46.90,0:01:50.47,Default,,0000,0000,0000,,the gradient, you just look at the sign of\Nthe gradient. Dialogue: 0,0:01:50.47,0:01:54.89,Default,,0000,0000,0000,,But, you do look at the step size you\Ndecided around for that weight. Dialogue: 0,0:01:54.89,0:01:59.96,Default,,0000,0000,0000,,And, that step size adopts over time,\Nagain without looking at the magnitude of Dialogue: 0,0:01:59.96,0:02:04.63,Default,,0000,0000,0000,,the gradient.\NSo we increase the step size for a weight Dialogue: 0,0:02:04.63,0:02:07.56,Default,,0000,0000,0000,,multiplicatively.\NFor example by factor 1.2. Dialogue: 0,0:02:07.56,0:02:10.63,Default,,0000,0000,0000,,If the signs of the last two gradients\Nagree. Dialogue: 0,0:02:10.63,0:02:16.30,Default,,0000,0000,0000,,This is like in Robbie Jacobs' adapted\Nweights methods except that we did, gonna Dialogue: 0,0:02:16.30,0:02:21.38,Default,,0000,0000,0000,,do a multiplicative increase here.\NIf the signs of the last two gradients Dialogue: 0,0:02:21.38,0:02:26.14,Default,,0000,0000,0000,,disagree, we decrease the step size\Nmultiplicatively, and in this case, we'll Dialogue: 0,0:02:26.14,0:02:31.28,Default,,0000,0000,0000,,make that more powerful than the increase,\Nso that we can die down faster than we Dialogue: 0,0:02:31.28,0:02:33.69,Default,,0000,0000,0000,,grow.\NWe need to limit the step sizes. Dialogue: 0,0:02:33.69,0:02:38.08,Default,,0000,0000,0000,,Mike Shuster's advice was to limit them\Nbetween 50 and a millionth. Dialogue: 0,0:02:38.27,0:02:42.26,Default,,0000,0000,0000,,I think it depends a lot on what problem\Nyou're dealing with. Dialogue: 0,0:02:42.26,0:02:47.49,Default,,0000,0000,0000,,If for example you have a problem with\Nsome tiny inputs, you might need very big Dialogue: 0,0:02:47.49,0:02:50.83,Default,,0000,0000,0000,,weights on those inputs for them to have\Nan effect. Dialogue: 0,0:02:50.83,0:02:55.53,Default,,0000,0000,0000,,I suspect that if you're not dealing with\Nthat kind of problem, having an upper Dialogue: 0,0:02:55.53,0:02:59.94,Default,,0000,0000,0000,,limit on the weight changes that's much\Nless than 50 would be a good idea. Dialogue: 0,0:02:59.94,0:03:03.52,Default,,0000,0000,0000,,So one question is, why doesn't rprop work\Nwith mini-batches. Dialogue: 0,0:03:03.52,0:03:06.85,Default,,0000,0000,0000,,People have tried it, and find it hard to\Nget it to work. Dialogue: 0,0:03:06.85,0:03:11.26,Default,,0000,0000,0000,,You can get it to work with very big\Nmini-batches, where you use much more Dialogue: 0,0:03:11.26,0:03:16.04,Default,,0000,0000,0000,,conservative changes to the step sizes.\NBut it's difficult. Dialogue: 0,0:03:16.04,0:03:21.44,Default,,0000,0000,0000,,So the reason it doesn't work is it\Nviolates the central idea behind Dialogue: 0,0:03:21.44,0:03:26.36,Default,,0000,0000,0000,,stochastic gradient descent,\NWhich is, that when we have a small Dialogue: 0,0:03:26.36,0:03:31.99,Default,,0000,0000,0000,,loaning rate, the gradient gets\Neffectively average over successive mini Dialogue: 0,0:03:31.99,0:03:37.50,Default,,0000,0000,0000,,batches.\NSo consider a weight that gets a gradient Dialogue: 0,0:03:37.50,0:03:44.27,Default,,0000,0000,0000,,of +.01 on nine mini batches, and then a\Ngradient of -.09 on the tenth mini batch. Dialogue: 0,0:03:44.27,0:03:49.01,Default,,0000,0000,0000,,What we'd like is those gradients will\Nroughly average out so the weight will Dialogue: 0,0:03:49.01,0:03:51.62,Default,,0000,0000,0000,,stay where it is.\NRprop won't give us that. Dialogue: 0,0:03:51.62,0:03:56.41,Default,,0000,0000,0000,,Rprop would increment the weight nine\Ntimes by whatever its current step size Dialogue: 0,0:03:56.41,0:04:00.66,Default,,0000,0000,0000,,is, and decrement it only once.\NAnd that would make the weight get much Dialogue: 0,0:04:00.66,0:04:04.14,Default,,0000,0000,0000,,bigger.\NWe're assuming here that the step sizes Dialogue: 0,0:04:04.14,0:04:09.07,Default,,0000,0000,0000,,adapt much slower than the time scale of\Nthese mini batches. Dialogue: 0,0:04:09.07,0:04:15.30,Default,,0000,0000,0000,,So the question is, can we combine the\Nrobustness that you get from rprop by just Dialogue: 0,0:04:15.30,0:04:20.17,Default,,0000,0000,0000,,using the sign of the gradient.\NThe efficiency that you get from many Dialogue: 0,0:04:20.17,0:04:23.33,Default,,0000,0000,0000,,batches.\NAnd this averaging of gradients over Dialogue: 0,0:04:23.33,0:04:28.89,Default,,0000,0000,0000,,mini-batches is what allows mini-batches\Nto combine gradients in the right way. Dialogue: 0,0:04:28.89,0:04:32.34,Default,,0000,0000,0000,,That leads to a method which I'm calling\NRmsprop. Dialogue: 0,0:04:32.34,0:04:37.97,Default,,0000,0000,0000,,And you can consider to be a mini-batch\Nversion of rprop. rprop is equivalent to Dialogue: 0,0:04:37.97,0:04:42.27,Default,,0000,0000,0000,,using the gradient,\NBut also dividing by the magnitude of the Dialogue: 0,0:04:42.27,0:04:45.44,Default,,0000,0000,0000,,gradient.\NAnd the reason it has problems with Dialogue: 0,0:04:45.44,0:04:50.92,Default,,0000,0000,0000,,mini-batches is that we divide the\Ngradient by a different magnitude for each Dialogue: 0,0:04:50.92,0:04:54.52,Default,,0000,0000,0000,,mini batch.\NSo the idea is that we're going to force Dialogue: 0,0:04:54.52,0:05:00.54,Default,,0000,0000,0000,,the number we divide by to be pretty much\Nthe same for nearby mini-batches. We do Dialogue: 0,0:05:00.54,0:05:05.96,Default,,0000,0000,0000,,that by keeping a moving average of the\Nsquared gradient for each weight. Dialogue: 0,0:05:05.96,0:05:11.01,Default,,0000,0000,0000,,So mean square WT means this moving\Naverage for weight W at time T, Dialogue: 0,0:05:11.01,0:05:14.35,Default,,0000,0000,0000,,Where time is an indicator of weight\Nupdates. Dialogue: 0,0:05:14.35,0:05:20.89,Default,,0000,0000,0000,,Time increments by one each time we update\Nthe weights The numbers I put in of 0.9 Dialogue: 0,0:05:20.89,0:05:25.92,Default,,0000,0000,0000,,and 0.1 for computing moving average are\Njust examples, but their reasonably Dialogue: 0,0:05:25.92,0:05:31.36,Default,,0000,0000,0000,,sensible examples.\NSo the mean square is the previous mean Dialogue: 0,0:05:31.36,0:05:36.92,Default,,0000,0000,0000,,square times 0.9,\NPlus the value of the squared gradient for Dialogue: 0,0:05:36.92,0:05:40.04,Default,,0000,0000,0000,,that weight at time t,\NTimes 0.1. Dialogue: 0,0:05:40.04,0:05:45.31,Default,,0000,0000,0000,,We then take that mean square.\NWe take its square root, Dialogue: 0,0:05:45.31,0:05:52.43,Default,,0000,0000,0000,,Which is why it has the name RMS.\NAnd then we divide the gradient by that Dialogue: 0,0:05:52.43,0:05:56.72,Default,,0000,0000,0000,,RMS, and make an update proportional to\Nthat. Dialogue: 0,0:05:57.72,0:06:02.13,Default,,0000,0000,0000,,That makes the learning work much better.\NNotice that we're not adapting the Dialogue: 0,0:06:02.13,0:06:05.03,Default,,0000,0000,0000,,learning rate separately for each\Nconnection here. Dialogue: 0,0:06:05.03,0:06:09.44,Default,,0000,0000,0000,,This is a simpler method where we simply,\Nfor each connection, keep a running Dialogue: 0,0:06:09.44,0:06:12.98,Default,,0000,0000,0000,,average of the route mean square gradient\Nand divide by that. Dialogue: 0,0:06:12.98,0:06:17.80,Default,,0000,0000,0000,,There's many further developments one\Ncould make for rmsprop. You could combine Dialogue: 0,0:06:17.80,0:06:21.59,Default,,0000,0000,0000,,the standard moment.\NMy experiment so far suggests that doesn't Dialogue: 0,0:06:21.59,0:06:26.18,Default,,0000,0000,0000,,help as much as momentum normally does,\NAnd that needs more investigation. Dialogue: 0,0:06:26.18,0:06:31.84,Default,,0000,0000,0000,,You could combine our rmsprop with\NNesterov momentum where you first make the Dialogue: 0,0:06:31.84,0:06:36.56,Default,,0000,0000,0000,,jump and then make a correction.\NAnd Ilya Sutskever has tried that recently Dialogue: 0,0:06:36.56,0:06:40.64,Default,,0000,0000,0000,,and got good results.\NHe's discovered that it works best if the Dialogue: 0,0:06:40.64,0:06:45.82,Default,,0000,0000,0000,,rms of the recent gradients is used to\Ndivide the correction term we make rather Dialogue: 0,0:06:45.82,0:06:50.73,Default,,0000,0000,0000,,than the large jump you make in the\Ndirection of the accumulated corrections. Dialogue: 0,0:06:50.73,0:06:56.16,Default,,0000,0000,0000,,Obviously you could combine rmsprop with\Nadaptive learning rates on each connection Dialogue: 0,0:06:56.16,0:07:00.97,Default,,0000,0000,0000,,which would make it much more like rprop.\NThat just needs a lot more investigation. Dialogue: 0,0:07:00.97,0:07:03.77,Default,,0000,0000,0000,,I just don't know at present how helpful\Nthat will be. Dialogue: 0,0:07:03.77,0:07:08.13,Default,,0000,0000,0000,,And then there is a bunch of other methods\Nrelated to rmsprop that have a lot in Dialogue: 0,0:07:08.13,0:07:11.51,Default,,0000,0000,0000,,common with it.\NYann LeCun's group has an interesting Dialogue: 0,0:07:11.51,0:07:16.14,Default,,0000,0000,0000,,paper called No More Pesky Learning Rates\Nthat came out this year. Dialogue: 0,0:07:16.14,0:07:21.69,Default,,0000,0000,0000,,And some of the terms in that looked like\Nrmsprop, but it has many other terms. Dialogue: 0,0:07:21.69,0:07:27.38,Default,,0000,0000,0000,,I suspect, at present, that most of the\Nadvantage that comes from this complicated Dialogue: 0,0:07:27.38,0:07:33.06,Default,,0000,0000,0000,,method recommended by Yann LeCun's group\Ncomes from the fact that it's similar to Dialogue: 0,0:07:33.06,0:07:35.94,Default,,0000,0000,0000,,rmsprop.\NBut I don't really know that. Dialogue: 0,0:07:35.94,0:07:41.21,Default,,0000,0000,0000,,So, a summary of the learning methods for\Nneural networks, goes like this. Dialogue: 0,0:07:41.21,0:07:46.35,Default,,0000,0000,0000,,If you've got a small data set, say 10,000\Ncases or less, Dialogue: 0,0:07:46.35,0:07:52.20,Default,,0000,0000,0000,,Or a big data set without much redundancy,\Nyou should consider using a full batch Dialogue: 0,0:07:52.20,0:07:55.78,Default,,0000,0000,0000,,method.\NThis full batch methods adapted from the Dialogue: 0,0:07:55.78,0:08:00.48,Default,,0000,0000,0000,,optimization literature like non-linear\Nconjugate gradient or lbfgs, or Dialogue: 0,0:08:00.48,0:08:03.96,Default,,0000,0000,0000,,LevenbergMarkhart,Marquardt.\NAnd one advantage of using those methods Dialogue: 0,0:08:03.96,0:08:09.28,Default,,0000,0000,0000,,is they typically come with a package.\NAnd when you report the results in your Dialogue: 0,0:08:09.28,0:08:14.06,Default,,0000,0000,0000,,paper you just have to say, I used this\Npackage and here's what it did. Dialogue: 0,0:08:14.06,0:08:17.88,Default,,0000,0000,0000,,You don't have to justify all sorts of\Nlittle decisions. Dialogue: 0,0:08:18.16,0:08:23.02,Default,,0000,0000,0000,,Alternatively you could use the adaptive\Nlearning rates I described in another Dialogue: 0,0:08:23.02,0:08:28.14,Default,,0000,0000,0000,,video or rprop, which are both essentially\Nfull batch methods but they are methods Dialogue: 0,0:08:28.14,0:08:33.19,Default,,0000,0000,0000,,that were developed for neural networks.\NIf you have a big redundant data set it's Dialogue: 0,0:08:33.19,0:08:37.07,Default,,0000,0000,0000,,essential to use mini batches.\NIt's a huge waste not to do that. Dialogue: 0,0:08:37.07,0:08:41.07,Default,,0000,0000,0000,,The first thing to try is just standard\Ngradient descent with momentum. Dialogue: 0,0:08:41.07,0:08:45.86,Default,,0000,0000,0000,,You're going to have to choose a global\Nlearning rate, and you might want to write Dialogue: 0,0:08:45.86,0:08:50.10,Default,,0000,0000,0000,,a little loop to adapt that global\Nlearning rate based on whether the Dialogue: 0,0:08:50.10,0:08:53.86,Default,,0000,0000,0000,,gradient has changed side.\NBut to begin with, don't go for anything Dialogue: 0,0:08:53.86,0:08:58.11,Default,,0000,0000,0000,,as fancy as adapting individual learning\Nrates for individual weights. Dialogue: 0,0:08:58.11,0:09:02.90,Default,,0000,0000,0000,,The next thing to try is RMS prop.\NThat's very simple to implement if you do Dialogue: 0,0:09:02.90,0:09:07.63,Default,,0000,0000,0000,,it without momentum, and in my experiment\Nso far, that seems to work as well as Dialogue: 0,0:09:07.63,0:09:10.48,Default,,0000,0000,0000,,gradient descent with momentum, would be\Nbetter. Dialogue: 0,0:09:11.88,0:09:17.34,Default,,0000,0000,0000,,You can also consider all sorts of ways of\Nimproving rmsprop by adding momentum or Dialogue: 0,0:09:17.34,0:09:21.93,Default,,0000,0000,0000,,adaptive step sizes for each weight, but\Nthat's still basically uncharted Dialogue: 0,0:09:21.93,0:09:25.57,Default,,0000,0000,0000,,territory.\NFinally, you could find out whatever Yann Dialogue: 0,0:09:25.57,0:09:30.09,Default,,0000,0000,0000,,Lecun's latest receipt is and try that.\NHe's probably the person who's tried the Dialogue: 0,0:09:30.09,0:09:35.37,Default,,0000,0000,0000,,most different ways of getting stochastic\Ngradient descent to work well, and so it's Dialogue: 0,0:09:35.37,0:09:40.71,Default,,0000,0000,0000,,worth keeping up with whatever he's doing.\NOne question you might ask is why is there Dialogue: 0,0:09:40.71,0:09:44.63,Default,,0000,0000,0000,,no simple recipe.\NWe have been messing around with neural Dialogue: 0,0:09:44.63,0:09:49.15,Default,,0000,0000,0000,,nets, including deep neural nets, for more\Nthan 25 years now, and you would think Dialogue: 0,0:09:49.15,0:09:53.06,Default,,0000,0000,0000,,that we would come up with an agreed way\Nof doing the learning. Dialogue: 0,0:09:53.44,0:09:57.34,Default,,0000,0000,0000,,There's really two reasons I think why\Nthere isn't a simple recipe. Dialogue: 0,0:09:58.00,0:10:02.20,Default,,0000,0000,0000,,First, neural nets differ a lot.\NVery deep networks, especially ones that Dialogue: 0,0:10:02.20,0:10:06.81,Default,,0000,0000,0000,,have narrow bottlenecks in them, which\NI'll come to in later lectures, are very Dialogue: 0,0:10:06.81,0:10:11.60,Default,,0000,0000,0000,,hard things to optimize and they need\Nmethods that can be very sensitive to very Dialogue: 0,0:10:11.60,0:10:14.91,Default,,0000,0000,0000,,small gradients.\NRecurring nets are another special case, Dialogue: 0,0:10:14.91,0:10:19.58,Default,,0000,0000,0000,,they're typically very hard to optimize,\Nif you want them to notice things that Dialogue: 0,0:10:19.58,0:10:24.19,Default,,0000,0000,0000,,happened a long time in the past and\Nchange the weights based on these things Dialogue: 0,0:10:24.19,0:10:29.21,Default,,0000,0000,0000,,that happened a long time ago.\NThen there's wide shallow networks, which Dialogue: 0,0:10:29.21,0:10:33.22,Default,,0000,0000,0000,,are quite different in flavor and are used\Na lot in practice. Dialogue: 0,0:10:33.22,0:10:37.69,Default,,0000,0000,0000,,They often can be optimized with methods\Nthat are not very accurate. Dialogue: 0,0:10:37.69,0:10:42.16,Default,,0000,0000,0000,,Because we stop the optimization early\Nbefore it starts overfitting. Dialogue: 0,0:10:42.16,0:10:47.48,Default,,0000,0000,0000,,So for these different kinds of networks,\Nthere's very different methods that are Dialogue: 0,0:10:47.48,0:10:52.25,Default,,0000,0000,0000,,probably appropriate.\NThe other consideration is that tasks Dialogue: 0,0:10:52.25,0:10:56.44,Default,,0000,0000,0000,,differ a lot.\NSome tasks require very accurate weights. Dialogue: 0,0:10:56.70,0:11:00.44,Default,,0000,0000,0000,,Some tasks don't require weights to be\Nvery accurate at all. Dialogue: 0,0:11:01.10,0:11:08.79,Default,,0000,0000,0000,,Also there's some tasks that have weird\Nproperties, like if your inputs are words Dialogue: 0,0:11:08.79,0:11:14.70,Default,,0000,0000,0000,,rare words may only occur on one case in a\Nhundred thousand. Dialogue: 0,0:11:14.98,0:11:20.17,Default,,0000,0000,0000,,That's a very, very different flavor from\Nwhat happens if your inputs are pixels. Dialogue: 0,0:11:20.17,0:11:25.50,Default,,0000,0000,0000,,So to summarize we really don't have nice\Nclear cut advice for how to train a neural Dialogue: 0,0:11:25.50,0:11:28.36,Default,,0000,0000,0000,,net.\NWe have a bunch of rules of sum, it's not Dialogue: 0,0:11:28.36,0:11:33.74,Default,,0000,0000,0000,,entirely satisfactory, but just think how\Nmuch better in your all natural work once Dialogue: 0,0:11:33.74,0:11:36.99,Default,,0000,0000,0000,,we've got this sorted out, and they\Nalready work pretty well.