[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:00.00,0:00:04.80,Default,,0000,0000,0000,,In this video, I'm first going to\Nintroduce a method called rprop, that is
Dialogue: 0,0:00:04.80,0:00:09.68,Default,,0000,0000,0000,,used for full batch learning.\NIt's like Robbie Jacobs method, but not
Dialogue: 0,0:00:09.68,0:00:13.45,Default,,0000,0000,0000,,quite the same.\NI'm then going to show how to extend RPROP
Dialogue: 0,0:00:13.45,0:00:18.95,Default,,0000,0000,0000,,so that it works for mini-batches. This\Ngives you the advantages of rprop and it
Dialogue: 0,0:00:18.95,0:00:24.51,Default,,0000,0000,0000,,also gives you the advantage of mini-batch\Nlearning, which is essential for large,
Dialogue: 0,0:00:24.51,0:00:29.16,Default,,0000,0000,0000,,redundant data sets.\NThe method that we end up with called RMS
Dialogue: 0,0:00:29.16,0:00:34.31,Default,,0000,0000,0000,,Pro is currently my favorite method as a\Nsort of basic method for learning the
Dialogue: 0,0:00:34.31,0:00:38.88,Default,,0000,0000,0000,,weights in a large neural network with a\Nlarge redundant data set.
Dialogue: 0,0:00:39.36,0:00:44.86,Default,,0000,0000,0000,,I'm now going to describe rprop which is\Nan interesting way of trying to deal with
Dialogue: 0,0:00:44.86,0:00:48.58,Default,,0000,0000,0000,,the fact that gradients vary widely in\Ntheir magnitudes.
Dialogue: 0,0:00:50.74,0:00:53.92,Default,,0000,0000,0000,,Some gradients can be tiny and others can\Nbe huge.
Dialogue: 0,0:00:53.92,0:00:57.92,Default,,0000,0000,0000,,And that makes it hard to choose a single\Nglobal learning rate.
Dialogue: 0,0:00:58.52,0:01:03.55,Default,,0000,0000,0000,,If we're doing full batch learning, we can\Ncope with this big variations in
Dialogue: 0,0:01:03.55,0:01:06.84,Default,,0000,0000,0000,,gradients, by just using the sign of the\Ngradient.
Dialogue: 0,0:01:07.66,0:01:11.58,Default,,0000,0000,0000,,That makes all of the weight updates be\Nthe same size.
Dialogue: 0,0:01:13.28,0:01:18.60,Default,,0000,0000,0000,,For issues like escaping from plateaus\Nwith very small gradients this is a great
Dialogue: 0,0:01:18.60,0:01:23.06,Default,,0000,0000,0000,,technique cause even with tiny gradients\Nwe'll take quite big steps.
Dialogue: 0,0:01:23.06,0:01:28.25,Default,,0000,0000,0000,,We couldn't achieve that by just turning\Nup the learning rate because then the
Dialogue: 0,0:01:28.25,0:01:32.84,Default,,0000,0000,0000,,steps we took for weights that had big\Ngradients would be much to big.
Dialogue: 0,0:01:32.84,0:01:38.16,Default,,0000,0000,0000,,Rprop combines the idea of just using the\Nsign of the gradient with the idea of
Dialogue: 0,0:01:38.16,0:01:41.76,Default,,0000,0000,0000,,making the step size.\NDepend on which weight it is.
Dialogue: 0,0:01:41.76,0:01:46.90,Default,,0000,0000,0000,,So to decide how much to change your\Nweight, you don't look at the magnitude of
Dialogue: 0,0:01:46.90,0:01:50.47,Default,,0000,0000,0000,,the gradient, you just look at the sign of\Nthe gradient.
Dialogue: 0,0:01:50.47,0:01:54.89,Default,,0000,0000,0000,,But, you do look at the step size you\Ndecided around for that weight.
Dialogue: 0,0:01:54.89,0:01:59.96,Default,,0000,0000,0000,,And, that step size adopts over time,\Nagain without looking at the magnitude of
Dialogue: 0,0:01:59.96,0:02:04.63,Default,,0000,0000,0000,,the gradient.\NSo we increase the step size for a weight
Dialogue: 0,0:02:04.63,0:02:07.56,Default,,0000,0000,0000,,multiplicatively.\NFor example by factor 1.2.
Dialogue: 0,0:02:07.56,0:02:10.63,Default,,0000,0000,0000,,If the signs of the last two gradients\Nagree.
Dialogue: 0,0:02:10.63,0:02:16.30,Default,,0000,0000,0000,,This is like in Robbie Jacobs' adapted\Nweights methods except that we did, gonna
Dialogue: 0,0:02:16.30,0:02:21.38,Default,,0000,0000,0000,,do a multiplicative increase here.\NIf the signs of the last two gradients
Dialogue: 0,0:02:21.38,0:02:26.14,Default,,0000,0000,0000,,disagree, we decrease the step size\Nmultiplicatively, and in this case, we'll
Dialogue: 0,0:02:26.14,0:02:31.28,Default,,0000,0000,0000,,make that more powerful than the increase,\Nso that we can die down faster than we
Dialogue: 0,0:02:31.28,0:02:33.69,Default,,0000,0000,0000,,grow.\NWe need to limit the step sizes.
Dialogue: 0,0:02:33.69,0:02:38.08,Default,,0000,0000,0000,,Mike Shuster's advice was to limit them\Nbetween 50 and a millionth.
Dialogue: 0,0:02:38.27,0:02:42.26,Default,,0000,0000,0000,,I think it depends a lot on what problem\Nyou're dealing with.
Dialogue: 0,0:02:42.26,0:02:47.49,Default,,0000,0000,0000,,If for example you have a problem with\Nsome tiny inputs, you might need very big
Dialogue: 0,0:02:47.49,0:02:50.83,Default,,0000,0000,0000,,weights on those inputs for them to have\Nan effect.
Dialogue: 0,0:02:50.83,0:02:55.53,Default,,0000,0000,0000,,I suspect that if you're not dealing with\Nthat kind of problem, having an upper
Dialogue: 0,0:02:55.53,0:02:59.94,Default,,0000,0000,0000,,limit on the weight changes that's much\Nless than 50 would be a good idea.
Dialogue: 0,0:02:59.94,0:03:03.52,Default,,0000,0000,0000,,So one question is, why doesn't rprop work\Nwith mini-batches.
Dialogue: 0,0:03:03.52,0:03:06.85,Default,,0000,0000,0000,,People have tried it, and find it hard to\Nget it to work.
Dialogue: 0,0:03:06.85,0:03:11.26,Default,,0000,0000,0000,,You can get it to work with very big\Nmini-batches, where you use much more
Dialogue: 0,0:03:11.26,0:03:16.04,Default,,0000,0000,0000,,conservative changes to the step sizes.\NBut it's difficult.
Dialogue: 0,0:03:16.04,0:03:21.44,Default,,0000,0000,0000,,So the reason it doesn't work is it\Nviolates the central idea behind
Dialogue: 0,0:03:21.44,0:03:26.36,Default,,0000,0000,0000,,stochastic gradient descent,\NWhich is, that when we have a small
Dialogue: 0,0:03:26.36,0:03:31.99,Default,,0000,0000,0000,,loaning rate, the gradient gets\Neffectively average over successive mini
Dialogue: 0,0:03:31.99,0:03:37.50,Default,,0000,0000,0000,,batches.\NSo consider a weight that gets a gradient
Dialogue: 0,0:03:37.50,0:03:44.27,Default,,0000,0000,0000,,of +.01 on nine mini batches, and then a\Ngradient of -.09 on the tenth mini batch.
Dialogue: 0,0:03:44.27,0:03:49.01,Default,,0000,0000,0000,,What we'd like is those gradients will\Nroughly average out so the weight will
Dialogue: 0,0:03:49.01,0:03:51.62,Default,,0000,0000,0000,,stay where it is.\NRprop won't give us that.
Dialogue: 0,0:03:51.62,0:03:56.41,Default,,0000,0000,0000,,Rprop would increment the weight nine\Ntimes by whatever its current step size
Dialogue: 0,0:03:56.41,0:04:00.66,Default,,0000,0000,0000,,is, and decrement it only once.\NAnd that would make the weight get much
Dialogue: 0,0:04:00.66,0:04:04.14,Default,,0000,0000,0000,,bigger.\NWe're assuming here that the step sizes
Dialogue: 0,0:04:04.14,0:04:09.07,Default,,0000,0000,0000,,adapt much slower than the time scale of\Nthese mini batches.
Dialogue: 0,0:04:09.07,0:04:15.30,Default,,0000,0000,0000,,So the question is, can we combine the\Nrobustness that you get from rprop by just
Dialogue: 0,0:04:15.30,0:04:20.17,Default,,0000,0000,0000,,using the sign of the gradient.\NThe efficiency that you get from many
Dialogue: 0,0:04:20.17,0:04:23.33,Default,,0000,0000,0000,,batches.\NAnd this averaging of gradients over
Dialogue: 0,0:04:23.33,0:04:28.89,Default,,0000,0000,0000,,mini-batches is what allows mini-batches\Nto combine gradients in the right way.
Dialogue: 0,0:04:28.89,0:04:32.34,Default,,0000,0000,0000,,That leads to a method which I'm calling\NRmsprop.
Dialogue: 0,0:04:32.34,0:04:37.97,Default,,0000,0000,0000,,And you can consider to be a mini-batch\Nversion of rprop. rprop is equivalent to
Dialogue: 0,0:04:37.97,0:04:42.27,Default,,0000,0000,0000,,using the gradient,\NBut also dividing by the magnitude of the
Dialogue: 0,0:04:42.27,0:04:45.44,Default,,0000,0000,0000,,gradient.\NAnd the reason it has problems with
Dialogue: 0,0:04:45.44,0:04:50.92,Default,,0000,0000,0000,,mini-batches is that we divide the\Ngradient by a different magnitude for each
Dialogue: 0,0:04:50.92,0:04:54.52,Default,,0000,0000,0000,,mini batch.\NSo the idea is that we're going to force
Dialogue: 0,0:04:54.52,0:05:00.54,Default,,0000,0000,0000,,the number we divide by to be pretty much\Nthe same for nearby mini-batches. We do
Dialogue: 0,0:05:00.54,0:05:05.96,Default,,0000,0000,0000,,that by keeping a moving average of the\Nsquared gradient for each weight.
Dialogue: 0,0:05:05.96,0:05:11.01,Default,,0000,0000,0000,,So mean square WT means this moving\Naverage for weight W at time T,
Dialogue: 0,0:05:11.01,0:05:14.35,Default,,0000,0000,0000,,Where time is an indicator of weight\Nupdates.
Dialogue: 0,0:05:14.35,0:05:20.89,Default,,0000,0000,0000,,Time increments by one each time we update\Nthe weights The numbers I put in of 0.9
Dialogue: 0,0:05:20.89,0:05:25.92,Default,,0000,0000,0000,,and 0.1 for computing moving average are\Njust examples, but their reasonably
Dialogue: 0,0:05:25.92,0:05:31.36,Default,,0000,0000,0000,,sensible examples.\NSo the mean square is the previous mean
Dialogue: 0,0:05:31.36,0:05:36.92,Default,,0000,0000,0000,,square times 0.9,\NPlus the value of the squared gradient for
Dialogue: 0,0:05:36.92,0:05:40.04,Default,,0000,0000,0000,,that weight at time t,\NTimes 0.1.
Dialogue: 0,0:05:40.04,0:05:45.31,Default,,0000,0000,0000,,We then take that mean square.\NWe take its square root,
Dialogue: 0,0:05:45.31,0:05:52.43,Default,,0000,0000,0000,,Which is why it has the name RMS.\NAnd then we divide the gradient by that
Dialogue: 0,0:05:52.43,0:05:56.72,Default,,0000,0000,0000,,RMS, and make an update proportional to\Nthat.
Dialogue: 0,0:05:57.72,0:06:02.13,Default,,0000,0000,0000,,That makes the learning work much better.\NNotice that we're not adapting the
Dialogue: 0,0:06:02.13,0:06:05.03,Default,,0000,0000,0000,,learning rate separately for each\Nconnection here.
Dialogue: 0,0:06:05.03,0:06:09.44,Default,,0000,0000,0000,,This is a simpler method where we simply,\Nfor each connection, keep a running
Dialogue: 0,0:06:09.44,0:06:12.98,Default,,0000,0000,0000,,average of the route mean square gradient\Nand divide by that.
Dialogue: 0,0:06:12.98,0:06:17.80,Default,,0000,0000,0000,,There's many further developments one\Ncould make for rmsprop. You could combine
Dialogue: 0,0:06:17.80,0:06:21.59,Default,,0000,0000,0000,,the standard moment.\NMy experiment so far suggests that doesn't
Dialogue: 0,0:06:21.59,0:06:26.18,Default,,0000,0000,0000,,help as much as momentum normally does,\NAnd that needs more investigation.
Dialogue: 0,0:06:26.18,0:06:31.84,Default,,0000,0000,0000,,You could combine our rmsprop with\NNesterov momentum where you first make the
Dialogue: 0,0:06:31.84,0:06:36.56,Default,,0000,0000,0000,,jump and then make a correction.\NAnd Ilya Sutskever has tried that recently
Dialogue: 0,0:06:36.56,0:06:40.64,Default,,0000,0000,0000,,and got good results.\NHe's discovered that it works best if the
Dialogue: 0,0:06:40.64,0:06:45.82,Default,,0000,0000,0000,,rms of the recent gradients is used to\Ndivide the correction term we make rather
Dialogue: 0,0:06:45.82,0:06:50.73,Default,,0000,0000,0000,,than the large jump you make in the\Ndirection of the accumulated corrections.
Dialogue: 0,0:06:50.73,0:06:56.16,Default,,0000,0000,0000,,Obviously you could combine rmsprop with\Nadaptive learning rates on each connection
Dialogue: 0,0:06:56.16,0:07:00.97,Default,,0000,0000,0000,,which would make it much more like rprop.\NThat just needs a lot more investigation.
Dialogue: 0,0:07:00.97,0:07:03.77,Default,,0000,0000,0000,,I just don't know at present how helpful\Nthat will be.
Dialogue: 0,0:07:03.77,0:07:08.13,Default,,0000,0000,0000,,And then there is a bunch of other methods\Nrelated to rmsprop that have a lot in
Dialogue: 0,0:07:08.13,0:07:11.51,Default,,0000,0000,0000,,common with it.\NYann LeCun's group has an interesting
Dialogue: 0,0:07:11.51,0:07:16.14,Default,,0000,0000,0000,,paper called No More Pesky Learning Rates\Nthat came out this year.
Dialogue: 0,0:07:16.14,0:07:21.69,Default,,0000,0000,0000,,And some of the terms in that looked like\Nrmsprop, but it has many other terms.
Dialogue: 0,0:07:21.69,0:07:27.38,Default,,0000,0000,0000,,I suspect, at present, that most of the\Nadvantage that comes from this complicated
Dialogue: 0,0:07:27.38,0:07:33.06,Default,,0000,0000,0000,,method recommended by Yann LeCun's group\Ncomes from the fact that it's similar to
Dialogue: 0,0:07:33.06,0:07:35.94,Default,,0000,0000,0000,,rmsprop.\NBut I don't really know that.
Dialogue: 0,0:07:35.94,0:07:41.21,Default,,0000,0000,0000,,So, a summary of the learning methods for\Nneural networks, goes like this.
Dialogue: 0,0:07:41.21,0:07:46.35,Default,,0000,0000,0000,,If you've got a small data set, say 10,000\Ncases or less,
Dialogue: 0,0:07:46.35,0:07:52.20,Default,,0000,0000,0000,,Or a big data set without much redundancy,\Nyou should consider using a full batch
Dialogue: 0,0:07:52.20,0:07:55.78,Default,,0000,0000,0000,,method.\NThis full batch methods adapted from the
Dialogue: 0,0:07:55.78,0:08:00.48,Default,,0000,0000,0000,,optimization literature like non-linear\Nconjugate gradient or lbfgs, or
Dialogue: 0,0:08:00.48,0:08:03.96,Default,,0000,0000,0000,,LevenbergMarkhart,Marquardt.\NAnd one advantage of using those methods
Dialogue: 0,0:08:03.96,0:08:09.28,Default,,0000,0000,0000,,is they typically come with a package.\NAnd when you report the results in your
Dialogue: 0,0:08:09.28,0:08:14.06,Default,,0000,0000,0000,,paper you just have to say, I used this\Npackage and here's what it did.
Dialogue: 0,0:08:14.06,0:08:17.88,Default,,0000,0000,0000,,You don't have to justify all sorts of\Nlittle decisions.
Dialogue: 0,0:08:18.16,0:08:23.02,Default,,0000,0000,0000,,Alternatively you could use the adaptive\Nlearning rates I described in another
Dialogue: 0,0:08:23.02,0:08:28.14,Default,,0000,0000,0000,,video or rprop, which are both essentially\Nfull batch methods but they are methods
Dialogue: 0,0:08:28.14,0:08:33.19,Default,,0000,0000,0000,,that were developed for neural networks.\NIf you have a big redundant data set it's
Dialogue: 0,0:08:33.19,0:08:37.07,Default,,0000,0000,0000,,essential to use mini batches.\NIt's a huge waste not to do that.
Dialogue: 0,0:08:37.07,0:08:41.07,Default,,0000,0000,0000,,The first thing to try is just standard\Ngradient descent with momentum.
Dialogue: 0,0:08:41.07,0:08:45.86,Default,,0000,0000,0000,,You're going to have to choose a global\Nlearning rate, and you might want to write
Dialogue: 0,0:08:45.86,0:08:50.10,Default,,0000,0000,0000,,a little loop to adapt that global\Nlearning rate based on whether the
Dialogue: 0,0:08:50.10,0:08:53.86,Default,,0000,0000,0000,,gradient has changed side.\NBut to begin with, don't go for anything
Dialogue: 0,0:08:53.86,0:08:58.11,Default,,0000,0000,0000,,as fancy as adapting individual learning\Nrates for individual weights.
Dialogue: 0,0:08:58.11,0:09:02.90,Default,,0000,0000,0000,,The next thing to try is RMS prop.\NThat's very simple to implement if you do
Dialogue: 0,0:09:02.90,0:09:07.63,Default,,0000,0000,0000,,it without momentum, and in my experiment\Nso far, that seems to work as well as
Dialogue: 0,0:09:07.63,0:09:10.48,Default,,0000,0000,0000,,gradient descent with momentum, would be\Nbetter.
Dialogue: 0,0:09:11.88,0:09:17.34,Default,,0000,0000,0000,,You can also consider all sorts of ways of\Nimproving rmsprop by adding momentum or
Dialogue: 0,0:09:17.34,0:09:21.93,Default,,0000,0000,0000,,adaptive step sizes for each weight, but\Nthat's still basically uncharted
Dialogue: 0,0:09:21.93,0:09:25.57,Default,,0000,0000,0000,,territory.\NFinally, you could find out whatever Yann
Dialogue: 0,0:09:25.57,0:09:30.09,Default,,0000,0000,0000,,Lecun's latest receipt is and try that.\NHe's probably the person who's tried the
Dialogue: 0,0:09:30.09,0:09:35.37,Default,,0000,0000,0000,,most different ways of getting stochastic\Ngradient descent to work well, and so it's
Dialogue: 0,0:09:35.37,0:09:40.71,Default,,0000,0000,0000,,worth keeping up with whatever he's doing.\NOne question you might ask is why is there
Dialogue: 0,0:09:40.71,0:09:44.63,Default,,0000,0000,0000,,no simple recipe.\NWe have been messing around with neural
Dialogue: 0,0:09:44.63,0:09:49.15,Default,,0000,0000,0000,,nets, including deep neural nets, for more\Nthan 25 years now, and you would think
Dialogue: 0,0:09:49.15,0:09:53.06,Default,,0000,0000,0000,,that we would come up with an agreed way\Nof doing the learning.
Dialogue: 0,0:09:53.44,0:09:57.34,Default,,0000,0000,0000,,There's really two reasons I think why\Nthere isn't a simple recipe.
Dialogue: 0,0:09:58.00,0:10:02.20,Default,,0000,0000,0000,,First, neural nets differ a lot.\NVery deep networks, especially ones that
Dialogue: 0,0:10:02.20,0:10:06.81,Default,,0000,0000,0000,,have narrow bottlenecks in them, which\NI'll come to in later lectures, are very
Dialogue: 0,0:10:06.81,0:10:11.60,Default,,0000,0000,0000,,hard things to optimize and they need\Nmethods that can be very sensitive to very
Dialogue: 0,0:10:11.60,0:10:14.91,Default,,0000,0000,0000,,small gradients.\NRecurring nets are another special case,
Dialogue: 0,0:10:14.91,0:10:19.58,Default,,0000,0000,0000,,they're typically very hard to optimize,\Nif you want them to notice things that
Dialogue: 0,0:10:19.58,0:10:24.19,Default,,0000,0000,0000,,happened a long time in the past and\Nchange the weights based on these things
Dialogue: 0,0:10:24.19,0:10:29.21,Default,,0000,0000,0000,,that happened a long time ago.\NThen there's wide shallow networks, which
Dialogue: 0,0:10:29.21,0:10:33.22,Default,,0000,0000,0000,,are quite different in flavor and are used\Na lot in practice.
Dialogue: 0,0:10:33.22,0:10:37.69,Default,,0000,0000,0000,,They often can be optimized with methods\Nthat are not very accurate.
Dialogue: 0,0:10:37.69,0:10:42.16,Default,,0000,0000,0000,,Because we stop the optimization early\Nbefore it starts overfitting.
Dialogue: 0,0:10:42.16,0:10:47.48,Default,,0000,0000,0000,,So for these different kinds of networks,\Nthere's very different methods that are
Dialogue: 0,0:10:47.48,0:10:52.25,Default,,0000,0000,0000,,probably appropriate.\NThe other consideration is that tasks
Dialogue: 0,0:10:52.25,0:10:56.44,Default,,0000,0000,0000,,differ a lot.\NSome tasks require very accurate weights.
Dialogue: 0,0:10:56.70,0:11:00.44,Default,,0000,0000,0000,,Some tasks don't require weights to be\Nvery accurate at all.
Dialogue: 0,0:11:01.10,0:11:08.79,Default,,0000,0000,0000,,Also there's some tasks that have weird\Nproperties, like if your inputs are words
Dialogue: 0,0:11:08.79,0:11:14.70,Default,,0000,0000,0000,,rare words may only occur on one case in a\Nhundred thousand.
Dialogue: 0,0:11:14.98,0:11:20.17,Default,,0000,0000,0000,,That's a very, very different flavor from\Nwhat happens if your inputs are pixels.
Dialogue: 0,0:11:20.17,0:11:25.50,Default,,0000,0000,0000,,So to summarize we really don't have nice\Nclear cut advice for how to train a neural
Dialogue: 0,0:11:25.50,0:11:28.36,Default,,0000,0000,0000,,net.\NWe have a bunch of rules of sum, it's not
Dialogue: 0,0:11:28.36,0:11:33.74,Default,,0000,0000,0000,,entirely satisfactory, but just think how\Nmuch better in your all natural work once
Dialogue: 0,0:11:33.74,0:11:36.99,Default,,0000,0000,0000,,we've got this sorted out, and they\Nalready work pretty well.