Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - What learning rate should be used for backprop?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Neighborhoods ]


Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What is backprop?
Next Document: What are conjugate gradients,
See reader questions & answers on this topic! - Help others by sharing your knowledge

In standard backprop, too low a learning rate makes the network learn very
slowly. Too high a learning rate makes the weights and objective function
diverge, so there is no learning at all. If the objective function is
quadratic, as in linear models, good learning rates can be computed from the
Hessian matrix (Bertsekas and Tsitsiklis, 1996). If the objective function
has many local and global optima, as in typical feedforward NNs with hidden
units, the optimal learning rate often changes dramatically during the
training process, since the Hessian also changes dramatically. Trying to
train a NN using a constant learning rate is usually a tedious process
requiring much trial and error. For some examples of how the choice of
learning rate and momentum interact with numerical condition in some very
simple networks, see ftp://ftp.sas.com/pub/neural/illcond/illcond.html 

With batch training, there is no need to use a constant learning rate. In
fact, there is no reason to use standard backprop at all, since vastly more
efficient, reliable, and convenient batch training algorithms exist (see
Quickprop and RPROP under "What is backprop?" and the numerous training
algorithms mentioned under "What are conjugate gradients,
Levenberg-Marquardt, etc.?"). 

Many other variants of backprop have been invented. Most suffer from the
same theoretical flaw as standard backprop: the magnitude of the change in
the weights (the step size) should NOT be a function of the magnitude of the
gradient. In some regions of the weight space, the gradient is small and you
need a large step size; this happens when you initialize a network with
small random weights. In other regions of the weight space, the gradient is
small and you need a small step size; this happens when you are close to a
local minimum. Likewise, a large gradient may call for either a small step
or a large step. Many algorithms try to adapt the learning rate, but any
algorithm that multiplies the learning rate by the gradient to compute the
change in the weights is likely to produce erratic behavior when the
gradient changes abruptly. The great advantage of Quickprop and RPROP is
that they do not have this excessive dependence on the magnitude of the
gradient. Conventional optimization algorithms use not only the gradient but
also second-order derivatives or a line search (or some combination thereof)
to obtain a good step size. 

With incremental training, it is much more difficult to concoct an algorithm
that automatically adjusts the learning rate during training. Various
proposals have appeared in the NN literature, but most of them don't work.
Problems with some of these proposals are illustrated by Darken and Moody
(1992), who unfortunately do not offer a solution. Some promising results
are provided by by LeCun, Simard, and Pearlmutter (1993), and by Orr and
Leen (1997), who adapt the momentum rather than the learning rate. There is
also a variant of stochastic approximation called "iterate averaging" or
"Polyak averaging" (Kushner and Yin 1997), which theoretically provides
optimal convergence rates by keeping a running average of the weight values.
I have no personal experience with these methods; if you have any solid
evidence that these or other methods of automatically setting the learning
rate and/or momentum in incremental training actually work in a wide variety
of NN applications, please inform the FAQ maintainer (saswss@unx.sas.com). 

References: 

   Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
   Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. 

   Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient
   search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds. 
   Advances in Neural Information Processing Systems 4, San Mateo, CA:
   Morgan Kaufmann Publishers, pp. 1009-1016. 

   Kushner, H.J., and Yin, G. (1997), Stochastic Approximation Algorithms
   and Applications, NY: Springer-Verlag. 

   LeCun, Y., Simard, P.Y., and Pearlmetter, B. (1993), "Automatic learning
   rate maximization by on-line estimation of the Hessian's eigenvectors,"
   in Hanson, S.J., Cowan, J.D., and Giles, C.L. (eds.), Advances in Neural
   Information Processing Systems 5, San Mateo, CA: Morgan Kaufmann, pp.
   156-163. 

   Orr, G.B. and Leen, T.K. (1997), "Using curvature information for fast
   stochastic search," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.)
   Advances in Neural Information Processing Systems 9, Cambrideg, MA: The
   MIT Press, pp. 606-612. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What is backprop?
Next Document: What are conjugate gradients,

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM