Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: What are batch, incremental, on-line, off-line, Next Document: What learning rate should be used for backprop? See reader questions & answers on this topic! - Help others by sharing your knowledge "Backprop" is short for "backpropagation of error". The term backpropagation causes much confusion. Strictly speaking, backpropagation refers to the method for computing the gradient of the case-wise error function with respect to the weights for a feedforward network, a straightforward but elegant application of the chain rule of elementary calculus (Werbos 1974/1994). By extension, backpropagation or backprop refers to a training method that uses backpropagation to compute the gradient. By further extension, a backprop network is a feedforward network trained by backpropagation. "Standard backprop" is a euphemism for the generalized delta rule, the training algorithm that was popularized by Rumelhart, Hinton, and Williams in chapter 8 of Rumelhart and McClelland (1986), which remains the most widely used supervised training method for neural nets. The generalized delta rule (including momentum) is called the "heavy ball method" in the numerical analysis literature (Polyak 1964, 1987; Bertsekas 1995, 78-79). Standard backprop can be used for both batch training (in which the weights are updated after processing the entire training set) and incremental training (in which the weights are updated after processing each case). For batch training, standard backprop usually converges (eventually) to a local minimum, if one exists. For incremental training, standard backprop does not converge to a stationary point of the error surface. To obtain convergence, the learning rate must be slowly reduced. This methodology is called "stochastic approximation" or "annealing". The convergence properties of standard backprop, stochastic approximation, and related methods, including both batch and incremental algorithms, are discussed clearly and thoroughly by Bertsekas and Tsitsiklis (1996). For a practical discussion of backprop training in MLPs, Reed and Marks (1999) is the best reference I've seen. For batch processing, there is no reason to suffer through the slow convergence and the tedious tuning of learning rates and momenta of standard backprop. Much of the NN research literature is devoted to attempts to speed up backprop. Most of these methods are inconsequential; two that are effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller and Braun 1993). Concise descriptions of these algorithms are given by Schiffmann, Joost, and Werner (1994) and Reed and Marks (1999). But conventional methods for nonlinear optimization are usually faster and more reliable than any of the "props". See "What are conjugate gradients, Levenberg-Marquardt, etc.?". Incremental backprop can be highly efficient for some large data sets if you select a good learning rate, but that can be difficult to do (see "What learning rate should be used for backprop?"). Also, incremental backprop is very sensitive to ill-conditioning (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html). For more on-line info on backprop, see Donald Tveter's Backpropagator's Review at http://www.dontveter.com/bpr/bpr.html or http://gannoo.uce.ac.uk/bpr/bpr.html. References on backprop: Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-14-0. Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. Polyak, B.T. (1964), "Some methods of speeding up the convergence of iteration methods," Z. Vycisl. Mat. i Mat. Fiz., 4, 1-17. Polyak, B.T. (1987), Introduction to Optimization, NY: Optimization Software, Inc. Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986), "Learning internal representations by error propagation", in Rumelhart, D.E. and McClelland, J. L., eds. (1986), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, 318-362, Cambridge, MA: The MIT Press. Werbos, P.J. (1974/1994), The Roots of Backpropagation, NY: John Wiley & Sons. Includes Werbos's 1974 Harvard Ph.D. thesis, Beyond Regression. References on stochastic approximation: Robbins, H. & Monro, S. (1951), "A Stochastic Approximation Method", Annals of Mathematical Statistics, 22, 400-407. Kiefer, J. & Wolfowitz, J. (1952), "Stochastic Estimation of the Maximum of a Regression Function," Annals of Mathematical Statistics, 23, 462-466. Kushner, H.J., and Yin, G. (1997), Stochastic Approximation Algorithms and Applications, NY: Springer-Verlag. Kushner, H.J., and Clark, D. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag. White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013. References on better props: Fahlman, S.E. (1989), "Faster-Learning Variations on Back-Propagation: An Empirical Study", in Touretzky, D., Hinton, G, and Sejnowski, T., eds., Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 38-51. Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8. Riedmiller, M. (199?), "Advanced supervised learning in multi-layer perceptrons--from backpropagtion to adaptive learning algorithms," ftp://i11s16.ira.uka.de/pub/neuro/papers/ Riedmiller, M. and Braun, H. (1993), "A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm", Proceedings of the IEEE International Conference on Neural Networks 1993, San Francisco: IEEE. Schiffmann, W., Joost, M., and Werner, R. (1994), "Optimization of the Backpropagation Algorithm for Training Multilayer Perceptrons," ftp://archive.cis.ohio-state.edu/pub/neuroprose/schiff.bp_speedup.ps.Z User Contributions:Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: What are batch, incremental, on-line, off-line, Next Document: What learning rate should be used for backprop? Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page [ Usenet FAQs | Web FAQs | Documents | RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
|
Comment about this article, ask questions, or add new information about this topic: