DissLiteratur/storage/3ZQIR2R2/.zotero-ft-cache

Lecture Notes in Computer Science
Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

7700

Editorial Board
David Hutchison Lancaster University, UK
Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler University of Surrey, Guildford, UK
Jon M. Kleinberg Cornell University, Ithaca, NY, USA
Alfred Kobsa University of California, Irvine, CA, USA
Friedemann Mattern ETH Zurich, Switzerland
John C. Mitchell Stanford University, CA, USA
Moni Naor Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz University of Bern, Switzerland
C. Pandu Rangan Indian Institute of Technology, Madras, India
Bernhard Steffen TU Dortmund University, Germany
Madhu Sudan Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos University of California, Los Angeles, CA, USA
Doug Tygar University of California, Berkeley, CA, USA
Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

Grégoire Montavon Geneviève B. Orr Klaus-Robert Müller (Eds.)
Neural Networks: Tricks of the Trade
Second Edition
13

Volume Editors
Grégoire Montavon Technische Universität Berlin Department of Computer Science Franklinstr. 28/29, 10587 Berlin, Germany E-mail: gregoire.montavon@tu-berlin.de
Geneviève B. Orr Willamette University Department of Computer Science 900 State Street, Salem, OR 97301, USA E-mail: gorr@willamette.edu
Klaus-Robert Müller Technische Universität Berlin Department of Computer Science Franklinstr. 28/29, 10587 Berlin, Germany and Korea University Department of Brain and Cognitive Engineering Anam-dong, Seongbuk-gu, Seoul 136-713, Korea E-mail: klaus-robert.mueller@tu-berlin.de

ISSN 0302-9743

e-ISSN 1611-3349

ISBN 978-3-642-35288-1

e-ISBN 978-3-642-35289-8

DOI 10.1007/978-3-642-35289-8

Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2012952591

CR Subject Classiﬁcation (1998): F.1, I.2.6, I.5.1, C.1.3, F.2, J.3

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

© Springer-Verlag Berlin Heidelberg 1998, 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface to the Second Edition
There have been substantial changes in the ﬁeld of neural networks since the ﬁrst edition of this book in 1998. Some of them have been driven by external factors such as the increase of available data and computing power. The Internet made public massive amounts of labeled and unlabeled data. The ever-increasing raw mass of user-generated and sensed data is made easily accessible by databases and Web crawlers. Nowadays, anyone having an Internet connection can parse the 4,000,000+ articles available on Wikipedia and construct a dataset out of them. Anyone can capture a Web TV stream and obtain days of video content to test their learning algorithm.
Another development is the amount of available computing power that has continued to rise at steady rate owing to progress in hardware design and engineering. While the number of cycles per second of processors has thresholded due to physics limitations, the slow-down has been oﬀset by the emergence of processing parallelism, best exempliﬁed by the massively parallel graphics processing units (GPU). Nowadays, everybody can buy a GPU board (usually already available in consumer-grade laptops), install free GPU software, and run computation-intensive simulations at low cost.
These developments have raised the following question: Can we make use of this large computing power to make sense of these increasingly complex datasets? Neural networks are a promising approach, as they have the intrinsic modeling capacity and ﬂexibility to represent the solution. Their intrinsically distributed nature allows one to leverage the massively parallel computing resources.
During the last two decades, the focus of neural network research and the practice of training neural networks underwent important changes. Learning in deep (or “deep learning”) has to a certain degree displaced the once more prevalent regularization issues, or more precisely, changed the practice of regularizing neural networks. Use of unlabeled data via unsupervised layer-wise pretraining or deep unsupervised embeddings is now often preferred over traditional regularization schemes such as weight decay or restricted connectivity. This new paradigm has started to spread over a large number of applications such as image recognition, speech recognition, natural language processing, complex systems, neuroscience, and computational physics.
The second edition of the book reloads the ﬁrst edition with more tricks. These tricks arose from 14 years of theory and experimentation (from 1998 to 2012) by some of the world’s most prominent neural networks researchers. These tricks can make a substantial diﬀerence (in terms of speed, ease of implementation, and accuracy) when it comes to putting algorithms to work on real problems. Tricks may not necessarily have solid theoretical foundations or formal validation. As Yoshua Bengio states in Chap. 19, “the wisdom distilled here should be taken as a guideline, to be tried and challenged, not as a practice set in stone” [1].

VI G. Montavon and K.-R. Müller
The second part of the new edition starts with tricks to faster optimize neural networks and make more eﬃcient use of the potentially inﬁnite stream of data presented to them. Chapter 18 [2] shows that a simple stochastic gradient descent (learning one example at a time) is suited for training most neural networks. Chapter 19 [1] introduces a large number of tricks and recommendations for training feed-forward neural networks and choosing the multiple hyperparameters.
When the representation built by the neural network is highly sensitive to small parameter changes, for example, in recurrent neural networks, second-order methods based on mini-batches such as those presented in Chap. 20 [9] can be a better choice. The seemingly simple optimization procedures presented in these chapters require their fair share of tricks in order to work optimally. The software Torch7 presented in Chap. 21 [5] provides a fast and modular implementation of these neural networks.
The novel second part of this volume continues with tricks to incorporate invariance into the model. In the context of image recognition, Chap. 22 [4] shows that translation invariance can be achieved by learning a k-means representation of image patches and spatially pooling the k-means activations. Chapter 23 [3] shows that invariance can be injected directly in the input space in the form of elastic distortions. Unlabeled data are ubiquitous and using them to capture regularities in data is an important component of many learning algorithms. For example, we can learn an unsupervised model of data as a ﬁrst step, as discussed in Chaps. 24 [7] and 25 [10], and feed the unsupervised representation to a supervised classiﬁer. Chapter 26 [12] shows that similar improvements can be obtained by learning an unsupervised embedding in the deep layers of a neural network, with added ﬂexibility.
The book concludes with the application of neural networks to modeling time series and optimal control systems. Modeling time series can be done using a very simple technique discussed in Chap. 27 [8] that consists of ﬁtting a linear model on top of a “reservoir” that implements a rich set of time series primitives. Chapter 28 [13] oﬀers an alternative to the previous method by directly identifying the underlying dynamical system that generates the time series data. Chapter 29 [6] presents how these system identiﬁcation techniques can be used to identify a Markov decision process from the observation of a control system (a sequence of states and actions in the reinforcement learning terminology). Chapter 30 [11] concludes by showing how the control system can be dynamically improved by ﬁtting a neural network as the control system explores the space of states and actions.
The book intends to provide a timely snapshot of tricks, theory, and algorithms that are of use. Our hope is that some of the chapters of the new second edition will become our companions when doing experimental work—eventually becoming classics, as some of the papers of the ﬁrst edition have become. Eventually in some years, there may be an urge to reload again...

September 2012

Grégoire Klaus

Preface to the Second Edition VII
Acknowledgments. This work was supported by the World Class University Program through the National Research Foundation of Korea funded by the Ministry of Education, Science, and Technology, under Grant R31-10008. The editors also acknowledge partial support by DFG (MU 987/17-1).
References
[1] Bengio, Y.: Practical Recommendations for Gradient-based Training of Deep Architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012)
[2] Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012)
[3] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep Big Multilayer Perceptrons for Digit Recognition. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 581–598. Springer, Heidelberg (2012)
[4] Coates, A., Ng, A.Y.: Learning Feature Representations with k-means. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 561–580. Springer, Heidelberg (2012)
[5] Collobert, R., Kavukcuoglu, K., Farabet, C.: Implementing Neural Networks Eﬃciently. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 537–557. Springer, Heidelberg (2012)
[6] Duell, S., Udluft, S., Sterzing, V.: Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)
[7] Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012)
[8] Lukoševičius, M.: A Practical Guide to Applying Echo State Networks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 659–686. Springer, Heidelberg (2012)
[9] Martens, J., Sutskever, I.: Training Deep and Recurrent Networks with Hessianfree Optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012)
[10] Montavon, G., Müller, K.-R.: Deep Boltzmann Machines and the Centering Trick. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012)
[11] Riedmiller, M.: 10 Steps and Some Tricks to Set Up Neural Reinforcement Controllers. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 735–757. Springer, Heidelberg (2012)
[12] Weston, J., Ratle, F., Collobert, R.: Deep Learning Via Semi-supervised Embedding. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 639–655. Springer, Heidelberg (2012)
[13] Zimmermann, H.-G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural Networks: 12 Tricks. In: NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)

Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Speeding Learning
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1. Eﬃcient BackProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Müller
Regularization Techniques to Improve Generalization
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2. Early Stopping — But When? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Lutz Prechelt
3. A Simple Trick for Estimating the Weight Decay Parameter . . . . . . . . . . 69 Thorsteinn S. Rögnvaldsson
4. Controlling the Hyperparameter Search in MacKay’s Bayesian Neural Network Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Tony Plate
5. Adaptive Regularization in Neural Network Modeling . . . . . . . . . . . . . . . 111 Jan Larsen, Claus Svarer, Lars Nonboe Andersen, and Lars Kai Hansen
6. Large Ensemble Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 David Horn, Ury Naftaly, and Nathan Intrator
Improving Network Models and Algorithmic Tricks
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons . . 143
Gary William Flake 8. A Dozen Tricks with Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Rich Caruana 9. Solving the Ill-Conditioning in Neural Network Learning . . . . . . . . . . . . . 191
Patrick van der Smagt and Gerd Hirzinger 10. Centering Neural Network Gradient Factors . . . . . . . . . . . . . . . . . . . . . . . 205
Nicol N. Schraudolph 11. Avoiding Roundoﬀ Error in Backpropagating Derivatives . . . . . . . . . . . . 225
Tony Plate

X

Table of Contents

Representing and Incorporating Prior Knowledge in Neural Network Training
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12. Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Patrice Y. Simard, Yann A. LeCun, John S. Denker, and Bernard Victorri
13. Combining Neural Networks and Context-Driven Search for On-line, Printed Handwriting Recognition in the Newton . . . . . . . . . . . 271 Larry S. Yaeger, Brandyn J. Webb, and Richard F. Lyon
14. Neural Network Classiﬁcation and Prior Class Probabilities . . . . . . . . . 295 Steve Lawrence, Ian Burns, Andrew Back, Ah Chung Tsoi, and C. Lee Giles
15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Jürgen Fritsch and Michael Finke
Tricks for Time Series
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
16. Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 John Moody
17. How to Train Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Ralph Neuneier and Hans Georg Zimmermann
Big Learning in Deep Neural Networks
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
18. Stochastic Gradient Descent Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Léon Bottou
19. Practical Recommendations for Gradient-Based Training of Deep Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 Yoshua Bengio
20. Training Deep and Recurrent Networks with Hessian-Free Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 James Martens and Ilya Sutskever
21. Implementing Neural Networks Eﬃciently . . . . . . . . . . . . . . . . . . . . . . . . . 537 Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet

Table of Contents XI
Better Representations: Invariant, Disentangled and Reusable
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
22. Learning Feature Representations with K-Means . . . . . . . . . . . . . . . . . . . 561 Adam Coates and Andrew Y. Ng
23. Deep Big Multilayer Perceptrons for Digit Recognition . . . . . . . . . . . . . 581 Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber
24. A Practical Guide to Training Restricted Boltzmann Machines . . . . . . 599 Geoﬀrey E. Hinton
25. Deep Boltzmann Machines and the Centering Trick . . . . . . . . . . . . . . . . 621 Grégoire Montavon and Klaus-Robert Müller
26. Deep Learning via Semi-supervised Embedding . . . . . . . . . . . . . . . . . . . . 639 Jason Weston, Frédéric Ratle, and Ronan Collobert
Identifying Dynamical Systems for Forecasting and Control
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
27. A Practical Guide to Applying Echo State Networks . . . . . . . . . . . . . . . 659 Mantas Lukoševičius
28. Forecasting with Recurrent Neural Networks: 12 Tricks . . . . . . . . . . . . . 687 Hans-Georg Zimmermann, Christoph Tietz, and Ralph Grothmann
29. Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 Siegmund Duell, Steﬀen Udluft, and Volkmar Sterzing
30. 10 Steps and Some Tricks to Set up Neural Reinforcement Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 Martin Riedmiller
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761

Introduction
It is our belief that researchers and practitioners acquire, through experience and word-of-mouth, techniques and heuristics that help them successfully apply neural networks to diﬃcult real world problems. Often these “tricks” are theoretically well motivated. Sometimes they are the result of trial and error. However, their most common link is that they are usually hidden in people’s heads or in the back pages of space-constrained conference papers. As a result newcomers to the ﬁeld waste much time wondering why their networks train so slowly and perform so poorly.
This book is an outgrowth of a 1996 NIPS workshop called Tricks of the Trade whose goal was to begin the process of gathering and documenting these tricks. The interest that the workshop generated, motivated us to expand our collection and compile it into this book. Although we have no doubt that there are many tricks we have missed, we hope that what we have included will prove to be useful, particularly to those who are relatively new to the ﬁeld. Each chapter contains one or more tricks presented by a given author (or authors). We have attempted to group related chapters into sections, though we recognize that the diﬀerent sections are far from disjoint. Some of the chapters (e.g. 1,13,17) contain entire systems of tricks that are far more general than the category they have been placed in.
Before each section we provide the reader with a summary of the tricks contained within, to serve as a quick overview and reference. However, we do not recommend applying tricks before having read the accompanying chapter. Each trick may only work in a particular context that is not fully explained in the summary. This is particularly true for the chapters that present systems where combinations of tricks must be applied together for them to be eﬀective.
Below we give a coarse roadmap of the contents of the individual chapters.
Speeding Learning
The book opens with a chapter based on Leon Bottou and Yann LeCun’s popular workshop on eﬃcient backpropagation where they present a system of tricks for speeding the minimization process. Included are tricks that are very simple to implement as well as more complex ones, e.g. based on second-order methods. Though many of the readers may recognize some of these tricks, we believe that this chapter provides both: a thorough explanation of their theoretical basis as well as an understanding of the subtle interactions among them.
This chapter provides an ideal introduction for the reader. It starts with discussing fundamental tricks addressing input representation, initialization, target
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 1–5, 2012. c Springer-Verlag Berlin Heidelberg 2012

2

G.B. Orr and K.-R. Müller

values, choice of learning rates, choice of the nonlinearity, and so on. Subsequently, the authors introduce in great detail tricks for estimation and approximation of the Hessian in neural networks. This provides the basis for a discussion of second-order algorithms, fast training methods like the stochastic LevenbergMarquardt algorithm, and tricks for learning rate adaptation.

Regularization Techniques to Improve Generalization
Fast minimization is important but only if we can also insure good generalization. We therefore next include a collection of chapters containing a range of approaches for improving generalization. As one might expect, there are no tricks that work well in all situations. However, many examples and discussions are included to help the reader to decide which will work best for their own problem.
Chapter 2 addresses what is one of the most commonly used techniques: early stopping. Here Lutz Prechelt discusses the pitfalls of this seemingly simple technique. He quantiﬁes the tradeoﬀ between generalization and training time for various stopping criteria, which leads to a trick for picking an appropriate criterion.
Using a weight decay penalty term in the cost function is another common method for improving generalization. The diﬃculty, however, is in ﬁnding a good estimate of the weight decay parameter. In chapter 3, Thorsteinn Rögnvaldsson presents a fast technique for ﬁnding a good estimate, surprisingly, by using information measured at the early stopping point. Experimental evidence for its usefulness is given in several applications.
Tony Plate in chapter 4 treats the penalty terms along the lines of MacKay, i.e. as hyperparameters to be found through iterative search. He presents and compares tricks for making the hyperparameter search in classiﬁcation networks work in practice by speeding it up and simplifying it. Key to his success is a control of the frequency of the hyperparameter updates and a better strategy in cases where the Hessian becomes out-of-bounds.
In chapter 5, Jan Larsen et al. present a trick for adapting regularization parameters by using simple gradient descent (with respect to the regularization parameters) on the validation error. The trick is tested on both classiﬁcation and regression problems.
Averaging over multiple predictors is a well known method for improving generalization. Two questions that arise are how many predictors are “enough” and how does the number of predictors aﬀect the stopping criteria for early stopping. In the ﬁnal chapter of this section, David Horn et al. present solutions to these questions by providing a method for estimating the error of an inﬁnite number of predictors. They then demonstrate this trick for a prediction task.

Improving Network Models and Algorithmic Tricks
In this section we examine tricks that help improve the network model. Even though standard multilayer perceptrons (MLPs) are, in theory, universal ap-

Introduction

3

proximators, other architectures may provide a more natural ﬁt to a problem. A better ﬁt means that training is faster and that there is a greater likelihood of ﬁnding a good and stable solution. For example, radial basis functions (RBFs) are preferred for problems that exhibit local features in a ﬁnite region. Of course, which architecture to choose is not always obvious.
In chapter 7, Gary Flake presents a trick that gives MLPs the power of both an MLP and an RBF so that one does not need to choose between them . This trick is simply to add extra inputs whose values are the square of the regular inputs. Both a theoretical and intuitive explanation are presented along with a number of simulation examples.
Rich Caruana in chapter 8 shows that performance can be improved on a main task by adding extra outputs to a network that predict related tasks. This technique, known as multi-task learning (MTL), trains these extra outputs in parallel with the main task. This chapter presents multiple examples of what one might use as these extra outputs as well as techniques for implementing MTL eﬀectively. Empirical examples include mortality rankings for pneumonia and road-following in a network learning to steer a vehicle.
Patrick van der Smagt and Gerd Hirzinger consider in chapter 9 the illconditioning of the Hessian in neural network training and propose using what they call a linearly augmented feed-forward network, employing input/output short-cut connections that share the input/hidden weights. This gives rise to better conditioning of the learning problem and, thus, to faster learning, as shown in a simulation example with data from a robot arm.
In chapter 10, Nicol Schraudolph takes the idea of scaling and centering the inputs even further than chapter 1 by proposing to center all factors in the neural network gradient: inputs, activities, error signals and hidden unit slopes. He gives experimental evidence for the usefulness of this trick.
In chapter 11, Tony Plate’s short note reports a numerical trick for computing derivatives more accurately with only a small memory overhead.

Representation and Incorporating Prior Knowledge in Neural Network Training
Previous chapters (e.g. Chapter 1) present very general tricks for transforming inputs to improve learning: prior knowledge of the problem is not taken into account explicitly (of course regularization, as discussed in Chapters 2-5, implicitly assumes a prior but on the weight distribution). For complex, diﬃcult problems, however, it is not enough to take a black box approach, no matter how good that black box might be. This section examines how prior knowledge about a problem can be used to greatly improve learning. The questions asked include how to best represent the data, how to make use of this representation for training, and how to take advantage of the invariances that are present. Such issues are key for proper neural network training. They are also at the heart of the tricks pointed out by Patrice Simard, et al. in the ﬁrst chapter of this section. Here, the authors present a particularly interesting perspective on how

4

G.B. Orr and K.-R. Müller

to incorporate prior knowledge into data. They also give the ﬁrst review of the tangent distance classiﬁcation method and related techniques evolving from it such as tangent prop. These methods are applied to the diﬃcult task of optical character recognition (OCR).
In chapter 13, Larry Yaeger, et al. give an overview of the tricks and techniques for on-line handwritten character recognition that were eventually used in the Apple Computer’s Newton MessagePad Rand eMate R. Anyone who has used these systems knows that their handwriting recognition capability works exceedingly well. Although many of the issues that are discussed in this chapter overlap with those in OCR, including representation and prior knowledge, the solutions are complementary. This chapter also gives a very nice overview of what design choices proved to be eﬃcient as well as how diﬀerent tricks such as choice of learning rate, over-representation of more diﬃcult patterns, negative training, error emphasis and so on work together.
Whether it be handwritten character recognition, speech recognition or medical applications, a particularly diﬃcult problem encountered is the unbalanced class prior probabilities that occur, for example, when certain writing styles and subphoneme classes are uncommon or certain illnesses occur less frequently. Chapter 13 brieﬂy discusses this problem in the context of handwriting recognition and presents a heuristic which controls the frequency with which samples are picked for training.
In chapter 14, Steve Lawrence, et al. discuss the issue of unbalanced class prior probabilities in greater depth. They present and compare several diﬀerent heuristics (prior scaling, probabilistic sampling, post scaling and class membership equalization) one of which is similar to the one in chapter 13. They demonstrate their tricks solving an ECG classiﬁcation problem and provide some theoretical explanations.
Many training techniques work well for small to moderate size nets. However when problems consist of thousands of classes and millions of examples, not uncommon in applications such as speech recognition, many of these techniques break down. This chapter by Jürgen Fritsch and Michael Finke is devoted to the issue of large scale classiﬁcation problems and representation design in general. Here the problem of unbalanced class prior probabilities is also tackled.
Although Fritsch and Finke speciﬁcally exemplify their design approach for the problem of building a large vocabulary speech recognizer, it becomes clear that these techniques are also applicable to the general construction of an appropriate hierarchical decision tree. A particularly interesting result in this paper is that the structural design to incorporate prior knowledge about speech done by a human speech expert was outperformed by their machine learning technique using an agglomerative clustering algorithm to choose the structure of the decision tree.

Tricks for Time Series
We close the book with two papers on the subject of time series and economic forecasting. In the ﬁrst of these chapters, John Moody presents an excellent

Introduction

5

survey of both the challenges of macroeconomic forecasting as well a number of neural network solutions. The survey is followed by a more detailed description of smoothing regularizers, model selection methods (e.g. AIC, eﬀective number of parameters, nonlinear cross-validation), and input selection via sensitivity-based input pruning. Model interpretation and visualization are also discussed.
In the ﬁnal chapter, Ralph Neuneier and Hans Georg Zimmermann present an impressive integrated system for neural network training of time series and economic forecasting. Every aspect of the system is discussed including input preprocessing, cost functions, handling of outliers, architecture, regularization techniques, as well as solutions for dealing with the problem of bottom-heavy networks, i.e. the input dimension is large while the output dimension is very small. There is also a thought-provoking discussion of the Observer-Observer dilemma: we want both to create a model based on observed data while, at the same time, use this model to judge the correctness of new incoming data. Even those people not interested speciﬁcally in economic forecasting are encouraged to read this very useful example of how to incorporate prior (system) knowledge into training.

Final Remark
As a ﬁnal remark, we note that some of the views taken in the chapters are contradictory, e.g. some authors favor one regularization method over another, while other authors make exactly the opposite statement. On the one hand, one can explain these discrepancies by stating that the ﬁeld is still very active and therefore opposing viewpoints will inevitably exist until more is understood. On the other hand, it may be that both (contradicting) views are correct but on diﬀerent data sets and in diﬀerent applications, e.g. an approach that considers noisy time-series needs algorithms with a completely diﬀerent robustness than in, say, an OCR setting. In this sense, the present book mirrors an active ﬁeld and a variety of applications with its diversity of views.

August 1998

Jenny & Klaus

Acknowledgements. We would like to thank all authors for their collaboration. Special thanks to Steven Lemm for considerable help with the typesetting. K.R.M. acknowledges partial ﬁnancial support from DFG (grant JA 379/51 and JA 379/7) and EU ESPRIT (grant 25387-STORM).

Speeding Learning
Preface
There are those who argue that developing fast algorithms is no longer necessary because computers have become so fast. However, we believe that the complexity of our algorithms and the size of our problems will always expand to consume all cycles available, regardless of the speed of our machines. Thus, there will never come a time when computational eﬃciency can or should be ignored. Besides, in the quest to ﬁnd solutions faster, we also often ﬁnd better and more stable solutions as well. This section is devoted to techniques for making the learning process in backpropagation (BP) faster and more eﬃcient. It contains a single chapter based on a workshop by Leon Bottou and Yann LeCun. While many alternative learning systems have emerged since the time BP was ﬁrst introduced, BP is still the most widely used learning algorithm. The reason for this is its simplicity, eﬃciency, and its general eﬀectiveness on a wide range of problems. Even so, there are many pitfalls in applying it, which is where all these tricks enter.
Chapter 1 begins gently by introducing us to a few practical tricks that are very simple to implement. Included are easy to understand qualitative explanations of each. There is a discussion of stochastic (on-line) vs batch mode learning where the advantages and disadvantages of both are presented while making it clear that stochastic learning is most often preferred (p. 13). There is a trick that aims at maximizing the per iteration information presented to the network simply by knowing how best to shuﬄe the examples (p. 15). This is followed by an entire set of tricks that must be coordinated together for maximum eﬀectiveness. These include:
– how to normalize, decorrelate, and scale the inputs (p. 16) – how to choose the sigmoid (p. 17) – how to set target values (classiﬁcation) (p. 19) – how to initialize the weights (p. 20) – how to pick the learning rates (p. 20).
Additional issues discussed include the eﬀectiveness of momentum and the choice between radial basis units and sigmoid units (p. 21).
Chapter 1 then introduces us to a little of the theory, providing deeper understanding of some of the preceding tricks. Included are discussions of the eﬀect of learning rates on the speed of learning and of the relationship between the Hessian matrix, the error surface, and the learning rates. Simple examples of linear and multilayer nets are provided to illustrate the theoretical results.
The chapter next enters more diﬃcult territory by giving an overview of second order methods (p. 31). Quickly summarized here, they are
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 7–8, 2012. c Springer-Verlag Berlin Heidelberg 2012

8

G.B. Orr and K-R. Müller

Newton method: generally impractical to use since it requires inverting the full Hessian and works only in batch mode.
conjugate gradient: an O(N ) algorithm that doesn’t use the Hessian, but requires a line search and so works only in batch mode.
Quasi-Newton, Broyden-Fletcher-Goldfarb-Shanno (BFGS) method: an O(N 2) algorithm that computes an estimate of the inverse Hessian. It requires line search and also only works in batch mode.
Gauss-Newton method: an O(N 3) algorithm that uses the square Jacobi approximation of the Hessian. Mainly used for batch and works only for mean squared error loss functions.
Levenberg Marquardt method: extends the Gauss-Newton method to include a regularization parameter for stability.
Second order methods can greatly speed learning at each iteration but often at an excessive computational cost. However, by replacing the exact Hessian with an approximation of either the full or partial Hessian, the beneﬁts of second order information can still be reaped without incurring as great a computational cost.
The ﬁrst and most direct method for estimating the full Hessian is ﬁnite diﬀerences which simply requires little more than two backpropagations to compute each row of the Hessian (p. 35). Another is to use the square Jacobian approximation which guarantees a positive semi-deﬁnite matrix which may be beneﬁcial for improving stability. If even more simpliﬁcation is desired, one can just compute the diagonal elements of the Hessian. All of the methods mentioned here are easily implemented using BP.
Unfortunately, for very large networks, many of the classical second order methods do not work well because storing the Hessian is far too expensive and because batch mode, required by most of the methods, is too slow. On-line second order methods are needed instead. One such technique presented here is a stochastic diagonal Levenberg Marquardt method (p. 40).
If all that is needed is the product of the Hessian with an arbitrary vector rather than the Hessian itself, then much time can be saved using a method that computes this entire product directly using only a single backpropagation step (p. 37). Such a technique can be used to compute the largest eigenvalue and associated eigenvector of the Hessian. The inverse of the largest eigenvalue can then be used to obtain a good estimate of the learning rate.
Finally, three useful tricks are presented for computing the principal eigenvalue and vector without having to compute the Hessian: the power method, Taylor expansion, and an on-line method (p. 42).
Jenny & Klaus

1 Eﬃcient BackProp
Yann A. LeCun1, Léon Bottou1, Genevieve B. Orr2, and Klaus-Robert Müller3
1 Image Processing Research Department AT& T Labs - Research, 100 Schulz Drive, Red Bank, NJ 07701-7033, USA
2 Willamette University, 900 State Street, Salem, OR 97301, USA 3 GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany
{yann,leonb}@research.att.com, gorr@willamette.edu, klaus@first.gmd.de
Abstract. The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and oﬀers explanations of why they work.
Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.
1.1 Introduction
Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally eﬃcient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number and types of nodes, layers, learning rates, training and test sets, and so forth. These choices can be critical, yet there is no foolproof recipe for deciding them because they are largely problem and data dependent. However, there are heuristics and some underlying theory that can help guide a practitioner to make better choices.
In the ﬁrst section below we introduce standard backpropagation and discuss a number of simple heuristics or tricks for improving its performance. We next discuss issues of convergence. We then describe a few “classical” second-order non-linear optimization techniques and show that their application to neural network training is very limited, despite many claims to the contrary in the literature. Finally, we present a few second-order methods that do accelerate learning in certain cases.
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 9–48, 2012. c Springer-Verlag Berlin Heidelberg 2012

10

Y.A. LeCun et al.

1.2 Learning and Generalization

There are several approaches to automatic machine learning, but much of the
successful approaches can be categorized as gradient-based learning methods. The learning machine, as represented in Figure 1.1, computes a function M (Zp, W ) where Zp is the p-th input pattern, and W represents the collection of adjustable parameters in the system. A cost function Ep = C(Dp, M (Zp, W )), measures the discrepancy between Dp, the “correct” or desired output for pattern Zp, and
the output produced by the system. The average cost function Etrain(W ) is the average of the errors Ep over a set of input/output pairs called the training set {(Z1, D1), ....(ZP , DP )}. In the simplest setting, the learning problem consists in ﬁnding the value of W that minimizes Etrain(W ). In practice, the performance of the system on a training set is of little interest. The more relevant measure
is the error rate of the system in the ﬁeld, where it would be used in practice.
This performance is estimated by measuring the accuracy on a set of samples
disjoint from the training set, called the test set. The most commonly used cost
function is the Mean Squared Error:

Ep = 1 (Dp − M (Zp, W ))2, 2

1 Etrain = P

Ep

p=1

E0, E1,....Ep

Error

COST FUNCTION

Parameters W

Output
M(Z,W) LEARNING MACHINE

Desired Output
D0, D1,...Dp

Input Z0, Z1,... Zp
Fig. 1.1. Gradient-based learning machine

This chapter is focused on strategies for improving the process of minimizing the cost function. However, these strategies must be used in conjunction with methods for maximizing the network’s ability to generalize, that is, to predict the correct targets for patterns the learning system has not previously seen (e.g. see chapters 2, 3, 4, 5 for more detail).
To understand generalization, let us consider how backpropagation works. We start with a set of samples each of which is an input/output pair of the function to be learned. Since the measurement process is often noisy, there may be errors in the samples. We can imagine that if we collected multiple sets of samples then each set would look a little diﬀerent because of the noise and because of the diﬀerent points sampled. Each of these data sets would also result in networks with minima that are slightly diﬀerent from each other and from the

1. Eﬃcient BackProp

11

true function. In this chapter, we concentrate on improving the process of ﬁnding the minimum for the particular set of examples that we are given. Generalization techniques try to correct for the errors introduced into the network as a result of our choice of dataset. Both are important.
Several theoretical eﬀorts have analyzed the process of learning by minimizing the error on a training set (a process sometimes called Empirical Risk Minimization) [40, 41].
Some of those theoretical analyses are based on decomposing the generalization error into two terms: bias and variance (see e.g. [12]). The bias is a measure of how much the network output, averaged over all possible data sets diﬀers from the desired function. The variance is a measure of how much the network output varies between datasets. Early in training, the bias is large because the network output is far from the desired function. The variance is very small because the data has had little inﬂuence yet. Late in training, the bias is small because the network has learned the underlying function. However, if trained too long, the network will also have learned the noise speciﬁc to that dataset. This is referred to as overtraining. In such a case, the variance will be large because the noise varies between datasets. It can be shown that the minimum total error will occur when the sum of bias and variance are minimal.
There are a number of techniques (e.g. early stopping, regularization) for maximizing the generalization ability of a network when using backprop. Many of these techniques are described in later chapters 2, 3, 5, 4.
The idea of this chapter, therefore, is to present minimization strategies (given a cost function) and the tricks associated with increasing the speed and quality of the minimization. It is however clear that the choice of the model (model selection), the architecture and the cost function is crucial for obtaining a network that generalizes well. So keep in mind that if the wrong model class is used and no proper model selection is done, then even a superb minimization will clearly not help very much. In fact, the existence of overtraining has led several authors to suggest that inaccurate minimization algorithms can be better than good ones.

1.3 Standard Backpropagation
Although the tricks and analyses in this paper are primarily presented in the context of “classical” multi-layer feed-forward neural networks, many of them also apply to most other gradient-based learning methods.
The simplest form of multilayer learning machine trained with gradient-based learning is simply a stack of modules, each of which implements a function Xn = Fn(Wn, Xn−1), where Xn is a vector representing the output of the module, Wn is the vector of tunable parameters in the module (a subset of W ), and Xn−1 is the module’s input vector (as well as the previous module’s output vector). The input X0 to the ﬁrst module is the input pattern Zp. If the partial derivative of

12

Y.A. LeCun et al.

Ep with respect to Xn is known, then the partial derivatives of Ep with respect to Wn and Xn−1 can be computed using the backward recurrence

∂Ep ∂F

∂Ep

∂Wn = ∂W (Wn, Xn−1) ∂Xn

∂Ep ∂F

∂Ep

∂Xn−1 = ∂X (Wn, Xn−1) ∂Xn

(1.1)

where

∂F ∂W

(Wn,

Xn−1)

is

the

Jacobian

of

F

with

respect

to

W

evaluated

at

the

point

(Wn, Xn−1),

and

∂F ∂X

(Wn,

Xn−1

)

is

the

Jacobian

of

F

with

respect

to

X.

The Jacobian of a vector function is a matrix containing the partial derivatives

of all the outputs with respect to all the inputs. When the above equations

are applied to the modules in reverse order, from layer N to layer 1, all the

partial derivatives of the cost function with respect to all the parameters can be

computed. The way of computing gradients is known as back-propagation. Traditional multi-layer neural networks are a special case of the above system

where the modules are alternated layers of matrix multiplications (the weights)

and component-wise sigmoid functions (the units):

Yn = WnXn−1 Xn = F (Yn)

(1.2) (1.3)

where Wn is a matrix whose number of columns is the dimension of Xn−1, and number of rows is the dimension of Xn. F is a vector function that applies a sigmoid function to each component of its input. Yn is the vector of weighted sums, or total inputs, to layer n.
Applying the chain rule to the equation above, the classical backpropagation equations are obtained:

∂Ep ∂ yni

=

f

(yni )

∂Ep ∂ xin

∂Ep ∂ wnij

=

xjn−1

∂Ep ∂ yni

∂Ep ∂xkn−1 =

i

wnik

∂Ep ∂ yni

.

(1.4) (1.5) (1.6)

The above equations can also be written in matrix form:

∂Ep

∂Ep

∂Yn = F (Yn) ∂Xn

∂Ep

∂Ep

∂Wn = Xn−1 ∂Yn

∂Ep ∂ Xn−1

=

WnT

∂Ep ∂Yn

.

(1.7) (1.8) (1.9)

The simplest learning (minimization) procedure in such a setting is the gradient descent algorithm where W is iteratively adjusted as follows:

W (t) = W (t − 1) − η ∂E . ∂W

(1.10)

1. Eﬃcient BackProp

13

In the simplest case, η is a scalar constant. More sophisticated procedures use variable η. In other methods η takes the form of a diagonal matrix, or is an estimate of the inverse Hessian matrix of the cost function (second derivative matrix) such as in the Newton and Quasi-Newton methods described later in the chapter. A proper choice of η is important and will be discussed at length later.

1.4 A Few Practical Tricks
Backpropagation can be very slow particularly for multilayered networks where the cost surface is typically non-quadratic, non-convex, and high dimensional with many local minima and/or ﬂat regions. There is no formula to guarantee that (1) the network will converge to a good solution, (2) convergence is swift, or (3) convergence even occurs at all. However, in this section we discuss a number of tricks that can greatly improve the chances of ﬁnding a good solution while also decreasing the convergence time often by orders of magnitude. More detailed theoretical justiﬁcations will be given in later sections.

1.4.1 Stochastic versus Batch Learning

At each iteration, equation (1.10) requires a complete pass through the entire dataset in order to compute the average or true gradient. This is referred to as batch learning since an entire “batch” of data must be considered before weights are updated. Alternatively, one can use stochastic (online) learning where a single example {Zt, Dt} is chosen (e.g. randomly) from the training set at each iteration t. An estimate of the true gradient is then computed based on the error Et of that example, and then the weights are updated:

∂Et W (t + 1) = W (t) − η .
∂W

(1.11)

Because this estimate of the gradient is noisy, the weights may not move precisely down the gradient at each iteration. As we shall see, this “noise” at each iteration can be advantageous. Stochastic learning is generally the preferred method for basic backpropagation for the following three reasons:

Advantages of Stochastic Learning 1. Stochastic learning is usually much faster than batch learning. 2. Stochastic learning also often results in better solutions. 3. Stochastic learning can be used for tracking changes.
Stochastic learning is most often much faster than batch learning particularly on large redundant datasets. The reason for this is simple to show. Consider the simple case where a training set of size 1000 is inadvertently composed of 10 identical copies of a set with 100 samples. Averaging the gradient over all 1000 patterns gives the exact same result as computing the gradient based on just the ﬁrst 100. Thus, batch gradient descent is wasteful because it recomputes

14

Y.A. LeCun et al.

the same quantity 10 times before one parameter update. On the other hand, stochastic gradient will see a full epoch as 10 iterations through a 100-long training set. In practice, examples rarely appear more than once in a dataset, but there are usually clusters of patterns that are very similar. For example in phoneme classiﬁcation, all of the patterns for the phoneme /æ/ will (hopefully) contain much of the same information. It is this redundancy that can make batch learning much slower than on-line.
Stochastic learning also often results in better solutions because of the noise in the updates. Nonlinear networks usually have multiple local minima of diﬀering depths. The goal of training is to locate one of these minima. Batch learning will discover the minimum of whatever basin the weights are initially placed. In stochastic learning, the noise present in the updates can result in the weights jumping into the basin of another, possibly deeper, local minimum. This has been demonstrated in certain simpliﬁed cases [15, 30].
Stochastic learning is also useful when the function being modeled is changing over time, a quite common scenario in industrial applications where the data distribution changes gradually over time (e.g. due to wear and tear of the machines). If the learning machine does not detect and follow the change it is impossible to learn the data properly and large generalization errors will result. With batch learning, changes go undetected and we obtain rather bad results since we are likely to average over several rules, whereas on-line learning – if operated properly (see below in section 1.4.7) – will track the changes and yield good approximation results.
Despite the advantages of stochastic learning, there are still reasons why one might consider using batch learning:

Advantages of Batch Learning 1. Conditions of convergence are well understood. 2. Many acceleration techniques (e.g. conjugate gradient) only op-
erate in batch learning. 3. Theoretical analysis of the weight dynamics and convergence
rates are simpler.

These advantages stem from the same noise that make stochastic learning

advantageous. This noise, which is so critical for ﬁnding better local minima

also prevents full convergence to the minimum. Instead of converging to the

exact minimum, the convergence stalls out due to the weight ﬂuctuations. The

size of the ﬂuctuations depend on the degree of noise of the stochastic updates.

The variance of the ﬂuctuations around the local minimum is proportional to the

learning rate η [28, 27, 6]. So in order to reduce the ﬂuctuations we can either

decrease (anneal) the learning rate or have an adaptive batch size. In theory

[13, 30, 36, 35] it is shown that the optimal annealing schedule of the learning

rate is of the form

η

∼

c ,

t

(1.12)

where t is the number of patterns presented and c is a constant. In practice, this may be too fast (see chapter 13).

1. Eﬃcient BackProp

15

Another method to remove noise is to use “mini-batches”, that is, start with a small batch size and increase the size as training proceeds. Møller discusses one method for doing this [25] and Orr [31] discusses this for linear problems. However, deciding the rate at which to increase the batch size and which inputs to include in the small batches is as diﬃcult as determining the proper learning rate. Eﬀectively the size of the learning rate in stochastic learning corresponds to the respective size of the mini batch.
Note also that the problem of removing the noise in the data may be less critical than one thinks because of generalization. Overtraining may occur long before the noise regime is even reached.
Another advantage of batch training is that one is able to use second order methods to speed the learning process. Second order methods speed learning by estimating not just the gradient but also the curvature of the cost surface. Given the curvature, one can estimate the approximate location of the actual minimum.
Despite the advantages of batch updates, stochastic learning is still often the preferred method particularly when dealing with very large data sets because it is simply much faster.

1.4.2 Shuﬄing the Examples
Networks learn the fastest from the most unexpected sample. Therefore, it is advisable to choose a sample at each iteration that is the most unfamiliar to the system. Note, this applies only to stochastic learning since the order of input presentation is irrelevant for batch1. Of course, there is no simple way to know which inputs are information rich, however, a very simple trick that crudely implements this idea is to simply choose successive examples that are from diﬀerent classes since training examples belonging to the same class will most likely contain similar information.
Another heuristic for judging how much new information a training example contains is to examine the error between the network output and the target value when this input is presented. A large error indicates that this input has not been learned by the network and so contains a lot of new information. Therefore, it makes sense to present this input more frequently. Of course, by “large” we mean relative to all of the other training examples. As the network trains, these relative errors will change and so should the frequency of presentation for a particular input pattern. A method that modiﬁes the probability of appearance of each pattern is called an emphasizing scheme.
Choose Examples with Maximum Information Content 1. Shuﬄe the training set so that successive training examples
never (rarely) belong to the same class. 2. Present input examples that produce a large error more fre-
quently than examples that produce a small error.
1 The order in which gradients are summed in batch may be aﬀected by roundoﬀ error if there is a signiﬁcant range of gradient values.

16

Y.A. LeCun et al.

However, one must be careful when perturbing the normal frequencies of input examples because this changes the relative importance that the network places on diﬀerent examples. This may or may not be desirable. For example, this technique applied to data containing outliers can be disastrous because outliers can produce large errors yet should not be presented frequently. On the other hand, this technique can be particularly beneﬁcial for boosting the performance for infrequently occurring inputs, e.g. /z/ in phoneme recognition (see chapter 13, 14).

1.4.3 Normalizing the Inputs

Convergence is usually faster if the average of each input variable over the training set is close to zero. To see this, consider the extreme case where all the inputs are positive. Weights to a particular node in the ﬁrst weight layer are updated by an amount proportional to δx where δ is the (scalar) error at that node and x is the input vector (see equations (1.5) and (1.10)). When all of the components of an input vector are positive, all of the updates of weights that feed into a node will be the same sign (i.e. sign(δ)). As a result, these weights can only all decrease or all increase together for a given input pattern. Thus, if a weight vector must change direction it can only do so by zigzagging which is ineﬃcient and thus very slow.
In the above example, the inputs were all positive. However, in general, any shift of the average input away from zero will bias the updates in a particular direction and thus slow down learning. Therefore, it is good to shift the inputs so that the average over the training set is close to zero. This heuristic should be applied at all layers which means that we want the average of the outputs of a node to be close to zero because these outputs are the inputs to the next layer [19], chapter 10. This problem can be addressed by coordinating how the inputs are transformed with the choice of sigmoidal activation function. Here we discuss the input transformation. The discussion of the sigmoid follows.
Convergence is faster not only if the inputs are shifted as described above but also if they are scaled so that all have about the same covariance, Ci, where

1 Ci = P

P
(zip)2.

p=1

(1.13)

Here, P is the number of training examples, Ci is the covariance of the ith input variable and zip is the ith component of the pth training example. Scaling speeds learning because it helps to balance out the rate at which the weights connected to the input nodes learn. The value of the covariance should be matched with that of the sigmoid used. For the sigmoid given below, a covariance of 1 is a good choice.
The exception to scaling all covariances to the same value occurs when it is known that some inputs are of less signiﬁcance than others. In such a case, it can be beneﬁcial to scale the less signiﬁcant inputs down so that they are “less visible” to the learning process.

1. Eﬃcient BackProp

17

Transforming the Inputs 1. The average of each input variable over the training set should be close
to zero. 2. Scale input variables so that their covariances are about the same. 3. Input variables should be uncorrelated if possible.
The above two tricks of shifting and scaling the inputs are quite simple to implement. Another trick that is quite eﬀective but more diﬃcult to implement is to decorrelate the inputs. Consider the simple network in Figure 1.2. If inputs are uncorrelated then it is possible to solve for the value of w1 that minimizes the error without any concern for w2, and vice versa. In other words, the two variables are independent (the system of equations is diagonal). With correlated inputs, one must solve for both simultaneously which is a much harder problem. Principal component analysis (also known as the Karhunen-Loeve expansion) can be used to remove linear correlations in inputs [10].
Inputs that are linearly dependent (the extreme case of correlation) may also produce degeneracies which may slow learning. Consider the case where one input is always twice the other input (z2 = 2z1). The network output is constant along lines W2 = v − (1/2)W1, where v is a constant. Thus, the gradient is zero along these directions (see Figure 1.2). Moving along these lines has absolutely no eﬀect on learning. We are trying to solve in 2-D what is eﬀectively only a 1-D problem. Ideally we want to remove one of the inputs which will decrease the size of the network.
Figure 1.3 shows the entire process of transforming inputs. The steps are (1) shift inputs so the mean is zero, (2) decorrelate inputs, and (3) equalize covariances.

y

ω1

ω2

ω2
Lines of constant E

z1

z2

ω1

Fig. 1.2. Linearly dependent inputs

1.4.4 The Sigmoid
Nonlinear activation functions are what give neural networks their nonlinear capabilities. One of the most common forms of activation function is the sigmoid which is a monotonically increasing function that asymptotes at some ﬁnite value as ±∞ is approached. The most common examples are the standard logistic function f (x) = 1/(1 + e−x) and hyperbolic tangent f (x) = tanh(x) shown in Figure 1.4. Sigmoids that are symmetric about the origin (e.g. see Figure 1.4b) are preferred for the same reason that inputs should be normalized, namely,

18

Y.A. LeCun et al.

Mean Cancellation
Covariance Equalization

KLExpansion

Fig. 1.3. Transformation of inputs

because they are more likely to produce outputs (which are inputs to the next layer) that are on average close to zero. This is in contrast, say, to the logistic function whose outputs are always positive and so must have a mean that is positive.

1 0.8 0.6 0.4 0.2

-6

-4

-2

2

4

6

(a)

1.5 1
0.5

-3

-2

-1

-0.5

-1

-1.5

1

2

3

(b)

Fig. 1.4. (a) Not recommended: the standard logistic function, f (x) = 1/(1 + e−x).

(b) Hyperbolic tangent, f (x) = 1.7159 tanh

2 3

x

.

Sigmoids

1. Symmetric sigmoids such as hyperbolic tangent often converge faster

than the standard logistic function.

2. A recommended sigmoid [19] is: f (x) = 1.7159

tanh

2 3

x

.

Since

the

tanh function is sometimes computationally expensive, an approxima-

tion of it by a ratio of polynomials can be used instead.

3. Sometimes it is helpful to add a small linear term, e.g. f (x) = tanh(x)+

ax so as to avoid ﬂat spots.

The constants in the recommended sigmoid given above have been chosen so that, when used with transformed inputs (see previous discussion), the variance of the outputs will also be close to 1 because the eﬀective gain of the sigmoid is roughly 1 over its useful range. In particular, this sigmoid has the properties

1. Eﬃcient BackProp

19

(a) f (±1) = ±1, (b) the second derivative is a maximum at x = 1, and (c) the eﬀective gain is close to 1.
One of the potential problems with using symmetric sigmoids is that the error surface can be very ﬂat near the origin. For this reason it is good to avoid initializing with very small weights. Because of the saturation of the sigmoids, the error surface is also ﬂat far from the origin. Adding a small linear term to the sigmoid can sometimes help avoid the ﬂat regions (see chapter 9).

1.4.5 Choosing Target Values
In classiﬁcation problems, target values are typically binary (e.g. {-1,+1}). Common wisdom might seem to suggest that the target values be set at the value of the sigmoid’s asymptotes. However, this has several drawbacks.
First, instabilities can result. The training process will try to drive the output as close as possible to the target values, which can only be achieved asymptotically. As a result, the weights (output and even hidden) are driven to larger and larger values where the sigmoid derivative is close to zero. The very large weights increase the gradients, however, these gradients are then multiplied by an exponentially small sigmoid derivative (except when a twisting term2 is added to the sigmoid) producing a weight update close to zero. As a result, the weights may become stuck.
Second, when the outputs saturate, the network gives no indication of conﬁdence level. When an input pattern falls near a decision boundary the output class is uncertain. Ideally this should be reﬂected in the network by an output value that is in between the two possible target values, i.e. not near either asymptote. However, large weights tend to force all outputs to the tails of the sigmoid regardless of the uncertainty. Thus, the network may predict a wrong class without giving any indication of its low conﬁdence in the result. Large weights that saturate the nodes make it impossible to diﬀerentiate between typical and nontypical examples.
A solution to these problems is to set the target values to be within the range of the sigmoid, rather than at the asymptotic values. Care must be taken, however, to insure that the node is not restricted to only the linear part of the sigmoid. Setting the target values to the point of the maximum second derivative on the sigmoid is the best way to take advantage of the nonlinearity without saturating the sigmoid. This is another reason the sigmoid in Figure 1.4b is a good choice. It has maximum second derivative at ±1 which correspond to the binary target values typical in classiﬁcation problems.
Targets Choose target values at the point of the maximum second derivative on the sigmoid so as to avoid saturating the output units.

2 A twisting term is a small linear term added to the node output, e.g. f (x) = tanh(x) + ax.

20

Y.A. LeCun et al.

1.4.6 Initializing the Weights

The starting values of the weights can have a signiﬁcant eﬀect on the training

process. Weights should be chosen randomly but in such a way that the sig-

moid is primarily activated in its linear region. If weights are all very large then

the sigmoid will saturate resulting in small gradients that make learning slow.

If weights are very small then gradients will also be very small. Intermediate

weights that range over the sigmoid’s linear region have the advantage that (1)

the gradients are large enough that learning can proceed and (2) the network

will learn the linear part of the mapping before the more diﬃcult nonlinear part.

Achieving this requires coordination between the training set normalization,

the choice of sigmoid, and the choice of weight initialization. We start by requir-

ing that the distribution of the outputs of each node have a standard deviation

(σ) of approximately 1. This is achieved at the input layer by normalizing the

training set as described earlier. To obtain a standard deviation close to 1 at

the output of the ﬁrst hidden layer we just need to use the above recommended

sigmoid together with the requirement that the input to the sigmoid also have a

standard deviation σy = 1. Assuming the inputs, yi, to a unit are uncorrelated with variance 1, the standard deviation of the units weighted sum will be

⎛

⎞1/2

σyi = ⎝ wi2j ⎠ .
j

(1.14)

Thus, to insure that the σyi are approximately 1 the weights should be randomly drawn from a distribution with mean zero and a standard deviation given by

σw = m−1/2

(1.15)

where m is the number of inputs to the unit.

Assuming that:

Initializing Weights

1. the training set has been normalized, and 2. the sigmoid from Figure 1.4b has been used

then weights should be randomly drawn from a distribution (e.g. uniform) with mean zero and standard deviation

σw = m−1/2

(1.16)

where m is the fan-in (the number of connections feeding into the node).

1.4.7 Choosing Learning Rates
There is at least one well-principled method (described in section 1.9.2) for estimating the ideal learning rate η. Many other schemes (most of them rather empirical) have been proposed in the literature to automatically adjust the learning

1. Eﬃcient BackProp

21

rate. Most of those schemes decrease the learning rate when the weight vector “oscillates”, and increase it when the weight vector follows a relatively steady direction. The main problem with these methods is that they are not appropriate for stochastic gradient or on-line learning because the weight vector ﬂuctuates all the time.
Beyond choosing a single global learning rate, it is clear that picking a diﬀerent learning rate ηi for each weight can improve the convergence. A well-principled way of doing this, based on computing second derivatives, is described in section 1.9.1. The main philosophy is to make sure that all the weights in the network converge roughly at the same speed.
Depending upon the curvature of the error surface, some weights may require a small learning rate in order to avoid divergence, while others may require a large learning rate in order to converge at a reasonable speed. Because of this, learning rates in the lower layers should generally be larger than in the higher layers (see Figure 1.21). This corrects for the fact that in most neural net architectures, the second derivative of the cost function with respect to weights in the lower layers is generally smaller than that of the higher layers. The rationale for the above heuristics will be discussed in more detail in later sections along with suggestions for how to choose the actual value of the learning rate for the diﬀerent weights (see section 1.9.1).
If shared weights are used such as in time-delay neural networks (TDNN) [42] or convolutional networks [20], the learning rate should be proportional to the square root of the number of connections sharing that weight, because we know that the gradients are a sum of more-or-less independent terms.
Equalize the Learning Speeds – give each weight its own learning rate – learning rates should be proportional to the square root of the
number of inputs to the unit – weights in lower layers should typically be larger than in the
higher layers
Other tricks for improving the convergence include:

Momentum. Momentum
Δw(t + 1) = η ∂Et+1 + μΔw(t), ∂w
can increase speed when the cost surface is highly nonspherical because it damps the size of the steps along directions of high curvature thus yielding a larger eﬀective learning rate along the directions of low curvature [43] (μ denotes the strength of the momentum term). It has been claimed that momentum generally helps more in batch mode than in stochastic mode, but no systematic study of this are known to the authors.

22

Y.A. LeCun et al.

Adaptive Learning Rates. Many authors, including Sompolinsky et al. [37], Darken & Moody [9], Sutton [38], Murata et al. [28] have proposed rules for automatically adapting the learning rates (see also [16]). These rules control the speed of convergence by increasing or decreasing the learning rate based on the error.
We assume the following facts for a learning rate adaptation algorithm: (1) the smallest eigenvalue of the Hessian (see Eq.(1.27)) is suﬃciently smaller than the second smallest eigenvalue and (2) therefore after a large number of iterations, the parameter vector w(t) will approach the minimum from the direction of the minimum eigenvector of the Hessian (see Eq.(1.27), Figure 1.5). Under these conditions the evolution of the estimated parameter can be thought of as a onedimensional process and the minimum eigenvector v can be approximated (for a large number of iterations: see Figure 1.5) by

where

∂E ∂E

v=

/

,

∂w ∂w

denotes the L2 norm. Hence we can adopt a projection

ξ = vT ∂E = ∂E

∂w

∂w

to the approximated minimum Eigenvector v as a one dimensional measure of the distance to the minimum. This distance can be used to control the learning rate (for details see [28])

w(t

+

1)

=

w(t

+

1)

−

ηt

∂Et ∂w

,

r(t + 1) = (1 − δ)r(t) + δ ∂Et , (0 < δ < 1) ∂w

η(t + 1) = η(t) + αη(t) (β r(t + 1) − η(t)) ,

(1.17) (1.18) (1.19)

where δ controls the leak size of the average, α, β are constants and r is used as

auxiliary

variable

to

calculate

the

leaky

average

of

the

gradient

∂E ∂w

.

Note that this set of rules is easy to compute and straightforward to imple-

ment. We simply have to keep track of an additional vector in Eq.(1.18): the

averaged gradient r. The norm of this vector then controls the size of the learn-

ing rate (see Eq.(1.19)). The algorithm follows the simple intuition: far away

from the minimum (large distance ξ) it proceeds in big steps and close to the

minimum it anneals the learning rate (for theoretical details see [28]).

1.4.8 Radial Basis Functions vs Sigmoid Units
Although most systems use nodes based on dot products and sigmoids, many other types of units (or layers) can be used. A common alternative is the radial basis function (RBF) network (see [7, 26, 5, 32]) In RBF networks, the dot product of the weight and input vector is replaced with a Euclidean distance

1. Eﬃcient BackProp

23

W*

Fig. 1.5. Convergence of the ﬂow. During the ﬁnal stage of learning the average ﬂow is approximately one dimensional towards the minimum w∗ and it is a good approxi-
mation of the minimum eigenvalue direction of the Hessian.

between the input and weight and the sigmoid is replaced by an exponential. The output activity is computed, e.g. for one output, as

N
g(x) = wi exp
i=1

1 − 2σi2

x − νi

2

,

where νi (σi) is the mean (standard deviation) of the i-th Gaussian. These units can replace or coexist with the standard units and they are usually trained by combination of gradient descent (for output units) and unsupervised clustering for determining the means and widths of the RBF units.
Unlike sigmoidal units which can cover the entire space, a single RBF unit covers only a small local region of the input space. This can be an advantage because learning can be faster. RBF units may also form a better set of basis functions to model the input space than sigmoid units, although this is highly problem dependent (see chapter 7). On the negative side, the locality property of RBFs may be a disadvantage particularly in high dimensional spaces because may units are needed to cover the spaces. RBFs are more appropriate in (low dimensional) upper layers and sigmoids in (high dimensional) lower layers.

1.5 Convergence of Gradient Descent

1.5.1 A Little Theory

In this section we examine some of the theory behind the tricks presented earlier.

We begin in one dimension where the update equation for gradient descent can

be written as

dE(W )

W (t + 1) = W (t) − η

.

dW

(1.20)

We would like to know how the value of η aﬀects convergence and the learning

speed. Figure 1.6 illustrates the learning behavior for several diﬀerent sizes of η

24

Y.A. LeCun et al.

when the weight W starts out in the vicinity of a local minimum. In one dimen-
sion, it is easy to deﬁne the optimal learning rate, ηopt, as being the learning rate that will move the weight to the minimum, Wmin, in precisely one step (see Figure 1.6(i)b). If η is smaller than ηopt then the stepsize will be smaller and convergence will take multiple timesteps. If η is between ηopt and 2ηopt then the weight will oscillate around Wmin but will eventually converge (Figure 1.6(i)c). If η is more than twice the size of ηopt (Figure 1.6(i)d) then the stepsize is so large that the weight ends up farther from Wmin than before. Divergence results.

E(ω)

η < ηopt

E(ω)

η = ηopt

E(ω)
η = ηopt

a) E(ω)
c)
(i)

ωmin
η > ηopt

ω b)

ω ωmin

E(ω)

η > 2 ηopt

ωc dE/dω

ω
ωmin

ωmin

ω d)

ωmin

ωc dE(ωc)
dω

ω
ωmin

ω

Δω

(ii)

Fig. 1.6. Gradient descent for diﬀerent learning rates

What is the optimal value of the learning rate ηopt? Let us ﬁrst consider the case in 1-dimension. Assuming that E can be approximated by a quadratic function, ηopt can be derived by ﬁrst expanding E in a Taylor series about the current weight, Wc:

E(W )

=

E(Wc)

+

(W

−

Wc)

dE(Wc dW

)

+

1 (W
2

−

Wc)2

d2 E (Wc ) dW 2

+

.

.

.

,

(1.21)

where we use the shorthand

dE (Wc ) dW

≡

dE dW

W =Wc . If E

is quadratic the second

order derivative is constant and the higher order terms vanish. Diﬀerentiating

both sides with respect to W then gives

dE(W ) dW

=

dE(Wc) dW

+

(W

−

Wc)

d2 E (Wc ) dW 2

.

(1.22)

Setting W = Wmin and noting that dE(Wmin)/dW = 0, we are left after rear-

ranging with

Wmin = Wc −

d2E(Wc) dW 2

−1 dE(Wc) . dW

(1.23)

1. Eﬃcient BackProp

25

Comparing this with the update equation (1.20), we ﬁnd that we can reach a

minimum in one step if

ηopt =

d2E(Wc) dW 2

−1
.

(1.24)

Perhaps an easier way to obtain this same result is illustrated in Figure 1.6(ii).

The bottom graph plots the gradient of E as a function of W . Since E is

quadratic, the gradient is simply a straight line with value zero at the mini-

mum

and

∂ E (Wc ) ∂W

at

the

current

weight

Wc.

∂2E/∂2W

is

simply

the

slope

of

this

line and is computed using the standard slope formula

∂2E/∂2W

=

∂E(Wc)/∂W

−0 .

Wc − Wmin

(1.25)

Solving for Wmin then gives equation (1.23). While the learning rate that gives fastest convergence is ηopt, the largest learn-
ing rate that can be used without causing divergence is (also see Figure 1.6(i)d)

ηmax = 2ηopt.

(1.26)

If E is not exactly quadratic then the higher order terms in equation (1.21) are

not precisely zero and (1.23) is only an approximation. In such a case, it may

take multiple iterations to locate the minimum even when using ηopt, however, convergence can still be quite fast.

In multiple dimensions, determining ηopt is a bit more diﬃcult because the right side of (1.24) is a matrix H−1 where H is called the Hessian whose com-

ponents are given by

∂2E Hij ≡ ∂Wi∂Wj

(1.27)

with 1 ≤ i, j ≤ N , and N equal to the total number of weights. H is a measure of the curvature of E. In two dimensions, the lines of constant
E for a quadratic cost are oval in shape as shown in Figure 1.7. The eigenvectors of H point in the directions of the major and minor axes. The eigenvalues measure the steepness of E along the corresponding eigendirection.

Example. In the least mean square (LMS) algorithm, we have a single layer linear network with error function

1 E(W ) =

P
|dp −

2P

wixpi |2

p=1

i

(1.28)

where P is the number of training vectors. The Hessian in this case turns out the be the same as the covariance matrix of the inputs,

H= 1

xpxpT .

P

p

(1.29)

26

Y.A. LeCun et al.

ω2 ωmin,2

Eigenvectors of H

ν2
P
E

ωmin,1

ω1

(a)

Fig. 1.7. Lines of constant E

x2

ν1
(b)

x1

Fig. 1.8. For the LMS algorithm, the eigenvectors and eigenvalues of H measure the spread of the inputs in input space

Thus, each eigenvalue of H is also a measure of the covariance or spread of the inputs along the corresponding eigendirection as shown in Figure 1.8.
Using a scalar learning rate is problematic in multiple dimensions. We want η to be large so that convergence is fast along the shallow directions of E (small eigenvalues of H), however, if η is too large the weights will diverge along the steep directions (large eigenvalues of H). To see this more speciﬁcally, let us again expand E, but this time about a minimum

E(W )

≈

E (Wmin )

+

1 (W
2

−

Wmin)T H(Wmin)(W

−

Wmin ).

(1.30)

Diﬀerentiating (1.30) and using the result in the update equation (1.20) gives

W (t + 1) = W (t) − η ∂E(t) ∂W
= W (t) − ηH(Wmin)(W (t) − Wmin).
Subtracting Wmin from both sides gives

(1.31) (1.32)

(W (t + 1) − Wmin) = (I − ηH(Wmin))(W (t) − Wmin).

(1.33)

1. Eﬃcient BackProp

27

If the prefactor (I − ηH(Wmin)) is a matrix transformation that always shrinks a vector (i.e. its eigenvalues all have magnitude less than 1) then the update equation will converge.
How does this help with choosing the learning rates? Ideally we want diﬀerent learning rates along the diﬀerent eigendirections. This is simple if the eigendirections are lined up with the coordinate axes of the weights. In such a case, the weights are uncoupled and we can assign each weight its own learning rate based on the corresponding eigenvalue. However, if the weights are coupled then we must ﬁrst rotate H such that H is diagonal, i.e. the coordinate axes line up with the eigendirections (see Figure 1.7b). This is the purpose of diagonalizing the Hessian discussed earlier.
Let Θ be the rotation matrix such that

Λ = ΘHΘT

(1.34)

where Λ is diagonal and ΘT Θ = I. The cost function then can be written as

1 E(W ) ≈ E(Wmin) + 2

(W − Wmin)T ΘT

ΘH(Wmin)ΘT [Θ(W − Wmin)] . (1.35)

Making a change of coordinates to ν = Θ(W − Wmin) simpliﬁes the above

equation to

E(ν) ≈ E(0) + 1 νT Λν 2

(1.36)

and the transformed update equation becomes

ν(t + 1) = (I − ηΛ)ν(t).

(1.37)

Note that I − ηΛ is diagonal with diagonal components 1 − ηλi. This equation

will

converge if

|1 − ηλi| < 1, i.e.

η

<

2 λi

for

all

i.

If

constrained to

have a

single

scalar learning rate for all weights then we must require

2 η<
λmax

(1.38)

in order to avoid divergence, where λmax is the largest eigenvalue of H. For

fastest convergence we have

1

ηopt

=

. λmax

(1.39)

If λmin is a lot smaller than λmax then convergence will be very slow along the

λmin direction. In fact, convergence time is proportional to the condition number κ ≡ λmax/λmin so that it is desirable to have as small an eigenvalue spread as

possible.

However, since we have rotated H to be aligned with the coordinate axes,

(1.37) consists actually of N independent 1-dimensional equations. Therefore,

we can choose a learning rate for each weight independent of the others. We see

that

the

optimal

rate

for

the

ith

weight

νi

is

ηopt,i

=

1 λi

.

28

Y.A. LeCun et al.

1.5.2 Examples
Linear Network. Figure 1.10 displays a set of 100 examples drawn from two Gaussian distributed classes centered at (-0.4,-0.8) and (0.4,0.8). The eigenvalues of the covariance matrix are 0.84 and 0.036. We train a single layer linear network with 2 inputs, 1 output, 2 weights, and 1 bias (see Figure (1.9)) using the LMS algorithm in batch mode. Figure 1.11 displays the weight trajectory and error during learning when using a learning rates of η = 1.5 and 2.5. Note that the learning rate (see Eq. 1.38) ηmax = 2/λmax = 2/.84 = 2.38 will cause divergences as is evident for η = 2.5.

y

ω2

ω0

ω1

χ0

χ1

1.4 1.2
1 0.8 0.6 0.4 0.2
0 −0.2 −0.4 −0.6 −0.8
−1 −1.2 −1.4
−1.4−1.2 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

Fig. 1.9. Simple linear network Fig. 1.10. Two classes drawn from gaussian distributions centered at (-0.4,-0.8) and (0.4,0.8)

Figure 1.12 shows the same example using stochastic instead of batch mode learning. Here, a learning rate of η = 0.2 is used. One can see that the trajectory is much noisier than in batch mode since only an estimate of the gradient is used at each iteration. The cost is plotted as a function of epoch. An epoch here is simply deﬁned as 100 input presentations which, for stochastic learning, corresponds to 100 weight updates. In batch, an epoch corresponds to one weight update.
Multilayer Network. Figure 1.14 shows the architecture for a very simple multilayer network. It has 1 input, 1 hidden, and 1 output node. There are 2 weights and 2 biases. The activation function is f (x) = 1.71 tanh((2/3)x). The training set contains 10 examples from each of 2 classes. Both classes are Gaussian distributed with standard deviation 0.4. Class 1 has a mean of -1 and class 2 has a mean of +1. Target values are -1 for class 1 and +1 for class 2. Figure 1.13 shows the stochastic trajectory for the example.

1. Eﬃcient BackProp

29

Weight space 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
−1 −0.8−0.6−0.4−0.2 0 Log MSE (dB)
0

0.2 0.4 0.6 0.8 1

−5

−10

−15

−20

0

1

2

3

4

5

6

7

8

9 10

epochs

(a)

Weight space 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
−1 −0.8−0.6−0.4−0.2 0 Log MSE (dB)
0

0.2 0.4 0.6 0.8 1

−5

−10

−15

−20

0

1

2

3

4

5

6

7

8

9 10

epochs

(b)

Fig. 1.11. Weight trajectory and error curve during learning for (a) η = 1.5 and (b) η = 2.5

Weight space 2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0 −1 −0.8−0.6−0.4−0.2 0
Log MSE (dB) 0

0.2 0.4 0.6 0.8 1

−5 −10

batch

−15

−20 0 1 2 3 4 5 6 7 8 9 10
epochs

Weight space
2 1.8 1.6 1.4 1.2
1 0.8 0.6 0.4 0.2
0 −0.2 −0.4 −0.6 −0.8
−1 −1.2 −1.4
−2 −1.6 −1.2 −0.8 −0.4 0 0.4 0.8 1.2 1.6 2 2.4
Log MSE (dB)
0

−5

−10

−15

−20

0

1

2

3

4

5

6

7

8

9

10

epochs

Fig. 1.12. Weight trajectory and error curve during stochastic learning for η = 0.2

Fig. 1.13. Weight trajectories and errors for 1-1-1 network trained using stochastic learning

30

Y.A. LeCun et al.

y
ω3 ω1
ω2 ω0
χ

Fig. 1.14. The minimal multilayer network

1.5.3 Input Transformations and Error Surface Transformations Revisited
We can use the results of the previous section to justify several of the tricks discussed earlier.
Subtract the means from the input variables
The reason for the above trick is that a nonzero mean in the input variables creates a very large eigenvalue. This means the condition number will be large, i.e. the cost surface will be steep in some directions and shallow in others so that convergence will be very slow. The solution is to simply preprocess the inputs by subtracting their means.
For a single linear neuron, the eigenvectors of the Hessian (with means subtracted) point along the principal axes of the cloud of training vectors (recall Figure 1.8). Inputs that have a large variation in spread along diﬀerent directions of the input space will have a large condition number and slow learning. And so we recommend:
Normalize the variances of the input variables.
If the input variables are correlated, this will not make the error surface spherical, but it will possibly reduce its eccentricity.
Correlated input variables usually cause the eigenvectors of H to be rotated away from the coordinate axes (Figure 1.7a versus 1.7b) thus weight updates are not decoupled. Decoupled weights make the “one learning rate per weight” method optimal, thus, we have the following trick:
Decorrelate the input variables.
Now suppose that the input variables of a neuron have been decorrelated, the Hessian for this neuron is then diagonal and its eigenvalues point along the coordinate axes. In such a case the gradient is not the best descent direction as can be seen in Fig 1.7b. At the point P, an arrow shows that gradient does not point towards the minimum. However, if we instead assign each weight its own learning rate (equal the inverse of the corresponding eigenvalue) then the

1. Eﬃcient BackProp

31

descent direction will be in the direction of the other arrow that points directly towards the minimum:
Use a separate learning rate for each weight.

1.6 Classical Second Order Optimization Methods
In the following we will brieﬂy introduce the Newton, conjugate gradient, GaussNewton, Levenberg Marquardt and the Quasi-Newton (BFGS) method (see also [11, 34, 3, 5]).

U

ω

Λ-½ Θ′

ΘΛ-½

Newton Algorithm here ...... output

....is like Gradient Descent there
output

ω

Network

U Λ-½

ω

Θ

Network

input

input

Fig. 1.15. Sketch of the whitening properties of the Newton algorithm

1.6.1 Newton Algorithm

To get an understanding of the Newton method let us recapitulate the results

from section 1.5.1. Assuming a quadratic loss function E (see Eq.(1.21)) as

depicted in Figure 1.6(ii), we can compute the weight update along the lines

of Eq.(1.21)-(1.23) Δw = η

∂2E ∂w2

−1 ∂E = ηH(w)−1 ∂E ,

∂w

∂w

(1.40)

where η must to be chosen in the range 0 < η < 1 since E is in practice not perfectly quadratic. In this equation information about the Hessian H is taken into account. If the error function was quadratic one step would be suﬃcient to converge.
Usually the energy surface around the minimum is rather ellipsoid, or in the extreme like a taco shell, depending on the conditioning of the Hessian. A whitening transform, well known from signal processing literature [29] can change this ellipsoid shape to a spherical shape through u = ΘΛ1/2w (see Figure 1.15 and

32

Y.A. LeCun et al.

first descent direction gradients

conjugate direction

Fig. 1.16. Sketch of conjugate gradient directions in a 2D error surface

Eq.(1.34)). So the inverse Hessian in Eq.(1.40) basically spheres out the error surface locally. The following two approaches can be shown to be equivalent: (a) use the Newton algorithm in an untransformed weight space and (b) do usual gradient descent in a whitened coordinate system (see Figure 1.15) [19].
Summarizing, the Newton algorithm converges in one step if the error function is quadratic and (unlike gradient descent) it is invariant with respect to linear transformations of the input vectors. This means that the convergence time is not aﬀected by shifts, scaling and rotation of input vectors. However one of the main drawbacks is that an N × N Hessian matrix must be stored and inverted, which takes O(N 3) per iterations and is therefore impractical for more than a few variables. Since the error function is in general non-quadratic, there is no guarantee of convergence. If the Hessian is not positive deﬁnite (if it has some zero or even negative Eigenvalues where the error surface is ﬂat or some directions are curved downward), then the Newton algorithm will diverge, so the Hessian must be positive deﬁnite. Of course the Hessian matrix of multilayer networks is in general not positive deﬁnite everywhere. For these reasons the Newton algorithm in its original form is not usable for general neural network learning. However it gives good insights for developing more sophisticated algorithms, as discussed in the following.
1.6.2 Conjugate Gradient
There are several important properties in conjugate gradient optimization: (1) it is a O(N ) method, (2) it doesn’t use the Hessian explicitly, (3) it attempts to ﬁnd descent directions that try to minimally spoil the result achieved in the previous iterations, (4) it uses a line search, and most importantly, (5) it works only for batch learning.

1. Eﬃcient BackProp

33

The third property is shown in Figure 1.16. Assume we pick a descent direction, e.g. the gradient, then we minimize along a line in this direction (line search). Subsequently we should try to ﬁnd a direction along which the gradient does not change its direction, but merely its length (conjugate direction), because moving along this direction will not spoil the result of the previous iteration. The evolution of the descent directions ρk at iteration k is given as

ρk = −∇E(wk) + βkρk−1,

(1.41)

where the choice of βk can be done either according to Fletcher and Reeves [34]

or Polak and Ribiere

βk

=

∇E(wk)T ∇E(wk) ∇E(wk−1)T ∇E(wk−1)

βk

=

(∇E

(wk) − ∇E(wk−1))T ∇E ∇E(wk−1)T ∇E(wk−1)

(wk

)

.

Two directions ρk and ρk−1 are deﬁned as conjugate if

ρTk Hρk−1 = 0,

(1.42) (1.43)

i.e. conjugate directions are orthogonal directions in the space of an identity Hessian matrix (see Figure 1.17). Very important for convergence in both choices is a good line search procedure. For a perfectly quadratic function with N variables a convergence within N steps can be proved. For non-quadratic functions Polak and Ribiere’s choice seems more robust. Conjugate gradient (1.41) can also be viewed as a smart choice for choosing the momentum term known in neural network training. It has been applied with large success in multi-layer network training on problems that are moderate sized with rather low redundancy in the data. Typical applications range from function approximation, robotic control [39], time-series prediction and other real valued problems where high accuracy is wanted. Clearly on large and redundant (classiﬁcation) problems stochastic backpropagation is faster. Although attempts have been made to deﬁne minibatches [25], the main disadvantage of conjugate gradient methods remains that it is a batch method (partly due to the precision requirements in line search procedure).

ω κ

ρκ−1

ρ κ

Fig. 1.17. Sketch of conjugate gradient directions in a 2D error surface

34

Y.A. LeCun et al.

1.6.3 Quasi-Newton (BFGS)

The Quasi-Newton (BFGS) method (1) iteratively computes an estimate of the inverse Hessian, (2) is an O(N 2) algorithm, (3) requires line search and (4) it works only for batch learning.
The positive deﬁnite estimate of the inverse Hessian is done directly without requiring matrix inversion and by only using gradient information. Algorithmically this can be described as follows: (1) ﬁrst a positive deﬁnite matrix M is chosen, e.g. M = I, (2) then the search direction is set to

ρ(t) = M (t)∇E(w(t)),

(3) a line search is performed along ρ, which gives the update for the parameters at time t
w(t) = w(t − 1) − η(t)ρ(t).

Finally (4) the estimate of the inverse Hessian is updated. Compared to the

Newton algorithm the Quasi-Newton approach only needs gradient information.

The most successful Quasi-Newton algorithm is the Broyden-Fletcher-Goldfarb-

Shanno (BFGS) method. The update rule for the estimate of the inverse Hessian

is

φT M φ δδT

δφT M + M φδT

M (t) = M (t − 1) 1 + δT φ δT φ −

δT φ

,

(1.44)

where some abbreviations have been used for the following N × 1 vectors

φ = ∇E(w(t)) − ∇E(w(t − 1)) δ = w(t) − w(t − 1).

(1.45)

Although, as mentioned above, the complexity is only O(N 2), we are still required to store a N × N matrix, so the algorithm is only practical for small networks with non-redundant training sets. Recently some variants exist that aim to reduce storage requirements (see e.g. [3]).

1.6.4 Gauss-Newton and Levenberg Marquardt

Gauss-Newton and Levenberg Marquardt algorithm (1) use the square Jacobi approximation, (2) are mainly designed for batch learning, (3) have a complexity of O(N 3) and (4) most important, they work only for mean squared error loss functions. The Gauss-Newton algorithm is like the Newton algorithm, however the Hessian is approximated by the square of the Jacobian (see also section 1.7.2 for a further discussion)

Δw =

∂f (w, xp) T ∂f (w, xp)

−1
∇E(w).

∂w

∂w

p

(1.46)

1. Eﬃcient BackProp

35

The Levenberg Marquardt method is like the Gauss-Newton above, but it has a

regularization parameter μ that prevents it from blowing up, if some eigenvalues

are small

Δw =

∂f (w, xp) T ∂f (w, xp) + μI

−1
∇E(w),

∂w

∂w

p

(1.47)

where I denotes the unity matrix. The Gauss Newton method is valid for quadratic cost functions however a similar procedure also works with KullbackLeibler cost and is called Natural Gradient (see e.g. [1, 44, 2]).

1.7 Tricks to Compute the Hessian Information in Multilayer Networks
We will now discuss several techniques aimed at computing full or partial Hessian information by (a) ﬁnite diﬀerence method, (b) square Jacobian approximation (for Gauss-Newton and Levenberg-Marquardt algorithm), (c) computation of the diagonal of the Hessian and (d) by obtaining a product of the Hessian and a vector without computing the Hessian. Other semi-analytical techniques that allow the computation of the full Hessian are omitted because they are rather complicated and also require many forward/backward propagation steps [5, 8].

1.7.1 Finite Diﬀerence

We can write the k-th line of the Hessian

H (k)

=

∂(∇E(w))

∼

∇E(w

+

δφk) −

∇E(w) ,

∂wk

δ

where φk = (0, 0, 0, . . . , 1, . . . , 0) is a vector of zeros and only one 1 at the k-th position. This can be implemented with a simple recipe: (1) compute the total gradient by multiple forward and backward propagation steps. (2) Add δ to the k-th parameter and compute again the gradient, and ﬁnally (3) subtract both results and divide by δ. Due to numerical errors in this computation scheme the resulting Hessian might not be perfectly symmetric. In this case it should be symmetrized as described below.

1.7.2 Square Jacobian Approximation for the Gauss-Newton and Levenberg-Marquardt Algorithms

Assuming a mean squared cost function

1 E(w) =
2

(dp − f (w, xp))T (dp − f (w, xp))

p

then the gradient is

(1.48)

∂E(w) = − ∂w

(dp

−

f

(w,

xp ))T

∂f

(w, xp) ∂w

p

(1.49)

36

Y.A. LeCun et al.

and the Hessian follows as

H(w) =

∂f (w, xp) T ∂f (w, xp) +

∂w

∂w

(dp

−

f

(w,

xp ))T

∂2f (w, xp) ∂w∂w

.

p

p

(1.50)

A simplifying approximation of the Hessian is the square of the Jacobian which is a positive semi-deﬁnite matrix of dimension: N × O

H(w) ∼

∂f (w, xp) T ∂f (w, xp) ,

∂w

∂w

p

(1.51)

where the second term from Eq.(1.50) was dropped. This is equivalent to assuming that the network is a linear function of the parameters w. Again this is readily implemented for the k-th column of the Jacobian: for all training patterns, (1) we forward propagate, then (2) set the activity of the output units to 0 and only the k-th output to 1, (3) a backpropagation step is taken and the gradient is accumulated.

1.7.3 Backpropagating Second Derivatives

Let us consider a multi-layer system with some functional blocks with Ni inputs,

No outputs and N parameters of the form O = F (W, X). Now assume we knew

∂2E/∂O2, which is a No × No matrix. Then it is straight forward to compute

this matrix

∂2E ∂O T ∂2E ∂O ∂E ∂2O ∂W 2 = ∂W ∂O2 ∂W + ∂O ∂W 2 .

(1.52)

We can drop the second term in Eq.(1.52) and the resulting estimate of the

Hessian is positive semi-deﬁnite. A further reduction is achieved, if we ignore all

but

the

diagonal

terms

of

∂2E ∂O2

:

∂2E

∂2E

∂wi2 = k ∂o2k

∂ok

2
.

∂wi

(1.53)

A similar derivation can be done to obtain the Ni times Ni matrix ∂2E/∂x2.

1.7.4 Backpropagating the Diagonal Hessian in Neural Nets

Backpropagation procedures for computing the diagonal Hessian are well known

[18, 4, 19]. It is assumed that each layer in the network has the functional form

oi = f (yi) = f ( j wijxj) (see Figure 1.18 for the sigmoidal network). Using the Gauss-Newton approximation (dropping the term that contain f (y)) we

obtain:

∂2E ∂ yk2

=

∂2E ∂ o2k

(f

(yk))2 ,

(1.54)

∂2E ∂ wk2i

=

∂2E ∂ yk2

x2i

(1.55)

1. Eﬃcient BackProp

37

and

∂2E ∂ x2i

k

∂2E ∂ yk2

wk2i

.

(1.56)

With f being a Gaussian nonlinearity as shown in Figure 1.18 for the RBF

networks we obtain

∂2E ∂ wk2i

=

∂2E ∂ yk2

(xi

−

wki )2

(1.57)

and

∂2E ∂x2i =

k

∂2E ∂ yk2

(xi

−

wki )2 .

(1.58)

The cost of computing the diagonal second derivatives by running these equations from the last layer to the ﬁrst one is essentially the same as the regular backpropation pass used for the gradient, except that the square of the weights are used in the weighted sums. This technique is applied in the “optimal brain damage” pruning procedure (see [21]).

z

f( )

y

ω

ωx

x

y 1/2 ||ω−x ||2
x

Fig. 1.18. Backpropagating the diagonal Hessian: sigmoids (left) and RBFs (right)

1.7.5 Computing the Product of the Hessian and a Vector

In many methods that make use of the Hessian, the Hessian is used exclusively in products with a vector. Interestingly, there is a way of computing such products without going through the trouble of computing the Hessian itself. The ﬁnite diﬀerence method can fulﬁll this task for an arbitrary vector Ψ

HΨ ∼ 1 ∂E (w + αΨ ) − ∂E (w) ,

α ∂w

∂w

(1.59)

using only two gradient computations (at point w and w + αΨ respectively), which can be readily computed with backprop (α is a small constant).
This method can be applied to compute the principal eigenvector and eigenvalue of H by the power method. By iterating and setting

Ψ (t

+

1)

=

HΨ (t) ,

Ψ (t)

(1.60)

38

Y.A. LeCun et al.

the vector Ψ (t) will converge to the largest eigenvector of H and Ψ (t) to the corresponding eigenvalue [23, 14, 10]. See also [33] for an even more accurate method that (1) does not use ﬁnite diﬀerences and (2) has similar complexity.

1.8 Analysis of the Hessian in Multi-layer Networks
It is interesting to understand how some of the tricks shown previously inﬂuence on the Hessian, i.e. how does the Hessian change with architecture and details of the implementation. Typically, the eigenvalue distribution of the Hessian looks like the one sketched in Figure 1.20: a few small eigenvalues, many medium ones and few very large ones. We will now argue that the large eigenvalues will cause the trouble in the training process because [23, 22]
– non-zero mean inputs or neuron states [22] (see also chapter 10) – wide variations of the second derivatives from layer to layer – correlation between state variables.
To exemplify this, we show the eigenvalue distribution of a network trained on OCR data in Figure 1.20. Clearly, there is a wide spread of eigenvalues (see Figure 1.19) and we observe that the ratio between e.g. the ﬁrst and the eleventh eigenvalue is about 8. The long tail of the eigenvalue distribution (see Figure 1.20) is rather painful because the ratio between the largest and smallest eigenvalue gives the conditioning of the learning problem. A large ratio corresponds to a big diﬀerence in the axis of the ellipsoidal shaped error function: the larger the ratio, the more we ﬁnd a taco-shell shaped minima, which are extremely steep towards the small axis and very ﬂat along the long axis.

Log10 Eigenvalue

0 −0.5
−1 −1.5
−2 −2.5
−3 −3.5
−4 −4.5
−5 −5.5
−6 0

the ratio between the 1st and the 11th eigenvalues is 8
100 200 300 400 500 600 700 800
Eigenvalue order

Fig. 1.19. Eigenvalue spectrum in a 4 layer shared weights network (256×128×64×10) trained on 320 handwritten digits

1. Eﬃcient BackProp

39

Number of Eigenvalues

20 19 18 17 16 15 14 13 12 11
10 9 8 7 6 5 4 3 2 1
0
0

Big killers

2

4

6

8

10 12 14 16

Eigenvalue magnitude

Fig. 1.20. Eigenvalue spectrum in a 4 layer shared weights network (256×128×64×10) trained on 320 handwritten digits

Fig. 1.21. Multilayered architecture: the second derivative is often smaller in lower layers

40

Y.A. LeCun et al.

Another general characteristic of the Hessian in multi-layer networks is the spread between layers. In Figure 1.21 we roughly sketch how the shape of the Hessian varies from being rather ﬂat in the ﬁrst layer to being quite steep in the last layer. This aﬀects the learning speed and can provide an ingredient to explain the slow learning in lower layers and the fast (sometime oscillating) learning in the last layer. A trick to compensate this diﬀerent scale of learning is to use the inverse diagonal Hessian to control the learning rate (see also section 1.6, chapter 17).

1.9 Applying Second Order Methods to Multilayer Networks
Before we concentrate in this section on how to tailor second order techniques for training large networks, let us ﬁrst repeat some rather pessimistic facts about applying classical second order methods. Techniques using full Hessian information (Gauss -Newton, Levenberg-Marquardt and BFGS) can only apply to very small networks trained in batch mode, however those small networks are not the ones that need speeding up the most. Most second order methods (conjugate gradient, BFGS, . . . ) require a line-search and can therefore not be used in the stochastic mode. Many of the tricks discussed previously apply only to batch learning. From our experience we know that a carefully tuned stochastic gradient descent is hard to beat on large classiﬁcation problems. For smaller problems that require accurate real-valued outputs like in function approximation or control problems, we see that conjugate gradient (with Polak-Ribiere Eq.(1.43)) oﬀers the best combination of speed, reliability and simplicity. Several attempts using “mini batches” in applying conjugate gradient to large and redundant problems have been made recently [17, 25, 31]. A variant of conjugate gradient optimization (called scaled CG) seems interesting: here the line search procedure is replaced by a 1D Levenberg Marquardt type algorithm [24].

1.9.1 A Stochastic Diagonal Levenberg Marquardt Method
To obtain a stochastic version of the Levenberg Marquardt algorithm the idea is to compute the diagonal Hessian through a running estimate of the second derivative with respect to each parameter. The instantaneous second derivative can be obtained via backpropagation as shown in the formulas of section 1.7. As soon as we have those running estimates we can use them to compute individual learning rates for each parameter

ηki =

∂2E ∂ wk2i

, +μ

(1.61)

where

denotes the global learning rate, and

∂2E ∂ wk2i

is a running estimate of the

diagonal second derivative with respect to wki. μ is a parameter to prevent ηki

1. Eﬃcient BackProp

41

from blowing up in case the second derivative is small, i.e. when the optimization moves in ﬂat parts of the error function. The running estimate is computed as

∂2E

∂2E

∂2Ep

∂wk2i new = (1 − γ) ∂wk2i old + γ ∂wk2i ,

(1.62)

where γ is a small constant that determines the amount of memory that is being used. The second derivatives can be computed prior to training over e.g. a subset of the training set. Since they change only very slowly they only need to be reestimated every few epochs. Note that the additional cost over regular backpropagation is negligible and convergence is – as a rule of thumb – about three times faster than a carefully tuned stochastic gradient algorithm.
In Figure 1.22 and 1.23 we see the convergence of the stochastic diagonal Levenberg Marquardt method (1.61) for a toy example with two diﬀerent sets of learning rates. Obviously the experiment shown Figure 1.22 contains fewer ﬂuctuations than in Figure 1.23 due to smaller learning rates.

Learning rates:
η0 = 0.12 η1 = 0.03 η2 = 0.02
Hessian largest eigenvalue:
λ ma=x 0.84
Maximum admissible Learning rate (batch):
η max= 2.38

Weight space
2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

−1 −0.8 −0.6 −0.4 −0.2 0

0.2 0.4 0.6 0.8

1

Log MSE (dB)
0

−5

−10

−15

−20

0

1

2

3

4

5

6

7

8

9

10

epochs

Fig. 1.22. Stochastic diagonal Levenberg-Marquardt algorithm. Data set from 2 Gaussians with 100 examples. The network has one linear unit, 2 inputs and 1 output, i.e. three parameters (2 weights, 1 bias).

42

Y.A. LeCun et al.

Learning rates:
η0 = 0.76 η1 = 0.18 η2 = 0.12
Hessian largest eigenvalue:
λ ma=x 0.84
Maximum admissible Learning rate (batch):
η max= 2.38

Weight space
2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

−1 −0.8 −0.6 −0.4 −0.2 0

0.2 0.4 0.6 0.8

1

Log MSE (dB)
0

−5

−10

−15

−20

0

1

2

3

4

5

6

7

8

9

10

epochs

Fig. 1.23. Stochastic diagonal Levenberg-Marquardt algorithm. Data set from 2 Gaussians with 100 examples. The network has one linear unit, 2 inputs and 1 output, i.e. three parameters (2 weights, 1 bias).

1.9.2 Computing the Principal Eigenvalue/Vector of the Hessian
In the following we give three tricks for computing the principal eigenvalue/Vector of the Hessian without having to compute the Hessian itself. Remember that in section 1.4.7 we also introduced a method to approximate the smallest eigenvector of the Hessian (without having to compute the Hessian) through averaging (see also [28]).

Power Method. We repeat the result of our discussion in section 1.7.5: starting from a random initial vector Ψ , the iteration

Ψnew = H

Ψold Ψold

,

will eventually converge to the principal eigenvector (or a vector in the principal eigenspace) and Ψold will converge to the corresponding eigenvalue [14, 10].

1. Eﬃcient BackProp

43

80

70

60

50

eigenvalue

40

30

γ=0.003

20

γ=0.01

10

0 0
γ=0.1

50

100

γ=0.03

150

200

250

300

350

400

Number of pattern presentations

Fig. 1.24. Evolution of the eigenvalue as a function of the number of pattern presentations for a shared weight network with 5 layers, 64638 connections and 1278 free parameters. The training set consists of 1000 handwritten digits.

Taylor Expansion. Another method makes use of the fact that small perturbations of the gradient also lead to the principal eigenvector of H

1 Ψnew = α

∂E (w + α Ψold ) − ∂E (w) ,

∂w

Ψold

∂w

(1.63)

where α is a small constant. One iteration of this procedure requires two forward and two backward propagation steps for each pattern in the training set.

Online Computation of Ψ . The following rule makes use of the running average to obtain the largest eigenvalue of the average Hessian very fast

1 Ψnew = (1 − γ)Ψ + α

∂Ep (w + α

Ψold

∂E ) − (w)

.

∂w

Ψold

∂w

(1.64)

44

Y.A. LeCun et al.

2.5

2

MEAN SQUARED ERROR

1.5
1 epoch
2 epochs
1
3 epochs 4 epochs 5 epochs
0.5

0 0 0.250.50.75 1 1.251.51.75 2 2.252.52.75 3 3.253.53.75 4
LEARNING RATE PREDICTED OPTIMAL LEARNING RATE
Fig. 1.25. Mean squared error as a function of the ratio between learning rate and predicted optimal learning rate for a fully connected network (784 × 30 × 10). The training set consists of 300 handwritten digits.

To summarize, the eigenvalue/vector computations:

1. a random vector is chosen for initialization of Ψ ,

2. an input pattern is presented with desired output, a forward and backward

propagation, step is performed and the gradients G(w) are stored,

3.

α

Ψold Ψold

is added to the current weight vector w,

4. a forward and backward propagation step is performed with the perturbed

weight vector and the gradients G(w ) are stored,

5. the diﬀerence 1/α(G(w ) − G(w)) is computed and the running average of

the eigenvector is updated,

6. we loop from (2)-(6) until a reasonably stable result is obtained for Ψ ,

7. the optimal learning rate is then given as

1

ηopt =

. Ψ

1. Eﬃcient BackProp

45

2.5

2

MEAN SQUARED ERROR

1.5

1 epoch
1
2 epochs

3 epochs 4 epochs

0.5

5 epochs

0 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3
LEARNING RATE PREDICTED OPTIMAL LEARNING RATE
Fig. 1.26. Mean squared error as a function of the ratio between learning rate and predicted optimal learning rate for a shared weight network with 5 layers (1024 × 1568 × 392 × 400 × 100 × 10), 64638 (local) connections and 1278 free parameters (shared weights). The training set consists of 1000 handwritten digits.
In Figure 1.24 we see the evolution of the eigenvalue as a function of the number of pattern presentations for a neural network in a handwritten character recognition task. In practice we adapt the leak size of the running average in order to get fewer ﬂuctuations (as also indicated on the ﬁgure). In the ﬁgure we see that after fewer than 100 pattern presentations the correct order of magnitude for the eigenvalue, i.e the learning rate is reached. From the experiments we also observe that the ﬂuctuations of the average Hessian over training are small.
In Figure 1.25 and 1.26 we start with the same initial conditions, and perform a ﬁxed number of epochs with learning rates computed by multiplying the predicted learning rate by a predeﬁned constant. Choosing constant 1 (i.e. using the predicted optimal rate) always gives residual errors which are very close to the error achieved by the best choice of the constant. In other words, the “predicted optimal rate” is optimal enough.

46

Y.A. LeCun et al.

1.10 Discussion and Conclusion

According to the recommendations mentioned above, a practitioner facing a multi-layer neural net training problem would go through the following steps:
– shuﬄe the examples – center the input variables by subtracting the mean – normalize the input variable to a standard deviation of 1 – if possible, decorrelate the input variables. – pick a network with the sigmoid function shown in ﬁgure 1.4 – set the target values within the range of the sigmoid, typically +1 and -1. – initialize the weights to random values as prescribed by 1.16.
The preferred method for training the network should be picked as follows:
– if the training set is large (more than a few hundred samples) and redundant, and if the task is classiﬁcation, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.
– if the training set is not too large, or if the task is regression, use conjugate gradient.
Classical second-order methods are impractical in almost all useful cases. The non-linear dynamics of stochastic gradient descent in multi-layer neural
networks, particularly as it pertains to generalization, is still far from being well understood. More theoretical work and systematic experimental work is needed.
Acknowledgement. Y.L. & L.B. & K.-R. M. gratefully acknowledge mutual exchange grants from DAAD and NSF.

References
[1] Amari, S.: Neural learning in structured parameter spaces — natural riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 127. MIT Press (1997)
[2] Amari, S.: Natural gradient works eﬃciently in learning. Neural Computation 10(2), 251–276 (1998)
[3] Battiti, R.: First- and second-order methods for learning: Between steepest descent and newton’s method. Neural Computation 4, 141–166 (1992)
[4] Becker, S., LeCun, Y.: Improving the convergence of backbropagation learning with second oder metho ds. In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Lawrence Erlbaum Associates (1989)
[5] Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
[6] Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.) Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998)
[7] Broomhead, D.S., Lowe, D.: Multivariable function interpolation and adaptive networks. Complex Systems 2, 321–355 (1988)

1. Eﬃcient BackProp

47

[8] Buntine, W.L., Weigend, A.S.: Computing second order derivatives in FeedForward networks: A review. IEEE Transactions on Neural Networks (1993) (to appear)
[9] Darken, C., Moody, J.E.: Note on learning rate schedules for stochastic optimization. In: Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 3, pp. 832–838. Morgan Kaufmann, San Mateo (1991)
[10] Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks. Wiley, New York (1996)
[11] Fletcher, R.: Practical Methods of Optimization, ch. 8.7: Polynomial time algorithms, 2nd edn., pp. 183–188. John Wiley & Sons, New York (1987)
[12] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)
[13] Goldstein, L.: Mean square optimality in the continuous time Robbins Monro procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern California, LA (1987)
[14] Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press, Baltimore (1989)
[15] Heskes, T.M., Kappen, B.: On-line learning processes in artiﬁcial neural networks. In: Tayler, J.G. (ed.) Mathematical Approaches to Neural Networks, vol. 51, pp. 199–233. Elsevier, Amsterdam (1993)
[16] Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295–307 (1988)
[17] Kramer, A.H., Sangiovanni-Vincentelli, A.: Eﬃcient parallel learning algorithms for neural networks. In: Touretzky, D.S. (ed.) Proceedings of the 1988 Conference on Advances in Neural Information Processing Systems, pp. 40–48. Morgan Kaufmann, San Mateo (1989)
[18] LeCun, Y.: Modeles connexionnistes de l’apprentissage (connectionist learning models). PhD thesis, Universitè P. et M. Curie, Paris VI (1987)
[19] LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Proceedings of the International Conference Connectionism in Perspective, University of Zürich, October 10-13. Elsevier, Amsterdam (1988)
[20] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a backpropagation network. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2. Morgan Kaufmann, San Mateo (1990)
[21] LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2, pp. 598–605 (1990)
[22] LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces. In: Advances in Neural Information Processing Systems, vol. 3. Morgan Kaufmann, San Mateo (1991)
[23] LeCun, Y., Simard, P.Y., Pearlmutter, B.: Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In: Giles, Hanson, Cowan (eds.) Advances in Neural Information Processing Systems, vol. 5. Morgan Kaufmann, San Mateo (1993)
[24] Møller, M.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533 (1993)
[25] Møller, M.: Supervised learning on large redundant training sets. International Journal of Neural Systems 4(1), 15–25 (1993)

48

Y.A. LeCun et al.

[26] Moody, J.E., Darken, C.J.: Fast learning in networks of locally-tuned processing units. Neural Computation 1, 281–294 (1989)
[27] Murata, N.: PhD thesis, University of Tokyo (1992) (in Japanese) [28] Murata, N., Müller, K.-R., Ziehe, A., Amari, S.: Adaptive on-line learning in
changing environments. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 599. MIT Press (1997) [29] Oppenheim, A.V., Schafer, R.W.: Digital Signal Processing. Prentice-Hall, Englewood Cliﬀs (1975) [30] Orr, G.B.: Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon Graduate Institute (1995) [31] Orr, G.B.: Removing noise in on-line search using adaptive batch sizes. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, p. 232. MIT Press (1997) [32] Orr, M.J.L.: Regularization in the selection of radial basis function centers. Neural Computation 7(3), 606–623 (1995) [33] Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computation 6, 147–160 (1994) [34] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipies in C: The art of Scientiﬁc Programming. Cambridge University Press, Cambridge (1988) [35] Saad, D. (ed.): Online Learning in Neural Networks (1997 Workshop at the Newton Institute). The Newton Institute Series. Cambridge University Press, Cambridge (1998) [36] Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural networks. Physical Review Letters 74, 4337–4340 (1995) [37] Sompolinsky, H., Barkai, N., Seung, H.S.: On-line learning of dichotomies: algorithms and learning curves. In: Oh, J.-H., Kwon, C., Cho, S. (eds.) Neural Networks: The Statistical Mechanics Perspective, pp. 105–130. World Scientiﬁc, Singapore (1995) [38] Sutton, R.S.: Adapting bias by gradient descent: An incremental version of deltabar-delta. In: Swartout, W. (ed.) Proceedings of the 10th National Conference on Artiﬁcial Intelligence, pp. 171–176. MIT Press, San Jose (July 1992) [39] van der Smagt, P.: Minimisation methods for training feed-forward networks. Neural Networks 7(1), 1–11 (1994) [40] Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) [41] Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998) [42] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-37, 328–339 (1989) [43] Wiegerinck, W., Komoda, A., Heskes, T.: Stochastic dynamics of learning with momentum in neural networks. Journal of Physics A 27, 4425–4437 (1994) [44] Yang, H.H., Amari, S.: The eﬃciency and the robustness of natural gradient descent learning rule. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. MIT Press (1998)

Regularization Techniques to Improve Generalization
Preface
Good tricks for regularization are extremely important for improving the generalization ability of neural networks. The ﬁrst and most commonly used trick is early stopping, which was originally described in [11]. In its simplest version, the trick is as follows:
Take an independent validation set, e.g. take out a part of the training set, and monitor the error on this set during training. The error on the training set will decrease, whereas the error on the validation set will ﬁrst decrease and then increase. The early stopping point occurs where the error on the validation set is the lowest. It is here that the network weights provide the best generalization.
As Lutz Prechelt points out in chapter 2, the above picture is highly idealized. In practice, the shape of the error curve on the validation set is more likely very ragged with multiple minima. Choosing the “best” early stopping point then involves a trade-oﬀ between (1) improvement of generalization and (2) speed of learning. If speed is not an issue then, clearly, the safest strategy is to train all the way until the minimum error on the training set is found, while monitoring the location of the lowest error rate on the validation set. Of course, this can take a prohibitive amount of computing time. This chapter presents less costly strategies employing a number of diﬀerent stopping criteria, e.g. when the ratio between the generalization loss and the progress exceeds a given threshold (see p. 57). A large simulation study using various benchmark problems is used in the discussion and analysis of the diﬀerences (with respect to e.g. robustness, eﬀectiveness, training time, . . . ) between these proposed stopping criteria (see p. 60ﬀ.). So far theoretical studies [12, 1, 6] have not studied this trade-oﬀ.
Weight decay is also a commonly used technique for controlling capacity in neural networks. Early stopping is considered to be fast, but it is not well deﬁned (keep in mind the pitfalls mentioned in chapter 2). On the other hand, weight decay regularizers [5, 2] are well understood, but ﬁnding a suitable parameter λ to control the strength of the weight decay term can be tediously time consuming. Thorsteinn Rögnvaldsson proposes a simple trick for estimating λ by making use of the best of both worlds (see p. 75): simply compute the gradient at the early stopping solution W es and divide it by the norm of W es,
λˆ = ∇E(W es) / 2W es .
Other penalties are also possible. The trick is speedy, since we neither have to do a complete training nor a scan of the whole λ parameter space, and the accuracy of the determined λˆ is good, as seen from some interesting simulations.
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 49–51, 2012. c Springer-Verlag Berlin Heidelberg 2012

50

G.B. Orr and K.-R. Müller

Tony Plate in chapter 4 treats the penalty factors for the weights (hyperparameters) along the Bayesian framework of MacKay [8] and Neal [9]. There are two levels in searching for the best network. The inner loop is a minimization of the training error keeping the hyperparameters ﬁxed, whereas the outer loop searches the hyperparameter space with the goal of maximizing the evidence of having generated the data. This whole procedure is rather slow and computationally expensive, since, in theory, the inner search needs to converge (to a local minimum) at each outer loop search step. When applied to classiﬁcation networks using the cross-entropy error function the outer-loop search can be unstable with the hyperparameter values oscillating wildly or going to inappropriate extremes. To make this Bayesian framework work better in practice, Tony Plate proposes a number of tricks that speed and simplify the hyperparameter search strategies (see p. 96). In particular, his search strategies center around the questions: (1) how often (when) should the hyperparameters be updated (see p. 96) and (2) what should be done if the Hessian is out-of-bounds (see p. 97ﬀ.). To discuss the eﬀects of the choices made in (1) and (2), Tony Plate uses simulations based on artiﬁcial examples and concludes with a concise set of rules for making the hyperparameter framework work better.
In chapter 5, Jan Larsen et al. formulate an iterative gradient descent scheme for adapting their regularization parameters (note, diﬀerent regularizers can be used for input/hidden and hidden/output weights). The trick is simple: perform gradient descent on the validation set errors with respect to the regularization parameters, and iteratively use the results for updating the estimate of the regularization parameters (see p. 116). This method holds for a variety of penalty terms (e.g. weight decay). The computational overhead is negligible for computing the gradients, however, an inverse Hessian has to be estimated. If second order methods are used for training, then the inverse Hessian may already be available, so there is little additional eﬀort. Otherwise obtaining full Hessian information is rather tedious and limits the approach to smaller applications (see discussion in chapter 1). Nevertheless approximations of the Hessian (e.g. diagonal) could also be used to limit the computation time. Jan Larsen, et al., demonstrate the applicability of their trick on classiﬁcation (vowel data) and regression (time-series prediction) problems.
Averaging over multiple predictors is a well known method for improving generalization (see e.g. [10, 3, 7, 13]). David Horn et al. raises two questions in ensemble training: (1) how many predictors are “enough” and (2) how does the number of predictors aﬀect the stopping criteria for early stopping (see p. 134). They present solutions for answering these questions by providing a method for estimating the error of an inﬁnite number of predictors and they demonstrate the usefulness of their trick for the sunspot prediction task. Additional theoretical reasoning is given to explain their success in terms of variance minimization within the ensemble.
Jenny & Klaus

Regularization Techniques to Improve Generalization

51

References

[1] Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H.H.: Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks 8(5), 985–996 (1997)
[2] Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
[3] Breiman, L.: Bagging predictors. Machine Learning 26(2), 123–140 (1996) [4] Cowan, J.D., Tesauro, G., Alspector, J. (eds.): Advances in Neural Information
Processing Systems 6, San Mateo, CA. Morgan Kaufman Publishers Inc. (1994) [5] Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks ar-
chitectures. Neural Computation 7(2), 219–269 (1995) [6] Kearns, M.: A bound on the error of cross validation using the approximation and
estimation rates, with consequences for the training-test split. Neural Computation 9(5), 1143–1161 (1997) [7] Lincoln, W.P., Skrzypek, J.: Synergy of clustering multiple back propagation networks. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems 2, San Mateo, CA, pp. 650–657. Morgan Kaufmann (1990) [8] McKay, D.J.C.: A practical Bayesian framework for backpropagation networks. Neural Computation 4, 448–472 (1992) [9] Neal, R.M.: Bayesian Learning for Neural Networks. Lecture Notes in Statistics, vol. 118. Springer, New York (1996) [10] Perrone, M.P.: Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization. PhD thesis, Brown University (May 1993) [11] Plaut, D.C., Nowlan, S.J., Hinton, G.E.: Experiments on learning by backpropagation. Technical Report Computer Science Dept. Tech. Report, Pittsburgh, PA (1986) [12] Wang, C., Venkatesh, S.S., Judd, J.S.: Optimal stopping and eﬀective machine complexity in learning. In: [4] (1994) [13] Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)

2 Early Stopping — But When?∗
Lutz Prechelt
Fakultät für Informatik, Universität Karlsruhe D-76128 Karlsruhe, Germany prechelt@ira.uka.de
http://www.ipd.ira.uka.de/˜prechelt/
Abstract. Validation can be used to detect when overﬁtting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overﬁtting (“early stopping”). The exact criterion used for validation-based early stopping, however, is usually chosen in an ad-hoc fashion or training is stopped interactively. This trick describes how to select a stopping criterion in a systematic fashion; it is a trick for either speeding learning procedures or improving generalization, whichever is more important in the particular situation. An empirical investigation on multi-layer perceptrons shows that there exists a tradeoﬀ between training time and generalization: From the given mix of 1296 training runs using diﬀerent 12 problems and 24 diﬀerent network architectures I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).
2.1 Early Stopping Is Not Quite as Simple
2.1.1 Why Early Stopping?
When training a neural network, one is usually interested in obtaining a network with optimal generalization performance. However, all standard neural network architectures such as the fully connected multi-layer perceptron are prone to overﬁtting [10]: While the network seems to get better and better, i.e., the error on the training set decreases, at some point during training it actually begins to get worse again, i.e., the error on unseen examples increases. The idealized expectation is that during training the generalization error of the network evolves as shown in Figure 2.1. Typically the generalization error is estimated by a validation error, i.e., the average error on a validation set, a ﬁxed set of examples not from the training set.
There are basically two ways to ﬁght overﬁtting: reducing the number of dimensions of the parameter space or reducing the eﬀective size of each dimension.
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 53–67, 2012. c Springer-Verlag Berlin Heidelberg 2012

54

L. Prechelt

Techniques for reducing the number of parameters are greedy constructive learning [7], pruning [5, 12, 14], or weight sharing [18]. Techniques for reducing the size of each parameter dimension are regularization, such as weight decay [13] and others [25], or early stopping [17]. See also [8, 20] for an overview and [9] for an experimental comparison.
Early stopping is widely used because it is simple to understand and implement and has been reported to be superior to regularization methods in many cases, e.g. in [9].

2.1.2 The Basic Early Stopping Technique
In most introductory papers on supervised neural network training one can ﬁnd a diagram similar to the one shown in Figure 2.1. It is claimed to show the evolution over time of the per-example error on the training set and on a validation set not used for training (the training error curve and the validation error curve). Given this behavior, it is clear how to do early stopping using validation:

Training error Validation error

Fig. 2.1. Idealized training and validation error curves. Vertical: errors; horizontal: time.
1. Split the training data into a training set and a validation set, e.g. in a 2-to-1 proportion.
2. Train only on the training set and evaluate the per-example error on the validation set once in a while, e.g. after every ﬁfth epoch.
3. Stop training as soon as the error on the validation set is higher than it was the last time it was checked.
4. Use the weights the network had in that previous step as the result of the training run.
This approach uses the validation set to anticipate the behavior in real use (or on a test set), assuming that the error on both will be similar: The validation error is used as an estimate of the generalization error.

2. Early Stopping — But When?

55

2.1.3 The Uglyness of Reality

However, for real neural network training the validation set error does not evolve as smoothly as shown in Figure 2.1, but looks more like in Figure 2.2. See Section 2.4 for a rough explanation of this behavior. As we see, the validation error can still go further down after it has begun to increase — plus in a realistic setting we do never know the exact generalization error but estimate it by the validation set error instead. There is no obvious rule for deciding when the minimum of the generalization error is obtained. Real validation error curves almost always have more than one local minimum. The above curve exhibits as many as 16 local minima before severe overﬁtting begins at about epoch 400. Of these local minima, 4 are the global minimum up to where they occur. The optimal stopping point in this example would be epoch 205. Note that stopping in epoch 400 compared to stopping shortly after the ﬁrst “deep” local minimum at epoch 45 trades an about sevenfold increase of learning time for an improvement of validation set error by 1.1% (by ﬁnding the minimum at epoch 205). If representative data is used, the validation error is an unbiased estimate of the actual network performance; so we expect a 1.1% decrease of the generalization error in this case. Nevertheless, overﬁtting might sometimes go undetected because the validation set is ﬁnite and thus not perfectly representative of the problem.
Unfortunately, the above or any other validation error curve is not typical in the sense that all curves share the same qualitative behavior. Other curves might never reach a better minimum than the ﬁrst, or than, say, the third; the mountains and valleys in the curve can be of very diﬀerent width, height, and shape. The only thing all curves seem to have in common is that the diﬀerences between the ﬁrst and the following local minima are not huge.
As we see, choosing a stopping criterion predominantly involves a tradeoﬀ between training time and generalization error. However, some stopping criteria may typically ﬁnd better tradeoﬀs that others. This leads to the question of

10 Validation error (Glass dataset, 4+4 hidden)
9.9
9.8
9.7
9.6
9.5
9.4
9.3 0 50 100 150 200 250 300 350 400 450
Fig. 2.2. A real validation error curve. Vertical: validation set error; horizontal: time (in training epochs).

56

L. Prechelt

which criterion to use with cross validation to decide when to stop training. This is why we need the present trick: To tell us how to really do early stopping.

2.2 How to Do Early Stopping Best
What we need is a predicate that tells us when to stop training. We call such a predicate a stopping criterion. Among all possible stopping criteria we are searching for those which yield the lowest generalization error and also for those with the best “price-performance ratio”, i.e., that require the least training for a given generalization error or that (on average) result in the lowest generalization error for a certain training time.

2.2.1 Some Classes of Stopping Criteria
There are a number of plausible stopping criteria and this work considers three classes of them. To formally describe the criteria, we need some deﬁnitions ﬁrst. Let E be the objective function (error function) of the training algorithm, for example the squared error. Then Etr(t), the training set error (for short: training error), is the average error per example over the training set, measured after epoch t. Eva(t), the validation error, is the corresponding error on the validation set and is used by the stopping criterion. Ete(t), the test error, is the corresponding error on the test set; it is not known to the training algorithm but estimates the generalization error and thus benchmarks the quality of the network resulting from training. In real life, the generalization error is usually unknown and only the validation error can be used to estimate it.
The value Eopt(t) is deﬁned to be the lowest validation set error obtained in epochs up to t:
Eopt(t) := min Eva(t )
t ≤t
Now we deﬁne the generalization at epoch t to be the relative increase of the validation error over the minimum-so-far (in percent):
GL(t) = 100 · Eva(t) − 1 Eopt(t)
High generalization loss is one obvious candidate reason to stop training, because it directly indicates overﬁtting. This leads us to the ﬁrst class of stopping criteria: stop as soon as the generalization loss exceeds a certain threshold. We deﬁne the class GLα as
GLα : stop after ﬁrst epoch t with GL(t) > α
However, we might want to suppress stopping if the training is still progressing very rapidly. The reasoning behind this approach is that when the training error still decreases quickly, generalization losses have higher chance to be “repaired”; we assume that often overﬁtting does not begin until the error decreases only

2. Early Stopping — But When?

57

slowly. To formalize this notion we deﬁne a training strip of length k to be a sequence of k epochs numbered n + 1 . . . n + k where n is divisible by k. The training progress (in per thousand) measured after such a training strip is then

Pk(t) := 1000 ·

k

·

t t

=t−k+1

Etr (t

)

mintt =t−k+1 Etr(t

)

−

1

that is, “how much was the average training error during the strip larger than the minimum training error during the strip?” Note that this progress measure is high for unstable phases of training, where the training set error goes up instead of down. This is intended, because many training algorithms sometimes produce such “jitter” by taking inappropriately large steps in weight space. The progress measure is, however, guaranteed to approach zero in the long run unless the training is globally unstable (e.g. oscillating).
Now we can deﬁne the second class of stopping criteria: use the quotient of generalization loss and progress.

GL(t) P Qα : stop after ﬁrst end-of-strip epoch t with Pk(t) > α
In the following we will always assume strips of length 5 and measure the validation error only at the end of each strip.
A completely diﬀerent kind of stopping criterion relies only on the sign of the changes in the generalization error. We deﬁne the third class of stopping criteria: stop when the generalization error increased in s successive strips.

U Ps : stop after epoch t iﬀ U Ps−1 stops after epoch t − k and Eva(t) > Eva(t − k)
U P1 : stop after ﬁrst end-of-strip epoch t with Eva(t) > Eva(t − k)

The idea behind this deﬁnition is that when the validation error has increased not only once but during s consecutive strips, we assume that such increases indicate the beginning of ﬁnal overﬁtting, independent of how large the increases actually are. The U P criteria have the advantage of measuring change locally so that they can be used in the context of pruning algorithms, where errors must be allowed to remain much higher than previous minima over long training periods.
None of these criteria alone can guarantee termination. We thus complement them by the rule that training is stopped when the progress drops below 0.1 or after at most 3000 epochs.
All stopping criteria are used in the same way: They decide to stop at some time t during training and the result of the training is then the set of weights that exhibited the lowest validation error Eopt(t). Note that in order to implement this scheme, only one duplicate weight set is needed.

2.2.2 The Trick: Criterion Selection Rules
These three classes of stopping criteria GL, U P , and P Q were evaluated on a variety of learning problems as described in Section 2.3 below. The results

58

L. Prechelt

indicate that “slower” criteria, which stop later than others, on the average lead to improved generalization compared to “faster” ones. However, the training time that has to be expended for such improvements is rather large on average and also varies dramatically when slow criteria are used. The systematic diﬀerences between the criteria classes are only small.
For training setups similar to the one used in this work, the following rules can be used for selecting a stopping criterion:
1. Use fast stopping criteria unless small improvements of network performance (e.g. 4%) are worth large increases of training time (e.g. factor 4).
2. To maximize the probability of ﬁnding a “good” solution (as opposed to maximizing the average quality of solutions), use a GL criterion.
3. To maximize the average quality of solutions, use a P Q criterion if the network overﬁts only very little or an U P criterion otherwise.

2.3 Where and How Well Does This Trick Work?
As no mathematical analysis of the properties of stopping criteria is possible today (see Section 2.4 for the state of the art), we resort to an experimental evaluation.
We want to ﬁnd out which criteria will achieve how much generalization using how much training time on which kinds of problems. To achieve broad coverage, we use 12 diﬀerent network topologies, 12 diﬀerent learning tasks, and 14 diﬀerent stopping criteria. To keep the experiment feasible, only one training algorithm is used.
2.3.1 Concrete Questions
To derive and evaluate the stopping criteria selection rules presented above we need to answer the following questions:
1. Training time: How long will training take with each criterion, i.e., how fast or slow are they?
2. Eﬃciency: How much of this training time will be redundant, i.e., will occur after the to-be-chosen validation error minimum has been seen?
3. Eﬀectiveness: How good will the resulting network performance be? 4. Robustness: How sensitive are the above qualities of a criterion to changes
of the learning problem, network topology, or initial conditions? 5. Tradeoﬀs: Which criteria provide the best time-performance tradeoﬀ? 6. Quantiﬁcation: How can the tradeoﬀ be quantiﬁed?
The answers will directly lead to the rules already presented above in Section 2.2.2. To ﬁnd the answers to the questions we record for a large number of runs when each criterion would stop and what the associated network performance would be.

2. Early Stopping — But When?

59

2.3.2 Experimental Setup
Approach. To measure network performance, we partition each dataset into two disjoint parts: Training data and test data. The training data is further subdivided into a training set of examples used to adjust the network weights and a validation set of examples used to estimate network performance during training as required by the stopping criteria. The validation set is never used for weight adjustment. This decision was made in order to obtain pure stopping criteria results. In contrast, in a real application after a reasonable stopping time has been computed, one would include the validation set examples in the training set and retrain from scratch.
Stopping Criteria. The stopping criteria examined were GL1, GL2, GL3, GL5, P Q0.5, P Q0.75, P Q1, P Q2, P Q3, U P2, U P3, U P4, U P6, and U P8. All criteria where evaluated simultaneously, i.e., each single training run returned one result for each of the criteria. This approach reduces the variance of the estimation.
Learning Tasks. Twelve diﬀerent problems were used, all from the Proben1 NN benchmark set [19]. All problems are real datasets from realistic application domains; they form a sample of a broad class of domains, but none of them exhibits extreme nonlinearity. The problems have between 8 and 120 inputs, between 1 and 19 outputs, and between 214 and 7200 examples. All inputs and outputs are normalized to range 0. . . 1. Nine of the problems are classiﬁcation tasks using 1-of-n output encoding (cancer, card, diabetes, gene, glass, heart, horse, soybean, and thyroid ), three are approximation tasks (building, ﬂare, and hearta).
Datasets and Network Architectures. The examples of each problem were partitioned into training (50%), validation (25%), and test set (25% of examples) in three diﬀerent random ways, resulting in 36 datasets. Each of these datasets was trained with 12 diﬀerent feedforward network topologies: one hidden layer networks with 2, 4, 8, 16, 24, or 32 hidden nodes and two hidden layer networks with 2+2, 4+2, 4+4, 8+4, 8+8, or 16+8 hidden nodes in the ﬁrst+second hidden layer, respectively; all these networks were fully connected including all possible shortcut connections. For each of the network topologies and each dataset, two runs were made with linear output units and one with sigmoidal output units using the activation function f (x) = x/(1 + |x|).
Training Algorithm. All runs were done using the RPROP training algorithm [21] using the squared error function and the parameters η+ = 1.1, η− = 0.5, Δ0 ∈ 0.05 . . . 0.2 randomly per weight, Δmax = 50, Δmin = 0, initial weights −0.5 . . . 0.5 randomly. RPROP is a fast backpropagation variant that is about as fast as quickprop [6] but more stable without adjustment of the parameters. RPROP requires epoch learning, i.e., the weights are updated only once per epoch. Therefore, the algorithm is fast without parameter tuning for small training sets but not recommendable for large training sets. Lack of parameter tuning helps to avoid the common methodological error of tuning parameters using the test error.

60

L. Prechelt

2.3.3 Experiment Results
Altogether, 1296 training runs were made for the comparison, giving 18144 stopping criteria performance records for the 14 criteria. 270 of these records (or 1.5%) from 125 diﬀerent runs reached the 3000 epoch limit instead of using the stopping criterion itself.
The results for each stopping criterion averaged over all 1296 runs are shown in Table 2.1. Figure 2.3 describes the variance embedded in the means given in the table. I will now explain and then interpret the entries in both, table and ﬁgure. Note that the discussion is biased by the particular collection of criteria chosen for the study.
Deﬁnitions. For each run, we deﬁne Eva(C) as the minimum validation error found until criterion C indicates to stop; it is the error after epoch number tm(C) (read: “time of minimum”). Ete(C) is the corresponding test error and characterizes network performance. Stopping occurs after epoch ts(C) (read: “time of stop”). A best criterion Cˆ of a particular run is one with minimum ts of all those (among the examined) with minimum Eva, i.e., a criterion that found the best validation error fastest. There may be several best, because multiple criteria may stop at the same epoch. Note that there is no single criterion Cˆ because Cˆ changes from run to run. C is called good in a particular run if Eva(C) = Eva(Cˆ), i.e., if it is among those that found the lowest validation set error, no matter how fast or slow.

2.3.4 Discussion: Answers to the Questions

We now discuss the questions raised in Section 2.3.1.

Table 2.1. Behavior of stopping criteria. SGL2 is normalized training time, BGL2 is normalized test error (both relative to GL2). r is the training time redundancy, Pg is the
probability of ﬁnding a good solution. For further description please refer to the text.

C
U P2 GL1 U P3 GL2 U P4 P Q0.5 P Q0.75 GL3 P Q1 U P6 GL5 P Q2 U P8 P Q3

training time

Scˆ(C) SGL2 (C)

0.792

0.766

0.956

0.823

1.010

1.264

1.237

1.000

1.243

1.566

1.253

1.334

1.466

1.614

1.550

1.450

1.635

1.796

1.786

2.381

2.014

2.013

2.184

2.510

2.485

3.259

2.614

3.095

eﬃciency and eﬀectiveness

r(C) Bcˆ(C) BGL2 (C) Pg (C)

0.277 1.055

1.024 0.587

0.308 1.044

1.010 0.680

0.419 1.026

1.003 0.631

0.514 1.034

1.000 0.723

0.599 1.020

0.997 0.666

0.663 1.027

1.002 0.658

0.863 1.021

0.998 0.682

0.712 1.025

0.994 0.748

1.038 1.018

0.994 0.704

1.125 1.012

0.990 0.737

1.162 1.021

0.991 0.772

1.636 1.012

0.990 0.768

1.823 1.010

0.988 0.759

2.140 1.009

0.988 0.800

2. Early Stopping — But When?

61

GL1

GL3

PQ0.5

PQ1

PQ3

UP3

UP6

GL2

GL5

PQ0.75 PQ2

UP2

UP4

UP8

8

6

slowness

4

2

0

GL1

GL3

PQ0.5

PQ1

PQ3

UP3

UP6

GL2

GL5

PQ0.75 PQ2

UP2

UP4

UP8

10

5

redundancy

0

GL1

GL3

PQ0.5

PQ1

PQ3

UP3

UP6

GL2

GL5

PQ0.75 PQ2

UP2

UP4

UP8

1.6

badness

1.2

0.8

Fig. 2.3. Variance of slowness SCˆ (C) (top), redundancy r(C) (middle), and badness BCˆ (C) (bottom) for each pair of learning problem and stopping criterion. In each of the 168 columns, the dot represents the mean computed from 108 runs: learning problem
and stopping criterion are ﬁxed, while three other parameters are varied (12 topologies × 3 runs × 3 dataset variants). The length of the line is twice the standard deviation within these 108 values. Within each block of dot-line plots, the plots represent (in
order) the problems building, cancer, card, diabetes, ﬂare, gene, glass, heart, hearta, horse, soybean, thyroid. The horizontal line marks the median of the means. Note: When comparing the criteria groups, remember that overall the P Q criteria chosen are
slower than the others. It is unfair to compare, for example, P Q0.5 to GL1 and U P2.

62

L. Prechelt

1. Training time: The slowness of a criterion C in a run, relative to another criterion x is Sx(C) := ts(C)/ts(x), i.e., the relative total training time. As we see, the times relative to a ﬁxed criterion as shown in column SGL2(C) vary by more than factor 4. Therefore, the decision for a particular stopping criterion inﬂuences training times dramatically, even if one considers only the range of criteria used here. In contrast, even the slowest criteria train only about 2.5 times as long as the fastest criterion of each run that ﬁnds the same result, as indicated in column SCˆ(C). This shows that the training times are not completely unreasonable even for the slower criteria, but do indeed pay oﬀ to some degree.
2. Eﬃciency: The redundancy of a criterion can be deﬁned as r(C) := (ts(C)/tm(C)) − 1. It characterizes how long the training continues after the ﬁnal solution has been seen. r(C) = 0 would be perfect, r(C) = 1 means that the criterion trains twice as long as necessary. Low values indicate eﬃcient criteria. As we see, the slower a criterion is, the less eﬃcient it tends to get. Even the fastest criteria “waste” about one ﬁfth of their overall training time. The slower criteria train twice as long as necessary to ﬁnd the same solution.
3. Eﬀectiveness: We deﬁne the badness of a criterion C in a run relative to another criterion x as Bx(C) := Ete(C)/Ete(x), i.e., its relative error on the test set. Pg(C) is the fraction of the 1296 runs in which C was a good criterion. This is an estimate of the probability that C is good in a run. As we see from the Pg column, even the fastest criteria are fairly eﬀective. They reach a result as good as the best (of the same run) in about 60% of the cases. On the other hand, even the slowest criteria are not at all infallible; they achieve about 80%. However, Pg says nothing about how far from the optimum the non-good runs are. Columns BCˆ(C) and BGL2(C) indicate that these diﬀerences are usually rather small: column BGL2(C) shows that even the criteria with the lowest error achieve only about 1% lower error on the average than the relatively fast criterion GL2. In column BCˆ(C) we see that several only modestly slow criteria have just about 2% higher error on the average than the best criteria of the same run. For obtaining the lowest possible generalization error, independent of training time, it appears that one has to use an extreme criterion such as GL50 or even use a conjunction of all three criteria classes with high parameter values.
4. Robustness: We call a criterion robust to the degree that its performance is independent of the learning problem and the learning environment (network topology, initial conditions etc.). Optimal robustness would mean that in Figure 2.3 all dots within a block are at the same height (problem independence) and all lines have length zero (environment independence). Note that slowness and badness are measured relative to the best criterion of the same program run. We observe the following:
– With respect to slowness and redundancy, slower criteria are much less robust than faster ones. In particular the P Q criteria are quite sensitive to the learning problem, with the card and horse problems being worst in this experimental setting.
– With respect to badness, the picture is completely diﬀerent: slower criteria tend to be slightly more robust than slower ones. P Q criteria are a little

2. Early Stopping — But When?

63

more robust than the others while GL criteria are signiﬁcantly less robust. All criteria are more or less instable for the building, cancer, and thyroid problems. In particular, all GL criteria have huge problems with the building problem, whose dataset 1 is the only one that is partitioned non-randomly; it uses chronological order of examples, see [19]. The slower variants of the other criteria types are nicely robust in this case. – Similar statements apply when one analyzes the inﬂuence of only large or only small network topologies separately (not shown in any ﬁgure or table). One notable exception was the fact that for networks with very few hidden nodes the P Q criteria are more cost-eﬀective than both the GL and the U P criteria for minimizing BCˆ(C). The explanation may be that such small networks do not overﬁt severely; in this case it is advantageous to take training progress into account as an additional factor to determine when to stop training.
Overall, fast criteria improve the predictability of the training time, while slow ones improve the predictability of the solution quality.
5. Best tradeoﬀs: Despite the common overall trend, some criteria may be more cost-eﬀective than others, i.e., provide better tradeoﬀs between training time and resulting network performance. Column Bcˆ of the table suggests that the best tradeoﬀs between test error and training time are (in order of increasing willingness to spend lots of training time) U P3, U P4, and U P6, if one wants to minimize the expected network performance from a single run. These criteria are also robust. If on the other hand one wants to make several runs and pick the network that seems to be best (based on its validation error), Pg is the relevant metric and the GL criteria are preferable. The best tradeoﬀs are marked with a star in the table. Figure 2.4 illustrates these results. The upper curve corresponds to column BCˆ of the table (plotted against column SCˆ); local minima indicate

1200 1100

Badness Pg

1000

900

800

700

600

500 1000

1500

2000

Slowness

2500

Fig. 2.4. Badness BCˆ (C) and Pg against slowness SCˆ (C) of criteria

64

L. Prechelt

criteria with the best tradeoﬀs. The lower curve corresponds to column Pg; local maxima indicate the criteria with the best tradeoﬀs. All measurements are scaled by 1000.
6. Quantiﬁcation: From columns SGL2(C) and BGL2(C) we can quantify the tradeoﬀ involved in the selection of a stopping criterion as follows: In the range of criteria examined we can roughly trade a 4% decrease in test error (from 1.024 to 0.988) for an about fourfold increase in training time (from 0.766 to 3.095). Within this range, some criteria are somewhat better than others, but there is no panacea.

2.3.5 Generalization of These Results
It is diﬃcult to say whether or how these results apply to diﬀerent contexts than those of the above evaluation. Speculating though, I would expect that the behavior of the stopping criteria
– is similar for other learning rules, unless they frequently make rather extreme steps in parameter space,
– is similar for other error functions, unless they are discontinuous, – is similar for other learning tasks, as long as they are in the same ballpark
with respect to their nonlinearity, number of inputs and outputs, and amount of available training data.
Note however, that at least with respect to the learning task deviations do occur (see Figure 2.3). More research is needed in order to describe which properties of the learning tasks lead to which diﬀerences in stopping criteria behavior — or more generally: in order to understand how which features of tasks inﬂuence learning methods.

2.4 Why This Works
Detailed theoretical analyses of the error curves cannot yet be done for the most interesting cases such as sigmoidal multi-layer perceptrons trained on a modest number of examples; today they are possible for restricted scenarios only [1, 2, 3, 24] and do usually not aim at ﬁnding the optimal stopping criterion in a way comparable to the present work. However, a simpliﬁcation of the analysis performed by Wang et al. [24] or the alternative view induced by the bias/variance decomposition of the error as described by Geman et al. [10] can give some insights why early stopping behaves as it does.
At the beginning of training (phase I), the error is dominated by what Wang et al. call the approximation error — the network has hardly learned anything and is still very biased. During training this part of the error is further and further reduced. At the same time, however, another component of the error increases: the complexity error that is induced by the increasing variance of the network model as the possible magnitude and diversity of the weights grows. If we train long enough, the error will be dominated by the complexity error

2. Early Stopping — But When?

65

(phase III). Therefore, there is a phase during training, when the approximation and complexity (or: bias and variance) components of the error compete but none of them dominates (phase II). See Amari et al. [1, 2] for yet another view of the training process, using a geometrical interpretation. The task of early stopping as described in the present work is to detect when phase II ends and the dominance of the variance part begins.
Published theoretical results on early stopping appear to provide some nice techniques for practical application: Wang et al. [24] oﬀer a method for computing the stopping point based on complexity considerations — without using a separate validation set at all. This could save precious training examples. Amari et al. [1, 2] compute the optimal split proportion of training data into training and validation set.
On the other hand, unfortunately, the practical applicability of these theoretical analyses is severely restricted. Wang et al.’s analysis applies to networks where only output weights are being trained; no hidden layer training is captured. It is unclear to what degree the results apply to the multi-layer networks considered here. Amari et al.’s analysis applies to the asymptotic case of very many training examples. The analysis does not give advice on stopping criteria; it shows that early stopping is not useful when very many examples are available but does not cover the much more frequent case when training examples are scarce.
There are several other theoretical works on early stopping, but none of them answers our practical questions. Thus, given these theoretic results, one is still left with making a good stopping decision for practical cases of multilayer networks with only few training examples and faced with a complicated evolution of the validation set error as shown in Figure 2.2. This is why the present empirical investigation was necessary.
The jagged form of the validation error curve during phase II arises because neither bias nor variance change monotonically, let alone smoothly. The bias error component may change abruptly because training algorithms never perform gradient descent, but take ﬁnite steps in parameter space that sometimes have severe results. The observed variance error component may change abruptly because, ﬁrst, the validation set error is only an estimate of the actual generalization error and, second, the eﬀect of a parameter change may be very diﬀerent in diﬀerent parts of parameter space.
Quantitatively, the diﬀerent error minima that occur during phase II are quite close together in terms of size, but may be rather far apart in terms of training epoch. The exact validation error behavior seems rather unpredictable when only a short left section of the error curve is given. The behavior is also very diﬀerent for diﬀerent training situations.
For these reasons no class of stopping criteria has any big advantage over another (on average, for the mix of situations considered here), but scaling the same criterion to be slower always tends to gain a little generalization.

66

L. Prechelt

References

[1] Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H.: Statistical theory of overtraining - is cross-validation eﬀective. In: [23], pp. 176–182 (1996)
[2] Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H.: Aymptotic statistical theory of overtraining and cross-validation. IEEE Trans. on Neural Networks 8(5), 985–996 (1997)
[3] Baldi, P., Chauvin, Y.: Temporal evolution of generalization during learning in linear networks. Neural Computation 3, 589–603 (1991)
[4] Cowan, J.D., Tesauro, G., Alspector, J. (eds.): Advances in Neural Information Processing Systems 6. Morgan Kaufman Publishers Inc., San Mateo (1994)
[5] Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: [22], pp. 598–605 (1990)
[6] Fahlman, S.E.: An empirical study of learning speed in back-propagation networks. Technical Report CMU-CS-88-162, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (September 1988)
[7] Fahlman, S.E., Lebiere, C.: The Cascade-Correlation learning architecture. In: [22], pp. 524–532 (1990)
[8] Fiesler, E.: Comparative bibliography of ontogenic neural networks (1994) (submitted for publication)
[9] Finnoﬀ, W., Hergert, F., Zimmermann, H.G.: Improving model selection by nonconvergent methods. Neural Networks 6, 771–783 (1993)
[10] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58 (1992)
[11] Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.): Advances in Neural Information Processing Systems 5. Morgan Kaufman Publishers Inc., San Mateo (1993)
[12] Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal brain surgeon. In: [11], pp. 164–171 (1993)
[13] Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: [16], pp. 950–957 (1992)
[14] Levin, A.U., Leen, T.K., Moody, J.E.: Fast pruning using principal components. In: [4] (1994)
[15] Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.): Advances in Neural Information Processing Systems 3. Morgan Kaufman Publishers Inc., San Mateo (1991)
[16] Moody, J.E., Hanson, S.J., Lippmann, R.P. (eds.): Advances in Neural Information Processing Systems 4. Morgan Kaufman Publishers Inc., San Mateo (1992)
[17] Morgan, N., Bourlard, H.: Generalization and parameter estimation in feedforward nets: Some experiments. In: [22], pp. 630–637 (1990)
[18] Nowlan, S.J., Hinton, G.E.: Simplifying neural networks by soft weight-sharing. Neural Computation 4(4), 473–493 (1992)
[19] Prechelt, L.: PROBEN1 — A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultät für Informatik, Universität Karlsruhe, Germany, Anonymous, ftp://pub/papers/techreports/1994/1994-21.ps.gz on, ftp.ira.uka.de (September 1994)
[20] Reed, R.: Pruning algorithms — a survey. IEEE Transactions on Neural Networks 4(5), 740–746 (1993)
[21] Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: Proc. of the IEEE Intl. Conf. on Neural Networks, San Francisco, CA, pp. 586–591 (April 1993)

2. Early Stopping — But When?

67

[22] Touretzky, D.S. (ed.): Advances in Neural Information Processing Systems 2. Morgan Kaufman Publishers Inc., San Mateo (1990)
[23] Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.): Advances in Neural Information Processing Systems 8. MIT Press, Cambridge (1996)
[24] Wang, C., Venkatesh, S.S., Judd, J.S.: Optimal stopping and eﬀective machine complexity in learning. In: [4] (1994)
[25] Weigend, A.S., Rumelhart, D.E., Huberman, B.A.: Generalization by weightelimination with application to forecasting. In: [15], pp. 875–882 (1991)

3 A Simple Trick for Estimating the Weight Decay
Parameter
Thorsteinn S. Rögnvaldsson
Centre for Computer Architecture (CCA), Halmstad University, P.O. Box 823, S-301 18 Halmstad, Sweden denni@cca.hh.se
http://www.hh.se/staff/denni/
Abstract. We present a simple trick to get an approximate estimate of the weight decay parameter λ. The method combines early stopping and weight decay, into the estimate
λˆ = ∇E(W es) / 2W es ,
where W es is the set of weights at the early stopping point, and E(W ) is the training data ﬁt error.
The estimate is demonstrated and compared to the standard crossvalidation procedure for λ selection on one synthetic and four real life data sets. The result is that λˆ is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute.
The results also show that weight decay can produce solutions that are signiﬁcantly superior to committees of networks trained with early stopping.
3.1 Introduction
A regression problem which does not put constraints on the model used is illposed [21], because there are inﬁnitely many functions that can ﬁt a ﬁnite set of training data perfectly. Furthermore, real life data sets tend to have noisy inputs and/or outputs, which is why models that ﬁt the data perfectly tend to be poor in terms of out-of-sample performance. Since the modeler’s task is to ﬁnd a model for the underlying function while not overﬁtting to the noise, models have to be based on criteria which include other qualities besides their ﬁt to the training data.
In the neural network community the two most common methods to avoid overﬁtting are early stopping and weight decay [17]. Early stopping has the advantage of being quick, since it shortens the training time, but the disadvantage
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 69–89, 2012. c Springer-Verlag Berlin Heidelberg 2012

70

T.S. Rögnvaldsson

of being poorly deﬁned and not making full use of the available data. Weight decay, on the other hand, has the advantage of being well deﬁned, but the disadvantage of being quite time consuming. This is because much time is spent with selecting a suitable value for the weight decay parameter (λ), by searching over several values of λ and estimating the out-of-sample performance using e.g. cross validation [25].
In this paper, we present a very simple method for estimating the weight decay parameter, for the standard weight decay case. This method combines early stopping with weight decay, thus merging the quickness of early stopping with the more well deﬁned weight decay method, providing a weight decay parameter which is essentially as good as the standard search method estimate when tested empirically.
We also demonstrate in this paper that the arduous process of selecting λ can be rewarding compared to simpler methods, like e.g. combining networks into committees [16].
The paper is organized as follows: In section 2 we present the background of how and why weight decay or early stopping should be used. In section 3 we review the standard method for selecting λ and also introduce our new estimate. In section 4 we give empirical evidence on how well the method works, and in section 5 we summarize our conclusions.

3.2 Ill-Posed Problems, Regularization, and Such Things...

3.2.1 Ill-Posed Problems

In what follows, we denote the input data by x(n), the target data by y(n),

and the model (neural network) output by f (W , x(n)), where W denotes the

parameters (weights) for the model. We assume a target data generating process

of the form

y(n) = φ[x(n)] + ε(n)

(3.1)

where φ is the underlying function and ε(n) are sampled from a stationary un-

correlated (IID) zero mean noise process with variance σ2. We select models f

from a model family F , e.g. multilayer perceptrons, to learn an approximation to

the underlying function φ, based on the training data. That is, we are searching

for

f ∗ ≡ f (W ∗) ∈ F such that E(f ∗, φ) ≤ E(f, φ) ∀ f ∈ F,

(3.2)

where E(f, φ) is a measure of the “distance” between the model f and the true model φ. Since we only have access to the target values y, and not the underlying function φ, E(f, φ) is often taken to be the mean square error

E(f, φ) → E(f, y) = E(W ) = 1

N
[y(n) − f (W , x(n))]2.

2N

n=1

(3.3)

Unfortunately, minimizing (3.3) is, more often than not, an ill-posed problem. That is, it does not meet the following three requirements [21]:

3. A Simple Trick for Estimating the Weight Decay Parameter

71

– The model (e.g. neural network) can learn the function training data, i.e. there exists a solution f ∗ ∈ F .
– The solution is unique. – The solution is stable under small variations in the training data set. For
instance, training with two slightly diﬀerent training data sets sampled from the same process must result in similar solutions (similar when evaluated on e.g. test data).
The ﬁrst and second of these requirements are often not considered serious problems. It is always possible to ﬁnd a multilayer perceptron that learns the training data perfectly by using many internal units, since any continuous function can be constructed with a single hidden layer network with sigmoid units (see e.g. [6]), and we may be happy with any solution and ignore questions on uniqueness. However, a network that has learned the training data perfectly will be very sensitive to changes in the training data. Fulﬁlling the ﬁrst requirement is thus usually in conﬂict with fulﬁlling the third requirement, which is a really important requirement. A solution which changes signiﬁcantly with slightly diﬀerent training sets will have very poor generalization properties.

3.2.2 Regularization

It is common to introduce so-called regularizers1 in order to make the learning task well posed (or at least less ill-posed). That is, instead of only minimizing an error of ﬁt measure like (3.3) we augment it with a regularization term λR(W ) which expresses e.g. our prior beliefs about the solution.
The error functional then takes the form

E(W ) = 1 2N

N
[y(n) − f (W , x(n))]2 + λR(W ) = E0(W ) + λR(W ),

(3.4)

n=1

where λ is the regularization parameter which weighs the importance of R(W ) relative to the error of ﬁt E0(W ).
The eﬀect of the regularization term is to shrink the model family F , or make some models more likely than others. As a consequence, solutions become more stable to small perturbations in the training data.
The term “regularization” encompasses all techniques which make use of penalty terms added to the error measure to avoid overﬁtting. This includes e.g. weight decay [17], weight elimination [26], soft weight sharing [15], Laplacian weight decay [12] [27], and smoothness regularization [2] [9] [14] . Certain forms of “hints” [1] can also be called regularization.

3.2.3 Bias and Variance
The beneﬁt of regularization is often described in the context of model bias and model variance. This originates from the separation of the expected generalization error Egen into three terms [8]
1 Called “stabilizers” by Tikhonov [21].

72

T.S. Rögnvaldsson

Egen = [y(x) − f (x)]2 p(x)dx

= [φ(x) − f (x) ]2 p(x)dx + [f (x) − f (x) ]2 p(x)dx +

[y(x) − φ(x)]2 p(x)dx

= Bias2 + Variance + σ2,

(3.5)

where denotes taking the expectation over an ensemble of training sets. Here p(x) denotes the input data probability density.
A high sensitivity to training data noise corresponds to a large model variance. A large bias term means either that φ ∈/ F , or that φ is downweighted in favour of other models in F . We thus have a trade-oﬀ between model bias and model variance, which corresponds to the trade-oﬀ between the ﬁrst and third requirements on well-posed problems.
Model bias is weighed versus model variance by selecting both a parametric form for R(W ) and an optimal2 value for the regularization parameter λ.
Many neural network practitioners ignore the ﬁrst part and choose weight decay by default, which corresponds to a Gaussian parametric form for the prior on W . Weight decay is, however, not always the best choice (in fact, it is most certainly not the best choice for all problems). Weight decay does not for instance consider the function the network is producing, it only puts a constraint on the parameters. Another, perhaps more correct, choice would be to constrain the higher order derivatives of the network function (which is commonplace in statistics) like in e.g. [14].

3.2.4 Bayesian Framework

From a Bayesian and maximum likelihood perspective, prior information about the model (f ) is weighed against the likelihood of the training data (D) through Bayes theorem (see [4] for a discussion on this). Denote the probability for observing data set D by p(D), the prior distribution of models f by p(f ), and the likelihood for observing the data D, if f is the correct model, by p(D|f ). We then have for the posterior probability p(f |D) for the model f given the observed data D

p(f |D) = p(D|f )p(f ) ⇒ p(D)

− ln p(f |D) = − log p(D|f ) − ln p(f ) + ln p(D) ⇒

N
− ln p(f |D) = [y(n) − f (W , x(n))]2 − ln p(f ) + constant,

(3.6)

n=1

where Gaussian noise ε is assumed in the last step. If we identify 2N λR(W ) with the negative logarithm of the model prior, − ln p(f ), then maximizing p(f |D) is equivalent to minimizing expression (3.4).

2 Optimality is usually measured via cross-validation or some similar method.

3. A Simple Trick for Estimating the Weight Decay Parameter

73

From this perspective, choosing R(W ) is equivalent to choosing a parameterized form for the model prior p(f ), and selecting a value for λ corresponds to estimating the parameters for the prior.

3.2.5 Weight Decay

Weight decay [17] is the neural network equivalent to the Ridge Regression [11] method. In this case R(W ) = W 2 = k wk2 and the error functional is

1 E(W ) = E0(W ) + λR(W ) = 2N

N
[y(n) − f (W , x(n))]2 + λ W

2,

(3.7)

n=1

and λ is usually referred to as the weight decay parameter. In the Bayesian framework, weight decay means implicitly imposing the model prior

p[f (W )] =

λ

−λ W 2

2πσ2 exp

2σ2

(3.8)

where σ2 is the variance of the noise in the data. Weight decay often improves the generalization properties of neural network
models, for reasons outlined above.

3.2.6 Early Stopping

Undoubtedly, the simplest and most widely used method to avoid overﬁtting is to stop training before the training set has been learned perfectly. This is done by setting aside a fraction of the training data for estimating the out-of-sample performance. This data set is called the validation data set. Training is then stopped when the error on the validation set starts to increase. Early stopping often shortens the training time signiﬁcantly, but suﬀers from being ill-deﬁned since there really is no well deﬁned stopping point, and wasteful with data, since a part of the data is set aside.
There is a connection between early stopping and weight decay, if learning starts from small weights, since weight decay applies a potential which forces all weights towards zero. For instance, Sjöberg and Ljung [20] show that, if a constant learning rate η is used, the number of iterations n at which training is stopped is related to the weight decay parameter λ roughly as

λ

∼

1 .
2ηn

(3.9)

This does not, however, mean that using early stopping is equivalent to using weight decay in practice. Expression (3.9) is based on a constant learning rate, a local expansion around the optimal stopping point, ignoring local minima, and assumes small input noise levels, which may not reﬂect the situation when overﬁtting is a serious problem. The choice of learning algorithm can also aﬀect

74

T.S. Rögnvaldsson

the early stopping point, and one cannot expect (3.9) to hold exactly in the practical case.
Inspired by this connection between early stopping and weight decay, we use early stopping in the following section to estimate the weight decay parameter λ.

3.3 Estimating λ
From a pure Bayesian point of view, the prior is something we know/assume in advance and do not use the training data to select (see e.g. [5]). There is consequently no such thing as “λ selection” in the pure Bayesian model selection scheme. This is of course perfectly ﬁne if the prior is correct. However, if we suspect that our choice of prior is less than perfect, then we are better oﬀ if we take an “empirical Bayes” approach and use the data to tune the prior, through λ.
Several options for selecting λ have been proposed. Weigend et al. [26] present, for a slightly diﬀerent weight cost term, a set of heuristic rules for changing λ during the training. Although Weigend et al. demonstrate the use of these heuristics on a couple of time series problems, we cannot get these rules to work consistently to our satisfaction. A more principled approach is to try several values of λ and estimate the out-of-sample error, either by correcting the training error, with some factor or term, or by using cross-validation. The former is done in e.g. [10], [23], and [24] (see also references therein). The latter is done by e.g. [25].
The method of using validation data for estimating the out-of-sample error is robust but slow since it requires training several models. We use cross-validation here because of its reliability.

3.3.1 Search Estimates

Finding the optimal λ requires the use of a search algorithm, which must be robust because the validation error can be very noisy. A simple and straightforward way is to start at some large λ where the validation error is large, due to the large model bias, and step towards lower values until the out-of-sample error becomes large again, due to the large model variance. In our experience, it often makes sense to do the search in log λ (i.e. with equally spaced increments in log λ).
The result of such a search is a set of K values {λk} with corresponding average n-fold cross validation errors {log EnCV,k} and standard deviations {σnCV,k} for the validation errors. These are deﬁned as

1n

log EnCV,k = n

log Ej,k

j=1

σn2 C V ,k

=

1 n−1

n

(log Ej,k − log EnCV,k)2

j=1

(3.10) (3.11)

3. A Simple Trick for Estimating the Weight Decay Parameter

75

when λ = λk. The number of validation data sets is n and Ej,k denotes the validation error when λ = λk and we use validation set j. Taking logarithms is motivated by our observation that the validation error distribution looks approximately log-normal and we use this in our selection of the optimal λ value below.
Once the search is ﬁnished, the optimal λ is selected. This is not necessarily trivial since a large range of values may look equally good, or one value may have a small average cross-validation error with a large variation in this error, and another value may have a slightly higher average cross-validation error with a small variation in this error. The simplest approach is to look at a plot of the validation errors versus λ and make a judgement on where the optimal λ is, but this adds an undesired subjectiveness to the choice. Another is to take a weighted average over the diﬀerent λ values, which is what we use here (see Ripley [19] for a discussion on variants of λ selection methods).
Our estimate for the optimal λ is the value

λˆopt =

K k=1

nkλk

K k=1

nk

(3.12)

where nk is the number of times λk corresponds to the minimum validation error when we sample validation errors from K log-normal distributions with means

log EnCV,k and standard deviations σnCV,k, assuming that the validation errors are independent. This is illustrated on a hypothetical example in Figure 3.1.

The choice (3.12) was done after conﬁrming that it often agrees well with our

subjective choice for λ. We refer to this below as a “Monte Carlo estimate” of λ.

3.3.2 Two Early Stopping Estimates

If W ∗ is the set of weights when E(W ) in eq. (3.4) is minimized, then

∇E(W ∗) = ∇E0(W ∗) + λ∇R(W ∗) = 0,

(3.13)

which implies

λ=

∇E0(W ∗) ∇R(W ∗)

(3.14)

for the regularization parameter λ. Thus, if we have a reasonable estimate of W ∗, or of ∇E0(W ∗) and ∇R(W ∗) , then we can use this to estimate λ. An appealingly simple way of estimating W ∗ is to use early stopping, because of its
connection with weight decay. Denoting the set of weights at the early stopping point by W es, we have

λˆ1 =

∇E0(W es) ∇R(W es)

,

(3.15)

as a simple estimate for λ. A second possibility is to consider the whole set of linear equations deﬁned by (3.13) and minimize the squared error

∇E0(W es) + λ∇R(W es) 2 = ∇E0(W es) 2 + 2λ∇E0(W es) · ∇R(W es) + λ2 ∇R(W es) 2 (3.16)

76

T.S. Rögnvaldsson

1.5

log10(Ecv)

1.0

0.5

−7

−6

−5

−4

−3

−2

−1

0

1

2

log10(lambda)

Counts (n)

25

20

mean = −2.48

15

std = 1.1545

10

5

0

−7

−6

−5

−4

−3

−2

−1

0

1

2

log10(lambda)

Fig. 3.1. Illustration of the procedure for estimating λˆopt on a hypothetical example.
From the search we have a set of K lognormal distributions with means log EnCV,k and variances σn2CV,k, which is illustrated in the top plate. From these K distributions, we sample K error values and select the λ corresponding to the minimum error value
as “winner”. This is repeated several times (100 in the ﬁgure but 10,000 times in the
experiments in the text) collecting statistics on how often diﬀerent λ values are winners,
and the mean log λ is computed. This is illustrated in the bottom plate, which shows the histogram resulting from sampling 100 times. From this we get log λˆopt = −2.48 ± 1.15, which gives us λ = 10−2.48 = 0.003 for training the “best” network.

with respect to λ. That is, solving the equation

∂ ∂λ

∇E0(W es) + λ∇R(W es) 2 = 0

(3.17)

which gives

λˆ2 = max

0,

−∇E0(W es) · ∇R(W ∇R(W es) 2

es )

.

(3.18)

The estimate is bound from below since λ must be positive. The second estimate, λˆ2, corresponds to a linear regression without intercept
term on the set of points {∂iE0(W es), ∂iR(W es)}, whereas the ﬁrst estimate, λˆ1, is closer to the ratio max[|∂iE0(W es)|]/ max[|∂iR(W es)|]. It follows from
the Cauchy-Schwartz inequality that

λˆ1 ≥ λˆ2.

(3.19)

For the speciﬁc case of weight decay, where R(W ) = W 2, expressions (3.15) and (3.18) become

λˆ1 =

∇E0(W es) 2 W es

,

(3.20)

3. A Simple Trick for Estimating the Weight Decay Parameter

77

λˆ2 = max

0,

−∇E0(W 2W

es )
es

·
2

W

es

.

(3.21)

These estimates are sensitive to the particularities of the training and validation data sets used, and possibly also to the training algorithm. One must therefore average them over diﬀerent validation and training sets. It is, however, still quicker to do this than to do a search since early stopping training often is several orders of magnitude faster to do than a full minimization of (3.7).
One way to view the estimates (3.15) and (3.18) is as the weight decay parameters that correspond to the early stopping point. However, our aim here is not to imitate early stopping with weight decay, but to use early stopping to estimate the weight decay parameter λ. We hope that using weight decay with this λ value will actually result in better out-of-sample performance than what we get from doing early stopping (the whole exercise becomes rather meaningless if this is not the case).
As a sidenote, we imagine that (3.15) and (3.18) could be used also to estimate weight decay parameters in cases when diﬀerent weight decays are used for weights in diﬀerent layers. This would then be done by considering these estimates for diﬀerent groups of weights.

3.4 Experiments

3.4.1 Data Sets
We here demonstrate the performance of our algorithm on a set of ﬁve regression problems. For each problem, we vary either the number of inputs, the number of hidden units, or the amount of training data to study the eﬀects of the numbers of parameters relative to the number of training data points. The ﬁve problems are:

Synthetic Bilinear Problem. The task is to model a bilinear function of the

form

φ(x1, x2) = x1x2.

(3.22)

We use three diﬀerent sizes of training data sets, M ∈ {20, 40, 100}, but a constant validation set size of 10 patterns. The validation patterns are in addition to the M training patterns. The test error, or generalization error, is computed by numerical integration over 201 × 201 data points on a two-dimensional lattice (x1, x2) ∈ [−1, 1]2. The target values (but not the inputs) are contaminated with three diﬀerent levels of Gaussian noise with standard deviation σ ∈ {0.1, 0.2, 0.5}. This gives a total of 3 × 3 = 9 diﬀerent experiments on this particular problem, which we refer to as setup A1, A2, ..., and A9 below.
This allows controlled studies w.r.t. noise levels and training set sizes, while keeping the network architecture constant (2 inputs, 8 tanh hidden, and one linear output).

78

T.S. Rögnvaldsson

Predicting Puget Sound Power and Light Co. Power Load between 7 and 8 a.m. the Following Day. This data set is taken from the Puget Sound Power and Light Co’s power prediction competition [3]. The winner of this competition used a set of linear models, one for each hour of the day. We have selected the subproblem of predicting the load between 7 and 8 a.m. 24 hrs. in advance. This hour shows the largest variation in power load. The training set consists of 844 weekdays between January 1985 and September 1990. Of these, 150 days are randomly selected and used for validation. We use 115 winter weekdays, from between November 1990 and March 1992, for out-of-sample testing. The inputs are things like current load, average load during the last 24 hours, average load during the last week, time of the year, etc., giving a total of 15 inputs. Three diﬀerent numbers of internal units are tried on this task: 15, 10, and 5, and we refer to these experiments as B1, B2, and B3 below.
Predicting Daily Riverﬂow in Two Icelandic Rivers. This problem is tabulated in [22], and the task is to model tomorrow’s average ﬂow of water in one of two Icelandic rivers, knowing today’s and previous days’ waterﬂow, temperature, and precipitation. The training set consists of 731 data points, corresponding to the years 1972 and 1973, out of which we randomly sample 150 datapoints for validation. The test set has 365 data points (the year 1974). We use two diﬀerent lengths of lags, 8 or 4 days back, which correspond to 24 or 12 inputs, while the number of internal units is kept constant at 12. These experiments are referred to as C1, C2, C3, and C4 below.
Predicting the Wolf Sunspots Time Series. This time series has been used several times in the context of demonstrating new regularization techniques, for instance by [15] and [26]. We try three diﬀerent network architectures on this problem, always keeping 12 input units but using 4, 8, or 12 internal units in the network. These experiments are referred to as setup D1, D2, and D3 below. The training set size is kept constant at M = 221 (years 1700-1920), out of which we randomly pick 22 patterns for validation. We test our models under four diﬀerent conditions: Single step prediction on “test set 1” with 35 data points (years 1921-1955), 4-step iterated prediction on “test set 1”, 8-step iterated prediction on all 74 available test years (1921-1994), and 11-step iterated prediction on all available test years. These test conditions are coded as s1, m4, m8, and m11.
Estimating the Peak Pressure Position in a Combustion Engine. This is a data set with 4 input variables (ignition time, engine load, engine speed, and air/fuel ratio) and only 49 training data points, out of which we randomly pick 9 patterns for validation. The test set consists of 35 data points, which have been measured under slightly diﬀerent conditions than the training data. We try four diﬀerent numbers of internal units on this task: 2, 4, 8, or 12, and refer to these experiments as E1, E2, E3, and E4.

3. A Simple Trick for Estimating the Weight Decay Parameter

79

3.4.2 Experimental Procedure

The experimental procedure is the same for all problems: We begin by estimating

λ in the “traditional” way by searching over the region log λ ∈ [−6.5, 1.0] in steps

of Δ log λ = 0.5. For each λ value, we train 10 networks using the Rprop training

algorithm 3 [18]. Each network is trained until the total error (3.7) is minimized,

measured by

log

1 100 100
i=1

|ΔEi | ΔW i

< −5,

(3.23)

where the sum runs over the most recent 100 epochs, or until 105 epochs have passed, whichever occurs ﬁrst. The convergence criterion (3.23) is usually fulﬁlled within 105 epochs. New validation and training sets are sampled for each of the 10 networks, but the diﬀerent validation sets are allowed to overlap. Means and standard deviations, log EnCV,k and σnCV,k, for the errors are estimated from these 10 network runs, assuming a lognormal distribution for the validation errors. Figure 3.2 shows an example of such a search for the Wolf sunspot problem, using a neural network with 12 inputs, 8 internal units, and 1 linear output.

Weight decay, Wolf sunspots (12-8-1) 0.5
Training (199) Validation (22) 0.0
-0.5

Weight decay, Wolf sunspots (12-8-1) 0.4
Test (35)
0.0

<log10(MSE/1535)> <log10(MSE/1535)>

-1.0

-0.4

-1.5 -0.8
-2.0

-2.5 -7 -6 -5 -4 -3 -2 -1 0 1 log10(lambda)

-1.2 -7 -6 -5 -4 -3 -2 -1 0 1 log10(lambda)

Fig. 3.2. Left panel: Training and validation errors on the Wolf sunspot time series, setup D2, plotted versus the weight decay parameter λ. Each point corresponds to an average over 10 runs with diﬀerent validation and training sets. The error bars mark 95% conﬁdence limits for the average validation and training errors, under the assumption that the errors are lognormally distributed. The objective Monte Carlo method gives log λˆopt = −2.00 ± 0.31. Right panel: The corresponding plot for the test error on the sunpots “test set 1”. The network architecture is 12 inputs, 8 tanh internal units, and 1 linear output.

Using the objective Monte Carlo method described above, we estimate an optimal λˆopt value from this search. This value is then used to train 10 new
3 Initial tests showed that the Rprop algorithm was considerably more eﬃcient and robust than e.g. backprop or conjugate gradients in minimizing the error. We did not, however, try true second order algorithms like Levenberg-Marquardt or QuasiNewton.

80

T.S. Rögnvaldsson

Weight decay, Wolf sunspots (12-8-1) 20
1:st estimator <log10(lambda)> = -2.4 +/- 0.4 15

Weight decay, Wolf sunpots (12-8-1) 20
2:nd estimator <log10(lambda)> = -3.7 +/- 0.6 15

Entries / 0.1 bin Entries / 0.1 bin

10

10

5

5

0

-6

-5

-4

-3

-2

-1

0

Estimated log10(lambda)

0

-6

-5

-4

-3

-2

-1

0

Estimated log10(lambda)

Fig. 3.3. Left panel: Histogram showing the estimated values λˆ1 for 100 diﬀerent train-
ing runs, using diﬀerent training and validation sets each time. Right panel: Similar histogram for λˆ2. The problem (D2) is the same as that depicted in Figure 3.2.

networks with all the training data (no validation set). The test errors for these networks are then computed using the held out test set.
A total of 16 × 10 = 160 network runs are thus done to select the λˆopt for each experiment. This corresponds to a few days’ or a week’s work, depending on available hardware and the size of the problem. Although this is in excess of what is really needed in practice (one could get away with about half as many runs in a real application) the time spent doing this is aggravating. The times needed for doing the searches described in this paper ranged from 10 up to 400 cpu-hours, depending on the problem and the computer4. For comparison, the early stopping experiments described below took between 10 cpu-minutes and 14 cpu-hours. There was typically a ratio of 40 between the time needed for a search and the time needed for an early stopping estimate.
We then estimate λˆ1 and λˆ2, by training 100 networks with early stopping. One problem here is that the stopping point is ill-deﬁned, i.e. the ﬁrst observed minimum in the validation error is not necessarily the minimum where one should stop. The validation error quite often decreases again beyond this point. To avoid such problems, we keep a record of the weights corresponding to the latest minimum validation error and continue training beyond that point. The training is stopped when as many epochs have passed as it took to ﬁnd the validation error minimum without encountering a new minimum. The weights corresponding to the last validation error minimum are then used as the early stopping weights. For example, if the validation error has a minimum at say 250 epochs, we then wait until a total of 500 epochs have passed before deciding on that particular stopping point. From the 100 networks, we get 100 estimates for λˆ1 and λˆ2. We take the logarithm of these and compute means log λˆ1 and log λˆ2 , and corresponding standard deviations.
4 A variety of computers were used for the simulations, including NeXT, Sun Sparc, DEC Alpha, and Pentium computers running Solaris.

3. A Simple Trick for Estimating the Weight Decay Parameter

81

The resulting arithmetic mean values are taken as the estimates for λ and the standard deviations are used as measures of the estimation error. The arithmetic means are then used to train 10 networks which use all the training data. Figure 3.3 shows the histograms corresponding to the problem presented in Figure 3.2.
When comparing test errors achieved with diﬀerent methods, we use the Wilcoxon rank test [13], also called the Mann-Whitney test, and report differences at 95% conﬁdence level.
3.4.3 Quality of the λ Estimates
As a ﬁrst test of the quality of the estimates λˆ1 and λˆ2, we check how well they agree with the λˆopt estimate, which can be considered a “truth”. The estimates for all the problem setups are tabulated in table 3.1 and plotted in Figure 3.4.

Table 3.1. Estimates of λ for the 23 diﬀerent problem setups. Code A corresponds to the synthetic problem, code B to the Power prediction, code C to the riverﬂow prediction, code D to the Sunspots series, and code E to the maximum pressure position problem. For the log λˆopt column, errors are the standard deviations of the Monte Carlo estimate. For the early stopping estimates, errors are the standard deviations of the estimates.

Problem

log λˆopt

log λˆ1

log λˆ2

A1 (M = 20, σ = 0.1)

-2.82 ± 0.04 -2.71 ± 0.66 -3.44 ± 1.14

A2 (M = 20, σ = 0.2)

-2.67 ± 0.42 -2.32 ± 0.58 -3.20 ± 0.96

A3 (M = 20, σ = 0.5)

-0.49 ± 1.01 -1.93 ± 0.78 -3.14 ± 1.15

A4 (M = 40, σ = 0.1)

-2.93 ± 0.49 -2.85 ± 0.73 -3.56 ± 0.87

A5 (M = 40, σ = 0.2)

-2.53 ± 0.34 -2.41 ± 0.64 -2.91 ± 0.68

A6 (M = 40, σ = 0.5)

-2.43 ± 0.44 -2.13 ± 0.74 -2.85 ± 0.77

A7 (M = 100, σ = 0.1) -3.45 ± 0.78 -3.01 ± 0.86 -3.74 ± 0.93

A8 (M = 100, σ = 0.2) -3.34 ± 0.71 -2.70 ± 0.73 -3.33 ± 0.92

A9 (M = 100, σ = 0.5) -3.31 ± 0.82 -2.34 ± 0.63 -3.13 ± 1.06

B1 (Power, 15 hidden) -3.05 ± 0.21 -3.82 ± 0.42 -5.20 ± 0.70

B2 (Power, 10 hidden) -3.57 ± 0.35 -3.75 ± 0.45 -4.93 ± 0.50

B3 (Power, 5 hidden)

-4.35 ± 0.66 -3.78 ± 0.52 -5.03 ± 0.74

C1 (Jökulsá Eystra, 8 lags) -2.50 ± 0.10 -3.10 ± 0.33 -4.57 ± 0.59

C2 (Jökulsá Eystra, 4 lags) -2.53 ± 0.12 -3.15 ± 0.40 -4.20 ± 0.59

C3 (Vatnsdalsá, 8 lags) -2.48 ± 0.11 -2.65 ± 0.40 -3.92 ± 0.56

C4 (Vatnsdalsá, 4 lags) -2.39 ± 0.55 -2.67 ± 0.45 -3.70 ± 0.62

D1 (Sunspots, 12 hidden) -2.48 ± 0.12 -2.48 ± 0.50 -3.70 ± 0.42

D2 (Sunspots, 8 hidden) -2.00 ± 0.31 -2.43 ± 0.45 -3.66 ± 0.60

D3 (Sunspots, 4 hidden) -2.51 ± 0.44 -2.39 ± 0.48 -3.54 ± 0.65

E1 (Pressure, 12 hidden) -3.13 ± 0.43 -3.03 ± 0.70 -4.69 ± 0.91

E2 (Pressure, 8 hidden) -3.01 ± 0.52 -3.02 ± 0.64 -4.72 ± 0.82

E3 (Pressure, 4 hidden) -3.83 ± 0.80 -3.07 ± 0.71 -4.50 ± 1.24

E4 (Pressure, 2 hidden) -4.65 ± 0.78 -3.46 ± 1.34 -4.21 ± 1.40

82

T.S. Rögnvaldsson

The linear correlation between log λˆ1 and log λˆopt is 0.71, which is more than three standard deviations larger than the expected correlation between 23 random points. Furthermore, a linear regression with intercept gives the result

λˆopt = 0.30 + 1.13λˆ1.

(3.24)

Thus, λˆ1 is a fairly good estimator of λˆopt. The linear correlation between λˆ2 and λˆopt is 0.48, more than two standard
deviations from the random correlation. A linear regression gives

λˆopt = −0.66 + 0.57λˆ2,

(3.25)

and the second estimator λˆ2 is clearly a less good estimator of λˆopt.

1

1

0

0

-1

-1

log(hat lambda 1) log(hat lambda 2)

-2

-2

-3

-3

-4

-4

-5

-5

-6 -6 -5 -4 -3 -2 -1 0 1 log(Search lambda)

-6 -6 -5 -4 -3 -2 -1 0 1 log(Search lambda)

Fig. 3.4. Plot of the results in Table 3.1. Left plate: The λˆ1 estimate plotted versus λˆopt. The linear correlation between log λˆ1 and log λˆopt is 0.71. Right plate: λˆ2 plotted versus λˆopt. The linear correlation between log λˆ2 and log λˆopt is 0.48. The sizes of the
crosses correspond to the error bars in Table 3.1.

We next compare the out-of-sample performances of these diﬀerent λ esti-
mates, which is what really matters to the practitioner. Table 3.2 lists the dif-
ferences in out-of-sample performance when using the early stopping estimates
or the search estimate. A “+” means that using the early stop estimate results in signiﬁcantly (95% signiﬁcance level) lower test error than if λˆopt is used. Similarly, a “–” means that the search estimate gives signiﬁcantly lower test error
than the early stopping estimates. A “0” means there is no signiﬁcant diﬀerence. The conclusion from Table 3.2 is that λˆ2 is signiﬁcantly worse than λˆopt, but that there is no consistent diﬀerence between λˆ1 and λˆopt. The two estimates are essentially equal, in terms of test error. In some cases, like the power prediction
problem, it would have been beneﬁcial to do a small search around the early
stop estimate to check for a possibly better value.

3. A Simple Trick for Estimating the Weight Decay Parameter

83

The test errors for the combustion engine (setups E) are not included in Tables 3.2 (and 3.3) because the test set is too diﬀerent from the training set to provide relevant results. In fact, no regularized network is signiﬁcantly better than an unregularized network on this problem.

Table 3.2. Relative performance of single networks trained using the estimates λˆ1 and λˆ2, for the weight decay parameter, and the performance of single networks trained using the search estimate λˆopt. The relative performances are reported as: “+” means that using λˆi results in a test error which is signiﬁcantly lower than what the search estimate λˆopt gives, “0” means that the performances are equivalent, and “–” means that using λˆopt results in a lower test error than when using λˆi. All results are reported
for a 95% conﬁdence level when using the Wilcoxon test. See the text on why the E
results are left out.

Problem Setup

λˆ1 vs. λˆopt λˆ2 vs. λˆopt

A1 (M = 20, σ = 0.1)

0

0

A2 (M = 20, σ = 0.2)

0

–

A3 (M = 20, σ = 0.5)

0

0

A4 (M = 40, σ = 0.1)

0

0

A5 (M = 40, σ = 0.2)

0

–

A6 (M = 40, σ = 0.5)

0

0

A7 (M = 100, σ = 0.1)

–

0

A8 (M = 100, σ = 0.2)

0

0

A9 (M = 100, σ = 0.5)

+

0

B1 (Power, 15 hidden)

–

–

B2 (Power, 10 hidden)

–

–

B3 (Power, 5 hidden)

+

–

C1 (Jökulsá Eystra, 8 lags)

–

–

C2 (Jökulsá Eystra, 4 lags)

0

–

C3 (Vatnsdalsá, 8 lags)

–

–

C4 (Vatnsdalsá, 4 lags)

0

–

D1.s1 (Sunspots, 12 hidden)

0

–

D2.s1 (Sunspots, 8 hidden)

+

–

D3.s1 (Sunspots, 4 hidden)

0

–

D1.m4 (Sunspots, 12 hidden)

0

–

D2.m4 (Sunspots, 8 hidden)

+

–

D3.m4 (Sunspots, 4 hidden)

+

–

D1.m8 (Sunspots, 12 hidden)

0

–

D2.m8 (Sunspots, 8 hidden)

–

–

D3.m8 (Sunspots, 4 hidden)

+

–

D1.m11 (Sunspots, 12 hidden) 0

0

D2.m11 (Sunspots, 8 hidden)

+

–

D3.m11 (Sunspots, 4 hidden)

0

–

84

T.S. Rögnvaldsson

3.4.4 Weight Decay versus Early Stopping Committees
Having trained all these early stopping networks, it is reasonable to ask if using them to estimate λ for a weight decay network is the optimal use of these networks? Another possible use is, for instance, to construct a committee [16] from them.
To test this, we compare the test errors for our regularized networks with those when using a committee of 10 networks trained with early stopping. The results are listed in Table 3.3.
Some observations from Table 3.3, bearing in mind that the set of problems is small, are: Early stopping committees seem like the better option when the problem is very noisy (setups A3, A6, and A9), and when the network does not have very many degrees of freedom (setups B3, C4, and D3). Weight decay networks, on the other hand, seem to work better than committees on problems with many degrees of freedom (setups B1 and C3), problems with low noise levels and much data (setup A7), and problems where the prediction is iterated through the network (m4, m8, and m11 setups). We emphasize, however, that these conclusions are drawn from a limited set of problems and that all problems tend to have their own set of weird characteristics.
We also check which model works best on each problem. On the power prediction, the best overall model is a large network (B1) which is trained with weight decay. On the river prediction problems, the best models are small (C2 and C4) and trained with either weight decay (Jökulsá Eystra) or early stopping and then combined into committees (Vatnsdalsá). On the sunspot problem, the best overall model is a large network (D1) trained with weight decay.
These networks are competitive with previous results on the same data sets. The performance of the power load B1 weight decay networks, using λˆopt, are signiﬁcantly better than what a human expert produces, and also signiﬁcantly better than the results by the winner of the Puget Sound Power and Light Co. Power Load Competition [7], although the diﬀerence is small. The test results are summarized in Figure 3.5. The performance of the sunspot D1 weight decay network is comparable with the network by Weigend et al., listed in [26]. Figure 3.6 shows the performance of the D1 network trained with weight decay, λ = λˆ1, and compares it to the results by Weigend et al. [26]. The weight decay network produces these results using a considerably simpler λ selection method and regularization cost than the one presented in [26].
From these anecdotal results, one could be bold and say that weight decay shows a slight edge over early stopping committees. However, it is fair to say that it is a good idea to try both committees and weight decay when constructing predictor models.
It is emphasized that these results are from a small set of problems, but that these problems (except perhaps for the synthetic data) are all realistic in the sense that the datasets are small and noisy.

3. A Simple Trick for Estimating the Weight Decay Parameter

85

Table 3.3. Relative performance of single networks trained using weight decay and early stopping committees with 10 members. The relative performance of weight decay (WD) and 10 member early stopping committees are reported as: “+” means that weight decay is signiﬁcantly better than committees, “0” means that weight decay and committees are equivalent, and “–” means that committees are better than weight decay. All results are reported for a 95% conﬁdence level when using the Wilcoxon test. See the text on why the E results are left out.

Problem Setup

WD(λˆopt) vs. Comm. WD(λˆ1) vs. Comm.

A1 (M = 20, σ = 0.1)

0

+

A2 (M = 20, σ = 0.2)

0

0

A3 (M = 20, σ = 0.5)

–

–

A4 (M = 40, σ = 0.1)

0

0

A5 (M = 40, σ = 0.2)

+

+

A6 (M = 40, σ = 0.5)

–

–

A7 (M = 100, σ = 0.1)

+

+

A8 (M = 100, σ = 0.2)

0

0

A9 (M = 100, σ = 0.5)

–

0

B1 (Power, 15 hidden)

+

–

B2 (Power, 10 hidden)

0

–

B3 (Power, 5 hidden)

–

–

C1 (Jökulsá Eystra, 8 lags)

+

0

C2 (Jökulsá Eystra, 4 lags)

+

0

C3 (Vatnsdalsá, 8 lags)

+

+

C4 (Vatnsdalsá, 4 lags)

–

–

D1.s1 (Sunspots, 12 hidden)

0

0

D2.s1 (Sunspots, 8 hidden)

–

0

D3.s1 (Sunspots, 4 hidden)

–

–

D1.m4 (Sunspots, 12 hidden)

+

+

D2.m4 (Sunspots, 8 hidden)

+

+

D3.m4 (Sunspots, 4 hidden)

0

+

D1.m8 (Sunspots, 12 hidden)

+

+

D2.m8 (Sunspots, 8 hidden)

+

0

D3.m8 (Sunspots, 4 hidden)

0

0

D1.m11 (Sunspots, 12 hidden)

+

+

D2.m11 (Sunspots, 8 hidden)

0

+

D3.m11 (Sunspots, 4 hidden)

+

+

86

T.S. Rögnvaldsson

log10(Test NMSE)

−0.95

B1 Power Prediction Test Results

−1.00 Human expert

−1.05

−1.10

−1.15

Competition winner

−1.20

Weight decay nets

−1.25

Fig. 3.5. The performance of the 10 neural networks with 15 inputs, 15 hidden units, and one output unit, trained with weight decay using λ = λˆopt, on the power prediction problem. “Human expert” denotes the prediction result by the human expert at Puget
Sound Power and Light Co., and “Competition winner” denotes the result by the model
that won the Puget Sound Power and Light Co.’s Power Prediction Competition.

Iterated sunspot predictions 0.35

0.30

0.25

MSE/1535

0.20

0.15

0.10

0.05 0

2

4

6

8

10

12

Years forward (test set 1)

Fig. 3.6. The performance of a neural network with 12 inputs, 12 hidden units, and one output unit, trained with weight decay using λ = λˆ1, on iterated predictions for the sunspot problem. The error bars denote one standard deviation for the 10 trained networks. The dashed line shows the results when using the network listed in [26]. Note that these results are achieved with a simple weight decay cost and a very simple method for selecting λ, whereas [26] use weight elimination and a complicated heuristic scheme for setting λ.

3. A Simple Trick for Estimating the Weight Decay Parameter

87

3.5 Conclusions

The established connection between early stopping and weight decay regularization naturally leads to the idea of using early stopping to estimate the weight decay parameter. In this paper we have shown how this can be done and that the resulting λ results in as low test errors as achieved with the standard crossvalidation method, although this varies between problems. In practical applications, this means replacing a search which may take days or weeks, with a computation that usually does not require more than a few minutes or hours. This value can also be used as a starting point for a more extensive cross-validation search.
We have also shown that using several early stopping networks to estimate λ can be smarter than combining the networks into committees. The conclusion from this is that although there is a correspondence between early stopping and weight decay under asymptotic conditions this does not mean that early stopping and weight decay give equivalent results in real life situations.
The method unfortunately only works for regularization terms that have a connection with early stopping, like quadratic weight decay or “weight decay like” regularizers where the weights are constrained towards the origin in weight space (but using e.g. a Laplacian prior instead of the usual Gaussian prior). The method does not carry over to regularizers which do not have any connection to early stopping (like e.g. Tikhonov smoothing regularizers).

Acknowledgements. David B. Rosen is thanked for a very inspiring dinner conversation during the 1996 “Machines that Learn” Workshop in Snowbird, Utah. Milan Casey Brace of Puget Sound Power and Light Co. is thanked for supplying the power load data. Financial support is gratefully acknowledged from NSF (grant CDA-9503968), Olle and Edla Ericsson’s Foundation, the Swedish Institute, and the Swedish Research Council for Engineering Sciences (grant TFR-282-95-847).

References
[1] Abu-Mustafa, Y.S.: Hints. Neural Computation 7, 639–671 (1995) [2] Bishop, C.M.: Curvature-driven smoothing: A learning algorithm for feedforward
networks. IEEE Transactions on Neural Networks 4(5), 882–884 (1993) [3] Brace, M.C., Schmidt, J., Hadlin, M.: Comparison of the forecast accuracy of
neural networks with other established techniques. In: Proceedings of the First International Form on Application of Neural Networks to Power System, Seattle WA, pp. 31–35 (1991) [4] Buntine, W.L., Weigend, A.S.: Bayesian back-propagation. Complex Systems 5, 603–643 (1991) [5] Cheeseman, P.: On Bayesian model selection. In: The Mathematics of Generalization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pp. 315–330. Addison-Wesley, Reading (1995)

88

T.S. Rögnvaldsson

[6] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 304–314 (1989)
[7] Engle, R., Clive, F., Granger, W.J., Ramanathan, R., Vahid, F., Werner, M.: Construction of the puget sound forecasting model. EPRI Project # RP2919, Quantitative Economics Research Institute, San Diego, CA (1991)
[8] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)
[9] Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, 219–269 (1995)
[10] Hansen, L.K., Rasmussen, C.E., Svarer, C., Larsen, J.: Adaptive regularization. In: Vlontzos, J., Hwang, J.-N., Wilson, E. (eds.) Proceedings of the IEEE Workshop on Neural Networks for Signal Processing IV, pp. 78–87. IEEE Press, Piscataway (1994)
[11] Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation of nonorthogonal problems. Technometrics 12, 55–67 (1970)
[12] Ishikawa, M.: A structural learning algorithm with forgetting of link weights. Technical Report TR-90-7, Electrotechnical Laboratory, Information Science Division, 1-1-4 Umezono, Tsukuba, Ibaraki 305, Japan (1990)
[13] Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, 3rd edn. Hafner Publishing Co., New York (1972)
[14] Moody, J.E., Rögnvaldsson, T.S.: Smoothing regularizers for projective basis function networks. In: Advances in Neural Information Processing Systems 9. MIT Press, Cambridge (1997)
[15] Nowlan, S., Hinton, G.: Simplifying neural networks by soft weight-sharing. Neural Computation 4, 473–493 (1992)
[16] Perrone, M.P., Cooper, L.C.: When networks disagree: Ensemble methods for hybrid neural networks. In: Artiﬁcial Neural Networks for Speech and Vision, pp. 126–142. Chapman and Hall, London (1993)
[17] Plaut, D., Nowlan, S., Hinton, G.: Experiments on learning by backpropagation. Technical Report CMU-CS-86-126, Carnegie Mellon University, Pittsburg, PA (1986)
[18] Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: Ruspini, H. (ed.) Proc. of the IEEE Intl. Conference on Neural Networks, San Fransisco, California, pp. 586–591 (1993)
[19] Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)
[20] Sjöberg, J., Ljung, L.: Overtraining, regularization, and searching for minimum with application to neural nets. Int. J. Control 62(6), 1391–1407 (1995)
[21] Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed problems. V. H. Winston & Sons, Washington D.C. (1977)
[22] Tong, H.: Non-linear Time Series: A Dynamical System Approach. Clarendon Press, Oxford (1990)
[23] Utans, J., Moody, J.E.: Selecting neural network architectures via the prediction risk: Application to corporate bond rating prediction. In: Proceedings of the First International Conference on Artiﬁcial Intelligence Applications on Wall Street. IEEE Computer Society Press, Los Alamitos (1991)
[24] Wahba, G., Gu, C., Wang, Y., Chappell, R.: Soft classiﬁcation, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In: The Mathematics of Generalization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pp. 331–359. Addison-Wesley, Reading (1995)

3. A Simple Trick for Estimating the Weight Decay Parameter

89

[25] Wahba, G., Wold, S.: A completely automatic french curve. Communications in Statistical Theory & Methods 4, 1–17 (1975)
[26] Weigend, A., Rumelhart, D., Hubermann, B.: Back-propagation, weightelimination and time series prediction. In: Sejnowski, T., Hinton, G., Touretzky, D. (eds.) Proc. of the Connectionist Models Summer School. Morgan Kaufmann Publishers, San Mateo (1990)
[27] Williams, P.M.: Bayesian regularization and pruning using a Laplace prior. Neural Computation 7, 117–143 (1995)

4 Controlling the Hyperparameter Search in MacKay’s Bayesian Neural Network Framework
Tony Plate
School of Mathematical and Computing Sciences, Victoria University, Wellington, New Zealand tap@mcs.vuw.ac.nz http://www.mcs.vuw.ac.nz/˜tap/
Abstract. In order to achieve good generalization with neural networks overﬁtting must be controlled. Weight penalty factors are one common method of providing this control. However, using weight penalties creates the additional search problem of ﬁnding the optimal penalty factors. MacKay [5] proposed an approximate Bayesian framework for training neural networks, in which penalty factors are treated as hyperparameters and found in an iterative search. However, for classiﬁcation networks trained with cross-entropy error, this search is slow and unstable, and it is not obvious how to improve it. This paper describes and compares several strategies for controlling this search. Some of these strategies greatly improve the speed and stability of the search. Test runs on a range of tasks are described.
4.1 Introduction
Neural networks can provide useful ﬂexible statistical models for non-linear regression and classiﬁcation. However, as with all such models, the ﬂexibility must be controlled to avoid overﬁtting. One way of doing this in neural networks is to use weight penalty factors ( regularization parameters). This creates the problem of ﬁnding the values of the penalty factors which will maximize performance on new data. As various researchers have pointed out, including MacKay [5], Neal [10] and Bishop [1], it is generally advantageous to use more than one penalty factor, in order to diﬀerentially penalize weights between diﬀerent layers of the network. However, doing this makes it computationally infeasible to choose optimal penalty factors by k-fold cross validation.
MacKay [5] describes a Bayesian framework for training neural networks and choosing optimal penalty factors (which are hyperparameters in his framework). In this framework, we choose point estimates of hyperparameters to maximize the “evidence” of the network. Parameters (i.e., weights) can be assigned into diﬀerent
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 91–110, 2012. c Springer-Verlag Berlin Heidelberg 2012

92

T. Plate

groups, and each controlled by a separate hyperparameter. This allows weights between diﬀerent layers to be penalized diﬀerently. MacKay [6, 8] and Neal [10] have shown that it also provides a way of implementing “Automatic Relevance Detection” (ARD), in which connections emerging from diﬀerent units in the input layer are assigned to diﬀerent regularization groups. The idea is that hyperparameters controlling weights for irrelevant inputs should become large, driving those weights to zero, while hyperparameters for relevant inputs stabilize at small to moderate values. This can help generalization by causing the network to ignore irrelevant inputs and also makes it possible to see at a glance which inputs are important.
In this framework the search for an optimal network has two levels. The inner level is a standard search for weights which minimize error on the training data, with ﬁxed hyperparameters. The outer level is a search for hyperparameters which maximize the evidence. For the Bayesian theory to apply, the inner level search should be allowed to converge to a local minima at each step of the outer level search. However, this can be expensive and slow. Problems with speed and stability of the search seem especially severe with classiﬁcation networks trained with cross-entropy error.
This paper describes experiments with diﬀerent control strategies for updating the hyperparameters in the outer level search. These experiments show that the simple “let it run to convergence and then update” strategy often does not work well, and that other strategies can generally work better. In previous work, the current author successfully employed one of these strategies in an application of neural networks to epidemiological data analysis [11]. The experiments reported here conﬁrm the necessity for update strategies and also demonstrate that although the strategy used in this previous work is reasonably eﬀective in some situations, there are simpler and better strategies which work in a wider range of situations. These experiments also furnish data on the relationship between the evidence and the generalization error. This data conﬁrms theoretical expectations about when the evidence should and should not be a good indication of generalization error.
In the second section of this chapter, the update formulas for hyperparameters are given. Network propagation and weight update formulas are not given, as they are well known and available elsewhere, e.g., in Bishop [1]. Diﬀerent control strategies for the outer level hyperparameter search are described in the third section. In the fourth section, the simulation experiments are described, and the results are reported in the ﬁfth section. The experimental relationships between evidence and generalization error are reported in the sixth section.

4.2 Hyperparameter Updates

The update formulas for the hyperparameters (weight penalty factors) in the outer level search are quite simple. Before describing them we need some terminology. For derivations and background theory see Bishop [1], MacKay [5, 7], or Thodberg [13].
• n is the total number of weight in the network. • wi the value of the ith weight.

4. Hyperparameter Search in MacKay’s Bayesian Framework

93

• K is the number of hyperparameters.

• Ic is the set of indices of the weights in the cth hyperparameter group. • αc is the value of the hyperparameter controlling the cth hyperparameter

group; it speciﬁes the prior distribution on the weights in that group. α[i] denotes the value of the hyperparameter controlling the group to which weight

i belongs.

• nc is the number of weights in the cth hyperparameter group.

•

C

is

the

weight

cost

(penalty

term)

for

the

network:

C

=

1 2

n i=1

α[i] wi2 .

• m is the total number of training examples.

• yj and tj are the network outputs and target values, respectively, for the jth

training example.

• E is the error term of the network. For the classiﬁcation networks described

here, the modiﬁed cross-entropy (Bishop [1], p.232) is used:

E=−

m

{tj

log

yj tj

+

(1

−

tj

)

log

1 1

− −

yj tj

}.

j=1

Note that all graphs and tables of test set performance use the “deviance”, which is twice the error. • H is the Hessian of the network (the second partial derivatives of the sum of the error and weight cost). hij denotes the ijth element of this matrix, and h−ij1 denotes the ijth element of H−1:
∂2(E + C) hij = ∂wi∂wj .

HE is the matrix of second partial derivatives of just the error, and HC is

the matrix of second partial derivatives of just the weight cost.

• Tr(H−1) is the trace of the inverse of H: Tr(H−1) =

n i=1

h−ii 1 .

• Trc(H−1) is the trace of the inverse Hessian for just those elements of the

cth regularization group: Trc(H−1) = i∈Ic h−ii1. • γc is a derived parameter which can be seen as an estimate of the number of

well-determined parameters in the cth regularization group, i.e., the number

of parameters determined by the data rather than by the prior.

The overall training procedure is shown in Figure 4.1. The updates for the hyperparameters αc depend on the estimate γc (the
number of well-determined parameters in group c) which is calculated as follows (Eqn 27 in [7]; derivable from Eqn 10.140 in [1]):

γc = nc − αcTrc(H−1).

(4.1)

If a Gaussian distribution is a reasonable approximation to the posterior weight
distribution, γc should be between 0 and nc. Furthermore, we expect each parameter in group c to contribute between 0 and 1 to γc. Hence, we expect h−ii1 to always be in the range [0, 1/α[i]].

94

T. Plate

set the αc to initial values set wi to initial random values repeat
repeat make an optimization step for weights to minimize E + C
until ﬁnished weight optimization re-estimate the αc until ﬁnished max number of passes through training data

Fig. 4.1. The training procedure

The updates for the αc is as follows (Eqn 22 in [7]; Eqn 10.74 in [1]):

αc =

γc i∈Ic wi2

(4.2)

MacKay [7] remarks that this formula can be seen as matching the prior to the data: 1/αc is an estimate of the variance for the weights in group c, taking into account the eﬀective number of well determined parameters (eﬀective degrees of freedom) in that group.

4.2.1 Diﬃculties with Using the Update Formulas

The diﬃculties with using these update formulas arise when the assumption that

the error plus cost surface is a quadratic bowl is false. This assumption can fail

in two ways: the error plus cost surface may not be quadratic, or it may not be

a bowl (i.e., the Hessian is not positive deﬁnite). In either of these situations, it

is possible for γc to be out of the range [0, nc]. To illustrate, consider a single

diagonal element of the Hessian in the situation where oﬀ-diagonal elements are

zero:

⎡

⎤⎡

⎤

H = ⎢⎢⎣ . . . hii 0 ⎥⎥⎦ = ⎢⎢⎣ . . . hEii + α[i] 0 ⎥⎥⎦

0 ...

0

...

Since the oﬀ-diagonal elements are zero, the inverse Hessian is simple to write

down:

H−1

=

⎡ ⎢⎢⎣

.

.

.

1 hE ii +α[i]

⎤ 0 ⎥⎥⎦

0

...

Suppose the i parameter is in the cth regularization group, by itself. Then the number of well-determined parameters in this group is given by:

γ[i]

=

1 − α[i]h−ii1

=1−

α[i] hEii + α[i]

=

hEii hEii + α[i]

(4.3)

4. Hyperparameter Search in MacKay’s Bayesian Framework

95

If hEii is positive, γ[i] will be between 0 and 1. γ[i] will be large if hEii is large relative to α[i], which means that wi is well determined by the data, i.e., small moves of wi will make a large diﬀerence to E. γ[i] will be small if hEii is small relative to α[i], which means that wi is poorly determined by the data.
The expectation that h−ii1 is in the range [0, 1/αc] (and hence contributes between 0 and 1 well determined parameter to γ[i]) can fail even if the model
is at a local minima of E + C. Being at a local minimum of E + C does not

guarantee that hEii will be positive: it is possible for the hyperparameter to “pin” the weight value to a convex portion of a non-quadratic E surface. Consider the

case where the Hessian is diagonal and positive deﬁnite, but hEii is negative. From Eqn 4.3, we can see that h−ii1 can make a negative contribution1 to γc, which makes little sense in terms of “numbers of well-determined parameters”. This

situation2 is illustrated in Figure 4.2: at the minimum of the E + C function the

E

function

is

convex

(

d2E dw2

is

negative).

Here,

negative degrees of

freedom would

be calculated under the (incorrect) assumption that error plus cost is quadratic.

This is important for neural networks, because even if sum-squared error is used,

non-linearities in the sigmoids can cause the the error plus cost function to be

not a quadratic function of weights.

E+C

∂ 2 (E +C ) ∂w2

E

∂2E ∂w2

8

C min of E + C

min of E+C

∂2C ∂w2

12

6

8

4 4

2 0

0

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.3 0.4 0.5

w

w

Fig. 4.2. In minimizing E + C, a weight cost function C can pin a weight value to a convex portion of the error surface E. The plot on the left shows the surfaces, the plot on the right shows the derivatives in the region of the minimum.

If the model is not at a local minimum of E + C all bets are oﬀ. H may not
even be positive deﬁnite (i.e., the Hessian of a quadratic bowl), and if this is the case it is almost certain that some h−ii1 will be out of the range [0, 1/α[i]]. Even

1 With general matrices it is possible that h−ii1 < −α[i], in which case the contribution will be an unbounded positive number.

2 In Figure 4.2, E = w(w − 1)(w + 1)2 + 1, C

d2 E dw2

0.152645

≈ −0.8036.

= 4w2,

d(E+C) dw 0.152645

≈ 0, and