DissLiteratur/storage/R878AP7S/.zotero-ft-cache

A FIELD GUIDE TO DYNAMICAL RECURRENT NETWORKS

IEEE Press 445 Hoes Lane, ~O. Box 1331
Piscataway, NJ 08855-1331

IEEE Press Editorial Board Stamatios V. Kartalopoulos, Editor in Chief

M. Akay J. B. Anderson R. J. Baker J. E. Brewer

M. Eden M. E. El-Hawary R. J. Herrick R. F. Hoyt D. Kirk

M. S. Newman
M. Padgett
w. D. Reeve
G. Zobrist

Kenneth Moore, Director of IEEE Press Catherine Faduska, Senior Acquisitions Editor
John Griffin, Acquisitions Editor Robert Bedford, Assistant Acquisitions Editor
Anthony VenGraitis, Project Editor Marilyn G. Catis, Marketing Manager

Cover Design: Sharon Klein, Sharon Klein Graphic Design

Technical Reviewers
Bilal M. Ayyub, University ofMaryland, College Park, MD Simon Haykin, McMaster University-Communications Research Laboratory,
Hamilton, Ontario, Canada Maria Petrou, University ofSurrey, U.K. Kwang Yoon, Inha University, Inchon, Korea

Books of Related Interest from the IEEE Press

EVOLUTIONARY COMPUTATION: The Fossil Record

A Selected Reprint Volume

David B. Fogel

1998

Hardcover

640 pp

IEEE Order No. PC5737

ISBN 0-7803-3481-7

UNDERSTANDING NEURAL NETWORKS AND FUZZY LOGIC: Basic Concepts & Applications

Stamatios V. Kartalopoulos

1996

Softcover

232 pp

IEEE Order No. PP5591

ISBN 0-7803-1128-0

TIME FREQUENCY AND WAVELETS IN BIOMEDICAL SIGNAL PROCESSING

Edited by Metin Akay

1998

Hardcover

768 pp

IEEE Order No. PC5619

ISBN 0-7803-1147-7

NEURAL NETWORKS: A Comprehensive Foundation, Second Edition

Simon Haykin

A Prentice Hall book published in cooperation with IEEE Press

1999

Hardcover

600 pp

IEEE Order No. PC5746

ISBN 0-7803-3494-9

A FIELD GUIDE TO DYNAMICAL RECURRENT NETWORKS
Edited by John E Kolen
Institute for Human and Machine Cognition University of West Florida Pensacola, FL, USA
Stefan C. Kremer
Guelph Natural Computation Group Department of Computing and Information Science University of Guelph Gue~h, Onrono, CANADA
IEEE • .• PRESS
The Institute of Electrical and Electronics Engineers, Inc., New York

This book and other books may be purchased at a discount from the publisher when ordered in bulk quantities. Contact: IEEE Press Marketing Attn: Special Sales 445 Hoes Lane P.O. Box 1331 Piscataway, NJ 08855-1331
Fax: +1-732-981-9334
For more information about IEEE Press products, visit the IEEE Online Catalog & Store: http://www.ieee.org/store.
© 2001 by the Institute of Electrical and Electronics Engineers, Inc. 3 Park Avenue, 17th Floor, New York, NY 10016-5997. All rights reserved. No part of this book may be reproduced in any form, nor may it be stored in a retrieval system or transmitted in any form, without written permission from the publisher. Printed in the United States of America. 10 9 8 7 6 5 4 3 2
ISBN 0-7803-5369-2 IEEE Order No. PCS809
Library of Congress Cataloging-in-Publication Data A field guide to dynamical recurrent networks / edited by John F. Kolen, Stefan C. Kremer.
p.cm. Includes bibliographical references and index. ISBN 0-7803-5369-2
1. Neural networks (Computer science) I. Kolen, John F., 1965- II. Kremer, Stefan C., 1968QA76.87 .F54 2001 006.3 '2--dc21
00-066458

to Heidi -John
to Janis, Lauren, and Shannon -Stefan

CONTENTS

Preface xvii

Acknowledgments xix

List of Figures xxi

List of Tables xxvii

List of Contributors xxix

PART I Chapter 1
PART II Chapter 2

INTRODUCTION 1
Dynamical Recurrent Networks 3 John F. Kolen and Stefan C. Kremer 1.1 Introduction 3 1.2 Dynamical Recurrent Networks 4 1.3 Overview 6
1.3.1 Architectures 6 1.3.2 Capabilities 7 1.3.3 Algorithms 8 1.3.4 Limitations 9 1.3.5 Applications 10
1.4 Conclusion 11
ARCHITECTURES 13
Networks with Adaptive State Transitions 15 David Calvert and Stefan C. Kremer 2.1 Introduction 15 2.2 The Search for Context 15
2.2.1 Context as Input History 16
2.3 Recurrent Approaches to Context 17 2.4 Representing Context 18 2.5 Training 19 2.6 Architectures 19
2.6.1 Jordan 19 2.6.2 Elman 21

vii

viii

Contents

2.6.3 Williams and Zipser 23 2.6.4 Giles et al. 23 2.6.5 Locally Recurrent Connections 24 2.7 Conclusion 25

Chapter 3

Delay Networks: Buffers to the Rescue 27 Tsung-Nan Lin and C. Lee Giles 3.1 Introduction to Delay Networks 27 3.2 Back-Propagation Through Time Learning Algorithm 28 3.3 Delay Networks with Feedback: NARX Networks 31 3.4 Long-Term Dependencies in NARX Networks 33
3.4.1 Vanishing Gradients and Long-Term Dependencies 33 3.4.2 The Effect of Delay Buffer on Vanishing Gradients 35 3.5 Experimental Results: The Latching Problem 36 3.6 Conclusion 38

Chapter 4

Memory Kernels 39 Ah Chung Tsoi, Andrew Back, Jose Principe, and Mike Mozer 4.1 Introduction 39 4.2 Different Types of Memory Kernels 40
4.2.1 The Parameters in Each Modular Component Are Different 41 4.2.2 Identical Parameters for Each Modular Component 43 4.3 Generic Representation of a Memory Kernel 44 4.3.1 State Observable Situation 45 4.3.2 Output Observable Situation 45 4.4 Basis Issues 45 4.4.1 Completeness 46 4.4.2 Gamma Filter 46 4.4.3 Laguerre Filters 46 4.4.4 General Conditions 46 4.5 Universal Approximation Theorem 47 4.6 Training Algorithms 48 4.6.1 Training Algorithm for the Generic Predictor Weights 48 4.6.2 State Observable Form 49 4.6.3 Output Observable Form 50 4.7 Illustrative Example 51 4.8 Conclusion 54

PART III CAPABILITIES 55

Chapter 5

Dynamical Systems and Iterated Function Systems 57 John F. Kolen 5.1 Introduction 57 5.2 Dynamical Systems 57
5.2.1 Definition 58 5.2.2 Trajectories, Transients, and Attractors 61

Contents

ix

5.2.3 Attractor Dimensionality 67 5.2.4 Bifurcations and Phase Transitions 68 5.2.5 Summary 71 5.3 Iterated Function Systems 72 5.3.1 Basic Iterated Function Systems Theory 72 5.3.2 Random Iteration 75 5.3.3 Summary 76 5.4 Symbolic Dynamics 78 5.5 The DRN Connection 80 5.6 Conclusion 81

Chapter 6

Representation of Discrete States 83 C. Lee Giles and Christian Omlin
6.1 Introduction 83 6.2 Finite-State Automata 83 6.3 Neural Network Representations of DFA 85
6.3.1 Preliminaries 85 6.3.2 Feedforward Neural Networks 85 6.3.3 First- Versus Second-Order Networks 87 6.3.4 Locally Recurrent Neural Networks 90 6.3.5 Recurrent Cascade Correlation Networks 92 6.3.6 Elman Networks 94 6.3.7 NARX Recurrent Neural Networks 97 6.4 Pushdown Automata 99
6.5 Turing Machines 101
6.6 Conclusion 102

Chapter 7

Simple Stable Encodings of Finite-State Machines in Dynamic Recurrent Networks 103 Mikel L. Forcada and Raphael C. Carrasco 7.1 Introduction 103
7.1.1 A Survey of Previous Work 103 7.1.2 Organization of the Chapter 106 7.2 Definitions 106
7.2.1 Mealy and Moore Machines and Deterministic Finite-State Automaton 106
7.2.2 Dynamic Recurrent Networks 107 7.3 Encoding 109
7.3.1 Conditions for a DRN to Behave as a FSM 109 7.3.2 A General Encoding Scheme 111 7.4 Encoding of Mealy Machines in DRN 114 7.4.1 Mealy Machines in Second-Order DRN 114 7.4.2 Mealy Machines in First-Order DRN 120 7.5 Encoding of Moore Machines in DRN 123 7.5.1 Moore Machines in First-Order DRN 123 7.5.2 Moore Machines in Second-Order DRN 124

x

Contents

7.6 Encoding of Deterministic Finite-State Automata in DRN 125 7.6.1 Encoding DFA in Elman Nets 125 7.6.2 Encoding DFA in Second-Order DRN 125
7.7 Conclusion 126 7.8 Acknowledgments 127

Chapter 8

Representation Beyond Finite States: Alternatives to Pushdown Automata 129 Janet Wiles, Alan D. Blair, and Mikael Boden

8.1 Introduction 129

8.1.1 Finding Structure in Continuous-State Space 129

8.2 Hierarchies of Languages and Machines 130

8.2.1 The Chomsky Hierarchy 130 8.2.2 Regular and Context-Free Languages 131 8.2.3 Context-Sensitive and Recursively Enumerable 132 8.2.4 Learning and Induction 132 8.2.5 Alternative Hierarchies 133 8.2.6 Performance Limitations 133

8.3 DRNs and Nonregular Languages 134

8.3.1 8.3.2 8.3.3 8.3.4 8.3.5

Augmenting DRNs with an External Stack 134
DRNs without an External Stack 135
Representing Counters in Recurrent Hidden Units 136
Learning, Stability, and Generalization for a'b" 138 Learning the Context-Sensitive Language anbncn 139

8.4 Generalization and Inductive Bias 141

8.5 Conclusion 142

Chapter 9

Universal Computation and Super-'Iuring Capabilities 143 Hava T. Siegelmann 9.1 Introduction 143 9.2 The Model 144 9.3 Preliminary: Computational Complexity 145 9.4 Summary of Results 146
9.4.1 Networks That Are Turing Equivalent 147 9.4.2 Networks That Are Beyond Turing 148 9.5 Pondering Real Weights 149 9.6 Analog Computation 149 9.7 Conclusion 150 9.7 Acknowledgments 151

PART IV ALGORITHMS 153
Chapter 10 Insertion of Prior Knowledge 155 Paolo Frasconi, C. Lee Giles, Marco Gori, and Christian Omlin 10.1 Introduction 155 10.2 Constrained Nondeterministic Insertion in First-Order Networks 156
10.2.1 Motivations 156

10.2.2 Time-Warping Automata 156 10.2.3 Mapping Network Dynamics into Symbolic Dynamics 157 10.2.4 State Memorization 157 10.2.5 Neural Logic Operators 158 10.2.6 The Injection Algorithm 159
10.3 Second-Order Networks 160
10.3.1 DFA Encoding Algorithm 160 10.3.2 Empirical Results 162 10.3.3 Stability of the DFA Representation 164 10.3.4 Scaling Issues 166 10.3.5 DFA States with Large Indegree 166 10.3.6 Extension to Fuzzy Domains 167 10.3.7 Fuzzy Representation 170
10.4 Other Related Techniques 175
10.4.1 Recurrent Radial Basis Function Networks 175 10.4.2 Injecting "Automata Knowledge" in R2BF Networks 176
10.5 Conclusion 177

Chapter 11

Gradient Calculations for Dynamic Recurrent Neural Networks 179 Barak A. Pearlmutter 11.1 Introduction 179
11.1.1 Why Recurrent Networks? 179 11.1.2 Why Hidden Units? 180 11.1.3 Continuous Versus Discrete Time 181
11.2 Learning in Networks with Fixed Points 182
11.2.1 Will a Fixed Point Exist? 182 11.2.2 Problems with Fixed Points 183 11.2.3 Recurrent Back Propagation 183 11.2.4 Deterministic Boltzmann Machines 186
11.3 Computing the Gradient Without Assuming a Fixed Point 188
11.3.1 Back Propagation Through Time 188 11.3.2 Real-Time Recurrent Learning 191 11.3.3 More Efficient Online Techniques 192 11.3.4 Time Constants 193 11.3.5 Time Delays 193 11.3.6 Extended RTRL to Time Constants and Time Delays 194 11.3.7 Long Short-Term Memory 195
11.4 Some Simulations 196
11.4.1 A Rotated Figure Eight 197 11.4.2 Computational Neuroscience: A Simulated Leech 197
11.5 Stability and Perturbation Experiments 198 11.6 Other Non-Fixed Point-Techniques 199
11.6.1 "Elman Nets" 199 11.6.2 The Moving Targets Method 199 11.6.3 Feedforward Networks with State 200 11.6.4 Teacher Forcing in Continuous Time 200 11.6.5 Jordan Nets 201 11.6.6 Teacher Forcing, RTRL, and the Kalman Filter 202
11.7 Learning with Scale Parameters 203 11.8 Conclusion 203

xii

Contents

11.8.1 Complexity Comparison 203 11.8.2 Speeding the Optimization 204 11.8.3 Prospects and Future Work 205 11.8.4 Conclusion 205

Chapter 12

Understanding and Explaining DRN Behavior 207 Christian Omlin 12.1 Introduction 207 12.2 Performance Deterioration 208 12.3 Dynamic Space Exploration 209
12.3.1 Extraction Algorithm 209 12.3.2 Example 211 12.3.3 Model Selection 213 12.4 DFA Extraction: Fool's Gold? 215 12.5 Theoretical Foundations 216 12.6 How Can DFA Outperform Networks? 218 12.6.1 Definitions 218 12.6.2 Example 218 12.7 Alternative Extraction Methods 220 12.7.1 Self-Clustering Recurrent Networks 221 12.7.2 Explicit Cluster Modeling 221 12.7.3 Learning Discrete Representations 222 12.7.4 DFA Extraction as a Learning Problem 223 12.8 Extension to Fuzzy Automata 225 12.9 Application to Financial Forecasting 226 12.9.1 A Hybrid Neural Network Architecture 226 12.10 Conclusion 227

PART V LIMITATIONS 229
Chapter 13 Evaluating Benchmark Problems by Random Guessing 231 Jiirgen Schmidhuber, Sepp Hochreiter; and Yoshua Bengio 13.1 Introduction 231 13.2 Random Guessing (RG) 231 13.3 Experiments 232 13.3.1 Latch and 2-Sequence Problems 232 13.3.2 Parity Problem 233 13.3.3 Tomita Grammars 234 13.4 Final Remarks 234 13.5 Conclusion 235 13.6 Acknowledgments 235

Chapter 14

Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies 237 Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jiirgen Schmidhuber
14.1 Introduction 237 14.2 Exponential Error Decay 237

Contents

xiii

14.3 Dilemma: Avoiding Aradient Decay Prevents Long-Term Latching 240
14.4 Remedies 241 14.5 Conclusion 243

Chapter 15

Limiting the Computational Power of Recurrent Neural Networks: VC Dimension and Noise 245 Christopher Moore 15.1 Introduction 245
15.2 Time-Bounded Networks and VC Dimension 246
15.2.1 Real-Time Dynamical Recognizers 246 15.2.2 VC Dimension 247 15.2.3 A Memory Task That Needs a Large VC Dimension 248 15.2.4 Sinusoidal Transfer Functions and Others 249
15.3 Robustness to Noise 250
15.3.1 Robustness with Perfect Reliability 250 15.3.2 Performance as a Function of Noise 251 15.3.3 Robustness with Limited Reliability 253
15.4 Conclusion 254
15.5 Acknowledgments 254

PART VI APPLICATIONS 255

Chapter 16

Dynamical Recurrent Networks in Control 257
Danil v Prokhorov, Gintaras v Puskorius, and Lee A. Feldkamp
16.1 Introduction 257 16.2 Description and Execution of TLRNN 258
16.3 Elements of Training 260
16.3.1 Derivative Calculations 260 16.3.2 Weight Update Methods 262
16.4 Basic Approach to Controller Synthesis 266
16.4.1 Specifics of Controller Training 266 16.4.2 Modular Approach 268 16.4.3 Multi-Stream Training 270
16.5 Example 1 272
16.5.1 MIMO Control Problem 272 16.5.2 State Variables Accessible and Plant Equations Known 273 16.5.3 State Variables Inaccessible and Plant Equations Unknown 275 16.5.4 Further Testing, Improvements, and Summary of the Results 277
16.6 Example 2 282
16.6.1 Financial Portfolio Optimization 282 16.6.2 General Training Considerations 284 16.6.3 Empirical Simulations 284 16.6.4 Possible Extensions 288
16.7 Conclusion 288

Chapter 17 Sentence Processing and Linguistic Structure 291 Whitney Tabor
17.1 Introduction 291

xiv

Contents

17.1.1 The Role of Recurrent Connectionist Networks 292 17.1.2 The Role of Sentence Processing Studies 294 17.1.3 Overview 295 17.2 Case Studies: Dynamical Networks for Sentence Processing 295 17.2.1 Case 1: Frequency Sensitivity 297 17.2.2 Case 2: Phrase Structure 301 17.3 Conclusion 308

Chapter 18

Neural Network Architectures for the Modeling of Dynamic Systems 311 Hans-Georg Zimmermann and Ralph Neuneier
18.1 Introduction and Overview 311
18.2 Modeling Dynamic Systems by Feedforward Neural Networks 312
18.2.1 Preprocessing the Data 312 18.2.2 Suppressing Outliers by Net Internal Preprocessing 313 18.2.3 Merging MLP and RBF Networks 314 18.2.4 Modeling Dynamics by an Interaction Layer 316 18.2.5 Statistical Averaging 317 18.2.6 An Integrated Feedforward Architecture for Forecasting 318
18.3 Modeling Dynamic Systems by Recurrent Neural Networks 321
18.3.1 Representing Dynamic Systems by Recurrent Networks 322 18.3.2 Finite Unfolding in Time 323 18.3.3 Overshooting 326 18.3.4 Analysis of Partially Externally Driven Systems 327 18.3.5 Principle of Uniform Causality 328 18.3.6 Undershooting 330 18.3.7 Special Dynamics 332
18.4 Combining State-Space Reconstruction and Forecasting 334
18.4.1 Finite Unfolding in Space and Time 335 18.4.2 Smoothness 338 18.4.3 Separation of Variants and Invariants 341 18.4.4 Nonlinearity Versus Noise 342 18.4.5 Intrinsic Time 344 18.4.6 Measuring the Observed Dynamics 344 18.4.7 Extended State-Space Transformations and Periodic Orbits 346 18.4.8 Overshooting in Space and Time 347 18.4.9 Undershooting in Space and Time 348
18.5 Conclusion 350

Chapter 19

From Sequencesto Data Structures: Theory and Applications 351 Paolo Frasconi, Marco Gori, Andreas Kuchler, and Alessandro Sperduti
19.1 Introduction 351
19.2 Historical Remarks 352
19.3 Adaptive Processing of Structured Information 354 19.3.1 Structured Domains 354 19.3.2 The Computational Model 355 19.3.3 Graphical Transduction 357 19.3.4 On the Representational Power of Recursive Models 359 19.3.5 Learning Algorithms 359

Contents

xv
19.4 Applications 366 19.4.1 Pattern Recognition 367 19.4.2 Computational Chemistry 368 19.4.3 Adaptive Search Control for Symbolic Deduction Systems 370
19.5 Conclusion 374

PARTVII CONCLUSION 375

Chapter 20

Dynamical Recurrent Networks: Looking Back and Looking Forward 377 Stefan C. Kremer and John F. Kolen 20.1 Introduction 377 20.2 The Challenges 377 20.3 The Potential 378 20.4 The Approaches 378 20.5 The Successes 378 20.6 Conclusion 378

Bibliography 379

Glossary 409

Index 415

About the Editors 423

PREFACE
The idea for this book emerged as the two of us were organizing the 1996 Neural Information Processing Systems (NIPS) workshop on recurrent networks. Many of the telephone calls between us turned to students and research. We both bemoaned the fact that we were handing large stacks of journal papers to our potential graduate students, in some cases, scaring them off to other advisors. Since we both were dealing with terminal masters students whom we would be lucky if they stuck around for a whole year, we needed introductory material that was self-contained, accessible, and devoid of distractions-a book that a first-year graduate student, or upper-division undergrad, could read and become productive in a short period of time.
Unfortunately, there were no such books available at that time. Hertz, Krogh, and Palmer's textbook serves well in getting students up to speed on back propagation, Hopfield networks, and Boltzmann machines. However, it did little justice to our research area, dynamical recurrent networks. Sometime before the workshop, we decided to fill this void.
From the conception of this project, we both agreed that this book was not going to be yet another edited collection of papers. We wanted a coherent presentation of the field. One difficulty for students has been the arbitrary nomenclature and mathematical notation used to describe recurrent networks. Though not insurmountable, it does act as a speed bump, reducing the valuable time that they could be spending elsewhere.
Like any other young field, we have several researchers with competing viewpoints. One problem with edited volumes is that the author of a chapter will have a biased view of the world. We decided to reduce this bias by assigning multiple authors to chapters. The goal of the assignments was to ensure a balanced presentation of the material, one that would be coming directly from the experts in that field. We also encouraged the authors to tie their chapters with other chapters in the book. Though difficult at times, it transforms the text from a collection of papers into a coherent book.
The book is organized into five sections. The first section presents the range of dynamical recurrent network (DRN) architectures that will be used in the book. With these architectures in hand, we tum to examine their capabilities as computational devices. The third section presents several training algorithms for solving the network loading problem. Next, we look back and temper our enthusiasm by acknowledging the limitations ofDRNs. The final section deals with applications of recurrent networks.
The material in this book assumes a basic understanding of neural networks. The reader, we assume, has read an introductory textbook covering multilayered perceptrons and backpropagation. As such, this book would be ideal for a second course in neural networks.
John F. Kolen Institute for Human and Machine Cognition
University of West Florida Pensacola, FL, USA
Stefan C. Kremer Guelph Natural Computation Group Department of Computing and Information Science
University of Guelph Guelph, Ontario, CANADA
xvii

ACKNOWLEDGMENTS
THE AUTHORS'
It is impossible to sucessfully complete an undertaking such as this one alone. We, therefore, wish to acknowledge the efforts of those who have made this book possible. First and foremost, we thank the contributing authors who have spent considerable effort in crafting their chapters. Without their cooperation, this project would never had taken shape. As mentioned in the preface, this book grew out of the 1996 NIPS workshop, and we thank the workshop organizers for giving us opportunity to gather and share our work. We humbly thank the editors at IEEE Press for their patience. The FieldGuidehas also benefited from the comments of its reviewers, both anonymous and not-so-anonymous. Those in the latter catagory include Tim Hutcheson, Heidi Larrison, and David Medler.
JOHN'S
My thanks go first to the Institute for Human and Machine Cognition (University of West Florida) for providing a haven during these last four years, making it possible to work on this project. I have benefited greatly from the support and advice of its director, Dr. Ken Ford, and many of its members, especially Drs. Pat Hayes and Bruce Dunn. Dr. Jordan Pollack, Brandeis University, deserves a few words of gratitude for introducing me to recurrent networks while I was his student in the Laboratory for AI Research at The Ohio State University.
During the course of this project, I have received research funding from the Air Force Office of Strategic Research, Office of Naval Research, National Aeronautics and Space Administration, and the University of West Florida.
Finally, I could not have completed this work without the support of family and friends. I am grateful to my parents for providing love and support over the years that has helped me to where I am now. I thank my daughter, Amanda, for her limitless supply of bright smiles and big hugs. Finally, my deepest gratitude goes to Heidi for her love, encouragement, and sacrifices over the last two years.
STEFAN'S
I will start by thanking Dr. Deborah Stacey, University of Guelph, for introducing me to artificial neural networks during my undergrad days. My appreciation also goes to Dr. Renee Elio, University of Alberta, for her challenging and constructive criticism of my Ph.D. work. Dr. Mike Dawson, University of Alberta, my other thesis supervisor, deserves thanks for encouraging me to submit my first paper for publication, for teaching me to take a critical look at the works of others, and generally providing a research environment that was both enjoyable and productive. I also thank Dr. C. Lee Giles for inviting me to visit his lab in Princeton, providing a fantastic bibliography to get me started in the field, and his advice throughout the years. I am further indebted to Valek Szwarc, Communications Research Centre, Ottawa, for supporting my participation in the 1996 Neural Information Processing Systems Workshop that I organized with John Kolen and that precipitated this book. I am
xix

xx

Acknowledgments

also grateful to Dr. Simon Haykin, University Professor, McMaster University, for his support, advice, and encouragement.
I would like to thank the Natural Sciences and Engineering Research Council of Canada and the Canadian Foundation for Innovation for two grants that have supported my work.
Finally, none of this would have been possible without the love and support of my family. I thank my parents for their guidance, love, and continuing help throughout my life. I thank my wife, Janis, for her love, patience, and support. And I thank my daughters Lauren and Shannon for reminding me of what's really important.

LIST OF FIGURES

2.1 General recurrent architecture.

16

2.2 TDNN architecture.

17

2.3 Equivalent view of context units.

18

2.4 A simple recurrent network, (a), is unfolded in time, (b).

20

2.5 Jordan's architecture.

21

2.6 Elman's architecture.

22

2.7 Williams and Zipser's architecture.

23

2.8 Gile's architecture.

24

2.9 Mozer's architecture.

24

3.1 A two-neuron recurrent network architecture.

29

3.2 The unfolded network of the two-neuron recurrent network architecture.

29

3.3 A schematic representation of back-propagation through time learning

algorithm.

30

3.4 A NARX network with input memory of order 2 and output memory of

order 2.

32

3.5 A variation of the NARX network.

32

3.6 A TDNN network with input memory of order 3.

33

3.7 Plots of J(t, n) as a function of n for different numbers of output delays.

36

3.8 Plots of the ratio ~ r.n J(t,1:) as a function of n for different numbers of output 1:=1

delays.

36

3.9 The network used for the latching problem.

37

3.10 Plots of percentage of successful simulations as a function of T, the

length of the input strings, for different number of output delays (D = 1,

D = 3, and D = 6).

37

4.1 A diagram showing the class of recurrent neural network under study in this

chapter.

40

4.2 A generic form of the short-term memory.

40

4.3 The neural network architecture considered in this chapter.

41

4.4 A tapped delay line memory model.

41

4.5 A Laguerre filter module.

42

4.6 A general modular form which encompasses the filter form as introduced in

the previous section.

44

4.7 Illustration of how a Laguerre filter may be expressed in the general modular

form.

44

xxi

xxii

List of Figures

4.8 The TINSTMA used in the worked example.

51

5.1 Attracting and repelling fixed points.

62

5.2 A demonstration comparing periodic and quasiperiodic trajectories.

63

5.3 Stretching and folding of state spacce in the logistic mapping.

64

5.4 From logistic map to tent map to the Bakers' map.

66

5.5 Bifurcation diagrams of the logistic function and a neural network unimodal

function.

69

5.6 A pitchfork bifurcation.

70

5.7 Several fractal sets generated by iterative function systems.

73

5.8 The difference beween the limit of a single transformation (a point) and the

limit of a collection of transformations (a Sierpinski triangle).

74

5.9 The individual transformations make three reduced copies of the attractor.

75

5.10 Examples of attractors generated with different transformation probabilities.

77

5.11 A superstable period-four orbit (CRLR) in the logistic mapping.

79

6.1 DFA for odd parity.

84

6.2 Network architectures for definite memory machines.

86

6.3 Example directed DeBruijn graph.

86

6.4 Definite memory machine.

87

6.5 First-order single-layer recurrent neural network.

88

6.6 Locally recurrent networks.

90

6.7 Recurrent cascade correlation networks.

93

6.8 Elman recurrent neural network.

95

6.9 NARX recurrent neural network.

97

6.10 Neural network pushdown automata.

100

7.1 The function 11 (see text).

117

7.2 The function e (see text).

118

8.1 A trajectory in the state space of a first-order SRN with two recurrent hidden

units, processing a8b8•

137

8.2 The oscillation performance for a self-recurrent unit with a bias and the

instability of learning.

139

8.3 The trajectory in the state space of an SCN with three recurrent hidden units,

processing a8b8c8•

140

8.4 (a) The trajectory in the state space of an SCN with two recurrent hidden units

processing a8b8c8 and (b) its linearizations around the fixed points.

140

10.1 Architecture of a first-order network.

160

10.2 Second-order recurrent neural network.

161

10.3 Performance of 10-state DFA.

162

10.4 Performance of 1OO-state DFA.

163

10.5 Performance of 1oo0-state DFA.

163

10.6 Fixed points of the sigmoidal discriminant function.

164

10.7 Existence of fixed points.

165

10.8 Scaling weight strength.

167

10.9 Transformation of a FFA into its corresponding DFA.

168

10.10Recurrent network architecture for crisp representation of fuzzy finite-state

automata.

169

List of Figures

xxiii

10.11 Network performance.

170

10.12 Fuzzy discriminant function for state representation.

171

10.13 Example of FFA transformation.

173

10.14 Stability of FFA state encoding.

174

10.15 Architecture of an R2BF network.

175

11.1 Energy landscape.

184

11.2 The architecture of a network to solve an associative version of the four-bit

rotation problem.

185

11.3 A Hinton diagram of weights learned by the network of Figure 11.2.

186

11.4 Network state for all the cases in the four-bit rotation problem.

. 187

11.5 A recurrent network is shown on the left, and a representation of that network

unfolded in time through four time steps is shown on the right.

189

11.6 The infinitesimal changes to y considered in et(t).

190

11.7 The infinitesimal changes to y considered in Zt(t).

190

11.8 The output of the rotated figure eight network at all the trained angles (left)

andsome untrained angles (right).

197

11.9

The

output

states

Y and 1

Y2

plotted

against

each

other

for

a

1000

time

unit

run.

198

12.1 State clustering.

210

12.2 Extraction example.

211

12.3 DFA extraction algorithm.

212

12.4 Minimization of DFA.

213

12.5 Extracted DFA.

214

12.6 A simple DFA.

219

12.7 Dynamics of a recurrent network.

219

12.8 Prefix tree.

224

12.9 DFA extracted.

227

14.1 Robust latching.

241

15.1 Five ellipses can shatter the plane into 25 components.

247

15.2 The family of setsj-l (H ) over all finite words v is independent.

248

v

yes

15.3 The family of sets defined by sinusoidal inequalities has infinite VC dimen-

sion.

250

15.4 The definition of the sets U. in Lemma 15.3.1.

252

I

16.1 Schematic illustration of an RML~ denoted as 1-2R-3-lR.

259

16.2 Block diagram of model reference control.

267

16.3 Distribution of the RMS error values for the feedforward controller (white)

and the recurrent controller (black) during testing on 10,000 plants.

274

16.4 Performance of the recurrent controller during testing in the full state feedback

case.

277

16.5 Performance of the feedforward controller during testing in the full state

feedback case.

278

16.6 Test performance of the recurrent controller trained using the ID network.

279

16.7 Test performance of the recurrent controller trained with random reference

signals using the ID network.

280

16.8 Test performance of the 5-20-10-2L feedforward controller trained with

skylines on the nominal plant.

281

xxiv

List of Figures

16.9 A block diagram representation of a simple, two-asset portfolio optimi-

zation system.

283

16.10 Typical results of training on the simulated price series.

286

16.11 A section of 1000 points from the simulatedprice series shown in Figure

16.10.

287

17.1 The bramble network (BRN).

296

17.2 Reduced-dimension plot of the manifolds associated with distinct lexical

classes.

300

17.3 (a) Sierpinski triangle with stack states labeled. (b) A sample trajectory.

302

17.4 The performance of the network described above degrades in a realistic

way with growth in the number of center embeddings.

304

17.5 Comparison of Gibson and Ko's human reading-time data with the

predictions of Gibson's integration cost model.

306

17.6 Comparison of Gibson and Ko's human reading-time data with the

predictions of DA 2.

308

18.1 Net internal preprocessing to limit the influence of outliers and to

eliminate unimportant inputs.

314

18.2 Net internal preprocessing cluster and the square cluster, which

produces the input signals for following clusters.

315

18.3 Point predictions followed by the interaction layer.

316

18.4 Geometric interpretation of the effect of the interaction layer on the cost

function.

317

18.5 Averaging of several point predictions for the same forecast horizon.

318

18.6 The integrating eleven layer architecture.

319

18.7 The identification of dynamic system using a discrete time

description:

input

u,

ER";

hidden

states

St

E B",

output

Y t

ER',

321

18.8 A time delay recurrent neural network.

323

18.9 Finite unfolding realized by shared weights A, B, C.

324

18.10 Optimal finite unfolding assuming error-level saturation within three

time steps of Fig. 18.9 and after disconnecting the error flows from

intermediate outputs Yt- 3 , Yt- 2 , Yt- 1 •

325

18.11 Overshooting is the extension of the autonomous part of the dynamics.

326

18.12 The dynamics for recurrent neural networks at the beginning of learning

when weights are still small.

327

18.13 Search for a continuous embedding of a discretely measured dynamic

system.

328

18.14 Refinement of the time grid by recurrent neural networks.

330

18.15 Since the intermediate targets are not available, the upper network

cannot be trained.

331

18.16 Input design for undershooting networks.

332

18.17 A network architecture for volume-conserving transformations.

333

18.18 The time series of the observed state description x may follow a very

t

complex trajectory.

335

18.19

The

function

g

maps

the

observed

states

x
t

to

an

inner

state

St;

h

is

a

generally nonlinear coordinate transformation back to the observation xt"

336

18.20 An architecture for phase 1.

336

18.21 First approach for training phase 2.

337

List of Figures

xxv

18.22 Second approach for training phase 2.

338

18.23 Training phase 3 for fine tuning of the weights found by previous phases.

339

18.24 Triangle inequality smoothness (compare Eq. 18.58).

340

18.25 The dynamics of a pendulum can be separated in only one variant, the

angle qJ, and in infinite number of invariants.

341

18.26 Variant-invariant separation of a dynamics.

342

18.27 Variant-invariant separation by neural networks.

342

18.28 An inappropriate description of the state of a dynamic system can be

misinterpreted as noise.

343

18.29 The unfolding of singularities.

343

18.30 The intrinsic time concept tries to construct a dynamics with a constant

velocity in a phase space.

344

18.31 Following the state description Xt+l of Eq. 18.62, the 2m forecasts are

combined to m predictions, which are subsequently averaged to suppress

noise.

345

18.32 By this architecture, we are able to describe a partially autonomous system

which is also driven by external inputs.

346

18.33 The one-dimensional state space of an oscillator is a folded slope that can

be unfolded as a circle in two dimensions.

347

18.34 Comparison of the unfolding in space for two different auto associators used

as subnetworks in the recurrent network of Fig. 18.31.

348

18.35 Combining overshooting and unfolding in space and time.

349

18.36 Integration of undershooting into the unfolding in space and time concept.

350

19.1 A sequence seen as a data structure built on I-SDOAGs.

355

19.2 A drawing represented by a tree with local features forming node labels.

356

19.3 Graphical model of the computation carried out by a dynamic neural network

on sequences.

357

19.4 A two-layered MLP implementing the state transition function of a recursive

neural network for k-DOAG data structures.

357

19.5 Graphical model of the computation carried out by a recursive neural network

on an example of 2-DOAG.

361

19.6 Illustration of the "never-ending movie" input skeleton.

366

19.7 Example of how a logo can be represented as a structure.

368

19.8 The tree representation is particularly convenient when considering global

transformations of the logo, like scaling and rotations.

369

19.9 A finite initial segment of a search-tree generated by an automated deduction

system for one specific proof problem.

372

LIST OFTABLES

5.1 A taxonomy of dynamical systems.

59

5.2 Attractor symbol sequences for the logistic map.

79

6.1 Network architectures and computational models.

102

7.1 Minimum values of weights and limiting values for encoding a Mealy machine. 117

7.2 Values of weights and limiting values for encoding a DFA on a second-order

DRN without biases.

120

7.3 A summary of the encodings proposed in the chapter.

126

8.1 Chomsky hierarchy.

130

9.1 The computational power of recurrent neural networks.

147

10.1 Comparison of different DFA encoding methods.

177

11.1 A summary of the complexity of some learning procedures for recurrent networks. 204

12.1 Network generalization performance.

208

12.2 DFA generalization performance.

215

12.3 Extracted DFA for sparse training sets.

225

16.1 Nominal values of parameters a for the MIMO plant of Narendra and

Mukhopadhyay (1994).

273

17.1 Grammar 2. (implemented in dynamical automaton 1).

303

17.2 Dynamical automaton 1.

303

17.3 Dynamical automaton 2: transitions.

307

17.4 Dynamical automaton 2: compartment definitions.

307

xxvii

LIST OF CONTRIBUTORS

Andrew Back Laboratory for Open Information Systems RlKEN Brain Science Institute The Institute of Physical and Chemical Research 2-1 Hirosawa, Wako-shi, Saitama, 351-0198, JAPAN
Yoshua Bengio Dept. Infonnatique et Recherche Operationnelle Universite de Montreal, CP 6128 Succ. Centre-Ville Montreal, Quebec, H3C 3J7 CANADA bengioy@iro.umontreal.ca http://www.iro.umontreal.cal-bengioy
Alan D. Blair Department of Computer Science & Software Engineering University of Melbourne, 3010 AUSTRALIA blair@cs.mu.oz.au http.z/www.cs.mu.oz.au/rblair
Mikael Boden Department of Computer Science & Electrical Engineering University of Queensland, 4072 AUSTRALIA mikael@csee.uq.edu.au httpz/www.csee.uq.edu.au/rmikael
David Calvert Guelph Natural Computation Group Department of Computing and Information Science University of Guelph Guelph, Ontario, NIG ZWI CANADA dave@ snowhite.cis.uoguelph.ca http://hebb.cis.uoguelph.ca/r'dave

Raphael C. Carrasco Departament de Llenguatges i Sistemes Informatics Universitat d' Alacant E-03071 Alacant, SPAIN carrasco@dlsi.ua.es
Lee A. Feldkamp Ford Research Laboratory Dearborn, MI 48121, U.S.A.
Mikel L. Forcada Departament de Llenguatges i Sistemes Informatics Universitat d' Alacant E-03071 Alacant, SPAIN mlf@dlsi.ua.es
Paolo Frasconi Dept. of Systems and Computer Science University of Florence via di Santa Marta 3,50139 Firenze, ITALY paolo @dsi.unifi.it
C. Lee Giles NEC Research Institute 4 Independence Way Princeton, NJ 08540, U.S.A. giles@research.nj.nec.com
Marco Gori Dept. of Information Engineering University of Siena via Roma 56,1-53100 Siena, ITALY marco@ing.unisi.it
Sepp Hochreiter Facultat fur Infonnatik Technische Universetat Munchen 80290 Munchen, GERMANY hochreit@infonnatik.tu-muenchen.de http://www7.infonnatik.tu-muenchen.de/ -hochreit
xxlx

xxx

List of Contributors

John F. Kolen Institute for Human and Machine Cognition University of West Florida Pensacola, FL 32514, U.S.A. jkolen @ai. uwf.edu
Stefan C. Kremer Guelph Natural Computation Group Dept. of Computing and Information Science University of Guelph Guelph, Ontario, NIG 2Wl CANADA skremer@uoguelph.ca http://hebb.cis.uoguelph.ca/rskremer
Andreas Kiichler Dept. of Neural Information Processing University of Ulm D-89069 Ulm, GERMANY kuechler@neuro.informatik.uni-ulm.de
Tsung-Nan Lin EPSON Research & Development Inc. 3145 Porter Drive, Suite 104 Palo Alto, CA 94304, U.S.A. tsungnan@erd.epson.com
Christopher Moore Department of Computer Science University of New Mexico Albuquerque, New Mexico 87131, U.S.A. moore@cs.unm.edu httpv/wwwsantafe.edu/rmoore
Mike Mozer Department of Computer Science University of Colorado Boulder, CO 80309-0430, U.S.A.
Ralph Neuneier Siemens AG, Corporate Research, Neural Computation D-81730 Muenchen, GERMANY Ralph.Neuneier@mchp.siemens.de
Christian Omlin Dept. of Computer Science University of Stellenbosch 7600 Stellenbosch, SOUTH AFRICA omlin @cs.sun.ac.za
Barak A. Pearlmutter Computer Science Dept., FEC 313 University of New Mexico Albuquerque, NM 87131, U.S.A.

Jose Principe Director of Computational NeuroEngineering Laboratory EB 451 Bldg. #33 University of Florida Gainesville, FL 32611, U.S.A.
Danil v Prokhorov
Ford Research Laboratory Dearborn, MI 48121, U.S.A.
Gintaras ~ Puskorius Ford Research Laboratory Dearborn, MI 48121, U.S.A.
Jiirgen Schmidhuber IDSIA Corso Elvezia 36 6900 Lugano, SWITZERLAND juergen@idsia.ch hup.z/www.idsia.ch/rjuergen
Hava T. Siegelmann Head, Information Systems Engineering Faculty of Industrial Engineering Technion 32000 Haifa, ISRAEL iehava@ie.technion.ac.il http://ie.technion.ac.iUiehava.phtml
Alessandro Sperduti Dept. of Computer Science University of Pisa Corso Italia, I-Pisa, ITALY perso @dLunipLit
Whitney Tabor Department of Psychology University of Connecticut Storrs, CT 06269, U.S.A. tabor@uconnvm.uconn.edu
Ah Chung Tsoi Faculty of Informatics University of Wollongong NSW 2522, AUSTRALIA
Janet Wiles Department of Computer Science & Electrical Engineering University of Queensland, 4072 AUSTRALIA janetw@csee.uq.edu.au http.z/wwwcsee.uq.edu.au/rjanetw
Hans-Georg Zimmermann Siemens AG, Corporate Research, Neural Computation 0-81730 Muenchen, GERMANY Georg.Zimmermann@mchp.siemens.de

IPart

INTRODUCTION

Chapter
1

DYNAMICAL RECURRENT NETWORKS

John F. Kalen and Stefan C. Kremer
1.1 INTRODUCTION
How do you handle a stream of input patterns whose interpretation may depend on the patterns that preceded it? Many researchers have been interested in patterns that extended over time and connectionist mechanisms, also known as neural networks, that store salient information between input presentations. Problems in control, language processing, and time-series prediction begged for connectionist solutions.
One solution to the problem of temporal processing is an approach utilizing dynamical recurrent networks (DRNs). DRNs are connectionist networks whose computational units assume activations based on the activation history of the network. The distinguishing characteristic of DRN s is their ability to map sequences of input vectors distributed across time into output vector sequences. In this sense, DRNs can be viewed as vector-sequence transducers. Although feedforward networks can also generate such sequences, DRNs differ in that they can generate different output vectors for the same input pattern-a choice depending on the existing input-history-dependent context.
Just as there are many ways for connectionist techniques to classify, restore, or cluster patterns, there are many DRN approaches for solving temporal problems. One may process moving windows in time, as Sejnowski and Rosenberg did with NETtaik (Sejnowski and Rosenberg, 1987). The units in the network might simply have self-connections to provide decay of activation (Jordan, 1986a). This process could be extended to larger units of recurrent information passing through multiple layers of processing (Elman, 1990). The search space for DRN solutions is enormous due to the choices of handling temporal information and how that information will be processed.
This search space contains enough varieties of DRNs to require some sort of catalog to guide the novice through all the combinations of mechanisms and applications. This book, or more appropriately, this field guide, fills this need. Why a field guide? Field guides, such as those employed by bird watchers and other admirers of nature, generally catalog various species of animals and their habitats. They also help their readers gain a better insight into the relationships among the various species and their environments by drawing attention to the relevant features that have been adopted to thrive in various niches.
The DRN literature has recently witnessed a dramatic proliferation of species and behaviors, with every conference and journal spawning new architectures and algorithms. At the same time, we have seen a migration of DRNs across an ever increasing area of problem domains. While there have been many specialized papers published on architectures, issues, and applications, novice DRN watchers (and even those who have been working in the field for some time) have dreamed of a more general and comprehensive reference to overview the field. For this reason, we were compelled to document our forays into artificial intelligence, control theory, and connectionism in search of DRNs.
Like other field guides, this book begins by teaching the reader to be a better observer. This lesson is supplemented by examining the challenges that DRNs face. In addition to the

3

4

Chapter 1 • Dynamical RecurrentNetworks

architectures and approaches, we emphasize the issues driving the development of this class of network structures. Knowing these issues outfits the reader with tools to understand and evaluate the merits of the architectures described in the book. These tools will help the reader analyze and exploit other architectures, both current and future, as well.
This field guide also describes the most common species a researcher or developer is likely to encounter during typical DRN-watching trips to the forest of scientific and engineering literature. We will identify salient features of the DRNs and their various instantiations that help one to better understand these networks and the relations between the various species. Toward this end, we have attempted to present all relevant details in consistent notation and terminology.
Although it is important to know the architectural and algorithmic details of the different species ofDRNs, it is probably more important to know their natural habitats. This information is especially critical to those trying to implement real systems based on these networks. We will, therefore, address the critical issue of how to apply these systems to real problems. Understanding the demands of various applications and how they can be met will permit the user to judiciously select and tune DRNs for new applications.
We have enlisted a diverse collection of researchers and practitioners in the area of DRN s to write various chapters of the guide. Their years of experience and differing viewpoints provide a more rounded presentation of the material than would be possible in a book with only one or two authors. Since there are almost as many reasons for researching DRNs as there are researchers, it is essential that a guide to DRN s reflect these differing objectives. Although we could try to paraphrase the reasons for their interest in DRNs, we believe that only the practitioners themselves can do justice to such an explanation.

1.2 DYNAMICAL RECURRENT NETWORKS
Our world is full of time-dependent processes. The daily cycle of light and dark, the changing of seasons, the periodic beats of our hearts, are all examples of systems that change over time. Almost any real system you can think of has some notion of temporal changes, although we often do not think of them in that way. Consider a light switch. Although there is a minuscule delay between closing the switch and the light turning on, we can ignore the signal propagation delay owing to the speed of light and consider the system to be time independent. These simplifications allow us to reason about complex systems and devices getting bogged down in detail.
In some systems, however, the time-dependent behavior is clearly important. A vending machine must release both your beverage and your change after you deposit a sufficient number of coins. Since you cannot simultaneously deposit more than one coin at a time, the machine must remember how much money had been deposited since the last sale. The behavior is context dependent since the machine will react differently to a coin depending on the previous coinage. A simple finite-state automaton can be used in this situation to store the current total coinage. Such a machine will have several distinct states, one for each possible increment of coinage it may receive. The representation of the coinage is the state of the finite-state machine. It will accept nickels, dimes, and quarters as "input" until the total amount is reached. If more than this amount has been fed to the machine, it will automatically release enough coinage to maintain the correct amount of money for the beverage. Once this state has been achieved, the user selects an item (via the Select input), the beverage is released, and the machine returns to its initial state.
The light switch and vending machine represent two disjoint classes of machines defined by the time dependence of their behavior. In digital circuit theory, this distinction also underlies

Section 1.2 • Dynamical Recurrent Networks

5

the division between combinatorial and sequential circuits. In combinatorial circuits, the behavior of the mechanism is totally determined by the current input. A NAND gate, for instance, should respond only to the voltage levels present at response time.
Sequential circuits, on the other hand, are sensitive to the temporal history of their inputs. The prime example of a sequential circuit is the fundamental memory unit, the flip-flop. In order to perform to its specification, the flip-flop contains an internal state that changes according to the inputs it receives. Internal state allows input signals to affect future output behavior.
Unfortunately, this definition is not broad enough; one could argue that the physical implementation of an NAND gate has an internal state. The electric signals cannot travel faster than the speed of light, so any output behavior must be in the future of the input event. The same can be said for multilayered feedforward neural networks. The activations of the hidden units constitute the internal state of the network. If one looks at the behavior in the limit (i.e., the action of the device if the inputs were to remain constant for a very long period of time), we do see a difference. The NAND gate will experience short-lived transients and will settle on a stable output. The network will also produce its own transients due to the pipelined nature of its processing, but once the input exits the pipeline, the network will generate the same behavior over time. Usually, the transients are considered a side effect of the implementation and are not considered part of the functional specification. Those systems with very short transients (relative to the time course of interest) can be considered to lack a temporal component like the NAND gate.
Like digital circuits, connectionist models consist of a number of processing units connected with one or more other processing units. A processing unit displays activity, o. This activity is dependent on the activity of the processing units connected to it. The connections can be either excitatory or inhibitory. That is, if a processing unit has an excitatory connection with another highly activated unit, it too will raise its activity level. Inhibitory connections, on the other hand, decrease the activity level of the processing unit when the other unit is active. Connections are weighted, that is, they have a numerical value associated with them that scales the input arriving on that connection. The signs of these weights indicate excitatory or inhibitory connections. The quantity Wij refers to the connection strength from unit j to unit
L i, The collection-weighted input to a unit is the net input, WijOi : The activation of a unit is
computed from the net unit using an activation function. Common activation functions include
the sigmoid, g(x) = sigmoid(x) = l+~-x, and the hyperbolic tangent, g(x) = tanh(x).
The most common connectionist architecture is the feedforward network. These networks consist of one or more layers of processing units. These layers are arranged linearly with weighted connections between each layer. All connections point in the same direction-a previous layer feeds into the current layer and that layer feeds into the next layer. Information flows from the first, or input, layer to the last, or output, layer. The input layer is distinguished by its lack of incoming connections. Similarly, the output layer is deprived of outgoing connections. The remaining layers possess both types of connectivity and are referred to as hidden layers.
Feedforward networks share many properties with combinatorial digital circuits. First, the fundamental discrete logic gates, such as AND, OR, NOT, and NAND, are implementable as linear combinations subjected to a high-gain transfer function (e.g., g (x) = sigmoid(50x )). More imporantly, both display directed acyclic connectivity graphs. Cyclic connectivity is the magic behind the digital memory device, the flip-flop. As such, neither mechanism can store state information from one input presentation to the next.
Thus, feedforward networks merely transform representations. The real power of such networks comes from selecting vector representations that embody the desired topological relationships. The success of an application based on feedforward neural networks rests on the designer knowing these constraints ahead of time. The designer, for instance, should select

6

Chapter 1 • Dynamical Recurrent Networks

representation vectors which ensure that all "noun" vectors should occupy a small region of word vector space. This constraint, and others like it, have guided researchers in constructing representations and is one of many programming aspects of neural networks. The problems to which feedforward networks have been applied have one constraint in common. These tasks are temporally independent: the "what" of current input unambiguously determines the current output independent of "when" it occurs. As demonstrated above, many problems are context dependent and thus demand connectionist architectures capable of encoding, storing, and processing context for later use.
A class of connectionist models known as dynamical recurrent networks (DRNs) is often brought to bear in these situations. In DRNs, the current activation of the network can depend on the input history of the system and not just the current input. Unlike their feedforward brethren, DRNs offer a diverse set of processing unit connectivity. Units can connect to themselves. Layers of units can project connections to previous layers of the feedforward network. DRNs can even exhibit arbitrary connectivity. All of these patterns share one thing in common-somewhere in the connectivity graph lies a cycle.
Some DRNs, however, have acyclic graphs. These networks utilize time delays to spread the effective input across time. Their output is determined by a moving window of previous inputs. Hence, the current output of the network depends on the network's input history. These networks are considered DRNs as well.
The networks with cycles have the potential to dynamically encode, store, and retrieve information much like sequential circuits. As such, DRNs can be thought of as neural state machines. The recurrent connectivity of the network produces cycles of activation within the network. This connectivity allows the network to have a short-term memory of previous experiences, as these experiences may have some effect on the cycling activation and may later affect the processing of the network well after the stimuli have passed. This idea, known as reverberation, can be traced to the work of McCulloch and Pitts (1943) (which also prompted the development of sequential circuits and finite automata). The memory component ofDRNs suggests that they would serve as excellent candidates for solving problems involving temporal processing.
The connectionist techniques used to solve these problems are diverse. The DRN implementor faces several difficult questions. How will the processing units be connected? Is the connectivity pattern sufficient to solve my problem? How do I find the connectivity weights? By addressesing the architectures, capabilities, algorithms, limitations, and applications of DRNs, we hope the reader will be able to find answers to these, and other, questions regarding this fascinating mechanism.

1.3 OVERVIEW
The book is divided into five parts. The first part presents an overview of DRN architectures. Once the reader has become familiar with the range of speciation in the DRN phylum, we tum to exploring their computational capabilities in Part II. The third part introduces algorithms for inserting and extracting knowledge from DRNs. Part IV then discusses the computational and representational limitations of the sequence transducers. We conclude the book with several field studies of DRNs in their natural habitats-applications.
1.3.1 Architectures
DRNs are the connectionist approach to the processing of spatio-temporal data. For instance, we may need to predict the next element of time series, decide on an action, or clean a noisy signal, given historical information. To address such problems, DRNs must be sensitive to the sequential properties of their inputs. Internal memory, capable of storing salient

Section 1.3 • Overview

7

information about the previous input vectors, is a necessary feature of DRNs. As shown in Part I, this memory can take on many forms. From shift registers to recurrent connections, the memory form plays an important part in separating the variety of DRN species.
In the first chapter of our architectural explorations, we introduce mechanisms that can change how their internal states are computed during the training process. Examples of these architectures include simple recurrent neural networks (Elman, 1990), Jordan networks (Jordan, 1986a), and second-order recurrent networks (e.g., Giles et aI., 1990).
Another approach to processing time-varying signals is to store a fixed-length input history, that is, a buffer, and then apply a conventional technique (e.g., linear transformation or multilayered perceptron) to the contents of the buffer. The famous NETtaik experiment (Sejnowski and Rosenberg, 1987) employed this technique for mapping sequences of text to phoneme sequences. Since that time, a number of enhancements have improved on the approach. For instance, the buffers can be moved into the network as in the time delay neural network (TDNN) approach (Lang et aI., 1990). Likewise, the recurrent output can be buffered
in the case of nonlinear with exogenous input (Nonlinear AutoRegressive with eXogenous
inputs-NARX) networks (Lin et aI., 1995). Our first exposure to architectures will observe members from the delay buffer genus.
In Chapter 4, the delay buffer notion is generalized to the memory kernel (de Vries and Principe, 1992). A memory kernel consists of a short-term memory subsystem and a generic predictor. The latter is assumed to be a static feedforward mechanism. The short-term memory, however, is a sequence of one or more memory modules that are governed by some transfer function between modules. For instance, the standard shift register is realized by a sequence of modules, with the identity function as their transfer function. Other mechanisms can be constructed as well: moving averages, gamma filters, Laguerre filters, and the standard IIR/FIR filter. Given this diversity, we will spend some time identifying and cataloging the memory kernel species.
1.3.2 Capabilities

After presenting the various DRN species, we tum to exploring their computational capabilities. Understanding how DRNs process information is very important when deciding which architecture to use. For instance, certain DRNs have difficulties learning context-free languages owing to the nature of their state transition functions.
From our description above, DRNs are computational systems with time-varying responses. Mechanisms such as these are most generally described as dynamical systems. Chapter 5 will introduce definitions crucial to understanding DRNs as dynamical systems. Three key notions will be elaborated: time/state-space taxonomy, attractor behavior, and iterated function systems. The first notion classifies dynamical systems by their treatment of time and space. While dynamical systems can exhibit a wide variety of behaviors, one can classify these behaviors qualitatively-fixed point, periodic, quasiperiodic, and chaotic. Finally, the similarity between iterated function systems and DRNs is identified.
The next two chapters address the issue of how dynamical recurrent networks can represent discrete states, implement state transitions, and thus act as finite-state automata or finite-state machines. The authors of Chapter 6 describe how to represent finite-state automata in networks with hard-limiting discriminant functions. The concept of, and necessity for, state splitting in single-layer first-order architectures is explained, and second-order alternatives are discussed. The representational capabilities of several other architectures, such as the recurrent cascade correlation network (Fahlman, 1991), are examined as well.
Although the previous chapter dealt with networks that used step functions to calculate activations, the next chapter concerns stable state encodings in DRNs with continuous activation functions. Such encodings are crucial to the implementation of finite-automata in DRN s.

8

Chapter 1 • Dynamical Recurrent Networks

Without stability, the network could enter an infinite set of computational states (a situation exploited in Chapter 8). The authors of Chapter 7 use a bounding argument to define the stable encodings of different automata in a variety of network architectures (Carrasco et aI., 2000). It is then possible to directly encode finite-state machines into a variety of commonly used DRNs networks. These encodings rely entirely on small weights. Large weights saturate the processing units and effectively produce processing units that behave like digital logic gates.
Chapter 8 in our exploration of the capabilities of DRNs steps to the next level in the Chomsky hierarchy: pushdown automata and context-free languages. The DRNs capable of finite-state behavior rely on state clustering-neatly partitioned neighborhoods of activation vectors representing the state of the automaton. The DRNs of this chapter, however, do not partition nicely. In fact, these networks take advantage of the sensitivity of initial conditions of the state transfer function to encode an infinite set of states capable of supporting a pushdown automata (Rodriguez et aI., 1999). Several experiments that demonstrate the DRN's ability to generalize in the direction of the context-free interpretation of a finite set of training examples are described as well (Wiles and Elman, 1995). The importance of these capabilities becomes evident when DRNs are applied to natural language processing problems (see Chapter 17).
In the final chapter on the representational power of DRNs, we see how analog DRNs can implement computational mechanisms equivalent to Turing machines (Siegelmann and Sontag, 1991). Although analog DRNs cannot branch on values (e.g., if x is greater than 0, go to another part of the program), do not have a classical memory structure (e.g., tapes or registers), and only have a fixed number of processing units, they still can compute the range of functions that a digital computer can. The trick is to fractally encode the Turing machine "tape" in the continuous activation of a processing unit. If the DRN has real valued weights, the network can actually compute functions beyond those realized by Turing machines and recursive functions (Siegelmann, 1999).
1.3.3 Algorithms

The previous parts of the field guide provide structural and functional descriptions of many DRNs. In order to use these mechanisms to solve problems, we must employ one or more algorithms to assist in the insertion and/or extraction of knowledge stored in the networks.
We begin this part with a survey of learning algorithms for DRNs with hidden units, placing the various techniques into a common framework. We discuss fixed-point learning algorithms, such as recurrent back propagation (Williams and Zipser, 1989), and non-fixedpoint algorithms, namely, back propagation through time (Rumelhart et aI., 1986), Elman's history cutoff (Elman, 1990), and Jordan's output feedback architecture (Jordan, 1986a). Forward propagation, an online technique that uses adjoining equations, is also discussed. We discuss the advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones, and we continue with some "tricks of the trade" 1 for training, using, and simulating continuous time and recurrent neural networks. After examining some simulation results, we address the issues of computational complexity and learning speed.
In Chapter 10, we focus on methods for injecting prior knowledge into DRNs (Giles and Omlin, 1993). Typically, this knowledge is in the form of a nondeterministic finite-state machine. Using the encodings described in Chapters 6 and 7, one can seed the training of a DRN with a good guess of the target machine. This insertion will result in improved generalization and reduced training times. Insertion algorithms for several architectures are described, including first-order, second-order, and Radial Basis Function (RBF)-based recurrent neural

1Additional connectionist model tricks can be found in Orr and Muller, 1998.

Section 1.3 • Overview

9

nets. In addition, the authors consider the implementation of fuzzy finite-automata in DRNs (Omlin et al., 1998).
While Chapter 11 focuses on the issue of representing and inserting a desired behavior pattern into a DRN, Chapter 12 tackles the inverse problem: extracting the behavior from a DRN. Since DRNs can represent finite-state automata, we begin by describing a technique for extracting automata from DRNs (Giles et at, 1992b) that approximates the network's functional (input/output) behavior. This technique is based on the tendency for the state vectors ofDRNs to cluster. The clusters are identified by quantizing the state space. Transitions between clusters are generated by exploring the state dynamics of the DRN through feeding the network input sequences and recording the state transitions. We go on to examine the limitations of this approach and the degree to which extraction is noise dependent. Finally, we examine an application automata extraction in the context of financial forecasting (Lawrence et at, 1997). This application suggests a novel approach to trading strategies in financial markets.

1.3.4 Limitations
In this part, we pause to reflect on the limitations of the DRN architecture and algorithms for manipulating those architectures. First, we learn to evaluate potential benchmark problems. Difficulties with gradient-based learning algorithms are then explored. Finally, we address limitations and processing limitations due to noisy computation.
Evaluating DRN training algorithms is a difficult task. Although there are several benchmark sets (e.g., the Tomita languages-see Tomita, 1982), one difficulty lies in interpreting the results. Did the network find the solution because of a good training algorithm, or do satisfactory solutions litter weight space? Chapter 13 addresses this question by looking at the naive learning algorithm called random weight guessing (RG) (Hochreiter and Schmidhuber, 1997c). Although RG cannot be viewed as a reasonable learning algorithm, it often outperforms more complex methods on widely used benchmark problems. One reason for RG's success is that the solutions to many of these benchmarks are dense in weight space. Thus, a potential benchmark set should contain problems in which RG fails to work; otherwise we could not distinguish a good algorithm from one that thrashes around in weight space. This test should serve to improve the quality of benchmarks for evaluating novel DRN training methods.
Numerous papers have focused on the difficulty of training standard DRNs to learn behaviors involving long-term dependencies. Chapter 14 focuses on the dynamics of error signals during gradient-based training of DRNs. In feedforward networks, the backward flow of error information is either amplified or reduced, depending on the eigenvalues of the weight matrices. This situation is not much of a problem when the error flows through two or three weight matrices. DRN error signals, however, repeatedly pass through the weights determining the next-state calculation. As such, the path integral over all error signals either decays or explodes exponentially as they are repeatedly propagated through the state weights. This condition can prevent a network from learning behaviors with significant time lags (five to ten time steps). For instance, DRNs have difficulties learning to latch information for later use. This chapter provides an analysis of this important problem and discusses possible remedies.
The final chapter looks at two important factors that limit the utility ofDRNs: the output function and computational stability. First, the output function of the DRN determines the shape of the state space used to classify input vectors. A single perceptron cuts the space by a hyperplane, while an RBF unit utilizes circular decision boundaries. By examining the VC dimension of the output function, one can establish a computational hierarchy for DRNs similar to those found in other limited computational systems. A DRN with quadratic decision

10

Chapter 1 • Dynamical Recurrent Networks

boundary, for instance, will be able to perform language recognition tasks that those with linear decision boundaries cannot. Second, this chapter examines two types of computational stability in the presence of noisy state transitions. The first assumes that the DRN must always perform perfectly in the presence of noise (Casey, 1996). Under this assumption, states are no longer points but clouds of points. If the dimensionality of this cloud is the same as the state space itself, the network can only perform finite-state calculations. The second stability assumption places a bound on the error rate of the DRN (Maass and Orponen, 1998). Again, the DRN is limited to finite-state behavior. The number of states in DRNs under this limitation, however, grows much faster than the first condition as the amplitude of the noise is decreased. Given that DRN implementations, both hardware and software, are affected by noise, it is important to understand these limitations before applying DRN solutions to a problem.

1.3.5 Applications
We have devoted the final part of this field guide to studying the DRN in its natural habitat. Observing DRNs applied to real problems can help the reader understand the benefits and limitations of selecting DRN s as a solution tool. Our application areas range from financial portfolio management to modeling linguistic behaviors. Throughout these habitats, we will encounter a wide range of DRN architectures and training methods. They will be controlling physical systems, performing cognitive tasks, predicting time series, and storing data structures.
The first problem domain we visit is the application of DRN s to control. In this application, the DRN examines the current state of a system, or plant, and the plant's input to adjust the behavior of the plant to achieve some goal. For instance, a DRN controller could maintain the plant's output at a constant level despite variations in the input. Chapter 16 describes an approach to training two classes of DRNs-time-lagged recurrent neural networks and recurrent multilayered perceptrons-in the control field. The methods described in this chapter are applied to two realistic examples. The first application involves controlling a multiple-input multiple-output plant so that its outputs track two independent reference signals. The second example addresses the problem of financial portfolio optimization from the control perspective.
The next habitat we find for DRNs is the area of cognitive science. This field has been an incubator for many connectionist ideas, most notably parallel distributed processing (Rumelhart and McClelland, 1986). One would be hard pressed to find an area of cognitive science devoid of connectionist influences. One especially bountiful area has been natural language processing. Many DRN advances have been driven by the attempts to build connectionist models in the cognitive domain, despite vocal opposition (e.g., Fodor and Pylyshyn, 1988). In this chapter, we examine the role DRNs can play in modeling several phenomena of sentence processing-frequency sensitivity, garden-path effects, and phrase structure and memory. These models are then compared with more traditional linguistic models of the same phenomena.
From linguistics we tum to the general problem of modeling dynamical systems. The next two chapters present the modeling of time series, such as those encountered in economic forecasting, by finite unfolding in time. The first chapter addresses such topics as domainspecific data preprocessing (e.g., Refenes, 1994), outliers suppression, square augmented processing units (Flake, 1998), estimating time-series derivatives (Weigend and Zimmermann, 1998), and combining estimations (Perrone, 1993). These methods are combined into both a feedforward network and a simple recurrent network. While feedforward networks have been used extensively in forecasting, DRNs are appropriate when the dynamical system in question possesses a significant autonomous component.
Chapter 18 continues the dynamical system modeling discussion. Specifically, it addresses the question of how to combine state-space reconstruction (Packard et aI., 1980) and

Section 1.4 • Conclusion

11

forecasting in a neural network framework. Starting from Takens' Theorem (Takens, 1981), networks search for an optimal state-space reconstruction such that the forecasts are more accurate. The network solves this problem by unfolding the data stream in both space and time. The theory provides a novel view on the detection of invariants, on the nonlinearity-noise dilemma, and on the concept of intrinsic time.
While most of this book deals with the problem of transducing or classifying inputs consisting of ordered sequences of vectors, more sophisticated, higher order structures can also be used. Chapter 19 examines the application of DRNs to the processing of labeled graphs (Frasconi et aI., 1997). This capability is important because the natural encoding of some problems are graph-like. For instance, the network could encode the parse tree of a sentence. The resulting vector representation could be stored, classified, or transformed. This chapter finishes with the presentation of applications in structural pattern recognition, computational chemistry, and search control for deduction systems.
We conclude the book with a synopsis of the major themes and highlights of the book. It discusses open problems that remain to be addressed and identifies some of the emerging directions that the field will follow in the future.

1.4 CONCLUSION

This chapter provides an outline of the contents of the DRN field guide. It sets the stage for the remainder of the book by contrasting feedforward networks with DRNs.

Chapter
2

NETWORKS WITH ADAPTIVE STATE TRANSITIONS

David Calvert and Stefan C. Kremer
2.1 INTRODUCTION
We begin our exploration of dynamical recurrent networks by defining a set of standard recurrent architectures for temporal sequence processing. This will set the stage for future chapters focusing on representational power, learning algorithms, limitations, and applications. This chapter also serves as a frame of reference for two specialized recurrent networks that are described in Chapters 3 and 4. While these chapters focus on networks with restricted recurrent connections, in this chapter, we explore a very general network that permits a wide range of spatio-temporal computations, including finite-state machines (Chapters 6 and 7), pushdown automata (Chapter 8), and both Turing machines and real-valued computational devices capable of recognizing noncomputable languages (Chapter 9).
The versatility of the networks described in this chapter derives from the minimal constraints placed on their internal state manipulation mechanisms. This freedom allows training algorithms to discover state transition functions that store information sailent to the task at hand. Since these functions are implemented as feedforward networks, the range and complexity of the state transitions and output functions are only limited by the resources provided by the developer (Hornik et aI., 1989). Our aim in this chapter is to identify some popular alternative architectures that employ this adaptive state computations and describe their processing and adaptation algorithms. In later chapters, we will explore the computational powers of these networks and variations, and limitations of their adaptation algorithms.
2.2 THE SEARCH FOR CONTEXT
Connectionist networks are information processing systems. As such, they operate on representations of information much like a computer manipulates bits that can represent names and ages. Because of their structure, feedforward networks can be viewed as continuous transformations or continuous representations. This is not surprising, given that these networks are comprised of multiple layers of processing units performing continuous transforms. These transforms are also many-to-one functions. Thus, any single-input pattern will always generate the same output pattern, assuming the network is unchanged. Although this feature is desirable for many pattern recognition tasks, it prevents these networks from responding to an input predicated on the presentation context.
In order to react appropriately, a context-dependent network must have some way to internally store the current context and update this context to reflect the current input event. The difficult aspect of this process is to ensure that information from the input history is available at the time it is needed. As such, the DRN must maintain short-term memory for periods of time longer than one input presentation. The longevity of this activation provides the desired temporal context. This encoding of input information across many input pattern presentations allows the network to perform spatio-temporal processing. Many connectionist models, such as multilayered perceptrons, perform spatial processing since the arrival order of the input patterns has no effect on output generation.
15

16

Chapter2 • Networkswith Adaptive State Transitions

The architecture of the context-dependent network must support the short-term storage of information. The content of the short-term memory must depend on the previous content and the current input pattern. In the simplest case, the output of the network can depend on the newly computed next state. Figure 2.1 depicts a connectionist implementation of this architecture. This system, originally used by Elman (Elman, 1991), is essentially a continuous state-space version of a Moore machine. The network consists of two subnetworks, the nextstate network, and the output subnetwork. These subnetworks can take many forms, but here we have used two-layer perceptrons. The output subnetwork is your standard spatial processor, transforming the state pattern to an output pattern. The next-state network, however, is configured recurrently. That is, it reuses the pattern appearing on outputs as part of the input pattern during the next timestep. The subnetwork performs three operations: input, compute, and copy. First, an input pattern is loaded into the input units. Second, the nextstate subnetwork uses both the current input and current state to calculate the next state of the system. Finally, the output of this subsystem is copied to the current state buffer.

Output units

Hidden

Two-layer net to compute output

Two-layer net to
compute state

Hidden

Input units

Context units

Figure 2.1 General recurrent architecture.
2.2.1 Context as Input History
One of the oldest techniques for representing a temporal context in a DRN involves the buffering of several consecutive input patterns (Figure 2.2). A fixed size first-in first-out buffer acts as a short-term memory storing the most recent input history. Input patterns enter one

Section2.3 •. RecurrentApproaches to Context

17

'---
t

1 Output units

'---

1 Hidden units

t

- 0 __ />: __ r>. r>.
Input~~

Time-delayed input sequence

1-1

1-2

1-3

Figure 2.2 TDNN architecture.

end, are pushed down the length of the buffer, and then are finally discarded after leaving the last buffer module. This buffering results in the creation of a spatial pattern from the time slice of the input pattern stream. Given this new, context-independent task of mapping the buffer contents to an output pattern, any nontemporal architecture and training algorithm can be used. This approach, often referred to as time delay neural networks (TDNNs) (Waibel, 1989), will be extensively reviewed in Chapter 3 and generalized in Chapter 4. The most famous example of a TDNN is NETtaik (Sejnowski and Rosenberg, 1987), a text-to-speech converter.
The TDNN approach offers two benefits. First, the architecture is familiar. It resembles a finite-duration impulse response (FIR) filter, a common tool in time-series analysis. The mechanism itself is the standard multilayered perceptron. Second, the conventional backpropagation training algorithm can train these networks without modification.
The benefits derived from this approach are offset by the limited size of the input history window. By definition, the response of a FIR to an impulse will disappear after some finite time limit has expired. In the case of TDNNs, this time limit is the length of the shift register storing the input history. The buffer size limits the length of the longest sequences successfully differentiable by these networks. Sequences with long, contextually relevant groupings require impractical numbers of input windows or delay lines. Given a problem with unbounded sequence length, it is impossible to guarantee correct classification regardless of the input window size. In addition, the number of input windows is fixed and must be chosen in advance. The network developer must have a good idea of the maximum dependency length in order to accommodate the longest possible sequences that the network is required to differentiate. Compounding this problem is an increase in computation time associated with the large number of additional weights between input and hidden units required to support the buffers.

2.3 RECURRENT APPROACHES TO CONTEXT
As an alternative to a spatial representation of context, we now tum to the technique of adding recurrent connections to the network. The intention of recurrent networks is to maintain state information in the system by feeding state information back into the processing units through delayed, looped connections. The strength of dynamical recurrent networks in this regard is

18

Chapter 2 • Networks with Adaptive State Transitions

their ability to arbitrarily adapt these connections to compute context and output to the problem at hand.
The windowed methods for maintaining state use a static calculation to generate outputs. These networks always perform their calculations using the same functional operations. With the addition of recurrent connections, the system has the ability to dynamically change how contextual information and output are computed within the network by means of weight adaptation.
This form of feedback also allows the system to maintain input information for arbitrary long periods of time. (This is termed "latching" in Chapter 14.) This is done by maintaining the relevant input information in the context units.
Given an appropriate learning rule, these architectures are therefore capable of maintaining state information which can be used to represent context. The recurrent connections found in these networks eliminate the broad input window configuration required by the static buffer systems. Instead, they allow the network to generate their own context through training.

2.4 REPRESENTING CONTEXT
All dynamical recurrent networks use some delayed previous activation values to compute current activation values. Some authors represent this delayed copy implicitly in their descriptions of their architectures, while others use an explicit approach. The latter employ a special set of processing elements called contextunits (Elman, 1990) to represent the delayed previous activation values. As an example, we consider Figure 2.3(a) which illustrates a processing unit j with a time-delayed recurrent connection with weight W ii- In this example, the connection serves two distinct purposes: it delays the transmission of the activation of node
i. and it weights that transmitted activation by the connection weight W ii- In Figure 2.3(b)
these two functions are explicitly separated by introducing a context unit i, This unit splits the connection from j to j into a component that delays the activation value (downward arrow) and a component that weights the delayed activation (upward arrow).

(a)

(b)

Figure 2.3 Equivalent view of context units.

For the purposes of this chapter, we will illustrate all of our networks using context units to clarify exactly where delayed activation values are transmitted. A bold arrow represents fully connected layers of n x m units, while a narrow arrow represents a set ofone-to-one connections (in which every processing element in one layer is connected to exactly one corresponding unit in another layer). Downward arrows represent an implicit delay of one time-step.

Section 2.6 • Architectures

19

2.5 TRAINING

A recurrent network can be transformed into afeedforward network with the identical behavior over a finite amount of time. The method to do this involves unfolding the recurrent network and adding a new layer to the feedforward system for each time step it will process. Figure 2.4(a) shows a simple recurrent network. In Figure 2.4(b) the network has been unfolded in time. Note that the same weight values are shared for various different connections in the unfolded network.
Once a network has been unfolded in this manner, it is possible to train it as if it were a many-layered network. Where connection weights are duplicated, the appropriate weight change can be computed to be the sum of all the weight changes for all the connections that share a particular weight value. Thus, the traditional weight update equation,

~Wji = rJ8i a j ,

becomes

L8 ~Wji = rJ

i (t )aj (t ).

t

This technique is known as back propagation through time (Rumelhart et aI., 1986).

Back propagation through time is a simple weight update algorithm that implements

gradient descent in dynamical recurrent networks. In Chapter 11 we will see a number of

alternative algorithms that compute the same thing or approximations to it, as well as some

algorithms that adapt weights in a manner that does not directly correspond to the gradient.

2.6 ARCHITECTURES

2.6.1 Jordan
The network proposed by Michael Jordan (Figure 2.5) is a traditional three-layer feedforward architecture with a set of context units that mirror the output layer activation and feed it back into the hidden layer (Elman, 1988, 1990). This acts as a parallel input layer that stores the context information. Jordan refers to the context as the state of the system. Connections from the output processing units to the state units are set equal to 1.0, and a single connection exists for each output, leading to one corresponding state unit. This results in the output activations being copied verbatim to the state units and requires that the number of context units is equal to the number of outputs.
State units have recurrent connections dk local to each node. These connections have a predetermined value of less than 1.0, which cause the current activation to decay and be fed back into the original state node. This acts as a memory for previous state information but allows the activations to decay over time, thereby allowing the system to gradually forget past events. The result is a context that remembers events local to a given time but that can forget those events when they are sufficiently far in the past to no longer be relevant.
Activation from the state units is derived from the outputs and state values at time t. It is processed through a weight matrix and passed to the hidden layer in parallel with the next input
pattern at time t + 1. An interesting feature of this network is that only the output values are
transmitted recurrently. This allows the back-propagation algorithm to be simplified to teacher forcing (which is described in Chapter 11). In this case, gradient descent can be reduced to the familiar back-propagation algorithm for nonrecurrent networks.
The activation calculations for the Jordan-style network use a minor variation of those used in back propagation. Input activation ai(t) is created when an input pattern is presented to the input layer at time t where n is the number of input units. Hidden layer activation is

20

Chapter 2 9 Networks with Adaptive State Transitions

I Output I
t
Hidden [
(a)

I Output(t)
t / N,
I I ~ontext~" I i Input (t)
l / x,, I Hidden(t- 1) I

Context (t - 1) I
T /

Input (t- 1)

J i Context (t - 2)

Input (t - 2) I

T

/ I Hidden (t - 3)

(b)

t-3)
l

Context(t-4) I I Input(t-g) Figure 2.4 A simple recurrent network, (a), is unfolded in time, (b).

Section 2.6 • Architectures

21
Output (t)

Figure 2.5 Jordan's architecture.

calculated from the sum of input af(t) and context af (t) activations, which are passed through
fully connected weight matrices.
af(t) = f (~af(t)Wji + ~af(t)Wji)

where m is the number of context units, and the function f is a nonlinear sigmoid function.

Output layer activation is calculated using the standard method:

(t aio (r) = f

af (t)W/j) ,

1=1

where p is the number of output units. An additional step must then be performed to transfer the output activations to the context units. A new context activation is calculated by copying each output activation to a context unit and summing them with the activation already present in that unit,
af af (t + 1) = (t) + af (t)dk ,

where dk is the decay value for existing activations. Note that the weight layer between the output and context layer is not fully connected. It consists of a single connection between one node in each layer, which serves to copy the activation from output to context.
Once this is complete, the weight layers may be trained using the back-propagation algorithm. The weights between the hidden and context layer are trained normally with the hidden error values being used to modify the connection weights. The weight layer between the output and context layer has fixed weight values of 1.0, and the recurrent connections in the context layer are given a user-defined decay value, so that neither of these is affected by the training algorithm.

2.6.2 Elman

Elman's (Elman, 1988, 1990, 1991) architecture also uses a context layer in parallel to the input layer (Figure 2.6). The context in this system is derived from the activation in the hidden units. The hidden unit activation pattern is copied verbatim through weight connections set equal to 1.0 and stored in the context units. This is similar to Jordan's method of transferring

22

Chapter 2 • Networks with Adaptive State Transitions

Output
I

/

,Hidden

Context

I

Input

Figure 2.6 Elman's architecture.

activations in that the number of context units must be equal to the number of units from which the context is copied. The context activations are derived from hidden units at time t and processed through a weight matrix and returned to the hidden layer in parallel with the next
input pattern at time t + 1.
Elman uses a simple training algorithm called truncated gradient descent, described in Chapter 11. This algorithm makes the simplifying but technically inaccurate assumption that the gradient of the activations of the context units with respect to the network weights is zero. This is not strictly true since the context unit activations are exactly equal to the hidden unit activation values, which in tum do depend on the weights of the network. The truncation of the gradient calculation at the hidden units means that the familiar back-propagation algorithm can then be applied to these networks.
This network is known as the simple recurrent network (SRN). A later version of this network used additional hidden layers between the input and hidden, and between the hidden and output layers.
Whereas Jordan's system feeds the output back into the network to provide a context, Elman's architecture generates context using an internally defined network state. This provides a further degree of freedom by allowing the network to adapt the computation of hidden units and hence the feedback values used to calculate context instead of forcing context to be derived from externally predefined output values.
at The activation algorithm for the Elman network is similar to that found in the Jordan
architecture. The calculation of input activation (r) is taken directly from the external
environment. The hidden layer activations afl(r) are calculated in the same manner as in
the Jordan architecture, with activation being the sum of input and context values that are passed through weighted connections. Similarly, output activations are the sum of the hidden
activations passed through a weight matrix. Activation calculations in the context units af (t +
1) are where the Jordan and Elman networks differ. The Elman network copies the activation
values directly from the hidden layer afl(t) as opposed to the output layer without use of
locally recurrent connections,
aF (t + 1) = afl(t).
Training the weights is done through normal back propagation. A recurrent connection joins each hidden node to single context units. These connections have weight values fixed to 1.0. Weights leading from the context layer to the hidden layer are trained normally.

Section 2.6 • Architectures

23

2.6.3 Williams and Zipser

Where the previous networks were largely feedforward with some added recurrences, the following systems are quite different. The network described by Williams and Zipser (Williams and Zipser, 1988) consists of a single layer of fully recurrently connected processing units (Figure 2.7).

Figure 2.7 Williams and Zipser's architecture.
Given a network with n processing units and m external inputs, the weights for this
system can be described using a single n x (m + n) matrix. In such a matrix, each row would
represent one of the processing units and each column the weights leading to (or from) that unit. A unique weight exists between every pair of processing units, and every unit is also connected to itself.
The activation values generated by the layer of nodes is treated as the outputs for the network. Activation is calculated for each node simultaneously using external inputs and current activation values. External input to the system at time t does not affect the activation
values of the processing units until time t + 1. Activation aj for processing unit j is calculated for time t + 1 using:
where W ji is the set of weights leading from the inputs and processing units to unit i, a, (t)
are activations currently present in the network, including input values which are treated similarly to processing unit activations during state calculations, i are all connections leading
to processing unit j , and f is the transfer function.
As this architecture is essentially a one-layer network, it is clear that it will only be able to compute linearly separable functions given one transition from input and the existing state and leading to the new state. To remedy this limitation, it is possible to cycle the network for an additional iteration at the end of each input string. This allows the network to compute nonlinearly separable decisions on its input as well.
2.6.4 Giles et al.
Giles et al. developed a second-order recurrent network (Figure 2.8) with a single-layer architecture (Giles et al., 1991). The single layer of processing units are connected to each other and to the input nodes through a series of multipliers.
Activation in each node is calculated using the sum of the products found when all other
activations and weight values are combined. The activation ai for node i at time t + 1 is
calculated using:
ai(t + 1) = f (~Wijka]<t)a£<t») j.k

24

Chapter 2 • Networkswith Adaptive State Transitions

PEs

Context

Input

Figure 2.8 Giles ' architecture.

aJ xl where weight Wijk is applied in conjunction with state unit activation and input to update
node i , and f is a sigmoidal transfer function . This differs from the system developed by
Williams and Zipser in that it calculates activations as a product of weights and both input and
current system activation. This architecture can be transformed into a finite-state automaton that models the con-
cepts learned by the network. This is explored in Chapters 6, l O, and 12.

2.6.5 Locally Recurrent Connections

Generally, two types of recurrent connections are used in these architectures, which are distinguished by their connection scheme. Recurrent connections can either loop back to the node that instigated the signal (locally recurrent connections), or they are completely connected to all other nodes in the same layer (fully recurrent connections).
Fully connected recurrent layers provide a more complex processing of activation values than do the self-recurrent nodes and are known to have greater representational power (see

Output
1
Hidden 2
1

Hidden 1
1

Input

Figure 2.9 Mozer's architecture.

Section 2.7 • Conclusion

2S

Kremer, 1999). Such systems provide for the simultaneous integration of all state information within a single layer. In addition, they require learning rules that can manipulate such information.
Locally recurrent neurons are normally connected to external inputs and are referred to as dynamic neurons. These allow the activations associated with the inputs to the system to create a temporal context. These units are also known as decay units or integrating units. Neurons without recurrent connections are termed static neurons.
Locally recurrent connections generally have weight values of less than 1.0. This creates a decay that allows the system to gradually forget its previous values. Larger weight values allow for a longer memory in these units. These units require that the recurrent connections have a delay associated with them to align the arrival time of external and recurrent inputs to the node. This approach is often referred to as a memory kernel approach and is illustrated in Figure 2.9. In Chapter 4 we examine these types of networks in greater detail.

2.7 CONCLUSION

We have now set the stage for an exploration of recurrent networks. Armed with a basic understanding of some of the prototypical architectures, we can explore refinements on these basic architectures in the chapters that follow. We will also proceed to examine the computational powers and limitations of these approaches and, finally, their applications to real-world problems.

Chapter
3

DELAY NETWORKS: BUFFERS TO THE RESCUE

Tsung-Nan Lin and C. Lee Giles

3.1 INTRODUCTION TO DELAY NETWORKS

Dynamical networks are different from static networks in the sense that dynamical models have memory elements. Memory elements are the key components that preserve the states of dynamical networks and give dynamical networks the capability of handling time-dependent problems. Different architectures may be theoretically equivalent in terms of representation power, yet there might be some practical advantages to use one architecture over another.
In this chapter we discuss Nonlinear AutoRegressive with eXogenous inputs (NARX) recurrent networks, which have powerful representational capability. It has been reported that the gradient-descent learning algorithm is more effective in NARX networks than in other recurrent networks that have "hidden states." We present an analysis of how the gradient information is propagated through the delay buffers in the NARX network. Because the delay buffers can propagate gradient more efficiently, NARX networks perform better than other recurrent networks on the problem involving long-term dependencies. We show that although NARX networks do not totally circumvent the problem of long-term dependencies, they can greatly improve performance on such problems. We present some experimental results showing that NARX networks can often retain information for two to three times as long as conventional recurrent neural networks.
Temporal tasks and static tasks have fundamentally different characteristics. The coordinates of a static vector in an N-dimensional feature space can be randomly permuted without affecting the classification. In contrast, the temporal sequence is order sensitive. The results will strongly depend on the order of each feature. In practice, time is important in many cognitive tasks such as vision, speech, signal processing, and motor control.
Memoryless networks are inadequate for temporal tasks. For a neural model to deal with temporal tasks, the network should have dynamic properties that make the network responsive to time-varying signals.
For a neural network to be dynamic, it must be given memory. The simplest memory element is the unit time delay, which has the transfer function

H(z) = Z-l.

(3.1)

The simplest memory architecture is the tapped delay line, which consists of a series of unit time delays. Tapped delay lines are the basis of traditional linear dynamics models such as finite impulse response (FIR) models or infinite impulse response (IIR) models. A multilayer perceptron network may be made dynamic by introducing time delay loops to the input, hidden, and/or output layers. Another way is to route delay connections from one layer to another. The memory elements can be either fully or sparsely interconnected. Chapter 4 presents a detailed discussion of different memory kernels.
Finite impulse response networks are one kind of delay networks without feedback. The FIR network can be easily related to a standard multilayer network as a temporal extension of substituting static synaptic weights with FIR filters (Wan, 1994). A similar network architecture incorporating embedding time delays is the time-delay neural network (TDNN) (Lang et aI.,

27

28

Chapter 3 • Delay Networks: Buffers to the Rescue

1990; Waibel et aI., 1989). In fact, TDNN and the FIR network can be shown to be functionally equivalent (Clouse et aI., 1997; Wan, 1994).
NARX neural networks (Chen et aI., 1990; Leontaritis and Billings, 1985; Ljung, 1987; Su and McAvoy, 1991; Su et aI., 1992) are an important class of dynamic neural networks. This type of neural network associates its memory elements with the input and output layers. NARX networks are a good representative of delay networks with feedback. Although the feedback of NARX networks is restricted in the sense that it comes only from the output neurons, it has been shown that in theory one can use NARX networks, rather than other dynamical networks with fully connected memory elements, without any computational loss. Furthermore, the computational power of NARX networks is at least as great as that of Turing machines (Siegelmann et aI., 1997). Home and Giles (1995) also showed that NARX networks perform better on some problems than many conventional recurrent networks with "hidden states" when training with a gradient-descent learning algorithm. NARX networks can converge much faster and generalize better than can other networks. Part of the reason gradient-descent learning is more effective can be attributed to the embedded memory of NARX networks (Home and Giles, 1995; Lin et al., 1996). The embedded memory of the NARX network will appear as jump-ahead connections that provide shorter paths to propagate gradient information more efficiently when networks are unfolded in time to backpropagate the error signal, thus reducing the network's sensitivity to long-term dependencies (Lin et al., 1996). Similar improvements of learning ability can be achieved in other types of recurrent networks by simply increasing the order of embedded memory (Lin et al., 1998).
This chapter is organized as follows. Section 3.2 shows the back-propagation through time learning algorithm of delay networks. NARX networks and the temporal property of learning long-term dependencies in NARX networks are presented in Section 3.3 and Section 3.4, respectively.

3.2 BACK-PROPAGATION THROUGH TIME
LEARNING ALGORITHM

The back-propagation through time (BPTT) algorithm is an extension of the standard back-propagation algorithm and is the most popular training algorithm for recurrent networks. Assuming that the task performed by the network is a sequential supervised learning task; certain output values of the network are to match specified target values at specified times. Let U denote the set of output indices such that Yk, k E U, is the kth output neuron and T(t) denotes the set of time indices for which there exists a specified target value dk (r), k E U that the output of the kth unit should match at time t. Figure 3.1 shows a simple recurrent network of two neurons. The dynamics of this network can be described by

Yk(t) = !(ak(t»,

(3.2)

L ak(t) = WkiYi(t - 1) + Uk(t)

(3.3)

where ak(t) and Yk (t) are the activation and output value of neuron k at time t. ! is the
nonlinear function, and Uk(t) is input value to neuron k at time t. Then the instantaneous error can be defined as
(3.4)

This allows target values to be specified for different units at different times. The cost function

Section 3.2 • Back-Propagation Through Time Learning Algorithm

29

W21

W22

Figure 3.1 A two-neuron recurrent network architecture.

Input

at time t can be defined as

E(t) = 21 'L"'J" [ek(t)] 2 .

(3.5)

kEU

The objective of learning might be minimizing the total error

Lt1
Etotal(to, tl) = E(r)
r=to

(3.6)

over some appropriate time period [to, tl]. The algorithm consists of two steps. First, the recurrent network is unfolded in time through the interval [to, t1] to obtain a feedforward network. Figure 3.2 displays the unfolded network of Figure 3.1 from time to through time tl. Unfolding the network in time allows one to regard the problem of training a recurrent neural network as a corresponding problem of training a feedforward network with duplicated weights. To compute the gradient of each weight in the recurrent network, one simply computes the

Time tl

Input
•

W21

Figure 3.2 The unfolded network of the twoneuron recurrent network architecture from time to through time tl .

W22
0 W l 2

30

Chapter 3 • Delay Networks: Buffers to the Rescue

gradient with respect to corresponding duplicated weights in the unrolling network and adds them up. Thus, the problem of computing the gradient information in the recurrent network reduces to the problem of computing the gradient in the feedforward network, to which we

may use the standard back-propagation algorithm.

The algorithm begins at the last time step. The error signal for the output node k at time t = tl can be calculated as

= 0k(tI) ti(ak(tI))ek(tI).

(3.7)

The error signal of the hidden nodes can be calculated according to the standard back-

propagation learning algorithm. Proceeding to the earlier time steps, there will be the injecting error from the desired target dk(t) into the output node k. Thus, not only are the error signals

determined by a backward pass through the unrolled network, but the injecting errors are also taken into account in the reverse order. The error signal of output node k at time r will become:

?= 0) , + ok(r) = ti(ak(r)) (ekCr) WjkOj(r +

(3.8)

JEU

where to < r < tl. Figure 3.3 explains a schematic representation of this process. Once
the back-propagation computation has been performed down to time to + 1, the gradient is
computed by

(3.9)

e If the algorithm is implemented properly, then only (n A) calculations are required to calculate e the necessary partial derivatives for each time step. is used to describe the exact order of

magnitude, and n A is the number of adjustable weights. Note that for the case of fully connected networks the number of weights is the number of nodes squared. If the number of nodes is denoted by n, then the above complexity reduces to the familiar equation

8(n2L),

(3.10)

where L denotes the length of the training sequence.

Time Input
e

Architecture

Targets

CJ (------r-----~ - - - -

( ~ /
,/'

Injecting error
-----C)

Injecting error
f1"/ '2J_---: ----CJ ('------r----~---~ C J t..&....--..Injecting error
~~

Figure 3.3 A schematic representation of the back-propagation through time learning algorithm.

Several variations on BPTT have been proposed. Those algorithms try to approximate the true gradient information based on the argument that back-propagation information to layers that are far from the output layer is not useful. A natural approximation to the real gradient

Section 3.3 • Delay Networks with Feedback: NARX Networks

31

information is obtained by truncating the backward propagation of information to a fixed
number of prior time steps. We call those algorithms Truncated BPTT. With h representing
e the number of propagation steps, the complexity can be reduced to (n2h) (Cleeremans et al.,
1989a; Elman, 1990; Williams and Peng, 1990; Williams and Zipser, 1990). A detailed
discussion of learning algorithms and different variations is presented in Chapter 11.

3.3 DELAY NETWORKS WITH FEEDBACK: NARX NETWORKS

In a linear discrete system, one model that describes the single-input, single-output, shift-

invariant system has the following form:

L L p

q

y(t) = aky(t - k) + fJkU(t - k),

(3.11)

k=l

k=O

where y(t) and u(t) are the output and input of the model, respectively, and ak and fJk are

constant coefficients. The linear model consists of one tapped delay memory of order q

associated with the input and one tapped delay memory of order p associated with the output.

In signal processing, this model is commonly called an infiniteimpulse response(IIR) filter. In

time-series modeling, the model is referred to as an autoregressive moving-average (ARMA)

model if u(t) represents the white noise input. Another common term is the pole-zero model

or finite parameter rational model.

The linear model can offer only very limited capability. Most real-world applications

involve nonlinear temporal dynamic models. In order to model nonlinear behavior, the linear

model of Eq. 3.11 can be extended to the nonlinear recursive form:

= y(t) f (u(t - D u) , •.. , u(t - 1), u(t), y(t - D y ) , ••• , y(t - 1)) .

(3.12)

The function f is a nonlinear function. Equation 3.12 shows an important class of the discrete
time nonlinear system, which is referred to as the Nonlinear AutoRegressive with eXogenous inputs (NARX) model! (Chen et al., 1990; Leontaritis and Billings, 1985; Ljung, 1987; Su and McAvoy, 1991; Su et al., 1992). The input-output relationship of Eq. 3.12 depends on
the nonlinear function f. In reality, f is generally very complex, and the knowledge of the
function is often not available. Multilayer perceptrons (MLPs) can uniformly approximate any continuous nonlinear
function to an arbitrary degree of accuracy (Cybenko, 1989; Funahashi, 1989; Hornik et al., 1989; lrie and Miyake, 1988). Therefore, an obvious approach to implement systems described by Eq. 3.12 is to approximate the nonlinear function f by using a multilayer perceptron. Specifically, the network will be described as

1), 1») , = y(t) \{1 (u(t - Du ) , ••• , u(t -

u(t), y(t - Dy ) , ••• , y(t -

(3.13)

where \11 is the nonlinear mapping function performed by an MLP as shown in Figure 3.4. Du and D; are the input and output order. These neural network architectures are called NARX recurrent neural networks (Chen et al., 1990; Narendra and Parthasarathy, 1990).
It has been demonstrated that NARX recurrent neural architectures are well suited for modeling various nonlinear systems such as heat exchangers (Chen et al., 1990), waste water treatment plants (Su and McAvoy, 1991; Su et al., 1992), catalytic reforming systems in a petroleum refinery (Su et al., 1992), nonlinear oscillations associated with multi-legged locomotion in biological systems (Venkataraman, 1994), time series (Connor et al., 1992), and various artificial nonlinear systems (Chen et al., 1990; Narendra and Parthasarathy, 1990;

1The terminology in the literature is conflicting. We chose the term NARX based on the references.

32

Chapter3 • DelayNetworks: Buffers to the Rescue

Figure 3.4 A NARX network with input memory of order 2 and output memory of order 2.

Qin et aI., 1992). Furthermore, it has recently been reported that gradient-descent learning is more effective in NARX networks than in other recurrent neural network architectures with "hidden states" when applied to problems including grammatical inference and nonlinear system identification. Typically, NARX networks converge much faster and generalize better than do other networks (Home and Giles, 1995).
Siegelmann et al. (1997) have proven that NARX recurrent networks are at least as computationally strong as fully connected networks; with a finite number of nodes and memories, they are at least as powerful as Turing machines and thus universal computation devices . Although the feedback is limited, one can use the NARX networks, in theory, rather than the conventional recurrent networks without any computational loss.
Narendra and Parthasarathy (1990) have proposed several variations on the NARX networks:

= LDu

y(t)

f3iU(t - i) + f(y(t - 1), ... , y(t - Dy ) )

i=!
LDy
y(t) = g(u(t - 1), . . . , u(t - + Du )) (Xi y(t - i)

i=!

y(t) = g(u(t - 1), . .. , u(t - Du )) + f(y(t - 1), .. . , y(t - Dy ))

(3.14) (3.15) (3.16)

The nonlinear function f and g can be implemented by MLPs as shown in Figure 3.5.

g( . )

f( · )

Figure 3.5 A variation of the NARX network.

TDL denotes the tapped delay line. The function

u(t)

f and g can be simply linear function s or nonlin-

ear functions implemented by MLPs.

When the output-memory order of NARX networks is zero, another special form of NARX networks is the time delay neural network (TDNN) as displayed in Figure 3.6, which

Section 3.4 • Long-Term Dependencies in NARX Networks

33

u(t)
Figure 3.6 A TDNN network with input memory of order 3.

u(t - 1)

u(t - 2)

u(t - 3)

is simply a tapped delay line followed by the MLP. TDNN can be seen as a generalization of the classic autoregressive model (AR) used in time-series analysis. TDNN implements the following function:

= y(t) 'II (u(t - D u ) , ••• , u(t -1), u(t)).

(3.17)

TDNN is not a recurrent neural network since there is no feedback and it preserves its dynamic properties by unfolding the input sequence over time. By simply converting the temporal sequence into a static pattern, TDNN is able to treat a temporal problem as a high-dimensional problem. In embedding theory (Ott et aI., 1994), the input memory order is called the embedding dimension. Another variation of the TDNN is the gamma network (de Vries and Principe, 1992), in which the tapped delay line is simply replaced with a series of gamma memory elements.

3.4 LONG-TERM DEPENDENCIES IN NARX NETWORKS

Recurrent Neural Networks (RNNs) are capable of representing arbitrary nonlinear dynamical systems (Siegelmann and Sontag, 1995; Sontag, 1992). However, learning simple behaviors can be quite difficult when using gradient descent. Recently, it has been demonstrated that at least part of this difficulty can be attributed to long-term dependencies, that is, when the desired
« output at time T depends on inputs presented at times t T (see also Chapter 14). This was
noted by Mozer (1992), who reports that recurrent networks are able to learn short-term musical structure when using gradient-based methods but had difficulty in capturing global behavior. These ideas were recently formalized by Bengio et aI. (1994), who show that if a system is to robustly latch information, then the fraction of the gradient due to information n time steps in the past approaches zero as n becomes large. Similar results are also independently discovered in Hochreiter (1991a).
In this section, we discuss an architectural approach to deal with long-term dependencies. We focus on NARX networks. The results in this section show why gradient-descent learning is better in NARX networks.
3.4.1 Vanishing Gradients and Long-Term Dependencies

Bengio et aI. (1994) have explained the difficulty of learning long-term dependencies. They argue that for many practical applications the goal of the network must robustly latch

34

Chapter 3 • Delay Networks: Buffers to the Rescue

information; that is, the network must be able to store information for a long period of time in

the presence of noise. More specifically, they argue that latching information is accomplished

when the states of the network stay within the vicinity of a hyperbolic attractor, and robustness

to noise is accomplished if the states of the network are contained in the reduced attracting

set that attractor (i.e., if the eigenvalues of the Jacobian are contained within the unit circle).

When a system robustly latches information, the fraction of the gradient due to information

n-time steps in the past approaches zero as n becomes large. This effect is called the problem

of the vanishing gradient.

A recurrent neural network can be described in the form

= x(t + 1) !(x(t), u(t); w)

(3.18)

y(t) = g(x(t»,

(3.19)

where x, U, y, and ware column vectors representing the states, inputs, outputs, and weights of

the network, respectively. Almost any recurrent neural network architectures can be expressed

in this form (Nerrand et aI., 1993), where! and g depend on the specific architecture. For

example, in simple first-order recurrent neural networks, ! would be a sigmoid of a weighted

sum of the values x(t) and u(t) and g would simply select one of the states as output.

We define up(t), t = 1 ... T to be an input sequence of length T for the network (for

simplicity we shall assume that all sequences are of the same length) and Yp(T) to be the

output of the network for that input sequence.

In what follows, we derive the gradient-descent learning algorithm in a matrix-vector

format that is slightly more compact than deriving it expressly in terms of partial derivatives,

and highlights the role of the Jacobian in the derivation.

Gradient-descent learning is typically based on minimizing the sum-of-squared error

cost function

L2 C = 1

' (Yp(T) - dp) (yp(T) - dp) ,

(3.20)

p

where d p is the desired (or target) output for the pth pattern/ and y' denotes transposition of a vector y. Gradient descent is an algorithm that iteratively updates the weights in proportion
to the gradient

(3.21)

where YJ is a learning rate and Vw is the row vector operator

v, = [a~1 a~2 ... a~n] ·

(3.22)

By using the Chain Rule, the gradient can be expanded
L VwC = (Yp(T) - dp)' Vx(T)Yp(T)Vwx(T).
p

(3.23)

We can expand this further by assuming that the weights at different time indices are independent and computing the partial gradient with respect to these weights, which is the methodology used to derive algorithms such as back propagation through time (BPTf) (Rumelhart et aI., 1986; Williams and Zipser, 1995). The total gradient is then equal to the sum of these partial

gradients. Specifically,

~ [t = VwC

(Yp(T) - dp)' Vx(T)Yp(T)

VW(T)X(T)].

(3.24)

2We deal only with problems in which the target output is presented at the end of the sequence.

Section3.4 • Long-Term Dependencies in NARXNetworks

35

Another application of the Chain Rule to Eq. 3.24 gives

~ [t, = VwC

(Yp(T) - d p)' Vx(TlYp(T)

Jx(T, T - T)VW(rlX(T)] ,

(3.25)

where Jx(T, T - r) = Vx(r)x(T) denotes the Jacobian of (3.18) expanded over T - r time steps.
Bengio et aI. (1994) showed that if the network satisfies their definition of robustly latching information, that is, if the Jacobian at each time step has all of its eigenvalues inside the unit circle, then Jx(T, n) is an exponentially decreasing function ofn, so that lim n-+ oo Jx(T, n) = O.
« This implies that the portion of VwC due to information at times r T is insignificant
when compared to the portion at times near T. This effect is called the problem of vanishing gradient, orforgetting behavior (Frasconi et aI., 1993). Bengio et aI. claimed that the problem of vanishing gradients is the essential reason why gradient-descent methods are not sufficiently powerful to discover a relationship between target outputs and inputs that occur at a much earlier time, which they termed the problem of long-term dependencies.
3.4.2 The Effectof Delay Buffer on Vanishing Gradients

An intuitive reason why output delays can help long-term dependencies can be under-

stood by considering how gradients are calculated using BPTI. BPTT involves two phases:

unfolding the network in time and back propagating the error through the unfolded network.

When a NARX network is unfolded in time, the output delays will appear as jump-ahead con-

nections in the unfolded network. Intuitively, these jump-ahead connections provide a shorter

path for propagating gradient information, thus reducing the sensitivity of the network to long-

term dependencies. However, this intuitive reason is only valid if the total gradient through

these jump-ahead pathways is greater than the gradient through the layer-to-layer pathways.

It is possible to derive analytical results for some simple problems to show that NARX

networks are indeed less sensitive to long-term dependencies. Here we give one such example,

which is based on the latching problem described in Bengio et aI. (1994).

Consider the simple one-node autonomous recurrent network described by

= x(t) tanh(wx(t - 1))

(3.26)

where W = 1.25, which has two stable fixed points at ±O.710 and one unstable fixed point at

zero. The following one-node, autonomous NARX network (no internal inputs)

x(t) = tanh (WIX(t - + 1) W2X(t - 2) + ... + WDX(t - D))

(3.27)

with D output delays has the same fixed points as long as L,~l ui, = w. Assume the state of the network has reached equilibrium at the positive stable fixed point.
In this case we can derive the exact gradient. For simplicity, we only consider the Jacobian J(t, n) = a;~~n)' which will be a component of the gradient VwC. Figure 3.7 shows plots of J(t, n) with respect to n for D = 1, D = 3, and D = 6 with ui, = ui] D. These plots show
that the effect of output delays is to flatten out the curves and place more emphasis on the
gradient due to terms farther in the past. Note that the gradient contribution due to short-term
dependencies is deemphasized. In Figure 3.8 we show plots of the ratio

J(t, n) L,~=l J(t, r)'

(3.28)

which illustrates the percentage of the total gradient that can be attributed to information n time steps in the past. These plots show that this percentage is larger for the network with output delays, and thus one would expect that these networks would be able to more effectively deal with long-term dependencies.

36

Chapter 3 • Delay Networks: Buffers to the Rescue

~= I
.... D=3 ._. D=6

Figure 3.7 Plots of J (t, n) (the Jacobian ex-

panded over n time steps) as a function of n for dif-
ferent numbers of output delays (D = 1, D = 3, and D = 6). Although all of these curves can be

bounded above by a function that decays expo-

10

20

30

40 50

60 nentially, the values of J (t, n) decay at a slower

n

rate as D becomes larger.

0.1

30.09
~
B 0.08
Q)
~ 0.07
.~ 0.06
~ ~ 0.05
~
'3 0.04
B
~ 0.03

~= I
.... D=3 ._. D=6

.§ 0.02
£U 0.01

o 40......... .......... O'-------'---~-_011...0_""'O";';':..:..&..I&oI

_~

:o..:..=.:.I

10 20 30

50

60

n

(t;l ) Figure 3.8 Plots of the ratio EnJ

as a func-

1'=1 t ;t

tion of n for different numbers of output delays

= = = (D 1, D 3, and D 6). These curves show

that the portion of the gradient due to information

n time steps in the past is a greater fraction of the

overall gradient as D gets larger.

3.5 EXPERIMENTAL RESULTS: THE LATCHING PROBLEM

We explored a slight modification on the latching problem described in Bengio et al. (1994). This problem is a minimal task designed as a test that must necessarily be passed in order for a network to latch information robustly. Bengio et al. described the task as one in which "the input values are to be learned. " Here we give an alternative description of the problem, which
allows us to re-express the problem as one in which only weights are to be learned.
In this task there are three inputs: Ul (t), U2(t), and one noise input e(t), and a single
output y(t). Both Ul (r) and U2(t) are zero for all times t > 1. At time t = 1, UI (1) == 1 and
u2(1) = 0 for samples from class one, and Ul (1) == 0 and u2(1) == 1 for samples from class
two. The noise input e(t) is given by

0

if t ::s L

= e(t) { U(-b,b) if L < t::: T,

where U(-b, b) are samples drawn uniformly from [-b, b]. The network used to solve this problem is a NARX network consisting of a single neuron,
x(t) == tanh(wx(t -1) + hlul (t) + ... + hlul (t) + hiu2(t) + ... + hiu2(t) + e(t)) (3.29)
hf where the parameters are adjustable and the recurrent weight w is fixed. In our simulations,
we used L = 3. The network is shown in Figure 3.9. Note that the problem as stated is identical to the problem stated by Bengio et al. except that here we are using uniform instead of Gaussian

Section 3.5 • Experimental Results: The Latching Problem

37

Figure 3.9 The network used for the latching problem.
hi random noise. In our formulation, the values are weights that are connected to tapped delay
lines on the input of the network, while Bengio et al. described them as learnable input values. In our simulation, we fixed the recurrent feedback weight to W = 1.25, which gives the
autonomous network two stable fixed points at ±0.710 and one unstable fixed point at zero. It can be shown (Frasconi et aI., 1995a) that the network is robust to perturbations in the range [-0.155, 0.155]. Thus, the uniform noise in e(t) was restricted to this range. Note that if Gaussian random noise is used, then there is some nonzero probability that the error will be outside of this range regardless of the variance. Thus, it is possible for the network to fail to correctly classify all training values due to Gaussian noise. We felt that such effects should be avoided in order to exclusively test the sensitivity of the network to long-term dependencies; that is why we chose to use uniform noise instead.
hi For each simulation, we generated 30 strings from each class, each with a different e(t).
The initial values of for each simulation were also chosen from the same distribution that defines e(t). For strings from class one, a target value of 0.8 was chosen, and -0.8 was chosen for class two.
The network was run using a simple BPTT algorithm with a learning rate of 0.1 for a maximum of 100 epochs. (We found that the network converged to some solution consistently within a few dozen epochs.) If the absolute error between the output of the network and the target value was less than 0.6 on all strings, the simulation was terminated and determined to be successful. If the simulation exceeded 100 epochs and did not correctly classify all strings, then the simulation was ruled a failure.
We varied T from 20 to 200 in increments of 2. For each value of T, we ran 50 simula-
tions. We then modified the architecture to include output delays of order D = 3 and D = 6
with all of the recurrent weights ui, = W / D. Figure 3.10 shows a plot of the percentage of those runs that were successful for each case. It is clear from these plots that the NARX networks become increasingly less sensitive to long-term dependencies as the output orderis increased.

0.9
0.8
§ 0.7
1°06
.~ 0.5
~ 0.4 00 Q)
80.3
:3
;0.2

m= 1
"" D=3 ._. D=6

0.1

o'---..--L-_....L....----L:..~~LlI....I.i~~u.L.&.L..:3I..:~....A.l~~

o W W ~ ~

100lWI~I~IW~

T

Figure 3.10 Plots of percentage of successful simulations as a function of T, the length of the input strings, for different number of output delays (D = I, D = 3, and D = 6).

38 3.6 CONCLUSION

Chapter 3 • Delay Networks: Buffers to the Rescue

Dynamical neural networks are useful because they are capable of modeling nonlinear system dynamics from input/output behavior. Their memory elements are the key components that preserve the states of dynamical networks and give dynamical networks the capability of handling time-dependent problems. A rigorous understanding of how the memory elements affect the performance is very important.
In this chapter we discuss the delay networks with feedback: NARX recurrent networks. NARX networks are powerful dynamical models. Although the feedback of NARX networks is restricted in the sense that it comes only from the output neurons, in theory one can use NARX network, rather than other dynamical networks with fully connected memory elements, without any computational loss. Furthermore, the computational power of NARX networks is at least as powerful as that of Turing machines (Siegelmann et aI., 1997).
Not only are NARX networks computationally powerful in theory, but they are able to "learn" rather large finite-state machines in practice (Giles et aI., 1995b). Typically, NARX networks converge much faster and generalize better than do other networks with "hidden states" when training with the gradient-descent learning algorithm (Home and Giles, 1995).
The efficiency of gradient-descent learning in NARX networks can be attributed to the embedded memory. The intuitive explanation for this behavior is that the output delays are manifested as jump-ahead connections in the unfolded network that is often used to describe the algorithms like back propagation through time. Therefore, those jump-ahead connections can propagate gradient information more effectively.
Due to the property, we consider an architectural approach to deal with the problem of learning long-term dependencies, that is, when the desired outputs depend on inputs presented at times far in the past, which has been shown to be difficult for gradient-based algorithms. Although NARX networks are not able to solve the problem totally, it is at least easier to discover long-term dependencies.
We present an analytical example showing that gradient information does not vanish as quickly in NARX networks as it does in networks without output memory buffers when the network is contained in a fixed point. We also present an experimental result showing that NARX networks can outperform networks with single-delay buffers on problems involving long-term dependencies.
Similar results could be obtained for other networks. In particular, we hypothesize that any network that uses tapped delay feedback (Back and Tsoi, 1991; Leighton and Conrath, 1991) will demonstrate improved performance on long-term dependency problems.

Chapter
4

MEMORY KERNELS
Ah Chung Tsoi, Andrew Back, Jose Principe, and Mike Mozer

4.1 INTRODUCTION

In this chapter, we consider a class of recurrent neural networks, which, unlike the Elman network considered in Chapter 2, has no global feedback paths. Instead, they have feedback paths within the individual modules. To be 'precise, this class of recurrent neural networks consists of two subsystems:
1. A short-term memory. For simplicity, we will assume the memory content to be linear, time invariant, and causal.
2. A generic predictor-in our case, we use a feedforward neural network predictor. This consists of only nonlinear elements, for example, sigmoidal function, hyperbolic tangent function, together with associated weights. This may consist of no hidden layers or multiple hidden layers. Note that in this case, the generic predictor is time invariant and consists of only constant parameters (weights) and nonlinear elements. Here, we could have assumed the generic predictor to be linear. However, in this case, it is a special case of the linear systems.
Throughout this chapter, we refer to this structure as a time-invariant nonlinear short-term memory architecture (TINSTMA). This structure is also known as a memory kernel (Mozer, 1994). This architecture is illustrated in Figure 4.1.
In designing a neural network architecture, for example, a TINSTMA, we need to consider the following issues (Mozer, 1994):
1. Architecture: What is the internal structure of the short-term memory and the predictor? This involves a discussion of the number of layers, the number of units, the pattern of connectivity among the units, and the activation function of the units.
2. Training: Given a set of training patterns, how should the internal parameters of the model, in this case, the weights be adjusted? Each training example i consists of x(t), y*(t)}, where x(t) is a scalar input and y*(t) is a scalar target; t = 1,2, ... , N. Training implies the adjustment of the weights such that the outputs of the neural network y(t) come as close as possible to the target function y*(t), usually in the least squares sense. 1
3. Representation: How should the input sequence be represented in the short-term memory? The nature and quantity of information that must go into the memory are domain dependent.
The three issues are intricately related and can be viewed as different perspectives on the same underlying problems. A given choice of representation may demand a certain neural network architecture or a particular type of learning algorithm to compute the representation.

1Note that here we assume a single-input single-output system. In general, it is possible to consider the multipleinput multiple-output system. However, the notations would be far more complex. In view of simplicity, we have opted to consider a single-input single-output system in this chapter.

39

40

Chapter4 • MemoryKernels

x(t)

Short-term

=-

memory

Generic

:..

predictor

-y~t)

Figure 4.1 A diagram showing the class of recurrent neural network under study in this chapter. The short-term memory is assumed to be linear, time invariant, and causal. The genericpredictoris assumedto consistof only nonlinearelements, together with constant weights.
Conversely, a given architecture or a learning algorithm may restrict the class of representations that can be accommodated. In this chapter, we address all three issues.
The arrangement of this chapter is as follows: in Section 4.2 we describe the architectures of various possible memory kernels. The structure of the generic predictor is the classic multilayer feedforward neural network. In Section 4.3, we consider a generic representation of the TINSTMA. In Section 4.4, we discuss the capability of the short-term memory structure to represent a given linear impulse response, while in Section 4.5 we consider the capability of the TINSTMA structure to represent a nonlinear impulse response. In Section 4.6 we describe the training algorithms for these types of networks. In Section 4.7, we give a detailed worked example to illustrate the concepts considered in this chapter. In Section 4.8, we conclude the chapter with some comments.

4.2 DIFFERENT TYPES OF MEMORY KERNELS

In this chapter, a memory kernel (Mozer, 1989) is defined as consisting of two parts: a shortterm memory subsystem and a generic predictor, as shown in Figure 4.2. The short-term memory is assumed to have a limited memory of the past. (This will be made more specific in Section 4.5.) The generic predictor subsystem is assumed to be static; that is, it contains only constant weights and may be nonlinear memoryless elements.

Module
x(t)

Module
2

Module
X2(t) 3

Module

Figure 4.2 A generic form of short-termmemory. Each module has the same structural form but not necessarilyof the same parameters(weights).
There are a number of different types of memory kernels. One way to classify them is to consider the short-term memory subsystem of a memory kernel to be made up of a number of modules. Each module is identical in structural form but not necessarily identical in terms of weights. Consequently, it is logical to classify memory kernels as follows:
1. Each modular component is of the same structural form but could have different weights (parameters).
2. Each modular component is of the same structural form as well as having the same weights (parameters).
The generic structural form as shown in Figure 4.2 takes in an input x(t) and passes it through f, modules. Each module is of the same structural form, but the parameters in each

Section4.2 • DifferentTypesof MemoryKernels

41

module may be different. The variables Xi(t), i = 1,2, . . . ,.e are all available to be fed into
the generic predictor as shown in Figure 4.3.

Figure 4.3 The neural network architecture considered in this chapter.
In Figure 4.3, we have only shown a single-layer feedforward neural network predictor, for ease of illustration purposes. In general, the modules can be a multilayer feedforward neural network.
In the following subsections, we consider some of the possible structural forms in tum.
4.2.1 The Parameters in Each Modular Component Are Different
In this subsection, we only consider the cases when each modular component in the memory kernel has the same structural form but allows different parameters in each module.
Tapped DelayLines. This is probably the simplest of all the memory kernels. It consists of a series of modules with no adjustable parameters. Hence, these types of memory kernels can be classified either under this section or in Section 4.2.2. A generic form of this type of memory kernels is as shown in Figure 4.4.

Figure 4.4 A tapped delay line memory model. ~ denotes a unit time delay.
The variables, x (t - i), i = 0, 1, 2, ... , .e are available to be fed into the generic predictor.
«' Define Xi (r) ~ x(t - i). Furthermore, define Xi (t) ~ Xi (t - 1); q is commonly called the
shift operator.

Moving-Average Filter. The moving-average memory kernel can be defined as follows:

1 - JLi

Xi(t) =

1 Xi-l(t)

(4.1)

1 - q- JLi

where JLi is a real constant value in the range [0, 1]. Also we have: xo(t) = x(t). This is a simple moving-average filter, alternatively called an exponential filter. Dependent on the value

42

Chapter4 • Memory Kernels

of JLi, the output of the module Xi(t) is a linear combination of the input to the module Xi-l (t), and the past output of the module Xi (t - 1). If JLi is close to 1, the output of the module Xi(r) will consist mainly of the past value of the output of the module Xi(t - 1). On the other hand, if JLi is close to 0, then the output Xi(r) will consist mainly of the input to the module Xi-l (t).

Gamma Filters. A gamma filter (de Vries and Principe, 1992) module can be defined
as follows:

Yiq-l

Xi(r) =

1 Xi-l (t)

(4.2)

1-(I-Yi)q-

where Yi is a real constant value in the range [0, 2], for a stable module.

Laguerre Filters. The Laguerre filter (Wahberg, 1991) module is defined as follows:

«:' - ~i

u«: Xi(t) = 1 -

1 Xi-l (t)

(4.3)

for i = 0, 1,2, ... , l - 1. xo(t) = x(t). The output is given in the following form, for

i = 0, 1, 2, ... , l.

)1 - ~lq-I

Zi(t) = 1 - i.«: 1 Xi(t).

(4.4)

In this case, ~i, i = 0, 1, 2, ... , l are real constant values. The signal Zi(t) is an intermediate output signal in the chain from the input x(t) to the output yet). The output yet) is a linear combination of the intermediate output signals Zi(t), i = 1, 2, ... , l. A Laguerre filter module is shown in Figure 4.5.

x

Xl

X2

l-(oq

l-(lq

1 - ci:« xl

q-lo

q-(l

q-ll-l

J 1-(~
q-(o
! Zo

~
q-(l
!

~
q-(2
!

Figure 4.5 A Laguerre filter module.

~
q-(l
LI

General Laguerre Filters. In the subsections of 4.2.1, the parameters are all restricted
to real constant values. In other words, they can represent decaying modes in the given impulse response. However, often the impulse response may consist of complex modes, that is, oscillatory behavior. In this case, it is more appropriate to allow the parameters to take complex values. For realizability purposes, it must occur in complex conjugate pairs. It is quite simple to generalize the gamma filter to second-order ones. It is also possible to generalize the Laguerre filter to a second-order Laguerre filter, commonly known as the Kautz filter (Wahberg, 1994). It is more common to consider a direct generalization of the first-order Laguerre filter, as this gives explicit links to the Laguerre filter. This generalization will be given in this section.

Section 4.2 • Different Types of Memory Kernels

43

In general, we can write the general Laguerre filter (Ninness and Gustafsson, 1997) as follows:

c«: «' - = Xi (r) 1 -

f;
1Xi-l (r).

(4.5)

The complex conjugate of ~i is denoted by f;. Where this occurs, they must occur in complex
conjugate pairs. The output of each individual module is obtained using the following equation:

}1 - 1~i12q-l

= Zj(t) q-d(l- {jq_l)Xj(t).

(4.6)

The value of d can be either 0 or 1. It is noted that if d = 0, and if the coefficients ~i are
all real, then the generalized Laguerre filter module is the same as the Laguerre filter module. The formulation shown here allows us to consider only a first-order form without needing to consider a second-order form, for example, the Kautz filter module.
The generalized Laguerre filter module is different from the Laguerre filter module in that it accepts complex conjugate values for the coefficients ~i.

IIR/FIR Filters. In general, it is possible to decompose the infinite impulse response (IIR) filter into a series of modules by factorizing the IIR filter transfer function G(q) as follows:

= n y(t) = G(q)x(t)

i q -1 -ai k, -1 _ b. x(t)

i=1 q

I

(4.7)

where a., or b, can have real or complex values. If they have complex values, then they must
occur in complex conjugate pairs. ki is a real constant. It is simple to observe that the first-order form given in Eq. 4.7 encompasses the moving-
average filter, the gamma filter, and the Laguerre filter shown in the subsections of 4.2.1. It is also noted that the filters considered in this subsection are all linear filters. From
linear control theory, it is well known that all these filters are related to one another by a similarity transformation (Kailath, 1980). This implies that they are the same from an input output point of view. In other words, the various filters are different ways to express the same input-output relationship. Which representation is "good" depends on the application at hand. Here "good" is defined by the user, with the application at hand.

4.2.2 Identical Parameters for Each Modular Component

In this subsection, we consider the situation in which all the modules have the same weights. In Section 4.2.1 it is possible to set the weights in each module to be the same. For example, in the gamma filter model, we can set Yi in each module to be the same, that is, Yi = y, i = 1, 2, ... , l, and y is a constant. This has the advantage of reducing the number of parameters to be estimated to a minimum. Second, it also allows for easy implementation, as each module can be fabricated, for example, in Very Large Scale Integration (VLSI) technology or in gate array technology.

Modular Recurrent Form. This is a general model for modular recurrent form. It is based on the intuition that all the previously considered modular forms can be generalized to be of a form shown in Figure 4.6.

44

Chapter 4 • Memory Kernels

Xi-l (t)

Xi(t)

Constant weights

I- I q-l]
II

Figure 4.6 A general modular form that encompasses the filter form as introduced in the previous section. The symbol ] in the feedback path denotes the identity matrix. In general, it is possible to have more than one delayed feedback, viz., Xi(t - 1), Xi(t - 2), ... , xitt - k), where k is an integer of the output Xi(r) to the input of the module. The symbol q -1 ] is a convenient device to denote this possibility.

For example, Figure 4.7 shows the details of a Laguerre filter module when we express it in the generalized framework as shown in Figure 4.6.

Xi(t)

Figure 4.7 Illustration of how a Laguerre filter may be expressed in the general modular form. The connections inside the submodule of "constant weights" as indicated in Figure 4.6 contain weights (refer to the equation for Laguerre filter). The notation E is interpreted as summing the inputs together and passing the results through a nonlinear element.

4.3 GENERIC REPRESENTATION OF A MEMORY KERNEL

In this section, we consider a generic representation for this class of recurrent neural networks, viz., the TINSTMA. In order to progress forward, we have chosen the generic predictor as a feedforward neural network with a single hidden layer. For simplicity, we assume that there are only one input x(t) and one output y(t) to the system.e

The generic predictor can be represented as follows:

l

y(t) = LCjYj(t)

(4.8)

j=O

and

Yi(t) = f (tbij~j(t))

(4.9)

where Cj, j = 0,1, ... , l are constants. In addition, b.], j = 0,1,2, ... , l; i = 0, 1,2, ... , l
are constants. The nonlinear function f (.) can be a sigmoidal function or a hyperbolic tangent
function. The signal ~j(t) are the outputs of the linear section, which can be anyone of the

2The general case of multiple inputs and multiple outputs and more than one single hidden layer can be written quite easily from the descriptions given in this section.

Section 4.4 • Basis Issues

45

four short-term memory kernels, viz., moving-average filter, gamma filter, Laguerre filter, or the generalized Laguerre filter.
There are two different situations:
1. When the outputs of the individual intermediate signals Xi(t) of the short-term memory kernel are directly observable at the inputs of the generic predictor. We will refer to this situation as the state observable situation. Both the moving-average filter and the gamma filter fall into this category.
2. When the outputs of the individual intermediate signals Xi (t) of the short-term memory kernel are not directly observable at the inputs of the generic predictor, but instead
a transformation of these intermediate signals z,(t) is observable. We refer to this sit-
uation as the output observable situation. Both the Laguerre filter and the generalized Laguerre filter fall into this category.
In the following subsections, we present these two situations in more detail.
4.3.1 State Observable Situation

In this case, the intermediate variables Xi (t) are directly observable at the inputs of the

generic predictor. The whole architecture can be described by Eqs. 4.8 and 4.9. The signals

~i (r) are described by

~i =
{

l~;~iJLi Xi-1
-1
l-(i~Yi )q-l Xi -1

moving-average filter gamma filter

(4.10)

4.3.2 Output Observable Situation

In the output observable situation, the outputs Xi (t) are not directly observable at the

inputs to the generic predictor. The signal ~i (t) in Eq. 4.9 are formed by the linear combination

of the observable output of each module. For the Laguerre filter, we have:

Jl -S?q-I

= ~i

1 - Siq -I Xi

(4.11)

and

q-1 - ~i
= Xi 1 _ Siq-I Xi-I·

(4.12)

For the generalized Laguerre filter, we have:
Jl - 1~i12q-1
= ~i q-d(1 _ Siq-I)Xi

(4.13)

= where d 0, or 1. The complex conjugate of ~i is denoted by fi. The signal Xi is defined as
follows:

(4.14)

4.4 BASIS ISSUES
All the modules discussed previously are linear modules. Hence, a question arises: Given an impulse response, how well can such an impulse response be approximated by these modules? This question can be studied using the concept of completeness in functional analysis (de Vries and Principe, 1992).

46

Chapter 4 • Memory Kernels

4.4.1 Completeness

Completeness is a functional theoretic concept. In our context, it is connected with the approximation by continuous (or discrete) functions. Moreover, we use the symbol Coo to denote the collection of functions that have partial derivatives of all order. Furthermore, the space C; consists of all functions I : R" ~ R. such that I is continuous and I vanishes outside some compact set. This space has compact support. Then, we have the following theorem (Ninness and Gustafsson, 1997).
Theorem 1 Cc is dense in L 1, that is, for any I E C 1, and any E > 0, there exists g E C~ such that II g - I lit< E.
Hence, the way to prove that the collection of functions is complete is to prove that it is dense in that space.
4.4.2 Gamma Filter

De Vries and Principe (1992) cited the following theorem: Theorem 2 The system gk(t) for k = 1,2, ... ,l is dense in L2[0, (0).

Proof The proof is based on the completeness of the Laguerre polynomials and can be found in Szego, Theorem 5.7.1 (Szego, 1939, p. 108).

This theorem is equivalent to the statement that the approximation of arbitrary wet) by

a linear combination of gamma kernels can be made as close as desired; that is, for all wet) in

L2[0, 00], for every E > 0, there exists a set of parameters land Wk, k = 1,2, ... , l, such that

Li

2

wet) - wkgk(t) dt < E.

k=l

4.4.3 Laguerre Filters

As indicated previously, this is proved in Szego, 4.4.4 General Conditions

In this case, we need to draw on the concept of Hardy space (Ninness and Gustafsson,

1997). We need to associate the causal z- impulse response with a Hardy space of frequency

responses, for this allows us to draw on the results of the Blaschke products. A definition I is

in L 2([ - 11:, 11:]) if

L: If(w)ldw < 00.

This can be written equivalently as a Fourier series I(w) = L~-oo Ck ejwk. Since I can be
written in terms of ejw, I can be considered as defined on the circle T so that I E L 2(T).

Now, if the coefficients Ck happen to be the impulse response of a discrete time system, then

f (-w) is the frequency response of the system. Furthermore the Fourier coefficients {Ck} can

be calculated from f (w) as

i7f

Ck = -1

I(w)e-]w. kdw

211: -Jr

The Hardy space H2(T ) is formally defined as the close subspace of L2(T) on which the

elements have Fourier coefficients that are zero for negative k.

Section 4.5 • Universal Approximation Theorem

47

Then we have the following theorem (Ninness and Gustafsson, 1997):

Theorem 3 Consider the basis functions Bk(ej W) defined by

fi (1- Bn(q) = qd (JI-I~lI2)

fkq)

q - ~l k=O q - ~k

with d = 0 or 1. Then span {Bk(e- j W) } is dense in H2(T) if and only if

00
L(1 - I~kl) = 00.

k=O

In other words, any sequence in the signal space can be made up as a possibly complex linear combination of the impulse responses of the {Bn }. If any of these are complex valued, then the complex conjugate pairs are used instead. Thus, any impulse response can be approximated by the impulse responses of the functions {Bn }.
When the parameters in the modules are identical, it is not known if the basis formed will be complete.

4.5 UNIVERSAL APPROXIMATION THEOREM

In Section 4.4, we considered the capability of the memory structure, without the generic predictor to approximate linear impulse responses. However, since we are dealing with nonlinear maps, the question arises: what type of responses does TINSTMA approximate? In other words, if we are given a particular type of response, possibly generated by nonlinear systems, can we approximate this response arbitrarily well, using the TINSTMA?
It came as no surprise that, indeed, the TINSTMA structure can approximate arbitrarily well a particular class of nonlinear maps, known as "myopic" maps (Sandberg and Xu, 1997). Myopic maps are a special class of nonlinear input output maps, which, roughly speaking, can be considered to be of the fading memory type. In other words, if we have an input u, the response of the system at points remote from the initial time instance is always relatively independent of the value of u.
Instead of presenting the theorem formally, we will give a paraphrase of the main theorem presented in Sandberg and Xu (1997), with some specialization to our current discussions.
Consider a myopic map, which is assumed to be shift invariant and causal. The input to the myopic map is assumed to be on a compact subset. Then, this map can be approximated
arbitrarily closely by the TINSTMA, where there are no branches of the linear time-invariant
causal dynamical system in the memory structure and the nonlinear elements are assumed to be continuous. A linear causal dynamical system is one in which future outputs of the system depend on past history. This is in contrast to systems, commonly known as noncausal dynamical systems, with current outputs dependent on future outputs of the system, This assumption is important in that it allows us to consider only systems whose outputs are dependent on past history. It is a common assumption to be placed on definitions in order to exclude possible situations when the current output depends on future outputs of the system.
Simple examples of the myopic map include the memory kernel architecture as depicted in Figure 4.1, when the short-term memory is one of the following types: a gamma filter, a Laguerre filter, or an IIRlFIR filter; and the generic predictor is a single-layer perceptron.
This theorem was anticipated in Back and Tsoi (1992), where an informal argument was offered for the validity of the theorem. However, as it turns out, the proof is far more difficult, as shown in Sandberg and Xu (1997).
Note that this result is not constructive, that is, there is no information concerning how
to find the number of branches no, or on how to find the parameters of the linear dynamical

48

Chapter4 • MemoryKernels

system. What it says is that such a structure exists; that is, it is an existence theorem. It is not known if the universal approximation theorem still holds, if the parameters in the module are all identical.
In the following section, we indicate training algorithms that can estimate the parameters of the system, if we assume that no is determined a priori.

4.6 TRAINING ALGORITHMS

We assume that we are given the following training data set: {x(t), y*(t); t = 1,2, ... , N}. The question is how to adjust the weights in the recurrent neural network so that the following error function is minimized.

L1 N
E = 2: (y*(t) - y(t»)2.

(4.15)

t=1

Now, we can expand the cost function E using Taylor series as follows:

E ~ E + - 8E~W + O(~w2)
8w

(4.16)

where ~w is a small perturbation of the weight w. O(~w2) denotes the higher order terms
than ~ w. If we let ~ w = -YJ ~;, where YJ > 0, then the cost function is guaranteed to decrease

in value with each iteration. Hence, we can update the weight as follows:

aEI Wk+l = Wk - YJ -a
W W=Wk

(4.17)

This is called the simple gradient-descent method.

Furthermore, there are two classes of training algorithms: batch processing mode and

instantaneous processing mode. The instantaneous processing mode operates on an individual

training pattern and updates the weights on each presentation of the training pattern. The

batch processing mode operates on the whole training set, before updating the weights. In

other words, the batch processing mode waits the presentation of the entire training set once

before it updates the weights. In the following section, we present only the batch processing

mode. The instantaneous processing mode can be obtained by a simple modification of the

batch processing mode.

From this very simple derivation, we see that to adjust the parameters in the recurrent

neural network, we need to compute the derivative of the cost E with respect to the weights in

the network.

We will consider each subclass of architectures-the state observable form and the output

observable form individually in the derivation of the training algorithm.

In both the state observable form and the output observable form, the training algorithms

for the weights c., or bij are the same. These are the weights associated with the multilayer

perceptron (generic predictor) part of the memory kernel. We consider the training algorithm

of these generic predictor weights next.

4.6.1 Training Algorithm for the Generic Predictor Weights

Consider the weight ci, i = 0, 1, ... , l. The derivative of E with respect to Ci can be obtained as follows:

8E

N

- = - Le(t)Yi(t)

8Ci

t=1

(4.18)

Section4.6 • TrainingAlgorithms

49

where e(t) = y*(t) - y(t), the error signal. The training algorithm for the instantaneous processing mode.' can be obtained by modifying Eq. 4.18 as follows:

8E - = -e(t)Yi(t). 8Ci
The weights c, can be updated using the simple gradient algorithm:

(4.19)

c.im + 1) =

ci(m) -

8E
1]-

8Ci

(4.20)

where 1] is a learning constant. Often, 1] is chosen to be a decreasing constant, 1] < 1]0/m, for the mth iterate. The symbol ci (m) is the mth iterate of the variable c..

The derivative of E with respect to bj k can be obtained as follows:

8E

~ N

~l 8Yi(t)

- - = - .!....Je(t) .!....JCi--

8bj k

1=1

i=O 8bj k

(4.21)

and

8Yi(t) __ {j'(Gi)Xk i = j

8bj k

0

i i= j

(4.22)

= where a, L~=obij~j(t), ~j(t) are different for the state observable form and the output
observable form, respectively. The symbol j' (.) denotes the derivative of f(·). The weights bjk can be updated using the following algorithm:

bjk(m + 1) = bjk(m) -

8E
1]-.

8bj k

(4.23)

4.6.2 State Observable Form

We will give a derivation of the training algorithm for the state observable form, which
includes the moving-average filter and the gamma filter modules as special cases. For the
moving-average filter module, the unknown weights are ~i, i = 1,2, ... ,.e. For the gamma filter module, the unknown weights are Yi, i = 1,2, ... ,.e.
We will consider the moving-average case first. The derivative of E with respect to the
weight coefficients ~k, k = 1,2, ... ,.e can be obtained as follows:

l
oE = _ Lei OYi(t).

8~k

i=O 8JLk

(4.24)

We have:

e

= / °Yi(t)
8~k

(ai)

L bij
j=O

OXj 8~k

.

(4.25)

Moreover, we can differentiate x j with respect to ~k using simple rules of differentiation. This results in the following:

= OXj
8~k

0

q-l_l
x (l-/-lkq-l)2 j-I

Ol

l-/-lm aXk

m=k+ I 1-/-lmq...;.1 a/-lk

j<k j=k j > k.

(4.26)

3We will only show this modification here to indicate how such modification can be made. We will not show this for the other formulas in the following sections.

50

Chapter 4 • Memory Kernels

For the gamma filter module, we can derive the training algorithm for the weights Yi, i = 1,2, ... , i.

(4.27)

We have:

I :l

= / aYj

(aj) bjj aXj(t) .

aYk

j=O

aYk

The derivative of x j with respect to Yk can be obtained as follows:

o

j<k

j=k

j > k.

(4.28) (4.29)

4.6.3 Output Observable Form

We present a derivation of the training algorithm for the output observable form, which includes the Laguerre filter and the generalized Laguerre filter module as special cases. We consider the Laguerre filter module and the generalized Laguerre filter module individually.
For the Laguerre filter module, we wish to find the derivative of E with respect to the weights ~k' This can be performed as follows:

We also have:

-aE

=-

N

l aYi

Le(t) LCi-.

a~k

t=1

i=O a~k

(4.30)

(4.31)

Moreover, we have:

(4.32)

Furthermore, we have
o

j<k j=k

(4.33)

j > k.

The derivation of the training algorithm for the general Laguerre filter module would

be more complicated due to the complex weights ~k. Hence, we have decided to omit the

derivation here.

The training algorithm for the weights of the filter modules (see Eqs. 4.26, 4.29, and 4.33,

«'. respectively) all involve the operator

For computational purposes, we need to convert

Eqs. 4.26, 4.29, and 4.33 into general formulas that can be readily programmed. This would

be rather difficult to do from these equations directly. A more convenient way is to use what is

commonly known as state-space representation to represent the memory kernels. While this

can be carried out, it will require substantial modifications of what is being considered in this

Section4.7 • Illustrative Example

51

chapter. It will also require the development of a theory using a slightly more complicated set of symbols. Hence, we will not show it here."

4.7 ILLUSTRATIVE EXAMPLE

In this section, we illustrate the concepts conveyed in this chapter using a detailed worked example.
Assume that we have three branches, that is, .e = 3 in the short-term memory subsystem.
Furthermore, we assume that there is a single hidden layer with three hidden neurons for the generic predictor subsystem.P Then, the TINSTMA,as shownin Figure 4.8, can be represented by the following equations:

x(t)

Figure 4.8 The TINSTMA used in the worked example.

Generic Predictor.
L3
y(t) = CiYi(t)
i=l
where Yi, i = 1, 2, 3 are the outputs of the respective hidden layer neurons.
Yl (t) = !(Ul (t)) Y2(t) = !(U2(t)) Y3(t) = !(U3(t))

(4.34)
(4.35) (4.36) (4.37)

4In the Dlustrative Example section (see Section 4.7), we give a hint of how this can be done in the conversion ofEq. 4.49 to Eq. 4.51.
5Note that we have assumed that we have only one input x(t) and one output y(t).

52

Chapter 4 • Memory Kernels

where f (.) is assumed to be a sigmoidal function for simplicity.
3
ai(t) = L,bij~j(t) + bio
j=1
where ~j(t), j = 1,2,3 are the outputs from the short-term memory subsection.

(4.38)

Short-Term Memory. Here for simplicity we consider the gamma module. Since f = 3,
we have:

(4.39)

j = 2,3 and ~1 = x(t). The unknown parameters are ci, i = 1,2,3; b.], j = 1,2,3; bio, i = 1,2,3, and Y!»
j = 1, 2. These unknown parameters can be trained using a set of input output examples {y*(t), x(t) : t = 1,2, ... , N. The error function is given as follows:

1N
2 J = L,(y*(t) - y(t))2
1=1

(4.40)

The training algorithm for the generic predictor subsystem can be found quite easily. For completeness sake, they are given as follows:

c.tm + 1) =

Ci(m) -

aJ
17-

Be,

(4.41)

where

aJ

N

- = - L,e(t)Yi(t)

Be,

1=1

(4.42)

/),.
where e(t) = y*(t) - y(t). Similarly, we have

(4.43)

where

(4.44)

where f' (.) is the derivative of f (.).

aJ

N

,

- = - Le(t)cjf (aj).

abj o

1=1

The training algorithm for Yj can be obtained as follows:

y]·(m + 1) = y]·(m) -

aJ
1a7Y-j

(4.45) (4.46)

Section 4.7 • Illustrative Example

53

where

!..!- = -

N
L

3
e(t) L

CJ 8y/(t)

aYj

1=1

[=1 aYj

(4.47)

N

3

= - 'L"'J"e(t) 'L"'J" c[f' a (a[)a~[

1=1

[=1

Yl

(4.48)

=

-

N
Le(t)

3
Lcd'(o))

L3b/k"aa~~'

1=1

[=1

k=1

Yl

(4.49)

Because of the structure of the short-term memory, we have the following situation:

a~1 = 0

aY2

a~1 = 0

aY3

a~2

q-l(l _ q-l)

-= aY2 (1 -

(1 -

Y2)q-l)2 ~1

a~2 =0

aY3

a~

~q-l

a~

=------

aY2 1 - (1 - Y3)q-l aY2

a~3

q-l(l _ q-l)

-=

~2.

aY3 (1 - (1 - Y3)q-l)2

Note that for practical implementations, Eq. 4.49 can be rewritten as follows:

aJ

N

3

3

~ = - L e(t) Lcd' (a/) L b/k~k/t)

Yl

1=1

[=1

k=1

where ~kj (z) is defined as

(4.50)

~k·(t)

=

a~k
-

1

aYj

(4.51)

j = 2, 3 and k = 1, 2, 3. Equation 4.51 can be implemented by noting that it is a linear subsystem, and we can express, for example, ~22(t), as follows:

~22(t) = 2(1 - Y2)~22(t - 1) - (1 - Y2)2~22(t - 2) + x(t) - x(t - 1).

(4.52)

If we assume that ~22(0) is a random value between 0 and 1, since x(t) is the given input, the values of ~22 can be computed. In a similar manner, we can obtain values for the ~kj(t). Hence, Eq. 4.49 can be implemented.
From this derivation, it is quite simple to see that the training algorithm for the generic predictor subsystem is the same as those obtained in the classic multilayer perceptron architecture; and the training algorithm for the short-term memory is the same as those obtained for linear subsystems. It is also quite simple to see that the above analogy can be extended to consider both state observable form and output observable form.
It is also noted that the training algorithm derived is a batch mode update training algorithm in that we need to run through the entire training data set before adjusting the parameters. We can also obtain an instantaneous update mode version (stochastic update) by
dropping the L;:l term in the training algorithm. For example, in the instantaneous update

54

Chapter 4 • Memory Kernels

mode, Eq. 4.49 becomes:

aJ

3

3

- = -e(t) Lczf'(az) Lbzk~kj(t).

aYj

Z=1

k=1

(4.53)

The choice between instantaneous update or batch update algorithms depends on the nature of the underlying system. For example, if the underlying system is varying slowly, then the batch update algorithm tends to work better than the instantaneous update algorithm. On the other hand, if the underlying system is varying rapidly, then the instantaneous update algorithm is the preferred one.

4.8 CONCLUSION

In this chapter, we have considered various memory kernels. We first examined the concept of classifying memory kernels by classifying the short-term memory subsystem. We then briefly described various types of short-term memory kernels, for example, moving-average filter modules, gamma filter modules, Laguerre filter modules, and generalized Laguerre filter modules. Moreover, we considered two theoretical issues-the capability of the short-term memory subsystem to represent linear systems using the concept of completeness, and the capability of the memory kernel to represent nonlinear systems using universal approximation theorems. We then derived a first-order training algorithm that is very much in the spirit of the error back-propagation methods used in the classic multilayer perceptron architecture. We also described briefly how second-order training algorithms can be derived, using only the first-order derivatives.
One may ask the question: Given the various short-term memory modules, which one would be the best in practice in modeling nonlinear system responses?
To answer this question, we must recognize that there must be some underlying assumptions for the nonlinear system response. If we can assume that it is generated by a nonlinear system with myopic memory structure, that is, the memory is limited and decaying, then by using the theorem by Sandberg and Xu (1997), we can be confident that the system response can be modeled using the memory kernel structure described in this chapter. It is also known that the generalized Laguerre filter modules will perform the best, as it is most general (Ninness and Gustafsson, 1997). However, it is also known that the error surface is full of local minima. Hence, the training process may be quite time consuming. On the other hand, it is known that the gamma filter modules are quite easy to train. Hence, in practice it is a compromise between the efficiency in training the architecture and the modeling accuracy. If we are only interested in fast training, then the gamma filter modules can be deployed. On the other hand, if we are interested in modeling the system response, then it is preferable to deploy the Laguerre filter module.
It is further noted that the memory kernel is a special case of the IIR MLP as advocated in Back and Tsoi (1991) in that the IIR MLP permits dynamics in the generic predictor subsystem as well. In this case, we may allow filter modules in the generic predictor subsystem. But in general it is quite difficult to train an IIR MLP, for the error surface contains many local minima.

Chapter
5

DYNAMICAL SYSTEMS AND ITERATED FUNCTION SYSTEMS

John ~ Kalen
5.1 INTRODUCTION
In previous chapters, we have identified problems amenable to dynamical recurrent network (DRN) solutions. These problems require a processing mechanism capable of responding to its current input under the existing context as specified by the input history. A simple case is the parity calculator that sees a sequence of bits and must output a value of one if the history of inputs contains an odd number of bits whose value is one. The DRN capable of implementing the parity calculator must be able to store some representation of its input history. We often refer to this information as the internal state of the mechanism. The recurrent network uses this state of information to modulate its response to the current input.
One part of this book examines the routes, strategies, and schemes that DRNs use to manipulate state information. Before we can address the nature of DRNs, we must first build a solid mathematical vocabulary from where we can describe the dynamic behaviors of our subjects. To this end, this chapter provides an introduction to dynamical systems, iterated function systems, and symbolic dynamics. The concepts described here will support the foundations necessary for understanding the dynamical properties of DRNs.
This chapter is divided into three sections. The first section addresses the concept of dynamical systems, specifically, the taxonomy of systems and the categorization of their behaviors. The next section focuses on a specific form of dynamical system known as the iterated function system (IFS). The IFS framework will prove invaluable for understanding the state spaces employed by DRNs. Finally, we introduce the notion of symbolic dynamics. The remainder of the chapters in this part heavily exploit this alternative method of describing the behavior of a given dynamical system.
5.2 DYNAMICAL SYSTEMS
Computation is often taken for granted. We can point to a variety of artifacts that perform computation: personal computers, calculators, coin sorters, and so on. Yet it is unclear how one goes about recognizing computation that occurs in natural systems such as cells, neural circuits, and whole organisms. For instance, an integrated circuit with transistors, capacitors, and resistors can be configured to implement the logical NAND behavior. Another organization of similar parts, however, could yield an operational amplifier. Specifically, how can we identify computation in artifacts where the goal of computing played no guiding role in their development? What logical functions are performed in vivo in a cortical neuron, a flowing stream, or a glass prism. Each has a particular causal organization, but it is unclear that any teleological raison d' etre exists for any of these examples.
A similar line of questioning could be applied to DRNs. In order to understand their source of information processing, it is necessary to understand the underlying mechanisms on which they are based. Before directly answering this question, it will be necessary to review
57

58

Chapter5 • DynamicalSystems and IteratedFunctionSystems

the explanatory framework known as dynamical systems theory. Dynamical systems theory attempts to describe systems in motion. By now, it should be clear that the difference between feedforward networks and DRNs is one of static versus dynamic behavior, behavior that changes over time. When studying real systems, such as swinging pendulums, dripping faucets, or circadian rhythms, motion can depend on both observable and unobservable variables. According to Casti (1988) "the study of natural systems begins and ends with the specification of observables belonging to such a system, and a characterization of the manner in which they are linked. ,,1 In short, dynamical systems theory is the mathematical characterization of those observables that allows us to describe, classify, and predict behaviors of natural and artificial systems. Computation, as we will see shortly, is but one aspect of dynamical systems theory and is best understood in conjunction with other, equally viable, ways of describing dynamic behavior.
5.2.1 Definition

The first order of business is to define the topic of study. A dynamical system consists of two parts, a state and a dynamic. A state describes the current condition of the system as a vector of observable quantities, like position, mass, and temperature. The collection of possible states is known as the state space (or phase space) of the system. State captures those features, visible or hidden, that a researcher considers relevant to determining the motion of the dynamical system. In one sense, the state delimits the system, separating the interesting from the noninteresting. The dynamic describes how the system's state evolves over time. The dynamical systems described in this chapter are deterministic in that the underlying dynamic is deterministic; for each state in state space, the dynamic uniquely specifies a single next state or direction of motion. While many nondeterministic systems exist (e.g., nondeterministic finite automata), this overview specifically avoids the issues surrounding dynamical systems whose behavior is not uniquely determined by the current state.
A simple example of a deterministic dynamical system is a freely moving pendulum. The state of the pendulum describes the current angular position, q, and current velocity, w. The dynamic is in the form of a differential equation describing the relationship between the rates of change of these two quantities, (), and co, as shown in Eq. 5.1. The pendulum also exemplifies the class of continuous-space dynamical systems. Notice that our description of the pendulum makes two assumptions. First, the state space is continuous. Second, the equations of motion are continuous in time:

(5.1)
dO
w- -d-t

Continuity of time or state, however, is not a necessary feature of a dynamical system. Ecological population models are an example of a family of discrete time dynamical systems, for instance. The logistic model in Eq. 5.2 is often used for this purpose (May, 1976). The value x, is the proportion of animals alive at time t with respect to some maximal population. The iterated mapping characterizes the growth and demise of populations, such as all the rabbits that inhabit a particular stretch of forest. When the population is small, growth is fed by available resources. When the population is large, growth is hampered by lack of resources and overcrowding. The parabolic bump ofEq. 5.2 captures these two extremes. The parameter 17 controls the growth rate of the population. Environmental factors such as birth rate and availability of food for reproduction determine the value of this parameter:

Xt+l = 17X t ( l - Xt).

(5.2)

1Page 2.

Section 5.2 • Dynamical Systems

59

Taken together, the examples of the pendulum and the rabbit population model illustrate that dynamical systems can be characterized by two independent features characterizing the granularity of the state space and the time course of the dynamic. The result is a taxonomy of dynamical systems presented in Table 5.1. Both time and space can be described either in terms of continuous or discrete quantities.

TABLE 5.1 A Taxonomy of Dynamical Systems

State Space

Continuous

Discrete

Time

Continuous Discrete

Differential equations Iterated mappings

Spin glasses Automata

The Nature of Time. In discrete time models, the time course of the system has been abstracted from the model. Consider the sequence of locations obtained by measuring the pendulum with a strobe light. The pendulum's bob will appear to jump from position to position. If the frequency of the strobe matches the natural frequency of the pendulum, the bob will eventually appear to stand still. In this case, the strobing is periodic; that is, a constant amount of time separates each strobe. The state-space sampling need not be periodic; the strobe could be driven by a very arbitrary time sampling. There is therefore no way to objectively compare durations in these models because they preserve ordering but not duration. It is not surprising then to find that many attempts to produce rate invariance in discrete time systems have suffered difficulties (Large and Kolen, 1994).
Continuous time models preserve event orderings as well. Like its discrete counterpart, the continuous time reference frame allows one to assign strict event orderings. This ordering, however, is augmented with duration information in the case of discrete events. A musician may playa single note for 0.574 second, for example. With continuous time models when events are not discrete, the volume of the previous note may increase over time.
One may think that objective comparisons of durations can be made in continuous time models. Time is an abstraction in these dynamical systems, too. In many cases, system time may be identical to wall clock time. Time in other models, however, may be a nonuniform scaling of the true time course of events, whatever that may be. Consider a spaceship under acceleration traveling at relativistic speeds where an astronaut observes a pendulum swinging on another ship moving at a constant velocity. As the observer's velocity increases, the pendulum appears to swing faster and faster.
A more down-to-earth example can be constructed with two pendulums. If we measure the state of the first pendulum relative to the phase of the second pendulum, then one of two situations can occur. If the second pendulum is oscillating at a fixed rate, then the time course of the second will flow at the same rate as the ubiquitous wall clock. The second pendulum, however, could be experiencing a variety of forces (e.g., friction) that will eventually drain the system's energy and stop its motion. Thus, the oscillation period of the second pendulum will decrease over time and change the time scale of our observations of the first pendulum. The first pendulum will appear to accelerate in this reference frame.
The Nature ofSpace. A similar distinction can be made with continuous versus discrete space models with regard to the granularity of state space. Regardless of the nature of the state space, the topology of the space is the most important feature. A topology, simply speaking, is the collection of neighborhood relationships that describe a space. Consider the integer

60

Chapter5 • DynamicalSystemsand IteratedFunction Systems

number line. The point specified by the number five is adjacent to four and six. Thus, the set {4, 5, 6} is a neighborhood of five.
Discrete space systems have arbitrary topologies induced by the dynamic. That is, the dynamic determines which elements of state space are next to each other and far apart. Consider a discrete space described by a k-bit binary number. In one case, a discrete time dynamic could only change one bit in a time step. The state 110010 is next to 110011 and far from 001101. The dynamic, on the other hand, may be able to arbitrarily flip some, all, or none of the bits. In this situation, all states are potentially neighbors of each other.
Continuous-state space systems, however, take advantage of the underlying continuity of the state space. Their dynamics are described in terms of continuous functions that maintain the neighborhood relations of the state spaces.

Taxonomy. These differences combine to produce four unique system classifications. Before delving into definitions and examples of dynamical system properties, we will visit a few examples of models in the taxonomy.
Physical systems, like the pendulum, are often modeled continuously in both time and state. A rock thrown in the air will follow a fixed trajectory given an initial velocity vector. At any time during its flight, it is possible to determine the location of the rock because there is a closed-form solution that maps time to location. This function is the result of integrating the underlying differential equations that determine the ballistic behavior of the rock. The motion of most physical systems can be described by similar equations. These equations of motion are integrated over time in order to predict or observe their behavior. Integration can be computationally expensive when simulations are performed on a computer.
Thus, many researchers may try to simplify this calculation by ignoring the continuous nature of time. We often don't need to know every location of the rock between T-plus one second and two seconds. The result of this simplification is a system with discrete time dynamics over a continuous-state space. The result is an iterated map, such as the logistic map in Eq. 5.1 or the strobed pendulum described above.
Continuous time and discrete states occur in the modeling of certain physical processes such as spin glasses and percolation systems. These systems are characterized by large numbers of objects with a discrete property, such as magnetic orientation (i.e., north/south poles) that can change over time (Kirkpatrick et aI., 1983). Another example of this system class is a forest fire from the trees' standpoint (Schroeder, 1991). The state space can be described as the condition of the individual trees in the forest: normal, burning, extinguished. The dynamic captures realities such as a burning tree can ignite a neighboring normal tree but not an extinguished one. Temporal continuity enters the problem in that ignition can occur at arbitrary times.
This leaves us with only one classification, discrete time and space. The traditional automata of computer science, finite automata, pushdown automata, and Turing machines are examples of systems in which both time and states are discrete. A finite automata is defined as a finite collection of configurations (the state space) and a function (the dynamic) that specifies the next configuration given the current configuration and input. A finite state space must be discrete. Since the configuration is updated synchronously, the time course of the automata is discrete as well. Automata can also function with infinite sets of states, as in pushdown automata and Turing machines.
In the preceding paragraphs, we have identified four classes of dynamical systems. Our next step is to understand the classes of behavior that these systems can exhibit. The definitions and descriptions presented next will work with any of the classes from our taxonomy, although some are more amenable to certain definitions than others. The best strategy for understanding the properties of dynamical systems is to project these properties on different models from the taxonomy.

Section 5.2 • Dynamical Systems

61

5.2.2 Trajectories, Transients, and Attractors

The dynamic of a system encapsulates the process of change over time. The sequence of states exhibited by a dynamical system during this evolution is called a trajectory. Even though a system can experience radically different initial trajectories, these trajectories often approach a characteristic behavior in the limit known as an attractor. An attractor is a region in state space where all trajectories converge from a larger subset of state space. The region of state space from where the system will evolve to the specified attractor is called a basin of attraction. As the system evolves toward the attractor, the initial part of the trajectory is considered a transient.
One way of finding attractor approximations is to arbitrarily select an initial condition and let the dynamic propel the state of the system through state space. After stripping away the initial transients of the trajectory, the remainder is a fairly good approximation to the underlying attractor. This approximation is perfect if the goal is to plot the attractor on some display device (or printer) since any discrepancy between the real attractor and its approximation is often lost in the finite resolution of the display device. This construction assumes, of course, that there is a single attractor. Since many multiple-attractor dynamical systems exist, the iterative method only approximates a single attractor for the basin from which the initial condition was selected.
Although there are fundamental differences between discrete and continuous systems, both exhibit the same types of dynamic behavior. In the temporal limit, only four types of qualitatively different attractors, or more correctly, regimes, can be identified: fixed points, limit cycles, quasiperiodicity, and chaos. A chaotic attractor is also known as a strange attractor for historical reasons. These behaviors can be found across all combinations of discrete or continuous time and space.

FixedPoints. Many systems appear to be stuck in a particular state, such as a pendulum
bob at rest. We characterize this lack of motion as a fixed point, the simplest of the four dynamic regimes. This resting state can be the final state of a system as in the case of the pendulum, when the bob has dissipated potential energy. Thus, a fixed point is thought of as the limit of a sequence of states resulting in a changeless state. Fixed points occur in both continuous and discrete state spaces when the dynamic fails to move the state away from its current value. Fixed points appear in two varieties: stable and unstable. Small perturbations to a stable, or attracting, fixed point results in a trajectory that returns to the original attractor after the initial transients die away. Consider the iterated map Xt+l = Xt/2 that has a fixed point at zero, the
solution to the equation x = x /2. The fixed point is attracting since for every value c > 0, there exists a time, to, such that for all possible states x, at time t > to, IXt I < c. In other words,
the distance between the system's state at time t and the fixed point will become arbitrarily small, given enough time.
Not all fixed points attract their surrounding states; some actually repel their neighbors (Figure 5.1). If the system starts in one of these states, it will remain forever in that state, hence the label "fixed." Any perturbation, however, will dislodge the system from its perch and send it rolling away from its perch. Consider the iterated map Xt+l = 2xt. It has a fixed point at zero. Even though a system started at zero will stay there forever, small deviations from this point will rapidly accelerate the system state away from the fixed point. Compare this behavior with the previous mapping, Xt+1 = x, /2. This mapping clearly has an attracting fixed point at zero. Yet the state space trajectories of these two mappings near the origin are radically different. The difference in behavior lies in the slope of the mapping at the fixed point. If the absolute value of the slope of the mapping at the fixed point is greater than one, then the iteration will escape from the fixed point. Otherwise, when the slope is between -1 and 1, the fixed point is attracting.

62

Chapter 5 • Dynamical Systems and Iterated Function Systems

An attracting fixed point

A repelling fixed point

At rest

•
Perturbation

After perturbation
Figure 5.1 Attracting and repelling fixed points.
The attractive and repelling fixed points can combine in multidimensional systems to form saddle points. These points in state space exhibit both attractive and repelling dynamics depending on the direction of approach. Saddle points occur in systems with more than one dimension. Imagine a hyperbolic energy surface in three-space. The dynamics attempt to minimize this energy in the direction of the surface gradient. Like the other fixed points, the saddle point can be found where the derivative of the surface is zero. This point is special in that along one axis a marble will roll toward the saddle point and in another axis of motion it will roll away from it, even though the marble could conceivably be balanced at the saddle point and not roll away.
Limit Cycles. While fixed points epitomize the denial of motion, limit cycles indicate a system in constant motion. The term limit cycle refers to the limit of a trajectory that periodically revisits a set of states during an oscillation. For instance, a frictionless pendulum has a periodic trajectory once it begins swinging. A computer program caught in an infinite loop is trapped in a limit cycle. The ideal pendulum endlessly cycles through a set of angle and velocity pairs determined by the initial conditions of the system, even though a real pendulum undergoes frictional forces that will eventually damp its motion until it reaches a fixed point. Similarly, the program counter endlessly cycles through a fixed sequence of instruction addresses. The process of repeating a set of state values over time can be described by the temporal relationship Xt+p = x.. The term p is the period of oscillation. To demonstrate some general properties of
limit cycles, consider the iterated mapping, f (x) = - x 3. This mapping has a limit cycle of period-two that can be found by solving the equation f(f(x)) = x. This equation has three
solutions, -1, 0, and 1. The solutions -1 and 1 are a cycle: the system will alternate between
the two values, f (-1) = 1 and f (1) = -1. The zero solution is a subharmonic of the period-
two oscillation; it has a period that is an integer divisor of the target period. In general, a cycle ofperiod-n, and its subharmonics, for a given mapping can be found as a solution to Eq. 5.3.2
(5.3)
2The degree-n notation indicates nesting, or composition, of f. Thus, f03 (x) = f (f (f (x))).

Section 5.2 • Dynamical Systems

63

Limit cycles, like fixed points , can be either attracting or repelling, or a combination of the two. Attracting limit cycles produce oscillations that are stable to minor perturbations. Repelling limit cycles can be found where perturbations send the system away from the limit cycle in order to approach another attractor. As in the fixed-point case, the slope of the dynamic near the oscillation will determine the attracting or repelling nature of the limit cycle .
Some systems, such as the van der Pol oscillator (Abraham and Shaw, 1982), exhibit saddle effects with oscillations. The van der Pol oscillator has two attractors, a fixed point at the origin and a cycle surrounding the origin . If the oscillator is started outside of the cycle, the dynamics will eventually send the system toward the oscillation. If started on the interior of the cycle, the state of the system will evolve toward the fixed point at the origin.

Quasiperiodic Behavior. Limit cycle behavior is not restricted to a single periodicity.
Systems can possess attractors and repellors with multiple constituent periodicities. Sometimes
the behavior of such compositions can be slightly unexpected. Consider the case of the system
defined by two oscillators. In Eq. 5.4, cP measures the phase of the oscillators, and n is the
period of oscillation.

cPl = sin nIt

cPz = sin nzt

(5.4)

n The winding number, W = I / nz, is important for determining the behavior of the
system. If to is rational, then the system will be strictly periodic with period of

nl
nz

(5.5)

Thus, two relatively prime periods combine to form a limit cycle with period equal to the product of the original periods . In Figure 5.2, the top two graphs plot Eq. 5.4 for a pair of oscillators with winding ratios of 2 : 3 and 2 : 4. Notice that the rational winding number produced one-dimensional periodic trajectories. Compare the trajectories of these rational winding numbers with the irrational ratio of 2 : n at the bottom of Figure 5.2. When ca is irrational, the behavior of the system as a whole is quasiperiodic. Even though frequency analysis of the trajectory will reveal the underlying constituent periodicities, the system will

Figure 5.2 A demonstration comparing periodic and quasiperiodic trajectories . Axes of all
graphs are 1/>. and 1/>2. The winding ratios are co = 2 : 3, co = 2 : 4, co = 2 : It .

64

Chapter5 • DynamicalSystemsand IteratedFunctionSystems

never revisit any point in phase space. In the limit, the trajectory of the quasiperiodic system will become a dense subset of a two-dimensional area.

Chaos and Strange Attractors. The third type of dynamic regime, and arguably the most interesting, is chaos. The trajectory of a system following a chaotic regime is aperiodic since no point state space is ever visited twice. If the trajectory did cross itself, the deterministic nature of the evolution functions would cause the system to behave exactly as it did after the previous same state. This crossing would result in an oscillation. When the chaotic motion is attracting, it also called a strange attractor.
Chaotic attractors can only arise when a system's dynamic has nonlinear components. If the dynamic is linear, then only fixed-point and periodic behavior is possible. Linear dynamics can only stretch, compress, and rotate state space. Imagine a piece of pizza dough into which someone maliciously added a drop of blue food coloring. Stretching will only elongate the blotch of blue dough, and no amount of twirling will affect the spot. More importantly, neither of these operations will spread the food coloring to the rest of the dough. These operations, in a formal sense, map, or transform, lines in state space into lines. These operations generate state-space trajectories and produce the dynamic phenomena discussed in the previous sections. Fixed points result from the repeated application of stretching or compressing operators. In the limit, these operations reduce subsets of the state space to a single point. Limit cycles, on the other hand, originate from rotations of state space.
When the dynamic repeatedly stretches, compresses, and folds the state space, strange attractors emerge. Figure 5.3 demonstrates the effect of these operations. Stretching occurs when a dynamic scales state space in a particular direction with a scale factor greater than one. In the figure, the state space (the leftmost line) is stretched by a factor of two. The state space is then folded by way of a nonlinear transformation, like the quadratic shown in Figure 5.3. Folding is one method of reorganizing the topology of the state space, as it can create new neighborhoods. The combination of stretching and folding can have devastating effects on the original topology of the state space. Neighborhoods are stretched and eventually straddle the bend in the horseshoe. The result is that the neighborhood spreads across the entire attractor. This condition makes long-term prediction in such systems impossible: Since any measurement of position is guaranteed to be incorrect, the difference between the measured state and the true state will exponentially diverge as the system evolves. Even though detailed long-term predictions are impossible, this fact does not prevent the possibility of making detailed short-term or gross long-term predictions. The former is true because the dynamic of

Original

2x

state

space

Transformed state space

x Stretching ~

1-(l-2x)2 Folding ~

4x(l-x)

Stretched Figure 5.3 Stretching and folding of state space in the logistic mapping.

Folded

Section 5.2 • DynamicalSystems

6S

the system is deterministic. The latter holds because the trajectory is bound by the extent of the attractor.
One curious property of chaotic system involves the dimensionality of state space. For continuous systems, three dimensions are necessary for chaotic behavior to emerge. One would not expect to see chaos in the trajectory of a single frictionless pendulum since its phase space is two-dimensional. The dynamic folds and stretches the state space manifold. When these manipulations occur in a two-dimensional space, it causes trajectories to intersect, which leads to limit cycle behavior when the dynamics are deterministic. In three dimensions, however, the attractor manifold can be compactly rolled into a space-filling curve that never terminates (in a fixed point) or intersects itself (a limit cycle). If the pendulum is externally driven or is actually a double pendulum, then chaos can arise in the system's attractor due to the increase in the system's degrees of freedom.
Deterministic chaos is characteristically different from random noise. Chaotic systems can be described by a set of deterministic equations, while a truly random process can only be characterized in terms of statistical properties. Random noise is assumed to originate from a stochastic process with an inherent probabilistic element (e.g., Markov processes). For many years, the deterministic equations of motion were assumed to leave no room for explicit stochastic generators. Yet, the apparently random variations exhibited by deterministic systems must come from some place. The source of variation for deterministic systems is the initial conditions.
As we alluded to earlier, long-term prediction of the motion of these systems is impossible due to sensitivity to initial conditions. This sensitivity is defined as follows: Given two arbitrarily close starting points, after some amount of time the system rapidly evolves the two points away from each other. In many systems, this expansion cannot go on indefinitely, the extent of the attractor bounds the maximal distance between two separating points. Upon reaching the attractor, however, the trajectories will no longer correlate.
An alternative way to understand the apparently random behavior of chaotic dynamical system is to view them as a shift mechanism on the numeric representations of their state variables. Sensitivity to initial conditions is a direct effect of the stretching and folding of state space by the dynamic. Stretching, or scaling by a factor greater than one, shifts the digits of the state representation to the left. Folding, which occurs when the state space is mapped through a many-to-one function, like x 2 , throws away the most significant digits. Since a real number is an infinite source of digits, the shift and truncate operation can create an endless aperiodic sequence totally specified by the initial conditions. If two initial states are identical for the first k digits of their decimal expansions and the dynamics are shifting at the rate of one
digit per unit time, the two trajectories will rapidly diverge after k + 1 time units. How long
a prediction will hold is related to the accuracy of the state description at the beginning of the prediction and relative to the size of the attractor. Thus, long-term quantitative prediction is impossible. Qualitative predictions could be made, however, if the structure of the underlying attractor were known. Even chaotic systems possess certain structural regularities, such as the attractor manifold in state space, which could be used for general long-term predictions. For instance, knowing whether or not the solar system will eventually fly apart differs from predicting the location of Jupiter 50,000 years from now.
The metaphors of stretching and folding are tools for a dynamical perspective of the behavioral explanation. Alternative perspectives exist for describing the behavior of chaotic systems. For instance, some consider the state a reservoir of information. From this information theoretic perspective, chaotic systems are producers of information, or at least unpackers of the information contained in the low-order digits of their state. They trade space for time. The only way to observe the very low-order digits of a continuous-state variable is to wait around long enough for them to be shifted into the digits accessible by our measuring devices.

66

Chapter5 • DynamicalSystemsand IteratedFunction Systems

The following example will illustrate how we can shift from one perspective to another with minimal effort.
Consider the Bakers' mapping: z = 2z mod 1. The trajectory of this system is entirely dependent on selection of the initial state due to the continual left shift of the binary encoding
of the first value of z. This focus on the binary encoding of state is typical of the information
perspective. Since there are many more irrational numbers in the range (0, 1) than rational numbers, the dynamics will most likely generate an aperiodic trajectory that appears random (due to the sensitivity to initial conditions) but is nonetheless deterministic. The logistic map, Eq. 5.2 with 1] = 4, also exhibits aperiodic trajectories indicative of chaos. In fact, the logistic map at this parameter setting shares the same shift behavior as the Bakers' map. The proof of this statement hinges on an isomorphism between the state spaces of the Bakers' map and this logistic function. The isomorphism involves a simple coordinate transformation,
! - y = ~ arcsin (1 - 2x), and flipping the right side of the resulting tent map, as shown in
Figure 5.4. While the Bakers' map interpretation emphasizes bit shifting, the original logistic mapping supports folding and stretching explanations.

o o
x = 4x(l-x)

y =l-12y-ll

z =2zmod 1

Figure 5.4 From logistic map to tent map to the Bakers' map.

In the preceding paragraphs, we have provided a general account of chaotic behavior. Yet, the mathematical requirements for proving a system to be chaotic are concise and threefold. The system must demonstrate:

• sensitivity to initial conditions, • mixing, and • a set repelling periodic points densely covering the attractor.

First, the system must exhibit sensitivity to initial conditions which indicates stretching and folding of state space. Second, the dynamic must continually mix points within the attractor. This implies that for any two subsets, A and B, of the strange attractor, there exists
a t such that for dynamic f after t iterations (or integrating for time t) using set A as initial
conditions, the resulting image intersects B. In other words, the dynamic eventually spreads any subset of the attractor through the entire attractor and thus the dynamic produces topological transitivity. The last condition requires that periodic points in the state space will densely cover the attractor. The existence of a chaotic attractor does not preclude the existence of repelling periodic and fixed points in the same dynamical system. In fact, the chaos would not exist without an infinite collection of periodic repellors. The repellors must be infinitesmaly close to the chaotic attractor: if these periodic points were the center of balls with small radii, the entire attractor would be covered by the balls, even as the radii go to zero.

Section 5.2 • Dynamical Systems

67

Although a system may exhibit a chaotic attractor, it simultaneously possesses an infinite set of unstable periodic attractors. This fact has led several researchers to develop chaotic controllers that select periodic regimes from an otherwise chaotic system. They accomplish this feat by deforming the state space near one of the repellors as to balance the system near the repellor. The deformation is dynamic and is accomplished solely by manipulating the control parameter (i.e., TJ). The controllers have worked well in both computer simulation and laboratory experiments. Grebogi et al. (1987) have used this scheme to select periodic regimes from the Henon map. Ditto et al. (1990) subjected an amorphous magnetoelastic ribbon to a magnetic field. Under certain field strengths, the strip would vibrate chaotically. Yet they were able to select low-period regimes from the strip through minute manipulations of the magnetic field strength. Singer et al. (1991) performed similar experiments involving a thermal convection loop. Both research groups experimentally demonstrated that periodic behavior regimes could be selected by minute variation in the system control parameters.
Although chaotic and quasiperiodic trajectories both appear never to revisit points in state space, there are important differences that separate these two behaviors. First, quasiperiodic systems are not sensitive to initial conditions. State perturbations will not cause exponential divergence. Rather, the system will continue to carry a perturbation until it is wiped out by another perturbation. Second, the frequency spectra are different. Quasiperiodic trajectories
7). will exhibit frequency peaks at the constituent frequencies. Chaotic trajectories, on the other
hand, display wide-band spectra with the power decreasing with the inverse of frequency ( This spectra differentiates chaos from noise: white noise displays a uniform distribution of frequencies across the spectrum.

5.2.3 Attractor Dimensionality
Many chaotic attractors with a variety of shapes and sizes can satisfy the definition of above. Given that periodic attractors can be differentiated by their period, how can we differentiate between two different chaotic attractors? Unfortunately, we cannot resort to a property such as periodicity or volume to classify these infinite sets. We can, however, resort to another property of point sets, namely, dimensionality. Attractors, like any other subset of vector spaces, have a characteristic dimension. A fixed point, since it is a point, has a dimension of zero. Limit cycles in discrete time systems also have dimensions of zero. The limit cycle in a continuous time system, such as the one exhibited by a frictionless pendulum, is a one-dimensional object, a line segment joined at the ends. Other limit cycle attractors, like tori, have integer dimensions. But what is the dimensionality of a strange attractor? In order to answer this question, the whole notion of dimensionality had to be rethought in order to account for the space filling curves associated with chaos.
Before describing the dimensionality of strange attractors in general, first consider the dimensionality of a seemingly innocuous collection of points known as the Cantor set. This set
is defined recursively over a closed interval of the reals, such as X = [0, 1]. Now, remove the
(1, 1) open middle-third of this set, that is, ~). This fractionating leaves two pieces, [0, and
(~, 1]. Recursively repeat this operation of middle-third exclusion on the resulting intervals. The limit of this operation is an uncountable set of points known as the Cantor set. What is the dimensionality of this set? Intuitively, the dimension is clearly less than one, as the limit of a particular sequence of subdivisions is always a disconnected point. Similarly, one would not want to classify the set as dimension zero since it appears to have more structure than a single point. To satisfy both intuitions, one must introduce the notion of noninteger, or fractal, dimension that somehow captures the recursive structure underlying the generative production

68

Chapter5 • DynamicalSystemsand IteratedFunction Systems

of the set (Mandelbrot, 1983). One way of describing this generative production is as recursive similarity: The local shape of an object is a scaled copy of the global shape of the object. The similarity (or fractal) dimension is given by Eq. 5.6 where N is the number of parts scaled and r is the scaling factor used to scale those parts:

D - log N - log ~ .

(5.6)

Let us apply the similarity dimension to the Cantor set. In the construction described

above, each region is recursively decomposed into two parts, the left and right regions. Each

region is scaled to one-third the length of the original line. Thus, the fractal dimensionality of

the Cantor set is

log 2

- = 0.63092975 . . . .

(5.7)

log 3

Other methods of characterizing the fractional dimension of a point set exist. Another technique for calculating the Hausdorff dimensionality is box counting (Grassberger and Procaccia, 1983). This calculation, and others, are related to the similarity dimensionality of fractals defined above. For instance, slices through the Lorenz attractor show bands isomorphic to the Cantor set (Mandelbrot, 1983). Each method for approximating the Hausdorff dimension can be applied inappropriately. For instance, a grid-based method can yield an inaccurate estimate due to a poor choice of grids. As such, it is important to evaluate grid-based dimensions with several different grid origins. The original similarity dimension formula also has its own caveats. Specifically, it only applies to similitudes, transformations that scale all directions by the same factor.

5.2.4 Bifurcations and PhaseTransitions

Some systems, like the logistic function described above (Eq. 5.2), are sensitive to parameter adjustment. Many parameterized systems exhibit changes in their attractors as the control parameter is varied. The logistic mapping is interesting because there exist parameter values that will produce attractors of all periods, including many aperiodic attractors. Based on the value of the parameter, 11, the logistic function can produce the three different attractor classes: fixed points, limit cycles, and aperiodic sequences. Figure 5.5 shows how the attractor of the logistic function changes as the control parameter increases. The diagram was created by iterating the logistic function, with 11 set to a particular value collecting the sequence of states generated by the system. The initial transients were dropped, and the union of the points of the remaining was plotted on the abscissa. For values of 11 less than 3.0, the system evolves to a fixed point. As h increases, it reaches a point that the system evolves to a period-two orbit. The sudden doubling of the number ofperiods is called pitchfork bifurcation (Figure 5.6). Although it may appear that fixed points split, it is not the case. The fixed point still exists, but it is no longer stable after the bifurcation. The act of bifurcation transfers stability from the fixed point to a pair of newly created fixed points for the second iterate (f02) of the mapping. These fixed points for the new system appear as period-two oscillations in state space. As the parameter increases, the attractor bifurcates again to produce a period-four attractor. Period doubling theoretically can continue indefinitely; a period-n limit cycle bifurcates into a period-2n cycle.
Does the period doubling cascade ever end? If the distance between each bifurcation were constant, the answer would be no. Fortunately, it is usually not the case. Feigenbaum (1983) discovered a self-similarity in the cascade that he used to prove that the distance (in 11 space) between two orbits decreases geometrically with the size of the period. The D..11 necessary to go from period-Z" to period-Z?"" is less than what is needed to go from 2n+1 to 2n+2• At first, the system bifurcates slowly. After many bifurcations, very small changes in 11

Section 5.2 • Dynamical Systems

0.75

0.5

.

0.25

......111:

0

-0.25

- 0.5

-0.75

Network bump

1.4

1.6

1.8

2

69
2.2

0.75 0.5 0.25
o
-0.25 -0.5
- 0.75

- 0.75 - 0.5 -0.25

o

0.25

0.5

0.75

Figure 5.5 Bifurcation diagrams of the logistic function and a neural network unimodal
function, Xt+l = tanh (a tanh (ax, + ~) - a tanh (ax, - ~) - a) .
will cause period doublings. For the logistic function, the limit of these small changes occurs
= at '12"" 3.5699 . ... In fact, for all odd fixed m, the sequence of parameters {17m2""} has a
limit. At these parameters, the mapping no longer exhibits a periodic oscillation. The system is at criticality, a condition marked by the onset of chaotic behavior from periodic. The critical values for this system is the set {17m2""}'
Although the critical values are important because they are the limits of bifurcation cascades, they also suggest other interesting mechanisms at work. As mentioned earlier, the dynamic can stretch and fold state space. Sometimes these operations will attain some measure of alignment. In fact, periodicity occurs when iterates of the dynamic map on to themselves. Criticality is another such accident. Period doubling occurs when the dynamic of a period-n

70

Chapter 5 • Dynamical Systems and Iterated Function Systems

Period 1

Period 2

Stable

____- -....~-----------Unstable

Bifurcation

---....-

Stable

Control parameter (11)

Figure 5.6 A pitchfork bifurcation.

attractor is similar to the dynamic of the period- 2n mapping. Similarity in this context indicates that one mapping is a scaled and translated copy of the other mapping. This similarity does not extend to 4n iterate, except in the case of criticality. At criticality the similarity holds for all iterates of the form m'I", Thus, the trajectories for logistic mappings and other unimodal maps at a criticality parameter never repeat except for a countable number of repelling exceptions with periods of m'I".
If a system that initially exhibits fixed-point behavior and goes through a cascade of period-doubling pitchfork bifurcations, are period-Z" limit cycles and chaos the only behaviors that this system can exhibit? The answer is no. Sometimes the dynamics of a system will produce a saddle-node bifurcation. In the logistic mapping, a saddle-node can create a period-n attractor when the peak of a hill or bottom of a valley of the nth iterate crosses the diagonal. This crossing creates two orbits, one stable, and the other unstable. These bifurcations are responsible for the appearance of odd-period attractors.
Recall that chaotic trajectories are surrounded by a dense cloud of unstable periodic repellors. The pitchfork bifurcation cascade offers an explanation for this property. The unstable fixed points and limit cycles are the instabilities created by the bifurcation cascade; every period-n bifurcation leaves behind n - 1 unstable trajectories.
An entire cascade of pitchfork bifurcations yields a collection of attractors with periods of the form 2n • These long-period regions eventually give way to saddle-node bifurcation, creating much shorter periods such as seven, five, and three. From the bifurcation diagram, one can see that the periods do not appear in increasing order of period size as the control parameter increases, that is, 17n-l < 17n < 17n+l for all n, where 17n is a parameter producing a period-n attractor. Sharkovski (1964) discovered that a nonintuitive ordering of the parameters underlies the bifurcation tree. He focused on the first occurrence of a period in the tree and found that the minimal period parameter values are ordered as shown in Eq. 5.8:

3>5>7>9> ...

> 2·3> 2·5> 2·7 > 2·9 > .

> 2n • 3 > 2n • 5 > 2n • 7 > 2n • 9 > .

(5.8)

> 2n > 4 > 2 > 1.

Section 5.2 • Dynamical Systems

71

The ordering starts with the powers of two. This captures the initial period doubling cascade that follows a fixed point. The ordering ends with period three preceded by the remainder of the odd periods. In between these two extremes, one finds a sequence of periods of the form m'l", For a fixed exponent n, m is filled with the increasing odd numbers. This peculiar ordering implies the existence of a period-six attractor before and after the appearance of period-three attractor. Before the first appearance of a period-three attractor, a saddlenode bifurcation will create a period-six attractor. The second period-six attractor will occur after the period-three as the result of a period doubling pitchfork bifurcation. One useful application of the Sharkovski theorem is searching for specific periods in one-dimensional mappings. If a period-four oscillation is isolated (we know the parameter for it by way of
analysis (solving x = f o4 (x ) ) or experiment (bifurcation diagram), the ordering tells us that
there exist parameters for period-two and fixed-point attractors. Likewise, if a cycle of periodthree is ever encountered in a one-dimensional system, then for any n there exists a parameter value that gives rise to a period-n orbit. From this one can conclude that any period is possible. Li and Yorke (1975) originally described this condition as chaos. However, chaos actually occurs much earlier, after the 2n bifurcation limit.
One of the most interesting features of the bifurcation diagram is its universality. Many other iterated functions produce similar diagrams. Feigenbaum (1993), in addition to the contributions described earlier, proved that this behavior is universal for all unimodal, that is, single bump, functions. Period doubling is independent of the specific function as long as certain conditions regarding the gross shape of the mapping are met. The results of this theoretical work suggest that other universals exist for other families of mappings.
As a demonstration of the universality of period doubling in the context of neural networks, it is possible to construct a recurrent neural network with the properties specified by Feigenbaum. The bifurcation diagram in Figure 5.6 illustrates the period-doubling cascade of a one-input, two hidden unit, one-output recurrent neural network that provides the single scalable bump through a modulation of the gain parameter of the hyperbolic tangent transfer function. This iterated mapping exhibits the same period doubling cascade as the logistic function and displays the full range of dynamical behaviors.
Bifurcations come in several varieties according to the effects of parameter changes on fixed points. A pitchfork bifurcation occurs when a stable fixed point splits into a repelling fixed point and a limit cycle. This event is crucial to Feigenbaum's description of what happens during a period-doubling bifurcation. The same process occurring in a continuous time system is known as a Hopfbifurcation. When a stable fixed point meets an unstable one, the two attractors annihilate each other in a saddle-node bifurcation. Itermittency, a short turbulent phase followed by a longer period of stability, suggests a saddle-node bifurcation. The wealth of dynamical behavior encapsulated in a system as simple as the logistic function can be traced to the interplay of the period-doubling pitchfork bifurcations and the destroying saddle-node bifurcations.

5.2.5 Summary

Dynamical systems theory provides a framework for describing systems whose state changes over time. The internal dynamic, a differential equation or iterated mapping, governs the evolution of the system. Although there are many dynamical systems and many ways to encode their internal dynamic, they can produce only four general types of behavior. Fixed points, limit cycles, quasiperiodicity, and chaos exhaust the behavioral repertoire of dynamical systems. These regimes may either attract or repel state-space trajectories. The process of bifurcation allows systems to shift from one regime to another or from a stable regime to an unstable one. While bifurcations can either affect existing periodicities or create new ones, chaos occurs at the limit of an infinite cascade of bifurcations.

72

Chapter5 • Dynamical Systemsand IteratedFunctionSystems

So far, one important aspect of dynamical systems has been avoided. If we are interested in modeling natural processes, it is clear that no system stands alone. Systems must interact with other systems. Hence, the problem of coupling, or input/output, must be acknowledged. Since one goal of this book is to understand the computational abilities ofDRNs, we will now focus on a class of systems with discrete couplings.

5.3 ITERATED FUNCTION SYSTEMS

The preceding section outlined several basic concepts from the dynamical systems theory.

These concepts are broad in scope and application. It is now time to focus on particular pieces

of mathematical knowledge that will directly help us understand the behavior of DRNs.

This section provides a brief overview of a relatively new branch of mathematics known

as iterated function theory. The foundational work was originally developed by Barnsley

(1993) as a method of describing the limit behavior of systems of transformations. The limit

behavior of linear or affine transformations can be determined by examining eigenvalues of the

transformations. The trajectories can either be fixed points or limit cycles. While the effects

iterating linear systems have been fully mapped out by this time, only recently did anyone

U7=I consider the case of multiple linear transformations in parallel, that is, f (x) =

fi (x).

Such systems have now been shown to be a generalization of Cantor's middle-third sets. This

section also describes an alternative method of approaching the attractors defined by iterated

function systems (IFSs). Known as the random iteration algorithm, or the chaos game, it

can produce sequences of points dense in the attractor by creating a random sequence of

transformations. Finally, this section addresses the effects of noncontractive transformations,

like those performed by DRNs, on the IFS theory.

5.3.1 Basic Iterated Function Systems Theory

DRNs share many similarities with the iterated function systems of Barnsley, but they also possess several radical differences. The iterated function systems responsible for the fractal structures seen in the Sierpinski triangle and other fractals (Figure 5.7) are defined as

a finite set of contractive mappings over a metric space. What makes IFSs so fascinating is

that, although the limit behavior of a single transformation is just a point, the limit set over the

union of the transformations can be an extremely complex set with recursive structure. A metric space (X, d) is the combination of a set X and distance metric d : X ~ ffi.

Consider, for illustrative purposes, the unit square and Euclidean distance function as our

metric space. The three functions in Eq. 5.11, WI, W2, and W3, map the unit square into three of the four quadrants of the unit square:

WI ((x, y)) = (O.5x, 0.5y + 0.5)

(5.9)

= W2((X, y)) (O.5x, 0.5y)

(5.10)

+ W3((X, y)) = (O.5x 0.5, 0.5y).

(5.11)

If we take anyone of them, WI, for instance, we find the limit of iterating this function

on any point in the unit square exists and is the point (0, 1). You can verify this by applying

WI to (0, 1) and showing that it returns (0, 1). This is a fixed point. It is an attracting fixed

point since iteration draws other points closer to it over time. The other transforms have limits

of (0,0) and (1,0).

Given that each transform, on its own, will go to a fixed point, what does the sys-

tem of transforms, as a whole, do over time? In other words, we are interested in the

U;=I limit behavior of iterating the function Q(x) =

Wi (x). A system of three fixed-point

transformation does not necessarily collapse into a single point. For instance, assume that

Section5.3 • IteratedFunction Systems

73

Figure 5.7 Several fractal sets generated by iterated function systems.

= the fixed-point attractor A is {(x , y ) }. Since A is a fixed point, Q(A) = A. But, Q(A)
{WI «x, y», W2 « x, y», W3 «x, y»} , that only equals A in degenerate cases of co. In general, it

can have attracting set with infinite recursive structure. A single transformation shrinks the en-

tire limit set into a one- fourth-sized copy ofthe original, while the union operation pastes the dif-

ferent copies together to form the original image again . This process is illustrated in Figure 5.8.

The attractor structure depends on the transients of the transformation. In other words ,

the limit behavior of the individual transformations plays only a minor role in determining the

emergent shape of the attracting set. A simple example will demonstrate this fact. Each of the

three transformations that make up the IFS for the Sierpinski triangle has a limit point. These

points, (0,1), (0,0),0,0), are also the limits of the transformations in Eq. 5.12:

WI «x ,y» = (0, 1) y» W2«X , = (0,0) y» W3«X, = 0,0) .

(5.12)

The limit set resulting from iterating this set oftransformations is the finite set (0, 1), (0, 0) , (1, 0). These three point s fall short of the uncountable number of points comprising the Sierpinski triangle, Figure 5.9.

74

Chapter5 • Dynamical Systems and Iterated Function Systems

o,

A subset of the metric space

W I tran sform (0, 1)

Figure 5.8 The difference between the limit of a single transformation (a point) and the limit of a collection of transformations (a Sierpinski triangle). The actual affine transformations are defined in Equation 5.11 . The infinity superscript denotes the limit of an infinite sequence of function compositions.

Since an IFS is a dynamical system, one could ask questions about its stable and unstable behaviors . It turns out that IFSs have a single basin of attraction. An important concept for understanding this particular fact of iterated function systems is the Hausdorff space. The Hausdorff space of a complete metric space (X, d) is the set of all compact subsets of X, excluding the empty set. A compact set is one that is both closed and totally bounded. The distance between two points in Hausdorff space is given by Eq. 5.13:

hd(A , B) = sup{inf{d(x, y)lx E A}ly E B}.

(5.13)

This metric measures the distance between two sets as the maximal distance between a point in A and the nearest point in B with respect to the original metric space. This function supports all the properties of a metric such as if A = B, then hd(A, B) is zero and the triangle

Section5.3 • Iterated Function Systems

75

attractor A

U"7 w,CAl u ..,CAl

Figure 5.9 The individualtransformations makethree reducedcopiesof the attractor. Taking the union of the individualcopies pastes together the originalattractor.
+ inequality, hd(A , B) < hd(A, C) hd(C, B) . Hausdorff space allows us to consider subsets
of (X , d) as single points, a shift that lets us view the set of iterated transformations over the metric space as a single transformation in Hausdorff space. The complicated mapping and pasting can then be analyzed like any other dynamical system .
Bamsley 's theory shows that there exist collections offunctions whose iteration produces simple , fixed-point dynamics in Hausdorff space. For instance, it can be shown that the Sierpinski triangle is a point in H «X, d)) and that each of the three Sierpinski transformations reduces the distance between any starting subset of X and the Sierpinski triangle . This fixed point is also unique; there are no other attractors in H«X, d)). These conclusions can be proved without considering the individual transformations as long as certain properties are true (e.g., contraction).
5.3.2 Random Iteration
An interesting side effect of the unique attractor for Bamsley's IFSs is that approximations to the attractor are easy to construct. Random iteration, also known as the chaos game, produces attractor approximations very rapidly. Before describing the process of random iteration , we will look at the chaos game. For this demonstration, you need a piece of paper, a pencil, a six-sided die, and a ruler. First, mark three points on the piece of paper and label them one, two, and three. Now, place a fourth point inside the triangle formed by the three points. This will be your current point. The rest of the chaos game is simple : roll the die and

76

Chapter 5 • Dynamical Systems and Iterated Function Systems

make a new point one-half the distance between the current point and triangle vertex with the same label as the die. (For die rolls greater than three, interpret four as one, etc.) Repeat these steps for many iterations. The result of this iterative process should be a figure with the same topological structure as the Sierpinski triangle described above. That is, the center triangle will be clear, the center of the smaller triangles will be clear, and so on.
The chaos game can be formulated in terms ofiterated function systems. Moving halfway between the current point and a vertex is a contractive affine transformation. In fact, the transformations used in the Sierpinski triangle in Eq. 5.11 will transform a given point to a new point midway between one of three vertices at (0, 0), (1,0), and (0,1). The chaos game, in other words, takes an initial point in the metric space and calculates a sequence of points by randomly selecting one IFS transformation at each iteration. It is easy to show that after each iteration the distance between the current point and the attractor geometrically decreases and approaches zero. This decrease is due to the contractivity of the affine transformations, as the scaling factor for each successive iteration is less than one. Some points in this sequence do not lie anywhere near the attractor, even though this sequence of points quickly settles down to a stable emergent pattern. Thus, the initial portion of the sequence is ignored as a transient because these points exponentially converge on the attractor. The resulting union of the sequence remainder is a finite approximation to the infinite point set of the attractor.
The description above is enough to generate useful approximations to many IFS attractors. Unfortunately, random iteration will fail to generate an approximation for many IFS. The problem lies not in the algorithm of random iteration, but in the method for randomly selecting the transformations. Random selection of transformations results in coverage of the attractor; that is, an infinite sequence of points generated this way will be dense in the attractor. Although the system eventually reaches an attractor cover for any fixed E, it may take a long time. The problem is that one part may be covered more than another. One way to ensure uniform coverage as suggested by Barnsley, is to set the probability of selecting a transformation proportional to the relative volume of transformation, or rather the resulting volume of mapping the entire metric space by the transformation. In the case of the Sierpinski triangle, each transformation compresses the original area of the unit square by one-fourth. Thus, each transformation has a one-third probability of being selected. The famous black spleenwort fern IFS (Figure 5.7) has four transformations, including a transform that maps the unit square into a line forming part of the stem. The probability of this measure zero transformation is assigned a very small volume solely for the determination of their probabilities. Otherwise, these transformations will have a zero probability of being used.
These probabilities are important because they affect the relative weighting of the resulting attractor. Figure 5.10 shows the effects of changing the probabilities of the transformations. The top pair of diagrams illustrates this effect on the Sierpinski triangle IFS. On the left are the results of mapping the unit square with each of the three transformations. The right diagram of the pair shows how an uneven distribution of transformations can weight some regions of the attractor more than others. Note that the more visited points do not occupy a connected region; they are distributed in a self-similar manner throughout the attractor. In the case of the black spleenwort fern IFS, the proper selection of transformation probabilities is crucial to the emergence of the final form. By not weighting the large transformation more than the others, the infinite regress of the subsystem is much more difficult to construct.

5.3.3 Summary
This section provided a brief overview of a branch of mathematics known as iterated function theory. This theory attempts to describe the limit behavior of systems of transformations. The limit behavior of linear or affine transformations has been known for many years.

Section 5.3 • Iterated Function Systems

77

Figure 5.10 Examples of attractors generated with different transformation probabilities.

The trajectories can either be fixed points or limit cycles, as determined by an examination of

transformation's eigenvalues. Only recently has anyone considered the case of multiple trans-

formations, Specifically, the theory of iterated function systems applies to iterated mappings

= U7=1 composed of multiple linear transformations, for example, n (x)

Wi (x) . Such systems

generalize the process that yields Cantor's middle-third set, a set generated by recursively re-

moving the middle third of an interval and all of its subintervals. They are also responsible for

the fractal structures seen in the Sierpinski triangle and other fractals (Figure 5.7) . Although

each transformation of an IFS has very simple limit behavior, the limit set over the union of

the transformations displays highly complex recursive structure. The key to understanding the

limit behavior of IFSs is the Hausdorff space. Rather than concentrating on the state space,

IFS theory tells us to look at the set of all closed sets of the original space . In this new space ,

IFSs display fixed-point attractors whose basin of attraction is the entire state space. The proof

of this theorem assumes that the transformations are contractive.

By directly implementing the definition ofIFSs, one can construct images ofthe attractor.

Repeated applications and unions of the contractive mappings will result in the attractor image,

78

Chapter 5 • Dynamical Systems and Iterated Function Systems

independent of the initial set. Because this method can consume an overwhelming amount of computational resources, an alternative method of approaching the attractors was discussed. Known as the random iteration algorithm, or the chaos game, it can produce sequences of points dense in the attractor by applying a random sequence of transformations. The random sequence is generated by assigning.each transformation a fixed selection probability. The chaos game is not foolproof. The probabilities of selecting can influence the weighting of the resulting image. Similarly, the method of selection can affect the resulting image.

5.4 SYMBOLIC DYNAMICS

Although it is helpful to characterize systems in terms of their dynamics, other methods exist
for describing the structure and regularities of dynamical systems. One way to examine an
attractor of a system is through the state values it generates. This method, however, may
obscure underlying regularities, especially for high-dimensional state spaces. An alternative
approach is through symbol dynamics. By reducing a trajectory from a sequence of state
values to a sequence of discrete symbols, many regularities can be found in symbol sequences
of different dynamical systems (Bai-lin, 1989). These symbol sequences correspond to labels of state-space regions. This labeling is a very useful tool in that it allows one to abstract away from the details of the trajectories and focus on the gross dynamic behavior of the system.
The best way to introduce symbol dynamics is an example. Consider the one-dimensional
logistic map x(t+l) = 1]x(t)(l - x(t)) from population dynamics (May, 1976). This mapping
has been studied extensively and will serve as an example throughout this section. Selecting
values of 1] from [0, 4] ensures that a trajectory started within the state space [0, 1] will remain
there. Values of 1] outside of this region will produce systems whose attractors are at infinity.
Note that the maximum next value of the logistic mapping occurs at x = ~. This value of x is
very important. To the left of this value, the logistic mapping has a positive slope. Likewise,
the slope is negative to the left of x = ~.
A trajectory evolved by this system can be replaced by a symbol sequence in the following
1, manner. If the state value is less than replace it with an L for left of center. If the state value 1, is greater than replace it with an R for right of center. Otherwise, the state value is equal to 1, so it is replaced it with a C. This scheme for the conversion of states to symbols is easily
extended to other mappings. Simply partition the state space into regions with boundaries defined by the zero first derivatives of the mapping variable. In the logistic function, the first
1. derivative of f (x) is zero when x =
The partitioning and symbolic dynamics of a system are best understood in terms of the
following examples involving the logistic function. Consider the parameter setting r = ~ ,
a parameter setting that causes the system to eventually settle on a fixed-point attractor at r = ~. This fixed point is to the right of the maximum. Thus, the symbol sequence for this
attractor is RRRRR ... , or more compactly' Roo. At the higher value r = 3.3, the trajectory
oscillates between 0.823603 ... and 0.479427 ... , generating a symbol sequence of (RL)oo (or equivalently (LR)oo).
Sometimes the boundary points, those points whose slope is zero, are part of the underlying attractor. When this occurs, the periodic attractor is superstable." The sequence (CRLR)oo
1 indicates a superstable orbit of period-four that starts at x = and returns to this point after
four iterations. In Figure 5.11, the periodic points are plotted on the logistic parabola for

3The symbol 00 is used, rather than the Kleene star ("), to emphasize that the attractor generates an infinite sequence.
!' 4Since the maximum of the logistic function is at x = the superstable orbits of the logistic function must !. include

Section 5.4 • Symbolic Dynamics 1.4 1.2

E L
E C

79
R
:-
R
:-

0.8 0.6 0.4 0.2

0.2

0.4

0.6

0.8

= Figure 5.11 A superstable period-four orbit (CRLR) in the logistic mapping (r 3.49856).
The trajectory starts at 0.5 (labeled C). The arrows indicate the resulting sequence of states and their symbolic labels.

r = 3.49856. The winding arrow at the top of the graph indicates the trajectory. During this cycle, the trajectory will visit one point to the left of center and two points to the right of center. The convention of splitting the state space at the maximum often aids analysis but is by no means a requirement. The symbol mapping simply must partition the state space into disjoint regions. Feigenbaum (1983) used this particular decomposition to calculate the ratio of successive period doubling parameters. The superstable orbit was a convenient landmark
(just solve I " (~) = ~) found in each period region.
Now consider the Bakers' mapping Xt+l = 2xt mod 1. The trajectory of this system is entirely dependent on the selection of the initial state. Multiplying by two repeatedly left-shifts the binary encoding of the first value of x, while the modulus throws away any bits to the left of the decimal point. Since there are many more irrational numbers in the range (0, 1) than rational numbers, the trajectory is most likely aperiodic and appears random (due to the sensitivity to initial conditions) but is nonetheless deterministic. Hence, any symbol subsequence is possible with the right choice of initial conditions, a property compactly described as (RIL)oo. This particular symbol dynamic is fairly common, for it is at the root of most chaotic systems: the Bakers' map is isomorphic to the logistic function 4x(1 - x) and shares the same symbolic dynamics (see Figure 5.4). A summary of the supercritical attractor sequences for low power of two periods is given in Table 5.2. The underlying pattern is simple: double the sequence of the previous stage, A, and replace the C at the beginning of the second half, A', with an R if the number of Rs in A is even; otherwise transform it into an L.

TABLE 5.2 Attractor Symbol Sequences for the Logistic Map

Period

Regular Expression
COO
(CR)oo (CRLR)oo (CRLRRRLR)oo
(A)oo (AA')oo

80

Chapter5 • DynamicalSystemsand IteratedFunctionSystems

The first part of this chapter discussed the merits of characterizing systems in terms of their dynamic flows or iterated mappings over continuous spaces. Symbol dynamics, however, provides another method for describing the structure and regularities of dynamical regimes. This method reduces the trajectory through continuous space into a sequence of discrete symbols. Although one may believe that such discretizations would obscure trajectory complexities, this approach allows an observer to abstract away from the details of the trajectories and focus on the gross dynamic behavior of the system. Fixed points show up as infinite sequences of single symbols. Limit cycles produce repeated patterns. Chaotic systems generate apparently random sequences that unpack the symbols composing their initial conditions. Period-doubling bifurcations affect superstable periodic symbol sequences by appending a modified copy of the periodic subsequence: the C is replaced according to the parity of Rs in the original subsequence. In later chapters, these properties will help us understand the computational capabilities of DRNs. The symbol dynamics, for instance, of DRN state-space transformations will be interpreted as automata states (see Chapters 6 and 7). This interpretation allows for injection of prior knowledge into DRNs (see Chapter 10) and the extraction of finite automata from DRNs (see Chapter 12).

5.5 THE DRN CONNECTION

Feedforward neural networks process information by performing fixed transformations from one representation space to another (Chapter 1). Recurrent networks, on the other hand, process information quite differently. While thinking of the recurrent network as an instantaneous feedforward network may help in developing learning schemes (see Chapter 11), this conceptualization fails to help us understand the information processing performed by the network. To understand recurrent networks, one must confront the notion of state since recurrent networks perform iterated transformations on state representations.
The first step in ascertaining the computational power of a system is to identify its information processing (IP) states. The attribution of computational behavior to a system traditionally rests on the discovery of a mapping between states of the physical (or dynamical) system to the IP states of the computational model (Ashby, 1956).
Many researchers studying the computational power of recurrent networks have followed this tradition. Some researchers have suggested parallels between recurrent networks and various deterministic automata (Giles et aI., 1990, 1992b; Servan-Schreiber et aI., 1988; Watrous and Kuhn, 1992b) (see Chapters 6 and 7). Servan-Schreiber et aI. trained an SRN to predict the next symbol from a Reber grammar generator. After training the network and collecting state vectors, they examined cluster diagrams extracted from the recurrent activations. The cluster divisions appeared to distinguish the five states of the Reber. generator. Giles et al. (1990) introduced dynamic state partitioning. This process generates partitions by arbitrarily quantizing the state space (e.g., hypercubes with a edge length of r). A transition table is then constructed by observing alphabetic transitions as the recurrent network dynamic propels the current state vector from one discrete region to another. Then the clusters and transition table are combined to form a state graph that is minimized. Watrous and Kuhn, on the other hand, adapted their partitions during the extraction process by constructing histograms of each hidden unit's response to a collection of strings. State labels are created by concatenating histogram interval numbers for each hidden unit. They, too, applied traditional state minimization techniques to the resulting state transition table.
These extraction techniques work well for some networks. Unfortunately, they can break down when the internal dynamics are at the edge of chaos. Kolen (1994a) demonstrated that a simple recurrent network can be constructed (Eq. 5.4) whose symbolic dynamics cannot be described by finite automata. This demonstration relies on the work of Crutchfield and Young