DissLiteratur/storage/HAZFG26R/.zotero-ft-cache

INTELLIGENT SYSTEMS REFERENCE LIBRARY Volume 49
Monica Bianchini Marco Maggini Lakhmi C. Jain (Eds.)
Handbook on Neural Information Processing
123

Intelligent Systems Reference Library

49

Editors-in-Chief
Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: kacprzyk@ibspan.waw.pl

Dr. Lakhmi C. Jain Adjunct Professor University of Canberra ACT 2601 Australia and University of South Australia Adelaide South Australia SA 5095 Australia E-mail: Lakhmi.jain@unisa.edu.au

For further volumes: http://www.springer.com/series/8578

Monica Bianchini, Marco Maggini, and Lakhmi C. Jain (Eds.)
Handbook on Neural Information Processing
123

Editors Professor Monica Bianchini Dipartimento di Ingegneria dell’Informazione Universita` degli Studi di Siena Siena Italy

Dr. Lakhmi C. Jain Adjunct Professor University of Canberra ACT 2601 Australia

Professor Marco Maggini Dipartimento di Ingegneria dell’Informazione Universita` degli Studi di Siena Siena Italy

ISSN 1868-4394 ISBN 978-3-642-36656-7 DOI 10.1007/978-3-642-36657-4 Springer Heidelberg New York Dordrecht London

ISSN 1868-4408 (electronic) ISBN 978-3-642-36657-4 (eBook)

Library of Congress Control Number: 2013932221
c Springer-Verlag Berlin Heidelberg 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied speciﬁcally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Foreword
Neural information processing is an interdisciplinary ﬁeld of research with contributions from statistics, computer science, neuroscience, electrical engineering, cognitive science, mathematics, and physics. Researchers in the ﬁeld share a common desire to develop and understand computational and statistical strategies for information processing, in the brain or in artifacts thereof.
The ﬁeld of neural information processing systems has seen a tremendous growth in the last couple of decades, both in volume and, more importantly, in reputation. It helps that there is now a quite clear separation between computational neuroscience (modeling the brain) and machine learning (developing artifacts that can “learn”, possibly but not necessarily inspired by the way the brain works). This handbook focuses on the latter, and does so with great success.
Although the editors do not claim to be fully comprehensive, the handbook has an amazing coverage of the important paradigms in machine learning, by some of the best experts available. It contains chapters going all the way from relevant mathematical theory, to state-of-the-art architectures and learning algorithms, to actual applications.
The nice thing about the chapters is that the authors can take a bit more space to introduce and explain the relevant concepts than they are normally allowed or expected to do in journal and conference papers. You can tell that they enjoyed writing their contributions, which makes the chapters a pleasure to read.
In short, the editors and authors are to be congratulated and thanked for a very interesting and readable contemporary perspective on the state-of-the-art in the machine learning part of neural information processing systems. This handbook will be a valuable asset for graduate students, research, as well as practitioners for the next years to come, until there will hopefully be another edition. I can hardly wait!
Prof.dr. Tom Heskes Institute for Computing and Information Sciences
Radboud University Nijmegen

Preface
This handbook is inspired by two fundamental questions: ”Can intelligent learning machines be built?” and ”Can they be applied to face problems otherwise unsolvable?”. A simple unique answer certainly does not exist to the ﬁrst question. Instead in the last three decades, the great amount of research in machine learning has succeeded in answering many related, but far more speciﬁc, questions. In other words, many automatic tools able to learn in particular environments have been proposed. They do not show an ”intelligent” behavior, in the human sense of the term, but certainly they can help in addressing problems that involve a deep perceptual understanding of such environments. Therefore, the answer to the second question is also partial and not fully satisfactory, even if a lot of challenging problems (computationally too hard to be faced in the classic algorithmic framework) can actually be tackled with machine learning techniques.
In this view, the handbook collects both well-established and new models in connectionism, together with their learning paradigms, and proposes a deep inspection of theoretical properties and advanced applications using a plain language, particularly tailored to non experts. Not pretending to be exhaustive, this chapter and the whole book delineate an evolving picture of connectionism, in which neural information systems are moving towards approaches that try to keep most of the information unaltered and to specialize themselves, sometimes based on biological inspiration, to cope expertly with difﬁcult real–world applications.
Chapters Included in the Book
Chapters composing this book can be logically grouped into three sections: • Neural Network Architectures • Learning Paradigms • Reasoning and Applications

VIII

Preface

Neural Network Architectures
Chapter 1, by Yoshua Bengio and Aaron Courville, gives an overview on deep learning algorithms, that are able to better capture invariances, and discover non–local structure in data distributions. Actually, deep architectures are obtained by composing several levels of distributed representations, as in neural networks with many layers and in the mammalian cortex. Unsupervised learning of representations, applied in deep architectures, has been found useful in many applications and beneﬁts from several advantages, e.g. when there are many unlabeled examples and few labeled ones (semi-supervised learning), or when the examples come from a distribution different but related to the one of interest (self–taught learning, multi–task learning, and domain adaptation). Learning algorithms for deep architectures have recently be proven to be important in modeling rich data, with the kind of complexity exhibited in images, video, natural language and other AI tasks.
In Chapter 2, by Sajid A. Marhon, Chistopher J. F. Cameron, and Stefan C. Kremer, the readers are introduced to the basic principles of recurrent networks, a variant of feedforward networks which additionally includes recurrent connections (loops) between processing elements. Recurrence forces a cyclic temporal dependency between the activation values of the processing elements, that injects a temporal dynamics in their evaluation. The ability to evolve the activation values over time makes recurrent networks capable of processing temporal input patterns (as opposed to the spatial patterns that limit conventional models), generating temporal output patterns, and performing internal, iterative computations. Computational capabilities and learning abilities of recurrent networks are assessed, and real–world state–of–the–art applications explored.
In the supervised framework, Graph Neural Networks (GNNs) are a powerful tool for processing structured data in the form of general graphs. In Chapter 3, by Monica Bianchini and Marco Maggini, a uniﬁed framework is proposed for learning in neural networks, based on the BackPropagation algorithm, that starts from GNNs and graphs, and views each other type of architecture/learning environment (f.i., recursive networks/trees, recurrent networks/sequences, static networks/arrays) as a particular and simpler subcase of the BackPropagation Through Graph algorithm. The derivation of the learning algorithm is described in details, showing how the adaptation is driven by an error diffusion process on the graph structure.
Chapter 4, by Liviu Goras¸, Ion Vornicu, and Paul Ungureanu, introduces the Cellular Neural Network (CNN) paradigm, and several analog parallel architectures inspired by CNNs. CNNs are massive parallel computing architectures deﬁned in discrete N –dimensional spaces, and represented by regular arrays of elements (cells). Cells are multiple input–single output processors, described by one or few parametric functionals. The spatio–temporal dynamics of CNNs, possibly associated with image sensors, can be used for high speed 1D and 2D signal processing, including linear and nonlinear ﬁltering, and feature extraction. The VLSI implementability of CNNs in standard CMOS technology is also explored.
In Chapter 5, by Paul C. Kainen, Veˇra Ku˚rkova´, and Marcello Sanguineti, tools from nonlinear approximation theory are used to derive useful estimates of network

Preface

IX

complexity, which depends on the type of computational units and the input dimension. Major tools are described for estimating the rates of decrease of approximation errors with increasing model complexity. Properties of best approximation are also discussed. Finally, recent results concerning the dependence of model complexity on the input dimension are described, together with examples of multivariable functions that can be tractably approximated.
A neural network utilizes data to ﬁnd a function consistent with the data and with further conceptual desiderata, such as desired smoothness, boundedness or integrability. The weights and the functions embodied in the hidden units can be thought of as determining a ﬁnite sum that approximates some function. This ﬁnite sum is a kind of quadrature for an integral formula that would represent the function exactly. In the last chapter of this section (Chapter 6), authored by Paul C. Kainen and Andrew Vogt, Bochner integration is applied to neural networks, in order to understand their approximation capabilities and to make effective choices of the network parameters. By modifying the traditional focus on the data to the construction of a family of functions able to approximate such data, gives not only a deep theoretical insight, but also helps in constructing networks capable of performing more sophisticated tasks.
Learning Paradigms
In traditional supervised learning, labeled data are used to build a model. However, labeling training data for real–world applications is difﬁcult, expensive and time consuming, as it requires the efforts of humans with a speciﬁc domain experience. This is especially true for applications that involve learning with a large number of classes, and sometimes with similarities among them. Semi–supervised learning addresses this inherent bottleneck by allowing the model to combine labeled and unlabeled data to boost the performance. In Chapter 7, by Mohamed Farouk Abdel Hady and Friedhelm Schwenker, an overview of recent research advances in semi–supervised learning is provided, whereas its combination with active learning techniques is explored.
Relational learning refers to learning from data that have a complex structure. This structure may be either internal (a data instance may be complex) or external (relationships among data). Statistical relational learning refers to the use of statistical learning methods in a relational learning context. In Chapter 8, by Hendrik Blockeel, statistical relational learning is presented in its concrete forms (learning from graphs, learning from logical interpretations, learning from relational databases) and state–of–the–art methods in this framework — such as inductive logic programming, relational neural networks, and probabilistic logical models — are extensively reviewed.
Kernel methods are a class of non–parametric learning techniques relying on kernels. A kernel generalizes dot products to arbitrary domains and can thus be seen as a similarity measure between data points with complex structures. Key to the success of any kernel method is the deﬁnition of an appropriate kernel for the data at hand. A well–designed kernel should capture the aspects characterizing similar

X

Preface

instances, while being computationally efﬁcient. Kernels have been designed for sequences, trees and graphs, as well as arbitrary relational data represented in ﬁrst or higher order logic. In Chapter 9, by Andrea Passerini, the basic principles underlying kernel machines are revised, together with some of the most popular approaches that have recently been developed. Finally, kernel methods for predicting structures are also presented. These algorithms deal with structured–output prediction, a learning framework in which the output is itself a structure that has to be predicted from the input one.
In Chapter 10, by Francesco Gargiulo, Claudio Mazzariello, and Carlo Sansone, a survey on multiple classiﬁer systems is presented. In the area of Pattern Recognition, this idea has been proposed based on the rationale that the consensus of a set of classiﬁers may compensate for the weakness of a single one, while each classiﬁer preserves its own strength. Classiﬁers could actually be constructed following a variety of classiﬁcation methodologies, and they could achieve different rates of correctly classiﬁed samples. Moreover, different combining rules and strategies, independent of the adopted classiﬁcation model, have been proposed in literature. Some of the currently available tools (following different approaches) for implementing multiple classiﬁer systems are reviewed in this chapter, together with a taxonomy of real–world applications.
Modal learning in neural computing refers to the strategic combinations of modes of adaptation and learning within a single artiﬁcial neural network architecture. Modes, in this context, are learning methods, and two or more modes may proceed in parallel in different parts of the neural computing structure (layers and neurons) or, alternatively, can be applied to the same part of the structure with a mechanism for allowing the network to switch between modes. Chapter 11, by Dominic Palmer–Brown and Chrisina Draganova, presents an introduction to modal learning techniques, with a particular focus on modal self–organization methods, applied to grouping learners’ responses to multiple choice questions, natural language phrase recognition and pattern classiﬁcation. Algorithms, dataset descriptions, pseudocode and Matlab code are included.
Reasoning and Applications
Bayesian networks represent a widely accepted model for reasoning with uncertainty. In Chapter 12, by Wim Wiegerink, Willem Burgers, and Bert Kappen, Bayesian networks are approached from a practical point of view, putting emphasis on modelling and practical applications. A short theoretic introduction to Bayesian networks is provided and some of their typical usages are reported, e.g. for reasoning and diagnostics. Furthermore, peculiar network behaviors are described, such as the explaining away phenomenon. Finally, practical applications to victim identiﬁcation by kinship analysis based on DNA proﬁles, and to the construction of a petrophysical decision support system are provided, in order to illustrate the data modelling process in detail.
In content based image retrieval (CBIR), relevance feedback is an interactive process, that builds a bridge between users and a search engine. In fact, it leads to a

Preface

XI

much improved retrieval performance by updating queries and similarity measures according to a user’s preference. Chapter 13, by Jing Li and Nigel M. Allinson, introduces the basic elements of CBIR (from low–level feature extraction to classiﬁcation methods) and reviews recently proposed CBIR techniques, with a particular focus on long–term learning, that allows the system to record and collect feedback knowledge from different users over a variety of query sessions, and it is particularly tailored for multimedia information searching. Some representative short–term learning techniques are also presented, in order to provide the reader with a comprehensive reference source for CBIR.
In Chapter 14, authored by Ah Chung Tsoi, Markus Hagenbuchner, Milly Kc, and ShiJua Zhang, an application of learning in structured domains is presented, namely classiﬁcation/regression problems on text documents from large document collections. In fact, while text documents are often processed as unstructured data, the performance and problem solving capability of machine learning methods can be enhanced through the use of suitable structured representations. In particular, for classiﬁcation problems, the incorporation of the relatedness information, as expressed by Concept Link Graphs, allows the development of tighter clusters with respect to the bag–of–words representation, whereas such graphs can also be useful in the regression framework, in order to rank the items in a large text corpus.
Finally, Chapter 15, by Masood Zamani and Stefan C. Kremer, constitutes a survey on the application of connectionist models to bioinformatics. Actually, many problems in bioinformatics involve predicting later stages in the information ﬂow from earlier ones. Bioinformatics methods capable of such predictions can often eliminate costly, difﬁcult or time–consuming tasks in important biological research. For example, predicting protein structure and function based on the amino acid sequence is an essential component of modern drug design, and can replace expensive wet–lab work. This chapter is intended to provide an introduction to the predominant research areas and some neural network approaches used within bioinformatics.

Conclusions
This chapter has presented an introduction to recent research in neural information processing paradigms, starting from connectionist models recently developed to learn graphs — and showing how they represent a powerful tool to address all those problems where the information is naturally organized in entities and relationships among entities — to classical paradigms, theoretically inspected in order to establish new signiﬁcant properties, and rivisited for guaranteeing better performances, and to advanced applications, derived from real–world problems and strongly biologogically inspired. Not having the presumption to exhaust a so extensive and evolving argument, if ever something should be concluded on the universe of connectionism is that neural information paradigms are moving towards approaching a vast variety of problems, derived from every–day life and, in so doing, they both try to keep most of the information unaltered and to specialize themselves, to attack each problem with the weapons of an expert in the particular ﬁeld. This book

XII

Preface

collects just some suggestions on which directions connectionism is following/will follow in view of the future challenges to be faced.

Aknowledgements
The Editors hope that this handbook will prove an interesting and valuable tool to researchers/practitioners, so as to graduate students, in the vast and partially unfathomed area of learning in connectionism. Being conscious of its incopleteness, we are, however, conﬁdent in its utility as a sourcebook, in that it draws the trends of current reasearch.
We have been really fortunate in attracting contributions from top class scientists and wish to offer our deep gratitude for their support in this project. We also aknowledge the reviewers for their precious expertise and time and Prof. Franco Scarselli for his fundamental help in realizing this book. Finally, we thanks Springer–Verlag for their support.

Contents
1 Deep Learning of Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Yoshua Bengio, Aaron Courville 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Deep Learning of Representations: A Review and Recent Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Greedy Layerwise Pre-training . . . . . . . . . . . . . . . . . . . . . 6 2.2 Undirected Graphical Models and Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 The Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . 8 2.4 The Zoo: Auto-Encoders, Sparse Coding, Predictive Sparse Decomposition, Denoising Auto-Encoders, Score Matching, and More . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Convolutional Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Local Receptive Fields and Weight Sharing . . . . . . . . . . . 11 3.2 Feature Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Learning Invariant Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 Dealing with Factors of Variation: Invariant Features . . . 13 4.2 Invariance via Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Teasing Apart Explanatory Factors via Slow Features Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.4 Learning to Pool Features . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5 Beyond Learning Invariant Features . . . . . . . . . . . . . . . . . 18 5 Disentangling Factors of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6 On the Importance of Top-Down Connections . . . . . . . . . . . . . . . . 21 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Sajid A. Marhon, Christopher J.F. Cameron, Stefan C. Kremer 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

XIV

Contents

2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1 Connectionist Network Topologies . . . . . . . . . . . . . . . . . . 35 2.2 Speciﬁc Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 Delayed Activations as Memory . . . . . . . . . . . . . . . . . . . . 46 3.2 Short-Term Memory and Generic Predictor . . . . . . . . . . . 47 3.3 Types of Memory Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Recurrent Back-Propagation: Learning with Fixed Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Back-Propagation through Time: Learning with Non-ﬁxed Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Long-Term Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1 Finite State Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Beyond Finite State Automata . . . . . . . . . . . . . . . . . . . . . . 57
6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Identiﬁcation and Control of Dynamical Systems . . . . . . 60
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3 Supervised Neural Network Models for Processing Graphs . . . . . . . . . 67 Monica Bianchini, Marco Maggini 1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2 Neural Models for Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . 73 2.1 The Graph Neural Network Model . . . . . . . . . . . . . . . . . . 73 2.2 Processing DAGs with Recursive Neural Networks . . . . 79 3 Supervised Learning for Graph Neural Networks . . . . . . . . . . . . . . 83 3.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.2 Learning Procedure for GNNs . . . . . . . . . . . . . . . . . . . . . . 85 3.3 Learning Procedure for Recursive Neural Networks . . . . 90 4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4 Topics on Cellular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Liviu Goras¸, Ion Vornicu, Paul Ungureanu 1 The CNN Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 1.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 1.2 Mathematical Description . . . . . . . . . . . . . . . . . . . . . . . . . 99 1.3 Other Tasks CNN’s Can Accomplish – The CNN Universal Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2 A Particular Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 2.1 The Architecture and the Equations . . . . . . . . . . . . . . . . . 102 2.2 The Decoupling Technique . . . . . . . . . . . . . . . . . . . . . . . . 103 2.3 Particular Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Contents

XV

2.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 2.5 A “Toy” Application: 1D “Edge” Detection . . . . . . . . . . . 113 3 Two-Grid Coupled CNN’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.1 The Architecture and the Equations . . . . . . . . . . . . . . . . . 122 3.2 The Decoupling Technique . . . . . . . . . . . . . . . . . . . . . . . . 124 3.3 Boundary Conditions (BC’s) and Their Inﬂuence on
Pattern Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.4 Dispersion Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.5 Turing Pattern Formation Mechanism . . . . . . . . . . . . . . . . 129 3.6 Boundary Conditions in 2D CNN’s . . . . . . . . . . . . . . . . . . 130 3.7 An Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5 Approximating Multivariable Functions by Feedforward Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Paul C. Kainen, Veˇra Ku˚rkova´, Marcello Sanguineti 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 2 Dictionaries and Variable-Basis Approximation . . . . . . . . . . . . . . . 145 3 The Universal Approximation Property . . . . . . . . . . . . . . . . . . . . . . 148 4 Quadratic Rates of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5 Geometric Rates of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 156 6 Approximation of Balls in Variational Norms . . . . . . . . . . . . . . . . . 160 7 Best Approximation and Non-continuity of Approximation . . . . . 165 8 Tractability of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.1 A Shift in Point-of-View: Complexity and Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.2 Measuring Worst-Case Error in Approximation . . . . . . . 169 8.3 Gaussian RBF Network Tractability . . . . . . . . . . . . . . . . . 170 8.4 Perceptron Network Tractability . . . . . . . . . . . . . . . . . . . . 172 9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 10 Summary of Main Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 Bochner Integrals and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 183 Paul C. Kainen, Andrew Vogt 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 2 Variational Norms and Completeness . . . . . . . . . . . . . . . . . . . . . . . . 185 3 Bochner Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 4 Spaces of Bochner Integrable Functions . . . . . . . . . . . . . . . . . . . . . 190 5 Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6 An Example Involving the Bessel Potential . . . . . . . . . . . . . . . . . . . 195 7 Application: A Gamma Function Inequality . . . . . . . . . . . . . . . . . . 197 8 Tensor-Product Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 9 An Example Involving Bounded Variation on an Interval . . . . . . . 202 10 Pointwise-Integrals vs. Bochner Integrals . . . . . . . . . . . . . . . . . . . . 205 10.1 Evaluation of Bochner Integrals . . . . . . . . . . . . . . . . . . . . 205

XVI

Contents

10.2 Essential Boundedness Is Needed for the Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
10.3 Connection with Sup Norm . . . . . . . . . . . . . . . . . . . . . . . . 207 11 Some Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 12 Appendix I: Some Banach Space Background . . . . . . . . . . . . . . . . 209 13 Appendix II: Some Key Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 210 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Mohamed Farouk Abdel Hady, Friedhelm Schwenker 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 2 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 3 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 4 SSL with Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 5 Semi-supervised SVMs (S3VMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6 Semi-supervised Learning with Graphs . . . . . . . . . . . . . . . . . . . . . . 221 7 Semi-supervised Learning with Committees (SSLC) . . . . . . . . . . . 222 7.1 SSLC with Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.2 SSLC with Single View . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8 Combination with Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 233 8.1 SSL with Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 8.2 SSL with Generative Models . . . . . . . . . . . . . . . . . . . . . . . 234 8.3 SSL with Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8 Statistical Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Hendrik Blockeel 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 2 Relational Learning versus Attribute-Value Learning . . . . . . . . . . . 242 2.1 Attribute-Value Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 242 2.2 Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 2.3 Mapping Relational Data to Attribute-Value Data . . . . . . 245 2.4 Summary of This Section . . . . . . . . . . . . . . . . . . . . . . . . . . 248 3 Relational Learning: Tasks and Formalisms . . . . . . . . . . . . . . . . . . 248 3.1 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . 248 3.2 Learning from Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 3.3 Multi-relational Data Mining . . . . . . . . . . . . . . . . . . . . . . . 251 4 Neural Network Based Approaches to Relational Learning . . . . . . 252 4.1 CIL2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 4.2 Relational Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 253 4.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 5 Statistical Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 5.1 Structuring Graphical Models . . . . . . . . . . . . . . . . . . . . . . 257 5.2 Approaches in the Relational Database Setting . . . . . . . . 262 5.3 Approaches in the Logical Setting . . . . . . . . . . . . . . . . . . . 263

Contents

XVII

5.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 6 General Remarks and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.1 Understanding Commonalities and Differences . . . . . . . . 274 6.2 Parameter Learning and Structure Learning . . . . . . . . . . . 275 6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 7 Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
9 Kernel Methods for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Andrea Passerini 1 A Gentle Introduction to Kernel Methods . . . . . . . . . . . . . . . . . . . . 284 2 Mathematical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 2.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 2.2 Supervised Learning with Kernels . . . . . . . . . . . . . . . . . . . 288 3 Kernel Machines for Structured Input . . . . . . . . . . . . . . . . . . . . . . . 289 3.1 SVM for Binary Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . 290 3.2 SVM for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 3.3 Smallest Enclosing Hypersphere . . . . . . . . . . . . . . . . . . . . 294 3.4 Kernel Principal Component Analysis . . . . . . . . . . . . . . . 297 4 Kernels on Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 4.1 Basic Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 4.2 Kernel Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 4.3 Kernels on Discrete Structures . . . . . . . . . . . . . . . . . . . . . . 302 4.4 Kernels from Generative Models . . . . . . . . . . . . . . . . . . . . 312 4.5 Kernels on Logical Representations . . . . . . . . . . . . . . . . . 316 5 Learning Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 5.1 Learning Kernel Combinations . . . . . . . . . . . . . . . . . . . . . 321 5.2 Learning Logical Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 322 6 Supervised Kernel Machines for Structured Output . . . . . . . . . . . . 325 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
10 Multiple Classiﬁer Systems: Theory, Applications and Tools . . . . . . . 335 Francesco Gargiulo, Claudio Mazzariello, Carlo Sansone 1 MCS Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 1.1 MCS Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 1.2 Combining Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 1.3 Strategies for Constructing a Classiﬁer Ensemble . . . . . . 345 2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 2.1 Remote-Sensing Data Analysis . . . . . . . . . . . . . . . . . . . . . 350 2.2 Document Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 2.3 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 2.4 Figure and Ground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 2.5 Medical Diagnosis Support . . . . . . . . . . . . . . . . . . . . . . . . 356 2.6 Chemistry and Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 2.7 Time Series Prediction/Analysis . . . . . . . . . . . . . . . . . . . . 359

XVIII

Contents

2.8 Image and Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 359 2.9 Computer and Network Security . . . . . . . . . . . . . . . . . . . . 360 2.10 Miscellanea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 3.1 Tool Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 3.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 3.3 KNIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 3.4 PRTools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
11 Self Organisation and Modal Learning: Algorithms and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Dominic Palmer-Brown, Chrisina Jayne 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 2 Snap-Drift Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 3 Snap-Drift Self-Organising Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 4.1 Applications of SDNN and SDSOM to Publicly Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
12 Bayesian Networks, Introduction and Practical Applications . . . . . . . 401 Wim Wiegerinck, Willem Burgers, Bert Kappen 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 2.1 Bayesian Network Theory . . . . . . . . . . . . . . . . . . . . . . . . . 404 2.2 Bayesian Network Modeling . . . . . . . . . . . . . . . . . . . . . . . 405 3 An Example Application: Medical Diagnosis . . . . . . . . . . . . . . . . . 406 3.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 3.2 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 4 Bonaparte: A Bayesian Network for Disaster Victim Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 4.1 Likelihood Ratio of Two Hypotheses . . . . . . . . . . . . . . . . 415 4.2 DNA Proﬁles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 4.3 A Bayesian Network for Kinship Analysis . . . . . . . . . . . . 417 4.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

Contents

XIX

4.5 The Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 5 A Petrophysical Decision Support System . . . . . . . . . . . . . . . . . . . 422 5.1 Probabilistic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 5.2 The Prior and the Observation Model . . . . . . . . . . . . . . . . 425 5.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 5.4 Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 5.5 The Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
13 Relevance Feedback in Content-Based Image Retrieval: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Jing Li, Nigel M. Allinson 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 2 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 2.1 Low-Level Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 436 2.2 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 2.3 Classiﬁcation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 2.4 Current Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 3 Short-Term Learning RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 3.1 One-Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 3.2 Two-Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 3.3 Multi-class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 4 Long-Term Learning RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 4.1 Latent Semantic Indexing-Based Techniques . . . . . . . . . . 452 4.2 Correlation-Based Approaches . . . . . . . . . . . . . . . . . . . . . 455 4.3 Clustering-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . 459 4.4 Feature Representation-Based Methods . . . . . . . . . . . . . . 460 4.5 Similarity Measure Modiﬁcation-Based Approaches . . . 461 4.6 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
14 Learning Structural Representations of Text Documents in Large Document Collections . . . . . . . . . . . . . . . . . . . . . . 471 Ah Chung Tsoi, Markus Hagenbuchner, Milly Kc, ShuJia Zhang 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 2 Representation of Unstructured or Semi-structured Text Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 3 General Framework for Processing Graph Structured Data . . . . . . 478 4 Self Organizing Maps for Structures . . . . . . . . . . . . . . . . . . . . . . . . 479 5 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 6 Clustering of the Wikipedia Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 482 6.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

XX

Contents

7 Ranking of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
15 Neural Networks in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Masood Zamani, Stefan C. Kremer 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 2 Analyzing DNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 2.1 Example Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 3 Peptide Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 3.1 Example Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 3.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 4 Diagnostic Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 4.1 Example Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537

Chapter 1
Deep Learning of Representations
Yoshua Bengio and Aaron Courville

Abstract. Unsupervised learning of representations has been found useful in many applications and beneﬁts from several advantages, e.g., where there are many unlabeled examples and few labeled ones (semi-supervised learning), or where the unlabeled or labeled examples are from a distribution different but related to the one of interest (self-taught learning, multi-task learning, and domain adaptation). Some of these algorithms have successfully been used to learn a hierarchy of features, i.e., to build a deep architecture, either as initialization for a supervised predictor, or as a generative model. Deep learning algorithms can yield representations that are more abstract and better disentangle the hidden factors of variation underlying the unknown generating distribution, i.e., to capture invariances and discover non-local structure in that distribution. This chapter reviews the main motivations and ideas behind deep learning algorithms and their representation-learning components, as well as recent results in this area, and proposes a vision of challenges and hopes on the road ahead, focusing on the questions of invariance and disentangling.

1 Introduction
This chapter is a review of recent work in the area of Deep Learning, which is mostly based on unsupervised learning of representations. It follows up on Bengio (2009) and presents our vision of the main challenges ahead for this type of learning algorithms.
Why Learn Representations?
Machine learning is about capturing dependencies between random variables and discovering the salient but hidden structure in an unknown distribution (conditional

Yoshua Bengio · Aaron Courville Dept. IRO, Universite´ de Montre´al

M. Bianchini et al. (Eds.): Handbook on Neural Information Processing, ISRL 49, pp. 1–28.

DOI: 10.1007/978-3-642-36657-4_1

c Springer-Verlag Berlin Heidelberg 2013

2

Y. Bengio and A. Courville

or not), from examples. Current machine learning algorithms tend to be very dependent on the choice of data representation on which they are applied: with the right representation, almost every learning problem becomes very easy. This is particularly true of non-parametric and kernel-based algorithms, which have been very successful in recent years (Scho¨lkopf and Smola, 2002), but it is also true of the numerous and equally successful systems based on linear predictors. It is therefore very tempting to ask the questions: can we learn better representations? what makes a good representation? The probabilistic and geometric points of view also help us to see the importance of a good representation. The data density may concentrate in a complicated region in raw input space, but a linear or non-linear change in coordinates could make the learning task much easier, e.g., extracting uncorrelated (PCA) or independent (ICA) factors may yield much more compact descriptions of the raw input space density, and this is even more true if we consider the kind of non-linear transformations captured by manifold learning algorithms (Saul and Roweis, 2002; Bengio et al., 2006; Lee and Verleysen, 2007).
Imagine a probabilistic graphical model (Jordan, 1998) in which we introduce latent variables which correspond to the true explanatory factors of the observed data. It is very likely that answering questions and learning dependencies in that space is going to be much easier. To make things concrete, imagine a graphical model for images that, given an image of a car breaking at a red light, would have latent variables turning “ON”, whose values would directly correspond to these events (e.g., a variable could encode the fact that a car is located at a particular position in a particular pose, others that there is a trafﬁc light at a particular position and orientation, that it is red, and a higher-level one that detects that the trafﬁc light’s and car’s orientation mean that the former applies to the latter, etc). Starting from this representation, even if the semantics of these latent variables is not determined a priori (because they have been automatically discovered), learning to answer questions about the scene would be very easy indeed. A simple linear classiﬁer trained from one or just a few examples would probably do the job. Indeed, this is the kind of surprisingly fast supervised learning that humans perform in settings that are familiar to them, i.e., in which they had the chance to collect many examples (without a label associated with the task on which they are ﬁnally tested, i.e., this is really the transfer learning setting in machine learning). In fact, with linguistic associations of these latent variables to words and sentences, humans can answer questions about a new task with zero labeled training examples for the new task, i.e., they are simply doing inference: the new task is speciﬁed in a language that is already well connected to these latent variables, and one can thus generalize to new tasks with zero examples (Larochelle et al., 2008), if one has set or learned a good representation for tasks themselves.
So that is the objective we are setting: learning representations that capture the explanatory factors of variation, and help to disentangle them.

1 Deep Learning of Representations

3

Why Distributed Representations?
A distributed representation is one which has many components or attributes, and such that many of them can be independently active or varied simultaneously. Hence the number of possible objects that can be represented can grow up to exponentially with the number of attributes that can be simultaneously active. In a dense distributed representation, all of the components can be active simultaneously, whereas in a sparse representation, only a few can be (while the others are 0). The other end of the spectrum is a local representation, such as a hard clustering, in which each input object is represented by the activation of a single cluster (the one to which the object belongs), i.e., an integer (ranging from 1 to the number of clusters). It is clear that even with a discrete representation, if there are many factors of variation, representing the input object by a set of N attributes (each taking one out of several values, at least 2) provides for a much richer representation, with which up to 2N different objects can be distinguished. With a sparse representation with k active attributes, the number of objects that can be distinguished is on the order of N choose k, which also grows faster than exponential in k (when k = αN and α a ﬁxed small fraction). An advantage of sparse representations (besides the inspiration from the brain) is that it allows to represent a variable-dimension (generally smalldimensional) object within a (large) ﬁxed-dimension vectors, by having a few of the dimensions active for any particular input. We also hypothesize that sparsity may help to disentangle factors of variation, as observed in Goodfellow et al. (2009).

Why Deep?
How should our learned representation by parametrized? A deep architecture is one in which there are multiple levels of representation, with higher levels built on top of lower levels (the lowest being the original raw input), and higher levels representing more abstract concepts deﬁned in terms of less abstract ones from lower levels. There are several motivations for deep architectures:
• Brain inspiration: several areas of the brain, especially those better understood such as visual cortex and auditory cortex, are organized as a deep architecture, with each brain area associated with a level of representation (Serre et al., 2007).
• Computational complexity: as discussed in (Bengio and LeCun, 2007), some computational complexity results suggest that some functions which can be represented compactly with a deep architecture would require an exponential number of components if represented with a shallow (e.g. 2-level) architecture. For example, a family of positive-weight formal-neuron deep networks can be exponentially more expensive to represent with depth k − 1 than with depth k (Ha˚stad and Goldmann, 1991), and similarly for logic gates (Ha˚stad, 1986). More recently Braverman (2011) showed that if all marginals of an input distribution involving at most k variables are uniform, more depth makes it exponentially easier to distinguish the joint from the uniform. In addition, Bengio and Delalleau (2011) showed two families of sum-product networks

4

Y. Bengio and A. Courville

(i.e. to represent polynomials) in which a 2-layer architecture required exponentially more units (i.e., computations and parameters) than a sufﬁcently deep architecture. All these theoretical results illustrate that depending on the task and the types of computation performed at each layer, the sufﬁcient depth will vary. It is therefore important to have algorithms that can accommodate different depths and choose depth empirically. • Statistical efﬁciency and sharing of statistical strength. First, if deeper architectures can be more efﬁcient in terms of number of computational units (to represent the same function), that in principle means that the number of parameters to estimate is smaller, which gives rise to greater statistical efﬁciency. Another way to see this is to consider the sharing of statistical strength that occurs when different components of an architecture are re-used for different purposes (e.g., in the computation for different outputs, or different tasks, or in the computation for different intermediate features). In the deep architecture, the lower level features can be shared in the deﬁnition of the higher-level features, which thus provides more sharing of statistical strength then in a shallow architecture. In addition the higher-level shared features can be more complicated functions (whereas for example the hidden units of a single hidden-layer network are very restricted in their expressive power, e.g. hidden=sigmoid(afﬁne(input))). Since the parameters of a component are used for different purposes, they share statistical strength among the different examples (or parts of examples) that rely on these parameters. This is similar and related to the sharing of statistical strength that occurs in distributed representations. For example, if the parameters of one hidden unit of an RBM are “used” for many examples (because that unit turns on for many examples), then there is more information available to estimate those parameters. When a new conﬁguration of the input is presented, it may not correspond to any of those seen in the training set, but its “components” (possibly represented at a higher level of abstraction in intermediate representations) may have been seen previously. See Bengio (2009) for a longer discussion of the statistical advantages of deep and distributed architectures to ﬁght the so-called “curse of dimensionality”. First of all, it is not so much dimensionality as much as the amount of variations to be captured that matters (even a one-dimensional function can be difﬁcult to learn if it is not sufﬁciently smooth). Previous theoretical work on the use of shallow neural networks to bypass the curse of dimensionality (Barron, 1993; Kurkova and Sanguineti, 2008) rely on the assumption of smoothness of the target function in order to avoid dependence on dimensionality. As argued in Bengio (2009) the smoothness assumption (also exploited in most non-parametric statistical models) is insufﬁcient because the kinds of functions we want to learn for AI are not smooth enough, have too much complexity and variations. Deep architectures, with their ability to generalize non-locally, can generalize even to variations never seen in the training set. They can represent a function that has lots of variation (e.g. a very high-order polynomial with sharp non-linearities in many places) with a comparatively small number of parameters, compared to the number of variations one can capture.

1 Deep Learning of Representations

5

• Cognitive arguments and engineering arguments for compositionality. Humans very often organize ideas and concepts in a modular way, and at multiple levels. Concepts at one level of abstraction are deﬁned in terms of lower-level concepts. For example it is possible to write a dictionary whose deﬁnitions depend only of a very small number of core words. Note that this gives rise to a graph which is generally not a tree (like in ontologies) because each concept can be re-used in deﬁning many other concepts. Similarly, that is also how problems are solved and systems built by engineers: they typically construct a chain or a graph of processing modules, with the output of one feeding the inputs of another. The inputs and outputs of these modules are intermediate representations of the raw signals that enter the system. Software engineering is also about decomposing computation into re-usable modules, and the word factorization is often used to describe that process which enables more powerful compositionality. Contrast this with a computer program written as a single main function without any calls to sub-routines. This would be a very shallow program and is not the way software engineers like to design software. Such hand-crafted modules are designed thanks to human ingenuity. We would like to add to human ingenuity the option to learn such decompositions and intermediate representations. In both the dictionary example and the engineering example, the power stems from compositionality.
• Sharing of statistical strength for multi-task learning, semi-supervised learning, self-taught learning, and out-of-domain generalization. Sharing of statistical strength is a core idea behind many advances in machine learning. Components and parameters are shared across tasks in the case of multi-task learning, and deep architectures are particularly well suited for multi-task learning (Collobert and Weston, 2008). Similarly semi-supervised learning exploits statistical sharing between the tasks of learning the input distribution P (X) and learning the conditional distribution P (Y |X). Because deep learning algorithms often rely heavily on unsupervised learning, they are well suited to exploit this particular form of statistical sharing. A very related form of sharing occurs in self-taught learning (Raina et al., 2007), whereby we consider unlabeled training data from P (X|Y ) for a set of classes Y ’s but really care about generalizing to tasks P (Y |X) for a different set of Y ’s. Recent work showed that deep learners beneﬁt more from the self-taught learning and multi-task learning frameworks than shallow learners (Bengio et al., 2010). This is also a form of outof-domain generalization, for which deep learners are well suited, as shown in (Bengio et al., 2010) for pattern recognition.

Why Semi-supervised or Unsupervised Learning?
An important prior exploited in many deep learning algorithms (such as those based on greedy layer-wise pre-training, detailed below, sec. 2.1) is the following: representations that are useful for capturing P (X) can be useful (at least in part, or as initialization) for capturing P (Y |X). As discussed above, this is beneﬁcial

6

Y. Bengio and A. Courville

as a statistical sharing strategy, and especially so because X is usually very highdimensional and rich, compared to Y , i.e., it can contain very detailed structure that can be relevant to predicting Y given X. See (Erhan et al., 2010b) for a discussion of the advantages brought by this prior, and a comprehensive set of experiments showing how it helps not only as a regularizer, but also to ﬁnd better training error when the training set is large, i.e., to ﬁnd better local minima of the generalization error (as a function of the parameters). This is an important consideration because deep learners have a highly non-convex training criterion (whether it be supervised or unsupervised) and the greedy layer-wise initialization strategy, based on unsupervised pre-training, can make a huge difference.
More generally, the very task of learning representations with the objective of sharing statistical strengths across tasks, domains, etc. begs for unsupervised learning, modeling for all the variables observed (including the Y ’s) and good representations for their joint distribution. Consider an agent immersed in a stream of observations of X’s and Y ’s, where the set of values that Y can take is non-stationary, i.e., new classes can appear on which we will later want to make predictions. Unsupervised learning is a way to collect as much information as possible ahead of time about all those observations, so as to later be in the best possible position in order to respond to new requests, possibly from very few labels associated with a new class Y . Unsupervised learning is what we ultimately need if we consider multi-task / semi-supervised / self-taught learning and the number of possible tasks or classes becomes very large or unbounded.

2 Deep Learning of Representations: A Review and Recent Trends
2.1 Greedy Layerwise Pre-training
The following basic recipe was introduced in 2006 (Hinton and Salakhutdinov, 2006; Hinton et al., 2006; Ranzato et al., 2007a; Bengio et al., 2007):
1. Let h0(x) = x be the lowest-level representation of the data, given by the observed raw input x.
2. For = 1 to L Train an unsupervised learning model taking as observed data the training examples h −1(x) represented at level − 1, and producing after training representations h (x) = R (h −1(x)) at the next level.
From this point on, several variants have been explored in the literature. For supervised learning with ﬁne-tuning, which is the most common variant (Hinton et al., 2006; Ranzato et al., 2007b; Bengio et al., 2007):
3. Initialize a supervised predictor whose ﬁrst stage is the parametrized representation function hL(x), followed by a linear or non-linear predictor as the second stage (i.e., taking hL(x) as input).

1 Deep Learning of Representations

7

4. Fine-tune the supervised predictor with respect to a supervised training criterion, based on a labeled training set of (x, y) pairs, and optimizing the parameters in both the representation stage and the predictor stage.
Another supervised variant involves using all the levels of representation as input to the predictor, keeping the representation stage ﬁxed, and optimizing only the predictor parameters (Lee et al., 2009a,b):
3. Train a supervised learner taking as input (hk(x), hk+1(x), . . . , hL(x)) for some choice of 0 ≤ k ≤ L, using a labeled training set of (x, y) pairs.
Finally, there is a common unsupervised variant, e.g. for training deep auto-encoders (Hinton and Salakhutdinov, 2006) or a Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009):
3. Initialize an unsupervised model of x based on the parameters of all the stages. 4. Fine-tune the unsupervised model with respect to a global (all-levels) training
criterion, based on the training set of examples x.

2.2 Undirected Graphical Models and Boltzmann Machines
The ﬁrst unsupervised learning algorithm (Hinton and Salakhutdinov, 2006; Hinton et al., 2006) that has been proposed for training each level of the above algorithm (step 2) is based on a Restricted Boltzmann Machine (Smolensky, 1986), which is an undirected graphical model that is a particular form of Boltzmann Machine (Hinton et al., 1984). An undirected graphical model for observed variable x based on latent variable h is speciﬁed by an energy function Energy(x, h):
e−Energy(x,h) P (x, h) =
Z where Z is a normalization constant called the partition function. A Boltzmann machine is one where Energy(x, h) is a second-order polynomial in (x, h), e.g.,
Energy(x, h) = h W x + h U h + x V x + b h + c x
and in general both x and h are considered to be binary vectors, which makes Z intractable except when both x and h have very few components. The coefﬁcients θ = (W, U, V, b, c) of that second-order polynomial are the parameters of the model. Given an observed x, the inference P (h|x) is generally intractable but can be estimated by sampling from a Monte-Carlo Markov Chain (MCMC), e.g. by Gibbs sampling, or using loopy belief, variational or mean-ﬁeld approximations. Even though computing the energy is easy, marginalizing over h in order to compute the likelihood P (x) is generally intractable, so that the exact log-likelihood gradient is also intractable. However, several algorithms have been proposed in recent years to estimate the gradient, most of them based on the following decomposition into the so-called “positive phase part” (x is ﬁxed to the observed value, the gradient term

8

Y. Bengio and A. Courville

tends to decrease the associated energies) and “negative phase part” (both x and h are sampled according to P , and the gradient term tends to increase their energy):

∂

∂Energy(x, h)

∂Energy(x, h)

∂θ (− log P (x)) = Eh

∂θ

|x − Ex,h

∂θ

.

Even though a Boltzmann Machine is a parametric model when we consider the dimensionality nh of h to be ﬁxed, in practice one allows nh to vary, making it a non-parametric model. With nh large enough, one can model any discrete distribution: Le Roux and Bengio (2008) shows that RBMs are universal approximators, and since RBMs are special cases of Boltzmann Machines, Boltzmann Machines also are universal approximators. On the other hand with nh > 0 the log-likelihood is not anymore convex in the parameters, and training can potentially get stuck in one of many local minima.

2.3 The Restricted Boltzmann Machine
The Restricted Boltzmann Machine (RBM) is one without lateral interactions, i.e., U = 0 and V = 0. In turns out that the positive phase part of the gradient can be computed exactly and tractably in the easier special case of the RBM, because P (h|x) factorizes into i P (hi|x). Similarly P (x|h) factorizes into j P (xj |h), which makes it possible to apply blocked Gibbs sampling (sampling h given x, then x given h, again h given x, etc.). For a trained RBM, the learned representation R(x) of its input x is usually taken to be E[h|x], as a heuristic.
RBMs are typically trained by stochastic gradient descent, using a noisy (and generally biased) estimator of the above log-likelihood gradient. The ﬁrst gradient estimator that was proposed for RBMs is the Contrastive Divergence estimator (Hinton, 1999; Hinton et al., 2006), and it has a particularly simple form: the negative phase gradient is obtained by starting a very short chain (usually just one step) at the observed x and replacing the above expectations by the corresponding samples. In practice, it has worked very well for unsupervised pre-training meant to initialize each layer of a deep supervised (Hinton et al., 2006; Bengio et al., 2007; Erhan et al., 2010b) or unsupervised (Hinton and Salakhutdinov, 2006) neural network.
Another common way to train RBMs is based on the Stochastic Maximum Likelihood (SML) estimator (Younes, 1999) of the gradient, also called Persistent Contrastive Divergence (PCD) (Tieleman, 2008) when it was introduced for RBMs. The idea is simply to keep sampling negative phase x’s (e.g. by blocked Gibbs sampling) even though the parameters are updated once in a while, i.e., without restarting a new chain each time an update is done. It turned out that SML yields RBMs with much better likelihood, whereas CD updates sometimes give rise to worsening likelihood and suffers from other issues (Desjardins et al., 2010). Theory suggests (Younes, 1999) this is a good estimator if the parameter changes are small, but practice revealed (Tieleman, 2008) that it worked even for large updates, in fact giving rise to faster mixing (Tieleman and Hinton, 2009; Breuleux et al., 2011). This is

1 Deep Learning of Representations

9

happening because learning actually interacts with sampling in a useful way, pushing the MCMC out of the states it just visited. This principle may also explain some of the fast mixing observed in a related approach called Herding (Welling, 2009; Breuleux et al., 2011).
RBMs can be stacked to form a Deep Belief Network (DBN), a hybrid of directed and undirected graphical model components, which has an RBM to characterize the interactions between its top two layers, and then generates the input through a directed belief network. See Bengio (2009) for a deeper treatment of Boltzmann Machines, RBMs, and Deep Belief Networks.

2.4 The Zoo: Auto-Encoders, Sparse Coding, Predictive Sparse Decomposition, Denoising Auto-Encoders, Score Matching, and More
Auto-encoders are neural networks which are trained to reconstruct their input (Rumelhart et al., 1986; Bourlard and Kamp, 1988; Hinton and Zemel, 1994). A one-hidden layer auto-encoder is very similar to an RBM and its reconstruction error gradient can be seen as an approximation of the RBM log-likelihood gradient (Bengio and Delalleau, 2009). Both RBMs and auto-encoders can be used as one-layer unsupervised learning algorithms that give rise to a new representation of the input or of the previous layer. In the same year that RBMs were successfully proposed for unsupervised pre-training of deep neural networks, auto-encoders were also shown to help initialize deep neural networks much better than random initialization (Bengio et al., 2007). However, ordinary auto-encoders generally performed worse than RBMs, and were unsatisfying because they could potentially learn a useless identity transformation when the representation size was larger than the input (the so-called “overcomplete” case).
Sparse coding was introduced in computational neuroscience (Olshausen and Field, 1997) and produced ﬁlters very similar to those observed in cortex visual area V1 (before similar ﬁlters were achieved with RBMs, sparse predictive decomposition, and denoising auto-encoders, below). They correspond to a linear directed graphical model with a continuous-valued latent variable associated with a sparsity prior (Student or Laplace, the latter corresponding to an L1 penalty on the value of the latent variable). This is like an auto-encoder, but without a parametric encoder, only a parametric decoder. The “encoding” corresponds to inference (ﬁnding the most likely hidden code associated with observed visible input) and involves solving a lengthy but convex optimization problem and much work has been devoted to speeding it up. A very interesting way to do so is with Predictive Sparse Decomposition (Kavukcuoglu et al., 2008), in which one learns a parametric encoder that approximates the result of the sparse coding inference (and in fact changes the solution so that both approximate encoding and decoding work well). Such models based on approximate inference were the ﬁrst successful examples of stacking a sparse encoding (Ranzato et al., 2007a; Jarrett et al., 2009)

10

Y. Bengio and A. Courville

into a deep architecture (ﬁne-tuned for supervised classiﬁcation afterwards, as per the above greedy-layerwise recipe).
Score Matching is an alternative statistical estimation principle (Hyva¨rinen, 2005) when the maximum likelihood framework is not tractable. It can be applied to models of continuous-valued data when the probability function can be computed tractably up to its normalization constant (which is the case for RBMs), i.e., it has a tractable energy function The score of the model is the partial derivative of the log-likelihood with respect to the input, and indicates in which direction the likelihood would increase the most, from a particular input x. Score matching is based on minimizing the squared difference between the score of the model and a target score. The latter is in general unknown but the score match can nonetheless be rewritten in terms of the expectation (under the data generating process) of ﬁrst and (diagonal) second derivatives of the energy with respect to the input, which correspond to a tractable computation.
Denoising Auto-Encoders were ﬁrst introduced (Vincent et al., 2008) to bypass the frustrating limitations of auto-encoders mentioned above. Auto-encoders are only meant to learn a “bottleneck”, a reduced-dimension representation. The idea of Denoising Auto-Encoders (DAE) is simple: feed the encoder/decoder system with a stochastically corrupted input, but ask it to reconstruct the clean input (as one would typically do to train any denoising system). This small change turned out to systematically yield better results than those obtained with ordinary autoencoders, and similar or better than those obtained with RBMs on a benchmark of several image classiﬁcation tasks (Vincent et al., 2010). Interestingly, the denoising error can be linked in several ways to the likelihood of a generative model of the distribution of the uncorrupted examples (Vincent et al., 2008; Vincent, 2011), and in particular through the Score Matching proxy for log-likelihood (Vincent, 2011): the denoising error corresponds to a form of regularized score matching criterion (Kingma and LeCun, 2010). The link also sheds light on why a denoising auto-encoder captures the input distribution. The difference vector between the reconstruction and the corrupted input is the model’s guess as to the direction of greatest increase in the likelihood (starting from a corrupted example), whereas the difference vector between the corrupted input and the clean original is nature’s hint of a direction of greatest increase in likelihood (since a noisy version of a training example is very likely to have a much lower probability under the data generating distribution than the original). The difference of these two differences is just the denoising reconstruction error residue.
Noise-Contrastive Estimation is another estimation principle which can be applied when the energy function can be computed but not the partition function (Gutmann and Hyvarinen, 2010). It is based on training not only from samples of the target distribution but also from samples of an auxiliary “background” distribution (e.g. a ﬂat Gaussian). The partition function is considered like a free parameter (along with the other parameters) in a kind of logistic regression trained to predict the probability that a sample belongs to the target distribution vs the background distribution.

1 Deep Learning of Representations

11

Semi-supervised Embedding is an interesting and different way to use unlabeled data to learn a representation (e.g., in the hidden layers of a deep neural network), based on a hint about pairs of examples (Weston et al., 2008). If some pairs are expected to have a similar semantic, then their representation should be encouraged to be similar, whereas otherwise their representation should be at least some distance away. This idea was used in unsupervised and semi-supervised contexts (Chopra et al., 2005; Hadsell et al., 2006; Weston et al., 2008), and originates in the much older idea of siamese networks (Bromley et al., 1993).

3 Convolutional Architectures
When there are directions of variation around input examples that are known to be irrelevant to the task of interest (e.g. translating an image), it is often very useful to exploit such invariances in a learning machine. We would like features that characterize the presence of objects in sequences and images to have some form of translation equivariance: if the object is translated (temporally or spatially), we would like the associated feature detectors to also be translated. This is the basic insight behind the convolutional network (LeCun et al., 1989, 1998), which has been very inﬂuential and regained importance along with new algorithms for deep architectures.
Recently, convolutional architectures have grown in popularity (Jain and Seung, 2008; Lee et al., 2009a,b; Kavukcuoglu et al., 2010; Turaga et al., 2010; Taylor et al., 2010; Krizhevsky, 2010; Le et al., 2010). As researchers attempt to move toward ever larger, more realistic and more challenging datasets, the convolutional architecture has been widely recognized for computational scalability and ease of learning. These properties are due to the basic architectural features that distinguish convolutional networks from other network models: weight sharing and feature pooling.
3.1 Local Receptive Fields and Weight Sharing
The classical way of obtaining some form of translation equivariance is the use of weight sharing in a network whose units have a local receptive ﬁeld, i.e., only receiving as input a subset (e.g. subimage) of the overall input. The same feature detector is applied at different positions or time steps1, thus yielding a sequence or a “map” containing the detector output for different positions or time steps. The idea of local receptive ﬁelds goes back to the Perceptron and to Hubel and Wiesel’s discoveries (Hubel and Wiesel, 1959) in the cat’s visual cortex.
1 Or, equivalently, one can say that many feature detectors, each associated with a different receptive ﬁeld, share the same weights, but applied to a different portion of the input.

12

Y. Bengio and A. Courville

3.2 Feature Pooling
Aside from weight sharing, the other signature architectural trait of the convolutional network is feature pooling. The standard implementation of feature pooling groups together locally translated copies of a single feature detector (single weight vector) as input into a single pooling layer unit. The output of the pooled feature detectors are either summed, or more commonly their maximal value is taken as the output of the pooling feature. The use of local pooling adds robustness to the representation in giving the pooling layer units a degree of translational invariance – the unit activates irrespective of where the image feature is located inside its pooling region. Empirically the use of pooling seems to contribute signiﬁcantly to improved classiﬁcation accuracy in object classiﬁcation tasks (LeCun et al., 1998).

4 Learning Invariant Feature Sets
For many AI tasks, the data is derived from a complex interaction of factors that act as sources of variability. When combined together, these factors give rise to the rich structure characteristic of AI-related domains. For instance, in the case of natural images, the factors can include the identity of objects in a scene, the orientation and position of each object as well as the ambient illumination conditions. In the case of speech recognition, the semantic content of the speech, the speaker identity and acoustical effects due to the environment are all sources of variability that give rise to speech data. In each case, factors that are relevant to a particular task combine with irrelevant factors to render the task much more challenging.
As a concrete example consider the task of face recognition. Two images of the same individual with different poses (e.g., one image is in a full frontal orientation, while the other image is of the individual in proﬁle) may result in images that are well separated in pixel space. On the other hand images of two distinct individuals with identical poses may well be positioned very close together in pixel space. In this example, there are two factors of variation at play: (1) the identity of the individual in the image, and (2) the person’s pose with respect to the image plane. One of these factors (the pose) is irrelevant to the face recognition task and yet of the two factors it could well dominate the representation of the image in pixel space. As a result, pixel space-based face recognition systems are destined to suffer from poor performance due to sensitivity to pose, but things really get interesting when many factors are involved.
The key to understanding the signiﬁcance of the impact that the combination of factors has on the difﬁculty of the task is to understand that these factors typically do not combine as simple superpositions that can be easily separated by, for example, choosing the correct lower-dimensional projection of the data. Rather, as our face recognition example illustrates, these factors often appear tightly entangled in the raw data. The challenge for deep learning methods is to construct representations of the data that somehow attempt to cope with the reality of entangled factors that account for the wide variability and complexity of data in AI domains.

1 Deep Learning of Representations

13

4.1 Dealing with Factors of Variation: Invariant Features
In an effort to alleviate the problems that arise when dealing with this sort of richly structured data, there has recently been a very broad based movement in machine learning toward building feature sets that are invariant to common perturbations of the datasets. The recent trend in computer vision toward representations based on large scale histograming of low-level features is one particularly effective example (Wang et al., 2009).
To a certain degree, simply training a deep model – whether it be by stacking a series of RBMs as in the DBN (in section 2.3) or by the joint training of the layers of a Deep Boltzmann Machine (discussed in section 6) – should engender an increasing amount of invariance in increasingly higher-level representations. However with our current set of models and algorithms, it appears as though depth alone is insufﬁcient to foster a sufﬁcient degree of invariance at all levels of the representation and that an explicit modiﬁcation of the inductive bias of the models is warranted.
Within the context of deep learning, the problem of learning invariant feature sets has long been considered an important goal. As discussed in some detail in section 3, one of the key innovations of the convolutional network architecture is the inclusion of max-pooling layers. These layers pool together locally shifted versions of the ﬁlters represented in the layer below. The result is a set of features that are invariant to local translations of objects and object parts within the image.
More generally, invariant features are designed to be insensitive to variations in the data that are uninformative to the target task while remaining selective to relevant aspects of the data. The result is a more stable representation that is well suited to be used as an input to a classiﬁer. With irrelevant sources of variance removed, the resulting feature space has the property that distances between data points represented in this space are a more meaningful indicator of their true similarity. In classiﬁcation tasks, this property naturally simpliﬁes the discrimination of examples associated with different class labels.
Thus far, we have considered only invariant features, such as those found in the convolutional network, whose invariant properties were hand-engineered by specifying the ﬁlter outputs to be pooled. This approach, while clearly effective in constructing features invariant to factors of variation such as translation, are fundamentally limited to expressing types of invariance that can be imposed upon them by human intervention. Ideally we would like our learning algorithms to automatically discover appropriate sets of invariant features. Feature sets that learn to be invariant to certain factors of variation have the potential advantage of discovering patterns of invariance that are either difﬁcult to hand-engineer (e.g. in-plane object rotations) or simply a priori not known to be useful. In the remainder of this section we review some of the recent progress in techniques for learning invariant features and invariant feature hierarchies.

14

Y. Bengio and A. Courville

4.2 Invariance via Sparsity
Learning sparse feature sets has long been popular both in the context of learning feature hierarchies as a deep learning strategy and as a means of learning effective shallow representations of the data. In the context of an object recognition task, Raina et al. (2007) established that using a sparse representations learned from image data as input to an SVM classiﬁer led to better classiﬁcation performance than using either the raw pixel data or a non-sparse PCA representation of the data.
Recently, Goodfellow et al. (2009) showed that sparsity can also lead to more invariant feature representations. In the context of auto-encoder networks trained on natural images, they showed that when adding a sparsity penalty to the hidden unit activations, the resulting features are more invariant to speciﬁc transformations such as translations, rotations normal to the image plane as well as in-plane rotations of the objects.
At this point it is not clear by what mechanism sparsity promotes the learning of invariant features. It is certainly true that sparsity tends to cause the learned feature detectors to be more localized: in training on natural images, the learned features form Gabor-like edge detectors (Olshausen and Field, 1997); in training on natural sounds, the learned features are wavelet-like in that they tend to be localized in both time and frequency (Smith and Lewicki, 2006). It is also true that localized ﬁlters are naturally invariant to variations outside their local region of interest or receptive ﬁeld. It is not clear that feature locality is sufﬁcient to entirely account for the invariance observed by Goodfellow et al. (2009). Nor is it clear that the often observed superior classiﬁcation performance of sparse representations (Raina et al., 2007; Manzagol et al., 2008; Bagnell and Bradley, 2009) may be attributed to this property of generating more invariant features. What is clear is that these issues merit further study.

4.3 Teasing Apart Explanatory Factors via Slow Features Analysis
We perceive the world around us through a temporally structured stream of perceptions (e.g. a video). As one moves their eyes and head, the identity of the surrounding objects generally does not change. More generally, a plausible hypothesis is that many of the most interesting high-level explanatory factors for individual perceptions have some form temporal stability. The principle of identifying slowly moving/changing factors in temporal/spatial data has been investigated by many (Becker and Hinton, 1993; Wiskott and Sejnowski, 2002; Hurri and Hyva¨rinen, 2003; Ko¨rding et al., 2004; Cadieu and Olshausen, 2009) as a principle for ﬁnding useful representations of images, and as an explanation for why V1 simple and complex cells behave the way they do. This kind of analysis is often called slow feature analysis. A good overview can be found in Berkes and Wiskott (2005). Note that it is easy to obtain features that change slowly when they are computed through averaging (e.g. a moving average of current and past

1 Deep Learning of Representations

15

observations). Instead, with slow feature analysis, one learns features of an instantaneous perception (e.g. a single image) such that consecutive feature values are similar to each other. For this to be of any use, it is also required that these features capture as much as possible of the input variations (e.g. constant features would be very stable but quite useless). A very interesting recent theoretical contribution (Klampﬂ and Maass, 2009) shows that if there exist categories and these categories are temporally stable, then slow feature analysis can discover them even in the complete absence of labeled examples.
Temporal coherence is therefore a prior that could be used by learning systems to discover categories. Going further in that direction, we hypothesize that more structure about the underlying explanatory factors could be extracted from temporal coherence, still without using any labeled examples. First, different explanatory factors will tend to operate at different time scales. With such a prior, we can not only separate the stable features from the instantaneous variations, but we could disentangle different concepts that belong to different time scales. Second, instead of the usual squared error penalty (over the time change of each feature), one could use priors that favor the kind of time course we actually ﬁnd around us. Typically, a factor is either present or not (there is a notion of sparsity there), and when it is present, it would tend to change slowly. This corresponds to a form of sparsity not only of the feature, but also of its change (either no change, or small change). Third, an explanatory factor is rarely represented by a single scalar. For example, camera geometry in 3 dimensions is characterized by 6 degrees of freedom. Typically, these factors would either not change, or change together. This could be characterized by a group sparsity prior (Jenatton et al., 2009).

4.4 Learning to Pool Features
Most unsupervised learning methods that are used as building blocks for Deep Learning are based on a distributed or factorial representation. Each unit is associated with a ﬁlter that is compared to the observation via a simple linear operation – most commonly the dot product – and activates the unit accordingly, typically through a non-linearity. This is the case for Autoencoder Networks, Denoising Autoencoder Networks and Restricted Boltzmann Machines. With this activation mechanism, the comparison between the ﬁlter and the observed vector is maximal when the two vectors are co-linear and decreases smoothly as the observation vector is perturbed in any direction. With regard to the activation of the feature, the only relevant aspects of the observation vector is the angle it makes with the ﬁlter and its norm. Features built this way are not differentially sensitive to variations in the observed vector. In other words, they do not, by themselves, possess directions of invariance. Even when sparsity or a temporal coherence penalty is used, the units activations may be more robust to small perturbations of the observations – particularly along the length of an edge, but there is no mechanism to allow them to exhibit robustness to signiﬁcant perturbations in arbitrary directions while remaining sensitive to perturbations in other directions.

16

Y. Bengio and A. Courville

In section 3, we encountered feature pooling as a way to ensure that the representation in the pooling layers of the convolutional network were invariant to small translations of the image. Feature pooling works by combining ﬁlters together to form a single feature that activates with the activation of any feature in the pool. With this simple mechanism, the pooling feature has the capacity to represent directions of invariance. Consider the way pooling is used in the convolutional network. The pooling features take as their input units possessing ﬁlters that are locally shifted copies of each other. When an input pattern is presented to the network that is “close” (in the sense of cosine distance) to one of the pooled ﬁlters, the pooling feature is activated. The result is a pooling feature that is invariant to translation of the target input pattern but remains sensitive to other transformations or distortions of the pattern.
The principle that invariance to various factors of the data can be induced through the use of pooling together of a set of simple ﬁlter responses is a powerful one – exploited to good effect in the convolutional network. However, the convolutional network’s use of the pooling is restricted to pooling of features that are shifted copies of each other – limiting the kind of invariance captured by the convolutional network to translational invariance. It is possible to conceive of generating other forms of invariant features by pooling together variously-transformed copies of ﬁlters. For instance, in the case of natural images, we might like some features to be invariant to small amounts of rotation of a particular image pattern while being highly sensitive to changes in the morphology of the image pattern. Ultimately this approach suffers from the disadvantage of limiting the set of invariances represented in the representation to hand-crafted transformations that are well-studied and understood to be useful in the domain of application. Consider the case of audio sequences, where we want a set of features to be invariant to changes in environmental properties such as reverberation while remaining highly sensitive to the source pitch and timbre. The set of transformations that render a feature set invariant to environmental conditions is not straightforward to encode. In this section we explore methods that seek to exploit the basic feature pooling architecture to learn the set of ﬁlters to be pooled together and thereby learn the invariance reﬂected in the pooling feature.
The ASSOM Model. The principle that invariant features can actually emerge from the organization of features into pools was ﬁrst established by Kohonen (1996) in the ASSOM model. Synthesizing the self-organizing map (SOM) (Kohonen, 1990) with the learning subspace method (Kohonen et al., 1979), Kohonen (1996) generalized the notion of a ﬁlter to a ﬁlter subspace. Filter subspaces are deﬁned by the space spanned by the set of ﬁlters in the pool. The principle states that one may consider an invariant feature as a linear subspace in the feature space. According to the ASSOM model, the value of the higher-order pooling feature is given by the square of the norm of the projection of the observation on the feature subspace.
Using a competitive learning strategy with an image sequence of colored noise, Kohonen (1996) showed that learning feature subspaces can lead to features that are

1 Deep Learning of Representations

17

invariant to the geometric transformations (including translation, rotation and scale) reﬂected in the structure of the colored noise sequence.

Subspace and Topological ICA. Hyva¨rinen and Hoyer (2000) later integrated the principle of feature subspaces (Kohonen, 1996) within the independent component analysis (ICA) paradigm via Multidimensional ICA (Cardoso, 1998) to form the Subspace ICA model. Multidimensional independent component analysis loosens the standard ICA requirement of strictly independent components to allow dependencies between groups or pools of components and only enforces the independence constraint across the component pools. By allowing dependencies within the feature subspaces deﬁned by the pools and enforcing the ICA independence constraint across feature pools, Hyva¨rinen and Hoyer (2000) were able to demonstrate the emergence of higher-order edge-like feature detectors that display some amount of invariance to edge orientation, phase, or location.
In their Topographic ICA (TICA) model, Hyva¨rinen et al. (2001) generalized the Subspace-ICA model by allowing for overlapping pools, arranged topographically such that neighbouring pools shared the linear ﬁlters along their shared border. Training on a natural image dataset, Hyva¨rinen et al. (2001) demonstrated that the topographic arrangement of the feature pools led to the emergence of pinwheel patterns in the topographically arranged ﬁlter set. The pinwheel pattern of ﬁlters is highly reminiscent of the pinwheel-like structure of orientation preference observed in primary visual cortex (Bartfeld and Grinvald, 1992).

Tiled-Convolutional Models. More recently, Le et al. (2010) used a training algorithm based on Topographic ICA to learn the ﬁlters in a tiled-convolutional arrangement. Unlike a standard convolution model, where the features pooled together share translated copies of the weights, the tiled convolution model pools together features that do not have the weights tied. However, at a distance away from that pool, another non-overlapping pool may share the same weights, allowing such models to handle variable-size images, like convolutional nets. In essence, this model can be seen as developing a convolutional version of the topographic ICA model that combines the major advantages of both approaches. On the one hand, the model preserves TICA’s capability to learn potentially complex invariant feature subspaces. On the other hand, the use of convolutional weight sharing means that the model possesses a relatively small number of parameters which can dramatically improve scalability and ease of learning.

Invariant Predictive Sparse Decomposition. Another recent approach to learn in-

variant features subspaces via a topographic pooling of low-level features is the

Invariant Predictive Sparse Decomposition (IPSD) (Kavukcuoglu et al., 2009). The

IPSD is a variant of Predictive Sparse Decomposition (discussed in section 2.4) that

replaces the L1 penalty on the coefﬁcients zj with a group sparsity term: λ

K i=1

vi

where K is the number of pooling features and vi =

j∈Pi wj zj2 is the value of

the pooling feature. Here, Pi is the set of low-level features associated with pool-

ing unit i, wj reﬂects the relative weight of coefﬁcient zj in the pool, and λ is a

18

Y. Bengio and A. Courville

parameter that adjusts the relative strength of the sparsity penalty in the loss function. This modiﬁcation to the PSD loss function has the effect of encouraging units within a pool to be nonzero together – forming the feature subspace that deﬁnes the invariance properties of the pooling feature vi.
Mean and Covariance Restricted Boltzmann Machine. Within the RBM context, Ranzato and Hinton (2010) follow the same basic strategy of learning invariant feature subspaces with their mean and covariance RBM (mcRBM). Using an energy function with elements that are very similar to elements of the objective functions of both Topographic-ICA and IPSD, Ranzato and Hinton (2010) demonstrated the model’s capacity to learn invariant feature subspaces by pooling ﬁlters. A very similar type of RBM-based pooling arrangement was presented by Courville et al. (2011).

4.5 Beyond Learning Invariant Features
As is evident from the above paragraphs, the feature subspace approach has been very popular recently as a means of learning features that are invariant to learned transformations of the data (those spanned by the feature subspace), while maintaining selectivity in all other directions in feature space. The ability to learn invariant features is obviously a crucially important step toward developing effective and robust representations of data. However, we feel it is doubtful that these strategies for creating invariant features are, by themselves, sufﬁcient to encode the rich interactions extant in the kind of data we would like to model and represent. We would like to learn features capable of disentangling the web of interacting factors that make up practically all data associated with AI-related tasks. In the next section, we explore what we mean by disentangling and how one can think about beginning to approach this ambitious goal.

5 Disentangling Factors of Variation
Complex data arise from the rich interaction of many sources. These factors interact in a complex web that can complicate AI-related tasks such as object classiﬁcation. For example, an image is composed of the interaction between one or more light sources, the object shapes and the material properties of the various surfaces present in the image. Shadows from objects in the scene can fall on each other in complex patterns, creating the illusion of object boundaries where there are none and dramatically effect the perceived object shape. How can we cope with these complex interactions? How can we disentangle the objects and their shadows? Ultimately, we believe the approach we adopt for overcoming these challenges must leverage the data itself, using vast quantities of unlabeled examples, to learn representations that separate the various explanatory sources. Doing so should give rise to a representation signiﬁcantly more robust to the complex and richly structured variations extant in natural data sources for AI-related tasks.

1 Deep Learning of Representations

19

It is important to distinguish between the related but distinct goals of learning invariant features and learning to disentangle explanatory factors. The central difference is the preservation of information. Invariant features, by deﬁnition, have reduced sensitivity in the direction of invariance. This is the goal of building invariant features and fully desirable if the directions of invariance all reﬂect sources of variance in the data that are uninformative to the task at hand. However it is often the case that the goal of feature extraction is the disentangling or separation of many distinct but informative factors in the data. In this situation, the methods of generating invariant features – namely, the feature subspace method – may be inadequate.
Roughly speaking, the feature subspace method can be seen as consisting of two steps (often performed together). First, a set of low-level features are recovered that account for the data. Second, subsets of these low level features are pooled together to form higher level invariant features. With this arrangement, the invariant representation formed by the pooling features offers a somewhat incomplete window on the data as the detailed representation of the lower-level features is abstracted away in the pooling procedure. While we would like higher level features to be more abstract and exhibit greater invariance, we have little control over what information is lost through feature subspace pooling. For example, consider higher-level features made invariant to the color of its target stimulus by forming a subspace of low-level features that represent the target stimulus in various colors (forming a basis for the subspace). If this is the only higher level feature that is associated with these low-level colored features then the color information of the stimulus is lost to the higher-level pooling feature and every layer above. This loss of information becomes a problem when the information that is lost is necessary to successfully complete the task at hand such as object classiﬁcation. In the above example, color is often a very discriminative feature in object classiﬁcation tasks. Losing color information through feature pooling would result in signiﬁcantly poorer classiﬁcation performance.
Obviously, what we really would like is for a particular feature set to be invariant to the irrelevant features and disentangle the relevant features. Unfortunately, it is often difﬁcult to determine a priori which set of features will ultimately be relevant to the task at hand. Further, as is often the case in the context of deep learning methods (Collobert and Weston, 2008), the feature set being trained may be destined to be used in multiple tasks that may have distinct subsets of relevant features. Considerations such as these lead us to the conclusion that the most robust approach to feature learning is to disentangle as many factors as possible, discarding as little information about the data as is practical. If some form of dimensionality reduction is desirable, then we hypothesize that the local directions of variation least represented in the training data should be ﬁrst to be pruned out (as in PCA, for example, which does it globally instead of around each example).
One solution to the problem of information loss that would ﬁt within the feature subspace paradigm, is to consider many overlapping pools of features based on the same low-level feature set. Such a structure would have the potential to learn a redundant set of invariant features that may not cause signiﬁcant loss of information. However it is not obvious what learning principle could be applied that can ensure

20

Y. Bengio and A. Courville

that the features are invariant while maintaining as much information as possible. While a Deep Belief Network or a Deep Boltzmann Machine (as discussed in sections 2.3 and 6 respectively) with 2 hidden layers would, in principle, be able to preserve information into the “pooling” second hidden layer, there is no guarantee that the second layer features are more invariant than the “low-level” ﬁrst layer features. However, there is some empirical evidence that the second layer of the DBN tends to display more invariance than the ﬁrst layer (Erhan et al., 2010a). A second issue with this approach is that it could nullify one of the major motivations for pooling features: to reduce the size of the representation. A pooling arrangement with a large number of overlapping pools could lead to as many pooling features as low-level features – a situation that is both computationally and statistically undesirable.
A more principled approach, from the perspective of ensuring a more robust compact feature representation, can be conceived by reconsidering the disentangling of features through the lens of its generative equivalent – feature composition. Since most (if not all) of our unsupervised learning algorithms have a generative interpretation, the generative perspective can provide insight into how to think about disentangling factors. The majority of the models currently used to construct invariant features have the interpretation that their low-level features linearly combine to construct the data.2 This is a fairly rudimentary form of feature composition with signiﬁcant limitations. For example, it is not possible to linearly combine a feature with a generic transformation (such as translation) to generate a transformed version of the feature. Nor can we even consider a generic color feature being linearly combined with a gray-scale stimulus pattern to generate a colored pattern. It would seem that if we are to take the notion of disentangling seriously we require a richer interaction of features than that offered by simple linear combinations.
Disentangling via Multilinear Models. While there are presently few examples of work dedicated to the task of learning features that disentangle the factors of variation in the data, one promising direction follows the development of bilinear (Tenenbaum and Freeman, 2000; Grimes and Rao, 2005) and multilinear (Vasilescu and Terzopoulos, 2005) models. Bilinear models are essentially linear models where the latent state is factored into the product of two variables. Formally, the elements of observation x are given by ∀k, xk = i j Wijkyizj, where yi and zj are elements of the two latent factors (y and z) representing the observation and Wijk is an element of the tensor of model parameters (Tenenbaum and Freeman, 2000). The tensor W can be thought of as a generalization of the typical weight matrix found in most unsupervised models we have considered above.
Tenenbaum and Freeman (2000) developed an EM-based algorithm to learn the model parameters and demonstrated, using images of letters from a set of distinct fonts, that the model could disentangle the style (font characteristics) from content
2 As an aside, if we are given only the values of the higher-level pooling features, we cannot accurately recover the data because we do not know how to apportion credit for the pooling feature values to the lower-level features. This is simply the generative version of the consequences of the loss of information caused by pooling.

1 Deep Learning of Representations

21

(letter identity). Grimes and Rao (2005) later developed a bilinear sparse coding model of a similar form as described above but included additional terms to the objective function to render the elements of both y and z sparse. They used the model to develop transformation invariant features of natural images.
Multilinear models are simply a generalization of the bilinear model where the number of factors that can be composed together is 2 or more. Vasilescu and Terzopoulos (2005) develop a multilinear ICA model, which they use to model images of faces, to disentangle factors of variation such as illumination, views (orientation of the image plane relative to the face) and identities of the people.
Neuroscientiﬁc Basis for Disentangling. Interestingly there is even some evidence from neuroscience that bilinear feature composition may play a role in the brain (Olshausen and Field, 2005). Olshausen et al. (1993) proposed a model of visual attention for forming position- and scale-invariant representations of objects. The model relies on a set of control neurons that dynamically and selectively routes information from lower level cortical areas (V1) to higher cortical areas. If we interpret the bilinear model as a standard linear model, by folding one of the latent state factors (say, y) into the parametrization of a dynamic weight matrix, then we can interpret these control neurons as playing the role of y, routing information from x to z.

6 On the Importance of Top-Down Connections
Whereas RBMs and other layer-wise representation learning algorithms are interesting as a means of initializing deeper models, we believe that it is important to develop algorithms that can coordinate the representations at multiple levels of a deep architecture. For example, to properly interpret a particular patch as an eye and estimate its characteristics (e.g. its pose and color) in the features computed at a particular level of a hierarchy, it helps to consider the top-down inﬂuence of higherlevel face features (for the face to which this eye hypothetically belongs), to make sure that both eyes are generally in agreement about pose and color (this agreement being mediated by pose and eye-color features at the level of the face). If we imagine that the value of a particular intermediate-level feature (maybe with a receptive ﬁeld that only covers a part of the whole input) is associated with a belief about some factor of variation (within that receptive ﬁeld), consider how this belief could be improved by taking into account the higher-level abstractions that are captured at higher levels of the hierarchy. Observations of the mechanisms and role of attention in brains clearly suggest that top-down connections are used to bias and “clean up” the activations of lower-level detectors. In the case of images, where feature detectors are associated with a receptive ﬁeld, the top-down connections allow one to take context into account when interpreting what is going on in a particular patch of the image. Top-down connections allow the “conclusions” reached in the higher levels to inﬂuence those obtained in the lower levels, but since the latter also inﬂuence the

22

Y. Bengio and A. Courville

former, some kind of inference mechanism is needed (to get a handle on appropriate values for the features associated with all the layers, given the network input).
In the Deep Learning literature, this approach is currently solely seen in the work on Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009; Salakhutdinov, 2010; Salakhutdinov and Larochelle, 2010). A Deep Boltzmann Machine is just a Boltzmann machine whose hidden units are organized in layers, and with connections mostly between consecutive layers. A Deep Boltzmann Machine differs from a Deep Belief Network in that proper inference (that involves top-down connections) can be done, albeit at the price of running an MCMC or a variational approximation each time an input is seen. Interestingly, whereas a mean-ﬁeld variational approximation (which gives a bound on the quantity of interest) can be used in the positive phase of Boltzmann machine gradient computation, it would hurt to do it in the negative phase, hence Salakhutdinov and Hinton (2009); Salakhutdinov (2010) recommend mean-ﬁeld approximations for the positive phase and Gibbs sampling for the negative phase. Because of the overhead involved in both phases, an interesting thread of research is the exploration of heuristics aiming to speed up inference (Salakhutdinov and Larochelle, 2010), for example by training a feedforward approximation of the posterior which is a good initialization for the iterative inference procedure. This idea is already present in some form in Predictive Sparse Decomposition (Kavukcuoglu et al., 2008) (see section 2.4).
We hypothesize that deep architectures with top-down connections used to coordinate representations at all levels can give rise to “better representations”, that can be more invariant to many of the changes in input. Indeed, invariance is achieved by composing non-linearities in the appropriate way, and the outcome of inference involves many iterations back and forth through the deep architecture. When unfolded in the time course of these iterations, this corresponds to an even deeper network, but with shared parameters (across time), like in a recurrent neural network. If an intermediate-level feature is informed by higher-level interpretations of the input, then it can be “cleaned-up” to better reﬂect the prior captured by higher levels, making it more robust to changes in the input associated with “noise” (i.e. variations in the directions not very present in the input distribution, and hence not well represented in the higher levels).

7 Conclusion
Unsupervised learning of representations promises to be a key ingredient in learning hierarchies of features and abstractions from data. This chapter asked: “why learn representations?”, “why deep ones?”, and especially “what makes a good representation?”. We have provided some partial answers, but since many algorithms and training criteria have already been proposed for unsupervised learning of representations, the right answer is not clear, and we would like to see algorithms more directly motivated by this last question. In particular, the chapter focuses on the question of invariance, and what architectures and criteria could help discover features that are invariant to major factors of variation in the data. We argue that a more

1 Deep Learning of Representations

23

appropriate objective is to discover representations that disentangle the factors of variation: keeping all of them in the learned representation but making each feature invariant to most of the factors of variation.

References
Bagnell, J.A., Bradley, D.M.: Differentiable sparse coding. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21 (NIPS 2008), pp. 113–120 (2009)
Barron, A.E.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory 39, 930–945 (1993)
Bartfeld, E., Grinvald, A.: Relationships between orientation-preference pinwheels, cytochrome oxidase blobs, and ocular-dominance columns in primate striate cortex. Proc. Nati. Acad. Sci. USA 89, 11905–11909 (1992)
Becker, S., Hinton, G.E.: Learning mixture models of spatial coherence. Neural Computation 5, 267–277 (1993)
Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009); also published as a book. Now Publishers (2009)
Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Computation 21(6), 1601–1621 (2009)
Bengio, Y., Delalleau, O.: Shallow versus deep sum-product networks. In: The Learning Workshop, Fort Lauderdale, Florida (2011)
Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (eds.) Large Scale Kernel Machines. MIT Press (2007)
Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., Ouimet, M.: Spectral Dimensionality Reduction. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.) Feature Extraction, Foundations and Applications, vol. 207, pp. 519–550. Springer, Heidelberg (2006)
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Scho¨lkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 153–160. MIT Press (2007)
Bengio, Y., Bastien, F., Bergeron, A., Boulanger-Lewandowski, N., Chherawala, Y., Cisse, M., Coˆte´, M., Erhan, D., Eustache, J., Glorot, X., Muller, X., Pannetier-Lebeuf, S., Pascanu, R., Savard, F., Sicard, G.: Deep self-taught learning for handwritten character recognition. In: NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop (2010)
Berkes, P., Wiskott, L.: Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision 5(6), 579–602 (2005)
Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics 59, 291–294 (1988)
Braverman, M.: Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM 54(4), 108–115 (2011)
Breuleux, O., Bengio, Y., Vincent, P.: Quickly generating representative samples from an RBM-derived process. Neural Computation 23(8), 2058–2073 (2011)
Bromley, J., Benz, J., Bottou, L., Guyon, I., Jackel, L., LeCun, Y., Moore, C., Sackinger, E., Shah, R.: Signature veriﬁcation using a siamese time delay neural network. In: Advances in Pattern Recognition Systems using Neural Network Technologies, pp. 669–687. World Scientiﬁc, Singapore (1993)

24

Y. Bengio and A. Courville

Cadieu, C., Olshausen, B.: Learning transformational invariants from natural movies. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21, pp. 209–216. MIT Press (2009)
Cardoso, J.-F.: Multidimensional independent component analysis. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 1941–1944 (1998)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face veriﬁcation. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2005). IEEE Press (2005)
Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 160–167. ACM (2008)
Courville, A., Bergstra, J., Bengio, Y.: A spike and slab restricted Boltzmann machine. In: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2011) (2011)
Desjardins, G., Courville, A., Bengio, Y., Vincent, P., Delalleau, O.: Tempered Markov chain Monte-Carlo for training of restricted Boltzmann machine. In: Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), pp. 145–152 (2010)
Erhan, D., Courville, A., Bengio, Y.: Understanding representations learned in deep architectures. Technical Report 1355, Universite´ de Montre´al/DIRO (2010a)
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625–660 (2010b)
Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 646–654 (2009)
Grimes, D.B., Rao, R.P.: Bilinear sparse coding for invariant vision. Neural Computation 17(1), 47–73 (2005)
Gutmann, M., Hyvarinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2010 (2010)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2006), pp. 1735–1742. IEEE Press (2006)
Ha˚stad, J.: Almost optimal lower bounds for small depth circuits. In: Proceedings of the 18th Annual ACM Symposium on Theory of Computing, Berkeley, California, pp. 6–20. ACM Press (1986)
Ha˚stad, J., Goldmann, M.: On the power of small-depth threshold circuits. Computational Complexity 1, 113–129 (1991)
Hinton, G.E.: Products of experts. In: Proceedings of the Ninth International Conference on Artiﬁcial Neural Networks (ICANN), Edinburgh, Scotland, vol. 1, pp. 1–6. IEE (1999)
Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length, and Helmholtz free energy. In: Cowan, D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems 6 (NIPS 1993), pp. 3–10. Morgan Kaufmann Publishers, Inc. (1994)

1 Deep Learning of Representations

25

Hinton, G.E., Sejnowski, T.J., Ackley, D.H.: Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon University, Dept. of Computer Science (1984)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)
Hubel, D.H., Wiesel, T.N.: Receptive ﬁelds of single neurons in the cat’s striate cortex. Journal of Physiology 148, 574–591 (1959)
Hurri, J., Hyva¨rinen, A.: Temporal coherence, natural image sequences, and the visual cortex. In: Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 141–148 (2003)
Hyva¨rinen, A.: Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research 6, 695–709 (2005)
Hyva¨rinen, A., Hoyer, P.: Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation 12(7), 1705– 1720 (2000)
Hyva¨rinen, A., Hoyer, P.O., Inki, M.O.: Topographic independent component analysis. Neural Computation 13(7), 1527–1558 (2001)
Jain, V., Seung, S.H.: Natural image denoising with convolutional networks. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21 (NIPS 2008), pp. 769–776 (2008)
Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proc. International Conference on Computer Vision (ICCV 2009), pp. 2146–2153. IEEE (2009)
Jenatton, R., Audibert, J.-Y., Bach, F.: Structured variable selection with sparsity-inducing norms. Technical report, arXiv:0904.3523 (2009)
Jordan, M.I.: Learning in Graphical Models. Kluwer, Dordrecht (1998) Kavukcuoglu, K., Ranzato, M., LeCun, Y.: Fast inference in sparse coding algorithms with
applications to object recognition. Technical report, Computational and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01 (2008) Kavukcuoglu, K., Ranzato, M., Fergus, R., LeCun, Y.: Learning invariant features through topographic ﬁlter maps. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2009), pp. 1605–1612. IEEE (2009) Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., LeCun, Y.: Learning convolutional feature hierarchies for visual recognition. In: Advances in Neural Information Processing Systems 23 (NIPS 2010), pp. 1090–1098 (2010) Kingma, D., LeCun, Y.: Regularized estimation of image statistics by score matching. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 1126–1134 (2010) Klampﬂ, S., Maass, W.: Replacing supervised classiﬁcation learning by slow feature analysis in spiking neural networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 988–996 (2009) Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480 (1990) Kohonen, T.: Emergence of invariant-feature detectors in the adaptive-subspace selforganizing map. Biological Cybernetics 75, 281–291 (1996), doi:10.1007/s004220050295 Kohonen, T., Nemeth, G., Bry, K.-J., Jalanko, M., Riittinen, H.: Spectral classiﬁcation of phonemes by learning subspaces. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1979, vol. 4, pp. 97–100 (1979)

26

Y. Bengio and A. Courville

Ko¨rding, K.P., Kayser, C., Einha¨user, W., Ko¨nig, P.: How are complex cell properties adapted to the statistics of natural stimuli? Journal of Neurophysiology 91, 206–212 (2004)
Krizhevsky, A.: Convolutional deep belief networks on cifar-10 (2010) (unpublished manuscript) http://www.cs.utoronto.ca/˜kriz/conv-cifar10-aug2010.pdf
Kurkova, V., Sanguineti, M.: Geometric upper bounds on rates of variable-basis approximation. IEEE Trans. on Information Theory 54, 5681–5688 (2008)
Larochelle, H., Erhan, D., Bengio, Y.: Zero-data learning of new tasks. In: Proceedings of the 23rd National Conference on Artiﬁcial Intelligence, vol. 2, pp. 646–651. AAAI Press (2008)
Le, Q., Ngiam, J., Chen, Z., Hao Chia, D.J., Koh, P.W., Ng, A.: Tiled convolutional neural networks. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23 (NIPS 2010), pp. 1279– 1287 (2010)
Le Roux, N., Bengio, Y.: Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation 20(6), 1631–1649 (2008)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML 2009). ACM, Montreal (2009a)
Lee, H., Pham, P., Largman, Y., Ng, A.: Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 1096–1104 (2009b)
Lee, J.A., Verleysen, M.: Nonlinear dimensionality reduction. Springer (2007) Manzagol, P.-A., Bertin-Mahieux, T., Eck, D.: On the use of sparse time-relative auditory
codes for music. In: Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), pp. 603–608 (2008) Olshausen, B., Field, D.J.: How close are we to understanding V1? Neural Computation 17, 1665–1699 (2005) Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 37, 3311–3325 (1997) Olshausen, B.A., Anderson, C.H., Van Essen, D.C.: A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci. 13(11), 4700–4719 (1993) Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: Ghahramani, Z. (ed.) Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 759–766. ACM (2007) Ranzato, M., Hinton, G.H.: Modeling pixel means and covariances using factorized thirdorder Boltzmann machines. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2010), pp. 2551–2558. IEEE Press (2010)

1 Deep Learning of Representations

27

Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efﬁcient learning of sparse representations with an energy-based model. In: Scho¨lkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 1137–1144. MIT Press (2007a)
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efﬁcient learning of sparse representations with an energy-based model. In: NIPS 2006 (2007b)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Salakhutdinov, R.: Learning deep Boltzmann machines using adaptive MCMC. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML 2010), vol. 1, pp. 943–950. ACM (2010)
Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2009), vol. 5, pp. 448–455 (2009)
Salakhutdinov, R., Larochelle, H.: Efﬁcient learning of deep Boltzmann machines. In: Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), JMLR W&CP, vol. 9, pp. 693–700 (2010)
Saul, L., Roweis, S.: Think globally, ﬁt locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2002)
Scho¨lkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge (2002)
Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., Poggio, T.: A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function 165, 33–56 (2007)
Smith, E.C., Lewicki, M.S.: Efﬁcient auditory coding. Nature 439(7079), 978–982 (2006) Smolensky, P.: Information processing in dynamical systems: Foundations of harmony theory.
In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, ch. 6, vol. 1, pp. 194–281. MIT Press, Cambridge (1986) Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional Learning of Spatio-temporal Features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010) Tenenbaum, J.B., Freeman, W.T.: Separating Style and Content with Bilinear Models. Neural Computation 12(6), 1247–1283 (2000) Tieleman, T.: Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the TwentyFifth International Conference on Machine Learning (ICML 2008), pp. 1064–1071. ACM (2008) Tieleman, T., Hinton, G.: Using fast weights to improve persistent contrastive divergence. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML 2009), pp. 1033–1040. ACM (2009) Turaga, S.C., Murray, J.F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk, W., Seung, H.S.: Convolutional networks can learn to generate afﬁnity graphs for image segmentation. Neural Computation 22, 511–538 (2010) Vasilescu, M.A.O., Terzopoulos, D.: Multilinear independent components analysis. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 547–553 (2005) Vincent, P.: A connection between score matching and denoising autoencoders. Neural Computation 23(7), 1661–1674 (2011)

28

Y. Bengio and A. Courville

Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 1096–1103. ACM (2008)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11, 3371–3408 (2010)
Wang, H., Ullah, M.M., Kla¨ser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (BMVC), London, UK, p. 127 (2009)
Welling, M.: Herding dynamic weights for partially observed random ﬁeld models. In: Proceedings of the 25th Conference in Uncertainty in Artiﬁcial Intelligence (UAI 2009). Morgan Kaufmann (2009)
Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 1168–1175. ACM, New York (2008)
Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14(4), 715–770 (2002)
Younes, L.: On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports 65(3), 177–228 (1999)

Chapter 2
Recurrent Neural Networks
Sajid A. Marhon, Christopher J.F. Cameron, and Stefan C. Kremer

1 Introduction

This chapter presents an introduction to recurrent neural networks for readers familiar with artiﬁcial neural networks in general, and multi-layer perceptrons trained with gradient descent algorithms (back-propagation) in particular. A recurrent neural network (RNN) is an artiﬁcial neural network with internal loops. These internal loops induce recursive dynamics in the networks and thus introduce delayed activation dependencies across the processing elements (PEs) in the network.
While most neural networks use distributed representations, where information is encoded across the activation values of multiple PEs, in recurrent networks a second kind of distributed representation is possible. In RNNs, it is also possible to represent information in the time varying activations of one or more PEs. Since information can be encoded spatially, across different PEs, and also temporally, these networks are sometimes also called Spatio-Temporal Networks [29].
In a RNN, because time is continuous, the PE activations form a dynamical system which can be described by a system of integral equations; and PE activations can be determined by integrating the contributions of earlier activations over a temporal kernel. If the activation of PE j at time t is denoted by y j(t), the temporal kernel by k(·); and, f is a transformation function (typically a sigmoidal non-linearity), then

t

y j(t) = f ∑

k ji(t − t) · yi(t )∂ t .

(1)

i t =0

This formulation deﬁnes a system of integral equations in which the variables y j(t), for different values of j, depend temporally on each other. If we place no restrictions on the indices, i and j, in the kernels, k, then these temporal dependencies can

Sajid A. Marhon · Christopher J.F. Cameron · Stefan C. Kremer The School of Computer Science at the University of Guelph, Guelph, Ontario e-mail: {smarhon,ccameron,skremer}@uoguelph.ca

M. Bianchini et al. (Eds.): Handbook on Neural Information Processing, ISRL 49, pp. 29–65.

DOI: 10.1007/978-3-642-36657-4_2

c Springer-Verlag Berlin Heidelberg 2013

30

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

contain loops. If we do place dependencies on the indices then we can restrict the graphical topologies of the dependencies to have no loops. In this case, the only temporal behaviour is that encoded directly in the kernel, k, that ﬁlters the inputs to the network. In both cases, but particularly in the latter, the kernel, k, is deﬁned by a design choice to model a speciﬁc type of temporal dependency.
For the purposes of simulating these systems on digital computers, it is convenient to describe the evolution of the activation values in the models at ﬁxed, regular time intervals. The frequency of the simulation required to maintain ﬁdelity with the original model is governed by Nyquist’s Theorem [45]. In this case, the dynamics of the model can be expressed without the necessity of an explicit integration over a temporal kernel. In fact, many models are described only in terms of activations that are deﬁned on the activations at a previous time-step. If w ji is used to represent the impact of the previous time-step’s value of PE i on PE j, then

y j(t) = f (p j(t)) ,

(2)

where,

p j(t) = ∑ w ji · yi(t − 1).

(3)

i

There is a number of operational paradigms in which RNNs can be applied:

• Vector to Vector mapping — Conventional multi-layer perceptrons (MLPs) map vectors of inputs into vectors of outputs (possibly using some intermediate hidden layer activations in the process). MLPs can be viewed as special cases of RNNs in which there either are no recurrent connections, or all recurrent connections have weights of zero. Additionally, winner take all, Hopﬁeld networks and other similar networks which receive a static input and are left to converge before an output is withdrawn, fall into this category.
• Sequence to Vector mapping — RNNs are capable of processing a sequence of input activation vectors over time, and ﬁnally rendering an output vector as a result. If the output vector is then post-processed, to map it to one of a ﬁnite number of categories, these networks can be used to classify input sequences. In a degenerate case there may be a single time-varying input, resulting in a device that maps a single input signal to a category; but in general, a number of input values that vary in time can be used.
• Vector to Sequence mapping — Less common are RNNs used to generate an output sequence in response to a single input pattern. These are generative models for sequences, in which a dynamical system is allowed to evolve under a constant input signal.
• Sequence to Sequence mapping — The ﬁnal case is the one where both input and output are vector sequences that vary over time. This is the case of sequence transduction and is the most general of the 4 cases. In general, this approach assumes a synchronization between the sequences in the sense that there is a one-to-one mapping between activations values at all points in time among the input and output sequences (for an asynchronous exception see [11]).

2 Recurrent Neural Networks

31

The most obvious application of these networks is, of course, to problems where input signals arrive over time, or outputs are required to be generated over time. But, this is not the only way in which these systems are used. Sometimes it is desirable to convert a problem that is not temporal in nature into one that is. A good example of such a problem is one in which input sequences of varying lengths must be processed.
If input sequences vary in length, the use of MLPs may impose an unnatural encoding of the input. One can either: 1) use a network with an input vector large enough to accommodate the longest sequence and pad shorter inputs with zeros, or 2) compress the input sequence into a smaller, ﬁxed-length vector. The ﬁrst option is impractical for long sequences or when the maximum sequence length is unbounded, and generally leads to non-parsimonious solutions. The second option is very effective when prior knowledge about the problem exists to formulate a compressed representation that does not lose any information that is germaine to the problem at hand. Some approaches to reducing input sequences into vectors use a temporal window which only reveals some parts (a ﬁnite vector) of the input sequence to the network, or to use a signal processing technique and feature reduction to compress the information. But, in many cases, such apriori knowledge is not available. Clearly, reducing the amount of information available to the network can be disastrous if information relevant to the intended solution is lost in the process.
Another scenario is one in which input sequences are very long. A long input sequence necessitates a large number of inputs which in turn implies a large number of parameters, corresponding to the connection weights from these inputs to subsequent PEs. If a problem is very simple, then having a large number of parameters typically leads to overﬁtting. Moreover, if such a network is trained using sequences with some given maximum length, and then evaluated on longer sequences, the network will not only be incapable of correctly generalizing reasonable outputs for these longer patterns, it will not even be able to process a pattern that is longer than its maximum input size at all. By contrast a RNN with only a single input and a moderate number of other PEs could very well solve the problem. The reduced number of parameters in the latter system not only reduces the chance of overﬁtting, but also simpliﬁes the training process.
Thus, RNNs represent a useful and sometimes essential alternative to conventional networks that every neural network practitioner should be aware of.
This chapter is organized as follows. We begin with a section on "Architecture" in which we present a very general RNN architecture which subsumes all other RNN and MLP models. Next we explore some topologies and some very speciﬁc models. We then present a section on "Memory" in which we describe different ways in which RNNs incorporate information about the past. Again we begin with a general model and then present speciﬁc examples. The third section of the chapter is about "Learning". In it, we describe the methods for updating the parameters of RNNs to improve performance on training datasets (and hopefully unseen test-data). We also describe an important limitation of all gradient based approaches to adapting the parameters of RNNs. In the section titled "Modeling", we compare these systems to more traditional computational models. We focus on a comparison between

32

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

RNNs and ﬁnite state automata, but also mention other computational models. The "Applications" section describes some example applications to give the reader some insight into how these networks can be applied to real-world problems. Finally, the "Conclusion" presents a summary, and some remarks on the future of the ﬁeld.

2 Architecture

In this section, we discuss RNN architectures. As mentioned before, RNNs are a generalization of MLPs, so it is appropriate to begin our discussion with this more familiar architecture. A typical MLP architecture is shown in Figure 1. In this ﬁgure, the mid-sized, white circles represent the input units of the network, solid arrows are connections between input and/or PEs, and large circles are PEs. The network is organized into a number of node layers; each layer is fully connected with its preceding and subsequent layers (except of course the original input and terminal output layers). Note that there are no recurrent connections in this network (e.g. amongst PEs in the same layer, or from PEs in subsequent layers to PEs in previous layers). The activation value of each PE y j in the ﬁrst PE-layer of this network can be computed as

y j = f (p j),

(4)

where f (p j) is generally a non-linear squashing function, such as the sigmoid function:

1

f (p j) = 1 + e−pj ,

(5)

∑ p j = w ji · xi + b j,

(6)

i

where w ji represents the weight of the connection from input i to PE j, p j is the weighted sum of the inputs to PE j, b j is a bias term associated with PE j, and xi is the i-th component of the input vector. For subsequent layers, the formula for the weighted summation (Equation 6) is changed to:

∑ p j = w ji · yi + b j,

(7)

i

where, this time p j is the weighted sum of the activation values of the PEs in the previous layer.
In a MLP network, this formulation assumes that the weights w ji for any units j and i not in immediately subsequent layers are zero. We can relax this restriction somewhat and still maintain a feedforward neural network (FNN) by requiring only that the weights w ji for j ≤ i be zero. This results in a cascade [10] architecture in which every input and every lower indexed unit can connect to every higher index unit. This generalization also allows us to identify an extra input i = 0 whose value is

2 Recurrent Neural Networks

33

Input Layer
Hidden Layers Fig. 1 A multi-layer perceptron neural network

Output Layer

Fig. 2 A cascade architecture of FNN

permanently ﬁxed at y0 = 1 to act as a biasing inﬂuence via the connections w j0 = b j to all other PEs j. An example of a cascade architecture is shown in Figure 2.
We now go one step further by removing all restrictions on weights to form a fully connected recurrent network (FCRN) in which every PE is connected to every other PE (including itself). At this stage, it becomes necessary to introduce a temporal index to our notation in order to disambiguate the activation values and unsatisﬁable equalities. So, we can now formulate the weighted summations of the PEs as

p j(t) = ∑ w jh · yh(t − 1) + ∑w ji · xi(t),

(8)

h

i

where the index h is used to sum over the PEs, and the index i sums over the inputs. We assume that the weight values do not change over time (at least not at the same

34

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

Fig. 3 A fully-connected recurrent network
scale as the activation values). A FCRN is shown in Figure 3. Note that these ﬁgures become very difﬁcult to draw as the number of PEs are increased, since there are not topological restrictions to organize elements into layers. However, it is possible to produce a simpliﬁed illustration that matches Equation 8 as shown in Figure 4. Note that in this ﬁgure time is being represented spatially by the appearance of an additional, virtual layer of PEs that are in fact the same PEs as already shown, but in a previous time-step. This virtual layer has been called a "context" layer [8]. We add a series of additional connections labeled z−1 from the PEs to the context units in order to indicate the temporally delayed copy of activation values implied in the system. Figure 5 shows the RNN architecture in Figure 4 as a block diagram. In this ﬁgure and all the following ﬁgures that present block diagrams in this chapter, the thick arrows indicate full connectivity (many-to-many) between the units in the linked sets. However, the thin bold arrows indicate the unit-to-unit connectivity (one-to-one).
There are a few important things to note about the FCRN architecture presented. First, it is the most general. The MLP and FNN architectures can be implemented within the FCRN paradigm by restricting some of the weights to zero values as indicated in the ﬁrst two paragraphs of this section. Second, any discrete time RNN can be represented as a FNN. This includes the speciﬁc layered recurrent networks discussed in the following sub-sections. Third, the networks introduce an additional parameter set whose values need to be determined. Namely, the initial values of the PE activations y j(0). Finally, the illustration of activation value calculation in Figure 4 can be solved recursively all the way to the base-case at t = 0, as shown in Figure 6.

2 Recurrent Neural Networks

35

Input Units

Context Units
z-1
z-1 z-1
Fig. 4 A RNN including a context layer
2.1 Connectionist Network Topologies
While the general FCRN described in the previous subsection is often used, many other RNNs are structured in layers. A RNN includes an input layer, output layer and typically one or more hidden layers. Each layer consists of a set of PEs. The feedback connections, which are speciﬁc to RNNs, can exist within or between any of the network layers. Typically, the inputs to the PE, in a RNN, are from other PEs in a preceding layer and delayed feedback from the PE itself or from other PEs in the same layer or in a successive layer. The sum of the inputs is presented as an activation to a nonlinear function to produce the activation value of the PE.
In RNNs, the topology of the feedforward connections is similar to MLPs. However, the topology of feedback connections, which is limited to RNNs, can be classiﬁed into locally recurrent, non-local recurrent and globally recurrent connections. In locally recurrent connections, a feedback connection originates from the output of a PE and feeds back the PE itself. In non-local recurrent connections, a feedback connection links the output of a PE to the input of another PE in the same layer.

36

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

Input Layer

PE Layer

Context Layer

Delay Units

Fig. 5 The block diagram of the RNN architecture in Figure 4

Input

Input

Input
PE Layer

PE Layer
t’=1

t’=0
Fig. 6 The unrolled architecture

PE Layer
t’=2

PE Layer
t’=t

In globally recurrent connections, the feedback connection is between two PEs in different layers. If we extend this terminology to feedforward connections, all MLPs are considered as global feedforward networks. The non-local recurrent connection class is a special case of the globally recurrent connection class. Based on the feedback topologies, the architecture of RNNs can take different forms as follows:

2 Recurrent Neural Networks

37

2.1.1 Locally Recurrent Globally Feedforward (LRGF) Networks
In this class of recurrent networks, recurrent connections can occur in a hidden layer or the output layer. All feedback connections are within the intra PE level. There are no feedback connections among different PEs [51]. When the feedback connection is in the ﬁrst PE layer of the network, the activation value of PE j is computed as follows:

∑ y j(t) = f w j j · y j(t − 1) + w ji · xi(t)

(9)

i

where w j j is the intensity factor at the local feedback connection of PE j, the index i sums over the inputs, and f (·) is a nonlinear function, usually a sigmoid function as denoted in Equation 5. For subsequent layers, Equation 9 is changed to:

∑ y j(t) = f w j j · y j(t − 1) + w ji · yi(t)

(10)

i

where yi(t) are the activation values of the PEs in the preceding PE layer. There are three different models of LRGF networks depending on the localization
of the feedback.

Local Activation Feedback — In this model, the feedback can be a delayed version of the activation of the PE. The local activation feedback model was studied by [12]. This model can be described by the following equations:

y j(t) = f p j(t) ,

(11)

m

p j(t) = ∑ wtj j · p j(t − t ) + ∑w ji · xi(t),

(12)

t =1

i

where p j(t) is the activation at the time step t, t is a summation index over the number of delays in the system, the index i sums over the system inputs, and wtj j is the weight of activation feedback of p j(t − t ). Figure 7 illustrates the architecture of this model.

x1(t) x2(t)

wj1 wj2

Σ

pj(t)

f(·)

yj(t)

xn(t) wjn

wm jj

2
w m-1 jj wjj

w1 jj

z-1

z-1

z-1

Fig. 7 A PE with local activation feedback

38

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

Local Output Feedback — In this model, the feedback is a delayed version of the activation output of the PE. This model was introduced by Gori et al. [19]. This feedback model can be illustrated as in Figure 8a, and its mathematical formulation can be given as follows:

∑ y j(t) = f w j j · y j(t − 1) + w ji · xi(t) .

(13)

i

Their model can be generalized by taking the feedback after a series of delay units and the feedback is fed back to the input of the PE as illustrated in Figure 8b. The mathematical formulation can be given as follows:

m

y j(t) = f ∑ wtj j · y j(t − t ) + ∑w ji · xi(t) ,

(14)

t =1

i

where wtj j is the intensity factor of the output feedback at time delay z−t , and the index i sums over the input units. From Equation 14, it can be noticed that the output
of the PE is ﬁltered by a ﬁnite impulse response (FIR) ﬁlter.

x1(t) x2(t)

wj1 wj2

xn(t) wjn

x1(t) x2(t)

wj1 wj2

Σ

f(·)

yj(t)

Σ

f(·)

yj(t)

wjj

z-1

xn(t) wjn

wm jj

2
w m-1 jj wjj

w1 jj

z-1

z-1

z-1

a

b

Fig. 8 A PE with local output feedback. a) With one delay unit. b) With a series of delay units.

Local Synapse Feedback — In this model, each synapse may include a feedback structure, and all feedback synapses are summed to produce the activation of the PE. Local activation feedback model is a special case of the local synapse feedback model since each synapse represents an individual local activation feedback structure. The local synapse feedback model represents FIR ﬁlter or inﬁnite impulse response (IIR) ﬁlter [3]. A network of this model is called a FIR MLP or IIR MLP when the network incorporates FIR synapses or IIR synapses respectively since the globally feedforward nature of this class of networks makes it identical to MLP networks [3, 32]. Complex structures can be designed to incorporate combination of both FIR synapses and IIR synapses [3].
In this model, a linear transfer function with poles and zeros is introduced with each synapse instead of a constant synapse weight. Figure 9 illustrates a PE architecture of this model. The mathematical description of the PE can be formulated as follows:

2 Recurrent Neural Networks

39

y j(t) = f ∑ Gi(z−1) · xi(t) ,

(15)

i

Gi(z−1)

=

∑ql=0 bl z−l ∑rl=0 al z−l

,

(16)

where Gi(z−1) is a linear transfer function, and bl (l = 0, 1, 2, · · · , q) and al (l = 0, 1, 2, · · · , r) are its zeros’ and poles’ coefﬁcients respectively.

x1(t)

G1(z-1)

x2(t)

G2(z-1)

Σ

xn(t)

Gn(z-1)

Fig. 9 A PE with local synapse feedback

f(·)

yj(t)

2.1.2 Non-local Recurrent Globally Feedforward (NLRGF) Networks
In this class of RNNs, the feedback connections to a particular PE are allowed to originate from the PE itself (like LRGF networks) and from other PEs in the same layer. Some researchers classify this type of feedback connections (non-local) as global feedback connections [37]. Based on the description of NLRGF networks, Elman [8] and Williams-Zipser [54] architectures are considered examples of this class of networks [8, 26, 54]. Non-local feedback connections can appear in the hidden or output layers. The mathematical description of PE j that has non-local connections in the ﬁrst PE layer can be given as follows:

y j(t) = f ∑ whj · yh(t − 1) + ∑w ji · xi(t) ,

(17)

h

i

where whj is the weight of the feedback connections including the non-local connections from PE h to PE j ( j = h) and the local connection from the PE itself ( j = h),
the index h sums over the PEs, and the index i sums over the inputs. For subsequent
layers, Equation 17 is changed to:

y j(t) = f ∑ whj · yh(t − 1) + ∑w ji · yi(t) ,

(18)

h

i

where the index i of the summation sums over the PEs in the preceding PE layer.

40

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

2.1.3 Globally Recurrent Globally Feedforward (GFGR) Networks

In this class of recurrent networks, the feedback connections are in the inter layer level. The feedback connections originate from PEs in a certain layer and feed back PEs in a preceding layer. It can be from the output layer to a hidden layer, or it can be from a hidden layer to a preceding hidden layer. Like the MLP, feedforward connections are global connections from a particular layer to the successive layer. The difference between this class and the non-local recurrent network class is that in the latter the feedback connections are in the intra layer level, while in the former the feedback connections are in the inter layer level. When the ﬁrst PE layer in the network receives global recurrent connections from a successive PE layer, the activation value of PE j in the layer that receives this feedback can be computed as follows:

y j(t) = f ∑ whj · yh(t − 1) + ∑w ji · xi(t) ,

(19)

h

i

where yh(t) is the output of PE h in a successive layer which the feedback connections originate from, and the index i sums over the inputs of PE j. For subsequent layers Equation 19 is changed to:

y j(t) = f ∑ whj · yh(t − 1) + ∑w ji · yi(t) .

(20)

h

i

Jordan’s second architecture is one of the models of this class of recurrent networks [25]. In this architecture, the feedback connections from the output layer are fed back to the hidden layer through a context layer. A unit in the context layer serves as an intermediate state in the model. Figure 10 illustrates the block diagram of the Jordan’s second architecture.
There are other classes of RNN topologies which incorporate two of the previously mentioned topologies. For example, the locally recurrent globally recurrent (LRGR) class represents models that include globally recurrent connections as well

y(t-1)

Bank of Delay Units

Context Layer
x(t)
Input Layer

Hidden Layer

Output

Layer

y(t)

Fig. 10 A block diagram of the Jordan’s 2nd architecture

2 Recurrent Neural Networks

41

as locally recurrent connections. Another class of RNNs includes networks incorporating non-local recurrent connections and globally recurrent connections at the same time [41].

2.2 Speciﬁc Architectures
In this section, we present some common RNN architectures which have been proposed in the literature. These architectures can be related to the classes mentioned in Subsection 2.1.

2.2.1 Time Delay Neural Networks (TDNN)

This architecture was proposed by Sejnowski and Rosenberg [44] and applied by Waibel et al. to phoneme recognition [52], and it is a variation of the MLP. It incorporates delay units at the inputs of the PEs. Each input to the PE is delayed with a series of time delays z−i, (i = 0, 1, 2, · · · , N). The delayed signals are multiplied by weight factors. The sum of the weighted signals is presented to a nonlinear function which is usually a sigmoid function [52]. The mathematical description of the TDNN PE can be formulated as in Equation 21. The architecture of a basic TDNN PE is illustrated in Figure 11.

MN

y j(t) = f ∑ ∑ w jli · xl(t − i)

(21)

l=1 i=0

The TDNN network can relate and compare the current input to the history of that input. In the mentioned phoneme recognition application, this architecture was shown to have the ability to learn the dynamic structure of the acoustic signal. Moreover, it has the property of translation invariance which means the features learned by this model are not affected by time shifts.

2.2.2 Williams-Zipser Recurrent Networks
This model has been introduced in this section as the most general form of RNNs. The architecture was proposed by Williams and Zipser [54]. It is called a real-time recurrent network since it was proposed for real-time applications. The network consists of a single layer of PEs. Each PE receives feedback from all other PEs as well as from itself via time delay units. Thus, a fully-connected network is obtained. In addition, the PEs receive external inputs. Each recurrent connection has a unique, adjustable weight to control the intensity of the delayed signal. The diagram of this model is illustrated in Figure 12. The activation of a PE in this network is same as that in Equation 19. This model can be classiﬁed to incorporate both local and non-local recurrent connections.

42

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

xM(t)

z-N

xM(t)
xM(t)

wjMN

z-1

wjM1

wjM0

x1(t)

Σ
wj1N
z-N wj11

f(·)

yj(t)

x1(t) x1(t)

z-1

wj10

Fig. 11 A typical TDNN PE architecture. The PE has M input signals. Each input signal is delayed by z−0, · · · , z−N . The delayed version of the input as well as the current input (without
delay) are multiplied by weights. The weighted sum of all the input signals is computed and
presented to a nonlinear function f (·).

y(t-1)

Bank of Delay Units

Context Layer

PE

Input

Layer

y(t)

Layer

Fig. 12 The architecture of the Williams-Zipser model. Every PE gets feedback from other PEs as well as from itself.
2.2.3 Partially-Connected Recurrent Networks
Unlike the FCRNs, the partially-connected recurrent networks (PCRNs) are based on the MLP. The most obvious example of this architecture is the Elman’s [8] and Jordan’s [25] models. The Elman model consists of three layers which are the input layer, hidden layer and output layer in addition to a context layer. The context layer receives feedback from the hidden layer, so its units memorize the output of the hidden layer. At a particular time step, the output of the PEs in the hidden layer

2 Recurrent Neural Networks

43

depends on the current input and the output of the hidden PEs in the previous time step. The mathematical formulation of the output of a hidden PE in this model is given in Equation 17 considering that y j(t) is the output of hidden PE j. This model can be classiﬁed as a non-local recurrent model since each hidden PE receives feedback from itself and from other PEs in the hidden layer. The block scheme of the Elman’s model is illustrated in Figure 13.

v(t-1)

Bank of Delay Units

Context Layer

x(t)

Input

Layer

Hidden

v(t)

Output

y(t)

Layer

Layer

Fig. 13 The architecture of the Elman’s model. y(t) is the network output vector. x(t) is the input vector. v(t) is the output vector of the hidden PEs (states).

In contrast to the Elman’s model, the Jordan’s 2nd model incorporates feedback connections from the output layer to feed back the PEs in the hidden layer via memory unit delays. A context layer receives feedback from the output layer and feeds it to the PEs in the hidden layer. This model is classiﬁed as a globally recurrent network since the feedback connections are global between the output and the hidden layers [25]. Figure 10 illustrates the block diagram of the Jordan’s 2nd architecture. The mathematical formulation of a hidden PE in the Jordan’s 2nd architecture is similar to what has been given in Equation 19.

2.2.4 State-Space Recurrent Networks
This model can include one or more hidden layers. The hidden PEs in a speciﬁc layer determine the states of the model. The output of this hidden layer is fed back to that layer via a bank of delay units. The feedback topology of this model can be classiﬁed as a non-local recurrent connection class. The number of hidden PEs (states) that feed back the layer can be variant and determine the order of the model [57]. Figure 14 describes the block diagram of the state-space model. The mathematical description of the model is given by the following two equations:

v(t) = f v(t − 1), x(t − 1) ,

(22)

y(t) = Cv(t),

(23)

44

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

where f(·) is a vector of nonlinear functions, C is the matrix of the weights between the hidden and output layers, v(t) is the vector of the hidden PE activations, x(t) is a vector of source inputs, and y(t) is the vector of output PE activations.
By comparing the diagrams in Figures 13 and 14, it can be noticed that the architecture of the state-space model is similar to the Elman’s (PCRN) architecture except that the Elman’s model uses a nonlinear function in the output layer, and there are no delay units in the output of the network. One of the most important features of the state-space model is that it can approximate many nonlinear dynamic functions [57]. There are two other advantages of the state-space model. First, the number of states (the model order) can be selected independently by the user. The other advantage of the model is that the states are accessible from the outside environment which makes the measurement of the states possible at speciﬁc time instances [37].

v(t-1)

Bank of

Delay Units

x(t-1)

Nonlinear Hidden Layer
v(t)

y(t)
Linear Output Layer

Bank of y(t-1) Delay Units

Fig. 14 The block diagram of the state-space model

2.2.5 Second-Order Recurrent Networks

This model was proposed by Giles et al. [16]. It incorporates a single layer of PEs. It was developed to learn grammars. The PEs in this model are referred to as secondorder PEs since the activation of the next state is computed as the multiplication of the previous state with the input signal. The output of each PE is fed back via a time delay unit and multiplied by each input signal. If the network has N feedback states and M input signals, N × M multipliers are used to multiply every single feedback state by every single input signal [16]. Thus, the activation value y j(t) of PE j can be computed as follows:

y j(t) = f ∑ ∑ w jilyi(t − 1)xl(t − 1) ,

(24)

il

where the weight w jil is applied to the multiplication of the activation value yi(t − 1) and the input xl(t − 1). Figure 15 shows the diagram of a second-order recurrent network.

2 Recurrent Neural Networks

45

y(t-1)

Bank of

Delay Units

Context Input

×

×

×

y(t)

×

PE

×

Layer

×

×

×

×

Multipliers

Fig. 15 The block diagram of the second-order recurrent network model

2.2.6 Nonlinear Autoregressive Model with Exogenous Inputs (NARX) Recurrent Networks

In this class of neural networks, the memory elements are incorporated in the input and output layers. The topology of NARX networks is similar to that of the ﬁnite memory machines, and this has made them good representative of ﬁnite state machines. The model can include input, hidden and output layers. The input to the network is fed via a series of delay units. The output is also fed back to the hidden layer via delay units [6]. The model has been successful in time series and control applications. The architecture of a NARX network with three hidden units is shown in Figure 16. The mathematical description of the model can be given as follows:

N

M

y(t) = f ∑ aiy(t − i) + ∑ bix(t − i) ,

(25)

i=1

i=1

where x(t) is the source input; y(t) is the output of the network; N and M are constants; and ai and bi are constants.
In this section, we reviewed the possible topologies of the RNN architectures and the common proposed architectures. The RNN architectures mentioned above have been proposed to tackle different applications. Some models have been proposed for grammatical inference and other models have been proposed for identiﬁcation and control of dynamic systems. In addition, there is a coordination between the network architecture and the learning algorithm used for training.

46
x(t-1)
z-1
x(t-2)
z-1
z-1
x(t-M)
y(t-N)
z-1
z-1
y(t-2)
z-1

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

y(t)
z-1
Output PE

y(t-1)

Hidden PE

Fig. 16 The architecture of a NARX network
3 Memory
This section of the chapter contains a more descriptive understanding of the importance of the memory for RNNs, how the memory works, and different types of memory.
3.1 Delayed Activations as Memory
Every recurrent network in which activation values from the past are used to compute future activation values, incorporates an implicit memory. This implicit memory can stretch back in time over the entire past history of processing. Consider Figure 5 which depicts a network with recurrent connections within the PE Layer. In this illustration, the block labeled Context Layer represents a virtual set of elements that contain a copy of the PE Layer’s elements at the previous time step. Figure 6, shows the same network over a number of time intervals. In this ﬁgure, the PE Layer’s elements can be seen to depend on the entire history of inputs all the way back to t = 0.

2 Recurrent Neural Networks

47

3.2 Short-Term Memory and Generic Predictor
In this subsection, neural networks without global feedback paths, but intra PE level, will be considered. These RNNs consist of two subsystems: a short-term memory and a generic predictor. Short-term memory will be assumed to be of a linear, time invariant, and causal nature. While a generic predictor is considered to be a feedforward neural network predictor, this predictor consists of nonlinear elements (i.e. a sigmoid function) with associated weights and zero, or more, hidden layers. In the case of this discussion, the generic predictor consists of constant parameters (i.e. weights), nonlinear elements, and it is time invariant.
This structure consisting of the short-term memory and the generic predictor will be considered and referenced as the time invariant nonlinear short-term memory architecture (TINSTMA); the TINSTMA is alternatively known as a memory kernel [34]. The simple structure of the memory kernel is shown in Figure 17.

x(t)

Short-Term

Memory

Generic

y(t)

Predictor

Fig. 17 Example of a memory kernel, the class of RNNs being observed in this section
When developing a TINSTMA, there are 3 issues that must be taken into consideration: architecture, training, and representation. The architecture refers to the internal structural form of a memory kernel, which involves the PE consideration of the number of layers and PEs within the network, the connections formed between these PEs/layers, and the activation function associated to these PEs. Training (as discussed in the next section of the chapter) involves taking into consideration how the kernel (in this case, the internal parameters will be weights) will adapt to the introduction of a set of input patterns to match the targets associated with the input patterns. Representation refers to retention of an input pattern within the short-term memory (i.e. how should it be stored?); nature and quantity of information to be retained is domain dependent. These three issues are views on the same problem, and thus related. The desired representation for a TINSTMA may or may not alter the architecture of a kernel and, consequently, has the possibility of affecting the design of network training. Alternatively a given training design may inﬂuence the structure and possible representation for a TINSTMA. In the following subsection, possible types of kernels will be discussed [29].

3.3 Types of Memory Kernels
Memory kernels can typically be considered one of two forms: each modular component consists of the same structural form, but with different parameters (i.e. weights) OR each modular component consists of the same structural form with identical weights.

48

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

3.3.1 Modular Components with Different Parameters
In this subsection, cases where each modular component in a memory kernel has the same structural form, but allows different parameters in each node, will be discussed.
Tapped Delay Lines. Considered to be the simplest of memory kernels, a tapped delay line (TDL) is a delay line (a series of nodes) with at least one tap. A delayline tap extracts a signal output from somewhere within the delay line, optionally scales it, and usually sums it with other taps (if existing) to form an output signal (shown in Figure 18). A tap may be either interpolating or non-interpolating. A noninterpolating tap extracts the signal at some ﬁxed integer delay relative to the input. Tapped delay lines efﬁciently simulate multiple echoes from the same source signal. Thus, a tap implements a shorter delay line within a larger one. As a result, they are extensively used in the ﬁeld of artiﬁcial reverberation [49].

x(t )

x(t-M 1) z-M1

z-(M2-M1)

b0

bM1

x(t -M2) bM2

+

+ y(t )

Fig. 18 Example of a TDL network. a TDL consists of an internal tap located at M1, a total delay length of M2 samples. Each node may be considered a layer of a multilayer neural network. The output signal of the TDL is a linear combination of the input signal x(t), the
delay-line output x(t − M2), and the tap signal x(t − M1).

Therefore, the ﬁlter output corresponding to Figure 18:

y(t) = b0x(t) + bM1x(t − M1) + bM2x(t − M2)

(26)

Laguerre Filter. Another popular memory kernel applies a ﬁlter based on Laguerre polynomials [55], formulated based on Figure 19 as follows:

Li(z−1, u) =

1

−

u2

(z−1 − u)i (1 − uz−1)i+1

,

i ≥ 0,

(27)

k

yk(t, u) = ∑ wk,i(u)xi(t, u),

(28)

i=0

where

xi(t, u) = Li(z−1, u)x(t).

(29)

2 Recurrent Neural Networks

49

x(t)

1-u2

z -1 - u

z -1 - u

1- uz -1

1- uz -1

1- uz -1

x0(t,u)

x1(t,u)

xi(t,u)

wk,0(u)

wk,1(u)

wk,k(u)

+

+ yk(t,u)

Fig. 19 Example of a Laguerre ﬁlter of size k. This ﬁlter is stable only if |u| < 1. When u = 0 the ﬁlter degenerates into the familiar transversal ﬁlter [48].

FIR/IIR Filters. FIR ﬁlters are considered to be nonrecursive since they do not
provide feedback (ai = 0, for i > 0), compared to IIR or ’recursive’ ﬁlters which provide feedback (ai = 0, for i > 0). An example of the format of the ﬁlter is shown in the following equations based on Figure 20:

b0

x(t)

+

z-1

b1

+

z-1

b2

+

y(t)

a1

z-1

+

a2

z-1

Fig. 20 Example of a second-order IIR ﬁlter

y(t) = b0x(t) + b1x(t − 1) + · · · + bMx(t − M) − a1y(t − 1) − · · · − aNy(t − N) (30)

M

N

y(t) = ∑ bix(t − i) − ∑ a jy(t − j)

(31)

i=0

j=1

Gamma Filter. The Gamma ﬁlter is a special class of IIR ﬁlters where the recursion is kept locally, proving effective in identiﬁcation of systems with long impulse responses [30]. The structure of the Gamma ﬁlter is similar to a TDL (refer above). The format of the Gamma ﬁlter is shown in the following equations [39]:

K

y(t) = ∑ wkxk(t)

(32)

k=0

xk(t) = G(z−1)xk−1(t), k = 1, ..., K,

(33)

50

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

where y(t) is the output signal, x(t) is an input signal, and G(z−1) is the generalized delay operator (tap-to-tap transfer function, either recursive or non-recursive).
Moving Average Filter. A Moving Average ﬁlter (MAF) is a simpliﬁed FIR, alternatively called the exponential ﬁlter; and, as the name suggests, it works by averaging a number of points from the input signal to produce each point in the output signal. This ﬁlter is considered the moving average, since at any moment, a moving window is calculated using M values of the data sequence [50]. Since the MAF places equal emphasis on all input values, two areas of concern for the MAF is the need for n measurements to be made before a reliable output can be computed and for this computation to occur there needs to be storage of n values [50]. The simpliﬁed equation of the MAF is:

∑ 1 M−1

y(t) =

x(t − j),

M j=0

(34)

where y(t) is the output signal, x(t) is the input signal, and M is the number of points used in the moving average.

3.3.2 Modular Components with Identical Parameters
In this subsection, cases where each modular component in a memory kernel has the same structural form and node parameter(s) will be discussed. With the possibility of each node’s weight being the same, parameter estimation is kept to minimum and each node may be fabricated (i.e. in gate array technology). Note, TDLs may be considered within this section as well.
Modular Recurrent Form. A modular recurrent form is a series of independent nodes organized by some intermediary. Each node provides inputs to the intermediary, which the intermediary processes to produce an output for the network as a whole. The intermediary only accepts nodes outputs, no response (i.e. signal) may be reported back to the node. Nodes do not typically interact with each other.

4 Learning
The most desirable aspect of artiﬁcial neural networks is their learning capability. When provided input and output values, a function approximation between these values can be approximated by the network. In this section, an understanding of how a RNN can learn through various learning methods to determine this relationship will be provided.

4.1 Recurrent Back-Propagation: Learning with Fixed Points
Fixed point learning algorithms assume that the network will converge to a stable ﬁxed point. This type of learning is useful for computation tasks such as

2 Recurrent Neural Networks

51

constraint satisfaction and associative memory tasks. Such problems are provided to the network through an initial input signal or through a continuous external signal, and the solution is provided as the state of the network when a ﬁxed point has been reached.
A problem with ﬁxed point learning is whether or not a ﬁxed point will be reached, since RNNs do not always reach a ﬁxed point [29]. There are several ways to guarantee that a ﬁxed point will be reached with certain special cases:
Weight Symmetry. Linear conditions on weights such as zero-diagonal symmetry (wi j = w ji, wii = 0) guarantee that the Lyapunov function (Equation 35) will decrease until a ﬁxed point has been reached [7]. If weights are considered to be Bayesian constraints, as in Boltzmann Machines, the weight symmetry condition will arise [21].

L = − ∑ w jiyiy j + ∑ yi log(yi) + (1 − yi) log(1 − yz)

(35)

j,i

i

Setting

Weight

Boundaries.

If

∑ ji w2j,i

<

max{ f
x

(x)}

where

max{ f
x

(x)}

is

the

maximal value of f (x) for any x [2], and f (·) is the derivative of f (·), conver-

gence to a ﬁxed point will occur. In practice it has been shown that much weaker

bounds on weights have an effect [40].

Asymptotic Fixed point Behavior. Some studies have shown that applying the ﬁxed point learning to a network causes the network to exhibit asymptotic ﬁxed point behavior [1, 13]. There is no theoretical explanation of this behavior as of yet, nor replication on larger networks.
Even with the guarantee of a network reaching a ﬁxed point, the ﬁxed point learning algorithm can still have problems reaching that state. As the learning algorithm moves the ﬁxed point location by changing the weights, there is the possibility of the error jumping suddenly due to discontinuity. This occurs no matter how gradually weights are manipulated as the underlying mechanisms are a dynamical system subject to bifurcations, and even chaos [29].

4.1.1 Traditional Back-Propagation Algorithm
The traditional back-propagation algorithm [42, 53] involves the computation of an error gradient with respect to network weights by means of a three-phase process: (1) activation forward propagation, (2) error gradient backward propagation, and (3) weight update. The concept is based on the premise that by making changes in weights proportional to the negation of the error gradient, and the error will be reduced until a locally minimal error can be found. The same premise can be applied to recurrent networks. However, the process of back-propagation becomes complex as soon as recurrent connections are introduced. Also, as we will discover at the end of this section, some complications arise when applying the gradient descent method in recurrent networks.

52

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

4.2 Back-Propagation through Time: Learning with Non-ﬁxed Points

The traditional back-propagation algorithm cannot directly be applied to RNNs since the error backpropagation pass assumes that the connections between PEs induce a cycle-free ordering. The solution to the back-propagation through time (BPTT) application is to “unroll” the RNN in time. This “unrolling” involves the stacking of identical copies of the RNN (displayed in Figure 6) and redirecting connections within the network to obtain connections between subsequent copies. The end result of this process provides a feedforward network, which is able to have the backpropagation algorithm applied by back-propagating the error gradient through previous time-steps to t = 0. Note that in Figure 6 the arrows from each of the “Input” rectangles to each of the “PE Layer” rectangles represent different “incarnations” of the exact same set of weights during different time-steps. Similarly, the arrows from each of the “PE Layer” rectangles to their subsequent “PE Layer” rectangles also represent the exact same set of weights in different temporal incarnations. This implies that all the changes to be made to all of the incarnations of a particular weights can be cumulatively applied to that weight. This implies that the traditional equation for updating the weights in a network,

Δ

w ji

=

−η ai δ

j

=

−η ai

∂E ∂ pj

,

(36)

is replaced by a temporal accumulation:

Δ wji

= −η ∑ ai(t t

)δ j(t

+ 1) = −η ∑ ai(t t

∂E ) ∂ p j(t

. )

(37)

With the “unrolled” network, a forward propagation begins from the initial copy propagation through the stack updating each connection. For each copy or time t: the input (u(t)) is read in, the internal state (y(t)) is computed from the input and the previous internal state (y(t − 1)), and then output (y(t)) is calculated. The error (E) to be minimized is:

E = ∑ d(t) − y(t) 2 = ∑ E(t),

(38)

t=1,...,T

t=1,...,T

where T represents the total length of time, d(t) is the target output vector, and y(t) is the output vector.
The problem with the slow convergence for traditional backpropagation is carried on in BPTT. The computation complexity of an epoch for the network is O(T N2), N is equal to the number of internal/hidden PEs. As with traditional backpropagation, there tends to be the need for multiple epochs (on the scale of 1000s) for a convergence to be reached. It does require manipulation with the network (and

2 Recurrent Neural Networks

53

processing time) before a desired result can be achieved. This hindrance of the BPTT algorithm tends to lead to smaller networks being used with this design (3-20 PEs), where larger networks tend to go for many hours before convergence.

4.2.1 Real-Time Recurrent Learning

Real-time recurrent learning (RTRL) [54] allows for computation of the partial error gradients at every time step within the BPTT algorithm in a forward-propagated fashion eliminating the need for a temporal back-propagation step (thus, making it very useful for online learning). Rather than computing and recording only the partial derivative of each net value with respect to the total error,

δj

=

∂E ∂ pj

,

(39)

it is noted that the total gradient error in a recurrent network is given by:

∑ ∂ E
∂ wji

=

k

(dk(t

)

−

yk(t

))

∂ yk(t) ∂ wji

,

(40)

and that

∑ ∂ yk(t + 1)
∂ wji

=

f

(pk(t)) ·

l

wlk

∂ yl(t) ∂ wji

+

y

j (t )

.

(41)

Note that in this equation, the partial derivatives

∂ yk(t + 1) ∂ wji

(42)

depend on the previous (not future) values of the same partial derivatives

∂ yl(t) ∂ wji

.

(43)

This implies that by storing the values

∂ yk(t + 1) ∂ wji

(44)

at each time-step, the next steps values can be computed until at the ﬁnal time-step,
where a complete error derivative can be computed. Since wkl has been assumed to be constant, the learning rate or η must be kept
small. RTRL has a computational cost of O(N + L)4 for each update step in time
((N + L) dimensional system solved each time), which means that networks em-
ploying RTRL must be kept small.

54

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

4.2.2 Schmidhuber’s Algorithm
Schmidhuber [43] has developed a hybrid algorithm which applies the two previous approaches in a clever, alternating fashion and is able to manage the superior time performance of BPTT while not requiring unbounded memory as sequence length increases.

4.2.3 Time Constants and Time Delays

One advantage of temporal continuous networks is the addition of parameters to control the temporal behavior of the network in ways known to be associated with natural tasks. Time constants represent an example of these additional parameters for networks as shown in [23, 33, 34, 35]. For time delays, consider that a network’s connections take a ﬁnite period of time between PEs, such that:

∑ y j(t) = w jiyi(t − τ ji),

(45)

i

where τ ji represents the time delay from PE i to PE j. Using a modiﬁcation of RTRL, parameters like τ can be learned by a gradient

descent approach.

4.3 Long-Term Dependencies
RNNs provide an explicit modeling of time and memory, allowing, in principle, the modelling of any type of open nonlinear dynamical systems. The dynamical system is described by the RNN with a set of real numbers represented by a point in an appropriate state space. There is often an understanding of negativity towards RNNs due to their inability to identify and learn long-term dependencies over time; authors have stated no more than ten steps [4]. Long-term dependencies, in this case, can be deﬁned as a desired output at time T which is dependent on inputs from times t less than T . This difﬁculty with RNN understanding of long term dependencies has been noted by Mozer, where that RNNs were able to learn short-term musical structure with gradient learning based methods, but had difﬁculty with a global behavior [34]. Therefore the goal of a network should be to robustly latch information (i.e. a network should be able to store previous inputs with the presence of noise for a long period of time).
As a system robustly latches information, the fraction of the gradient due to information t-steps in the past approaches zero as t becomes large. This is known as the ‘problem of vanishing gradients’. Bengio et al. [4] and Hochreiter [22] have both noted that the problem of vanishing gradients is the reason why a network trained by gradient-descent methods is unable to distinguish a relationship between target outputs and inputs that occur at a much earlier time. This problem has been termed ‘long-term dependencies’.

2 Recurrent Neural Networks

55

One way to counteract long-term dependencies is TDL, such as BPTT with a NARX network. BPTT occurs through the process of unrolling the network in time and then back propagating the error through the ‘unrolled network’ (Figure 6). From this, the output delays will occur as jump ahead connections in the ‘unrolled network’. These jump ahead connections provide a shorter path for propagation of the gradient information, and this decreases the sensitivity of a network to long-term dependencies.
Another solution is presented in [23].

5 Modeling
RNNs, which have feedback in their architectures, have the potential to represent and learn discrete state processes. The existence of feedback makes it possible to apply RNN models to solve problems in control, speech processing, time series prediction, natural language processing, and so on [15]. In addition, apriori knowledge can be encoded into these networks which enhances the performance of the applied network models. This section presents how RNNs can represent and model theoretical models such as ﬁnite state automata (FSA) and Turing machines (TM). The understanding of the theoretical aspect of these networks in relation to formal models such as FSA is important to select the most appropriate model to solve a given problem. In this section, we will review RNNs for representing FSA and Turing machines.

5.1 Finite State Automata

FSA have a ﬁnite number of input and output symbols, and a ﬁnite number of internal states. Large portion of discrete processes can be modeled by deterministic ﬁnite-state automata (DFA). The mathematical formulation of the FSA M is a 6-tuple and can be deﬁned by:

M = {Q, q0, Σ , Δ , δ , ϕ}

(46)

where Q is a set of ﬁnite number of state symbols: {q1, q2, · · · , qn}, n is the number of states, q0 ∈ Q is the initial state, Σ is the set of input symbols: {u1, u2, · · · , um}, m is the number of input symbols, Δ is the set of output symbols: {o1, o2, · · · , or}, r is the number of output symbols, δ : Q × Σ → Q is the state transition function, and ϕ is the output function. The class of FSA is basically divided into Mealy and Moore models. Both of the models are denoted by the 6-tuple formulation deﬁned
in Equation 46. However, the only difference between the two models is the formu-
lation of the output function ϕ. In the case of the Mealy model, the output function is ϕ:Q × Σ → Δ , while, in the Moore model, it is: ϕ:Q → Δ [17, 24]. In general, the FSA M is described by the following dynamic model:

56

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

M:

q(t + 1) = δ (q(t), u(t)), o(t) = ϕ(q(t), u(t))

q(0) = q0

(47)

Many RNN models have been proven to model FSA. The existence of such equivalence between RNNs and FSA has given the conﬁdence to apply RNN models in solving problems with dynamical systems. Omlin and Giles [36] proposed an algorithm to construct DFA in second-order networks. The architecture of the network is same as the architecture illustrated in Figure 15. The outputs of the PEs in the output layer are the states of the DFA. The network accepts a temporal sequence of inputs and evolves with the dynamic determined by the activation function of the network deﬁned by Equation 24, so the state PEs are computed according to that equation. One of the PEs o0 is a special PE. This PE represents the output of the network after a string of input is presented to the network. The possible value of this state PE is accept/reject. Therefore, if the modeled DFA has n states and m input symbols, the construction of the network includes n + 1 state PEs and m inputs. The proposed algorithm is composed of two parts. The ﬁrst part is to adjust the weights to determine the DFA state transitions. The second part is to adjust the output of the PE for each DFA state.
Kuroe [31] introduced a recurrent network called a hybrid RNN to represent the FSA. In this model, the state symbols qi (i = 1, 2, · · · , n), input symbols ui (i = 1, 2, · · · , m) and the output symbols oi (i = 1, 2, · · · , r) are encoded as binary numbers. q(t), u(t) and o(t) will be expressed as follows:

q(t) = (s1(t), s2(t), · · · , sA(t)) (si(t) ∈ {0, 1})

u(t) = (x1(t), x2(t), · · · , xB(t)) (xi(t) ∈ {0, 1})

(48)

o(t) = (y1(t), y2(t), · · · , yD(t)) (yi(t) ∈ {0, 1})

where A, B and D are integer numbers and their values are selected in a way such that n ≤ 2A, m ≤ 2B and r ≤ 2D respectively. In this way, there is no need to increase the number of state PEs in the network according to the 1 against 1 basis as the number of states in the FSA increases. The architecture of the hybrid network consists of two types of PEs, which are static and dynamic PEs. The difference between them is that the dynamic PE feedback originates from its output, while the static PE does not [31]. Figure 21 illustrates the diagram of an arbitrary hybrid neural network representing a FSA. The dynamic PEs are represented by darkly-shaded circles, while the static PEs are represented by lightly-shaded circles.
Won et al. [55] proposed a new recurrent neural architecture to identify discretetime dynamical systems (DTDS). The model is composed of two MLPs of ﬁve layers. The ﬁrst MLP is composed of layers 0,1 and 2. The second MPL is composed of layers 3 and 4. The ﬁrst set of layers {0, 1, 2} approximates the states of the FSA, while the second set of layers {3, 4} approximates the output of the FSA. Figure 22 shows the architecture of the network proposed by [55].

2 Recurrent Neural Networks

57

yd 1

x1

yd A

xB

ys 1

o1

Fig. 21 A hybrid neural network representing a FSA
q(t-1)

yS D

oD

Context Layer

x(t)

Input

Layer

layer 0

PE Layer
Layer 1

PE

q(t) Delay

layer

Units

PE Layer

Layer 2

x(t) Layer 3

Fig. 22 A RNN composed of two MLPs to identify and extract FSA

y(t) PE Layer
Layer 4

5.2 Beyond Finite State Automata
The ability of RNNs to model FSA should inspire great conﬁdence in these systems as FSA represent the computational limits of digital computers with ﬁnite memory resources. I.e. anything that a digital computer with ﬁnite memory can compute, can also be computed by a FSA, and by a RNN.
Sometimes it is interesting to study even greater levels of computational power by asking questions about what a device could do without memory limitations. This has led to the study, within theoretical computer science of devices such as pushdown automata and Turing machines. It is interesting to note that RNNs measure up to these devices as well. The reader is referred to [28, 38, 47].

58

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

6 Applications
The application of RNNs has proved itself in various ﬁelds. The spectrum of RNN applications is so wide, and it touched various aspects. Various architectures and learning algorithms have been developed to be applied in solving problems in various ﬁelds. The spectrum of application is ranging from natural language processing, ﬁnancial forecasting, plant modeling, robot control, and dynamic system identiﬁcation and control. In this section, we review two case studies. One of these cases will be on grammatical inference and the other will be on control.

6.1 Natural Language Processing

In the last years, there has been a lot of efforts and progress in developing RNN architectures for natural language processing. Harigopal and Chen [20] proposed a network to recognize strings which are much longer that the ones which the network was trained on. They used a second-order recurrent network model for the problem of grammatical inference. Zheng et al. [59] proposed a discrete recurrent network for learning deterministic context-free grammar. Elman [9] addressed three challenges of natural language processing. One challenge is the nature of the linguistic representations. Second, is the representation of the complex structural relationships. The other challenge is the ability of a ﬁxed resource system to accommodate the open-ended nature of a language.
Grammatical inference is the problem of extracting the grammar from the strings of a language. There exists a FSA that generates and recognizes that grammar. In order to give the reader a clear idea about the application of RNNs in grammatical inference, we will review a method proposed by Chen et al. [5] to design a RNN for grammatical inference. Chen et al. [5] proposed an adaptive RNN to learn a regular grammar and extract the underlying grammatical rules. They called their model as adaptive discrete recurrent neural network ﬁnite state automata (ADNNFSA). The model is based on two recurrent network models, which are the neural network ﬁnite state automata (NNFSA) proposed by Giles et al. [18] and the discrete neural network ﬁnite state automata (DNNFSA) proposed by Zeng et al. [58].
Figure 23 shows the network architecture for both NNFSA and DNNFSA which was also used in the ADNNFSA model. The network consists of two layers. The ﬁrst layer consists of N units (context units) that receive feedback from the next layer (state layer) and M input units that receive input signals. The outputs of this layer is connected to the inputs of the second layer via second-order weight connections. The second layer consists of N PEs. The state PEs are denoted by the vector s(t − 1) and the input units are denoted by the vector x(t − 1). In the second layer, s(t) is the current-state output vector and h(t) is the current-state activation vector. The activation of the second layer can be computed as follows:

h j(t) = f ∑ ∑ w jinsi(t − 1)xn(t − 1) ,

(49)

in

2 Recurrent Neural Networks

59

s0(t)

s1(t)

sN-1(t)

Bank of Delay Units

g

g

h0(t) f

h1(t) f

g
hN-1(t) f

Layer 2

Second-Order Weight Connections

Layer 1

Context Units

x0(t-1) x1(t-1) Input Units

xM-1(t-1)

Fig. 23 The general architecture for NNFSA, DNNFSA, and ADNNFSA models

s j(t) = g h j(t) .

(50)

In the implementation of the NNFSA model, f (·) is a sigmoid function and g(·) is an identity function, while in the implementation of DNNFSA, g(·) is a discrete hard-limiter as follows:

g(a) =

0.8 0.2

if a > 0.5 if a < 0.5

(51)

The NNFSA model applies the true-gradient descent real-time recurrent learning (RTRL) algorithm, which had a good performance. In the NNFSA model, Giles et al. [18] used the analog nature of the network, which does not match with the discrete behavior of a FSA. Therefore, Zeng et al. [58], in DNNFSA, discretized the analog NNFSA by using the function in Equation 51. Therefore, all the states are discretized, and the RTRL algorithm is no longer applicable. The pseudo-gradient algorithm was used, and it hinders training because it is an approximation of the true gradient. Therefore, Chen et al. [5] used analog internal states at the beginning of the training, and as the training progresses, the model changes gradually to the discrete mode of the internal states. Thus, the current-state activation output h j(t) is computed same as in Equation 49, and current-state output s j(t) is computed as follows:

s j(t) =

h j(t) g h j(t)

if j ∈ analog mode if j ∈ discrete mode

(52)

60

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

To decide whether the mode of a state PE has to be switched to the discrete mode, a quantization threshold parameter β is used. If the output of the sate PE j, s j(t) < β or s j(t) > 1.0 − β for all the training strings, the mode of this state PE is switched to the discrete phase. This recurrent network model adapts the training from the initial analog phase, which has a good training performance, to the discrete phase, which ﬁts properly with the nature of the FSA, through the progress of the training for automatic rule extraction.

6.2 Identiﬁcation and Control of Dynamical Systems
In dynamic systems, different time-variant variables interact to produce outputs. To control such systems, a dynamic model is required to tackle the unpredictable variations in such systems. Adaptive control systems can be used to control dynamic systems since these control systems use a control scheme that have the feature to modify its behavior in response to the variations in dynamic systems. However, most of the adaptive control techniques require the availability of an explicit dynamic structure of the system, and this is impossible for most nonlinear systems which have poorly known dynamics. In addition, these conventional adaptive control techniques lack the ability of learning. This means that such adaptive control systems cannot use the knowledge available from the past and apply it in similar situations in the present or future. Therefore, a class of intelligent control has been applied which is based on neural modeling and learning [27, 46].
Neural networks can deal with nonlinearity, perform parallel computing and process noisy data. These features have made neural networks good tools for controlling nonlinear and time-variant systems. Since dynamic systems involve model states at different time steps, it has been important for the neural network model to memorize the previous states of the system and deal with the feedback signals. This is impossible with the conventional feedforward neural networks. To solve the problem, it has been important to make neural networks have a dynamic behavior by incorporating delay units and feedback links. In other words, RNNs with time delay units and feedback connections can tackle the dynamic behavior of nonlinear timevariant systems. RNNs have proved successfulness in the identiﬁcation and control of systems which are dynamic in nature. We will review a model that was proposed by Ge et al. [14] for the identiﬁcation and control of nonlinear systems.
Yan and Zhang [56] presented two main characteristics to measure the dynamic memory performance of neural networks. They called them the "depth" and "resolution ratio". Depth refers to how far information can be memorized by the model, while resolution ratio refers to the amount of information of the input to the model that can be retained. Time-delay recurrent neural networks (TDRNN) can retain much information about the input and memorize information of a short time period. Thus, it has a good resolution ratio and poor depth. On the other hand, most recurrent networks such as Elman networks have a good depth and poor resolution ratio [14]. Ge et al. [14] proposed a model that has a memory of better depth and

2 Recurrent Neural Networks

61

β z-1

z(t)
γ
z-1

y(t) h(t)

f

g

f

g

z-1
β
x(t)
γ
z-1
β z-1
β z-1

f

g

α v(t)
z-1

α
z-1

α
z-1

z-1
z-1 z-1

Fig. 24 The architecture of the TDRNN for identiﬁcation and control of dynamic systems

resolution ratio to identify and control dynamic systems. Figure 24 shows the architecture of the TDRNN proposed by Ge et al. [14].
The model incorporates memory units in the input layer with local feedback gain γ (0 ≤ γ ≤ 1) to enhance the resolution ratio. The architecture includes input layer, hidden layer, output layer, and context layer. In this model, the input and context units are different from the traditional recurrent networks since, in this model, the units in the input and context layers are PEs with linear transfer functions. Therefore, in this instance only, we will call them input PEs and context PEs for consistency. The context PEs memorize the activations of the output PEs; in addition, there are local feedback links with constant coefﬁcient α in the context PEs. The output of the context PE can be given as follows:

v j(t) = αv j(t − 1) + y j(t − 1), j = 1, 2, · · · , N

(53)

where v j(t) and y j(t) are the outputs of the jth context PE and the jth output PE respectively, and N is the number of the context PEs and the output PEs.
The mathematical description of the output PEs, hidden PEs, and input PEs are respectively described by the following three equations:

M

N

∑ ∑ y j(t) = g w2jihi(t) + w3jivi(t) ,

(54)

i=1

i=1

62

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

K

∑ h j(t) = f w1jizi(t) ,

(55)

i=1

r

z j(t) = x j(t) + β ∑ x j(t − i) + γz j(t − 1),

(56)

i=1

where w2 is the weight between the hidden and the output PEs, w3 is the weight between the context and the output PEs, h j(t) is the output of the jth hidden PE, M is the number of the hidden PEs, w1 is the weight between the input and hidden PEs, z j(t) is the output of the jth input PE, K is the number of the input PEs, x j(t) is the jth external input, r is the number of the unit time delays, and 0 ≤ α, β , γ ≤ 1 and β + γ = 1. The activation function f (·) is a sigmoid function, and the function g(·)
is a linear function.
The network was learned by a dynamic recurrent backpropagation learning algo-
rithm which was developed based on the gradient descent method. The model shows
good effectiveness in the identiﬁcation and control of nonlinear systems.

7 Conclusion
In this chapter, we presented an introduction to RNNs. We began by describing the basic paradigm of this extension of MLPs, and the need for networks that can process sequences of varying lengths. Then we classiﬁed the architectures used and identiﬁed the prevailing models and topologies. Subsequently we tacked the implementation of memory in these systems by describing different approaches for maintaining state in such devices. Then we described the prevailing learning methods and an important limitation that all gradient based approaches to learning in RNNs face. In the next section, we described the relationship between the ability of RNNs to process symbolic input sequences and more familiar computational models that can handle the same types of data. This provided conﬁdence in the model as a general purpose computational tool. The ﬁnal section presented two sample applications to elucidate the applicability of these networks on real-world problems.
We hope that this chapter has motivated the use of RNNs and whetted the reader’s appetite to explore this rich paradigm further. A good place to begin a deeper exploration is [29], which presents the most important results in the ﬁeld in a collection of mutually consistent papers by the primary researchers in these areas.

References
1. Allen, R.B., Alspector, J.: Learning of stable states in stochastic asymmetric networks. Technical Report TM-ARH-015240, Bell Communications Research, Morristown, NJ (1989)
2. Atiya, A.F.: Learning on a general network. In: Neural Information Processing Systems, New York, pp. 22–30 (1988)

2 Recurrent Neural Networks

63

3. Back, A.D., Tsoi, A.C.: FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation 3, 375–385 (1991)
4. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gadient descent is difﬁcult. IEEE Transactions on Neural Networks 5, 157–166 (1994)
5. Chen, L., Chua, H., Tan, P.: Grammatical inference using an adaptive recurrent neural network. Neural Processing Letters 8, 211–219 (1998)
6. Chen, S., Billings, S., Grant, P.: Nonlinear system identiﬁcation using neural networks. International Journal of Control 51(6), 1191–1214 (1990)
7. Cohen, M.A., Grossberg, S.: Stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics 13, 815–826 (1983)
8. Elman, J.L.: Finding structure in time. Cognitive Science 14, 179–211 (1990) 9. Elman, J.L.: Distributed representations, simple recurrent networks and grammatical
structure. Machine Learning 7, 195–225 (1991) 10. Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. Technical Re-
port CMU-CS-90-100, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (February 1990) 11. Forcada, M.L., Ñeco, R.P.: Recursive Hetero-Associative Memories for Translation. In: Mira, J., Moreno-Díaz, R., Cabestany, J. (eds.) IWANN 1997. LNCS, vol. 1240, pp. 453–462. Springer, Heidelberg (1997) 12. Frasconi, P., Gori, M., Soda, G.: Local feedback multilayered networks. Neural Computation 4, 120–130 (1992) 13. Galland, C.C., Hinton, G.E.: Deterministic Boltzman learning in networks with asymmetric connectivity. Technical Report CRG-TR-89-6, University of Toronto Department of Computer Science (1989) 14. Ge, H., Du, W., Qian, F., Liang, Y.: Identiﬁcation and control of nonlinear systems by a time-delay recurrent neural network. Neurocomputing 72, 2857–2864 (2009) 15. Giles, C., Kuhn, G., Williams, R.: Dynamic recurrent neural networks: theory and applications. IEEE Trans. Neural Netw. 5(2), 153–156 (1994) 16. Giles, C.L., Chen, D., Miller, C.B., Chen, H.H., Sun, G.Z., Lee, Y.C.: Second-order recurrent neural networks for grammatical inference. In: 1991 IEEE INNS International Joint Conference on Neural Networks, Seattle, Piscataway, NJ, vol. 2, pp. 271–281. IEEE Press (1991) 17. Giles, C.L., Horne, B.G., Lin, T.: Learning a class of large ﬁnite state machines with a recurrent neural network. Neural Networks 8, 1359–1365 (1995) 18. Giles, C.L., Miller, C.B., Chen, D., Chen, H.H., Sun, G.Z., Lee, Y.C.: Learning and extracting ﬁnite state automata with second-order recurrent neural networks. Neural Computation 4, 395–405 (1992) 19. Gori, M., Bengio, Y., Mori, R.D.: Bps: A learning algorithm for capturing the dynamic nature of speech. In: International Joint Conference on Neural Networks, vol. II, pp. 417–423 (1989) 20. Harigopal, U., Chen, H.C.: Grammatical inference using higher order recurrent neural networks. In: Proceedings of the Twenty-Fifth Southeastern Symposium on System Theory, SSST 1993, pp. 338–342 (1993) 21. Hinton, T.J., abd Sejnowski, G.E.: Optimal perceptual inference. In: Proceedines of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 448–453. IEEE Computer Society (1983) 22. Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis (1991)

64

S.A. Marhon, C.J.F. Cameron, and S.C. Kremer

23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9, 1735– 1780 (1997)
24. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley (1979)
25. Jordan, M.I.: Supervised learning and systems with excess degrees of freedom. Technical Report COINS Technical Report 88–27, Massachusetts Institute of Technology (1988)
26. Karakasoglu, A., Sudharsanan, S., Sundareshan, M.K.: Identiﬁcation and decentralized adaptive control using dynamic neural networks with application to robotic manipulators. IEEE Trans. Neural Networks 4, 919–930 (1993)
27. Karray, F.O., Silva, C.: Soft Computing and Intelligent Systems Design. Addison Wesley (2004)
28. Kilian, J., Siegelmann, H.T.: On the power of sigmoid neural networks. In: Proceedings of the Sixth ACM Workshop on Computational Learning Theory, pp. 137–143. ACM Press (1993)
29. Kolen, J.F., Kremer, S.C. (eds.): A Field Guide to Dynamical Recurrent Networks. Wiley-IEEE Press (2001)
30. Kuo, J., Celebi, S.: Adaptation of memory depth in the gamma ﬁlter. In: Acoustics, Speech and Signal Processing IEEE Conference, pp. 1–4 (1994)
31. Kuroe, Y.: Representation and Identiﬁcation of Finite State Automata by Recurrent Neural Networks. In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS, vol. 3316, pp. 261–268. Springer, Heidelberg (2004)
32. Lippmann, R.P.: An introduction to computing with neural nets. IEEE ASSP Magazine 4, 4–22 (1987)
33. Mozer, M.: A focused background algorithm for temporal pattern recognition. Complex Systems 3 (1989)
34. Mozer, M.C.: Induction of multiscale temporal structure. In: Advances in Neural Information Processing Systems 4, pp. 275–282. Morgan Kaufmann (1992)
35. Nguyen, M., Cottrell, G.: A technique for adapting to speech rate. In: Kamm, C., Kuhn, G., Yoon, B., Chellapa, R., Kung, S. (eds.) Neural Networks for Signal Processing 3. IEEE Press (1993)
36. Omlin, C.W., Giles, C.L.: Constructing deterministic ﬁnite-state automata in recurrent neural networks. Journal of the ACM 43(6), 937–972 (1996)
37. Patan, K.: Locally Recurrent Neural Networks. In: Patan, K. (ed.) Artiﬁcial. Neural Net. for the Model. & Fault Diagnosis. LNCIS, vol. 377, pp. 29–63. Springer, Heidelberg (2008)
38. Pollack, J.B.: On Connectionist Models of Natural Language Processing. PhD thesis, Computer Science Department of the University of Illinois at Urbana-Champaign, Urbana, Illinois, Available as TR MCCS-87-100, Computing Research Laboratory, New Mexico State University, Las Cruces, NM (1987)
39. Principe, J.C., de Vries, B., de Oliveira, P.G.: The gamma ﬁlter - a new class of adaptive IIR ﬁlter with restricted feedback. IEEE Transactions on Signal Processing 41, 649–656 (1993)
40. Renals, S., Rohwer, R.: A study of network dynamics. Journal of Statistical Physics 58, 825–848 (1990)
41. Robinson, A.J.: Dynamic Error Propagation Networks. Ph.d., Cambridge University Engineering Department (1989)
42. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing. MIT Press, Cambridge (1986)
43. Schmidhuber, J.H.: A ﬁxed size storage o(n3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation 4(2), 243–248 (1992)

2 Recurrent Neural Networks

65

44. Sejnowski, T.J., Rosenberg, C.R.: Parallel networks that learn to pronounce english text. Complex Syst. I, 145–168 (1987)
45. Shannon, C.E.: Communication in the presence of noise. Proc. Institute of Radio Engineers 37(1), 10–21 (1949); reprinted as classic paper in: Proc. IEEE 86(2) (February 1998)
46. Shearer, J.L., Murphy, A.T., Richardson, H.H.: Introduction to System Dynamics. Addison-Wesley, Reading (1971)
47. Siegelmann, H.T., Sontag, E.D.: Turing computability with neural nets. Applied Mathematics Letters 4(6), 77–80 (1991)
48. Silva, T.O.: Laguerre ﬁlters - an introduction. Revista do Detua 1(3) (1995) 49. Smith, J.O.: Delay lines. Physical Audio Signal Processing (2010),
http://ccrma.stanford.edu/ jos/pasp/ Tapped_Delay_Line_TDL.htm (cited November 28, 2010) 50. Smith, S.W.: The scientist and engineer’s guide to digital signal processing. California Technical Publishing (2006), http://www.dspguide.com/ch15.htm (cited November 29, 2010) 51. Tsoi, A.C., Back, A.D.: Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks 5, 229–239 (1994) 52. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, L.: Phonemic recognition using time delay neural networks. IEEE Trans. Acoustic Speech and Signal Processing 37(3), 328–339 (1989) 53. Werbos, P.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences. Phd thesis, Harvard University (1974) 54. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1, 270–289 (1989) 55. Won, S.H., Song, I., Lee, S.Y., Park, C.H.: Identiﬁcation of ﬁnite state automata with a class of recurrent neural networks. IEEE Transactions on Neural Networks 21(9), 1408– 1421 (2010) 56. Yan, P.F., Zhang, C.S.: Artiﬁcial Neural Network and Simulated Evolutionary Computation. Thinghua University Press, Beijing (2000) 57. Zamarreno, J.M., Vega, P.: State space neural network. Properties and application. Neural Networks 11, 1099–1112 (1998) 58. Zeng, Z., Goodman, R.M., Smyth, P.: Learning ﬁnite state machines with self-clustering recurrent networks. Neural Computation 5(6), 977–990 (1993) 59. Zeng, Z., Goodman, R.M., Smyth, P.: Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks 5(2), 320–330 (1994)

Chapter 3
Supervised Neural Network Models for Processing Graphs
Monica Bianchini and Marco Maggini

An intelligent agent interacts with the environment where it lives taking its decisions on the basis of sensory data that describe the speciﬁc context in which the agent is currently operating. These measurements compose the environment representation that is processed by the agent’s decision algorithms and, hence, they should provide sufﬁcient information to yield the correct actions to support the agent’s life. In general, the developed environment description is redundant to provide robustness with respect to noise and eventual missing data. On the other hand, a proper organization of the input data can ease the development of successful processing and decisional schemes.
Therefore, input data encoding is a crucial step in the design of an artiﬁcial intelligent agent. This is particularly true for agents featuring also learning capabilities, since an appropriate data representation, that faithfully preserves the environment properties, can make the extraction of general rules from the available examples simpler. The most natural organization of the input data depends on the speciﬁc task. In some cases, the decision is taken on a single event that is described by a certain number measurements, encoded as a vector of real numbers. For instance, if the agent should decide if a piece of food is good, a vector collecting measurements related to its color, temperature, and chemical properties as detected by olfactory and/or taste sensors is likely to provide the relevant information for the decision. When the event is a sound and the agent should interpret if it can be the hint of a possible threat, or the presence of a prey, or a message from another agent, the most appropriate input data model is a temporal sequence of measurements, like, for instance, the signal spectrogram. When considering simple visual tasks, as recognizing the shape of a potential predator, a two dimensional spatial sensory input can be processed as encoded by a bidimensional matrix, whose entries measure the luminance or color of a given point in the space projected by the agent visual system
M. Bianchini · M. Maggini Dipartimento di Ingegneria dell’Informazione e Scienze Matematiche, Universita` degli Studi di Siena, Italy e-mail: {monica,maggini}@dii.unisi.it

M. Bianchini et al. (Eds.): Handbook on Neural Information Processing, ISRL 49, pp. 67–96.

DOI: 10.1007/978-3-642-36657-4_3

c Springer-Verlag Berlin Heidelberg 2013

68

M. Bianchini and M. Maggini

onto the retina. However, when more complex tasks, such as scene interpretation, are considered, a successful approach may require to exploit structured data, that provide a more balanced split of the intrinsic complexity between the task of data representation and that of data processing. In these cases, a graph can describe more precisely the environment, by modeling it through a set of entities and relationships among them, that are encoded by the graph nodes and arcs, respectively. For instance, in a vision task, the nodes may represent parts of objects (as regions having homogenous visual features), whereas the arcs may encode the spatial relationships among pairs of adjacent regions.
Hence, the design of an intelligent learning agent requires to jointly devise an optimal feature representation of the task inputs and the related processing/learning algorithm. In fact, even if it is possible to obtain simpler representations from complex data structures, this process always implies the loss of information, an inaccurate modeling of the problem at hand or a higher complexity in the input data distribution. For instance, given an input representation encoded as a tree, it is possible to map it to a vector by performing a visit of its nodes. However, such a representation would not be suitable for trees with a varying number of nodes that would yield vectors with different dimensions, unless a maximum number of nodes is assumed and a padding technique is used to encode the missing nodes. Moreover, the relationships among the nodes in the tree would be encoded by their position in the vector, making it more difﬁcult to exploit the structural regularities by an algorithm designed to process generic vectors.
In this chapter, we will show how an agent based on artiﬁcial neural networks (ANNs) can be designed in order to naturally process structured input data encoded as graphs. Graph Neural Networks (GNNs) [23] are an extension of classical MultiLayer Perceptrons (MLPs) that accept input data encoded as general undirected/directed labeled graphs. GNNs are provided with a supervised learning algorithm that, beside the classical input-output data ﬁtting measure, incorporates a criterion aimed at the development of a contractive dynamics, in order to properly process the cycles in the input graph. A GNN processes a graph in input and it can be naturally employed to compute an output for each node in the graph (node–focused computation). The training examples are provided as graphs for which supervisions are given as output target values for a subset of their nodes. This processing scheme can be adapted to perform a graph–based computation in which only one output is computed for the whole graph.
For instance, an object detection problem may be faced as a node–focused task (see ﬁgure 1-a). The input image can be represented by a Region Adjacency Graph (RAG) [5, 13], where each homogenous region, extracted by an image segmentation algorithm, is associated to a node and the edge between two nodes encodes the adjacency relationship among them. The label assigned to the node represents visual features of the associated region (color, shape, etc.), whereas the edge labels may encode parameters related to the mutual spatial position of the two adjacent regions (distance of the barycenters, rotation angles, etc.). The GNN computes a label for each node in the graph, stating if the corresponding region is a part of the object to be detected. A post-processing can be applied to the GNN output to ﬁnd the node

3 Supervised Neural Network Models for Processing Graphs

69

sets that match the target object proﬁle, providing also the localization of the object in the input image.
As an example of a graph–based problem, we can consider the prediction of some chemical property of a molecule (see ﬁgure 1-b). In this case, the graph nodes model the atoms or the atomic groups in the molecule, whereas the edges represent the chemical bonds between them. Each node (edge) can store a label that describes the physicochemical properties of the atom (bond). The GNN processes an input graph, modeling a speciﬁc molecule, yielding a prediction for the presence of the target chemical property. The training is performed by providing a learning set containing graphs (molecules) for which the value of the considered chemical property is known a priori.

supersource

(a) Input Image

(b) Segmented Image and RAG

(c) Directed RAG

O N
HN

H2N

N

N

H

O

C

HN

C

H C
N
H

C N

N C
N H

(d) Guanine

(e) Molecule graph

Fig. 1 Examples of structured data encodings. (a-c) An object detection and localization problem. The input image is encoded by a Region Adjacency Graph. (d-e) A chemical compound and its graph encoding (nodes are described by the features of the corresponding atom, edges are enriched with features representing the bond type — thicker lines correspond to double bonds).

The chapter describes in details the GNN model, showing how simpler models can be derived from this more general computational scheme. In particular, when considering acyclic graphs, such as Directed Acyclic Graphs with Labeled Edges (DAGs-LE), Directed Positional Acyclic Graphs (DPAGs), or generic Positional Trees (PTs), the model reduces to that of Recursive Neural Networks (RNNs) [6, 11]. Also Recurrent Neural Networks, that are designed to process temporal input sequences [1, 18, 19, 29] can be viewed as a speciﬁc case of GNNs, since

70

M. Bianchini and M. Maggini

sequences can be represented as linked lists of nodes, where arcs encode the followed-by relationship. Finally, MLPs processing vectors can be viewed as a GNN model processing graphs containing a single node.
The chapter is organized as follows. The next section describes the input representation as graphs and introduces the required notation. Section 2 details the neural network architectures for processing structured information. In particular, the Graph Neural Network model is introduced for processing generic labelled graphs, together with simpler models that can be used with less general cases, such as acyclic graphs. Then, Section 3 presents the supervised learning algorithm for GNNs, showing how it can be viewed as a generalization of the classical BackPropagation algorithm for MLPs. Finally, Section 4 summarizes the main concepts presented in this chapter, reviewing some applications reported in the literature.

1 Graphs
A graph is a data model that allows the representation of complex structures as a set of elements and binary relationships among them. Nodes represent the basic entities that compose the atomic parts of the information, whereas an edge encodes a property that links a pair of entities. In general, different types of relations can be deﬁned on the pairs of nodes, such that the same two nodes may be connected by a set of edges each representing a different relationship. For instance, if two nodes stand for two regions in an image, they can be connected by an edge encoding the adjacent-to relationship and an arc for the on-top-of property. As shown in this example, the relationships encoded by the graph edges may be symmetrical (adjacentto), or asymmetrical (on-top-of). The two cases would correspond to an undirected edge and a directed arc between the two nodes, respectively. In the following, we formally introduce the graph data model, considering speciﬁc choices to provide a simple and uniform representation able to deal with multiple types of relationships, that can be both symmetrical or asymmetrical.
A directed unlabeled graph is deﬁned by the pair GU = (V, E), where V is the ﬁnite set of nodes and E ⊆ V ×V collects the arcs. An arc from node u to node v is a directed link represented by the ordered pair (u, v) ∈ E, u, v ∈ V . It is clear that an arc can encode an asymmetrical relationship between the two nodes, since each node can be associated to a speciﬁc role by its position in the pair. For instance, if the arc encodes the on-top-of property, the ﬁrst node u will represent the region that is on the top of the region corresponding to node v. An undirected graph can be conveniently represented as a directed graph by substituting each undirected edge with a pair of directed arcs: an edge between nodes u and v will correspond to the two directed arcs (u, v) and (v, u). Hence, if we need to model symmetrical relationships, we will need to explicitly encode the symmetry by adding the two directed arcs. For instance, the adjacent-to relationship will be represented by two arcs stating that the region of node u is adjacent to that of node v and viceversa. Even if this choice does not provide a natural representation of symmetric relationships, the resulting

3 Supervised Neural Network Models for Processing Graphs

71

model allows a uniform and simple encoding of both symmetric and asymmetric relationships.
The pair (V, E) speciﬁes the underlying structure of the data organization through the topology of the connections among the nodes. This model can be enriched to incorporate attributes or measures that may be available for the entities represented by the graph nodes and/or for the relationships encoded by the arcs. In particular, attributes on the arcs may allow us to encode the presence of different types of relationships between two nodes, as it is detailed in the following.
First, a label can be attached to each node in the graph, encoding the values of a set of properties that are measured for the entity represented by the node. For instance, if a node is a model of a region in an image, features describing perceptual and geometrical properties can be stored in the node to describe same visual characteristics of the region (e.g. average color, shape descriptors, area of the region, perimeter of the region contour, etc). In general, a different set of attributes can be attached to each node, but in the following we will assume that the labels are chosen from the same label space L. Since we will deal with connectionist models that are devised to process information encoded by real numbers, in the following we will consider labels modeled as vectors of reals, i.e. L ⊂ Rm. Given a node label space, a directed labeled graph is a triple GL = (V, E, L ), where V and E are the set of nodes and arcs respectively, and L : V → L is a node labeling function which deﬁnes the label L (v) ∈ L for each node v in the graph. The label of node v will be compactly denoted by lv ∈ Rm.
Moreover, a label can also be associated to each arc (u, v) in the graph, to enrich its semantics. A graph with labeled edges can encode also attributes attached to the relationships between pairs of nodes. For instance, we can associate a label to the arc encoding the adjacent-to relationship, in order to specify a set of features that describe the mutual position of the two regions (e.g. distance of the region barycenters, angles between the region principal axes, etc.). In particular, the label can also be exploited to encode the types of relationships existing between two nodes, thus providing a uniform method to model the presence of different kinds of relationships in the data. In the previous example concerning image representation, if we would like to encode both the relationships adjacent-to and on-top-of between pairs of regions, we need just to add two boolean attributes to the arc stating if the associated type of relationship is applicable. In general, we will assume that the labels for the arcs belong to a given edge label space Le. A directed labeled graph with labeled edges is deﬁned by a quadruple GLE = (V, E, L , E ), where the edge labeling function E : E → Le attaches a label E ((u, v)) ∈ Le to the arc (u, v) ∈ E. More compactly, the label attached to the arc (u, v) will be denoted as l(u,v). Notice that, in general, the two directed arcs (u, v) and (v, u) can have different labels. When dealing with Artiﬁcial Neural Networks the attributes must be encoded with real valued vectors, such that Le ⊂ Rd. In this case, if T different types of relationships can be present between a pair of nodes, T entries of the vector may be allocated to encode the actual types that are applicable for the pair (u, v). Each entry will be associated to a speciﬁc type of relationship, such that its value is 1 if it is present and 0 otherwise. For instance, given the two relationships adjacent-to and on-top-of,

72

M. Bianchini and M. Maggini

the conﬁguration of the label entries [1, 0] will encode that the region u is adjacent to region v but not on its top, whereas [1, 1] will represent the fact that the two regions are adjacent and region u is on the top of region v. The conﬁguration [0, 0] would make no sense if adjacent-to and on-top-of are the only modeled relationships between pairs of regions, as, in this case, a correct modeling would require to omit the arc between the two nodes. Finally, the entries [0, 1] may be used if we suppose that the relation on-top-of does not need the regions to be adjacent.
Properties deriving from the graph topology are deﬁned as follows. Given a node v ∈ V , the set of its parents is deﬁned as pa[v] = {w ∈ V |(w, v) ∈ E}, whereas the set of its children is ch[v] = {w ∈ V |(v, w) ∈ E}. The outdegree of v is the cardinality of ch[v], od[v] = |ch[v]|, and o = maxv od[v] is the maximum outdegree in the graph. The indegree of node v, is the cardinality of pa[v] (|pa[v]|). Nodes having no parents (i.e. |pa[v]| = 0) are called sources, whereas nodes having no children (i.e. |ch[v]| = 0) are referred to as leaves. The class of graphs with maximum indegree i and maximum outdegree o is denoted as #(i,o). Moreover, the class of graphs with bounded indegree and outdegree (but unspeciﬁed) is indicated as #. The set of the neighbors of node v is deﬁned as ne[v] = ch[v] pa[v], i.e. the neighborhood of v contains all the nodes connected to it by an arc independently of its direction. Given a labeled graph GL, the structure obtained by ignoring the node and/or edge labels will be referred to as the skeleton of GL, denoted as ske(GL). Finally, the class of all the data structures deﬁned over the domain of the labeling function L and skeleton in #(i,o) will be denoted as L #(i,o) and will be referred to as a structured space.
A directed path from node u to node v in a graph G is a sequence of nodes (w1, w2, . . . , wp) such that w1 = u, wp = v and the arcs (wi, wi+1) ∈ E, i = 1, . . . , p − 1. The path length is p − 1. If there exists at least one path such that w1 = wp, the graph is cyclic. If the graph is acyclic, we can deﬁne a partial ordering, referred to as topological order, on the set of nodes V , such that u ≺ v if u is connected to v by a directed path. The set of the descendants of a node u, desc(u) = {v ∈ V |u ≺ v}, contains all the nodes that follow v in the partial ordering. The ancestors of a node u, anc(u) = {v ∈ V |v ≺ u}, contains all the nodes that precede v in the partial ordering.
The class of Directed Acyclic Graphs (DAGs) is particularly interesting since it can be processed by employing simpler computational schemes. In particular, Directed Positional Acyclic Graphs (DPAGs) form a subclass of DAGs for which an injective function ov : ch[v] → [1, o] assigns a position ov(c) to each child c of a node v. Therefore, a DPAG is represented by the tuple (V, E, L , O), where O = {o1, . . . , o|V|} is the set of functions deﬁning the position of the children for each node. Since the codomain for each function ov(c) is the interval [1, o], if the outdegree of v is less than the graph maximum outdegree o, i.e. |ch[v]| < o, some positions will be empty. The missing links will be encoded by using null pointers (NIL). Hence, in a DPAG the children of each node v can be organized in a ﬁxed size vector ch[v] = [ch1[v], . . . , cho[v]], where chk[v] ∈ V {NIL}, k = 1, . . . , o. Finally, we denote with PTREEs the subset of DPAGs that contains graphs which are trees, i.e. there is only one node with no parents, the root node r (|pa[r]| = 0), and any other node v has just one parent (|pa[v]| = 1). Instead, Directed Acyclic Graphs with Labeled Edges (DAGs–LE) represent the subclass of DAGs for which an edge

3 Supervised Neural Network Models for Processing Graphs

73

labeling function E is deﬁned. In this case it is not required to deﬁne an ordering among the children of a given node, since an eventual order can be encoded in the arc labels. Finally, we denote with TREEs–LE the subset of DAGs–LE that contains graphs which are trees.
For graph–focused processing tasks, the graph G may be required to possess a supersource, that is a node s ∈ V such that any other node in G can be reached by a directed path starting from s. Note that, if the graph does not have a supersource, it is still possible to deﬁne a convention for adding an extra node s with a minimal number of outgoing edges, such that s is a supersource for the expanded graph [26].

2 Neural Models for Graph Processing
The Graph Neural Network (GNN) model deﬁnes a computational scheme that allows the processing of an input graph GLE in order to implement a node–focused function. Formally, a GNN realizes a function τ(GLE , v) that maps the input graph GLE and one of its nodes v to a real valued vector in Rr. The computational scheme at the basis of the GNN model is that of information diffusion. A computational unit is associated to each node in the input graph, and the units are interconnected to each other following the link topology of the input graph. In order to deﬁne the computation of the GNN model we introduce the concepts of state and state transition functions.

2.1 The Graph Neural Network Model

The information needed to perform the overall computation is locally stored in a
state variable xv for each node v. The set of the state variables, available for all the computational units attached to the graph nodes, represents the current state
of the computation. The state of each node xv is an s–dimensional vector that is supposed to encode the information relevant for node v and its context. Hence, the state xv ∈ Rs should depend on the information contained in the neighborhood of v, deﬁned by the nodes that link or are linked by v.
Following the idea that the global computation emerges from the diffusion of the
information embedded into the states of the graph nodes, the model assumes that
the state variables can be computed locally at each node depending on the states
of its neighbors and on the node’s local information and connectivity (see Fig. 2).
Basically, this framework requires the deﬁnition of a state transition function f that
models the dependence of xv with respect to the context of node v. The function will depend on the label lv of node v, the labels of the incoming arcs l(u,v), those of the outgoing arcs l(v,u), and on the states of the node’s neighbors xne[v]. In a learning framework, the transition function will also depend on a set of learnable parameters θ f ∈ Rp. Hence, in general, the state computation will exploit a state transition function deﬁned as

xv = f (xne[v], l(v,ch[v]), l(pa[v],v), lv|θ f ) .

(1)

74

M. Bianchini and M. Maggini

x8

l(8,9) x9 l9

l8 l (8,1)
l(1,8)

l(10,9)

l(9,10)

x10 l10

l(10,1)

l(7,6)

x7 l7

l(1,7)

l (5,7)

x1

l1

l(1,4)

ne[1] l(2,1)

x6 l6 l(6,5)
x5 l5
x4 l4

l(10,11) x11
l11

x2 l2

l(2,3)

l(3,2)

l(4,3) l3 x3

ch[1] = {4,7,8} pa[1] = {2,8,10} ne[1] = {2,4,7,8,10}

,
x1 = f (x2, x4, x7, x8, x10, l(1,4), l(1,7), l(1,8), l(2,1), l(8,1), l(10,1), l1|θf )

xne[1]

l(1,ch[1])

l(pa[1],1)

Fig. 2 The neighborhood of a node exploited for the state computation when processing a directed graph in input

This model can be made more general by exploiting other deﬁnitions of neighborhood that imply the dependence of the function f from additional variables. For instance, the proposed formulation assumes that the labels of the nodes in ne[v] are properly encoded by the corresponding states xne[v], but we can rewrite f such that this dependence is made explicit. Moreover, the concept of neighborhood exploited in eq. (1) just includes the nodes one link away from the current unit v. In general, we can redeﬁne the context by including the nodes two or more links away from the current node. Anyway, such a model would not be minimal since we can assume that the information of larger contexts is properly encoded by the states in ne[v]. Finally, the model of eq. (1) assumes that the same computational unit is applied to any node in the graph, since both the function model and the parameters θ f are the same for all the nodes in V . We can relax this requirement provided that there is an explicit rule that allows us to choose a speciﬁc function and the associated parameters given some node’s properties (e.g. the node in–degree and/or out–degree). In general, the function f can be implemented by a neural network and the set of trainable parameters θ f will represent the neural network connection weights. Apart from the constraints on the number and type of inputs and outputs of the neural network, there are no other assumptions on its architecture (type of neurons, number of layers, etc.). We will give more details on the possible implementations of f in the following.
The result of the state computation for a given input graph G is not completely deﬁned until we do not devise an appropriate global scheme for applying the local state transition function f . In fact, due to the presence of cycles in the input graph there is a circular dependence of the variables xv on their own values. This is evident just by the deﬁnition of neighborhood, since the value of xv depends on the values of the neighbors’ states xne[v] whose computation depends on xv itself. The state

3 Supervised Neural Network Models for Processing Graphs

75

variables for the nodes in a given input graph can be stacked into a vector x ∈ Rs|V|. Similarly, the labels deﬁned on the input graph nodes and arcs can be collected into the vector l ∈ Rm|V |+d|E|. With this notation, the collective state computation can be
written as the solution of the following vectorial equation

x = F(x, l|θ f ) ,

(2)

where F is the global transition function, whose components are obtained by concatenating the single functions f . The actual parameters for each function f in F are easily obtained by projecting the needed components of x and l, as required by eq. (1). This equation shows that the state resulting from the computation is the solution of a system of non–linear equations in the variables x. In fact, the state x is both on the left and the right side of eq. (2). This is due to the fact that the state for each node v depends on the states of its neighbors in ne[v] that, on turn, depend on the state of v. This cyclic dependence is independent on the actual presence of explicit cycles in the graph, since the considered model for the transition function exploits the states of the neighbor nodes despite the direction of the arcs.
In general, eq. (2) may have multiple solutions that can be single points or a manifold in the state space. However, an useful computation requires that the ﬁnal state is uniquely deﬁned. This requirement can be met by satisfying the conditions of the Banach ﬁxed point theorem [14]. The theorem states that, if F is a contraction map with respect to the state variables x, then the system in eq. (2) has a unique solution. The global transition function is a contraction map if there exists μ, μ ∈ [0, 1) such that F(x, l|θ f ) − F(y, l|θ f ) ≤ μ x − y , for any x and y in the state space.
The Banach’s ﬁxed point theorem provides also a method to compute the solution of the system in eq. (2). Given that F is a contraction map, the unique solution can be found by employing an iterative state update scheme, deﬁned by the equation

x(t + 1) = F(x(t), l|θ f ) ,

(3)

where x(t) is the state at iteration t. The theorem guarantees that the sequence x(t) converges exponentially to the solution of the non-linear system (2). This computational scheme motivates the term state transition function for the function f ; in fact, given an input graph G, the states at its nodes are updated by the attached computational units as

xv(t + 1) = f (xne[v](t), l(v,ch[v]), l(pa[v],v), lv|θ f ) ,

(4)

until all the state vectors xv(t) converge to the ﬁxed point. It should be noted that the ﬁnal result of the computation depends on the input graph, since the function F depends both on the graph topology and on the node and arc labels. Whereas the dependence on the labels is clear, since they are explicit parameters of the transition function, the contribution of the graph topology is not directly evident, being due to the computation scheme based on the information diffusion among the neighboring nodes. This scheme gives rise to a computational structure made up of

76

M. Bianchini and M. Maggini

computational units that are interconnected following the link topology of the input graph. When processing a different graph, the units will be the same but their computation will be arranged in a different way. The resulting computational structure is a network that is referred to as encoding network (see Fig. 3), since it encodes relevant information on the input graph into the state vectors. The encoding network is built by placing a computational unit f in each node in the graph, and by interconnecting them following its arcs. By applying the same transition function to different input graphs, we obtain different encoding networks, featuring the same building block but assembled with a different structure. The current states xv(t), v ∈ V , are stored in each computational unit that calculates the new state xv(t + 1) using the information provided by both the node label and its neighborhood.

o4(t)

g(·|θg )

l(4,1), l(4,3), l(4,5), l(1,4), l(5,4), l4

x4(t)

f (. . . |θf )

l4 l (5,4)
l (4,5) l5

o5(t)

g(·|θg )

l(5,1), l(5,2), l(5,4), l(1,5), l(4,5), l5

x5(t) f (. . . |θf ) x5(t)

x4(t) x1(t)

l(1,4)

l (4,3) l(4,1)
l (5,2)

l(1,5) l(5,1)
l2

l1
l(3,1)

l(3,2)

l3

l(1,4), l(1,5), l(3,1), l(4,1), l(5,1), l1 f (. . . |θf )

x1(t)

g(·|θg )

x2(t)
l(3,2), l(5,2), l2 f (. . . |θf )

o1(t) x3(t) l(3,1), l(3,2), l(4,3), l3
f (. . . |θf )

x2(t) g(·|θg )

x3(t) g(·|θg )

o2(t)

o3(t)


Fig. 3 The encoding and the output networks (b) associated to an input graph (a). The graph
neural network, implementing the functions f (. . . , θ f ) and g(·, θg), is replicated following the structure of the graph and its weights, θ f and θg, are shared among all the replicas.

Hence, the output of the state propagation is a graph having the same skeleton of the input graph G. The states xv are essentially new labels attached to the nodes of G. As a ﬁnal remark, it should be noted that all the computational units in the encoding network share the same weights θ f .
The GNN model considers also a local output function g, that maps the states to
the actual output of the computation for each node. This function will be applied at each node after the states have converged to the ﬁxed point and will, in general, depend on a set of parameters θg, that can be tuned by the learning algorithm. Hence, the output ov, for each node v ∈ V , will be computed as

ov = g(xv|θg) ,

(5)

3 Supervised Neural Network Models for Processing Graphs

77

where ov belongs to the output space. In the following we will consider ov ∈ Rr. The function g can be computed for each node in the input graph G, thus yielding an output graph with the same skeleton of G and the nodes labeled with the values ov. In this case the GNN network realizes a transduction from a graph G to a graph G , such that skel(G) = skel(G ). Otherwise, the output can be computed only for a given node in the input graph realizing the node–focused function τ(G, v), from the space of labeled graphs to Rr, deﬁned as τ(G, v) = g(xv|θg). Figure 3 shows how the output function is combined with the state encoding process to yield the ﬁnal result of the computation. A graph–focused computation can be obtained by properly combining the outputs ov for all the nodes v ∈ V (f.i. we can use the average operator), or by choosing a predeﬁned node with speciﬁc properties, usually the supersource s, such that the ﬁnal output is os.
As shown in Figure 3, the functions f and g deﬁne the Graph Neural Network. In particular, the connections among the units computing the function f deﬁne the dependences among the variables in the connected nodes. In fact, the recursive connections deﬁne the topology of the encoding network, stating the modality of combination of the states of the neighbors of each node. The parameters θ f and θg are the trainable connection weights of the graph neural network, being θ f and θg independent of node v. The parametric representations of f and g can be implemented by a variety of neural network models as detailed in the following.
The function g can be implemented by any MultiLayer Perceptron (MLP) with s inputs and r outputs. The function g does not need to meet any other particular requirement, and, hence, there will be no restrictions on the MLP implementing it. The parameter vector θg will collect the MLP connection weights. Instead, the function f should be selected to guarantee that the global transition function F is a contraction map with respect to the state x. Moreover, the implementation of f should take into account that the number of input arguments may vary with the processed node v. In fact, the size of the neighborhood ne[v] and the number of incoming and outgoing arcs can be different among the nodes in the graph.
Let us ﬁrst consider the case of positional graphs with bounded indegree and outdegree, i.e. the set #(i,o) as deﬁned in section 1. In this case the set ne[v] will contain at most i + o nodes, that can be ordered on the basis of a position assigned to each node in ne[v]. Basically, the state transition function will process at most i states of nodes in pa[v], and the related arc labels l(pa[v],v), and a maximum of o states for the nodes in ch[v], with the attached labels l(v,ch[v]). Hence, the local transition function can be rewritten as

xv(t + 1) = f ( xch1[v](t), . . . , xcho[v](t), xpa1[v](t), . . . , xpai[v](t), l(v,ch1[v]), . . . , l(v,cho[v]), l(pa1[v],v), . . . , l(pai[v],v), lv|θ f ) ,

(6)

where the function arguments are mapped to the associated graph variables, taking
into account the position map deﬁned for the node v. When processing a node having an indegree (outdegree) less than the maximum value i (o), the missing positions will be padded using speciﬁc constants for the state (xnil ∈ Rs) and for the arc labels (lnil ∈ Rd). The transition function of eq. (6) has a predeﬁned number of arguments,

78

M. Bianchini and M. Maggini

and can be implemented by an MLP with (s + d) · (i + o) + m inputs and s outputs. When processing undirected graphs, there is no distinction between parent and child nodes. In this case, it is preferable to avoid to transform the undirected graph into a directed one, that would just lead to a duplication of the function arguments and make the implementation more complex, without adding any useful information in the computation.
The positional implementation is meaningful when a signiﬁcant positional map is actually available for any graph to be processed by the GNN, and the node indegrees/outdegrees do not vary too much among the nodes to be processed. In fact, when the number of null pointers in the input padding is highly variable, the learning process may be hindered if we exploit only one positional state transition function. In this speciﬁc case, we can use a different function f (i.e. a different MLP) for processing nodes featuring the indegree and the outdegree in a given range. The number of functions needed for this implementation depends on the actual distribution of the in degrees/outdegrees. For instance, if only two conﬁgurations are possible, i.e. (i1, o1) and (i2, o2), then we may exploit two different implementations f(i1,o1) with (s + d) · (i1 + o1) + m inputs and f(i2,o2) with (s + d) · (i2 + o2) + m inputs, respectively. The two implementations will depend on two different sets of parameters θ f(i1,o1) and θ f(i2,o2) . Clearly, when building the encoding network, the proper implementation will be selected for each node based on its indegree/outdegree conﬁguration. This simple case can be extended to more general conﬁgurations involving ranges for the indegree/outdegree.
On the other hand, when the graph is not positional, the state transition function can be realized by a scheme that is independent on the neighborhood size. A simple solution is to use a state transition function implemented as

∑ xv(t + 1) =

h(xu(t), l(v,u), l(u,v), lv|θh) ,

(7)

u∈ne[v]

where the arc label is set to lnil if the ingoing or the outgoing link is missing. In the case of undirected graphs, we can avoid to duplicate the arc label and simplify the function by removing one argument.
In the case of a linear (non–positional) GNN, the non–positional transition function h is implemented as

h(xu(t), l(v,u), l(u,v), lv|θh) = Av,uxu + bv

(8)

where the matrix Av,u ∈ Rs×s and the vector bv ∈ Rs are the output of two MLPs, whose connection weights correspond to the parameters of the GNN θh. In partic-
ular, the transition neural network computes the elements of the matrix Av,u given
the labels of the arcs between the nodes v and u, i.e. l(v,u) and l(u,v), and the label of the current node lv. The transition MLP will have 2d + m inputs and s2 outputs, implementing a function φ : R2d+m → Rs2 that depends on a set of connection weights collected into the parameter vector θφ . The s2 values computed by the function φ (l(v,u), l(u,v), lv|θφ ) can be arranged into a s × s square matrix Φv,u such that the

3 Supervised Neural Network Models for Processing Graphs

79

(i, j) entry of Φv,u is the component i + j · s of the vector computed by the function φ . Finally, the transition matrix is computed as

Av,u

=

μ s|ne[u]|

Φv,u

,

(9)

where μ ∈ (0, 1). This particular expression to compute the transition matrix guar-

antees that the resulting global transition map is a contraction, provided that the

MLP implementing the function φ has a bounded activation function in the output

neurons (e.g. a sigmoid ora a hyperbolic tangent) [23]. The forcing term bv can be

computed by a forcing neural network, that is an MLP that implements a function

ρ : Rm → Rs, such that

bv = ρ(lv|θρ ) ,

(10)

where θρ is the parameter vector collecting the connection weights of the MLP implementing the function ρ.
In a non–linear (non–positional) GNN the function h is implemented by an MLP having s + 2d + m inputs and s outputs. In particular, three–layer MLPs are universal approximators [24] and, hence, a three–layer MLP can be employed to approximate any function h provided that a sufﬁcient number of neurons is used. However, this solution does not necessarily implements a contraction map for any value of the MLP connection weights θh. Hence, when adopting this solution, the learning objective needs to include a cost term that penalizes the development of non–contractive mappings (see section 3 for the details).

2.2 Processing DAGs with Recursive Neural Networks

When dealing with directed acyclic graphs, we can devise a simpler processing scheme that does not need a relaxation procedure to compute the states xv for the nodes in the input graph. The setting is a particular case of Graph Neural Networks for which the local state is updated using the transition function

xv = f (xch[v], l(v,ch[v]), lv|θ f ) .

(11)

In this model, the state information ﬂows from the children to their parents following backwards the links between the nodes. Since there are no cycles, we can order the graph nodes using their inverse topological order. In fact, in the topological order, any node precedes all its descendants, i.e. all the nodes that can be reached from it following a direct path (see section 1). Hence, if the node u is an ancestor of node v, u ≺ v in the topological order. In particular, the parent always precedes its children. If, for a given input DAG G, we apply the transition function eq. (11) recursively following the inverse topological order of the nodes, it is guaranteed that when we process node v the states of its children have already been computed. In particular, the ﬁrst nodes to be processed are the leaves, whose state only depends on the local label lv (the order in which the leaves are processed is irrelevant).

80

M. Bianchini and M. Maggini

Then, the computation is propagated to the upper levels of the graph until the source

nodes are reached (or the supersource if there is only one source node).

Hence, in the case of DAGs the simpliﬁed version of the local state transition

function allows us to deﬁne a processing scheme that does not require to iterate the

state update equations to converge to a ﬁxed point. Following the inverse topological

order, the ﬁxed point of the state computation is reached just after applying the local

update equations only once. Given this general procedure for calculating the states

for a given input graph, speciﬁc models are devised assuming some hypotheses on

the structure of the function f and on the type of DAGs in input.

For graph–focused tasks, the output is computed at the supersource. Hence, the

recursive neural network implements a function from the set of DAGs to Rr, ψ : DAGs → Rr, where ψ(G) = os. Formally, ψ = g ◦ f˜, where f˜, recursively deﬁned

as

f˜(G) =

xnil

if G is empty,

f ( f˜(G1), ..., f˜(Go), lv) otherwise,

denotes the process that maps a graph to the state at its supersource, f˜(G) = xs. The encoding function f˜ yields a result that depends both on the topology and on the labels of the input DAG.

2.2.1 DPAGs
In DPAGs the children of a given node, ch[v], are ordered and can be organized in a vector using their position as index. Hence, the ﬁrst argument xch[v] of the state transition function in eq. (11) is a vector in Ro·s, such that

xch[v] = xch1[v], . . . , xcho[v]

, o = max{od[v]},
v∈V

where xchi[v] is equal to the null state xnil, if node v has no child in position i. In particular, when processing a leaf node, the state only depends on its label lv and on the null state xnil, since xlea f = f ( xnil, . . . , xnil , llea f |θ f ).
The function f may be implemented by a three–layer perceptron, with sigmoidal
activation functions in the q hidden units and linear activation functions in the output
units, such that the state for each node v is calculated according to

o

∑ xv = V · σ

Ak · xchk[v] + B · lv + C + D,

(12)

k=1

where σ is a vectorial sigmoid and the parameters θ f of the local transition function (11) collect the pointer matrices Ak ∈ Rq×s, k = 1, . . . , o, the local label connections B ∈ Rq×m, the hidden layer biases C ∈ Rq, the output layer biases D ∈ Rs, and hidden to output connection weights V ∈ Rs×q.
The state transition function (12) exploits a different pointer matrix Ak for each child position k. This choice allows to implement a dependence of the output value
on the child position, but has some limitations when the node outdegree has a high

3 Supervised Neural Network Models for Processing Graphs

81

variability and the empty positions are just added to pad the child vectors to have the same size. In this case, the pointer matrices associated to the last positions in the child vector are often used to propagate the null pointer value xnil, that is introduced by the padding rule. Since padding is not a natural property of the environment and it is introduced just to remove a limitation of the model, this choice may cause artiﬁcial results and a sub–optimal use of the network parameters. In this cases, different transition functions can be employed to process nodes having very different values for the outdegree. In practice, all the exploited functions will have the same structure of eq. (12) but with different sets of parameters. By using this approach, we avoid to introduce noisy information due to the padding of empty positions with the null state, otherwise needed for nodes having outdegrees lower than the maximum.
Similarly the output function g can be implemented by a three–layer MLP with q hidden neurons with a sigmoidal activation function, such that

ov = W · σ (E · xv + F) + G,

(13)

where the parameters of the output function θg are E ∈ Rq ×s, F ∈ Rq , G ∈ Rr, W ∈ Rr×q .

source 1

l1

3

2

1 l2 3 2

1 l4 3 2

1 l5 3 2

1 l3 3 leaves 2

Input DPAG

o2 x2

o1 x1

l1


o4 x4

ov

g(xv |θg )

 xv


l2


   xnil


o5

x5

lv xch1[v]xch2[v]xch3[v] f (xch1[v], xch2[v], xch3[v], lv|θf )
Recursive Neural Network for DPAGs

l5


xnil xnil xnil

l4


xnil xnil

o3 x3

l3


xnil xnil xnil

Unfolding network

Fig. 4 Transition function implemented by a three–layer MLP. The resulting recursive neural network can process graphs with a maximum outdegree o = 3.

Figure 4 shows an example of a neural network architecture implementing the transition function of eq. (12). The recursive connections link the output of the state neurons to the network inputs for each position in the child vector. This notation describes the template that is exploited to assemble the encoding network. Apart from the recursive connections, the transition function is implemented by a classical

82

M. Bianchini and M. Maggini

feedforward MLP with a layer of hidden units. The network has s · o + m inputs corresponding to the o states of the children (s components each) and to the node label (m components).

2.2.2 DAGs–LE

In many applications, the DPAG–based model adds unnecessary constraints in the

data encoding. First, the assignment of a position to each child of a node may not be

naturally deﬁned, and the chosen criterion may be arbitrary. Second, as also noted

previously, the need to have a uniform outdegree or to bound its maximum value in

the graphs can cause the loss of important information. For example, when consid-

ering the image representation based on the Region Adjacency Graph, the order of

the adjacent regions may be signiﬁcant, since they can be listed following the region

contour in a given direction, but to map them to a speciﬁc position we need to deﬁne

a criterion to choose the starting point on the contour and the contour direction. Un-

less there is a natural and uniform criterion we may end up with arbitrary encodings

that fail to preserve natural properties of the modeled environment. Moreover, the

maximum graph outdegree may be bounded by deﬁning a criterion to prune some

arcs when the number of the node children exceeds a predeﬁned value. Anyway, the

pruning process may lead to a loss of useful information and may not completely

eliminate the need to arbitrarily pad the last positions in the child vector with the

null state xnil, for each node v having |ch[v]| < o. In fact, the need to have exactly o children is a limitation of the model and not a feature of the problem.

These limitations may be overcome by modeling the data with DAGs–LE [12].

In fact, the arc label l(u,v) can encode attributes of the relationship represented by the arcs, allowing us to devise a model able to distinguish the role of each child without

the need to encode it by its position. For DAGs–LE, we can deﬁne a transition

function f that has not a predeﬁned number of arguments and that does not depend

on their order, following eq. (7). The general form for the state update function for

DAGs–LE is

∑ xv =

h(xu, l(v,u), lv|θh) ,

(14)

u∈ch[v]

that can be applied for any internal node, i.e. the nodes with at least one child. For the
leaf nodes the set ch[lea f ] is empty and the previous update equation is undeﬁned. In this case the state will depend on the node label llea f , such that xlea f = h (llea f |θh ).
However, there are other solutions to implement a non–positional transition func-
tion, such as that proposed in [12]. In this model, ﬁrst the contributions of the children are summed to obtain an intermediate node state yv ∈ Rp, then this state is non–linearly combined with the node label to yield the node state xv. The state update function is then deﬁned as

∑ yv =

φ (xu, l(v,u)|θφ ) ,

u∈ch[v]

(15)

xv = f˜(yv, lv|θ f˜) ,