DissLiteratur/storage/8CG3ASFD/.zotero-ft-cache

Learning with Kernels

Adaptive Computation and Machine Learning
Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors
Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey Learning in Graphical Models, Michael I. Jordan Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard Scheines Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Søren Brunak Learning Kernel Classiﬁers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Scho¨ lkopf and Alexander J. Smola

Learning with Kernels
Support Vector Machines, Regularization, Optimization, and Beyond Bernhard Scho¨ lkopf Alexander J. Smola
The MIT Press Cambridge, Massachusetts London, England

c 2002 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
Typeset by the authors using LATEX 2 Printed and bound in the United States of America
Library of Congress Cataloging-in-Publication Data
Learning with Kernels — Support Vector Machines, Regularization, Optimization and Beyond / by Bernhard Scho¨ lkopf,
Alexander J. Smola. p. cm.
Includes bibliographical references and index. ISBN 0-262-19475-9 (alk. paper) 1. Machine learning. 2. Algorithms. 3. Kernel functions I. Scho¨ lkopf, Bernhard. II. Smola, Alexander J.

To our parents

Contents

Series Foreword

xiii

Preface

xv

1 A Tutorial Introduction

1

1.1 Data Representation and Similarity . . . . . . . . . . . . . . . . . . . 1

1.2 A Simple Pattern Recognition Algorithm . . . . . . . . . . . . . . . 4

1.3 Some Insights From Statistical Learning Theory . . . . . . . . . . . 6

1.4 Hyperplane Classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Support Vector Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . 19

1.8 Empirical Results and Implementations . . . . . . . . . . . . . . . . 21

I CONCEPTS AND TOOLS

23

2 Kernels

25

2.1 Product Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 The Representation of Similarities in Linear Spaces . . . . . . . . . . 29

2.3 Examples and Properties of Kernels . . . . . . . . . . . . . . . . . . 45

2.4 The Representation of Dissimilarities in Linear Spaces . . . . . . . . 48

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3 Risk and Loss Functions

61

3.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2 Test Error and Expected Risk . . . . . . . . . . . . . . . . . . . . . . 65

3.3 A Statistical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4 Robust Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4 Regularization

87

4.1 The Regularized Risk Functional . . . . . . . . . . . . . . . . . . . . 88

viii

Contents

4.2 The Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 Regularization Operators . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Translation Invariant Kernels . . . . . . . . . . . . . . . . . . . . . . 96 4.5 Translation Invariant Kernels in Higher Dimensions . . . . . . . . . 105 4.6 Dot Product Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.7 Multi-Output Regularization . . . . . . . . . . . . . . . . . . . . . . 113 4.8 Semiparametric Regularization . . . . . . . . . . . . . . . . . . . . . 115 4.9 Coefﬁcient Based Regularization . . . . . . . . . . . . . . . . . . . . 118 4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5 Elements of Statistical Learning Theory

125

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 128

5.3 When Does Learning Work: the Question of Consistency . . . . . . 131

5.4 Uniform Convergence and Consistency . . . . . . . . . . . . . . . . 131

5.5 How to Derive a VC Bound . . . . . . . . . . . . . . . . . . . . . . . 134

5.6 A Model Selection Example . . . . . . . . . . . . . . . . . . . . . . . 144

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6 Optimization

149

6.1 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.2 Unconstrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.3 Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.4 Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 175

6.5 Maximum Search Problems . . . . . . . . . . . . . . . . . . . . . . . 179

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

6.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

II SUPPORT VECTOR MACHINES

187

7 Pattern Recognition

189

7.1 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . 189

7.2 The Role of the Margin . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7.3 Optimal Margin Hyperplanes . . . . . . . . . . . . . . . . . . . . . . 196

7.4 Nonlinear Support Vector Classiﬁers . . . . . . . . . . . . . . . . . . 200

7.5 Soft Margin Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . 204

7.6 Multi-Class Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . 211

7.7 Variations on a Theme . . . . . . . . . . . . . . . . . . . . . . . . . . 214

7.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Contents

ix

8 Single-Class Problems: Quantile Estimation and Novelty Detection 227 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.2 A Distribution’s Support and Quantiles . . . . . . . . . . . . . . . . 229 8.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 8.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.5 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

9 Regression Estimation

251

9.1 Linear Regression with Insensitive Loss Function . . . . . . . . . . . 251

9.2 Dual Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

9.3 -SV Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

9.4 Convex Combinations and 1-Norms . . . . . . . . . . . . . . . . . . 266 9.5 Parametric Insensitivity Models . . . . . . . . . . . . . . . . . . . . . 269

9.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

9.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

10 Implementation

279

10.1 Tricks of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

10.2 Sparse Greedy Matrix Approximation . . . . . . . . . . . . . . . . . 288

10.3 Interior Point Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 295

10.4 Subset Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . 300

10.5 Sequential Minimal Optimization . . . . . . . . . . . . . . . . . . . . 305

10.6 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

10.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

11 Incorporating Invariances

333

11.1 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

11.2 Transformation Invariance . . . . . . . . . . . . . . . . . . . . . . . . 335

11.3 The Virtual SV Method . . . . . . . . . . . . . . . . . . . . . . . . . . 337

11.4 Constructing Invariance Kernels . . . . . . . . . . . . . . . . . . . . 343

11.5 The Jittered SV Method . . . . . . . . . . . . . . . . . . . . . . . . . . 354

11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

11.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

12 Learning Theory Revisited

359

12.1 Concentration of Measure Inequalities . . . . . . . . . . . . . . . . . 360

12.2 Leave-One-Out Estimates . . . . . . . . . . . . . . . . . . . . . . . . 366

12.3 PAC-Bayesian Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 381

12.4 Operator-Theoretic Methods in Learning Theory . . . . . . . . . . . 391

x

Contents

12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 12.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

III KERNEL METHODS

405

13 Designing Kernels

407

13.1 Tricks for Constructing Kernels . . . . . . . . . . . . . . . . . . . . . 408

13.2 String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

13.3 Locality-Improved Kernels . . . . . . . . . . . . . . . . . . . . . . . . 414

13.4 Natural Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

13.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

14 Kernel Feature Extraction

427

14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

14.2 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

14.3 Kernel PCA Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 437

14.4 A Framework for Feature Extraction . . . . . . . . . . . . . . . . . . 442

14.5 Algorithms for Sparse KFA . . . . . . . . . . . . . . . . . . . . . . . 447

14.6 KFA Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450

14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451

14.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

15 Kernel Fisher Discriminant

457

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

15.2 Fisher’s Discriminant in Feature Space . . . . . . . . . . . . . . . . . 458

15.3 Efﬁcient Training of Kernel Fisher Discriminants . . . . . . . . . . . 460

15.4 Probabilistic Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

15.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

15.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

16 Bayesian Kernel Methods

469

16.1 Bayesics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

16.2 Inference Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

16.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

16.4 Implementation of Gaussian Processes . . . . . . . . . . . . . . . . . 488

16.5 Laplacian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

16.6 Relevance Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 506

16.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

16.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

17 Regularized Principal Manifolds

517

17.1 A Coding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 518

Contents

xi

17.2 A Regularized Quantization Functional . . . . . . . . . . . . . . . . 522 17.3 An Algorithm for Minimizing Rreg[ f ] . . . . . . . . . . . . . . . . . 526 17.4 Connections to Other Algorithms . . . . . . . . . . . . . . . . . . . . 529 17.5 Uniform Convergence Bounds . . . . . . . . . . . . . . . . . . . . . 533 17.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 17.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 17.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540

18 Pre-Images and Reduced Set Methods

543

18.1 The Pre-Image Problem . . . . . . . . . . . . . . . . . . . . . . . . . 544

18.2 Finding Approximate Pre-Images . . . . . . . . . . . . . . . . . . . . 547

18.3 Reduced Set Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 552

18.4 Reduced Set Selection Methods . . . . . . . . . . . . . . . . . . . . . 554

18.5 Reduced Set Construction Methods . . . . . . . . . . . . . . . . . . . 561

18.6 Sequential Evaluation of Reduced Set Expansions . . . . . . . . . . 564

18.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566

18.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567

A Addenda

569

A.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569

A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572

B Mathematical Prerequisites

575

B.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575

B.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580

B.3 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586

References

591

Index

617

Notation and Symbols

625

Series Foreword
The goal of building systems that can adapt to their environments and learn from their experience has attracted researchers from many ﬁelds, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many scientiﬁc and industrial ﬁelds. Recently, several research communities have converged on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series on Adaptive Computation and Machine Learning seeks to unify the many diverse strands of machine learning research and to foster high quality research and innovative applications.
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond is an excellent illustration of this convergence of ideas from many ﬁelds. The development of kernel-based learning methods has resulted from a combination of machine learning theory, optimization algorithms from operations research, and kernel techniques from mathematical analysis. These three ideas have spread far beyond the original support-vector machine algorithm: Virtually every learning algorithm has been redesigned to exploit the power of kernel methods. Bernhard Scho¨ lkopf and Alexander Smola have written a comprehensive, yet accessible, account of these developments. This volume includes all of the mathematical and algorithmic background needed not only to obtain a basic understanding of the material but to master it. Students and researchers who study this book will be able to apply kernel methods in creative ways to solve a wide range of problems in science and engineering.
Thomas Dietterich

Preface
One of the most fortunate situations a scientist can encounter is to enter a ﬁeld in its infancy. There is a large choice of topics to work on, and many of the issues are conceptual rather than merely technical. Over the last seven years, we have had the privilege to be in this position with regard to the ﬁeld of Support Vector Machines (SVMs). We began working on our respective doctoral dissertations in 1994 and 1996. Upon completion, we decided to combine our efforts and write a book about SVMs. Since then, the ﬁeld has developed impressively, and has to an extent been transformed. We set up a website that quickly became the central repository for the new community, and a number of workshops were organized by various researchers. The scope of the ﬁeld has now widened signiﬁcantly, both in terms of new algorithms, such as kernel methods different to SVMs, and in terms of a deeper theoretical understanding being gained. It has become clear that kernel methods provide a framework for tackling some rather profound issues in machine learning theory. At the same time, successful applications have demonstrated that SVMs not only have a more solid foundation than artiﬁcial neural networks, but are able to serve as a replacement for neural networks that perform as well or better, in a wide variety of ﬁelds. Standard neural network and pattern recognition textbooks have now started including chapters on SVMs and kernel PCA (for instance, [235, 153]).
While these developments took place, we were trying to strike a balance between pursuing exciting new research, and making progress with the slowly growing manuscript of this book. In the two and a half years that we worked on the book, we faced a number of lessons that we suspect everyone writing a scientiﬁc monograph — or any other book — will encounter. First, writing a book is more work than you think, even with two authors sharing the work in equal parts. Second, our book got longer than planned. Once we exceeded the initially planned length of 500 pages, we got worried. In fact, the manuscript kept growing even after we stopped writing new chapters, and began polishing things and incorporating corrections suggested by colleagues. This was mainly due to the fact that the book deals with a fascinating new area, and researchers keep adding fresh material to the body of knowledge. We learned that there is no asymptotic regime in writing such a book — if one does not stop, it will grow beyond any bound — unless one starts cutting. We therefore had to take painful decisions to leave out material that we originally thought should be in the book. Sadly, and this is the third point, the book thus contains less material than originally planned, especially on the sub-

xvi

Preface

ject of theoretical developments. We sincerely apologize to all researchers who feel that their contributions should have been included — the book is certainly biased towards our own work, and does not provide a fully comprehensive overview of the ﬁeld. We did, however, aim to provide all the necessary concepts and ideas to enable a reader equipped with some basic mathematical knowledge to enter the engaging world of machine learning, using theoretically well-founded kernel algorithms, and to understand and apply the powerful algorithms that have been developed over the last few years.
The book is divided into three logical parts. Each part consists of a brief introduction and a number of technical chapters. In addition, we include two appendices containing addenda, technical details, and mathematical prerequisites. Each chapter begins with a short discussion outlining the contents and prerequisites; for some of the longer chapters, we include a graph that sketches the logical structure and dependencies between the sections. At the end of most chapters, we include a set of problems, ranging from simple exercises (marked by ) to hard ones ( ); in addition, we describe open problems and questions for future research ( ).1 The latter often represent worthwhile projects for a research publication, or even a thesis. References are also included in some of the problems. These references contain the solutions to the associated problems, or at least signiﬁcant parts thereof.
The overall structure of the book is perhaps somewhat unusual. Rather than presenting a logical progression of chapters building upon each other, we occasionally touch on a subject brieﬂy, only to revisit it later in more detail. For readers who are used to reading scientiﬁc monographs and textbooks from cover to cover, this will amount to some redundancy. We hope, however, that some readers, who are more selective in their reading habits (or less generous with their time), and only look at those chapters that they are interested in, will beneﬁt. Indeed, nobody is expected to read every chapter. Some chapters are fairly technical, and cover material included for reasons of completeness. Other chapters, which are more relevant to the central subjects of the book, are kept simpler, and should be accessible to undergraduate students.
In a way, this book thus contains several books in one. For instance, the ﬁrst chapter can be read as a standalone “executive summary” of Support Vector and kernel methods. This chapter should also provide a fast entry point for practitioners. Someone interested in applying SVMs to a pattern recognition problem might want to read Chapters 1 and 7 only. A reader thinking of building their own SVM implementation could additionally read Chapter 10, and parts of Chapter 6. Those who would like to get actively involved in research aspects of kernel methods, for example by “kernelizing” a new algorithm, should probably read at least Chapters 1 and 2. A one-semester undergraduate course on learning with kernels could include the material of Chapters 1, 2.1–2.3, 3.1–3.2, 5.1–5.2, 6.1–6.3, 7. If there is more

1. We suggest that authors post their solutions on the book website www.learning-withkernels.org.

Preface

xvii
time, one of the Chapters 14, 16, or 17 can be added, or 4.1–4.2. A graduate course could additionally deal with the more advanced parts of Chapters 3, 4, and 5. The remaining chapters provide ample material for specialized courses and seminars.
As a general time-saving rule, we recommend reading the ﬁrst chapter and then jumping directly to the chapter of particular interest to the reader. Chances are that this will lead to a chapter that contains references to the earlier ones, which can then be followed as desired. We hope that this way, readers will inadvertently be tempted to venture into some of the less frequented chapters and research areas. Explore this book; there is a lot to ﬁnd, and much more is yet to be discovered in the ﬁeld of learning with kernels.
We conclude the preface by thanking those who assisted us in the preparation of the book. Our ﬁrst thanks go to our ﬁrst readers. Chris Burges, Arthur Gretton, and Bob Williamson have read through various versions of the book, and made numerous suggestions that corrected or improved the material. A number of other researchers have proofread various chapters. We would like to thank Matt Beal, Daniel Berger, Olivier Bousquet, Ben Bradshaw, Nicolo` CesaBianchi, Olivier Chapelle, Dennis DeCoste, Andre Elisseeff, Anita Faul, Arnulf Graf, Isabelle Guyon, Ralf Herbrich, Simon Hill, Dominik Janzing, Michael Jordan, Sathiya Keerthi, Neil Lawrence, Ben O’Loghlin, Ulrike von Luxburg, Davide Mattera, Sebastian Mika, Natasa Milic-Frayling, Marta Milo, Klaus Mu¨ ller, Dave Musicant, Fernando Pe´rez Cruz, Ingo Steinwart, Mike Tipping, and Chris Williams.
In addition, a large number of people have contributed to this book in one way or another, be it by sharing their insights with us in discussions, or by collaborating with us on some of the topics covered in the book. In many places, this strongly inﬂuenced the presentation of the material. We would like to thank Dimitris Achlioptas, Lu´ıs Almeida, Shun-Ichi Amari, Peter Bartlett, Jonathan Baxter, Tony Bell, Shai Ben-David, Kristin Bennett, Matthias Bethge, Chris Bishop, Andrew Blake, Volker Blanz, Le´on Bottou, Paul Bradley, Chris Burges, Heinrich Bu¨ lthoff, Olivier Chapelle, Nello Cristianini, Corinna Cortes, Cameron Dawson,Tom Dietterich, Andre´ Elisseeff, Oscar de Feo, Federico Girosi, Thore Graepel, Isabelle Guyon, Patrick Haffner, Stefan Harmeling, Paul Hayton, Markus Hegland, Ralf Herbrich, Tommi Jaakkola, Michael Jordan, Jyrki Kivinen, Yann LeCun, Chi-Jen Lin, Gabor Lugosi, Olvi Mangasarian, Laurent Massoulie, Sebastian Mika, Sayan Mukherjee, Klaus Mu¨ ller, Noboru Murata, Nuria Oliver, John Platt, Tomaso Poggio, Gunnar Ra¨tsch, Sami Romdhani, Rainer von Sachs, Christoph Schno¨ rr, Matthias Seeger, John Shawe-Taylor, Kristy Sim, Patrice Simard, Stephen Smale, Sara Solla, Lionel Tarassenko, Lily Tian, Mike Tipping, Alexander Tsybakov, Lou van den Dries, Santosh Venkatesh, Thomas Vetter, Chris Watkins, Jason Weston, Chris Williams, Bob Williamson, Andreas Ziehe, Alex Zien, and Tong Zhang.
Next, we would like to extend our thanks to the research institutes that allowed us to pursue our research interests and to dedicate the time necessary for writing the present book; these are AT&T / Bell Laboratories (Holmdel), the Australian National University (Canberra), Biowulf Technologies (New York), GMD FIRST (Berlin), the Max-Planck-Institute for Biological Cybernetics (Tu¨ bingen), and Mi-

xviii

Preface

crosoft Research (Cambridge). We are grateful to Doug Sery from MIT Press for continuing support and encouragement during the writing of this book. We are, moreover, indebted to funding from various sources; speciﬁcally, from the Studienstiftung des deutschen Volkes, the Deutsche Forschungsgemeinschaft, the Australian Research Council, and the European Union.
Finally, special thanks go to Vladimir Vapnik, who introduced us to the fascinating world of statistical learning theory.

. . . the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classiﬁcation and Sheep Vectors. . .
Illustration by Ana Mart´ın Larran˜ aga

1

A Tutorial Introduction

Overview Prerequisites

This chapter describes the central ideas of Support Vector (SV) learning in a nutshell. Its goal is to provide an overview of the basic concepts.
One such concept is that of a kernel. Rather than going immediately into mathematical detail, we introduce kernels informally as similarity measures that arise from a particular representation of patterns (Section 1.1), and describe a simple kernel algorithm for pattern recognition (Section 1.2). Following this, we report some basic insights from statistical learning theory, the mathematical theory that underlies SV learning (Section 1.3). Finally, we brieﬂy review some of the main kernel algorithms, namely Support Vector Machines (SVMs) (Sections 1.4 to 1.6) and kernel principal component analysis (Section 1.7).
We have aimed to keep this introductory chapter as basic as possible, whilst giving a fairly comprehensive overview of the main ideas that will be discussed in the present book. After reading it, readers should be able to place all the remaining material in the book in context and judge which of the following chapters is of particular interest to them.
As a consequence of this aim, most of the claims in the chapter are not proven. Abundant references to later chapters will enable the interested reader to ﬁll in the gaps at a later stage, without losing sight of the main ideas described presently.

1.1 Data Representation and Similarity

Training Data

One of the fundamental problems of learning theory is the following: suppose we are given two classes of objects. We are then faced with a new object, and we have to assign it to one of the two classes. This problem can be formalized as follows: we are given empirical data

(x1 y1)

(xm ym)

1

(1.1)

Here, is some nonempty set from which the patterns xi (sometimes called cases,
inputs, instances, or observations) are taken, usually referred to as the domain; the yi are called labels, targets, outputs or sometimes also observations.1 Note that there are

1. Note that we use the term pattern to refer to individual observations. A (smaller) part of the existing literature reserves the term for a generic prototype which underlies the data. The

2 Dot Product

A Tutorial Introduction

only two classes of patterns. For the sake of mathematical convenience, they are

labelled by 1 and 1, respectively. This is a particularly simple situation, referred

to as (binary) pattern recognition or (binary) classiﬁcation.

It should be emphasized that the patterns could be just about anything, and we

have made no assumptions on other than it being a set. For instance, the task

might be to categorize sheep into two classes, in which case the patterns xi would simply be sheep.

In order to study the problem of learning, however, we need an additional type

of structure. In learning, we want to be able to generalize to unseen data points. In

the case of pattern recognition, this means that given some new pattern x , we

want to predict the corresponding y

1 .2 By this we mean, loosely speaking,

that we choose y such that (x y) is in some sense similar to the training examples

(1.1). To this end, we need notions of similarity in and in 1 .

Characterizing the similarity of the outputs 1 is easy: in binary classiﬁcation,

only two situations can occur: two labels can either be identical or different. The

choice of the similarity measure for the inputs, on the other hand, is a deep

question that lies at the core of the ﬁeld of machine learning.

Let us consider a similarity measure of the form

k:

(x x ) k(x x )

(1.2)

that is, a function that, given two patterns x and x , returns a real number characterizing their similarity. Unless stated otherwise, we will assume that k is symmetric, that is, k(x x ) k(x x) for all x x . For reasons that will become clear later (cf. Remark 2.16), the function k is called a kernel [359, 4, 42, 62, 223].
General similarity measures of this form are rather difﬁcult to study. Let us

therefore start from a particularly simple case, and generalize it subsequently. A simple type of similarity measure that is of particular mathematical appeal is a dot product. For instance, given two vectors x x N , the canonical dot product is deﬁned as

N
x x : ∑[x]i[x ]i i1

(1.3)

Here, [x]i denotes the ith entry of x. Note that the dot product is also referred to as inner product or scalar product, and
sometimes denoted with round brackets and a dot, as (x x ) — this is where the “dot” in the name comes from. In Section B.2, we give a general deﬁnition of dot products. Usually, however, it is sufﬁcient to think of dot products as (1.3).

latter is probably closer to the original meaning of the term, however we decided to stick

with the present usage, which is more common in the ﬁeld of machine learning.

2. Doing this for every x amounts to estimating a function f :

1.

1.1 Data Representation and Similarity

3

Length Feature Space

The geometric interpretation of the canonical dot product is that it computes the cosine of the angle between the vectors x and x , provided they are normalized to length 1. Moreover, it allows computation of the length (or norm) of a vector x as

x

xx

(1.4)

Likewise, the distance between two vectors is computed as the length of the difference vector. Therefore, being able to compute dot products amounts to being able to carry out all geometric constructions that can be formulated in terms of angles, lengths and distances.
Note, however, that the dot product approach is not really sufﬁciently general to deal with many interesting problems.

First, we have deliberately not made the assumption that the patterns actually exist in a dot product space. So far, they could be any kind of object. In order to be able to use a dot product as a similarity measure, we therefore ﬁrst need to represent the patterns as vectors in some dot product space (which need not coincide with N ). To this end, we use a map

Φ:

x x : Φ(x)

(1.5)

Second, even if the original patterns exist in a dot product space, we may still want to consider more general similarity measures obtained by applying a map (1.5). In that case, Φ will typically be a nonlinear map. An example that we will consider in Chapter 2 is a map which computes products of entries of the input patterns.

In both the above cases, the space is called a feature space. Note that we have used a bold face x to denote the vectorial representation of x in the feature space. We will follow this convention throughout the book.
To summarize, embedding the data into via Φ has three beneﬁts:

1. It lets us deﬁne a similarity measure from the dot product in ,

k(x x ) : x x Φ(x) Φ(x )

(1.6)

2. It allows us to deal with the patterns geometrically, and thus lets us study learning algorithms using linear algebra and analytic geometry.
3. The freedom to choose the mapping Φ will enable us to design a large variety of similarity measures and learning algorithms. This also applies to the situation where the inputs xi already exist in a dot product space. In that case, we might directly use the dot product as a similarity measure. However, nothing prevents us from ﬁrst applying a possibly nonlinear map Φ to change the representation into one that is more suitable for a given problem. This will be elaborated in Chapter 2, where the theory of kernels is developed in more detail.

4

A Tutorial Introduction

1.2 A Simple Pattern Recognition Algorithm

Decision Function

We are now in the position to describe a pattern recognition learning algorithm that is arguably one of the simplest possible. We make use of the structure introduced in the previous section; that is, we assume that our data are embedded into a dot product space .3 Using the dot product, we can measure distances in this space. The basic idea of the algorithm is to assign a previously unseen pattern to the class with closer mean.
We thus begin by computing the means of the two classes in feature space;

c

∑ 1

m

i yi

xi
1

c

∑ 1

m

i yi

xi
1

(1.7) (1.8)

where m and m are the number of examples with positive and negative labels, respectively. We assume that both classes are non-empty, thus m m 0. We assign a new point x to the class whose mean is closest (Figure 1.1). This geometric construction can be formulated in terms of the dot product . Half way between c and c lies the point c : (c c ) 2. We compute the class of x by checking whether the vector x c connecting c to x encloses an angle smaller than 2 with the vector w : c c connecting the class means. This leads to

y sgn (x c) w

sgn (x (c c ) 2) (c c )

sgn ( x c

x c b)

(1.9)

Here, we have deﬁned the offset

b : 1( c 2 2

c 2)

(1.10)

with the norm x : x x . If the class means have the same distance to the origin, then b will vanish.
Note that (1.9) induces a decision boundary which has the form of a hyperplane (Figure 1.1); that is, a set of points that satisfy a constraint expressible as a linear equation.
It is instructive to rewrite (1.9) in terms of the input patterns xi, using the kernel k to compute the dot products. Note, however, that (1.6) only tells us how to compute the dot products between vectorial representations xi of inputs xi. We therefore need to express the vectors ci and w in terms of x1 xm.
To this end, substitute (1.7) and (1.8) into (1.9) to get the decision function

3. For the deﬁnition of a dot product space, see Section B.2.

1.2 A Simple Pattern Recognition Algorithm

5

+

+

w

c+

+

+

o

.

o c- o

c

x-c

x

Figure 1.1 A simple geometric classiﬁcation algorithm: given two classes of points (depicted by ‘o’ and ‘+’), compute their means c c and assign a test pattern x to the one whose mean is closer. This can be done by looking at the dot product between x c (where c (c c ) 2) and w : c c , which changes sign as the enclosed angle passes through
2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w.

∑ y

sgn

1 m i yi

x xi
1

∑ 1

m

i yi

x xi
1

b

∑ ∑ 1

1

sgn

m

k(x xi)
i yi 1

m

i yi

k(x xi)
1

b

Similarly, the offset becomes

(1.11)

b:

1 2

∑ 1
m2 (i j) yi yj

k(xi xj)
1

∑ 1
m2 (i j) yi yj

k(xi xj)
1

(1.12)

Surprisingly, it turns out that this rather simple-minded approach contains a wellknown statistical classiﬁcation method as a special case. Assume that the class means have the same distance to the origin (hence b 0, cf. (1.10)), and that k can be viewed as a probability density when one of its arguments is ﬁxed. By this we mean that it is positive and has unit integral,4

k(x x )dx 1 for all x

(1.13)

In this case, (1.11) takes the form of the so-called Bayes classiﬁer separating the two classes, subject to the assumption that the two classes of patterns were generated by sampling from two probability distributions that are correctly estimated by the

4. In order to state this assumption, we have to require that we can deﬁne an integral on .

6

A Tutorial Introduction

Parzen Windows

Parzen windows estimators of the two class densities,

∑ ∑ p (x) :

1 m

i yi

k(x xi) and p (x) :
1

1 m

i yi

k(x xi)
1

(1.14)

where x . Given some point x, the label is then simply computed by checking which of
the two values p (x) or p (x) is larger, which leads directly to (1.11). Note that this decision is the best we can do if we have no prior information about the probabilities of the two classes.
The classiﬁer (1.11) is quite close to the type of classiﬁer that this book deals with in detail. Both take the form of kernel expansions on the input domain,

m
y sgn ∑ ik(x xi) b i1

(1.15)

In both cases, the expansions correspond to a separating hyperplane in a feature space. In this sense, the i can be considered a dual representation of the hyperplane’s normal vector [223]. Both classiﬁers are example-based in the sense that the kernels are centered on the training patterns; that is, one of the two arguments of the kernel is always a training pattern. A test point is classiﬁed by comparing it to all the training points that appear in (1.15) with a nonzero weight.
More sophisticated classiﬁcation techniques, to be discussed in the remainder of the book, deviate from (1.11) mainly in the selection of the patterns on which the kernels are centered and in the choice of weights i that are placed on the individual kernels in the decision function. It will no longer be the case that all training patterns appear in the kernel expansion, and the weights of the kernels in the expansion will no longer be uniform within the classes — recall that in the current example, cf. (1.11), the weights are either (1 m ) or ( 1 m ), depending on the class to which the pattern belongs.
In the feature space representation, this statement corresponds to saying that we will study normal vectors w of decision hyperplanes that can be represented as general linear combinations (i.e., with non-uniform coefﬁcients) of the training patterns. For instance, we might want to remove the inﬂuence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce the computational cost of evaluating the decision function (cf. (1.11)). The hyperplane will then only depend on a subset of training patterns called Support Vectors.

1.3 Some Insights From Statistical Learning Theory
With the above example in mind, let us now consider the problem of pattern recognition in a slightly more formal setting [559, 152, 186]. This will allow us to indicate the factors affecting the design of “better” algorithms. Rather than just

1.3 Some Insights From Statistical Learning Theory

7

Figure 1.2 2D toy example of binary classiﬁcation, solved using three models (the decision boundaries are shown). The models vary in complexity, ranging from a simple one (left), which misclassiﬁes a large number of points, to a complex one (right), which “trusts” each point and comes up with solution that is consistent with all training points (but may not work well on new points). As an aside: the plots were generated using the so-called softmargin SVM to be explained in Chapter 7; cf. also Figure 7.10.

providing tools to come up with new algorithms, we also want to provide some insight in how to do it in a promising way.
In two-class pattern recognition, we seek to infer a function

f:

1

(1.16)

from input-output training data (1.1). The training data are sometimes also called the sample.
Figure 1.2 shows a simple 2D toy example of a pattern recognition problem. The task is to separate the solid dots from the circles by ﬁnding a function which takes the value 1 on the dots and 1 on the circles. Note that instead of plotting this function, we may plot the boundaries where it switches between 1 and 1. In the rightmost plot, we see a classiﬁcation function which correctly separates all training points. From this picture, however, it is unclear whether the same would hold true for test points which stem from the same underlying regularity. For instance, what should happen to a test point which lies close to one of the two “outliers,” sitting amidst points of the opposite class? Maybe the outliers should not be allowed to claim their own custom-made regions of the decision function. To avoid this, we could try to go for a simpler model which disregards these points. The leftmost picture shows an almost linear separation of the classes. This separation, however, not only misclassiﬁes the above two outliers, but also a number of “easy” points which are so close to the decision boundary that the classiﬁer really should be able to get them right. Finally, the central picture represents a compromise, by using a model with an intermediate complexity, which gets most points right, without putting too much trust in any individual point.
The goal of statistical learning theory is to place these intuitive arguments in a mathematical framework. To this end, it studies mathematical properties of learning machines. These properties are usually properties of the function class

8

A Tutorial Introduction

g(x) 1 x
-1

IID Data Loss Function Test Data
Empirical Risk Risk

Figure 1.3 A 1D classiﬁcation problem, with a training set of three points (marked by circles), and three test inputs (marked on the x-axis). Classiﬁcation is performed by thresholding real-valued functions g(x) according to sgn ( f (x)). Note that both functions (dotted line, and solid line) perfectly explain the training data, but they give opposite predictions on the test inputs. Lacking any further information, the training data alone give us no means to tell which of the two functions is to be preferred.

that the learning machine can implement.

We assume that the data are generated independently from some unknown (but ﬁxed) probability distribution P(x y).5 This is a standard assumption in learning

theory; data generated this way is commonly referred to as iid (independent and

identically distributed). Our goal is to ﬁnd a function f that will correctly classify

unseen examples (x y), so that f (x) y for examples (x y) that are also generated

from P(x y).6 Correctness of the classiﬁcation is measured by means of the zero-one

loss function c(x y f (x)) :

1 2

f (x)

y . Note that the loss is 0 if (x y) is classiﬁed

correctly, and 1 otherwise.

If we put no restriction on the set of functions from which we choose our

estimated f , however, then even a function that does very well on the training

data, e.g., by satisfying f (xi) yi for all i 1 m, might not generalize well to unseen examples. To see this, note that for each function f and any test set

(x¯1 y¯1)

(x¯m¯ y¯m¯ )

1 satisfying x¯1

x¯ m¯

x1

xm

, there

exists another function f such that f (xi) f (xi) for all i 1 m, yet f (x¯i)

f (x¯i) for all i 1 m¯ (cf. Figure 1.3). As we are only given the training data, we

have no means of selecting which of the two functions (and hence which of the two

different sets of test label predictions) is preferable. We conclude that minimizing

only the (average) training error (or empirical risk),

Remp[ f ]

∑ 1 m 1
mi 12

f (xi)

yi

(1.17)

does not imply a small test error (called risk), averaged over test examples drawn from the underlying distribution P(x y),

5. For a deﬁnition of a probability distribution, see Section B.1.1. 6. We mostly use the term example to denote a pair consisting of a training pattern x and
the corresponding target y.

1.3 Some Insights From Statistical Learning Theory

9

Capacity VC dimension Shattering
VC Bound

R[ f ]

1 2

f (x)

y dP(x y)

(1.18)

The risk can be deﬁned for any loss function, provided the integral exists. For the present zero-one loss function, the risk equals the probability of misclassiﬁcation.7
Statistical learning theory (Chapter 5, [570, 559, 561, 136, 562, 14]), or VC
(Vapnik-Chervonenkis) theory, shows that it is imperative to restrict the set of

functions from which f is chosen to one that has a capacity suitable for the amount of available training data. VC theory provides bounds on the test error. The minimization of these bounds, which depend on both the empirical risk and the capacity of the function class, leads to the principle of structural risk minimization [559].
The best-known capacity concept of VC theory is the VC dimension, deﬁned as follows: each function of the class separates the patterns in a certain way and thus induces a certain labelling of the patterns. Since the labels are in 1 , there are at most 2m different labellings for m patterns. A very rich function class might be able to realize all 2m separations, in which case it is said to shatter the m points. However, a given class of functions might not be sufﬁciently righ to shatter the m points. The VC dimension is deﬁned as the largest m such that there exists a set of m points which the class can shatter, and if no such m exists. It can be thought of as a one-number summary of a learning machine’s capacity (for an

example, see Figure 1.4). As such, it is necessarily somewhat crude. More accurate capacity measures are the annealed VC entropy or the growth function. These are usually considered to be harder to evaluate, but they play a fundamental role in the conceptual part of VC theory. Another interesting capacity measure, which can be thought of as a scale-sensitive version of the VC dimension, is the fat shattering dimension [286, 6]. For further details, cf. Chapters 5 and 12.
Whilst it will be difﬁcult for the non-expert to appreciate the results of VC theory in this chapter, we will nevertheless brieﬂy describe an example of a VC bound:

7. The risk-based approach to machine learning has its roots in statistical decision theory [582, 166, 43]. In that context, f (x) is thought of as an action, and the loss function measures the loss incurred by taking action f (x) upon observing x when the true output (state of nature) is y.
Like many ﬁelds of statistics, decision theory comes in two ﬂavors. The present approach is a frequentist one. It considers the risk as a function of the distribution P and the decision function f . The Bayesian approach considers parametrized families PΘ to model the distribution. Given a prior over Θ (which need not in general be a ﬁnite-dimensional vector), the Bayes risk of a decision function f is the expected frequentist risk, where the expectation is taken over the prior. Minimizing the Bayes risk (over decision functions) then leads to a Bayes decision function. Bayesians thus act as if the parameter Θ were actually a random variable whose distribution is known. Frequentists, who do not make this (somewhat bold) assumption, have to resort to other strategies for picking a decision function. Examples thereof are considerations like invariance and unbiasedness, both used to restrict the class of decision rules, and the minimax principle. A decision function is said to be minimax if it minimizes (over all decision functions) the maximal (over all distributions) risk. For a discussion of the relationship of these issues to VC theory, see Problem 5.9.

10

A Tutorial Introduction

x x
x

x x
x

x x
x

x x
x

x x
x

x x
x

x x
x

x x
x

Figure 1.4 A simple VC dimension example. There are 23 8 ways of assigning 3 points to two classes. For the displayed points in 2, all 8 possibilities can be realized using
separating hyperplanes, in other words, the function class can shatter 3 points. This would
not work if we were given 4 points, no matter how we placed them. Therefore, the VC dimension of the class of separating hyperplanes in 2 is 3.

if h m is the VC dimension of the class of functions that the learning machine

can implement, then for all functions of that class, independent of the underlying

distribution P generating the data, with a probability of at least 1 drawing of the training sample,8 the bound

over the

R[ f ] Remp[ f ] (h m ) holds, where the conﬁdence term (or capacity term) is deﬁned as

(1.19)

(h m )

1 m

h

ln

2m h

1

ln 4

(1.20)

The bound (1.19) merits further explanation. Suppose we wanted to learn a “dependency” where patterns and labels are statistically independent, P(x y) P(x)P(y). In that case, the pattern x contains no information about the label y. If, moreover, the two classes 1 and 1 are equally likely, there is no way of making a good guess about the label of a test pattern.
Nevertheless, given a training set of ﬁnite size, we can always come up with a learning machine which achieves zero training error (provided we have no examples contradicting each other, i.e., whenever two patterns are identical, then they must come with the same label). To reproduce the random labellings by correctly separating all training examples, however, this machine will necessarily require a large VC dimension h. Therefore, the conﬁdence term (1.20), which increases monotonically with h, will be large, and the bound (1.19) will show

8. Recall that each training example is generated from P(x y), and thus the training data are subject to randomness.

1.4 Hyperplane Classiﬁers

11

that the small training error does not guarantee a small test error. This illustrates how the bound can apply independent of assumptions about the underlying distribution P(x y): it always holds (provided that h m), but it does not always make a nontrivial prediction. In order to get nontrivial predictions from (1.19), the function class must be restricted such that its capacity (e.g., VC dimension) is small enough (in relation to the available amount of data). At the same time, the class should be large enough to provide functions that are able to model the dependencies hidden in P(x y). The choice of the set of functions is thus crucial for learning from data. In the next section, we take a closer look at a class of functions which is particularly interesting for pattern recognition problems.

1.4 Hyperplane Classiﬁers

Optimal Hyperplane

In the present section, we shall describe a hyperplane learning algorithm that can be performed in a dot product space (such as the feature space that we introduced earlier). As described in the previous section, to design learning algorithms whose statistical effectiveness can be controlled, one needs to come up with a class of functions whose capacity can be computed. Vapnik et al. [573, 566, 570] considered the class of hyperplanes in some dot product space ,

w x b 0 where w

b

(1.21)

corresponding to decision functions

f (x) sgn ( w x b)

(1.22)

and proposed a learning algorithm for problems which are separable by hyperplanes (sometimes said to be linearly separable), termed the Generalized Portrait, for constructing f from empirical data. It is based on two facts. First (see Chapter 7), among all hyperplanes separating the data, there exists a unique optimal hyperplane, distinguished by the maximum margin of separation between any training point and the hyperplane. It is the solution of

maximize min x xi x
wb

wx b 0i 1 m

(1.23)

Second (see Chapter 5), the capacity (as discussed in Section 1.3) of the class of separating hyperplanes decreases with increasing margin. Hence there are theoretical arguments supporting the good generalization performance of the optimal hyperplane, cf. Chapters 5, 7, 12. In addition, it is computationally attractive, since we will show below that it can be constructed by solving a quadratic programming problem for which efﬁcient algorithms exist (see Chapters 6 and 10).
Note that the form of the decision function (1.22) is quite similar to our earlier example (1.9). The ways in which the classiﬁers are trained, however, are different. In the earlier example, the normal vector of the hyperplane was trivially computed from the class means as w c c .

12

A Tutorial Introduction

{x | <w, x> + b = −1}

❍

x2❍

yi = −1

w

{x | <w, x> + b = +1}
◆
◆ x1 yi = +1
◆ ,
◆

Note:

<w, <w,

x1> x2>

+ +

b b

= =

+1 −1

=> <w, (x1−x2)> = 2

< > =>

w ||w||

,

(x1−x2)

=

2 ||w||

❍

❍

❍ {x | <w, x> + b = 0}

Figure 1.5 A binary classiﬁcation toy problem: separate balls from diamonds. The optimal

hyperplane (1.23) is shown as a solid line. The problem being separable, there exists a weight

vector w and a threshold b such that yi( w xi b) 0 (i 1 m). Rescaling w and

b such that the point(s) closest to the hyperplane satisfy w xi b 1, we obtain a

canonical form (w b) of the hyperplane, satisfying yi( w xi b) 1. Note that in this

case, the margin (the distance of the closest point to the hyperplane) equals 1 w . This

can be seen by considering two points x1 x2 on opposite sides of the margin, that is,

w x1 b 1 w x2 b w w.

1, and projecting them onto the hyperplane normal vector

In the present case, we need to do some additional work to ﬁnd the normal

vector that leads to the largest margin. To construct the optimal hyperplane, we

have to solve

minimize (w)
wb

1 2

w

2

(1.24)

subject to yi( w xi b) 1 for all i 1 m

(1.25)

Note that the constraints (1.25) ensure that f (xi) will be 1 for yi 1, and 1 for yi 1. Now one might argue that for this to be the case, we don’t actually need the “ 1” on the right hand side of (1.25). However, without it, it would

not be meaningful to minimize the length of w: to see this, imagine we wrote

“ 0” instead of “ 1.” Now assume that the solution is (w b). Let us rescale this

solution by multiplication with some 0

1. Since 0, the constraints are

still satisﬁed. Since 1, however, the length of w has decreased. Hence (w b)

cannot be the minimizer of (w).

The “ 1” on the right hand side of the constraints effectively ﬁxes the scaling

of w. In fact, any other positive number would do.

Let us now try to get an intuition for why we should be minimizing the length

of w, as in (1.24). If w were 1, then the left hand side of (1.25) would equal

the distance from xi to the hyperplane (cf. (1.23)). In general, we have to divide

1.4 Hyperplane Classiﬁers

13

Lagrangian KKT Conditions

yi( w xi b) by w to transform it into this distance. Hence, if we can satisfy (1.25) for all i 1 m with an w of minimal length, then the overall margin will be maximized.
A more detailed explanation of why this leads to the maximum margin hyperplane will be given in Chapter 7. A short summary of the argument is also given in Figure 1.5.
The function in (1.24) is called the objective function, while (1.25) are called inequality constraints. Together, they form a so-called constrained optimization problem. Problems of this kind are dealt with by introducing Lagrange multipliers i 0 and a Lagrangian9

L(w b )

∑ 1
2

w

2

m
i yi( xi w
i1

b) 1

(1.26)

The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables i (in other words, a saddle point has to be found). Note that the constraint has been incorporated into the second term of the Lagrangian; it is not necessary to enforce it explicitly.
Let us try to get some intuition for this way of dealing with constrained optimization problems. If a constraint (1.25) is violated, then yi( w xi b) 1 0, in which case L can be increased by increasing the corresponding i. At the same time, w and b will have to change such that L decreases. To prevent
i yi( w xi b) 1 from becoming an arbitrarily large negative number, the change in w and b will ensure that, provided the problem is separable, the constraint will eventually be satisﬁed. Similarly, one can understand that for all constraints which are not precisely met as equalities (that is, for which yi( w xi b) 1 0), the corresponding i must be 0: this is the value of i that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker (KKT) complementarity conditions of optimization theory (Chapter 6).
The statement that at the saddle point, the derivatives of L with respect to the primal variables must vanish,

b L(w b ) 0 and w L(w b ) 0

leads to

m
∑ iyi 0
i1

and

m

∑ w

i yixi

i1

(1.27) (1.28) (1.29)

9. Henceforth, we use boldface Greek letters as a shorthand for corresponding vectors

(1

m).

14

A Tutorial Introduction

Support Vector
Dual Problem Decision Function
Mechanical Analogy

The solution vector thus has an expansion (1.29) in terms of a subset of the training patterns, namely those patterns with non-zero i, called Support Vectors (SVs) (cf. (1.15) in the initial example). By the KKT conditions,

i yi xi w b 1 0 for all i 1 m

(1.30)

the SVs lie on the margin (cf. Figure 1.5). All remaining training examples (x j y j) are irrelevant: their constraint y j( w xj b) 1 (cf. (1.25)) could just as well be left out, and they do not appear in the expansion (1.29). This nicely captures our intuition of the problem: as the hyperplane (cf. Figure 1.5) is completely determined by the patterns closest to it, the solution should not depend on the other examples.
By substituting (1.28) and (1.29) into the Lagrangian (1.26), one eliminates the primal variables w and b, arriving at the so-called dual optimization problem, which is the problem that one usually solves in practice:

maximize W( ) m

m
∑i
i1

∑ 1 m
2ij 1

i

j yiy j xi xj

subject to i 0 for all i 1

m
m and ∑ i yi 0

i1

(1.31) (1.32)

Using (1.29), the hyperplane decision function (1.22) can thus be written as

m
f (x) sgn ∑ yi i x xi b i1

(1.33)

where b is computed by exploiting (1.30) (for details, cf. Chapter 7). The structure of the optimization problem closely resembles those that typically
arise in Lagrange’s formulation of mechanics (e.g., [206]). In the latter class of problem, it is also often the case that only a subset of constraints become active. For instance, if we keep a ball in a box, then it will typically roll into one of the corners. The constraints corresponding to the walls which are not touched by the ball are irrelevant, and those walls could just as well be removed.
Seen in this light, it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes [87]: If we assume that each SV xi exerts a perpendicular force of size i and direction y j w w on a solid plane sheet lying along the hyperplane, then the solution satisﬁes the requirements for mechanical stability. The constraint (1.28) states that the forces on the sheet sum to zero, and (1.29) implies that the torques also sum to zero, via ∑i xi yi iw w w w w 0.10 This mechanical analogy illustrates the physical meaning of the term Support Vector.

10. Here, the denotes the vector (or cross) product, satisfying v v 0 for all v .

1.5 Support Vector Classiﬁcation

15

input space ❍

◆ ◆

Φ

❍ ❍

feature space ◆
◆
❍ ❍
❍

Figure 1.6 The idea of SVMs: map the training data into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there. This yields a nonlinear decision boundary in input space. By the use of a kernel function (1.2), it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.

1.5 Support Vector Classiﬁcation

Decision Function

We now have all the tools to describe SVMs (Figure 1.6). Everything in the last section was formulated in a dot product space. We think of this space as the feature space of Section 1.1. To express the formulas in terms of the input patterns in , we thus need to employ (1.6), which expresses the dot product of bold face feature vectors x x in terms of the kernel k evaluated on input patterns x x ,

k(x x ) x x

(1.34)

This substitution, which is sometimes referred to as the kernel trick, was used by Boser, Guyon, and Vapnik [62] to extend the Generalized Portrait hyperplane classiﬁer to nonlinear Support Vector Machines. Aizerman, Braverman, and Rozonoe´r [4] called the linearization space, and used it in the context of the potential function classiﬁcation method to express the dot product between elements of in terms of elements of the input space.
The kernel trick can be applied since all feature vectors only occurred in dot products (see (1.31) and (1.33)). The weight vector (cf. (1.29)) then becomes an expansion in feature space, and therefore will typically no longer correspond to the Φ-image of a single input space vector (cf. Chapter 18). We obtain decision functions of the form (cf. (1.33))

m
f (x) sgn ∑ yi i Φ(x) Φ(xi) b i1

m
sgn ∑ yi ik(x xi) b i1

(1.35)

and the following quadratic program (cf. (1.31)):

maximize W( ) m

m
∑i
i1

∑ 1 m
2ij 1

i

j yi y jk(xi x j)

subject to i 0 for all i 1

m
m and ∑ i yi 0

i1

(1.36) (1.37)

16

A Tutorial Introduction

Soft Margin Hyperplane

Figure 1.7 Example of an SV classiﬁer found using a radial basis function kernel k(x x ) exp( x x 2) (here, the input space is [ 1 1]2). Circles and disks are two classes of
training examples; the middle line is the decision surface; the outer lines precisely meet the
constraint (1.25). Note that the SVs found by the algorithm (marked by extra circles) are not
centers of clusters, but examples which are critical for the given classiﬁcation task. Gray values code ∑mi 1 yi ik(x xi) b , the modulus of the argument of the decision function (1.35). The top and the bottom lines indicate places where it takes the value 1 (from [471]).

Figure 1.7 shows an example of this approach, using a Gaussian radial basis function kernel. We will later study the different possibilities for the kernel function in detail (Chapters 2 and 13).
In practice, a separating hyperplane may not exist, e.g., if a high noise level causes a large overlap of the classes. To allow for the possibility of examples violating (1.25), one introduces slack variables [111, 561, 481]

i 0 for all i 1 m in order to relax the constraints (1.25) to

(1.38)

yi( w xi b) 1 i for all i 1 m

(1.39)

A classiﬁer that generalizes well is then found by controlling both the classiﬁer capacity (via w ) and the sum of the slacks ∑i i. The latter can be shown to provide an upper bound on the number of training errors.
One possible realization of such a soft margin classiﬁer is obtained by minimizing the objective function

(w )

1 w2 2

m
C∑ i i1

(1.40)

1.6 Support Vector Regression

17

subject to the constraints (1.38) and (1.39), where the constant C 0 determines the trade-off between margin maximization and training error minimization.11
Incorporating a kernel, and rewriting it in terms of Lagrange multipliers, this again
leads to the problem of maximizing (1.36), subject to the constraints

0 i C for all i 1

m
m and ∑ i yi 0 i1

(1.41)

The only difference from the separable case is the upper bound C on the Lagrange multipliers i. This way, the inﬂuence of the individual patterns (which could be outliers) gets limited. As above, the solution takes the form (1.35). The threshold b can be computed by exploiting the fact that for all SVs xi with i C, the slack variable i is zero (this again follows from the KKT conditions), and hence

m
∑ j yjk(xi xj) b yi
j1

(1.42)

Geometrically speaking, choosing b amounts to shifting the hyperplane, and (1.42)

states that we have to shift the hyperplane such that the SVs with zero slack

variables lie on the 1 lines of Figure 1.5.

Another possible realization of a soft margin variant of the optimal hyperplane

uses the more natural -parametrization. In it, the parameter C is replaced by a

parameter (0 1] which can be shown to provide lower and upper bounds

for the fraction of examples that will be SVs and those that will have non-zero

slack variables, respectively. It uses a primal objective function with the error term

1 m

∑i

i

instead of C ∑i i (cf. (1.40)), and separation constraints that involve

a margin parameter ,

yi( w xi b)

i for all i 1 m

(1.43)

which itself is a variable of the optimization problem. The dual can be shown
to consist in maximizing the quadratic part of (1.36), subject to 0 i 1 ( m), ∑i i yi 0 and the additional constraint ∑i i 1. We shall return to these methods in more detail in Section 7.5.

1.6 Support Vector Regression

Let us turn to a problem slightly more general than pattern recognition. Rather

than dealing with outputs y

1 , regression estimation is concerned with esti-

mating real-valued functions.

To generalize the SV algorithm to the regression case, an analog of the soft

margin is constructed in the space of the target values y (note that we now have

11. It is sometimes convenient to scale the sum in (1.40) by C m rather than C, as done in Chapter 7 below.

18

A Tutorial Introduction

-Insensitive Loss

y

ξ

x x

+ε

0

x

x

x x

x −ε

x

x

x

xx

x

x

x

loss
ξ
x
−ε +ε y f (x)

Figure 1.8 In SV regression, a tube with radius is ﬁtted to the data. The trade-off between model complexity and points lying outside of the tube (with positive slack variables ) is determined by minimizing (1.47).

y ) by using Vapnik’s -insensitive loss function [561] (Figure 1.8, see Chapters 3 and 9) . This quantiﬁes the loss incurred by predicting f (x) instead of y as

c(x y f (x)) : y f (x) : max 0 y f (x)

(1.44)

To estimate a linear regression

f (x) w x b

(1.45)

one minimizes

1 w2 2

m
C ∑ yi i1

f (xi)

(1.46)

Note that the term w 2 is the same as in pattern recognition (cf. (1.40)); for further details, cf. Chapter 9.
We can transform this into a constrained optimization problem by introducing slack variables, akin to the soft margin case. In the present case, we need two types of slack variable for the two cases f (xi) yi and yi f (xi) . We denote them by and , respectively, and collectively refer to them as ( ).
The optimization problem is given by

minimize

w

() mb

(w ( ))

1 w2 2

m
C ∑( i i1

i)

subject to f (xi) yi

i

yi f (xi)

i

ii 0

for all i 1 m

(1.47)
(1.48) (1.49) (1.50)

Note that according to (1.48) and (1.49), any error smaller than does not require
a nonzero i or i and hence does not enter the objective function (1.47). Generalization to kernel-based regression estimation is carried out in an analo-

1.7 Kernel Principal Component Analysis

19

Regression Function
-SV Regression

gous manner to the case of pattern recognition. Introducing Lagrange multipliers, one arrives at the following optimization problem (for C 0 chosen a priori):

maximize W( ) m

m

∑( i

i1

∑ 1

m
(

2ij 1

i

m

i) ∑( i

i)yi

i1

i)( j

j)k(xi x j)

subject to 0 i i C for all i 1

m
m and ∑( i i1

i) 0

(1.51) (1.52)

The regression estimate takes the form

m
f (x) ∑( i i1

i)k(xi x) b

(1.53)

where b is computed using the fact that (1.48) becomes an equality with i 0 if

0 i C, and (1.49) becomes an equality with i 0 if 0 i C (for details, see Chapter 9). The solution thus looks quite similar to the pattern recognition case

(cf. (1.35) and Figure 1.9).

A number of extensions of this algorithm are possible. From an abstract point of

view, we just need some target function which depends on (w ) (cf. (1.47)). There

are multiple degrees of freedom for constructing it, including some freedom how

to penalize, or regularize. For instance, more general loss functions can be used for

, leading to problems that can still be solved efﬁciently ([512, 515], cf. Chapter 9).

Moreover, norms other than the 2-norm can be used to regularize the solution

(see Sections 4.9 and 9.4).

Finally, the algorithm can be modiﬁed such that need not be speciﬁed a priori.

Instead, one speciﬁes an upper bound 0

1 on the fraction of points allowed

to lie outside the tube (asymptotically, the number of SVs) and the corresponding

is computed automatically. This is achieved by using as primal objective function

1 2

w

2

C

m

m
∑ yi
i1

f (xi)

(1.54)

instead of (1.46), and treating more detail, cf. Section 9.3.

0 as a parameter over which we minimize. For

1.7 Kernel Principal Component Analysis
The kernel method for computing dot products in feature spaces is not restricted to SVMs. Indeed, it has been pointed out that it can be used to develop nonlinear generalizations of any algorithm that can be cast in terms of dot products, such as principal component analysis (PCA) [480].
Principal component analysis is perhaps the most common feature extraction algorithm; for details, see Chapter 14. The term feature extraction commonly refers

20
Kernel PCA Eigenvalue Problem Feature Extraction

A Tutorial Introduction

to procedures for extracting (real) numbers from patterns which in some sense

represent the crucial information contained in these patterns.

PCA in feature space leads to an algorithm called kernel PCA. By solving an

eigenvalue problem, the algorithm computes nonlinear feature extraction func-

tions

m

∑ fn(x)

n i

k(xi

x)

i1

(1.55)

where, up to a normalizing constant, the

n i

are

the

components

of

the

nth

eigen-

vector of the kernel matrix Kij : (k(xi x j)).

In a nutshell, this can be understood as follows. To do PCA in , we wish to

ﬁnd eigenvectors v and eigenvalues of the so-called covariance matrix C in the

feature space, where

∑ C :

1 m

i

m 1

Φ(xi)Φ(xi)

(1.56)

Here, Φ(xi) denotes the transpose of Φ(xi) (see Section B.2.1). In the case when is very high dimensional, the computational costs of doing this directly are
prohibitive. Fortunately, one can show that all solutions to

Cv v

(1.57)

with 0 must lie in the span of Φ-images of the training data. Thus, we may expand the solution v as

m
v ∑ iΦ(xi) i1

(1.58)

thereby reducing the problem to that of ﬁnding the i. It turns out that this leads to a dual eigenvalue problem for the expansion coefﬁcients,

m

K

(1.59)

where ( 1

m) .

To extract nonlinear features from a test point x, we compute the dot product

between Φ(x) and the nth normalized eigenvector in feature space,

vn Φ(x)

m

∑n i

k(xi

x)

i1

(1.60)

Usually, this will be computationally far less expensive than taking the dot product in the feature space explicitly.
A toy example is given in Chapter 14 (Figure 14.4). As in the case of SVMs, the architecture can be visualized by Figure 1.9.

1.8 Empirical Results and Implementations

21

σ( Σ )

output σ (Σ υi k (x,xi))

υ1

υ2

. . .

υ m

weights

<, > <, >

. . .

<, > dot product <Φ(x),Φ(xi)>= k(x,xi)

Φ(x1)

Φ(x2) Φ(x)

Φ(xn) mapped vectors Φ(xi), Φ(x)

. . .

support vectors x1 ... xn

test vector x

Figure 1.9 Architecture of SVMs and related kernel methods. The input x and the expansion patterns (SVs) xi (we assume that we are dealing with handwritten digits) are nonlinearly mapped (by Φ) into a feature space where dot products are computed. Through the use of the kernel k, these two layers are in practice computed in one step. The results are linearly combined using weights i, found by solving a quadratic program (in pattern recognition, i yi i; in regression estimation, i i i) or an eigenvalue problem (Kernel PCA). The linear combination is fed into the function (in pattern recognition,
(x) sgn (x b); in regression estimation, (x) x b; in Kernel PCA, (x) x).

1.8 Empirical Results and Implementations

Examples of Kernels

Having described the basics of SVMs, we now summarize some empirical ﬁndings. By the use of kernels, the optimal margin classiﬁer was turned into a highperformance classiﬁer. Surprisingly, it was observed that the polynomial kernel

k(x x ) x x d

(1.61)

the Gaussian k(x x ) exp

x x2 22

and the sigmoid

(1.62)

k(x x ) tanh x x Θ

(1.63)

with suitable choices of d and Θ (here,

N ), empirically led to

SV classiﬁers with very similar accuracies and SV sets (Section 7.8.2). In this sense,

the SV set seems to characterize (or compress) the given task in a manner which

22

A Tutorial Introduction

Applications Implementation

to some extent is independent of the type of kernel (that is, the type of classiﬁer) used, provided the kernel parameters are well adjusted.
Initial work at AT&T Bell Labs focused on OCR (optical character recognition), a problem where the two main issues are classiﬁcation accuracy and classiﬁcation speed. Consequently, some effort went into the improvement of SVMs on these issues, leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs (Chapter 7), and the Reduced Set method (Chapter 18) for speeding up classiﬁcation. Using these procedures, SVMs soon became competitive with the best available classiﬁers on OCR and other object recognition tasks [87, 57, 419, 438, 134], and later even achieved the world record on the main handwritten digit benchmark dataset [134].
An initial weakness of SVMs, less apparent in OCR applications which are characterized by low noise levels, was that the size of the quadratic programming problem (Chapter 10) scaled with the number of support vectors. This was due to the fact that in (1.36), the quadratic part contained at least all SVs — the common practice was to extract the SVs by going through the training data in chunks while regularly testing for the possibility that patterns initially not identiﬁed as SVs become SVs at a later stage. This procedure is referred to as chunking; note that without chunking, the size of the matrix in the quadratic part of the objective function would be m m, where m is the number of all training examples.
What happens if we have a high-noise problem? In this case, many of the slack variables i become nonzero, and all the corresponding examples become SVs. For this case, decomposition algorithms were proposed [398, 409], based on the observation that not only can we leave out the non-SV examples (the xi with i 0) from the current chunk, but also some of the SVs, especially those that hit the upper boundary ( i C). The chunks are usually dealt with using quadratic optimizers. Among the optimizers used for SVMs are LOQO [555], MINOS [380], and variants of conjugate gradient descent, such as the optimizers of Bottou [459] and Burges [85]. Several public domain SV packages and optimizers are listed on the web page http://www.kernel-machines.org. For more details on implementations, see Chapter 10.
Once the SV algorithm had been generalized to regression, researchers started applying it to various problems of estimating real-valued functions. Very good results were obtained on the Boston housing benchmark [529], and on problems of times series prediction (see [376, 371, 351]). Moreover, the SV method was applied to the solution of inverse function estimation problems ([572]; cf. [563, 589]). For overviews, the interested reader is referred to [85, 472, 504, 125].

I CONCEPTS AND TOOLS

The generic can be more intense than the concrete.

J. L. Borges1

We now embark on a more systematic presentation of the concepts and tools

underlying Support Vector Machines and other kernel methods.

In machine learning problems, we try to discover structure in data. For in-

stance, in pattern recognition and regression estimation, we are given a training

set (x1 y1) (xm ym)

, and attempt to predict the outputs y for previ-

ously unseen inputs x. This is only possible if we have some measure that tells us

how (x y) is related to the training set. Informally, we want similar inputs to lead to similar outputs.2 To formalize this, we have to state what we mean by similar.

A particularly simple yet surprisingly useful notion of similarity of inputs — the

one we will use throughout this book — derives from embedding the data into

a Euclidean feature space and utilizing geometrical concepts. Chapter 2 describes

how certain classes of kernels induce feature spaces, and how one can compute

dot products, and thus angles and distances, without having to explicitly work in

these potentially inﬁnite-dimensional spaces. This leads to a rather general class

of similarity measure to be used on the inputs.

1. From A History of Eternity, in The Total Library, Penguin, London, 2001. 2. This procedure can be traced back to an old maxim of law: de similibus ad similia eadem ratione procedendum est — from things similar to things similar we are to proceed by the same rule.

24

CONCEPTS AND TOOLS

On the outputs, similarity is usually measured in terms of a loss function stating how “bad” it is if the predicted y does not match the true one. The training of a learning machine commonly involves a risk functional that contains a term measuring the loss incurred for the training patterns. The concepts of loss and risk are introduced in depth in Chapter 3.
This is not the full story, however. In order to generalize well to the test data, it is not sufﬁcient to “explain” the training data. It is also necessary to control the complexity of the model used for explaining the training data, a task that is often accomplished with the help of regularization terms, as explained in Chapter 4. Speciﬁcally, one utilizes objective functions that involve both the empirical loss term and a regularization term. From a statistical point of view, we can expect the function minimizing a properly chosen objective function to work well on test data, as explained by statistical learning theory (Chapter 5). From a practical point of view, however, it is not at all straightforward to ﬁnd this minimizer. Indeed, the quality of a loss function or a regularizer should be assessed not only on a statistical basis but also in terms of the feasibility of the objective function minimization problem. In order to be able to assess this, and in order to obtain a thorough understanding of practical algorithms for this task, we conclude this part of the book with an in-depth review of optimization theory (Chapter 6).
The chapters in this part of the book assume familiarity with basic concepts of linear algebra and probability theory. Readers who would like to refresh their knowledge of these topics may want to consult Appendix B beforehand.

2

Kernels

Overview Prerequisites

In Chapter 1, we described how a kernel arises as a similarity measure that can be thought of as a dot product in a so-called feature space. We tried to provide an intuitive understanding of kernels by introducing them as similarity measures, rather than immediately delving into the functional analytic theory of the classes of kernels that actually admit a dot product representation in a feature space.
In the present chapter, we will be both more formal and more precise. We will study the class of kernels k that correspond to dot products in feature spaces via a map Φ,

Φ:

x x : Φ(x)

(2.1)

that is,

k(x x ) Φ(x) Φ(x )

(2.2)

Regarding the input domain , we need not make assumptions other than it being a set. For instance, we could consider a set of discrete objects, such as strings.
A natural question to ask at this point is what kind of functions k(x x ) admit a representation of the form (2.2); that is, whether we can always construct a dot product space and a map Φ mapping into it such that (2.2) holds true. We shall begin, however, by trying to give some motivation as to why kernels are at all useful, considering kernels that compute dot products in spaces of monomial features (Section 2.1). Following this, we move on to the questions of how, given a kernel, an associated feature space can be constructed (Section 2.2). This leads to the notion of a Reproducing Kernel Hilbert Space, crucial for the theory of kernel machines. In Section 2.3, we give some examples and properties of kernels, and in Section 2.4, we discuss a class of kernels that can be used as dissimilarity measures rather than as similarity measures.
The chapter builds on knowledge of linear algebra, as brieﬂy summarized in Appendix B. Apart from that, it can be read on its own; however, readers new to the ﬁeld will proﬁt from ﬁrst reading Sections 1.1 and 1.2.

26

Kernels

2.1 Polynomial Kernels

2.2.1 Positive Definite Kernels

2.2.2, 2.2.3 RKHS Representation
2.2.4, 2.2.5 Mercer Representation

2.3 Examples and Properties of Kernels

2.4 Conditionally Positive Definite Kernels

2.2.6, 2.2.7 Data Dependent Representation

2.1 Product Features

Monomial Features

In this section, we think of as a subset of the vector space N , (N ), endowed with the canonical dot product (1.3).
Suppose we are given patterns x where most information is contained in the dth order products (so-called monomials) of entries [x] j of x,

[x] j1 [x] j2 [x] jd

(2.3)

where j1 jd 1 N . Often, these monomials are referred to as product features. These features form the basis of many practical algorithms; indeed, there

is a whole ﬁeld of pattern recognition research studying polynomial classiﬁers [484],

which is based on ﬁrst extracting product features and then applying learning

algorithms to these features. In other words, the patterns are preprocessed by

mapping into the feature space of all products of d entries. This has proven

quite effective in visual pattern recognition tasks, for instance. To understand the

rationale for doing this, note that visual patterns are usually represented as vectors

whose entries are the pixel intensities. Taking products of entries of these vectors

then corresponds to taking products of pixel intensities, and is thus akin to taking

logical “and” operations on the pixels. Roughly speaking, this corresponds to the

intuition that, for instance, a handwritten “8” constitutes an eight if there is a top

circle and a bottom circle. With just one of the two circles, it is not half an “8,” but

rather a “0.” Nonlinearities of this type are crucial for achieving high accuracies in

pattern recognition tasks.

Let us take a look at this feature map in the simple example of two-dimensional

patterns, for which

2 . In this case, we can collect all monomial feature

extractors of degree 2 in the nonlinear map

Φ: 2 ([x]1 [x]2)

3
([x]21 [x]22 [x]1[x]2)

(2.4) (2.5)

This approach works ﬁne for small toy examples, but it fails for realistically sized

2.1 Product Features

27

Kernel

problems: for N-dimensional input patterns, there exist

N

dN1 d

(d N 1)! d!(N 1)!

(2.6)

different monomials (2.3) of degree d, comprising a feature space of dimension N . For instance, 16 16 pixel input images and a monomial degree d 5 thus yield a dimension of almost 1010.
In certain cases described below, however, there exists a way of computing dot products in these high-dimensional feature spaces without explicitly mapping into the spaces, by means of kernels nonlinear in the input space N . Thus, if the subsequent processing can be carried out using dot products exclusively, we are able to deal with the high dimension.
We now describe how dot products in polynomial feature spaces can be computed efﬁciently, followed by a section in which we discuss more general feature spaces. In order to compute dot products of the form Φ(x) Φ(x ) , we employ kernel representations of the form

k(x x ) Φ(x) Φ(x )

(2.7)

which allow us to compute the value of the dot product in without having to explicitly compute the map Φ.
What does k look like in the case of polynomial features? We start by giving an example for N d 2, as considered above [561]. For the map

Φ : ([x]1 [x]2) ([x]21 [x]22 [x]1[x]2 [x]2[x]1)

(2.8)

(note that for now, we have considered [x]1[x]2 and [x]2[x]1 as separate features; thus we are looking at ordered monomials) dot products in take the form

Φ(x) Φ(x ) [x]21[x ]21 [x]22[x ]22 2[x]1[x]2[x ]1[x ]2 x x 2

(2.9)

In other words, the desired kernel k is simply the square of the dot product in input space. The same works for arbitrary N d [62]: as a straightforward generalization of a result proved in the context of polynomial approximation [412, Lemma 2.1], we have:

Polynomial Kernel

Proposition 2.1 Deﬁne Cd to map x N to the vector Cd(x) whose entries are all possible dth degree ordered products of the entries of x. Then the corresponding kernel computing the dot product of vectors mapped by Cd is

k(x x ) Cd(x) Cd(x ) x x d

(2.10)

Proof We directly compute

Cd(x) Cd(x )

N

N

∑ ∑ [x]j1

j1 1

jd 1

[x] jd [x ]j1

[x ]jd

(2.11)

28

Kernels

N
∑ [x]j1 [x ]j1
j1 1

N
∑ [x]jd [x ]jd
jd 1

N

d

∑[x]j [x ]j

j1

xx d

Note that we used the symbol Cd for the feature map. The reason for this is that we would like to reserve Φd for the corresponding map computing unordered product features. Let us construct such a map Φd, yielding the same value of the dot product. To this end, we have to compensate for the multiple occurrence of certain
monomials in Cd by scaling the respective entries of Φd with the square roots of their numbers of occurrence. Then, by this construction of Φd, and (2.10),

Φd(x) Φd(x )

Cd(x) Cd(x )

xx d

(2.12)

For instance, if n of the ji in (2.3) are equal, and the remaining ones are different, then the coefﬁcient in the corresponding component of Φd is (d n 1)!. For the general case, see Problem 2.2. For Φ2, this simply means that [561]

Φ2(x) ([x]21 [x]22 2 [x]1[x]2)

(2.13)

The above reasoning illustrates an important point pertaining to the construction of feature spaces associated with kernel functions. Although they map into different feature spaces, Φd and Cd are both valid instantiations of feature maps for k(x x ) x x d.
To illustrate how monomial feature kernels can signiﬁcantly simplify pattern recognition tasks, let us consider a simple toy example.

Toy Example

Example 2.2 (Monomial Features in 2-D Pattern Recognition) In the example of Figure 2.1, a non-separable problem is reduced to the construction of a separating hyperplane by preprocessing the input data with Φ2. As we shall see in later chapters, this has advantages both from the computational point of view (there exist efﬁcient algorithms for computing the hyperplane) and from the statistical point of view (there exist guarantees for how well the hyperplane will generalize to unseen test points).
In more realistic cases, e.g., if x represents an image with the entries being pixel values, polynomial kernels x x d enable us to work in the space spanned by products of any d pixel values — provided that we are able to do our work solely in terms of dot products, without any explicit usage of a mapped pattern Φd(x). Using kernels of the form (2.10), we can take higher-order statistics into account, without the combinatorial explosion (2.6) of time and memory complexity which accompanies even moderately high N and d.
To conclude this section, note that it is possible to modify (2.10) such that it maps into the space of all monomials up to degree d, by deﬁning k(x x ) ( x x 1)d (Problem 2.17). Moreover, in practice, it is often useful to multiply the kernel by a scaling factor c to ensure that its numeric range is within some bounded interval, say [ 1 1]. The value of c will depend on the dimension and range of the data.

2.2 The Representation of Similarities in Linear Spaces

29

✕ ✕
✕

x2

✕

✕

✕

✕

✕✕ ❍

❍

❍

❍

❍

✕ x1

❍

✕

❍

✕

❍

✕

✕

✕

✕

✕

✕

✕

✕

z2

✕

✕

✕

✕ ✕

✕✕

❍ ✕✕

❍

✕

❍✕

❍❍ ✕

❍ ❍❍

✕

z1

✕

z3

Figure 2.1 Toy example of a binary classiﬁcation problem mapped into feature space. We
assume that the true decision boundary is an ellipse in input space (left panel). The task of the learning process is to estimate this boundary based on empirical data consisting of training points in both classes (crosses and circles, respectively). When mapped into feature space via the nonlinear map Φ2(x) (z1 z2 z3) ([x]21 [x]22 2 [x]1[x]2) (right panel), the ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence all points are plotted in the (z1 z2) plane). This is due to the fact that ellipses can be written as linear equations in the entries of (z1 z2 z3). Therefore, in feature space, the problem reduces to that of estimating a hyperplane from the mapped data points. Note that via the
polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space can be computed without computing Φ2. Later in the book, we shall describe algorithms for constructing hyperplanes which are based on dot products (Chapter 7).

2.2 The Representation of Similarities in Linear Spaces
In what follows, we will look at things the other way round, and start with the kernel rather than with the feature map. Given some kernel, can we construct a feature space such that the kernel computes the dot product in that feature space; that is, such that (2.2) holds? This question has been brought to the attention of the machine learning community in a variety of contexts, especially during recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the theory of kernels is the book of Berg, Christensen, and Ressel [42]; indeed, a large part of the material in the present chapter is based on this work. We do not aim to be fully rigorous; instead, we try to provide insight into the basic ideas. As a rule, all the results that we state without proof can be found in [42]. Other standard references include [16, 455].
There is one more aspect in which this section differs from the previous one: the latter dealt with vectorial data, and the domain was assumed to be a subset of N . By contrast, the results in the current section hold for data drawn from domains which need no structure, other than their being nonempty sets. This generalizes kernel learning algorithms to a large number of situations where a vectorial representation is not readily available, and where one directly works

30
Gram Matrix PD Matrix
PD Kernel

Kernels

with pairwise distances or similarities between non-vectorial objects [246, 467, 154, 210, 234, 585]. This theme will recur in several places throughout the book, for instance in Chapter 13.

2.2.1 Positive Deﬁnite Kernels

We start with some basic deﬁnitions and results. As in the previous chapter, indices i and j are understood to run over 1 m.

Deﬁnition 2.3 (Gram Matrix) Given a function k : 2

(where

and patterns x1 xm , the m m matrix K with elements

Ki j : k(xi x j)

is called the Gram matrix (or kernel matrix) of k with respect to x1

or

)

(2.14) xm.

Deﬁnition 2.4 (Positive Deﬁnite Matrix) A complex m m matrix K satisfying

∑ cic¯j Ki j 0
ij
for all ci is called positive deﬁnite.1 Similarly, a real symmetric m satisfying (2.15) for all ci is called positive deﬁnite.

(2.15) m matrix K

Note that a symmetric matrix is positive deﬁnite if and only if all its eigenvalues are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as the quadratic form induced by K.

Deﬁnition 2.5 ((Positive Deﬁnite) Kernel) Let be a nonempty set. A function k on which for all m and all x1 xm gives rise to a positive deﬁnite Gram
matrix is called a positive deﬁnite (pd) kernel. Often, we shall refer to it simply as a kernel.

Remark 2.6 (Terminology) The term kernel stems from the ﬁrst use of this type of function in the ﬁeld of integral operators as studied by Hilbert and others [243, 359, 112]. A function k which gives rise to an operator Tk via

(Tk f )(x)

k(x x ) f (x ) dx

(2.16)

is called the kernel of Tk.

In the literature, a number of different terms are used for positive deﬁnite kernels, such

as reproducing kernel, Mercer kernel, admissible kernel, Support Vector kernel,

nonnegative deﬁnite kernel, and covariance function. One might argue that the term

positive deﬁnite kernel is slightly misleading. In matrix theory, the term deﬁnite is

sometimes reserved for the case where equality in (2.15) only occurs if c1

cm 0.

1. The bar in c¯j denotes complex conjugation; for real numbers, it has no effect.

2.2 The Representation of Similarities in Linear Spaces

31

Real-Valued Kernels

Simply using the term positive kernel, on the other hand, could be mistaken as referring to a kernel whose values are positive. Finally, the term positive semideﬁnite kernel becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow the convention used for instance in [42], and employ the term positive deﬁnite both for kernels and matrices in the way introduced above. The case where the value 0 is only attained if all coefﬁcients are 0 will be referred to as strictly positive deﬁnite.
We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x x ) which is not positive deﬁnite in the sense stated above, it will be clear from the context.

The deﬁnitions for positive deﬁnite kernels and positive deﬁnite matrices differ in the fact that in the former case, we are free to choose the points on which the kernel is evaluated — for every choice, the kernel induces a positive deﬁnite matrix.
Positive deﬁniteness implies positivity on the diagonal (Problem 2.12),

k(x x) 0 for all x

(2.17)

and symmetry (Problem 2.13),

k(xi xj) k(x j xi)

(2.18)

To also cover the complex-valued case, our deﬁnition of symmetry includes complex conjugation. The deﬁnition of symmetry of matrices is analogous; that is, Ki j K ji.
For real-valued kernels it is not sufﬁcient to stipulate that (2.15) hold for real coefﬁcients ci. To get away with real coefﬁcients only, we must additionally require that the kernel be symmetric (Problem 2.14); k(xi x j) k(x j xi) (cf. Problem 2.13).
It can be shown that whenever k is a (complex-valued) positive deﬁnite kernel, its real part is a (real-valued) positive deﬁnite kernel. Below, we shall largely be dealing with real-valued kernels. Most of the results, however, also apply for complex-valued kernels.
Kernels can be regarded as generalized dot products. Indeed, any dot product is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard property of dot products, does not carry over to general kernels. However, another property of dot products, the Cauchy-Schwarz inequality, does have a natural generalization to kernels:

Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive deﬁnite kernel, and x1 x2 , then

k(x1 x2) 2 k(x1 x1) k(x2 x2)

(2.19)

Proof For sake of brevity, we give a non-elementary proof using some basic facts of linear algebra. The 2 2 Gram matrix with entries Kij k(xi x j) (i j 1 2 ) is positive deﬁnite. Hence both its eigenvalues are nonnegative, and so is their product, the determinant of K. Therefore

0 K11 K22 K12 K21 K11 K22 K12 K12 K11 K22 K12 2

(2.20)

32 Feature Map Vector Space

Kernels

..

x

x'

Φ Φ(x) Φ(x')

Figure 2.2 One instantiation of the fea-

ture map associated with a kernel is the

map (2.21), which represents each pattern

(in the picture, x or x ) by a kernel-shaped

function sitting on the pattern. In this sense,

each pattern is represented by its similar-

ity to all other patterns. In the picture, the

kernel is assumed to be bell-shaped, e.g., a

Gaussian k(x x ) exp( x x 2 (2 2)).

In the text, we describe the construction of

a dot product

on the function space

such that k(x x ) Φ(x) Φ(x ) .

Substituting k(xi x j) for Kij, we get the desired inequality.
We now show how the feature spaces in question are deﬁned by the choice of kernel function.

2.2.2 The Reproducing Kernel Map

Assume that k is a real-valued positive deﬁnite kernel, and a nonempty set. We

deﬁne a map from into the space of functions mapping into , denoted as

: f:

, via

Φ: x k( x)

(2.21)

Here, Φ(x) denotes the function that assigns the value k(x x) to x , i.e., Φ(x)( ) k( x) (as shown in Figure 2.2).
We have thus turned each pattern into a function on the domain . In this sense, a pattern is now represented by its similarity to all other points in the input domain
. This seems a very rich representation; nevertheless, it will turn out that the kernel allows the computation of the dot product in this representation. Below, we show how to construct a feature space associated with Φ, proceeding in the following steps:

1. Turn the image of Φ into a vector space, 2. deﬁne a dot product; that is, a strictly positive deﬁnite bilinear form, and 3. show that the dot product satisﬁes k(x x ) Φ(x) Φ(x ) .

We begin by constructing a dot product space containing the images of the input patterns under Φ. To this end, we ﬁrst need to deﬁne a vector space. This is done by taking linear combinations of the form

m
f ( ) ∑ ik( xi) i1

(2.22)

Here, m , i and x1 xm are arbitrary. Next, we deﬁne a dot product

2.2 The Representation of Similarities in Linear Spaces

33

Dot Product
Reproducing Kernel

between f and another function

m
g( ) ∑ jk( xj) j1

(2.23)

where m

,j

and x1

xm

, as

mm
f g : ∑ ∑ i jk(xi xj) i 1j 1

(2.24)

This expression explicitly contains the expansion coefﬁcients, which need not be unique. To see that it is nevertheless well-deﬁned, note that

m
f g ∑ j f (xj) j1

(2.25)

using k(xj xi) k(xi x j). The sum in (2.25), however, does not depend on the particular expansion of f . Similarly, for g, note that

m
f g ∑ ig(xi) i1

(2.26)

The last two equations also show that is bilinear. It is symmetric, as f g g f . Moreover, it is positive deﬁnite, since positive deﬁniteness of k implies that for any function f , written as (2.22), we have

m

∑ f f

i jk(xi x j) 0

ij 1

(2.27)

The latter implies that

is actually itself a positive deﬁnite kernel, deﬁned

on our space of functions. To see this, note that given functions f1 fn, and

coefﬁcients 1

n , we have

n
∑ i j fi fj
ij 1

n

n

∑ ∑ i fi

j fj

0

i1

j1

(2.28)

Here, the left hand equality follows from the bilinearity of , and the right hand inequality from (2.27). For the last step in proving that it qualiﬁes as a dot product, we will use the following interesting property of Φ, which follows directly from the deﬁnition: for all functions (2.22), we have

k( x) f f (x)

(2.29)

— k is the representer of evaluation. In particular,

k( x) k( x ) k(x x )

(2.30)

By virtue of these properties, positive deﬁnite kernels k are also called reproducing kernels [16, 42, 455, 578, 467, 202]. By (2.29) and Proposition 2.7, we have

f (x) 2 k( x) f 2 k(x x) f f

(2.31)

34
Kernels from Feature Maps
Equivalent Deﬁnition of PD Kernels
Kernel Trick

Kernels

Therefore, f f 0 directly implies f 0, which is the last property that required proof in order to establish that is a dot product (cf. Section B.2).
The case of complex-valued kernels can be dealt with using the same construction; in that case, we will end up with a complex dot product space [42].
The above reasoning has shown that any positive deﬁnite kernel can be thought of as a dot product in another space: in view of (2.21), the reproducing kernel property (2.30) amounts to

Φ(x) Φ(x ) k(x x )

(2.32)

Therefore, the dot product space constructed in this way is one possible instan-

tiation of the feature space associated with a kernel.

Above, we have started with the kernel, and constructed a feature map. Let us now consider the opposite direction. Whenever we have a mapping Φ from into a dot product space, we obtain a positive deﬁnite kernel via k(x x ) : Φ(x) Φ(x ) .

This can be seen by noting that for all ci

xi

i 1 m, we have

2

∑ cicjk(xi xj)
ij

∑ ciΦ(xi) ∑ c jΦ(x j)

i

j

∑ ciΦ(xi) 0 i

(2.33)

due to the nonnegativity of the norm. This has two consequences. First, it allows us to give an equivalent deﬁnition of
positive deﬁnite kernels as functions with the property that there exists a map Φ into a dot product space such that (2.32) holds true. Second, it allows us to construct kernels from feature maps. For instance, it is in this way that powerful linear representations of 3D heads proposed in computer graphics [575, 59] give rise to kernels. The identity (2.32) forms the basis for the kernel trick:

Remark 2.8 (“Kernel Trick”) Given an algorithm which is formulated in terms of a positive deﬁnite kernel k, one can construct an alternative algorithm by replacing k by another positive deﬁnite kernel k˜.
In view of the material in the present section, the justiﬁcation for this procedure is the following: effectively, the original algorithm can be thought of as a dot product based algorithm operating on vectorial data Φ(x1) Φ(xm). The algorithm obtained by replacing k by k˜ then is exactly the same dot product based algorithm, only that it operates on Φ˜ (x1) Φ˜ (xm).
The best known application of the kernel trick is in the case where k is the dot product in the input domain (cf. Problem 2.5). The trick is not limited to that case, however: k and k˜ can both be nonlinear kernels. In general, care must be exercised in determining whether the resulting algorithm will be useful: sometimes, an algorithm will only work subject to additional conditions on the input data, e.g., the data set might have to lie in the positive orthant. We shall later see that certain kernels induce feature maps which enforce such properties for the mapped data (cf. (2.73)), and that there are algorithms which take advantage of these aspects (e.g., in Chapter 8). In such cases, not every conceivable positive deﬁnite kernel

2.2 The Representation of Similarities in Linear Spaces

35

Historical Remarks

will make sense. Even though the kernel trick had been used in the literature for a fair amount of
time [4, 62], it took until the mid 1990s before it was explicitly stated that any algorithm that only depends on dot products, i.e., any algorithm that is rotationally invariant, can be kernelized [479, 480]. Since then, a number of algorithms have beneﬁtted from the kernel trick, such as the ones described in the present book, as well as methods for clustering in feature spaces [479, 215, 199].
Moreover, the machine learning community took time to comprehend that the deﬁnition of kernels on general sets (rather than dot product spaces) greatly extends the applicability of kernel methods [467], to data types such as texts and other sequences [234, 585, 23]. Indeed, this is now recognized as a crucial feature of kernels: they lead to an embedding of general data types in linear spaces.
Not surprisingly, the history of methods for representing kernels in linear spaces (in other words, the mathematical counterpart of the kernel trick) dates back signiﬁcantly further than their use in machine learning. The methods appear to have ﬁrst been studied in the 1940s by Kolmogorov [304] for countable and Aronszajn [16] in the general case. Pioneering work on linear representations of a related class of kernels, to be described in Section 2.4, was done by Schoenberg [465]. Further bibliographical comments can be found in [42].
We thus see that the mathematical basis for kernel algorithms has been around for a long time. As is often the case, however, the practical importance of mathematical results was initially underestimated.2
2.2.3 Reproducing Kernel Hilbert Spaces
In the last section, we described how to deﬁne a space of functions which is a valid realization of the feature spaces associated with a given kernel. To do this, we had to make sure that the space is a vector space, and that it is endowed with a dot product. Such spaces are referred to as dot product spaces (cf. Appendix B), or equivalently as pre-Hilbert spaces. The reason for the latter is that one can turn them into Hilbert spaces (cf. Section B.3) by a fairly simple mathematical trick. This additional structure has some mathematical advantages. For instance, in Hilbert spaces it is always possible to deﬁne projections. Indeed, Hilbert spaces are one of the favorite concepts of functional analysis.
So let us again consider the pre-Hilbert space of functions (2.22), endowed with the dot product (2.24). To turn it into a Hilbert space (over ), one completes it in the norm corresponding to the dot product, f : f f . This is done by adding the limit points of sequences that are convergent in that norm (see Appendix B).

2. This is illustrated by the following quotation from an excellent machine learning textbook published in the seventies (p. 174 in [152]): “The familiar functions of mathematical physics are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of potential functions. However, these suggestions are more appealing for their mathematical beauty than their practical usefulness.”

36

Kernels

RKHS

In view of the properties (2.29) and (2.30), this space is usually called a reproducing kernel Hilbert space (RKHS).
In general, an RKHS can be deﬁned as follows.

Reproducing Property Closed Space
Uniqueness of k

Deﬁnition 2.9 (Reproducing Kernel Hilbert Space) Let be a nonempty set (often

called the index set) and by a Hilbert space of functions f :

. Then is called

a reproducing kernel Hilbert space endowed with the dot product

(and the norm

f:

f f ) if there exists a function k :

with the following properties.

1. k has the reproducing property3

f k(x ) f (x) for all f ;

(2.34)

in particular,

k(x ) k(x ) k(x x )

(2.35)

2. k spans , i.e. (cf. Appendix B).

span k(x ) x

where X denotes the completion of the set X

On a more abstract level, an RKHS can be deﬁned as a Hilbert space of functions f on such that all evaluation functionals (the maps f f (x ), where x ) are continuous. In that case, by the Riesz representation theorem (e.g., [429]), for each x there exists a unique function of x, called k(x x ), such that

f (x ) f k( x )

(2.36)

It follows directly from (2.35) that k(x x ) is symmetric in its arguments (see Problem 2.28) and satisﬁes the conditions for positive deﬁniteness.
Note that the RKHS uniquely determines k. This can be shown by contradiction: assume that there exist two kernels, say k and k , spanning the same RKHS . From Problem 2.28 we know that both k and k must be symmetric. Moreover, from (2.34) we conclude that

k(x ) k (x ) k(x x ) k (x x)

(2.37)

In the second equality we used the symmetry of the dot product. Finally, symmetry in the arguments of k yields k(x x ) k (x x ) which proves our claim.

2.2.4 The Mercer Kernel Map

Section 2.2.2 has shown that any positive deﬁnite kernel can be represented as a dot product in a linear space. This was done by explicitly constructing a (Hilbert) space that does the job. The present section will construct another Hilbert space.

3. Note that this implies that each f is actually a single function whose values at any x are well-deﬁned. In contrast, L2 Hilbert spaces usually do not have this property. The elements of these spaces are equivalence classes of functions that disagree only on sets of measure 0; cf. footnote 15 in Section B.3.

2.2 The Representation of Similarities in Linear Spaces

37

Mercer ’s Theorem

One could argue that this is superﬂuous, given that any two separable Hilbert spaces are isometrically isomorphic, in other words, it is possible to deﬁne a oneto-one linear map between the spaces which preserves the dot product. However, the tool that we shall presently use, Mercer’s theorem, has played a crucial role in the understanding of SVMs, and it provides valuable insight into the geometry of feature spaces, which more than justiﬁes its detailed discussion. In the SVM literature, the kernel trick is usually introduced via Mercer’s theorem.
We start by stating the version of Mercer’s theorem given in [606]. We assume ( ) to be a ﬁnite measure space.4 The term almost all (cf. Appendix B) means except for sets of measure zero. For the commonly used Lebesgue-Borel measure, countable sets of individual points are examples of zero measure sets. Note that the integral with respect to a measure is explained in Appendix B. Readers who do not want to go into mathematical detail may simply want to think of the d (x ) as a dx , and of as a compact subset of N . For further explanations of the terms involved in this theorem, cf. Appendix B, especially Section B.3.

Theorem 2.10 (Mercer [359, 307]) Suppose k L ( 2) is a symmetric real-valued function such that the integral operator (cf. (2.16))

Tk : L2( ) L2( ) (Tk f )(x) : k(x x ) f (x ) d (x )

(2.38)

is positive deﬁnite; that is, for all f L2( ), we have

k(x x ) f (x) f (x ) d (x)d (x ) 0
2

(2.39)

Let j L2( ) be the normalized orthogonal eigenfunctions of Tk associated with the eigenvalues j 0, sorted in non-increasing order. Then

1. ( j)j 1,

2. k(x x ) ∑Nj 1 j j(x) j(x ) holds for almost all (x x ). Either N

, or N

;

in the latter case, the series converges absolutely and uniformly for almost all (x x ).

For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approx-

imation and its relationship to kernel PCA (Section 1.7), see Problem 2.26.

From statement 2 it follows that k(x x ) corresponds to a dot product in

N 2

,

since k(x x ) Φ(x) Φ(x ) with

Φ: x

N 2
( j j(x)) j 1 N

(2.40)

for almost all x . Note that we use the same Φ as in (2.21) to denote the feature

4. A ﬁnite measure space is a set with a -algebra (Deﬁnition B.1) deﬁned on it, and a

measure (Deﬁnition B.2) deﬁned on the latter, satisfying ( )

(so that, up to a scaling

factor, is a probability measure).

38
Mercer Feature Map

Kernels

map, although the target spaces are different. However, this distinction is not

important for the present purposes — we are interested in the existence of some

Hilbert space in which the kernel corresponds to the dot product, and not in what

particular representation of it we are using.

In fact, it has been noted [467] that the uniform convergence of the series implies

that given any 0, there exists an n such that even if N

, k can be

approximated within accuracy as a dot product in n : for almost all x x

, k(x x ) Φn(x) Φn(x )

, where Φn : x ( 1 1(x)

n n(x)). The

feature space can thus always be thought of as ﬁnite-dimensional within some

accuracy . We summarize our ﬁndings in the following proposition.

Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of Theorem 2.10, we can construct a mapping Φ into a space where k acts as a dot product,

Φ(x) Φ(x ) k(x x )

(2.41)

for almost all x x . Moreover, given any 0, there exists a map Φn into an ndimensional dot product space (where n depends on ) such that

k(x x ) Φn(x) Φn(x )

(2.42)

for almost all x x .

Both Mercer kernels and positive deﬁnite kernels can thus be represented as dot products in Hilbert spaces. The following proposition, showing a case where the two types of kernels coincide, thus comes as no surprise.

Proposition 2.12 (Mercer Kernels are Positive Deﬁnite [359, 42]) Let [a b] be a compact interval and let k : [a b] [a b] be continuous. Then k is a positive deﬁnite kernel if and only if

bb
k(x x ) f (x) f (x ) dx dx 0
aa

for each continuous function f :

.

(2.43)

Note that the conditions in this proposition are actually more restrictive than those of Theorem 2.10. Using the feature space representation (Proposition 2.11), however, it is easy to see that Mercer kernels are also positive deﬁnite (for almost all x x ) in the more general case of Theorem 2.10: given any c m , we have

∑ ∑ cic jk(xi x j)

cic j Φ(xi) Φ(x j)

ij

ij

2

∑ ciΦ(xi)

0

i

(2.44)

Being positive deﬁnite, Mercer kernels are thus also reproducing kernels. We next show how the reproducing kernel map is related to the Mercer kernel
map constructed from the eigenfunction decomposition [202, 467]. To this end, let us consider a kernel which satisﬁes the condition of Theorem 2.10, and construct

2.2 The Representation of Similarities in Linear Spaces

39

Equivalence of Feature Spaces

a dot product such that k becomes a reproducing kernel for the Hilbert space containing the functions

N

f (x)

∑ ik(x xi)

∑ ∑ i

j j(x) j(xi)

i1

i1 j1

(2.45)

By linearity, which holds for any dot product, we have

f k( x )

N

∑ ∑ i

j j(xi) j n n n(x )

i 1 jn 1

(2.46)

Since k is a Mercer kernel, the i (i 1 N ) can be chosen to be orthogonal with respect to the dot product in L2( ). Hence it is straightforward to choose such that

jn

jn j

(2.47)

(using the Kronecker symbol jn, see (B.30)), in which case (2.46) reduces to the reproducing kernel property (2.36) (using (2.45)). For a coordinate representation in the RKHS, see Problem 2.29.
The above connection between the Mercer kernel map and the RKHS map is instructive, but we shall rarely make use of it. In fact, we will usually identify the different feature spaces. Thus, to avoid confusion in subsequent chapters, the following comments are necessary. As described above, there are different ways of constructing feature spaces for any given kernel. In fact, they can even differ in terms of their dimensionality (cf. Problem 2.22). The two feature spaces that we will mostly use in this book are the RKHS associated with k (Section 2.2.2) and the Mercer 2 feature space. We will mostly use the same symbol for all feature spaces that are associated with a given kernel. This makes sense provided that everything we do, at the end of the day, reduces to dot products. For instance, let us assume that Φ1 Φ2 are maps into the feature spaces 1 2 respectively, both associated with the kernel k; in other words,

k(x x )

Φi(x) Φi(x )

for i
i

12

(2.48)

Then it will usually not be the case that Φ1(x) Φ2(x); due to (2.48), however,

we always have Φ1(x) Φ1(x ) 1

Φ2(x) Φ2(x )

. Therefore, as long as we are
2

only interested in dot products, the two spaces can be considered identical.

An example of this identity is the so-called large margin regularizer that is

usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters

4 and 7),

m
w w where w ∑ iΦ(xi) i1

(2.49)

No matter whether Φ is the RKHS map Φ(xi) k( xi) (2.21) or the Mercer map Φ(xi) ( j j(x)) j 1 N (2.40), the value of w 2 will not change.
This point is of great importance, and we hope that all readers are still with us.

40

Kernels

It is fair to say, however, that Section 2.2.5 can be skipped at ﬁrst reading.

2.2.5 The Shape of the Mapped Data in Feature Space

Using Mercer’s theorem, we have shown that one can think of the feature map

as a map into a high- or inﬁnite-dimensional Hilbert space. The argument in the

remainder of the section shows that this typically entails that the mapped data

Φ( ) lie in some box with rapidly decaying side lengths [606]. By this we mean

that the range of the data decreases as the dimension index j increases, with a rate

that depends on the size of the eigenvalues.

Let us assume that for all j sequence

, we have supx

j j(x) 2

. Deﬁne the

l j : sup j j(x) 2
x
Note that if

(2.50)

Ck : sup sup j(x)
jx

(2.51)

exists (see Problem 2.24), then we have l j jCk2. However, if the j decay rapidly,

then (2.50) can be ﬁnite even if (2.51) is not.

By construction, Φ(

) is contained in an axis parallel parallelepiped in

N 2

with

side lengths 2 lj (cf. (2.40)).5

Consider an example of a common kernel, the Gaussian, and let (see The-

orem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and

cosine functions (with supremum one), and thus the sequence of the lj coincides

with the sequence of the eigenvalues j. Generally, whenever supx

j(x) 2 is ﬁ-

nite, the lj decay as fast as the j. We shall see in Sections 4.4, 4.5 and Chapter 12

that for many common kernels, this decay is very rapid.

It will be useful to consider operators that map Φ( ) into balls of some radius

R centered at the origin. The following proposition characterizes a class of such

operators, determined by the sequence (l j)j . Recall that denotes the space of

all real sequences.

Proposition 2.13 (Mapping Φ( ) into 2) Let S be the diagonal map

S: (x j) j

S(x j)j (sjx j)j

(2.52)

where (sj)j

. If sj lj j 2, then S maps Φ( ) into a ball centered at the origin

whose radius is R

sj lj j .

5. In fact, it is sufﬁcient to use the essential supremum in (2.50). In that case, subsequent statements also only hold true almost everywhere.

2.2 The Representation of Similarities in Linear Spaces

41

Continuity of Φ

Proof Suppose sj

∑ SΦ(x) 2

s2j j

j

lj j j(x) 2

2. Using the Mercer map (2.40), we have
∑ s2j l j R
j

(2.53)

for any x . Hence SΦ( ) 2.

The converse is not necessarily the case. To see this, note that if sj lj j 2, amounting to saying that

∑ s2j sup j j(x) 2 jx

(2.54)

is not ﬁnite, then there need not always exist an x s j j j(x) j 2, i.e., that
∑ s2j j j(x) 2 j

such that SΦ(x) (2.55)

is not ﬁnite.

To see how the freedom to rescale Φ( ) effectively restricts the class of functions

we are using, we ﬁrst note that everything in the feature space

N 2

is done

in terms of dot products. Therefore, we can compensate any invertible symmetric

linear transformation of the data in by the inverse transformation on the set

of admissible weight vectors in . In other words, for any invertible symmetric

operator S on , we have S 1w SΦ(x) w Φ(x) for all x

As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there

exists a class of generalization error bound that depends on the radius R of the

smallest sphere containing the data. If the (li)i decay rapidly, we are not actually

“making use” of the whole sphere. In this case, we may construct a diagonal

scaling operator S which inﬂates the sides of the above parallelepiped as much

as possible, while ensuring that it is still contained within a sphere of the original

radius R in (Figure 2.3). By effectively reducing the size of the function class, this

will provide a way of strengthening the bounds. A similar idea, using kernel PCA

(Section 14.2) to determine empirical scaling coefﬁcients, has been successfully

applied by [101].

We conclude this section with another useful insight that characterizes a prop-

erty of the feature map Φ. Note that most of what was said so far applies to the

case where the input domain is a general set. In this case, it is not possible to

make nontrivial statements about continuity properties of Φ. This changes if we

assume to be endowed with a notion of closeness, by turning it into a so-called

topological space. Readers not familiar with this concept will be reassured to hear

that Euclidean vector spaces are particular cases of topological spaces.

Proposition 2.14 (Continuity of the Feature Map [402]) If is a topological space

and k is a continuous positive deﬁnite kernel on

, then there exists a Hilbert space

and a continuous map Φ :

such that for all x x , we have k(x x )

Φ(x) Φ(x ) .

42

Kernels

φ(x)

R

w

φ(x)

R

w

Feature Space

Weight vector

Feature Space

Weight vector

Figure 2.3 Since everything is done in terms of dot products, scaling up the data by an operator S can be compensated by scaling the weight vectors with S 1 (cf. text). By
choosing S such that the data are still contained in a ball of the same radius R, we effectively reduce our function class (parametrized by the weight vector), which can lead to better generalization bounds, depending on the kernel inducing the map Φ.

2.2.6 The Empirical Kernel Map
The map Φ, deﬁned in (2.21), transforms each input pattern into a function on , that is, into a potentially inﬁnite-dimensional object. For any given set of points, however, it is possible to approximate Φ by only evaluating it on these points (cf. [232, 350, 361, 547, 474]):

Empirical Kernel Map

Deﬁnition 2.15 (Empirical Kernel Map) For a given set z 1 zn call

Φn : N

n where x k( x) z1 zn (k(z1 x)

k(zn x))

the empirical kernel map w.r.t. z1 zn .

, n , we (2.56)

As an example, consider ﬁrst the case where k is a positive deﬁnite kernel, and

z1

zn

x1

xm ; we thus evaluate k( x) on the training patterns. If we

carry out a linear algorithm in feature space, then everything will take place in

the linear span of the mapped training patterns. Therefore, we can represent the

k( x) of (2.21) as Φm(x) without losing information. The dot product to use in that representation, however, is not simply the canonical dot product in m , since the

Φ(xi) will usually not form an orthonormal system. To turn Φm into a feature map associated with k, we need to endow m with a dot product m such that

k(x x ) Φm(x) Φm(x ) m

(2.57)

To this end, we use the ansatz m

M , with M being a positive deﬁnite

matrix.6 Enforcing (2.57) on the training patterns, this yields the self-consistency

condition [478, 512]

K KMK

(2.58)

6. Every dot product in m can be written in this form. We do not require strict deﬁniteness of M, as the null space can be projected out, leading to a lower-dimensional feature space.

2.2 The Representation of Similarities in Linear Spaces

43

Kernel PCA Map

where K is the Gram matrix. The condition (2.58) can be satisﬁed for instance by the (pseudo-)inverse M K 1. Equivalently, we could have incorporated this rescaling operation, which corresponds to a Kernel PCA “whitening” ([478, 547, 474], cf. Section 11.4), directly into the map, by whitening (2.56) to get

Φwm : x

K

1 2

(k(x

1

x)

k(xm x))

(2.59)

This simply amounts to dividing the eigenvector basis vectors of K by i, where the i are the eigenvalues of K.7 This parallels the rescaling of the eigenfunctions of the integral operator belonging to the kernel, given by (2.47). It turns out that
this map can equivalently be performed using kernel PCA feature extraction (see
Problem 14.8), which is why we refer to this map as the kernel PCA map.
Note that we have thus constructed a data-dependent feature map into an mdimensional space which satisﬁes Φwm(x) Φwm(x ) k(x x ), i.e., we have found an m-dimensional feature space associated with the given kernel. In the case where K is invertible, Φwm(x) computes the coordinates of Φ(x) when represented in a basis of the m-dimensional subspace spanned by Φ(x1) Φ(xm).
For data sets where the number of examples is smaller than their dimension, it can actually be computationally attractive to carry out Φwm explicitly, rather than using kernels in subsequent algorithms. Moreover, algorithms which are not
readily “kernelized” may beneﬁt from explicitly carrying out the kernel PCA map.
We end this section with two notes which illustrate why the use of (2.56) need
not be restricted to the special case we just discussed.

More general kernels. When using non-symmetric kernels k in (2.56), together with the canonical dot product, we effectively work with the positive deﬁnite matrix K K. Note that each positive deﬁnite matrix can be written as K K. Therefore, working with positive deﬁnite kernels leads to an equally rich set of nonlinearities as working with an empirical kernel map using general non-symmetric kernels. If we wanted to carry out the whitening step, we would have to use (K K) 1 4 (cf. footnote 7 concerning potential singularities).

Different evaluation sets. Things can be sped up by using expansion sets of the form z1 zn , mapping into an n-dimensional space, with n m, as done in [100, 228]. In that case, one modiﬁes (2.59) to

Φwn : x

1
Kn 2 (k(z1 x)

k(zn x))

(2.60)

where (Kn)ij : k(zi z j). The expansion set can either be a subset of the training set,8 or some other set of points. We will later return to the issue of how to choose

7. It is understood that if K is singular, we use the pseudo-inverse of K1 2 in which case we get an even lower dimensional subspace. 8. In [228] it is recommended that the size n of the expansion set is chosen large enough to ensure that the smallest eigenvalue of Kn is larger than some predetermined 0. Alternatively, one can start off with a larger set, and use kernel PCA to select the most important components for the map, see Problem 14.8. In the kernel PCA case, the map (2.60) is com-

44

Kernels

the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of Kernel PCA (see Section 1.7 and Chapter 14 below), one does not need to worry about the whitening step in (2.59) and (2.60): using the canonical dot product in
m (rather than ) will simply lead to diagonalizing K2 instead of K, which yields the same eigenvectors with squared eigenvalues. This was pointed out by [350, 361]. The study [361] reports experiments where (2.56) was employed to speed up Kernel PCA by choosing z1 zn as a subset of x1 xm .
2.2.7 A Kernel Map Deﬁned from Pairwise Similarities
In practice, we are given a ﬁnite amount of data x1 xm. The following simple observation shows that even if we do not want to (or are unable to) analyze a given kernel k analytically, we can still compute a map Φ such that k corresponds to a dot product in the linear span of the Φ(xi):

Proposition 2.16 (Data-Dependent Kernel Map [467]) Suppose the data x 1 xm and the kernel k are such that the kernel Gram matrix Kij k(xi x j) is positive deﬁnite. Then it is possible to construct a map Φ into an m-dimensional feature space such that

k(xi x j) Φ(xi) Φ(x j)

(2.61)

Conversely, given an arbitrary map Φ into some feature space , the matrix Kij Φ(xi) Φ(x j) is positive deﬁnite.

Proof First assume that K is positive deﬁnite. In this case, it can be diagonalized as K SDS , with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Then

k(xi xj) (SDS )ij Si DSj

DSi DSj

(2.62)

where we have deﬁned the Si as the rows of S (note that the columns of S would be K’s eigenvectors). Therefore, K is the Gram matrix of the vectors Dii Si.9 Hence
the following map Φ, deﬁned on x1 xm will satisfy (2.61)

Φ : xi

Dii Si

(2.63)

Thus far, Φ is only deﬁned on a set of points, rather than on a vector space. Therefore, it makes no sense to ask whether it is linear. We can, however, ask whether it can be extended to a linear map, provided the xi are elements of a vector space. The answer is that if the xi are linearly dependent (which is often the case), then this will not be possible, since a linear map would then typically be over-

puted as Dn 1 2Un (k(z1 x)

k(zn x)), where Un DnUn is the eigenvalue decomposition of

Kn. Note that the columns of Un are the eigenvectors of Kn. We discard all columns that cor-

respond to zero eigenvalues, as well as the corresponding dimensions of Dn. To approximate

the map, we may actually discard all eigenvalues smaller than some 0.

9. In fact, every positive deﬁnite matrix is the Gram matrix of some set of vectors [46].

2.3 Examples and Properties of Kernels

45

determined by the m conditions (2.63). For the converse, assume an arbitrary

m
∑ i j Ki j
ij 1

m

m

∑ iΦ(xi) ∑ jΦ(x j)

i1

j1

m , and compute

m

2

∑ iΦ(xi)

0

i1

(2.64)

In particular, this result implies that given data x1 xm, and a kernel k which gives rise to a positive deﬁnite matrix K, it is always possible to construct a feature space of dimension at most m that we are implicitly working in when using kernels (cf. Problem 2.32 and Section 2.2.6).
If we perform an algorithm which requires k to correspond to a dot product in some other space (as for instance the SV algorithms described in this book), it is possible that even though k is not positive deﬁnite in general, it still gives rise to a positive deﬁnite Gram matrix K with respect to the training data at hand. In this case, Proposition 2.16 tells us that nothing will go wrong during training when we work with these data. Moreover, if k leads to a matrix with some small negative eigenvalues, we can add a small multiple of some strictly positive deﬁnite kernel k (such as the identity k (xi x j) ij) to obtain a positive deﬁnite matrix. To see this, suppose that min 0 is the minimal eigenvalue of k’s Gram matrix. Note that being strictly positive deﬁnite, the Gram matrix K of k satisﬁes

min K
1

min 0

(2.65)

where min denotes its minimal eigenvalue, and the ﬁrst inequality follows from

Rayleigh’s principle (B.57). Therefore, provided that min

min 0, we have

(K K )

K

K

2 min

min

0

(2.66)

for all

m , rendering (K K ) positive deﬁnite.

2.3 Examples and Properties of Kernels

Polynomial Gaussian Sigmoid

For the following examples, let us assume that polynomial kernels (cf. Proposition 2.1),
k(x x ) x x d

N . Besides homogeneous (2.67)

Boser, Guyon, and Vapnik [62, 223, 561] suggest the usage of Gaussian radial basis function kernels [26, 4],

k(x x ) exp

x x2 22

(2.68)

where 0, and sigmoid kernels,

k(x x ) tanh( x x )

(2.69)

46

Kernels

Inhomogeneous Polynomial
Bn-Spline of Odd Order
Invariance of Kernels RBF Kernels

where 0 and 0. By applying Theorem 13.4 below, one can check that the latter kernel is not actually positive deﬁnite (see Section 4.6 and [85, 511] and the discussion in Example 4.25). Curiously, it has nevertheless successfully been used in practice. The reasons for this are discussed in [467].
Other useful kernels include the inhomogeneous polynomial,

k(x x ) x x c d

(2.70)

(d c 0) and the Bn-spline kernel [501, 572] (IX denoting the indicator (or characteristic) function on the set X, and the convolution operation, ( f g)(x) :
f (x )g(x x)dx ),

k(x x )

B2p 1( x

x ) with Bn :

n

I[

] 1 1
22

i1

(2.71)

The kernel computes B-splines of order 2p 1 (p ), deﬁned by the (2p 1)-fold convolution of the unit interval [ 1 2 1 2]. See Section 4.4.1 for further details and a regularization theoretic analysis of this kernel.
Note that all these kernels have the convenient property of unitary invariance, k(x x ) k(Ux Ux ) if U U 1, for instance if U is a rotation. If we consider complex numbers, then we have to use the adjoint U : U instead of the transpose.
Radial basis function (RBF) kernels are kernels that can be written in the form

k(x x ) f (d(x x ))

(2.72)

where d is a metric on , and f is a function on 0 . Examples thereof are the Gaussians and B-splines mentioned above. Usually, the metric arises from the

dot product; d(x x ) x x

x x x x . In this case, RBF kernels are

unitary invariant, too. In addition, they are translation invariant; in other words,

k(x x ) k(x x0 x x0) for all x0 .
In some cases, invariance properties alone can distinguish particular kernels: in Section 2.1, we explained how using polynomial kernels x x d corresponds to

mapping into a feature space whose dimensions are spanned by all possible dth

order monomials in input coordinates. The different dimensions are scaled with

the square root of the number of ordered products of the respective d entries (e.g.,

2 in (2.13)). These scaling factors precisely ensure invariance under the group

of all orthogonal transformations (rotations and mirroring operations). In many

cases, this is a desirable property: it ensures that the results of a learning procedure

do not depend on which orthonormal coordinate system (with ﬁxed origin) we use

for representing our input data.

Properties of RBF Kernels

Proposition 2.17 (Invariance of Polynomial Kernels [480]) Up to a scaling factor, the kernel k(x x ) x x d is the only kernel inducing a map into a space of all monomials of degree d which is invariant under orthogonal transformations of N .
Some interesting additional structure exists in the case of a Gaussian RBF kernel k (2.68). As k(x x) 1 for all x , each mapped example has unit length, Φ(x)

2.3 Examples and Properties of Kernels

47

InﬁniteDimensional Feature Space

1 (Problem 2.18 shows how to achieve this for general kernels). Moreover, as k(x x ) 0 for all x x , all points lie inside the same orthant in feature space. To see this, recall that for unit length vectors, the dot product (1.3) equals the cosine of the enclosed angle. We obtain

cos( (Φ(x) Φ(x ))) Φ(x) Φ(x ) k(x x ) 0

(2.73)

which amounts to saying that the enclosed angle between any two mapped examples is smaller than 2.
The above seems to indicate that in the Gaussian case, the mapped data lie in a fairly restricted area of feature space. However, in another sense, they occupy a space which is as large as possible:

Theorem 2.18 (Full Rank of Gaussian RBF Gram Matrices [360]) Suppose that x1 xm are distinct points, and 0. The matrix K given by

Ki j : exp

xi xj 2 22

(2.74)

has full rank.

In other words, the points Φ(x1) Φ(xm) are linearly independent (provided no two xi are the same). They span an m-dimensional subspace of . Therefore a Gaussian kernel deﬁned on a domain of inﬁnite cardinality, with no a priori restriction on the number of training examples, produces a feature space of inﬁnite dimension. Nevertheless, an analysis of the shape of the mapped data in feature space shows that capacity is distributed in a way that ensures smooth and simple estimates whenever possible (see Section 12.4).
The examples given above all apply to the case of vectorial data. Let us next give an example where is not a vector space [42].

Proposition 2.19 (Similarity of Probabilistic Events) If ( space with -algebra and probability measure P, then

k(A B) P(A B) P(A)P(B)

is a positive deﬁnite kernel on

.

P) is a probability (2.75)

Proof To see this, we deﬁne a feature map

Φ : A (IA P(A))

(2.76)

where IA is the characteristic function on A. On the feature space, which consists of functions on taking values in [ 1 1], we use the dot product

f g : f g dP

(2.77)

The result follows by noticing IA IB P(A B) and IA P(B) P(A)P(B).

48

Kernels

Further examples include kernels for string matching, as proposed by [585, 234, 23]. We shall describe these, and address the general problem of designing kernel functions, in Chapter 13.
The next section will return to the connection between kernels and feature spaces. Readers who are eager to move on to SV algorithms may want to skip this section, which is somewhat more technical.

2.4 The Representation of Dissimilarities in Linear Spaces

2.4.1 Conditionally Positive Deﬁnite Kernels

We now proceed to a larger class of kernels than that of the positive deﬁnite ones. This larger class is interesting in several regards. First, it will turn out that some kernel algorithms work with this class, rather than only with positive deﬁnite kernels. Second, its relationship to positive deﬁnite kernels is a rather interesting one, and a number of connections between the two classes provide understanding of kernels in general. Third, they are intimately related to a question which is a variation on the central aspect of positive deﬁnite kernels: the latter can be thought of as dot products in feature spaces; the former, on the other hand, can be embedded as distance measures arising from norms in feature spaces.
The present section thus attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. The underlying mathematical results have been known for quite a while [465]; some of them have already attracted interest in the kernel methods community in various contexts [515, 234].
Clearly, the squared distance Φ(x) Φ(x ) 2 in the feature space associated with a pd kernel k can be computed, using k(x x ) Φ(x) Φ(x ) , as

Φ(x) Φ(x ) 2 k(x x) k(x x ) 2k(x x )

(2.78)

Positive deﬁnite kernels are, however, not the full story: there exists a larger class

of kernels that can be used as generalized distances, and the present section will

describe why and how [468].

Let us start by considering how a dot product and the corresponding distance

measure are affected by a translation of the data, x x x0. Clearly, x x 2 is

translation invariant while x x is not. A short calculation shows that the effect

of the translation can be expressed in terms of

2 as

1 (x x0) (x x0) 2

x x 2 x x0 2 x0 x 2

(2.79)

Note that this, just like x x , is still a pd kernel: ∑i j cic j (xi x0) (x j x0) ∑i ci(xi x0) 2 0 holds true for any ci. For any choice of x0 , we thus get a
similarity measure (2.79) associated with the dissimilarity measure x x .
This naturally leads to the question of whether (2.79) might suggest a connection

2.4 The Representation of Dissimilarities in Linear Spaces

49

that also holds true in more general cases: what kind of nonlinear dissimilarity

measure do we have to substitute for

2 on the right hand side of (2.79), to

ensure that the left hand side becomes positive deﬁnite? To state the answer, we

ﬁrst need to deﬁne the appropriate class of kernels.

The following deﬁnition differs from Deﬁnition 2.4 only in the additional con-

straint on the sum of the ci. Below, is a shorthand for or ; the deﬁnitions are the same in both cases.

Deﬁnition 2.20 (Conditionally Positive Deﬁnite Matrix) A symmetric m trix K (m 2) taking values in and satisfying

m
∑ cic¯j Ki j 0 for all ci
ij 1

m
with ∑ ci 0 i1

is called conditionally positive deﬁnite (cpd).

m ma(2.80)

Deﬁnition 2.21 (Conditionally Positive Deﬁnite Kernel) Let be a nonempty set.

A function k :

which for all m 2 x1 xm gives rise to a conditionally

positive deﬁnite Gram matrix is called a conditionally positive deﬁnite (cpd) kernel.

Note that symmetry is also required in the complex case. Due to the additional constraint on the coefﬁcients ci, it does not follow automatically anymore, as it did in the case of complex positive deﬁnite matrices and kernels. In Chapter 4, we will revisit cpd kernels. There, we will actually introduce cpd kernels of different orders. The deﬁnition given in the current chapter covers the case of kernels which are cpd of order 1.

Connection PD — CPD

Proposition 2.22 (Constructing PD Kernels from CPD Kernels [42]) Let x 0 ,

and let k be a symmetric kernel on

. Then

k˜(x x ) :

1 (k(x x ) 2

k(x x0)

k(x0 x )

k(x0 x0))

is positive deﬁnite if and only if k is conditionally positive deﬁnite.

The proof follows directly from the deﬁnitions and can be found in [42]. This

result does generalize (2.79): the negative squared distance kernel is indeed cpd,

since ∑i ci 0 implies ∑i j cic j xi x j 2

∑i ci ∑j c j x j 2 ∑j c j ∑i ci xi 2

2 ∑i j cic j xi x j 2 ∑i j cic j xi x j 2 ∑i cixi 2 0 In fact, this implies that all

kernels of the form

k(x x )

xx 0

2

(2.81)

are cpd (they are not pd),10 by application of the following result (note that the case 0 is trivial):

10. Moreover, they are not cpd if 2 [42].

50

Kernels

Proposition 2.23 (Fractional Powers and Logs of CPD Kernels [42]) If k :

( 0] is cpd, then so are ( k) (0

1) and ln(1 k).

To state another class of cpd kernels that are not pd, note ﬁrst that as a trivial consequence of Deﬁnition 2.20, we know that (i) sums of cpd kernels are cpd, and (ii) any constant b is a cpd kernel. Therefore, any kernel of the form k b, where k is cpd and b , is also cpd. In particular, since pd kernels are cpd, we can take any pd kernel and offset it by b, and it will still be at least cpd. For further examples of cpd kernels, cf. [42, 578, 205, 515].

2.4.2 Hilbert Space Representation of CPD Kernels

We now return to the main ﬂow of the argument. Proposition 2.22 allows us to construct the feature map for k from that of the pd kernel k˜. To this end, ﬁx x0 and deﬁne k˜ according to Proposition 2.22. Due to Proposition 2.22, k˜ is positive

deﬁnite. Therefore, we may employ the Hilbert space representation Φ :

of

k˜ (cf. (2.32)), satisfying Φ(x) Φ(x ) k˜(x x ); hence,

Φ(x) Φ(x ) 2 k˜(x x) k˜(x x ) 2k˜(x x )

(2.82)

Substituting Proposition 2.22 yields

Φ(x) Φ(x ) 2

k(x x )

1 2

k(x x)

k(x x )

This implies the following result [465, 42].

(2.83)

Feature Map for CPD Kernels

Proposition 2.24 (Hilbert Space Representation of CPD Kernels) Let k be a real-

valued CPD kernel on , satisfying k(x x) 0 for all x . Then there exists a Hilbert

space of real-valued functions on , and a mapping Φ :

, such that

Φ(x) Φ(x ) 2 k(x x )

(2.84)

If we drop the assumption k(x x) 0, the Hilbert space representation reads

Φ(x) Φ(x ) 2

k(x x )

1 2

k(x x)

k(x x )

It can be shown that if k(x x) 0 for all x , then

(2.85)

d(x x ) :

k(x x ) Φ(x) Φ(x )

(2.86)

is a semi-metric: clearly, it is nonnegative and symmetric; additionally, it satisﬁes the triangle inequality, as can be seen by computing d(x x ) d(x x ) Φ(x) Φ(x ) Φ(x ) Φ(x ) Φ(x) Φ(x ) d(x x ) [42].
It is a metric if k(x x ) 0 for x x . We thus see that we can rightly think of k as the negative of a distance measure.
We next show how to represent general symmetric kernels (thus in particular cpd kernels) as symmetric bilinear forms Q in feature spaces. This generalization of the previously known feature space representation for pd kernels comes at a

2.4 The Representation of Dissimilarities in Linear Spaces

51

cost: Q will no longer be a dot product. For our purposes, we can get away with
this. The result will give us an intuitive understanding of Proposition 2.22: we can then write k˜ as k˜(x x ) : Q(Φ(x) Φ(x0) Φ(x ) Φ(x0)). Proposition 2.22 thus essentially adds an origin in feature space which corresponds to the image Φ(x0) of one point x0 under the feature map.

Feature Map for General Symmetric Kernels

Proposition 2.25 (Vector Space Representation of Symmetric Kernels) Let k be a real-valued symmetric kernel on . Then there exists a linear space of real-valued functions on , endowed with a symmetric bilinear form Q( ), and a mapping Φ :
, such that k(x x ) Q(Φ(x) Φ(x )).
Proof The proof is a direct modiﬁcation of the pd case. We use the map (2.21) and linearly complete the image as in (2.22). Deﬁne Q( f g) : ∑mi 1 ∑mj 1 i jk(xi x j). To see that it is well-deﬁned, although it explicitly contains the expansion coefﬁcients (which need not be unique), note that Q( f g) ∑mj 1 j f (x j), independent of the
i. Similarly, for g, note that Q( f g) ∑i ig(xi), hence it is independent of j. The last two equations also show that Q is bilinear; clearly, it is symmetric.
Note, moreover, that by deﬁnition of Q, k is a reproducing kernel for the feature space (which is not a Hilbert space): for all functions f (2.22), we have Q(k( x) f ) f (x); in particular, Q(k( x) k( x )) k(x x )
Rewriting k˜ as k˜(x x ) : Q(Φ(x) Φ(x0) Φ(x ) Φ(x0)) suggests an immediate generalization of Proposition 2.22: in practice, we might want to choose other points as origins in feature space — points that do not have a pre-image x0 in the input domain, such as the mean of a set of points (cf. [543]). This will be useful when considering kernel PCA. It is only crucial that the behavior of our reference point under translation is identical to that of individual points. This is taken care of by the constraint on the sum of the ci in the following proposition.

Matrix Centering

Proposition 2.26 (Exercise 2.23 in [42]) Let K be a symmetric matrix, e m be the vector of all ones, 1 the m m identity matrix, and let c m satisfy e c 1. Then

K˜ : (1 ec )K(1 ce )

(2.87)

is positive deﬁnite if and only if K is conditionally positive deﬁnite.11

Proof “ ”: suppose K˜ is positive deﬁnite. Thus for any a m which satisﬁes a e e a 0, we have 0 a K˜ a a Ka a ec Kce a a Kce a a ec Ka a Ka.
This means that 0 a Ka, proving that K is conditionally positive deﬁnite.
“ ”: suppose K is conditionally positive deﬁnite. This means that we have to show that a K˜ a 0 for all a m . We have

a K˜ a a (1 ec )K(1 ce )a s Ks for s (1 ce )a

(2.88)

11. c is the vector obtained by transposing and taking the complex conjugate of c.

52

Kernels

All we need to show is e s 0, since then we can use the fact that K is cpd to obtain s Ks 0. This can be seen as follows e s e (1 ce )a (e (e c)e )a (e e )a 0.
This result directly implies a corresponding generalization of Proposition 2.22:

Kernel Centering Proposition 2.27 (Adding a General Origin) Let k be a symmetric kernel, x 1 xm

, and let ci

satisfy ∑mi 1 ci 1. Then

∑ ∑ ∑ k˜(x x ) :

1 2

k(x x )

m
cik(x xi)
i1

m
cik(xi x )
i1

m
cicjk(xi x j)
ij 1

is positive deﬁnite if and only if k is conditionally positive deﬁnite.

Application to SVMs
Application to Kernel PCA
Application to Parzen Windows Classiﬁers

Proof Consider a set of m

points x1

xm

(m m ) Gram matrix based on x1 xm x1

using cm 1

cm m 0.

, and let K be the (m m ) xm . Apply Proposition 2.26

The above results show that conditionally positive deﬁnite kernels are a natural choice whenever we are dealing with a translation invariant problem, such as the SVM: maximization of the margin of separation between two classes of data is independent of the position of the origin. Seen in this light, it is not surprising that the structure of the dual optimization problem (cf. [561]) allows cpd kernels: as noted in [515, 507], the constraint ∑mi 1 i yi 0 projects out the same subspace as (2.80) in the deﬁnition of conditionally positive deﬁnite matrices.
Another example of a kernel algorithm that works with conditionally positive deﬁnite kernels is Kernel PCA (Chapter 14), where the data are centered, thus removing the dependence on the origin in feature space. Formally, this follows from Proposition 2.26 for ci 1 m.
Let us consider another example. One of the simplest distance-based classiﬁcation algorithms proceeds as follows. Given m points labelled with 1, m points labelled with 1, and a mapped test point Φ(x), we compute the mean squared distances between the latter and the two classes, and assign it to the one for which this mean is smaller;

∑ ∑ y

sgn

1 m yi

Φ(x)
1

Φ(xi) 2

1

Φ(x)

m yi 1

Φ(xi) 2

(2.89)

We use the distance kernel trick (Proposition 2.24) to express the decision function as a kernel expansion in the input domain: a short calculation shows that

∑ ∑ 1

1

y

sgn

m

k(x xi)
yi 1

m yi

k(x xi)
1

b

with the constant offset

∑ ∑ b

1 2m yi

k(xi xi)
1

1 2m

k(xi xi)
yi 1

(2.90) (2.91)

2.4 The Representation of Dissimilarities in Linear Spaces

53

Properties of CPD Kernels

Note that for some cpd kernels, such as (2.81), k(xi xi) is always 0, and thus b 0. For others, such as the commonly used Gaussian kernel, k(xi xi) is a nonzero constant, in which case b vanishes provided that m m . For normalized Gaussians, the resulting decision boundary can be interpreted as the Bayes decision based on two Parzen window density estimates of the classes; for general cpd kernels, the analogy is merely a formal one; that is, the decision functions take the same form.
Many properties of positive deﬁnite kernels carry over to the more general case of conditionally positive deﬁnite kernels, such as Proposition 13.1.
Using Proposition 2.22, one can prove an interesting connection between the two classes of kernels:
Proposition 2.28 (Connection PD — CPD [465]) A kernel k is conditionally positive deﬁnite if and only if exp(tk) is positive deﬁnite for all t 0.
Positive deﬁnite kernels of the form exp(tk) (t 0) have the interesting property that their nth root (n ) is again a positive deﬁnite kernel. Such kernels are called inﬁnitely divisible. One can show that, disregarding some technicalities, the logarithm of an inﬁnitely divisible positive deﬁnite kernel mapping into 0 is a conditionally positive deﬁnite kernel.

2.4.3 Higher Order CPD Kernels

For the sake of completeness, we now present some material which is of interest to one section later in the book (Section 4.8), but not central for the present chapter. We follow [341, 204].

Deﬁnition 2.29 (Conditionally Positive Deﬁnite Functions of Order q) A contin-
uous function h, deﬁned on [0 ), is called conditionally positive deﬁnite (cpd) of order q on N if for any distinct points x1 xm N , the quadratic form,

m
∑ i jh( xi
ij 1

xj 2)

(2.92)

is nonnegative, provided that the scalars 1 polynomials p( ) on N of degree lower than q.

m satisfy ∑mi 1 i p(xi) 0, for all

Let ΠqN denote the space of polynomials of degree lower than q on N . By deﬁnition, every cpd function h of order q generates a positive deﬁnite kernel for SV expansions in the space of functions orthogonal to ΠqN, by setting k(x x ) : h( x x 2).
There exists also an analogue to the positive deﬁniteness of the integral operator
in the conditions of Mercer’s theorem. In [157, 341] it is shown that for cpd
functions h of order q, we have

h( x x 2) f (x) f (x )dxdx 0

(2.93)

provided that the projection of f onto ΠqN is zero.

54

Kernels

Gaussian

Inverse Multiquadric

k(||x − x’||)

1

1

k(||x − x’||)

0.8 0.8
0.6 0.6
0.4

0.2

0.4

0

−2

0

2

||x − x’||

0.2

−2

0

2

||x − x’||

−1 −1.5
−2 −2.5

Multiquadric

k(||x − x’||)

Thin Plate Spline (n=1) 12 10
8 6 4

k(||x − x’||)

−3

2

−3.5

−2

0

2

||x − x’||

0

−2

0

2

||x − x’||

Figure 2.4 Conditionally positive deﬁnite functions, as described in Table 2.1. Where applicable, we set the free parameter c to 1; is set to 2. Note that cpd kernels need not be positive anywhere (e.g., the Multiquadric kernel).

Table 2.1 Examples of Conditionally Positive Deﬁnite Kernels. The fact that the exponential kernel is pd (i.e., cpd of order 0) follows from (2.81) and Proposition 2.28.

Kernel

e c x x ,0

2

1

x x 2 c2

x x 2 c2 x x 2n ln x x

Order 0 0

Exponential Inverse Multiquadric

1

Multiquadric

n

Thin Plate Spline

Deﬁnition 2.30 (Completely Monotonic Functions) A function h(x) is called completely monotonic of order q if

(

1)n

dn dxn

h(x)

0 for all x

[0

) and n

q

(2.94)

It can be shown [464, 465, 360] that a function h(x2) is conditionally positive deﬁnite if and only if h(x) is completely monotonic of the same order. This gives a (sometimes simpler) criterion for checking whether a function is cpd or not.
If we use cpd kernels in learning algorithms, we must ensure orthogonality of the estimate with respect to ΠqN. This is usually done via constraints ∑mi 1 i p(xi) 0 for all polynomials p( ) on N of degree lower than q (see Section 4.8).

2.5 Summary

55

2.5 Summary
The crucial ingredient of SVMs and other kernel methods is the so-called kernel trick (see (2.7) and Remark 2.8), which permits the computation of dot products in high-dimensional feature spaces, using simple functions deﬁned on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example. The mathematical result underlying the kernel trick is almost a century old [359]. Nevertheless, it was only much later that it was exploited by the machine learning community for the analysis [4] and construction of algorithms [62], and that it was described as a general method for constructing nonlinear generalizations of dot product algorithms [480].
The present chapter has reviewed the mathematical theory of kernels. We started with the class of polynomial kernels, which can be motivated as computing a combinatorially large number of monomial features rather efﬁciently. This led to the general question of which kernel can be used, or: which kernel can be represented as a dot product in a linear feature space. We deﬁned this class and discussed some of its properties. We described several ways how, given such a kernel, one can construct a representation in a feature space. The most well-known representation employs Mercer’s theorem, and represents the feature space as an 2 space deﬁned in terms of the eigenfunctions of an integral operator associated with the kernel. An alternative representation uses elements of the theory of reproducing kernel Hilbert spaces, and yields additional insights, representing the linear space as a space of functions written as kernel expansions. We gave an indepth discussion of the kernel trick in its general form, including the case where we are interested in dissimilarities rather than similarities; that is, when we want to come up with nonlinear generalizations of distance-based algorithms rather than dot-product-based algorithms.
In both cases, the underlying philosophy is the same: we are trying to express a complex nonlinear algorithm in terms of simple geometrical concepts, and we are then dealing with it in a linear space. This linear space may not always be readily available; in some cases, it may even be hard to construct explicitly. Nevertheless, for the sake of design and analysis of the algorithms, it is sufﬁcient to know that the linear space exists, empowering us to use the full potential of geometry, linear algebra and functional analysis.

2.6 Problems 2.1 (Monomial Features in 2 ) Verify the second equality in (2.9).
2.2 (Multiplicity of Monomial Features in N [515] ) Consider the monomial kernel k(x x ) x x d (where x x N ), generating monomial features of order d. Prove

56

Kernels

that a valid feature map for this kernel can be deﬁned coordinate-wise as

Φm(x)

∏ ∏ni

d! 1[m]i!

n
[x][im]i
i1

(2.95)

for every m n ∑ni 1[m]i d (i.e., every such m corresponds to one dimension of ).

2.3 (Inhomogeneous Polynomial Kernel ) Prove that the kernel (2.70) induces a feature map into the space of all monomials up to degree d. Discuss the role of c.

2.4 (Eigenvalue Criterion of Positive Deﬁniteness ) Prove that a symmetric matrix is positive deﬁnite if and only if all its eigenvalues are nonnegative (see Appendix B).

2.5 (Dot Products are Kernels ) Prove that dot products (Deﬁnition B.7) are positive deﬁnite kernels.

2.6 (Kernels on Finite Domains ) Prove that for ﬁnite , say

x1

a kernel if and only if the m m matrix (k(xi x j))ij is positive deﬁnite.

xm , k is

2.7 (Positivity on the Diagonal ) From Deﬁnition 2.5, prove that a kernel satisﬁes k(x x) 0 for all x .

2.8 (Cauchy-Schwarz for Kernels ) Give an elementary proof of Proposition 2.7. Hint: start with the general form of a symmetric 2 2 matrix, and derive conditions for
its coefﬁcients that ensure that it is positive deﬁnite.

2.9 (PD Kernels Vanishing on the Diagonal ) Use Proposition 2.7 to prove that a kernel satisfying k(x x) for all x is identically zero.
How does the RKHS look in this case? Hint: use (2.31).

2.10 (Two Kinds of Positivity ) Give an example of a kernel which is positive deﬁnite according to Deﬁnition 2.5, but not positive in the sense that k(x x ) 0 for all x x .
Give an example of a kernel where the contrary is the case.

2.11 (General Coordinate Transformations ) Prove that if : and k(x x ) is a kernel, then k( (x) (x )) is a kernel, too.

is a bijection,

2.12 (Positivity on the Diagonal ) Prove that positive deﬁnite kernels are positive on the diagonal, k(x x) 0 for all x . Hint: use m 1 in (2.15).

2.13 (Symmetry of Complex Kernels ) Prove that complex-valued positive deﬁnite kernels are symmetric (2.18).

2.14 (Real Kernels vs. Complex Kernels ) Prove that a real matrix satisﬁes (2.15) for all ci if and only if it is symmetric and it satisﬁes (2.15) for real coefﬁcients ci.
Hint: decompose each ci in (2.15) into real and imaginary parts.

2.6 Problems

57

2.15 (Rank-One Kernels ) Prove that if f is a real-valued function on , then k(x x ) : f (x) f (x ) is a positive deﬁnite kernel.

2.16 (Bayes Kernel ) Consider a binary pattern recognition problem. Specialize the

last problem to the case where f :

1 equals the Bayes decision function y(x),

i.e., the classiﬁcation with minimal risk subject to an underlying distribution P(x y)

generating the data.

Argue that this kernel is particularly suitable since it renders the problem linearly

separable in a 1D feature space: State a decision function (cf. (1.35)) that solves the problem

(hint: you just need one parameter , and you may set it to 1; moreover, use b 0) [124].

The ﬁnal part of the problem requires knowledge of Chapter 16: Consider now the

situation where some prior P( f ) over the target function class is given. What would the

optimal kernel be in this case? Discuss the connection to Gaussian processes.

2.17 (Inhomogeneous Polynomials ) Prove that the inhomogeneous polynomial (2.70) is a positive deﬁnite kernel, e.g., by showing that it is a linear combination of homogeneous polynomial kernels with positive coefﬁcients. What kind of features does this kernel compute [561]?

2.18 (Normalization in Feature Space ) Given a kernel k, construct a corresponding normalized kernel k˜ by normalizing the feature map Φ˜ such that for all x , Φ˜ (x) 1 (cf. also Deﬁnition 12.35). Discuss the relationship between normalization in input space
and normalization in feature space for Gaussian kernels and homogeneous polynomial
kernels.

2.19 (Cosine Kernel ) Suppose is a dot product space, and x x . Prove that k(x x ) cos( (x x)) is a positive deﬁnite kernel. Hint: use Problem 2.18.

2.20 (Alignment Kernel ) Let K K F : ∑ij KijKij be the Frobenius dot product of two matrices. Prove that the empirical alignment of two Gram matrices [124],
A(K K ) : K K F K K F K K F, is a positive deﬁnite kernel. Note that the alignment can be used for model selection, putting Kij : yiy j (cf.
Problem 2.16) and Ki j : sgn (k(xi x j)) or Ki j : sgn (k(xi x j)) b (cf. [124]).

2.21 (Equivalence Relations as Kernels ) Consider a similarity measure k : 0 1 with

k(x x) 1 for all x

(2.96)

Prove that k is a positive deﬁnite kernel if and only if, for all x x x ,

k(x x ) 1 k(x x) 1 and

(2.97)

k(x x ) k(x x ) 1 k(x x ) 1

(2.98)

Equations (2.96) to (2.98) amount to saying that k I T, where T lence relation.

is an equiva-

58

Kernels

As a simple example, consider an undirected graph, and let (x x ) T whenever x and x are in the same connected component of the graph. Show that T is an equivalence relation.
Find examples of equivalence relations that lend themselves to an interpretation as similarity measures. Discuss whether there are other relations that one might want to use as similarity measures.

2.22 (Different Feature Spaces for the Same Kernel ) Give an example of a kernel with two valid feature maps Φ1 Φ2, mapping into spaces 1 2 of different dimensions.

2.23 (Converse of Mercer’s Theorem ) Prove that if an integral operator kernel k

admits a uniformly convergent dot product representation on some compact set

,

k(x x ) ∑ i(x) i(x ) i1
then it is positive deﬁnite. Hint: show that

(2.99)

∑ i(x) i(x ) f (x) f (x ) dx dx ∑

i1

i1

2

i(x) f (x) dx

0

Argue that in particular, polynomial kernels (2.67) satisfy Mercer’s conditions.

2.24 ( -Norm of Mercer Eigenfunctions ) Prove that under the conditions of Theorem 2.10, we have, up to sets of measure zero,

sup

jj

k

j

(2.100)

Hint: note that k

k(x x) up to sets of measures zero, and use the series expansion

given in Theorem 2.10. Show, moreover, that it is not generally the case that

sup j
j

(2.101)

Hint: consider the case where

, ( n ) : 2 n, and k(i j) : ij. Show that

1. Tk((a j)) (a j2 j) for (a j) L2( ),

2. Tk satisﬁes (a j) Tk(a j) ∑j(a j2 j)2 0 and is thus positive deﬁnite,

3. j 2 j and j 2j 2ej form an orthonormal eigenvector decomposition of Tk (here,

ej is the jth canonical unit vector in 2), and

4. j

2j 2

j 1 2.

Argue that the last statement shows that (2.101) is wrong and (2.100) is tight. 12

2.25 (Generalized Feature Maps ) Via (2.38), Mercer kernels induce compact (integral) operators. Can you generalize the idea of deﬁning a feature map associated with an

12. Thanks to S. Smale and I. Steinwart for this exercise.

2.6 Problems

59

operator to more general bounded positive deﬁnite operators T? Hint: use the multiplication operator representation of T [467].

2.26 (Nystro¨ m Approximation (cf. [603]) ) Consider the integral operator obtained by substituting the distribution P underlying the data into (2.38), i.e.,

(Tk f )(x)

k(x x ) f (x) dP(x)

(2.102)

If the conditions of Mercer’s theorem are satisﬁed, then k can be diagonalized as
N
k(x x ) ∑ j j(x) j(x ) j1
where j and j satisfy the eigenvalue equation

(2.103)

k(x x ) j(x) dP(x) j j(x )

(2.104)

and the orthonormality conditions

i(x) j(x)dP(x) i j

(2.105)

Show that by replacing the integral by a summation over an iid sample X x1 xm

from P(x), one can recover the kernel PCA eigenvalue problem (Section 1.7). Hint: Start

by evaluating (2.104) for x X, to obtain m equations. Next, approximate the integral by

a sum over the points in X, replacing

k(x x )

j(x) dP(x) by

1 m

∑mn

1 k(xn

x)

j(xn).

Derive the orthogonality condition for the eigenvectors ( j(xn))n 1 m from (2.105).

2.27 (Lorentzian Feature Spaces ) If a ﬁnite number of eigenvalues is negative, the expansion in Theorem 2.10 is still valid. Show that in this case, k corresponds to a Lorentzian symmetric bilinear form in a space with indeﬁnite signature [467].
Discuss whether this causes problems for learning algorithms utilizing these kernels. In particular, consider the cases of SV machines (Chapter 7) and Kernel PCA (Chapter 14).

2.28 (Symmetry of Reproducing Kernels ) Show that reproducing kernels (Deﬁnition 2.9) are symmetric. Hint: use (2.35) and exploit the symmetry of the dot product.

2.29 (Coordinate Representation in the RKHS ) Write

as a dot product of

coordinate vectors by expressing the functions of the RKHS in the basis ( n n)n 1 N ,

which is orthonormal with respect to , i.e.,

N
f (x) ∑ n n1

n n(x)

(2.106)

Obtain an expression for the coordinates n, using (2.47) and n f n n Show that has the structure of a RKHS in the sense that for f and g given by (2.106), and

N
g(x) ∑ j j1

j j(x)

(2.107)

60

Kernels

we have

f g Show, moreover, that f (x)

Φ(x) in . In other words,

Φ(x) is the coordinate representation of the kernel as a function of one argument.

2.30 (Equivalence of Regularization Terms ) Using (2.36) and (2.41), prove that w 2, where w ∑mi 1 iΦ(xi), is the same no matter whether Φ denotes the RKHS fea-
ture map (2.21) or the Mercer feature map (2.40).

2.31 (Approximate Inversion of Gram Matrices ) Use the kernel PCA map (2.59) to derive a method for approximately inverting a large Gram matrix.

2.32 (Effective Dimension of Feature Space ) Building on Section 2.2.7, argue that for a ﬁnite data set, we are always effectively working in a ﬁnite-dimensional feature space.

2.33 (Translation of a Dot Product ) Prove (2.79).

2.34 (Example of a CPD Kernel ) Argue that the hyperbolic tangent kernel (2.69) is effectively conditionally positive deﬁnite, if the input values are suitably restricted, since it can be approximated by k b, where k is a polynomial kernel (2.67) and b . Discuss how this explains that hyperbolic tangent kernels can be used for SVMs although, as pointed out in number of works (e.g., [86], cf. the remark following (2.69)), they are not positive deﬁnite.

2.35 (Polarization Identity ) Prove the polarization identity, stating that for any

symmetric bilinear form :

, we have, for all x x ,

xx 1 x x x x 4

x xx x

(2.108)

Now consider the special case where is a Euclidean dot product and x x x x is the squared Euclidean distance between x and x . Discuss why the polarization identity does not imply that the value of the dot product can be recovered from the distances alone. What else does one need?

2.36 (Vector Space Representation of CPD Kernels ) Specialize the vector space representation of symmetric kernels (Proposition 2.25) to the case of cpd kernels. Can you identify a subspace on which a cpd kernel is actually pd?

2.37 (Parzen Windows Classiﬁers in Feature Space ) Assume that k is a positive deﬁnite kernel. Compare the algorithm described in Section 1.2 with the one of (2.89). Construct situations where the two algorithms give different results. Hint: consider datasets where the class means coincide.

2.38 (Canonical Distortion Kernel ) Can you deﬁne a kernel based on Baxter’s canonical distortion metric [28]?

3

Risk and Loss Functions

Overview

One of the most immediate requirements in any learning problem is to specify

what exactly we would like to achieve, minimize, bound, or approximate. In other

words, we need to determine a criterion according to which we will assess the

quality of an estimate f :

obtained from data.

This question is far from trivial. Even in binary classiﬁcation there exist ample

choices. The selection criterion may be the fraction of patterns classiﬁed correctly,

it could involve the conﬁdence with which the classiﬁcation is carried out, or it

might take into account the fact that losses are not symmetric for the two classes,

such as in health diagnosis problems. Furthermore, the loss for an error may be

input-dependent (for instance, meteorological predictions may require a higher ac-

curacy in urban regions), and ﬁnally, we might want to obtain probabilities rather

than a binary prediction of the class labels 1 and 1. Multi class discrimination and

regression add even further levels of complexity to the problem. Thus we need a

means of encoding these criteria.

The chapter is structured as follows: in Section 3.1, we begin with a brief

overview of common loss functions used in classiﬁcation and regression algo-

rithms. This is done without much mathematical rigor or statistical justiﬁcation,

in order to provide basic working knowledge for readers who want to get a quick

idea of the default design choices in the area of kernel machines. Following this,

Section 3.2 formalizes the idea of risk. The risk approach is the predominant tech-

nique used in this book, and most of the algorithms presented subsequently mini-

mize some form of a risk functional. Section 3.3 treats the concept of loss functions

from a statistical perspective, points out the connection to the estimation of den-

sities and introduces the notion of efﬁciency. Readers interested in more detail

should also consider Chapter 16, which discusses the problem of estimation from

a Bayesian perspective. The later parts of this section are intended for readers in-

terested in the more theoretical details of estimation. The concept of robustness is

introduced in Section 3.4. Several commonly used loss functions, such as Huber’s

loss and the -insensitive loss, enjoy robustness properties with respect to rather

general classes of distributions. Beyond the basic relations, will show how to ad-

just the -insensitive loss in such a way as to accommodate different amounts of

variance automatically. This will later lead to the construction of so-called Sup-

port Vector Algorithms (see Chapters 7, 8, and 9).

While technical details and proofs can be omitted for most of the present chap-

ter, we encourage the reader to review the practical implications of this section.

62 Prerequisites

Risk and Loss Functions

3.1.1 Classification 3.1.2 Regression

3.2 Empirical Risk Functional

3.3.1 Maximum Likelihood

3.4 Robustness

3.3.2 Efficiency

3.4.3, 3.4.4 Adaptive Loss functions & ν

3.4.2 ε−insensitive loss

As usual, exercises for all sections can be found at the end. The chapter requires knowledge of probability theory, as introduced in Section B.1.

3.1 Loss Functions

Minimized Loss Incurred Loss

Let us begin with a formal deﬁnition of what we mean by the loss incurred by a function f at location x, given an observation y.

Deﬁnition 3.1 (Loss Function) Denote by (x y f (x))

the triplet consist-

ing of a pattern x, an observation y and a prediction f (x). Then the map c :

[0 ) with the property c(x y y) 0 for all x and y will be called a loss function.

Note that we require c to be a nonnegative function. This means that we will never get a payoff from an extra good prediction. If the latter was the case, we could always recover non-negativity (provided the loss is bounded from below), by using a simple shift operation (possibly depending on x). Likewise we can always satisfy the condition that exact predictions ( f (x) y) never cause any loss. The advantage of these extra conditions on c is that we know that the minimum of the loss is 0 and that it is obtainable, at least for a given x y.
Next we will formalize different kinds of loss, as described informally in the introduction of the chapter. Note that the incurred loss is not always the quantity that we will attempt to minimize. For instance, for algorithmic reasons, some loss functions will prove to be infeasible (the binary loss, for instance, can lead to NPhard optimization problems [367]). Furthermore, statistical considerations such as the desire to obtain conﬁdence levels on the prediction (Section 3.3.1) will also inﬂuence our choice.

3.1.1 Binary Classiﬁcation

Misclassiﬁcation Error

The simplest case to consider involves counting the misclassiﬁcation error if pattern x is classiﬁed wrongly we incur loss 1, otherwise there is no penalty.:

0 if y f (x)

c(x y f (x))

(3.1)

1 otherwise

3.1 Loss Functions

63

Asymmetric and Input-Dependent Loss
Conﬁdence Level Soft Margin Loss

This deﬁnition of c does not distinguish between different classes and types of errors (false positive or negative).1
A slight extension takes the latter into account. For the sake of simplicity let us assume, as in (3.1), that we have a binary classiﬁcation problem. This time, however, the loss may depend on a function c˜(x) which accounts for input-dependence, i.e.

0 if y f (x)

c(x y f (x))

(3.2)

c˜(x) otherwise

A simple (albeit slightly contrived) example is the classiﬁcation of objects into rocks and diamonds. Clearly, the incurred loss will depend largely on the weight of the object under consideration.
Analogously, we might distinguish between errors for y 1 and y 1 (see, e.g., [331] for details). For instance, in a fraud detection application, we would like to be really sure about the situation before taking any measures, rather than losing potential customers. On the other hand, a blood bank should consider even the slightest suspicion of disease before accepting a donor.
Rather than predicting only whether a given object x belongs to a certain class y, we may also want to take a certain conﬁdence level into account. In this case, f (x) becomes a real-valued function, even though y 1 1 .
In this case, sgn ( f (x)) denotes the class label, and the absolute value f (x) the conﬁdence of the prediction. Corresponding loss functions will depend on the product y f (x) to assess the quality of the estimate. The soft margin loss function, as introduced by Bennett and Mangasarian [40, 111], is deﬁned as

c(x y f (x)) max(0 1 y f (x))

0

if y f (x) 1

(3.3)

1 y f (x) otherwise

In some cases [348, 125] (see also Section 10.6.2) the squared version of (3.3) provides an expression that can be minimized more easily;

c(x y f (x)) max(0 1 y f (x))2

(3.4)

Logistic Loss

The soft margin loss closely resembles the so-called logistic loss function (cf. [251], as well as Problem 3.1 and Section 16.1.1);

c(x y f (x)) ln 1 exp y f (x)

(3.5)

We will derive this loss function in Section 3.3.1. It is used in order to associate a probabilistic meaning with f (x).
Note that in both (3.3) and (3.5) (nearly) no penalty occurs if y f (x) is sufﬁciently large, i.e. if the patterns are classiﬁed correctly with large conﬁdence. In particular, in (3.3) a minimum conﬁdence of 1 is required for zero loss. These loss functions

1. A false positive is a point which the classiﬁer erroneously assigns to class 1, a false negative is erroneously assigned to class 1.

64

Risk and Loss Functions

Multi Class Discrimination

0−1 Loss 5

Linear Soft Margin 5

Logistic Regression 5

Quadratic Soft Margin 5

4

4

4

4

c(x, y, f(x)) c(x, y, f(x)) c(x, y, f(x)) c(x, y, f(x))

3

3

3

3

2

2

2

2

1

1

1

1

0

0

0

0

−5

0

5 −5

0

5 −5

0

5 −5

0

5

y f(x)

y f(x)

y f(x)

y f(x)

Figure 3.1 From left to right: 0-1 loss, linear soft margin loss, logistic regression, and quadratic soft margin loss. Note that both soft margin loss functions are upper bounds on the 0-1 loss.

led to the development of large margin classiﬁers (see [491, 460, 504] and Chapter 5 for further details). Figure 3.1 depicts various popular loss functions.2
Matters are more complex when dealing with more than two classes. Each type of misclassiﬁcation could potentially incur a different loss, leading to an M M matrix (M being the number of classes) with positive off-diagonal and zero diagonal entries. It is still a matter of ongoing research in which way a conﬁdence level should be included in such cases (cf. [41, 311, 593, 161, 119]).

3.1.2 Regression

When estimating real-valued quantities, it is usually the size of the difference y f (x), i.e. the amount of misprediction, rather than the product y f (x), which is used to determine the quality of the estimate. For instance, this can be the actual loss incurred by mispredictions (e.g., the loss incurred by mispredicting the value of a ﬁnancial instrument at the stock exchange), provided the latter is known and computationally tractable.3 Assuming location independence, in most cases the loss function will be of the type

c(x y f (x)) c˜( f (x) y)

(3.7)

See Figure 3.2 below for several regression loss functions. Below we list the ones most common in kernel methods.

2. Other popular loss functions from the generalized linear model context include the inverse complementary log-log function. It is given by

c(x y f (x)) 1 exp( exp(y f (x)))

(3.6)

This function, unfortunately, is not convex and therefore it will not lead to a convex optimization problem. However, it has nice robustness properties and therefore we think that it should be investigated in the present context. 3. As with classiﬁcation, computational tractability is one of the primary concerns. This is not always satisfying from a statistician’s point of view, yet it is crucial for any practical implementation of an estimation algorithm.

3.2 Test Error and Expected Risk

65

Squared Loss
-insensitive Loss and 1 Loss
Practical Considerations

The popular choice is to minimize the sum of squares of the residuals f (x) y. As we shall see in Section 3.3.1, this corresponds to the assumption that we have additive normal noise corrupting the observations yi. Consequently we minimize

c(x y f (x)) ( f (x) y)2 or equivalently c˜( ) 2

(3.8)

For

convenience

of

subsequent

notation,

1 2

2 rather than

2 is often used.

An extension of the soft margin loss (3.3) to regression is the -insensitive loss

function [561, 572, 562]. It is obtained by symmetrization of the “hinge” of (3.3),

c˜( ) max(

0) :

(3.9)

The idea behind (3.9) is that deviations up to should not be penalized, and all further deviations should incur only a linear penalty. Setting 0 leads to an 1 loss, i.e., to minimization of the sum of absolute deviations. This is written

c˜( )

(3.10)

We will study these functions in more detail in Section 3.4.2. For efﬁcient implementations of learning procedures, it is crucial that loss func-
tions satisfy certain properties. In particular, they should be cheap to compute, have a small number of discontinuities (if any) in the ﬁrst derivative, and be convex in order to ensure the uniqueness of the solution (see Chapter 6 and also Problem 3.6 for details). Moreover, we may want to obtain solutions that are computationally efﬁcient, which may disregard a certain number of training points. This leads to conditions such as vanishing derivatives for a range of function values f (x). Finally, requirements such as outlier resistance are also important for the construction of estimators.

3.2 Test Error and Expected Risk

Now that we have determined how errors should be penalized on speciﬁc in-

stances (x y f (x)), we have to ﬁnd a method to combine these (local) penalties.

This will help us to assess a particular estimate f .

In the following, we will assume that there exists a probability distribution

P(x y) on

which governs the data generation and underlying functional

dependency. Moreover, we denote by P(y x) the conditional distribution of y given

x, and by dP(x y) and dP(y x) the integrals with respect to the distributions P(x y)

and P(y x) respectively (cf. Section B.1.3).

3.2.1 Exact Quantities

Unless stated otherwise, we assume that the data (x y) are drawn iid (independent and identically distributed, see Section B.1) from P(x y). Whether or not we have

66 Transduction Problem
Empirical Density

Risk and Loss Functions

knowledge of the test patterns at training time4 makes a signiﬁcant difference in the design of learning algorithms. In the latter case, we will want to minimize the test error on that speciﬁc test set; in the former case, the expected error over all possible test sets.

Deﬁnition 3.2 (Test Error) Assume that we are not only given the training data
x1 xm along with target values y1 ym but also the test patterns x1 xm on which we would like to predict yi (i 1 m ). Since we already know xi, all we should care about is to minimize the expected error on the test set. We formalize this in
the following deﬁnition

∑ Rtest[ f ] :

1m mi1

c(xi y f (xi))dP(y xi)

(3.11)

Unfortunately, this problem, referred to as transduction, is quite difﬁcult to address, both computationally and conceptually, see [562, 267, 37, 211]. Instead, one typically considers the case where no knowledge about test patterns is available, as described in the following deﬁnition.

Deﬁnition 3.3 (Expected Risk) If we have no knowledge about the test patterns (or decide to ignore them) we should minimize the expected error over all possible training patterns. Hence we have to minimize the expected loss with respect to P and c

R[ f ] : E Rtest[ f ] E c(x y f (x))

c(x y f (x))dP(x y)

(3.12)

Here the integration is carried out with respect to the distribution P(x y). Again,

just as (3.11), this problem is intractable, since we do not know P(x y) explicitly.

Instead, we are only given the training patterns (xi yi). The latter, however, allow us to replace the unknown distribution P(x y) by its empirical estimate.

To study connections between loss functions and density models, it will be

convenient to assume that there exists a density p(x y) corresponding to P(x y).

This means that we may replace dP(x y) by p(x y)dxdy and the appropriate

measure on

. Such a density p(x y) need not always exist (see Section B.1 for

more details) but we will not give further heed to these concerns at present.

3.2.2 Approximations

Unfortunately, this change in notation did not solve the problem. All we have at our disposal is the actual training data. What one usually does is replace p(x y) by the empirical density

∑ pemp(x y) :

1m mi 1

xi (x)

yi ( y)

(3.13)

4. The test outputs, however, are not available during training.

3.2 Test Error and Expected Risk

67

Here x (x) denotes the -distribution, satisfying x (x) f (x)dx f (x ). The hope is that replacing p by pemp will lead to a quantity that is “reasonably close” to the expected risk. This will be the case if the class of possible solutions f is sufﬁciently limited [568, 571]. The issue of closeness with regard to different estimators will be discussed in further detail in Chapters 5 and 12. Substituting pemp(x y) into (3.12) leads to the empirical risk:

M-Estimator
Ill-Posed Problems
Example of an Ill-Posed Problem

Deﬁnition 3.4 (Empirical Risk) The empirical risk is deﬁned as

Remp[ f ] :

∑ c(x y f (x))pemp(x y)dxdy

1 m

m i1

c(xi

yi

f (xi))

(3.14)

This quantity has the advantage that, given the training data, we can readily compute and also minimize it. This constitutes a particular case of what is called an M-estimator in statistics. Estimators of this type are studied in detail in the ﬁeld of empirical processes [554]. As pointed out in Section 3.1, it is crucial to understand that although our particular M-estimator is built from minimizing a loss, this need not always be the case. From a decision-theoretic point of view, the question of which loss to choose is a separate issue, which is dictated by the problem at hand as well as the goal of trying to evaluate the performance of estimation methods, rather than by the problem of trying to deﬁne a particular estimation method [582, 166, 43].
These considerations aside, it may appear as if (3.14) is the answer to our problems, and all that remains to be done is to ﬁnd a suitable class of functions f such that we can minimize Remp[ f ] with respect to . Unfortunately, determining
is quite difﬁcult (see Chapters 5 and 12 for details). Moreover, the minimization of Remp[ f ] can lead to an ill-posed problem [538, 370]. We will show this with a simple example.
Assume that we want to solve a regression problem using the quadratic loss function (3.8) given by c(x y f (x) (y f (x))2. Moreover, assume that we are dealing with a linear class of functions,5 say

n
: f f (x) ∑ i fi(x) with i i1

where the fi are functions mapping to . We want to ﬁnd the minimizer of Remp, i.e.,

minimize Remp[ f ]
f

∑ 1 m

minimize n mi 1

yi

n

2

∑ j f j(xi)

j1

(3.15) (3.16)

5. In the simplest case, assuming is contained in a vector space, these could be functions that extract coordinates of x; in other words, would be the class of linear functions on .

68
Condition of a Matrix

Risk and Loss Functions

Computing the derivative of Remp[ f ] with respect to and deﬁning Fij : fi(x j), we can see that the minimum of (3.16) is achieved if

Fy FF

(3.17)

A sufﬁcient condition for (3.17) is

F F 1 F y where F F 1 denotes the

(pseudo-)inverse of the matrix.

If F F has a bad condition number (i.e. the quotient between the largest and the

smallest eigenvalue of F F is large), it is numerically difﬁcult [423, 530] to solve

(3.17) for . Furthermore, if n m, i.e. if we have more basis functions fi than

training patterns xi, there will exist a subspace of solutions with dimension at least

n m, satisfying (3.17). This is undesirable both practically (speed of computation)

and theoretically (we would have to deal with a whole class of solutions rather

than a single one).

One might also expect that if is too rich, the discrepancy between Remp[ f ] and

R[ f ] could be large. For instance, if F is an m m matrix of full rank, contains

an f that predicts all target values yi correctly on the training data. Nevertheless,

we cannot expect that we will also obtain zero prediction error on unseen points.

Chapter 4 will show how these problems can be overcome by adding a so-called

regularization term to Remp[ f ].

3.3 A Statistical Perspective
Given a particular pattern x˜, we may want to ask what risk we can expect for it, and with which probability the corresponding loss is going to occur. In other words, instead of (or in addition to) E c(x˜ y f (x˜) for a ﬁxed x˜, we may want to know the distribution of y given x˜, i.e., P(y x˜).
(Bayesian) statistics (see [338, 432, 49, 43] and also Chapter 16) often attempt to estimate the density corresponding to the random variables (x y), and in some cases, we may really need information about p(x y) to arrive at the desired conclusions given the training data (e.g., medical diagnosis). However, we always have to keep in mind that if we model the density p ﬁrst, and subsequently, based on this approximation, compute a minimizer of the expected risk, we will have to make two approximations. This could lead to inferior or at least not easily predictable results. Therefore, wherever possible, we should avoid solving a more general problem, since additional approximation steps might only make the estimates worse [561].
3.3.1 Maximum Likelihood Estimation
All this said, we still may want to compute the conditional density p(y x). For this purpose we need to model how y is generated, based on some underlying dependency f (x); thus, we specify the functional form of p(y x f (x)) and maximize

3.3 A Statistical Perspective

69

Log-Likelihood Regression
Classiﬁcation

the expression with respect to f . This will provide us with the function f that is most likely to have generated the data.

Deﬁnition 3.5 (Likelihood) The likelihood of a sample (x 1 y1) (xm ym) given an underlying functional dependency f is given by

p( x1

xm y1

m

m

ym f ) ∏ p(xi yi f ) ∏ p(yi xi f )p(xi)

i1

i1

(3.18)

Strictly speaking the likelihood only depends on the values f (x1) f (xm) rather than being a functional of f itself. To keep the notation simple, however, we

write p( x1 xm y1 ym f ) instead of the more heavyweight expression

p( x1

xm y1

ym f (x1)

f (xm) ).

For practical reasons, we convert products into sums by taking the negative

logarithm of P( x1 xm y1 ym f ), an expression which is then conveniently minimized. Furthermore, we may drop the p(xi) from (3.18), since they do not depend on f . Thus maximization of (3.18) is equivalent to minimization of the

Log-Likelihood

m
[ f ] : ∑ ln p(yi xi f ) i1

(3.19)

Remark 3.6 (Regression Loss Functions) Minimization of [ f ] and of R emp[ f ] coincide if the loss function c is chosen according to

c(x y f (x)) ln p(y x f )

(3.20)

Assuming that the target values y were generated by an underlying functional dependency f plus additive noise with density p , i.e. yi ftrue(xi) i, we obtain

c(x y f (x)) ln p (y f (x))

(3.21)

Things are slightly different in classiﬁcation. Since all we are interested in is the probability that pattern x has label 1 or 1 (assuming binary classiﬁcation), we can transform the problem into one of estimating the logarithm of the probability that a pattern assumes its correct label.

Remark 3.7 (Classiﬁcation Loss Functions) We have a ﬁnite set of labels, which allows us to model P(y f (x)) directly, instead of modelling a density. In the binary classiﬁcation case (classes 1 and 1) this problem becomes particularly easy, since all we have to do is assume functional dependency underlying P(1 f (x)): this immediately gives us P( 1 f (x)) 1 P(1 f (x)). The link to loss functions is established via

c(x y f (x)) ln P(y f (x))

(3.22)

The same result can be obtained by minimizing the cross entropy6 between the classiﬁca-

6. In the case of discrete variables the cross entropy between two distributions P and Q is deﬁned as ∑i P(i) ln Q(i).

70 Examples

Risk and Loss Functions

Table 3.1 Common loss functions and corresponding density models according to Remark 3.6. As a shorthand we use c˜( f (x) y) : c(x y f (x)).

-insensitive Laplacian Gaussian Huber ’s robust loss Polynomial Piecewise polynomial

loss function c˜( )

12 2

1 2

(

)2

if

2 otherwise

1d d

1

d

dd1

if

d1 d

otherwise

density model p( )

1 2(1

) exp(

)

1 2

exp(

)

1 exp(
2

2 2

)

exp(

2 2

)

exp( 2

)

d 2Γ(1

d)

exp(

d)

d
exp( d d 1 )

exp(

d1 d

if otherwise
if ) otherwise

tion labels yi and the probabilities p(y f (x)), as is typically done in a generalized linear

models context (see e.g., [355, 232, 163]). For binary classiﬁcation (with y

1 ) we

obtain

c(x y f (x)) 1 2 y ln P(y 1 f (x)) 1 2 y ln P(y

1 f (x))

(3.23)

When substituting the actual values for y into (3.23), this reduces to (3.22).

At this point we have a choice in modelling P(y 1 f (x)) to suit our needs.

Possible models include the logistic transfer function, the probit model, the inverse

complementary log-log model. See Section 16.3.5 for a more detailed discussion of

the choice of such link functions. Below we explain connections in some more detail

for the logistic link function. For a logistic model, where P(y
malization

1 x f)

exp(

1 2

f

(x)),

we

obtain

after

nor-

P(y

1 x f):

exp( f (x)) 1 exp( f (x))

(3.24)

and consequently ln P(y 1 x f ) ln(1 exp( f (x))). We thus recover (3.5) as

the loss function for classiﬁcation. Choices other than (3.24) for a map

[0 1]

will lead to further loss functions for classiﬁcation. See [579, 179, 596] and Section

16.1.1 for more details on this subject.

It is important to note that not every loss function used in classiﬁcation corre-

sponds to such a density model (recall that in this case, the probabilities have to

add up to 1 for any value of f (x)). In fact, one of the most popular loss functions,

the soft margin loss (3.3), does not enjoy this property. A discussion of these issues

can be found in [521].

Table 3.1 summarizes common loss functions and the corresponding density

models as deﬁned by (3.21), some of which were already presented in Section

3.1. It is an exhaustive list of the loss functions that will be used in this book for

regression. Figure 3.2 contains graphs of the functions.

3.3 A Statistical Perspective

71

Squared Loss 3

Absolute Loss 3

2.5

2.5

2

2

c(x, y, f(x))

c(x, y, f(x))

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−3 −2 −1

0

1

2

3

ξ

−0.5

−3 −2 −1

0

1

2

3

ξ

Huber’s Robust Loss 3

ε−insensitive 3

2.5

2.5

2

2

c(x, y, f(x))

c(x, y, f(x))

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−3 −2 −1

0

1

2

3

ξ

−0.5

−3 −2 −1

0

1

2

3

ξ

Figure 3.2 Graphs of loss functions and corresponding density models. upper left: Gaussian, upper right: Laplacian, lower left: Huber’s robust, lower right: -insensitive.

Practical Considerations

We conclude with a few cautionary remarks. The loss function resulting from a maximum likelihood reasoning might be non-convex. This might spell trouble when we try to ﬁnd an efﬁcient solution of the corresponding minimization problem. Moreover, we made a very strong assumption by claiming to know P(y x f ) explicitly, which was necessary in order to evaluate (3.20).
Finally, the solution we obtain by minimizing the log-likelihood depends on the class of functions . So we are in no better situation than by minimizing Remp[ f ], albeit with the additional constraint, that the loss functions c(x y f (x)) must correspond to a probability density.
3.3.2 Efﬁciency
The above reasoning could mislead us into thinking that the choice of loss function is rather arbitrary, and that there exists no good means of assessing the performance of an estimator. In the present section we will develop tools which can be used to compare estimators that are derived from different loss functions. For this purpose we need to introduce additional statistical concepts which deal with the efﬁciency of an estimator. Roughly speaking, these give an indication of how

72 Estimator

Risk and Loss Functions

“noisy” an estimator is with respect to a reference estimator.
We begin by formalizing the concept of an estimator. Denote by P(y ) a dis-
tribution of y depending (amongst other variables) on the parameters , and by
Y y1 ym an m-sample drawn iid from P(y ). Note that the use of the symbol y bears no relation to the yi that are outputs of some functional dependency (cf. Chapter 1). We employ this symbol because some of the results to be derived
will later be applied to the outputs of SV regression. Next, we introduce the estimator ˆ (Y) of the parameters , based on Y. For
instance, P(y ) could be a Gaussian with ﬁxed variance and mean , and ˆ (Y) could be the estimator (1 m) ∑mi 1 yi.
To avoid cumbersome notation, we use the shorthand

E (y) : EP(y ) (y)

(y)dP(y )

(3.25)

to express expectations of a random variable (y) with respect to P(y ). One criterion that we might impose on an estimator is that it be unbiased, i.e., that on average, it tells us the correct value of the parameter it attempts to estimate.

Deﬁnition 3.8 (Unbiased Estimator) An unbiased estimator ˆ (Y) of the parameters in P(y ) satisﬁes

E ˆ(Y)

(3.26)

In this section, we will focus on unbiased estimators. In general, however, the estimators we are dealing with in this book will not be unbiased. In fact, they will have a bias towards ‘simple’, low-complexity functions. Properties of such estimators are more difﬁcult to deal with, which is why, for the sake of simplicity, we restrict ourselves to the unbiased case in this section. Note, however, that “biasedness” is not a bad property by itself. On the contrary, there exist cases as the one described by James and Stein [262] where biased estimators consistently outperform unbiased estimators in the ﬁnite sample size setting, both in terms of variance and prediction error.
A possible way to compare unbiased estimators is to compute their variance. Other quantities such as moments of higher order or maximum deviation properties would be valid criteria as well, yet for historical and practical reasons the variance has become a standard tool to benchmark estimators. The Fisher information matrix is crucial for this purpose since it will tell us via the Crame´r-Rao bound (Theorem 3.11) the minimal possible variance for an unbiased estimator. The idea is that the smaller the variance, the lower (typically) the probability that ˆ (Y) will deviate from by a large amount. Therefore, we can use the variance as a possible one number summary to compare different estimators.

Deﬁnition 3.9 (Score Function, Fisher Information, Covariance) Assume there exists a density p(y ) for the distribution P(y ) such that ln p(y ) is differentiable with

3.3 A Statistical Perspective

73

Score Function
Fisher Information Covariance

respect to . The score V (Y) of P(y ) is a random variable deﬁned by7

V (Y) : ln p(Y )

∑ ∑ m
ln p(yi )
i1

m p(yi ) i 1 p(yi )

(3.27)

This score tells us how much the likelihood of the data depends on the different components of , and thus, in the maximum likelihood procedure, how much the data affect the choice of . The covariance of V (Y) is called the Fisher information matrix I. It is given by

Ii j : E i ln p(Y ) j ln p(Y ) and the covariance matrix B of the estimator ˆ (Y) is deﬁned by Bi j : E ˆ i E ˆ i ˆ j E ˆ j

(3.28) (3.29)

The covariance matrix B tells us the amount of variation of the estimator. It can
therefore be used (e.g., by Chebychev’s inequality) to bound the probability that ˆ (Y) deviates from by more than a certain amount.

Average Fisher Score Vanishes

Remark 3.10 (Expected Value of Fisher Score) One can check that the expected value of V (Y) is 0 since

E [V (Y)] p(Y ) ln p(Y )dY

p(Y )dY 1 0

(3.30)

In other words, the contribution of Y to the adjustment of averages to 0 over all possible Y, drawn according to P(Y ). Equivalently we could say that the average likelihood for Y drawn according to P(Y ) is extremal, provided we choose : the derivative of the expected likelihood of the data E ln P(Y ) with respect to vanishes. This is also what we expect, namely that the “proper” distribution is on average the one with the highest likelihood.

The following theorem gives a lower bound on the variance of an estimator, i.e. B is found in terms of the Fisher information I. This is useful to determine how well a given estimator performs with respect to the one with the lowest possible variance.

Theorem 3.11 (Crame´r and Rao [425]) Any unbiased estimator ˆ (Y) satisﬁes

det IB 1

(3.31)

Proof We prove (3.31) for the scalar case. The extension to matrices is left as an exercise (see Problem 3.10). Using the Cauchy-Schwarz inequality, we obtain

E (V (Y) E [V (Y)]) ˆ (Y) E ˆ (Y) 2

(3.32)

E (V (Y) E [V (Y)])2 E ˆ (Y) E ˆ (Y) 2 IB

(3.33)

7. Recall that p(Y ) is the gradient of p(Y ) with respect to the parameters 1

n.

74
Asymptotic Variance

Risk and Loss Functions

At the same time, E [V (Y)] 0 implies that
E (V (Y) E [V (Y)]) ˆ (Y) E ˆ (Y) 2
E V (Y)ˆ(Y) 2
2
p(Y )V (Y)ˆ(Y)dY

2

p(Y )ˆ(Y)dY

( )2 1

(3.34) (3.35)
(3.36)

since we may interchange integration by Y and .

Eq. (3.31) lends itself to the deﬁnition of a one-number summary of the properties of an estimator, namely how closely the inequality is met.

Deﬁnition 3.12 (Efﬁciency) The statistical efﬁciency e of an estimator ˆ (Y) is deﬁned as

e : 1 det IB

(3.37)

The closer e is to 1, the lower the variance of the corresponding estimator ˆ (Y). For a special class of estimators minimizing loss functions, the following theorem allows us to compute B and e efﬁciently.

Theorem 3.13 (Murata, Yoshizawa, Amari [379, Lemma 3]) Assume that ˆ is deﬁned by ˆ (Y) : argmin d(Y ) and that d is a twice differentiable function in .

Then asymptotically, for increasing sample size m

, the variance B is given by

B Q 1GQ 1. Here

Gi j : cov i d (Y ) j d (Y ) and

Qij : E

2 d(Y ) ij

and therefore e (det Q)2 (det IG).

(3.38) (3.39)

This means that for the class of estimators deﬁned via d, the evaluation of their asymptotic efﬁciency can be conveniently achieved via (3.38) and (3.39). For scalar valued estimators (Y) , these expressions can be greatly simpliﬁed to

I

ln p(Y ) 2 dP(Y )

(3.40)

G

( d(Y ))2 dP(Y )

(3.41)

Q

2d(Y )dP(Y )

(3.42)

Finally, in the case of continuous densities, Theorem 3.13 may be extended to piecewise twice differentiable continuous functions d, by convolving the latter with a twice differentiable smoothing kernel, and letting the width of the smoothing kernel converge to zero. We will make use of this observation in the next section when studying the efﬁciency of some estimators.

3.4 Robust Estimators

75

The current section concludes with the proof that the maximum likelihood estimator meets the Crame´r-Rao bound.

Theorem 3.14 (Efﬁciency of Maximum Likelihood [118, 218, 43]) The maximum likelihood estimator (cf. (3.18) and (3.19)) given by

ˆ (Y) : argmax ln p(Y ) argmin [ ]

(3.43)

is asymptotically efﬁcient (e 1).

To keep things simple we will prove (3.43) only for the class of twice differentiable continuous densities by applying Theorem 3.13. For a more general proof see [118, 218, 43].

Proof By construction, G is equal to the Fisher information matrix, if we choose d according to (3.43). Hence a sufﬁcient condition is that Q I, which is what we show below. To this end we expand the integrand of (3.42),

2d(Y )

2 ln p(Y )

2 p(Y ) p(Y )

p(Y ) 2 p(Y )

2 p(Y ) V2(Y) p(Y )

(3.44)

The expectation of the second term in (3.44) equals I. We now show that the expectation of the ﬁrst term vanishes;

2p(Y ) p(Y ) p(Y ) dY

2 p(Y )dY

21 0

(3.45)

Hence Q I and thus e Q2 (IG) 1. This proves that the maximum likelihood estimator is asymptotically efﬁcient.

It appears as if the best thing we could do is to use the maximum likelihood (ML) estimator. Unfortunately, reality is not quite as simple as that. First, the above statement holds only asymptotically. This leads to the (justiﬁed) suspicion that for ﬁnite sample sizes we may be able to do better than ML estimation. Second, practical considerations such as the additional goal of sparse decomposition may lead to the choice of a non-optimal loss function.
Finally, we may not know the true density model, which is required for the deﬁnition of the maximum likelihood estimator. We can try to make an educated guess; bad guesses of the class of densities, however, can lead to large errors in the estimation (see, e.g., [251]). This prompted the development of robust estimators.

3.4 Robust Estimators
So far, in order to make any practical predictions, we had to assume a certain class of distributions from which P(Y) was chosen. Likewise, in the case of risk functionals, we also assumed that training and test data are identically distributed. This section provides tools to safeguard ourselves against cases where the above

76 Outliers
Mixture Densities

Risk and Loss Functions

assumptions are not satisﬁed. More speciﬁcally, we would like to avoid a certain fraction of ‘bad’ obser-
vations (often also referred to as ‘outliers’) seriously affecting the quality of the estimate. This implies that the inﬂuence of individual patterns should be bounded from above. Huber [250] gives a detailed list of desirable properties of a robust estimator. We refrain from reproducing this list at present, or committing to a particular deﬁnition of robustness.
As usual for the estimation of location parameter context (i.e. estimation of the expected value of a random variable) we assume a speciﬁc parametric form of p(Y ), namely

m

m

p(Y ) ∏ p(yi ) ∏ p(yi )

i1

i1

(3.46)

Unless stated otherwise, this is the formulation we will use throughout this section.

3.4.1 Robustness via Loss Functions

Huber’s idea [250] in constructing a robust estimator was to take a loss function as provided by the maximum likelihood framework, and modify it in such a way as to limit the inﬂuence of each individual pattern. This is done by providing an upper bound on the slope of ln p(Y ). We shall see that methods such as the trimmed mean or the median are special cases thereof. The -insensitive loss function can also be viewed as a trimmed estimator. This will lead to the development of adaptive loss functions in the subsequent sections. We begin with the main theorem of this section.

Theorem 3.15 (Robust Loss Functions (Huber [250])) Let be a class of densities formed by

: p p (1 )p0 p1 where (0 1) and p0 are known

(3.47)

Moreover assume that both p0 and p1 are symmetric with respect to the origin, their logarithms are twice continuously differentiable, ln p0 is convex and known, and p1 is unknown. Then the density

p¯( ) : (1

)

p0( ) p0( 0)e k(

if

0

0) otherwise

(3.48)

is robust in the sense that the maximum likelihood estimator corresponding to (3.48) has minimum variance with respect to the “worst” possible density pworst (1 )p0 p1: it is a saddle point (located at pworst) in terms of variance with respect to the true density p and the density p¯ used in estimating the location parameter. This means that no density p has larger variance than pworst and that for p pworst no estimator is better than the one where p¯ pworst, as used in the robust estimator.
The constants k 0 and 0 are obtained by the normalization condition, that p¯ be a

3.4 Robust Estimators

77

proper density and that the ﬁrst derivative in ln p¯ be continuous.

Proof To show that p¯ is a saddle point in we have to prove that (a) no estimation procedure other than the one using ln p¯ as the loss function has lower variance for the density p¯, and that (b) no density has higher variance than p¯ if ln p¯ is used as loss function. Part (a) follows immediately from the Crame´r-Rao theorem (Th. 3.11); part (b) can be proved as follows.
We use Theorem 3.13, and a proof technique pointed out in [559], to compute the variance of an estimator using ln p¯ as loss function;

B

ln p¯(y ) 2 (1 )p0(y ) p (y ) dy 2 ln p¯(y ) (1 )p0(y ) p (y ) dy

(3.49)

Here p is an arbitrary density which we will choose such that B is maximized. By construction,

ln p¯(y ) 2

ln p0(y ) 2 k2 if y

0

k2

otherwise,

(3.50)

2 ln p¯(y )

2 ln p0(y ) 0 if y

0

0

otherwise.

(3.51)

Thus any density p which is 0 in [ 0 0] will minimize the denominator (the term depending on p will be 0, which is the lowest obtainable value due to (3.51)),
and maximize the numerator, since in the latter the contribution of p is always limited to k2 . Now 1 p¯ (1 )p0 is exactly such a density. Hence the saddle point property holds.

Remark 3.16 (Robustness Classes) If we have more knowledge about the class of densities , a different loss function will have the saddle point property. For instance, using a similar argument as above, one can show that the normal distribution is robust in the class of all distributions with bounded variance. This implies that among all possible distributions with bounded variance, the estimator of the mean of a normal distribution has the highest variance.
Likewise, the Laplacian distribution is robust in the class of all symmetric distributions with density p(0) c for some ﬁxed c 0 (see [559, 251] for more details).
Hence, even though a loss function deﬁned according to Theorem 3.15 is generally desirable, we may be less cautious, and use a different loss function for improved performance, when we have additional knowledge of the distribution.

Remark 3.17 (Mean and Median) Assume we are dealing with a mixture of a normal distribution with variance 2 and an additional unknown distribution with weight at most
. It is easy to check that the application of Theorem 3.15 to normal distributions yields Huber’s robust loss function from Table 3.1.
The maximizer of the likelihood (see also Problem 3.17) is a trimmed mean estimator which discards of the data: effectively all i deviating from the mean by more than are

78

Risk and Loss Functions

ignored and the mean is computed from the remaining data. Hence Theorem 3.15 gives a formal justiﬁcation for this popular type of estimator.
If we let 1 we recover the median estimator which stems from a Laplacian distribution. Here, all patterns but the median one are discarded.

Trimmed Interval Estimator
Support Patterns

Besides the classical examples of loss functions and density models, we might also consider a slightly unconventional estimation procedure: use the average between the k-smallest and the k-largest of all observations observations as the estimated mean of the underlying distribution (for sorted observations i with i j for 1 i j m the estimator computes ( k m k 1) 2). This procedure makes sense, for instance, when we are trying to infer the mean of a random variable generated by roundoff noise (i.e., noise whose density is constant within some bounded interval) plus an additional unknown amount of noise.
Note that both the patterns strictly inside or outside an interval of size [ ] around the estimate have no direct inﬂuence on the outcome. Only patterns on the boundary matter. This is a very similar situation to the behavior of Support Vector Machines in regression, and one can show that it corresponds to the minimizer of the -insensitive loss function (3.9). We will study the properties of the latter in more detail in the following section and thereafter show how it can be transformed into an adaptive risk functional.

3.4.2 Efﬁciency and the -Insensitive Loss Function

The tools of Section 3.3.2 allow us to analyze the -insensitive loss function in more detail. Even though the asymptotic estimation of a location parameter setting is a gross oversimpliﬁcation of what is happening in a SV regression estimator (where we estimate a nonparametric function, and moreover have only a limited number of observations at our disposition), it will provide us with useful insights into this more complex case [510, 481].
In a ﬁrst step, we compute the efﬁciency of an estimator, for several noise models and amounts of variance, using a density corresponding to the -insensitive loss function (cf. Table 3.1);

p (y )

1 22

exp(

y

)

1 22

1 exp(

y

if y ) otherwise.

(3.52)

For this purpose we have to evaluate the quantities G (3.41) and Q (3.42) of Theorem 3.13. We obtain

Gm

ln p(y ) 2 dP(y ) m 1

p(y )dy

(3.53)

Q m 2 ln p(y )dP(y ) m p(

) p(

)

(3.54)

The Fisher information I of m iid random variables distributed according to p is m-times the value of a single random variable. Thus all dependencies on m in e cancel out and we can limit ourselves to the case of m 1 for the analysis of the

3.4 Robust Estimators

79

efﬁciency of estimators. Now we may check what happens if we use the -insensitive loss function for
different types of noise model. For the sake of simplicity we begin with Gaussian noise.

Example 3.18 (Gaussian Noise) Assume that y is normally distributed with zero mean (i.e. 0) and variance . By construction, the minimum obtainable variance is I 1 2
(recall that m 1). Moreover (3.53) and (3.54) yield

G Q2

2
2 exp 2

1 erf 2

(3.55)

The efﬁciency e

Q2 GI

is

maximized

for

0 6120 . This means that if the underlying

noise model is Gaussian with variance and we have to use an -insensitive loss function

to estimate a location parameter, the most efﬁcient estimator from this family is given by

0 6120 .

The consequence of (3.55) is that the optimal value of scales linearly with . Of course, we could just use squared loss in such a situation, but in general, we will not know the exact noise model, and squared loss does not lead to robust estimators. The following lemma (which will come handy in the next section) shows that this is a general property of the -insensitive loss.

Lemma 3.19 (Linear Dependency between -Tube Width and Variance) Denote by p a symmetric density with variance 0. Then the optimal value of (i.e. the value that achieves maximum asymptotic efﬁciency) for an estimator using the -insensitive loss is given by

opt

argmin pstd(

1 ) pstd( ) 2

1

pstd( )d

(3.56)

where pstd( ) : p(

) is the standardized version of p(y ), i.e. it is obtained by

rescaling p(y ) to zero mean and unit variance.

Since pstd is independent of , we have a linear dependency between opt and . The scaling factor depends on the noise model.

Proof We prove (3.56) by rewriting the efﬁciency e( ) in terms of pstd via p(y ) 1 pstd( 1(y )). This yields

e( ) Q2 IG

1 pstd( 21

1)

1 pstd( 1 ) 2

1 pstd( 1 )d

pstd( 1

1 ) pstd( 1 ) 2

1 1

pstd(

)d

The maximum of e( ) does not depend directly on , but on 1 (which is independent of ). Hence we can ﬁnd argmax e( ) by solving (3.56).

Lemma 3.19 made it apparent that in order to adjust we have to know beforehand. Unfortunately, the latter is usually unknown at the beginning of the

80 -Property

Risk and Loss Functions

estimation procedure.8 The solution to this dilemma is to make adaptive.

3.4.3 Adaptive Loss Functions

We again consider the trimmed mean estimator, which discards a predeﬁned

fraction of largest and smallest samples. This method belongs to the more general

class of quantile estimators, which base their estimates on the value of samples

in a certain quantile. The latter methods do not require prior knowledge of the

variance, and adapt to whatever scale is required. What we need is a technique

which connects (in Huber’s robust loss function) or (in the -insensitive loss

case) with the deviations between the estimate ˆ and the random variables yi.

Let us analyze what happens to the negative log likelihood, if, in the -

insensitive case, we change to

(with

) while keeping ˆ ﬁxed. In par-

ticular we assume that is chosen sufﬁciently small such that for all i 1 m,

ˆ yi

if ˆ yi if ˆ yi

(3.57)

Moreover denote by m m m the number of samples for which ˆ yi is less than, equal to, or greater than , respectively. Then

m
∑ ˆ yi
i1

∑ ˆ yi
ˆ yi

∑ ˆ yi m
ˆ yi

∑ ˆ yi
ˆ yi

m
∑ ˆ yi
i1

m

if 0

(m m ) otherwise.

(3.58)

In other words, the amount by which the loss changes depends only on the quantiles at . What happens if we make itself a variable of the optimization problem? By the scaling properties of (3.58) one can see that for [0 1]

∑ 1 m

minimize
ˆ

m

i

1

ˆ

yi

(3.59)

is minimized if is chosen such that

m

mm

m

m

(3.60)

This relation holds since at the solution (ˆ ) the solution also has to be optimal wrt. alone while keeping ˆ ﬁxed. In the latter case, however, the derivatives of

8. The obvious question is why one would ever like to choose an -insensitive loss in the presence of Gaussian noise in the ﬁrst place. If the complexity of the function expansion is of no concern and the highest accuracy is required, squared loss is to be preferred. In most cases, however, it is not quite clear what exactly the type of the additive noise model is. This is when we would like to have a more conservative estimator. In practice, the -insensitive loss has been shown to work rather well on a variety of tasks (Chapter 9).

3.4 Robust Estimators

81

the log-likelihood (i.e. error) term wrt.

at

the

solution

are

given

by

m m

and m

m m

on the left and right hand side respectively.9 These have to cancel with which

proves the claim. Furthermore, computing the derivative of (3.59) with respect to

ˆ shows that the number of samples outside the interval [

] has to be

equal on both halves (

) and (

). We have the following theorem:

Extension to General Robust Estimators

Theorem 3.20 (Quantile Estimation as Optimization Problem [481]) A quantile procedure to estimate the mean of a distribution by taking the average of the samples at the 2 th and (1 2 )th quantile is equivalent to minimizing (3.59). In particular,

1. is an upper bound on the fraction of samples outside the interval [

].

2. is a lower bound on the fraction of samples outside the interval ]

[.

3. If the distribution p( ) is continuous, for all [0 1]

m

lim P
m

m

1 for all 0

(3.61)

One might question the practical advantage of this method over direct trimming

of the sample Y. In fact, the use of (3.59) is not recommended if all we want is to

estimate . That said, (3.59) does allow us to employ trimmed estimation in the

nonparametric case, cf. Chapter 9.

Unfortunately, we were unable to ﬁnd a similar method for Huber’s robust loss

function, since in this case the change in the negative log-likelihood incurred by

changing not only involves the (statistical) rank of yi, but also the exact location

of samples with yi

.

One way to overcome this problem is re-estimate adaptively while minimizing

a term similar to (3.59) (see [180] for details in the context of boosting, Section 10.6.3

for a discussion of online estimation techniques, or [251] for a general overview).

3.4.4 Optimal Choice of

Let us return to the -insensitive loss. A combination of Theorems 3.20, 3.13 and

Lemma 3.19 allows us to compute optimal values of for various distributions,

provided that an -insensitive loss function is to be used in the estimation procedure.10

The idea is to determine the optimal value of for a ﬁxed density p(y ) via

(3.56), and compute the corresponding fraction of patterns outside the interval

[

].

9. Strictly speaking, the derivative is not deﬁned at ; the lhs and rhs values are deﬁned, however, which is sufﬁcient for our purpose. 10. This is not optimal in the sense of Theorem 3.15, which suggests the use of a more adapted loss function. However (as already stated in the introduction of this chapter), algorithmic or technical reasons such as computationally efﬁcient solutions or limited memory may provide sufﬁcient motivation to use such a loss function.

82
Heavy Tails Large

Risk and Loss Functions

Table 3.2 Optimal and for various degrees of polynomial additive noise.

Polynomial Degree d Optimal Optimal for unit variance
Polynomial Degree d Optimal Optimal for unit variance

1 1 0
6 0.1080 1.5576

2 0.5405 0.6120
7 0.0881 1.6035

3 0.2909 1.1180
8 0.0743 1.6339

4 0.1898 1.3583
9 0.0641 1.6551

5 0.1384 1.4844
10 0.0563 1.6704

Theorem 3.21 (Optimal Choice of ) Denote by p a symmetric density with variance 0 and by pstd the corresponding rescaled density with zero mean and unit variance.
Then the optimal value of (i.e. the value that achieves maximum asymptotic efﬁciency) for an estimator using the -insensitive loss is given by

1

pstd( y)d y

where is chosen according to (3.56). This expression is independent of .

(3.62)

Proof The independence of follows from the fact that depends only on pstd. Next we show (3.62). For a given density p, the asymptotically optimal value of
is given by Lemma 3.19. The average fraction of patterns outside the interval [ ˆ opt ˆ opt] is

opt

1

p(y )dy 1

opt

1 opt
pstd( y)d y
1 opt

(3.63)

which depends only on 1 opt and is thus independent of . Combining (3.63) with (3.56) yields the theorem.

This means that given the type of additive noise, we can determine the value of such that it yields the asymptotically most efﬁcient estimator independent of the
level of the noise. These theoretical predictions have since been conﬁrmed rather accurately in a set of regression experiments [95].
Let us now look at some special cases.

Example 3.22 (Optimal for Polynomial Noise) Arbitrary polynomial noise models ( e d ) with unit variance can be written as

p(y) cp exp

cp y p where cp

1 2

Γ3d d Γ 1 d Γ 1 d and cp

d
Γ3d Γ1d

where Γ(x) is the gamma function. Figure 3.3 shows opt for polynomial degrees in the interval [1 10]. For convenience, the explicit numerical values are repeated in Table 3.2.
Observe that as the distribution becomes “lighter-tailed”, the optimal decreases; in other words, we may then use a larger amount of the data for the purpose of estimation. This is reasonable since it is only for very long tails of the distribution (data with many

3.5 Summary

83

1.8

1.6

1.4

Optimal ν

1.2

Optimal ε for unit variance

1

0.8

0.6

0.4

0.2

0

1

2

3

4

5

6

7

8

9

10

Polynomial degree p

Figure 3.3 Optimal and for various degrees of polynomial additive noise.

outliers) that we have to be conservative and discard a large fraction of observations.

Even though we derived these relations solely for the case where a single number ( ) has to be estimated, experiments show that the same scaling properties hold for the nonparametric case. It is still an open research problem to establish this connection exactly.
As we shall see, in the nonparametric case, the effect of will be that it both determines the number of Support Vectors (i.e., the number of basis functions needed to expand the solution) and also the fraction of function values f (xi) with deviation larger than from the corresponding observations. Further information on this topic, both from the statistical and the algorithmic point of view, can be found in Section 9.3.

3.5 Summary
We saw in this chapter that there exist two complementary concepts as to how risk and loss functions should be designed. The ﬁrst one is data driven and uses the incurred loss as its principal guideline, possibly modiﬁed in order to suit the need of numerical efﬁciency. This leads to loss functions and the deﬁnitions of empirical and expected risk.
A second method is based on the idea of estimating (or at least approximating) the distribution which may be responsible for generating the data. We showed

84

Risk and Loss Functions

that in a Maximum Likelihood setting this concept is rather similar to the notions of risk and loss, with c(x y f (x)) ln p(y x f (x)) as the link between both quantities.
This point of view allowed us to analyze the properties of estimators in more detail and provide lower bounds on the performance of unbiased estimators, i.e. the Crame´r-Rao theorem. The latter was then used as a benchmarking tool for various loss functions and density models, such as the -insensitive loss. The consequence of this analysis is a corroboration of experimental ﬁndings that there exists a linear correlation between the amount of noise in the observations and the optimal width of .
This, in turn, allowed us to construct adaptive loss functions which adjust themselves to the amount of noise, much like trimmed mean estimators. These formulations can be used directly in mathematical programs, leading to -SV algorithms in subsequent chapters. The question of which choices are optimal in a ﬁnite sample size setting remains an open research problem.

3.6 Problems

3.1 (Soft Margin and Logistic Regression ) The soft margin loss function c soft and the logistic loss clogist are asymptotically almost the same; show that

lim csoft(x 1 f ) clogist(x 1 f ) 1
f
lim csoft(x 1 f ) clogist(x 1 f ) 0
f

(3.64) (3.65)

3.2 (Multi-class Discrimination ) Assume you have to solve a classiﬁcation problem with M different classes. Discuss how the number of functions used to solve this task affects the quality of the solution.

How would the loss function look if you were to use only one real-valued function

f:

. Which symmetries are violated in this case (hint: what happens if you permute

the classes)?

How many functions do you need if each of them makes a binary decision f :

01?

How many functions do you need in order to make the solution permutation symmetric with respect to the class labels?

How should you assess the classiﬁcation error? Is it a good idea to use the misclassiﬁcation rate of one individual function as a performance criterion (hint: correlation of errors)? By how much can this error differ from the total misclassiﬁcation error?

3.3 (Mean and Median ) Assume 8 people want to gather for a meeting; 5 of them live in Stuttgart and 3 in Munich. Where should they meet if (a) they want the total distance traveled by all people to be minimal, (b) they want the average distance traveled per person to be minimal, or (c) they want the average squared distance to be minimal? What happens

3.6 Problems

85

to the meeting points if one of the 3 people moves from Munich to Sydney?

3.4 (Locally Adaptive Loss Functions ) Assume that the loss function c(x y f (x)) varies with x. What does this mean for the expected loss? Can you give a bound on the latter even if you know p(y x) and f at every point but know c only on a ﬁnite sample (hint: construct a counterexample)? How will things change if c cannot vary much with x?

3.5 (Transduction Error ) Assume that we want to minimize the test error of misclassiﬁcation Rtest[ f ], given a training sample (x1 y1) (xm ym) , a test sample
x1 xm and a loss function c(x y f (x)). Show that any loss function c (x f (x )) on the test sample has to be symmetric in f , i.e. c (x f (x )) c (x f (x )). Prove that no non-constant convex function can satisfy this property. What does this mean for the practical solution of optimization problem? See [267, 37, 211, 103] for details.

3.6 (Convexity and Uniqueness ) Show that the problem of estimating a location parameter (a single scalar) has an interval [a b] of equivalent global minima if the loss functions are convex. For non-convex loss functions construct an example where this is not the case.

3.7 (Linearly Dependent Parameters ) Show that in a linear model f ∑i i fi on it is impossible to ﬁnd a unique set of optimal parameters i if the functions fi are not
linearly independent. Does this have any effect on f itself?

3.8 (Ill-posed Problems ) Assume you want to solve the problem Ax y where A is a symmetric positive deﬁnite matrix, i.e., a matrix with nonnegative eigenvalues. If you change y to y , how much will the solution x of Ax y differ from x . Give lower and upper bounds on this quantity. Hint: decompose y into the eigensystem of A.

3.9 (Fisher Map [258] ) Show that the map

U (x) :

I1 2

ln p(x )

(3.66)

maps x into vectors with zero mean and unit variance. Chapter 13 will use this map to design kernels.

3.10 (Crame´r-Rao Inequality for Multivariate Estimators ) Prove equation (3.31). Hint: start by applying the Cauchy-Schwarz inequality to

det E¯ [( ˆ ( ) E¯ ˆ ( ))(T ( ) E¯ T ( )) ]

(3.67)

to obtain I and B and compute the expected value coefﬁcient-wise.

3.11 (Soft Margin Loss and Conditional Probabilities [521] ) What is the conditional probability p(y x) corresponding to the soft margin loss function c(x y f (x)) max(0 1 y f (x))?