6046 lines
220 KiB
Plaintext
6046 lines
220 KiB
Plaintext
Learning with Kernels
|
||
|
||
Adaptive Computation and Machine Learning
|
||
Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors
|
||
Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey Learning in Graphical Models, Michael I. Jordan Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard Scheines Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Søren Brunak Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Scho¨ lkopf and Alexander J. Smola
|
||
|
||
Learning with Kernels
|
||
Support Vector Machines, Regularization, Optimization, and Beyond Bernhard Scho¨ lkopf Alexander J. Smola
|
||
The MIT Press Cambridge, Massachusetts London, England
|
||
|
||
c 2002 Massachusetts Institute of Technology
|
||
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
|
||
Typeset by the authors using LATEX 2 Printed and bound in the United States of America
|
||
Library of Congress Cataloging-in-Publication Data
|
||
Learning with Kernels — Support Vector Machines, Regularization, Optimization and Beyond / by Bernhard Scho¨ lkopf,
|
||
Alexander J. Smola. p. cm.
|
||
Includes bibliographical references and index. ISBN 0-262-19475-9 (alk. paper) 1. Machine learning. 2. Algorithms. 3. Kernel functions I. Scho¨ lkopf, Bernhard. II. Smola, Alexander J.
|
||
|
||
To our parents
|
||
|
||
Contents
|
||
|
||
Series Foreword
|
||
|
||
xiii
|
||
|
||
Preface
|
||
|
||
xv
|
||
|
||
1 A Tutorial Introduction
|
||
|
||
1
|
||
|
||
1.1 Data Representation and Similarity . . . . . . . . . . . . . . . . . . . 1
|
||
|
||
1.2 A Simple Pattern Recognition Algorithm . . . . . . . . . . . . . . . 4
|
||
|
||
1.3 Some Insights From Statistical Learning Theory . . . . . . . . . . . 6
|
||
|
||
1.4 Hyperplane Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 11
|
||
|
||
1.5 Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . 15
|
||
|
||
1.6 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . 17
|
||
|
||
1.7 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . 19
|
||
|
||
1.8 Empirical Results and Implementations . . . . . . . . . . . . . . . . 21
|
||
|
||
I CONCEPTS AND TOOLS
|
||
|
||
23
|
||
|
||
2 Kernels
|
||
|
||
25
|
||
|
||
2.1 Product Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces . . . . . . . . . . 29
|
||
|
||
2.3 Examples and Properties of Kernels . . . . . . . . . . . . . . . . . . 45
|
||
|
||
2.4 The Representation of Dissimilarities in Linear Spaces . . . . . . . . 48
|
||
|
||
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
|
||
|
||
2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
|
||
|
||
3 Risk and Loss Functions
|
||
|
||
61
|
||
|
||
3.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
|
||
|
||
3.2 Test Error and Expected Risk . . . . . . . . . . . . . . . . . . . . . . 65
|
||
|
||
3.3 A Statistical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 68
|
||
|
||
3.4 Robust Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
|
||
|
||
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
|
||
|
||
3.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
|
||
|
||
4 Regularization
|
||
|
||
87
|
||
|
||
4.1 The Regularized Risk Functional . . . . . . . . . . . . . . . . . . . . 88
|
||
|
||
viii
|
||
|
||
Contents
|
||
|
||
4.2 The Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 Regularization Operators . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Translation Invariant Kernels . . . . . . . . . . . . . . . . . . . . . . 96 4.5 Translation Invariant Kernels in Higher Dimensions . . . . . . . . . 105 4.6 Dot Product Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.7 Multi-Output Regularization . . . . . . . . . . . . . . . . . . . . . . 113 4.8 Semiparametric Regularization . . . . . . . . . . . . . . . . . . . . . 115 4.9 Coefficient Based Regularization . . . . . . . . . . . . . . . . . . . . 118 4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
|
||
|
||
5 Elements of Statistical Learning Theory
|
||
|
||
125
|
||
|
||
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
|
||
|
||
5.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 128
|
||
|
||
5.3 When Does Learning Work: the Question of Consistency . . . . . . 131
|
||
|
||
5.4 Uniform Convergence and Consistency . . . . . . . . . . . . . . . . 131
|
||
|
||
5.5 How to Derive a VC Bound . . . . . . . . . . . . . . . . . . . . . . . 134
|
||
|
||
5.6 A Model Selection Example . . . . . . . . . . . . . . . . . . . . . . . 144
|
||
|
||
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
|
||
|
||
5.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
|
||
|
||
6 Optimization
|
||
|
||
149
|
||
|
||
6.1 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
|
||
|
||
6.2 Unconstrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . 154
|
||
|
||
6.3 Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 165
|
||
|
||
6.4 Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 175
|
||
|
||
6.5 Maximum Search Problems . . . . . . . . . . . . . . . . . . . . . . . 179
|
||
|
||
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
|
||
|
||
6.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
|
||
|
||
II SUPPORT VECTOR MACHINES
|
||
|
||
187
|
||
|
||
7 Pattern Recognition
|
||
|
||
189
|
||
|
||
7.1 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . 189
|
||
|
||
7.2 The Role of the Margin . . . . . . . . . . . . . . . . . . . . . . . . . . 192
|
||
|
||
7.3 Optimal Margin Hyperplanes . . . . . . . . . . . . . . . . . . . . . . 196
|
||
|
||
7.4 Nonlinear Support Vector Classifiers . . . . . . . . . . . . . . . . . . 200
|
||
|
||
7.5 Soft Margin Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . 204
|
||
|
||
7.6 Multi-Class Classification . . . . . . . . . . . . . . . . . . . . . . . . 211
|
||
|
||
7.7 Variations on a Theme . . . . . . . . . . . . . . . . . . . . . . . . . . 214
|
||
|
||
7.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
|
||
|
||
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
|
||
|
||
7.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
|
||
|
||
Contents
|
||
|
||
ix
|
||
|
||
8 Single-Class Problems: Quantile Estimation and Novelty Detection 227 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.2 A Distribution’s Support and Quantiles . . . . . . . . . . . . . . . . 229 8.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 8.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.5 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
|
||
|
||
9 Regression Estimation
|
||
|
||
251
|
||
|
||
9.1 Linear Regression with Insensitive Loss Function . . . . . . . . . . . 251
|
||
|
||
9.2 Dual Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
|
||
|
||
9.3 -SV Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
|
||
|
||
9.4 Convex Combinations and 1-Norms . . . . . . . . . . . . . . . . . . 266 9.5 Parametric Insensitivity Models . . . . . . . . . . . . . . . . . . . . . 269
|
||
|
||
9.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
|
||
|
||
9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
|
||
|
||
9.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
|
||
|
||
10 Implementation
|
||
|
||
279
|
||
|
||
10.1 Tricks of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
|
||
|
||
10.2 Sparse Greedy Matrix Approximation . . . . . . . . . . . . . . . . . 288
|
||
|
||
10.3 Interior Point Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 295
|
||
|
||
10.4 Subset Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . 300
|
||
|
||
10.5 Sequential Minimal Optimization . . . . . . . . . . . . . . . . . . . . 305
|
||
|
||
10.6 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
|
||
|
||
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
|
||
|
||
10.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
|
||
|
||
11 Incorporating Invariances
|
||
|
||
333
|
||
|
||
11.1 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
|
||
|
||
11.2 Transformation Invariance . . . . . . . . . . . . . . . . . . . . . . . . 335
|
||
|
||
11.3 The Virtual SV Method . . . . . . . . . . . . . . . . . . . . . . . . . . 337
|
||
|
||
11.4 Constructing Invariance Kernels . . . . . . . . . . . . . . . . . . . . 343
|
||
|
||
11.5 The Jittered SV Method . . . . . . . . . . . . . . . . . . . . . . . . . . 354
|
||
|
||
11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
|
||
|
||
11.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
|
||
|
||
12 Learning Theory Revisited
|
||
|
||
359
|
||
|
||
12.1 Concentration of Measure Inequalities . . . . . . . . . . . . . . . . . 360
|
||
|
||
12.2 Leave-One-Out Estimates . . . . . . . . . . . . . . . . . . . . . . . . 366
|
||
|
||
12.3 PAC-Bayesian Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 381
|
||
|
||
12.4 Operator-Theoretic Methods in Learning Theory . . . . . . . . . . . 391
|
||
|
||
x
|
||
|
||
Contents
|
||
|
||
12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 12.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
|
||
|
||
III KERNEL METHODS
|
||
|
||
405
|
||
|
||
13 Designing Kernels
|
||
|
||
407
|
||
|
||
13.1 Tricks for Constructing Kernels . . . . . . . . . . . . . . . . . . . . . 408
|
||
|
||
13.2 String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
|
||
|
||
13.3 Locality-Improved Kernels . . . . . . . . . . . . . . . . . . . . . . . . 414
|
||
|
||
13.4 Natural Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
|
||
|
||
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
|
||
|
||
13.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
|
||
|
||
14 Kernel Feature Extraction
|
||
|
||
427
|
||
|
||
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
|
||
|
||
14.2 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
|
||
|
||
14.3 Kernel PCA Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 437
|
||
|
||
14.4 A Framework for Feature Extraction . . . . . . . . . . . . . . . . . . 442
|
||
|
||
14.5 Algorithms for Sparse KFA . . . . . . . . . . . . . . . . . . . . . . . 447
|
||
|
||
14.6 KFA Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
|
||
|
||
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
|
||
|
||
14.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
|
||
|
||
15 Kernel Fisher Discriminant
|
||
|
||
457
|
||
|
||
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
|
||
|
||
15.2 Fisher’s Discriminant in Feature Space . . . . . . . . . . . . . . . . . 458
|
||
|
||
15.3 Efficient Training of Kernel Fisher Discriminants . . . . . . . . . . . 460
|
||
|
||
15.4 Probabilistic Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
|
||
|
||
15.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
|
||
|
||
15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
|
||
|
||
15.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
|
||
|
||
16 Bayesian Kernel Methods
|
||
|
||
469
|
||
|
||
16.1 Bayesics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
|
||
|
||
16.2 Inference Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
|
||
|
||
16.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
|
||
|
||
16.4 Implementation of Gaussian Processes . . . . . . . . . . . . . . . . . 488
|
||
|
||
16.5 Laplacian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
|
||
|
||
16.6 Relevance Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 506
|
||
|
||
16.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
|
||
|
||
16.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
|
||
|
||
17 Regularized Principal Manifolds
|
||
|
||
517
|
||
|
||
17.1 A Coding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 518
|
||
|
||
Contents
|
||
|
||
xi
|
||
|
||
17.2 A Regularized Quantization Functional . . . . . . . . . . . . . . . . 522 17.3 An Algorithm for Minimizing Rreg[ f ] . . . . . . . . . . . . . . . . . 526 17.4 Connections to Other Algorithms . . . . . . . . . . . . . . . . . . . . 529 17.5 Uniform Convergence Bounds . . . . . . . . . . . . . . . . . . . . . 533 17.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 17.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 17.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
|
||
|
||
18 Pre-Images and Reduced Set Methods
|
||
|
||
543
|
||
|
||
18.1 The Pre-Image Problem . . . . . . . . . . . . . . . . . . . . . . . . . 544
|
||
|
||
18.2 Finding Approximate Pre-Images . . . . . . . . . . . . . . . . . . . . 547
|
||
|
||
18.3 Reduced Set Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
|
||
|
||
18.4 Reduced Set Selection Methods . . . . . . . . . . . . . . . . . . . . . 554
|
||
|
||
18.5 Reduced Set Construction Methods . . . . . . . . . . . . . . . . . . . 561
|
||
|
||
18.6 Sequential Evaluation of Reduced Set Expansions . . . . . . . . . . 564
|
||
|
||
18.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
|
||
|
||
18.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
|
||
|
||
A Addenda
|
||
|
||
569
|
||
|
||
A.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
|
||
|
||
A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
|
||
|
||
B Mathematical Prerequisites
|
||
|
||
575
|
||
|
||
B.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
|
||
|
||
B.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
|
||
|
||
B.3 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
|
||
|
||
References
|
||
|
||
591
|
||
|
||
Index
|
||
|
||
617
|
||
|
||
Notation and Symbols
|
||
|
||
625
|
||
|
||
Series Foreword
|
||
The goal of building systems that can adapt to their environments and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many scientific and industrial fields. Recently, several research communities have converged on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series on Adaptive Computation and Machine Learning seeks to unify the many diverse strands of machine learning research and to foster high quality research and innovative applications.
|
||
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond is an excellent illustration of this convergence of ideas from many fields. The development of kernel-based learning methods has resulted from a combination of machine learning theory, optimization algorithms from operations research, and kernel techniques from mathematical analysis. These three ideas have spread far beyond the original support-vector machine algorithm: Virtually every learning algorithm has been redesigned to exploit the power of kernel methods. Bernhard Scho¨ lkopf and Alexander Smola have written a comprehensive, yet accessible, account of these developments. This volume includes all of the mathematical and algorithmic background needed not only to obtain a basic understanding of the material but to master it. Students and researchers who study this book will be able to apply kernel methods in creative ways to solve a wide range of problems in science and engineering.
|
||
Thomas Dietterich
|
||
|
||
Preface
|
||
One of the most fortunate situations a scientist can encounter is to enter a field in its infancy. There is a large choice of topics to work on, and many of the issues are conceptual rather than merely technical. Over the last seven years, we have had the privilege to be in this position with regard to the field of Support Vector Machines (SVMs). We began working on our respective doctoral dissertations in 1994 and 1996. Upon completion, we decided to combine our efforts and write a book about SVMs. Since then, the field has developed impressively, and has to an extent been transformed. We set up a website that quickly became the central repository for the new community, and a number of workshops were organized by various researchers. The scope of the field has now widened significantly, both in terms of new algorithms, such as kernel methods different to SVMs, and in terms of a deeper theoretical understanding being gained. It has become clear that kernel methods provide a framework for tackling some rather profound issues in machine learning theory. At the same time, successful applications have demonstrated that SVMs not only have a more solid foundation than artificial neural networks, but are able to serve as a replacement for neural networks that perform as well or better, in a wide variety of fields. Standard neural network and pattern recognition textbooks have now started including chapters on SVMs and kernel PCA (for instance, [235, 153]).
|
||
While these developments took place, we were trying to strike a balance between pursuing exciting new research, and making progress with the slowly growing manuscript of this book. In the two and a half years that we worked on the book, we faced a number of lessons that we suspect everyone writing a scientific monograph — or any other book — will encounter. First, writing a book is more work than you think, even with two authors sharing the work in equal parts. Second, our book got longer than planned. Once we exceeded the initially planned length of 500 pages, we got worried. In fact, the manuscript kept growing even after we stopped writing new chapters, and began polishing things and incorporating corrections suggested by colleagues. This was mainly due to the fact that the book deals with a fascinating new area, and researchers keep adding fresh material to the body of knowledge. We learned that there is no asymptotic regime in writing such a book — if one does not stop, it will grow beyond any bound — unless one starts cutting. We therefore had to take painful decisions to leave out material that we originally thought should be in the book. Sadly, and this is the third point, the book thus contains less material than originally planned, especially on the sub-
|
||
|
||
xvi
|
||
|
||
Preface
|
||
|
||
ject of theoretical developments. We sincerely apologize to all researchers who feel that their contributions should have been included — the book is certainly biased towards our own work, and does not provide a fully comprehensive overview of the field. We did, however, aim to provide all the necessary concepts and ideas to enable a reader equipped with some basic mathematical knowledge to enter the engaging world of machine learning, using theoretically well-founded kernel algorithms, and to understand and apply the powerful algorithms that have been developed over the last few years.
|
||
The book is divided into three logical parts. Each part consists of a brief introduction and a number of technical chapters. In addition, we include two appendices containing addenda, technical details, and mathematical prerequisites. Each chapter begins with a short discussion outlining the contents and prerequisites; for some of the longer chapters, we include a graph that sketches the logical structure and dependencies between the sections. At the end of most chapters, we include a set of problems, ranging from simple exercises (marked by ) to hard ones ( ); in addition, we describe open problems and questions for future research ( ).1 The latter often represent worthwhile projects for a research publication, or even a thesis. References are also included in some of the problems. These references contain the solutions to the associated problems, or at least significant parts thereof.
|
||
The overall structure of the book is perhaps somewhat unusual. Rather than presenting a logical progression of chapters building upon each other, we occasionally touch on a subject briefly, only to revisit it later in more detail. For readers who are used to reading scientific monographs and textbooks from cover to cover, this will amount to some redundancy. We hope, however, that some readers, who are more selective in their reading habits (or less generous with their time), and only look at those chapters that they are interested in, will benefit. Indeed, nobody is expected to read every chapter. Some chapters are fairly technical, and cover material included for reasons of completeness. Other chapters, which are more relevant to the central subjects of the book, are kept simpler, and should be accessible to undergraduate students.
|
||
In a way, this book thus contains several books in one. For instance, the first chapter can be read as a standalone “executive summary” of Support Vector and kernel methods. This chapter should also provide a fast entry point for practitioners. Someone interested in applying SVMs to a pattern recognition problem might want to read Chapters 1 and 7 only. A reader thinking of building their own SVM implementation could additionally read Chapter 10, and parts of Chapter 6. Those who would like to get actively involved in research aspects of kernel methods, for example by “kernelizing” a new algorithm, should probably read at least Chapters 1 and 2. A one-semester undergraduate course on learning with kernels could include the material of Chapters 1, 2.1–2.3, 3.1–3.2, 5.1–5.2, 6.1–6.3, 7. If there is more
|
||
|
||
1. We suggest that authors post their solutions on the book website www.learning-withkernels.org.
|
||
|
||
Preface
|
||
|
||
xvii
|
||
time, one of the Chapters 14, 16, or 17 can be added, or 4.1–4.2. A graduate course could additionally deal with the more advanced parts of Chapters 3, 4, and 5. The remaining chapters provide ample material for specialized courses and seminars.
|
||
As a general time-saving rule, we recommend reading the first chapter and then jumping directly to the chapter of particular interest to the reader. Chances are that this will lead to a chapter that contains references to the earlier ones, which can then be followed as desired. We hope that this way, readers will inadvertently be tempted to venture into some of the less frequented chapters and research areas. Explore this book; there is a lot to find, and much more is yet to be discovered in the field of learning with kernels.
|
||
We conclude the preface by thanking those who assisted us in the preparation of the book. Our first thanks go to our first readers. Chris Burges, Arthur Gretton, and Bob Williamson have read through various versions of the book, and made numerous suggestions that corrected or improved the material. A number of other researchers have proofread various chapters. We would like to thank Matt Beal, Daniel Berger, Olivier Bousquet, Ben Bradshaw, Nicolo` CesaBianchi, Olivier Chapelle, Dennis DeCoste, Andre Elisseeff, Anita Faul, Arnulf Graf, Isabelle Guyon, Ralf Herbrich, Simon Hill, Dominik Janzing, Michael Jordan, Sathiya Keerthi, Neil Lawrence, Ben O’Loghlin, Ulrike von Luxburg, Davide Mattera, Sebastian Mika, Natasa Milic-Frayling, Marta Milo, Klaus Mu¨ ller, Dave Musicant, Fernando Pe´rez Cruz, Ingo Steinwart, Mike Tipping, and Chris Williams.
|
||
In addition, a large number of people have contributed to this book in one way or another, be it by sharing their insights with us in discussions, or by collaborating with us on some of the topics covered in the book. In many places, this strongly influenced the presentation of the material. We would like to thank Dimitris Achlioptas, Lu´ıs Almeida, Shun-Ichi Amari, Peter Bartlett, Jonathan Baxter, Tony Bell, Shai Ben-David, Kristin Bennett, Matthias Bethge, Chris Bishop, Andrew Blake, Volker Blanz, Le´on Bottou, Paul Bradley, Chris Burges, Heinrich Bu¨ lthoff, Olivier Chapelle, Nello Cristianini, Corinna Cortes, Cameron Dawson,Tom Dietterich, Andre´ Elisseeff, Oscar de Feo, Federico Girosi, Thore Graepel, Isabelle Guyon, Patrick Haffner, Stefan Harmeling, Paul Hayton, Markus Hegland, Ralf Herbrich, Tommi Jaakkola, Michael Jordan, Jyrki Kivinen, Yann LeCun, Chi-Jen Lin, Gabor Lugosi, Olvi Mangasarian, Laurent Massoulie, Sebastian Mika, Sayan Mukherjee, Klaus Mu¨ ller, Noboru Murata, Nuria Oliver, John Platt, Tomaso Poggio, Gunnar Ra¨tsch, Sami Romdhani, Rainer von Sachs, Christoph Schno¨ rr, Matthias Seeger, John Shawe-Taylor, Kristy Sim, Patrice Simard, Stephen Smale, Sara Solla, Lionel Tarassenko, Lily Tian, Mike Tipping, Alexander Tsybakov, Lou van den Dries, Santosh Venkatesh, Thomas Vetter, Chris Watkins, Jason Weston, Chris Williams, Bob Williamson, Andreas Ziehe, Alex Zien, and Tong Zhang.
|
||
Next, we would like to extend our thanks to the research institutes that allowed us to pursue our research interests and to dedicate the time necessary for writing the present book; these are AT&T / Bell Laboratories (Holmdel), the Australian National University (Canberra), Biowulf Technologies (New York), GMD FIRST (Berlin), the Max-Planck-Institute for Biological Cybernetics (Tu¨ bingen), and Mi-
|
||
|
||
xviii
|
||
|
||
Preface
|
||
|
||
crosoft Research (Cambridge). We are grateful to Doug Sery from MIT Press for continuing support and encouragement during the writing of this book. We are, moreover, indebted to funding from various sources; specifically, from the Studienstiftung des deutschen Volkes, the Deutsche Forschungsgemeinschaft, the Australian Research Council, and the European Union.
|
||
Finally, special thanks go to Vladimir Vapnik, who introduced us to the fascinating world of statistical learning theory.
|
||
|
||
. . . the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classification and Sheep Vectors. . .
|
||
Illustration by Ana Mart´ın Larran˜ aga
|
||
|
||
1
|
||
|
||
A Tutorial Introduction
|
||
|
||
Overview Prerequisites
|
||
|
||
This chapter describes the central ideas of Support Vector (SV) learning in a nutshell. Its goal is to provide an overview of the basic concepts.
|
||
One such concept is that of a kernel. Rather than going immediately into mathematical detail, we introduce kernels informally as similarity measures that arise from a particular representation of patterns (Section 1.1), and describe a simple kernel algorithm for pattern recognition (Section 1.2). Following this, we report some basic insights from statistical learning theory, the mathematical theory that underlies SV learning (Section 1.3). Finally, we briefly review some of the main kernel algorithms, namely Support Vector Machines (SVMs) (Sections 1.4 to 1.6) and kernel principal component analysis (Section 1.7).
|
||
We have aimed to keep this introductory chapter as basic as possible, whilst giving a fairly comprehensive overview of the main ideas that will be discussed in the present book. After reading it, readers should be able to place all the remaining material in the book in context and judge which of the following chapters is of particular interest to them.
|
||
As a consequence of this aim, most of the claims in the chapter are not proven. Abundant references to later chapters will enable the interested reader to fill in the gaps at a later stage, without losing sight of the main ideas described presently.
|
||
|
||
1.1 Data Representation and Similarity
|
||
|
||
Training Data
|
||
|
||
One of the fundamental problems of learning theory is the following: suppose we are given two classes of objects. We are then faced with a new object, and we have to assign it to one of the two classes. This problem can be formalized as follows: we are given empirical data
|
||
|
||
(x1 y1)
|
||
|
||
(xm ym)
|
||
|
||
1
|
||
|
||
(1.1)
|
||
|
||
Here, is some nonempty set from which the patterns xi (sometimes called cases,
|
||
inputs, instances, or observations) are taken, usually referred to as the domain; the yi are called labels, targets, outputs or sometimes also observations.1 Note that there are
|
||
|
||
1. Note that we use the term pattern to refer to individual observations. A (smaller) part of the existing literature reserves the term for a generic prototype which underlies the data. The
|
||
|
||
2 Dot Product
|
||
|
||
A Tutorial Introduction
|
||
|
||
only two classes of patterns. For the sake of mathematical convenience, they are
|
||
|
||
labelled by 1 and 1, respectively. This is a particularly simple situation, referred
|
||
|
||
to as (binary) pattern recognition or (binary) classification.
|
||
|
||
It should be emphasized that the patterns could be just about anything, and we
|
||
|
||
have made no assumptions on other than it being a set. For instance, the task
|
||
|
||
might be to categorize sheep into two classes, in which case the patterns xi would simply be sheep.
|
||
|
||
In order to study the problem of learning, however, we need an additional type
|
||
|
||
of structure. In learning, we want to be able to generalize to unseen data points. In
|
||
|
||
the case of pattern recognition, this means that given some new pattern x , we
|
||
|
||
want to predict the corresponding y
|
||
|
||
1 .2 By this we mean, loosely speaking,
|
||
|
||
that we choose y such that (x y) is in some sense similar to the training examples
|
||
|
||
(1.1). To this end, we need notions of similarity in and in 1 .
|
||
|
||
Characterizing the similarity of the outputs 1 is easy: in binary classification,
|
||
|
||
only two situations can occur: two labels can either be identical or different. The
|
||
|
||
choice of the similarity measure for the inputs, on the other hand, is a deep
|
||
|
||
question that lies at the core of the field of machine learning.
|
||
|
||
Let us consider a similarity measure of the form
|
||
|
||
k:
|
||
|
||
(x x ) k(x x )
|
||
|
||
(1.2)
|
||
|
||
that is, a function that, given two patterns x and x , returns a real number characterizing their similarity. Unless stated otherwise, we will assume that k is symmetric, that is, k(x x ) k(x x) for all x x . For reasons that will become clear later (cf. Remark 2.16), the function k is called a kernel [359, 4, 42, 62, 223].
|
||
General similarity measures of this form are rather difficult to study. Let us
|
||
|
||
therefore start from a particularly simple case, and generalize it subsequently. A simple type of similarity measure that is of particular mathematical appeal is a dot product. For instance, given two vectors x x N , the canonical dot product is defined as
|
||
|
||
N
|
||
x x : ∑[x]i[x ]i i1
|
||
|
||
(1.3)
|
||
|
||
Here, [x]i denotes the ith entry of x. Note that the dot product is also referred to as inner product or scalar product, and
|
||
sometimes denoted with round brackets and a dot, as (x x ) — this is where the “dot” in the name comes from. In Section B.2, we give a general definition of dot products. Usually, however, it is sufficient to think of dot products as (1.3).
|
||
|
||
latter is probably closer to the original meaning of the term, however we decided to stick
|
||
|
||
with the present usage, which is more common in the field of machine learning.
|
||
|
||
2. Doing this for every x amounts to estimating a function f :
|
||
|
||
1.
|
||
|
||
1.1 Data Representation and Similarity
|
||
|
||
3
|
||
|
||
Length Feature Space
|
||
|
||
The geometric interpretation of the canonical dot product is that it computes the cosine of the angle between the vectors x and x , provided they are normalized to length 1. Moreover, it allows computation of the length (or norm) of a vector x as
|
||
|
||
x
|
||
|
||
xx
|
||
|
||
(1.4)
|
||
|
||
Likewise, the distance between two vectors is computed as the length of the difference vector. Therefore, being able to compute dot products amounts to being able to carry out all geometric constructions that can be formulated in terms of angles, lengths and distances.
|
||
Note, however, that the dot product approach is not really sufficiently general to deal with many interesting problems.
|
||
|
||
First, we have deliberately not made the assumption that the patterns actually exist in a dot product space. So far, they could be any kind of object. In order to be able to use a dot product as a similarity measure, we therefore first need to represent the patterns as vectors in some dot product space (which need not coincide with N ). To this end, we use a map
|
||
|
||
Φ:
|
||
|
||
x x : Φ(x)
|
||
|
||
(1.5)
|
||
|
||
Second, even if the original patterns exist in a dot product space, we may still want to consider more general similarity measures obtained by applying a map (1.5). In that case, Φ will typically be a nonlinear map. An example that we will consider in Chapter 2 is a map which computes products of entries of the input patterns.
|
||
|
||
In both the above cases, the space is called a feature space. Note that we have used a bold face x to denote the vectorial representation of x in the feature space. We will follow this convention throughout the book.
|
||
To summarize, embedding the data into via Φ has three benefits:
|
||
|
||
1. It lets us define a similarity measure from the dot product in ,
|
||
|
||
k(x x ) : x x Φ(x) Φ(x )
|
||
|
||
(1.6)
|
||
|
||
2. It allows us to deal with the patterns geometrically, and thus lets us study learning algorithms using linear algebra and analytic geometry.
|
||
3. The freedom to choose the mapping Φ will enable us to design a large variety of similarity measures and learning algorithms. This also applies to the situation where the inputs xi already exist in a dot product space. In that case, we might directly use the dot product as a similarity measure. However, nothing prevents us from first applying a possibly nonlinear map Φ to change the representation into one that is more suitable for a given problem. This will be elaborated in Chapter 2, where the theory of kernels is developed in more detail.
|
||
|
||
4
|
||
|
||
A Tutorial Introduction
|
||
|
||
1.2 A Simple Pattern Recognition Algorithm
|
||
|
||
Decision Function
|
||
|
||
We are now in the position to describe a pattern recognition learning algorithm that is arguably one of the simplest possible. We make use of the structure introduced in the previous section; that is, we assume that our data are embedded into a dot product space .3 Using the dot product, we can measure distances in this space. The basic idea of the algorithm is to assign a previously unseen pattern to the class with closer mean.
|
||
We thus begin by computing the means of the two classes in feature space;
|
||
|
||
c
|
||
|
||
∑ 1
|
||
|
||
m
|
||
|
||
i yi
|
||
|
||
xi
|
||
1
|
||
|
||
c
|
||
|
||
∑ 1
|
||
|
||
m
|
||
|
||
i yi
|
||
|
||
xi
|
||
1
|
||
|
||
(1.7) (1.8)
|
||
|
||
where m and m are the number of examples with positive and negative labels, respectively. We assume that both classes are non-empty, thus m m 0. We assign a new point x to the class whose mean is closest (Figure 1.1). This geometric construction can be formulated in terms of the dot product . Half way between c and c lies the point c : (c c ) 2. We compute the class of x by checking whether the vector x c connecting c to x encloses an angle smaller than 2 with the vector w : c c connecting the class means. This leads to
|
||
|
||
y sgn (x c) w
|
||
|
||
sgn (x (c c ) 2) (c c )
|
||
|
||
sgn ( x c
|
||
|
||
x c b)
|
||
|
||
(1.9)
|
||
|
||
Here, we have defined the offset
|
||
|
||
b : 1( c 2 2
|
||
|
||
c 2)
|
||
|
||
(1.10)
|
||
|
||
with the norm x : x x . If the class means have the same distance to the origin, then b will vanish.
|
||
Note that (1.9) induces a decision boundary which has the form of a hyperplane (Figure 1.1); that is, a set of points that satisfy a constraint expressible as a linear equation.
|
||
It is instructive to rewrite (1.9) in terms of the input patterns xi, using the kernel k to compute the dot products. Note, however, that (1.6) only tells us how to compute the dot products between vectorial representations xi of inputs xi. We therefore need to express the vectors ci and w in terms of x1 xm.
|
||
To this end, substitute (1.7) and (1.8) into (1.9) to get the decision function
|
||
|
||
3. For the definition of a dot product space, see Section B.2.
|
||
|
||
1.2 A Simple Pattern Recognition Algorithm
|
||
|
||
5
|
||
|
||
+
|
||
|
||
+
|
||
|
||
w
|
||
|
||
c+
|
||
|
||
+
|
||
|
||
+
|
||
|
||
o
|
||
|
||
.
|
||
|
||
o c- o
|
||
|
||
c
|
||
|
||
x-c
|
||
|
||
x
|
||
|
||
Figure 1.1 A simple geometric classification algorithm: given two classes of points (depicted by ‘o’ and ‘+’), compute their means c c and assign a test pattern x to the one whose mean is closer. This can be done by looking at the dot product between x c (where c (c c ) 2) and w : c c , which changes sign as the enclosed angle passes through
|
||
2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w.
|
||
|
||
∑ y
|
||
|
||
sgn
|
||
|
||
1 m i yi
|
||
|
||
x xi
|
||
1
|
||
|
||
∑ 1
|
||
|
||
m
|
||
|
||
i yi
|
||
|
||
x xi
|
||
1
|
||
|
||
b
|
||
|
||
∑ ∑ 1
|
||
|
||
1
|
||
|
||
sgn
|
||
|
||
m
|
||
|
||
k(x xi)
|
||
i yi 1
|
||
|
||
m
|
||
|
||
i yi
|
||
|
||
k(x xi)
|
||
1
|
||
|
||
b
|
||
|
||
Similarly, the offset becomes
|
||
|
||
(1.11)
|
||
|
||
b:
|
||
|
||
1 2
|
||
|
||
∑ 1
|
||
m2 (i j) yi yj
|
||
|
||
k(xi xj)
|
||
1
|
||
|
||
∑ 1
|
||
m2 (i j) yi yj
|
||
|
||
k(xi xj)
|
||
1
|
||
|
||
(1.12)
|
||
|
||
Surprisingly, it turns out that this rather simple-minded approach contains a wellknown statistical classification method as a special case. Assume that the class means have the same distance to the origin (hence b 0, cf. (1.10)), and that k can be viewed as a probability density when one of its arguments is fixed. By this we mean that it is positive and has unit integral,4
|
||
|
||
k(x x )dx 1 for all x
|
||
|
||
(1.13)
|
||
|
||
In this case, (1.11) takes the form of the so-called Bayes classifier separating the two classes, subject to the assumption that the two classes of patterns were generated by sampling from two probability distributions that are correctly estimated by the
|
||
|
||
4. In order to state this assumption, we have to require that we can define an integral on .
|
||
|
||
6
|
||
|
||
A Tutorial Introduction
|
||
|
||
Parzen Windows
|
||
|
||
Parzen windows estimators of the two class densities,
|
||
|
||
∑ ∑ p (x) :
|
||
|
||
1 m
|
||
|
||
i yi
|
||
|
||
k(x xi) and p (x) :
|
||
1
|
||
|
||
1 m
|
||
|
||
i yi
|
||
|
||
k(x xi)
|
||
1
|
||
|
||
(1.14)
|
||
|
||
where x . Given some point x, the label is then simply computed by checking which of
|
||
the two values p (x) or p (x) is larger, which leads directly to (1.11). Note that this decision is the best we can do if we have no prior information about the probabilities of the two classes.
|
||
The classifier (1.11) is quite close to the type of classifier that this book deals with in detail. Both take the form of kernel expansions on the input domain,
|
||
|
||
m
|
||
y sgn ∑ ik(x xi) b i1
|
||
|
||
(1.15)
|
||
|
||
In both cases, the expansions correspond to a separating hyperplane in a feature space. In this sense, the i can be considered a dual representation of the hyperplane’s normal vector [223]. Both classifiers are example-based in the sense that the kernels are centered on the training patterns; that is, one of the two arguments of the kernel is always a training pattern. A test point is classified by comparing it to all the training points that appear in (1.15) with a nonzero weight.
|
||
More sophisticated classification techniques, to be discussed in the remainder of the book, deviate from (1.11) mainly in the selection of the patterns on which the kernels are centered and in the choice of weights i that are placed on the individual kernels in the decision function. It will no longer be the case that all training patterns appear in the kernel expansion, and the weights of the kernels in the expansion will no longer be uniform within the classes — recall that in the current example, cf. (1.11), the weights are either (1 m ) or ( 1 m ), depending on the class to which the pattern belongs.
|
||
In the feature space representation, this statement corresponds to saying that we will study normal vectors w of decision hyperplanes that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce the computational cost of evaluating the decision function (cf. (1.11)). The hyperplane will then only depend on a subset of training patterns called Support Vectors.
|
||
|
||
1.3 Some Insights From Statistical Learning Theory
|
||
With the above example in mind, let us now consider the problem of pattern recognition in a slightly more formal setting [559, 152, 186]. This will allow us to indicate the factors affecting the design of “better” algorithms. Rather than just
|
||
|
||
1.3 Some Insights From Statistical Learning Theory
|
||
|
||
7
|
||
|
||
Figure 1.2 2D toy example of binary classification, solved using three models (the decision boundaries are shown). The models vary in complexity, ranging from a simple one (left), which misclassifies a large number of points, to a complex one (right), which “trusts” each point and comes up with solution that is consistent with all training points (but may not work well on new points). As an aside: the plots were generated using the so-called softmargin SVM to be explained in Chapter 7; cf. also Figure 7.10.
|
||
|
||
providing tools to come up with new algorithms, we also want to provide some insight in how to do it in a promising way.
|
||
In two-class pattern recognition, we seek to infer a function
|
||
|
||
f:
|
||
|
||
1
|
||
|
||
(1.16)
|
||
|
||
from input-output training data (1.1). The training data are sometimes also called the sample.
|
||
Figure 1.2 shows a simple 2D toy example of a pattern recognition problem. The task is to separate the solid dots from the circles by finding a function which takes the value 1 on the dots and 1 on the circles. Note that instead of plotting this function, we may plot the boundaries where it switches between 1 and 1. In the rightmost plot, we see a classification function which correctly separates all training points. From this picture, however, it is unclear whether the same would hold true for test points which stem from the same underlying regularity. For instance, what should happen to a test point which lies close to one of the two “outliers,” sitting amidst points of the opposite class? Maybe the outliers should not be allowed to claim their own custom-made regions of the decision function. To avoid this, we could try to go for a simpler model which disregards these points. The leftmost picture shows an almost linear separation of the classes. This separation, however, not only misclassifies the above two outliers, but also a number of “easy” points which are so close to the decision boundary that the classifier really should be able to get them right. Finally, the central picture represents a compromise, by using a model with an intermediate complexity, which gets most points right, without putting too much trust in any individual point.
|
||
The goal of statistical learning theory is to place these intuitive arguments in a mathematical framework. To this end, it studies mathematical properties of learning machines. These properties are usually properties of the function class
|
||
|
||
8
|
||
|
||
A Tutorial Introduction
|
||
|
||
g(x) 1 x
|
||
-1
|
||
|
||
IID Data Loss Function Test Data
|
||
Empirical Risk Risk
|
||
|
||
Figure 1.3 A 1D classification problem, with a training set of three points (marked by circles), and three test inputs (marked on the x-axis). Classification is performed by thresholding real-valued functions g(x) according to sgn ( f (x)). Note that both functions (dotted line, and solid line) perfectly explain the training data, but they give opposite predictions on the test inputs. Lacking any further information, the training data alone give us no means to tell which of the two functions is to be preferred.
|
||
|
||
that the learning machine can implement.
|
||
|
||
We assume that the data are generated independently from some unknown (but fixed) probability distribution P(x y).5 This is a standard assumption in learning
|
||
|
||
theory; data generated this way is commonly referred to as iid (independent and
|
||
|
||
identically distributed). Our goal is to find a function f that will correctly classify
|
||
|
||
unseen examples (x y), so that f (x) y for examples (x y) that are also generated
|
||
|
||
from P(x y).6 Correctness of the classification is measured by means of the zero-one
|
||
|
||
loss function c(x y f (x)) :
|
||
|
||
1 2
|
||
|
||
f (x)
|
||
|
||
y . Note that the loss is 0 if (x y) is classified
|
||
|
||
correctly, and 1 otherwise.
|
||
|
||
If we put no restriction on the set of functions from which we choose our
|
||
|
||
estimated f , however, then even a function that does very well on the training
|
||
|
||
data, e.g., by satisfying f (xi) yi for all i 1 m, might not generalize well to unseen examples. To see this, note that for each function f and any test set
|
||
|
||
(x¯1 y¯1)
|
||
|
||
(x¯m¯ y¯m¯ )
|
||
|
||
1 satisfying x¯1
|
||
|
||
x¯ m¯
|
||
|
||
x1
|
||
|
||
xm
|
||
|
||
, there
|
||
|
||
exists another function f such that f (xi) f (xi) for all i 1 m, yet f (x¯i)
|
||
|
||
f (x¯i) for all i 1 m¯ (cf. Figure 1.3). As we are only given the training data, we
|
||
|
||
have no means of selecting which of the two functions (and hence which of the two
|
||
|
||
different sets of test label predictions) is preferable. We conclude that minimizing
|
||
|
||
only the (average) training error (or empirical risk),
|
||
|
||
Remp[ f ]
|
||
|
||
∑ 1 m 1
|
||
mi 12
|
||
|
||
f (xi)
|
||
|
||
yi
|
||
|
||
(1.17)
|
||
|
||
does not imply a small test error (called risk), averaged over test examples drawn from the underlying distribution P(x y),
|
||
|
||
5. For a definition of a probability distribution, see Section B.1.1. 6. We mostly use the term example to denote a pair consisting of a training pattern x and
|
||
the corresponding target y.
|
||
|
||
1.3 Some Insights From Statistical Learning Theory
|
||
|
||
9
|
||
|
||
Capacity VC dimension Shattering
|
||
VC Bound
|
||
|
||
R[ f ]
|
||
|
||
1 2
|
||
|
||
f (x)
|
||
|
||
y dP(x y)
|
||
|
||
(1.18)
|
||
|
||
The risk can be defined for any loss function, provided the integral exists. For the present zero-one loss function, the risk equals the probability of misclassification.7
|
||
Statistical learning theory (Chapter 5, [570, 559, 561, 136, 562, 14]), or VC
|
||
(Vapnik-Chervonenkis) theory, shows that it is imperative to restrict the set of
|
||
|
||
functions from which f is chosen to one that has a capacity suitable for the amount of available training data. VC theory provides bounds on the test error. The minimization of these bounds, which depend on both the empirical risk and the capacity of the function class, leads to the principle of structural risk minimization [559].
|
||
The best-known capacity concept of VC theory is the VC dimension, defined as follows: each function of the class separates the patterns in a certain way and thus induces a certain labelling of the patterns. Since the labels are in 1 , there are at most 2m different labellings for m patterns. A very rich function class might be able to realize all 2m separations, in which case it is said to shatter the m points. However, a given class of functions might not be sufficiently righ to shatter the m points. The VC dimension is defined as the largest m such that there exists a set of m points which the class can shatter, and if no such m exists. It can be thought of as a one-number summary of a learning machine’s capacity (for an
|
||
|
||
example, see Figure 1.4). As such, it is necessarily somewhat crude. More accurate capacity measures are the annealed VC entropy or the growth function. These are usually considered to be harder to evaluate, but they play a fundamental role in the conceptual part of VC theory. Another interesting capacity measure, which can be thought of as a scale-sensitive version of the VC dimension, is the fat shattering dimension [286, 6]. For further details, cf. Chapters 5 and 12.
|
||
Whilst it will be difficult for the non-expert to appreciate the results of VC theory in this chapter, we will nevertheless briefly describe an example of a VC bound:
|
||
|
||
7. The risk-based approach to machine learning has its roots in statistical decision theory [582, 166, 43]. In that context, f (x) is thought of as an action, and the loss function measures the loss incurred by taking action f (x) upon observing x when the true output (state of nature) is y.
|
||
Like many fields of statistics, decision theory comes in two flavors. The present approach is a frequentist one. It considers the risk as a function of the distribution P and the decision function f . The Bayesian approach considers parametrized families PΘ to model the distribution. Given a prior over Θ (which need not in general be a finite-dimensional vector), the Bayes risk of a decision function f is the expected frequentist risk, where the expectation is taken over the prior. Minimizing the Bayes risk (over decision functions) then leads to a Bayes decision function. Bayesians thus act as if the parameter Θ were actually a random variable whose distribution is known. Frequentists, who do not make this (somewhat bold) assumption, have to resort to other strategies for picking a decision function. Examples thereof are considerations like invariance and unbiasedness, both used to restrict the class of decision rules, and the minimax principle. A decision function is said to be minimax if it minimizes (over all decision functions) the maximal (over all distributions) risk. For a discussion of the relationship of these issues to VC theory, see Problem 5.9.
|
||
|
||
10
|
||
|
||
A Tutorial Introduction
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
x x
|
||
x
|
||
|
||
Figure 1.4 A simple VC dimension example. There are 23 8 ways of assigning 3 points to two classes. For the displayed points in 2, all 8 possibilities can be realized using
|
||
separating hyperplanes, in other words, the function class can shatter 3 points. This would
|
||
not work if we were given 4 points, no matter how we placed them. Therefore, the VC dimension of the class of separating hyperplanes in 2 is 3.
|
||
|
||
if h m is the VC dimension of the class of functions that the learning machine
|
||
|
||
can implement, then for all functions of that class, independent of the underlying
|
||
|
||
distribution P generating the data, with a probability of at least 1 drawing of the training sample,8 the bound
|
||
|
||
over the
|
||
|
||
R[ f ] Remp[ f ] (h m ) holds, where the confidence term (or capacity term) is defined as
|
||
|
||
(1.19)
|
||
|
||
(h m )
|
||
|
||
1 m
|
||
|
||
h
|
||
|
||
ln
|
||
|
||
2m h
|
||
|
||
1
|
||
|
||
ln 4
|
||
|
||
(1.20)
|
||
|
||
The bound (1.19) merits further explanation. Suppose we wanted to learn a “dependency” where patterns and labels are statistically independent, P(x y) P(x)P(y). In that case, the pattern x contains no information about the label y. If, moreover, the two classes 1 and 1 are equally likely, there is no way of making a good guess about the label of a test pattern.
|
||
Nevertheless, given a training set of finite size, we can always come up with a learning machine which achieves zero training error (provided we have no examples contradicting each other, i.e., whenever two patterns are identical, then they must come with the same label). To reproduce the random labellings by correctly separating all training examples, however, this machine will necessarily require a large VC dimension h. Therefore, the confidence term (1.20), which increases monotonically with h, will be large, and the bound (1.19) will show
|
||
|
||
8. Recall that each training example is generated from P(x y), and thus the training data are subject to randomness.
|
||
|
||
1.4 Hyperplane Classifiers
|
||
|
||
11
|
||
|
||
that the small training error does not guarantee a small test error. This illustrates how the bound can apply independent of assumptions about the underlying distribution P(x y): it always holds (provided that h m), but it does not always make a nontrivial prediction. In order to get nontrivial predictions from (1.19), the function class must be restricted such that its capacity (e.g., VC dimension) is small enough (in relation to the available amount of data). At the same time, the class should be large enough to provide functions that are able to model the dependencies hidden in P(x y). The choice of the set of functions is thus crucial for learning from data. In the next section, we take a closer look at a class of functions which is particularly interesting for pattern recognition problems.
|
||
|
||
1.4 Hyperplane Classifiers
|
||
|
||
Optimal Hyperplane
|
||
|
||
In the present section, we shall describe a hyperplane learning algorithm that can be performed in a dot product space (such as the feature space that we introduced earlier). As described in the previous section, to design learning algorithms whose statistical effectiveness can be controlled, one needs to come up with a class of functions whose capacity can be computed. Vapnik et al. [573, 566, 570] considered the class of hyperplanes in some dot product space ,
|
||
|
||
w x b 0 where w
|
||
|
||
b
|
||
|
||
(1.21)
|
||
|
||
corresponding to decision functions
|
||
|
||
f (x) sgn ( w x b)
|
||
|
||
(1.22)
|
||
|
||
and proposed a learning algorithm for problems which are separable by hyperplanes (sometimes said to be linearly separable), termed the Generalized Portrait, for constructing f from empirical data. It is based on two facts. First (see Chapter 7), among all hyperplanes separating the data, there exists a unique optimal hyperplane, distinguished by the maximum margin of separation between any training point and the hyperplane. It is the solution of
|
||
|
||
maximize min x xi x
|
||
wb
|
||
|
||
wx b 0i 1 m
|
||
|
||
(1.23)
|
||
|
||
Second (see Chapter 5), the capacity (as discussed in Section 1.3) of the class of separating hyperplanes decreases with increasing margin. Hence there are theoretical arguments supporting the good generalization performance of the optimal hyperplane, cf. Chapters 5, 7, 12. In addition, it is computationally attractive, since we will show below that it can be constructed by solving a quadratic programming problem for which efficient algorithms exist (see Chapters 6 and 10).
|
||
Note that the form of the decision function (1.22) is quite similar to our earlier example (1.9). The ways in which the classifiers are trained, however, are different. In the earlier example, the normal vector of the hyperplane was trivially computed from the class means as w c c .
|
||
|
||
12
|
||
|
||
A Tutorial Introduction
|
||
|
||
{x | <w, x> + b = −1}
|
||
|
||
❍
|
||
|
||
x2❍
|
||
|
||
yi = −1
|
||
|
||
w
|
||
|
||
{x | <w, x> + b = +1}
|
||
◆
|
||
◆ x1 yi = +1
|
||
◆ ,
|
||
◆
|
||
|
||
Note:
|
||
|
||
<w, <w,
|
||
|
||
x1> x2>
|
||
|
||
+ +
|
||
|
||
b b
|
||
|
||
= =
|
||
|
||
+1 −1
|
||
|
||
=> <w, (x1−x2)> = 2
|
||
|
||
< > =>
|
||
|
||
w ||w||
|
||
|
||
,
|
||
|
||
(x1−x2)
|
||
|
||
=
|
||
|
||
2 ||w||
|
||
|
||
❍
|
||
|
||
❍
|
||
|
||
❍ {x | <w, x> + b = 0}
|
||
|
||
Figure 1.5 A binary classification toy problem: separate balls from diamonds. The optimal
|
||
|
||
hyperplane (1.23) is shown as a solid line. The problem being separable, there exists a weight
|
||
|
||
vector w and a threshold b such that yi( w xi b) 0 (i 1 m). Rescaling w and
|
||
|
||
b such that the point(s) closest to the hyperplane satisfy w xi b 1, we obtain a
|
||
|
||
canonical form (w b) of the hyperplane, satisfying yi( w xi b) 1. Note that in this
|
||
|
||
case, the margin (the distance of the closest point to the hyperplane) equals 1 w . This
|
||
|
||
can be seen by considering two points x1 x2 on opposite sides of the margin, that is,
|
||
|
||
w x1 b 1 w x2 b w w.
|
||
|
||
1, and projecting them onto the hyperplane normal vector
|
||
|
||
In the present case, we need to do some additional work to find the normal
|
||
|
||
vector that leads to the largest margin. To construct the optimal hyperplane, we
|
||
|
||
have to solve
|
||
|
||
minimize (w)
|
||
wb
|
||
|
||
1 2
|
||
|
||
w
|
||
|
||
2
|
||
|
||
(1.24)
|
||
|
||
subject to yi( w xi b) 1 for all i 1 m
|
||
|
||
(1.25)
|
||
|
||
Note that the constraints (1.25) ensure that f (xi) will be 1 for yi 1, and 1 for yi 1. Now one might argue that for this to be the case, we don’t actually need the “ 1” on the right hand side of (1.25). However, without it, it would
|
||
|
||
not be meaningful to minimize the length of w: to see this, imagine we wrote
|
||
|
||
“ 0” instead of “ 1.” Now assume that the solution is (w b). Let us rescale this
|
||
|
||
solution by multiplication with some 0
|
||
|
||
1. Since 0, the constraints are
|
||
|
||
still satisfied. Since 1, however, the length of w has decreased. Hence (w b)
|
||
|
||
cannot be the minimizer of (w).
|
||
|
||
The “ 1” on the right hand side of the constraints effectively fixes the scaling
|
||
|
||
of w. In fact, any other positive number would do.
|
||
|
||
Let us now try to get an intuition for why we should be minimizing the length
|
||
|
||
of w, as in (1.24). If w were 1, then the left hand side of (1.25) would equal
|
||
|
||
the distance from xi to the hyperplane (cf. (1.23)). In general, we have to divide
|
||
|
||
1.4 Hyperplane Classifiers
|
||
|
||
13
|
||
|
||
Lagrangian KKT Conditions
|
||
|
||
yi( w xi b) by w to transform it into this distance. Hence, if we can satisfy (1.25) for all i 1 m with an w of minimal length, then the overall margin will be maximized.
|
||
A more detailed explanation of why this leads to the maximum margin hyperplane will be given in Chapter 7. A short summary of the argument is also given in Figure 1.5.
|
||
The function in (1.24) is called the objective function, while (1.25) are called inequality constraints. Together, they form a so-called constrained optimization problem. Problems of this kind are dealt with by introducing Lagrange multipliers i 0 and a Lagrangian9
|
||
|
||
L(w b )
|
||
|
||
∑ 1
|
||
2
|
||
|
||
w
|
||
|
||
2
|
||
|
||
m
|
||
i yi( xi w
|
||
i1
|
||
|
||
b) 1
|
||
|
||
(1.26)
|
||
|
||
The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables i (in other words, a saddle point has to be found). Note that the constraint has been incorporated into the second term of the Lagrangian; it is not necessary to enforce it explicitly.
|
||
Let us try to get some intuition for this way of dealing with constrained optimization problems. If a constraint (1.25) is violated, then yi( w xi b) 1 0, in which case L can be increased by increasing the corresponding i. At the same time, w and b will have to change such that L decreases. To prevent
|
||
i yi( w xi b) 1 from becoming an arbitrarily large negative number, the change in w and b will ensure that, provided the problem is separable, the constraint will eventually be satisfied. Similarly, one can understand that for all constraints which are not precisely met as equalities (that is, for which yi( w xi b) 1 0), the corresponding i must be 0: this is the value of i that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker (KKT) complementarity conditions of optimization theory (Chapter 6).
|
||
The statement that at the saddle point, the derivatives of L with respect to the primal variables must vanish,
|
||
|
||
b L(w b ) 0 and w L(w b ) 0
|
||
|
||
leads to
|
||
|
||
m
|
||
∑ iyi 0
|
||
i1
|
||
|
||
and
|
||
|
||
m
|
||
|
||
∑ w
|
||
|
||
i yixi
|
||
|
||
i1
|
||
|
||
(1.27) (1.28) (1.29)
|
||
|
||
9. Henceforth, we use boldface Greek letters as a shorthand for corresponding vectors
|
||
|
||
(1
|
||
|
||
m).
|
||
|
||
14
|
||
|
||
A Tutorial Introduction
|
||
|
||
Support Vector
|
||
Dual Problem Decision Function
|
||
Mechanical Analogy
|
||
|
||
The solution vector thus has an expansion (1.29) in terms of a subset of the training patterns, namely those patterns with non-zero i, called Support Vectors (SVs) (cf. (1.15) in the initial example). By the KKT conditions,
|
||
|
||
i yi xi w b 1 0 for all i 1 m
|
||
|
||
(1.30)
|
||
|
||
the SVs lie on the margin (cf. Figure 1.5). All remaining training examples (x j y j) are irrelevant: their constraint y j( w xj b) 1 (cf. (1.25)) could just as well be left out, and they do not appear in the expansion (1.29). This nicely captures our intuition of the problem: as the hyperplane (cf. Figure 1.5) is completely determined by the patterns closest to it, the solution should not depend on the other examples.
|
||
By substituting (1.28) and (1.29) into the Lagrangian (1.26), one eliminates the primal variables w and b, arriving at the so-called dual optimization problem, which is the problem that one usually solves in practice:
|
||
|
||
maximize W( ) m
|
||
|
||
m
|
||
∑i
|
||
i1
|
||
|
||
∑ 1 m
|
||
2ij 1
|
||
|
||
i
|
||
|
||
j yiy j xi xj
|
||
|
||
subject to i 0 for all i 1
|
||
|
||
m
|
||
m and ∑ i yi 0
|
||
|
||
i1
|
||
|
||
(1.31) (1.32)
|
||
|
||
Using (1.29), the hyperplane decision function (1.22) can thus be written as
|
||
|
||
m
|
||
f (x) sgn ∑ yi i x xi b i1
|
||
|
||
(1.33)
|
||
|
||
where b is computed by exploiting (1.30) (for details, cf. Chapter 7). The structure of the optimization problem closely resembles those that typically
|
||
arise in Lagrange’s formulation of mechanics (e.g., [206]). In the latter class of problem, it is also often the case that only a subset of constraints become active. For instance, if we keep a ball in a box, then it will typically roll into one of the corners. The constraints corresponding to the walls which are not touched by the ball are irrelevant, and those walls could just as well be removed.
|
||
Seen in this light, it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes [87]: If we assume that each SV xi exerts a perpendicular force of size i and direction y j w w on a solid plane sheet lying along the hyperplane, then the solution satisfies the requirements for mechanical stability. The constraint (1.28) states that the forces on the sheet sum to zero, and (1.29) implies that the torques also sum to zero, via ∑i xi yi iw w w w w 0.10 This mechanical analogy illustrates the physical meaning of the term Support Vector.
|
||
|
||
10. Here, the denotes the vector (or cross) product, satisfying v v 0 for all v .
|
||
|
||
1.5 Support Vector Classification
|
||
|
||
15
|
||
|
||
input space ❍
|
||
|
||
◆ ◆
|
||
|
||
Φ
|
||
|
||
❍ ❍
|
||
|
||
feature space ◆
|
||
◆
|
||
❍ ❍
|
||
❍
|
||
|
||
Figure 1.6 The idea of SVMs: map the training data into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there. This yields a nonlinear decision boundary in input space. By the use of a kernel function (1.2), it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
|
||
|
||
1.5 Support Vector Classification
|
||
|
||
Decision Function
|
||
|
||
We now have all the tools to describe SVMs (Figure 1.6). Everything in the last section was formulated in a dot product space. We think of this space as the feature space of Section 1.1. To express the formulas in terms of the input patterns in , we thus need to employ (1.6), which expresses the dot product of bold face feature vectors x x in terms of the kernel k evaluated on input patterns x x ,
|
||
|
||
k(x x ) x x
|
||
|
||
(1.34)
|
||
|
||
This substitution, which is sometimes referred to as the kernel trick, was used by Boser, Guyon, and Vapnik [62] to extend the Generalized Portrait hyperplane classifier to nonlinear Support Vector Machines. Aizerman, Braverman, and Rozonoe´r [4] called the linearization space, and used it in the context of the potential function classification method to express the dot product between elements of in terms of elements of the input space.
|
||
The kernel trick can be applied since all feature vectors only occurred in dot products (see (1.31) and (1.33)). The weight vector (cf. (1.29)) then becomes an expansion in feature space, and therefore will typically no longer correspond to the Φ-image of a single input space vector (cf. Chapter 18). We obtain decision functions of the form (cf. (1.33))
|
||
|
||
m
|
||
f (x) sgn ∑ yi i Φ(x) Φ(xi) b i1
|
||
|
||
m
|
||
sgn ∑ yi ik(x xi) b i1
|
||
|
||
(1.35)
|
||
|
||
and the following quadratic program (cf. (1.31)):
|
||
|
||
maximize W( ) m
|
||
|
||
m
|
||
∑i
|
||
i1
|
||
|
||
∑ 1 m
|
||
2ij 1
|
||
|
||
i
|
||
|
||
j yi y jk(xi x j)
|
||
|
||
subject to i 0 for all i 1
|
||
|
||
m
|
||
m and ∑ i yi 0
|
||
|
||
i1
|
||
|
||
(1.36) (1.37)
|
||
|
||
16
|
||
|
||
A Tutorial Introduction
|
||
|
||
Soft Margin Hyperplane
|
||
|
||
Figure 1.7 Example of an SV classifier found using a radial basis function kernel k(x x ) exp( x x 2) (here, the input space is [ 1 1]2). Circles and disks are two classes of
|
||
training examples; the middle line is the decision surface; the outer lines precisely meet the
|
||
constraint (1.25). Note that the SVs found by the algorithm (marked by extra circles) are not
|
||
centers of clusters, but examples which are critical for the given classification task. Gray values code ∑mi 1 yi ik(x xi) b , the modulus of the argument of the decision function (1.35). The top and the bottom lines indicate places where it takes the value 1 (from [471]).
|
||
|
||
Figure 1.7 shows an example of this approach, using a Gaussian radial basis function kernel. We will later study the different possibilities for the kernel function in detail (Chapters 2 and 13).
|
||
In practice, a separating hyperplane may not exist, e.g., if a high noise level causes a large overlap of the classes. To allow for the possibility of examples violating (1.25), one introduces slack variables [111, 561, 481]
|
||
|
||
i 0 for all i 1 m in order to relax the constraints (1.25) to
|
||
|
||
(1.38)
|
||
|
||
yi( w xi b) 1 i for all i 1 m
|
||
|
||
(1.39)
|
||
|
||
A classifier that generalizes well is then found by controlling both the classifier capacity (via w ) and the sum of the slacks ∑i i. The latter can be shown to provide an upper bound on the number of training errors.
|
||
One possible realization of such a soft margin classifier is obtained by minimizing the objective function
|
||
|
||
(w )
|
||
|
||
1 w2 2
|
||
|
||
m
|
||
C∑ i i1
|
||
|
||
(1.40)
|
||
|
||
1.6 Support Vector Regression
|
||
|
||
17
|
||
|
||
subject to the constraints (1.38) and (1.39), where the constant C 0 determines the trade-off between margin maximization and training error minimization.11
|
||
Incorporating a kernel, and rewriting it in terms of Lagrange multipliers, this again
|
||
leads to the problem of maximizing (1.36), subject to the constraints
|
||
|
||
0 i C for all i 1
|
||
|
||
m
|
||
m and ∑ i yi 0 i1
|
||
|
||
(1.41)
|
||
|
||
The only difference from the separable case is the upper bound C on the Lagrange multipliers i. This way, the influence of the individual patterns (which could be outliers) gets limited. As above, the solution takes the form (1.35). The threshold b can be computed by exploiting the fact that for all SVs xi with i C, the slack variable i is zero (this again follows from the KKT conditions), and hence
|
||
|
||
m
|
||
∑ j yjk(xi xj) b yi
|
||
j1
|
||
|
||
(1.42)
|
||
|
||
Geometrically speaking, choosing b amounts to shifting the hyperplane, and (1.42)
|
||
|
||
states that we have to shift the hyperplane such that the SVs with zero slack
|
||
|
||
variables lie on the 1 lines of Figure 1.5.
|
||
|
||
Another possible realization of a soft margin variant of the optimal hyperplane
|
||
|
||
uses the more natural -parametrization. In it, the parameter C is replaced by a
|
||
|
||
parameter (0 1] which can be shown to provide lower and upper bounds
|
||
|
||
for the fraction of examples that will be SVs and those that will have non-zero
|
||
|
||
slack variables, respectively. It uses a primal objective function with the error term
|
||
|
||
1 m
|
||
|
||
∑i
|
||
|
||
i
|
||
|
||
instead of C ∑i i (cf. (1.40)), and separation constraints that involve
|
||
|
||
a margin parameter ,
|
||
|
||
yi( w xi b)
|
||
|
||
i for all i 1 m
|
||
|
||
(1.43)
|
||
|
||
which itself is a variable of the optimization problem. The dual can be shown
|
||
to consist in maximizing the quadratic part of (1.36), subject to 0 i 1 ( m), ∑i i yi 0 and the additional constraint ∑i i 1. We shall return to these methods in more detail in Section 7.5.
|
||
|
||
1.6 Support Vector Regression
|
||
|
||
Let us turn to a problem slightly more general than pattern recognition. Rather
|
||
|
||
than dealing with outputs y
|
||
|
||
1 , regression estimation is concerned with esti-
|
||
|
||
mating real-valued functions.
|
||
|
||
To generalize the SV algorithm to the regression case, an analog of the soft
|
||
|
||
margin is constructed in the space of the target values y (note that we now have
|
||
|
||
11. It is sometimes convenient to scale the sum in (1.40) by C m rather than C, as done in Chapter 7 below.
|
||
|
||
18
|
||
|
||
A Tutorial Introduction
|
||
|
||
-Insensitive Loss
|
||
|
||
y
|
||
|
||
ξ
|
||
|
||
x x
|
||
|
||
+ε
|
||
|
||
0
|
||
|
||
x
|
||
|
||
x
|
||
|
||
x x
|
||
|
||
x −ε
|
||
|
||
x
|
||
|
||
x
|
||
|
||
x
|
||
|
||
xx
|
||
|
||
x
|
||
|
||
x
|
||
|
||
x
|
||
|
||
loss
|
||
ξ
|
||
x
|
||
−ε +ε y f (x)
|
||
|
||
Figure 1.8 In SV regression, a tube with radius is fitted to the data. The trade-off between model complexity and points lying outside of the tube (with positive slack variables ) is determined by minimizing (1.47).
|
||
|
||
y ) by using Vapnik’s -insensitive loss function [561] (Figure 1.8, see Chapters 3 and 9) . This quantifies the loss incurred by predicting f (x) instead of y as
|
||
|
||
c(x y f (x)) : y f (x) : max 0 y f (x)
|
||
|
||
(1.44)
|
||
|
||
To estimate a linear regression
|
||
|
||
f (x) w x b
|
||
|
||
(1.45)
|
||
|
||
one minimizes
|
||
|
||
1 w2 2
|
||
|
||
m
|
||
C ∑ yi i1
|
||
|
||
f (xi)
|
||
|
||
(1.46)
|
||
|
||
Note that the term w 2 is the same as in pattern recognition (cf. (1.40)); for further details, cf. Chapter 9.
|
||
We can transform this into a constrained optimization problem by introducing slack variables, akin to the soft margin case. In the present case, we need two types of slack variable for the two cases f (xi) yi and yi f (xi) . We denote them by and , respectively, and collectively refer to them as ( ).
|
||
The optimization problem is given by
|
||
|
||
minimize
|
||
|
||
w
|
||
|
||
() mb
|
||
|
||
(w ( ))
|
||
|
||
1 w2 2
|
||
|
||
m
|
||
C ∑( i i1
|
||
|
||
i)
|
||
|
||
subject to f (xi) yi
|
||
|
||
i
|
||
|
||
yi f (xi)
|
||
|
||
i
|
||
|
||
ii 0
|
||
|
||
for all i 1 m
|
||
|
||
(1.47)
|
||
(1.48) (1.49) (1.50)
|
||
|
||
Note that according to (1.48) and (1.49), any error smaller than does not require
|
||
a nonzero i or i and hence does not enter the objective function (1.47). Generalization to kernel-based regression estimation is carried out in an analo-
|
||
|
||
1.7 Kernel Principal Component Analysis
|
||
|
||
19
|
||
|
||
Regression Function
|
||
-SV Regression
|
||
|
||
gous manner to the case of pattern recognition. Introducing Lagrange multipliers, one arrives at the following optimization problem (for C 0 chosen a priori):
|
||
|
||
maximize W( ) m
|
||
|
||
m
|
||
|
||
∑( i
|
||
|
||
i1
|
||
|
||
∑ 1
|
||
|
||
m
|
||
(
|
||
|
||
2ij 1
|
||
|
||
i
|
||
|
||
m
|
||
|
||
i) ∑( i
|
||
|
||
i)yi
|
||
|
||
i1
|
||
|
||
i)( j
|
||
|
||
j)k(xi x j)
|
||
|
||
subject to 0 i i C for all i 1
|
||
|
||
m
|
||
m and ∑( i i1
|
||
|
||
i) 0
|
||
|
||
(1.51) (1.52)
|
||
|
||
The regression estimate takes the form
|
||
|
||
m
|
||
f (x) ∑( i i1
|
||
|
||
i)k(xi x) b
|
||
|
||
(1.53)
|
||
|
||
where b is computed using the fact that (1.48) becomes an equality with i 0 if
|
||
|
||
0 i C, and (1.49) becomes an equality with i 0 if 0 i C (for details, see Chapter 9). The solution thus looks quite similar to the pattern recognition case
|
||
|
||
(cf. (1.35) and Figure 1.9).
|
||
|
||
A number of extensions of this algorithm are possible. From an abstract point of
|
||
|
||
view, we just need some target function which depends on (w ) (cf. (1.47)). There
|
||
|
||
are multiple degrees of freedom for constructing it, including some freedom how
|
||
|
||
to penalize, or regularize. For instance, more general loss functions can be used for
|
||
|
||
, leading to problems that can still be solved efficiently ([512, 515], cf. Chapter 9).
|
||
|
||
Moreover, norms other than the 2-norm can be used to regularize the solution
|
||
|
||
(see Sections 4.9 and 9.4).
|
||
|
||
Finally, the algorithm can be modified such that need not be specified a priori.
|
||
|
||
Instead, one specifies an upper bound 0
|
||
|
||
1 on the fraction of points allowed
|
||
|
||
to lie outside the tube (asymptotically, the number of SVs) and the corresponding
|
||
|
||
is computed automatically. This is achieved by using as primal objective function
|
||
|
||
1 2
|
||
|
||
w
|
||
|
||
2
|
||
|
||
C
|
||
|
||
m
|
||
|
||
m
|
||
∑ yi
|
||
i1
|
||
|
||
f (xi)
|
||
|
||
(1.54)
|
||
|
||
instead of (1.46), and treating more detail, cf. Section 9.3.
|
||
|
||
0 as a parameter over which we minimize. For
|
||
|
||
1.7 Kernel Principal Component Analysis
|
||
The kernel method for computing dot products in feature spaces is not restricted to SVMs. Indeed, it has been pointed out that it can be used to develop nonlinear generalizations of any algorithm that can be cast in terms of dot products, such as principal component analysis (PCA) [480].
|
||
Principal component analysis is perhaps the most common feature extraction algorithm; for details, see Chapter 14. The term feature extraction commonly refers
|
||
|
||
20
|
||
Kernel PCA Eigenvalue Problem Feature Extraction
|
||
|
||
A Tutorial Introduction
|
||
|
||
to procedures for extracting (real) numbers from patterns which in some sense
|
||
|
||
represent the crucial information contained in these patterns.
|
||
|
||
PCA in feature space leads to an algorithm called kernel PCA. By solving an
|
||
|
||
eigenvalue problem, the algorithm computes nonlinear feature extraction func-
|
||
|
||
tions
|
||
|
||
m
|
||
|
||
∑ fn(x)
|
||
|
||
n i
|
||
|
||
k(xi
|
||
|
||
x)
|
||
|
||
i1
|
||
|
||
(1.55)
|
||
|
||
where, up to a normalizing constant, the
|
||
|
||
n i
|
||
|
||
are
|
||
|
||
the
|
||
|
||
components
|
||
|
||
of
|
||
|
||
the
|
||
|
||
nth
|
||
|
||
eigen-
|
||
|
||
vector of the kernel matrix Kij : (k(xi x j)).
|
||
|
||
In a nutshell, this can be understood as follows. To do PCA in , we wish to
|
||
|
||
find eigenvectors v and eigenvalues of the so-called covariance matrix C in the
|
||
|
||
feature space, where
|
||
|
||
∑ C :
|
||
|
||
1 m
|
||
|
||
i
|
||
|
||
m 1
|
||
|
||
Φ(xi)Φ(xi)
|
||
|
||
(1.56)
|
||
|
||
Here, Φ(xi) denotes the transpose of Φ(xi) (see Section B.2.1). In the case when is very high dimensional, the computational costs of doing this directly are
|
||
prohibitive. Fortunately, one can show that all solutions to
|
||
|
||
Cv v
|
||
|
||
(1.57)
|
||
|
||
with 0 must lie in the span of Φ-images of the training data. Thus, we may expand the solution v as
|
||
|
||
m
|
||
v ∑ iΦ(xi) i1
|
||
|
||
(1.58)
|
||
|
||
thereby reducing the problem to that of finding the i. It turns out that this leads to a dual eigenvalue problem for the expansion coefficients,
|
||
|
||
m
|
||
|
||
K
|
||
|
||
(1.59)
|
||
|
||
where ( 1
|
||
|
||
m) .
|
||
|
||
To extract nonlinear features from a test point x, we compute the dot product
|
||
|
||
between Φ(x) and the nth normalized eigenvector in feature space,
|
||
|
||
vn Φ(x)
|
||
|
||
m
|
||
|
||
∑n i
|
||
|
||
k(xi
|
||
|
||
x)
|
||
|
||
i1
|
||
|
||
(1.60)
|
||
|
||
Usually, this will be computationally far less expensive than taking the dot product in the feature space explicitly.
|
||
A toy example is given in Chapter 14 (Figure 14.4). As in the case of SVMs, the architecture can be visualized by Figure 1.9.
|
||
|
||
1.8 Empirical Results and Implementations
|
||
|
||
21
|
||
|
||
σ( Σ )
|
||
|
||
output σ (Σ υi k (x,xi))
|
||
|
||
υ1
|
||
|
||
υ2
|
||
|
||
. . .
|
||
|
||
υ m
|
||
|
||
weights
|
||
|
||
<, > <, >
|
||
|
||
. . .
|
||
|
||
<, > dot product <Φ(x),Φ(xi)>= k(x,xi)
|
||
|
||
Φ(x1)
|
||
|
||
Φ(x2) Φ(x)
|
||
|
||
Φ(xn) mapped vectors Φ(xi), Φ(x)
|
||
|
||
. . .
|
||
|
||
support vectors x1 ... xn
|
||
|
||
test vector x
|
||
|
||
Figure 1.9 Architecture of SVMs and related kernel methods. The input x and the expansion patterns (SVs) xi (we assume that we are dealing with handwritten digits) are nonlinearly mapped (by Φ) into a feature space where dot products are computed. Through the use of the kernel k, these two layers are in practice computed in one step. The results are linearly combined using weights i, found by solving a quadratic program (in pattern recognition, i yi i; in regression estimation, i i i) or an eigenvalue problem (Kernel PCA). The linear combination is fed into the function (in pattern recognition,
|
||
(x) sgn (x b); in regression estimation, (x) x b; in Kernel PCA, (x) x).
|
||
|
||
1.8 Empirical Results and Implementations
|
||
|
||
Examples of Kernels
|
||
|
||
Having described the basics of SVMs, we now summarize some empirical findings. By the use of kernels, the optimal margin classifier was turned into a highperformance classifier. Surprisingly, it was observed that the polynomial kernel
|
||
|
||
k(x x ) x x d
|
||
|
||
(1.61)
|
||
|
||
the Gaussian k(x x ) exp
|
||
|
||
x x2 22
|
||
|
||
and the sigmoid
|
||
|
||
(1.62)
|
||
|
||
k(x x ) tanh x x Θ
|
||
|
||
(1.63)
|
||
|
||
with suitable choices of d and Θ (here,
|
||
|
||
N ), empirically led to
|
||
|
||
SV classifiers with very similar accuracies and SV sets (Section 7.8.2). In this sense,
|
||
|
||
the SV set seems to characterize (or compress) the given task in a manner which
|
||
|
||
22
|
||
|
||
A Tutorial Introduction
|
||
|
||
Applications Implementation
|
||
|
||
to some extent is independent of the type of kernel (that is, the type of classifier) used, provided the kernel parameters are well adjusted.
|
||
Initial work at AT&T Bell Labs focused on OCR (optical character recognition), a problem where the two main issues are classification accuracy and classification speed. Consequently, some effort went into the improvement of SVMs on these issues, leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs (Chapter 7), and the Reduced Set method (Chapter 18) for speeding up classification. Using these procedures, SVMs soon became competitive with the best available classifiers on OCR and other object recognition tasks [87, 57, 419, 438, 134], and later even achieved the world record on the main handwritten digit benchmark dataset [134].
|
||
An initial weakness of SVMs, less apparent in OCR applications which are characterized by low noise levels, was that the size of the quadratic programming problem (Chapter 10) scaled with the number of support vectors. This was due to the fact that in (1.36), the quadratic part contained at least all SVs — the common practice was to extract the SVs by going through the training data in chunks while regularly testing for the possibility that patterns initially not identified as SVs become SVs at a later stage. This procedure is referred to as chunking; note that without chunking, the size of the matrix in the quadratic part of the objective function would be m m, where m is the number of all training examples.
|
||
What happens if we have a high-noise problem? In this case, many of the slack variables i become nonzero, and all the corresponding examples become SVs. For this case, decomposition algorithms were proposed [398, 409], based on the observation that not only can we leave out the non-SV examples (the xi with i 0) from the current chunk, but also some of the SVs, especially those that hit the upper boundary ( i C). The chunks are usually dealt with using quadratic optimizers. Among the optimizers used for SVMs are LOQO [555], MINOS [380], and variants of conjugate gradient descent, such as the optimizers of Bottou [459] and Burges [85]. Several public domain SV packages and optimizers are listed on the web page http://www.kernel-machines.org. For more details on implementations, see Chapter 10.
|
||
Once the SV algorithm had been generalized to regression, researchers started applying it to various problems of estimating real-valued functions. Very good results were obtained on the Boston housing benchmark [529], and on problems of times series prediction (see [376, 371, 351]). Moreover, the SV method was applied to the solution of inverse function estimation problems ([572]; cf. [563, 589]). For overviews, the interested reader is referred to [85, 472, 504, 125].
|
||
|
||
I CONCEPTS AND TOOLS
|
||
|
||
The generic can be more intense than the concrete.
|
||
|
||
J. L. Borges1
|
||
|
||
We now embark on a more systematic presentation of the concepts and tools
|
||
|
||
underlying Support Vector Machines and other kernel methods.
|
||
|
||
In machine learning problems, we try to discover structure in data. For in-
|
||
|
||
stance, in pattern recognition and regression estimation, we are given a training
|
||
|
||
set (x1 y1) (xm ym)
|
||
|
||
, and attempt to predict the outputs y for previ-
|
||
|
||
ously unseen inputs x. This is only possible if we have some measure that tells us
|
||
|
||
how (x y) is related to the training set. Informally, we want similar inputs to lead to similar outputs.2 To formalize this, we have to state what we mean by similar.
|
||
|
||
A particularly simple yet surprisingly useful notion of similarity of inputs — the
|
||
|
||
one we will use throughout this book — derives from embedding the data into
|
||
|
||
a Euclidean feature space and utilizing geometrical concepts. Chapter 2 describes
|
||
|
||
how certain classes of kernels induce feature spaces, and how one can compute
|
||
|
||
dot products, and thus angles and distances, without having to explicitly work in
|
||
|
||
these potentially infinite-dimensional spaces. This leads to a rather general class
|
||
|
||
of similarity measure to be used on the inputs.
|
||
|
||
1. From A History of Eternity, in The Total Library, Penguin, London, 2001. 2. This procedure can be traced back to an old maxim of law: de similibus ad similia eadem ratione procedendum est — from things similar to things similar we are to proceed by the same rule.
|
||
|
||
24
|
||
|
||
CONCEPTS AND TOOLS
|
||
|
||
On the outputs, similarity is usually measured in terms of a loss function stating how “bad” it is if the predicted y does not match the true one. The training of a learning machine commonly involves a risk functional that contains a term measuring the loss incurred for the training patterns. The concepts of loss and risk are introduced in depth in Chapter 3.
|
||
This is not the full story, however. In order to generalize well to the test data, it is not sufficient to “explain” the training data. It is also necessary to control the complexity of the model used for explaining the training data, a task that is often accomplished with the help of regularization terms, as explained in Chapter 4. Specifically, one utilizes objective functions that involve both the empirical loss term and a regularization term. From a statistical point of view, we can expect the function minimizing a properly chosen objective function to work well on test data, as explained by statistical learning theory (Chapter 5). From a practical point of view, however, it is not at all straightforward to find this minimizer. Indeed, the quality of a loss function or a regularizer should be assessed not only on a statistical basis but also in terms of the feasibility of the objective function minimization problem. In order to be able to assess this, and in order to obtain a thorough understanding of practical algorithms for this task, we conclude this part of the book with an in-depth review of optimization theory (Chapter 6).
|
||
The chapters in this part of the book assume familiarity with basic concepts of linear algebra and probability theory. Readers who would like to refresh their knowledge of these topics may want to consult Appendix B beforehand.
|
||
|
||
2
|
||
|
||
Kernels
|
||
|
||
Overview Prerequisites
|
||
|
||
In Chapter 1, we described how a kernel arises as a similarity measure that can be thought of as a dot product in a so-called feature space. We tried to provide an intuitive understanding of kernels by introducing them as similarity measures, rather than immediately delving into the functional analytic theory of the classes of kernels that actually admit a dot product representation in a feature space.
|
||
In the present chapter, we will be both more formal and more precise. We will study the class of kernels k that correspond to dot products in feature spaces via a map Φ,
|
||
|
||
Φ:
|
||
|
||
x x : Φ(x)
|
||
|
||
(2.1)
|
||
|
||
that is,
|
||
|
||
k(x x ) Φ(x) Φ(x )
|
||
|
||
(2.2)
|
||
|
||
Regarding the input domain , we need not make assumptions other than it being a set. For instance, we could consider a set of discrete objects, such as strings.
|
||
A natural question to ask at this point is what kind of functions k(x x ) admit a representation of the form (2.2); that is, whether we can always construct a dot product space and a map Φ mapping into it such that (2.2) holds true. We shall begin, however, by trying to give some motivation as to why kernels are at all useful, considering kernels that compute dot products in spaces of monomial features (Section 2.1). Following this, we move on to the questions of how, given a kernel, an associated feature space can be constructed (Section 2.2). This leads to the notion of a Reproducing Kernel Hilbert Space, crucial for the theory of kernel machines. In Section 2.3, we give some examples and properties of kernels, and in Section 2.4, we discuss a class of kernels that can be used as dissimilarity measures rather than as similarity measures.
|
||
The chapter builds on knowledge of linear algebra, as briefly summarized in Appendix B. Apart from that, it can be read on its own; however, readers new to the field will profit from first reading Sections 1.1 and 1.2.
|
||
|
||
26
|
||
|
||
Kernels
|
||
|
||
2.1 Polynomial Kernels
|
||
|
||
2.2.1 Positive Definite Kernels
|
||
|
||
2.2.2, 2.2.3 RKHS Representation
|
||
2.2.4, 2.2.5 Mercer Representation
|
||
|
||
2.3 Examples and Properties of Kernels
|
||
|
||
2.4 Conditionally Positive Definite Kernels
|
||
|
||
2.2.6, 2.2.7 Data Dependent Representation
|
||
|
||
2.1 Product Features
|
||
|
||
Monomial Features
|
||
|
||
In this section, we think of as a subset of the vector space N , (N ), endowed with the canonical dot product (1.3).
|
||
Suppose we are given patterns x where most information is contained in the dth order products (so-called monomials) of entries [x] j of x,
|
||
|
||
[x] j1 [x] j2 [x] jd
|
||
|
||
(2.3)
|
||
|
||
where j1 jd 1 N . Often, these monomials are referred to as product features. These features form the basis of many practical algorithms; indeed, there
|
||
|
||
is a whole field of pattern recognition research studying polynomial classifiers [484],
|
||
|
||
which is based on first extracting product features and then applying learning
|
||
|
||
algorithms to these features. In other words, the patterns are preprocessed by
|
||
|
||
mapping into the feature space of all products of d entries. This has proven
|
||
|
||
quite effective in visual pattern recognition tasks, for instance. To understand the
|
||
|
||
rationale for doing this, note that visual patterns are usually represented as vectors
|
||
|
||
whose entries are the pixel intensities. Taking products of entries of these vectors
|
||
|
||
then corresponds to taking products of pixel intensities, and is thus akin to taking
|
||
|
||
logical “and” operations on the pixels. Roughly speaking, this corresponds to the
|
||
|
||
intuition that, for instance, a handwritten “8” constitutes an eight if there is a top
|
||
|
||
circle and a bottom circle. With just one of the two circles, it is not half an “8,” but
|
||
|
||
rather a “0.” Nonlinearities of this type are crucial for achieving high accuracies in
|
||
|
||
pattern recognition tasks.
|
||
|
||
Let us take a look at this feature map in the simple example of two-dimensional
|
||
|
||
patterns, for which
|
||
|
||
2 . In this case, we can collect all monomial feature
|
||
|
||
extractors of degree 2 in the nonlinear map
|
||
|
||
Φ: 2 ([x]1 [x]2)
|
||
|
||
3
|
||
([x]21 [x]22 [x]1[x]2)
|
||
|
||
(2.4) (2.5)
|
||
|
||
This approach works fine for small toy examples, but it fails for realistically sized
|
||
|
||
2.1 Product Features
|
||
|
||
27
|
||
|
||
Kernel
|
||
|
||
problems: for N-dimensional input patterns, there exist
|
||
|
||
N
|
||
|
||
dN1 d
|
||
|
||
(d N 1)! d!(N 1)!
|
||
|
||
(2.6)
|
||
|
||
different monomials (2.3) of degree d, comprising a feature space of dimension N . For instance, 16 16 pixel input images and a monomial degree d 5 thus yield a dimension of almost 1010.
|
||
In certain cases described below, however, there exists a way of computing dot products in these high-dimensional feature spaces without explicitly mapping into the spaces, by means of kernels nonlinear in the input space N . Thus, if the subsequent processing can be carried out using dot products exclusively, we are able to deal with the high dimension.
|
||
We now describe how dot products in polynomial feature spaces can be computed efficiently, followed by a section in which we discuss more general feature spaces. In order to compute dot products of the form Φ(x) Φ(x ) , we employ kernel representations of the form
|
||
|
||
k(x x ) Φ(x) Φ(x )
|
||
|
||
(2.7)
|
||
|
||
which allow us to compute the value of the dot product in without having to explicitly compute the map Φ.
|
||
What does k look like in the case of polynomial features? We start by giving an example for N d 2, as considered above [561]. For the map
|
||
|
||
Φ : ([x]1 [x]2) ([x]21 [x]22 [x]1[x]2 [x]2[x]1)
|
||
|
||
(2.8)
|
||
|
||
(note that for now, we have considered [x]1[x]2 and [x]2[x]1 as separate features; thus we are looking at ordered monomials) dot products in take the form
|
||
|
||
Φ(x) Φ(x ) [x]21[x ]21 [x]22[x ]22 2[x]1[x]2[x ]1[x ]2 x x 2
|
||
|
||
(2.9)
|
||
|
||
In other words, the desired kernel k is simply the square of the dot product in input space. The same works for arbitrary N d [62]: as a straightforward generalization of a result proved in the context of polynomial approximation [412, Lemma 2.1], we have:
|
||
|
||
Polynomial Kernel
|
||
|
||
Proposition 2.1 Define Cd to map x N to the vector Cd(x) whose entries are all possible dth degree ordered products of the entries of x. Then the corresponding kernel computing the dot product of vectors mapped by Cd is
|
||
|
||
k(x x ) Cd(x) Cd(x ) x x d
|
||
|
||
(2.10)
|
||
|
||
Proof We directly compute
|
||
|
||
Cd(x) Cd(x )
|
||
|
||
N
|
||
|
||
N
|
||
|
||
∑ ∑ [x]j1
|
||
|
||
j1 1
|
||
|
||
jd 1
|
||
|
||
[x] jd [x ]j1
|
||
|
||
[x ]jd
|
||
|
||
(2.11)
|
||
|
||
28
|
||
|
||
Kernels
|
||
|
||
N
|
||
∑ [x]j1 [x ]j1
|
||
j1 1
|
||
|
||
N
|
||
∑ [x]jd [x ]jd
|
||
jd 1
|
||
|
||
N
|
||
|
||
d
|
||
|
||
∑[x]j [x ]j
|
||
|
||
j1
|
||
|
||
xx d
|
||
|
||
Note that we used the symbol Cd for the feature map. The reason for this is that we would like to reserve Φd for the corresponding map computing unordered product features. Let us construct such a map Φd, yielding the same value of the dot product. To this end, we have to compensate for the multiple occurrence of certain
|
||
monomials in Cd by scaling the respective entries of Φd with the square roots of their numbers of occurrence. Then, by this construction of Φd, and (2.10),
|
||
|
||
Φd(x) Φd(x )
|
||
|
||
Cd(x) Cd(x )
|
||
|
||
xx d
|
||
|
||
(2.12)
|
||
|
||
For instance, if n of the ji in (2.3) are equal, and the remaining ones are different, then the coefficient in the corresponding component of Φd is (d n 1)!. For the general case, see Problem 2.2. For Φ2, this simply means that [561]
|
||
|
||
Φ2(x) ([x]21 [x]22 2 [x]1[x]2)
|
||
|
||
(2.13)
|
||
|
||
The above reasoning illustrates an important point pertaining to the construction of feature spaces associated with kernel functions. Although they map into different feature spaces, Φd and Cd are both valid instantiations of feature maps for k(x x ) x x d.
|
||
To illustrate how monomial feature kernels can significantly simplify pattern recognition tasks, let us consider a simple toy example.
|
||
|
||
Toy Example
|
||
|
||
Example 2.2 (Monomial Features in 2-D Pattern Recognition) In the example of Figure 2.1, a non-separable problem is reduced to the construction of a separating hyperplane by preprocessing the input data with Φ2. As we shall see in later chapters, this has advantages both from the computational point of view (there exist efficient algorithms for computing the hyperplane) and from the statistical point of view (there exist guarantees for how well the hyperplane will generalize to unseen test points).
|
||
In more realistic cases, e.g., if x represents an image with the entries being pixel values, polynomial kernels x x d enable us to work in the space spanned by products of any d pixel values — provided that we are able to do our work solely in terms of dot products, without any explicit usage of a mapped pattern Φd(x). Using kernels of the form (2.10), we can take higher-order statistics into account, without the combinatorial explosion (2.6) of time and memory complexity which accompanies even moderately high N and d.
|
||
To conclude this section, note that it is possible to modify (2.10) such that it maps into the space of all monomials up to degree d, by defining k(x x ) ( x x 1)d (Problem 2.17). Moreover, in practice, it is often useful to multiply the kernel by a scaling factor c to ensure that its numeric range is within some bounded interval, say [ 1 1]. The value of c will depend on the dimension and range of the data.
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
29
|
||
|
||
✕ ✕
|
||
✕
|
||
|
||
x2
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕✕ ❍
|
||
|
||
❍
|
||
|
||
❍
|
||
|
||
❍
|
||
|
||
❍
|
||
|
||
✕ x1
|
||
|
||
❍
|
||
|
||
✕
|
||
|
||
❍
|
||
|
||
✕
|
||
|
||
❍
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
z2
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕
|
||
|
||
✕ ✕
|
||
|
||
✕✕
|
||
|
||
❍ ✕✕
|
||
|
||
❍
|
||
|
||
✕
|
||
|
||
❍✕
|
||
|
||
❍❍ ✕
|
||
|
||
❍ ❍❍
|
||
|
||
✕
|
||
|
||
z1
|
||
|
||
✕
|
||
|
||
z3
|
||
|
||
Figure 2.1 Toy example of a binary classification problem mapped into feature space. We
|
||
assume that the true decision boundary is an ellipse in input space (left panel). The task of the learning process is to estimate this boundary based on empirical data consisting of training points in both classes (crosses and circles, respectively). When mapped into feature space via the nonlinear map Φ2(x) (z1 z2 z3) ([x]21 [x]22 2 [x]1[x]2) (right panel), the ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence all points are plotted in the (z1 z2) plane). This is due to the fact that ellipses can be written as linear equations in the entries of (z1 z2 z3). Therefore, in feature space, the problem reduces to that of estimating a hyperplane from the mapped data points. Note that via the
|
||
polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space can be computed without computing Φ2. Later in the book, we shall describe algorithms for constructing hyperplanes which are based on dot products (Chapter 7).
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
In what follows, we will look at things the other way round, and start with the kernel rather than with the feature map. Given some kernel, can we construct a feature space such that the kernel computes the dot product in that feature space; that is, such that (2.2) holds? This question has been brought to the attention of the machine learning community in a variety of contexts, especially during recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the theory of kernels is the book of Berg, Christensen, and Ressel [42]; indeed, a large part of the material in the present chapter is based on this work. We do not aim to be fully rigorous; instead, we try to provide insight into the basic ideas. As a rule, all the results that we state without proof can be found in [42]. Other standard references include [16, 455].
|
||
There is one more aspect in which this section differs from the previous one: the latter dealt with vectorial data, and the domain was assumed to be a subset of N . By contrast, the results in the current section hold for data drawn from domains which need no structure, other than their being nonempty sets. This generalizes kernel learning algorithms to a large number of situations where a vectorial representation is not readily available, and where one directly works
|
||
|
||
30
|
||
Gram Matrix PD Matrix
|
||
PD Kernel
|
||
|
||
Kernels
|
||
|
||
with pairwise distances or similarities between non-vectorial objects [246, 467, 154, 210, 234, 585]. This theme will recur in several places throughout the book, for instance in Chapter 13.
|
||
|
||
2.2.1 Positive Definite Kernels
|
||
|
||
We start with some basic definitions and results. As in the previous chapter, indices i and j are understood to run over 1 m.
|
||
|
||
Definition 2.3 (Gram Matrix) Given a function k : 2
|
||
|
||
(where
|
||
|
||
and patterns x1 xm , the m m matrix K with elements
|
||
|
||
Ki j : k(xi x j)
|
||
|
||
is called the Gram matrix (or kernel matrix) of k with respect to x1
|
||
|
||
or
|
||
|
||
)
|
||
|
||
(2.14) xm.
|
||
|
||
Definition 2.4 (Positive Definite Matrix) A complex m m matrix K satisfying
|
||
|
||
∑ cic¯j Ki j 0
|
||
ij
|
||
for all ci is called positive definite.1 Similarly, a real symmetric m satisfying (2.15) for all ci is called positive definite.
|
||
|
||
(2.15) m matrix K
|
||
|
||
Note that a symmetric matrix is positive definite if and only if all its eigenvalues are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as the quadratic form induced by K.
|
||
|
||
Definition 2.5 ((Positive Definite) Kernel) Let be a nonempty set. A function k on which for all m and all x1 xm gives rise to a positive definite Gram
|
||
matrix is called a positive definite (pd) kernel. Often, we shall refer to it simply as a kernel.
|
||
|
||
Remark 2.6 (Terminology) The term kernel stems from the first use of this type of function in the field of integral operators as studied by Hilbert and others [243, 359, 112]. A function k which gives rise to an operator Tk via
|
||
|
||
(Tk f )(x)
|
||
|
||
k(x x ) f (x ) dx
|
||
|
||
(2.16)
|
||
|
||
is called the kernel of Tk.
|
||
|
||
In the literature, a number of different terms are used for positive definite kernels, such
|
||
|
||
as reproducing kernel, Mercer kernel, admissible kernel, Support Vector kernel,
|
||
|
||
nonnegative definite kernel, and covariance function. One might argue that the term
|
||
|
||
positive definite kernel is slightly misleading. In matrix theory, the term definite is
|
||
|
||
sometimes reserved for the case where equality in (2.15) only occurs if c1
|
||
|
||
cm 0.
|
||
|
||
1. The bar in c¯j denotes complex conjugation; for real numbers, it has no effect.
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
31
|
||
|
||
Real-Valued Kernels
|
||
|
||
Simply using the term positive kernel, on the other hand, could be mistaken as referring to a kernel whose values are positive. Finally, the term positive semidefinite kernel becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow the convention used for instance in [42], and employ the term positive definite both for kernels and matrices in the way introduced above. The case where the value 0 is only attained if all coefficients are 0 will be referred to as strictly positive definite.
|
||
We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x x ) which is not positive definite in the sense stated above, it will be clear from the context.
|
||
|
||
The definitions for positive definite kernels and positive definite matrices differ in the fact that in the former case, we are free to choose the points on which the kernel is evaluated — for every choice, the kernel induces a positive definite matrix.
|
||
Positive definiteness implies positivity on the diagonal (Problem 2.12),
|
||
|
||
k(x x) 0 for all x
|
||
|
||
(2.17)
|
||
|
||
and symmetry (Problem 2.13),
|
||
|
||
k(xi xj) k(x j xi)
|
||
|
||
(2.18)
|
||
|
||
To also cover the complex-valued case, our definition of symmetry includes complex conjugation. The definition of symmetry of matrices is analogous; that is, Ki j K ji.
|
||
For real-valued kernels it is not sufficient to stipulate that (2.15) hold for real coefficients ci. To get away with real coefficients only, we must additionally require that the kernel be symmetric (Problem 2.14); k(xi x j) k(x j xi) (cf. Problem 2.13).
|
||
It can be shown that whenever k is a (complex-valued) positive definite kernel, its real part is a (real-valued) positive definite kernel. Below, we shall largely be dealing with real-valued kernels. Most of the results, however, also apply for complex-valued kernels.
|
||
Kernels can be regarded as generalized dot products. Indeed, any dot product is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard property of dot products, does not carry over to general kernels. However, another property of dot products, the Cauchy-Schwarz inequality, does have a natural generalization to kernels:
|
||
|
||
Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive definite kernel, and x1 x2 , then
|
||
|
||
k(x1 x2) 2 k(x1 x1) k(x2 x2)
|
||
|
||
(2.19)
|
||
|
||
Proof For sake of brevity, we give a non-elementary proof using some basic facts of linear algebra. The 2 2 Gram matrix with entries Kij k(xi x j) (i j 1 2 ) is positive definite. Hence both its eigenvalues are nonnegative, and so is their product, the determinant of K. Therefore
|
||
|
||
0 K11 K22 K12 K21 K11 K22 K12 K12 K11 K22 K12 2
|
||
|
||
(2.20)
|
||
|
||
32 Feature Map Vector Space
|
||
|
||
Kernels
|
||
|
||
..
|
||
|
||
x
|
||
|
||
x'
|
||
|
||
Φ Φ(x) Φ(x')
|
||
|
||
Figure 2.2 One instantiation of the fea-
|
||
|
||
ture map associated with a kernel is the
|
||
|
||
map (2.21), which represents each pattern
|
||
|
||
(in the picture, x or x ) by a kernel-shaped
|
||
|
||
function sitting on the pattern. In this sense,
|
||
|
||
each pattern is represented by its similar-
|
||
|
||
ity to all other patterns. In the picture, the
|
||
|
||
kernel is assumed to be bell-shaped, e.g., a
|
||
|
||
Gaussian k(x x ) exp( x x 2 (2 2)).
|
||
|
||
In the text, we describe the construction of
|
||
|
||
a dot product
|
||
|
||
on the function space
|
||
|
||
such that k(x x ) Φ(x) Φ(x ) .
|
||
|
||
Substituting k(xi x j) for Kij, we get the desired inequality.
|
||
We now show how the feature spaces in question are defined by the choice of kernel function.
|
||
|
||
2.2.2 The Reproducing Kernel Map
|
||
|
||
Assume that k is a real-valued positive definite kernel, and a nonempty set. We
|
||
|
||
define a map from into the space of functions mapping into , denoted as
|
||
|
||
: f:
|
||
|
||
, via
|
||
|
||
Φ: x k( x)
|
||
|
||
(2.21)
|
||
|
||
Here, Φ(x) denotes the function that assigns the value k(x x) to x , i.e., Φ(x)( ) k( x) (as shown in Figure 2.2).
|
||
We have thus turned each pattern into a function on the domain . In this sense, a pattern is now represented by its similarity to all other points in the input domain
|
||
. This seems a very rich representation; nevertheless, it will turn out that the kernel allows the computation of the dot product in this representation. Below, we show how to construct a feature space associated with Φ, proceeding in the following steps:
|
||
|
||
1. Turn the image of Φ into a vector space, 2. define a dot product; that is, a strictly positive definite bilinear form, and 3. show that the dot product satisfies k(x x ) Φ(x) Φ(x ) .
|
||
|
||
We begin by constructing a dot product space containing the images of the input patterns under Φ. To this end, we first need to define a vector space. This is done by taking linear combinations of the form
|
||
|
||
m
|
||
f ( ) ∑ ik( xi) i1
|
||
|
||
(2.22)
|
||
|
||
Here, m , i and x1 xm are arbitrary. Next, we define a dot product
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
33
|
||
|
||
Dot Product
|
||
Reproducing Kernel
|
||
|
||
between f and another function
|
||
|
||
m
|
||
g( ) ∑ jk( xj) j1
|
||
|
||
(2.23)
|
||
|
||
where m
|
||
|
||
,j
|
||
|
||
and x1
|
||
|
||
xm
|
||
|
||
, as
|
||
|
||
mm
|
||
f g : ∑ ∑ i jk(xi xj) i 1j 1
|
||
|
||
(2.24)
|
||
|
||
This expression explicitly contains the expansion coefficients, which need not be unique. To see that it is nevertheless well-defined, note that
|
||
|
||
m
|
||
f g ∑ j f (xj) j1
|
||
|
||
(2.25)
|
||
|
||
using k(xj xi) k(xi x j). The sum in (2.25), however, does not depend on the particular expansion of f . Similarly, for g, note that
|
||
|
||
m
|
||
f g ∑ ig(xi) i1
|
||
|
||
(2.26)
|
||
|
||
The last two equations also show that is bilinear. It is symmetric, as f g g f . Moreover, it is positive definite, since positive definiteness of k implies that for any function f , written as (2.22), we have
|
||
|
||
m
|
||
|
||
∑ f f
|
||
|
||
i jk(xi x j) 0
|
||
|
||
ij 1
|
||
|
||
(2.27)
|
||
|
||
The latter implies that
|
||
|
||
is actually itself a positive definite kernel, defined
|
||
|
||
on our space of functions. To see this, note that given functions f1 fn, and
|
||
|
||
coefficients 1
|
||
|
||
n , we have
|
||
|
||
n
|
||
∑ i j fi fj
|
||
ij 1
|
||
|
||
n
|
||
|
||
n
|
||
|
||
∑ ∑ i fi
|
||
|
||
j fj
|
||
|
||
0
|
||
|
||
i1
|
||
|
||
j1
|
||
|
||
(2.28)
|
||
|
||
Here, the left hand equality follows from the bilinearity of , and the right hand inequality from (2.27). For the last step in proving that it qualifies as a dot product, we will use the following interesting property of Φ, which follows directly from the definition: for all functions (2.22), we have
|
||
|
||
k( x) f f (x)
|
||
|
||
(2.29)
|
||
|
||
— k is the representer of evaluation. In particular,
|
||
|
||
k( x) k( x ) k(x x )
|
||
|
||
(2.30)
|
||
|
||
By virtue of these properties, positive definite kernels k are also called reproducing kernels [16, 42, 455, 578, 467, 202]. By (2.29) and Proposition 2.7, we have
|
||
|
||
f (x) 2 k( x) f 2 k(x x) f f
|
||
|
||
(2.31)
|
||
|
||
34
|
||
Kernels from Feature Maps
|
||
Equivalent Definition of PD Kernels
|
||
Kernel Trick
|
||
|
||
Kernels
|
||
|
||
Therefore, f f 0 directly implies f 0, which is the last property that required proof in order to establish that is a dot product (cf. Section B.2).
|
||
The case of complex-valued kernels can be dealt with using the same construction; in that case, we will end up with a complex dot product space [42].
|
||
The above reasoning has shown that any positive definite kernel can be thought of as a dot product in another space: in view of (2.21), the reproducing kernel property (2.30) amounts to
|
||
|
||
Φ(x) Φ(x ) k(x x )
|
||
|
||
(2.32)
|
||
|
||
Therefore, the dot product space constructed in this way is one possible instan-
|
||
|
||
tiation of the feature space associated with a kernel.
|
||
|
||
Above, we have started with the kernel, and constructed a feature map. Let us now consider the opposite direction. Whenever we have a mapping Φ from into a dot product space, we obtain a positive definite kernel via k(x x ) : Φ(x) Φ(x ) .
|
||
|
||
This can be seen by noting that for all ci
|
||
|
||
xi
|
||
|
||
i 1 m, we have
|
||
|
||
2
|
||
|
||
∑ cicjk(xi xj)
|
||
ij
|
||
|
||
∑ ciΦ(xi) ∑ c jΦ(x j)
|
||
|
||
i
|
||
|
||
j
|
||
|
||
∑ ciΦ(xi) 0 i
|
||
|
||
(2.33)
|
||
|
||
due to the nonnegativity of the norm. This has two consequences. First, it allows us to give an equivalent definition of
|
||
positive definite kernels as functions with the property that there exists a map Φ into a dot product space such that (2.32) holds true. Second, it allows us to construct kernels from feature maps. For instance, it is in this way that powerful linear representations of 3D heads proposed in computer graphics [575, 59] give rise to kernels. The identity (2.32) forms the basis for the kernel trick:
|
||
|
||
Remark 2.8 (“Kernel Trick”) Given an algorithm which is formulated in terms of a positive definite kernel k, one can construct an alternative algorithm by replacing k by another positive definite kernel k˜.
|
||
In view of the material in the present section, the justification for this procedure is the following: effectively, the original algorithm can be thought of as a dot product based algorithm operating on vectorial data Φ(x1) Φ(xm). The algorithm obtained by replacing k by k˜ then is exactly the same dot product based algorithm, only that it operates on Φ˜ (x1) Φ˜ (xm).
|
||
The best known application of the kernel trick is in the case where k is the dot product in the input domain (cf. Problem 2.5). The trick is not limited to that case, however: k and k˜ can both be nonlinear kernels. In general, care must be exercised in determining whether the resulting algorithm will be useful: sometimes, an algorithm will only work subject to additional conditions on the input data, e.g., the data set might have to lie in the positive orthant. We shall later see that certain kernels induce feature maps which enforce such properties for the mapped data (cf. (2.73)), and that there are algorithms which take advantage of these aspects (e.g., in Chapter 8). In such cases, not every conceivable positive definite kernel
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
35
|
||
|
||
Historical Remarks
|
||
|
||
will make sense. Even though the kernel trick had been used in the literature for a fair amount of
|
||
time [4, 62], it took until the mid 1990s before it was explicitly stated that any algorithm that only depends on dot products, i.e., any algorithm that is rotationally invariant, can be kernelized [479, 480]. Since then, a number of algorithms have benefitted from the kernel trick, such as the ones described in the present book, as well as methods for clustering in feature spaces [479, 215, 199].
|
||
Moreover, the machine learning community took time to comprehend that the definition of kernels on general sets (rather than dot product spaces) greatly extends the applicability of kernel methods [467], to data types such as texts and other sequences [234, 585, 23]. Indeed, this is now recognized as a crucial feature of kernels: they lead to an embedding of general data types in linear spaces.
|
||
Not surprisingly, the history of methods for representing kernels in linear spaces (in other words, the mathematical counterpart of the kernel trick) dates back significantly further than their use in machine learning. The methods appear to have first been studied in the 1940s by Kolmogorov [304] for countable and Aronszajn [16] in the general case. Pioneering work on linear representations of a related class of kernels, to be described in Section 2.4, was done by Schoenberg [465]. Further bibliographical comments can be found in [42].
|
||
We thus see that the mathematical basis for kernel algorithms has been around for a long time. As is often the case, however, the practical importance of mathematical results was initially underestimated.2
|
||
2.2.3 Reproducing Kernel Hilbert Spaces
|
||
In the last section, we described how to define a space of functions which is a valid realization of the feature spaces associated with a given kernel. To do this, we had to make sure that the space is a vector space, and that it is endowed with a dot product. Such spaces are referred to as dot product spaces (cf. Appendix B), or equivalently as pre-Hilbert spaces. The reason for the latter is that one can turn them into Hilbert spaces (cf. Section B.3) by a fairly simple mathematical trick. This additional structure has some mathematical advantages. For instance, in Hilbert spaces it is always possible to define projections. Indeed, Hilbert spaces are one of the favorite concepts of functional analysis.
|
||
So let us again consider the pre-Hilbert space of functions (2.22), endowed with the dot product (2.24). To turn it into a Hilbert space (over ), one completes it in the norm corresponding to the dot product, f : f f . This is done by adding the limit points of sequences that are convergent in that norm (see Appendix B).
|
||
|
||
2. This is illustrated by the following quotation from an excellent machine learning textbook published in the seventies (p. 174 in [152]): “The familiar functions of mathematical physics are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of potential functions. However, these suggestions are more appealing for their mathematical beauty than their practical usefulness.”
|
||
|
||
36
|
||
|
||
Kernels
|
||
|
||
RKHS
|
||
|
||
In view of the properties (2.29) and (2.30), this space is usually called a reproducing kernel Hilbert space (RKHS).
|
||
In general, an RKHS can be defined as follows.
|
||
|
||
Reproducing Property Closed Space
|
||
Uniqueness of k
|
||
|
||
Definition 2.9 (Reproducing Kernel Hilbert Space) Let be a nonempty set (often
|
||
|
||
called the index set) and by a Hilbert space of functions f :
|
||
|
||
. Then is called
|
||
|
||
a reproducing kernel Hilbert space endowed with the dot product
|
||
|
||
(and the norm
|
||
|
||
f:
|
||
|
||
f f ) if there exists a function k :
|
||
|
||
with the following properties.
|
||
|
||
1. k has the reproducing property3
|
||
|
||
f k(x ) f (x) for all f ;
|
||
|
||
(2.34)
|
||
|
||
in particular,
|
||
|
||
k(x ) k(x ) k(x x )
|
||
|
||
(2.35)
|
||
|
||
2. k spans , i.e. (cf. Appendix B).
|
||
|
||
span k(x ) x
|
||
|
||
where X denotes the completion of the set X
|
||
|
||
On a more abstract level, an RKHS can be defined as a Hilbert space of functions f on such that all evaluation functionals (the maps f f (x ), where x ) are continuous. In that case, by the Riesz representation theorem (e.g., [429]), for each x there exists a unique function of x, called k(x x ), such that
|
||
|
||
f (x ) f k( x )
|
||
|
||
(2.36)
|
||
|
||
It follows directly from (2.35) that k(x x ) is symmetric in its arguments (see Problem 2.28) and satisfies the conditions for positive definiteness.
|
||
Note that the RKHS uniquely determines k. This can be shown by contradiction: assume that there exist two kernels, say k and k , spanning the same RKHS . From Problem 2.28 we know that both k and k must be symmetric. Moreover, from (2.34) we conclude that
|
||
|
||
k(x ) k (x ) k(x x ) k (x x)
|
||
|
||
(2.37)
|
||
|
||
In the second equality we used the symmetry of the dot product. Finally, symmetry in the arguments of k yields k(x x ) k (x x ) which proves our claim.
|
||
|
||
2.2.4 The Mercer Kernel Map
|
||
|
||
Section 2.2.2 has shown that any positive definite kernel can be represented as a dot product in a linear space. This was done by explicitly constructing a (Hilbert) space that does the job. The present section will construct another Hilbert space.
|
||
|
||
3. Note that this implies that each f is actually a single function whose values at any x are well-defined. In contrast, L2 Hilbert spaces usually do not have this property. The elements of these spaces are equivalence classes of functions that disagree only on sets of measure 0; cf. footnote 15 in Section B.3.
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
37
|
||
|
||
Mercer ’s Theorem
|
||
|
||
One could argue that this is superfluous, given that any two separable Hilbert spaces are isometrically isomorphic, in other words, it is possible to define a oneto-one linear map between the spaces which preserves the dot product. However, the tool that we shall presently use, Mercer’s theorem, has played a crucial role in the understanding of SVMs, and it provides valuable insight into the geometry of feature spaces, which more than justifies its detailed discussion. In the SVM literature, the kernel trick is usually introduced via Mercer’s theorem.
|
||
We start by stating the version of Mercer’s theorem given in [606]. We assume ( ) to be a finite measure space.4 The term almost all (cf. Appendix B) means except for sets of measure zero. For the commonly used Lebesgue-Borel measure, countable sets of individual points are examples of zero measure sets. Note that the integral with respect to a measure is explained in Appendix B. Readers who do not want to go into mathematical detail may simply want to think of the d (x ) as a dx , and of as a compact subset of N . For further explanations of the terms involved in this theorem, cf. Appendix B, especially Section B.3.
|
||
|
||
Theorem 2.10 (Mercer [359, 307]) Suppose k L ( 2) is a symmetric real-valued function such that the integral operator (cf. (2.16))
|
||
|
||
Tk : L2( ) L2( ) (Tk f )(x) : k(x x ) f (x ) d (x )
|
||
|
||
(2.38)
|
||
|
||
is positive definite; that is, for all f L2( ), we have
|
||
|
||
k(x x ) f (x) f (x ) d (x)d (x ) 0
|
||
2
|
||
|
||
(2.39)
|
||
|
||
Let j L2( ) be the normalized orthogonal eigenfunctions of Tk associated with the eigenvalues j 0, sorted in non-increasing order. Then
|
||
|
||
1. ( j)j 1,
|
||
|
||
2. k(x x ) ∑Nj 1 j j(x) j(x ) holds for almost all (x x ). Either N
|
||
|
||
, or N
|
||
|
||
;
|
||
|
||
in the latter case, the series converges absolutely and uniformly for almost all (x x ).
|
||
|
||
For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approx-
|
||
|
||
imation and its relationship to kernel PCA (Section 1.7), see Problem 2.26.
|
||
|
||
From statement 2 it follows that k(x x ) corresponds to a dot product in
|
||
|
||
N 2
|
||
|
||
,
|
||
|
||
since k(x x ) Φ(x) Φ(x ) with
|
||
|
||
Φ: x
|
||
|
||
N 2
|
||
( j j(x)) j 1 N
|
||
|
||
(2.40)
|
||
|
||
for almost all x . Note that we use the same Φ as in (2.21) to denote the feature
|
||
|
||
4. A finite measure space is a set with a -algebra (Definition B.1) defined on it, and a
|
||
|
||
measure (Definition B.2) defined on the latter, satisfying ( )
|
||
|
||
(so that, up to a scaling
|
||
|
||
factor, is a probability measure).
|
||
|
||
38
|
||
Mercer Feature Map
|
||
|
||
Kernels
|
||
|
||
map, although the target spaces are different. However, this distinction is not
|
||
|
||
important for the present purposes — we are interested in the existence of some
|
||
|
||
Hilbert space in which the kernel corresponds to the dot product, and not in what
|
||
|
||
particular representation of it we are using.
|
||
|
||
In fact, it has been noted [467] that the uniform convergence of the series implies
|
||
|
||
that given any 0, there exists an n such that even if N
|
||
|
||
, k can be
|
||
|
||
approximated within accuracy as a dot product in n : for almost all x x
|
||
|
||
, k(x x ) Φn(x) Φn(x )
|
||
|
||
, where Φn : x ( 1 1(x)
|
||
|
||
n n(x)). The
|
||
|
||
feature space can thus always be thought of as finite-dimensional within some
|
||
|
||
accuracy . We summarize our findings in the following proposition.
|
||
|
||
Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of Theorem 2.10, we can construct a mapping Φ into a space where k acts as a dot product,
|
||
|
||
Φ(x) Φ(x ) k(x x )
|
||
|
||
(2.41)
|
||
|
||
for almost all x x . Moreover, given any 0, there exists a map Φn into an ndimensional dot product space (where n depends on ) such that
|
||
|
||
k(x x ) Φn(x) Φn(x )
|
||
|
||
(2.42)
|
||
|
||
for almost all x x .
|
||
|
||
Both Mercer kernels and positive definite kernels can thus be represented as dot products in Hilbert spaces. The following proposition, showing a case where the two types of kernels coincide, thus comes as no surprise.
|
||
|
||
Proposition 2.12 (Mercer Kernels are Positive Definite [359, 42]) Let [a b] be a compact interval and let k : [a b] [a b] be continuous. Then k is a positive definite kernel if and only if
|
||
|
||
bb
|
||
k(x x ) f (x) f (x ) dx dx 0
|
||
aa
|
||
|
||
for each continuous function f :
|
||
|
||
.
|
||
|
||
(2.43)
|
||
|
||
Note that the conditions in this proposition are actually more restrictive than those of Theorem 2.10. Using the feature space representation (Proposition 2.11), however, it is easy to see that Mercer kernels are also positive definite (for almost all x x ) in the more general case of Theorem 2.10: given any c m , we have
|
||
|
||
∑ ∑ cic jk(xi x j)
|
||
|
||
cic j Φ(xi) Φ(x j)
|
||
|
||
ij
|
||
|
||
ij
|
||
|
||
2
|
||
|
||
∑ ciΦ(xi)
|
||
|
||
0
|
||
|
||
i
|
||
|
||
(2.44)
|
||
|
||
Being positive definite, Mercer kernels are thus also reproducing kernels. We next show how the reproducing kernel map is related to the Mercer kernel
|
||
map constructed from the eigenfunction decomposition [202, 467]. To this end, let us consider a kernel which satisfies the condition of Theorem 2.10, and construct
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
39
|
||
|
||
Equivalence of Feature Spaces
|
||
|
||
a dot product such that k becomes a reproducing kernel for the Hilbert space containing the functions
|
||
|
||
N
|
||
|
||
f (x)
|
||
|
||
∑ ik(x xi)
|
||
|
||
∑ ∑ i
|
||
|
||
j j(x) j(xi)
|
||
|
||
i1
|
||
|
||
i1 j1
|
||
|
||
(2.45)
|
||
|
||
By linearity, which holds for any dot product, we have
|
||
|
||
f k( x )
|
||
|
||
N
|
||
|
||
∑ ∑ i
|
||
|
||
j j(xi) j n n n(x )
|
||
|
||
i 1 jn 1
|
||
|
||
(2.46)
|
||
|
||
Since k is a Mercer kernel, the i (i 1 N ) can be chosen to be orthogonal with respect to the dot product in L2( ). Hence it is straightforward to choose such that
|
||
|
||
jn
|
||
|
||
jn j
|
||
|
||
(2.47)
|
||
|
||
(using the Kronecker symbol jn, see (B.30)), in which case (2.46) reduces to the reproducing kernel property (2.36) (using (2.45)). For a coordinate representation in the RKHS, see Problem 2.29.
|
||
The above connection between the Mercer kernel map and the RKHS map is instructive, but we shall rarely make use of it. In fact, we will usually identify the different feature spaces. Thus, to avoid confusion in subsequent chapters, the following comments are necessary. As described above, there are different ways of constructing feature spaces for any given kernel. In fact, they can even differ in terms of their dimensionality (cf. Problem 2.22). The two feature spaces that we will mostly use in this book are the RKHS associated with k (Section 2.2.2) and the Mercer 2 feature space. We will mostly use the same symbol for all feature spaces that are associated with a given kernel. This makes sense provided that everything we do, at the end of the day, reduces to dot products. For instance, let us assume that Φ1 Φ2 are maps into the feature spaces 1 2 respectively, both associated with the kernel k; in other words,
|
||
|
||
k(x x )
|
||
|
||
Φi(x) Φi(x )
|
||
|
||
for i
|
||
i
|
||
|
||
12
|
||
|
||
(2.48)
|
||
|
||
Then it will usually not be the case that Φ1(x) Φ2(x); due to (2.48), however,
|
||
|
||
we always have Φ1(x) Φ1(x ) 1
|
||
|
||
Φ2(x) Φ2(x )
|
||
|
||
. Therefore, as long as we are
|
||
2
|
||
|
||
only interested in dot products, the two spaces can be considered identical.
|
||
|
||
An example of this identity is the so-called large margin regularizer that is
|
||
|
||
usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters
|
||
|
||
4 and 7),
|
||
|
||
m
|
||
w w where w ∑ iΦ(xi) i1
|
||
|
||
(2.49)
|
||
|
||
No matter whether Φ is the RKHS map Φ(xi) k( xi) (2.21) or the Mercer map Φ(xi) ( j j(x)) j 1 N (2.40), the value of w 2 will not change.
|
||
This point is of great importance, and we hope that all readers are still with us.
|
||
|
||
40
|
||
|
||
Kernels
|
||
|
||
It is fair to say, however, that Section 2.2.5 can be skipped at first reading.
|
||
|
||
2.2.5 The Shape of the Mapped Data in Feature Space
|
||
|
||
Using Mercer’s theorem, we have shown that one can think of the feature map
|
||
|
||
as a map into a high- or infinite-dimensional Hilbert space. The argument in the
|
||
|
||
remainder of the section shows that this typically entails that the mapped data
|
||
|
||
Φ( ) lie in some box with rapidly decaying side lengths [606]. By this we mean
|
||
|
||
that the range of the data decreases as the dimension index j increases, with a rate
|
||
|
||
that depends on the size of the eigenvalues.
|
||
|
||
Let us assume that for all j sequence
|
||
|
||
, we have supx
|
||
|
||
j j(x) 2
|
||
|
||
. Define the
|
||
|
||
l j : sup j j(x) 2
|
||
x
|
||
Note that if
|
||
|
||
(2.50)
|
||
|
||
Ck : sup sup j(x)
|
||
jx
|
||
|
||
(2.51)
|
||
|
||
exists (see Problem 2.24), then we have l j jCk2. However, if the j decay rapidly,
|
||
|
||
then (2.50) can be finite even if (2.51) is not.
|
||
|
||
By construction, Φ(
|
||
|
||
) is contained in an axis parallel parallelepiped in
|
||
|
||
N 2
|
||
|
||
with
|
||
|
||
side lengths 2 lj (cf. (2.40)).5
|
||
|
||
Consider an example of a common kernel, the Gaussian, and let (see The-
|
||
|
||
orem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and
|
||
|
||
cosine functions (with supremum one), and thus the sequence of the lj coincides
|
||
|
||
with the sequence of the eigenvalues j. Generally, whenever supx
|
||
|
||
j(x) 2 is fi-
|
||
|
||
nite, the lj decay as fast as the j. We shall see in Sections 4.4, 4.5 and Chapter 12
|
||
|
||
that for many common kernels, this decay is very rapid.
|
||
|
||
It will be useful to consider operators that map Φ( ) into balls of some radius
|
||
|
||
R centered at the origin. The following proposition characterizes a class of such
|
||
|
||
operators, determined by the sequence (l j)j . Recall that denotes the space of
|
||
|
||
all real sequences.
|
||
|
||
Proposition 2.13 (Mapping Φ( ) into 2) Let S be the diagonal map
|
||
|
||
S: (x j) j
|
||
|
||
S(x j)j (sjx j)j
|
||
|
||
(2.52)
|
||
|
||
where (sj)j
|
||
|
||
. If sj lj j 2, then S maps Φ( ) into a ball centered at the origin
|
||
|
||
whose radius is R
|
||
|
||
sj lj j .
|
||
|
||
5. In fact, it is sufficient to use the essential supremum in (2.50). In that case, subsequent statements also only hold true almost everywhere.
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
41
|
||
|
||
Continuity of Φ
|
||
|
||
Proof Suppose sj
|
||
|
||
∑ SΦ(x) 2
|
||
|
||
s2j j
|
||
|
||
j
|
||
|
||
lj j j(x) 2
|
||
|
||
2. Using the Mercer map (2.40), we have
|
||
∑ s2j l j R
|
||
j
|
||
|
||
(2.53)
|
||
|
||
for any x . Hence SΦ( ) 2.
|
||
|
||
The converse is not necessarily the case. To see this, note that if sj lj j 2, amounting to saying that
|
||
|
||
∑ s2j sup j j(x) 2 jx
|
||
|
||
(2.54)
|
||
|
||
is not finite, then there need not always exist an x s j j j(x) j 2, i.e., that
|
||
∑ s2j j j(x) 2 j
|
||
|
||
such that SΦ(x) (2.55)
|
||
|
||
is not finite.
|
||
|
||
To see how the freedom to rescale Φ( ) effectively restricts the class of functions
|
||
|
||
we are using, we first note that everything in the feature space
|
||
|
||
N 2
|
||
|
||
is done
|
||
|
||
in terms of dot products. Therefore, we can compensate any invertible symmetric
|
||
|
||
linear transformation of the data in by the inverse transformation on the set
|
||
|
||
of admissible weight vectors in . In other words, for any invertible symmetric
|
||
|
||
operator S on , we have S 1w SΦ(x) w Φ(x) for all x
|
||
|
||
As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there
|
||
|
||
exists a class of generalization error bound that depends on the radius R of the
|
||
|
||
smallest sphere containing the data. If the (li)i decay rapidly, we are not actually
|
||
|
||
“making use” of the whole sphere. In this case, we may construct a diagonal
|
||
|
||
scaling operator S which inflates the sides of the above parallelepiped as much
|
||
|
||
as possible, while ensuring that it is still contained within a sphere of the original
|
||
|
||
radius R in (Figure 2.3). By effectively reducing the size of the function class, this
|
||
|
||
will provide a way of strengthening the bounds. A similar idea, using kernel PCA
|
||
|
||
(Section 14.2) to determine empirical scaling coefficients, has been successfully
|
||
|
||
applied by [101].
|
||
|
||
We conclude this section with another useful insight that characterizes a prop-
|
||
|
||
erty of the feature map Φ. Note that most of what was said so far applies to the
|
||
|
||
case where the input domain is a general set. In this case, it is not possible to
|
||
|
||
make nontrivial statements about continuity properties of Φ. This changes if we
|
||
|
||
assume to be endowed with a notion of closeness, by turning it into a so-called
|
||
|
||
topological space. Readers not familiar with this concept will be reassured to hear
|
||
|
||
that Euclidean vector spaces are particular cases of topological spaces.
|
||
|
||
Proposition 2.14 (Continuity of the Feature Map [402]) If is a topological space
|
||
|
||
and k is a continuous positive definite kernel on
|
||
|
||
, then there exists a Hilbert space
|
||
|
||
and a continuous map Φ :
|
||
|
||
such that for all x x , we have k(x x )
|
||
|
||
Φ(x) Φ(x ) .
|
||
|
||
42
|
||
|
||
Kernels
|
||
|
||
φ(x)
|
||
|
||
R
|
||
|
||
w
|
||
|
||
φ(x)
|
||
|
||
R
|
||
|
||
w
|
||
|
||
Feature Space
|
||
|
||
Weight vector
|
||
|
||
Feature Space
|
||
|
||
Weight vector
|
||
|
||
Figure 2.3 Since everything is done in terms of dot products, scaling up the data by an operator S can be compensated by scaling the weight vectors with S 1 (cf. text). By
|
||
choosing S such that the data are still contained in a ball of the same radius R, we effectively reduce our function class (parametrized by the weight vector), which can lead to better generalization bounds, depending on the kernel inducing the map Φ.
|
||
|
||
2.2.6 The Empirical Kernel Map
|
||
The map Φ, defined in (2.21), transforms each input pattern into a function on , that is, into a potentially infinite-dimensional object. For any given set of points, however, it is possible to approximate Φ by only evaluating it on these points (cf. [232, 350, 361, 547, 474]):
|
||
|
||
Empirical Kernel Map
|
||
|
||
Definition 2.15 (Empirical Kernel Map) For a given set z 1 zn call
|
||
|
||
Φn : N
|
||
|
||
n where x k( x) z1 zn (k(z1 x)
|
||
|
||
k(zn x))
|
||
|
||
the empirical kernel map w.r.t. z1 zn .
|
||
|
||
, n , we (2.56)
|
||
|
||
As an example, consider first the case where k is a positive definite kernel, and
|
||
|
||
z1
|
||
|
||
zn
|
||
|
||
x1
|
||
|
||
xm ; we thus evaluate k( x) on the training patterns. If we
|
||
|
||
carry out a linear algorithm in feature space, then everything will take place in
|
||
|
||
the linear span of the mapped training patterns. Therefore, we can represent the
|
||
|
||
k( x) of (2.21) as Φm(x) without losing information. The dot product to use in that representation, however, is not simply the canonical dot product in m , since the
|
||
|
||
Φ(xi) will usually not form an orthonormal system. To turn Φm into a feature map associated with k, we need to endow m with a dot product m such that
|
||
|
||
k(x x ) Φm(x) Φm(x ) m
|
||
|
||
(2.57)
|
||
|
||
To this end, we use the ansatz m
|
||
|
||
M , with M being a positive definite
|
||
|
||
matrix.6 Enforcing (2.57) on the training patterns, this yields the self-consistency
|
||
|
||
condition [478, 512]
|
||
|
||
K KMK
|
||
|
||
(2.58)
|
||
|
||
6. Every dot product in m can be written in this form. We do not require strict definiteness of M, as the null space can be projected out, leading to a lower-dimensional feature space.
|
||
|
||
2.2 The Representation of Similarities in Linear Spaces
|
||
|
||
43
|
||
|
||
Kernel PCA Map
|
||
|
||
where K is the Gram matrix. The condition (2.58) can be satisfied for instance by the (pseudo-)inverse M K 1. Equivalently, we could have incorporated this rescaling operation, which corresponds to a Kernel PCA “whitening” ([478, 547, 474], cf. Section 11.4), directly into the map, by whitening (2.56) to get
|
||
|
||
Φwm : x
|
||
|
||
K
|
||
|
||
1 2
|
||
|
||
(k(x
|
||
|
||
1
|
||
|
||
x)
|
||
|
||
k(xm x))
|
||
|
||
(2.59)
|
||
|
||
This simply amounts to dividing the eigenvector basis vectors of K by i, where the i are the eigenvalues of K.7 This parallels the rescaling of the eigenfunctions of the integral operator belonging to the kernel, given by (2.47). It turns out that
|
||
this map can equivalently be performed using kernel PCA feature extraction (see
|
||
Problem 14.8), which is why we refer to this map as the kernel PCA map.
|
||
Note that we have thus constructed a data-dependent feature map into an mdimensional space which satisfies Φwm(x) Φwm(x ) k(x x ), i.e., we have found an m-dimensional feature space associated with the given kernel. In the case where K is invertible, Φwm(x) computes the coordinates of Φ(x) when represented in a basis of the m-dimensional subspace spanned by Φ(x1) Φ(xm).
|
||
For data sets where the number of examples is smaller than their dimension, it can actually be computationally attractive to carry out Φwm explicitly, rather than using kernels in subsequent algorithms. Moreover, algorithms which are not
|
||
readily “kernelized” may benefit from explicitly carrying out the kernel PCA map.
|
||
We end this section with two notes which illustrate why the use of (2.56) need
|
||
not be restricted to the special case we just discussed.
|
||
|
||
More general kernels. When using non-symmetric kernels k in (2.56), together with the canonical dot product, we effectively work with the positive definite matrix K K. Note that each positive definite matrix can be written as K K. Therefore, working with positive definite kernels leads to an equally rich set of nonlinearities as working with an empirical kernel map using general non-symmetric kernels. If we wanted to carry out the whitening step, we would have to use (K K) 1 4 (cf. footnote 7 concerning potential singularities).
|
||
|
||
Different evaluation sets. Things can be sped up by using expansion sets of the form z1 zn , mapping into an n-dimensional space, with n m, as done in [100, 228]. In that case, one modifies (2.59) to
|
||
|
||
Φwn : x
|
||
|
||
1
|
||
Kn 2 (k(z1 x)
|
||
|
||
k(zn x))
|
||
|
||
(2.60)
|
||
|
||
where (Kn)ij : k(zi z j). The expansion set can either be a subset of the training set,8 or some other set of points. We will later return to the issue of how to choose
|
||
|
||
7. It is understood that if K is singular, we use the pseudo-inverse of K1 2 in which case we get an even lower dimensional subspace. 8. In [228] it is recommended that the size n of the expansion set is chosen large enough to ensure that the smallest eigenvalue of Kn is larger than some predetermined 0. Alternatively, one can start off with a larger set, and use kernel PCA to select the most important components for the map, see Problem 14.8. In the kernel PCA case, the map (2.60) is com-
|
||
|
||
44
|
||
|
||
Kernels
|
||
|
||
the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of Kernel PCA (see Section 1.7 and Chapter 14 below), one does not need to worry about the whitening step in (2.59) and (2.60): using the canonical dot product in
|
||
m (rather than ) will simply lead to diagonalizing K2 instead of K, which yields the same eigenvectors with squared eigenvalues. This was pointed out by [350, 361]. The study [361] reports experiments where (2.56) was employed to speed up Kernel PCA by choosing z1 zn as a subset of x1 xm .
|
||
2.2.7 A Kernel Map Defined from Pairwise Similarities
|
||
In practice, we are given a finite amount of data x1 xm. The following simple observation shows that even if we do not want to (or are unable to) analyze a given kernel k analytically, we can still compute a map Φ such that k corresponds to a dot product in the linear span of the Φ(xi):
|
||
|
||
Proposition 2.16 (Data-Dependent Kernel Map [467]) Suppose the data x 1 xm and the kernel k are such that the kernel Gram matrix Kij k(xi x j) is positive definite. Then it is possible to construct a map Φ into an m-dimensional feature space such that
|
||
|
||
k(xi x j) Φ(xi) Φ(x j)
|
||
|
||
(2.61)
|
||
|
||
Conversely, given an arbitrary map Φ into some feature space , the matrix Kij Φ(xi) Φ(x j) is positive definite.
|
||
|
||
Proof First assume that K is positive definite. In this case, it can be diagonalized as K SDS , with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Then
|
||
|
||
k(xi xj) (SDS )ij Si DSj
|
||
|
||
DSi DSj
|
||
|
||
(2.62)
|
||
|
||
where we have defined the Si as the rows of S (note that the columns of S would be K’s eigenvectors). Therefore, K is the Gram matrix of the vectors Dii Si.9 Hence
|
||
the following map Φ, defined on x1 xm will satisfy (2.61)
|
||
|
||
Φ : xi
|
||
|
||
Dii Si
|
||
|
||
(2.63)
|
||
|
||
Thus far, Φ is only defined on a set of points, rather than on a vector space. Therefore, it makes no sense to ask whether it is linear. We can, however, ask whether it can be extended to a linear map, provided the xi are elements of a vector space. The answer is that if the xi are linearly dependent (which is often the case), then this will not be possible, since a linear map would then typically be over-
|
||
|
||
puted as Dn 1 2Un (k(z1 x)
|
||
|
||
k(zn x)), where Un DnUn is the eigenvalue decomposition of
|
||
|
||
Kn. Note that the columns of Un are the eigenvectors of Kn. We discard all columns that cor-
|
||
|
||
respond to zero eigenvalues, as well as the corresponding dimensions of Dn. To approximate
|
||
|
||
the map, we may actually discard all eigenvalues smaller than some 0.
|
||
|
||
9. In fact, every positive definite matrix is the Gram matrix of some set of vectors [46].
|
||
|
||
2.3 Examples and Properties of Kernels
|
||
|
||
45
|
||
|
||
determined by the m conditions (2.63). For the converse, assume an arbitrary
|
||
|
||
m
|
||
∑ i j Ki j
|
||
ij 1
|
||
|
||
m
|
||
|
||
m
|
||
|
||
∑ iΦ(xi) ∑ jΦ(x j)
|
||
|
||
i1
|
||
|
||
j1
|
||
|
||
m , and compute
|
||
|
||
m
|
||
|
||
2
|
||
|
||
∑ iΦ(xi)
|
||
|
||
0
|
||
|
||
i1
|
||
|
||
(2.64)
|
||
|
||
In particular, this result implies that given data x1 xm, and a kernel k which gives rise to a positive definite matrix K, it is always possible to construct a feature space of dimension at most m that we are implicitly working in when using kernels (cf. Problem 2.32 and Section 2.2.6).
|
||
If we perform an algorithm which requires k to correspond to a dot product in some other space (as for instance the SV algorithms described in this book), it is possible that even though k is not positive definite in general, it still gives rise to a positive definite Gram matrix K with respect to the training data at hand. In this case, Proposition 2.16 tells us that nothing will go wrong during training when we work with these data. Moreover, if k leads to a matrix with some small negative eigenvalues, we can add a small multiple of some strictly positive definite kernel k (such as the identity k (xi x j) ij) to obtain a positive definite matrix. To see this, suppose that min 0 is the minimal eigenvalue of k’s Gram matrix. Note that being strictly positive definite, the Gram matrix K of k satisfies
|
||
|
||
min K
|
||
1
|
||
|
||
min 0
|
||
|
||
(2.65)
|
||
|
||
where min denotes its minimal eigenvalue, and the first inequality follows from
|
||
|
||
Rayleigh’s principle (B.57). Therefore, provided that min
|
||
|
||
min 0, we have
|
||
|
||
(K K )
|
||
|
||
K
|
||
|
||
K
|
||
|
||
2 min
|
||
|
||
min
|
||
|
||
0
|
||
|
||
(2.66)
|
||
|
||
for all
|
||
|
||
m , rendering (K K ) positive definite.
|
||
|
||
2.3 Examples and Properties of Kernels
|
||
|
||
Polynomial Gaussian Sigmoid
|
||
|
||
For the following examples, let us assume that polynomial kernels (cf. Proposition 2.1),
|
||
k(x x ) x x d
|
||
|
||
N . Besides homogeneous (2.67)
|
||
|
||
Boser, Guyon, and Vapnik [62, 223, 561] suggest the usage of Gaussian radial basis function kernels [26, 4],
|
||
|
||
k(x x ) exp
|
||
|
||
x x2 22
|
||
|
||
(2.68)
|
||
|
||
where 0, and sigmoid kernels,
|
||
|
||
k(x x ) tanh( x x )
|
||
|
||
(2.69)
|
||
|
||
46
|
||
|
||
Kernels
|
||
|
||
Inhomogeneous Polynomial
|
||
Bn-Spline of Odd Order
|
||
Invariance of Kernels RBF Kernels
|
||
|
||
where 0 and 0. By applying Theorem 13.4 below, one can check that the latter kernel is not actually positive definite (see Section 4.6 and [85, 511] and the discussion in Example 4.25). Curiously, it has nevertheless successfully been used in practice. The reasons for this are discussed in [467].
|
||
Other useful kernels include the inhomogeneous polynomial,
|
||
|
||
k(x x ) x x c d
|
||
|
||
(2.70)
|
||
|
||
(d c 0) and the Bn-spline kernel [501, 572] (IX denoting the indicator (or characteristic) function on the set X, and the convolution operation, ( f g)(x) :
|
||
f (x )g(x x)dx ),
|
||
|
||
k(x x )
|
||
|
||
B2p 1( x
|
||
|
||
x ) with Bn :
|
||
|
||
n
|
||
|
||
I[
|
||
|
||
] 1 1
|
||
22
|
||
|
||
i1
|
||
|
||
(2.71)
|
||
|
||
The kernel computes B-splines of order 2p 1 (p ), defined by the (2p 1)-fold convolution of the unit interval [ 1 2 1 2]. See Section 4.4.1 for further details and a regularization theoretic analysis of this kernel.
|
||
Note that all these kernels have the convenient property of unitary invariance, k(x x ) k(Ux Ux ) if U U 1, for instance if U is a rotation. If we consider complex numbers, then we have to use the adjoint U : U instead of the transpose.
|
||
Radial basis function (RBF) kernels are kernels that can be written in the form
|
||
|
||
k(x x ) f (d(x x ))
|
||
|
||
(2.72)
|
||
|
||
where d is a metric on , and f is a function on 0 . Examples thereof are the Gaussians and B-splines mentioned above. Usually, the metric arises from the
|
||
|
||
dot product; d(x x ) x x
|
||
|
||
x x x x . In this case, RBF kernels are
|
||
|
||
unitary invariant, too. In addition, they are translation invariant; in other words,
|
||
|
||
k(x x ) k(x x0 x x0) for all x0 .
|
||
In some cases, invariance properties alone can distinguish particular kernels: in Section 2.1, we explained how using polynomial kernels x x d corresponds to
|
||
|
||
mapping into a feature space whose dimensions are spanned by all possible dth
|
||
|
||
order monomials in input coordinates. The different dimensions are scaled with
|
||
|
||
the square root of the number of ordered products of the respective d entries (e.g.,
|
||
|
||
2 in (2.13)). These scaling factors precisely ensure invariance under the group
|
||
|
||
of all orthogonal transformations (rotations and mirroring operations). In many
|
||
|
||
cases, this is a desirable property: it ensures that the results of a learning procedure
|
||
|
||
do not depend on which orthonormal coordinate system (with fixed origin) we use
|
||
|
||
for representing our input data.
|
||
|
||
Properties of RBF Kernels
|
||
|
||
Proposition 2.17 (Invariance of Polynomial Kernels [480]) Up to a scaling factor, the kernel k(x x ) x x d is the only kernel inducing a map into a space of all monomials of degree d which is invariant under orthogonal transformations of N .
|
||
Some interesting additional structure exists in the case of a Gaussian RBF kernel k (2.68). As k(x x) 1 for all x , each mapped example has unit length, Φ(x)
|
||
|
||
2.3 Examples and Properties of Kernels
|
||
|
||
47
|
||
|
||
InfiniteDimensional Feature Space
|
||
|
||
1 (Problem 2.18 shows how to achieve this for general kernels). Moreover, as k(x x ) 0 for all x x , all points lie inside the same orthant in feature space. To see this, recall that for unit length vectors, the dot product (1.3) equals the cosine of the enclosed angle. We obtain
|
||
|
||
cos( (Φ(x) Φ(x ))) Φ(x) Φ(x ) k(x x ) 0
|
||
|
||
(2.73)
|
||
|
||
which amounts to saying that the enclosed angle between any two mapped examples is smaller than 2.
|
||
The above seems to indicate that in the Gaussian case, the mapped data lie in a fairly restricted area of feature space. However, in another sense, they occupy a space which is as large as possible:
|
||
|
||
Theorem 2.18 (Full Rank of Gaussian RBF Gram Matrices [360]) Suppose that x1 xm are distinct points, and 0. The matrix K given by
|
||
|
||
Ki j : exp
|
||
|
||
xi xj 2 22
|
||
|
||
(2.74)
|
||
|
||
has full rank.
|
||
|
||
In other words, the points Φ(x1) Φ(xm) are linearly independent (provided no two xi are the same). They span an m-dimensional subspace of . Therefore a Gaussian kernel defined on a domain of infinite cardinality, with no a priori restriction on the number of training examples, produces a feature space of infinite dimension. Nevertheless, an analysis of the shape of the mapped data in feature space shows that capacity is distributed in a way that ensures smooth and simple estimates whenever possible (see Section 12.4).
|
||
The examples given above all apply to the case of vectorial data. Let us next give an example where is not a vector space [42].
|
||
|
||
Proposition 2.19 (Similarity of Probabilistic Events) If ( space with -algebra and probability measure P, then
|
||
|
||
k(A B) P(A B) P(A)P(B)
|
||
|
||
is a positive definite kernel on
|
||
|
||
.
|
||
|
||
P) is a probability (2.75)
|
||
|
||
Proof To see this, we define a feature map
|
||
|
||
Φ : A (IA P(A))
|
||
|
||
(2.76)
|
||
|
||
where IA is the characteristic function on A. On the feature space, which consists of functions on taking values in [ 1 1], we use the dot product
|
||
|
||
f g : f g dP
|
||
|
||
(2.77)
|
||
|
||
The result follows by noticing IA IB P(A B) and IA P(B) P(A)P(B).
|
||
|
||
48
|
||
|
||
Kernels
|
||
|
||
Further examples include kernels for string matching, as proposed by [585, 234, 23]. We shall describe these, and address the general problem of designing kernel functions, in Chapter 13.
|
||
The next section will return to the connection between kernels and feature spaces. Readers who are eager to move on to SV algorithms may want to skip this section, which is somewhat more technical.
|
||
|
||
2.4 The Representation of Dissimilarities in Linear Spaces
|
||
|
||
2.4.1 Conditionally Positive Definite Kernels
|
||
|
||
We now proceed to a larger class of kernels than that of the positive definite ones. This larger class is interesting in several regards. First, it will turn out that some kernel algorithms work with this class, rather than only with positive definite kernels. Second, its relationship to positive definite kernels is a rather interesting one, and a number of connections between the two classes provide understanding of kernels in general. Third, they are intimately related to a question which is a variation on the central aspect of positive definite kernels: the latter can be thought of as dot products in feature spaces; the former, on the other hand, can be embedded as distance measures arising from norms in feature spaces.
|
||
The present section thus attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. The underlying mathematical results have been known for quite a while [465]; some of them have already attracted interest in the kernel methods community in various contexts [515, 234].
|
||
Clearly, the squared distance Φ(x) Φ(x ) 2 in the feature space associated with a pd kernel k can be computed, using k(x x ) Φ(x) Φ(x ) , as
|
||
|
||
Φ(x) Φ(x ) 2 k(x x) k(x x ) 2k(x x )
|
||
|
||
(2.78)
|
||
|
||
Positive definite kernels are, however, not the full story: there exists a larger class
|
||
|
||
of kernels that can be used as generalized distances, and the present section will
|
||
|
||
describe why and how [468].
|
||
|
||
Let us start by considering how a dot product and the corresponding distance
|
||
|
||
measure are affected by a translation of the data, x x x0. Clearly, x x 2 is
|
||
|
||
translation invariant while x x is not. A short calculation shows that the effect
|
||
|
||
of the translation can be expressed in terms of
|
||
|
||
2 as
|
||
|
||
1 (x x0) (x x0) 2
|
||
|
||
x x 2 x x0 2 x0 x 2
|
||
|
||
(2.79)
|
||
|
||
Note that this, just like x x , is still a pd kernel: ∑i j cic j (xi x0) (x j x0) ∑i ci(xi x0) 2 0 holds true for any ci. For any choice of x0 , we thus get a
|
||
similarity measure (2.79) associated with the dissimilarity measure x x .
|
||
This naturally leads to the question of whether (2.79) might suggest a connection
|
||
|
||
2.4 The Representation of Dissimilarities in Linear Spaces
|
||
|
||
49
|
||
|
||
that also holds true in more general cases: what kind of nonlinear dissimilarity
|
||
|
||
measure do we have to substitute for
|
||
|
||
2 on the right hand side of (2.79), to
|
||
|
||
ensure that the left hand side becomes positive definite? To state the answer, we
|
||
|
||
first need to define the appropriate class of kernels.
|
||
|
||
The following definition differs from Definition 2.4 only in the additional con-
|
||
|
||
straint on the sum of the ci. Below, is a shorthand for or ; the definitions are the same in both cases.
|
||
|
||
Definition 2.20 (Conditionally Positive Definite Matrix) A symmetric m trix K (m 2) taking values in and satisfying
|
||
|
||
m
|
||
∑ cic¯j Ki j 0 for all ci
|
||
ij 1
|
||
|
||
m
|
||
with ∑ ci 0 i1
|
||
|
||
is called conditionally positive definite (cpd).
|
||
|
||
m ma(2.80)
|
||
|
||
Definition 2.21 (Conditionally Positive Definite Kernel) Let be a nonempty set.
|
||
|
||
A function k :
|
||
|
||
which for all m 2 x1 xm gives rise to a conditionally
|
||
|
||
positive definite Gram matrix is called a conditionally positive definite (cpd) kernel.
|
||
|
||
Note that symmetry is also required in the complex case. Due to the additional constraint on the coefficients ci, it does not follow automatically anymore, as it did in the case of complex positive definite matrices and kernels. In Chapter 4, we will revisit cpd kernels. There, we will actually introduce cpd kernels of different orders. The definition given in the current chapter covers the case of kernels which are cpd of order 1.
|
||
|
||
Connection PD — CPD
|
||
|
||
Proposition 2.22 (Constructing PD Kernels from CPD Kernels [42]) Let x 0 ,
|
||
|
||
and let k be a symmetric kernel on
|
||
|
||
. Then
|
||
|
||
k˜(x x ) :
|
||
|
||
1 (k(x x ) 2
|
||
|
||
k(x x0)
|
||
|
||
k(x0 x )
|
||
|
||
k(x0 x0))
|
||
|
||
is positive definite if and only if k is conditionally positive definite.
|
||
|
||
The proof follows directly from the definitions and can be found in [42]. This
|
||
|
||
result does generalize (2.79): the negative squared distance kernel is indeed cpd,
|
||
|
||
since ∑i ci 0 implies ∑i j cic j xi x j 2
|
||
|
||
∑i ci ∑j c j x j 2 ∑j c j ∑i ci xi 2
|
||
|
||
2 ∑i j cic j xi x j 2 ∑i j cic j xi x j 2 ∑i cixi 2 0 In fact, this implies that all
|
||
|
||
kernels of the form
|
||
|
||
k(x x )
|
||
|
||
xx 0
|
||
|
||
2
|
||
|
||
(2.81)
|
||
|
||
are cpd (they are not pd),10 by application of the following result (note that the case 0 is trivial):
|
||
|
||
10. Moreover, they are not cpd if 2 [42].
|
||
|
||
50
|
||
|
||
Kernels
|
||
|
||
Proposition 2.23 (Fractional Powers and Logs of CPD Kernels [42]) If k :
|
||
|
||
( 0] is cpd, then so are ( k) (0
|
||
|
||
1) and ln(1 k).
|
||
|
||
To state another class of cpd kernels that are not pd, note first that as a trivial consequence of Definition 2.20, we know that (i) sums of cpd kernels are cpd, and (ii) any constant b is a cpd kernel. Therefore, any kernel of the form k b, where k is cpd and b , is also cpd. In particular, since pd kernels are cpd, we can take any pd kernel and offset it by b, and it will still be at least cpd. For further examples of cpd kernels, cf. [42, 578, 205, 515].
|
||
|
||
2.4.2 Hilbert Space Representation of CPD Kernels
|
||
|
||
We now return to the main flow of the argument. Proposition 2.22 allows us to construct the feature map for k from that of the pd kernel k˜. To this end, fix x0 and define k˜ according to Proposition 2.22. Due to Proposition 2.22, k˜ is positive
|
||
|
||
definite. Therefore, we may employ the Hilbert space representation Φ :
|
||
|
||
of
|
||
|
||
k˜ (cf. (2.32)), satisfying Φ(x) Φ(x ) k˜(x x ); hence,
|
||
|
||
Φ(x) Φ(x ) 2 k˜(x x) k˜(x x ) 2k˜(x x )
|
||
|
||
(2.82)
|
||
|
||
Substituting Proposition 2.22 yields
|
||
|
||
Φ(x) Φ(x ) 2
|
||
|
||
k(x x )
|
||
|
||
1 2
|
||
|
||
k(x x)
|
||
|
||
k(x x )
|
||
|
||
This implies the following result [465, 42].
|
||
|
||
(2.83)
|
||
|
||
Feature Map for CPD Kernels
|
||
|
||
Proposition 2.24 (Hilbert Space Representation of CPD Kernels) Let k be a real-
|
||
|
||
valued CPD kernel on , satisfying k(x x) 0 for all x . Then there exists a Hilbert
|
||
|
||
space of real-valued functions on , and a mapping Φ :
|
||
|
||
, such that
|
||
|
||
Φ(x) Φ(x ) 2 k(x x )
|
||
|
||
(2.84)
|
||
|
||
If we drop the assumption k(x x) 0, the Hilbert space representation reads
|
||
|
||
Φ(x) Φ(x ) 2
|
||
|
||
k(x x )
|
||
|
||
1 2
|
||
|
||
k(x x)
|
||
|
||
k(x x )
|
||
|
||
It can be shown that if k(x x) 0 for all x , then
|
||
|
||
(2.85)
|
||
|
||
d(x x ) :
|
||
|
||
k(x x ) Φ(x) Φ(x )
|
||
|
||
(2.86)
|
||
|
||
is a semi-metric: clearly, it is nonnegative and symmetric; additionally, it satisfies the triangle inequality, as can be seen by computing d(x x ) d(x x ) Φ(x) Φ(x ) Φ(x ) Φ(x ) Φ(x) Φ(x ) d(x x ) [42].
|
||
It is a metric if k(x x ) 0 for x x . We thus see that we can rightly think of k as the negative of a distance measure.
|
||
We next show how to represent general symmetric kernels (thus in particular cpd kernels) as symmetric bilinear forms Q in feature spaces. This generalization of the previously known feature space representation for pd kernels comes at a
|
||
|
||
2.4 The Representation of Dissimilarities in Linear Spaces
|
||
|
||
51
|
||
|
||
cost: Q will no longer be a dot product. For our purposes, we can get away with
|
||
this. The result will give us an intuitive understanding of Proposition 2.22: we can then write k˜ as k˜(x x ) : Q(Φ(x) Φ(x0) Φ(x ) Φ(x0)). Proposition 2.22 thus essentially adds an origin in feature space which corresponds to the image Φ(x0) of one point x0 under the feature map.
|
||
|
||
Feature Map for General Symmetric Kernels
|
||
|
||
Proposition 2.25 (Vector Space Representation of Symmetric Kernels) Let k be a real-valued symmetric kernel on . Then there exists a linear space of real-valued functions on , endowed with a symmetric bilinear form Q( ), and a mapping Φ :
|
||
, such that k(x x ) Q(Φ(x) Φ(x )).
|
||
Proof The proof is a direct modification of the pd case. We use the map (2.21) and linearly complete the image as in (2.22). Define Q( f g) : ∑mi 1 ∑mj 1 i jk(xi x j). To see that it is well-defined, although it explicitly contains the expansion coefficients (which need not be unique), note that Q( f g) ∑mj 1 j f (x j), independent of the
|
||
i. Similarly, for g, note that Q( f g) ∑i ig(xi), hence it is independent of j. The last two equations also show that Q is bilinear; clearly, it is symmetric.
|
||
Note, moreover, that by definition of Q, k is a reproducing kernel for the feature space (which is not a Hilbert space): for all functions f (2.22), we have Q(k( x) f ) f (x); in particular, Q(k( x) k( x )) k(x x )
|
||
Rewriting k˜ as k˜(x x ) : Q(Φ(x) Φ(x0) Φ(x ) Φ(x0)) suggests an immediate generalization of Proposition 2.22: in practice, we might want to choose other points as origins in feature space — points that do not have a pre-image x0 in the input domain, such as the mean of a set of points (cf. [543]). This will be useful when considering kernel PCA. It is only crucial that the behavior of our reference point under translation is identical to that of individual points. This is taken care of by the constraint on the sum of the ci in the following proposition.
|
||
|
||
Matrix Centering
|
||
|
||
Proposition 2.26 (Exercise 2.23 in [42]) Let K be a symmetric matrix, e m be the vector of all ones, 1 the m m identity matrix, and let c m satisfy e c 1. Then
|
||
|
||
K˜ : (1 ec )K(1 ce )
|
||
|
||
(2.87)
|
||
|
||
is positive definite if and only if K is conditionally positive definite.11
|
||
|
||
Proof “ ”: suppose K˜ is positive definite. Thus for any a m which satisfies a e e a 0, we have 0 a K˜ a a Ka a ec Kce a a Kce a a ec Ka a Ka.
|
||
This means that 0 a Ka, proving that K is conditionally positive definite.
|
||
“ ”: suppose K is conditionally positive definite. This means that we have to show that a K˜ a 0 for all a m . We have
|
||
|
||
a K˜ a a (1 ec )K(1 ce )a s Ks for s (1 ce )a
|
||
|
||
(2.88)
|
||
|
||
11. c is the vector obtained by transposing and taking the complex conjugate of c.
|
||
|
||
52
|
||
|
||
Kernels
|
||
|
||
All we need to show is e s 0, since then we can use the fact that K is cpd to obtain s Ks 0. This can be seen as follows e s e (1 ce )a (e (e c)e )a (e e )a 0.
|
||
This result directly implies a corresponding generalization of Proposition 2.22:
|
||
|
||
Kernel Centering Proposition 2.27 (Adding a General Origin) Let k be a symmetric kernel, x 1 xm
|
||
|
||
, and let ci
|
||
|
||
satisfy ∑mi 1 ci 1. Then
|
||
|
||
∑ ∑ ∑ k˜(x x ) :
|
||
|
||
1 2
|
||
|
||
k(x x )
|
||
|
||
m
|
||
cik(x xi)
|
||
i1
|
||
|
||
m
|
||
cik(xi x )
|
||
i1
|
||
|
||
m
|
||
cicjk(xi x j)
|
||
ij 1
|
||
|
||
is positive definite if and only if k is conditionally positive definite.
|
||
|
||
Application to SVMs
|
||
Application to Kernel PCA
|
||
Application to Parzen Windows Classifiers
|
||
|
||
Proof Consider a set of m
|
||
|
||
points x1
|
||
|
||
xm
|
||
|
||
(m m ) Gram matrix based on x1 xm x1
|
||
|
||
using cm 1
|
||
|
||
cm m 0.
|
||
|
||
, and let K be the (m m ) xm . Apply Proposition 2.26
|
||
|
||
The above results show that conditionally positive definite kernels are a natural choice whenever we are dealing with a translation invariant problem, such as the SVM: maximization of the margin of separation between two classes of data is independent of the position of the origin. Seen in this light, it is not surprising that the structure of the dual optimization problem (cf. [561]) allows cpd kernels: as noted in [515, 507], the constraint ∑mi 1 i yi 0 projects out the same subspace as (2.80) in the definition of conditionally positive definite matrices.
|
||
Another example of a kernel algorithm that works with conditionally positive definite kernels is Kernel PCA (Chapter 14), where the data are centered, thus removing the dependence on the origin in feature space. Formally, this follows from Proposition 2.26 for ci 1 m.
|
||
Let us consider another example. One of the simplest distance-based classification algorithms proceeds as follows. Given m points labelled with 1, m points labelled with 1, and a mapped test point Φ(x), we compute the mean squared distances between the latter and the two classes, and assign it to the one for which this mean is smaller;
|
||
|
||
∑ ∑ y
|
||
|
||
sgn
|
||
|
||
1 m yi
|
||
|
||
Φ(x)
|
||
1
|
||
|
||
Φ(xi) 2
|
||
|
||
1
|
||
|
||
Φ(x)
|
||
|
||
m yi 1
|
||
|
||
Φ(xi) 2
|
||
|
||
(2.89)
|
||
|
||
We use the distance kernel trick (Proposition 2.24) to express the decision function as a kernel expansion in the input domain: a short calculation shows that
|
||
|
||
∑ ∑ 1
|
||
|
||
1
|
||
|
||
y
|
||
|
||
sgn
|
||
|
||
m
|
||
|
||
k(x xi)
|
||
yi 1
|
||
|
||
m yi
|
||
|
||
k(x xi)
|
||
1
|
||
|
||
b
|
||
|
||
with the constant offset
|
||
|
||
∑ ∑ b
|
||
|
||
1 2m yi
|
||
|
||
k(xi xi)
|
||
1
|
||
|
||
1 2m
|
||
|
||
k(xi xi)
|
||
yi 1
|
||
|
||
(2.90) (2.91)
|
||
|
||
2.4 The Representation of Dissimilarities in Linear Spaces
|
||
|
||
53
|
||
|
||
Properties of CPD Kernels
|
||
|
||
Note that for some cpd kernels, such as (2.81), k(xi xi) is always 0, and thus b 0. For others, such as the commonly used Gaussian kernel, k(xi xi) is a nonzero constant, in which case b vanishes provided that m m . For normalized Gaussians, the resulting decision boundary can be interpreted as the Bayes decision based on two Parzen window density estimates of the classes; for general cpd kernels, the analogy is merely a formal one; that is, the decision functions take the same form.
|
||
Many properties of positive definite kernels carry over to the more general case of conditionally positive definite kernels, such as Proposition 13.1.
|
||
Using Proposition 2.22, one can prove an interesting connection between the two classes of kernels:
|
||
Proposition 2.28 (Connection PD — CPD [465]) A kernel k is conditionally positive definite if and only if exp(tk) is positive definite for all t 0.
|
||
Positive definite kernels of the form exp(tk) (t 0) have the interesting property that their nth root (n ) is again a positive definite kernel. Such kernels are called infinitely divisible. One can show that, disregarding some technicalities, the logarithm of an infinitely divisible positive definite kernel mapping into 0 is a conditionally positive definite kernel.
|
||
|
||
2.4.3 Higher Order CPD Kernels
|
||
|
||
For the sake of completeness, we now present some material which is of interest to one section later in the book (Section 4.8), but not central for the present chapter. We follow [341, 204].
|
||
|
||
Definition 2.29 (Conditionally Positive Definite Functions of Order q) A contin-
|
||
uous function h, defined on [0 ), is called conditionally positive definite (cpd) of order q on N if for any distinct points x1 xm N , the quadratic form,
|
||
|
||
m
|
||
∑ i jh( xi
|
||
ij 1
|
||
|
||
xj 2)
|
||
|
||
(2.92)
|
||
|
||
is nonnegative, provided that the scalars 1 polynomials p( ) on N of degree lower than q.
|
||
|
||
m satisfy ∑mi 1 i p(xi) 0, for all
|
||
|
||
Let ΠqN denote the space of polynomials of degree lower than q on N . By definition, every cpd function h of order q generates a positive definite kernel for SV expansions in the space of functions orthogonal to ΠqN, by setting k(x x ) : h( x x 2).
|
||
There exists also an analogue to the positive definiteness of the integral operator
|
||
in the conditions of Mercer’s theorem. In [157, 341] it is shown that for cpd
|
||
functions h of order q, we have
|
||
|
||
h( x x 2) f (x) f (x )dxdx 0
|
||
|
||
(2.93)
|
||
|
||
provided that the projection of f onto ΠqN is zero.
|
||
|
||
54
|
||
|
||
Kernels
|
||
|
||
Gaussian
|
||
|
||
Inverse Multiquadric
|
||
|
||
k(||x − x’||)
|
||
|
||
1
|
||
|
||
1
|
||
|
||
k(||x − x’||)
|
||
|
||
0.8 0.8
|
||
0.6 0.6
|
||
0.4
|
||
|
||
0.2
|
||
|
||
0.4
|
||
|
||
0
|
||
|
||
−2
|
||
|
||
0
|
||
|
||
2
|
||
|
||
||x − x’||
|
||
|
||
0.2
|
||
|
||
−2
|
||
|
||
0
|
||
|
||
2
|
||
|
||
||x − x’||
|
||
|
||
−1 −1.5
|
||
−2 −2.5
|
||
|
||
Multiquadric
|
||
|
||
k(||x − x’||)
|
||
|
||
Thin Plate Spline (n=1) 12 10
|
||
8 6 4
|
||
|
||
k(||x − x’||)
|
||
|
||
−3
|
||
|
||
2
|
||
|
||
−3.5
|
||
|
||
−2
|
||
|
||
0
|
||
|
||
2
|
||
|
||
||x − x’||
|
||
|
||
0
|
||
|
||
−2
|
||
|
||
0
|
||
|
||
2
|
||
|
||
||x − x’||
|
||
|
||
Figure 2.4 Conditionally positive definite functions, as described in Table 2.1. Where applicable, we set the free parameter c to 1; is set to 2. Note that cpd kernels need not be positive anywhere (e.g., the Multiquadric kernel).
|
||
|
||
Table 2.1 Examples of Conditionally Positive Definite Kernels. The fact that the exponential kernel is pd (i.e., cpd of order 0) follows from (2.81) and Proposition 2.28.
|
||
|
||
Kernel
|
||
|
||
e c x x ,0
|
||
|
||
2
|
||
|
||
1
|
||
|
||
x x 2 c2
|
||
|
||
x x 2 c2 x x 2n ln x x
|
||
|
||
Order 0 0
|
||
|
||
Exponential Inverse Multiquadric
|
||
|
||
1
|
||
|
||
Multiquadric
|
||
|
||
n
|
||
|
||
Thin Plate Spline
|
||
|
||
Definition 2.30 (Completely Monotonic Functions) A function h(x) is called completely monotonic of order q if
|
||
|
||
(
|
||
|
||
1)n
|
||
|
||
dn dxn
|
||
|
||
h(x)
|
||
|
||
0 for all x
|
||
|
||
[0
|
||
|
||
) and n
|
||
|
||
q
|
||
|
||
(2.94)
|
||
|
||
It can be shown [464, 465, 360] that a function h(x2) is conditionally positive definite if and only if h(x) is completely monotonic of the same order. This gives a (sometimes simpler) criterion for checking whether a function is cpd or not.
|
||
If we use cpd kernels in learning algorithms, we must ensure orthogonality of the estimate with respect to ΠqN. This is usually done via constraints ∑mi 1 i p(xi) 0 for all polynomials p( ) on N of degree lower than q (see Section 4.8).
|
||
|
||
2.5 Summary
|
||
|
||
55
|
||
|
||
2.5 Summary
|
||
The crucial ingredient of SVMs and other kernel methods is the so-called kernel trick (see (2.7) and Remark 2.8), which permits the computation of dot products in high-dimensional feature spaces, using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example. The mathematical result underlying the kernel trick is almost a century old [359]. Nevertheless, it was only much later that it was exploited by the machine learning community for the analysis [4] and construction of algorithms [62], and that it was described as a general method for constructing nonlinear generalizations of dot product algorithms [480].
|
||
The present chapter has reviewed the mathematical theory of kernels. We started with the class of polynomial kernels, which can be motivated as computing a combinatorially large number of monomial features rather efficiently. This led to the general question of which kernel can be used, or: which kernel can be represented as a dot product in a linear feature space. We defined this class and discussed some of its properties. We described several ways how, given such a kernel, one can construct a representation in a feature space. The most well-known representation employs Mercer’s theorem, and represents the feature space as an 2 space defined in terms of the eigenfunctions of an integral operator associated with the kernel. An alternative representation uses elements of the theory of reproducing kernel Hilbert spaces, and yields additional insights, representing the linear space as a space of functions written as kernel expansions. We gave an indepth discussion of the kernel trick in its general form, including the case where we are interested in dissimilarities rather than similarities; that is, when we want to come up with nonlinear generalizations of distance-based algorithms rather than dot-product-based algorithms.
|
||
In both cases, the underlying philosophy is the same: we are trying to express a complex nonlinear algorithm in terms of simple geometrical concepts, and we are then dealing with it in a linear space. This linear space may not always be readily available; in some cases, it may even be hard to construct explicitly. Nevertheless, for the sake of design and analysis of the algorithms, it is sufficient to know that the linear space exists, empowering us to use the full potential of geometry, linear algebra and functional analysis.
|
||
|
||
2.6 Problems 2.1 (Monomial Features in 2 ) Verify the second equality in (2.9).
|
||
2.2 (Multiplicity of Monomial Features in N [515] ) Consider the monomial kernel k(x x ) x x d (where x x N ), generating monomial features of order d. Prove
|
||
|
||
56
|
||
|
||
Kernels
|
||
|
||
that a valid feature map for this kernel can be defined coordinate-wise as
|
||
|
||
Φm(x)
|
||
|
||
∏ ∏ni
|
||
|
||
d! 1[m]i!
|
||
|
||
n
|
||
[x][im]i
|
||
i1
|
||
|
||
(2.95)
|
||
|
||
for every m n ∑ni 1[m]i d (i.e., every such m corresponds to one dimension of ).
|
||
|
||
2.3 (Inhomogeneous Polynomial Kernel ) Prove that the kernel (2.70) induces a feature map into the space of all monomials up to degree d. Discuss the role of c.
|
||
|
||
2.4 (Eigenvalue Criterion of Positive Definiteness ) Prove that a symmetric matrix is positive definite if and only if all its eigenvalues are nonnegative (see Appendix B).
|
||
|
||
2.5 (Dot Products are Kernels ) Prove that dot products (Definition B.7) are positive definite kernels.
|
||
|
||
2.6 (Kernels on Finite Domains ) Prove that for finite , say
|
||
|
||
x1
|
||
|
||
a kernel if and only if the m m matrix (k(xi x j))ij is positive definite.
|
||
|
||
xm , k is
|
||
|
||
2.7 (Positivity on the Diagonal ) From Definition 2.5, prove that a kernel satisfies k(x x) 0 for all x .
|
||
|
||
2.8 (Cauchy-Schwarz for Kernels ) Give an elementary proof of Proposition 2.7. Hint: start with the general form of a symmetric 2 2 matrix, and derive conditions for
|
||
its coefficients that ensure that it is positive definite.
|
||
|
||
2.9 (PD Kernels Vanishing on the Diagonal ) Use Proposition 2.7 to prove that a kernel satisfying k(x x) for all x is identically zero.
|
||
How does the RKHS look in this case? Hint: use (2.31).
|
||
|
||
2.10 (Two Kinds of Positivity ) Give an example of a kernel which is positive definite according to Definition 2.5, but not positive in the sense that k(x x ) 0 for all x x .
|
||
Give an example of a kernel where the contrary is the case.
|
||
|
||
2.11 (General Coordinate Transformations ) Prove that if : and k(x x ) is a kernel, then k( (x) (x )) is a kernel, too.
|
||
|
||
is a bijection,
|
||
|
||
2.12 (Positivity on the Diagonal ) Prove that positive definite kernels are positive on the diagonal, k(x x) 0 for all x . Hint: use m 1 in (2.15).
|
||
|
||
2.13 (Symmetry of Complex Kernels ) Prove that complex-valued positive definite kernels are symmetric (2.18).
|
||
|
||
2.14 (Real Kernels vs. Complex Kernels ) Prove that a real matrix satisfies (2.15) for all ci if and only if it is symmetric and it satisfies (2.15) for real coefficients ci.
|
||
Hint: decompose each ci in (2.15) into real and imaginary parts.
|
||
|
||
2.6 Problems
|
||
|
||
57
|
||
|
||
2.15 (Rank-One Kernels ) Prove that if f is a real-valued function on , then k(x x ) : f (x) f (x ) is a positive definite kernel.
|
||
|
||
2.16 (Bayes Kernel ) Consider a binary pattern recognition problem. Specialize the
|
||
|
||
last problem to the case where f :
|
||
|
||
1 equals the Bayes decision function y(x),
|
||
|
||
i.e., the classification with minimal risk subject to an underlying distribution P(x y)
|
||
|
||
generating the data.
|
||
|
||
Argue that this kernel is particularly suitable since it renders the problem linearly
|
||
|
||
separable in a 1D feature space: State a decision function (cf. (1.35)) that solves the problem
|
||
|
||
(hint: you just need one parameter , and you may set it to 1; moreover, use b 0) [124].
|
||
|
||
The final part of the problem requires knowledge of Chapter 16: Consider now the
|
||
|
||
situation where some prior P( f ) over the target function class is given. What would the
|
||
|
||
optimal kernel be in this case? Discuss the connection to Gaussian processes.
|
||
|
||
2.17 (Inhomogeneous Polynomials ) Prove that the inhomogeneous polynomial (2.70) is a positive definite kernel, e.g., by showing that it is a linear combination of homogeneous polynomial kernels with positive coefficients. What kind of features does this kernel compute [561]?
|
||
|
||
2.18 (Normalization in Feature Space ) Given a kernel k, construct a corresponding normalized kernel k˜ by normalizing the feature map Φ˜ such that for all x , Φ˜ (x) 1 (cf. also Definition 12.35). Discuss the relationship between normalization in input space
|
||
and normalization in feature space for Gaussian kernels and homogeneous polynomial
|
||
kernels.
|
||
|
||
2.19 (Cosine Kernel ) Suppose is a dot product space, and x x . Prove that k(x x ) cos( (x x)) is a positive definite kernel. Hint: use Problem 2.18.
|
||
|
||
2.20 (Alignment Kernel ) Let K K F : ∑ij KijKij be the Frobenius dot product of two matrices. Prove that the empirical alignment of two Gram matrices [124],
|
||
A(K K ) : K K F K K F K K F, is a positive definite kernel. Note that the alignment can be used for model selection, putting Kij : yiy j (cf.
|
||
Problem 2.16) and Ki j : sgn (k(xi x j)) or Ki j : sgn (k(xi x j)) b (cf. [124]).
|
||
|
||
2.21 (Equivalence Relations as Kernels ) Consider a similarity measure k : 0 1 with
|
||
|
||
k(x x) 1 for all x
|
||
|
||
(2.96)
|
||
|
||
Prove that k is a positive definite kernel if and only if, for all x x x ,
|
||
|
||
k(x x ) 1 k(x x) 1 and
|
||
|
||
(2.97)
|
||
|
||
k(x x ) k(x x ) 1 k(x x ) 1
|
||
|
||
(2.98)
|
||
|
||
Equations (2.96) to (2.98) amount to saying that k I T, where T lence relation.
|
||
|
||
is an equiva-
|
||
|
||
58
|
||
|
||
Kernels
|
||
|
||
As a simple example, consider an undirected graph, and let (x x ) T whenever x and x are in the same connected component of the graph. Show that T is an equivalence relation.
|
||
Find examples of equivalence relations that lend themselves to an interpretation as similarity measures. Discuss whether there are other relations that one might want to use as similarity measures.
|
||
|
||
2.22 (Different Feature Spaces for the Same Kernel ) Give an example of a kernel with two valid feature maps Φ1 Φ2, mapping into spaces 1 2 of different dimensions.
|
||
|
||
2.23 (Converse of Mercer’s Theorem ) Prove that if an integral operator kernel k
|
||
|
||
admits a uniformly convergent dot product representation on some compact set
|
||
|
||
,
|
||
|
||
k(x x ) ∑ i(x) i(x ) i1
|
||
then it is positive definite. Hint: show that
|
||
|
||
(2.99)
|
||
|
||
∑ i(x) i(x ) f (x) f (x ) dx dx ∑
|
||
|
||
i1
|
||
|
||
i1
|
||
|
||
2
|
||
|
||
i(x) f (x) dx
|
||
|
||
0
|
||
|
||
Argue that in particular, polynomial kernels (2.67) satisfy Mercer’s conditions.
|
||
|
||
2.24 ( -Norm of Mercer Eigenfunctions ) Prove that under the conditions of Theorem 2.10, we have, up to sets of measure zero,
|
||
|
||
sup
|
||
|
||
jj
|
||
|
||
k
|
||
|
||
j
|
||
|
||
(2.100)
|
||
|
||
Hint: note that k
|
||
|
||
k(x x) up to sets of measures zero, and use the series expansion
|
||
|
||
given in Theorem 2.10. Show, moreover, that it is not generally the case that
|
||
|
||
sup j
|
||
j
|
||
|
||
(2.101)
|
||
|
||
Hint: consider the case where
|
||
|
||
, ( n ) : 2 n, and k(i j) : ij. Show that
|
||
|
||
1. Tk((a j)) (a j2 j) for (a j) L2( ),
|
||
|
||
2. Tk satisfies (a j) Tk(a j) ∑j(a j2 j)2 0 and is thus positive definite,
|
||
|
||
3. j 2 j and j 2j 2ej form an orthonormal eigenvector decomposition of Tk (here,
|
||
|
||
ej is the jth canonical unit vector in 2), and
|
||
|
||
4. j
|
||
|
||
2j 2
|
||
|
||
j 1 2.
|
||
|
||
Argue that the last statement shows that (2.101) is wrong and (2.100) is tight. 12
|
||
|
||
2.25 (Generalized Feature Maps ) Via (2.38), Mercer kernels induce compact (integral) operators. Can you generalize the idea of defining a feature map associated with an
|
||
|
||
12. Thanks to S. Smale and I. Steinwart for this exercise.
|
||
|
||
2.6 Problems
|
||
|
||
59
|
||
|
||
operator to more general bounded positive definite operators T? Hint: use the multiplication operator representation of T [467].
|
||
|
||
2.26 (Nystro¨ m Approximation (cf. [603]) ) Consider the integral operator obtained by substituting the distribution P underlying the data into (2.38), i.e.,
|
||
|
||
(Tk f )(x)
|
||
|
||
k(x x ) f (x) dP(x)
|
||
|
||
(2.102)
|
||
|
||
If the conditions of Mercer’s theorem are satisfied, then k can be diagonalized as
|
||
N
|
||
k(x x ) ∑ j j(x) j(x ) j1
|
||
where j and j satisfy the eigenvalue equation
|
||
|
||
(2.103)
|
||
|
||
k(x x ) j(x) dP(x) j j(x )
|
||
|
||
(2.104)
|
||
|
||
and the orthonormality conditions
|
||
|
||
i(x) j(x)dP(x) i j
|
||
|
||
(2.105)
|
||
|
||
Show that by replacing the integral by a summation over an iid sample X x1 xm
|
||
|
||
from P(x), one can recover the kernel PCA eigenvalue problem (Section 1.7). Hint: Start
|
||
|
||
by evaluating (2.104) for x X, to obtain m equations. Next, approximate the integral by
|
||
|
||
a sum over the points in X, replacing
|
||
|
||
k(x x )
|
||
|
||
j(x) dP(x) by
|
||
|
||
1 m
|
||
|
||
∑mn
|
||
|
||
1 k(xn
|
||
|
||
x)
|
||
|
||
j(xn).
|
||
|
||
Derive the orthogonality condition for the eigenvectors ( j(xn))n 1 m from (2.105).
|
||
|
||
2.27 (Lorentzian Feature Spaces ) If a finite number of eigenvalues is negative, the expansion in Theorem 2.10 is still valid. Show that in this case, k corresponds to a Lorentzian symmetric bilinear form in a space with indefinite signature [467].
|
||
Discuss whether this causes problems for learning algorithms utilizing these kernels. In particular, consider the cases of SV machines (Chapter 7) and Kernel PCA (Chapter 14).
|
||
|
||
2.28 (Symmetry of Reproducing Kernels ) Show that reproducing kernels (Definition 2.9) are symmetric. Hint: use (2.35) and exploit the symmetry of the dot product.
|
||
|
||
2.29 (Coordinate Representation in the RKHS ) Write
|
||
|
||
as a dot product of
|
||
|
||
coordinate vectors by expressing the functions of the RKHS in the basis ( n n)n 1 N ,
|
||
|
||
which is orthonormal with respect to , i.e.,
|
||
|
||
N
|
||
f (x) ∑ n n1
|
||
|
||
n n(x)
|
||
|
||
(2.106)
|
||
|
||
Obtain an expression for the coordinates n, using (2.47) and n f n n Show that has the structure of a RKHS in the sense that for f and g given by (2.106), and
|
||
|
||
N
|
||
g(x) ∑ j j1
|
||
|
||
j j(x)
|
||
|
||
(2.107)
|
||
|
||
60
|
||
|
||
Kernels
|
||
|
||
we have
|
||
|
||
f g Show, moreover, that f (x)
|
||
|
||
Φ(x) in . In other words,
|
||
|
||
Φ(x) is the coordinate representation of the kernel as a function of one argument.
|
||
|
||
2.30 (Equivalence of Regularization Terms ) Using (2.36) and (2.41), prove that w 2, where w ∑mi 1 iΦ(xi), is the same no matter whether Φ denotes the RKHS fea-
|
||
ture map (2.21) or the Mercer feature map (2.40).
|
||
|
||
2.31 (Approximate Inversion of Gram Matrices ) Use the kernel PCA map (2.59) to derive a method for approximately inverting a large Gram matrix.
|
||
|
||
2.32 (Effective Dimension of Feature Space ) Building on Section 2.2.7, argue that for a finite data set, we are always effectively working in a finite-dimensional feature space.
|
||
|
||
2.33 (Translation of a Dot Product ) Prove (2.79).
|
||
|
||
2.34 (Example of a CPD Kernel ) Argue that the hyperbolic tangent kernel (2.69) is effectively conditionally positive definite, if the input values are suitably restricted, since it can be approximated by k b, where k is a polynomial kernel (2.67) and b . Discuss how this explains that hyperbolic tangent kernels can be used for SVMs although, as pointed out in number of works (e.g., [86], cf. the remark following (2.69)), they are not positive definite.
|
||
|
||
2.35 (Polarization Identity ) Prove the polarization identity, stating that for any
|
||
|
||
symmetric bilinear form :
|
||
|
||
, we have, for all x x ,
|
||
|
||
xx 1 x x x x 4
|
||
|
||
x xx x
|
||
|
||
(2.108)
|
||
|
||
Now consider the special case where is a Euclidean dot product and x x x x is the squared Euclidean distance between x and x . Discuss why the polarization identity does not imply that the value of the dot product can be recovered from the distances alone. What else does one need?
|
||
|
||
2.36 (Vector Space Representation of CPD Kernels ) Specialize the vector space representation of symmetric kernels (Proposition 2.25) to the case of cpd kernels. Can you identify a subspace on which a cpd kernel is actually pd?
|
||
|
||
2.37 (Parzen Windows Classifiers in Feature Space ) Assume that k is a positive definite kernel. Compare the algorithm described in Section 1.2 with the one of (2.89). Construct situations where the two algorithms give different results. Hint: consider datasets where the class means coincide.
|
||
|
||
2.38 (Canonical Distortion Kernel ) Can you define a kernel based on Baxter’s canonical distortion metric [28]?
|
||
|
||
3
|
||
|
||
Risk and Loss Functions
|
||
|
||
Overview
|
||
|
||
One of the most immediate requirements in any learning problem is to specify
|
||
|
||
what exactly we would like to achieve, minimize, bound, or approximate. In other
|
||
|
||
words, we need to determine a criterion according to which we will assess the
|
||
|
||
quality of an estimate f :
|
||
|
||
obtained from data.
|
||
|
||
This question is far from trivial. Even in binary classification there exist ample
|
||
|
||
choices. The selection criterion may be the fraction of patterns classified correctly,
|
||
|
||
it could involve the confidence with which the classification is carried out, or it
|
||
|
||
might take into account the fact that losses are not symmetric for the two classes,
|
||
|
||
such as in health diagnosis problems. Furthermore, the loss for an error may be
|
||
|
||
input-dependent (for instance, meteorological predictions may require a higher ac-
|
||
|
||
curacy in urban regions), and finally, we might want to obtain probabilities rather
|
||
|
||
than a binary prediction of the class labels 1 and 1. Multi class discrimination and
|
||
|
||
regression add even further levels of complexity to the problem. Thus we need a
|
||
|
||
means of encoding these criteria.
|
||
|
||
The chapter is structured as follows: in Section 3.1, we begin with a brief
|
||
|
||
overview of common loss functions used in classification and regression algo-
|
||
|
||
rithms. This is done without much mathematical rigor or statistical justification,
|
||
|
||
in order to provide basic working knowledge for readers who want to get a quick
|
||
|
||
idea of the default design choices in the area of kernel machines. Following this,
|
||
|
||
Section 3.2 formalizes the idea of risk. The risk approach is the predominant tech-
|
||
|
||
nique used in this book, and most of the algorithms presented subsequently mini-
|
||
|
||
mize some form of a risk functional. Section 3.3 treats the concept of loss functions
|
||
|
||
from a statistical perspective, points out the connection to the estimation of den-
|
||
|
||
sities and introduces the notion of efficiency. Readers interested in more detail
|
||
|
||
should also consider Chapter 16, which discusses the problem of estimation from
|
||
|
||
a Bayesian perspective. The later parts of this section are intended for readers in-
|
||
|
||
terested in the more theoretical details of estimation. The concept of robustness is
|
||
|
||
introduced in Section 3.4. Several commonly used loss functions, such as Huber’s
|
||
|
||
loss and the -insensitive loss, enjoy robustness properties with respect to rather
|
||
|
||
general classes of distributions. Beyond the basic relations, will show how to ad-
|
||
|
||
just the -insensitive loss in such a way as to accommodate different amounts of
|
||
|
||
variance automatically. This will later lead to the construction of so-called Sup-
|
||
|
||
port Vector Algorithms (see Chapters 7, 8, and 9).
|
||
|
||
While technical details and proofs can be omitted for most of the present chap-
|
||
|
||
ter, we encourage the reader to review the practical implications of this section.
|
||
|
||
62 Prerequisites
|
||
|
||
Risk and Loss Functions
|
||
|
||
3.1.1 Classification 3.1.2 Regression
|
||
|
||
3.2 Empirical Risk Functional
|
||
|
||
3.3.1 Maximum Likelihood
|
||
|
||
3.4 Robustness
|
||
|
||
3.3.2 Efficiency
|
||
|
||
3.4.3, 3.4.4 Adaptive Loss functions & ν
|
||
|
||
3.4.2 ε−insensitive loss
|
||
|
||
As usual, exercises for all sections can be found at the end. The chapter requires knowledge of probability theory, as introduced in Section B.1.
|
||
|
||
3.1 Loss Functions
|
||
|
||
Minimized Loss Incurred Loss
|
||
|
||
Let us begin with a formal definition of what we mean by the loss incurred by a function f at location x, given an observation y.
|
||
|
||
Definition 3.1 (Loss Function) Denote by (x y f (x))
|
||
|
||
the triplet consist-
|
||
|
||
ing of a pattern x, an observation y and a prediction f (x). Then the map c :
|
||
|
||
[0 ) with the property c(x y y) 0 for all x and y will be called a loss function.
|
||
|
||
Note that we require c to be a nonnegative function. This means that we will never get a payoff from an extra good prediction. If the latter was the case, we could always recover non-negativity (provided the loss is bounded from below), by using a simple shift operation (possibly depending on x). Likewise we can always satisfy the condition that exact predictions ( f (x) y) never cause any loss. The advantage of these extra conditions on c is that we know that the minimum of the loss is 0 and that it is obtainable, at least for a given x y.
|
||
Next we will formalize different kinds of loss, as described informally in the introduction of the chapter. Note that the incurred loss is not always the quantity that we will attempt to minimize. For instance, for algorithmic reasons, some loss functions will prove to be infeasible (the binary loss, for instance, can lead to NPhard optimization problems [367]). Furthermore, statistical considerations such as the desire to obtain confidence levels on the prediction (Section 3.3.1) will also influence our choice.
|
||
|
||
3.1.1 Binary Classification
|
||
|
||
Misclassification Error
|
||
|
||
The simplest case to consider involves counting the misclassification error if pattern x is classified wrongly we incur loss 1, otherwise there is no penalty.:
|
||
|
||
0 if y f (x)
|
||
|
||
c(x y f (x))
|
||
|
||
(3.1)
|
||
|
||
1 otherwise
|
||
|
||
3.1 Loss Functions
|
||
|
||
63
|
||
|
||
Asymmetric and Input-Dependent Loss
|
||
Confidence Level Soft Margin Loss
|
||
|
||
This definition of c does not distinguish between different classes and types of errors (false positive or negative).1
|
||
A slight extension takes the latter into account. For the sake of simplicity let us assume, as in (3.1), that we have a binary classification problem. This time, however, the loss may depend on a function c˜(x) which accounts for input-dependence, i.e.
|
||
|
||
0 if y f (x)
|
||
|
||
c(x y f (x))
|
||
|
||
(3.2)
|
||
|
||
c˜(x) otherwise
|
||
|
||
A simple (albeit slightly contrived) example is the classification of objects into rocks and diamonds. Clearly, the incurred loss will depend largely on the weight of the object under consideration.
|
||
Analogously, we might distinguish between errors for y 1 and y 1 (see, e.g., [331] for details). For instance, in a fraud detection application, we would like to be really sure about the situation before taking any measures, rather than losing potential customers. On the other hand, a blood bank should consider even the slightest suspicion of disease before accepting a donor.
|
||
Rather than predicting only whether a given object x belongs to a certain class y, we may also want to take a certain confidence level into account. In this case, f (x) becomes a real-valued function, even though y 1 1 .
|
||
In this case, sgn ( f (x)) denotes the class label, and the absolute value f (x) the confidence of the prediction. Corresponding loss functions will depend on the product y f (x) to assess the quality of the estimate. The soft margin loss function, as introduced by Bennett and Mangasarian [40, 111], is defined as
|
||
|
||
c(x y f (x)) max(0 1 y f (x))
|
||
|
||
0
|
||
|
||
if y f (x) 1
|
||
|
||
(3.3)
|
||
|
||
1 y f (x) otherwise
|
||
|
||
In some cases [348, 125] (see also Section 10.6.2) the squared version of (3.3) provides an expression that can be minimized more easily;
|
||
|
||
c(x y f (x)) max(0 1 y f (x))2
|
||
|
||
(3.4)
|
||
|
||
Logistic Loss
|
||
|
||
The soft margin loss closely resembles the so-called logistic loss function (cf. [251], as well as Problem 3.1 and Section 16.1.1);
|
||
|
||
c(x y f (x)) ln 1 exp y f (x)
|
||
|
||
(3.5)
|
||
|
||
We will derive this loss function in Section 3.3.1. It is used in order to associate a probabilistic meaning with f (x).
|
||
Note that in both (3.3) and (3.5) (nearly) no penalty occurs if y f (x) is sufficiently large, i.e. if the patterns are classified correctly with large confidence. In particular, in (3.3) a minimum confidence of 1 is required for zero loss. These loss functions
|
||
|
||
1. A false positive is a point which the classifier erroneously assigns to class 1, a false negative is erroneously assigned to class 1.
|
||
|
||
64
|
||
|
||
Risk and Loss Functions
|
||
|
||
Multi Class Discrimination
|
||
|
||
0−1 Loss 5
|
||
|
||
Linear Soft Margin 5
|
||
|
||
Logistic Regression 5
|
||
|
||
Quadratic Soft Margin 5
|
||
|
||
4
|
||
|
||
4
|
||
|
||
4
|
||
|
||
4
|
||
|
||
c(x, y, f(x)) c(x, y, f(x)) c(x, y, f(x)) c(x, y, f(x))
|
||
|
||
3
|
||
|
||
3
|
||
|
||
3
|
||
|
||
3
|
||
|
||
2
|
||
|
||
2
|
||
|
||
2
|
||
|
||
2
|
||
|
||
1
|
||
|
||
1
|
||
|
||
1
|
||
|
||
1
|
||
|
||
0
|
||
|
||
0
|
||
|
||
0
|
||
|
||
0
|
||
|
||
−5
|
||
|
||
0
|
||
|
||
5 −5
|
||
|
||
0
|
||
|
||
5 −5
|
||
|
||
0
|
||
|
||
5 −5
|
||
|
||
0
|
||
|
||
5
|
||
|
||
y f(x)
|
||
|
||
y f(x)
|
||
|
||
y f(x)
|
||
|
||
y f(x)
|
||
|
||
Figure 3.1 From left to right: 0-1 loss, linear soft margin loss, logistic regression, and quadratic soft margin loss. Note that both soft margin loss functions are upper bounds on the 0-1 loss.
|
||
|
||
led to the development of large margin classifiers (see [491, 460, 504] and Chapter 5 for further details). Figure 3.1 depicts various popular loss functions.2
|
||
Matters are more complex when dealing with more than two classes. Each type of misclassification could potentially incur a different loss, leading to an M M matrix (M being the number of classes) with positive off-diagonal and zero diagonal entries. It is still a matter of ongoing research in which way a confidence level should be included in such cases (cf. [41, 311, 593, 161, 119]).
|
||
|
||
3.1.2 Regression
|
||
|
||
When estimating real-valued quantities, it is usually the size of the difference y f (x), i.e. the amount of misprediction, rather than the product y f (x), which is used to determine the quality of the estimate. For instance, this can be the actual loss incurred by mispredictions (e.g., the loss incurred by mispredicting the value of a financial instrument at the stock exchange), provided the latter is known and computationally tractable.3 Assuming location independence, in most cases the loss function will be of the type
|
||
|
||
c(x y f (x)) c˜( f (x) y)
|
||
|
||
(3.7)
|
||
|
||
See Figure 3.2 below for several regression loss functions. Below we list the ones most common in kernel methods.
|
||
|
||
2. Other popular loss functions from the generalized linear model context include the inverse complementary log-log function. It is given by
|
||
|
||
c(x y f (x)) 1 exp( exp(y f (x)))
|
||
|
||
(3.6)
|
||
|
||
This function, unfortunately, is not convex and therefore it will not lead to a convex optimization problem. However, it has nice robustness properties and therefore we think that it should be investigated in the present context. 3. As with classification, computational tractability is one of the primary concerns. This is not always satisfying from a statistician’s point of view, yet it is crucial for any practical implementation of an estimation algorithm.
|
||
|
||
3.2 Test Error and Expected Risk
|
||
|
||
65
|
||
|
||
Squared Loss
|
||
-insensitive Loss and 1 Loss
|
||
Practical Considerations
|
||
|
||
The popular choice is to minimize the sum of squares of the residuals f (x) y. As we shall see in Section 3.3.1, this corresponds to the assumption that we have additive normal noise corrupting the observations yi. Consequently we minimize
|
||
|
||
c(x y f (x)) ( f (x) y)2 or equivalently c˜( ) 2
|
||
|
||
(3.8)
|
||
|
||
For
|
||
|
||
convenience
|
||
|
||
of
|
||
|
||
subsequent
|
||
|
||
notation,
|
||
|
||
1 2
|
||
|
||
2 rather than
|
||
|
||
2 is often used.
|
||
|
||
An extension of the soft margin loss (3.3) to regression is the -insensitive loss
|
||
|
||
function [561, 572, 562]. It is obtained by symmetrization of the “hinge” of (3.3),
|
||
|
||
c˜( ) max(
|
||
|
||
0) :
|
||
|
||
(3.9)
|
||
|
||
The idea behind (3.9) is that deviations up to should not be penalized, and all further deviations should incur only a linear penalty. Setting 0 leads to an 1 loss, i.e., to minimization of the sum of absolute deviations. This is written
|
||
|
||
c˜( )
|
||
|
||
(3.10)
|
||
|
||
We will study these functions in more detail in Section 3.4.2. For efficient implementations of learning procedures, it is crucial that loss func-
|
||
tions satisfy certain properties. In particular, they should be cheap to compute, have a small number of discontinuities (if any) in the first derivative, and be convex in order to ensure the uniqueness of the solution (see Chapter 6 and also Problem 3.6 for details). Moreover, we may want to obtain solutions that are computationally efficient, which may disregard a certain number of training points. This leads to conditions such as vanishing derivatives for a range of function values f (x). Finally, requirements such as outlier resistance are also important for the construction of estimators.
|
||
|
||
3.2 Test Error and Expected Risk
|
||
|
||
Now that we have determined how errors should be penalized on specific in-
|
||
|
||
stances (x y f (x)), we have to find a method to combine these (local) penalties.
|
||
|
||
This will help us to assess a particular estimate f .
|
||
|
||
In the following, we will assume that there exists a probability distribution
|
||
|
||
P(x y) on
|
||
|
||
which governs the data generation and underlying functional
|
||
|
||
dependency. Moreover, we denote by P(y x) the conditional distribution of y given
|
||
|
||
x, and by dP(x y) and dP(y x) the integrals with respect to the distributions P(x y)
|
||
|
||
and P(y x) respectively (cf. Section B.1.3).
|
||
|
||
3.2.1 Exact Quantities
|
||
|
||
Unless stated otherwise, we assume that the data (x y) are drawn iid (independent and identically distributed, see Section B.1) from P(x y). Whether or not we have
|
||
|
||
66 Transduction Problem
|
||
Empirical Density
|
||
|
||
Risk and Loss Functions
|
||
|
||
knowledge of the test patterns at training time4 makes a significant difference in the design of learning algorithms. In the latter case, we will want to minimize the test error on that specific test set; in the former case, the expected error over all possible test sets.
|
||
|
||
Definition 3.2 (Test Error) Assume that we are not only given the training data
|
||
x1 xm along with target values y1 ym but also the test patterns x1 xm on which we would like to predict yi (i 1 m ). Since we already know xi, all we should care about is to minimize the expected error on the test set. We formalize this in
|
||
the following definition
|
||
|
||
∑ Rtest[ f ] :
|
||
|
||
1m mi1
|
||
|
||
c(xi y f (xi))dP(y xi)
|
||
|
||
(3.11)
|
||
|
||
Unfortunately, this problem, referred to as transduction, is quite difficult to address, both computationally and conceptually, see [562, 267, 37, 211]. Instead, one typically considers the case where no knowledge about test patterns is available, as described in the following definition.
|
||
|
||
Definition 3.3 (Expected Risk) If we have no knowledge about the test patterns (or decide to ignore them) we should minimize the expected error over all possible training patterns. Hence we have to minimize the expected loss with respect to P and c
|
||
|
||
R[ f ] : E Rtest[ f ] E c(x y f (x))
|
||
|
||
c(x y f (x))dP(x y)
|
||
|
||
(3.12)
|
||
|
||
Here the integration is carried out with respect to the distribution P(x y). Again,
|
||
|
||
just as (3.11), this problem is intractable, since we do not know P(x y) explicitly.
|
||
|
||
Instead, we are only given the training patterns (xi yi). The latter, however, allow us to replace the unknown distribution P(x y) by its empirical estimate.
|
||
|
||
To study connections between loss functions and density models, it will be
|
||
|
||
convenient to assume that there exists a density p(x y) corresponding to P(x y).
|
||
|
||
This means that we may replace dP(x y) by p(x y)dxdy and the appropriate
|
||
|
||
measure on
|
||
|
||
. Such a density p(x y) need not always exist (see Section B.1 for
|
||
|
||
more details) but we will not give further heed to these concerns at present.
|
||
|
||
3.2.2 Approximations
|
||
|
||
Unfortunately, this change in notation did not solve the problem. All we have at our disposal is the actual training data. What one usually does is replace p(x y) by the empirical density
|
||
|
||
∑ pemp(x y) :
|
||
|
||
1m mi 1
|
||
|
||
xi (x)
|
||
|
||
yi ( y)
|
||
|
||
(3.13)
|
||
|
||
4. The test outputs, however, are not available during training.
|
||
|
||
3.2 Test Error and Expected Risk
|
||
|
||
67
|
||
|
||
Here x (x) denotes the -distribution, satisfying x (x) f (x)dx f (x ). The hope is that replacing p by pemp will lead to a quantity that is “reasonably close” to the expected risk. This will be the case if the class of possible solutions f is sufficiently limited [568, 571]. The issue of closeness with regard to different estimators will be discussed in further detail in Chapters 5 and 12. Substituting pemp(x y) into (3.12) leads to the empirical risk:
|
||
|
||
M-Estimator
|
||
Ill-Posed Problems
|
||
Example of an Ill-Posed Problem
|
||
|
||
Definition 3.4 (Empirical Risk) The empirical risk is defined as
|
||
|
||
Remp[ f ] :
|
||
|
||
∑ c(x y f (x))pemp(x y)dxdy
|
||
|
||
1 m
|
||
|
||
m i1
|
||
|
||
c(xi
|
||
|
||
yi
|
||
|
||
f (xi))
|
||
|
||
(3.14)
|
||
|
||
This quantity has the advantage that, given the training data, we can readily compute and also minimize it. This constitutes a particular case of what is called an M-estimator in statistics. Estimators of this type are studied in detail in the field of empirical processes [554]. As pointed out in Section 3.1, it is crucial to understand that although our particular M-estimator is built from minimizing a loss, this need not always be the case. From a decision-theoretic point of view, the question of which loss to choose is a separate issue, which is dictated by the problem at hand as well as the goal of trying to evaluate the performance of estimation methods, rather than by the problem of trying to define a particular estimation method [582, 166, 43].
|
||
These considerations aside, it may appear as if (3.14) is the answer to our problems, and all that remains to be done is to find a suitable class of functions f such that we can minimize Remp[ f ] with respect to . Unfortunately, determining
|
||
is quite difficult (see Chapters 5 and 12 for details). Moreover, the minimization of Remp[ f ] can lead to an ill-posed problem [538, 370]. We will show this with a simple example.
|
||
Assume that we want to solve a regression problem using the quadratic loss function (3.8) given by c(x y f (x) (y f (x))2. Moreover, assume that we are dealing with a linear class of functions,5 say
|
||
|
||
n
|
||
: f f (x) ∑ i fi(x) with i i1
|
||
|
||
where the fi are functions mapping to . We want to find the minimizer of Remp, i.e.,
|
||
|
||
minimize Remp[ f ]
|
||
f
|
||
|
||
∑ 1 m
|
||
|
||
minimize n mi 1
|
||
|
||
yi
|
||
|
||
n
|
||
|
||
2
|
||
|
||
∑ j f j(xi)
|
||
|
||
j1
|
||
|
||
(3.15) (3.16)
|
||
|
||
5. In the simplest case, assuming is contained in a vector space, these could be functions that extract coordinates of x; in other words, would be the class of linear functions on .
|
||
|
||
68
|
||
Condition of a Matrix
|
||
|
||
Risk and Loss Functions
|
||
|
||
Computing the derivative of Remp[ f ] with respect to and defining Fij : fi(x j), we can see that the minimum of (3.16) is achieved if
|
||
|
||
Fy FF
|
||
|
||
(3.17)
|
||
|
||
A sufficient condition for (3.17) is
|
||
|
||
F F 1 F y where F F 1 denotes the
|
||
|
||
(pseudo-)inverse of the matrix.
|
||
|
||
If F F has a bad condition number (i.e. the quotient between the largest and the
|
||
|
||
smallest eigenvalue of F F is large), it is numerically difficult [423, 530] to solve
|
||
|
||
(3.17) for . Furthermore, if n m, i.e. if we have more basis functions fi than
|
||
|
||
training patterns xi, there will exist a subspace of solutions with dimension at least
|
||
|
||
n m, satisfying (3.17). This is undesirable both practically (speed of computation)
|
||
|
||
and theoretically (we would have to deal with a whole class of solutions rather
|
||
|
||
than a single one).
|
||
|
||
One might also expect that if is too rich, the discrepancy between Remp[ f ] and
|
||
|
||
R[ f ] could be large. For instance, if F is an m m matrix of full rank, contains
|
||
|
||
an f that predicts all target values yi correctly on the training data. Nevertheless,
|
||
|
||
we cannot expect that we will also obtain zero prediction error on unseen points.
|
||
|
||
Chapter 4 will show how these problems can be overcome by adding a so-called
|
||
|
||
regularization term to Remp[ f ].
|
||
|
||
3.3 A Statistical Perspective
|
||
Given a particular pattern x˜, we may want to ask what risk we can expect for it, and with which probability the corresponding loss is going to occur. In other words, instead of (or in addition to) E c(x˜ y f (x˜) for a fixed x˜, we may want to know the distribution of y given x˜, i.e., P(y x˜).
|
||
(Bayesian) statistics (see [338, 432, 49, 43] and also Chapter 16) often attempt to estimate the density corresponding to the random variables (x y), and in some cases, we may really need information about p(x y) to arrive at the desired conclusions given the training data (e.g., medical diagnosis). However, we always have to keep in mind that if we model the density p first, and subsequently, based on this approximation, compute a minimizer of the expected risk, we will have to make two approximations. This could lead to inferior or at least not easily predictable results. Therefore, wherever possible, we should avoid solving a more general problem, since additional approximation steps might only make the estimates worse [561].
|
||
3.3.1 Maximum Likelihood Estimation
|
||
All this said, we still may want to compute the conditional density p(y x). For this purpose we need to model how y is generated, based on some underlying dependency f (x); thus, we specify the functional form of p(y x f (x)) and maximize
|
||
|
||
3.3 A Statistical Perspective
|
||
|
||
69
|
||
|
||
Log-Likelihood Regression
|
||
Classification
|
||
|
||
the expression with respect to f . This will provide us with the function f that is most likely to have generated the data.
|
||
|
||
Definition 3.5 (Likelihood) The likelihood of a sample (x 1 y1) (xm ym) given an underlying functional dependency f is given by
|
||
|
||
p( x1
|
||
|
||
xm y1
|
||
|
||
m
|
||
|
||
m
|
||
|
||
ym f ) ∏ p(xi yi f ) ∏ p(yi xi f )p(xi)
|
||
|
||
i1
|
||
|
||
i1
|
||
|
||
(3.18)
|
||
|
||
Strictly speaking the likelihood only depends on the values f (x1) f (xm) rather than being a functional of f itself. To keep the notation simple, however, we
|
||
|
||
write p( x1 xm y1 ym f ) instead of the more heavyweight expression
|
||
|
||
p( x1
|
||
|
||
xm y1
|
||
|
||
ym f (x1)
|
||
|
||
f (xm) ).
|
||
|
||
For practical reasons, we convert products into sums by taking the negative
|
||
|
||
logarithm of P( x1 xm y1 ym f ), an expression which is then conveniently minimized. Furthermore, we may drop the p(xi) from (3.18), since they do not depend on f . Thus maximization of (3.18) is equivalent to minimization of the
|
||
|
||
Log-Likelihood
|
||
|
||
m
|
||
[ f ] : ∑ ln p(yi xi f ) i1
|
||
|
||
(3.19)
|
||
|
||
Remark 3.6 (Regression Loss Functions) Minimization of [ f ] and of R emp[ f ] coincide if the loss function c is chosen according to
|
||
|
||
c(x y f (x)) ln p(y x f )
|
||
|
||
(3.20)
|
||
|
||
Assuming that the target values y were generated by an underlying functional dependency f plus additive noise with density p , i.e. yi ftrue(xi) i, we obtain
|
||
|
||
c(x y f (x)) ln p (y f (x))
|
||
|
||
(3.21)
|
||
|
||
Things are slightly different in classification. Since all we are interested in is the probability that pattern x has label 1 or 1 (assuming binary classification), we can transform the problem into one of estimating the logarithm of the probability that a pattern assumes its correct label.
|
||
|
||
Remark 3.7 (Classification Loss Functions) We have a finite set of labels, which allows us to model P(y f (x)) directly, instead of modelling a density. In the binary classification case (classes 1 and 1) this problem becomes particularly easy, since all we have to do is assume functional dependency underlying P(1 f (x)): this immediately gives us P( 1 f (x)) 1 P(1 f (x)). The link to loss functions is established via
|
||
|
||
c(x y f (x)) ln P(y f (x))
|
||
|
||
(3.22)
|
||
|
||
The same result can be obtained by minimizing the cross entropy6 between the classifica-
|
||
|
||
6. In the case of discrete variables the cross entropy between two distributions P and Q is defined as ∑i P(i) ln Q(i).
|
||
|
||
70 Examples
|
||
|
||
Risk and Loss Functions
|
||
|
||
Table 3.1 Common loss functions and corresponding density models according to Remark 3.6. As a shorthand we use c˜( f (x) y) : c(x y f (x)).
|
||
|
||
-insensitive Laplacian Gaussian Huber ’s robust loss Polynomial Piecewise polynomial
|
||
|
||
loss function c˜( )
|
||
|
||
12 2
|
||
|
||
1 2
|
||
|
||
(
|
||
|
||
)2
|
||
|
||
if
|
||
|
||
2 otherwise
|
||
|
||
1d d
|
||
|
||
1
|
||
|
||
d
|
||
|
||
dd1
|
||
|
||
if
|
||
|
||
d1 d
|
||
|
||
otherwise
|
||
|
||
density model p( )
|
||
|
||
1 2(1
|
||
|
||
) exp(
|
||
|
||
)
|
||
|
||
1 2
|
||
|
||
exp(
|
||
|
||
)
|
||
|
||
1 exp(
|
||
2
|
||
|
||
2 2
|
||
|
||
)
|
||
|
||
exp(
|
||
|
||
2 2
|
||
|
||
)
|
||
|
||
exp( 2
|
||
|
||
)
|
||
|
||
d 2Γ(1
|
||
|
||
d)
|
||
|
||
exp(
|
||
|
||
d)
|
||
|
||
d
|
||
exp( d d 1 )
|
||
|
||
exp(
|
||
|
||
d1 d
|
||
|
||
if otherwise
|
||
if ) otherwise
|
||
|
||
tion labels yi and the probabilities p(y f (x)), as is typically done in a generalized linear
|
||
|
||
models context (see e.g., [355, 232, 163]). For binary classification (with y
|
||
|
||
1 ) we
|
||
|
||
obtain
|
||
|
||
c(x y f (x)) 1 2 y ln P(y 1 f (x)) 1 2 y ln P(y
|
||
|
||
1 f (x))
|
||
|
||
(3.23)
|
||
|
||
When substituting the actual values for y into (3.23), this reduces to (3.22).
|
||
|
||
At this point we have a choice in modelling P(y 1 f (x)) to suit our needs.
|
||
|
||
Possible models include the logistic transfer function, the probit model, the inverse
|
||
|
||
complementary log-log model. See Section 16.3.5 for a more detailed discussion of
|
||
|
||
the choice of such link functions. Below we explain connections in some more detail
|
||
|
||
for the logistic link function. For a logistic model, where P(y
|
||
malization
|
||
|
||
1 x f)
|
||
|
||
exp(
|
||
|
||
1 2
|
||
|
||
f
|
||
|
||
(x)),
|
||
|
||
we
|
||
|
||
obtain
|
||
|
||
after
|
||
|
||
nor-
|
||
|
||
P(y
|
||
|
||
1 x f):
|
||
|
||
exp( f (x)) 1 exp( f (x))
|
||
|
||
(3.24)
|
||
|
||
and consequently ln P(y 1 x f ) ln(1 exp( f (x))). We thus recover (3.5) as
|
||
|
||
the loss function for classification. Choices other than (3.24) for a map
|
||
|
||
[0 1]
|
||
|
||
will lead to further loss functions for classification. See [579, 179, 596] and Section
|
||
|
||
16.1.1 for more details on this subject.
|
||
|
||
It is important to note that not every loss function used in classification corre-
|
||
|
||
sponds to such a density model (recall that in this case, the probabilities have to
|
||
|
||
add up to 1 for any value of f (x)). In fact, one of the most popular loss functions,
|
||
|
||
the soft margin loss (3.3), does not enjoy this property. A discussion of these issues
|
||
|
||
can be found in [521].
|
||
|
||
Table 3.1 summarizes common loss functions and the corresponding density
|
||
|
||
models as defined by (3.21), some of which were already presented in Section
|
||
|
||
3.1. It is an exhaustive list of the loss functions that will be used in this book for
|
||
|
||
regression. Figure 3.2 contains graphs of the functions.
|
||
|
||
3.3 A Statistical Perspective
|
||
|
||
71
|
||
|
||
Squared Loss 3
|
||
|
||
Absolute Loss 3
|
||
|
||
2.5
|
||
|
||
2.5
|
||
|
||
2
|
||
|
||
2
|
||
|
||
c(x, y, f(x))
|
||
|
||
c(x, y, f(x))
|
||
|
||
1.5
|
||
|
||
1.5
|
||
|
||
1
|
||
|
||
1
|
||
|
||
0.5
|
||
|
||
0.5
|
||
|
||
0
|
||
|
||
0
|
||
|
||
−0.5
|
||
|
||
−3 −2 −1
|
||
|
||
0
|
||
|
||
1
|
||
|
||
2
|
||
|
||
3
|
||
|
||
ξ
|
||
|
||
−0.5
|
||
|
||
−3 −2 −1
|
||
|
||
0
|
||
|
||
1
|
||
|
||
2
|
||
|
||
3
|
||
|
||
ξ
|
||
|
||
Huber’s Robust Loss 3
|
||
|
||
ε−insensitive 3
|
||
|
||
2.5
|
||
|
||
2.5
|
||
|
||
2
|
||
|
||
2
|
||
|
||
c(x, y, f(x))
|
||
|
||
c(x, y, f(x))
|
||
|
||
1.5
|
||
|
||
1.5
|
||
|
||
1
|
||
|
||
1
|
||
|
||
0.5
|
||
|
||
0.5
|
||
|
||
0
|
||
|
||
0
|
||
|
||
−0.5
|
||
|
||
−3 −2 −1
|
||
|
||
0
|
||
|
||
1
|
||
|
||
2
|
||
|
||
3
|
||
|
||
ξ
|
||
|
||
−0.5
|
||
|
||
−3 −2 −1
|
||
|
||
0
|
||
|
||
1
|
||
|
||
2
|
||
|
||
3
|
||
|
||
ξ
|
||
|
||
Figure 3.2 Graphs of loss functions and corresponding density models. upper left: Gaussian, upper right: Laplacian, lower left: Huber’s robust, lower right: -insensitive.
|
||
|
||
Practical Considerations
|
||
|
||
We conclude with a few cautionary remarks. The loss function resulting from a maximum likelihood reasoning might be non-convex. This might spell trouble when we try to find an efficient solution of the corresponding minimization problem. Moreover, we made a very strong assumption by claiming to know P(y x f ) explicitly, which was necessary in order to evaluate (3.20).
|
||
Finally, the solution we obtain by minimizing the log-likelihood depends on the class of functions . So we are in no better situation than by minimizing Remp[ f ], albeit with the additional constraint, that the loss functions c(x y f (x)) must correspond to a probability density.
|
||
3.3.2 Efficiency
|
||
The above reasoning could mislead us into thinking that the choice of loss function is rather arbitrary, and that there exists no good means of assessing the performance of an estimator. In the present section we will develop tools which can be used to compare estimators that are derived from different loss functions. For this purpose we need to introduce additional statistical concepts which deal with the efficiency of an estimator. Roughly speaking, these give an indication of how
|
||
|
||
72 Estimator
|
||
|
||
Risk and Loss Functions
|
||
|
||
“noisy” an estimator is with respect to a reference estimator.
|
||
We begin by formalizing the concept of an estimator. Denote by P(y ) a dis-
|
||
tribution of y depending (amongst other variables) on the parameters , and by
|
||
Y y1 ym an m-sample drawn iid from P(y ). Note that the use of the symbol y bears no relation to the yi that are outputs of some functional dependency (cf. Chapter 1). We employ this symbol because some of the results to be derived
|
||
will later be applied to the outputs of SV regression. Next, we introduce the estimator ˆ (Y) of the parameters , based on Y. For
|
||
instance, P(y ) could be a Gaussian with fixed variance and mean , and ˆ (Y) could be the estimator (1 m) ∑mi 1 yi.
|
||
To avoid cumbersome notation, we use the shorthand
|
||
|
||
E (y) : EP(y ) (y)
|
||
|
||
(y)dP(y )
|
||
|
||
(3.25)
|
||
|
||
to express expectations of a random variable (y) with respect to P(y ). One criterion that we might impose on an estimator is that it be unbiased, i.e., that on average, it tells us the correct value of the parameter it attempts to estimate.
|
||
|
||
Definition 3.8 (Unbiased Estimator) An unbiased estimator ˆ (Y) of the parameters in P(y ) satisfies
|
||
|
||
E ˆ(Y)
|
||
|
||
(3.26)
|
||
|
||
In this section, we will focus on unbiased estimators. In general, however, the estimators we are dealing with in this book will not be unbiased. In fact, they will have a bias towards ‘simple’, low-complexity functions. Properties of such estimators are more difficult to deal with, which is why, for the sake of simplicity, we restrict ourselves to the unbiased case in this section. Note, however, that “biasedness” is not a bad property by itself. On the contrary, there exist cases as the one described by James and Stein [262] where biased estimators consistently outperform unbiased estimators in the finite sample size setting, both in terms of variance and prediction error.
|
||
A possible way to compare unbiased estimators is to compute their variance. Other quantities such as moments of higher order or maximum deviation properties would be valid criteria as well, yet for historical and practical reasons the variance has become a standard tool to benchmark estimators. The Fisher information matrix is crucial for this purpose since it will tell us via the Crame´r-Rao bound (Theorem 3.11) the minimal possible variance for an unbiased estimator. The idea is that the smaller the variance, the lower (typically) the probability that ˆ (Y) will deviate from by a large amount. Therefore, we can use the variance as a possible one number summary to compare different estimators.
|
||
|
||
Definition 3.9 (Score Function, Fisher Information, Covariance) Assume there exists a density p(y ) for the distribution P(y ) such that ln p(y ) is differentiable with
|
||
|
||
3.3 A Statistical Perspective
|
||
|
||
73
|
||
|
||
Score Function
|
||
Fisher Information Covariance
|
||
|
||
respect to . The score V (Y) of P(y ) is a random variable defined by7
|
||
|
||
V (Y) : ln p(Y )
|
||
|
||
∑ ∑ m
|
||
ln p(yi )
|
||
i1
|
||
|
||
m p(yi ) i 1 p(yi )
|
||
|
||
(3.27)
|
||
|
||
This score tells us how much the likelihood of the data depends on the different components of , and thus, in the maximum likelihood procedure, how much the data affect the choice of . The covariance of V (Y) is called the Fisher information matrix I. It is given by
|
||
|
||
Ii j : E i ln p(Y ) j ln p(Y ) and the covariance matrix B of the estimator ˆ (Y) is defined by Bi j : E ˆ i E ˆ i ˆ j E ˆ j
|
||
|
||
(3.28) (3.29)
|
||
|
||
The covariance matrix B tells us the amount of variation of the estimator. It can
|
||
therefore be used (e.g., by Chebychev’s inequality) to bound the probability that ˆ (Y) deviates from by more than a certain amount.
|
||
|
||
Average Fisher Score Vanishes
|
||
|
||
Remark 3.10 (Expected Value of Fisher Score) One can check that the expected value of V (Y) is 0 since
|
||
|
||
E [V (Y)] p(Y ) ln p(Y )dY
|
||
|
||
p(Y )dY 1 0
|
||
|
||
(3.30)
|
||
|
||
In other words, the contribution of Y to the adjustment of averages to 0 over all possible Y, drawn according to P(Y ). Equivalently we could say that the average likelihood for Y drawn according to P(Y ) is extremal, provided we choose : the derivative of the expected likelihood of the data E ln P(Y ) with respect to vanishes. This is also what we expect, namely that the “proper” distribution is on average the one with the highest likelihood.
|
||
|
||
The following theorem gives a lower bound on the variance of an estimator, i.e. B is found in terms of the Fisher information I. This is useful to determine how well a given estimator performs with respect to the one with the lowest possible variance.
|
||
|
||
Theorem 3.11 (Crame´r and Rao [425]) Any unbiased estimator ˆ (Y) satisfies
|
||
|
||
det IB 1
|
||
|
||
(3.31)
|
||
|
||
Proof We prove (3.31) for the scalar case. The extension to matrices is left as an exercise (see Problem 3.10). Using the Cauchy-Schwarz inequality, we obtain
|
||
|
||
E (V (Y) E [V (Y)]) ˆ (Y) E ˆ (Y) 2
|
||
|
||
(3.32)
|
||
|
||
E (V (Y) E [V (Y)])2 E ˆ (Y) E ˆ (Y) 2 IB
|
||
|
||
(3.33)
|
||
|
||
7. Recall that p(Y ) is the gradient of p(Y ) with respect to the parameters 1
|
||
|
||
n.
|
||
|
||
74
|
||
Asymptotic Variance
|
||
|
||
Risk and Loss Functions
|
||
|
||
At the same time, E [V (Y)] 0 implies that
|
||
E (V (Y) E [V (Y)]) ˆ (Y) E ˆ (Y) 2
|
||
E V (Y)ˆ(Y) 2
|
||
2
|
||
p(Y )V (Y)ˆ(Y)dY
|
||
|
||
2
|
||
|
||
p(Y )ˆ(Y)dY
|
||
|
||
( )2 1
|
||
|
||
(3.34) (3.35)
|
||
(3.36)
|
||
|
||
since we may interchange integration by Y and .
|
||
|
||
Eq. (3.31) lends itself to the definition of a one-number summary of the properties of an estimator, namely how closely the inequality is met.
|
||
|
||
Definition 3.12 (Efficiency) The statistical efficiency e of an estimator ˆ (Y) is defined as
|
||
|
||
e : 1 det IB
|
||
|
||
(3.37)
|
||
|
||
The closer e is to 1, the lower the variance of the corresponding estimator ˆ (Y). For a special class of estimators minimizing loss functions, the following theorem allows us to compute B and e efficiently.
|
||
|
||
Theorem 3.13 (Murata, Yoshizawa, Amari [379, Lemma 3]) Assume that ˆ is defined by ˆ (Y) : argmin d(Y ) and that d is a twice differentiable function in .
|
||
|
||
Then asymptotically, for increasing sample size m
|
||
|
||
, the variance B is given by
|
||
|
||
B Q 1GQ 1. Here
|
||
|
||
Gi j : cov i d (Y ) j d (Y ) and
|
||
|
||
Qij : E
|
||
|
||
2 d(Y ) ij
|
||
|
||
and therefore e (det Q)2 (det IG).
|
||
|
||
(3.38) (3.39)
|
||
|
||
This means that for the class of estimators defined via d, the evaluation of their asymptotic efficiency can be conveniently achieved via (3.38) and (3.39). For scalar valued estimators (Y) , these expressions can be greatly simplified to
|
||
|
||
I
|
||
|
||
ln p(Y ) 2 dP(Y )
|
||
|
||
(3.40)
|
||
|
||
G
|
||
|
||
( d(Y ))2 dP(Y )
|
||
|
||
(3.41)
|
||
|
||
Q
|
||
|
||
2d(Y )dP(Y )
|
||
|
||
(3.42)
|
||
|
||
Finally, in the case of continuous densities, Theorem 3.13 may be extended to piecewise twice differentiable continuous functions d, by convolving the latter with a twice differentiable smoothing kernel, and letting the width of the smoothing kernel converge to zero. We will make use of this observation in the next section when studying the efficiency of some estimators.
|
||
|
||
3.4 Robust Estimators
|
||
|
||
75
|
||
|
||
The current section concludes with the proof that the maximum likelihood estimator meets the Crame´r-Rao bound.
|
||
|
||
Theorem 3.14 (Efficiency of Maximum Likelihood [118, 218, 43]) The maximum likelihood estimator (cf. (3.18) and (3.19)) given by
|
||
|
||
ˆ (Y) : argmax ln p(Y ) argmin [ ]
|
||
|
||
(3.43)
|
||
|
||
is asymptotically efficient (e 1).
|
||
|
||
To keep things simple we will prove (3.43) only for the class of twice differentiable continuous densities by applying Theorem 3.13. For a more general proof see [118, 218, 43].
|
||
|
||
Proof By construction, G is equal to the Fisher information matrix, if we choose d according to (3.43). Hence a sufficient condition is that Q I, which is what we show below. To this end we expand the integrand of (3.42),
|
||
|
||
2d(Y )
|
||
|
||
2 ln p(Y )
|
||
|
||
2 p(Y ) p(Y )
|
||
|
||
p(Y ) 2 p(Y )
|
||
|
||
2 p(Y ) V2(Y) p(Y )
|
||
|
||
(3.44)
|
||
|
||
The expectation of the second term in (3.44) equals I. We now show that the expectation of the first term vanishes;
|
||
|
||
2p(Y ) p(Y ) p(Y ) dY
|
||
|
||
2 p(Y )dY
|
||
|
||
21 0
|
||
|
||
(3.45)
|
||
|
||
Hence Q I and thus e Q2 (IG) 1. This proves that the maximum likelihood estimator is asymptotically efficient.
|
||
|
||
It appears as if the best thing we could do is to use the maximum likelihood (ML) estimator. Unfortunately, reality is not quite as simple as that. First, the above statement holds only asymptotically. This leads to the (justified) suspicion that for finite sample sizes we may be able to do better than ML estimation. Second, practical considerations such as the additional goal of sparse decomposition may lead to the choice of a non-optimal loss function.
|
||
Finally, we may not know the true density model, which is required for the definition of the maximum likelihood estimator. We can try to make an educated guess; bad guesses of the class of densities, however, can lead to large errors in the estimation (see, e.g., [251]). This prompted the development of robust estimators.
|
||
|
||
3.4 Robust Estimators
|
||
So far, in order to make any practical predictions, we had to assume a certain class of distributions from which P(Y) was chosen. Likewise, in the case of risk functionals, we also assumed that training and test data are identically distributed. This section provides tools to safeguard ourselves against cases where the above
|
||
|
||
76 Outliers
|
||
Mixture Densities
|
||
|
||
Risk and Loss Functions
|
||
|
||
assumptions are not satisfied. More specifically, we would like to avoid a certain fraction of ‘bad’ obser-
|
||
vations (often also referred to as ‘outliers’) seriously affecting the quality of the estimate. This implies that the influence of individual patterns should be bounded from above. Huber [250] gives a detailed list of desirable properties of a robust estimator. We refrain from reproducing this list at present, or committing to a particular definition of robustness.
|
||
As usual for the estimation of location parameter context (i.e. estimation of the expected value of a random variable) we assume a specific parametric form of p(Y ), namely
|
||
|
||
m
|
||
|
||
m
|
||
|
||
p(Y ) ∏ p(yi ) ∏ p(yi )
|
||
|
||
i1
|
||
|
||
i1
|
||
|
||
(3.46)
|
||
|
||
Unless stated otherwise, this is the formulation we will use throughout this section.
|
||
|
||
3.4.1 Robustness via Loss Functions
|
||
|
||
Huber’s idea [250] in constructing a robust estimator was to take a loss function as provided by the maximum likelihood framework, and modify it in such a way as to limit the influence of each individual pattern. This is done by providing an upper bound on the slope of ln p(Y ). We shall see that methods such as the trimmed mean or the median are special cases thereof. The -insensitive loss function can also be viewed as a trimmed estimator. This will lead to the development of adaptive loss functions in the subsequent sections. We begin with the main theorem of this section.
|
||
|
||
Theorem 3.15 (Robust Loss Functions (Huber [250])) Let be a class of densities formed by
|
||
|
||
: p p (1 )p0 p1 where (0 1) and p0 are known
|
||
|
||
(3.47)
|
||
|
||
Moreover assume that both p0 and p1 are symmetric with respect to the origin, their logarithms are twice continuously differentiable, ln p0 is convex and known, and p1 is unknown. Then the density
|
||
|
||
p¯( ) : (1
|
||
|
||
)
|
||
|
||
p0( ) p0( 0)e k(
|
||
|
||
if
|
||
|
||
0
|
||
|
||
0) otherwise
|
||
|
||
(3.48)
|
||
|
||
is robust in the sense that the maximum likelihood estimator corresponding to (3.48) has minimum variance with respect to the “worst” possible density pworst (1 )p0 p1: it is a saddle point (located at pworst) in terms of variance with respect to the true density p and the density p¯ used in estimating the location parameter. This means that no density p has larger variance than pworst and that for p pworst no estimator is better than the one where p¯ pworst, as used in the robust estimator.
|
||
The constants k 0 and 0 are obtained by the normalization condition, that p¯ be a
|
||
|
||
3.4 Robust Estimators
|
||
|
||
77
|
||
|
||
proper density and that the first derivative in ln p¯ be continuous.
|
||
|
||
Proof To show that p¯ is a saddle point in we have to prove that (a) no estimation procedure other than the one using ln p¯ as the loss function has lower variance for the density p¯, and that (b) no density has higher variance than p¯ if ln p¯ is used as loss function. Part (a) follows immediately from the Crame´r-Rao theorem (Th. 3.11); part (b) can be proved as follows.
|
||
We use Theorem 3.13, and a proof technique pointed out in [559], to compute the variance of an estimator using ln p¯ as loss function;
|
||
|
||
B
|
||
|
||
ln p¯(y ) 2 (1 )p0(y ) p (y ) dy 2 ln p¯(y ) (1 )p0(y ) p (y ) dy
|
||
|
||
(3.49)
|
||
|
||
Here p is an arbitrary density which we will choose such that B is maximized. By construction,
|
||
|
||
ln p¯(y ) 2
|
||
|
||
ln p0(y ) 2 k2 if y
|
||
|
||
0
|
||
|
||
k2
|
||
|
||
otherwise,
|
||
|
||
(3.50)
|
||
|
||
2 ln p¯(y )
|
||
|
||
2 ln p0(y ) 0 if y
|
||
|
||
0
|
||
|
||
0
|
||
|
||
otherwise.
|
||
|
||
(3.51)
|
||
|
||
Thus any density p which is 0 in [ 0 0] will minimize the denominator (the term depending on p will be 0, which is the lowest obtainable value due to (3.51)),
|
||
and maximize the numerator, since in the latter the contribution of p is always limited to k2 . Now 1 p¯ (1 )p0 is exactly such a density. Hence the saddle point property holds.
|
||
|
||
Remark 3.16 (Robustness Classes) If we have more knowledge about the class of densities , a different loss function will have the saddle point property. For instance, using a similar argument as above, one can show that the normal distribution is robust in the class of all distributions with bounded variance. This implies that among all possible distributions with bounded variance, the estimator of the mean of a normal distribution has the highest variance.
|
||
Likewise, the Laplacian distribution is robust in the class of all symmetric distributions with density p(0) c for some fixed c 0 (see [559, 251] for more details).
|
||
Hence, even though a loss function defined according to Theorem 3.15 is generally desirable, we may be less cautious, and use a different loss function for improved performance, when we have additional knowledge of the distribution.
|
||
|
||
Remark 3.17 (Mean and Median) Assume we are dealing with a mixture of a normal distribution with variance 2 and an additional unknown distribution with weight at most
|
||
. It is easy to check that the application of Theorem 3.15 to normal distributions yields Huber’s robust loss function from Table 3.1.
|
||
The maximizer of the likelihood (see also Problem 3.17) is a trimmed mean estimator which discards of the data: effectively all i deviating from the mean by more than are
|
||
|
||
78
|
||
|
||
Risk and Loss Functions
|
||
|
||
ignored and the mean is computed from the remaining data. Hence Theorem 3.15 gives a formal justification for this popular type of estimator.
|
||
If we let 1 we recover the median estimator which stems from a Laplacian distribution. Here, all patterns but the median one are discarded.
|
||
|
||
Trimmed Interval Estimator
|
||
Support Patterns
|
||
|
||
Besides the classical examples of loss functions and density models, we might also consider a slightly unconventional estimation procedure: use the average between the k-smallest and the k-largest of all observations observations as the estimated mean of the underlying distribution (for sorted observations i with i j for 1 i j m the estimator computes ( k m k 1) 2). This procedure makes sense, for instance, when we are trying to infer the mean of a random variable generated by roundoff noise (i.e., noise whose density is constant within some bounded interval) plus an additional unknown amount of noise.
|
||
Note that both the patterns strictly inside or outside an interval of size [ ] around the estimate have no direct influence on the outcome. Only patterns on the boundary matter. This is a very similar situation to the behavior of Support Vector Machines in regression, and one can show that it corresponds to the minimizer of the -insensitive loss function (3.9). We will study the properties of the latter in more detail in the following section and thereafter show how it can be transformed into an adaptive risk functional.
|
||
|
||
3.4.2 Efficiency and the -Insensitive Loss Function
|
||
|
||
The tools of Section 3.3.2 allow us to analyze the -insensitive loss function in more detail. Even though the asymptotic estimation of a location parameter setting is a gross oversimplification of what is happening in a SV regression estimator (where we estimate a nonparametric function, and moreover have only a limited number of observations at our disposition), it will provide us with useful insights into this more complex case [510, 481].
|
||
In a first step, we compute the efficiency of an estimator, for several noise models and amounts of variance, using a density corresponding to the -insensitive loss function (cf. Table 3.1);
|
||
|
||
p (y )
|
||
|
||
1 22
|
||
|
||
exp(
|
||
|
||
y
|
||
|
||
)
|
||
|
||
1 22
|
||
|
||
1 exp(
|
||
|
||
y
|
||
|
||
if y ) otherwise.
|
||
|
||
(3.52)
|
||
|
||
For this purpose we have to evaluate the quantities G (3.41) and Q (3.42) of Theorem 3.13. We obtain
|
||
|
||
Gm
|
||
|
||
ln p(y ) 2 dP(y ) m 1
|
||
|
||
p(y )dy
|
||
|
||
(3.53)
|
||
|
||
Q m 2 ln p(y )dP(y ) m p(
|
||
|
||
) p(
|
||
|
||
)
|
||
|
||
(3.54)
|
||
|
||
The Fisher information I of m iid random variables distributed according to p is m-times the value of a single random variable. Thus all dependencies on m in e cancel out and we can limit ourselves to the case of m 1 for the analysis of the
|
||
|
||
3.4 Robust Estimators
|
||
|
||
79
|
||
|
||
efficiency of estimators. Now we may check what happens if we use the -insensitive loss function for
|
||
different types of noise model. For the sake of simplicity we begin with Gaussian noise.
|
||
|
||
Example 3.18 (Gaussian Noise) Assume that y is normally distributed with zero mean (i.e. 0) and variance . By construction, the minimum obtainable variance is I 1 2
|
||
(recall that m 1). Moreover (3.53) and (3.54) yield
|
||
|
||
G Q2
|
||
|
||
2
|
||
2 exp 2
|
||
|
||
1 erf 2
|
||
|
||
(3.55)
|
||
|
||
The efficiency e
|
||
|
||
Q2 GI
|
||
|
||
is
|
||
|
||
maximized
|
||
|
||
for
|
||
|
||
0 6120 . This means that if the underlying
|
||
|
||
noise model is Gaussian with variance and we have to use an -insensitive loss function
|
||
|
||
to estimate a location parameter, the most efficient estimator from this family is given by
|
||
|
||
0 6120 .
|
||
|
||
The consequence of (3.55) is that the optimal value of scales linearly with . Of course, we could just use squared loss in such a situation, but in general, we will not know the exact noise model, and squared loss does not lead to robust estimators. The following lemma (which will come handy in the next section) shows that this is a general property of the -insensitive loss.
|
||
|
||
Lemma 3.19 (Linear Dependency between -Tube Width and Variance) Denote by p a symmetric density with variance 0. Then the optimal value of (i.e. the value that achieves maximum asymptotic efficiency) for an estimator using the -insensitive loss is given by
|
||
|
||
opt
|
||
|
||
argmin pstd(
|
||
|
||
1 ) pstd( ) 2
|
||
|
||
1
|
||
|
||
pstd( )d
|
||
|
||
(3.56)
|
||
|
||
where pstd( ) : p(
|
||
|
||
) is the standardized version of p(y ), i.e. it is obtained by
|
||
|
||
rescaling p(y ) to zero mean and unit variance.
|
||
|
||
Since pstd is independent of , we have a linear dependency between opt and . The scaling factor depends on the noise model.
|
||
|
||
Proof We prove (3.56) by rewriting the efficiency e( ) in terms of pstd via p(y ) 1 pstd( 1(y )). This yields
|
||
|
||
e( ) Q2 IG
|
||
|
||
1 pstd( 21
|
||
|
||
1)
|
||
|
||
1 pstd( 1 ) 2
|
||
|
||
1 pstd( 1 )d
|
||
|
||
pstd( 1
|
||
|
||
1 ) pstd( 1 ) 2
|
||
|
||
1 1
|
||
|
||
pstd(
|
||
|
||
)d
|
||
|
||
The maximum of e( ) does not depend directly on , but on 1 (which is independent of ). Hence we can find argmax e( ) by solving (3.56).
|
||
|
||
Lemma 3.19 made it apparent that in order to adjust we have to know beforehand. Unfortunately, the latter is usually unknown at the beginning of the
|
||
|
||
80 -Property
|
||
|
||
Risk and Loss Functions
|
||
|
||
estimation procedure.8 The solution to this dilemma is to make adaptive.
|
||
|
||
3.4.3 Adaptive Loss Functions
|
||
|
||
We again consider the trimmed mean estimator, which discards a predefined
|
||
|
||
fraction of largest and smallest samples. This method belongs to the more general
|
||
|
||
class of quantile estimators, which base their estimates on the value of samples
|
||
|
||
in a certain quantile. The latter methods do not require prior knowledge of the
|
||
|
||
variance, and adapt to whatever scale is required. What we need is a technique
|
||
|
||
which connects (in Huber’s robust loss function) or (in the -insensitive loss
|
||
|
||
case) with the deviations between the estimate ˆ and the random variables yi.
|
||
|
||
Let us analyze what happens to the negative log likelihood, if, in the -
|
||
|
||
insensitive case, we change to
|
||
|
||
(with
|
||
|
||
) while keeping ˆ fixed. In par-
|
||
|
||
ticular we assume that is chosen sufficiently small such that for all i 1 m,
|
||
|
||
ˆ yi
|
||
|
||
if ˆ yi if ˆ yi
|
||
|
||
(3.57)
|
||
|
||
Moreover denote by m m m the number of samples for which ˆ yi is less than, equal to, or greater than , respectively. Then
|
||
|
||
m
|
||
∑ ˆ yi
|
||
i1
|
||
|
||
∑ ˆ yi
|
||
ˆ yi
|
||
|
||
∑ ˆ yi m
|
||
ˆ yi
|
||
|
||
∑ ˆ yi
|
||
ˆ yi
|
||
|
||
m
|
||
∑ ˆ yi
|
||
i1
|
||
|
||
m
|
||
|
||
if 0
|
||
|
||
(m m ) otherwise.
|
||
|
||
(3.58)
|
||
|
||
In other words, the amount by which the loss changes depends only on the quantiles at . What happens if we make itself a variable of the optimization problem? By the scaling properties of (3.58) one can see that for [0 1]
|
||
|
||
∑ 1 m
|
||
|
||
minimize
|
||
ˆ
|
||
|
||
m
|
||
|
||
i
|
||
|
||
1
|
||
|
||
ˆ
|
||
|
||
yi
|
||
|
||
(3.59)
|
||
|
||
is minimized if is chosen such that
|
||
|
||
m
|
||
|
||
mm
|
||
|
||
m
|
||
|
||
m
|
||
|
||
(3.60)
|
||
|
||
This relation holds since at the solution (ˆ ) the solution also has to be optimal wrt. alone while keeping ˆ fixed. In the latter case, however, the derivatives of
|
||
|
||
8. The obvious question is why one would ever like to choose an -insensitive loss in the presence of Gaussian noise in the first place. If the complexity of the function expansion is of no concern and the highest accuracy is required, squared loss is to be preferred. In most cases, however, it is not quite clear what exactly the type of the additive noise model is. This is when we would like to have a more conservative estimator. In practice, the -insensitive loss has been shown to work rather well on a variety of tasks (Chapter 9).
|
||
|
||
3.4 Robust Estimators
|
||
|
||
81
|
||
|
||
the log-likelihood (i.e. error) term wrt.
|
||
|
||
at
|
||
|
||
the
|
||
|
||
solution
|
||
|
||
are
|
||
|
||
given
|
||
|
||
by
|
||
|
||
m m
|
||
|
||
and m
|
||
|
||
m m
|
||
|
||
on the left and right hand side respectively.9 These have to cancel with which
|
||
|
||
proves the claim. Furthermore, computing the derivative of (3.59) with respect to
|
||
|
||
ˆ shows that the number of samples outside the interval [
|
||
|
||
] has to be
|
||
|
||
equal on both halves (
|
||
|
||
) and (
|
||
|
||
). We have the following theorem:
|
||
|
||
Extension to General Robust Estimators
|
||
|
||
Theorem 3.20 (Quantile Estimation as Optimization Problem [481]) A quantile procedure to estimate the mean of a distribution by taking the average of the samples at the 2 th and (1 2 )th quantile is equivalent to minimizing (3.59). In particular,
|
||
|
||
1. is an upper bound on the fraction of samples outside the interval [
|
||
|
||
].
|
||
|
||
2. is a lower bound on the fraction of samples outside the interval ]
|
||
|
||
[.
|
||
|
||
3. If the distribution p( ) is continuous, for all [0 1]
|
||
|
||
m
|
||
|
||
lim P
|
||
m
|
||
|
||
m
|
||
|
||
1 for all 0
|
||
|
||
(3.61)
|
||
|
||
One might question the practical advantage of this method over direct trimming
|
||
|
||
of the sample Y. In fact, the use of (3.59) is not recommended if all we want is to
|
||
|
||
estimate . That said, (3.59) does allow us to employ trimmed estimation in the
|
||
|
||
nonparametric case, cf. Chapter 9.
|
||
|
||
Unfortunately, we were unable to find a similar method for Huber’s robust loss
|
||
|
||
function, since in this case the change in the negative log-likelihood incurred by
|
||
|
||
changing not only involves the (statistical) rank of yi, but also the exact location
|
||
|
||
of samples with yi
|
||
|
||
.
|
||
|
||
One way to overcome this problem is re-estimate adaptively while minimizing
|
||
|
||
a term similar to (3.59) (see [180] for details in the context of boosting, Section 10.6.3
|
||
|
||
for a discussion of online estimation techniques, or [251] for a general overview).
|
||
|
||
3.4.4 Optimal Choice of
|
||
|
||
Let us return to the -insensitive loss. A combination of Theorems 3.20, 3.13 and
|
||
|
||
Lemma 3.19 allows us to compute optimal values of for various distributions,
|
||
|
||
provided that an -insensitive loss function is to be used in the estimation procedure.10
|
||
|
||
The idea is to determine the optimal value of for a fixed density p(y ) via
|
||
|
||
(3.56), and compute the corresponding fraction of patterns outside the interval
|
||
|
||
[
|
||
|
||
].
|
||
|
||
9. Strictly speaking, the derivative is not defined at ; the lhs and rhs values are defined, however, which is sufficient for our purpose. 10. This is not optimal in the sense of Theorem 3.15, which suggests the use of a more adapted loss function. However (as already stated in the introduction of this chapter), algorithmic or technical reasons such as computationally efficient solutions or limited memory may provide sufficient motivation to use such a loss function.
|
||
|
||
82
|
||
Heavy Tails Large
|
||
|
||
Risk and Loss Functions
|
||
|
||
Table 3.2 Optimal and for various degrees of polynomial additive noise.
|
||
|
||
Polynomial Degree d Optimal Optimal for unit variance
|
||
Polynomial Degree d Optimal Optimal for unit variance
|
||
|
||
1 1 0
|
||
6 0.1080 1.5576
|
||
|
||
2 0.5405 0.6120
|
||
7 0.0881 1.6035
|
||
|
||
3 0.2909 1.1180
|
||
8 0.0743 1.6339
|
||
|
||
4 0.1898 1.3583
|
||
9 0.0641 1.6551
|
||
|
||
5 0.1384 1.4844
|
||
10 0.0563 1.6704
|
||
|
||
Theorem 3.21 (Optimal Choice of ) Denote by p a symmetric density with variance 0 and by pstd the corresponding rescaled density with zero mean and unit variance.
|
||
Then the optimal value of (i.e. the value that achieves maximum asymptotic efficiency) for an estimator using the -insensitive loss is given by
|
||
|
||
1
|
||
|
||
pstd( y)d y
|
||
|
||
where is chosen according to (3.56). This expression is independent of .
|
||
|
||
(3.62)
|
||
|
||
Proof The independence of follows from the fact that depends only on pstd. Next we show (3.62). For a given density p, the asymptotically optimal value of
|
||
is given by Lemma 3.19. The average fraction of patterns outside the interval [ ˆ opt ˆ opt] is
|
||
|
||
opt
|
||
|
||
1
|
||
|
||
p(y )dy 1
|
||
|
||
opt
|
||
|
||
1 opt
|
||
pstd( y)d y
|
||
1 opt
|
||
|
||
(3.63)
|
||
|
||
which depends only on 1 opt and is thus independent of . Combining (3.63) with (3.56) yields the theorem.
|
||
|
||
This means that given the type of additive noise, we can determine the value of such that it yields the asymptotically most efficient estimator independent of the
|
||
level of the noise. These theoretical predictions have since been confirmed rather accurately in a set of regression experiments [95].
|
||
Let us now look at some special cases.
|
||
|
||
Example 3.22 (Optimal for Polynomial Noise) Arbitrary polynomial noise models ( e d ) with unit variance can be written as
|
||
|
||
p(y) cp exp
|
||
|
||
cp y p where cp
|
||
|
||
1 2
|
||
|
||
Γ3d d Γ 1 d Γ 1 d and cp
|
||
|
||
d
|
||
Γ3d Γ1d
|
||
|
||
where Γ(x) is the gamma function. Figure 3.3 shows opt for polynomial degrees in the interval [1 10]. For convenience, the explicit numerical values are repeated in Table 3.2.
|
||
Observe that as the distribution becomes “lighter-tailed”, the optimal decreases; in other words, we may then use a larger amount of the data for the purpose of estimation. This is reasonable since it is only for very long tails of the distribution (data with many
|
||
|
||
3.5 Summary
|
||
|
||
83
|
||
|
||
1.8
|
||
|
||
1.6
|
||
|
||
1.4
|
||
|
||
Optimal ν
|
||
|
||
1.2
|
||
|
||
Optimal ε for unit variance
|
||
|
||
1
|
||
|
||
0.8
|
||
|
||
0.6
|
||
|
||
0.4
|
||
|
||
0.2
|
||
|
||
0
|
||
|
||
1
|
||
|
||
2
|
||
|
||
3
|
||
|
||
4
|
||
|
||
5
|
||
|
||
6
|
||
|
||
7
|
||
|
||
8
|
||
|
||
9
|
||
|
||
10
|
||
|
||
Polynomial degree p
|
||
|
||
Figure 3.3 Optimal and for various degrees of polynomial additive noise.
|
||
|
||
outliers) that we have to be conservative and discard a large fraction of observations.
|
||
|
||
Even though we derived these relations solely for the case where a single number ( ) has to be estimated, experiments show that the same scaling properties hold for the nonparametric case. It is still an open research problem to establish this connection exactly.
|
||
As we shall see, in the nonparametric case, the effect of will be that it both determines the number of Support Vectors (i.e., the number of basis functions needed to expand the solution) and also the fraction of function values f (xi) with deviation larger than from the corresponding observations. Further information on this topic, both from the statistical and the algorithmic point of view, can be found in Section 9.3.
|
||
|
||
3.5 Summary
|
||
We saw in this chapter that there exist two complementary concepts as to how risk and loss functions should be designed. The first one is data driven and uses the incurred loss as its principal guideline, possibly modified in order to suit the need of numerical efficiency. This leads to loss functions and the definitions of empirical and expected risk.
|
||
A second method is based on the idea of estimating (or at least approximating) the distribution which may be responsible for generating the data. We showed
|
||
|
||
84
|
||
|
||
Risk and Loss Functions
|
||
|
||
that in a Maximum Likelihood setting this concept is rather similar to the notions of risk and loss, with c(x y f (x)) ln p(y x f (x)) as the link between both quantities.
|
||
This point of view allowed us to analyze the properties of estimators in more detail and provide lower bounds on the performance of unbiased estimators, i.e. the Crame´r-Rao theorem. The latter was then used as a benchmarking tool for various loss functions and density models, such as the -insensitive loss. The consequence of this analysis is a corroboration of experimental findings that there exists a linear correlation between the amount of noise in the observations and the optimal width of .
|
||
This, in turn, allowed us to construct adaptive loss functions which adjust themselves to the amount of noise, much like trimmed mean estimators. These formulations can be used directly in mathematical programs, leading to -SV algorithms in subsequent chapters. The question of which choices are optimal in a finite sample size setting remains an open research problem.
|
||
|
||
3.6 Problems
|
||
|
||
3.1 (Soft Margin and Logistic Regression ) The soft margin loss function c soft and the logistic loss clogist are asymptotically almost the same; show that
|
||
|
||
lim csoft(x 1 f ) clogist(x 1 f ) 1
|
||
f
|
||
lim csoft(x 1 f ) clogist(x 1 f ) 0
|
||
f
|
||
|
||
(3.64) (3.65)
|
||
|
||
3.2 (Multi-class Discrimination ) Assume you have to solve a classification problem with M different classes. Discuss how the number of functions used to solve this task affects the quality of the solution.
|
||
|
||
How would the loss function look if you were to use only one real-valued function
|
||
|
||
f:
|
||
|
||
. Which symmetries are violated in this case (hint: what happens if you permute
|
||
|
||
the classes)?
|
||
|
||
How many functions do you need if each of them makes a binary decision f :
|
||
|
||
01?
|
||
|
||
How many functions do you need in order to make the solution permutation symmetric with respect to the class labels?
|
||
|
||
How should you assess the classification error? Is it a good idea to use the misclassification rate of one individual function as a performance criterion (hint: correlation of errors)? By how much can this error differ from the total misclassification error?
|
||
|
||
3.3 (Mean and Median ) Assume 8 people want to gather for a meeting; 5 of them live in Stuttgart and 3 in Munich. Where should they meet if (a) they want the total distance traveled by all people to be minimal, (b) they want the average distance traveled per person to be minimal, or (c) they want the average squared distance to be minimal? What happens
|
||
|
||
3.6 Problems
|
||
|
||
85
|
||
|
||
to the meeting points if one of the 3 people moves from Munich to Sydney?
|
||
|
||
3.4 (Locally Adaptive Loss Functions ) Assume that the loss function c(x y f (x)) varies with x. What does this mean for the expected loss? Can you give a bound on the latter even if you know p(y x) and f at every point but know c only on a finite sample (hint: construct a counterexample)? How will things change if c cannot vary much with x?
|
||
|
||
3.5 (Transduction Error ) Assume that we want to minimize the test error of misclassification Rtest[ f ], given a training sample (x1 y1) (xm ym) , a test sample
|
||
x1 xm and a loss function c(x y f (x)). Show that any loss function c (x f (x )) on the test sample has to be symmetric in f , i.e. c (x f (x )) c (x f (x )). Prove that no non-constant convex function can satisfy this property. What does this mean for the practical solution of optimization problem? See [267, 37, 211, 103] for details.
|
||
|
||
3.6 (Convexity and Uniqueness ) Show that the problem of estimating a location parameter (a single scalar) has an interval [a b] of equivalent global minima if the loss functions are convex. For non-convex loss functions construct an example where this is not the case.
|
||
|
||
3.7 (Linearly Dependent Parameters ) Show that in a linear model f ∑i i fi on it is impossible to find a unique set of optimal parameters i if the functions fi are not
|
||
linearly independent. Does this have any effect on f itself?
|
||
|
||
3.8 (Ill-posed Problems ) Assume you want to solve the problem Ax y where A is a symmetric positive definite matrix, i.e., a matrix with nonnegative eigenvalues. If you change y to y , how much will the solution x of Ax y differ from x . Give lower and upper bounds on this quantity. Hint: decompose y into the eigensystem of A.
|
||
|
||
3.9 (Fisher Map [258] ) Show that the map
|
||
|
||
U (x) :
|
||
|
||
I1 2
|
||
|
||
ln p(x )
|
||
|
||
(3.66)
|
||
|
||
maps x into vectors with zero mean and unit variance. Chapter 13 will use this map to design kernels.
|
||
|
||
3.10 (Crame´r-Rao Inequality for Multivariate Estimators ) Prove equation (3.31). Hint: start by applying the Cauchy-Schwarz inequality to
|
||
|
||
det E¯ [( ˆ ( ) E¯ ˆ ( ))(T ( ) E¯ T ( )) ]
|
||
|
||
(3.67)
|
||
|
||
to obtain I and B and compute the expected value coefficient-wise.
|
||
|
||
3.11 (Soft Margin Loss and Conditional Probabilities [521] ) What is the conditional probability p(y x) corresponding to the soft margin loss function c(x y f (x)) max(0 1 y f (x))?
|
||
|